diff --git a/README.md b/README.md
index ef5bdc66ef03131318e1dde627e0224cca9137fd..3cdb6e478ddf4f18af7f81bb3e321510903beb9d 100644
--- a/README.md
+++ b/README.md
@@ -22,6 +22,10 @@ organization for the purposes of conducting machine learning and deep neural
 networks research.  The system is general enough to be applicable in a wide
 variety of other domains, as well.
 
+Keep up to date with release announcements and security updates by
+subscribing to
+[announce@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/announce).
+
 ## Installation
 *See [Installing TensorFlow](https://www.tensorflow.org/get_started/os_setup.html) for instructions on how to install our release binaries or how to build from source.*
 
diff --git a/RELEASE.md b/RELEASE.md
index 6f54dee58f75c29a16545ba25de12fe059baf1eb..c63d9f20c9a842ceed97afc25690073d082c42cb 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,3 +1,63 @@
+# Release 1.7.0
+
+## Major Features And Improvements
+* Eager mode is moving out of contrib, try `tf.enable_eager_execution()`.
+* Graph rewrites emulating fixed-point quantization compatible with TensorFlow Lite, supported by new `tf.contrib.quantize` package.
+* Easily customize gradient computation with `tf.custom_gradient`.
+* [TensorBoard Debugger Plugin](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md), the graphical user interface (GUI) of TensorFlow Debugger (tfdbg), is now in alpha.
+* Experimental support for reading a sqlite database as a `Dataset` with new `tf.contrib.data.SqlDataset`.
+* Distributed Mutex / CriticalSection added to `tf.contrib.framework.CriticalSection`.
+* Better text processing with `tf.regex_replace`.
+* Easy, efficient sequence input with `tf.contrib.data.bucket_by_sequence_length`
+
+## Bug Fixes and Other Changes
+* Accelerated Linear Algebra (XLA):
+  * Add `MaxPoolGradGrad` support for XLA
+  * CSE pass from Tensorflow is now disabled in XLA.
+* `tf.data`:
+  * `tf.data.Dataset`
+    * Add support for building C++ Dataset op kernels as external libraries, using the `tf.load_op_library()` mechanism.
+    * `Dataset.list_files()` now shuffles its output by default.
+    * `Dataset.shuffle(..., seed=tf.constant(0, dtype=tf.int64))` now yields the same sequence of elements as `Dataset.shuffle(..., seed=0)`.
+  * Add `num_parallel_reads` argument to `tf.data.TFRecordDataset`.
+* `tf.contrib`:
+  * `tf.contrib.bayesflow.halton_sequence` now supports randomization.
+  * Add support for scalars in `tf.contrib.all_reduce`.
+  * Add `effective_sample_size` to `tf.contrib.bayesflow.mcmc_diagnostics`.
+  * Add `potential_scale_reduction` to `tf.contrib.bayesflow.mcmc_diagnostics`.
+  * Add `BatchNormalization`, `Kumaraswamy` bijectors.
+  * Deprecate `tf.contrib.learn`. Please check contrib/learn/README.md for instructions on how to convert existing code.
+  * `tf.contrib.data`
+    * Remove deprecated `tf.contrib.data.Dataset`, `tf.contrib.data.Iterator`, `tf.contrib.data.FixedLengthRecordDataset`, `tf.contrib.data.TextLineDataset`, and `tf.contrib.data.TFRecordDataset` classes.
+    * Added `bucket_by_sequence_length`, `sliding_window_batch`, and `make_batched_features_dataset`
+  * Remove unmaintained `tf.contrib.ndlstm`. You can find it externally at https://github.com/tmbarchive/tfndlstm.
+  * Moved most of `tf.contrib.bayesflow` to its own repo: `tfp`
+* Other:
+  * tf.py_func now reports the full stack trace if an exception occurs.
+  * Integrate `TPUClusterResolver` with GKE's integration for Cloud TPUs.
+  * Add a library for statistical testing of samplers.
+  * Add Helpers to stream data from the GCE VM to a Cloud TPU.
+  * Integrate ClusterResolvers with TPUEstimator.
+  * Unify metropolis_hastings interface with HMC kernel.
+  * Move LIBXSMM convolutions to a separate --define flag so that they are disabled by default.
+  * Fix `MomentumOptimizer` lambda.
+  * Reduce `tfp.layers` boilerplate via programmable docstrings.
+  * Add `auc_with_confidence_intervals`, a method for computing the AUC and confidence interval with linearithmic time complexity.
+  * `regression_head` now accepts customized link function, to satisfy the usage that user can define their own link function if the `array_ops.identity` does not meet the requirement.
+  * Fix `initialized_value` and `initial_value` behaviors for `ResourceVariables` created from `VariableDef` protos.
+  * Add TensorSpec to represent the specification of Tensors.
+  * Constant folding pass is now deterministic.
+  * Support `float16` `dtype` in `tf.linalg.*`.
+  * Add `tf.estimator.export.TensorServingInputReceiver` that allows `tf.estimator.Estimator.export_savedmodel` to pass raw tensors to model functions.
+
+## Thanks to our Contributors
+
+This release contains contributions from many people at Google, as well as:
+
+4d55397500, Abe, Alistair Low, Andy Kernahan, Appledore, Ben, Ben Barsdell, Boris Pfahringer, Brad Wannow, Brett Koonce, Carl Thomé, cclauss, Chengzhi Chen, Chris Drake, Christopher Yeh, Clayne Robison, Codrut Grosu, Daniel Trebbien, Danny Goodman, David Goodwin, David Norman, Deron Eriksson, Donggeon Lim, Donny Viszneki, DosLin, DylanDmitri, Francisco Guerrero, Fred Reiss, gdh1995, Giuseppe, Glenn Weidner, gracehoney, Guozhong Zhuang, Haichen "Hc" Li, Harald Husum, harumitsu.nobuta, Henry Spivey, hsm207, Jekyll Song, Jerome, Jiongyan Zhang, jjsjann123, John Sungjin Park, Johnson145, JoshVarty, Julian Wolff, Jun Wang, June-One, Kamil Sindi, Kb Sriram, Kdavis-Mozilla, Kenji, lazypanda1, Liang-Chi Hsieh, Loo Rong Jie, Mahesh Bhosale, MandarJKulkarni, ManHyuk, Marcus Ong, Marshal Hayes, Martin Pool, matthieudelaro, mdfaijul, mholzel, Michael Zhou, Ming Li, Minmin Sun, Myungjoo Ham, MyungsungKwak, Naman Kamra, Peng Yu, Penghao Cen, Phil, Raghuraman-K, resec, Rohin Mohanadas, Sandeep N Gupta, Scott Tseng, seaotterman, Seo Sanghyeon, Sergei Lebedev, Ted Chang, terrytangyuan, Tim H, tkunic, Tod, vihanjain, Yan Facai (颜发才), Yin Li, Yong Tang, Yukun Chen, Yusuke Yamada
+
+
+
 # Release 1.6.0
 
 ## Breaking Changes
diff --git a/tensorflow/SECURITY.md b/SECURITY.md
similarity index 97%
rename from tensorflow/SECURITY.md
rename to SECURITY.md
index fea24b273920885ba8a1ae96aafbf7710df46e1f..378e77696725e338e8289cda84dbc543303ae053 100644
--- a/tensorflow/SECURITY.md
+++ b/SECURITY.md
@@ -6,7 +6,7 @@ report vulnerabilities in TensorFlow.
 
 ## TensorFlow models are programs
 
-TensorFlow's runtime system interprets and executes programs. What machine 
+TensorFlow's runtime system interprets and executes programs. What machine
 learning practitioners term
 [**models**](https://developers.google.com/machine-learning/glossary/#model) are
 expressed as programs that TensorFlow executes.  TensorFlow programs are encoded
@@ -28,12 +28,12 @@ data you supply to TensorFlow to train a model, or to use a model to run
 inference on the data.
 
 **TensorFlow models are programs, and need to be treated as such from a security
-perspective.** 
+perspective.**
 
 ## Running untrusted models
 
 As a general rule: **Always** execute untrusted models inside a sandbox (e.g.,
-[nsjail](https://github.com/google/nsjail)). 
+[nsjail](https://github.com/google/nsjail)).
 
 There are several ways in which a model could become untrusted. Obviously, if an
 untrusted party supplies TensorFlow kernels, arbitrary code may be executed.
@@ -109,11 +109,11 @@ graphs known to the `ModelServer`. This means that an attacker may run
 graphs using untrusted inputs as described above, but they would not be able to
 execute arbitrary graphs. It is possible to safely expose a `ModelServer`
 directly to an untrusted network, **but only if the graphs it is configured to
-use have been carefully audited to be safe**. 
+use have been carefully audited to be safe**.
 
 Similar to best practices for other servers, we recommend running any
 `ModelServer` with appropriate privileges (i.e., using a separate user with
-reduced permisisons). In the spirit of defense in depth, we recommend
+reduced permissions). In the spirit of defense in depth, we recommend
 authenticating requests to any TensorFlow server connected to an untrusted
 network, as well as sandboxing the server to minimize the adverse effects of
 any breach.
@@ -129,11 +129,11 @@ with specially crafted inputs.
 ### What is a vulnerability?
 
 Given TensorFlow's flexibility, it is possible to specify computation graphs
-which exhibit unexpected or unwanted behaviors. The fact that TensorFlow models
+which exhibit unexpected or unwanted behavior. The fact that TensorFlow models
 can perform arbitrary computations means that they may read and write files,
 communicate via the network, produce deadlocks and infinite loops, or run out
 of memory. It is only when these behaviors are outside the specifications of the
-operations involved that such behavior is a vulnerability. 
+operations involved that such behavior is a vulnerability.
 
 A `FileWriter` writing a file is not unexpected behavior and therefore is not a
 vulnerability in TensorFlow. A `MatMul` allowing arbitrary binary code execution
@@ -168,7 +168,7 @@ below).
 
 Please use a descriptive subject line for your report email. After the initial
 reply to your report, the security team will endeavor to keep you informed of
-the progress being made towards a fix and announcement. 
+the progress being made towards a fix and announcement.
 
 If you believe that an existing (public) issue is security-related, please send
 an email to `security@tensorflow.org`. The email should include the issue ID and
diff --git a/WORKSPACE b/WORKSPACE
index 1e38a9a8cd754886fc5232531816b875de0879a3..11c5cdb2070e79b16540a39f13cab28608962340 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -14,6 +14,12 @@ load("@io_bazel_rules_closure//closure:defs.bzl", "closure_repositories")
 
 closure_repositories()
 
+# We must check the bazel version before trying to parse any other BUILD
+# files, in case the parsing of those build files depends on the bazel
+# version we require here.
+load("//tensorflow:version_check.bzl", "check_bazel_version_at_least")
+check_bazel_version_at_least("0.10.0")
+
 load("//tensorflow:workspace.bzl", "tf_workspace")
 
 # Uncomment and update the paths in these entries to build the Android demo.
diff --git a/configure.py b/configure.py
index 7d2e30cd8af53b74c9274e28c4b95bdd01e9d41a..6744082d5d55c3a039b7a4efa7a539e77185cabd 100644
--- a/configure.py
+++ b/configure.py
@@ -40,7 +40,7 @@ _DEFAULT_CUDA_PATH = '/usr/local/cuda'
 _DEFAULT_CUDA_PATH_LINUX = '/opt/cuda'
 _DEFAULT_CUDA_PATH_WIN = ('C:/Program Files/NVIDIA GPU Computing '
                           'Toolkit/CUDA/v%s' % _DEFAULT_CUDA_VERSION)
-_DEFAULT_TENSORRT_PATH_LINUX = '/usr/lib/x86_64-linux-gnu'
+_DEFAULT_TENSORRT_PATH_LINUX = '/usr/lib/%s-linux-gnu' % platform.machine()
 _TF_OPENCL_VERSION = '1.2'
 _DEFAULT_COMPUTECPP_TOOLKIT_PATH = '/usr/local/computecpp'
 _DEFAULT_TRISYCL_INCLUDE_DIR = '/usr/local/triSYCL/include'
@@ -250,7 +250,11 @@ def reset_tf_configure_bazelrc(workspace_path):
       if _TF_BAZELRC_FILENAME in l:
         continue
       f.write('%s\n' % l)
-    f.write('import %s\n' % _TF_BAZELRC)
+    if is_windows():
+      tf_bazelrc_path = _TF_BAZELRC.replace("\\", "/")
+    else:
+      tf_bazelrc_path = _TF_BAZELRC
+    f.write('import %s\n' % tf_bazelrc_path)
 
 
 def cleanup_makefile():
@@ -444,7 +448,7 @@ def check_bazel_version(min_version):
   if which('bazel') is None:
     print('Cannot find bazel. Please install bazel.')
     sys.exit(0)
-  curr_version = run_shell(['bazel', '--batch', 'version'])
+  curr_version = run_shell(['bazel', '--batch', '--bazelrc=/dev/null', 'version'])
 
   for line in curr_version.split('\n'):
     if 'Build label: ' in line:
@@ -498,7 +502,6 @@ def set_cc_opt_flags(environ_cp):
   write_to_bazelrc('build --copt=-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK')
   write_to_bazelrc('build --host_copt=-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK')
 
-
 def set_tf_cuda_clang(environ_cp):
   """set TF_CUDA_CLANG action_env.
 
@@ -520,7 +523,7 @@ def set_tf_cuda_clang(environ_cp):
 
 def set_tf_download_clang(environ_cp):
   """Set TF_DOWNLOAD_CLANG action_env."""
-  question = 'Do you want to download a fresh release of clang? (Experimental)'
+  question = 'Do you wish to download a fresh release of clang? (Experimental)'
   yes_reply = 'Clang will be downloaded and used to compile tensorflow.'
   no_reply = 'Clang will not be downloaded.'
   set_action_env_var(
@@ -1044,7 +1047,10 @@ def set_tf_tensorrt_install_path(environ_cp):
 
     for lib_file in possible_files:
       if is_compatible(lib_file, cuda_ver, cudnn_ver):
-        ver_str = nvinfer_pattern.search(lib_file).group(1)
+        matches = nvinfer_pattern.search(lib_file)
+        if len(matches.groups()) == 0:
+          continue
+        ver_str = matches.group(1)
         ver = convert_version_to_int(ver_str) if len(ver_str) else 0
         if ver > highest_ver[0]:
           highest_ver = [ver, ver_str, lib_file]
@@ -1373,7 +1379,7 @@ def main():
   # environment variables.
   environ_cp = dict(os.environ)
 
-  check_bazel_version('0.5.4')
+  check_bazel_version('0.10.0')
 
   reset_tf_configure_bazelrc(args.workspace)
   cleanup_makefile()
@@ -1390,6 +1396,9 @@ def main():
     environ_cp['TF_NEED_OPENCL'] = '0'
     environ_cp['TF_CUDA_CLANG'] = '0'
     environ_cp['TF_NEED_TENSORRT'] = '0'
+    # TODO(ibiryukov): Investigate using clang as a cpu or cuda compiler on
+    # Windows.
+    environ_cp['TF_DOWNLOAD_CLANG'] = '0'
 
   if is_macos():
     environ_cp['TF_NEED_JEMALLOC'] = '0'
@@ -1404,7 +1413,7 @@ def main():
   set_build_var(environ_cp, 'TF_NEED_S3', 'Amazon S3 File System',
                 'with_s3_support', True, 's3')
   set_build_var(environ_cp, 'TF_NEED_KAFKA', 'Apache Kafka Platform',
-                'with_kafka_support', False, 'kafka')
+                'with_kafka_support', True, 'kafka')
   set_build_var(environ_cp, 'TF_ENABLE_XLA', 'XLA JIT', 'with_xla_support',
                 False, 'xla')
   set_build_var(environ_cp, 'TF_NEED_GDR', 'GDR', 'with_gdr_support',
@@ -1437,16 +1446,8 @@ def main():
 
     set_tf_cuda_clang(environ_cp)
     if environ_cp.get('TF_CUDA_CLANG') == '1':
-      if not is_windows():
-        # Ask if we want to download clang release while building.
-        set_tf_download_clang(environ_cp)
-      else:
-        # We use bazel's generated crosstool on Windows and there is no
-        # way to provide downloaded toolchain for that yet.
-        # TODO(ibiryukov): Investigate using clang as a cuda compiler on
-        # Windows.
-        environ_cp['TF_DOWNLOAD_CLANG'] = '0'
-
+      # Ask whether we should download the clang toolchain.
+      set_tf_download_clang(environ_cp)
       if environ_cp.get('TF_DOWNLOAD_CLANG') != '1':
         # Set up which clang we should use as the cuda / host compiler.
         set_clang_cuda_compiler_path(environ_cp)
@@ -1456,6 +1457,13 @@ def main():
       if not is_windows():
         set_gcc_host_compiler_path(environ_cp)
     set_other_cuda_vars(environ_cp)
+  else:
+    # CUDA not required. Ask whether we should download the clang toolchain and
+    # use it for the CPU build.
+    set_tf_download_clang(environ_cp)
+    if environ_cp.get('TF_DOWNLOAD_CLANG') == '1':
+      write_to_bazelrc('build --config=download_clang')
+      write_to_bazelrc('test --config=download_clang')
 
   set_build_var(environ_cp, 'TF_NEED_MPI', 'MPI', 'with_mpi_support', False)
   if environ_cp.get('TF_NEED_MPI') == '1':
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index dc995d231d3e591771f801e28024a76610cdba26..bb9079bf5b04d85294d30c86fb5978676b5a42eb 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -240,6 +240,13 @@ config_setting(
     visibility = ["//visibility:public"],
 )
 
+config_setting(
+    name = "with_kafka_support_windows_override",
+    define_values = {"with_kafka_support": "true"},
+    values = {"cpu": "x64_windows"},
+    visibility = ["//visibility:public"],
+)
+
 config_setting(
     name = "with_gcp_support_android_override",
     define_values = {"with_gcp_support": "true"},
@@ -415,6 +422,17 @@ py_library(
     deps = ["//tensorflow/python"],
 )
 
+py_library(
+    name = "experimental_tensorflow_py",
+    srcs = ["experimental_api.py"],
+    srcs_version = "PY2AND3",
+    visibility = ["//tensorflow/tools/api/tests:__subpackages__"],
+    deps = [
+        "//tensorflow/python",
+        "//tensorflow/tools/api/generator:python_api",
+    ],
+)
+
 filegroup(
     name = "all_opensource_files",
     data = [
@@ -441,6 +459,7 @@ filegroup(
         "//tensorflow/compiler/xla:all_files",
         "//tensorflow/compiler/xla/client:all_files",
         "//tensorflow/compiler/xla/client/lib:all_files",
+        "//tensorflow/compiler/xla/client/xla_client:all_files",
         "//tensorflow/compiler/xla/legacy_flags:all_files",
         "//tensorflow/compiler/xla/python:all_files",
         "//tensorflow/compiler/xla/service:all_files",
@@ -598,6 +617,7 @@ filegroup(
         "//tensorflow/contrib/verbs:all_files",
         "//tensorflow/core:all_files",
         "//tensorflow/core/api_def:all_files",
+        "//tensorflow/core/common_runtime/eager:all_files",
         "//tensorflow/core/debug:all_files",
         "//tensorflow/core/distributed_runtime:all_files",
         "//tensorflow/core/distributed_runtime/rpc:all_files",
@@ -787,6 +807,7 @@ tf_cc_shared_object(
     }),
     deps = [
         "//tensorflow/c:c_api",
+        "//tensorflow/c:c_api_experimental",
         "//tensorflow/c:exported_symbols.lds",
         "//tensorflow/c:version_script.lds",
         "//tensorflow/c/eager:c_api",
diff --git a/tensorflow/c/BUILD b/tensorflow/c/BUILD
index 5dfb743681255d8c03e91ea43fd441d94fdee59d..4332f44e5da2f3d716b76168a1618a29e3847f54 100644
--- a/tensorflow/c/BUILD
+++ b/tensorflow/c/BUILD
@@ -17,7 +17,10 @@ load(
 
 filegroup(
     name = "headers",
-    srcs = ["c_api.h"],
+    srcs = [
+        "c_api.h",
+        "c_api_experimental.h",
+    ],
     visibility = ["//tensorflow:__subpackages__"],
 )
 
@@ -113,6 +116,10 @@ tf_cuda_library(
         ":c_api",
         ":c_api_internal",
         "//tensorflow/compiler/jit/legacy_flags:mark_for_compilation_pass_flags",
+        "//tensorflow/contrib/tpu:all_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
     ],
 )
diff --git a/tensorflow/c/c_api.cc b/tensorflow/c/c_api.cc
index 85f1d1639b4d09f2de77d326481a86ec246270d0..18eeb2816807ec9986999cfc2c9a4c0f032683c0 100644
--- a/tensorflow/c/c_api.cc
+++ b/tensorflow/c/c_api.cc
@@ -30,6 +30,7 @@ limitations under the License.
 #endif
 #include "tensorflow/c/c_api_internal.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/eval_const_tensor.h"
 #include "tensorflow/core/common_runtime/shape_refiner.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
 #include "tensorflow/core/framework/log_memory.h"
@@ -62,6 +63,7 @@ limitations under the License.
 // brain namespace because we are defining 'extern "C"' functions.
 using tensorflow::AllocationDescription;
 using tensorflow::DataType;
+using tensorflow::ExtendSessionGraphHelper;
 using tensorflow::Graph;
 using tensorflow::GraphDef;
 using tensorflow::mutex_lock;
@@ -73,6 +75,7 @@ using tensorflow::NodeBuilder;
 using tensorflow::NodeDef;
 using tensorflow::OpDef;
 using tensorflow::OpRegistry;
+using tensorflow::OutputTensor;
 using tensorflow::PartialTensorShape;
 using tensorflow::RunMetadata;
 using tensorflow::RunOptions;
@@ -638,17 +641,17 @@ Status MessageToBuffer(const tensorflow::protobuf::Message& in,
 }
 
 void RecordMutation(TF_Graph* graph, const TF_Operation& op,
-                    const char* mutation_type)
-    EXCLUSIVE_LOCKS_REQUIRED(graph->mu) {
+                    const char* mutation_type) {
   // If any session has already run this node_id, mark this session as
   // unrunnable.
   for (auto it : graph->sessions) {
+    mutex_lock session_lock(it.first->mu);
     if (it.first->last_num_graph_nodes > op.node.id()) {
-      it.second = FailedPrecondition(
+      it.second = strings::StrCat(
           "Operation '", op.node.DebugString(), "' was changed by ",
           mutation_type,
-          " after it was run by a session. Nodes can be mutated "
-          "only before they are executed by a session. Either don't modify "
+          " after it was run by a session. This mutation will have no effect, "
+          "and will trigger an error in the future. Either don't modify "
           "nodes after running them or create a new session.");
     }
   }
@@ -708,6 +711,61 @@ void TF_GraphSetOutputHandleShapesAndTypes(TF_Graph* graph, TF_Output output,
 Status LoadLibrary(const char* library_filename, void** result,
                    const void** buf, size_t* len);
 
+// TODO(josh11b,mrry): Change Session to be able to use a Graph*
+// directly, instead of requiring us to serialize to a GraphDef and
+// call Session::Extend().
+bool ExtendSessionGraphHelper(TF_Session* session, TF_Status* status) {
+  if (session->graph != nullptr) {
+    // Take the graph lock before the session lock to avoid deadlock. This is
+    // safe since session->graph does not change.
+    session->graph->mu.lock();
+    mutex_lock session_lock(session->mu);
+    const Graph& graph = session->graph->graph;
+
+    const string& mutation_warning = session->graph->sessions[session];
+    if (!mutation_warning.empty()) {
+      // TODO(b/74949947): turn this back into an error status
+      LOG(WARNING) << mutation_warning;
+      session->graph->sessions[session].clear();
+    }
+
+    const auto num_nodes = graph.num_node_ids();
+    if (session->last_num_graph_nodes < num_nodes) {
+      status->status = tensorflow::ValidateNoCycles(session->graph->graph);
+      if (!status->status.ok()) {
+        session->graph->mu.unlock();
+        return false;
+      }
+
+      GraphDef graph_def;
+      *graph_def.mutable_versions() = graph.versions();
+      // Fill graph_def with nodes with ids in the range
+      // [session->last_num_graph_nodes, num_nodes), that is the nodes
+      // added since the last TF_SessionRun() call.
+      for (auto id = session->last_num_graph_nodes; id < num_nodes; ++id) {
+        Node* const node = graph.FindNodeId(id);
+        if (node != nullptr && node->IsOp()) {
+          NodeDef* const node_def = graph_def.add_node();
+          *node_def = node->def();
+        }
+      }
+      *graph_def.mutable_library() = graph.flib_def().ToProto();
+      session->graph->mu.unlock();
+      status->status = session->session->Extend(graph_def);
+      if (!status->status.ok()) {
+        // Contract is we always delete input_values[i].
+        return false;
+      }
+      // Note: session->session is not modified if Extend() fails, so
+      // we only set last_num_graph_nodes if it succeeds.
+      session->last_num_graph_nodes = num_nodes;
+    } else {
+      session->graph->mu.unlock();
+    }
+  }
+  return true;
+}
+
 }  // namespace tensorflow
 
 static void TF_Run_Setup(int noutputs, TF_Tensor** c_outputs,
@@ -2408,11 +2466,7 @@ void TF_AddGradients(TF_Graph* g, TF_Output* y, int ny, TF_Output* x, int nx,
 // TF_Session functions ----------------------------------------------
 
 TF_Session::TF_Session(tensorflow::Session* s, TF_Graph* g)
-    : session(s), graph(g), last_num_graph_nodes(0), device_mgr(nullptr) {
-  if (s->LocalDeviceManager(&device_mgr).ok()) {
-    devices = device_mgr->ListDevices();
-  }
-}
+    : session(s), graph(g), last_num_graph_nodes(0), extend_before_run(true) {}
 
 TF_Session* TF_NewSession(TF_Graph* graph, const TF_SessionOptions* opt,
                           TF_Status* status) {
@@ -2422,7 +2476,7 @@ TF_Session* TF_NewSession(TF_Graph* graph, const TF_SessionOptions* opt,
     TF_Session* new_session = new TF_Session(session, graph);
     if (graph != nullptr) {
       mutex_lock l(graph->mu);
-      graph->sessions[new_session] = Status::OK();
+      graph->sessions[new_session] = "";
     }
     return new_session;
   } else {
@@ -2488,7 +2542,7 @@ TF_Session* TF_LoadSessionFromSavedModel(
 
   TF_Session* session = new TF_Session(bundle.session.release(), graph);
 
-  graph->sessions[session] = Status::OK();
+  graph->sessions[session] = "";
   session->last_num_graph_nodes = graph->graph.num_node_ids();
   return session;
 #endif  // __ANDROID__
@@ -2512,58 +2566,6 @@ void TF_DeleteSession(TF_Session* s, TF_Status* status) {
   delete s;
 }
 
-// TODO(josh11b,mrry): Change Session to be able to use a Graph*
-// directly, instead of requiring us to serialize to a GraphDef and
-// call Session::Extend().
-static bool ExtendSessionGraphHelper(TF_Session* session, TF_Status* status) {
-  if (session->graph != nullptr) {
-    mutex_lock session_lock(session->mu);
-    session->graph->mu.lock();
-    const Graph& graph = session->graph->graph;
-
-    status->status = session->graph->sessions[session];
-    if (!status->status.ok()) {
-      session->graph->mu.unlock();
-      return false;
-    }
-
-    const auto num_nodes = graph.num_node_ids();
-    if (session->last_num_graph_nodes < num_nodes) {
-      status->status = tensorflow::ValidateNoCycles(session->graph->graph);
-      if (!status->status.ok()) {
-        session->graph->mu.unlock();
-        return false;
-      }
-
-      GraphDef graph_def;
-      *graph_def.mutable_versions() = graph.versions();
-      // Fill graph_def with nodes with ids in the range
-      // [session->last_num_graph_nodes, num_nodes), that is the nodes
-      // added since the last TF_SessionRun() call.
-      for (auto id = session->last_num_graph_nodes; id < num_nodes; ++id) {
-        Node* const node = graph.FindNodeId(id);
-        if (node != nullptr && node->IsOp()) {
-          NodeDef* const node_def = graph_def.add_node();
-          *node_def = node->def();
-        }
-      }
-      *graph_def.mutable_library() = graph.flib_def().ToProto();
-      session->graph->mu.unlock();
-      status->status = session->session->Extend(graph_def);
-      if (!status->status.ok()) {
-        // Contract is we always delete input_values[i].
-        return false;
-      }
-      // Note: session->session is not modified if Extend() fails, so
-      // we only set last_num_graph_nodes if it succeeds.
-      session->last_num_graph_nodes = num_nodes;
-    } else {
-      session->graph->mu.unlock();
-    }
-  }
-  return true;
-}
-
 void TF_SessionRun(TF_Session* session, const TF_Buffer* run_options,
                    const TF_Output* inputs, TF_Tensor* const* input_values,
                    int ninputs, const TF_Output* outputs,
@@ -2573,7 +2575,8 @@ void TF_SessionRun(TF_Session* session, const TF_Buffer* run_options,
   // TODO(josh11b,mrry): Change Session to be able to use a Graph*
   // directly, instead of requiring us to serialize to a GraphDef and
   // call Session::Extend().
-  if (!ExtendSessionGraphHelper(session, status)) {
+  if (session->extend_before_run &&
+      !ExtendSessionGraphHelper(session, status)) {
     return;
   }
 
@@ -2610,7 +2613,8 @@ void TF_SessionPRunSetup(TF_Session* session, const TF_Output* inputs,
                          const char** handle, TF_Status* status) {
   *handle = nullptr;
 
-  if (!ExtendSessionGraphHelper(session, status)) {
+  if (session->extend_before_run &&
+      !ExtendSessionGraphHelper(session, status)) {
     return;
   }
 
@@ -2653,7 +2657,8 @@ void TF_SessionPRun(TF_Session* session, const char* handle,
   // TODO(josh11b,mrry): Change Session to be able to use a Graph*
   // directly, instead of requiring us to serialize to a GraphDef and
   // call Session::Extend().
-  if (!ExtendSessionGraphHelper(session, status)) {
+  if (session->extend_before_run &&
+      !ExtendSessionGraphHelper(session, status)) {
     return;
   }
 
@@ -2682,6 +2687,24 @@ void TF_SessionPRun(TF_Session* session, const char* handle,
                 output_values, target_names, nullptr, status);
 }
 
+unsigned char TF_TryEvaluateConstant(TF_Graph* graph, TF_Output output,
+                                     TF_Tensor** result, TF_Status* status) {
+  *result = nullptr;
+  mutex_lock l(graph->mu);
+  OutputTensor tensor(&output.oper->node, output.index);
+  bool evaluated;
+  Tensor result_tensor;
+  status->status = EvaluateConstantTensor(
+      tensor, graph->refiner, *graph->graph.op_registry(),
+      graph->graph.versions().producer(), &evaluated, &result_tensor);
+  if (evaluated) {
+    DCHECK(status->status.ok());
+    *result = TF_TensorFromTensor(result_tensor, status);
+    if (!status->status.ok()) evaluated = false;
+  }
+  return evaluated;
+}
+
 TF_ApiDefMap* TF_NewApiDefMap(TF_Buffer* op_list_buffer, TF_Status* status) {
   tensorflow::OpList op_list;
   if (!op_list.ParseFromArray(op_list_buffer->data, op_list_buffer->length)) {
diff --git a/tensorflow/c/c_api.h b/tensorflow/c/c_api.h
index ad592ef70961ef427bfe9fd322a82bd64df7f9f1..b32f574628c4d1dc5c3bb3f1265a1b12adee28bc 100644
--- a/tensorflow/c/c_api.h
+++ b/tensorflow/c/c_api.h
@@ -1275,13 +1275,22 @@ TF_CAPI_EXPORT extern void TF_FunctionGetAttrValueProto(
 // Deleting a function does not remove it from any graphs it was copied to.
 TF_CAPI_EXPORT extern void TF_DeleteFunction(TF_Function* func);
 
+// Attempts to evaluate `output`. This will only be possible if `output` doesn't
+// depend on any graph inputs (this function is safe to call if this isn't the
+// case though).
+//
+// If the evaluation is successful, this function returns true and `output`s
+// value is returned in `result`. Otherwise returns false. An error status is
+// returned if something is wrong with the graph or input. Note that this may
+// return false even if no error status is set.
+TF_CAPI_EXPORT extern unsigned char TF_TryEvaluateConstant(TF_Graph* graph,
+                                                           TF_Output output,
+                                                           TF_Tensor** result,
+                                                           TF_Status* status);
+
 // TODO(josh11b): Register OpDef, available to all operations added
 // to this graph.
 
-// The following two may both benefit from a subgraph-definition API
-// that re-uses most of the graph-definition API.
-// TODO(andydavis): Add functions to a graph.
-
 // --------------------------------------------------------------------------
 // API for driving Graph execution.
 
diff --git a/tensorflow/c/c_api_experimental.cc b/tensorflow/c/c_api_experimental.cc
index be7f85a5bb06dce84579b109d506ded049042b50..29caf508e7aac9a273c8a27cdb1e849d721eb220 100644
--- a/tensorflow/c/c_api_experimental.cc
+++ b/tensorflow/c/c_api_experimental.cc
@@ -17,8 +17,13 @@ limitations under the License.
 
 #include "tensorflow/c/c_api_internal.h"
 #include "tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/protobuf/config.pb.h"
 
+using tensorflow::Status;
+
 void TF_EnableXLACompilation(TF_SessionOptions* options, unsigned char enable) {
   tensorflow::ConfigProto& config = options->options.config;
   auto* optimizer_options =
@@ -37,3 +42,64 @@ void TF_EnableXLACompilation(TF_SessionOptions* options, unsigned char enable) {
     optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::OFF);
   }
 }
+
+void TF_InitializeTPU(TF_Session* session, TF_Status* status) {
+  VLOG(1) << "Initializing TPU";
+  TF_Operation* config_op =
+      TF_GraphOperationByName(session->graph, "ConfigureDistributedTPU");
+  if (config_op == nullptr) {
+    status->status = tensorflow::errors::Internal(
+        "Unable to find node ConfigureDistributedTPU in the TF graph.");
+    return;
+  }
+
+  TF_Output config_node{config_op, 0};
+
+  TF_Tensor* dummy_output;
+  TF_SessionRun(session, /*run_options*/ nullptr,
+                // input related parameters
+                /*inputs*/ nullptr, /*input_values*/ nullptr, /*ninputs*/ 0,
+                // output related parameters
+                /*outputs*/ &config_node, /*output_values*/ &dummy_output,
+                /*noutputs*/ 1,
+                /*targets*/ nullptr, /*ntargets*/ 0,
+                /*run_metadata*/ nullptr, status);
+  if (status->status.ok()) {
+    TF_DeleteTensor(dummy_output);
+  }
+}
+
+void TF_ShutdownTPU(TF_Session* session, TF_Status* status) {
+  {
+    tensorflow::mutex_lock c(session->graph->mu);
+    VLOG(1) << "Shutting down TPU, with input graph: "
+            << session->graph->graph.ToGraphDefDebug().DebugString();
+  }
+
+  TF_Operation* shutdown_op =
+      TF_GraphOperationByName(session->graph, "ShutdownDistributedTPU");
+  if (shutdown_op == nullptr) {
+    status->status = tensorflow::errors::Internal(
+        "Unable to find node ShutdownDistributedTPU in the TF graph.");
+    return;
+  }
+
+  TF_SessionRun(session, /*run_options*/ nullptr,
+                // input related parameters
+                /*inputs*/ nullptr, /*input_values*/ nullptr, /*ninputs*/ 0,
+                // output related parameters
+                /*outputs*/ nullptr, /*output_values*/ nullptr,
+                /*noutputs*/ 0,
+                /*targets*/ &shutdown_op, /*ntargets*/ 1,
+                /*run_metadata*/ nullptr, status);
+}
+
+TF_CAPI_EXPORT extern const char* TF_GraphDebugString(TF_Graph* graph,
+                                                      size_t* len) {
+  tensorflow::mutex_lock c(graph->mu);
+  const auto& debug_str = graph->graph.ToGraphDefDebug().DebugString();
+  *len = debug_str.size();
+  char* ret = static_cast<char*>(malloc(*len + 1));
+  memcpy(ret, debug_str.c_str(), *len + 1);
+  return ret;
+}
diff --git a/tensorflow/c/c_api_experimental.h b/tensorflow/c/c_api_experimental.h
index 5a7b007e40aa199889b2d00b2bde5976c19e2966..f069398bbb8070a7503ec76a85a95a8425ac2e93 100644
--- a/tensorflow/c/c_api_experimental.h
+++ b/tensorflow/c/c_api_experimental.h
@@ -25,6 +25,7 @@ limitations under the License.
 // Experimental C API for TensorFlow.
 //
 // The API here is subject to changes in the future.
+// --------------------------------------------------------------------------
 
 // Macro to control visibility of exported symbols in the shared library (.so,
 // .dylib, .dll).
@@ -59,6 +60,33 @@ extern "C" {
 TF_CAPI_EXPORT extern void TF_EnableXLACompilation(TF_SessionOptions* options,
                                                    unsigned char enable);
 
+// Initializes TPU system. Must be called exactly once before TF_SessionRun() is
+// called on a TPU graph.
+//
+// The session graph must contain a node named ConfigureDistributedTPU.
+// TODO(b/74774824): Improve the API on initializing TPU system.
+TF_CAPI_EXPORT extern void TF_InitializeTPU(TF_Session* session,
+                                            TF_Status* status);
+
+// Shuts down TPU system. For any `session` where TF_InitializeTPU() has
+// been successfully called, this call must be made exactly once before the
+// session is closed.
+// The session graph must contain a node named ShutdownDistributedTPU.
+TF_CAPI_EXPORT extern void TF_ShutdownTPU(TF_Session* session,
+                                          TF_Status* status);
+
+// Returns the graph content in a human-readable format, with length set in
+// `len`. The format is subject to change in the future.
+// The returned string is heap-allocated, and caller should call free() on it.
+TF_CAPI_EXPORT extern const char* TF_GraphDebugString(TF_Graph* graph,
+                                                      size_t* len);
+
+// Returns the graph content in a human-readable format, with length set in
+// `len`. The format is subject to change in the future.
+// The returned string is heap-allocated, and caller should call free() on it.
+TF_CAPI_EXPORT extern const char* TF_GraphDebugString(TF_Graph* graph,
+                                                      size_t* len);
+
 #ifdef __cplusplus
 } /* end extern "C" */
 #endif
diff --git a/tensorflow/c/c_api_function_test.cc b/tensorflow/c/c_api_function_test.cc
index 7ca50119eafe299b307f06c555aec1388e7e82e2..610274696f5940c063e68f2310cfd9cc1e0bd964 100644
--- a/tensorflow/c/c_api_function_test.cc
+++ b/tensorflow/c/c_api_function_test.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/lib/strings/proto_serialization.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
diff --git a/tensorflow/c/c_api_internal.h b/tensorflow/c/c_api_internal.h
index 91667056e0eeb224b4b8a034766f11a123cd1a03..95652a11378d6276b5ba6540a07baa15aa77cc1c 100644
--- a/tensorflow/c/c_api_internal.h
+++ b/tensorflow/c/c_api_internal.h
@@ -84,19 +84,20 @@ struct TF_Graph {
   std::unordered_map<tensorflow::string, tensorflow::Node*> name_map
       GUARDED_BY(mu);
 
-  // The keys of this map are all the active sessions using this graph.
-  // Each value is the current "runnability" status of the corresponding
-  // session. Under normal conditions all statuses are Status::OK(), but
-  // if some operation is mutated after it was run by a session (this
-  // is detected in RecordMutation function), that session is no longer
-  // safe to run. Its status will contain the error that will be returned
-  // to the user, should she try running this session.
+  // The keys of this map are all the active sessions using this graph. Each
+  // value records whether the graph has been mutated since the corresponding
+  // session has been run (this is detected in RecordMutation function). If the
+  // string is empty, no mutation has occurred. Otherwise the string is a
+  // description of the mutation suitable for returning to the user.
   //
   // Sessions are added to this map in TF_NewSession, and removed in
   // TF_DeleteSession.
   // TF_Graph may only / must be deleted when
   //   sessions.size() == 0 && delete_requested == true
-  tensorflow::gtl::FlatMap<TF_Session*, tensorflow::Status> sessions
+  //
+  // TODO(b/74949947): mutations currently trigger a warning instead of a bad
+  // status, this should be reverted when possible.
+  tensorflow::gtl::FlatMap<TF_Session*, tensorflow::string> sessions
       GUARDED_BY(mu);
   bool delete_requested GUARDED_BY(mu);  // set true by TF_DeleteGraph
 
@@ -124,15 +125,16 @@ struct TF_Session {
   TF_Session(tensorflow::Session* s, TF_Graph* g);
 
   tensorflow::Session* session;
-  TF_Graph* graph;
+  TF_Graph* const graph;
 
-  tensorflow::mutex mu;
+  tensorflow::mutex mu ACQUIRED_AFTER(TF_Graph::mu);
   int last_num_graph_nodes;
 
-  // NOTE(ashankar): Experimental fields to help keep the
-  // buffers of a TF_Tensor pinned in device memory.
-  const tensorflow::DeviceMgr* device_mgr;   // Owned by session.
-  std::vector<tensorflow::Device*> devices;  // Owned by device_mgr.
+  // If true, TF_SessionRun and similar methods will call
+  // ExtendSessionGraphHelper before running the graph (this is the default
+  // public behavior). Can be set to false if the caller needs to call
+  // ExtendSessionGraphHelper manually.
+  std::atomic<bool> extend_before_run;
 };
 
 struct TF_ImportGraphDefOptions {
@@ -210,7 +212,11 @@ void TF_GraphSetOutputHandleShapesAndTypes(TF_Graph* graph, TF_Output output,
                                            TF_Status* status);
 
 void RecordMutation(TF_Graph* graph, const TF_Operation& op,
-                    const char* mutation_type);
+                    const char* mutation_type)
+    EXCLUSIVE_LOCKS_REQUIRED(graph->mu);
+
+bool ExtendSessionGraphHelper(TF_Session* session, TF_Status* status)
+    LOCKS_EXCLUDED(session->graph->mu, session->mu);
 
 }  // end namespace tensorflow
 
diff --git a/tensorflow/c/c_test_util.cc b/tensorflow/c/c_test_util.cc
index 3db2852ce6560ba493d60ef54a110161c112d110..f3b28c1708129d39e451d927a89c0d10e2193b63 100644
--- a/tensorflow/c/c_test_util.cc
+++ b/tensorflow/c/c_test_util.cc
@@ -34,6 +34,10 @@ static void DoubleDeallocator(void* data, size_t, void* arg) {
   delete[] static_cast<double*>(data);
 }
 
+static void FloatDeallocator(void* data, size_t, void* arg) {
+  delete[] static_cast<float*>(data);
+}
+
 TF_Tensor* Int8Tensor(const int64_t* dims, int num_dims, const char* values) {
   int64_t num_values = 1;
   for (int i = 0; i < num_dims; ++i) {
@@ -78,21 +82,34 @@ TF_Tensor* DoubleTensor(double v) {
                       &DoubleDeallocator, nullptr);
 }
 
+TF_Tensor* FloatTensor(float v) {
+  const int num_bytes = sizeof(float);
+  float* values = new float[1];
+  values[0] = v;
+  return TF_NewTensor(TF_FLOAT, nullptr, 0, values, num_bytes,
+                      &FloatDeallocator, nullptr);
+}
+
 // All the *Helper methods are used as a workaround for the restrictions that
 // one cannot call ASSERT_* methods in non-void-returning functions (when
 // exceptions are disabled during compilation)
 void PlaceholderHelper(TF_Graph* graph, TF_Status* s, const char* name,
+                       TF_DataType dtype, const std::vector<int64_t>& dims,
                        TF_Operation** op) {
   TF_OperationDescription* desc = TF_NewOperation(graph, "Placeholder", name);
-  TF_SetAttrType(desc, "dtype", TF_INT32);
+  TF_SetAttrType(desc, "dtype", dtype);
+  if (!dims.empty()) {
+    TF_SetAttrShape(desc, "shape", dims.data(), dims.size());
+  }
   *op = TF_FinishOperation(desc, s);
   ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
   ASSERT_NE(*op, nullptr);
 }
 
-TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s, const char* name) {
+TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s, const char* name,
+                          TF_DataType dtype, const std::vector<int64_t>& dims) {
   TF_Operation* op;
-  PlaceholderHelper(graph, s, name, &op);
+  PlaceholderHelper(graph, s, name, dtype, dims, &op);
   return op;
 }
 
@@ -126,6 +143,12 @@ TF_Operation* ScalarConst(double v, TF_Graph* graph, TF_Status* s,
   return Const(tensor.get(), graph, s, name);
 }
 
+TF_Operation* ScalarConst(float v, TF_Graph* graph, TF_Status* s,
+                          const char* name) {
+  unique_tensor_ptr tensor(FloatTensor(v), TF_DeleteTensor);
+  return Const(tensor.get(), graph, s, name);
+}
+
 void AddOpHelper(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
                  TF_Status* s, const char* name, TF_Operation** op,
                  bool check) {
diff --git a/tensorflow/c/c_test_util.h b/tensorflow/c/c_test_util.h
index 2a70177c724c569844a5d8ad42b99bed20209946..cd19cf8d624d9b914b61132f93d918b046cdbd30 100644
--- a/tensorflow/c/c_test_util.h
+++ b/tensorflow/c/c_test_util.h
@@ -44,8 +44,12 @@ TF_Tensor* Int32Tensor(int32_t v);
 
 TF_Tensor* DoubleTensor(double v);
 
+TF_Tensor* FloatTensor(float v);
+
 TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s,
-                          const char* name = "feed");
+                          const char* name = "feed",
+                          TF_DataType dtype = TF_INT32,
+                          const std::vector<int64_t>& dims = {});
 
 TF_Operation* Const(TF_Tensor* t, TF_Graph* graph, TF_Status* s,
                     const char* name = "const");
@@ -56,6 +60,9 @@ TF_Operation* ScalarConst(int32_t v, TF_Graph* graph, TF_Status* s,
 TF_Operation* ScalarConst(double v, TF_Graph* graph, TF_Status* s,
                           const char* name = "scalar");
 
+TF_Operation* ScalarConst(float v, TF_Graph* graph, TF_Status* s,
+                          const char* name = "scalar");
+
 TF_Operation* Add(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
                   TF_Status* s, const char* name = "add");
 
diff --git a/tensorflow/c/eager/BUILD b/tensorflow/c/eager/BUILD
index e55cb672e97e1403a3dd864c91c176426eb3f067..bea5a121b39f57cf73e0a2e7a19a03ecdc0f39c0 100644
--- a/tensorflow/c/eager/BUILD
+++ b/tensorflow/c/eager/BUILD
@@ -27,6 +27,10 @@ tf_cuda_library(
             ":runtime",
             "//tensorflow/c:c_api",
             "//tensorflow/c:c_api_internal",
+            "//tensorflow/core:core_cpu",
+            "//tensorflow/core/common_runtime/eager:context",
+            "//tensorflow/core/common_runtime/eager:eager_executor",
+            "//tensorflow/core/common_runtime/eager:kernel_and_device",
             "//tensorflow/core:core_cpu_internal",
             "//tensorflow/core:framework",
             "//tensorflow/core:framework_internal",
@@ -54,11 +58,16 @@ tf_cuda_library(
         ":runtime",
         "//tensorflow/c:c_api",
         "//tensorflow/c:c_api_internal",
+        "//tensorflow/core:core_cpu",
         "//tensorflow/core:core_cpu_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:framework_lite",
+        "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
+        "//tensorflow/core/common_runtime/eager:context",
+        "//tensorflow/core/common_runtime/eager:eager_executor",
+        "//tensorflow/core/common_runtime/eager:kernel_and_device",
     ],
 )
 
@@ -93,6 +102,7 @@ tf_cuda_library(
         "//conditions:default": [
             "//tensorflow/c:c_api",
             "//tensorflow/core:core_cpu",
+            "//tensorflow/core/common_runtime/eager:kernel_and_device",
             "//tensorflow/core:core_cpu_internal",
             "//tensorflow/core:framework",
             "//tensorflow/core:framework_internal",
diff --git a/tensorflow/c/eager/c_api.cc b/tensorflow/c/eager/c_api.cc
index c27a7129fa3bef1a1a3389fb1b7d606b94077184..2402a6d04482508d1ccfb6b49d7c10dcdc574557 100644
--- a/tensorflow/c/eager/c_api.cc
+++ b/tensorflow/c/eager/c_api.cc
@@ -31,8 +31,10 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/copy_tensor.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/device_set.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/rendezvous.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.h"
@@ -40,6 +42,7 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/flatmap.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/thread_annotations.h"
 #include "tensorflow/core/public/version.h"
@@ -65,6 +68,7 @@ string DeviceName(const tensorflow::Device* d) {
 #ifdef TENSORFLOW_EAGER_USE_XLA
 std::atomic_int_fast64_t func_id_generator(0);
 #endif  // TENSORFLOW_EAGER_USE_XLA
+
 }  // namespace
 
 extern "C" {
@@ -76,130 +80,149 @@ void TFE_ContextOptionsSetConfig(TFE_ContextOptions* options, const void* proto,
   TF_SetConfig(&options->session_options, proto, proto_len, status);
 }
 
+void TFE_ContextOptionsSetAsync(TFE_ContextOptions* options,
+                                unsigned char async) {
+  options->async = async;
+}
 void TFE_ContextOptionsSetDevicePlacementPolicy(
     TFE_ContextOptions* options, TFE_ContextDevicePlacementPolicy policy) {
   options->policy = policy;
 }
 
+TF_CAPI_EXPORT extern void TFE_ContextSetAsyncForThread(TFE_Context* ctx,
+                                                        unsigned char async,
+                                                        TF_Status* status) {
+  status->status = ctx->context.SetAsyncForThread(async);
+}
+
 void TFE_DeleteContextOptions(TFE_ContextOptions* options) { delete options; }
 
 TFE_Context* TFE_NewContext(const TFE_ContextOptions* opts, TF_Status* status) {
-  TF_Graph* graph = TF_NewGraph();
-  TF_Session* session = TF_NewSession(graph, &opts->session_options, status);
-  if (status->status.ok()) {
-    if (session->device_mgr == nullptr || session->devices.empty()) {
-      status->status = tensorflow::errors::InvalidArgument(
-          "Provided TF_SessionOptions are not compatible with eager execution "
-          "(perhaps the TF_SessionOptions alluded to session execution in a "
-          "remote address space?)");
-    }
-  }
+  std::vector<tensorflow::Device*> devices;
+  status->status = tensorflow::DeviceFactory::AddDevices(
+      opts->session_options.options, "/job:localhost/replica:0/task:0",
+      &devices);
   if (!status->status.ok()) {
-    TF_DeleteGraph(graph);
     return nullptr;
   }
-
-  return new TFE_Context(*opts, session);
+  std::unique_ptr<tensorflow::DeviceMgr> device_mgr(
+      new tensorflow::DeviceMgr(devices));
+  tensorflow::Rendezvous* r =
+      new tensorflow::IntraProcessRendezvous(device_mgr.get());
+  return new TFE_Context(opts->session_options.options, opts->policy,
+                         opts->async, std::move(device_mgr), r);
 }
 
 void TFE_DeleteContext(TFE_Context* ctx, TF_Status* status) {
-  status->status = tensorflow::Status::OK();
-  {
-    tensorflow::mutex_lock ml(ctx->cache_mu);
-    tensorflow::gtl::STLDeleteValues(&ctx->kernel_cache);
-  }
-  TF_Graph* graph = ctx->session->graph;
-  TF_DeleteSession(ctx->session, status);
-  TF_DeleteGraph(graph);
-  ctx->rendezvous->Unref();
   delete ctx;
 }
 
 TF_DeviceList* TFE_ContextListDevices(TFE_Context* ctx, TF_Status* status) {
-  return TF_SessionListDevices(ctx->session, status);
+  TF_DeviceList* list = new TF_DeviceList;
+  ctx->context.device_mgr()->ListDeviceAttributes(&list->response);
+  return list;
 }
 
-void TFE_ContextClearCaches(TFE_Context* ctx) {
-  tensorflow::mutex_lock ml(ctx->cache_mu);
-  tensorflow::gtl::STLDeleteValues(&ctx->kernel_cache);
-}
+void TFE_ContextClearCaches(TFE_Context* ctx) { ctx->context.ClearCaches(); }
 
 void TFE_ContextSetThreadLocalDevicePlacementPolicy(
     TFE_Context* ctx, TFE_ContextDevicePlacementPolicy policy) {
-  tensorflow::mutex_lock ml(ctx->policy_map_mu);
-  ctx->thread_local_policies[std::this_thread::get_id()] = policy;
+  ctx->context.SetThreadLocalDevicePlacementPolicy(
+      static_cast<tensorflow::ContextDevicePlacementPolicy>(policy));
 }
 
+// Note: this function looks up a thread local policy. So it should be called in
+// the appropriate client thread. In particular, in async mode, it may not be
+// safe to call this function from the async EagerExecutor threads.
 extern TFE_ContextDevicePlacementPolicy TFE_ContextGetDevicePlacementPolicy(
     TFE_Context* ctx) {
-  tensorflow::mutex_lock ml(ctx->policy_map_mu);
-  auto policy_map_it =
-      ctx->thread_local_policies.find(std::this_thread::get_id());
-  if (policy_map_it != ctx->thread_local_policies.end()) {
-    return policy_map_it->second;
-  }
-  return ctx->policy;
+  return static_cast<TFE_ContextDevicePlacementPolicy>(
+      ctx->context.GetDevicePlacementPolicy());
+}
+
+void TFE_ContextAsyncWait(TFE_Context* ctx, TF_Status* status) {
+  status->status = ctx->context.AsyncWait();
+}
+
+void TFE_ContextGetStatus(TFE_Context* ctx, TF_Status* status) {
+  status->status = ctx->context.GetStatus();
+}
+
+void TFE_ContextAsyncClearError(TFE_Context* ctx) {
+  ctx->context.ClearAsyncError();
 }
 
 TFE_TensorHandle* TFE_NewTensorHandle(TF_Tensor* t, TF_Status* status) {
   tensorflow::Tensor tensor;
   status->status = tensorflow::TF_TensorToTensor(t, &tensor);
   if (!status->status.ok()) return nullptr;
-  return new TFE_TensorHandle(tensor, nullptr);
+  return new TFE_TensorHandle(tensor, nullptr, nullptr);
 }
 
-void TFE_DeleteTensorHandle(TFE_TensorHandle* h) { delete h; }
+void TFE_DeleteTensorHandle(TFE_TensorHandle* h) {
+  DCHECK(h);
+  h->Unref();
+}
 
 TF_DataType TFE_TensorHandleDataType(TFE_TensorHandle* h) {
-  return static_cast<TF_DataType>(h->t.dtype());
+  return static_cast<TF_DataType>(h->dtype);
 }
 
 int TFE_TensorHandleNumDims(TFE_TensorHandle* h, TF_Status* status) {
-  status->status = tensorflow::Status::OK();
-  return h->t.dims();
+  const tensorflow::Tensor* t = nullptr;
+  status->status = h->Tensor(&t);
+  return t == nullptr ? 0 : t->dims();
 }
 
 int64_t TFE_TensorHandleDim(TFE_TensorHandle* h, int dim_index,
                             TF_Status* status) {
-  status->status = tensorflow::Status::OK();
-  return h->t.dim_size(dim_index);
+  const tensorflow::Tensor* t = nullptr;
+  status->status = h->Tensor(&t);
+  return t == nullptr ? 0 : t->dim_size(dim_index);
 }
 
 const char* TFE_TensorHandleDeviceName(TFE_TensorHandle* h, TF_Status* status) {
-  // TODO(apassos) this will be potentially incorrect in the distributed case as
-  // our local device will have a name which depends on the ClusterSpec and
-  // hence will require the context to resolve.
-  status->status = tensorflow::Status::OK();
-  return (h->d == nullptr) ? "/job:localhost/replica:0/task:0/device:CPU:0"
-                           : h->d->name().c_str();
+  tensorflow::Device* d = nullptr;
+  status->status = h->OpDevice(&d);
+  return (d == nullptr) ? "/job:localhost/replica:0/task:0/device:CPU:0"
+                        : d->name().c_str();
 }
 
 TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h, TF_Status* status) {
-  if (!IsCPU(h->d)) {
+  // TODO(agarwal): move this implementation inside TFE_TensorHandle.
+  tensorflow::Device* d = nullptr;
+  tensorflow::Device* op_device = nullptr;
+  const tensorflow::Tensor* t = nullptr;
+  status->status = h->TensorAndDevice(&t, &d, &op_device);
+  if (!status->status.ok()) return nullptr;
+  if (!IsCPU(d)) {
     TF_SetStatus(status, TF_UNIMPLEMENTED,
                  tensorflow::strings::StrCat(
                      "TFE_TensorHandle can be resolved iff it is on CPU (this "
                      "handle is on ",
-                     h->d->name(),
+                     d->name(),
                      "). Consider using TFE_TensorHandleCopyToDevice to get a "
                      "copy of the tensor on CPU")
                      .c_str());
     return nullptr;
   }
-  return tensorflow::TF_TensorFromTensor(h->t, status);
+  return tensorflow::TF_TensorFromTensor(*t, status);
 }
+}  // extern "C"
 
-TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
-                                               TFE_Context* ctx,
-                                               const char* device_name,
-                                               TF_Status* status) {
-  tensorflow::Device* dstd = ctx->devices()[0];
-  if (device_name != nullptr && strlen(device_name) > 0) {
-    status->status = ctx->session->device_mgr->LookupDevice(device_name, &dstd);
-    if (!status->status.ok()) return nullptr;
-  }
+namespace {
 
-  tensorflow::Device* srcd = h->d == nullptr ? ctx->devices()[0] : h->d;
+tensorflow::Status TensorHandleCopyToDevice(TFE_TensorHandle* h,
+                                            TFE_Context* ctx,
+                                            tensorflow::Device* dstd,
+                                            TFE_TensorHandle** output) {
+  const tensorflow::Tensor* src = nullptr;
+  tensorflow::Device* srcd = nullptr;
+  // TODO(agarwal): src_opd is unused. Perhaps allow TensorAndDevice to accept
+  // nullptr.
+  tensorflow::Device* src_opd = nullptr;
+  TF_RETURN_IF_ERROR(h->TensorAndDevice(&src, &srcd, &src_opd));
+  if (srcd == nullptr) srcd = ctx->context.HostCPU();
   bool is_same_device =
       (srcd == dstd) || (DeviceName(srcd) == DeviceName(dstd));
   const bool dst_cpu = IsCPU(dstd);
@@ -208,18 +231,16 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
   // has device type XLA_CPU, and the other CPU.
   const bool both_on_cpu = src_cpu && dst_cpu;
   if (is_same_device || both_on_cpu) {
-    return new TFE_TensorHandle(h->t, dst_cpu ? nullptr : dstd);
+    dstd = dst_cpu ? nullptr : dstd;
+    *output = new TFE_TensorHandle(*src, dstd, dstd);
+    return tensorflow::Status::OK();
   }
-  tensorflow::Tensor* src = &(h->t);
   if (!dst_cpu && (src->dtype() != tensorflow::DT_VARIANT &&
                    !tensorflow::DataTypeCanUseMemcpy(src->dtype()))) {
-    TF_SetStatus(
-        status, TF_INVALID_ARGUMENT,
-        tensorflow::strings::StrCat("Can't copy Tensor with type ",
-                                    tensorflow::DataTypeString(src->dtype()),
-                                    " to device ", DeviceName(dstd), ".")
-            .c_str());
-    return nullptr;
+    return tensorflow::errors::InvalidArgument(
+        "Can't copy Tensor with type ",
+        tensorflow::DataTypeString(src->dtype()), " to device ",
+        DeviceName(dstd), ".");
   }
   tensorflow::AllocatorAttributes attr;
   if (src->dtype() == tensorflow::DT_VARIANT) {
@@ -227,7 +248,9 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
   }
   tensorflow::Tensor dst(dstd->GetAllocator(attr), src->dtype(), src->shape());
   if (src->shape().num_elements() == 0) {
-    return new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd);
+    dstd = dst_cpu ? nullptr : dstd;
+    *output = new TFE_TensorHandle(dst, dstd, dstd);
+    return tensorflow::Status::OK();
   }
   tensorflow::DeviceContext* src_device_context = nullptr;
   if (!src_cpu) {
@@ -244,20 +267,26 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
   // With that setup, Sync()ing across all 3 streams should be sufficient
   // but more than necessary (since it waits for operations that might have
   // nothing to do with this tensor to complete).
-  status->status = srcd->Sync();
+  TF_RETURN_IF_ERROR(srcd->Sync());
   tensorflow::Notification n;
+  tensorflow::Status status;
   tensorflow::CopyTensor::ViaDMA("copy", src_device_context, dst_device_context,
                                  srcd, dstd, tensorflow::AllocatorAttributes(),
                                  tensorflow::AllocatorAttributes(), src, &dst,
-                                 [status, &n](const tensorflow::Status& s) {
-                                   status->status = s;
+                                 [&status, &n](const tensorflow::Status& s) {
+                                   status = s;
                                    n.Notify();
                                  });
   n.WaitForNotification();
-  return (TF_GetCode(status) == TF_OK)
-             ? new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd)
-             : nullptr;
+  if (status.ok()) {
+    dstd = dst_cpu ? nullptr : dstd;
+    *output = new TFE_TensorHandle(dst, dstd, dstd);
+  }
+  return status;
 }
+}  // namespace
+
+extern "C" {
 
 TFE_Op* TFE_NewOp(TFE_Context* ctx, const char* op_or_function_name,
                   TF_Status* status) {
@@ -266,8 +295,7 @@ TFE_Op* TFE_NewOp(TFE_Context* ctx, const char* op_or_function_name,
   status->status = tensorflow::AttrTypeMapForOp(name, &types);
   if (status->status.ok()) return new TFE_Op(ctx, name, types);
   if (TF_GetCode(status) == TF_NOT_FOUND) {
-    tensorflow::mutex_lock l(ctx->functions_mu);
-    if (ctx->func_lib_def.Find(name) != nullptr) {
+    if (ctx->context.FindFunctionByName(name)) {
       status->status = tensorflow::Status::OK();
       return new TFE_Op(ctx, name, nullptr);
     }
@@ -280,16 +308,14 @@ void TFE_DeleteOp(TFE_Op* op) { delete op; }
 void TFE_OpSetDevice(TFE_Op* op, const char* device_name, TF_Status* status) {
   tensorflow::Device* d = nullptr;
   if (device_name != nullptr && strlen(device_name) > 0) {
-    status->status =
-        op->ctx->session->device_mgr->LookupDevice(device_name, &d);
-    if (!status->status.ok()) return;
+    status->status = op->ctx->context.FindDeviceByName(device_name, &d);
   }
   op->device = d;
 }
 
 const char* TFE_OpGetDevice(TFE_Op* op, TF_Status* status) {
   tensorflow::Device* device =
-      (op->device == nullptr) ? op->ctx->devices()[0] : op->device;
+      (op->device == nullptr) ? op->ctx->context.HostCPU() : op->device;
   return device->name().c_str();
 }
 
@@ -302,15 +328,19 @@ void TFE_OpSetXLACompilation(TFE_Op* op, unsigned char enable) {
 }
 
 void TFE_OpAddInput(TFE_Op* op, TFE_TensorHandle* h, TF_Status* status) {
-  // Questionable heuristic ...
-  // - If a device was explicitly set on the op, always use that.
-  // - If not, place on the first non-host device seen.
-  if (op->device == nullptr && !IsCPU(h->d)) {
-    op->device = h->d;
+  if (op->device == nullptr) {
+    // Questionable heuristic ...
+    // - If a device was explicitly set on the op, always use that.
+    // - If not, place on the first non-host device seen.
+    tensorflow::Device* d = nullptr;
+    // TODO(agarwal): This call may block if h is not ready. Avoid this if
+    // possible.
+    status->status = h->Device(&d);
+    if (!status->status.ok()) return;
+    if (!IsCPU(d)) op->device = d;
   }
-  if (!status->status.ok()) return;
-  op->inputs.push_back(h->t);
-  op->input_devices.push_back(h->d);
+  h->Ref();
+  op->inputs.push_back(h);
   op->attrs.NumInputs(op->inputs.size());
 }
 
@@ -472,14 +502,14 @@ void TFE_OpSetAttrFunctionList(TFE_Op* op, const char* attr_name,
                 tensorflow::gtl::ArraySlice<const tensorflow::NameAttrList>(
                     funcs.get(), num_values));
 }
+}  // extern "C"
 
 namespace {
 
 tensorflow::Status ValidateInputTypeAndPlacement(
     TFE_Context* ctx, tensorflow::Device* host_device,
     tensorflow::Device* op_device, TFE_Op* op,
-    const tensorflow::OpKernel* kernel,
-    std::vector<TFE_TensorHandle*>* copied_tensors) {
+    const tensorflow::OpKernel* kernel) {
   const tensorflow::MemoryTypeVector& memtypes = kernel->input_memory_types();
   if (memtypes.size() != op->inputs.size()) {
     return tensorflow::errors::InvalidArgument(
@@ -488,14 +518,17 @@ tensorflow::Status ValidateInputTypeAndPlacement(
   for (int i = 0; i < op->inputs.size(); ++i) {
     const tensorflow::Device* expected_device =
         memtypes[i] == tensorflow::HOST_MEMORY ? host_device : op_device;
+    TFE_TensorHandle* handle = op->inputs[i];
+    tensorflow::Device* handle_device = nullptr;
+    TF_RETURN_IF_ERROR(handle->Device(&handle_device));
     const tensorflow::Device* actual_device =
-        op->input_devices[i] == nullptr ? host_device : op->input_devices[i];
+        handle_device == nullptr ? host_device : handle_device;
     if (expected_device != actual_device) {
       switch (TFE_ContextGetDevicePlacementPolicy(ctx)) {
         case TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32:
           // TODO(xpan): See if we could bubble python related error up
           // to python level.
-          if (op->inputs[i].dtype() == tensorflow::DT_INT32) {
+          if (handle->dtype == tensorflow::DT_INT32) {
             // Note: enabling silent copies of int32 tensors to match behavior
             // of graph mode.
             break;
@@ -526,35 +559,241 @@ tensorflow::Status ValidateInputTypeAndPlacement(
       }
       // We are only here if the policy is warn or silent copies, so we should
       // trigger a copy.
-      TFE_TensorHandle original{op->inputs[i], op->input_devices[i]};
       TF_Status* s = TF_NewStatus();
       TFE_TensorHandle* copied_tensor = TFE_TensorHandleCopyToDevice(
-          &original, ctx, expected_device->name().c_str(), s);
-      if (!s->status.ok()) {
-        tensorflow::Status status = s->status;
-        delete s;
+          handle, ctx, expected_device->name().c_str(), s);
+      tensorflow::Status status = s->status;
+      TF_DeleteStatus(s);
+      if (!status.ok()) {
+        if (copied_tensor != nullptr) copied_tensor->Unref();
         return tensorflow::errors::Internal(
             "Failed copying input tensor from ", actual_device->name(), " to ",
             expected_device->name(), " in order to run ", op->name, ": ",
             status.error_message());
       }
-      op->inputs[i] = copied_tensor->t;
-      copied_tensors->push_back(copied_tensor);
-      op->input_devices[i] = copied_tensor->d;
-      delete s;
+      handle->Unref();
+      handle = copied_tensor;
+      op->inputs[i] = copied_tensor;
     }
-    if (op->inputs[i].dtype() != kernel->input_type(i)) {
+    if (handle->dtype != kernel->input_type(i)) {
       return tensorflow::errors::InvalidArgument(
           "cannot compute ", op->name, " as input #", i,
           " was expected to be a ",
           tensorflow::DataTypeString(kernel->input_type(i)),
-          " tensor but is a ",
-          tensorflow::DataTypeString(op->inputs[i].dtype()), " tensor");
+          " tensor but is a ", tensorflow::DataTypeString(handle->dtype),
+          " tensor");
+    }
+  }
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Device* SelectDevice(const tensorflow::NodeDef& ndef,
+                                 TFE_Context* ctx, TF_Status* status) {
+  tensorflow::DeviceSet ds;
+  for (tensorflow::Device* d : *ctx->context.devices()) {
+    ds.AddDevice(d);
+  }
+  tensorflow::DeviceTypeVector final_devices;
+  status->status = tensorflow::SupportedDeviceTypesForNode(
+      ds.PrioritizedDeviceTypeList(), ndef, &final_devices);
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+  if (final_devices.empty()) {
+    status->status = tensorflow::errors::Internal(
+        "Could not find valid device for node ", ndef.DebugString());
+    return nullptr;
+  }
+  for (tensorflow::Device* d : *ctx->context.devices()) {
+    if (d->device_type() == final_devices[0].type_string()) {
+      return d;
+    }
+  }
+  status->status = tensorflow::errors::Unknown(
+      "Could not find a device for node ", ndef.DebugString());
+  return nullptr;
+}
+
+tensorflow::Status Execute(
+    TFE_Context* ctx, tensorflow::Device* device,
+    const tensorflow::gtl::InlinedVector<TFE_TensorHandle*, 4>& op_inputs,
+    tensorflow::KernelAndDevice* kernel, tensorflow::NodeExecStats* maybe_stats,
+    TFE_TensorHandle** retvals, int num_retvals) {
+  if (!ctx->context.SoftPlacement() && device == nullptr) {
+    device = ctx->context.HostCPU();
+  }
+
+  if (device == nullptr) {
+    // TODO(apassos) debug how the assignment below might return a different
+    // device from the one requested above.
+    device = kernel->device();
+  }
+
+  std::vector<tensorflow::Tensor> outputs(1);
+  const tensorflow::MemoryTypeVector* output_memory_types = nullptr;
+  output_memory_types = &kernel->kernel()->output_memory_types();
+  std::vector<tensorflow::Tensor> inputs(op_inputs.size());
+  for (int i = 0; i < op_inputs.size(); ++i) {
+    const tensorflow::Tensor* input_tensor = nullptr;
+    TF_RETURN_IF_ERROR(op_inputs[i]->Tensor(&input_tensor));
+    inputs[i] = *input_tensor;
+  }
+  // WARNING: kernel->Run utilizes the FunctionLibraryRuntime
+  // (ctx->func_lib(device)), which in turn holds a pointer to func_lib_def.
+  // But knowledge of the implementation
+  // of FunctionLibraryRuntime tells us that func_lib_def is not accessed by
+  // FunctionLibraryRuntime::Run(), so there is no thread-safety concern here.
+  // This is quite subtle. Re-work things to make this better?  (Would it make
+  // sense for FunctionLibraryRuntime to ensure thread-safe access to
+  // FunctionLibraryDefinition?).  TODO(apassos) figure out how to record stats
+  // for ops which are a part of functions.
+  // TODO(agarwal): change Run to take vector of handles ?
+  TF_RETURN_IF_ERROR(kernel->Run(&inputs, &outputs, maybe_stats));
+  if (maybe_stats != nullptr) {
+    maybe_stats->set_op_end_rel_micros(tensorflow::Env::Default()->NowMicros() -
+                                       maybe_stats->all_start_micros());
+    tensorflow::mutex_lock ml(*ctx->context.MetadataMu());
+    if (ctx->context.ShouldStoreMetadata()) {
+      auto* step_stats = ctx->context.RunMetadataProto()->mutable_step_stats();
+      // Lazily initialize the RunMetadata with information about all devices if
+      // this is the first call.
+      while (step_stats->dev_stats_size() < ctx->context.devices()->size()) {
+        step_stats->add_dev_stats();
+      }
+      // Find the current device's index.
+      int device_idx = 0;
+      for (int i = 0; i < ctx->context.devices()->size(); ++i) {
+        if (ctx->context.devices()->at(i) == device) {
+          device_idx = i;
+          break;
+        }
+      }
+      // Populate the device stats for this device.
+      auto* dev_stats = step_stats->mutable_dev_stats(device_idx);
+      dev_stats->set_device(device->name());
+      *dev_stats->add_node_stats() = *maybe_stats;
+    }
+  }
+  DCHECK_EQ(num_retvals, outputs.size());
+  tensorflow::Device* op_device = IsCPU(device) ? nullptr : device;
+  for (int i = 0; i < num_retvals; ++i) {
+    tensorflow::Device* d = op_device;
+    if (d != nullptr && output_memory_types != nullptr &&
+        (*output_memory_types)[i] == tensorflow::HOST_MEMORY) {
+      d = nullptr;
+    }
+    if (retvals[i] == nullptr) {
+      retvals[i] = new TFE_TensorHandle(outputs[i], d, op_device);
+    } else {
+      retvals[i]->SetTensorAndDevice(outputs[i], d, op_device);
     }
   }
   return tensorflow::Status::OK();
 }
 
+// TODO(agarwal): move EagerExecutor and EagerNode related code to a separate
+// file.
+class ExecuteNode : public tensorflow::EagerNode {
+ public:
+  ExecuteNode(TFE_Op* op, tensorflow::KernelAndDevice* kernel,
+              tensorflow::NodeExecStats* maybe_stats,
+              const tensorflow::DataTypeVector& output_dtypes,
+              TFE_TensorHandle** retvals, int num_retvals)
+      : tensorflow::EagerNode(op->ctx->context.NextId()),
+        ctx_(op->ctx),
+        op_device_(op->device),
+        inputs_(op->inputs),
+        kernel_(kernel),
+        maybe_stats_(maybe_stats),
+        retvals_(num_retvals) {
+    for (auto handle : inputs_) {
+      handle->Ref();
+    }
+    TFE_Context* ctx = op->ctx;
+    for (int i = 0; i < num_retvals; ++i) {
+      TFE_TensorHandle* h = new TFE_TensorHandle(id, output_dtypes[i], ctx);
+      h->Ref();
+      retvals[i] = h;
+      retvals_[i] = h;
+    }
+  }
+
+  ~ExecuteNode() override {
+    for (auto handle : inputs_) {
+      handle->Unref();
+    }
+    for (auto handle : retvals_) {
+      handle->Unref();
+    }
+  }
+
+  tensorflow::Status Run() override {
+    const tensorflow::Status status =
+        Execute(ctx_, op_device_, inputs_, kernel_, maybe_stats_.get(),
+                retvals_.begin(), retvals_.size());
+    if (status.ok()) {
+      return status;
+    } else {
+      return tensorflow::Status(
+          status.code(),
+          tensorflow::strings::StrCat("Got error, \"", status.error_message(),
+                                      "\" while executing kernel ",
+                                      kernel_->kernel()->def().DebugString()));
+    }
+  }
+
+ private:
+  TFE_Context* ctx_;
+  tensorflow::Device* op_device_;
+  tensorflow::gtl::InlinedVector<TFE_TensorHandle*, 4> inputs_;
+  tensorflow::KernelAndDevice* kernel_;
+  std::unique_ptr<tensorflow::NodeExecStats> maybe_stats_;
+  tensorflow::gtl::InlinedVector<TFE_TensorHandle*, 2> retvals_;
+};
+
+class CopyToDeviceNode : public tensorflow::EagerNode {
+ public:
+  CopyToDeviceNode(TFE_TensorHandle* src, tensorflow::Device* dstd,
+                   TFE_Context* ctx)
+      : tensorflow::EagerNode(ctx->context.NextId()),
+        src_(src),
+        dstd_(dstd),
+        ctx_(ctx),
+        dst_(new TFE_TensorHandle(id, src_->dtype, ctx)) {
+    src_->Ref();
+    dst_->Ref();
+  }
+
+  ~CopyToDeviceNode() override {
+    src_->Unref();
+    dst_->Unref();
+  }
+
+  tensorflow::Status Run() override {
+    TFE_TensorHandle* temp = nullptr;
+    TF_RETURN_IF_ERROR(TensorHandleCopyToDevice(src_, ctx_, dstd_, &temp));
+    const tensorflow::Tensor* tensor = nullptr;
+    tensorflow::Device* device = nullptr;
+    tensorflow::Device* op_device = nullptr;
+    tensorflow::Status status =
+        temp->TensorAndDevice(&tensor, &device, &op_device);
+    // `temp` is a ready handle. So the following call should return OK.
+    TF_DCHECK_OK(status) << status.error_message();
+    DCHECK(tensor);
+    dst_->SetTensorAndDevice(*tensor, device, op_device);
+    temp->Unref();
+    return tensorflow::Status::OK();
+  }
+
+  TFE_TensorHandle* dst() { return dst_; }
+
+ private:
+  TFE_TensorHandle* src_;
+  tensorflow::Device* dstd_;
+  TFE_Context* ctx_;
+  TFE_TensorHandle* dst_;
+};
+
 #ifdef TENSORFLOW_EAGER_USE_XLA
 // Synthesizes and returns a wrapper function over `op`, which must be a
 // primitive op (e.g. matmul).
@@ -582,8 +821,7 @@ const tensorflow::FunctionDef* OpToFunction(
   TFE_Context* ctx = op->ctx;
   const tensorflow::OpRegistrationData* op_data;
   {
-    tensorflow::tf_shared_lock l(ctx->functions_mu);
-    status->status = ctx->func_lib_def.LookUp(op->name, &op_data);
+    status->status = ctx->context.FindFunctionOpData(op->name, &op_data);
     if (!status->status.ok()) {
       return nullptr;
     }
@@ -620,7 +858,7 @@ const tensorflow::FunctionDef* OpToFunction(
       (*op_input_to_func_input)[i] = const_index;
       func_input_arg = signature->mutable_input_arg(const_index++);
       const_input_types->push_back(
-          static_cast<TF_DataType>(op->inputs[i].dtype()));
+          static_cast<TF_DataType>(op->inputs[i]->dtype));
     } else if (op_input_arg.type() == tensorflow::DT_RESOURCE) {
       VLOG(1) << "For resource input, mapping op input " << i
               << " to func input " << resource_index;
@@ -632,11 +870,11 @@ const tensorflow::FunctionDef* OpToFunction(
       (*op_input_to_func_input)[i] = arg_index;
       func_input_arg = signature->mutable_input_arg(arg_index++);
       arg_input_types->push_back(
-          static_cast<TF_DataType>(op->inputs[i].dtype()));
+          static_cast<TF_DataType>(op->inputs[i]->dtype));
     }
 
     func_input_arg->set_name(op_input_arg.name());
-    func_input_arg->set_type(op->inputs[i].dtype());
+    func_input_arg->set_type(op->inputs[i]->dtype);
   }
   VLOG(1) << "Added OpDef Inputs: " << fdef.DebugString();
 
@@ -679,10 +917,9 @@ const tensorflow::FunctionDef* OpToFunction(
   }
   VLOG(1) << "Fixed Output names and all types: " << fdef.DebugString();
 
-  tensorflow::mutex_lock l(ctx->functions_mu);
-  status->status = ctx->func_lib_def.AddFunctionDef(fdef);
+  ctx->context.AddFunctionDef(fdef);
   if (!status->status.ok()) return nullptr;
-  const auto ret = ctx->func_lib_def.Find(signature->name());
+  const auto ret = ctx->context.FindFunctionDef(signature->name());
   DCHECK(ret != nullptr);
   return ret;
 }
@@ -701,8 +938,7 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
 
   const tensorflow::FunctionDef* fdef;
   {
-    tensorflow::tf_shared_lock l(op->ctx->functions_mu);
-    fdef = op->ctx->func_lib_def.Find(op->name);
+    fdef = op->ctx->context.FindFunctionDef(op->name);
   }
   std::vector<TF_DataType> const_input_types;
   std::vector<TF_DataType> arg_input_types;
@@ -729,21 +965,16 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
   // Since input param reordering may have occurred between `op` and `launch_op`
   // via `op_input_to_func_input`, adjust the actual inputs accordingly.
   launch_op->inputs = op->inputs;
-  launch_op->input_devices = op->input_devices;
+  for (TFE_TensorHandle* h : launch_op->inputs) {
+    h->Ref();
+  }
   if (!op_input_to_func_input.empty()) {
     DCHECK_EQ(op->inputs.size(), op_input_to_func_input.size());
-    if (!op->input_devices.empty()) {
-      DCHECK_EQ(op->input_devices.size(), op_input_to_func_input.size());
-    }
     for (int i = 0; i < op_input_to_func_input.size(); ++i) {
       VLOG(1) << "mapping op input " << i << " to func input "
               << op_input_to_func_input[i];
 
       launch_op->inputs[op_input_to_func_input[i]] = op->inputs[i];
-      if (!op->input_devices.empty()) {
-        launch_op->input_devices[op_input_to_func_input[i]] =
-            op->input_devices[i];
-      }
     }
   }
   launch_op->attrs.NumInputs(op->inputs.size());
@@ -776,15 +1007,18 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
   return launch_op;
 }
 #endif  // TENSORFLOW_EAGER_USE_XLA
+
 }  // namespace
 
+extern "C" {
+
 void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
                  TF_Status* status) {
   TFE_Context* ctx = op->ctx;
-  // TODO(ashankar): ASSUMPTION: ctx->devices()[0] is always CPU
-  tensorflow::Device* device =
-      (op->device == nullptr) ? ctx->devices()[0] : op->device;
-
+  status->status = ctx->context.GetStatus();
+  if (!status->status.ok()) {
+    return;
+  }
 #ifdef TENSORFLOW_EAGER_USE_XLA
   std::unique_ptr<TFE_Op> xla_launch_op;
   if (op->use_xla && op->name != "_XlaLaunch") {
@@ -795,49 +1029,99 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
     op = xla_launch_op.get();
   }
 #endif  // TENSORFLOW_EAGER_USE_XLA
-
-  std::vector<tensorflow::Tensor> outputs(1);
-  const tensorflow::MemoryTypeVector* output_memory_types = nullptr;
-  tensorflow::Fprint128 cache_key = op->attrs.CacheKey(device->name());
-  tensorflow::KernelAndDevice* kernel;
-  {
-    tensorflow::tf_shared_lock l(ctx->cache_mu);
-    kernel = tensorflow::gtl::FindPtrOrNull(ctx->kernel_cache, cache_key);
+  // Ensure all resource-touching ops run in the device the resource is,
+  // regardless of anything else that has been specified. This is identical to
+  // the graph mode behavior.
+  for (int i = 0; i < op->inputs.size(); ++i) {
+    tensorflow::Device* input_op_device = nullptr;
+    status->status = op->inputs[i]->OpDevice(&input_op_device);
+    if (!status->status.ok()) return;
+    if (op->inputs[i]->dtype == tensorflow::DT_RESOURCE &&
+        input_op_device != op->device) {
+      tensorflow::Device* d =
+          input_op_device == nullptr ? ctx->context.HostCPU() : input_op_device;
+      VLOG(1) << "Changing device of operation " << op->name << " to "
+              << d->name() << " because input #" << i
+              << " is a resource in this device.";
+      op->device = d;
+    }
   }
+  tensorflow::Device* device = op->device;
+  if (!ctx->context.SoftPlacement() && device == nullptr) {
+    device = ctx->context.HostCPU();
+  }
+
+  tensorflow::Fprint128 cache_key =
+      op->attrs.CacheKey(device == nullptr ? "unspecified" : device->name());
+  tensorflow::KernelAndDevice* kernel = ctx->context.GetCachedKernel(cache_key);
   if (kernel == nullptr) {
     const tensorflow::NodeDef& ndef = op->attrs.BuildNodeDef();
-    if (ctx->log_device_placement) {
+    if (ctx->context.SoftPlacement() && device == nullptr) {
+      device = SelectDevice(ndef, ctx, status);
+      if (!status->status.ok()) {
+        return;
+      }
+    }
+    CHECK(device != nullptr);
+    if (ctx->context.LogDevicePlacement()) {
       LOG(INFO) << "Executing op " << ndef.op() << " in device "
                 << device->name();
     }
-    kernel = new tensorflow::KernelAndDevice(ctx->rendezvous);
+    kernel = new tensorflow::KernelAndDevice(ctx->context.GetRendezvous());
     // Knowledge of the implementation of Init (and in-turn
     // FunctionLibraryRuntime::CreateKernel) tells us that ctx->func_lib_def
     // will be accessed, so grab on to the lock.
-    // See WARNING comment below - would be nice to rework to avoid this
-    // subtlety.
-    tensorflow::tf_shared_lock l(ctx->functions_mu);
-    status->status =
-        tensorflow::KernelAndDevice::Init(ndef, ctx->func_lib(device), kernel);
+    // See WARNING comment in Execute (before kernel->Run) - would be nice to
+    // rework to avoid this subtlety.
+    tensorflow::tf_shared_lock l(*ctx->context.FunctionsMu());
+    status->status = tensorflow::KernelAndDevice::Init(
+        ndef, ctx->context.func_lib(device), kernel);
     if (!status->status.ok()) {
       delete kernel;
       return;
     }
-    tensorflow::mutex_lock ml(ctx->cache_mu);
-    tensorflow::gtl::InsertOrUpdate(&(ctx->kernel_cache), cache_key, kernel);
-  }
-  std::vector<TFE_TensorHandle*> copied_tensors;
-  status->status = ValidateInputTypeAndPlacement(
-      ctx, ctx->devices()[0], device, op, kernel->kernel(), &copied_tensors);
-  output_memory_types = &kernel->kernel()->output_memory_types();
-  if (!status->status.ok()) {
-    for (auto* t : copied_tensors) {
-      TFE_DeleteTensorHandle(t);
+    // Update output_dtypes inside `kernel`.
+    const tensorflow::OpDef* op_def = nullptr;
+    const tensorflow::FunctionDef* function_def =
+        ctx->context.FuncLibDef()->Find(ndef.op());
+    if (function_def != nullptr) {
+      op_def = &(function_def->signature());
     }
+    if (op_def == nullptr) {
+      status->status = OpDefForOp(ndef.op().c_str(), &op_def);
+      if (!status->status.ok()) {
+        return;
+      }
+    }
+    tensorflow::DataTypeVector input_dtypes;
+    status->status = InOutTypesForNode(ndef, *op_def, &input_dtypes,
+                                       kernel->mutable_output_dtypes());
+    if (!status->status.ok()) {
+      return;
+    }
+    ctx->context.AddKernelToCache(cache_key, kernel);
+  }
+  const tensorflow::DataTypeVector& output_dtypes = kernel->output_dtypes();
+  const int output_dtypes_size = output_dtypes.size();
+  if (output_dtypes_size > *num_retvals) {
+    TF_SetStatus(status, TF_INVALID_ARGUMENT,
+                 tensorflow::strings::StrCat("Expecting ", output_dtypes.size(),
+                                             " outputs, but *num_retvals is ",
+                                             *num_retvals)
+                     .c_str());
     return;
   }
+  *num_retvals = output_dtypes_size;
+  if (device == nullptr) {
+    // TODO(apassos) debug how the assignment below might return a different
+    // device from the one requested above.
+    device = kernel->device();
+  }
+  status->status = ValidateInputTypeAndPlacement(ctx, ctx->context.HostCPU(),
+                                                 device, op, kernel->kernel());
+  if (!status->status.ok()) return;
   std::unique_ptr<tensorflow::NodeExecStats> maybe_stats;
-  if (ctx->should_store_metadata.load()) {
+  if (ctx->context.ShouldStoreMetadata()) {
     maybe_stats.reset(new tensorflow::NodeExecStats);
     maybe_stats->set_node_name(op->name);
     maybe_stats->set_all_start_micros(tensorflow::Env::Default()->NowMicros());
@@ -845,53 +1129,52 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
     maybe_stats->set_scheduled_micros(tensorflow::Env::Default()->NowMicros());
     // TODO(apassos) track referenced tensors
   }
-  // WARNING: kernel->Run utilizes the FunctionLibraryRuntime
-  // (ctx->func_lib(device)), which in turn holds a pointer to func_lib_def,
-  // which is GUARDED_BY(ctx->functions_mu). But knowledge of the implementation
-  // of FunctionLibraryRuntime tells us that func_lib_def is not accessed by
-  // FunctionLibraryRuntime::Run(), so there is no thread-safety concern here.
-  // This is quite subtle. Re-work things to make this better?  (Would it make
-  // sense for FunctionLibraryRuntime to ensure thread-safe access to
-  // FunctionLibraryDefinition?).  TODO(apassos) figure out how to record stats
-  // for ops which are a part of functions.
-  status->status = kernel->Run(&op->inputs, &outputs, maybe_stats.get());
-  for (auto* t : copied_tensors) {
-    TFE_DeleteTensorHandle(t);
-  }
-  if (!status->status.ok()) return;
-  if (maybe_stats != nullptr) {
-    maybe_stats->set_op_end_rel_micros(tensorflow::Env::Default()->NowMicros() -
-                                       maybe_stats->all_start_micros());
-    tensorflow::mutex_lock ml(ctx->metadata_mu);
-    if (ctx->should_store_metadata.load()) {
-      auto* step_stats = ctx->run_metadata.mutable_step_stats();
-      // Lazily initialize the RunMetadata with information about all devices if
-      // this is the first call.
-      while (step_stats->dev_stats_size() < ctx->devices().size()) {
-        step_stats->add_dev_stats();
-      }
-      // Find the current device's index.
-      int device_idx = 0;
-      for (int i = 0; i < ctx->devices().size(); ++i) {
-        if (ctx->devices()[i] == device) {
-          device_idx = i;
-          break;
-        }
-      }
-      // Populate the device stats for this device.
-      auto* dev_stats = step_stats->mutable_dev_stats(device_idx);
-      dev_stats->set_device(device->name());
-      *dev_stats->add_node_stats() = *maybe_stats;
+  if (ctx->context.Async()) {
+    // Note that for async mode, execution order will make sure that all
+    // input handles are ready before executing them.
+    // TODO(agarwal): Consider executing "cheap" kernels inline for performance.
+    tensorflow::EagerNode* node =
+        new ExecuteNode(op, kernel, maybe_stats.release(), output_dtypes,
+                        retvals, *num_retvals);
+    ctx->context.ExecutorAdd(node);
+  } else {
+    // Execute checks if retvals[i] is nullptr or not to figure if it needs to
+    // allocate it.
+    for (int i = 0; i < *num_retvals; ++i) {
+      retvals[i] = nullptr;
     }
+    status->status = Execute(op->ctx, op->device, op->inputs, kernel,
+                             maybe_stats.get(), retvals, *num_retvals);
   }
-  *num_retvals = std::min<int>(*num_retvals, outputs.size());
-  for (int i = 0; i < *num_retvals; ++i) {
-    tensorflow::Device* d = IsCPU(device) ? nullptr : device;
-    if (d != nullptr && output_memory_types != nullptr &&
-        (*output_memory_types)[i] == tensorflow::HOST_MEMORY) {
-      d = nullptr;
-    }
-    retvals[i] = new TFE_TensorHandle(outputs[i], d);
+}
+
+TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
+                                               TFE_Context* ctx,
+                                               const char* device_name,
+                                               TF_Status* status) {
+  status->status = ctx->context.GetStatus();
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+  tensorflow::Device* dstd = ctx->context.HostCPU();
+  if (device_name != nullptr && strlen(device_name) > 0) {
+    status->status =
+        ctx->context.device_mgr()->LookupDevice(device_name, &dstd);
+    if (!status->status.ok()) return nullptr;
+  }
+  if (ctx->context.Async()) {
+    // Note that `h` may not be currently ready. However execution order will
+    // make sure that `h` is ready before the copy is actually done.
+    CopyToDeviceNode* node = new CopyToDeviceNode(h, dstd, ctx);
+    TFE_TensorHandle* output = node->dst();
+    // Note that calling Add makes `node` accessible by the EagerExecutor
+    // thread. So further accesses need to be thread-safe.
+    ctx->context.ExecutorAdd(node);
+    return output;
+  } else {
+    TFE_TensorHandle* output = nullptr;
+    status->status = TensorHandleCopyToDevice(h, ctx, dstd, &output);
+    return output;
   }
 }
 
@@ -904,46 +1187,189 @@ void TFE_ContextAddFunctionDef(TFE_Context* ctx,
         tensorflow::errors::InvalidArgument("Invalid FunctionDef proto");
     return;
   }
-  tensorflow::mutex_lock l(ctx->functions_mu);
-  status->status = ctx->func_lib_def.AddFunctionDef(function_def);
+  status->status = ctx->context.AddFunctionDef(function_def);
 }
 
 void TFE_ContextAddFunction(TFE_Context* ctx, TF_Function* function,
                             TF_Status* status) {
-  tensorflow::mutex_lock l(ctx->functions_mu);
-  status->status = ctx->func_lib_def.AddFunctionDef(function->fdef);
+  status->status = ctx->context.AddFunctionDef(function->fdef);
+}
+
+void TFE_ContextEnableRunMetadata(TFE_Context* ctx) {
+  ctx->context.SetShouldStoreMetadata(true);
+}
+
+void TFE_ContextDisableRunMetadata(TFE_Context* ctx) {
+  ctx->context.SetShouldStoreMetadata(false);
 }
 
 }  // extern "C"
 
 TFE_TensorHandle* TFE_NewTensorHandle(const tensorflow::Tensor& t) {
-  return new TFE_TensorHandle(t, nullptr);
+  return new TFE_TensorHandle(t, nullptr, nullptr);
 }
 
 const tensorflow::Tensor* TFE_TensorHandleUnderlyingTensorInHostMemory(
     TFE_TensorHandle* h, TF_Status* status) {
-  if (h->d != nullptr) {
+  tensorflow::Device* d = nullptr;
+  tensorflow::Device* op_device = nullptr;
+  const tensorflow::Tensor* t = nullptr;
+  status->status = h->TensorAndDevice(&t, &d, &op_device);
+  if (!status->status.ok()) return nullptr;
+  if (d != nullptr) {
     status->status = tensorflow::errors::FailedPrecondition(
         "TFE_TensorHandle is placed in device (not host) memory. Cannot return "
         "a tensorflow::Tensor");
     return nullptr;
   }
-  return &h->t;
+  return t;
 }
 
-void TFE_ContextEnableRunMetadata(TFE_Context* ctx) {
-  ctx->should_store_metadata.store(true);
+void TFE_ContextExportRunMetadata(TFE_Context* ctx, TF_Buffer* buf,
+                                  TF_Status* status) {
+  TFE_ContextAsyncWait(ctx, status);
+  if (!status->status.ok()) return;
+  tensorflow::mutex_lock ml(*ctx->context.MetadataMu());
+  status->status = MessageToBuffer(*ctx->context.RunMetadataProto(), buf);
+  ctx->context.RunMetadataProto()->Clear();
 }
 
-void TFE_ContextDisableRunMetadata(TFE_Context* ctx) {
-  tensorflow::mutex_lock ml(ctx->metadata_mu);
-  ctx->should_store_metadata.store(false);
-  ctx->run_metadata.Clear();
+namespace {
+TFE_Op* GetFunc(TFE_Context* ctx, const tensorflow::NameAttrList& func,
+                TF_Status* status) {
+  TFE_Op* func_op = TFE_NewOp(ctx, func.name().data(), status);
+  for (const auto& attr : func.attr()) {
+    if (TF_GetCode(status) != TF_OK) return nullptr;
+    SetOpAttrValueScalar(ctx, func_op, attr.second, attr.first.data(), status);
+    if (TF_GetCode(status) != TF_OK) return nullptr;
+  }
+  return func_op;
 }
+}  // namespace
 
-void TFE_ContextExportRunMetadata(TFE_Context* ctx, TF_Buffer* buf,
-                                  TF_Status* status) {
-  tensorflow::mutex_lock ml(ctx->metadata_mu);
-  status->status = MessageToBuffer(ctx->run_metadata, buf);
-  ctx->run_metadata.Clear();
+namespace tensorflow {
+void SetOpAttrValueScalar(TFE_Context* ctx, TFE_Op* op,
+                          const tensorflow::AttrValue& default_value,
+                          const char* attr_name, TF_Status* status) {
+  switch (default_value.value_case()) {
+    case tensorflow::AttrValue::kS:
+      TFE_OpSetAttrString(op, attr_name, default_value.s().data());
+      break;
+    case tensorflow::AttrValue::kI:
+      TFE_OpSetAttrInt(op, attr_name, static_cast<int64_t>(default_value.i()));
+      break;
+    case tensorflow::AttrValue::kF:
+      TFE_OpSetAttrFloat(op, attr_name, default_value.f());
+      break;
+    case tensorflow::AttrValue::kB:
+      TFE_OpSetAttrBool(op, attr_name, default_value.b());
+      break;
+    case tensorflow::AttrValue::kType:
+      TFE_OpSetAttrType(op, attr_name,
+                        static_cast<TF_DataType>(default_value.type()));
+      break;
+    case tensorflow::AttrValue::kShape: {
+      const auto& tensor_shape = default_value.shape();
+      if (tensor_shape.unknown_rank()) {
+        TFE_OpSetAttrShape(op, attr_name, nullptr, -1, status);
+      } else {
+        const auto num_dims = tensor_shape.dim_size();
+        std::unique_ptr<int64_t[]> dims(new int64_t[num_dims]);
+        for (int i = 0; i < num_dims; ++i) {
+          dims[i] = tensor_shape.dim(i).size();
+        }
+        TFE_OpSetAttrShape(op, attr_name, dims.get(), num_dims, status);
+      }
+    } break;
+    case tensorflow::AttrValue::kFunc: {
+      const auto func_op = GetFunc(ctx, default_value.func(), status);
+      if (TF_GetCode(status) != TF_OK) return;
+      // TODO(nareshmodi): TFE_OpSetAttrFunction and TFE_OpSetAttrFunctionList
+      // require TFE_Op* and just convert it internally a NameAttrValue, so
+      // consider adding an overload to the C API to make this case easier.
+      TFE_OpSetAttrFunction(op, attr_name, func_op);
+    } break;
+    case tensorflow::AttrValue::kList:
+      TF_FALLTHROUGH_INTENDED;
+    case tensorflow::AttrValue::kTensor:
+      TF_FALLTHROUGH_INTENDED;
+    case tensorflow::AttrValue::kPlaceholder:
+      TF_FALLTHROUGH_INTENDED;
+    case tensorflow::AttrValue::VALUE_NOT_SET:
+      TF_SetStatus(
+          status, TF_UNIMPLEMENTED,
+          tensorflow::strings::StrCat("Unable to get setfor default value: ",
+                                      default_value.DebugString())
+              .data());
+  }
+}
+}  // namespace tensorflow
+
+
+
+bool TFE_TensorHandle::IsReady() {
+  if (node_id == 0) return true;
+  tensorflow::mutex_lock l(ctx_mutex_);
+  return ctx_ == nullptr;
+}
+
+tensorflow::Status TFE_TensorHandle::WaitReady() {
+  if (node_id == 0) return tensorflow::Status::OK();
+  tensorflow::EagerExecutor* executor = nullptr;
+  {
+    tensorflow::mutex_lock l(ctx_mutex_);
+    if (ctx_ == nullptr) return tensorflow::Status::OK();
+    executor = ctx_->context.Executor();
+  }
+  return executor->WaitFor(node_id);
+}
+
+tensorflow::Status TFE_TensorHandle::Tensor(const tensorflow::Tensor** t) {
+  TF_RETURN_IF_ERROR(WaitReady());
+  DCHECK(IsReady());
+  *t = &tensor_;
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status TFE_TensorHandle::Device(tensorflow::Device** d) {
+  TF_RETURN_IF_ERROR(WaitReady());
+  DCHECK(IsReady());
+  *d = device_;
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status TFE_TensorHandle::OpDevice(tensorflow::Device** d) {
+  TF_RETURN_IF_ERROR(WaitReady());
+  DCHECK(IsReady());
+  *d = op_device_;
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status TFE_TensorHandle::TensorAndDevice(
+    const tensorflow::Tensor** tensor, tensorflow::Device** device,
+    tensorflow::Device** op_device) {
+  TF_RETURN_IF_ERROR(WaitReady());
+  DCHECK(IsReady());
+  *tensor = &tensor_;
+  *device = device_;
+  *op_device = op_device_;
+  return tensorflow::Status::OK();
+}
+
+void TFE_TensorHandle::SetTensorAndDevice(const tensorflow::Tensor& tensor,
+                                          tensorflow::Device* device,
+                                          tensorflow::Device* op_device) {
+  tensorflow::mutex_lock l(ctx_mutex_);
+  DCHECK(node_id > 0 && ctx_) << "SetTensorAndDevice should be only called  "
+                              << "on non-ready handles.";
+  ctx_ = nullptr;
+  tensor_ = tensor;
+  device_ = device;
+  op_device_ = op_device;
+}
+
+TFE_Op::~TFE_Op() {
+  for (TFE_TensorHandle* h : inputs) {
+    h->Unref();
+  }
 }
diff --git a/tensorflow/c/eager/c_api.h b/tensorflow/c/eager/c_api.h
index 90cfb7500e26231052b7c942ba6d2aeeafab7dc9..a5029bf2115c7dac54d03b8bc6397bc63349c068 100644
--- a/tensorflow/c/eager/c_api.h
+++ b/tensorflow/c/eager/c_api.h
@@ -61,7 +61,8 @@ TF_CAPI_EXPORT extern void TFE_ContextOptionsSetConfig(
 // Controls how to act when we try to run an operation on a given device but
 // some input tensors are not on that device.
 typedef enum TFE_ContextDevicePlacementPolicy {
-  // Running operations with input tensors on the wrong device will fail.
+  // Running operations with input tensors on the wrong device will fail. When
+  // soft placement is enabled acts like TFE_DEVICE_PLACEMENT_SILENT.
   TFE_DEVICE_PLACEMENT_EXPLICIT = 0,
   // Copy the tensor to the right device but log a warning.
   TFE_DEVICE_PLACEMENT_WARN = 1,
@@ -69,10 +70,16 @@ typedef enum TFE_ContextDevicePlacementPolicy {
   // operation will be blocked till the copy completes.
   TFE_DEVICE_PLACEMENT_SILENT = 2,
   // Default placement policy which silently copies int32 tensors but not other
-  // dtypes.
+  // dtypes.  When soft placement is enabled acts like
+  // TFE_DEVICE_PLACEMENT_SILENT.
   TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32 = 3,
 } TFE_ContextDevicePlacementPolicy;
 
+// Sets the default execution mode (sync/async). Note that this can be
+// overridden per thread using TFE_ContextSetAsyncForThread.
+TF_CAPI_EXPORT extern void TFE_ContextOptionsSetAsync(TFE_ContextOptions*,
+                                                      unsigned char async);
+
 TF_CAPI_EXPORT extern void TFE_ContextOptionsSetDevicePlacementPolicy(
     TFE_ContextOptions*, TFE_ContextDevicePlacementPolicy);
 
@@ -108,6 +115,30 @@ TF_CAPI_EXPORT extern void TFE_ContextSetThreadLocalDevicePlacementPolicy(
 TF_CAPI_EXPORT extern TFE_ContextDevicePlacementPolicy
 TFE_ContextGetDevicePlacementPolicy(TFE_Context*);
 
+// Overrides the execution mode (sync/async) for the current thread.
+TF_CAPI_EXPORT extern void TFE_ContextSetAsyncForThread(TFE_Context*,
+                                                        unsigned char async,
+                                                        TF_Status* status);
+
+// Causes the calling thread to block till all ops dispatched in async mode
+// have been executed. Note that "execution" here refers to kernel execution /
+// scheduling of copies, etc. Similar to sync execution, it doesn't guarantee
+// that lower level device queues (like GPU streams) have been flushed.
+//
+// This call may not block for execution of ops enqueued concurrently with this
+// call.
+TF_CAPI_EXPORT extern void TFE_ContextAsyncWait(TFE_Context*,
+                                                TF_Status* status);
+
+// When an error happens, any pending operations are discarded and newly issued
+// ops return an error. This call clears the error state and re-enables
+// execution of newly issued ops.
+//
+// Note that outputs of discarded ops remain in a corrupt state and should not
+// be used for future calls.
+// TODO(agarwal): mark the affected handles and raise errors if they are used.
+TF_CAPI_EXPORT extern void TFE_ContextAsyncClearError(TFE_Context*);
+
 // A handle to a tensor on a device.
 //
 // Like a TF_Tensor, a TFE_TensorHandle refers to a tensor with a value, shape,
@@ -117,15 +148,21 @@ typedef struct TFE_TensorHandle TFE_TensorHandle;
 
 TF_CAPI_EXPORT extern TFE_TensorHandle* TFE_NewTensorHandle(TF_Tensor* t,
                                                             TF_Status* status);
+// Indicates that the caller will not be using `h` any more.
 TF_CAPI_EXPORT extern void TFE_DeleteTensorHandle(TFE_TensorHandle* h);
 TF_CAPI_EXPORT extern TF_DataType TFE_TensorHandleDataType(TFE_TensorHandle* h);
+// This function will block till the operation that produces `h` has completed.
 TF_CAPI_EXPORT extern int TFE_TensorHandleNumDims(TFE_TensorHandle* h,
                                                   TF_Status* status);
+// This function will block till the operation that produces `h` has completed.
 TF_CAPI_EXPORT extern int64_t TFE_TensorHandleDim(TFE_TensorHandle* h,
                                                   int dim_index,
                                                   TF_Status* status);
+// This function will block till the operation that produces `h` has completed.
 TF_CAPI_EXPORT extern const char* TFE_TensorHandleDeviceName(
     TFE_TensorHandle* h, TF_Status* status);
+
+// This function will block till the operation that produces `h` has completed.
 TF_CAPI_EXPORT extern TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h,
                                                          TF_Status* status);
 
@@ -135,6 +172,9 @@ TF_CAPI_EXPORT extern TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h,
 // that shares the underlying buffer. Otherwise, it currently requires at least
 // one of the source or destination devices to be CPU (i.e., for the source or
 // destination tensor to be placed in host memory).
+// If async execution is enabled, the copy may be enqueued and the call will
+// return "non-ready" handle. Else, this function returns after the copy has
+// been done.
 TF_CAPI_EXPORT extern TFE_TensorHandle* TFE_TensorHandleCopyToDevice(
     TFE_TensorHandle* h, TFE_Context* ctx, const char* device_name,
     TF_Status* status);
@@ -155,6 +195,7 @@ typedef struct TFE_Op TFE_Op;
 TF_CAPI_EXPORT extern TFE_Op* TFE_NewOp(TFE_Context* ctx,
                                         const char* op_or_function_name,
                                         TF_Status* status);
+
 TF_CAPI_EXPORT extern void TFE_DeleteOp(TFE_Op* op);
 
 TF_CAPI_EXPORT extern void TFE_OpSetDevice(TFE_Op* op, const char* device_name,
@@ -240,13 +281,21 @@ TF_CAPI_EXPORT extern void TFE_OpSetAttrFunctionList(TFE_Op* op,
                                                      int num_values);
 
 // Execute the operation defined by 'op' and return handles to computed
-// tensors in 'retvals'.
+// tensors in `retvals`.
+//
+// 'retvals' must point to a pre-allocated array of TFE_TensorHandle* and
+// '*num_retvals' should be set to the size of this array. It is an error if
+// the size of 'retvals' is less than the number of outputs. This call sets
+// *num_retvals to the number of outputs.
 //
-// 'retvals' must point to a pre-allocated array of TFE_TensorHandle*
-// and '*num_retvals' should be set to the size of this array.
+// If async execution is enabled, the call may simply enqueue the execution
+// and return "non-ready" handles in `retvals`. Note that any handles contained
+// in 'op' should not be mutated till the kernel execution actually finishes.
 //
-// On return, 'num_retvals' will be set to the actual number of outputs
-// returned by the operation.
+// For sync execution, if any of the inputs to `op` are not ready, this call
+// will block till they become ready and then return when the kernel execution
+// is done.
+// TODO(agarwal): change num_retvals to int from int*.
 TF_CAPI_EXPORT extern void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals,
                                        int* num_retvals, TF_Status* status);
 
@@ -272,6 +321,8 @@ TF_CAPI_EXPORT extern void TFE_ContextDisableRunMetadata(TFE_Context* ctx);
 // Populates the passed-in buffer with a serialized RunMetadata protocol buffer
 // containing any run metadata information accumulated so far and clears this
 // information.
+// If async mode is enabled, this call blocks till all currently pending ops are
+// done.
 TF_CAPI_EXPORT extern void TFE_ContextExportRunMetadata(TFE_Context* ctx,
                                                         TF_Buffer* buf,
                                                         TF_Status* status);
diff --git a/tensorflow/c/eager/c_api_internal.h b/tensorflow/c/eager/c_api_internal.h
index 3356054cd09939b24a1d942c0cced06136e33b85..5b29120b40b0ddbcee4522a73024fd7a439bd64e 100644
--- a/tensorflow/c/eager/c_api_internal.h
+++ b/tensorflow/c/eager/c_api_internal.h
@@ -19,7 +19,9 @@ limitations under the License.
 
 #include <algorithm>
 #include <cstddef>
+#include <map>
 #include <memory>
+#include <queue>
 #include <string>
 #include <thread>
 #include <vector>
@@ -28,86 +30,121 @@ limitations under the License.
 #include "tensorflow/c/c_api_internal.h"
 #include "tensorflow/c/eager/runtime.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/eager/context.h"
+#include "tensorflow/core/common_runtime/eager/eager_executor.h"
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
 #include "tensorflow/core/framework/rendezvous.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/thread_annotations.h"
 #include "tensorflow/core/public/version.h"
 
+
 struct TFE_ContextOptions {
   TF_SessionOptions session_options;
+  // true if async execution is enabled.
+  bool async = false;
   TFE_ContextDevicePlacementPolicy policy{
       TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32};
 };
 
 struct TFE_Context {
-  explicit TFE_Context(const TFE_ContextOptions& opts, TF_Session* s)
-      : policy(opts.policy),
-        session(s),
-        rendezvous(new tensorflow::IntraProcessRendezvous(s->device_mgr)),
-        pflr(new tensorflow::ProcessFunctionLibraryRuntime(
-            session->device_mgr, opts.session_options.options.env,
-            TF_GRAPH_DEF_VERSION, &func_lib_def, {})),
-        log_device_placement(
-            opts.session_options.options.config.log_device_placement()) {}
-
-  const TFE_ContextDevicePlacementPolicy policy;
-
-  // Note: we cannot use C++11 thread_local here as there is no concept of a
-  // thread-local-object-local variable in C++11.
-  tensorflow::mutex policy_map_mu;
-  std::unordered_map<std::thread::id, TFE_ContextDevicePlacementPolicy>
-      thread_local_policies GUARDED_BY(policy_map_mu);
-
-  // TFE_Context is an extension of TF_Session. And TF_Session needs a TF_Graph.
-  TF_Session* const session;
-  tensorflow::Rendezvous* const rendezvous;
-
-  tensorflow::mutex functions_mu;
-  tensorflow::FunctionLibraryDefinition func_lib_def GUARDED_BY(functions_mu){
-      tensorflow::OpRegistry::Global(), {}};
-
-  // One FunctionLibraryRuntime per device.
-  // func_libs[i] is the FunctionLibraryRuntime corresponding to
-  // session->devices[i].
-  const std::unique_ptr<tensorflow::ProcessFunctionLibraryRuntime> pflr;
-
-  tensorflow::mutex cache_mu;
-  std::unordered_map<tensorflow::Fprint128, tensorflow::KernelAndDevice*,
-                     tensorflow::Fprint128Hasher>
-      kernel_cache GUARDED_BY(cache_mu);
-
-  tensorflow::FunctionLibraryRuntime* func_lib(tensorflow::Device* d) const {
-    return pflr->GetFLR(d->name());
+  explicit TFE_Context(const tensorflow::SessionOptions& opts,
+                       TFE_ContextDevicePlacementPolicy default_policy,
+                       bool async,
+                       std::unique_ptr<tensorflow::DeviceMgr> device_mgr,
+                       tensorflow::Rendezvous* rendezvous)
+      : context(opts,
+                static_cast<tensorflow::ContextDevicePlacementPolicy>(
+                    default_policy),
+                async, std::move(device_mgr), rendezvous) {}
+
+  tensorflow::EagerContext context;
+};
+
+struct TFE_TensorHandle : public tensorflow::core::RefCounted {
+ public:
+  TFE_TensorHandle(const tensorflow::Tensor& t, tensorflow::Device* d,
+                   tensorflow::Device* op_device)
+      : dtype(t.dtype()),
+        node_id(0),
+        tensor_(t),
+        device_(d),
+        op_device_(op_device),
+        ctx_(nullptr) {}
+
+  TFE_TensorHandle(tensorflow::uint64 node_id, tensorflow::DataType dtype,
+                   TFE_Context* ctx)
+      : dtype(dtype),
+        node_id(node_id),
+        tensor_(dtype),
+        device_(nullptr),
+        op_device_(nullptr),
+        ctx_(ctx) {
+    DCHECK_GT(node_id, 0);
   }
 
-  const std::vector<tensorflow::Device*>& devices() { return session->devices; }
+  ~TFE_TensorHandle() override {}
 
-  // Whether we should compute RunMetadata.
-  std::atomic<bool> should_store_metadata{false};
-  tensorflow::mutex metadata_mu;
-  tensorflow::RunMetadata run_metadata GUARDED_BY(metadata_mu);
+  tensorflow::Status Tensor(const tensorflow::Tensor** t);
 
-  const bool log_device_placement;
-};
+  tensorflow::Status Device(tensorflow::Device** d);
+
+  tensorflow::Status OpDevice(tensorflow::Device** d);
+
+  tensorflow::Status TensorAndDevice(const tensorflow::Tensor** tensor,
+                                     tensorflow::Device** device,
+                                     tensorflow::Device** op_device);
+
+  // Note that this can be called at most once, and only on non-ready handles,
+  // and makes them ready.
+  void SetTensorAndDevice(const tensorflow::Tensor& tensor,
+                          tensorflow::Device* device,
+                          tensorflow::Device* op_device);
+
+  // dtype for the handle. It must be the same as t.dtype() once the handle is
+  // ready.
+  const tensorflow::DataType dtype;
 
-struct TFE_TensorHandle {
-  TFE_TensorHandle(const tensorflow::Tensor& t, tensorflow::Device* d)
-      : t(t), d(d) {}
+ private:
+  // If the contents of the Tensor pointed to by this handle is yet to be
+  // computed by a EagerNode, this function will block till that compuatation is
+  // done and the handle is "ready".
+  tensorflow::Status WaitReady();
 
-  tensorflow::Tensor t;
-  // TODO(ashankar): d == nullptr iff local CPU
-  // This was expedient, but perhaps worth revisiting ('d' should always be a
-  // valid pointer?)
+  bool IsReady();
+
+  // Id for the EagerNode that will compute the value pointed to by this handle.
+  // If the value is 0, the handle is already ready, but not vice-versa.
+  const tensorflow::uint64 node_id;
+
+  tensorflow::Tensor tensor_;
+
+  // TODO(ashankar): device_ == nullptr iff local CPU
+  // This was expedient, but perhaps worth revisiting ('device_' should always
+  // be a valid pointer?)
   // This can be done if TFE_NewOp() and the TFE_TensorHandle constructors are
   // provided with the appropriate TFE_Context.
   //
-  // TODO(ashankar): Reference count TFE_Context to ensure that 'd' of a
+  // TODO(ashankar): Reference count TFE_Context to ensure that 'device_' of a
   // TFE_TensorHandle does not outlive the TFE_Context from which it came?
-  tensorflow::Device* d;
+  tensorflow::Device* device_;
+
+  // Device in which the op producing this tensor was executed. Equals to
+  // device_ for constant tensors.
+  tensorflow::Device* op_device_;
+
+  tensorflow::mutex ctx_mutex_;
+
+  // `ctx` is only guaranteed to be set if the handle is not "ready". This is
+  // typically true when the handle was produced during async execution.
+  // `ctx` object is not owned and should outlive this handle.
+  TFE_Context* ctx_ GUARDED_BY(ctx_mutex_);
 };
 
 struct TFE_Op {
@@ -116,16 +153,24 @@ struct TFE_Op {
   TFE_Op(TFE_Context* ctx, const char* op, const tensorflow::AttrTypeMap* t)
       : ctx(ctx), name(op), attrs(op), attr_types(t), device(nullptr) {}
 
+  ~TFE_Op();
+
   bool const is_function() const { return attr_types == nullptr; }
 
   TFE_Context* ctx;  // Must outlive the TFE_Op.
   const tensorflow::string name;
   tensorflow::AttrBuilder attrs;
   const tensorflow::AttrTypeMap* attr_types;
-  std::vector<tensorflow::Tensor> inputs;
-  std::vector<tensorflow::Device*> input_devices;
+  tensorflow::gtl::InlinedVector<TFE_TensorHandle*, 4> inputs;
   tensorflow::Device* device;
   bool use_xla = false;
 };
 
+namespace tensorflow {
+// Set an AttrValue on the op. Doesn't handle the list types.
+void SetOpAttrValueScalar(TFE_Context* ctx, TFE_Op* op,
+                          const tensorflow::AttrValue& default_value,
+                          const char* attr_name, TF_Status* status);
+}  // namespace tensorflow
+
 #endif  // TENSORFLOW_C_EAGER_C_API_INTERNAL_H_
diff --git a/tensorflow/c/eager/c_api_test.cc b/tensorflow/c/eager/c_api_test.cc
index 00fb7e68d00dd2ef316bf89b8f253cf6c7c63f00..2268aba90d60b7b2f10e99f64fd7aa3ae719badb 100644
--- a/tensorflow/c/eager/c_api_test.cc
+++ b/tensorflow/c/eager/c_api_test.cc
@@ -29,6 +29,20 @@ using tensorflow::string;
 
 namespace {
 
+TFE_TensorHandle* DoubleTestMatrixTensorHandle() {
+  int64_t dims[] = {2, 2};
+  double data[] = {1.0, 2.0, 3.0, 4.0};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_DOUBLE, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
 TFE_TensorHandle* TestMatrixTensorHandle() {
   int64_t dims[] = {2, 2};
   float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
@@ -43,6 +57,20 @@ TFE_TensorHandle* TestMatrixTensorHandle() {
   return th;
 }
 
+TFE_TensorHandle* TestMatrixTensorHandle3X2() {
+  int64_t dims[] = {3, 2};
+  double data[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
 TFE_Op* MatMulOp(TFE_Context* ctx, TFE_TensorHandle* a, TFE_TensorHandle* b) {
   TF_Status* status = TF_NewStatus();
 
@@ -139,10 +167,12 @@ void BM_InitOp(int iters) {
 }
 BENCHMARK(BM_InitOp);
 
-void BM_Execute(int iters) {
+void BM_Execute(int iters, int async) {
   tensorflow::testing::StopTiming();
+  tensorflow::testing::SetLabel(async ? "ExecuteAsync" : "Execute");
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -156,6 +186,9 @@ void BM_Execute(int iters) {
     TFE_Execute(matmul, &retvals[0], &num_retvals, status);
     CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   }
+  if (async) {
+    TFE_ContextAsyncWait(ctx, status);
+  }
   tensorflow::testing::StopTiming();
   TFE_DeleteOp(matmul);
   TFE_DeleteTensorHandle(m);
@@ -163,7 +196,7 @@ void BM_Execute(int iters) {
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
-BENCHMARK(BM_Execute);
+BENCHMARK(BM_Execute)->Arg(0)->Arg(1);
 
 TEST(CAPI, Context) {
   TF_Status* status = TF_NewStatus();
@@ -205,10 +238,11 @@ TEST(CAPI, TensorHandle) {
   TFE_DeleteTensorHandle(h);
 }
 
-TEST(CAPI, TensorHandleCopyBetweenDevices) {
+void TensorHandleCopyBetweenDevices(bool async) {
   std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
       TF_NewStatus(), TF_DeleteStatus);
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status.get());
   TFE_DeleteContextOptions(opts);
   ASSERT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
@@ -274,10 +308,56 @@ TEST(CAPI, TensorHandleCopyBetweenDevices) {
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
 }
 
-TEST(CAPI, TensorHandleCopyBetweenTwoGPUDevices) {
+TEST(CAPI, TensorHandleCopyBetweenDevices) {
+  TensorHandleCopyBetweenDevices(false);
+}
+
+TEST(CAPI, TensorHandleCopyBetweenDevicesAsync) {
+  TensorHandleCopyBetweenDevices(true);
+}
+
+void TensorHandleCopyBetweenDevicesError(bool async) {
   std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
       TF_NewStatus(), TF_DeleteStatus);
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
+  TFE_Context* ctx = TFE_NewContext(opts, status.get());
+  TFE_DeleteContextOptions(opts);
+  ASSERT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_TensorHandle* hcpu = TestMatrixTensorHandle();
+  const char* kErrorDevice = "NoSuchDevice:0";
+  TFE_TensorHandle* hdevice =
+      TFE_TensorHandleCopyToDevice(hcpu, ctx, kErrorDevice, status.get());
+  EXPECT_NE(TF_OK, TF_GetCode(status.get()));
+  const char* msg = "NoSuchDevice:0 unknown device";
+  EXPECT_TRUE(strstr(TF_Message(status.get()), msg) != nullptr)
+      << TF_Message(status.get());
+  TF_SetStatus(status.get(), TF_OK, "");
+  const char* kCPUDevice = "CPU:0";
+  TFE_TensorHandle* hcopy =
+      TFE_TensorHandleCopyToDevice(hcpu, ctx, kCPUDevice, status.get());
+  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_ContextAsyncWait(ctx, status.get());
+  EXPECT_EQ(TF_OK, TF_GetCode(status.get()));
+  TFE_DeleteTensorHandle(hcopy);
+  TFE_DeleteTensorHandle(hcpu);
+  if (hdevice != nullptr) TFE_DeleteTensorHandle(hdevice);
+  TFE_DeleteContext(ctx, status.get());
+}
+
+TEST(CAPI, TensorHandleCopyBetweenDevicesError) {
+  TensorHandleCopyBetweenDevicesError(false);
+}
+
+TEST(CAPI, TensorHandleCopyBetweenDevicesErrorAsync) {
+  TensorHandleCopyBetweenDevicesError(true);
+}
+
+void TensorHandleCopyBetweenTwoGPUDevices(bool async) {
+  std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
+      TF_NewStatus(), TF_DeleteStatus);
+  TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status.get());
   TFE_DeleteContextOptions(opts);
   ASSERT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
@@ -332,11 +412,20 @@ TEST(CAPI, TensorHandleCopyBetweenTwoGPUDevices) {
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
 }
 
-TEST(CAPI, TensorHandleSilentCopy) {
+TEST(CAPI, TensorHandleCopyBetweenTwoGPUDevices) {
+  TensorHandleCopyBetweenTwoGPUDevices(false);
+}
+
+TEST(CAPI, TensorHandleCopyBetweenTwoGPUDevicesAsync) {
+  TensorHandleCopyBetweenTwoGPUDevices(true);
+}
+
+void TensorHandleSilentCopy(bool async) {
   std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
       TF_NewStatus(), TF_DeleteStatus);
   TFE_ContextOptions* opts = TFE_NewContextOptions();
   TFE_ContextOptionsSetDevicePlacementPolicy(opts, TFE_DEVICE_PLACEMENT_SILENT);
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status.get());
   TFE_DeleteContextOptions(opts);
   ASSERT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
@@ -366,14 +455,20 @@ TEST(CAPI, TensorHandleSilentCopy) {
 
   TF_DeleteTensor(t);
   TFE_DeleteTensorHandle(hcpu);
+  TFE_ContextAsyncWait(ctx, status.get());
+  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
   TFE_DeleteContext(ctx, status.get());
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
 }
 
-TEST(CAPI, TensorHandleSilentCopyLocal) {
+TEST(CAPI, TensorHandleSilentCopy) { TensorHandleSilentCopy(false); }
+TEST(CAPI, TensorHandleSilentCopyAsync) { TensorHandleSilentCopy(true); }
+
+void TensorHandleSilentCopyLocal(bool async) {
   std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
       TF_NewStatus(), TF_DeleteStatus);
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_ContextOptionsSetDevicePlacementPolicy(opts,
                                              TFE_DEVICE_PLACEMENT_EXPLICIT);
   TFE_Context* ctx = TFE_NewContext(opts, status.get());
@@ -407,11 +502,17 @@ TEST(CAPI, TensorHandleSilentCopyLocal) {
 
   TF_DeleteTensor(t);
   TFE_DeleteTensorHandle(hcpu);
+  TFE_ContextAsyncWait(ctx, status.get());
+  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
   TFE_DeleteContext(ctx, status.get());
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
 }
+TEST(CAPI, TensorHandleSilentCopyLocal) { TensorHandleSilentCopyLocal(false); }
+TEST(CAPI, TensorHandleSilentCopyLocalAsync) {
+  TensorHandleSilentCopyLocal(true);
+}
 
-TEST(CAPI, SetAndGetOpDevices) {
+void SetAndGetOpDevices(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
   TFE_Context* ctx = TFE_NewContext(opts, status);
@@ -442,27 +543,28 @@ TEST(CAPI, SetAndGetOpDevices) {
   TF_DeleteStatus(status);
 }
 
-TEST(CAPI, Execute_MatMul_CPU) {
+void Execute_MatMul_CPU(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
 
   TFE_TensorHandle* m = TestMatrixTensorHandle();
   TFE_Op* matmul = MatMulOp(ctx, m, m);
-  TFE_TensorHandle* retvals[2] = {nullptr};
-  int num_retvals = 2;  // Should be reduced to 1 by the TFE_Execute call.
+  TFE_TensorHandle* retvals[2] = {nullptr, nullptr};
+  int num_retvals = 2;
   TFE_Execute(matmul, &retvals[0], &num_retvals, status);
+  EXPECT_EQ(1, num_retvals);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteOp(matmul);
   TFE_DeleteTensorHandle(m);
-  TFE_DeleteContext(ctx, status);
-  ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  ASSERT_EQ(1, num_retvals);
 
   TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
+  ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteTensorHandle(retvals[0]);
+  TFE_DeleteContext(ctx, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   float product[4] = {0};
   EXPECT_EQ(sizeof(product), TF_TensorByteSize(t));
@@ -474,7 +576,101 @@ TEST(CAPI, Execute_MatMul_CPU) {
   EXPECT_EQ(22, product[3]);
   TF_DeleteStatus(status);
 }
+TEST(CAPI, Execute_MatMul_CPU) { Execute_MatMul_CPU(false); }
+TEST(CAPI, Execute_MatMul_CPUAsync) { Execute_MatMul_CPU(true); }
+
+void Execute_MatMul_CPU_Runtime_Error(bool async) {
+  TF_Status* status = TF_NewStatus();
+  TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
+  TFE_Context* ctx = TFE_NewContext(opts, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_DeleteContextOptions(opts);
+
+  TFE_TensorHandle* m1 = TestMatrixTensorHandle();
+  TFE_TensorHandle* m2 = TestMatrixTensorHandle3X2();
+  TFE_Op* matmul = MatMulOp(ctx, m1, m2);
+  TFE_Op* matmul2 = MatMulOp(ctx, m1, m1);
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
+  TFE_Execute(matmul, &retvals[0], &num_retvals, status);
+  TFE_DeleteOp(matmul);
+  if (!async) {
+    EXPECT_NE(TF_OK, TF_GetCode(status));
+  } else {
+    TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
+    EXPECT_NE(TF_OK, TF_GetCode(status));
+    EXPECT_EQ(nullptr, t);
+    const char* msg = "Matrix size-incompatible: In[0]: [2,2], In[1]: [3,2]";
+    EXPECT_TRUE(strstr(TF_Message(status), msg) != nullptr)
+        << TF_Message(status);
+    // Since error is not cleared, the following copy with correct device will
+    // still fail.
+    TF_SetStatus(status, TF_OK, "");
+    TFE_DeleteTensorHandle(retvals[0]);
+    retvals[0] = nullptr;
+    TFE_Execute(matmul2, &retvals[0], &num_retvals, status);
+    EXPECT_NE(TF_OK, TF_GetCode(status));
+    TFE_ContextAsyncClearError(ctx);
+    TFE_ContextAsyncWait(ctx, status);
+    EXPECT_EQ(TF_OK, TF_GetCode(status));
+  }
+  // Following works in async mode since TFE_ContextAsyncClearError was called.
+  TF_SetStatus(status, TF_OK, "");
+  if (retvals[0] != nullptr) {
+    TFE_DeleteTensorHandle(retvals[0]);
+  }
+  retvals[0] = nullptr;
+  TFE_Execute(matmul2, &retvals[0], &num_retvals, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status));
+  TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status));
+  TF_DeleteTensor(t);
+  TFE_DeleteOp(matmul2);
+  TFE_DeleteTensorHandle(m1);
+  TFE_DeleteTensorHandle(m2);
+  TFE_DeleteTensorHandle(retvals[0]);
+  TFE_DeleteContext(ctx, status);
+  TF_DeleteStatus(status);
+}
+TEST(CAPI, Execute_MatMul_CPU_Runtime_Error) {
+  Execute_MatMul_CPU_Runtime_Error(false);
+}
+TEST(CAPI, Execute_MatMul_CPU_Runtime_ErrorAsync) {
+  Execute_MatMul_CPU_Runtime_Error(true);
+}
+
+void Execute_MatMul_CPU_Type_Error(bool async) {
+  TF_Status* status = TF_NewStatus();
+  TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
+  TFE_Context* ctx = TFE_NewContext(opts, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_DeleteContextOptions(opts);
+
+  TFE_TensorHandle* m1 = TestMatrixTensorHandle();
+  TFE_TensorHandle* m2 = DoubleTestMatrixTensorHandle();
+  TFE_Op* matmul = MatMulOp(ctx, m1, m2);
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
+  TFE_Execute(matmul, &retvals[0], &num_retvals, status);
+  EXPECT_NE(TF_OK, TF_GetCode(status));
+  TFE_DeleteOp(matmul);
+  TFE_DeleteTensorHandle(m1);
+  TFE_DeleteTensorHandle(m2);
+  if (retvals[0] != nullptr) {
+    TFE_DeleteTensorHandle(retvals[0]);
+  }
+  TFE_DeleteContext(ctx, status);
+  TF_DeleteStatus(status);
+}
 
+TEST(CAPI, Execute_MatMul_CPU_Type_Error) {
+  Execute_MatMul_CPU_Type_Error(false);
+}
+TEST(CAPI, Execute_MatMul_CPU_Type_ErrorAsync) {
+  Execute_MatMul_CPU_Type_Error(true);
+}
 TEST(CAPI, Execute_Min_CPU) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
@@ -485,8 +681,8 @@ TEST(CAPI, Execute_Min_CPU) {
   TFE_TensorHandle* input = TestMatrixTensorHandle();
   TFE_TensorHandle* axis = TestAxisTensorHandle();
   TFE_Op* minOp = MinOp(ctx, input, axis);
-  TFE_TensorHandle* retvals[2] = {nullptr};
-  int num_retvals = 2;  // Should be reduced to 1 by the TFE_Execute call.
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
   TFE_Execute(minOp, &retvals[0], &num_retvals, status);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteOp(minOp);
@@ -509,9 +705,10 @@ TEST(CAPI, Execute_Min_CPU) {
 }
 
 #ifdef TENSORFLOW_EAGER_USE_XLA
-TEST(CAPI, Execute_MatMul_XLA_CPU) {
+void Execute_MatMul_XLA_CPU(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -521,15 +718,14 @@ TEST(CAPI, Execute_MatMul_XLA_CPU) {
 
   TFE_OpSetXLACompilation(matmul, true);
 
-  TFE_TensorHandle* retvals[2] = {nullptr};
-  int num_retvals = 2;  // Should be reduced to 1 by the TFE_Execute call.
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
   TFE_Execute(matmul, &retvals[0], &num_retvals, status);
   // Running a primitive TF operator via XLA is not yet supported.
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
   TFE_DeleteOp(matmul);
   TFE_DeleteTensorHandle(m);
-  TFE_DeleteContext(ctx, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
   EXPECT_EQ(1, num_retvals);
@@ -545,13 +741,16 @@ TEST(CAPI, Execute_MatMul_XLA_CPU) {
   EXPECT_EQ(10, product[1]);
   EXPECT_EQ(15, product[2]);
   EXPECT_EQ(22, product[3]);
-
+  TFE_DeleteContext(ctx, status);
   TF_DeleteStatus(status);
 }
+TEST(CAPI, Execute_MatMul_XLA_CPU) { Execute_MatMul_XLA_CPU(false); }
+TEST(CAPI, Execute_MatMul_XLA_CPUAsync) { Execute_MatMul_XLA_CPU(true); }
 
-TEST(CAPI, Execute_Min_XLA_CPU) {
+void Execute_Min_XLA_CPU(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -562,14 +761,13 @@ TEST(CAPI, Execute_Min_XLA_CPU) {
 
   TFE_OpSetXLACompilation(minOp, true);
 
-  TFE_TensorHandle* retvals[2] = {nullptr};
-  int num_retvals = 2;  // Should be reduced to 1 by the TFE_Execute call.
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
   TFE_Execute(minOp, &retvals[0], &num_retvals, status);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteOp(minOp);
   TFE_DeleteTensorHandle(input);
   TFE_DeleteTensorHandle(axis);
-  TFE_DeleteContext(ctx, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   ASSERT_EQ(1, num_retvals);
 
@@ -582,13 +780,17 @@ TEST(CAPI, Execute_Min_XLA_CPU) {
   TF_DeleteTensor(t);
   EXPECT_EQ(1, output[0]);
   EXPECT_EQ(3, output[1]);
+  TFE_DeleteContext(ctx, status);
   TF_DeleteStatus(status);
 }
+TEST(CAPI, Execute_Min_XLA_CPU) { Execute_Min_XLA_CPU(false); }
+TEST(CAPI, Execute_Min_XLA_CPUAsync) { Execute_Min_XLA_CPU(true); }
 #endif  // TENSORFLOW_EAGER_USE_XLA
 
-TEST(CAPI, ExecuteWithTracing) {
+void ExecuteWithTracing(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   TFE_ContextEnableRunMetadata(ctx);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
@@ -596,8 +798,8 @@ TEST(CAPI, ExecuteWithTracing) {
 
   TFE_TensorHandle* m = TestMatrixTensorHandle();
   TFE_Op* matmul = MatMulOp(ctx, m, m);
-  TFE_TensorHandle* retvals[2] = {nullptr};
-  int num_retvals = 2;  // Should be reduced to 1 by the TFE_Execute call.
+  TFE_TensorHandle* retvals[1] = {nullptr};
+  int num_retvals = 1;
   TFE_Execute(matmul, &retvals[0], &num_retvals, status);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteOp(matmul);
@@ -609,12 +811,12 @@ TEST(CAPI, ExecuteWithTracing) {
   EXPECT_TRUE(
       rm.ParseFromString({reinterpret_cast<const char*>(b->data), b->length}));
   TF_DeleteBuffer(b);
-  TFE_DeleteContext(ctx, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   ASSERT_EQ(1, num_retvals);
 
   TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
   TFE_DeleteTensorHandle(retvals[0]);
+  TFE_DeleteContext(ctx, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   float product[4] = {0};
   EXPECT_EQ(sizeof(product), TF_TensorByteSize(t));
@@ -626,6 +828,8 @@ TEST(CAPI, ExecuteWithTracing) {
   EXPECT_EQ(22, product[3]);
   TF_DeleteStatus(status);
 }
+TEST(CAPI, ExecuteWithTracing) { ExecuteWithTracing(false); }
+TEST(CAPI, ExecuteWithTracingAsync) { ExecuteWithTracing(true); }
 
 TEST(CAPI, Function_ident_CPU) {
   // First create a simple identity function.
@@ -657,32 +861,37 @@ TEST(CAPI, Function_ident_CPU) {
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteFunction(fn);
 
-  TF_Tensor* t =
-      TF_AllocateTensor(TF_INT32, nullptr, 0, 1 * sizeof(tensorflow::int32));
-  *reinterpret_cast<tensorflow::int32*>(TF_TensorData(t)) = 42;
-  TFE_TensorHandle* h = TFE_NewTensorHandle(t, status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  TF_DeleteTensor(t);
+  for (bool async : {false, true, false}) {
+    TFE_ContextSetAsyncForThread(ctx, static_cast<unsigned char>(async),
+                                 status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK);
+    TF_Tensor* t =
+        TF_AllocateTensor(TF_INT32, nullptr, 0, 1 * sizeof(tensorflow::int32));
+    *reinterpret_cast<tensorflow::int32*>(TF_TensorData(t)) = 42;
+    TFE_TensorHandle* h = TFE_NewTensorHandle(t, status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TF_DeleteTensor(t);
 
-  TFE_Op* op = TFE_NewOp(ctx, "ident", status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  TFE_OpAddInput(op, h, status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TFE_Op* op = TFE_NewOp(ctx, "ident", status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TFE_OpAddInput(op, h, status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
 
-  std::vector<TFE_TensorHandle*> result;
-  result.push_back(nullptr);
-  int num_retvals = 1;
-  TFE_Execute(op, result.data(), &num_retvals, status);
-  TFE_DeleteOp(op);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  ASSERT_EQ(num_retvals, 1);
+    std::vector<TFE_TensorHandle*> result;
+    result.push_back(nullptr);
+    int num_retvals = 1;
+    TFE_Execute(op, result.data(), &num_retvals, status);
+    TFE_DeleteOp(op);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    ASSERT_EQ(num_retvals, 1);
 
-  TF_Tensor* r = TFE_TensorHandleResolve(result[0], status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  EXPECT_EQ(*reinterpret_cast<tensorflow::int32*>(TF_TensorData(r)), 42);
-  TFE_DeleteTensorHandle(h);
-  TF_DeleteTensor(r);
-  TFE_DeleteTensorHandle(result[0]);
+    TF_Tensor* r = TFE_TensorHandleResolve(result[0], status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    EXPECT_EQ(*reinterpret_cast<tensorflow::int32*>(TF_TensorData(r)), 42);
+    TFE_DeleteTensorHandle(h);
+    TF_DeleteTensor(r);
+    TFE_DeleteTensorHandle(result[0]);
+  }
   TFE_DeleteContext(ctx, status);
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteStatus(status);
@@ -719,35 +928,40 @@ TEST(CAPI, Function_ident_XLA_CPU) {
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteFunction(fn);
 
-  TF_Tensor* t =
-      TF_AllocateTensor(TF_INT32, nullptr, 0, 1 * sizeof(tensorflow::int32));
-  *reinterpret_cast<tensorflow::int32*>(TF_TensorData(t)) = 42;
-  TFE_TensorHandle* h = TFE_NewTensorHandle(t, status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  TF_DeleteTensor(t);
+  for (bool async : {false, true, false}) {
+    TFE_ContextSetAsyncForThread(ctx, static_cast<unsigned char>(async),
+                                 status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK);
+    TF_Tensor* t =
+        TF_AllocateTensor(TF_INT32, nullptr, 0, 1 * sizeof(tensorflow::int32));
+    *reinterpret_cast<tensorflow::int32*>(TF_TensorData(t)) = 42;
+    TFE_TensorHandle* h = TFE_NewTensorHandle(t, status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TF_DeleteTensor(t);
 
-  TFE_Op* op = TFE_NewOp(ctx, "ident", status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  TFE_OpAddInput(op, h, status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TFE_Op* op = TFE_NewOp(ctx, "ident", status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    TFE_OpAddInput(op, h, status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
 
-  // Now run it via XLA.
-  TFE_OpSetXLACompilation(op, true);
+    // Now run it via XLA.
+    TFE_OpSetXLACompilation(op, true);
 
-  std::vector<TFE_TensorHandle*> result;
-  result.push_back(nullptr);
-  int num_retvals = 1;
-  TFE_Execute(op, result.data(), &num_retvals, status);
-  TFE_DeleteOp(op);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  ASSERT_EQ(num_retvals, 1);
+    std::vector<TFE_TensorHandle*> result;
+    result.push_back(nullptr);
+    int num_retvals = 1;
+    TFE_Execute(op, result.data(), &num_retvals, status);
+    TFE_DeleteOp(op);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    ASSERT_EQ(num_retvals, 1);
 
-  TF_Tensor* r = TFE_TensorHandleResolve(result[0], status);
-  ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
-  EXPECT_EQ(*reinterpret_cast<tensorflow::int32*>(TF_TensorData(r)), 42);
-  TFE_DeleteTensorHandle(h);
-  TF_DeleteTensor(r);
-  TFE_DeleteTensorHandle(result[0]);
+    TF_Tensor* r = TFE_TensorHandleResolve(result[0], status);
+    ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
+    EXPECT_EQ(*reinterpret_cast<tensorflow::int32*>(TF_TensorData(r)), 42);
+    TFE_DeleteTensorHandle(h);
+    TF_DeleteTensor(r);
+    TFE_DeleteTensorHandle(result[0]);
+  }
   TFE_DeleteContext(ctx, status);
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteStatus(status);
@@ -788,9 +1002,10 @@ string MatMulFunction() {
   return def.SerializeAsString();
 }
 
-TEST(CAPI, FunctionDefAndExecute) {
+void FunctionDefAndExecute(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -827,11 +1042,16 @@ TEST(CAPI, FunctionDefAndExecute) {
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
+TEST(CAPI, FunctionDefAndExecute) { FunctionDefAndExecute(false); }
+TEST(CAPI, FunctionDefAndExecuteAsync) { FunctionDefAndExecute(true); }
 
-void BM_ExecuteFunction(int iters) {
+void BM_ExecuteFunction(int iters, int async) {
   tensorflow::testing::StopTiming();
+  tensorflow::testing::SetLabel(async ? "ExecuteFunctionAsync"
+                                      : "ExecuteFunction");
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(async));
   TFE_Context* ctx = TFE_NewContext(opts, status);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -853,6 +1073,9 @@ void BM_ExecuteFunction(int iters) {
     TFE_Execute(matmul, &retval[0], &num_retvals, status);
     CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   }
+  if (async) {
+    TFE_ContextAsyncWait(ctx, status);
+  }
   tensorflow::testing::StopTiming();
   TFE_DeleteTensorHandle(m);
   TFE_DeleteTensorHandle(retval[0]);
@@ -860,7 +1083,7 @@ void BM_ExecuteFunction(int iters) {
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
-BENCHMARK(BM_ExecuteFunction);
+BENCHMARK(BM_ExecuteFunction)->Arg(0)->Arg(1);
 
 TFE_TensorHandle* CreateVariable(TFE_Context* ctx, float value,
                                  TF_Status* status) {
diff --git a/tensorflow/c/eager/runtime.cc b/tensorflow/c/eager/runtime.cc
index f77a937f1ffc2d146224cb3191a5ca127daefc22..abe2793ce894ad07c252575c5d55d98342916eac 100644
--- a/tensorflow/c/eager/runtime.cc
+++ b/tensorflow/c/eager/runtime.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/c/eager/runtime.h"
 
 #include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/node_def.pb.h"
@@ -41,17 +42,26 @@ const uint32 kIsList = 1U << 31;
 
 }  // namespace
 
+Status OpDefForOp(const char* op_name, const OpDef** op_def) {
+  const OpRegistrationData* op_reg_data = nullptr;
+  Status s = OpRegistry::Global()->LookUp(op_name, &op_reg_data);
+  if (s.ok()) {
+    *op_def = &op_reg_data->op_def;
+  }
+  return s;
+}
+
 Status AttrTypeMapForOp(const char* op_name, const AttrTypeMap** out) {
   mutex_lock l(g_op_name_to_attr_type_map_lock);
   *out = gtl::FindPtrOrNull(*OpNameToAttrTypeMap(), op_name);
   if (*out != nullptr) return Status::OK();
-  const OpRegistrationData* op_reg_data = nullptr;
-  Status s = OpRegistry::Global()->LookUp(op_name, &op_reg_data);
+  const OpDef* op_def = nullptr;
+  Status s = OpDefForOp(op_name, &op_def);
   if (!s.ok()) return s;
   std::unique_ptr<AttrTypeMap> m(new AttrTypeMap);
   // TODO(agarwal): Avoid having to create this "registry" at runtime,
   // perhaps can be done at op registration time?
-  for (const auto& attr : op_reg_data->op_def.attr()) {
+  for (const auto& attr : op_def->attr()) {
     string type = attr.type();
     const bool is_list = (type.length() > 6 && type.compare(0, 4, "list") == 0);
     if (is_list) {
@@ -86,22 +96,6 @@ Status AttrTypeMapForOp(const char* op_name, const AttrTypeMap** out) {
   return Status::OK();
 }
 
-Status AttrTypeByName(const AttrTypeMap& m, const string& attr_name,
-                      TF_AttrType* out, unsigned char* is_list) {
-  auto* t = gtl::FindOrNull(m, attr_name);
-  if (t == nullptr) {
-    return errors::InvalidArgument("Attribute '", attr_name,
-                                   "' does not exist for this operation");
-  }
-  *out = static_cast<TF_AttrType>(*t & ~kIsList);
-  if (*t & kIsList) {
-    *is_list = 1;
-  } else {
-    *is_list = 0;
-  }
-  return Status::OK();
-}
-
 #define DEFINE_SET_ATTR(value_type, value_field)                             \
   template <>                                                                \
   AttrBuilder& AttrBuilder::Set(StringPiece attr_name, value_type&& value) { \
@@ -159,6 +153,22 @@ const NodeDef& AttrBuilder::BuildNodeDef() {
   return *node_def_;
 }
 
+Status AttrTypeByName(const AttrTypeMap& m, const string& attr_name,
+                      TF_AttrType* out, unsigned char* is_list) {
+  auto* t = gtl::FindOrNull(m, attr_name);
+  if (t == nullptr) {
+    return errors::InvalidArgument("Attribute '", attr_name,
+                                   "' does not exist for this operation");
+  }
+  *out = static_cast<TF_AttrType>(*t & ~kIsList);
+  if (*t & kIsList) {
+    *is_list = 1;
+  } else {
+    *is_list = 0;
+  }
+  return Status::OK();
+}
+
 namespace {
 inline tensorflow::Fprint128 FingerprintCat128(const tensorflow::Fprint128& a,
                                                const tensorflow::Fprint128& b) {
@@ -236,93 +246,4 @@ void AttrBuilder::MayBeInitializeNodeDef() {
   }
 }
 
-// static
-Status KernelAndDevice::InitOp(Device* device, const NodeDef& ndef,
-                               KernelAndDevice* out) {
-  OpKernel* k = nullptr;
-  Status s = CreateOpKernel(device->device_type().c_str(), device,
-                            device->GetAllocator(AllocatorAttributes()),
-                            nullptr, ndef, TF_GRAPH_DEF_VERSION, &k);
-  out->device_ = device;
-  out->kernel_.reset(k);
-  out->flib_ = nullptr;
-  return s;
-}
-
-// static
-Status KernelAndDevice::Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
-                             KernelAndDevice* out) {
-  OpKernel* k = nullptr;
-  Status s = flib->CreateKernel(ndef, &k);
-  out->device_ = flib->device();
-  out->kernel_.reset(k);
-  out->flib_ = flib;
-  return s;
-}
-
-Status KernelAndDevice::Run(std::vector<Tensor>* input_tensors,
-                            std::vector<Tensor>* output_tensors,
-                            NodeExecStats* stats) {
-  gtl::InlinedVector<TensorValue, 4> inputs;
-  for (Tensor& t : *input_tensors) {
-    inputs.push_back(TensorValue(&t));
-  }
-
-  std::vector<AllocatorAttributes> out_attrs(kernel_->num_outputs());
-  for (size_t i = 0; i < out_attrs.size(); ++i) {
-    out_attrs[i].set_on_host(kernel_->output_memory_types()[i] ==
-                             tensorflow::HOST_MEMORY);
-  }
-
-  OpKernelContext::Params params;
-  params.device = device_;
-  params.frame_iter = FrameAndIter(0, 0);
-  params.inputs = &inputs;
-  params.op_kernel = kernel_.get();
-  params.resource_manager = device_->resource_manager();
-  params.output_attr_array = gtl::vector_as_array(&out_attrs);
-  params.function_library = flib_;
-  params.slice_reader_cache = &slice_reader_cache_;
-  params.rendezvous = rendez_;
-  if (stats != nullptr) {
-    params.track_allocations = true;
-  }
-  // TODO(apassos): use a thread pool.
-  std::function<void(std::function<void()>)> runner =
-      [](std::function<void()> f) { f(); };
-  params.runner = &runner;
-
-  OpKernelContext context(&params);
-  device_->Compute(kernel_.get(), &context);
-  if (!context.status().ok()) return context.status();
-
-  output_tensors->clear();
-  for (int i = 0; i < context.num_outputs(); ++i) {
-    output_tensors->push_back(Tensor(*context.mutable_output(i)));
-  }
-  if (stats != nullptr) {
-    for (const auto& allocator_pair : context.wrapped_allocators()) {
-      AllocatorMemoryUsed* memory = stats->add_memory();
-      memory->set_allocator_name(allocator_pair.first->Name());
-      auto sizes = allocator_pair.second->GetSizes();
-      memory->set_total_bytes(std::get<0>(sizes));
-      memory->set_peak_bytes(std::get<1>(sizes));
-      memory->set_live_bytes(std::get<2>(sizes));
-
-      AllocatorStats allocator_stats;
-      allocator_pair.first->GetStats(&allocator_stats);
-      memory->set_allocator_bytes_in_use(allocator_stats.bytes_in_use);
-      allocator_pair.second->GetRecordsAndUnRef();
-    }
-    auto* ms = stats->mutable_memory_stats();
-    ms->set_temp_memory_size(context.temp_memory_allocated());
-    for (const auto& alloc_id : context.persistent_alloc_ids()) {
-      ms->mutable_persistent_tensor_alloc_ids()->Add(alloc_id);
-    }
-
-    ms->set_persistent_memory_size(context.persistent_memory_allocated());
-  }
-  return Status::OK();
-}
-
 }  // namespace tensorflow
diff --git a/tensorflow/c/eager/runtime.h b/tensorflow/c/eager/runtime.h
index 4d20b5244a46fcde2eed0a429dced2a77b86aedd..929b1b8296faf61c11c68af06ffc4ca3770ae929 100644
--- a/tensorflow/c/eager/runtime.h
+++ b/tensorflow/c/eager/runtime.h
@@ -23,6 +23,7 @@ limitations under the License.
 
 #include "tensorflow/c/c_api.h"
 #include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/types.h"
@@ -39,9 +40,16 @@ namespace tensorflow {
 // represent the TF_AttrType type of the values in the list.
 typedef std::unordered_map<string, uint32> AttrTypeMap;
 
+// Look up OpDef for `op_name`.
+Status OpDefForOp(const char* op_name, const OpDef** op_def);
+
 // Returns the AttrTypeMap for the TensorFlow operation named op_name.
 Status AttrTypeMapForOp(const char* op_name, const AttrTypeMap** out);
 
+// Looks for 'attr_name' in 'm' and sets 'out' and 'is_list'.
+Status AttrTypeByName(const AttrTypeMap& m, const string& attr_name,
+                      TF_AttrType* out, unsigned char* is_list);
+
 // Looks for 'attr_name' in 'm' and sets 'out' and 'is_list'.
 Status AttrTypeByName(const AttrTypeMap& m, const string& attr_name,
                       TF_AttrType* out, unsigned char* is_list);
@@ -146,47 +154,6 @@ template <>
 AttrBuilder& AttrBuilder::Set(StringPiece attr_name,
                               tensorflow::DataType&& value);
 
-// KernelAndDevice encapsulates an instantiated kernel and the device it is on.
-//
-// Also see:
-// https://www.tensorflow.org/code/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
-// and
-// https://www.tensorflow.org/code/tensorflow/core/kernels/ops_testutil.h
-class KernelAndDevice {
- public:
-  // Populates 'out' with a kernel appropriate for 'ndef'.
-  //
-  // The provided FunctionLibraryRuntime MUST outlive all calls to
-  // Run() on the returned KernelAndDevice.
-  //
-  // TODO(ashankar): Figure out thread-safety concerns around
-  // FunctionLibraryRuntime (in particular, how the underlying
-  // FunctionLibraryDefinition might be mutated by another thread as new
-  // functions are registered with it).  Conservatively, thread-safe usage of
-  // the FunctionLibraryRuntime is pushed on to the caller (see locking in
-  // c_api.cc).
-  static Status Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
-                     KernelAndDevice* out);
-  // TODO(ashankar): Remove this
-  static Status InitOp(Device* device, const NodeDef& ndef,
-                       KernelAndDevice* out);
-
-  KernelAndDevice(tensorflow::Rendezvous* rendez)
-      : device_(nullptr), flib_(nullptr), rendez_(rendez) {}
-
-  // TODO(ashankar): Handle list-valued inputs.
-  Status Run(std::vector<Tensor>* inputs, std::vector<Tensor>* outputs,
-             NodeExecStats* stats);
-
-  const OpKernel* kernel() const { return kernel_.get(); }
-
- private:
-  std::unique_ptr<OpKernel> kernel_;
-  Device* device_;
-  FunctionLibraryRuntime* flib_;
-  checkpoint::TensorSliceReaderCacheWrapper slice_reader_cache_;
-  Rendezvous* rendez_;
-};
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/c/eager/runtime_test.cc b/tensorflow/c/eager/runtime_test.cc
index 643153058ce3d6f0c88dd23a0dec4c6eff060319..27ebeb0508844ee1ee89e0733b66f6ed129b7757 100644
--- a/tensorflow/c/eager/runtime_test.cc
+++ b/tensorflow/c/eager/runtime_test.cc
@@ -33,27 +33,6 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-class TestEnv {
- public:
-  TestEnv() : flib_def_(OpRegistry::Global(), {}) {
-    Device* device =
-        DeviceFactory::NewDevice("CPU", {}, "/job:a/replica:0/task:0");
-    device_mgr_.reset(new DeviceMgr({device}));
-    flib_runtime_ = NewFunctionLibraryRuntime(device_mgr_.get(), Env::Default(),
-                                              device, TF_GRAPH_DEF_VERSION,
-                                              &flib_def_, {}, nullptr);
-  }
-
-  FunctionLibraryRuntime* function_library_runtime() const {
-    return flib_runtime_.get();
-  }
-
- private:
-  FunctionLibraryDefinition flib_def_;
-  std::unique_ptr<DeviceMgr> device_mgr_;
-  std::unique_ptr<FunctionLibraryRuntime> flib_runtime_;
-};
-
 TEST(AttrTypeMap, Lookup) {
   const AttrTypeMap* m = nullptr;
   Status s = AttrTypeMapForOp("ThisOpCannotPossiblyExist", &m);
@@ -79,113 +58,5 @@ TEST(AttrTypeMap, Lookup) {
   EXPECT_NE(is_list, 0);
 }
 
-TEST(KernelAndDevice, Run) {
-  Tensor t(Input({{1.0f, 2.0f}, {3.0f, 4.0f}}).tensor());
-  std::vector<Tensor> inputs;
-  inputs.push_back(t);
-  inputs.push_back(t);
-  NodeDef ndef(AttrBuilder("MatMul")
-                   .Set("T", DT_FLOAT)
-                   .Set("transpose_a", false)
-                   .Set("transpose_b", false)
-                   .NumInputs(inputs.size())
-                   .BuildNodeDef());
-  TestEnv env;
-  KernelAndDevice kernel(nullptr);
-  Status s =
-      KernelAndDevice::Init(ndef, env.function_library_runtime(), &kernel);
-  ASSERT_TRUE(s.ok()) << s;
-  std::vector<Tensor> outputs;
-  s = kernel.Run(&inputs, &outputs, nullptr);
-  ASSERT_TRUE(s.ok()) << s;
-  ASSERT_EQ(1, outputs.size());
-  const Tensor& out = outputs[0];
-  EXPECT_EQ(7, out.matrix<float>()(0, 0));
-  EXPECT_EQ(10, out.matrix<float>()(0, 1));
-  EXPECT_EQ(15, out.matrix<float>()(1, 0));
-  EXPECT_EQ(22, out.matrix<float>()(1, 1));
-}
-
-void BM_CreateGraph(int iters) {
-  for (int i = 0; i < iters; ++i) {
-    Scope root = Scope::NewRootScope();
-    auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
-    auto M = ops::MatMul(root, C, C);
-    TF_CHECK_OK(root.status());
-  }
-}
-BENCHMARK(BM_CreateGraph);
-
-void BM_RunGraph(int iters) {
-  tensorflow::testing::StopTiming();
-  Scope root = Scope::NewRootScope();
-  auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
-  auto M = ops::MatMul(root, C, C);
-  SessionOptions opts;
-  opts.config.set_inter_op_parallelism_threads(1);
-  opts.config.set_intra_op_parallelism_threads(1);
-  ClientSession sess(root, opts);
-  std::vector<Tensor> outputs;
-  tensorflow::testing::StartTiming();
-  for (int i = 0; i < iters; ++i) {
-    outputs.clear();
-    TF_CHECK_OK(sess.Run({M}, &outputs));
-  }
-}
-BENCHMARK(BM_RunGraph);
-
-void BM_CreateAndDestroySession(int iters) {
-  tensorflow::testing::StopTiming();
-  Scope root = Scope::NewRootScope();
-  auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
-  auto M = ops::MatMul(root, C, C);
-  tensorflow::testing::StartTiming();
-  for (int i = 0; i < iters; ++i) {
-    ClientSession sess(root);
-  }
-}
-BENCHMARK(BM_CreateAndDestroySession);
-
-void BM_KernelAndDeviceInit(int iters) {
-  tensorflow::testing::StopTiming();
-  NodeDef ndef(AttrBuilder("MatMul")
-                   .Set("T", DT_FLOAT)
-                   .Set("transpose_a", false)
-                   .Set("transpose_b", false)
-                   .NumInputs(2)
-                   .BuildNodeDef());
-  TestEnv env;
-  KernelAndDevice k(nullptr);
-  tensorflow::testing::StartTiming();
-  for (int i = 0; i < iters; ++i) {
-    TF_CHECK_OK(
-        KernelAndDevice::Init(ndef, env.function_library_runtime(), &k));
-  }
-}
-BENCHMARK(BM_KernelAndDeviceInit);
-
-void BM_KernelAndDeviceRun(int iters) {
-  tensorflow::testing::StopTiming();
-  Tensor t(Input({{1.0f, 2.0f}, {3.0f, 4.0f}}).tensor());
-  std::vector<Tensor> inputs;
-  inputs.push_back(t);
-  inputs.push_back(t);
-  std::vector<Tensor> outputs;
-  NodeDef ndef(AttrBuilder("MatMul")
-                   .Set("T", DT_FLOAT)
-                   .Set("transpose_a", false)
-                   .Set("transpose_b", false)
-                   .NumInputs(inputs.size())
-                   .BuildNodeDef());
-  TestEnv env;
-  KernelAndDevice kernel(nullptr);
-  TF_CHECK_OK(
-      KernelAndDevice::Init(ndef, env.function_library_runtime(), &kernel));
-  tensorflow::testing::StartTiming();
-  for (int i = 0; i < iters; ++i) {
-    TF_CHECK_OK(kernel.Run(&inputs, &outputs, nullptr));
-  }
-}
-BENCHMARK(BM_KernelAndDeviceRun);
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/c/eager/tape.h b/tensorflow/c/eager/tape.h
index bdb0815d6b68444ec1c89b835d563db20ce4d8a1..c7bd3bdafd787e5c72625b190ea8bf8b8264d22d 100644
--- a/tensorflow/c/eager/tape.h
+++ b/tensorflow/c/eager/tape.h
@@ -152,6 +152,8 @@ class GradientTape {
                          gtl::ArraySlice<Gradient*> output_gradients,
                          std::vector<Gradient*>* result);
 
+  bool IsPersistent() const { return persistent_; }
+
  private:
   TensorTape tensor_tape_;
   OpTape<BackwardFunction> op_tape_;
diff --git a/tensorflow/c/python_api.cc b/tensorflow/c/python_api.cc
index f553142d15f476ad2c1af68016a4254ed211b9b2..cd604538f1fa142c6fe6a76624c048baddaa52fb 100644
--- a/tensorflow/c/python_api.cc
+++ b/tensorflow/c/python_api.cc
@@ -104,4 +104,9 @@ void SetRequireShapeInferenceFns(TF_Graph* graph, bool require) {
   graph->refiner.set_require_shape_inference_fns(require);
 }
 
+void ExtendSession(TF_Session* session, TF_Status* status) {
+  ExtendSessionGraphHelper(session, status);
+  session->extend_before_run = false;
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/c/python_api.h b/tensorflow/c/python_api.h
index 542d70f42c2a5df8309a722b32d850dd249e496f..13b680b3a24afa2d285ea18207578aff4350f6d5 100644
--- a/tensorflow/c/python_api.h
+++ b/tensorflow/c/python_api.h
@@ -41,6 +41,16 @@ void RemoveAllControlInputs(TF_Graph* graph, TF_Operation* op);
 // error. The default is true.
 void SetRequireShapeInferenceFns(TF_Graph* graph, bool require);
 
+// Extends `session` with any new operations added to its associated graph.
+// Usually this happens automatically in TF_SessionRun. After this is called,
+// TF_SessionRun will no longer extend the session on every call.
+//
+// We expose this here to allow fine-grained synchronization in multi-threaded
+// workloads, which is required since the Python implementation depends on the
+// above mutation methods. This allows us to prevent modifications to nodes in
+// the graph after the session has been made aware of them.
+void ExtendSession(TF_Session* session, TF_Status* status);
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_C_PYTHON_API_H_
diff --git a/tensorflow/cc/framework/cc_op_gen.cc b/tensorflow/cc/framework/cc_op_gen.cc
index a40ad1ffc3b262840e6ca0043139b1b61e04510d..d73121c7b701ec06c03836d1a765f4b35d88fe92 100644
--- a/tensorflow/cc/framework/cc_op_gen.cc
+++ b/tensorflow/cc/framework/cc_op_gen.cc
@@ -28,6 +28,7 @@ limitations under the License.
 #include "tensorflow/core/framework/types.pb_text.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/hash/hash.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/env.h"
@@ -697,7 +698,8 @@ string OpInfo::GetOpAttrStruct() const {
     attr_comment = MakeComment(attr_comment, "    ");
 
     strings::StrAppend(&setters, attr_comment);
-    strings::StrAppend(&setters, "    Attrs ", attr_func_def, " x) {\n");
+    strings::StrAppend(&setters, "    TF_MUST_USE_RESULT Attrs ", attr_func_def,
+                       " x) {\n");
     strings::StrAppend(&setters, "      Attrs ret = *this;\n");
     strings::StrAppend(&setters, "      ret.", api_def_attr.rename_to(),
                        "_ = x;\n");
diff --git a/tensorflow/cc/framework/while_gradients.cc b/tensorflow/cc/framework/while_gradients.cc
index 0734075fc6144d7c9f4fdb48c5e097faa58b8355..81870a0efa309ae6dbd5cc05a5dbe8c3e2d437c8 100644
--- a/tensorflow/cc/framework/while_gradients.cc
+++ b/tensorflow/cc/framework/while_gradients.cc
@@ -72,9 +72,9 @@ Status AddForwardLoopCounter(WhileContext* while_ctx, const Scope& scope,
   };
 
   // Body function that adds one to input.
-  BodyGraphBuilderFn body_fn = [while_ctx](const Scope& scope,
-                                           const std::vector<Output>& inputs,
-                                           std::vector<Output>* outputs) {
+  BodyGraphBuilderFn body_fn = [](const Scope& scope,
+                                  const std::vector<Output>& inputs,
+                                  std::vector<Output>* outputs) {
     DCHECK_EQ(inputs.size(), 1);
     outputs->emplace_back(ops::Add(scope, inputs[0], 1));
     return scope.status();
diff --git a/tensorflow/cc/gradients/nn_grad.cc b/tensorflow/cc/gradients/nn_grad.cc
index 63a67f09f6f7c2b39da8cf082c2a36179014ac6f..0cb3132e94e381f672d69aefe4a199d2b590830c 100644
--- a/tensorflow/cc/gradients/nn_grad.cc
+++ b/tensorflow/cc/gradients/nn_grad.cc
@@ -48,8 +48,8 @@ Status SoftmaxGrad(const Scope& scope, const Operation& op,
 REGISTER_GRADIENT_OP("Softmax", SoftmaxGrad);
 
 Status LogSoftmaxGrad(const Scope& scope, const Operation& op,
-                   const std::vector<Output>& grad_inputs,
-                   std::vector<Output>* grad_outputs) {
+                      const std::vector<Output>& grad_inputs,
+                      std::vector<Output>* grad_outputs) {
   auto softmax = Exp(scope, op.output(0));
   auto sum = Sum(scope, grad_inputs[0], {1}, Sum::KeepDims(true));
   auto mul = Mul(scope, sum, softmax);
@@ -107,11 +107,10 @@ Status BiasAddGradHelper(const Scope& scope, const Operation& op,
                          const std::vector<Output>& grad_inputs,
                          std::vector<Output>* grad_outputs) {
   string data_format;
-  BiasAddGrad::Attrs input_attrs;
   TF_RETURN_IF_ERROR(
       GetNodeAttr(op.output(0).node()->attrs(), "data_format", &data_format));
-  input_attrs.DataFormat(data_format);
-  auto dx_1 = BiasAddGrad(scope, grad_inputs[0], input_attrs);
+  auto dx_1 =
+      BiasAddGrad(scope, grad_inputs[0], BiasAddGrad::DataFormat(data_format));
   grad_outputs->push_back(Identity(scope, grad_inputs[0]));
   grad_outputs->push_back(dx_1);
   return scope.status();
@@ -130,19 +129,16 @@ Status Conv2DGrad(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "strides", &strides));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "use_cudnn_on_gpu", &use_cudnn_on_gpu));
-  Conv2DBackpropInput::Attrs input_attrs;
-  input_attrs.DataFormat(data_format);
-  input_attrs.UseCudnnOnGpu(use_cudnn_on_gpu);
-  auto dx_1 = Conv2DBackpropInput(scope, Shape(scope, op.input(0)),
-                                  op.input(1), grad_inputs[0],
-                                  strides, padding, input_attrs);
+  auto dx_1 = Conv2DBackpropInput(scope, Shape(scope, op.input(0)), op.input(1),
+                                  grad_inputs[0], strides, padding,
+                                  Conv2DBackpropInput::DataFormat(data_format)
+                                      .UseCudnnOnGpu(use_cudnn_on_gpu));
   grad_outputs->push_back(dx_1);
-  Conv2DBackpropFilter::Attrs filter_attrs;
-  filter_attrs.DataFormat(data_format);
-  filter_attrs.UseCudnnOnGpu(use_cudnn_on_gpu);
-  auto dx_2 = Conv2DBackpropFilter(scope, op.input(0),
-                                   Shape(scope, op.input(1)), grad_inputs[0],
-                                   strides, padding, filter_attrs);
+  auto dx_2 =
+      Conv2DBackpropFilter(scope, op.input(0), Shape(scope, op.input(1)),
+                           grad_inputs[0], strides, padding,
+                           Conv2DBackpropFilter::DataFormat(data_format)
+                               .UseCudnnOnGpu(use_cudnn_on_gpu));
   grad_outputs->push_back(dx_2);
   return scope.status();
 }
@@ -160,13 +156,9 @@ Status MaxPoolGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "ksize", &ksize));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "strides", &strides));
-  internal::MaxPoolGrad::Attrs grad_attrs;
-  grad_attrs.DataFormat(data_format);
-  auto dx = internal::MaxPoolGrad(scope, op.input(0),
-                                  op.output(0),
-                                  grad_inputs[0],
-                                  ksize, strides,
-                                  padding, grad_attrs);
+  auto dx = internal::MaxPoolGrad(
+      scope, op.input(0), op.output(0), grad_inputs[0], ksize, strides, padding,
+      internal::MaxPoolGrad::DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -180,15 +172,9 @@ Status MaxPoolGradV2Helper(const Scope& scope, const Operation& op,
   auto attrs = op.output(0).node()->attrs();
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
-  MaxPoolGradV2::Attrs grad_attrs;
-  grad_attrs.DataFormat(data_format);
-  auto dx = MaxPoolGradV2(scope, op.input(0),
-                          op.output(0),
-                          grad_inputs[0],
-                          op.input(1),
-                          op.input(2),
-                          padding,
-                          grad_attrs);
+  auto dx = MaxPoolGradV2(scope, op.input(0), op.output(0), grad_inputs[0],
+                          op.input(1), op.input(2), padding,
+                          MaxPoolGradV2::DataFormat(data_format));
   grad_outputs->push_back(dx);
   grad_outputs->push_back(NoGradient());
   grad_outputs->push_back(NoGradient());
@@ -209,9 +195,9 @@ Status MaxPool3DGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   MaxPool3DGrad::Attrs grad_attrs;
-  grad_attrs.DataFormat(data_format);
   auto dx = MaxPool3DGrad(scope, op.input(0), op.output(0), grad_inputs[0],
-                          ksize, strides, padding, grad_attrs);
+                          ksize, strides, padding,
+                          grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -230,10 +216,10 @@ Status AvgPoolGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   internal::AvgPoolGrad::Attrs grad_attrs;
-  grad_attrs.DataFormat(data_format);
   auto dx =
       internal::AvgPoolGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
-                            ksize, strides, padding, grad_attrs);
+                            ksize, strides, padding,
+                            grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -252,9 +238,9 @@ Status AvgPool3DGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   AvgPool3DGrad::Attrs grad_attrs;
-  grad_attrs.DataFormat(data_format);
   auto dx = AvgPool3DGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
-                          ksize, strides, padding, grad_attrs);
+                          ksize, strides, padding,
+                          grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -262,11 +248,8 @@ REGISTER_GRADIENT_OP("AvgPool3D", AvgPool3DGradHelper);
 
 Status LRNGradHelper(const Scope& scope, const Operation& op,
                      const std::vector<Output>& grad_inputs,
-                     std::vector<Output>* grad_outputs){
-  internal::LRNGrad::Attrs grad_attrs;
-
-  auto dx = internal::LRNGrad(scope, grad_inputs[0], op.input(0), op.output(0),
-                              grad_attrs);
+                     std::vector<Output>* grad_outputs) {
+  auto dx = internal::LRNGrad(scope, grad_inputs[0], op.input(0), op.output(0));
   grad_outputs->push_back(dx);
   return scope.status();
 }
diff --git a/tensorflow/cc/profiler/profiler.h b/tensorflow/cc/profiler/profiler.h
index 6077c45c5854fd5812ccb7c91522f93ed4e54883..64edbb5766c3604fbe0f15c2299843718381aa3f 100644
--- a/tensorflow/cc/profiler/profiler.h
+++ b/tensorflow/cc/profiler/profiler.h
@@ -61,18 +61,18 @@ class Profiler {
   /// Adds tracing information `run_meta` to profiler. A `run_meta` is
   /// generated by a TensorFlow session run call. `step` is the key
   /// to the `run_meta`. When calling ProfileXXX methods, caller can specify
-  /// `step` in `options` to seletively profile the corresponding `run_meta`.
+  /// `step` in `options` to selectively profile the corresponding `run_meta`.
   /// Multiple different `run_meta` can be keyed by the same `step` in order
   /// to group them together.
   void AddStep(int64 step, const RunMetadata& run_meta);
 
   /// Profiles the model by organizing nodes in graph structure.
-  /// Each node is an op and the nodes are contected by the op inputs/outputs.
+  /// Each node is an op and the nodes are connected by the op inputs/outputs.
   GraphNodeProto ProfileGraph(const Options& options);
 
   /// Profiles the model by organizing nodes in name scope structure.
   /// Each node is an op, and nodes are organized by the ops' name
-  /// scope, similar to a filesystem tree.
+  /// scope, similar to a file system tree.
   /// E.g. /foo is the root of operation /foo/matmul_1 and foo/conv_2.
   GraphNodeProto ProfileNameScope(const Options& options);
 
diff --git a/tensorflow/cc/tools/BUILD b/tensorflow/cc/tools/BUILD
index 97f66e79b8ad9f383b22f56e9385fc6d2080e1f8..f413a5cc52e9eb4bc393b8186f5b591681fa2e5e 100644
--- a/tensorflow/cc/tools/BUILD
+++ b/tensorflow/cc/tools/BUILD
@@ -32,6 +32,7 @@ tf_cc_test(
     deps = [
         ":freeze_saved_model",
         "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:resource_variable_ops",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:protos_all_cc",
diff --git a/tensorflow/cc/tools/freeze_saved_model.cc b/tensorflow/cc/tools/freeze_saved_model.cc
index ddf372cdef21e1b3892c9a03714478d5a5785517..4ddddcb5863c9ffb1e5367db750b0d2ffd29cd5e 100644
--- a/tensorflow/cc/tools/freeze_saved_model.cc
+++ b/tensorflow/cc/tools/freeze_saved_model.cc
@@ -75,16 +75,13 @@ void GetNodeNameToNodeDefMap(
 // variable nodes to convert.
 void GetReachableNodesAndVariables(
     GraphDef* graph_def, const std::unordered_set<string>& outputs,
+    const std::unordered_map<string, NodeDef*>& name_to_node_map,
     std::unordered_set<string>* reachable_node_names,
     std::unordered_set<string>* variable_node_names) {
   // TODO(suharshs): Add support for ResourceVariables.
   static const std::unordered_set<string>* kVariableTypes =
-      new std::unordered_set<string>({"Variable", "VariableV2"});
-  // name_to_node_map is needed to get the inputs from the NodeDef corresponding
-  // the a string node name. These inputs are used when doing our backwards
-  // traversal.
-  std::unordered_map<string, NodeDef*> name_to_node_map;
-  GetNodeNameToNodeDefMap(graph_def, &name_to_node_map);
+      new std::unordered_set<string>({"Variable", "VariableV2", "VarHandleOp"});
+
   std::queue<string> nodes_to_visit;
   for (const string& tensor_name : outputs) {
     // We need to strip off the tensor part to get the node name.
@@ -99,7 +96,7 @@ void GetReachableNodesAndVariables(
       continue;
     }
     reachable_node_names->insert(node_name);
-    NodeDef* node = name_to_node_map[node_name];
+    NodeDef* node = name_to_node_map.at(node_name);
     if (kVariableTypes->find(node->op()) != kVariableTypes->end()) {
       variable_node_names->insert(node->name());
     }
@@ -111,7 +108,9 @@ void GetReachableNodesAndVariables(
 
 // Gets a map from variable name to variable value.
 Status GetVariableNameToTensorMap(
-    Session* session, std::unordered_set<string> variable_names_set,
+    Session* session,
+    const std::unordered_map<string, NodeDef*>& name_to_node_map,
+    std::unordered_set<string> variable_names_set,
     std::unordered_map<string, Tensor>* variable_name_to_value_map) {
   if (variable_names_set.empty()) {
     return Status::OK();
@@ -120,8 +119,14 @@ Status GetVariableNameToTensorMap(
   std::vector<string> tensor_names;
   for (const string& node_name : variable_names_set) {
     variable_names.push_back(node_name);
-    // We need to run tensors, so append ":0".
-    tensor_names.push_back(node_name + ":0");
+    NodeDef* node_def = name_to_node_map.at(node_name);
+    if (node_def->op() == "VarHandleOp") {
+      // If this is a resource variable, we have to run the corresponding
+      // ReadVariableOp.
+      tensor_names.push_back(node_name + "/Read/ReadVariableOp:0");
+    } else {
+      tensor_names.push_back(node_name + ":0");
+    }
   }
   std::vector<Tensor> outputs;
   TF_RETURN_IF_ERROR(
@@ -143,6 +148,15 @@ void ConvertVariableToConstant(const NodeDef& variable_node,
       (*const_node->mutable_attr())["value"].mutable_tensor());
 }
 
+// Converts a ReadVariableOp NodeDef to an Identity NodeDef.
+void ConvertReadVariableOpToIdentity(const NodeDef& node,
+                                     NodeDef* identity_node) {
+  identity_node->set_name(node.name());
+  identity_node->set_op("Identity");
+  (*identity_node->mutable_attr())["T"] = node.attr().at("dtype");
+  identity_node->add_input(node.input(0));
+}
+
 // Freezes the subgraph of all nodes needed by `outputs`.
 Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
                       const std::unordered_set<string>& outputs,
@@ -155,14 +169,19 @@ Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
   if (graph_def.node_size() == 0) {
     return Status::OK();
   }
+  // name_to_node_map is needed to get the inputs from the NodeDef corresponding
+  // the a string node name. These inputs are used when doing our backwards
+  // traversal.
+  std::unordered_map<string, NodeDef*> name_to_node_map;
+  GetNodeNameToNodeDefMap(&graph_def, &name_to_node_map);
   std::unordered_set<string> reachable_node_names;
   std::unordered_set<string> variable_node_names;
-  GetReachableNodesAndVariables(&graph_def, outputs, &reachable_node_names,
-                                &variable_node_names);
+  GetReachableNodesAndVariables(&graph_def, outputs, name_to_node_map,
+                                &reachable_node_names, &variable_node_names);
   std::unordered_map<string, Tensor> variable_to_value_map;
-  TF_RETURN_IF_ERROR(
-      GetVariableNameToTensorMap(saved_model_bundle.session.get(),
-                                 variable_node_names, &variable_to_value_map));
+  TF_RETURN_IF_ERROR(GetVariableNameToTensorMap(
+      saved_model_bundle.session.get(), name_to_node_map, variable_node_names,
+      &variable_to_value_map));
   // We copy the nodes in the same order they were in the original graph_def.
   for (const NodeDef& node : graph_def.node()) {
     if (reachable_node_names.find(node.name()) == reachable_node_names.end()) {
@@ -171,6 +190,12 @@ Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
     if (variable_node_names.find(node.name()) != variable_node_names.end()) {
       ConvertVariableToConstant(node, variable_to_value_map[node.name()],
                                 frozen_graph_def->add_node());
+    } else if (node.op() == "ReadVariableOp" &&
+               variable_node_names.find(node.input(0)) !=
+                   variable_node_names.end()) {
+      // If the node is a ReadVariableOp, its input VarHandleOp will be
+      // converted to a Constant, so we will need to convert it to an Identity.
+      ConvertReadVariableOpToIdentity(node, frozen_graph_def->add_node());
     } else {
       // If the node isn't a variable, just copy the node as-is.
       *frozen_graph_def->add_node() = node;
diff --git a/tensorflow/cc/tools/freeze_saved_model_test.cc b/tensorflow/cc/tools/freeze_saved_model_test.cc
index 52a81a50284aec36bba4e56a0232c886cb0cb6cf..cd35fd3b95deec669218cfa4f25fea2c3ac9e56e 100644
--- a/tensorflow/cc/tools/freeze_saved_model_test.cc
+++ b/tensorflow/cc/tools/freeze_saved_model_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/cc/tools/freeze_saved_model.h"
 
+#include "tensorflow/cc/ops/resource_variable_ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/function_testlib.h"
 #include "tensorflow/core/framework/graph.pb.h"
@@ -113,6 +114,160 @@ class FreezeTest : public ::testing::Test {
 
     test::ExpectTensorEqual<float>(unfrozen_outputs[0], frozen_outputs[0]);
   }
+
+  void TestFreezeGraphWithoutDependentVariables(bool use_resource) {
+    // Test freezing a graph with variables that are not needed by the outputs
+    // in the SignatureDef. The resulting graph shouldn't be frozen, but
+    // non-dependent nodes should be pruned.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output b = ops::Const(scope.WithOpName("b"), 10.0f, {});
+    Output c = ops::Mul(scope.WithOpName("c"), a, b);
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      Output read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+    } else {
+      Output var =
+          ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
+    }
+
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    GraphDef expected_graph_def;
+    Scope expected_scope = Scope::NewRootScope();
+    Output expected_a = ops::Const(expected_scope.WithOpName("a"), 10.0f, {});
+    Output expected_b = ops::Const(expected_scope.WithOpName("b"), 10.0f, {});
+    Output expected_c =
+        ops::Mul(expected_scope.WithOpName("c"), expected_a, expected_b);
+    TF_ASSERT_OK(expected_scope.ToGraphDef(&expected_graph_def));
+
+    GraphDefEqual(frozen_graph_def, expected_graph_def);
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
+
+  void TestFreezeGraphWithDependentVariables(bool use_resource) {
+    // Test freezing a graph with variables that are needed by outputs in the
+    // SignatureDef. The variables should be frozen.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output read_var;
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+    } else {
+      Output read_var =
+          ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), read_var, a);
+    }
+    Output c = ops::Mul(scope.WithOpName("c"), a, read_var);
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    // If using normal variables there should be 3 nodes in the resulting
+    // graph_def. If using resource variables there should be 4 nodes in the
+    // resulting graph_def.
+    // In both cases, none should be variables.
+    size_t expected_nodes = use_resource ? 4 : 3;
+    EXPECT_EQ(frozen_graph_def.node_size(), expected_nodes);
+    for (const NodeDef& node : frozen_graph_def.node()) {
+      EXPECT_NE(node.op(), "Variable") << node.name();
+      EXPECT_NE(node.op(), "VariableV2") << node.name();
+      EXPECT_NE(node.op(), "VarHandleOp") << node.name();
+      EXPECT_NE(node.op(), "ReadVariableOp") << node.name();
+    }
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
+
+  void TestFreezeGraphWithAndWithoutDependentVariables(bool use_resource) {
+    // Test freezing a graph with some variables that are needed and not needed
+    // by
+    // the outputs in the SignatureDef. The resulting graph should only freeze
+    // dependent variables.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output read_var;
+
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+      Output var_1 =
+          ops::VarHandleOp(scope.WithOpName("var_1"), DataType::DT_FLOAT, {});
+      Output read_var_1 =
+          ops::ReadVariableOp(scope.WithOpName("var_1/Read/ReadVariableOp"),
+                              var, DataType::DT_FLOAT);
+      auto assign_1 =
+          ops::AssignVariableOp(scope.WithOpName("assign_1"), var_1, a);
+    } else {
+      read_var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), read_var, a);
+      Output var_1 =
+          ops::Variable(scope.WithOpName("var_1"), {}, DataType::DT_FLOAT);
+      Output assign_1 = ops::Assign(scope.WithOpName("assign_1"), var_1, a);
+    }
+
+    Output c = ops::Mul(scope.WithOpName("c"), a, read_var);
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    // There should be 3 nodes in the resulting graph_def, and none should be
+    // variables.
+    size_t expected_nodes = use_resource ? 4 : 3;
+    EXPECT_EQ(frozen_graph_def.node_size(), expected_nodes);
+    for (const NodeDef& node : frozen_graph_def.node()) {
+      EXPECT_NE(node.op(), "Variable") << node.name();
+      EXPECT_NE(node.op(), "VariableV2") << node.name();
+      EXPECT_NE(node.op(), "VarHandleOp") << node.name();
+      EXPECT_NE(node.op(), "ReadVariableOp") << node.name();
+    }
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
 };
 
 TEST_F(FreezeTest, InputsAndOutputsSingleSignatureDef) {
@@ -196,111 +351,28 @@ TEST_F(FreezeTest, GraphDefWithNoVariables) {
   GraphDefEqual(frozen_graph_def, graph_def);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNotNeededByOutputs) {
-  // Test freezing a graph with variables that are not needed by the outputs in
-  // the SignatureDef. The resulting graph shouldn't be frozen, but
-  // non-dependent nodes should be pruned.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output b = ops::Const(scope.WithOpName("b"), 10.0f, {});
-  Output c = ops::Mul(scope.WithOpName("c"), a, b);
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
-
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
-
-  GraphDef expected_graph_def;
-  Scope expected_scope = Scope::NewRootScope();
-  Output expected_a = ops::Const(expected_scope.WithOpName("a"), 10.0f, {});
-  Output expected_b = ops::Const(expected_scope.WithOpName("b"), 10.0f, {});
-  Output expected_c =
-      ops::Mul(expected_scope.WithOpName("c"), expected_a, expected_b);
-  TF_ASSERT_OK(expected_scope.ToGraphDef(&expected_graph_def));
-
-  GraphDefEqual(frozen_graph_def, expected_graph_def);
-
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithoutDependentVariables) {
+  TestFreezeGraphWithoutDependentVariables(false);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNeededByOutputs) {
-  // Test freezing a graph with variables that are needed by outputs in the
-  // SignatureDef. The variables should be frozen.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output c = ops::Mul(scope.WithOpName("c"), a, var);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
-
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
-
-  // There should be 3 nodes in the resulting graph_def, and none should be
-  // variables.
-  EXPECT_EQ(frozen_graph_def.node_size(), 3);
-  for (const NodeDef& node : frozen_graph_def.node()) {
-    EXPECT_NE(node.op(), "Variable") << node.name();
-    EXPECT_NE(node.op(), "VariableV2") << node.name();
-  }
-
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithoutDependentResourceVariables) {
+  TestFreezeGraphWithoutDependentVariables(true);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNeededAndNotNeededByOutputs) {
-  // Test freezing a graph with some variables that are needed and not needed by
-  // the outputs in the SignatureDef. The resulting graph should only freeze
-  // dependent variables.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output c = ops::Mul(scope.WithOpName("c"), a, var);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  Output var_1 =
-      ops::Variable(scope.WithOpName("var_1"), {}, DataType::DT_FLOAT);
-  Output assign_1 = ops::Assign(scope.WithOpName("assign_1"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
+TEST_F(FreezeTest, GraphDefWithDependentVariables) {
+  TestFreezeGraphWithDependentVariables(false);
+}
 
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
+TEST_F(FreezeTest, GraphDefWithDependentResourceVariables) {
+  TestFreezeGraphWithDependentVariables(true);
+}
 
-  // There should be 3 nodes in the resulting graph_def, and none should be
-  // variables.
-  EXPECT_EQ(frozen_graph_def.node_size(), 3);
-  for (const NodeDef& node : frozen_graph_def.node()) {
-    EXPECT_NE(node.op(), "Variable") << node.name();
-    EXPECT_NE(node.op(), "VariableV2") << node.name();
-  }
+TEST_F(FreezeTest, GraphDefWithAndWithoutDependentVariables) {
+  TestFreezeGraphWithAndWithoutDependentVariables(false);
+}
 
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithAndWithoutDependentResourceVariables) {
+  TestFreezeGraphWithAndWithoutDependentVariables(true);
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/aot/BUILD b/tensorflow/compiler/aot/BUILD
index 0900e87ebabd378e6237b77ca0ef01677c07c244..ffa2d088295375bbbcd2cdd9365982907f2bf480 100644
--- a/tensorflow/compiler/aot/BUILD
+++ b/tensorflow/compiler/aot/BUILD
@@ -72,6 +72,7 @@ cc_library(
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
         "//tensorflow/core:protos_all_cc",
     ],
 )
diff --git a/tensorflow/compiler/aot/compile.cc b/tensorflow/compiler/aot/compile.cc
index c87f2b75dfa18ad5c3eda4bd6fcbcb3083ef73fd..7c833878818022c86fd3171ec9cef9fcd3217a24 100644
--- a/tensorflow/compiler/aot/compile.cc
+++ b/tensorflow/compiler/aot/compile.cc
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/lib/strings/proto_serialization.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
diff --git a/tensorflow/compiler/aot/tfcompile.bzl b/tensorflow/compiler/aot/tfcompile.bzl
index 9dff1be09fede6f65f82c2f36d94be07e781949f..3a877c5337ff76193a7f27fb9681e5a9ca500961 100644
--- a/tensorflow/compiler/aot/tfcompile.bzl
+++ b/tensorflow/compiler/aot/tfcompile.bzl
@@ -132,7 +132,7 @@ def tf_library(name, graph, config,
   header_file = name + ".h"
   metadata_object_file = name + "_tfcompile_metadata.o"
   function_object_file = name + "_tfcompile_function.o"
-  ep = ("__" + PACKAGE_NAME + "__" + name).replace("/", "_")
+  ep = ("__" + native.package_name() + "__" + name).replace("/", "_")
   if type(tfcompile_flags) == type(""):
     flags = tfcompile_flags
   else:
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index af259e0564c885836cf0e49c8b29c6169f059c5a..8e505da6221b23b0130548405f12a61dcda100d7 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -29,7 +29,10 @@ load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda_is_configured")
 # Target that bundles up the XLA CPU and GPU JIT devices.
 cc_library(
     name = "jit",
-    visibility = [":friends"],
+    visibility = [
+        ":friends",
+        "//learning/tfx:__subpackages__",
+    ],
     deps = [
         ":xla_cpu_device",
         ":xla_cpu_jit",
@@ -73,6 +76,7 @@ cc_library(
         ":jit_compilation_passes",
         ":xla_device",
         "//tensorflow/compiler/jit/kernels:xla_launch_op",
+        "//tensorflow/compiler/jit/legacy_flags:xla_device_flags",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/tf2xla/kernels:xla_ops",
         "//tensorflow/compiler/xla/service:cpu_plugin",  # buildcleaner: keep
@@ -115,14 +119,31 @@ cc_library(
     alwayslink = 1,
 )
 
+cc_library(
+    name = "xla_tensor_info",
+    srcs = ["xla_tensor_info.cc"],
+    hdrs = ["xla_tensor_info.h"],
+    deps = [
+        ":common",
+        "//tensorflow/compiler/xla/service:shaped_buffer",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
 cc_library(
     name = "xla_device",
     srcs = [
+        "xla_compile_on_demand_op.cc",
         "xla_device.cc",
         "xla_device_context.cc",
         "xla_device_ops.cc",
     ],
     hdrs = [
+        "xla_compile_on_demand_op.h",
         "xla_device.h",
         "xla_device_context.h",
         "xla_device_ops.h",
@@ -132,6 +153,8 @@ cc_library(
     deps = [
         ":common",
         ":jit_compilation_passes",
+        ":xla_launch_util",
+        ":xla_tensor_info",
         "//tensorflow/compiler/jit/ops:xla_ops",
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:dump_graph",
@@ -171,6 +194,29 @@ cc_library(
     visibility = [":friends"],
 )
 
+cc_library(
+    name = "xla_launch_util",
+    srcs = ["xla_launch_util.cc"],
+    hdrs = ["xla_launch_util.h"],
+    deps = [
+        ":common",
+        ":xla_compilation_cache",
+        ":xla_tensor_info",
+        "//tensorflow/compiler/tf2xla:xla_compiler",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla/client:client_library",
+        "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/kernels:variable_ops",
+    ],
+)
+
 cc_library(
     name = "xla_compilation_cache",
     srcs = ["xla_compilation_cache.cc"],
@@ -205,6 +251,7 @@ cc_library(
     name = "graph_to_functiondef",
     srcs = ["graph_to_functiondef.cc"],
     hdrs = ["graph_to_functiondef.h"],
+    visibility = [":friends"],
     deps = [
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
@@ -301,6 +348,7 @@ tf_cc_test(
     deps = [
         ":common",
         ":compilation_passes",
+        ":graph_to_functiondef",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:cc_ops_internal",
         "//tensorflow/cc:function_ops",
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
index 9c372a012789fc25ca0a711349c09ca62edc6754..7fc43fb26318335909d52d5bbd83ebf61f42a703 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
@@ -53,6 +53,8 @@ namespace tensorflow {
 const char* const kXlaCompiledKernelAttr = "_XlaCompiledKernel";
 const char* const kXlaNumConstantArgsAttr = "_XlaNumConstantArgs";
 const char* const kXlaNumResourceArgsAttr = "_XlaNumResourceArgs";
+const char* const kXlaHostTransferSequencerAttr =
+    "_xla_host_transfer_sequencer";
 
 namespace {
 
@@ -143,7 +145,7 @@ struct NodeSlot {
 // everything to use it.
 static const char* const kArgOp = "_Arg";
 static const char* const kRetValOp = "_Retval";
-static const char* const kHostComputeOp = "_XlaHostCompute";
+static const char* const kHostComputeOp = "XlaHostCompute";
 static const char* const kSendFromHostOp = "_XlaSendFromHost";
 static const char* const kRecvAtHostOp = "_XlaRecvAtHost";
 
@@ -328,12 +330,14 @@ class Encapsulator {
     Status MakeSequencingNode(const string& subgraph_name, Graph* graph_out);
 
     // If there is a sequencer node, adds a control edge from the sequencer to
-    // all the downstream nodes of call_node_outputs.
-    void ConnectSequencerToOutputs(Graph* graph_out);
+    // the call node.
+    void ConnectSequencerToCallNode(Graph* graph_out);
 
     Status AddShapeInferenceInfo(
+        const string& subgraph_name,
         const string& outside_compilation_subgraph_name,
-        const std::vector<TensorShapeProto>& shapes, GraphDef* inference_graph);
+        const std::vector<TensorShapeProto>& shapes, Graph* inference_graph,
+        FunctionLibraryDefinition* library);
 
     Status ReplaceFunctionDef(FunctionLibraryDefinition* library);
 
@@ -381,12 +385,24 @@ class Encapsulator {
       Node* send_from_host = nullptr;
     };
 
+    // Creates an outside_compilation subgraph for outside_compilation_id if
+    // none exists yet. Returns the (possible newly created) subgraph for
+    // outside_compilation_id.
+    OutsideCompilationSubgraph* LookupOrCreateOutsideCompilationSubgraph(
+        const string& outside_compilation_id);
+
     // Builds a ParallelCheck op that compares the output of the original
     // subgraph with the encapsulated subgraph.
     Status BuildParallelCheckOp(
         const std::unordered_map<const Node*, Node*>& node_images,
         Graph* graph_out);
 
+    // Builds a placeholder node used to provide the key input to a RecvAtHost
+    // or SendFromHost node. This placeholder node will be removed by a later
+    // pass.
+    Status AddHostComputeKeyPlaceholder(OutsideCompilationSubgraph* oc_subgraph,
+                                        Graph* graph_out);
+
     // Builds a _RecvAtHost node producing all the inputs of an
     // outside_compilation subgraph and stores it in oc_subgraph.recv_at_host.
     Status AddRecvAtHostNode(const string& subgraph_name,
@@ -413,6 +429,14 @@ class Encapsulator {
     // NodeDef for the function call node.
     NodeDef call_node_def_;
 
+    // Name that is used for the call node. This may not be
+    // call_node_def_.name() if the client supplies a rewrite lambda.
+    string function_def_name_;
+
+    // Placeholder node simulating the host compute key in the output graph.
+    // Not owned.
+    Node* host_compute_key_placeholder_ = nullptr;
+
     // Function call node(s) in the output graph. Not owned.
     // If parallel_checking is enabled, 'call_node_inputs' is the function call
     // node to which inputs should be fed, and 'call_node_outputs' is the
@@ -551,7 +575,7 @@ class Encapsulator {
       const std::unordered_set<string>& recv_at_host_nodes, Node* send_node,
       FunctionLibraryDefinition* library,
       std::vector<TensorShapeProto>* static_shape_out,
-      std::unique_ptr<GraphDef>* graphdef_out);
+      std::unique_ptr<Graph>* graph_out);
 
   // Makes a copy of graph containing only nodes that are ancestors of at least
   // one node in send_from_host_nodes and store it in pruned_graph. On exit
@@ -712,39 +736,44 @@ Status Encapsulator::Subgraph::RecordResult(
   return Status::OK();
 }
 
-void Encapsulator::Subgraph::RecordOutsideCompilationInputOrControl(
-    const string& outside_compilation_id, const Edge* edge) {
+Encapsulator::Subgraph::OutsideCompilationSubgraph*
+Encapsulator::Subgraph::LookupOrCreateOutsideCompilationSubgraph(
+    const string& outside_compilation_id) {
   auto iter = outside_compilation_subgraphs_
                   .emplace(outside_compilation_id, OutsideCompilationSubgraph())
                   .first;
-  OutsideCompilationSubgraph& outside_subgraph = iter->second;
+  OutsideCompilationSubgraph* outside_subgraph = &iter->second;
+  return outside_subgraph;
+}
+
+void Encapsulator::Subgraph::RecordOutsideCompilationInputOrControl(
+    const string& outside_compilation_id, const Edge* edge) {
+  OutsideCompilationSubgraph* outside_subgraph =
+      LookupOrCreateOutsideCompilationSubgraph(outside_compilation_id);
   if (edge->IsControlEdge()) {
-    outside_subgraph.control_inputs.insert(edge->src());
+    outside_subgraph->control_inputs.insert(edge->src());
   } else {
-    int input_index = outside_subgraph.inputs.size();
-    outside_subgraph.inputs.emplace(NodeSlot(edge->src(), edge->src_output()),
-                                    input_index);
+    int input_index = outside_subgraph->inputs.size();
+    outside_subgraph->inputs.emplace(NodeSlot(edge->src(), edge->src_output()),
+                                     input_index);
   }
 }
 
 void Encapsulator::Subgraph::RecordOutsideCompilationOutputOrControl(
     const string& outside_compilation_id, const Edge* edge) {
-  auto subgraph_iter =
-      outside_compilation_subgraphs_
-          .emplace(outside_compilation_id, OutsideCompilationSubgraph())
-          .first;
-  OutsideCompilationSubgraph& outside_subgraph = subgraph_iter->second;
+  OutsideCompilationSubgraph* outside_subgraph =
+      LookupOrCreateOutsideCompilationSubgraph(outside_compilation_id);
   if (edge->IsControlEdge()) {
-    outside_subgraph.control_outputs.insert(edge->dst());
+    outside_subgraph->control_outputs.insert(edge->dst());
   } else {
     DataType dtype = edge->dst()->input_type(edge->dst_input());
     auto output_iter =
-        outside_subgraph.outputs_by_src
+        outside_subgraph->outputs_by_src
             .emplace(NodeSlot(edge->src(), edge->src_output(), dtype),
-                     outside_subgraph.outputs_by_src.size())
+                     outside_subgraph->outputs_by_src.size())
             .first;
     int output_index = output_iter->second;
-    outside_subgraph.outputs_by_dst[NodeSlot(edge->dst(), edge->dst_input())] =
+    outside_subgraph->outputs_by_dst[NodeSlot(edge->dst(), edge->dst_input())] =
         output_index;
   }
 }
@@ -842,25 +871,21 @@ Status Encapsulator::Subgraph::MakeSequencingNode(const string& subgraph_name,
     NodeDef seq_def;
     NodeDefBuilder builder(strings::StrCat(subgraph_name, "_sequencer"),
                            "NoOp");
+    builder.Attr(kXlaHostTransferSequencerAttr, subgraph_name);
+    builder.Device(device_);
     Status s = builder.Finalize(&seq_def);
     if (!s.ok()) return s;
 
     sequencer_ = graph_out->AddNode(seq_def, &s);
     if (!s.ok()) return s;
-    sequencer_->set_assigned_device_name(device_);
   }
   return Status::OK();
 }
 
-void Encapsulator::Subgraph::ConnectSequencerToOutputs(Graph* graph_out) {
+void Encapsulator::Subgraph::ConnectSequencerToCallNode(Graph* graph_out) {
   if (sequencer_ != nullptr) {
-    std::unordered_set<Node*> output_dependencies;
-    for (Node* node : call_node_outputs_->out_nodes()) {
-      output_dependencies.insert(node);
-    }
-    for (Node* node : output_dependencies) {
-      graph_out->AddControlEdge(sequencer_, node);
-    }
+    VLOG(2) << "ConnectSequencerToCallNode";
+    graph_out->AddControlEdge(sequencer_, call_node_inputs_);
   }
 }
 
@@ -906,6 +931,8 @@ Status Encapsulator::Subgraph::BuildFunctionDef(
     name = call_node_def_.op();
   }
 
+  function_def_name_ = name;
+
   FunctionDef fdef;
   TF_RETURN_IF_ERROR(GraphToFunctionDef(*graph_, name, &fdef));
 
@@ -924,8 +951,10 @@ Status Encapsulator::Subgraph::BuildFunctionDef(
 }
 
 Status Encapsulator::Subgraph::AddShapeInferenceInfo(
+    const string& subgraph_name,
     const string& outside_compilation_subgraph_name,
-    const std::vector<TensorShapeProto>& shapes, GraphDef* inference_graph) {
+    const std::vector<TensorShapeProto>& shapes, Graph* inference_graph,
+    FunctionLibraryDefinition* library) {
   OutsideCompilationSubgraph& oc_subgraph =
       outside_compilation_subgraphs_.at(outside_compilation_subgraph_name);
 
@@ -947,21 +976,22 @@ Status Encapsulator::Subgraph::AddShapeInferenceInfo(
     host_compute->AddAttr("shape_inference_graph", "");
     host_compute->AddAttr("shapes", shapes);
   } else {
-    string serialized_graph;
-    if (!inference_graph->SerializeToString(&serialized_graph)) {
-      return errors::Internal(
-          "Failed to serialize graph for outside compilation subgraph ",
-          oc_subgraph.host_compute_name);
-    }
-    host_compute->AddAttr("shape_inference_graph", serialized_graph);
+    string inference_graph_name =
+        strings::StrCat("_outside_compilation_shape_inference_", subgraph_name,
+                        "_", outside_compilation_subgraph_name);
+    FunctionDef fdef;
+    TF_RETURN_IF_ERROR(
+        GraphToFunctionDef(*inference_graph, inference_graph_name, &fdef));
+    host_compute->AddAttr("shape_inference_graph", inference_graph_name);
     host_compute->AddAttr("shapes", std::vector<TensorShapeProto>());
+    TF_RETURN_IF_ERROR(library->AddFunctionDef(fdef));
   }
   return Status::OK();
 }
 
 Status Encapsulator::Subgraph::ReplaceFunctionDef(
     FunctionLibraryDefinition* library) {
-  const string& name = call_node_def_.name();
+  const string& name = function_def_name_;
 
   FunctionDef fdef;
   TF_RETURN_IF_ERROR(GraphToFunctionDef(*graph_, name, &fdef));
@@ -1060,9 +1090,36 @@ Status Encapsulator::Subgraph::AddFunctionCallNode(
   return Status::OK();
 }
 
+Status Encapsulator::Subgraph::AddHostComputeKeyPlaceholder(
+    OutsideCompilationSubgraph* oc_subgraph, Graph* graph_out) {
+  TensorShapeProto shape_proto;
+  TensorShape shape({2});
+  shape.AsProto(&shape_proto);
+  GraphDefBuilder::Options options(graph_out, /*status=*/nullptr);
+  NodeDef key_def;
+  NodeDefBuilder builder(
+      strings::StrCat(call_node_def_.name(), "_key_placeholder"),
+      "Placeholder");
+  builder.Attr("dtype", DT_STRING);
+  builder.Attr("shape", shape_proto);
+  builder.Attr("_host_compute_call_node", call_node_def_.name());
+  Status s = builder.Finalize(&key_def);
+  if (!s.ok()) return s;
+
+  host_compute_key_placeholder_ = graph_out->AddNode(key_def, &s);
+  if (!s.ok()) return s;
+  host_compute_key_placeholder_->set_assigned_device_name(device_);
+
+  return Status::OK();
+}
+
 Status Encapsulator::Subgraph::AddRecvAtHostNode(
     const string& subgraph_name, const string& oc_subgraph_name,
     OutsideCompilationSubgraph* oc_subgraph, Graph* graph_out) {
+  if (host_compute_key_placeholder_ == nullptr) {
+    TF_RETURN_IF_ERROR(AddHostComputeKeyPlaceholder(oc_subgraph, graph_out));
+  }
+
   std::vector<DataType> dtypes(oc_subgraph->inputs.size(), DT_INVALID);
 
   for (const auto& input : oc_subgraph->inputs) {
@@ -1078,15 +1135,22 @@ Status Encapsulator::Subgraph::AddRecvAtHostNode(
   NodeDefBuilder builder(strings::StrCat("outside_compilation_", subgraph_name,
                                          "_", oc_subgraph_name, "_recv"),
                          kRecvAtHostOp);
+  // TODO(misard) When we add replication the device placement will have to be
+  // redone.
+  builder.Device(device_);
   builder.Attr("Toutputs", dtypes);
+  // TODO(misard) For now we only support TPU device 0.
+  builder.Attr("device_ordinal", 0);
   builder.Attr("key", strings::StrCat("host_compute_channel_", subgraph_name,
                                       "_", oc_subgraph_name));
+  builder.Input(host_compute_key_placeholder_->name(), 0, DT_STRING);
   Status s = builder.Finalize(&recv_def);
   if (!s.ok()) return s;
 
   oc_subgraph->recv_at_host = graph_out->AddNode(recv_def, &s);
   if (!s.ok()) return s;
-  oc_subgraph->recv_at_host->set_assigned_device_name(device_);
+  graph_out->AddEdge(host_compute_key_placeholder_, 0,
+                     oc_subgraph->recv_at_host, 0);
 
   // Add a control dependency forcing the RecvAtHost to run before the subgraph
   // completes. This has no effect on execution order but prevents the
@@ -1101,6 +1165,10 @@ Status Encapsulator::Subgraph::AddSendFromHostNode(
     const std::unordered_map<const Node*, Node*>& node_images,
     const string& subgraph_name, const string& oc_subgraph_name,
     OutsideCompilationSubgraph* oc_subgraph, Graph* graph_out) {
+  if (host_compute_key_placeholder_ == nullptr) {
+    TF_RETURN_IF_ERROR(AddHostComputeKeyPlaceholder(oc_subgraph, graph_out));
+  }
+
   std::vector<DataType> dtypes(oc_subgraph->outputs_by_src.size(), DT_INVALID);
   std::vector<NodeDefBuilder::NodeOut> inputs(
       oc_subgraph->outputs_by_src.size());
@@ -1120,16 +1188,23 @@ Status Encapsulator::Subgraph::AddSendFromHostNode(
   NodeDefBuilder builder(strings::StrCat("outside_compilation_", subgraph_name,
                                          "_", oc_subgraph_name, "_send"),
                          kSendFromHostOp);
+  // TODO(misard) When we add replication the device placement will have to be
+  // redone.
+  builder.Device(device_);
   builder.Attr("Tinputs", dtypes);
   builder.Attr("key", strings::StrCat("host_compute_channel_", subgraph_name,
                                       "_", oc_subgraph_name));
+  // TODO(misard) For now we only support TPU device 0.
+  builder.Attr("device_ordinal", 0);
   builder.Input(inputs);
+  builder.Input(host_compute_key_placeholder_->name(), 0, DT_STRING);
   Status s = builder.Finalize(&send_def);
   if (!s.ok()) return s;
 
   oc_subgraph->send_from_host = graph_out->AddNode(send_def, &s);
   if (!s.ok()) return s;
-  oc_subgraph->send_from_host->set_assigned_device_name(device_);
+  graph_out->AddEdge(host_compute_key_placeholder_, 0,
+                     oc_subgraph->send_from_host, inputs.size());
 
   // Add a control dependency forcing the SendFromHost to run before the
   // subgraph completes. This has no effect on execution order but prevents the
@@ -1611,7 +1686,7 @@ Status Encapsulator::AddEdgesToOutputGraph(
 
   for (auto& subgraph_entry : subgraphs_) {
     Subgraph& subgraph = subgraph_entry.second;
-    subgraph.ConnectSequencerToOutputs(graph_out);
+    subgraph.ConnectSequencerToCallNode(graph_out);
   }
 
   return Status::OK();
@@ -1690,7 +1765,7 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
     const std::unordered_set<string>& recv_at_host_nodes, Node* send_node,
     FunctionLibraryDefinition* library,
     std::vector<TensorShapeProto>* static_shape_out,
-    std::unique_ptr<GraphDef>* graphdef_out) {
+    std::unique_ptr<Graph>* graph_out) {
   // Maps from nodes in graph_in to nodes in graph_out.
   //
   // When an edge has fully defined shape the source node in graph_in is
@@ -1707,9 +1782,11 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
   std::unordered_map<Node*, Node*> dummy_node_images;
   std::unordered_map<Node*, Node*> copied_node_images;
 
-  std::unique_ptr<Graph> graph_out(new Graph(graph_in.op_registry()));
-  graph_out->set_versions(graph_in.versions());
-  static_shape_out->resize(send_node->num_inputs());
+  graph_out->reset(new Graph(graph_in.op_registry()));
+  (*graph_out)->set_versions(graph_in.versions());
+  // The final input to the send node is the dynamic key, which we don't include
+  // in the static shapes.
+  static_shape_out->resize(send_node->num_inputs() - 1);
 
   // We don't use the standard ReverseDFS because we want to cut off traversal
   // whenever we find an output with fully defined shape.
@@ -1728,7 +1805,7 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
     if (w.leave) {
       TF_RETURN_IF_ERROR(CopyShapeInferenceNodeToGraph(
           n, send_node, dummy_node_images, library, &copied_node_images,
-          graph_out.get()));
+          graph_out->get()));
     } else {
       if (visited[n->id()]) continue;
       visited[n->id()] = true;
@@ -1750,14 +1827,23 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
             // continue.
             TensorShapeProto proto;
             context->ShapeHandleToProto(shape, &proto);
-            dummy_node_images[src_node] = AddDummyShapedNode(
-                src_node->output_type(src_port), proto, graph_out.get());
-            if (n == send_node) {
+            if (dummy_node_images.find(src_node) == dummy_node_images.end()) {
+              dummy_node_images[src_node] = AddDummyShapedNode(
+                  src_node->output_type(src_port), proto, graph_out->get());
+            }
+            // The final input to the send node is the dynamic key, which we
+            // don't include in the static shapes.
+            if (n == send_node &&
+                in_edge->dst_input() < static_shape_out->size()) {
               (*static_shape_out)[in_edge->dst_input()] = proto;
             }
           } else {
+            has_parent_with_unknown_shape = true;
             if (!visited[src_node->id()]) {
-              has_parent_with_unknown_shape = true;
+              if (VLOG_IS_ON(2)) {
+                TensorShapeProto proto;
+                context->ShapeHandleToProto(shape, &proto);
+              }
               stack.push_back({src_node, false});
             }
           }
@@ -1768,7 +1854,7 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
           // The shapes of all the inputs to send_node are statically known. We
           // won't have to do any inference at compile time so return now: the
           // shapes were stored in static_shape_out above.
-          graphdef_out->reset();
+          graph_out->reset();
           return Status::OK();
         } else {
           // Any shape that is being processed is either the original send node
@@ -1791,9 +1877,6 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
     }
   }
 
-  graphdef_out->reset(new GraphDef());
-  graph_out->ToGraphDef(graphdef_out->get());
-
   return Status::OK();
 }
 
@@ -1910,14 +1993,20 @@ Status Encapsulator::GetShapeInfoForOutsideCompilationSends(
   TF_RETURN_IF_ERROR(MakeGraphForOutsideCompilationSends(
       *graph_out, &pruned_graph, &shape_refiner, &node_images, library));
 
+  if (VLOG_IS_ON(1)) {
+    dump_graph::DumpGraphToFile("pruned_graph_for_shape_inference",
+                                *pruned_graph, library);
+  }
+
   for (auto& subgraph_entry : subgraphs_) {
+    const string& subgraph_name = subgraph_entry.first;
     Subgraph& subgraph = subgraph_entry.second;
     // Find all the recv_at_host nodes in this subgraph.
     std::vector<string> outside_compilation_names;
     subgraph.GetOutsideCompilationSubgraphNames(&outside_compilation_names);
     std::unordered_set<string> recv_at_host_names;
-    for (const auto& name : outside_compilation_names) {
-      Node* recv_node = subgraph.GetRecvAtHostNode(name);
+    for (const auto& oc_name : outside_compilation_names) {
+      Node* recv_node = subgraph.GetRecvAtHostNode(oc_name);
       if (recv_node != nullptr) {
         recv_at_host_names.insert(recv_node->name());
       }
@@ -1926,26 +2015,30 @@ Status Encapsulator::GetShapeInfoForOutsideCompilationSends(
     // without knowing the shape of the recv_at_host nodes, and store the
     // result, along with enough information to complete the job at compile time
     // once the recv_at_host shapes are known.
-    for (const auto& name : outside_compilation_names) {
-      Node* send_node = subgraph.GetSendFromHostNode(name);
+    for (const auto& oc_name : outside_compilation_names) {
+      Node* send_node = subgraph.GetSendFromHostNode(oc_name);
       std::vector<TensorShapeProto> static_shape;
-      std::unique_ptr<GraphDef> graphdef;
+      std::unique_ptr<Graph> graph;
       if (send_node != nullptr) {
         TF_RETURN_IF_ERROR(DoStaticShapeInferenceForOutsideCompilationSend(
             *pruned_graph, shape_refiner, recv_at_host_names,
-            node_images[send_node], library, &static_shape, &graphdef));
-        if (graphdef == nullptr) {
+            node_images[send_node], library, &static_shape, &graph));
+        if (graph == nullptr) {
           VLOG(2) << "Send node  " << send_node->name() << " shapes";
           for (int i = 0; i < static_shape.size(); ++i) {
             VLOG(2) << static_shape[i].DebugString();
           }
         } else {
-          VLOG(2) << "Send node " << send_node->name() << " graph\n"
-                  << graphdef->DebugString();
+          if (VLOG_IS_ON(2)) {
+            GraphDef graphdef;
+            graph->ToGraphDef(&graphdef);
+            VLOG(2) << "Send node " << send_node->name() << " graph\n"
+                    << graphdef.DebugString();
+          }
         }
       }
-      TF_RETURN_IF_ERROR(
-          subgraph.AddShapeInferenceInfo(name, static_shape, graphdef.get()));
+      TF_RETURN_IF_ERROR(subgraph.AddShapeInferenceInfo(
+          subgraph_name, oc_name, static_shape, graph.get(), library));
     }
     if (!outside_compilation_names.empty()) {
       TF_RETURN_IF_ERROR(subgraph.ReplaceFunctionDef(library));
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
index aed9cae0f1799c4524da8ee309344849798755d5..94481a1fde986b705764f6f0c6de14fb28002496 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
@@ -13,12 +13,14 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <memory>
 #include <utility>
 
 #include "tensorflow/compiler/jit/encapsulate_subgraphs_pass.h"
 
 #include "tensorflow/cc/framework/ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/compiler/jit/graph_to_functiondef.h"
 #include "tensorflow/core/framework/function_testlib.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/graph/graph_def_builder.h"
@@ -29,6 +31,27 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
+const char* const kXlaHostTransferSequencerAttr =
+    "_xla_host_transfer_sequencer";
+
+Status AddGraphDefToFunctionLibrary(const GraphDefBuilder& graphdef_builder,
+                                    const string& name_suffix,
+                                    FunctionDefLibrary* library) {
+  GraphDef graphdef;
+  TF_RETURN_IF_ERROR(graphdef_builder.ToGraphDef(&graphdef));
+  std::unique_ptr<Graph> graph =
+      std::unique_ptr<Graph>(new Graph(OpRegistry::Global()));
+  GraphConstructorOptions opts;
+  opts.allow_internal_ops = true;
+  TF_RETURN_IF_ERROR(ConvertGraphDefToGraph(opts, graphdef, graph.get()));
+  FunctionDef* fdef = library->add_function();
+  TF_RETURN_IF_ERROR(GraphToFunctionDef(
+      *graph,
+      strings::StrCat("_outside_compilation_shape_inference_", name_suffix),
+      fdef));
+  return Status::OK();
+}
+
 template <class Tkey, class Tvalue>
 bool EqualProtoMap(const ::tensorflow::protobuf::Map<Tkey, Tvalue>& a,
                    const ::tensorflow::protobuf::Map<Tkey, Tvalue>& b,
@@ -112,23 +135,7 @@ bool EqualFunctionNodeDef(const NodeDef& a, const NodeDef& b,
       a.attr(), b.attr(), [](const string& s) { return s; },
       [](const AttrValue& v) { return v.DebugString(); },
       [](const string& key, const AttrValue& av, const AttrValue& bv) {
-        if (key == "shape_inference_graph") {
-          // Default serialization of GraphDef is unstable because maps don't
-          // serialize deterministically. Rather than go through the hoops to
-          // turn on deterministic serialization of this attr just for this
-          // test, add logic here to compare determinstically.
-          GraphDef ga;
-          if (!ga.ParseFromString(av.s())) {
-            return false;
-          }
-          GraphDef gb;
-          if (!gb.ParseFromString(bv.s())) {
-            return false;
-          }
-          return EqualGraphDef(ga, gb, nullptr);
-        } else {
-          return av.DebugString() == bv.DebugString();
-        }
+        return av.DebugString() == bv.DebugString();
       },
       strings::StrCat(diff_preamble, " attr mismatch for node ", a.name()),
       diff);
@@ -246,26 +253,32 @@ bool EqualFunctionDefLibrary(const FunctionDefLibrary& expected,
         << diff << "\nActual: " << actual.DebugString();          \
   } while (false)
 
-// TODO(misard): remove these fake registrations once there are real Ops to be
-// compiled.
-REGISTER_OP("_XlaHostCompute")
+// These dummy Op registrations are here because the real Op registrations live
+// in contrib and there can't be a dependence from this test to contrib.
+REGISTER_OP("XlaHostCompute")
     .Input("inputs: Tinputs")
     .Output("outputs: Toutputs")
     .Attr("Tinputs: list(type) >= 0")
     .Attr("Toutputs: list(type) >= 0")
     .Attr("key: string")
+    .Attr("shape_inference_graph: string = ''")
+    .Attr("shapes: list(shape) >= 0")
     .SetShapeFn(::tensorflow::shape_inference::UnknownShape);
 
 REGISTER_OP("_XlaSendFromHost")
-    .Input("input: Tinputs")
+    .Input("inputs: Tinputs")
+    .Input("dynamic_key: string")
     .Attr("Tinputs: list(type) >= 0")
     .Attr("key: string")
+    .Attr("device_ordinal: int")
     .SetShapeFn(::tensorflow::shape_inference::UnknownShape);
 
 REGISTER_OP("_XlaRecvAtHost")
-    .Output("output: Toutputs")
+    .Input("dynamic_key: string")
+    .Output("outputs: Toutputs")
     .Attr("Toutputs: list(type) >= 0")
     .Attr("key: string")
+    .Attr("device_ordinal: int")
     .SetShapeFn(::tensorflow::shape_inference::UnknownShape);
 
 REGISTER_OP("InputTest")
@@ -315,8 +328,13 @@ REGISTER_OP("AddNLikeTest")
     .SetIsCommutative()
     .SetIsAggregate();
 
-Node* NoOp(const GraphDefBuilder::Options& opts) {
-  return ops::SourceOp("NoOp", opts);
+Node* Sequencer(const GraphDefBuilder::Options& opts,
+                const string& call_node_name) {
+  if (opts.HaveError()) return nullptr;
+  NodeBuilder node_builder(opts.GetNameForOp("NoOp"), "NoOp",
+                           opts.op_registry());
+  return opts.WithAttr(kXlaHostTransferSequencerAttr, call_node_name)
+      .FinalizeBuilder(&node_builder);
 }
 
 Node* Input(const GraphDefBuilder::Options& opts) {
@@ -327,43 +345,71 @@ Node* InputShaped(const GraphDefBuilder::Options& opts) {
   return ops::SourceOp("InputTestShaped", opts);
 }
 
-Node* KnownShape(const gtl::ArraySlice<int>& shape,
-                 const GraphDefBuilder::Options& opts) {
+Node* KnownShapeBase(DataType dtype, const gtl::ArraySlice<int>& shape,
+                     const GraphDefBuilder::Options& opts) {
   if (opts.HaveError()) return nullptr;
   NodeBuilder node_builder(opts.GetNameForOp("Const"), "Const",
                            opts.op_registry());
   TensorProto value;
-  value.set_dtype(DT_FLOAT);
+  value.set_dtype(dtype);
   for (int dim : shape) {
     value.mutable_tensor_shape()->add_dim()->set_size(dim);
   }
   return opts.WithAttr("value", value)
-      .WithAttr("dtype", DT_FLOAT)
+      .WithAttr("dtype", dtype)
       .FinalizeBuilder(&node_builder);
 }
 
-Node* RecvAtHost(const string& key, const gtl::ArraySlice<DataType>& dtypes,
+Node* KnownShape(const gtl::ArraySlice<int>& shape,
+                 const GraphDefBuilder::Options& opts) {
+  return KnownShapeBase(DT_FLOAT, shape, opts);
+}
+
+Node* KeyPlaceholderShape(const GraphDefBuilder::Options& opts) {
+  return KnownShapeBase(DT_STRING, {2}, opts);
+}
+
+Node* KeyPlaceholder(const string& call_node,
+                     const GraphDefBuilder::Options& opts) {
+  if (opts.HaveError()) return nullptr;
+  NodeBuilder node_builder(opts.GetNameForOp("Placeholder"), "Placeholder",
+                           opts.op_registry());
+  TensorShapeProto shape;
+  shape.add_dim()->set_size(2);
+  return opts.WithAttr("shape", shape)
+      .WithAttr("dtype", DT_STRING)
+      .WithAttr("_host_compute_call_node", call_node)
+      .FinalizeBuilder(&node_builder);
+}
+
+Node* RecvAtHost(ops::NodeOut key_input, const string& key,
+                 const gtl::ArraySlice<DataType>& dtypes,
                  const GraphDefBuilder::Options& opts) {
   if (opts.HaveError()) return nullptr;
   NodeBuilder node_builder(opts.GetNameForOp("_XlaRecvAtHost"),
                            "_XlaRecvAtHost", opts.op_registry());
+  node_builder.Input(std::move(key_input));
   return opts.WithAttr("Toutputs", dtypes)
       .WithAttr("key", key)
+      .WithAttr("device_ordinal", 0)
       .FinalizeBuilder(&node_builder);
 }
 
-Node* SendFromHost(const string& key, const std::vector<ops::NodeOut>& inputs,
+Node* SendFromHost(ops::NodeOut key_input, const string& key,
+                   const std::vector<ops::NodeOut>& inputs,
                    const GraphDefBuilder::Options& opts) {
   if (opts.HaveError()) return nullptr;
   NodeBuilder node_builder(opts.GetNameForOp("_XlaSendFromHost"),
                            "_XlaSendFromHost", opts.op_registry());
   node_builder.Input(inputs);
+  node_builder.Input(std::move(key_input));
   std::vector<DataType> dtypes;
   for (const auto& node : inputs) {
     dtypes.push_back(node.dt);
   }
-  return opts.WithAttr("key", key)
-      .WithAttr("Tinputs", dtypes)
+  return opts.WithAttr("Tinputs", dtypes)
+      .WithAttr("key", key)
+      .WithAttr("device_ordinal", 0)
       .FinalizeBuilder(&node_builder);
 }
 
@@ -806,19 +852,20 @@ TEST(EncapsulateSubgraphsTest, OneFunctionOneOutside) {
   FunctionDefLibrary library_expected;
   GraphDef graphdef_expected;
 
-  string shape_string_expected;
   {
     GraphDefBuilder shape(GraphDefBuilder::kFailImmediately);
+    Node* key_constant =
+        KeyPlaceholderShape(shape.opts().WithName("KnownShape/_0"));
     Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    shape.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv, 0), ops::NodeOut(recv, 1),
                      shape.opts().WithName("E"));
-    SendFromHost("host_compute_channel_F1_O1", {e},
-                 shape.opts().WithName("outside_compilation_F1_O1_send"));
-    GraphDef shape_graph;
-    TF_EXPECT_OK(shape.ToGraphDef(&shape_graph));
-    EXPECT_TRUE(shape_graph.SerializeToString(&shape_string_expected));
+    SendFromHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                 {e}, shape.opts().WithName("outside_compilation_F1_O1_send"));
+    TF_EXPECT_OK(
+        AddGraphDefToFunctionLibrary(shape, "F1_O1", &library_expected));
   }
 
   *library_expected.add_function() = test::function::XTimesTwo();
@@ -833,12 +880,13 @@ TEST(EncapsulateSubgraphsTest, OneFunctionOneOutside) {
            {},
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"C:o:0", "c:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT, DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"key", "host_compute_channel_F1_O1"},
-            {"shape_inference_graph", shape_string_expected},
+            {"shape_inference_graph",
+             "_outside_compilation_shape_inference_F1_O1"},
             {"shapes", gtl::ArraySlice<DataType>({})}},
            {"c"}},
       },
@@ -851,24 +899,30 @@ TEST(EncapsulateSubgraphsTest, OneFunctionOneOutside) {
     Node* a = Input(b2.opts().WithName("A"));
     Node* b = Input(b2.opts().WithName("B"));
 
-    NodeBuilder node_builder("F1", "F1", lib_def.get());
-    node_builder.Input(a).Input(b);
-    Node* call = b2.opts().FinalizeBuilder(&node_builder);
-
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
     Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv, 0), ops::NodeOut(recv, 1),
                      b2.opts().WithName("E").WithControlInputs({recv, b}));
-    Node* send = SendFromHost("host_compute_channel_F1_O1", {e},
+    Node* send = SendFromHost(ops::NodeOut(key_constant, 0),
+                              "host_compute_channel_F1_O1", {e},
                               b2.opts()
                                   .WithName("outside_compilation_F1_O1_send")
                                   .WithControlInput(e));
 
-    Node* s = NoOp(
-        b2.opts().WithName("F1_sequencer").WithControlInputs({recv, send}));
+    Node* s = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInputs({recv, send}),
+        "F1");
+
+    NodeBuilder node_builder("F1", "F1", lib_def.get());
+    node_builder.Input(a).Input(b);
+    Node* call =
+        b2.opts().WithControlInputs({s}).FinalizeBuilder(&node_builder);
 
-    Binary(a, call, b2.opts().WithName("G").WithControlInputs({s, e}));
+    Binary(a, call, b2.opts().WithName("G").WithControlInputs({e}));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -918,38 +972,41 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
   FunctionDefLibrary library_expected;
   GraphDef graphdef_expected;
 
-  string shape_string_expected_1;
   {
     GraphDefBuilder shape1(GraphDefBuilder::kFailImmediately);
+    Node* key_constant =
+        KeyPlaceholderShape(shape1.opts().WithName("KnownShape/_0"));
     Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    shape1.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv, 0), ops::NodeOut(recv, 1),
                      shape1.opts().WithName("E"));
-    SendFromHost("host_compute_channel_F1_O1", {e},
-                 shape1.opts().WithName("outside_compilation_F1_O1_send"));
-    GraphDef shape1_graph;
-    TF_EXPECT_OK(shape1.ToGraphDef(&shape1_graph));
-    EXPECT_TRUE(shape1_graph.SerializeToString(&shape_string_expected_1));
+    SendFromHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                 {e}, shape1.opts().WithName("outside_compilation_F1_O1_send"));
+    TF_EXPECT_OK(
+        AddGraphDefToFunctionLibrary(shape1, "F1_O1", &library_expected));
   }
 
-  string shape_string_expected_2;
   {
     GraphDefBuilder shape2(GraphDefBuilder::kFailImmediately);
+    Node* key_constant =
+        KeyPlaceholderShape(shape2.opts().WithName("KnownShape/_0"));
     Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    shape2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv1, 0), ops::NodeOut(recv1, 1),
                      shape2.opts().WithName("E"));
     Node* recv2 =
-        RecvAtHost("host_compute_channel_F1_O2", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O2",
+                   {DT_FLOAT, DT_FLOAT},
                    shape2.opts().WithName("outside_compilation_F1_O2_recv"));
     Node* h = Binary(ops::NodeOut(recv2, 0), e, shape2.opts().WithName("H"));
-    SendFromHost("host_compute_channel_F1_O2", {h},
-                 shape2.opts().WithName("outside_compilation_F1_O2_send"));
-    GraphDef shape2_graph;
-    TF_EXPECT_OK(shape2.ToGraphDef(&shape2_graph));
-    EXPECT_TRUE(shape2_graph.SerializeToString(&shape_string_expected_2));
+    SendFromHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O2",
+                 {h}, shape2.opts().WithName("outside_compilation_F1_O2_send"));
+    TF_EXPECT_OK(
+        AddGraphDefToFunctionLibrary(shape2, "F1_O2", &library_expected));
   }
 
   *library_expected.add_function() = FunctionDefHelper::Create(
@@ -966,21 +1023,23 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
            {},
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O2_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"D:o:0", "F:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT, DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"key", "host_compute_channel_F1_O2"},
-            {"shape_inference_graph", shape_string_expected_2},
+            {"shape_inference_graph",
+             "_outside_compilation_shape_inference_F1_O2"},
             {"shapes", gtl::ArraySlice<DataType>({})}},
            {"F"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"C:o:0", "D:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT, DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"key", "host_compute_channel_F1_O1"},
-            {"shape_inference_graph", shape_string_expected_1},
+            {"shape_inference_graph",
+             "_outside_compilation_shape_inference_F1_O1"},
             {"shapes", gtl::ArraySlice<DataType>({})}},
            {"D"}},
       },
@@ -993,35 +1052,41 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
     Node* a = Input(b2.opts().WithName("A"));
     Node* b = Input(b2.opts().WithName("B"));
 
-    NodeBuilder node_builder("F1", "F1", lib_def.get());
-    node_builder.Input(a).Input(b);
-    Node* call = b2.opts().FinalizeBuilder(&node_builder);
-
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
     Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv1, 0), ops::NodeOut(recv1, 1),
                      b2.opts().WithName("E").WithControlInputs({recv1, b}));
-    Node* send1 = SendFromHost("host_compute_channel_F1_O1", {e},
+    Node* send1 = SendFromHost(ops::NodeOut(key_constant, 0),
+                               "host_compute_channel_F1_O1", {e},
                                b2.opts()
                                    .WithName("outside_compilation_F1_O1_send")
                                    .WithControlInput(e));
 
     Node* recv2 =
-        RecvAtHost("host_compute_channel_F1_O2", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O2",
+                   {DT_FLOAT, DT_FLOAT},
                    b2.opts().WithName("outside_compilation_F1_O2_recv"));
     Node* g = Binary(e, ops::NodeOut(recv2, 1),
                      b2.opts().WithName("G").WithControlInputs({recv2, e}));
     Node* h = Binary(ops::NodeOut(recv2, 0), e, b2.opts().WithName("H"));
-    Node* send2 =
-        SendFromHost("host_compute_channel_F1_O2", {h},
-                     b2.opts().WithName("outside_compilation_F1_O2_send"));
+    Node* send2 = SendFromHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O2", {h},
+        b2.opts().WithName("outside_compilation_F1_O2_send"));
 
-    Node* s = NoOp(b2.opts()
-                       .WithName("F1_sequencer")
-                       .WithControlInputs({recv1, send1, recv2, send2}));
+    Node* s = Sequencer(b2.opts()
+                            .WithName("F1_sequencer")
+                            .WithControlInputs({recv1, send1, recv2, send2}),
+                        "F1");
 
-    Binary(g, call, b2.opts().WithName("J").WithControlInput(s));
+    NodeBuilder node_builder("F1", "F1", lib_def.get());
+    node_builder.Input(a).Input(b);
+    Node* call = b2.opts().WithControlInput(s).FinalizeBuilder(&node_builder);
+
+    Binary(g, call, b2.opts().WithName("J"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1070,19 +1135,20 @@ TEST(EncapsulateSubgraphsTest, TwoFunctionsTwoOutside) {
   FunctionDefLibrary library_expected;
   GraphDef graphdef_expected;
 
-  string shape_string_expected;
   {
     GraphDefBuilder shape(GraphDefBuilder::kFailImmediately);
+    Node* key_constant =
+        KeyPlaceholderShape(shape.opts().WithName("KnownShape/_0"));
     Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    shape.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv, 0), ops::NodeOut(recv, 1),
                      shape.opts().WithName("E"));
-    SendFromHost("host_compute_channel_F1_O1", {e},
-                 shape.opts().WithName("outside_compilation_F1_O1_send"));
-    GraphDef shape_graph;
-    TF_EXPECT_OK(shape.ToGraphDef(&shape_graph));
-    EXPECT_TRUE(shape_graph.SerializeToString(&shape_string_expected));
+    SendFromHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                 {e}, shape.opts().WithName("outside_compilation_F1_O1_send"));
+    TF_EXPECT_OK(
+        AddGraphDefToFunctionLibrary(shape, "F1_O1", &library_expected));
   }
 
   TensorShapeProto shape_proto_expected;
@@ -1100,12 +1166,13 @@ TEST(EncapsulateSubgraphsTest, TwoFunctionsTwoOutside) {
            {},
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"C:o:0", "D:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT, DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"key", "host_compute_channel_F1_O1"},
-            {"shape_inference_graph", shape_string_expected},
+            {"shape_inference_graph",
+             "_outside_compilation_shape_inference_F1_O1"},
             {"shapes", gtl::ArraySlice<DataType>({})}},
            {"D"}},
       },
@@ -1120,7 +1187,7 @@ TEST(EncapsulateSubgraphsTest, TwoFunctionsTwoOutside) {
            "BinaryTest",
            {"f_0_arg", "outside_compilation_O1_host_compute:outputs:0"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"G:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
@@ -1138,39 +1205,47 @@ TEST(EncapsulateSubgraphsTest, TwoFunctionsTwoOutside) {
     Node* a = InputShaped(b2.opts().WithName("A"));
     Node* b = InputShaped(b2.opts().WithName("B"));
 
+    Node* key_constant1 =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
     Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT, DT_FLOAT},
+        RecvAtHost(ops::NodeOut(key_constant1, 0), "host_compute_channel_F1_O1",
+                   {DT_FLOAT, DT_FLOAT},
                    b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Binary(ops::NodeOut(recv1, 0), ops::NodeOut(recv1, 1),
                      b2.opts().WithName("E").WithControlInputs({recv1, b}));
-    Node* send1 = SendFromHost("host_compute_channel_F1_O1", {e},
+    Node* send1 = SendFromHost(ops::NodeOut(key_constant1, 0),
+                               "host_compute_channel_F1_O1", {e},
                                b2.opts()
                                    .WithName("outside_compilation_F1_O1_send")
                                    .WithControlInput(e));
+    Node* s1 = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}),
+        "F1");
+
     NodeBuilder node_builder1("F1", "F1", lib_def.get());
     node_builder1.Input(a).Input(b);
-    Node* call1 = b2.opts().FinalizeBuilder(&node_builder1);
-    Node* s1 = NoOp(
-        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}));
-
-    Node* recv2 =
-        RecvAtHost("host_compute_channel_F2_O1", {DT_FLOAT},
-                   b2.opts().WithName("outside_compilation_F2_O1_recv"));
-    Node* h = Binary(ops::NodeOut(call1, 1), recv2,
-                     b2.opts().WithName("H").WithControlInput(s1));
-    Node* send2 =
-        SendFromHost("host_compute_channel_F2_O1", {h},
-                     b2.opts().WithName("outside_compilation_F2_O1_send"));
-
+    Node* call1 =
+        b2.opts().WithControlInput(s1).FinalizeBuilder(&node_builder1);
+
+    Node* key_constant2 =
+        KeyPlaceholder("F2", b2.opts().WithName("F2_key_placeholder"));
+    Node* recv2 = RecvAtHost(
+        ops::NodeOut(key_constant2, 0), "host_compute_channel_F2_O1",
+        {DT_FLOAT}, b2.opts().WithName("outside_compilation_F2_O1_recv"));
+    Node* h = Binary(ops::NodeOut(call1, 1), recv2, b2.opts().WithName("H"));
+    Node* send2 = SendFromHost(
+        ops::NodeOut(key_constant2, 0), "host_compute_channel_F2_O1", {h},
+        b2.opts().WithName("outside_compilation_F2_O1_send"));
+
+    Node* s2 = Sequencer(
+        b2.opts().WithName("F2_sequencer").WithControlInputs({recv2, send2}),
+        "F2");
     NodeBuilder node_builder2("F2", "F2", lib_def.get());
     node_builder2.Input(e).Input(call1);
     Node* call2 = b2.opts()
-                      .WithControlInputs({s1, e, call1})
+                      .WithControlInputs({s2, e, call1})
                       .FinalizeBuilder(&node_builder2);
-    Node* s2 = NoOp(
-        b2.opts().WithName("F2_sequencer").WithControlInputs({recv2, send2}));
-    Binary(call2, ops::NodeOut(call2, 1),
-           b2.opts().WithName("J").WithControlInput(s2));
+    Binary(call2, ops::NodeOut(call2, 1), b2.opts().WithName("J"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1218,7 +1293,7 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationNoInputs) {
            "BinaryTest",
            {"D:o:0", "outside_compilation_O1_host_compute:outputs:0"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {},
            {{"Tinputs", gtl::ArraySlice<DataType>({})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
@@ -1237,15 +1312,19 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationNoInputs) {
     Node* b = Input(b2.opts().WithName("B"));
 
     Node* e = Unary(a, b2.opts().WithName("E"));
-    Node* send1 =
-        SendFromHost("host_compute_channel_F1_O1", {e},
-                     b2.opts().WithName("outside_compilation_F1_O1_send"));
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
+    Node* send1 = SendFromHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {e},
+        b2.opts().WithName("outside_compilation_F1_O1_send"));
+    Node* s1 = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInput(send1), "F1");
     NodeBuilder node_builder1("F1", "F1", lib_def.get());
     node_builder1.Input(a).Input(b);
-    Node* call1 = b2.opts().FinalizeBuilder(&node_builder1);
-    Node* s1 = NoOp(b2.opts().WithName("F1_sequencer").WithControlInput(send1));
+    Node* call1 =
+        b2.opts().WithControlInput(s1).FinalizeBuilder(&node_builder1);
 
-    Unary(call1, b2.opts().WithName("G").WithControlInput(s1));
+    Unary(call1, b2.opts().WithName("G"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1294,7 +1373,7 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationControlInput) {
            "BinaryTest",
            {"D:o:0", "outside_compilation_O1_host_compute:outputs:0"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {},
            {{"Tinputs", gtl::ArraySlice<DataType>({})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
@@ -1313,20 +1392,24 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationControlInput) {
     Node* a = InputShaped(b2.opts().WithName("A"));
     Node* b = Input(b2.opts().WithName("B"));
 
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
     Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {},
-                   b2.opts().WithName("outside_compilation_F1_O1_recv"));
+        RecvAtHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                   {}, b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Unary(a, b2.opts().WithName("E").WithControlInput(recv1));
-    Node* send1 =
-        SendFromHost("host_compute_channel_F1_O1", {e},
-                     b2.opts().WithName("outside_compilation_F1_O1_send"));
+    Node* send1 = SendFromHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {e},
+        b2.opts().WithName("outside_compilation_F1_O1_send"));
+    Node* s1 = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}),
+        "F1");
     NodeBuilder node_builder1("F1", "F1", lib_def.get());
     node_builder1.Input(a).Input(b);
-    Node* call1 = b2.opts().FinalizeBuilder(&node_builder1);
-    Node* s1 = NoOp(
-        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}));
+    Node* call1 =
+        b2.opts().WithControlInput(s1).FinalizeBuilder(&node_builder1);
 
-    Unary(call1, b2.opts().WithName("G").WithControlInput(s1));
+    Unary(call1, b2.opts().WithName("G"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1368,7 +1451,7 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationNoOutputs) {
           {{"D"}, "BinaryTest", {"b_0_arg", "C:o:0"}},
           {{"F"}, "UnaryTest", {"D:o:0"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"D:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({})},
@@ -1385,16 +1468,20 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationNoOutputs) {
     Node* a = Input(b2.opts().WithName("A"));
     Node* b = Input(b2.opts().WithName("B"));
 
-    Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT},
-                   b2.opts().WithName("outside_compilation_F1_O1_recv"));
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
+    Node* recv1 = RecvAtHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {DT_FLOAT},
+        b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Unary(recv1, b2.opts().WithName("E"));
+    Node* s1 = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInput(recv1), "F1");
     NodeBuilder node_builder1("F1", "F1", lib_def.get());
     node_builder1.Input(a).Input(b);
-    Node* call1 = b2.opts().FinalizeBuilder(&node_builder1);
-    Node* s1 = NoOp(b2.opts().WithName("F1_sequencer").WithControlInput(recv1));
+    Node* call1 =
+        b2.opts().WithControlInput(s1).FinalizeBuilder(&node_builder1);
 
-    Binary(e, call1, b2.opts().WithName("G").WithControlInput(s1));
+    Binary(e, call1, b2.opts().WithName("G"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1441,7 +1528,7 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationControlOutput) {
            {},
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"D:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({})},
@@ -1458,21 +1545,26 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationControlOutput) {
     Node* a = Input(b2.opts().WithName("A"));
     Node* b = Input(b2.opts().WithName("B"));
 
-    Node* recv1 =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT},
-                   b2.opts().WithName("outside_compilation_F1_O1_recv"));
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
+    Node* recv1 = RecvAtHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {DT_FLOAT},
+        b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = Unary(recv1, b2.opts().WithName("E"));
-    Node* send1 = SendFromHost("host_compute_channel_F1_O1", {},
+    Node* send1 = SendFromHost(ops::NodeOut(key_constant, 0),
+                               "host_compute_channel_F1_O1", {},
                                b2.opts()
                                    .WithName("outside_compilation_F1_O1_send")
                                    .WithControlInput(e));
+    Node* s1 = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}),
+        "F1");
     NodeBuilder node_builder1("F1", "F1", lib_def.get());
     node_builder1.Input(a).Input(b);
-    Node* call1 = b2.opts().FinalizeBuilder(&node_builder1);
-    Node* s1 = NoOp(
-        b2.opts().WithName("F1_sequencer").WithControlInputs({recv1, send1}));
+    Node* call1 =
+        b2.opts().WithControlInput(s1).FinalizeBuilder(&node_builder1);
 
-    Binary(e, call1, b2.opts().WithName("G").WithControlInput(s1));
+    Binary(e, call1, b2.opts().WithName("G"));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
@@ -1569,19 +1661,19 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationShapeInference) {
   FunctionDefLibrary library_expected;
   GraphDef graphdef_expected;
 
-  string shape_string_expected;
   {
     GraphDefBuilder shape(GraphDefBuilder::kFailImmediately);
-    Node* known = KnownShape({2}, shape.opts().WithName("KnownShape/_0"));
-    Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT},
-                   shape.opts().WithName("outside_compilation_F1_O1_recv"));
+    Node* key_constant =
+        KeyPlaceholderShape(shape.opts().WithName("KnownShape/_0"));
+    Node* known = KnownShape({2}, shape.opts().WithName("KnownShape/_1"));
+    Node* recv = RecvAtHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {DT_FLOAT},
+        shape.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = BinaryUnknownShape(known, recv, shape.opts().WithName("E"));
-    SendFromHost("host_compute_channel_F1_O1", {e},
-                 shape.opts().WithName("outside_compilation_F1_O1_send"));
-    GraphDef shape_graph;
-    TF_EXPECT_OK(shape.ToGraphDef(&shape_graph));
-    EXPECT_TRUE(shape_graph.SerializeToString(&shape_string_expected));
+    SendFromHost(ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1",
+                 {e}, shape.opts().WithName("outside_compilation_F1_O1_send"));
+    TF_EXPECT_OK(
+        AddGraphDefToFunctionLibrary(shape, "F1_O1", &library_expected));
   }
 
   *library_expected.add_function() = test::function::XTimesTwo();
@@ -1595,12 +1687,13 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationShapeInference) {
            {},
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O1_host_compute"},
-           "_XlaHostCompute",
+           "XlaHostCompute",
            {"c:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"key", "host_compute_channel_F1_O1"},
-            {"shape_inference_graph", shape_string_expected},
+            {"shape_inference_graph",
+             "_outside_compilation_shape_inference_F1_O1"},
             {"shapes", gtl::ArraySlice<DataType>({})}},
            {"c"}},
       },
@@ -1614,26 +1707,30 @@ TEST(EncapsulateSubgraphsTest, OutsideCompilationShapeInference) {
     Node* b = Input(b2.opts().WithName("B"));
     Node* c = Unary(a, b2.opts().WithName("C"));
 
-    NodeBuilder node_builder("F1", "F1", lib_def.get());
-    node_builder.Input(b).Input(c);
-    Node* call =
-        b2.opts().WithControlInputs({c}).FinalizeBuilder(&node_builder);
-
-    Node* recv =
-        RecvAtHost("host_compute_channel_F1_O1", {DT_FLOAT},
-                   b2.opts().WithName("outside_compilation_F1_O1_recv"));
+    Node* key_constant =
+        KeyPlaceholder("F1", b2.opts().WithName("F1_key_placeholder"));
+    Node* recv = RecvAtHost(
+        ops::NodeOut(key_constant, 0), "host_compute_channel_F1_O1", {DT_FLOAT},
+        b2.opts().WithName("outside_compilation_F1_O1_recv"));
     Node* e = BinaryUnknownShape(
         c, ops::NodeOut(recv, 0),
         b2.opts().WithName("E").WithControlInputs({recv, b}));
-    Node* send = SendFromHost("host_compute_channel_F1_O1", {e},
+    Node* send = SendFromHost(ops::NodeOut(key_constant, 0),
+                              "host_compute_channel_F1_O1", {e},
                               b2.opts()
                                   .WithName("outside_compilation_F1_O1_send")
                                   .WithControlInput(e));
 
-    Node* s = NoOp(
-        b2.opts().WithName("F1_sequencer").WithControlInputs({recv, send}));
+    Node* s = Sequencer(
+        b2.opts().WithName("F1_sequencer").WithControlInputs({recv, send}),
+        "F1");
+
+    NodeBuilder node_builder("F1", "F1", lib_def.get());
+    node_builder.Input(b).Input(c);
+    Node* call =
+        b2.opts().WithControlInputs({s, c}).FinalizeBuilder(&node_builder);
 
-    Binary(a, call, b2.opts().WithName("G").WithControlInputs({s, e}));
+    Binary(a, call, b2.opts().WithName("G").WithControlInputs({e}));
     TF_EXPECT_OK(b2.ToGraphDef(&graphdef_expected));
   }
 
diff --git a/tensorflow/compiler/jit/kernels/BUILD b/tensorflow/compiler/jit/kernels/BUILD
index 9bea5663319c8a25249fdc265cee0191556a7c04..616a7f8f1541d3debff97a90bd390c76c665d196 100644
--- a/tensorflow/compiler/jit/kernels/BUILD
+++ b/tensorflow/compiler/jit/kernels/BUILD
@@ -14,6 +14,7 @@ cc_library(
         "//tensorflow/compiler/jit:common",
         "//tensorflow/compiler/jit:xla_compilation_cache",
         "//tensorflow/compiler/jit:xla_device",
+        "//tensorflow/compiler/jit:xla_launch_util",
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/xla:statusor",
diff --git a/tensorflow/compiler/jit/kernels/xla_launch_op.cc b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
index 6353149e4afdf739fe44dd5c76502ef5d98b8477..8a8e8bb8df1a8d0a40af054e6713616745224cc8 100644
--- a/tensorflow/compiler/jit/kernels/xla_launch_op.cc
+++ b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/jit/defs.h"
 #include "tensorflow/compiler/jit/xla_device.h"
+#include "tensorflow/compiler/jit/xla_launch_util.h"
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
@@ -40,111 +41,6 @@ namespace gpu = perftools::gputools;
 
 namespace tensorflow {
 
-// Adapter class that wraps a Tensorflow allocator as an XLA allocator.
-// Assumes that the Tensorflow allocator permits asynchronous deallocation:
-// see comment on `AllowsAsynchronousDeallocation()`.
-class XlaAllocator : public xla::DeviceMemoryAllocator {
- public:
-  XlaAllocator(const gpu::Platform* platform, OpKernelContext* op_context);
-  ~XlaAllocator() override;
-  xla::StatusOr<gpu::DeviceMemoryBase> Allocate(int device_ordinal, uint64 size,
-                                                bool retry_on_failure) override;
-  Status Deallocate(int device_ordinal, gpu::DeviceMemoryBase* mem) override;
-
-  // Register an Tensor (input or resource variable) with the allocator. If
-  // the operation returns an alias to one of its inputs, then the allocator
-  // needs to be able to handle it.
-  Status RegisterArgument(const Tensor* t);
-
-  // Makes 'tensor' a wrapper around the data buffer at 'ptr'. The buffer is
-  // interpreted as having data type 'dtype' and shape 'shape'.
-  Status MakeTensorFromBuffer(gpu::DeviceMemoryBase buffer, DataType dtype,
-                              const TensorShape& shape, Tensor* tensor) const;
-
-  // The Tensorflow BFC allocator used on GPU allows host-side deallocation
-  // before GPU execution takes place. Tensorflow uses the ordering of the main
-  // compute stream to enforce a happens-before relationship between a memory
-  // allocation and code that reuses the same memory. If Tensorflow adds
-  // support for multiple GPU streams or allocators with different ordering
-  // requirements, this code may need to change.
-  // (This attribute has no effect on CPU.)
-  bool AllowsAsynchronousDeallocation() const override { return true; }
-
- private:
-  OpKernelContext* const op_context_;
-
-  // Map from pointer address to the owning Tensor; used by
-  // MakeTensorFromBuffer. Also used to automatically release Tensors when the
-  // allocator is freed.
-  std::unordered_map<void*, Tensor> tensors_;
-};
-
-XlaAllocator::XlaAllocator(const gpu::Platform* platform,
-                           OpKernelContext* op_context)
-    : xla::DeviceMemoryAllocator(platform), op_context_(op_context) {}
-
-XlaAllocator::~XlaAllocator() = default;
-
-xla::StatusOr<gpu::DeviceMemoryBase> XlaAllocator::Allocate(
-    int device_ordinal, uint64 size, bool retry_on_failure) {
-  AllocatorAttributes allocator_attrs;
-  allocator_attrs.set_on_host(false);
-
-  AllocationAttributes allocation_attrs;
-  allocation_attrs.no_retry_on_failure = !retry_on_failure;
-
-  Tensor t;
-  Status status = op_context_->allocate_temp(
-      DT_UINT8, TensorShape({static_cast<int64>(size)}), &t, allocator_attrs,
-      allocation_attrs);
-  if (!status.ok()) {
-    VLOG(2) << "Allocation failed " << size;
-    return status;
-  }
-  void* data =
-      reinterpret_cast<void*>(const_cast<char*>(t.tensor_data().data()));
-  tensors_[data] = t;
-  return gpu::DeviceMemoryBase(data, size);
-}
-
-Status XlaAllocator::RegisterArgument(const Tensor* t) {
-  void* data =
-      reinterpret_cast<void*>(const_cast<char*>(t->tensor_data().data()));
-  tensors_[data] = *t;
-  return Status::OK();
-}
-
-Status XlaAllocator::Deallocate(int device_ordinal,
-                                gpu::DeviceMemoryBase* mem) {
-  if (mem->opaque() != nullptr) {
-    if (tensors_.erase(mem->opaque()) == 0) {
-      return tensorflow::errors::InvalidArgument("Unknown tensor address");
-    }
-  }
-  return Status::OK();
-}
-
-Status XlaAllocator::MakeTensorFromBuffer(gpu::DeviceMemoryBase buffer,
-                                          DataType dtype,
-                                          const TensorShape& shape,
-                                          Tensor* out_tensor) const {
-  void* ptr = const_cast<void*>(buffer.opaque());
-  auto it = tensors_.find(ptr);
-  if (it == tensors_.end()) {
-    return errors::InvalidArgument("Unknown tensor address");
-  }
-  const Tensor& tensor = it->second;
-
-  int64 output_size = DataTypeSize(dtype) * shape.num_elements();
-  if (tensor.TotalBytes() == output_size) {
-    out_tensor->UnsafeCopyFromInternal(tensor, dtype, shape);
-  } else {
-    Tensor slice = tensor.Slice(0, output_size);
-    out_tensor->UnsafeCopyFromInternal(slice, dtype, shape);
-  }
-  return Status::OK();
-}
-
 XlaLocalLaunchOp::XlaLocalLaunchOp(OpKernelConstruction* ctx)
     : OpKernel(ctx), device_type_(ctx->device_type()) {
   const NameAttrList* func;
@@ -196,23 +92,6 @@ Status XlaLocalLaunchOp::BuildCompilationCache(OpKernelContext* ctx,
   return Status::OK();
 }
 
-std::vector<OptionalTensor> SnapshotResourceVariables(OpKernelContext* ctx,
-                                                      int num_variables) {
-  std::vector<OptionalTensor> snapshot(num_variables);
-  int first_variable = ctx->num_inputs() - num_variables;
-  for (int i = 0; i < num_variables; ++i) {
-    Var* variable = nullptr;
-    ResourceHandle handle = HandleFromInput(ctx, first_variable + i);
-    if (LookupResource(ctx, handle, &variable).ok()) {
-      tf_shared_lock lock(*variable->mu());
-      snapshot[i].name = handle.name();
-      snapshot[i].present = true;
-      snapshot[i].value = *variable->tensor();
-    }
-  }
-  return snapshot;
-}
-
 void XlaLocalLaunchOp::Compute(OpKernelContext* ctx) {
   VLOG(1) << "XlaLocalLaunchOp::Compute "
           << Canonicalize(function_.name(), AttrSlice(&function_.attr()));
@@ -235,16 +114,22 @@ void XlaLocalLaunchOp::Compute(OpKernelContext* ctx) {
   // this is more obviously correct.)
   core::ScopedUnref cache_ref(cache);
 
+  const XlaDevice::Metadata* metadata;
+  Status s = XlaDevice::GetMetadata(ctx, &metadata);
+
+  XlaTensorInfoManager* tensor_info_manager = nullptr;
+  if (s.ok()) {
+    tensor_info_manager = &metadata->tensor_info_manager();
+  }
+
   // Get the platform_id_ for XLA_* devices.
   if (platform_id_ == nullptr) {
-    const XlaDevice::Metadata* metadata;
-    Status s = XlaDevice::GetMetadata(ctx, &metadata);
     if (s.ok()) {
       platform_id_ = metadata->platform()->id();
     }
   }
 
-  std::vector<OptionalTensor> variables =
+  std::map<int, OptionalTensor> variables =
       SnapshotResourceVariables(ctx, num_resource_args_);
 
   xla::LocalClient* client = static_cast<xla::LocalClient*>(cache->client());
@@ -263,49 +148,19 @@ void XlaLocalLaunchOp::Compute(OpKernelContext* ctx) {
   const XlaCompiler::CompilationResult* kernel;
   xla::LocalExecutable* executable;
 
-  OP_REQUIRES_OK(ctx, cache->Compile(options, function_, num_constant_args_,
+  std::map<int, Tensor> constant_args;
+  for (int i = 0; i < num_constant_args_; ++i) {
+    constant_args.insert({i, ctx->input(i)});
+  }
+  OP_REQUIRES_OK(ctx, cache->Compile(options, function_, constant_args,
                                      variables, ctx, &kernel, &executable,
                                      /*compile_options=*/nullptr));
 
   VLOG(1) << "Executing XLA Computation...";
 
-  std::unique_ptr<xla::ShapedBuffer> output;
-  // Build xla::ShapedBuffers that point directly to the Tensor buffers.
-  std::vector<std::unique_ptr<xla::ShapedBuffer>> arg_buffers;
-  arg_buffers.reserve(kernel->xla_input_shapes.size() + 1);
-  arg_buffers.resize(kernel->xla_input_shapes.size());
-  std::vector<xla::ShapedBuffer*> arg_ptrs(arg_buffers.size());
-
-  const int first_variable_arg = ctx->num_inputs() - num_resource_args_;
-  // Pass remaining parameters.
-  const Tensor* t;
-  for (int i = 0; i < kernel->xla_input_shapes.size(); ++i) {
-    int arg_num = kernel->input_mapping[i];
-    const xla::Shape& shape = kernel->xla_input_shapes[i];
-    if (arg_num >= first_variable_arg) {
-      t = &(variables[arg_num - first_variable_arg].value);
-    } else {
-      t = &(ctx->input(arg_num));
-    }
-
-    gpu::DeviceMemoryBase dmem = gpu::DeviceMemoryBase(
-        const_cast<char*>(t->tensor_data().data()), t->tensor_data().size());
-
-    const xla::Shape on_device_shape =
-        client->backend().transfer_manager()->HostShapeToDeviceShape(shape);
-    CHECK(xla::ShapeUtil::Equal(shape, on_device_shape))
-        << "On-device shape "
-        << xla::ShapeUtil::HumanStringWithLayout(on_device_shape)
-        << " not the same as on-host shape "
-        << xla::ShapeUtil::HumanStringWithLayout(shape);
-    arg_buffers[i] = xla::MakeUnique<xla::ShapedBuffer>(
-        /*on_host_shape=*/shape, /*on_device_shape=*/shape, client->platform(),
-        client->default_device_ordinal());
-    arg_buffers[i]->set_buffer(dmem, /*index=*/{});
-    arg_ptrs[i] = arg_buffers[i].get();
-
-    OP_REQUIRES_OK(ctx, xla_allocator.RegisterArgument(t));
-  }
+  XlaComputationLaunchContext launch_context(
+      num_resource_args_, client, &xla_allocator, tensor_info_manager);
+  launch_context.PopulateInputs(ctx, kernel, variables);
 
   // Execute the computation.
   VLOG(2) << "Executing computation.";
@@ -315,93 +170,13 @@ void XlaLocalLaunchOp::Compute(OpKernelContext* ctx) {
   run_options.set_intra_op_thread_pool(&ctx->eigen_cpu_device());
   Env* env = Env::Default();
   auto start_time = env->NowMicros();
-  auto run_result = executable->Run(arg_ptrs, run_options);
+  auto run_result = executable->Run(launch_context.arguments(), run_options);
   OP_REQUIRES(ctx, run_result.ok(), run_result.status());
 
-  output = run_result.ConsumeValueOrDie()->release();
   auto elapsed = env->NowMicros() - start_time;
   VLOG(2) << "Elapsed time: " << elapsed << "us";
 
-  // Computation output should always be a tuple.
-  if (VLOG_IS_ON(2)) {
-    VLOG(2) << "Result tuple shape: " << output->on_host_shape().DebugString();
-  }
-  CHECK_EQ(ctx->num_outputs(), kernel->outputs.size());
-
-  // Copy XLA results to the OpOutputList.
-  int output_num = 0;
-  for (int i = 0; i < ctx->num_outputs(); ++i) {
-    if (kernel->outputs[i].is_constant) {
-      // Output is a constant.
-      const Tensor& const_tensor = kernel->outputs[i].constant_value;
-      const size_t total_bytes = const_tensor.TotalBytes();
-      if (stream && total_bytes > 0) {
-        // Copy host -> device. (Empty tensors don't have backing buffers.)
-        VLOG(1) << "Constant output tensor on device";
-        Tensor* output_tensor;
-        TF_CHECK_OK(
-            ctx->allocate_output(i, const_tensor.shape(), &output_tensor));
-
-        const void* src_ptr = DMAHelper::base(&const_tensor);
-        void* dst_ptr = DMAHelper::base(output_tensor);
-        gpu::DeviceMemoryBase gpu_dst_ptr(dst_ptr, total_bytes);
-        stream->ThenMemcpy(&gpu_dst_ptr, src_ptr, total_bytes);
-      } else {
-        // No copy required.
-        ctx->set_output(i, const_tensor);
-      }
-    } else {
-      const TensorShape& shape = kernel->outputs[i].shape;
-      VLOG(2) << "Retval " << i << " shape " << shape.DebugString();
-
-      gpu::DeviceMemoryBase buffer = output->buffer({output_num});
-      Tensor output_tensor;
-      // Looks up the owning Tensor by buffer address.
-      OP_REQUIRES_OK(ctx, xla_allocator.MakeTensorFromBuffer(
-                              buffer, ctx->expected_output_dtype(i), shape,
-                              &output_tensor));
-      ctx->set_output(i, output_tensor);
-      ++output_num;
-    }
-
-    if (VLOG_IS_ON(3)) {
-      VLOG(3) << ctx->mutable_output(i)->DebugString();
-    }
-  }
-
-  // Apply variable updates, if any.
-  VLOG(2) << "Applying variable updates";
-  for (int i = 0; i < kernel->resource_updates.size(); ++i) {
-    const XlaCompiler::ResourceUpdate& write = kernel->resource_updates[i];
-    OP_REQUIRES(ctx,
-                write.input_index >= 0 && write.input_index < ctx->num_inputs(),
-                errors::Internal("Invalid input index for variable write."));
-
-    gpu::DeviceMemoryBase buffer = output->buffer({output_num});
-
-    Var* variable = nullptr;
-    // TODO(b/35625933): tensorflow::Var should contain a PersistentTensor, not
-    // a Tensor.
-    OP_REQUIRES_OK(ctx, LookupOrCreateResource<Var>(
-                            ctx, HandleFromInput(ctx, write.input_index),
-                            &variable, [this, ctx, &write](Var** ptr) {
-                              *ptr = new Var(write.type);
-                              return Status::OK();
-                            }));
-
-    core::ScopedUnref s(variable);
-
-    mutex_lock ml(*variable->mu());
-    OP_REQUIRES(ctx, variable->tensor()->dtype() == write.type,
-                errors::Internal("Mismatched type in variable write"));
-
-    // Looks up the owning Tensor by buffer address.
-    OP_REQUIRES_OK(
-        ctx, xla_allocator.MakeTensorFromBuffer(buffer, write.type, write.shape,
-                                                variable->tensor()));
-    ++output_num;
-  }
-
+  launch_context.PopulateOutputs(ctx, kernel, run_result.ConsumeValueOrDie());
   VLOG(1) << "Done";
 }
 
diff --git a/tensorflow/compiler/jit/kernels/xla_launch_op.h b/tensorflow/compiler/jit/kernels/xla_launch_op.h
index 47fd912b12abbbe876e933ab57f6f586fd299909..c6cc0986af0300c51283d432c671e92a1e4d8145 100644
--- a/tensorflow/compiler/jit/kernels/xla_launch_op.h
+++ b/tensorflow/compiler/jit/kernels/xla_launch_op.h
@@ -26,14 +26,6 @@ limitations under the License.
 
 namespace tensorflow {
 
-// Takes a snapshot of the values of resource variable arguments, which are
-// the last `num_variables` arguments. We snapshot tensors that back
-// resource variables since concurrent updates may modify the shape, and it is
-// important that the shapes used for compilation match the true shapes of the
-// buffers.
-std::vector<OptionalTensor> SnapshotResourceVariables(OpKernelContext* ctx,
-                                                      int num_variables);
-
 // XlaLocalLaunchOp is used to replace a region of the TensorFlow graph
 // which will be compiled and executed using XLA.  The XlaLocalLaunchOp is
 // responsible for handling interactions with the TensorFlow executor.
diff --git a/tensorflow/compiler/jit/legacy_flags/BUILD b/tensorflow/compiler/jit/legacy_flags/BUILD
index 4491dd6ac8f2b84f341162eb469cc8194f817c9a..9cd66fc13c9e0658fdf105d5d9d92f0320ddd179 100644
--- a/tensorflow/compiler/jit/legacy_flags/BUILD
+++ b/tensorflow/compiler/jit/legacy_flags/BUILD
@@ -52,6 +52,18 @@ cc_library(
         ],
 )
 
+cc_library(
+    name = "xla_device_flags",
+    srcs = ["xla_device_flags.cc"],
+    hdrs = ["xla_device_flags.h"],
+    deps =
+        [
+            "//tensorflow/compiler/xla/legacy_flags:parse_flags_from_env",
+            "//tensorflow/core:framework_internal",
+            "//tensorflow/core:lib",
+        ],
+)
+
 # -----------------------------------------------------------------------------
 
 filegroup(
diff --git a/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.cc b/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.cc
index 4bc209b7ecf499d82e7567f7eff12b17cefa9863..51384ac2fe6fa70c8a723097093a0a29e7ad2c6b 100644
--- a/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.cc
+++ b/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.cc
@@ -40,6 +40,7 @@ static void AllocateFlags() {
   flags->tf_xla_max_cluster_size = std::numeric_limits<int32>::max();
   flags->tf_xla_clustering_debug = false;
   flags->tf_xla_cpu_global_jit = false;
+  flags->tf_xla_clustering_fuel = std::numeric_limits<int64>::max();
   flag_list = new std::vector<Flag>(
       {Flag("tf_xla_auto_jit", &flags->tf_xla_auto_jit,
             "Control compilation of operators into XLA computations on CPU and "
@@ -55,7 +56,10 @@ static void AllocateFlags() {
        Flag("tf_xla_clustering_debug", &flags->tf_xla_clustering_debug,
             "Dump graphs during XLA compilation."),
        Flag("tf_xla_cpu_global_jit", &flags->tf_xla_cpu_global_jit,
-            "Enables global JIT compilation for CPU via SessionOptions.")});
+            "Enables global JIT compilation for CPU via SessionOptions."),
+       Flag("tf_xla_clustering_fuel", &flags->tf_xla_clustering_fuel,
+            "Places an artificial limit on the number of ops marked as "
+            "eligible for clustering.")});
   xla::legacy_flags::ParseFlagsFromEnv(*flag_list);
 }
 
diff --git a/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h b/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h
index e1ccd7ddb8706ca445b6811ca1fec369af7cd5d5..170b89c987f30f985f981d7835b4af455922594e 100644
--- a/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h
+++ b/tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h
@@ -48,6 +48,9 @@ typedef struct {
   bool tf_xla_clustering_debug;   // Dump graphs during XLA compilation.
   bool tf_xla_cpu_global_jit;     // Enables global JIT compilation for CPU
                                   // via SessionOptions.
+  int64 tf_xla_clustering_fuel;   // "Compiler fuel" for clustering.  Only this
+                                  // many ops will be marked as eligible for
+                                  // clustering.
 } MarkForCompilationPassFlags;
 
 // Return a pointer to the MarkForCompilationPassFlags struct;
diff --git a/tensorflow/compiler/jit/legacy_flags/xla_device_flags.cc b/tensorflow/compiler/jit/legacy_flags/xla_device_flags.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1bb2fce2dbad5bffce2e33b665b7222090d0855a
--- /dev/null
+++ b/tensorflow/compiler/jit/legacy_flags/xla_device_flags.cc
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Legacy flags for the XLA bridge's xla_device module.
+
+#include <mutex>
+#include <vector>
+
+#include "tensorflow/compiler/jit/legacy_flags/xla_device_flags.h"
+#include "tensorflow/compiler/xla/legacy_flags/parse_flags_from_env.h"
+#include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/util/command_line_flags.h"
+
+namespace tensorflow {
+namespace legacy_flags {
+
+// Pointers to the parsed value of the flags and flag descriptors, initialized
+// via flags_init.
+static XlaDeviceFlags* flags;
+static std::vector<Flag>* flag_list;
+static std::once_flag flags_init;
+
+// Allocate *flags.  Called via call_once(&flags_init,...).
+static void AllocateFlags() {
+  flags = new XlaDeviceFlags;
+  flags->tf_xla_compile_on_demand = false;
+  flag_list = new std::vector<Flag>({
+      Flag("tf_xla_compile_on_demand", &flags->tf_xla_compile_on_demand,
+           "Switch a device into 'on-demand' mode, where instead of "
+           "autoclustering ops are compiled one by one just-in-time."),
+  });
+  xla::legacy_flags::ParseFlagsFromEnv(*flag_list);
+}
+
+// Return a pointer to the XlaDeviceFlags struct;
+// repeated calls return the same pointer.
+// This should be called only after Flags::Parse() has returned.
+XlaDeviceFlags* GetXlaDeviceFlags() {
+  std::call_once(flags_init, &AllocateFlags);
+  return flags;
+}
+
+}  // namespace legacy_flags
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/legacy_flags/xla_device_flags.h b/tensorflow/compiler/jit/legacy_flags/xla_device_flags.h
new file mode 100644
index 0000000000000000000000000000000000000000..27b22121ac1e089bd5d5a494e1e3fb60b05bc76d
--- /dev/null
+++ b/tensorflow/compiler/jit/legacy_flags/xla_device_flags.h
@@ -0,0 +1,47 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_XLA_DEVICE_FLAGS_H_
+#define TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_XLA_DEVICE_FLAGS_H_
+
+// Legacy flags for the XLA bridge's xla_device module.
+
+#include <vector>
+
+#include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/util/command_line_flags.h"
+
+namespace tensorflow {
+namespace legacy_flags {
+
+// The values of flags associated with the XLA bridge's
+// xla_device module.
+typedef struct {
+  // Switch the CPU device into "on-demand" mode, where instead of
+  // autoclustering ops are compiled one by one just-in-time.
+  // Enabling this mode by a legacy flag is a temporary mechanism. When this
+  // feature is battle-tested, we will switch this to be a session option.
+  bool tf_xla_compile_on_demand;
+} XlaDeviceFlags;
+
+// Return a pointer to the XlaDeviceFlags struct;
+// repeated calls return the same pointer.
+// This should be called only after Flags::Parse() has returned.
+XlaDeviceFlags* GetXlaDeviceFlags();
+
+}  // namespace legacy_flags
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_XLA_DEVICE_FLAGS_H_
diff --git a/tensorflow/compiler/jit/mark_for_compilation_pass.cc b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
index a0211acbbe9eec77d30c7d14293650de8826f41c..57fb8d242208318a608b1f356bef7a8d39dbdc83 100644
--- a/tensorflow/compiler/jit/mark_for_compilation_pass.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
@@ -174,7 +174,9 @@ bool HasResourceInputOrOutput(const Node& node) {
 }
 
 struct NodeCompare {
-  bool operator()(const Node* a, const Node* b) { return a->id() < b->id(); }
+  bool operator()(const Node* a, const Node* b) const {
+    return a->id() < b->id();
+  }
 };
 using OrderedNodeSet = std::set<Node*, NodeCompare>;
 
@@ -189,7 +191,27 @@ Status FindCompilationCandidates(
   FunctionLibraryRuntime* lib_runtime =
       pflr->GetFLR(ProcessFunctionLibraryRuntime::kDefaultFLRDevice);
 
+  int64& fuel =
+      legacy_flags::GetMarkForCompilationPassFlags()->tf_xla_clustering_fuel;
+
+  // Iterate over nodes in sorted order so that compiler fuel is deterministic.
+  // We can't simply pass op_nodes().begin() and op_nodes().end to the
+  // std::vector constructor because they're not proper iterators, with
+  // iterator_traits defined and so on.
+  std::vector<Node*> sorted_nodes;
   for (Node* node : graph.op_nodes()) {
+    sorted_nodes.push_back(node);
+  }
+  std::sort(sorted_nodes.begin(), sorted_nodes.end(), NodeCompare());
+
+  for (Node* node : sorted_nodes) {
+    VLOG(2) << "Fuel: " << fuel;
+    if (fuel <= 0) {
+      VLOG(2)
+          << "Hit fuel limit; not marking any remaining ops as clusterable.";
+      break;
+    }
+
     VLOG(2) << "FindCompilationCandidates(): Processing "
             << node->DebugString();
 
@@ -234,7 +256,9 @@ Status FindCompilationCandidates(
       continue;
     }
     candidates->insert(node);
+    --fuel;
   }
+  VLOG(2) << "candidates->size() = " << candidates->size();
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/jit/xla_compilation_cache.cc b/tensorflow/compiler/jit/xla_compilation_cache.cc
index 6d854a920eb0b4c01b09024ceaef5035e847d392..6430975335f5eef5b53c80213e6090ffd6166a91 100644
--- a/tensorflow/compiler/jit/xla_compilation_cache.cc
+++ b/tensorflow/compiler/jit/xla_compilation_cache.cc
@@ -92,38 +92,30 @@ uint64 XlaCompilationCache::Signature::Hash::operator()(
 }
 
 Status XlaCompilationCache::BuildSignature(
-    const NameAttrList& function, int num_constant_args,
-    const std::vector<OptionalTensor>& variable_args, OpKernelContext* ctx,
+    const NameAttrList& function, const std::map<int, Tensor>& constant_args,
+    const std::map<int, OptionalTensor>& variable_args, OpKernelContext* ctx,
     Signature* signature) {
   signature->name = Canonicalize(function.name(), AttrSlice(&function.attr()));
-  signature->arg_values.resize(num_constant_args);
-
-  signature->arg_types.reserve(ctx->num_inputs() - num_constant_args);
-
-  // Inputs are in the order: constants, non-constants, resource variables.
-  int input_num = 0;
-  // Use the values of compile time constants in the signature->
-  while (input_num < num_constant_args) {
-    signature->arg_values[input_num] = ctx->input(input_num);
-    ++input_num;
-  }
-  // Add the types and shapes of the remaining arguments.
-  while (input_num < ctx->num_inputs() - variable_args.size()) {
-    signature->arg_types.emplace_back(ctx->input_dtype(input_num),
-                                      ctx->input(input_num).shape());
-    ++input_num;
-  }
-  // For variable signatures, use the type and shape of the variable's
-  // current value.
-  for (const OptionalTensor& variable : variable_args) {
-    TF_RET_CHECK(input_num < ctx->num_inputs());
-    if (variable.present) {
-      signature->arg_types.emplace_back(variable.value.dtype(),
-                                        variable.value.shape());
+  signature->arg_values.reserve(constant_args.size());
+
+  signature->arg_types.reserve(ctx->num_inputs() - constant_args.size());
+
+  for (int i = 0; i < ctx->num_inputs(); ++i) {
+    if (constant_args.count(i) > 0) {
+      // Use the values of compile time constants in the signature.
+      signature->arg_values.push_back(constant_args.at(i));
+    } else if (variable_args.count(i) > 0) {
+      const OptionalTensor& variable = variable_args.at(i);
+      if (variable.present) {
+        signature->arg_types.emplace_back(variable.value.dtype(),
+                                          variable.value.shape());
+      } else {
+        signature->arg_types.emplace_back(DT_INVALID, TensorShape());
+      }
     } else {
-      signature->arg_types.emplace_back(DT_INVALID, TensorShape());
+      signature->arg_types.emplace_back(ctx->input_dtype(i),
+                                        ctx->input(i).shape());
     }
-    ++input_num;
   }
   return Status::OK();
 }
@@ -131,74 +123,58 @@ Status XlaCompilationCache::BuildSignature(
 namespace {
 
 // Builds a XlaCompiler::Argument vector from the arguments to the _XlaLaunch
-// op. The first `num_constant_args` arguments must be host-memory Tensors.
-Status BuildArguments(int num_constant_args,
-                      const std::vector<OptionalTensor>& variable_args,
+// op.
+Status BuildArguments(const std::map<int, Tensor>& constant_args,
+                      const std::map<int, OptionalTensor>& variable_args,
                       OpKernelContext* ctx,
                       std::vector<XlaCompiler::Argument>* args) {
   args->resize(ctx->num_inputs());
 
-  int input_num = 0;
-
-  // Handles compile-time constants.
-  TF_RET_CHECK(num_constant_args <= ctx->num_inputs());
-  while (input_num < num_constant_args) {
-    const Tensor& input = ctx->input(input_num);
-    TF_RET_CHECK(input.dtype() != DT_RESOURCE);
-    XlaCompiler::Argument& arg = (*args)[input_num];
-    arg.kind = XlaCompiler::Argument::kConstant;
-    arg.type = input.dtype();
-    arg.shape = input.shape();
-    arg.constant_value = input;
-    ++input_num;
-  }
-
-  // Handles the non-constant arguments.
-  int num_variable_args = variable_args.size();
-  int num_nonconst_args =
-      ctx->num_inputs() - num_variable_args - num_constant_args;
-  TF_RET_CHECK(num_nonconst_args >= 0);
-  while (input_num < num_constant_args + num_nonconst_args) {
-    const Tensor& input = ctx->input(input_num);
-    TF_RET_CHECK(input.dtype() != DT_RESOURCE);
+  for (int64 input_num = 0; input_num < ctx->num_inputs(); ++input_num) {
     XlaCompiler::Argument& arg = (*args)[input_num];
-    if (input.NumElements() > 0) {
-      arg.kind = XlaCompiler::Argument::kParameter;
-    } else {
+    if (constant_args.count(input_num) > 0) {
+      // Handles compile-time constants.
+      const Tensor& input = constant_args.at(input_num);
+      TF_RET_CHECK(input.dtype() != DT_RESOURCE);
       arg.kind = XlaCompiler::Argument::kConstant;
+      arg.type = input.dtype();
+      arg.shape = input.shape();
       arg.constant_value = input;
-    }
-    arg.type = input.dtype();
-    arg.shape = input.shape();
-    ++input_num;
-  }
-
-  // Handles resource variables.
-  TF_RET_CHECK(input_num + num_variable_args == ctx->num_inputs());
-  for (int variable_id = 0; variable_id < num_variable_args; ++variable_id) {
-    const Tensor& input = ctx->input(input_num);
-    TF_RET_CHECK(input.dtype() == DT_RESOURCE);
-
-    XlaCompiler::Argument& arg = (*args)[input_num];
-
-    arg.name = variable_args[variable_id].name;
-    arg.kind = XlaCompiler::Argument::kResource;
-    arg.resource_kind = XlaResource::kVariable;
-    if (variable_args[variable_id].present) {
-      const Tensor& value = variable_args[variable_id].value;
-      arg.type = value.dtype();
-      arg.shape = value.shape();
-      arg.initialized = true;
+    } else if (variable_args.count(input_num) == 0) {
+      // Handles the non-constant arguments.
+      const Tensor& input = ctx->input(input_num);
+      TF_RET_CHECK(input.dtype() != DT_RESOURCE);
+      if (input.NumElements() > 0) {
+        arg.kind = XlaCompiler::Argument::kParameter;
+      } else {
+        arg.kind = XlaCompiler::Argument::kConstant;
+        arg.constant_value = input;
+      }
+      arg.type = input.dtype();
+      arg.shape = input.shape();
     } else {
-      // The values of uninitialized variables are not passed as inputs, since
-      // they are meaningless. However, it is legal to assign to a resource
-      // variable for the first time inside the XLA computation, so we do permit
-      // uninitialized variables.
-      arg.initialized = false;
-      arg.type = DT_INVALID;
-      arg.shape = TensorShape();
+      // Handles resource variables.
+      const Tensor& input = ctx->input(input_num);
+      TF_RET_CHECK(input.dtype() == DT_RESOURCE);
+      const OptionalTensor& variable = variable_args.at(input_num);
+      arg.name = variable.name;
+      arg.kind = XlaCompiler::Argument::kResource;
+      arg.resource_kind = XlaResource::kVariable;
+      if (variable.present) {
+        const Tensor& value = variable.value;
+        arg.type = value.dtype();
+        arg.shape = value.shape();
+        arg.initialized = true;
+      } else {
+        // The values of uninitialized variables are not passed as inputs, since
+        // they are meaningless. However, it is legal to assign to a resource
+        // variable for the first time inside the XLA computation, so we do
+        // permit uninitialized variables.
+        arg.initialized = false;
+        arg.type = DT_INVALID;
+        arg.shape = TensorShape();
+      }
     }
-    ++input_num;
   }
 
   return Status::OK();
@@ -233,16 +209,43 @@ Status XlaCompilationCache::BuildExecutable(
 
 Status XlaCompilationCache::Compile(
     const XlaCompiler::Options& options, const NameAttrList& function,
-    int num_constant_args, const std::vector<OptionalTensor>& variable_args,
-    OpKernelContext* ctx,
+    const std::map<int, Tensor>& constant_args,
+    const std::map<int, OptionalTensor>& variable_args, OpKernelContext* ctx,
     const XlaCompiler::CompilationResult** compilation_result,
     xla::LocalExecutable** executable,
     const XlaCompiler::CompileOptions* compile_options) {
+  return CompileImpl(options, function, constant_args, variable_args, ctx,
+                     compilation_result, executable, compile_options, false);
+}
+
+Status XlaCompilationCache::CompileSingleOp(
+    const XlaCompiler::Options& options,
+    const std::map<int, Tensor>& constant_args,
+    const std::map<int, OptionalTensor>& variable_args, OpKernelContext* ctx,
+    const XlaCompiler::CompilationResult** compilation_result,
+    xla::LocalExecutable** executable,
+    const XlaCompiler::CompileOptions* compile_options) {
+  const NodeDef& def = ctx->op_kernel().def();
+  NameAttrList name;
+  name.set_name(def.op());
+  *name.mutable_attr() = def.attr();
+  return CompileImpl(options, name, constant_args, variable_args, ctx,
+                     compilation_result, executable, compile_options, true);
+}
+
+Status XlaCompilationCache::CompileImpl(
+    const XlaCompiler::Options& options, const NameAttrList& function,
+    const std::map<int, Tensor>& constant_args,
+    const std::map<int, OptionalTensor>& variable_args, OpKernelContext* ctx,
+    const XlaCompiler::CompilationResult** compilation_result,
+    xla::LocalExecutable** executable,
+    const XlaCompiler::CompileOptions* compile_options,
+    bool compile_single_op) {
   VLOG(1) << "XlaCompilationCache::Compile " << DebugString();
 
   if (VLOG_IS_ON(2)) {
     VLOG(2) << "num_inputs=" << ctx->num_inputs()
-            << " num_constant_args=" << num_constant_args
+            << " num_constant_args=" << constant_args.size()
             << " num_variable_args=" << variable_args.size();
     for (int i = 0; i < ctx->num_inputs(); i++) {
       TensorShape shape = ctx->input(i).shape();
@@ -250,10 +253,12 @@ Status XlaCompilationCache::Compile(
               << " present=" << ctx->has_input(i)
               << " shape=" << shape.DebugString();
     }
-    for (const OptionalTensor& variable : variable_args) {
+    for (auto& iterator : variable_args) {
+      const OptionalTensor& variable = iterator.second;
       VLOG(2) << "variable present=" << variable.present
               << " type=" << DataTypeString(variable.value.dtype())
-              << " shape=" << variable.value.shape().DebugString();
+              << " shape=" << variable.value.shape().DebugString()
+              << " TF arg= " << iterator.first;
     }
     VLOG(2) << "num_outputs = " << ctx->num_outputs();
     for (int i = 0; i < ctx->num_outputs(); i++) {
@@ -261,11 +266,12 @@ Status XlaCompilationCache::Compile(
     }
   }
 
-  TF_RET_CHECK(num_constant_args + variable_args.size() <= ctx->num_inputs());
+  TF_RET_CHECK(constant_args.size() + variable_args.size() <=
+               ctx->num_inputs());
 
   Signature signature;
-  TF_RETURN_IF_ERROR(BuildSignature(function, num_constant_args, variable_args,
-                                    ctx, &signature));
+  TF_RETURN_IF_ERROR(
+      BuildSignature(function, constant_args, variable_args, ctx, &signature));
 
   VLOG(2) << "Signature: " << SignatureDebugString(signature);
   // The outer lock protects the existence of the cache entry. It does not
@@ -292,13 +298,20 @@ Status XlaCompilationCache::Compile(
     // a long time.)
     std::vector<XlaCompiler::Argument> args;
     TF_RETURN_IF_ERROR(
-        BuildArguments(num_constant_args, variable_args, ctx, &args));
+        BuildArguments(constant_args, variable_args, ctx, &args));
 
     XlaCompiler compiler(options);
     entry->compiled = true;
-    entry->compilation_status = compiler.CompileFunction(
-        compile_options ? *compile_options : XlaCompiler::CompileOptions(),
-        function, args, &entry->compilation_result);
+
+    if (compile_single_op) {
+      entry->compilation_status = compiler.CompileSingleOp(
+          compile_options ? *compile_options : XlaCompiler::CompileOptions(),
+          signature.name, ctx, args, &entry->compilation_result);
+    } else {
+      entry->compilation_status = compiler.CompileFunction(
+          compile_options ? *compile_options : XlaCompiler::CompileOptions(),
+          function, args, &entry->compilation_result);
+    }
   }
   *compilation_result = &entry->compilation_result;
   if (entry->compilation_status.ok() && executable) {
diff --git a/tensorflow/compiler/jit/xla_compilation_cache.h b/tensorflow/compiler/jit/xla_compilation_cache.h
index 0858020716fcf4763e42dc0699ad22cfda756942..5c0c79b880c474969464f23b4485734c404cef07 100644
--- a/tensorflow/compiler/jit/xla_compilation_cache.h
+++ b/tensorflow/compiler/jit/xla_compilation_cache.h
@@ -52,8 +52,8 @@ class XlaCompilationCache : public ResourceBase {
   // Compiles a function into a XlaCompiler::CompilationResult that can be used
   // to execute an XLA Computation. Compilation results are cached.
   // `function` is the name of a Tensorflow function to compile.
-  // `num_constant_args` is the number of compile-time constant arguments to
-  // `function`. `variable_args` is a snapshot of the current values of the
+  // `constant_args` is a maps of tensorflow argument number to constant value.
+  // `variable_args` is a snapshot of the current values of the
   // resource variable arguments to `function`; uninitialized variables are
   // represented by an absent OptionalTensor.
   // The result of compilation is written to `*compilation_result`, which must
@@ -62,19 +62,40 @@ class XlaCompilationCache : public ResourceBase {
   // executable pointer may be null if the computation has no non-constant
   // outputs.
   Status Compile(const XlaCompiler::Options& options,
-                 const NameAttrList& function, int num_constant_args,
-                 const std::vector<OptionalTensor>& variable_args,
+                 const NameAttrList& function,
+                 const std::map<int, Tensor>& constant_args,
+                 const std::map<int, OptionalTensor>& variable_args,
                  OpKernelContext* ctx,
                  const XlaCompiler::CompilationResult** compilation_result,
                  xla::LocalExecutable** executable,
                  const XlaCompiler::CompileOptions* compile_options);
 
+  // As above, but calls XlaCompiler::CompileSingleOp instead of
+  // XlaCompiler::CompileFunction.
+  Status CompileSingleOp(
+      const XlaCompiler::Options& options,
+      const std::map<int, Tensor>& constant_args,
+      const std::map<int, OptionalTensor>& variable_args, OpKernelContext* ctx,
+      const XlaCompiler::CompilationResult** compilation_result,
+      xla::LocalExecutable** executable,
+      const XlaCompiler::CompileOptions* compile_options);
+
   xla::LocalClient* client() const { return client_; }
   const DeviceType& device_type() const { return device_type_; }
 
   string DebugString() override;
 
  private:
+  // Common implementation of Compile and CompileSingleOp.
+  Status CompileImpl(const XlaCompiler::Options& options,
+                     const NameAttrList& function,
+                     const std::map<int, Tensor>& constant_args,
+                     const std::map<int, OptionalTensor>& variable_args,
+                     OpKernelContext* ctx,
+                     const XlaCompiler::CompilationResult** compilation_result,
+                     xla::LocalExecutable** executable,
+                     const XlaCompiler::CompileOptions* compile_options,
+                     bool compile_single_op);
   // Takes `result` which has been compiled from a Tensorflow subgraph to a
   // XLA computation already, and generates an XLA LocalExecutable `executable`.
   Status BuildExecutable(const XlaCompiler::Options& options,
@@ -104,8 +125,9 @@ class XlaCompilationCache : public ResourceBase {
   static string SignatureDebugString(const Signature& sig);
 
   // Builds the signature for a compilation.
-  Status BuildSignature(const NameAttrList& function, int num_constant_args,
-                        const std::vector<OptionalTensor>& variable_args,
+  Status BuildSignature(const NameAttrList& function,
+                        const std::map<int, Tensor>& constant_args,
+                        const std::map<int, OptionalTensor>& variable_args,
                         OpKernelContext* ctx, Signature* signature);
 
   // The value associated with a cache entry.
diff --git a/tensorflow/compiler/jit/xla_compile_on_demand_op.cc b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..915b9ce84ab8268ef4e652351bc981aa5bf7b10c
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
@@ -0,0 +1,178 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Defines the XlaCompileOnDemandOp.
+
+#include "tensorflow/compiler/jit/xla_compile_on_demand_op.h"
+#include "tensorflow/compiler/jit/xla_device.h"
+#include "tensorflow/compiler/jit/xla_launch_util.h"
+#include "tensorflow/compiler/tf2xla/xla_compiler.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+
+namespace tensorflow {
+
+namespace {
+std::map<int, OptionalTensor> GetVariables(OpKernelContext* ctx) {
+  std::map<int, OptionalTensor> variables;
+  for (int64 i = 0; i < ctx->num_inputs(); ++i) {
+    if (ctx->input(i).dtype() == DT_RESOURCE) {
+      Var* variable = nullptr;
+      ResourceHandle handle = HandleFromInput(ctx, i);
+      OptionalTensor& optional = variables[i];
+      optional.name = handle.name();
+      if (LookupResource(ctx, handle, &variable).ok()) {
+        tf_shared_lock lock(*variable->mu());
+        optional.present = true;
+        optional.value = *variable->tensor();
+      }
+    }
+  }
+  return variables;
+}
+}  // namespace
+
+Status XlaCompileOnDemandOp::Run(OpKernelContext* ctx,
+                                 const XlaDevice::Metadata& metadata,
+                                 const XlaCompiler::CompilationResult* result,
+                                 xla::LocalExecutable* executable) {
+  std::map<int, OptionalTensor> variables = GetVariables(ctx);
+  int64 num_resource_args = variables.size();
+
+  xla::LocalClient* client = metadata.client();
+  XlaTensorInfoManager* tensor_info_manager = &metadata.tensor_info_manager();
+
+  // Builds an XLA allocator for the device.
+  XlaAllocator xla_allocator(client->platform(), ctx);
+  XlaComputationLaunchContext launch_context(
+      num_resource_args, client, &xla_allocator, tensor_info_manager);
+
+  launch_context.PopulateInputs(ctx, result, variables);
+
+  perftools::gputools::Stream* stream =
+      ctx->op_device_context() ? ctx->op_device_context()->stream() : nullptr;
+  TF_RET_CHECK(stream);
+
+  VLOG(2) << "Executing computation.";
+  xla::ExecutableRunOptions run_options;
+  run_options.set_stream(stream);
+  run_options.set_allocator(&xla_allocator);
+  run_options.set_intra_op_thread_pool(&ctx->eigen_cpu_device());
+
+  auto run_result = executable->Run(launch_context.arguments(), run_options);
+  TF_RETURN_IF_ERROR(run_result.status());
+
+  launch_context.PopulateOutputs(ctx, result, run_result.ConsumeValueOrDie());
+  return Status::OK();
+}
+
+bool XlaCompileOnDemandOp::MustArgumentBeConstant(const OpKernel* op_kernel,
+                                                  int64 argument_idx) {
+  // TODO(jmolloy): This could be expensive, so memoize.
+  auto* constant_inputs = tensorflow::XlaOpRegistry::CompileTimeConstantInputs(
+      op_kernel->def().op());
+  CHECK(constant_inputs);
+  std::set<int64> constant_input_indices;
+  for (const auto& name : *constant_inputs) {
+    int start, stop;
+    TF_CHECK_OK(op_kernel->InputRange(name, &start, &stop));
+    for (int i = start; i < stop; ++i) {
+      constant_input_indices.insert(i);
+    }
+  }
+  return constant_input_indices.count(argument_idx) > 0;
+}
+
+bool XlaCompileOnDemandOp::ShouldArgumentBeConstant(const OpKernel* op_kernel,
+                                                    int64 argument_idx) {
+  // Right now we only create kConstant arguments when absolutely required, but
+  // there may be benefit in eagerly constant-folding a larger subset of
+  // arguments in the future.
+  return MustArgumentBeConstant(op_kernel, argument_idx);
+}
+
+Status XlaCompileOnDemandOp::Compile(
+    OpKernelContext* ctx, const XlaDevice::Metadata& metadata,
+    const XlaCompiler::CompilationResult** result,
+    xla::LocalExecutable** executable) {
+  XlaTensorInfoManager* tensor_info_manager = &metadata.tensor_info_manager();
+
+  std::map<int, Tensor> constant_arguments;
+  for (int64 i = 0; i < ctx->num_inputs(); ++i) {
+    const Tensor& device_tensor = ctx->input(i);
+    if (const XlaTensorInfo* tensor_info =
+            tensor_info_manager->GetTensorInfo(device_tensor)) {
+      if (tensor_info->has_host_tensor() &&
+          ShouldArgumentBeConstant(&ctx->op_kernel(), i)) {
+        constant_arguments[i] = tensor_info->host_tensor();
+      }
+    }
+    if (constant_arguments.count(i) == 0 &&
+        MustArgumentBeConstant(&ctx->op_kernel(), i)) {
+      // Slow path; the argument is not available as a host constant so we must
+      // fetch it synchronously.
+      Tensor host_tensor;
+      TF_RETURN_IF_ERROR(ctx->allocate_temp(
+          device_tensor.dtype(), device_tensor.shape(), &host_tensor));
+      Notification n;
+      ctx->op_device_context()->CopyDeviceTensorToCPU(
+          &device_tensor, "ConstantArgument",
+          reinterpret_cast<Device*>(ctx->device()), &host_tensor,
+          [&](Status status) { n.Notify(); });
+      n.WaitForNotification();
+      constant_arguments[i] = host_tensor;
+    }
+  }
+
+  // We store information about the JIT-compiled XLA computation
+  // in the ResourceMgr.
+  ResourceMgr* rm = ctx->resource_manager();
+  CHECK(rm);
+
+  XlaCompilationCache* cache;
+  TF_RETURN_IF_ERROR(rm->LookupOrCreate<XlaCompilationCache>(
+      rm->default_container(), "xla_cache", &cache,
+      [&](XlaCompilationCache** cache) {
+        *cache = new XlaCompilationCache(metadata.client(),
+                                         metadata.jit_device_type());
+        return Status::OK();
+      }));
+  // Hold the reference to the JIT during evaluation. (We could probably
+  // free it sooner because the ResourceMgr will retain a reference, but
+  // this is more obviously correct.)
+  core::ScopedUnref cache_ref(cache);
+
+  XlaCompiler::Options options;
+  DeviceType device_type = metadata.jit_device_type();
+  options.device_type = &device_type;
+  options.client = metadata.client();
+  options.flib_def =
+      new FunctionLibraryDefinition(OpRegistry::Global(), FunctionDefLibrary{});
+
+  std::map<int, OptionalTensor> variable_args = GetVariables(ctx);
+  return cache->CompileSingleOp(options, constant_arguments, variable_args, ctx,
+                                result, executable,
+                                /*compile_options=*/nullptr);
+}
+
+void XlaCompileOnDemandOp::Compute(OpKernelContext* ctx) {
+  const XlaCompiler::CompilationResult* result;
+  xla::LocalExecutable* executable;
+  const XlaDevice::Metadata* metadata;
+  OP_REQUIRES_OK(ctx, XlaDevice::GetMetadata(ctx, &metadata));
+  OP_REQUIRES_OK(ctx, Compile(ctx, *metadata, &result, &executable));
+  OP_REQUIRES_OK(ctx, Run(ctx, *metadata, result, executable));
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_compile_on_demand_op.h b/tensorflow/compiler/jit/xla_compile_on_demand_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..23c6f3903f841a6c39104983c6f7f409757a7319
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_compile_on_demand_op.h
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// The XlaCompileOnDemandOp is an OpKernel that, when its Compute method is
+// called, will generate an xla::Computation and run it asynchronously.
+
+#ifndef TENSORFLOW_COMPILER_JIT_XLA_COMPILE_ON_DEMAND_OP_H_
+#define TENSORFLOW_COMPILER_JIT_XLA_COMPILE_ON_DEMAND_OP_H_
+
+#include "tensorflow/compiler/jit/xla_device.h"
+#include "tensorflow/compiler/tf2xla/xla_compiler.h"
+#include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+
+// An OpKernel that compiles an op to an XLA computation and runs it. Unlike
+// _XlaLaunch this doesn't rely on any rewrites of the graphdef - it will run a
+// vanilla TensorFlow op as long as the bridge supports it.
+//
+// Importantly _XlaLaunch assumes all input and output tensors are on the host,
+// whereas XlacompileOnDemandOp works with tensors in device memory.
+class XlaCompileOnDemandOp : public OpKernel {
+ public:
+  explicit XlaCompileOnDemandOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+  void Compute(OpKernelContext* ctx) override;
+
+ private:
+  XlaCompiler::Argument CreateCompilerArgument(OpKernelContext* ctx, int64 i);
+  bool ShouldArgumentBeConstant(const OpKernel* op_kernel, int64 argument_idx);
+  bool MustArgumentBeConstant(const OpKernel* op_kernel, int64 argument_idx);
+  Status Compile(OpKernelContext* ctx, const XlaDevice::Metadata& metadata,
+                 const XlaCompiler::CompilationResult** result,
+                 xla::LocalExecutable** executable);
+  Status Run(OpKernelContext* ctx, const XlaDevice::Metadata& metadata,
+             const XlaCompiler::CompilationResult* result,
+             xla::LocalExecutable* executable);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_XLA_COMPILE_ON_DEMAND_OP_H_
diff --git a/tensorflow/compiler/jit/xla_cpu_device.cc b/tensorflow/compiler/jit/xla_cpu_device.cc
index e238252751e677eb947f6df03e3b2f2e948ffe19..d2dfdeea68129b536477aa75f66c9d267f5a9434 100644
--- a/tensorflow/compiler/jit/xla_cpu_device.cc
+++ b/tensorflow/compiler/jit/xla_cpu_device.cc
@@ -17,6 +17,8 @@ limitations under the License.
 // operators using XLA via the XLA "Host" (CPU) backend.
 
 #include "tensorflow/compiler/jit/kernels/xla_launch_op.h"
+#include "tensorflow/compiler/jit/legacy_flags/xla_device_flags.h"
+#include "tensorflow/compiler/jit/xla_compile_on_demand_op.h"
 #include "tensorflow/compiler/jit/xla_device.h"
 #include "tensorflow/compiler/jit/xla_device_ops.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
@@ -34,14 +36,24 @@ class XlaCpuDeviceFactory : public DeviceFactory {
 Status XlaCpuDeviceFactory::CreateDevices(const SessionOptions& options,
                                           const string& name_prefix,
                                           std::vector<Device*>* devices) {
+  legacy_flags::XlaDeviceFlags* flags = legacy_flags::GetXlaDeviceFlags();
+  bool compile_on_demand = flags->tf_xla_compile_on_demand;
+
+  XlaOpRegistry::DeviceRegistration registration;
+  registration.compilation_device_name = DEVICE_CPU_XLA_JIT;
+  registration.requires_compilation = !compile_on_demand;
+  registration.enable_jit_by_default = false;
+  registration.compile_resource_ops = true;
+
   static XlaDeviceOpRegistrations* registrations =
       RegisterXlaDeviceKernels(DEVICE_XLA_CPU, DEVICE_CPU_XLA_JIT);
   (void)registrations;
 
   std::unique_ptr<XlaDevice> device;
-  TF_RETURN_IF_ERROR(XlaDevice::Create(
-      "Host", DEVICE_XLA_CPU, 0, DEVICE_CPU_XLA_JIT, options, name_prefix,
-      /*register_device_for_compilation=*/true, &device));
+  TF_RETURN_IF_ERROR(XlaDevice::Create("Host", DEVICE_XLA_CPU, 0,
+                                       DEVICE_CPU_XLA_JIT, options, name_prefix,
+                                       registration,
+                                       /*transfer_as_literal=*/false, &device));
   devices->push_back(device.release());
   return Status::OK();
 }
diff --git a/tensorflow/compiler/jit/xla_device.cc b/tensorflow/compiler/jit/xla_device.cc
index d4d8fe1c1d575b4e35d624621cc709e3a16569d5..82048f5d78957dfeaf9656d332374ba86a5e920b 100644
--- a/tensorflow/compiler/jit/xla_device.cc
+++ b/tensorflow/compiler/jit/xla_device.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include <unordered_set>
 
 #include "tensorflow/compiler/jit/defs.h"
+#include "tensorflow/compiler/jit/xla_compile_on_demand_op.h"
 #include "tensorflow/compiler/jit/xla_device_context.h"
 #include "tensorflow/compiler/jit/xla_device_ops.h"
 #include "tensorflow/compiler/tf2xla/dump_graph.h"
@@ -108,21 +109,15 @@ XlaDeviceAllocator* XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
 /* static */ Status XlaDevice::Create(
     const string& platform_name, const string& device_name, int device_ordinal,
     const string& jit_device_name, const SessionOptions& options,
-    const string& name_prefix, bool register_device_for_compilation,
-    std::unique_ptr<XlaDevice>* device) {
+    const string& name_prefix,
+    const XlaOpRegistry::DeviceRegistration& registration,
+    bool transfer_as_literal, std::unique_ptr<XlaDevice>* device) {
   VLOG(1) << "XlaDevice::Create " << platform_name << " " << device_name << ":"
           << device_ordinal;
 
-  if (register_device_for_compilation) {
-    // These are no-ops if they have already been done previously for
-    // this device_name/compilation_device_name pair.
-    XlaOpRegistry::DeviceRegistration registration;
-    registration.compilation_device_name = jit_device_name;
-    registration.requires_compilation = true;
-    registration.enable_jit_by_default = false;
-    registration.compile_resource_ops = true;
-    XlaOpRegistry::RegisterCompilationDevice(device_name, registration);
-  }
+  // These are no-ops if they have already been done previously for
+  // this device_name/compilation_device_name pair.
+  XlaOpRegistry::RegisterCompilationDevice(device_name, registration);
 
   auto platform = se::MultiPlatformManager::PlatformWithName(platform_name);
   if (!platform.ok()) {
@@ -137,15 +132,17 @@ XlaDeviceAllocator* XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
 
   device->reset(new XlaDevice(options, attrs, device_ordinal,
                               DeviceType(jit_device_name),
-                              platform.ValueOrDie()));
+                              platform.ValueOrDie(), transfer_as_literal));
   return Status::OK();
 }
 
-XlaDevice::Metadata::Metadata(int device_ordinal, se::Platform* platform,
-                              const DeviceType& device_type)
+XlaDevice::Metadata::Metadata(
+    int device_ordinal, se::Platform* platform, const DeviceType& device_type,
+    std::unique_ptr<XlaTensorInfoManager>* tensor_info_manager)
     : device_ordinal_(device_ordinal),
       device_type_(device_type),
-      platform_(platform) {}
+      platform_(platform),
+      tensor_info_manager_(*tensor_info_manager) {}
 
 int XlaDevice::Metadata::device_ordinal() const { return device_ordinal_; }
 
@@ -160,6 +157,10 @@ const DeviceType& XlaDevice::Metadata::jit_device_type() const {
   return device_type_;
 }
 
+XlaTensorInfoManager& XlaDevice::Metadata::tensor_info_manager() const {
+  return *tensor_info_manager_;
+}
+
 /* static */ Status XlaDevice::GetMetadata(OpKernelContext* ctx,
                                            const Metadata** metadata) {
   XlaDevice* xla_device =
@@ -177,13 +178,19 @@ const DeviceType& XlaDevice::Metadata::jit_device_type() const {
 
 XlaDevice::XlaDevice(const SessionOptions& options,
                      const DeviceAttributes& attrs, int device_ordinal,
-                     const DeviceType& jit_device_name, se::Platform* platform)
+                     const DeviceType& jit_device_name, se::Platform* platform,
+                     bool transfer_as_literal)
     : LocalDevice(options, attrs),
-      xla_metadata_(device_ordinal, platform, jit_device_name),
+      xla_metadata_(
+          device_ordinal, platform, jit_device_name,
+          // Pass tensor_info_manager_ by reference as it is initialized lazily.
+          &tensor_info_manager_),
       device_ordinal_(device_ordinal),
       jit_device_name_(jit_device_name),
       xla_allocator_(nullptr),
-      platform_(platform) {}
+      platform_(platform),
+      tensor_info_manager_(nullptr),
+      transfer_as_literal_(transfer_as_literal) {}
 
 XlaDevice::~XlaDevice() {}
 
@@ -208,6 +215,7 @@ Allocator* XlaDevice::GetAllocator(AllocatorAttributes attr) {
     xla::Backend* backend = client()->mutable_backend();
     xla_allocator_ = XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
         backend, device_ordinal_);
+    tensor_info_manager_.reset(new XlaTensorInfoManager(xla_allocator_));
   }
   return xla_allocator_;
 }
@@ -225,7 +233,11 @@ Status XlaDevice::FillContextMap(const Graph* graph,
   VLOG(1) << "XlaDevice::FillContextMap";
   device_context_map->resize(graph->num_node_ids());
   TF_ASSIGN_OR_RETURN(se::Stream * stream, GetStream());
-  auto ctx = new XlaDeviceContext(stream);
+  // Call GetAllocator for the side-effect of ensuring the allocator and
+  // XlaTensorInfoManager is created.
+  (void)GetAllocator({});
+  auto ctx = new XlaDeviceContext(stream, tensor_info_manager_.get(),
+                                  transfer_as_literal_);
   for (Node* n : graph->nodes()) {
     VLOG(2) << n->id() << " : " << n->type_string() << " : " << n->name();
     ctx->Ref();
@@ -273,7 +285,8 @@ Status XlaDevice::MakeTensorFromProto(const TensorProto& tensor_proto,
     Tensor copy(GetAllocator(alloc_attrs), parsed.dtype(), parsed.shape());
     Notification n;
     TF_ASSIGN_OR_RETURN(se::Stream * stream, GetStream());
-    XlaTransferManager manager(stream);
+    XlaTransferManager manager(stream, tensor_info_manager_.get(),
+                               transfer_as_literal_);
     manager.CopyCPUTensorToDevice(&parsed, this, &copy,
                                   [&n, &status](const Status& s) {
                                     status = s;
@@ -288,19 +301,23 @@ Status XlaDevice::MakeTensorFromProto(const TensorProto& tensor_proto,
 
 XlaDeviceOpRegistrations* RegisterXlaDeviceKernels(const char* device,
                                                    const char* jit_device) {
+  // Any op assigned to the device that isn't rewritten by the graph rewriter
+  // gets executed by a n XlaCompileOnDemandOp, which compiles it and executes
+  // it just-in-time.
+  kernel_factory::OpKernelRegistrar::Factory factory =
+      [](OpKernelConstruction* context) -> OpKernel* {
+    return new XlaCompileOnDemandOp(context);
+  };
   XlaOpRegistry::RegisterCompilationKernels();
   XlaDeviceOpRegistrations* registrations = new XlaDeviceOpRegistrations;
-  auto dummy_factory = [](OpKernelConstruction* context) -> OpKernel* {
-    return new XlaDeviceDummyOp(context);
-  };
   for (const KernelDef* jit_def : XlaOpRegistry::DeviceKernels(
            jit_device,
            /*include_compilation_only_kernels=*/false)) {
     KernelDef* def = new KernelDef(*jit_def);
     def->set_device_type(device);
     registrations->op_kernel_registrars.emplace_back(
-        new kernel_factory::OpKernelRegistrar(def, "XlaDeviceDummyOp",
-                                              dummy_factory));
+        new kernel_factory::OpKernelRegistrar(def, "XlaCompileOnDemandOp",
+                                              factory));
   }
   return registrations;
 }
diff --git a/tensorflow/compiler/jit/xla_device.h b/tensorflow/compiler/jit/xla_device.h
index d2ec38293c429f04f088bf3726ba97eb4e4b0dba..9cd9167e523961c0ddd99fbc9ca9bdc20b9be7b5 100644
--- a/tensorflow/compiler/jit/xla_device.h
+++ b/tensorflow/compiler/jit/xla_device.h
@@ -26,6 +26,8 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_JIT_XLA_DEVICE_H_
 #define TENSORFLOW_COMPILER_JIT_XLA_DEVICE_H_
 
+#include "tensorflow/compiler/jit/xla_tensor_info.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/local_device.h"
@@ -48,7 +50,8 @@ class XlaDevice : public LocalDevice {
   class Metadata {
    public:
     Metadata(int device_ordinal, perftools::gputools::Platform* platform,
-             const DeviceType& device_type);
+             const DeviceType& device_type,
+             std::unique_ptr<XlaTensorInfoManager>* tensor_info_manager);
 
     // The index of the device on this host.
     int device_ordinal() const;
@@ -56,11 +59,13 @@ class XlaDevice : public LocalDevice {
     perftools::gputools::Platform* platform() const;
     xla::LocalClient* client() const;
     const DeviceType& jit_device_type() const;
+    XlaTensorInfoManager& tensor_info_manager() const;
 
    private:
     const int device_ordinal_;
     const DeviceType device_type_;
     perftools::gputools::Platform* platform_;  // Not owned.
+    std::unique_ptr<XlaTensorInfoManager>& tensor_info_manager_;
 
     TF_DISALLOW_COPY_AND_ASSIGN(Metadata);
   };
@@ -71,15 +76,20 @@ class XlaDevice : public LocalDevice {
   // Factory function. 'platform_name' is the name of the XLA platform.
   // 'device_name' is the name of the Tensorflow device to create.
   // 'jit_device_name' is the name of the corresponding JIT device.
+  // 'transfer_as_literal' is true if device<->host transfers must be done using
+  // XLA's TransferLiteral{To,From}Device interface. If false, we can use
+  // ThenMemcpy instead.
   static Status Create(const string& platform_name, const string& device_name,
                        int device_ordinal, const string& jit_device_name,
                        const SessionOptions& options, const string& name_prefix,
-                       bool register_device_for_compilation,
+                       const XlaOpRegistry::DeviceRegistration& registration,
+                       bool transfer_as_literal,
                        std::unique_ptr<XlaDevice>* device);
 
   XlaDevice(const SessionOptions& options, const DeviceAttributes& attrs,
             int device_ordinal, const DeviceType& jit_device_name,
-            ::perftools::gputools::Platform* platform);
+            ::perftools::gputools::Platform* platform,
+            bool transfer_as_literal);
   ~XlaDevice() override;
 
   Allocator* GetAllocator(AllocatorAttributes attr) override;
@@ -104,7 +114,7 @@ class XlaDevice : public LocalDevice {
   // Which hardware device in the client's platform this XlaDevice controls.
   const int device_ordinal_;
   // The name of the device that is used to compile Ops for this XlaDevice.
-  const DeviceType& jit_device_name_;
+  DeviceType jit_device_name_;
   // Memory allocator associated with this device.
   Allocator* xla_allocator_;                   // Not owned.
   ::perftools::gputools::Platform* platform_;  // Not owned.
@@ -113,9 +123,19 @@ class XlaDevice : public LocalDevice {
   // copying back and forth between CPU and the device, and
   // computations enqueued by XLA.
   xla::Backend::StreamPtr stream_;
+  // Manages sideband data about tensors, in particular the on-device shape tree
+  // if the tensor requires multiple device buffers to represent (for example,
+  // tuple shapes).
+  // This is a unique_ptr because XlaTensorInfoManager is non-copy-constructible
+  // and we need to initialize this lazily (as we also lazily initialize the
+  // underlying allocator).
+  std::unique_ptr<XlaTensorInfoManager> tensor_info_manager_;
+  // Must we use XLA's transfer manager for correct host<->device transfers? if
+  // false, we can use ThenMemcpy() instead.
+  bool transfer_as_literal_;
 };
 
-// Builds dummy OpKernel registrations on 'device' for the JIT operators
+// Builds OpKernel registrations on 'device' for the JIT operators
 // registered on 'jit_device'. Returns ownership of a XlaDeviceOpRegistrations
 // object that encapsulates the kernel registrations.
 struct XlaDeviceOpRegistrations {
diff --git a/tensorflow/compiler/jit/xla_device_context.cc b/tensorflow/compiler/jit/xla_device_context.cc
index c936222f32056e92efced82d5adb3a96c8041a17..88f7c15f0b74a8c99935647f75352e7dec4689fc 100644
--- a/tensorflow/compiler/jit/xla_device_context.cc
+++ b/tensorflow/compiler/jit/xla_device_context.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/jit/xla_device_context.h"
 
+#include "tensorflow/compiler/jit/xla_launch_util.h"
 #include "tensorflow/compiler/tf2xla/literal_util.h"
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/xla/util.h"
@@ -52,7 +53,12 @@ void XlaDeviceAllocator::DeallocateRaw(void* ptr) {
 
 void XlaDeviceAllocator::GetStats(AllocatorStats* stats) { stats->Clear(); }
 
-XlaTransferManager::XlaTransferManager(se::Stream* stream) : stream_(stream) {}
+XlaTransferManager::XlaTransferManager(
+    se::Stream* stream, XlaTensorInfoManager* tensor_info_manager,
+    bool transfer_as_literal)
+    : stream_(stream),
+      tensor_info_manager_(tensor_info_manager),
+      transfer_as_literal_(transfer_as_literal) {}
 
 void XlaTransferManager::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
                                                Device* device,
@@ -72,15 +78,25 @@ void XlaTransferManager::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
     se::DeviceMemoryBase dev_dst_ptr(dst_ptr, total_bytes);
 
     Status status;
-    stream_->ThenMemcpy(&dev_dst_ptr, src_ptr, total_bytes);
-    // TODO(hpucha): Make this asynchronous.
-    Status block_status = stream_->BlockHostUntilDone();
-    if (!block_status.ok()) {
-      status = xla::InternalError(
-          "Failed to complete data transfer on stream %p: %s", stream_,
-          block_status.error_message().c_str());
+    if (transfer_as_literal_) {
+      status = xla::Unimplemented(
+          "XlaTransferManager::CopyCPUTensorToDevice not implemented for "
+          "literals");
+    } else {
+      stream_->ThenMemcpy(&dev_dst_ptr, src_ptr, total_bytes);
+      // TODO(hpucha): Make this asynchronous.
+      Status block_status = stream_->BlockHostUntilDone();
+      if (!block_status.ok()) {
+        status = xla::InternalError(
+            "Failed to complete data transfer on stream %p: %s", stream_,
+            block_status.error_message().c_str());
+      }
     }
 
+    XlaTensorInfo* tensor_info =
+        tensor_info_manager_->GetOrCreateTensorInfo(*device_tensor);
+    tensor_info->set_host_tensor(*cpu_tensor);
+
     done(status);
     return;
   }
@@ -108,13 +124,19 @@ void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
     void* dst_ptr = DMAHelper::base(cpu_tensor);
 
     Status status;
-    stream_->ThenMemcpy(dst_ptr, dev_src_ptr, total_bytes);
-    // TODO(hpucha): Make this asynchronous.
-    Status block_status = stream_->BlockHostUntilDone();
-    if (!block_status.ok()) {
-      status = xla::InternalError(
-          "Failed to complete data transfer on stream %p: %s", stream_,
-          block_status.error_message().c_str());
+    if (transfer_as_literal_) {
+      status = xla::Unimplemented(
+          "XlaTransferManager::CopyDeviceTensorToCPU not implemented for "
+          "literals");
+    } else {
+      stream_->ThenMemcpy(dst_ptr, dev_src_ptr, total_bytes);
+      // TODO(hpucha): Make this asynchronous.
+      Status block_status = stream_->BlockHostUntilDone();
+      if (!block_status.ok()) {
+        status = xla::InternalError(
+            "Failed to complete data transfer on stream %p: %s", stream_,
+            block_status.error_message().c_str());
+      }
     }
 
     done(status);
@@ -125,7 +147,10 @@ void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
   done(Status::OK());
 }
 
-XlaDeviceContext::XlaDeviceContext(se::Stream* stream) : manager_(stream) {}
+XlaDeviceContext::XlaDeviceContext(se::Stream* stream,
+                                   XlaTensorInfoManager* tensor_info_manager,
+                                   bool transfer_as_literal)
+    : manager_(stream, tensor_info_manager, transfer_as_literal) {}
 
 void XlaDeviceContext::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
                                              Device* device,
diff --git a/tensorflow/compiler/jit/xla_device_context.h b/tensorflow/compiler/jit/xla_device_context.h
index c4edcd474e48f791af9340c3cd6e4d031407bb68..df02f4eac482f385f8864476d11c5430971f00c8 100644
--- a/tensorflow/compiler/jit/xla_device_context.h
+++ b/tensorflow/compiler/jit/xla_device_context.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include <memory>
 
+#include "tensorflow/compiler/jit/xla_tensor_info.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/core/framework/allocator.h"
@@ -49,7 +50,9 @@ class XlaDeviceAllocator : public Allocator {
 // Helper class for managing data transfers between host and XLA devices.
 class XlaTransferManager {
  public:
-  explicit XlaTransferManager(perftools::gputools::Stream* stream);
+  explicit XlaTransferManager(perftools::gputools::Stream* stream,
+                              XlaTensorInfoManager* tensor_info_manager,
+                              bool transfer_as_literal);
 
   void CopyCPUTensorToDevice(const Tensor* cpu_tensor, Device* device,
                              Tensor* device_tensor, StatusCallback done) const;
@@ -62,6 +65,10 @@ class XlaTransferManager {
   // Stream obtained from a Device, used to transfer tensors between
   // CPU and device.
   perftools::gputools::Stream* stream_;
+  // The tensor info manager, for access to sideband information about tensors.
+  XlaTensorInfoManager* tensor_info_manager_;
+  // True if we must use XLA's TransferManager for correct device transfers.
+  bool transfer_as_literal_;
 };
 
 // DeviceContext for operators assigned to XlaDevice devices. The
@@ -69,7 +76,9 @@ class XlaTransferManager {
 // wraps the methods in XlaTransferManager.
 class XlaDeviceContext : public DeviceContext {
  public:
-  explicit XlaDeviceContext(perftools::gputools::Stream* stream);
+  explicit XlaDeviceContext(perftools::gputools::Stream* stream,
+                            XlaTensorInfoManager* tensor_info_manager,
+                            bool transfer_as_literal);
 
   void CopyCPUTensorToDevice(const Tensor* cpu_tensor, Device* device,
                              Tensor* device_tensor,
diff --git a/tensorflow/compiler/jit/xla_gpu_device.cc b/tensorflow/compiler/jit/xla_gpu_device.cc
index 2326070358d67c0cf30ef17fab5c93862cd8932c..5a1db817745f56d6bcc26ff6fc441b7c902ee2b5 100644
--- a/tensorflow/compiler/jit/xla_gpu_device.cc
+++ b/tensorflow/compiler/jit/xla_gpu_device.cc
@@ -34,14 +34,21 @@ class XlaGpuDeviceFactory : public DeviceFactory {
 Status XlaGpuDeviceFactory::CreateDevices(const SessionOptions& options,
                                           const string& name_prefix,
                                           std::vector<Device*>* devices) {
+  XlaOpRegistry::DeviceRegistration registration;
+  registration.compilation_device_name = DEVICE_GPU_XLA_JIT;
+  registration.requires_compilation = true;
+  registration.enable_jit_by_default = false;
+  registration.compile_resource_ops = true;
+
   static XlaDeviceOpRegistrations* registrations =
       RegisterXlaDeviceKernels(DEVICE_XLA_GPU, DEVICE_GPU_XLA_JIT);
   (void)registrations;
 
   std::unique_ptr<XlaDevice> device;
-  Status status = XlaDevice::Create(
-      "CUDA", DEVICE_XLA_GPU, 0, DEVICE_GPU_XLA_JIT, options, name_prefix,
-      /*register_device_for_compilation=*/true, &device);
+  Status status =
+      XlaDevice::Create("CUDA", DEVICE_XLA_GPU, 0, DEVICE_GPU_XLA_JIT, options,
+                        name_prefix, registration,
+                        /*transfer_as_literal=*/false, &device);
   if (!status.ok()) {
     // Treat failures as non-fatal; there might not be a GPU in the machine.
     VLOG(1) << "Failed to create XLA_GPU device: " << status;
diff --git a/tensorflow/compiler/jit/xla_interpreter_device.cc b/tensorflow/compiler/jit/xla_interpreter_device.cc
index a329451b14a785b17913e3838a6571b62b422804..9e098c46f422b436c722bb909dc58930ab7c0ef6 100644
--- a/tensorflow/compiler/jit/xla_interpreter_device.cc
+++ b/tensorflow/compiler/jit/xla_interpreter_device.cc
@@ -41,10 +41,17 @@ Status XlaInterpreterDeviceFactory::CreateDevices(
       DEVICE_XLA_INTERPRETER, DEVICE_INTERPRETER_XLA_JIT);
   (void)registrations;
 
+  XlaOpRegistry::DeviceRegistration registration;
+  registration.compilation_device_name = DEVICE_INTERPRETER_XLA_JIT;
+  registration.requires_compilation = true;
+  registration.enable_jit_by_default = false;
+  registration.compile_resource_ops = true;
+
   std::unique_ptr<XlaDevice> device;
-  TF_RETURN_IF_ERROR(XlaDevice::Create(
-      "Interpreter", DEVICE_XLA_INTERPRETER, 0, DEVICE_INTERPRETER_XLA_JIT,
-      options, name_prefix, /*register_device_for_compilation=*/true, &device));
+  TF_RETURN_IF_ERROR(XlaDevice::Create("Interpreter", DEVICE_XLA_INTERPRETER, 0,
+                                       DEVICE_INTERPRETER_XLA_JIT, options,
+                                       name_prefix, registration,
+                                       /*transfer_as_literal=*/false, &device));
   devices->push_back(device.release());
   return Status::OK();
 }
diff --git a/tensorflow/compiler/jit/xla_launch_util.cc b/tensorflow/compiler/jit/xla_launch_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..bb7316c60c61f8755b6cdd575676fab343f26d11
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_launch_util.cc
@@ -0,0 +1,280 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/xla_launch_util.h"
+
+#include "tensorflow/compiler/jit/defs.h"
+#include "tensorflow/compiler/tf2xla/xla_compiler.h"
+#include "tensorflow/compiler/xla/client/client_library.h"
+#include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/common_runtime/function.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/util/stream_executor_util.h"
+
+namespace gpu = perftools::gputools;
+
+namespace tensorflow {
+
+std::map<int, OptionalTensor> SnapshotResourceVariables(OpKernelContext* ctx,
+                                                        int num_variables) {
+  std::map<int, OptionalTensor> snapshot;
+  int first_variable = ctx->num_inputs() - num_variables;
+  for (int i = 0; i < num_variables; ++i) {
+    Var* variable = nullptr;
+    ResourceHandle handle = HandleFromInput(ctx, first_variable + i);
+    OptionalTensor& tensor = snapshot[first_variable + i];
+    if (LookupResource(ctx, handle, &variable).ok()) {
+      tf_shared_lock lock(*variable->mu());
+      tensor.name = handle.name();
+      tensor.present = true;
+      tensor.value = *variable->tensor();
+    }
+  }
+  return snapshot;
+}
+
+XlaAllocator::XlaAllocator(const gpu::Platform* platform,
+                           OpKernelContext* op_context)
+    : xla::DeviceMemoryAllocator(platform), op_context_(op_context) {}
+
+XlaAllocator::~XlaAllocator() { CHECK(allocated_.empty()); }
+
+xla::StatusOr<gpu::DeviceMemoryBase> XlaAllocator::Allocate(
+    int device_ordinal, uint64 size, bool retry_on_failure) {
+  void* data = op_context_->device()->GetAllocator({})->AllocateRaw(
+      Allocator::kAllocatorAlignment, size);
+  allocated_.insert(data);
+  return gpu::DeviceMemoryBase(data, size);
+}
+
+void XlaAllocator::Release(void* ptr) { allocated_.erase(ptr); }
+
+Status XlaAllocator::Deallocate(int device_ordinal,
+                                gpu::DeviceMemoryBase* mem) {
+  if (allocated_.count(mem->opaque())) {
+    op_context_->device()->GetAllocator({})->DeallocateRaw(mem->opaque());
+    allocated_.erase(mem->opaque());
+  }
+  return Status::OK();
+}
+
+namespace {
+// Return the 'index''th subtree of the given ShapedBuffer as a ShapedBuffer.
+xla::ShapedBuffer ExtractSubShapedBuffer(const xla::ShapedBuffer& shaped_buffer,
+                                         int index) {
+  xla::Shape on_host_shape = xla::ShapeUtil::GetTupleElementShape(
+      shaped_buffer.on_host_shape(), index);
+  xla::Shape on_device_shape = xla::ShapeUtil::GetTupleElementShape(
+      shaped_buffer.on_device_shape(), index);
+
+  xla::ShapedBuffer sub_shaped_buffer(on_host_shape, on_device_shape,
+                                      shaped_buffer.platform(),
+                                      shaped_buffer.device_ordinal());
+
+  auto& shape_tree = shaped_buffer.buffers();
+  auto& sub_shape_tree = sub_shaped_buffer.buffers();
+  sub_shape_tree.CopySubtreeFrom(shape_tree,
+                                 /*source_base_index=*/{index},
+                                 /*target_base_index=*/{});
+  return sub_shaped_buffer;
+}
+}  // namespace
+
+XlaComputationLaunchContext::XlaComputationLaunchContext(
+    int64 num_resource_args, xla::LocalClient* client,
+    XlaAllocator* xla_allocator, XlaTensorInfoManager* tensor_info_manager)
+    : num_resource_args_(num_resource_args),
+      client_(client),
+      xla_allocator_(xla_allocator),
+      tensor_info_manager_(tensor_info_manager) {}
+
+void XlaComputationLaunchContext::PopulateInputs(
+    OpKernelContext* ctx, const XlaCompiler::CompilationResult* kernel,
+    const std::map<int, OptionalTensor>& variables) {
+  // Build xla::ShapedBuffers that point directly to the Tensor buffers.
+  arg_buffers_.reserve(kernel->xla_input_shapes.size() + 1);
+  arg_buffers_.resize(kernel->xla_input_shapes.size());
+  arg_ptrs_ = std::vector<xla::ShapedBuffer*>(arg_buffers_.size());
+
+  // Pass remaining parameters.
+  const Tensor* t;
+  for (int i = 0; i < kernel->xla_input_shapes.size(); ++i) {
+    int arg_num = kernel->input_mapping[i];
+    const xla::Shape& shape = kernel->xla_input_shapes[i];
+    if (variables.count(arg_num)) {
+      t = &(variables.at(arg_num).value);
+      CHECK(t);
+    } else {
+      t = &(ctx->input(arg_num));
+    }
+
+    const xla::Shape on_device_shape =
+        client_->backend().transfer_manager()->HostShapeToDeviceShape(shape);
+    if (xla::ShapeUtil::IsTuple(on_device_shape)) {
+      CHECK(tensor_info_manager_);
+      const XlaTensorInfo* tensor_info =
+          tensor_info_manager_->GetTensorInfo(*t);
+      CHECK(tensor_info && tensor_info->has_shaped_buffer());
+      arg_ptrs_[i] =
+          const_cast<xla::ShapedBuffer*>(&tensor_info->shaped_buffer());
+    } else {
+      CHECK(xla::ShapeUtil::Equal(shape, on_device_shape))
+          << "On-device shape "
+          << xla::ShapeUtil::HumanStringWithLayout(on_device_shape)
+          << " not the same as on-host shape "
+          << xla::ShapeUtil::HumanStringWithLayout(shape);
+      gpu::DeviceMemoryBase dmem = gpu::DeviceMemoryBase(
+          const_cast<char*>(t->tensor_data().data()), t->tensor_data().size());
+      arg_buffers_[i] = xla::MakeUnique<xla::ShapedBuffer>(
+          /*on_host_shape=*/shape, /*on_device_shape=*/shape,
+          client_->platform(), client_->default_device_ordinal());
+      arg_buffers_[i]->set_buffer(dmem, /*index=*/{});
+      arg_ptrs_[i] = arg_buffers_[i].get();
+    }
+  }
+}
+
+void XlaComputationLaunchContext::PopulateOutputs(
+    OpKernelContext* ctx, const XlaCompiler::CompilationResult* kernel,
+    std::unique_ptr<xla::ScopedShapedBuffer> output) {
+  gpu::Stream* stream =
+      ctx->op_device_context() ? ctx->op_device_context()->stream() : nullptr;
+
+  // Computation output should always be a tuple.
+  if (VLOG_IS_ON(2)) {
+    VLOG(2) << "Result tuple shape: " << output->on_host_shape().DebugString();
+  }
+  CHECK_EQ(ctx->num_outputs(), kernel->outputs.size());
+
+  // Copy XLA results to the OpOutputList.
+  int output_num = 0;
+  for (int i = 0; i < ctx->num_outputs(); ++i) {
+    AllocatorAttributes alloc_attrs = ctx->output_alloc_attr(i);
+    Allocator* allocator = ctx->device()->GetAllocator({});
+    if (tensor_info_manager_ && !alloc_attrs.on_host()) {
+      allocator = tensor_info_manager_;
+    }
+    if (kernel->outputs[i].is_constant) {
+      // Output is a constant.
+      const Tensor& const_tensor = kernel->outputs[i].constant_value;
+      Tensor* output_tensor;
+      const size_t total_bytes = const_tensor.TotalBytes();
+      if (stream && total_bytes > 0) {
+        // Copy host -> device. (Empty tensors don't have backing buffers.)
+        VLOG(1) << "Constant output tensor on device";
+
+        TF_CHECK_OK(
+            ctx->allocate_output(i, const_tensor.shape(), &output_tensor));
+
+        const void* src_ptr = DMAHelper::base(&const_tensor);
+        void* dst_ptr = DMAHelper::base(output_tensor);
+        gpu::DeviceMemoryBase gpu_dst_ptr(dst_ptr, total_bytes);
+        // Memcpying asynchronously is safe for the GPU, but the CPU uses a
+        // shared allocator so hold a reference to the copied-to buffer until
+        // complete.
+        TensorReference ref(*output_tensor);
+        stream->ThenMemcpy(&gpu_dst_ptr, src_ptr, total_bytes);
+        stream->ThenDoHostCallback([ref] { ref.Unref(); });
+      } else {
+        // No copy required.
+        ctx->set_output(i, const_tensor);
+        output_tensor = ctx->mutable_output(i);
+      }
+      if (tensor_info_manager_) {
+        XlaTensorInfo* tensor_info =
+            tensor_info_manager_->GetOrCreateTensorInfo(*output_tensor);
+        tensor_info->set_host_tensor(const_tensor);
+      }
+    } else {
+      const TensorShape& shape = kernel->outputs[i].shape;
+      VLOG(2) << "Retval " << i << " shape " << shape.DebugString();
+
+      gpu::DeviceMemoryBase buffer = output->buffer({output_num});
+      Tensor output_tensor = XlaTensorBuffer::MakeTensor(
+          ctx->expected_output_dtype(i), shape, buffer, allocator);
+      xla_allocator_->Release(buffer.opaque());
+
+      xla::Shape output_shape = xla::ShapeUtil::GetTupleElementShape(
+          output->on_device_shape(), output_num);
+      if (xla::ShapeUtil::IsTuple(output_shape)) {
+        CHECK(tensor_info_manager_);
+        XlaTensorInfo* tensor_info =
+            tensor_info_manager_->GetOrCreateTensorInfo(output_tensor);
+        tensor_info->set_shaped_buffer(
+            ExtractSubShapedBuffer(*output, output_num));
+      }
+      ctx->set_output(i, output_tensor);
+      ++output_num;
+    }
+
+    if (VLOG_IS_ON(3)) {
+      VLOG(3) << ctx->mutable_output(i)->DebugString();
+    }
+  }
+
+  // Apply variable updates, if any.
+  VLOG(2) << "Applying variable updates";
+  for (int i = 0; i < kernel->resource_updates.size(); ++i) {
+    Allocator* allocator = ctx->device()->GetAllocator({});
+    if (tensor_info_manager_) {
+      allocator = tensor_info_manager_;
+    }
+    const XlaCompiler::ResourceUpdate& write = kernel->resource_updates[i];
+    OP_REQUIRES(ctx,
+                write.input_index >= 0 && write.input_index < ctx->num_inputs(),
+                errors::Internal("Invalid input index for variable write."));
+
+    gpu::DeviceMemoryBase buffer = output->buffer({output_num});
+
+    Var* variable = nullptr;
+    // TODO(b/35625933): tensorflow::Var should contain a PersistentTensor,
+    // not a Tensor.
+    OP_REQUIRES_OK(ctx, LookupOrCreateResource<Var>(
+                            ctx, HandleFromInput(ctx, write.input_index),
+                            &variable, [this, ctx, &write](Var** ptr) {
+                              *ptr = new Var(write.type);
+                              return Status::OK();
+                            }));
+
+    core::ScopedUnref s(variable);
+
+    mutex_lock ml(*variable->mu());
+    OP_REQUIRES(ctx, variable->tensor()->dtype() == write.type,
+                errors::Internal("Mismatched type in variable write"));
+    *variable->tensor() =
+        XlaTensorBuffer::MakeTensor(write.type, write.shape, buffer, allocator);
+    xla_allocator_->Release(buffer.opaque());
+
+    xla::Shape output_shape = xla::ShapeUtil::GetTupleElementShape(
+        output->on_device_shape(), output_num);
+    if (xla::ShapeUtil::IsTuple(output_shape)) {
+      CHECK(tensor_info_manager_);
+      XlaTensorInfo* tensor_info =
+          tensor_info_manager_->GetOrCreateTensorInfo(*variable->tensor());
+      tensor_info->set_shaped_buffer(
+          ExtractSubShapedBuffer(*output, output_num));
+    }
+    ++output_num;
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_launch_util.h b/tensorflow/compiler/jit/xla_launch_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..8694f6ce58b72ca188bf831528db30daf93b905d
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_launch_util.h
@@ -0,0 +1,149 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Contains utilities for launching compiled XLA kernels for a KernelContext.
+
+#ifndef TENSORFLOW_COMPILER_JIT_XLA_LAUNCH_UTIL_H_
+#define TENSORFLOW_COMPILER_JIT_XLA_LAUNCH_UTIL_H_
+
+#include "tensorflow/compiler/jit/xla_compilation_cache.h"
+#include "tensorflow/compiler/jit/xla_tensor_info.h"
+#include "tensorflow/compiler/tf2xla/xla_compiler.h"
+#include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/core/framework/allocation_description.pb.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/kernels/variable_ops.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+class XlaAllocator;
+
+// Takes a snapshot of the values of resource variable arguments, which are
+// the last `num_variables` arguments. We snapshot tensors that back
+// resource variables since concurrent updates may modify the shape, and it is
+// important that the shapes used for compilation match the true shapes of the
+// buffers.
+//
+// Returns a map of TensorFlow argument index to resource variable.
+std::map<int, OptionalTensor> SnapshotResourceVariables(OpKernelContext* ctx,
+                                                        int num_variables);
+
+// Adapter class that wraps a Tensorflow allocator as an XLA allocator.
+// Assumes that the Tensorflow allocator permits asynchronous deallocation:
+// see comment on `AllowsAsynchronousDeallocation()`.
+class XlaAllocator : public xla::DeviceMemoryAllocator {
+ public:
+  XlaAllocator(const perftools::gputools::Platform* platform,
+               OpKernelContext* op_context);
+  ~XlaAllocator() override;
+  xla::StatusOr<perftools::gputools::DeviceMemoryBase> Allocate(
+      int device_ordinal, uint64 size, bool retry_on_failure) override;
+  Status Deallocate(int device_ordinal,
+                    perftools::gputools::DeviceMemoryBase* mem) override;
+
+  // Un-track 'ptr' - do not delete it on destruction.
+  void Release(void* ptr);
+
+  // The Tensorflow BFC allocator used on GPU allows host-side deallocation
+  // before GPU execution takes place. Tensorflow uses the ordering of the main
+  // compute stream to enforce a happens-before relationship between a memory
+  // allocation and code that reuses the same memory. If Tensorflow adds
+  // support for multiple GPU streams or allocators with different ordering
+  // requirements, this code may need to change.
+  // (This attribute has no effect on CPU.)
+  bool AllowsAsynchronousDeallocation() const override { return true; }
+
+ private:
+  OpKernelContext* const op_context_;
+  std::unordered_set<void*> allocated_;
+};
+
+// Helper class to perform the marshalling of TensorFlow inputs and outputs to
+// ShapedBuffers suitable for passing to an XLA computation.
+class XlaComputationLaunchContext {
+ public:
+  XlaComputationLaunchContext(int64 num_resource_args, xla::LocalClient* client,
+                              XlaAllocator* xla_allocator,
+                              XlaTensorInfoManager* tensor_info_manager);
+
+  // Add all inputs within `ctx` as XLA arguments (returned by arguments()).
+  // `variables` is a map from TensorFlow argument number to resource variable.
+  void PopulateInputs(OpKernelContext* ctx,
+                      const XlaCompiler::CompilationResult* kernel,
+                      const std::map<int, OptionalTensor>& variables);
+
+  // Given the XLA output in `output`, populate all outputs of `ctx`.
+  void PopulateOutputs(OpKernelContext* ctx,
+                       const XlaCompiler::CompilationResult* kernel,
+                       std::unique_ptr<xla::ScopedShapedBuffer> output);
+
+  // Return the argument list. Only valid after PopulateInputs() has been
+  // called.
+  const std::vector<xla::ShapedBuffer*>& arguments() const { return arg_ptrs_; }
+
+ private:
+  int64 num_resource_args_;
+  xla::LocalClient* client_;
+  XlaAllocator* xla_allocator_;
+  XlaTensorInfoManager* tensor_info_manager_;
+  std::vector<std::unique_ptr<xla::ShapedBuffer>> arg_buffers_;
+  std::vector<xla::ShapedBuffer*> arg_ptrs_;
+};
+
+// A simple TensorBuffer implementation that allows us to create Tensors that
+// take ownership of pre-allocated memory.
+class XlaTensorBuffer : public TensorBuffer {
+ public:
+  XlaTensorBuffer(const void* ptr, size_t expected_size, size_t actual_size,
+                  Allocator* allocator)
+      : expected_size_(expected_size),
+        actual_size_(actual_size),
+        allocator_(allocator) {
+    data_ = const_cast<void*>(ptr);
+  }
+
+  ~XlaTensorBuffer() override { allocator_->DeallocateRaw(data_); }
+
+  void* data() const override { return data_; }
+  size_t size() const override { return expected_size_; }
+
+  TensorBuffer* root_buffer() override { return this; }
+
+  void FillAllocationDescription(AllocationDescription* proto) const override {
+    proto->set_allocated_bytes(actual_size_);
+  }
+
+  static Tensor MakeTensor(DataType dtype, const TensorShape& shape,
+                           perftools::gputools::DeviceMemoryBase buffer,
+                           Allocator* allocator) {
+    size_t expected_size = shape.num_elements() * DataTypeSize(dtype);
+    auto* tensor_buffer = new XlaTensorBuffer(buffer.opaque(), expected_size,
+                                              buffer.size(), allocator);
+    Tensor t(dtype, shape, tensor_buffer);
+    tensor_buffer->Unref();
+    return t;
+  }
+
+ private:
+  void* data_;
+  size_t expected_size_;
+  size_t actual_size_;
+  Allocator* allocator_;
+};
+
+}  // namespace tensorflow
+
+#endif
diff --git a/tensorflow/compiler/jit/xla_tensor_info.cc b/tensorflow/compiler/jit/xla_tensor_info.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0ce18c27cbe1d46eb61f8000506396fedc509e9c
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_tensor_info.cc
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/xla_tensor_info.h"
+
+namespace tensorflow {
+
+const XlaTensorInfo* XlaTensorInfoManager::GetTensorInfo(
+    const void* device_ptr) const {
+  mutex_lock lock(lock_);
+  auto iterator = tensor_infos_.find(device_ptr);
+  return (iterator == tensor_infos_.end()) ? nullptr
+                                           : tensor_infos_.at(device_ptr).get();
+}
+
+XlaTensorInfo* XlaTensorInfoManager::GetOrCreateTensorInfo(
+    const void* device_ptr) {
+  mutex_lock lock(lock_);
+  auto iterator = tensor_infos_.find(device_ptr);
+  if (iterator != tensor_infos_.end()) {
+    return iterator->second.get();
+  }
+  auto iterator_and_inserted =
+      tensor_infos_.emplace(device_ptr, MakeUnique<XlaTensorInfo>());
+  CHECK(iterator_and_inserted.second);
+  return iterator_and_inserted.first->second.get();
+}
+
+const XlaTensorInfo* XlaTensorInfoManager::GetTensorInfo(const Tensor& tensor) {
+  return GetTensorInfo(tensor.tensor_data().data());
+}
+
+XlaTensorInfo* XlaTensorInfoManager::GetOrCreateTensorInfo(
+    const Tensor& tensor) {
+  return GetOrCreateTensorInfo(tensor.tensor_data().data());
+}
+
+void XlaTensorInfoManager::DeallocateRaw(void* ptr) {
+  wrapped()->DeallocateRaw(ptr);
+  mutex_lock lock(lock_);
+  tensor_infos_.erase(ptr);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_tensor_info.h b/tensorflow/compiler/jit/xla_tensor_info.h
new file mode 100644
index 0000000000000000000000000000000000000000..fbd6ad770fbf9b80829ca80f1a85704e3288a680
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_tensor_info.h
@@ -0,0 +1,101 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_JIT_XLA_TENSOR_INFO_H_
+#define TENSORFLOW_COMPILER_JIT_XLA_TENSOR_INFO_H_
+
+#include "tensorflow/compiler/xla/service/shaped_buffer.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+
+// Information about a tensor. The XlaTensorInfoManager can maintain one of
+// these per device Tensor.
+class XlaTensorInfo {
+ public:
+  XlaTensorInfo() {}
+
+  // Some Tensors can have complex on-device shapes, including tuple shapes. To
+  // manage the memory for these tensors a ShapedBuffer may be required.
+
+  // Return true if this TensorInfo contains a ShapedBuffer.
+  bool has_shaped_buffer() const { return shaped_buffer_ != nullptr; }
+  // Return the contained ShapedBuffer.
+  // REQUIRES: has_shaped_buffer()
+  const xla::ShapedBuffer& shaped_buffer() const { return *shaped_buffer_; }
+  // Mutates the TensorInfo to set the ShapedBuffer.
+  void set_shaped_buffer(xla::ShapedBuffer shaped_buffer) {
+    shaped_buffer_.reset(new xla::ShapedBuffer(std::move(shaped_buffer)));
+  }
+
+  // Some tensors on the device may have known values on the host. We use these
+  // in on-demand mode to avoid re-copying values from the device if we know the
+  // host value already.
+
+  // Return true if this TensorInfo contains a host tensor.
+  bool has_host_tensor() const { return host_tensor_ != nullptr; }
+  // Return the contained host tensor.
+  // REQUIRES: has_host_tensor()
+  const Tensor& host_tensor() const { return *host_tensor_; }
+  // Sets the contained host tensor.
+  void set_host_tensor(const Tensor& tensor) {
+    host_tensor_.reset(new Tensor(tensor));
+  }
+
+ private:
+  // The optional contained ShapedBuffer.
+  std::unique_ptr<xla::ShapedBuffer> shaped_buffer_;
+  // An optional host tensor value.
+  std::unique_ptr<Tensor> host_tensor_;
+};
+
+// Manages XlaTensorInfo objects. This class is also an Allocator, so that
+// XlaTensorInfo objects can be deleted when their Tensor is deallocated.
+class XlaTensorInfoManager : public AllocatorWrapper {
+ public:
+  // Creates a new XlaTensorInfoManager, delegating all DeallocateRaw calls to
+  // allocator.
+  XlaTensorInfoManager(Allocator* allocator) : AllocatorWrapper(allocator) {}
+
+  // Returns the XlaTensorInfo for the given device memory pointer or nullptr if
+  // none exists.
+  const XlaTensorInfo* GetTensorInfo(const void* device_ptr) const;
+  // Returns the XlaTensorInfo for the device memory pointer extracted from
+  // tensor or nullptr if none exists.
+  const XlaTensorInfo* GetTensorInfo(const Tensor& tensor);
+
+  // Returns the XlaTensorInfo for the given device memory pointer, creating one
+  // if necessary.
+  XlaTensorInfo* GetOrCreateTensorInfo(const Tensor& tensor);
+  // Returns the XlaTensorInfo for the device memory pointer extracted from
+  // tensor, creating one if necessary.
+  XlaTensorInfo* GetOrCreateTensorInfo(const void* device_ptr);
+
+  // Allocator interface
+  void DeallocateRaw(void* ptr) override;
+
+ private:
+  mutable mutex lock_;
+  // The managed tensor infos. The mapped value is a unique_ptr so that returned
+  // references are stable over rehashes.
+  std::unordered_map<const void*, std::unique_ptr<XlaTensorInfo>> tensor_infos_
+      GUARDED_BY(lock_);
+};
+}  // namespace tensorflow
+
+#endif
diff --git a/tensorflow/compiler/tests/BUILD b/tensorflow/compiler/tests/BUILD
index 782bf82d4149968d5e5fbfb93bbd4ff1dcd75494..bbb6089ea8594de7394f5a8dac015a88612fb1ea 100644
--- a/tensorflow/compiler/tests/BUILD
+++ b/tensorflow/compiler/tests/BUILD
@@ -86,7 +86,10 @@ tf_xla_py_test(
     # ArgMax needs CustomCall on CPU, which is not available in normal
     # (not precompiled) TensorFlow. The flag below excludes the CPU
     # backend.
-    disabled_backends = "cpu",
+    disabled_backends = [
+        "cpu",
+        "cpu_ondemand",
+    ],
     deps = [
         ":xla_test",
         "//tensorflow/python:array_ops",
@@ -98,7 +101,7 @@ tf_xla_py_test(
 
 tf_xla_py_test(
     name = "binary_ops_test",
-    size = "small",
+    size = "medium",
     srcs = ["binary_ops_test.py"],
     shard_count = 5,
     tags = [
@@ -315,6 +318,8 @@ tf_xla_py_test(
     name = "function_test",
     size = "small",
     srcs = ["function_test.py"],
+    # Functions are not implemented in the on-demand compilation model yet.
+    disabled_backends = "cpu_ondemand",
     deps = [
         ":xla_test",
         "//tensorflow/python:array_ops",
@@ -537,6 +542,7 @@ tf_xla_py_test(
     size = "medium",
     srcs = ["spacetobatch_op_test.py"],
     shard_count = 3,
+    tags = ["notsan"],
     deps = [
         ":xla_test",
         "//tensorflow/python:array_ops",
@@ -550,6 +556,8 @@ tf_xla_py_test(
     name = "stack_ops_test",
     size = "small",
     srcs = ["stack_ops_test.py"],
+    # Stack ops are not implemented in the on-demand compilation model yet.
+    disabled_backends = "cpu_ondemand",
     deps = [
         ":xla_test",
         "//tensorflow/python:array_ops",
@@ -576,6 +584,8 @@ tf_xla_py_test(
     name = "tensor_array_ops_test",
     size = "small",
     srcs = ["tensor_array_ops_test.py"],
+    # TensorArray ops are not implemented in the on-demand compilation model yet.
+    disabled_backends = "cpu_ondemand",
     deps = [
         ":xla_test",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/compiler/tests/binary_ops_test.py b/tensorflow/compiler/tests/binary_ops_test.py
index 30a6d3a74d64f90ad33062df6d1e16e3a575bd63..ba7b9bacd2b794c74409d517a9c05bfbb14a845f 100644
--- a/tensorflow/compiler/tests/binary_ops_test.py
+++ b/tensorflow/compiler/tests/binary_ops_test.py
@@ -71,7 +71,7 @@ class BinaryOpsTest(XLATestCase):
           expected=np.array([[[[False, True], [True, False]]]], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([3, 3, -1.5, -8, 44], dtype=dtype),
           np.array([2, -2, 7, -4, 0], dtype=dtype),
           expected=np.array(
@@ -108,57 +108,57 @@ class BinaryOpsTest(XLATestCase):
               [0, np.pi / 4, np.pi / 2, np.pi * 3 / 4, np.pi], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._reciprocal_grad,
+          gen_math_ops.reciprocal_grad,
           np.array([4, -3, -2, 1], dtype=dtype),
           np.array([5, -6, 7, -8], dtype=dtype),
           expected=np.array([-80, 54, -28, 8], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._sigmoid_grad,
+          gen_math_ops.sigmoid_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-60, -36, -14, 0], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._rsqrt_grad,
+          gen_math_ops.rsqrt_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-160, -81, -28, -4], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._sqrt_grad,
+          gen_math_ops.sqrt_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([0.625, 1, 1.75, 4], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softplus_grad,
+          gen_nn_ops.softplus_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array(
               [3.97322869, 2.99258232, 1.99817801, 0.99966466], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softsign_grad,
+          gen_nn_ops.softsign_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array(
               [0.11111111, 0.06122449, 0.03125, 0.01234568], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._tanh_grad,
+          gen_math_ops.tanh_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-75, -48, -21, 0], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._elu_grad,
+          gen_nn_ops.elu_grad,
           np.array([1, 2, 3, 4, 5, 6], dtype=dtype),
           np.array([-.6, -.4, -.2, 0, .2, .4], dtype=dtype),
           expected=np.array([0.4, 1.2, 2.4, 4, 5, 6], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._selu_grad,
+          gen_nn_ops.selu_grad,
           np.array([1, 2, 3, 4, 5, 6], dtype=dtype),
           np.array([-.6, -.4, -.2, .2, .4, .6], dtype=dtype),
           expected=np.array(
@@ -166,20 +166,20 @@ class BinaryOpsTest(XLATestCase):
                4.202803949422, 5.2535049367774, 6.30420592413], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._relu_grad,
+          gen_nn_ops.relu_grad,
           np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=dtype),
           np.array([0, 0, 0, 0, 0, 0.1, 0.3, 0.5, 0.7, 0.9], dtype=dtype),
           expected=np.array([0, 0, 0, 0, 0, 6, 7, 8, 9, 10], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._relu6_grad,
+          gen_nn_ops.relu6_grad,
           np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=dtype),
           np.array(
               [0, 0, 0, 0, 0, 0.1, 0.3, 0.5, 0.7, 0.9, 6.1, 10.0], dtype=dtype),
           expected=np.array([0, 0, 0, 0, 0, 6, 7, 8, 9, 10, 0, 0], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softmax_cross_entropy_with_logits,
+          gen_nn_ops.softmax_cross_entropy_with_logits,
           np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=dtype),
           np.array([[0.1, 0.2, 0.3, 0.4], [0.4, 0.3, 0.2, 0.1]], dtype=dtype),
           expected=[
@@ -191,7 +191,7 @@ class BinaryOpsTest(XLATestCase):
           equality_test=self.ListsAreClose)
 
       self._testBinary(
-          gen_nn_ops._sparse_softmax_cross_entropy_with_logits,
+          gen_nn_ops.sparse_softmax_cross_entropy_with_logits,
           np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8],
                     [0.9, 1.0, 1.1, 1.2]], dtype=dtype),
           np.array([2, 1, 7], dtype=np.int32),
@@ -207,7 +207,7 @@ class BinaryOpsTest(XLATestCase):
   def testIntOps(self):
     for dtype in self.int_types:
       self._testBinary(
-          gen_math_ops._truncate_div,
+          gen_math_ops.truncate_div,
           np.array([3, 3, -1, -9, -8], dtype=dtype),
           np.array([2, -2, 7, 2, -4], dtype=dtype),
           expected=np.array([1, -1, 0, -4, 2], dtype=dtype))
@@ -232,11 +232,16 @@ class BinaryOpsTest(XLATestCase):
           expected=np.right_shift(lhs, rhs))
 
       if dtype in [np.int8, np.int16, np.int32, np.int64]:
-        lhs = np.array([-1, -5, -3, -14], dtype=dtype)
-        rhs = np.array([5, 0, 1, 11], dtype=dtype)
-        self._testBinary(
-            bitwise_ops.right_shift, lhs, rhs,
-            expected=np.right_shift(lhs, rhs))
+        lhs = np.array([-1, -5, -3, -14, -2], dtype=dtype)
+        rhs = np.array([5, 0, 1, 11, 36], dtype=dtype)
+        # HLO has saturating shift behavior.
+        bits = np.ceil(
+            np.log(np.iinfo(dtype).max - np.iinfo(dtype).min) / np.log(2))
+        expected = [
+            np.right_shift(l, r) if r < bits else np.sign(l)
+            for l, r in zip(lhs, rhs)
+        ]
+        self._testBinary(bitwise_ops.right_shift, lhs, rhs, expected=expected)
 
   def testNumericOps(self):
     for dtype in self.numeric_types:
@@ -255,12 +260,18 @@ class BinaryOpsTest(XLATestCase):
           np.array([[1], [2]], dtype=dtype),
           dtype(7),
           expected=np.array([[8], [9]], dtype=dtype))
+      self._testBinary(
+          math_ops.add,
+          np.array([0xffffffff, 0xfffffffff, 1, 1], dtype=np.int64),
+          np.array([1, 1, 0xffffffff, 0xfffffffff], dtype=np.int64),
+          expected=np.array(
+              [1 << 32, 1 << 36, 1 << 32, 1 << 36], dtype=np.int64))
 
       self._testBinary(
           math_ops.subtract,
-          np.array([1, 2], dtype=dtype),
-          np.array([10, 20], dtype=dtype),
-          expected=np.array([-9, -18], dtype=dtype))
+          np.array([1, 2, 100], dtype=dtype),
+          np.array([10, 20, -1], dtype=dtype),
+          expected=np.array([-9, -18, 101], dtype=dtype))
       self._testBinary(
           math_ops.subtract,
           dtype(5),
@@ -369,7 +380,7 @@ class BinaryOpsTest(XLATestCase):
           expected=np.array([[[[False, True], [True, False]]]], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([3, 3j, -1.5j, -8, 2 + 3j, 2 + 4j], dtype=dtype),
           np.array([2, -2, 7j, -4j, 4 - 6j, 1 + 2j], dtype=dtype),
           expected=np.array(
@@ -378,7 +389,7 @@ class BinaryOpsTest(XLATestCase):
 
       # Test inf/nan scenarios.
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([4 + 3j, 4, 3j, -4, -4j, 2 - 3j], dtype=dtype),
           np.array([0, 0, 0, 0, 0, 0], dtype=dtype),
           expected=np.array(
@@ -418,19 +429,19 @@ class BinaryOpsTest(XLATestCase):
       lhs = np.array([4 + 2j, -3 - 1j, 2j, 1], dtype=dtype)
       rhs = np.array([5, -6j, 7 - 3j, -8j], dtype=dtype)
       self._testBinary(
-          gen_math_ops._reciprocal_grad, lhs, rhs, expected=-rhs * lhs * lhs)
+          gen_math_ops.reciprocal_grad, lhs, rhs, expected=-rhs * lhs * lhs)
 
       self._testBinary(
-          gen_math_ops._sigmoid_grad, lhs, rhs, expected=rhs * lhs * (1 - lhs))
+          gen_math_ops.sigmoid_grad, lhs, rhs, expected=rhs * lhs * (1 - lhs))
 
       self._testBinary(
-          gen_math_ops._rsqrt_grad, lhs, rhs, expected=lhs**3 * rhs / -2)
+          gen_math_ops.rsqrt_grad, lhs, rhs, expected=lhs**3 * rhs / -2)
 
       self._testBinary(
-          gen_math_ops._sqrt_grad, lhs, rhs, expected=rhs / (2 * lhs))
+          gen_math_ops.sqrt_grad, lhs, rhs, expected=rhs / (2 * lhs))
 
       self._testBinary(
-          gen_math_ops._tanh_grad, lhs, rhs, expected=rhs * (1 - lhs * lhs))
+          gen_math_ops.tanh_grad, lhs, rhs, expected=rhs * (1 - lhs * lhs))
 
   def testComplexMath(self):
     for dtype in self.complex_types:
@@ -538,7 +549,7 @@ class BinaryOpsTest(XLATestCase):
 
     if dtype not in self.complex_types:  # floordiv unsupported for complex.
       self._testBinary(
-          gen_math_ops._floor_div,
+          gen_math_ops.floor_div,
           np.array([3, 3, -1, -9, -8], dtype=dtype),
           np.array([2, -2, 7, 2, -4], dtype=dtype),
           expected=np.array([1, -2, -1, -5, 2], dtype=dtype))
@@ -554,12 +565,12 @@ class BinaryOpsTest(XLATestCase):
   def _testRemainder(self, dtype):
     """Test cases for remainder operators."""
     self._testBinary(
-        gen_math_ops._floor_mod,
+        gen_math_ops.floor_mod,
         np.array([3, 3, -1, -8], dtype=dtype),
         np.array([2, -2, 7, -4], dtype=dtype),
         expected=np.array([1, -1, 6, 0], dtype=dtype))
     self._testBinary(
-        gen_math_ops._truncate_mod,
+        gen_math_ops.truncate_mod,
         np.array([3, 3, -1, -8], dtype=dtype),
         np.array([2, -2, 7, -4], dtype=dtype),
         expected=np.array([1, 1, -1, 0], dtype=dtype))
@@ -668,6 +679,11 @@ class BinaryOpsTest(XLATestCase):
           np.array([[10], [7], [2]], dtype=np.float32),
           np.float32(7),
           expected=np.array([[False], [False], [True]], dtype=np.bool))
+      self._testBinary(
+          less_op,
+          np.array([[10], [7], [2], [-1]], dtype=np.int64),
+          np.int64(7),
+          expected=np.array([[False], [False], [True], [True]], dtype=np.bool))
 
     for less_equal_op in [math_ops.less_equal, (lambda x, y: x <= y)]:
       self._testBinary(
@@ -686,6 +702,80 @@ class BinaryOpsTest(XLATestCase):
           np.float32(7),
           expected=np.array([[False], [True], [True]], dtype=np.bool))
 
+  def testS64Comparisons(self):
+    for op in [(lambda x, y: x < y), (lambda x, y: x <= y),
+               (lambda x, y: x >= y), (lambda x, y: x > y)]:
+      lhs = np.array(
+          [
+              np.int64(0x000000007FFFFFFF),
+              np.int64(0x000000007FFFFFFF),
+              np.int64(0x0000000080000000),
+              np.int64(0x0000000080000000),
+              np.int64(0x0000000080000001),
+              np.int64(0x00000000FFFF0000),
+              np.int64(0x00000000FFFF0000),
+              np.int64(0x00000000FFFFFFFE),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(0x0000000100000000),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000002),
+              np.int64(-0x7FFFFFFF00000002),
+              np.int64(-0x7FFFFFFF00000002),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(0x7ffffffefff00010),
+              np.int64(0x7ffffffefff00010),
+              np.int64(-1),
+              np.int64(-1)
+          ],
+          dtype=np.int64)
+      rhs = np.array(
+          [
+              np.int64(0x000000007FFFFFFE),
+              np.int64(0x000000007FFFFFFF),
+              np.int64(0x000000007FFFFFFF),
+              np.int64(0x0000000080000000),
+              np.int64(0x0000000080000001),
+              np.int64(0x00000000FFFF0000),
+              np.int64(0x00000000FFFF0001),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(0x00000000FFFFFFFE),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(0x0000000100000001),
+              np.int64(0x0000000100000002),
+              np.int64(0x0000000100000003),
+              np.int64(0x0000000200000001),
+              np.int64(0x0000000200000002),
+              np.int64(0x0000000200000003),
+              np.int64(0x0000000300000001),
+              np.int64(0x0000000300000002),
+              np.int64(0x0000000300000003),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(0x00000000FFFFFFFE),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(-0x7FFFFFFF00000002),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(0x00000000FFFFFFFF),
+              np.int64(-0x7FFFFFFF00000001),
+              np.int64(-2),
+              np.int64(-1)
+          ],
+          dtype=np.int64)
+      expected = np.array([op(l, r) for l, r in zip(lhs, rhs)], dtype=np.bool)
+      self._testBinary(op, lhs, rhs, expected=expected)
+
   def testBroadcasting(self):
     """Tests broadcasting behavior of an operator."""
 
@@ -1045,6 +1135,20 @@ class BinaryOpsTest(XLATestCase):
             ],
             equality_test=self.ListsAreClose)
 
+      def splitvOp(x, y):  # pylint: disable=invalid-name
+        return array_ops.split(value=y, num_or_size_splits=[2, 3], axis=x)
+      for axis in [1, -1]:
+        self._testBinary(
+            splitvOp,
+            np.int32(axis),
+            np.array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]],
+                     dtype=dtype),
+            expected=[
+                np.array([[0, 1], [5, 6]], dtype=dtype),
+                np.array([[2, 3, 4], [7, 8, 9]], dtype=dtype),
+            ],
+            equality_test=self.ListsAreClose)
+
   def testTile(self):
     for dtype in self.numeric_types:
       self._testBinary(
diff --git a/tensorflow/compiler/tests/concat_ops_test.py b/tensorflow/compiler/tests/concat_ops_test.py
index 81734082d9aab86f8bc763681265ef64ef32bd31..f10973e19f1945515b776cf86349445ed7334629 100644
--- a/tensorflow/compiler/tests/concat_ops_test.py
+++ b/tensorflow/compiler/tests/concat_ops_test.py
@@ -301,7 +301,7 @@ class ConcatOffsetTest(XLATestCase):
         s0 = constant_op.constant([2, 3, 5], dtypes.int32)
         s1 = constant_op.constant([2, 7, 5], dtypes.int32)
         s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-        off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+        off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
         ans = sess.run(off)
         self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
diff --git a/tensorflow/compiler/tests/image_ops_test.py b/tensorflow/compiler/tests/image_ops_test.py
index 538fa8e8e570b83ed681ecc0501285520cabdecb..3bc41b7cfd72bec7572097f8c53eef314a4369f6 100644
--- a/tensorflow/compiler/tests/image_ops_test.py
+++ b/tensorflow/compiler/tests/image_ops_test.py
@@ -426,7 +426,7 @@ class ResizeBilinearTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       dtype = dtype or np.float32
       grads = array_ops.placeholder(np.float32)
-      resized = gen_image_ops._resize_bilinear_grad(
+      resized = gen_image_ops.resize_bilinear_grad(
           grads,
           np.zeros([1, input_shape[0], input_shape[1], 1], dtype=dtype),
           align_corners=True)
diff --git a/tensorflow/compiler/tests/lrn_ops_test.py b/tensorflow/compiler/tests/lrn_ops_test.py
index 5d8d89224d4a778d84803811710bb095872e86b2..69bd8f7230d4394c45764d02a88fb0ec097c5756 100644
--- a/tensorflow/compiler/tests/lrn_ops_test.py
+++ b/tensorflow/compiler/tests/lrn_ops_test.py
@@ -115,11 +115,11 @@ class LRNTest(XLATestCase):
       out_image = constant_op.constant(out_image_vals, shape=shape)
       out_grads = constant_op.constant(out_grads_vals, shape=shape)
       with ops.device(CPU_DEVICE):
-        expected = gen_nn_ops._lrn_grad(out_grads, in_image, out_image,
-                                        depth_radius, bias, alpha, beta)
+        expected = gen_nn_ops.lrn_grad(out_grads, in_image, out_image,
+                                       depth_radius, bias, alpha, beta)
       with self.test_scope():
-        actual = gen_nn_ops._lrn_grad(out_grads, in_image, out_image,
-                                      depth_radius, bias, alpha, beta)
+        actual = gen_nn_ops.lrn_grad(out_grads, in_image, out_image,
+                                     depth_radius, bias, alpha, beta)
       expected_val = expected.eval()
       actual_val = actual.eval()
     self.assertAllClose(actual_val, expected_val, rtol=1e-3)
diff --git a/tensorflow/compiler/tests/pooling_ops_3d_test.py b/tensorflow/compiler/tests/pooling_ops_3d_test.py
index eb48fe555a0b182ea7983cbd8c3b217d56350408..4eed903963a34a253ea5c409782d9a89a97a4fdf 100644
--- a/tensorflow/compiler/tests/pooling_ops_3d_test.py
+++ b/tensorflow/compiler/tests/pooling_ops_3d_test.py
@@ -33,7 +33,7 @@ from tensorflow.python.platform import test
 # MaxPoolGrad.
 def _AvgPoolGrad(inputs, outputs, output_gradients, ksize, strides, padding):
   del outputs  # Unused by average-pooling gradients.
-  return gen_nn_ops._avg_pool3d_grad(
+  return gen_nn_ops.avg_pool3d_grad(
       inputs.get_shape().as_list(),
       output_gradients,
       ksize=ksize,
@@ -263,7 +263,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding1_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[1, 3, 3, 3, 1],
         ksize=[1, 1, 1],
         strides=[1, 1, 1],
@@ -272,7 +272,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_1_6_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 3, 6, 3],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -281,7 +281,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_1_7_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 5, 7, 3],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -290,7 +290,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_2_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 2, 2, 2, 3],
         ksize=[2, 2, 2],
         strides=[2, 2, 2],
@@ -299,7 +299,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding1_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 2, 4, 1],
         ksize=[1, 1, 1],
         strides=[1, 1, 1],
@@ -308,7 +308,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding2_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 2, 4, 1],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -317,7 +317,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding2_2_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 5, 2, 4, 3],
         ksize=[2, 2, 2],
         strides=[2, 2, 2],
@@ -326,7 +326,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding3_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[1, 3, 3, 7, 1],
         ksize=[3, 3, 3],
         strides=[1, 1, 1],
diff --git a/tensorflow/compiler/tests/pooling_ops_test.py b/tensorflow/compiler/tests/pooling_ops_test.py
index 7c19a99c4eb4be3ca34b3ce949216e557b0a681d..fe270af3d636c0824621f36360ce9e7d14d8fc91 100644
--- a/tensorflow/compiler/tests/pooling_ops_test.py
+++ b/tensorflow/compiler/tests/pooling_ops_test.py
@@ -292,8 +292,15 @@ class PoolGradTest(XLATestCase):
 
   CPU_DEVICE = "/job:localhost/replica:0/task:0/cpu:0"
 
-  def _VerifyOneTest(self, pool_func, pool_grad_func, input_sizes, ksize,
-                     strides, padding, data_format):
+  def _VerifyOneTest(self,
+                     pool_func,
+                     pool_grad_func,
+                     input_sizes,
+                     ksize,
+                     strides,
+                     padding,
+                     data_format,
+                     pool_grad_grad_func=None):
     """Verifies the output values of the pooling gradient function.
 
     Args:
@@ -304,9 +311,19 @@ class PoolGradTest(XLATestCase):
       strides: The stride dimensions
       padding: Padding type.
       data_format: The data format we use to run the pooling operation.
+      pool_grad_grad_func: Second-order gradient function, if available.
     """
     total_size = np.prod(input_sizes)
-    x = np.arange(1, total_size + 1, dtype=np.float32).reshape(input_sizes)
+    # TODO(b/73062247): MaxPoolGradGrad can confuse gradients when x is equally
+    # maximal at 16 bits. Switch to np.random.randn when resolved.
+    x = np.arange(1, total_size + 1, dtype=np.float32)
+    x *= (np.random.randint(2, size=total_size) * 2 - 1)  # Flip signs randomly
+    # Verify some specifically interesting values...
+    x[np.random.choice(total_size)] = np.inf
+    x[np.random.choice(total_size)] = -np.inf
+    # TODO(b/74222344): Fix nan handling for max pool grad.
+    # x[np.random.choice(total_size)] = np.nan
+    x = x.reshape(input_sizes)
     with self.test_session() as sess:
       # Use the forward pool function to compute some corresponding outputs
       # (needed for the CPU device, and we need the shape in both cases).
@@ -323,6 +340,8 @@ class PoolGradTest(XLATestCase):
       output_gradient_vals = np.arange(
           1, output_vals.size + 1, dtype=np.float32)
       output_gradient_vals = output_gradient_vals.reshape(output_vals.shape)
+      output_grad_grad_vals = np.arange(1, x.size + 1, dtype=np.float32)
+      output_grad_grad_vals = output_grad_grad_vals.reshape(x.shape)
 
       # Use the Tensorflow CPU pooling gradient to compute the expected input
       # gradients.
@@ -342,18 +361,36 @@ class PoolGradTest(XLATestCase):
             {inputs: x,
              output_gradients: output_gradient_vals})
 
+        output_grad_gradients = array_ops.placeholder(
+            dtypes.float32, shape=expected_input_gradient_vals.shape)
+        if pool_grad_grad_func is not None:
+          expected_grad_gradients = pool_grad_grad_func(
+              inputs,
+              outputs,
+              output_grad_gradients,
+              ksize=ksize,
+              strides=strides,
+              padding=padding,
+              data_format="NHWC")
+          expected_grad_gradients_vals = sess.run(expected_grad_gradients, {
+              inputs: x,
+              output_grad_gradients: output_grad_grad_vals
+          })
+
       # Run the gradient op on the XLA device
       with self.test_scope():
         outputs = array_ops.placeholder(dtypes.float32, shape=output_vals.shape)
         xla_inputs = inputs
         xla_outputs = outputs
         xla_output_gradients = output_gradients
+        xla_output_grad_gradients = output_grad_gradients
         xla_ksize = ksize
         xla_strides = strides
         if data_format == "NCHW":
           xla_inputs = NHWCToNCHW(inputs)
           xla_outputs = NHWCToNCHW(outputs)
           xla_output_gradients = NHWCToNCHW(output_gradients)
+          xla_output_grad_gradients = NHWCToNCHW(output_grad_gradients)
           xla_ksize = NHWCToNCHW(ksize)
           xla_strides = NHWCToNCHW(strides)
         actual_input_gradients = pool_grad_func(
@@ -366,22 +403,54 @@ class PoolGradTest(XLATestCase):
             data_format=data_format)
         if data_format == "NCHW":
           actual_input_gradients = NCHWToNHWC(actual_input_gradients)
-      actual = sess.run(actual_input_gradients, {
+        if pool_grad_grad_func is not None:
+          actual_grad_gradients = pool_grad_grad_func(
+              xla_inputs,
+              xla_outputs,
+              xla_output_grad_gradients,
+              ksize=xla_ksize,
+              strides=xla_strides,
+              padding=padding,
+              data_format=data_format)
+          if data_format == "NCHW":
+            actual_grad_gradients = NCHWToNHWC(actual_grad_gradients)
+      actual_input_gradients_vals = sess.run(actual_input_gradients, {
           inputs: x,
           outputs: output_vals,
           output_gradients: output_gradient_vals
       })
-
       # Compare the Tensorflow and XLA results.
       self.assertAllClose(
-          expected_input_gradient_vals.flatten(),
-          actual.flatten(),
+          expected_input_gradient_vals,
+          actual_input_gradients_vals,
           rtol=1e-4,
           atol=1e-6)
-      self.assertShapeEqual(actual, inputs)
-
-  def _VerifyValues(self, pool_func, pool_grad_func, input_sizes, ksize,
-                    strides, padding):
+      self.assertShapeEqual(actual_input_gradients_vals, inputs)
+
+      if pool_grad_grad_func is not None:
+        actual_grad_gradients_vals = sess.run(
+            actual_grad_gradients, {
+                inputs: x,
+                outputs: output_vals,
+                output_grad_gradients: output_grad_grad_vals
+            })
+
+        # Compare the Tensorflow and XLA results.
+        self.assertAllClose(
+            expected_grad_gradients_vals,
+            actual_grad_gradients_vals,
+            rtol=1e-4,
+            atol=1e-6)
+        self.assertShapeEqual(actual_grad_gradients_vals, outputs)
+
+  def _VerifyValues(self,
+                    pool_func,
+                    pool_grad_func,
+                    input_sizes,
+                    ksize,
+                    strides,
+                    padding,
+                    pool_grad_grad_func=None):
     """Verifies the output values of the pooling function.
 
     Args:
@@ -391,12 +460,20 @@ class PoolGradTest(XLATestCase):
       ksize: The kernel size dimensions
       strides: The stride dimensions
       padding: Padding type.
+      pool_grad_grad_func: Second-order gradient function, if available.
     """
     for data_format in GetTestConfigs():
-      self._VerifyOneTest(pool_func, pool_grad_func, input_sizes, ksize,
-                          strides, padding, data_format)
-
-  def _TestPooling(self, forward_op, backward_op):
+      self._VerifyOneTest(
+          pool_func,
+          pool_grad_func,
+          input_sizes,
+          ksize,
+          strides,
+          padding,
+          data_format,
+          pool_grad_grad_func=pool_grad_grad_func)
+
+  def _TestPooling(self, forward_op, backward_op, pool_grad_grad_func=None):
     # VALID padding
     self._VerifyValues(
         forward_op,
@@ -404,7 +481,8 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 3, 3, 3],
         ksize=[1, 2, 2, 1],
         strides=[1, 2, 2, 1],
-        padding="VALID")
+        padding="VALID",
+        pool_grad_grad_func=pool_grad_grad_func)
 
     # SAME padding
     self._VerifyValues(
@@ -413,7 +491,8 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 2, 3, 3],
         ksize=[1, 2, 2, 1],
         strides=[1, 2, 2, 1],
-        padding="SAME")
+        padding="SAME",
+        pool_grad_grad_func=pool_grad_grad_func)
 
     # SAME padding, non square window
     self._VerifyValues(
@@ -422,7 +501,8 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 2, 2, 1],
         ksize=[1, 1, 2, 1],
         strides=[1, 1, 1, 1],
-        padding="SAME")
+        padding="SAME",
+        pool_grad_grad_func=pool_grad_grad_func)
 
     # VALID padding, uneven stride
     self._VerifyValues(
@@ -431,14 +511,16 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 4, 4, 1],
         ksize=[1, 2, 2, 1],
         strides=[1, 1, 2, 1],
-        padding="VALID")
+        padding="VALID",
+        pool_grad_grad_func=pool_grad_grad_func)
     self._VerifyValues(
         forward_op,
         backward_op,
         input_sizes=[1, 4, 4, 1],
         ksize=[1, 2, 2, 1],
         strides=[1, 2, 1, 1],
-        padding="VALID")
+        padding="VALID",
+        pool_grad_grad_func=pool_grad_grad_func)
 
     # SAME padding, size 4 input
     self._VerifyValues(
@@ -447,7 +529,8 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 4, 4, 4],
         ksize=[1, 2, 2, 1],
         strides=[1, 2, 2, 1],
-        padding="SAME")
+        padding="SAME",
+        pool_grad_grad_func=pool_grad_grad_func)
 
     # SAME padding, size 8 input
     self._VerifyValues(
@@ -456,10 +539,14 @@ class PoolGradTest(XLATestCase):
         input_sizes=[1, 8, 8, 8],
         ksize=[1, 3, 3, 1],
         strides=[1, 2, 2, 1],
-        padding="SAME")
+        padding="SAME",
+        pool_grad_grad_func=pool_grad_grad_func)
 
   def testMaxPool(self):
-    self._TestPooling(nn_ops.max_pool, gen_nn_ops._max_pool_grad)
+    self._TestPooling(
+        nn_ops.max_pool,
+        gen_nn_ops.max_pool_grad,
+        pool_grad_grad_func=gen_nn_ops.max_pool_grad_grad)
 
   def testAvgPool(self):
     # Wrapper around AvgPoolGrad that ignores extra arguments needed by
@@ -467,7 +554,7 @@ class PoolGradTest(XLATestCase):
     def AvgPoolGrad(inputs, outputs, output_gradients, ksize, strides, padding,
                     data_format):
       del outputs  # Unused by average-pooling gradients.
-      return gen_nn_ops._avg_pool_grad(
+      return gen_nn_ops.avg_pool_grad(
           inputs.get_shape().as_list(),
           output_gradients,
           ksize=ksize,
@@ -483,7 +570,7 @@ class PoolGradTest(XLATestCase):
   def testMaxPoolKernelSmallerThanStrideValid(self):
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 7, 7, 1],
         ksize=[1, 2, 2, 1],
         strides=[1, 3, 3, 1],
@@ -492,7 +579,7 @@ class PoolGradTest(XLATestCase):
   def testMaxPoolKernelSmallerThanStrideSame(self):
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 3, 3, 1],
         ksize=[1, 1, 1, 1],
         strides=[1, 2, 2, 1],
@@ -500,7 +587,7 @@ class PoolGradTest(XLATestCase):
 
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 4, 4, 1],
         ksize=[1, 1, 1, 1],
         strides=[1, 2, 2, 1],
diff --git a/tensorflow/compiler/tests/randomized_tests.cc b/tensorflow/compiler/tests/randomized_tests.cc
index e72dd4eea9f127e1df96ab166103c4c16372adb6..e53efc3091d8935e745122af29abd7b8063b1d01 100644
--- a/tensorflow/compiler/tests/randomized_tests.cc
+++ b/tensorflow/compiler/tests/randomized_tests.cc
@@ -83,8 +83,8 @@ string LocalDeviceToFullDeviceName(const string& device) {
   return strings::StrCat("/job:localhost/replica:0/task:0/device:", device);
 }
 
-constexpr std::array<DataType, 4> kAllXlaTypes = {
-    {DT_INT32, DT_FLOAT, DT_BOOL, DT_COMPLEX64}};
+constexpr std::array<DataType, 5> kAllXlaTypes = {
+    {DT_INT32, DT_FLOAT, DT_BOOL, DT_COMPLEX64, DT_INT64}};
 
 // An OpTestBuilder is a graph builder class that takes as input an operator to
 // test, its inputs and attributes, and builds a graph that executes the
diff --git a/tensorflow/compiler/tests/reduce_ops_test.py b/tensorflow/compiler/tests/reduce_ops_test.py
index 965fdf684b973498d0b3c3cde17711cca7279705..2c084b04fa2f67ad0d86508109522d7bead206eb 100644
--- a/tensorflow/compiler/tests/reduce_ops_test.py
+++ b/tensorflow/compiler/tests/reduce_ops_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import functools
 import numpy as np
 
 from tensorflow.compiler.tests.xla_test import XLATestCase
@@ -30,8 +31,13 @@ from tensorflow.python.platform import googletest
 
 class ReduceOpsTest(XLATestCase):
 
-  def _testReduction(self, tf_reduce_fn, np_reduce_fn, dtype, test_inputs,
-                     rtol=1e-4, atol=1e-4):
+  def _testReduction(self,
+                     tf_reduce_fn,
+                     np_reduce_fn,
+                     dtype,
+                     test_inputs,
+                     rtol=1e-4,
+                     atol=1e-4):
     """Tests that the output of 'tf_reduce_fn' matches numpy's output."""
 
     for test_input in test_inputs:
@@ -41,16 +47,16 @@ class ReduceOpsTest(XLATestCase):
           index = array_ops.placeholder(dtypes.int32)
           out = tf_reduce_fn(a, index)
         result = sess.run(out, {a: test_input, index: [0]})
-        self.assertAllClose(result, np_reduce_fn(test_input, axis=0),
-                            rtol=rtol, atol=atol)
+        self.assertAllClose(
+            result, np_reduce_fn(test_input, axis=0), rtol=rtol, atol=atol)
 
         result = sess.run(out, {a: test_input, index: [1]})
-        self.assertAllClose(result, np_reduce_fn(test_input, axis=1),
-                            rtol=rtol, atol=atol)
+        self.assertAllClose(
+            result, np_reduce_fn(test_input, axis=1), rtol=rtol, atol=atol)
 
         result = sess.run(out, {a: test_input, index: [-1]})
-        self.assertAllClose(result, np_reduce_fn(test_input, axis=1),
-                            rtol=rtol, atol=atol)
+        self.assertAllClose(
+            result, np_reduce_fn(test_input, axis=1), rtol=rtol, atol=atol)
 
         with self.assertRaisesWithPredicateMatch(
             errors_impl.InvalidArgumentError, 'Invalid reduction dim'):
@@ -60,7 +66,7 @@ class ReduceOpsTest(XLATestCase):
             errors_impl.InvalidArgumentError, 'Invalid reduction dim'):
           sess.run(out, {a: test_input, index: [2]})
 
-  FLOAT_DATA = [
+  REAL_DATA = [
       np.zeros(shape=(2, 0)),
       np.zeros(shape=(0, 30)),
       np.arange(1, 7).reshape(2, 3),
@@ -74,7 +80,7 @@ class ReduceOpsTest(XLATestCase):
       np.arange(-14, -2, dtype=np.float32).view(np.complex64).reshape(2, 3),
       np.arange(-4, 8, dtype=np.float32).view(np.complex64).reshape(2, 3),
   ]
-  NONEMPTY_FLOAT_DATA = [x for x in FLOAT_DATA if np.size(x) > 0]
+  NONEMPTY_REAL_DATA = [x for x in REAL_DATA if np.size(x) > 0]
   NONEMPTY_COMPLEX_DATA = [x for x in COMPLEX_DATA if np.size(x) > 0]
   BOOL_DATA = [
       np.array([], dtype=np.bool).reshape(2, 0),
@@ -83,8 +89,7 @@ class ReduceOpsTest(XLATestCase):
   ]
 
   def testReduceSumF32(self):
-    self._testReduction(math_ops.reduce_sum, np.sum, np.float32,
-                        self.FLOAT_DATA)
+    self._testReduction(math_ops.reduce_sum, np.sum, np.float32, self.REAL_DATA)
 
   def testReduceSumC64(self):
     self._testReduction(math_ops.reduce_sum, np.sum, np.complex64,
@@ -92,7 +97,7 @@ class ReduceOpsTest(XLATestCase):
 
   def testReduceProdF32(self):
     self._testReduction(math_ops.reduce_prod, np.prod, np.float32,
-                        self.FLOAT_DATA)
+                        self.REAL_DATA)
 
   def testReduceProdC64(self):
     self._testReduction(math_ops.reduce_prod, np.prod, np.complex64,
@@ -100,31 +105,44 @@ class ReduceOpsTest(XLATestCase):
 
   def testReduceMin(self):
 
-    def reference_min(inp, axis):
+    def reference_min(dtype, inp, axis):
       """Wrapper around np.amin that returns +infinity for an empty input."""
       if inp.shape[axis] == 0:
-        return np.full(inp.shape[0:axis] + inp.shape[axis + 1:], float('inf'))
+        if np.issubdtype(dtype, np.floating):
+          return np.full(inp.shape[0:axis] + inp.shape[axis + 1:], float('inf'))
+        return np.full(inp.shape[0:axis] + inp.shape[axis + 1:],
+                       np.iinfo(dtype).max)
       return np.amin(inp, axis)
 
-    self._testReduction(math_ops.reduce_min, reference_min, np.float32,
-                        self.FLOAT_DATA)
+    for dtype in set(self.all_types).intersection(
+        [np.float32, np.int32, np.int64]):
+      self._testReduction(math_ops.reduce_min,
+                          functools.partial(reference_min, dtype), dtype,
+                          self.REAL_DATA)
 
   def testReduceMax(self):
 
-    def reference_max(inp, axis):
+    def reference_max(dtype, inp, axis):
       """Wrapper around np.amax that returns -infinity for an empty input."""
       if inp.shape[axis] == 0:
-        return np.full(inp.shape[0:axis] + inp.shape[axis + 1:], float('-inf'))
+        if np.issubdtype(dtype, np.floating):
+          return np.full(inp.shape[0:axis] + inp.shape[axis + 1:],
+                         float('-inf'))
+        return np.full(inp.shape[0:axis] + inp.shape[axis + 1:],
+                       np.iinfo(dtype).min)
       return np.amax(inp, axis)
 
-    self._testReduction(math_ops.reduce_max, reference_max, np.float32,
-                        self.FLOAT_DATA)
+    for dtype in set(self.all_types).intersection(
+        [np.float32, np.int32, np.int64]):
+      self._testReduction(math_ops.reduce_max,
+                          functools.partial(reference_max, dtype), dtype,
+                          self.REAL_DATA)
 
   def testReduceMeanF32(self):
     # TODO(phawkins): mean on XLA currently returns 0 instead of NaN when
     # reducing across zero inputs.
     self._testReduction(math_ops.reduce_mean, np.mean, np.float32,
-                        self.NONEMPTY_FLOAT_DATA)
+                        self.NONEMPTY_REAL_DATA)
 
   def testReduceMeanC64(self):
     self._testReduction(math_ops.reduce_mean, np.mean, np.complex64,
diff --git a/tensorflow/compiler/tests/spacetobatch_op_test.py b/tensorflow/compiler/tests/spacetobatch_op_test.py
index c013f4b50a4cf95be8028248c52b10b1c3be2bd3..92518aadc4bf5c601cfb4192c093799784b6aa72 100644
--- a/tensorflow/compiler/tests/spacetobatch_op_test.py
+++ b/tensorflow/compiler/tests/spacetobatch_op_test.py
@@ -75,11 +75,11 @@ class SpaceToBatchTest(XLATestCase):
       for dtype in self.float_types:
         # outputs = space_to_batch(inputs)
         placeholder = array_ops.placeholder(dtype)
-        x_tf = gen_array_ops._space_to_batch(
+        x_tf = gen_array_ops.space_to_batch(
             placeholder, paddings, block_size=block_size)
         self.assertAllEqual(sess.run(x_tf, {placeholder: inputs}), outputs)
         # inputs = batch_to_space(outputs)
-        x_tf = gen_array_ops._batch_to_space(
+        x_tf = gen_array_ops.batch_to_space(
             placeholder, paddings, block_size=block_size)
         self.assertAllEqual(sess.run(x_tf, {placeholder: outputs}), inputs)
 
diff --git a/tensorflow/compiler/tests/stack_ops_test.py b/tensorflow/compiler/tests/stack_ops_test.py
index 2b9c2279737ccee531d488d27ccdb0cafa1dc8fc..94342f9567ca71274609e63b0482d55637c98d51 100644
--- a/tensorflow/compiler/tests/stack_ops_test.py
+++ b/tensorflow/compiler/tests/stack_ops_test.py
@@ -34,33 +34,33 @@ class StackOpTest(XLATestCase):
     with self.test_session(), self.test_scope():
       size = array_ops.placeholder(dtypes.int32)
       v = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(size, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, v)
+      h = gen_data_flow_ops.stack_v2(size, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, v)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval({size: 5, v: [[4.0, 5.0]]}))
 
   def testStackPushPopSwap(self):
     with self.test_session(), self.test_scope():
       a = np.arange(2000)
       x = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, x, swap_memory=True)
+      h = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose(a, c1.eval({x: a}))
 
   def testMultiStack(self):
     with self.test_session(), self.test_scope():
       v = array_ops.placeholder(dtypes.float32)
-      h1 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push_v2(h1, v)
+      h1 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c1 = gen_data_flow_ops.stack_push_v2(h1, v)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-      h2 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+        c1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+      h2 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="bar")
+      c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval({v: 4.0}))
 
@@ -69,15 +69,15 @@ class StackOpTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       v1 = array_ops.placeholder(dtypes.float32)
       v2 = array_ops.placeholder(dtypes.float32)
-      h1 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      h2 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
+      h1 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      h2 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
 
-      c1 = gen_data_flow_ops._stack_push_v2(h1, v1)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, v1)
       with ops.control_dependencies([c1]):
-        c2 = gen_data_flow_ops._stack_push_v2(h2, v2)
+        c2 = gen_data_flow_ops.stack_push_v2(h2, v2)
       with ops.control_dependencies([c2]):
-        pop1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-        pop2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        pop1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+        pop2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
 
       out1, out2 = sess.run([pop1, pop2], {v1: 4.0, v2: 5.0})
       self.assertAllClose(out1, 4.0)
@@ -86,17 +86,17 @@ class StackOpTest(XLATestCase):
   def testCloseStack(self):
     with self.test_session() as sess, self.test_scope():
       size = array_ops.placeholder(dtypes.int32)
-      h = gen_data_flow_ops._stack_v2(size, dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close_v2(h)
+      h = gen_data_flow_ops.stack_v2(size, dtypes.float32, stack_name="foo")
+      c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1, {size: 5})
 
   def testPushCloseStack(self):
     with self.test_session() as sess, self.test_scope():
       v = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, v)
+      h = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, v)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close_v2(h)
+        c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1, {v: [[4.0, 5.0]]})
 
 
diff --git a/tensorflow/compiler/tests/tensor_array_ops_test.py b/tensorflow/compiler/tests/tensor_array_ops_test.py
index a62925a1818da00cb0a9e82e1281db20fb38b208..7624d6e4b2e2ece6a61155743fc8b866f6903f32 100644
--- a/tensorflow/compiler/tests/tensor_array_ops_test.py
+++ b/tensorflow/compiler/tests/tensor_array_ops_test.py
@@ -338,7 +338,7 @@ class TensorArrayTest(xla_test.XLATestCase):
         w0 = ta.write(0, [[4.0, 5.0]])
 
         # Test reading wrong datatype.
-        r0_bad = gen_data_flow_ops._tensor_array_read_v3(
+        r0_bad = gen_data_flow_ops.tensor_array_read_v3(
             handle=w0.handle, index=0, dtype=dtype2, flow_in=w0.flow)
         with self.assertRaisesOpError("TensorArray dtype is "):
           r0_bad.eval()
diff --git a/tensorflow/compiler/tests/xla_test.py b/tensorflow/compiler/tests/xla_test.py
index 7e1f5c76ed65946363cc3c113ab1a9862f87b289..e924fe1e61454aefda622a5a46a0e483d26db5c1 100644
--- a/tensorflow/compiler/tests/xla_test.py
+++ b/tensorflow/compiler/tests/xla_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import contextlib
+import os
 import random
 import re
 
@@ -44,6 +45,8 @@ flags.DEFINE_string('test_device', None,
 flags.DEFINE_string('types', None, 'Types to test. Comma-separated list.')
 flags.DEFINE_string('disabled_manifest', None,
                     'Path to a file with a list of tests that should not run.')
+flags.DEFINE_string('tf_xla_flags', None,
+                    'Value to set the TF_XLA_FLAGS environment variable to')
 
 
 class XLATestCase(test.TestCase):
@@ -71,14 +74,14 @@ class XLATestCase(test.TestCase):
 
     self._all_types = set(
         [dtype.as_numpy_dtype for dtype in self._all_tf_types])
-    self.int_types = set([dtype.as_numpy_dtype for dtype in self.int_tf_types])
+    self._int_types = set([dtype.as_numpy_dtype for dtype in self.int_tf_types])
     self._float_types = set(
         [dtype.as_numpy_dtype for dtype in self._float_tf_types])
     self.complex_types = set([
         dtype.as_numpy_dtype for dtype in self.complex_tf_types
     ])
-    self._numeric_types = set(
-        self.int_types | self._float_types | self.complex_types)
+    self._numeric_types = set(self._int_types | self._float_types
+                              | self.complex_types)
 
     # Parse the manifest file, if any, into a regex identifying tests to
     # disable
@@ -97,6 +100,8 @@ class XLATestCase(test.TestCase):
       disabled_tests = []
       disabled_method_types = []
       for l in manifest_file.read().splitlines():
+        if not l:
+          continue
         entry = comments_re.sub('', l).strip().split(' ')
         if len(entry) == 1:
           disabled_tests.append(entry[0])
@@ -113,6 +118,9 @@ class XLATestCase(test.TestCase):
             for name in types])
       manifest_file.close()
 
+    if FLAGS.tf_xla_flags is not None:
+      os.environ['TF_XLA_FLAGS'] = FLAGS.tf_xla_flags
+
   @property
   def all_tf_types(self):
     name = '{}.{}'.format(type(self).__name__, self._testMethodName)
@@ -130,6 +138,11 @@ class XLATestCase(test.TestCase):
     name = '{}.{}'.format(type(self).__name__, self._testMethodName)
     return self._float_tf_types - self._method_types_filter.get(name, set())
 
+  @property
+  def int_types(self):
+    name = '{}.{}'.format(type(self).__name__, self._testMethodName)
+    return self._int_types - self._method_types_filter.get(name, set())
+
   @property
   def numeric_tf_types(self):
     name = '{}.{}'.format(type(self).__name__, self._testMethodName)
diff --git a/tensorflow/compiler/tf2xla/BUILD b/tensorflow/compiler/tf2xla/BUILD
index fb82c2601c432cee425a46a3b6dc2c55febeda87..eb20ca501c80b01c76198e1ad54173f1c601714d 100644
--- a/tensorflow/compiler/tf2xla/BUILD
+++ b/tensorflow/compiler/tf2xla/BUILD
@@ -58,6 +58,15 @@ xla_proto_library(
     ],
 )
 
+xla_proto_library(
+    name = "host_compute_metadata_proto",
+    srcs = ["host_compute_metadata.proto"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
 cc_library(
     name = "tf2xla",
     srcs = ["tf2xla.cc"],
@@ -149,6 +158,7 @@ cc_library(
         ":common",
         ":dump_graph",
         ":functionalize_control_flow",
+        ":host_compute_metadata_proto",
         ":sharding_util",
         ":tf2xla_util",
         "//tensorflow/compiler/tf2xla/lib:util",
diff --git a/tensorflow/compiler/tf2xla/const_analysis.cc b/tensorflow/compiler/tf2xla/const_analysis.cc
index 82923722c54d235716b9138d95a75a441df924ca..de1008803d69fefa415c7bdbe6c27a62e625b417 100644
--- a/tensorflow/compiler/tf2xla/const_analysis.cc
+++ b/tensorflow/compiler/tf2xla/const_analysis.cc
@@ -37,7 +37,7 @@ Status BackwardsConstAnalysis(const Graph& g,
   };
 
   Status status;
-  std::unordered_set<Node*> must_be_const;
+  std::unordered_set<const Node*> must_be_const;
   auto visit = [&status, &metadata_ops, &must_be_const,
                 compile_time_const_args](Node* node) {
     if (!status.ok()) return;
@@ -55,8 +55,10 @@ Status BackwardsConstAnalysis(const Graph& g,
         compile_time_const_args->at(index) = true;
         return;
       }
-      for (Node* pred : node->in_nodes()) {
-        must_be_const.insert(pred);
+      for (const Edge* pred : node->in_edges()) {
+        if (!pred->IsControlEdge()) {
+          must_be_const.insert(pred->src());
+        }
       }
       return;
     }
diff --git a/tensorflow/compiler/tf2xla/const_analysis_test.cc b/tensorflow/compiler/tf2xla/const_analysis_test.cc
index 9d125f8d499863cfaa0e26b5b633ca02914d1b7d..992b12c06db5efc0ae54284d0ea77017c1c79aca 100644
--- a/tensorflow/compiler/tf2xla/const_analysis_test.cc
+++ b/tensorflow/compiler/tf2xla/const_analysis_test.cc
@@ -79,5 +79,24 @@ TEST(ConstAnalysisTest, TopologicalOrder) {
   }
 }
 
+TEST(ConstAnalysisTest, DontFollowControlDependencies) {
+  Scope root = Scope::NewRootScope();
+
+  Output arg0 = ops::_Arg(root.WithOpName("Arg0"), DT_INT32, 0);
+  Output arg1 = ops::_Arg(root.WithOpName("Arg1"), DT_INT32, 1);
+  Output c1 =
+      ops::Const(root.WithOpName("c1").WithControlDependencies(arg0), 1, {1});
+  Output add = ops::Add(root, arg1, c1);
+  Output reshape = ops::Reshape(root, arg1, add);
+
+  Graph graph(OpRegistry::Global());
+  TF_ASSERT_OK(root.ToGraph(&graph));
+
+  std::vector<bool> const_args(2, false);
+  TF_ASSERT_OK(BackwardsConstAnalysis(graph, &const_args));
+
+  EXPECT_EQ(const_args, std::vector<bool>({false, true}));
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/g3doc/cpu_supported_ops.md b/tensorflow/compiler/tf2xla/g3doc/cpu_supported_ops.md
index 91351421bcacd26c41b5c9f98ea833730e4aef30..20179b67991d3d23d678cf1df2642e029ea037fd 100644
--- a/tensorflow/compiler/tf2xla/g3doc/cpu_supported_ops.md
+++ b/tensorflow/compiler/tf2xla/g3doc/cpu_supported_ops.md
@@ -3,6 +3,7 @@
 Operator                              | Type Constraint
 ------------------------------------- | ---------------
 `Abs`                                 | `T={double,float,int32,int64}`
+`Acos`                                | `T={complex64,double,float,int32,int64}`
 `Acosh`                               | `T={complex64,double,float}`
 `Add`                                 | `T={complex64,double,float,int32,int64}`
 `AddN`                                | `T={complex64,double,float,int32,int64,uint32,uint64}`
@@ -15,10 +16,12 @@ Operator                              | Type Constraint
 `ApproximateEqual`                    | `T={complex64,double,float,int32,int64,uint32,uint64}`
 `ArgMax`                              | `Tidx={int32,int64}`<br>`output_type={int32,int64}`<br>`T={float}`
 `ArgMin`                              | `Tidx={int32,int64}`<br>`output_type={int32,int64}`<br>`T={complex64,double,float,int32,int64,uint32,uint64}`
+`Asin`                                | `T={complex64,double,float,int32,int64}`
 `Asinh`                               | `T={complex64,double,float}`
 `AssignAddVariableOp`                 | `dtype={complex64,double,float,int32,int64,uint32,uint64}`
 `AssignSubVariableOp`                 | `dtype={complex64,double,float,int32,int64,uint32,uint64}`
 `AssignVariableOp`                    | `dtype={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`Atan`                                | `T={complex64,double,float,int32,int64}`
 `Atan2`                               | `T={double,float}`
 `Atanh`                               | `T={complex64,double,float}`
 `AvgPool`                             | `T={double,float}`
@@ -75,6 +78,10 @@ Operator                              | Type Constraint
 `FFT`                                 |
 `FFT2D`                               |
 `FFT3D`                               |
+`FakeQuantWithMinMaxArgs`             |
+`FakeQuantWithMinMaxArgsGradient`     |
+`FakeQuantWithMinMaxVars`             |
+`FakeQuantWithMinMaxVarsGradient`     |
 `Fill`                                | `index_type={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Floor`                               | `T={double,float}`
 `FloorDiv`                            | `T={complex64,double,float,int32,int64}`
@@ -84,6 +91,7 @@ Operator                              | Type Constraint
 `FusedBatchNormGradV2`                | `U={float}`<br>`T={float}`
 `FusedBatchNormV2`                    | `U={float}`<br>`T={float}`
 `Gather`                              | `Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`GatherNd`                            | `Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `GatherV2`                            | `Taxis={int32,int64}`<br>`Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Greater`                             | `T={double,float,int32,int64,uint32,uint64}`
 `GreaterEqual`                        | `T={double,float,int32,int64,uint32,uint64}`
@@ -117,14 +125,18 @@ Operator                              | Type Constraint
 `LogicalNot`                          |
 `LogicalOr`                           |
 `MatMul`                              | `T={complex64,double,float}`
+`MatrixBandPart`                      | `Tindex={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixDiag`                          | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixDiagPart`                      | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`MatrixSetDiag`                       | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixTriangularSolve`               | `T={complex64,double,float}`
 `Max`                                 | `Tidx={int32,int64}`<br>`T={complex64,double,float,int32,int64,uint32,uint64}`
 `MaxPool`                             | `T={double,float,int32,int64}`
 `MaxPool3D`                           | `T={float}`
 `MaxPool3DGrad`                       | `TInput={float}`<br>`T={float}`
 `MaxPoolGrad`                         | `T={double,float,int32,int64,uint32,uint64}`
+`MaxPoolGradGrad`                     | `T={float}`
+`MaxPoolGradGradV2`                   | `T={float}`
 `MaxPoolGradV2`                       | `T={double,float,int32,int64,uint32,uint64}`
 `MaxPoolV2`                           | `T={double,float,int32,int64}`
 `Maximum`                             | `T={double,float,int32,int64}`
@@ -186,6 +198,7 @@ Operator                              | Type Constraint
 `Round`                               | `T={complex64,double,float,int32,int64}`
 `Rsqrt`                               | `T={complex64,double,float}`
 `RsqrtGrad`                           | `T={complex64,double,float}`
+`ScatterNd`                           | `Tindices={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Select`                              | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Selu`                                | `T={double,float}`
 `SeluGrad`                            | `T={double,float}`
@@ -198,6 +211,7 @@ Operator                              | Type Constraint
 `Sinh`                                | `T={complex64,double,float}`
 `Size`                                | `out_type={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Slice`                               | `Index={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`Snapshot`                            | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Softmax`                             | `T={double,float}`
 `SoftmaxCrossEntropyWithLogits`       | `T={double,float}`
 `Softplus`                            | `T={double,float,int32,int64,uint32,uint64}`
diff --git a/tensorflow/compiler/tf2xla/g3doc/gpu_supported_ops.md b/tensorflow/compiler/tf2xla/g3doc/gpu_supported_ops.md
index b9bdb829d773825005a8921f48d28b6892d8f0cd..55f0538dba7c1941dfea88e0631cd299e51f76d0 100644
--- a/tensorflow/compiler/tf2xla/g3doc/gpu_supported_ops.md
+++ b/tensorflow/compiler/tf2xla/g3doc/gpu_supported_ops.md
@@ -3,6 +3,7 @@
 Operator                              | Type Constraint
 ------------------------------------- | ---------------
 `Abs`                                 | `T={double,float,int32,int64}`
+`Acos`                                | `T={complex64,double,float,int32,int64}`
 `Acosh`                               | `T={complex64,double,float}`
 `Add`                                 | `T={complex64,double,float,int32,int64}`
 `AddN`                                | `T={complex64,double,float,int32,int64,uint32,uint64}`
@@ -15,10 +16,12 @@ Operator                              | Type Constraint
 `ApproximateEqual`                    | `T={complex64,double,float,int32,int64,uint32,uint64}`
 `ArgMax`                              | `Tidx={int32,int64}`<br>`output_type={int32,int64}`<br>`T={complex64,double,float,int32,int64,uint32,uint64}`
 `ArgMin`                              | `Tidx={int32,int64}`<br>`output_type={int32,int64}`<br>`T={complex64,double,float,int32,int64,uint32,uint64}`
+`Asin`                                | `T={complex64,double,float,int32,int64}`
 `Asinh`                               | `T={complex64,double,float}`
 `AssignAddVariableOp`                 | `dtype={complex64,double,float,int32,int64,uint32,uint64}`
 `AssignSubVariableOp`                 | `dtype={complex64,double,float,int32,int64,uint32,uint64}`
 `AssignVariableOp`                    | `dtype={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`Atan`                                | `T={complex64,double,float,int32,int64}`
 `Atan2`                               | `T={double,float}`
 `Atanh`                               | `T={complex64,double,float}`
 `AvgPool`                             | `T={double,float}`
@@ -75,6 +78,10 @@ Operator                              | Type Constraint
 `FFT`                                 |
 `FFT2D`                               |
 `FFT3D`                               |
+`FakeQuantWithMinMaxArgs`             |
+`FakeQuantWithMinMaxArgsGradient`     |
+`FakeQuantWithMinMaxVars`             |
+`FakeQuantWithMinMaxVarsGradient`     |
 `Fill`                                | `index_type={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Floor`                               | `T={double,float}`
 `FloorDiv`                            | `T={complex64,double,float,int32,int64}`
@@ -84,6 +91,7 @@ Operator                              | Type Constraint
 `FusedBatchNormGradV2`                | `U={float}`<br>`T={float}`
 `FusedBatchNormV2`                    | `U={float}`<br>`T={float}`
 `Gather`                              | `Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`GatherNd`                            | `Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `GatherV2`                            | `Taxis={int32,int64}`<br>`Tindices={int32,int64}`<br>`Tparams={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Greater`                             | `T={double,float,int32,int64,uint32,uint64}`
 `GreaterEqual`                        | `T={double,float,int32,int64,uint32,uint64}`
@@ -117,14 +125,18 @@ Operator                              | Type Constraint
 `LogicalNot`                          |
 `LogicalOr`                           |
 `MatMul`                              | `T={complex64,double,float}`
+`MatrixBandPart`                      | `Tindex={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixDiag`                          | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixDiagPart`                      | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`MatrixSetDiag`                       | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `MatrixTriangularSolve`               | `T={complex64,double,float}`
 `Max`                                 | `Tidx={int32,int64}`<br>`T={complex64,double,float,int32,int64,uint32,uint64}`
 `MaxPool`                             | `T={double,float,int32,int64}`
 `MaxPool3D`                           | `T={float}`
 `MaxPool3DGrad`                       | `TInput={float}`<br>`T={float}`
 `MaxPoolGrad`                         | `T={double,float,int32,int64,uint32,uint64}`
+`MaxPoolGradGrad`                     | `T={float}`
+`MaxPoolGradGradV2`                   | `T={float}`
 `MaxPoolGradV2`                       | `T={double,float,int32,int64,uint32,uint64}`
 `MaxPoolV2`                           | `T={double,float,int32,int64}`
 `Maximum`                             | `T={double,float,int32,int64}`
@@ -183,6 +195,7 @@ Operator                              | Type Constraint
 `Round`                               | `T={complex64,double,float,int32,int64}`
 `Rsqrt`                               | `T={complex64,double,float}`
 `RsqrtGrad`                           | `T={complex64,double,float}`
+`ScatterNd`                           | `Tindices={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Select`                              | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Selu`                                | `T={double,float}`
 `SeluGrad`                            | `T={double,float}`
@@ -195,6 +208,7 @@ Operator                              | Type Constraint
 `Sinh`                                | `T={complex64,double,float}`
 `Size`                                | `out_type={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Slice`                               | `Index={int32,int64}`<br>`T={bool,complex64,double,float,int32,int64,uint32,uint64}`
+`Snapshot`                            | `T={bool,complex64,double,float,int32,int64,uint32,uint64}`
 `Softmax`                             | `T={double,float}`
 `SoftmaxCrossEntropyWithLogits`       | `T={double,float}`
 `Softplus`                            | `T={double,float,int32,int64,uint32,uint64}`
diff --git a/tensorflow/compiler/tf2xla/graph_compiler.cc b/tensorflow/compiler/tf2xla/graph_compiler.cc
index 058a1f2621c64a735bd9d9c9d0ae007f93aa4dea..b20c1ffc7d8956f3f5530ee63e9b711a26439be5 100644
--- a/tensorflow/compiler/tf2xla/graph_compiler.cc
+++ b/tensorflow/compiler/tf2xla/graph_compiler.cc
@@ -130,7 +130,7 @@ Status GraphCompiler::Compile() {
     // Set up inputs from outputs of previous nodes.
     for (auto* e : n->in_edges()) {
       if (e->IsControlEdge()) continue;
-      Node* src = e->src();
+      const Node* src = e->src();
       TF_RET_CHECK(src->id() < output_registry.size());
       const NodeOutputs& src_outputs = output_registry[src->id()];
 
diff --git a/tensorflow/compiler/tf2xla/host_compute_metadata.proto b/tensorflow/compiler/tf2xla/host_compute_metadata.proto
new file mode 100644
index 0000000000000000000000000000000000000000..43ab371a217e6c4521a160715104c96e3c8782c6
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/host_compute_metadata.proto
@@ -0,0 +1,38 @@
+syntax = "proto3";
+
+package tensorflow.tf2xla;
+option cc_enable_arenas = true;
+option java_outer_classname = "Tf2XlaProtos";
+option java_multiple_files = true;
+option java_package = "org.tensorflow.tf2xla";
+
+import "tensorflow/core/framework/tensor_shape.proto";
+import "tensorflow/core/framework/types.proto";
+
+// TensorMetadata indicates the type and shape of a Tensor that is
+// part of a host compute transfer.
+message TensorMetadata {
+  DataType type = 1;
+  TensorShapeProto shape = 2;
+}
+
+// HostTransferMetadata describes a transfer either from host to device
+// or device to host. It has a key that is unique to the computation,
+// and metadata about the list of tensors being transferred.
+message HostTransferMetadata {
+  // The key used to identify this transfer.
+  string key = 1;
+
+  // For each Tensor being transferred, its type and shape.
+  repeated TensorMetadata metadata = 2;
+}
+
+// HostComputeMetadata describes all the sends and recvs
+// from all host compute transfer ops in a computation.
+message HostComputeMetadata {
+  // Metadata about each device_to_host transfer
+  repeated HostTransferMetadata device_to_host = 1;
+
+  // Metadata about each host_to_device transfer
+  repeated HostTransferMetadata host_to_device = 2;
+}
diff --git a/tensorflow/compiler/tf2xla/kernels/BUILD b/tensorflow/compiler/tf2xla/kernels/BUILD
index d2fa933cf9c085f92b2f442827a94d72938e4bb2..0bbfe86de389ff6063b1f9604003f35b41d28e3b 100644
--- a/tensorflow/compiler/tf2xla/kernels/BUILD
+++ b/tensorflow/compiler/tf2xla/kernels/BUILD
@@ -93,6 +93,7 @@ tf_kernel_library(
         "shape_util.h",
     ],
     deps = [
+        ":if_op",
         ":while_op",
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:xla_compiler",
@@ -154,6 +155,22 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "if_op",
+    srcs = ["if_op.cc"],
+    hdrs = ["if_op.h"],
+    deps = [
+        "//tensorflow/compiler/tf2xla:common",
+        "//tensorflow/compiler/tf2xla:xla_compiler",
+        "//tensorflow/compiler/tf2xla/ops:functional_ops",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla/client:computation_builder",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
 # Kernels that only work on CPU, because they use XLA custom calls.
 # Only link this when using the CPU backend for XLA.
 tf_kernel_library(
diff --git a/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc b/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
index a249b1869f547f8e5aa725f9f5cf391b10429928..931175be1111ed5f70afbdf351ee53c59c1367de 100644
--- a/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
@@ -118,30 +118,24 @@ class FusedBatchNormGradOp : public XlaOpKernel {
   }
 
   void Compile(XlaOpKernelContext* ctx) override {
-    xla::ComputationBuilder* b = ctx->builder();
-
-    auto grad_backprop = ctx->Input(0);
-    auto activations = ctx->Input(1);
-    auto scale = ctx->Input(2);
-    auto mean = ctx->Input(3);
-    auto var = ctx->Input(4);
-
-    TensorShape input_shape = ctx->InputShape(0);
-    int feature_index =
-        GetTensorFeatureDimIndex(input_shape.dims(), data_format_);
-
+    xla::ComputationBuilder* const b = ctx->builder();
     DataType input_dtype = ctx->input_type(0);
     DataType scale_dtype = ctx->input_type(2);
-    xla::PrimitiveType input_type;
-    OP_REQUIRES_OK(ctx, DataTypeToPrimitiveType(input_dtype, &input_type));
-    xla::PrimitiveType scale_type;
-    OP_REQUIRES_OK(ctx, DataTypeToPrimitiveType(scale_dtype, &scale_type));
 
     // TODO(b/69928690): support mixed precision in the XLA batch normalization
     // operators. For now, cast everything to the statistics type (which
     // may be more precise than the input type).
-    grad_backprop = b->ConvertElementType(grad_backprop, scale_type);
-    activations = b->ConvertElementType(activations, scale_type);
+    auto grad_backprop =
+        XlaHelpers::ConvertElementType(b, ctx->Input(0), scale_dtype);
+    auto activations =
+        XlaHelpers::ConvertElementType(b, ctx->Input(1), scale_dtype);
+    auto scale = ctx->Input(2);
+    auto mean = ctx->Input(3);
+    auto var = ctx->Input(4);
+
+    const int input_dims = ctx->InputShape(0).dims();
+    const int feature_index =
+        GetTensorFeatureDimIndex(input_dims, data_format_);
 
     xla::ComputationDataHandle x_backprop;
     xla::ComputationDataHandle scale_backprop;
@@ -156,7 +150,7 @@ class FusedBatchNormGradOp : public XlaOpKernel {
       offset_backprop = b->GetTupleElement(output, 2);
     } else {
       // Reduce over all dimensions except the feature dim.
-      std::vector<int64> reduction_dims(input_shape.dims() - 1);
+      std::vector<int64> reduction_dims(input_dims - 1);
       std::iota(reduction_dims.begin(), reduction_dims.begin() + feature_index,
                 0);
       std::iota(reduction_dims.begin() + feature_index, reduction_dims.end(),
@@ -165,9 +159,14 @@ class FusedBatchNormGradOp : public XlaOpKernel {
       // scale_backprop = y_backprop * ((x - pop_mean) * rsqrt(pop_var +
       // epsilon))
       // x_backprop = y_backprop * (scale * rsqrt(pop_var + epsilon))
-      offset_backprop =
-          b->Reduce(grad_backprop, XlaHelpers::Zero(b, scale_dtype),
-                    *ctx->GetOrCreateAdd(scale_dtype), reduction_dims);
+      const DataType accumulation_type =
+          XlaHelpers::SumAccumulationType(scale_dtype);
+      auto converted =
+          XlaHelpers::ConvertElementType(b, grad_backprop, accumulation_type);
+      auto reduce =
+          b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                    *ctx->GetOrCreateAdd(accumulation_type), reduction_dims);
+      offset_backprop = XlaHelpers::ConvertElementType(b, reduce, scale_dtype);
 
       // scratch1 = rsqrt(pop_var + epsilon)
       auto neg_half = XlaHelpers::FloatLiteral(b, scale_dtype, -0.5);
@@ -175,17 +174,21 @@ class FusedBatchNormGradOp : public XlaOpKernel {
           b->Pow(b->Add(var, b->ConstantR0<float>(epsilon_)), neg_half);
 
       // scratch2 = sum(y_backprop * (x - mean))
-      auto scratch2 = b->Reduce(
-          b->Mul(grad_backprop, b->Sub(activations, mean, {feature_index})),
-          XlaHelpers::Zero(b, scale_dtype), *ctx->GetOrCreateAdd(scale_dtype),
-          reduction_dims);
+      auto mul =
+          b->Mul(grad_backprop, b->Sub(activations, mean, {feature_index}));
+      converted = XlaHelpers::ConvertElementType(b, mul, accumulation_type);
+      reduce =
+          b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                    *ctx->GetOrCreateAdd(accumulation_type), reduction_dims);
+      auto scratch2 = XlaHelpers::ConvertElementType(b, reduce, scale_dtype);
 
       x_backprop =
           b->Mul(grad_backprop, b->Mul(scratch1, scale), {feature_index});
       scale_backprop = b->Mul(scratch1, scratch2);
     }
 
-    ctx->SetOutput(0, b->ConvertElementType(x_backprop, input_type));
+    ctx->SetOutput(0,
+                   XlaHelpers::ConvertElementType(b, x_backprop, input_dtype));
     ctx->SetOutput(1, scale_backprop);
     ctx->SetOutput(2, offset_backprop);
     ctx->SetConstantOutput(3, Tensor(scale_dtype, {}));
diff --git a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
index cbade79e85eed10ecb5ead7151ee778c86a0de37..569950c2dfaeb61028049a263a962dfa54a62e09 100644
--- a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
@@ -184,9 +184,7 @@ class BatchToSpaceOp : public XlaOpKernel {
  private:
   int block_size_;
 };
-REGISTER_XLA_OP(Name("BatchToSpace")
-                    .CompileTimeConstInput("crops")
-                    .CompileTimeConstInput("block_shape"),
+REGISTER_XLA_OP(Name("BatchToSpace").CompileTimeConstInput("crops"),
                 BatchToSpaceOp);
 
 }  // namespace
diff --git a/tensorflow/compiler/tf2xla/kernels/bias_ops.cc b/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
index c667b4e3e326b776faba49387760abbd582fcc68..ed33b8ed2e823f313a9a7fe220390bc617288405 100644
--- a/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
@@ -103,10 +103,15 @@ class BiasAddGradOp : public XlaOpKernel {
     std::iota(reduce_dims.begin(), reduce_dims.begin() + feature_dim, 0);
     std::iota(reduce_dims.begin() + feature_dim, reduce_dims.end(),
               feature_dim + 1);
-    xla::ComputationDataHandle result = ctx->builder()->Reduce(
-        ctx->Input(0), XlaHelpers::Zero(ctx->builder(), input_type(0)),
-        *ctx->GetOrCreateAdd(input_type(0)), reduce_dims);
-    ctx->SetOutput(0, result);
+    xla::ComputationBuilder* const b = ctx->builder();
+    const DataType accumulation_type =
+        XlaHelpers::SumAccumulationType(input_type(0));
+    auto converted =
+        XlaHelpers::ConvertElementType(b, ctx->Input(0), accumulation_type);
+    auto reduce =
+        b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                  *ctx->GetOrCreateAdd(accumulation_type), reduce_dims);
+    ctx->SetOutput(0, XlaHelpers::ConvertElementType(b, reduce, input_type(0)));
   }
 
  private:
diff --git a/tensorflow/compiler/tf2xla/kernels/conv_ops.cc b/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
index 81cea6d376d02c956a5257c5475fe5c10b83deb9..c0ee0c9c2ea849a692bee70bba36d32335eed9b5 100644
--- a/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
@@ -58,7 +58,7 @@ xla::ComputationDataHandle CreateExpandedZero(
 
 // Create a mask for depthwise convolution that will make a normal convolution
 // produce the same results as a depthwise convolution. For a [2, 2, 3, 2]
-// depthwise filter this returns a [2, 2, 3, 6] tesnsor
+// depthwise filter this returns a [2, 2, 3, 6] tensor
 //   1 1 0 0 0 0   1 1 0 0 0 0
 //   0 0 1 1 0 0   0 0 1 1 0 0
 //   0 0 0 0 1 1   0 0 0 0 1 1
@@ -166,6 +166,10 @@ xla::ComputationDataHandle ContractFilterForDepthwiseBackprop(
       CreateExpandedFilterMask(filter_shape, builder), filter_backprop,
       CreateExpandedZero(filter_shape, dtype, builder));
   return builder->Reshape(
+      // This reduce does not need inputs to be converted with
+      // XlaHelpers::SumAccumulationType() since the ExpandedFilterMask with
+      // ExpandedZero guarantees that only one element is non zero, so there
+      // cannot be accumulated precision error.
       builder->Reduce(masked_expanded_filter, XlaHelpers::Zero(builder, dtype),
                       *ctx->GetOrCreateAdd(dtype),
                       {expanded_filter_shape.dims() - 2}),
diff --git a/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc b/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
index 453a32c494b42e9922bc35fc526f3306530054fd..99470d70e709ddb5593c5eaae061bb897befc168 100644
--- a/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
@@ -247,6 +247,8 @@ class FakeQuantWithMinMaxVarsGradOp : public XlaOpKernel {
     const TensorShape gradient_shape = ctx->InputShape(0);
     xla::ComputationDataHandle input = ctx->Input(1);
     const DataType data_type = ctx->input_type(1);
+    const DataType accumulation_type =
+        XlaHelpers::SumAccumulationType(data_type);
     xla::ComputationDataHandle input_min = ctx->Input(2);
     xla::ComputationDataHandle input_max = ctx->Input(3);
 
@@ -265,15 +267,23 @@ class FakeQuantWithMinMaxVarsGradOp : public XlaOpKernel {
     ctx->SetOutput(0, output0);
 
     xla::ComputationDataHandle below_min = b->Lt(input, nudged_input_min);
+    xla::ComputationDataHandle select1 = b->Select(below_min, gradient, zeroes);
+    xla::ComputationDataHandle reduce1 = b->ReduceAll(
+        XlaHelpers::ConvertElementType(b, select1, accumulation_type),
+        XlaHelpers::Zero(b, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type));
     xla::ComputationDataHandle output1 =
-        b->ReduceAll(b->Select(below_min, gradient, zeroes), zero,
-                     *ctx->GetOrCreateAdd(data_type));
+        XlaHelpers::ConvertElementType(b, reduce1, data_type);
     ctx->SetOutput(1, output1);
 
     xla::ComputationDataHandle above_max = b->Gt(input, nudged_input_max);
+    xla::ComputationDataHandle select2 = b->Select(above_max, gradient, zeroes);
+    xla::ComputationDataHandle reduce2 = b->ReduceAll(
+        XlaHelpers::ConvertElementType(b, select2, accumulation_type),
+        XlaHelpers::Zero(b, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type));
     xla::ComputationDataHandle output2 =
-        b->ReduceAll(b->Select(above_max, gradient, zeroes), zero,
-                     *ctx->GetOrCreateAdd(data_type));
+        XlaHelpers::ConvertElementType(b, reduce2, data_type);
     ctx->SetOutput(2, output2);
   }
 
diff --git a/tensorflow/compiler/tf2xla/kernels/if_op.cc b/tensorflow/compiler/tf2xla/kernels/if_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..eefbe55c815d80a608bdf62d454a69d722adb158
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/kernels/if_op.cc
@@ -0,0 +1,226 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/tf2xla/kernels/if_op.h"
+
+#include "tensorflow/compiler/tf2xla/shape_util.h"
+#include "tensorflow/compiler/tf2xla/xla_context.h"
+#include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+
+namespace tensorflow {
+
+XlaIfOp::XlaIfOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+  const NameAttrList* name_attr;
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("then_branch", &name_attr));
+  then_branch_ = *name_attr;
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("else_branch", &name_attr));
+  else_branch_ = *name_attr;
+
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("Tcond", &cond_type_));
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("Tin", &input_types_));
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("Tout", &output_types_));
+}
+
+// TODO(b/35949885): There is duplication here with the handling of the
+// while_op. Refactor the common code out/rework.
+void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
+  xla::ComputationBuilder* b = ctx->builder();
+
+  OP_REQUIRES(ctx, cond_type_ == DT_BOOL,
+              errors::InvalidArgument(
+                  "Condition argument must be a boolean for XLA compilation"));
+  OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(ctx->InputShape(0)),
+              errors::InvalidArgument(
+                  "Condition argument must be a scalar for XLA compilation"));
+
+  VLOG(1) << "Building If: " << input_types_.size() << " inputs";
+
+  std::vector<xla::ComputationDataHandle> inputs(input_types_.size());
+  std::vector<XlaCompiler::Argument> arguments(input_types_.size());
+  for (int i = 0; i < input_types_.size(); ++i) {
+    XlaCompiler::Argument& arg = arguments[i];
+    DataType type = ctx->input_type(i + 1);
+    if (type == DT_RESOURCE) {
+      XlaResource* resource;
+      OP_REQUIRES_OK(ctx, ctx->GetResourceInput(i + 1, &resource));
+
+      arg.initialized = resource->initialized();
+      arg.kind = XlaCompiler::Argument::kResource;
+      arg.resource_kind = resource->kind();
+      OP_REQUIRES_OK(ctx, resource->Pack(&inputs[i], b));
+
+      arg.type = resource->type();
+      arg.shape = resource->shape();
+      OP_REQUIRES(ctx, arg.initialized,
+                  errors::Unimplemented("Uninitialized arguments: ", arg.name));
+      arg.tensor_array_size = resource->tensor_array_size();
+      for (const auto& gradient : resource->tensor_array_gradients()) {
+        arg.tensor_array_gradients.insert(gradient.first);
+      }
+      arg.name = resource->name();
+      VLOG(2) << "Resource " << resource->name()
+              << " type: " << DataTypeString(arg.type)
+              << " shape: " << arg.shape.DebugString()
+              << " initialized: " << arg.initialized;
+    } else {
+      arg.kind = XlaCompiler::Argument::kParameter;
+      arg.type = input_types_[i];
+      arg.shape = ctx->InputShape(i + 1);
+      inputs[i] = ctx->Input(i + 1);
+      VLOG(2) << "Arg type: " << DataTypeString(arg.type)
+              << " shape: " << arg.shape.DebugString();
+    }
+  }
+
+  // Compile both branches of the conditional.
+  XlaCompiler::CompileOptions options;
+  options.use_tuple_arg = true;
+  options.resolve_compile_time_constants = false;
+  options.return_updated_values_for_all_resources = true;
+  options.is_entry_computation = false;
+  XlaCompiler* compiler = ctx->compiler();
+
+  XlaCompiler::CompilationResult then_result;
+  OP_REQUIRES_OK(ctx, compiler->CompileFunction(options, then_branch_,
+                                                arguments, &then_result));
+  XlaCompiler::CompilationResult else_result;
+  OP_REQUIRES_OK(ctx, compiler->CompileFunction(options, else_branch_,
+                                                arguments, &else_result));
+
+  for (XlaCompiler::CompilationResult* result : {&then_result, &else_result}) {
+    for (const XlaCompiler::ResourceUpdate& update : result->resource_updates) {
+      XlaResource* resource;
+      OP_REQUIRES_OK(ctx,
+                     ctx->GetResourceInput(update.input_index + 1, &resource));
+      XlaCompiler::Argument& arg = arguments[update.input_index];
+
+      // Add any TensorArray gradients touched by the then/else computation to
+      // the enclosing graph.
+      for (const string& grad_source : update.tensor_array_gradients_accessed) {
+        VLOG(5) << "TensorArray " << resource->name() << " accessed gradient "
+                << grad_source;
+        XlaResource* gradient;
+        OP_REQUIRES_OK(ctx, resource->GetOrCreateTensorArrayGradient(
+                                grad_source, b, &gradient));
+      }
+      // Add all of the TensorArray gradients to the argument. For simplicity,
+      // we always pass all known gradients.
+      for (const auto& gradient : resource->tensor_array_gradients()) {
+        arg.tensor_array_gradients.insert(gradient.first);
+      }
+    }
+  }
+
+  // Check that both branches have identical input shapes.
+  OP_REQUIRES(ctx, then_result.xla_input_shapes.size() == 1,
+              errors::FailedPrecondition("Expected one input shape"));
+  xla::Shape then_input_shape = then_result.xla_input_shapes[0];
+  OP_REQUIRES(ctx, xla::ShapeUtil::IsTuple(then_input_shape),
+              errors::FailedPrecondition("Expected tuple shape"));
+  OP_REQUIRES(ctx, else_result.xla_input_shapes.size() == 1,
+              errors::FailedPrecondition("Expected one input shape"));
+  xla::Shape else_input_shape = else_result.xla_input_shapes[0];
+  OP_REQUIRES(ctx, xla::ShapeUtil::IsTuple(else_input_shape),
+              errors::FailedPrecondition("Expected tuple shape"));
+  OP_REQUIRES(ctx,
+              xla::ShapeUtil::Compatible(then_input_shape, else_input_shape),
+              errors::InvalidArgument(
+                  "Input shapes of then and else branches do not match: ",
+                  xla::ShapeUtil::HumanString(then_input_shape), " vs. ",
+                  xla::ShapeUtil::HumanString(else_input_shape)));
+
+  // Check that both branches have identical output shapes.
+  OP_REQUIRES(
+      ctx,
+      xla::ShapeUtil::Compatible(then_result.xla_output_shape,
+                                 else_result.xla_output_shape),
+      errors::InvalidArgument(
+          "Output shapes of then and else branches do not match: ",
+          xla::ShapeUtil::HumanString(then_result.xla_output_shape), " vs. ",
+          xla::ShapeUtil::HumanString(else_result.xla_output_shape)));
+
+  VLOG(2) << "Input shape: " << xla::ShapeUtil::HumanString(then_input_shape);
+  VLOG(2) << "Output shape: "
+          << xla::ShapeUtil::HumanString(then_result.xla_output_shape);
+
+  // We set return_updated_values_for_all_resources=true and we pass the same
+  // arguments to both computations, so the resource update count must match.
+  OP_REQUIRES(ctx,
+              then_result.resource_updates.size() ==
+                  else_result.resource_updates.size(),
+              errors::FailedPrecondition(
+                  "Different number of resources in then and else branch"));
+  for (int i = 0; i < then_result.resource_updates.size(); ++i) {
+    const auto& lhs = then_result.resource_updates[i];
+    const auto& rhs = else_result.resource_updates[i];
+    bool equal = lhs.input_index == rhs.input_index && lhs.shape == rhs.shape &&
+                 lhs.tensor_array_gradients_accessed ==
+                     rhs.tensor_array_gradients_accessed;
+    OP_REQUIRES(
+        ctx, equal,
+        errors::FailedPrecondition(
+            "Mismatch in resource of then and else branch for resource ", i));
+  }
+
+  xla::ComputationDataHandle outputs =
+      b->Conditional(ctx->Input(0), b->Tuple(inputs), *then_result.computation,
+                     b->Tuple(inputs), *else_result.computation);
+  // Sets non-variable outputs.
+  for (int i = 0; i < output_types_.size(); ++i) {
+    if (ctx->input_type(i) != DT_RESOURCE) {
+      xla::ComputationDataHandle output_handle = b->GetTupleElement(outputs, i);
+      if (VLOG_IS_ON(2)) {
+        LOG(INFO) << "Setting output " << i;
+        auto shape_or = b->GetShape(output_handle);
+        if (shape_or.ok()) {
+          LOG(INFO) << "Shape for output " << i << ": "
+                    << xla::ShapeUtil::HumanString(*shape_or.ValueOrDie());
+        } else {
+          LOG(INFO) << "Shape unknown for output " << i;
+        }
+      }
+      ctx->SetOutput(i, output_handle);
+    }
+  }
+
+  // Updates the values of any resource variables modified by the conditional
+  // bodies.
+  for (XlaCompiler::CompilationResult* result : {&then_result, &else_result}) {
+    for (int i = 0; i < result->resource_updates.size(); ++i) {
+      const XlaCompiler::ResourceUpdate& update = result->resource_updates[i];
+      XlaResource* resource;
+      OP_REQUIRES_OK(ctx,
+                     ctx->GetResourceInput(update.input_index + 1, &resource));
+      if (update.modified) {
+        int pos = result->outputs.size() + i;
+        OP_REQUIRES_OK(ctx,
+                       resource->SetFromPack(
+                           arguments[update.input_index].tensor_array_gradients,
+                           b->GetTupleElement(outputs, pos), b));
+      }
+      VLOG(2) << "If variable: pos: " << update.input_index
+              << " name: " << resource->name()
+              << " modified: " << update.modified
+              << " type: " << DataTypeString(update.type)
+              << " shape: " << update.shape.DebugString();
+    }
+  }
+  VLOG(1) << "Done building If";
+}
+
+REGISTER_XLA_OP(Name("XlaIf").AllowResourceTypes(), XlaIfOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/if_op.h b/tensorflow/compiler/tf2xla/kernels/if_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..f9bc98a198a72dcc0594e61971713bf890ce30b6
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/kernels/if_op.h
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_TF2XLA_KERNELS_IF_OP_H_
+#define TENSORFLOW_COMPILER_TF2XLA_KERNELS_IF_OP_H_
+
+#include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
+#include "tensorflow/core/framework/attr_value.pb.h"
+
+namespace tensorflow {
+
+// This TensorFlow op provides a functional conditional primitive.
+//
+// The outputs of the then/else branches must agree on the number, types, and
+// shapes of the Tensors carried around the two bodies.
+//
+// Computations in then/else bodies may read from and write to resource
+// variables.
+// Resource variables may be passed as arguments to the then/else function's
+// bodies. The XlaCompiler converts resource variable arguments
+// into parameters to the XLA computation and moves them to the end of the
+// parameter list, and by using the `return_updated_values_for_all_variables`
+// we ensure that all variables that appear in the input also appear at the
+// end of the then/else bodies output. This ensures the then/else bodies output
+// signatures match.
+//
+// It is the user's responsibility to ensure that each non-variable _Arg matches
+// the corresponding _Retval.
+class XlaIfOp : public XlaOpKernel {
+ public:
+  explicit XlaIfOp(OpKernelConstruction* ctx);
+
+  void Compile(XlaOpKernelContext* ctx) override;
+
+ private:
+  TF_DISALLOW_COPY_AND_ASSIGN(XlaIfOp);
+
+  NameAttrList then_branch_;
+  NameAttrList else_branch_;
+  DataType cond_type_;
+  DataTypeVector input_types_;
+  DataTypeVector output_types_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_TF2XLA_KERNELS_IF_OP_H_
diff --git a/tensorflow/compiler/tf2xla/kernels/image_ops.cc b/tensorflow/compiler/tf2xla/kernels/image_ops.cc
index f22f384256a8ddd8c05de4a1322aba741dc4d7fd..5eeda79a935e8194a596d322b52add27846d378c 100644
--- a/tensorflow/compiler/tf2xla/kernels/image_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/image_ops.cc
@@ -180,9 +180,13 @@ class AdjustContrastOpV2 : public XlaOpKernel {
 
     DataType type = context->input_type(0);
 
-    auto output = b->Reduce(input, /*init_value=*/XlaHelpers::Zero(b, type),
-                            /*computation=*/*context->GetOrCreateAdd(type),
+    const DataType accumulation_type = XlaHelpers::SumAccumulationType(type);
+    auto converted =
+        XlaHelpers::ConvertElementType(b, input, accumulation_type);
+    auto reduce = b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                            *context->GetOrCreateAdd(accumulation_type),
                             {height_dim, width_dim});
+    auto output = XlaHelpers::ConvertElementType(b, reduce, type);
     output = b->Div(output, XlaHelpers::FloatLiteral(b, type, height * width));
 
     std::vector<int64> broadcast_dims(input_shape.dims() - 2);
diff --git a/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc b/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
index d096415087e47a73503a06526ab133ac34803c5d..c177f08d9c4687bb13b98a4328bb3960519799c4 100644
--- a/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
@@ -29,21 +29,22 @@ class L2LossOp : public XlaOpKernel {
   explicit L2LossOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
 
   void Compile(XlaOpKernelContext* ctx) override {
-    const TensorShape input_shape = ctx->InputShape(0);
+    std::vector<int64> dims(ctx->InputShape(0).dims());
+    std::iota(dims.begin(), dims.end(), 0);
 
     DataType dtype = ctx->input_type(0);
-    xla::ComputationBuilder* b = ctx->builder();
-
-    auto zero = XlaHelpers::Zero(b, dtype);
-    auto two = XlaHelpers::IntegerLiteral(b, dtype, 2);
-    const xla::Computation& add = *ctx->GetOrCreateAdd(dtype);
-
-    std::vector<int64> dims(input_shape.dims());
-    std::iota(dims.begin(), dims.end(), 0);
+    xla::ComputationBuilder* const b = ctx->builder();
 
     //  output = sum(t ** 2) / 2
-    auto x = ctx->Input(0);
-    ctx->SetOutput(0, b->Div(b->Reduce(b->Mul(x, x), zero, add, dims), two));
+    const DataType accumulation_type = XlaHelpers::SumAccumulationType(dtype);
+    auto t =
+        XlaHelpers::ConvertElementType(b, ctx->Input(0), accumulation_type);
+    auto square = b->Mul(t, t);
+    auto reduce = b->Reduce(square, XlaHelpers::Zero(b, accumulation_type),
+                            *ctx->GetOrCreateAdd(accumulation_type), dims);
+    auto deconverted = XlaHelpers::ConvertElementType(b, reduce, dtype);
+    auto two = XlaHelpers::IntegerLiteral(b, dtype, 2);
+    ctx->SetOutput(0, b->Div(deconverted, two));
   }
 };
 
diff --git a/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc b/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
index 759d1a1a2d996d4f5deb1774be7014bb6de30f40..1cfee3070f384af0a7441a9c860c530dd1b42187 100644
--- a/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
@@ -47,12 +47,17 @@ class LRNOp : public XlaOpKernel {
 
     // We use a window of depth_radius_ * 2 + 1, to account for the current
     // element and a depth_radius_ on either side.
-    auto squared = builder->Mul(input, input);
-    auto sqr_sum = builder->ReduceWindow(
-        squared, XlaHelpers::Zero(builder, input_type(0)),
-        *ctx->GetOrCreateAdd(input_type(0)),
+    auto accumulation_type = XlaHelpers::SumAccumulationType(input_type(0));
+    auto converted =
+        XlaHelpers::ConvertElementType(builder, input, accumulation_type);
+    auto squared = builder->Mul(converted, converted);
+    auto reduce = builder->ReduceWindow(
+        squared, XlaHelpers::Zero(builder, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type),
         /* window_dimensions = */ {1, 1, 1, depth_radius_ * 2 + 1},
         /* window_strides = */ {1, 1, 1, 1}, xla::Padding::kSame);
+    auto sqr_sum =
+        XlaHelpers::ConvertElementType(builder, reduce, input_type(0));
 
     auto scale = builder->Pow(
         builder->Add(builder->ConstantR0<float>(bias_),
@@ -130,12 +135,17 @@ class LRNGradOp : public XlaOpKernel {
     //     dyi *= out_grads[j]
     //     grads[k] += dyi
 
-    auto squared = builder->Mul(in_image, in_image);
-    auto sqr_sum = builder->ReduceWindow(
-        squared, XlaHelpers::Zero(builder, input_type(0)),
-        *ctx->GetOrCreateAdd(input_type(0)),
+    auto accumulation_type = XlaHelpers::SumAccumulationType(input_type(0));
+    auto converted =
+        XlaHelpers::ConvertElementType(builder, in_image, accumulation_type);
+    auto squared = builder->Mul(converted, converted);
+    auto reduce = builder->ReduceWindow(
+        squared, XlaHelpers::Zero(builder, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type),
         /* window_dimensions = */ {1, 1, 1, depth_radius_ * 2 + 1},
         /* window_strides = */ {1, 1, 1, 1}, xla::Padding::kSame);
+    auto sqr_sum =
+        XlaHelpers::ConvertElementType(builder, reduce, input_type(0));
 
     auto norm =
         builder->Add(builder->ConstantR0<float>(bias_),
@@ -146,11 +156,15 @@ class LRNGradOp : public XlaOpKernel {
                      builder->Div(out_image, norm)),
         in_grads);
 
-    auto dy_reduced = builder->ReduceWindow(
-        dy, XlaHelpers::Zero(builder, input_type(0)),
-        *ctx->GetOrCreateAdd(input_type(0)),
+    auto converted_dy =
+        XlaHelpers::ConvertElementType(builder, dy, accumulation_type);
+    auto dy_reduce = builder->ReduceWindow(
+        converted_dy, XlaHelpers::Zero(builder, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type),
         /* window_dimensions = */ {1, 1, 1, depth_radius_ * 2 + 1},
         /* window_strides = */ {1, 1, 1, 1}, xla::Padding::kSame);
+    auto dy_reduced =
+        XlaHelpers::ConvertElementType(builder, dy_reduce, input_type(0));
 
     xla::ComputationDataHandle gradients = builder->Add(
         builder->Mul(in_image, dy_reduced),
diff --git a/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc b/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
index d4fb5dd4e06c7c70591262c0d63a91c383a2a6e0..5f635dd1bc6122cfcac8163baafd95b13f157715 100644
--- a/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
@@ -35,8 +35,11 @@ namespace {
 // Superclass of pooling ops.
 class PoolingOp : public XlaOpKernel {
  public:
-  PoolingOp(OpKernelConstruction* ctx, int num_spatial_dims)
-      : XlaOpKernel(ctx), num_spatial_dims_(num_spatial_dims) {
+  PoolingOp(OpKernelConstruction* ctx, int num_spatial_dims,
+            const DataType reduction_type)
+      : XlaOpKernel(ctx),
+        num_spatial_dims_(num_spatial_dims),
+        reduction_type_(reduction_type) {
     if (ctx->num_inputs() == 1) {
       std::vector<int32> ksize_int;
       std::vector<int32> stride_int;
@@ -63,12 +66,10 @@ class PoolingOp : public XlaOpKernel {
   int num_dims() const { return num_spatial_dims_ + 2; }
 
   // Method that builds an initial value to use in reductions.
-  virtual xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b,
-                                               DataType data_type) = 0;
+  virtual xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b) = 0;
 
   // The reduction operation to apply to each window.
-  virtual const xla::Computation* Reduction(XlaOpKernelContext* ctx,
-                                            DataType dtype) = 0;
+  virtual const xla::Computation* Reduction(XlaOpKernelContext* ctx) = 0;
 
   // A post-processing operation to apply on the outputs of the ReduceWindow.
   virtual xla::ComputationDataHandle PostProcessOutput(
@@ -76,9 +77,6 @@ class PoolingOp : public XlaOpKernel {
       DataType dtype, const TensorShape& input_shape) = 0;
 
   void Compile(XlaOpKernelContext* ctx) override {
-    xla::ComputationDataHandle input = ctx->Input(0);
-    const TensorShape input_shape = ctx->InputShape(0);
-
     std::vector<int64> ksize = ksize_;
     std::vector<int64> stride = stride_;
     if (ctx->num_inputs() != 1) {
@@ -106,16 +104,20 @@ class PoolingOp : public XlaOpKernel {
       stride.clear();
       OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntVector(2, &stride));
     }
+    const TensorShape input_shape = ctx->InputShape(0);
     OP_REQUIRES(ctx, input_shape.dims() == num_dims(),
                 errors::InvalidArgument("Input to ", type_string(),
                                         " operator must have ", num_dims(),
                                         " dimensions"));
 
-    const DataType type = input_type(0);
-    xla::ComputationDataHandle pooled = ctx->builder()->ReduceWindow(
-        input, InitValue(ctx->builder(), type), *Reduction(ctx, type), ksize,
-        stride, padding_);
-    ctx->SetOutput(0, PostProcessOutput(ctx, pooled, type, input_shape));
+    xla::ComputationBuilder* const b = ctx->builder();
+    auto input =
+        XlaHelpers::ConvertElementType(b, ctx->Input(0), reduction_type_);
+    auto reduce = ctx->builder()->ReduceWindow(
+        input, InitValue(b), *Reduction(ctx), ksize, stride, padding_);
+    auto pooled = XlaHelpers::ConvertElementType(b, reduce, input_type(0));
+    ctx->SetOutput(0,
+                   PostProcessOutput(ctx, pooled, input_type(0), input_shape));
   }
 
  protected:
@@ -124,21 +126,21 @@ class PoolingOp : public XlaOpKernel {
   std::vector<int64> stride_;
   xla::Padding padding_;
   TensorFormat data_format_ = FORMAT_NHWC;
+  DataType reduction_type_;
 };
 
 class MaxPoolOp : public PoolingOp {
  public:
   MaxPoolOp(OpKernelConstruction* ctx, int num_spatial_dims)
-      : PoolingOp(ctx, /*num_spatial_dims=*/num_spatial_dims) {}
+      : PoolingOp(ctx, /*num_spatial_dims=*/num_spatial_dims,
+                  /*reduction_type=*/ctx->input_type(0)) {}
 
-  xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b,
-                                       DataType data_type) override {
-    return XlaHelpers::MinValue(b, data_type);
+  xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b) override {
+    return XlaHelpers::MinValue(b, reduction_type_);
   }
 
-  const xla::Computation* Reduction(XlaOpKernelContext* ctx,
-                                    DataType dtype) override {
-    return ctx->GetOrCreateMax(dtype);
+  const xla::Computation* Reduction(XlaOpKernelContext* ctx) override {
+    return ctx->GetOrCreateMax(reduction_type_);
   }
 
   xla::ComputationDataHandle PostProcessOutput(
@@ -209,15 +211,17 @@ static xla::ComputationDataHandle AvgPoolDivideByCount(
     }
 
     // Build a matrix of all 1s, with the same width/height as the input.
+    const DataType accumulation_type = XlaHelpers::SumAccumulationType(dtype);
     auto ones = ctx->builder()->Broadcast(
-        XlaHelpers::One(ctx->builder(), dtype), input_dim_sizes);
+        XlaHelpers::One(ctx->builder(), accumulation_type), input_dim_sizes);
 
     // Perform a ReduceWindow with the same window size, strides, and padding
     // to count the number of contributions to each result element.
-    auto counts = ctx->builder()->ReduceWindow(
-        ones, XlaHelpers::Zero(ctx->builder(), dtype),
-        *ctx->GetOrCreateAdd(dtype), window_ksize, window_stride,
+    auto reduce = ctx->builder()->ReduceWindow(
+        ones, XlaHelpers::Zero(ctx->builder(), accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type), window_ksize, window_stride,
         xla::Padding::kSame);
+    auto counts = XlaHelpers::ConvertElementType(ctx->builder(), reduce, dtype);
 
     return ctx->builder()->Div(output, counts, window_dims);
   }
@@ -226,16 +230,16 @@ static xla::ComputationDataHandle AvgPoolDivideByCount(
 class AvgPoolOp : public PoolingOp {
  public:
   AvgPoolOp(OpKernelConstruction* ctx, int num_spatial_dims)
-      : PoolingOp(ctx, num_spatial_dims) {}
+      : PoolingOp(ctx, /*num_spatial_dims=*/num_spatial_dims,
+                  /*reduction_type=*/
+                  XlaHelpers::SumAccumulationType(ctx->input_type(0))) {}
 
-  xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b,
-                                       DataType data_type) override {
-    return XlaHelpers::Zero(b, data_type);
+  xla::ComputationDataHandle InitValue(xla::ComputationBuilder* b) override {
+    return XlaHelpers::Zero(b, reduction_type_);
   }
 
-  const xla::Computation* Reduction(XlaOpKernelContext* ctx,
-                                    DataType dtype) override {
-    return ctx->GetOrCreateAdd(dtype);
+  const xla::Computation* Reduction(XlaOpKernelContext* ctx) override {
+    return ctx->GetOrCreateAdd(reduction_type_);
   }
 
   xla::ComputationDataHandle PostProcessOutput(
@@ -455,14 +459,12 @@ class AvgPoolGradOp : public XlaOpKernel {
                  gradients_shape, filter_shape, out_backprop_shape, stride_,
                  padding_, data_format_, &dims));
 
+    // The input gradients are computed by a convolution of the output gradients
+    // and the filter, with some appropriate padding. See the comment at the top
+    // of conv_grad_ops.h for details.
+    xla::ComputationBuilder* const b = ctx->builder();
     auto out_backprop = ctx->Input(1);
-
-    // The input gradients are computed by a convolution of the output
-    // gradients
-    // and the filter, with some appropriate padding. See the comment at
-    // the top of conv_grad_ops.h for details.
-    DataType dtype = input_type(1);
-
+    auto dtype = input_type(1);
     xla::Padding xla_padding =
         (padding_ == VALID) ? xla::Padding::kValid : xla::Padding::kSame;
 
@@ -483,17 +485,18 @@ class AvgPoolGradOp : public XlaOpKernel {
       padding->set_interior_padding(dims.spatial_dims[i].stride - 1);
     }
 
-    auto zero = XlaHelpers::Zero(ctx->builder(), dtype);
-    auto padded_gradients =
-        ctx->builder()->Pad(out_backprop_div, zero, padding_config);
+    auto zero = XlaHelpers::Zero(b, dtype);
+    auto padded_gradients = b->Pad(out_backprop_div, zero, padding_config);
 
     // in_backprop = padded_gradients <conv> ones
     std::vector<int64> ones(num_dims(), 1LL);
-    xla::ComputationDataHandle in_backprop = ctx->builder()->ReduceWindow(
-        padded_gradients, zero, *ctx->GetOrCreateAdd(dtype), ksize_,
+    auto accumulation_type = XlaHelpers::SumAccumulationType(dtype);
+    auto in_backprop = b->ReduceWindow(
+        XlaHelpers::ConvertElementType(b, padded_gradients, accumulation_type),
+        XlaHelpers::Zero(b, accumulation_type),
+        *ctx->GetOrCreateAdd(accumulation_type), ksize_,
         /* window_strides=*/ones, xla::Padding::kValid);
-
-    ctx->SetOutput(0, in_backprop);
+    ctx->SetOutput(0, XlaHelpers::ConvertElementType(b, in_backprop, dtype));
   }
 
  protected:
@@ -525,5 +528,172 @@ class AvgPool3DGradOp : public AvgPoolGradOp {
 REGISTER_XLA_OP(Name("AvgPool3DGrad").CompileTimeConstInput("orig_input_shape"),
                 AvgPool3DGradOp);
 
+class MaxPoolGradGradOp : public XlaOpKernel {
+ public:
+  MaxPoolGradGradOp(OpKernelConstruction* ctx, int num_spatial_dims)
+      : XlaOpKernel(ctx), num_spatial_dims_(num_spatial_dims) {
+    if (ctx->num_inputs() == 3) {
+      OP_REQUIRES_OK(ctx, ctx->GetAttr("ksize", &ksize_));
+      OP_REQUIRES_OK(ctx, ctx->GetAttr("strides", &stride_));
+    }
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("padding", &padding_));
+  }
+
+  int num_dims() const { return num_spatial_dims_ + 2; }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    if (ctx->num_inputs() != 3) {
+      OP_REQUIRES(
+          ctx, ctx->num_inputs() == 5,
+          errors::InvalidArgument("Must supply ksize and stride arguments."));
+      const TensorShape ksize_shape = ctx->InputShape(3);
+      // Validate input sizes.
+      OP_REQUIRES(ctx, TensorShapeUtils::IsVector(ksize_shape),
+                  errors::InvalidArgument("ksize must be a vector, not shape ",
+                                          ksize_shape.DebugString()));
+      OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntVector(3, &ksize_));
+
+      const TensorShape stride_shape = ctx->InputShape(4);
+      // Validate input sizes.
+      OP_REQUIRES(ctx, TensorShapeUtils::IsVector(stride_shape),
+                  errors::InvalidArgument("stride must be a vector, not shape ",
+                                          stride_shape.DebugString()));
+      OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntVector(4, &stride_));
+    }
+
+    OP_REQUIRES(ctx, ksize_.size() == num_dims(),
+                errors::InvalidArgument("Sliding window ksize field must "
+                                        "specify ",
+                                        num_dims(), " dimensions"));
+    OP_REQUIRES(ctx, stride_.size() == num_dims(),
+                errors::InvalidArgument("Sliding window strides field must "
+                                        "specify ",
+                                        num_dims(), " dimensions"));
+
+    const TensorShape tensor_in_shape = ctx->InputShape(0);
+    const TensorShape tensor_out_shape = ctx->InputShape(1);
+    const TensorShape out_backprop_shape = ctx->InputShape(2);
+
+    // For maxpooling, tensor_in should have num_dims() dimensions.
+    OP_REQUIRES(ctx, tensor_in_shape.dims() == num_dims(),
+                errors::InvalidArgument("tensor_in must be ", num_dims(),
+                                        "-dimensional"));
+    OP_REQUIRES(ctx, tensor_out_shape.dims() == num_dims(),
+                errors::InvalidArgument("tensor_out must be ", num_dims(),
+                                        "-dimensional"));
+    // For maxpooling, out_backprop should have num_dims() dimensions.
+    OP_REQUIRES(ctx, out_backprop_shape.dims() == num_dims(),
+                errors::InvalidArgument("out_backprop must be ", num_dims(),
+                                        "-dimensional"));
+
+    // What we want to compute:
+    // Given y = MaxPool(x), and xs_grad = MaxPoolGrad(x, y, ys_grad)
+    // MaxPoolGradGrad computes {ys_grad}_grad given x, y, and {xs_grad}_grad.
+    //
+    // In the regular TF op, this amounts to selecting for each window the
+    // incoming backprop value from xs_grad_grad that corresponds to the maximal
+    // value in the corresponding window of x.
+    //
+    // TODO(b/73062247): What we really want is a ReduceWindow with different
+    // arrays for index selection vs return value selection--a select-to-gather.
+    //
+    // Here, we implement a bitwise hack: we use the hi 16 bits of input for
+    // separate max pooling alongside each of the hi and lo 16 bits of
+    // out_backprop packed into 16 lo bits, which we then glue back together at
+    // the end to get a full 32 bits of gradient.
+    //
+    // This could select the wrong backprop value for two x values that are
+    // equally maximal up to the first 16 bits, in which case we are taking the
+    // latter.
+    //
+    // Note that in principle we could use 32 separate maxpools to recover each
+    // of 32 bits of the gradient while preserving 31 bits of input for the max
+    // pooling criteria; here, we just truncate to the first 16 bits of input.
+
+    auto input = ctx->Input(0);
+    auto out_backprop = ctx->Input(2);
+
+    auto b = ctx->builder();
+
+    auto sixteen = b->ConstantR0<uint32>(16);
+    // in (f32) -> round to bf16 -> f32 for correct bitwidth -> 16-high-bit u32
+    auto in_hi = b->BitcastConvertType(
+        b->ConvertElementType(b->ConvertElementType(input, xla::BF16),
+                              xla::F32),
+        xla::U32);
+    auto bp_int = b->BitcastConvertType(out_backprop, xla::U32);
+    auto bp_hi = b->ShiftRightLogical(bp_int, sixteen);
+    auto bp_lo = b->ShiftRightLogical(b->ShiftLeft(bp_int, sixteen), sixteen);
+    auto in_hi_bp_hi = b->Add(in_hi, bp_hi);  // Want an unsigned add.
+    auto in_hi_bp_lo = b->Add(in_hi, bp_lo);  // Want an unsigned add.
+
+    auto init_value = XlaHelpers::MinValue(b, DT_FLOAT);
+    // We will reduce by taking the maximal value up to 16 bits (ignoring the lo
+    // 16 bits of packed-in hi/lo backprop value).
+    auto rb = b->CreateSubBuilder("GreaterOrEqOf_ByFirst16Bits");
+    {
+      // F32 parameters to satisfy lowering type restriction for reduce opcode.
+      const xla::Shape scalar = xla::ShapeUtil::MakeShape(xla::F32, {});
+      auto lhs = rb->Parameter(0, scalar, "lhs");
+      auto rhs = rb->Parameter(1, scalar, "rhs");
+      auto sixteen = rb->ConstantR0<int32>(16);
+      auto lhs_criteria = rb->ShiftLeft(
+          rb->ShiftRightLogical(rb->BitcastConvertType(lhs, xla::S32), sixteen),
+          sixteen);
+      auto rhs_criteria = rb->ShiftLeft(
+          rb->ShiftRightLogical(rb->BitcastConvertType(rhs, xla::S32), sixteen),
+          sixteen);
+      // Must use a F32 comparison, because S32 would not work for negatives.
+      rb->Select(rb->Ge(rb->BitcastConvertType(lhs_criteria, xla::F32),
+                        rb->BitcastConvertType(rhs_criteria, xla::F32)),
+                 lhs, rhs);
+    }
+    auto reduce = rb->BuildAndNoteError();
+    xla::Padding xla_padding =
+        (padding_ == VALID) ? xla::Padding::kValid : xla::Padding::kSame;
+    auto pooled_hi =
+        b->ReduceWindow(b->BitcastConvertType(in_hi_bp_hi, xla::F32),
+                        init_value, reduce, ksize_, stride_, xla_padding);
+    auto pooled_lo =
+        b->ReduceWindow(b->BitcastConvertType(in_hi_bp_lo, xla::F32),
+                        init_value, reduce, ksize_, stride_, xla_padding);
+    auto grads_hi =
+        b->ShiftLeft(b->BitcastConvertType(pooled_hi, xla::U32), sixteen);
+    auto grads_lo = b->ShiftRightLogical(
+        b->ShiftLeft(b->BitcastConvertType(pooled_lo, xla::U32), sixteen),
+        sixteen);
+    auto grads = b->Add(grads_hi, grads_lo);  // Want an unsigned add.
+
+    xla::PrimitiveType element_type;
+    OP_REQUIRES_OK(ctx, DataTypeToPrimitiveType(input_type(2), &element_type));
+    ctx->SetOutput(0, b->BitcastConvertType(grads, element_type));
+  }
+
+ protected:
+  const int num_spatial_dims_;
+  std::vector<int64> ksize_;
+  std::vector<int64> stride_;
+  Padding padding_;
+  TensorFormat data_format_ = FORMAT_NHWC;
+};
+
+class MaxPool2DGradGradOp : public MaxPoolGradGradOp {
+ public:
+  explicit MaxPool2DGradGradOp(OpKernelConstruction* ctx)
+      : MaxPoolGradGradOp(ctx, /*num_spatial_dims=*/2) {
+    string data_format;
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("data_format", &data_format));
+    OP_REQUIRES(ctx, FormatFromString(data_format, &data_format_),
+                errors::InvalidArgument("Invalid data format"));
+  }
+};
+REGISTER_XLA_OP(Name("MaxPoolGradGrad").TypeConstraint("T", DT_FLOAT),
+                MaxPool2DGradGradOp);
+REGISTER_XLA_OP(Name("MaxPoolGradGradV2")
+                    .TypeConstraint("T", DT_FLOAT)
+                    .CompileTimeConstInput("ksize")
+                    .CompileTimeConstInput("strides"),
+                MaxPool2DGradGradOp);
+
 }  // anonymous namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc b/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
index 03b13b2924f4b81c1017804c91d5ffb81c44ea0b..812d258cd1677e18ef49952044126c76a2f55b19 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
@@ -27,7 +27,13 @@ namespace {
 
 class SumOp : public XlaReductionOp {
  public:
-  explicit SumOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit SumOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx,
+                       XlaHelpers::SumAccumulationType(ctx->input_type(0))) {}
+  xla::ComputationDataHandle InitialValue(
+      xla::ComputationBuilder* builder) override {
+    return XlaHelpers::Zero(builder, reduction_type_);
+  }
   void BuildReducer(xla::ComputationBuilder* builder,
                     const xla::ComputationDataHandle& scalar_lhs,
                     const xla::ComputationDataHandle& scalar_rhs) override {
@@ -39,11 +45,13 @@ REGISTER_XLA_OP(Name("Sum").CompileTimeConstInput("reduction_indices"), SumOp);
 
 class ProdOp : public XlaReductionOp {
  public:
-  explicit ProdOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit ProdOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx,
+                       XlaHelpers::SumAccumulationType(ctx->input_type(0))) {}
 
   xla::ComputationDataHandle InitialValue(
       xla::ComputationBuilder* builder) override {
-    return XlaHelpers::One(builder, input_type(0));
+    return XlaHelpers::One(builder, reduction_type_);
   }
 
   void BuildReducer(xla::ComputationBuilder* builder,
@@ -58,13 +66,12 @@ REGISTER_XLA_OP(Name("Prod").CompileTimeConstInput("reduction_indices"),
 
 class MinOp : public XlaReductionOp {
  public:
-  explicit MinOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit MinOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx, ctx->input_type(0)) {}
 
   xla::ComputationDataHandle InitialValue(
       xla::ComputationBuilder* builder) override {
-    xla::PrimitiveType type;
-    TF_CHECK_OK(DataTypeToPrimitiveType(input_type(0), &type));
-    return builder->ConstantLiteral(xla::Literal::MaxValue(type));
+    return XlaHelpers::MaxValue(builder, reduction_type_);
   }
 
   void BuildReducer(xla::ComputationBuilder* builder,
@@ -78,13 +85,12 @@ REGISTER_XLA_OP(Name("Min").CompileTimeConstInput("reduction_indices"), MinOp);
 
 class MaxOp : public XlaReductionOp {
  public:
-  explicit MaxOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit MaxOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx, ctx->input_type(0)) {}
 
   xla::ComputationDataHandle InitialValue(
       xla::ComputationBuilder* builder) override {
-    xla::PrimitiveType type;
-    TF_CHECK_OK(DataTypeToPrimitiveType(input_type(0), &type));
-    return builder->ConstantLiteral(xla::Literal::MinValue(type));
+    return XlaHelpers::MinValue(builder, reduction_type_);
   }
 
   void BuildReducer(xla::ComputationBuilder* builder,
@@ -98,8 +104,14 @@ REGISTER_XLA_OP(Name("Max").CompileTimeConstInput("reduction_indices"), MaxOp);
 
 class MeanOp : public XlaReductionOp {
  public:
-  explicit MeanOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit MeanOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx,
+                       XlaHelpers::SumAccumulationType(ctx->input_type(0))) {}
 
+  xla::ComputationDataHandle InitialValue(
+      xla::ComputationBuilder* builder) override {
+    return XlaHelpers::Zero(builder, reduction_type_);
+  }
   void BuildReducer(xla::ComputationBuilder* builder,
                     const xla::ComputationDataHandle& scalar_lhs,
                     const xla::ComputationDataHandle& scalar_rhs) override {
@@ -121,7 +133,8 @@ REGISTER_XLA_OP(Name("Mean").CompileTimeConstInput("reduction_indices"),
 
 class AllOp : public XlaReductionOp {
  public:
-  explicit AllOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit AllOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx, ctx->input_type(0)) {}
 
   xla::ComputationDataHandle InitialValue(
       xla::ComputationBuilder* builder) override {
@@ -139,7 +152,8 @@ REGISTER_XLA_OP(Name("All").CompileTimeConstInput("reduction_indices"), AllOp);
 
 class AnyOp : public XlaReductionOp {
  public:
-  explicit AnyOp(OpKernelConstruction* ctx) : XlaReductionOp(ctx) {}
+  explicit AnyOp(OpKernelConstruction* ctx)
+      : XlaReductionOp(ctx, ctx->input_type(0)) {}
 
   xla::ComputationDataHandle InitialValue(
       xla::ComputationBuilder* builder) override {
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops.h b/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
index 9aca6d8fedf92f176b3b7b40c5961d4a2e557a8a..f3181f0dadc2d3f45abb145e009e2663c10490f0 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
@@ -33,12 +33,12 @@ namespace tensorflow {
 // xla::ComputationBuilder.
 class XlaReductionOp : public XlaOpKernel {
  public:
-  explicit XlaReductionOp(OpKernelConstruction* ctx);
+  XlaReductionOp(OpKernelConstruction* ctx, DataType reduction_type);
   ~XlaReductionOp() override {}
 
-  // Return the base case for the reduction. Defaults to zero.
+  // Return the base case for the reduction.
   virtual xla::ComputationDataHandle InitialValue(
-      xla::ComputationBuilder* builder);
+      xla::ComputationBuilder* builder) = 0;
 
   // Implement the (scalar,scalar)->scalar lambda that should be
   // applied to each pair of elements to be reduced. The desired
@@ -63,6 +63,9 @@ class XlaReductionOp : public XlaOpKernel {
  private:
   // True if the number of dimensions should be maintained.
   bool keep_dims_;
+
+ protected:
+  DataType reduction_type_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc b/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
index 4b5d09eb9fd4110cdc4221099ff55767e9132540..64fe765ae9a945c58ea60bc157b1520c83b0d8e7 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
@@ -24,19 +24,15 @@ limitations under the License.
 
 namespace tensorflow {
 
-XlaReductionOp::XlaReductionOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+XlaReductionOp::XlaReductionOp(OpKernelConstruction* ctx,
+                               DataType reduction_type)
+    : XlaOpKernel(ctx), reduction_type_(reduction_type) {
   const DataType dt = BaseType(input_type(0));
   OP_REQUIRES_OK(ctx, ctx->MatchSignature({dt, DT_INT32}, {dt}));
 
   OP_REQUIRES_OK(ctx, ctx->GetAttr("keep_dims", &keep_dims_));
 }
 
-// Return the base case for the reduction. Defaults to zero.
-xla::ComputationDataHandle XlaReductionOp::InitialValue(
-    xla::ComputationBuilder* builder) {
-  return XlaHelpers::Zero(builder, input_type(0));
-}
-
 // Unless BuildFinalizer is overridden the reduction has no
 // finalizer.
 xla::ComputationDataHandle XlaReductionOp::BuildFinalizer(
@@ -100,36 +96,26 @@ void XlaReductionOp::Compile(XlaOpKernelContext* ctx) {
 
   string desc = ctx->op_kernel().name();
 
-  // Call virtual method to get the initial value.
-  const xla::ComputationDataHandle initial = InitialValue(ctx->builder());
+  xla::ComputationBuilder* const b = ctx->builder();
   // Construct the builder for the reduction lambda.
-  xla::ComputationBuilder r(ctx->builder()->client(),
-                            strings::StrCat(desc, "-reduction"));
+  xla::ComputationBuilder r(b->client(), strings::StrCat(desc, "-reduction"));
   xla::PrimitiveType type;
-  TF_CHECK_OK(DataTypeToPrimitiveType(input_type(0), &type));
-  // Make two scalar parameters of the desired type for the lambda.
-  xla::ComputationDataHandle rx =
-      r.Parameter(0, xla::ShapeUtil::MakeShape(type, {}), "x");
-  xla::ComputationDataHandle ry =
-      r.Parameter(1, xla::ShapeUtil::MakeShape(type, {}), "y");
-
-  auto data = ctx->Input(0);
+  TF_CHECK_OK(DataTypeToPrimitiveType(reduction_type_, &type));
 
+  auto data = b->ConvertElementType(ctx->Input(0), type);
+  // Call virtual method to get the initial value.
+  auto initial = b->ConvertElementType(InitialValue(b), type);
+  // Make two scalar parameters of the desired type for the lambda.
+  auto rx = r.Parameter(0, xla::ShapeUtil::MakeShape(type, {}), "x");
+  auto ry = r.Parameter(1, xla::ShapeUtil::MakeShape(type, {}), "y");
   // Call virtual method to build the reduction lambda.
   BuildReducer(&r, rx, ry);
   xla::Computation reduction_computation = r.Build().ConsumeValueOrDie();
-  xla::ComputationDataHandle reduce =
-      ctx->builder()->Reduce(data, initial, reduction_computation, xla_axes);
 
-  xla::ComputationDataHandle finalized =
-      BuildFinalizer(ctx->builder(), reduce, num_elements_reduced);
-
-  xla::ComputationDataHandle result;
-  if (keep_dims_) {
-    result = ctx->builder()->Reshape(finalized, final_shape);
-  } else {
-    result = finalized;
-  }
+  auto reduce = b->Reduce(data, initial, reduction_computation, xla_axes);
+  auto deconverted = XlaHelpers::ConvertElementType(b, reduce, input_type(0));
+  auto finalized = BuildFinalizer(b, deconverted, num_elements_reduced);
+  auto result = keep_dims_ ? b->Reshape(finalized, final_shape) : finalized;
   ctx->SetOutput(0, result);
 }
 
diff --git a/tensorflow/compiler/tf2xla/kernels/scan_ops.cc b/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
index ee4a94164c4a43828eb4feedbfa9d1a9e231ef8f..4cfa28a0ce3d7d1f24196ef6ef2775f840b2bcf1 100644
--- a/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
@@ -66,7 +66,7 @@ class ScanOp : public XlaOpKernel {
                                 -input_shape.dims(), ", ", input_shape.dims(),
                                 "), but got ", axis));
 
-    DataType dtype = ctx->input_type(0);
+    DataType dtype = XlaHelpers::SumAccumulationType(ctx->input_type(0));
 
     if (input_shape.num_elements() == 0) {
       // Exit early if there is nothing to compute.
@@ -91,7 +91,6 @@ class ScanOp : public XlaOpKernel {
       std::swap(padding[axis].first, padding[axis].second);
     }
 
-    xla::ComputationDataHandle input = ctx->Input(0);
     xla::ComputationDataHandle init;
     const xla::Computation* reducer;
     if (sum_) {
@@ -102,7 +101,10 @@ class ScanOp : public XlaOpKernel {
       reducer = ctx->GetOrCreateMul(dtype);
     }
     auto output = builder->ReduceWindowWithGeneralPadding(
-        ctx->Input(0), init, *reducer, window_dims, window_strides, padding);
+        XlaHelpers::ConvertElementType(builder, ctx->Input(0), dtype), init,
+        *reducer, window_dims, window_strides, padding);
+    output =
+        XlaHelpers::ConvertElementType(builder, output, ctx->input_type(0));
 
     // In exclusive mode, we have computed an extra element containing the sum
     // of all the input elements. Slice off this extra "last" element.
diff --git a/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc b/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
index 80d6df6c48b0141734dcee1c2a3c413926931feb..498342a98881df0c6ff50007eacc1d5ef6196b57 100644
--- a/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
@@ -83,7 +83,9 @@ class UnsortedSegmentSum : public XlaOpKernel {
   DataType dtype_;
 };
 
-REGISTER_XLA_OP(Name("UnsortedSegmentSum"), UnsortedSegmentSum);
+REGISTER_XLA_OP(
+    Name("UnsortedSegmentSum").CompileTimeConstInput("num_segments"),
+    UnsortedSegmentSum);
 
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/softmax_op.cc b/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
index 750a4c2dec8154f97f307978b3d8884271292279..aa47cb799f1f3d01f6fcb01ff9f2e410f7f0ac5a 100644
--- a/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
@@ -42,9 +42,8 @@ class SoftmaxOp : public XlaOpKernel {
     const DataType type = input_type(0);
     auto logits = ctx->Input(0);
 
-    xla::ComputationBuilder* b = ctx->builder();
+    xla::ComputationBuilder* const b = ctx->builder();
     const xla::Computation& max_func = *ctx->GetOrCreateMax(type);
-    const xla::Computation& add_func = *ctx->GetOrCreateAdd(type);
 
     // Find the max in each batch, resulting in a tensor of shape [batch]
     auto logits_max =
@@ -52,21 +51,20 @@ class SoftmaxOp : public XlaOpKernel {
     // Subtract the max in batch b from every element in batch b. Broadcasts
     // along the batch dimension.
     auto shifted_logits = b->Sub(logits, logits_max, {kBatchDim});
-    xla::ComputationDataHandle softmax;
-    if (log_) {
-      // softmax = shifted_logits - log(sum(exp(shifted_logits)))
-      auto log_sum_exp =
-          b->Log(b->Reduce(b->Exp(shifted_logits), XlaHelpers::Zero(b, type),
-                           add_func, {kClassDim}));
-      softmax = b->Sub(shifted_logits, log_sum_exp, {kBatchDim});
-    } else {
-      // softmax = exp(shifted_logits) / sum(exp(shifted_logits))
-      auto exp_shifted = b->Exp(shifted_logits);
-      auto sum_exp = b->Reduce(exp_shifted, XlaHelpers::Zero(b, type), add_func,
-                               {kClassDim});
-      softmax = b->Div(exp_shifted, sum_exp, {kBatchDim});
-    }
-
+    auto exp_shifted = b->Exp(shifted_logits);
+    const DataType accumulation_type = XlaHelpers::SumAccumulationType(type);
+    auto converted =
+        XlaHelpers::ConvertElementType(b, exp_shifted, accumulation_type);
+    auto reduce =
+        b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                  *ctx->GetOrCreateAdd(accumulation_type), {kClassDim});
+    auto sum = XlaHelpers::ConvertElementType(b, reduce, type);
+    auto softmax =
+        log_
+            // softmax = shifted_logits - log(sum(exp(shifted_logits)))
+            ? b->Sub(shifted_logits, b->Log(sum), {kBatchDim})
+            // softmax = exp(shifted_logits) / sum(exp(shifted_logits))
+            : b->Div(exp_shifted, sum, {kBatchDim});
     ctx->SetOutput(0, softmax);
   }
 
@@ -82,7 +80,6 @@ CrossEntropyWithLogits(XlaOpKernelContext* ctx, DataType type,
                        const xla::ComputationDataHandle& logits,
                        const xla::ComputationDataHandle& labels) {
   const xla::Computation& max_func = *ctx->GetOrCreateMax(type);
-  const xla::Computation& add_func = *ctx->GetOrCreateAdd(type);
 
   const int kBatchDim = 0;
   const int kClassDim = 1;
@@ -100,8 +97,12 @@ CrossEntropyWithLogits(XlaOpKernelContext* ctx, DataType type,
   auto exp_shifted_logits = b->Exp(shifted_logits);
 
   // sum_{class} (exp(logits - max_logits))
-  auto sum_exp = b->Reduce(exp_shifted_logits, XlaHelpers::Zero(b, type),
-                           add_func, {kClassDim});
+  const DataType accumulation_type = XlaHelpers::SumAccumulationType(type);
+  auto converted =
+      XlaHelpers::ConvertElementType(b, exp_shifted_logits, accumulation_type);
+  auto reduce = b->Reduce(converted, XlaHelpers::Zero(b, accumulation_type),
+                          *ctx->GetOrCreateAdd(accumulation_type), {kClassDim});
+  auto sum_exp = XlaHelpers::ConvertElementType(b, reduce, type);
 
   // log(sum(exp(logits - max_logits)))
   auto log_sum_exp = b->Log(sum_exp);
@@ -110,9 +111,13 @@ CrossEntropyWithLogits(XlaOpKernelContext* ctx, DataType type,
   //    ((logits - max_logits) - log(sum(exp(logits - max_logits)))))
   // along classes
   // (The subtraction broadcasts along the batch dimension.)
-  xla::ComputationDataHandle loss = b->Reduce(
-      b->Mul(b->Neg(labels), b->Sub(shifted_logits, log_sum_exp, {kBatchDim})),
-      XlaHelpers::Zero(b, type), add_func, {kClassDim});
+  auto sub = b->Sub(shifted_logits, log_sum_exp, {kBatchDim});
+  auto mul = b->Mul(b->Neg(labels), sub);
+  auto sum =
+      b->Reduce(XlaHelpers::ConvertElementType(b, mul, accumulation_type),
+                XlaHelpers::Zero(b, accumulation_type),
+                *ctx->GetOrCreateAdd(accumulation_type), {kClassDim});
+  auto loss = XlaHelpers::ConvertElementType(b, sum, type);
 
   // backprop: prob - labels, where
   //   prob = exp(logits - max_logits) / sum(exp(logits - max_logits))
diff --git a/tensorflow/compiler/tf2xla/kernels/split_op.cc b/tensorflow/compiler/tf2xla/kernels/split_op.cc
index 79c435c90a1f57250be90c2c2523bf3d7d231461..43c15e753805352875034dfd2c70a2a1ed9a4114 100644
--- a/tensorflow/compiler/tf2xla/kernels/split_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/split_op.cc
@@ -111,27 +111,24 @@ class SplitVOp : public XlaOpKernel {
 
   void Compile(XlaOpKernelContext* ctx) override {
     const int32 num_split = num_outputs();
+    const TensorShape input_shape = ctx->InputShape(0);
     const TensorShape index_shape = ctx->InputShape(2);
-    xla::Literal literal_index;
-    OP_REQUIRES_OK(ctx, ctx->ConstantInput(2, &literal_index));
 
-    int32 split_dim;
-    OP_REQUIRES(ctx, index_shape.dims() == 0,
-                errors::InvalidArgument("split_dim input to Split Op must be a "
-                                        "scalar"));
-    split_dim = literal_index.Get<int>({});
+    int64 split_dim_orig;
+    OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntScalar(2, &split_dim_orig));
+    int64 split_dim = split_dim_orig < 0 ? split_dim_orig + input_shape.dims()
+                                         : split_dim_orig;
+    OP_REQUIRES(ctx, 0 <= split_dim && split_dim < input_shape.dims(),
+                errors::InvalidArgument("-input rank(-", input_shape.dims(),
+                                        ") <= split_dim < input rank (",
+                                        input_shape.dims(), "), but got ",
+                                        split_dim_orig));
 
     xla::ComputationDataHandle input = ctx->Input(0);
-    const TensorShape input_shape = ctx->InputShape(0);
 
     OP_REQUIRES(ctx, input_shape.dims() > 0,
                 errors::InvalidArgument("Can't split a 0 dimensional input"));
 
-    OP_REQUIRES(
-        ctx, 0 <= split_dim && split_dim < input_shape.dims(),
-        errors::InvalidArgument("0 <= split_dim < number of input dimensions (",
-                                input_shape.dims(), "), but got ", split_dim));
-
     OP_REQUIRES(
         ctx, num_split > 0,
         errors::InvalidArgument(
diff --git a/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc b/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
index b10880de77e6b9811008076cd4a959c284e558d1..5bb773d97fc5ce90dabceeefd5c29d916597f5ff 100644
--- a/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
@@ -239,6 +239,7 @@ class StatelessRandomUniformOp : public XlaOpKernel {
 
 // TODO(phawkins): generalize to non-float, non-int32 seed types.
 REGISTER_XLA_OP(Name("StatelessRandomUniform")
+                    .CompileTimeConstInput("shape")
                     .TypeConstraint("dtype", DT_FLOAT)
                     .TypeConstraint("Tseed", DT_INT32),
                 StatelessRandomUniformOp);
@@ -272,6 +273,7 @@ class StatelessRandomNormalOp : public XlaOpKernel {
 
 // TODO(phawkins): generalize to non-float, non-int32 seed types.
 REGISTER_XLA_OP(Name("StatelessRandomNormal")
+                    .CompileTimeConstInput("shape")
                     .TypeConstraint("dtype", DT_FLOAT)
                     .TypeConstraint("Tseed", DT_INT32),
                 StatelessRandomNormalOp);
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.cc b/tensorflow/compiler/tf2xla/xla_compiler.cc
index 5ec05c4121e059ad2b1307376766a41916fe61ae..86263d847ae02d50e70dafb0129b2664c522f2a3 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler.cc
@@ -600,6 +600,48 @@ Status XlaCompiler::BuildArguments(
   return Status::OK();
 }
 
+Status XlaCompiler::CompileSingleOp(
+    const XlaCompiler::CompileOptions& options, string const& name,
+    OpKernelContext* ctx, const std::vector<XlaCompiler::Argument>& args,
+    CompilationResult* result) {
+  // TODO(b/74182462): We implement this by creating a new dummy Graph including
+  // _Arg nodes, and let CompileGraph walk it. This could be optimized.
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+
+  Status status;
+  // First create the actual node we care about computing.
+  Node* main_node = graph->AddNode(ctx->op_kernel().def(), &status);
+  TF_RETURN_IF_ERROR(status);
+
+  // Create dummy _Arg nodes. Link these to `node` and also via a control
+  // dependency edge to the _SOURCE node.
+  for (int64 i = 0; i < ctx->num_inputs(); ++i) {
+    Node* node;
+    string name = strings::StrCat(ctx->op_kernel().name(), "_", i, "_arg");
+    Status status = NodeBuilder(name, "_Arg")
+                        .ControlInput(graph->source_node())
+                        .Attr("T", ctx->input_dtype(i))
+                        .Attr("index", i)
+                        .Finalize(graph.get(), &node);
+    TF_RETURN_IF_ERROR(status);
+    graph->AddEdge(node, 0, main_node, i);
+  }
+
+  // Similarly with return values, create dummy _Retval nodes fed by `node`.
+  for (int64 i = 0; i < ctx->num_outputs(); ++i) {
+    Node* node;
+    string name = strings::StrCat(ctx->op_kernel().name(), "_", i, "_retval");
+    Status status = NodeBuilder(name, "_Retval")
+                        .Input(main_node, i)
+                        .Attr("T", ctx->expected_output_dtype(i))
+                        .Attr("index", i)
+                        .Finalize(graph.get(), &node);
+    TF_RETURN_IF_ERROR(status);
+  }
+
+  return CompileGraph(options, name, std::move(graph), args, result);
+}
+
 Status XlaCompiler::CompileGraph(const XlaCompiler::CompileOptions& options,
                                  string const& name,
                                  std::unique_ptr<Graph> graph,
@@ -674,6 +716,14 @@ Status XlaCompiler::CompileGraph(const XlaCompiler::CompileOptions& options,
   VLOG(2) << "XLA output shape: "
           << xla::ShapeUtil::HumanString(result->xla_output_shape);
 
+  // Copy the host transfer metadata to the result.
+  for (const auto& send : host_compute_sends_) {
+    *result->host_compute_metadata.add_device_to_host() = send.second;
+  }
+  for (const auto& recv : host_compute_recvs_) {
+    *result->host_compute_metadata.add_host_to_device() = recv.second;
+  }
+
   // Tensorflow expects a major-to-minor order of results.
   xla::LayoutUtil::SetToDefaultLayout(&result->xla_output_shape);
 
@@ -708,4 +758,59 @@ Status XlaCompiler::GetChannelHandle(const string& key,
   return Status::OK();
 }
 
+namespace {
+
+void SetTransfer(const string& key, gtl::ArraySlice<DataType> types,
+                 gtl::ArraySlice<TensorShape> shapes,
+                 tf2xla::HostTransferMetadata* transfer) {
+  transfer->set_key(key);
+  CHECK(types.size() == shapes.size());
+  for (int i = 0; i < types.size(); ++i) {
+    tf2xla::TensorMetadata* metadata = transfer->add_metadata();
+    metadata->set_type(types[i]);
+    shapes[i].AsProto(metadata->mutable_shape());
+  }
+}
+
+}  // namespace
+
+Status XlaCompiler::SetDeviceToHostMetadata(
+    const string& key, gtl::ArraySlice<DataType> types,
+    gtl::ArraySlice<TensorShape> shapes) {
+  if (host_compute_sends_.find(key) != host_compute_sends_.end()) {
+    return errors::InvalidArgument(
+        "Duplicate calls to SetDeviceToHostMetadata with key ", key);
+  }
+  tf2xla::HostTransferMetadata& transfer = host_compute_sends_[key];
+  SetTransfer(key, types, shapes, &transfer);
+  return Status::OK();
+}
+
+Status XlaCompiler::GetDeviceToHostShapes(
+    const string& key, std::vector<TensorShape>* shapes) const {
+  const auto iter = host_compute_sends_.find(key);
+  if (iter == host_compute_sends_.end()) {
+    return errors::InvalidArgument(
+        "No host compute send shapes registered for key ", key);
+  }
+  shapes->clear();
+  for (int i = 0; i < iter->second.metadata_size(); ++i) {
+    TensorShape shape(iter->second.metadata(i).shape());
+    shapes->push_back(shape);
+  }
+  return Status::OK();
+}
+
+Status XlaCompiler::SetHostToDeviceMetadata(
+    const string& key, gtl::ArraySlice<DataType> types,
+    gtl::ArraySlice<TensorShape> shapes) {
+  if (host_compute_recvs_.find(key) != host_compute_sends_.end()) {
+    return errors::InvalidArgument(
+        "Duplicate calls to SetHostToDeviceMetadata with key ", key);
+  }
+  tf2xla::HostTransferMetadata& transfer = host_compute_recvs_[key];
+  SetTransfer(key, types, shapes, &transfer);
+  return Status::OK();
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.h b/tensorflow/compiler/tf2xla/xla_compiler.h
index c4449bc4be06daff856eff70c6d89be6ddbcf0ee..a6747bbe72e161b2ece55697825cce0e71145a5c 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.h
+++ b/tensorflow/compiler/tf2xla/xla_compiler.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_XLA_COMPILER_H_
 #define TENSORFLOW_COMPILER_TF2XLA_XLA_COMPILER_H_
 
+#include "tensorflow/compiler/tf2xla/host_compute_metadata.pb.h"
 #include "tensorflow/compiler/tf2xla/xla_compilation_device.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/core/common_runtime/device.h"
@@ -216,6 +217,10 @@ class XlaCompiler {
     // containing both constant and non-constant results.
     std::vector<OutputDescription> outputs;
 
+    // TensorFlow shapes and types of sends/recvs from HostCompute Ops to their
+    // matching RecvAtHost/SendFromHost Ops in the outer graph.
+    tf2xla::HostComputeMetadata host_compute_metadata;
+
     // Resources whose values were updated by the computation, ordered
     // by return value position. Resource updates follow the non-constant
     // results in the outputs of XLA computation.
@@ -284,6 +289,14 @@ class XlaCompiler {
                       const std::vector<Argument>& args,
                       CompilationResult* result);
 
+  // Compiles a single Op, given by an OpKernelContext, into an
+  // xla::Computation. Similar to CompileFunction but takes a single Op as
+  // input.
+  Status CompileSingleOp(const CompileOptions& options, string const& name,
+                         OpKernelContext* ctx,
+                         const std::vector<Argument>& args,
+                         CompilationResult* result);
+
   // Returns the shape of the XLA parameter for an argument 'arg'.
   // See the class comment for more details about the argument passing
   // convention.
@@ -296,6 +309,22 @@ class XlaCompiler {
   // same XlaCompiler.
   Status GetChannelHandle(const string& key, xla::ChannelHandle* channel);
 
+  // Sets the shapes and types for the device to host transfer associated with
+  // 'key'.
+  Status SetDeviceToHostMetadata(const string& key,
+                                 gtl::ArraySlice<DataType> types,
+                                 gtl::ArraySlice<TensorShape> shapes);
+
+  // Gets the shapes the device to host transfer associated with 'key'.
+  Status GetDeviceToHostShapes(const string& key,
+                               std::vector<TensorShape>* shapes) const;
+
+  // Sets the shapes and types for the host to device transfer associated with
+  // 'key'.
+  Status SetHostToDeviceMetadata(const string& key,
+                                 gtl::ArraySlice<DataType> types,
+                                 gtl::ArraySlice<TensorShape> shapes);
+
   const Options& options() const { return options_; }
   xla::Client* client() const { return options_.client; }
   FunctionLibraryRuntime* flib_runtime() const { return flib_runtime_; }
@@ -359,6 +388,9 @@ class XlaCompiler {
 
   std::unordered_map<string, xla::ChannelHandle> channels_;
 
+  std::unordered_map<string, tf2xla::HostTransferMetadata> host_compute_sends_;
+  std::unordered_map<string, tf2xla::HostTransferMetadata> host_compute_recvs_;
+
   TF_DISALLOW_COPY_AND_ASSIGN(XlaCompiler);
 };
 
diff --git a/tensorflow/compiler/tf2xla/xla_helpers.cc b/tensorflow/compiler/tf2xla/xla_helpers.cc
index f048662953e20b2a612271e2daeef6e370c4822a..3b0b2f06ebae4af918cbe6fb8a384004c1858998 100644
--- a/tensorflow/compiler/tf2xla/xla_helpers.cc
+++ b/tensorflow/compiler/tf2xla/xla_helpers.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/computation_builder.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace tensorflow {
@@ -273,4 +274,20 @@ Status XlaHelpers::OneHot(xla::ComputationBuilder* builder, int64 depth,
   return Status::OK();
 }
 
+DataType XlaHelpers::SumAccumulationType(const DataType& dtype) {
+  if (dtype == DT_BFLOAT16) {
+    return DT_FLOAT;
+  }
+  return dtype;
+}
+
+xla::ComputationDataHandle XlaHelpers::ConvertElementType(
+    xla::ComputationBuilder* const builder,
+    const xla::ComputationDataHandle& operand,
+    const DataType new_element_type) {
+  xla::PrimitiveType convert_to;
+  TF_CHECK_OK(DataTypeToPrimitiveType(new_element_type, &convert_to));
+  return builder->ConvertElementType(operand, convert_to);
+}
+
 }  // end namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/xla_helpers.h b/tensorflow/compiler/tf2xla/xla_helpers.h
index 2a027db4c839c917f3a7acd27184792d157356bf..68ab93b64a5fa87ad99e0f44d84f6473fc8bbebd 100644
--- a/tensorflow/compiler/tf2xla/xla_helpers.h
+++ b/tensorflow/compiler/tf2xla/xla_helpers.h
@@ -107,6 +107,18 @@ class XlaHelpers {
                        const xla::ComputationDataHandle& on_value,
                        const xla::ComputationDataHandle& off_value,
                        xla::ComputationDataHandle* one_hot);
+
+  // Certain DataTypes should use increased precision DataTypes when performing
+  // reductions.  This function remaps a given DataType to a higher precision
+  // DataType if needed.
+  static DataType SumAccumulationType(const DataType& dtype);
+
+  // A helper for creating a ConvertElementType xla op given a DataType rather
+  // than the xla::PrimitiveType.
+  static xla::ComputationDataHandle ConvertElementType(
+      xla::ComputationBuilder* const builder,
+      const xla::ComputationDataHandle& operand,
+      const DataType new_element_type);
 };
 
 }  // end namespace tensorflow
diff --git a/tensorflow/compiler/xla/BUILD b/tensorflow/compiler/xla/BUILD
index c7cb69215fb051b7f87c3be3b0b419b9c1b8998c..cd13db4d300bb5bba21a734173b6afb9223539d8 100644
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@@ -52,6 +52,7 @@ xla_proto_library(
     visibility = ["//visibility:public"],
     deps = [
         ":xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service:session_proto",
     ],
 )
diff --git a/tensorflow/compiler/xla/array.h b/tensorflow/compiler/xla/array.h
index 46ee4e64c9ae7ca111d9d04bedcb74ff02a42386..ea75ad32d5df7bbadd37e89de6144b264ab6d5d1 100644
--- a/tensorflow/compiler/xla/array.h
+++ b/tensorflow/compiler/xla/array.h
@@ -30,6 +30,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/status.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/bits.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -121,10 +122,31 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 2D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 1D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
+                             std::is_same<T2, float>::value>::type>
+  Array(std::initializer_list<T2> values)
+      : Array(ToInt64Vector({values.size()})) {
+    int64 idx = 0;
+    for (const auto& it1 : values) {
+      values_[idx] = static_cast<T>(it1);
+      ++idx;
+    }
+    CHECK(idx == num_elements());
+  }
+
+  // Creates a 2D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
+  template <typename T2, typename = typename std::enable_if<
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<std::initializer_list<T2>> values)
       : Array(ToInt64Vector({values.size(), values.begin()->size()})) {
@@ -155,10 +177,13 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 3D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 3D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<std::initializer_list<std::initializer_list<T2>>>
             values)
@@ -196,10 +221,13 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 4D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 4D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<
         std::initializer_list<std::initializer_list<std::initializer_list<T2>>>>
diff --git a/tensorflow/compiler/xla/array2d.h b/tensorflow/compiler/xla/array2d.h
index d30e78ecde45cfcfcfdaac6c13c9d87ab5630c57..a17e81f44832f272fd93dce9f854042b4a84fde4 100644
--- a/tensorflow/compiler/xla/array2d.h
+++ b/tensorflow/compiler/xla/array2d.h
@@ -53,10 +53,13 @@ class Array2D : public Array<T> {
   Array2D(std::initializer_list<std::initializer_list<T>> values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array2D(std::initializer_list<std::initializer_list<T2>> values)
       : Array<T>(values) {}
@@ -100,14 +103,16 @@ std::unique_ptr<Array2D<NativeT>> MakeLinspaceArray2D(double from, double to,
                                                       int64 n1, int64 n2) {
   auto array = MakeUnique<Array2D<NativeT>>(n1, n2);
   int64 count = n1 * n2;
-  NativeT step = (count > 1) ? (to - from) / (count - 1) : 0.0f;
+  NativeT step =
+      static_cast<NativeT>((count > 1) ? (to - from) / (count - 1) : 0);
   auto set = [&array, n1, n2](int64 index, NativeT value) {
     (*array)(index / n2, index % n2) = value;
   };
   for (int64 i = 0; i < count - 1; ++i) {
-    set(i, static_cast<NativeT>(from + i * step));
+    set(i, (static_cast<NativeT>(from) +
+            static_cast<NativeT>(i) * static_cast<NativeT>(step)));
   }
-  set(count - 1, to);
+  set(count - 1, static_cast<NativeT>(to));
   return array;
 }
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/array3d.h b/tensorflow/compiler/xla/array3d.h
index e5eb235d45d160d486d1499db665ed14a8509043..0e9a0722ae43e1dc6ecddde9cbc3daf1db058840 100644
--- a/tensorflow/compiler/xla/array3d.h
+++ b/tensorflow/compiler/xla/array3d.h
@@ -57,10 +57,13 @@ class Array3D : public Array<T> {
               values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array3D(
       std::initializer_list<std::initializer_list<std::initializer_list<T2>>>
diff --git a/tensorflow/compiler/xla/array4d.h b/tensorflow/compiler/xla/array4d.h
index cff70e54bad0116bdd08674b626b3bf99dc89e1f..a75fffc605aa0df3e1e2eeb6d3129718cbbba0e4 100644
--- a/tensorflow/compiler/xla/array4d.h
+++ b/tensorflow/compiler/xla/array4d.h
@@ -82,10 +82,13 @@ class Array4D : public Array<T> {
               values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array4D(std::initializer_list<std::initializer_list<
               std::initializer_list<std::initializer_list<T2>>>>
diff --git a/tensorflow/compiler/xla/client/compile_only_client.cc b/tensorflow/compiler/xla/client/compile_only_client.cc
index c7e2c4367b89ca2112022fa40449ae3ebe28463e..59662c95ac15e7c23790c5b5ff5d75a694613aeb 100644
--- a/tensorflow/compiler/xla/client/compile_only_client.cc
+++ b/tensorflow/compiler/xla/client/compile_only_client.cc
@@ -39,16 +39,15 @@ CompileOnlyClient::CompileAheadOfTime(
   return compiler_service_->CompileAheadOfTime(service_instances, options);
 }
 
-int64 CompileOnlyClient::PointerSizeForTriple(
-    tensorflow::StringPiece target_triple) {
-  llvm::Triple triple(llvm::Triple::normalize(
-      llvm::StringRef(target_triple.data(), target_triple.size())));
-  if (triple.isArch64Bit()) {
+int64 CompileOnlyClient::PointerSizeForTriple(tensorflow::StringPiece triple) {
+  llvm::Triple llvm_triple(
+      llvm::Triple::normalize(llvm::StringRef(triple.data(), triple.size())));
+  if (llvm_triple.isArch64Bit()) {
     return 8;
-  } else if (triple.isArch32Bit()) {
+  } else if (llvm_triple.isArch32Bit()) {
     return 4;
   } else {
-    CHECK(triple.isArch16Bit());
+    CHECK(llvm_triple.isArch16Bit());
     return 2;
   }
 }
diff --git a/tensorflow/compiler/xla/client/computation_builder.cc b/tensorflow/compiler/xla/client/computation_builder.cc
index 2a6e02649d15bc9fd47a893c41f9c8a62ac076c6..39d02f0863f78d4094f2cc4805f534713fb7e929 100644
--- a/tensorflow/compiler/xla/client/computation_builder.cc
+++ b/tensorflow/compiler/xla/client/computation_builder.cc
@@ -408,7 +408,7 @@ ComputationDataHandle ComputationBuilder::Reshape(
 
 ComputationDataHandle ComputationBuilder::Collapse(
     const ComputationDataHandle& operand,
-    tensorflow::gtl::ArraySlice<int64> dims_to_collapse) {
+    tensorflow::gtl::ArraySlice<int64> dimensions) {
   if (!first_error_.ok()) {
     return ComputationDataHandle();
   }
@@ -416,8 +416,8 @@ ComputationDataHandle ComputationBuilder::Collapse(
   // Don't support out-of-order collapse here.
   // Checks that the collapsed dimensions are in order and consecutive.
   for (tensorflow::gtl::ArraySlice<int64>::size_type i = 1;
-       i < dims_to_collapse.size(); ++i) {
-    if (dims_to_collapse[i] - 1 != dims_to_collapse[i - 1]) {
+       i < dimensions.size(); ++i) {
+    if (dimensions[i] - 1 != dimensions[i - 1]) {
       NoteError(InvalidArgument(
           "Collapsed dimensions are not in order and consecutive."));
       return ComputationDataHandle();
@@ -434,9 +434,9 @@ ComputationDataHandle ComputationBuilder::Collapse(
 
   VLOG(3) << "original shape: " << ShapeUtil::HumanString(*original_shape);
   VLOG(3) << "dims to collapse: "
-          << tensorflow::str_util::Join(dims_to_collapse, ",");
+          << tensorflow::str_util::Join(dimensions, ",");
 
-  if (dims_to_collapse.size() <= 1) {
+  if (dimensions.size() <= 1) {
     // Not collapsing anything, trivially we can return the operand versus
     // enqueueing a trivial reshape.
     return operand;
@@ -444,7 +444,7 @@ ComputationDataHandle ComputationBuilder::Collapse(
 
   std::vector<int64> new_sizes;
   for (int i = 0; i < ShapeUtil::Rank(*original_shape); ++i) {
-    if (i <= dims_to_collapse.front() || i > dims_to_collapse.back()) {
+    if (i <= dimensions.front() || i > dimensions.back()) {
       new_sizes.push_back(original_shape->dimensions(i));
     } else {
       new_sizes.back() *= original_shape->dimensions(i);
@@ -753,13 +753,13 @@ ComputationDataHandle ComputationBuilder::Infeed(const Shape& shape,
 }
 
 void ComputationBuilder::Outfeed(const ComputationDataHandle& operand,
-                                 const Shape& shape,
+                                 const Shape& shape_with_layout,
                                  const string& outfeed_config) {
   OpRequest op_request;
   OutfeedRequest* request = op_request.mutable_outfeed_request();
   request->set_outfeed_config(outfeed_config);
   *request->mutable_operand() = operand;
-  *request->mutable_shape() = shape;
+  *request->mutable_shape() = shape_with_layout;
   RunOpAndNoteError(&op_request);
 }
 
@@ -868,6 +868,14 @@ ComputationDataHandle ComputationBuilder::Or(
   return BinaryOp(BINOP_OR, lhs, rhs, broadcast_dimensions);
 }
 
+// TODO(b/65209188): Create a dedicated lowering for Xor
+ComputationDataHandle ComputationBuilder::Xor(
+    const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
+  return Or(And(Not(lhs), rhs, broadcast_dimensions),
+            And(lhs, Not(rhs), broadcast_dimensions));
+}
+
 ComputationDataHandle ComputationBuilder::Not(
     const ComputationDataHandle& operand) {
   return UnaryOp(UNOP_NOT, operand);
@@ -1382,15 +1390,16 @@ ComputationDataHandle ComputationBuilder::BatchNormInference(
 
 ComputationDataHandle ComputationBuilder::BatchNormGrad(
     const ComputationDataHandle& operand, const ComputationDataHandle& scale,
-    const ComputationDataHandle& mean, const ComputationDataHandle& var,
+    const ComputationDataHandle& batch_mean,
+    const ComputationDataHandle& batch_var,
     const ComputationDataHandle& grad_output, float epsilon,
     int64 feature_index) {
   OpRequest op_request;
   BatchNormGradRequest* request = op_request.mutable_batch_norm_grad_request();
   *request->mutable_operand() = operand;
   *request->mutable_scale() = scale;
-  *request->mutable_mean() = mean;
-  *request->mutable_variance() = var;
+  *request->mutable_mean() = batch_mean;
+  *request->mutable_variance() = batch_var;
   *request->mutable_grad_output() = grad_output;
   request->set_epsilon(epsilon);
   request->set_feature_index(feature_index);
diff --git a/tensorflow/compiler/xla/client/computation_builder.h b/tensorflow/compiler/xla/client/computation_builder.h
index 377b6716399ea87b12bd0bd8a9486d4476e3cbf0..2141ebc2065a1a80d2fe820a7b6fe15434c89e28 100644
--- a/tensorflow/compiler/xla/client/computation_builder.h
+++ b/tensorflow/compiler/xla/client/computation_builder.h
@@ -512,6 +512,10 @@ class ComputationBuilder {
       const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
 
+  ComputationDataHandle Xor(
+      const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
   ComputationDataHandle Not(const ComputationDataHandle& operand);
 
   ComputationDataHandle ShiftLeft(
@@ -872,7 +876,7 @@ class ComputationBuilder {
                   Window* window);
 
   // Internal helper method that does the building for an arbitrary unary op.
-  ComputationDataHandle UnaryOp(UnaryOperation binop,
+  ComputationDataHandle UnaryOp(UnaryOperation unop,
                                 const ComputationDataHandle& operand);
 
   // Internal helper method that does the building for an arbitrary binary op.
diff --git a/tensorflow/compiler/xla/client/executable_build_options.cc b/tensorflow/compiler/xla/client/executable_build_options.cc
index 804e34f5e75ce2d153ac7627b94a543fda88e810..6e3c5cb484b8f1ef053fa287a4d462aeb886e530 100644
--- a/tensorflow/compiler/xla/client/executable_build_options.cc
+++ b/tensorflow/compiler/xla/client/executable_build_options.cc
@@ -76,4 +76,35 @@ ExecutableBuildOptions::generate_hlo_graph() const {
   return generate_hlo_graph_;
 }
 
+ExecutableBuildOptions& ExecutableBuildOptions::set_dump_optimized_hlo_proto_to(
+    tensorflow::StringPiece dirpath) {
+  dump_optimized_hlo_proto_to_ = dirpath.ToString();
+  return *this;
+}
+
+const tensorflow::gtl::optional<string>&
+ExecutableBuildOptions::dump_optimized_hlo_proto_to() const {
+  return dump_optimized_hlo_proto_to_;
+}
+
+ExecutableBuildOptions& ExecutableBuildOptions::set_dump_per_pass_hlo_proto_to(
+    tensorflow::StringPiece dirpath) {
+  dump_per_pass_hlo_proto_to_ = dirpath.ToString();
+  return *this;
+}
+
+const tensorflow::gtl::optional<string>&
+ExecutableBuildOptions::dump_per_pass_hlo_proto_to() const {
+  return dump_per_pass_hlo_proto_to_;
+}
+
+ExecutableBuildOptions& ExecutableBuildOptions::set_hlo_profile(bool enabled) {
+  hlo_profile_ = enabled;
+  return *this;
+}
+
+tensorflow::gtl::optional<bool> ExecutableBuildOptions::hlo_profile() const {
+  return hlo_profile_;
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/executable_build_options.h b/tensorflow/compiler/xla/client/executable_build_options.h
index 3a52dbac9adb155ad9a7d91a8102707f70fe2fbf..11f10983606fe02b1edb11a260edde8e5f9a726f 100644
--- a/tensorflow/compiler/xla/client/executable_build_options.h
+++ b/tensorflow/compiler/xla/client/executable_build_options.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/optional.h"
 
 namespace xla {
@@ -57,15 +58,36 @@ class ExecutableBuildOptions {
   ExecutableBuildOptions& set_generate_hlo_graph(string regex);
   const tensorflow::gtl::optional<string>& generate_hlo_graph() const;
 
+  // If set, specifies a dirpath to dump the end-of-optimization-pipeline HLO
+  // protobuf to (as in DebugOptions).
+  ExecutableBuildOptions& set_dump_optimized_hlo_proto_to(
+      tensorflow::StringPiece dirpath);
+  const tensorflow::gtl::optional<string>& dump_optimized_hlo_proto_to() const;
+
+  // If set, specifies a dirpath to dump the per-pass-in-pipeline HLO protobufs
+  // to (as in DebugOptions).
+  ExecutableBuildOptions& set_dump_per_pass_hlo_proto_to(
+      tensorflow::StringPiece dirpath);
+  const tensorflow::gtl::optional<string>& dump_per_pass_hlo_proto_to() const;
+
+  // If true, specifies that we should record an HLO profile during execution
+  // and log it after execution (as in DebugOptions). If nullopt the default is
+  // used.
+  ExecutableBuildOptions& set_hlo_profile(bool enabled);
+  tensorflow::gtl::optional<bool> hlo_profile() const;
+
   // Returns a string representation of the build options, suitable for
   // debugging.
   string ToString() const;
 
  private:
+  tensorflow::gtl::optional<bool> hlo_profile_;
   int device_ordinal_ = -1;
   Shape result_layout_;
   bool result_layout_set_ = false;
   tensorflow::gtl::optional<string> generate_hlo_graph_;
+  tensorflow::gtl::optional<string> dump_optimized_hlo_proto_to_;
+  tensorflow::gtl::optional<string> dump_per_pass_hlo_proto_to_;
   DeviceMemoryAllocator* device_allocator_ = nullptr;
 };
 
diff --git a/tensorflow/compiler/xla/client/local_client.h b/tensorflow/compiler/xla/client/local_client.h
index b52a30f5a0b92e0094e6b0de3241c10a5a909cad..de0ed13c43f87966c272102b2e9af9ff3be63aea 100644
--- a/tensorflow/compiler/xla/client/local_client.h
+++ b/tensorflow/compiler/xla/client/local_client.h
@@ -69,7 +69,7 @@ class LocalExecutable {
   // of the computation.
   tensorflow::Status ValidateExecutionOptions(
       const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-      const ExecutableRunOptions& options, const Backend& backend);
+      const ExecutableRunOptions& run_options, const Backend& backend);
 
   // Records the computation in a SessionModule proto with the arguments used to
   // invoke it, and the result. Enabled by flag: --tla_dump_executions_to.
diff --git a/tensorflow/compiler/xla/client/xla_client/BUILD b/tensorflow/compiler/xla/client/xla_client/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..b912889e2627aa01e5a7441e71e6bf002916ba5e
--- /dev/null
+++ b/tensorflow/compiler/xla/client/xla_client/BUILD
@@ -0,0 +1,77 @@
+# Description:
+#   The new XLA client libraries.
+#
+# This is NOT YET ready to use.
+
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = [":friends"])
+
+package_group(
+    name = "friends",
+    includes = [
+        "//tensorflow/compiler/xla:friends",
+    ],
+)
+
+# Filegroup used to collect source files for dependency checking.
+filegroup(
+    name = "c_srcs",
+    data = glob([
+        "**/*.cc",
+        "**/*.h",
+    ]),
+)
+
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+
+# TODO(b/74197823): Replace computation_builder with xla_builder.
+cc_library(
+    name = "xla_builder",
+    srcs = ["xla_builder.cc"],
+    hdrs = ["xla_builder.h"],
+    deps = [
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_proto",
+        "//tensorflow/compiler/xla/service:shape_inference",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "xla_builder_test",
+    srcs = ["xla_builder_test.cc"],
+    deps = [
+        ":xla_builder",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/core:test",
+    ],
+)
+
+# -----------------------------------------------------------------------------
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.cc b/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
new file mode 100644
index 0000000000000000000000000000000000000000..82b61d4d517fe8394515727c3da54e955137eec7
--- /dev/null
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
@@ -0,0 +1,249 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/shape_inference.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace xla {
+
+using tensorflow::strings::StrCat;
+
+namespace {
+
+int64 GetUniqueId() {
+  static tensorflow::mutex mu(tensorflow::LINKER_INITIALIZED);
+  static int64 built_counter = 0;
+  tensorflow::mutex_lock loc(mu);
+  const int64 id = built_counter++;
+  return id;
+}
+
+// Returns true if an instruction with the given opcode can be the root of the
+// computation.
+bool CanBeRoot(HloOpcode opcode) {
+  switch (opcode) {
+    case HloOpcode::kSend:
+    case HloOpcode::kOutfeed:
+    case HloOpcode::kTrace:
+      return false;
+    default:
+      return true;
+  }
+}
+
+}  // namespace
+
+StatusOr<Shape> XlaBuilder::GetShape(const XlaOp& op) const {
+  TF_ASSIGN_OR_RETURN(auto instr, LookUpInstruction(op));
+  return instr->shape();
+}
+
+StatusOr<Shape> XlaOp::GetShape() const {
+  TF_RET_CHECK(builder_ != nullptr);
+  return builder_->GetShape(*this);
+}
+
+XlaBuilder::XlaBuilder(const string& computation_name)
+    : name_(computation_name) {}
+
+XlaBuilder::~XlaBuilder() {}
+
+void XlaBuilder::NoteError(const Status& error) {
+  CHECK(!error.ok());
+  if (die_immediately_on_error_) {
+    LOG(FATAL) << "error building computation: " << error;
+  }
+
+  if (first_error_.ok()) {
+    first_error_ = error;
+    first_error_backtrace_.CreateCurrent(/*skip_count=*/1);
+  }
+}
+
+StatusOr<XlaComputation> XlaBuilder::Build() {
+  if (!first_error_.ok()) {
+    string backtrace;
+    first_error_backtrace_.Dump(tensorflow::DebugWriteToString, &backtrace);
+    return AppendStatus(first_error_, backtrace);
+  }
+
+  HloComputationProto entry;
+  ProgramShape* program_shape = entry.mutable_program_shape();
+
+  entry.set_name(name_);
+
+  // Not all instructions can be roots. Walk backwards from the last added
+  // instruction until a valid root is found.
+  entry.set_root_id(-1);
+  for (int64 i = instructions_.size() - 1; i >= 0; i--) {
+    TF_ASSIGN_OR_RETURN(HloOpcode opcode,
+                        StringToHloOpcode(instructions_[i].opcode()));
+    if (CanBeRoot(opcode)) {
+      entry.set_root_id(instructions_[i].id());
+      *program_shape->mutable_result() = instructions_[i].shape();
+      break;
+    }
+  }
+  if (entry.root_id() == -1) {
+    return FailedPrecondition("no root instruction was found");
+  }
+
+  // Check that the parameter numbers are continuous from 0, and add parameter
+  // shapes and names to the program shape.
+  const int64 param_count = parameter_numbers_.size();
+  for (int64 i = 0; i < param_count; i++) {
+    program_shape->add_parameters();
+    program_shape->add_parameter_names();
+  }
+  for (const HloInstructionProto& instr : instructions_) {
+    // Parameter number uniqueness is guaranteed in XlaBuilder::Parameter(). So
+    // to verify continuity, we just need to verify that every parameter is in
+    // the right range.
+    if (instr.opcode() == HloOpcodeString(HloOpcode::kParameter)) {
+      const int64 index = instr.parameter_number();
+      TF_RET_CHECK(index >= 0 && index < param_count)
+          << "invalid parameter number: " << index;
+      *program_shape->mutable_parameters(index) = instr.shape();
+      *program_shape->mutable_parameter_names(index) = instr.name();
+    }
+  }
+
+  for (auto& instruction : instructions_) {
+    entry.add_instructions()->Swap(&instruction);
+  }
+
+  const int64 id = GetUniqueId();
+  entry.set_id(id);
+  XlaComputation computation(id);
+  HloModuleProto* module = computation.mutable_proto();
+  module->set_name(entry.name());
+  module->set_id(entry.id());
+  module->set_entry_computation_name(entry.name());
+  module->set_entry_computation_id(entry.id());
+  *module->mutable_program_shape() = entry.program_shape();
+  for (auto& e : embedded_) {
+    module->add_computations()->Swap(&e.second);
+  }
+  module->add_computations()->Swap(&entry);
+
+  return std::move(computation);
+}
+
+XlaOp XlaBuilder::Add(const XlaOp& lhs, const XlaOp& rhs,
+                      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
+  auto op = [&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+    TF_ASSIGN_OR_RETURN(const Shape& lhs_shape, lhs.GetShape());
+    TF_ASSIGN_OR_RETURN(const Shape& rhs_shape, rhs.GetShape());
+    TF_ASSIGN_OR_RETURN(
+        *instr.mutable_shape(),
+        ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, lhs_shape,
+                                           rhs_shape, broadcast_dimensions));
+    return AddInstruction(std::move(instr), HloOpcode::kAdd, {lhs, rhs});
+  };
+  return NoteErrorOrReturn(op());
+}
+
+XlaOp XlaBuilder::ConstantLiteral(const Literal& literal) {
+  HloInstructionProto instr;
+  *instr.mutable_shape() = literal.shape();
+  *instr.mutable_literal() = literal.ToProto();
+  return AddInstruction(std::move(instr), HloOpcode::kConstant);
+}
+
+XlaOp XlaBuilder::Call(const XlaComputation& computation,
+                       tensorflow::gtl::ArraySlice<XlaOp> operands) {
+  auto op = [&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+    std::vector<const Shape*> operand_shape_ptrs;
+    std::vector<Shape> operand_shapes;
+    for (const auto& operand : operands) {
+      TF_ASSIGN_OR_RETURN(const Shape& shape, operand.GetShape());
+      operand_shapes.push_back(shape);
+    }
+    c_transform(operand_shapes, std::back_inserter(operand_shape_ptrs),
+                [](const Shape& shape) { return &shape; });
+    TF_ASSIGN_OR_RETURN(*instr.mutable_shape(),
+                        ShapeInference::InferCallShape(
+                            operand_shape_ptrs,
+                            /*to_apply=*/computation.GetProgramShape()));
+
+    // Add called computation.
+    instr.add_called_computation_ids(
+        computation.proto().entry_computation_id());
+    for (const HloComputationProto& e : computation.proto().computations()) {
+      embedded_.insert({e.id(), e});
+    }
+
+    return AddInstruction(std::move(instr), HloOpcode::kCall, operands);
+  };
+  return NoteErrorOrReturn(op());
+}
+
+XlaOp XlaBuilder::Parameter(int64 parameter_number, const Shape& shape,
+                            const string& name) {
+  auto op = [&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+    if (parameter_numbers_.find(parameter_number) != parameter_numbers_.end()) {
+      return InvalidArgument("parameter %lld already registered",
+                             parameter_number);
+    }
+    parameter_numbers_.insert(parameter_number);
+    instr.set_parameter_number(parameter_number);
+    instr.set_name(name);
+    *instr.mutable_shape() = shape;
+    return AddInstruction(std::move(instr), HloOpcode::kParameter);
+  };
+  return NoteErrorOrReturn(op());
+}
+
+XlaOp XlaBuilder::AddInstruction(HloInstructionProto&& instr, HloOpcode opcode,
+                                 tensorflow::gtl::ArraySlice<XlaOp> operands) {
+  const int64 handle = instructions_.size();
+  instr.set_id(handle);
+  instr.set_opcode(HloOpcodeString(opcode));
+  if (instr.name().empty()) {
+    instr.set_name(StrCat(instr.opcode(), ".", handle));
+  } else {
+    // Append the handle to make sure the name is unique.
+    instr.set_name(StrCat(instr.name(), ".", handle));
+  }
+  for (const auto& operand : operands) {
+    instr.add_operand_ids(operand.handle());
+  }
+  instructions_.push_back(instr);
+
+  XlaOp op(handle, this);
+  return op;
+}
+
+StatusOr<const HloInstructionProto*> XlaBuilder::LookUpInstruction(
+    const XlaOp& op) const {
+  TF_RET_CHECK(op.builder_ == this);
+  if (op.handle() >= instructions_.size() || op.handle() < 0) {
+    return InvalidArgument("no XlaOp value %lld", op.handle());
+  }
+  return &instructions_[op.handle()];
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.h b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
new file mode 100644
index 0000000000000000000000000000000000000000..f1d10ecdb968da556d79acd5e9f4e5623176aa68
--- /dev/null
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
@@ -0,0 +1,218 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// TODO(b/74197823): Replace computation_builder.h with this file.
+//
+// This is NOT YET ready to use.
+
+#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
+#define TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
+
+#include <map>
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/stacktrace.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+class XlaBuilder;
+
+// This represents an instruction that has been enqueued using the XlaBuilder.
+// This is used to pass to subsequent computations that depends upon the
+// instruction as an operand.
+//
+// TODO(b/74197823): Replace xla::ComputationDataHandle with this one.
+class XlaOp {
+ public:
+  StatusOr<Shape> GetShape() const;
+
+ private:
+  XlaOp() : handle_(0), builder_(nullptr) {}
+  XlaOp(int64 handle, XlaBuilder* builder)
+      : handle_(handle), builder_(builder) {}
+
+  int64 handle() const { return handle_; }
+  friend class XlaBuilder;
+
+  int64 handle_;
+  XlaBuilder* builder_;  // Not owned.
+};
+
+// The computation graph that the user builds up with the XlaBuilder.
+//
+// TODO(b/74197823): Replace xla::Computation with this one.
+class XlaComputation {
+ public:
+  XlaComputation(const XlaComputation&) = delete;
+  XlaComputation& operator=(const XlaComputation&) = delete;
+
+  XlaComputation(XlaComputation&& from) { *this = std::move(from); }
+
+  XlaComputation& operator=(XlaComputation&& from) {
+    proto_ = std::move(from.proto());
+    unique_id_ = from.unique_id_;
+    return *this;
+  }
+
+  // Returns the "program shape" (parameter and return shapes) for this
+  // computation.
+  const ProgramShape& GetProgramShape() const { return proto_.program_shape(); }
+
+  const HloModuleProto& proto() const { return proto_; }
+
+ private:
+  // Creates a null Computation.
+  XlaComputation(const int64 unique_id) : unique_id_(unique_id) {}
+  HloModuleProto* mutable_proto() { return &proto_; }
+  friend class XlaBuilder;
+
+  int64 unique_id_;
+  HloModuleProto proto_;
+};
+
+// A convenient interface for building up computations.
+//
+// Thread-compatible.
+//
+// TODO(b/74197823): Replace xla::ComputationBuilder with this one.
+class XlaBuilder {
+ public:
+  // computation_name: name to use for the built computation.
+  XlaBuilder(const string& computation_name);
+
+  XlaBuilder(const XlaBuilder&) = delete;
+  XlaBuilder& operator=(const XlaBuilder&) = delete;
+
+  ~XlaBuilder();
+
+  // Returns the computation name.
+  const string& name() const { return name_; }
+
+  // Sets the builder to a mode where it will die immediately when an error is
+  // encountered, rather than producing it in a deferred fashion when Build() is
+  // called (which is the default).
+  void set_die_immediately_on_error(bool enabled) {
+    die_immediately_on_error_ = enabled;
+  }
+
+  // Enqueues an add instruction onto the computation.
+  XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a call instruction onto the computation.
+  XlaOp Call(const XlaComputation& computation,
+             tensorflow::gtl::ArraySlice<XlaOp> operands);
+
+  // Enqueues a "retrieve parameter value" instruction for a parameter that was
+  // passed to the computation.
+  XlaOp Parameter(int64 parameter_number, const Shape& shape,
+                  const string& name);
+
+  // Enqueues a constant with the value of the given literal onto the
+  // computation.
+  XlaOp ConstantLiteral(const Literal& literal);
+
+  // Enqueues a constant onto the computation. Methods are templated on the
+  // native host type (NativeT) which corresponds to a specific XLA
+  // PrimitiveType as given in the following table:
+  //
+  //  Native Type   PrimitiveType
+  // -----------------------------
+  //   bool           PRED
+  //   int32          S32
+  //   int64          S64
+  //   uint32         U32
+  //   uint64         U64
+  //   float          F32
+  //   double         F64
+  //
+  // Note: not all primitive types defined in xla_data.proto have a
+  // corresponding native type yet.
+  template <typename NativeT>
+  XlaOp ConstantR0(NativeT value);
+
+  // Returns the shape of the given op.
+  StatusOr<Shape> GetShape(const XlaOp& op) const;
+
+  // Builds the computation with the requested operations, or returns a non-ok
+  // status.
+  StatusOr<XlaComputation> Build();
+
+ private:
+  XlaOp AddInstruction(HloInstructionProto&& instr, HloOpcode opcode,
+                       tensorflow::gtl::ArraySlice<XlaOp> operands = {});
+
+  // Notes that the error occurred by:
+  // * storing it internally and capturing a backtrace if it's the first error
+  //   (this deferred value will be produced on the call to Build())
+  // * dying if die_immediately_on_error_ is true
+  void NoteError(const Status& error);
+
+  XlaOp NoteErrorOrReturn(StatusOr<XlaOp>&& op) {
+    if (!op.ok()) {
+      NoteError(op.status());
+      return XlaOp();
+    }
+    return op.ConsumeValueOrDie();
+  }
+
+  StatusOr<const HloInstructionProto*> LookUpInstruction(const XlaOp& op) const;
+
+  string name_;  // Name to use for the built computation.
+
+  // The first error encountered while building the computation.
+  // This is OK until the first error is encountered.
+  Status first_error_;
+
+  // The saved stack trace from the point at which the first error occurred.
+  tensorflow::SavedStackTrace first_error_backtrace_;
+
+  // The instructions of this computation.
+  std::vector<HloInstructionProto> instructions_;
+
+  // The embedded computations used by this computation. Each computation was
+  // the entry computation of some XlaComputation, the key is the unique id of
+  // that XlaComputation.
+  std::map<int64, HloComputationProto> embedded_;
+
+  // The unique parameter numbers.
+  tensorflow::gtl::FlatSet<int64> parameter_numbers_;
+
+  // Mode bit that indicates whether to die when a first error is encountered.
+  bool die_immediately_on_error_ = false;
+};
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR0(NativeT value) {
+  return ConstantLiteral(*Literal::CreateR0<NativeT>(value));
+}
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder_test.cc b/tensorflow/compiler/xla/client/xla_client/xla_builder_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a400e4e78b044ae633a0135b0011d5267eacc115
--- /dev/null
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder_test.cc
@@ -0,0 +1,137 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+
+#include <string>
+
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+
+namespace xla {
+
+namespace {
+
+namespace op = xla::testing::opcode_matchers;
+
+using ::testing::HasSubstr;
+
+// TODO(b/74197823): Move the tests to service/.
+class XlaBuilderTest : public ::testing::Test {
+ protected:
+  StatusOr<std::unique_ptr<HloModule>> BuildHloModule(XlaBuilder* b) {
+    TF_ASSIGN_OR_RETURN(XlaComputation computation, b->Build());
+    const HloModuleProto& proto = computation.proto();
+    TF_ASSIGN_OR_RETURN(const auto& config,
+                        HloModule::CreateModuleConfigFromProto(proto));
+    return HloModule::CreateFromProto(proto, config);
+  }
+
+  // Returns the name of the test currently being run.
+  string TestName() const {
+    return ::testing::UnitTest::GetInstance()->current_test_info()->name();
+  }
+};
+
+TEST_F(XlaBuilderTest, OnePlusTwo) {
+  XlaBuilder b(TestName());
+  b.Add(b.ConstantR0<float>(1.0), b.ConstantR0<float>(2.0));
+  TF_ASSERT_OK_AND_ASSIGN(auto module, BuildHloModule(&b));
+  auto root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Add(op::Constant(), op::Constant()));
+}
+
+TEST_F(XlaBuilderTest, ParamPlusConstant) {
+  XlaBuilder b(TestName());
+  auto x = b.Parameter(0, ShapeUtil::MakeShape(F32, {3, 5}), "x");
+  b.Add(x, b.ConstantR0<float>(1.0));
+  TF_ASSERT_OK_AND_ASSIGN(auto module, BuildHloModule(&b));
+  auto root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Add(op::Parameter(), op::Constant()));
+}
+
+TEST_F(XlaBuilderTest, ParamPlusParam) {
+  XlaBuilder b(TestName());
+  const auto& x_shape = ShapeUtil::MakeShape(S32, {2, 4, 6});
+  const auto& y_shape = ShapeUtil::MakeShape(S32, {2, 4});
+  auto x = b.Parameter(0, x_shape, "x");
+  auto y = b.Parameter(1, y_shape, "y");
+  auto add = b.Add(x, y, /*broadcast_dimensions=*/{0, 1});
+
+  TF_ASSERT_OK_AND_ASSIGN(auto add_shape, add.GetShape());
+  EXPECT_TRUE(ShapeUtil::Equal(add_shape, x_shape));
+
+  TF_ASSERT_OK_AND_ASSIGN(auto module, BuildHloModule(&b));
+  auto root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Add(op::Parameter(0), op::Parameter(1)));
+}
+
+TEST_F(XlaBuilderTest, XPlusX) {
+  XlaBuilder b(TestName());
+  auto x = b.Parameter(0, ShapeUtil::MakeShape(S32, {1, 3, 5, 7}), "x");
+  b.Add(x, x);
+  TF_ASSERT_OK_AND_ASSIGN(auto module, BuildHloModule(&b));
+  auto root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Add(op::Parameter(0), op::Parameter(0)));
+}
+
+TEST_F(XlaBuilderTest, ShapeInferenceError) {
+  XlaBuilder b(TestName());
+  auto x = b.Parameter(0, ShapeUtil::MakeShape(U32, {2, 4, 6}), "x");
+  auto y = b.Parameter(1, ShapeUtil::MakeShape(U32, {2, 4}), "y");
+  b.Add(x, y);
+  auto statusor = BuildHloModule(&b);
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(), HasSubstr("shape inference"));
+}
+
+TEST_F(XlaBuilderTest, ParameterAlreadyRegistered) {
+  XlaBuilder b_call("add");
+  b_call.Parameter(0, ShapeUtil::MakeShape(PRED, {}), "x");
+
+  XlaBuilder b(TestName());
+  auto x = b.Parameter(0, ShapeUtil::MakeShape(PRED, {}), "x");
+  auto y = b.Parameter(0, ShapeUtil::MakeShape(PRED, {}), "y");
+  b.Add(x, y);
+  auto statusor = BuildHloModule(&b);
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("parameter 0 already registered"));
+}
+
+TEST_F(XlaBuilderTest, Call) {
+  XlaBuilder b_call("the_only_to_apply");
+  auto p0 = b_call.Parameter(0, ShapeUtil::MakeShape(F32, {}), "p0");
+  auto p1 = b_call.Parameter(1, ShapeUtil::MakeShape(F32, {}), "p1");
+  b_call.Add(p0, p1);
+  TF_ASSERT_OK_AND_ASSIGN(auto call, b_call.Build());
+  XlaBuilder b(TestName());
+  auto x = b.Parameter(0, ShapeUtil::MakeShape(F32, {}), "x");
+  auto y = b.Parameter(1, ShapeUtil::MakeShape(F32, {}), "y");
+  auto one = b.ConstantR0<float>(1);
+  auto two = b.ConstantR0<float>(2);
+  b.Add(b.Call(call, {x, y}), b.Call(call, {one, two}));
+  TF_ASSERT_OK_AND_ASSIGN(auto module, BuildHloModule(&b));
+  auto root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Add(op::Call(op::Parameter(), op::Parameter()),
+                            op::Call(op::Constant(), op::Constant())));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/legacy_flags/parse_flags_from_env_test.cc b/tensorflow/compiler/xla/legacy_flags/parse_flags_from_env_test.cc
index a3b4286f4c12bf39a44c63dd6e7d303a46a418c3..7b6ae311c1099dccb8dceb2f49743c1b185cd5ab 100644
--- a/tensorflow/compiler/xla/legacy_flags/parse_flags_from_env_test.cc
+++ b/tensorflow/compiler/xla/legacy_flags/parse_flags_from_env_test.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/subprocess.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/command_line_flags.h"
diff --git a/tensorflow/compiler/xla/literal_util.cc b/tensorflow/compiler/xla/literal_util.cc
index 823da43b5ab2e9c8e80181efc993735877a2c363..214c2030cdc7e6fc474210dbf446e21305d2b146 100644
--- a/tensorflow/compiler/xla/literal_util.cc
+++ b/tensorflow/compiler/xla/literal_util.cc
@@ -223,7 +223,7 @@ Status Literal::CopySliceFromInternal(
     Literal::StrideConfig stride_config(src_literal.shape(), shape(),
                                         copy_size);
 
-    auto copy_proc = [&](const std::vector<int64>& indexes) {
+    auto copy_proc = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
       // Map from multi-dimensional index, to source index.
       std::transform(indexes.begin(), indexes.end(), src_base.begin(),
                      src_indexes.begin(), std::plus<int64>());
@@ -248,6 +248,28 @@ Status Literal::CopySliceFromInternal(
   return Status::OK();
 }
 
+Status Literal::CopyElementFrom(const Literal& src_literal,
+                                tensorflow::gtl::ArraySlice<int64> src_index,
+                                tensorflow::gtl::ArraySlice<int64> dest_index) {
+  DCHECK_EQ(shape().element_type(), src_literal.shape().element_type());
+  const int64 src_linear_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+      src_literal.shape(), src_index);
+  const int64 dest_linear_index =
+      IndexUtil::MultidimensionalIndexToLinearIndex(shape(), dest_index);
+  const int64 primitive_size =
+      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
+
+  char* dest_address =
+      static_cast<char*>(untyped_data()) + dest_linear_index * primitive_size;
+  const char* source_address =
+      static_cast<const char*>(src_literal.untyped_data()) +
+      src_linear_index * primitive_size;
+  if (dest_address != source_address) {
+    memcpy(dest_address, source_address, primitive_size);
+  }
+  return Status::OK();
+}
+
 std::vector<Literal> Literal::DecomposeTuple() {
   CHECK(ShapeUtil::IsTuple(shape()));
   std::vector<Literal> elements;
@@ -343,7 +365,7 @@ Status Literal::Piece::CopyFrom(const Literal::Piece& src) {
 #undef COPY_ELEMENTS
       default:
         return Unimplemented(
-            "Unhandled primitive type %s",
+            "Copying a Literal object with element type %s is not implemented.",
             PrimitiveType_Name(subshape().element_type()).c_str());
     }
   }
@@ -491,7 +513,10 @@ Status Literal::CopySliceFrom(const Literal& src_literal,
     default:
       break;
   }
-  return Unimplemented("Unhandled primitive type %d", shape().element_type());
+  return Unimplemented(
+      "Copying a slice from a Literal object with element type %d is not "
+      "implemented.",
+      shape().element_type());
 }
 
 /* static */ Literal Literal::Zero(PrimitiveType primitive_type) {
@@ -808,9 +833,10 @@ std::unique_ptr<Literal> Literal::Slice(
   DimensionVector result_dimensions;
   for (int64 dnum = 0; dnum < ShapeUtil::Rank(shape()); ++dnum) {
     CHECK_GE(start_indices[dnum], 0);
-    CHECK_LE(limit_indices[dnum], shape().dimensions(dnum));
+    CHECK_LE(limit_indices[dnum], shape().dimensions(dnum))
+        << "dnum = " << dnum;
     int64 dimension = limit_indices[dnum] - start_indices[dnum];
-    CHECK_GT(dimension, 0);
+    CHECK_GE(dimension, 0) << "dnum = " << dnum;
     result_dimensions.push_back(dimension);
   }
   const auto result_shape =
@@ -903,7 +929,7 @@ string Literal::GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
     case U64:
       return StrCat(Get<uint64>(multi_index, shape_index));
     case F16:
-      return StrCat(Get<half>(multi_index, shape_index));
+      return StrCat(static_cast<float>(Get<half>(multi_index, shape_index)));
     case F32:
       return StrCat(Get<float>(multi_index, shape_index));
     case BF16:
@@ -953,7 +979,8 @@ string Literal::GetSparseElementAsString(int64 sparse_element_number,
       return StrCat(
           GetSparseElement<uint64>(sparse_element_number, shape_index));
     case F16:
-      return StrCat(GetSparseElement<half>(sparse_element_number, shape_index));
+      return StrCat(static_cast<float>(
+          GetSparseElement<half>(sparse_element_number, shape_index)));
     case F32:
       return StrCat(
           GetSparseElement<float>(sparse_element_number, shape_index));
@@ -997,6 +1024,36 @@ StatusOr<int64> Literal::GetIntegralAsS64(
   }
 }
 
+Status Literal::SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
+                                 int64 value) {
+  CHECK(LayoutUtil::IsDenseArray(shape()));
+  switch (shape().element_type()) {
+    case PRED:
+      Set<bool>(multi_index, value);
+      break;
+    case U8:
+      Set<uint8>(multi_index, value);
+      break;
+    case S32:
+      Set<int32>(multi_index, value);
+      break;
+    case S64:
+      Set<int64>(multi_index, value);
+      break;
+    case U32:
+      Set<uint32>(multi_index, value);
+      break;
+    case U64:
+      Set<uint64>(multi_index, value);
+      break;
+    default:
+      return FailedPrecondition(
+          "Array element type is not integral: %s",
+          PrimitiveType_Name(shape().element_type()).c_str());
+  }
+  return Status::OK();
+}
+
 tensorflow::gtl::ArraySlice<int64> Literal::GetSparseIndex(
     int64 sparse_element_number, const ShapeIndex& shape_index) const {
   const Piece& p = piece(shape_index);
@@ -1394,8 +1451,8 @@ StatusOr<std::unique_ptr<Literal>> ConvertIfDestTypeMatches(
       return ConvertToC64<primitive_src_type>(src_literal);
     // Other types are not yet supported.
     default:
-      return InvalidArgument(
-          "Unimplemented: Convert from type %s to type %s",
+      return Unimplemented(
+          "Converting from type %s to type %s is not implemented.",
           PrimitiveType_Name(src_literal.shape().element_type()).c_str(),
           PrimitiveType_Name(primitive_dest_type).c_str());
   }
@@ -1406,6 +1463,9 @@ StatusOr<std::unique_ptr<Literal>> ConvertIfDestTypeMatches(
 StatusOr<std::unique_ptr<Literal>> Literal::Convert(
     PrimitiveType primitive_dest_type) const {
   TF_RET_CHECK(ShapeUtil::IsArray(shape()));
+  if (shape().element_type() == primitive_dest_type) {
+    return CloneToUnique();
+  }
   switch (shape().element_type()) {
 #define CONVERT_IF_DEST_TYPE_MATCHES(type) \
   case (type):                             \
@@ -1424,10 +1484,29 @@ StatusOr<std::unique_ptr<Literal>> Literal::Convert(
 #undef CONVERT_IF_DEST_TYPE_MATCHES
       // Other types are not yet supported.
     default:
-      return InvalidArgument("Unimplemented: Convert from type %s to type %s",
-                             PrimitiveType_Name(shape().element_type()).c_str(),
-                             PrimitiveType_Name(primitive_dest_type).c_str());
+      return Unimplemented(
+          "Converting from type %s to type %s is not implemented.",
+          PrimitiveType_Name(shape().element_type()).c_str(),
+          PrimitiveType_Name(primitive_dest_type).c_str());
+  }
+}
+
+StatusOr<std::unique_ptr<Literal>> Literal::ConvertToShape(
+    const Shape& dest_shape) const {
+  if (!ShapeUtil::IsTuple(dest_shape)) {
+    return Convert(dest_shape.element_type());
   }
+  std::vector<Literal> elements;
+  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
+    auto element = LiteralView::Create(*this, {i});
+    TF_ASSIGN_OR_RETURN(
+        auto new_element,
+        element.ConvertToShape(ShapeUtil::GetSubshape(dest_shape, {i})));
+    elements.push_back(std::move(*new_element));
+  }
+  auto converted = MakeUnique<Literal>();
+  *converted = Literal::MoveIntoTuple(&elements);
+  return std::move(converted);
 }
 
 template <typename NativeT>
diff --git a/tensorflow/compiler/xla/literal_util.h b/tensorflow/compiler/xla/literal_util.h
index d5ae3fd72322fe243f0156dfbe236b6d62ab8c9d..e24f5285d9a14cf26216e4a16c6d1e516afc413f 100644
--- a/tensorflow/compiler/xla/literal_util.h
+++ b/tensorflow/compiler/xla/literal_util.h
@@ -262,6 +262,11 @@ class Literal {
                        tensorflow::gtl::ArraySlice<int64> dest_base,
                        tensorflow::gtl::ArraySlice<int64> copy_size);
 
+  // Copies one element from src_literal[src_index] to (*this)[dest_index].
+  Status CopyElementFrom(const Literal& src_literal,
+                         tensorflow::gtl::ArraySlice<int64> src_index,
+                         tensorflow::gtl::ArraySlice<int64> dest_index);
+
   // Returns a vector containing the tuple elements of this Literal as separate
   // Literals. This Literal must be tuple-shaped and can be a nested tuple. The
   // elements are moved into the new Literals; no data is copied. Upon return
@@ -333,6 +338,11 @@ class Literal {
   StatusOr<std::unique_ptr<Literal>> Convert(
       PrimitiveType primitive_dest_type) const;
 
+  // Converts this literal to the given shape. Returns an error is the
+  // conversion is not possible.
+  StatusOr<std::unique_ptr<Literal>> ConvertToShape(
+      const Shape& dest_shape) const;
+
   // Creates a scalar literal value zero of the given primitive type.
   static Literal Zero(PrimitiveType primitive_type);
 
@@ -469,6 +479,11 @@ class Literal {
   StatusOr<int64> GetIntegralAsS64(
       tensorflow::gtl::ArraySlice<int64> multi_index) const;
 
+  // As Set(), but truncates `value` to the literal element type before storing.
+  // This literal must be an array.
+  Status SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
+                          int64 value);
+
   // Returns an identity matrix (rank 2) with the given row and column count.
   template <typename NativeT>
   static std::unique_ptr<Literal> MakeIdentityR2(int64 size);
@@ -1269,7 +1284,7 @@ Status Literal::Populate(const FnType& generator) {
     int64 minor_dimension_size =
         ShapeUtil::GetDimension(this_shape, stride_config.minor_dimension);
 
-    auto init_function = [&](const std::vector<int64>& indexes) {
+    auto init_function = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
       const int64 index =
           IndexUtil::MultidimensionalIndexToLinearIndex(shape(), indexes);
       std::copy(indexes.begin(), indexes.end(), minor_scan_indexes.begin());
diff --git a/tensorflow/compiler/xla/literal_util_test.cc b/tensorflow/compiler/xla/literal_util_test.cc
index ee2f4fe87440428c7364fe2924003c5124f4eaa2..7627762074b6132655c58690a7fffbaf2717e279 100644
--- a/tensorflow/compiler/xla/literal_util_test.cc
+++ b/tensorflow/compiler/xla/literal_util_test.cc
@@ -30,6 +30,7 @@ limitations under the License.
 namespace xla {
 namespace {
 
+using tensorflow::gtl::ArraySlice;
 using ::testing::ElementsAre;
 using ::testing::HasSubstr;
 
@@ -214,11 +215,11 @@ TEST_F(LiteralUtilTest, CreateSparse) {
   std::vector<int64> expected_values = {8, 9, 7, 10};
 
   EXPECT_EQ(literal->sparse_indices()->data(),
-            tensorflow::gtl::ArraySlice<int64>(
-                expected_indices.data(), expected_indices.num_elements()));
-  EXPECT_EQ(tensorflow::gtl::ArraySlice<int64>(literal->data<int64>().data(),
-                                               expected_values.size()),
-            tensorflow::gtl::ArraySlice<int64>(expected_values));
+            ArraySlice<int64>(expected_indices.data(),
+                              expected_indices.num_elements()));
+  EXPECT_EQ(
+      ArraySlice<int64>(literal->data<int64>().data(), expected_values.size()),
+      ArraySlice<int64>(expected_values));
 }
 
 TEST_F(LiteralUtilTest, LiteralR4F32ProjectedStringifies) {
@@ -290,7 +291,7 @@ TEST_F(LiteralUtilTest, EachCellR2F32) {
   // clang-format on
   std::vector<std::tuple<int64, int64, string>> seen;
   literal->EachCellAsString(
-      [&seen](tensorflow::gtl::ArraySlice<int64> indices, const string& value) {
+      [&seen](ArraySlice<int64> indices, const string& value) {
         seen.emplace_back(indices[0], indices[1], value);
       });
 
@@ -622,11 +623,10 @@ TEST_F(LiteralUtilTest, TransposeR4) {
   // clang-format on
   auto reshape = original->Transpose(/*permutation=*/{2, 3, 0, 1});
 
-  reshape->EachCell<float>(
-      [&](tensorflow::gtl::ArraySlice<int64> indices, float value) {
-        EXPECT_EQ(value, original->Get<float>(
-                             {indices[2], indices[3], indices[0], indices[1]}));
-      });
+  reshape->EachCell<float>([&](ArraySlice<int64> indices, float value) {
+    EXPECT_EQ(value, original->Get<float>(
+                         {indices[2], indices[3], indices[0], indices[1]}));
+  });
 }
 
 TEST_F(LiteralUtilTest, TestR4RelayoutEquivalence) {
@@ -863,7 +863,7 @@ TEST_F(LiteralUtilTest, CopySliceFrom) {
     const int64 zero_base[] = {0, 0, 0, 0};
     const int64 step[] = {1, 1, 1, 1};
     uint32 seqnr = 0;
-    auto init_proc = [&](const std::vector<int64>& indexes) {
+    auto init_proc = [&](ArraySlice<int64> indexes) {
       source->Set(indexes, ++seqnr);
       return true;
     };
@@ -879,7 +879,7 @@ TEST_F(LiteralUtilTest, CopySliceFrom) {
     std::vector<int64> source_indexes(TF_ARRAYSIZE(dimensions), 0);
     std::vector<int64> blank_indexes(TF_ARRAYSIZE(dimensions), 0);
     bool matched = true;
-    auto check_proc = [&](const std::vector<int64>& indexes) {
+    auto check_proc = [&](ArraySlice<int64> indexes) {
       std::copy(indexes.begin(), indexes.end(), source_indexes.begin());
       std::transform(source_indexes.begin(), source_indexes.end(), src_base,
                      source_indexes.begin(), std::plus<int64>());
@@ -1067,7 +1067,7 @@ TEST_F(LiteralUtilTest, Populate) {
         primitive_util::NativeToPrimitiveType<uint32>(), data.dimensions,
         data.layout);
     auto literal = Literal::CreateFromShape(shape);
-    auto generator = [&](tensorflow::gtl::ArraySlice<int64> indexes) -> uint32 {
+    auto generator = [&](ArraySlice<int64> indexes) -> uint32 {
       // Offsets from linear index just to avoid R0 literals to be initialized
       // with zero.
       return IndexUtil::MultidimensionalIndexToLinearIndex(literal->shape(),
@@ -1079,7 +1079,7 @@ TEST_F(LiteralUtilTest, Populate) {
     std::vector<int64> zero_base(data.dimensions.size(), 0);
     std::vector<int64> step(data.dimensions.size(), 1);
     bool matched = true;
-    auto check_function = [&](const std::vector<int64>& indexes) {
+    auto check_function = [&](ArraySlice<int64> indexes) {
       auto value = literal->Get<uint32>(indexes);
       matched = matched && (value == generator(indexes));
       return matched;
@@ -1232,15 +1232,15 @@ TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
   EXPECT_EQ(*conv, *c64);
 
   EXPECT_EQ(s32->Convert(TUPLE).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(s32->Convert(S16).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(s32->Convert(U16).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(c64->Convert(F32).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(c64->Convert(S32).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
 }
 
 TEST_F(LiteralUtilTest, CopyFromProto_Bool) {
@@ -1702,7 +1702,7 @@ TEST_F(LiteralUtilTest, GetSparseElementAsString) {
   ASSERT_EQ(Literal::CreateSparse<half>(dimensions, indices,
                                         {half{1.0}, half{2.0}, half{3.0}})
                 ->GetSparseElementAsString(1),
-            tensorflow::strings::StrCat(half{2.0}));
+            tensorflow::strings::StrCat(static_cast<float>(half{2.0})));
   ASSERT_EQ(
       Literal::CreateSparse<complex64>(
           dimensions, indices,
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.i b/tensorflow/compiler/xla/python/local_computation_builder.i
index b5354131c94930b75ea66036ddb61ecd3993414f..8f231d1a12d92ecd93908771019c1440da6855e3 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.i
+++ b/tensorflow/compiler/xla/python/local_computation_builder.i
@@ -141,6 +141,33 @@ bool GetIntAttr(PyObject* o, const char* field, int64* result) {
   return true;
 }
 
+// Returns "ok"; true if there is no error, false if there was an error.
+bool HandleStringAttribute(PyObject* o,
+                           const char* attr_name,
+                           std::function<void(string s)> f) {
+  if (!PyObject_HasAttrString(o, attr_name)) {
+    return true;  // It's ok for the object to not have the attribute.
+  }
+  PyObject* attr = PyObject_GetAttrString(o, attr_name);
+  if (attr == nullptr) {
+    return false;  // An error occurred getting the attribute.
+  }
+  if (attr == Py_None) {
+    Py_DECREF(attr);
+    return true;  // The attribute is None, which we consider ok.
+  }
+  if (!PyString_Check(attr)) {
+    string message = tensorflow::strings::Printf("%s must be a string or none; got %s",
+        attr_name, numpy::PyObjectCppRepr(attr).c_str());
+    PyErr_SetString(PyExc_TypeError, message.c_str());
+    Py_DECREF(attr);
+    return false;  // Type error, not ok.
+  }
+  f(PyString_AsString(attr));
+  Py_DECREF(attr);
+  return true;  // Handled string attribute, ok!
+}
+
 }
 }
 %}
@@ -216,6 +243,7 @@ tensorflow::ImportNumpy();
         PyExc_RuntimeError, $1.ToString().c_str());
     return NULL;
   }
+  Py_INCREF(Py_None);
   $result = Py_None;
 }
 
@@ -819,16 +847,32 @@ tensorflow::ImportNumpy();
   if ($input == Py_None) {
     $1 = NULL;
   } else {
-    PyObject* o = PyObject_GetAttrString($input, "generate_hlo_graph");
-    if (!o) {
+    if (!HandleStringAttribute($input, "generate_hlo_graph", [&](string s) {
+      build_options.set_generate_hlo_graph(std::move(s));
+    })) {
+      return nullptr;
+    }
+    if (!HandleStringAttribute($input, "dump_optimized_hlo_proto_to", [&](string s) {
+      build_options.set_dump_optimized_hlo_proto_to(std::move(s));
+    })) {
+      return nullptr;
+    }
+    if (!HandleStringAttribute($input, "dump_per_pass_hlo_proto_to", [&](string s) {
+      build_options.set_dump_per_pass_hlo_proto_to(std::move(s));
+    })) {
+      return nullptr;
+    }
+
+    PyObject* o = PyObject_GetAttrString($input, "hlo_profile");
+    if (o == NULL) {
       return NULL;
     }
     if (o != Py_None) {
-      if (!PyString_Check(o)) {
-        PyErr_SetString(PyExc_TypeError, "ExecutableBuildOptions.generate_hlo_graph must be a string or None.");
+      if (!PyBool_Check(o)) {
+        PyErr_SetString(PyExc_TypeError, "ExecutableBuildOptions.hlo_profile must be a bool or None.");
         return NULL;
       }
-      build_options.set_generate_hlo_graph(PyString_AsString(o));
+      build_options.set_hlo_profile(o == Py_True);
     }
     Py_DECREF(o);
 
diff --git a/tensorflow/compiler/xla/python/numpy_bridge.cc b/tensorflow/compiler/xla/python/numpy_bridge.cc
index 3d87480728aab1d4ebbc71c6c7504d37cae5edaf..eec48479c929ab0823fef342fc284bfdc4b1f339 100644
--- a/tensorflow/compiler/xla/python/numpy_bridge.cc
+++ b/tensorflow/compiler/xla/python/numpy_bridge.cc
@@ -170,8 +170,7 @@ static string PyObjectCppStr(PyObject* o) {
   return ExtractStringAndDecref(s);
 }
 
-// Safely returns a repr of the given Python object o as a C++ string.
-static string PyObjectCppRepr(PyObject* o) {
+string PyObjectCppRepr(PyObject* o) {
   PyObject* r = PyObject_Repr(o);
   return ExtractStringAndDecref(r);
 }
diff --git a/tensorflow/compiler/xla/python/numpy_bridge.h b/tensorflow/compiler/xla/python/numpy_bridge.h
index adfcc3b8588dce01718bb19dea936bace483be4d..9656cb1c31c39dbe54293700c2765d0723255657 100644
--- a/tensorflow/compiler/xla/python/numpy_bridge.h
+++ b/tensorflow/compiler/xla/python/numpy_bridge.h
@@ -107,6 +107,9 @@ void CopyLiteralToNumpyArray(const Literal& literal, PyArrayObject* py_array) {
   std::copy(source.begin(), source.end(), dest);
 }
 
+// Safely returns a repr of the given Python object o as a C++ string.
+string PyObjectCppRepr(PyObject* o);
+
 // Workarounds for Python 2 and 3 interop
 
 PyObject* LongToPyIntOrPyLong(long x);  // NOLINT
diff --git a/tensorflow/compiler/xla/python/xla_client.py b/tensorflow/compiler/xla/python/xla_client.py
index 3b8ec851d5aa032ebbf4f6cfc7e12f5a03539cbd..e548d420f4614d3b3fff6034f9a174d553ebea66 100644
--- a/tensorflow/compiler/xla/python/xla_client.py
+++ b/tensorflow/compiler/xla/python/xla_client.py
@@ -30,9 +30,9 @@ from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python import pywrap_xla as c_api
 
 
-# Most functions are snake_case for consistency with other modules,
-# whereas method names of ComputationBuilder and LocalComputation are
-# CamelCase for consistency with XLA.
+# Most functions are snake_case for consistency with other modules, whereas
+# method names of ComputationBuilder and LocalComputation are CamelCase for
+# consistency with XLA.
 # pylint: disable=invalid-name
 
 
@@ -123,24 +123,34 @@ _BINARY_OPS = [
     'Pow',
 ]
 
+
 XLA_ELEMENT_TYPE_TO_DTYPE = {
-    xla_data_pb2.F32: np.dtype(np.float32),
-    xla_data_pb2.F64: np.dtype(np.float64),
-    xla_data_pb2.S32: np.dtype(np.int32),
-    xla_data_pb2.S64: np.dtype(np.int64),
-    xla_data_pb2.U32: np.dtype(np.uint32),
-    xla_data_pb2.U64: np.dtype(np.uint64),
-    xla_data_pb2.PRED: np.dtype(np.bool),
+    xla_data_pb2.PRED: np.dtype('bool'),
+    xla_data_pb2.S8: np.dtype('int8'),
+    xla_data_pb2.S16: np.dtype('int16'),
+    xla_data_pb2.S32: np.dtype('int32'),
+    xla_data_pb2.S64: np.dtype('int64'),
+    xla_data_pb2.U8: np.dtype('uint8'),
+    xla_data_pb2.U16: np.dtype('uint16'),
+    xla_data_pb2.U32: np.dtype('uint32'),
+    xla_data_pb2.U64: np.dtype('uint64'),
+    xla_data_pb2.F16: np.dtype('float16'),
+    xla_data_pb2.F32: np.dtype('float32'),
+    xla_data_pb2.F64: np.dtype('float64'),
+    xla_data_pb2.C64: np.dtype('complex64'),
     xla_data_pb2.TUPLE: np.dtype(np.object),
 }
 
 # Note the conversion on the key. Numpy has a known issue wherein dtype hashing
 # doesn't work as expected (https://github.com/numpy/numpy/issues/7242). Thus,
 # when keying by dtype in this dict, we use the string form of dtypes.
-DTYPE_TO_XLA_ELEMENT_TYPE = {
-    str(v): k
-    for k, v in XLA_ELEMENT_TYPE_TO_DTYPE.items()
-}
+DTYPE_TO_XLA_ELEMENT_TYPE = {str(dt): et
+                             for et, dt in XLA_ELEMENT_TYPE_TO_DTYPE.items()}
+
+
+def dtype_to_etype(dtype):
+  """Convenience function for reading DTYPE_TO_XLA_ELEMENT_TYPE."""
+  return DTYPE_TO_XLA_ELEMENT_TYPE[str(np.dtype(dtype))]
 
 
 class LocalBuffer(object):
@@ -310,6 +320,9 @@ class CompileOptions(object):
 
   def __init__(self):
     self.generate_hlo_graph = None
+    self.dump_optimized_hlo_proto_to = None
+    self.dump_per_pass_hlo_proto_to = None
+    self.hlo_profile = False
 
 
 def transfer_to_infeed(value, replica_number=None):
diff --git a/tensorflow/compiler/xla/reference_util.cc b/tensorflow/compiler/xla/reference_util.cc
index a9acdae380af5b7f9efb3d08302fc717108f5e40..ad3a28e11939d6259ebd75d544a950ba7abd741f 100644
--- a/tensorflow/compiler/xla/reference_util.cc
+++ b/tensorflow/compiler/xla/reference_util.cc
@@ -30,29 +30,23 @@ limitations under the License.
 
 namespace xla {
 
-/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::TransposeArray2D(
-    const Array2D<float>& operand) {
-  auto result = MakeUnique<Array2D<float>>(operand.width(), operand.height());
-  for (int64 w = 0; w < operand.width(); ++w) {
-    for (int64 h = 0; h < operand.height(); ++h) {
-      (*result)(w, h) = operand(h, w);
-    }
-  }
-
-  return result;
-}
-
-/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::MatmulArray2D(
-    const Array2D<float>& lhs, const Array2D<float>& rhs) {
+namespace {
+
+template <typename T>
+std::unique_ptr<Array2D<T>> MatmulArray2DImpl(
+    const Array2D<T>& lhs, const Array2D<T>& rhs,
+    const std::function<void(
+        const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m, int64 n,
+        int64 k, int32 transpose_lhs, int32 transpose_rhs)>& impl_fn) {
   CHECK_EQ(lhs.width(), rhs.height());
   int m = lhs.height();
   int n = rhs.width();
   int k = lhs.width();
-  auto result = MakeUnique<Array2D<float>>(m, n);
+  auto result = MakeUnique<Array2D<T>>(m, n);
   // Because Eigen is a header-oriented library, make sure that the Eigen code
   // is the same as the code used by the CPU backend (otherwise the linker will
   // randomly pick *some* definition).
-  __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
+  impl_fn(
       /*run_options_ptr=*/nullptr, result->data(), rhs.data(), lhs.data(), n, m,
       k,
       /*transpose_lhs=*/0,
@@ -60,22 +54,24 @@ namespace xla {
   return result;
 }
 
+}  // namespace
+
+/* static */ std::unique_ptr<Array2D<Eigen::half>> ReferenceUtil::MatmulArray2D(
+    const Array2D<Eigen::half>& lhs, const Array2D<Eigen::half>& rhs) {
+  return MatmulArray2DImpl<Eigen::half>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF16);
+}
+
+/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::MatmulArray2D(
+    const Array2D<float>& lhs, const Array2D<float>& rhs) {
+  return MatmulArray2DImpl<float>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF32);
+}
+
 /* static */ std::unique_ptr<Array2D<double>> ReferenceUtil::MatmulArray2D(
     const Array2D<double>& lhs, const Array2D<double>& rhs) {
-  CHECK_EQ(lhs.width(), rhs.height());
-  int m = lhs.height();
-  int n = rhs.width();
-  int k = lhs.width();
-  auto result = MakeUnique<Array2D<double>>(m, n);
-  // Because Eigen is a header-oriented library, make sure that the Eigen code
-  // is the same as the code used by the CPU backend (otherwise the linker will
-  // randomly pick *some* definition).
-  __xla_cpu_runtime_EigenSingleThreadedMatMulF64(
-      /*run_options_ptr=*/nullptr, result->data(), rhs.data(), lhs.data(), n, m,
-      k,
-      /*transpose_lhs=*/0,
-      /*transpose_rhs=*/0);
-  return result;
+  return MatmulArray2DImpl<double>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF64);
 }
 
 /* static */ std::unique_ptr<Array2D<double>> ReferenceUtil::Array2DF32ToF64(
@@ -188,18 +184,6 @@ ReferenceUtil::SeparableConvArray4D(const Array4D<float>& input,
   return tensorflow::MathUtil::CeilOfRatio(unpadded_width, stride);
 }
 
-/* static  */ std::unique_ptr<std::vector<float>>
-ReferenceUtil::ReduceWindow1DGeneric(
-    const tensorflow::gtl::ArraySlice<float>& operand, float init,
-    const std::function<float(float, float)>& reduce_func,
-    const tensorflow::gtl::ArraySlice<int64>& window,
-    const tensorflow::gtl::ArraySlice<int64>& stride, Padding padding) {
-  std::vector<int64> dim_lengths{static_cast<int64>(operand.size())};
-  return ReduceWindow1DGeneric(
-      operand, init, reduce_func, window, stride,
-      xla::MakePadding(dim_lengths, window, stride, padding));
-}
-
 /* static  */ std::unique_ptr<std::vector<float>>
 ReferenceUtil::ReduceWindow1DGeneric(
     const tensorflow::gtl::ArraySlice<float>& operand, float init,
@@ -239,23 +223,28 @@ ReferenceUtil::ReduceWindow1DAdd(
     const tensorflow::gtl::ArraySlice<int64>& window,
     const tensorflow::gtl::ArraySlice<int64>& stride, Padding padding) {
   const auto add_reduce = [](float arg1, float arg2) { return arg1 + arg2; };
-  return ReduceWindow1DGeneric(operand, init, add_reduce, window, stride,
-                               padding);
+  std::vector<int64> dim_lengths{static_cast<int64>(operand.size())};
+  return ReduceWindow1DGeneric(
+      operand, init, add_reduce, window, stride,
+      xla::MakePadding(dim_lengths, window, stride, padding));
 }
 
-/* static  */ std::unique_ptr<Array2D<float>> ReferenceUtil::ReduceWindow2DAdd(
+/* static */ std::unique_ptr<Array2D<float>>
+ReferenceUtil::ReduceWindow2DGeneric(
     const Array2D<float>& operand, float init,
+    const std::function<float(float, float)>& reduce_func,
     const tensorflow::gtl::ArraySlice<int64>& window,
-    const tensorflow::gtl::ArraySlice<int64>& stride, Padding padding) {
+    const tensorflow::gtl::ArraySlice<int64>& stride,
+    const tensorflow::gtl::ArraySlice<std::pair<int64, int64>>& padding) {
   std::vector<int64> dim_lengths{operand.height(), operand.width()};
-  auto padding_both = xla::MakePadding(dim_lengths, window, stride, padding);
 
   std::vector<int64> window_counts(window.size(), 0);
   std::vector<int64> pad_low(window.size(), 0);
   for (int64 i = 0; i < window.size(); ++i) {
+    int64 padded_width = padding[i].first + dim_lengths[i] + padding[i].second;
     window_counts[i] =
-        WindowCount(dim_lengths[i], window[i], stride[i], padding);
-    pad_low[i] = padding_both[i].first;
+        window_util::StridedBound(padded_width, window[i], stride[i]);
+    pad_low[i] = padding[i].first;
   }
   auto result = MakeUnique<Array2D<float>>(window_counts[0], window_counts[1]);
 
@@ -271,7 +260,7 @@ ReferenceUtil::ReduceWindow1DAdd(
           if (i0_base + i0_win >= 0 && i1_base + i1_win >= 0 &&
               i0_base + i0_win < operand.n1() &&
               i1_base + i1_win < operand.n2()) {
-            val += operand(i0_base + i0_win, i1_base + i1_win);
+            val = reduce_func(val, operand(i0_base + i0_win, i1_base + i1_win));
           }
         }
       }
@@ -281,6 +270,17 @@ ReferenceUtil::ReduceWindow1DAdd(
   return result;
 }
 
+/* static  */ std::unique_ptr<Array2D<float>> ReferenceUtil::ReduceWindow2DAdd(
+    const Array2D<float>& operand, float init,
+    const tensorflow::gtl::ArraySlice<int64>& window,
+    const tensorflow::gtl::ArraySlice<int64>& stride, Padding padding) {
+  const auto add_reduce = [](float arg1, float arg2) { return arg1 + arg2; };
+  std::vector<int64> dim_lengths{operand.height(), operand.width()};
+  return ReduceWindow2DGeneric(
+      operand, init, add_reduce, window, stride,
+      xla::MakePadding(dim_lengths, window, stride, padding));
+}
+
 /* static  */ std::unique_ptr<Array3D<float>> ReferenceUtil::ReduceWindow3DAdd(
     const Array3D<float>& operand, float init,
     const tensorflow::gtl::ArraySlice<int64>& window,
@@ -472,7 +472,7 @@ ReferenceUtil::SelectAndScatter4DGePlus(
                       i3_base + i3_win < operand.n4()) {
                     float tmp = operand(i0_base + i0_win, i1_base + i1_win,
                                         i2_base + i2_win, i3_base + i3_win);
-                    if (tmp >= val) {
+                    if (tmp > val) {
                       val = tmp;
                       scatter_0 = i0_base + i0_win;
                       scatter_1 = i1_base + i1_win;
diff --git a/tensorflow/compiler/xla/reference_util.h b/tensorflow/compiler/xla/reference_util.h
index 3ec96f2f38b8f91e1549419b60481327fa9bbd5f..28d6a8c3fe85fa4179bf2f41c82ad4eb93a045fe 100644
--- a/tensorflow/compiler/xla/reference_util.h
+++ b/tensorflow/compiler/xla/reference_util.h
@@ -39,10 +39,22 @@ namespace xla {
 class ReferenceUtil {
  public:
   // Returns the result of a transpose operation on the input matrix.
-  static std::unique_ptr<Array2D<float>> TransposeArray2D(
-      const Array2D<float>& operand);
+  template <typename T>
+  static std::unique_ptr<Array2D<T>> TransposeArray2D(
+      const Array2D<T>& operand) {
+    auto result = MakeUnique<Array2D<T>>(operand.width(), operand.height());
+    for (int64 w = 0; w < operand.width(); ++w) {
+      for (int64 h = 0; h < operand.height(); ++h) {
+        (*result)(w, h) = operand(h, w);
+      }
+    }
+
+    return result;
+  }
 
   // Returns the result of a matrix multiply `lhs x rhs`.
+  static std::unique_ptr<Array2D<Eigen::half>> MatmulArray2D(
+      const Array2D<Eigen::half>& lhs, const Array2D<Eigen::half>& rhs);
   static std::unique_ptr<Array2D<float>> MatmulArray2D(
       const Array2D<float>& lhs, const Array2D<float>& rhs);
   static std::unique_ptr<Array2D<double>> MatmulArray2D(
@@ -187,9 +199,10 @@ class ReferenceUtil {
       const tensorflow::gtl::ArraySlice<float>& operand, float init,
       const std::function<float(float, float)>& reduce_func,
       const tensorflow::gtl::ArraySlice<int64>& window,
-      const tensorflow::gtl::ArraySlice<int64>& stride, Padding padding);
-  static std::unique_ptr<std::vector<float>> ReduceWindow1DGeneric(
-      const tensorflow::gtl::ArraySlice<float>& operand, float init,
+      const tensorflow::gtl::ArraySlice<int64>& stride,
+      const tensorflow::gtl::ArraySlice<std::pair<int64, int64>>& padding);
+  static std::unique_ptr<Array2D<float>> ReduceWindow2DGeneric(
+      const Array2D<float>& operand, float init,
       const std::function<float(float, float)>& reduce_func,
       const tensorflow::gtl::ArraySlice<int64>& window,
       const tensorflow::gtl::ArraySlice<int64>& stride,
@@ -215,6 +228,7 @@ class ReferenceUtil {
 
   // Performs select and scatter with Greater Than or equal as the select, plus
   // as the scatter, and Same Padding.
+  // TODO(b/74533103) Switch tests to evaluator and remove this implementation.
   static std::unique_ptr<Array4D<float>> SelectAndScatter4DGePlus(
       const Array4D<float>& operand, const Array4D<float>& source, float init,
       const tensorflow::gtl::ArraySlice<int64>& window,
diff --git a/tensorflow/compiler/xla/service/BUILD b/tensorflow/compiler/xla/service/BUILD
index e6a6e54927b4752f6e7c8eca1fc0e84301ff0a58..d4d67872cfa5581b5e266581b72f38949d89e652 100644
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@@ -106,6 +106,7 @@ tf_cc_test(
         ":bfloat16_normalization",
         ":bfloat16_support",
         ":hlo",
+        ":hlo_verifier",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
@@ -129,6 +130,7 @@ cc_library(
         ":hlo_dce",
         ":hlo_pass",
         ":tuple_simplifier",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
@@ -148,6 +150,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
     ],
 )
@@ -986,6 +989,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -1062,6 +1066,38 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "hlo_module_group_metadata",
+    srcs = ["hlo_module_group_metadata.cc"],
+    hdrs = ["hlo_module_group_metadata.h"],
+    deps = [
+        ":hlo",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "hlo_module_group_util",
+    srcs = ["hlo_module_group_util.cc"],
+    hdrs = ["hlo_module_group_util.h"],
+    deps = [
+        ":hlo",
+        ":hlo_module_group_metadata",
+        ":hlo_reachability",
+        "//tensorflow/compiler/xla:status",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
 cc_library(
     name = "hlo_scheduling",
     srcs = ["hlo_scheduling.cc"],
@@ -1093,6 +1129,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -1130,6 +1167,19 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "hlo_creation_utils",
+    srcs = ["hlo_creation_utils.cc"],
+    hdrs = ["hlo_creation_utils.h"],
+    deps = [
+        ":hlo",
+        ":shape_inference",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:util",
+    ],
+)
+
 cc_library(
     name = "batchnorm_expander",
     srcs = ["batchnorm_expander.cc"],
@@ -1138,7 +1188,6 @@ cc_library(
         ":hlo",
         ":hlo_pass",
         ":hlo_query",
-        ":shape_inference",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -1150,6 +1199,20 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "gather_expander",
+    srcs = ["gather_expander.cc"],
+    hdrs = ["gather_expander.h"],
+    deps = [
+        ":hlo",
+        ":hlo_creation_utils",
+        ":hlo_pass",
+        ":while_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:util",
+    ],
+)
+
 tf_cc_test(
     name = "batchnorm_expander_test",
     size = "small",
@@ -1177,9 +1240,9 @@ cc_library(
     hdrs = ["algebraic_simplifier.h"],
     deps = [
         ":hlo",
+        ":hlo_creation_utils",
         ":hlo_pass",
         ":hlo_query",
-        ":shape_inference",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -1213,6 +1276,53 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "gather_expander_test",
+    srcs = ["gather_expander_test.cc"],
+    deps = [
+        ":gather_expander",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/tests:test_macros_header",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+    ],
+)
+
+cc_library(
+    name = "conditional_simplifier",
+    srcs = ["conditional_simplifier.cc"],
+    hdrs = ["conditional_simplifier.h"],
+    deps = [
+        ":call_inliner",
+        ":hlo",
+        ":hlo_pass",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "conditional_simplifier_test",
+    srcs = ["conditional_simplifier_test.cc"],
+    deps = [
+        ":conditional_simplifier",
+        ":hlo",
+        ":hlo_matchers",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+    ],
+)
+
 cc_library(
     name = "while_loop_simplifier",
     srcs = ["while_loop_simplifier.cc"],
@@ -1235,6 +1345,7 @@ tf_cc_test(
         ":while_loop_simplifier",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
+        "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
 )
@@ -2349,6 +2460,24 @@ cc_library(
         ":hlo",
         ":hlo_proto",
         "//tensorflow/compiler/xla:status",
+        "//tensorflow/compiler/xla:util",
+    ],
+)
+
+tf_cc_test(
+    name = "hlo_proto_util_test",
+    srcs = ["hlo_proto_util_test.cc"],
+    deps = [
+        ":hlo",
+        ":hlo_proto",
+        ":hlo_proto_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:lib",
     ],
 )
 
@@ -2447,7 +2576,9 @@ cc_library(
     deps = [
         ":call_inliner",
         ":hlo",
+        ":hlo_creation_utils",
         ":tuple_util",
+        "//tensorflow/core:lib",
     ],
 )
 
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.cc b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
index 5ddd8ec377690bdf47e6d54ae5d419416044a53c..971c2935c835f6b10d1e4a7e0df42e5a7785c4ec 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@@ -26,10 +26,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_query.h"
-#include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -383,13 +383,9 @@ Status AlgebraicSimplifierVisitor::HandleAdd(HloInstruction* add) {
       !lhs->operand(0)->IsConstant() && lhs->operand(1)->IsConstant()) {
     auto* c1 = lhs->mutable_operand(1);
     auto* c2 = rhs;
-    TF_ASSIGN_OR_RETURN(
-        Shape sum_of_constants_shape,
-        ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, c1, c2));
 
-    auto* sum_of_constants =
-        computation_->AddInstruction(HloInstruction::CreateBinary(
-            sum_of_constants_shape, HloOpcode::kAdd, c1, c2));
+    TF_ASSIGN_OR_RETURN(auto* sum_of_constants,
+                        MakeBinaryHlo(HloOpcode::kAdd, c1, c2));
     return ReplaceWithNewInstruction(
         add, HloInstruction::CreateBinary(add->shape(), HloOpcode::kAdd,
                                           lhs->mutable_operand(0),
@@ -640,32 +636,23 @@ Status AlgebraicSimplifierVisitor::HandleDivide(HloInstruction* divide) {
   // (A / B) / (C / D)  =>  (A / B)*(D / C) => (A * D) / (B * C)
   if (lhs->opcode() == HloOpcode::kDivide &&
       rhs->opcode() == HloOpcode::kDivide) {
-    TF_ASSIGN_OR_RETURN(
-        const Shape a_times_d_shape,
-        ShapeInference::InferBinaryOpShape(HloOpcode::kMultiply,
-                                           lhs->operand(0), rhs->operand(1)));
-    auto a_times_d = computation_->AddInstruction(HloInstruction::CreateBinary(
-        a_times_d_shape, HloOpcode::kMultiply, lhs->mutable_operand(0),
-        rhs->mutable_operand(1)));
-    TF_ASSIGN_OR_RETURN(
-        const Shape b_times_c_shape,
-        ShapeInference::InferBinaryOpShape(HloOpcode::kMultiply,
-                                           lhs->operand(1), rhs->operand(0)));
-    auto b_times_c = computation_->AddInstruction(HloInstruction::CreateBinary(
-        b_times_c_shape, HloOpcode::kMultiply, lhs->mutable_operand(1),
-        rhs->mutable_operand(0)));
-    return ReplaceWithNewInstruction(
-        divide, HloInstruction::CreateBinary(
-                    divide->shape(), HloOpcode::kDivide, a_times_d, b_times_c));
+    TF_ASSIGN_OR_RETURN(auto a_times_d, MakeBinaryHlo(HloOpcode::kMultiply,
+                                                      lhs->mutable_operand(0),
+                                                      rhs->mutable_operand(1)));
+    TF_ASSIGN_OR_RETURN(auto b_times_c, MakeBinaryHlo(HloOpcode::kMultiply,
+                                                      lhs->mutable_operand(1),
+                                                      rhs->mutable_operand(0)));
+    TF_ASSIGN_OR_RETURN(auto new_divide, MakeBinaryHlo(HloOpcode::kDivide,
+                                                       a_times_d, b_times_c));
+
+    return ReplaceInstruction(divide, new_divide);
   }
 
   // (A / B) / C => A / (B * C)
   if (lhs->opcode() == HloOpcode::kDivide) {
-    TF_ASSIGN_OR_RETURN(const Shape b_times_c_shape,
-                        ShapeInference::InferBinaryOpShape(
-                            HloOpcode::kMultiply, lhs->operand(1), rhs));
-    auto b_times_c = computation_->AddInstruction(HloInstruction::CreateBinary(
-        b_times_c_shape, HloOpcode::kMultiply, lhs->mutable_operand(1), rhs));
+    TF_ASSIGN_OR_RETURN(
+        auto b_times_c,
+        MakeBinaryHlo(HloOpcode::kMultiply, lhs->mutable_operand(1), rhs));
     return ReplaceWithNewInstruction(
         divide,
         HloInstruction::CreateBinary(divide->shape(), HloOpcode::kDivide,
@@ -674,11 +661,8 @@ Status AlgebraicSimplifierVisitor::HandleDivide(HloInstruction* divide) {
 
   // A / (B / C) => (A*C) / B
   if (rhs->opcode() == HloOpcode::kDivide) {
-    TF_ASSIGN_OR_RETURN(const Shape a_times_c_shape,
-                        ShapeInference::InferBinaryOpShape(
-                            HloOpcode::kMultiply, lhs, rhs->operand(1)));
-    auto a_times_c = computation_->AddInstruction(HloInstruction::CreateBinary(
-        a_times_c_shape, HloOpcode::kMultiply, lhs, rhs->mutable_operand(1)));
+    TF_ASSIGN_OR_RETURN(auto a_times_c, MakeBinaryHlo(HloOpcode::kMultiply, lhs,
+                                                      rhs->mutable_operand(1)));
     return ReplaceWithNewInstruction(
         divide,
         HloInstruction::CreateBinary(divide->shape(), HloOpcode::kDivide,
@@ -1311,17 +1295,14 @@ Status AlgebraicSimplifierVisitor::HandlePad(HloInstruction* pad) {
         padding_dimension->set_edge_padding_high(0);
       }
     }
-    TF_ASSIGN_OR_RETURN(Shape nonzero_pad_shape,
-                        ShapeInference::InferPadShape(pad->operand(0)->shape(),
-                                                      pad->operand(1)->shape(),
-                                                      nonzero_padding));
+
+    TF_ASSIGN_OR_RETURN(HloInstruction * nonzero_pad,
+                        MakePadHlo(pad->mutable_operand(0),
+                                   pad->mutable_operand(1), nonzero_padding));
     // Copy the layout from the original pad instructions. The new pad and the
     // slice instruction should all have the same layout.
-    TF_RETURN_IF_ERROR(
-        LayoutUtil::CopyLayoutBetweenShapes(pad->shape(), &nonzero_pad_shape));
-    HloInstruction* nonzero_pad = computation_->AddInstruction(
-        HloInstruction::CreatePad(nonzero_pad_shape, pad->mutable_operand(0),
-                                  pad->mutable_operand(1), nonzero_padding));
+    TF_RETURN_IF_ERROR(LayoutUtil::CopyLayoutBetweenShapes(
+        pad->shape(), nonzero_pad->mutable_shape()));
 
     // Second, construct the slice instruction to perform the negative padding.
     std::vector<int64> start_indices;
@@ -1334,7 +1315,7 @@ Status AlgebraicSimplifierVisitor::HandlePad(HloInstruction* pad) {
       if (padding_dimension.edge_padding_low() < 0) {
         start = -1 * padding_dimension.edge_padding_low();
       }
-      int64 end = nonzero_pad_shape.dimensions(i);
+      int64 end = nonzero_pad->shape().dimensions(i);
       if (padding_dimension.edge_padding_high() < 0) {
         end += padding_dimension.edge_padding_high();
       }
@@ -1343,16 +1324,14 @@ Status AlgebraicSimplifierVisitor::HandlePad(HloInstruction* pad) {
       strides.push_back(1);
     }
 
-    // Verify that the slice shape matches the pad shape.
     TF_ASSIGN_OR_RETURN(
-        Shape inferred_slice_shape,
-        ShapeInference::InferSliceShape(nonzero_pad_shape, start_indices,
-                                        end_indices, strides));
-    TF_RET_CHECK(ShapeUtil::Compatible(inferred_slice_shape, pad->shape()));
+        HloInstruction * slice,
+        MakeSliceHlo(nonzero_pad, start_indices, end_indices, strides));
 
-    std::unique_ptr<HloInstruction> slice = HloInstruction::CreateSlice(
-        pad->shape(), nonzero_pad, start_indices, end_indices, strides);
-    return ReplaceWithNewInstruction(pad, std::move(slice));
+    // Verify that the slice shape matches the pad shape.
+    TF_RET_CHECK(ShapeUtil::Compatible(slice->shape(), pad->shape()));
+
+    return ReplaceInstruction(pad, slice);
   }
 
   return Status::OK();
@@ -1625,6 +1604,14 @@ Status AlgebraicSimplifierVisitor::HandleDynamicUpdateSlice(
   if (IsAll(start_indices, 0) && SameShape(dynamic_update_slice, update)) {
     return ReplaceInstruction(dynamic_update_slice, update);
   }
+
+  // If any dimension of update is 0, elide the DynamicUpdateSlice.  This
+  // optimization becomes invalid should we later prefer to warn about out of
+  // bound indices.
+  if (ShapeUtil::HasZeroElements(update->shape())) {
+    return ReplaceInstruction(dynamic_update_slice,
+                              dynamic_update_slice->mutable_operand(0));
+  }
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.h b/tensorflow/compiler/xla/service/algebraic_simplifier.h
index 43315f5cdc7afbe79039420320f4a0d0535e11f1..f0590943be48a10281c8a676f16764ed9545ed8d 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.h
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.h
@@ -23,7 +23,7 @@ limitations under the License.
 
 namespace xla {
 
-// A pass which performs AlgebraicSimplications.
+// A pass which performs algebraic simplifications.
 class AlgebraicSimplifier : public HloPassInterface {
  public:
   // Given shapes 'from_shape' and 'to_shape', determines if it is valid to
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
index 667ae01993ebf0feeab89e0b5afaf7c7c8c99ab9..451294ef5d8367686d7fc22b7f5ebfde89d14d42 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
@@ -2800,6 +2800,29 @@ DotOfConcatTestSpec kDotOfConcatTestSpecs[] = {
     {/*m=*/1, /*k=*/16, /*n=*/1},   //
 };
 
+// Test that DynamicUpdateSlice update param with any dimension equal to zero
+// gets removed.
+TEST_F(AlgebraicSimplifierTest, DynamicUpdateSliceZeroUpdate) {
+  HloComputation::Builder builder(TestName());
+  const Shape dslice_shape = ShapeUtil::MakeShape(F32, {10});
+  HloInstruction* const operand = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, dslice_shape, "operand"));
+  const Shape update_shape = ShapeUtil::MakeShape(F32, {0});
+  HloInstruction* const update = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, update_shape, "update"));
+  HloInstruction* const start_indices = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR1<int>({0})));
+  builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
+      dslice_shape, operand, update, start_indices));
+  const HloComputation* const computation =
+      module().AddEntryComputation(builder.Build());
+
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(), operand);
+}
+
 INSTANTIATE_TEST_CASE_P(DotOfConcatSimplificationTestInstantiation,
                         DotOfConcatSimplificationTest,
                         ::testing::ValuesIn(kDotOfConcatTestSpecs));
diff --git a/tensorflow/compiler/xla/service/allocation_tracker.cc b/tensorflow/compiler/xla/service/allocation_tracker.cc
index 4e80679c11dfdf7fdf8077a9f354139a4cab6803..4f819a743c48f30df8dde00ece72a0b4e1748802 100644
--- a/tensorflow/compiler/xla/service/allocation_tracker.cc
+++ b/tensorflow/compiler/xla/service/allocation_tracker.cc
@@ -34,40 +34,54 @@ StatusOr<GlobalDataHandle> AllocationTracker::Register(
     std::unique_ptr<ShapedBuffer> shaped_buffer, const string& tag) {
   tensorflow::mutex_lock lock(mutex_);
   VLOG(2) << "Register";
-  return RegisterInternal(std::move(shaped_buffer), tag);
+  std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers;
+  replicated_buffers.emplace_back(std::move(shaped_buffer));
+  return RegisterInternal(std::move(replicated_buffers), tag);
+}
+
+StatusOr<GlobalDataHandle> AllocationTracker::RegisterReplicatedBuffers(
+    std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers,
+    const string& tag) {
+  tensorflow::mutex_lock lock(mutex_);
+  VLOG(2) << "RegisterReplicatedBuffers";
+  return RegisterInternal(std::move(replicated_buffers), tag);
 }
 
 StatusOr<GlobalDataHandle> AllocationTracker::RegisterInternal(
-    std::unique_ptr<ShapedBuffer> shaped_buffer, const string& tag) {
+    std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers,
+    const string& tag) {
   VLOG(2) << "RegisterInternal("
-          << "tag: \"" << tag << "\" "
-          << "shaped_buffer: " << *shaped_buffer;
-  if (shaped_buffer->platform() != backend_->platform()) {
-    return InvalidArgument(
-        "AllocationTracker for platform %s cannot register buffer from "
-        "platform %s",
-        backend_->platform()->Name().c_str(),
-        shaped_buffer->platform()->Name().c_str());
+          << "tag: \"" << tag << "\" with " << replicated_buffers.size()
+          << " shaped_buffers.";
+  for (const auto& shaped_buffer : replicated_buffers) {
+    VLOG(2) << "shaped_buffer:" << *shaped_buffer;
+    if (shaped_buffer->platform() != backend_->platform()) {
+      return InvalidArgument(
+          "AllocationTracker for platform %s cannot register buffer from "
+          "platform %s",
+          backend_->platform()->Name().c_str(),
+          shaped_buffer->platform()->Name().c_str());
+    }
   }
 
   int64 handle = next_handle_++;
-  std::vector<ShapeIndex> shape_indices;
-  ShapeUtil::ForEachSubshape(shaped_buffer->on_device_shape(),
-                             [this, &shape_indices](const Shape& /*subshape*/,
-                                                    const ShapeIndex& index) {
-                               shape_indices.push_back(index);
-                             });
-  for (const ShapeIndex& index : shape_indices) {
-    AddAllocationOrIncrementRefCount(shaped_buffer->buffer(index),
-                                     shaped_buffer->device_ordinal());
+  for (auto& shaped_buffer : replicated_buffers) {
+    std::vector<ShapeIndex> shape_indices;
+    ShapeUtil::ForEachSubshape(shaped_buffer->on_device_shape(),
+                               [this, &shape_indices](const Shape& /*subshape*/,
+                                                      const ShapeIndex& index) {
+                                 shape_indices.push_back(index);
+                               });
+    for (const ShapeIndex& index : shape_indices) {
+      AddAllocationOrIncrementRefCount(shaped_buffer->buffer(index),
+                                       shaped_buffer->device_ordinal());
+    }
+    handle_to_shaped_buffers_[handle].emplace_back(std::move(shaped_buffer));
   }
+
   GlobalDataHandle result;
   result.set_handle(handle);
-
-  handle_to_shaped_buffer_[handle] = std::move(shaped_buffer);
-
   VLOG(2) << "handle: " << handle;
-
   return result;
 }
 
@@ -75,23 +89,35 @@ tensorflow::Status AllocationTracker::Unregister(const GlobalDataHandle& data) {
   tensorflow::mutex_lock lock(mutex_);
   VLOG(2) << "Unregister("
           << "handle: " << data.handle() << ")";
-  TF_ASSIGN_OR_RETURN(ShapedBuffer * shaped_buffer, ResolveInternal(data));
-  std::vector<ShapeIndex> shape_indices;
-  ShapeUtil::ForEachSubshape(shaped_buffer->on_device_shape(),
-                             [this, &shape_indices](const Shape& /*subshape*/,
-                                                    const ShapeIndex& index) {
-                               shape_indices.push_back(index);
-                             });
-  for (const ShapeIndex& index : shape_indices) {
-    TF_RETURN_IF_ERROR(DecrementRefCount(shaped_buffer->buffer(index),
-                                         shaped_buffer->device_ordinal()));
+  TF_ASSIGN_OR_RETURN(std::vector<const ShapedBuffer*> replicated_buffers,
+                      ResolveInternal(data));
+  for (const auto& shaped_buffer : replicated_buffers) {
+    std::vector<ShapeIndex> shape_indices;
+    ShapeUtil::ForEachSubshape(shaped_buffer->on_device_shape(),
+                               [this, &shape_indices](const Shape& /*subshape*/,
+                                                      const ShapeIndex& index) {
+                                 shape_indices.push_back(index);
+                               });
+    for (const ShapeIndex& index : shape_indices) {
+      TF_RETURN_IF_ERROR(DecrementRefCount(shaped_buffer->buffer(index),
+                                           shaped_buffer->device_ordinal()));
+    }
   }
+  return Reset(data);
+}
 
-  // Keep a nullptr as a tombstone for unregistered handles. This enables better
-  // error messages. That is, "handle has been deallocated" versus "handle does
-  // not exist".
-  handle_to_shaped_buffer_.at(data.handle()).reset();
-
+Status AllocationTracker::Reset(const GlobalDataHandle& data) {
+  // Keep a nullptr as a tombstone for unregistered handles. This enables
+  // better error messages. That is, "handle has been deallocated" versus
+  // "handle does not exist".
+  auto it = handle_to_shaped_buffers_.find(data.handle());
+  if (it == handle_to_shaped_buffers_.end()) {
+    return NotFound("no allocation record for global data handle: %lld",
+                    data.handle());
+  }
+  for (auto& shaped_buffer : it->second) {
+    shaped_buffer.reset();
+  }
   return tensorflow::Status::OK();
 }
 
@@ -99,7 +125,11 @@ StatusOr<std::vector<GlobalDataHandle>> AllocationTracker::DeconstructTuple(
     const GlobalDataHandle& data) {
   tensorflow::mutex_lock lock(mutex_);
 
-  TF_ASSIGN_OR_RETURN(ShapedBuffer * shaped_buffer, ResolveInternal(data));
+  TF_ASSIGN_OR_RETURN(std::vector<const ShapedBuffer*> replicated_buffers,
+                      ResolveInternal(data));
+  // We only need to care about replica id 0 here, since the GlobalDataHandle is
+  // the same for all buffers across replicas.
+  const ShapedBuffer* shaped_buffer = replicated_buffers[0];
   if (!ShapeUtil::IsTuple(shaped_buffer->on_host_shape())) {
     return InvalidArgument("global data handle %lld is not a tuple",
                            data.handle());
@@ -109,7 +139,7 @@ StatusOr<std::vector<GlobalDataHandle>> AllocationTracker::DeconstructTuple(
   TF_RET_CHECK(ShapeUtil::IsTuple(shaped_buffer->on_device_shape()));
 
   if (ShapeUtil::IsNestedTuple(shaped_buffer->on_device_shape())) {
-    return Unimplemented("deconstructing nested tuples not yet supported");
+    return Unimplemented("Deconstructing nested tuples is not implemented.");
   }
 
   std::vector<GlobalDataHandle> element_handles;
@@ -122,37 +152,55 @@ StatusOr<std::vector<GlobalDataHandle>> AllocationTracker::DeconstructTuple(
         shaped_buffer->platform(), shaped_buffer->device_ordinal());
     element_buffer->set_buffer(shaped_buffer->buffer(/*index=*/{i}),
                                /*index=*/{});
+    std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers;
+    replicated_buffers.emplace_back(std::move(element_buffer));
     TF_ASSIGN_OR_RETURN(
         GlobalDataHandle element_handle,
-        RegisterInternal(std::move(element_buffer), "deconstructed tuple"));
+        RegisterInternal(std::move(replicated_buffers), "deconstructed tuple"));
 
     element_handles.push_back(element_handle);
   }
   return std::move(element_handles);
 }
 
-StatusOr<const ShapedBuffer*> AllocationTracker::Resolve(
+StatusOr<std::vector<const ShapedBuffer*>> AllocationTracker::Resolve(
     const GlobalDataHandle& data) {
   tensorflow::mutex_lock lock(mutex_);
   return AllocationTracker::ResolveInternal(data);
 }
 
-StatusOr<ShapedBuffer*> AllocationTracker::ResolveInternal(
+StatusOr<const ShapedBuffer*> AllocationTracker::ResolveForReplica(
+    const GlobalDataHandle& data, int replica_id) {
+  tensorflow::mutex_lock lock(mutex_);
+  TF_ASSIGN_OR_RETURN(std::vector<const ShapedBuffer*> replicated_buffers,
+                      ResolveInternal(data));
+  if (replica_id >= replicated_buffers.size()) {
+    return InvalidArgument(
+        "Requesting buffer for replica %d, but found buffers only for %lu "
+        "replicas.",
+        replica_id, replicated_buffers.size());
+  }
+  return replicated_buffers[replica_id];
+}
+
+StatusOr<std::vector<const ShapedBuffer*>> AllocationTracker::ResolveInternal(
     const GlobalDataHandle& data) {
   VLOG(2) << "resolve:" << data.handle();
-  auto it = handle_to_shaped_buffer_.find(data.handle());
-  if (it == handle_to_shaped_buffer_.end()) {
+  auto it = handle_to_shaped_buffers_.find(data.handle());
+  if (it == handle_to_shaped_buffers_.end()) {
     return NotFound("no allocation record for global data handle: %lld",
                     data.handle());
   }
-  ShapedBuffer* shaped_buffer = it->second.get();
-
-  if (shaped_buffer == nullptr) {
-    return InvalidArgument("global data handle %lld was previously deallocated",
-                           data.handle());
+  std::vector<const ShapedBuffer*> replicated_buffers;
+  for (const auto& shaped_buffer : it->second) {
+    if (shaped_buffer == nullptr) {
+      return InvalidArgument(
+          "global data handle %lld was previously deallocated", data.handle());
+    }
+    replicated_buffers.push_back(shaped_buffer.get());
   }
 
-  return shaped_buffer;
+  return replicated_buffers;
 }
 
 void AllocationTracker::AddAllocationOrIncrementRefCount(
diff --git a/tensorflow/compiler/xla/service/allocation_tracker.h b/tensorflow/compiler/xla/service/allocation_tracker.h
index 807af8694972083d097604a67ee46d2f73d9545a..038aee8541b297d6f91fe2b3bce7455fd9a7084e 100644
--- a/tensorflow/compiler/xla/service/allocation_tracker.h
+++ b/tensorflow/compiler/xla/service/allocation_tracker.h
@@ -43,10 +43,17 @@ class AllocationTracker {
   AllocationTracker(Backend* backend) : backend_(backend), next_handle_(1) {}
 
   // Registers a shaped buffer of device memory, and returns a corresponding
-  // handle that can be used for talking to XLA clients.
+  // handle that can be used for talking to XLA clients. The given shaped buffer
+  // will be treated as the buffer corresponding to the only replica.
   StatusOr<GlobalDataHandle> Register(
       std::unique_ptr<ShapedBuffer> shaped_buffer, const string& tag);
 
+  // Registers a vector of shaped buffers of device memory, one per replica, and
+  // returns a corresponding handle that can be used for talking to XLA clients.
+  StatusOr<GlobalDataHandle> RegisterReplicatedBuffers(
+      std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers,
+      const string& tag);
+
   // Unregister the allocation for the given data handle.
   Status Unregister(const GlobalDataHandle& data);
 
@@ -54,9 +61,17 @@ class AllocationTracker {
   StatusOr<std::vector<GlobalDataHandle>> DeconstructTuple(
       const GlobalDataHandle& Data);
 
-  // Resolve a handle from an XLA client to a shaped buffer, or provide an error
-  // status to say whether it was not found (or found, but found deallocated).
-  StatusOr<const ShapedBuffer*> Resolve(const GlobalDataHandle& data);
+  // Resolve a handle from an XLA client to a vector of shaped buffers, one per
+  // replica, or provide an error status to say whether any of those buffers
+  // were not found (or found, but found deallocated).
+  StatusOr<std::vector<const ShapedBuffer*>> Resolve(
+      const GlobalDataHandle& data);
+
+  // Resolves a handle from an XLA client and replica id to a shaped buffer, or
+  // provide an error status to say whether it was not found (or found, but
+  // found deallocated).
+  StatusOr<const ShapedBuffer*> ResolveForReplica(const GlobalDataHandle& data,
+                                                  int replica_id);
 
  private:
   // Data structure encapsulating single memory allocation on the device.
@@ -74,13 +89,17 @@ class AllocationTracker {
 
   // Internal helper which resolves the given GlobalDataHandle to a
   // ShapedBuffer.
-  StatusOr<ShapedBuffer*> ResolveInternal(const GlobalDataHandle& data)
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+  StatusOr<std::vector<const ShapedBuffer*>> ResolveInternal(
+      const GlobalDataHandle& data) EXCLUSIVE_LOCKS_REQUIRED(mutex_);
 
-  // Internal helper which registers a shaped buffer.
+  // Internal helper which registers a vector of shaped buffers, one per
+  // replica.
   StatusOr<GlobalDataHandle> RegisterInternal(
-      std::unique_ptr<ShapedBuffer> shaped_buffer, const string& tag)
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+      std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers,
+      const string& tag) EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+
+  // Resets the shaped buffers corresponding to the given handle.
+  Status Reset(const GlobalDataHandle& data) EXCLUSIVE_LOCKS_REQUIRED(mutex_);
 
   // Adds the given device address to the allocation tracker, or if it already
   // exists, then increment it's reference count.
@@ -111,9 +130,10 @@ class AllocationTracker {
   tensorflow::gtl::FlatMap<int, AllocationMap> opaque_to_allocation_map_
       GUARDED_BY(mutex_);
 
-  // A map from data handle to ShapedBuffer.
-  tensorflow::gtl::FlatMap<int64, std::unique_ptr<ShapedBuffer>>
-      handle_to_shaped_buffer_ GUARDED_BY(mutex_);
+  // A map from data handle to a vector of shaped buffers that represent the
+  // buffers for different replicas.
+  tensorflow::gtl::FlatMap<int64, std::vector<std::unique_ptr<ShapedBuffer>>>
+      handle_to_shaped_buffers_ GUARDED_BY(mutex_);
 
   TF_DISALLOW_COPY_AND_ASSIGN(AllocationTracker);
 };
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander.cc b/tensorflow/compiler/xla/service/batchnorm_expander.cc
index 84c9db32932becd9b701929b392efa4998d03067..38086bd7e121847be6b6b69415cfe87814e7fc24 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander.cc
+++ b/tensorflow/compiler/xla/service/batchnorm_expander.cc
@@ -30,7 +30,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_query.h"
-#include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
diff --git a/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc b/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
index cde990e176ddb57a8e93ecc3c60260b2dbae32a8..08d0152e3cfcfcb7ae1e85f72c2f7dc856f5e8b3 100644
--- a/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
@@ -34,6 +34,9 @@ class BFloat16ConversionFoldingVisitor : public DfsHloVisitorWithDefault {
 
   Status DefaultAction(HloInstruction* hlo) override;
 
+  // Special handling for cross-replica-sum which can have a tuple output.
+  Status HandleCrossReplicaSum(HloInstruction* crs) override;
+
   static bool Run(HloComputation* computation,
                   const BFloat16Support* bfloat16_support) {
     BFloat16ConversionFoldingVisitor visitor(computation, bfloat16_support);
@@ -84,6 +87,25 @@ Status BFloat16ConversionFoldingVisitor::FoldOperandConversion(
   return Status::OK();
 }
 
+namespace {
+
+// Returns whether hlo has users and all users are conversions from F32 to BF16.
+bool AllUsersAreF32ToBF16Converts(const HloInstruction* hlo) {
+  if (hlo->user_count() == 0 || hlo->shape().element_type() != F32) {
+    return false;
+  }
+  for (const auto user : hlo->users()) {
+    if (user->opcode() == HloOpcode::kConvert &&
+        user->shape().element_type() == BF16) {
+      continue;
+    }
+    return false;
+  }
+  return true;
+}
+
+}  // namespace
+
 Status BFloat16ConversionFoldingVisitor::TryFoldBF16Conversions(
     HloInstruction* hlo) {
   std::vector<int64> bf16_to_f32_operands;
@@ -104,22 +126,9 @@ Status BFloat16ConversionFoldingVisitor::TryFoldBF16Conversions(
     }
   }
 
-  bool fold_output_conversion = hlo->user_count() > 0 &&
-                                hlo->shape().element_type() == F32 &&
-                                bfloat16_support_->SupportsBF16Output(*hlo) &&
-                                hlo != computation_->root_instruction();
-  if (fold_output_conversion) {
-    for (auto user : hlo->users()) {
-      if (user->opcode() == HloOpcode::kConvert &&
-          user->shape().element_type() == BF16) {
-        continue;
-      }
-      // We should not change the output type if any user is not a conversion
-      // from F32 to BF16.
-      fold_output_conversion = false;
-      break;
-    }
-  }
+  const bool fold_output_conversion =
+      AllUsersAreF32ToBF16Converts(hlo) &&
+      bfloat16_support_->SupportsBF16Output(*hlo);
 
   if (!bfloat16_support_->SupportsMixedPrecisions(*hlo)) {
     if (has_other_f32_operands ||
@@ -147,6 +156,10 @@ Status BFloat16ConversionFoldingVisitor::DefaultAction(HloInstruction* hlo) {
       hlo->opcode() == HloOpcode::kGetTupleElement ||  //
       hlo->opcode() == HloOpcode::kInfeed ||           //
       hlo->opcode() == HloOpcode::kOutfeed ||          //
+      hlo->opcode() == HloOpcode::kSend ||             //
+      hlo->opcode() == HloOpcode::kSendDone ||         //
+      hlo->opcode() == HloOpcode::kRecv ||             //
+      hlo->opcode() == HloOpcode::kRecvDone ||         //
       hlo->opcode() == HloOpcode::kConstant ||         //
       hlo->opcode() == HloOpcode::kParameter ||        //
       hlo->opcode() == HloOpcode::kFusion ||           //
@@ -167,6 +180,52 @@ Status BFloat16ConversionFoldingVisitor::DefaultAction(HloInstruction* hlo) {
   return TryFoldBF16Conversions(hlo);
 }
 
+Status BFloat16ConversionFoldingVisitor::HandleCrossReplicaSum(
+    HloInstruction* crs) {
+  if (!ShapeUtil::IsTuple(crs->shape()) ||
+      !bfloat16_support_->SupportsMixedPrecisions(*crs)) {
+    return DefaultAction(crs);
+  }
+
+  // First use DefaultAction() to handle the operands. It can't handle
+  // tuple-shaped output.
+  TF_RETURN_IF_ERROR(DefaultAction(crs));
+
+  // Then do per-tuple-element handling on the output.
+  std::vector<std::vector<HloInstruction*>> per_tuple_element_gtes(
+      crs->operand_count());
+  for (auto user : crs->users()) {
+    if (user->opcode() != HloOpcode::kGetTupleElement) {
+      return Status::OK();
+    }
+    per_tuple_element_gtes[user->tuple_index()].push_back(user);
+  }
+
+  for (int64 i = 0; i < crs->operand_count(); ++i) {
+    // Fold conversions only when all the get-tuple-elements' users are
+    // conversions from F32 to BF16.
+    auto all_gte_users_are_bf16_convert = [&per_tuple_element_gtes, i]() {
+      for (auto gte : per_tuple_element_gtes[i]) {
+        if (!AllUsersAreF32ToBF16Converts(gte)) {
+          return false;
+        }
+      }
+      return true;
+    };
+    if (!all_gte_users_are_bf16_convert()) {
+      continue;
+    }
+
+    ShapeUtil::GetMutableSubshape(crs->mutable_shape(), {i})
+        ->set_element_type(BF16);
+    for (auto gte : per_tuple_element_gtes[i]) {
+      TF_RETURN_IF_ERROR(FoldOutputConversions(gte));
+    }
+  }
+
+  return Status::OK();
+}
+
 StatusOr<bool> BFloat16ConversionFolding::Run(HloModule* module) {
   XLA_VLOG_LINES(
       2, "BFloat16ConversionFolding::Run(), before:\n" + module->ToString());
diff --git a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
index cb37759439debf41a305ec7dccaa548e1bf234cd..28e71c2054f59ba4d5d096bf7d898161877bb42f 100644
--- a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
@@ -37,7 +37,8 @@ class TestBFloat16Support : public BFloat16Support {
     if (hlo.opcode() == HloOpcode::kAdd ||
         hlo.opcode() == HloOpcode::kSubtract ||
         hlo.opcode() == HloOpcode::kTuple ||
-        hlo.opcode() == HloOpcode::kGetTupleElement) {
+        hlo.opcode() == HloOpcode::kGetTupleElement ||
+        hlo.opcode() == HloOpcode::kCrossReplicaSum) {
       return true;
     }
     return false;
@@ -47,7 +48,8 @@ class TestBFloat16Support : public BFloat16Support {
     if (hlo.opcode() == HloOpcode::kAdd ||
         hlo.opcode() == HloOpcode::kSubtract ||
         hlo.opcode() == HloOpcode::kTuple ||
-        hlo.opcode() == HloOpcode::kGetTupleElement) {
+        hlo.opcode() == HloOpcode::kGetTupleElement ||
+        hlo.opcode() == HloOpcode::kCrossReplicaSum) {
       return true;
     }
     return false;
@@ -55,7 +57,8 @@ class TestBFloat16Support : public BFloat16Support {
 
   bool SupportsMixedPrecisions(const HloInstruction& hlo) const override {
     if (hlo.opcode() == HloOpcode::kAdd || hlo.opcode() == HloOpcode::kTuple ||
-        hlo.opcode() == HloOpcode::kGetTupleElement) {
+        hlo.opcode() == HloOpcode::kGetTupleElement ||
+        hlo.opcode() == HloOpcode::kCrossReplicaSum) {
       return true;
     }
     return false;
@@ -206,4 +209,46 @@ TEST_F(BFloat16ConversionFoldingTest, DoNotFoldTuple) {
   EXPECT_EQ(tuple->operand(1), convert0);
 }
 
+TEST_F(BFloat16ConversionFoldingTest, FoldCrossReplicaSumTupleOutput) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape f32_shape = ShapeUtil::MakeShape(F32, {2, 4});
+  Shape bf16_shape = ShapeUtil::MakeShape(BF16, {2, 4});
+
+  HloInstruction* a = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, bf16_shape, "a"));
+  HloInstruction* convert_a =
+      builder.AddInstruction(HloInstruction::CreateConvert(f32_shape, a));
+  HloInstruction* b = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, f32_shape, "b"));
+
+  HloInstruction* crs =
+      builder.AddInstruction(HloInstruction::CreateCrossReplicaSum(
+          ShapeUtil::MakeTupleShape({f32_shape, f32_shape}), {convert_a, b}));
+  HloInstruction* gte_a = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32_shape, crs, 0));
+  HloInstruction* gte_b = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32_shape, crs, 1));
+  HloInstruction* convert_gte_b =
+      builder.AddInstruction(HloInstruction::CreateConvert(bf16_shape, gte_b));
+  HloInstruction* tuple = builder.AddInstruction(
+      HloInstruction::CreateTuple({gte_a, convert_gte_b}));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(FoldConversions(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), tuple);
+  EXPECT_EQ(tuple->operand(0), gte_a);
+  EXPECT_EQ(tuple->operand(1), gte_b);
+  EXPECT_EQ(gte_a->shape().element_type(), F32);
+  EXPECT_EQ(gte_b->shape().element_type(), BF16);
+  EXPECT_EQ(crs->operand(0), a);
+  EXPECT_EQ(crs->operand(1), b);
+  EXPECT_EQ(a->shape().element_type(), BF16);
+  EXPECT_EQ(b->shape().element_type(), F32);
+  EXPECT_EQ(ShapeUtil::GetSubshape(crs->shape(), {0}).element_type(), F32);
+  EXPECT_EQ(ShapeUtil::GetSubshape(crs->shape(), {1}).element_type(), BF16);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization.cc b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
index b032c040e8aff49f9e0fc1ff9a1c1e79ea4bb77f..14c54ddd135af024327f63418b410da1ed3c4fd4 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
@@ -152,44 +152,64 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
 
   std::vector<PrimitiveType> operand_types(crs->operand_count());
   std::vector<PrimitiveType> output_types(crs->operand_count());
-  bool has_f32 = false;
-  bool has_bf16 = false;
-  bool has_bf16_output = false;
+  int64 f32_count = 0;
+  int64 bf16_count = 0;
+  bool has_unsupported_bf16_operand = false;
+  bool has_unsupported_bf16_output = false;
   for (int64 i = 0; i < crs->operand_count(); ++i) {
     operand_types[i] = crs->operand(i)->shape().element_type();
     output_types[i] = ShapeUtil::GetSubshape(crs->shape(), {i}).element_type();
-    if (operand_types[i] == F32 || output_types[i] == F32) {
-      has_f32 = true;
+    if (operand_types[i] == F32) {
+      f32_count += 1;
     } else if (operand_types[i] == BF16) {
-      has_bf16 = true;
+      bf16_count += 1;
+      if (!bfloat16_support_->SupportsBF16Operand(*crs, i)) {
+        has_unsupported_bf16_operand = true;
+      }
     }
-    if (output_types[i] == BF16) {
-      has_bf16 = true;
-      has_bf16_output = true;
+    if (output_types[i] == F32) {
+      f32_count += 1;
+    } else if (output_types[i] == BF16) {
+      bf16_count += 1;
+      if (!bfloat16_support_->SupportsBF16Output(*crs)) {
+        has_unsupported_bf16_output = true;
+      }
     }
   }
 
-  for (int64 i = 0; i < crs->operand_count(); ++i) {
+  if (bf16_count == 0) {
+    return Status::OK();
+  }
+
+  auto should_convert_operand = [&](int64 i) {
     if (operand_types[i] != BF16) {
-      continue;
+      return false;
     }
-    if (bfloat16_support_->SupportsBF16Operand(*crs, i) &&
-        (bfloat16_support_->SupportsMixedPrecisions(*crs) || !has_f32)) {
-      continue;
+    if (!bfloat16_support_->SupportsBF16Operand(*crs, i)) {
+      return true;
     }
-    TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(crs, i, F32, computation_));
-    has_f32 = true;
-  }
+    if (bfloat16_support_->SupportsMixedPrecisions(*crs)) {
+      return false;
+    }
+    return has_unsupported_bf16_operand || has_unsupported_bf16_output ||
+           f32_count > 0;
+  };
 
-  if (!has_bf16_output) {
-    return Status::OK();
+  for (int64 i = 0; i < crs->operand_count(); ++i) {
+    if (should_convert_operand(i)) {
+      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(crs, i, F32, computation_));
+      f32_count += 1;
+      bf16_count -= 1;
+    }
   }
 
-  if (bfloat16_support_->SupportsBF16Output(*crs) &&
-      (bfloat16_support_->SupportsMixedPrecisions(*crs) || !has_f32)) {
+  if (!has_unsupported_bf16_output &&
+      (bfloat16_support_->SupportsMixedPrecisions(*crs) || f32_count == 0 ||
+       bf16_count == 0)) {
     return Status::OK();
   }
 
+  std::vector<HloInstruction*> materialized_users = crs->users();
   std::vector<HloInstruction*> output_elements(crs->operand_count());
   auto original_shape = crs->shape();
   for (int64 i = 0; i < crs->operand_count(); ++i) {
@@ -209,7 +229,6 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
   auto tuple = computation_->AddInstruction(
       HloInstruction::CreateTuple(output_elements));
 
-  std::vector<HloInstruction*> materialized_users = crs->users();
   // Use the crs' shape temporarily, in order to pass checks in
   // ReplaceUseWith.
   *tuple->mutable_shape() = crs->shape();
@@ -221,41 +240,37 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
 }
 
 Status BFloat16NormalizationVisitor::HandleInstruction(HloInstruction* hlo) {
-  std::vector<int64> bf16_operands;
-  std::vector<int64> f32_operands;
-  bool has_f32 = false;
-  bool has_bf16 = false;
+  int f32_count = 0;
+  int bf16_count = 1;
 
   for (int64 i = 0; i < hlo->operand_count(); ++i) {
     if (hlo->operand(i)->shape().element_type() == F32) {
-      f32_operands.push_back(i);
-      has_f32 = true;
+      f32_count += 1;
     } else if (hlo->operand(i)->shape().element_type() == BF16) {
-      bf16_operands.push_back(i);
-      has_bf16 = true;
+      bf16_count += 1;
     }
   }
 
   if (hlo->shape().element_type() == F32) {
-    has_f32 = true;
+    f32_count += 1;
   } else if (hlo->shape().element_type() == BF16) {
-    has_bf16 = true;
+    bf16_count += 1;
   }
 
   std::vector<HloComputation*> bf16_called_comps;
   for (auto* comp : hlo->called_computations()) {
     bool comp_has_bf16 = false;
     if (comp->root_instruction()->shape().element_type() == F32) {
-      has_f32 = true;
+      f32_count += 1;
     } else if (comp->root_instruction()->shape().element_type() == BF16) {
-      has_bf16 = true;
+      bf16_count += 1;
       comp_has_bf16 = true;
     }
     for (auto* param : comp->parameter_instructions()) {
       if (param->shape().element_type() == F32) {
-        has_f32 = true;
+        f32_count += 1;
       } else if (param->shape().element_type() == BF16) {
-        has_bf16 = true;
+        bf16_count += 1;
         comp_has_bf16 = true;
       }
     }
@@ -264,54 +279,69 @@ Status BFloat16NormalizationVisitor::HandleInstruction(HloInstruction* hlo) {
     }
   }
 
-  if (!bfloat16_support_->SupportsMixedPrecisions(*hlo) && has_bf16 &&
-      has_f32) {
-    // Resolve unsupported mixed precision.
-    //
-    // See if we can change everything to BF16.
-    if (hlo->called_computations().empty() &&
-        hlo->shape().element_type() == BF16) {
-      bool can_use_bf16 = true;
-      for (int i : f32_operands) {
-        if (bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(*hlo,
-                                                                          i) &&
-            bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
-          continue;
-        }
-        can_use_bf16 = false;
-        break;
-      }
-      if (can_use_bf16) {
-        for (int i : f32_operands) {
-          TF_RETURN_IF_ERROR(
-              InsertConvertBeforeOperand(hlo, i, BF16, computation_));
-        }
-        return Status::OK();
-      }
-    }
-    if (hlo->shape().element_type() == BF16) {
-      TF_RETURN_IF_ERROR(
-          ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
-    }
-    for (int i : bf16_operands) {
-      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
-    }
-    return ConvertCalledComputations(hlo, bf16_called_comps);
-  }
-
-  for (int i : bf16_operands) {
-    if (!bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
+  // Resolve unsupported BF16 operands.
+  for (int i = 0; i < hlo->operand_count(); ++i) {
+    if (hlo->operand(i)->shape().element_type() == BF16 &&
+        !bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
       TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
+      bf16_count -= 1;
+      f32_count += 1;
     }
   }
 
+  // Resolve unsupported BF16 output.
   if (hlo->shape().element_type() == BF16 &&
       !bfloat16_support_->SupportsBF16Output(*hlo)) {
     TF_RETURN_IF_ERROR(
         ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
+    bf16_count -= 1;
+    f32_count += 1;
   }
 
-  return Status::OK();
+  // Resolve unsupported mixed precision after resolving unsupported BF16
+  // operands and output, because the numbers of BF16 operands/output and F32
+  // operands/output may have changed.
+  if (bfloat16_support_->SupportsMixedPrecisions(*hlo) || bf16_count == 0 ||
+      f32_count == 0) {
+    return Status::OK();
+  }
+  // See if we can change everything to BF16.
+  if (hlo->called_computations().empty() &&
+      hlo->shape().element_type() == BF16) {
+    bool can_use_bf16 = true;
+    for (int i = 0; i < hlo->operand_count(); ++i) {
+      if (hlo->operand(i)->shape().element_type() == BF16) {
+        continue;
+      }
+      if ((bfloat16_support_->EffectiveOperandPrecisionIsBF16(*hlo, i) ||
+           bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(*hlo,
+                                                                         i)) &&
+          bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
+        continue;
+      }
+      can_use_bf16 = false;
+      break;
+    }
+    if (can_use_bf16) {
+      for (int i = 0; i < hlo->operand_count(); ++i) {
+        if (hlo->operand(i)->shape().element_type() == F32) {
+          TF_RETURN_IF_ERROR(
+              InsertConvertBeforeOperand(hlo, i, BF16, computation_));
+        }
+      }
+      return Status::OK();
+    }
+  }
+  if (hlo->shape().element_type() == BF16) {
+    TF_RETURN_IF_ERROR(
+        ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
+  }
+  for (int i = 0; i < hlo->operand_count(); ++i) {
+    if (hlo->operand(i)->shape().element_type() == BF16) {
+      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
+    }
+  }
+  return ConvertCalledComputations(hlo, bf16_called_comps);
 }
 
 Status BFloat16NormalizationVisitor::DefaultAction(HloInstruction* hlo) {
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
index 66c3085842c4afe7ffc4d5891883e4cce9389d45..1afaefd9df9c5771fb9e134ae9050f3abb00ea4a 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_verifier.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -41,13 +42,17 @@ class TestBFloat16Support : public BFloat16Support {
         hlo.opcode() == HloOpcode::kGetTupleElement) {
       return true;
     }
+    if (hlo.opcode() == HloOpcode::kDot) {
+      // Test that only the first operand of kDot supports BF16.
+      return operand_index == 0;
+    }
     return false;
   }
 
   bool SupportsBF16Output(const HloInstruction& hlo) const override {
     if (hlo.opcode() == HloOpcode::kAdd || hlo.opcode() == HloOpcode::kReduce ||
         hlo.opcode() == HloOpcode::kSubtract ||
-        hlo.opcode() == HloOpcode::kTuple ||
+        hlo.opcode() == HloOpcode::kDot || hlo.opcode() == HloOpcode::kTuple ||
         hlo.opcode() == HloOpcode::kGetTupleElement) {
       return true;
     }
@@ -70,6 +75,10 @@ class BFloat16NormalizationTest : public HloTestBase {
     BFloat16Normalization normalization(&bfloat16_support_);
     StatusOr<bool> result = normalization.Run(module);
     EXPECT_IS_OK(result.status());
+
+    HloVerifier verifier(/*allow_mixed_precision=*/true);
+    EXPECT_IS_OK(verifier.Run(module).status());
+
     return result.ValueOrDie();
   }
 };
@@ -166,7 +175,7 @@ TEST_F(BFloat16NormalizationTest, ResolveUnsupportedMixedPrecisionReduce) {
   Shape f32_input_shape = ShapeUtil::MakeShape(F32, {2, 4});
   Shape f32_output_shape = ShapeUtil::MakeShape(F32, {4});
 
-  Shape bf16_scalar_shape = ShapeUtil::MakeShape(BF16, {2, 4});
+  Shape bf16_scalar_shape = ShapeUtil::MakeShape(BF16, {});
 
   auto reduce_comp_builder = HloComputation::Builder("reduce_comp");
   auto reduce_comp_param0 = reduce_comp_builder.AddInstruction(
@@ -245,4 +254,34 @@ TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
   EXPECT_EQ(ShapeUtil::GetSubshape(crs->shape(), {1}).element_type(), F32);
 }
 
+// Tests that the normalization should not cause unsupported mixed precision due
+// to resolving unsupported BF16 operand.
+TEST_F(BFloat16NormalizationTest, DoNotAddUnsupportedMixedPrecision) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape bf16_shape = ShapeUtil::MakeShape(BF16, {4, 4});
+
+  HloInstruction* a = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, bf16_shape, "a"));
+  HloInstruction* b = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, bf16_shape, "b"));
+
+  DotDimensionNumbers dot_dnums;
+  dot_dnums.add_lhs_contracting_dimensions(1);
+  dot_dnums.add_rhs_contracting_dimensions(0);
+  HloInstruction* dot = builder.AddInstruction(
+      HloInstruction::CreateDot(bf16_shape, a, b, dot_dnums));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(Normalize(module.get()));
+
+  EXPECT_EQ(computation->root_instruction()->opcode(), HloOpcode::kConvert);
+  EXPECT_EQ(dot->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(0)->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(0)->opcode(), HloOpcode::kConvert);
+  EXPECT_EQ(dot->operand(1)->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(1)->opcode(), HloOpcode::kConvert);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation.cc b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
index 6145c690b911dd3c74d2677ceb840ae3b86d5309..7195c31d9c5918d554ae7d5e8e57c417f9315be4 100644
--- a/tensorflow/compiler/xla/service/bfloat16_propagation.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/bfloat16_propagation.h"
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_dce.h"
@@ -68,33 +69,53 @@ void BFloat16Propagation::DetermineAndMutateFusionComputationPrecision(
   for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
     DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
   }
+  computations_visited_in_mutation_pass_.insert(
+      fusion->fused_instructions_computation());
 }
 
-void BFloat16Propagation::AdjustFusionParameters(HloInstruction* fusion) {
-  CHECK_EQ(fusion->fused_parameters().size(), fusion->operand_count());
-  for (int64 i = 0; i < fusion->operand_count(); ++i) {
-    auto parameter = fusion->fused_parameter(i);
-    ShapeUtil::ForEachMutableSubshape(
-        parameter->mutable_shape(),
-        [&](Shape* subshape, const ShapeIndex& index) {
-          if (!ShapeUtil::IsLeafIndex(parameter->shape(), index)) {
-            return;
-          }
-          PrimitiveType operand_type =
-              ShapeUtil::GetSubshape(fusion->operand(i)->shape(), index)
-                  .element_type();
-          if (subshape->element_type() == operand_type) {
-            return;
-          }
-          CHECK(operand_type == F32 || operand_type == BF16);
-          subshape->set_element_type(operand_type);
+void BFloat16Propagation::DetermineAndMutateWhileComputationsPrecision(
+    HloInstruction* while_hlo) {
+  CHECK_EQ(while_hlo->opcode(), HloOpcode::kWhile);
+
+  // We are depending on the while node itself having already been analyzed for
+  // whether it can output BF16 and this has been adjusted in the output shape,
+  // and now we're looking to update the body and condition computations to
+  // match the new output shape, as well as recursively process the whole while
+  // node even if the output shape was not modified.
+  HloComputation* body = while_hlo->while_body();
+  auto body_root = body->root_instruction();
+  HloComputation* condition = while_hlo->while_condition();
+
+  ShapeUtil::ForEachMutableSubshape(
+      body_root->mutable_shape(),
+      [this, while_hlo, body_root](Shape* subshape, const ShapeIndex& index) {
+        if (subshape->element_type() != F32) {
+          return;
+        }
+        if (ShapeUtil::GetSubshape(while_hlo->shape(), index).element_type() ==
+            BF16) {
+          subshape->set_element_type(BF16);
           changed_ = true;
-          VLOG(2) << "Fused parameter " << parameter->ToString()
+          VLOG(2) << "While body root " << body_root->ToString()
                   << " at shape index " << index
-                  << " adjusted to match operand in fusion "
-                  << fusion->ToString();
-        });
+                  << " changed to BF16 precision for while "
+                  << while_hlo->ToString();
+        }
+      });
+
+  auto body_insts = body->MakeInstructionPostOrder();
+  for (auto inst_it = body_insts.rbegin(); inst_it != body_insts.rend();
+       ++inst_it) {
+    DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
+  }
+  computations_visited_in_mutation_pass_.insert(body);
+
+  auto condition_insts = condition->MakeInstructionPostOrder();
+  for (auto inst_it = condition_insts.rbegin();
+       inst_it != condition_insts.rend(); ++inst_it) {
+    DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
   }
+  computations_visited_in_mutation_pass_.insert(condition);
 }
 
 bool BFloat16Propagation::AllUsersConsumeBF16(const HloInstruction& hlo,
@@ -108,14 +129,45 @@ bool BFloat16Propagation::AllUsersConsumeBF16(const HloInstruction& hlo,
       continue;
     }
     for (const HloUse& use : value->uses()) {
+      if (!ContainsKey(instructions_visited_in_mutation_pass_,
+                       use.instruction)) {
+        // We don't know yet whether use.instruction will consume BF16 since it
+        // hasn't been visited. Although we visit instructions in reverse
+        // topological order, this is still possible because there may be
+        // unvisited instruction that alias the same buffer. In this case, we
+        // aggressively skip this use, and if this causes inconsistency (e.g.,
+        // one use is in BF16 but another use is in F32), it will be resolved at
+        // the end of the BFloat16Propagation pass.
+        continue;
+      }
+      // Any visited user that can accept BF16 has already been updated if
+      // necessary, e.g., the output has been changed to BF16 if it propagates
+      // precision, or a called computation's parameters have been changed to
+      // BF16 for fusions or whiles.
       if (use.instruction->opcode() == HloOpcode::kFusion) {
-        auto fused_parameter =
+        const auto* fused_parameter =
             use.instruction->fused_parameter(use.operand_number);
         if (ShapeUtil::GetSubshape(fused_parameter->shape(), use.operand_index)
                 .element_type() != BF16) {
           return false;
         }
         continue;
+      } else if (use.instruction->opcode() == HloOpcode::kWhile) {
+        const auto* cond_parameter =
+            use.instruction->while_condition()->parameter_instruction(
+                use.operand_number);
+        if (ShapeUtil::GetSubshape(cond_parameter->shape(), use.operand_index)
+                .element_type() != BF16) {
+          return false;
+        }
+        const auto* body_parameter =
+            use.instruction->while_body()->parameter_instruction(
+                use.operand_number);
+        if (ShapeUtil::GetSubshape(body_parameter->shape(), use.operand_index)
+                .element_type() != BF16) {
+          return false;
+        }
+        continue;
       }
       if (bfloat16_support_->EffectiveOperandPrecisionIsBF16(
               *use.instruction, use.operand_number)) {
@@ -149,24 +201,40 @@ bool BFloat16Propagation::AllUsersConsumeBF16(const HloInstruction& hlo,
 
 void BFloat16Propagation::DetermineAndMutateInstructionPrecision(
     HloInstruction* hlo, bool skip_parameters) {
-  // We handle any fusion computation after the instruction is handled, because
-  // we need to know a fusion's output shape before propagating inside its fused
-  // computation.
-  auto cleaner = tensorflow::gtl::MakeCleanup([this, hlo] {
-    if (hlo->opcode() == HloOpcode::kFusion) {
-      DetermineAndMutateFusionComputationPrecision(hlo);
-    }
-  });
+  // We handle any fusion computation or while body/condition after the
+  // instruction is handled, because we need to know the output shape of a
+  // fusion or while before propagating inside its  computations.
+  bool postpone_processing_called_computations = false;
+  auto cleaner = tensorflow::gtl::MakeCleanup(
+      [this, hlo, &postpone_processing_called_computations] {
+        if (!postpone_processing_called_computations) {
+          if (hlo->opcode() == HloOpcode::kFusion) {
+            DetermineAndMutateFusionComputationPrecision(hlo);
+          } else if (hlo->opcode() == HloOpcode::kWhile) {
+            DetermineAndMutateWhileComputationsPrecision(hlo);
+          }
+        }
+        instructions_visited_in_mutation_pass_.insert(hlo);
+      });
+
+  if (hlo->opcode() == HloOpcode::kWhile &&
+      (caller_counts_[hlo->while_condition()] > 1 ||
+       caller_counts_[hlo->while_body()] > 1)) {
+    postpone_processing_called_computations = true;
+    return;
+  }
 
   // Do not change precision for instructions related to entry and exit of a
   // computation, and control flow, because this pass might break the interfaces
   // or assumptions for them.
   if (hlo->opcode() == HloOpcode::kInfeed ||       //
       hlo->opcode() == HloOpcode::kOutfeed ||      //
-      hlo->opcode() == HloOpcode::kConstant ||     //
+      hlo->opcode() == HloOpcode::kSend ||         //
+      hlo->opcode() == HloOpcode::kSendDone ||     //
+      hlo->opcode() == HloOpcode::kRecv ||         //
+      hlo->opcode() == HloOpcode::kRecvDone ||     //
       hlo->opcode() == HloOpcode::kCustomCall ||   //
       hlo->opcode() == HloOpcode::kCall ||         //
-      hlo->opcode() == HloOpcode::kWhile ||        //
       hlo->opcode() == HloOpcode::kConditional ||  //
       (hlo->opcode() == HloOpcode::kParameter && skip_parameters)) {
     return;
@@ -231,60 +299,198 @@ bool BFloat16Propagation::InstructionIsCandidateForBF16Output(
   return true;
 }
 
-Status BFloat16Propagation::ResolveInconsistencyOfAliasingBuffers(
-    HloModule* module) {
-  std::list<HloComputation*> computations_topological_order =
-      module->MakeComputationPostOrder();
-  for (auto comp_it = computations_topological_order.rbegin();
-       comp_it != computations_topological_order.rend(); ++comp_it) {
-    auto insts = (*comp_it)->MakeInstructionPostOrder();
-    // Do the adjustment on each instruction in the computation in reverse
-    // topological order.
-    for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
-      auto hlo = *inst_it;
-      auto adjust_buffer = [this, hlo](Shape* subshape,
-                                       const ShapeIndex& index) {
-        if (subshape->element_type() != F32 &&
-            subshape->element_type() != BF16) {
-          return;
+void BFloat16Propagation::AdjustCalledComputationParameters(
+    HloInstruction* hlo) {
+  auto adjust_computation =
+      [this, hlo](HloComputation* computation,
+                  tensorflow::gtl::ArraySlice<HloInstruction*> operands) {
+        // Adjust parameters.
+        CHECK_EQ(operands.size(), computation->num_parameters());
+        for (int64 i = 0; i < operands.size(); ++i) {
+          auto parameter = computation->parameter_instruction(i);
+          ShapeUtil::ForEachMutableSubshape(
+              parameter->mutable_shape(),
+              [this, i, hlo, &operands, parameter](Shape* subshape,
+                                                   const ShapeIndex& index) {
+                if (!ShapeUtil::IsLeafIndex(parameter->shape(), index)) {
+                  return;
+                }
+                PrimitiveType operand_type =
+                    ShapeUtil::GetSubshape(operands[i]->shape(), index)
+                        .element_type();
+                if (subshape->element_type() == operand_type) {
+                  return;
+                }
+                CHECK(operand_type == F32 || operand_type == BF16);
+                subshape->set_element_type(operand_type);
+                changed_ = true;
+                VLOG(2) << "Called computation parameter "
+                        << parameter->ToString() << " at shape index " << index
+                        << " adjusted to match operand in HLO "
+                        << hlo->ToString();
+              });
         }
-        PrimitiveType type = BF16;
-        for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
-          if (value->shape().element_type() == BF16) {
-            continue;
+      };
+
+  switch (hlo->opcode()) {
+    case HloOpcode::kFusion:
+      adjust_computation(hlo->fused_instructions_computation(),
+                         hlo->operands());
+      break;
+    case HloOpcode::kWhile:
+      adjust_computation(hlo->while_condition(), hlo->operands());
+      adjust_computation(hlo->while_body(), hlo->operands());
+      break;
+    default:
+      break;
+  }
+}
+
+void BFloat16Propagation::AdjustCalledComputationRoot(HloInstruction* hlo) {
+  auto adjust_computation = [this, hlo](HloComputation* computation,
+                                        const Shape& output_shape) {
+    // Adjust root.
+    HloInstruction* root = computation->root_instruction();
+    ShapeUtil::ForEachMutableSubshape(
+        root->mutable_shape(), [this, hlo, root, &output_shape](
+                                   Shape* subshape, const ShapeIndex& index) {
+          if (!ShapeUtil::IsLeafIndex(hlo->shape(), index)) {
+            return;
           }
-          CHECK_EQ(value->shape().element_type(), F32);
-          type = F32;
-          break;
-        }
-        // It's possible that a user has been changed from BF16 to F32
-        // during this final adjustment pass, so we need to check
-        // AllUsersConsumeBF16() again.
-        if (type == BF16 && !AllUsersConsumeBF16(*hlo, index)) {
-          type = F32;
-        }
-        if (type == F32) {
-          for (const auto* value :
-               dataflow_->GetValueSet(hlo, index).values()) {
-            // We rely on the fact that this adjustment works in reverse
-            // topological order. Adding the value to
-            // values_that_must_be_kept_as_f32_ will ensure the correctness
-            // of the adjustment for HLOs that will be processed later.
-            values_that_must_be_kept_as_f32_.insert(value);
+          const PrimitiveType output_type =
+              ShapeUtil::GetSubshape(output_shape, index).element_type();
+          if (subshape->element_type() == output_type) {
+            return;
           }
+          CHECK(output_type == F32 || output_type == BF16);
+          subshape->set_element_type(output_type);
+          // It's possible that output_type is F32, but the root instruction's
+          // type is BF16; e.g., a fusion node's output was changed to BF16
+          // initially but then adjusted back to F32, and the fusion computation
+          // is now being adjusted after the fusion node.
+          if (output_type == F32) {
+            for (const auto* value :
+                 dataflow_->GetValueSet(root, index).values()) {
+              // We rely on the fact that this adjustment works in reverse
+              // topological order so that called computation will be
+              // processed later. Adding the value to
+              // values_that_must_be_kept_as_f32_ will ensure the
+              // correctness of the adjustment for HLOs that will be
+              // processed later.
+              values_that_must_be_kept_as_f32_.insert(value);
+            }
+          }
+          changed_ = true;
+          VLOG(2) << "Called computation root " << root->ToString()
+                  << " at shape index " << index
+                  << " adjusted to match output shape of " << hlo->ToString();
+        });
+  };
+
+  switch (hlo->opcode()) {
+    case HloOpcode::kFusion:
+      adjust_computation(hlo->fused_instructions_computation(), hlo->shape());
+      break;
+    case HloOpcode::kWhile:
+      adjust_computation(hlo->while_condition(), hlo->shape());
+      adjust_computation(hlo->while_body(), hlo->shape());
+      break;
+    default:
+      break;
+  }
+}
+
+bool BFloat16Propagation::ResolveInconsistencyOfAliasingBuffersHelper(
+    HloComputation* computation,
+    tensorflow::gtl::FlatSet<const HloComputation*>* visited_computations) {
+  bool parameter_changed = false;
+  auto insts = computation->MakeInstructionPostOrder();
+  // Do the adjustment on each instruction in the computation in reverse
+  // topological order.
+  for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+    auto hlo = *inst_it;
+    auto adjust_hlo_output = [this, hlo, &parameter_changed](
+                                 Shape* subshape, const ShapeIndex& index) {
+      if (subshape->element_type() != F32 && subshape->element_type() != BF16) {
+        return;
+      }
+      PrimitiveType type = BF16;
+      for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
+        if (value->shape().element_type() == BF16) {
+          continue;
+        }
+        CHECK_EQ(value->shape().element_type(), F32);
+        type = F32;
+        break;
+      }
+      // It's possible that a user has been changed from BF16 to F32
+      // during this final adjustment pass, so we need to check
+      // AllUsersConsumeBF16() again.
+      if (type == BF16 && !AllUsersConsumeBF16(*hlo, index)) {
+        type = F32;
+      }
+      if (type == F32) {
+        for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
+          // We rely on the fact that this adjustment works in reverse
+          // topological order. Adding the value to
+          // values_that_must_be_kept_as_f32_ will ensure the correctness
+          // of the adjustment for HLOs that will be processed later.
+          values_that_must_be_kept_as_f32_.insert(value);
         }
+      }
+      if (type != subshape->element_type()) {
         subshape->set_element_type(type);
-      };
-      ShapeUtil::ForEachMutableSubshape(hlo->mutable_shape(), adjust_buffer);
-    }
-    // Now adjust parameters of fusions inside this computation.
-    for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
-      auto hlo = *inst_it;
-      if (hlo->opcode() == HloOpcode::kFusion) {
-        AdjustFusionParameters(hlo);
+        VLOG(2) << "HloInstruction output at shape index " << index
+                << " adjusted to " << *subshape << ": " << hlo->ToString();
+        if (hlo->opcode() == HloOpcode::kParameter) {
+          parameter_changed = true;
+        }
+      }
+    };
+    ShapeUtil::ForEachMutableSubshape(hlo->mutable_shape(), adjust_hlo_output);
+    AdjustCalledComputationRoot(hlo);
+    if (hlo->opcode() == HloOpcode::kWhile) {
+      // We need to run on the while body and condition repeatedly until a fixed
+      // point is reached, i.e., the parameters do not change any more. We may
+      // need more than one iteration because the while input and output alias
+      // each other, so changing one input parameter requires changing the
+      // corresponding output element and thus may transitively require changing
+      // another input parameter. A fixed point will be reached because the
+      // parameters can only be changed from BF16 to F32, not the other way
+      // around.
+      tensorflow::gtl::FlatSet<const HloComputation*> visited_in_while;
+      while (ResolveInconsistencyOfAliasingBuffersHelper(hlo->while_condition(),
+                                                         &visited_in_while) ||
+             ResolveInconsistencyOfAliasingBuffersHelper(hlo->while_body(),
+                                                         &visited_in_while)) {
+        visited_in_while.clear();
+        ShapeUtil::ForEachMutableSubshape(hlo->mutable_shape(),
+                                          adjust_hlo_output);
+        AdjustCalledComputationRoot(hlo);
       }
+      visited_computations->insert(visited_in_while.begin(),
+                                   visited_in_while.end());
     }
   }
+  // Now adjust parameters of called computations.
+  for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+    AdjustCalledComputationParameters(*inst_it);
+  }
+  return parameter_changed;
+}
+
+Status BFloat16Propagation::ResolveInconsistencyOfAliasingBuffers(
+    HloModule* module) {
+  std::list<HloComputation*> computations_topological_order =
+      module->MakeComputationPostOrder();
+  tensorflow::gtl::FlatSet<const HloComputation*> resolved;
+  for (auto comp_it = computations_topological_order.rbegin();
+       comp_it != computations_topological_order.rend(); ++comp_it) {
+    if (ContainsKey(resolved, *comp_it)) {
+      continue;
+    }
+    ResolveInconsistencyOfAliasingBuffersHelper(*comp_it, &resolved);
+  }
 
   // We could have changed a fusion computation's root shape to have a different
   // precision than the fusion node's output, if the fusion root does not
@@ -382,15 +588,66 @@ Status BFloat16Propagation::ResolveInconsistencyOfAliasingBuffers(
       needs_tuple_simplifier |= ShapeUtil::IsTuple(hlo->shape());
     }
   }
+
+  // We may have converted some constants from F32 to BF16, so adjust the
+  // constant literals in such cases. We do this here instead of when the
+  // constant node's is changed because 1) the HloInstruction interface does not
+  // allow resetting the literal so we have to create a new kConstant
+  // instruction to replace the old one, which invalidates dataflow analysis,
+  // and 2) it's possible that a kConstant's output gets changed to BF16 at the
+  // beginning but later on adjusted back to F32, so converting literals here
+  // can avoid repeated conversions.
+  //
+  // TODO(b/73833576): Consider resetting literal in HloInstruction.
+  bool needs_dce = needs_tuple_simplifier;
+  for (auto computation : computations_topological_order) {
+    for (auto hlo : computation->MakeInstructionPostOrder()) {
+      if (hlo->opcode() != HloOpcode::kConstant) {
+        continue;
+      }
+      if (!ShapeUtil::Equal(hlo->literal().shape(), hlo->shape())) {
+        TF_ASSIGN_OR_RETURN(auto converted_literal,
+                            hlo->literal().ConvertToShape(hlo->shape()));
+        auto new_constant = computation->AddInstruction(
+            HloInstruction::CreateConstant(std::move(converted_literal)));
+        TF_RETURN_IF_ERROR(hlo->ReplaceAllUsesWith(new_constant));
+        needs_dce = true;
+      }
+    }
+  }
+
   if (needs_tuple_simplifier) {
     TupleSimplifier tuple_simplifier;
     TF_RETURN_IF_ERROR(tuple_simplifier.Run(module).status());
+  }
+  if (needs_dce) {
     HloDCE dce;
     TF_RETURN_IF_ERROR(dce.Run(module).status());
   }
   return Status::OK();
 }
 
+Status BFloat16Propagation::RemoveNoopConversions(HloModule* module) {
+  for (auto computation : module->computations()) {
+    for (auto hlo : computation->MakeInstructionPostOrder()) {
+      if (hlo->opcode() != HloOpcode::kConvert) {
+        continue;
+      }
+      auto source = hlo->mutable_operand(0);
+      if (!ShapeUtil::Equal(source->shape(), hlo->shape())) {
+        continue;
+      }
+      const bool is_root = hlo == computation->root_instruction();
+      TF_RETURN_IF_ERROR(hlo->ReplaceAllUsesWith(source));
+      if (is_root) {
+        computation->set_root_instruction(source);
+      }
+      TF_RETURN_IF_ERROR(computation->RemoveInstructionAndUnusedOperands(hlo));
+    }
+  }
+  return Status::OK();
+}
+
 // The algorithm first does a forward pass (parameters to root) to determine a
 // set of instructions to consider using bfloat16, then does a backward pass to
 // determine the precisions of those instructions according to the need of
@@ -441,6 +698,10 @@ StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
   // defining instruction's shape has changed. So we need to adjust the output
   // shapes of instructions according to the HLO values they refer to.
   TF_RETURN_IF_ERROR(ResolveInconsistencyOfAliasingBuffers(module));
+
+  // This pass could have turned an F32 -> BF16 conversion to a no-op (BF16 ->
+  // BF16), so we remove them now.
+  TF_RETURN_IF_ERROR(RemoveNoopConversions(module));
   return true;
 }
 
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation.h b/tensorflow/compiler/xla/service/bfloat16_propagation.h
index ccf77d7b4eb6bd7b76b1b6743bd724f42c141f08..1744e9db90aeff269daa91eb68a1d61bb0fc3035 100644
--- a/tensorflow/compiler/xla/service/bfloat16_propagation.h
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation.h
@@ -38,7 +38,8 @@ namespace xla {
 // be bitwise identical to that without this pass; this is possible if the
 // backend already reduces precision to BF16 on some HLO instructions.
 //
-// This pass will not modify the signature of any non-fusion computation.
+// This pass will not modify the signature of a computation, unless it is a
+// fusion computation or its only caller is a while.
 //
 // !!! WARNING !!! This pass can introduce mixed precision in individual HLOs,
 // which has two issues:
@@ -92,8 +93,23 @@ class BFloat16Propagation : public HloPassInterface {
                                               bool skip_parameters);
 
   // Special handling in the mutation pass for fusion computations.
+  //
+  // Precondition: hlo->opcode() == kFusion
   void DetermineAndMutateFusionComputationPrecision(HloInstruction* fusion);
 
+  // Special handling in the mutation pass for while computations.
+  //
+  // Precondition: hlo->opcode() == kWhile
+  void DetermineAndMutateWhileComputationsPrecision(HloInstruction* while_hlo);
+
+  // The set of HloInstructions that have been visited in the mutation pass.
+  tensorflow::gtl::FlatSet<const HloInstruction*>
+      instructions_visited_in_mutation_pass_;
+
+  // The set of HloComputations that have been visited in the mutation pass.
+  tensorflow::gtl::FlatSet<const HloComputation*>
+      computations_visited_in_mutation_pass_;
+
   // ***************************
   // Functions called by the final inconsistency resolving pass.
 
@@ -102,9 +118,25 @@ class BFloat16Propagation : public HloPassInterface {
   // same precision.
   Status ResolveInconsistencyOfAliasingBuffers(HloModule* module);
 
-  // Makes the fusion parameters match the precision of the actual parameters
-  // passed to the fusion node.
-  void AdjustFusionParameters(HloInstruction* fusion);
+  // Resolves inconsistency of aliasing buffers for the given computation, and
+  // recursively runs on a while instruction's condition and body until a fixed
+  // point is reached.
+  bool ResolveInconsistencyOfAliasingBuffersHelper(
+      HloComputation* computation,
+      tensorflow::gtl::FlatSet<const HloComputation*>* visited_computations);
+
+  // Makes the parameters of called computations match how they are called by
+  // the given HLO.
+  void AdjustCalledComputationParameters(HloInstruction* hlo);
+
+  // Makes the root instructions of called computations match how they are used
+  // by the given HLO.
+  void AdjustCalledComputationRoot(HloInstruction* hlo);
+
+  // ***************************
+  // Removes no-op conversions (same source and target shapes) that can be
+  // produced this pass.
+  Status RemoveNoopConversions(HloModule* module);
 
   // ***************************
   // Functions called and state used by two or more passes.
@@ -117,8 +149,10 @@ class BFloat16Propagation : public HloPassInterface {
   // The set of F32 HLO values that must be kept in F32.
   tensorflow::gtl::FlatSet<const HloValue*> values_that_must_be_kept_as_f32_;
 
-  // ***************************
-  // State used by both passes.
+  // Mapping from each HloComputation to the number of callers to it in the
+  // module. Populated at the beginning of this pass.
+  tensorflow::gtl::FlatMap<const HloComputation*, int64> caller_counts_;
+
   const BFloat16Support* bfloat16_support_;
   std::unique_ptr<HloDataflowAnalysis> dataflow_;
 
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
index 2047e2053a1a819a2d534f34fc4ba2f8768dc861..88f83014164ff726a11e45e762b9c082cf12720d 100644
--- a/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
 namespace xla {
@@ -121,6 +122,41 @@ TEST_F(BFloat16PropagationTest, PropagateThroughSelectButNotAdd) {
   EXPECT_FALSE(OutputsBF16(c));
 }
 
+// Tests that if a constant is converted to BF16 then its literal must also be
+// converted.
+TEST_F(BFloat16PropagationTest, ConvertConstantLiteral) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+  Array2D<float> array_a(4, 4);
+  array_a.FillUnique(1.0f);
+  Array2D<float> array_b(4, 4);
+  array_b.FillUnique(10.0f);
+
+  HloInstruction* a = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateFromArray(array_a)));
+  HloInstruction* b = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateFromArray(array_b)));
+  HloInstruction* dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, a, b));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(dot->operand(0)));
+  EXPECT_TRUE(OutputsBF16(dot->operand(1)));
+  EXPECT_EQ(dot->operand(0)->opcode(), HloOpcode::kConstant);
+  EXPECT_EQ(dot->operand(1)->opcode(), HloOpcode::kConstant);
+  LiteralTestUtil::ExpectEqual(
+      dot->operand(0)->literal(),
+      *LiteralTestUtil::ConvertF32ToBF16(*Literal::CreateFromArray(array_a)));
+  LiteralTestUtil::ExpectEqual(
+      dot->operand(1)->literal(),
+      *LiteralTestUtil::ConvertF32ToBF16(*Literal::CreateFromArray(array_b)));
+}
+
 // Tests that BF16 can be propagated through nested tuples.
 TEST_F(BFloat16PropagationTest, PropagateThroughTuples) {
   auto builder = HloComputation::Builder(TestName());
@@ -390,4 +426,235 @@ TEST_F(BFloat16PropagationTest, SelectOverTuples) {
   EXPECT_TRUE(OutputsBF16(xpose));
 }
 
+// Tests that BF16 is propagated properly through while computations.
+TEST_F(BFloat16PropagationTest, PropagateThroughWhile) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+
+  auto builder_cond = HloComputation::Builder("cond");
+  auto cond_param = builder_cond.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple->shape(), "cond_param"));
+  auto cond_lhs = builder_cond.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond_param, 0));
+  auto cond_rhs = builder_cond.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond_param, 1));
+  // This add should prevent RHS from using BF16
+  auto cond_add_rhs = builder_cond.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, cond_rhs, cond_rhs));
+  auto cond_dot = builder_cond.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond_lhs, cond_add_rhs));
+  builder_cond.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond = module->AddEmbeddedComputation(builder_cond.Build());
+
+  auto builder_body = HloComputation::Builder("body");
+  auto body_param = builder_body.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple->shape(), "body_param"));
+  auto body_lhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 0));
+  auto body_rhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 1));
+  auto body_dot = builder_body.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_lhs, body_rhs));
+  builder_body.AddInstruction(
+      HloInstruction::CreateTuple({body_dot, body_rhs}));
+  auto body = module->AddEmbeddedComputation(builder_body.Build());
+
+  auto while_hlo = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple->shape(), cond, body, tuple));
+
+  auto lhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, while_hlo, 0));
+  auto rhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, while_hlo, 1));
+  auto dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, lhs, rhs));
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(lhs));
+  EXPECT_FALSE(OutputsBF16(rhs));
+  EXPECT_TRUE(OutputsBF16(body_dot));
+  EXPECT_TRUE(OutputsBF16(body_lhs));
+  EXPECT_FALSE(OutputsBF16(body_rhs));
+  EXPECT_TRUE(OutputsBF16(cond_lhs));
+  EXPECT_FALSE(OutputsBF16(cond_rhs));
+  EXPECT_TRUE(OutputsBF16(add0));
+  EXPECT_FALSE(OutputsBF16(add1));
+}
+
+// Tests that BF16 is not propagated through multiple whiles that invoke the
+// same computation as long as one while prevents the propagation.
+TEST_F(BFloat16PropagationTest, DoNotPropagateWhilesCallingSameComputation) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add2 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add3 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* tuple0 =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+  HloInstruction* tuple1 =
+      builder.AddInstruction(HloInstruction::CreateTuple({add2, add3}));
+
+  // Condition computation for the first while.
+  auto builder_cond0 = HloComputation::Builder("cond0");
+  auto cond0_param = builder_cond0.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple0->shape(), "cond0_param"));
+  auto cond0_lhs = builder_cond0.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond0_param, 0));
+  auto cond0_rhs = builder_cond0.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond0_param, 1));
+  // This add should prevent RHS from using BF16
+  auto cond0_add_rhs =
+      builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+          shape, HloOpcode::kAdd, cond0_rhs, cond0_rhs));
+  auto cond0_dot = builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond0_lhs, cond0_add_rhs));
+  builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond0.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond0_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond0.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond0_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond0 = module->AddEmbeddedComputation(builder_cond0.Build());
+
+  // Condition computation for the second while.
+  auto builder_cond1 = HloComputation::Builder("cond1");
+  auto cond1_param = builder_cond1.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple1->shape(), "cond1_param"));
+  auto cond1_lhs = builder_cond1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond1_param, 0));
+  auto cond1_rhs = builder_cond1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond1_param, 1));
+  // This add should prevent LHS from using BF16
+  auto cond1_add_lhs =
+      builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+          shape, HloOpcode::kAdd, cond1_lhs, cond1_lhs));
+  auto cond1_dot = builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond1_add_lhs, cond1_rhs));
+  builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond1.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond1_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond1.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond1_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond1 = module->AddEmbeddedComputation(builder_cond1.Build());
+
+  // Body computation shared by both whiles.
+  auto builder_body = HloComputation::Builder("body");
+  auto body_param = builder_body.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple0->shape(), "body_param"));
+  auto body_lhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 0));
+  auto body_rhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 1));
+  auto body_dot = builder_body.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_lhs, body_rhs));
+  builder_body.AddInstruction(
+      HloInstruction::CreateTuple({body_dot, body_rhs}));
+  auto body = module->AddEmbeddedComputation(builder_body.Build());
+
+  auto while0 = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple0->shape(), cond0, body, tuple0));
+  auto while1 = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple1->shape(), cond1, body, tuple1));
+
+  auto lhs = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot,
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while0, 0)),
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while0, 1))));
+  auto rhs = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot,
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while1, 0)),
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while1, 1))));
+  auto dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, lhs, rhs));
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+  EXPECT_FALSE(OutputsBF16(body_dot));
+  EXPECT_FALSE(OutputsBF16(body_rhs));
+  EXPECT_FALSE(OutputsBF16(body_lhs));
+  EXPECT_FALSE(OutputsBF16(cond0_lhs));
+  EXPECT_FALSE(OutputsBF16(cond0_rhs));
+  EXPECT_FALSE(OutputsBF16(cond1_lhs));
+  EXPECT_FALSE(OutputsBF16(cond1_rhs));
+  EXPECT_TRUE(OutputsBF16(cond0_add_rhs));
+  EXPECT_TRUE(OutputsBF16(cond1_add_lhs));
+  EXPECT_EQ(computation->root_instruction(), dot);
+}
+
+// Tests that if this pass turns an F32 -> BF16 conversion into a no-op (BF16 ->
+// BF16 conversion), then it will remove that conversion.
+TEST_F(BFloat16PropagationTest, NoopConversionRemoved) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape f32_shape = ShapeUtil::MakeShape(F32, {4, 4});
+  Shape bf16_shape = ShapeUtil::MakeShape(BF16, {4, 4});
+
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, f32_shape, "param"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(f32_shape, HloOpcode::kAdd, param, param));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(f32_shape, HloOpcode::kAdd, param, param));
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+  HloInstruction* gte0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32_shape, tuple, 0));
+  HloInstruction* gte1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32_shape, tuple, 1));
+  HloInstruction* convert0 =
+      builder.AddInstruction(HloInstruction::CreateConvert(bf16_shape, gte0));
+  HloInstruction* convert1 =
+      builder.AddInstruction(HloInstruction::CreateConvert(bf16_shape, gte1));
+  HloInstruction* add2 = builder.AddInstruction(HloInstruction::CreateBinary(
+      bf16_shape, HloOpcode::kAdd, convert0, convert1));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), add2);
+  EXPECT_EQ(add2->operand(0), gte0);
+  EXPECT_EQ(add2->operand(1), gte1);
+  EXPECT_EQ(gte0->shape().element_type(), BF16);
+  EXPECT_EQ(gte1->shape().element_type(), BF16);
+  EXPECT_EQ(add0->shape().element_type(), BF16);
+  EXPECT_EQ(add1->shape().element_type(), BF16);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_support.cc b/tensorflow/compiler/xla/service/bfloat16_support.cc
index 3fd9e24601f27633c8063e4574c7c4f91f30dcff..07b4b14b5ec1bdbc01345091105df69368b0b2fb 100644
--- a/tensorflow/compiler/xla/service/bfloat16_support.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_support.cc
@@ -79,6 +79,7 @@ bool BFloat16Support::EffectiveOperandPrecisionIsOutputPrecision(
     case HloOpcode::kBroadcast:
     case HloOpcode::kClamp:
     case HloOpcode::kConcatenate:
+    case HloOpcode::kConvert:
     case HloOpcode::kCopy:
     case HloOpcode::kGetTupleElement:
     case HloOpcode::kMaximum:
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.cc b/tensorflow/compiler/xla/service/buffer_assignment.cc
index b1e693da9d5af4babe619b8796007f2da318f6a8..dbe45e932cdeed00e959355d5b3199d2e858148f 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment.cc
@@ -48,6 +48,183 @@ using ::tensorflow::strings::HumanReadableNumBytes;
 using ::tensorflow::strings::Printf;
 using ::tensorflow::strings::StrAppend;
 
+namespace {
+
+template <typename T>
+string ColocatedBufferSetsToString(const T& container, const char* title) {
+  string result;
+  StrAppend(&result, title, "\n");
+  for (const auto& it : container) {
+    StrAppend(&result, "\t", it->ToString(), "\n");
+  }
+  return result;
+}
+
+// Walk the call graph of the HLO module and place each computation into either
+// thread_local_computations or global_computations depending upon whether the
+// computation requires thread-local allocations or global allocations. The
+// elements in thread_local_computations and global_computations are in post
+// order (if computation A has an instruction which calls computation B, then A
+// will appear after B in the vector).
+Status GatherComputationsByAllocationType(
+    const HloModule* module,
+    std::vector<const HloComputation*>* thread_local_computations,
+    std::vector<const HloComputation*>* global_computations) {
+  // Create a worklist of computations paired with whether the allocation must
+  // be thread-local.
+  std::deque<std::pair<const HloComputation*, bool>> worklist;
+  worklist.push_back(std::make_pair(module->entry_computation(),
+                                    /*is_thread_local*/ false));
+
+  // Sets for quickly checking membership. Computations are returned in vectors
+  // for stable iteration.
+  FlatSet<const HloComputation*> thread_local_set;
+  FlatSet<const HloComputation*> global_set;
+
+  while (!worklist.empty()) {
+    auto worklist_front = worklist.front();
+    worklist.pop_front();
+    const HloComputation* computation = worklist_front.first;
+    bool is_thread_local = worklist_front.second;
+    bool in_thread_local_set = thread_local_set.count(computation) > 0;
+    bool in_global_set = global_set.count(computation) > 0;
+
+    // If the computation has already been added to the respective set, then
+    // nothing to do.
+    if ((is_thread_local && in_thread_local_set) ||
+        (!is_thread_local && in_global_set)) {
+      continue;
+    }
+
+    // If the computation has already been added to the other set this is an
+    // error condition because the global call to the computation (eg,
+    // while/call) may return a reference to one of the thread-local buffers to
+    // the calling computation which will become a dangling reference when the
+    // thread-local is deallocated with the call return.
+    if ((is_thread_local && in_global_set) ||
+        (!is_thread_local && in_thread_local_set)) {
+      return InvalidArgument(
+          "computation %s has conflicting allocation requirements (global "
+          "and thread-local)",
+          computation->name().c_str());
+    }
+
+    if (is_thread_local) {
+      thread_local_set.insert(computation);
+    } else {
+      global_set.insert(computation);
+    }
+
+    for (auto* instruction : computation->instructions()) {
+      for (HloComputation* subcomputation :
+           instruction->called_computations()) {
+        switch (instruction->opcode()) {
+          case HloOpcode::kCall:
+          case HloOpcode::kConditional:
+          case HloOpcode::kWhile:
+            // Call and while must be called from a computation with global
+            // allocations as they may return references to buffers inside the
+            // called computation which cannot be thread-local.
+            if (is_thread_local) {
+              return InvalidArgument(
+                  "computation %s cannot contain call/while op because it "
+                  "requires thread-local buffer allocations",
+                  computation->name().c_str());
+            }
+            worklist.push_back(std::make_pair(subcomputation,
+                                              false));  // Not thread local.
+            break;
+          case HloOpcode::kMap:
+          case HloOpcode::kReduce:
+          case HloOpcode::kReduceWindow:
+          case HloOpcode::kSelectAndScatter:
+          case HloOpcode::kFusion:
+            // Map/reduce etc computations are always thread-local.
+            worklist.push_back(std::make_pair(subcomputation,
+                                              true));  // Thread local.
+            break;
+          default:
+            return InternalError(
+                "Unexpected calling opcode: %s",
+                HloOpcodeString(instruction->opcode()).c_str());
+        }
+      }
+    }
+  }
+
+  // Add the computations to the vectors in post order.
+  for (auto* computation : module->MakeComputationPostOrder()) {
+    if (thread_local_set.count(computation) > 0) {
+      thread_local_computations->push_back(computation);
+    } else if (global_set.count(computation) > 0) {
+      global_computations->push_back(computation);
+    }
+    // If the computation is not reachable from the entry computation, then it
+    // will not appear in either thread_local_set or global_set. We don't bother
+    // assigning buffers for these.
+  }
+  return Status::OK();
+}
+
+// Checks that points-to set of 'instruction' is unambiguous and distinct
+// (ensured by CopyInsertion), then adds the buffer from the points-to set at
+// 'index' to 'colocated_set'.
+const LogicalBuffer* AddBufferToColocatedSet(
+    const HloInstruction* instruction, const ShapeIndex& index,
+    const TuplePointsToAnalysis& points_to_analysis,
+    std::vector<const LogicalBuffer*>* colocated_set) {
+  // CopyInsertion ensures root points-to set is unambiguous and distinct.
+  const auto& points_to = points_to_analysis.GetPointsToSet(instruction);
+  DCHECK(!points_to.IsAmbiguous());
+  colocated_set->push_back(points_to.element(index)[0]);
+  return colocated_set->back();
+}
+
+// Given the interference map of a graph (the list of interfering node indices
+// for each node), perform graph coloring such that interfering nodes are
+// assigned to different colors. Returns the assigned color of the nodes, where
+// the colors are represented as integer values [0, color_count).
+std::vector<int64> ColorInterferenceGraph(
+    const std::vector<std::vector<int64>>& interference_map) {
+  const int64 node_count = interference_map.size();
+
+  // Sort the nodes such that we assign nodes with more interference first. This
+  // relies on the common heuristic of assigning the most constrained node
+  // first, but it would be good to investigate other ordering heuristics too.
+  std::vector<int64> nodes(node_count);
+  std::iota(nodes.begin(), nodes.end(), 0);
+  std::sort(nodes.begin(), nodes.end(),
+            [&interference_map](const int64 i, const int64 j) {
+              return interference_map[i].size() > interference_map[j].size();
+            });
+
+  const int64 kColorUnassigned = -1;
+  std::vector<int64> assigned_colors(node_count, kColorUnassigned);
+  for (int64 node : nodes) {
+    // Mark the colors that are already assigned to the neighbors.
+    std::vector<bool> available_colors(node_count, true);
+    for (int64 neighbor : interference_map[node]) {
+      int64 color = assigned_colors[neighbor];
+      if (color != kColorUnassigned) {
+        available_colors[color] = false;
+      }
+    }
+
+    // Find the color that is not yet assigned to the neighbors.
+    int64 color = kColorUnassigned;
+    for (color = 0; color < available_colors.size(); ++color) {
+      if (available_colors[color]) {
+        break;
+      }
+    }
+    CHECK_NE(color, kColorUnassigned);
+    assigned_colors[node] = color;
+  }
+  return assigned_colors;
+}
+
+}  // namespace
+
 size_t BufferAllocation::Slice::Hasher::operator()(Slice s) const {
   uint64 h = std::hash<int64>()(s.index());
   h = tensorflow::Hash64Combine(h, std::hash<int64>()(s.offset()));
@@ -115,6 +292,112 @@ BufferAllocationProto BufferAllocation::ToProto() const {
   return proto;
 }
 
+std::pair<int64, std::vector<const LogicalBuffer*>>
+BufferAllocation::ComputePeakMemoryLogicalBuffers() const {
+  if (HeapTraces().empty()) {
+    // Just return the largest LogicalBuffer in the allocation.
+    const LogicalBuffer* largest_buffer = nullptr;
+    int64 largest_size = 0;
+    for (const auto& pair : assigned_buffers()) {
+      const LogicalBuffer* buffer = pair.first;
+      int64 size = pair.second.size;
+      if (largest_buffer == nullptr) {
+        largest_buffer = buffer;
+        largest_size = size;
+        continue;
+      }
+      // Tie-break with LogicalBuffer::Id so the return value is stable relative
+      // to changing addresses.
+      if (size > largest_size ||
+          ((size == largest_size) && (largest_buffer->id() > buffer->id()))) {
+        largest_buffer = buffer;
+        largest_size = size;
+      }
+    }
+    CHECK(largest_buffer != nullptr)
+        << "No logical buffers in allocation: " << ToString();
+    return {largest_size, {largest_buffer}};
+  }
+
+  // Create a map from LogicalBuffer::Id to LogicalBuffer* for the logical
+  // buffers in this allocation.
+  tensorflow::gtl::FlatMap<LogicalBuffer::Id, const LogicalBuffer*>
+      id_to_buffer;
+  tensorflow::gtl::FlatMap<const LogicalBuffer*, int64> buffer_sizes;
+  for (const auto& pair : assigned_buffers()) {
+    const LogicalBuffer* buffer = pair.first;
+    const OffsetSize& offset_size = pair.second;
+    id_to_buffer[buffer->id()] = buffer;
+    buffer_sizes[buffer] = offset_size.size;
+  }
+
+  // Returns how much the given event increases the total size of live
+  // buffers. Can be negative.
+  auto memory_delta = [this, &id_to_buffer, &buffer_sizes](
+                          const HeapSimulatorTrace::Event& event) -> int64 {
+    const LogicalBuffer* buffer = id_to_buffer.at(event.buffer_id());
+    const int64 buffer_size = buffer_sizes.at(buffer);
+    if (event.kind() == HeapSimulatorTrace::Event::ALLOC) {
+      return buffer_size;
+    } else if (event.kind() == HeapSimulatorTrace::Event::SHARE_WITH) {
+      // Sharing a buffer does not change the live set size for the purposes of
+      // the heap simulator. Even though the shared-with buffer may be smaller,
+      // the entire allocation remains live.
+      return 0;
+    } else if (event.kind() == HeapSimulatorTrace::Event::FREE) {
+      return -1 * buffer_size;
+    }
+    LOG(FATAL) << "Unknown event kind: " << event.kind();
+  };
+
+  int64 total_max_live_size = 0;
+  std::vector<const LogicalBuffer*> live_buffers_vector;
+  for (const HeapSimulatorTrace& heap_trace : HeapTraces()) {
+    // First compute the size of the maximal live set.
+    int64 max_live_size = 0;
+    int64 live_size = 0;
+    for (const auto& event : heap_trace.events()) {
+      live_size += memory_delta(event);
+      if (max_live_size < live_size) {
+        max_live_size = live_size;
+      }
+    }
+
+    // Next gather the set of logical buffers live at the earliest point of
+    // maximal live set size.
+    tensorflow::gtl::FlatSet<const LogicalBuffer*> live_buffers;
+    live_size = 0;
+    for (const auto& event : heap_trace.events()) {
+      const LogicalBuffer* buffer = id_to_buffer.at(event.buffer_id());
+      if (event.kind() == HeapSimulatorTrace::Event::ALLOC) {
+        InsertOrDie(&live_buffers, buffer);
+      } else if (event.kind() == HeapSimulatorTrace::Event::SHARE_WITH) {
+        // Nothing to do.
+      } else if (event.kind() == HeapSimulatorTrace::Event::FREE) {
+        CHECK(ContainsKey(live_buffers, buffer));
+        live_buffers.erase(buffer);
+      }
+
+      live_size += memory_delta(event);
+      if (live_size == max_live_size) {
+        break;
+      }
+    }
+    CHECK_EQ(live_size, max_live_size);
+    total_max_live_size += max_live_size;
+
+    live_buffers_vector.insert(live_buffers_vector.end(), live_buffers.begin(),
+                               live_buffers.end());
+  }
+
+  // Stabily sort the live buffers.
+  std::sort(live_buffers_vector.begin(), live_buffers_vector.end(),
+            [](const LogicalBuffer* a, const LogicalBuffer* b) {
+              return a->id() < b->id();
+            });
+  return {total_max_live_size, live_buffers_vector};
+}
+
 string BufferAllocation::ToString() const {
   string output;
   Appendf(&output, "allocation %lld: %p, size %lld", index_, this, size());
@@ -348,6 +631,7 @@ void BufferAssignment::AddAssignment(BufferAllocation* allocation,
 // Combines allocations of temporary buffers of the same color into one big
 // BufferAllocation.
 void BufferAssignment::CombineTempAllocations() {
+  VLOG(1) << "CombineTempAllocations()";
   FlatMap<LogicalBuffer::Color, BufferAllocation, LogicalBuffer::Color::Hasher>
       combined_allocation_map;
 
@@ -369,11 +653,16 @@ void BufferAssignment::CombineTempAllocations() {
       if (combined_it == combined_allocation_map.end()) {
         // We have found the first temp allocation of this color. Collect
         // the other temp allocations of the same color into it.
+        VLOG(1) << "Combined temp allocation for color " << color
+                << " is: " << temp_allocation;
         combined_allocation_map.emplace(color, temp_allocation);
         continue;
       }
 
       auto* combined_allocation = &combined_it->second;
+      VLOG(1) << "Combined allocation absorbing temp allocation: "
+              << temp_allocation;
+
       // Each temp allocation is placed end-to-end, accounting for alignment.
       // The offset of each buffer in the combined allocation is computed from
       // the base offset of the allocation.
@@ -387,6 +676,10 @@ void BufferAssignment::CombineTempAllocations() {
         const int64 size = buffer_offset_size.second.size;
         combined_allocation->AddAssignment(*buffer, base + offset, size);
       }
+      if (!temp_allocation.HeapTraces().empty()) {
+        CHECK_EQ(temp_allocation.HeapTraces().size(), 1);
+        combined_allocation->AddHeapTrace(temp_allocation.HeapTraces().front());
+      }
     }
     // Replace all existing temporary allocations with the new combined
     // allocations.
@@ -516,123 +809,13 @@ BufferAssignmentProto BufferAssignment::ToProto() const {
   for (const BufferAllocation& allocation : Allocations()) {
     BufferAllocationProto proto_allocation = allocation.ToProto();
     proto.add_buffer_allocations()->Swap(&proto_allocation);
-  }
-  for (const HeapSimulatorTrace& trace : heap_simulator_traces_) {
-    *proto.add_heap_simulator_traces() = trace;
-  }
-  return proto;
-}
-
-namespace {
-
-// Walk the call graph of the HLO module and place each computation into either
-// thread_local_computations or global_computations depending upon whether the
-// computation requires thread-local allocations or global allocations. The
-// elements in thread_local_computations and global_computations are in post
-// order (if computation A has an instruction which calls computation B, then A
-// will appear after B in the vector).
-Status GatherComputationsByAllocationType(
-    const HloModule* module,
-    std::vector<const HloComputation*>* thread_local_computations,
-    std::vector<const HloComputation*>* global_computations) {
-  // Create a worklist of computations paired with whether the allocation must
-  // be thread-local.
-  std::deque<std::pair<const HloComputation*, bool>> worklist;
-  worklist.push_back(std::make_pair(module->entry_computation(),
-                                    /*is_thread_local*/ false));
-
-  // Sets for quickly checking membership. Computations are returned in vectors
-  // for stable iteration.
-  FlatSet<const HloComputation*> thread_local_set;
-  FlatSet<const HloComputation*> global_set;
-
-  while (!worklist.empty()) {
-    auto worklist_front = worklist.front();
-    worklist.pop_front();
-    const HloComputation* computation = worklist_front.first;
-    bool is_thread_local = worklist_front.second;
-    bool in_thread_local_set = thread_local_set.count(computation) > 0;
-    bool in_global_set = global_set.count(computation) > 0;
-
-    // If the computation has already been added to the respective set, then
-    // nothing to do.
-    if ((is_thread_local && in_thread_local_set) ||
-        (!is_thread_local && in_global_set)) {
-      continue;
+    for (const HeapSimulatorTrace& heap_trace : allocation.HeapTraces()) {
+      *proto.add_heap_simulator_traces() = heap_trace;
     }
-
-    // If the computation has already been added to the other set this is an
-    // error condition because the global call to the computation (eg,
-    // while/call) may return a reference to one of the thread-local buffers to
-    // the calling computation which will become a dangling reference when the
-    // thread-local is deallocated with the call return.
-    if ((is_thread_local && in_global_set) ||
-        (!is_thread_local && in_thread_local_set)) {
-      return InvalidArgument(
-          "computation %s has conflicting allocation requirements (global "
-          "and thread-local)",
-          computation->name().c_str());
-    }
-
-    if (is_thread_local) {
-      thread_local_set.insert(computation);
-    } else {
-      global_set.insert(computation);
-    }
-
-    for (auto* instruction : computation->instructions()) {
-      for (HloComputation* subcomputation :
-           instruction->called_computations()) {
-        switch (instruction->opcode()) {
-          case HloOpcode::kCall:
-          case HloOpcode::kConditional:
-          case HloOpcode::kWhile:
-            // Call and while must be called from a computation with global
-            // allocations as they may return references to buffers inside the
-            // called computation which cannot be thread-local.
-            if (is_thread_local) {
-              return InvalidArgument(
-                  "computation %s cannot contain call/while op because it "
-                  "requires thread-local buffer allocations",
-                  computation->name().c_str());
-            }
-            worklist.push_back(std::make_pair(subcomputation,
-                                              false));  // Not thread local.
-            break;
-          case HloOpcode::kMap:
-          case HloOpcode::kReduce:
-          case HloOpcode::kReduceWindow:
-          case HloOpcode::kSelectAndScatter:
-          case HloOpcode::kFusion:
-            // Map/reduce etc computations are always thread-local.
-            worklist.push_back(std::make_pair(subcomputation,
-                                              true));  // Thread local.
-            break;
-          default:
-            return InternalError(
-                "Unexpected calling opcode: %s",
-                HloOpcodeString(instruction->opcode()).c_str());
-        }
-      }
-    }
-  }
-
-  // Add the computations to the vectors in post order.
-  for (auto* computation : module->MakeComputationPostOrder()) {
-    if (thread_local_set.count(computation) > 0) {
-      thread_local_computations->push_back(computation);
-    } else if (global_set.count(computation) > 0) {
-      global_computations->push_back(computation);
-    }
-    // If the computation is not reachable from the entry computation, then it
-    // will not appear in either thread_local_set or global_set. We don't bother
-    // assigning buffers for these.
   }
-  return Status::OK();
+  return proto;
 }
 
-}  // namespace
-
 /* static */
 StatusOr<std::unique_ptr<BufferAssignment>> BufferAssigner::Run(
     const HloModule* module, std::unique_ptr<HloOrdering> hlo_ordering,
@@ -1064,7 +1247,8 @@ void BufferAssigner::AssignBuffersFromHeapSimulator(
     assignment->AddAssignment(allocation, buffer, chunk.offset, chunk.size);
   }
 
-  assignment->heap_simulator_traces_.push_back(result.debug_trace);
+  VLOG(1) << "Ran heap simulation for allocation: " << allocation->ToString();
+  allocation->AddHeapTrace(result.debug_trace);
 }
 
 // Adds the 'colocated_set' of buffers to 'colocated_buffer_sets', maintaining
@@ -1085,7 +1269,8 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   if (colocated_set.empty()) {
     return;
   }
-
+  VLOG(5) << ColocatedBufferSetsToString(colocated_set,
+                                         "Adding colocated buffer set");
   // Find existing sets that overlap with at least one buffer from the
   // colocated_set. The resulting 'overlap_set_indices' will have at most
   // colocated_buffer_sets->size() entries, and will be in increasing order.
@@ -1093,6 +1278,10 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   for (size_t index = 0; index < colocated_buffer_sets->size(); ++index) {
     for (const LogicalBuffer* buffer : colocated_set) {
       if ((*colocated_buffer_sets)[index].count(buffer) > 0) {
+        VLOG(5) << "Found overlap with existing set on buffer "
+                << buffer->ToString() << "\n"
+                << ColocatedBufferSetsToString((*colocated_buffer_sets)[index],
+                                               "Overlapping set");
         overlap_set_indices.push_back(index);
         break;
       }
@@ -1104,6 +1293,7 @@ void BufferAssigner::AddSetToColocatedBufferSets(
     colocated_buffer_sets->emplace_back();
     colocated_buffer_sets->back().insert(colocated_set.begin(),
                                          colocated_set.end());
+    VLOG(5) << "No overlap found, new group created";
     return;
   }
 
@@ -1115,6 +1305,8 @@ void BufferAssigner::AddSetToColocatedBufferSets(
     first->insert(overlap_set.begin(), overlap_set.end());
   }
   first->insert(colocated_set.begin(), colocated_set.end());
+  VLOG(5) << ColocatedBufferSetsToString(
+      *first, "Result of the colocated buffer set merging");
 
   // Remove overlap sets that we just merged. The offset accounts for the fact
   // that as elements are erased, the indices need to be adjusted. Keep in mind
@@ -1125,67 +1317,6 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   }
 }
 
-namespace {
-
-// Checks that points-to set of 'instruction' is unambiguous and distinct
-// (ensured by CopyInsertion), then adds the buffer from the points-to set at
-// 'index' to 'colocated_set'.
-const LogicalBuffer* AddBufferToColocatedSet(
-    const HloInstruction* instruction, const ShapeIndex& index,
-    const TuplePointsToAnalysis& points_to_analysis,
-    std::vector<const LogicalBuffer*>* colocated_set) {
-  // CopyInsertion ensures root points-to set is unambiguous and distinct.
-  const auto& points_to = points_to_analysis.GetPointsToSet(instruction);
-  DCHECK(!points_to.IsAmbiguous());
-  colocated_set->push_back(points_to.element(index)[0]);
-  return colocated_set->back();
-}
-
-// Given the interference map of a graph (the list of interfering node indices
-// for each node), perform graph coloring such that interfering nodes are
-// assigned to different colors. Returns the assigned color of the nodes, where
-// the colors are represented as integer values [0, color_count).
-std::vector<int64> ColorInterferenceGraph(
-    const std::vector<std::vector<int64>>& interference_map) {
-  const int64 node_count = interference_map.size();
-
-  // Sort the nodes such that we assign nodes with more interference first. This
-  // relies on the common heuristic of assigning the most constrained node
-  // first, but it would be good to investigate other ordering heuristics too.
-  std::vector<int64> nodes(node_count);
-  std::iota(nodes.begin(), nodes.end(), 0);
-  std::sort(nodes.begin(), nodes.end(),
-            [&interference_map](const int64 i, const int64 j) {
-              return interference_map[i].size() > interference_map[j].size();
-            });
-
-  const int64 kColorUnassigned = -1;
-  std::vector<int64> assigned_colors(node_count, kColorUnassigned);
-  for (int64 node : nodes) {
-    // Mark the colors that are already assigned to the neighbors.
-    std::vector<bool> available_colors(node_count, true);
-    for (int64 neighbor : interference_map[node]) {
-      int64 color = assigned_colors[neighbor];
-      if (color != kColorUnassigned) {
-        available_colors[color] = false;
-      }
-    }
-
-    // Find the color that is not yet assigned to the neighbors.
-    int64 color = kColorUnassigned;
-    for (color = 0; color < available_colors.size(); ++color) {
-      if (available_colors[color]) {
-        break;
-      }
-    }
-    CHECK_NE(color, kColorUnassigned);
-    assigned_colors[node] = color;
-  }
-  return assigned_colors;
-}
-
-}  // namespace
-
 std::vector<BufferAssigner::ColocatedBufferSet>
 BufferAssigner::MergeColocatedBufferSets(
     const std::vector<ColocatedBufferSet>& colocated_buffer_sets,
@@ -1208,26 +1339,35 @@ BufferAssigner::MergeColocatedBufferSets(
   auto cannot_merge_buffer_sets = [&colocated_buffer_sets, &buffer_liveness,
                                    &buffer_size,
                                    &is_entry_parameter](int64 i, int64 j) {
-    for (auto& buffer_a : colocated_buffer_sets[i]) {
-      for (auto& buffer_b : colocated_buffer_sets[j]) {
-        // Do not merge if the set includes live outs or entry parameters.
-        if ((buffer_liveness.MaybeLiveOut(*buffer_a) &&
-             is_entry_parameter(*buffer_b)) ||
-            (buffer_liveness.MaybeLiveOut(*buffer_b) &&
-             is_entry_parameter(*buffer_a))) {
+    // Do not merge if one of the sets includes live outs or entry parameters.
+    for (int64 key : {i, j}) {
+      for (auto& buffer : colocated_buffer_sets[key]) {
+        if (buffer_liveness.MaybeLiveOut(*buffer) ||
+            is_entry_parameter(*buffer)) {
           return true;
         }
-        // Do not merge if the buffers interfere with each other.
+      }
+    }
+
+    // Colocated sets satisfy the invariant that all buffers within a set have
+    // the same size. That means we need to check whether the size is the same
+    // between the two sets, but also that it's enough to look at just one
+    // buffer within each set.
+    if (buffer_size(**colocated_buffer_sets[i].begin()) !=
+        buffer_size(**colocated_buffer_sets[j].begin())) {
+      return true;
+    }
+
+    // Do not merge if some pair of buffers interferes with each other.
+    for (auto& buffer_a : colocated_buffer_sets[i]) {
+      for (auto& buffer_b : colocated_buffer_sets[j]) {
         if (buffer_a->id() != buffer_b->id() &&
             buffer_liveness.MayInterfere(*buffer_a, *buffer_b)) {
           return true;
         }
-        // Do not merge if the buffer sizes are different.
-        if (buffer_size(*buffer_a) != buffer_size(*buffer_b)) {
-          return true;
-        }
       }
     }
+
     return false;
   };
 
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.h b/tensorflow/compiler/xla/service/buffer_assignment.h
index 6b7fd0014d103ef0617afcc5cb3f663554a01aa4..3086d0e2ca0026547134285b8ceb357390fc7ece 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.h
+++ b/tensorflow/compiler/xla/service/buffer_assignment.h
@@ -192,6 +192,37 @@ class BufferAllocation {
            !is_thread_local();
   }
 
+  // Add a heap trace which was used to assign slices to logical buffers in this
+  // allocation. A single BufferAllocation may include multiple heap traces
+  // in the case of the temporary block where there is a heap trace per
+  // computation.
+  void AddHeapTrace(const HeapSimulatorTrace& heap_trace) {
+    heap_traces_.push_back(heap_trace);
+  }
+
+  // Return the set of heap traces used to assign slices to logical buffers in
+  // this allocation.
+  const std::vector<HeapSimulatorTrace> HeapTraces() const {
+    return heap_traces_;
+  }
+
+  // Compute and return the LogicalBuffers which are live at the point of peak
+  // memory usage for the given allocation. The point of peak memory usage is
+  // the point at which the total size of all live logical buffers is
+  // maximal. If peak memory is reached at multiple points, the set of logical
+  // buffers live at the earliest maximal point is returned. The vector is
+  // stabily asserted by LogicalBuffer::Index.
+  //
+  // The return value is a pair of total size of the logical buffers at peak,
+  // and the buffers themselves.
+  std::pair<int64, std::vector<const LogicalBuffer*>>
+  ComputePeakMemoryLogicalBuffers() const;
+
+  // Get the number of bytes lost to fragmentation. This is equal to the
+  // difference between the size of the allocation and the size of the maximal
+  // live set.
+  int64 fragmentation_bytes() const { return fragmentation_bytes_; }
+
   bool operator==(const BufferAllocation& other) const {
     return index_ == other.index_;
   }
@@ -257,6 +288,9 @@ class BufferAllocation {
   // Mapping from the set of buffers assigned to this allocation to their
   // logical offsets and sizes.
   tensorflow::gtl::FlatMap<const LogicalBuffer*, OffsetSize> assigned_buffers_;
+
+  int64 fragmentation_bytes_ = 0;
+  std::vector<HeapSimulatorTrace> heap_traces_;
 };
 
 // Add stream operators for nicer output of CHECK/RET_CHECK failures.
@@ -441,7 +475,6 @@ class BufferAssignment {
   LogicalBuffer::AlignmentFunction color_alignment_;
 
   Stats stats_;
-  std::vector<HeapSimulatorTrace> heap_simulator_traces_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(BufferAssignment);
 };
diff --git a/tensorflow/compiler/xla/service/buffer_assignment_test.cc b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
index cd73654b8f666c4b96c000235cc3ad2cd0a46c17..513a8785bbd52b0a3bfa3642bbfc62b1035ffb17 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
@@ -37,14 +37,16 @@ limitations under the License.
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/platform/macros.h"
 
 namespace xla {
-
 namespace {
 
+using ::testing::UnorderedElementsAre;
+
 // DFS visitor that collects the instructions referenced by a computation
 // without descending into nested computations, i.e., only from the operands.
 class InstructionListVisitor : public DfsHloVisitorWithDefault {
@@ -101,6 +103,22 @@ class BufferAssignmentTest : public HloTestBase {
         .ConsumeValueOrDie();
   }
 
+  std::unique_ptr<BufferAssignment> RunBufferAssignmentWithInstructionSequence(
+      HloModule* module,
+      tensorflow::gtl::ArraySlice<const HloInstruction*> instruction_sequence,
+      int64 alignment = 1) {
+    SequentialHloOrdering::HloModuleSequence module_sequence;
+    module_sequence[module->entry_computation()] =
+        std::vector<const HloInstruction*>(instruction_sequence.begin(),
+                                           instruction_sequence.end());
+    return BufferAssigner::Run(
+               module,
+               xla::MakeUnique<SequentialHloOrdering>(module, module_sequence),
+               backend().compiler()->BufferSizeBytesFunction(),
+               [alignment](LogicalBuffer::Color) { return alignment; })
+        .ConsumeValueOrDie();
+  }
+
   // Builds an x+1.0 computation to use in a Map.
   std::unique_ptr<HloComputation> BuildMapComputationPlus1(const string& name) {
     auto builder = HloComputation::Builder(name);
@@ -1370,7 +1388,7 @@ TEST_F(BufferAssignmentTest, AmbiguousBufferAsOutput) {
   auto element_slices = assignment->GetAllSlices(select, /*index=*/{0});
   EXPECT_EQ(2, element_slices.size());
   EXPECT_THAT(element_slices,
-              ::testing::UnorderedElementsAre(
+              UnorderedElementsAre(
                   assignment->GetUniqueSlice(tuple_param0, /*index=*/{0})
                       .ConsumeValueOrDie(),
                   assignment->GetUniqueSlice(tuple_param1, /*index=*/{0})
@@ -1473,6 +1491,98 @@ TEST_F(BufferAssignmentTest, OneTempAllocation) {
   }
 }
 
+TEST_F(BufferAssignmentTest, TrivialPeakBuffers) {
+  // paramscalar ------- (mul) -- (add) -- (sub)
+  //                     /        /        /
+  // param0[100] -------/        /        /
+  //                            /        /
+  // param1[100] --------------/--------/
+  auto builder = HloComputation::Builder(TestName());
+  auto paramscalar =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+  auto param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, f32vec100_, ""));
+  auto param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(2, f32vec100_, ""));
+  auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
+  auto add = builder.AddInstruction(
+      HloInstruction::CreateBinary(f32vec100_, HloOpcode::kAdd, mul, param1));
+  builder.AddInstruction(HloInstruction::CreateBinary(
+      f32vec100_, HloOpcode::kSubtract, add, param1));
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+
+  auto buffers = RunBufferAssignment(module.get());
+
+  // Trivially, the set of peak memory logical buffer(s) of an allocation with a
+  // single logical buffer should be exactly the logical buffer in that
+  // allocation.
+  const BufferAllocation& mul_buffer = GetTopLevelAllocation(*buffers, mul);
+  int64 peak_size;
+  std::vector<const LogicalBuffer*> peak_buffers;
+
+  std::tie(peak_size, peak_buffers) =
+      mul_buffer.ComputePeakMemoryLogicalBuffers();
+  EXPECT_EQ(peak_size, ShapeUtil::ByteSizeOf(f32vec100_));
+  ASSERT_EQ(peak_buffers.size(), 1);
+  EXPECT_EQ(peak_buffers[0]->instruction(), mul);
+}
+
+TEST_F(BufferAssignmentTest, PeakBuffers) {
+  // Compute the peak liveness buffers of the following sequence:
+  //
+  //   %param = ...
+  //   %log = log(%param)
+  //   %rev = reverse(%log)
+  //   %neg = neg(%param)
+  //   %concat = concat(%rev, %neg)
+  //   ROOT %root = slice(concat)
+  //
+  // In the temporary block, the set of live buffers at peak memory use should
+  // be {%rev, %neg, %concat}. This occurs right at the concat itself.
+  auto builder = HloComputation::Builder(TestName());
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, f32vec100_, ""));
+  auto log = builder.AddInstruction(
+      HloInstruction::CreateUnary(f32vec100_, HloOpcode::kLog, param));
+  auto rev = builder.AddInstruction(
+      HloInstruction::CreateReverse(f32vec100_, log, {0}));
+  auto neg = builder.AddInstruction(
+      HloInstruction::CreateUnary(f32vec100_, HloOpcode::kNegate, param));
+  const Shape concat_shape = ShapeUtil::MakeShape(F32, {200});
+  auto concat = builder.AddInstruction(
+      HloInstruction::CreateConcatenate(concat_shape, {rev, neg}, 0));
+  // Make the root tiny so no interior nodes can share its buffer.
+  auto root = builder.AddInstruction(HloInstruction::CreateSlice(
+      ShapeUtil::MakeShape(F32, {1}), concat, {0}, {1}, {1}));
+
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+
+  auto buffers = RunBufferAssignmentWithInstructionSequence(
+      module.get(), {param, log, rev, neg, concat, root});
+
+  // The temporary buffer should hold the 4 interior instructions.
+  const BufferAllocation& buffer = GetTopLevelAllocation(*buffers, concat);
+  EXPECT_FALSE(buffer.IsInputOrOutput());
+  EXPECT_TRUE(buffer.IsPreallocatedTempBuffer());
+  ASSERT_EQ(buffer.assigned_buffers().size(), 4);
+
+  int64 peak_size;
+  std::vector<const LogicalBuffer*> peak_buffers;
+  std::tie(peak_size, peak_buffers) = buffer.ComputePeakMemoryLogicalBuffers();
+
+  // The peak live set should be concat and its inputs.
+  EXPECT_EQ(peak_size, ShapeUtil::ByteSizeOf(ShapeUtil::MakeShape(F32, {400})));
+  ASSERT_EQ(peak_buffers.size(), 3);
+  std::vector<const HloInstruction*> peak_instructions;
+  for (const LogicalBuffer* logical_buffer : peak_buffers) {
+    peak_instructions.push_back(logical_buffer->instruction());
+  }
+  EXPECT_THAT(peak_instructions, UnorderedElementsAre(rev, neg, concat));
+}
+
 class WhileBufferAssignmentTest : public HloTestBase {
  protected:
   std::unique_ptr<HloComputation> BuildWhileConditionComputation(
@@ -1587,6 +1697,81 @@ TEST_F(WhileBufferAssignmentTest, TwoForwardWhileLoops) {
             assignment->GetUniqueSlice(while1, {1}).ConsumeValueOrDie());
 }
 
+// Tests that two colocated buffer sets are not merged if an entry parameter
+// buffer belongs to either of the colocation sets (b/73267882).
+//
+// %param --> %while.0 --> %mul --> %while.1 --> %broadcast
+//
+// %while.0 body just forwards the init value, so the loop carried variable
+// remains the constant, whereas %while.1 changes the loop carried variable.
+TEST_F(WhileBufferAssignmentTest, ColocatedBufferWithEntryParameter) {
+  const Shape r0s32 = ShapeUtil::MakeShape(S32, {});
+
+  const char* module_str = R"(
+HloModule test_module
+
+%cond.v0 {
+  %param = s32[] parameter(0)
+  ROOT %constant = pred[] constant(true)
+}
+
+%cond.v1 {
+  %param.0 = s32[] parameter(0)
+  ROOT %constant.0 = pred[] constant(true)
+}
+
+%body.v0 {
+  ROOT %param.1 = s32[] parameter(0)
+}
+
+%body.v1 {
+  %param.2 = s32[] parameter(0)
+  ROOT add = s32[] add(%param.2, %param.2)
+}
+
+ENTRY %test_module {
+  %param.3 = s32[] parameter(0)
+  %while.0 = s32[] while(%param.3), condition=%cond.v0, body=%body.v0
+  %mul = s32[] multiply(%while.0, %while.0)
+  %while.1 = s32[] while(%mul), condition=%cond.v1, body=%body.v1
+  ROOT %bcast = s32[1024,1024]{1,0} broadcast(s32[] %while.1), dimensions={}
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          tools::Parse(module_str));
+
+  // Run CopyInsertion and check if the graph constructed above doesn't need
+  // any copies inserted for BufferAssignment to run.
+  int64 instruction_count = module->instruction_count();
+  CopyInsertion copy_insertion;
+  ASSERT_IS_OK(copy_insertion.Run(module.get()).status());
+  ASSERT_EQ(instruction_count, module->instruction_count());
+
+  // Get the instructions in the module.
+  const HloInstruction* bcast = module->entry_computation()->root_instruction();
+  const HloInstruction* param =
+      module->entry_computation()->parameter_instruction(0);
+  ASSERT_EQ(bcast->opcode(), HloOpcode::kBroadcast);
+  const HloInstruction* while1 = bcast->operand(0);
+  ASSERT_EQ(while1->opcode(), HloOpcode::kWhile);
+  const HloInstruction* while0 = while1->operand(0)->operand(0);
+  ASSERT_EQ(while0->opcode(), HloOpcode::kWhile);
+
+  // Run buffer assignment.
+  auto assignment = RunBufferAssignment(module.get());
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_param,
+                          assignment->GetUniqueSlice(param, {}));
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_while0,
+                          assignment->GetUniqueSlice(while0, {}));
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_while1,
+                          assignment->GetUniqueSlice(while1, {}));
+
+  // The parameter slice is part of the while0's colocation set (init value),
+  // but not merged into the while1's colocation set.
+  EXPECT_EQ(slice_param, slice_while0);
+  EXPECT_NE(slice_param, slice_while1);
+}
+
 // Tests that the colocated buffers for while instructions are properly assigned
 // during buffer assignment such that the result tuple elements are not assigned
 // to the same buffer.
diff --git a/tensorflow/compiler/xla/service/compile_only_service.cc b/tensorflow/compiler/xla/service/compile_only_service.cc
index dab73596e1639eed62151197048ee8d29570b20a..6664496ab6c603c35c7dce923fcf94c54d1ce714 100644
--- a/tensorflow/compiler/xla/service/compile_only_service.cc
+++ b/tensorflow/compiler/xla/service/compile_only_service.cc
@@ -72,8 +72,7 @@ CompileOnlyService::CompileAheadOfTime(
     VersionedComputationHandle versioned_handle =
         user_computation->GetVersionedHandle();
 
-    // TODO(b/63773457): Track DebugOptions in AotCompilationOptions.
-    DebugOptions debug_options = legacy_flags::GetDebugOptionsFromFlags();
+    const DebugOptions& debug_options = options.debug_options();
 
     // Dump computation proto state if flag is set.
     const string& directory_path = debug_options.xla_dump_computations_to();
diff --git a/tensorflow/compiler/xla/service/compiler.cc b/tensorflow/compiler/xla/service/compiler.cc
index e2e9d2a0c048fec6c6ffbeef1223ae0e6aef50d1..0392d4af48a040c4a648f7bf9bf21a62ce03a990 100644
--- a/tensorflow/compiler/xla/service/compiler.cc
+++ b/tensorflow/compiler/xla/service/compiler.cc
@@ -86,4 +86,7 @@ Compiler::GetPlatformCompilers() {
   return compilers->at(platform->id()).get();
 }
 
+AotCompilationOptions::AotCompilationOptions()
+    : debug_options_(legacy_flags::GetDebugOptionsFromFlags()) {}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/compiler.h b/tensorflow/compiler/xla/service/compiler.h
index 74fd24edf88d44b2dfdc87556b0af43987e69e08..33e19efc72c6d30ccd7e0b3a13f664a4f42208bf 100644
--- a/tensorflow/compiler/xla/service/compiler.h
+++ b/tensorflow/compiler/xla/service/compiler.h
@@ -79,11 +79,15 @@ class AotCompilationOptions {
     device_allocator_ = device_allocator;
   }
 
+  const DebugOptions& debug_options() const { return debug_options_; }
+  DebugOptions* mutable_debug_options() { return &debug_options_; }
+
  protected:
-  AotCompilationOptions() = default;
+  AotCompilationOptions();
 
  private:
   DeviceMemoryAllocator* device_allocator_ = nullptr;
+  DebugOptions debug_options_;
 };
 
 // Abstract compiler interface that is subclassed for compilation on a
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier.cc b/tensorflow/compiler/xla/service/conditional_simplifier.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f35de080853f7ec986565cb2df1050946ac3f244
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier.cc
@@ -0,0 +1,106 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
+
+#include <string>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+
+namespace xla {
+
+// Tries to replace a conditional with a call operation of the corresponding
+// computation. If the given conditional has a constant predicate, tries to
+// replace it with a call to its true/false computation as appropirate and then
+// inline that computation.
+//
+// Returns true if it made a change to the graph.
+static StatusOr<bool> TryRemoveConditional(HloInstruction* conditional) {
+  CHECK_EQ(conditional->opcode(), HloOpcode::kConditional);
+  // Do not remove conditionals that contain side-effecting instructions or
+  // have control predecessors/successors in either true/false computation.
+  if (!conditional->parent()->IsRemovable(conditional) ||
+      conditional->HasSideEffect()) {
+    VLOG(2) << "Not attempting to remove conditional as it is not removable or "
+               "has side effect: "
+            << conditional->ToShortString();
+    return false;
+  }
+
+  if (conditional->operand(0)->opcode() != HloOpcode::kConstant) {
+    VLOG(2) << "Not attempting to remove conditional as its predicate is not a "
+               "compile-time constant: "
+            << conditional->ToShortString();
+    return false;
+  }
+
+  auto computation = conditional->parent();
+  HloInstruction* call_op;
+  if (conditional->operand(0)->literal().Get<bool>({})) {
+    call_op = computation->AddInstruction(HloInstruction::CreateCall(
+        conditional->shape(), {conditional->mutable_operand(1)},
+        conditional->true_computation()));
+  } else {
+    call_op = computation->AddInstruction(HloInstruction::CreateCall(
+        conditional->shape(), {conditional->mutable_operand(2)},
+        conditional->false_computation()));
+  }
+
+  TF_RETURN_IF_ERROR(computation->ReplaceInstruction(conditional, call_op));
+  TF_RETURN_IF_ERROR(CallInliner::Inline(call_op).status());
+
+  return true;
+}
+
+StatusOr<bool> ConditionalSimplifier::Run(HloModule* module) {
+  XLA_VLOG_LINES(
+      3, "ConditionalSimplifier::Run(), before:\n" + module->ToString());
+  bool changed = false;
+
+  // Gather all the conditional ops in our module. We do this ahead of time so
+  // we don't have to worry about mutating the lists of computations or
+  // instructions as we iterate.
+  std::vector<HloInstruction*> conditional_ops;
+  for (auto* comp : module->computations()) {
+    for (auto* instr : comp->instructions()) {
+      if (instr->opcode() == HloOpcode::kConditional) {
+        conditional_ops.push_back(instr);
+      }
+    }
+  }
+
+  for (HloInstruction* conditional_op : conditional_ops) {
+    TF_ASSIGN_OR_RETURN(bool result, TryRemoveConditional(conditional_op));
+    changed |= result;
+  }
+
+  XLA_VLOG_LINES(3,
+                 "ConditionalSimplifier::Run(), after:\n" + module->ToString());
+  return changed;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier.h b/tensorflow/compiler/xla/service/conditional_simplifier.h
new file mode 100644
index 0000000000000000000000000000000000000000..063261e26d06e21a297e8e3c405898a17221b7ca
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier.h
@@ -0,0 +1,38 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
+
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+
+namespace xla {
+
+// HLO pass that removes kConditional with a constant predicate, replacing them
+// with their true or false computation as appropriate.
+class ConditionalSimplifier : public HloPassInterface {
+ public:
+  tensorflow::StringPiece name() const override {
+    return "simplify-conditional";
+  }
+  StatusOr<bool> Run(HloModule* module) override;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier_test.cc b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..868348547d9f5cbdc7576c7fc0697d72c3a3e557
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
@@ -0,0 +1,153 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
+
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace {
+
+namespace op = xla::testing::opcode_matchers;
+
+class ConditionalSimplifierTest : public HloVerifiedTestBase {
+ public:
+  // Makes a computation that contains a conditional with constant predicate.
+  HloComputation* MakeConditional(HloModule* module);
+};
+
+HloComputation* ConditionalSimplifierTest::MakeConditional(HloModule* module) {
+  HloComputation::Builder builder(TestName());
+
+  // true_computation returns param+1.
+  HloComputation* true_computation;
+  {
+    HloComputation::Builder true_computation_builder(TestName() +
+                                                     ".true_computation");
+    auto param =
+        true_computation_builder.AddInstruction(HloInstruction::CreateParameter(
+            0, ShapeUtil::MakeShape(S32, {}), "param"));
+    auto one = true_computation_builder.AddInstruction(
+        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+
+    true_computation_builder.AddInstruction(HloInstruction::CreateBinary(
+        ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, one));
+
+    true_computation =
+        module->AddEmbeddedComputation(true_computation_builder.Build());
+  }
+
+  // false_computation returns param+42.
+  HloComputation* false_computation;
+  {
+    HloComputation::Builder false_computation_builder(TestName() +
+                                                      ".false_computation");
+    auto param = false_computation_builder.AddInstruction(
+        HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(S32, {}),
+                                        "param"));
+    auto forty_two = false_computation_builder.AddInstruction(
+        HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+
+    false_computation_builder.AddInstruction(HloInstruction::CreateBinary(
+        ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, forty_two));
+    false_computation =
+        module->AddEmbeddedComputation(false_computation_builder.Build());
+  }
+
+  auto false_instrn = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+  auto false_param = builder.AddInstruction(HloInstruction::CreateParameter(
+      0, ShapeUtil::MakeShape(S32, {}), "false_param"));
+  auto one = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+
+  builder.AddInstruction(HloInstruction::CreateConditional(
+      ShapeUtil::MakeShape(S32, {}), false_instrn, one, true_computation,
+      false_param, false_computation));
+
+  return module->AddEntryComputation(builder.Build());
+}
+
+TEST_F(ConditionalSimplifierTest, ConditionalGetsInlined) {
+  HloComputation* computation = MakeConditional(&module());
+  ASSERT_TRUE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(),
+              op::Add(op::Parameter(), op::Constant()));
+}
+
+TEST_F(ConditionalSimplifierTest, ConditionalWithControlDependency) {
+  HloComputation* computation = MakeConditional(&module());
+
+  auto* true_op = computation->AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+  TF_ASSERT_OK(
+      true_op->AddControlDependencyTo(computation->root_instruction()));
+
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsSend) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+
+  auto* true_computation = conditional->true_computation();
+  auto* send = true_computation->AddInstruction(HloInstruction::CreateSend(
+      true_computation->AddInstruction(
+          HloInstruction::CreateConstant(Literal::CreateR0<bool>(true))),
+      /*channel_id=*/0));
+  true_computation->AddInstruction(HloInstruction::CreateSendDone(send));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsRecv) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+
+  auto* true_computation = conditional->true_computation();
+  auto* recv = true_computation->AddInstruction(HloInstruction::CreateRecv(
+      ShapeUtil::MakeShape(F32, {1}), /*channel_id=*/0));
+  true_computation->AddInstruction(HloInstruction::CreateRecvDone(recv));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsNonRemovableInstruction) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+  auto* false_computation = conditional->false_computation();
+  false_computation->AddInstruction(
+      HloInstruction::CreateInfeed(ShapeUtil::MakeShape(F32, {1}), "config"));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/copy_insertion.cc b/tensorflow/compiler/xla/service/copy_insertion.cc
index cc195879a6bb490a9b49ad962aa9326cb51d9b0a..40519ecc799c8f0343294ad88009820dbd8535e9 100644
--- a/tensorflow/compiler/xla/service/copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion.cc
@@ -58,6 +58,46 @@ bool ValueIsReadOnly(const HloValue& value) {
   return IsConstantValue(value) || IsEntryParameterValue(value);
 }
 
+// Data structure describing the action which should be taken on parts of a
+// computation buffers, with respect to the adding of special case copies.
+struct SpecialCaseCopyPolicy {
+  // Insert a copy if the same buffer is found at multiple indices within the
+  // output tuple.
+  bool copy_root_replicated_buffers = false;
+  // If true, insert a copy if a buffer coming from a constant or a parameter
+  // is found wihtin the output tuple.
+  bool copy_parameters_and_constants = false;
+};
+
+SpecialCaseCopyPolicy GetSpecialCaseCopyPolicy(const CallGraphNode& node,
+                                               HloModule* module,
+                                               HloComputation* computation) {
+  SpecialCaseCopyPolicy policy;
+  if (computation == module->entry_computation()) {
+    policy.copy_parameters_and_constants = true;
+    policy.copy_root_replicated_buffers = true;
+  }
+  for (const CallSite& site : node.caller_callsites()) {
+    // The AddCopiesForConditional() already adds copies, but the copy remover
+    // removes them, so we re-add them by returning the policy here. But really
+    // the copy remover should not be removing them.
+    if (site.instruction()->opcode() == HloOpcode::kConditional) {
+      policy.copy_parameters_and_constants = true;
+      policy.copy_root_replicated_buffers = true;
+    }
+  }
+  return policy;
+}
+
+bool ShouldCopyRootValue(const HloValue& value,
+                         const SpecialCaseCopyPolicy& policy) {
+  if (policy.copy_parameters_and_constants) {
+    return IsConstantValue(value) ||
+           value.defining_instruction()->opcode() == HloOpcode::kParameter;
+  }
+  return false;
+}
+
 // Deep copy the given instructions 'from' and 'to' at the ShapeIndexes given in
 // 'indices_to_copy'. Add control edges from the respective kCopy instructions
 // in deep copy of 'from' to the respective kCopy instruction in the deep copy
@@ -282,6 +322,29 @@ Status AddCopiesForWhile(const HloAliasAnalysis& alias_analysis,
   return Status::OK();
 }
 
+// We add copies for all the indices of the true and false computaiton roots,
+// in order to resolve interference. We later rely on the CopyRemover to drop
+// the unnecessary ones.
+Status AddCopiesForConditional(const HloAliasAnalysis& alias_analysis,
+                               HloInstruction* conditional) {
+  VLOG(2) << "Adding copies for kConditional instruction "
+          << conditional->name();
+  TF_RET_CHECK(conditional->opcode() == HloOpcode::kConditional);
+
+  for (HloComputation* computation :
+       {conditional->true_computation(), conditional->false_computation()}) {
+    HloInstruction* root = computation->root_instruction();
+    std::vector<HloInstruction*> users = root->users();
+    TF_ASSIGN_OR_RETURN(HloInstruction * deep_copy,
+                        computation->DeepCopyInstruction(root));
+    for (HloInstruction* user : users) {
+      TF_RETURN_IF_ERROR(root->ReplaceUseWith(user, deep_copy));
+    }
+    computation->set_root_instruction(deep_copy);
+  }
+  return Status::OK();
+}
+
 // Removes any control dependencies to or from the given instruction.
 Status StripControlDependenciesFrom(HloInstruction* instruction) {
   while (!instruction->control_successors().empty()) {
@@ -309,6 +372,9 @@ Status AddCopiesToResolveInterference(HloModule* module) {
     for (HloInstruction* instruction : computation->instructions()) {
       if (instruction->opcode() == HloOpcode::kWhile) {
         TF_RETURN_IF_ERROR(AddCopiesForWhile(*alias_analysis, instruction));
+      } else if (instruction->opcode() == HloOpcode::kConditional) {
+        TF_RETURN_IF_ERROR(
+            AddCopiesForConditional(*alias_analysis, instruction));
       }
     }
   }
@@ -557,6 +623,7 @@ class CopyRemover {
 
       auto is_live_range_before = [this](const ValueNode& a,
                                          const ValueNode& b) {
+        VLOG(3) << "Checking live range of " << *a.value << " WRT " << *b.value;
         if (LiveRangeBefore(a, b)) {
           VLOG(2) << "  Live range of " << a.value->ToShortString()
                   << " is before " << b.value->ToShortString();
@@ -571,7 +638,7 @@ class CopyRemover {
       VLOG(3) << copy->name() << " copies value "
               << src->value->ToShortString();
       VLOG(3) << "Source buffer values: " << ValueListToString(src);
-      VLOG(3) << "Dest buffer values: " << ValueListToString(src);
+      VLOG(3) << "Dest buffer values: " << ValueListToString(dest);
 
       // A kCopy instruction copies an HLO value from a source buffer and
       // defines an HLO value in a destination buffer. Most generally, the
@@ -747,16 +814,16 @@ class CopyRemover {
     // updated as copies are removed.
     bool LiveRangeBefore(const ValueNode& a, const ValueNode& b) {
       if (a.uses.empty()) {
-        VLOG(2) << "Empty uses";
+        VLOG(2) << "Empty uses for " << *a.value;
         return ordering_.IsDefinedBefore(*a.value, *b.value);
       }
       for (const HloUse* use : a.uses) {
-        VLOG(2) << "use: " << *use;
-        VLOG(2) << "is before:" << *b.value;
+        VLOG(2) << "Checking use " << *use << " against " << *b.value;
         if (!ordering_.UseIsBeforeValueDefinition(*use, *b.value, dataflow_)) {
-          VLOG(2) << "Not before";
+          VLOG(2) << "Use " << *use << " is NOT before " << *b.value;
           return false;
         }
+        VLOG(2) << "Use " << *use << " is before " << *b.value;
       }
       return true;
     }
@@ -892,7 +959,6 @@ Status RemoveUnnecessaryCopies(
   CopyRemover copy_remover(*alias_analysis, ordering, module);
   XLA_VLOG_LINES(3, copy_remover.ToString());
 
-  tensorflow::gtl::FlatSet<int> existing_copies;
   for (HloComputation* computation : module->computations()) {
     for (HloInstruction* instruction : computation->instructions()) {
       if (instruction->opcode() == HloOpcode::kCopy &&
@@ -901,7 +967,6 @@ Status RemoveUnnecessaryCopies(
       }
     }
   }
-
   return Status::OK();
 }
 
@@ -921,7 +986,7 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
 
   // Identify which shape indices of which instructions need to be copied. Store
   // these results in 'instructions_to_copy'.
-  std::unordered_map<HloInstruction*, ShapeTree<bool>> instructions_to_copy;
+  HloInstructionMap<ShapeTree<bool>> instructions_to_copy;
   auto add_index_to_copy = [&instructions_to_copy](HloInstruction* instruction,
                                                    const ShapeIndex& index) {
     auto it = instructions_to_copy.find(instruction);
@@ -957,7 +1022,8 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
     }
     TF_RET_CHECK(node.context() == CallContext::kSequential);
 
-    const bool is_entry = computation == module->entry_computation();
+    SpecialCaseCopyPolicy policy =
+        GetSpecialCaseCopyPolicy(node, module, computation);
     HloInstruction* root = computation->root_instruction();
 
     // Mark nondistinct/ambiguous indices.
@@ -970,27 +1036,26 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
           for (const HloBuffer* buffer : buffers_at_index) {
             buffer_seen_before |= !seen.insert(buffer).second;
           }
-          if (buffers_at_index.size() > 1 || (buffer_seen_before && is_entry)) {
-            VLOG(2) << "Index " << index << " of root of computation "
+          if (buffers_at_index.size() > 1 ||
+              (buffer_seen_before && policy.copy_root_replicated_buffers)) {
+            VLOG(2) << "Index " << index << " of computation "
                     << computation->name() << " (" << root->name()
                     << ") has ambiguous or non-distinct buffer. Copying.";
             add_index_to_copy(root, index);
           }
         });
 
-    // For entry instructions, mark any parameter or constant values.
-    if (is_entry) {
-      for (const auto& pair :
-           alias_analysis->dataflow_analysis().GetInstructionValueSet(root)) {
-        const ShapeIndex& index = pair.first;
-        const HloValueSet& value_set = pair.second;
-        for (const HloValue* value : value_set.values()) {
-          if (ValueIsReadOnly(*value)) {
-            VLOG(2) << "Root of entry computation (" << root->name()
-                    << ") has constant or entry parameter value at index "
-                    << index << ". Copying.";
-            add_index_to_copy(root, index);
-          }
+    for (const auto& pair :
+         alias_analysis->dataflow_analysis().GetInstructionValueSet(root)) {
+      const ShapeIndex& index = pair.first;
+      const HloValueSet& value_set = pair.second;
+      for (const HloValue* value : value_set.values()) {
+        if (ShouldCopyRootValue(*value, policy)) {
+          VLOG(2) << "Root of (" << root->name() << ") of computation("
+                  << computation->name()
+                  << ") has constant or parameter value at index " << index
+                  << ". Copying.";
+          add_index_to_copy(root, index);
         }
       }
     }
@@ -1012,7 +1077,6 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
       instruction->parent()->set_root_instruction(deep_copy);
     }
   }
-
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/cpu/BUILD b/tensorflow/compiler/xla/service/cpu/BUILD
index 32be0b0c968f2d24f460fc8377c458f2da282112..093db020c03aa3685005b4a42b26dd032fcf0001 100644
--- a/tensorflow/compiler/xla/service/cpu/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/BUILD
@@ -105,9 +105,11 @@ cc_library(
         "//tensorflow/compiler/xla/service:buffer_assignment",
         "//tensorflow/compiler/xla/service:buffer_liveness",
         "//tensorflow/compiler/xla/service:call_inliner",
+        "//tensorflow/compiler/xla/service:conditional_simplifier",
         "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
+        "//tensorflow/compiler/xla/service:gather_expander",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_constant_folding",
         "//tensorflow/compiler/xla/service:hlo_cse",
@@ -514,7 +516,6 @@ cc_library(
 
 cc_library(
     name = "runtime_matvec",
-    srcs = ["runtime_matvec.cc"],
     hdrs = ["runtime_matvec.h"],
     copts = runtime_copts(),
     deps = [
@@ -771,6 +772,31 @@ cc_library(
     ],
 )
 
+tf_cc_test(
+    name = "parallel_task_assignment_test",
+    srcs = ["parallel_task_assignment_test.cc"],
+    deps = [
+        ":cpu_executable",
+        ":parallel_task_assignment",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_layout",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:algebraic_simplifier",
+        "//tensorflow/compiler/xla/service:computation_layout",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
+        "//tensorflow/compiler/xla/tests:test_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+    ],
+)
+
 cc_library(
     name = "cpu_options",
     srcs = ["cpu_options.cc"],
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
index 387806e24aad0d5f28cb104507ef6cc136ffd779..e43777c5e5e8afcf08e1e334c8847f6b94d0d047 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
@@ -47,6 +47,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
 #include "tensorflow/compiler/xla/service/cpu/compiler_functor.h"
 #include "tensorflow/compiler/xla/service/cpu/conv_canonicalization.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_copy_insertion.h"
@@ -66,6 +67,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/dot_decomposer.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
+#include "tensorflow/compiler/xla/service/gather_expander.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_constant_folding.h"
@@ -260,6 +262,7 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile) {
         /*rewrite_inference_op=*/true,
         /*rewrite_grad_op=*/true,
         /*use_fusion=*/false);
+    pipeline.AddPass<GatherExpander>();
     pass.AddPass<AlgebraicSimplifier>(
         /*is_layout_sensitive=*/false,
         [](const Shape&, const Shape&) { return false; },
@@ -275,6 +278,7 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile) {
     pass.AddPass<HloDCE>();
     pass.AddPass<ReshapeMover>();
     pass.AddPass<HloConstantFolding>();
+    pass.AddPass<ConditionalSimplifier>();
   }
   pipeline.AddPass<TransposeFolding>(
       [](const HloInstruction& dot,
@@ -314,7 +318,7 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile) {
     // Note this is not run for AOT because it would bring in thread pool
     // and thread synchronization dependencies which would likely increase
     // binary size (and most AOT applications are single-threaded).
-    // TODO(29630486) Support multi-threaded AOT.
+    // TODO(b/29630486) Support multi-threaded AOT.
     pipeline.AddPass<ParallelTaskAssigner>(max_parallelism,
                                            ShapeSizeBytesFunction());
   }
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
index 482e04052d5a914eab0e5bff2c7a83f3b698052f..0fc5a746bbbc7685ff5d4647111a750e7d7b1c19 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
@@ -30,7 +30,6 @@ bool CanBeLoopFused(const HloInstruction& hlo) {
   // These are the only ones we fuse since we rely on effective elemental IR
   // generation.
   return hlo.IsElementwise() ||  //
-         hlo.opcode() == HloOpcode::kBitcast ||
          hlo.opcode() == HloOpcode::kBroadcast ||
          hlo.opcode() == HloOpcode::kConcatenate ||
          hlo.opcode() == HloOpcode::kDynamicSlice ||
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
index 595c3f55b321f47e2312b93e0c238c7637495d77..6ed1cd31b18f6360bdd7fd41bd5be2e657b310a5 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
@@ -77,7 +77,7 @@ TEST_F(InstructionFusionTest, DotOperationFusion_Basic_1) {
   EXPECT_THAT(computation->root_instruction(), op::Fusion());
 }
 
-TEST_F(InstructionFusionTest, DotOperationFusion_Bitcast) {
+TEST_F(InstructionFusionTest, DotOperationNoFusion_Bitcast) {
   HloComputation::Builder builder(TestName());
   HloInstruction* arg0 = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {2, 512, 2, 128}), "arg0"));
@@ -94,8 +94,7 @@ TEST_F(InstructionFusionTest, DotOperationFusion_Bitcast) {
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   EXPECT_EQ(dot, computation->root_instruction());
-  EXPECT_TRUE(CpuInstructionFusion().Run(module.get()).ValueOrDie());
-  EXPECT_THAT(computation->root_instruction(), op::Fusion());
+  EXPECT_FALSE(CpuInstructionFusion().Run(module.get()).ValueOrDie());
 }
 
 TEST_F(InstructionFusionTest, DotOperationFusion_Reshape) {
@@ -244,35 +243,33 @@ class OpcodeFusionTest : public InstructionFusionTest {
   }
 };
 
-TEST_F(OpcodeFusionTest, Exponential_Bitcast_Negate) {
+TEST_F(OpcodeFusionTest, Exponential_Reshape_Negate) {
   HloComputation::Builder builder(TestName());
   Shape param_shape = ShapeUtil::MakeShape(F32, {1, 4});
   Shape result_shape = ShapeUtil::MakeShape(F32, {4});
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "param"));
-  // InstructionFusion::ShouldFuse() precludes fusing a bitcast whose operand
-  // is a parameter, so create an operand between the parameter and bitcast.
   HloInstruction* exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(param_shape, HloOpcode::kExp, param0));
-  HloInstruction* bitcast2 = builder.AddInstruction(
-      HloInstruction::CreateUnary(result_shape, HloOpcode::kBitcast, exp1));
+  HloInstruction* reshape2 =
+      builder.AddInstruction(HloInstruction::CreateReshape(result_shape, exp1));
   builder.AddInstruction(
-      HloInstruction::CreateUnary(result_shape, HloOpcode::kNegate, bitcast2));
+      HloInstruction::CreateUnary(result_shape, HloOpcode::kNegate, reshape2));
 
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
 
   RunFusionAndCheckOpcodesWereFused(
-      module.get(), {HloOpcode::kNegate, HloOpcode::kBitcast, HloOpcode::kExp,
+      module.get(), {HloOpcode::kNegate, HloOpcode::kReshape, HloOpcode::kExp,
                      HloOpcode::kParameter});
 }
 
-TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
+TEST_F(OpcodeFusionTest, Broadcast_Reshape_DynamicSlice_Tanh) {
   HloComputation::Builder builder(TestName());
   Shape param_shape = ShapeUtil::MakeShape(F32, {8});
   Shape starts_shape = ShapeUtil::MakeShape(F32, {2});
   Shape broadcast_shape = ShapeUtil::MakeShape(F32, {1, 8, 8});
-  Shape bitcast_shape = ShapeUtil::MakeShape(F32, {8, 8});
+  Shape reshape_shape = ShapeUtil::MakeShape(F32, {8, 8});
   Shape dynamic_slice_shape = ShapeUtil::MakeShape(F32, {4, 4});
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "param"));
@@ -280,11 +277,11 @@ TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
       HloInstruction::CreateParameter(1, starts_shape, "starts"));
   HloInstruction* broadcast2 = builder.AddInstruction(
       HloInstruction::CreateBroadcast(broadcast_shape, param0, {1}));
-  HloInstruction* bitcast3 = builder.AddInstruction(HloInstruction::CreateUnary(
-      bitcast_shape, HloOpcode::kBitcast, broadcast2));
+  HloInstruction* reshape3 = builder.AddInstruction(
+      HloInstruction::CreateReshape(reshape_shape, broadcast2));
   HloInstruction* dynamic_slice4 =
       builder.AddInstruction(HloInstruction::CreateDynamicSlice(
-          dynamic_slice_shape, bitcast3, param1, {4, 4}));
+          dynamic_slice_shape, reshape3, param1, {4, 4}));
   builder.AddInstruction(HloInstruction::CreateUnary(
       dynamic_slice_shape, HloOpcode::kTanh, dynamic_slice4));
 
@@ -293,7 +290,7 @@ TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
 
   RunFusionAndCheckOpcodesWereFused(
       module.get(),
-      {HloOpcode::kTanh, HloOpcode::kDynamicSlice, HloOpcode::kBitcast,
+      {HloOpcode::kTanh, HloOpcode::kDynamicSlice, HloOpcode::kReshape,
        HloOpcode::kBroadcast, HloOpcode::kParameter, HloOpcode::kParameter});
 }
 
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
index 40ace963270e8cead47cc731cc326351178dff7d..9a3bd68c80c6e8bcdb231c63ba025d1f73619eb7 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
@@ -31,6 +31,8 @@ XfeedManager* GetXfeedManager() {
   return manager;
 }
 
+extern const char* const kEigenMatMulF16SymbolName =
+    "__xla_cpu_runtime_EigenMatMulF16";
 extern const char* const kEigenMatMulF32SymbolName =
     "__xla_cpu_runtime_EigenMatMulF32";
 extern const char* const kEigenMatMulF64SymbolName =
@@ -40,6 +42,8 @@ extern const char* const kEigenConvF16SymbolName =
 extern const char* const kEigenConvF32SymbolName =
     "__xla_cpu_runtime_EigenConvF32";
 extern const char* const kEigenFftSymbolName = "__xla_cpu_runtime_EigenFft";
+extern const char* const kEigenSingleThreadedMatMulF16SymbolName =
+    "__xla_cpu_runtime_EigenSingleThreadedMatMulF16";
 extern const char* const kEigenSingleThreadedMatMulF32SymbolName =
     "__xla_cpu_runtime_EigenSingleThreadedMatMulF32";
 extern const char* const kEigenSingleThreadedMatMulF64SymbolName =
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_runtime.h b/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
index 2141dfe1cedd6f9674acc348152574b4fd30895b..e61d6ea28b633398863357541e056ee887582f9c 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
@@ -41,11 +41,13 @@ namespace runtime {
 //    the actual symbol.
 // 2. When using ahead-of-time compilation, the linker can resolve the name
 //    because it is a symbol in the cpu_runtime library.
+extern const char* const kEigenMatMulF16SymbolName;
 extern const char* const kEigenMatMulF32SymbolName;
 extern const char* const kEigenMatMulF64SymbolName;
 extern const char* const kEigenConvF16SymbolName;
 extern const char* const kEigenConvF32SymbolName;
 extern const char* const kEigenFftSymbolName;
+extern const char* const kEigenSingleThreadedMatMulF16SymbolName;
 extern const char* const kEigenSingleThreadedMatMulF32SymbolName;
 extern const char* const kEigenSingleThreadedMatMulF64SymbolName;
 extern const char* const kEigenSingleThreadedConvF16SymbolName;
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
index cfe7c9c3af0be109ac8a86753e880e2bcbceba41..8b1e20d79e90fcc32e985ffb855a1a10cdd2f2b9 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
@@ -715,6 +715,11 @@ tensorflow::Status DotOpEmitter::Emit() {
   // which performs the sum-of-products (the reduction loop) before storing
   // the result in the output buffer.
 
+  // This routine assumes that the dot operation is not in a parallelized
+  // enclosing computation.
+  CHECK(
+      dot_.parent()->root_instruction()->outer_dimension_partitions().empty());
+
   const Shape& lhs_shape = lhs_array_.GetShape();
   const Shape& rhs_shape = rhs_array_.GetShape();
 
@@ -919,6 +924,12 @@ tensorflow::Status DotOpEmitter::EmitCallToRuntime() {
   llvm::Type* float_type;
   const char* fn_name;
   switch (type) {
+    case F16:
+      fn_name = multi_threaded_eigen
+                    ? runtime::kEigenMatMulF16SymbolName
+                    : runtime::kEigenSingleThreadedMatMulF16SymbolName;
+      float_type = ir_builder_->getHalfTy();
+      break;
     case F32:
       fn_name = multi_threaded_eigen
                     ? runtime::kEigenMatMulF32SymbolName
@@ -1051,7 +1062,8 @@ static bool AreValidGemmShapes(const Shape& lhs_shape, const Shape& rhs_shape,
   // The inputs and the output must
   // 1) be matrices with no padding, and
   // 2) have an allowed element type.
-  return output_shape.element_type() == F32 &&
+  PrimitiveType output_primitive_type = output_shape.element_type();
+  return (output_primitive_type == F32 || output_primitive_type == F16) &&
          IsRank2WithNoPadding(lhs_shape) && IsRank2WithNoPadding(rhs_shape) &&
          IsRank2WithNoPadding(output_shape);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
index 4dffaee87f6b33933b58c8c58478eec918569197..3b8056d50500cac381a1c5ad6b05028476504a47 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
@@ -2074,7 +2074,7 @@ Status IrEmitter::HandleFusion(HloInstruction* fusion) {
 
     TF_RETURN_IF_ERROR(ElementTypesSameAndSupported(
         /*instruction=*/*root, /*operands=*/{lhs, rhs},
-        /*supported_types=*/{F32}));
+        /*supported_types=*/{F16, F32}));
 
     llvm_ir::IrArray lhs_array(GetIrArrayFor(lhs));
     llvm_ir::IrArray rhs_array(GetIrArrayFor(rhs));
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
index deb21bf4ef5895cfdbec5c2449b6ce7b306a7008..86e8be8461047a82184a18a0a489ed1c0d882fc9 100644
--- a/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
+++ b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
@@ -71,7 +71,7 @@ class DefaultCostModel : public ParallelCostModel {
     if (flops_to_bytes_ratio <= 1.0) {
       // Limit max parallelism for I/O bound instructions by assuming a
       // sub-linear scaling function (fit based on empirical benchmark results).
-      // TODO(29630486) Develop system bandwidth model.
+      // TODO(b/29630486) Develop system bandwidth model.
       max_parallelism =
           std::ceil(std::sqrt(tensorflow::port::NumSchedulableCPUs()));
       // Use shape size instruction cost and L2 cache size min per-thread cost.
@@ -81,7 +81,7 @@ class DefaultCostModel : public ParallelCostModel {
       // Use max parallelism for compute bound instructions.
       max_parallelism = max_parallelism_;
       // Calculate the instruction cost in cycles.
-      // TODO(29630486) Improve on this linear cost model.
+      // TODO(b/29630486) Improve on this linear cost model.
       // Consider making 'min_cost_per_thread' be a function of the target
       // bandwidth limit for instructions with low arithmetic complexity.
       instruction_cost =
@@ -130,22 +130,21 @@ int64 ParallelTaskAssignment::GetTargetParallelTaskCount(
   // *) Emit custom loops (kSelectAndScatter, FusionKind::kTransposeDot).
   // *) Tuple-shaped.
   // TODO(b/27458679) Parallelize instructions which are skipped here.
-  if (instruction->opcode() == HloOpcode::kParameter ||
-      instruction->opcode() == HloOpcode::kConstant ||
-      instruction->opcode() == HloOpcode::kCall ||
-      instruction->opcode() == HloOpcode::kCustomCall ||
-      instruction->opcode() == HloOpcode::kSelectAndScatter ||
-      instruction->opcode() == HloOpcode::kGetTupleElement ||
-      instruction->opcode() == HloOpcode::kBitcast ||
-      instruction->opcode() == HloOpcode::kFft ||
-      (instruction->opcode() == HloOpcode::kConvolution &&
+  auto opcode = instruction->opcode();
+  if (opcode == HloOpcode::kParameter || opcode == HloOpcode::kConstant ||
+      opcode == HloOpcode::kCall || opcode == HloOpcode::kCustomCall ||
+      opcode == HloOpcode::kDot || opcode == HloOpcode::kSelectAndScatter ||
+      opcode == HloOpcode::kGetTupleElement || opcode == HloOpcode::kBitcast ||
+      opcode == HloOpcode::kFft ||
+      (opcode == HloOpcode::kConvolution &&
        PotentiallyImplementedAsEigenConvolution(*instruction)) ||
       PotentiallyImplementedAsEigenDot(*instruction) ||
-      (instruction->opcode() == HloOpcode::kFusion &&
+      (opcode == HloOpcode::kFusion &&
        instruction->fusion_kind() != HloInstruction::FusionKind::kLoop) ||
       ShapeUtil::IsTuple(instruction->shape())) {
     return 1;
   }
+
   // Consult 'cost_model_' to compute target parallel task count.
   return cost_model_->GetParallelTaskCount(instruction);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_task_assignment_test.cc b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..90191221eb70187290131dbe6037d17244043e4a
--- /dev/null
+++ b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment_test.cc
@@ -0,0 +1,84 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/cpu/parallel_task_assignment.h"
+#include "tensorflow/compiler/xla/service/cpu/cpu_executable.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+
+namespace xla {
+namespace {
+
+class ParallelTaskAssignmentTest : public HloVerifiedTestBase {
+ protected:
+  const HloCostAnalysis::ShapeSizeFunction shape_size_func_ =
+      cpu::CpuExecutable::ShapeSizeBytes;
+
+  // Use any value larger than 2 since we only test whether a module is
+  // parallelized or not
+  const int max_parallelism_ = 10;
+};
+
+TEST_F(ParallelTaskAssignmentTest, DotOperationNotParallelized) {
+  const string hlo_string = R"(
+    HloModule TestTaskParallel_Dot
+    ENTRY Dot {
+      dot_lhs = f32[196614,2]{1,0} parameter(0)
+      dot_rhs = f32[2,1]{1,0} parameter(1)
+      ROOT dot = f32[196614,1]{1,0} dot(dot_lhs, dot_rhs),
+        lhs_contracting_dims={1}, rhs_contracting_dims={0}
+    }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, cpu::ParallelTaskAssigner(
+                                            max_parallelism_, shape_size_func_)
+                                            .Run(&module()));
+  EXPECT_FALSE(changed);
+}
+
+TEST_F(ParallelTaskAssignmentTest,
+       FusedComputationWithDotOperationNotParallelized) {
+  const string hlo_string = R"(
+    HloModule TestTaskParallel_DotNestedInFusedComp
+    fused_computation.0 {
+      parameter.0 = f32[196614,2]{1,0} parameter(0)
+      parameter.0.1 = f32[2,1]{1,0} parameter(1)
+      parameter.0.2 = f32[196614,1]{1,0} parameter(2)
+      dot.0 = f32[196614,1]{1,0} dot(parameter.0, parameter.0.1),
+        lhs_contracting_dims={1}, rhs_contracting_dims={0}
+      ROOT add.0 = f32[196614,1]{1,0} add(dot.0, parameter.0.2)
+
+    }
+    ENTRY DotNestedInFusedComp {
+      parameter = f32[196614,2]{1,0} parameter(0)
+      parameter.1 = f32[2,1]{1,0} parameter(1)
+      parameter.2 = f32[196614,1]{1,0} parameter(2)
+      ROOT fusion = f32[196614,1]{1,0} fusion(parameter, parameter.1,
+        parameter.2), kind=kOutput, calls=fused_computation.0
+    }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, cpu::ParallelTaskAssigner(
+                                            max_parallelism_, shape_size_func_)
+                                            .Run(&module()));
+  EXPECT_FALSE(changed);
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_conv2d.h b/tensorflow/compiler/xla/service/cpu/runtime_conv2d.h
index 39e20ed45639040110b99ddb52eb6f6dab26dfaa..7337c907f5c83d608641b7382e75902e6f6c05d4 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_conv2d.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_conv2d.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_CONV2D_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_CONV2D_H_
 
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/platform/types.h"
 
 extern "C" {
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
index bff57d33ae23fbba8c664cbd18df77e4c35eb592..39b13183ff093611a42b3931d45f64eadb420622 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
@@ -63,30 +63,41 @@ void MatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
   C.device(*run_options->intra_op_thread_pool()) = A.contract(B, dims);
 }
 
+template <typename T>
+void MatMulImpl(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
+                int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
+  if (m == 1 || n == 1) {
+    // Despite being single threaded, this version of matrix * vector is faster.
+    xla::EigenMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
+  } else {
+    MatMul<T>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+              transpose_rhs);
+  }
+}
+
 }  // namespace
 
+void __xla_cpu_runtime_EigenMatMulF16(const void* run_options_ptr,
+                                      Eigen::half* out, Eigen::half* lhs,
+                                      Eigen::half* rhs, int64 m, int64 n,
+                                      int64 k, int32 transpose_lhs,
+                                      int32 transpose_rhs) {
+  MatMulImpl<Eigen::half>(run_options_ptr, out, lhs, rhs, m, n, k,
+                          transpose_lhs, transpose_rhs);
+}
+
 void __xla_cpu_runtime_EigenMatMulF32(const void* run_options_ptr, float* out,
                                       float* lhs, float* rhs, int64 m, int64 n,
                                       int64 k, int32 transpose_lhs,
                                       int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    // Despite being single threaded, this version of matrix * vector is faster.
-    xla::EigenMatVecF32(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                  transpose_rhs);
-  }
+  MatMulImpl<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+                    transpose_rhs);
 }
 
 void __xla_cpu_runtime_EigenMatMulF64(const void* run_options_ptr, double* out,
                                       double* lhs, double* rhs, int64 m,
                                       int64 n, int64 k, int32 transpose_lhs,
                                       int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    // Despite being single threaded, this version of matrix * vector is faster.
-    xla::EigenMatVecF64(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                   transpose_rhs);
-  }
+  MatMulImpl<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+                     transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul.h b/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
index fdb644651dd5d0fa0345580f52ed0fb051672285..d96fe3d58bd5ffbad347e3ede3534d1d47be697a 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATMUL_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATMUL_H_
 
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/platform/types.h"
 
 extern "C" {
@@ -25,6 +26,12 @@ extern "C" {
 // order. 'out' is a pointer to a buffer sufficiently large to hold the result
 // of the operation. Following standard nomenclature: lhs is m x k,
 // rhs is k x n, and out is m x n.
+extern void __xla_cpu_runtime_EigenMatMulF16(
+    const void* /* xla::ExecutableRunOptions* */ run_options_ptr,
+    Eigen::half* out, Eigen::half* lhs, Eigen::half* rhs, tensorflow::int64 m,
+    tensorflow::int64 n, tensorflow::int64 k, tensorflow::int32 transpose_lhs,
+    tensorflow::int32 transpose_rhs);
+
 extern void __xla_cpu_runtime_EigenMatMulF32(
     const void* /* xla::ExecutableRunOptions* */ run_options_ptr, float* out,
     float* lhs, float* rhs, tensorflow::int64 m, tensorflow::int64 n,
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc b/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc
deleted file mode 100644
index 435820cdd36e2a906d9dfbe2555f4c0df623c729..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc
+++ /dev/null
@@ -1,110 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include <algorithm>
-#include <cassert>
-
-#include "third_party/eigen3/Eigen/Core"
-#include "tensorflow/compiler/xla/service/cpu/runtime_matvec.h"
-
-using tensorflow::int32;
-using tensorflow::int64;
-
-namespace {
-
-// Does mat * x or mat^T * x.
-template <typename T>
-void MatVec(T* out_buf, T* mat_buf, T* x_buf, int64 rows, int64 cols,
-            int32 transpose) {
-  // Use an Eigen Matrix instead of a Tensor, as the GEMV from Matrix seems to
-  // be faster (b/30223679).  See also: the matmul op kernel in TensorFlow,
-  // which implements the same optimization.
-  using Matrix = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;
-  using MatrixMap = Eigen::Map<Matrix>;
-
-  using Vector = Eigen::Matrix<T, Eigen::Dynamic, 1>;
-  using VectorMap = Eigen::Map<Vector>;
-
-  auto x = VectorMap(x_buf, cols);
-  auto out = VectorMap(out_buf, rows);
-
-  int64 mat_rows = rows;
-  int64 mat_cols = cols;
-
-  if (transpose) {
-    std::swap(mat_rows, mat_cols);
-  }
-
-  auto mat = MatrixMap(mat_buf, mat_rows, mat_cols);
-
-  if (transpose) {
-    out = mat.transpose() * x;
-  } else {
-    out = mat * x;
-  }
-}
-
-// Converts matmul-style args to matvec.
-template <typename T>
-void DispatchMatVec(T* out, T* lhs, T* rhs, int64 m, int64 n, int64 k,
-                    int32 transpose_lhs, int32 transpose_rhs) {
-  // If the input is in the form x * A, where x is the vector, then bring A back
-  // over to the left hand side.  We make use of the identity
-  //
-  //   (x * A)^T = A^T * x^T
-  //
-  // We do not need to take the transpose of x or of the result since taking
-  // the transpose of a vector does not change the memory layout.
-  const int64 cols = k;
-
-  T* mat;
-  T* vec;
-  int64 rows;
-  bool transpose_mat;
-
-  bool is_mat_vec = (n == 1);
-
-  if (is_mat_vec) {
-    mat = lhs;
-    vec = rhs;
-    rows = m;
-    transpose_mat = transpose_lhs;
-  } else {
-    mat = rhs;
-    vec = lhs;
-    rows = n;
-    transpose_mat = !transpose_rhs;
-  }
-
-  MatVec<T>(out, mat, vec, rows, cols, transpose_mat);
-}
-
-}  // namespace
-
-namespace xla {
-
-void EigenMatVecF32(float* out, float* lhs, float* rhs, int64 m, int64 n,
-                    int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
-  DispatchMatVec<float>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-}
-
-void EigenMatVecF64(double* out, double* lhs, double* rhs, int64 m, int64 n,
-                    int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
-  DispatchMatVec<double>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-}
-
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matvec.h b/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
index 1bd8dfb377acc1f7cfbe9a92773f87f0ef25de3a..70eb98c54169824e220d9287753c0849362eade6 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
@@ -16,10 +16,86 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATVEC_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATVEC_H_
 
+#include "third_party/eigen3/Eigen/Core"
+
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
 
+namespace detail {
+
+using tensorflow::int32;
+using tensorflow::int64;
+
+// Does mat * x or mat^T * x.
+template <typename T>
+void MatVec(T* out_buf, T* mat_buf, T* x_buf, int64 rows, int64 cols,
+            int32 transpose) {
+  // Use an Eigen Matrix instead of a Tensor, as the GEMV from Matrix seems to
+  // be faster (b/30223679).  See also: the matmul op kernel in TensorFlow,
+  // which implements the same optimization.
+  using Matrix = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;
+  using MatrixMap = Eigen::Map<Matrix>;
+
+  using Vector = Eigen::Matrix<T, Eigen::Dynamic, 1>;
+  using VectorMap = Eigen::Map<Vector>;
+
+  auto x = VectorMap(x_buf, cols);
+  auto out = VectorMap(out_buf, rows);
+
+  int64 mat_rows = rows;
+  int64 mat_cols = cols;
+
+  if (transpose) {
+    std::swap(mat_rows, mat_cols);
+  }
+
+  auto mat = MatrixMap(mat_buf, mat_rows, mat_cols);
+
+  if (transpose) {
+    out = mat.transpose() * x;
+  } else {
+    out = mat * x;
+  }
+}
+
+// Converts matmul-style args to matvec.
+template <typename T>
+void DispatchMatVec(T* out, T* lhs, T* rhs, int64 m, int64 n, int64 k,
+                    int32 transpose_lhs, int32 transpose_rhs) {
+  // If the input is in the form x * A, where x is the vector, then bring A back
+  // over to the left hand side.  We make use of the identity
+  //
+  //   (x * A)^T = A^T * x^T
+  //
+  // We do not need to take the transpose of x or of the result since taking
+  // the transpose of a vector does not change the memory layout.
+  const int64 cols = k;
+
+  T* mat;
+  T* vec;
+  int64 rows;
+  bool transpose_mat;
+
+  bool is_mat_vec = (n == 1);
+
+  if (is_mat_vec) {
+    mat = lhs;
+    vec = rhs;
+    rows = m;
+    transpose_mat = transpose_lhs;
+  } else {
+    mat = rhs;
+    vec = lhs;
+    rows = n;
+    transpose_mat = !transpose_rhs;
+  }
+
+  MatVec<T>(out, mat, vec, rows, cols, transpose_mat);
+}
+
+}  // namespace detail
+
 // Performs a matrix-vector multiplication using Eigen. 'lhs' and 'rhs' are
 // pointers to buffers containing input matrices in column-major order. 'out' is
 // a pointer to a buffer sufficiently large to hold the result of the
@@ -30,15 +106,15 @@ namespace xla {
 //
 // TODO(b/64684907): Compare runtime performance of these functions with dot
 // simplification.
-void EigenMatVecF32(float* out, float* lhs, float* rhs, tensorflow::int64 m,
-                    tensorflow::int64 n, tensorflow::int64 k,
-                    tensorflow::int32 transpose_lhs,
-                    tensorflow::int32 transpose_rhs);
-
-void EigenMatVecF64(double* out, double* lhs, double* rhs, tensorflow::int64 m,
-                    tensorflow::int64 n, tensorflow::int64 k,
-                    tensorflow::int32 transpose_lhs,
-                    tensorflow::int32 transpose_rhs);
+template <typename T>
+void EigenMatVec(T* out, T* lhs, T* rhs, tensorflow::int64 m,
+                 tensorflow::int64 n, tensorflow::int64 k,
+                 tensorflow::int32 transpose_lhs,
+                 tensorflow::int32 transpose_rhs) {
+  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
+  detail::DispatchMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs,
+                            transpose_rhs);
+}
 
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_conv2d.h b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_conv2d.h
index f216bd0152aa93b8753d881938c63a9cabea899b..44b201725b2c724f48c1a3f0373c41e76211e0c2 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_conv2d.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_conv2d.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_SINGLE_THREADED_CONV2D_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_SINGLE_THREADED_CONV2D_H_
 
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/platform/types.h"
 
 extern "C" {
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
index ee8eb081556d60fcf6537b1036a4a5825c4c7bf6..17303e2f0d34e531a3a56aa147608b949e0f43ae 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
@@ -57,26 +57,38 @@ void MatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
   C = A.contract(B, dims);
 }
 
+template <typename T>
+void SingleThreadedMatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs,
+                          int64 m, int64 n, int64 k, int32 transpose_lhs,
+                          int32 transpose_rhs) {
+  if (m == 1 || n == 1) {
+    xla::EigenMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
+  } else {
+    MatMul<T>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+              transpose_rhs);
+  }
+}
+
 }  // namespace
 
+void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
+    const void* run_options_ptr, Eigen::half* out, Eigen::half* lhs,
+    Eigen::half* rhs, int64 m, int64 n, int64 k, int32 transpose_lhs,
+    int32 transpose_rhs) {
+  SingleThreadedMatMul<Eigen::half>(run_options_ptr, out, lhs, rhs, m, n, k,
+                                    transpose_lhs, transpose_rhs);
+}
+
 void __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
     const void* run_options_ptr, float* out, float* lhs, float* rhs, int64 m,
     int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    xla::EigenMatVecF32(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                  transpose_rhs);
-  }
+  SingleThreadedMatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k,
+                              transpose_lhs, transpose_rhs);
 }
 
 void __xla_cpu_runtime_EigenSingleThreadedMatMulF64(
     const void* run_options_ptr, double* out, double* lhs, double* rhs, int64 m,
     int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    xla::EigenMatVecF64(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                   transpose_rhs);
-  }
+  SingleThreadedMatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k,
+                               transpose_lhs, transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
index 029eb9514287d8c69cde2cfb06e0d56e78d6f165..82a1fcce594fa5b04f4fe459870991863c32a91a 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_SINGLE_THREADED_MATMUL_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_SINGLE_THREADED_MATMUL_H_
 
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/platform/types.h"
 
 extern "C" {
@@ -25,6 +26,12 @@ extern "C" {
 // 'out' is a pointer to a buffer sufficiently large to hold the result of the
 // operation. Following standard nomenclature: lhs is m x k, rhs is k x n, and
 // out is m x n.
+extern void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
+    const void* /* xla::ExecutableRunOptions* */ run_options_ptr,
+    Eigen::half* out, Eigen::half* lhs, Eigen::half* rhs, tensorflow::int64 m,
+    tensorflow::int64 n, tensorflow::int64 k, tensorflow::int32 transpose_lhs,
+    tensorflow::int32 transpose_rhs);
+
 extern void __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
     const void* /* xla::ExecutableRunOptions* */ run_options_ptr, float* out,
     float* lhs, float* rhs, tensorflow::int64 m, tensorflow::int64 n,
diff --git a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
index e8a375d63791cd9a94f77af4ef5e74d2cb7e4361..80c24eaccfc2a83f8f3f311d60860715668d0c08 100644
--- a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
+++ b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
@@ -181,10 +181,12 @@ bool RegisterKnownJITSymbols() {
   REGISTER_CPU_RUNTIME_SYMBOL(EigenConvF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenConvF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenFft);
+  REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF64);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedConvF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedConvF32);
+  REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF64);
   REGISTER_CPU_RUNTIME_SYMBOL(ParallelForkJoin);
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
index c732974995f70d9ba1b46e18aa4cc2c6ab467182..b6a0903b0eeaa04d8bc1488378c148b2016c5d48 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
@@ -1522,15 +1522,12 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
     case HloOpcode::kBroadcast:
       return [this, hlo, &operand_to_generator](
                  const IrArray::Index& target_index) -> StatusOr<llvm::Value*> {
+        const HloInstruction* operand = hlo->operand(0);
         // The `dimensions` member of the broadcast instruction maps from
         // input dimensions to output dimensions.
-        const HloInstruction* operand = hlo->operand(0);
-        int64 rank = ShapeUtil::Rank(operand->shape());
-        IrArray::Index source_index(rank);
-        for (int64 i = 0; i < rank; ++i) {
-          source_index[i] = target_index[hlo->dimensions(i)];
-        }
-        return operand_to_generator.at(operand)(source_index);
+        return operand_to_generator.at(
+            operand)(target_index.SourceIndexOfBroadcast(
+            hlo->shape(), operand->shape(), hlo->dimensions(), ir_builder_));
       };
     case HloOpcode::kSlice:
       return [this, hlo, &operand_to_generator](
@@ -1722,6 +1719,14 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
         SetToFirstInsertPoint(if_data.after_block, ir_builder_);
         return ir_builder_->CreateLoad(ret_value_addr);
       };
+    case HloOpcode::kBitcast:
+      CHECK_EQ(ShapeUtil::ElementsIn(hlo->shape()),
+               ShapeUtil::ElementsIn(hlo->operand(0)->shape()));
+      return [this, hlo, &operand_to_generator](const IrArray::Index& index) {
+        const HloInstruction* operand = hlo->operand(0);
+        return operand_to_generator.at(operand)(index.SourceIndexOfBitcast(
+            hlo->shape(), operand->shape(), ir_builder_));
+      };
     case HloOpcode::kReshape:
       CHECK_EQ(ShapeUtil::ElementsIn(hlo->shape()),
                ShapeUtil::ElementsIn(hlo->operand(0)->shape()));
diff --git a/tensorflow/compiler/xla/service/executable.cc b/tensorflow/compiler/xla/service/executable.cc
index 90481c7a88f90edea5399ee44aee2d2c77fc115f..be92b1629a2d8dae57b315751bd4f7f9ccddf171 100644
--- a/tensorflow/compiler/xla/service/executable.cc
+++ b/tensorflow/compiler/xla/service/executable.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/core/lib/hash/hash.h"
 #include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/lib/strings/proto_serialization.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/env.h"
 
diff --git a/tensorflow/compiler/xla/service/gather_expander.cc b/tensorflow/compiler/xla/service/gather_expander.cc
new file mode 100644
index 0000000000000000000000000000000000000000..221ff7900f398166c193c495848a2afcfd4edc81
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gather_expander.cc
@@ -0,0 +1,392 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/gather_expander.h"
+#include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/while_util.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/util.h"
+
+namespace xla {
+using tensorflow::gtl::ArraySlice;
+
+static StatusOr<HloInstruction*> TransposeIndexVectorDimToLast(
+    HloInstruction* gather_indices, int64 index_vector_dim) {
+  const Shape& gather_indices_shape = gather_indices->shape();
+  if (index_vector_dim == (gather_indices_shape.dimensions_size() - 1)) {
+    return gather_indices;
+  }
+  std::vector<int64> permutation;
+  permutation.reserve(gather_indices_shape.dimensions_size());
+  for (int64 i = 0, e = gather_indices_shape.dimensions_size(); i < e; i++) {
+    if (i != index_vector_dim) {
+      permutation.push_back(i);
+    }
+  }
+  permutation.push_back(index_vector_dim);
+  return MakeTransposeHlo(gather_indices, permutation);
+}
+
+// If the gather_indices holds scalar indices (i.e. gather_indices has rank N
+// and index_vector_dim is N) then reshape it to have a trailing degenerate
+// dimension.  This makes the code for slicing out the index vector more
+// uniform.
+static StatusOr<HloInstruction*> DeScalarizeGatherIndices(
+    HloInstruction* gather_indices, int64 index_vector_dim) {
+  const Shape& gather_indices_shape = gather_indices->shape();
+  if (index_vector_dim != gather_indices_shape.dimensions_size()) {
+    return gather_indices;
+  }
+
+  DCHECK_EQ(index_vector_dim, gather_indices_shape.dimensions_size());
+
+  std::vector<int64> result_shape_dims;
+  c_copy(gather_indices_shape.dimensions(),
+         std::back_inserter(result_shape_dims));
+  result_shape_dims.push_back(1);
+
+  return MakeReshapeHlo(result_shape_dims, gather_indices);
+}
+
+// Canonicalizes the gather_indices tensors so that we only have deal with some
+// specific cases in the while loop that does the heavy lifting.
+//
+// See the "High Level Algorithm" section for a broader picture.
+static StatusOr<HloInstruction*> CanonicalizeGatherIndices(
+    HloInstruction* gather_indices, int64 index_vector_dim) {
+  // If gather_indices holds scalar indices, normalize it to hold index vectors
+  // of size 1.
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * descalarized_gather_indices,
+      DeScalarizeGatherIndices(gather_indices, index_vector_dim));
+
+  // Transpose the non-index-vector dimensions to the front.
+  TF_ASSIGN_OR_RETURN(HloInstruction * transposed_gather_indices,
+                      TransposeIndexVectorDimToLast(descalarized_gather_indices,
+                                                    index_vector_dim));
+
+  // If there is only one index (i.e. gather_indices has rank 1 and this gather
+  // is really just a dynamic slice) add a leading degenerate dimension for
+  // uniformity.  Otherwise create a "collapsed" leading dimension that subsumes
+  // all of the non-index-vector dimensions.
+  const Shape& shape = transposed_gather_indices->shape();
+  if (shape.dimensions_size() == 1) {
+    return ExpandFirstDimIntoNDims(transposed_gather_indices,
+                                   {1, shape.dimensions(0)});
+  } else {
+    return CollapseFirstNDims(transposed_gather_indices,
+                              shape.dimensions_size() - 1);
+  }
+}
+
+// Expands out or contracts away the gather dimensions in the accumulator
+// produced by the while loop.
+static StatusOr<HloInstruction*> AdjustGatherDimsInAccumulator(
+    const Shape& gather_indices_shape, HloInstruction* accumulator,
+    int64 index_vector_dim) {
+  std::vector<int64> output_gather_dim_bounds;
+  output_gather_dim_bounds.reserve(gather_indices_shape.dimensions_size());
+  for (int64 i = 0, e = gather_indices_shape.dimensions_size(); i < e; i++) {
+    if (i != index_vector_dim) {
+      output_gather_dim_bounds.push_back(gather_indices_shape.dimensions(i));
+    }
+  }
+
+  if (output_gather_dim_bounds.empty()) {
+    // If output_gather_dim_bounds is empty we must be lowering a (effectively)
+    // dynamic-slice.  In that case, there is a leading degenerate gather
+    // dimension that we added to make this special case play well with the
+    // general while loop which we need to remove now.
+    CHECK_EQ(accumulator->shape().dimensions(0), 1);
+    ArraySlice<int64> reshaped_dim_sizes =
+        AsInt64Slice(accumulator->shape().dimensions());
+    reshaped_dim_sizes.remove_prefix(1);
+    return MakeReshapeHlo(reshaped_dim_sizes, accumulator);
+  }
+
+  return ExpandFirstDimIntoNDims(accumulator, output_gather_dim_bounds);
+}
+
+// Expand an index vector from the gather_indices tensor into a vector that can
+// be used to dynamic-slice out of the gather operand.
+static StatusOr<HloInstruction*> ExpandIndexVectorIntoOperandSpace(
+    HloInstruction* index_vector, const GatherDimensionNumbers& dim_numbers,
+    int64 operand_rank) {
+  HloComputation* computation = index_vector->parent();
+  const Shape& index_shape = index_vector->shape();
+  HloInstruction* zero =
+      computation->AddInstruction(HloInstruction::CreateConstant(
+          Literal::CreateFromDimensions(index_shape.element_type(), {1})));
+
+  // We extract out individual components from the smaller index and concatenate
+  // them (interspersing zeros as needed) into the larger index.
+  std::vector<HloInstruction*> expanded_index_components;
+
+  for (int i = 0; i < operand_rank; i++) {
+    int64 index_vector_dim_index =
+        FindIndex(dim_numbers.gather_dims_to_operand_dims(), i);
+    if (index_vector_dim_index !=
+        dim_numbers.gather_dims_to_operand_dims_size()) {
+      TF_ASSIGN_OR_RETURN(
+          HloInstruction * component_to_concat,
+          MakeSliceHlo(index_vector, /*start_indices=*/{index_vector_dim_index},
+                       /*limit_indices=*/{index_vector_dim_index + 1},
+                       /*strides=*/{1}));
+      expanded_index_components.push_back(component_to_concat);
+    } else {
+      expanded_index_components.push_back(zero);
+    }
+  }
+
+  return MakeConcatHlo(expanded_index_components, /*dimension=*/0);
+}
+
+// This generates the body of the while that implements the main data movement
+// behavior of gather using dynamic-slice and dynamic-update-slice.
+static StatusOr<std::vector<HloInstruction*>> GatherLoopBody(
+    const HloInstruction& gather, HloInstruction* induction_var,
+    const std::vector<HloInstruction*>& incoming_loop_state) {
+  CHECK_EQ(incoming_loop_state.size(), 3);
+  HloInstruction* const operand = incoming_loop_state[0];
+  HloInstruction* const gather_indices = incoming_loop_state[1];
+  HloInstruction* const output_accumulator = incoming_loop_state[2];
+
+  int64 index_vector_size = gather_indices->shape().dimensions(1);
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * induction_var_as_vector,
+      MakeBroadcastHlo(induction_var, /*broadcast_dimensions=*/{},
+                       /*result_shape_bounds=*/{1}));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * index_into_gather_indices,
+      PadVectorWithZeros(induction_var_as_vector,
+                         /*zeros_to_prepend=*/0, /*zeros_to_append=*/1));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * index_vector_2d,
+      MakeDynamicSliceHlo(gather_indices, index_into_gather_indices,
+                          {1, index_vector_size}));
+
+  TF_ASSIGN_OR_RETURN(HloInstruction * index_vector,
+                      ElideDegenerateDims(index_vector_2d, {0}));
+
+  TF_ASSIGN_OR_RETURN(HloInstruction * gathered_slice_start,
+                      ExpandIndexVectorIntoOperandSpace(
+                          index_vector, gather.gather_dimension_numbers(),
+                          operand->shape().dimensions_size()));
+
+  TF_ASSIGN_OR_RETURN(HloInstruction * gathered_slice,
+                      MakeDynamicSliceHlo(operand, gathered_slice_start,
+                                          gather.gather_window_bounds()));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * gathered_slice_for_update,
+      ExpandFirstDimIntoNDims(gathered_slice,
+                              {1, gathered_slice->shape().dimensions(0)}));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * index_vector_into_accumulator,
+      PadVectorWithZeros(
+          induction_var_as_vector, /*zeros_to_prepend=*/0,
+          /*zeros_to_append=*/gathered_slice->shape().dimensions_size()));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * updated_accumulator,
+      MakeDynamicUpdateSliceHlo(output_accumulator, gathered_slice_for_update,
+                                index_vector_into_accumulator));
+
+  // New loop state -- only the accumulator has changed.  The
+  // WhileUtil::MakeCountedLoop functions takes care of the induction variable
+  // and the while loop exit condition.
+  return StatusOr<std::vector<HloInstruction*>>{
+      {operand, gather_indices, updated_accumulator}};
+}
+
+static StatusOr<HloInstruction*> CreateGatherLoopAccumulatorInitValue(
+    HloComputation* computation, PrimitiveType element_type,
+    ArraySlice<int64> window_bounds, int64 gather_loop_trip_count) {
+  std::vector<int64> accumulator_state_shape_dims;
+  accumulator_state_shape_dims.reserve(1 + window_bounds.size());
+  accumulator_state_shape_dims.push_back(gather_loop_trip_count);
+  c_copy(window_bounds, std::back_inserter(accumulator_state_shape_dims));
+  return BroadcastZeros(computation, element_type,
+                        accumulator_state_shape_dims);
+}
+
+static StatusOr<HloInstruction*> ElideWindowDimsFromAccumulator(
+    HloInstruction* accumulator, const GatherDimensionNumbers& dim_numbers) {
+  std::vector<int64> dims_to_elide;
+  dims_to_elide.reserve(dim_numbers.elided_window_dims_size());
+  for (int64 elided_window_dim : dim_numbers.elided_window_dims()) {
+    dims_to_elide.push_back(elided_window_dim + 1);
+  }
+
+  return ElideDegenerateDims(accumulator, dims_to_elide);
+}
+
+// `accumulator` is almost the tensor the gather operation would have produced,
+// except that it has the dimensions in the wrong order -- the gather dimensions
+// are the major dimensions and the window dimensions are the minor dimensions.
+// Fix this up with a transpose.
+static StatusOr<HloInstruction*> PermuteGatherAndWindowDims(
+    HloInstruction* accumulator, ArraySlice<int64> output_window_dims,
+    int64 output_rank) {
+  std::vector<int64> permutation;
+  permutation.reserve(output_rank);
+
+  int64 gather_idx_counter = 0;
+  int64 window_idx_counter = output_rank - output_window_dims.size();
+  for (int64 i = 0; i < output_rank; i++) {
+    bool is_window_dim = c_binary_search(output_window_dims, i);
+    if (is_window_dim) {
+      permutation.push_back(window_idx_counter++);
+    } else {
+      permutation.push_back(gather_idx_counter++);
+    }
+  }
+
+  return MakeTransposeHlo(accumulator, permutation);
+}
+
+// High Level Algorithm
+//
+// We follow the following steps in sequence:
+//
+//  1. We canonicalize the gather_indices tensor such that it has rank
+//     2 (i.e. is a matrix) where each row is an index vector into the
+//     operand.
+//  2. We iterate over the set of indices in the canonicalized
+//     gather_indices tensor using a while loop, accumulating slices
+//     of the operand tensor into an accumulator using
+//     DynamicUpdateSlice.
+//  3. The accumulator result from the while loop from (2) is then
+//     reshaped to split out all the individual gather dimensions and
+//     then transposed to give the final result.
+//
+// As an example, if we started with the following operation:
+//
+//   HloModule TensorFlowGatherMultipleBatchDims
+//
+//   ENTRY main {
+//     operand = s32[3,3] parameter(0)
+//     indices = s32[2,2] parameter(1)
+//     ROOT gather = s32[2,3,2] gather(operand, indices),
+//         output_window_dims={1},
+//         elided_window_dims={1},
+//         gather_dims_to_operand_dims={1},
+//         index_vector_dim=2,
+//         window_bounds={3, 1}
+//   }
+//
+// We'd first reshape indices to s32[4,1], where each row is an index
+// into operand.  We'd then run a loop to slice out 4 tensors of shape
+// [3,1] out of operand into an accumulator of shape [4,3,1].  We then
+// reshape this result to [2,2,3] and finally transpose it to [2,3,2].
+
+StatusOr<HloInstruction*> GatherExpander::ExpandGather(
+    HloInstruction* gather_instr) {
+  CHECK(!ShapeUtil::HasZeroElements(gather_instr->shape()));
+
+  HloComputation* computation = gather_instr->parent();
+  HloInstruction* operand = gather_instr->mutable_operand(0);
+  HloInstruction* gather_indices = gather_instr->mutable_operand(1);
+  const Shape& gather_indices_shape = gather_indices->shape();
+  const Shape& output_shape = gather_instr->shape();
+  int64 output_rank = output_shape.dimensions_size();
+
+  const GatherDimensionNumbers& dim_numbers =
+      gather_instr->gather_dimension_numbers();
+
+  int64 gather_loop_trip_count = 1;
+  for (int64 i = 0, e = gather_indices_shape.dimensions_size(); i < e; i++) {
+    if (i != dim_numbers.index_vector_dim()) {
+      gather_loop_trip_count *= gather_indices_shape.dimensions(i);
+    }
+  }
+
+  if (!IsInt32(gather_loop_trip_count)) {
+    return Unimplemented(
+        "Gather operations with more than 2147483647 gather indices are not "
+        "supported. This error occurred for %s.",
+        gather_instr->ToString().c_str());
+  }
+
+  TF_ASSIGN_OR_RETURN(HloInstruction * canonical_gather_indices,
+                      CanonicalizeGatherIndices(
+                          gather_indices, dim_numbers.index_vector_dim()));
+
+  CHECK_EQ(gather_loop_trip_count,
+           canonical_gather_indices->shape().dimensions(0));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * accumulator_init,
+      CreateGatherLoopAccumulatorInitValue(
+          computation, output_shape.element_type(),
+          gather_instr->gather_window_bounds(), gather_loop_trip_count));
+
+  StatusOr<std::vector<HloInstruction*>> gather_loop_result_or_error =
+      WhileUtil::MakeCountedLoop(
+          computation, gather_loop_trip_count,
+          {operand, canonical_gather_indices, accumulator_init},
+          [&](HloInstruction* indvar,
+              const std::vector<HloInstruction*>& loop_state) {
+            return GatherLoopBody(*gather_instr, indvar, loop_state);
+          });
+
+  TF_ASSIGN_OR_RETURN(std::vector<HloInstruction*> gather_loop_result,
+                      gather_loop_result_or_error);
+
+  HloInstruction* accumulator_result = gather_loop_result.back();
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * accumulator_with_window_dims_elided,
+      ElideWindowDimsFromAccumulator(accumulator_result, dim_numbers));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * accumulator_with_output_gather_dims_decanonicalized,
+      AdjustGatherDimsInAccumulator(gather_indices->shape(),
+                                    accumulator_with_window_dims_elided,
+                                    dim_numbers.index_vector_dim()));
+
+  return PermuteGatherAndWindowDims(
+      accumulator_with_output_gather_dims_decanonicalized,
+      AsInt64Slice(dim_numbers.output_window_dims()), output_rank);
+}
+
+StatusOr<bool> GatherExpander::Run(HloModule* module) {
+  auto is_nontrivial_gather = [](HloInstruction* inst) {
+    return inst->opcode() == HloOpcode::kGather &&
+           // Avoid expanding gather ops that produce zero sized tensors,
+           // instead punt these to ZeroSizedHloElimination.
+           !ShapeUtil::HasZeroElements(inst->shape());
+  };
+
+  std::vector<HloInstruction*> gather_instrs;
+  for (HloComputation* computation : module->MakeNonfusionComputations()) {
+    c_copy_if(computation->instructions(), std::back_inserter(gather_instrs),
+              is_nontrivial_gather);
+  }
+
+  for (HloInstruction* inst : gather_instrs) {
+    TF_ASSIGN_OR_RETURN(HloInstruction * expanded_root, ExpandGather(inst));
+    TF_RETURN_IF_ERROR(inst->parent()->ReplaceInstruction(inst, expanded_root));
+  }
+
+  return !gather_instrs.empty();
+}
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gather_expander.h b/tensorflow/compiler/xla/service/gather_expander.h
new file mode 100644
index 0000000000000000000000000000000000000000..c1fc8574da99fff223c7dbb570b4533f76905b9a
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gather_expander.h
@@ -0,0 +1,37 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GATHER_EXPANDER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GATHER_EXPANDER_H_
+
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+
+namespace xla {
+
+// This pass rewrites gather operations into (roughly) while loops of dynamic
+// slices.  This lets backends that don't support gather directly to
+// nevertheless have a minimum level of support.
+class GatherExpander : public HloPassInterface {
+ public:
+  tensorflow::StringPiece name() const override { return "gather_expander"; }
+  StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  StatusOr<HloInstruction*> ExpandGather(HloInstruction* gather_instr);
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GATHER_EXPANDER_H_
diff --git a/tensorflow/compiler/xla/service/gather_expander_test.cc b/tensorflow/compiler/xla/service/gather_expander_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ba41ee8428cbe7132103df24d552565a8dc2f9f6
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gather_expander_test.cc
@@ -0,0 +1,51 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gather_expander.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+
+namespace xla {
+namespace {
+TEST(GatherExpanderTest, ErrorStatusOnTooManyIndices) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherMultipleBatchDims
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2147483647,5] parameter(1)
+  ROOT gather = s32[2147483647,3,5] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=2,
+      window_bounds={3, 1}
+}
+)";
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          tools::Parse(hlo_text));
+
+  Status status = GatherExpander{}.Run(module.get()).status();
+  EXPECT_EQ(status.code(), tensorflow::error::UNIMPLEMENTED);
+
+  ASSERT_THAT(
+      status.error_message(),
+      ::testing::HasSubstr("Gather operations with more than 2147483647 gather "
+                           "indices are not supported."));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/generic_transfer_manager.cc b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
index 78dc0ad4fcd167c93f19d0c2b18ea72d666897ef..a99e2b7794a399047fb5a77a140bd333214e3f23 100644
--- a/tensorflow/compiler/xla/service/generic_transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
@@ -38,14 +38,7 @@ namespace xla {
 
 GenericTransferManager::GenericTransferManager(se::Platform::Id platform_id,
                                                size_t pointer_size)
-    : platform_id_(platform_id), pointer_size_(pointer_size) {
-  // We currently only support kHostPlatformId for CPU, kCudaPlatformId for
-  // GPU and kInterpreterPlatformId for Interpreter. Before supporting other
-  // platforms, we need to test this transfer manager on them.
-  CHECK(platform_id_ == se::host::kHostPlatformId ||
-        platform_id_ == se::interpreter::kInterpreterPlatformId ||
-        platform_id_ == se::cuda::kCudaPlatformId);
-}
+    : platform_id_(platform_id), pointer_size_(pointer_size) {}
 
 se::Platform::Id GenericTransferManager::PlatformId() const {
   return platform_id_;
diff --git a/tensorflow/compiler/xla/service/gpu/BUILD b/tensorflow/compiler/xla/service/gpu/BUILD
index 9da4fb97fa27a238fead74985cb481a9be1f4a65..93b2f2a4748932e50ce40e8a2f573af922dea8d1 100644
--- a/tensorflow/compiler/xla/service/gpu/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/BUILD
@@ -241,6 +241,7 @@ cc_library(
         "gpu_executable.cc",
         "infeed_thunk.cc",
         "kernel_thunk.cc",
+        "memset_thunk.cc",
         "sequential_thunk.cc",
         "thunk_schedule.cc",
         "tuple_thunk.cc",
@@ -257,6 +258,7 @@ cc_library(
         "gpu_executable.h",
         "infeed_thunk.h",
         "kernel_thunk.h",
+        "memset_thunk.h",
         "sequential_thunk.h",
         "thunk.h",
         "thunk_schedule.h",
@@ -273,6 +275,7 @@ cc_library(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
@@ -293,6 +296,7 @@ cc_library(
         "//tensorflow/core/platform/default/build_config:cudnn_plugin",
         "//tensorflow/core/platform/default/build_config:cufft_plugin",
         "//tensorflow/core/platform/default/build_config:stream_executor_cuda",  # build_cleaner: keep
+        "//tensorflow/stream_executor",
     ],
 )
 
@@ -397,6 +401,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla/service:hlo_matchers",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -437,8 +442,10 @@ tf_cc_test(
         ":fusion_merger",
         ":instruction_fusion",
         "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -452,6 +459,7 @@ cc_library(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:window_util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_creation_utils",
         "//tensorflow/compiler/xla/service:hlo_pass",
         "//tensorflow/compiler/xla/service:shape_inference",
     ],
@@ -510,9 +518,11 @@ cc_library(
         "//tensorflow/compiler/xla/service:buffer_assignment",
         "//tensorflow/compiler/xla/service:buffer_liveness",
         "//tensorflow/compiler/xla/service:call_inliner",
+        "//tensorflow/compiler/xla/service:conditional_simplifier",
         "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
+        "//tensorflow/compiler/xla/service:gather_expander",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_constant_folding",
         "//tensorflow/compiler/xla/service:hlo_cse",
diff --git a/tensorflow/compiler/xla/service/gpu/fusion_merger.cc b/tensorflow/compiler/xla/service/gpu/fusion_merger.cc
index c137fbc97e29e24edb3603c611a5c8f093bc62a6..3cd30b754c3242f00c704de1afab2282ed827b41 100644
--- a/tensorflow/compiler/xla/service/gpu/fusion_merger.cc
+++ b/tensorflow/compiler/xla/service/gpu/fusion_merger.cc
@@ -45,6 +45,7 @@ void MaybeResolveTupleElements(HloInstruction* instruction,
 // Returns the bytes read by fusion parameter 'param', by returning the byte
 // size of 'param' shape (or the cumulative byte sizes of all leaf tuple
 // elements if 'param' is tuple-shaped).
+//
 // In the special case where all users of 'param' (or all users of a leaf
 // tuple element if 'param' is tuple-shaped) are Slice instructions, the size
 // of each slice instruction is accumulated instead, to give a more accurate
@@ -63,11 +64,10 @@ double CalculateBytesReadByFusionParameter(HloInstruction* param) {
   // Slice for a more accurate estimate of bytes read.
   double bytes = 0.0;
   for (auto& instruction : instructions) {
-    if (std::all_of(instruction->users().begin(), instruction->users().end(),
-                    [](const HloInstruction* instruction) {
-                      return instruction->opcode() == HloOpcode::kSlice ||
-                             instruction->opcode() == HloOpcode::kDynamicSlice;
-                    })) {
+    if (c_all_of(instruction->users(), [](const HloInstruction* instruction) {
+          return instruction->opcode() == HloOpcode::kSlice ||
+                 instruction->opcode() == HloOpcode::kDynamicSlice;
+        })) {
       // All users are slice: accumulate bytes of all user slice instructions.
       for (auto& user : instruction->users()) {
         bytes += ShapeUtil::ByteSizeOf(user->shape());
@@ -199,6 +199,7 @@ Status FusionInstructionMerger::HandleFusion(HloInstruction* fusion) {
   ++total_visited_;
   // Skip 'fusion' instruction if there are no users into which we can merge.
   if (fusion->users().empty()) {
+    VLOG(3) << "Not merging " << fusion->name() << ": Has no users.";
     ++num_fail_no_users_;
     return Status::OK();
   }
@@ -208,24 +209,27 @@ Status FusionInstructionMerger::HandleFusion(HloInstruction* fusion) {
   // Input fusion instructions need to be rooted at a particular HLO (e.g.
   // kReduce), so they shouldn't be further fused either.
   if (fusion->fusion_kind() != HloInstruction::FusionKind::kLoop) {
+    VLOG(3) << "Not merging " << fusion->name() << ": Is not loop fusion.";
     ++num_fail_not_loop_fusion_;
     return Status::OK();
   }
 
   // Skip multiple output fusion. It's not yet supported.
   if (fusion->IsMultiOutputFusion()) {
+    VLOG(3) << "Not merging " << fusion->name() << ": Is multi-output fusion.";
     ++num_fail_not_loop_fusion_;
     return Status::OK();
   }
   // Skip 'fusion' instruction if we cannot merge into all of its users.
   // Merging into all users enables the removal of 'fusion' from the
   // computation.
-  if (!std::all_of(fusion->users().begin(), fusion->users().end(),
-                   [](const HloInstruction* instruction) {
-                     return instruction->opcode() == HloOpcode::kFusion &&
-                            instruction->fusion_kind() ==
-                                HloInstruction::FusionKind::kLoop;
-                   })) {
+  if (!c_all_of(fusion->users(), [](const HloInstruction* user) {
+        return user->opcode() == HloOpcode::kFusion &&
+               (user->fusion_kind() == HloInstruction::FusionKind::kLoop ||
+                user->fusion_kind() == HloInstruction::FusionKind::kInput);
+      })) {
+    VLOG(3) << "Not merging " << fusion->name()
+            << ": Some of its users are not loop/input fusion kernels.";
     ++num_fail_merge_all_users_;
     return Status::OK();
   }
@@ -233,18 +237,17 @@ Status FusionInstructionMerger::HandleFusion(HloInstruction* fusion) {
   // Skip 'fusion' instruction if any of its fused instructions are expensive.
   // This is done to avoid the duplication of expensive instructions, which
   // would occur if 'fusion' were merged into multiple users.
+  //
   // If 'fusion' has just one user, then an earlier fusion pass chose not to
   // fuse this producer/comsumer pair (likely because of expensive instruction
   // re-use by the consumer), and so we honor that choice here as well.
-  if (!std::all_of(fusion->fused_instructions().begin(),
-                   fusion->fused_instructions().end(),
-                   [](const HloInstruction* instruction) {
-                     if (instruction->opcode() != HloOpcode::kParameter &&
-                         GpuInstructionFusion::IsExpensive(*instruction)) {
-                       return false;
-                     }
-                     return true;
-                   })) {
+  if (c_any_of(fusion->fused_instructions(),
+               [](const HloInstruction* instruction) {
+                 return instruction->opcode() != HloOpcode::kParameter &&
+                        GpuInstructionFusion::IsExpensive(*instruction);
+               })) {
+    VLOG(3) << "Not merging " << fusion->name()
+            << ": Contains one or more expensive instructions.";
     ++num_fail_expensive_fused_instruction_;
     return Status::OK();
   }
@@ -253,6 +256,8 @@ Status FusionInstructionMerger::HandleFusion(HloInstruction* fusion) {
   // exceeds the threshold value.
   if (CalculateFlopsToBytesRatio(fusion) >
       FusionMerger::GetThresholdFlopsToBytesRatio()) {
+    VLOG(3) << "Not merging " << fusion->name()
+            << ": flops-to-bytes ratio is not favorable.";
     ++num_fail_flops_to_byte_ratio_;
     return Status::OK();
   }
@@ -265,6 +270,9 @@ Status FusionInstructionMerger::HandleFusion(HloInstruction* fusion) {
   const double merged_to_current_bytes_ratio =
       merged_bytes_transferred / std::max(1.0, current_bytes_transferred);
   if (merged_to_current_bytes_ratio > 1.10) {
+    VLOG(3) << "Not merging " << fusion->name()
+            << ": merged-to-current-bytes-ratio of "
+            << merged_to_current_bytes_ratio << " is not favorable.";
     ++num_fail_net_bytes_transferred_ratio_;
     return Status::OK();
   }
diff --git a/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc b/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
index deef5966b80d1b7f16e9982eed9ac5c7131e9d73..2217776c7d5a5f92c520d56222988f80401be9e4 100644
--- a/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
@@ -16,257 +16,21 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/fusion_merger.h"
 
 #include "tensorflow/compiler/xla/service/gpu/instruction_fusion.h"
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace gpu {
 namespace {
 
-class FusionMergerTest : public HloTestBase {
- protected:
-  FusionMergerTest() : module_(CreateNewModule()) {}
-
-  // Builds the following computation:
-  //
-  //                 Param
-  //               /   |   \
-  //              /    |    \
-  //  OnesVec  GTE(0) GTE(1) GTE(2)
-  //       \   /         \   /
-  //        Add           Add  OnesVec
-  //         \           /  \  /
-  //           \      Add   Mul  OnesVec
-  //            \      |     |  /
-  //             \    Mul    Add
-  //              \    |    /
-  //               \   |   /
-  //                 Tuple
-  //
-  HloComputation* BuildComputation0() {
-    auto builder = HloComputation::Builder(TestName() + ".Computation0");
-    // Create param instruction to access computation state.
-    auto param = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, tuple_shape3_, "param"));
-
-    // Create GetTupleElement instructions for each tuple element.
-    auto gte0 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, param, 0));
-    auto gte1 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, param, 1));
-    auto gte2 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, param, 2));
-
-    // Create const vector of ones to be used in element-wise computations.
-    auto one_vec = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f})));
-
-    // Create simple fusable computation for tuple element 0 (wont get merged).
-    auto out0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, one_vec, gte0));
-
-    // Create fusable computation which is dependent on second and third tuple
-    // elements (will initially be fused on its own).
-    auto add1 = builder.AddInstruction(
-        HloInstruction::CreateBinary(data_shape_, HloOpcode::kAdd, gte1, gte2));
-
-    // Create two sub-computations, both of which are users of 'add1'.
-
-    // First sub-computation: out1 = Mul(Add(add1, one_vec), one_vec)
-    auto add2 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, add1, one_vec));
-    auto out1 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, add2, one_vec));
-
-    // Second sub-computation: out2 = Add(Mul(add1, one_vec), one_vec)
-    auto mul0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, add1, one_vec));
-    auto out2 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, mul0, one_vec));
-
-    // Create output Tuple.
-    builder.AddInstruction(HloInstruction::CreateTuple({out0, out1, out2}));
-    return module_->AddEntryComputation(builder.Build());
-  }
-
-  // Builds the following computation:
-  //
-  //                 Param
-  //               /      \
-  //            GTE(0)   GTE(1)
-  //            | | \   /
-  //            | |  Mul
-  //             \  \ |
-  //              \  Mul
-  //               \ |
-  //      OnesVec   Mul  OnesVec
-  //             \  /  \ /
-  //     OnesVec  Add  Mul  OnesVec
-  //            \  |    |  /
-  //             Mul    Add
-  //               \    /
-  //                \  /
-  //                Tuple
-  //
-  HloComputation* BuildComputation1() {
-    auto builder = HloComputation::Builder(TestName() + ".Computation1");
-    Shape tuple_shape2_ = ShapeUtil::MakeTupleShape({data_shape_, data_shape_});
-    // Create param instruction to access computation state.
-    auto state = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, tuple_shape2_, "state"));
-
-    // Create shared sub-computation (will initially be fused on its own).
-    auto gte0 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, state, 0));
-    auto gte1 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, state, 2));
-    // Calculate the flops we need to generate for this shared computation
-    // to exceed the threshold flops_to_bytes_ratio.
-    // Note that bytes transferred is multiplied by 3 because there are two
-    // operands and one output of size 'data_shape_'.
-    const int64 flops_needed = FusionMerger::GetThresholdFlopsToBytesRatio() *
-                               ShapeUtil::ByteSizeOf(data_shape_) * 3;
-    const int64 vec_elements = ShapeUtil::ElementsIn(data_shape_);
-    const int64 iters = (flops_needed + vec_elements - 1) / vec_elements;
-
-    auto mul0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, gte0, gte1));
-    for (int i = 0; i < iters; ++i) {
-      mul0 = builder.AddInstruction(HloInstruction::CreateBinary(
-          data_shape_, HloOpcode::kMultiply, gte0, mul0));
-    }
-
-    // Create two sub-computations, both of which are users of 'mul0'.
-    auto one_vec = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f})));
-
-    // First sub-computation: out0 = Mul(Add(mul0, one_vec), one_vec)
-    auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, mul0, one_vec));
-    auto out0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, add0, one_vec));
-
-    // Second sub-computation: out1 = Add(Mul(mul0, one_vec), one_vec)
-    auto mul1 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, mul0, one_vec));
-    auto out1 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, mul1, one_vec));
-
-    // Create output Tuple.
-    builder.AddInstruction(HloInstruction::CreateTuple({out0, out1}));
-    return module_->AddEntryComputation(builder.Build());
-  }
-
-  // Builds the following computation:
-  //
-  //                Param
-  //             /   |   |  \
-  //            /    |   |   \
-  //           /     |   |    \
-  //      GTE(0) GTE(1) GTE(2) GTE(3)
-  //           \   /    /     /
-  //            Add    /     /
-  //              \   /     /
-  //               Add     /
-  //                 \    /
-  //                  \  /
-  //         OnesVec   Add  OnesVec
-  //                \  /  \ /
-  //        OnesVec  Add  Mul OnesVec
-  //              \  |    |  /
-  //               Mul    Add
-  //                 \    /
-  //                  \  /
-  //                  Tuple
-  //
-  HloComputation* BuildComputation2(bool add_extra_input) {
-    auto builder = HloComputation::Builder(TestName() + ".Computation2");
-    Shape state_shape = add_extra_input ? tuple_shape4_ : tuple_shape3_;
-    // Create param instruction to access computation state.
-    auto state = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, state_shape, "state"));
-
-    // Create GetTupleElement instructions for each tuple element.
-    auto gte0 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, state, 0));
-    auto gte1 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, state, 1));
-    auto gte2 = builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(data_shape_, state, 2));
-
-    // Create shared fusable computation that reduces its operands.
-    auto reduce0 = builder.AddInstruction(
-        HloInstruction::CreateBinary(data_shape_, HloOpcode::kAdd, gte0, gte1));
-    auto reduce_out = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, reduce0, gte2));
-    if (add_extra_input) {
-      auto gte3 = builder.AddInstruction(
-          HloInstruction::CreateGetTupleElement(data_shape_, state, 3));
-      reduce_out = builder.AddInstruction(HloInstruction::CreateBinary(
-          data_shape_, HloOpcode::kAdd, reduce_out, gte3));
-    }
+namespace op = xla::testing::opcode_matchers;
 
-    // Create two fusable sub-computations which are dependent on shared
-    // computation 'reduce_out'.
-    auto one_vec = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f})));
-
-    // First sub-computation: out0 = Mul(Add(reduce_out, one_vec), one_vec)
-    auto add2 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, reduce_out, one_vec));
-    auto out0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, add2, one_vec));
-
-    // Second sub-computation: out1 = Add(Mul(reduce_out, one_vec), one_vec)
-    auto mul0 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kMultiply, reduce_out, one_vec));
-    auto out1 = builder.AddInstruction(HloInstruction::CreateBinary(
-        data_shape_, HloOpcode::kAdd, mul0, one_vec));
-
-    // Create output Tuple.
-    builder.AddInstruction(HloInstruction::CreateTuple({out0, out1}));
-    return module_->AddEntryComputation(builder.Build());
-  }
-
-  Shape data_shape_ = ShapeUtil::MakeShape(F32, {4});
-  Shape tuple_shape2_ = ShapeUtil::MakeTupleShape({data_shape_, data_shape_});
-  Shape tuple_shape3_ =
-      ShapeUtil::MakeTupleShape({data_shape_, data_shape_, data_shape_});
-  Shape tuple_shape4_ = ShapeUtil::MakeTupleShape(
-      {data_shape_, data_shape_, data_shape_, data_shape_});
-
-  std::unique_ptr<HloModule> module_;
-};
+class FusionMergerTest : public HloTestBase {};
 
 // Tests that we can merge a fusion instruction that is below threshold.
 //
-// Original computation:
-//
-//                 Param
-//                /  |  \
-//               /   |   \
-//  OnesVec  GTE(0) GTE(1) GTE(2)
-//       \   /         \   /
-//        Add           Add  OnesVec
-//         \           /  \  /
-//           \      Add   Mul  OnesVec
-//            \      |     |  /
-//             \    Mul    Add
-//              \    |    /
-//               \   |   /
-//                 Tuple
-//
-// Computation after fusion passes:
-//
-//                  Param
-//                 /     \
-//            Fusion3    Fusion2
-//               |       /     \
-//                \ Fusion0  Fusion1
-//                 \    |   /
-//                  \   |  /
-//                   Tuple
-//
 // Computation after fusion merger pass (Fusion2 is merged into Fusion0 and
 // Fusion1):
 //                   Param
@@ -276,19 +40,50 @@ class FusionMergerTest : public HloTestBase {
 //                   Tuple
 //
 TEST_F(FusionMergerTest, MergeSharedFusionInstruction) {
-  auto computation = BuildComputation0();
-  // Run standard fusion passes.
-  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/false)
-                  .Run(module_.get())
-                  .ValueOrDie());
-  EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
-                   .Run(module_.get())
-                   .ValueOrDie());
-  // Run fusion merger pass, which should merge the shared fusion instruction
-  // into its two users.
-  EXPECT_TRUE(FusionMerger().Run(module_.get()).ValueOrDie());
-
-  auto* root = computation->root_instruction();
+  auto module = tools::Parse(R"(
+HloModule MergeSharedFusionInstruction
+
+comp.3 {
+  constant.param_0 = f32[4]{0} parameter(0)
+  param.param_1.2 = (f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(1)
+  get-tuple-element.6 = f32[4]{0} get-tuple-element(param.param_1.2), index=0
+  ROOT add.7 = f32[4]{0} add(constant.param_0, get-tuple-element.6)
+}
+
+comp.2 {
+  param.param_1.1 = (f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  get-tuple-element.4 = f32[4]{0} get-tuple-element(param.param_1.1), index=1
+  get-tuple-element.5 = f32[4]{0} get-tuple-element(param.param_1.1), index=2
+  ROOT add.6 = f32[4]{0} add(get-tuple-element.4, get-tuple-element.5)
+}
+
+comp.1 {
+  add.1.param_1.1 = f32[4]{0} parameter(1)
+  constant.param_1.3 = f32[4]{0} parameter(0)
+  add.5 = f32[4]{0} add(add.1.param_1.1, constant.param_1.3)
+  ROOT multiply.3 = f32[4]{0} multiply(add.5, constant.param_1.3)
+}
+
+comp {
+  add.1.param_1 = f32[4]{0} parameter(1)
+  constant.param_1.1 = f32[4]{0} parameter(0)
+  multiply.2 = f32[4]{0} multiply(add.1.param_1, constant.param_1.1)
+  ROOT add.4 = f32[4]{0} add(multiply.2, constant.param_1.1)
+}
+
+ENTRY MergeSharedFusionInstruction.Computation0 {
+  constant = f32[4]{0} constant({1, 1, 1, 1})
+  param = (f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  fusion.3 = f32[4]{0} fusion(constant, param), kind=kLoop, calls=comp.3
+  fusion.4 = f32[4]{0} fusion(param), kind=kLoop, calls=comp.2
+  fusion.5 = f32[4]{0} fusion(constant, fusion.4), kind=kLoop, calls=comp.1
+  fusion.6 = f32[4]{0} fusion(constant, fusion.4), kind=kLoop, calls=comp
+  ROOT tuple = (f32[4]{0}, f32[4]{0}, f32[4]{0}) tuple(fusion.3, fusion.5, fusion.6)
+})")
+                    .ValueOrDie();
+  EXPECT_TRUE(FusionMerger().Run(module.get()).ValueOrDie());
+
+  auto* root = module->entry_computation()->root_instruction();
   EXPECT_EQ(HloOpcode::kTuple, root->opcode());
   // Check operand 0 (not merged). Should have 4 instructions.
   auto* operand0 = root->operand(0);
@@ -307,156 +102,188 @@ TEST_F(FusionMergerTest, MergeSharedFusionInstruction) {
 // Tests that we do not merge a fusion instruction that above flops to bytes
 // threshold.
 //
-// Original computation:
-//
-//                 Param
-//                /     \
-//            GTE(0)   GTE(1)
-//            | | \   /
-//            | |  Mul
-//             \  \ |
-//              \  Mul
-//               \ |
-//      OnesVec   Mul  OnesVec
-//             \  /  \ /
-//     OnesVec  Add  Mul  OnesVec
-//            \  |    |  /
-//             Mul    Add
-//               \    /
-//                \  /
-//                Tuple
-//
-// Computation after fusion passes and fusion merger pass (Fusion2 is not
-// merged because it exceeds the threshold flops to bytes ratio).
-//
-//                 Param
-//                   |
-//                Fusion2
-//                /     \
-//           Fusion0  Fusion1
-//                \    /
-//                 Tuple
-//
+// Fusion2 is not merged because it exceeds the threshold flops-to-bytes ratio.
 TEST_F(FusionMergerTest, FlopsToBytesRatioThresholdExceeded) {
-  BuildComputation1();
-  // Run standard fusion passes.
-  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/false)
-                  .Run(module_.get())
-                  .ValueOrDie());
-  EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
-                   .Run(module_.get())
-                   .ValueOrDie());
+  auto module = tools::Parse(R"(
+HloModule FlopsToBytesRatioThresholdExceeded
+
+comp.2 {
+  state.param_1.1 = (f32[4]{0}, f32[4]{0}) parameter(0)
+  get-tuple-element.3 = f32[4]{0} get-tuple-element(state.param_1.1), index=0
+  get-tuple-element.4 = f32[4]{0} get-tuple-element(state.param_1.1), index=2
+  multiply.29 = f32[4]{0} multiply(get-tuple-element.3, get-tuple-element.4)
+  multiply.30 = f32[4]{0} multiply(get-tuple-element.3, multiply.29)
+  multiply.31 = f32[4]{0} multiply(get-tuple-element.3, multiply.30)
+  multiply.32 = f32[4]{0} multiply(get-tuple-element.3, multiply.31)
+  multiply.33 = f32[4]{0} multiply(get-tuple-element.3, multiply.32)
+  multiply.34 = f32[4]{0} multiply(get-tuple-element.3, multiply.33)
+  multiply.35 = f32[4]{0} multiply(get-tuple-element.3, multiply.34)
+  multiply.36 = f32[4]{0} multiply(get-tuple-element.3, multiply.35)
+  multiply.37 = f32[4]{0} multiply(get-tuple-element.3, multiply.36)
+  multiply.38 = f32[4]{0} multiply(get-tuple-element.3, multiply.37)
+  multiply.39 = f32[4]{0} multiply(get-tuple-element.3, multiply.38)
+  multiply.40 = f32[4]{0} multiply(get-tuple-element.3, multiply.39)
+  ROOT multiply.41 = f32[4]{0} multiply(get-tuple-element.3, multiply.40)
+}
+
+comp.1 {
+  multiply.12.param_1.1 = f32[4]{0} parameter(1)
+  constant.param_1.3 = f32[4]{0} parameter(0)
+  add.3 = f32[4]{0} add(multiply.12.param_1.1, constant.param_1.3)
+  ROOT multiply.16 = f32[4]{0} multiply(add.3, constant.param_1.3)
+}
+
+comp {
+  multiply.12.param_1 = f32[4]{0} parameter(1)
+  constant.param_1.1 = f32[4]{0} parameter(0)
+  multiply.15 = f32[4]{0} multiply(multiply.12.param_1, constant.param_1.1)
+  ROOT add.2 = f32[4]{0} add(multiply.15, constant.param_1.1)
+}
+
+ENTRY FlopsToBytesRatioThresholdExceeded.Computation1 {
+  constant = f32[4]{0} constant({1, 1, 1, 1})
+  state = (f32[4]{0}, f32[4]{0}) parameter(0)
+  fusion.2 = f32[4]{0} fusion(state), kind=kLoop, calls=comp.2
+  fusion.3 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp.1
+  fusion.4 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp
+  ROOT tuple = (f32[4]{0}, f32[4]{0}) tuple(fusion.3, fusion.4)
+})")
+                    .ValueOrDie();
   // Run fusion merger pass, which should detect that the flops/bytes of the
   // shared fusion instruction exceeds the threshold ratio, and therefore
   // cannot be merged with other fusion instructions.
-  EXPECT_FALSE(FusionMerger().Run(module_.get()).ValueOrDie());
+  EXPECT_FALSE(FusionMerger().Run(module.get()).ValueOrDie());
 }
 
 // Tests that threshold for bytes transferred if merged is exceeded.
 //
-// Original computation:
-//
-//                Param
-//             /   |   |  \
-//            /    |   |   \
-//           /     |   |    \
-//      GTE(0) GTE(1) GTE(2) GTE(3)
-//           \   /    /     /
-//            Add    /     /
-//              \   /     /
-//               Add     /
-//                 \    /
-//                  \  /
-//         OnesVec   Add  OnesVec
-//                \  /  \ /
-//        OnesVec  Add  Mul OnesVec
-//              \  |    |  /
-//               Mul    Add
-//                 \    /
-//                  \  /
-//                  Tuple
-//
-// Computation after fusion passes and fusion merger pass. Fusion2 is not
-// merged because it exceeds the threshold bytes transferred. This is because
-// the bytes read by Fusion2 (when replicated if the instruction is merged
-// into Fusion0 and Fusion1) would exceed the bytes transferred threshold.
-//
-//                 Param
-//                   |
-//                Fusion2
-//                /     \
-//           Fusion0  Fusion1
-//                \    /
-//                 Tuple
-//
+// Fusion2 is not merged because it exceeds the threshold bytes transferred.
+// This is because the bytes read by Fusion2 (when replicated if the instruction
+// is merged into Fusion0 and Fusion1) would exceed the bytes transferred
+// threshold.
 TEST_F(FusionMergerTest, BytesTransferredThresholdExeceeded) {
-  BuildComputation2(/*add_extra_input=*/true);
-  // Run standard fusion passes.
-  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/false)
-                  .Run(module_.get())
-                  .ValueOrDie());
-  EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
-                   .Run(module_.get())
-                   .ValueOrDie());
+  auto module = tools::Parse(R"(
+HloModule BytesTransferredThresholdExeceeded
+
+comp.2 {
+  state.param_1.1 = (f32[4]{0}, f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  get-tuple-element.7 = f32[4]{0} get-tuple-element(state.param_1.1), index=0
+  get-tuple-element.8 = f32[4]{0} get-tuple-element(state.param_1.1), index=1
+  add.9 = f32[4]{0} add(get-tuple-element.7, get-tuple-element.8)
+  get-tuple-element.9 = f32[4]{0} get-tuple-element(state.param_1.1), index=2
+  add.10 = f32[4]{0} add(add.9, get-tuple-element.9)
+  get-tuple-element.10 = f32[4]{0} get-tuple-element(state.param_1.1), index=3
+  ROOT add.11 = f32[4]{0} add(add.10, get-tuple-element.10)
+}
+
+comp.1 {
+  add.2.param_1.1 = f32[4]{0} parameter(1)
+  constant.param_1.3 = f32[4]{0} parameter(0)
+  add.6 = f32[4]{0} add(add.2.param_1.1, constant.param_1.3)
+  ROOT multiply.3 = f32[4]{0} multiply(add.6, constant.param_1.3)
+}
+
+comp {
+  add.2.param_1 = f32[4]{0} parameter(1)
+  constant.param_1.1 = f32[4]{0} parameter(0)
+  multiply.2 = f32[4]{0} multiply(add.2.param_1, constant.param_1.1)
+  ROOT add.5 = f32[4]{0} add(multiply.2, constant.param_1.1)
+}
+
+ENTRY BytesTransferredThresholdExeceeded.Computation2 {
+  constant = f32[4]{0} constant({1, 1, 1, 1})
+  state = (f32[4]{0}, f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  fusion.2 = f32[4]{0} fusion(state), kind=kLoop, calls=comp.2
+  fusion.3 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp.1
+  fusion.4 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp
+  ROOT tuple = (f32[4]{0}, f32[4]{0}) tuple(fusion.3, fusion.4)
+})")
+                    .ValueOrDie();
   // Run fusion merger pass, which should detect that the net bytes transferred
   // (if merged) would increase.
-  EXPECT_FALSE(FusionMerger().Run(module_.get()).ValueOrDie());
+  EXPECT_FALSE(FusionMerger().Run(module.get()).ValueOrDie());
 }
 
 // Tests that threshold for bytes transferred if merged is not exceeded.
 //
-// Original computation:
-//
-//               Param
-//             /   |  \
-//            /    |   \
-//           /     |    \
-//      GTE(0) GTE(1) GTE(2)
-//           \   /    /
-//            Add    /
-//              \   /
-//     OnesVec   Add  OnesVec
-//            \  /  \ /
-//   OnesVec  Add   Mul OnesVec
-//              \  /   \  /
-//               Mul    Add
-//                 \    /
-//                  \  /
-//                  Tuple
-//
-// Computation after fusion passes:
-//
-//                 Param
-//                   |
-//                Fusion2
-//                /     \
-//           Fusion0  Fusion1
-//                \    /
-//                 Tuple
-//
-// Computation after fusion merger pass (Fusion2 is merged into Fusion0 and
-// Fusion1, because bytes read from Param by Fusion2 is reduced for this test
-// which makes the merge operation into its operand below the bytes
-// transferred threshold.
-//
-//                   Param
-//                   /  \
-//             Fusion0  Fusion1
-//                   \    /
-//                   Tuple
-//
+// Fusion2 is merged into Fusion0 and Fusion1, because bytes read from Param by
+// Fusion2 is reduced for this test which makes the merge operation into its
+// operand below the bytes transferred threshold.
 TEST_F(FusionMergerTest, BytesTransferredThresholdNotExeceeded) {
-  BuildComputation2(/*add_extra_input=*/false);
-  // Run standard fusion passes.
-  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/false)
-                  .Run(module_.get())
-                  .ValueOrDie());
-  EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
-                   .Run(module_.get())
-                   .ValueOrDie());
+  auto module = tools::Parse(R"(
+HloModule BytesTransferredThresholdNotExeceeded
+
+comp.2 {
+  state.param_1.1 = (f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  get-tuple-element.5 = f32[4]{0} get-tuple-element(state.param_1.1), index=0
+  get-tuple-element.6 = f32[4]{0} get-tuple-element(state.param_1.1), index=1
+  add.7 = f32[4]{0} add(get-tuple-element.5, get-tuple-element.6)
+  get-tuple-element.7 = f32[4]{0} get-tuple-element(state.param_1.1), index=2
+  ROOT add.8 = f32[4]{0} add(add.7, get-tuple-element.7)
+}
+
+comp.1 {
+  add.1.param_1.1 = f32[4]{0} parameter(1)
+  constant.param_1.3 = f32[4]{0} parameter(0)
+  add.5 = f32[4]{0} add(add.1.param_1.1, constant.param_1.3)
+  ROOT multiply.3 = f32[4]{0} multiply(add.5, constant.param_1.3)
+}
+
+comp {
+  add.1.param_1 = f32[4]{0} parameter(1)
+  constant.param_1.1 = f32[4]{0} parameter(0)
+  multiply.2 = f32[4]{0} multiply(add.1.param_1, constant.param_1.1)
+  ROOT add.4 = f32[4]{0} add(multiply.2, constant.param_1.1)
+}
+
+ENTRY BytesTransferredThresholdNotExeceeded.Computation2 {
+  constant = f32[4]{0} constant({1, 1, 1, 1})
+  state = (f32[4]{0}, f32[4]{0}, f32[4]{0}) parameter(0)
+  fusion.2 = f32[4]{0} fusion(state), kind=kLoop, calls=comp.2
+  fusion.3 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp.1
+  fusion.4 = f32[4]{0} fusion(constant, fusion.2), kind=kLoop, calls=comp
+  ROOT tuple = (f32[4]{0}, f32[4]{0}) tuple(fusion.3, fusion.4)
+})")
+                    .ValueOrDie();
   // Run fusion merger pass, which should detect that the net bytes transferred
   // (if merged) would not increase.
-  EXPECT_TRUE(FusionMerger().Run(module_.get()).ValueOrDie());
+  EXPECT_TRUE(FusionMerger().Run(module.get()).ValueOrDie());
+}
+
+// Check that we're willing to merge f1_computation into f2_computation, even
+// though f2 is an input fusion node.
+TEST_F(FusionMergerTest, WillMergeIntoInputFusion) {
+  auto module = tools::Parse(R"(
+    HloModule m
+
+    f1_computation {
+      f1_p0 = f32[10]{0} parameter(0)
+      ROOT f1_root = f32[10]{0} add(f1_p0, f1_p0)
+    }
+
+    add_computation {
+      add_lhs = f32[] parameter(0)
+      add_rhs = f32[] parameter(1)
+      ROOT add_root = f32[] add(add_lhs, add_rhs)
+    }
+
+    f2_computation {
+      f2_p0 = f32[10]{0} parameter(0)
+      f2_mul = f32[10]{0} multiply(f2_p0, f2_p0)
+      f2_zero = f32[] constant(0)
+      ROOT f2_root = f32[] reduce(f2_mul, f2_zero), dimensions={0},
+             to_apply=add_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[10]{0} parameter(0)
+      f1 = f32[10]{0} fusion(p0), kind=kLoop, calls=f1_computation
+      ROOT f2 = f32[] fusion(f1), kind=kInput, calls=f2_computation
+    })")
+                    .ValueOrDie();
+  EXPECT_TRUE(FusionMerger().Run(module.get()).ValueOrDie());
+  EXPECT_THAT(module->entry_computation()->root_instruction(),
+              op::Fusion(op::Parameter()));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
index ba482793e7632f0f423cc9da0dd9620bdf29c642..38668ff455a44c7ef99b57b750f1a3b18a90bd2c 100644
--- a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
@@ -49,7 +49,7 @@ struct MatrixDescriptor {
 // rhs_matrix, and stores the result to output_matrix.
 template <typename Element>
 bool DoGemm(MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
-            MatrixDescriptor output_matrix, se::Stream* stream) {
+            MatrixDescriptor output_matrix, double alpha, se::Stream* stream) {
   DCHECK(!output_matrix.transpose);
 
   se::DeviceMemory<Element> lhs_data(lhs_matrix.data);
@@ -65,7 +65,7 @@ bool DoGemm(MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
   return stream
       ->ThenBlasGemm(
           lhs_transpose, rhs_transpose, output_matrix.num_rows,
-          output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/1.0,
+          output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/alpha,
           lhs_data, /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
           /*leading dim of RHS=*/rhs_matrix.num_rows, /*beta=*/0.0,
           &output_data, /*leading dim of output=*/output_matrix.num_rows)
@@ -89,7 +89,7 @@ bool DoGemm(MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
 template <typename Element>
 bool DoGemmWithAlgorithm(MatrixDescriptor lhs_matrix,
                          MatrixDescriptor rhs_matrix,
-                         MatrixDescriptor output_matrix,
+                         MatrixDescriptor output_matrix, double alpha,
                          se::blas::ComputationType computation_type,
                          se::blas::AlgorithmType algorithm, se::Stream* stream,
                          se::blas::ProfileResult* output_profile_result) {
@@ -108,11 +108,13 @@ bool DoGemmWithAlgorithm(MatrixDescriptor lhs_matrix,
   return stream
       ->ThenBlasGemmWithAlgorithm(
           lhs_transpose, rhs_transpose, output_matrix.num_rows,
-          output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/1.0,
-          lhs_data, /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
-          /*leading dim of RHS=*/rhs_matrix.num_rows, /*beta=*/0.0,
-          &output_data, /*leading dim of output=*/output_matrix.num_rows,
-          computation_type, algorithm, output_profile_result)
+          output_matrix.num_cols, /*size of reduce dim=*/k,
+          /*alpha=*/static_cast<Element>(alpha), lhs_data,
+          /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
+          /*leading dim of RHS=*/rhs_matrix.num_rows,
+          /*beta=*/static_cast<Element>(0.0f), &output_data,
+          /*leading dim of output=*/output_matrix.num_rows, computation_type,
+          algorithm, output_profile_result)
       .ok();
 }
 
@@ -125,8 +127,8 @@ bool DoGemmWithAlgorithm(MatrixDescriptor lhs_matrix,
 template <typename Element>
 StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
     MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
-    MatrixDescriptor output_matrix, se::blas::ComputationType computation_type,
-    se::Stream* stream) {
+    MatrixDescriptor output_matrix, double alpha,
+    se::blas::ComputationType computation_type, se::Stream* stream) {
   std::vector<se::blas::AlgorithmType> algorithms;
   CHECK(stream->parent()->GetBlasGemmAlgorithms(&algorithms));
 
@@ -138,8 +140,8 @@ StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
     // non-null ProfileResult, DoGemmWithAlgorithm should always return true,
     // and the actual success-ness is returned in ProfileResult::is_valid.
     CHECK(DoGemmWithAlgorithm<Element>(lhs_matrix, rhs_matrix, output_matrix,
-                                       computation_type, algorithm, stream,
-                                       &profile_result));
+                                       alpha, computation_type, algorithm,
+                                       stream, &profile_result));
 
     if (profile_result.is_valid() && profile_result.elapsed_time_in_ms() <
                                          best_result.elapsed_time_in_ms()) {
@@ -161,6 +163,8 @@ StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
 // DoGemm/DoGemmWithAlgorithm/DoGemmAutotune.
 auto GetGemmFn(PrimitiveType type) -> decltype(&DoGemm<float>) {
   switch (type) {
+    case F16:
+      return &DoGemm<Eigen::half>;
     case F32:
       return &DoGemm<float>;
     case F64:
@@ -172,6 +176,8 @@ auto GetGemmFn(PrimitiveType type) -> decltype(&DoGemm<float>) {
 auto GetGemmWithAlgorithmFn(PrimitiveType type)
     -> decltype(&DoGemmWithAlgorithm<float>) {
   switch (type) {
+    case F16:
+      return &DoGemmWithAlgorithm<Eigen::half>;
     case F32:
       return &DoGemmWithAlgorithm<float>;
     case F64:
@@ -182,6 +188,8 @@ auto GetGemmWithAlgorithmFn(PrimitiveType type)
 }
 auto GetGemmAutotuneFn(PrimitiveType type) -> decltype(&DoGemmAutotune<float>) {
   switch (type) {
+    case F16:
+      return &DoGemmAutotune<Eigen::half>;
     case F32:
       return &DoGemmAutotune<float>;
     case F64:
@@ -196,6 +204,10 @@ auto GetGemmAutotuneFn(PrimitiveType type) -> decltype(&DoGemmAutotune<float>) {
 // separately from the precision of the inputs and result.
 se::blas::ComputationType GetBlasComputationType(PrimitiveType type) {
   switch (type) {
+    case F16:
+      // Use F32 as computation type for F16 as we currently only implement the
+      // cuDNN pseudo half configuration for half precision.
+      return se::blas::ComputationType::kF32;
     case F32:
       return se::blas::ComputationType::kF32;
     case F64:
@@ -212,7 +224,8 @@ GemmThunk::GemmThunk(const BufferAllocation::Slice& lhs_buffer,
                      const BufferAllocation::Slice& output_buffer,
                      const Shape& lhs_shape, const Shape& rhs_shape,
                      const Shape& output_shape, bool transpose_lhs,
-                     bool transpose_rhs, const HloInstruction* hlo_instruction)
+                     bool transpose_rhs, double alpha,
+                     const HloInstruction* hlo_instruction)
     : Thunk(Kind::kGemm, hlo_instruction),
       lhs_buffer_(lhs_buffer),
       rhs_buffer_(rhs_buffer),
@@ -221,7 +234,8 @@ GemmThunk::GemmThunk(const BufferAllocation::Slice& lhs_buffer,
       rhs_shape_(rhs_shape),
       output_shape_(output_shape),
       transpose_lhs_(transpose_lhs),
-      transpose_rhs_(transpose_rhs) {}
+      transpose_rhs_(transpose_rhs),
+      alpha_(alpha) {}
 
 tensorflow::Status GemmThunk::ExecuteOnStream(
     const BufferAllocations& buffer_allocations, se::Stream* stream) {
@@ -290,7 +304,7 @@ tensorflow::Status GemmThunk::ExecuteOnStream(
     if (autotune_it == autotune_results_.end()) {
       StatusOr<se::blas::AlgorithmType> best_algorithm =
           GetGemmAutotuneFn(element_type)(lhs_matrix, rhs_matrix, output_matrix,
-                                          computation_type, stream);
+                                          alpha_, computation_type, stream);
       autotune_it =
           autotune_results_.insert({device_name, best_algorithm}).first;
 
@@ -311,12 +325,15 @@ tensorflow::Status GemmThunk::ExecuteOnStream(
       VLOG(2) << "Using algorithm " << algorithm
               << " chosen by autotuning on GemmThunk " << this;
       return GetGemmWithAlgorithmFn(element_type)(
-          lhs_matrix, rhs_matrix, output_matrix, computation_type, algorithm,
-          stream,
+          lhs_matrix, rhs_matrix, output_matrix, alpha_, computation_type,
+          algorithm, stream,
           /*output_profile_result=*/nullptr);
     }
+
+    // Autotune will fail when CUDA 8 and GPU sm_50 or older are used.
+    // Use the older Gemm API in this case.
     return GetGemmFn(element_type)(lhs_matrix, rhs_matrix, output_matrix,
-                                   stream);
+                                   alpha_, stream);
   };
 
   bool launch_ok;
diff --git a/tensorflow/compiler/xla/service/gpu/gemm_thunk.h b/tensorflow/compiler/xla/service/gpu/gemm_thunk.h
index 8c6a1f51a8a09ef78950dfe7e89994a3fe247f49..df3edcefef898d465cd5ddc53e5d06a966a31f88 100644
--- a/tensorflow/compiler/xla/service/gpu/gemm_thunk.h
+++ b/tensorflow/compiler/xla/service/gpu/gemm_thunk.h
@@ -34,15 +34,16 @@ namespace gpu {
 // This is thread-compatible.
 class GemmThunk : public Thunk {
  public:
-  // Constructs a thunk that computes "output = lhs <dot> rhs" using BLAS gemm.
-  // transpose_lhs and transpose_rhs indicate whether gemm should transpose the
-  // lhs and rhs operand. hlo_instruction is as in Thunk.
+  // Constructs a thunk that computes "output = (lhs <dot> rhs) * alpha" using
+  // BLAS gemm. transpose_lhs and transpose_rhs indicate whether gemm should
+  // transpose the lhs and rhs operand. hlo_instruction is as in Thunk. alpha is
+  // a constant.
   GemmThunk(const BufferAllocation::Slice& lhs_buffer,
             const BufferAllocation::Slice& rhs_buffer,
             const BufferAllocation::Slice& output_buffer,
             const Shape& lhs_shape, const Shape& rhs_shape,
             const Shape& output_shape, bool transpose_lhs, bool transpose_rhs,
-            const HloInstruction* hlo_instruction);
+            double alpha, const HloInstruction* hlo_instruction);
 
   GemmThunk(const GemmThunk&) = delete;
   GemmThunk& operator=(const GemmThunk&) = delete;
@@ -72,6 +73,7 @@ class GemmThunk : public Thunk {
 
   const bool transpose_lhs_;
   const bool transpose_rhs_;
+  const double alpha_;
 
   // Maps device names (StreamExecutor::DeviceDescription::name()) to autotune
   // results.  The map's value is the best algorithm we've found for this thunk
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
index 28ebd034ee0c89137f4e6eb417d8a37f4a00af7a..07be2a0cf90c326af6e41764e79950db546e43e4 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
@@ -33,8 +33,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
 #include "tensorflow/compiler/xla/service/dot_decomposer.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
+#include "tensorflow/compiler/xla/service/gather_expander.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_rewriter.h"
@@ -164,6 +166,9 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
           /*rewrite_grad_op=*/true,
           /*use_fusion=*/false);
 
+      // Rewrite gather ops into smaller ones.
+      pass.AddPass<GatherExpander>();
+
       // BatchNormExpander can create zero-sized ops, so zero-sized HLO
       // elimination has to come after that pass.
       pipeline.AddPass<ZeroSizedHloElimination>();
@@ -176,6 +181,7 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
       pass.AddPass<HloDCE>();
       pass.AddPass<ReshapeMover>();
       pass.AddPass<HloConstantFolding>();
+      pass.AddPass<ConditionalSimplifier>();
     }
 
     pipeline.AddPass<TransposeFolding>(
@@ -241,6 +247,22 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
     TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
   }
 
+  {
+    HloPassPipeline pipeline("layout_assignment");
+    pipeline.AddPass<GpuLayoutAssignment>(
+        hlo_module->mutable_entry_computation_layout());
+
+    // The LayoutAssignment pass may leave behind kCopy instructions which are
+    // duplicate or NOPs, so remove them with algebraic simplification and CSE.
+    pipeline.AddPass<HloPassFix<AlgebraicSimplifier>>(
+        /*is_layout_sensitive=*/true,
+        /*valid_bitcast_callback=*/[](const Shape&, const Shape&) {
+          return true;
+        });
+    pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/true);
+    TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
+  }
+
   {
     HloPassFix<HloPassPipeline> fusion("fusion");
     fusion.AddInvariantChecker<HloVerifier>();
@@ -277,15 +299,6 @@ tensorflow::Status PrepareHloModuleForIrEmitting(HloModule* hlo_module) {
   HloPassPipeline pipeline("GPU-ir-emit-prepare");
   pipeline.AddInvariantChecker<HloVerifier>();
 
-  pipeline.AddPass<GpuLayoutAssignment>(
-      hlo_module->mutable_entry_computation_layout());
-
-  // The LayoutAssignment pass may leave behind kCopy instructions which are
-  // duplicate or NOPs, so remove them with algebraic simplification and CSE.
-  pipeline.AddPass<HloPassFix<AlgebraicSimplifier>>(
-      /*is_layout_sensitive=*/true,
-      [](const Shape&, const Shape&) { return true; });
-  pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/true);
   // Copy insertion should be performed immediately before IR emission to avoid
   // inserting unnecessary copies (later pass adds an instruction which
   // materializes the value) or missing a necessary copy (later pass removes an
@@ -658,6 +671,8 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
 
   if (module->config().hlo_profiling_enabled()) {
     HloCostAnalysis cost_analysis(ShapeSizeBytesFunction());
+    cost_analysis.set_bytes_per_second(
+        stream_exec->GetDeviceDescription().memory_bandwidth());
     TF_RETURN_IF_ERROR(module->entry_computation()->Accept(&cost_analysis));
     profile_index_map = MakeUnique<HloProfileIndexMap>(*module);
     profile_printer =
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
index b5962f069bf499c913bd5479f263a7cb77c00555..85ecbe8fdb34700ca738b99ddd9ea615afc35da3 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
@@ -25,13 +25,19 @@ namespace gpu {
 namespace {
 
 bool IsFusile(const HloInstruction& hlo) {
+  // Don't fuse get-tuple-element on GPU: We can, but it's slower than not
+  // fusing.  We never generate kernels for unfused GTEs.  Instead, if an
+  // unfused GTE is an input to a kernel (including a fusion kernel), we
+  // compute the address of the GTE at the top of the kernel.  Often we know the
+  // address of the GTE result statically, so we can do this without chasing any
+  // pointers.
   return (hlo.IsElementwise() && hlo.operand_count() > 0) ||
+         hlo.opcode() == HloOpcode::kBitcast ||
          hlo.opcode() == HloOpcode::kBroadcast ||
          hlo.opcode() == HloOpcode::kConcatenate ||
          hlo.opcode() == HloOpcode::kDynamicSlice ||
          hlo.opcode() == HloOpcode::kDynamicUpdateSlice ||
          hlo.opcode() == HloOpcode::kFusion ||
-         hlo.opcode() == HloOpcode::kGetTupleElement ||
          hlo.opcode() == HloOpcode::kPad ||
          hlo.opcode() == HloOpcode::kReduce ||
          hlo.opcode() == HloOpcode::kReduceWindow ||
@@ -46,6 +52,34 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
                                       int64 operand_index) {
   HloInstruction* producer = consumer->mutable_operand(operand_index);
 
+  // Check if we can use output fusion for (A @ B) * alpha
+  if (producer->opcode() == HloOpcode::kDot) {
+    if (consumer->opcode() == HloOpcode::kMultiply) {
+      CHECK_EQ(consumer->operand_count(), 2);
+      int64 other_operand_index = 1 - operand_index;
+      const HloInstruction* alpha = consumer->operand(other_operand_index);
+      if (alpha->opcode() == HloOpcode::kConstant &&
+          ShapeUtil::IsScalar(alpha->shape())) {
+        return true;
+      }
+    }
+  }
+
+  // Only allow to fuse transpose into an output fusion.
+  if (consumer->opcode() == HloOpcode::kFusion &&
+      consumer->fusion_kind() == HloInstruction::FusionKind::kOutput) {
+    if (producer->opcode() != HloOpcode::kTranspose) {
+      return false;
+    }
+    // Check that the transpose is the operand of a dot.
+    auto producer_operand_index = consumer->operand_index(producer);
+    auto fused_parameter = consumer->fused_parameter(producer_operand_index);
+    const std::vector<HloInstruction*>& fused_parameter_users =
+        fused_parameter->users();
+    return (fused_parameter_users.size() == 1 &&
+            fused_parameter_users[0]->opcode() == HloOpcode::kDot);
+  }
+
   // Output fusion is not currently supported on GPUs.
   if (producer->opcode() == HloOpcode::kFusion) {
     return false;
@@ -70,17 +104,6 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
     return false;
   }
 
-  // We may need to know original operand layout to emit input fusion, and so
-  // far, we merely use the layout of an operand of the fusion node, which means
-  // we must fuse only elementwise operations. This restriction should be lifted
-  // later if we need to fuse other operations, e.g. transpose, for performance.
-  if ((IsReductionToVector(*consumer) ||
-       (HloOpcode::kFusion == consumer->opcode() &&
-        HloInstruction::FusionKind::kInput == consumer->fusion_kind())) &&
-      !producer->IsElementwise()) {
-    return false;
-  }
-
   // Cost condition: not fuse (simple, expensive producers) and (consumers who
   // reuse operand elements).
   if (producer->opcode() != HloOpcode::kFusion &&
@@ -98,6 +121,9 @@ HloInstruction::FusionKind GpuInstructionFusion::ChooseKind(
   if (IsReductionToVector(*consumer)) {
     return HloInstruction::FusionKind::kInput;
   }
+  if (producer->opcode() == HloOpcode::kDot) {
+    return HloInstruction::FusionKind::kOutput;
+  }
   if (HloOpcode::kFusion == consumer->opcode()) {
     return consumer->fusion_kind();
   }
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
index 2d6dad27a59978da6e4719afc50ebee5e641dde0..4b231c449f8f101127b4d30bfff20c69d8cef5c1 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace op = xla::testing::opcode_matchers;
 
@@ -137,30 +138,119 @@ TEST_F(InstructionFusionTest, PotentialBitcastTransposeOfDotUnfused) {
                    .ValueOrDie());
 }
 
-TEST_F(InstructionFusionTest, GetTupleElementFused) {
-  HloComputation::Builder builder(TestName());
-  Shape data_shape = ShapeUtil::MakeShape(F32, {8});
-  Shape tuple_shape = ShapeUtil::MakeTupleShape({data_shape, data_shape});
-  auto param = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, tuple_shape, "param"));
-  auto gte0 = builder.AddInstruction(
-      HloInstruction::CreateGetTupleElement(data_shape, param, 0));
-  auto gte1 = builder.AddInstruction(
-      HloInstruction::CreateGetTupleElement(data_shape, param, 1));
-  builder.AddInstruction(
-      HloInstruction::CreateBinary(data_shape, HloOpcode::kAdd, gte0, gte1));
-  auto module = CreateNewModule();
-  auto computation = module->AddEntryComputation(builder.Build());
+// Tests that broadcasts fused into a fusion with a reduce root.
+TEST_F(InstructionFusionTest, BroadcastIntoReduce) {
+  auto module = tools::Parse(R"(
+    HloModule test_module
+
+    add {
+      lhs = f32[] parameter(0)
+      rhs = f32[] parameter(1)
+      ROOT add = f32[] add(lhs, rhs)
+    }
+
+    ENTRY BroadcastIntoReduce {
+      constant = f32[] constant(1)
+      broadcast = f32[16,16,16,16]{3,2,1,0} broadcast(constant), dimensions={}
+      constant.1 = f32[] constant(0)
+      ROOT reduce = f32[] reduce(broadcast, constant.1), dimensions={0,1,2,3},
+                                                         to_apply=add
+    })")
+                    .ValueOrDie();
+
+  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+
+  HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Fusion());
+  EXPECT_THAT(root->fused_expression_root(),
+              op::Reduce(op::Broadcast(op::Parameter()), op::Parameter()));
+}
+
+TEST_F(InstructionFusionTest, BitcastIntoAdd) {
+  auto module = tools::Parse(R"(
+    HloModule test_module
+
+    ENTRY BroadcastIntoAdd {
+      p0 = f32[4,1,1]{2,1,0} parameter(0)
+      p1 = f32[4,1]{1,0} parameter(1)
+      bitcast = f32[4,1]{1,0} bitcast(p0)
+      ROOT add = f32[4,1] add(bitcast, p1)
+    })")
+                    .ValueOrDie();
+
+  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+
+  HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Fusion());
+  EXPECT_THAT(root->fused_expression_root(),
+              op::Add(op::Bitcast(op::Parameter()), op::Parameter()));
+}
+
+TEST_F(InstructionFusionTest, AddIntoBitcast) {
+  auto module = tools::Parse(R"(
+    HloModule test_module
+
+    ENTRY BroadcastIntoAdd {
+      p0 = f32[4,1,1]{2,1,0} parameter(0)
+      p1 = f32[4,1]{1,0} parameter(1)
+      add = f32[4,1] add(p0, p1)
+      ROOT bitcast = f32[4,1,1] bitcast(add)
+    })")
+                    .ValueOrDie();
+
+  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+
+  HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Fusion());
+  EXPECT_THAT(root->fused_expression_root(),
+              op::Bitcast(op::Add(op::Parameter(), op::Parameter())));
+}
+
+TEST_F(InstructionFusionTest, DontFuseGTE) {
+  auto module = tools::Parse(R"(
+  HloModule test_module
+  ENTRY DontFuseGTE {
+    p0 = (f32[10], f32[10]) parameter(0)
+    gte0 = f32[10] get-tuple-element(p0), index=0
+    gte1 = f32[10] get-tuple-element(p0), index=1
+    ROOT add = f32[10] add(gte0, gte1)
+  })")
+                    .ValueOrDie();
+
+  EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
+                   .Run(module.get())
+                   .ValueOrDie());
+}
+
+TEST_F(InstructionFusionTest, DotOutputFusion) {
+  auto module = tools::Parse(R"(
+  HloModule test_module
+  ENTRY OutputFusion {
+    constant = f32[] constant(3)
+    p0 = f32[4,3]{1,0} parameter(0)
+    p1 = f32[4,3]{1,0} parameter(1)
+    transpose = f32[3,4]{1,0} transpose(p1), dimensions={1, 0}
+    dot = f32[4,4]{1,0} dot(p0, transpose)
+    ROOT mul = f32[4,4] multiply(constant, dot)
+  })")
+                    .ValueOrDie();
+
   EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
                   .Run(module.get())
                   .ValueOrDie());
-  HloInstruction* root = computation->root_instruction();
-  EXPECT_EQ(HloOpcode::kFusion, root->opcode());
-  HloInstruction* fused_root = root->fused_expression_root();
-  EXPECT_EQ(HloOpcode::kAdd, fused_root->opcode());
-  // Check that operands of 'fused_root' are GTE.
-  EXPECT_EQ(HloOpcode::kGetTupleElement, fused_root->operand(0)->opcode());
-  EXPECT_EQ(HloOpcode::kGetTupleElement, fused_root->operand(1)->opcode());
+
+  HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Fusion());
+  EXPECT_THAT(
+      root->fused_expression_root(),
+      op::Multiply(op::Parameter(),
+                   op::Dot(op::Parameter(), op::Transpose(op::Parameter()))));
 }
 
 }  // namespace gpu
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
index 2f65edffea81db7dba1f8545f92b27ea622044e7..32413f975a40c1abc334b16e81097bb44f56a44a 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
@@ -49,8 +49,10 @@ bool AreValidGemmShapes(const Shape& lhs_shape, const Shape& rhs_shape,
   // The inputs and the output must
   // 1) be matrices with no padding and a non-zero number of elements,
   // 2) have an allowed element type.
-  bool type_is_allowed = (output_shape.element_type() == F32 ||
-                          output_shape.element_type() == F64);
+  PrimitiveType output_primitive_type = output_shape.element_type();
+  bool type_is_allowed =
+      (output_primitive_type == F16 || output_primitive_type == F32 ||
+       output_primitive_type == F64);
   return type_is_allowed && IsRank2WithNoPadding(lhs_shape) &&
          IsRank2WithNoPadding(rhs_shape) &&
          IsRank2WithNoPadding(output_shape) &&
@@ -87,6 +89,19 @@ bool ImplementedAsGemm(const HloInstruction& hlo) {
     return true;
   }
 
+  if (hlo.opcode() == HloOpcode::kFusion &&
+      hlo.fusion_kind() == HloInstruction::FusionKind::kOutput &&
+      hlo.fused_expression_root()->opcode() == HloOpcode::kMultiply) {
+    // Try to find the dot inside the output fusion node.
+    const HloInstruction* dot = hlo.fused_expression_root()->operand(0);
+    if (dot->opcode() != HloOpcode::kDot) {
+      dot = hlo.fused_expression_root()->operand(1);
+    }
+    if (dot->opcode() == HloOpcode::kDot) {
+      return ImplementedAsGemm(*dot);
+    }
+  }
+
   return false;
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
index a3df67a87344d6ece2ea9047321ad9542c13f8cf..1e0db2821a2c212d0f212ae94ab69231bc6053ea 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <string>
 #include <unordered_map>
+#include <utility>
 
 #include "tensorflow/core/platform/logging.h"
 // IWYU pragma: no_include "llvm/IR/Intrinsics.gen.inc"
@@ -438,6 +439,32 @@ Status IrEmitter::HandleSelect(HloInstruction* select) {
   return IrEmitter::DefaultAction(select);
 }
 
+namespace {
+llvm::Value* Real(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
+  return ir_builder->CreateExtractValue(x, {0});
+}
+
+llvm::Value* Imag(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
+  return ir_builder->CreateExtractValue(x, {1});
+}
+
+std::pair<llvm::Value*, llvm::Value*> MultiplyComplex(
+    llvm::Value* lhs_value, llvm::Value* rhs_value,
+    llvm::IRBuilder<>* ir_builder) {
+  llvm::Value* lhs_real = Real(lhs_value, ir_builder);
+  llvm::Value* lhs_imag = Imag(lhs_value, ir_builder);
+  llvm::Value* rhs_real = Real(rhs_value, ir_builder);
+  llvm::Value* rhs_imag = Imag(rhs_value, ir_builder);
+  llvm::Value* real_result1 = ir_builder->CreateFMul(lhs_real, rhs_real);
+  llvm::Value* real_result2 = ir_builder->CreateFMul(lhs_imag, rhs_imag);
+  llvm::Value* real_result = ir_builder->CreateFSub(real_result1, real_result2);
+  llvm::Value* imag_result1 = ir_builder->CreateFMul(lhs_real, rhs_imag);
+  llvm::Value* imag_result2 = ir_builder->CreateFMul(lhs_imag, rhs_real);
+  llvm::Value* imag_result = ir_builder->CreateFAdd(imag_result1, imag_result2);
+  return {real_result, imag_result};
+}
+}  // namespace
+
 Status IrEmitter::HandleDot(HloInstruction* dot) {
   auto lhs_instruction = dot->operand(0);
   auto rhs_instruction = dot->operand(1);
@@ -456,21 +483,10 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
         rhs_array.EmitReadArrayElement(/*index=*/{}, &ir_builder_);
     llvm::Value* result;
     if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-      auto real = [&](llvm::Value* x) {
-        return ir_builder_.CreateExtractValue(x, {0});
-      };
-      auto imag = [&](llvm::Value* x) {
-        return ir_builder_.CreateExtractValue(x, {1});
-      };
-      llvm::Value* real_result = ir_builder_.CreateFSub(
-          ir_builder_.CreateFMul(real(lhs_value), real(rhs_value)),
-          ir_builder_.CreateFMul(imag(lhs_value), imag(rhs_value)));
-      llvm::Value* imag_result = ir_builder_.CreateFAdd(
-          ir_builder_.CreateFMul(real(lhs_value), imag(rhs_value)),
-          ir_builder_.CreateFMul(imag(lhs_value), real(rhs_value)));
+      auto value = MultiplyComplex(lhs_value, rhs_value, &ir_builder_);
       result = llvm::ConstantAggregateZero::get(lhs_array.GetElementLlvmType());
-      result = ir_builder_.CreateInsertValue(result, real_result, {0});
-      result = ir_builder_.CreateInsertValue(result, imag_result, {1});
+      result = ir_builder_.CreateInsertValue(result, value.first, {0});
+      result = ir_builder_.CreateInsertValue(result, value.second, {1});
     } else {
       result = ir_builder_.CreateFMul(lhs_value, rhs_value);
     }
@@ -548,20 +564,13 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   llvm::Value* accum = ir_builder_.CreateLoad(accum_address);
   llvm::Value* updated_accum;
   if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-#define REAL(x) ir_builder_.CreateExtractValue(x, {0})
-#define IMAG(x) ir_builder_.CreateExtractValue(x, {1})
-    llvm::Value* product_real = ir_builder_.CreateFSub(
-        ir_builder_.CreateFMul(REAL(lhs_element), REAL(rhs_element)),
-        ir_builder_.CreateFMul(IMAG(lhs_element), IMAG(rhs_element)));
-    llvm::Value* product_imag = ir_builder_.CreateFAdd(
-        ir_builder_.CreateFMul(REAL(lhs_element), IMAG(rhs_element)),
-        ir_builder_.CreateFMul(IMAG(lhs_element), REAL(rhs_element)));
-    updated_accum = ir_builder_.CreateInsertValue(
-        accum, ir_builder_.CreateFAdd(REAL(accum), product_real), {0});
-    updated_accum = ir_builder_.CreateInsertValue(
-        updated_accum, ir_builder_.CreateFAdd(IMAG(accum), product_imag), {1});
-#undef IMAG
-#undef REAL
+    auto value = MultiplyComplex(lhs_element, rhs_element, &ir_builder_);
+    llvm::Value* accum_real = Real(accum, &ir_builder_);
+    llvm::Value* real_sum = ir_builder_.CreateFAdd(accum_real, value.first);
+    updated_accum = ir_builder_.CreateInsertValue(accum, real_sum, {0});
+    llvm::Value* accum_imag = Imag(accum, &ir_builder_);
+    llvm::Value* imag_sum = ir_builder_.CreateFAdd(accum_imag, value.second);
+    updated_accum = ir_builder_.CreateInsertValue(updated_accum, imag_sum, {1});
   } else {
     llvm::Value* product = ir_builder_.CreateFMul(lhs_element, rhs_element);
     updated_accum = ir_builder_.CreateFAdd(accum, product);
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
index 30c88c0a5d38f6ea3f94d3b47b7b69c7122bf6ac..199e6b787413c5e0fb1435c62f1fc3b83fc6eba3 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <algorithm>
+#include <cstring>
 #include <memory>
 #include <string>
 #include <vector>
@@ -44,6 +46,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_context.h"
 #include "tensorflow/compiler/xla/service/gpu/kernel_thunk.h"
+#include "tensorflow/compiler/xla/service/gpu/memset_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h"
 #include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
 #include "tensorflow/compiler/xla/service/gpu/sequential_thunk.h"
@@ -498,12 +501,11 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
     switch (root->opcode()) {
       case HloOpcode::kReduce: {
         VLOG(3) << "Emitting fused reduction to vector: " << fusion->ToString();
+        TF_ASSIGN_OR_RETURN(std::unique_ptr<Thunk> initializer_thunk,
+                            BuildInitializerThunk(fusion));
         std::vector<std::unique_ptr<Thunk>> thunks;
-        thunks.emplace_back(BuildKernelThunk(fusion));
-        TF_RETURN_IF_ERROR(EmitInitializer(
-            fusion, static_cast<KernelThunk*>(thunks.back().get())));
-        bindings_.UnbindAllLocalIrValues();
-        thunks.emplace_back(BuildKernelThunk(fusion));
+        thunks.push_back(std::move(initializer_thunk));
+        thunks.push_back(BuildKernelThunk(fusion));
         thunk_sequence_->emplace_back(
             MakeUnique<SequentialThunk>(std::move(thunks), fusion));
         std::vector<llvm_ir::IrArray> parameter_arrays;
@@ -517,39 +519,6 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
         TF_RETURN_IF_ERROR(root->Accept(&fused_emitter));
 
         Shape input_shape = root->operand(0)->shape();
-        // EmitReductionToVector requires the input shape to have a layout, but
-        // fused instructions don't have one. So we determine its layout from
-        // the fusion's operands. The choice of the layout only affects
-        // performance but not correctness.
-        auto choose_input_layout = [](
-            tensorflow::gtl::ArraySlice<const HloInstruction*> operands,
-            Shape* input_shape) -> Status {
-          // Prefer the layout of an operand whose shape is compatible with
-          // input_shape.
-          for (const HloInstruction* operand : operands) {
-            if (ShapeUtil::Compatible(*input_shape, operand->shape())) {
-              return LayoutUtil::CopyLayoutBetweenShapes(operand->shape(),
-                                                         input_shape);
-            }
-          }
-          // If no operand has a compatible shape, prefer an operand that has
-          // the same rank at least.
-          for (const HloInstruction* operand : operands) {
-            if (ShapeUtil::Rank(*input_shape) ==
-                ShapeUtil::Rank(operand->shape())) {
-              // Do not use CopyLayoutBetweenShapes because input_shape and
-              // operand->shape() may be incompatible.
-              *input_shape->mutable_layout() = operand->shape().layout();
-              return Status::OK();
-            }
-          }
-          // When all the above fails, which is rare, set the default layout.
-          LayoutUtil::SetToDefaultLayout(input_shape);
-          return Status::OK();
-        };
-        TF_RETURN_IF_ERROR(
-            choose_input_layout(fusion->operands(), &input_shape));
-
         return EmitReductionToVector(
             root, input_shape, fused_emitter.GetGenerator(root->operand(0)),
             fused_emitter.GetGenerator(root->operand(1)), root->dimensions(),
@@ -1668,14 +1637,14 @@ Status IrEmitterUnnested::HandleReduce(HloInstruction* reduce) {
   if (IsReductionToVector(*reduce) &&
       // NVPTX backend can't do atomic cmpxchg any narrower than 32 bits
       32 <= primitive_util::BitWidth(reduce->shape().element_type())) {
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<Thunk> initializer_thunk,
+                        BuildInitializerThunk(reduce));
     std::vector<std::unique_ptr<Thunk>> thunks;
-    thunks.emplace_back(BuildKernelThunk(reduce));
-    TF_RETURN_IF_ERROR(EmitInitializer(
-        reduce, static_cast<KernelThunk*>(thunks.back().get())));
-    bindings_.UnbindAllLocalIrValues();
-    thunks.emplace_back(BuildKernelThunk(reduce));
+    thunks.push_back(std::move(initializer_thunk));
+    thunks.push_back(BuildKernelThunk(reduce));
     thunk_sequence_->emplace_back(
         MakeUnique<SequentialThunk>(std::move(thunks), reduce));
+
     return EmitReductionToVector(
         reduce, input->shape(),
         [&](const llvm_ir::IrArray::Index& index) {
@@ -1739,16 +1708,13 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
   CHECK_EQ(rank, ShapeUtil::Rank(source->shape()));
   CHECK_EQ(rank, window.dimensions_size());
 
-  {
-    std::vector<std::unique_ptr<Thunk>> thunks;
-    thunks.emplace_back(BuildKernelThunk(select_and_scatter));
-    TF_RETURN_IF_ERROR(EmitInitializer(
-        select_and_scatter, static_cast<KernelThunk*>(thunks.back().get())));
-    bindings_.UnbindAllLocalIrValues();
-    thunks.emplace_back(BuildKernelThunk(select_and_scatter));
-    thunk_sequence_->emplace_back(
-        MakeUnique<SequentialThunk>(std::move(thunks), select_and_scatter));
-  }
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<Thunk> initializer_thunk,
+                      BuildInitializerThunk(select_and_scatter));
+  std::vector<std::unique_ptr<Thunk>> thunks;
+  thunks.push_back(std::move(initializer_thunk));
+  thunks.push_back(BuildKernelThunk(select_and_scatter));
+  thunk_sequence_->emplace_back(
+      MakeUnique<SequentialThunk>(std::move(thunks), select_and_scatter));
 
   // TODO(b/31410564): Implement dilation rate for select-and-scatter.
   if (window_util::HasDilation(window)) {
@@ -2013,11 +1979,22 @@ GetHloBufferSlices(const HloInstruction* hlo,
       }
     }
 
-    // If *that* didn't work, check whether instr is a GTE instruction.  If it
-    // is, see if we can get a buffer for its parent, and continue walking up
-    // parents until we find a defined buffer or we hit something that's not a
-    // GTE.
+    // If *that* didn't work, walk up any bitcasts that we might see.  These
+    // must appear before any GTE instructions, because it's illegal to bitcast
+    // to a tuple type.
     const HloInstruction* parent = instr;
+    while (parent->opcode() == HloOpcode::kBitcast) {
+      parent = parent->operand(0);
+
+      auto slice = GetKnownAtRuntimeSlice(parent, {}, buffer_assn);
+      if (slice.has_value()) {
+        return {{*slice, gte_indices}};
+      }
+    }
+
+    // Finally, check whether instr is a GTE instruction.  If it is, see if we
+    // can get a buffer for its parent, and continue walking up parents until we
+    // find a defined buffer or we hit something that's not a GTE.
     while (parent->opcode() == HloOpcode::kGetTupleElement) {
       gte_indices.push_front(parent->tuple_index());
       parent = parent->operand(0);
@@ -2069,7 +2046,7 @@ Status IrEmitterUnnested::HandleGather(HloInstruction* gather) {
   return Unimplemented("Gather is not implemented on GPUs.");
 }
 
-std::unique_ptr<Thunk> IrEmitterUnnested::BuildKernelThunk(
+std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(
     const HloInstruction* inst) {
   const BufferAssignment& buffer_assn =
       ir_emitter_context_->buffer_assignment();
@@ -2221,31 +2198,63 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildGemmThunk(
         inst->shape(),              // The shape of the output.
         false,                      // Do not transpose LHS.
         false,                      // Do not transpose RHS.
+        1.0,                        // alpha.
         inst);
   }
 
   if (inst->opcode() == HloOpcode::kFusion) {
-    const HloInstruction* dot = inst->fused_expression_root();
-    DCHECK(dot->opcode() == HloOpcode::kDot);
-    const HloInstruction* lhs_parameter = StripTranspose(*dot->operand(0));
-    const HloInstruction* rhs_parameter = StripTranspose(*dot->operand(1));
-    DCHECK(lhs_parameter->opcode() == HloOpcode::kParameter &&
-           rhs_parameter->opcode() == HloOpcode::kParameter);
-    const HloInstruction* lhs =
-        inst->operand(lhs_parameter->parameter_number());
-    const HloInstruction* rhs =
-        inst->operand(rhs_parameter->parameter_number());
-
-    return MakeUnique<GemmThunk>(
-        GetAllocationSlice(*lhs),             // The buffer assigned to LHS.
-        GetAllocationSlice(*rhs),             // The buffer assigned to RHS.
-        GetAllocationSlice(*inst),            // The output buffer.
-        lhs->shape(),                         // The shape of LHS.
-        rhs->shape(),                         // The shape of RHS.
-        inst->shape(),                        // The shape of the output.
-        dot->operand(0)->IsRank2Transpose(),  // Transpose LHS.
-        dot->operand(1)->IsRank2Transpose(),  // Trasnpose RHS.
-        inst);
+    if (inst->fusion_kind() == HloInstruction::FusionKind::kOutput) {
+      const HloInstruction* mul = inst->fused_expression_root();
+      const HloInstruction* dot = mul->operand(0);
+      const HloInstruction* alpha = mul->operand(1);
+      if (dot->opcode() != HloOpcode::kDot) {
+        std::swap(dot, alpha);
+      }
+      DCHECK(dot->opcode() == HloOpcode::kDot);
+      const HloInstruction* lhs_parameter = StripTranspose(*dot->operand(0));
+      const HloInstruction* rhs_parameter = StripTranspose(*dot->operand(1));
+      DCHECK(lhs_parameter->opcode() == HloOpcode::kParameter &&
+             rhs_parameter->opcode() == HloOpcode::kParameter);
+      const HloInstruction* lhs =
+          inst->operand(lhs_parameter->parameter_number());
+      const HloInstruction* rhs =
+          inst->operand(rhs_parameter->parameter_number());
+
+      return MakeUnique<GemmThunk>(
+          GetAllocationSlice(*lhs),             // The buffer assigned to LHS.
+          GetAllocationSlice(*rhs),             // The buffer assigned to RHS.
+          GetAllocationSlice(*mul),             // The output buffer.
+          lhs->shape(),                         // The shape of LHS.
+          rhs->shape(),                         // The shape of RHS.
+          inst->shape(),                        // The shape of the output.
+          dot->operand(0)->IsRank2Transpose(),  // Transpose LHS.
+          dot->operand(1)->IsRank2Transpose(),  // Transpose RHS.
+          alpha->literal().Get<double>({0}),    // alpha.
+          inst);
+    } else {
+      const HloInstruction* dot = inst->fused_expression_root();
+      DCHECK(dot->opcode() == HloOpcode::kDot);
+      const HloInstruction* lhs_parameter = StripTranspose(*dot->operand(0));
+      const HloInstruction* rhs_parameter = StripTranspose(*dot->operand(1));
+      DCHECK(lhs_parameter->opcode() == HloOpcode::kParameter &&
+             rhs_parameter->opcode() == HloOpcode::kParameter);
+      const HloInstruction* lhs =
+          inst->operand(lhs_parameter->parameter_number());
+      const HloInstruction* rhs =
+          inst->operand(rhs_parameter->parameter_number());
+
+      return MakeUnique<GemmThunk>(
+          GetAllocationSlice(*lhs),             // The buffer assigned to LHS.
+          GetAllocationSlice(*rhs),             // The buffer assigned to RHS.
+          GetAllocationSlice(*inst),            // The output buffer.
+          lhs->shape(),                         // The shape of LHS.
+          rhs->shape(),                         // The shape of RHS.
+          inst->shape(),                        // The shape of the output.
+          dot->operand(0)->IsRank2Transpose(),  // Transpose LHS.
+          dot->operand(1)->IsRank2Transpose(),  // Transpose RHS.
+          1.0,                                  // Alpha.
+          inst);
+    }
   }
 
   LOG(FATAL) << "Cannot build a GemmThunk for " << inst->ToString();
@@ -2261,37 +2270,87 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildFftThunk(
                               /*output_shape=*/inst->shape(), inst);
 }
 
-Status IrEmitterUnnested::EmitInitializer(const HloInstruction* hlo,
-                                          KernelThunk* thunk) {
+StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
+    const HloInstruction* hlo) {
   bool fused = HloOpcode::kFusion == hlo->opcode();
-
   const HloInstruction* inst = fused ? hlo->fused_expression_root() : hlo;
-  CHECK(inst->opcode() == HloOpcode::kSelectAndScatter ||
-        inst->opcode() == HloOpcode::kReduce);
-  const HloInstruction* init_value = nullptr;
-  switch (inst->opcode()) {
-    case HloOpcode::kSelectAndScatter:
-      init_value = inst->operand(2);
-      break;
-    case HloOpcode::kReduce:
-      init_value = inst->operand(1);
-      break;
-    default:
-      LOG(FATAL) << "Opcode " << inst->opcode()
-                 << " should not need an initializer.";
-  }
+  const HloInstruction* init_value = [&] {
+    switch (inst->opcode()) {
+      case HloOpcode::kSelectAndScatter:
+        return inst->operand(2);
+      case HloOpcode::kReduce:
+        return inst->operand(1);
+      default:
+        LOG(FATAL) << "Opcode " << inst->opcode()
+                   << " should not need an initializer.";
+    }
+  }();
 
   if (fused && init_value->opcode() == HloOpcode::kParameter) {
     init_value = hlo->operand(init_value->parameter_number());
   }
 
-  return EmitTargetElementLoopInThunk(
+  // In the common case, the initializer is a constant.  In this case, emit a
+  // device-memset call if we can.  Currently StreamExecutor only supports
+  // zeroing and 32-bit memsets.
+  if (init_value->IsConstant()) {
+    CHECK(ShapeUtil::IsScalar(init_value->shape()));
+    int64 num_bytes = ShapeUtil::ByteSizeOfElements(init_value->shape());
+    const auto& literal = init_value->literal();
+
+    // Are all the bytes of this scalar equal to 0?  If so, we can create a
+    // MemzeroThunk.
+    ArraySlice<uint8> literal_bytes(
+        reinterpret_cast<const uint8*>(literal.untyped_data()), num_bytes);
+    if (c_all_of(literal_bytes, [](uint8 byte) { return byte == 0; })) {
+      return {MakeUnique<MemzeroThunk>(GetAllocationSlice(*hlo), hlo)};
+    }
+
+    // If the literal is 8 or 16 bits wide, we can emit a 32-bit memset by
+    // repeating the literal 4 or 2 times, so long as the destination buffer is
+    // an even multiple of 32 bits long.
+    if ((num_bytes == 1 || num_bytes == 2) &&
+        ShapeUtil::ByteSizeOf(hlo->shape()) % 4 == 0) {
+      uint16 pattern16;
+      if (num_bytes == 1) {
+        uint8 b = literal_bytes.front();
+        pattern16 = uint16{b} | (uint16{b} << 8);
+      } else {
+        pattern16 = literal_bytes.front();
+      }
+      uint32 pattern32 = uint32{pattern16} | (uint32{pattern16} << 16);
+      return {MakeUnique<Memset32BitValueThunk>(pattern32,
+                                                GetAllocationSlice(*hlo), hlo)};
+    }
+
+    // If the literal is an even multiple of 32 bits wide, we can emit a 32-bit
+    // memset so long as all 32-bit words of the scalar are equal to each other.
+    if (num_bytes >= 4 && num_bytes % 4 == 0 &&
+        memcmp(literal_bytes.data(), literal_bytes.data() + 4,
+               literal_bytes.size() - 4) == 0) {
+      uint32 word;
+      memcpy(&word, literal_bytes.data(), sizeof(word));
+      return {MakeUnique<Memset32BitValueThunk>(word, GetAllocationSlice(*hlo),
+                                                hlo)};
+    }
+  }
+
+  // Otherwise fall back to our slow initializer code.
+  std::unique_ptr<KernelThunk> kernel_thunk = BuildKernelThunk(hlo);
+  TF_RETURN_IF_ERROR(EmitTargetElementLoopInThunk(
       *hlo,
       [=](const llvm_ir::IrArray::Index& index) {
         return GetIrArray(*init_value, *hlo)
             .EmitReadArrayElement(index, &ir_builder_);
       },
-      thunk);
+      kernel_thunk.get()));
+
+  // Clean up state left behind by emitting the loop above.  (This is normally
+  // done in IrEmitterUnnested::Postprocess().)
+  bindings_.UnbindAllLocalIrValues();
+
+  // Convert unique_ptr<KernelThunk> to StatusOr<unique_ptr<Thunk>>.
+  return {std::move(kernel_thunk)};
 }
 
 namespace {
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
index b83a2337e2decd9d4fba3d40fcf33f131fca8a3c..66c62e2d2de3ed1668271a21943dc73ed3d77651 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
@@ -148,13 +148,10 @@ class IrEmitterUnnested : public IrEmitter {
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
       HloComputation* reducer);
 
-  // Emits code to initialize buffer of `inst` in given `thunk`.
-  Status EmitInitializer(const HloInstruction* inst, KernelThunk* thunk);
-
   // Returns a KernelThunk that invokes the kernel emitted for `inst`. The
   // caller needs to make sure `inst` outlives the lifetime of the returned
   // Thunk object.
-  std::unique_ptr<Thunk> BuildKernelThunk(const HloInstruction* inst);
+  std::unique_ptr<KernelThunk> BuildKernelThunk(const HloInstruction* inst);
 
   // Returns a FftThunk that calls cuFFT to implement `inst`.
   std::unique_ptr<Thunk> BuildFftThunk(const HloInstruction* inst);
@@ -163,6 +160,11 @@ class IrEmitterUnnested : public IrEmitter {
   // to make sure `inst` outlives the lifetime of the returned Thunk object.
   std::unique_ptr<Thunk> BuildGemmThunk(const HloInstruction* inst);
 
+  // Returns a thunk that, given a reduce or select-and-scatter op, initializes
+  // its memory to the appropriate initial value.
+  StatusOr<std::unique_ptr<Thunk>> BuildInitializerThunk(
+      const HloInstruction* hlo);
+
   // Returns a thunk that calls host-to-device cuMemcpy to implement `inst`.
   std::unique_ptr<Thunk> BuildHostToDeviceCopyThunk(const HloInstruction* inst);
 
diff --git a/tensorflow/compiler/xla/service/gpu/memset_thunk.cc b/tensorflow/compiler/xla/service/gpu/memset_thunk.cc
new file mode 100644
index 0000000000000000000000000000000000000000..18e673542c5b47cb90d31a8eff62a5e4adb78d1d
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/memset_thunk.cc
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/memset_thunk.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+namespace xla {
+namespace gpu {
+
+namespace se = ::perftools::gputools;
+
+Status MemzeroThunk::ExecuteOnStream(
+    const BufferAllocations& buffer_allocations, se::Stream* stream) {
+  se::DeviceMemoryBase dest_data = buffer_allocations.GetDeviceAddress(dest_);
+  stream->ThenMemZero(&dest_data, dest_data.size());
+  return Status::OK();
+}
+
+Status Memset32BitValueThunk::ExecuteOnStream(
+    const BufferAllocations& buffer_allocations, se::Stream* stream) {
+  se::DeviceMemoryBase dest_data = buffer_allocations.GetDeviceAddress(dest_);
+  stream->ThenMemset32(&dest_data, value_, dest_data.size());
+  return Status::OK();
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/memset_thunk.h b/tensorflow/compiler/xla/service/gpu/memset_thunk.h
new file mode 100644
index 0000000000000000000000000000000000000000..b4bb74d1dd6dc9d09c5e4d439d57dfe8b57c2ed9
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/memset_thunk.h
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MEMSET_THUNK_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MEMSET_THUNK_H_
+
+#include "tensorflow/compiler/xla/service/buffer_assignment.h"
+#include "tensorflow/compiler/xla/service/gpu/thunk.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/status.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+// This file contains thunks that set a buffer's elements to a particular value.
+// This can be faster than emitting a kernel to set the elements.
+
+namespace xla {
+namespace gpu {
+
+// Thunk that zeroes out a given chunk of memory.
+class MemzeroThunk : public Thunk {
+ public:
+  explicit MemzeroThunk(const BufferAllocation::Slice& dest,
+                        const HloInstruction* hlo)
+      : Thunk(Kind::kMemzero, hlo), dest_(dest) {}
+
+  Status ExecuteOnStream(const BufferAllocations& buffer_allocations,
+                         perftools::gputools::Stream* stream) override;
+
+ private:
+  const BufferAllocation::Slice dest_;
+};
+
+// Thunk that sets a given chunk of memory to a particular 32-bit value.  The
+// destination chunk must have size divisible by 32 bits.
+class Memset32BitValueThunk : public Thunk {
+ public:
+  explicit Memset32BitValueThunk(uint32 value,
+                                 const BufferAllocation::Slice& dest,
+                                 const HloInstruction* hlo)
+      : Thunk(Kind::kMemset32BitValue, hlo), value_(value), dest_(dest) {}
+
+  Status ExecuteOnStream(const BufferAllocations& buffer_allocations,
+                         perftools::gputools::Stream* stream) override;
+
+ private:
+  uint32 value_;
+  const BufferAllocation::Slice dest_;
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MEMSET_THUNK_H_
diff --git a/tensorflow/compiler/xla/service/gpu/pad_insertion.cc b/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
index 25846dc6cd4633c7becb6e62d6bc9585348a6eac..7bda4e2fcd469bd430e5ef1846251c8504225383 100644
--- a/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
+++ b/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/window_util.h"
@@ -68,13 +69,7 @@ HloInstruction* MaybePaddedAndSlicedInput(
     HloInstruction* padding =
         computation->AddInstruction(HloInstruction::CreateConstant(
             MakeUnique<Literal>(Literal::Zero(element_type))));
-    input = computation->AddInstruction(HloInstruction::CreatePad(
-        ShapeInference::InferPadShape(
-            /*operand_shape=*/input->shape(),
-            /*padding_value_shape=*/ShapeUtil::MakeShape(element_type, {}),
-            padding_config)
-            .ConsumeValueOrDie(),
-        input, padding, padding_config));
+    input = MakePadHlo(input, padding, padding_config).ValueOrDie();
   }
 
   if (window_util::HasNegativePadding(conv_window)) {
@@ -97,11 +92,8 @@ HloInstruction* MaybePaddedAndSlicedInput(
           std::max<int64>(0LL, -conv_window.dimensions(i).padding_high());
     }
 
-    input = computation->AddInstruction(HloInstruction::CreateSlice(
-        ShapeInference::InferSliceShape(input->shape(), start_indices,
-                                        limit_indices, strides)
-            .ConsumeValueOrDie(),
-        input, start_indices, limit_indices, strides));
+    input =
+        MakeSliceHlo(input, start_indices, limit_indices, strides).ValueOrDie();
   }
 
   return input;
@@ -134,13 +126,7 @@ HloInstruction* MaybePaddedKernel(const Window& conv_window,
   HloInstruction* padding =
       computation->AddInstruction(HloInstruction::CreateConstant(
           MakeUnique<Literal>(Literal::Zero(element_type))));
-  return computation->AddInstruction(HloInstruction::CreatePad(
-      ShapeInference::InferPadShape(
-          /*operand_shape=*/kernel->shape(),
-          /*padding_value_shape=*/ShapeUtil::MakeShape(element_type, {}),
-          padding_config)
-          .ConsumeValueOrDie(),
-      kernel, padding, padding_config));
+  return MakePadHlo(kernel, padding, padding_config).ValueOrDie();
 }
 }  // namespace
 
@@ -252,11 +238,7 @@ bool PadInsertion::CanonicalizeBackwardFilterConvolution(
       computation->AddInstruction(HloInstruction::CreateConstant(
           MakeUnique<Literal>(Literal::Zero(input->shape().element_type()))));
   HloInstruction* padded_input =
-      computation->AddInstruction(HloInstruction::CreatePad(
-          ShapeInference::InferPadShape(input->shape(), padding->shape(),
-                                        input_padding_config)
-              .ConsumeValueOrDie(),
-          input, padding, input_padding_config));
+      MakePadHlo(input, padding, input_padding_config).ValueOrDie();
 
   // The shape of the backward_conv CustomCall is a tuple (conv_result,
   // scratch_buffer).  Extract out the shape of conv_result.
diff --git a/tensorflow/compiler/xla/service/gpu/thunk.h b/tensorflow/compiler/xla/service/gpu/thunk.h
index 2c3032d79be221e8cacb178ffb1817459b603cc0..9eea958d1214b131d49cb4e28f1944860408d3a8 100644
--- a/tensorflow/compiler/xla/service/gpu/thunk.h
+++ b/tensorflow/compiler/xla/service/gpu/thunk.h
@@ -51,6 +51,8 @@ class Thunk {
     kGemm,
     kInfeed,
     kKernel,
+    kMemset32BitValue,
+    kMemzero,
     kSequential,
     kTuple,
     kWhile,
diff --git a/tensorflow/compiler/xla/service/heap_simulator.cc b/tensorflow/compiler/xla/service/heap_simulator.cc
index a2d13c013c56059148ccd04dba2137a5b2badc42..3dd4c4a0794e5c41b877078c4e69c6c9584ce6c0 100644
--- a/tensorflow/compiler/xla/service/heap_simulator.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator.cc
@@ -27,38 +27,6 @@ namespace xla {
 using tensorflow::gtl::FlatMap;
 using tensorflow::gtl::FlatSet;
 
-namespace {
-
-// Returns the set of buffers that may be sources of all operands of the given
-// instruction.  The returned buffers are guaranteed to have no duplicates, and
-// to be sorted in a deterministic order.
-std::vector<const LogicalBuffer*> UniqueOperandSourceBuffers(
-    const HloInstruction* instruction,
-    const TuplePointsToAnalysis& points_to_analysis) {
-  std::vector<const LogicalBuffer*> buffers;
-  for (const HloInstruction* operand : instruction->operands()) {
-    points_to_analysis.GetPointsToSet(operand).ForEachElement(
-        [&](const ShapeIndex& /*index*/,
-            const PointsToSet::BufferList& points_to) {
-          buffers.insert(buffers.end(), points_to.begin(), points_to.end());
-        });
-  }
-
-  // Sort and then remove duplicates from buffers.
-  std::sort(buffers.begin(), buffers.end(),
-            [](const LogicalBuffer* a, const LogicalBuffer* b) {
-              return a->id() < b->id();
-            });
-  buffers.erase(std::unique(buffers.begin(), buffers.end(),
-                            [](const LogicalBuffer* a, const LogicalBuffer* b) {
-                              return a->id() == b->id();
-                            }),
-                buffers.end());
-  return buffers;
-}
-
-}  // namespace
-
 /*static*/
 StatusOr<HeapSimulator::Result> HeapSimulator::Run(
     std::unique_ptr<HeapAlgorithm> algorithm, const HloModule& module,
@@ -93,6 +61,7 @@ Status HeapSimulator::RunComputation(
     const HloComputation& computation,
     const std::vector<const HloInstruction*>& instruction_sequence,
     const TuplePointsToAnalysis& points_to_analysis) {
+  VLOG(3) << "Computation:\n" << computation.ToString();
   // The goal here is to minimize memory usage, assuming the given sequential
   // ordering of instructions.  The strategy is to walk through the instruction
   // sequence, calling Alloc and Free on the underlying heap algorithm.  The
@@ -101,7 +70,51 @@ Status HeapSimulator::RunComputation(
   // 'live_buffers' tracks the liveness of each buffer that we assign, by
   // associating it with a set of HloInstructions that need to be visited.  When
   // the set becomes empty, the buffer is no longer used, and can be freed.
+  // 'used_buffers' is the reverse map - it tracks which buffers were used by an
+  // instruction, so that we can remove the instructions from a buffer's live
+  // set after they are visited.
   FlatMap<const LogicalBuffer*, FlatSet<const HloInstruction*>> live_buffers;
+  FlatMap<const HloInstruction*, FlatSet<const LogicalBuffer*>> used_buffers;
+  auto add_user_to_buffer = [this, &live_buffers, &used_buffers](
+                                const HloInstruction* user,
+                                const LogicalBuffer* buffer) {
+    if (!IgnoreBuffer(buffer)) {
+      VLOG(4) << "  Adding user " << user->name() << " to buffer "
+              << buffer->ToString();
+      live_buffers[buffer].insert(user);
+      used_buffers[user].insert(buffer);
+    }
+  };
+
+  // Initialize live_buffers for each buffer that we're going to assign.  The
+  // set of instructions that need to be visited contains all users of all
+  // aliases, that is, all users of all instructions that have the buffer
+  // contained in their points-to set.
+  for (const HloInstruction* instruction : instruction_sequence) {
+    const PointsToSet& points_to =
+        points_to_analysis.GetPointsToSet(instruction);
+    const PointsToSet::BufferSet& buffer_set = points_to.CreateFlattenedSet();
+    for (const HloInstruction* user : instruction->users()) {
+      if (user->opcode() != HloOpcode::kGetTupleElement) {
+        for (const LogicalBuffer* buffer : buffer_set) {
+          add_user_to_buffer(user, buffer);
+        }
+      } else {
+        // A GetTupleElement doesn't need to keep all of its operand's buffers
+        // alive. It only needs the buffers that relate to the element its
+        // extracting, and the tuple it's extracting from, but not the buffers
+        // for the other elements.
+        for (const LogicalBuffer* buffer : points_to.element({})) {
+          add_user_to_buffer(user, buffer);
+        }
+        const PointsToSet& gte_points_to =
+            points_to_analysis.GetPointsToSet(user);
+        for (const LogicalBuffer* buffer : gte_points_to.CreateFlattenedSet()) {
+          add_user_to_buffer(user, buffer);
+        }
+      }
+    }
+  }
 
   const HloInstruction* root = computation.root_instruction();
   auto output_source_buffers =
@@ -114,34 +127,17 @@ Status HeapSimulator::RunComputation(
         buffers_defined_by_instruction =
             points_to_analysis.GetBuffersDefinedByInstruction(instruction);
 
-    // Initialize live_buffers for each buffer that we're going to assign.  The
-    // set of instructions that need to be visited contains all users of all
-    // aliases.  The alias itself is not necessary; if it has users, the users
-    // are necessarily scheduled after the alias.  And if it has no users, it is
-    // either a dead value or an output, both of which are handled below.
-    //
-    // We ignore control dependencies here. The reasoning is that the control
-    // dependencies have already been accounted for in the ordering of the given
-    // 'instruction_sequence', and should not otherwise artificially extend the
-    // lifetime of buffers that aren't already connected by a data dependency.
+    VLOG(3) << "Instruction: " << instruction->ToString();
+    for (const LogicalBuffer* buffer : buffers_defined_by_instruction) {
+      VLOG(4) << "  Defines: " << buffer->ToString()
+              << (IgnoreBuffer(buffer) ? " (Ignored)" : "");
+    }
+
     dead_buffers_to_free.clear();
     for (const LogicalBuffer* buffer : buffers_defined_by_instruction) {
       if (IgnoreBuffer(buffer)) {
         continue;
       }
-      FlatSet<const HloInstruction*>* live_set = nullptr;
-      for (const BufferAlias& alias :
-           points_to_analysis.GetBufferAliases(*buffer)) {
-        const std::vector<HloInstruction*>& users =
-            alias.instruction()->users();
-        if (!users.empty()) {
-          if (live_set == nullptr) {
-            live_set = &live_buffers[buffer];
-          }
-          live_set->insert(users.begin(), users.end());
-        }
-      }
-
       // Add a nullptr sentry to ensure entry parameters and output source
       // buffers are not freed until the very end.
       const bool entry_parameter =
@@ -165,11 +161,12 @@ Status HeapSimulator::RunComputation(
     // have no instructions left to visit are moved from live_buffers to
     // operand_buffers_to_free.
     operand_buffers_to_free.clear();
-    for (const LogicalBuffer* operand_buffer :
-         UniqueOperandSourceBuffers(instruction, points_to_analysis)) {
+    for (const LogicalBuffer* operand_buffer : used_buffers[instruction]) {
       if (IgnoreBuffer(operand_buffer)) {
         continue;
       }
+      VLOG(4) << "  Removing user " << instruction->name() << " from buffer "
+              << operand_buffer->ToString();
       auto it = live_buffers.find(operand_buffer);
       FlatSet<const HloInstruction*>* live_set = &it->second;
       live_set->erase(instruction);
@@ -178,6 +175,11 @@ Status HeapSimulator::RunComputation(
         operand_buffers_to_free.push_back(operand_buffer);
       }
     }
+    // Sort to get a deterministic iteration order.
+    std::sort(operand_buffers_to_free.begin(), operand_buffers_to_free.end(),
+              [](const LogicalBuffer* x, const LogicalBuffer* y) {
+                return x->id() < y->id();
+              });
 
     // Allocate buffers defined by this instruction.  This is the latest point
     // that we can allocate; right before the buffer is first used.  This must
@@ -203,6 +205,8 @@ Status HeapSimulator::RunComputation(
               CanShareOperandBufferWithUser(
                   operand_buffer->instruction(), operand_buffer->index(),
                   buffer->instruction(), buffer->index(), points_to_analysis)) {
+            VLOG(3) << "  Sharing: " << buffer->ToString() << " with "
+                    << operand_buffer->ToString();
             ShareBuffer(buffer, operand_buffer, instruction);
             shared = true;
             break;
@@ -211,6 +215,7 @@ Status HeapSimulator::RunComputation(
       }
 
       if (!shared) {
+        VLOG(3) << "  Allocating: " << buffer->ToString();
         Alloc(buffer, instruction);
       }
     }
@@ -244,20 +249,34 @@ Status HeapSimulator::RunComputation(
     // Free buffers that are no longer live.  This is the earliest point that we
     // can de-allocate; right after the last use of the buffer.
     for (const LogicalBuffer* buffer : dead_buffers_to_free) {
+      VLOG(3) << "  Freeing dead: " << buffer->ToString();
       Free(buffer, instruction);
     }
     for (const LogicalBuffer* buffer : operand_buffers_to_free) {
+      VLOG(3) << "  Freeing operand: " << buffer->ToString();
       Free(buffer, instruction);
     }
   }
 
   // Any remaining live buffers must be entry parameters or output source
-  // buffers, which had a nullptr sentry added.  Free them now.
+  // buffers, which had a nullptr sentry added.  Free them now, in a
+  // deterministic order.
+  std::vector<const LogicalBuffer*> to_free;
+  to_free.reserve(live_buffers.size());
   for (const auto& buffer_pending : live_buffers) {
     const LogicalBuffer* buffer = buffer_pending.first;
     const FlatSet<const HloInstruction*>& pending = buffer_pending.second;
     CHECK_EQ(pending.size(), 1) << *buffer;
     CHECK(*pending.begin() == nullptr) << *buffer;
+    to_free.push_back(buffer);
+  }
+
+  std::sort(to_free.begin(), to_free.end(),
+            [](const LogicalBuffer* x, const LogicalBuffer* y) {
+              return x->id() < y->id();
+            });
+  for (const LogicalBuffer* buffer : to_free) {
+    VLOG(3) << "Freeing pending: " << buffer->ToString();
     Free(buffer, root);
   }
 
diff --git a/tensorflow/compiler/xla/service/heap_simulator_test.cc b/tensorflow/compiler/xla/service/heap_simulator_test.cc
index 387b649a731ebcbfd8307807469f39f22d192b06..688a271712ac243666ba4ff02932aa4f7f7ed21c 100644
--- a/tensorflow/compiler/xla/service/heap_simulator_test.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator_test.cc
@@ -410,6 +410,56 @@ TEST_F(HeapSimulatorTest, MultiplyDotDotTuple) {
   });
 }
 
+TEST_F(HeapSimulatorTest, IndependentTupleElements) {
+  auto builder = HloComputation::Builder(TestName());
+  auto paramA = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, f32scalar_, "paramA"));
+  auto paramB = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, f32scalar_, "paramB"));
+  auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kMultiply, paramA, paramB));
+  auto add = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kAdd, paramA, paramB));
+  auto tuple = builder.AddInstruction(HloInstruction::CreateTuple({mul, add}));
+  auto element0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32scalar_, tuple, 0));
+  auto broadcast = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(f32vec4_, element0, {0}));
+  auto sub = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kSubtract, paramA, paramB));
+  auto element1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32scalar_, tuple, 1));
+  auto output = builder.AddInstruction(
+      HloInstruction::CreateTuple({broadcast, sub, element1}));
+
+  HeapSimulatorTracker tracker(TestName(), builder.Build(),
+                               {paramA, paramB, mul, add, tuple, element0,
+                                broadcast, sub, element1, output});
+  tracker.ExpectCallSequence({
+      {kAlloc, tracker.BufferAt(paramA, {})},
+      {kAlloc, tracker.BufferAt(paramB, {})},
+      {kAlloc, tracker.BufferAt(mul, {})},
+      {kAlloc, tracker.BufferAt(add, {})},
+      {kAlloc, tracker.BufferAt(tuple, {})},
+      {kAlloc, tracker.BufferAt(broadcast, {})},
+      // The mul can be freed right after the broadcast happens, even though
+      // The other GetTupleElement is still alive.
+      {kFree, tracker.BufferAt(mul, {})},
+      {kAlloc, tracker.BufferAt(sub, {})},
+      // The temporary tuple is now dead.
+      {kFree, tracker.BufferAt(tuple, {})},
+      {kAlloc, tracker.BufferAt(output, {})},
+      // All params and outputs are freed at the end.
+      {kFree, tracker.BufferAt(paramA, {})},
+      {kFree, tracker.BufferAt(paramB, {})},
+      {kFree, tracker.BufferAt(add, {})},
+      {kFree, tracker.BufferAt(broadcast, {})},
+      {kFree, tracker.BufferAt(sub, {})},
+      {kFree, tracker.BufferAt(output, {})},
+      {kFinish, nullptr},
+  });
+}
+
 TEST_F(HeapSimulatorTest, WholeModule) {
   HeapSimulatorTracker tracker(TestName());
 
diff --git a/tensorflow/compiler/xla/service/hlo.proto b/tensorflow/compiler/xla/service/hlo.proto
index a43785b4a9701369ae315f67d4d64d03dc6c081d..406feadfd45840eeab6fe36cda8d4f3d17bd081b 100644
--- a/tensorflow/compiler/xla/service/hlo.proto
+++ b/tensorflow/compiler/xla/service/hlo.proto
@@ -13,13 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// DO NOT USE THESE PROTO MESSAGES FOR ANYTHING OTHER THAN DEBUGGING.
-//
-// Don't use these protos in the real compilation or execution codepaths. The
-// data format is meant for debugging only, and may change without notice.
+// This proto file defines messages which represent the HLO module. This is a
+// full fidelity serialization of the c++ HLO constructs.
 //
 // Many of the protos below are simple 1-to-1 serializations of the
-// corresponding C++ classes.
+// corresponding C++ classes, e.g., HloModule, HloComputation, and
+// HloInstruction.
 //
 // FIELD NAMES ARE IMPORTANT
 //
@@ -38,16 +37,19 @@ option cc_enable_arenas = true;
 message HloInstructionProto {
   reserved 10;
   reserved "parameter_name";
+  reserved 12;
+  reserved "fused_instructions_computation";
+  reserved 4;
+  reserved "operand_names";
+  reserved 5;
+  reserved "control_predecessor_names";
+  reserved 6;
+  reserved "called_computation_names";
 
   string name = 1;
   string opcode = 2;
   xla.Shape shape = 3;
 
-  // TODO(b/67782397): Replace instruction names with HloInstruction ids.
-  repeated string operand_names = 4;
-  repeated string control_predecessor_names = 5;
-  repeated string called_computation_names = 6;
-
   xla.OpMetadata metadata = 7;
 
   // Literal, only present for kConstant.
@@ -58,7 +60,6 @@ message HloInstructionProto {
 
   // Fusion state, only present for kFusion.
   string fusion_kind = 11;
-  HloComputationProto fused_instructions_computation = 12;
 
   // Index for kGetTupleElement.
   int64 tuple_index = 13;
@@ -133,28 +134,51 @@ message HloInstructionProto {
   // Gather dimension numbers.
   xla.GatherDimensionNumbers gather_dimension_numbers = 33;
   repeated int64 gather_window_bounds = 34;
+
+  // The id of this instruction.
+  int64 id = 35;
+
+  repeated int64 operand_ids = 36;
+  repeated int64 control_predecessor_ids = 37;
+  repeated int64 called_computation_ids = 38;
 }
 
 // Serialization of HloComputation.
 message HloComputationProto {
+  reserved 3;
+  reserved "root_name";
+
   string name = 1;
 
   // The array of instructions is always in a valid dependency order, where
   // operands appear before their users.
   repeated HloInstructionProto instructions = 2;
 
-  // The name of the root of the computation.
-  string root_name = 3;
+  // The program shape (with layout) of this computation.
+  xla.ProgramShape program_shape = 4;
+
+  // The id of this computation.
+  int64 id = 5;
+
+  // The id of the root of the computation.
+  int64 root_id = 6;
 }
 
 // Serialization of HloModule.
 message HloModuleProto {
   string name = 1;
   string entry_computation_name = 2;
+  int64 entry_computation_id = 6;
 
   // The array of computations is always in a valid dependency order, where
   // callees appear before their callers.
   repeated HloComputationProto computations = 3;
+
+  // The program shape (with layout) of the entry computation.
+  xla.ProgramShape program_shape = 4;
+
+  // The id of this module.
+  int64 id = 5;
 }
 
 // Serialization of HloOrdering.
diff --git a/tensorflow/compiler/xla/service/hlo_alias_analysis.cc b/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
index 30e32a46d7dd0923f738939c33407ac7484b5bbe..a88283ed9a6459b4fa9310e160b59c77d51f1027 100644
--- a/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
@@ -171,24 +171,21 @@ class BufferValueMap {
     return value_to_buffer_number_.at(&value);
   }
 
-  // Compute and return a vector of buffers that the given value must be
-  // contained in due to HLO aliasing rules.
-  std::vector<BufferNumber> ComputeAliasedBuffers(const HloValue& value) {
+  void ComputeWhileAliasedBuffers(const HloValue& value,
+                                  std::vector<BufferNumber>* aliased_buffers) {
+    VLOG(3) << "Compute kWhile aliases";
     // Value is init of a while (use is while).
-    std::vector<BufferNumber> aliased_buffers;
     for (const HloUse& use : value.uses()) {
-      VLOG(2) << "use of value " << value.ToShortString() << ": " << use;
       if (use.instruction->opcode() == HloOpcode::kWhile) {
         // Determine the while value that this shares a buffer with.
         const HloValue& while_value =
             dataflow_.GetUniqueValueAt(use.instruction, use.operand_index);
-        aliased_buffers.push_back(GetBufferForValue(while_value));
+        aliased_buffers->push_back(GetBufferForValue(while_value));
         VLOG(3) << "  value is init value to a while; must share buffer with "
                    "while value "
                 << while_value.ToShortString();
       }
     }
-
     // Value is a parameter of a while body/condition.
     if (value.defining_instruction()->opcode() == HloOpcode::kParameter) {
       const HloComputation* computation =
@@ -205,11 +202,10 @@ class BufferValueMap {
           VLOG(3) << "  value is parameter value of the body or condition of a "
                      "while; must share buffer with while value "
                   << while_value.ToShortString();
-          aliased_buffers.push_back(GetBufferForValue(while_value));
+          aliased_buffers->push_back(GetBufferForValue(while_value));
         }
       }
     }
-
     // Value is the root of a while body.
     for (const HloPosition& position : value.positions()) {
       const HloComputation* computation = position.instruction->parent();
@@ -224,27 +220,71 @@ class BufferValueMap {
 
             const HloValue& while_value = dataflow_.GetUniqueValueAt(
                 callsite.instruction(), position.index);
-            VLOG(3) << "  value is root the body computation of a while; must "
-                       "share buffer with while value "
+            VLOG(3) << "  value @ " << position << " is root of "
+                    << callsite.instruction()->name()
+                    << "; body root and while value root must share buffer "
+                       "among them : "
                     << while_value.ToShortString();
-            aliased_buffers.push_back(GetBufferForValue(while_value));
+            aliased_buffers->push_back(GetBufferForValue(while_value));
           }
         }
       }
     }
-
     // Value is the output of the while instruction itself.
     if (value.defining_instruction()->opcode() == HloOpcode::kWhile) {
       VLOG(3) << "  value is output of a while instruction";
-      aliased_buffers.push_back(GetBufferForValue(value));
+      aliased_buffers->push_back(GetBufferForValue(value));
+    }
+  }
+
+  void ComputeConditionalAliasedBuffers(
+      const HloValue& value, std::vector<BufferNumber>* aliased_buffers) {
+    VLOG(3) << "Compute kConditional aliases";
+    // Aliases the buffers of the true/false computations roots, with the one of
+    // the conditional.
+    for (const HloPosition& position : value.positions()) {
+      const HloComputation* computation = position.instruction->parent();
+      const CallGraphNode& call_graph_node =
+          dataflow_.call_graph().GetNode(computation);
+      if (position.instruction == computation->root_instruction()) {
+        for (const CallSite& callsite : call_graph_node.caller_callsites()) {
+          if (callsite.instruction()->opcode() == HloOpcode::kConditional) {
+            // Call graph must have been flattened.
+            CHECK_EQ(call_graph_node.caller_callsites().size(), 1);
+
+            const HloValue& cond_value = dataflow_.GetUniqueValueAt(
+                callsite.instruction(), position.index);
+            VLOG(3)
+                << "  value @ " << position << " is root of "
+                << callsite.instruction()->name()
+                << "; true/false branch roots must share buffer among them : "
+                << cond_value.ToShortString();
+            aliased_buffers->push_back(GetBufferForValue(cond_value));
+          }
+        }
+      }
+    }
+    // Value is the output of the conditional instruction itself.
+    if (value.defining_instruction()->opcode() == HloOpcode::kConditional) {
+      VLOG(3) << "  value is output of a conditional instruction";
+      aliased_buffers->push_back(GetBufferForValue(value));
     }
+  }
 
+  // Compute and return a vector of buffers that the given value must be
+  // contained in due to HLO aliasing rules.
+  std::vector<BufferNumber> ComputeAliasedBuffers(const HloValue& value) {
+    for (const HloUse& use : value.uses()) {
+      VLOG(2) << "Use of value " << value.ToShortString() << ": " << use;
+    }
+    std::vector<BufferNumber> aliased_buffers;
+    ComputeWhileAliasedBuffers(value, &aliased_buffers);
+    ComputeConditionalAliasedBuffers(value, &aliased_buffers);
     // Uniquify aliased buffers.
     std::sort(aliased_buffers.begin(), aliased_buffers.end());
     aliased_buffers.erase(
         std::unique(aliased_buffers.begin(), aliased_buffers.end()),
         aliased_buffers.end());
-
     return aliased_buffers;
   }
 
diff --git a/tensorflow/compiler/xla/service/hlo_computation.cc b/tensorflow/compiler/xla/service/hlo_computation.cc
index 21e6b2ca730f6347af902097e6496826b861e8a3..6f983d0b950435d43fe3a1e0fe84902b51bfe249 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation.cc
@@ -65,6 +65,7 @@ HloComputation::HloComputation(
     std::vector<std::unique_ptr<HloInstruction>>* instructions,
     HloInstruction* root_instruction, HloInstruction* fusion_instruction)
     : name_(name),
+      unique_id_(-1),
       root_instruction_(root_instruction),
       fusion_instruction_(fusion_instruction) {
   param_instructions_.resize(parameter_count, nullptr);
@@ -101,7 +102,7 @@ HloInstruction* HloComputation::AddInstructionInternal(
     instruction->UniquifyName(&parent()->instruction_name_uniquer());
     instruction->SetUniqueId(parent()->NewUniqueInstructionId());
   }
-  Reparent(instruction.get());
+  instruction->set_parent(this);
   HloInstruction* pinst = instruction.get();
   instruction_iterators_[pinst] =
       instructions_.insert(instructions_.end(), std::move(instruction));
@@ -158,10 +159,6 @@ Status HloComputation::RemoveParameter(int64 param_no) {
   return Status::OK();
 }
 
-void HloComputation::Reparent(HloInstruction* instruction) {
-  instruction->set_parent(this);
-}
-
 bool HloComputation::IsRemovable(const HloInstruction* instruction) {
   // If the instruction has control predecessors or successors then we cannot
   // remove the instruction without violating ordering constraints (added, for
@@ -393,43 +390,46 @@ string HloComputation::ToString(const HloPrintOptions& options) const {
 
 HloComputationProto HloComputation::ToProto() const {
   HloComputationProto proto;
+  CHECK(unique_id_ != -1)
+      << "This computation does not have a valid id. Please make sure the "
+         "computation is inside a module before dumping it.";
+  proto.set_id(unique_id_);
   proto.set_name(name_);
   for (const HloInstruction* instruction : MakeInstructionPostOrder()) {
     HloInstructionProto instruction_proto = instruction->ToProto();
     proto.add_instructions()->Swap(&instruction_proto);
   }
-  proto.set_root_name(root_instruction()->name());
+  proto.set_root_id(root_instruction()->unique_id());
+  *proto.mutable_program_shape() = ComputeProgramShape();
   return proto;
 }
 
 /* static */ StatusOr<std::unique_ptr<HloComputation>>
 HloComputation::CreateFromProto(
     HloModule* module, const HloComputationProto& proto,
-    const tensorflow::gtl::FlatMap<string, HloComputation*>& computation_map,
-    const std::function<void(std::unique_ptr<HloComputation>)>&
-        add_fused_computation,
-    HloInstruction* fusion_instruction) {
+    const tensorflow::gtl::FlatMap<int64, HloComputation*>& computation_map) {
   std::vector<std::unique_ptr<HloInstruction>> instructions;
-  tensorflow::gtl::FlatMap<string, HloInstruction*> instruction_map;
+  tensorflow::gtl::FlatMap<int64, HloInstruction*> instruction_map;
   int64 parameter_count = 0;
   for (const HloInstructionProto& instruction_proto : proto.instructions()) {
-    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloInstruction> instruction,
-                        HloInstruction::CreateFromProto(
-                            module, instruction_proto, instruction_map,
-                            computation_map, add_fused_computation));
+    TF_ASSIGN_OR_RETURN(
+        std::unique_ptr<HloInstruction> instruction,
+        HloInstruction::CreateFromProto(module, instruction_proto,
+                                        instruction_map, computation_map));
     if (instruction->opcode() == HloOpcode::kParameter) {
       parameter_count++;
     }
-    TF_RET_CHECK(!ContainsKey(instruction_map, instruction->name()));
-    instruction_map[instruction->name()] = instruction.get();
+    TF_RET_CHECK(!ContainsKey(instruction_map, instruction_proto.id()));
+    instruction_map[instruction_proto.id()] = instruction.get();
     instructions.push_back(std::move(instruction));
   }
 
-  TF_RET_CHECK(!proto.root_name().empty());
-  TF_RET_CHECK(ContainsKey(instruction_map, proto.root_name()));
-  HloInstruction* root = instruction_map.at(proto.root_name());
-  return WrapUnique(new HloComputation(
-      proto.name(), parameter_count, &instructions, root, fusion_instruction));
+  TF_RET_CHECK(proto.root_id() != -1);
+  TF_RET_CHECK(ContainsKey(instruction_map, proto.root_id()));
+  HloInstruction* root = instruction_map.at(proto.root_id());
+  return WrapUnique(new HloComputation(proto.name(), parameter_count,
+                                       &instructions, root,
+                                       /*fusion_instruction=*/nullptr));
 }
 
 void HloComputation::FuseInstructionsInto(
@@ -532,7 +532,6 @@ ProgramShape HloComputation::ComputeProgramShape() const {
   }
   *program_shape.mutable_result() = root_instruction_->shape();
 
-  LayoutUtil::ClearLayout(&program_shape);
   return program_shape;
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_computation.h b/tensorflow/compiler/xla/service/hlo_computation.h
index 39d864efcb70382b6f8e631d7e6e452ea6410104..9d3f6e9a2c2efd97681a22b6b0f6d929afc553de 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.h
+++ b/tensorflow/compiler/xla/service/hlo_computation.h
@@ -160,20 +160,12 @@ class HloComputation {
   //   module: the module which will contain the computation. The newly created
   //     computation is *not* added to the module, however.
   //   proto: the proto to convert from.
-  //   computation_map: a map from computation name to HloComputation*. This map
+  //   computation_map: a map from computation id to HloComputation*. This map
   //     must contain all computations which the newly constructed computation
   //     calls.
-  //   add_fused_computation: A function to call to add a fused
-  //     computation. Used only when the instruction is a fusion instruction.
-  //   fusion_instruction: if non-null then the newly created computation will
-  //     be constructed as a fused computation with this instruction as its
-  //     fusion parent.
   static StatusOr<std::unique_ptr<HloComputation>> CreateFromProto(
       HloModule* module, const HloComputationProto& proto,
-      const tensorflow::gtl::FlatMap<string, HloComputation*>& computation_map,
-      const std::function<void(std::unique_ptr<HloComputation>)>&
-          add_fused_computation,
-      HloInstruction* fusion_instruction = nullptr);
+      const tensorflow::gtl::FlatMap<int64, HloComputation*>& computation_map);
 
   // Gets the instructions in this computation.
   //
@@ -248,7 +240,7 @@ class HloComputation {
       ShapeTree<HloInstruction*>* copies_added = nullptr);
 
   // Computes and returns the ProgramShape of this computation (shape of
-  // parameters and result without layout).
+  // parameters and result with layout).
   ProgramShape ComputeProgramShape() const;
 
   // Return whether `*this` and `other` are functionally equivalent.
@@ -342,6 +334,15 @@ class HloComputation {
     fusion_instruction_ = fusion_instruction;
   }
 
+  // The id of this computation should be unique within the module.
+  void SetUniqueId(int64 id) {
+    CHECK_EQ(unique_id_, -1);
+    CHECK_GE(id, 0);
+    unique_id_ = id;
+  }
+
+  int64 unique_id() const { return unique_id_; }
+
  private:
   explicit HloComputation(
       const string& name, int parameter_count,
@@ -352,10 +353,6 @@ class HloComputation {
   HloInstruction* AddInstructionInternal(
       std::unique_ptr<HloInstruction> instruction);
 
-  // Helper for setting the parent of instructions that are added to this
-  // computation.
-  void Reparent(HloInstruction* instruction);
-
   // Fuses HLOs in instructions_to_fuse into fusion_instruction.
   //
   // Pre-condition: fusion_instruction's opcode is kFusion.
@@ -373,6 +370,7 @@ class HloComputation {
   std::vector<HloInstruction*> CollectUnreachableRoots() const;
 
   string name_;
+  int64 unique_id_;
   HloInstruction* root_instruction_;
 
   // If this computation is a fusion computation, this field points to the
diff --git a/tensorflow/compiler/xla/service/hlo_constant_folding.cc b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
index 53450991b6fad5b9651d9d23b55c908e6b68e5dd..35ecd4428d0dfde2de445ea34472d2c78148c6c9 100644
--- a/tensorflow/compiler/xla/service/hlo_constant_folding.cc
+++ b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
@@ -35,7 +35,10 @@ limitations under the License.
 namespace xla {
 
 StatusOr<bool> HloConstantFolding::Run(HloModule* module) {
-  auto evaluator = MakeUnique<HloEvaluator>();
+  // Limit the constant folding to 0 iterations to skip folding loops. This
+  // retains the behavior from before while loop support in HloEvaluator and may
+  // be revised.
+  auto evaluator = MakeUnique<HloEvaluator>(/*max_loop_iterations=*/0);
 
   XLA_VLOG_LINES(2,
                  "HloConstantFolding::Run(), before:\n" + module->ToString());
diff --git a/tensorflow/compiler/xla/service/hlo_creation_utils.cc b/tensorflow/compiler/xla/service/hlo_creation_utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b186767ce792cd89ae77fe9a03b3a2ecf296b804
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_creation_utils.cc
@@ -0,0 +1,277 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/shape_inference.h"
+#include "tensorflow/compiler/xla/util.h"
+
+namespace xla {
+using tensorflow::gtl::ArraySlice;
+using tensorflow::strings::StrCat;
+
+StatusOr<HloInstruction*> MakeBinaryHlo(HloOpcode opcode, HloInstruction* lhs,
+                                        HloInstruction* rhs) {
+  HloComputation* computation = lhs->parent();
+  CHECK_EQ(computation, rhs->parent());
+  TF_ASSIGN_OR_RETURN(Shape binary_op_shape,
+                      ShapeInference::InferBinaryOpShape(opcode, lhs, rhs));
+  return computation->AddInstruction(
+      HloInstruction::CreateBinary(binary_op_shape, opcode, lhs, rhs));
+}
+
+StatusOr<HloInstruction*> MakePadHlo(HloInstruction* operand,
+                                     HloInstruction* padding_value,
+                                     const PaddingConfig& padding_config) {
+  HloComputation* computation = operand->parent();
+  CHECK_EQ(computation, padding_value->parent());
+  TF_ASSIGN_OR_RETURN(
+      Shape pad_shape,
+      ShapeInference::InferPadShape(operand->shape(), padding_value->shape(),
+                                    padding_config));
+  return computation->AddInstruction(HloInstruction::CreatePad(
+      pad_shape, operand, padding_value, padding_config));
+}
+
+StatusOr<HloInstruction*> MakeSliceHlo(HloInstruction* operand,
+                                       ArraySlice<int64> start_indices,
+                                       ArraySlice<int64> limit_indices,
+                                       ArraySlice<int64> strides) {
+  HloComputation* computation = operand->parent();
+  TF_ASSIGN_OR_RETURN(Shape slice_shape, ShapeInference::InferSliceShape(
+                                             operand->shape(), start_indices,
+                                             limit_indices, strides));
+  return computation->AddInstruction(HloInstruction::CreateSlice(
+      slice_shape, operand, start_indices, limit_indices, strides));
+}
+
+StatusOr<HloInstruction*> MakeConvolveHlo(
+    HloInstruction* lhs, HloInstruction* rhs, const Window& window,
+    const ConvolutionDimensionNumbers& dimension_numbers) {
+  HloComputation* computation = lhs->parent();
+  CHECK_EQ(computation, rhs->parent());
+  TF_ASSIGN_OR_RETURN(Shape convolve_shape, ShapeInference::InferConvolveShape(
+                                                lhs->shape(), rhs->shape(),
+                                                window, dimension_numbers));
+  return computation->AddInstruction(HloInstruction::CreateConvolve(
+      convolve_shape, lhs, rhs, window, dimension_numbers));
+}
+
+StatusOr<HloInstruction*> MakeTransposeHlo(HloInstruction* operand,
+                                           ArraySlice<int64> dimensions) {
+  HloComputation* computation = operand->parent();
+  TF_ASSIGN_OR_RETURN(
+      Shape transpose_shape,
+      ShapeInference::InferTransposeShape(operand->shape(), dimensions));
+  return computation->AddInstruction(
+      HloInstruction::CreateTranspose(transpose_shape, operand, dimensions));
+}
+
+StatusOr<HloInstruction*> MakeReshapeHlo(const Shape& result_shape,
+                                         HloInstruction* operand) {
+  HloComputation* computation = operand->parent();
+  return computation->AddInstruction(
+      HloInstruction::CreateReshape(result_shape, operand));
+}
+
+StatusOr<HloInstruction*> MakeReshapeHlo(
+    ArraySlice<int64> result_shape_dim_bounds, HloInstruction* operand) {
+  Shape new_shape = ShapeUtil::MakeShape(operand->shape().element_type(),
+                                         result_shape_dim_bounds);
+  return MakeReshapeHlo(new_shape, operand);
+}
+
+StatusOr<HloInstruction*> MakeDynamicSliceHlo(HloInstruction* operand,
+                                              HloInstruction* start_indices,
+                                              ArraySlice<int64> slice_sizes) {
+  HloComputation* computation = operand->parent();
+  CHECK_EQ(computation, start_indices->parent());
+  TF_ASSIGN_OR_RETURN(
+      Shape dynamic_slice_shape,
+      ShapeInference::InferDynamicSliceShape(
+          operand->shape(), start_indices->shape(), slice_sizes));
+  return computation->AddInstruction(HloInstruction::CreateDynamicSlice(
+      dynamic_slice_shape, operand, start_indices, slice_sizes));
+}
+
+StatusOr<HloInstruction*> MakeDynamicUpdateSliceHlo(
+    HloInstruction* operand, HloInstruction* update,
+    HloInstruction* start_indices) {
+  HloComputation* computation = operand->parent();
+  CHECK_EQ(computation, update->parent());
+  CHECK_EQ(computation, start_indices->parent());
+  TF_ASSIGN_OR_RETURN(
+      Shape dynamic_update_slice_shape,
+      ShapeInference::InferDynamicUpdateSliceShape(
+          operand->shape(), update->shape(), start_indices->shape()));
+  return computation->AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
+      dynamic_update_slice_shape, operand, update, start_indices));
+}
+
+StatusOr<HloInstruction*> MakeBroadcastHlo(
+    HloInstruction* operand, ArraySlice<int64> broadcast_dimensions,
+    ArraySlice<int64> result_shape_bounds) {
+  HloComputation* computation = operand->parent();
+  Shape broadcast_shape = ShapeUtil::MakeShape(operand->shape().element_type(),
+                                               result_shape_bounds);
+
+  return computation->AddInstruction(HloInstruction::CreateBroadcast(
+      broadcast_shape, operand, broadcast_dimensions));
+}
+
+StatusOr<HloInstruction*> MakeGetTupleElementHlo(HloInstruction* operand,
+                                                 int64 index) {
+  HloComputation* computation = operand->parent();
+
+  TF_ASSIGN_OR_RETURN(
+      Shape gte_shape,
+      ShapeInference::InferGetTupleElementShape(operand->shape(), index));
+  return computation->AddInstruction(
+      HloInstruction::CreateGetTupleElement(gte_shape, operand, index));
+}
+
+StatusOr<HloInstruction*> MakeConcatHlo(ArraySlice<HloInstruction*> operands,
+                                        int64 dimension) {
+  CHECK_GT(operands.size(), 0);
+
+  HloComputation* computation = operands[0]->parent();
+  CHECK(c_all_of(operands, [&](HloInstruction* instr) {
+    return instr->parent() == computation;
+  }));
+
+  std::vector<const Shape*> operand_shapes;
+  c_transform(operands, std::back_inserter(operand_shapes),
+              [](HloInstruction* instr) { return &instr->shape(); });
+
+  TF_ASSIGN_OR_RETURN(Shape concat_shape, ShapeInference::InferConcatOpShape(
+                                              operand_shapes, dimension));
+  return computation->AddInstruction(
+      HloInstruction::CreateConcatenate(concat_shape, operands, dimension));
+}
+
+StatusOr<HloInstruction*> CollapseFirstNDims(HloInstruction* operand, int64 n) {
+  const Shape& operand_shape = operand->shape();
+  CHECK_GE(operand_shape.dimensions_size(), n);
+  int64 new_shape_leading_bound = 1;
+  for (int64 i = 0; i < n; i++) {
+    new_shape_leading_bound *= operand_shape.dimensions(i);
+  }
+
+  std::vector<int64> new_shape_dims;
+  new_shape_dims.reserve(operand_shape.dimensions_size() - n + 1);
+  new_shape_dims.push_back(new_shape_leading_bound);
+
+  std::copy(operand_shape.dimensions().begin() + n,
+            operand_shape.dimensions().end(),
+            std::back_inserter(new_shape_dims));
+
+  Shape output_shape =
+      ShapeUtil::MakeShape(operand_shape.element_type(), new_shape_dims);
+
+  return MakeReshapeHlo(output_shape, operand);
+}
+
+StatusOr<HloInstruction*> ExpandFirstDimIntoNDims(
+    HloInstruction* operand, ArraySlice<int64> expanded_dims) {
+  CHECK_GT(operand->shape().dimensions_size(), 0);
+  CHECK_EQ(operand->shape().dimensions(0), Product(expanded_dims));
+
+  std::vector<int64> expanded_shape_dim_bounds;
+  expanded_shape_dim_bounds.reserve(expanded_dims.size() +
+                                    operand->shape().dimensions_size() - 1);
+  c_copy(expanded_dims, std::back_inserter(expanded_shape_dim_bounds));
+  std::copy(operand->shape().dimensions().begin() + 1,
+            operand->shape().dimensions().end(),
+            std::back_inserter(expanded_shape_dim_bounds));
+  Shape new_shape = ShapeUtil::MakeShape(operand->shape().element_type(),
+                                         expanded_shape_dim_bounds);
+  return MakeReshapeHlo(new_shape, operand);
+}
+
+StatusOr<HloInstruction*> ElideDegenerateDims(HloInstruction* operand,
+                                              ArraySlice<int64> dims_to_elide) {
+  CHECK(c_is_sorted(dims_to_elide));
+
+  const Shape& input_shape = operand->shape();
+  // First accumulate in reverse
+  std::vector<int64> new_shape_dim_bounds;
+  new_shape_dim_bounds.reserve(input_shape.dimensions_size() -
+                               dims_to_elide.size());
+  int64 dims_to_elide_idx = dims_to_elide.size() - 1;
+  for (int64 i = input_shape.dimensions_size() - 1; i >= 0; i--) {
+    if (dims_to_elide_idx >= 0 && i == dims_to_elide[dims_to_elide_idx]) {
+      CHECK_EQ(input_shape.dimensions(i), 1);
+      dims_to_elide_idx--;
+    } else {
+      new_shape_dim_bounds.push_back(input_shape.dimensions(i));
+    }
+  }
+
+  c_reverse(new_shape_dim_bounds);
+  Shape output_shape =
+      ShapeUtil::MakeShape(input_shape.element_type(), new_shape_dim_bounds);
+  return MakeReshapeHlo(output_shape, operand);
+}
+
+StatusOr<HloInstruction*> PadVectorWithZeros(HloInstruction* operand,
+                                             int64 zeros_to_prepend,
+                                             int64 zeros_to_append) {
+  HloComputation* computation = operand->parent();
+  CHECK_EQ(operand->shape().dimensions_size(), 1);
+  PaddingConfig padding_config;
+  PaddingConfig::PaddingConfigDimension padding_config_dim;
+  padding_config_dim.set_edge_padding_low(zeros_to_prepend);
+  padding_config_dim.set_edge_padding_high(zeros_to_append);
+  *padding_config.add_dimensions() = padding_config_dim;
+
+  HloInstruction* zero =
+      computation->AddInstruction(HloInstruction::CreateConstant(
+          MakeUnique<Literal>(Literal::Zero(operand->shape().element_type()))));
+  return MakePadHlo(operand, zero, padding_config);
+}
+
+StatusOr<HloInstruction*> BroadcastZeros(
+    HloComputation* computation, PrimitiveType element_type,
+    ArraySlice<int64> broadcast_dimensions) {
+  HloInstruction* zero =
+      computation->AddInstruction(HloInstruction::CreateConstant(
+          MakeUnique<Literal>(Literal::Zero(element_type))));
+  return MakeBroadcastHlo(zero, /*broadcast_dimensions=*/{},
+                          /*result_shape_bounds=*/broadcast_dimensions);
+}
+
+StatusOr<std::unique_ptr<HloComputation>> CreateComputationWithSignature(
+    ArraySlice<const Shape*> domain, const Shape& range,
+    tensorflow::StringPiece name) {
+  HloComputation::Builder b(name.ToString());
+  int64 param_idx = 0;
+  for (const Shape* param_shape : domain) {
+    b.AddInstruction(HloInstruction::CreateParameter(
+        param_idx, *param_shape, StrCat("param.", param_idx)));
+    param_idx++;
+  }
+
+  // We can't change the root type of a computation once it is created so create
+  // a dummy root instruction to give the computation the right root shape.  In
+  // the future we may want to use a (recursive) broadcast here to avoid
+  // creating large constants.
+  b.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateFromShape(range)));
+
+  return b.Build();
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_creation_utils.h b/tensorflow/compiler/xla/service/hlo_creation_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..d99e32a737e6aaa2ff746cf6c00d4300cf62f4e1
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_creation_utils.h
@@ -0,0 +1,153 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CREATION_UTILS_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CREATION_UTILS_H_
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/statusor.h"
+
+namespace xla {
+
+// Some lightweight utilities intended to make HLO instruction creation more
+// ergonomic.  We don't have a complete set of helpers yet -- I expect we'll
+// expand this interface as needed on an ad-hoc basis.
+
+// Creates a binary HLO instruction and adds it to the computation containing
+// `lhs` and `rhs` (`lhs` and `rhs` must be in the same computation).
+StatusOr<HloInstruction*> MakeBinaryHlo(HloOpcode opcode, HloInstruction* lhs,
+                                        HloInstruction* rhs);
+
+// Creates a pad HLO instruction and adds it to the computation containing
+// `operand` and `padding_value` (`operand` and `padding_value` must be in the
+// same computation).
+StatusOr<HloInstruction*> MakePadHlo(HloInstruction* operand,
+                                     HloInstruction* padding_value,
+                                     const PaddingConfig& padding_config);
+
+// Creates a slice HLO instruction and adds it to the computation containing
+// `operand`.
+StatusOr<HloInstruction*> MakeSliceHlo(
+    HloInstruction* operand, tensorflow::gtl::ArraySlice<int64> start_indices,
+    tensorflow::gtl::ArraySlice<int64> limit_indices,
+    tensorflow::gtl::ArraySlice<int64> strides);
+
+// Creates a convolution HLO instruction and adds it to the computation
+// containing `lhs` and `rhs` (`lhs` and `rhs` must be in the same computation).
+StatusOr<HloInstruction*> MakeConvolveHlo(
+    HloInstruction* lhs, HloInstruction* rhs, const Window& window,
+    const ConvolutionDimensionNumbers& dimension_numbers);
+
+// Creates a transpose HLO instruction and adds it to the computation containing
+// `operand`.
+StatusOr<HloInstruction*> MakeTransposeHlo(
+    HloInstruction* operand, tensorflow::gtl::ArraySlice<int64> dimensions);
+
+// Creates a reshape HLO instruction and adds it to the computation containing
+// `operand`.
+StatusOr<HloInstruction*> MakeReshapeHlo(const Shape& result_shape,
+                                         HloInstruction* operand);
+
+StatusOr<HloInstruction*> MakeReshapeHlo(
+    tensorflow::gtl::ArraySlice<int64> result_shape_dim_bounds,
+    HloInstruction* operand);
+
+// Creates a dynamic-slice HLO instruction and adds it to the computation
+// containing `operand` and `start_indices` (`operand` and `start_indices` must
+// be in the same computation).
+StatusOr<HloInstruction*> MakeDynamicSliceHlo(
+    HloInstruction* operand, HloInstruction* start_indices,
+    tensorflow::gtl::ArraySlice<int64> slice_sizes);
+
+// Creates a dynamic-update-slice HLO instruction and adds it to the computation
+// containing `operand`, `update` and `start_indices` (`operand`, `update` and
+// `start_indices` must be in the same computation).
+StatusOr<HloInstruction*> MakeDynamicUpdateSliceHlo(
+    HloInstruction* operand, HloInstruction* update,
+    HloInstruction* start_indices);
+
+// Creates a broadcast HLO instruction and adds it to the computation containing
+// `operand`.
+StatusOr<HloInstruction*> MakeBroadcastHlo(
+    HloInstruction* operand,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions,
+    tensorflow::gtl::ArraySlice<int64> result_shape_bounds);
+
+// Creates a GetTupleElement HLO instruction and adds it to the computation
+// containing `operand`.
+StatusOr<HloInstruction*> MakeGetTupleElementHlo(HloInstruction* operand,
+                                                 int64 index);
+
+// Creates a Concatenate HLO instruction and adds it to the computation
+// containing `operands` (`operands` must be non-empty and every element must be
+// contained in the same computation).
+StatusOr<HloInstruction*> MakeConcatHlo(
+    tensorflow::gtl::ArraySlice<HloInstruction*> operands, int64 dimension);
+
+// -----------------------------------------------------------------------------
+// Some other miscellaneous helpers to generate common HLO patterns.  All of
+// these add all the instructions they generate into the computation containing
+// their operand(s).
+
+// Collapses (via reshape) the first N (logical) dimensions of `operand` into a
+// single leading dimension.  `operand` must have rank > n.
+//
+// For instance if `operand` has shape f32[7,8,9] and n is 2 then the output is
+// the `operand` reshaped to [56,9].
+StatusOr<HloInstruction*> CollapseFirstNDims(HloInstruction* operand, int64 n);
+
+// Expands (via reshape) the first (logical) dimension of `operand` into a
+// sequence of `expanded_dims` dimensions.  `operand` must at least be of rank 1
+// and the number of elements in its first dimension must be equal to the
+// product of `expanded_dims`.
+//
+// For instance if `operand` has shape f32[200,9,7] and expanded_dims is
+// {2,5,20} the result is `operand` reshaped to [2,5,20,9,7].
+StatusOr<HloInstruction*> ExpandFirstDimIntoNDims(
+    HloInstruction* operand, tensorflow::gtl::ArraySlice<int64> expanded_dims);
+
+// Elides (via reshape) a set of degenerate dimensions (dimensions containing
+// exactly one element), `dims_to_elide` from `operand`.  Every dimension in
+// `dims_to_elide` must be a degenerate dimension.  `dims_to_elide` must be
+// sorted and not contain duplicates.
+//
+// For example if `operand` is of shape f32[19,1,20,1,7,1,9] and dims_to_elide
+// is {1,5} then the result is `operand` reshaped to [19,20,1,7,9].
+StatusOr<HloInstruction*> ElideDegenerateDims(
+    HloInstruction* operand, tensorflow::gtl::ArraySlice<int64> dims_to_elide);
+
+// Pads `operand` (which must have rank 1) with `zeros_to_prepend` zeros in the
+// front and `zeros_to_append` zeros in the back.
+StatusOr<HloInstruction*> PadVectorWithZeros(HloInstruction* operand,
+                                             int64 zeros_to_prepend,
+                                             int64 zeros_to_append);
+
+// Broadcasts a zero value of type `element_type` into a tensor with element
+// type `element_type` and dimension bounds `broadcast_dimensions`.  The
+// broadcast instruction is emitted into `computation`.
+StatusOr<HloInstruction*> BroadcastZeros(
+    HloComputation* computation, PrimitiveType element_type,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+// Creates a HLO computation that takes arguments of type `domain` and produces
+// a value of type `range`.
+StatusOr<std::unique_ptr<HloComputation>> CreateComputationWithSignature(
+    tensorflow::gtl::ArraySlice<const Shape*> domain, const Shape& range,
+    tensorflow::StringPiece name);
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CREATION_UTILS_H_
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
index 934e43ba4879628362009267c671ec4cb0d79c52..0c37a8d75f38dabaad886cc9d4adce8ab29ddf18 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
@@ -368,11 +368,11 @@ bool HloDataflowAnalysis::UpdateConditionalValueSet(
           conditional->true_computation()->root_instruction()),
       &GetInstructionValueSet(
           conditional->false_computation()->root_instruction())};
-  // A phi-node is not defined for a kConditional instruction even though it
-  // represents a join point. This is because the current approach is to define
-  // a phi-node only for kWhile to account for the dataflow through back-edges
-  // and deal with the ambiguity in other cases.
-  return GetInstructionValueSet(conditional).AssignUnionOf(inputs);
+  if (ssa_form_) {
+    return Phi(conditional, inputs);
+  } else {
+    return GetInstructionValueSet(conditional).AssignUnionOf(inputs);
+  }
 }
 
 bool HloDataflowAnalysis::UpdateCopyValueSet(HloInstruction* copy) {
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
index 7bf3a1a06045c79621d75b653bf42220705a69d4..07f69b8e1339fed636e4eb54791941b85e09fd17 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
@@ -1602,11 +1602,17 @@ TEST_P(HloDataflowAnalysisTest, ConditionalWithIdentity) {
   EXPECT_THAT(analysis.GetValueDefinedAt(constant2).uses(),
               ElementsAre(HloUse{conditional, 2, {}}));
 
-  EXPECT_EQ(analysis.values().size(), 3);
-  EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
-  EXPECT_THAT(HloValuesAt(conditional),
-              UnorderedElementsAre(analysis.GetValueDefinedAt(constant1),
-                                   analysis.GetValueDefinedAt(constant2)));
+  bool ssa_form = GetParam();
+  if (ssa_form) {
+    EXPECT_EQ(analysis.values().size(), 4);
+    EXPECT_TRUE(analysis.ValueIsDefinedAt(conditional));
+  } else {
+    EXPECT_EQ(analysis.values().size(), 3);
+    EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
+    EXPECT_THAT(HloValuesAt(conditional),
+                UnorderedElementsAre(analysis.GetValueDefinedAt(constant1),
+                                     analysis.GetValueDefinedAt(constant2)));
+  }
 }
 
 TEST_P(HloDataflowAnalysisTest, ConditionalTakingTupleOperand) {
@@ -1713,11 +1719,17 @@ TEST_P(HloDataflowAnalysisTest, ConditionalTakingTupleOperand) {
                   HloUse{true_x, 0, {}}, HloUse{true_y, 0, {}},
                   HloUse{false_x, 0, {}}, HloUse{false_y, 0, {}}));
 
-  EXPECT_EQ(analysis.values().size(), 6);
-  EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
-  EXPECT_THAT(HloValuesAt(conditional),
-              UnorderedElementsAre(analysis.GetValueDefinedAt(add),
-                                   analysis.GetValueDefinedAt(sub)));
+  bool ssa_form = GetParam();
+  if (ssa_form) {
+    EXPECT_EQ(analysis.values().size(), 7);
+    EXPECT_TRUE(analysis.ValueIsDefinedAt(conditional));
+  } else {
+    EXPECT_EQ(analysis.values().size(), 6);
+    EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
+    EXPECT_THAT(HloValuesAt(conditional),
+                UnorderedElementsAre(analysis.GetValueDefinedAt(add),
+                                     analysis.GetValueDefinedAt(sub)));
+  }
 }
 
 TEST_P(HloDataflowAnalysisTest, NestedConditionals) {
@@ -1834,20 +1846,27 @@ TEST_P(HloDataflowAnalysisTest, NestedConditionals) {
   EXPECT_EQ(analysis.GetUniqueValueAt(false_operand_cond),
             analysis.GetValueDefinedAt(constant2));
 
-  EXPECT_EQ(analysis.values().size(), 9);
-  EXPECT_FALSE(analysis.ValueIsDefinedAt(inner_conditional));
-  EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
-  EXPECT_THAT(
-      HloValuesAt(inner_conditional),
-      UnorderedElementsAre(
-          analysis.GetValueDefinedAt(computation1->root_instruction()),
-          analysis.GetValueDefinedAt(computation2->root_instruction())));
-  EXPECT_THAT(
-      HloValuesAt(conditional),
-      UnorderedElementsAre(
-          analysis.GetValueDefinedAt(computation1->root_instruction()),
-          analysis.GetValueDefinedAt(computation2->root_instruction()),
-          analysis.GetValueDefinedAt(computation3->root_instruction())));
+  bool ssa_form = GetParam();
+  if (ssa_form) {
+    EXPECT_EQ(analysis.values().size(), 11);
+    EXPECT_TRUE(analysis.ValueIsDefinedAt(inner_conditional));
+    EXPECT_TRUE(analysis.ValueIsDefinedAt(conditional));
+  } else {
+    EXPECT_EQ(analysis.values().size(), 9);
+    EXPECT_FALSE(analysis.ValueIsDefinedAt(inner_conditional));
+    EXPECT_FALSE(analysis.ValueIsDefinedAt(conditional));
+    EXPECT_THAT(
+        HloValuesAt(inner_conditional),
+        UnorderedElementsAre(
+            analysis.GetValueDefinedAt(computation1->root_instruction()),
+            analysis.GetValueDefinedAt(computation2->root_instruction())));
+    EXPECT_THAT(
+        HloValuesAt(conditional),
+        UnorderedElementsAre(
+            analysis.GetValueDefinedAt(computation1->root_instruction()),
+            analysis.GetValueDefinedAt(computation2->root_instruction()),
+            analysis.GetValueDefinedAt(computation3->root_instruction())));
+  }
 }
 
 INSTANTIATE_TEST_CASE_P(HloDataflowAnalysisInstantiation,
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.cc b/tensorflow/compiler/xla/service/hlo_evaluator.cc
index 15ae53128aa5dfe706daa6d47dc6d842fd78e26c..693004d364114b1a25ce6b6791092665c861d13f 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.cc
@@ -51,12 +51,22 @@ namespace xla {
 
 namespace {
 
+using tensorflow::gtl::ArraySlice;
+using tensorflow::gtl::FlatSet;
+using tensorflow::gtl::optional;
+
 template <typename T>
 struct is_complex_t : public std::false_type {};
 
 template <>
 struct is_complex_t<complex64> : public std::true_type {};
 
+template <typename T>
+struct is_complex64_t : public std::false_type {};
+
+template <>
+struct is_complex64_t<complex64> : public std::true_type {};
+
 template <typename OperandT>
 StatusOr<std::unique_ptr<Literal>> Compare(const Shape& shape, HloOpcode opcode,
                                            const Literal& lhs_literal,
@@ -99,11 +109,10 @@ StatusOr<std::unique_ptr<Literal>> Compare(const Shape& shape, HloOpcode opcode,
   }
 
   auto result = Literal::CreateFromShape(shape);
-  TF_RETURN_IF_ERROR(result->Populate<bool>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-        return compare_op(lhs_literal.Get<OperandT>(multi_index),
-                          rhs_literal.Get<OperandT>(multi_index));
-      }));
+  TF_RETURN_IF_ERROR(result->Populate<bool>([&](ArraySlice<int64> multi_index) {
+    return compare_op(lhs_literal.Get<OperandT>(multi_index),
+                      rhs_literal.Get<OperandT>(multi_index));
+  }));
 
   return std::move(result);
 }
@@ -130,11 +139,10 @@ StatusOr<std::unique_ptr<Literal>> Compare<complex64>(
   }
 
   auto result = Literal::CreateFromShape(shape);
-  TF_RETURN_IF_ERROR(result->Populate<bool>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-        return compare_op(lhs_literal.Get<complex64>(multi_index),
-                          rhs_literal.Get<complex64>(multi_index));
-      }));
+  TF_RETURN_IF_ERROR(result->Populate<bool>([&](ArraySlice<int64> multi_index) {
+    return compare_op(lhs_literal.Get<complex64>(multi_index),
+                      rhs_literal.Get<complex64>(multi_index));
+  }));
 
   return std::move(result);
 }
@@ -159,8 +167,8 @@ StatusOr<std::unique_ptr<Literal>> ElementWiseUnaryOpImpl(
 
   auto result = Literal::CreateFromShape(shape);
 
-  TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+  TF_RETURN_IF_ERROR(
+      result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
         return unary_op(operand_literal.Get<NativeT>(multi_index));
       }));
   return std::move(result);
@@ -172,7 +180,7 @@ StatusOr<std::unique_ptr<Literal>> ElementWiseUnaryOpImpl(
 // with the base index.
 void IterateThroughWindow(
     const Shape& window_shape, const Window& window, const Shape& base_shape,
-    const tensorflow::gtl::ArraySlice<int64>& window_count_index,
+    const ArraySlice<int64>& window_count_index,
     const std::function<void(const std::vector<int64>&)>& f) {
   const int64 rank = ShapeUtil::Rank(base_shape);
   DimensionVector window_index(rank);
@@ -248,17 +256,37 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
   template <
       typename NativeT,
-      typename std::enable_if<std::is_signed<NativeT>::value ||
-                              is_complex_t<NativeT>::value>::type* = nullptr>
+      typename std::enable_if<std::is_signed<NativeT>::value>::type* = nullptr>
   Status HandleAbs(HloInstruction* abs) {
     TF_ASSIGN_OR_RETURN(parent_->evaluated_[abs],
-                        ElementWiseUnaryOp(abs, [](ElementwiseT elem_operand) {
+                        ElementWiseUnaryOp(abs, [](NativeT elem_operand) {
                           return std::abs(elem_operand);
                         }));
     return Status::OK();
   }
 
+  template <
+      typename NativeT,
+      typename std::enable_if<is_complex64_t<NativeT>::value>::type* = nullptr>
+  Status HandleAbs(HloInstruction* abs) {
+    const Literal& operand_literal =
+        parent_->GetEvaluatedLiteralFor(abs->operand(0));
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[abs],
+        (ElementWiseUnaryOpImpl<float, NativeT>(
+            abs, [](NativeT elem_operand) { return std::abs(elem_operand); },
+            operand_literal)));
+
+    return Status::OK();
+  }
+
   Status HandleAbs(HloInstruction* abs) override {
+    // If the operand is of C64 type, the return type of abs will be F32.
+    // However, ElementwiseT would still be the return type, F32, and thus
+    // specifying the ElementwiseT explicitly as C64 is needed below.
+    if (abs->operand(0)->shape().element_type() == C64) {
+      return HandleAbs<complex64>(abs);
+    }
     return HandleAbs<ElementwiseT>(abs);
   }
 
@@ -306,13 +334,12 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
                    operand_to_broadcast.shape().dimensions(i));
     }
 
-    return output->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-          for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
-            broadcast_indices[i] = multi_index[broadcast->dimensions(i)];
-          }
-          return operand_to_broadcast.Get<ReturnT>(broadcast_indices);
-        });
+    return output->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
+      for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
+        broadcast_indices[i] = multi_index[broadcast->dimensions(i)];
+      }
+      return operand_to_broadcast.Get<ReturnT>(broadcast_indices);
+    });
   }
 
   template <
@@ -586,14 +613,25 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     return Status::OK();
   }
 
-  template <
-      typename NativeT,
-      typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
+  template <typename NativeT,
+            typename std::enable_if<std::is_integral<NativeT>::value>::type* =
+                nullptr>
+  Status HandleMaximum(HloInstruction* maximum) {
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[maximum],
+        ElementWiseBinaryOp(maximum, [](ElementwiseT lhs, ElementwiseT rhs) {
+          return std::max(lhs, rhs);
+        }));
+    return Status::OK();
+  }
+
+  template <typename NativeT, typename std::enable_if<std::is_floating_point<
+                                  NativeT>::value>::type* = nullptr>
   Status HandleMaximum(HloInstruction* maximum) {
     TF_ASSIGN_OR_RETURN(
         parent_->evaluated_[maximum],
         ElementWiseBinaryOp(maximum, [](ElementwiseT lhs, ElementwiseT rhs) {
-          return std::fmax(lhs, rhs);
+          return ((lhs >= rhs) || std::isnan(lhs)) ? lhs : rhs;
         }));
     return Status::OK();
   }
@@ -609,18 +647,30 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     return HandleMaximum<ElementwiseT>(maximum);
   }
 
-  template <
-      typename NativeT,
-      typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
+  template <typename NativeT,
+            typename std::enable_if<std::is_integral<NativeT>::value>::type* =
+                nullptr>
   Status HandleMinimum(HloInstruction* minimum) {
     TF_ASSIGN_OR_RETURN(parent_->evaluated_[minimum],
                         ElementWiseBinaryOp(minimum, [](ElementwiseT lhs_el,
                                                         ElementwiseT rhs_el) {
-                          return std::fmin(lhs_el, rhs_el);
+                          return std::min(lhs_el, rhs_el);
                         }));
     return Status::OK();
   }
 
+  template <typename NativeT, typename std::enable_if<std::is_floating_point<
+                                  NativeT>::value>::type* = nullptr>
+  Status HandleMinimum(HloInstruction* minimum) {
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[minimum],
+        ElementWiseBinaryOp(minimum, [](ElementwiseT lhs_el,
+                                        ElementwiseT rhs_el) {
+          return ((lhs_el <= rhs_el) || std::isnan(lhs_el)) ? lhs_el : rhs_el;
+        }));
+    return Status::OK();
+  }
+
   template <
       typename NativeT,
       typename std::enable_if<is_complex_t<NativeT>::value>::type* = nullptr>
@@ -825,7 +875,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   Status HandleClamp(HloInstruction* clamp) {
     std::function<ElementwiseT(ElementwiseT, ElementwiseT, ElementwiseT)>
         clamp_op = [](ElementwiseT low, ElementwiseT value, ElementwiseT high) {
-          return std::fmax(low, std::fmin(value, high));
+          return std::fmin(high, std::fmax(value, low));
         };
     TF_ASSIGN_OR_RETURN(
         parent_->evaluated_[clamp],
@@ -846,6 +896,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   }
 
   Status HandleSelect(HloInstruction* select) override {
+    CHECK(!ShapeUtil::IsScalar(select->operand(0)->shape()));
     CHECK(!ShapeUtil::IsTuple(select->shape()));
     std::function<ReturnT(bool, ReturnT, ReturnT)> select_op =
         [](bool pred, ReturnT on_true, ReturnT on_false) {
@@ -876,8 +927,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     const Literal& operand_literal = parent_->GetEvaluatedLiteralFor(operand);
     auto result = Literal::CreateFromShape(result_shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> out_index) {
           std::vector<int64> from_index(out_index.begin(), out_index.end());
           for (const int64 dim : reverse_dimensions) {
             from_index[dim] = result_shape.dimensions(dim) - 1 - out_index[dim];
@@ -952,7 +1003,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     DimensionVector rhs_index(rhs_rank);
     DimensionVector rhs_spatial_index(dnums.kernel_spatial_dimensions_size());
 
-    auto func = [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    auto func = [&](ArraySlice<int64> out_index) {
       ElementwiseT result_val = static_cast<ElementwiseT>(0);
 
       std::fill(lhs_index.begin(), lhs_index.end(), 0);
@@ -1074,9 +1125,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     }
 
     std::vector<int64> rhs_non_batch_non_contracting_dims;
-    tensorflow::gtl::FlatSet<int64> batch_dims_set(
-        dnums.rhs_batch_dimensions().begin(),
-        dnums.rhs_batch_dimensions().end());
+    FlatSet<int64> batch_dims_set(dnums.rhs_batch_dimensions().begin(),
+                                  dnums.rhs_batch_dimensions().end());
     for (int64 i = 0; i < rhs_rank; i++) {
       if (i != rhs_contracting_dimension && batch_dims_set.count(i) == 0) {
         rhs_non_batch_non_contracting_dims.push_back(i);
@@ -1088,8 +1138,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     DimensionVector lhs_index(lhs_rank);
     DimensionVector rhs_index(rhs_rank);
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> result_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> result_index) {
           ElementwiseT result_val = static_cast<ElementwiseT>(0);
 
           // Find the corresponding non-contracting indices for lhs and rhs.
@@ -1183,9 +1233,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
         parent_->GetEvaluatedLiteralFor(pad->operand(1)).Get<ReturnT>({});
     auto result = Literal::CreateFromShape(pad->shape());
     TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&scalar](tensorflow::gtl::ArraySlice<int64> multi_index) {
-          return scalar;
-        }));
+        [&scalar](ArraySlice<int64> multi_index) { return scalar; }));
 
     const Literal& evaluated_operand =
         parent_->GetEvaluatedLiteralFor(pad->operand(0));
@@ -1198,7 +1246,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     // corresponding index of the resulting padded literal.
     const PaddingConfig& pad_config = pad->padding_config();
 
-    auto func = [&](const std::vector<int64>& input_index) {
+    auto func = [&](ArraySlice<int64> input_index) {
       for (auto i = 0; i < input_index.size(); ++i) {
         // Interior padding occurs logically before edge padding, so in the case
         // of negative edge padding elements are removed from the
@@ -1348,9 +1396,9 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(map->shape());
 
-    HloEvaluator embedded_evaluator;
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           std::vector<std::unique_ptr<Literal>> arg_literals;
           arg_literals.reserve(operands.size());
 
@@ -1440,7 +1488,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   Status HandleReduce(HloInstruction* reduce) override {
     auto arg = reduce->operand(0);
     auto init_value = reduce->operand(1);
-    tensorflow::gtl::ArraySlice<int64> dimensions(reduce->dimensions());
+    ArraySlice<int64> dimensions(reduce->dimensions());
     HloComputation* function = reduce->to_apply();
     TF_RET_CHECK(ShapeUtil::Rank(reduce->shape()) ==
                  ShapeUtil::Rank(arg->shape()) - dimensions.size());
@@ -1483,10 +1531,10 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
       }
     }
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     // For each resulting dimension, calculate and assign computed value.
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           ReturnT result_val = init_scalar;
 
           std::vector<int64> base(arg_dimensions.size());
@@ -1494,7 +1542,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
             base[result_to_arg_index[i]] = multi_index[i];
           }
 
-          auto func = [&](const std::vector<int64>& input_index) {
+          auto func = [&](ArraySlice<int64> input_index) {
             auto curr_val = arg_literal.Get<ReturnT>(input_index);
 
             // Evaluate computation with specified literal operands.
@@ -1540,9 +1588,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     // Initialize result array with the init value.
     TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> output_index) {
-          return init_scalar;
-        }));
+        [&](ArraySlice<int64> output_index) { return init_scalar; }));
 
     std::vector<int64> window_dimension_sizes;
     for (const auto& window_dimension : window.dimensions()) {
@@ -1559,7 +1605,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     int64 rank = ShapeUtil::Rank(operand_literal.shape());
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     DimensionVector source_index(rank);
 
     std::fill(source_index.begin(), source_index.end(), 0);
@@ -1575,8 +1621,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
       // 2. Using the selected index, scatter value from `source` to result. We
       // do this by iterating through the window, and compare each index with
       // the selected index.
-      tensorflow::gtl::optional<ReturnT> selected_val;
-      tensorflow::gtl::optional<std::vector<int64>> selected_index;
+      optional<ReturnT> selected_val;
+      optional<std::vector<int64>> selected_index;
 
       IterateThroughWindow(
           window_shape, window, operand_literal.shape(), source_index,
@@ -1591,11 +1637,11 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
                 Literal::CreateR0<ReturnT>(*selected_val);
 
             const std::vector<const Literal*> args = {
-                curr_val_literal.get(), selected_val_literal.get()};
+                selected_val_literal.get(), curr_val_literal.get()};
             std::unique_ptr<Literal> computed_result =
                 embedded_evaluator.Evaluate<const Literal*>(*select, args)
                     .ConsumeValueOrDie();
-            bool selected = computed_result->Get<bool>({});
+            bool selected = !computed_result->Get<bool>({});
             if (selected) {
               selected_val = curr_val;
               selected_index = operand_index;
@@ -1670,10 +1716,10 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     DimensionVector window_index(window.dimensions_size());
     DimensionVector operand_index(ShapeUtil::Rank(operand_literal.shape()));
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     // For each resulting dimension, calculate and assign computed value.
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> output_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> output_index) {
           ReturnT result_val = init_scalar;
 
           std::fill(window_index.begin(), window_index.end(), 0);
@@ -1723,7 +1769,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     const int64 rank = ShapeUtil::Rank(operand->shape());
     const Literal& operand_literal = parent_->GetEvaluatedLiteralFor(operand);
-    auto func = [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    auto func = [&](ArraySlice<int64> out_index) {
       DimensionVector operand_index(rank);
       for (int64 i = 0; i < rank; ++i) {
         operand_index[i] =
@@ -1904,8 +1950,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     std::vector<int64> operand_indices(start.size());
 
     auto result = Literal::CreateFromShape(result_shape);
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           for (int64 i = 0; i < operand_indices.size(); ++i) {
             CHECK_GE(multi_index[i] + start[i], 0);
             // Mod is only used here to be consistent with the existing
@@ -1925,17 +1971,26 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   StatusOr<std::unique_ptr<Literal>> DynamicUpdateSlice(
       const Literal& operand_literal, const Literal& update_literal,
       const Literal& start_indices_literal) {
-    auto start_indices_typed = start_indices_literal.data<IndexT>();
-    const std::vector<int64> start(start_indices_typed.begin(),
-                                   start_indices_typed.end());
-
     auto result = operand_literal.CloneToUnique();
-    std::vector<int64> result_index(ShapeUtil::Rank(result->shape()), 0);
+    auto start_indices_typed = start_indices_literal.data<IndexT>();
+    const auto rank = ShapeUtil::Rank(result->shape());
+    std::vector<int64> start(rank, 0);
+    for (int64 i = 0; i < rank; ++i) {
+      // All other implementations currently wrap-around the index, so this
+      // should do so as well.
+      start[i] = (start_indices_typed[i] % result->shape().dimensions(i));
+      start[i] += (start[i] < 0) * result->shape().dimensions(i);
+    }
+    std::vector<int64> result_index(rank, 0);
 
-    auto func = [&](const std::vector<int64>& update_index) {
+    auto func = [&](ArraySlice<int64> update_index) {
       std::transform(update_index.begin(), update_index.end(), start.begin(),
                      result_index.begin(), std::plus<int64>());
-
+      // Same as above, wrap-around only to match other implementations'
+      // semantics.
+      std::transform(result_index.begin(), result_index.end(),
+                     result->shape().dimensions().begin(), result_index.begin(),
+                     std::modulus<int64>());
       result->Set<ReturnT>(result_index,
                            update_literal.Get<ReturnT>(update_index));
       return true;
@@ -1988,8 +2043,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           return ConvertBinaryFunction(binary_op)(
               lhs_literal.Get<ReturnT>(multi_index),
               rhs_literal.Get<ReturnT>(multi_index));
@@ -2026,8 +2081,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           return ternary_op(lhs_literal.Get<LhsType>(multi_index),
                             rhs_literal.Get<RhsType>(multi_index),
                             ehs_literal.Get<EhsType>(multi_index));
@@ -2047,17 +2102,20 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   HloEvaluator* parent_;
 };  // class HloEvaluator::TypedVisitor
 
-HloEvaluator::HloEvaluator() {
+HloEvaluator::HloEvaluator(int64 max_loop_iterations)
+    : max_loop_iterations_(max_loop_iterations) {
   typed_visitors_[PRED] = MakeUnique<TypedVisitor<bool>>(this);
   typed_visitors_[U8] = MakeUnique<TypedVisitor<uint8>>(this);
   typed_visitors_[U16] = MakeUnique<FunctionVisitor>([](HloInstruction*) {
-    return Unimplemented("HloEvaluator: unhandled primitive type: U16.");
+    return Unimplemented(
+        "HloEvaluator::TypedVisitor: unhandled primitive type: U16.");
   });
   typed_visitors_[U32] = MakeUnique<TypedVisitor<uint32>>(this);
   typed_visitors_[U64] = MakeUnique<TypedVisitor<uint64>>(this);
   typed_visitors_[S8] = MakeUnique<TypedVisitor<int8>>(this);
   typed_visitors_[S16] = MakeUnique<FunctionVisitor>([](HloInstruction*) {
-    return Unimplemented("HloEvaluator: unhandled primitive type: S16.");
+    return Unimplemented(
+        "HloEvaluator::TypedVisitor: unhandled primitive type: S16.");
   });
   typed_visitors_[S32] = MakeUnique<TypedVisitor<int32>>(this);
   typed_visitors_[S64] = MakeUnique<TypedVisitor<int64>>(this);
@@ -2071,18 +2129,20 @@ HloEvaluator::HloEvaluator() {
   // elementwise computations to be done in F32 and do BF16<->F32 conversion
   // around the input and the output of the computations.
   typed_visitors_[BF16] = MakeUnique<TypedVisitor<bfloat16, float>>(this);
+
   typed_visitors_[TUPLE] = MakeUnique<FunctionVisitor>([](HloInstruction*) {
-    return Unimplemented("HloEvaluator: unhandled primitive type: TUPLE.");
+    return Unimplemented(
+        "HloEvaluator::TypedVistor: unhandled primitive type: TUPLE.");
   });
   typed_visitors_[OPAQUE] = MakeUnique<FunctionVisitor>([](HloInstruction*) {
-    return Unimplemented("HloEvaluator: unhandled primitive type: OPAQUE.");
+    return Unimplemented(
+        "HloEvaluator::TypedVisitor: unhandled primitive type: OPAQUE.");
   });
 }
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    const HloModule& module,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    const HloModule& module, ArraySlice<LiteralPtr> arg_literals) {
   XLA_VLOG_LINES(2, "HloEvaluator::Evaluate module:\n" + module.ToString());
 
   evaluated_.clear();
@@ -2099,8 +2159,8 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    const HloComputation& computation,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    const HloComputation& computation, ArraySlice<LiteralPtr> arg_literals) {
+  CHECK(computation.parent() != nullptr);
   XLA_VLOG_LINES(
       2, "HloEvaluator::Evaluate computation:\n" + computation.ToString());
 
@@ -2116,8 +2176,7 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    HloInstruction* instruction,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    HloInstruction* instruction, ArraySlice<LiteralPtr> arg_literals) {
   TF_RET_CHECK(hlo_query::AllOperandsAreParametersOrConstants(*instruction));
   TF_RETURN_IF_ERROR(ShapeUtil::ValidateShape(instruction->shape()));
 
@@ -2242,8 +2301,7 @@ Status HloEvaluator::HandleTranspose(HloInstruction* transpose) {
 }
 
 Status HloEvaluator::HandleConcatenate(HloInstruction* concatenate) {
-  tensorflow::gtl::ArraySlice<HloInstruction*> operands(
-      concatenate->operands());
+  ArraySlice<HloInstruction*> operands(concatenate->operands());
   // The result concatenate dimension is going to be the sum of all
   // concatenate dimensions of the operands taking part of the operation.
   const Shape& reference_shape = operands[0]->shape();
@@ -2415,6 +2473,349 @@ Status HloEvaluator::HandleTuple(HloInstruction* tuple) {
   return Status::OK();
 }
 
+// Returns an ShapeUtil::IndexIterationSpace that iterates over the output
+// gather dimensions while keeping the rest of the output dimensions clamped to
+// 0.
+ShapeUtil::IndexIterationSpace IterationSpaceForOutputGatherIndices(
+    const Shape& output_shape, const GatherDimensionNumbers& dim_numbers) {
+  int64 output_rank = output_shape.dimensions_size();
+  std::vector<int64> index_base(output_rank, 0);
+  std::vector<int64> index_count;
+  index_count.reserve(output_rank);
+  for (int64 i = 0; i < output_rank; i++) {
+    bool is_output_gather_dim =
+        !c_binary_search(dim_numbers.output_window_dims(), i);
+    index_count.push_back(is_output_gather_dim ? output_shape.dimensions(i)
+                                               : 1);
+  }
+
+  return {std::move(index_base), std::move(index_count),
+          std::vector<int64>(output_rank, 1)};
+}
+
+// Return an ShapeUtil::IndexIterationSpace that iterates over the output window
+// dimensions while keeping the rest of the output dimensions clamped to 0.
+ShapeUtil::IndexIterationSpace IterationSpaceForOutputWindowIndices(
+    int64 output_rank, ArraySlice<int64> window_bounds,
+    const GatherDimensionNumbers& dim_numbers) {
+  std::vector<int64> index_base(output_rank, 0);
+  std::vector<int64> index_count(output_rank, 1);
+  int64 window_bounds_idx = 0;
+  for (int64 i = 0; i < output_rank; i++) {
+    bool is_output_window_dim =
+        c_binary_search(dim_numbers.output_window_dims(), i);
+    if (is_output_window_dim) {
+      while (c_binary_search(dim_numbers.elided_window_dims(),
+                             window_bounds_idx)) {
+        window_bounds_idx++;
+      }
+      index_count[i] = window_bounds[window_bounds_idx++];
+    }
+  }
+
+  return {std::move(index_base), std::move(index_count),
+          std::vector<int64>(output_rank, 1)};
+}
+
+// This functor computes the contribution of gather_indices to an input index
+// corresponding to an output index.  That is, given an output index I, it picks
+// out the gather output indices in I and uses them to look up a gather index,
+// G, from the gather indices tensor, and expands G into the input space
+// according to gather_dims_to_operand_dims.
+class OutputGatherIndexToInputIndex {
+ public:
+  // The constructor does some setup work that is amortized across all
+  // iterations.
+  explicit OutputGatherIndexToInputIndex(
+      const GatherDimensionNumbers* dim_numbers, const Shape& input_shape,
+      const Shape& output_shape, const Literal* gather_indices)
+      : dim_numbers_(*dim_numbers), gather_indices_(*gather_indices) {
+    for (int64 i = 0; i < output_shape.dimensions_size(); i++) {
+      output_dim_is_gather_dims_.push_back(
+          !c_binary_search(dim_numbers_.output_window_dims(), i));
+    }
+
+    for (int64 i = 0; i < input_shape.dimensions_size(); i++) {
+      int64 index_of_input_dim_in_index_vector =
+          std::distance(dim_numbers_.gather_dims_to_operand_dims().begin(),
+                        c_find(dim_numbers_.gather_dims_to_operand_dims(), i));
+      if (index_of_input_dim_in_index_vector ==
+          dim_numbers_.gather_dims_to_operand_dims_size()) {
+        input_dim_value_to_index_vector_.push_back(-1);
+      } else {
+        input_dim_value_to_index_vector_.push_back(
+            index_of_input_dim_in_index_vector);
+      }
+    }
+
+    index_vector_index_.resize(gather_indices_.shape().dimensions_size());
+    input_index_.resize(input_shape.dimensions_size());
+    int64 index_vector_size =
+        gather_indices_.shape().dimensions(dim_numbers_.index_vector_dim());
+    index_vector_.resize(index_vector_size);
+  }
+
+  // Returns the contribution of gather_indices to the input index corresponding
+  // to output_index.  See gather_inner_loop_body.
+  //
+  // This is conceptually  a stateless transformation from output_index to the
+  // gather input index, but:
+  //
+  //  - Instead of allocating memory to represent the gather input index on
+  //    every invocation we reuse the same storage for the result
+  //    (input_index_), mutating it in place.
+  //  - Instead of allocating buffers for temporary values like
+  //    index_vector_index_ and index_vector on every invocation, we reuse the
+  //    same storage for all invocations.
+  //
+  // This returns an arrayslice into memory owned by the class.
+  StatusOr<ArraySlice<int64>> operator()(ArraySlice<int64> output_index) {
+    PropagateOutputIndexGatherDimsToIndexVectorIndex(output_index);
+    TF_RETURN_IF_ERROR(FetchIndexVector());
+    PropagateIndexVectorToInputIndex();
+    return ArraySlice<int64>(input_index_);
+  }
+
+ private:
+  // Propagates the gather index dimensions from the output index into
+  // index_vector_index_ by mutating index_vector_index_ in place.  Does not
+  // update the dim_numbers.index_vector_dim() dimension -- that's the dimension
+  // we iterate over in FetchIndexVector.
+  void PropagateOutputIndexGatherDimsToIndexVectorIndex(
+      ArraySlice<int64> output_index) {
+    int64 index_vector_index_i = 0;
+    for (int64 i = 0, e = output_index.size(); i < e; i++) {
+      if (!output_dim_is_gather_dims_[i]) {
+        continue;
+      }
+
+      if (index_vector_index_i == dim_numbers_.index_vector_dim()) {
+        index_vector_index_i++;
+      }
+
+      index_vector_index_[index_vector_index_i++] = output_index[i];
+    }
+  }
+
+  // Populates index_vector_ by iterating over gather_indices_ according to
+  // index_vector_index_.
+  Status FetchIndexVector() {
+    int64 index_vector_dim = dim_numbers_.index_vector_dim();
+    for (int64 i = 0, e = index_vector_.size(); i < e; i++) {
+      index_vector_index_[index_vector_dim] = i;
+      TF_ASSIGN_OR_RETURN(index_vector_[i], gather_indices_.GetIntegralAsS64(
+                                                index_vector_index_));
+    }
+    return Status::OK();
+  }
+
+  // Populates input_index_.
+  void PropagateIndexVectorToInputIndex() {
+    for (int64 i = 0, e = input_index_.size(); i < e; i++) {
+      if (input_dim_value_to_index_vector_[i] != -1) {
+        input_index_[i] = index_vector_[input_dim_value_to_index_vector_[i]];
+      }
+
+      // If input_dim_value_to_index_vector_[i] == -1 then input_index_[i]
+      // remains 0, as set by the constructor.
+    }
+  }
+
+  // input_dim_value_to_index_vector_[i] tells us how to compute dimension i of
+  // the input index from the index vector.  See
+  // PropagateIndexVectorToInputIndex.
+  std::vector<int64> input_dim_value_to_index_vector_;
+
+  // output_dim_is_gather_dims_[i] is true iff the output index i is a gather
+  // dimension.
+  std::vector<bool> output_dim_is_gather_dims_;
+
+  // The buffer into which we construct an index into gather_indices_ to fetch
+  // the index vector.
+  std::vector<int64> index_vector_index_;
+
+  // The index vector fetched from gather_indices_.
+  std::vector<int64> index_vector_;
+
+  // The result computed by this functor.  operator() returns an ArraySlice into
+  // this vector.
+  std::vector<int64> input_index_;
+
+  const GatherDimensionNumbers& dim_numbers_;
+  const Literal& gather_indices_;
+};
+
+// This functor computes the contribution of the window indices in an output
+// index to an input index.  That is, given an output index I it picks out the
+// output window indices in I and expands it into a window index into the input
+// shape.
+class OutputWindowIndexToInputIndex {
+ public:
+  // The constructor does some setup work that is amortized across all
+  // iterations.
+  explicit OutputWindowIndexToInputIndex(
+      const GatherDimensionNumbers& dim_numbers, const Shape& input_shape,
+      const Shape& output_shape) {
+    std::vector<int64> window_index_to_output_index;
+    int64 output_index_count = 0;
+    for (int64 i = 0; i < output_shape.dimensions_size(); i++) {
+      if (c_binary_search(dim_numbers.output_window_dims(), i)) {
+        window_index_to_output_index.push_back(output_index_count++);
+      } else {
+        output_index_count++;
+      }
+    }
+
+    int64 window_dim_count = 0;
+    for (int64 i = 0; i < input_shape.dimensions_size(); i++) {
+      if (c_binary_search(dim_numbers.elided_window_dims(), i)) {
+        input_dim_value_to_output_index_.push_back(-1);
+      } else {
+        input_dim_value_to_output_index_.push_back(
+            window_index_to_output_index[window_dim_count++]);
+      }
+    }
+
+    input_index_.resize(input_shape.dimensions_size());
+  }
+
+  // Returns the contribution of the window indices to the input index
+  // corresponding to output_index.  See gather_inner_loop_body.
+  //
+  // This is conceptually a stateless transformation from output_index to the
+  // window input index, but instead of allocating memory to represent the
+  // gather input index on every invocation we reuse the same storage for the
+  // result (input_index_), mutating it in place.
+  //
+  // This returns an arrayslice into memory owned by the class.
+  StatusOr<ArraySlice<int64>> operator()(ArraySlice<int64> output_index) {
+    PropagateOutputIndexWindowDimsToInputIndex(output_index);
+    return ArraySlice<int64>(input_index_);
+  }
+
+ private:
+  // Propagates window dimensions from the output index to input_index_ by
+  // mutating input_index_ in place.
+  void PropagateOutputIndexWindowDimsToInputIndex(
+      ArraySlice<int64> output_index) {
+    for (int64 i = 0, e = input_index_.size(); i < e; i++) {
+      if (input_dim_value_to_output_index_[i] != -1) {
+        input_index_[i] = output_index[input_dim_value_to_output_index_[i]];
+      }
+
+      // If input_dim_value_to_index_vector_[i] == -1 then input_index_[i]
+      // remains 0, as set by the constructor.
+    }
+  }
+
+  // input_dim_value_to_index_vector_[i] tells us how to compute dimension i of
+  // the input index from the output index. See
+  // PropagateOutputIndexToInputIndex.
+  std::vector<int64> input_dim_value_to_output_index_;
+
+  // The result computed by this functor.  operator() returns an ArraySlice into
+  // this vector.
+  std::vector<int64> input_index_;
+};
+
+// Rehapes the gather indices input to have a trailing degenerate `1` dimension
+// if necessary.  Hands over the ownership of the newly created literal (if
+// there is one) to `reshaped_gather_indices`.
+static StatusOr<std::reference_wrapper<const Literal>> ReshapedGatherIndices(
+    int64 index_vector_dim, const Literal& gather_indices,
+    std::unique_ptr<Literal>* reshaped_gather_indices) {
+  if (gather_indices.shape().dimensions_size() != index_vector_dim) {
+    return std::cref(gather_indices);
+  }
+
+  std::vector<int64> new_shape(gather_indices.shape().dimensions().begin(),
+                               gather_indices.shape().dimensions().end());
+  new_shape.push_back(1);
+  TF_ASSIGN_OR_RETURN(*reshaped_gather_indices,
+                      gather_indices.Reshape(new_shape));
+  return std::cref(**reshaped_gather_indices);
+}
+
+Status HloEvaluator::HandleGather(HloInstruction* gather) {
+  std::unique_ptr<Literal> result = Literal::CreateFromShape(gather->shape());
+  const Shape& shape = gather->shape();
+  const GatherDimensionNumbers& dim_numbers =
+      gather->gather_dimension_numbers();
+  const Literal& operand = GetEvaluatedLiteralFor(gather->operand(0));
+  std::unique_ptr<Literal> reshaped_gather_indices;
+  TF_ASSIGN_OR_RETURN(
+      const Literal& gather_indices,
+      ReshapedGatherIndices(dim_numbers.index_vector_dim(),
+                            GetEvaluatedLiteralFor(gather->operand(1)),
+                            &reshaped_gather_indices));
+
+  // We iterate over the gather dimensions in the output shape in an outer loop
+  // nest, and iterate over the window dimensions in the output shape in an
+  // inner loop nest.
+
+  ShapeUtil::IndexIterationSpace gather_indices_iteration_space =
+      IterationSpaceForOutputGatherIndices(shape, dim_numbers);
+  ShapeUtil::IndexIterationSpace window_indices_iteration_space =
+      IterationSpaceForOutputWindowIndices(
+          shape.dimensions_size(), gather->gather_window_bounds(), dim_numbers);
+
+  // Scratch buffers that hold an index in the output shape and the
+  // corresponding index in the input shape.
+  std::vector<int64> input_index(operand.shape().dimensions_size());
+  std::vector<int64> output_index(gather->shape().dimensions_size());
+
+  OutputGatherIndexToInputIndex output_gather_index_to_input_index(
+      &gather->gather_dimension_numbers(), /*input_shape=*/operand.shape(),
+      /*output_shape=*/shape, &gather_indices);
+  OutputWindowIndexToInputIndex output_window_index_to_input_index(
+      gather->gather_dimension_numbers(), /*input_shape=*/operand.shape(),
+      /*output_shape=*/shape);
+
+  const Shape& operand_shape = operand.shape();
+
+  auto gather_inner_loop_body =
+      [&](ArraySlice<int64> output_window_index,
+          ArraySlice<int64> input_gather_index,
+          ArraySlice<int64> output_gather_index) -> StatusOr<bool> {
+    TF_ASSIGN_OR_RETURN(
+        ArraySlice<int64> input_window_index,
+        output_window_index_to_input_index(output_window_index));
+    for (int i = 0, e = output_index.size(); i < e; i++) {
+      output_index[i] = output_gather_index[i] + output_window_index[i];
+      DCHECK_LT(output_index[i], shape.dimensions(i));
+    }
+    for (int i = 0, e = input_index.size(); i < e; i++) {
+      // TODO(b/74360564): We should implement whatever out of bounds behavior
+      // we decide for dynamic-slice here as well.
+      input_index[i] = (input_gather_index[i] + input_window_index[i]) %
+                       operand_shape.dimensions(i);
+      if (input_index[i] < 0) {
+        input_index[i] += operand_shape.dimensions(i);
+      }
+    }
+    TF_RETURN_IF_ERROR(
+        result->CopyElementFrom(operand, input_index, output_index));
+    return true;
+  };
+
+  auto gather_outer_loop_body =
+      [&](ArraySlice<int64> output_gather_index) -> StatusOr<bool> {
+    TF_ASSIGN_OR_RETURN(
+        ArraySlice<int64> input_gather_index,
+        output_gather_index_to_input_index(output_gather_index));
+    TF_RETURN_IF_ERROR(ShapeUtil::ForEachIndexWithStatus(
+        shape, window_indices_iteration_space,
+        std::bind(gather_inner_loop_body, std::placeholders::_1,
+                  input_gather_index, output_gather_index)));
+    return true;
+  };
+
+  TF_RETURN_IF_ERROR(ShapeUtil::ForEachIndexWithStatus(
+      shape, gather_indices_iteration_space, gather_outer_loop_body));
+  evaluated_[gather] = std::move(result);
+  return Status::OK();
+}
+
 Status HloEvaluator::HandleGetTupleElement(HloInstruction* get_tuple_element) {
   const auto result_shape = get_tuple_element->shape();
   const int64 index = get_tuple_element->tuple_index();
@@ -2445,6 +2846,135 @@ Status HloEvaluator::HandleCopy(HloInstruction* copy) {
   return Status::OK();
 }
 
+Status HloEvaluator::HandleCall(HloInstruction* call) {
+  auto* computation = call->to_apply();
+  auto operands = call->operands();
+
+  std::vector<const Literal*> arg_literals;
+  arg_literals.reserve(operands.size());
+  for (auto operand : operands) {
+    const Literal& arg_literal = GetEvaluatedLiteralFor(operand);
+    arg_literals.push_back(&arg_literal);
+  }
+
+  HloEvaluator embedded_evaluator;
+  std::unique_ptr<Literal> result =
+      embedded_evaluator.Evaluate<const Literal*>(*computation, arg_literals)
+          .ConsumeValueOrDie();
+
+  evaluated_[call] = std::move(result);
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleFusion(HloInstruction* fusion) {
+  // Attach cloned computation to an empty HLO module so the existing ones are
+  // not modified.
+  HloModule empty_hlo_module("EmptyModuleForFusion");
+  auto cloned_fused_computation =
+      fusion->fused_instructions_computation()->Clone(
+          /*suffix=*/"clone_with_layout", &empty_hlo_module);
+  for (auto* instruction : cloned_fused_computation->instructions()) {
+    LayoutUtil::SetToDefaultLayout(instruction->mutable_shape());
+  }
+  auto readded_computation =
+      empty_hlo_module.AddEntryComputation(std::move(cloned_fused_computation));
+
+  auto operands = fusion->operands();
+  std::vector<const Literal*> arg_literals;
+  arg_literals.reserve(operands.size());
+  for (auto operand : operands) {
+    const Literal& arg_literal = GetEvaluatedLiteralFor(operand);
+    arg_literals.push_back(&arg_literal);
+  }
+
+  HloEvaluator embedded_evaluator;
+  std::unique_ptr<Literal> result =
+      embedded_evaluator
+          .Evaluate<const Literal*>(*readded_computation, arg_literals)
+          .ConsumeValueOrDie();
+
+  evaluated_[fusion] = std::move(result);
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleConditional(HloInstruction* conditional) {
+  const auto& pred = GetEvaluatedLiteralFor(conditional->operand(0));
+  const auto& true_computation_arg =
+      GetEvaluatedLiteralFor(conditional->operand(1));
+  const auto& false_computation_arg =
+      GetEvaluatedLiteralFor(conditional->operand(2));
+
+  auto* true_computation = conditional->true_computation();
+  auto* false_computation = conditional->false_computation();
+
+  auto result = Literal::CreateFromShape(conditional->shape());
+  HloEvaluator embedded_evaluator;
+  if (pred.Get<bool>({})) {
+    result = embedded_evaluator
+                 .Evaluate<const Literal*>(*true_computation,
+                                           {&true_computation_arg})
+                 .ConsumeValueOrDie();
+  } else {
+    result = embedded_evaluator
+                 .Evaluate<const Literal*>(*false_computation,
+                                           {&false_computation_arg})
+                 .ConsumeValueOrDie();
+  }
+
+  evaluated_[conditional] = std::move(result);
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleSelect(HloInstruction* select) {
+  const auto& pred = GetEvaluatedLiteralFor(select->operand(0));
+  const auto& on_true = GetEvaluatedLiteralFor(select->operand(1));
+  const auto& on_false = GetEvaluatedLiteralFor(select->operand(2));
+
+  // If predicate is of scalar type, no element-wise selection would be needed.
+  // This would also handle output array of tuple types as the DefaultAction
+  // would go through the TypedVisitor which doesn't handle tuples.
+  if (ShapeUtil::IsScalar(pred.shape())) {
+    if (pred.Get<bool>({})) {
+      evaluated_[select] = on_true.CloneToUnique();
+    } else {
+      evaluated_[select] = on_false.CloneToUnique();
+    }
+    return Status::OK();
+  }
+
+  return DefaultAction(select);
+}
+
+Status HloEvaluator::HandleWhile(HloInstruction* while_hlo) {
+  HloComputation* cond_comp = while_hlo->while_condition();
+  HloComputation* body_comp = while_hlo->while_body();
+  // Initialize the loop carried valued with the input to the While instruction.
+  auto lcv = GetEvaluatedLiteralFor(while_hlo->operand(0)).CloneToUnique();
+  bool keep_going = true;
+  int64 iteration_count = 0;
+  HloEvaluator cond_evaluator(max_loop_iterations_);
+  HloEvaluator loop_body_evaluator(max_loop_iterations_);
+  while (keep_going) {
+    if (max_loop_iterations_ >= 0 && iteration_count++ > max_loop_iterations_) {
+      return InvalidArgument("Loop %s exceeded loop iteration limit (%lld).",
+                             while_hlo->name().c_str(), max_loop_iterations_);
+    }
+    TF_ASSIGN_OR_RETURN(auto cond_val, cond_evaluator.Evaluate<Literal*>(
+                                           *cond_comp, {lcv.get()}));
+    keep_going = cond_val->GetFirstElement<bool>();
+    if (keep_going) {
+      TF_ASSIGN_OR_RETURN(auto body_val, loop_body_evaluator.Evaluate<Literal*>(
+                                             *body_comp, {lcv.get()}));
+      VLOG(3) << "Loop iteration result: " << body_val->ToString();
+      lcv = std::move(body_val);
+      cond_evaluator.ResetVisitStates();
+      loop_body_evaluator.ResetVisitStates();
+    }
+  }
+  evaluated_[while_hlo] = std::move(lcv);
+  return Status::OK();
+}
+
 Status HloEvaluator::Preprocess(HloInstruction* hlo) {
   VLOG(2) << "About to visit HLO: " << hlo->ToString();
   return Status::OK();
@@ -2458,28 +2988,27 @@ Status HloEvaluator::Postprocess(HloInstruction* hlo) {
 
 // Explicit instantiation of templatized Evaluate* methods.
 //
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(const HloModule& module,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(const HloModule& module,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
-    const HloModule& module,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    const HloModule& module, ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(const HloComputation& computation,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(const HloComputation& computation,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
     const HloComputation& computation,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(HloInstruction* instruction,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(HloInstruction* instruction,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
     HloInstruction* instruction,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.h b/tensorflow/compiler/xla/service/hlo_evaluator.h
index 3b2b697e492a78a06a4e5ae6bf056ff8676f2ff5..c0dcee0c3e382f74de72a2b89f39e06f042e2b80 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.h
@@ -36,7 +36,10 @@ namespace xla {
 // This class is not thread-safe.
 class HloEvaluator : public DfsHloVisitorWithDefault {
  public:
-  HloEvaluator();
+  // Only evaluate up to max_loop_iterations per while-loop execution if
+  // specified.
+  explicit HloEvaluator(int64 max_loop_iterations = -1);
+
   // Evaluates an HLO module and an array of pointers to literals.
   // Returns the evaluated result as a literal if successful.
   // Precondition: The indices of arg_literals correspond to the parameter
@@ -149,10 +152,22 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
 
   Status HandleTuple(HloInstruction* tuple) override;
 
+  Status HandleGather(HloInstruction* gather) override;
+
   Status HandleGetTupleElement(HloInstruction* get_tuple_element) override;
 
   Status HandleCopy(HloInstruction* copy) override;
 
+  Status HandleConditional(HloInstruction* conditional) override;
+
+  Status HandleCall(HloInstruction* call) override;
+
+  Status HandleFusion(HloInstruction* fusion) override;
+
+  Status HandleWhile(HloInstruction* while_hlo) override;
+
+  Status HandleSelect(HloInstruction* select) override;
+
  private:
   // Returns the already-evaluated literal result for the instruction.
   // A Constant instruction is considered evaluated and its literal will be
@@ -190,6 +205,9 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
   // Must be cleared for each evaluation.
   std::vector<const Literal*> arg_literals_;
 
+  // Max loop iterations to execute with no maximum if negative.
+  int64 max_loop_iterations_;
+
   TF_DISALLOW_COPY_AND_ASSIGN(HloEvaluator);
 };
 
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
index 97765d65909cee192f65069777f8f195081603b2..685cacd7f74c00789296dee16f0a6a94c35a4393 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
@@ -1729,6 +1729,207 @@ TEST_P(HloEvaluatorTest, EvaluateWithSubstitutionsWithConstantOperand) {
                                *result.ValueOrDie());
 }
 
+TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherV1) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherV1
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[2,3] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1, 3}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{1, 2, 3}, {7, 8, 9}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherV2) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherV2
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[3,2] gather(operand, indices),
+      output_window_dims={0},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3, 1}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{1, 3}, {4, 6}, {7, 9}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherMultipleBatchDims) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherMultipleBatchDims
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,3,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=2,
+      window_bounds={3, 1}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 2}, {2, 1}});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR3<int32>(
+          {{{1, 3}, {4, 6}, {7, 9}}, {{3, 2}, {6, 5}, {9, 8}}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherNd) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherNd
+
+ENTRY main {
+  operand = s32[3,3,2] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0,1},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1,2}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{-1, 1}, {-4, 4}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest,
+       EvaluateGather_TensorFlowGatherNdNonDefaultIndexVectorDim) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherNd
+
+ENTRY main {
+  operand = s32[3,3,2] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0,1},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1,2}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{-2, 2}, {-1, 1}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_DynamicSlice) {
+  const char* hlo_text = R"(
+HloModule DynamicSlice
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[1,1] gather(operand, indices),
+      output_window_dims={0,1},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({1, 1});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{5}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_BatchDynamicSlice) {
+  const char* hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,1,1] gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{2, 1}, {1, 1}});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR3<int32>({{{8}}, {{5}}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
+TEST_P(HloEvaluatorTest, EvaluateGather_ZeroDimBounds) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherV1
+
+ENTRY main {
+  operand = s32[3,0] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[2,0] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1, 0}
+}
+)";
+  ParseAndVerifyModule(hlo_text);
+  std::unique_ptr<Literal> operand = Literal::CreateR2<int32>({{}, {}, {}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  LiteralTestUtil::ExpectEqual(
+      *Literal::CreateR2<int32>({{}, {}}),
+      *Evaluate({operand.get(), gather_indices.get()}));
+}
+
 INSTANTIATE_TEST_CASE_P(HloEvaluatorTest_Instantiation, HloEvaluatorTest,
                         ::testing::ValuesIn(use_bf16_params));
 
diff --git a/tensorflow/compiler/xla/service/hlo_execution_profile.cc b/tensorflow/compiler/xla/service/hlo_execution_profile.cc
index f0df93b61d29c1535d8a89fbd65e669de5b43729..c3ccbf0f0c75b569b49652807dea52faebdccc31 100644
--- a/tensorflow/compiler/xla/service/hlo_execution_profile.cc
+++ b/tensorflow/compiler/xla/service/hlo_execution_profile.cc
@@ -111,8 +111,8 @@ HloExecutionProfile::HloExecutionProfile(
     : hlo_profile_printer_data_(*hlo_profile_printer_data),
       hlo_profile_index_map_(*hlo_profile_index_map),
       profile_counters_(
-          /*count*/ hlo_profile_index_map_.total_count(),
-          /*value*/ 0) {}
+          /*count=*/hlo_profile_index_map_.total_count(),
+          /*value=*/0) {}
 
 void HloExecutionProfile::SetCyclesTakenBy(const HloInstruction* hlo,
                                            uint64 cycles_taken) {
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
index 2861fec39ef0c92fdfbcee04584f9bd36d3cb4d8..1dc72355cf179e996caab4d6b52068dc99d02244 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
@@ -157,52 +157,60 @@ enum ColorScheme {
   kDashedBorder,
 };
 
+// Graphviz attributes/colors that make up a color scheme.
+struct NodeColors {
+  const char* style;
+  const char* fill_color;
+  const char* stroke_color;
+  const char* font_color;
+};
+
+NodeColors NodeColorsForScheme(ColorScheme color) {
+  switch (color) {
+    case kBlue:
+      return NodeColors{"filled", "#bbdefb", "#8aacc8", "black"};
+    case kBrown:
+      return NodeColors{"filled", "#bcaaa4", "#8c7b75", "black"};
+    case kDarkBlue:
+      return NodeColors{"filled", "#1565c0", "#003c8f", "white"};
+    case kDarkGreen:
+      return NodeColors{"filled", "#2e7d32", "#005005", "white"};
+    case kDarkRed:
+      return NodeColors{"filled", "#b71c1c", "#7f0000", "white"};
+    case kGray:
+      return NodeColors{"filled", "#cfd8dc", "#9ea7aa", "black"};
+    case kGreen:
+      return NodeColors{"filled", "#c8e6c9", "#97b498", "black"};
+    case kOrange:
+      return NodeColors{"filled", "#ffe0b2", "#cbae82", "black"};
+    case kPurple:
+      return NodeColors{"filled", "#e1bee7", "#af8eb5", "black"};
+    case kRed:
+      return NodeColors{"filled", "#ffcdd2", "#cb9ca1", "black"};
+    case kWhite:
+      return NodeColors{"filled", "white", "black", "black"};
+    case kYellow:
+      return NodeColors{"filled", "#fff9c4", "#cbc693", "black"};
+    case kDashedBorder:
+      // "filled,dashed" looks the same as "dashed", since we have a white
+      // background.  But we use "filled,dashed" so that when you hover over
+      // any part of the node (not just the text inside the node), our css
+      // :hover rule is triggered.
+      return NodeColors{"filled,dashed", "white", "#757575", "#757575"};
+  }
+}
+
 // Given a ColorScheme, returns an attribute string for a node of that color.
 // Sets the node's style and fill/stroke/text colors.
 //
 // Colors are from https://material.io/color.
 string NodeColorAttributes(ColorScheme color) {
-  using std::make_tuple;
-
-  const char *style, *fill_color, *stroke_color, *font_color;
-  std::tie(style, fill_color, stroke_color, font_color) = [color] {
-    switch (color) {
-      case kBlue:
-        return make_tuple("filled", "#bbdefb", "#8aacc8", "black");
-      case kBrown:
-        return make_tuple("filled", "#bcaaa4", "#8c7b75", "black");
-      case kDarkBlue:
-        return make_tuple("filled", "#1565c0", "#003c8f", "white");
-      case kDarkGreen:
-        return make_tuple("filled", "#2e7d32", "#005005", "white");
-      case kDarkRed:
-        return make_tuple("filled", "#b71c1c", "#7f0000", "white");
-      case kGray:
-        return make_tuple("filled", "#cfd8dc", "#9ea7aa", "black");
-      case kGreen:
-        return make_tuple("filled", "#c8e6c9", "#97b498", "black");
-      case kOrange:
-        return make_tuple("filled", "#ffe0b2", "#cbae82", "black");
-      case kPurple:
-        return make_tuple("filled", "#e1bee7", "#af8eb5", "black");
-      case kRed:
-        return make_tuple("filled", "#ffcdd2", "#cb9ca1", "black");
-      case kWhite:
-        return make_tuple("filled", "white", "black", "black");
-      case kYellow:
-        return make_tuple("filled", "#fff9c4", "#cbc693", "black");
-      case kDashedBorder:
-        // "filled,dashed" looks the same as "dashed", since we have a white
-        // background.  But we use "filled,dashed" so that when you hover over
-        // any part of the node (not just the text inside the node), our css
-        // :hover rule is triggered.
-        return make_tuple("filled,dashed", "white", "#757575", "#757575");
-    }
-  }();
+  NodeColors node_colors = NodeColorsForScheme(color);
 
   return Printf(
-      R"(style="%s", fontcolor="%s", color="%s", fillcolor="%s")", style,
-      font_color, stroke_color, fill_color);
+      R"(style="%s", fontcolor="%s", color="%s", fillcolor="%s")",
+      node_colors.style, node_colors.font_color, node_colors.stroke_color,
+      node_colors.fill_color);
 }
 
 // Replaces <> with &lt;&gt;, so that this string is safe(er) for use in a
@@ -604,11 +612,21 @@ tooltip = " ";
       StrAppend(&subcomp_label, "<br/>", extra_info);
     }
 
-    // Subcomputation's fill/stroke color is light/dark red/gray, depending on
-    // whether or not the subcomputation's fusion node is highlighted.
     bool highlight = filter_.Highlight(parent_instr);
-    const char* fillcolor = highlight ? "#ffcdd2" : "#f5f5f5";
-    const char* strokecolor = highlight ? "#b71c1c" : "#c2c2c2";
+    const char* fillcolor;
+    const char* strokecolor;
+    if (debug_options_.xla_hlo_graph_sharding_color() && !highlight) {
+      // Use the sharding color, if the node isn't highlighted.
+      NodeColors node_colors =
+          NodeColorsForScheme(GetInstructionColor(parent_instr));
+      fillcolor = node_colors.fill_color;
+      strokecolor = node_colors.stroke_color;
+    } else {
+      // Subcomputation's fill/stroke color is light/dark red/gray, depending on
+      // whether or not the subcomputation's fusion node is highlighted.
+      fillcolor = highlight ? "#ffcdd2" : "#f5f5f5";
+      strokecolor = highlight ? "#b71c1c" : "#c2c2c2";
+    }
     style =
         Printf(R"(style="rounded,filled,bold"; fillcolor="%s"; color="%s;")",
                fillcolor, strokecolor);
@@ -782,6 +800,14 @@ string HloDotDumper::GetInstructionNodeInlinedOperands(
   auto stringify_constant = [](const HloInstruction* constant) {
     const auto& shape = constant->shape();
 
+    // If the shape has a dimension of size zero, print it as e.g.
+    // "{} (f32[42, 0, 10])".  The alternative, calling Literal::ToString(),
+    // enumerates all of its empty dimensions (e.g.  "{ { {}, {} }, ..."), which
+    // is just noise.
+    if (ShapeUtil::HasZeroElements(shape)) {
+      return Printf("{} (%s)", ShapeUtil::HumanString(constant->shape()));
+    }
+
     // Print the literal value of constants with <= K elements.
     optional<int64> elem_count;
     if (!ShapeUtil::IsOpaque(shape) && !ShapeUtil::IsTuple(shape)) {
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.cc b/tensorflow/compiler/xla/service/hlo_instruction.cc
index b7dd055d7cd78eb759a2b24bcbbbc948159f9425..a2a2c1e615a7f2b226c712a75b1240b980fc8d3c 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction.cc
@@ -37,6 +37,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/window_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -51,24 +52,22 @@ using ::tensorflow::strings::StrCat;
 /* static */
 StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
     HloModule* module, const HloInstructionProto& proto,
-    const tensorflow::gtl::FlatMap<string, HloInstruction*>& instruction_map,
-    const tensorflow::gtl::FlatMap<string, HloComputation*>& computation_map,
-    const std::function<void(std::unique_ptr<HloComputation>)>&
-        add_fused_computation) {
+    const tensorflow::gtl::FlatMap<int64, HloInstruction*>& instruction_map,
+    const tensorflow::gtl::FlatMap<int64, HloComputation*>& computation_map) {
   TF_RET_CHECK(!proto.opcode().empty());
   TF_ASSIGN_OR_RETURN(HloOpcode opcode, StringToHloOpcode(proto.opcode()));
   TF_RET_CHECK(proto.has_shape());
 
   auto instruction = WrapUnique(new HloInstruction(opcode, proto.shape()));
-  for (const string& operand_name : proto.operand_names()) {
-    TF_RET_CHECK(ContainsKey(instruction_map, operand_name))
-        << "No instruction named " << operand_name;
-    instruction->AppendOperand(instruction_map.at(operand_name));
-  }
-  for (const string& predecessor_name : proto.control_predecessor_names()) {
-    TF_RET_CHECK(ContainsKey(instruction_map, predecessor_name))
-        << "No instruction named " << predecessor_name;
-    TF_RETURN_IF_ERROR(instruction_map.at(predecessor_name)
+  for (const int64 operand_id : proto.operand_ids()) {
+    TF_RET_CHECK(ContainsKey(instruction_map, operand_id))
+        << "No instruction with id " << operand_id;
+    instruction->AppendOperand(instruction_map.at(operand_id));
+  }
+  for (const int64 predecessor_id : proto.control_predecessor_ids()) {
+    TF_RET_CHECK(ContainsKey(instruction_map, predecessor_id))
+        << "No instruction with id " << predecessor_id;
+    TF_RETURN_IF_ERROR(instruction_map.at(predecessor_id)
                            ->AddControlDependencyTo(instruction.get()));
   }
 
@@ -76,23 +75,26 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
   // HloInstructionProto and do not appear as an HloComputationProto within the
   // HloModuleProto.
   if (instruction->opcode() == HloOpcode::kFusion) {
-    TF_RET_CHECK(proto.has_fused_instructions_computation());
     TF_RET_CHECK(!proto.fusion_kind().empty());
     TF_ASSIGN_OR_RETURN(instruction->fusion_kind_,
                         StringToFusionKind(proto.fusion_kind()));
-    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloComputation> fused_computation,
-                        HloComputation::CreateFromProto(
-                            module, proto.fused_instructions_computation(),
-                            computation_map, add_fused_computation,
-                            /*fusion_instruction=*/instruction.get()));
-    instruction->called_computations_.push_back(fused_computation.get());
-    add_fused_computation(std::move(fused_computation));
+
+    // Find the fused computation and set its fusion instruction.
+    TF_RET_CHECK(proto.called_computation_ids_size() == 1)
+        << "Expect 1 called computation for fusion instruction, but sees "
+        << proto.called_computation_ids_size();
+    const int64 fusion_id = proto.called_computation_ids(0);
+    auto* fused_computation = FindPtrOrNull(computation_map, fusion_id);
+    TF_RET_CHECK(fused_computation != nullptr)
+        << "No fusion computation with id " << fusion_id;
+    fused_computation->SetFusionInstruction(instruction.get());
+    instruction->called_computations_.push_back(fused_computation);
   } else {
-    for (const string& computation_name : proto.called_computation_names()) {
-      TF_RET_CHECK(ContainsKey(computation_map, computation_name))
-          << "No computation named " << computation_name;
+    for (const int64 computation_id : proto.called_computation_ids()) {
+      TF_RET_CHECK(ContainsKey(computation_map, computation_id))
+          << "No computation with id " << computation_id;
       instruction->called_computations_.push_back(
-          computation_map.at(computation_name));
+          computation_map.at(computation_id));
     }
   }
 
@@ -182,6 +184,7 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
 /* static */ std::unique_ptr<HloInstruction>
 HloInstruction::CreateGetTupleElement(const Shape& shape,
                                       HloInstruction* operand, int64 index) {
+  CHECK(ShapeUtil::IsTuple(operand->shape()));
   auto instruction =
       WrapUnique(new HloInstruction(HloOpcode::kGetTupleElement, shape));
   instruction->tuple_index_ = index;
@@ -1172,7 +1175,8 @@ bool HloInstruction::HasSideEffect() const {
 /* static */ GatherDimensionNumbers HloInstruction::MakeGatherDimNumbers(
     tensorflow::gtl::ArraySlice<int64> output_window_dims,
     tensorflow::gtl::ArraySlice<int64> elided_window_dims,
-    tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims) {
+    tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+    int64 index_vector_dim) {
   GatherDimensionNumbers gather_dim_numbers;
   for (int64 output_window_dim : output_window_dims) {
     gather_dim_numbers.add_output_window_dims(output_window_dim);
@@ -1184,6 +1188,7 @@ bool HloInstruction::HasSideEffect() const {
     gather_dim_numbers.add_gather_dims_to_operand_dims(gather_dim_to_input_dim);
   }
 
+  gather_dim_numbers.set_index_vector_dim(index_vector_dim);
   return gather_dim_numbers;
 }
 
@@ -2310,14 +2315,18 @@ string HloInstruction::ToShortString() const {
 
 HloInstructionProto HloInstruction::ToProto() const {
   HloInstructionProto proto;
+  CHECK(unique_id_ != -1)
+      << "This instruction does not have a valid id. Please make sure the "
+         "instruction is inside a module before dumping it.";
+  proto.set_id(unique_id_);
   proto.set_name(name_);
   proto.set_opcode(HloOpcodeString(opcode_));
   *proto.mutable_shape() = shape_;
   for (const HloInstruction* operand : operands_) {
-    *proto.add_operand_names() = operand->name();
+    proto.add_operand_ids(operand->unique_id());
   }
   for (const HloInstruction* control : control_predecessors_) {
-    *proto.add_control_predecessor_names() = control->name();
+    proto.add_control_predecessor_ids(control->unique_id());
   }
 
   *proto.mutable_metadata() = metadata_;
@@ -2327,11 +2336,11 @@ HloInstructionProto HloInstruction::ToProto() const {
   proto.set_parameter_number(parameter_number_);
   if (opcode() == HloOpcode::kFusion) {
     proto.set_fusion_kind(xla::ToString(fusion_kind()));
-    *proto.mutable_fused_instructions_computation() =
-        fused_instructions_computation()->ToProto();
+    proto.add_called_computation_ids(
+        fused_instructions_computation()->unique_id());
   } else {
     for (const HloComputation* computation : called_computations_) {
-      *proto.add_called_computation_names() = computation->name();
+      proto.add_called_computation_ids(computation->unique_id());
     }
   }
 
@@ -2680,8 +2689,10 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
     case HloOpcode::kTrace:
       break;
   }
-  return Unimplemented("unhandled HloOpcode for DfsHloVisitor: %s",
-                       HloOpcodeString(opcode_).c_str());
+  return InternalError(
+      "Unhandled HloOpcode for DfsHloVisitor: %s. This should not happen - "
+      "please file a bug for XLA.",
+      HloOpcodeString(opcode_).c_str());
 }
 
 // Explicit instantiations.
@@ -3369,9 +3380,12 @@ string HloInstruction::GatherDimensionNumbersToString() const {
   string gather_dims_to_operand_dims = StrCat(
       "gather_dims_to_operand_dims={",
       Join(gather_dimension_numbers_->gather_dims_to_operand_dims(), ","), "}");
+  string index_vector_dim = StrCat(
+      "index_vector_dim=", gather_dimension_numbers_->index_vector_dim());
 
   return Join<std::initializer_list<string>>(
-      {output_window_dims, elided_window_dims, gather_dims_to_operand_dims},
+      {output_window_dims, elided_window_dims, gather_dims_to_operand_dims,
+       index_vector_dim},
       ", ");
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.h b/tensorflow/compiler/xla/service/hlo_instruction.h
index e4d22e5703811dc9b5f3ea3ee1ca85fd848f88b2..a94ba145df792ade9bb7ce3e9a31b56b2f460cd2 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.h
+++ b/tensorflow/compiler/xla/service/hlo_instruction.h
@@ -179,20 +179,15 @@ class HloInstruction {
   //   module: the module which will contain the instruction. The newly created
   //     instruction is *not* added to the module or any computation, however.
   //   proto: the proto to convert from.
-  //   instruction_map: a map from instruction name to HloInstruction*. This map
+  //   instruction_map: a map from instruction id to HloInstruction*. This map
   //     must contain all operands of the newly constructed instruction.
-  //   computation_map: a map from computation name to HloComputation*. This map
+  //   computation_map: a map from computation id to HloComputation*. This map
   //     must contain all computations which the newly constructed instruction
   //     calls.
-  //   add_fused_computation: A function to call to add a fused
-  //     computation. Used (clearly) when the instruction is a fusion
-  //     instruction.
   static StatusOr<std::unique_ptr<HloInstruction>> CreateFromProto(
       HloModule* module, const HloInstructionProto& proto,
-      const tensorflow::gtl::FlatMap<string, HloInstruction*>& instruction_map,
-      const tensorflow::gtl::FlatMap<string, HloComputation*>& computation_map,
-      const std::function<void(std::unique_ptr<HloComputation>)>&
-          add_fused_computation);
+      const tensorflow::gtl::FlatMap<int64, HloInstruction*>& instruction_map,
+      const tensorflow::gtl::FlatMap<int64, HloComputation*>& computation_map);
 
   // Creates a parameter-retrieving instruction.
   static std::unique_ptr<HloInstruction> CreateParameter(int64 parameter_number,
@@ -502,7 +497,8 @@ class HloInstruction {
   static GatherDimensionNumbers MakeGatherDimNumbers(
       tensorflow::gtl::ArraySlice<int64> output_window_dims,
       tensorflow::gtl::ArraySlice<int64> elided_window_dims,
-      tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims);
+      tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+      int64 index_vector_dim);
 
   // Returns the opcode for this instruction.
   HloOpcode opcode() const { return opcode_; }
diff --git a/tensorflow/compiler/xla/service/hlo_instruction_test.cc b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
index 32d3ed272bd6b239918076999ecae6c1b3ded2fd..f2980d309d01fdf3b3e601bc260a0ad0895b3064 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
@@ -1271,7 +1271,7 @@ TEST_F(HloInstructionTest, Stringification) {
             "true_computation=%TransposeDot, false_computation=%TransposeDot");
 }
 
-TEST_F(HloInstructionTest, StringifyGather) {
+TEST_F(HloInstructionTest, StringifyGather_0) {
   Shape input_tensor_shape = ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
   Shape gather_indices_tensor_shape =
       ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 5});
@@ -1291,7 +1291,8 @@ TEST_F(HloInstructionTest, StringifyGather) {
           HloInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
-              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/4),
           /*window_bounds=*/{30, 29, 28, 27, 26}));
 
   HloModule module(TestName());
@@ -1303,7 +1304,43 @@ TEST_F(HloInstructionTest, StringifyGather) {
             "s64[10,9,8,7,5]{4,3,2,1,0} %gather_indices), "
             "output_window_dims={4,5,6,7,8}, elided_window_dims={}, "
             "gather_dims_to_operand_dims={0,1,2,3,4}, "
-            "window_bounds={30,29,28,27,26}");
+            "index_vector_dim=4, window_bounds={30,29,28,27,26}");
+}
+
+TEST_F(HloInstructionTest, StringifyGather_1) {
+  Shape input_tensor_shape = ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
+  Shape gather_indices_tensor_shape =
+      ShapeUtil::MakeShape(S64, {10, 9, 5, 7, 6});
+  Shape gather_result_shape =
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26});
+
+  HloComputation::Builder builder("Gather");
+  HloInstruction* input = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_tensor_shape, "input_tensor"));
+  HloInstruction* gather_indices =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          1, gather_indices_tensor_shape, "gather_indices"));
+
+  HloInstruction* gather_instruction =
+      builder.AddInstruction(HloInstruction::CreateGather(
+          gather_result_shape, input, gather_indices,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  HloModule module(TestName());
+  module.AddEntryComputation(builder.Build());
+
+  EXPECT_EQ(gather_instruction->ToString(),
+            "%gather = f32[10,9,7,6,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} "
+            "gather(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, "
+            "s64[10,9,5,7,6]{4,3,2,1,0} %gather_indices), "
+            "output_window_dims={4,5,6,7,8}, elided_window_dims={}, "
+            "gather_dims_to_operand_dims={0,1,2,3,4}, "
+            "index_vector_dim=2, window_bounds={30,29,28,27,26}");
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/hlo_module.cc b/tensorflow/compiler/xla/service/hlo_module.cc
index cb2fe9f874012a51e1e6cbd1dd086dbb26994bde..2037764daeb34837f7af488350a7bf55bc59ebc5 100644
--- a/tensorflow/compiler/xla/service/hlo_module.cc
+++ b/tensorflow/compiler/xla/service/hlo_module.cc
@@ -83,6 +83,11 @@ HloComputation* HloModule::AddComputationInternal(
   for (auto* instruction : computation->instructions()) {
     instruction->SetUniqueId(NewUniqueInstructionId());
   }
+  // Set unique id to this computation.
+  CHECK_NE(computation->root_instruction()->unique_id(), -1)
+      << "Root has no valid id: " << computation->ToString();
+  computation->SetUniqueId(computation->root_instruction()->unique_id());
+
   computation->set_parent(this);
   computations_.push_back(std::move(computation));
   return computations_.back().get();
@@ -204,83 +209,29 @@ string HloModule::ToString(const HloPrintOptions& options) const {
 
 HloModuleProto HloModule::ToProto() const {
   HloModuleProto proto;
+  proto.set_id(unique_id_);
   proto.set_name(name_);
   proto.set_entry_computation_name(entry_computation_->name());
+  proto.set_entry_computation_id(entry_computation_->unique_id());
   for (const HloComputation* computation : MakeComputationPostOrder()) {
-    // Fusion computations are added when the fusion instructions are created by
-    // HloInstruction::CreateFromProto.
-    if (computation->IsFusionComputation()) {
-      continue;
-    }
     HloComputationProto computation_proto = computation->ToProto();
+    if (computation->name() == entry_computation_->name()) {
+      *proto.mutable_program_shape() = computation_proto.program_shape();
+    }
     proto.add_computations()->Swap(&computation_proto);
   }
   return proto;
 }
 
-namespace {
-
-// Construct a ProgramShape matching the shape of the parameters and root of the
-// given module's entry computation.
-StatusOr<ProgramShape> ProgramShapeFromProto(const HloModuleProto& module) {
-  const HloComputationProto* entry_computation = nullptr;
-  for (const HloComputationProto& computation : module.computations()) {
-    if (computation.name() == module.entry_computation_name()) {
-      entry_computation = &computation;
-      break;
-    }
-  }
-  TF_RET_CHECK(entry_computation != nullptr)
-      << "No computation with entry computation name"
-      << module.entry_computation_name();
-
-  tensorflow::gtl::FlatMap<int64, std::pair<string, const Shape*>> parameters;
-  const HloInstructionProto* root = nullptr;
-  for (const HloInstructionProto& instruction :
-       entry_computation->instructions()) {
-    if (instruction.name() == entry_computation->root_name()) {
-      TF_RET_CHECK(root == nullptr) << "Entry computation has more than "
-                                       "one instruction with (root) name "
-                                    << instruction.name();
-      root = &instruction;
-    }
-    if (instruction.opcode() == HloOpcodeString(HloOpcode::kParameter)) {
-      TF_RET_CHECK(!ContainsKey(parameters, instruction.parameter_number()))
-          << "Entry computation has more than one parameter instruction "
-             "with parameter number "
-          << instruction.parameter_number();
-      parameters[instruction.parameter_number()] = {instruction.name(),
-                                                    &instruction.shape()};
-    }
-  }
-  TF_RET_CHECK(root != nullptr)
-      << "Entry computation is missing root instruction named "
-      << entry_computation->root_name();
-
-  ProgramShape program_shape;
-  *program_shape.mutable_result() = root->shape();
-  for (int64 i = 0; i < parameters.size(); ++i) {
-    TF_RET_CHECK(ContainsKey(parameters, i))
-        << "Entry computation missing parameter number " << i;
-    const string& name = parameters.at(i).first;
-    const Shape& shape = *parameters.at(i).second;
-    *program_shape.add_parameters() = shape;
-    program_shape.add_parameter_names(name);
-  }
-
-  return std::move(program_shape);
-}
-
-}  // namespace
-
 /* static */
 StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
     const HloModuleProto& proto, const HloModuleConfig& module_config,
     const VersionedComputationHandle& entry_computation_handle) {
   // The ProgramShape in the passed in module config must match the shapes of
   // the entry parameters and root.
-  TF_ASSIGN_OR_RETURN(ProgramShape expected_program_shape,
-                      ProgramShapeFromProto(proto));
+  TF_RET_CHECK(proto.has_program_shape())
+      << "No program shape found in the proto";
+  const auto& expected_program_shape = proto.program_shape();
   TF_RET_CHECK(expected_program_shape.parameters_size() ==
                module_config.entry_computation_layout().parameter_count());
   for (int i = 0; i < expected_program_shape.parameters_size(); ++i) {
@@ -305,26 +256,20 @@ StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
   auto module = MakeUnique<HloModule>(proto.name(), entry_computation_handle,
                                       module_config);
 
-  tensorflow::gtl::FlatMap<string, HloComputation*> computation_map;
+  tensorflow::gtl::FlatMap<int64, HloComputation*> computation_map;
   for (const HloComputationProto& computation_proto : proto.computations()) {
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<HloComputation> computation,
-        HloComputation::CreateFromProto(
-            module.get(), computation_proto, computation_map,
-            /*add_fused_computation=*/
-            [&module](std::unique_ptr<HloComputation> fused_computation) {
-              module->AddComputationInternal(std::move(fused_computation),
-                                             /*is_entry=*/false,
-                                             /*uniquify_names=*/false);
-            }));
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloComputation> computation,
+                        HloComputation::CreateFromProto(
+                            module.get(), computation_proto, computation_map));
     CHECK_NE(computation.get(), nullptr);
-    TF_RET_CHECK(!ContainsKey(computation_map, computation->name()));
-    string computation_name = computation->name();
+    int64 computation_id = computation_proto.id();
+    TF_RET_CHECK(computation_id != -1);
+    TF_RET_CHECK(!ContainsKey(computation_map, computation_id));
     // Don't uniquify names because we want names to be stable across
     // serialization and deserialization.
-    computation_map[computation_name] = module->AddComputationInternal(
+    computation_map[computation_id] = module->AddComputationInternal(
         std::move(computation),
-        /*is_entry=*/proto.entry_computation_name() == computation_name,
+        /*is_entry=*/proto.entry_computation_id() == computation_id,
         /*uniquify_names=*/false);
   }
   TF_RET_CHECK(module->entry_computation_ != nullptr);
@@ -334,10 +279,6 @@ StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
   tensorflow::gtl::FlatSet<string> computation_names;
   tensorflow::gtl::FlatSet<string> instruction_names;
   for (HloComputation* computation : module->computations()) {
-    if (computation->IsFusionComputation()) {
-      continue;
-    }
-
     TF_RET_CHECK(!ContainsKey(computation_names, computation->name()))
         << "Computation name is not unique: " << computation->name();
     computation_names.insert(computation->name());
@@ -354,8 +295,9 @@ StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
 /* static */
 StatusOr<HloModuleConfig> HloModule::CreateModuleConfigFromProto(
     const HloModuleProto& module) {
-  TF_ASSIGN_OR_RETURN(ProgramShape program_shape,
-                      ProgramShapeFromProto(module));
+  TF_RET_CHECK(module.has_program_shape())
+      << "No program shape found in the proto";
+  const auto& program_shape = module.program_shape();
 
   HloModuleConfig module_config(program_shape);
 
diff --git a/tensorflow/compiler/xla/service/hlo_module.h b/tensorflow/compiler/xla/service/hlo_module.h
index 06d92f94fd6f62162b22575e9cc341f2906cd0db..755bbd359f7b95e7f3f3cbee1b46df85908202c6 100644
--- a/tensorflow/compiler/xla/service/hlo_module.h
+++ b/tensorflow/compiler/xla/service/hlo_module.h
@@ -103,7 +103,7 @@ class HloModule {
     return config_.mutable_entry_computation_layout();
   }
 
-  ComputationLayout entry_computation_layout() const {
+  const ComputationLayout& entry_computation_layout() const {
     return config_.entry_computation_layout();
   }
 
@@ -187,11 +187,6 @@ class HloModule {
   // Returns a randomly generated uint64.
   uint64 RandomNew64() const;
 
-  // Returns the unique name for a computation in this module.
-  string GetUniqueCompuationName(const string& prefix) {
-    return computation_name_uniquer_.GetUniqueName(prefix);
-  }
-
   // Returns the NameUniquer for uniquing instruction names in this module.
   NameUniquer& instruction_name_uniquer() { return instruction_name_uniquer_; }
 
diff --git a/tensorflow/compiler/xla/service/hlo_module_config.cc b/tensorflow/compiler/xla/service/hlo_module_config.cc
index 822e2f1f53e5ee460b88c2241ecf7f6b91ef608b..4205b0402cb8b2c31141d65be652cd84c22e7262 100644
--- a/tensorflow/compiler/xla/service/hlo_module_config.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_config.cc
@@ -40,7 +40,7 @@ void HloModuleConfig::SetDefaultComputationLayout(
 
 string HloModuleConfig::compilation_cache_key() const {
   string key =
-      tensorflow::strings::StrCat("profiling=", hlo_profiling_enabled_);
+      tensorflow::strings::StrCat("profiling=", hlo_profiling_enabled());
   StrAppend(&key, "::(");
   std::vector<string> params;
   for (const ShapeLayout& param_layout :
diff --git a/tensorflow/compiler/xla/service/hlo_module_config.h b/tensorflow/compiler/xla/service/hlo_module_config.h
index d3c1fae592bb465609ffbde2d0262e2600912e63..586a03d412681cacdd780f48e77baf4cd4c51415 100644
--- a/tensorflow/compiler/xla/service/hlo_module_config.h
+++ b/tensorflow/compiler/xla/service/hlo_module_config.h
@@ -63,9 +63,10 @@ class HloModuleConfig {
     return &(*entry_computation_layout_);
   }
 
-  // Sets/returns whether to enable HLO-level profiling.
-  bool hlo_profiling_enabled() const { return hlo_profiling_enabled_; }
-  void enable_hlo_profiling(bool enabled) { hlo_profiling_enabled_ = enabled; }
+  // Returns whether to enable HLO-level profiling.
+  bool hlo_profiling_enabled() const {
+    return debug_options_.xla_hlo_profile();
+  }
 
   // Sets/returns whether this is a "host module".  Host modules are used to
   // record the data- and control-flow dependencies of host side computation
@@ -110,9 +111,6 @@ class HloModuleConfig {
 
   tensorflow::gtl::optional<ComputationLayout> entry_computation_layout_;
 
-  // Whether to enable HLO-level profiling.
-  bool hlo_profiling_enabled_ = false;
-
   // Whether this is a 'host module'.
   bool is_host_module_ = false;
 
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
new file mode 100644
index 0000000000000000000000000000000000000000..fa5dcb0b369d17c70c64c67b9f11640c93fb4278
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
@@ -0,0 +1,350 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_module_group_metadata.h"
+
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+string HloModuleGroupMetadata::TrackedInstruction::ToString() const {
+  string repr =
+      (instruction_ != nullptr) ? instruction_->ToShortString() : "NULL";
+  switch (kind_) {
+    case ComputationKind::kInvalid:
+      repr += ":INVALID";
+      break;
+    case ComputationKind::kWhileCondition:
+      repr += ":WHILE_CONDITION";
+      break;
+    case ComputationKind::kWhileBody:
+      repr += ":WHILE_BODY";
+      break;
+    case ComputationKind::kConditionalTrue:
+      repr += ":CONDITIONAL_TRUE";
+      break;
+    case ComputationKind::kConditionalFalse:
+      repr += ":CONDITIONAL_FALSE";
+      break;
+  }
+  return repr;
+}
+
+/* static */ StatusOr<std::unique_ptr<HloModuleGroupMetadata>>
+HloModuleGroupMetadata::Build(const std::vector<HloModule*>& modules) {
+  auto metadata = absl::make_unique<HloModuleGroupMetadata>(modules);
+  TF_RETURN_IF_ERROR(metadata->Build());
+  return std::move(metadata);
+}
+
+Status HloModuleGroupMetadata::Build() {
+  TF_RETURN_IF_ERROR(RecordInstructions());
+  TF_RETURN_IF_ERROR(VerifyChannelInstructions());
+
+  // Record all companion while instructions.
+  const auto visitor = [this](HloInstruction* hlo) -> Status {
+    // We only need to process if the instruction is within the computation
+    // of a companion instruction, like in the condition or body computation
+    // of a While.
+    const TrackedInstruction* tracked = GetTrackedInstruction(hlo->parent());
+    if (tracked == nullptr) {
+      return Status::OK();
+    }
+    // Add the parent computation of this channel instruction and its peer
+    // computation (both must be while computations) as companions.
+    if (IsChannelInstruction(hlo)) {
+      HloComputation* peer_computation = PeerComputation(hlo);
+      const TrackedInstruction* peer_tracked =
+          GetTrackedInstruction(peer_computation);
+      TF_RET_CHECK(peer_tracked != nullptr)
+          << "Peer instruction is not a possible companion";
+      TF_RET_CHECK(*tracked == *peer_tracked)
+          << "Peer instruction does not match the computation kind";
+      TF_RETURN_IF_ERROR(
+          AddCompanion(tracked->instruction(), peer_tracked->instruction()));
+    }
+
+    // Add the parents of companion instructions (they must be all of the same
+    // kind of instructions, opcode wise) as companions.
+    if (IsCompanionInstruction(hlo)) {
+      for (HloInstruction* companion : Companions(hlo)) {
+        const TrackedInstruction* companion_tracked =
+            GetTrackedInstruction(companion->parent());
+        TF_RET_CHECK(companion_tracked != nullptr);
+        TF_RET_CHECK(*tracked == *companion_tracked);
+        TF_RETURN_IF_ERROR(AddCompanion(tracked->instruction(),
+                                        companion_tracked->instruction()));
+      }
+    }
+    return Status::OK();
+  };
+
+  // Visit the computations in postorder so that the companion information grows
+  // from inner computations to outer ones.
+  for (HloModule* module : modules_) {
+    for (HloComputation* computation : module->MakeComputationPostOrder()) {
+      TF_RETURN_IF_ERROR(computation->Accept(visitor));
+    }
+  }
+  return Status::OK();
+}
+
+bool HloModuleGroupMetadata::IsChannelInstruction(
+    const HloInstruction* instruction) const {
+  switch (instruction->opcode()) {
+    case HloOpcode::kSend:
+    case HloOpcode::kRecv:
+    case HloOpcode::kSendDone:
+    case HloOpcode::kRecvDone:
+      return true;
+    default:
+      return false;
+  }
+}
+
+bool HloModuleGroupMetadata::IsCompanionInstruction(HloInstruction* hlo) const {
+  return companion_set_index_.count(hlo) > 0;
+}
+
+bool HloModuleGroupMetadata::InstructionCommunicates(
+    HloInstruction* hlo) const {
+  return IsChannelInstruction(hlo) || IsCompanionInstruction(hlo);
+}
+
+const HloModuleGroupMetadata::Channel& HloModuleGroupMetadata::GetChannel(
+    int64 channel_id) const {
+  CHECK(channel_id_map_.find(channel_id) != channel_id_map_.end());
+  return channels_[channel_id_map_.at(channel_id)];
+}
+
+HloComputation* HloModuleGroupMetadata::PeerComputation(
+    const HloInstruction* instruction) const {
+  CHECK(IsChannelInstruction(instruction));
+  const Channel& channel = GetChannel(instruction->channel_id());
+  switch (instruction->opcode()) {
+    case HloOpcode::kSend:
+    case HloOpcode::kSendDone:
+      return channel.recv->parent();
+    case HloOpcode::kRecv:
+    case HloOpcode::kRecvDone:
+      return channel.send->parent();
+    default:
+      LOG(FATAL) << "opcode not supported";
+  }
+}
+
+std::vector<HloModuleGroupMetadata::TrackedInstruction>
+HloModuleGroupMetadata::GetCompanionsPath(const HloInstruction* hlo) const {
+  std::vector<TrackedInstruction> path;
+  const HloComputation* parent = hlo->parent();
+  const TrackedInstruction* companion;
+  while ((companion = GetTrackedInstruction(parent)) != nullptr) {
+    parent = companion->instruction()->parent();
+    path.push_back(*companion);
+  }
+  return path;
+}
+
+bool HloModuleGroupMetadata::CheckCompanionPathsCompatibility(
+    const std::vector<TrackedInstruction>& path0,
+    const std::vector<TrackedInstruction>& path1) const {
+  if (path0.size() != path1.size()) {
+    VLOG(5) << "Companion path size do not match: " << path0.size()
+            << " != " << path1.size();
+    return false;
+  }
+  for (int64 i = 0; i < path0.size(); ++i) {
+    if (path0[i] != path1[i]) {
+      VLOG(5) << "Companion instructions at path index " << i
+              << " do not have the same opcode: " << path0[i].ToString()
+              << " vs " << path1[i].ToString();
+      return false;
+    }
+  }
+  return true;
+}
+
+int64 HloModuleGroupMetadata::GetModuleId(const HloModule* module) const {
+  for (int64 i = 0; i < modules_.size(); ++i) {
+    if (modules_[i] == module) {
+      return i;
+    }
+  }
+  LOG(FATAL) << "unknown module";
+}
+
+Status HloModuleGroupMetadata::RecordInstructions() {
+  const auto visitor = [this](HloInstruction* hlo) -> Status {
+    if (hlo->opcode() == HloOpcode::kWhile) {
+      tracked_instructions_[hlo->while_condition()] =
+          TrackedInstruction(hlo, ComputationKind::kWhileCondition);
+      tracked_instructions_[hlo->while_body()] =
+          TrackedInstruction(hlo, ComputationKind::kWhileBody);
+    } else if (hlo->opcode() == HloOpcode::kConditional) {
+      tracked_instructions_[hlo->true_computation()] =
+          TrackedInstruction(hlo, ComputationKind::kConditionalTrue);
+      tracked_instructions_[hlo->false_computation()] =
+          TrackedInstruction(hlo, ComputationKind::kConditionalFalse);
+    }
+    if (!IsChannelInstruction(hlo)) {
+      return Status::OK();
+    }
+
+    // Add a new channel if needed.
+    if (channel_id_map_.find(hlo->channel_id()) == channel_id_map_.end()) {
+      channels_.emplace_back();
+      channels_.back().id = hlo->channel_id();
+      channel_id_map_[hlo->channel_id()] = channels_.size() - 1;
+      max_channel_id_ = std::max(max_channel_id_, hlo->channel_id());
+    }
+    Channel& channel = channels_[channel_id_map_[hlo->channel_id()]];
+
+    if (hlo->opcode() == HloOpcode::kSend) {
+      TF_RET_CHECK(channel.send == nullptr)
+          << "channel id " << hlo->channel_id()
+          << " is used by multiple send instructions";
+      channel.send = hlo;
+    }
+    if (hlo->opcode() == HloOpcode::kRecv) {
+      TF_RET_CHECK(channel.recv == nullptr)
+          << "channel id " << hlo->channel_id()
+          << " is used by multiple recv instructions";
+      channel.recv = hlo;
+    }
+    if (hlo->opcode() == HloOpcode::kSendDone) {
+      TF_RET_CHECK(channel.send_done == nullptr)
+          << "channel id " << hlo->channel_id()
+          << " is used by multiple send-done instructions";
+      channel.send_done = hlo;
+    }
+    if (hlo->opcode() == HloOpcode::kRecvDone) {
+      TF_RET_CHECK(channel.recv_done == nullptr)
+          << "channel id " << hlo->channel_id()
+          << " is used by multiple recv-done instructions";
+      channel.recv_done = hlo;
+    }
+    return Status::OK();
+  };
+
+  for (HloModule* module : modules_) {
+    for (auto* computation : module->computations()) {
+      TF_RETURN_IF_ERROR(computation->Accept(visitor));
+    }
+  }
+  return Status::OK();
+}
+
+Status HloModuleGroupMetadata::AddCompanion(HloInstruction* instruction1,
+                                            HloInstruction* instruction2) {
+  TF_RET_CHECK(instruction1->opcode() == HloOpcode::kWhile ||
+               instruction1->opcode() == HloOpcode::kConditional);
+  VLOG(2) << "adding as companions:" << instruction1->ToString() << " and "
+          << instruction2->ToString();
+
+  if (!ContainsKey(companion_set_index_, instruction1) &&
+      !ContainsKey(companion_set_index_, instruction2)) {
+    companion_sets_.push_back(
+        absl::make_unique<std::unordered_set<HloInstruction*>>());
+    auto companion_set = companion_sets_.back().get();
+    companion_set->insert(instruction1);
+    companion_set->insert(instruction2);
+    companion_set_index_[instruction1] = companion_sets_.size() - 1;
+    companion_set_index_[instruction2] = companion_sets_.size() - 1;
+  } else if (!ContainsKey(companion_set_index_, instruction1)) {
+    companion_sets_[companion_set_index_[instruction2]]->insert(instruction1);
+    companion_set_index_[instruction1] = companion_set_index_[instruction2];
+  } else if (!ContainsKey(companion_set_index_, instruction2)) {
+    companion_sets_[companion_set_index_[instruction1]]->insert(instruction2);
+    companion_set_index_[instruction2] = companion_set_index_[instruction1];
+  } else if (companion_set_index_[instruction1] !=
+             companion_set_index_[instruction2]) {
+    companion_sets_[companion_set_index_[instruction1]]->insert(
+        Companions(instruction2).begin(), Companions(instruction2).end());
+    int64 index_to_remove = companion_set_index_[instruction2];
+    for (HloInstruction* hlo : Companions(instruction2)) {
+      companion_set_index_[hlo] = companion_set_index_[instruction1];
+    }
+    companion_sets_.erase(companion_sets_.begin() + index_to_remove);
+  }
+  return Status::OK();
+}
+
+Status HloModuleGroupMetadata::VerifyChannelInstructions() {
+  for (const Channel& channel : channels_) {
+    if (channel.send == nullptr) {
+      return FailedPrecondition("missing send for id : %lld", channel.id);
+    }
+    if (channel.recv == nullptr) {
+      return FailedPrecondition("missing recv for id : %lld", channel.id);
+    }
+    if (channel.send_done == nullptr) {
+      return FailedPrecondition("missing send-done for id : %lld", channel.id);
+    }
+    if (channel.recv_done == nullptr) {
+      return FailedPrecondition("missing recv-done for id : %lld", channel.id);
+    }
+  }
+
+  // Check if the shapes match for each channel.
+  for (const Channel& channel : channels_) {
+    const Shape& send_shape = channel.send->operand(0)->shape();
+    const Shape& recv_shape = channel.recv_done->shape();
+    if (!ShapeUtil::Compatible(send_shape, recv_shape)) {
+      return FailedPrecondition("send/recv shapes do not match");
+    }
+  }
+
+  // Check if channel instructions are used only in allowed computations.
+  const auto allowed = [this](HloInstruction* hlo) {
+    HloComputation* computation = hlo->parent();
+    const HloModule* module = computation->parent();
+    if (module->entry_computation() == computation ||
+        tracked_instructions_.count(computation) > 0) {
+      return true;
+    }
+    return false;
+  };
+  for (const Channel& channel : channels_) {
+    if (!allowed(channel.send) || !allowed(channel.send_done) ||
+        !allowed(channel.recv) || !allowed(channel.recv_done)) {
+      return FailedPrecondition("channel is used in disallowed computation");
+    }
+  }
+  // Check if the nest levels match for each channel.
+  for (const Channel& channel : channels_) {
+    std::vector<TrackedInstruction> path = GetCompanionsPath(channel.send);
+    if (!CheckCompanionPathsCompatibility(
+            path, GetCompanionsPath(channel.send_done)) ||
+        !CheckCompanionPathsCompatibility(path,
+                                          GetCompanionsPath(channel.recv)) ||
+        !CheckCompanionPathsCompatibility(
+            path, GetCompanionsPath(channel.recv_done))) {
+      return FailedPrecondition(
+          "Nest companion paths do not match for channel %lld", channel.id);
+    }
+  }
+  return Status::OK();
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.h b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
new file mode 100644
index 0000000000000000000000000000000000000000..c48a7ab0b59269474f7406ef24a249355528e085
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
@@ -0,0 +1,239 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_METADATA_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_METADATA_H_
+
+#include <memory>
+#include <set>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/status.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+// Class for bookkeeping the information on the given modules, in particular on
+// the interaction between computations.
+//
+// Companion instructions are one of the information collected as we build the
+// metadata. For example, for each While instruction, companion instructions
+// refer to a set of While instructions in other computations that communicate
+// with each other.
+// In the example below with 3 modules, {While_0, While_2, While_5}, {While_1,
+// While_4}, {While_3, While_6} are companion sets.
+//
+// <Module 0>               <Module 1>                 <Module 2>
+// While_0() {              While_2() {                While_5() {
+//   While_1() { Send(0) }    While_3() { Send(1) }      While_6() { Recv(1) }
+// }                          While_4() { Recv(0) }
+//                          }
+//
+// Companion instructions are used to detect cycles in the graph and also for
+// global scheduling.
+class HloModuleGroupMetadata {
+ public:
+  // The kind of companion computation a given instruction can be within.
+  enum class ComputationKind {
+    kInvalid,
+    kWhileCondition,
+    kWhileBody,
+    kConditionalTrue,
+    kConditionalFalse,
+  };
+
+  // Tracks the instruction mapped to a given computation, and the computation
+  // kind.
+  // For example, a body computation of a while instruction, will generate a
+  // TrackedInstruction with instruction being the while instruction, and
+  // kind being ComputationKind::kWhileBody.
+  class TrackedInstruction {
+   public:
+    TrackedInstruction() = default;
+    TrackedInstruction(HloInstruction* instruction, ComputationKind kind)
+        : instruction_(instruction), kind_(kind) {}
+
+    bool operator==(const TrackedInstruction& rhs) const {
+      return instruction_->opcode() == rhs.instruction_->opcode() &&
+             kind_ == rhs.kind_;
+    }
+    bool operator!=(const TrackedInstruction& rhs) const {
+      return !operator==(rhs);
+    }
+
+    HloInstruction* instruction() const { return instruction_; }
+
+    string ToString() const;
+
+   private:
+    HloInstruction* instruction_ = nullptr;
+    ComputationKind kind_ = ComputationKind::kInvalid;
+  };
+
+  // Represents a channel and the 4 instructions that form the channel.
+  struct Channel {
+    int64 id = -1;
+    HloInstruction* send = nullptr;
+    HloInstruction* recv = nullptr;
+    HloInstruction* send_done = nullptr;
+    HloInstruction* recv_done = nullptr;
+  };
+
+  explicit HloModuleGroupMetadata(const std::vector<HloModule*>& modules)
+      : modules_(modules) {}
+
+  ~HloModuleGroupMetadata() = default;
+
+  // Build and return the metadata for the given modules.
+  static StatusOr<std::unique_ptr<HloModuleGroupMetadata>> Build(
+      const std::vector<HloModule*>& modules);
+
+  // Returns true if the instruction is one of the 4 channel instructions (Send,
+  // Recv, SendDone, RecvDone).
+  bool IsChannelInstruction(const HloInstruction* instruction) const;
+
+  // Returns true if the instruction is a companion instruction. See the class
+  // comment above on companion instructions.
+  bool IsCompanionInstruction(HloInstruction* hlo) const;
+
+  // Returns true if the instruction is either a channel instruction or a
+  // companion instruction.
+  bool InstructionCommunicates(HloInstruction* hlo) const;
+
+  // Returns the Channel instance for the given channel id.
+  const Channel& GetChannel(int64 channel_id) const;
+
+  // Returns the computation that contains the peer channel instructions for
+  // the given instruction.
+  //
+  // Precondition: IsChannelInstruction(instruction) is true.
+  HloComputation* PeerComputation(const HloInstruction* instruction) const;
+
+  // Returns the path of the nested companion instructions, in terms of HLO
+  // instructions. The path goes from inner to outer companions.
+  // The returned path does not include the input hlo instruction, in case it
+  // is a companion instruction.
+  std::vector<TrackedInstruction> GetCompanionsPath(
+      const HloInstruction* hlo) const;
+
+  // Checks whether two companion paths (as returned by the GetCompanionsPath()
+  // API) are compatible. The two paths are compatible if the sequence of
+  // opcodes, and the companion kinds, of the two paths matches.
+  bool CheckCompanionPathsCompatibility(
+      const std::vector<TrackedInstruction>& path0,
+      const std::vector<TrackedInstruction>& path1) const;
+
+  // Returns the unique integer for each module. The returned id is the index of
+  // the module in the module vector.
+  int64 GetModuleId(const HloModule* module) const;
+
+  // Returns the companion instructions for the given instruction.
+  //
+  // Precondition: IsCompanionWhile(instruction) is true.
+  const std::unordered_set<HloInstruction*>& Companions(
+      HloInstruction* instruction) const {
+    CHECK_EQ(companion_set_index_.count(instruction), 1);
+    return companion_set(companion_set_index_.at(instruction));
+  }
+
+  // Returns the companion set at the given index.
+  const std::unordered_set<HloInstruction*>& companion_set(int64 index) const {
+    CHECK_LT(index, companion_sets_.size());
+    return *companion_sets_[index];
+  }
+
+  // Returns the companion set index of the given instruction.
+  int64 companion_set_index(HloInstruction* instruction) const {
+    return companion_set_index_.at(instruction);
+  }
+
+  // Returns the list of all companion sets in the HLO module group.
+  const std::vector<std::unique_ptr<std::unordered_set<HloInstruction*>>>&
+  companion_sets() const {
+    return companion_sets_;
+  }
+
+  // Returns all channels in the module group.
+  const std::vector<Channel>& channels() const { return channels_; }
+
+  // Returns the maximum channel id used in the module group.
+  int64 max_channel_id() const { return max_channel_id_; }
+
+ private:
+  Status Build();
+
+  // Record all channel instructions and While instructions.
+  Status RecordInstructions();
+
+  // Verifies the given HloModules are well-formed and follow the specification,
+  // in particular with respect to using channel instructions.
+  //
+  // * Each channel has all 4 instructions (Send, Recv, SendDone, RecvDone).
+  // * The shape of channel instructions match.
+  // * The nest level of channel instructions match.
+  // * Channel instructions are used in allowed computations; i.e., in the
+  //   entry computation of the module or condition/body of While computations.
+  //
+  // TODO(b/62064342): Currently, HloModuleGroupScheduler checks if there is a
+  // cycle in the graph, but it would be good to verify here.
+  Status VerifyChannelInstructions();
+
+  // Adds metadata that the given two instructions are companions.
+  Status AddCompanion(HloInstruction* instruction1,
+                      HloInstruction* instruction2);
+
+  // Retrieves a pointer to the stored TrackedInstruction associated with a
+  // tracked computation, or nullptr in case such computation is not tracked.
+  const TrackedInstruction* GetTrackedInstruction(
+      const HloComputation* computation) const {
+    auto it = tracked_instructions_.find(computation);
+    return it != tracked_instructions_.end() ? &it->second : nullptr;
+  }
+
+  // List of all companion instructions sets in the module.
+  std::vector<std::unique_ptr<std::unordered_set<HloInstruction*>>>
+      companion_sets_;
+
+  // Map from each companion while instruction to the index into companion_set_.
+  tensorflow::gtl::FlatMap<HloInstruction*, int64> companion_set_index_;
+
+  // Map from computation to the instruction using it (a kWhile, kConditional).
+  tensorflow::gtl::FlatMap<const HloComputation*, TrackedInstruction>
+      tracked_instructions_;
+
+  // All channels in the module.
+  std::vector<Channel> channels_;
+
+  // Map from channel ids to the index in channels_.
+  tensorflow::gtl::FlatMap<int64, int64> channel_id_map_;
+
+  // The maximum channel id used in the module group.
+  int64 max_channel_id_ = -1;
+
+  // The modules that this metadata was built from.
+  const std::vector<HloModule*>& modules_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_METADATA_H_
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_util.cc b/tensorflow/compiler/xla/service/hlo_module_group_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..289c96b0a7b90c5f8a122cd3fc327a5762099106
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_module_group_util.cc
@@ -0,0 +1,316 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_module_group_util.h"
+
+#include <algorithm>
+#include <list>
+#include <queue>
+#include <stack>
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_reachability.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+std::vector<HloInstruction*> HloModuleGroupUtil::GlobalPredecessors(
+    HloInstruction* instruction) {
+  std::vector<HloInstruction*> predecessors;
+
+  // Adds to the unique predecessors list and also add companion instructions
+  // if the given predecessor has those.
+  auto add_unique_predecessor = [&](HloInstruction* predecessor) {
+    if (std::find(predecessors.begin(), predecessors.end(), predecessor) !=
+        predecessors.end()) {
+      return;
+    }
+    if (!metadata_.IsCompanionInstruction(predecessor)) {
+      predecessors.push_back(predecessor);
+      return;
+    }
+    for (HloInstruction* companion : metadata_.Companions(predecessor)) {
+      predecessors.push_back(companion);
+    }
+  };
+
+  // If the given instruction is a companion instruction, we need to find the
+  // predecessors of all of its companion instructions.
+  std::vector<HloInstruction*> instruction_group;
+  if (metadata_.IsCompanionInstruction(instruction)) {
+    for (HloInstruction* companion : metadata_.Companions(instruction)) {
+      instruction_group.push_back(companion);
+    }
+  } else {
+    instruction_group.push_back(instruction);
+  }
+
+  for (HloInstruction* hlo : instruction_group) {
+    for (HloInstruction* operand : hlo->operands()) {
+      add_unique_predecessor(operand);
+    }
+    for (HloInstruction* control_predecessor : hlo->control_predecessors()) {
+      add_unique_predecessor(control_predecessor);
+    }
+  }
+  if (instruction->opcode() == HloOpcode::kRecvDone) {
+    // Send is a remote predecessor of RecvDone.
+    HloInstruction* send = metadata_.GetChannel(instruction->channel_id()).send;
+    add_unique_predecessor(send);
+  }
+  if (instruction->opcode() == HloOpcode::kSend) {
+    // Recv is a remote predecessor of Send.
+    HloInstruction* recv_done =
+        metadata_.GetChannel(instruction->channel_id()).recv_done;
+    CHECK(recv_done->opcode() == HloOpcode::kRecvDone);
+    CHECK_EQ(recv_done->operand_count(), 1);
+    HloInstruction* recv = recv_done->mutable_operand(0);
+    add_unique_predecessor(recv);
+  }
+  return predecessors;
+}
+
+std::vector<HloInstruction*> HloModuleGroupUtil::GlobalSuccessors(
+    HloInstruction* instruction) {
+  std::vector<HloInstruction*> successors;
+
+  // Adds to the unique successors list and also add companion instructions
+  // if the given successor has those.
+  auto add_unique_successor = [&](HloInstruction* successor) {
+    if (std::find(successors.begin(), successors.end(), successor) !=
+        successors.end()) {
+      return;
+    }
+    if (!metadata_.IsCompanionInstruction(successor)) {
+      successors.push_back(successor);
+      return;
+    }
+    for (HloInstruction* companion : metadata_.Companions(successor)) {
+      successors.push_back(companion);
+    }
+  };
+
+  // If the given instruction is a companion instruction, we need to find the
+  // successors of all of its companion instructions.
+  std::vector<HloInstruction*> instruction_group;
+  if (metadata_.IsCompanionInstruction(instruction)) {
+    for (HloInstruction* companion : metadata_.Companions(instruction)) {
+      instruction_group.push_back(companion);
+    }
+  } else {
+    instruction_group.push_back(instruction);
+  }
+
+  for (HloInstruction* hlo : instruction_group) {
+    for (HloInstruction* user : hlo->users()) {
+      add_unique_successor(user);
+    }
+    for (HloInstruction* control_successor : hlo->control_successors()) {
+      add_unique_successor(control_successor);
+    }
+  }
+  if (instruction->opcode() == HloOpcode::kRecv) {
+    // Send is a remote successor of Recv.
+    const HloInstruction* recv_done = instruction->users().front();
+    CHECK(recv_done->opcode() == HloOpcode::kRecvDone);
+    HloInstruction* send = metadata_.GetChannel(instruction->channel_id()).send;
+    add_unique_successor(send);
+  }
+  if (instruction->opcode() == HloOpcode::kSend) {
+    // RecvDone is a remote successor of Send.
+    HloInstruction* recv_done =
+        metadata_.GetChannel(instruction->channel_id()).recv_done;
+    add_unique_successor(recv_done);
+  }
+  return successors;
+}
+
+std::vector<HloInstruction*> HloModuleGroupUtil::RootInstructions(
+    tensorflow::gtl::ArraySlice<HloComputation*> computations) {
+  std::vector<HloInstruction*> roots;
+  for (HloComputation* computation : computations) {
+    for (HloInstruction* instruction : computation->instructions()) {
+      if (GlobalSuccessors(instruction).empty()) {
+        roots.push_back(instruction);
+      }
+    }
+  }
+  return roots;
+}
+
+Status HloModuleGroupUtil::VisitTopologicalOrder(
+    VisitStates* visit_state, const VisitFunction& visit_function,
+    HloInstruction* root) {
+  // Stack of HLO instructions visited in DFS order.
+  std::stack<HloInstruction*> stack;
+  stack.push(root);
+
+  while (!stack.empty()) {
+    HloInstruction* hlo = stack.top();
+
+    // Find the instruction group of the currently visited instruction. The
+    // instruction group represents all companion instructions of the
+    // current instruction, and are considered to be a single entity for the
+    // purpose of the traversal (i.e., they must always be in the same visit
+    // state).
+    std::vector<HloInstruction*> instruction_group;
+    if (metadata_.IsCompanionInstruction(hlo)) {
+      for (HloInstruction* companion : metadata_.Companions(hlo)) {
+        instruction_group.push_back(companion);
+      }
+    } else {
+      instruction_group.push_back(hlo);
+    }
+
+    if ((*visit_state)[hlo] == VisitState::kVisited) {
+      // All instructions in the group must be in the same state.
+      for (HloInstruction* instruction : instruction_group) {
+        TF_RET_CHECK((*visit_state)[instruction] == VisitState::kVisited);
+      }
+      stack.pop();
+      continue;
+    }
+
+    if ((*visit_state)[hlo] == VisitState::kVisiting) {
+      TF_RETURN_IF_ERROR(visit_function(hlo, instruction_group));
+
+      // Set the visit state of all instructions in the group to kVisited.
+      for (HloInstruction* instruction : instruction_group) {
+        TF_RET_CHECK((*visit_state)[instruction] == VisitState::kVisiting);
+        (*visit_state)[instruction] = VisitState::kVisited;
+      }
+      stack.pop();
+      continue;
+    }
+
+    // Set the visit state of all instructions in the group to kVisiting.
+    for (HloInstruction* instruction : instruction_group) {
+      TF_RET_CHECK((*visit_state)[instruction] == VisitState::kNotVisited)
+          << instruction->ToString();
+      (*visit_state)[instruction] = VisitState::kVisiting;
+    }
+
+    // For each instruction in the group, visit its predecessors (operands,
+    // control predecessors and remote predecessors).
+    for (HloInstruction* instruction : instruction_group) {
+      for (HloInstruction* predecessor : GlobalPredecessors(instruction)) {
+        // Visiting a node that is already being visited implies that there is
+        // a cycle. Generate an error with the list of instructions in the
+        // cycle.
+        if ((*visit_state)[predecessor] == VisitState::kVisiting) {
+          string cyclic_instructions;
+          for (const auto& state : *visit_state) {
+            if (state.second == VisitState::kVisiting) {
+              tensorflow::strings::StrAppend(&cyclic_instructions,
+                                             state.first->ToString(), "\n");
+            }
+          }
+          // TODO(b/64305524): Improve the error message to print out the
+          // instructions in a deterministic order that forms the cycle.
+          return FailedPrecondition(
+              "Cross-computation cycle detected via communicating nodes. The "
+              "cycle contains the node %s. The cycle is found among the "
+              "following nodes. Note that the order of the nodes is arbitrary "
+              "and that the list may include nodes that are not part of the "
+              "cycle.\n%s",
+              predecessor->ToString().c_str(), cyclic_instructions.c_str());
+        }
+        stack.push(predecessor);
+      }
+    }
+  }
+
+  return Status::OK();
+}
+
+Status HloModuleGroupUtil::VerifyComputations(
+    tensorflow::gtl::ArraySlice<HloComputation*> computations) {
+  auto visit_function =
+      [&](HloInstruction* instruction,
+          const std::vector<HloInstruction*>& instruction_group) {
+        return Status::OK();
+      };
+  int64 instructions_count = 0;
+  VisitStates visit_states;
+  for (HloComputation* computation : computations) {
+    // Visit all instructions, and not just from the root instruction of the
+    // computation. This allows us to detect dead cycles (i.e., cycles that
+    // are not reachable from the root) or to enforce an order for the
+    // communication instructions that are not reachable from any roots.
+    for (HloInstruction* instruction : computation->instructions()) {
+      TF_RETURN_IF_ERROR(
+          VisitTopologicalOrder(&visit_states, visit_function, instruction));
+    }
+    instructions_count += computation->instruction_count();
+  }
+
+  // Check if all instructions are visited and are in the visited state.
+  TF_RET_CHECK(visit_states.size() == instructions_count);
+  for (auto& state : visit_states) {
+    TF_RET_CHECK(state.second == VisitState::kVisited);
+  }
+
+  return Status::OK();
+}
+
+StatusOr<std::unique_ptr<HloReachabilityMap>>
+HloModuleGroupUtil::ComputeReachability(
+    tensorflow::gtl::ArraySlice<HloComputation*> computations) {
+  std::list<HloInstruction*> post_order;
+  auto visit_function =
+      [&](HloInstruction* instruction,
+          const std::vector<HloInstruction*>& instruction_group) {
+        post_order.insert(post_order.end(), instruction_group.begin(),
+                          instruction_group.end());
+        return Status::OK();
+      };
+  HloModuleGroupUtil::VisitStates visit_states;
+  for (HloInstruction* root : RootInstructions(computations)) {
+    TF_RETURN_IF_ERROR(
+        VisitTopologicalOrder(&visit_states, visit_function, root));
+  }
+  auto reachability = absl::make_unique<HloReachabilityMap>(post_order);
+  for (HloInstruction* hlo : post_order) {
+    reachability->SetReachabilityToUnion(GlobalPredecessors(hlo), hlo);
+  }
+  return std::move(reachability);
+}
+
+void HloModuleGroupUtil::UpdateReachabilityThroughInstruction(
+    HloInstruction* instruction, HloReachabilityMap* reachability_map) {
+  std::queue<HloInstruction*> worklist;
+  worklist.push(instruction);
+
+  while (!worklist.empty()) {
+    HloInstruction* item = worklist.front();
+    worklist.pop();
+    if (reachability_map->SetReachabilityToUnion(GlobalPredecessors(item),
+                                                 item)) {
+      for (HloInstruction* successor : GlobalSuccessors(item)) {
+        worklist.push(successor);
+      }
+    }
+  }
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_util.h b/tensorflow/compiler/xla/service/hlo_module_group_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..c25ca1aff50b288f3ac3885cbed53e7ba9768430
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_module_group_util.h
@@ -0,0 +1,117 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_UTIL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_UTIL_H_
+
+#include <functional>
+#include <memory>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module_group_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_reachability.h"
+#include "tensorflow/compiler/xla/status.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+
+namespace xla {
+
+// Collection of utilities for handling HloModuleGroups.
+class HloModuleGroupUtil {
+ public:
+  explicit HloModuleGroupUtil(const HloModuleGroupMetadata& metadata)
+      : metadata_(metadata) {}
+
+  // Returns all unique predecessors of the instruction. This includes:
+  // * predecessors in the same computation: operands and control predecessors
+  // * Recv is a predecessor of Send
+  // * Send is a predecessor of RecvDone
+  // * predecessors of companions (if the instruction is a companion while)
+  // * predecessors' companions (for any predecessor that is a companion while)
+  std::vector<HloInstruction*> GlobalPredecessors(HloInstruction* instruction);
+
+  // Returns all unique successors of the instruction. This includes:
+  // * successors in the same computation: users and control successors
+  // * Send is a successor of Recv
+  // * RecvDone is a predecessor of Send
+  // * successors of companions (if the instruction is a companion while)
+  // * successors' companions (for any successor that is a companion while)
+  std::vector<HloInstruction*> GlobalSuccessors(HloInstruction* instruction);
+
+  // Returns the root instructions of the computations.
+  std::vector<HloInstruction*> RootInstructions(
+      tensorflow::gtl::ArraySlice<HloComputation*> computations);
+
+  // Visit state of each instruction during DFS traversal.
+  enum VisitState {
+    kNotVisited = 0,
+    kVisiting,
+    kVisited,
+  };
+
+  // Function called on each instruction group during the DFS traversal. See the
+  // comment for VisitTopologicalOrder()).
+  using VisitFunction = std::function<Status(
+      HloInstruction* hlo,
+      const std::vector<HloInstruction*>& instruction_group)>;
+
+  // Given the hlo instruction as the root, recursively visits all its
+  // predecessor instructions in DFS order to visit nodes in topological order.
+  //
+  // Note that the DFS traversal does not only visit nodes in the same
+  // computation (parent of the root instruction), but also visits nodes in
+  // different computations connected via communication instructions. During the
+  // traversal, companion While instructions (see the class comment in
+  // HloModuleGroupMetadata) are treated as a single instruction (called
+  // instruction group, which contains only a single instruction if the visiting
+  // node is not a companion while) -- visiting one of the instructions in the
+  // group effectively visits all other instructions in the group, and then all
+  // predecessor instructions of the group are visited.
+  //
+  // * visit_state: map from each instruction to its visit state.
+  // * visit_function: function called when each instruction group.
+  // * root: the root instruction of the traversal.
+  using VisitStates = tensorflow::gtl::FlatMap<HloInstruction*, VisitState>;
+  Status VisitTopologicalOrder(VisitStates* visit_state,
+                               const VisitFunction& visit_function,
+                               HloInstruction* root);
+
+  // Verifies that the computations are well-formed (e.g., no cycles).
+  Status VerifyComputations(
+      tensorflow::gtl::ArraySlice<HloComputation*> computations);
+
+  // Below Reachability utils resemble those in HloComputation, except that
+  // they can handle instructions across multiple computations.
+  //
+  // Creates the reachability map for the instructions in the computations.
+  StatusOr<std::unique_ptr<HloReachabilityMap>> ComputeReachability(
+      tensorflow::gtl::ArraySlice<HloComputation*> computations);
+
+  // Updates the reachability of the given instruction, taking the global
+  // predeccessorss and successors into account.
+  void UpdateReachabilityThroughInstruction(
+      HloInstruction* instruction, HloReachabilityMap* reachability_map);
+
+ private:
+  const HloModuleGroupMetadata& metadata_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MODULE_GROUP_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/hlo_ordering.cc b/tensorflow/compiler/xla/service/hlo_ordering.cc
index 1b24d8da9e832e6847cb6f405e15af3c455f695a..e89d94bede6c437ca1131a1b1b0098390d58c0d9 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering.cc
@@ -66,6 +66,28 @@ bool HloOrdering::ExecutesBefore(const HloInstruction* a,
     }
   }
 
+  // If the common ancestor is a conditional instruction, even though the true
+  // and false computations are not really ordered per-se, we define the true
+  // computation to be ordered before the false one.
+  // This ensures that buffers can still be shared among the two computations
+  // as they will forcibly have disjoint liveness.
+  if (a_ancestor == b_ancestor &&
+      a_ancestor->opcode() == HloOpcode::kConditional) {
+    const HloComputation* true_computation = a_ancestor->true_computation();
+    const HloComputation* false_computation = a_ancestor->false_computation();
+    if (call_graph_->InstructionIsNestedIn(a, true_computation) &&
+        call_graph_->InstructionIsNestedIn(b, false_computation)) {
+      return true;
+    }
+    // If 'b' is the conditional ancestor, and 'a' is within the true or false
+    // computations, 'a' executes before 'b'.
+    if (b == a_ancestor &&
+        (call_graph_->InstructionIsNestedIn(a, true_computation) ||
+         call_graph_->InstructionIsNestedIn(a, false_computation))) {
+      return true;
+    }
+  }
+
   return ExecutesBeforeInSameComputation(a_ancestor, b_ancestor);
 }
 
@@ -118,7 +140,18 @@ bool HloOrdering::IsDefinedBefore(const HloValue& a, const HloValue& b) const {
            b.defining_instruction()->while_condition()))) {
     return true;
   }
-
+  // If 'b' is a conditional phi and 'a' is in the true or false computation,
+  // then 'a' executes before 'b'.
+  if (b.is_phi() &&
+      b.defining_instruction()->opcode() == HloOpcode::kConditional &&
+      (call_graph_->InstructionIsNestedIn(
+           a.defining_instruction(),
+           b.defining_instruction()->true_computation()) ||
+       call_graph_->InstructionIsNestedIn(
+           a.defining_instruction(),
+           b.defining_instruction()->false_computation()))) {
+    return true;
+  }
   return ExecutesBefore(a.defining_instruction(), b.defining_instruction());
 }
 
@@ -212,18 +245,17 @@ bool HloOrdering::LiveRangeStrictlyBefore(
   VLOG(4) << "LiveRangeStrictlyBefore(a = " << a.ToShortString()
           << ", b = " << b.ToShortString() << ")";
   if (!IsDefinedBefore(a, b)) {
-    VLOG(4) << "a not defined before b";
+    VLOG(4) << a << " not defined before " << b;
     return false;
   }
-
   // All uses of 'a' must be before 'b' is defined.
   for (const HloUse& use : a.uses()) {
     if (!UseIsBeforeValueDefinition(use, b, dataflow)) {
-      VLOG(4) << "use of a (" << use << ") not before b is defined";
+      VLOG(4) << "use of " << a << " (" << use << ") not before " << b
+              << " is defined";
       return false;
     }
   }
-
   return true;
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_ordering_test.cc b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
index a989fce63234cb860d08c48b02462e96bec879bc..37a7fbad97cea2f34798efecc2489e57d1374f35 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
@@ -34,53 +34,6 @@ namespace {
 
 class HloOrderingTest : public HloTestBase {};
 
-TEST_F(HloOrderingTest, LastUseScheduledFirst) {
-  // Tests scheduling of the following HLO code:
-  //
-  //   %ab = abs(%param)
-  //   %exp = exp(%param)
-  //   %add = add(%ab, %exp)
-  //   %negate = negate(%exp)
-  //   %sub = subtract(%add, %negate)
-  //
-  // %add should be scheduled before %negate because %add is the last (and only)
-  // use of %ab. Scheduling %add first then frees up %ab's buffer.
-  const Shape vec = ShapeUtil::MakeShape(xla::F32, {42});
-  auto builder = HloComputation::Builder(TestName());
-  auto param =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, vec, "param"));
-  auto ab = builder.AddInstruction(
-      HloInstruction::CreateUnary(vec, HloOpcode::kAbs, param));
-  auto exp = builder.AddInstruction(
-      HloInstruction::CreateUnary(vec, HloOpcode::kExp, param));
-
-  auto add = builder.AddInstruction(
-      HloInstruction::CreateBinary(vec, HloOpcode::kAdd, ab, exp));
-  auto negate = builder.AddInstruction(
-      HloInstruction::CreateUnary(vec, HloOpcode::kNegate, exp));
-  auto sub = builder.AddInstruction(
-      HloInstruction::CreateBinary(vec, HloOpcode::kSubtract, add, negate));
-
-  auto module = CreateNewModule();
-  module->AddEntryComputation(builder.Build());
-
-  TF_ASSERT_OK_AND_ASSIGN(
-      SequentialHloOrdering::HloModuleSequence sequence,
-      CreateMemoryMinimizingSequence(*module, [](const LogicalBuffer& buffer) {
-        return ShapeUtil::ByteSizeOf(buffer.shape());
-      }));
-  // Verify that all instructions are in the sequence.
-  EXPECT_EQ(module->entry_computation()->instruction_count(),
-            sequence.at(module->entry_computation()).size());
-
-  // The first instruction should be the parameter and the last the root "sub".
-  EXPECT_EQ(param, sequence.at(module->entry_computation()).front());
-  EXPECT_EQ(sub, sequence.at(module->entry_computation()).back());
-
-  SequentialHloOrdering ordering(module.get(), sequence);
-  EXPECT_TRUE(ordering.ExecutesBefore(add, negate));
-}
-
 TEST_F(HloOrderingTest, InstructionsInDifferentComputations) {
   // Tests the ordering of instructions in different computations using the
   // following HLO code:
@@ -362,5 +315,66 @@ ENTRY while.v11 {
   ordering.ToString();  // Shouldn't crash.
 }
 
+TEST_F(HloOrderingTest, ConditionalInstructionOrdering) {
+  const char* module_str = R"(
+HloModule test_conditional_module
+
+true_branch {
+  param.1 = (s32[], s32[]) parameter(0)
+  get-tuple-element.1 = s32[] get-tuple-element(param.1), index=0
+  get-tuple-element.2 = s32[] get-tuple-element(param.1), index=1
+  add.1 = s32[] add(get-tuple-element.1, get-tuple-element.2)
+  ROOT tuple.1 = (s32[], s32[]) tuple(add.1, get-tuple-element.1)
+}
+
+false_branch {
+  param.2 = (s32[], s32[]) parameter(0)
+  get-tuple-element.3 = s32[] get-tuple-element(param.2), index=0
+  get-tuple-element.4 = s32[] get-tuple-element(param.2), index=1
+  add.2 = s32[] add(get-tuple-element.3, get-tuple-element.4)
+  ROOT tuple.2 = (s32[], s32[]) tuple(add.2, get-tuple-element.4)
+}
+
+ENTRY root {
+  param.3 = (pred[], (s32[], s32[])) parameter(0)
+  pred.1 = pred[] get-tuple-element(param.3), index=0
+  cond_arg.1 = (s32[], s32[]) get-tuple-element(param.3), index=1
+  conditional = (s32[], s32[]) conditional(pred.1, cond_arg.1, cond_arg.1), true_computation=true_branch, false_computation=false_branch
+  cond_res.1 = s32[] get-tuple-element(conditional), index=0
+  cond_res.2 = s32[] get-tuple-element(conditional), index=1
+  add.3 = s32[] add(cond_res.1, cond_res.2)
+  ROOT result = (s32[], s32[], s32[]) tuple(add.3, cond_res.1, cond_res.2)
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          tools::Parse(module_str));
+  TF_ASSERT_OK_AND_ASSIGN(auto dataflow,
+                          HloDataflowAnalysis::Run(*module, /*ssa_form=*/true));
+  DependencyHloOrdering ordering(module.get());
+
+  // Even though the true and false branches has no ordering, since they do not
+  // interfere (as they are mutually exclusive), we define the true computation
+  // to be before the false one.
+  // Similarly, any instruction in the true or false branches are considered
+  // before the conditional instruction. The roots are effectively "at the same
+  // time" WRT the conditional, but they are Phi-ed anyway.
+  HloInstruction* add_1 = FindInstruction(module.get(), "add.1");
+  HloInstruction* add_2 = FindInstruction(module.get(), "add.2");
+  HloInstruction* add_3 = FindInstruction(module.get(), "add.3");
+  HloInstruction* conditional = FindInstruction(module.get(), "conditional");
+  EXPECT_TRUE(ordering.IsDefinedBefore(dataflow->GetValueDefinedAt(add_1),
+                                       dataflow->GetValueDefinedAt(add_2)));
+  EXPECT_TRUE(
+      ordering.IsDefinedBefore(dataflow->GetValueDefinedAt(add_2),
+                               dataflow->GetValueDefinedAt(conditional)));
+  EXPECT_TRUE(
+      ordering.IsDefinedBefore(dataflow->GetValueDefinedAt(add_1),
+                               dataflow->GetValueDefinedAt(conditional)));
+  EXPECT_TRUE(ordering.IsDefinedBefore(dataflow->GetValueDefinedAt(add_1),
+                                       dataflow->GetValueDefinedAt(add_3)));
+  EXPECT_TRUE(ordering.IsDefinedBefore(dataflow->GetValueDefinedAt(add_2),
+                                       dataflow->GetValueDefinedAt(add_3)));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_proto_util.cc b/tensorflow/compiler/xla/service/hlo_proto_util.cc
index 78e6a101c10a1e812e3e2631d520139fd0bc425c..3460679558d185d1e022660d9a1d23176d0d96bf 100644
--- a/tensorflow/compiler/xla/service/hlo_proto_util.cc
+++ b/tensorflow/compiler/xla/service/hlo_proto_util.cc
@@ -15,6 +15,10 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_proto_util.h"
 
+#include <string>
+
+#include "tensorflow/compiler/xla/util.h"
+
 namespace xla {
 
 HloProto MakeHloProto(const HloModule& module,
@@ -35,4 +39,35 @@ HloProto MakeHloProto(const HloModule& module) {
   return proto;
 }
 
+StatusOr<std::vector<const Shape*>> EntryComputationParameterShapes(
+    const HloProto& hlo_proto) {
+  if (!hlo_proto.has_hlo_module()) {
+    return NotFound("HloProto missing HloModuleProto.");
+  }
+  if (!hlo_proto.hlo_module().has_program_shape()) {
+    return NotFound("HloProto missing program shape.");
+  }
+
+  std::vector<const Shape*> parameter_shapes;
+  const auto& program_shape = hlo_proto.hlo_module().program_shape();
+  for (const Shape& shape : program_shape.parameters()) {
+    parameter_shapes.push_back(&shape);
+  }
+  return parameter_shapes;
+}
+
+StatusOr<const Shape*> EntryComputationOutputShape(const HloProto& hlo_proto) {
+  if (!hlo_proto.has_hlo_module()) {
+    return NotFound("HloProto missing HloModuleProto.");
+  }
+  if (!hlo_proto.hlo_module().has_program_shape()) {
+    return NotFound("HloProto missing program shape.");
+  }
+  if (!hlo_proto.hlo_module().program_shape().has_result()) {
+    return NotFound("HloProto missing result in its program shape");
+  }
+
+  return &hlo_proto.hlo_module().program_shape().result();
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_proto_util.h b/tensorflow/compiler/xla/service/hlo_proto_util.h
index 320288fdb9aa0810b306b1d78bd1ff4cfc366ed2..3d9c375cd5d26f92cf8316f78789daf4fc08c927 100644
--- a/tensorflow/compiler/xla/service/hlo_proto_util.h
+++ b/tensorflow/compiler/xla/service/hlo_proto_util.h
@@ -35,6 +35,15 @@ HloProto MakeHloProto(const HloModule& module,
 // will not be included in the output.
 HloProto MakeHloProto(const HloModule& module);
 
+// Returns the shapes of the parameters of the entry computation. Shape pointers
+// refer to shapes inside of the given HloProto.
+StatusOr<std::vector<const Shape*>> EntryComputationParameterShapes(
+    const HloProto& hlo_proto);
+
+// Returns the shape of the output of the entry computation. The shape pointer
+// refers to the output shape inside of the given HloProto.
+StatusOr<const Shape*> EntryComputationOutputShape(const HloProto& hlo_proto);
+
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_PROTO_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/hlo_proto_util_test.cc b/tensorflow/compiler/xla/service/hlo_proto_util_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b9cca138703c8fa61aadf69dd7304a215a9f4be2
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_proto_util_test.cc
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_proto_util.h"
+
+#include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+
+namespace xla {
+namespace {
+
+class HloProtoUtilTest : public ::testing::Test {};
+
+TEST_F(HloProtoUtilTest, ParamsAndOutputShapeMissingModule) {
+  HloProto hlo_proto;
+
+  auto status = EntryComputationParameterShapes(hlo_proto).status();
+  ASSERT_FALSE(status.ok());
+  ASSERT_THAT(status.error_message(),
+              ::testing::HasSubstr("missing HloModuleProto"));
+}
+
+TEST_F(HloProtoUtilTest, MissingProgramShape) {
+  HloProto hlo_proto;
+  HloModuleProto* module = hlo_proto.mutable_hlo_module();
+  module->set_name("entry");
+
+  auto status = EntryComputationParameterShapes(hlo_proto).status();
+  ASSERT_FALSE(status.ok());
+  ASSERT_THAT(status.error_message(),
+              ::testing::HasSubstr("missing program shape"));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.cc b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
index 98b8d34be1f331aaeac94e952deeae1e76379861..b0632448933df4b7681a0704c58d697b5ec68a1f 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
@@ -1320,7 +1320,7 @@ StatusOr<bool> HloRematerialization::Run(
 /* static */ StatusOr<bool> HloRematerialization::RematerializeAndSchedule(
     const HloRematerialization::ShapeSizeFunction& size_function,
     int64 memory_limit_bytes, HloModule* hlo_module,
-    SchedulerAlgorithm scheduler_algorithm,
+    MemorySchedulerAlgorithm scheduler_algorithm,
     SequentialHloOrdering::HloModuleSequence* sequence,
     RematerializationSizes* sizes) {
   HloRematerialization remat(scheduler_algorithm, size_function);
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.h b/tensorflow/compiler/xla/service/hlo_rematerialization.h
index 52553439033a3bcfa4b472f13f9cd4b1ecf5ed96..2ee2dd0571ae8c6604e4ca722351fd48a913bda5 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.h
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.h
@@ -66,12 +66,12 @@ class HloRematerialization {
   // code generation.
   static StatusOr<bool> RematerializeAndSchedule(
       const ShapeSizeFunction& size_function, int64 memory_limit_bytes,
-      HloModule* hlo_module, SchedulerAlgorithm scheduler_algorithm,
+      HloModule* hlo_module, MemorySchedulerAlgorithm scheduler_algorithm,
       SequentialHloOrdering::HloModuleSequence* sequence,
       RematerializationSizes* sizes = nullptr);
 
  protected:
-  HloRematerialization(SchedulerAlgorithm scheduler_algorithm,
+  HloRematerialization(MemorySchedulerAlgorithm scheduler_algorithm,
                        const ShapeSizeFunction& size_function)
       : scheduler_algorithm_(scheduler_algorithm),
         size_function_(size_function) {}
@@ -108,7 +108,7 @@ class HloRematerialization {
       const HloInstruction* instruction) const;
 
   // Selects an algorithm to use for HLO scheduling.
-  SchedulerAlgorithm scheduler_algorithm_;
+  MemorySchedulerAlgorithm scheduler_algorithm_;
 
   // Function which computes the size of the top-level buffer of a shape.
   const ShapeSizeFunction size_function_;
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
index 1b7d26dde501a6a0955d62ea0938e0683a32d49d..83de54f3fa56ee660b79d8c366dbc0b52f9fde87 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
@@ -162,7 +162,7 @@ TEST_F(HloRematerializationTest, SingleComputation) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/14 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
 
   // Root should not have changed.
@@ -195,7 +195,7 @@ TEST_F(HloRematerializationTest, SingleComputationNoRematerialization) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/20 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
 
   // No instructions should have been materialized.
   EXPECT_FALSE(changed);
@@ -236,7 +236,7 @@ TEST_F(HloRematerializationTest, RematerializeAroundWhile) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/17 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
 
   // Only the entry computation should have a rematerialized instruction added.
@@ -272,7 +272,7 @@ TEST_F(HloRematerializationTest, RematerializeEntryAndWhileBody) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/15 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
 
   // Both computations should have a rematerialized instruction added.
@@ -314,7 +314,7 @@ TEST_F(HloRematerializationTest, RematerializeNestedComputations) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/13 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
 
   // All computations should have a rematerialized instruction added.
@@ -385,7 +385,7 @@ TEST_F(HloRematerializationTest, RngNotRematerialized) {
       bool changed, HloRematerialization::RematerializeAndSchedule(
                         ByteSizeOf,
                         /*memory_limit_bytes=*/4 * ByteSizeOf(vec1024_shape_),
-                        module.get(), SchedulerAlgorithm::kAuto, &sequence));
+                        module.get(), DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
   // The rng should not have been rematerialized.
   EXPECT_EQ(count_rngs(entry_computation), 1);
@@ -480,7 +480,7 @@ TEST_F(HloRematerializationTest, InstructionRematerializedMultipleTimes) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/22 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   EXPECT_TRUE(changed);
 
   // The broadcast should have been rematerialized 3 times.
@@ -577,7 +577,7 @@ TEST_P(IndirectUseTest, IndirectUseNotRematerialized) {
                           HloRematerialization::RematerializeAndSchedule(
                               ByteSizeOf,
                               /*memory_limit_bytes=*/22 * 1024, module.get(),
-                              SchedulerAlgorithm::kAuto, &sequence));
+                              DefaultMemoryScheduler, &sequence));
   // Rematerialization should only occur if the rematerializable instruction has
   // no indirect uses.
   if (indirectly_used) {
diff --git a/tensorflow/compiler/xla/service/hlo_runner.cc b/tensorflow/compiler/xla/service/hlo_runner.cc
index 41b079eb799d06321a31f7d7ae0630dc8d58c46b..e5b1c2efa3fc25d23531df298e125521c002dba1 100644
--- a/tensorflow/compiler/xla/service/hlo_runner.cc
+++ b/tensorflow/compiler/xla/service/hlo_runner.cc
@@ -110,7 +110,7 @@ HloRunner::HloRunner(se::Platform* platform) {
 
 HloRunner::~HloRunner() {}
 
-StatusOr<std::unique_ptr<Literal>> HloRunner::ExecuteInternal(
+StatusOr<std::unique_ptr<Literal>> HloRunner::Execute(
     std::unique_ptr<HloModule> module,
     const tensorflow::gtl::ArraySlice<Literal*> arguments,
     bool run_hlo_passes) {
@@ -158,8 +158,8 @@ StatusOr<std::unique_ptr<Literal>> HloRunner::ExecuteInternal(
 
   TF_ASSIGN_OR_RETURN(
       std::unique_ptr<ShapedBuffer> result,
-      executable->ExecuteOnStream(&service_run_options, argument_buffer_ptrs,
-                                  /*hlo_execution_profile=*/nullptr));
+      executable->ExecuteOnStreamWrapper(
+          &service_run_options, /*profile=*/nullptr, argument_buffer_ptrs));
 
   // Create a ScopedShapedBuffer of the result to manage deallocation. This will
   // deallocate all the device memory when it goes out of scope.
diff --git a/tensorflow/compiler/xla/service/hlo_runner.h b/tensorflow/compiler/xla/service/hlo_runner.h
index cbaebc68bee708090b8ccb2eae19b556c4d6d453..06ce22a5b9fc7b3d6c10857c84196094c0eed303 100644
--- a/tensorflow/compiler/xla/service/hlo_runner.h
+++ b/tensorflow/compiler/xla/service/hlo_runner.h
@@ -27,6 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
@@ -64,17 +65,27 @@ class HloRunner {
       const std::string& filename, const DebugOptions& debug_options);
 
   // Executes the given module with given literals as input and returns the
-  // result as a Literal. The LiteralPtr type accepts Literal* or
-  // std::unique_ptr<Literal>.
+  // result as a Literal.
   //
   // If run_hlo_passes is false, the module will be executed without Hlo
   // optimization.
-  template <typename LiteralPtr>
   StatusOr<std::unique_ptr<Literal>> Execute(
       std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<LiteralPtr> arguments,
+      const tensorflow::gtl::ArraySlice<Literal*> arguments,
       bool run_hlo_passes = true);
 
+  StatusOr<std::unique_ptr<Literal>> Execute(
+      std::unique_ptr<HloModule> module,
+      const tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arguments,
+      bool run_hlo_passes = true) {
+    // Construct a vector of plain pointers for the arguments.
+    std::vector<Literal*> argument_pointers;
+    c_transform(
+        arguments, std::back_inserter(argument_pointers),
+        [](const std::unique_ptr<Literal>& literal) { return literal.get(); });
+    return Execute(std::move(module), argument_pointers, run_hlo_passes);
+  }
+
   // If backend is not created in the constructor, creates and returns the
   // default backend. If creation fails, crashes the program.
   //
@@ -83,11 +94,6 @@ class HloRunner {
   Backend& backend();
 
  private:
-  StatusOr<std::unique_ptr<Literal>> ExecuteInternal(
-      std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<Literal*> arguments,
-      bool run_hlo_passes = true);
-
   struct EigenThreadPoolWrapper;
 
   std::unique_ptr<EigenThreadPoolWrapper> thread_pool_wrapper_;
@@ -95,19 +101,6 @@ class HloRunner {
   std::unique_ptr<Backend> backend_;
 };
 
-template <typename LiteralPtr>
-StatusOr<std::unique_ptr<Literal>> HloRunner::Execute(
-    std::unique_ptr<HloModule> module,
-    const tensorflow::gtl::ArraySlice<LiteralPtr> arguments,
-    bool run_hlo_passes) {
-  // Construct a vector of plain pointers for the arguments.
-  std::vector<Literal*> argument_pointers;
-  for (const auto& argument : arguments) {
-    argument_pointers.push_back(&*argument);
-  }
-  return ExecuteInternal(std::move(module), argument_pointers, run_hlo_passes);
-}
-
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_RUNNER_H_
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.cc b/tensorflow/compiler/xla/service/hlo_scheduling.cc
index f6e33403f538bd8492b04c34d46a458f7f06cc06..1a767628f6e2d33df353366974fb866e89f0df5a 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.cc
@@ -103,10 +103,11 @@ class ListScheduler {
     for (auto* instruction : computation.instructions()) {
       tensorflow::gtl::FlatSet<const LogicalBuffer*> instr_uses;
       for (auto* operand : instruction->operands()) {
-        for (const LogicalBuffer* buffer :
-             points_to_analysis.GetBuffersDefinedByInstruction(operand)) {
-          instr_uses.insert(buffer);
-        }
+        points_to_analysis.GetPointsToSet(operand).ForEachElement(
+            [&](const ShapeIndex& /*index*/,
+                const PointsToSet::BufferList& buffers) {
+              instr_uses.insert(buffers.begin(), buffers.end());
+            });
       }
       buffer_uses_[instruction] = std::vector<const LogicalBuffer*>(
           instr_uses.begin(), instr_uses.end());
@@ -339,7 +340,33 @@ int64 SumLogicalBufferSizes(
   return size;
 }
 
-StatusOr<std::vector<const HloInstruction*>> RunDFSMemoryScheduler(
+StatusOr<int64> MinimumMemoryForComputation(
+    const HloComputation& computation,
+    const std::vector<const HloInstruction*>& sequence,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function) {
+  TF_ASSIGN_OR_RETURN(
+      HeapSimulator::Result result,
+      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), computation,
+                         sequence, points_to_analysis, size_function));
+  return result.heap_size;
+}
+
+StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
+    const HloComputation& computation,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function,
+    const MemorySchedulerAlgorithm& algorithm) {
+  VLOG(2) << "Computation: " << computation.name();
+  if (algorithm) {
+    return algorithm(computation, points_to_analysis, size_function);
+  }
+  return DefaultMemoryScheduler(computation, points_to_analysis, size_function);
+}
+
+}  // namespace
+
+StatusOr<std::vector<const HloInstruction*>> DFSMemoryScheduler(
     const HloComputation& computation,
     const TuplePointsToAnalysis& points_to_analysis,
     const LogicalBuffer::SizeFunction& size_function) {
@@ -348,6 +375,7 @@ StatusOr<std::vector<const HloInstruction*>> RunDFSMemoryScheduler(
   // simply users-1 for each instruction.  By subtracting 1, we're saying that
   // instructions with no users or a single user don't count; instructions with
   // lots of fan-out will be visited earlier.
+  int64 cumulative_total_size = 0;
   tensorflow::gtl::FlatMap<const HloInstruction*, int64> extra_users;
   tensorflow::gtl::FlatMap<const HloInstruction*, int64> total_sizes;
   for (const HloInstruction* hlo : computation.MakeInstructionPostOrder()) {
@@ -357,14 +385,17 @@ StatusOr<std::vector<const HloInstruction*>> RunDFSMemoryScheduler(
       continue;
     }
     extra_users[hlo] = hlo->users().empty() ? 0 : hlo->users().size() - 1;
-    total_sizes[hlo] = SumLogicalBufferSizes(
+    int64 logical_buffer_size = SumLogicalBufferSizes(
         points_to_analysis.GetBuffersDefinedByInstruction(hlo), size_function);
+    total_sizes[hlo] = logical_buffer_size;
+    cumulative_total_size += logical_buffer_size;
     tensorflow::gtl::FlatSet<const HloInstruction*> unique_operands(
         hlo->operands().begin(), hlo->operands().end());
     for (const HloInstruction* operand : unique_operands) {
       extra_users[hlo] += extra_users[operand];
       total_sizes[hlo] += total_sizes[operand];
     }
+    total_sizes[hlo] = std::min(total_sizes[hlo], cumulative_total_size);
   }
   CHECK_EQ(extra_users.size(), computation.instruction_count());
   CHECK_EQ(total_sizes.size(), computation.instruction_count());
@@ -392,32 +423,17 @@ StatusOr<std::vector<const HloInstruction*>> RunDFSMemoryScheduler(
   return sequence;
 }
 
-StatusOr<int64> MinimumMemoryForComputation(
+StatusOr<std::vector<const HloInstruction*>> ListMemoryScheduler(
     const HloComputation& computation,
-    const std::vector<const HloInstruction*>& sequence,
     const TuplePointsToAnalysis& points_to_analysis,
     const LogicalBuffer::SizeFunction& size_function) {
-  TF_ASSIGN_OR_RETURN(
-      HeapSimulator::Result result,
-      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), computation,
-                         sequence, points_to_analysis, size_function));
-  return result.heap_size;
+  return ListScheduler::Run(computation, points_to_analysis, size_function);
 }
 
-StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
+StatusOr<std::vector<const HloInstruction*>> DefaultMemoryScheduler(
     const HloComputation& computation,
     const TuplePointsToAnalysis& points_to_analysis,
-    const LogicalBuffer::SizeFunction& size_function,
-    SchedulerAlgorithm algorithm) {
-  VLOG(2) << "Computation: " << computation.name();
-  if (algorithm == SchedulerAlgorithm::kListSchedule) {
-    return ListScheduler::Run(computation, points_to_analysis, size_function);
-  }
-  if (algorithm == SchedulerAlgorithm::kDfsSchedule) {
-    return RunDFSMemoryScheduler(computation, points_to_analysis,
-                                 size_function);
-  }
-
+    const LogicalBuffer::SizeFunction& size_function) {
   // We try both a list-scheduler based ordering and a DFS based ordering, and
   // choose whichever returns a lower min-memory, not accounting for
   // fragmentation.
@@ -427,7 +443,7 @@ StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
   // within the caller's context. But it's good enough for now.
   TF_ASSIGN_OR_RETURN(
       std::vector<const HloInstruction*> list_sequence,
-      ListScheduler::Run(computation, points_to_analysis, size_function));
+      ListMemoryScheduler(computation, points_to_analysis, size_function));
   TF_ASSIGN_OR_RETURN(
       const int64 list_memory,
       MinimumMemoryForComputation(computation, list_sequence,
@@ -436,7 +452,7 @@ StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
 
   TF_ASSIGN_OR_RETURN(
       std::vector<const HloInstruction*> dfs_sequence,
-      RunDFSMemoryScheduler(computation, points_to_analysis, size_function));
+      DFSMemoryScheduler(computation, points_to_analysis, size_function));
   TF_ASSIGN_OR_RETURN(
       const int64 dfs_memory,
       MinimumMemoryForComputation(computation, dfs_sequence, points_to_analysis,
@@ -454,12 +470,10 @@ StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
   }
 }
 
-}  // namespace
-
 StatusOr<SequentialHloOrdering::HloModuleSequence>
 CreateMemoryMinimizingSequence(const HloModule& module,
                                const LogicalBuffer::SizeFunction& size_function,
-                               SchedulerAlgorithm algorithm) {
+                               const MemorySchedulerAlgorithm& algorithm) {
   SequentialHloOrdering::HloModuleSequence sequence;
   TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
                       TuplePointsToAnalysis::Run(&module));
@@ -475,7 +489,7 @@ CreateMemoryMinimizingSequence(const HloModule& module,
 StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
     const HloComputation& computation,
     const LogicalBuffer::SizeFunction& size_function,
-    SchedulerAlgorithm algorithm) {
+    const MemorySchedulerAlgorithm& algorithm) {
   CHECK(!computation.IsFusionComputation());
   TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
                       TuplePointsToAnalysis::Run(computation.parent()));
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.h b/tensorflow/compiler/xla/service/hlo_scheduling.h
index 1d1eb1e064f75c2220b39e84b010e720a0c37880..068e68383deb170ded1c9b09a8b7ceb8c4c0ab4b 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.h
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.h
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_ordering.h"
 #include "tensorflow/compiler/xla/service/logical_buffer.h"
+#include "tensorflow/compiler/xla/service/tuple_points_to_analysis.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 
@@ -33,28 +34,48 @@ StatusOr<int64> MinimumMemoryForSequence(
     const SequentialHloOrdering::HloModuleSequence& module_sequence,
     const LogicalBuffer::SizeFunction& size_function);
 
-enum class SchedulerAlgorithm {
-  kListSchedule,
-  kDfsSchedule,
+// A memory scheduler computes an execution sequence for the HLO instructions in
+// 'computation' that minimizes peak memory, given a points-to analysis result
+// that describes buffer aliasing, together with a target-specific size function
+// that maps a tensor's logical size to its padded size.
+typedef std::function<StatusOr<std::vector<const HloInstruction*>>(
+    const HloComputation&, const TuplePointsToAnalysis&,
+    const LogicalBuffer::SizeFunction&)>
+    MemorySchedulerAlgorithm;
 
-  // Selects the available scheduler algorithm that had the minimum memory in
-  // the resulting sequence (a la MinimumMemoryForSequence).
-  kAuto,
-};
+// List scheduler
+StatusOr<std::vector<const HloInstruction*>> ListMemoryScheduler(
+    const HloComputation& computation,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function);
+
+// DFS-order scheduler
+StatusOr<std::vector<const HloInstruction*>> DFSMemoryScheduler(
+    const HloComputation& computation,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function);
+
+// The default scheduling algorithm. Runs both the list scheduler
+// and the DFS scheduler, and chooses whichever returns a lower min-memory,
+// not accounting for fragmentation.
+StatusOr<std::vector<const HloInstruction*>> DefaultMemoryScheduler(
+    const HloComputation& computation,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function);
 
 // Returns an HloModuleSequence which seeks to minimize the memory required for
 // the computation. size_function is the function returning the number of bytes
 // required for a LogicalBuffer.
 StatusOr<SequentialHloOrdering::HloModuleSequence>
-CreateMemoryMinimizingSequence(
-    const HloModule& module, const LogicalBuffer::SizeFunction& size_function,
-    SchedulerAlgorithm algorithm = SchedulerAlgorithm::kAuto);
+CreateMemoryMinimizingSequence(const HloModule& module,
+                               const LogicalBuffer::SizeFunction& size_function,
+                               const MemorySchedulerAlgorithm& algorithm = {});
 
 // Overload of above that computes the sequence for a single computation.
 StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
     const HloComputation& computation,
     const LogicalBuffer::SizeFunction& size_function,
-    SchedulerAlgorithm algorithm = SchedulerAlgorithm::kAuto);
+    const MemorySchedulerAlgorithm& algorithm = {});
 
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
index 7fb338e7042ce19ac9647e23719e738f3ef42c7c..74544c4a67a819d341056aba4cf6b321a5a86c0a 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_ordering.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
@@ -89,5 +90,105 @@ TEST_F(MinimumMemoryForSequenceTest, MultiComputation) {
             MinimumMemoryForSequence(module_sequence, size_fn).ValueOrDie());
 }
 
+class HloSchedulingTest : public HloTestBase {};
+
+TEST_F(HloSchedulingTest, LastUseScheduledFirst) {
+  // Tests scheduling of the following HLO code:
+  //
+  //   %ab = abs(%param)
+  //   %exp = exp(%param)
+  //   %add = add(%ab, %exp)
+  //   %negate = negate(%exp)
+  //   %sub = subtract(%add, %negate)
+  //
+  // %add should be scheduled before %negate because %add is the last (and only)
+  // use of %ab. Scheduling %add first then frees up %ab's buffer.
+  const Shape vec = ShapeUtil::MakeShape(xla::F32, {42});
+  auto builder = HloComputation::Builder(TestName());
+  auto param =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, vec, "param"));
+  auto ab = builder.AddInstruction(
+      HloInstruction::CreateUnary(vec, HloOpcode::kAbs, param));
+  auto exp = builder.AddInstruction(
+      HloInstruction::CreateUnary(vec, HloOpcode::kExp, param));
+
+  auto add = builder.AddInstruction(
+      HloInstruction::CreateBinary(vec, HloOpcode::kAdd, ab, exp));
+  auto negate = builder.AddInstruction(
+      HloInstruction::CreateUnary(vec, HloOpcode::kNegate, exp));
+  auto sub = builder.AddInstruction(
+      HloInstruction::CreateBinary(vec, HloOpcode::kSubtract, add, negate));
+
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      SequentialHloOrdering::HloModuleSequence sequence,
+      CreateMemoryMinimizingSequence(*module, [](const LogicalBuffer& buffer) {
+        return ShapeUtil::ByteSizeOf(buffer.shape());
+      }));
+  // Verify that all instructions are in the sequence.
+  EXPECT_EQ(module->entry_computation()->instruction_count(),
+            sequence.at(module->entry_computation()).size());
+
+  // The first instruction should be the parameter and the last the root "sub".
+  EXPECT_EQ(param, sequence.at(module->entry_computation()).front());
+  EXPECT_EQ(sub, sequence.at(module->entry_computation()).back());
+
+  SequentialHloOrdering ordering(module.get(), sequence);
+  EXPECT_TRUE(ordering.ExecutesBefore(add, negate));
+}
+
+TEST_F(HloSchedulingTest, ListSchedulerHandlesAliasing) {
+  const char* module_str = R"(
+HloModule test_aliasing_module
+
+ENTRY root {
+  param = s32[1000] parameter(0)
+  p0 = s32[1000] copy(param)
+  p1 = s32[1000] copy(param)
+  t = (s32[1000], s32[1000]) tuple(p0, p1)
+  a = s32[1000] get-tuple-element(t), index=0
+  b = s32[1000] get-tuple-element(t), index=1
+  c = s32[1000] add(a, b)
+  d = s32[1000] add(c, b)
+  e = s32[1000] add(c, c)
+  f = s32[1000] add(e, e)
+  ROOT result = (s32[1000], s32[1000], s32[1000]) tuple(d, e, f)
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          tools::Parse(module_str));
+
+  auto size_fn = [](const LogicalBuffer& buffer) {
+    return ShapeUtil::ByteSizeOf(buffer.shape(), /*pointer_size=*/8);
+  };
+  TF_ASSERT_OK_AND_ASSIGN(
+      SequentialHloOrdering::HloModuleSequence sequence,
+      CreateMemoryMinimizingSequence(*module, size_fn, ListMemoryScheduler));
+  // Verify that all instructions are in the sequence.
+  EXPECT_EQ(module->entry_computation()->instruction_count(),
+            sequence.at(module->entry_computation()).size());
+
+  std::unordered_map<string, const HloInstruction*> instructions_by_name;
+  for (const HloInstruction* instruction :
+       sequence.at(module->entry_computation())) {
+    instructions_by_name[instruction->name()] = instruction;
+  }
+
+  // The first instruction should be the parameter and the last the root.
+  EXPECT_EQ(instructions_by_name.at("param"),
+            sequence.at(module->entry_computation()).front());
+  EXPECT_EQ(instructions_by_name.at("result"),
+            sequence.at(module->entry_computation()).back());
+
+  // Instructions "d" and "e" will both be schedulable at the same time, but
+  // instruction "d" allows us to free the buffer of "p1", so the list scheduler
+  // should prefer it.
+  SequentialHloOrdering ordering(module.get(), sequence);
+  EXPECT_TRUE(ordering.ExecutesBefore(instructions_by_name.at("d"),
+                                      instructions_by_name.at("e")));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.cc b/tensorflow/compiler/xla/service/hlo_sharding.cc
index afe79c9f17befdcb2812c0a08b205f21b0715b19..aa9ff89e983aa5d35a18906afca1c6e8eeaefa06 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding.cc
@@ -348,4 +348,30 @@ OpSharding HloSharding::ToProto() const {
   return result;
 }
 
+HloSharding HloSharding::TransformShardedTileShape(
+    const Shape& new_shape,
+    const std::function<int64(int64, int64)>& transform) const {
+  CHECK(!IsTuple());
+  if (IsTileMaximal()) {
+    return *this;
+  }
+  CHECK_EQ(ShapeUtil::Rank(new_shape), ShapeUtil::Rank(tile_shape()));
+  Shape new_tile_shape;
+  new_tile_shape.set_element_type(tile_shape().element_type());
+  for (int64 i = 0; i < ShapeUtil::Rank(new_shape); ++i) {
+    int64 dim;
+    if (tile_assignment().dim(i) == 1) {
+      dim = new_shape.dimensions(i);
+    } else if (transform) {
+      dim = transform(i, tile_shape().dimensions(i));
+    } else {
+      dim = tile_shape().dimensions(i);
+    }
+    new_tile_shape.add_dimensions(dim);
+  }
+  TF_CHECK_OK(
+      LayoutUtil::CopyLayoutBetweenShapes(tile_shape_, &new_tile_shape));
+  return HloSharding::Tile(new_tile_shape, tile_assignment());
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.h b/tensorflow/compiler/xla/service/hlo_sharding.h
index 7263198385cf0c84b1dac1e15177dcac99adaafb..38273236f9fd26dd6566f9c6c031ff9fb6fe3431 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.h
+++ b/tensorflow/compiler/xla/service/hlo_sharding.h
@@ -173,7 +173,7 @@ class HloSharding {
 
   bool operator==(const HloSharding& other) const {
     return replicated_ == other.replicated_ && maximal_ == other.maximal_ &&
-           protobuf_util::ProtobufEquals(tile_shape_, other.tile_shape_) &&
+           ShapeUtil::Compatible(tile_shape_, other.tile_shape_) &&
            tile_assignment_ == other.tile_assignment_ &&
            tuple_elements_ == other.tuple_elements_;
   }
@@ -207,6 +207,26 @@ class HloSharding {
   // REQUIRES: !IsReplicated() && !IsTuple()
   const Array<int64>& tile_assignment() const { return tile_assignment_; }
 
+  // Returns the flattened list of all the leaf shardings in a tuple shape, by
+  // pre-order walk (ShapeTree iterator order).
+  // REQUIRES: IsTuple().
+  const std::vector<HloSharding>& tuple_elements() const {
+    return tuple_elements_;
+  }
+
+  // Return a new sharding that can apply to the given new shape.
+  // If this sharding is tile-maximal, the returned sharding will be the same as
+  // this sharding. If this sharding is not tile-maximal, the returned
+  // sharding's tile size will differ:
+  //   - Non-sharded dimensions will be adapted to be the same as `new_shape`;
+  //     tile_dimension(i) = new_shape.dimensions(i);
+  //   - Sharded dimensions will be kept the same unless `transform` is supplied
+  //     in which case tile_dimension(i) = transform(i, tile_dimension(i));
+  // REQUIRES: !IsTuple().
+  HloSharding TransformShardedTileShape(
+      const Shape& new_shape,
+      const std::function<int64(int64, int64)>& transform = nullptr) const;
+
  private:
   HloSharding()
       : replicated_(true),
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_test.cc b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
index 0c7487b3ac77ff181d44dd55ebcf2608feaf02ea..07fc4687cc1c0518b3ab2a86c62464fc54082a01 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
@@ -269,5 +269,18 @@ TEST_F(HloShardingTest, Hash) {
   }
 }
 
+TEST_F(HloShardingTest, TransformShardedTileShapeTest) {
+  HloSharding sharding =
+      HloSharding::Tile(ShapeUtil::MakeShape(F32, {3, 5, 7, 11}),
+                        Array4D<int64>({{{{0, 1}, {2, 3}}}}));
+  HloSharding result = sharding.TransformShardedTileShape(
+      ShapeUtil::MakeShape(F32, {13, 15, 17, 19}),
+      [](int dim, int value) { return dim * 111; });
+  HloSharding expected =
+      HloSharding::Tile(ShapeUtil::MakeShape(F32, {13, 15, 222, 333}),
+                        Array4D<int64>({{{{0, 1}, {2, 3}}}}));
+  EXPECT_EQ(result, expected);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.cc b/tensorflow/compiler/xla/service/hlo_verifier.cc
index b1fd068115e1d104a11d880675ef84e07d6d5602..8c875698eb1992719d504d272ca338b05b60e36b 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.cc
+++ b/tensorflow/compiler/xla/service/hlo_verifier.cc
@@ -762,11 +762,14 @@ StatusOr<bool> HloVerifier::Run(HloModule* module) {
       } else if (instruction->opcode() == HloOpcode::kBroadcast) {
         // If you see this failure then someone has confused the difference
         // between the HLO broadcast op, and the UserComputation broadcast
-        // op.  See https://groups.google.com/forum/#!topic/xla-dev/9LqijHmTt_I
+        // op. See https://groups.google.com/forum/#!topic/xla-dev/9LqijHmTt_I
         // or ComputationLowerer::Visit()
         TF_RET_CHECK(instruction->dimensions().size() ==
                      ShapeUtil::Rank(instruction->operand(0)->shape()))
-            << "Broadcast HLO has invalid number of dimensions.";
+            << "Broadcast HLO (" << instruction->ToShortString()
+            << ") has invalid number of dimensions: "
+            << instruction->dimensions().size()
+            << " != " << ShapeUtil::Rank(instruction->operand(0)->shape());
       } else if (instruction->opcode() == HloOpcode::kWhile) {
         auto* while_cond = instruction->while_condition();
         auto* while_body = instruction->while_body();
diff --git a/tensorflow/compiler/xla/service/instruction_fusion.cc b/tensorflow/compiler/xla/service/instruction_fusion.cc
index f494748e17fc2d0de74dec67f7414d4791f76a07..d69ad80bdb4d2eab2d34228be026d7bc0b76efc0 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion.cc
@@ -302,7 +302,7 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
 
       // Consider each operand of this instruction for fusion into this
       // instruction. We want to consider the operands in a particular order to
-      // avoid created duplicate instruction clones in the fusion instruction.
+      // avoid creating duplicate instruction clones in the fusion instruction.
       // For example, consider the following expression:
       //
       //   A = ...
@@ -377,7 +377,7 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
         changed = true;
 
         if (operand->user_count() == 0) {
-          // Operand is now dead. Remove from post order by setting it's
+          // Operand is now dead. Remove from post order by setting its
           // location to nullptr.
           post_order[FindOrDie(post_order_index, operand)] = nullptr;
           post_order_index.erase(operand);
diff --git a/tensorflow/compiler/xla/service/interpreter/BUILD b/tensorflow/compiler/xla/service/interpreter/BUILD
index 0819ab3b90b2360c6b0b2afaa89f322afe566eb3..0db3863f2428cf0c9a66a928d54f774e39a18539 100644
--- a/tensorflow/compiler/xla/service/interpreter/BUILD
+++ b/tensorflow/compiler/xla/service/interpreter/BUILD
@@ -63,10 +63,7 @@ cc_library(
     name = "platform_id",
     srcs = ["platform_id.cc"],
     hdrs = ["platform_id.h"],
-    deps = [
-        "@nsync//:nsync_headers",
-        "//tensorflow/core:stream_executor_headers_lib",
-    ] + if_static(
+    deps = ["//tensorflow/core:stream_executor_headers_lib"] + if_static(
         ["@protobuf_archive//:protobuf"],
         ["@protobuf_archive//:protobuf_headers"],
     ),
diff --git a/tensorflow/compiler/xla/service/layout_assignment.cc b/tensorflow/compiler/xla/service/layout_assignment.cc
index 0668f66051ce96292c3c85bac7e649d89914106c..39f9120e552f014dd2759bff2892157402d9c47a 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment.cc
@@ -192,17 +192,34 @@ LayoutConstraints::LayoutConstraints(
   }
 }
 
+PointsToSet::BufferSet* LayoutConstraints::GetBufferSet(
+    const HloInstruction* instruction) const {
+  auto it = buffer_sets_cache_.find(instruction);
+  if (it != buffer_sets_cache_.end()) {
+    return it->second.get();
+  }
+  auto& buffer_set =
+      buffer_sets_cache_
+          .emplace(instruction, MakeUnique<PointsToSet::BufferSet>())
+          .first->second;
+  const auto& points_to_set = points_to_analysis_.GetPointsToSet(instruction);
+  points_to_set.ForEachElement(
+      [&buffer_set](const ShapeIndex& /*index*/,
+                    const PointsToSet::BufferList& buffers) {
+        buffer_set->insert(buffers.begin(), buffers.end());
+      });
+  return buffer_set.get();
+}
+
 bool LayoutConstraints::OperandBufferForwarded(
     const HloInstruction* instruction, int64 operand_no) const {
   // The operand is potentially forwarded if the intersection of points-to sets
   // of the operand and the instruction is non-empty.
-  auto output_buffers =
-      points_to_analysis_.GetPointsToSet(instruction).CreateFlattenedSet();
-  auto operand_buffers =
-      points_to_analysis_.GetPointsToSet(instruction->operand(operand_no))
-          .CreateFlattenedSet();
-  for (const LogicalBuffer* output_buffer : output_buffers) {
-    if (operand_buffers.count(output_buffer) > 0) {
+  PointsToSet::BufferSet* output_buffers = GetBufferSet(instruction);
+  PointsToSet::BufferSet* operand_buffers =
+      GetBufferSet(instruction->operand(operand_no));
+  for (const LogicalBuffer* output_buffer : *output_buffers) {
+    if (operand_buffers->count(output_buffer) > 0) {
       return true;
     }
   }
@@ -1544,6 +1561,13 @@ StatusOr<bool> LayoutAssignment::Run(HloModule* module) {
     // infeeds.  Clearing the layouts here avoids hiding potential bugs in the
     // layout assignment pass that may accidently use the existing layout.
     for (HloInstruction* instruction : computation->instructions()) {
+      if (instruction->opcode() == HloOpcode::kBitcast) {
+        // bitcasts are inherently layout sensitive and so a bitcast instruction
+        // present in the IR before layout assignment is a bug.
+        return InternalError(
+            "Unexpected bitcast operation seen during layout assignment: %s.",
+            instruction->ToString().c_str());
+      }
       if (instruction->opcode() != HloOpcode::kInfeed) {
         LayoutUtil::ClearLayout(instruction->mutable_shape());
       }
diff --git a/tensorflow/compiler/xla/service/layout_assignment.h b/tensorflow/compiler/xla/service/layout_assignment.h
index 29018584487cabfd740d7914625c2a50f552d6ff..680f88048a1f0cd5ede7991640003ef407d4facf 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.h
+++ b/tensorflow/compiler/xla/service/layout_assignment.h
@@ -38,6 +38,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
@@ -199,6 +200,11 @@ class LayoutConstraints {
   string ToString() const;
 
  private:
+  // Find a bufferset in the bufferset cache. This is useful since we can
+  // currently create the flattened buffer set for the same instruction many
+  // times, which is often slow.
+  PointsToSet::BufferSet* GetBufferSet(const HloInstruction* instruction) const;
+
   // The set of BufferLayoutConstraints applied to the computation.
   std::unordered_map<const LogicalBuffer*, BufferLayoutConstraint>
       buffer_constraints_;
@@ -221,6 +227,10 @@ class LayoutConstraints {
   // Array-shaped buffers which have not yet been constrained.
   std::set<LogicalBuffer::Id> unconstrained_buffer_ids_;
 
+  mutable tensorflow::gtl::FlatMap<const HloInstruction*,
+                                   std::unique_ptr<PointsToSet::BufferSet>>
+      buffer_sets_cache_;
+
   HloComputation* computation_;
 };
 
@@ -393,7 +403,6 @@ class LayoutAssignment : public HloPassInterface {
   Status CheckLayouts(HloModule* module);
 
   ComputationLayout* entry_computation_layout_;
-  ChannelLayoutConstraints* channel_layout_constraints_;
 
  protected:
   // Map containing the layouts of all computations assigned so
@@ -401,6 +410,7 @@ class LayoutAssignment : public HloPassInterface {
   // handled before their caller instructions so the layouts of caller
   // instructions can be set to match the computation.
   std::map<HloComputation*, ComputationLayout> computation_layouts_;
+  ChannelLayoutConstraints* channel_layout_constraints_;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/layout_assignment_test.cc b/tensorflow/compiler/xla/service/layout_assignment_test.cc
index 88e5caaf478bc99ecf93ab00ddba4637397b9d78..4b1c9bad41de8030cf14bc6d1c0db21b9c56c3bf 100644
--- a/tensorflow/compiler/xla/service/layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment_test.cc
@@ -590,6 +590,85 @@ TEST_F(LayoutAssignmentTest, TransposeToBitcastToUser) {
                                             transpose->shape(), {2, 3, 0, 1}));
 }
 
+// TransposeIsBitcast shouldn't be called without layout information.
+TEST_F(LayoutAssignmentTest, TransposeIsBitcastFail) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape input_shape = ShapeUtil::MakeShape(F32, {2, 2, 2});
+  Shape input_shape_with_layout(input_shape);
+  *input_shape_with_layout.mutable_layout() = LayoutUtil::MakeLayout({2, 1, 0});
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_shape_with_layout, "param"));
+  auto hlo = builder.AddInstruction(
+      HloInstruction::CreateTranspose(input_shape, param, {0, 2, 1}));
+  // Clear the default layout assigned to the instruction.
+  LayoutUtil::ClearLayout(hlo->mutable_shape());
+  EXPECT_DEATH(ShapeUtil::TransposeIsBitcast(hlo->operand(0)->shape(),
+                                             hlo->shape(), hlo->dimensions()),
+               "LayoutUtil::HasLayout");
+}
+
+// ReshapeIsBitcast shouldn't be called without layout information.
+TEST_F(LayoutAssignmentTest, ReshapeIsBitcastFail) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape input_shape = ShapeUtil::MakeShape(F32, {2, 2, 2});
+  Shape input_shape_with_layout(input_shape);
+  *input_shape_with_layout.mutable_layout() = LayoutUtil::MakeLayout({2, 1, 0});
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_shape_with_layout, "param"));
+  auto hlo =
+      builder.AddInstruction(HloInstruction::CreateReshape(input_shape, param));
+  // Clear the default layout assigned to the instruction.
+  LayoutUtil::ClearLayout(hlo->mutable_shape());
+  EXPECT_DEATH(
+      ShapeUtil::ReshapeIsBitcast(hlo->operand(0)->shape(), hlo->shape()),
+      "LayoutUtil::HasLayout");
+}
+
+// Check that the computation below doesn't crash the compiler.
+//
+// Within a fusion computation, only the parameters and result get assigned a
+// layout.  When we run the algebraic simplifier on this computation post layout
+// assignment, it should not call TransposeIsBitcast on the `transpose` node
+// inside the fusion computation as TransposeIsBitcast checks both input_shape
+// and output_shape have layouts.
+TEST_F(LayoutAssignmentTest, TransposeWithinFusionDoesNotCrash) {
+  const char* module_str = R"(
+    HloModule test_module
+
+    fused_computation {
+      param_1 = f32[2,2,2]{2,1,0} parameter(1)
+      transpose = f32[2,2,2]{2,1,0} transpose(param_1), dimensions={0,2,1}
+      reduce_1 = f32[] parameter(0)
+      broadcast_1 = f32[2,2,2]{2,1,0} broadcast(reduce_1), dimensions={}
+      ROOT divide_1 = f32[2,2,2]{2,1,0} divide(transpose, broadcast_1)
+    }
+
+    ENTRY entry_computation {
+      fusion.1 = f32[2,2,2]{2,1,0} parameter(1)
+      reduce.1 = f32[] parameter(0)
+      fusion.2 = f32[2,2,2]{2,1,0} fusion(reduce.1, fusion.1), kind=kLoop, calls=fused_computation
+     ROOT tuple.1 = (f32[2,2,2]{2,1,0}) tuple(fusion.2)
+    }
+  )";
+
+  auto module = tools::Parse(module_str).ValueOrDie();
+
+  module =
+      backend()
+          .compiler()
+          ->RunHloPasses(std::move(module), backend().default_stream_executor(),
+                         /*device_allocator=*/nullptr)
+          .ConsumeValueOrDie();
+
+  EXPECT_EQ(
+      ::tensorflow::Status::OK(),
+      backend()
+          .compiler()
+          ->RunBackend(std::move(module), backend().default_stream_executor(),
+                       /*device_allocator=*/nullptr)
+          .status());
+}
+
 // A GTE inside of a fusion node inherits the layout of its operand (which
 // should, if we keep following operands, eventually be a parameter).
 TEST_F(LayoutAssignmentTest, GTEInheritsLayoutFromOperand) {
@@ -717,5 +796,26 @@ TEST_F(LayoutAssignmentTest, ConditionalAsymmetricLayout) {
   EXPECT_THAT(false_result->opcode(), HloOpcode::kCopy);
 }
 
+TEST_F(LayoutAssignmentTest, InternalErrorOnBitcast) {
+  auto builder = HloComputation::Builder(TestName());
+  auto constant0 = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+          {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
+  builder.AddInstruction(HloInstruction::CreateUnary(
+      constant0->shape(), HloOpcode::kBitcast, constant0));
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+
+  ComputationLayout computation_layout(
+      module->entry_computation()->ComputeProgramShape());
+  LayoutAssignment layout_assignment(&computation_layout);
+  Status error_status = layout_assignment.Run(module.get()).status();
+  EXPECT_FALSE(error_status.ok());
+  EXPECT_THAT(
+      error_status.error_message(),
+      ::testing::HasSubstr(
+          "Unexpected bitcast operation seen during layout assignment"));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
index 6384c7f46f5ebbedaeda232b40095611a5d738a4..3312a888443233139841ce7a5e3173f907605e1d 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
@@ -29,18 +29,13 @@ limitations under the License.
 namespace xla {
 namespace llvm_ir {
 
-IrArray::Index::Index(llvm::Value* linear, const Shape& shape,
-                      llvm::IRBuilder<>* ir_builder)
-    : multidim_(ShapeUtil::Rank(shape)),
-      linear_(linear),
-      layout_(shape.layout()),
-      dims_(shape.dimensions().begin(), shape.dimensions().end()) {
-  CHECK(LayoutUtil::HasLayout(shape))
-      << "Shape " << ShapeUtil::HumanStringWithLayout(shape)
-      << " should have a layout.";
+static void Delinearize(std::vector<llvm::Value*>* multidim,
+                        llvm::Value* linear, const Shape& shape,
+                        llvm::IRBuilder<>* ir_builder) {
   int64 divisor = 1;
-  for (int64 i = 0; i < layout_.minor_to_major_size(); ++i) {
-    int64 dimension = layout_.minor_to_major(i);
+  const Layout& layout = shape.layout();
+  for (int64 i = 0; i < layout.minor_to_major_size(); ++i) {
+    int64 dimension = layout.minor_to_major(i);
     int64 size_of_current_dimension = shape.dimensions(dimension);
 
     // If i is not the last dimension, compute
@@ -54,16 +49,28 @@ IrArray::Index::Index(llvm::Value* linear, const Shape& shape,
     // memory lives in one big allocation, so cuda-memcheck can't detect
     // out-of-bounds accesses.
     auto* quot = ir_builder->CreateUDiv(linear, ir_builder->getInt64(divisor));
-    if (i < layout_.minor_to_major_size() - 1) {
-      multidim_[dimension] = ir_builder->CreateURem(
+    if (i < layout.minor_to_major_size() - 1) {
+      (*multidim)[dimension] = ir_builder->CreateURem(
           quot, ir_builder->getInt64(size_of_current_dimension));
     } else {
-      multidim_[dimension] = quot;
+      (*multidim)[dimension] = quot;
     }
     divisor *= size_of_current_dimension;
   }
 }
 
+IrArray::Index::Index(llvm::Value* linear, const Shape& shape,
+                      llvm::IRBuilder<>* ir_builder)
+    : multidim_(ShapeUtil::Rank(shape)),
+      linear_(linear),
+      layout_(shape.layout()),
+      dims_(shape.dimensions().begin(), shape.dimensions().end()) {
+  CHECK(LayoutUtil::HasLayout(shape))
+      << "Shape " << ShapeUtil::HumanStringWithLayout(shape)
+      << " should have a layout.";
+  Delinearize(&multidim_, linear, shape, ir_builder);
+}
+
 IrArray::Index::Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
                       llvm::Value* linear, const Shape& shape)
     : multidim_(multidim.begin(), multidim.end()),
@@ -83,7 +90,6 @@ IrArray::Index::Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
       dims_(shape.dimensions().begin(), shape.dimensions().end()) {
   CHECK_EQ(shape.dimensions_size(), multidim.size());
   CHECK(LayoutUtil::HasLayout(shape));
-  linear_ = Linearize(AsInt64Slice(shape.dimensions()), ir_builder);
 }
 
 IrArray::IrArray(llvm::Value* base_ptr, const Shape& shape)
@@ -106,16 +112,13 @@ IrArray::IrArray(llvm::Value* base_ptr, const Shape& shape)
   }
 }
 
-// Returns whether given linear index valid on given shape.
+// Returns whether the given linear index is valid on the given shape.
 bool IrArray::Index::LinearValidOnShape(const Shape& a) const {
-  auto b = ShapeUtil::MakeShape(PRED /* irrelevant */, dims_);
+  auto b = ShapeUtil::MakeShape(a.element_type(), dims_);
   *b.mutable_layout() = layout_;
   return linear_ != nullptr &&
-         ContainersEqual(
-             ShapeUtil::StripDegenerateDimensions(a).dimensions(),
-             ShapeUtil::StripDegenerateDimensions(b).dimensions()) &&
-         LayoutUtil::Equal(ShapeUtil::StripDegenerateDimensions(a).layout(),
-                           ShapeUtil::StripDegenerateDimensions(b).layout());
+         ShapeUtil::ElementsIn(a) == ShapeUtil::ElementsIn(b) &&
+         ShapeUtil::ReshapeIsBitcast(a, b);
 }
 
 IrArray::Index IrArray::Index::SourceIndexOfReshape(
@@ -160,7 +163,8 @@ IrArray::Index IrArray::Index::SourceIndexOfReshape(
     }
   }
 
-  if (linear() != nullptr &&
+  if (linear() != nullptr && LayoutUtil::HasLayout(input_shape) &&
+      LayoutUtil::HasLayout(output_shape) &&
       ShapeUtil::ReshapeIsBitcast(input_shape, output_shape)) {
     return Index(source_multidim_index, linear(), input_shape);
   }
@@ -195,13 +199,111 @@ IrArray::Index IrArray::Index::SourceIndexOfTranspose(
     llvm::IRBuilder<>* builder) const {
   std::vector<llvm::Value*> operand_multidim_index =
       Permute(dimension_mapping, multidim());
-  if (linear() != nullptr &&
+
+  if (linear() != nullptr && LayoutUtil::HasLayout(operand_shape) &&
+      LayoutUtil::HasLayout(shape) &&
       ShapeUtil::TransposeIsBitcast(operand_shape, shape, dimension_mapping)) {
     return Index(operand_multidim_index, linear(), operand_shape);
   }
+
   return Index(operand_multidim_index);
 }
 
+IrArray::Index IrArray::Index::SourceIndexOfBitcast(
+    const Shape& shape, const Shape& operand_shape,
+    llvm::IRBuilder<>* builder) const {
+  CHECK(LayoutUtil::HasLayout(shape) && LayoutUtil::HasLayout(operand_shape));
+  // In case the bitcast is just a reshape, we can use SourceIndexOfReshape()
+  // instead. This will reuse linear() if possible, so we don't have to build a
+  // new 'linear_index'.
+  if (ShapeUtil::ReshapeIsBitcast(operand_shape, shape)) {
+    return SourceIndexOfReshape(shape, operand_shape, builder);
+  }
+
+  // First linearize the index coming from the output of the bitcast. We want
+  // the physical index of the element in the buffer. This is like Linearize,
+  // but takes the layout into account.
+  int64 scale = 1;
+  llvm::Value* linear_index = builder->getInt64(0);
+  for (auto dimension : LayoutUtil::MinorToMajor(shape)) {
+    linear_index = builder->CreateAdd(
+        linear_index,
+        builder->CreateMul(multidim_[dimension], builder->getInt64(scale), "",
+                           /*HasNUW=*/true, /*HasNSW=*/true),
+        "", /*HasNUW=*/true, /*HasNSW=*/true);
+    scale *= shape.dimensions(dimension);
+  }
+
+  // Now delinearize it for the input of the bitcast.
+  std::vector<llvm::Value*> multi_index(operand_shape.dimensions_size());
+  Delinearize(&multi_index, linear_index, operand_shape, builder);
+
+  return Index(multi_index, linear_index, operand_shape);
+}
+
+IrArray::Index IrArray::Index::SourceIndexOfBroadcast(
+    const Shape& shape, const Shape& operand_shape,
+    tensorflow::gtl::ArraySlice<int64> dimension_mapping,
+    llvm::IRBuilder<>* builder) const {
+  int64 rank = ShapeUtil::Rank(operand_shape);
+  std::vector<llvm::Value*> source_index(rank);
+  for (int64 i = 0; i < rank; ++i) {
+    source_index[i] = multidim_[dimension_mapping[i]];
+  }
+  if (linear_ == nullptr || !LayoutUtil::HasLayout(operand_shape) ||
+      !LayoutUtil::HasLayout(shape)) {
+    return Index(source_index);
+  }
+  // High-level idea: we can reuse the linear index if the broadcasted
+  // dimensions are contiguous, and this part of the operation is a bitcast.
+  // The other dimensions can be masked out with a div and a mod operation.
+  std::vector<int64> logical_to_physical =
+      LayoutUtil::MakeLogicalToPhysical(shape.layout());
+  int64 output_rank = ShapeUtil::Rank(shape);
+  // The minimum physical dimension that is broadcasted.
+  int64 min_broadcasted_dimension = output_rank;
+  // The maximum physical dimension that is broadcasted.
+  int64 max_broadcasted_dimension = -1;
+  for (int64 i = 0; i < rank; ++i) {
+    int64 physical_dim = logical_to_physical[dimension_mapping[i]];
+    min_broadcasted_dimension =
+        std::min(min_broadcasted_dimension, physical_dim);
+    max_broadcasted_dimension =
+        std::max(max_broadcasted_dimension, physical_dim);
+  }
+  bool contiguous_broadcast_dimensions =
+      max_broadcasted_dimension - min_broadcasted_dimension == rank - 1;
+  if (!contiguous_broadcast_dimensions) {
+    return Index(source_index);
+  }
+  // Check if the mapped dimensions are a bitcast.
+  std::vector<int64> operand_logical_to_physical =
+      LayoutUtil::MakeLogicalToPhysical(operand_shape.layout());
+  for (int64 i = 0; i < rank; ++i) {
+    if (operand_logical_to_physical[i] !=
+        logical_to_physical[dimension_mapping[i]] - min_broadcasted_dimension) {
+      return Index(source_index);
+    }
+  }
+  llvm::Value* linear = linear_;
+  int64 divisor = 1;
+  for (int64 i = max_broadcasted_dimension + 1; i < output_rank; ++i) {
+    divisor *= shape.dimensions(LayoutUtil::Major(shape.layout(), i));
+  }
+  if (divisor > 1) {
+    linear = builder->CreateUDiv(linear, builder->getInt64(divisor));
+  }
+  if (min_broadcasted_dimension > 0) {
+    int64 mod = 1;
+    for (int64 i = min_broadcasted_dimension; i <= max_broadcasted_dimension;
+         ++i) {
+      mod *= shape.dimensions(LayoutUtil::Major(shape.layout(), i));
+    }
+    linear = builder->CreateURem(linear, builder->getInt64(mod));
+  }
+  return Index(source_index, linear, operand_shape);
+}
+
 llvm::Value* IrArray::Index::Linearize(
     tensorflow::gtl::ArraySlice<int64> dimensions,
     llvm::IRBuilder<>* builder) const {
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ir_array.h b/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
index 387d4629125cbb791840e943013188d14159908a..06cfb2a36c56c5fdece7140e469379f8394111fa 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
@@ -76,8 +76,7 @@ class IrArray {
           llvm::IRBuilder<>* ir_builder);
 
     // Constructs an index from the given multi-dimensional index and the shape
-    // that it indexes into. Also, computes the linear index according to
-    // "shape".
+    // that it indexes into.
     //
     // Precondition: "shape" has a layout.
     Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
@@ -134,6 +133,18 @@ class IrArray {
         tensorflow::gtl::ArraySlice<int64> dimension_mapping,
         llvm::IRBuilder<>* builder) const;
 
+    // Given that "this" is the target index of a bitcast from `operand_shape`
+    // to `shape`, returns the source index.
+    Index SourceIndexOfBitcast(const Shape& shape, const Shape& operand_shape,
+                               llvm::IRBuilder<>* builder) const;
+
+    // Given that "this" is the target index of a broadcast from `operand_shape`
+    // to `shape` with the given dimension mapping, returns the source index.
+    Index SourceIndexOfBroadcast(
+        const Shape& shape, const Shape& operand_shape,
+        tensorflow::gtl::ArraySlice<int64> dimension_mapping,
+        llvm::IRBuilder<>* builder) const;
+
     // Linearizes the index into the given shape, i.e. reshapes it to rank-1 and
     // returns the index into the sole dimension 0 of the new shape.
     llvm::Value* Linearize(tensorflow::gtl::ArraySlice<int64> dimensions,
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
index 5c1866311d1ae1e0c33ab061ee326d86d647a908..2a282f3be79f847a6569416794d1a2a3fcd69148 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
@@ -106,8 +106,10 @@ llvm::Value* EmitFloatMax(llvm::Value* lhs_value, llvm::Value* rhs_value,
     auto cmp = ir_builder->CreateFCmpUGE(lhs_value, rhs_value);
     return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    return EmitCallToIntrinsic(llvm::Intrinsic::maxnum, {lhs_value, rhs_value},
-                               {lhs_value->getType()}, ir_builder);
+    auto cmp_ge = ir_builder->CreateFCmpOGE(lhs_value, rhs_value);
+    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = ir_builder->CreateOr(cmp_ge, lhs_is_nan);
+    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
@@ -117,8 +119,10 @@ llvm::Value* EmitFloatMin(llvm::Value* lhs_value, llvm::Value* rhs_value,
     auto cmp = ir_builder->CreateFCmpULE(lhs_value, rhs_value);
     return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    return EmitCallToIntrinsic(llvm::Intrinsic::minnum, {lhs_value, rhs_value},
-                               {lhs_value->getType()}, ir_builder);
+    auto cmp_le = ir_builder->CreateFCmpOLE(lhs_value, rhs_value);
+    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = ir_builder->CreateOr(cmp_le, lhs_is_nan);
+    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/local_service.cc b/tensorflow/compiler/xla/service/local_service.cc
index 07f989d4faea199e812e54d2ae74d3ff9e7fa19a..5690a89909cd58f8f627e72a7341cb4731811b09 100644
--- a/tensorflow/compiler/xla/service/local_service.cc
+++ b/tensorflow/compiler/xla/service/local_service.cc
@@ -119,10 +119,24 @@ StatusOr<std::unique_ptr<Executable>> LocalService::CompileExecutable(
   }
 
   ExecutionOptions execution_options = CreateDefaultExecutionOptions();
+  if (build_options.hlo_profile().has_value()) {
+    execution_options.mutable_debug_options()->set_xla_hlo_profile(
+        *build_options.hlo_profile());
+  }
   if (build_options.generate_hlo_graph().has_value()) {
     execution_options.mutable_debug_options()->set_xla_generate_hlo_graph(
         build_options.generate_hlo_graph().value());
   }
+  if (build_options.dump_optimized_hlo_proto_to().has_value()) {
+    execution_options.mutable_debug_options()
+        ->set_xla_dump_optimized_hlo_proto_to(
+            build_options.dump_optimized_hlo_proto_to().value());
+  }
+  if (build_options.dump_per_pass_hlo_proto_to().has_value()) {
+    execution_options.mutable_debug_options()
+        ->set_xla_dump_per_pass_hlo_proto_to(
+            build_options.dump_per_pass_hlo_proto_to().value());
+  }
   if (build_options.result_layout() != nullptr) {
     *execution_options.mutable_shape_with_output_layout() =
         *build_options.result_layout();
diff --git a/tensorflow/compiler/xla/service/service.cc b/tensorflow/compiler/xla/service/service.cc
index 43d0f605985819afdaf2db2309a0bfb86f230fe3..0becc9d8f8ed22b2d7174b76ce775efec4b646f5 100644
--- a/tensorflow/compiler/xla/service/service.cc
+++ b/tensorflow/compiler/xla/service/service.cc
@@ -232,10 +232,14 @@ tensorflow::Status Service::ValidateResultShapeWithLayout(
   return ShapeUtil::ValidateShape(shape_with_layout);
 }
 
-StatusOr<std::vector<const ShapedBuffer*>> Service::ResolveAndValidateArguments(
+StatusOr<std::vector<std::vector<const ShapedBuffer*>>>
+Service::ResolveAndValidateArguments(
     tensorflow::gtl::ArraySlice<const GlobalDataHandle*> arguments,
-    int device_ordinal) {
-  std::vector<const ShapedBuffer*> shaped_buffers;
+    tensorflow::gtl::ArraySlice<perftools::gputools::StreamExecutor*>
+        stream_executors) {
+  CHECK_EQ(options_.number_of_replicas(), stream_executors.size());
+  std::vector<std::vector<const ShapedBuffer*>> replicated_arguments;
+  replicated_arguments.resize(options_.number_of_replicas());
   for (size_t i = 0; i < arguments.size(); ++i) {
     auto buffer_status = allocation_tracker_.Resolve(*arguments[i]);
     if (!buffer_status.ok()) {
@@ -243,22 +247,25 @@ StatusOr<std::vector<const ShapedBuffer*>> Service::ResolveAndValidateArguments(
                     StrCat(buffer_status.status().error_message(), ", ",
                            "failed to resolve allocation for parameter ", i));
     }
-    const ShapedBuffer* shaped_buffer = buffer_status.ValueOrDie();
-
-    // Verify allocation is same platform and device as the execution.
-    if (shaped_buffer->platform() != execute_backend_->platform() ||
-        shaped_buffer->device_ordinal() != device_ordinal) {
-      return InvalidArgument(
-          "argument %lu is on device %s:%d but computation will be executed "
-          "on device %s",
-          i, shaped_buffer->platform()->Name().c_str(),
-          shaped_buffer->device_ordinal(),
-          execute_backend_->device_name(device_ordinal).c_str());
+    auto replicated_buffers = buffer_status.ValueOrDie();
+    CHECK_EQ(options_.number_of_replicas(), replicated_buffers.size());
+    for (int replica = 0; replica < options_.number_of_replicas(); ++replica) {
+      const ShapedBuffer* shaped_buffer = replicated_buffers[replica];
+      int replica_device_ordinal = stream_executors[replica]->device_ordinal();
+      // Verify allocation is same platform and device as the execution.
+      if (shaped_buffer->platform() != execute_backend_->platform() ||
+          shaped_buffer->device_ordinal() != replica_device_ordinal) {
+        return InvalidArgument(
+            "argument %lu is on device %s:%d but computation will be executed "
+            "on device %s",
+            i, shaped_buffer->platform()->Name().c_str(),
+            shaped_buffer->device_ordinal(),
+            execute_backend_->device_name(replica_device_ordinal).c_str());
+      }
+      replicated_arguments[replica].push_back(shaped_buffer);
     }
-
-    shaped_buffers.push_back(shaped_buffer);
   }
-  return shaped_buffers;
+  return replicated_arguments;
 }
 
 StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
@@ -307,8 +314,6 @@ StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
   if (execution_options != nullptr) {
     config->set_seed(execution_options->seed());
     config->set_debug_options(execution_options->debug_options());
-    config->enable_hlo_profiling(
-        execution_options->debug_options().xla_hlo_profile());
   } else {
     config->set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
   }
@@ -490,7 +495,8 @@ StatusOr<std::shared_ptr<Executable>> Service::BuildAndCacheExecutable(
 StatusOr<std::vector<GlobalDataHandle>>
 Service::ExecuteParallelAndRegisterResult(
     tensorflow::gtl::ArraySlice<Executable*> executables,
-    tensorflow::gtl::ArraySlice<std::vector<const ShapedBuffer*>> arguments,
+    tensorflow::gtl::ArraySlice<std::vector<std::vector<const ShapedBuffer*>>>
+        arguments,
     Backend* backend, tensorflow::gtl::ArraySlice<DeviceHandle> device_handles,
     tensorflow::gtl::ArraySlice<string> result_tags,
     ExecutionProfile* profile) {
@@ -513,6 +519,8 @@ Service::ExecuteParallelAndRegisterResult(
   for (int64 i = 0; i < executables.size(); i++) {
     // Stream executors for the replicas of the current computation.
     TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*backend, device_handles[i]));
+    CHECK_EQ(replicas.size(), arguments[i].size());
+    std::vector<std::unique_ptr<ShapedBuffer>> result_buffers;
     for (int64 replica = 0; replica < replicas.size(); ++replica) {
       TF_ASSIGN_OR_RETURN(Pool<se::Stream>::SmartPtr stream,
                           backend->BorrowStream(replicas[replica]));
@@ -545,23 +553,20 @@ Service::ExecuteParallelAndRegisterResult(
                                               backend->StreamBorrower());
 
       // Asynchronously launch the computation.
-      TF_ASSIGN_OR_RETURN(
-          std::unique_ptr<ShapedBuffer> result,
-          executables[i]->ExecuteAsyncOnStream(&run_options, arguments[i]));
+      TF_ASSIGN_OR_RETURN(std::unique_ptr<ShapedBuffer> result,
+                          executables[i]->ExecuteAsyncOnStream(
+                              &run_options, arguments[i][replica]));
 
       if (replica == 0 && profile != nullptr) {
         streams.back()->ThenStopTimer(timers.back().get());
       }
 
-      // All replicas share the same device address for the result allocation,
-      // so only one of the replicas need to register the result handle.
-      if (replica == 0) {
-        TF_ASSIGN_OR_RETURN(
-            GlobalDataHandle handle,
-            allocation_tracker_.Register(std::move(result), result_tags[i]));
-        result_handles.push_back(handle);
-      }
+      result_buffers.emplace_back(std::move(result));
     }
+    TF_ASSIGN_OR_RETURN(GlobalDataHandle handle,
+                        allocation_tracker_.RegisterReplicatedBuffers(
+                            std::move(result_buffers), result_tags[i]));
+    result_handles.push_back(handle);
   }
 
   // Wait for all executions to complete.
@@ -627,9 +632,9 @@ Service::ExecuteParallelAndRegisterResult(
 
 StatusOr<GlobalDataHandle> Service::ExecuteAndRegisterResult(
     Executable* executable,
-    const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-    Backend* backend, perftools::gputools::StreamExecutor* executor,
-    const string& result_tag, ExecutionProfile* profile) {
+    const tensorflow::gtl::ArraySlice<std::vector<const ShapedBuffer*>>
+        arguments,
+    Backend* backend, const string& result_tag, ExecutionProfile* profile) {
   // Set up streams.
   std::vector<Pool<se::Stream>::SmartPtr> streams;
 
@@ -662,21 +667,26 @@ StatusOr<GlobalDataHandle> Service::ExecuteAndRegisterResult(
                              backend->inter_op_thread_pool());
   }
 
-  std::unique_ptr<ShapedBuffer> result;
   if (options_.number_of_replicas() == 1) {
-    TF_ASSIGN_OR_RETURN(result, executable->ExecuteOnStreamWrapper(
-                                    &run_options[0], profile, arguments));
-  } else {
-    // TODO(b/69985541): Support profiling also on this path.
-    std::vector<tensorflow::gtl::ArraySlice<const ShapedBuffer*>>
-        repeated_arguments(options_.number_of_replicas(), arguments);
-
-    TF_ASSIGN_OR_RETURN(auto results, executable->ExecuteOnStreams(
-                                          run_options, repeated_arguments));
-    TF_RET_CHECK(!results.empty());
-    result = std::move(results[0]);
+    TF_ASSIGN_OR_RETURN(
+        auto result, executable->ExecuteOnStreamWrapper(&run_options[0],
+                                                        profile, arguments[0]));
+    return allocation_tracker_.Register(std::move(result), result_tag);
   }
-  return allocation_tracker_.Register(std::move(result), result_tag);
+
+  // TODO(b/69985541): Support profiling also on this path.
+
+  std::vector<tensorflow::gtl::ArraySlice<const ShapedBuffer*>>
+      replicated_arguments;
+  for (const auto& arg : arguments) {
+    replicated_arguments.emplace_back(arg);
+  }
+
+  TF_ASSIGN_OR_RETURN(auto results, executable->ExecuteOnStreams(
+                                        run_options, replicated_arguments));
+  TF_RET_CHECK(!results.empty());
+  return allocation_tracker_.RegisterReplicatedBuffers(std::move(results),
+                                                       result_tag);
 }
 
 tensorflow::Status Service::SetReturnValue(const SetReturnValueRequest* arg,
@@ -690,7 +700,7 @@ tensorflow::Status Service::ExecuteParallel(const ExecuteParallelRequest* arg,
                                             ExecuteParallelResponse* result) {
   VLOG(1) << "running execute-parallel request: " << arg->ShortDebugString();
 
-  std::vector<std::vector<const ShapedBuffer*>> all_arguments;
+  std::vector<std::vector<std::vector<const ShapedBuffer*>>> all_arguments;
   std::vector<std::vector<perftools::gputools::StreamExecutor*>> all_executors;
   std::vector<VersionedComputationHandle> versioned_handles;
   std::vector<std::unique_ptr<HloModuleConfig>> module_configs;
@@ -718,6 +728,14 @@ tensorflow::Status Service::ExecuteParallel(const ExecuteParallelRequest* arg,
       return FailedPrecondition(
           "device handles must be given to execute parallel computations");
     }
+    if (arg->requests_size() > 1 &&
+        execution_options.device_handles_size() > 1) {
+      return InvalidArgument(
+          "Parallel requests with multiple device handles is not supported. "
+          "Found %d parallel requests, with request %lld containing %d device "
+          "handles.",
+          arg->requests_size(), i, execution_options.device_handles_size());
+    }
     std::vector<perftools::gputools::StreamExecutor*> executors;
     for (const auto& device_handle : execution_options.device_handles()) {
       TF_ASSIGN_OR_RETURN(auto replicas,
@@ -747,22 +765,26 @@ tensorflow::Status Service::ExecuteParallel(const ExecuteParallelRequest* arg,
     // In the case of partitioned computations, assume all arguments go on the
     // zeroth core.
     TF_ASSIGN_OR_RETURN(
-        std::vector<const ShapedBuffer*> arguments,
-        ResolveAndValidateArguments(request.arguments(),
-                                    executors[0]->device_ordinal()));
+        auto replicas,
+        Replicas(*execute_backend_, execution_options.device_handles(0)));
+    TF_ASSIGN_OR_RETURN(
+        std::vector<std::vector<const ShapedBuffer*>> replicated_arguments,
+        ResolveAndValidateArguments(request.arguments(), replicas));
 
     // Create an HloModuleConfig object for the computation, given the shape of
-    // the program and the argument allocations.
+    // the program and the argument allocations. Here, we care only about the
+    // shapes of the arguments, so, it is sufficient to use the arguments of
+    // replica 0.
     TF_ASSIGN_OR_RETURN(
         std::unique_ptr<HloModuleConfig> module_config,
-        CreateModuleConfig(*program_shape, arguments,
+        CreateModuleConfig(*program_shape, replicated_arguments.front(),
                            request.execution_options(), *user_computation));
     VLOG(3) << "ExecuteParallel created HloModuleConfig computation layout: "
             << module_config->entry_computation_layout().ToString();
 
     // Adds to the vectors to build and execute the computations after the loop.
-    all_arguments.push_back(arguments);
-    all_arguments.insert(all_arguments.end(), executors.size() - 1, {});
+    all_arguments.push_back(replicated_arguments);
+    all_arguments.insert(all_arguments.end(), executors.size() - 1, {{}});
     versioned_handles.push_back(versioned_handle);
     module_configs.push_back(std::move(module_config));
     computation_names.insert(computation_names.end(), executors.size(),
@@ -861,15 +883,18 @@ tensorflow::Status Service::Execute(const ExecuteRequest* arg,
       std::shared_ptr<const ProgramShape> program_shape,
       user_computation->ComputeProgramShape(versioned_handle.version));
 
+  TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*execute_backend_,
+                                              SingleComputationDeviceHandle()));
   TF_ASSIGN_OR_RETURN(
-      std::vector<const ShapedBuffer*> arguments,
-      ResolveAndValidateArguments(arg->arguments(),
-                                  execute_backend_->default_device_ordinal()));
+      std::vector<std::vector<const ShapedBuffer*>> replicated_arguments,
+      ResolveAndValidateArguments(arg->arguments(), replicas));
 
+  // Since we care only about the shapes of the arguments, it is sufficient to
+  // use the arguments of replica 0.
   TF_ASSIGN_OR_RETURN(
       std::unique_ptr<HloModuleConfig> module_config,
-      CreateModuleConfig(*program_shape, arguments, arg->execution_options(),
-                         *user_computation));
+      CreateModuleConfig(*program_shape, replicated_arguments.front(),
+                         arg->execution_options(), *user_computation));
 
   VLOG(3) << "Execute created HloModuleConfig computation layout: "
           << module_config->entry_computation_layout().ToString();
@@ -885,20 +910,21 @@ tensorflow::Status Service::Execute(const ExecuteRequest* arg,
     executable->session_module()->set_execution_platform(
         execute_backend_->platform()->Name());
     TF_RETURN_IF_ERROR(RecordArguments(
-        arguments, execute_backend_->default_stream_executor(),
+        replicated_arguments.front(),
+        execute_backend_->default_stream_executor(),
         execute_backend_->transfer_manager(), executable->session_module()));
   }
 
   TF_ASSIGN_OR_RETURN(
       *result->mutable_output(),
       ExecuteAndRegisterResult(
-          executable.get(), arguments, execute_backend_.get(),
-          execute_backend_->default_stream_executor(),
+          executable.get(), replicated_arguments, execute_backend_.get(),
           "result of " + user_computation->name(), result->mutable_profile()));
 
   if (executable->dumping()) {
-    TF_ASSIGN_OR_RETURN(const ShapedBuffer* result_buffer,
-                        allocation_tracker_.Resolve(result->output()));
+    TF_ASSIGN_OR_RETURN(
+        const ShapedBuffer* result_buffer,
+        allocation_tracker_.ResolveForReplica(result->output(), 0));
     TF_RETURN_IF_ERROR(RecordResult(
         *result_buffer, execute_backend_->default_stream_executor(),
         execute_backend_->transfer_manager(), executable->session_module()));
@@ -909,6 +935,11 @@ tensorflow::Status Service::Execute(const ExecuteRequest* arg,
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status Service::ExecuteGraph(const ExecuteGraphRequest* /*arg*/,
+                                         ExecuteResponse* /*result*/) {
+  return Unimplemented("execute-graph is not yet implemented");
+}
+
 tensorflow::Status Service::ExecuteAsync(const ExecuteAsyncRequest* arg,
                                          ExecuteAsyncResponse* result) {
   VLOG(1) << "running execute-async request: " << arg->ShortDebugString();
@@ -926,15 +957,17 @@ tensorflow::Status Service::ExecuteAsync(const ExecuteAsyncRequest* arg,
       std::shared_ptr<const ProgramShape> program_shape,
       user_computation->ComputeProgramShape(versioned_handle.version));
 
+  TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*execute_backend_,
+                                              SingleComputationDeviceHandle()));
+  TF_RET_CHECK(!replicas.empty());
   TF_ASSIGN_OR_RETURN(
-      std::vector<const ShapedBuffer*> arguments,
-      ResolveAndValidateArguments(arg->arguments(),
-                                  execute_backend_->default_device_ordinal()));
+      std::vector<std::vector<const ShapedBuffer*>> replicated_arguments,
+      ResolveAndValidateArguments(arg->arguments(), replicas));
 
   TF_ASSIGN_OR_RETURN(
       std::unique_ptr<HloModuleConfig> module_config,
-      CreateModuleConfig(*program_shape, arguments, arg->execution_options(),
-                         *user_computation));
+      CreateModuleConfig(*program_shape, replicated_arguments.front(),
+                         arg->execution_options(), *user_computation));
 
   VLOG(3) << "ExecuteAsync created HloModuleConfig computation layout: "
           << module_config->entry_computation_layout().ToString();
@@ -947,21 +980,17 @@ tensorflow::Status Service::ExecuteAsync(const ExecuteAsyncRequest* arg,
           versioned_handle, std::move(module_config), execute_backend_.get(),
           execute_backend_->default_stream_executor(), &profile));
 
-  TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*execute_backend_,
-                                              SingleComputationDeviceHandle()));
-  TF_RET_CHECK(!replicas.empty());
-
   // Set up streams.
   std::vector<Pool<se::Stream>::SmartPtr> streams;
-
   for (se::StreamExecutor* executor : replicas) {
     TF_ASSIGN_OR_RETURN(Pool<se::Stream>::SmartPtr stream,
                         execute_backend_->BorrowStream(executor));
     streams.push_back(std::move(stream));
   }
 
-  std::unique_ptr<ShapedBuffer> result_buffer;
-  for (const Pool<se::Stream>::SmartPtr& stream : streams) {
+  std::vector<std::unique_ptr<ShapedBuffer>> result_buffers;
+  for (size_t i = 0; i < streams.size(); ++i) {
+    const auto& stream = streams[i];
     ExecutableRunOptions options;
     options.set_stream(stream.get());
     options.set_allocator(execute_backend_->memory_allocator());
@@ -972,20 +1001,17 @@ tensorflow::Status Service::ExecuteAsync(const ExecuteAsyncRequest* arg,
     ServiceExecutableRunOptions service_options(
         options, execute_backend_->StreamBorrower());
 
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<ShapedBuffer> this_result_buffer,
-        executable->ExecuteAsyncOnStream(&service_options, arguments));
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<ShapedBuffer> this_result_buffer,
+                        executable->ExecuteAsyncOnStream(
+                            &service_options, replicated_arguments[i]));
 
-    // Take the first result.
-    if (result_buffer == nullptr) {
-      result_buffer = std::move(this_result_buffer);
-    }
+    result_buffers.emplace_back(std::move(this_result_buffer));
   }
 
   TF_ASSIGN_OR_RETURN(
       GlobalDataHandle output,
-      allocation_tracker_.Register(std::move(result_buffer),
-                                   "result of " + user_computation->name()));
+      allocation_tracker_.RegisterReplicatedBuffers(
+          std::move(result_buffers), "result of " + user_computation->name()));
 
   *result->mutable_execution() = execution_tracker_.Register(
       execute_backend_.get(), std::move(streams), profile, output);
@@ -1013,7 +1039,7 @@ tensorflow::Status Service::WaitForExecution(const WaitForExecutionRequest* arg,
 tensorflow::Status Service::TransferToClient(const TransferToClientRequest* arg,
                                              TransferToClientResponse* result) {
   TF_ASSIGN_OR_RETURN(const ShapedBuffer* shaped_buffer,
-                      allocation_tracker_.Resolve(arg->data()));
+                      allocation_tracker_.ResolveForReplica(arg->data(), 0));
 
   const Shape* return_shape;
   if (arg->has_shape_with_layout()) {
@@ -1074,37 +1100,24 @@ tensorflow::Status Service::TransferToServer(const TransferToServerRequest* arg,
         replicas, Replicas(*execute_backend_, SingleComputationDeviceHandle()));
   }
 
-  // All memory allocation is done on the first replica. The allocations in all
-  // other replicas mirror the firsts'.
-  int master_device_ordinal = replicas[0]->device_ordinal();
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<ShapedBuffer> shaped_buffer,
-      execute_backend_->transfer_manager()->AllocateShapedBuffer(
-          shape, execute_backend_->memory_allocator(), master_device_ordinal));
-
-  // Transfer the data to the replicas.
+  // Allocate memory in each replica and transfer the data to all replicas.
+  std::vector<std::unique_ptr<ShapedBuffer>> replicated_buffers;
   for (se::StreamExecutor* executor : replicas) {
-    if (executor->device_ordinal() == master_device_ordinal) {
-      TF_RETURN_IF_ERROR(
-          execute_backend_->transfer_manager()->TransferLiteralToDevice(
-              executor, *literal, *shaped_buffer));
-    } else {
-      // The replica is not the master. Create an cloned shaped buffer with
-      // the replica's device ordinal. This is required because
-      // TransferLiteralToDevice verifies that the device ordinal of the shaped
-      // buffer matches that of the executor.
-      std::unique_ptr<ShapedBuffer> clone =
-          CloneShapedBufferOnDevice(*shaped_buffer, executor->device_ordinal());
-      TF_RETURN_IF_ERROR(
-          execute_backend_->transfer_manager()->TransferLiteralToDevice(
-              executor, *literal, *clone));
-    }
+    TF_ASSIGN_OR_RETURN(
+        std::unique_ptr<ShapedBuffer> shaped_buffer,
+        execute_backend_->transfer_manager()->AllocateShapedBuffer(
+            shape, execute_backend_->memory_allocator(),
+            executor->device_ordinal()));
+    TF_RETURN_IF_ERROR(
+        execute_backend_->transfer_manager()->TransferLiteralToDevice(
+            executor, *literal, *shaped_buffer));
+    replicated_buffers.emplace_back(std::move(shaped_buffer));
   }
-  TF_ASSIGN_OR_RETURN(
-      *result->mutable_data(),
-      allocation_tracker_.Register(std::move(shaped_buffer),
-                                   StrCat("TransferToServer literal of shape ",
-                                          ShapeUtil::HumanString(shape))));
+  TF_ASSIGN_OR_RETURN(*result->mutable_data(),
+                      allocation_tracker_.RegisterReplicatedBuffers(
+                          std::move(replicated_buffers),
+                          StrCat("TransferToServer literal of shape ",
+                                 ShapeUtil::HumanString(shape))));
 
   return tensorflow::Status::OK();
 }
@@ -1287,7 +1300,7 @@ tensorflow::Status Service::ComputeConstant(const ComputeConstantRequest* arg,
 tensorflow::Status Service::GetShape(const GetShapeRequest* arg,
                                      GetShapeResponse* result) {
   TF_ASSIGN_OR_RETURN(const ShapedBuffer* buffer,
-                      allocation_tracker_.Resolve(arg->data()));
+                      allocation_tracker_.ResolveForReplica(arg->data(), 0));
   *result->mutable_shape() = buffer->on_host_shape();
   return tensorflow::Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/service.h b/tensorflow/compiler/xla/service/service.h
index 6ce241971156599aaa25aea1b0caac0e1bd5379c..96352d9096e6aeeb33f84c7b6fc42c28820e5e84 100644
--- a/tensorflow/compiler/xla/service/service.h
+++ b/tensorflow/compiler/xla/service/service.h
@@ -112,6 +112,12 @@ class Service : public ServiceInterface {
   tensorflow::Status Execute(const ExecuteRequest* arg,
                              ExecuteResponse* result) override;
 
+  // Executes a computation with the provided global data passed as
+  // immutable arguments. The request contains the whole computation graph.
+  // Returns global data output and execution timing.
+  tensorflow::Status ExecuteGraph(const ExecuteGraphRequest* arg,
+                                  ExecuteResponse* result) override;
+
   // Executes one or more computations in parallel with the provided global data
   // passed as immutable arguments. Returns global data output for each
   // computation.
@@ -265,11 +271,14 @@ class Service : public ServiceInterface {
   static StatusOr<std::unique_ptr<Backend>> CreateComputeConstantBackend();
 
   // Resolves the given argument handles in the allocation tracker and returns
-  // the corresponding allocations. The function also verifies that each
-  // allocation matches the execution platform and device ordinal.
-  StatusOr<std::vector<const ShapedBuffer*>> ResolveAndValidateArguments(
+  // the corresponding allocations for every replica. The function also verifies
+  // that each allocation matches the execution platform and device ordinal of
+  // the corresponding replica.
+  StatusOr<std::vector<std::vector<const ShapedBuffer*>>>
+  ResolveAndValidateArguments(
       tensorflow::gtl::ArraySlice<const GlobalDataHandle*> arguments,
-      int device_ordinal);
+      tensorflow::gtl::ArraySlice<perftools::gputools::StreamExecutor*>
+          stream_executors);
 
   // Create a Hlo module config for the given program shape and arguments.
   // execution_options is optional; if not given a default is used.
@@ -314,16 +323,17 @@ class Service : public ServiceInterface {
   // ExecutionProfile object which will be filled in with profile data.
   StatusOr<GlobalDataHandle> ExecuteAndRegisterResult(
       Executable* executable,
-      const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-      Backend* backend, perftools::gputools::StreamExecutor* executor,
-      const string& result_tag, ExecutionProfile* profile);
+      const tensorflow::gtl::ArraySlice<std::vector<const ShapedBuffer*>>
+          arguments,
+      Backend* backend, const string& result_tag, ExecutionProfile* profile);
 
   // Runs the given executables with the given arguments and register the result
   // from each executable in the allocation tracker. The handles of the result
   // from the tracker are returned.
   StatusOr<std::vector<GlobalDataHandle>> ExecuteParallelAndRegisterResult(
       tensorflow::gtl::ArraySlice<Executable*> executables,
-      tensorflow::gtl::ArraySlice<std::vector<const ShapedBuffer*>> arguments,
+      tensorflow::gtl::ArraySlice<std::vector<std::vector<const ShapedBuffer*>>>
+          arguments,
       Backend* backend,
       tensorflow::gtl::ArraySlice<DeviceHandle> device_handles,
       tensorflow::gtl::ArraySlice<string> result_tags,
diff --git a/tensorflow/compiler/xla/service/shape_inference.cc b/tensorflow/compiler/xla/service/shape_inference.cc
index c9692757b27980b10a5ca562223c3d0f6462d820..8c8bd6d73ad41db7d609ac91c7bdfc4703f364e1 100644
--- a/tensorflow/compiler/xla/service/shape_inference.cc
+++ b/tensorflow/compiler/xla/service/shape_inference.cc
@@ -169,11 +169,11 @@ bool AllUnique(tensorflow::gtl::ArraySlice<int64> slice) {
 tensorflow::Status ExpectNotTupleOrOpaque(const Shape& shape,
                                           tensorflow::StringPiece op_type) {
   if (ShapeUtil::IsTuple(shape)) {
-    return InvalidArgument("Expected non-tuple argument for %s. Got: %s",
+    return InvalidArgument("Expected non-tuple argument for %s, but got %s.",
                            op_type.ToString().c_str(),
                            ShapeUtil::HumanString(shape).c_str());
   } else if (ShapeUtil::IsOpaque(shape)) {
-    return InvalidArgument("Expected non-opaque argument for %s. Got: %s",
+    return InvalidArgument("Expected non-opaque argument for %s, but got %s.",
                            op_type.ToString().c_str(),
                            ShapeUtil::HumanString(shape).c_str());
   } else {
@@ -193,8 +193,10 @@ tensorflow::Status VerifyReducerShape(const ProgramShape& reducer_shape,
 
   const Shape& accumulator_shape = reducer_shape.result();
   if (ShapeUtil::Rank(accumulator_shape) != 0) {
-    return Unimplemented(
-        "Reduction function currently must have rank-0 result.");
+    return InvalidArgument(
+        "Reduction function must have rank 0 (rank %lld reduction function "
+        "given).",
+        ShapeUtil::Rank(accumulator_shape));
   }
 
   // Check that the accumulator can be passed in as the first argument.
@@ -235,8 +237,8 @@ tensorflow::Status VerifyReducerShape(const ProgramShape& reducer_shape,
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(accumulator_shape,
                                                 reducer_shape.parameters(1))) {
     return InvalidArgument(
-        "Reduction function's second parameter shape currently must "
-        "match the result shape. Got %s vs %s",
+        "Reduction function's second parameter shape must "
+        "match the result shape, but got %s vs %s.",
         ShapeUtil::HumanString(reducer_shape.parameters(1)).c_str(),
         ShapeUtil::HumanString(accumulator_shape).c_str());
   }
@@ -258,29 +260,29 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   for (int64 i = 0; i < window.dimensions_size(); ++i) {
     const auto& dim = window.dimensions(i);
     if (dim.size() <= 0) {
-      return InvalidArgument("Window has a non-positive dimension. Window: %s",
+      return InvalidArgument("Window %s has a non-positive dimension.",
                              window.DebugString().c_str());
     }
     if (dim.stride() <= 0) {
-      return InvalidArgument("Window has a non-positive stride. Window: %s",
+      return InvalidArgument("Window %s has a non-positive stride.",
                              window.DebugString().c_str());
     }
     if (!allow_negative_padding && dim.padding_low() < 0) {
-      return InvalidArgument("Window has a negative low padding. Window: %s",
+      return InvalidArgument("Window %s has a negative low padding.",
                              window.DebugString().c_str());
     }
     if (!allow_negative_padding && dim.padding_high() < 0) {
-      return InvalidArgument("Window has a negative high padding. Window: %s",
+      return InvalidArgument("Window %s has a negative high padding.",
                              window.DebugString().c_str());
     }
     if (dim.base_dilation() < 1) {
       return InvalidArgument(
-          "Window has a non-positive base area dilation factor. Window: %s",
+          "Window %s has a non-positive base area dilation factor.",
           window.DebugString().c_str());
     }
     if (dim.window_dilation() < 1) {
       return InvalidArgument(
-          "Window has a non-positive window dilation factor. Window: %s",
+          "Window %s has a non-positive window dilation factor.",
           window.DebugString().c_str());
     }
 
@@ -320,8 +322,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_CEIL:
       if (!ShapeUtil::ElementIsFloating(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating for floor/ceil "
-            "operation; got %s",
+            "Expected element type in shape to be floating for floor/ceil "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -333,8 +335,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
       if (!ShapeUtil::ElementIsFloating(arg) &&
           !ShapeUtil::ElementIsComplex(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating or complex for "
-            "sin/cos/exp/log/tanh operation; got %s",
+            "Expected element type in shape to be floating or complex for "
+            "sin/cos/exp/log/tanh operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -342,8 +344,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_IMAG:
       if (!ShapeUtil::ElementIsComplex(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be complex for real/imag "
-            "operation; got %s",
+            "Expected element type in shape to be complex for real/imag "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return ShapeUtil::ChangeElementType(arg, F32);
@@ -363,8 +365,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
       if (arg.element_type() != PRED &&
           !primitive_util::IsIntegralType(arg.element_type())) {
         return InvalidArgument(
-            "expected pred or an integral element type in argument to not "
-            "operation; got %s",
+            "Expected pred or an integral element type in argument to Not "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -372,8 +374,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_IS_FINITE:
       if (!ShapeUtil::ElementIsFloating(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating point for IsFinite "
-            "operation; got %s",
+            "Expected element type in shape to be floating point for IsFinite "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return ShapeUtil::ChangeElementType(arg, PRED);
@@ -389,10 +391,10 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     tensorflow::gtl::ArraySlice<const Shape*> arg_shapes,
     const int64 dimension) {
   if (arg_shapes.empty()) {
-    return InvalidArgument("Concatenate expects at least one argument");
+    return InvalidArgument("Concatenate expects at least one argument.");
   }
   if (dimension < 0 || dimension >= ShapeUtil::Rank(*arg_shapes[0])) {
-    return InvalidArgument("dimension to concatenate along out of bounds: %lld",
+    return InvalidArgument("Concatenate dimension out of bounds: %lld.",
                            dimension);
   }
   const Shape* arg_shape = nullptr;
@@ -408,14 +410,14 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     if (ShapeUtil::Rank(*arg_shape) != ShapeUtil::Rank(*shape)) {
       return InvalidArgument(
           "Cannot concatenate arrays with different ranks: %lld (%s) vs %lld "
-          "(%s)",
+          "(%s).",
           ShapeUtil::Rank(*arg_shape),
           ShapeUtil::HumanString(*arg_shape).c_str(), ShapeUtil::Rank(*shape),
           ShapeUtil::HumanString(*shape).c_str());
     }
     if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(*arg_shape, *shape)) {
       return InvalidArgument(
-          "cannot concatenate arrays with different element types: %s vs %s",
+          "Cannot concatenate arrays with different element types: %s vs %s.",
           PrimitiveType_Name(arg_shape->element_type()).c_str(),
           PrimitiveType_Name(shape->element_type()).c_str());
     }
@@ -428,9 +430,9 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
                      // concatenating.
         }
         return InvalidArgument(
-            "cannot concatenate arrays that differ in dimensions other than "
+            "Cannot concatenate arrays that differ in dimensions other than "
             "the one being concatenated (the other array dimensions must be "
-            "the same): %s vs %s in dimension %lld",
+            "the same): %s vs %s in dimension %lld.",
             ShapeUtil::HumanString(*arg_shape).c_str(),
             ShapeUtil::HumanString(*shape).c_str(), dimension);
       }
@@ -452,7 +454,7 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   if (primitive_util::IsComplexType(old_element_type) &&
       !primitive_util::IsComplexType(new_element_type)) {
     return Unimplemented(
-        "Unsupported conversion from complex to real type: %s => %s",
+        "Conversion from complex to real type %s => %s is not implemented.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -461,7 +463,7 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     // future, by recursing into the tuple elements to check all sub-conversions
     // are valid. For now we just reject them, though.
     return InvalidArgument(
-        "cannot convert from or to tuple type; requested conversion: %s => %s",
+        "Convert does not allow tuples, so cannot convert from %s to %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -474,24 +476,23 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   auto old_element_type = operand_shape.element_type();
   if (primitive_util::IsComplexType(old_element_type) !=
       primitive_util::IsComplexType(new_element_type)) {
-    return Unimplemented(
-        "Unsupported conversion between real and complex types: %s => %s",
-        ShapeUtil::HumanString(operand_shape).c_str(),
-        PrimitiveType_Name(new_element_type).c_str());
+    return InvalidArgument("Conversion from complex to real type %s => %s.",
+                           ShapeUtil::HumanString(operand_shape).c_str(),
+                           PrimitiveType_Name(new_element_type).c_str());
   }
   if (ShapeUtil::IsTuple(operand_shape) || new_element_type == TUPLE) {
     // Note: we may want to support tuple conversions via this operation in the
     // future, by recursing into the tuple elements to check all sub-conversions
     // are valid. For now we just reject them, though.
     return InvalidArgument(
-        "cannot convert from or to tuple type; requested conversion: %s => %s",
+        "Cannot convert from or to tuple type; requested conversion: %s => %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
   if (primitive_util::BitWidth(old_element_type) !=
       primitive_util::BitWidth(new_element_type)) {
     return InvalidArgument(
-        "cannot bitcast types with different bit-widths: %s => %s",
+        "Cannot bitcast types with different bit-widths: %s => %s.",
         PrimitiveType_Name(old_element_type).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -504,20 +505,20 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     const int mantissa_bits) {
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
-        "expected element type in shape to be floating point for "
-        "ReducePrecision operation; got %s",
+        "Expected element type in shape to be floating point for "
+        "ReducePrecision operation; got %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
   if (exponent_bits < 1) {
     // One exponent bit is necessary to distinguish 0 from infinity.  Having
     // no exponent bits doesn't produce a sensible number, so we require at
     // least one.
-    return InvalidArgument("expected exponent_bits >= 1; got %d",
+    return InvalidArgument("Expected exponent_bits >= 1; got %d.",
                            exponent_bits);
   }
   if (mantissa_bits < 0) {
     // A number with no mantissa bits is still meaningful, however.
-    return InvalidArgument("expected non-negative mantissa_bits; got %d",
+    return InvalidArgument("Expected non-negative mantissa_bits; got %d.",
                            mantissa_bits);
   }
   return operand_shape;
@@ -528,23 +529,23 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     const PaddingConfig& padding_config) {
   if (ShapeUtil::IsTuple(operand_shape)) {
     return InvalidArgument(
-        "pad operation does not support tuple-shape operands");
+        "Pad operation does not support tuple-shape operands.");
   }
   if (!ShapeUtil::IsScalar(padding_value_shape)) {
     return InvalidArgument(
-        "pad operation does not support non-scalar padding values");
+        "Pad operation does not support non-scalar padding values.");
   }
   if (ShapeUtil::Rank(operand_shape) != padding_config.dimensions_size()) {
     return InvalidArgument(
         "The rank of the operand and the padding configuration do not match: "
-        "%s vs %s",
+        "%s vs %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         padding_config.ShortDebugString().c_str());
   }
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(operand_shape,
                                                      padding_value_shape)) {
     return InvalidArgument(
-        "the element types of the operands to pad do not match");
+        "The element types of the operands to Pad do not match.");
   }
   std::vector<int64> dimensions(ShapeUtil::Rank(operand_shape));
   for (int64 i = 0; i < operand_shape.dimensions_size(); ++i) {
@@ -605,7 +606,7 @@ Status ValidateDotDimensionNumbers(
                      lhs_batch_dimensions) ||
       !dims_in_range(ShapeUtil::Rank(rhs), rhs_contracting_dimensions,
                      rhs_batch_dimensions)) {
-    return InvalidArgument("A dimension number is out of range in dot: %s",
+    return InvalidArgument("A dimension number is out of range in Dot: %s.",
                            dimension_numbers.DebugString().c_str());
   }
 
@@ -623,7 +624,7 @@ Status ValidateDotDimensionNumbers(
 
   if (!dims_unique(lhs_contracting_dimensions, lhs_batch_dimensions) ||
       !dims_unique(rhs_contracting_dimensions, rhs_batch_dimensions)) {
-    return InvalidArgument("A dimension number is not unique in dot: %s",
+    return InvalidArgument("A dimension number is not unique in Dot: %s.",
                            dimension_numbers.DebugString().c_str());
   }
 
@@ -641,8 +642,7 @@ Status ValidateDotDimensionNumbers(
       rhs_non_contracting_non_batch_dims < 0 ||
       rhs_non_contracting_non_batch_dims > 1) {
     return InvalidArgument(
-        "batch and contracting dimension number mismatch "
-        "with rank ");
+        "Batch and contracting dimension number mismatch with rank.");
   }
 
   // Check that batch dimension numbers are ordered before all others, and
@@ -654,7 +654,7 @@ Status ValidateDotDimensionNumbers(
       !std::equal(batch_dim_numbers.begin(), batch_dim_numbers.end(),
                   rhs_batch_dimensions.begin())) {
     return InvalidArgument(
-        "batch dimension numbers must precede non-batch dimensions and be"
+        "Batch dimension numbers must precede non-batch dimensions and be"
         "monotonically increasing.");
   }
 
@@ -671,22 +671,22 @@ Status ValidateDotDimensionNumbers(
 
   auto fail = [lhs, rhs](const string& addendum) -> Status {
     string message = tensorflow::strings::Printf(
-        "cannot infer shape for dot operation: %s <dot> %s",
+        "Cannot infer shape for dot operation: %s <dot> %s.",
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
     if (!addendum.empty()) {
-      message += ": " + addendum;
+      message += " " + addendum;
     }
     return InvalidArgument("%s", message.c_str());
   };
 
   // Check if both element types are the same.
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
-    return fail("element types do not match");
+    return fail("Element types do not match.");
   }
 
   if ((ShapeUtil::Rank(lhs) < 1) || (ShapeUtil::Rank(rhs) < 1)) {
-    return fail("dot only supports rank 1 or above.");
+    return fail("Dot only supports rank 1 or above.");
   }
 
   // Validate basic properties of dot dimension numbers.
@@ -696,7 +696,7 @@ Status ValidateDotDimensionNumbers(
   if (dimension_numbers.lhs_contracting_dimensions_size() !=
           dimension_numbers.rhs_contracting_dimensions_size() ||
       dimension_numbers.lhs_contracting_dimensions_size() != 1) {
-    return fail("must specify one contracting dimension for both lhs and rhs.");
+    return fail("Must specify one contracting dimension for both lhs and rhs.");
   }
 
   // Check that contracting dimension sizes match.
@@ -706,13 +706,13 @@ Status ValidateDotDimensionNumbers(
       dimension_numbers.rhs_contracting_dimensions(0);
   if (lhs.dimensions(lhs_contracting_dimension) !=
       rhs.dimensions(rhs_contracting_dimension)) {
-    return fail("contracting dimension sizes do not match.");
+    return fail("Contracting dimension sizes do not match.");
   }
 
   // Check that number of batch dimensions match.
   if (dimension_numbers.lhs_batch_dimensions_size() !=
       dimension_numbers.rhs_batch_dimensions_size()) {
-    return fail("must the same number of batch dimensions for lhs and rhs.");
+    return fail("Must the same number of batch dimensions for lhs and rhs.");
   }
 
   // Check that batch dimension numbers and sizes match.
@@ -721,7 +721,7 @@ Status ValidateDotDimensionNumbers(
             dimension_numbers.rhs_batch_dimensions(i) ||
         lhs.dimensions(dimension_numbers.lhs_batch_dimensions(i)) !=
             rhs.dimensions(dimension_numbers.rhs_batch_dimensions(i))) {
-      return fail("batch dimension numbers and sizes must match for lhs/rhs.");
+      return fail("Batch dimension numbers and sizes must match for lhs/rhs.");
     }
   }
 
@@ -770,10 +770,11 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     } else if (rhs.dimensions(i) == 1) {
       output_dimensions[i] = lhs.dimensions(i);
     } else {
-      return InvalidArgument("binary op %s with incompatible shapes: %s and %s",
-                             BinaryOperation_Name(operation).c_str(),
-                             ShapeUtil::HumanString(lhs).c_str(),
-                             ShapeUtil::HumanString(rhs).c_str());
+      return InvalidArgument(
+          "Binary op %s with incompatible shapes: %s and %s.",
+          BinaryOperation_Name(operation).c_str(),
+          ShapeUtil::HumanString(lhs).c_str(),
+          ShapeUtil::HumanString(rhs).c_str());
     }
   }
   return ShapeUtil::MakeShape(ShapeUtil::HigherPrecisionElementType(lhs, rhs),
@@ -788,15 +789,15 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     // Reject "magic" inference for binops on different shapes, requiring
     // the user to provide an explicit broadcast dimension in this case.
     // See b/25177275 for more details.
-    return InvalidArgument("automatic shape inference not supported: %s and %s",
+    return InvalidArgument("Automatic shape inference not supported: %s and %s",
                            ShapeUtil::HumanString(smaller_shape).c_str(),
                            ShapeUtil::HumanString(larger_shape).c_str());
   } else if (broadcast_dimensions.size() != ShapeUtil::Rank(smaller_shape)) {
     return InvalidArgument(
-        "size of broadcast_dimensions has to match lower-rank operand's "
+        "Size of broadcast_dimensions has to match lower-rank operand's "
         "rank; "
         " lower-rank operand's rank is %lld, size of broadcast_dimensions is "
-        "%zu",
+        "%zu.",
         ShapeUtil::Rank(smaller_shape), broadcast_dimensions.size());
   }
 
@@ -846,13 +847,13 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     int64 dimension_to_match = broadcast_dimensions.at(i);
     if (dimension_to_match < 0) {
       return InvalidArgument(
-          "broadcast dimension number (%lld) cannot be negative",
+          "Broadcast dimension number (%lld) cannot be negative.",
           dimension_to_match);
     }
     if (dimension_to_match >= larger_shape.dimensions_size()) {
       return InvalidArgument(
-          "broadcast dimension number (%lld) too large; higher-rank "
-          "operand has rank %d",
+          "Broadcast dimension number (%lld) too large; higher-rank "
+          "operand has rank %d.",
           dimension_to_match, larger_shape.dimensions_size());
     }
     int64 small_dimension_size = smaller_shape.dimensions(i);
@@ -863,7 +864,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (small_dimension_size != large_dimension_size &&
         small_dimension_size != 1 && large_dimension_size != 1) {
       return InvalidArgument(
-          "broadcast dimension %d mismatch: %lld != %lld; %s and %s", i,
+          "Broadcast dimension %d mismatch: %lld != %lld; %s and %s.", i,
           small_dimension_size, large_dimension_size,
           ShapeUtil::HumanString(smaller_shape).c_str(),
           ShapeUtil::HumanString(larger_shape).c_str());
@@ -872,7 +873,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     // order.
     if (i > 0 && broadcast_dimensions.at(i - 1) >= dimension_to_match) {
       return InvalidArgument(
-          "broadcast dimensions order is wrong: %lld comes after %lld",
+          "Broadcast dimensions order is wrong: %lld comes after %lld.",
           dimension_to_match, broadcast_dimensions.at(i - 1));
     }
 
@@ -892,7 +893,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return InvalidArgument(
-        "binary op %s with different element types: %s and %s",
+        "Binary op %s with different element types: %s and %s.",
         BinaryOperation_Name(operation).c_str(),
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
@@ -904,8 +905,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (!broadcast_dimensions.empty() &&
         broadcast_dimensions != identity_dims) {
       return InvalidArgument(
-          "broadcast dimensions field must either be not set or be the "
-          "identity on binary operations with operands of the same rank");
+          "Broadcast dimensions field must either be not set or be the "
+          "identity on binary operations with operands of the same rank.");
     }
   }
 
@@ -943,6 +944,13 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
                             rhs->shape(), /*broadcast_dimensions=*/{});
 }
 
+/* static */ StatusOr<Shape> ShapeInference::InferBinaryOpShape(
+    HloOpcode opcode, const Shape& lhs, const Shape& rhs,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
+  return InferBinaryOpShape(OpcodeToBinaryOperation(opcode), lhs, rhs,
+                            broadcast_dimensions);
+}
+
 /* static */ StatusOr<Shape> ShapeInference::InferBinaryOpShape(
     BinaryOperation operation, const Shape& lhs, const Shape& rhs,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
@@ -979,8 +987,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case BINOP_COMPLEX: {
       if (!ShapeUtil::ElementIsFloating(lhs)) {
         return InvalidArgument(
-            "expected element type in shape to be floating for complex compose "
-            "operation; got %s",
+            "Expected element type in shape to be floating for complex compose "
+            "operation; got %s.",
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
       TF_ASSIGN_OR_RETURN(const Shape& shape,
@@ -989,7 +997,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       if (lhs.element_type() == F32 && rhs.element_type() == F32) {
         return ShapeUtil::ChangeElementType(shape, C64);
       } else {
-        return Unimplemented("complex component type not supported");
+        return Unimplemented("Complex component type is not implemented.");
       }
     }
     case BINOP_AND:
@@ -997,8 +1005,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       if (lhs.element_type() != PRED &&
           !primitive_util::IsIntegralType(lhs.element_type())) {
         return InvalidArgument(
-            "expected pred or integral type in argument to and/or operation; "
-            "got %s",
+            "Expected pred or integral type in argument to and/or operation; "
+            "got %s.",
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
       return InferElementwiseBinaryOpShape(operation, lhs, rhs,
@@ -1016,7 +1024,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     default:
       return Unimplemented(
-          "not yet implemented; infer binary op shape: %s; lhs: %s; rhs: %s",
+          "Binary op shape inference: %s; lhs: %s; rhs: %s is not implemented.",
           BinaryOperation_Name(operation).c_str(),
           lhs.ShortDebugString().c_str(), rhs.ShortDebugString().c_str());
   }
@@ -1041,7 +1049,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case TRIOP_SELECT:
       return InferSelectShape(lhs, rhs, ehs);
     default:
-      return InvalidArgument("unknown operation %s",
+      return InvalidArgument("Unknown operation %s.",
                              TernaryOperation_Name(operation).c_str());
   }
 }
@@ -1072,7 +1080,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return result;
     }
     default:
-      return InvalidArgument("unknown operation %s",
+      return InvalidArgument("Unknown operation %s.",
                              VariadicOperation_Name(operation).c_str());
   }
 }
@@ -1082,7 +1090,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const ProgramShape& to_apply,
     tensorflow::gtl::ArraySlice<int64> dimensions) {
   if (arg_shapes.empty()) {
-    return InvalidArgument("Map expects at least one argument");
+    return InvalidArgument("Map expects at least one argument.");
   }
 
   // All arguments must have the same shape.
@@ -1113,7 +1121,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     return InvalidArgument(
         "Map operation requires all operands to have the same shape; got: "
-        "%s",
+        "%s.",
         Join(pieces, ", ").c_str());
   }
 
@@ -1122,7 +1130,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (dimensions.size() != arg_shape->dimensions_size()) {
     return InvalidArgument(
         "Map applied to a subset of dimensions currently not supported: "
-        "arg_dimension_size: %d, requested_map_dimensions_size: %zu",
+        "arg_dimension_size: %d, requested_map_dimensions_size: %zu.",
         arg_shape->dimensions_size(), dimensions.size());
   }
 
@@ -1130,7 +1138,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int i = 0; i < dimensions.size(); ++i) {
     if (dimensions[i] != i) {
       return InvalidArgument(
-          "Map requires monotonically increasing dimension numbers, found: %s ",
+          "Map requires monotonically increasing dimension numbers; got: %s.",
           Join(dimensions, ", ").c_str());
     }
   }
@@ -1139,7 +1147,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (arg_shapes.size() != to_apply.parameters_size()) {
     return InvalidArgument(
         "Map applied function arity must match number of arguments; got: "
-        "arity: %d, arguments: %zu",
+        "arity: %d, arguments: %zu.",
         to_apply.parameters_size(), arg_shapes.size());
   }
 
@@ -1147,8 +1155,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   const Shape& output_shape = to_apply.result();
   if (!ShapeUtil::IsScalar(output_shape)) {
     return InvalidArgument(
-        "mapped computation's result has to be a scalar; "
-        "got: %s",
+        "Mapped computation's result has to be a scalar; got: %s.",
         ShapeUtil::HumanString(output_shape).c_str());
   }
 
@@ -1157,16 +1164,16 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
     if (!ShapeUtil::IsScalar(parameter_shape)) {
       return InvalidArgument(
-          "mapped computation's parameter has to be a scalar; "
-          "got parameter %d shape: %s",
+          "Mapped computation's parameter has to be a scalar; "
+          "got parameter %d shape: %s.",
           i, ShapeUtil::HumanString(parameter_shape).c_str());
     }
 
     if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(parameter_shape,
                                                        *arg_shape)) {
       return InvalidArgument(
-          "mapped computation's parameter type has to match argument element "
-          "type; got parameter %d shape: %s, argument shape: %s",
+          "Mapped computation's parameter type has to match argument element "
+          "type; got parameter %d shape: %s, argument shape: %s.",
           i, ShapeUtil::HumanString(parameter_shape).c_str(),
           ShapeUtil::HumanString(*arg_shape).c_str());
     }
@@ -1197,21 +1204,21 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-training to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
   if (feature_index < 0) {
     return InvalidArgument(
         "Expected feature_index of batch-norm-training to "
-        "be a non-negative number, got %lld",
+        "be a non-negative number, got %lld.",
         feature_index);
   }
 
   if (ShapeUtil::Rank(operand_shape) < 1) {
     return InvalidArgument(
         "Expected the rank of operand to "
-        "batch-norm-training to be at least 1; got %lld",
+        "batch-norm-training to be at least 1; got %lld.",
         ShapeUtil::Rank(operand_shape));
   }
 
@@ -1232,7 +1239,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-training must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
@@ -1241,7 +1248,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-training, "
         "but the shape of offset factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(offset_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1251,7 +1258,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-training, "
         "but the shape of scale factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1264,7 +1271,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of offset factor should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(offset_shape, 0), feature_count);
   }
 
@@ -1272,7 +1279,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1307,21 +1314,21 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-inference to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
   if (feature_index < 0) {
     return InvalidArgument(
         "Expected feature_index of batch-norm-inference to "
-        "be a non-negative number, got %lld",
+        "be a non-negative number, got %lld.",
         feature_index);
   }
 
   if (ShapeUtil::Rank(operand_shape) < 1) {
     return InvalidArgument(
         "Expected the rank of operand to "
-        "batch-norm-inference to be at least 1; got %lld",
+        "batch-norm-inference to be at least 1; got %lld.",
         ShapeUtil::Rank(operand_shape));
   }
 
@@ -1342,7 +1349,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-inference must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
@@ -1352,7 +1359,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of offset factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(offset_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1363,7 +1370,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of scale factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1374,7 +1381,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of mean is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1385,7 +1392,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of variance is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(variance_shape.element_type()).c_str());
   }
@@ -1398,7 +1405,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of offset factor should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(offset_shape, 0), feature_count);
   }
 
@@ -1406,7 +1413,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1414,7 +1421,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of mean should be the same as feature count,"
         "but the size of mean is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(mean_shape, 0), feature_count);
   }
 
@@ -1422,7 +1429,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of variance should be the same as feature count,"
         "but the size of variance is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(variance_shape, 0), feature_count);
   }
 
@@ -1455,7 +1462,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-grad to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
@@ -1463,7 +1470,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected operand_shape of batch-norm-grad to have the same rank as"
         " output_grad_shape; got rank(oprand_shape) %lld, and"
-        " rank(output_grad_shape) %lld",
+        " rank(output_grad_shape) %lld.",
         ShapeUtil::Rank(operand_shape), ShapeUtil::Rank(output_grad_shape));
   }
 
@@ -1491,14 +1498,14 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-grad must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
   if (!ShapeUtil::ElementIsFloating(output_grad_shape)) {
     return InvalidArgument(
         "The output_grad to batch-norm-grad must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(output_grad_shape.element_type()).c_str());
   }
 
@@ -1507,7 +1514,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of output_grad is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(output_grad_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1517,7 +1524,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of scale factor is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1527,7 +1534,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of mean is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1537,7 +1544,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of mean is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1551,7 +1558,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of mean should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(mean_shape, 0), feature_count);
   }
 
@@ -1559,7 +1566,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1567,7 +1574,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of variance should be the same as feature count,"
         "but the size of variance is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(var_shape, 0), feature_count);
   }
 
@@ -1578,7 +1585,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return InvalidArgument(
           "The bounds of operand shape should be the same as output_grad's,"
           "but the bound of operand_shape at dimension %lld is %lld "
-          "and the bound of output_grad_shape is %lld",
+          "and the bound of output_grad_shape is %lld.",
           i, ShapeUtil::GetDimension(operand_shape, i),
           ShapeUtil::GetDimension(output_grad_shape, i));
     }
@@ -1596,7 +1603,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return InvalidArgument(
-        "Convolution with different element types: %s and %s",
+        "Convolution with different element types: %s and %s.",
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
   }
@@ -1612,21 +1619,19 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (window.dimensions_size() != num_spatial_dims) {
     return InvalidArgument(
         "Window must have same number of dimensions as dimension numbers.\n"
-        "Window: %s\nDimension numbers: %s",
+        "Window: %s\nDimension numbers: %s.",
         window.DebugString().c_str(), dnums.DebugString().c_str());
   }
 
   const int num_dims = num_spatial_dims + 2;
   if (ShapeUtil::Rank(lhs) != num_dims) {
     return InvalidArgument(
-        "The LHS argument to a convolution should have rank %d.\n"
-        "lhs: %s",
+        "The LHS argument to a convolution should have rank %d; lhs: %s.",
         num_dims, ShapeUtil::HumanString(lhs).c_str());
   }
   if (ShapeUtil::Rank(rhs) != num_dims) {
     return InvalidArgument(
-        "The RHS argument to a convolution should have rank %d.\n"
-        "lhs: %s",
+        "The RHS argument to a convolution should have rank %d; lhs: %s.",
         num_dims, ShapeUtil::HumanString(lhs).c_str());
   }
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(lhs));
@@ -1663,26 +1668,26 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       !std::all_of(window_dnums.begin(), window_dnums.end(), in_range) ||
       !std::all_of(output_dnums.begin(), output_dnums.end(), in_range)) {
     return InvalidArgument(
-        "A dimension number is out of range in convolution: %s",
+        "A dimension number is out of range in convolution: %s.",
         dnums.DebugString().c_str());
   }
 
   if (input_dnums != expected_dnums) {
     return InvalidArgument(
         "Input dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
   if (window_dnums != expected_dnums) {
     return InvalidArgument(
         "Window dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
   if (output_dnums != expected_dnums) {
     return InvalidArgument(
         "Output dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
 
@@ -1706,7 +1711,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected LHS feature dimension (value %lld) to match RHS "
         "input feature dimension (value %lld); got <conv>(%s, %s)\n"
-        "Dimension numbers: {%s}",
+        "Dimension numbers: {%s}.",
         input_features, kernel_input_features,
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str(), dnums.DebugString().c_str());
@@ -1720,7 +1725,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "Window dimensions do not match RHS shape:\n\t"
         "RHS shape: %s\n\t"
         "Window: {%s}\n\t"
-        "Dimension numbers: {%s}",
+        "Dimension numbers: {%s}.",
         ShapeUtil::HumanString(rhs).c_str(), window.ShortDebugString().c_str(),
         dnums.ShortDebugString().c_str());
   }
@@ -1748,8 +1753,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const tensorflow::gtl::ArraySlice<int64> fft_length) {
   const int64 fft_rank = fft_length.size();
   if (fft_rank < 1 || fft_rank > 3) {
-    return InvalidArgument("FFT only supports ranks 1-3, but got %lld",
-                           fft_rank);
+    return InvalidArgument("FFT only supports ranks 1-3; got %lld.", fft_rank);
   }
 #define RET_CHECK_RANK(x)                              \
   if (x.dimensions_size() < fft_rank) {                \
@@ -1762,7 +1766,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case FFT:
     case IFFT:
       if (in.element_type() != C64) {
-        return InvalidArgument("%s requires C64 input type, found %s",
+        return InvalidArgument("%s requires C64 input type, found %s.",
                                FftType_Name(fft_type).c_str(),
                                PrimitiveType_Name(in.element_type()).c_str());
       }
@@ -1770,7 +1774,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return in;
     case RFFT: {
       if (in.element_type() != F32) {
-        return InvalidArgument("RFFT requires F32 input type, found %s",
+        return InvalidArgument("RFFT requires F32 input type, found %s.",
                                PrimitiveType_Name(in.element_type()).c_str());
       }
       RET_CHECK_RANK(in);
@@ -1779,7 +1783,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             fft_length[i]) {
           return InvalidArgument(
               "RFFT requires innermost dimensions match fft_length but "
-              "dimension %lld is %lld and should be %lld",
+              "dimension %lld is %lld and should be %lld.",
               in.dimensions_size() - fft_rank + i,
               in.dimensions(in.dimensions_size() - fft_rank + i),
               fft_length[i]);
@@ -1792,7 +1796,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     case IRFFT: {
       if (in.element_type() != C64) {
-        return InvalidArgument("IRFFT requires C64 input type, found %s",
+        return InvalidArgument("IRFFT requires C64 input type, found %s.",
                                PrimitiveType_Name(in.element_type()).c_str());
       }
       RET_CHECK_RANK(in);
@@ -1802,7 +1806,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             fft_length[i]) {
           return InvalidArgument(
               "IRFFT requires all but one innermost dimensions match "
-              "fft_length, but dimension %lld is %lld and should be %lld",
+              "fft_length, but dimension %lld is %lld and should be %lld.",
               in.dimensions_size() - fft_rank + i,
               in.dimensions(in.dimensions_size() - fft_rank + i),
               fft_length[i]);
@@ -1812,7 +1816,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
           fft_length[fft_rank - 1] / 2 + 1) {
         return InvalidArgument(
             "IRFFT requires innermost dimension matches fft_length/2+1, but "
-            "dimension %d is %lld and should be %lld",
+            "dimension %d is %lld and should be %lld.",
             in.dimensions_size() - 1, in.dimensions(in.dimensions_size() - 1),
             fft_length[fft_rank - 1] / 2 + 1);
       }
@@ -1850,8 +1854,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int64 dimension : dimensions_to_reduce) {
     if (dimension >= ShapeUtil::Rank(arg) || dimension < 0) {
       return InvalidArgument(
-          "attempting to reduce out-of-bounds dimension %lld in shape %s",
-          dimension, ShapeUtil::HumanString(arg).c_str());
+          "Reducing out-of-bounds dimension %lld in shape %s.", dimension,
+          ShapeUtil::HumanString(arg).c_str());
     }
   }
   TF_RETURN_IF_ERROR(
@@ -1891,30 +1895,30 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   // Check if the select function has a proper shape of (T,T) -> PRED.
   if (select_shape.parameters_size() != 2) {
     return InvalidArgument(
-        "select function must take 2 parameters, but "
+        "Select function must take 2 parameters, but "
         "takes %d parameter(s).",
         select_shape.parameters_size());
   }
   const Shape& select_result_shape = select_shape.result();
   if (!ShapeUtil::Compatible(select_result_shape,
                              ShapeUtil::MakeShape(PRED, {}))) {
-    return Unimplemented("select function must have rank-0 PRED result.");
+    return InvalidArgument("Select function must have rank-0 PRED result.");
   }
   const Shape& operand_element_shape =
       ShapeUtil::MakeShape(operand_shape.element_type(), {});
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(operand_element_shape,
                                                 select_shape.parameters(0))) {
     return InvalidArgument(
-        "select function's first parameter shape currently must "
-        "match the operand element shape. Got %s vs %s",
+        "Select function's first parameter shape currently must "
+        "match the operand element shape, but got %s vs %s.",
         ShapeUtil::HumanString(select_shape.parameters(0)).c_str(),
         ShapeUtil::HumanString(operand_element_shape).c_str());
   }
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(operand_element_shape,
                                                 select_shape.parameters(1))) {
     return InvalidArgument(
-        "select function's second parameter shape currently must "
-        "match the operand element shape. Got %s vs %s",
+        "Select function's second parameter shape currently must "
+        "match the operand element shape, but got %s vs %s.",
         ShapeUtil::HumanString(select_shape.parameters(1)).c_str(),
         ShapeUtil::HumanString(operand_element_shape).c_str());
   }
@@ -1931,8 +1935,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(source_shape,
                                                 window_result_shape)) {
     return InvalidArgument(
-        "source shape does not match the shape of window-reduced operand: "
-        "source(%s), window-reduced operand(%s)",
+        "Source shape does not match the shape of window-reduced operand: "
+        "source(%s), window-reduced operand(%s).",
         ShapeUtil::HumanString(source_shape).c_str(),
         ShapeUtil::HumanString(window_result_shape).c_str());
   }
@@ -1946,7 +1950,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   auto error = [&](const string& message) {
     return InvalidArgument(
         "%s in slice operation; argument shape: %s; starts: {%s}; limits: "
-        "{%s}; strides: {%s}",
+        "{%s}; strides: {%s}.",
         message.c_str(), ShapeUtil::HumanString(arg).c_str(),
         Join(starts, ",").c_str(), Join(limits, ",").c_str(),
         Join(strides, ",").c_str());
@@ -1969,7 +1973,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (starts.size() != ShapeUtil::Rank(arg)) {
     return InvalidArgument(
-        "slice index count does not match argument rank: %zu vs %lld",
+        "Slice index count does not match argument rank: %zu vs %lld.",
         starts.size(), ShapeUtil::Rank(arg));
   }
 
@@ -1979,7 +1983,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     int64 limit_index = limits[dimension];
     int64 stride = strides[dimension];
     if (start_index < 0) {
-      return InvalidArgument("negative start index to slice: %lld",
+      return InvalidArgument("Negative start index to slice: %lld.",
                              start_index);
     }
     if (limit_index > arg.dimensions(dimension)) {
@@ -1999,7 +2003,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
                  limit_index, start_index));
     }
     if (stride <= 0) {
-      return InvalidArgument("stride (%lld) must be positive", stride);
+      return InvalidArgument("Stride (%lld) must be positive.", stride);
     }
     sizes.push_back((limit_index - start_index + stride - 1) / stride);
   }
@@ -2023,20 +2027,20 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(start_indices_shape) != 1) {
     return InvalidArgument(
-        "dynamic slice start indices of rank %lld must be rank1.",
+        "Dynamic slice start indices of rank %lld must be rank1.",
         ShapeUtil::Rank(start_indices_shape));
   }
 
   if (!ShapeUtil::ElementIsIntegral(start_indices_shape)) {
     return InvalidArgument(
-        "dynamic slice start indices must be of integral type.");
+        "Dynamic slice start indices must be of integral type.");
   }
 
   const int64 start_num_dims = start_indices_shape.dimensions(0);
   if (ShapeUtil::Rank(operand_shape) != start_num_dims) {
     return InvalidArgument(
-        "dynamic slice start number of dimensions %lld (%s) must match rank "
-        "%lld of slice input (%s)",
+        "Dynamic slice start number of dimensions %lld (%s) must match rank "
+        "%lld of slice input (%s).",
         start_num_dims, ShapeUtil::HumanString(start_indices_shape).c_str(),
         ShapeUtil::Rank(operand_shape),
         ShapeUtil::HumanString(operand_shape).c_str());
@@ -2044,7 +2048,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (slice_sizes.size() != ShapeUtil::Rank(operand_shape)) {
     return InvalidArgument(
-        "dynamic slice index count does not match argument rank: %zu vs %lld",
+        "Dynamic slice index count does not match argument rank: %zu vs %lld.",
         slice_sizes.size(), ShapeUtil::Rank(operand_shape));
   }
 
@@ -2052,12 +2056,12 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const int64 input_dim_size = operand_shape.dimensions(dim);
     const int64 slice_dim_size = slice_sizes[dim];
     if (slice_dim_size < 0) {
-      return InvalidArgument("negative size index to dynamic slice: %lld",
+      return InvalidArgument("Negative size index to dynamic slice: %lld.",
                              slice_dim_size);
     }
     if (slice_dim_size > input_dim_size) {
       return InvalidArgument(
-          "slice dim size %lld greater than dynamic slice dimension: %lld",
+          "Slice dim size %lld greater than dynamic slice dimension: %lld.",
           slice_dim_size, input_dim_size);
     }
     VLOG(2) << tensorflow::strings::Printf("slice_sizes[%lld] = %lld", dim,
@@ -2086,20 +2090,20 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(start_indices_shape) != 1) {
     return InvalidArgument(
-        "dynamic update slice start indices of rank %lld must be rank1.",
+        "Dynamic update slice start indices of rank %lld must be rank1.",
         ShapeUtil::Rank(start_indices_shape));
   }
 
   if (!ShapeUtil::ElementIsIntegral(start_indices_shape)) {
     return InvalidArgument(
-        "dynamic update slice start indices must be of integral type.");
+        "Dynamic update slice start indices must be of integral type.");
   }
 
   const int64 start_num_dims = start_indices_shape.dimensions(0);
   if (ShapeUtil::Rank(operand_shape) != start_num_dims) {
     return InvalidArgument(
-        "dynamic slice start number of dimensions %lld (%s) must match rank "
-        "%lld of slice input (%s)",
+        "Dynamic update slice start number of dimensions %lld (%s) must match "
+        "rank %lld of slice input (%s).",
         start_num_dims, ShapeUtil::HumanString(start_indices_shape).c_str(),
         ShapeUtil::Rank(operand_shape),
         ShapeUtil::HumanString(operand_shape).c_str());
@@ -2107,16 +2111,16 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(update_shape) != ShapeUtil::Rank(operand_shape)) {
     return InvalidArgument(
-        "dynamic update slice update rank does not match argument rank: "
-        "%lld vs %lld",
+        "Dynamic update slice update rank does not match argument rank: "
+        "%lld vs %lld.",
         ShapeUtil::Rank(update_shape), ShapeUtil::Rank(operand_shape));
   }
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(operand_shape,
                                                      update_shape)) {
     return InvalidArgument(
-        "dynamic update slice update element type does not match argument. "
-        "operand.element_type: %s vs update.element_type: %s",
+        "Dynamic update slice update element type does not match argument. "
+        "operand.element_type: %s vs update.element_type: %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str(),
         PrimitiveType_Name(update_shape.element_type()).c_str());
   }
@@ -2126,12 +2130,12 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const int64 update_dim_size = update_shape.dimensions(dim);
     if (update_dim_size < 0) {
       return InvalidArgument(
-          "size index %lld to dynamic update slice must be >= 0",
+          "Size index %lld to dynamic update slice must be >= 0.",
           update_dim_size);
     }
     if (update_dim_size > input_dim_size) {
       return InvalidArgument(
-          "update dim size %lld greater than dynamic slice dimension: %lld",
+          "Update dim size %lld greater than dynamic slice dimension: %lld.",
           update_dim_size, input_dim_size);
     }
     VLOG(2) << tensorflow::strings::Printf("update_sizes[%lld] = %lld", dim,
@@ -2151,7 +2155,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int64 dimension : dimensions) {
     if (dimension >= ShapeUtil::Rank(operand_shape) || dimension < 0) {
       return InvalidArgument(
-          "one of the reverse dimensions (%lld) is out-of-bounds in shape %s",
+          "One of the reverse dimensions (%lld) is out-of-bounds in shape %s.",
           dimension, ShapeUtil::HumanString(operand_shape).c_str());
     }
   }
@@ -2162,14 +2166,14 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& arg, int64 index) {
   if (!ShapeUtil::IsTuple(arg)) {
     return InvalidArgument(
-        "cannot infer shape: attempting to index into non-tuple: %s",
+        "Cannot infer shape: attempting to index into non-tuple: %s.",
         ShapeUtil::HumanString(arg).c_str());
   }
 
   if (index >= arg.tuple_shapes_size()) {
     return InvalidArgument(
-        "cannot infer shape: attempt to index out of tuple bounds: %lld "
-        ">= %d in shape %s",
+        "Cannot infer shape: attempt to index out of tuple bounds: %lld "
+        ">= %d in shape %s.",
         index, arg.tuple_shapes_size(), ShapeUtil::HumanString(arg).c_str());
   }
 
@@ -2181,17 +2185,17 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& init) {
   // Check the number of parameters for given computations.
   if (condition.parameters_size() != 1) {
-    return InvalidArgument("condition must take 1 arguments; got %d",
+    return InvalidArgument("Condition must take 1 arguments; got %d.",
                            condition.parameters_size());
   }
   if (body.parameters_size() != 1) {
-    return InvalidArgument("body must take 1 arguments; got %d",
+    return InvalidArgument("Body must take 1 arguments; got %d.",
                            body.parameters_size());
   }
 
   auto shape_string = [&]() {
     return tensorflow::strings::Printf(
-        "condition: %s; body: %s; init: %s",
+        "Condition: %s; body: %s; init: %s.",
         ShapeUtil::HumanString(condition).c_str(),
         ShapeUtil::HumanString(body).c_str(),
         ShapeUtil::HumanString(init).c_str());
@@ -2199,15 +2203,15 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   // Check the shapes of computation parameters and return types.
   if (!ShapeUtil::ShapeIs(condition.result(), PRED, {})) {
-    return InvalidArgument("condition must return a boolean; got %s",
+    return InvalidArgument("Condition must return a boolean; got %s.",
                            shape_string().c_str());
   }
   if (!ShapeUtil::Compatible(body.result(), condition.parameters(0)) ||
       !ShapeUtil::Compatible(body.result(), body.parameters(0)) ||
       !ShapeUtil::Compatible(body.result(), init)) {
     return InvalidArgument(
-        "the parameter of condition and body, the result of the body, and init "
-        "must all have the same shape; got %s",
+        "The parameter of condition and body, the result of the body, and init "
+        "must all have the same shape; got %s.",
         shape_string().c_str());
   }
 
@@ -2219,7 +2223,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& false_operand, const ProgramShape& true_computation,
     const ProgramShape& false_computation) {
   if (!ShapeUtil::ShapeIs(predicate, PRED, {})) {
-    return InvalidArgument("predicate must be a boolean; got %s.",
+    return InvalidArgument("Predicate must be a boolean; got %s.",
                            ShapeUtil::HumanString(predicate).c_str());
   }
 
@@ -2302,8 +2306,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::ElementsIn(operand) != ShapeUtil::ElementsIn(inferred_shape)) {
     return InvalidArgument(
-        "reshape operation has mismatched element counts: from=%lld (%s) "
-        "to=%lld (%s)",
+        "Reshape operation has mismatched element counts: from=%lld (%s) "
+        "to=%lld (%s).",
         ShapeUtil::ElementsIn(operand), ShapeUtil::HumanString(operand).c_str(),
         ShapeUtil::ElementsIn(inferred_shape),
         ShapeUtil::HumanString(inferred_shape).c_str());
@@ -2351,7 +2355,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(max, "clamp max"));
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(min, operand) ||
       !ShapeUtil::SameElementTypeIgnoringFpPrecision(max, operand)) {
-    return InvalidArgument("clamp op with different operand types: %s, %s, %s",
+    return InvalidArgument("Clamp with different operand types: %s, %s, %s.",
                            ShapeUtil::HumanString(min).c_str(),
                            ShapeUtil::HumanString(operand).c_str(),
                            ShapeUtil::HumanString(max).c_str());
@@ -2372,7 +2376,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
   }
   return Unimplemented(
-      "not yet implemented: %s, %s <clamp> %s", min.ShortDebugString().c_str(),
+      "%s, %s <clamp> %s is not implemented.", min.ShortDebugString().c_str(),
       max.ShortDebugString().c_str(), operand.ShortDebugString().c_str());
 }
 
@@ -2391,25 +2395,26 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   }
   if (!compatible) {
     return InvalidArgument(
-        "operands to select must be the same shape; got %s and %s",
+        "Operands to select must be the same shape; got %s and %s.",
         ShapeUtil::HumanString(on_true).c_str(),
         ShapeUtil::HumanString(on_false).c_str());
   }
   if (pred.element_type() != PRED) {
     return InvalidArgument(
-        "select's pred operand must have PRED element type; got %s",
+        "Select's pred operand must have PRED element type; got %s.",
         ShapeUtil::HumanString(pred).c_str());
   }
-  if (ShapeUtil::SameDimensions(pred, on_true) || ShapeUtil::Rank(pred) == 0) {
+  if (ShapeUtil::CompatibleIgnoringElementType(pred, on_true) ||
+      ShapeUtil::Rank(pred) == 0) {
     // By this stage we know that pred's element type is PRED. Therefore, this
     // check restricts pred to be a PRED scalar, or a PRED array with the same
     // dimensions as on_true and on_false.
     return ShapeUtil::ChangeElementType(
         on_true, ShapeUtil::HigherPrecisionElementType(on_true, on_false));
   } else {
-    return Unimplemented(
-        "select operation with non-scalar predicate with dimensionality "
-        " different from the other operands: %s",
+    return InvalidArgument(
+        "Select operation with non-scalar predicate with dimensionality "
+        " different from the other operands: %s.",
         ShapeUtil::HumanString(pred).c_str());
   }
 }
@@ -2427,7 +2432,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Call applied function arity must match number of arguments; got: "
         "arity: %d, arguments: %zu; computation signature: %s; argument "
-        "shapes: [%s]",
+        "shapes: [%s].",
         to_apply.parameters_size(), arg_shapes.size(),
         computation_signature.c_str(), argument_shapes.c_str());
   }
@@ -2439,7 +2444,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (!ShapeUtil::Compatible(arg_shape, param_shape)) {
       return InvalidArgument(
           "Call parameter must match argument; got parameter %d shape: %s, "
-          "argument shape: %s",
+          "argument shape: %s.",
           i, ShapeUtil::HumanString(param_shape).c_str(),
           ShapeUtil::HumanString(arg_shape).c_str());
     }
@@ -2454,40 +2459,40 @@ static Status ValidateGatherDimensionNumbers(
     const GatherDimensionNumbers& dim_numbers) {
   if (!c_is_sorted(dim_numbers.output_window_dims())) {
     return InvalidArgument(
-        "Output window dimensions in gather op must be ascending; got: %s",
+        "Output window dimensions in gather op must be ascending; got: %s.",
         Join(dim_numbers.output_window_dims(), ", ").c_str());
   }
 
   if (c_adjacent_find(dim_numbers.output_window_dims()) !=
       dim_numbers.output_window_dims().end()) {
     return InvalidArgument(
-        "Output window dimensions in gather op must not repeat; got: %s",
+        "Output window dimensions in gather op must not repeat; got: %s.",
         Join(dim_numbers.output_window_dims(), ", ").c_str());
   }
 
   const int64 output_window_dim_count = dim_numbers.output_window_dims_size();
   const int64 output_shape_rank =
-      output_window_dim_count + gather_indices_shape.size();
+      output_window_dim_count + gather_indices_shape.size() - 1;
 
   for (int i = 0; i < dim_numbers.output_window_dims_size(); ++i) {
     int64 window_index = dim_numbers.output_window_dims(i);
     if (window_index < 0 || window_index >= output_shape_rank) {
       return InvalidArgument(
           "Window index %d in gather op is out of bounds; got %lld, but should "
-          "have been in"
-          "[0,%lld)",
+          "have been in [0,%lld).",
           i, window_index, output_shape_rank);
     }
   }
 
   if (dim_numbers.gather_dims_to_operand_dims_size() !=
-      gather_indices_shape.back()) {
+      gather_indices_shape[dim_numbers.index_vector_dim()]) {
     return InvalidArgument(
-        "There must be exactly as many elements in gather_dims_to_operand_dims "
-        "as there are elements in the last dimension of %%gather_indices; got: "
-        "%d, expected %lld",
+        "Gather op has %d elements in gather_dims_to_operand_dims and the "
+        "bound of dimension index_vector_dim=%lld of gather_indices is "
+        "%lld. These two numbers must be equal.",
         dim_numbers.gather_dims_to_operand_dims_size(),
-        gather_indices_shape.back());
+        dim_numbers.index_vector_dim(),
+        gather_indices_shape[dim_numbers.index_vector_dim()]);
   }
 
   for (int i = 0; i < dim_numbers.gather_dims_to_operand_dims_size(); i++) {
@@ -2496,7 +2501,7 @@ static Status ValidateGatherDimensionNumbers(
         gather_dim_to_input_dim >= input_shape.dimensions_size()) {
       return InvalidArgument(
           "Invalid gather_dims_to_operand_dims mapping; domain is [0, %d), "
-          "got: %d->%lld",
+          "got: %d->%lld.",
           input_shape.dimensions_size(), i, gather_dim_to_input_dim);
     }
   }
@@ -2511,7 +2516,7 @@ static Status ValidateGatherDimensionNumbers(
       sorted_gather_dims_to_operand_dims.end()) {
     return InvalidArgument(
         "Repeated dimensions are not allowed in gather_dims_to_operand_dims; "
-        "got: %s",
+        "got: %s.",
         Join(dim_numbers.gather_dims_to_operand_dims(), ", ").c_str());
   }
 
@@ -2519,7 +2524,7 @@ static Status ValidateGatherDimensionNumbers(
     if (elided_dim < 0 || elided_dim >= input_shape.dimensions_size()) {
       return InvalidArgument(
           "Invalid elided_window_dims set in gather op; valid range is [0, "
-          "%d), got: %lld",
+          "%d), got: %lld.",
           input_shape.dimensions_size(), elided_dim);
     }
   }
@@ -2534,7 +2539,7 @@ static Status ValidateGatherDimensionNumbers(
       dim_numbers.elided_window_dims().end()) {
     return InvalidArgument(
         "Repeated dimensions not allowed in elided_window_dims in gather op; "
-        "got: %s",
+        "got: %s.",
         Join(dim_numbers.elided_window_dims(), ", ").c_str());
   }
 
@@ -2550,24 +2555,33 @@ static Status ValidateGatherDimensionNumbers(
   TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(
       gather_indices_shape, "gather indices operand of gather op"));
 
-  if (gather_indices_shape.dimensions_size() < 1) {
+  if (!ShapeUtil::ElementIsIntegral(gather_indices_shape)) {
     return InvalidArgument(
-        "Gather indices parameter must at least of rank 1; got %s",
+        "Gather indices parameter must be an integral tensor; got %s.",
         ShapeUtil::HumanString(gather_indices_shape).c_str());
   }
 
-  if (!ShapeUtil::ElementIsIntegral(gather_indices_shape)) {
+  // We implicitly reshape gather indices of shape P[A,B,C] to P[A,B,C,1] if
+  // index_vector_dim is rank(P).  The bounds of this expanded shape is
+  // stored in expanded_gather_indices_shape.
+
+  if (gather_indices_shape.dimensions_size() <
+          gather_dim_numbers.index_vector_dim() ||
+      gather_dim_numbers.index_vector_dim() < 0) {
     return InvalidArgument(
-        "Gather indices parameter must be an integral tensor; got %s",
-        ShapeUtil::HumanString(gather_indices_shape).c_str());
+        "Gather index leaf dimension must be within [0, rank(gather_indices) + "
+        "1). rank(gather_indices) is %d and gather index leaf dimension is "
+        "%lld.",
+        gather_indices_shape.dimensions_size(),
+        gather_dim_numbers.index_vector_dim());
   }
 
   std::vector<int64> expanded_gather_indices_shape;
-  // We implicitly reshape gather indices of shape P[N] to P[N,1].
   expanded_gather_indices_shape.reserve(gather_indices_shape.dimensions_size());
   c_copy(gather_indices_shape.dimensions(),
          std::back_inserter(expanded_gather_indices_shape));
-  if (expanded_gather_indices_shape.size() == 1) {
+  if (expanded_gather_indices_shape.size() ==
+      gather_dim_numbers.index_vector_dim()) {
     expanded_gather_indices_shape.push_back(1);
   }
 
@@ -2577,7 +2591,7 @@ static Status ValidateGatherDimensionNumbers(
   if (window_bounds.size() != input_shape.dimensions_size()) {
     return InvalidArgument(
         "Gather op must have one window bound for every input dimension; got: "
-        "len(window_bounds)=%lu, input_shape.rank=%d",
+        "len(window_bounds)=%lu, input_shape.rank=%d.",
         window_bounds.size(), input_shape.dimensions_size());
   }
 
@@ -2587,7 +2601,7 @@ static Status ValidateGatherDimensionNumbers(
     return InvalidArgument(
         "All components of the window index in a gather op must either be a "
         "output window index or explicitly elided; got len(window_bounds)=%lu, "
-        "output_window_bounds=%s, elided_window_bounds=%s",
+        "output_window_bounds=%s, elided_window_bounds=%s.",
         window_bounds.size(),
         Join(gather_dim_numbers.output_window_dims(), ",").c_str(),
         Join(gather_dim_numbers.elided_window_dims(), ",").c_str());
@@ -2600,7 +2614,7 @@ static Status ValidateGatherDimensionNumbers(
       return InvalidArgument(
           "Window bound at index %d in gather op is out of range, must be "
           "within "
-          "[0, %lld), got %lld",
+          "[0, %lld), got %lld.",
           i, corresponding_input_bound + 1, window_bound);
     }
   }
@@ -2609,7 +2623,7 @@ static Status ValidateGatherDimensionNumbers(
     if (window_bounds[gather_dim_numbers.elided_window_dims(i)] != 1) {
       return InvalidArgument(
           "Gather op can only elide window indices with bound 1, but bound is "
-          "%lld for index %lld at position %d",
+          "%lld for index %lld at position %d.",
           window_bounds[gather_dim_numbers.elided_window_dims(i)],
           gather_dim_numbers.elided_window_dims(i), i);
     }
@@ -2632,6 +2646,9 @@ static Status ValidateGatherDimensionNumbers(
       }
       current_bound = window_bounds[window_dims_seen++];
     } else {
+      if (gather_dims_seen == gather_dim_numbers.index_vector_dim()) {
+        gather_dims_seen++;
+      }
       current_bound = expanded_gather_indices_shape[gather_dims_seen++];
     }
 
diff --git a/tensorflow/compiler/xla/service/shape_inference.h b/tensorflow/compiler/xla/service/shape_inference.h
index 0d3045213db2230da3e18ffcb1a9923250560b64..085fdac60c6de161c457dff672175e82f4f4da51 100644
--- a/tensorflow/compiler/xla/service/shape_inference.h
+++ b/tensorflow/compiler/xla/service/shape_inference.h
@@ -56,6 +56,9 @@ class ShapeInference {
   static StatusOr<Shape> InferBinaryOpShape(
       BinaryOperation operation, const Shape& lhs, const Shape& rhs,
       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  static StatusOr<Shape> InferBinaryOpShape(
+      HloOpcode opcode, const Shape& lhs, const Shape& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
   static StatusOr<Shape> InferBinaryOpShape(HloOpcode opcode,
                                             const HloInstruction* lhs,
                                             const HloInstruction* rhs);
diff --git a/tensorflow/compiler/xla/service/shape_inference_test.cc b/tensorflow/compiler/xla/service/shape_inference_test.cc
index 7eb120843fd841d841048eeaefd895fde96d133c..0e61994a786b53a295ef9c9c2287b28fbf754d9b 100644
--- a/tensorflow/compiler/xla/service/shape_inference_test.cc
+++ b/tensorflow/compiler/xla/service/shape_inference_test.cc
@@ -135,7 +135,7 @@ TEST_F(ShapeInferenceTest, SelectBadShapes) {
       TernaryOperation::TRIOP_SELECT, pred_, matrix_64_48_, matrix_32_64_);
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("operands to select must be the same shape"));
+              HasSubstr("Operands to select must be the same shape"));
 
   auto inferred_status_error2 = ShapeInference::InferTernaryOpShape(
       TernaryOperation::TRIOP_SELECT, s32_, matrix_64_48_, matrix_64_48_);
@@ -340,7 +340,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSourceShape) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("source shape does not match"));
+              HasSubstr("Source shape does not match"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape1) {
@@ -351,7 +351,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape1) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function must take 2 parameters"));
+              HasSubstr("Select function must take 2 parameters"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape2) {
@@ -362,7 +362,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape2) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function must have rank-0 PRED"));
+              HasSubstr("Select function must have rank-0 PRED"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape3) {
@@ -373,7 +373,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape3) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function's first parameter"));
+              HasSubstr("Select function's first parameter"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape4) {
@@ -384,7 +384,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape4) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function's second parameter"));
+              HasSubstr("Select function's second parameter"));
 }
 
 TEST_F(ShapeInferenceTest, Convolve) {
@@ -906,7 +906,7 @@ TEST_F(ShapeInferenceTest, ScalarDotVector) {
       ShapeInference::InferDotOpShape(f32_, vector_32_, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("dot only supports rank"));
+              HasSubstr("Dot only supports rank"));
 }
 
 // 3D <dot> 2D: error
@@ -918,7 +918,7 @@ TEST_F(ShapeInferenceTest, DotWithRankHigherThanTwo) {
       ShapeUtil::MakeShape(F32, {32, 32, 32}), matrix_32_64_, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch and contracting dimension number mismatch"));
+              HasSubstr("Batch and contracting dimension number mismatch"));
 }
 
 // vector <dot> vector -> scalar
@@ -1024,7 +1024,7 @@ TEST_F(ShapeInferenceTest, DotWithTwoContractingDimsFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("must specify one contracting dimension for both "
+              HasSubstr("Must specify one contracting dimension for both "
                         "lhs and rhs"));
 }
 
@@ -1044,7 +1044,7 @@ TEST_F(ShapeInferenceTest, DotWithMisatchedBatchDimSizesFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch dimension numbers and sizes must match"));
+              HasSubstr("Batch dimension numbers and sizes must match"));
 }
 
 // BatchMatMul with different batch dimension numbers fails.
@@ -1063,7 +1063,7 @@ TEST_F(ShapeInferenceTest, DotWithMisatchedBatchDimNumbersFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch dimension numbers must precede non-batch"));
+              HasSubstr("Batch dimension numbers must precede non-batch"));
 }
 
 // BatchMatMul with out-of-range dimension numbers fails.
@@ -1166,42 +1166,42 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
       BinaryOperation::BINOP_ADD, tensor, vec8, {});
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("automatic"));
+              HasSubstr("Automatic"));
 
   // broadcast_dimension out of bounds for tensor's rank
   auto inferred_status_error2 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, vec8, {3});
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              ContainsRegex("broadcast dimension number .* too large"));
+              ContainsRegex("Broadcast dimension number .* too large"));
 
   // broadcast_dimension doesn't match corresponding dimension
   auto inferred_status_error3 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, vec8, {0});
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("Broadcast dimension 0 mismatch"));
 
   // broadcast_dimensions list too long
   auto inferred_status_error4 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {0, 1, 2});
   ASSERT_FALSE(inferred_status_error4.ok());
   ASSERT_THAT(inferred_status_error4.status().error_message(),
-              HasSubstr("size of broadcast_dimensions has to match"));
+              HasSubstr("broadcast_dimensions has to match"));
 
   // there's a dimension above the rank of the tensor
   auto inferred_status_error5 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {3, 0});
   ASSERT_FALSE(inferred_status_error5.ok());
   ASSERT_THAT(inferred_status_error5.status().error_message(),
-              ContainsRegex("broadcast dimension number .* too large"));
+              ContainsRegex("dimension number .* too large"));
 
   // broadcasting dimensions don't match in this order
   auto inferred_status_error6 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {2, 1});
   ASSERT_FALSE(inferred_status_error6.ok());
   ASSERT_THAT(inferred_status_error6.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("dimension 0 mismatch"));
 
   // The following two tests make sure that broadcasting dimensions are listed
   // in a proper (strictly increasing) order, even if the lower-rank array
@@ -1210,13 +1210,13 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
       BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {0, 0});
   ASSERT_FALSE(inferred_status_error7.ok());
   ASSERT_THAT(inferred_status_error7.status().error_message(),
-              HasSubstr("broadcast dimensions order is wrong"));
+              HasSubstr("dimensions order is wrong"));
 
   auto inferred_status_error8 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {1, 0});
   ASSERT_FALSE(inferred_status_error8.ok());
   ASSERT_THAT(inferred_status_error8.status().error_message(),
-              HasSubstr("broadcast dimensions order is wrong"));
+              HasSubstr("dimensions order is wrong"));
 }
 
 // Tests for the while instruction with proper shapes.
@@ -1242,7 +1242,7 @@ TEST_F(ShapeInferenceTest, WhileWithBadShapes) {
       ShapeInference::InferWhileShape(bad_shape_1, body, result_shape);
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("condition must take 1 arguments"));
+              HasSubstr("Condition must take 1 arguments"));
 
   auto bad_shape_2 =
       ShapeUtil::MakeProgramShape({s32_, result_shape}, result_shape);
@@ -1250,14 +1250,14 @@ TEST_F(ShapeInferenceTest, WhileWithBadShapes) {
       ShapeInference::InferWhileShape(cond, bad_shape_2, result_shape);
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              HasSubstr("body must take 1 arguments"));
+              HasSubstr("Body must take 1 arguments"));
 
   auto bad_shape_3 = ShapeUtil::MakeProgramShape({result_shape}, s32_);
   auto inferred_status_error3 =
       ShapeInference::InferWhileShape(bad_shape_3, body, result_shape);
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("condition must return a boolean"));
+              HasSubstr("Condition must return a boolean"));
 
   auto bad_shape_4 = ShapeUtil::MakeProgramShape({result_shape}, vector_32_);
   auto inferred_status_error4 =
@@ -1301,13 +1301,13 @@ TEST_F(ShapeInferenceTest, ConcatenateWithBadShapes) {
       ShapeInference::InferConcatOpShape({&vector_32_}, /*dimension=*/-1);
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              HasSubstr("dimension to concatenate along out of bounds: -1"));
+              HasSubstr("dimension out of bounds: -1"));
 
   auto inferred_status_error3 =
       ShapeInference::InferConcatOpShape({&vector_32_}, /*dimension=*/1);
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("dimension to concatenate along out of bounds: 1"));
+              HasSubstr("dimension out of bounds: 1"));
 
   Shape tuple = ShapeUtil::MakeTupleShape({vector_32_});
   auto inferred_status_error4 = ShapeInference::InferConcatOpShape(
@@ -1315,21 +1315,20 @@ TEST_F(ShapeInferenceTest, ConcatenateWithBadShapes) {
   ASSERT_FALSE(inferred_status_error4.ok());
   ASSERT_THAT(
       inferred_status_error4.status().error_message(),
-      HasSubstr("Expected non-tuple argument for operand of concatenation."));
+      HasSubstr("Expected non-tuple argument for operand of concatenation"));
 
   const Shape vector_s32 = ShapeUtil::MakeShape(S32, {32});
   auto inferred_status_error5 = ShapeInference::InferConcatOpShape(
       {&vector_32_, &vector_s32}, /*dimension=*/0);
   ASSERT_FALSE(inferred_status_error5.ok());
-  ASSERT_THAT(
-      inferred_status_error5.status().error_message(),
-      HasSubstr("cannot concatenate arrays with different element types"));
+  ASSERT_THAT(inferred_status_error5.status().error_message(),
+              HasSubstr("concatenate arrays with different element types"));
 
   auto inferred_status_error6 = ShapeInference::InferConcatOpShape(
       {&matrix_32_48_, &matrix_32_64_}, /*dimension=*/0);
   ASSERT_FALSE(inferred_status_error6.ok());
   ASSERT_THAT(inferred_status_error6.status().error_message(),
-              HasSubstr("cannot concatenate arrays that differ in "
+              HasSubstr("concatenate arrays that differ in "
                         "dimensions other than the one being "
                         "concatenated"));
 }
@@ -1467,7 +1466,7 @@ TEST_F(ShapeInferenceTest, Conditional) {
       ShapeUtil::MakeProgramShape({vector_64_}, f32_));
   EXPECT_FALSE(inferred_status_error0.ok());
   EXPECT_THAT(inferred_status_error0.status().error_message(),
-              HasSubstr("predicate must be a boolean"));
+              HasSubstr("Predicate must be a boolean"));
 
   auto inferred_status_error1 = ShapeInference::InferConditionalShape(
       pred_, ShapeUtil::MakeTupleShape({f32_, vector_32_}), matrix_32_48_,
@@ -1530,11 +1529,17 @@ TEST_F(ShapeInferenceTest, BadSlice) {
 
 class GatherShapeInferenceTest : public ShapeInferenceTest {
  protected:
+  const Shape s64_scalar_ = ShapeUtil::MakeShape(S64, {});
+  const Shape s64_vector_5_ = ShapeUtil::MakeShape(S64, {5});
   const Shape s64_vector_32_ = ShapeUtil::MakeShape(S64, {32});
   const Shape s64_4d_tensor_10_9_8_7_1_ =
       ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 1});
   const Shape s64_4d_tensor_10_9_8_7_5_ =
       ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 5});
+  const Shape s64_4d_tensor_5_10_9_7_6_ =
+      ShapeUtil::MakeShape(S64, {5, 10, 9, 7, 6});
+  const Shape s64_4d_tensor_10_9_5_7_6_ =
+      ShapeUtil::MakeShape(S64, {10, 9, 5, 7, 6});
   const Shape f32_5d_tensor_50_49_48_47_46_ =
       ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
   const Shape tuple_shape_ = ShapeUtil::MakeTupleShape(
@@ -1548,7 +1553,8 @@ TEST_F(GatherShapeInferenceTest, TensorFlowGather) {
                                        HloInstruction::MakeGatherDimNumbers(
                                            /*output_window_dims=*/{0},
                                            /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1}),
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/1),
                                        /*window_bounds=*/{64, 1}));
   EXPECT_TRUE(
       ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {64, 32})))
@@ -1562,7 +1568,8 @@ TEST_F(GatherShapeInferenceTest, TensorFlowGatherV2) {
                                        HloInstruction::MakeGatherDimNumbers(
                                            /*output_window_dims=*/{1},
                                            /*elided_window_dims=*/{0},
-                                           /*gather_dims_to_operand_dims=*/{0}),
+                                           /*gather_dims_to_operand_dims=*/{0},
+                                           /*index_vector_dim=*/1),
                                        /*window_bounds=*/{1, 48}));
   EXPECT_TRUE(
       ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {32, 48})))
@@ -1576,7 +1583,8 @@ TEST_F(GatherShapeInferenceTest, TensorFlowGatherNd) {
                                        HloInstruction::MakeGatherDimNumbers(
                                            /*output_window_dims=*/{4},
                                            /*elided_window_dims=*/{0},
-                                           /*gather_dims_to_operand_dims=*/{0}),
+                                           /*gather_dims_to_operand_dims=*/{0},
+                                           /*index_vector_dim=*/4),
                                        /*window_bounds=*/{1, 48}));
   EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
                                ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 48})))
@@ -1591,7 +1599,8 @@ TEST_F(GatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
           HloInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
-              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/4),
           /*window_bounds=*/{30, 29, 28, 27, 26}));
   EXPECT_TRUE(ShapeUtil::Equal(
       gather_shape,
@@ -1599,12 +1608,85 @@ TEST_F(GatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
+TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_A) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(
+      gather_shape,
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_B) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_5_10_9_7_6_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(
+      gather_shape,
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, NoOutputGatherDims) {
+  // This is equivalent to a dynamic slice.
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_vector_5_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{0, 1, 2, 3, 4},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
+                               ShapeUtil::MakeShape(F32, {30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, ScalarGatherIndices) {
+  // The gather indices "tensor" is a scalar S here that's used to slice out
+  // [S,0,0,0,0]..[S,30,29,28,27] into a [30,29,28,27] shaped result.
+  TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
+                          ShapeInference::InferGatherShape(
+                              f32_5d_tensor_50_49_48_47_46_, s64_scalar_,
+                              HloInstruction::MakeGatherDimNumbers(
+                                  /*output_window_dims=*/{0, 1, 2, 3},
+                                  /*elided_window_dims=*/{0},
+                                  /*gather_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/0),
+                              /*window_bounds=*/{1, 30, 29, 28, 27}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
+                               ShapeUtil::MakeShape(F32, {30, 29, 28, 27})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
 TEST_F(GatherShapeInferenceTest, TupleShapedTensorInput) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       tuple_shape_, s64_vector_32_,
       HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
                                            /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1}),
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/1),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1617,7 +1699,8 @@ TEST_F(GatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
       s64_vector_32_, tuple_shape_,
       HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
                                            /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1}),
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/0),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1625,25 +1708,13 @@ TEST_F(GatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest, ScalarGatherIndicesInput) {
-  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
-      s64_vector_32_, s32_,
-      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
-                                           /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1}),
-      /*window_bounds=*/{64, 1});
-  ASSERT_FALSE(statusor.ok());
-  EXPECT_THAT(statusor.status().error_message(),
-              HasSubstr("Gather indices parameter must at least of rank 1"))
-      << statusor.status();
-}
-
 TEST_F(GatherShapeInferenceTest, FloatingPointGatherIndicesInput) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       s64_vector_32_, vector_32_,
       HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
                                            /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1}),
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/0),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1658,7 +1729,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 8, 7},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1674,7 +1746,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 7},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1690,7 +1763,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 99, 100, 101},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1698,6 +1772,22 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_WindowIndexBarelyOutOfBounds) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 9},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Window index 4 in gather op is out of bounds"))
+      << statusor.status();
+}
+
 TEST_F(GatherShapeInferenceTest,
        InvalidGatherDimNumbers_MismatchingElidedWindowDims) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
@@ -1705,7 +1795,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{4},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1722,7 +1813,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{0, 1, 2, 3, 19},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1738,7 +1830,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{0, 1, 2, 3, 3},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1755,15 +1848,15 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
       statusor.status().error_message(),
-      HasSubstr(
-          "There must be exactly as many elements in "
-          "gather_dims_to_operand_dims "
-          "as there are elements in the last dimension of %gather_indices"))
+      HasSubstr("Gather op has 4 elements in gather_dims_to_operand_dims and "
+                "the bound of dimension index_vector_dim=4 of "
+                "gather_indices is 5. These two numbers must be equal."))
       << statusor.status();
 }
 
@@ -1774,7 +1867,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 7}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 7},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1791,7 +1885,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 3}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 3},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1808,7 +1903,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{2, 1},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{1, 1, 28, 27, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1822,7 +1918,8 @@ TEST_F(GatherShapeInferenceTest, InvalidGatherDimNumbers_WindowBoundsTooLarge) {
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7},
           /*elided_window_dims=*/{2},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 1, 300, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1838,7 +1935,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 26});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(
@@ -1855,7 +1953,8 @@ TEST_F(GatherShapeInferenceTest,
       HloInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7},
           /*elided_window_dims=*/{1},
-          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4}),
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
       /*window_bounds=*/{30, 29, 28, 26, 20});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1864,5 +1963,22 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
+TEST_F(GatherShapeInferenceTest, OutOfBoundsGatherIndicesLeafDim) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/32),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Gather index leaf dimension must be within [0, "
+                        "rank(gather_indices) + 1)"))
+      << statusor.status();
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/user_computation.cc b/tensorflow/compiler/xla/service/user_computation.cc
index 06735e9442942f3c69d1cd679857fe22f2fa6756..0dca30a804005c6f536aca5b54af24eb08d4560b 100644
--- a/tensorflow/compiler/xla/service/user_computation.cc
+++ b/tensorflow/compiler/xla/service/user_computation.cc
@@ -3315,20 +3315,23 @@ void ComputationLowerer::Visit(
       HloInstruction* rhs = lookup_instruction(ternary_op_request.rhs());
       HloInstruction* ehs = lookup_instruction(ternary_op_request.ehs());
       auto hlo_opcode = TernaryOperationToHloOpcode(ternary_op_request.triop());
-
-      if (debug_options_.xla_eliminate_hlo_implicit_broadcast()) {
-        if (!ShapeUtil::SameDimensions(request.output_shape(), lhs->shape())) {
+      if (debug_options_.xla_eliminate_hlo_implicit_broadcast() &&
+          !ShapeUtil::IsTuple(request.output_shape())) {
+        if (!ShapeUtil::IsTuple(lhs->shape()) &&
+            !ShapeUtil::SameDimensions(request.output_shape(), lhs->shape())) {
           // lhs side is being implicitly broadcast. Change to explicit.
           lhs =
               ImplicitBroadcastToExplicitBroadcast(lhs, request.output_shape());
         }
 
-        if (!ShapeUtil::SameDimensions(request.output_shape(), rhs->shape())) {
+        if (!ShapeUtil::IsTuple(rhs->shape()) &&
+            !ShapeUtil::SameDimensions(request.output_shape(), rhs->shape())) {
           rhs =
               ImplicitBroadcastToExplicitBroadcast(rhs, request.output_shape());
         }
 
-        if (!ShapeUtil::SameDimensions(request.output_shape(), ehs->shape())) {
+        if (!ShapeUtil::IsTuple(ehs->shape()) &&
+            !ShapeUtil::SameDimensions(request.output_shape(), ehs->shape())) {
           ehs =
               ImplicitBroadcastToExplicitBroadcast(ehs, request.output_shape());
         }
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
index a5f9b01f011ce04f1114c74391a967c62f015221..3ef0cdff6751258e4489ce350deb0931fdf69ef9 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
@@ -106,20 +106,12 @@ static bool NotWorthHoistingIndividually(const HloInstruction& instruction) {
     case HloOpcode::kBitcast:
     case HloOpcode::kBroadcast:
     case HloOpcode::kConstant:
+    case HloOpcode::kReshape:
     case HloOpcode::kReverse:
     case HloOpcode::kSlice:
+    case HloOpcode::kTranspose:
     case HloOpcode::kTuple:
       return true;
-
-    case HloOpcode::kTranspose:
-      return ShapeUtil::TransposeIsBitcast(
-          /*input_shape=*/instruction.operand(0)->shape(),
-          /*output_shape=*/instruction.shape(), instruction.dimensions());
-
-    case HloOpcode::kReshape:
-      return ShapeUtil::ReshapeIsBitcast(
-          /*input_shape=*/instruction.operand(0)->shape(),
-          /*output_shape=*/instruction.shape());
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/while_loop_simplifier.cc b/tensorflow/compiler/xla/service/while_loop_simplifier.cc
index 981de9b2200a9ae8938db21299580f510834d2f0..ec05a74e286c89dd8db5ae07580e461938d7c087 100644
--- a/tensorflow/compiler/xla/service/while_loop_simplifier.cc
+++ b/tensorflow/compiler/xla/service/while_loop_simplifier.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/while_loop_simplifier.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
 #include "tensorflow/compiler/xla/service/hlo_evaluator.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
 #include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
@@ -212,7 +213,7 @@ static optional<int64> GetLoopTripCount(HloInstruction* while_op) {
   // Now that we know the index of the induction variable, we can we can try to
   // compute how many times the loop executes.  Start by computing the induction
   // variable's initial value.
-  HloEvaluator evaluator;
+  HloEvaluator evaluator(/*max_loop_iterations=*/0);
   auto* while_init = while_op->mutable_operand(0);
   auto* indvar_init = while_init->mutable_operand(*indvar_tuple_idx);
   StatusOr<std::unique_ptr<Literal>> indvar_init_result =
@@ -605,6 +606,78 @@ static StatusOr<bool> TryRemoveWhileLoop(HloInstruction* while_op) {
   return false;
 }
 
+static StatusOr<bool> TryPropagateConstant(HloInstruction* while_op) {
+  auto while_init = while_op->operand(0);
+  if (while_init->opcode() != HloOpcode::kTuple) {
+    return false;
+  }
+
+  auto while_body = while_op->while_body();
+  auto while_body_root = while_body->root_instruction();
+  if (while_body_root->opcode() != HloOpcode::kTuple) {
+    return false;
+  }
+
+  auto while_body_param = while_body->parameter_instruction(0);
+  const HloInstruction::InstructionVector& root_operands =
+      while_body_root->operands();
+
+  // Find the loop invariant tuple elements with scalar constant init value and
+  // build a map from the tuple element index to the constant value. Limit this
+  // to scalar constant values because propagating array constants can regress
+  // performance by forcing us to copy constants.
+  tensorflow::gtl::FlatMap<int, const HloInstruction*> index_to_constant;
+  for (int i = 0; i < root_operands.size(); i++) {
+    HloInstruction* instr = root_operands[i];
+    if (instr->opcode() == HloOpcode::kGetTupleElement &&
+        instr->tuple_index() == i && instr->operand(0) == while_body_param &&
+        ShapeUtil::IsScalar(instr->shape())) {
+      auto tuple_element = while_init->operand(i);
+      if (tuple_element->IsConstant()) {
+        VLOG(3) << "Found loop invariant tuple element " << i << " "
+                << tuple_element->ToString();
+        index_to_constant[i] = tuple_element;
+      }
+    }
+  }
+
+  if (index_to_constant.empty()) {
+    return false;
+  }
+
+  // Replace the use of each constant tuple element in the loop_condition and
+  // loop_body with the corresponding constant value.
+  auto propagate_constant = [&](HloComputation* computation) -> StatusOr<bool> {
+    HloInstruction* param = computation->parameter_instruction(0);
+    bool changed = false;
+    for (auto instr : param->users()) {
+      // Since only a while-loop with a tuple result reaches here, we can safely
+      // assume that `param` is a tuple and the first operand of the
+      // GetTupleElement instruction is a use of `param`.
+      if (instr->opcode() == HloOpcode::kGetTupleElement) {
+        VLOG(3) << "tuple index " << instr->tuple_index() << " "
+                << instr->ToString();
+        auto iter = index_to_constant.find(instr->tuple_index());
+        if (iter != index_to_constant.end()) {
+          const HloInstruction* hlo_constant = (*iter).second;
+          VLOG(3) << "Replace use of " << instr->ToString() << " with "
+                  << hlo_constant->ToString();
+          TF_RETURN_IF_ERROR(instr->ReplaceAllUsesWith(
+              computation->AddInstruction(hlo_constant->Clone())));
+          changed = true;
+        }
+      }
+    }
+    return changed;
+  };
+
+  TF_ASSIGN_OR_RETURN(bool changed_cond,
+                      propagate_constant(while_op->while_condition()));
+  TF_ASSIGN_OR_RETURN(bool changed_body, propagate_constant(while_body));
+
+  return changed_cond || changed_body;
+}
+
 StatusOr<bool> WhileLoopSimplifier::Run(HloModule* module) {
   XLA_VLOG_LINES(3,
                  "WhileLoopSimplifier::Run(), before:\n" + module->ToString());
@@ -635,7 +708,11 @@ StatusOr<bool> WhileLoopSimplifier::Run(HloModule* module) {
       continue;
     }
 
-    StatusOr<bool> result = TryRemoveWhileLoop(while_op);
+    StatusOr<bool> result = TryPropagateConstant(while_op);
+    TF_RETURN_IF_ERROR(result.status());
+    changed |= result.ValueOrDie();
+
+    result = TryRemoveWhileLoop(while_op);
     TF_RETURN_IF_ERROR(result.status());
     if (result.ValueOrDie()) {
       changed = true;
diff --git a/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc b/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
index c5183f8d3aee99696ed4114c3f7e451888222137..619e87caa5b6d0f6ec3c3b1489b0d4f50ef29963 100644
--- a/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 
 namespace xla {
 namespace {
@@ -26,112 +27,140 @@ namespace {
 namespace op = xla::testing::opcode_matchers;
 
 class WhileLoopSimplifierTest : public HloVerifiedTestBase {
- public:
-  // Makes a computation that contains a loop that runs num_iters times.
-  HloComputation* MakeSimpleLoop(int num_iters, HloModule* module);
-
-  // Makes a computation which has one parameter, of the given shape, and always
-  // returns PRED[]{true}.  This is useful as a dummy loop condition.
-  HloComputation* MakeAlwaysTrueComputation(const Shape& param_shape,
-                                            HloModule* module);
+ protected:
+  // Makes an HloModule that contains a loop with `num_iters` iteration.
+  void MakeModuleWithSimpleLoop(int num_iters);
+
+  // Similar to MakeModuleWithSimpleLoop except that the loop bound is passed to
+  // the loop-condition through an element of a tuple which is the
+  // loop-condition parameter.
+  void MakeModuleWithSimpleLoopTupleElementLoopBound(int num_iters);
 };
 
-HloComputation* WhileLoopSimplifierTest::MakeSimpleLoop(int num_iters,
-                                                        HloModule* module) {
-  HloComputation::Builder builder(TestName());
-
-  auto loop_iter_init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
-  auto loop_data_init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({0, 1, 2})));
-  auto loop_init = builder.AddInstruction(
-      HloInstruction::CreateTuple({loop_iter_init, loop_data_init}));
-
-  HloComputation* condition;
-  {
-    HloComputation::Builder cond_builder(TestName() + ".condition");
-    auto loop_var = cond_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    auto loop_induction_var =
-        cond_builder.AddInstruction(HloInstruction::CreateGetTupleElement(
-            ShapeUtil::MakeShape(S32, {}), loop_var, 0));
-    auto limit = cond_builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR0<int32>(42 + num_iters)));
-    cond_builder.AddInstruction(HloInstruction::CreateBinary(
-        ShapeUtil::MakeShape(PRED, {}), HloOpcode::kLt, loop_induction_var,
-        limit));
-    condition = module->AddEmbeddedComputation(cond_builder.Build());
+void WhileLoopSimplifierTest::MakeModuleWithSimpleLoop(int num_iters) {
+  string hlo_string_template = R"(
+  HloModule SimpleLoop
+  SimpleLoop.body {
+    loop_var.1 = (s32[], s32[3]{0}) parameter(0)
+    get-tuple-element.1 = s32[] get-tuple-element(loop_var.1), index=0
+    constant.1 = s32[] constant(1)
+    add = s32[] add(get-tuple-element.1, constant.1)
+    get-tuple-element.2 = s32[3]{0} get-tuple-element(loop_var.1), index=1
+    multiply = s32[3]{0} multiply(get-tuple-element.2, get-tuple-element.2)
+    ROOT tuple = (s32[], s32[3]{0}) tuple(add, multiply)
   }
-
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    auto loop_var = body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    auto loop_induction_var =
-        body_builder.AddInstruction(HloInstruction::CreateGetTupleElement(
-            ShapeUtil::MakeShape(S32, {}), loop_var, 0));
-    auto new_loop_induction_var =
-        body_builder.AddInstruction(HloInstruction::CreateBinary(
-            loop_induction_var->shape(), HloOpcode::kAdd, loop_induction_var,
-            body_builder.AddInstruction(
-                HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)))));
-    auto loop_data =
-        body_builder.AddInstruction(HloInstruction::CreateGetTupleElement(
-            loop_data_init->shape(), loop_var, 1));
-    auto new_loop_data =
-        body_builder.AddInstruction(HloInstruction::CreateBinary(
-            loop_data_init->shape(), HloOpcode::kMultiply, loop_data,
-            loop_data));
-    body_builder.AddInstruction(
-        HloInstruction::CreateTuple({new_loop_induction_var, new_loop_data}));
-    body = module->AddEmbeddedComputation(body_builder.Build());
+  SimpleLoop.condition {
+    loop_var.2 = (s32[], s32[3]{0}) parameter(0)
+    get-tuple-element.3 = s32[] get-tuple-element(loop_var.2), index=0
+    constant.2 = s32[] constant({{LOOP_BOUND}})
+    ROOT less-than = pred[] less-than(get-tuple-element.3, constant.2)
   }
+  ENTRY SimpleLoop {
+    constant.3 = s32[] constant(42)
+    constant.4 = s32[3]{0} constant({0, 1, 2})
+    tuple.1 = (s32[], s32[3]{0}) tuple(constant.3, constant.4)
+    ROOT while = (s32[], s32[3]{0}) while(tuple.1), condition=
+      SimpleLoop.condition, body=SimpleLoop.body
+  }
+  )";
 
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-
-  return module->AddEntryComputation(builder.Build());
+  string hlo_string = tensorflow::str_util::StringReplace(
+      hlo_string_template, "{{LOOP_BOUND}}",
+      tensorflow::strings::StrCat(42 + num_iters),
+      /*replace_all=*/true);
+  ParseAndVerifyModule(hlo_string);
 }
 
-HloComputation* WhileLoopSimplifierTest::MakeAlwaysTrueComputation(
-    const Shape& param_shape, HloModule* module) {
-  HloComputation::Builder builder(TestName() + ".always_true");
-  builder.AddInstruction(
-      HloInstruction::CreateParameter(0, param_shape, "param"));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
-  return module->AddEmbeddedComputation(builder.Build());
+void WhileLoopSimplifierTest::MakeModuleWithSimpleLoopTupleElementLoopBound(
+    int num_iters) {
+  string hlo_string_template = R"(
+  HloModule SimpleLoopWithIndirectLoopBound
+  SimpleLoopWithIndirectLoopBound.body {
+    loop_var.1 = (s32[], s32[3]{0}, s32[]) parameter(0)
+    get-tuple-element.1 = s32[] get-tuple-element(loop_var.1), index=0
+    constant.1 = s32[] constant(1)
+    add = s32[] add(get-tuple-element.1, constant.1)
+    get-tuple-element.2 = s32[3]{0} get-tuple-element(loop_var.1), index=1
+    multiply = s32[3]{0} multiply(get-tuple-element.2, get-tuple-element.2)
+    limit = s32[] get-tuple-element(loop_var.1), index=2
+    ROOT tuple = (s32[], s32[3]{0}, s32[]) tuple(add, multiply, limit)
+  }
+  SimpleLoopWithIndirectLoopBound.condition {
+    loop_var.2 = (s32[], s32[3]{0}, s32[]) parameter(0)
+    get-tuple-element.3 = s32[] get-tuple-element(loop_var.2), index=0
+    get-tuple-element.4 = s32[] get-tuple-element(loop_var.2), index=2
+    ROOT less-than = pred[] less-than(get-tuple-element.3, get-tuple-element.4)
+  }
+  ENTRY SimpleLoopWithIndirectLoopBound {
+    constant.3 = s32[] constant(42)
+    constant.4 = s32[3]{0} constant({0, 1, 2})
+    constant.2 = s32[] constant({{LOOP_BOUND}})
+    tuple.1 = (s32[], s32[3]{0}, s32[]) tuple(constant.3, constant.4,
+      constant.2)
+    ROOT while = (s32[], s32[3]{0}, s32[]) while(tuple.1),
+      condition=SimpleLoopWithIndirectLoopBound.condition,
+      body=SimpleLoopWithIndirectLoopBound.body
+  }
+  )";
+
+  string hlo_string = tensorflow::str_util::StringReplace(
+      hlo_string_template, "{{LOOP_BOUND}}",
+      tensorflow::strings::StrCat(42 + num_iters),
+      /*replace_all=*/true);
+  ParseAndVerifyModule(hlo_string);
 }
 
-TEST_F(WhileLoopSimplifierTest, WhileLoopWithZeroIterations) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/0, &module());
-  ASSERT_TRUE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
-  EXPECT_THAT(computation->root_instruction(),
+TEST_F(WhileLoopSimplifierTest, LoopWithZeroIterationSimiplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/0);
+  HloModule* the_module = &module();
+  ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
+  EXPECT_THAT(the_module->entry_computation()->root_instruction(),
               op::Tuple(op::Constant(), op::Constant()));
 }
 
-TEST_F(WhileLoopSimplifierTest, WhileLoopWithOneIteration) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/1, &module());
-  ASSERT_TRUE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
-  EXPECT_THAT(computation->root_instruction(),
+TEST_F(WhileLoopSimplifierTest,
+       LoopWithZeroIterationTupleElementLoopBoundSimplified) {
+  MakeModuleWithSimpleLoopTupleElementLoopBound(/*num_iters=*/0);
+  HloModule* the_module = &module();
+  ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
+  EXPECT_THAT(the_module->entry_computation()->root_instruction(),
+              op::Tuple(op::Constant(), op::Constant(), op::Constant()));
+}
+
+TEST_F(WhileLoopSimplifierTest, LoopWithOneIterationSimplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
+  EXPECT_THAT(the_module->entry_computation()->root_instruction(),
               op::Tuple(op::Add(), op::Multiply()));
 }
 
-TEST_F(WhileLoopSimplifierTest, WhileLoopWithTwoIterations) {
-  MakeSimpleLoop(/*num_iters=*/2, &module());
+TEST_F(WhileLoopSimplifierTest,
+       LoopWithOneIterationTupleELementLoopBoundSimplified) {
+  MakeModuleWithSimpleLoopTupleElementLoopBound(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
+  EXPECT_THAT(the_module->entry_computation()->root_instruction(),
+              op::Tuple(op::Add(), op::Multiply(), op::Constant()));
+}
+
+TEST_F(WhileLoopSimplifierTest, LoopWithTwoIterationsNotSimplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/2);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
-TEST_F(WhileLoopSimplifierTest, WhileLoopWithControlDependency) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/1, &module());
+TEST_F(WhileLoopSimplifierTest,
+       LoopWithControlDependencySimplifiedDependencyPreserved) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  HloComputation* computation = the_module->entry_computation();
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* true_op = while_op->while_body()->AddInstruction(
       HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
   TF_ASSERT_OK(true_op->AddControlDependencyTo(
       while_op->while_body()->root_instruction()));
-  ASSERT_TRUE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+  ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
   EXPECT_THAT(computation->root_instruction()->control_predecessors(),
               ElementsAre(op::Constant()))
       << computation->ToString();
@@ -139,8 +168,10 @@ TEST_F(WhileLoopSimplifierTest, WhileLoopWithControlDependency) {
 
 // Loops that contain send/recv nodes can't be simplified; the loop structure
 // around send/recv nodes must be preserved.
-TEST_F(WhileLoopSimplifierTest, NotRemovedIfContainsSend) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/1, &module());
+TEST_F(WhileLoopSimplifierTest, LoopWithSendNotSimplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  HloComputation* computation = the_module->entry_computation();
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
@@ -149,11 +180,13 @@ TEST_F(WhileLoopSimplifierTest, NotRemovedIfContainsSend) {
           HloInstruction::CreateConstant(Literal::CreateR0<bool>(true))),
       /*channel_id=*/0));
   while_body->AddInstruction(HloInstruction::CreateSendDone(send));
-  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+  EXPECT_FALSE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
 }
 
-TEST_F(WhileLoopSimplifierTest, NotRemovedIfContainsRecv) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/1, &module());
+TEST_F(WhileLoopSimplifierTest, LoopWithRecvNotSimplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  HloComputation* computation = the_module->entry_computation();
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
@@ -161,247 +194,217 @@ TEST_F(WhileLoopSimplifierTest, NotRemovedIfContainsRecv) {
       HloInstruction::CreateRecv(ShapeUtil::MakeShape(F32, {1}),
                                  /*channel_id=*/0));
   while_body->AddInstruction(HloInstruction::CreateRecvDone(recv));
-  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+  EXPECT_FALSE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
 }
 
 // The limitation on not being able to simplify loops that contain infeeds (and
 // other non-removable instructions) isn't fundamental -- it just stems from the
 // fact that our infrastructure sees simplifying such a loop as tantamount to
 // removing the non-removable instruction.
-TEST_F(WhileLoopSimplifierTest, NotRemovedIfContainsNonRemovableInstruction) {
-  HloComputation* computation = MakeSimpleLoop(/*num_iters=*/1, &module());
+TEST_F(WhileLoopSimplifierTest, LoopWithInfeedNotSimplified) {
+  MakeModuleWithSimpleLoop(/*num_iters=*/1);
+  HloModule* the_module = &module();
+  HloComputation* computation = the_module->entry_computation();
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
   while_body->AddInstruction(
       HloInstruction::CreateInfeed(ShapeUtil::MakeShape(F32, {1}), "config"));
-  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+  EXPECT_FALSE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
 }
 
-// Check that we don't crash when given a loop whose shape is not a tuple.
-TEST_F(WhileLoopSimplifierTest, IgnoreNonTupleShapedLoop) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
-
-  HloComputation* condition;
-  {
-    HloComputation::Builder cond_builder(TestName() + ".condition");
-    auto param = cond_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    cond_builder.AddInstruction(HloInstruction::CreateBinary(
-        ShapeUtil::MakeShape(PRED, {}), HloOpcode::kLt, param,
-        cond_builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<int32>(100)))));
-    condition = module().AddEmbeddedComputation(cond_builder.Build());
+// A non-tuple shaped loop shouldn't be simplified or crash the compiler.
+TEST_F(WhileLoopSimplifierTest, NonTupleShapedLoopNotSimplified) {
+  const string hlo_string = R"(
+ HloModule NonTupleShapedLoop
+ NonTupleShapedLoop.body {
+   loop_var.1 = s32[] parameter(0)
+   constant.1 = s32[] constant(-1)
+   ROOT add = s32[] add(s32[] loop_var.1, s32[] constant.1)
+ }
+ NonTupleShapedLoop.condition {
+   loop_var = s32[] parameter(0)
+   constant = s32[] constant(100)
+   ROOT less-than = pred[] less-than(s32[] loop_var, s32[] constant)
+ }
+ ENTRY INonTupleShapedLoop {
+   constant.2 = s32[] constant(42)
+   ROOT while = s32[] while(s32[] constant.2),
+     condition=NonTupleShapedLoop.condition,
+     body=NonTupleShapedLoop.body
   }
+  )";
 
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    auto param = body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    body_builder.AddInstruction(HloInstruction::CreateBinary(
-        ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param,
-        body_builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<int32>(-1)))));
-    body = module().AddEmbeddedComputation(body_builder.Build());
-  }
-
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-
-  module().AddEntryComputation(builder.Build());
+  ParseAndVerifyModule(hlo_string);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
-// Construct a loop where we swap the tuple elements in each iteration.
-// Although the tuple elements aren't used in the loop, we don't eliminate them,
-// because the swapping side-effect is visible to users of the loop.
-TEST_F(WhileLoopSimplifierTest, SwapTupleIndices) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(HloInstruction::CreateTuple({
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(1))),
-  }));
-
-  HloComputation* condition =
-      MakeAlwaysTrueComputation(loop_init->shape(), &module());
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    auto param = body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    auto scalar_s32 = ShapeUtil::MakeShape(S32, {});
-    body_builder.AddInstruction(HloInstruction::CreateTuple({
-        body_builder.AddInstruction(
-            HloInstruction::CreateGetTupleElement(scalar_s32, param, 1)),
-        body_builder.AddInstruction(
-            HloInstruction::CreateGetTupleElement(scalar_s32, param, 0)),
-    }));
-    body = module().AddEmbeddedComputation(body_builder.Build());
+// A while loop that does nothing else besides swapping tuple elements
+// can't be simplified as the result of the swapping is visible to users of the
+// loop.
+TEST_F(WhileLoopSimplifierTest, LoopSwappingTupleElementsNotSimplified) {
+  const string hlo_string = R"(
+  HloModule SwappingTupleElements
+  SwappingTupleElements.body {
+    loop_var = (s32[], s32[]) parameter(0)
+    get-tuple-element = s32[] get-tuple-element((s32[], s32[]) loop_var),index=1
+    get-tuple-element.1 = s32[] get-tuple-element((s32[], s32[]) loop_var),
+      index=0
+    ROOT tuple = (s32[], s32[]) tuple(s32[] get-tuple-element,
+      s32[] get-tuple-element.1)
   }
+  SwappingTupleElements.always_true {
+   param = (s32[], s32[]) parameter(0)
+   ROOT constant = pred[] constant(true)
+  }
+  ENTRY SwappingTupleElements {
+   x = s32[] parameter(0)
+   y = s32[] parameter(1)
+   tuple.1 = (s32[], s32[]) tuple(s32[] x, s32[] y)
+   ROOT while = (s32[], s32[]) while((s32[], s32[]) tuple.1),
+     condition=SwappingTupleElements.always_true,
+     body=SwappingTupleElements.body
+  }
+  )";
 
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-
-  module().AddEntryComputation(builder.Build());
+  ParseAndVerifyModule(hlo_string);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
 // Construct a loop where we assign a constant to tuple element 0 in each
 // iteration.  We can't eliminate tuple element 0, even though we never use its
 // value.
-TEST_F(WhileLoopSimplifierTest, UnusedButModifiedTupleElement) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(
-      HloInstruction::CreateTuple({builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)))}));
-
-  HloComputation* condition =
-      MakeAlwaysTrueComputation(loop_init->shape(), &module());
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    body_builder.AddInstruction(HloInstruction::CreateTuple({
-        body_builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<int32>(1))),
-    }));
-    body = module().AddEmbeddedComputation(body_builder.Build());
+TEST_F(WhileLoopSimplifierTest,
+       LoopWithUnusedButModifiedTupleElementNotSimplified) {
+  const string hlo_string = R"(
+  HloModule UnusedButModifiedTupleElement
+  UnusedButModifiedTupleElement.body {
+    loop_var = (s32[]) parameter(0)
+    constant.1 = s32[] constant(1)
+    ROOT tuple = (s32[]) tuple(s32[] constant.1)
   }
+  UnusedButModifiedTupleElement.always_true {
+    param = (s32[]) parameter(0)
+   ROOT  constant = pred[] constant(true)
+  }
+  ENTRY  UnusedButModifiedTupleElement {
+    constant.2 = s32[] constant(0)
+    tuple.1 = (s32[]) tuple(s32[]  constant.2)
+    ROOT while = (s32[]) while((s32[]) tuple.1),
+      condition=UnusedButModifiedTupleElement.always_true,
+      body=UnusedButModifiedTupleElement.body
+  }
+  )";
 
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-
-  module().AddEntryComputation(builder.Build());
+  ParseAndVerifyModule(hlo_string);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
 // Nothing to simplify in a while loop whose tuple has 0 elements.
-TEST_F(WhileLoopSimplifierTest, EmptyTuple) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(HloInstruction::CreateTuple({}));
-
-  HloComputation* condition =
-      MakeAlwaysTrueComputation(loop_init->shape(), &module());
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "loop_var"));
-    body_builder.AddInstruction(HloInstruction::CreateTuple({}));
-    body = module().AddEmbeddedComputation(body_builder.Build());
+TEST_F(WhileLoopSimplifierTest, LoopWithEmptyTupleNotSimplified) {
+  const string hlo_string = R"(
+  HloModule EmptyTuple
+  EmptyTuple.body {
+    loop_var = () parameter(0)
+    ROOT  tuple = () tuple()
+  }
+  EmptyTuple.always_true {
+   param = () parameter(0)
+   ROOT constant = pred[] constant(true)
+  }
+  ENTRY EmptyTuple {
+   tuple.1 = () tuple()
+   ROOT while = () while(() tuple.1), condition=EmptyTuple.always_true,
+     body=EmptyTuple.body
   }
+  )";
 
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-  module().AddEntryComputation(builder.Build());
+  ParseAndVerifyModule(hlo_string);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
 // While loop where one tuple element is used twice in the body, and thus can't
 // be simplified away.
-TEST_F(WhileLoopSimplifierTest, ElemUsedTwice) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(HloInstruction::CreateTuple({
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(1))),
-  }));
-
-  HloComputation* condition =
-      MakeAlwaysTrueComputation(loop_init->shape(), &module());
-
-  auto scalar_s32 = ShapeUtil::MakeShape(S32, {});
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    auto* param = body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_init->shape(), "param0"));
-    auto* gte0 = body_builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(scalar_s32, param, /*index=*/0));
-    // get0 is used twice in the loop body's tuple.
-    body_builder.AddInstruction(HloInstruction::CreateTuple({gte0, gte0}));
-    body = module().AddEmbeddedComputation(body_builder.Build());
+TEST_F(WhileLoopSimplifierTest, LoopWithElemUsedTwiceNotSimplified) {
+  const string hlo_string = R"(
+  HloModule ElemUsedTwice
+  ElemUsedTwice.body {
+    param0 = (s32[], s32[]) parameter(0)
+    get-tuple-element = s32[] get-tuple-element((s32[], s32[]) param0), index=0
+    ROOT tuple = (s32[], s32[]) tuple(s32[] get-tuple-element,
+      s32[] get-tuple-element)
+  }
+  ElemUsedTwice.always_true {
+    param = (s32[], s32[]) parameter(0)
+    ROOT constant = pred[] constant(true)
   }
+  ENTRY ElemUsedTwice {
+   x = s32[] parameter(0)
+   y = s32[] parameter(1)
+   tuple.1 = (s32[], s32[]) tuple(s32[] x, s32[] y)
+   ROOT while = (s32[], s32[]) while((s32[], s32[]) tuple.1),
+     condition=ElemUsedTwice.always_true, body=ElemUsedTwice.body
+  }
+  )";
 
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-  module().AddEntryComputation(builder.Build());
+  ParseAndVerifyModule(hlo_string);
   EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
 // This while loop has three tuple elements.  Element 0 is unused and should be
 // removed. Element 1 is used by the loop body, and element 2 is used by the
 // loop condition; these two should stay.
-TEST_F(WhileLoopSimplifierTest, RemoveUnusedOperand) {
-  HloComputation::Builder builder(TestName());
-  auto loop_init = builder.AddInstruction(HloInstruction::CreateTuple({
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-  }));
-  auto loop_shape = loop_init->shape();
-  auto scalar_s32 = ShapeUtil::MakeShape(S32, {});
-
-  HloComputation* condition;
-  {
-    HloComputation::Builder cond_builder(TestName() + ".loop_condition");
-    auto param = cond_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_shape, "param0"));
-    cond_builder.AddInstruction(HloInstruction::CreateBinary(
-        ShapeUtil::MakeShape(PRED, {}), HloOpcode::kEq,
-        cond_builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<int32>(0))),
-        cond_builder.AddInstruction(HloInstruction::CreateGetTupleElement(
-            scalar_s32, param, /*index=*/2))));
-    condition = module().AddEmbeddedComputation(cond_builder.Build());
+TEST_F(WhileLoopSimplifierTest, RemoveUnusedLoopOperands) {
+  const string hlo_string = R"(
+  HloModule RemoveUnusedOperands
+  RemoveUnusedOperands.body {
+    loop_var = (s32[], s32[], s32[]) parameter(0)
+    get-tuple-element.1 = s32[] get-tuple-element((s32[], s32[],
+      s32[]) loop_var), index=0
+    get-tuple-element.2 = s32[] get-tuple-element((s32[], s32[],
+      s32[]) loop_var), index=1
+    constant.1 = s32[] constant(1)
+    add = s32[] add(s32[] get-tuple-element.2, s32[] constant.1)
+    get-tuple-element.3 = s32[] get-tuple-element((s32[], s32[], s32[])
+      loop_var), index=2
+    ROOT tuple = (s32[], s32[], s32[]) tuple(s32[] get-tuple-element.1,
+      s32[] add, s32[] get-tuple-element.3)
   }
-
-  HloComputation* body;
-  {
-    HloComputation::Builder body_builder(TestName() + ".body");
-    auto* param = body_builder.AddInstruction(
-        HloInstruction::CreateParameter(0, loop_shape, "loop_var"));
-
-    auto* tuple0 = body_builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(scalar_s32, param, /*index=*/0));
-    auto* tuple1 = body_builder.AddInstruction(HloInstruction::CreateBinary(
-        scalar_s32, HloOpcode::kAdd,
-        body_builder.AddInstruction(HloInstruction::CreateGetTupleElement(
-            scalar_s32, param, /*index=*/1)),
-        body_builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)))));
-    auto* tuple2 = body_builder.AddInstruction(
-        HloInstruction::CreateGetTupleElement(scalar_s32, param, /*index=*/2));
-    body_builder.AddInstruction(
-        HloInstruction::CreateTuple({tuple0, tuple1, tuple2}));
-
-    body = module().AddEmbeddedComputation(body_builder.Build());
+  RemoveUnusedOperands.loop_condition {
+    constant.2 = s32[] constant(0)
+    param0 = (s32[], s32[], s32[]) parameter(0)
+    get-tuple-element = s32[] get-tuple-element((s32[], s32[], s32[]) param0),
+      index=2
+    ROOT equal-to = pred[] equal-to(s32[] constant.2, s32[] get-tuple-element)
   }
+  ENTRY RemoveUnusedOperands {
+    x = s32[] parameter(0)
+    constant.3 = s32[] constant(0)
+    y = s32[] parameter(1)
+    tuple.1 = (s32[], s32[], s32[]) tuple(s32[] x, s32[] constant.3,
+      s32[] y)
+    ROOT while = (s32[], s32[], s32[]) while((s32[], s32[], s32[]) tuple.1),
+      condition=RemoveUnusedOperands.loop_condition,
+      body=RemoveUnusedOperands.body
+  }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  HloModule* the_module = &module();
+  EXPECT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
+
+  // The original while instruction is still left in the module as a dead
+  // instruction, find a while instruction with a different name as the new
+  // while instruction.
+  HloInstruction* new_while_op =
+      *std::find_if(the_module->entry_computation()->instructions().begin(),
+                    the_module->entry_computation()->instructions().end(),
+                    [&](const HloInstruction* instr) {
+                      return (instr->opcode() == HloOpcode::kWhile &&
+                              instr->name() != "while");
+                    });
 
-  auto* while_op = builder.AddInstruction(HloInstruction::CreateWhile(
-      loop_init->shape(), condition, body, loop_init));
-
-  module().AddEntryComputation(builder.Build());
-  EXPECT_TRUE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
-
-  // We leave most of the checking to HloVerifiedTestBase, which runs the
-  // verifier on module() at the end of this test.
-  HloInstruction* new_while_op = *std::find_if(
-      module().entry_computation()->instructions().begin(),
-      module().entry_computation()->instructions().end(),
-      [&](const HloInstruction* instr) {
-        return instr != while_op && instr->opcode() == HloOpcode::kWhile;
-      });
+  auto scalar_s32 = ShapeUtil::MakeShape(S32, {});
   EXPECT_TRUE(
       ShapeUtil::Equal(new_while_op->shape(),
                        ShapeUtil::MakeTupleShape({scalar_s32, scalar_s32})))
@@ -418,31 +421,91 @@ TEST_F(WhileLoopSimplifierTest, RemoveUnusedOperand) {
                      op::GetTupleElement(op::Parameter(0), /*tuple_index=*/1)));
 }
 
-TEST_F(WhileLoopSimplifierTest, BodyHasNonTupleRoot) {
-  auto scalar_s32 = ShapeUtil::MakeShape(S32, {});
-  Shape while_shape = ShapeUtil::MakeTupleShape({scalar_s32, scalar_s32});
-
-  HloComputation* while_body = [&]() {
-    HloComputation::Builder builder(TestName() + ".passthrough");
-    HloInstruction* param = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, while_shape, "param"));
-    HloComputation* result = module().AddEmbeddedComputation(builder.Build());
-
-    result->AddInstruction(
-        HloInstruction::CreateGetTupleElement(scalar_s32, param, 1));
-    return result;
-  }();
-
-  HloComputation::Builder builder(TestName());
-  auto* init_value = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, while_shape, "init_value"));
-  builder.AddInstruction(HloInstruction::CreateWhile(
-      while_shape, MakeAlwaysTrueComputation(while_shape, &module()),
-      while_body, init_value));
-  module().AddEntryComputation(builder.Build());
-  TF_ASSERT_OK_AND_ASSIGN(bool simplified_loop,
-                          WhileLoopSimplifier{}.Run(&module()));
-  EXPECT_FALSE(simplified_loop);
+TEST_F(WhileLoopSimplifierTest, LoopWithNonTupleBodyShapeNotSimplified) {
+  const string hlo_string = R"(
+  HloModule BodyHasNonTupleRoot
+  BodyHasNonTupleRoot.passthrough {
+    ROOT param = (s32[], s32[]) parameter(0)
+  }
+  BodyHasNonTupleRoot.always_true {
+    param.1 = (s32[], s32[]) parameter(0)
+    ROOT constant = pred[] constant(true)
+  }
+  ENTRY BodyHasNonTupleRoot {
+    init_value = (s32[], s32[]) parameter(0)
+    ROOT while = (s32[], s32[]) while((s32[], s32[]) init_value),
+      condition=BodyHasNonTupleRoot.always_true,
+      body=BodyHasNonTupleRoot.passthrough
+  }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(WhileLoopSimplifierTest,
+       LoopWithNonTupleBodyRootInstructionNotSimplified) {
+  const string hlo_string = R"(
+  HloModule SimpleLoop
+  SimpleLoop.body {
+    loop_var.1 = (s32[], s32[3]{0}) parameter(0)
+    get-tuple-element.1 = s32[] get-tuple-element(loop_var.1), index=0
+    constant.1 = s32[] constant(1)
+    add = s32[] add(get-tuple-element.1, constant.1)
+    get-tuple-element.2 = s32[3]{0} get-tuple-element(loop_var.1), index=1
+    multiply = s32[3]{0} multiply(get-tuple-element.2, get-tuple-element.2)
+    ROOT custom-call = (s32[], s32[3]{0}) custom-call(add, multiply),
+      custom_call_target="x"
+  }
+  SimpleLoop.condition {
+    loop_var.2 = (s32[], s32[3]{0}) parameter(0)
+    get-tuple-element.3 = s32[] get-tuple-element(loop_var.2), index=0
+    constant.2 = s32[] constant(44)
+    ROOT less-than = pred[] less-than(get-tuple-element.3, constant.2)
+  }
+  ENTRY SimpleLoop {
+    constant.3 = s32[] constant(42)
+    constant.4 = s32[3]{0} constant({0, 1, 2})
+    tuple.1 = (s32[], s32[3]{0}) tuple(constant.3, constant.4)
+    ROOT while = (s32[], s32[3]{0}) while(tuple.1), condition=
+      SimpleLoop.condition, body=SimpleLoop.body
+  }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(WhileLoopSimplifierTest, LoopWithArrayConstantNotSimplified) {
+  const string hlo_string = R"(
+  HloModule SimpleLoop
+  SimpleLoop.body {
+    loop_var.1 = (s32[], s32[3]{0}, s32[3]{0}) parameter(0)
+    get-tuple-element.1 = s32[] get-tuple-element(loop_var.1), index=0
+    constant.1 = s32[] constant(1)
+    add = s32[] add(get-tuple-element.1, constant.1)
+    get-tuple-element.2 = s32[3]{0} get-tuple-element(loop_var.1), index=1
+    get-tuple-element.3 = s32[3]{0} get-tuple-element(loop_var.1), index=2
+    add.2 = s32[3]{0} add(get-tuple-element.2, get-tuple-element.3)
+    ROOT tuple = (s32[], s32[3]{0}) tuple(add, add.2, get-tuple-element.3)
+  }
+  SimpleLoop.condition {
+    loop_var.2 = (s32[], s32[3]{0}, s32[3]{0}) parameter(0)
+    get-tuple-element.4 = s32[] get-tuple-element(loop_var.2), index=0
+    constant.2 = s32[] constant(47)
+    ROOT less-than = pred[] less-than(get-tuple-element.4, constant.2)
+  }
+  ENTRY SimpleLoop {
+    constant.3 = s32[] constant(42)
+    constant.4 = s32[3]{0} constant({0, 1, 2})
+    tuple.1 = (s32[], s32[3]{0}) tuple(constant.3, constant.4, constant.4)
+    ROOT while = (s32[], s32[3]{0}, s32[3]{0}) while(tuple.1), condition=
+      SimpleLoop.condition, body=SimpleLoop.body
+  }
+  )";
+
+  ParseAndVerifyModule(hlo_string);
+  EXPECT_FALSE(WhileLoopSimplifier().Run(&module()).ValueOrDie());
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/while_util.cc b/tensorflow/compiler/xla/service/while_util.cc
index e20b25e4a08a946f6b58575a4d4e557744f8035c..bd0794184328b7926543c4275b3b915f51e7b812 100644
--- a/tensorflow/compiler/xla/service/while_util.cc
+++ b/tensorflow/compiler/xla/service/while_util.cc
@@ -15,18 +15,21 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/while_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
 #include "tensorflow/compiler/xla/service/tuple_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 
 namespace xla {
 
+using tensorflow::strings::StrCat;
+
 static StatusOr<HloComputation*> WidenWhileCondition(
     HloComputation* narrow_condition, const Shape& wide_shape) {
   const Shape& narrow_shape =
       narrow_condition->parameter_instruction(0)->shape();
 
   HloComputation* wide_while_cond = [&]() {
-    HloComputation::Builder builder(
-        tensorflow::strings::StrCat("wide.", narrow_condition->name()));
+    HloComputation::Builder builder(StrCat("wide.", narrow_condition->name()));
     builder.AddInstruction(
         HloInstruction::CreateParameter(0, wide_shape, "wide_param"));
 
@@ -57,8 +60,7 @@ WidenWhileBody(HloComputation* narrow_body, const Shape& wide_shape) {
   const Shape& narrow_shape = narrow_body->parameter_instruction(0)->shape();
 
   HloComputation* wide_while_body = [&]() {
-    HloComputation::Builder builder(
-        tensorflow::strings::StrCat("wide.", narrow_body->name()));
+    HloComputation::Builder builder(StrCat("wide.", narrow_body->name()));
     builder.AddInstruction(
         HloInstruction::CreateParameter(0, wide_shape, "wide_param"));
     return narrow_body->parent()->AddEmbeddedComputation(builder.Build());
@@ -137,4 +139,109 @@ WhileUtil::MakeInstructionsLiveIn(
 
   return std::move(result);
 }
+
+static StatusOr<std::unique_ptr<HloComputation>>
+MakeCountedLoopConditionComputation(const Shape& loop_state_shape,
+                                    int32 trip_count) {
+  Shape scalar_pred = ShapeUtil::MakeShape(PRED, {});
+
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloComputation> cond_computation,
+                      CreateComputationWithSignature(
+                          {&loop_state_shape}, scalar_pred, "while_cond"));
+
+  HloInstruction* trip_count_constant = cond_computation->AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(trip_count)));
+
+  HloInstruction* param = cond_computation->parameter_instruction(0);
+  TF_ASSIGN_OR_RETURN(HloInstruction * indvar,
+                      MakeGetTupleElementHlo(param, 0));
+
+  TF_ASSIGN_OR_RETURN(
+      HloInstruction * compare,
+      MakeBinaryHlo(HloOpcode::kLt, indvar, trip_count_constant));
+  cond_computation->set_root_instruction(compare);
+  return std::move(cond_computation);
+}
+
+static StatusOr<std::unique_ptr<HloComputation>> MakeCountedLoopBodyComputation(
+    const Shape& loop_state_shape,
+    const std::function<StatusOr<WhileUtil::LoopStateTy>(
+        HloInstruction*, const WhileUtil::LoopStateTy&)>& loop_body_generator) {
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloComputation> body_computation,
+                      CreateComputationWithSignature(
+                          {&loop_state_shape}, loop_state_shape, "while_body"));
+  HloInstruction* one = body_computation->AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+  HloInstruction* param = body_computation->parameter_instruction(0);
+  TF_ASSIGN_OR_RETURN(HloInstruction * indvar,
+                      MakeGetTupleElementHlo(param, 0));
+  TF_ASSIGN_OR_RETURN(HloInstruction * next_indvar,
+                      MakeBinaryHlo(HloOpcode::kAdd, indvar, one));
+
+  std::vector<HloInstruction*> loop_body_generator_args;
+  for (int64 i = 1, e = loop_state_shape.tuple_shapes_size(); i < e; i++) {
+    TF_ASSIGN_OR_RETURN(HloInstruction * tuple_element,
+                        MakeGetTupleElementHlo(param, i));
+    loop_body_generator_args.push_back(tuple_element);
+  }
+  TF_ASSIGN_OR_RETURN(std::vector<HloInstruction*> next_state,
+                      loop_body_generator(indvar, loop_body_generator_args));
+  next_state.insert(next_state.begin(), next_indvar);
+  HloInstruction* next_state_tuple =
+      body_computation->AddInstruction(HloInstruction::CreateTuple(next_state));
+  body_computation->set_root_instruction(next_state_tuple);
+
+  return std::move(body_computation);
+}
+
+static StatusOr<HloInstruction*> MakeInitTupleFromInitValues(
+    HloComputation* computation, const WhileUtil::LoopStateTy& init_values) {
+  std::vector<HloInstruction*> init_values_with_indvar;
+  init_values_with_indvar.reserve(init_values.size() + 1);
+  HloInstruction* zero = computation->AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+  init_values_with_indvar.push_back(zero);
+  c_copy(init_values, std::back_inserter(init_values_with_indvar));
+  return computation->AddInstruction(
+      HloInstruction::CreateTuple(init_values_with_indvar));
+}
+
+static Shape MakeLoopStateShape(const WhileUtil::LoopStateTy& init_values) {
+  std::vector<Shape> loop_state_shape_components;
+  loop_state_shape_components.reserve(init_values.size() + 1);
+  loop_state_shape_components.push_back(ShapeUtil::MakeShape(S32, {}));
+  c_transform(init_values, std::back_inserter(loop_state_shape_components),
+              [](HloInstruction* instr) { return instr->shape(); });
+  return ShapeUtil::MakeTupleShape(loop_state_shape_components);
+}
+
+/*static*/ StatusOr<WhileUtil::LoopStateTy> WhileUtil::MakeCountedLoop(
+    HloComputation* computation, int32 trip_count,
+    const WhileUtil::LoopStateTy& init_values,
+    const WhileUtil::LoopBodyGeneratorTy& loop_body_generator) {
+  CHECK_GE(trip_count, 0);
+
+  Shape loop_state_shape = MakeLoopStateShape(init_values);
+  TF_ASSIGN_OR_RETURN(
+      std::unique_ptr<HloComputation> cond,
+      MakeCountedLoopConditionComputation(loop_state_shape, trip_count));
+  TF_ASSIGN_OR_RETURN(
+      std::unique_ptr<HloComputation> body,
+      MakeCountedLoopBodyComputation(loop_state_shape, loop_body_generator));
+  TF_ASSIGN_OR_RETURN(HloInstruction * init_tuple,
+                      MakeInitTupleFromInitValues(computation, init_values));
+  HloModule* module = computation->parent();
+  HloInstruction* while_instr =
+      computation->AddInstruction(HloInstruction::CreateWhile(
+          loop_state_shape, module->AddEmbeddedComputation(std::move(cond)),
+          module->AddEmbeddedComputation(std::move(body)), init_tuple));
+
+  std::vector<HloInstruction*> result;
+  for (int64 i = 0, e = init_values.size(); i < e; i++) {
+    TF_ASSIGN_OR_RETURN(HloInstruction * user_state,
+                        MakeGetTupleElementHlo(while_instr, i + 1));
+    result.push_back(user_state);
+  }
+  return result;
+}
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/while_util.h b/tensorflow/compiler/xla/service/while_util.h
index 3600b5a80d26e37fdb7d5173c3b8743734306390..1688d4674269c36c5b356f262dbd5d958572e101 100644
--- a/tensorflow/compiler/xla/service/while_util.h
+++ b/tensorflow/compiler/xla/service/while_util.h
@@ -52,6 +52,28 @@ class WhileUtil {
   static StatusOr<MakeInstructionsLiveInResult> MakeInstructionsLiveIn(
       HloInstruction* while_instr,
       tensorflow::gtl::ArraySlice<HloInstruction*> instructions);
+
+  using LoopStateTy = std::vector<HloInstruction*>;
+  using LoopBodyGeneratorTy = std::function<StatusOr<LoopStateTy>(
+      HloInstruction* /*induction_var*/,
+      const LoopStateTy& /*current_values*/)>;
+
+  // Creates a while loop in `computation` that runs for `trip_count`
+  // iterations.  The structure of the while loop is as follows, in pseudocode:
+  //
+  //  loop_state while_loop() {
+  //    indvar = 0;
+  //    loop_state = init_values
+  //    while (indvar < trip_count) {
+  //      loop_state = loop_body_generator(loop_state)
+  //      indvar++;
+  //    }
+  //    return loop_state;
+  //  }
+  static StatusOr<LoopStateTy> MakeCountedLoop(
+      HloComputation* computation, int32 trip_count,
+      const LoopStateTy& init_values,
+      const LoopBodyGeneratorTy& loop_body_generator);
 };
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service_interface.h b/tensorflow/compiler/xla/service_interface.h
index 809941d8fe1f63d66bf104e66eea66167a0f509d..d8235113dd800f7bab5ceb70272a598b9dcb1fbe 100644
--- a/tensorflow/compiler/xla/service_interface.h
+++ b/tensorflow/compiler/xla/service_interface.h
@@ -54,6 +54,9 @@ class ServiceInterface {
   virtual tensorflow::Status Execute(const ExecuteRequest* arg,
                                      ExecuteResponse* result) = 0;
 
+  virtual tensorflow::Status ExecuteGraph(const ExecuteGraphRequest* arg,
+                                          ExecuteResponse* result) = 0;
+
   virtual tensorflow::Status ExecuteParallel(
       const ExecuteParallelRequest* arg, ExecuteParallelResponse* result) = 0;
 
diff --git a/tensorflow/compiler/xla/shape_tree.h b/tensorflow/compiler/xla/shape_tree.h
index 280f02e88675381bd75108bfae0dd22c462ba718..ffaa40c2d673a2365342371ed8dab59565d1d08f 100644
--- a/tensorflow/compiler/xla/shape_tree.h
+++ b/tensorflow/compiler/xla/shape_tree.h
@@ -53,7 +53,7 @@ struct ShapeTreeNode {
   ShapeTreeNode(const ShapeTreeNode& other)
       : data(other.data), children(other.children.size()) {
     for (size_t i = 0; i < children.size(); ++i) {
-      children[i] = MakeUnique<ShapeTreeNode>(*other.children[i]);
+      children[i] = ::xla::MakeUnique<ShapeTreeNode>(*other.children[i]);
     }
   }
 
@@ -62,7 +62,7 @@ struct ShapeTreeNode {
       data = other.data;
       children.resize(other.children.size());
       for (size_t i = 0; i < children.size(); ++i) {
-        children[i] = MakeUnique<ShapeTreeNode>(*other.children[i]);
+        children[i] = ::xla::MakeUnique<ShapeTreeNode>(*other.children[i]);
       }
     }
     return *this;
@@ -445,7 +445,7 @@ class ShapeTreeIterator : public std::iterator<std::forward_iterator_tag,
     for (auto& node_and_index : stack_) {
       index.push_back(node_and_index.second);
     }
-    current_ = MakeUnique<value_type>(index, node_->data);
+    current_ = ::xla::MakeUnique<value_type>(index, node_->data);
     return *current_;
   }
 
@@ -492,7 +492,7 @@ void ShapeTree<T>::InitChildren(const Shape& shape, Node* node) {
 template <typename T>
 ShapeTree<T>::ShapeTree(Shape shape)
     : root_(),
-      shape_storage_(MakeUnique<Shape>(std::move(shape))),
+      shape_storage_(::xla::MakeUnique<Shape>(std::move(shape))),
       shape_(shape_storage_.get()) {
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
@@ -508,7 +508,7 @@ ShapeTree<T>::ShapeTree(const Shape* shape) : root_(), shape_(shape) {
 template <typename T>
 ShapeTree<T>::ShapeTree(Shape shape, const T& init_value)
     : root_(init_value),
-      shape_storage_(MakeUnique<Shape>(std::move(shape))),
+      shape_storage_(::xla::MakeUnique<Shape>(std::move(shape))),
       shape_(shape_storage_.get()) {
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
diff --git a/tensorflow/compiler/xla/shape_util.cc b/tensorflow/compiler/xla/shape_util.cc
index 604e0173e789348923316174873f58058eaf2815..4f604e6f7cb18c1aaf844967d54e3b0e07e54b34 100644
--- a/tensorflow/compiler/xla/shape_util.cc
+++ b/tensorflow/compiler/xla/shape_util.cc
@@ -609,6 +609,8 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
 
 /* static */ bool ShapeUtil::SameDimensions(const Shape& lhs,
                                             const Shape& rhs) {
+  CHECK(ShapeUtil::IsArray(lhs));
+  CHECK(ShapeUtil::IsArray(rhs));
   return ContainersEqual(lhs.dimensions(), rhs.dimensions());
 }
 
@@ -617,7 +619,10 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
     return rhs.element_type() == TUPLE &&
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(), Compatible);
   }
-  return SameDimensions(lhs, rhs) && SameElementType(lhs, rhs);
+  if (lhs.element_type() == OPAQUE) {
+    return rhs.element_type() == OPAQUE;
+  }
+  return SameElementType(lhs, rhs) && SameDimensions(lhs, rhs);
 }
 
 /* static */ bool ShapeUtil::CompatibleIgnoringElementType(const Shape& lhs,
@@ -627,7 +632,10 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
                            CompatibleIgnoringElementType);
   }
-  return SameDimensions(lhs, rhs);
+  if (lhs.element_type() == OPAQUE) {
+    return rhs.element_type() == OPAQUE;
+  }
+  return ShapeUtil::IsArray(rhs) && SameDimensions(lhs, rhs);
 }
 
 /* static */ bool ShapeUtil::CompatibleIgnoringFpPrecision(const Shape& lhs,
@@ -637,6 +645,9 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
                            CompatibleIgnoringFpPrecision);
   }
+  if (lhs.element_type() == OPAQUE) {
+    return rhs.element_type() == OPAQUE;
+  }
   if (SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return CompatibleIgnoringElementType(lhs, rhs);
   }
@@ -1073,9 +1084,10 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 /* static */ bool ShapeUtil::TransposeIsBitcast(
     const Shape& input_shape, const Shape& output_shape,
     tensorflow::gtl::ArraySlice<int64> dimension_mapping) {
-  // Can't insert bitcasts without layout information.
-  if (!LayoutUtil::HasLayout(input_shape) &&
-      !LayoutUtil::HasLayout(output_shape)) {
+  CHECK(LayoutUtil::HasLayout(input_shape) &&
+        LayoutUtil::HasLayout(output_shape));
+
+  if (!SameElementType(input_shape, output_shape)) {
     return false;
   }
 
@@ -1106,9 +1118,10 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ bool ShapeUtil::ReshapeIsBitcast(const Shape& input_shape,
                                               const Shape& output_shape) {
-  // Can't convert reshapes into bitcasts without layout information.
-  if (!LayoutUtil::HasLayout(input_shape) ||
-      !LayoutUtil::HasLayout(output_shape)) {
+  CHECK(LayoutUtil::HasLayout(input_shape) &&
+        LayoutUtil::HasLayout(output_shape));
+
+  if (!SameElementType(input_shape, output_shape)) {
     return false;
   }
 
diff --git a/tensorflow/compiler/xla/shape_util.h b/tensorflow/compiler/xla/shape_util.h
index 19b1aa93bd373ebd5f502d0dca56c9b31ab4fd7f..3e130a02e2ce853ee157e46afb9760f5ff5a5026 100644
--- a/tensorflow/compiler/xla/shape_util.h
+++ b/tensorflow/compiler/xla/shape_util.h
@@ -24,6 +24,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -208,6 +209,7 @@ class ShapeUtil {
 
   // Returns whether the LHS and RHS shapes have the same dimensions; note: does
   // not check element type.
+  // Precondition: IsArray(lhs) && IsArray(rhs)
   static bool SameDimensions(const Shape& lhs, const Shape& rhs);
 
   // Returns whether the lhs and rhs shapes have the same element type.
@@ -320,6 +322,15 @@ class ShapeUtil {
   static Shape MakeShape(PrimitiveType element_type,
                          tensorflow::gtl::ArraySlice<int64> dimensions);
 
+  // Creates a Shape with element type corresponding to T and the given
+  // dimensions
+  template <typename T>
+  static Shape MakeShapeWithType(
+      tensorflow::gtl::ArraySlice<int64> dimensions) {
+    return ShapeUtil::MakeShape(primitive_util::NativeToPrimitiveType<T>(),
+                                dimensions);
+  }
+
   // Constructs a new shape with the given minor_to_major order in its Layout.
   // Returns a value shape such that shape.has_layout().
   static Shape MakeShapeWithLayout(
@@ -522,12 +533,16 @@ class ShapeUtil {
   // Returns whether a transpose from input_shape to output_shape with dimension
   // mapping "dimension_mapping" produces a result which is bit-wise identical
   // to its input and thus may be replaced with a bitcast.
+  //
+  // Precondition: Both input_shape and output_shape have explicit layouts.
   static bool TransposeIsBitcast(
       const Shape& input_shape, const Shape& output_shape,
       tensorflow::gtl::ArraySlice<int64> dimension_mapping);
 
   // Returns whether a reshape from "input_shape" to "output_shape" is a
   // bitcast.
+  //
+  // Precondition: Both input_shape and output_shape have explicit layouts.
   static bool ReshapeIsBitcast(const Shape& input_shape,
                                const Shape& output_shape);
 
@@ -560,16 +575,16 @@ class ShapeUtil {
   // The visitor_function visitor function should return true if it wants to
   // continue, or false otherwise.
   //
-  // visitor_function must be a callable of type bool(const std::vector<int64>&)
-  // or compatible.
+  // visitor_function must be a callable of type
+  // StatusOr<bool>(ArraySlice<int64>) or compatible.
   template <typename FnType>
-  static void ForEachIndex(const Shape& shape,
-                           tensorflow::gtl::ArraySlice<int64> base,
-                           tensorflow::gtl::ArraySlice<int64> count,
-                           tensorflow::gtl::ArraySlice<int64> incr,
-                           const FnType& visitor_function) {
+  static Status ForEachIndexWithStatus(const Shape& shape,
+                                       tensorflow::gtl::ArraySlice<int64> base,
+                                       tensorflow::gtl::ArraySlice<int64> count,
+                                       tensorflow::gtl::ArraySlice<int64> incr,
+                                       const FnType& visitor_function) {
     if (ShapeUtil::HasZeroElements(shape)) {
-      return;
+      return Status::OK();
     }
     CHECK_EQ(Rank(shape), base.size());
     CHECK_EQ(incr.size(), base.size());
@@ -579,7 +594,11 @@ class ShapeUtil {
     // once with the proper empty indexes.
     int64 n = -1;
     std::vector<int64> indexes(base.begin(), base.end());
-    while (n < rank && visitor_function(indexes)) {
+    while (n < rank) {
+      TF_ASSIGN_OR_RETURN(bool should_continue, visitor_function(indexes));
+      if (!should_continue) {
+        break;
+      }
       // Increments dimensions in minor to major order.
       for (n = 0; n < rank; ++n) {
         int64 dim = LayoutUtil::Minor(shape.layout(), n);
@@ -590,6 +609,37 @@ class ShapeUtil {
         indexes[dim] = base[dim];
       }
     }
+
+    return Status::OK();
+  }
+
+  // Simple ergonomic wrapper around ShapeUtil::ForEachIndexWithStatus.
+  struct IndexIterationSpace {
+    std::vector<int64> index_base;
+    std::vector<int64> index_count;
+    std::vector<int64> index_incr;
+  };
+
+  template <typename FnTy>
+  static Status ForEachIndexWithStatus(
+      const Shape& shape, const IndexIterationSpace& iteration_space,
+      FnTy&& function) {
+    return ShapeUtil::ForEachIndexWithStatus(
+        shape, iteration_space.index_base, iteration_space.index_count,
+        iteration_space.index_incr, std::forward<FnTy>(function));
+  }
+
+  template <typename FnType>
+  static void ForEachIndex(const Shape& shape,
+                           tensorflow::gtl::ArraySlice<int64> base,
+                           tensorflow::gtl::ArraySlice<int64> count,
+                           tensorflow::gtl::ArraySlice<int64> incr,
+                           const FnType& visitor_function) {
+    ForEachIndexWithStatus(shape, base, count, incr,
+                           [&](tensorflow::gtl::ArraySlice<int64> indices) {
+                             return StatusOr<bool>(visitor_function(indices));
+                           })
+        .IgnoreError();
   }
 
  private:
diff --git a/tensorflow/compiler/xla/shape_util_test.cc b/tensorflow/compiler/xla/shape_util_test.cc
index 4db97d45b20b86dc60531845c6e28a223203ff7f..424cfe37ea44d64884e08695fd1f49ca1970ca62 100644
--- a/tensorflow/compiler/xla/shape_util_test.cc
+++ b/tensorflow/compiler/xla/shape_util_test.cc
@@ -238,6 +238,18 @@ TEST(ShapeUtilTest, IncompatibleTuplesWithDifferentDimensions) {
   EXPECT_FALSE(ShapeUtil::Compatible(tuple1, tuple2));
 }
 
+TEST(ShapeUtilTest, IncompatibleScalarVsTuple) {
+  Shape shape1 = ShapeUtil::MakeShape(F32, {});
+  Shape shape2 = ShapeUtil::MakeTupleShape(
+      {ShapeUtil::MakeShape(F32, {3, 2}), ShapeUtil::MakeShape(U32, {})});
+  EXPECT_FALSE(ShapeUtil::Compatible(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::Compatible(shape2, shape1));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringElementType(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringElementType(shape2, shape1));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringFpPrecision(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringFpPrecision(shape2, shape1));
+}
+
 TEST(ShapeUtilTest, CompareShapesWithPaddedDimensionsMismatch) {
   Shape shape1 = ShapeUtil::MakeShape(F32, {20, 30});
   shape1.mutable_layout()->add_padded_dimensions(10);
@@ -573,10 +585,11 @@ TEST(ShapeUtilTest, ForEachIndex) {
     Shape shape = ShapeUtil::MakeShape(F32, data.dimensions);
     // Increments at every invocation.
     int invocations = 0;
-    auto increment_func = [&invocations](const std::vector<int64>& indexes) {
-      invocations++;
-      return true;
-    };
+    auto increment_func =
+        [&invocations](tensorflow::gtl::ArraySlice<int64> indexes) {
+          invocations++;
+          return true;
+        };
 
     std::vector<int64> zero_base(data.dimensions.size(), 0);
     std::vector<int64> step(data.dimensions.size(), 1);
@@ -588,6 +601,29 @@ TEST(ShapeUtilTest, ForEachIndex) {
   }
 }
 
+TEST(ShapeUtilTest, ForEachIndexWithStatus) {
+  Shape shape = ShapeUtil::MakeShape(F32, {10, 10});
+  // Increments at every invocation.
+  int invocations = 0;
+  auto increment_func =
+      [&invocations](
+          tensorflow::gtl::ArraySlice<int64> indexes) -> StatusOr<bool> {
+    if (++invocations == 5) {
+      return Unimplemented("Cannot increment beyond 5.");
+    }
+    return true;
+  };
+
+  Status error_status = ShapeUtil::ForEachIndexWithStatus(
+      shape, /*base=*/{0, 0}, /*count=*/{10, 10}, /*incr=*/{0, 1},
+      increment_func);
+
+  EXPECT_FALSE(error_status.ok());
+  EXPECT_THAT(error_status.error_message(),
+              ::testing::HasSubstr("Cannot increment beyond 5."));
+  EXPECT_EQ(invocations, 5);
+}
+
 TEST(ShapeUtilTest, DimensionsUnmodifiedByReshape_1x1x1x1_to_1x1x1) {
   // All output dimensions should be unmodified. One of the input dimensions is
   // modified because the input rank is larger by one.
diff --git a/tensorflow/compiler/xla/tests/BUILD b/tensorflow/compiler/xla/tests/BUILD
index 97abf217d7dc2e824b6bca3cb310f9cb847eb9ef..7fb7919674da4aae9a0b35fee949e069abd34c17 100644
--- a/tensorflow/compiler/xla/tests/BUILD
+++ b/tensorflow/compiler/xla/tests/BUILD
@@ -44,6 +44,7 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
+    alwayslink = True,
 )
 
 cc_library(
@@ -138,6 +139,7 @@ cc_library(
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_verifier",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -334,6 +336,9 @@ xla_test(
 xla_test(
     name = "while_test",
     srcs = ["while_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -478,6 +483,7 @@ xla_test(
 xla_test(
     name = "conditional_test",
     srcs = ["conditional_test.cc"],
+    tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -494,6 +500,7 @@ xla_test(
 xla_test(
     name = "unary_op_test",
     srcs = ["unary_op_test.cc"],
+    tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -550,6 +557,9 @@ xla_test(
 xla_test(
     name = "deconstruct_tuple_test",
     srcs = ["deconstruct_tuple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -662,6 +672,20 @@ xla_test(
     ],
 )
 
+xla_test(
+    name = "gather_operation_test",
+    srcs = ["gather_operation_test.cc"],
+    deps = [
+        ":client_library_test_base",
+        ":hlo_test_base",
+        "//tensorflow/compiler/xla:execution_options_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+    ],
+)
+
 # Repeat dot_operation_runtime_test with single-threaded eigen.
 xla_test(
     name = "dot_operation_single_threaded_runtime_test",
@@ -942,6 +966,9 @@ xla_test(
     name = "dynamic_ops_test",
     timeout = "moderate",
     srcs = ["dynamic_ops_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:reference_util",
@@ -968,6 +995,9 @@ xla_test(
 xla_test(
     name = "tuple_test",
     srcs = ["tuple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:literal_util",
@@ -979,6 +1009,7 @@ xla_test(
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
@@ -1143,6 +1174,9 @@ xla_test(
 xla_test(
     name = "call_test",
     srcs = ["call_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -1291,6 +1325,7 @@ xla_test(
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
+        "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
@@ -1646,6 +1681,9 @@ xla_test(
 xla_test(
     name = "fusion_test",
     srcs = ["fusion_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:literal_util",
diff --git a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
index 8b35259013200e96807446803c696451a8db80a9..6e21dda25d8e5151b31b8c2328253260595a94c4 100644
--- a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
@@ -1648,33 +1648,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, SquareIn4DZeroElements) {
   ComputeAndCompareR4<float>(&builder, expected, {}, error_spec_);
 }
 
-// GPU backend emits nvvm intrinsic for fmin and fmax, whose semantics is NOT
-// such
-// * fmin(NaN, x) = x
-// * fmax(NaN, x) = x
-// so we only test NAN on CPU.
-//
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends.
 XLA_TEST_F(ArrayElementwiseOpTest, MinF32s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f});
-  auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f, NAN, 6.0f});
   auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f, 10.0f, NAN});
-#endif
   auto minimum = builder.Min(lhs, rhs);
 
-  ComputeAndCompareR1<float>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                             {1.0f, -5.0f, 1.0f},
-#else
-                             {1.0f, -5.0f, 1.0f, 10.0f, 6.0f},
-#endif
-                             {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {1.0f, -5.0f, 1.0f, NAN, NAN}, {},
+                             error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MinZeroElementF32s) {
@@ -1685,50 +1667,26 @@ XLA_TEST_F(ArrayElementwiseOpTest, MinZeroElementF32s) {
   ComputeAndCompareR1<float>(&builder, {}, {}, error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MinF64s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25});
-  auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25, NAN, 6.0});
   auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0, 10.0, NAN});
-#endif
   auto minimum = builder.Min(lhs, rhs);
 
-  ComputeAndCompareR1<double>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                              {1.0, -5.0, 1.0},
-#else
-                              {1.0, -5.0, 1.0, 10.0, 6.0},
-#endif
-                              {}, error_spec_);
+  ComputeAndCompareR1<double>(&builder, {1.0, -5.0, 1.0, NAN, NAN}, {},
+                              error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MaxF32s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f});
-  auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f, NAN, 6.0f});
   auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f, 10.0f, NAN});
-#endif
   auto maximum = builder.Max(lhs, rhs);
 
-  ComputeAndCompareR1<float>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                             {2.0f, 1.0f, 2.25f},
-#else
-                             {2.0f, 1.0f, 2.25f, 10.0f, 6.0f},
-#endif
-                             {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {2.0f, 1.0f, 2.25f, NAN, NAN}, {},
+                             error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MaxZeroElementF32s) {
@@ -1739,27 +1697,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, MaxZeroElementF32s) {
   ComputeAndCompareR1<float>(&builder, {}, {}, error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MaxF64s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25});
-  auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25, NAN, 6.0});
   auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0, 10.0, NAN});
-#endif
   auto maximum = builder.Max(lhs, rhs);
 
-  ComputeAndCompareR1<double>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                              {2.0, 1.0, 2.25},
-#else
-                              {2.0, 1.0, 2.25, 10.0, 6.0},
-#endif
-                              {}, error_spec_);
+  ComputeAndCompareR1<double>(&builder, {2.0, 1.0, 2.25, NAN, NAN}, {},
+                              error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MaxS32s) {
diff --git a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
index 03f5e08315bfed2bcb43ebb7098aaa0b97228605..97095f1cc427789845051a8fea24c95475286fe2 100644
--- a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
@@ -662,7 +662,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidBinaryAndDegenerateBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("dimension 0 mismatch"));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, InvalidInDimensionBroadcasting) {
@@ -675,7 +675,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidInDimensionBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("binary op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op BINOP_ADD with incompatible shapes"));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
@@ -688,7 +688,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("binary op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op BINOP_ADD with incompatible shapes"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/concat_test.cc b/tensorflow/compiler/xla/tests/concat_test.cc
index 1bcad5a3f37a37c9d482f3a5a899ac527666cca3..fb0e9c724a69b61801e6e0c2d07ef75b63a00465 100644
--- a/tensorflow/compiler/xla/tests/concat_test.cc
+++ b/tensorflow/compiler/xla/tests/concat_test.cc
@@ -75,7 +75,7 @@ XLA_TEST_F(ConcatTest, CannotConcatR0WithR0) {
   StatusOr<Computation> computation_status = builder.Build();
   ASSERT_FALSE(computation_status.ok());
   EXPECT_THAT(computation_status.status().ToString(),
-              HasSubstr("dimension to concatenate along out of bounds: 0"));
+              HasSubstr("out of bounds: 0"));
 }
 
 XLA_TEST_F(ConcatTest, Concat_R1_L0_With_R1_L0) {
diff --git a/tensorflow/compiler/xla/tests/conditional_test.cc b/tensorflow/compiler/xla/tests/conditional_test.cc
index bc821674820fb128823786d7149037fc59b22ab6..b917dee77b5400db8f2c0a6a86258fee64723d71 100644
--- a/tensorflow/compiler/xla/tests/conditional_test.cc
+++ b/tensorflow/compiler/xla/tests/conditional_test.cc
@@ -571,5 +571,56 @@ XLA_TEST_F(ConditionalOpTest, ShapeMismatch) {
                                    "only parameter of true_computation"));
 }
 
+XLA_TEST_F(ConditionalOpTest, SwappedInputsInSequentialConditionals) {
+  Shape tuple_shape = ShapeUtil::MakeTupleShape({r0f32_, r0f32_});
+  Computation swapper;
+  {
+    ComputationBuilder builder(client_, TestName() + ".swapper");
+    auto param0 = builder.Parameter(0, tuple_shape, "sp0");
+    auto x = builder.GetTupleElement(param0, 0);
+    auto y = builder.GetTupleElement(param0, 1);
+    builder.Tuple({y, x});
+    swapper = builder.Build().ConsumeValueOrDie();
+  }
+  Computation forwarder;
+  {
+    ComputationBuilder builder(client_, TestName() + ".forwarder");
+    auto param0 = builder.Parameter(0, tuple_shape, "fp0");
+    auto x = builder.GetTupleElement(param0, 0);
+    auto y = builder.GetTupleElement(param0, 1);
+    builder.Tuple({x, y});
+    forwarder = builder.Build().ConsumeValueOrDie();
+  }
+  Computation main;
+  {
+    ComputationBuilder builder(client_, TestName() + ".main");
+    auto param0 = builder.Parameter(0, tuple_shape, "mp0");
+    auto x = builder.GetTupleElement(param0, 0);
+    auto y = builder.GetTupleElement(param0, 1);
+    auto lt_pred = builder.Lt(x, y);
+    auto res = builder.Conditional(lt_pred, param0, forwarder, param0, swapper);
+    auto ge_pred = builder.Ge(x, y);
+    builder.Conditional(ge_pred, res, swapper, res, forwarder);
+    main = builder.Build().ConsumeValueOrDie();
+  }
+
+  auto test_swap = [&](float a, float b) {
+    ComputationBuilder builder(client_, TestName());
+    auto x = builder.ConstantR0<float>(a);
+    auto y = builder.ConstantR0<float>(b);
+    auto tuple_operand = builder.Tuple({x, y});
+    builder.Call(main, {tuple_operand});
+
+    ComputeAndCompareTuple(
+        &builder,
+        *Literal::MakeTuple({Literal::CreateR0<float>(a).get(),
+                             Literal::CreateR0<float>(b).get()}),
+        {}, error_spec_);
+  };
+
+  test_swap(3.11f, 9.4f);
+  test_swap(11.24f, 5.55f);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/convert_test.cc b/tensorflow/compiler/xla/tests/convert_test.cc
index 59d6d7a4153be1b76ed8195a12a90cb103baa422..9a899b79141fbc35fabd8d2e5d4195fb589dd84c 100644
--- a/tensorflow/compiler/xla/tests/convert_test.cc
+++ b/tensorflow/compiler/xla/tests/convert_test.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/casts.h"
+#include "tensorflow/core/lib/math/math_util.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
@@ -177,6 +178,24 @@ XLA_TEST_F(ConvertTest, ConvertR1U32ToR1F32) {
   ComputeAndCompareR1<float>(&builder, expected, {arg_data.get()});
 }
 
+XLA_TEST_F(ConvertTest, ConvertR1F32ToR1U32) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<float> arg{0.0f,        1.0f,          16777216.0f,
+                         16777218.0f, 2147483647.0f, 4294967040.0f};
+  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<float>({arg});
+  auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
+  std::unique_ptr<GlobalData> arg_data =
+      client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
+
+  builder.ConvertElementType(arg_param, U32);
+
+  std::vector<uint32> expected(arg.size());
+  for (int64 i = 0; i < arg.size(); ++i) {
+    expected[i] = static_cast<uint32>(arg[i]);
+  }
+  ComputeAndCompareR1<uint32>(&builder, expected, {arg_data.get()});
+}
+
 XLA_TEST_F(ConvertTest, ConvertR1U32ToR1S64) {
   ComputationBuilder builder(client_, TestName());
   std::vector<uint32> arg{0, 1, 0x1000, 0x7fffffff, 0x80000082, 0xFFFFFFFF};
@@ -366,5 +385,44 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1F16) {
 
   ComputeAndCompareR1<half>(&builder, expected_output, {dot_lhs_handle.get()});
 }
+
+XLA_TEST_F(ConvertTest, ConvertC64ToC64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<complex64> x = {{42.0f, 64.0f}};
+  builder.ConvertElementType(builder.ConstantR1<complex64>(x), C64);
+  ComputeAndCompareR1<complex64>(&builder, x, {}, ErrorSpec(0.0001));
+}
+
+XLA_TEST_F(ConvertTest, ConvertS64S64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<int64> x = {{-42, 64}};
+  builder.ConvertElementType(builder.ConstantR1<int64>(x), S64);
+  ComputeAndCompareR1<int64>(&builder, x, {});
+}
+
+XLA_TEST_F(ConvertTest, ConvertU64U64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<uint64> x = {{42, 64}};
+  builder.ConvertElementType(builder.ConstantR1<uint64>(x), U64);
+  ComputeAndCompareR1<uint64>(&builder, x, {});
+}
+
+XLA_TEST_F(ConvertTest, ConvertU64S64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<uint64> unsigned_x = {{42, UINT64_MAX}};
+  builder.ConvertElementType(builder.ConstantR1<uint64>(unsigned_x), S64);
+  std::vector<int64> signed_x = {{42, -1}};
+  ComputeAndCompareR1<int64>(&builder, signed_x, {});
+}
+
+XLA_TEST_F(ConvertTest, ConvertS64U64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<int64> signed_x = {{42, -1, INT64_MIN}};
+  builder.ConvertElementType(builder.ConstantR1<int64>(signed_x), U64);
+  std::vector<uint64> unsigned_x = {
+      {42, UINT64_MAX, tensorflow::MathUtil::IPow<uint64>(2, 63)}};
+  ComputeAndCompareR1<uint64>(&builder, unsigned_x, {});
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/convolution_test.cc b/tensorflow/compiler/xla/tests/convolution_test.cc
index e2b5c91653fa6db5df86404c6c5f9158b0d484e1..72715398dea468d0000144759454c5f8d8673516 100644
--- a/tensorflow/compiler/xla/tests/convolution_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_test.cc
@@ -53,26 +53,12 @@ class ConvolutionTest : public ClientLibraryTestBase {
 #endif
 };
 
-#if (XLA_TEST_BACKEND_GPU || XLA_TEST_BACKEND_CPU)
-using TestTypes = ::testing::Types<float, Eigen::half>;
-#else
+#ifdef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
 using TestTypes = ::testing::Types<float>;
+#else
+using TestTypes = ::testing::Types<float, Eigen::half>;
 #endif
 
-template <typename T>
-Shape MakeShapeWrapper(tensorflow::gtl::ArraySlice<int64> dimensions);
-
-template <>
-Shape MakeShapeWrapper<float>(tensorflow::gtl::ArraySlice<int64> dimensions) {
-  return ShapeUtil::MakeShape(F32, dimensions);
-}
-
-template <>
-Shape MakeShapeWrapper<Eigen::half>(
-    tensorflow::gtl::ArraySlice<int64> dimensions) {
-  return ShapeUtil::MakeShape(F16, dimensions);
-}
-
 template <typename T>
 class ForwardPassConvolution_3x3x256_256_OutputZ_Iota : public ConvolutionTest {
  public:
@@ -121,8 +107,8 @@ class Convolve_1x1x1x2_1x1x1x2_Valid : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 1, 2});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 1, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 1, 2});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 1, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kValid);
@@ -152,8 +138,8 @@ class Convolve_1x1x4x4_1x1x2x2_Valid : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 2, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 2, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kValid);
@@ -186,8 +172,8 @@ class Convolve_1x1x4x4_1x1x2x2_Same : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 2, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 2, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kSame);
@@ -222,8 +208,8 @@ class Convolve_1x1x4x4_1x1x3x3_Same : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 3, 3});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 3, 3});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kSame);
@@ -280,8 +266,8 @@ class Convolve1D_1x2x5_1x2x2_WithRHSDilation : public ConvolutionTest {
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
     {
-      Shape input_shape = MakeShapeWrapper<T>({1, 2, 5});
-      Shape filter_shape = MakeShapeWrapper<T>({1, 2, 2});
+      Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 5});
+      Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 2});
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
       // Convolution dimensions are bf0_oi0->bo0.
@@ -381,8 +367,8 @@ class Convolve1D_1x2x5_1x2x2_WithPadding : public ConvolutionTest {
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
     {
-      Shape input_shape = MakeShapeWrapper<T>({1, 2, 5});
-      Shape filter_shape = MakeShapeWrapper<T>({1, 2, 2});
+      Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 5});
+      Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 2});
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
       // Convolution dimensions are bf0_oi0->bo0.
@@ -486,8 +472,8 @@ class Convolve2D_1x3x3x5_3x3x5x5_Valid : public ConvolutionTest {
     ComputationBuilder builder(client_, TestName());
     std::vector<int64> input_dims = {1, 3, 3, 5};
     std::vector<int64> filter_dims = {3, 3, 5, 3};
-    Shape input_shape = MakeShapeWrapper<T>(input_dims);
-    Shape filter_shape = MakeShapeWrapper<T>(filter_dims);
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>(input_dims);
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>(filter_dims);
     {
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
@@ -611,8 +597,8 @@ class Convolve1D1WindowTestBase
                                      input_feature};
     std::vector<int64> filter_dims = {window_size, input_feature,
                                       output_feature};
-    Shape input_shape = MakeShapeWrapper<T>(input_dims);
-    Shape filter_shape = MakeShapeWrapper<T>(filter_dims);
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>(input_dims);
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>(filter_dims);
     {
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
@@ -737,7 +723,7 @@ INSTANTIATE_TEST_CASE_P(
 );
 #endif
 
-TEST_F(ConvolutionTest, Convolve_bf16_1x1x1x2_1x1x1x2_Valid) {
+XLA_TEST_F(ConvolutionTest, Convolve_bf16_1x1x1x2_1x1x1x2_Valid) {
   ComputationBuilder builder(client_, TestName());
   Shape input_shape = ShapeUtil::MakeShape(BF16, {1, 1, 1, 2});
   Shape filter_shape = ShapeUtil::MakeShape(BF16, {1, 1, 1, 2});
diff --git a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
index 032c06cd3c9f872f57674d3d7b5adc201c91ea77..3ab0ea4ad48c00724d48e7d285ec024e10d5db31 100644
--- a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
+++ b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
@@ -195,7 +195,7 @@ XLA_TEST_F(DeconstructTupleTest, DeconstructNestedTuple) {
   auto result_status = client_->DeconstructTuple(*global_data);
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("deconstructing nested tuples not yet supported"));
+              HasSubstr("Deconstructing nested tuples is not implemented"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/dot_operation_test.cc b/tensorflow/compiler/xla/tests/dot_operation_test.cc
index 815962094ae476c4b15713ad2c1e4f1e0d140fd9..09b1dd283e4d026a2f0007240d88cd9ac38acb19 100644
--- a/tensorflow/compiler/xla/tests/dot_operation_test.cc
+++ b/tensorflow/compiler/xla/tests/dot_operation_test.cc
@@ -34,169 +34,194 @@ limitations under the License.
 namespace xla {
 namespace {
 
-// TODO(b/34468543): use GUnit typed tests when we can do all tests on all
-// backends.
 class DotOperationTest : public ClientLibraryTestBase {
  public:
   ErrorSpec error_spec_{0.0001, 1e-5};
-
- protected:
-  template <typename Element>
-  void TestOneElementVectorDot();
-  template <typename Element>
-  void TestVectorDot();
-  template <typename Element>
-  void TestSquareMatrixDot(bool lhs_row_major = false,
-                           bool rhs_row_major = false);
-  template <typename Element>
-  void TestNonsquareMatrixDot(bool lhs_row_major = false,
-                              bool rhs_row_major = false);
 };
 
-XLA_TEST_F(DotOperationTest, ZeroElementVectorDotF32) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<float>({});
-  auto rhs = builder.ConstantR1<float>({});
+#if defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16) && \
+    defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT64)
+using TypesF16F32 = ::testing::Types<float>;
+using TypesF16F32F64 = ::testing::Types<float>;
+using TypesF16F32F64CF64 = ::testing::Types<float>;
+#elif !defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16) && \
+    !defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT64)
+using TypesF16F32 = ::testing::Types<Eigen::half, float>;
+using TypesF16F32F64 = ::testing::Types<Eigen::half, float, double>;
+using TypesF16F32F64CF64 =
+    ::testing::Types<Eigen::half, float, double, complex64>;
+#else
+#error "Situation not handled yet"
+#endif
+
+template <typename T>
+class DotOperationTest_F16F32F64CF64 : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTest_F16F32F64CF64, TypesF16F32F64CF64);
+
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, ZeroElementVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+
+  auto lhs = builder.ConstantR1<T>({});
+  auto rhs = builder.ConstantR1<T>({});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<float>(&builder, 0.0, {}, error_spec_);
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(0.0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, TrivialMatrixVectorDotF32) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2<float>({{3.0, 4.0}});
-  auto rhs = builder.ConstantR1<float>({3.0, 4.0});
-  auto result = builder.Dot(lhs, rhs);
+template <typename T>
+class DotOperationTest_F16F32F64 : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTest_F16F32F64, TypesF16F32F64);
 
-  ComputeAndCompareR1<float>(&builder, {25.0}, {}, error_spec_);
-}
-
-template <typename Element>
-void DotOperationTest::TestOneElementVectorDot() {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<Element>({2.0});
-  auto rhs = builder.ConstantR1<Element>({3.0});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, TrivialMatrixVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>({{3.0f, 4.0f}});
+  auto rhs = builder.ConstantFromArray<T>({3.0f, 4.0f});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<Element>(&builder, 6.0, {}, error_spec_);
+  this->template ComputeAndCompareR1<T>(&builder, {static_cast<T>(25.0f)}, {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, OneElementVectorDotF32) {
-  TestOneElementVectorDot<float>();
-}
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, OneElementVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR1<T>({static_cast<T>(2.0f)});
+  auto rhs = builder.ConstantR1<T>({static_cast<T>(3.0f)});
+  auto result = builder.Dot(lhs, rhs);
 
-XLA_TEST_F(DotOperationTest, OneElementVectorDotF64) {
-  TestOneElementVectorDot<double>();
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(6.0f), {},
+                                        this->error_spec_);
 }
 
-template <typename Element>
-void DotOperationTest::TestVectorDot() {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<Element>({1.0, 2.5, 42.0});
-  auto rhs = builder.ConstantR1<Element>({11.0, -1.0, 0.5});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, VectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantFromArray<T>({1.0f, 2.5f, 42.0f});
+  auto rhs = builder.ConstantFromArray<T>({11.0f, -1.0f, 0.5f});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<Element>(&builder, 29.5, {}, error_spec_);
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(29.5f), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, VectorDotF32) { TestVectorDot<float>(); }
-
-XLA_TEST_F(DotOperationTest, VectorDotF64) { TestVectorDot<double>(); }
-
-namespace {
-
 std::vector<int64> MinorToMajorForIsRowMajor(bool row_major) {
   return {row_major ? 1 : 0, row_major ? 0 : 1};
 }
 
-}  // namespace
-
-XLA_TEST_F(DotOperationTest, Dot_0x2_2x0) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x0) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(0, 0), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(0, 0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_0x2_2x3) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
-  auto rhs = builder.ConstantR2<float>({{7.0, 8.0, 9.0}, {42.0, 77.0, 101.0}});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x3) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
+  auto rhs = builder.ConstantR2FromArray2D<T>(
+      {{7.0f, 8.0f, 9.0f}, {42.0f, 77.0f, 101.0f}});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(0, 3), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(0, 3), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_3x2_2x0) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs =
-      builder.ConstantR2<float>({{7.0, 8.0}, {9.0, 42.0}, {77.0, 101.0}});
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_3x2_2x0) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(
+      {{7.0f, 8.0f}, {9.0f, 42.0f}, {77.0f, 101.0f}});
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(3, 0), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(3, 0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_2x0_0x2) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_2x0_0x2) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(2, 2, 0.0f), {},
-                             error_spec_);
+  this->template ComputeAndCompareR2<T>(
+      &builder, Array2D<T>(2, 2, static_cast<T>(0.0f)), {}, this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, FusedDot) {
-  ComputationBuilder builder(client_, TestName());
-  auto param0 = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 4}), "arg0");
-  auto param1 = builder.Parameter(1, ShapeUtil::MakeShape(F32, {4, 1}), "arg1");
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, FusedDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto param0 =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 4}), "arg0");
+  auto param1 =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({4, 1}), "arg1");
   auto exp0 = builder.Exp(param0);
   auto result = builder.Dot(exp0, param1);
 
-  auto lhs_handle = client_
-                        ->TransferToServer(*Literal::CreateR2<float>(
-                            {{1.0, 2.0, 3.0, 4.0}, {-1.0, -2.0, -3.0, -4.0}}))
-                        .ConsumeValueOrDie();
-  auto rhs_handle = client_
-                        ->TransferToServer(*Literal::CreateR2<float>(
-                            {{1.0}, {2.0}, {3.0}, {4.0}}))
-                        .ConsumeValueOrDie();
-
-  ComputeAndCompareR2<float>(
-      &builder, Array2D<float>({{296.14560492846033}, {0.8611737683031964}}),
-      {lhs_handle.get(), rhs_handle.get()}, error_spec_);
-}
-
-template <typename Element>
-void DotOperationTest::TestSquareMatrixDot(bool lhs_row_major,
-                                           bool rhs_row_major) {
   auto lhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 2.0}, {3.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(lhs_row_major))))
-          .ConsumeValueOrDie();
-  auto rhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 6.0}, {7.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(rhs_row_major))))
+      this->client_
+          ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+              {{1.0f, 2.0f, 3.0f, 4.0f}, {-1.0f, -2.0f, -3.0f, -4.0f}}))
           .ConsumeValueOrDie();
+  auto rhs_handle = this->client_
+                        ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+                            {{1.0f}, {2.0f}, {3.0f}, {4.0f}}))
+                        .ConsumeValueOrDie();
 
-  ComputationBuilder builder(client_, TestName());
-  auto prim_type = primitive_util::NativeToPrimitiveType<Element>();
-  auto result = builder.Dot(
-      builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}), "lhs"),
-      builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 2}), "rhs"));
+  if (std::is_same<Eigen::half, T>::value) {
+    this->error_spec_ = ErrorSpec{0.0001, 1e-3};
+  }
 
-  Array2D<Element> expected({{15.0, -2.0}, {-25.0, 34.0}});
-  ComputeAndCompareR2<Element>(
-      &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  this->template ComputeAndCompareR2<T>(
+      &builder, Array2D<T>({{296.14560492846033f}, {0.8611737683031964f}}),
+      {lhs_handle.get(), rhs_handle.get()}, this->error_spec_);
 }
 
+template <typename T>
+class SquareMatrixDot : public DotOperationTest {
+ public:
+  void TestImpl(bool lhs_row_major, bool rhs_row_major) {
+    auto lhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 2.0f}, {3.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(lhs_row_major))))
+            .ConsumeValueOrDie();
+    auto rhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 6.0f}, {7.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(rhs_row_major))))
+            .ConsumeValueOrDie();
+    ComputationBuilder builder(client_, TestName());
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    auto result = builder.Dot(
+        builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}), "lhs"),
+        builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 2}), "rhs"));
+
+    Array2D<T> expected({{15.0f, -2.0f}, {-25.0f, 34.0f}});
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  }
+};
+
+TYPED_TEST_CASE(SquareMatrixDot, TypesF16F32F64CF64);
+XLA_TYPED_TEST(SquareMatrixDot, TypesFF) { this->TestImpl(false, false); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesFT) { this->TestImpl(false, true); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesTF) { this->TestImpl(true, false); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesTT) { this->TestImpl(true, true); }
+
 struct DotTestParam {
   int m;
   int k;
@@ -302,14 +327,13 @@ void ParametricDotTest::TestImpl() {
   if (param.has_addend) {
     args.push_back(addend_handle.get());
   }
-
-  ComputeAndCompareR2<NativeT>(&builder, *expected, args, ErrorSpec(0.3, 3e-3));
+  ErrorSpec error_spec(0.3, 3e-3);
+  if (std::is_same<Eigen::half, NativeT>::value) {
+    error_spec = ErrorSpec(0.3, 5e-3);
+  }
+  ComputeAndCompareR2<NativeT>(&builder, *expected, args, error_spec);
 }
 
-XLA_TEST_P(ParametricDotTest, TestF32) { TestImpl<float>(); }
-
-XLA_TEST_P(ParametricDotTest, TestF64) { TestImpl<double>(); }
-
 std::vector<DotTestParam> CreateDotTestParameters() {
   std::vector<DotTestParam> params;
 
@@ -331,6 +355,12 @@ std::vector<DotTestParam> CreateDotTestParameters() {
   return params;
 }
 
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(ParametricDotTest, TestF16) { TestImpl<Eigen::half>(); }
+#endif
+XLA_TEST_P(ParametricDotTest, TestF32) { TestImpl<float>(); }
+XLA_TEST_P(ParametricDotTest, TestF64) { TestImpl<double>(); }
+
 INSTANTIATE_TEST_CASE_P(DotTests, ParametricDotTest,
                         ::testing::ValuesIn(CreateDotTestParameters()),
                         PrintDotTestParam);
@@ -343,14 +373,6 @@ class ParametricDotTestWithoutLayoutAssignment : public ParametricDotTest {
   }
 };
 
-XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF32) {
-  TestImpl<float>();
-}
-
-XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF64) {
-  TestImpl<double>();
-}
-
 std::vector<DotTestParam> CreateNoLayoutAssignmentDotTestParameters() {
   std::vector<DotTestParam> params;
 
@@ -407,110 +429,60 @@ std::vector<DotTestParam> CreateNoLayoutAssignmentDotTestParameters() {
   return params;
 }
 
-INSTANTIATE_TEST_CASE_P(
-    DotTests, ParametricDotTestWithoutLayoutAssignment,
-    ::testing::ValuesIn(CreateNoLayoutAssignmentDotTestParameters()),
-    PrintDotTestParam);
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorFF) {
-  TestSquareMatrixDot<float>(false, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorFT) {
-  TestSquareMatrixDot<float>(false, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorTF) {
-  TestSquareMatrixDot<float>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorTT) {
-  TestSquareMatrixDot<float>(true, true);
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF16) {
+  TestImpl<Eigen::half>();
 }
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorFF) {
-  TestSquareMatrixDot<complex64>(false, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorFT) {
-  TestSquareMatrixDot<complex64>(false, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorTF) {
-  TestSquareMatrixDot<complex64>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorTT) {
-  TestSquareMatrixDot<complex64>(true, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF64) {
-  TestSquareMatrixDot<double>();
-}
-
-template <typename Element>
-void DotOperationTest::TestNonsquareMatrixDot(bool lhs_row_major,
-                                              bool rhs_row_major) {
-  auto lhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 2.0, 3.0}, {3.0, -4.0, -1.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(lhs_row_major))))
-          .ConsumeValueOrDie();
-  auto rhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 6.0}, {2.0, 3.0}, {7.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(rhs_row_major))))
-          .ConsumeValueOrDie();
-
-  ComputationBuilder builder(client_, TestName());
-  auto prim_type = primitive_util::NativeToPrimitiveType<Element>();
-  auto result = builder.Dot(
-      builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 3}), "lhs"),
-      builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {3, 2}), "rhs"));
-
-  Array2D<Element> expected({{26.0, 0.0}, {-12.0, 10.0}});
-
-  ComputeAndCompareR2<Element>(
-      &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorFF) {
-  TestNonsquareMatrixDot<float>(false, false);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorFT) {
-  TestNonsquareMatrixDot<float>(false, true);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorTF) {
-  TestNonsquareMatrixDot<float>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorTT) {
-  TestNonsquareMatrixDot<float>(true, true);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF64) {
-  TestNonsquareMatrixDot<double>();
+#endif
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF32) {
+  TestImpl<float>();
 }
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorFF) {
-  TestNonsquareMatrixDot<complex64>(false, false);
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF64) {
+  TestImpl<double>();
 }
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorFT) {
-  TestNonsquareMatrixDot<complex64>(false, true);
-}
+INSTANTIATE_TEST_CASE_P(
+    DotTests, ParametricDotTestWithoutLayoutAssignment,
+    ::testing::ValuesIn(CreateNoLayoutAssignmentDotTestParameters()),
+    PrintDotTestParam);
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorTF) {
-  TestNonsquareMatrixDot<complex64>(true, false);
-}
+template <typename T>
+class NonsquareMatrixDot : public DotOperationTest {
+ public:
+  void TestImpl(bool lhs_row_major, bool rhs_row_major) {
+    auto lhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 2.0f, 3.0f}, {3.0f, -4.0f, -1.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(lhs_row_major))))
+            .ConsumeValueOrDie();
+    auto rhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 6.0f}, {2.0f, 3.0f}, {7.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(rhs_row_major))))
+            .ConsumeValueOrDie();
+
+    ComputationBuilder builder(client_, TestName());
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    auto result = builder.Dot(
+        builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 3}), "lhs"),
+        builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {3, 2}), "rhs"));
+
+    Array2D<T> expected({{26.0f, 0.0f}, {-12.0f, 10.0f}});
+
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  }
+};
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorTT) {
-  TestNonsquareMatrixDot<complex64>(true, true);
-}
+TYPED_TEST_CASE(NonsquareMatrixDot, TypesF16F32F64CF64);
+XLA_TYPED_TEST(NonsquareMatrixDot, TestFF) { this->TestImpl(false, false); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestFT) { this->TestImpl(false, true); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestTF) { this->TestImpl(true, false); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestTT) { this->TestImpl(true, true); }
 
 XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
   auto lhs_handle =
@@ -537,25 +509,35 @@ XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
       &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, ConcurrentMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto matrix1 = builder.ConstantR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto matrix2 = builder.ConstantR2<float>({{5.0, 6.0}, {7.0, 8.0}});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, ConcurrentMatMult) {
+  using T = TypeParam;
+
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto matrix1 = builder.ConstantR2FromArray2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}});
+  auto matrix2 = builder.ConstantR2FromArray2D<T>({{5.0f, 6.0f}, {7.0f, 8.0f}});
   auto matrix12 = builder.Dot(matrix1, matrix2);
   auto matrix21 = builder.Dot(matrix2, matrix1);
   builder.Add(matrix12, matrix21);
 
-  Array2D<float> expected({{42.0, 56.0}, {74.0, 96.0}});
-  ComputeAndCompareR2<float>(&builder, expected, {}, error_spec_);
+  Array2D<T> expected({{42.0f, 56.0f}, {74.0f, 96.0f}});
+  this->template ComputeAndCompareR2<T>(&builder, expected, {},
+                                        this->error_spec_);
 }
 
+template <typename T>
+class DotOperationTestForBatchMatMul : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTestForBatchMatMul, TypesF16F32F64);
+
 // Regression test for b/32055648. The root of the graph is a kFusion of 4
 // bitcasts. Although bitcasts don't map to thunks, the root should still be
 // sync-dependent on bitcasts' operands.
-XLA_TEST_F(DotOperationTest, BatchMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 2, 2, 2}), "x");
-  auto y = builder.Parameter(1, ShapeUtil::MakeShape(F32, {2, 2, 2, 2}), "y");
+XLA_TYPED_TEST(DotOperationTestForBatchMatMul, Types) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto x =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}), "x");
+  auto y =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}), "y");
 
   auto x_flat = builder.Reshape(x, {0, 1, 2, 3}, {4, 2, 2});
   auto y_flat = builder.Reshape(y, {0, 1, 2, 3}, {4, 2, 2});
@@ -576,29 +558,42 @@ XLA_TEST_F(DotOperationTest, BatchMatMul) {
   auto out_flat = builder.ConcatInDim(out_slices, 0);
   builder.Reshape(out_flat, {0, 1, 2}, {2, 2, 2, 2});
 
-  auto x_data = client_
-                    ->TransferToServer(*Literal::CreateR4<float>(
-                        {{{{1000, 100}, {10, 1}}, {{2000, 200}, {20, 2}}},
-                         {{{3000, 300}, {30, 3}}, {{4000, 400}, {40, 4}}}}))
-                    .ConsumeValueOrDie();
-  auto y_data = client_
-                    ->TransferToServer(*Literal::CreateR4<float>(
-                        {{{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}},
-                         {{{11, 22}, {33, 44}}, {{55, 66}, {77, 88}}}}))
+  auto x_data = this->client_
+                    ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+                        {{{{1000.0f, 100.0f}, {10.0f, 1.0f}},
+                          {{2000.0f, 200.0f}, {20.0f, 2.0f}}},
+                         {{{3000.0f, 300.0f}, {30.0f, 3.0f}},
+                          {{4000.0f, 400.0f}, {40.0f, 4.0f}}}}))
                     .ConsumeValueOrDie();
+  auto y_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+              {{{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+               {{{11.0f, 22.0f}, {33.0f, 44.0f}},
+                {{55.0f, 66.0f}, {77.0f, 88.0f}}}}))
+          .ConsumeValueOrDie();
 
-  ComputeAndCompareR4<float>(
+  if (std::is_same<Eigen::half, T>::value) {
+    this->error_spec_ = ErrorSpec{0.0001, 1e-3};
+  }
+  this->template ComputeAndCompareR4<T>(
       &builder,
       /*expected=*/
-      {{{{1300, 2400}, {13, 24}}, {{11400, 13600}, {114, 136}}},
-       {{{42900, 79200}, {429, 792}}, {{250800, 299200}, {2508, 2992}}}},
-      {x_data.get(), y_data.get()}, error_spec_);
+      {{{{1300.0f, 2400.0f}, {13.0f, 24.0f}},
+        {{11400.0f, 13600.0f}, {114.0f, 136.0f}}},
+       {{{42900.0f, 79200.0f}, {429.0f, 792.0f}},
+        {{250800.0f, 299200.0f}, {2508.0f, 2992.0f}}}},
+      {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, GeneralMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 2, 2}), "x");
-  auto y = builder.Parameter(1, ShapeUtil::MakeShape(F32, {2, 2, 2}), "y");
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, GeneralMatMul) {
+  using T = TypeParam;
+
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto x =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2, 2}), "x");
+  auto y =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 2, 2}), "y");
 
   DotDimensionNumbers dnums;
   dnums.add_lhs_contracting_dimensions(2);
@@ -608,31 +603,34 @@ XLA_TEST_F(DotOperationTest, GeneralMatMul) {
 
   auto out = builder.DotGeneral(x, y, dnums);
 
-  auto x_data = client_
-                    ->TransferToServer(*Literal::CreateR3<float>(
-                        {{{1.0, 2.0}, {3.0, 4.0}}, {{5.0, 6.0}, {7.0, 8.0}}}))
-                    .ConsumeValueOrDie();
+  auto x_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+              {{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}}))
+          .ConsumeValueOrDie();
 
-  auto y_data = client_
-                    ->TransferToServer(*Literal::CreateR3<float>(
-                        {{{1.0, 0.0}, {0.0, 1.0}}, {{1.0, 0.0}, {0.0, 1.0}}}))
-                    .ConsumeValueOrDie();
+  auto y_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+              {{{1.0f, 0.0f}, {0.0f, 1.0f}}, {{1.0f, 0.0f}, {0.0f, 1.0f}}}))
+          .ConsumeValueOrDie();
 
-  ComputeAndCompareR3<float>(
+  this->template ComputeAndCompareR3<T>(
       &builder,
       /*expected=*/
-      {{{1.0, 2.0}, {3.0, 4.0}}, {{5.0, 6.0}, {7.0, 8.0}}},
-      {x_data.get(), y_data.get()}, error_spec_);
+      {{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+      {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-TEST_F(DotOperationTest, TransposeFolding) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, TransposeFolding) {
+  using T = TypeParam;
   for (bool transpose_lhs : {false, true}) {
     for (bool transpose_rhs : {false, true}) {
       for (bool row_major : {false, true}) {
-        std::unique_ptr<Array2D<float>> lhs(
-            new Array2D<float>({{1.0, 2.0, 3.0}, {3.0, -4.0, -1.0}}));
-        std::unique_ptr<Array2D<float>> rhs(
-            new Array2D<float>({{1.0, 6.0}, {2.0, 3.0}, {7.0, -4.0}}));
+        std::unique_ptr<Array2D<T>> lhs(
+            new Array2D<T>({{1.0f, 2.0f, 3.0f}, {3.0f, -4.0f, -1.0f}}));
+        std::unique_ptr<Array2D<T>> rhs(
+            new Array2D<T>({{1.0f, 6.0f}, {2.0f, 3.0f}, {7.0f, -4.0f}}));
 
         if (transpose_lhs) {
           lhs = ReferenceUtil::TransposeArray2D(*lhs);
@@ -641,22 +639,20 @@ TEST_F(DotOperationTest, TransposeFolding) {
           rhs = ReferenceUtil::TransposeArray2D(*rhs);
         }
         auto lhs_handle =
-            client_
-                ->TransferToServer(
-                    *Literal::CreateR2FromArray2DWithLayout<float>(
-                        *lhs, LayoutUtil::MakeLayout(
-                                  MinorToMajorForIsRowMajor(row_major))))
+            this->client_
+                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+                    *lhs, LayoutUtil::MakeLayout(
+                              MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
         auto rhs_handle =
-            client_
-                ->TransferToServer(
-                    *Literal::CreateR2FromArray2DWithLayout<float>(
-                        *rhs, LayoutUtil::MakeLayout(
-                                  MinorToMajorForIsRowMajor(row_major))))
+            this->client_
+                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+                    *rhs, LayoutUtil::MakeLayout(
+                              MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
 
-        ComputationBuilder builder(client_, TestName());
-        auto prim_type = primitive_util::NativeToPrimitiveType<float>();
+        ComputationBuilder builder(this->client_, this->TestName());
+        auto prim_type = primitive_util::NativeToPrimitiveType<T>();
         auto lhs_arg = builder.Parameter(
             0, ShapeUtil::MakeShape(prim_type, {lhs->height(), lhs->width()}),
             "lhs");
@@ -671,24 +667,27 @@ TEST_F(DotOperationTest, TransposeFolding) {
         }
         auto result = builder.Dot(lhs_arg, rhs_arg);
 
-        Array2D<float> expected({{26.0, 0.0}, {-12.0, 10.0}});
+        Array2D<T> expected({{26.0f, 0.0f}, {-12.0f, 10.0f}});
         VLOG(1) << "TestTransposeFolding " << transpose_lhs << " "
                 << transpose_rhs << " " << row_major;
-        ComputeAndCompareR2<float>(&builder, expected,
-                                   {lhs_handle.get(), rhs_handle.get()},
-                                   error_spec_);
+        this->template ComputeAndCompareR2<T>(
+            &builder, expected, {lhs_handle.get(), rhs_handle.get()},
+            this->error_spec_);
       }
     }
   }
 }
 
-TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstLHS) {
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
+XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+               DotOfConcatOptimizationWithConstLHS) {
+  using T = TypeParam;
+  auto prim_type = primitive_util::NativeToPrimitiveType<T>();
 
-  std::unique_ptr<Array2D<float>> constant_lhs_array(new Array2D<float>(
-      {{1.0, 2.0, 3.0, 4.0, 5.0, 6.0}, {6.0, 5.0, 4.0, 3.0, 2.0, 1.0}}));
+  std::unique_ptr<Array2D<T>> constant_lhs_array(
+      new Array2D<T>({{1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f},
+                      {6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f}}));
 
-  ComputationBuilder builder(client_, TestName());
+  ComputationBuilder builder(this->client_, this->TestName());
   auto lhs_constant = builder.ConstantR2FromArray2D(*constant_lhs_array);
   auto rhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}),
                                      "rhs_arg_0");
@@ -699,78 +698,80 @@ TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstLHS) {
   auto result = builder.Dot(
       lhs_constant, builder.ConcatInDim({rhs_arg_0, rhs_arg_1, rhs_arg_2}, 0));
 
-  std::unique_ptr<Array2D<float>> arg_0_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}}));
-  std::unique_ptr<Array2D<float>> arg_1_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}}));
-  std::unique_ptr<Array2D<float>> arg_2_value_array(
-      new Array2D<float>({{1.0, 2.0}}));
+  std::unique_ptr<Array2D<T>> arg_0_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+  std::unique_ptr<Array2D<T>> arg_1_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}}));
+  std::unique_ptr<Array2D<T>> arg_2_value_array(new Array2D<T>({{1.0f, 2.0f}}));
 
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_0_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_1_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_2_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
-  Array2D<float> expected({{53.0, 74.0}, {45.0, 66.0}});
-  ComputeAndCompareR2<float>(
+  Array2D<T> expected({{53.0f, 74.0f}, {45.0f, 66.0f}});
+  this->template ComputeAndCompareR2<T>(
       &builder, expected,
-      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()}, error_spec_);
-}
-
-TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstRHS) {
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
-
-  std::unique_ptr<Array2D<float>> constant_rhs_array(
-      new Array2D<float>({{1.0, 2.0},
-                          {3.0, 4.0},
-                          {5.0, 6.0},
-                          {6.0, 5.0},
-                          {4.0, 3.0},
-                          {2.0, 1.0}}));
-
-  ComputationBuilder builder(client_, TestName());
+      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()},
+      this->error_spec_);
+}
+
+XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+               DotOfConcatOptimizationWithConstRHS) {
+  using T = TypeParam;
+  std::unique_ptr<Array2D<T>> constant_rhs_array(
+      new Array2D<T>({{1.0f, 2.0f},
+                      {3.0f, 4.0f},
+                      {5.0f, 6.0f},
+                      {6.0f, 5.0f},
+                      {4.0f, 3.0f},
+                      {2.0f, 1.0f}}));
+
+  ComputationBuilder builder(this->client_, this->TestName());
   auto rhs_constant = builder.ConstantR2FromArray2D(*constant_rhs_array);
-  auto lhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}),
+  auto lhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2}),
                                      "lhs_arg_0");
-  auto lhs_arg_1 = builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 3}),
+  auto lhs_arg_1 = builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 3}),
                                      "lhs_arg_1");
-  auto lhs_arg_2 = builder.Parameter(2, ShapeUtil::MakeShape(prim_type, {2, 1}),
+  auto lhs_arg_2 = builder.Parameter(2, ShapeUtil::MakeShapeWithType<T>({2, 1}),
                                      "lhs_arg_2");
   auto result = builder.Dot(
       builder.ConcatInDim({lhs_arg_0, lhs_arg_1, lhs_arg_2}, 1), rhs_constant);
 
-  std::unique_ptr<Array2D<float>> arg_0_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}}));
-  std::unique_ptr<Array2D<float>> arg_1_value_array(
-      new Array2D<float>({{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}}));
-  std::unique_ptr<Array2D<float>> arg_2_value_array(
-      new Array2D<float>({{1.0}, {2.0}}));
+  std::unique_ptr<Array2D<T>> arg_0_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+  std::unique_ptr<Array2D<T>> arg_1_value_array(
+      new Array2D<T>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}}));
+  std::unique_ptr<Array2D<T>> arg_2_value_array(
+      new Array2D<T>({{1.0f}, {2.0f}}));
 
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_0_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_1_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_2_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
-  Array2D<float> expected({{38.0, 36.0}, {93.0, 91.0}});
-  ComputeAndCompareR2<float>(
+  Array2D<T> expected({{38.0f, 36.0f}, {93.0f, 91.0f}});
+  this->template ComputeAndCompareR2<T>(
       &builder, expected,
-      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()}, error_spec_);
+      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()},
+      this->error_spec_);
 }
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/dynamic_ops_test.cc b/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
index 877dc7db0eec229a7119b3627f177a33ed0d971b..4f354e6aefe70a51c09be1c0ca151af2bb9f0a2c 100644
--- a/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
@@ -206,19 +206,19 @@ XLA_TEST_F(DynamicSliceTest, Int32R1BF16) { TestR1<int32, bfloat16>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R1) { TestR1<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R1Wrap) { TestR1Wrap<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int64R1) { TestR1<int64, float>(); }
-XLA_TEST_F(DynamicSliceTest, UInt64R1) { TestR1<uint64, double>(); }
+XLA_TEST_F(DynamicSliceTest, UInt64R1) { TestR1<uint64, float>(); }
 
 XLA_TEST_F(DynamicSliceTest, Int32R2BF16) { TestR2<int32, bfloat16>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R2) { TestR2<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R2Wrap) { TestR2Wrap<int32, int32>(); }
-XLA_TEST_F(DynamicSliceTest, Int64R2) { TestR2<int64, double>(); }
+XLA_TEST_F(DynamicSliceTest, Int64R2) { TestR2<int64, float>(); }
 XLA_TEST_F(DynamicSliceTest, UInt64R2) { TestR2<uint64, int32>(); }
 
 XLA_TEST_F(DynamicSliceTest, Int32R3BF16) { TestR3<int32, bfloat16>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R3) { TestR3<int32, float>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R3Wrap) { TestR3Wrap<int32, float>(); }
 XLA_TEST_F(DynamicSliceTest, Int64R3) { TestR3<int64, float>(); }
-XLA_TEST_F(DynamicSliceTest, UInt64R3) { TestR3<uint64, double>(); }
+XLA_TEST_F(DynamicSliceTest, UInt64R3) { TestR3<uint64, float>(); }
 
 XLA_TEST_F(DynamicSliceTest, Int32R1Pred) {
   // Slice at dimension start.
@@ -506,7 +506,7 @@ XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_CPU_PARALLEL(Int32R1BF16)) {
 }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R1) { TestR1<int32, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int64R1) { TestR1<int64, float>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, UInt64R1) { TestR1<uint64, double>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt64R1) { TestR1<uint64, float>(); }
 
 // TODO(b/71820067): The CPU parallel backend failed for this on 2018-01-10.
 XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_CPU_PARALLEL(Int32R2BF16)) {
diff --git a/tensorflow/compiler/xla/tests/gather_operation_test.cc b/tensorflow/compiler/xla/tests/gather_operation_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9db68ff7a6dcbd9204fb2b3a37734a9aaed35dfd
--- /dev/null
+++ b/tensorflow/compiler/xla/tests/gather_operation_test.cc
@@ -0,0 +1,461 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/execution_options_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/client_library_test_base.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+
+// NB!  TODO(b/74360564): These tests do not test out of bounds behavior since
+// that hasn't been specced yet.
+
+namespace xla {
+namespace {
+
+using tensorflow::gtl::nullopt;
+
+class GatherOperationTest : public HloTestBase {
+ protected:
+  void RunTest(const string& hlo_text, Literal* operand,
+               Literal* gather_indices) {
+    RunTest(hlo_text, {operand, gather_indices});
+  }
+
+  void RunTest(const string& hlo_text,
+               tensorflow::gtl::ArraySlice<Literal*> args) {
+    HloModuleConfig config;
+    config.set_debug_options(GetDebugOptionsForTest());
+    TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                            tools::Parse(hlo_text, config));
+    EXPECT_TRUE(RunAndCompare(std::move(module), args, nullopt));
+  }
+};
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherV1) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherV1
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[2,3] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1, 3}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherV2) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherV2
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[3,2] gather(operand, indices),
+      output_window_dims={0},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3, 1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherMultipleBatchDims) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherMultipleBatchDims
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,3,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=2,
+      window_bounds={3, 1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 2}, {2, 1}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherNdMultipleBatchDims_0) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherNdMultipleBatchDims
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2,2] parameter(1)
+  ROOT gather = s32[2,2] gather(operand, indices),
+      output_window_dims={},
+      elided_window_dims={0,1},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=2,
+      window_bounds={1, 1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherNdMultipleBatchDims_1) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherNdMultipleBatchDims
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2,2] parameter(1)
+  ROOT gather = s32[2,1,1,2] gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=2,
+      window_bounds={1, 1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherNd) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherNd
+
+ENTRY main {
+  operand = s32[3,3,2] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0,1},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1,2}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, TensorFlowGatherNdNonDefaultIndexVectorDim) {
+  const string hlo_text = R"(
+HloModule TensorFlowGatherNd
+
+ENTRY main {
+  operand = s32[3,3,2] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,2] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0,1},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1,2}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, DynamicSlice) {
+  const char* hlo_text = R"(
+HloModule DynamicSlice
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[1,1] gather(operand, indices),
+      output_window_dims={0,1},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({1, 1});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, BatchDynamicSlice) {
+  const string hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[2,2] parameter(1)
+  ROOT gather = s32[2,1,1] gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=0,
+      window_bounds={1,1}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      Literal::CreateR2<int32>({{2, 1}, {1, 1}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, ZeroDimBounds) {
+  const char* hlo_text = R"(
+HloModule TensorFlowGatherV1
+
+ENTRY main {
+  operand = s32[3,0] parameter(0)
+  indices = s32[2] parameter(1)
+  ROOT gather = s32[2,0] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1, 0}
+}
+)";
+  std::unique_ptr<Literal> operand = Literal::CreateR2<int32>({{}, {}, {}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, OutOfBoundsIndex) {
+  // Out of bounds indices must not crash, and the indices in range should
+  // produce the same values across all backends.
+  //
+  // TODO(b/74360564): Once we have a well defined semantics for OOB accesses,
+  // we should get rid of the mask and check that backends produce the same
+  // value for OOB indices too.
+
+  const string hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = s32[3,3]{1,0} parameter(0)
+  indices = s32[6,2]{1,0} parameter(1)
+  gather = s32[6,1,1]{2,1,0} gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1}
+  gather_reshaped = s32[6]{0} reshape(gather)
+  in_bounds_mask = s32[6]{0} parameter(2)
+  ROOT result = s32[6]{0} multiply(gather_reshaped, in_bounds_mask)
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR2<int32>(
+      {{2, 7}, {2, 1}, {1, 1}, {5, 1}, {2147483647, 1}, {1, 2}});
+  std::unique_ptr<Literal> in_bounds_mask =
+      Literal::CreateR1<int32>({0, 1, 1, 0, 0, 1});
+
+  RunTest(hlo_text,
+          {operand.get(), gather_indices.get(), in_bounds_mask.get()});
+}
+
+XLA_TEST_F(GatherOperationTest, NegativeIndex) {
+  // Negative indices must not crash, and the indices in range should produce
+  // the same values across all backends.
+  //
+  // TODO(b/74360564): Once we have a well defined semantics for negative
+  // accesses, we should get rid of the mask and check that backends produce the
+  // same value for negative indices too.
+
+  const string hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = s32[3,3]{1,0} parameter(0)
+  indices = s32[6,2]{1,0} parameter(1)
+  gather = s32[6,1,1]{2,1,0} gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1}
+  gather_reshaped = s32[6]{0} reshape(gather)
+  in_bounds_mask = s32[6]{0} parameter(2)
+  ROOT result = s32[6]{0} multiply(gather_reshaped, in_bounds_mask)
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR2<int32>(
+      {{2, -1}, {2, 1}, {1, 1}, {-500, 1}, {-2147483648, 1}, {1, 2}});
+  std::unique_ptr<Literal> in_bounds_mask =
+      Literal::CreateR1<int32>({0, 1, 1, 0, 0, 1});
+
+  RunTest(hlo_text,
+          {operand.get(), gather_indices.get(), in_bounds_mask.get()});
+}
+
+XLA_TEST_F(GatherOperationTest, OneScalarIndex) {
+  const char* hlo_text = R"(
+HloModule OneScalarIndex
+
+ENTRY main {
+  operand = s32[2,3,2]{2,1,0} parameter(0)
+  index = s32[] parameter(1)
+  ROOT gather = s32[1,3,2]{2,1,0} gather(operand, index),
+      output_window_dims={0,1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=0,
+      window_bounds={1,3,2}
+}
+)";
+  std::unique_ptr<Literal> operand = Literal::CreateR3<int32>(
+      {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR0<int32>(1);
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, ScalarResult) {
+  const char* hlo_text = R"(
+HloModule ScalarResult
+
+ENTRY main {
+  operand = s32[4]{0} parameter(0)
+  index = s32[] parameter(1)
+  ROOT gather = s32[] gather(operand, index),
+      output_window_dims={},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=0,
+      window_bounds={1}
+}
+)";
+  std::unique_ptr<Literal> operand = Literal::CreateR1<int32>({1, 2, 3, 4});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR0<int32>(1);
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, ZeroSizedResult) {
+  const string hlo_text = R"(
+HloModule ZeroSizedResult
+
+ENTRY main {
+  operand = s32[3,3] parameter(0)
+  indices = s32[0] parameter(1)
+  ROOT gather = s32[0,3] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1, 3}
+}
+)";
+  std::unique_ptr<Literal> operand =
+      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+class GatherClientLibraryTest : public ClientLibraryTestBase {};
+
+// TODO(b/30671675): Asynchronous execution on stream is not yet supported on
+// GPU and CPU_PARALLEL.
+XLA_TEST_F(GatherClientLibraryTest,
+           DISABLED_ON_CPU_PARALLEL(DISABLED_ON_GPU(Basic))) {
+  // We create this HLO, but using the ComputationBuilder API.
+  //
+  // ENTRY main {
+  //   operand = s32[3,3] parameter(0)
+  //   indices = s32[2] parameter(1)
+  //   ROOT gather = s32[2,3] gather(operand, indices),
+  //       output_window_dims={1},
+  //       elided_window_dims={0},
+  //       gather_dims_to_operand_dims={0},
+  //       index_vector_dim=1,
+  //       window_bounds={1, 3}
+  // }
+
+  ComputationBuilder builder(client_, "gather_basic");
+
+  Shape operand_shape = ShapeUtil::MakeShape(S32, {3, 3});
+  Shape indices_shape = ShapeUtil::MakeShape(S32, {2});
+
+  auto operand = builder.Parameter(0, operand_shape, "operand");
+  auto indices = builder.Parameter(1, indices_shape, "indices");
+  GatherDimensionNumbers dim_numbers;
+  dim_numbers.add_output_window_dims(1);
+  dim_numbers.add_elided_window_dims(0);
+  dim_numbers.add_gather_dims_to_operand_dims(0);
+  dim_numbers.set_index_vector_dim(1);
+  builder.Gather(operand, indices, dim_numbers, {1, 3});
+
+  std::vector<int32> expected = {};
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> operand_arg,
+                          client_->TransferToServer(*Literal::CreateR2<int32>(
+                              {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<GlobalData> indices_arg,
+      client_->TransferToServer(*Literal::CreateR1<int32>({0, 2})));
+  TF_ASSERT_OK_AND_ASSIGN(std::vector<xla::DeviceHandle> devices,
+                          client_->GetDeviceHandles(1));
+  xla::ExecutionOptions execution_options = CreateDefaultExecutionOptions();
+  *execution_options.add_device_handles() = devices[0];
+  TF_ASSERT_OK_AND_ASSIGN(Computation computation, builder.Build());
+  std::vector<xla::Client::ComputationInstance> computation_instances = {
+      {computation,
+       {operand_arg.get(), indices_arg.get()},
+       execution_options,
+       /*execution_profile=*/nullptr}};
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::vector<std::unique_ptr<xla::GlobalData>> result_data,
+      client_->ExecuteParallel(computation_instances));
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
+                          client_->Transfer(*(result_data[0])));
+  LiteralTestUtil::ExpectEqual(
+      *result_literal, *Literal::CreateR2<int32>({{1, 2, 3}, {7, 8, 9}}));
+}
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.cc b/tensorflow/compiler/xla/tests/hlo_test_base.cc
index 6723c99edb945492abfbac159bed1959d551ec57..e574644dea7c1ba144ba87fbeb7f28cc52312e26 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.cc
@@ -115,6 +115,13 @@ StatusOr<std::unique_ptr<Literal>> HloTestBase::Execute(
   return test_runner_.Execute(std::move(module), arguments);
 }
 
+StatusOr<std::unique_ptr<Literal>> HloTestBase::ExecuteNoHloPasses(
+    std::unique_ptr<HloModule> module,
+    tensorflow::gtl::ArraySlice<Literal*> arguments) {
+  return test_runner_.Execute(std::move(module), arguments,
+                              /*run_hlo_passes=*/false);
+}
+
 std::unique_ptr<Literal> HloTestBase::ExecuteAndTransfer(
     std::unique_ptr<HloModule> module,
     tensorflow::gtl::ArraySlice<Literal*> arguments) {
@@ -140,15 +147,10 @@ StatusOr<std::unique_ptr<HloModule>> HloTestBase::MakeReferenceModule(
   return std::move(reference_module);
 }
 
-template <typename LiteralPtr>
 StatusOr<::testing::AssertionResult> HloTestBase::RunAndCompareInternal(
-    std::unique_ptr<HloModule> module, const ArraySlice<LiteralPtr> arguments,
+    std::unique_ptr<HloModule> module, const ArraySlice<Literal*> arguments,
     const optional<ErrorSpec>& error, bool run_hlo_passes,
     const std::function<void(HloModule*)>& reference_preprocessor) {
-  static_assert(
-      std::is_same<Literal*, LiteralPtr>::value ||
-          std::is_same<std::unique_ptr<Literal>, LiteralPtr>::value,
-      "The LiteralPtr type only accepts Literal* or std::unique_ptr<Literal>.");
   TF_RETURN_IF_ERROR(
       VerifyHloModule(*test_runner_.backend().platform(), module.get()));
   TF_ASSIGN_OR_RETURN(auto reference_module,
@@ -165,9 +167,8 @@ StatusOr<::testing::AssertionResult> HloTestBase::RunAndCompareInternal(
                                       error);
 }
 
-template <typename LiteralPtr>
 ::testing::AssertionResult HloTestBase::RunAndCompare(
-    std::unique_ptr<HloModule> module, const ArraySlice<LiteralPtr> arguments,
+    std::unique_ptr<HloModule> module, const ArraySlice<Literal*> arguments,
     const optional<ErrorSpec>& error,
     const std::function<void(HloModule*)>& reference_preprocessor) {
   auto result =
@@ -179,9 +180,8 @@ template <typename LiteralPtr>
   return result.ValueOrDie();
 }
 
-template <typename LiteralPtr>
 ::testing::AssertionResult HloTestBase::RunAndCompareNoHloPasses(
-    std::unique_ptr<HloModule> module, const ArraySlice<LiteralPtr> arguments,
+    std::unique_ptr<HloModule> module, const ArraySlice<Literal*> arguments,
     const optional<ErrorSpec>& error,
     const std::function<void(HloModule*)>& reference_preprocessor) {
   auto result =
@@ -198,8 +198,14 @@ template <typename LiteralPtr>
     const std::function<void(HloModule*)>& reference_preprocessor) {
   const auto& fake_arguments =
       MakeFakeArguments(module.get()).ConsumeValueOrDie();
-  return RunAndCompare<std::unique_ptr<Literal>>(
-      std::move(module), fake_arguments, error, reference_preprocessor);
+
+  std::vector<Literal*> fake_argument_ptrs;
+  c_transform(
+      fake_arguments, std::back_inserter(fake_argument_ptrs),
+      [](const std::unique_ptr<Literal>& literal) { return literal.get(); });
+
+  return RunAndCompare(std::move(module), fake_argument_ptrs, error,
+                       reference_preprocessor);
 }
 
 ::testing::AssertionResult HloTestBase::RunAndCompareNoHloPasses(
@@ -207,8 +213,13 @@ template <typename LiteralPtr>
     const std::function<void(HloModule*)>& reference_preprocessor) {
   const auto& fake_arguments =
       MakeFakeArguments(module.get()).ConsumeValueOrDie();
-  return RunAndCompareNoHloPasses<std::unique_ptr<Literal>>(
-      std::move(module), fake_arguments, error, reference_preprocessor);
+  std::vector<Literal*> fake_argument_ptrs;
+  c_transform(
+      fake_arguments, std::back_inserter(fake_argument_ptrs),
+      [](const std::unique_ptr<Literal>& literal) { return literal.get(); });
+
+  return RunAndCompareNoHloPasses(std::move(module), fake_argument_ptrs, error,
+                                  reference_preprocessor);
 }
 
 ::testing::AssertionResult HloTestBase::RunAndCompare(
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.h b/tensorflow/compiler/xla/tests/hlo_test_base.h
index 413bb213fdcb1303f396308d13d9d0b96b47b71f..3e8e2360bb3a87e127920cd222803c0f7b9161f4 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.h
@@ -44,7 +44,7 @@ namespace xla {
 // enables, for one, explicitly building a graph of HLO instructions to run.
 //
 // This can also be used to write text/file-based test cases. Note that the test
-// target is responsible for linking the needed backends. A covenient way to do
+// target is responsible for linking the needed backends. A convenient way to do
 // this is to make it an xla_test: it will generate test targets linking with
 // the respective backends, which will be used as the test backend; the
 // interpreter backend is already linked with hlo_test_base so it will be the
@@ -98,14 +98,19 @@ class HloTestBase : public ::testing::Test {
       std::unique_ptr<HloModule> module,
       tensorflow::gtl::ArraySlice<Literal*> arguments);
 
+  // Same as above, except the module will be executed without running any HLO
+  // passes on it.
+  StatusOr<std::unique_ptr<Literal>> ExecuteNoHloPasses(
+      std::unique_ptr<HloModule> module,
+      tensorflow::gtl::ArraySlice<Literal*> arguments);
+
   std::unique_ptr<Literal> ExecuteAndTransfer(
       std::unique_ptr<HloModule> module,
       tensorflow::gtl::ArraySlice<Literal*> arguments);
 
   // Executes the given hlo module on two backends and compares results.
   //
-  // 'arguments': the input of the hlo module. The LiteralPtr type accepts
-  // Literal* or std::unique_ptr<Literal>.
+  // 'arguments': the input of the hlo module.
   //
   // 'error': if has value, expects the results to be near (within the error
   // bound). Otherwise, expects the results to be equal.
@@ -114,20 +119,18 @@ class HloTestBase : public ::testing::Test {
   // backend, but it might need to be tailored so that it is able to run on the
   // reference backend. Note that the program shape of the module must not be
   // modified.
-  template <typename LiteralPtr>
   ::testing::AssertionResult RunAndCompare(
       std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<LiteralPtr> arguments,
+      const tensorflow::gtl::ArraySlice<Literal*> arguments,
       const tensorflow::gtl::optional<ErrorSpec>& error,
       const std::function<void(HloModule*)>& reference_preprocessor = nullptr)
       TF_MUST_USE_RESULT;
 
   // Same as above, except that the module will be executed without Hlo
   // optimization.
-  template <typename LiteralPtr>
   ::testing::AssertionResult RunAndCompareNoHloPasses(
       std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<LiteralPtr> arguments,
+      const tensorflow::gtl::ArraySlice<Literal*> arguments,
       const tensorflow::gtl::optional<ErrorSpec>& error,
       const std::function<void(HloModule*)>& reference_preprocessor = nullptr)
       TF_MUST_USE_RESULT;
@@ -232,10 +235,9 @@ class HloTestBase : public ::testing::Test {
   // Runs the module on two platforms with or without running hlo passes and
   // compares the results. Returns whether the results are near or equal. If any
   // error happens before the results are computed, returns the error status.
-  template <typename LiteralPtr>
   StatusOr<::testing::AssertionResult> RunAndCompareInternal(
       std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<LiteralPtr> arguments,
+      const tensorflow::gtl::ArraySlice<Literal*> arguments,
       const tensorflow::gtl::optional<ErrorSpec>& error, bool run_hlo_passes,
       const std::function<void(HloModule*)>& reference_preprocessor);
 };
diff --git a/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc b/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
index 506091ddd8d1d8e6519525bb7031f4e8b296b5fb..da4cf4ae0c31bc194cd2ec9b845df36afbde69b0 100644
--- a/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_verifier.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -40,18 +41,22 @@ void HloVerifiedTestBase::TearDown() {
       << "TearDown called more than once; it should be called exactly once.";
   tear_down_called_ = true;
   if (module_) {
-    HloVerifier verifier;
-    xla::StatusOr<bool> mutated = verifier.Run(module_.get());
-    if (!mutated.ok()) {
-      ADD_FAILURE() << "HloVerifier failed: " << mutated.status();
-    } else {
-      EXPECT_FALSE(mutated.ValueOrDie())
-          << "HloVerifier should never mutate the HloModule";
-    }
+    VerifyModule();
   }
   HloTestBase::TearDown();
 }
 
+void HloVerifiedTestBase::VerifyModule() {
+  HloVerifier verifier;
+  xla::StatusOr<bool> mutated = verifier.Run(module_.get());
+  if (!mutated.ok()) {
+    ADD_FAILURE() << "HloVerifier failed: " << mutated.status();
+  } else {
+    EXPECT_FALSE(mutated.ValueOrDie())
+        << "HloVerifier should never mutate the HloModule";
+  }
+}
+
 HloModule& HloVerifiedTestBase::module() {
   if (!module_) {
     module_ = CreateNewModule();
@@ -59,4 +64,10 @@ HloModule& HloVerifiedTestBase::module() {
   return *module_;
 }
 
+void HloVerifiedTestBase::ParseAndVerifyModule(
+    tensorflow::StringPiece hlo_text) {
+  CHECK(!module_) << "Called ParseModule when test already has a module.";
+  TF_ASSERT_OK_AND_ASSIGN(module_, tools::Parse(hlo_text));
+  VerifyModule();
+}
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/hlo_verified_test_base.h b/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
index 492688bf7d682cf991cb8c09399492a0437f651b..e5bb14a8839acbdef8fd2b79bb0f574c46ea3d40 100644
--- a/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
@@ -44,6 +44,7 @@ class HloVerifiedTestBase : public HloTestBase {
   // Returns the default HloModule, lazily creating it if necessary via
   // HloTestBase::CreateNewModule().
   HloModule& module();
+  void ParseAndVerifyModule(tensorflow::StringPiece hlo_text);
 
   // Sets the shape-size function used during hlo verification. If this isn't
   // called, a default ShapeVerifier is used instead.
@@ -55,6 +56,7 @@ class HloVerifiedTestBase : public HloTestBase {
   std::unique_ptr<HloModule> module_;  // Lazily populated. Access via module().
   std::unique_ptr<ShapeVerifier> shape_verifier_;
   bool tear_down_called_ = false;
+  void VerifyModule();
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
index 99514baf23cafe61adc28a30dfdfe2691ab82d32..3023df47cda33f5d11abc921fd0355d48f761107 100644
--- a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
+++ b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
@@ -20,6 +20,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/tests/filecheck.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace xla {
@@ -49,11 +50,11 @@ void LLVMIRGenTestBase::CompileAndVerifyIr(
     std::unique_ptr<HloModule> hlo_module, const string& pattern,
     bool match_optimized_ir) {
   SetIrHook(match_optimized_ir);
-  ASSERT_TRUE(CompileToExecutable(std::move(hlo_module)).ok());
+  TF_ASSERT_OK(CompileToExecutable(std::move(hlo_module)).status());
   ResetIrHook();
 
   StatusOr<bool> filecheck_result = RunFileCheck(ir_, pattern);
-  ASSERT_TRUE(filecheck_result.ok());
+  TF_ASSERT_OK(filecheck_result.status());
   EXPECT_TRUE(filecheck_result.ValueOrDie());
 }
 
diff --git a/tensorflow/compiler/xla/tests/map_test.cc b/tensorflow/compiler/xla/tests/map_test.cc
index 2b0f7e6e80c48435ca55432a2afa3b6d69162625..0cd812fd1b4bc69c34b70d3ca0fd0aa6cf57fa4c 100644
--- a/tensorflow/compiler/xla/tests/map_test.cc
+++ b/tensorflow/compiler/xla/tests/map_test.cc
@@ -531,7 +531,7 @@ TEST_F(MapTest, MapOperantionWithBuildError) {
   ASSERT_TRUE(!computation_status.ok());
   EXPECT_THAT(
       computation_status.status().ToString(),
-      ::testing::HasSubstr("error from: ErrorAdd: binary op BINOP_ADD with "
+      ::testing::HasSubstr("error from: ErrorAdd: Binary op BINOP_ADD with "
                            "different element types: f32[] and u16[]"));
 }
 
diff --git a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
index 6c86dd5b9ef673c9facffafa37e00a859ce82010..c42f71388baba73e08a361d817e41b03e03bf133 100644
--- a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
@@ -29,6 +29,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/compiler/xla/tests/test_utils.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/logging.h"
@@ -38,258 +40,223 @@ limitations under the License.
 namespace xla {
 namespace {
 
-class MatOpsSimpleTest : public ClientLibraryTestBase {
- protected:
-  Computation BuildSum() {
-    // sum(x, y) = x + y
-    ComputationBuilder builder(client_, "sum");
-    auto x_value =
-        builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "x_value");
-    auto y_value =
-        builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "y_value");
-    builder.Add(x_value, y_value);
-    auto computation_status = builder.Build();
-    TF_CHECK_OK(computation_status.status());
-    return computation_status.ConsumeValueOrDie();
-  }
-
-  void TestLinspaceMax(int64 rows, int64 cols) {
-    float from = -128.0, to = 256.0;
-    std::unique_ptr<Array2D<float>> alhs =
-        MakeLinspaceArray2D(from, to, rows, cols);
-    auto arhs = MakeUnique<Array2D<float>>(rows, cols, 1.0);
-
-    ComputationBuilder builder(
-        client_,
-        tensorflow::strings::Printf("max_%lldx%lld_linspace", rows, cols));
-    auto lhs = builder.ConstantR2FromArray2D<float>(*alhs);
-    auto rhs = builder.ConstantR2FromArray2D<float>(*arhs);
-    auto max = builder.Max(lhs, rhs);
-
-    Array2D<float> aexpected(rows, cols);
-    for (int row = 0; row < rows; ++row) {
-      for (int col = 0; col < cols; ++col) {
-        aexpected(row, col) = std::max((*alhs)(row, col), (*arhs)(row, col));
-      }
-    }
-
-    ComputeAndCompareR2<float>(&builder, aexpected, {}, ErrorSpec(1e-6));
-  }
-};
-
-TEST_F(MatOpsSimpleTest, ExpTwoByTwoValues) {
-  ComputationBuilder builder(client_, "exp_2x2");
-  auto data = builder.ConstantR2<float>({
-      {1.0, 0.0},   // row 0
-      {-1.0, 0.5},  // row 1
+#ifdef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+using TypesF16F32 = ::testing::Types<float>;
+#else
+using TypesF16F32 = ::testing::Types<Eigen::half, float>;
+#endif
+
+class MatOpsSimpleTest : public ClientLibraryTestBase {};
+
+template <typename T>
+class MatOpsSimpleTest_F16F32 : public MatOpsSimpleTest {};
+
+// TODO(bixia): This test for F16 failed on GPU 02-25-2018.
+#ifdef XLA_TEST_BACKEND_GPU
+TYPED_TEST_CASE(MatOpsSimpleTest_F16F32, ::testing::Types<float>);
+#else
+TYPED_TEST_CASE(MatOpsSimpleTest_F16F32, TypesF16F32);
+#endif
+
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, ExpTwoByTwoValues) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, "exp_2x2");
+  auto data = builder.ConstantR2FromArray2D<T>({
+      {1.0f, 0.0f},   // row 0
+      {-1.0f, 0.5f},  // row 1
   });
   builder.Exp(data);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{2.71828, 1.00000},    // row 0
-                                {0.36788, 1.64872}});  // row 1
+      Literal::CreateR2FromArray2D<T>({{2.71828f, 1.00000f},    // row 0
+                                       {0.36788f, 1.64872f}});  // row 1
 
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-5));
 }
 
-TEST_F(MatOpsSimpleTest, MapTwoByTwo) {
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MapTwoByTwo) {
+  using T = TypeParam;
   Computation add_half;
   {
     // add_half(x) = x + 0.5
-    ComputationBuilder builder(client_, "add_half");
+    ComputationBuilder builder(this->client_, "add_half");
     auto x_value =
-        builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "x_value");
-    auto half = builder.ConstantR0<float>(0.5);
+        builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({}), "x_value");
+    auto half = builder.ConstantR0<T>(static_cast<T>(0.5));
     builder.Add(x_value, half);
     auto computation_status = builder.Build();
     ASSERT_IS_OK(computation_status.status());
     add_half = computation_status.ConsumeValueOrDie();
   }
 
-  ComputationBuilder builder(client_, "map_2x2");
-  auto data = builder.ConstantR2<float>({
-      {1.0, 0.0},   // row 0
-      {-1.0, 0.5},  // row 1
+  ComputationBuilder builder(this->client_, "map_2x2");
+  auto data = builder.ConstantR2FromArray2D<T>({
+      {1.0f, 0.0f},   // row 0
+      {-1.0f, 0.5f},  // row 1
   });
   auto map = builder.Map({data}, add_half, {0, 1});
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{1.5, 0.5},     // row 0
-                                {-0.5, 1.0}});  // row 1
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
+      Literal::CreateR2FromArray2D<T>({{1.5f, 0.5f},     // row 0
+                                       {-0.5f, 1.0f}});  // row 1
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-5));
 }
 
-TEST_F(MatOpsSimpleTest, MaxTwoByTwoValues) {
-  ComputationBuilder builder(client_, "max_2x2");
-  auto lhs = builder.ConstantR2<float>({
-      {7.0, 2.0},   // row 0
-      {3.0, -4.0},  // row 1
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MaxTwoByTwoValues) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, "max_2x2");
+  auto lhs = builder.ConstantR2FromArray2D<T>({
+      {7.0f, 2.0f},   // row 0
+      {3.0f, -4.0f},  // row 1
   });
-  auto rhs = builder.ConstantR2<float>({
-      {5.0, 6.0},   // row 0
-      {1.0, -8.0},  // row 1
+  auto rhs = builder.ConstantR2FromArray2D<T>({
+      {5.0f, 6.0f},   // row 0
+      {1.0f, -8.0f},  // row 1
   });
   auto max = builder.Max(lhs, rhs);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{7.0, 6.0},     // row 0
-                                {3.0, -4.0}});  // row 1
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6));
+      Literal::CreateR2FromArray2D<T>({{7.0f, 6.0f},     // row 0
+                                       {3.0f, -4.0f}});  // row 1
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-6));
 }
 
-TEST_F(MatOpsSimpleTest, Max1x1Linspace) { TestLinspaceMax(1, 1); }
-
-TEST_F(MatOpsSimpleTest, Max2x2Linspace) { TestLinspaceMax(2, 2); }
-
-TEST_F(MatOpsSimpleTest, Max3x3Linspace) { TestLinspaceMax(3, 3); }
-
-TEST_F(MatOpsSimpleTest, Max4x4Linspace) { TestLinspaceMax(4, 4); }
-
-TEST_F(MatOpsSimpleTest, Max6x6Linspace) { TestLinspaceMax(6, 6); }
-
-TEST_F(MatOpsSimpleTest, Max8x8Linspace) { TestLinspaceMax(8, 8); }
-
-TEST_F(MatOpsSimpleTest, Max12x12Linspace) { TestLinspaceMax(12, 12); }
-
-TEST_F(MatOpsSimpleTest, Max16x16Linspace) { TestLinspaceMax(16, 16); }
+struct TestLinspaceMaxParam {
+  int64 rows;
+  int64 cols;
+};
 
-TEST_F(MatOpsSimpleTest, Max32x8Linspace) { TestLinspaceMax(32, 8); }
+class TestLinspaceMaxParametric
+    : public MatOpsSimpleTest,
+      public ::testing::WithParamInterface<TestLinspaceMaxParam> {
+ public:
+  template <typename T>
+  void TestImpl() {
+    TestLinspaceMaxParam param = GetParam();
+    int64 rows = param.rows;
+    int64 cols = param.cols;
+    float from = -128.0, to = 256.0;
+    std::unique_ptr<Array2D<T>> alhs =
+        MakeLinspaceArray2D<T>(from, to, rows, cols);
+    auto arhs = MakeUnique<Array2D<T>>(rows, cols, static_cast<T>(1.0f));
 
-TEST_F(MatOpsSimpleTest, Max64x8Linspace) { TestLinspaceMax(64, 8); }
+    ComputationBuilder builder(
+        client_,
+        tensorflow::strings::Printf("max_%lldx%lld_linspace", rows, cols));
+    auto lhs = builder.ConstantR2FromArray2D<T>(*alhs);
+    auto rhs = builder.ConstantR2FromArray2D<T>(*arhs);
+    auto max = builder.Max(lhs, rhs);
 
-class MatOpsDotAddTest
-    : public ClientLibraryTestBase,
-      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {};
-
-TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2) {
-  bool row_major = std::get<0>(GetParam());
-  bool add_lhs = std::get<1>(GetParam());
-  bool transpose = std::get<2>(GetParam());
-  Array2D<float> lhs({{1.0, 2.0}, {3.0, 4.0}});
-  Array2D<float> rhs({{10.0, 11.0}, {12.0, 13.0}});
-
-  auto minor_to_major = [](bool row_major) -> std::vector<int64> {
-    return {row_major ? 1 : 0, row_major ? 0 : 1};
-  };
-
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
-  Shape lhs_shape =
-      ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
-  Shape rhs_shape =
-      ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
-
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto lhs_handle,
-      client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<float>(
-          lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto rhs_handle,
-      client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<float>(
-          rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-
-  ComputationBuilder builder(client_, TestName());
-  auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
-  auto lhs_mat_arg = lhs_arg;
-  if (transpose) {
-    lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
-  }
-  auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
-  auto result = builder.Dot(lhs_mat_arg, rhs_arg);
-  Array2D<float> expected;
-  if (add_lhs) {
-    result = builder.Add(result, lhs_arg);
-    if (transpose) {
-      expected = Array2D<float>({{47, 52}, {71, 78}});
-    } else {
-      expected = Array2D<float>({{35, 39}, {81, 89}});
+    Array2D<T> expected(rows, cols);
+    for (int row = 0; row < rows; ++row) {
+      for (int col = 0; col < cols; ++col) {
+        expected(row, col) = std::max<T>((*alhs)(row, col), (*arhs)(row, col));
+      }
     }
-  } else {
-    result = builder.Add(result, rhs_arg);
-    if (transpose) {
-      expected = Array2D<float>({{56, 61}, {80, 87}});
-    } else {
-      expected = Array2D<float>({{44, 48}, {90, 98}});
+    ErrorSpec error_spec(1e-6);
+    if (std::is_same<Eigen::half, T>::value) {
+      error_spec = ErrorSpec(1e-6, 2e-4);
     }
+    ComputeAndCompareR2<T>(&builder, expected, {}, error_spec);
   }
+};
 
-  ComputeAndCompareR2<float>(&builder, expected,
-                             {lhs_handle.get(), rhs_handle.get()},
-                             ErrorSpec(1e-6));
+string PrintTestLinspaceMaxParam(
+    const ::testing::TestParamInfo<TestLinspaceMaxParam>& test_param) {
+  const TestLinspaceMaxParam& param = test_param.param;
+  return tensorflow::strings::StrCat(param.rows, "r", param.cols, "c");
 }
 
-INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest,
-                        ::testing::Combine(::testing::Bool(), ::testing::Bool(),
-                                           ::testing::Bool()));
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+// TODO(bixia): This test failed on GPU 02-25-2018
+#ifdef XLA_TEST_BACKEND_CPU
+XLA_TEST_P(TestLinspaceMaxParametric, TestF16) { TestImpl<Eigen::half>(); }
+#endif
+#endif
+XLA_TEST_P(TestLinspaceMaxParametric, TestF32) { TestImpl<float>(); }
+
+INSTANTIATE_TEST_CASE_P(
+    TestLinspaceMax, TestLinspaceMaxParametric,
+    ::testing::Values(TestLinspaceMaxParam{1, 1}, TestLinspaceMaxParam{2, 2},
+                      TestLinspaceMaxParam{3, 3}, TestLinspaceMaxParam{4, 4},
+                      TestLinspaceMaxParam{6, 6}, TestLinspaceMaxParam{8, 8},
+                      TestLinspaceMaxParam{12, 12},
+                      TestLinspaceMaxParam{16, 16}, TestLinspaceMaxParam{32, 8},
+                      TestLinspaceMaxParam{64, 8}),
+    PrintTestLinspaceMaxParam);
 
-class MatOpsDotAddTest_bf16
+class MatOpsDotAddTest
     : public ClientLibraryTestBase,
-      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {};
-
-TEST_P(MatOpsDotAddTest_bf16, Dot_Add_2x2_2x2) {
-  bool row_major = std::get<0>(GetParam());
-  bool add_lhs = std::get<1>(GetParam());
-  bool transpose = std::get<2>(GetParam());
-  Array2D<bfloat16> lhs(
-      {{bfloat16(1.0f), bfloat16(2.0f)}, {bfloat16(3.0), bfloat16(4.0)}});
-  Array2D<bfloat16> rhs(
-      {{bfloat16(10.0f), bfloat16(11.0f)}, {bfloat16(12.0f), bfloat16(13.0f)}});
-
-  auto minor_to_major = [](bool row_major) -> std::vector<int64> {
-    return {row_major ? 1 : 0, row_major ? 0 : 1};
-  };
-
-  auto prim_type = primitive_util::NativeToPrimitiveType<bfloat16>();
-  Shape lhs_shape =
-      ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
-  Shape rhs_shape =
-      ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
-
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto lhs_handle,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2DWithLayout<bfloat16>(
-              lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto rhs_handle,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2DWithLayout<bfloat16>(
-              rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-
-  ComputationBuilder builder(client_, TestName());
-  auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
-  auto lhs_mat_arg = lhs_arg;
-  if (transpose) {
-    lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
-  }
-  auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
-  auto result = builder.Dot(lhs_mat_arg, rhs_arg);
-  Array2D<bfloat16> expected;
-  if (add_lhs) {
-    result = builder.Add(result, lhs_arg);
+      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {
+ public:
+  template <typename T>
+  void TestImpl() {
+    bool row_major = std::get<0>(GetParam());
+    bool add_lhs = std::get<1>(GetParam());
+    bool transpose = std::get<2>(GetParam());
+    Array2D<T> lhs({{1.0f, 2.0f}, {3.0f, 4.0f}});
+    Array2D<T> rhs({{10.0f, 11.0f}, {12.0f, 13.0f}});
+
+    auto minor_to_major = [](bool row_major) -> std::vector<int64> {
+      return {row_major ? 1 : 0, row_major ? 0 : 1};
+    };
+
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    Shape lhs_shape =
+        ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
+    Shape rhs_shape =
+        ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
+
+    TF_ASSERT_OK_AND_ASSIGN(
+        auto lhs_handle,
+        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+            lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+    TF_ASSERT_OK_AND_ASSIGN(
+        auto rhs_handle,
+        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+            rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+
+    ComputationBuilder builder(client_, TestName());
+    auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
+    auto lhs_mat_arg = lhs_arg;
     if (transpose) {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(47), bfloat16(52)}, {bfloat16(71), bfloat16(78)}});
-    } else {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(35), bfloat16(39)}, {bfloat16(81), bfloat16(89)}});
+      lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
     }
-  } else {
-    result = builder.Add(result, rhs_arg);
-    if (transpose) {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(56), bfloat16(61)}, {bfloat16(80), bfloat16(87)}});
+    auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
+    auto result = builder.Dot(lhs_mat_arg, rhs_arg);
+    Array2D<T> expected;
+    if (add_lhs) {
+      result = builder.Add(result, lhs_arg);
+      if (transpose) {
+        expected = Array2D<T>({{47.0f, 52.0f}, {71.0f, 78.0f}});
+      } else {
+        expected = Array2D<T>({{35.0f, 39.0f}, {81.0f, 89.0f}});
+      }
     } else {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(44), bfloat16(48)}, {bfloat16(90), bfloat16(98)}});
+      result = builder.Add(result, rhs_arg);
+      if (transpose) {
+        expected = Array2D<T>({{56.0f, 61.0f}, {80.0f, 87.0f}});
+      } else {
+        expected = Array2D<T>({{44.0f, 48.0f}, {90.0f, 98.0f}});
+      }
     }
+
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()},
+                           ErrorSpec(1e-6));
   }
+};
 
-  ComputeAndCompareR2<bfloat16>(&builder, expected,
-                                {lhs_handle.get(), rhs_handle.get()},
-                                ErrorSpec(1e-6));
-}
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2BF16) { TestImpl<bfloat16>(); }
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2F16) { TestImpl<Eigen::half>(); }
+#endif
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2F32) { TestImpl<float>(); }
 
-INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest_bf16,
+INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest,
                         ::testing::Combine(::testing::Bool(), ::testing::Bool(),
                                            ::testing::Bool()));
 
diff --git a/tensorflow/compiler/xla/tests/reduce_test.cc b/tensorflow/compiler/xla/tests/reduce_test.cc
index 50d7b5074d201d2292cf90224ef4cd37efdbb8d3..3a097a01ab095b8a21a39f0d738a43c3d6a4d1d7 100644
--- a/tensorflow/compiler/xla/tests/reduce_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_test.cc
@@ -884,5 +884,47 @@ XLA_TEST_F(ReduceTest, ReduceOrPredR2_64x32_To_R1) {
   RunR2ToR1PredTest</*cols=32*/ 32>(/*and_reduce=false*/ false, /*rows=64*/ 64);
 }
 
+// Tests reductions with different initial values.  There's no test macro that
+// combines TYPED_TEST and TYPED_P, so we have to do it manually.
+class ReduceInitializerTest : public ReduceTest {
+ protected:
+  template <typename T>
+  void DoTest(T initializer, int num_elems) {
+    ComputationBuilder builder(client_, TestName());
+    Computation max_fn = CreateScalarMaxComputation(
+        primitive_util::NativeToPrimitiveType<T>(), &builder);
+
+    auto init = builder.ConstantR0<T>(initializer);
+    std::vector<T> input_arr(num_elems, std::numeric_limits<T>::lowest());
+    auto input_literal = Literal::CreateR1<T>(input_arr);
+    auto input_data =
+        client_->TransferToServer(*input_literal).ConsumeValueOrDie();
+    builder.Reduce(builder.Parameter(0, input_literal->shape(), "input"), init,
+                   max_fn, {0});
+
+    ComputeAndCompareR0<T>(&builder, initializer, {input_data.get()});
+  }
+};
+
+XLA_TEST_F(ReduceInitializerTest, U8Small) { DoTest<uint8>(42, 2); }
+
+XLA_TEST_F(ReduceInitializerTest, U8BigPowerOf2) { DoTest<uint8>(42, 4096); }
+
+XLA_TEST_F(ReduceInitializerTest, U8InitializerBigNonPowerOf2) {
+  DoTest<uint8>(42, 4095);
+}
+
+XLA_TEST_F(ReduceInitializerTest, U64InitializerZero) {
+  DoTest<uint64>(0, 1024);
+}
+
+XLA_TEST_F(ReduceInitializerTest, U64InitializerOne) {
+  DoTest<uint64>(1, 1024);
+}
+
+XLA_TEST_F(ReduceInitializerTest, U64InitializerBigValue) {
+  DoTest<uint64>(1234556789123, 1024);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/reduce_window_test.cc b/tensorflow/compiler/xla/tests/reduce_window_test.cc
index b11b64e40a582150d6adf29e915cd70b4bcb982b..9c317fe579394c5b7a1d599169f471d484950199 100644
--- a/tensorflow/compiler/xla/tests/reduce_window_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_window_test.cc
@@ -960,45 +960,76 @@ struct R2ReduceWindowTestData {
   int64 base_bounds[2];
   int64 window_bounds[2];
   int64 strides[2];
+  int64 pad_low[2];
+  int64 pad_high[2];
   int64 layout[2];
-  Padding padding;
   Reducer reducer;
 } kR2TestCases[] = {
     {/*base_bounds=*/{4, 18}, /*window_bounds=*/{2, 4},
-     /*strides=*/{1, 2}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 2}, /*pad_low=*/{0, 1}, /*pad_high=*/{1, 1},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
     {/*base_bounds=*/{2, 5}, /*window_bounds=*/{2, 4},
-     /*strides=*/{1, 1}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 1}, /*pad_high=*/{1, 2},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
     {/*base_bounds=*/{1, 3}, /*window_bounds=*/{2, 3},
-     /*strides=*/{1, 1}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 1}, /*pad_high=*/{1, 1},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
     {/*base_bounds=*/{3, 129}, /*window_bounds=*/{1, 100},
-     /*strides=*/{2, 99}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{2, 99}, /*pad_low=*/{0, 0}, /*pad_high=*/{35, 35},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
+// TODO(b/74260408): This test last failed on GPU on 2018-03-08, likely due to a
+// ptxas bug.
+#ifndef XLA_TEST_BACKEND_GPU
     {/*base_bounds=*/{6, 152}, /*window_bounds=*/{2, 25},
-     /*strides=*/{5, 4}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{5, 4}, /*pad_low=*/{0, 1}, /*pad_high=*/{10, 11},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
+#endif
     {/*base_bounds=*/{6, 4}, /*window_bounds=*/{4, 2},
-     /*strides=*/{3, 3}, /*layout=*/{0, 1},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{3, 3}, /*pad_low=*/{0, 1}, /*pad_high=*/{0, 1},
+     /*layout=*/{0, 1},
+     /*reducer=*/Reducer::kAdd},
     {/*base_bounds=*/{5, 147}, /*window_bounds=*/{1, 36},
-     /*strides=*/{4, 5}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{4, 5}, /*pad_low=*/{0, 0}, /*pad_high=*/{17, 17},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
     {/*base_bounds=*/{4, 153}, /*window_bounds=*/{2, 93},
-     /*strides=*/{1, 1}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 1}, /*pad_high=*/{46, 46},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
     // Regression test for a bug that appeared in Inception (b/34784899).
     {/*base_bounds=*/{28, 28}, /*window_bounds=*/{3, 3},
-     /*strides=*/{1, 1}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kSame, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 1}, /*pad_low=*/{1, 1}, /*pad_high=*/{1, 1},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
+    {/*base_bounds=*/{4, 4}, /*window_bounds=*/{2, 2},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 0}, /*pad_high=*/{0, 0},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
     // Regression test for a bug that appeared in Inception (b/34784899).
     {/*base_bounds=*/{4, 32}, /*window_bounds=*/{2, 2},
-     /*strides=*/{2, 2}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kValid, /*reducer=*/Reducer::kAdd},
-    {/*base_bounds=*/{4, 4}, /*window_bounds=*/{2, 2},
-     /*strides=*/{1, 1}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kValid, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{2, 2}, /*pad_low=*/{0, 0}, /*pad_high=*/{0, 0},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
+    // Regression test for b/73903312: bf16 lacks precision to store result of
+    // very large windows. Testing with a reasonable window larger than 128.
+    {/*base_bounds=*/{8, 130}, /*window_bounds=*/{1, 130},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 130}, /*pad_high=*/{0, 0},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
+// TODO(b/76025683): These tests fail on TPU.
+#if defined(XLA_TEST_BACKEND_CPU) || defined(XLA_TEST_BACKEND_GPU)
+    {/*base_bounds=*/{4096, 4096}, /*window_bounds=*/{1, 4},
+     /*strides=*/{1, 1024}, /*pad_low=*/{0, 0}, /*pad-high=*/{0, 0},
+     /*layout=*/{1, 0}, /*reducer=*/Reducer::kAdd},
+    {/*base_bounds=*/{8, 256}, /*window_bounds=*/{1, 4},
+     /*strides=*/{1, 64}, /*pad_low=*/{0, 0}, /*pad_high=*/{0, 0},
+     /*layout=*/{1, 0}, /*reducer=*/Reducer::kAdd},
+#endif
 };
 
 string R2ReduceWindowTestDataToString(
@@ -1008,10 +1039,11 @@ string R2ReduceWindowTestDataToString(
   string str = tensorflow::strings::StrCat(
       "base_bounds_", tensorflow::str_util::Join(param.base_bounds, "x"),  //
       "__window_bounds_",
-      tensorflow::str_util::Join(param.window_bounds, "x"),              //
-      "__strides_", tensorflow::str_util::Join(param.strides, "x"),      //
-      "__padding_", param.padding == Padding::kSame ? "same" : "valid",  //
-      "__layout_", param.layout[0], "_", param.layout[1],                //
+      tensorflow::str_util::Join(param.window_bounds, "x"),          //
+      "__strides_", tensorflow::str_util::Join(param.strides, "x"),  //
+      "__pad_low_", tensorflow::str_util::Join(param.pad_low, "x"),
+      "__pad_high_", tensorflow::str_util::Join(param.pad_high, "x"),
+      "__layout_", param.layout[0], "_", param.layout[1],  //
       "__reducer_", param.reducer == kAdd ? "add" : "max");
   if (::testing::get<1>(data.param)) {
     str = tensorflow::strings::StrCat(str, "_bfloat16");
@@ -1039,17 +1071,29 @@ class R2ReduceWindowTest : public ReduceWindowTestBase,
     ComputationDataHandle parameter;
     auto input_arg = CreateParameterAndTransferLiteral(0, *input_literal, "p0",
                                                        &b, &parameter);
+    std::vector<std::pair<int64, int64>> padding(2);
+    for (int i = 0; i < 2; ++i) {
+      padding[i] = {param.pad_low[i], param.pad_high[i]};
+    }
+    auto computation = param.reducer == kAdd
+                           ? CreateScalarAddComputation(FloatType(), &b)
+                           : CreateScalarMaxComputation(FloatType(), &b);
     auto init_value =
         CreateConstantFromLiteral(*Literal::CreateR0(kInitValue), &b);
-    b.ReduceWindow(/*operand=*/parameter,
-                   /*init_value=*/init_value,
-                   /*computation=*/CreateScalarAddComputation(FloatType(), &b),
-                   /*window_dimensions=*/param.window_bounds,
-                   /*window_strides=*/param.strides, /*padding=*/param.padding);
+    b.ReduceWindowWithGeneralPadding(
+        /*operand=*/parameter,
+        /*init_value=*/init_value,
+        /*computation=*/computation,
+        /*window_dimensions=*/param.window_bounds,
+        /*window_strides=*/param.strides, /*padding=*/padding);
 
-    auto expected = ReferenceUtil::ReduceWindow2DAdd(
-        /*operand=*/input, /*init=*/kInitValue, /*window=*/param.window_bounds,
-        /*stride=*/param.strides, /*padding=*/param.padding);
+    auto reduce_func = param.reducer == kAdd
+                           ? +[](float a, float b) { return a + b; }
+                           : +[](float a, float b) { return std::max(a, b); };
+    auto expected = ReferenceUtil::ReduceWindow2DGeneric(
+        /*operand=*/input, /*init=*/kInitValue, /*reduce_func=*/reduce_func,
+        /*window=*/param.window_bounds,
+        /*stride=*/param.strides, /*padding=*/padding);
 
     ComputeAndCompareLiteral(&b, *Literal::CreateFromArray(*expected),
                              {input_arg.get()}, DefaultErrorSpec());
@@ -1074,8 +1118,9 @@ XLA_TEST_P(R2ReduceWindowFailingCpuGpuBf16Test,
 
 const R2ReduceWindowTestData kR2FailingValuesCpuGpuBf16Test[] = {
     {/*base_bounds=*/{8, 128}, /*window_bounds=*/{8, 128},
-     /*strides=*/{1, 1}, /*layout=*/{1, 0},
-     /*padding=*/Padding::kValid, /*reducer=*/Reducer::kAdd},
+     /*strides=*/{1, 1}, /*pad_low=*/{0, 0}, /*pad_high=*/{0, 0},
+     /*layout=*/{1, 0},
+     /*reducer=*/Reducer::kAdd},
 };
 
 INSTANTIATE_TEST_CASE_P(
@@ -1315,5 +1360,41 @@ ENTRY R2Window {
   EXPECT_TRUE(RunAndCompare(hlo_string, ErrorSpec{0.001}));
 }
 
+TEST_F(ReduceWindowTextTest, R2EffectiveScalar) {
+  const string& hlo_string = R"(
+HloModule R2Window
+mul {
+  lhs = f32[] parameter(0)
+  rhs = f32[] parameter(1)
+  ROOT mul = f32[] multiply(lhs, rhs)
+}
+ENTRY R2Window {
+  operand = f32[1,1]{1,0} parameter(0)
+  negate = f32[1,1]{1,0} negate(operand)
+  constant = f32[] constant(1)
+  ROOT reduce-window = f32[1,1]{1,0} reduce-window(negate, constant), window={size=1x1 pad=0_0x0_0}, to_apply=mul
+}
+)";
+  EXPECT_TRUE(RunAndCompare(hlo_string, ErrorSpec{0.001}));
+}
+
+TEST_F(ReduceWindowTextTest, R3EffectiveScalar) {
+  const string& hlo_string = R"(
+HloModule R3Window
+mul {
+  lhs = f32[] parameter(0)
+  rhs = f32[] parameter(1)
+  ROOT mul = f32[] multiply(lhs, rhs)
+}
+ENTRY R3Window {
+  operand = f32[1,1,1]{2,1,0} parameter(0)
+  negate = f32[1,1,1]{2,1,0} negate(operand)
+  constant = f32[] constant(1)
+  ROOT reduce-window = f32[1,1,1]{2,1,0} reduce-window(negate, constant), window={size=1x1x1 pad=0_0x0_0x0_0}, to_apply=mul
+}
+)";
+  EXPECT_TRUE(RunAndCompare(hlo_string, ErrorSpec{0.001}));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/scalar_computations_test.cc b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
index d7bda77e87f33938162f94dbee42b160906b4087..0c88bef69dfc522fef52422b0bd3a825fa173d44 100644
--- a/tensorflow/compiler/xla/tests/scalar_computations_test.cc
+++ b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
@@ -860,6 +860,12 @@ XLA_TEST_F(ScalarComputationsTest, MinF32Below) {
   TestMinMax<float>(-100.1f, 3.1f, -100.1f, &ComputationBuilder::Min);
 }
 
+XLA_TEST_F(ScalarComputationsTest, MinPropagatesNan) {
+  SetFastMathDisabled(true);
+  TestMinMax<float>(NAN, 3.1f, NAN, &ComputationBuilder::Min);
+  TestMinMax<float>(-3.1f, NAN, NAN, &ComputationBuilder::Min);
+}
+
 XLA_TEST_F(ScalarComputationsTest, MaxF32Above) {
   TestMinMax<float>(10.1f, 3.1f, 10.1f, &ComputationBuilder::Max);
 }
@@ -868,6 +874,12 @@ XLA_TEST_F(ScalarComputationsTest, MaxF32Below) {
   TestMinMax<float>(-100.1f, 3.1f, 3.1f, &ComputationBuilder::Max);
 }
 
+XLA_TEST_F(ScalarComputationsTest, MaxPropagatesNan) {
+  SetFastMathDisabled(true);
+  TestMinMax<float>(NAN, 3.1f, NAN, &ComputationBuilder::Max);
+  TestMinMax<float>(-3.1f, NAN, NAN, &ComputationBuilder::Max);
+}
+
 XLA_TEST_F(ScalarComputationsTest, ComplicatedArithmeticExpressionF32) {
   // Compute the expression (1 * (3 - 1) * (7 + 0) - 4) / 20.
   ComputationBuilder b(client_, TestName());
diff --git a/tensorflow/compiler/xla/tests/select_and_scatter_test.cc b/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
index 9ee94b8571e5fc8789b60501462986967ce909a0..d268fdcacebcb162bf61bc7dd4b208f4db6c4a5f 100644
--- a/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
+++ b/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
@@ -252,6 +252,21 @@ XLA_TEST_F(SelectAndScatterTest, R2S32) {
   ComputeAndCompareR2<int32>(&builder_, expected, {});
 }
 
+// Test for tie breaking rule in ge_f32_. When a tie is present, the operand
+// that has the lower lexicographical order (smaller index) should be chosen.
+XLA_TEST_F(SelectAndScatterTest, R2F32Tie) {
+  const auto operand = builder_.ConstantR2<float>(
+      {{0.f, 0.f, 0.f}, {0.f, 0.f, 0.f}, {0.f, 0.f, 0.f}});
+  const auto source = builder_.ConstantR2<float>(
+      {{1.0f, 2.0f, 3.0f}, {4.f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}});
+  Array2D<float> expected(
+      {{12.f, 9.f, 0.f}, {15.f, 9.f, 0.f}, {0.f, 0.f, 0.f}});
+  builder_.SelectAndScatter(operand, ge_f32_, /*window_dimensions=*/{3, 3},
+                            /*window_strides=*/{1, 1}, Padding::kSame, source,
+                            builder_.ConstantR0<float>(0.0f), add_f32_);
+  ComputeAndCompareR2<float>(&builder_, expected, {}, ErrorSpec(1e-7));
+}
+
 // Similar to SelectAndScatterTest.R2S32 but the input is transposed.
 XLA_TEST_F(SelectAndScatterTest, ReshapeR2S32) {
   const auto operand = builder_.ConstantR2<int32>(
diff --git a/tensorflow/compiler/xla/tests/test_macros.cc b/tensorflow/compiler/xla/tests/test_macros.cc
index 978a669bcab720bddec5c4bcd0144810ba3c8477..be35ec6c6ee4c015755622b2dc9bb92e23af7c85 100644
--- a/tensorflow/compiler/xla/tests/test_macros.cc
+++ b/tensorflow/compiler/xla/tests/test_macros.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include <unordered_map>
 
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/regexp.h"
 
 namespace xla {
diff --git a/tensorflow/compiler/xla/tests/tuple_test.cc b/tensorflow/compiler/xla/tests/tuple_test.cc
index 2029312f94a14bc81706368b9ecfc2727fd9fe4c..fa60af4b6a7d4f249b28be14357b8cad9a42c783 100644
--- a/tensorflow/compiler/xla/tests/tuple_test.cc
+++ b/tensorflow/compiler/xla/tests/tuple_test.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -514,5 +515,33 @@ XLA_TEST_F(TupleTest, ComplexTuples) {
                          error_spec_);
 }
 
+class TupleHloTest : public HloTestBase {};
+
+// Disabled on CPU parallel because that's broken and will be removed soon.
+// Disabled on the interpreter because bitcast doesn't exist on the interpreter.
+TEST_F(TupleHloTest,
+       DISABLED_ON_INTERPRETER(DISABLED_ON_CPU_PARALLEL(BitcastAfterGTE))) {
+  const char* testcase = R"(
+    HloModule m
+
+    ENTRY test {
+      name.1 = (f32[3]{0}) parameter(0)
+      get-tuple-element.1 = f32[3]{0} get-tuple-element(name.1), index=0
+      bitcast = f32[1,3]{1,0} bitcast(get-tuple-element.1)
+      copy = f32[1,3]{1,0} copy(bitcast)
+      ROOT tuple.4 = (f32[1,3]{1,0}) tuple(copy)
+    }
+  )";
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::MakeTupleOwned(Literal::CreateR1<float>({1, 2, 3}));
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          ExecuteNoHloPasses(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result,
+      *Literal::MakeTupleOwned(Literal::CreateR2<float>({{1, 2, 3}}))));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/while_test.cc b/tensorflow/compiler/xla/tests/while_test.cc
index 52157b837c383205f77a030ef98b2fd03a41aff5..33d457c70bac84c2da10e3cf9302c2c952cf1bc2 100644
--- a/tensorflow/compiler/xla/tests/while_test.cc
+++ b/tensorflow/compiler/xla/tests/while_test.cc
@@ -910,7 +910,7 @@ XLA_TEST_F(WhileTest, WhileWithDynamicUpdateSlice) {
 // Per backend the values generated can be different as the different backends
 // use different random number generators.
 // TODO(b/32240857): Extend test to verify outputs.
-TEST_F(WhileTest, WhileWithPrngScalarResult) {
+TEST_F(WhileTest, DISABLED_ON_INTERPRETER(WhileWithPrngScalarResult)) {
   auto v6s32 = ShapeUtil::MakeShape(S32, {6});
 
   // Create a computation for the condition: repeat for count iterations.
@@ -1166,7 +1166,7 @@ XLA_TEST_F(WhileTest, NestedWhileWithScalarResult) {
 // while (f(result).get<0>()) {
 //   result = result + 1;
 // }
-TEST_F(WhileTest, WhileWithCallInsideCondition) {
+TEST_F(WhileTest, DISABLED_ON_INTERPRETER(WhileWithCallInsideCondition)) {
   auto result_shape = ShapeUtil::MakeShape(S32, {});
 
   // Create a computation for the condition: repeat for 5 iterations.
diff --git a/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc b/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
index 9ad2a1985331b80625dd0687ea052300bc99e440..24b9f37a8008b6f774634f2dbff9d3296ec0585b 100644
--- a/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
+++ b/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
@@ -144,7 +144,7 @@ void ExecuteAndFetchProfile(string* profile_output, LocalClient* client,
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<LocalExecutable> local_executable,
       client->Compile(computation, {&lhs_arg_shape, &rhs_arg_shape},
-                      ExecutableBuildOptions()));
+                      ExecutableBuildOptions().set_hlo_profile(true)));
 
   Executable* executable = local_executable->executable();
   HloExecutionProfile hlo_execution_profile(
diff --git a/tensorflow/compiler/xla/tools/BUILD b/tensorflow/compiler/xla/tools/BUILD
index 091fa0c3ec807a66449eca0bfbb141285b8eb532..2e55f609d17bf42e410f97c51c7b9c6c0e85576d 100644
--- a/tensorflow/compiler/xla/tools/BUILD
+++ b/tensorflow/compiler/xla/tools/BUILD
@@ -75,6 +75,7 @@ cc_library(
     name = "replay_computation_library",
     srcs = ["replay_computation.cc"],
     deps = [
+        "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc b/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
index cd2b843ad36013ae83818ecbc184fb823093f037..e60a5a4919f2207939821e787c3c59a08ff3ba4e 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
+++ b/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
@@ -1049,9 +1049,40 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           HloInstruction::CreateDot(shape, operands[0], operands[1], dnum));
       break;
     }
-    case HloOpcode::kGather:
-      // TODO(b/72710576): HLO parsing is not implemented for Gather.
-      return TokenError("HLO parsing is not implemented for Gather");
+    case HloOpcode::kGather: {
+      optional<std::vector<int64>> output_window_dims;
+      attrs["output_window_dims"] = {
+          /*required=*/true, AttrTy::kBracedInt64List, &output_window_dims};
+      optional<std::vector<int64>> elided_window_dims;
+      attrs["elided_window_dims"] = {
+          /*required=*/true, AttrTy::kBracedInt64List, &elided_window_dims};
+      optional<std::vector<int64>> gather_dims_to_operand_dims;
+      attrs["gather_dims_to_operand_dims"] = {/*required=*/true,
+                                              AttrTy::kBracedInt64List,
+                                              &gather_dims_to_operand_dims};
+      optional<int64> index_vector_dim;
+      attrs["index_vector_dim"] = {/*required=*/true, AttrTy::kInt64,
+                                   &index_vector_dim};
+      optional<std::vector<int64>> window_bounds;
+      attrs["window_bounds"] = {/*required=*/true, AttrTy::kBracedInt64List,
+                                &window_bounds};
+
+      if (!ParseOperands(&operands, /*expected_size=*/2) ||
+          !ParseAttributes(attrs)) {
+        return false;
+      }
+
+      GatherDimensionNumbers dim_numbers = HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/*output_window_dims,
+          /*elided_window_dims=*/*elided_window_dims,
+          /*gather_dims_to_operand_dims=*/*gather_dims_to_operand_dims,
+          /*index_vector_dim=*/*index_vector_dim);
+
+      instruction = builder->AddInstruction(HloInstruction::CreateGather(
+          shape, /*operand=*/operands[0], /*gather_indices=*/operands[1],
+          dim_numbers, *window_bounds));
+      break;
+    }
     case HloOpcode::kTrace:
       return TokenError(StrCat("parsing not yet implemented for op: ",
                                HloOpcodeString(opcode)));
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc b/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
index b8c6b59204f897c7dc07b846370b5b776a19a808..863081d654390440aa6506bab4576b3cc5c1cbd1 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
+++ b/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
@@ -716,6 +716,18 @@ ENTRY %sparse_f32_r1 () -> f32[9] {
   ROOT %foo = f32[9]sparse{10} constant(f32[9]{1: 2, 3: 4, 5: 6})
 }
 
+)"
+},
+{
+"gather",
+R"(HloModule StringifyGather
+
+ENTRY %Gather (input_tensor: f32[50,49,48,47,46], gather_indices: s64[10,9,8,7,5]) -> f32[10,9,8,7,30,29,28,27,26] {
+  %input_tensor = f32[50,49,48,47,46]{4,3,2,1,0} parameter(0)
+  %gather_indices = s64[10,9,8,7,5]{4,3,2,1,0} parameter(1)
+  ROOT %gather = f32[10,9,8,7,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} gather(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, s64[10,9,8,7,5]{4,3,2,1,0} %gather_indices), output_window_dims={4,5,6,7,8}, elided_window_dims={}, gather_dims_to_operand_dims={0,1,2,3,4}, index_vector_dim=4, window_bounds={30,29,28,27,26}
+}
+
 )"
 },
   });
@@ -860,6 +872,18 @@ ENTRY dot {
   ROOT dot = f32[2,3]{1,0} dot(a, b), lhs_batch_dims={0}, lhs_contracting_dims={1}, rhs_contracting_dims={0}
 }
 
+)"
+},
+{
+"gather",
+R"(HloModule gather
+
+ENTRY Gather {
+  input_tensor = f32[50,49,48,47,46]{4,3,2,1,0} parameter(0)
+  gather_indices = s64[10,9,8,7,5]{4,3,2,1,0} parameter(1)
+  ROOT gather = f32[10,9,8,7,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} gather(input_tensor, gather_indices), output_window_dims={4,5,6,7,8}, elided_window_dims={}, gather_dims_to_operand_dims={0,1,2,3,4}, index_vector_dim=4, window_bounds={30,29,28,27,26}
+}
+
 )"
 },
   });
diff --git a/tensorflow/compiler/xla/tools/replay_computation.cc b/tensorflow/compiler/xla/tools/replay_computation.cc
index eda5effbb92db92c9317a956497a00c0ec15c27c..62a353ad09af009e4abf47664a5c5f7bd70a049e 100644
--- a/tensorflow/compiler/xla/tools/replay_computation.cc
+++ b/tensorflow/compiler/xla/tools/replay_computation.cc
@@ -40,6 +40,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/testing.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/session.pb.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -66,6 +67,7 @@ struct Options {
   bool use_fake_data = false;
   bool print_result = true;
   int num_runs = 1;
+  bool xla_hlo_profile_last_run = false;
 };
 
 // Invokes the given computation passing arbitrary data for every (unbound)
@@ -122,16 +124,21 @@ StatusOr<std::unique_ptr<Literal>> ReplayComputation(
   std::unique_ptr<Literal> result;
   for (int i = 0; i < opts.num_runs; ++i) {
     ExecutionProfile profile;
+    ExecutionOptions execution_options = CreateDefaultExecutionOptions();
+    if (opts.xla_hlo_profile_last_run && i == opts.num_runs - 1) {
+      execution_options.mutable_debug_options()->set_xla_hlo_profile(true);
+    }
+
     if (opts.print_result) {
-      TF_ASSIGN_OR_RETURN(result, client->ExecuteAndTransfer(
-                                      computation, execute_arguments,
-                                      /*execution_options=*/nullptr, &profile));
+      TF_ASSIGN_OR_RETURN(
+          result, client->ExecuteAndTransfer(computation, execute_arguments,
+                                             &execution_options, &profile));
     } else {
       // If we're not printing the result, execute the computation but don't
       // bother retrieving the result.  This can be a significant speedup.
       TF_RETURN_IF_ERROR(client
                              ->Execute(computation, execute_arguments,
-                                       /*execution_options=*/nullptr, &profile)
+                                       &execution_options, &profile)
                              .status());
     }
     LOG(INFO) << "Execution took "
@@ -191,6 +198,9 @@ int main(int argc, char** argv) {
                        "Number of times to run each computation"),
       tensorflow::Flag("fake_infeed_shape", &opts.fake_infeed_shape,
                        "Shape of fake data to construct for (infinite) infeed"),
+      tensorflow::Flag(
+          "xla_hlo_profile_last_run", &opts.xla_hlo_profile_last_run,
+          "Pass --xla_hlo_profile the last time we run the computation."),
   };
   xla::string usage = tensorflow::Flags::Usage(argv[0], flag_list);
   bool parse_ok = tensorflow::Flags::Parse(&argc, argv, flag_list);
diff --git a/tensorflow/compiler/xla/util.cc b/tensorflow/compiler/xla/util.cc
index 1f0c626bbb2d64ef4e67c9ec51485ae96ae73d04..dc4f7a1cb436183f5acfa360fb092795258b6a75 100644
--- a/tensorflow/compiler/xla/util.cc
+++ b/tensorflow/compiler/xla/util.cc
@@ -15,7 +15,6 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/util.h"
 
-#include <numeric>
 #include <stdarg.h>
 #include <numeric>
 
@@ -292,7 +291,8 @@ void LogLines(int sev, tensorflow::StringPiece text, const char* fname,
 }
 
 int64 Product(tensorflow::gtl::ArraySlice<int64> xs) {
-  return std::accumulate(xs.begin(), xs.end(), 1, std::multiplies<int64>());
+  return std::accumulate(xs.begin(), xs.end(), static_cast<int64>(1),
+                         std::multiplies<int64>());
 }
 
 std::vector<std::pair<int64, int64>> CommonFactors(
diff --git a/tensorflow/compiler/xla/util.h b/tensorflow/compiler/xla/util.h
index e14c8cefa1d16e0a749e7a2c022a24a1c5083b15..2da9f9ed6f40fcf5b2512f974519df0b355da10f 100644
--- a/tensorflow/compiler/xla/util.h
+++ b/tensorflow/compiler/xla/util.h
@@ -21,6 +21,7 @@ limitations under the License.
 
 #include <algorithm>
 #include <string>
+#include <type_traits>
 #include <vector>
 
 #include "tensorflow/compiler/xla/status.h"
@@ -427,30 +428,37 @@ std::vector<std::pair<int64, int64>> CommonFactors(
 string SanitizeFileName(string file_name);
 
 template <typename Container, typename Predicate>
-bool c_all_of(Container container, Predicate&& predicate) {
+bool c_all_of(const Container& container, Predicate&& predicate) {
   return std::all_of(std::begin(container), std::end(container),
                      std::forward<Predicate>(predicate));
 }
 
+template <typename Container, typename Predicate>
+bool c_any_of(const Container& container, Predicate&& predicate) {
+  return std::any_of(std::begin(container), std::end(container),
+                     std::forward<Predicate>(predicate));
+}
+
 template <typename InputContainer, typename OutputIterator,
           typename UnaryOperation>
-OutputIterator c_transform(InputContainer input_container,
+OutputIterator c_transform(const InputContainer& input_container,
                            OutputIterator output_iterator,
-                           UnaryOperation unary_op) {
+                           UnaryOperation&& unary_op) {
   return std::transform(std::begin(input_container), std::end(input_container),
-                        output_iterator, unary_op);
+                        output_iterator,
+                        std::forward<UnaryOperation>(unary_op));
 }
 
 template <class InputContainer, class OutputIterator, class UnaryPredicate>
-OutputIterator c_copy_if(InputContainer input_container,
+OutputIterator c_copy_if(const InputContainer& input_container,
                          OutputIterator output_iterator,
-                         UnaryPredicate predicate) {
+                         UnaryPredicate&& predicate) {
   return std::copy_if(std::begin(input_container), std::end(input_container),
-                      output_iterator, predicate);
+                      output_iterator, std::forward<UnaryPredicate>(predicate));
 }
 
 template <class InputContainer, class OutputIterator>
-OutputIterator c_copy(InputContainer input_container,
+OutputIterator c_copy(const InputContainer& input_container,
                       OutputIterator output_iterator) {
   return std::copy(std::begin(input_container), std::end(input_container),
                    output_iterator);
@@ -468,7 +476,7 @@ void c_sort(InputContainer& input_container, Comparator&& comparator) {
 }
 
 template <typename Sequence, typename T>
-bool c_binary_search(Sequence& sequence, T&& value) {
+bool c_binary_search(const Sequence& sequence, T&& value) {
   return std::binary_search(std::begin(sequence), std::end(sequence),
                             std::forward<T>(value));
 }
@@ -487,6 +495,39 @@ template <typename C, typename Pred>
 auto c_find_if(const C& c, Pred&& pred) -> decltype(std::begin(c)) {
   return std::find_if(std::begin(c), std::end(c), std::forward<Pred>(pred));
 }
+
+template <typename C, typename Value>
+auto c_find(const C& c, Value&& value) -> decltype(std::begin(c)) {
+  return std::find(std::begin(c), std::end(c), std::forward<Value>(value));
+}
+
+template <typename Sequence>
+void c_reverse(Sequence& sequence) {
+  std::reverse(std::begin(sequence), std::end(sequence));
+}
+
+template <typename Sequence, typename T, typename BinaryOp>
+typename std::decay<T>::type c_accumulate(const Sequence& sequence, T&& init,
+                                          BinaryOp&& binary_op) {
+  return std::accumulate(std::begin(sequence), std::end(sequence),
+                         std::forward<T>(init),
+                         std::forward<BinaryOp>(binary_op));
+}
+
+template <typename C, typename Value>
+int64 FindIndex(const C& c, Value&& value) {
+  auto it = c_find(c, std::forward<Value>(value));
+  return std::distance(c.begin(), it);
+}
+
+// Returns true if `x` fits in 32-bits.
+template <typename T>
+bool IsInt32(T x) {
+  // Following conversion rules: "the value is unchanged if it can be
+  // represented in the destination type (and bit-field width); otherwise, the
+  // value is implementation-defined."
+  return static_cast<int32>(x) == x;
+}
 }  // namespace xla
 
 #define XLA_LOG_LINES(SEV, STRING) \
diff --git a/tensorflow/compiler/xla/xla.proto b/tensorflow/compiler/xla/xla.proto
index 56162ab44e2e0e3e4478fe631888f243332dc1d8..edf1b07af82b5d43fe67c6efdabdb0a9b4b1edea 100644
--- a/tensorflow/compiler/xla/xla.proto
+++ b/tensorflow/compiler/xla/xla.proto
@@ -16,6 +16,7 @@ limitations under the License.
 syntax = "proto3";
 
 import "tensorflow/compiler/xla/xla_data.proto";
+import "tensorflow/compiler/xla/service/hlo.proto";
 import "tensorflow/compiler/xla/service/session.proto";
 
 package xla;
@@ -342,6 +343,14 @@ message ExecuteRequest {
   ExecutionOptions execution_options = 5;
 }
 
+message ExecuteGraphRequest {
+  HloModuleProto computation = 1;
+  repeated GlobalDataHandle arguments = 2;
+
+  // Options that affect how XLA compiles and runs code to service this request.
+  ExecutionOptions execution_options = 3;
+}
+
 message ExecuteParallelRequest {
   repeated ExecuteRequest requests = 1;
 }
diff --git a/tensorflow/compiler/xla/xla_data.proto b/tensorflow/compiler/xla/xla_data.proto
index 28620c3b86349281573eaf57d2838bee1488d838..1f16e6d25178fd9c10a30b0c500e090ee2e08117 100644
--- a/tensorflow/compiler/xla/xla_data.proto
+++ b/tensorflow/compiler/xla/xla_data.proto
@@ -418,6 +418,10 @@ message GatherDimensionNumbers {
   // transforms the gather index looked up from the gather_indices tensor into
   // the starting index in the input space.
   repeated int64 gather_dims_to_operand_dims = 3;
+
+  // The dimension in the gather_indices input that contains the starting
+  // indices.
+  int64 index_vector_dim = 4;
 }
 
 // Operation requests that are all collected as a tagged union with a oneof
diff --git a/tensorflow/contrib/BUILD b/tensorflow/contrib/BUILD
index bab37e8906e5c648acdc1556da7e5f4601776ff5..0a955db7daa7cd632c2b72ba9ffbc897366567f5 100644
--- a/tensorflow/contrib/BUILD
+++ b/tensorflow/contrib/BUILD
@@ -51,7 +51,6 @@ py_library(
         "//tensorflow/contrib/image:single_image_random_dot_stereograms_py",
         "//tensorflow/contrib/input_pipeline:input_pipeline_py",
         "//tensorflow/contrib/integrate:integrate_py",
-        "//tensorflow/contrib/kafka",
         "//tensorflow/contrib/keras",
         "//tensorflow/contrib/kernel_methods",
         "//tensorflow/contrib/kfac",
@@ -110,7 +109,13 @@ py_library(
         "//tensorflow/python:util",
     ] + if_mpi(["//tensorflow/contrib/mpi_collectives:mpi_collectives_py"]) + if_tensorrt([
         "//tensorflow/contrib/tensorrt:init_py",
-    ]),
+    ]) + select({
+        "//tensorflow:with_kafka_support_windows_override": [],
+        "//tensorflow:with_kafka_support": [
+            "//tensorflow/contrib/kafka",
+        ],
+        "//conditions:default": [],
+    }),
 )
 
 cc_library(
@@ -119,7 +124,6 @@ cc_library(
     deps = [
         "//tensorflow/contrib/boosted_trees:boosted_trees_kernels",
         "//tensorflow/contrib/coder:all_kernels",
-        "//tensorflow/contrib/cudnn_rnn:cudnn_rnn_kernels",
         "//tensorflow/contrib/data/kernels:dataset_kernels",
         "//tensorflow/contrib/factorization/kernels:all_kernels",
         "//tensorflow/contrib/input_pipeline:input_pipeline_ops_kernels",
@@ -133,7 +137,13 @@ cc_library(
         "//tensorflow/contrib/text:all_kernels",
     ] + if_mpi(["//tensorflow/contrib/mpi_collectives:mpi_collectives_py"]) + if_cuda([
         "//tensorflow/contrib/nccl:nccl_kernels",
-    ]),
+    ]) + select({
+        "//tensorflow:with_kafka_support_windows_override": [],
+        "//tensorflow:with_kafka_support": [
+            "//tensorflow/contrib/kafka:dataset_kernels",
+        ],
+        "//conditions:default": [],
+    }),
 )
 
 cc_library(
@@ -142,12 +152,10 @@ cc_library(
     deps = [
         "//tensorflow/contrib/boosted_trees:boosted_trees_ops_op_lib",
         "//tensorflow/contrib/coder:all_ops",
-        "//tensorflow/contrib/cudnn_rnn:cudnn_rnn_ops_op_lib",
         "//tensorflow/contrib/data:dataset_ops_op_lib",
         "//tensorflow/contrib/factorization:all_ops",
         "//tensorflow/contrib/framework:all_ops",
         "//tensorflow/contrib/input_pipeline:input_pipeline_ops_op_lib",
-        "//tensorflow/contrib/kafka:kafka_ops_op_lib",
         "//tensorflow/contrib/layers:sparse_feature_cross_op_op_lib",
         "//tensorflow/contrib/nccl:nccl_ops_op_lib",
         "//tensorflow/contrib/nearest_neighbor:nearest_neighbor_ops_op_lib",
@@ -158,7 +166,13 @@ cc_library(
         "//tensorflow/contrib/tensor_forest:tensor_forest_ops_op_lib",
         "//tensorflow/contrib/text:all_ops",
         "//tensorflow/contrib/tpu:all_ops",
-    ],
+    ] + select({
+        "//tensorflow:with_kafka_support_windows_override": [],
+        "//tensorflow:with_kafka_support": [
+            "//tensorflow/contrib/kafka:dataset_ops_op_lib",
+        ],
+        "//conditions:default": [],
+    }),
 )
 
 filegroup(
diff --git a/tensorflow/contrib/android/cmake/CMakeLists.txt b/tensorflow/contrib/android/cmake/CMakeLists.txt
index a115d1610e2334a6626f29674f3dd195e3a3c648..ecf1a103d2981f409a4598d762fb26100217f779 100644
--- a/tensorflow/contrib/android/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/android/cmake/CMakeLists.txt
@@ -75,7 +75,6 @@ target_link_libraries(tensorflow_inference
 include_directories(
     ${PREBUILT_DIR}/proto
     ${PREBUILT_DIR}/protobuf/include
-    ${PREBUILT_DIR}/nsync/public
     ${TENSORFLOW_ROOT_DIR}/tensorflow/contrib/makefile/downloads/eigen
     ${TENSORFLOW_ROOT_DIR}
     ${CMAKE_CURRENT_SOURCE_DIR}/..)
diff --git a/tensorflow/contrib/bayesflow/BUILD b/tensorflow/contrib/bayesflow/BUILD
index 08b29fb6bcbb615d8875283f812f219ae591c6b4..c6feec68e0104ff33451bbb6fa7de51d13e0a43c 100644
--- a/tensorflow/contrib/bayesflow/BUILD
+++ b/tensorflow/contrib/bayesflow/BUILD
@@ -56,117 +56,6 @@ cuda_py_test(
     ],
 )
 
-cuda_py_test(
-    name = "csiszar_divergence_test",
-    size = "medium",
-    srcs = ["python/kernel_tests/csiszar_divergence_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:linalg_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:nn_ops",
-    ],
-    tags = [
-        "manual",  # b/64490288
-        "notap",
-    ],
-)
-
-cuda_py_test(
-    name = "custom_grad_test",
-    size = "small",
-    srcs = ["python/kernel_tests/custom_grad_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/layers:layers_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:init_ops",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:variable_scope",
-        "//tensorflow/python:variables",
-    ],
-)
-
-cuda_py_test(
-    name = "docstring_util_test",
-    size = "small",
-    srcs = ["python/kernel_tests/docstring_util_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//tensorflow/python:client_testlib",
-    ],
-)
-
-cuda_py_test(
-    name = "layers_conv_variational_test",
-    size = "small",
-    srcs = ["python/kernel_tests/layers_conv_variational_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:linalg_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:nn_ops",
-    ],
-)
-
-cuda_py_test(
-    name = "layers_dense_variational_test",
-    size = "small",
-    srcs = ["python/kernel_tests/layers_dense_variational_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:linalg_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:nn_ops",
-    ],
-)
-
-cuda_py_test(
-    name = "mcmc_diagnostics_test",
-    size = "small",
-    srcs = ["python/kernel_tests/mcmc_diagnostics_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/python:spectral_ops_test_util",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:random_seed",
-    ],
-)
-
 cuda_py_test(
     name = "monte_carlo_test",
     size = "small",
@@ -188,29 +77,9 @@ cuda_py_test(
     ],
 )
 
-cuda_py_test(
-    name = "halton_sequence_test",
-    size = "small",
-    srcs = ["python/kernel_tests/halton_sequence_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:random_ops",
-        "//tensorflow/python:variable_scope",
-        "//tensorflow/python:variables",
-    ],
-    tags = ["no_mac"],  # b/73192243
-)
-
 cuda_py_test(
     name = "hmc_test",
-    size = "medium",
+    size = "large",
     srcs = ["python/kernel_tests/hmc_test.py"],
     additional_deps = [
         ":bayesflow_py",
@@ -227,67 +96,7 @@ cuda_py_test(
         "//tensorflow/python:platform_test",
         "//tensorflow/python:random_seed",
     ],
-)
-
-cuda_py_test(
-    name = "sgld_optimizer_test",
-    size = "small",
-    srcs = ["python/kernel_tests/sgld_optimizer_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/contrib/layers:layers_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:random_seed",
-    ],
-    tags = ["notsan"],
-)
-
-cuda_py_test(
-    name = "variable_utils_test",
-    size = "small",
-    srcs = ["python/kernel_tests/variable_utils_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform_test",
-    ],
-)
-
-cuda_py_test(
-    name = "variational_sgd_optimizer_test",
-    size = "small",
-    srcs = ["python/kernel_tests/variational_sgd_optimizer_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/contrib/distributions:distributions_py",
-        "//tensorflow/contrib/layers:layers_py",
-        "//tensorflow/python/ops/distributions",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:random_seed",
-    ],
-    tags = ["notsan"],
+    tags = ["nomsan"],
 )
 
 filegroup(
diff --git a/tensorflow/contrib/bayesflow/__init__.py b/tensorflow/contrib/bayesflow/__init__.py
index 528c4fbacd06c7b0defa0e32bd24a98b2bc07b64..f86820382682f79e85e6a92c7f63fa15bb8be1a3 100644
--- a/tensorflow/contrib/bayesflow/__init__.py
+++ b/tensorflow/contrib/bayesflow/__init__.py
@@ -21,35 +21,21 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long
-from tensorflow.contrib.bayesflow.python.ops import csiszar_divergence
-from tensorflow.contrib.bayesflow.python.ops import custom_grad
-from tensorflow.contrib.bayesflow.python.ops import halton_sequence
 from tensorflow.contrib.bayesflow.python.ops import hmc
-from tensorflow.contrib.bayesflow.python.ops import layers
-from tensorflow.contrib.bayesflow.python.ops import mcmc_diagnostics
 from tensorflow.contrib.bayesflow.python.ops import metropolis_hastings
 from tensorflow.contrib.bayesflow.python.ops import monte_carlo
-from tensorflow.contrib.bayesflow.python.ops import optimizers
-from tensorflow.contrib.bayesflow.python.ops import variable_utils
 # pylint: enable=unused-import,line-too-long
 
 from tensorflow.python.util.all_util import remove_undocumented
 
 
 _allowed_symbols = [
-    'csiszar_divergence',
-    'custom_grad',
     'entropy',
-    'halton_sequence',
     'hmc',
-    'layers',
     'metropolis_hastings',
-    'mcmc_diagnostics',
     'monte_carlo',
-    'optimizers',
     'special_math',
     'stochastic_variables',
-    'variable_utils',
     'variational_inference',
 ]
 
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py
deleted file mode 100644
index 2e94b7206de4f7c40c89f083f3bfa2a22bb7b917..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py
+++ /dev/null
@@ -1,1004 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for Csiszar Divergence Ops."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import csiszar_divergence_impl
-from tensorflow.contrib.distributions.python.ops import mvn_diag as mvn_diag_lib
-from tensorflow.contrib.distributions.python.ops import mvn_full_covariance as mvn_full_lib
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gradients_impl
-from tensorflow.python.ops import linalg_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.ops.distributions import kullback_leibler
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.platform import test
-
-
-cd = csiszar_divergence_impl
-
-
-def tridiag(d, diag_value, offdiag_value):
-  """d x d matrix with given value on diag, and one super/sub diag."""
-  diag_mat = linalg_ops.eye(d) * (diag_value - offdiag_value)
-  three_bands = array_ops.matrix_band_part(
-      array_ops.fill([d, d], offdiag_value), 1, 1)
-  return diag_mat + three_bands
-
-
-class AmariAlphaTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    for alpha in [-1., 0., 1., 2.]:
-      for normalized in [True, False]:
-        with self.test_session(graph=ops.Graph()):
-          self.assertAllClose(
-              cd.amari_alpha(0., alpha=alpha,
-                             self_normalized=normalized).eval(),
-              0.)
-
-  def test_correct_when_alpha0(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.amari_alpha(self._logu, alpha=0.).eval(),
-          -self._logu)
-
-      self.assertAllClose(
-          cd.amari_alpha(self._logu, alpha=0., self_normalized=True).eval(),
-          -self._logu + (self._u - 1.))
-
-  def test_correct_when_alpha1(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.amari_alpha(self._logu, alpha=1.).eval(),
-          self._u * self._logu)
-
-      self.assertAllClose(
-          cd.amari_alpha(self._logu, alpha=1., self_normalized=True).eval(),
-          self._u * self._logu - (self._u - 1.))
-
-  def test_correct_when_alpha_not_01(self):
-    for alpha in [-2, -1., -0.5, 0.5, 2.]:
-      with self.test_session(graph=ops.Graph()):
-        self.assertAllClose(
-            cd.amari_alpha(self._logu,
-                           alpha=alpha,
-                           self_normalized=False).eval(),
-            ((self._u**alpha - 1)) / (alpha * (alpha - 1.)))
-
-        self.assertAllClose(
-            cd.amari_alpha(self._logu,
-                           alpha=alpha,
-                           self_normalized=True).eval(),
-            ((self._u**alpha - 1.)
-             - alpha * (self._u - 1)) / (alpha * (alpha - 1.)))
-
-
-class KLReverseTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    for normalized in [True, False]:
-      with self.test_session(graph=ops.Graph()):
-        self.assertAllClose(
-            cd.kl_reverse(0., self_normalized=normalized).eval(),
-            0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.kl_reverse(self._logu).eval(),
-          -self._logu)
-
-      self.assertAllClose(
-          cd.kl_reverse(self._logu, self_normalized=True).eval(),
-          -self._logu + (self._u - 1.))
-
-
-class KLForwardTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    for normalized in [True, False]:
-      with self.test_session(graph=ops.Graph()):
-        self.assertAllClose(
-            cd.kl_forward(0., self_normalized=normalized).eval(),
-            0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.kl_forward(self._logu).eval(),
-          self._u * self._logu)
-
-      self.assertAllClose(
-          cd.kl_forward(self._logu, self_normalized=True).eval(),
-          self._u * self._logu - (self._u - 1.))
-
-
-class JensenShannonTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.jensen_shannon(0.).eval(), np.log(0.25))
-
-  def test_symmetric(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.jensen_shannon(self._logu).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu, cd.jensen_shannon).eval())
-
-      self.assertAllClose(
-          cd.jensen_shannon(self._logu, self_normalized=True).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu,
-              lambda x: cd.jensen_shannon(x, self_normalized=True)).eval())
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.jensen_shannon(self._logu).eval(),
-          (self._u * self._logu
-           - (1 + self._u) * np.log1p(self._u)))
-
-      self.assertAllClose(
-          cd.jensen_shannon(self._logu, self_normalized=True).eval(),
-          (self._u * self._logu
-           - (1 + self._u) * np.log((1 + self._u) / 2)))
-
-
-class ArithmeticGeometricMeanTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.arithmetic_geometric(0.).eval(), np.log(4))
-      self.assertAllClose(
-          cd.arithmetic_geometric(0., self_normalized=True).eval(), 0.)
-
-  def test_symmetric(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.arithmetic_geometric(self._logu).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu, cd.arithmetic_geometric).eval())
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.arithmetic_geometric(self._logu).eval(),
-          (1. + self._u) * np.log((1. + self._u) / np.sqrt(self._u)))
-
-      self.assertAllClose(
-          cd.arithmetic_geometric(self._logu, self_normalized=True).eval(),
-          (1. + self._u) * np.log(0.5 * (1. + self._u) / np.sqrt(self._u)))
-
-
-class TotalVariationTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.total_variation(0.).eval(), 0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.total_variation(self._logu).eval(),
-          0.5 * np.abs(self._u - 1))
-
-
-class PearsonTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.pearson(0.).eval(), 0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.pearson(self._logu).eval(),
-          np.square(self._u - 1))
-
-
-class SquaredHellingerTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.squared_hellinger(0.).eval(), 0.)
-
-  def test_symmetric(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.squared_hellinger(self._logu).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu, cd.squared_hellinger).eval())
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.squared_hellinger(self._logu).eval(),
-          np.square(np.sqrt(self._u) - 1))
-
-
-class TriangularTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.triangular(0.).eval(), 0.)
-
-  def test_symmetric(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.triangular(self._logu).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu, cd.triangular).eval())
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.triangular(self._logu).eval(),
-          np.square(self._u - 1) / (1 + self._u))
-
-
-class TPowerTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.t_power(0., t=-0.1).eval(), 0.)
-      self.assertAllClose(cd.t_power(0., t=0.5).eval(), 0.)
-      self.assertAllClose(cd.t_power(0., t=1.1).eval(), 0.)
-      self.assertAllClose(
-          cd.t_power(0., t=-0.1, self_normalized=True).eval(), 0.)
-      self.assertAllClose(
-          cd.t_power(0., t=0.5, self_normalized=True).eval(), 0.)
-      self.assertAllClose(
-          cd.t_power(0., t=1.1, self_normalized=True).eval(), 0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(-0.1)).eval(),
-          self._u ** -0.1 - 1.)
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(0.5)).eval(),
-          -self._u ** 0.5 + 1.)
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(1.1)).eval(),
-          self._u ** 1.1 - 1.)
-
-  def test_correct_self_normalized(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(-0.1),
-                     self_normalized=True).eval(),
-          self._u ** -0.1 - 1. + 0.1 * (self._u - 1.))
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(0.5),
-                     self_normalized=True).eval(),
-          -self._u ** 0.5 + 1. + 0.5 * (self._u - 1.))
-      self.assertAllClose(
-          cd.t_power(self._logu, t=np.float64(1.1),
-                     self_normalized=True).eval(),
-          self._u ** 1.1 - 1. - 1.1 * (self._u - 1.))
-
-
-class Log1pAbsTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.log1p_abs(0.).eval(), 0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.log1p_abs(self._logu).eval(),
-          self._u**(np.sign(self._u - 1)) - 1)
-
-
-class JeffreysTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.jeffreys(0.).eval(), 0.)
-
-  def test_symmetric(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.jeffreys(self._logu).eval(),
-          cd.symmetrized_csiszar_function(
-              self._logu, cd.jeffreys).eval())
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.jeffreys(self._logu).eval(),
-          0.5 * (self._u * self._logu - self._logu))
-
-
-class ChiSquareTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(cd.chi_square(0.).eval(), 0.)
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.chi_square(self._logu).eval(),
-          self._u**2 - 1)
-
-
-class ModifiedGanTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10, 100)
-    self._u = np.exp(self._logu)
-
-  def test_at_zero(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.modified_gan(0.).eval(), np.log(2))
-      self.assertAllClose(
-          cd.modified_gan(0., self_normalized=True).eval(), np.log(2))
-
-  def test_correct(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.modified_gan(self._logu).eval(),
-          np.log1p(self._u) - self._logu)
-
-      self.assertAllClose(
-          cd.modified_gan(self._logu, self_normalized=True).eval(),
-          np.log1p(self._u) - self._logu + 0.5 * (self._u - 1))
-
-
-class SymmetrizedCsiszarFunctionTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10., 100)
-    self._u = np.exp(self._logu)
-
-  def test_jensen_shannon(self):
-    with self.test_session():
-
-      # The following functions come from the claim made in the
-      # symmetrized_csiszar_function docstring.
-      def js1(logu):
-        return (-logu
-                - (1. + math_ops.exp(logu)) * (
-                    nn_ops.softplus(logu)))
-
-      def js2(logu):
-        return 2. * (math_ops.exp(logu) * (
-            logu - nn_ops.softplus(logu)))
-
-      self.assertAllClose(
-          cd.symmetrized_csiszar_function(self._logu, js1).eval(),
-          cd.jensen_shannon(self._logu).eval())
-
-      self.assertAllClose(
-          cd.symmetrized_csiszar_function(self._logu, js2).eval(),
-          cd.jensen_shannon(self._logu).eval())
-
-  def test_jeffreys(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.symmetrized_csiszar_function(self._logu, cd.kl_reverse).eval(),
-          cd.jeffreys(self._logu).eval())
-
-      self.assertAllClose(
-          cd.symmetrized_csiszar_function(self._logu, cd.kl_forward).eval(),
-          cd.jeffreys(self._logu).eval())
-
-
-class DualCsiszarFunctionTest(test.TestCase):
-
-  def setUp(self):
-    self._logu = np.linspace(-10., 10., 100)
-    self._u = np.exp(self._logu)
-
-  def test_kl_forward(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.dual_csiszar_function(self._logu, cd.kl_forward).eval(),
-          cd.kl_reverse(self._logu).eval())
-
-  def test_kl_reverse(self):
-    with self.test_session():
-      self.assertAllClose(
-          cd.dual_csiszar_function(self._logu, cd.kl_reverse).eval(),
-          cd.kl_forward(self._logu).eval())
-
-
-class MonteCarloCsiszarFDivergenceTest(test.TestCase):
-
-  def test_kl_forward(self):
-    with self.test_session() as sess:
-      q = normal_lib.Normal(
-          loc=np.ones(6),
-          scale=np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0]))
-
-      p = normal_lib.Normal(loc=q.loc + 0.1, scale=q.scale - 0.2)
-
-      approx_kl = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_forward,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      approx_kl_self_normalized = cd.monte_carlo_csiszar_f_divergence(
-          f=lambda logu: cd.kl_forward(logu, self_normalized=True),
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      exact_kl = kullback_leibler.kl_divergence(p, q)
-
-      [approx_kl_, approx_kl_self_normalized_, exact_kl_] = sess.run([
-          approx_kl, approx_kl_self_normalized, exact_kl])
-
-      self.assertAllClose(approx_kl_, exact_kl_,
-                          rtol=0.08, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_, exact_kl_,
-                          rtol=0.02, atol=0.)
-
-  def test_kl_reverse(self):
-    with self.test_session() as sess:
-
-      q = normal_lib.Normal(
-          loc=np.ones(6),
-          scale=np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0]))
-
-      p = normal_lib.Normal(loc=q.loc + 0.1, scale=q.scale - 0.2)
-
-      approx_kl = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_reverse,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      approx_kl_self_normalized = cd.monte_carlo_csiszar_f_divergence(
-          f=lambda logu: cd.kl_reverse(logu, self_normalized=True),
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      exact_kl = kullback_leibler.kl_divergence(q, p)
-
-      [approx_kl_, approx_kl_self_normalized_, exact_kl_] = sess.run([
-          approx_kl, approx_kl_self_normalized, exact_kl])
-
-      self.assertAllClose(approx_kl_, exact_kl_,
-                          rtol=0.07, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_, exact_kl_,
-                          rtol=0.02, atol=0.)
-
-  def test_kl_reverse_multidim(self):
-
-    with self.test_session() as sess:
-      d = 5  # Dimension
-
-      p = mvn_full_lib.MultivariateNormalFullCovariance(
-          covariance_matrix=tridiag(d, diag_value=1, offdiag_value=0.5))
-
-      q = mvn_diag_lib.MultivariateNormalDiag(scale_diag=[0.5]*d)
-
-      approx_kl = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_reverse,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      approx_kl_self_normalized = cd.monte_carlo_csiszar_f_divergence(
-          f=lambda logu: cd.kl_reverse(logu, self_normalized=True),
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      exact_kl = kullback_leibler.kl_divergence(q, p)
-
-      [approx_kl_, approx_kl_self_normalized_, exact_kl_] = sess.run([
-          approx_kl, approx_kl_self_normalized, exact_kl])
-
-      self.assertAllClose(approx_kl_, exact_kl_,
-                          rtol=0.02, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_, exact_kl_,
-                          rtol=0.08, atol=0.)
-
-  def test_kl_forward_multidim(self):
-
-    with self.test_session() as sess:
-      d = 5  # Dimension
-
-      p = mvn_full_lib.MultivariateNormalFullCovariance(
-          covariance_matrix=tridiag(d, diag_value=1, offdiag_value=0.5))
-
-      # Variance is very high when approximating Forward KL, so we make
-      # scale_diag larger than in test_kl_reverse_multidim. This ensures q
-      # "covers" p and thus Var_q[p/q] is smaller.
-      q = mvn_diag_lib.MultivariateNormalDiag(scale_diag=[1.]*d)
-
-      approx_kl = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_forward,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      approx_kl_self_normalized = cd.monte_carlo_csiszar_f_divergence(
-          f=lambda logu: cd.kl_forward(logu, self_normalized=True),
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=int(1e5),
-          seed=1)
-
-      exact_kl = kullback_leibler.kl_divergence(p, q)
-
-      [approx_kl_, approx_kl_self_normalized_, exact_kl_] = sess.run([
-          approx_kl, approx_kl_self_normalized, exact_kl])
-
-      self.assertAllClose(approx_kl_, exact_kl_,
-                          rtol=0.06, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_, exact_kl_,
-                          rtol=0.05, atol=0.)
-
-  def test_score_trick(self):
-
-    with self.test_session() as sess:
-      d = 5  # Dimension
-      num_draws = int(1e5)
-      seed = 1
-
-      p = mvn_full_lib.MultivariateNormalFullCovariance(
-          covariance_matrix=tridiag(d, diag_value=1, offdiag_value=0.5))
-
-      # Variance is very high when approximating Forward KL, so we make
-      # scale_diag larger than in test_kl_reverse_multidim. This ensures q
-      # "covers" p and thus Var_q[p/q] is smaller.
-      s = array_ops.constant(1.)
-      q = mvn_diag_lib.MultivariateNormalDiag(
-          scale_diag=array_ops.tile([s], [d]))
-
-      approx_kl = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_reverse,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=num_draws,
-          seed=seed)
-
-      approx_kl_self_normalized = cd.monte_carlo_csiszar_f_divergence(
-          f=lambda logu: cd.kl_reverse(logu, self_normalized=True),
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=num_draws,
-          seed=seed)
-
-      approx_kl_score_trick = cd.monte_carlo_csiszar_f_divergence(
-          f=cd.kl_reverse,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=num_draws,
-          use_reparametrization=False,
-          seed=seed)
-
-      approx_kl_self_normalized_score_trick = (
-          cd.monte_carlo_csiszar_f_divergence(
-              f=lambda logu: cd.kl_reverse(logu, self_normalized=True),
-              p_log_prob=p.log_prob,
-              q=q,
-              num_draws=num_draws,
-              use_reparametrization=False,
-              seed=seed))
-
-      exact_kl = kullback_leibler.kl_divergence(q, p)
-
-      grad_sum = lambda fs: gradients_impl.gradients(fs, s)[0]
-
-      [
-          approx_kl_grad_,
-          approx_kl_self_normalized_grad_,
-          approx_kl_score_trick_grad_,
-          approx_kl_self_normalized_score_trick_grad_,
-          exact_kl_grad_,
-          approx_kl_,
-          approx_kl_self_normalized_,
-          approx_kl_score_trick_,
-          approx_kl_self_normalized_score_trick_,
-          exact_kl_,
-      ] = sess.run([
-          grad_sum(approx_kl),
-          grad_sum(approx_kl_self_normalized),
-          grad_sum(approx_kl_score_trick),
-          grad_sum(approx_kl_self_normalized_score_trick),
-          grad_sum(exact_kl),
-          approx_kl,
-          approx_kl_self_normalized,
-          approx_kl_score_trick,
-          approx_kl_self_normalized_score_trick,
-          exact_kl,
-      ])
-
-      # Test average divergence.
-      self.assertAllClose(approx_kl_, exact_kl_,
-                          rtol=0.02, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_, exact_kl_,
-                          rtol=0.08, atol=0.)
-
-      self.assertAllClose(approx_kl_score_trick_, exact_kl_,
-                          rtol=0.02, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_score_trick_, exact_kl_,
-                          rtol=0.08, atol=0.)
-
-      # Test average gradient-divergence.
-      self.assertAllClose(approx_kl_grad_, exact_kl_grad_,
-                          rtol=0.007, atol=0.)
-
-      self.assertAllClose(approx_kl_self_normalized_grad_, exact_kl_grad_,
-                          rtol=0.011, atol=0.)
-
-      self.assertAllClose(approx_kl_score_trick_grad_, exact_kl_grad_,
-                          rtol=0.018, atol=0.)
-
-      self.assertAllClose(
-          approx_kl_self_normalized_score_trick_grad_, exact_kl_grad_,
-          rtol=0.017, atol=0.)
-
-
-class CsiszarVIMCOTest(test.TestCase):
-
-  def _csiszar_vimco_helper(self, logu):
-    """Numpy implementation of `csiszar_vimco_helper`."""
-
-    # Since this is a naive/intuitive implementation, we compensate by using the
-    # highest precision we can.
-    logu = np.float128(logu)
-    n = logu.shape[0]
-    u = np.exp(logu)
-    loogeoavg_u = []  # Leave-one-out geometric-average of exp(logu).
-    for j in range(n):
-      loogeoavg_u.append(np.exp(np.mean(
-          [logu[i, ...] for i in range(n) if i != j],
-          axis=0)))
-    loogeoavg_u = np.array(loogeoavg_u)
-
-    loosum_u = []  # Leave-one-out sum of exp(logu).
-    for j in range(n):
-      loosum_u.append(np.sum(
-          [u[i, ...] for i in range(n) if i != j],
-          axis=0))
-    loosum_u = np.array(loosum_u)
-
-    # Natural log of the average u except each is swapped-out for its
-    # leave-`i`-th-out Geometric average.
-    log_sooavg_u = np.log(loosum_u + loogeoavg_u) - np.log(n)
-
-    log_avg_u = np.log(np.mean(u, axis=0))
-    return log_avg_u, log_sooavg_u
-
-  def _csiszar_vimco_helper_grad(self, logu, delta):
-    """Finite difference approximation of `grad(csiszar_vimco_helper, logu)`."""
-
-    # This code actually estimates the sum of the Jacobiab because that's what
-    # TF's `gradients` does.
-    np_log_avg_u1, np_log_sooavg_u1 = self._csiszar_vimco_helper(
-        logu[..., None] + np.diag([delta]*len(logu)))
-    np_log_avg_u, np_log_sooavg_u = self._csiszar_vimco_helper(
-        logu[..., None])
-    return [
-        (np_log_avg_u1 - np_log_avg_u) / delta,
-        np.sum(np_log_sooavg_u1 - np_log_sooavg_u, axis=0) / delta,
-    ]
-
-  def test_vimco_helper_1(self):
-    """Tests that function calculation correctly handles batches."""
-
-    logu = np.linspace(-100., 100., 100).reshape([10, 2, 5])
-    with self.test_session() as sess:
-      np_log_avg_u, np_log_sooavg_u = self._csiszar_vimco_helper(logu)
-      [log_avg_u, log_sooavg_u] = sess.run(cd.csiszar_vimco_helper(logu))
-      self.assertAllClose(np_log_avg_u, log_avg_u,
-                          rtol=1e-8, atol=0.)
-      self.assertAllClose(np_log_sooavg_u, log_sooavg_u,
-                          rtol=1e-8, atol=0.)
-
-  def test_vimco_helper_2(self):
-    """Tests that function calculation correctly handles overflow."""
-
-    # Using 700 (rather than 1e3) since naive numpy version can't handle higher.
-    logu = np.float32([0., 700, -1, 1])
-    with self.test_session() as sess:
-      np_log_avg_u, np_log_sooavg_u = self._csiszar_vimco_helper(logu)
-      [log_avg_u, log_sooavg_u] = sess.run(cd.csiszar_vimco_helper(logu))
-      self.assertAllClose(np_log_avg_u, log_avg_u,
-                          rtol=1e-6, atol=0.)
-      self.assertAllClose(np_log_sooavg_u, log_sooavg_u,
-                          rtol=1e-5, atol=0.)
-
-  def test_vimco_helper_3(self):
-    """Tests that function calculation correctly handles underlow."""
-
-    logu = np.float32([0., -1000, -1, 1])
-    with self.test_session() as sess:
-      np_log_avg_u, np_log_sooavg_u = self._csiszar_vimco_helper(logu)
-      [log_avg_u, log_sooavg_u] = sess.run(cd.csiszar_vimco_helper(logu))
-      self.assertAllClose(np_log_avg_u, log_avg_u,
-                          rtol=1e-5, atol=0.)
-      self.assertAllClose(np_log_sooavg_u, log_sooavg_u,
-                          rtol=1e-4, atol=1e-15)
-
-  def test_vimco_helper_gradient_using_finite_difference_1(self):
-    """Tests that gradient calculation correctly handles batches."""
-
-    logu_ = np.linspace(-100., 100., 100).reshape([10, 2, 5])
-    with self.test_session() as sess:
-      logu = array_ops.constant(logu_)
-
-      grad = lambda flogu: gradients_impl.gradients(flogu, logu)[0]
-      log_avg_u, log_sooavg_u = cd.csiszar_vimco_helper(logu)
-
-      [
-          grad_log_avg_u,
-          grad_log_sooavg_u,
-      ] = sess.run([grad(log_avg_u), grad(log_sooavg_u)])
-
-      # We skip checking against finite-difference approximation since it
-      # doesn't support batches.
-
-      # Verify claim in docstring.
-      self.assertAllClose(
-          np.ones_like(grad_log_avg_u.sum(axis=0)),
-          grad_log_avg_u.sum(axis=0))
-      self.assertAllClose(
-          np.ones_like(grad_log_sooavg_u.mean(axis=0)),
-          grad_log_sooavg_u.mean(axis=0))
-
-  def test_vimco_helper_gradient_using_finite_difference_2(self):
-    """Tests that gradient calculation correctly handles overflow."""
-
-    delta = 1e-3
-    logu_ = np.float32([0., 1000, -1, 1])
-    with self.test_session() as sess:
-      logu = array_ops.constant(logu_)
-
-      [
-          np_grad_log_avg_u,
-          np_grad_log_sooavg_u,
-      ] = self._csiszar_vimco_helper_grad(logu_, delta)
-
-      grad = lambda flogu: gradients_impl.gradients(flogu, logu)[0]
-      log_avg_u, log_sooavg_u = cd.csiszar_vimco_helper(logu)
-
-      [
-          grad_log_avg_u,
-          grad_log_sooavg_u,
-      ] = sess.run([grad(log_avg_u), grad(log_sooavg_u)])
-
-      self.assertAllClose(np_grad_log_avg_u, grad_log_avg_u,
-                          rtol=delta, atol=0.)
-      self.assertAllClose(np_grad_log_sooavg_u, grad_log_sooavg_u,
-                          rtol=delta, atol=0.)
-      # Verify claim in docstring.
-      self.assertAllClose(
-          np.ones_like(grad_log_avg_u.sum(axis=0)),
-          grad_log_avg_u.sum(axis=0))
-      self.assertAllClose(
-          np.ones_like(grad_log_sooavg_u.mean(axis=0)),
-          grad_log_sooavg_u.mean(axis=0))
-
-  def test_vimco_helper_gradient_using_finite_difference_3(self):
-    """Tests that gradient calculation correctly handles underlow."""
-
-    delta = 1e-3
-    logu_ = np.float32([0., -1000, -1, 1])
-    with self.test_session() as sess:
-      logu = array_ops.constant(logu_)
-
-      [
-          np_grad_log_avg_u,
-          np_grad_log_sooavg_u,
-      ] = self._csiszar_vimco_helper_grad(logu_, delta)
-
-      grad = lambda flogu: gradients_impl.gradients(flogu, logu)[0]
-      log_avg_u, log_sooavg_u = cd.csiszar_vimco_helper(logu)
-
-      [
-          grad_log_avg_u,
-          grad_log_sooavg_u,
-      ] = sess.run([grad(log_avg_u), grad(log_sooavg_u)])
-
-      self.assertAllClose(np_grad_log_avg_u, grad_log_avg_u,
-                          rtol=delta, atol=0.)
-      self.assertAllClose(np_grad_log_sooavg_u, grad_log_sooavg_u,
-                          rtol=delta, atol=0.)
-      # Verify claim in docstring.
-      self.assertAllClose(
-          np.ones_like(grad_log_avg_u.sum(axis=0)),
-          grad_log_avg_u.sum(axis=0))
-      self.assertAllClose(
-          np.ones_like(grad_log_sooavg_u.mean(axis=0)),
-          grad_log_sooavg_u.mean(axis=0))
-
-  def test_vimco_and_gradient(self):
-
-    with self.test_session() as sess:
-      dims = 5  # Dimension
-      num_draws = int(20)
-      num_batch_draws = int(3)
-      seed = 1
-
-      f = lambda logu: cd.kl_reverse(logu, self_normalized=False)
-      np_f = lambda logu: -logu
-
-      p = mvn_full_lib.MultivariateNormalFullCovariance(
-          covariance_matrix=tridiag(dims, diag_value=1, offdiag_value=0.5))
-
-      # Variance is very high when approximating Forward KL, so we make
-      # scale_diag larger than in test_kl_reverse_multidim. This ensures q
-      # "covers" p and thus Var_q[p/q] is smaller.
-      s = array_ops.constant(1.)
-      q = mvn_diag_lib.MultivariateNormalDiag(
-          scale_diag=array_ops.tile([s], [dims]))
-
-      vimco = cd.csiszar_vimco(
-          f=f,
-          p_log_prob=p.log_prob,
-          q=q,
-          num_draws=num_draws,
-          num_batch_draws=num_batch_draws,
-          seed=seed)
-
-      x = q.sample(sample_shape=[num_draws, num_batch_draws],
-                   seed=seed)
-      x = array_ops.stop_gradient(x)
-      logu = p.log_prob(x) - q.log_prob(x)
-      f_log_sum_u = f(cd.csiszar_vimco_helper(logu)[0])
-
-      grad_sum = lambda fs: gradients_impl.gradients(fs, s)[0]
-
-      def jacobian(x):
-        # Warning: this function is slow and may not even finish if prod(shape)
-        # is larger than, say, 100.
-        shape = x.shape.as_list()
-        assert all(s is not None for s in shape)
-        x = array_ops.reshape(x, shape=[-1])
-        r = [grad_sum(x[i]) for i in range(np.prod(shape))]
-        return array_ops.reshape(array_ops.stack(r), shape=shape)
-
-      [
-          logu_,
-          jacobian_logqx_,
-          vimco_,
-          grad_vimco_,
-          f_log_sum_u_,
-          grad_mean_f_log_sum_u_,
-      ] = sess.run([
-          logu,
-          jacobian(q.log_prob(x)),
-          vimco,
-          grad_sum(vimco),
-          f_log_sum_u,
-          grad_sum(f_log_sum_u) / num_batch_draws,
-      ])
-
-      np_log_avg_u, np_log_sooavg_u = self._csiszar_vimco_helper(logu_)
-
-      # Test VIMCO loss is correct.
-      self.assertAllClose(np_f(np_log_avg_u).mean(axis=0), vimco_,
-                          rtol=1e-5, atol=0.)
-
-      # Test gradient of VIMCO loss is correct.
-      #
-      # To make this computation we'll inject two gradients from TF:
-      # - grad[mean(f(log(sum(p(x)/q(x)))))]
-      # - jacobian[log(q(x))].
-      #
-      # We now justify why using these (and only these) TF values for
-      # ground-truth does not undermine the completeness of this test.
-      #
-      # Regarding `grad_mean_f_log_sum_u_`, note that we validate the
-      # correctness of the zero-th order derivative (for each batch member).
-      # Since `cd.csiszar_vimco_helper` itself does not manipulate any gradient
-      # information, we can safely rely on TF.
-      self.assertAllClose(np_f(np_log_avg_u), f_log_sum_u_, rtol=1e-4, atol=0.)
-      #
-      # Regarding `jacobian_logqx_`, note that testing the gradient of
-      # `q.log_prob` is outside the scope of this unit-test thus we may safely
-      # use TF to find it.
-
-      # The `mean` is across batches and the `sum` is across iid samples.
-      np_grad_vimco = (
-          grad_mean_f_log_sum_u_
-          + np.mean(
-              np.sum(
-                  jacobian_logqx_ * (np_f(np_log_avg_u)
-                                     - np_f(np_log_sooavg_u)),
-                  axis=0),
-              axis=0))
-
-      self.assertAllClose(np_grad_vimco, grad_vimco_,
-                          rtol=1e-5, atol=0.)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/custom_grad_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/custom_grad_test.py
deleted file mode 100644
index a95df31ac1fd9f5038abe779391ccba5f7fe408d..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/custom_grad_test.py
+++ /dev/null
@@ -1,157 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for Custom Gradient Ops."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import custom_grad_impl
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gradients_impl
-from tensorflow.python.ops import init_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import variable_scope
-from tensorflow.python.ops import variables
-from tensorflow.python.platform import test
-
-
-cg = custom_grad_impl
-
-
-class CustomGradientTest(test.TestCase):
-
-  def test_works_correctly(self):
-    with self.test_session() as sess:
-      f = lambda x: x**2 / 2
-      g = lambda x: (x - 1)**3 / 3
-      x_ = np.linspace(-100, 100, int(1e4)) + [0.]
-
-      x = constant_op.constant(x_)
-      fx = cg.custom_gradient(f(x), g(x), x)
-      gx = gradients_impl.gradients(fx, x)[0]
-      [fx_, gx_] = sess.run([fx, gx])
-
-      self.assertAllClose(f(x_), fx_)
-      self.assertAllClose(g(x_), gx_)
-
-  def test_works_correctly_both_f_g_zero(self):
-    with self.test_session() as sess:
-      f = lambda x: x**2 / 2
-      g = lambda x: x**3 / 3
-      x_ = np.linspace(-100, 100, int(1e4)) + [0.]
-
-      x = constant_op.constant(x_)
-      fx = cg.custom_gradient(f(x), g(x), x)
-      gx = gradients_impl.gradients(fx, x)[0]
-      [fx_, gx_] = sess.run([fx, gx])
-
-      self.assertAllClose(f(x_), fx_)
-      self.assertAllClose(g(x_), gx_)
-
-  def test_works_correctly_vector_of_vars(self):
-    with self.test_session() as sess:
-      x = variable_scope.get_variable(
-          name="x",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(2))
-      y = variable_scope.get_variable(
-          name="y",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(3))
-      sess.run([variables.global_variables_initializer()])
-
-      f = lambda z: z[0] * z[1]
-      g = lambda z: z[0]**2 * z[1]**2 / 2
-
-      z = array_ops.stack([x, y])
-      fz = cg.custom_gradient(f(z), g(z), z, axis=0)
-      gz = gradients_impl.gradients(fz, variables.trainable_variables())
-      [z_, fz_, gx_, gy_] = sess.run([z, fz, gz[0], gz[1]])
-
-      self.assertEqual(f(z_), fz_)
-      self.assertEqual(g(z_), gx_)
-      self.assertEqual(g(z_), gy_)
-
-  def test_works_correctly_side_vars(self):
-    with self.test_session() as sess:
-      x_ = np.float32(2.1)  # Adding extra tenth to force imprecision.
-      y_ = np.float32(3.1)
-      x = variable_scope.get_variable(
-          name="x",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(x_))
-      y = variable_scope.get_variable(
-          name="y",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(y_))
-      sess.run([variables.global_variables_initializer()])
-
-      f = lambda x: x * y
-      g = lambda z: math_ops.square(x) * y
-
-      fx = cg.custom_gradient(f(x), g(x), x)
-      gx = gradients_impl.gradients(fx, variables.trainable_variables())
-      [x_, fx_, gx_] = sess.run([x, fx, gx[0]])
-      gy_ = gx[1]
-
-      self.assertEqual(x_ * y_, fx_)
-      self.assertEqual(np.square(x_) * y_, gx_)
-      self.assertEqual(None, gy_)
-
-  def test_works_correctly_fx_gx_manually_stopped(self):
-    with self.test_session() as sess:
-      x_ = np.float32(2.1)  # Adding extra tenth to force imprecision.
-      y_ = np.float32(3.1)
-      x = variable_scope.get_variable(
-          name="x",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(x_))
-      y = variable_scope.get_variable(
-          name="y",
-          shape=[],
-          dtype=dtypes.float32,
-          initializer=init_ops.constant_initializer(y_))
-      sess.run([variables.global_variables_initializer()])
-
-      stop = array_ops.stop_gradient  # For readability.
-
-      # Basically we need to stop the `x` portion of `f`. And when we supply the
-      # arg to `custom_gradient` we need to stop the complement, i.e., the `y`
-      # part.
-      f = lambda x: stop(x) * y
-      g = lambda x: stop(math_ops.square(x)) * y
-      fx = cg.custom_gradient(f(x), g(x), x + stop(y),
-                              fx_gx_manually_stopped=True)
-
-      gx = gradients_impl.gradients(fx, variables.trainable_variables())
-      [x_, fx_, gx_, gy_] = sess.run([x, fx, gx[0], gx[1]])
-
-      self.assertEqual(x_ * y_, fx_)
-      self.assertEqual(np.square(x_) * y_, gx_)
-      self.assertEqual(x_, gy_)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py
deleted file mode 100644
index 8ed500b19d8dd72795758a2920119e3680576697..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py
+++ /dev/null
@@ -1,87 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for docstring utilities."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.bayesflow.python.ops import docstring_util
-from tensorflow.python.platform import test
-
-
-class DocstringUtil(test.TestCase):
-
-  def _testFunction(self):
-    doc_args = """x: Input to return as output.
-  y: Baz."""
-    @docstring_util.expand_docstring(args=doc_args)
-    def foo(x):
-      # pylint: disable=g-doc-args
-      """Hello world.
-
-      Args:
-        @{args}
-
-      Returns:
-        x.
-      """
-      # pylint: enable=g-doc-args
-      return x
-
-    true_docstring = """Hello world.
-
-    Args:
-      x: Input to return as output.
-      y: Baz.
-
-    Returns:
-      x.
-    """
-    self.assertEqual(foo.__doc__, true_docstring)
-
-  def _testClassInit(self):
-    doc_args = """x: Input to return as output.
-  y: Baz."""
-
-    class Foo(object):
-
-      @docstring_util.expand_docstring(args=doc_args)
-      def __init__(self, x, y):
-        # pylint: disable=g-doc-args
-        """Hello world.
-
-        Args:
-          @{args}
-
-        Bar.
-        """
-        # pylint: enable=g-doc-args
-        pass
-
-    true_docstring = """Hello world.
-
-    Args:
-      x: Input to return as output.
-      y: Baz.
-
-    Bar.
-    """
-    self.assertEqual(Foo.__doc__, true_docstring)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/halton_sequence_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/halton_sequence_test.py
deleted file mode 100644
index 0a85862abfd744a86b9a38e10dbb5b985d0a0e94..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/halton_sequence_test.py
+++ /dev/null
@@ -1,131 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for halton_sequence.py."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import halton_sequence as halton
-from tensorflow.contrib.bayesflow.python.ops import monte_carlo_impl as monte_carlo_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.platform import test
-
-
-mc = monte_carlo_lib
-
-
-class HaltonSequenceTest(test.TestCase):
-
-  def test_known_values_small_bases(self):
-    with self.test_session():
-      # The first five elements of the Halton sequence with base 2 and 3
-      expected = np.array(((1. / 2, 1. / 3),
-                           (1. / 4, 2. / 3),
-                           (3. / 4, 1. / 9),
-                           (1. / 8, 4. / 9),
-                           (5. / 8, 7. / 9)), dtype=np.float32)
-      sample = halton.sample(2, num_samples=5)
-      self.assertAllClose(expected, sample.eval(), rtol=1e-6)
-
-  def test_sample_indices(self):
-    with self.test_session():
-      dim = 5
-      indices = math_ops.range(10, dtype=dtypes.int32)
-      sample_direct = halton.sample(dim, num_samples=10)
-      sample_from_indices = halton.sample(dim, sample_indices=indices)
-      self.assertAllClose(sample_direct.eval(), sample_from_indices.eval(),
-                          rtol=1e-6)
-
-  def test_dtypes_works_correctly(self):
-    with self.test_session():
-      dim = 3
-      sample_float32 = halton.sample(dim, num_samples=10, dtype=dtypes.float32)
-      sample_float64 = halton.sample(dim, num_samples=10, dtype=dtypes.float64)
-      self.assertEqual(sample_float32.eval().dtype, np.float32)
-      self.assertEqual(sample_float64.eval().dtype, np.float64)
-
-  def test_normal_integral_mean_and_var_correctly_estimated(self):
-    n = int(1000)
-    # This test is almost identical to the similarly named test in
-    # monte_carlo_test.py. The only difference is that we use the Halton
-    # samples instead of the random samples to evaluate the expectations.
-    # MC with pseudo random numbers converges at the rate of 1/ Sqrt(N)
-    # (N=number of samples). For QMC in low dimensions, the expected convergence
-    # rate is ~ 1/N. Hence we should only need 1e3 samples as compared to the
-    # 1e6 samples used in the pseudo-random monte carlo.
-    with self.test_session():
-      mu_p = array_ops.constant([-1.0, 1.0], dtype=dtypes.float64)
-      mu_q = array_ops.constant([0.0, 0.0], dtype=dtypes.float64)
-      sigma_p = array_ops.constant([0.5, 0.5], dtype=dtypes.float64)
-      sigma_q = array_ops.constant([1.0, 1.0], dtype=dtypes.float64)
-      p = normal_lib.Normal(loc=mu_p, scale=sigma_p)
-      q = normal_lib.Normal(loc=mu_q, scale=sigma_q)
-
-      cdf_sample = halton.sample(2, num_samples=n, dtype=dtypes.float64)
-      q_sample = q.quantile(cdf_sample)
-
-      # Compute E_p[X].
-      e_x = mc.expectation_importance_sampler(
-          f=lambda x: x, log_p=p.log_prob, sampling_dist_q=q, z=q_sample,
-          seed=42)
-
-      # Compute E_p[X^2].
-      e_x2 = mc.expectation_importance_sampler(
-          f=math_ops.square, log_p=p.log_prob, sampling_dist_q=q, z=q_sample,
-          seed=42)
-
-      stddev = math_ops.sqrt(e_x2 - math_ops.square(e_x))
-      # Keep the tolerance levels the same as in monte_carlo_test.py.
-      self.assertEqual(p.batch_shape, e_x.get_shape())
-      self.assertAllClose(p.mean().eval(), e_x.eval(), rtol=0.01)
-      self.assertAllClose(p.stddev().eval(), stddev.eval(), rtol=0.02)
-
-  def test_docstring_example(self):
-    # Produce the first 1000 members of the Halton sequence in 3 dimensions.
-    num_samples = 1000
-    dim = 3
-    with self.test_session():
-      sample = halton.sample(dim, num_samples=num_samples)
-
-      # Evaluate the integral of x_1 * x_2^2 * x_3^3  over the three dimensional
-      # hypercube.
-      powers = math_ops.range(1.0, limit=dim + 1)
-      integral = math_ops.reduce_mean(
-          math_ops.reduce_prod(sample ** powers, axis=-1))
-      true_value = 1.0 / math_ops.reduce_prod(powers + 1.0)
-
-      # Produces a relative absolute error of 1.7%.
-      self.assertAllClose(integral.eval(), true_value.eval(), rtol=0.02)
-
-    # Now skip the first 1000 samples and recompute the integral with the next
-    # thousand samples. The sample_indices argument can be used to do this.
-
-      sample_indices = math_ops.range(start=1000, limit=1000 + num_samples,
-                                      dtype=dtypes.int32)
-      sample_leaped = halton.sample(dim, sample_indices=sample_indices)
-
-      integral_leaped = math_ops.reduce_mean(
-          math_ops.reduce_prod(sample_leaped ** powers, axis=-1))
-      self.assertAllClose(integral_leaped.eval(), true_value.eval(), rtol=0.001)
-
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
index 819095a060b5f4cf18df6e7e4e4556e50ae44dd3..dabadfc7b6a3da8786e88d559fe2d05b44599ca0 100644
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
+++ b/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
@@ -462,138 +462,6 @@ class HMCTest(test.TestCase):
   def testKernelLeavesTargetInvariant3(self):
     self._kernel_leaves_target_invariant_wrapper(3)
 
-  def _ais_gets_correct_log_normalizer(self, init, independent_chain_ndims,
-                                       sess, feed_dict=None):
-    counter = collections.Counter()
-
-    def proposal_log_prob(x):
-      counter["proposal_calls"] += 1
-      event_dims = math_ops.range(independent_chain_ndims, array_ops.rank(x))
-      return -0.5 * math_ops.reduce_sum(x**2. + np.log(2 * np.pi),
-                                        axis=event_dims)
-
-    def target_log_prob(x):
-      counter["target_calls"] += 1
-      event_dims = math_ops.range(independent_chain_ndims, array_ops.rank(x))
-      return self._log_gamma_log_prob(x, event_dims)
-
-    if feed_dict is None:
-      feed_dict = {}
-
-    num_steps = 200
-
-    _, ais_weights, _ = hmc.sample_annealed_importance_chain(
-        proposal_log_prob_fn=proposal_log_prob,
-        num_steps=num_steps,
-        target_log_prob_fn=target_log_prob,
-        step_size=0.5,
-        current_state=init,
-        num_leapfrog_steps=2,
-        seed=45)
-
-    # We have three calls because the calculation of `ais_weights` entails
-    # another call to the `convex_combined_log_prob_fn`. We could refactor
-    # things to avoid this, if needed (eg, b/72994218).
-    self.assertAllEqual(dict(target_calls=3, proposal_calls=3), counter)
-
-    event_shape = array_ops.shape(init)[independent_chain_ndims:]
-    event_size = math_ops.reduce_prod(event_shape)
-
-    log_true_normalizer = (
-        -self._shape_param * math_ops.log(self._rate_param)
-        + math_ops.lgamma(self._shape_param))
-    log_true_normalizer *= math_ops.cast(event_size, log_true_normalizer.dtype)
-
-    log_estimated_normalizer = (math_ops.reduce_logsumexp(ais_weights)
-                                - np.log(num_steps))
-
-    ratio_estimate_true = math_ops.exp(ais_weights - log_true_normalizer)
-    ais_weights_size = array_ops.size(ais_weights)
-    standard_error = math_ops.sqrt(
-        _reduce_variance(ratio_estimate_true)
-        / math_ops.cast(ais_weights_size, ratio_estimate_true.dtype))
-
-    [
-        ratio_estimate_true_,
-        log_true_normalizer_,
-        log_estimated_normalizer_,
-        standard_error_,
-        ais_weights_size_,
-        event_size_,
-    ] = sess.run([
-        ratio_estimate_true,
-        log_true_normalizer,
-        log_estimated_normalizer,
-        standard_error,
-        ais_weights_size,
-        event_size,
-    ], feed_dict)
-
-    logging_ops.vlog(1, "        log_true_normalizer: {}\n"
-                        "   log_estimated_normalizer: {}\n"
-                        "           ais_weights_size: {}\n"
-                        "                 event_size: {}\n".format(
-                            log_true_normalizer_,
-                            log_estimated_normalizer_,
-                            ais_weights_size_,
-                            event_size_))
-    self.assertNear(ratio_estimate_true_.mean(), 1., 4. * standard_error_)
-
-  def _ais_gets_correct_log_normalizer_wrapper(self, independent_chain_ndims):
-    """Tests that AIS yields reasonable estimates of normalizers."""
-    with self.test_session(graph=ops.Graph()) as sess:
-      x_ph = array_ops.placeholder(np.float32, name="x_ph")
-      initial_draws = np.random.normal(size=[30, 2, 1])
-      self._ais_gets_correct_log_normalizer(
-          x_ph,
-          independent_chain_ndims,
-          sess,
-          feed_dict={x_ph: initial_draws})
-
-  def testAIS1(self):
-    self._ais_gets_correct_log_normalizer_wrapper(1)
-
-  def testAIS2(self):
-    self._ais_gets_correct_log_normalizer_wrapper(2)
-
-  def testAIS3(self):
-    self._ais_gets_correct_log_normalizer_wrapper(3)
-
-  def testSampleAIChainSeedReproducibleWorksCorrectly(self):
-    with self.test_session(graph=ops.Graph()) as sess:
-      independent_chain_ndims = 1
-      x = np.random.rand(4, 3, 2)
-
-      def proposal_log_prob(x):
-        event_dims = math_ops.range(independent_chain_ndims, array_ops.rank(x))
-        return -0.5 * math_ops.reduce_sum(x**2. + np.log(2 * np.pi),
-                                          axis=event_dims)
-
-      def target_log_prob(x):
-        event_dims = math_ops.range(independent_chain_ndims, array_ops.rank(x))
-        return self._log_gamma_log_prob(x, event_dims)
-
-      ais_kwargs = dict(
-          proposal_log_prob_fn=proposal_log_prob,
-          num_steps=200,
-          target_log_prob_fn=target_log_prob,
-          step_size=0.5,
-          current_state=x,
-          num_leapfrog_steps=2,
-          seed=53)
-
-      _, ais_weights0, _ = hmc.sample_annealed_importance_chain(
-          **ais_kwargs)
-
-      _, ais_weights1, _ = hmc.sample_annealed_importance_chain(
-          **ais_kwargs)
-
-      [ais_weights0_, ais_weights1_] = sess.run([
-          ais_weights0, ais_weights1])
-
-      self.assertAllClose(ais_weights0_, ais_weights1_,
-                          atol=1e-5, rtol=1e-5)
-
   def testNanRejection(self):
     """Tests that an update that yields NaN potentials gets rejected.
 
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/layers_conv_variational_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/layers_conv_variational_test.py
deleted file mode 100644
index 750afb6654311fea30a1dc6b31b20aa3b4160ae2..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/layers_conv_variational_test.py
+++ /dev/null
@@ -1,521 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for convolutional Bayesian layers."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import layers_conv_variational as prob_layers_lib
-from tensorflow.contrib.bayesflow.python.ops import layers_util as prob_layers_util
-from tensorflow.contrib.distributions.python.ops import independent as independent_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import nn
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.ops import random_ops
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.ops.distributions import util as distribution_util
-from tensorflow.python.platform import test
-
-
-class Counter(object):
-  """Helper class to manage incrementing a counting `int`."""
-
-  def __init__(self):
-    self._value = -1
-
-  @property
-  def value(self):
-    return self._value
-
-  def __call__(self):
-    self._value += 1
-    return self._value
-
-
-class MockDistribution(independent_lib.Independent):
-  """Monitors layer calls to the underlying distribution."""
-
-  def __init__(self, result_sample, result_log_prob, loc=None, scale=None):
-    self.result_sample = result_sample
-    self.result_log_prob = result_log_prob
-    self.result_loc = loc
-    self.result_scale = scale
-    self.result_distribution = normal_lib.Normal(loc=0.0, scale=1.0)
-    if loc is not None and scale is not None:
-      self.result_distribution = normal_lib.Normal(loc=self.result_loc,
-                                                   scale=self.result_scale)
-    self.called_log_prob = Counter()
-    self.called_sample = Counter()
-    self.called_loc = Counter()
-    self.called_scale = Counter()
-
-  def log_prob(self, *args, **kwargs):
-    self.called_log_prob()
-    return self.result_log_prob
-
-  def sample(self, *args, **kwargs):
-    self.called_sample()
-    return self.result_sample
-
-  @property
-  def distribution(self):  # for dummy check on Independent(Normal)
-    return self.result_distribution
-
-  @property
-  def loc(self):
-    self.called_loc()
-    return self.result_loc
-
-  @property
-  def scale(self):
-    self.called_scale()
-    return self.result_scale
-
-
-class MockKLDivergence(object):
-  """Monitors layer calls to the divergence implementation."""
-
-  def __init__(self, result):
-    self.result = result
-    self.args = []
-    self.called = Counter()
-
-  def __call__(self, *args, **kwargs):
-    self.called()
-    self.args.append(args)
-    return self.result
-
-
-class ConvVariational(test.TestCase):
-
-  def _testKLPenaltyKernel(self, layer_class):
-    with self.test_session():
-      layer = layer_class(filters=2, kernel_size=3)
-      if layer_class in (prob_layers_lib.Conv1DReparameterization,
-                         prob_layers_lib.Conv1DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 1], seed=1)
-      elif layer_class in (prob_layers_lib.Conv2DReparameterization,
-                           prob_layers_lib.Conv2DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 3, 1], seed=1)
-      elif layer_class in (prob_layers_lib.Conv3DReparameterization,
-                           prob_layers_lib.Conv3DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 3, 3, 1], seed=1)
-
-      # No keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 0)
-      self.assertListEqual(layer.losses, losses)
-
-      _ = layer(inputs)
-
-      # Yes keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 1)
-      self.assertListEqual(layer.losses, losses)
-
-  def _testKLPenaltyBoth(self, layer_class):
-    def _make_normal(dtype, *args):  # pylint: disable=unused-argument
-      return normal_lib.Normal(
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.))
-    with self.test_session():
-      layer = layer_class(
-          filters=2,
-          kernel_size=3,
-          bias_posterior_fn=prob_layers_util.default_mean_field_normal_fn(),
-          bias_prior_fn=_make_normal)
-      if layer_class in (prob_layers_lib.Conv1DReparameterization,
-                         prob_layers_lib.Conv1DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 1], seed=1)
-      elif layer_class in (prob_layers_lib.Conv2DReparameterization,
-                           prob_layers_lib.Conv2DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 3, 1], seed=1)
-      elif layer_class in (prob_layers_lib.Conv3DReparameterization,
-                           prob_layers_lib.Conv3DFlipout):
-        inputs = random_ops.random_uniform([2, 3, 3, 3, 1], seed=1)
-
-      # No keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 0)
-      self.assertListEqual(layer.losses, losses)
-
-      _ = layer(inputs)
-
-      # Yes keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 2)
-      self.assertListEqual(layer.losses, losses)
-
-  def _testConvSetUp(self, layer_class, batch_size, depth=None,
-                     height=None, width=None, channels=None, filters=None,
-                     **kwargs):
-    seed = Counter()
-    if layer_class in (prob_layers_lib.Conv1DReparameterization,
-                       prob_layers_lib.Conv1DFlipout):
-      inputs = random_ops.random_uniform(
-          [batch_size, width, channels], seed=seed())
-      kernel_size = (2,)
-    elif layer_class in (prob_layers_lib.Conv2DReparameterization,
-                         prob_layers_lib.Conv2DFlipout):
-      inputs = random_ops.random_uniform(
-          [batch_size, height, width, channels], seed=seed())
-      kernel_size = (2, 2)
-    elif layer_class in (prob_layers_lib.Conv3DReparameterization,
-                         prob_layers_lib.Conv3DFlipout):
-      inputs = random_ops.random_uniform(
-          [batch_size, depth, height, width, channels], seed=seed())
-      kernel_size = (2, 2, 2)
-
-    kernel_shape = kernel_size + (channels, filters)
-    kernel_posterior = MockDistribution(
-        loc=random_ops.random_uniform(kernel_shape, seed=seed()),
-        scale=random_ops.random_uniform(kernel_shape, seed=seed()),
-        result_log_prob=random_ops.random_uniform(kernel_shape, seed=seed()),
-        result_sample=random_ops.random_uniform(kernel_shape, seed=seed()))
-    kernel_prior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(kernel_shape, seed=seed()),
-        result_sample=random_ops.random_uniform(kernel_shape, seed=seed()))
-    kernel_divergence = MockKLDivergence(
-        result=random_ops.random_uniform(kernel_shape, seed=seed()))
-
-    bias_size = (filters,)
-    bias_posterior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(bias_size, seed=seed()),
-        result_sample=random_ops.random_uniform(bias_size, seed=seed()))
-    bias_prior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(bias_size, seed=seed()),
-        result_sample=random_ops.random_uniform(bias_size, seed=seed()))
-    bias_divergence = MockKLDivergence(
-        result=random_ops.random_uniform(bias_size, seed=seed()))
-
-    layer = layer_class(
-        filters=filters,
-        kernel_size=kernel_size,
-        padding="SAME",
-        kernel_posterior_fn=lambda *args: kernel_posterior,
-        kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-        kernel_prior_fn=lambda *args: kernel_prior,
-        kernel_divergence_fn=kernel_divergence,
-        bias_posterior_fn=lambda *args: bias_posterior,
-        bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-        bias_prior_fn=lambda *args: bias_prior,
-        bias_divergence_fn=bias_divergence,
-        **kwargs)
-
-    outputs = layer(inputs)
-
-    kl_penalty = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-    return (kernel_posterior, kernel_prior, kernel_divergence,
-            bias_posterior, bias_prior, bias_divergence,
-            layer, inputs, outputs, kl_penalty, kernel_shape)
-
-  def _testConvReparameterization(self, layer_class):
-    batch_size, depth, height, width, channels, filters = 2, 4, 4, 4, 3, 5
-    with self.test_session() as sess:
-      (kernel_posterior, kernel_prior, kernel_divergence,
-       bias_posterior, bias_prior, bias_divergence, layer, inputs,
-       outputs, kl_penalty, kernel_shape) = self._testConvSetUp(
-           layer_class, batch_size,
-           depth=depth, height=height, width=width, channels=channels,
-           filters=filters)
-
-      convolution_op = nn_ops.Convolution(
-          tensor_shape.TensorShape(inputs.shape),
-          filter_shape=tensor_shape.TensorShape(kernel_shape),
-          padding="SAME")
-      expected_outputs = convolution_op(inputs, kernel_posterior.result_sample)
-      expected_outputs = nn.bias_add(expected_outputs,
-                                     bias_posterior.result_sample,
-                                     data_format="NHWC")
-
-      [
-          expected_outputs_, actual_outputs_,
-          expected_kernel_, actual_kernel_,
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          expected_bias_, actual_bias_,
-          expected_bias_divergence_, actual_bias_divergence_,
-      ] = sess.run([
-          expected_outputs, outputs,
-          kernel_posterior.result_sample, layer.kernel_posterior_tensor,
-          kernel_divergence.result, kl_penalty[0],
-          bias_posterior.result_sample, layer.bias_posterior_tensor,
-          bias_divergence.result, kl_penalty[1],
-      ])
-
-      self.assertAllClose(
-          expected_kernel_, actual_kernel_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_, actual_bias_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_outputs_, actual_outputs_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_divergence_, actual_bias_divergence_,
-          rtol=1e-6, atol=0.)
-
-      self.assertAllEqual(
-          [[kernel_posterior.distribution,
-            kernel_prior.distribution,
-            kernel_posterior.result_sample]],
-          kernel_divergence.args)
-
-      self.assertAllEqual(
-          [[bias_posterior.distribution,
-            bias_prior.distribution,
-            bias_posterior.result_sample]],
-          bias_divergence.args)
-
-  def _testConvFlipout(self, layer_class):
-    batch_size, depth, height, width, channels, filters = 2, 4, 4, 4, 3, 5
-    with self.test_session() as sess:
-      (kernel_posterior, kernel_prior, kernel_divergence,
-       bias_posterior, bias_prior, bias_divergence, layer, inputs,
-       outputs, kl_penalty, kernel_shape) = self._testConvSetUp(
-           layer_class, batch_size,
-           depth=depth, height=height, width=width, channels=channels,
-           filters=filters, seed=44)
-
-      convolution_op = nn_ops.Convolution(
-          tensor_shape.TensorShape(inputs.shape),
-          filter_shape=tensor_shape.TensorShape(kernel_shape),
-          padding="SAME")
-
-      expected_kernel_posterior_affine = normal_lib.Normal(
-          loc=array_ops.zeros_like(kernel_posterior.result_loc),
-          scale=kernel_posterior.result_scale)
-      expected_kernel_posterior_affine_tensor = (
-          expected_kernel_posterior_affine.sample(seed=42))
-
-      expected_outputs = convolution_op(
-          inputs, kernel_posterior.distribution.loc)
-
-      input_shape = array_ops.shape(inputs)
-      output_shape = array_ops.shape(expected_outputs)
-      batch_shape = array_ops.expand_dims(input_shape[0], 0)
-      channels = input_shape[-1]
-      rank = len(inputs.get_shape()) - 2
-
-      sign_input = random_ops.random_uniform(
-          array_ops.concat([batch_shape,
-                            array_ops.expand_dims(channels, 0)], 0),
-          minval=0,
-          maxval=2,
-          dtype=dtypes.int32,
-          seed=layer.seed)
-      sign_input = math_ops.cast(2 * sign_input - 1, inputs.dtype)
-      sign_output = random_ops.random_uniform(
-          array_ops.concat([batch_shape,
-                            array_ops.expand_dims(filters, 0)], 0),
-          minval=0,
-          maxval=2,
-          dtype=dtypes.int32,
-          seed=distribution_util.gen_new_seed(
-              layer.seed, salt="conv_flipout"))
-      sign_output = math_ops.cast(2 * sign_output - 1, inputs.dtype)
-      for _ in range(rank):
-        sign_input = array_ops.expand_dims(sign_input, 1)  # 2D ex: (B, 1, 1, C)
-        sign_output = array_ops.expand_dims(sign_output, 1)
-
-      sign_input = array_ops.tile(  # tile for element-wise op broadcasting
-          sign_input,
-          [1] + [input_shape[i + 1] for i in range(rank)] + [1])
-      sign_output = array_ops.tile(
-          sign_output,
-          [1] + [output_shape[i + 1] for i in range(rank)] + [1])
-
-      perturbed_inputs = convolution_op(
-          inputs * sign_input, expected_kernel_posterior_affine_tensor)
-      perturbed_inputs *= sign_output
-
-      expected_outputs += perturbed_inputs
-      expected_outputs = nn.bias_add(expected_outputs,
-                                     bias_posterior.result_sample,
-                                     data_format="NHWC")
-
-      [
-          expected_outputs_, actual_outputs_,
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          expected_bias_, actual_bias_,
-          expected_bias_divergence_, actual_bias_divergence_,
-      ] = sess.run([
-          expected_outputs, outputs,
-          kernel_divergence.result, kl_penalty[0],
-          bias_posterior.result_sample, layer.bias_posterior_tensor,
-          bias_divergence.result, kl_penalty[1],
-      ])
-
-      self.assertAllClose(
-          expected_bias_, actual_bias_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_outputs_, actual_outputs_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_divergence_, actual_bias_divergence_,
-          rtol=1e-6, atol=0.)
-
-      self.assertAllEqual(
-          [[kernel_posterior.distribution, kernel_prior.distribution, None]],
-          kernel_divergence.args)
-
-      self.assertAllEqual(
-          [[bias_posterior.distribution,
-            bias_prior.distribution,
-            bias_posterior.result_sample]],
-          bias_divergence.args)
-
-  def _testRandomConvFlipout(self, layer_class):
-    batch_size, depth, height, width, channels, filters = 2, 4, 4, 4, 3, 5
-    with self.test_session() as sess:
-      seed = Counter()
-      if layer_class in (prob_layers_lib.Conv1DReparameterization,
-                         prob_layers_lib.Conv1DFlipout):
-        inputs = random_ops.random_uniform(
-            [batch_size, width, channels], seed=seed())
-        kernel_size = (2,)
-      elif layer_class in (prob_layers_lib.Conv2DReparameterization,
-                           prob_layers_lib.Conv2DFlipout):
-        inputs = random_ops.random_uniform(
-            [batch_size, height, width, channels], seed=seed())
-        kernel_size = (2, 2)
-      elif layer_class in (prob_layers_lib.Conv3DReparameterization,
-                           prob_layers_lib.Conv3DFlipout):
-        inputs = random_ops.random_uniform(
-            [batch_size, depth, height, width, channels], seed=seed())
-        kernel_size = (2, 2, 2)
-
-      kernel_shape = kernel_size + (channels, filters)
-      bias_size = (filters,)
-
-      kernel_posterior = MockDistribution(
-          loc=random_ops.random_uniform(
-              kernel_shape, seed=seed()),
-          scale=random_ops.random_uniform(
-              kernel_shape, seed=seed()),
-          result_log_prob=random_ops.random_uniform(
-              kernel_shape, seed=seed()),
-          result_sample=random_ops.random_uniform(
-              kernel_shape, seed=seed()))
-      bias_posterior = MockDistribution(
-          loc=random_ops.random_uniform(
-              bias_size, seed=seed()),
-          scale=random_ops.random_uniform(
-              bias_size, seed=seed()),
-          result_log_prob=random_ops.random_uniform(
-              bias_size, seed=seed()),
-          result_sample=random_ops.random_uniform(
-              bias_size, seed=seed()))
-      layer_one = layer_class(
-          filters=filters,
-          kernel_size=kernel_size,
-          padding="SAME",
-          kernel_posterior_fn=lambda *args: kernel_posterior,
-          kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-          bias_posterior_fn=lambda *args: bias_posterior,
-          bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-          seed=44)
-      layer_two = layer_class(
-          filters=filters,
-          kernel_size=kernel_size,
-          padding="SAME",
-          kernel_posterior_fn=lambda *args: kernel_posterior,
-          kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-          bias_posterior_fn=lambda *args: bias_posterior,
-          bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-          seed=45)
-
-      outputs_one = layer_one(inputs)
-      outputs_two = layer_two(inputs)
-
-      outputs_one_, outputs_two_ = sess.run([
-          outputs_one, outputs_two])
-
-      self.assertLess(np.sum(np.isclose(outputs_one_, outputs_two_)),
-                      np.prod(outputs_one_.shape))
-
-  def testKLPenaltyKernelConv1DReparameterization(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv1DReparameterization)
-
-  def testKLPenaltyKernelConv2DReparameterization(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv2DReparameterization)
-
-  def testKLPenaltyKernelConv3DReparameterization(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv3DReparameterization)
-
-  def testKLPenaltyKernelConv1DFlipout(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv1DFlipout)
-
-  def testKLPenaltyKernelConv2DFlipout(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv2DFlipout)
-
-  def testKLPenaltyKernelConv3DFlipout(self):
-    self._testKLPenaltyKernel(prob_layers_lib.Conv3DFlipout)
-
-  def testKLPenaltyBothConv1DReparameterization(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv1DReparameterization)
-
-  def testKLPenaltyBothConv2DReparameterization(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv2DReparameterization)
-
-  def testKLPenaltyBothConv3DReparameterization(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv3DReparameterization)
-
-  def testKLPenaltyBothConv1DFlipout(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv1DFlipout)
-
-  def testKLPenaltyBothConv2DFlipout(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv2DFlipout)
-
-  def testKLPenaltyBothConv3DFlipout(self):
-    self._testKLPenaltyBoth(prob_layers_lib.Conv3DFlipout)
-
-  def testConv1DReparameterization(self):
-    self._testConvReparameterization(prob_layers_lib.Conv1DReparameterization)
-
-  def testConv2DReparameterization(self):
-    self._testConvReparameterization(prob_layers_lib.Conv2DReparameterization)
-
-  def testConv3DReparameterization(self):
-    self._testConvReparameterization(prob_layers_lib.Conv3DReparameterization)
-
-  def testConv1DFlipout(self):
-    self._testConvFlipout(prob_layers_lib.Conv1DFlipout)
-
-  def testConv2DFlipout(self):
-    self._testConvFlipout(prob_layers_lib.Conv2DFlipout)
-
-  def testConv3DFlipout(self):
-    self._testConvFlipout(prob_layers_lib.Conv3DFlipout)
-
-  def testRandomConv1DFlipout(self):
-    self._testRandomConvFlipout(prob_layers_lib.Conv1DFlipout)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/layers_dense_variational_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/layers_dense_variational_test.py
deleted file mode 100644
index 342f38ccec7ec74db1b393d6cdc22300205cc547..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/layers_dense_variational_test.py
+++ /dev/null
@@ -1,443 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for dense Bayesian layers."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import layers_dense_variational as prob_layers_lib
-from tensorflow.contrib.bayesflow.python.ops import layers_util as prob_layers_util
-from tensorflow.contrib.distributions.python.ops import independent as independent_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import random_ops
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.ops.distributions import util as distribution_util
-from tensorflow.python.platform import test
-
-
-class Counter(object):
-  """Helper class to manage incrementing a counting `int`."""
-
-  def __init__(self):
-    self._value = -1
-
-  @property
-  def value(self):
-    return self._value
-
-  def __call__(self):
-    self._value += 1
-    return self._value
-
-
-class MockDistribution(independent_lib.Independent):
-  """Monitors layer calls to the underlying distribution."""
-
-  def __init__(self, result_sample, result_log_prob, loc=None, scale=None):
-    self.result_sample = result_sample
-    self.result_log_prob = result_log_prob
-    self.result_loc = loc
-    self.result_scale = scale
-    self.result_distribution = normal_lib.Normal(loc=0.0, scale=1.0)
-    if loc is not None and scale is not None:
-      self.result_distribution = normal_lib.Normal(loc=self.result_loc,
-                                                   scale=self.result_scale)
-    self.called_log_prob = Counter()
-    self.called_sample = Counter()
-    self.called_loc = Counter()
-    self.called_scale = Counter()
-
-  def log_prob(self, *args, **kwargs):
-    self.called_log_prob()
-    return self.result_log_prob
-
-  def sample(self, *args, **kwargs):
-    self.called_sample()
-    return self.result_sample
-
-  @property
-  def distribution(self):  # for dummy check on Independent(Normal)
-    return self.result_distribution
-
-  @property
-  def loc(self):
-    self.called_loc()
-    return self.result_loc
-
-  @property
-  def scale(self):
-    self.called_scale()
-    return self.result_scale
-
-
-class MockKLDivergence(object):
-  """Monitors layer calls to the divergence implementation."""
-
-  def __init__(self, result):
-    self.result = result
-    self.args = []
-    self.called = Counter()
-
-  def __call__(self, *args, **kwargs):
-    self.called()
-    self.args.append(args)
-    return self.result
-
-
-class DenseVariational(test.TestCase):
-
-  def _testKLPenaltyKernel(self, layer_class):
-    with self.test_session():
-      layer = layer_class(units=2)
-      inputs = random_ops.random_uniform([2, 3], seed=1)
-
-      # No keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 0)
-      self.assertListEqual(layer.losses, losses)
-
-      _ = layer(inputs)
-
-      # Yes keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 1)
-      self.assertListEqual(layer.losses, losses)
-
-  def _testKLPenaltyBoth(self, layer_class):
-    def _make_normal(dtype, *args):  # pylint: disable=unused-argument
-      return normal_lib.Normal(
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.))
-    with self.test_session():
-      layer = layer_class(
-          units=2,
-          bias_posterior_fn=prob_layers_util.default_mean_field_normal_fn(),
-          bias_prior_fn=_make_normal)
-      inputs = random_ops.random_uniform([2, 3], seed=1)
-
-      # No keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 0)
-      self.assertListEqual(layer.losses, losses)
-
-      _ = layer(inputs)
-
-      # Yes keys.
-      losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-      self.assertEqual(len(losses), 2)
-      self.assertListEqual(layer.losses, losses)
-
-  def _testDenseSetUp(self, layer_class, batch_size, in_size, out_size,
-                      **kwargs):
-    seed = Counter()
-    inputs = random_ops.random_uniform([batch_size, in_size], seed=seed())
-
-    kernel_size = [in_size, out_size]
-    kernel_posterior = MockDistribution(
-        loc=random_ops.random_uniform(kernel_size, seed=seed()),
-        scale=random_ops.random_uniform(kernel_size, seed=seed()),
-        result_log_prob=random_ops.random_uniform(kernel_size, seed=seed()),
-        result_sample=random_ops.random_uniform(kernel_size, seed=seed()))
-    kernel_prior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(kernel_size, seed=seed()),
-        result_sample=random_ops.random_uniform(kernel_size, seed=seed()))
-    kernel_divergence = MockKLDivergence(
-        result=random_ops.random_uniform(kernel_size, seed=seed()))
-
-    bias_size = [out_size]
-    bias_posterior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(bias_size, seed=seed()),
-        result_sample=random_ops.random_uniform(bias_size, seed=seed()))
-    bias_prior = MockDistribution(
-        result_log_prob=random_ops.random_uniform(bias_size, seed=seed()),
-        result_sample=random_ops.random_uniform(bias_size, seed=seed()))
-    bias_divergence = MockKLDivergence(
-        result=random_ops.random_uniform(bias_size, seed=seed()))
-
-    layer = layer_class(
-        units=out_size,
-        kernel_posterior_fn=lambda *args: kernel_posterior,
-        kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-        kernel_prior_fn=lambda *args: kernel_prior,
-        kernel_divergence_fn=kernel_divergence,
-        bias_posterior_fn=lambda *args: bias_posterior,
-        bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-        bias_prior_fn=lambda *args: bias_prior,
-        bias_divergence_fn=bias_divergence,
-        **kwargs)
-
-    outputs = layer(inputs)
-
-    kl_penalty = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
-    return (kernel_posterior, kernel_prior, kernel_divergence,
-            bias_posterior, bias_prior, bias_divergence,
-            layer, inputs, outputs, kl_penalty)
-
-  def testKLPenaltyKernelReparameterization(self):
-    self._testKLPenaltyKernel(prob_layers_lib.DenseReparameterization)
-
-  def testKLPenaltyKernelLocalReparameterization(self):
-    self._testKLPenaltyKernel(prob_layers_lib.DenseLocalReparameterization)
-
-  def testKLPenaltyKernelFlipout(self):
-    self._testKLPenaltyKernel(prob_layers_lib.DenseFlipout)
-
-  def testKLPenaltyBothReparameterization(self):
-    self._testKLPenaltyBoth(prob_layers_lib.DenseReparameterization)
-
-  def testKLPenaltyBothLocalReparameterization(self):
-    self._testKLPenaltyBoth(prob_layers_lib.DenseLocalReparameterization)
-
-  def testKLPenaltyBothFlipout(self):
-    self._testKLPenaltyBoth(prob_layers_lib.DenseFlipout)
-
-  def testDenseReparameterization(self):
-    batch_size, in_size, out_size = 2, 3, 4
-    with self.test_session() as sess:
-      (kernel_posterior, kernel_prior, kernel_divergence,
-       bias_posterior, bias_prior, bias_divergence, layer, inputs,
-       outputs, kl_penalty) = self._testDenseSetUp(
-           prob_layers_lib.DenseReparameterization,
-           batch_size, in_size, out_size)
-
-      expected_outputs = (
-          math_ops.matmul(inputs, kernel_posterior.result_sample) +
-          bias_posterior.result_sample)
-
-      [
-          expected_outputs_, actual_outputs_,
-          expected_kernel_, actual_kernel_,
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          expected_bias_, actual_bias_,
-          expected_bias_divergence_, actual_bias_divergence_,
-      ] = sess.run([
-          expected_outputs, outputs,
-          kernel_posterior.result_sample, layer.kernel_posterior_tensor,
-          kernel_divergence.result, kl_penalty[0],
-          bias_posterior.result_sample, layer.bias_posterior_tensor,
-          bias_divergence.result, kl_penalty[1],
-      ])
-
-      self.assertAllClose(
-          expected_kernel_, actual_kernel_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_, actual_bias_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_outputs_, actual_outputs_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_divergence_, actual_bias_divergence_,
-          rtol=1e-6, atol=0.)
-
-      self.assertAllEqual(
-          [[kernel_posterior.distribution,
-            kernel_prior.distribution,
-            kernel_posterior.result_sample]],
-          kernel_divergence.args)
-
-      self.assertAllEqual(
-          [[bias_posterior.distribution,
-            bias_prior.distribution,
-            bias_posterior.result_sample]],
-          bias_divergence.args)
-
-  def testDenseLocalReparameterization(self):
-    batch_size, in_size, out_size = 2, 3, 4
-    with self.test_session() as sess:
-      (kernel_posterior, kernel_prior, kernel_divergence,
-       bias_posterior, bias_prior, bias_divergence, layer, inputs,
-       outputs, kl_penalty) = self._testDenseSetUp(
-           prob_layers_lib.DenseLocalReparameterization,
-           batch_size, in_size, out_size)
-
-      expected_kernel_posterior_affine = normal_lib.Normal(
-          loc=math_ops.matmul(inputs, kernel_posterior.result_loc),
-          scale=math_ops.matmul(
-              inputs**2., kernel_posterior.result_scale**2)**0.5)
-      expected_kernel_posterior_affine_tensor = (
-          expected_kernel_posterior_affine.sample(seed=42))
-      expected_outputs = (expected_kernel_posterior_affine_tensor +
-                          bias_posterior.result_sample)
-
-      [
-          expected_outputs_, actual_outputs_,
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          expected_bias_, actual_bias_,
-          expected_bias_divergence_, actual_bias_divergence_,
-      ] = sess.run([
-          expected_outputs, outputs,
-          kernel_divergence.result, kl_penalty[0],
-          bias_posterior.result_sample, layer.bias_posterior_tensor,
-          bias_divergence.result, kl_penalty[1],
-      ])
-
-      self.assertAllClose(
-          expected_bias_, actual_bias_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_outputs_, actual_outputs_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_divergence_, actual_bias_divergence_,
-          rtol=1e-6, atol=0.)
-
-      self.assertAllEqual(
-          [[kernel_posterior.distribution,
-            kernel_prior.distribution,
-            None]],
-          kernel_divergence.args)
-
-      self.assertAllEqual(
-          [[bias_posterior.distribution,
-            bias_prior.distribution,
-            bias_posterior.result_sample]],
-          bias_divergence.args)
-
-  def testDenseFlipout(self):
-    batch_size, in_size, out_size = 2, 3, 4
-    with self.test_session() as sess:
-      (kernel_posterior, kernel_prior, kernel_divergence,
-       bias_posterior, bias_prior, bias_divergence, layer, inputs,
-       outputs, kl_penalty) = self._testDenseSetUp(
-           prob_layers_lib.DenseFlipout,
-           batch_size, in_size, out_size, seed=44)
-
-      expected_kernel_posterior_affine = normal_lib.Normal(
-          loc=array_ops.zeros_like(kernel_posterior.result_loc),
-          scale=kernel_posterior.result_scale)
-      expected_kernel_posterior_affine_tensor = (
-          expected_kernel_posterior_affine.sample(seed=42))
-
-      sign_input = random_ops.random_uniform(
-          [batch_size, in_size],
-          minval=0,
-          maxval=2,
-          dtype=dtypes.int32,
-          seed=layer.seed)
-      sign_input = math_ops.cast(2 * sign_input - 1, inputs.dtype)
-      sign_output = random_ops.random_uniform(
-          [batch_size, out_size],
-          minval=0,
-          maxval=2,
-          dtype=dtypes.int32,
-          seed=distribution_util.gen_new_seed(
-              layer.seed, salt="dense_flipout"))
-      sign_output = math_ops.cast(2 * sign_output - 1, inputs.dtype)
-      perturbed_inputs = math_ops.matmul(
-          inputs * sign_input, expected_kernel_posterior_affine_tensor)
-      perturbed_inputs *= sign_output
-
-      expected_outputs = math_ops.matmul(inputs, kernel_posterior.result_loc)
-      expected_outputs += perturbed_inputs
-      expected_outputs += bias_posterior.result_sample
-
-      [
-          expected_outputs_, actual_outputs_,
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          expected_bias_, actual_bias_,
-          expected_bias_divergence_, actual_bias_divergence_,
-      ] = sess.run([
-          expected_outputs, outputs,
-          kernel_divergence.result, kl_penalty[0],
-          bias_posterior.result_sample, layer.bias_posterior_tensor,
-          bias_divergence.result, kl_penalty[1],
-      ])
-
-      self.assertAllClose(
-          expected_bias_, actual_bias_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_outputs_, actual_outputs_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_kernel_divergence_, actual_kernel_divergence_,
-          rtol=1e-6, atol=0.)
-      self.assertAllClose(
-          expected_bias_divergence_, actual_bias_divergence_,
-          rtol=1e-6, atol=0.)
-
-      self.assertAllEqual(
-          [[kernel_posterior.distribution, kernel_prior.distribution, None]],
-          kernel_divergence.args)
-
-      self.assertAllEqual(
-          [[bias_posterior.distribution,
-            bias_prior.distribution,
-            bias_posterior.result_sample]],
-          bias_divergence.args)
-
-  def testRandomDenseFlipout(self):
-    batch_size, in_size, out_size = 2, 3, 4
-    with self.test_session() as sess:
-      seed = Counter()
-      inputs = random_ops.random_uniform([batch_size, in_size], seed=seed())
-
-      kernel_posterior = MockDistribution(
-          loc=random_ops.random_uniform(
-              [in_size, out_size], seed=seed()),
-          scale=random_ops.random_uniform(
-              [in_size, out_size], seed=seed()),
-          result_log_prob=random_ops.random_uniform(
-              [in_size, out_size], seed=seed()),
-          result_sample=random_ops.random_uniform(
-              [in_size, out_size], seed=seed()))
-      bias_posterior = MockDistribution(
-          loc=random_ops.random_uniform(
-              [out_size], seed=seed()),
-          scale=random_ops.random_uniform(
-              [out_size], seed=seed()),
-          result_log_prob=random_ops.random_uniform(
-              [out_size], seed=seed()),
-          result_sample=random_ops.random_uniform(
-              [out_size], seed=seed()))
-      layer_one = prob_layers_lib.DenseFlipout(
-          units=out_size,
-          kernel_posterior_fn=lambda *args: kernel_posterior,
-          kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-          bias_posterior_fn=lambda *args: bias_posterior,
-          bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-          seed=44)
-      layer_two = prob_layers_lib.DenseFlipout(
-          units=out_size,
-          kernel_posterior_fn=lambda *args: kernel_posterior,
-          kernel_posterior_tensor_fn=lambda d: d.sample(seed=42),
-          bias_posterior_fn=lambda *args: bias_posterior,
-          bias_posterior_tensor_fn=lambda d: d.sample(seed=43),
-          seed=45)
-
-      outputs_one = layer_one(inputs)
-      outputs_two = layer_two(inputs)
-
-      outputs_one_, outputs_two_ = sess.run([
-          outputs_one, outputs_two])
-
-      self.assertLess(np.sum(np.isclose(outputs_one_, outputs_two_)), out_size)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/mcmc_diagnostics_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/mcmc_diagnostics_test.py
deleted file mode 100644
index 52e36e135d95c1ec919c710f35d59073c2134d05..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/mcmc_diagnostics_test.py
+++ /dev/null
@@ -1,445 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for MCMC diagnostic utilities."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import mcmc_diagnostics_impl as mcmc_diagnostics
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import spectral_ops_test_util
-from tensorflow.python.platform import test
-
-rng = np.random.RandomState(42)
-
-
-class _EffectiveSampleSizeTest(object):
-
-  @property
-  def use_static_shape(self):
-    raise NotImplementedError(
-        "Subclass failed to implement `use_static_shape`.")
-
-  def _check_versus_expected_effective_sample_size(self,
-                                                   x_,
-                                                   expected_ess,
-                                                   sess,
-                                                   atol=1e-2,
-                                                   rtol=1e-2,
-                                                   filter_threshold=None,
-                                                   filter_beyond_lag=None):
-    x = array_ops.placeholder_with_default(
-        input=x_, shape=x_.shape if self.use_static_shape else None)
-    ess = mcmc_diagnostics.effective_sample_size(
-        x,
-        filter_threshold=filter_threshold,
-        filter_beyond_lag=filter_beyond_lag)
-    if self.use_static_shape:
-      self.assertAllEqual(x.shape[1:], ess.shape)
-
-    ess_ = sess.run(ess)
-
-    self.assertAllClose(
-        np.ones_like(ess_) * expected_ess, ess_, atol=atol, rtol=rtol)
-
-  def testIidRank1NormalHasFullEssMaxLags10(self):
-    # With a length 5000 iid normal sequence, and filter_beyond_lag = 10, we
-    # should have a good estimate of ESS, and it should be close to the full
-    # sequence length of 5000.
-    # The choice of filter_beyond_lag = 10 is a short cutoff, reasonable only
-    # since we know the correlation length should be zero right away.
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=rng.randn(5000).astype(np.float32),
-            expected_ess=5000,
-            sess=sess,
-            filter_beyond_lag=10,
-            filter_threshold=None,
-            rtol=0.3)
-
-  def testIidRank2NormalHasFullEssMaxLags10(self):
-    # See similar test for Rank1Normal for reasoning.
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=rng.randn(5000, 2).astype(np.float32),
-            expected_ess=5000,
-            sess=sess,
-            filter_beyond_lag=10,
-            filter_threshold=None,
-            rtol=0.3)
-
-  def testIidRank1NormalHasFullEssMaxLagThresholdZero(self):
-    # With a length 5000 iid normal sequence, and filter_threshold = 0,
-    # we should have a super-duper estimate of ESS, and it should be very close
-    # to the full sequence length of 5000.
-    # The choice of filter_beyond_lag = 0 means we cutoff as soon as the
-    # auto-corris below zero.  This should happen very quickly, due to the fact
-    # that the theoretical auto-corr is [1, 0, 0,...]
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=rng.randn(5000).astype(np.float32),
-            expected_ess=5000,
-            sess=sess,
-            filter_beyond_lag=None,
-            filter_threshold=0.,
-            rtol=0.1)
-
-  def testIidRank2NormalHasFullEssMaxLagThresholdZero(self):
-    # See similar test for Rank1Normal for reasoning.
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=rng.randn(5000, 2).astype(np.float32),
-            expected_ess=5000,
-            sess=sess,
-            filter_beyond_lag=None,
-            filter_threshold=0.,
-            rtol=0.1)
-
-  def testLength10CorrelationHasEssOneTenthTotalLengthUsingMaxLags50(self):
-    # Create x_, such that
-    #   x_[i] = iid_x_[0], i = 0,...,9
-    #   x_[i] = iid_x_[1], i = 10,..., 19,
-    #   and so on.
-    iid_x_ = rng.randn(5000, 1).astype(np.float32)
-    x_ = (iid_x_ * np.ones((5000, 10)).astype(np.float32)).reshape((50000,))
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=x_,
-            expected_ess=50000 // 10,
-            sess=sess,
-            filter_beyond_lag=50,
-            filter_threshold=None,
-            rtol=0.2)
-
-  def testLength10CorrelationHasEssOneTenthTotalLengthUsingMaxLagsThresholdZero(
-      self):
-    # Create x_, such that
-    #   x_[i] = iid_x_[0], i = 0,...,9
-    #   x_[i] = iid_x_[1], i = 10,..., 19,
-    #   and so on.
-    iid_x_ = rng.randn(5000, 1).astype(np.float32)
-    x_ = (iid_x_ * np.ones((5000, 10)).astype(np.float32)).reshape((50000,))
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        self._check_versus_expected_effective_sample_size(
-            x_=x_,
-            expected_ess=50000 // 10,
-            sess=sess,
-            filter_beyond_lag=None,
-            filter_threshold=0.,
-            rtol=0.1)
-
-  def testListArgs(self):
-    # x_ has correlation length 10 ==> ESS = N / 10
-    # y_ has correlation length 1  ==> ESS = N
-    iid_x_ = rng.randn(5000, 1).astype(np.float32)
-    x_ = (iid_x_ * np.ones((5000, 10)).astype(np.float32)).reshape((50000,))
-    y_ = rng.randn(50000).astype(np.float32)
-    states = [x_, x_, y_, y_]
-    filter_threshold = [0., None, 0., None]
-    filter_beyond_lag = [None, 5, None, 5]
-
-    # See other tests for reasoning on tolerance.
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        ess = mcmc_diagnostics.effective_sample_size(
-            states,
-            filter_threshold=filter_threshold,
-            filter_beyond_lag=filter_beyond_lag)
-        ess_ = sess.run(ess)
-    self.assertAllEqual(4, len(ess_))
-
-    self.assertAllClose(50000 // 10, ess_[0], rtol=0.3)
-    self.assertAllClose(50000 // 10, ess_[1], rtol=0.3)
-    self.assertAllClose(50000, ess_[2], rtol=0.1)
-    self.assertAllClose(50000, ess_[3], rtol=0.1)
-
-  def testMaxLagsThresholdLessThanNeg1SameAsNone(self):
-    # Setting both means we filter out items R_k from the auto-correlation
-    # sequence if k > filter_beyond_lag OR k >= j where R_j < filter_threshold.
-
-    # x_ has correlation length 10.
-    iid_x_ = rng.randn(500, 1).astype(np.float32)
-    x_ = (iid_x_ * np.ones((500, 10)).astype(np.float32)).reshape((5000,))
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        x = array_ops.placeholder_with_default(
-            input=x_, shape=x_.shape if self.use_static_shape else None)
-
-        ess_none_none = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=None, filter_beyond_lag=None)
-        ess_none_200 = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=None, filter_beyond_lag=200)
-        ess_neg2_200 = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=-2., filter_beyond_lag=200)
-        ess_neg2_none = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=-2., filter_beyond_lag=None)
-        ess_none_none_, ess_none_200_, ess_neg2_200_, ess_neg2_none_ = sess.run(
-            [ess_none_none, ess_none_200, ess_neg2_200, ess_neg2_none])
-
-        # filter_threshold=-2 <==> filter_threshold=None.
-        self.assertAllClose(ess_none_none_, ess_neg2_none_)
-        self.assertAllClose(ess_none_200_, ess_neg2_200_)
-
-  def testMaxLagsArgsAddInAnOrManner(self):
-    # Setting both means we filter out items R_k from the auto-correlation
-    # sequence if k > filter_beyond_lag OR k >= j where R_j < filter_threshold.
-
-    # x_ has correlation length 10.
-    iid_x_ = rng.randn(500, 1).astype(np.float32)
-    x_ = (iid_x_ * np.ones((500, 10)).astype(np.float32)).reshape((5000,))
-    with self.test_session() as sess:
-      with spectral_ops_test_util.fft_kernel_label_map():
-        x = array_ops.placeholder_with_default(
-            input=x_, shape=x_.shape if self.use_static_shape else None)
-
-        ess_1_9 = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=1., filter_beyond_lag=9)
-        ess_1_none = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=1., filter_beyond_lag=None)
-        ess_none_9 = mcmc_diagnostics.effective_sample_size(
-            x, filter_threshold=1., filter_beyond_lag=9)
-        ess_1_9_, ess_1_none_, ess_none_9_ = sess.run(
-            [ess_1_9, ess_1_none, ess_none_9])
-
-        # Since R_k = 1 for k < 10, and R_k < 1 for k >= 10,
-        # filter_threshold = 1 <==> filter_beyond_lag = 9.
-        self.assertAllClose(ess_1_9_, ess_1_none_)
-        self.assertAllClose(ess_1_9_, ess_none_9_)
-
-
-class EffectiveSampleSizeStaticTest(test.TestCase, _EffectiveSampleSizeTest):
-
-  @property
-  def use_static_shape(self):
-    return True
-
-
-class EffectiveSampleSizeDynamicTest(test.TestCase, _EffectiveSampleSizeTest):
-
-  @property
-  def use_static_shape(self):
-    return False
-
-
-class _PotentialScaleReductionTest(object):
-
-  @property
-  def use_static_shape(self):
-    raise NotImplementedError(
-        "Subclass failed to impliment `use_static_shape`.")
-
-  def testListOfStatesWhereFirstPassesSecondFails(self):
-    """Simple test showing API with two states.  Read first!."""
-    n_samples = 1000
-
-    # state_0 is two scalar chains taken from iid Normal(0, 1).  Will pass.
-    state_0 = rng.randn(n_samples, 2)
-
-    # state_1 is three 4-variate chains taken from Normal(0, 1) that have been
-    # shifted.  Since every chain is shifted, they are not the same, and the
-    # test should fail.
-    offset = np.array([1., -1., 2.]).reshape(3, 1)
-    state_1 = rng.randn(n_samples, 3, 4) + offset
-
-    rhat = mcmc_diagnostics.potential_scale_reduction(
-        chains_states=[state_0, state_1], independent_chain_ndims=1)
-
-    self.assertIsInstance(rhat, list)
-    with self.test_session() as sess:
-      rhat_0_, rhat_1_ = sess.run(rhat)
-
-    # r_hat_0 should be close to 1, meaning test is passed.
-    self.assertAllEqual((), rhat_0_.shape)
-    self.assertAllClose(1., rhat_0_, rtol=0.02)
-
-    # r_hat_1 should be greater than 1.2, meaning test has failed.
-    self.assertAllEqual((4,), rhat_1_.shape)
-    self.assertAllEqual(np.ones_like(rhat_1_).astype(bool), rhat_1_ > 1.2)
-
-  def check_results(self, state_, independent_chain_shape, should_pass):
-    sample_ndims = 1
-    independent_chain_ndims = len(independent_chain_shape)
-    with self.test_session():
-      state = array_ops.placeholder_with_default(
-          input=state_, shape=state_.shape if self.use_static_shape else None)
-
-      rhat = mcmc_diagnostics.potential_scale_reduction(
-          state, independent_chain_ndims=independent_chain_ndims)
-
-      if self.use_static_shape:
-        self.assertAllEqual(
-            state_.shape[sample_ndims + independent_chain_ndims:], rhat.shape)
-
-      rhat_ = rhat.eval()
-      if should_pass:
-        self.assertAllClose(np.ones_like(rhat_), rhat_, atol=0, rtol=0.02)
-      else:
-        self.assertAllEqual(np.ones_like(rhat_).astype(bool), rhat_ > 1.2)
-
-  def iid_normal_chains_should_pass_wrapper(self,
-                                            sample_shape,
-                                            independent_chain_shape,
-                                            other_shape,
-                                            dtype=np.float32):
-    """Check results with iid normal chains."""
-
-    state_shape = sample_shape + independent_chain_shape + other_shape
-    state_ = rng.randn(*state_shape).astype(dtype)
-
-    # The "other" dimensions do not have to be identical, just independent, so
-    # force them to not be identical.
-    if other_shape:
-      state_ *= rng.rand(*other_shape).astype(dtype)
-
-    self.check_results(state_, independent_chain_shape, should_pass=True)
-
-  def testPassingIIDNdimsAreIndependentOneOtherZero(self):
-    self.iid_normal_chains_should_pass_wrapper(
-        sample_shape=[10000], independent_chain_shape=[4], other_shape=[])
-
-  def testPassingIIDNdimsAreIndependentOneOtherOne(self):
-    self.iid_normal_chains_should_pass_wrapper(
-        sample_shape=[10000], independent_chain_shape=[3], other_shape=[7])
-
-  def testPassingIIDNdimsAreIndependentOneOtherTwo(self):
-    self.iid_normal_chains_should_pass_wrapper(
-        sample_shape=[10000], independent_chain_shape=[2], other_shape=[5, 7])
-
-  def testPassingIIDNdimsAreIndependentTwoOtherTwo64Bit(self):
-    self.iid_normal_chains_should_pass_wrapper(
-        sample_shape=[10000],
-        independent_chain_shape=[2, 3],
-        other_shape=[5, 7],
-        dtype=np.float64)
-
-  def offset_normal_chains_should_fail_wrapper(
-      self, sample_shape, independent_chain_shape, other_shape):
-    """Check results with normal chains that are offset from each other."""
-
-    state_shape = sample_shape + independent_chain_shape + other_shape
-    state_ = rng.randn(*state_shape)
-
-    # Add a significant offset to the different (formerly iid) chains.
-    offset = np.linspace(
-        0, 2, num=np.prod(independent_chain_shape)).reshape([1] * len(
-            sample_shape) + independent_chain_shape + [1] * len(other_shape))
-    state_ += offset
-
-    self.check_results(state_, independent_chain_shape, should_pass=False)
-
-  def testFailingOffsetNdimsAreSampleOneIndependentOneOtherOne(self):
-    self.offset_normal_chains_should_fail_wrapper(
-        sample_shape=[10000], independent_chain_shape=[2], other_shape=[5])
-
-
-class PotentialScaleReductionStaticTest(test.TestCase,
-                                        _PotentialScaleReductionTest):
-
-  @property
-  def use_static_shape(self):
-    return True
-
-  def testIndependentNdimsLessThanOneRaises(self):
-    with self.assertRaisesRegexp(ValueError, "independent_chain_ndims"):
-      mcmc_diagnostics.potential_scale_reduction(
-          rng.rand(2, 3, 4), independent_chain_ndims=0)
-
-
-class PotentialScaleReductionDynamicTest(test.TestCase,
-                                         _PotentialScaleReductionTest):
-
-  @property
-  def use_static_shape(self):
-    return False
-
-
-class _ReduceVarianceTest(object):
-
-  @property
-  def use_static_shape(self):
-    raise NotImplementedError(
-        "Subclass failed to impliment `use_static_shape`.")
-
-  def check_versus_numpy(self, x_, axis, biased, keepdims):
-    with self.test_session():
-      x_ = np.asarray(x_)
-      x = array_ops.placeholder_with_default(
-          input=x_, shape=x_.shape if self.use_static_shape else None)
-      var = mcmc_diagnostics._reduce_variance(
-          x, axis=axis, biased=biased, keepdims=keepdims)
-      np_var = np.var(x_, axis=axis, ddof=0 if biased else 1, keepdims=keepdims)
-
-      if self.use_static_shape:
-        self.assertAllEqual(np_var.shape, var.shape)
-
-      var_ = var.eval()
-      # We will mask below, which changes shape, so check shape explicitly here.
-      self.assertAllEqual(np_var.shape, var_.shape)
-
-      # We get NaN when we divide by zero due to the size being the same as ddof
-      nan_mask = np.isnan(np_var)
-      if nan_mask.any():
-        self.assertTrue(np.isnan(var_[nan_mask]).all())
-      self.assertAllClose(np_var[~nan_mask], var_[~nan_mask], atol=0, rtol=0.02)
-
-  def testScalarBiasedTrue(self):
-    self.check_versus_numpy(x_=-1.234, axis=None, biased=True, keepdims=False)
-
-  def testScalarBiasedFalse(self):
-    # This should result in NaN.
-    self.check_versus_numpy(x_=-1.234, axis=None, biased=False, keepdims=False)
-
-  def testShape2x3x4AxisNoneBiasedFalseKeepdimsFalse(self):
-    self.check_versus_numpy(
-        x_=rng.randn(2, 3, 4), axis=None, biased=True, keepdims=False)
-
-  def testShape2x3x4Axis1BiasedFalseKeepdimsTrue(self):
-    self.check_versus_numpy(
-        x_=rng.randn(2, 3, 4), axis=1, biased=True, keepdims=True)
-
-  def testShape2x3x4x5Axis13BiasedFalseKeepdimsTrue(self):
-    self.check_versus_numpy(
-        x_=rng.randn(2, 3, 4, 5), axis=1, biased=True, keepdims=True)
-
-  def testShape2x3x4x5Axis13BiasedFalseKeepdimsFalse(self):
-    self.check_versus_numpy(
-        x_=rng.randn(2, 3, 4, 5), axis=1, biased=False, keepdims=False)
-
-
-class ReduceVarianceTestStaticShape(test.TestCase, _ReduceVarianceTest):
-
-  @property
-  def use_static_shape(self):
-    return True
-
-
-class ReduceVarianceTestDynamicShape(test.TestCase, _ReduceVarianceTest):
-
-  @property
-  def use_static_shape(self):
-    return False
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/sgld_optimizer_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/sgld_optimizer_test.py
deleted file mode 100644
index 756c25683bd4b0c8c77e9e28485ca2a85582999c..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/sgld_optimizer_test.py
+++ /dev/null
@@ -1,212 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Functional test for GradientDescent."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import math
-from tensorflow.contrib.bayesflow.python.ops.optimizers import SGLDOptimizer
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import variables
-from tensorflow.python.platform import test
-
-
-class SGLDOptimizerTest(test.TestCase):
-
-  def testBasic(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.53
-        sgd_optimizer = SGLDOptimizer(3.0, preconditioner_decay_rate=decay_rate)
-        sgd_op = sgd_optimizer.apply_gradients(
-            zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        grads_scaled = (0.5 * 0.1 / math.sqrt(decay_rate +
-                                              (1 - decay_rate) * 0.1**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [1.1 - 3.0 * grads_scaled, 2.1 - 3.0 * grads_scaled], var0.eval())
-        grads_scaled = (0.5 * 0.01 / math.sqrt(
-            decay_rate + (1 - decay_rate) * 0.01**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [3.0 - 3.0 * grads_scaled, 4.0 - 3.0 * grads_scaled], var1.eval())
-        self.assertAllCloseAccordingToType(1, sgd_optimizer._counter.eval())
-
-  def testBasicMultiInstance(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        vara = variables.Variable([1.1, 2.1], dtype=dtype)
-        varb = variables.Variable([3.0, 4.0], dtype=dtype)
-        gradsa = constant_op.constant([0.1, 0.1], dtype=dtype)
-        gradsb = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.5
-        sgd_optimizer = SGLDOptimizer(3.0, preconditioner_decay_rate=decay_rate)
-        sgd_op = sgd_optimizer.apply_gradients(
-            zip([grads0, grads1], [var0, var1]))
-        sgd_optimizer2 = SGLDOptimizer(
-            3.0, preconditioner_decay_rate=decay_rate)
-        sgd_op2 = sgd_optimizer2.apply_gradients(
-            zip([gradsa, gradsb], [vara, varb]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        self.assertAllCloseAccordingToType([1.1, 2.1], vara.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], varb.eval())
-
-        # Run 1 step of sgd
-        sgd_op.run()
-        sgd_op2.run()
-        # Validate updated params
-        grads_scaled = (0.5 * 0.1 / math.sqrt(decay_rate +
-                                              (1 - decay_rate) * 0.1**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [1.1 - 3.0 * grads_scaled, 2.1 - 3.0 * grads_scaled], var0.eval())
-        self.assertAllCloseAccordingToType(
-            [1.1 - 3.0 * grads_scaled, 2.1 - 3.0 * grads_scaled], vara.eval())
-
-        grads_scaled = (0.5 * 0.01 / math.sqrt(
-            decay_rate + (1 - decay_rate) * 0.01**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [3.0 - 3.0 * grads_scaled, 4.0 - 3.0 * grads_scaled], var1.eval())
-        self.assertAllCloseAccordingToType(
-            [3.0 - 3.0 * grads_scaled, 4.0 - 3.0 * grads_scaled], varb.eval())
-        self.assertNotEqual(sgd_optimizer.variable_scope,
-                            sgd_optimizer2.variable_scope)
-        self.assertNotEqual(sgd_optimizer.variable_scope.name,
-                            sgd_optimizer2.variable_scope.name)
-        self.assertAllCloseAccordingToType(1, sgd_optimizer._counter.eval())
-        self.assertAllCloseAccordingToType(1, sgd_optimizer2._counter.eval())
-
-  def testTensorLearningRate(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        lrate = constant_op.constant(3.0)
-        decay_rate = 0.5
-        sgd_op = SGLDOptimizer(
-            lrate, preconditioner_decay_rate=constant_op.constant(
-                decay_rate)).apply_gradients(
-                    zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        grads_scaled = (0.5 * 0.1 / math.sqrt(decay_rate +
-                                              (1 - decay_rate) * 0.1**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [1.1 - 3.0 * grads_scaled, 2.1 - 3.0 * grads_scaled], var0.eval())
-        grads_scaled = (0.5 * 0.01 / math.sqrt(
-            decay_rate + (1 - decay_rate) * 0.01**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [3.0 - 3.0 * grads_scaled, 4.0 - 3.0 * grads_scaled], var1.eval())
-
-  def testGradWrtRef(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        opt = SGLDOptimizer(3.0)
-        values = [1.0, 3.0]
-        vars_ = [variables.Variable([v], dtype=dtype) for v in values]
-        grads_and_vars = opt.compute_gradients(vars_[0] + vars_[1], vars_)
-        variables.global_variables_initializer().run()
-        for grad, _ in grads_and_vars:
-          self.assertAllCloseAccordingToType([1.0], grad.eval())
-
-  def testWithGlobalStep(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        global_step = variables.Variable(0, trainable=False)
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.1
-        sgd_op = SGLDOptimizer(
-            3.0, preconditioner_decay_rate=decay_rate).apply_gradients(
-                zip([grads0, grads1], [var0, var1]), global_step=global_step)
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-
-        # Validate updated params and global_step
-        grads_scaled = (0.5 * 0.1 / math.sqrt(decay_rate +
-                                              (1 - decay_rate) * 0.1**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [1.1 - 3.0 * grads_scaled, 2.1 - 3.0 * grads_scaled], var0.eval())
-        grads_scaled = (0.5 * 0.01 / math.sqrt(
-            decay_rate + (1 - decay_rate) * 0.01**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [3.0 - 3.0 * grads_scaled, 4.0 - 3.0 * grads_scaled], var1.eval())
-        self.assertAllCloseAccordingToType(1, global_step.eval())
-
-  def testSparseBasic(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([[1.1], [2.1]], dtype=dtype)
-        var1 = variables.Variable([[3.0], [4.0]], dtype=dtype)
-        grads0 = ops.IndexedSlices(
-            constant_op.constant([0.1], shape=[1, 1], dtype=dtype),
-            constant_op.constant([0]), constant_op.constant([2, 1]))
-        grads1 = ops.IndexedSlices(
-            constant_op.constant([0.01], shape=[1, 1], dtype=dtype),
-            constant_op.constant([1]), constant_op.constant([2, 1]))
-        decay_rate = 0.9
-        sgd_op = SGLDOptimizer(
-            3.0, preconditioner_decay_rate=decay_rate).apply_gradients(
-                zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([[1.1], [2.1]], var0.eval())
-        self.assertAllCloseAccordingToType([[3.0], [4.0]], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        grads_scaled = (0.5 * 0.1 / math.sqrt(decay_rate +
-                                              (1 - decay_rate) * 0.1**2 + 1e-8))
-        self.assertAllCloseAccordingToType([[1.1 - 3.0 * grads_scaled], [2.1]],
-                                           var0.eval())
-        grads_scaled = (0.5 * 0.01 / math.sqrt(
-            decay_rate + (1 - decay_rate) * 0.01**2 + 1e-8))
-        self.assertAllCloseAccordingToType(
-            [[3.0 - 3.0 * 0], [4.0 - 3.0 * grads_scaled]], var1.eval())
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py
deleted file mode 100644
index f978cf86417dc5ff5412a3eee584330a266e0964..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py
+++ /dev/null
@@ -1,135 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for utility functions related to managing `tf.Variable`s."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import warnings
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import variable_utils
-
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.ops import variables as variables_ops
-from tensorflow.python.platform import test
-
-
-def test_fn(x):
-  x = ops.convert_to_tensor(x, name="x")
-  dtype = x.dtype.as_numpy_dtype
-  s = x.shape.as_list()
-  z = varscope_ops.get_variable(
-      name="z",
-      dtype=dtype,
-      initializer=np.arange(np.prod(s)).reshape(s).astype(dtype))
-  y = varscope_ops.get_variable(
-      name="y",
-      dtype=dtype,
-      initializer=np.arange(np.prod(s)).reshape(s).astype(dtype)**2)
-  return x + y + z
-
-
-class _WrapCallableTest(object):
-
-  def testDefaultArgsWorkCorrectly(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, vars_args = variable_utils.externalize_variables_as_args(
-          test_fn, [x])
-
-      varscope_ops.get_variable_scope().reuse_variables()
-
-      result = wrapped_fn(self.dtype(2), [3, 4, 5], 0.5)
-
-      y_actual = varscope_ops.get_variable("y", dtype=self.dtype)
-      z_actual = varscope_ops.get_variable("z", dtype=self.dtype)
-
-      variables_ops.global_variables_initializer().run()
-      result_ = result.eval()
-
-      self.assertEqual(self.dtype, result_.dtype)
-      self.assertAllEqual([5.5, 6.5, 7.5], result_)
-      self.assertAllEqual([y_actual, z_actual], vars_args)
-
-  def testNonDefaultArgsWorkCorrectly(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-
-      _ = test_fn(self.dtype([0., 0.]))   # Needed to create vars.
-      varscope_ops.get_variable_scope().reuse_variables()
-
-      y_actual = varscope_ops.get_variable("y", dtype=self.dtype)
-
-      wrapped_fn, vars_args = variable_utils.externalize_variables_as_args(
-          test_fn, [x], possible_ancestor_vars=[y_actual])
-
-      result = wrapped_fn(self.dtype([2, 3]), 0.5)  # x, y
-
-      variables_ops.global_variables_initializer().run()
-      result_ = result.eval()
-
-      self.assertEqual(self.dtype, result_.dtype)
-      self.assertAllEqual([2.5, 4.5], result_)
-      self.assertAllEqual([y_actual], vars_args)
-
-  def testWarnings(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, _ = variable_utils.externalize_variables_as_args(
-          test_fn, [x], possible_ancestor_vars=[])
-      varscope_ops.get_variable_scope().reuse_variables()
-      with warnings.catch_warnings(record=True) as w:
-        wrapped_fn(self.dtype(2))
-      w = sorted(w, key=lambda w: str(w.message))
-      self.assertEqual(2, len(w))
-      self.assertRegexpMatches(
-          str(w[0].message),
-          r"Variable .* 'y:0' .* not found in bypass dict.")
-      self.assertRegexpMatches(
-          str(w[1].message),
-          r"Variable .* 'z:0' .* not found in bypass dict.")
-
-  def testExceptions(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, _ = variable_utils.externalize_variables_as_args(
-          test_fn,
-          [x],
-          possible_ancestor_vars=[],
-          assert_variable_override=True)
-      varscope_ops.get_variable_scope().reuse_variables()
-      with self.assertRaisesRegexp(ValueError, r"not found"):
-        wrapped_fn(self.dtype(2))
-
-
-class WrapCallableTest16(test.TestCase, _WrapCallableTest):
-  dtype = np.float16
-
-
-class WrapCallableTest32(test.TestCase, _WrapCallableTest):
-  dtype = np.float32
-
-
-class WrapCallableTest64(test.TestCase, _WrapCallableTest):
-  dtype = np.float64
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/variational_sgd_optimizer_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/variational_sgd_optimizer_test.py
deleted file mode 100644
index 83c64dbe0fd586edcb784a5c09a4c133aaa99cff..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/variational_sgd_optimizer_test.py
+++ /dev/null
@@ -1,268 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Functional test for GradientDescent."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from tensorflow.contrib.bayesflow.python.ops.optimizers import VariationalSGDOptimizer
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import errors
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import variables
-from tensorflow.python.platform import test
-
-
-class VariationalSGDOptimizerTest(test.TestCase):
-
-  def testBasic(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.53
-        sgd_op = VariationalSGDOptimizer(
-            1,
-            1,
-            preconditioner_decay_rate=decay_rate,
-            max_learning_rate=3.0,
-            burnin_max_learning_rate=3.0,
-            use_single_learning_rate=True).apply_gradients(
-                zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        self.assertAllCloseAccordingToType([1.1 - 3.0 * 0.1, 2.1 - 3.0 * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
-                                           var1.eval())
-
-  def testBasicMultiInstance(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        vara = variables.Variable([1.1, 2.1], dtype=dtype)
-        varb = variables.Variable([3.0, 4.0], dtype=dtype)
-        gradsa = constant_op.constant([0.1, 0.1], dtype=dtype)
-        gradsb = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.5
-        batch_size = 2
-        total_num_examples = 10
-        optimizer = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=1.0,
-            burnin_max_learning_rate=3.0,
-            preconditioner_decay_rate=decay_rate)
-        sgd_op = optimizer.apply_gradients(
-            zip([grads0, grads1], [var0, var1]))
-        optimizer2 = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=1.0,
-            burnin_max_learning_rate=10.0,
-            burnin=0,
-            preconditioner_decay_rate=decay_rate)
-        sgd_op2 = optimizer2.apply_gradients(
-            zip([gradsa, gradsb], [vara, varb]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        self.assertAllCloseAccordingToType([1.1, 2.1], vara.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], varb.eval())
-
-        # Run 1 step of sgd
-        sgd_op.run()
-        sgd_op2.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType([1.1 - 3. * 0.1, 2.1 - 3. * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([1.1 - 0.1, 2.1 - 0.1], vara.eval())
-
-        self.assertAllCloseAccordingToType([3.0 - 3. * 0.01, 4.0 - 3. * 0.01],
-                                           var1.eval())
-        self.assertAllCloseAccordingToType([3.0 - 0.01, 4.0 - 0.01],
-                                           varb.eval())
-        self.assertNotEqual(optimizer.variable_scope,
-                            optimizer2.variable_scope)
-        self.assertNotEqual(optimizer.variable_scope.name,
-                            optimizer2.variable_scope.name)
-        self.assertAllCloseAccordingToType(1, optimizer._counter.eval())
-        self.assertAllCloseAccordingToType(1, optimizer2._counter.eval())
-
-  def testTensorLearningRate(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        lrate = constant_op.constant(3.0)
-        decay_rate = 0.5
-        batch_size = 2
-        total_num_examples = 10
-        sgd_op = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=lrate,
-            burnin=0,
-            preconditioner_decay_rate=decay_rate).apply_gradients(
-                zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType([1.1 - 3.0 * 0.1, 2.1 - 3.0 * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
-                                           var1.eval())
-
-  def testTensorDecayLearningRate(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        lrate = variables.Variable(3.0)
-        lrate_decay_op = lrate.assign_add(-3.)
-        decay_rate = 0.5
-        batch_size = 2
-        total_num_examples = 10
-        optimizer = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=lrate,
-            burnin=0,
-            preconditioner_decay_rate=decay_rate)
-        sgd_op = optimizer.apply_gradients(zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType([1.1 - 3.0 * 0.1, 2.1 - 3.0 * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
-                                           var1.eval())
-        # Update learning rate to 0
-        lrate_decay_op.eval()
-        sgd_op.run()
-        # Validate params haven't changed
-        self.assertAllCloseAccordingToType([1.1 - 3.0 * 0.1, 2.1 - 3.0 * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
-                                           var1.eval())
-        lrate_decay_op.eval()
-
-        with self.assertRaises(errors.InvalidArgumentError):
-          sgd_op.run()
-
-  def testGradWrtRef(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        opt = VariationalSGDOptimizer(1, 1, max_learning_rate=1.0)
-        values = [1.0, 3.0]
-        vars_ = [variables.Variable([v], dtype=dtype) for v in values]
-        grads_and_vars = opt.compute_gradients(vars_[0] + vars_[1], vars_)
-        variables.global_variables_initializer().run()
-        for grad, _ in grads_and_vars:
-          self.assertAllCloseAccordingToType([1.0], grad.eval())
-
-  def testWithGlobalStep(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        global_step = variables.Variable(0, trainable=False)
-        var0 = variables.Variable([1.1, 2.1], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        decay_rate = 0.1
-        batch_size = 2
-        total_num_examples = 10
-        sgd_optimizer = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=3.0,
-            burnin=0,
-            preconditioner_decay_rate=decay_rate)
-        sgd_op = sgd_optimizer.apply_gradients(
-            zip([grads0, grads1], [var0, var1]), global_step=global_step)
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([1.1, 2.1], var0.eval())
-        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-
-        # Validate updated params and global_step
-        self.assertAllCloseAccordingToType([1.1 - 3.0 * 0.1, 2.1 - 3.0 * 0.1],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
-                                           var1.eval())
-        self.assertAllCloseAccordingToType(1, global_step.eval())
-        self.assertAllCloseAccordingToType(1, sgd_optimizer._counter.eval())
-
-  def testSparseBasic(self):
-    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = variables.Variable([[1.1], [2.1]], dtype=dtype)
-        var1 = variables.Variable([[3.0], [4.0]], dtype=dtype)
-        grads0 = ops.IndexedSlices(
-            constant_op.constant([0.1], shape=[1, 1], dtype=dtype),
-            constant_op.constant([0]), constant_op.constant([2, 1]))
-        grads1 = ops.IndexedSlices(
-            constant_op.constant([0.01], shape=[1, 1], dtype=dtype),
-            constant_op.constant([1]), constant_op.constant([2, 1]))
-        decay_rate = 0.1
-        batch_size = 2
-        total_num_examples = 10
-        sgd_op = VariationalSGDOptimizer(
-            batch_size,
-            total_num_examples,
-            max_learning_rate=3.0,
-            burnin=0,
-            preconditioner_decay_rate=decay_rate).apply_gradients(
-                zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([[1.1], [2.1]], var0.eval())
-        self.assertAllCloseAccordingToType([[3.0], [4.0]], var1.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType([[1.1 - 3.0 * 0.1], [2.1]],
-                                           var0.eval())
-        self.assertAllCloseAccordingToType(
-            [[3.0 - 3.0 * 0], [4.0 - 3.0 * 0.01]], var1.eval())
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence.py b/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence.py
deleted file mode 100644
index 9f7a95f138f7fd3e726f095dc16f41abb6182e17..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence.py
+++ /dev/null
@@ -1,51 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Csiszar f-Divergence and helpers.
-
-See ${python/contrib.bayesflow.csiszar_divergence}.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-# go/tf-wildcard-import
-# pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.csiszar_divergence_impl import *
-# pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
-
-_allowed_symbols = [
-    'amari_alpha',
-    'arithmetic_geometric',
-    'chi_square',
-    'csiszar_vimco',
-    'dual_csiszar_function',
-    'jeffreys',
-    'jensen_shannon',
-    'kl_forward',
-    'kl_reverse',
-    'log1p_abs',
-    'modified_gan',
-    'monte_carlo_csiszar_f_divergence',
-    'pearson',
-    'squared_hellinger',
-    'symmetrized_csiszar_function',
-    'total_variation',
-    't_power',
-    'triangular',
-]
-
-remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence_impl.py b/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence_impl.py
deleted file mode 100644
index 8efd59d6516924bea538717d45bb4ae303583421..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/csiszar_divergence_impl.py
+++ /dev/null
@@ -1,1105 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Csiszar f-Divergence and helpers.
-
-@@amari_alpha
-@@arithmetic_geometric
-@@chi_square
-@@csiszar_vimco
-@@dual_csiszar_function
-@@jeffreys
-@@jensen_shannon
-@@kl_forward
-@@kl_reverse
-@@log1p_abs
-@@modified_gan
-@@monte_carlo_csiszar_f_divergence
-@@pearson
-@@squared_hellinger
-@@symmetrized_csiszar_function
-@@total_variation
-@@triangular
-
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib import framework as contrib_framework
-from tensorflow.contrib.bayesflow.python.ops import monte_carlo_impl as monte_carlo
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
-
-
-def amari_alpha(logu, alpha=1., self_normalized=False, name=None):
-  """The Amari-alpha Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True`, the Amari-alpha Csiszar-function is:
-
-  ```none
-  f(u) = { -log(u) + (u - 1),     alpha = 0
-         { u log(u) - (u - 1),    alpha = 1
-         { [(u**alpha - 1) - alpha (u - 1)] / (alpha (alpha - 1)),    otherwise
-  ```
-
-  When `self_normalized = False` the `(u - 1)` terms are omitted.
-
-  Warning: when `alpha != 0` and/or `self_normalized = True` this function makes
-  non-log-space calculations and may therefore be numerically unstable for
-  `|logu| >> 0`.
-
-  For more information, see:
-    A. Cichocki and S. Amari. "Families of Alpha-Beta-and GammaDivergences:
-    Flexible and Robust Measures of Similarities." Entropy, vol. 12, no. 6, pp.
-    1532-1568, 2010.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    alpha: `float`-like Python scalar. (See Mathematical Details for meaning.)
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    amari_alpha_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-
-  Raises:
-    TypeError: if `alpha` is `None` or a `Tensor`.
-    TypeError: if `self_normalized` is `None` or a `Tensor`.
-  """
-  with ops.name_scope(name, "amari_alpha", [logu]):
-    if alpha is None or contrib_framework.is_tensor(alpha):
-      raise TypeError("`alpha` cannot be `None` or `Tensor` type.")
-    if self_normalized is None or contrib_framework.is_tensor(self_normalized):
-      raise TypeError("`self_normalized` cannot be `None` or `Tensor` type.")
-
-    logu = ops.convert_to_tensor(logu, name="logu")
-
-    if alpha == 0.:
-      f = -logu
-    elif alpha == 1.:
-      f = math_ops.exp(logu) * logu
-    else:
-      f = math_ops.expm1(alpha * logu) / (alpha * (alpha - 1.))
-
-    if not self_normalized:
-      return f
-
-    if alpha == 0.:
-      return f + math_ops.expm1(logu)
-    elif alpha == 1.:
-      return f - math_ops.expm1(logu)
-    else:
-      return f - math_ops.expm1(logu) / (alpha - 1.)
-
-
-def kl_reverse(logu, self_normalized=False, name=None):
-  """The reverse Kullback-Leibler Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True`, the KL-reverse Csiszar-function is:
-
-  ```none
-  f(u) = -log(u) + (u - 1)
-  ```
-
-  When `self_normalized = False` the `(u - 1)` term is omitted.
-
-  Observe that as an f-Divergence, this Csiszar-function implies:
-
-  ```none
-  D_f[p, q] = KL[q, p]
-  ```
-
-  The KL is "reverse" because in maximum likelihood we think of minimizing `q`
-  as in `KL[p, q]`.
-
-  Warning: when self_normalized = True` this function makes non-log-space
-  calculations and may therefore be numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    kl_reverse_of_u: `float`-like `Tensor` of the Csiszar-function evaluated at
-      `u = exp(logu)`.
-
-  Raises:
-    TypeError: if `self_normalized` is `None` or a `Tensor`.
-  """
-
-  with ops.name_scope(name, "kl_reverse", [logu]):
-    return amari_alpha(logu, alpha=0., self_normalized=self_normalized)
-
-
-def kl_forward(logu, self_normalized=False, name=None):
-  """The forward Kullback-Leibler Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True`, the KL-forward Csiszar-function is:
-
-  ```none
-  f(u) = u log(u) - (u - 1)
-  ```
-
-  When `self_normalized = False` the `(u - 1)` term is omitted.
-
-  Observe that as an f-Divergence, this Csiszar-function implies:
-
-  ```none
-  D_f[p, q] = KL[p, q]
-  ```
-
-  The KL is "forward" because in maximum likelihood we think of minimizing `q`
-  as in `KL[p, q]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    kl_forward_of_u: `float`-like `Tensor` of the Csiszar-function evaluated at
-      `u = exp(logu)`.
-
-  Raises:
-    TypeError: if `self_normalized` is `None` or a `Tensor`.
-  """
-
-  with ops.name_scope(name, "kl_forward", [logu]):
-    return amari_alpha(logu, alpha=1., self_normalized=self_normalized)
-
-
-def jensen_shannon(logu, self_normalized=False, name=None):
-  """The Jensen-Shannon Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True`, the Jensen-Shannon Csiszar-function is:
-
-  ```none
-  f(u) = u log(u) - (1 + u) log(1 + u) + (u + 1) log(2)
-  ```
-
-  When `self_normalized = False` the `(u + 1) log(2)` term is omitted.
-
-  Observe that as an f-Divergence, this Csiszar-function implies:
-
-  ```none
-  D_f[p, q] = KL[p, m] + KL[q, m]
-  m(x) = 0.5 p(x) + 0.5 q(x)
-  ```
-
-  In a sense, this divergence is the "reverse" of the Arithmetic-Geometric
-  f-Divergence.
-
-  This Csiszar-function induces a symmetric f-Divergence, i.e.,
-  `D_f[p, q] = D_f[q, p]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  For more information, see:
-    Lin, J. "Divergence measures based on the Shannon entropy." IEEE Trans.
-    Inf. Th., 37, 145-151, 1991.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    jensen_shannon_of_u: `float`-like `Tensor` of the Csiszar-function
-      evaluated at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "jensen_shannon", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    npdt = logu.dtype.as_numpy_dtype
-    y = nn_ops.softplus(logu)
-    if self_normalized:
-      y -= np.log(2).astype(npdt)
-    return math_ops.exp(logu) * logu - (1. + math_ops.exp(logu)) * y
-
-
-def arithmetic_geometric(logu, self_normalized=False, name=None):
-  """The Arithmetic-Geometric Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True` the Arithmetic-Geometric Csiszar-function is:
-
-  ```none
-  f(u) = (1 + u) log( (1 + u) / sqrt(u) ) - (1 + u) log(2)
-  ```
-
-  When `self_normalized = False` the `(1 + u) log(2)` term is omitted.
-
-  Observe that as an f-Divergence, this Csiszar-function implies:
-
-  ```none
-  D_f[p, q] = KL[m, p] + KL[m, q]
-  m(x) = 0.5 p(x) + 0.5 q(x)
-  ```
-
-  In a sense, this divergence is the "reverse" of the Jensen-Shannon
-  f-Divergence.
-
-  This Csiszar-function induces a symmetric f-Divergence, i.e.,
-  `D_f[p, q] = D_f[q, p]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    arithmetic_geometric_of_u: `float`-like `Tensor` of the
-      Csiszar-function evaluated at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "arithmetic_geometric", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    y = nn_ops.softplus(logu) - 0.5 * logu
-    if self_normalized:
-      y -= np.log(2.).astype(logu.dtype.as_numpy_dtype)
-    return (1. + math_ops.exp(logu)) * y
-
-
-def total_variation(logu, name=None):
-  """The Total Variation Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Total-Variation Csiszar-function is:
-
-  ```none
-  f(u) = 0.5 |u - 1|
-  ```
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    total_variation_of_u: `float`-like `Tensor` of the Csiszar-function
-      evaluated at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "total_variation", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return 0.5 * math_ops.abs(math_ops.expm1(logu))
-
-
-def pearson(logu, name=None):
-  """The Pearson Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Pearson Csiszar-function is:
-
-  ```none
-  f(u) = (u - 1)**2
-  ```
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    pearson_of_u: `float`-like `Tensor` of the Csiszar-function evaluated at
-      `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "pearson", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return math_ops.square(math_ops.expm1(logu))
-
-
-def squared_hellinger(logu, name=None):
-  """The Squared-Hellinger Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Squared-Hellinger Csiszar-function is:
-
-  ```none
-  f(u) = (sqrt(u) - 1)**2
-  ```
-
-  This Csiszar-function induces a symmetric f-Divergence, i.e.,
-  `D_f[p, q] = D_f[q, p]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    squared_hellinger_of_u: `float`-like `Tensor` of the Csiszar-function
-      evaluated at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "squared_hellinger", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return pearson(0.5 * logu)
-
-
-def triangular(logu, name=None):
-  """The Triangular Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Triangular Csiszar-function is:
-
-  ```none
-  f(u) = (u - 1)**2 / (1 + u)
-  ```
-
-  This Csiszar-function induces a symmetric f-Divergence, i.e.,
-  `D_f[p, q] = D_f[q, p]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    triangular_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "triangular", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return pearson(logu) / (1. + math_ops.exp(logu))
-
-
-def t_power(logu, t, self_normalized=False, name=None):
-  """The T-Power Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True` the T-Power Csiszar-function is:
-
-  ```none
-  f(u) = s [ u**t - 1 - t(u - 1) ]
-  s = { -1   0 < t < 1
-      { +1   otherwise
-  ```
-
-  When `self_normalized = False` the `- t(u - 1)` term is omitted.
-
-  This is similar to the `amari_alpha` Csiszar-function, with the associated
-  divergence being the same up to factors depending only on `t`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    t:  `Tensor` of same `dtype` as `logu` and broadcastable shape.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    t_power_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-  with ops.name_scope(name, "t_power", [logu, t]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    t = ops.convert_to_tensor(t, dtype=logu.dtype.base_dtype, name="t")
-    fu = math_ops.expm1(t * logu)
-    if self_normalized:
-      fu -= t * math_ops.expm1(logu)
-    fu *= array_ops.where(math_ops.logical_and(0. < t, t < 1.),
-                          -array_ops.ones_like(t),
-                          array_ops.ones_like(t))
-    return fu
-
-
-def log1p_abs(logu, name=None):
-  """The log1p-abs Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Log1p-Abs Csiszar-function is:
-
-  ```none
-  f(u) = u**(sign(u-1)) - 1
-  ```
-
-  This function is so-named because it was invented from the following recipe.
-  Choose a convex function g such that g(0)=0 and solve for f:
-
-  ```none
-  log(1 + f(u)) = g(log(u)).
-    <=>
-  f(u) = exp(g(log(u))) - 1
-  ```
-
-  That is, the graph is identically `g` when y-axis is `log1p`-domain and x-axis
-  is `log`-domain.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    log1p_abs_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "log1p_abs", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return math_ops.expm1(math_ops.abs(logu))
-
-
-def jeffreys(logu, name=None):
-  """The Jeffreys Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Jeffreys Csiszar-function is:
-
-  ```none
-  f(u) = 0.5 ( u log(u) - log(u) )
-       = 0.5 kl_forward + 0.5 kl_reverse
-       = symmetrized_csiszar_function(kl_reverse)
-       = symmetrized_csiszar_function(kl_forward)
-  ```
-
-  This Csiszar-function induces a symmetric f-Divergence, i.e.,
-  `D_f[p, q] = D_f[q, p]`.
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    jeffreys_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "jeffreys", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return 0.5 * math_ops.expm1(logu) * logu
-
-
-def chi_square(logu, name=None):
-  """The chi-Square Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Chi-square Csiszar-function is:
-
-  ```none
-  f(u) = u**2 - 1
-  ```
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    chi_square_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "chi_square", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return math_ops.expm1(2. * logu)
-
-
-def modified_gan(logu, self_normalized=False, name=None):
-  """The Modified-GAN Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  When `self_normalized = True` the modified-GAN (Generative/Adversarial
-  Network) Csiszar-function is:
-
-  ```none
-  f(u) = log(1 + u) - log(u) + 0.5 (u - 1)
-  ```
-
-  When `self_normalized = False` the `0.5 (u - 1)` is omitted.
-
-  The unmodified GAN Csiszar-function is identical to Jensen-Shannon (with
-  `self_normalized = False`).
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    self_normalized: Python `bool` indicating whether `f'(u=1)=0`. When
-      `f'(u=1)=0` the implied Csiszar f-Divergence remains non-negative even
-      when `p, q` are unnormalized measures.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    chi_square_of_u: `float`-like `Tensor` of the Csiszar-function evaluated
-      at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "chi_square", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    y = nn_ops.softplus(logu) - logu
-    if self_normalized:
-      y += 0.5 * math_ops.expm1(logu)
-    return y
-
-
-def dual_csiszar_function(logu, csiszar_function, name=None):
-  """Calculates the dual Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Csiszar-dual is defined as:
-
-  ```none
-  f^*(u) = u f(1 / u)
-  ```
-
-  where `f` is some other Csiszar-function.
-
-  For example, the dual of `kl_reverse` is `kl_forward`, i.e.,
-
-  ```none
-  f(u) = -log(u)
-  f^*(u) = u f(1 / u) = -u log(1 / u) = u log(u)
-  ```
-
-  The dual of the dual is the original function:
-
-  ```none
-  f^**(u) = {u f(1/u)}^*(u) = u (1/u) f(1/(1/u)) = f(u)
-  ```
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    csiszar_function: Python `callable` representing a Csiszar-function over
-      log-domain.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    dual_f_of_u: `float`-like `Tensor` of the result of calculating the dual of
-      `f` at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "dual_csiszar_function", [logu]):
-    return math_ops.exp(logu) * csiszar_function(-logu)
-
-
-def symmetrized_csiszar_function(logu, csiszar_function, name=None):
-  """Symmetrizes a Csiszar-function in log-space.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The symmetrized Csiszar-function is defined as:
-
-  ```none
-  f_g(u) = 0.5 g(u) + 0.5 u g (1 / u)
-  ```
-
-  where `g` is some other Csiszar-function.
-
-  We say the function is "symmetrized" because:
-
-  ```none
-  D_{f_g}[p, q] = D_{f_g}[q, p]
-  ```
-
-  for all `p << >> q` (i.e., `support(p) = support(q)`).
-
-  There exists alternatives for symmetrizing a Csiszar-function. For example,
-
-  ```none
-  f_g(u) = max(f(u), f^*(u)),
-  ```
-
-  where `f^*` is the dual Csiszar-function, also implies a symmetric
-  f-Divergence.
-
-  Example:
-
-  When either of the following functions are symmetrized, we obtain the
-  Jensen-Shannon Csiszar-function, i.e.,
-
-  ```none
-  g(u) = -log(u) - (1 + u) log((1 + u) / 2) + u - 1
-  h(u) = log(4) + 2 u log(u / (1 + u))
-  ```
-
-  implies,
-
-  ```none
-  f_g(u) = f_h(u) = u log(u) - (1 + u) log((1 + u) / 2)
-         = jensen_shannon(log(u)).
-  ```
-
-  Warning: this function makes non-log-space calculations and may therefore be
-  numerically unstable for `|logu| >> 0`.
-
-  Args:
-    logu: `float`-like `Tensor` representing `log(u)` from above.
-    csiszar_function: Python `callable` representing a Csiszar-function over
-      log-domain.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    symmetrized_g_of_u: `float`-like `Tensor` of the result of applying the
-      symmetrization of `g` evaluated at `u = exp(logu)`.
-  """
-
-  with ops.name_scope(name, "symmetrized_csiszar_function", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-    return 0.5 * (csiszar_function(logu)
-                  + dual_csiszar_function(logu, csiszar_function))
-
-
-def monte_carlo_csiszar_f_divergence(
-    f,
-    p_log_prob,
-    q,
-    num_draws,
-    use_reparametrization=None,
-    seed=None,
-    name=None):
-  """Monte-Carlo approximation of the Csiszar f-Divergence.
-
-  A Csiszar-function is a member of,
-
-  ```none
-  F = { f:R_+ to R : f convex }.
-  ```
-
-  The Csiszar f-Divergence for Csiszar-function f is given by:
-
-  ```none
-  D_f[p(X), q(X)] := E_{q(X)}[ f( p(X) / q(X) ) ]
-                  ~= m**-1 sum_j^m f( p(x_j) / q(x_j) ),
-                             where x_j ~iid q(X)
-  ```
-
-  Tricks: Reparameterization and Score-Gradient
-
-  When q is "reparameterized", i.e., a diffeomorphic transformation of a
-  parameterless distribution (e.g.,
-  `Normal(Y; m, s) <=> Y = sX + m, X ~ Normal(0,1)`), we can swap gradient and
-  expectation, i.e.,
-  `grad[Avg{ s_i : i=1...n }] = Avg{ grad[s_i] : i=1...n }` where `S_n=Avg{s_i}`
-  and `s_i = f(x_i), x_i ~iid q(X)`.
-
-  However, if q is not reparameterized, TensorFlow's gradient will be incorrect
-  since the chain-rule stops at samples of unreparameterized distributions. In
-  this circumstance using the Score-Gradient trick results in an unbiased
-  gradient, i.e.,
-
-  ```none
-  grad[ E_q[f(X)] ]
-  = grad[ int dx q(x) f(x) ]
-  = int dx grad[ q(x) f(x) ]
-  = int dx [ q'(x) f(x) + q(x) f'(x) ]
-  = int dx q(x) [q'(x) / q(x) f(x) + f'(x) ]
-  = int dx q(x) grad[ f(x) q(x) / stop_grad[q(x)] ]
-  = E_q[ grad[ f(x) q(x) / stop_grad[q(x)] ] ]
-  ```
-
-  Unless `q.reparameterization_type != distribution.FULLY_REPARAMETERIZED` it is
-  usually preferable to set `use_reparametrization = True`.
-
-  Example Application:
-
-  The Csiszar f-Divergence is a useful framework for variational inference.
-  I.e., observe that,
-
-  ```none
-  f(p(x)) =  f( E_{q(Z | x)}[ p(x, Z) / q(Z | x) ] )
-          <= E_{q(Z | x)}[ f( p(x, Z) / q(Z | x) ) ]
-          := D_f[p(x, Z), q(Z | x)]
-  ```
-
-  The inequality follows from the fact that the "perspective" of `f`, i.e.,
-  `(s, t) |-> t f(s / t))`, is convex in `(s, t)` when `s/t in domain(f)` and
-  `t` is a real. Since the above framework includes the popular Evidence Lower
-  BOund (ELBO) as a special case, i.e., `f(u) = -log(u)`, we call this framework
-  "Evidence Divergence Bound Optimization" (EDBO).
-
-  Args:
-    f: Python `callable` representing a Csiszar-function in log-space, i.e.,
-      takes `p_log_prob(q_samples) - q.log_prob(q_samples)`.
-    p_log_prob: Python `callable` taking (a batch of) samples from `q` and
-      returning the natural-log of the probability under distribution `p`.
-      (In variational inference `p` is the joint distribution.)
-    q: `tf.Distribution`-like instance; must implement:
-      `reparameterization_type`, `sample(n, seed)`, and `log_prob(x)`.
-      (In variational inference `q` is the approximate posterior distribution.)
-    num_draws: Integer scalar number of draws used to approximate the
-      f-Divergence expectation.
-    use_reparametrization: Python `bool`. When `None` (the default),
-      automatically set to:
-      `q.reparameterization_type == distribution.FULLY_REPARAMETERIZED`.
-      When `True` uses the standard Monte-Carlo average. When `False` uses the
-      score-gradient trick. (See above for details.)  When `False`, consider
-      using `csiszar_vimco`.
-    seed: Python `int` seed for `q.sample`.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    monte_carlo_csiszar_f_divergence: `float`-like `Tensor` Monte Carlo
-      approximation of the Csiszar f-Divergence.
-
-  Raises:
-    ValueError: if `q` is not a reparameterized distribution and
-      `use_reparametrization = True`. A distribution `q` is said to be
-      "reparameterized" when its samples are generated by transforming the
-      samples of another distribution which does not depend on the
-      parameterization of `q`. This property ensures the gradient (with respect
-      to parameters) is valid.
-    TypeError: if `p_log_prob` is not a Python `callable`.
-  """
-  with ops.name_scope(name, "monte_carlo_csiszar_f_divergence", [num_draws]):
-    if use_reparametrization is None:
-      use_reparametrization = (q.reparameterization_type
-                               == distribution.FULLY_REPARAMETERIZED)
-    elif (use_reparametrization and
-          q.reparameterization_type != distribution.FULLY_REPARAMETERIZED):
-      # TODO(jvdillon): Consider only raising an exception if the gradient is
-      # requested.
-      raise ValueError(
-          "Distribution `q` must be reparameterized, i.e., a diffeomorphic "
-          "transformation of a parameterless distribution. (Otherwise this "
-          "function has a biased gradient.)")
-    if not callable(p_log_prob):
-      raise TypeError("`p_log_prob` must be a Python `callable` function.")
-    return monte_carlo.expectation(
-        f=lambda q_samples: f(p_log_prob(q_samples) - q.log_prob(q_samples)),
-        samples=q.sample(num_draws, seed=seed),
-        log_prob=q.log_prob,  # Only used if use_reparametrization=False.
-        use_reparametrization=use_reparametrization)
-
-
-def csiszar_vimco(f,
-                  p_log_prob,
-                  q,
-                  num_draws,
-                  num_batch_draws=1,
-                  seed=None,
-                  name=None):
-  """Use VIMCO to lower the variance of gradient[csiszar_function(Avg(logu))].
-
-  This function generalizes "Variational Inference for Monte Carlo Objectives"
-  (VIMCO), i.e., https://arxiv.org/abs/1602.06725, to Csiszar f-Divergences.
-
-  Note: if `q.reparameterization_type = distribution.FULLY_REPARAMETERIZED`,
-  consider using `monte_carlo_csiszar_f_divergence`.
-
-  The VIMCO loss is:
-
-  ```none
-  vimco = f(Avg{logu[i] : i=0,...,m-1})
-  where,
-    logu[i] = log( p(x, h[i]) / q(h[i] | x) )
-    h[i] iid~ q(H | x)
-  ```
-
-  Interestingly, the VIMCO gradient is not the naive gradient of `vimco`.
-  Rather, it is characterized by:
-
-  ```none
-  grad[vimco] - variance_reducing_term
-  where,
-    variance_reducing_term = Sum{ grad[log q(h[i] | x)] *
-                                    (vimco - f(log Avg{h[j;i] : j=0,...,m-1}))
-                                 : i=0, ..., m-1 }
-    h[j;i] = { u[j]                             j!=i
-             { GeometricAverage{ u[k] : k!=i}   j==i
-  ```
-
-  (We omitted `stop_gradient` for brevity. See implementation for more details.)
-
-  The `Avg{h[j;i] : j}` term is a kind of "swap-out average" where the `i`-th
-  element has been replaced by the leave-`i`-out Geometric-average.
-
-  This implementation prefers numerical precision over efficiency, i.e.,
-  `O(num_draws * num_batch_draws * prod(batch_shape) * prod(event_shape))`.
-  (The constant may be fairly large, perhaps around 12.)
-
-  Args:
-    f: Python `callable` representing a Csiszar-function in log-space.
-    p_log_prob: Python `callable` representing the natural-log of the
-      probability under distribution `p`. (In variational inference `p` is the
-      joint distribution.)
-    q: `tf.Distribution`-like instance; must implement: `sample(n, seed)`, and
-      `log_prob(x)`. (In variational inference `q` is the approximate posterior
-      distribution.)
-    num_draws: Integer scalar number of draws used to approximate the
-      f-Divergence expectation.
-    num_batch_draws: Integer scalar number of draws used to approximate the
-      f-Divergence expectation.
-    seed: Python `int` seed for `q.sample`.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    vimco: The Csiszar f-Divergence generalized VIMCO objective.
-
-  Raises:
-    ValueError: if `num_draws < 2`.
-  """
-  with ops.name_scope(name, "csiszar_vimco", [num_draws, num_batch_draws]):
-    if num_draws < 2:
-      raise ValueError("Must specify num_draws > 1.")
-    stop = array_ops.stop_gradient  # For readability.
-    x = stop(q.sample(sample_shape=[num_draws, num_batch_draws],
-                      seed=seed))
-    logqx = q.log_prob(x)
-    logu = p_log_prob(x) - logqx
-    f_log_avg_u, f_log_sooavg_u = [f(r) for r in csiszar_vimco_helper(logu)]
-    dotprod = math_ops.reduce_sum(
-        logqx * stop(f_log_avg_u - f_log_sooavg_u),
-        axis=0)  # Sum over iid samples.
-    # We now rewrite f_log_avg_u so that:
-    #   `grad[f_log_avg_u] := grad[f_log_avg_u + dotprod]`.
-    # To achieve this, we use a trick that
-    #   `f(x) - stop(f(x)) == zeros_like(f(x))`
-    # but its gradient is grad[f(x)].
-    # Note that IEEE754 specifies that `x - x == 0.` and `x + 0. == x`, hence
-    # this trick loses no precision. For more discussion regarding the relevant
-    # portions of the IEEE754 standard, see the StackOverflow question,
-    # "Is there a floating point value of x, for which x-x == 0 is false?"
-    # http://stackoverflow.com/q/2686644
-    f_log_avg_u += dotprod - stop(dotprod)  # Add zeros_like(dot_prod).
-    return math_ops.reduce_mean(f_log_avg_u, axis=0)  # Avg over batches.
-
-
-def csiszar_vimco_helper(logu, name=None):
-  """Helper to `csiszar_vimco`; computes `log_avg_u`, `log_sooavg_u`.
-
-  `axis = 0` of `logu` is presumed to correspond to iid samples from `q`, i.e.,
-
-  ```none
-  logu[j] = log(u[j])
-  u[j] = p(x, h[j]) / q(h[j] | x)
-  h[j] iid~ q(H | x)
-  ```
-
-  Args:
-    logu: Floating-type `Tensor` representing `log(p(x, h) / q(h | x))`.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    log_avg_u: `logu.dtype` `Tensor` corresponding to the natural-log of the
-      average of `u`. The sum of the gradient of `log_avg_u` is `1`.
-    log_sooavg_u: `logu.dtype` `Tensor` characterized by the natural-log of the
-      average of `u`` except that the average swaps-out `u[i]` for the
-      leave-`i`-out Geometric-average. The mean of the gradient of
-      `log_sooavg_u` is `1`. Mathematically `log_sooavg_u` is,
-      ```none
-      log_sooavg_u[i] = log(Avg{h[j ; i] : j=0, ..., m-1})
-      h[j ; i] = { u[j]                              j!=i
-                 { GeometricAverage{u[k] : k != i}   j==i
-      ```
-
-  """
-  with ops.name_scope(name, "csiszar_vimco_helper", [logu]):
-    logu = ops.convert_to_tensor(logu, name="logu")
-
-    n = logu.shape.with_rank_at_least(1)[0].value
-    if n is None:
-      n = array_ops.shape(logu)[0]
-      log_n = math_ops.log(math_ops.cast(n, dtype=logu.dtype))
-      nm1 = math_ops.cast(n - 1, dtype=logu.dtype)
-    else:
-      log_n = np.log(n).astype(logu.dtype.as_numpy_dtype)
-      nm1 = np.asarray(n - 1, dtype=logu.dtype.as_numpy_dtype)
-
-    # Throughout we reduce across axis=0 since this is presumed to be iid
-    # samples.
-
-    log_max_u = math_ops.reduce_max(logu, axis=0)
-    log_sum_u_minus_log_max_u = math_ops.reduce_logsumexp(
-        logu - log_max_u, axis=0)
-
-    # log_loosum_u[i] =
-    # = logsumexp(logu[j] : j != i)
-    # = log( exp(logsumexp(logu)) - exp(logu[i]) )
-    # = log( exp(logsumexp(logu - logu[i])) exp(logu[i])  - exp(logu[i]))
-    # = logu[i] + log(exp(logsumexp(logu - logu[i])) - 1)
-    # = logu[i] + log(exp(logsumexp(logu) - logu[i]) - 1)
-    # = logu[i] + softplus_inverse(logsumexp(logu) - logu[i])
-    d = log_sum_u_minus_log_max_u + (log_max_u - logu)
-    # We use `d != 0` rather than `d > 0.` because `d < 0.` should never
-    # happens; if it does we want to complain loudly (which `softplus_inverse`
-    # will).
-    d_ok = math_ops.not_equal(d, 0.)
-    safe_d = array_ops.where(d_ok, d, array_ops.ones_like(d))
-    d_ok_result = logu + distribution_util.softplus_inverse(safe_d)
-
-    inf = np.array(np.inf, dtype=logu.dtype.as_numpy_dtype)
-
-    # When not(d_ok) and is_positive_and_largest then we manually compute the
-    # log_loosum_u. (We can efficiently do this for any one point but not all,
-    # hence we still need the above calculation.) This is good because when
-    # this condition is met, we cannot use the above calculation; its -inf.
-    # We now compute the log-leave-out-max-sum, replicate it to every
-    # point and make sure to select it only when we need to.
-    is_positive_and_largest = math_ops.logical_and(
-        logu > 0.,
-        math_ops.equal(logu, log_max_u[array_ops.newaxis, ...]))
-    log_lomsum_u = math_ops.reduce_logsumexp(
-        array_ops.where(is_positive_and_largest,
-                        array_ops.fill(array_ops.shape(logu), -inf),
-                        logu),
-        axis=0, keep_dims=True)
-    log_lomsum_u = array_ops.tile(
-        log_lomsum_u,
-        multiples=1 + array_ops.pad([n-1], [[0, array_ops.rank(logu)-1]]))
-
-    d_not_ok_result = array_ops.where(
-        is_positive_and_largest,
-        log_lomsum_u,
-        array_ops.fill(array_ops.shape(d), -inf))
-
-    log_loosum_u = array_ops.where(d_ok, d_ok_result, d_not_ok_result)
-
-    # The swap-one-out-sum ("soosum") is n different sums, each of which
-    # replaces the i-th item with the i-th-left-out average, i.e.,
-    # soo_sum_u[i] = [exp(logu) - exp(logu[i])] + exp(mean(logu[!=i]))
-    #              =  exp(log_loosum_u[i])      + exp(looavg_logu[i])
-    looavg_logu = (math_ops.reduce_sum(logu, axis=0) - logu) / nm1
-    log_soosum_u = math_ops.reduce_logsumexp(
-        array_ops.stack([log_loosum_u, looavg_logu]),
-        axis=0)
-
-    log_avg_u = log_sum_u_minus_log_max_u + log_max_u - log_n
-    log_sooavg_u = log_soosum_u - log_n
-
-    log_avg_u.set_shape(logu.shape.with_rank_at_least(1)[1:])
-    log_sooavg_u.set_shape(logu.shape)
-
-    return log_avg_u, log_sooavg_u
diff --git a/tensorflow/contrib/bayesflow/python/ops/custom_grad_impl.py b/tensorflow/contrib/bayesflow/python/ops/custom_grad_impl.py
deleted file mode 100644
index d44fe6529a7ff0da0c6747e193fdb98a272a8da3..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/custom_grad_impl.py
+++ /dev/null
@@ -1,110 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Functions for specifying custom gradients.
-
-@@custom_gradient
-
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-
-__all__ = [
-    "custom_gradient",
-]
-
-
-def custom_gradient(fx, gx, x, axis=(), fx_gx_manually_stopped=False,
-                    name=None):
-  """Enables specifying a custom gradient.
-
-  This function works by clever application of `stop_gradient`. I.e., observe
-  that:
-
-  ```none
-  h(x) = x * stop_gradient(g(x)) + stop_gradient(f(x) - x * g(x))
-  ```
-
-  is such that `h(x) = stop_gradient(f(x))` and `grad[h(x), x] =
-  stop_gradient(g(x)).`
-
-  In addition to scalar-domain/scalar-range functions, this function also
-  supports tensor-domain/scalar-range functions. However, in the latter case it
-  is necessary to reduce `x` to a scalar. This can be done by indicating the
-  `axis` over which `f` operates or by appropriately `reduce_sum`-ing `x`, prior
-  to calling this function.
-
-  Partial Custom Gradient:
-
-  Suppose `h(x) = htilde(x, y)`. Note that `dh/dx = stop(g(x))` but `dh/dy =
-  None`. This is because a `Tensor` cannot have only a portion of its gradient
-  stopped. To circumvent this issue, one must manually `stop_gradient` the
-  relevant portions of `f`, `g`. For example see the unit-test,
-  `test_works_correctly_fx_gx_manually_stopped`.
-
-  Args:
-    fx: `Tensor`. Output of function evaluated at `x`.
-    gx: `Tensor`. Gradient of function evaluated at `x`.
-    x: `Tensor`. Point of evaluation for `f, g`.
-    axis: 1D `int` `Tensor` representing dimensions of `x` which are the domain
-      of `f`. If `()` (the default), `f` is assumed scalar-domain/scalar-range.
-      If `None` `f` is assumed to render one scalar given all of `x`. Otherwise
-      `f` is assumed to output one scalar for each of `axis` dimensions of `x`.
-    fx_gx_manually_stopped: Python `bool` indicating that `fx`, `gx` manually
-      have `stop_gradient` applied.
-    name: Python `str` name prefixed to Ops created by this function.
-
-  Returns:
-    fx: Floating-type `Tensor` equal to `f(x)` but which has gradient
-      `stop_gradient(g(x))`.
-  """
-  with ops.name_scope(name, "custom_gradient", [fx, gx, x]):
-    fx = ops.convert_to_tensor(fx, name="fx")
-    # We don't want to bother eagerly computing `gx` since we may not even need
-    # it.
-    with ops.control_dependencies([fx]):
-      gx = ops.convert_to_tensor(gx, dtype=fx.dtype, name="gx")
-      gx = array_ops.identity(gx, name="gx")
-    # Proof of correctness:
-    #
-    #  f(x) = x * stop[gx] + stop[fx - x * gx]
-    #       = stop[fx]
-    #
-    #  g(x) = grad[fx]
-    #       = stop[gx] + grad[stop[fx - x * gx]]
-    #       = stop[gx] + 0
-    #
-    # Notice that when x is zero it still works:
-    # grad[x * stop(gx) + stop(fx - x * gx)] = 1 * stop[gx] + 0 = stop[gx]
-    #
-    # The proof is similar for the tensor-domain case, except that `x` is
-    # replaced by `reduce_sum(x)`.
-    sum_x = math_ops.reduce_sum(x, axis=axis, name="sum_x")
-    if not fx_gx_manually_stopped:
-      fx = array_ops.stop_gradient(fx)
-      gx = array_ops.stop_gradient(gx)
-    # IEEE754 ensures `(x-x)==0.` and that `0.*x==0.` so we make sure to write
-    # the code this way, rather than, e.g.,
-    # `sum_x * stop(gx) + stop(fx - sum_x * gx)`.
-    # For more discussion regarding the relevant portions of the IEEE754
-    # standard, see the StackOverflow question,
-    # "Is there a floating point value of x, for which x-x == 0 is false?"
-    # http://stackoverflow.com/q/2686644
-    return (sum_x - array_ops.stop_gradient(sum_x)) * gx + fx
diff --git a/tensorflow/contrib/bayesflow/python/ops/docstring_util.py b/tensorflow/contrib/bayesflow/python/ops/docstring_util.py
deleted file mode 100644
index 081f2d5a8bfd437fd173f63b4226fb7df6ca921c..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/docstring_util.py
+++ /dev/null
@@ -1,88 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utilities for programmable docstrings.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import re
-import six
-
-
-def expand_docstring(**kwargs):
-  """Decorator to programmatically expand the docstring.
-
-  Args:
-    **kwargs: Keyword arguments to set. For each key-value pair `k` and `v`,
-      the key is found as `@{k}` in the docstring and replaced with `v`.
-
-  Returns:
-    Decorated function.
-  """
-  def _fn_wrapped(fn):
-    """Original function with modified `__doc__` attribute."""
-    doc = _trim(fn.__doc__)
-    for k, v in six.iteritems(kwargs):
-      # Capture each @{k} reference to replace with v.
-      # We wrap the replacement in a function so no backslash escapes
-      # are processed.
-      pattern = r'@\{' + str(k) + r'\}'
-      doc = re.sub(pattern, lambda match: v, doc)  # pylint: disable=cell-var-from-loop
-    fn.__doc__ = doc
-    return fn
-  return _fn_wrapped
-
-
-def _trim(docstring):
-  """Trims docstring indentation.
-
-  In general, multi-line docstrings carry their level of indentation when
-  defined under a function or class method. This function standardizes
-  indentation levels by removing them. Taken from PEP 257 docs.
-
-  Args:
-    docstring: Python string to trim indentation.
-
-  Returns:
-    Trimmed docstring.
-  """
-  if not docstring:
-    return ''
-  # Convert tabs to spaces (following the normal Python rules)
-  # and split into a list of lines:
-  lines = docstring.expandtabs().splitlines()
-  # Determine minimum indentation (first line doesn't count):
-  indent = None
-  for line in lines[1:]:
-    stripped = line.lstrip()
-    if stripped:
-      if indent is None:
-        indent = len(line) - len(stripped)
-      else:
-        indent = min(indent, len(line) - len(stripped))
-  # Remove indentation (first line is special):
-  trimmed = [lines[0].strip()]
-  if indent is not None:
-    for line in lines[1:]:
-      trimmed.append(line[indent:].rstrip())
-  # Strip off trailing and leading blank lines:
-  while trimmed and not trimmed[-1]:
-    trimmed.pop()
-  while trimmed and not trimmed[0]:
-    trimmed.pop(0)
-  # Return a single string:
-  return '\n'.join(trimmed)
diff --git a/tensorflow/contrib/bayesflow/python/ops/halton_sequence_impl.py b/tensorflow/contrib/bayesflow/python/ops/halton_sequence_impl.py
deleted file mode 100644
index 8cabf18903b5f15002470acdfb8fdd3ec31a7413..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/halton_sequence_impl.py
+++ /dev/null
@@ -1,264 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Quasi Monte Carlo support: Halton sequence.
-
-@@sample
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-
-
-__all__ = [
-    'sample',
-]
-
-
-# The maximum dimension we support. This is limited by the number of primes
-# in the _PRIMES array.
-_MAX_DIMENSION = 1000
-
-
-def sample(dim, num_samples=None, sample_indices=None, dtype=None, name=None):
-  r"""Returns a sample from the `m` dimensional Halton sequence.
-
-  Warning: The sequence elements take values only between 0 and 1. Care must be
-  taken to appropriately transform the domain of a function if it differs from
-  the unit cube before evaluating integrals using Halton samples. It is also
-  important to remember that quasi-random numbers are not a replacement for
-  pseudo-random numbers in every context. Quasi random numbers are completely
-  deterministic and typically have significant negative autocorrelation (unless
-  randomized).
-
-  Computes the members of the low discrepancy Halton sequence in dimension
-  `dim`. The d-dimensional sequence takes values in the unit hypercube in d
-  dimensions. Currently, only dimensions up to 1000 are supported. The prime
-  base for the `k`-th axes is the k-th prime starting from 2. For example,
-  if dim = 3, then the bases will be [2, 3, 5] respectively and the first
-  element of the sequence will be: [0.5, 0.333, 0.2]. For a more complete
-  description of the Halton sequences see:
-  https://en.wikipedia.org/wiki/Halton_sequence. For low discrepancy sequences
-  and their applications see:
-  https://en.wikipedia.org/wiki/Low-discrepancy_sequence.
-
-  The user must supply either `num_samples` or `sample_indices` but not both.
-  The former is the number of samples to produce starting from the first
-  element. If `sample_indices` is given instead, the specified elements of
-  the sequence are generated. For example, sample_indices=tf.range(10) is
-  equivalent to specifying n=10.
-
-  Example Use:
-
-  ```python
-  bf = tf.contrib.bayesflow
-
-  # Produce the first 1000 members of the Halton sequence in 3 dimensions.
-  num_samples = 1000
-  dim = 3
-  sample = bf.halton_sequence.sample(dim, num_samples=num_samples)
-
-  # Evaluate the integral of x_1 * x_2^2 * x_3^3  over the three dimensional
-  # hypercube.
-  powers = tf.range(1.0, limit=dim + 1)
-  integral = tf.reduce_mean(tf.reduce_prod(sample ** powers, axis=-1))
-  true_value = 1.0 / tf.reduce_prod(powers + 1.0)
-  with tf.Session() as session:
-    values = session.run((integral, true_value))
-
-  # Produces a relative absolute error of 1.7%.
-  print ("Estimated: %f, True Value: %f" % values)
-
-  # Now skip the first 1000 samples and recompute the integral with the next
-  # thousand samples. The sample_indices argument can be used to do this.
-
-
-  sample_indices = tf.range(start=1000, limit=1000 + num_samples,
-                            dtype=tf.int32)
-  sample_leaped = halton.sample(dim, sample_indices=sample_indices)
-
-  integral_leaped = tf.reduce_mean(tf.reduce_prod(sample_leaped ** powers,
-                                                  axis=-1))
-  with tf.Session() as session:
-    values = session.run((integral_leaped, true_value))
-  # Now produces a relative absolute error of 0.05%.
-  print ("Leaped Estimated: %f, True Value: %f" % values)
-  ```
-
-  Args:
-    dim: Positive Python `int` representing each sample's `event_size.` Must
-      not be greater than 1000.
-    num_samples: (Optional) positive Python `int`. The number of samples to
-      generate. Either this parameter or sample_indices must be specified but
-      not both. If this parameter is None, then the behaviour is determined by
-      the `sample_indices`.
-    sample_indices: (Optional) `Tensor` of dtype int32 and rank 1. The elements
-      of the sequence to compute specified by their position in the sequence.
-      The entries index into the Halton sequence starting with 0 and hence,
-      must be whole numbers. For example, sample_indices=[0, 5, 6] will produce
-      the first, sixth and seventh elements of the sequence. If this parameter
-      is None, then the `num_samples` parameter must be specified which gives
-      the number of desired samples starting from the first sample.
-    dtype: (Optional) The dtype of the sample. One of `float32` or `float64`.
-      Default is `float32`.
-    name:  (Optional) Python `str` describing ops managed by this function. If
-    not supplied the name of this function is used.
-
-  Returns:
-    halton_elements: Elements of the Halton sequence. `Tensor` of supplied dtype
-    and `shape` `[num_samples, dim]` if `num_samples` was specified or shape
-    `[s, dim]` where s is the size of `sample_indices` if `sample_indices`
-    were specified.
-
-  Raises:
-    ValueError: if both `sample_indices` and `num_samples` were specified or
-    if dimension `dim` is less than 1 or greater than 1000.
-  """
-  if dim < 1 or dim > _MAX_DIMENSION:
-    raise ValueError(
-        'Dimension must be between 1 and {}. Supplied {}'.format(_MAX_DIMENSION,
-                                                                 dim))
-  if (num_samples is None) == (sample_indices is None):
-    raise ValueError('Either `num_samples` or `sample_indices` must be'
-                     ' specified but not both.')
-
-  dtype = dtype or dtypes.float32
-  if not dtype.is_floating:
-    raise ValueError('dtype must be of `float`-type')
-
-  with ops.name_scope(name, 'sample', values=[sample_indices]):
-    # Here and in the following, the shape layout is as follows:
-    # [sample dimension, event dimension, coefficient dimension].
-    # The coefficient dimension is an intermediate axes which will hold the
-    # weights of the starting integer when expressed in the (prime) base for
-    # an event dimension.
-    indices = _get_indices(num_samples, sample_indices, dtype)
-    radixes = array_ops.constant(_PRIMES[0:dim], dtype=dtype, shape=[dim, 1])
-
-    max_sizes_by_axes = _base_expansion_size(math_ops.reduce_max(indices),
-                                             radixes)
-
-    max_size = math_ops.reduce_max(max_sizes_by_axes)
-
-    # The powers of the radixes that we will need. Note that there is a bit
-    # of an excess here. Suppose we need the place value coefficients of 7
-    # in base 2 and 3. For 2, we will have 3 digits but we only need 2 digits
-    # for base 3. However, we can only create rectangular tensors so we
-    # store both expansions in a [2, 3] tensor. This leads to the problem that
-    # we might end up attempting to raise large numbers to large powers. For
-    # example, base 2 expansion of 1024 has 10 digits. If we were in 10
-    # dimensions, then the 10th prime (29) we will end up computing 29^10 even
-    # though we don't need it. We avoid this by setting the exponents for each
-    # axes to 0 beyond the maximum value needed for that dimension.
-    exponents_by_axes = array_ops.tile([math_ops.range(max_size)], [dim, 1])
-    weight_mask = exponents_by_axes > max_sizes_by_axes
-    capped_exponents = array_ops.where(
-        weight_mask, array_ops.zeros_like(exponents_by_axes), exponents_by_axes)
-    weights = radixes ** capped_exponents
-    coeffs = math_ops.floor_div(indices, weights)
-    coeffs *= 1 - math_ops.cast(weight_mask, dtype)
-    coeffs = (coeffs % radixes) / radixes
-    return math_ops.reduce_sum(coeffs / weights, axis=-1)
-
-
-def _get_indices(n, sample_indices, dtype, name=None):
-  """Generates starting points for the Halton sequence procedure.
-
-  The k'th element of the sequence is generated starting from a positive integer
-  which must be distinct for each `k`. It is conventional to choose the starting
-  point as `k` itself (or `k+1` if k is zero based). This function generates
-  the starting integers for the required elements and reshapes the result for
-  later use.
-
-  Args:
-    n: Positive `int`. The number of samples to generate. If this
-      parameter is supplied, then `sample_indices` should be None.
-    sample_indices: `Tensor` of dtype int32 and rank 1. The entries
-      index into the Halton sequence starting with 0 and hence, must be whole
-      numbers. For example, sample_indices=[0, 5, 6] will produce the first,
-      sixth and seventh elements of the sequence. If this parameter is not None
-      then `n` must be None.
-    dtype: The dtype of the sample. One of `float32` or `float64`.
-      Default is `float32`.
-    name: Python `str` name which describes ops created by this function.
-
-  Returns:
-    indices: `Tensor` of dtype `dtype` and shape = `[n, 1, 1]`.
-  """
-  with ops.name_scope(name, 'get_indices', [n, sample_indices]):
-    if sample_indices is None:
-      sample_indices = math_ops.range(n, dtype=dtype)
-    else:
-      sample_indices = math_ops.cast(sample_indices, dtype)
-
-    # Shift the indices so they are 1 based.
-    indices = sample_indices + 1
-
-    # Reshape to make space for the event dimension and the place value
-    # coefficients.
-    return array_ops.reshape(indices, [-1, 1, 1])
-
-
-def _base_expansion_size(num, bases):
-  """Computes the number of terms in the place value expansion.
-
-  Let num = a0 + a1 b + a2 b^2 + ... ak b^k be the place value expansion of
-  `num` in base b (ak <> 0). This function computes and returns `k` for each
-  base `b` specified in `bases`.
-
-  This can be inferred from the base `b` logarithm of `num` as follows:
-    $$k = Floor(log_b (num)) + 1  = Floor( log(num) / log(b)) + 1$$
-
-  Args:
-    num: Scalar `Tensor` of dtype either `float32` or `float64`. The number to
-      compute the base expansion size of.
-    bases: `Tensor` of the same dtype as num. The bases to compute the size
-      against.
-
-  Returns:
-    Tensor of same dtype and shape as `bases` containing the size of num when
-    written in that base.
-  """
-  return math_ops.floor(math_ops.log(num) / math_ops.log(bases)) + 1
-
-
-def _primes_less_than(n):
-  # Based on
-  # https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
-  """Returns sorted array of primes such that `2 <= prime < n`."""
-  small_primes = np.array((2, 3, 5))
-  if n <= 6:
-    return small_primes[small_primes < n]
-  sieve = np.ones(n // 3 + (n % 6 == 2), dtype=np.bool)
-  sieve[0] = False
-  m = int(n ** 0.5) // 3 + 1
-  for i in range(m):
-    if not sieve[i]:
-      continue
-    k = 3 * i + 1 | 1
-    sieve[k ** 2 // 3::2 * k] = False
-    sieve[(k ** 2 + 4 * k - 2 * k * (i & 1)) // 3::2 * k] = False
-  return np.r_[2, 3, 3 * np.nonzero(sieve)[0] + 1 | 1]
-
-_PRIMES = _primes_less_than(7919+1)
-
-assert len(_PRIMES) == _MAX_DIMENSION
diff --git a/tensorflow/contrib/bayesflow/python/ops/hmc.py b/tensorflow/contrib/bayesflow/python/ops/hmc.py
index 7fd5652c5c3e085b23c05baef6e3a42b7a42e08f..c8a5a195d3d709ded7afd09287255deab2ac2f3c 100644
--- a/tensorflow/contrib/bayesflow/python/ops/hmc.py
+++ b/tensorflow/contrib/bayesflow/python/ops/hmc.py
@@ -24,7 +24,6 @@ from tensorflow.python.util import all_util
 
 _allowed_symbols = [
     "sample_chain",
-    "sample_annealed_importance_chain",
     "kernel",
 ]
 
diff --git a/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py b/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
index 82693c2b7bcdbca9f6f4a1d799be5728bb5d36bf..66afcc749746ab5c04114e585c5f93a3f3354d86 100644
--- a/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
+++ b/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
@@ -15,7 +15,6 @@
 """Hamiltonian Monte Carlo, a gradient-based MCMC algorithm.
 
 @@sample_chain
-@@sample_annealed_importance_chain
 @@kernel
 """
 
@@ -38,7 +37,6 @@ from tensorflow.python.ops.distributions import util as distributions_util
 
 __all__ = [
     "sample_chain",
-    "sample_annealed_importance_chain",
     "kernel",
 ]
 
@@ -330,221 +328,6 @@ def sample_chain(
     return functional_ops.scan(**scan_kwargs)
 
 
-def sample_annealed_importance_chain(
-    proposal_log_prob_fn,
-    num_steps,
-    target_log_prob_fn,
-    current_state,
-    step_size,
-    num_leapfrog_steps,
-    seed=None,
-    name=None):
-  """Runs annealed importance sampling (AIS) to estimate normalizing constants.
-
-  This function uses Hamiltonian Monte Carlo to sample from a series of
-  distributions that slowly interpolates between an initial "proposal"
-  distribution:
-
-  `exp(proposal_log_prob_fn(x) - proposal_log_normalizer)`
-
-  and the target distribution:
-
-  `exp(target_log_prob_fn(x) - target_log_normalizer)`,
-
-  accumulating importance weights along the way. The product of these
-  importance weights gives an unbiased estimate of the ratio of the
-  normalizing constants of the initial distribution and the target
-  distribution:
-
-  `E[exp(ais_weights)] = exp(target_log_normalizer - proposal_log_normalizer)`.
-
-  Note: `proposal_log_prob_fn` and `target_log_prob_fn` are called exactly three
-  times (although this may be reduced to two times, in the future).
-
-  #### Examples:
-
-  ##### Estimate the normalizing constant of a log-gamma distribution.
-
-  ```python
-  tfd = tf.contrib.distributions
-
-  # Run 100 AIS chains in parallel
-  num_chains = 100
-  dims = 20
-  dtype = np.float32
-
-  proposal = tfd.MultivatiateNormalDiag(
-     loc=tf.zeros([dims], dtype=dtype))
-
-  target = tfd.TransformedDistribution(
-    distribution=tfd.Gamma(concentration=dtype(2),
-                           rate=dtype(3)),
-    bijector=tfd.bijectors.Invert(tfd.bijectors.Exp()),
-    event_shape=[dims])
-
-  chains_state, ais_weights, kernels_results = (
-      hmc.sample_annealed_importance_chain(
-          proposal_log_prob_fn=proposal.log_prob,
-          num_steps=1000,
-          target_log_prob_fn=target.log_prob,
-          step_size=0.2,
-          current_state=proposal.sample(num_chains),
-          num_leapfrog_steps=2))
-
-  log_estimated_normalizer = (tf.reduce_logsumexp(ais_weights)
-                              - np.log(num_chains))
-  log_true_normalizer = tf.lgamma(2.) - 2. * tf.log(3.)
-  ```
-
-  ##### Estimate marginal likelihood of a Bayesian regression model.
-
-  ```python
-  tfd = tf.contrib.distributions
-
-  def make_prior(dims, dtype):
-    return tfd.MultivariateNormalDiag(
-        loc=tf.zeros(dims, dtype))
-
-  def make_likelihood(weights, x):
-    return tfd.MultivariateNormalDiag(
-        loc=tf.tensordot(weights, x, axes=[[0], [-1]]))
-
-  # Run 100 AIS chains in parallel
-  num_chains = 100
-  dims = 10
-  dtype = np.float32
-
-  # Make training data.
-  x = np.random.randn(num_chains, dims).astype(dtype)
-  true_weights = np.random.randn(dims).astype(dtype)
-  y = np.dot(x, true_weights) + np.random.randn(num_chains)
-
-  # Setup model.
-  prior = make_prior(dims, dtype)
-  def target_log_prob_fn(weights):
-    return prior.log_prob(weights) + make_likelihood(weights, x).log_prob(y)
-
-  proposal = tfd.MultivariateNormalDiag(
-      loc=tf.zeros(dims, dtype))
-
-  weight_samples, ais_weights, kernel_results = (
-      hmc.sample_annealed_importance_chain(
-        num_steps=1000,
-        proposal_log_prob_fn=proposal.log_prob,
-        target_log_prob_fn=target_log_prob_fn
-        current_state=tf.zeros([num_chains, dims], dtype),
-        step_size=0.1,
-        num_leapfrog_steps=2))
-  log_normalizer_estimate = (tf.reduce_logsumexp(ais_weights)
-                             - np.log(num_chains))
-  ```
-
-  Args:
-    proposal_log_prob_fn: Python callable that returns the log density of the
-      initial distribution.
-    num_steps: Integer number of Markov chain updates to run. More
-      iterations means more expense, but smoother annealing between q
-      and p, which in turn means exponentially lower variance for the
-      normalizing constant estimator.
-    target_log_prob_fn: Python callable which takes an argument like
-      `current_state` (or `*current_state` if it's a list) and returns its
-      (possibly unnormalized) log-density under the target distribution.
-    current_state: `Tensor` or Python `list` of `Tensor`s representing the
-      current state(s) of the Markov chain(s). The first `r` dimensions index
-      independent chains, `r = tf.rank(target_log_prob_fn(*current_state))`.
-    step_size: `Tensor` or Python `list` of `Tensor`s representing the step size
-      for the leapfrog integrator. Must broadcast with the shape of
-      `current_state`. Larger step sizes lead to faster progress, but too-large
-      step sizes make rejection exponentially more likely. When possible, it's
-      often helpful to match per-variable step sizes to the standard deviations
-      of the target distribution in each variable.
-    num_leapfrog_steps: Integer number of steps to run the leapfrog integrator
-      for. Total progress per HMC step is roughly proportional to `step_size *
-      num_leapfrog_steps`.
-    seed: Python integer to seed the random number generator.
-    name: Python `str` name prefixed to Ops created by this function.
-      Default value: `None` (i.e., "hmc_sample_annealed_importance_chain").
-
-  Returns:
-    next_state: `Tensor` or Python list of `Tensor`s representing the
-      state(s) of the Markov chain(s) at the final iteration. Has same shape as
-      input `current_state`.
-    ais_weights: Tensor with the estimated weight(s). Has shape matching
-      `target_log_prob_fn(current_state)`.
-    kernel_results: `collections.namedtuple` of internal calculations used to
-      advance the chain.
-  """
-  def make_convex_combined_log_prob_fn(iter_):
-    def _fn(*args):
-      p = proposal_log_prob_fn(*args)
-      t = target_log_prob_fn(*args)
-      dtype = p.dtype.base_dtype
-      beta = (math_ops.cast(iter_ + 1, dtype)
-              / math_ops.cast(num_steps, dtype))
-      return (1. - beta) * p + beta * t
-    return _fn
-
-  with ops.name_scope(
-      name, "hmc_sample_annealed_importance_chain",
-      [num_steps, current_state, step_size, num_leapfrog_steps, seed]):
-    with ops.name_scope("initialize"):
-      [
-          current_state,
-          step_size,
-          current_log_prob,
-          current_grads_log_prob,
-      ] = _prepare_args(
-          make_convex_combined_log_prob_fn(iter_=0),
-          current_state,
-          step_size,
-          description="convex_combined_log_prob")
-      num_steps = ops.convert_to_tensor(
-          num_steps,
-          dtype=dtypes.int32,
-          name="num_steps")
-      num_leapfrog_steps = ops.convert_to_tensor(
-          num_leapfrog_steps,
-          dtype=dtypes.int32,
-          name="num_leapfrog_steps")
-    def _loop_body(iter_, ais_weights, current_state, kernel_results):
-      """Closure which implements `tf.while_loop` body."""
-      current_state_parts = (list(current_state)
-                             if _is_list_like(current_state)
-                             else [current_state])
-      # TODO(b/72994218): Consider refactoring things to avoid this unecessary
-      # call.
-      ais_weights += ((target_log_prob_fn(*current_state_parts)
-                       - proposal_log_prob_fn(*current_state_parts))
-                      / math_ops.cast(num_steps, ais_weights.dtype))
-      return [iter_ + 1, ais_weights] + list(kernel(
-          make_convex_combined_log_prob_fn(iter_),
-          current_state,
-          step_size,
-          num_leapfrog_steps,
-          seed,
-          kernel_results.current_target_log_prob,
-          kernel_results.current_grads_target_log_prob))
-
-    while_loop_kwargs = dict(
-        cond=lambda iter_, *args: iter_ < num_steps,
-        body=_loop_body,
-        loop_vars=[
-            np.int32(0),  # iter_
-            array_ops.zeros_like(current_log_prob),  # ais_weights
-            current_state,
-            _make_dummy_kernel_results(current_state,
-                                       current_log_prob,
-                                       current_grads_log_prob),
-        ])
-    if seed is not None:
-      while_loop_kwargs["parallel_iterations"] = 1
-
-    [ais_weights, current_state, kernel_results] = control_flow_ops.while_loop(
-        **while_loop_kwargs)[1:]  # Lop-off "iter_".
-
-    return [current_state, ais_weights, kernel_results]
-
-
 def kernel(target_log_prob_fn,
            current_state,
            step_size,
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers.py b/tensorflow/contrib/bayesflow/python/ops/layers.py
deleted file mode 100644
index a742b7c1aa593d6c08bf9d8d597c99c9fc4e7aed..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/layers.py
+++ /dev/null
@@ -1,67 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Probabilistic neural layers.
-
-See ${python/contrib.bayesflow.layers}.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-# go/tf-wildcard-import
-# pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.layers_conv_variational import *
-from tensorflow.contrib.bayesflow.python.ops.layers_dense_variational import *
-from tensorflow.contrib.bayesflow.python.ops.layers_util import *
-# pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
-
-_allowed_symbols = [
-    'Convolution1DReparameterization',
-    'Convolution2DReparameterization',
-    'Convolution3DReparameterization',
-    'Convolution1DFlipout',
-    'Convolution2DFlipout',
-    'Convolution3DFlipout',
-    'Conv1DReparameterization',
-    'Conv2DReparameterization',
-    'Conv3DReparameterization',
-    'Conv1DFlipout',
-    'Conv2DFlipout',
-    'Conv3DFlipout',
-    'convolution1d_reparameterization',
-    'convolution2d_reparameterization',
-    'convolution3d_reparameterization',
-    'convolution1d_flipout',
-    'convolution2d_flipout',
-    'convolution3d_flipout',
-    'conv1d_reparameterization',
-    'conv2d_reparameterization',
-    'conv3d_reparameterization',
-    'conv1d_flipout',
-    'conv2d_flipout',
-    'conv3d_flipout',
-    'DenseReparameterization',
-    'DenseLocalReparameterization',
-    'DenseFlipout',
-    'dense_reparameterization',
-    'dense_local_reparameterization',
-    'dense_flipout',
-    'default_loc_scale_fn',
-    'default_mean_field_normal_fn',
-]
-
-remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py b/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py
deleted file mode 100644
index cb80718f719ff31fb8ba5066170342fc69630780..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py
+++ /dev/null
@@ -1,2486 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Convolutional variational layer classes and their functional aliases.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.bayesflow.python.ops import docstring_util
-from tensorflow.contrib.bayesflow.python.ops import layers_util
-from tensorflow.contrib.distributions.python.ops import independent as independent_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.layers import base as layers_lib
-from tensorflow.python.layers import utils
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import nn
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.ops import standard_ops
-from tensorflow.python.ops.distributions import kullback_leibler as kl_lib
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.ops.distributions import util as distribution_util
-
-doc_args = """activation: Activation function. Set it to None to maintain a
-      linear activation.
-  activity_regularizer: Optional regularizer function for the output.
-  trainable: Boolean, if `True` also add variables to the graph collection
-    `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-  kernel_posterior_fn: Python `callable` which creates
-    `tf.distributions.Distribution` instance representing the surrogate
-    posterior of the `kernel` parameter. Default value:
-    `default_mean_field_normal_fn()`.
-  kernel_posterior_tensor_fn: Python `callable` which takes a
-    `tf.distributions.Distribution` instance and returns a representative
-    value. Default value: `lambda d: d.sample()`.
-  kernel_prior_fn: Python `callable` which creates `tf.distributions`
-    instance. See `default_mean_field_normal_fn` docstring for required
-    parameter signature.
-    Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-  kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-    distribution, prior distribution and random variate sample(s) from the
-    surrogate posterior and computes or approximates the KL divergence. The
-    distributions are `tf.distributions.Distribution`-like instances and the
-    sample is a `Tensor`.
-  bias_posterior_fn: Python `callable` which creates
-    `tf.distributions.Distribution` instance representing the surrogate
-    posterior of the `bias` parameter. Default value:
-    `default_mean_field_normal_fn(is_singular=True)` (which creates an
-    instance of `tf.distributions.Deterministic`).
-  bias_posterior_tensor_fn: Python `callable` which takes a
-    `tf.distributions.Distribution` instance and returns a representative
-    value. Default value: `lambda d: d.sample()`.
-  bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-    See `default_mean_field_normal_fn` docstring for required parameter
-    signature. Default value: `None` (no prior, no variational inference)
-  bias_divergence_fn: Python `callable` which takes the surrogate posterior
-    distribution, prior distribution and random variate sample(s) from the
-    surrogate posterior and computes or approximates the KL divergence. The
-    distributions are `tf.distributions.Distribution`-like instances and the
-    sample is a `Tensor`.
-  name: A string, the name of the layer."""
-
-
-class _ConvVariational(layers_lib.Layer):
-  """Abstract nD convolution layer (private, used as implementation base).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    rank: Python integer, dimensionality of convolution.
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      rank,
-      filters,
-      kernel_size,
-      strides=1,
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=1,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      rank: An integer, the rank of the convolution, e.g. "2" for 2D
-        convolution.
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of n integers, specifying the
-        length of the convolution window.
-      strides: An integer or tuple/list of n integers,
-        specifying the stride length of the convolution.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, ...,
-        channels)` while `channels_first` corresponds to inputs with shape
-        `(batch, channels, ...)`.
-      dilation_rate: An integer or tuple/list of n integers, specifying
-        the dilation rate to use for dilated convolution.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any `strides` value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(_ConvVariational, self).__init__(
-        trainable=trainable,
-        name=name,
-        activity_regularizer=activity_regularizer,
-        **kwargs)
-    self.rank = rank
-    self.filters = filters
-    self.kernel_size = utils.normalize_tuple(kernel_size, rank, "kernel_size")
-    self.strides = utils.normalize_tuple(strides, rank, "strides")
-    self.padding = utils.normalize_padding(padding)
-    self.data_format = utils.normalize_data_format(data_format)
-    self.dilation_rate = utils.normalize_tuple(
-        dilation_rate, rank, "dilation_rate")
-    self.activation = activation
-    self.input_spec = layers_lib.InputSpec(ndim=self.rank + 2)
-    self.kernel_posterior_fn = kernel_posterior_fn
-    self.kernel_posterior_tensor_fn = kernel_posterior_tensor_fn
-    self.kernel_prior_fn = kernel_prior_fn
-    self.kernel_divergence_fn = kernel_divergence_fn
-    self.bias_posterior_fn = bias_posterior_fn
-    self.bias_posterior_tensor_fn = bias_posterior_tensor_fn
-    self.bias_prior_fn = bias_prior_fn
-    self.bias_divergence_fn = bias_divergence_fn
-
-  def build(self, input_shape):
-    input_shape = tensor_shape.TensorShape(input_shape)
-    if self.data_format == "channels_first":
-      channel_axis = 1
-    else:
-      channel_axis = -1
-    if input_shape[channel_axis].value is None:
-      raise ValueError("The channel dimension of the inputs "
-                       "should be defined. Found `None`.")
-    input_dim = input_shape[channel_axis].value
-    kernel_shape = self.kernel_size + (input_dim, self.filters)
-    dtype = dtypes.as_dtype(self.dtype)
-
-    # Must have a posterior kernel.
-    self.kernel_posterior = self.kernel_posterior_fn(
-        dtype, kernel_shape, "kernel_posterior",
-        self.trainable, self.add_variable)
-
-    if self.kernel_prior_fn is None:
-      self.kernel_prior = None
-    else:
-      self.kernel_prior = self.kernel_prior_fn(
-          dtype, kernel_shape, "kernel_prior",
-          self.trainable, self.add_variable)
-    self._built_kernel_divergence = False
-
-    if self.bias_posterior_fn is None:
-      self.bias_posterior = None
-    else:
-      self.bias_posterior = self.bias_posterior_fn(
-          dtype, (self.filters,), "bias_posterior",
-          self.trainable, self.add_variable)
-
-    if self.bias_prior_fn is None:
-      self.bias_prior = None
-    else:
-      self.bias_prior = self.bias_prior_fn(
-          dtype, (self.filters,), "bias_prior",
-          self.trainable, self.add_variable)
-    self._built_bias_divergence = False
-
-    self.input_spec = layers_lib.InputSpec(ndim=self.rank + 2,
-                                           axes={channel_axis: input_dim})
-    self._convolution_op = nn_ops.Convolution(
-        input_shape,
-        filter_shape=tensor_shape.TensorShape(kernel_shape),
-        dilation_rate=self.dilation_rate,
-        strides=self.strides,
-        padding=self.padding.upper(),
-        data_format=utils.convert_data_format(self.data_format,
-                                              self.rank + 2))
-
-    self.built = True
-
-  def call(self, inputs):
-    inputs = ops.convert_to_tensor(inputs, dtype=self.dtype)
-
-    outputs = self._apply_variational_kernel(inputs)
-    outputs = self._apply_variational_bias(outputs)
-    if self.activation is not None:
-      outputs = self.activation(outputs)
-    if not self._built_kernel_divergence:
-      kernel_posterior = self.kernel_posterior
-      kernel_prior = self.kernel_prior
-      if isinstance(self.kernel_posterior, independent_lib.Independent):
-        kernel_posterior = kernel_posterior.distribution
-      if isinstance(self.kernel_prior, independent_lib.Independent):
-        kernel_prior = kernel_prior.distribution
-      self._apply_divergence(self.kernel_divergence_fn,
-                             kernel_posterior,
-                             kernel_prior,
-                             self.kernel_posterior_tensor,
-                             name="divergence_kernel")
-      self._built_kernel_divergence = True
-    if not self._built_bias_divergence:
-      bias_posterior = self.bias_posterior
-      bias_prior = self.bias_prior
-      if isinstance(self.bias_posterior, independent_lib.Independent):
-        bias_posterior = bias_posterior.distribution
-      if isinstance(self.bias_prior, independent_lib.Independent):
-        bias_prior = bias_prior.distribution
-      self._apply_divergence(self.bias_divergence_fn,
-                             bias_posterior,
-                             bias_prior,
-                             self.bias_posterior_tensor,
-                             name="divergence_bias")
-      self._built_bias_divergence = True
-    return outputs
-
-  def _apply_variational_bias(self, inputs):
-    if self.bias_posterior is None:
-      self.bias_posterior_tensor = None
-      return inputs
-    self.bias_posterior_tensor = self.bias_posterior_tensor_fn(
-        self.bias_posterior)
-    outputs = inputs
-    if self.data_format == "channels_first":
-      if self.rank == 1:
-        # nn.bias_add does not accept a 1D input tensor.
-        bias = array_ops.reshape(self.bias_posterior_tensor,
-                                 (1, self.filters, 1))
-        outputs += bias
-      if self.rank == 2:
-        outputs = nn.bias_add(outputs,
-                              self.bias_posterior_tensor,
-                              data_format="NCHW")
-      if self.rank == 3:
-        # As of Mar 2017, direct addition is significantly slower than
-        # bias_add when computing gradients. To use bias_add, we collapse Z
-        # and Y into a single dimension to obtain a 4D input tensor.
-        outputs_shape = outputs.shape.as_list()
-        outputs_4d = array_ops.reshape(outputs,
-                                       [outputs_shape[0], outputs_shape[1],
-                                        outputs_shape[2] * outputs_shape[3],
-                                        outputs_shape[4]])
-        outputs_4d = nn.bias_add(outputs_4d,
-                                 self.bias_posterior_tensor,
-                                 data_format="NCHW")
-        outputs = array_ops.reshape(outputs_4d, outputs_shape)
-    else:
-      outputs = nn.bias_add(outputs,
-                            self.bias_posterior_tensor,
-                            data_format="NHWC")
-    return outputs
-
-  def _apply_divergence(self, divergence_fn, posterior, prior,
-                        posterior_tensor, name):
-    if (divergence_fn is None or
-        posterior is None or
-        prior is None):
-      divergence = None
-      return
-    divergence = standard_ops.identity(
-        divergence_fn(
-            posterior, prior, posterior_tensor),
-        name=name)
-    self.add_loss(divergence)
-
-  def _compute_output_shape(self, input_shape):
-    input_shape = tensor_shape.TensorShape(input_shape).as_list()
-    if self.data_format == "channels_last":
-      space = input_shape[1:-1]
-      new_space = []
-      for i in range(len(space)):
-        new_dim = utils.conv_output_length(
-            space[i],
-            self.kernel_size[i],
-            padding=self.padding,
-            stride=self.strides[i],
-            dilation=self.dilation_rate[i])
-        new_space.append(new_dim)
-      return tensor_shape.TensorShape([input_shape[0]] + new_space +
-                                      [self.filters])
-    else:
-      space = input_shape[2:]
-      new_space = []
-      for i in range(len(space)):
-        new_dim = utils.conv_output_length(
-            space[i],
-            self.kernel_size[i],
-            padding=self.padding,
-            stride=self.strides[i],
-            dilation=self.dilation_rate[i])
-        new_space.append(new_dim)
-      return tensor_shape.TensorShape([input_shape[0], self.filters] +
-                                      new_space)
-
-
-class _ConvReparameterization(_ConvVariational):
-  """Abstract nD convolution layer (private, used as implementation base).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    rank: Python integer, dimensionality of convolution.
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      rank,
-      filters,
-      kernel_size,
-      strides=1,
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=1,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      rank: An integer, the rank of the convolution, e.g. "2" for 2D
-        convolution.
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of n integers, specifying the
-        length of the convolution window.
-      strides: An integer or tuple/list of n integers,
-        specifying the stride length of the convolution.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, ...,
-        channels)` while `channels_first` corresponds to inputs with shape
-        `(batch, channels, ...)`.
-      dilation_rate: An integer or tuple/list of n integers, specifying
-        the dilation rate to use for dilated convolution.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any `strides` value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(_ConvReparameterization, self).__init__(
-        rank=rank,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name, **kwargs)
-
-  def _apply_variational_kernel(self, inputs):
-    self.kernel_posterior_tensor = self.kernel_posterior_tensor_fn(
-        self.kernel_posterior)
-    self.kernel_posterior_affine = None
-    self.kernel_posterior_affine_tensor = None
-    outputs = self._convolution_op(inputs, self.kernel_posterior_tensor)
-    return outputs
-
-
-class Conv1DReparameterization(_ConvReparameterization):
-  """1D convolution layer (e.g. temporal convolution).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 128, 1])
-  net = tfp.layers.Conv1DReparameterization(64,
-                                            kernel_size=5,
-                                            padding="SAME",
-                                            activation=tf.nn.relu)(net)
-  net = tf.reshape(net, [-1, 128 * 64])
-  logits = tfp.layers.DenseReparameterization(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=1,
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=1,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of a single integer, specifying the
-        length of the 1D convolution window.
-      strides: An integer or tuple/list of a single integer,
-        specifying the stride length of the convolution.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, length,
-        channels)` while `channels_first` corresponds to inputs with shape
-        `(batch, channels, length)`.
-      dilation_rate: An integer or tuple/list of a single integer, specifying
-        the dilation rate to use for dilated convolution.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any `strides` value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv1DReparameterization, self).__init__(
-        rank=1,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv1d_reparameterization(
-    inputs,
-    filters,
-    kernel_size,
-    strides=1,
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=1,
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for 1D convolution layer (e.g. temporal convolution).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of a single integer, specifying the
-      length of the 1D convolution window.
-    strides: An integer or tuple/list of a single integer,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, length, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, length)`.
-    dilation_rate: An integer or tuple/list of a single integer, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 128, 1])
-  net = tfp.layers.conv1d_reparameterization(net,
-                                             filters=64,
-                                             kernel_size=5,
-                                             padding="SAME",
-                                             activation=tf.nn.relu)
-  net = tf.reshape(net, [-1, 128 * 64])
-  logits = tfp.layers.dense_reparameterization(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv1DReparameterization(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class Conv2DReparameterization(_ConvReparameterization):
-  """2D convolution layer (e.g. spatial convolution over images).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 32, 32, 3])
-  net = tfp.layers.Conv2DReparameterization(64,
-                                            kernel_size=5,
-                                            padding="SAME",
-                                            activation=tf.nn.relu)(net)
-  net = tf.layers.MaxPooling2D(pool_size=2,
-                               strides=2,
-                               padding="SAME")(net)
-  net = tf.reshape(net, [-1, 8 * 8 * 64])
-  logits = tfp.layers.DenseReparameterization(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=(1, 1),
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=(1, 1),
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of 2 integers, specifying the
-        height and width of the 2D convolution window.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-      strides: An integer or tuple/list of 2 integers,
-        specifying the strides of the convolution along the height and width.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, height,
-        width, channels)` while `channels_first` corresponds to inputs with
-        shape `(batch, channels, height, width)`.
-      dilation_rate: An integer or tuple/list of 2 integers, specifying
-        the dilation rate to use for dilated convolution.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any stride value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv2DReparameterization, self).__init__(
-        rank=2,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv2d_reparameterization(
-    inputs,
-    filters,
-    kernel_size,
-    strides=(1, 1),
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=(1, 1),
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for the 2D convolution layer.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 2 integers, specifying the
-      height and width of the 2D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 2 integers,
-      specifying the strides of the convolution along the height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, height, width, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, height, width)`.
-    dilation_rate: An integer or tuple/list of 2 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 32, 32, 3])
-  net = tfp.layers.conv2d_reparameterization(net,
-                                             filters=64,
-                                             kernel_size=5,
-                                             padding="SAME",
-                                             activation=tf.nn.relu)
-  net = tf.layers.max_pooling2d(net,
-                                pool_size=2,
-                                strides=2,
-                                padding="SAME")
-  net = tf.reshape(net, [-1, 8 * 8 * 64])
-  logits = tfp.layers.dense_reparameterization(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv2DReparameterization(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class Conv3DReparameterization(_ConvReparameterization):
-  """3D convolution layer (e.g. spatial convolution over volumes).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 256, 32, 32, 3])
-  net = tfp.layers.Conv3DReparameterization(64,
-                                            kernel_size=5,
-                                            padding="SAME",
-                                            activation=tf.nn.relu)(net)
-  net = tf.layers.MaxPooling2D(pool_size=2,
-                               strides=2,
-                               padding="SAME")(net)
-  net = tf.reshape(net, [-1, 256 * 8 * 8 * 64])
-  logits = tfp.layers.DenseReparameterization(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=(1, 1, 1),
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=(1, 1, 1),
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of 3 integers, specifying the
-        depth, height and width of the 3D convolution window.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-      strides: An integer or tuple/list of 3 integers,
-        specifying the strides of the convolution along the depth,
-        height and width.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, depth,
-        height, width, channels)` while `channels_first` corresponds to inputs
-        with shape `(batch, channels, depth, height, width)`.
-      dilation_rate: An integer or tuple/list of 3 integers, specifying
-        the dilation rate to use for dilated convolution.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any stride value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv3DReparameterization, self).__init__(
-        rank=3,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv3d_reparameterization(
-    inputs,
-    filters,
-    kernel_size,
-    strides=(1, 1, 1),
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=(1, 1, 1),
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for the 3D convolution layer.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the reparameterization
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 3 integers, specifying the
-      depth, height and width of the 3D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 3 integers,
-      specifying the strides of the convolution along the depth,
-      height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, depth, height, width, channels)` while `channels_first`
-      corresponds to inputs with shape
-      `(batch, channels, depth, height, width)`.
-    dilation_rate: An integer or tuple/list of 3 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 256, 32, 32, 3])
-  net = tfp.layers.conv3d_reparameterization(net,
-                                             filters=64,
-                                             kernel_size=5,
-                                             padding="SAME",
-                                             activation=tf.nn.relu)
-  net = tf.layers.max_pooling2d(net,
-                                pool_size=2,
-                                strides=2,
-                                padding="SAME")
-  net = tf.reshape(net, [-1, 256 * 8 * 8 * 64])
-  logits = tfp.layers.dense_reparameterization(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv3DReparameterization(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class _ConvFlipout(_ConvVariational):
-  """Abstract nD convolution layer (private, used as implementation base).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    rank: Python integer, dimensionality of convolution.
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-    seed: Python integer, used to create random seeds.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      rank,
-      filters,
-      kernel_size,
-      strides=1,
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=1,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      seed=None,
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      rank: An integer, the rank of the convolution, e.g. "2" for 2D
-        convolution.
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of n integers, specifying the
-        length of the convolution window.
-      strides: An integer or tuple/list of n integers,
-        specifying the stride length of the convolution.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, ...,
-        channels)` while `channels_first` corresponds to inputs with shape
-        `(batch, channels, ...)`.
-      dilation_rate: An integer or tuple/list of n integers, specifying
-        the dilation rate to use for dilated convolution.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any `strides` value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(_ConvFlipout, self).__init__(
-        rank=rank,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name, **kwargs)
-    self.seed = seed
-
-  def _apply_variational_kernel(self, inputs):
-    if (not isinstance(self.kernel_posterior, independent_lib.Independent) or
-        not isinstance(self.kernel_posterior.distribution, normal_lib.Normal)):
-      raise TypeError(
-          "`{}` requires "
-          "`kernel_posterior_fn` produce an instance of "
-          "`tf.distributions.Independent(tf.distributions.Normal)` "
-          "(saw: \"{}\").".format(
-              type(self).__name__, self.kernel_posterior.name))
-    self.kernel_posterior_affine = normal_lib.Normal(
-        loc=array_ops.zeros_like(self.kernel_posterior.distribution.loc),
-        scale=self.kernel_posterior.distribution.scale)
-    self.kernel_posterior_affine_tensor = (
-        self.kernel_posterior_tensor_fn(self.kernel_posterior_affine))
-    self.kernel_posterior_tensor = None
-
-    outputs = self._convolution_op(
-        inputs, self.kernel_posterior.distribution.loc)
-
-    input_shape = array_ops.shape(inputs)
-    output_shape = array_ops.shape(outputs)
-    batch_shape = array_ops.expand_dims(input_shape[0], 0)
-    channels = input_shape[-1]
-
-    sign_input = layers_util.random_sign(
-        array_ops.concat([batch_shape,
-                          array_ops.expand_dims(channels, 0)], 0),
-        dtype=inputs.dtype,
-        seed=self.seed)
-    sign_output = layers_util.random_sign(
-        array_ops.concat([batch_shape,
-                          array_ops.expand_dims(self.filters, 0)], 0),
-        dtype=inputs.dtype,
-        seed=distribution_util.gen_new_seed(
-            self.seed, salt="conv_flipout"))
-    for _ in range(self.rank):
-      sign_input = array_ops.expand_dims(sign_input, 1)  # 2D ex: (B, 1, 1, C)
-      sign_output = array_ops.expand_dims(sign_output, 1)
-
-    sign_input = array_ops.tile(  # tile for element-wise op broadcasting
-        sign_input,
-        [1] + [input_shape[i + 1] for i in range(self.rank)] + [1])
-    sign_output = array_ops.tile(
-        sign_output,
-        [1] + [output_shape[i + 1] for i in range(self.rank)] + [1])
-
-    perturbed_inputs = self._convolution_op(
-        inputs * sign_input, self.kernel_posterior_affine_tensor) * sign_output
-
-    outputs += perturbed_inputs
-    return outputs
-
-
-class Conv1DFlipout(_ConvFlipout):
-  """1D convolution layer (e.g. temporal convolution) with Flipout.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-    seed: Python integer, used to create random seeds.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 128, 1])
-  net = tfp.layers.Conv1DFlipout(64,
-                                 kernel_size=5,
-                                 padding="SAME",
-                                 activation=tf.nn.relu)(net)
-  net = tf.reshape(net, [-1, 128 * 64])
-  logits = tfp.layers.DenseFlipout(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=1,
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=1,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      seed=None,
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of a single integer, specifying the
-        length of the 1D convolution window.
-      strides: An integer or tuple/list of a single integer,
-        specifying the stride length of the convolution.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, length,
-        channels)` while `channels_first` corresponds to inputs with shape
-        `(batch, channels, length)`.
-      dilation_rate: An integer or tuple/list of a single integer, specifying
-        the dilation rate to use for dilated convolution.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any `strides` value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv1DFlipout, self).__init__(
-        rank=1,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        seed=seed,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv1d_flipout(
-    inputs,
-    filters,
-    kernel_size,
-    strides=1,
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=1,
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    seed=None,
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for 1D convolution layer (e.g. temporal convolution).
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of a single integer, specifying the
-      length of the 1D convolution window.
-    strides: An integer or tuple/list of a single integer,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, length, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, length)`.
-    dilation_rate: An integer or tuple/list of a single integer, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 128, 1])
-  net = tfp.layers.conv1d_flipout(net,
-                                  filters=64,
-                                  kernel_size=5,
-                                  padding="SAME",
-                                  activation=tf.nn.relu)
-  net = tf.reshape(net, [-1, 128 * 64])
-  logits = tfp.layers.dense_flipout(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv1DFlipout(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      seed=seed,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class Conv2DFlipout(_ConvFlipout):
-  """2D convolution layer (e.g. spatial convolution over images) with Flipout.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-    seed: Python integer, used to create random seeds.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 32, 32, 3])
-  net = tfp.layers.Conv2DFlipout(64,
-                                 kernel_size=5,
-                                 padding="SAME",
-                                 activation=tf.nn.relu)(net)
-  net = tf.layers.MaxPooling2D(pool_size=2,
-                               strides=2,
-                               padding="SAME")(net)
-  net = tf.reshape(net, [-1, 8 * 8 * 64])
-  logits = tfp.layers.DenseFlipout(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=(1, 1),
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=(1, 1),
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      seed=None,
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of 2 integers, specifying the
-        height and width of the 2D convolution window.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-      strides: An integer or tuple/list of 2 integers,
-        specifying the strides of the convolution along the height and width.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, height,
-        width, channels)` while `channels_first` corresponds to inputs with
-        shape `(batch, channels, height, width)`.
-      dilation_rate: An integer or tuple/list of 2 integers, specifying
-        the dilation rate to use for dilated convolution.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any stride value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv2DFlipout, self).__init__(
-        rank=2,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        seed=seed,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv2d_flipout(
-    inputs,
-    filters,
-    kernel_size,
-    strides=(1, 1),
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=(1, 1),
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    seed=None,
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for the 2D convolution layer.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 2 integers, specifying the
-      height and width of the 2D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 2 integers,
-      specifying the strides of the convolution along the height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, height, width, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, height, width)`.
-    dilation_rate: An integer or tuple/list of 2 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 32, 32, 3])
-  net = tfp.layers.conv2d_flipout(net,
-                                  filters=64,
-                                  kernel_size=5,
-                                  padding="SAME",
-                                  activation=tf.nn.relu)
-  net = tf.layers.max_pooling2d(net,
-                                pool_size=2,
-                                strides=2,
-                                padding="SAME")
-  net = tf.reshape(net, [-1, 8 * 8 * 64])
-  logits = tfp.layers.dense_flipout(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv2DFlipout(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      seed=seed,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class Conv3DFlipout(_ConvFlipout):
-  """3D convolution layer (e.g. spatial convolution over volumes) with Flipout.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    filters: Python integer, dimensionality of the output space.
-    kernel_size: Size of the convolution window.
-    strides: Stride length of convolution.
-    padding: Python string describing padding approach.
-    data_format: Python string describing input data's dimensions.
-    dilation_rate: Dilation rate for an atrous convolution.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-    seed: Python integer, used to create random seeds.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 256, 32, 32, 3])
-  net = tfp.layers.Conv3DFlipout(64,
-                                 kernel_size=5,
-                                 padding="SAME",
-                                 activation=tf.nn.relu)(net)
-  net = tf.layers.MaxPooling2D(pool_size=2,
-                               strides=2,
-                               padding="SAME")(net)
-  net = tf.reshape(net, [-1, 256 * 8 * 8 * 64])
-  logits = tfp.layers.DenseFlipout(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      filters,
-      kernel_size,
-      strides=(1, 1, 1),
-      padding="valid",
-      data_format="channels_last",
-      dilation_rate=(1, 1, 1),
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      seed=None,
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      filters: Integer, the dimensionality of the output space (i.e. the number
-        of filters in the convolution).
-      kernel_size: An integer or tuple/list of 3 integers, specifying the
-        depth, height and width of the 3D convolution window.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-      strides: An integer or tuple/list of 3 integers,
-        specifying the strides of the convolution along the depth,
-        height and width.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Specifying any stride value != 1 is incompatible with specifying
-        any `dilation_rate` value != 1.
-      padding: One of `"valid"` or `"same"` (case-insensitive).
-      data_format: A string, one of `channels_last` (default) or
-        `channels_first`. The ordering of the dimensions in the inputs.
-        `channels_last` corresponds to inputs with shape `(batch, depth,
-        height, width, channels)` while `channels_first` corresponds to inputs
-        with shape `(batch, channels, depth, height, width)`.
-      dilation_rate: An integer or tuple/list of 3 integers, specifying
-        the dilation rate to use for dilated convolution.
-        Can be a single integer to specify the same value for
-        all spatial dimensions.
-        Currently, specifying any `dilation_rate` value != 1 is
-        incompatible with specifying any stride value != 1.
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(Conv3DFlipout, self).__init__(
-        rank=3,
-        filters=filters,
-        kernel_size=kernel_size,
-        strides=strides,
-        padding=padding,
-        data_format=data_format,
-        dilation_rate=dilation_rate,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        seed=seed,
-        name=name, **kwargs)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def conv3d_flipout(
-    inputs,
-    filters,
-    kernel_size,
-    strides=(1, 1, 1),
-    padding="valid",
-    data_format="channels_last",
-    dilation_rate=(1, 1, 1),
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    seed=None,
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Functional interface for the 3D convolution layer.
-
-  This layer creates a convolution kernel that is convolved
-  (actually cross-correlated) with the layer input to produce a tensor of
-  outputs. It may also include a bias addition and activation function
-  on the outputs. It assumes the `kernel` and/or `bias` are drawn from
-  distributions.
-
-  By default, the layer implements a stochastic forward pass via
-  sampling from the kernel and bias posteriors,
-  ```none
-  outputs = f(inputs; kernel, bias), kernel, bias ~ posterior
-  ```
-  where f denotes the layer's calculation. It uses the Flipout
-  estimator [1], which performs a Monte Carlo approximation of the
-  distribution integrating over the `kernel` and `bias`. Flipout uses
-  roughly twice as many floating point operations as the
-  reparameterization estimator but has the advantage of significantly
-  lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 3 integers, specifying the
-      depth, height and width of the 3D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 3 integers,
-      specifying the strides of the convolution along the depth,
-      height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, depth, height, width, channels)` while `channels_first`
-      corresponds to inputs with shape
-      `(batch, channels, depth, height, width)`.
-    dilation_rate: An integer or tuple/list of 3 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    @{args}
-    reuse: Boolean, whether to reuse the weights of a previous layer
-      by the same name.
-
-  Returns:
-    Output tensor.
-
-  Raises:
-    ValueError: if eager execution is enabled.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tf.reshape(features, [-1, 256, 32, 32, 3])
-  net = tfp.layers.conv3d_flipout(net,
-                                  filters=64,
-                                  kernel_size=5,
-                                  padding="SAME",
-                                  activation=tf.nn.relu)
-  net = tf.layers.max_pooling2d(net,
-                                pool_size=2,
-                                strides=2,
-                                padding="SAME")
-  net = tf.reshape(net, [-1, 256 * 8 * 8 * 64])
-  logits = tfp.layers.dense_flipout(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
-        International Conference on Learning Representations, 2018.
-  """
-  # pylint: enable=g-doc-args
-  layer = Conv3DFlipout(
-      filters=filters,
-      kernel_size=kernel_size,
-      strides=strides,
-      padding=padding,
-      data_format=data_format,
-      dilation_rate=dilation_rate,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      seed=seed,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-# Aliases
-
-Convolution1DReparameterization = Conv1DReparameterization
-Convolution2DReparameterization = Conv2DReparameterization
-Convolution3DReparameterization = Conv3DReparameterization
-convolution1d_reparameterization = conv1d_reparameterization
-convolution2d_reparameterization = conv2d_reparameterization
-convolution3d_reparameterization = conv3d_reparameterization
-Convolution1DFlipout = Conv1DFlipout
-Convolution2DFlipout = Conv2DFlipout
-Convolution3DFlipout = Conv3DFlipout
-convolution1d_flipout = conv1d_flipout
-convolution2d_flipout = conv2d_flipout
-convolution3d_flipout = conv3d_flipout
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py b/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py
deleted file mode 100644
index 1f1d8fda2a5db4db33a2b6e5d7f027c4b509011a..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py
+++ /dev/null
@@ -1,955 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Dense Bayesian layer using KL-divergence based variational inference.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.bayesflow.python.ops import docstring_util
-from tensorflow.contrib.bayesflow.python.ops import layers_util
-from tensorflow.contrib.distributions.python.ops import independent as independent_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.layers import base as layers_lib
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import nn
-from tensorflow.python.ops import standard_ops
-from tensorflow.python.ops.distributions import kullback_leibler as kl_lib
-from tensorflow.python.ops.distributions import normal as normal_lib
-from tensorflow.python.ops.distributions import util as distribution_util
-
-
-doc_args = """units: Integer or Long, dimensionality of the output space.
-  activation: Activation function (`callable`). Set it to None to maintain a
-    linear activation.
-  activity_regularizer: Regularizer function for the output.
-  trainable: Boolean, if `True` also add variables to the graph collection
-    `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-  kernel_posterior_fn: Python `callable` which creates
-    `tf.distributions.Distribution` instance representing the surrogate
-    posterior of the `kernel` parameter. Default value:
-    `default_mean_field_normal_fn()`.
-  kernel_posterior_tensor_fn: Python `callable` which takes a
-    `tf.distributions.Distribution` instance and returns a representative
-    value. Default value: `lambda d: d.sample()`.
-  kernel_prior_fn: Python `callable` which creates `tf.distributions`
-    instance. See `default_mean_field_normal_fn` docstring for required
-    parameter signature.
-    Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-  kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-    distribution, prior distribution and random variate sample(s) from the
-    surrogate posterior and computes or approximates the KL divergence. The
-    distributions are `tf.distributions.Distribution`-like instances and the
-    sample is a `Tensor`.
-  bias_posterior_fn: Python `callable` which creates
-    `tf.distributions.Distribution` instance representing the surrogate
-    posterior of the `bias` parameter. Default value:
-    `default_mean_field_normal_fn(is_singular=True)` (which creates an
-    instance of `tf.distributions.Deterministic`).
-  bias_posterior_tensor_fn: Python `callable` which takes a
-    `tf.distributions.Distribution` instance and returns a representative
-    value. Default value: `lambda d: d.sample()`.
-  bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-    See `default_mean_field_normal_fn` docstring for required parameter
-    signature. Default value: `None` (no prior, no variational inference)
-  bias_divergence_fn: Python `callable` which takes the surrogate posterior
-    distribution, prior distribution and random variate sample(s) from the
-    surrogate posterior and computes or approximates the KL divergence. The
-    distributions are `tf.distributions.Distribution`-like instances and the
-    sample is a `Tensor`.
-  seed: Python scalar `int` which initializes the random number
-    generator. Default value: `None` (i.e., use global seed).
-  name: Python `str`, the name of the layer. Layers with the same name will
-    share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-    such cases.
-  reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-    layer by the same name."""
-
-
-class _DenseVariational(layers_lib.Layer):
-  """Abstract densely-connected class (private, used as implementation base).
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    units: Python integer, dimensionality of the output space.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      units,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(_DenseVariational, self).__init__(
-        trainable=trainable,
-        name=name,
-        activity_regularizer=activity_regularizer,
-        **kwargs)
-    self.units = units
-    self.activation = activation
-    self.input_spec = layers_lib.InputSpec(min_ndim=2)
-    self.kernel_posterior_fn = kernel_posterior_fn
-    self.kernel_posterior_tensor_fn = kernel_posterior_tensor_fn
-    self.kernel_prior_fn = kernel_prior_fn
-    self.kernel_divergence_fn = kernel_divergence_fn
-    self.bias_posterior_fn = bias_posterior_fn
-    self.bias_posterior_tensor_fn = bias_posterior_tensor_fn
-    self.bias_prior_fn = bias_prior_fn
-    self.bias_divergence_fn = bias_divergence_fn
-
-  def build(self, input_shape):
-    input_shape = tensor_shape.TensorShape(input_shape)
-    in_size = input_shape.with_rank_at_least(2)[-1].value
-    if in_size is None:
-      raise ValueError("The last dimension of the inputs to `Dense` "
-                       "should be defined. Found `None`.")
-    self._input_spec = layers_lib.InputSpec(min_ndim=2, axes={-1: in_size})
-    dtype = dtypes.as_dtype(self.dtype)
-
-    # Must have a posterior kernel.
-    self.kernel_posterior = self.kernel_posterior_fn(
-        dtype, [in_size, self.units], "kernel_posterior",
-        self.trainable, self.add_variable)
-
-    if self.kernel_prior_fn is None:
-      self.kernel_prior = None
-    else:
-      self.kernel_prior = self.kernel_prior_fn(
-          dtype, [in_size, self.units], "kernel_prior",
-          self.trainable, self.add_variable)
-    self._built_kernel_divergence = False
-
-    if self.bias_posterior_fn is None:
-      self.bias_posterior = None
-    else:
-      self.bias_posterior = self.bias_posterior_fn(
-          dtype, [self.units], "bias_posterior",
-          self.trainable, self.add_variable)
-
-    if self.bias_prior_fn is None:
-      self.bias_prior = None
-    else:
-      self.bias_prior = self.bias_prior_fn(
-          dtype, [self.units], "bias_prior",
-          self.trainable, self.add_variable)
-    self._built_bias_divergence = False
-
-    self.built = True
-
-  def call(self, inputs):
-    inputs = ops.convert_to_tensor(inputs, dtype=self.dtype)
-
-    outputs = self._apply_variational_kernel(inputs)
-    outputs = self._apply_variational_bias(outputs)
-    if self.activation is not None:
-      outputs = self.activation(outputs)  # pylint: disable=not-callable
-    if not self._built_kernel_divergence:
-      kernel_posterior = self.kernel_posterior
-      kernel_prior = self.kernel_prior
-      if isinstance(self.kernel_posterior, independent_lib.Independent):
-        kernel_posterior = kernel_posterior.distribution
-      if isinstance(self.kernel_prior, independent_lib.Independent):
-        kernel_prior = kernel_prior.distribution
-      self._apply_divergence(self.kernel_divergence_fn,
-                             kernel_posterior,
-                             kernel_prior,
-                             self.kernel_posterior_tensor,
-                             name="divergence_kernel")
-      self._built_kernel_divergence = True
-    if not self._built_bias_divergence:
-      bias_posterior = self.bias_posterior
-      bias_prior = self.bias_prior
-      if isinstance(self.bias_posterior, independent_lib.Independent):
-        bias_posterior = bias_posterior.distribution
-      if isinstance(self.bias_prior, independent_lib.Independent):
-        bias_prior = bias_prior.distribution
-      self._apply_divergence(self.bias_divergence_fn,
-                             bias_posterior,
-                             bias_prior,
-                             self.bias_posterior_tensor,
-                             name="divergence_bias")
-      self._built_bias_divergence = True
-    return outputs
-
-  def _apply_variational_bias(self, inputs):
-    if self.bias_posterior is None:
-      self.bias_posterior_tensor = None
-      return inputs
-    self.bias_posterior_tensor = self.bias_posterior_tensor_fn(
-        self.bias_posterior)
-    return nn.bias_add(inputs, self.bias_posterior_tensor)
-
-  def _apply_divergence(self, divergence_fn, posterior, prior,
-                        posterior_tensor, name):
-    if (divergence_fn is None or
-        posterior is None or
-        prior is None):
-      divergence = None
-      return
-    divergence = standard_ops.identity(
-        divergence_fn(
-            posterior, prior, posterior_tensor),
-        name=name)
-    self.add_loss(divergence)
-
-  def _matmul(self, inputs, kernel):
-    if inputs.shape.ndims <= 2:
-      return standard_ops.matmul(inputs, kernel)
-    # To handle broadcasting, we must use `tensordot`.
-    return standard_ops.tensordot(inputs, kernel, axes=[[-1], [0]])
-
-  def _compute_output_shape(self, input_shape):
-    input_shape = tensor_shape.TensorShape(input_shape).with_rank_at_least(2)
-    if input_shape[-1].value is None:
-      raise ValueError(
-          "The innermost dimension of input_shape must be defined, "
-          "but saw: {}".format(input_shape))
-    return input_shape[:-1].concatenate(self.units)
-
-
-class DenseReparameterization(_DenseVariational):
-  """Densely-connected layer class with reparameterization estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the reparameterization estimator [1], which performs a Monte Carlo
-  approximation of the distribution integrating over the `kernel` and
-  `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    units: Python integer, dimensionality of the output space.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.DenseReparameterization(
-      512, activation=tf.nn.relu)(features)
-  logits = tfp.layers.DenseReparameterization(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      units,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(
-          is_singular=True),
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(DenseReparameterization, self).__init__(
-        units=units,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name,
-        **kwargs)
-
-  def _apply_variational_kernel(self, inputs):
-    self.kernel_posterior_tensor = self.kernel_posterior_tensor_fn(
-        self.kernel_posterior)
-    self.kernel_posterior_affine = None
-    self.kernel_posterior_affine_tensor = None
-    return self._matmul(inputs, self.kernel_posterior_tensor)
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def dense_reparameterization(
-    inputs,
-    units,
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(is_singular=True),  # pylint: disable=line-too-long
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Densely-connected layer with reparameterization estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the reparameterization estimator [1], which performs a Monte Carlo
-  approximation of the distribution integrating over the `kernel` and
-  `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    @{args}
-
-  Returns:
-    output: `Tensor` representing a the affine transformed input under a random
-      draw from the surrogate posterior distribution.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.dense_reparameterization(
-      features, 512, activation=tf.nn.relu)
-  logits = tfp.layers.dense_reparameterization(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Auto-Encoding Variational Bayes."
-        Diederik P. Kingma, Max Welling.
-        International Conference on Learning Representations, 2014.
-  """
-  # pylint: enable=g-doc-args
-  layer = DenseReparameterization(
-      units,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class DenseLocalReparameterization(_DenseVariational):
-  """Densely-connected layer class with local reparameterization estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the local reparameterization estimator [1], which performs a
-  Monte Carlo approximation of the distribution on the hidden units
-  induced by the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    units: Python integer, dimensionality of the output space.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.DenseLocalReparameterization(
-      512, activation=tf.nn.relu)(features)
-  logits = tfp.layers.DenseLocalReparameterization(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses local reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Variational Dropout and the Local Reparameterization Trick."
-        Diederik P. Kingma, Tim Salimans, Max Welling.
-        Neural Information Processing Systems, 2015.
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      units,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(
-          is_singular=True),
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(DenseLocalReparameterization, self).__init__(
-        units=units,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name,
-        **kwargs)
-
-  def _apply_variational_kernel(self, inputs):
-    if (not isinstance(self.kernel_posterior, independent_lib.Independent) or
-        not isinstance(self.kernel_posterior.distribution, normal_lib.Normal)):
-      raise TypeError(
-          "`DenseLocalReparameterization` requires "
-          "`kernel_posterior_fn` produce an instance of "
-          "`tf.distributions.Independent(tf.distributions.Normal)` "
-          "(saw: \"{}\").".format(self.kernel_posterior.name))
-    self.kernel_posterior_affine = normal_lib.Normal(
-        loc=self._matmul(inputs, self.kernel_posterior.distribution.loc),
-        scale=standard_ops.sqrt(self._matmul(
-            standard_ops.square(inputs),
-            standard_ops.square(self.kernel_posterior.distribution.scale))))
-    self.kernel_posterior_affine_tensor = (
-        self.kernel_posterior_tensor_fn(self.kernel_posterior_affine))
-    self.kernel_posterior_tensor = None
-    return self.kernel_posterior_affine_tensor
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def dense_local_reparameterization(
-    inputs,
-    units,
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(
-        is_singular=True),
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Densely-connected layer with local reparameterization estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the local reparameterization estimator [1], which performs a
-  Monte Carlo approximation of the distribution on the hidden units
-  induced by the `kernel` and `bias`.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    @{args}
-
-  Returns:
-    output: `Tensor` representing a the affine transformed input under a random
-      draw from the surrogate posterior distribution.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.dense_local_reparameterization(
-      features, 512, activation=tf.nn.relu)
-  logits = tfp.layers.dense_local_reparameterization(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses local reparameterization gradients to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Variational Dropout and the Local Reparameterization Trick."
-        Diederik P. Kingma, Tim Salimans, Max Welling.
-        Neural Information Processing Systems, 2015.
-  """
-  # pylint: enable=g-doc-args
-  layer = DenseLocalReparameterization(
-      units,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
-
-
-class DenseFlipout(_DenseVariational):
-  """Densely-connected layer class with Flipout estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the Flipout estimator [1], which performs a Monte Carlo
-  approximation of the distribution integrating over the `kernel` and
-  `bias`. Flipout uses roughly twice as many floating point operations
-  as the reparameterization estimator but has the advantage of
-  significantly lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Properties:
-    units: Python integer, dimensionality of the output space.
-    activation: Activation function (`callable`).
-    activity_regularizer: Regularizer function for the output.
-    kernel_posterior_fn: `callable` returning posterior.
-    kernel_posterior_tensor_fn: `callable` operating on posterior.
-    kernel_prior_fn: `callable` returning prior.
-    kernel_divergence_fn: `callable` returning divergence.
-    bias_posterior_fn: `callable` returning posterior.
-    bias_posterior_tensor_fn: `callable` operating on posterior.
-    bias_prior_fn: `callable` returning prior.
-    bias_divergence_fn: `callable` returning divergence.
-    seed: Python integer, used to create random seeds.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.DenseFlipout(
-      512, activation=tf.nn.relu)(features)
-  logits = tfp.layers.DenseFlipout(10)(net)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
-  """
-
-  @docstring_util.expand_docstring(args=doc_args)
-  def __init__(
-      self,
-      units,
-      activation=None,
-      activity_regularizer=None,
-      trainable=True,
-      kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-      kernel_posterior_tensor_fn=lambda d: d.sample(),
-      kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-          loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-      kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      bias_posterior_fn=layers_util.default_mean_field_normal_fn(
-          is_singular=True),
-      bias_posterior_tensor_fn=lambda d: d.sample(),
-      bias_prior_fn=None,
-      bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-      seed=None,
-      name=None,
-      **kwargs):
-    # pylint: disable=g-doc-args
-    """Construct layer.
-
-    Args:
-      @{args}
-    """
-    # pylint: enable=g-doc-args
-    super(DenseFlipout, self).__init__(
-        units=units,
-        activation=activation,
-        activity_regularizer=activity_regularizer,
-        trainable=trainable,
-        kernel_posterior_fn=kernel_posterior_fn,
-        kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-        kernel_prior_fn=kernel_prior_fn,
-        kernel_divergence_fn=kernel_divergence_fn,
-        bias_posterior_fn=bias_posterior_fn,
-        bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-        bias_prior_fn=bias_prior_fn,
-        bias_divergence_fn=bias_divergence_fn,
-        name=name,
-        **kwargs)
-    self.seed = seed
-
-  def _apply_variational_kernel(self, inputs):
-    if (not isinstance(self.kernel_posterior, independent_lib.Independent) or
-        not isinstance(self.kernel_posterior.distribution, normal_lib.Normal)):
-      raise TypeError(
-          "`DenseFlipout` requires "
-          "`kernel_posterior_fn` produce an instance of "
-          "`tf.distributions.Independent(tf.distributions.Normal)` "
-          "(saw: \"{}\").".format(self.kernel_posterior.name))
-    self.kernel_posterior_affine = normal_lib.Normal(
-        loc=array_ops.zeros_like(self.kernel_posterior.distribution.loc),
-        scale=self.kernel_posterior.distribution.scale)
-    self.kernel_posterior_affine_tensor = (
-        self.kernel_posterior_tensor_fn(self.kernel_posterior_affine))
-    self.kernel_posterior_tensor = None
-
-    input_shape = array_ops.shape(inputs)
-    batch_shape = input_shape[:-1]
-
-    sign_input = layers_util.random_sign(
-        input_shape,
-        dtype=inputs.dtype,
-        seed=self.seed)
-    sign_output = layers_util.random_sign(
-        array_ops.concat([batch_shape,
-                          array_ops.expand_dims(self.units, 0)], 0),
-        dtype=inputs.dtype,
-        seed=distribution_util.gen_new_seed(
-            self.seed, salt="dense_flipout"))
-    perturbed_inputs = self._matmul(
-        inputs * sign_input, self.kernel_posterior_affine_tensor) * sign_output
-
-    outputs = self._matmul(inputs, self.kernel_posterior.distribution.loc)
-    outputs += perturbed_inputs
-    return outputs
-
-
-@docstring_util.expand_docstring(args=doc_args)
-def dense_flipout(
-    inputs,
-    units,
-    activation=None,
-    activity_regularizer=None,
-    trainable=True,
-    kernel_posterior_fn=layers_util.default_mean_field_normal_fn(),
-    kernel_posterior_tensor_fn=lambda d: d.sample(),
-    kernel_prior_fn=lambda dtype, *args: normal_lib.Normal(  # pylint: disable=g-long-lambda
-        loc=dtype.as_numpy_dtype(0.), scale=dtype.as_numpy_dtype(1.)),
-    kernel_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    bias_posterior_fn=layers_util.default_mean_field_normal_fn(
-        is_singular=True),
-    bias_posterior_tensor_fn=lambda d: d.sample(),
-    bias_prior_fn=None,
-    bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
-    seed=None,
-    name=None,
-    reuse=None):
-  # pylint: disable=g-doc-args
-  """Densely-connected layer with Flipout estimator.
-
-  This layer implements the Bayesian variational inference analogue to
-  a dense layer by assuming the `kernel` and/or the `bias` are drawn
-  from distributions. By default, the layer implements a stochastic
-  forward pass via sampling from the kernel and bias posteriors,
-
-  ```none
-  kernel, bias ~ posterior
-  outputs = activation(matmul(inputs, kernel) + bias)
-  ```
-
-  It uses the Flipout estimator [1], which performs a Monte Carlo
-  approximation of the distribution integrating over the `kernel` and
-  `bias`. Flipout uses roughly twice as many floating point operations
-  as the reparameterization estimator but has the advantage of
-  significantly lower variance.
-
-  The arguments permit separate specification of the surrogate posterior
-  (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
-  distributions.
-
-  Args:
-    inputs: Tensor input.
-    @{args}
-
-  Returns:
-    output: `Tensor` representing a the affine transformed input under a random
-      draw from the surrogate posterior distribution.
-
-  #### Examples
-
-  We illustrate a Bayesian neural network with [variational inference](
-  https://en.wikipedia.org/wiki/Variational_Bayesian_methods),
-  assuming a dataset of `features` and `labels`.
-
-  ```python
-  tfp = tf.contrib.bayesflow
-
-  net = tfp.layers.dense_flipout(
-      features, 512, activation=tf.nn.relu)
-  logits = tfp.layers.dense_flipout(net, 10)
-  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
-      labels=labels, logits=logits)
-  kl = sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
-  loss = neg_log_likelihood + kl
-  train_op = tf.train.AdamOptimizer().minimize(loss)
-  ```
-
-  It uses the Flipout gradient estimator to minimize the
-  Kullback-Leibler divergence up to a constant, also known as the
-  negative Evidence Lower Bound. It consists of the sum of two terms:
-  the expected negative log-likelihood, which we approximate via
-  Monte Carlo; and the KL divergence, which is added via regularizer
-  terms which are arguments to the layer.
-
-  [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
-        Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
-  """
-  # pylint: enable=g-doc-args
-  layer = DenseFlipout(
-      units,
-      activation=activation,
-      activity_regularizer=activity_regularizer,
-      trainable=trainable,
-      kernel_posterior_fn=kernel_posterior_fn,
-      kernel_posterior_tensor_fn=kernel_posterior_tensor_fn,
-      kernel_prior_fn=kernel_prior_fn,
-      kernel_divergence_fn=kernel_divergence_fn,
-      bias_posterior_fn=bias_posterior_fn,
-      bias_posterior_tensor_fn=bias_posterior_tensor_fn,
-      bias_prior_fn=bias_prior_fn,
-      bias_divergence_fn=bias_divergence_fn,
-      seed=seed,
-      name=name,
-      dtype=inputs.dtype.base_dtype,
-      _scope=name,
-      _reuse=reuse)
-  return layer.apply(inputs)
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers_util.py b/tensorflow/contrib/bayesflow/python/ops/layers_util.py
deleted file mode 100644
index 8c1fb203f7328e8260e49b4326d813fbe133613e..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/layers_util.py
+++ /dev/null
@@ -1,191 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utilities for probabilistic layers.
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.distributions.python.ops import deterministic as deterministic_lib
-from tensorflow.contrib.distributions.python.ops import independent as independent_lib
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import init_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.ops import random_ops
-from tensorflow.python.ops.distributions import normal as normal_lib
-
-
-def default_loc_scale_fn(
-    is_singular=False,
-    loc_initializer=init_ops.random_normal_initializer(stddev=0.1),
-    untransformed_scale_initializer=init_ops.random_normal_initializer(
-        mean=-3., stddev=0.1),
-    loc_regularizer=None,
-    untransformed_scale_regularizer=None,
-    loc_constraint=None,
-    untransformed_scale_constraint=None):
-  """Makes closure which creates `loc`, `scale` params from `tf.get_variable`.
-
-  This function produces a closure which produces `loc`, `scale` using
-  `tf.get_variable`. The closure accepts the following arguments:
-
-    dtype: Type of parameter's event.
-    shape: Python `list`-like representing the parameter's event shape.
-    name: Python `str` name prepended to any created (or existing)
-      `tf.Variable`s.
-    trainable: Python `bool` indicating all created `tf.Variable`s should be
-      added to the graph collection `GraphKeys.TRAINABLE_VARIABLES`.
-    add_variable_fn: `tf.get_variable`-like `callable` used to create (or
-      access existing) `tf.Variable`s.
-
-  Args:
-    is_singular: Python `bool` indicating if `scale is None`. Default: `False`.
-    loc_initializer: Initializer function for the `loc` parameters.
-      The default is `tf.random_normal_initializer(mean=0., stddev=0.1)`.
-    untransformed_scale_initializer: Initializer function for the `scale`
-      parameters. Default value: `tf.random_normal_initializer(mean=-3.,
-      stddev=0.1)`. This implies the softplus transformed result has mean
-      approximately `0.05` and std. deviation approximately `0.005`.
-    loc_regularizer: Regularizer function for the `loc` parameters.
-      The default (`None`) is to use the `tf.get_variable` default.
-    untransformed_scale_regularizer: Regularizer function for the `scale`
-      parameters. The default (`None`) is to use the `tf.get_variable` default.
-    loc_constraint: An optional projection function to be applied to the
-      loc after being updated by an `Optimizer`. The function must take as input
-      the unprojected variable and must return the projected variable (which
-      must have the same shape). Constraints are not safe to use when doing
-      asynchronous distributed training.
-      The default (`None`) is to use the `tf.get_variable` default.
-    untransformed_scale_constraint: An optional projection function to be
-      applied to the `scale` parameters after being updated by an `Optimizer`
-      (e.g. used to implement norm constraints or value constraints). The
-      function must take as input the unprojected variable and must return the
-      projected variable (which must have the same shape). Constraints are not
-      safe to use when doing asynchronous distributed training. The default
-      (`None`) is to use the `tf.get_variable` default.
-
-  Returns:
-    default_loc_scale_fn: Python `callable` which instantiates `loc`, `scale`
-    parameters from args: `dtype, shape, name, trainable, add_variable_fn`.
-  """
-  def _fn(dtype, shape, name, trainable, add_variable_fn):
-    """Creates `loc`, `scale` parameters."""
-    loc = add_variable_fn(
-        name=name + "_loc",
-        shape=shape,
-        initializer=loc_initializer,
-        regularizer=loc_regularizer,
-        constraint=loc_constraint,
-        dtype=dtype,
-        trainable=trainable)
-    if is_singular:
-      return loc, None
-    untransformed_scale = add_variable_fn(
-        name=name + "_untransformed_scale",
-        shape=shape,
-        initializer=untransformed_scale_initializer,
-        regularizer=untransformed_scale_regularizer,
-        constraint=untransformed_scale_constraint,
-        dtype=dtype,
-        trainable=trainable)
-    scale = (np.finfo(dtype.as_numpy_dtype).eps +
-             nn_ops.softplus(untransformed_scale))
-    return loc, scale
-  return _fn
-
-
-def default_mean_field_normal_fn(
-    is_singular=False,
-    loc_initializer=None,
-    untransformed_scale_initializer=None,
-    loc_regularizer=None,
-    untransformed_scale_regularizer=None,
-    loc_constraint=None,
-    untransformed_scale_constraint=None):
-  """Creates a function to build Normal distributions with trainable params.
-
-  This function produces a closure which produces `tf.distributions.Normal`
-  parameterized by a loc` and `scale` each created using `tf.get_variable`. The
-  produced closure accepts the following arguments:
-
-    name: Python `str` name prepended to any created (or existing)
-      `tf.Variable`s.
-    shape: Python `list`-like representing the parameter's event shape.
-    dtype: Type of parameter's event.
-    trainable: Python `bool` indicating all created `tf.Variable`s should be
-      added to the graph collection `GraphKeys.TRAINABLE_VARIABLES`.
-    add_variable_fn: `tf.get_variable`-like `callable` used to create (or
-      access existing) `tf.Variable`s.
-
-  Args:
-    is_singular: Python `bool` if `True`, forces the special case limit of
-      `scale->0`, i.e., a `Deterministic` distribution.
-    loc_initializer: Initializer function for the `loc` parameters.
-      If `None` (default), values are initialized using the default
-      initializer used by `tf.get_variable`.
-    untransformed_scale_initializer: Initializer function for the `scale`
-      parameters. If `None` (default), values are initialized using the default
-      initializer used by `tf.get_variable`.
-    loc_regularizer: Regularizer function for the `loc` parameters.
-    untransformed_scale_regularizer: Regularizer function for the `scale`
-      parameters.
-    loc_constraint: An optional projection function to be applied to the
-      loc after being updated by an `Optimizer`. The function must take as input
-      the unprojected variable and must return the projected variable (which
-      must have the same shape). Constraints are not safe to use when doing
-      asynchronous distributed training.
-    untransformed_scale_constraint: An optional projection function to be
-      applied to the `scale` parameters after being updated by an `Optimizer`
-      (e.g. used to implement norm constraints or value constraints). The
-      function must take as input the unprojected variable and must return the
-      projected variable (which must have the same shape). Constraints are not
-      safe to use when doing asynchronous distributed training.
-
-  Returns:
-    make_normal_fn: Python `callable` which creates a `tf.distributions.Normal`
-      using from args: `dtype, shape, name, trainable, add_variable_fn`.
-  """
-  loc_scale_fn_ = default_loc_scale_fn(
-      is_singular,
-      loc_initializer,
-      untransformed_scale_initializer,
-      loc_regularizer,
-      untransformed_scale_regularizer,
-      loc_constraint,
-      untransformed_scale_constraint)
-  def _fn(dtype, shape, name, trainable, add_variable_fn):
-    """Creates multivariate `Deterministic` or `Normal` distribution."""
-    loc, scale = loc_scale_fn_(dtype, shape, name, trainable, add_variable_fn)
-    if scale is None:
-      dist = deterministic_lib.Deterministic(loc=loc)
-    else:
-      dist = normal_lib.Normal(loc=loc, scale=scale)
-    reinterpreted_batch_ndims = array_ops.shape(dist.batch_shape_tensor())[0]
-    return independent_lib.Independent(
-        dist, reinterpreted_batch_ndims=reinterpreted_batch_ndims)
-  return _fn
-
-
-def random_sign(shape, dtype=dtypes.float32, seed=None):
-  """Draw values from {-1, 1} uniformly, i.e., Rademacher distribution."""
-  random_bernoulli = random_ops.random_uniform(shape, minval=0, maxval=2,
-                                               dtype=dtypes.int32,
-                                               seed=seed)
-  return math_ops.cast(2 * random_bernoulli - 1, dtype)
diff --git a/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics.py b/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics.py
deleted file mode 100644
index f3a645eafc249d1c39e0d4a238ae7ec8755c78d8..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics.py
+++ /dev/null
@@ -1,32 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utilities for Markov Chain Monte Carlo (MCMC) sampling."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-# go/tf-wildcard-import
-# pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.mcmc_diagnostics_impl import *
-# pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
-
-_allowed_symbols = [
-    "effective_sample_size",
-    "potential_scale_reduction",
-]
-
-remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics_impl.py b/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics_impl.py
deleted file mode 100644
index 0424b6952bc89ce7fe5b00b0135c9a5fe1faa8cf..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/mcmc_diagnostics_impl.py
+++ /dev/null
@@ -1,400 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utilities for Markov Chain Monte Carlo (MCMC) sampling.
-
-@@effective_sample_size
-@@potential_scale_reduction
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.distributions.python.ops import sample_stats
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-
-__all__ = [
-    "effective_sample_size",
-    "potential_scale_reduction",
-]
-
-
-def effective_sample_size(states,
-                          filter_threshold=0.,
-                          filter_beyond_lag=None,
-                          name=None):
-  """Estimate a lower bound on effective sample size for each independent chain.
-
-  Roughly speaking, "effective sample size" (ESS) is the size of an iid sample
-  with the same variance as `state`.
-
-  More precisely, given a stationary sequence of possibly correlated random
-  variables `X_1, X_2,...,X_N`, each identically distributed ESS is the number
-  such that
-
-  ```Variance{ N**-1 * Sum{X_i} } = ESS**-1 * Variance{ X_1 }.```
-
-  If the sequence is uncorrelated, `ESS = N`.  In general, one should expect
-  `ESS <= N`, with more highly correlated sequences having smaller `ESS`.
-
-  #### Example of using ESS to estimate standard error.
-
-  ```
-  tfd = tf.contrib.distributions
-  tfb = tf.contrib.bayesflow
-
-  target = tfd.MultivariateNormalDiag(scale_diag=[1., 2.])
-
-  # Get 1000 states from one chain.
-  states = tfb.hmc.sample_chain(
-      num_results=1000,
-      target_log_prob_fn=target.log_prob,
-      current_state=tf.constant([0., 0.]),
-      step_size=0.05,
-      num_leapfrog_steps=20,
-      num_burnin_steps=200)
-  states.shape
-  ==> (1000, 2)
-
-  ess = effective_sample_size(states)
-  ==> Shape (2,) Tensor
-
-  mean, variance = tf.nn.moments(states, axis=0)
-  standard_error = tf.sqrt(variance / ess)
-  ```
-
-  Some math shows that, with `R_k` the auto-correlation sequence,
-  `R_k := Covariance{X_1, X_{1+k}} / Variance{X_1}`, we have
-
-  ```ESS(N) =  N / [ 1 + 2 * ( (N - 1) / N * R_1 + ... + 1 / N * R_{N-1}  ) ]```
-
-  This function estimates the above by first estimating the auto-correlation.
-  Since `R_k` must be estimated using only `N - k` samples, it becomes
-  progressively noisier for larger `k`.  For this reason, the summation over
-  `R_k` should be truncated at some number `filter_beyond_lag < N`.  Since many
-  MCMC methods generate chains where `R_k > 0`, a reasonable critera is to
-  truncate at the first index where the estimated auto-correlation becomes
-  negative.
-
-  The arguments `filter_beyond_lag`, `filter_threshold` are filters intended to
-  remove noisy tail terms from `R_k`.  They combine in an "OR" manner meaning
-  terms are removed if they were to be filtered under the `filter_beyond_lag` OR
-  `filter_threshold` criteria.
-
-  Args:
-    states:  `Tensor` or list of `Tensor` objects.  Dimension zero should index
-      identically distributed states.
-    filter_threshold:  `Tensor` or list of `Tensor` objects.
-      Must broadcast with `state`.  The auto-correlation sequence is truncated
-      after the first appearance of a term less than `filter_threshold`.
-      Setting to `None` means we use no threshold filter.  Since `|R_k| <= 1`,
-      setting to any number less than `-1` has the same effect.
-    filter_beyond_lag:  `Tensor` or list of `Tensor` objects.  Must be
-      `int`-like and scalar valued.  The auto-correlation sequence is truncated
-      to this length.  Setting to `None` means we do not filter based on number
-      of lags.
-    name:  `String` name to prepend to created ops.
-
-  Returns:
-    ess:  `Tensor` or list of `Tensor` objects.  The effective sample size of
-      each component of `states`.  Shape will be `states.shape[1:]`.
-
-  Raises:
-    ValueError:  If `states` and `filter_threshold` or `states` and
-      `filter_beyond_lag` are both lists with different lengths.
-  """
-  states_was_list = _is_list_like(states)
-
-  # Convert all args to lists.
-  if not states_was_list:
-    states = [states]
-
-  filter_beyond_lag = _broadcast_maybelist_arg(states, filter_beyond_lag,
-                                               "filter_beyond_lag")
-  filter_threshold = _broadcast_maybelist_arg(states, filter_threshold,
-                                              "filter_threshold")
-
-  # Process items, one at a time.
-  with ops.name_scope(name, "effective_sample_size"):
-    ess_list = [
-        _effective_sample_size_single_state(s, ml, mlt)
-        for (s, ml, mlt) in zip(states, filter_beyond_lag, filter_threshold)
-    ]
-
-  if states_was_list:
-    return ess_list
-  return ess_list[0]
-
-
-def _effective_sample_size_single_state(states, filter_beyond_lag,
-                                        filter_threshold):
-  """ESS computation for one single Tensor argument."""
-
-  with ops.name_scope(
-      "effective_sample_size_single_state",
-      values=[states, filter_beyond_lag, filter_threshold]):
-
-    states = ops.convert_to_tensor(states, name="states")
-    dt = states.dtype
-
-    # filter_beyond_lag == None ==> auto_corr is the full sequence.
-    auto_corr = sample_stats.auto_correlation(
-        states, axis=0, max_lags=filter_beyond_lag)
-    if filter_threshold is not None:
-      filter_threshold = ops.convert_to_tensor(
-          filter_threshold, dtype=dt, name="filter_threshold")
-      # Get a binary mask to zero out values of auto_corr below the threshold.
-      #   mask[i, ...] = 1 if auto_corr[j, ...] > threshold for all j <= i,
-      #   mask[i, ...] = 0, otherwise.
-      # So, along dimension zero, the mask will look like [1, 1, ..., 0, 0,...]
-      # Building step by step,
-      #   Assume auto_corr = [1, 0.5, 0.0, 0.3], and filter_threshold = 0.2.
-      # Step 1:  mask = [False, False, True, False]
-      mask = auto_corr < filter_threshold
-      # Step 2:  mask = [0, 0, 1, 1]
-      mask = math_ops.cast(mask, dtype=dt)
-      # Step 3:  mask = [0, 0, 1, 2]
-      mask = math_ops.cumsum(mask, axis=0)
-      # Step 4:  mask = [1, 1, 0, 0]
-      mask = math_ops.maximum(1. - mask, 0.)
-      auto_corr *= mask
-
-    # With R[k] := auto_corr[k, ...],
-    # ESS = N / {1 + 2 * Sum_{k=1}^N (N - k) / N * R[k]}
-    #     = N / {-1 + 2 * Sum_{k=0}^N (N - k) / N * R[k]} (since R[0] = 1)
-    #     approx N / {-1 + 2 * Sum_{k=0}^M (N - k) / N * R[k]}
-    # where M is the filter_beyond_lag truncation point chosen above.
-
-    # Get the factor (N - k) / N, and give it shape [M, 1,...,1], having total
-    # ndims the same as auto_corr
-    n = _axis_size(states, axis=0)
-    k = math_ops.range(0., _axis_size(auto_corr, axis=0))
-    nk_factor = (n - k) / n
-    if auto_corr.shape.ndims is not None:
-      new_shape = [-1] + [1] * (auto_corr.shape.ndims - 1)
-    else:
-      new_shape = array_ops.concat(
-          ([-1],
-           array_ops.ones([array_ops.rank(auto_corr) - 1], dtype=dtypes.int32)),
-          axis=0)
-    nk_factor = array_ops.reshape(nk_factor, new_shape)
-
-    return n / (-1 + 2 * math_ops.reduce_sum(nk_factor * auto_corr, axis=0))
-
-
-def potential_scale_reduction(chains_states,
-                              independent_chain_ndims=1,
-                              name=None):
-  """Gelman and Rubin's potential scale reduction factor for chain convergence.
-
-  Given `N > 1` states from each of `C > 1` independent chains, the potential
-  scale reduction factor, commonly referred to as R-hat, measures convergence of
-  the chains (to the same target) by testing for equality of means.
-  Specifically, R-hat measures the degree to which variance (of the means)
-  between chains exceeds what one would expect if the chains were identically
-  distributed.  See [1], [2].
-
-  Some guidelines:
-
-  * The initial state of the chains should be drawn from a distribution
-    overdispersed with respect to the target.
-  * If all chains converge to the target, then as `N --> infinity`, R-hat --> 1.
-    Before that, R-hat > 1 (except in pathological cases, e.g. if the chain
-    paths were identical).
-  * The above holds for any number of chains `C > 1`.  Increasing `C` does
-    improves effectiveness of the diagnostic.
-  * Sometimes, R-hat < 1.2 is used to indicate approximate convergence, but of
-    course this is problem depedendent.  See [2].
-  * R-hat only measures non-convergence of the mean. If higher moments, or other
-    statistics are desired, a different diagnostic should be used.  See [2].
-
-  #### Examples
-
-  Diagnosing convergence by monitoring 10 chains that each attempt to
-  sample from a 2-variate normal.
-
-  ```python
-  tfd = tf.contrib.distributions
-  tfb = tf.contrib.bayesflow
-
-  target = tfd.MultivariateNormalDiag(scale_diag=[1., 2.])
-
-  # Get 10 (2x) overdispersed initial states.
-  initial_state = target.sample(10) * 2.
-  ==> (10, 2)
-
-  # Get 1000 samples from the 10 independent chains.
-  chains_states, _ = tfb.hmc.sample_chain(
-      num_results=1000,
-      target_log_prob_fn=target.log_prob,
-      current_state=initial_state,
-      step_size=0.05,
-      num_leapfrog_steps=20,
-      num_burnin_steps=200)
-  chains_states.shape
-  ==> (1000, 10, 2)
-
-  rhat = tfb.mcmc_diagnostics.potential_scale_reduction(
-      chains_states, independent_chain_ndims=1)
-
-  # The second dimension needed a longer burn-in.
-  rhat.eval()
-  ==> [1.05, 1.3]
-  ```
-
-  To see why R-hat is reasonable, let `X` be a random variable drawn uniformly
-  from the combined states (combined over all chains).  Then, in the limit
-  `N, C --> infinity`, with `E`, `Var` denoting expectation and variance,
-
-  ```R-hat = ( E[Var[X | chain]] + Var[E[X | chain]] ) / E[Var[X | chain]].```
-
-  Using the law of total variance, the numerator is the variance of the combined
-  states, and the denominator is the total variance minus the variance of the
-  the individual chain means.  If the chains are all drawing from the same
-  distribution, they will have the same mean, and thus the ratio should be one.
-
-  [1] "Inference from Iterative Simulation Using Multiple Sequences"
-      Andrew Gelman and Donald B. Rubin
-      Statist. Sci. Volume 7, Number 4 (1992), 457-472.
-  [2] "General Methods for Monitoring Convergence of Iterative Simulations"
-      Stephen P. Brooks and Andrew Gelman
-      Journal of Computational and Graphical Statistics, 1998. Vol 7, No. 4.
-
-  Args:
-    chains_states:  `Tensor` or Python `list` of `Tensor`s representing the
-      state(s) of a Markov Chain at each result step.  The `ith` state is
-      assumed to have shape `[Ni, Ci1, Ci2,...,CiD] + A`.
-      Dimension `0` indexes the `Ni > 1` result steps of the Markov Chain.
-      Dimensions `1` through `D` index the `Ci1 x ... x CiD` independent
-      chains to be tested for convergence to the same target.
-      The remaining dimensions, `A`, can have any shape (even empty).
-    independent_chain_ndims: Integer type `Tensor` with value `>= 1` giving the
-      number of giving the number of dimensions, from `dim = 1` to `dim = D`,
-      holding independent chain results to be tested for convergence.
-    name: `String` name to prepend to created ops.  Default:
-      `potential_scale_reduction`.
-
-  Returns:
-    `Tensor` or Python `list` of `Tensor`s representing the R-hat statistic for
-    the state(s).  Same `dtype` as `state`, and shape equal to
-    `state.shape[1 + independent_chain_ndims:]`.
-
-  Raises:
-    ValueError:  If `independent_chain_ndims < 1`.
-  """
-  chains_states_was_list = _is_list_like(chains_states)
-  if not chains_states_was_list:
-    chains_states = [chains_states]
-
-  # tensor_util.constant_value returns None iff a constant value (as a numpy
-  # array) is not efficiently computable.  Therefore, we try constant_value then
-  # check for None.
-  icn_const_ = tensor_util.constant_value(
-      ops.convert_to_tensor(independent_chain_ndims))
-  if icn_const_ is not None:
-    independent_chain_ndims = icn_const_
-    if icn_const_ < 1:
-      raise ValueError(
-          "Argument `independent_chain_ndims` must be `>= 1`, found: {}".format(
-              independent_chain_ndims))
-
-  with ops.name_scope(name, "potential_scale_reduction"):
-    rhat_list = [
-        _potential_scale_reduction_single_state(s, independent_chain_ndims)
-        for s in chains_states
-    ]
-
-  if chains_states_was_list:
-    return rhat_list
-  return rhat_list[0]
-
-
-def _potential_scale_reduction_single_state(state, independent_chain_ndims):
-  """potential_scale_reduction for one single state `Tensor`."""
-  with ops.name_scope(
-      "potential_scale_reduction_single_state",
-      values=[state, independent_chain_ndims]):
-    # We assume exactly one leading dimension indexes e.g. correlated samples
-    # from each Markov chain.
-    state = ops.convert_to_tensor(state, name="state")
-    sample_ndims = 1
-
-    sample_axis = math_ops.range(0, sample_ndims)
-    chain_axis = math_ops.range(sample_ndims,
-                                sample_ndims + independent_chain_ndims)
-    sample_and_chain_axis = math_ops.range(
-        0, sample_ndims + independent_chain_ndims)
-
-    n = _axis_size(state, sample_axis)
-    m = _axis_size(state, chain_axis)
-
-    # In the language of [2],
-    # B / n is the between chain variance, the variance of the chain means.
-    # W is the within sequence variance, the mean of the chain variances.
-    b_div_n = _reduce_variance(
-        math_ops.reduce_mean(state, sample_axis, keepdims=True),
-        sample_and_chain_axis,
-        biased=False)
-    w = math_ops.reduce_mean(
-        _reduce_variance(state, sample_axis, keepdims=True, biased=True),
-        sample_and_chain_axis)
-
-    # sigma^2_+ is an estimate of the true variance, which would be unbiased if
-    # each chain was drawn from the target.  c.f. "law of total variance."
-    sigma_2_plus = w + b_div_n
-
-    return ((m + 1.) / m) * sigma_2_plus / w - (n - 1.) / (m * n)
-
-
-# TODO(b/72873233) Move some variant of this to sample_stats.
-def _reduce_variance(x, axis=None, biased=True, keepdims=False):
-  with ops.name_scope("reduce_variance"):
-    x = ops.convert_to_tensor(x, name="x")
-    mean = math_ops.reduce_mean(x, axis=axis, keepdims=True)
-    biased_var = math_ops.reduce_mean(
-        math_ops.squared_difference(x, mean), axis=axis, keepdims=keepdims)
-    if biased:
-      return biased_var
-    n = _axis_size(x, axis)
-    return (n / (n - 1.)) * biased_var
-
-
-def _axis_size(x, axis=None):
-  """Get number of elements of `x` in `axis`, as type `x.dtype`."""
-  if axis is None:
-    return math_ops.cast(array_ops.size(x), x.dtype)
-  return math_ops.cast(
-      math_ops.reduce_prod(array_ops.gather(array_ops.shape(x), axis)), x.dtype)
-
-
-def _is_list_like(x):
-  """Helper which returns `True` if input is `list`-like."""
-  return isinstance(x, (tuple, list))
-
-
-def _broadcast_maybelist_arg(states, secondary_arg, name):
-  """Broadcast a listable secondary_arg to that of states."""
-  if _is_list_like(secondary_arg):
-    if len(secondary_arg) != len(states):
-      raise ValueError("Argument `%s` was a list of different length ({}) than "
-                       "`states` ({})".format(name, len(states)))
-  else:
-    secondary_arg = [secondary_arg] * len(states)
-
-  return secondary_arg
diff --git a/tensorflow/contrib/bayesflow/python/ops/sgld_optimizer.py b/tensorflow/contrib/bayesflow/python/ops/sgld_optimizer.py
deleted file mode 100644
index 7786656398e3c87704227be95b3cd23a38785249..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/sgld_optimizer.py
+++ /dev/null
@@ -1,220 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""An optimizer module for stochastic gradient Langevin dynamics."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import check_ops
-from tensorflow.python.ops import control_flow_ops
-from tensorflow.python.ops import init_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import random_ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.training import optimizer
-from tensorflow.python.training import training_ops
-
-
-class SGLDOptimizer(optimizer.Optimizer):
-  """An optimizer module for stochastic gradient Langevin dynamics.
-
-  This implements the preconditioned Stochastic Gradient Langevin Dynamics
-  optimizer [1]. The optimization variable is regarded as a sample from the
-  posterior under Stochastic Gradient Langevin Dynamics with noise rescaled in
-  each dimension according to RMSProp [2].
-
-  Note: If a prior is included in the loss, it should be scaled by
-  `1/num_pseudo_batches`, where num_pseudo_batches is the number of minibatches
-  in the data.  I.e., it should be divided by the `num_pseudo_batches` term
-  described below.
-
-  [1]: "Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural
-       Networks." Chunyuan Li, Changyou Chen, David Carlson, Lawrence Carin.
-       ArXiv:1512.07666, 2015. https://arxiv.org/abs/1512.07666
-  [2]: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
-
-  Args:
-    learning_rate: Scalar `float`-like `Tensor`. The base learning rate for the
-      optimizer. Must be tuned to the specific function being minimized.
-    preconditioner_decay_rate: Scalar `float`-like `Tensor`. The exponential
-      decay rate of the rescaling of the preconditioner (RMSprop). (This is
-      "alpha" in [1]). Should be smaller than but nearly `1` to approximate
-      sampling from the posterior. (Default: `0.95`)
-    num_pseudo_batches: Scalar `int`-like `Tensor`. The effective number of
-      minibatches in the data set.  Trades off noise and prior with the SGD
-      likelihood term. Note: Assumes the loss is taken as the mean over a
-      minibatch. Otherwise if the sum was taken, divide this number by the
-      batch size.  (Default: `1`)
-    burnin: Scalar `int`-like `Tensor`. The number of iterations to collect
-      gradient statistics to update the preconditioner before starting to draw
-      noisy samples. (Default: `25`)
-    diagonal_bias: Scalar `float`-like `Tensor`. Term added to the diagonal of
-      the preconditioner to prevent the preconditioner from degenerating.
-      (Default: `1e-8`)
-    name: Python `str` describing ops managed by this function.
-      (Default: `"SGLDOptimizer"`)
-    variable_scope: Variable scope used for calls to `tf.get_variable`.
-      If `None`, a new variable scope is created using name
-      `ops.get_default_graph().unique_name(name or default_name)`.
-
-  Raises:
-    InvalidArgumentError: If preconditioner_decay_rate is a `Tensor` not in
-      `(0,1]`.
-  """
-
-  def __init__(self,
-               learning_rate,
-               preconditioner_decay_rate=0.95,
-               num_pseudo_batches=1,
-               burnin=25,
-               diagonal_bias=1e-8,
-               name=None,
-               variable_scope=None):
-    default_name = 'SGLDOptimizer'
-    with ops.name_scope(name, default_name, [
-        learning_rate, preconditioner_decay_rate, num_pseudo_batches, burnin,
-        diagonal_bias
-    ]):
-      if variable_scope is None:
-        var_scope_name = ops.get_default_graph().unique_name(
-            name or default_name)
-        with varscope_ops.variable_scope(var_scope_name) as scope:
-          self._variable_scope = scope
-      else:
-        self._variable_scope = variable_scope
-
-      self._preconditioner_decay_rate = ops.convert_to_tensor(
-          preconditioner_decay_rate, name='preconditioner_decay_rate')
-      self._num_pseudo_batches = ops.convert_to_tensor(
-          num_pseudo_batches, name='num_pseudo_batches')
-      self._burnin = ops.convert_to_tensor(burnin, name='burnin')
-      self._diagonal_bias = ops.convert_to_tensor(
-          diagonal_bias, name='diagonal_bias')
-      self._learning_rate = ops.convert_to_tensor(
-          learning_rate, name='learning_rate')
-
-      with varscope_ops.variable_scope(self._variable_scope):
-        self._counter = varscope_ops.get_variable(
-            'counter', initializer=0, trainable=False)
-
-      self._preconditioner_decay_rate = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._preconditioner_decay_rate,
-              message='`preconditioner_decay_rate` must be non-negative'),
-          check_ops.assert_less_equal(
-              self._preconditioner_decay_rate,
-              1.,
-              message='`preconditioner_decay_rate` must be at most 1.'),
-      ], self._preconditioner_decay_rate)
-
-      self._num_pseudo_batches = control_flow_ops.with_dependencies([
-          check_ops.assert_greater(
-              self._num_pseudo_batches,
-              0,
-              message='`num_pseudo_batches` must be greater than zero')
-      ], self._num_pseudo_batches)
-
-      self._burnin = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._burnin, message='`burnin` must be non-negative'),
-          check_ops.assert_integer(
-              self._burnin, message='`burnin` must be an integer')
-      ], self._burnin)
-
-      self._diagonal_bias = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._diagonal_bias,
-              message='`diagonal_bias` must be non-negative')
-      ], self._diagonal_bias)
-
-      super(SGLDOptimizer, self).__init__(use_locking=False,
-                                          name=name or default_name)
-
-  def _create_slots(self, var_list):
-    for v in var_list:
-      init_rms = init_ops.ones_initializer(dtype=v.dtype)
-      self._get_or_make_slot_with_initializer(v, init_rms, v.get_shape(),
-                                              v.dtype, 'rms', self._name)
-
-  def _prepare(self):
-    # We need to put the conversion and check here because a user will likely
-    # want to decay the learning rate dynamically.
-    self._learning_rate_tensor = control_flow_ops.with_dependencies([
-        check_ops.assert_non_negative(
-            self._learning_rate, message='`learning_rate` must be non-negative')
-    ], ops.convert_to_tensor(self._learning_rate, name='learning_rate_tensor'))
-    self._decay_tensor = ops.convert_to_tensor(
-        self._preconditioner_decay_rate, name='preconditioner_decay_rate')
-
-    super(SGLDOptimizer, self)._prepare()
-
-  def _apply_dense(self, grad, var):
-    rms = self.get_slot(var, 'rms')
-
-    with ops.control_dependencies([
-        self._update_momentum(rms, grad, math_ops.cast(self._decay_tensor,
-                                                       var.dtype.base_dtype))]):
-      new_grad = self._apply_noisy_update(rms, grad)
-
-    return training_ops.apply_gradient_descent(
-        var,
-        math_ops.cast(self._learning_rate_tensor, var.dtype.base_dtype),
-        new_grad,
-        use_locking=self._use_locking).op
-
-  def _apply_sparse(self, grad, var):
-    rms = self.get_slot(var, 'rms')
-
-    with ops.control_dependencies([
-        self._update_momentum(rms, grad, math_ops.cast(self._decay_tensor,
-                                                       var.dtype.base_dtype))]):
-      new_grad = self._apply_noisy_update(rms, grad)
-
-    return training_ops.apply_gradient_descent(
-        var,
-        math_ops.cast(self._learning_rate_tensor, var.dtype.base_dtype),
-        new_grad,
-        use_locking=self._use_locking).op
-
-  def _finish(self, update_ops, name_scope):
-    update_ops.append([self._counter.assign_add(1)])
-    return control_flow_ops.group(*update_ops, name=name_scope)
-
-  @property
-  def variable_scope(self):
-    """Variable scope of all calls to `tf.get_variable`."""
-    return self._variable_scope
-
-  def _apply_noisy_update(self, mom, grad):
-    # Compute and apply the gradient update following
-    # preconditioned Langevin dynamics
-    stddev = array_ops.where(
-        array_ops.squeeze(self._counter > self._burnin),
-        math_ops.cast(math_ops.rsqrt(self._learning_rate), grad.dtype),
-        array_ops.zeros([], grad.dtype))
-
-    preconditioner = math_ops.rsqrt(
-        mom + math_ops.cast(self._diagonal_bias, grad.dtype))
-    return (
-        0.5 * preconditioner * grad * math_ops.cast(self._num_pseudo_batches,
-                                                    grad.dtype) +
-        random_ops.random_normal(array_ops.shape(grad), 1.0, dtype=grad.dtype) *
-        stddev * math_ops.sqrt(preconditioner))
-
-  def _update_momentum(self, mom, grad, decay):
-    # Keep an exponentially weighted moving average of squared gradients.
-    # Not thread safe
-    return mom.assign_add((1.0 - decay) * (math_ops.square(grad) - mom))
diff --git a/tensorflow/contrib/bayesflow/python/ops/variable_utils.py b/tensorflow/contrib/bayesflow/python/ops/variable_utils.py
deleted file mode 100644
index eadf6f4d5fa1c776e2c71c66c4b64b8f5ac98359..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/variable_utils.py
+++ /dev/null
@@ -1,29 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utility functions related to managing `tf.Variable`s."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-# go/tf-wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.variable_utils_impl import *  # pylint: disable=wildcard-import,unused-wildcard-import,g-importing-member
-from tensorflow.python.util import all_util
-
-_allowed_symbols = [
-    "externalize_variables_as_args",
-]
-
-all_util.remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py b/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py
deleted file mode 100644
index ca3d75b5bfee093449026c7d1d62e3bdeff6b096..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py
+++ /dev/null
@@ -1,157 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utility functions related to managing `tf.Variable`s.
-
-@@externalize_variables_as_args
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import warnings
-
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import gradients_impl as gradients_ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.ops import variables as variables_ops
-
-__all__ = [
-    "externalize_variables_as_args",
-]
-
-
-# Cause all warnings to always be triggered.
-# Not having this means subsequent calls wont trigger the warning.
-warnings.simplefilter("always")
-
-
-def externalize_variables_as_args(fn,
-                                  fn_args=(),
-                                  ancestor_variables=None,
-                                  possible_ancestor_vars=None,
-                                  assert_variable_override=False,
-                                  name=None):
-  """"Converts variables within a callable into explicit args.
-
-  Makes a new callable from `fn` which has arguments `list(fn_args) +
-  list(ancestor_variables)`. If `ancestor_variables` is not specified, it is
-  inferred by checking which of `possible_ancestor_vars` actually influences the
-  return value of `fn` (concretely, gradient of `fn(*fn_args)` is not `None`).
-  By default `possible_ancestor_vars` is `tf.trainable_variables() +
-  tf.get_collection(tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES)`.
-
-  #### Examples:
-
-  ```python
-  num_samples = 2
-  num_dims = 1
-  dtype = np.float32
-
-  def foo(x):
-    x = tf.convert_to_tensor(x, dtype=dtype, name="x")
-    s = x.shape.as_list()
-    y = tf.get_variable(
-        name="y",
-        dtype=dtype,
-        initializer=np.arange(np.prod(s)).reshape(s).astype(dtype))
-    return x + y
-
-  x = tf.constant(dtype([0.1, 0.2]))
-
-  wrapped_foo, discovered_ancestor_variables = (
-      externalize_variables_as_args(foo, [x]))
-
-  new_x = dtype([[1.], [2.]])
-  new_y = dtype([[3.], [4.]])
-  new_result = wrapped_foo(new_x, new_y)
-  # ==> [[4.], [6.]]
-
-  discovered_ancestor_variables == [tf.get_variable("y", dtype)]
-  # ==> [True]
-  ```
-
-  Args:
-    fn: Python callable which returns a `Tensor` and accepts `*fn_args`.
-    fn_args: Python list of args to `fn`. Represents dummy arguments passed to
-      `fn` to trace its execution; actual values are unimportant. These args are
-      only used to construct the output of `fn` and to resolve the ancestor
-      `tf.Variable`s.
-      Default value: `()` (i.e., `fn` takes no args).
-    ancestor_variables: Python list of `tf.Variable`s. When `None` the list is
-      expanded to non-`None` gradients of `fn(*fn_args)`. By directly providing
-      the `ancestor_variables` the internal call to `fn` is avoided.
-      Default value: `None` (i.e., `tf.Variable` dependencies are discovered).
-    possible_ancestor_vars: Python list of possible `tf.Variable`s which might
-      be a dependency of computing `fn(*fn_args)`.
-      Default value: `None` (i.e., expanded as described above).
-    assert_variable_override: Python `bool` indicating that not finding a
-      `tf.Variable` in the override list is an exception.
-      Default value: `False` (i.e., missing a `Variable` triggers a `warning`).
-    name: Python `str` name prefixed to Ops created by this function.
-      Default value: `None` (i.e., "externalize_variables_as_args").
-
-  Returns:
-    wrapped_fn: Python callable taking arguments like
-      `*(list(fn_args) + discovered_ancestor_variables)`.
-    discovered_ancestor_variables: Python list of `tf.Variable`s known to be a
-      dependency of `fn(*fn_args)`.
-
-  Raises:
-    ValueError: if `assert_variable_override` is `True` and `Variable` is
-      requested but not overridden.
-  """
-  def _make_bypassing_custom_getter_fn(new_var_dict):
-    """Return dict value rather than what would otherwise be dict key."""
-    def _custom_getter(getter, *args, **kwargs):
-      v = getter(*args, **kwargs)
-      new_v = new_var_dict.get(v, None)
-      if new_v is None:
-        msg = "Variable \"{}\" not found in bypass dict.".format(v)
-        if assert_variable_override:
-          raise ValueError(msg)
-        warnings.warn(msg)
-        return v
-      return new_v
-    return _custom_getter
-
-  with ops.name_scope(name, "externalize_variables_as_args"):
-    if ancestor_variables is not None and not ancestor_variables:
-      return fn, ()
-    if ancestor_variables is None:
-      y = fn(*fn_args)  # Side-effect: adds trainable vars.
-      if possible_ancestor_vars is None:
-        possible_ancestor_vars = (
-            variables_ops.trainable_variables() +
-            ops.get_collection(ops.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
-      # TODO(b/72873296): Add a dedicated op for identifying ancestors.
-      ancestors = [v for g, v
-                   in zip(gradients_ops.gradients(y, possible_ancestor_vars),
-                          possible_ancestor_vars)
-                   if g is not None]
-      ancestor_variables = sorted(ancestors, key=lambda v: v.name)
-  n = len(fn_args)
-  def _fn(*args):
-    with ops.name_scope("wrapped_fn"):
-      vars_dict = dict(
-          (k, ops.convert_to_tensor(
-              v, dtype=k.dtype.base_dtype, name=k.op.name))
-          for k, v in zip(ancestor_variables, args[n:]))
-      with varscope_ops.variable_scope(
-          varscope_ops.get_variable_scope(),
-          reuse=True,
-          custom_getter=_make_bypassing_custom_getter_fn(vars_dict)):
-        return fn(*args[:n])
-  return _fn, ancestor_variables
diff --git a/tensorflow/contrib/bayesflow/python/ops/variational_sgd_optimizer.py b/tensorflow/contrib/bayesflow/python/ops/variational_sgd_optimizer.py
deleted file mode 100644
index 4d5f0cfe9713a011b32c5aba8d429847d81f33e2..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/variational_sgd_optimizer.py
+++ /dev/null
@@ -1,279 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""An optimizer module for constant stochastic gradient descent."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-from tensorflow.python.framework import errors
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import check_ops
-from tensorflow.python.ops import clip_ops
-from tensorflow.python.ops import control_flow_ops
-from tensorflow.python.ops import init_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import state_ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.training import optimizer
-from tensorflow.python.training import training_ops
-
-
-class VariationalSGDOptimizer(optimizer.Optimizer):
-  """An optimizer module for constant stochastic gradient descent.
-
-  This implements an optimizer module for the constant stochastic gradient
-  descent algorithm [1].  The optimization variable is regarded as an
-  approximate sample from the posterior .
-
-  Note: If a prior is included in the loss, it should be scaled by
-  `1/num_pseudo_batches`, where num_pseudo_batches is the number of minibatches
-  in the data.  I.e., it should be divided by the `num_pseudo_batches` term
-  described below.
-
-  [1]: "Stochastic Gradient Descent as Approximate Bayesian Inference
-       Stephan Mandt, Matthew D. Hoffman, David M. Blei.
-       ArXiv:1704.04289, 2017. https://arxiv.org/abs/1704.04289
-
-  Args:
-    batch_size: Scalar `int`-like `Tensor`. The number of examples in a
-      minibatch in the data set. Note: Assumes the loss is taken as the mean
-      over a minibatch. Otherwise if the sum was taken set this to 1.
-    total_num_examples: Scalar `int`-like `Tensor`. The total number of examples
-      in the data set.
-    max_learning_rate: Scalar `float`-like `Tensor`. A maximum allowable
-      effective coordinate-wise learning rate. The algorithm scales down any
-      effective learning rate (i.e. after preconditioning) that is larger than
-      this. (Default: `1`)
-    preconditioner_decay_rate: Scalar `float`-like `Tensor`. The exponential
-      decay rate of the rescaling of the preconditioner (RMSprop). (This is
-      "alpha" in [1]). Should be smaller than but nearly `1` to approximate
-      sampling from the posterior. (Default: `0.95`)
-    burnin: Scalar `int`-like `Tensor`. The number of iterations to collect
-      gradient statistics to update the preconditioner before starting to draw
-      noisy samples. (Default: `25`)
-    burnin_max_learning_rate: Scalar `float`-like `Tensor`. Maximum learning
-      rate to use during the burnin period.
-      (Default: `1e-8`)
-    use_single_learning_rate: Boolean Indicates whether one single learning
-      rate is used or coordinate_wise learning rates are used.
-      (Default: `False`)
-    name: Python `str` describing ops managed by this function.
-      (Default: `"VariationalSGDOptimizer"`)
-    variable_scope: Variable scope used for calls to `tf.get_variable`.
-      If `None`, a new variable scope is created using name
-      `ops.get_default_graph().unique_name(name or default_name)`.
-
-  Raises:
-    InvalidArgumentError: If preconditioner_decay_rate is a `Tensor` not in
-      `(0,1]`.
-  """
-
-  def __init__(self,
-               batch_size,
-               total_num_examples,
-               max_learning_rate=1.0,
-               preconditioner_decay_rate=0.95,
-               burnin=25,
-               burnin_max_learning_rate=1e-6,
-               use_single_learning_rate=False,
-               name=None,
-               variable_scope=None):
-    default_name = 'VariationalSGDOptimizer'
-    with ops.name_scope(name, default_name, [
-        max_learning_rate, preconditioner_decay_rate, batch_size, burnin,
-        burnin_max_learning_rate
-    ]):
-      if variable_scope is None:
-        var_scope_name = ops.get_default_graph().unique_name(
-            name or default_name)
-        with varscope_ops.variable_scope(var_scope_name) as scope:
-          self._variable_scope = scope
-      else:
-        self._variable_scope = variable_scope
-
-      self._preconditioner_decay_rate = ops.convert_to_tensor(
-          preconditioner_decay_rate, name='preconditioner_decay_rate')
-      self._batch_size = ops.convert_to_tensor(batch_size, name='batch_size')
-      self._total_num_examples = ops.convert_to_tensor(
-          total_num_examples, name='total_num_examples')
-      self._burnin = ops.convert_to_tensor(burnin, name='burnin')
-      self._burnin_max_learning_rate = ops.convert_to_tensor(
-          burnin_max_learning_rate, name='burnin_max_learning_rate')
-      self._max_learning_rate = ops.convert_to_tensor(
-          max_learning_rate, name='max_learning_rate')
-      self._use_single_learning_rate = use_single_learning_rate
-
-      with varscope_ops.variable_scope(self._variable_scope):
-        self._counter = varscope_ops.get_variable(
-            'counter', initializer=0, trainable=False)
-
-      self._preconditioner_decay_rate = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._preconditioner_decay_rate,
-              message='`preconditioner_decay_rate` must be non-negative'),
-          check_ops.assert_less_equal(
-              self._preconditioner_decay_rate,
-              1.,
-              message='`preconditioner_decay_rate` must be at most 1.'),
-      ], self._preconditioner_decay_rate)
-
-      self._batch_size = control_flow_ops.with_dependencies([
-          check_ops.assert_greater(
-              self._batch_size,
-              0,
-              message='`batch_size` must be greater than zero')
-      ], self._batch_size)
-
-      self._total_num_examples = control_flow_ops.with_dependencies([
-          check_ops.assert_greater(
-              self._total_num_examples,
-              0,
-              message='`total_num_examples` must be greater than zero')
-      ], self._total_num_examples)
-
-      self._burnin = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._burnin, message='`burnin` must be non-negative'),
-          check_ops.assert_integer(
-              self._burnin, message='`burnin` must be an integer')
-      ], self._burnin)
-
-      self._burnin_max_learning_rate = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._burnin_max_learning_rate,
-              message='`burnin_max_learning_rate` must be non-negative')
-      ], self._burnin_max_learning_rate)
-
-      self._max_learning_rate = control_flow_ops.with_dependencies([
-          check_ops.assert_non_negative(
-              self._max_learning_rate,
-              message='`max_learning_rate` must be non-negative')
-      ], self._max_learning_rate)
-
-      super(VariationalSGDOptimizer, self).__init__(
-          use_locking=False, name=name or default_name)
-
-  def _create_slots(self, var_list):
-    for v in var_list:
-      init_moment = init_ops.zeros_initializer(dtype=v.dtype)
-      self._get_or_make_slot_with_initializer(
-          v, init_moment, v.get_shape(), v.dtype, 'first_moment', self._name)
-      self._get_or_make_slot_with_initializer(
-          v, init_moment, v.get_shape(), v.dtype, 'second_moment', self._name)
-
-  def _prepare(self):
-    self._decay_tensor = ops.convert_to_tensor(
-        self._preconditioner_decay_rate, name='preconditioner_decay_rate')
-    self._batch_size_tensor = ops.convert_to_tensor(
-        self._batch_size, name='batch_size_tensor')
-
-    super(VariationalSGDOptimizer, self)._prepare()
-
-  def _get_coordinatewise_learning_rate(self, grad, var):
-    # Compute the learning rate using a moving average for the diagonal of BB^T
-    avg_first = self.get_slot(var, 'first_moment')
-    avg_second = self.get_slot(var, 'second_moment')
-    decay_tensor = math_ops.cast(self._decay_tensor, var.dtype)
-    batch_size = math_ops.cast(self._batch_size_tensor, var.dtype)
-
-    # Create an estimator for the moving average of gradient mean and variance
-    # via Welford's algorithm
-    if isinstance(grad, ops.Tensor):
-      delta = grad - avg_first
-      first_moment_update = avg_first.assign_add(
-          array_ops.where(self._counter < 1, math_ops.cast(1, var.dtype),
-                          1. - decay_tensor) * delta)
-
-      with ops.control_dependencies([first_moment_update]):
-        second_moment_update = avg_second.assign_add(
-            math_ops.cast(self._counter < 1, var.dtype) *
-            -(1. - decay_tensor) * (
-                avg_second - decay_tensor  * math_ops.square(delta)))
-      diag_preconditioner = control_flow_ops.with_dependencies(
-          [second_moment_update],
-          clip_ops.clip_by_value(avg_second, 1e-12, 1e12))
-    elif isinstance(grad, ops.IndexedSlices):
-      delta = grad.values - array_ops.gather_nd(avg_first, grad.indices)
-      first_moment_update = state_ops.scatter_add(
-          avg_first,
-          grad.indices,
-          array_ops.where(self._counter < 1,
-                          math_ops.cast(1., var.dtype),
-                          1. - decay_tensor) * delta)
-
-      with ops.control_dependencies([first_moment_update]):
-        avg_second = state_ops.scatter_add(
-            avg_second,
-            grad.indices,
-            math_ops.cast(self._counter < 1, var.dtype) *
-            -(1. - decay_tensor) * (
-                array_ops.gather_nd(avg_second, grad.indices) - decay_tensor *
-                math_ops.square(delta)))
-        avg_second = array_ops.gather_nd(avg_second, grad.indices)
-        # TODO(b/70783772)
-        diag_preconditioner = clip_ops.clip_by_value(avg_second, 1e-12, 1e12)
-    else:
-      raise errors.InvalidArgumentError(
-          None, None, 'grad must of type Tensor or IndexedSlice')
-
-    diag_preconditioner *= batch_size
-
-    if self._use_single_learning_rate:
-      diag_preconditioner = math_ops.reduce_mean(diag_preconditioner)
-
-    # From Theorem 2 Corollary 1 of Mandt et al. 2017
-    return 2. * batch_size / (
-        math_ops.cast(self._total_num_examples, var.dtype.base_dtype) *
-        diag_preconditioner)
-
-  def _apply_dense(self, grad, var):
-
-    max_learning_rate = array_ops.where(self._counter < self._burnin,
-                                        self._burnin_max_learning_rate,
-                                        self._max_learning_rate)
-
-    learn_rates = clip_ops.clip_by_value(
-        self._get_coordinatewise_learning_rate(grad, var), 0.0,
-        math_ops.cast(max_learning_rate, var.dtype.base_dtype))
-
-    newgrad = grad * learn_rates
-    return training_ops.apply_gradient_descent(
-        var,
-        math_ops.cast(1.0, var.dtype),
-        newgrad,
-        use_locking=self._use_locking).op
-
-  def _apply_sparse(self, grad, var):
-
-    max_learning_rate = array_ops.where(self._counter < self._burnin,
-                                        self._burnin_max_learning_rate,
-                                        self._max_learning_rate)
-
-    learn_rate = clip_ops.clip_by_value(
-        self._get_coordinatewise_learning_rate(grad, var), 0.0,
-        math_ops.cast(max_learning_rate, var.dtype))
-    delta = grad.values * learn_rate
-
-    return state_ops.scatter_sub(var, grad.indices, delta,
-                                 use_locking=self._use_locking)
-
-  def _finish(self, update_ops, name_scope):
-    update_ops.append([self._counter.assign_add(1)])
-    return control_flow_ops.group(*update_ops, name=name_scope)
-
-  @property
-  def variable_scope(self):
-    """Variable scope of all calls to `tf.get_variable`."""
-    return self._variable_scope
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/BUILD b/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
index 289f5bb3140974d8c37f4938ceef27275b099f9a..dae402204f5b9f211b911c3f492c294e3c9b280c 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
@@ -149,7 +149,7 @@ py_library(
 
 py_test(
     name = "dnn_tree_combined_estimator_test",
-    size = "small",
+    size = "medium",
     srcs = ["dnn_tree_combined_estimator_test.py"],
     srcs_version = "PY2AND3",
     tags = [
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
index 23ba76210b3b68d0d0b2eef9d4040882654bdad9..d9b0d89a03dce40d34f76bb1262d26bb587a2dc7 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
@@ -54,7 +54,7 @@ def make_custom_export_strategy(name,
     An `ExportStrategy`.
   """
   base_strategy = saved_model_export_utils.make_export_strategy(
-      serving_input_fn=export_input_fn)
+      serving_input_fn=export_input_fn, strip_default_attrs=True)
   input_fn = export_input_fn()
   (sorted_feature_names, dense_floats, sparse_float_indices, _, _,
    sparse_int_indices, _, _) = gbdt_batch.extract_features(
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
index cec3892b57655dc967b4e7926f7f5a6a30084487..2e7b8cba05b89feaac3f47e13d26e7ae37a7b0ae 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
@@ -25,15 +25,20 @@ from __future__ import division
 from __future__ import print_function
 
 import six
-
 from tensorflow.contrib import layers
 from tensorflow.contrib.boosted_trees.estimator_batch import trainer_hooks
 from tensorflow.contrib.boosted_trees.python.ops import model_ops
 from tensorflow.contrib.boosted_trees.python.training.functions import gbdt_batch
 from tensorflow.contrib.layers.python.layers import optimizers
+from tensorflow.contrib.learn.python.learn.estimators import constants
 from tensorflow.contrib.learn.python.learn.estimators import estimator
 from tensorflow.contrib.learn.python.learn.estimators import head as head_lib
 from tensorflow.contrib.learn.python.learn.estimators import model_fn
+from tensorflow.contrib.learn.python.learn.estimators import model_fn as contrib_model_fn_lib
+from tensorflow.contrib.learn.python.learn.estimators import prediction_key
+from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator.export import export_output
+from tensorflow.python.feature_column import feature_column as feature_column_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import nn
@@ -46,6 +51,52 @@ from tensorflow.python.training import training_util
 
 _DNN_LEARNING_RATE = 0.001
 
+_CORE_MODE_TO_CONTRIB_MODE_ = {
+    model_fn_lib.ModeKeys.TRAIN: contrib_model_fn_lib.ModeKeys.TRAIN,
+    model_fn_lib.ModeKeys.EVAL: contrib_model_fn_lib.ModeKeys.EVAL,
+    model_fn_lib.ModeKeys.PREDICT: contrib_model_fn_lib.ModeKeys.INFER
+}
+
+
+def _core_mode_to_contrib_mode(mode):
+  return _CORE_MODE_TO_CONTRIB_MODE_[mode]
+
+
+def _export_outputs_to_output_alternatives(export_outputs):
+  """Converts EstimatorSpec.export_outputs to output_alternatives.
+
+  Args:
+    export_outputs: export_outputs created by create_estimator_spec.
+  Returns:
+    converted output_alternatives.
+  """
+  output = dict()
+  if export_outputs is not None:
+    for key, value in export_outputs.items():
+      if isinstance(value, export_output.ClassificationOutput):
+        exported_predictions = {
+            prediction_key.PredictionKey.SCORES: value.scores,
+            prediction_key.PredictionKey.CLASSES: value.classes
+        }
+        output[key] = (constants.ProblemType.CLASSIFICATION,
+                       exported_predictions)
+    return output
+  return None
+
+
+def _estimator_spec_to_model_fn_ops(estimator_spec, is_regression):
+  alternatives = []
+  if not is_regression:
+    _export_outputs_to_output_alternatives(estimator_spec.export_outputs)
+
+  return model_fn.ModelFnOps(
+      mode=_core_mode_to_contrib_mode(estimator_spec.mode),
+      predictions=estimator_spec.predictions,
+      loss=estimator_spec.loss,
+      train_op=estimator_spec.train_op,
+      eval_metric_ops=estimator_spec.eval_metric_ops,
+      output_alternatives=alternatives)
+
 
 def _get_optimizer(optimizer):
   if callable(optimizer):
@@ -59,16 +110,26 @@ def _add_hidden_layer_summary(value, tag):
   summary.histogram("%s_activation" % tag, value)
 
 
-def _dnn_tree_combined_model_fn(
-    features, labels, mode, head, dnn_hidden_units,
-    dnn_feature_columns, tree_learner_config, num_trees,
-    tree_examples_per_layer,
-    config=None, dnn_optimizer="Adagrad",
-    dnn_activation_fn=nn.relu, dnn_dropout=None,
-    dnn_input_layer_partitioner=None,
-    dnn_input_layer_to_tree=True, dnn_steps_to_train=10000,
-    tree_feature_columns=None,
-    tree_center_bias=True):
+def _dnn_tree_combined_model_fn(features,
+                                labels,
+                                mode,
+                                head,
+                                dnn_hidden_units,
+                                dnn_feature_columns,
+                                tree_learner_config,
+                                num_trees,
+                                tree_examples_per_layer,
+                                config=None,
+                                dnn_optimizer="Adagrad",
+                                dnn_activation_fn=nn.relu,
+                                dnn_dropout=None,
+                                dnn_input_layer_partitioner=None,
+                                dnn_input_layer_to_tree=True,
+                                dnn_steps_to_train=10000,
+                                tree_feature_columns=None,
+                                tree_center_bias=False,
+                                use_core_versions=False,
+                                is_regression=False):
   """DNN and GBDT combined model_fn.
 
   Args:
@@ -106,6 +167,9 @@ def _dnn_tree_combined_model_fn(
       set to True, these features are in addition to dnn_feature_columns.
     tree_center_bias: Whether a separate tree should be created for
       first fitting the bias.
+    use_core_versions: Whether feature columns and loss are from the core (as
+      opposed to contrib) version of tensorflow.
+    is_regression: Whether the problem is regression or not.
 
   Returns:
     A `ModelFnOps` object.
@@ -135,11 +199,17 @@ def _dnn_tree_combined_model_fn(
         "input_from_feature_columns",
         values=tuple(six.itervalues(features)),
         partitioner=dnn_partitioner) as input_layer_scope:
-      input_layer = layers.input_from_feature_columns(
-          columns_to_tensors=features,
-          feature_columns=dnn_feature_columns,
-          weight_collections=[dnn_parent_scope],
-          scope=input_layer_scope)
+      if use_core_versions:
+        input_layer = feature_column_lib.input_layer(
+            features=features,
+            feature_columns=dnn_feature_columns,
+            weight_collections=[dnn_parent_scope])
+      else:
+        input_layer = layers.input_from_feature_columns(
+            columns_to_tensors=features,
+            feature_columns=dnn_feature_columns,
+            weight_collections=[dnn_parent_scope],
+            scope=input_layer_scope)
     previous_layer = input_layer
     for layer_id, num_hidden_units in enumerate(dnn_hidden_units):
       with variable_scope.variable_scope(
@@ -222,24 +292,51 @@ def _dnn_tree_combined_model_fn(
     del loss
     return control_flow_ops.no_op()
 
-  model_fn_ops = head.create_model_fn_ops(
-      features=features,
-      mode=mode,
-      labels=labels,
-      train_op_fn=_no_train_op_fn,
-      logits=tree_train_logits)
-  dnn_train_op = head.create_model_fn_ops(
-      features=features,
-      mode=mode,
-      labels=labels,
-      train_op_fn=_dnn_train_op_fn,
-      logits=dnn_logits).train_op
-  tree_train_op = head.create_model_fn_ops(
-      features=tree_features,
-      mode=mode,
-      labels=labels,
-      train_op_fn=_tree_train_op_fn,
-      logits=tree_train_logits).train_op
+  if use_core_versions:
+    model_fn_ops = head.create_estimator_spec(
+        features=features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_no_train_op_fn,
+        logits=tree_train_logits)
+    dnn_train_op = head.create_estimator_spec(
+        features=features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_dnn_train_op_fn,
+        logits=dnn_logits)
+    dnn_train_op = _estimator_spec_to_model_fn_ops(dnn_train_op,
+                                                   is_regression).train_op
+
+    tree_train_op = head.create_estimator_spec(
+        features=tree_features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_tree_train_op_fn,
+        logits=tree_train_logits)
+    tree_train_op = _estimator_spec_to_model_fn_ops(tree_train_op,
+                                                    is_regression).train_op
+
+    model_fn_ops = _estimator_spec_to_model_fn_ops(model_fn_ops, is_regression)
+  else:
+    model_fn_ops = head.create_model_fn_ops(
+        features=features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_no_train_op_fn,
+        logits=tree_train_logits)
+    dnn_train_op = head.create_model_fn_ops(
+        features=features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_dnn_train_op_fn,
+        logits=dnn_logits).train_op
+    tree_train_op = head.create_model_fn_ops(
+        features=tree_features,
+        mode=mode,
+        labels=labels,
+        train_op_fn=_tree_train_op_fn,
+        logits=tree_train_logits).train_op
 
   if tree_center_bias:
     num_trees += 1
@@ -277,7 +374,8 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
                dnn_input_layer_to_tree=True,
                dnn_steps_to_train=10000,
                tree_feature_columns=None,
-               tree_center_bias=True):
+               tree_center_bias=False,
+               use_core_versions=False):
     """Initializes a DNNBoostedTreeCombinedClassifier instance.
 
     Args:
@@ -322,6 +420,8 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
         set to True, these features are in addition to dnn_feature_columns.
       tree_center_bias: Whether a separate tree should be created for
         first fitting the bias.
+      use_core_versions: Whether feature columns and loss are from the core (as
+        opposed to contrib) version of tensorflow.
     """
     head = head_lib.multi_class_head(
         n_classes=n_classes,
@@ -336,8 +436,8 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
           tree_learner_config, num_trees, tree_examples_per_layer, config,
           dnn_optimizer, dnn_activation_fn, dnn_dropout,
           dnn_input_layer_partitioner, dnn_input_layer_to_tree,
-          dnn_steps_to_train,
-          tree_feature_columns, tree_center_bias)
+          dnn_steps_to_train, tree_feature_columns, tree_center_bias,
+          use_core_versions)
 
     super(DNNBoostedTreeCombinedClassifier, self).__init__(
         model_fn=_model_fn, model_dir=model_dir,
@@ -366,7 +466,8 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
                dnn_input_layer_to_tree=True,
                dnn_steps_to_train=10000,
                tree_feature_columns=None,
-               tree_center_bias=True):
+               tree_center_bias=False,
+               use_core_versions=False):
     """Initializes a DNNBoostedTreeCombinedRegressor instance.
 
     Args:
@@ -411,6 +512,8 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
         set to True, these features are in addition to dnn_feature_columns.
       tree_center_bias: Whether a separate tree should be created for
         first fitting the bias.
+      use_core_versions: Whether feature columns and loss are from the core (as
+        opposed to contrib) version of tensorflow.
     """
     head = head_lib.regression_head(
         label_name=label_name,
@@ -426,11 +529,26 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
 
     def _model_fn(features, labels, mode, config):
       return _dnn_tree_combined_model_fn(
-          features, labels, mode, head, dnn_hidden_units, dnn_feature_columns,
-          tree_learner_config, num_trees, tree_examples_per_layer, config,
-          dnn_optimizer, dnn_activation_fn, dnn_dropout,
-          dnn_input_layer_partitioner, dnn_input_layer_to_tree,
-          dnn_steps_to_train, tree_feature_columns, tree_center_bias)
+          features,
+          labels,
+          mode,
+          head,
+          dnn_hidden_units,
+          dnn_feature_columns,
+          tree_learner_config,
+          num_trees,
+          tree_examples_per_layer,
+          config,
+          dnn_optimizer,
+          dnn_activation_fn,
+          dnn_dropout,
+          dnn_input_layer_partitioner,
+          dnn_input_layer_to_tree,
+          dnn_steps_to_train,
+          tree_feature_columns,
+          tree_center_bias,
+          use_core_versions,
+          is_regression=True)
 
     super(DNNBoostedTreeCombinedRegressor, self).__init__(
         model_fn=_model_fn, model_dir=model_dir,
@@ -460,7 +578,8 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
                dnn_input_layer_to_tree=True,
                dnn_steps_to_train=10000,
                tree_feature_columns=None,
-               tree_center_bias=True):
+               tree_center_bias=False,
+               use_core_versions=False):
     """Initializes a DNNBoostedTreeCombinedEstimator instance.
 
     Args:
@@ -500,6 +619,8 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
         set to True, these features are in addition to dnn_feature_columns.
       tree_center_bias: Whether a separate tree should be created for
         first fitting the bias.
+      use_core_versions: Whether feature columns and loss are from the core (as
+        opposed to contrib) version of tensorflow.
     """
     def _model_fn(features, labels, mode, config):
       return _dnn_tree_combined_model_fn(
@@ -507,8 +628,8 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
           tree_learner_config, num_trees, tree_examples_per_layer, config,
           dnn_optimizer, dnn_activation_fn, dnn_dropout,
           dnn_input_layer_partitioner, dnn_input_layer_to_tree,
-          dnn_steps_to_train,
-          tree_feature_columns, tree_center_bias)
+          dnn_steps_to_train, tree_feature_columns, tree_center_bias,
+          use_core_versions)
 
     super(DNNBoostedTreeCombinedEstimator, self).__init__(
         model_fn=_model_fn, model_dir=model_dir,
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
index 83d58c561008e8a5a69eb503d1605bb9e940f281..f495edc62f0909880c170ccb4cf5d11e3f20f55c 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
@@ -19,15 +19,17 @@ from __future__ import division
 from __future__ import print_function
 
 import tempfile
-
 from tensorflow.contrib.boosted_trees.estimator_batch import dnn_tree_combined_estimator as estimator
 from tensorflow.contrib.boosted_trees.proto import learner_pb2
 from tensorflow.contrib.layers.python.layers import feature_column
 from tensorflow.contrib.learn.python.learn.estimators import estimator_test_utils
 from tensorflow.contrib.learn.python.learn.estimators import run_config
+from tensorflow.python.estimator.canned import head as head_lib
+from tensorflow.python.feature_column import feature_column_lib as core_feature_column
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import googletest
 
 
@@ -100,6 +102,35 @@ class DNNBoostedTreeCombinedTest(test_util.TensorFlowTestCase):
     classifier.fit(input_fn=_train_input_fn, steps=15)
     classifier.evaluate(input_fn=_eval_input_fn, steps=1)
 
+  def testFitAndEvaluateDontThrowExceptionWithCore(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 1
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    # Use core head
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_BATCH_SIZE)
+
+    classifier = estimator.DNNBoostedTreeCombinedEstimator(
+        head=head_fn,
+        dnn_hidden_units=[1],
+        # Use core feature columns
+        dnn_feature_columns=[core_feature_column.numeric_column("x")],
+        tree_learner_config=learner_config,
+        num_trees=1,
+        tree_examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        dnn_steps_to_train=10,
+        dnn_input_layer_to_tree=True,
+        tree_feature_columns=[],
+        use_core_versions=True)
+
+    classifier.fit(input_fn=_train_input_fn, steps=15)
+    classifier.evaluate(input_fn=_eval_input_fn, steps=1)
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
index 01752416b347dd0a5e646283b6b5572592df4690..70454aa6dbdb19297028a3f80822719bef5a0f72 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
@@ -81,7 +81,8 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
         n_classes=n_classes,
         weight_column_name=weight_column_name,
         enable_centered_bias=False,
-        loss_fn=loss_fn)
+        loss_fn=loss_fn,
+        label_keys=label_keys)
     if learner_config.num_classes == 0:
       learner_config.num_classes = n_classes
     elif learner_config.num_classes != n_classes:
diff --git a/tensorflow/contrib/boosted_trees/kernels/model_ops.cc b/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
index 754b7bc3270d647fc381033b769eadd7b791771e..3bf33186ec13f5ff991db938d59849c0124a30a0 100644
--- a/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
@@ -137,6 +137,61 @@ class TreeEnsembleDeserializeOp : public OpKernel {
   }
 };
 
+class TreeEnsembleUsedHandlersOp : public OpKernel {
+ public:
+  explicit TreeEnsembleUsedHandlersOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("num_all_handlers", &num_handlers_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    boosted_trees::models::DecisionTreeEnsembleResource* ensemble_resource;
+
+    OP_REQUIRES_OK(context, LookupResource(context, HandleFromInput(context, 0),
+                                           &ensemble_resource));
+    tf_shared_lock l(*ensemble_resource->get_mutex());
+    core::ScopedUnref unref_me(ensemble_resource);
+
+    // Get the stamp token.
+    const Tensor* stamp_token_t;
+    OP_REQUIRES_OK(context, context->input("stamp_token", &stamp_token_t));
+    int64 stamp_token = stamp_token_t->scalar<int64>()();
+
+    // Only the Chief should run this Op and it is guaranteed to be in
+    // a consistent state so the stamps must always match.
+    CHECK(ensemble_resource->is_stamp_valid(stamp_token));
+
+    Tensor* output_used_handlers_t = nullptr;
+    OP_REQUIRES_OK(
+        context, context->allocate_output("used_handlers_mask", {num_handlers_},
+                                          &output_used_handlers_t));
+    auto output_used_handlers = output_used_handlers_t->vec<bool>();
+
+    Tensor* output_num_used_handlers_t = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output("num_used_handlers", {},
+                                            &output_num_used_handlers_t));
+    int handler_idx = 0;
+    std::vector<int64> used_handlers = ensemble_resource->GetUsedHandlers();
+    output_num_used_handlers_t->scalar<int64>()() = used_handlers.size();
+    for (int64 i = 0; i < num_handlers_; ++i) {
+      if (handler_idx >= used_handlers.size() ||
+          used_handlers[handler_idx] > i) {
+        output_used_handlers(i) = false;
+      } else {
+        OP_REQUIRES(context, used_handlers[handler_idx] == i,
+                    errors::InvalidArgument("Handler IDs should be sorted."));
+        ++handler_idx;
+        output_used_handlers(i) = true;
+      }
+    }
+  }
+
+ private:
+  int64 num_handlers_;
+};
+
 REGISTER_RESOURCE_HANDLE_KERNEL(DecisionTreeEnsembleResource);
 
 REGISTER_KERNEL_BUILDER(
@@ -155,5 +210,7 @@ REGISTER_KERNEL_BUILDER(Name("TreeEnsembleSerialize").Device(DEVICE_CPU),
 REGISTER_KERNEL_BUILDER(Name("TreeEnsembleDeserialize").Device(DEVICE_CPU),
                         TreeEnsembleDeserializeOp);
 
+REGISTER_KERNEL_BUILDER(Name("TreeEnsembleUsedHandlers").Device(DEVICE_CPU),
+                        TreeEnsembleUsedHandlersOp);
 }  // namespace boosted_trees
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
index 7f8dea1d3c2a04b725843f6e2932a0cdfbc7733c..1bfeed306641111718984b2097512e5ec3fa8630 100644
--- a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
@@ -361,27 +361,10 @@ class GrowTreeEnsembleOp : public OpKernel {
     // Increment attempt stats.
     ensemble_resource->IncrementAttempts();
 
-    // In case we want to do feature selection and we have reached the limit,
-    // build a list of handlers used so far to avoid adding new features.
-    std::vector<int64> allowed_handlers;
-    if (learner_config_.constraints().max_number_of_unique_feature_columns() >
-        0) {
-      allowed_handlers = ensemble_resource->GetUsedHandlers();
-      // TODO(soroush): We can disable handlers that are not going to be used to
-      // avoid unnecessary computations.
-      if (allowed_handlers.size() <
-          learner_config_.constraints()
-              .max_number_of_unique_feature_columns()) {
-        // We have not reached the limit yet. Empty the list of allow features
-        // which means we can keep adding new features.
-        allowed_handlers.clear();
-      }
-    }
-
     // Find best splits for each active partition.
     std::map<int32, SplitCandidate> best_splits;
-    FindBestSplitsPerPartition(context, allowed_handlers, partition_ids_list,
-                               gains_list, splits_list, &best_splits);
+    FindBestSplitsPerPartition(context, partition_ids_list, gains_list,
+                               splits_list, &best_splits);
 
     // No-op if no new splits can be considered.
     if (best_splits.empty()) {
@@ -422,19 +405,12 @@ class GrowTreeEnsembleOp : public OpKernel {
   // and finds the best split for each partition.
   void FindBestSplitsPerPartition(
       OpKernelContext* const context,
-      const std::vector<int64>& allowed_handlers,  // Empty means all handlers.
       const OpInputList& partition_ids_list, const OpInputList& gains_list,
       const OpInputList& splits_list,
       std::map<int32, SplitCandidate>* best_splits) {
     // Find best split per partition going through every feature candidate.
     // TODO(salehay): Is this worth parallelizing?
     for (int64 handler_id = 0; handler_id < num_handlers_; ++handler_id) {
-      if (!allowed_handlers.empty()) {
-        if (!std::binary_search(allowed_handlers.begin(),
-                                allowed_handlers.end(), handler_id)) {
-          continue;
-        }
-      }
       const auto& partition_ids = partition_ids_list[handler_id].vec<int32>();
       const auto& gains = gains_list[handler_id].vec<float>();
       const auto& splits = splits_list[handler_id].vec<string>();
diff --git a/tensorflow/contrib/boosted_trees/lib/learner/common/stats/node-stats.h b/tensorflow/contrib/boosted_trees/lib/learner/common/stats/node-stats.h
index cd925f6b65e569538212e9c26aef0abc8482960b..794ba2bcb0aafa26c5e1c90fcd66caf9dd5bf7d5 100644
--- a/tensorflow/contrib/boosted_trees/lib/learner/common/stats/node-stats.h
+++ b/tensorflow/contrib/boosted_trees/lib/learner/common/stats/node-stats.h
@@ -137,7 +137,7 @@ struct NodeStats {
         Eigen::MatrixXf hessian =
             TensorToEigenMatrix(grad_stats.second.t, grad_dim, grad_dim);
         // I is an identity matrix.
-        // The gain in general form is -g^T (H+l2 I)^-1 g.
+        // The gain in general form is g^T (H+l2 I)^-1 g.
         // The node weights are -(H+l2 I)^-1 g.
         Eigen::MatrixXf identity;
         identity.setIdentity(grad_dim, grad_dim);
@@ -240,7 +240,7 @@ struct NodeStats {
   // given regularized Hessian and gradient vector g.
   void CalculateWeightAndGain(const Eigen::MatrixXf& hessian_and_reg,
                               const Eigen::VectorXf& g) {
-    // The gain in general form is -g^T (Hessian_and_regularization)^-1 g.
+    // The gain in general form is g^T (Hessian_and_regularization)^-1 g.
     // The node weights are -(Hessian_and_regularization)^-1 g.
     Eigen::VectorXf weight;
     // If we want to calculate x = K^-1 v, instead of explicitly calculating
diff --git a/tensorflow/contrib/boosted_trees/ops/model_ops.cc b/tensorflow/contrib/boosted_trees/ops/model_ops.cc
index 0786c4166410720e8d4d70960e5747ff111076d8..9d6343c7e80f369bf6a5465821c5f4bacb984cd0 100644
--- a/tensorflow/contrib/boosted_trees/ops/model_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/model_ops.cc
@@ -110,5 +110,32 @@ stamp_token: Token to use as the new value of the resource stamp.
 tree_ensemble_config: Serialized proto of the ensemble.
 )doc");
 
+REGISTER_OP("TreeEnsembleUsedHandlers")
+    .Attr("num_all_handlers: int >= 0")
+    .Input("tree_ensemble_handle: resource")
+    .Input("stamp_token: int64")
+    .Output("num_used_handlers: int64")
+    .Output("used_handlers_mask: bool")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused_input;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &unused_input));
+      c->set_output(0, c->Scalar());
+      int num_all_handlers;
+      c->GetAttr("num_all_handlers", &num_all_handlers).IgnoreError();
+      c->set_output(1, {c->Vector(num_all_handlers)});
+
+      return Status::OK();
+    })
+    .Doc(R"doc(
+Returns the mask of used handlers along with the number of non-zero elements in 
+this mask. Used in feature selection.
+
+tree_ensemble_handle: Handle to the tree ensemble.
+stamp_token: Token to use as the new value of the resource stamp.
+num_used_handlers: number of feature column handlers used in the model.
+used_handlers_mask: A boolean vector of showing which handlers are used in the
+                    model.
+)doc");
+
 }  // namespace boosted_trees
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/boosted_trees/ops/quantile_ops.cc b/tensorflow/contrib/boosted_trees/ops/quantile_ops.cc
index ae99d53a2cf805d70d60746cd44f73f7fd9dc6e2..6aa52463987b55a54b7308765920cbe94c15b8d1 100644
--- a/tensorflow/contrib/boosted_trees/ops/quantile_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/quantile_ops.cc
@@ -272,6 +272,20 @@ REGISTER_OP("Quantiles")
     .Input("sparse_indices: num_sparse_features * int64")
     .Output("dense_quantiles: num_dense_features * int32")
     .Output("sparse_quantiles: num_sparse_features * int32")
+    .SetShapeFn([](InferenceContext* c) {
+      int num_dense_features;
+      TF_RETURN_IF_ERROR(c->GetAttr("num_dense_features", &num_dense_features));
+      int num_sparse_features;
+      TF_RETURN_IF_ERROR(
+          c->GetAttr("num_sparse_features", &num_sparse_features));
+      // Set output shapes (dense_quantiles and sparse_quantiles) by the
+      // relevant inputs (dense_values and sparse_values). Note that the output
+      // has an additional dimension for dimension_ids.
+      for (int i = 0; i < num_dense_features + num_sparse_features; ++i) {
+        c->set_output(i, c->MakeShape({c->Dim(c->input(i), 0), 2}));
+      }
+      return Status::OK();
+    })
     .Doc(R"doc(
 Computes quantile for each a given list of dense and sparse feature values using
 the given buckets.
diff --git a/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py b/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
index 27c288bbf78b3b593d0807e92ac7fd9afc4d2725..63b9c5fddf0d9967d53077608664b59d9ae00481 100644
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
@@ -310,6 +310,22 @@ class ModelOpsTest(test_util.TensorFlowTestCase):
         # The third tree was added after the save.
         self.assertAllClose(result.eval(), [[-1.1], [-1.1]])
 
+  def testUsedHandlers(self):
+    with self.test_session():
+      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
+      tree_ensemble_config.growing_metadata.used_handler_ids.append(1)
+      tree_ensemble_config.growing_metadata.used_handler_ids.append(5)
+      stamp_token = 3
+      tree_ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=stamp_token,
+          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
+          name="create_tree")
+      resources.initialize_resources(resources.shared_resources()).run()
+      result = model_ops.tree_ensemble_used_handlers(
+          tree_ensemble_handle, stamp_token, num_all_handlers=6)
+      self.assertAllEqual([0, 1, 0, 0, 0, 1], result.used_handlers_mask.eval())
+      self.assertEqual(2, result.num_used_handlers.eval())
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
index 8ca1aabacaf53b66aaba184962922294427d6803..3e524efbeac74ff754d63cae92b3e194411cb2de 100644
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
@@ -1588,7 +1588,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
       self.assertEqual(
           2, tree_ensemble_config.tree_metadata[2].num_tree_weight_updates)
 
-  def testGrowExistingEnsembleTreeWithFeatureSelectionCanStillGrow(self):
+  def testGrowExistingEnsembleTreeWithFeatureSelectionUsedHandlers(self):
     """Test growing a tree with feature selection."""
     with self.test_session() as session:
       # Create existing ensemble with one root split and one bias tree.
@@ -1649,7 +1649,6 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           num_trees_attempted: 2
           num_layers_attempted: 2
           used_handler_ids: 2
-          used_handler_ids: 5
         }
       """, tree_ensemble_config)
       tree_ensemble_handle = model_ops.tree_ensemble_variable(
@@ -1668,183 +1667,8 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      # There are 2 handler_ids in used_handler_ids already but one of them
-      # is handler 2, so we can still grow trees.
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
-      learner_config = learner_config.SerializeToString()
-      # Prepare handler inputs.
-      handler1_partitions = np.array([0], dtype=np.int32)
-      handler1_gains = np.array([7.62], dtype=np.float32)
-      handler1_split = [_gen_dense_split_info(5, 0.52, -4.375, 7.143)]
-      handler2_partitions = np.array([0], dtype=np.int32)
-      handler2_gains = np.array([0.63], dtype=np.float32)
-      handler2_split = [_gen_dense_split_info(2, 0.23, -0.6, 0.24)]
-      handler3_partitions = np.array([0], dtype=np.int32)
-      handler3_gains = np.array([7.62], dtype=np.float32)
-      handler3_split = [_gen_categorical_split_info(8, 7, -4.375, 7.143)]
-
-      # Grow tree ensemble.
-      grow_op = training_ops.grow_tree_ensemble(
-          tree_ensemble_handle,
-          stamp_token=0,
-          next_stamp_token=1,
-          learning_rate=1,
-          partition_ids=[
-              handler1_partitions, handler2_partitions, handler3_partitions
-          ],
-          gains=[handler1_gains, handler2_gains, handler3_gains],
-          splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
-          dropout_seed=123,
-          center_bias=True)
-      session.run(grow_op)
-
-      # Expect a new tree to be added with the split from handler 1.
-      _, serialized = session.run(
-          model_ops.tree_ensemble_serialize(tree_ensemble_handle))
-      tree_ensemble_config.ParseFromString(serialized)
-      self.assertEqual(3, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          2, len(tree_ensemble_config.growing_metadata.used_handler_ids))
-
-  def testGrowExistingEnsembleTreeWithFeatureSelectionEmptyEnsemble(self):
-    """Test growing a tree with feature selection with empty ensemble."""
-    with self.test_session() as session:
-      # Create existing ensemble with one root split and one bias tree.
-      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
-      tree_ensemble_handle = model_ops.tree_ensemble_variable(
-          stamp_token=0,
-          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
-          name="tree_ensemble")
-      resources.initialize_resources(resources.shared_resources()).run()
-
-      # Prepare learner config.
-      learner_config = _gen_learner_config(
-          num_classes=2,
-          l1_reg=0,
-          l2_reg=0,
-          tree_complexity=0,
-          max_depth=1,
-          min_node_weight=0,
-          pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
-      learner_config = learner_config.SerializeToString()
-      # Prepare handler inputs.
-      handler1_partitions = np.array([0], dtype=np.int32)
-      handler1_gains = np.array([7.62], dtype=np.float32)
-      handler1_split = [_gen_dense_split_info(5, 0.52, -4.375, 7.143)]
-      handler2_partitions = np.array([0], dtype=np.int32)
-      handler2_gains = np.array([0.63], dtype=np.float32)
-      handler2_split = [_gen_dense_split_info(2, 0.23, -0.6, 0.24)]
-      handler3_partitions = np.array([0], dtype=np.int32)
-      handler3_gains = np.array([7.62], dtype=np.float32)
-      handler3_split = [_gen_categorical_split_info(8, 7, -4.375, 7.143)]
-
-      # Grow tree ensemble.
-      grow_op = training_ops.grow_tree_ensemble(
-          tree_ensemble_handle,
-          stamp_token=0,
-          next_stamp_token=1,
-          learning_rate=1,
-          partition_ids=[
-              handler1_partitions, handler2_partitions, handler3_partitions
-          ],
-          gains=[handler1_gains, handler2_gains, handler3_gains],
-          splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
-          dropout_seed=123,
-          center_bias=True)
-      session.run(grow_op)
-
-      _, serialized = session.run(
-          model_ops.tree_ensemble_serialize(tree_ensemble_handle))
-      tree_ensemble_config.ParseFromString(serialized)
-      self.assertEqual(1, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          1, len(tree_ensemble_config.growing_metadata.used_handler_ids))
-
-  def testGrowExistingEnsembleTreeWithFeatureSelectionCantGrow(self):
-    """Test growing a tree with feature selection with empty ensemble."""
-    with self.test_session() as session:
-      # Create existing ensemble with one root split and one bias tree.
-      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
-      text_format.Merge("""
-        trees {
-          nodes {
-            leaf {
-              vector {
-                value: -0.32
-                value: 0.28
-              }
-            }
-          }
-        }
-        trees {
-          nodes {
-            categorical_id_binary_split {
-              feature_column: 3
-              feature_id: 7
-              left_id: 1
-              right_id: 2
-            }
-            node_metadata {
-              gain: 1.3
-            }
-          }
-          nodes {
-            leaf {
-              sparse_vector {
-                index: 0
-                value: 2.3
-              }
-            }
-          }
-          nodes {
-            leaf {
-              sparse_vector {
-                index: 0
-                value: -0.9
-              }
-            }
-          }
-        }
-        tree_weights: 0.7
-        tree_weights: 1
-        tree_metadata {
-          num_tree_weight_updates: 1
-          num_layers_grown: 1
-          is_finalized: true
-        }
-        tree_metadata {
-          num_tree_weight_updates: 5
-          num_layers_grown: 1
-          is_finalized: true
-        }
-        growing_metadata {
-          num_trees_attempted: 2
-          num_layers_attempted: 2
-          used_handler_ids: 4
-          used_handler_ids: 5
-        }
-      """, tree_ensemble_config)
-      tree_ensemble_handle = model_ops.tree_ensemble_variable(
-          stamp_token=0,
-          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
-          name="tree_ensemble")
-      resources.initialize_resources(resources.shared_resources()).run()
 
-      # Prepare learner config.
-      learner_config = _gen_learner_config(
-          num_classes=2,
-          l1_reg=0,
-          l2_reg=0,
-          tree_complexity=0,
-          max_depth=1,
-          min_node_weight=0,
-          pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
+      learner_config.constraints.max_number_of_unique_feature_columns = 3
       learner_config = learner_config.SerializeToString()
       # Prepare handler inputs.
       handler1_partitions = np.array([0], dtype=np.int32)
@@ -1876,12 +1700,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
       _, serialized = session.run(
           model_ops.tree_ensemble_serialize(tree_ensemble_handle))
       tree_ensemble_config.ParseFromString(serialized)
-      # We can't grow a tree since we have reached the limit of 2 unique
-      # features [4, 5] and the only available splits are from
-      # handlers [0, 1, 2].
-      self.assertEqual(2, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          2, len(tree_ensemble_config.growing_metadata.used_handler_ids))
+      self.assertEqual(3, len(tree_ensemble_config.trees))
+      # 2 was already used. handler 0 is being added in this tree.
+      self.assertAllEqual(
+          [0, 2], tree_ensemble_config.growing_metadata.used_handler_ids)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/boosted_trees/python/ops/model_ops.py b/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
index 7a5f509047d46549ba81039a23d29ec987ca7920..25b2c9e2fd72bd018717e8a87fce726f26bad968 100644
--- a/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
+++ b/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensem
 from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_serialize
 # pylint: disable=unused-import
 from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_stamp_token
+from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_used_handlers
 # pylint: enable=unused-import
 
 from tensorflow.python.framework import ops
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
index f0b66dcbbe1c5167b9993e66b30b1dc8a839c380..85b909e4f2556c520a5bffe46d5954683d9dda5a 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
@@ -57,6 +57,8 @@ PREDICTIONS = "predictions"
 PARTITION_IDS = "partition_ids"
 NUM_LAYERS_ATTEMPTED = "num_layers"
 NUM_TREES_ATTEMPTED = "num_trees"
+NUM_USED_HANDLERS = "num_used_handlers"
+USED_HANDLERS_MASK = "used_handlers_mask"
 _FEATURE_NAME_TEMPLATE = "%s_%d"
 
 
@@ -70,7 +72,8 @@ def _get_column_by_index(tensor, indices):
   return array_ops.reshape(array_ops.gather(p_flat, i_flat), [shape[0], -1])
 
 
-def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
+def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats,
+                           used_handlers):
   """Returns predictions for the given logits and n_classes.
 
   Args:
@@ -79,6 +82,8 @@ def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
         that contains predictions when no dropout was applied.
     partition_ids: A rank 1 `Tensor` with shape [batch_size].
     ensemble_stats: A TreeEnsembleStatsOp result tuple.
+    used_handlers: A TreeEnsembleUsedHandlerOp result tuple of an int and a
+        boolean mask..
 
   Returns:
     A dict of predictions.
@@ -89,6 +94,8 @@ def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
   result[PARTITION_IDS] = partition_ids
   result[NUM_LAYERS_ATTEMPTED] = ensemble_stats.attempted_layers
   result[NUM_TREES_ATTEMPTED] = ensemble_stats.attempted_trees
+  result[NUM_USED_HANDLERS] = used_handlers.num_used_handlers
+  result[USED_HANDLERS_MASK] = used_handlers.used_handlers_mask
   return result
 
 
@@ -361,6 +368,13 @@ class GradientBoostedDecisionTreeModel(object):
     """
     ensemble_stats = training_ops.tree_ensemble_stats(ensemble_handle,
                                                       ensemble_stamp)
+    num_handlers = (
+        len(self._dense_floats) + len(self._sparse_float_shapes) +
+        len(self._sparse_int_shapes))
+    # Used during feature selection.
+    used_handlers = model_ops.tree_ensemble_used_handlers(
+        ensemble_handle, ensemble_stamp, num_all_handlers=num_handlers)
+
     # We don't need dropout info - we can always restore it based on the
     # seed.
     apply_dropout, seed = _dropout_params(mode, ensemble_stats)
@@ -395,7 +409,7 @@ class GradientBoostedDecisionTreeModel(object):
           use_locking=True)
 
     return _make_predictions_dict(ensemble_stamp, predictions, partition_ids,
-                                  ensemble_stats)
+                                  ensemble_stats, used_handlers)
 
   def predict(self, mode):
     """Returns predictions given the features and mode.
@@ -710,12 +724,28 @@ class GradientBoostedDecisionTreeModel(object):
       active_handlers_current_layer = (
           active_handlers_current_layer <
           self._learner_config.feature_fraction_per_tree)
-      active_handlers = array_ops.stack(active_handlers_current_layer,
-                                        array_ops.ones(
-                                            [len(handlers)], dtype=dtypes.bool))
+      active_handlers = array_ops.stack([
+          active_handlers_current_layer,
+          array_ops.ones([len(handlers)], dtype=dtypes.bool)], axis=1)
     else:
       active_handlers = array_ops.ones([len(handlers), 2], dtype=dtypes.bool)
 
+    if self._learner_config.constraints.max_number_of_unique_feature_columns:
+      target = (
+          self._learner_config.constraints.max_number_of_unique_feature_columns)
+
+      def _feature_selection_active_handlers():
+        # The active list for current and the next iteration.
+        used_handlers = array_ops.reshape(predictions_dict[USED_HANDLERS_MASK],
+                                          [-1, 1])
+        used_handlers = array_ops.concat([used_handlers, used_handlers], axis=1)
+        return math_ops.logical_and(used_handlers, active_handlers)
+
+      active_handlers = (
+          control_flow_ops.cond(predictions_dict[NUM_USED_HANDLERS] >= target,
+                                _feature_selection_active_handlers,
+                                lambda: active_handlers))
+
     # Prepare empty gradients and hessians when handlers are not ready.
     empty_hess_shape = [1] + hessian_shape.as_list()
     empty_grad_shape = [1] + gradient_shape.as_list()
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
index dba51d4f527792d2a8dedc693f74c07119fd231d..6411f57a5419123e799af9231a04fce8ae7724d4 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
@@ -47,6 +47,38 @@ def _squared_loss(label, unused_weights, predictions):
   return loss
 
 
+def _append_to_leaf(leaf, c_id, w):
+  """Helper method for building tree leaves.
+
+  Appends weight contributions for the given class index to a leaf node.
+
+  Args:
+    leaf: leaf node to append to.
+    c_id: class Id for the weight update.
+    w: weight contribution value.
+  """
+  leaf.sparse_vector.index.append(c_id)
+  leaf.sparse_vector.value.append(w)
+
+
+def _set_float_split(split, feat_col, thresh, l_id, r_id):
+  """Helper method for building tree float splits.
+
+  Sets split feature column, threshold and children.
+
+  Args:
+    split: split node to update.
+    feat_col: feature column for the split.
+    thresh: threshold to split on forming rule x <= thresh.
+    l_id: left child Id.
+    r_id: right child Id.
+  """
+  split.feature_column = feat_col
+  split.threshold = thresh
+  split.left_id = l_id
+  split.right_id = r_id
+
+
 class GbdtTest(test_util.TensorFlowTestCase):
 
   def setUp(self):
@@ -917,6 +949,350 @@ class GbdtTest(test_util.TensorFlowTestCase):
           output.trees[0].nodes[2].leaf.sparse_vector.value[0],
           atol=1e-4, rtol=1e-4)
 
+  def testTrainFnChiefFeatureSelectionReachedLimitNoGoodSplit(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config="", name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      # Feature 1 is predictive but it won't be used because we have reached the
+      # limit of num_used_handlers >= max_number_of_unique_feature_columns
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([True, False], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 0)
+      self.assertEquals(len(output.tree_weights), 0)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      # On second run, expect a trivial split to be chosen to basically
+      # predict the average.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 1)
+      self.assertAllClose(output.tree_weights, [0.1])
+      self.assertEquals(stamp_token.eval(), 2)
+      expected_tree = """
+          nodes {
+            dense_float_binary_split {
+              feature_column: 0
+              threshold: 1.0
+              left_id: 1
+              right_id: 2
+            }
+            node_metadata {
+              gain: 0
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: -0.25
+              }
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: 0.0
+              }
+            }
+          }"""
+      self.assertProtoEquals(expected_tree, output.trees[0])
+
+  def testTrainFnChiefFeatureSelectionWithGoodSplits(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config="", name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      # Feature 1 is predictive and is in our selected features so it will be
+      # used even when we're at the limit.
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([False, True], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 0)
+      self.assertEquals(len(output.tree_weights), 0)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+
+      self.assertEquals(len(output.trees), 1)
+      self.assertAllClose(output.tree_weights, [0.1])
+      self.assertEquals(stamp_token.eval(), 2)
+      expected_tree = """
+          nodes {
+            dense_float_binary_split {
+              feature_column: 1
+              left_id: 1
+              right_id: 2
+            }
+            node_metadata {
+              gain: 0.5
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: 0.0
+              }
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: -0.5
+              }
+            }
+          }"""
+      self.assertProtoEquals(expected_tree, output.trees[0])
+
+  def testTrainFnChiefFeatureSelectionReachedLimitIncrementAttemptedLayer(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
+      tree = tree_ensemble_config.trees.add()
+
+      _set_float_split(tree.nodes.add()
+                       .sparse_float_binary_split_default_right.split, 2, 4.0,
+                       1, 2)
+      _append_to_leaf(tree.nodes.add().leaf, 0, 0.5)
+      _append_to_leaf(tree.nodes.add().leaf, 1, 1.2)
+      tree_ensemble_config.tree_weights.append(1.0)
+      metadata = tree_ensemble_config.tree_metadata.add()
+      metadata.is_finalized = False
+      metadata.num_layers_grown = 1
+      tree_ensemble_config = tree_ensemble_config.SerializeToString()
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config=tree_ensemble_config,
+          name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      # Both features will be disabled since the feature selection limit is
+      # already reached.
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          # We have somehow reached our limit 1. Both of the handlers will be
+          # disabled.
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([False, False], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 1)
+      self.assertEquals(output.growing_metadata.num_layers_attempted, 1)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      # Make sure the trees are not modified, but the num_layers_attempted is
+      # incremented so that eventually the training stops.
+      self.assertEquals(len(output.trees), 1)
+      self.assertEquals(len(output.trees[0].nodes), 3)
+
+      self.assertEquals(output.growing_metadata.num_layers_attempted, 2)
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/cluster_resolver/BUILD b/tensorflow/contrib/cluster_resolver/BUILD
index 6b03df2b8eb636ad888d050a3b2b29eabc3f8934..1a124eca364424b651de86bfaac6f33ad131804b 100644
--- a/tensorflow/contrib/cluster_resolver/BUILD
+++ b/tensorflow/contrib/cluster_resolver/BUILD
@@ -110,5 +110,6 @@ tf_py_test(
         "//tensorflow/python:platform_test",
         "//tensorflow/python:training",
     ],
+    grpc_enabled = True,
     main = "python/training/tpu_cluster_resolver_test.py",
 )
diff --git a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
index b04822fa9d66465e34a545d3b00c399bbb196514..1c480b25134b1e54200e0ddb780bd7bb0f122341 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
@@ -53,11 +53,16 @@ class ClusterResolver(object):
     raise NotImplementedError(
         'cluster_spec is not implemented for {}.'.format(self))
 
+  @abc.abstractmethod
+  def master(self):
+    """..."""
+    raise NotImplementedError('master is not implemented for {}.'.format(self))
+
 
 class SimpleClusterResolver(ClusterResolver):
   """Simple implementation of ClusterResolver that accepts a ClusterSpec."""
 
-  def __init__(self, cluster_spec):
+  def __init__(self, cluster_spec, master=''):
     """Creates a SimpleClusterResolver from a ClusterSpec."""
     super(SimpleClusterResolver, self).__init__()
 
@@ -65,10 +70,18 @@ class SimpleClusterResolver(ClusterResolver):
       raise TypeError('cluster_spec must be a ClusterSpec.')
     self._cluster_spec = cluster_spec
 
+    if not isinstance(master, str):
+      raise TypeError('master must be a string.')
+    self._master = master
+
   def cluster_spec(self):
     """Returns the ClusterSpec passed into the constructor."""
     return self._cluster_spec
 
+  def master(self):
+    """Returns the master address to use when creating a session."""
+    return self._master
+
 
 class UnionClusterResolver(ClusterResolver):
   """Performs a union on underlying ClusterResolvers.
@@ -87,9 +100,13 @@ class UnionClusterResolver(ClusterResolver):
 
     Raises:
       TypeError: If any argument is not a subclass of `ClusterResolvers`.
+      ValueError: If there are no arguments passed.
     """
     super(UnionClusterResolver, self).__init__()
 
+    if not args:
+      raise ValueError('At least one ClusterResolver is required.')
+
     for cluster_resolver in args:
       if not isinstance(cluster_resolver, ClusterResolver):
         raise TypeError('All arguments must be a sub-class of '
@@ -169,3 +186,7 @@ class UnionClusterResolver(ClusterResolver):
           merged_cluster[job_name].update(task_dict)
 
     return ClusterSpec(merged_cluster)
+
+  def master(self):
+    """master returns the master address from the first cluster resolver."""
+    return self._cluster_resolvers[0].master()
diff --git a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
index dbfb77723cdaab66e29bb41b764593bb5fd61b35..d9c97d53eb3663f6ab2f7b40395592dc7638b896 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
@@ -234,5 +234,7 @@ class UnionClusterResolverTest(test.TestCase):
     self._verifyClusterSpecEquality(cluster_spec, expected_proto)
 
 
+# TODO(saeta): Include tests for master resolution
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
index d6f2eced93ba4fda5ac27f9412b6f729981f4f40..3f5824128948453634bc5e5a7d6fdeedae60f5bd 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
@@ -134,3 +134,6 @@ class GceClusterResolver(ClusterResolver):
 
     worker_list.sort()
     return ClusterSpec({self._job_name: worker_list})
+
+  def master(self):
+    return ''
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
index a6a6e642e4e4c721b94821a70d55d6fe931347d6..300b19733e2b4d1b912f966e94ae0286ed9c694d 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
@@ -18,12 +18,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
 
 from six.moves.urllib.request import Request
 from six.moves.urllib.request import urlopen
 
 from tensorflow.contrib.cluster_resolver.python.training.cluster_resolver import ClusterResolver
-from tensorflow.python.training.server_lib import ClusterSpec
+from tensorflow.python.training import server_lib
+from tensorflow.python.util import compat
 
 _GOOGLE_API_CLIENT_INSTALLED = True
 try:
@@ -33,6 +35,9 @@ except ImportError:
   _GOOGLE_API_CLIENT_INSTALLED = False
 
 
+_GKE_ENV_VARIABLE = 'KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS'
+
+
 class TPUClusterResolver(ClusterResolver):
   """Cluster Resolver for Google Cloud TPUs.
 
@@ -46,13 +51,30 @@ class TPUClusterResolver(ClusterResolver):
     req = Request('http://metadata/computeMetadata/v1/%s' % path,
                   headers={'Metadata-Flavor': 'Google'})
     resp = urlopen(req)
-    return resp.read()
+    return compat.as_bytes(resp.read())
+
+  def _shouldResolve(self):
+    if (self._tpu == compat.as_bytes('') or
+        self._tpu == compat.as_bytes('local') or
+        self._tpu.startswith(compat.as_bytes('/bns')) or
+        self._tpu.startswith(compat.as_bytes('grpc://'))):
+      return False
+    return True
+
+  def _inGke(self):
+    """When running in GKE, the environment variable will be set."""
+    return _GKE_ENV_VARIABLE in os.environ
+
+  def _gkeMaster(self):
+    return os.environ[_GKE_ENV_VARIABLE].split(',')[0]
 
   def __init__(self,
-               tpu_names,
+               tpu=None,
                zone=None,
                project=None,
-               job_name='tpu_worker',
+               job_name='worker',
+               coordinator_name='coordinator',
+               coordinator_address=None,
                credentials='default',
                service=None):
     """Creates a new TPUClusterResolver object.
@@ -61,7 +83,11 @@ class TPUClusterResolver(ClusterResolver):
     for the IP addresses and ports of each Cloud TPU listed.
 
     Args:
-      tpu_names: A list of names of the target Cloud TPUs.
+      tpu: Either a string, or a list of strings corresponding to the TPUs to
+        use. If the single string is the empty string, the string 'local', or a
+        string that begins with 'grpc://' or '/bns', then it is assumed to not
+        correspond with a Cloud TPU and will instead be passed as the session
+        master and no ClusterSpec propagation will be done.
       zone: Zone where the TPUs are located. If omitted or empty, we will assume
         that the zone of the TPU is the same as the zone of the GCE VM, which we
         will try to discover from the GCE metadata service.
@@ -69,6 +95,12 @@ class TPUClusterResolver(ClusterResolver):
         empty, we will try to discover the project name of the GCE VM from the
         GCE metadata service.
       job_name: Name of the TensorFlow job the TPUs belong to.
+      coordinator_name: The name to use for the coordinator. Set to None if the
+        coordinator should not be included in the computed ClusterSpec.
+      coordinator_address: The address of the coordinator (typically an ip:port
+        pair). If set to None, a TF server will be started. If coordinator_name
+        is None, a TF server will not be started even if coordinator_address is
+        None.
       credentials: GCE Credentials. If None, then we use default credentials
         from the oauth2client
       service: The GCE API object returned by the googleapiclient.discovery
@@ -77,29 +109,47 @@ class TPUClusterResolver(ClusterResolver):
 
     Raises:
       ImportError: If the googleapiclient is not installed.
+      ValueError: If no TPUs are specified.
     """
+    if isinstance(tpu, list):
+      if not tpu:
+        raise ValueError('At least one TPU must be specified.')
+      if len(tpu) != 1:
+        raise NotImplementedError(
+            'Using multiple TPUs in a single session is not yet implemented')
+      tpu = tpu[0]
+
+    # When using GKE with Cloud TPUs, the env variable will be set.
+    if tpu is None and self._inGke():
+      tpu = self._gkeMaster()
+
+    self._tpu = compat.as_bytes(tpu)  # self._tpu is always bytes
+    self._job_name = job_name
+    self._credentials = credentials
+
+    should_resolve = self._shouldResolve()
 
-    if not project:
-      project = self._requestComputeMetadata('/project/project-id')
+    if not project and should_resolve:
+      project = compat.as_str(
+          self._requestComputeMetadata('project/project-id'))
 
-    if not zone:
-      zone_path = self._requestComputeMetadata('/instance/zone')
+    if not zone and should_resolve:
+      zone_path = compat.as_str(self._requestComputeMetadata('instance/zone'))
       zone = zone_path.split('/')[-1]
 
     self._project = project
     self._zone = zone
-    self._tpu_names = tpu_names
-    self._job_name = job_name
-    self._credentials = credentials
 
-    if credentials == 'default':
+    if credentials == 'default' and should_resolve:
       if _GOOGLE_API_CLIENT_INSTALLED:
         self._credentials = GoogleCredentials.get_application_default()
 
-    if service is None:
+    if service is None and should_resolve:
       if not _GOOGLE_API_CLIENT_INSTALLED:
         raise ImportError('googleapiclient must be installed before using the '
-                          'TPU cluster resolver')
+                          'TPU cluster resolver. Execute: `pip install '
+                          '--upgrade google-api-python-client` to install with '
+                          'pip.')
 
       self._service = discovery.build(
           'tpu', 'v1alpha1',
@@ -107,25 +157,41 @@ class TPUClusterResolver(ClusterResolver):
     else:
       self._service = service
 
-  def get_master(self):
-    """Get the ClusterSpec grpc master path.
+    self._coordinator_name = coordinator_name
+    if coordinator_name and not coordinator_address and should_resolve:
+      self._start_local_server()
+    else:
+      self._coordinator_address = coordinator_address
 
-    This returns the grpc path (grpc://1.2.3.4:8470) of first instance in the
-    ClusterSpec returned by the cluster_spec function. This is suitable for use
-    for the `master` argument in tf.Session() when you are using one TPU.
+  def master(self):
+    """Get the Master string to be used for the session.
+
+    In the normal case, this returns the grpc path (grpc://1.2.3.4:8470) of
+    first instance in the ClusterSpec returned by the cluster_spec function.
+
+    If a non-TPU name is used when constructing a TPUClusterResolver, that will
+    be returned instead (e.g. If the tpus argument's value when constructing
+    this TPUClusterResolver was 'grpc://10.240.1.2:8470',
+    'grpc://10.240.1.2:8470' will be returned).
 
     Returns:
-      string, the grpc path of the first instance in the ClusterSpec.
+      string, the connection string to use when creating a session.
 
     Raises:
       ValueError: If none of the TPUs specified exists.
     """
+    if not self._shouldResolve():
+      return self._tpu
+
     job_tasks = self.cluster_spec().job_tasks(self._job_name)
     if not job_tasks:
       raise ValueError('No TPUs exists with the specified names exist.')
 
     return 'grpc://' + job_tasks[0]
 
+  def get_master(self):
+    return self.master()
+
   def cluster_spec(self):
     """Returns a ClusterSpec object based on the latest TPU information.
 
@@ -134,17 +200,54 @@ class TPUClusterResolver(ClusterResolver):
 
     Returns:
       A ClusterSpec containing host information returned from Cloud TPUs.
-    """
-    worker_list = []
 
-    for tpu_name in self._tpu_names:
-      full_name = 'projects/%s/locations/%s/nodes/%s' % (
-          self._project, self._zone, tpu_name)
-      request = self._service.projects().locations().nodes().get(name=full_name)
-      response = request.execute()
-
-      if 'health' in response and response['health'] == 'HEALTHY':
-        instance_url = '%s:%s' % (response['ipAddress'], response['port'])
-        worker_list.append(instance_url)
-
-    return ClusterSpec({self._job_name: worker_list})
+    Raises:
+      RuntimeError: If the provided TPU is not healthy.
+    """
+    if not self._shouldResolve():
+      return server_lib.ClusterSpec({})
+
+    full_name = 'projects/%s/locations/%s/nodes/%s' % (
+        self._project, self._zone, compat.as_text(self._tpu))
+    request = self._service.projects().locations().nodes().get(name=full_name)
+    response = request.execute()
+
+    if 'health' in response and response['health'] != 'HEALTHY':
+      raise RuntimeError('TPU "%s" is unhealthy: "%s"' % (self._tpu,
+                                                          response['health']))
+
+    if 'networkEndpoints' in response:
+      worker_list = [
+          '%s:%s' % (endpoint['ipAddress'], endpoint['port'])
+          for endpoint in response['networkEndpoints']
+      ]
+    else:
+      # Fall back to the deprecated response format
+      instance_url = '%s:%s' % (response['ipAddress'], response['port'])
+      worker_list = [instance_url]
+
+    cluster_spec = {self._job_name: worker_list}
+
+    if self._coordinator_address:
+      cluster_spec[self._coordinator_name] = [self._coordinator_address]
+
+    return server_lib.ClusterSpec(cluster_spec)
+
+  def _start_local_server(self):
+    address = self._requestComputeMetadata('instance/network-interfaces/0/ip')
+    self._server = server_lib.Server(
+        {
+            'local': ['0.0.0.0:0']
+        }, protocol='grpc', config=None, start=True)
+    # self._server.target is of the form: grpc://ipaddress:port
+    target = compat.as_bytes(self._server.target)
+    splits = target.split(compat.as_bytes(':'))
+    assert len(splits) == 3, self._server.target
+    assert splits[0] == compat.as_bytes('grpc'), self._server.target
+    self._coordinator_port = compat.as_text(splits[2])
+    self._coordinator_address = '%s:%s' % (
+        address, compat.as_text(self._coordinator_port))
+
+  def __deepcopy__(self, memo):
+    # TODO(b/73668574): Remove this once RunConfig avoids performing deepcopy.
+    return self
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
index 4fd34629cf74f90869c77b8cb098d3c585a49404..48c3f6bb4f2d1643982e03d9ed68db14c10c184a 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
@@ -18,10 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
+
 from tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver import TPUClusterResolver
 from tensorflow.python.platform import test
 from tensorflow.python.training import server_lib
-
+from tensorflow.python.util import compat
 
 mock = test.mock
 
@@ -50,10 +52,12 @@ class MockNodeClass(object):
 
 def mock_request_compute_metadata(cls, *args, **kwargs):
   del cls, kwargs  # Unused.
-  if args[0] == '/project/project-id':
+  if args[0] == 'project/project-id':
     return 'test-project'
-  elif args[0] == '/instance/zone':
+  elif args[0] == 'instance/zone':
     return 'projects/test-project/locations/us-central1-c'
+  elif args[0] == 'instance/network-interfaces/0/ip':
+    return '10.128.1.2'
   return ''
 
 
@@ -71,18 +75,17 @@ class TPUClusterResolverTest(test.TestCase):
       expected_proto: Expected protobuf
     """
     self.assertProtoEquals(expected_proto, cluster_spec.as_cluster_def())
-    self.assertProtoEquals(
-        expected_proto, server_lib.ClusterSpec(cluster_spec).as_cluster_def())
-    self.assertProtoEquals(
-        expected_proto,
-        server_lib.ClusterSpec(cluster_spec.as_cluster_def()).as_cluster_def())
     self.assertProtoEquals(
         expected_proto,
-        server_lib.ClusterSpec(cluster_spec.as_dict()).as_cluster_def())
+        server_lib.ClusterSpec(cluster_spec).as_cluster_def())
+    self.assertProtoEquals(expected_proto,
+                           server_lib.ClusterSpec(
+                               cluster_spec.as_cluster_def()).as_cluster_def())
+    self.assertProtoEquals(expected_proto,
+                           server_lib.ClusterSpec(
+                               cluster_spec.as_dict()).as_cluster_def())
 
-  def mock_service_client(
-      self,
-      tpu_map=None):
+  def mock_service_client(self, tpu_map=None):
 
     if tpu_map is None:
       tpu_map = {}
@@ -98,8 +101,7 @@ class TPUClusterResolverTest(test.TestCase):
 
     return mock_client
 
-  @mock.patch.object(TPUClusterResolver,
-                     '_requestComputeMetadata',
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
                      mock_request_compute_metadata)
   def testRetrieveProjectAndZoneFromMetadata(self):
     tpu_map = {
@@ -113,17 +115,26 @@ class TPUClusterResolverTest(test.TestCase):
     tpu_cluster_resolver = TPUClusterResolver(
         project=None,
         zone=None,
-        tpu_names=['test-tpu-1'],
+        tpu=['test-tpu-1'],
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.1.2.3:8470' } }
-    """
-    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+    job {
+      name: 'coordinator'
+      tasks { key: 0 value: '10.128.1.2:%s' }
+    }
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.1.2.3:8470' }
+    }
+    """ % tpu_cluster_resolver._coordinator_port
+    self._verifyClusterSpecEquality(actual_cluster_spec, str(expected_proto))
 
-  def testSimpleSuccessfulRetrieval(self):
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testRetrieveProjectAndZoneFromMetadataNoCoordinator(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
             'ipAddress': '10.1.2.3',
@@ -133,116 +144,230 @@ class TPUClusterResolverTest(test.TestCase):
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
-        project='test-project',
-        zone='us-central1-c',
-        tpu_names=['test-tpu-1'],
+        project=None,
+        zone=None,
+        tpu=['test-tpu-1'],
+        coordinator_name=None,
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.1.2.3:8470' } }
+    job { name: 'worker' tasks { key: 0 value: '10.1.2.3:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
-  def testMultipleSuccessfulRetrieval(self):
+  def testSimpleSuccessfulRetrieval(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
             'ipAddress': '10.1.2.3',
             'port': '8470',
             'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-            'health': 'HEALTHY'
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1'],
+        tpu=['test-tpu-1'],
+        coordinator_address='10.128.1.5:10203',
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.4.5.6:8470' }
-                             tasks { key: 1 value: '10.1.2.3:8470' } }
+    job { name: 'coordinator' tasks { key: 0 value: '10.128.1.5:10203' } }
+    job { name: 'worker' tasks { key: 0 value: '10.1.2.3:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
-  def testHealthyTpuNodeRetrieval(self):
+  def testNewNetworkEndpointFormat(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
-            'ipAddress': '10.1.2.3',
-            'port': '8470',
-            'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-3': {
-            'ipAddress': '10.7.8.9',
-            'port': '8470',
-            'health': 'UNHEALTHY'
+            'health': 'HEALTHY',
+            'networkEndpoints': [{
+                'ipAddress': '10.2.3.4',
+                'port': 8470,
+            }]
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1', 'test-tpu-3'],
+        tpu='test-tpu-1',
+        coordinator_address='10.128.1.5:10203',
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job {
-      name: 'tpu_worker'
-      tasks {
-        key: 0
-        value: '10.1.2.3:8470'
-      }
-    }
+    job { name: 'coordinator' tasks { key: 0 value: '10.128.1.5:10203' } }
+    job { name: 'worker' tasks { key: 0 value: '10.2.3.4:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+    self.assertEqual('grpc://10.2.3.4:8470', tpu_cluster_resolver.master())
 
-  def testGetMasterMultipleEntries(self):
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testPodResolution(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
-            'ipAddress': '10.1.2.3',
-            'port': '8470',
-            'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-            'health': 'HEALTHY'
+            'health':
+                'HEALTHY',
+            'networkEndpoints': [
+                {
+                    'ipAddress': '10.2.3.4',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.5',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.6',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.7',
+                    'port': 8470,
+                },
+            ]
+        }
+    }
+
+    tpu_cluster_resolver = TPUClusterResolver(
+        tpu='test-tpu-1',
+        credentials=None,
+        service=self.mock_service_client(tpu_map=tpu_map))
+
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'coordinator',
+      tasks { key: 0 value: '10.128.1.2:%s'}
+    }
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.2.3.4:8470' }
+      tasks { key: 1 value: '10.2.3.5:8470' }
+      tasks { key: 2 value: '10.2.3.6:8470' }
+      tasks { key: 3 value: '10.2.3.7:8470' }
+    }
+    """ % tpu_cluster_resolver._coordinator_port
+    self._verifyClusterSpecEquality(actual_cluster_spec, str(expected_proto))
+
+  def testPodResolutionNoCoordinator(self):
+    tpu_map = {
+        'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
+            'health':
+                'HEALTHY',
+            'networkEndpoints': [
+                {
+                    'ipAddress': '10.2.3.4',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.5',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.6',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.7',
+                    'port': 8470,
+                },
+            ]
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1'],
+        tpu='test-tpu-1',
+        coordinator_name=None,
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
-    self.assertEqual('grpc://10.4.5.6:8470', tpu_cluster_resolver.get_master())
+
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.2.3.4:8470' }
+      tasks { key: 1 value: '10.2.3.5:8470' }
+      tasks { key: 2 value: '10.2.3.6:8470' }
+      tasks { key: 3 value: '10.2.3.7:8470' }
+    }
+    """
+    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
   def testGetMasterNoEntries(self):
     tpu_map = {}
 
+    with self.assertRaises(ValueError):
+      TPUClusterResolver(
+          project='test-project',
+          zone='us-central1-c',
+          tpu=[],
+          coordinator_name=None,
+          credentials=None,
+          service=self.mock_service_client(tpu_map=tpu_map))
+
+  # TODO(saeta): Convert to parameterized test when included in OSS TF.
+  def verifyShouldResolve(self, tpu, should_resolve):
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=[],
+        tpu=tpu,
+        coordinator_name=None,
         credentials=None,
-        service=self.mock_service_client(tpu_map=tpu_map))
-    with self.assertRaises(ValueError):
-      tpu_cluster_resolver.get_master()
+        service=self.mock_service_client(tpu_map={}))
+    self.assertEqual(should_resolve, tpu_cluster_resolver._shouldResolve(),
+                     "TPU: '%s'" % tpu)
+
+  def testShouldResolveNoName(self):
+    self.verifyShouldResolve('', False)
+
+  def testShouldResolveLocal(self):
+    self.verifyShouldResolve('local', False)
+
+  def testShouldResolveGrpc(self):
+    self.verifyShouldResolve('grpc://10.1.2.3:8470', False)
+
+  def testShouldResolveBns(self):
+    self.verifyShouldResolve('/bns/foo/bar', False)
+
+  def testShouldResolveName(self):
+    self.verifyShouldResolve('mytpu', True)
+
+  def testShouldResolveList(self):
+    self.verifyShouldResolve(['myothertpu'], True)
+
+  def testShouldResolveGrpcPrefix(self):
+    self.verifyShouldResolve('grpctpu', True)
+
+  def testNoCallComputeMetadata(self):
+    tpu_cluster_resolver = TPUClusterResolver(tpu='/bns/foo/bar')
+    self.assertEqual(
+        compat.as_bytes('/bns/foo/bar'), tpu_cluster_resolver.master())
+    self.assertEqual(
+        server_lib.ClusterSpec({}), tpu_cluster_resolver.cluster_spec())
+
+  def testGkeEnvironment(self):
+    os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS'] = 'grpc://10.120.27.5:8470'
+    self.assertTrue('KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS' in os.environ)
+    tpu_cluster_resolver = TPUClusterResolver()
+    self.assertTrue(tpu_cluster_resolver._inGke())
+    self.assertEqual(
+        compat.as_bytes('grpc://10.120.27.5:8470'),
+        compat.as_bytes(tpu_cluster_resolver._gkeMaster()))
+    self.assertEqual(
+        compat.as_bytes('grpc://10.120.27.5:8470'),
+        compat.as_bytes(tpu_cluster_resolver.get_master()))
+    del os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS']
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/cmake/README.md b/tensorflow/contrib/cmake/README.md
index 8f85a75ee466dbac524a1266dc2522109ca77cd5..fe83bb32046cd75328c92a74cdb4fdb6ce44560e 100644
--- a/tensorflow/contrib/cmake/README.md
+++ b/tensorflow/contrib/cmake/README.md
@@ -26,7 +26,7 @@ The CMake files in this directory can build the core TensorFlow runtime, an
 example C++ binary, and a PIP package containing the runtime and Python
 bindings.
 
-### Pre-requisites
+### Prerequisites
 
 * CMake version 3.5 or later.
 
@@ -34,14 +34,16 @@ bindings.
 
 * [SWIG](http://www.swig.org/download.html)
 
-* Additional pre-requisites for Microsoft Windows:
+* Additional prerequisites for Microsoft Windows:
   - Visual Studio 2015
   - Python 3.5
-  - NumPy 1.11.0 or later
 
-* Additional pre-requisites for Linux:
+* Additional prerequisites for Linux:
   - Python 2.7 or later
   - [Docker](https://www.docker.com/) (for automated testing)
+
+* Python dependencies:
+  - wheel
   - NumPy 1.11.0 or later
 
 ### Known-good configurations
@@ -102,7 +104,7 @@ ops or APIs.
 Step-by-step Windows build
 ==========================
 
-1. Install the pre-requisites detailed above, and set up your environment.
+1. Install the prerequisites detailed above, and set up your environment.
 
    * The following commands assume that you are using the Windows Command
      Prompt (`cmd.exe`). You will need to set up your environment to use the
diff --git a/tensorflow/contrib/cmake/external/cub.cmake b/tensorflow/contrib/cmake/external/cub.cmake
index 836889895567f679d9960e29ece1600d1a7a58eb..98a8c7e736e5c8c407b90e8eac440cdc7ab21579 100644
--- a/tensorflow/contrib/cmake/external/cub.cmake
+++ b/tensorflow/contrib/cmake/external/cub.cmake
@@ -14,8 +14,8 @@
 # ==============================================================================
 include (ExternalProject)
 
-set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.4.zip)
-set(cub_HASH SHA256=20a1a39fd97e5da7f40f5f2e7fd73fd2ea59f9dc4bb8a6c5f228aa543e727e31)
+set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.8.0.zip)
+set(cub_HASH SHA256=6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3)
 set(cub_BUILD ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_ARCHIVE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/cub_archive)
diff --git a/tensorflow/contrib/cmake/external/grpc.cmake b/tensorflow/contrib/cmake/external/grpc.cmake
index a9f43a3ecba4830533efcc13f8c4c1c61fe1ef78..cc218e8ab8ce211a85aa3ece318558dd24049c83 100644
--- a/tensorflow/contrib/cmake/external/grpc.cmake
+++ b/tensorflow/contrib/cmake/external/grpc.cmake
@@ -17,7 +17,7 @@ include (ExternalProject)
 set(GRPC_INCLUDE_DIRS ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/include)
 set(GRPC_URL https://github.com/grpc/grpc.git)
 set(GRPC_BUILD ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc)
-set(GRPC_TAG 730b778632e79cc3c96ad237f282d687ee325ce7)
+set(GRPC_TAG 575bda39755b98d1f7099406bb57a6e3b2074874)
 
 if(WIN32)
   if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
@@ -35,6 +35,7 @@ else()
   set(grpc_STATIC_LIBRARIES
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc++_unsecure.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc_unsecure.a
+      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/third_party/cares/cares/lib/libcares.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgpr.a)
 endif()
 
diff --git a/tensorflow/contrib/cmake/external/nsync.cmake b/tensorflow/contrib/cmake/external/nsync.cmake
index f3a37ff5088e3f9e54e38c0edb5777c27b26969f..b9d1dd88d4c2d3c9141ba56e14911e06b4d33f7c 100644
--- a/tensorflow/contrib/cmake/external/nsync.cmake
+++ b/tensorflow/contrib/cmake/external/nsync.cmake
@@ -16,7 +16,7 @@ include (ExternalProject)
 
 set(nsync_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/nsync/public)
 set(nsync_URL https://github.com/google/nsync)
-set(nsync_TAG 8502189abfa44c249c01c2cad64e6ed660a9a668)
+set(nsync_TAG 0559ce013feac8db639ee1bf776aca0325d28777)
 set(nsync_BUILD ${CMAKE_CURRENT_BINARY_DIR}/nsync/src/nsync)
 set(nsync_INSTALL ${CMAKE_CURRENT_BINARY_DIR}/nsync/install)
 
diff --git a/tensorflow/contrib/cmake/external/protobuf.cmake b/tensorflow/contrib/cmake/external/protobuf.cmake
index aba8a5244e17d717293deec6d9b6e8e725ef010e..ab464bc99a43138130bb2758ae28ecef29805c31 100644
--- a/tensorflow/contrib/cmake/external/protobuf.cmake
+++ b/tensorflow/contrib/cmake/external/protobuf.cmake
@@ -16,7 +16,7 @@ include (ExternalProject)
 
 set(PROTOBUF_INCLUDE_DIRS ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
 set(PROTOBUF_URL https://github.com/google/protobuf.git)
-set(PROTOBUF_TAG 396336eb961b75f03b25824fe86cf6490fb75e3a)
+set(PROTOBUF_TAG b04e5cba356212e4e8c66c61bbe0c3a20537c5b9)
 
 if(WIN32)
   if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
diff --git a/tensorflow/contrib/cmake/patches/nsync/CMakeLists.txt b/tensorflow/contrib/cmake/patches/nsync/CMakeLists.txt
index aaae18a313dd082b428654091c9411600c981ec9..6f059c7225dd0938b758e8f9c28ec36fcff6db4c 100644
--- a/tensorflow/contrib/cmake/patches/nsync/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/patches/nsync/CMakeLists.txt
@@ -42,7 +42,6 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
   include_directories ("${PROJECT_SOURCE_DIR}/platform/c++11")
   add_definitions ("-DNSYNC_USE_CPP11_TIMEPOINT -DNSYNC_ATOMIC_CPP11")
   set (NSYNC_OS_CPP_SRC
-    "platform/c++11/src/nsync_semaphore_mutex.cc"
     "platform/c++11/src/per_thread_waiter.cc"
     "platform/c++11/src/yield.cc"
     "platform/c++11/src/time_rep_timespec.cc"
@@ -52,6 +51,7 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
     include_directories ("${PROJECT_SOURCE_DIR}/platform/win32")
     add_compile_options ("/TP")
     set (NSYNC_OS_SRC
+      "platform/c++11/src/nsync_semaphore_mutex.cc"
       "platform/win32/src/clock_gettime.c"
       "platform/win32/src/pthread_key_win32.cc"
       ${NSYNC_OS_CPP_SRC}
@@ -68,6 +68,7 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
     add_compile_options ("-std=c++11")
     set (NSYNC_OS_SRC
       ${NSYNC_OS_CPP_SRC}
+      "platform/c++11/src/nsync_semaphore_mutex.cc"
       "platform/posix/src/clock_gettime.c"
       "platform/posix/src/nsync_semaphore_mutex.c"
     )
@@ -75,9 +76,11 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
       "platform/posix/src/start_thread.c"
     )
   elseif ("${CMAKE_SYSTEM_NAME}X" STREQUAL "LinuxX")
+    include_directories (BEFORE "${PROJECT_SOURCE_DIR}/platform/c++11.futex")
     include_directories ("${PROJECT_SOURCE_DIR}/platform/posix")
     add_compile_options ("-std=c++11")
     set (NSYNC_OS_SRC
+      "platform/linux/src/nsync_semaphore_futex.c"
       ${NSYNC_OS_CPP_SRC}
     )
     set (NSYNC_TEST_OS_SRC
@@ -87,6 +90,7 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
     include_directories ("${PROJECT_SOURCE_DIR}/platform/posix")
     add_compile_options ("-std=c++11")
     set (NSYNC_OS_SRC
+      "platform/c++11/src/nsync_semaphore_mutex.cc"
       ${NSYNC_OS_CPP_SRC}
     )
     set (NSYNC_TEST_OS_SRC
@@ -96,6 +100,7 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
     include_directories ("${PROJECT_SOURCE_DIR}/platform/posix")
     add_compile_options ("-std=c++11")
     set (NSYNC_OS_SRC
+      "platform/c++11/src/nsync_semaphore_mutex.cc"
       ${NSYNC_OS_CPP_SRC}
     )
     set (NSYNC_TEST_OS_SRC
@@ -105,6 +110,7 @@ if ("${NSYNC_LANGUAGE}X" STREQUAL "c++11X")
     include_directories ("${PROJECT_SOURCE_DIR}/platform/posix")
     add_compile_options ("-std=c++11")
     set (NSYNC_OS_SRC
+      "platform/c++11/src/nsync_semaphore_mutex.cc"
       ${NSYNC_OS_CPP_SRC}
     )
     set (NSYNC_TEST_OS_SRC
diff --git a/tensorflow/contrib/cmake/python_modules.txt b/tensorflow/contrib/cmake/python_modules.txt
index bfe53c01b3b5fb9db8a5d8fa280d1d7f98974882..f7d3c73b2c448db2230227bbb57e16d09bb6886b 100644
--- a/tensorflow/contrib/cmake/python_modules.txt
+++ b/tensorflow/contrib/cmake/python_modules.txt
@@ -147,8 +147,6 @@ tensorflow/contrib/crf
 tensorflow/contrib/crf/python
 tensorflow/contrib/crf/python/ops
 tensorflow/contrib/cudnn_rnn
-tensorflow/contrib/cudnn_rnn/kernels
-tensorflow/contrib/cudnn_rnn/ops
 tensorflow/contrib/cudnn_rnn/python
 tensorflow/contrib/cudnn_rnn/python/layers
 tensorflow/contrib/cudnn_rnn/python/ops
@@ -165,6 +163,7 @@ tensorflow/contrib/distributions/python
 tensorflow/contrib/distributions/python/ops
 tensorflow/contrib/distributions/python/ops/bijectors
 tensorflow/contrib/eager
+tensorflow/contrib/eager/proto
 tensorflow/contrib/eager/python
 tensorflow/contrib/estimator
 tensorflow/contrib/estimator/python
diff --git a/tensorflow/contrib/cmake/python_protos.txt b/tensorflow/contrib/cmake/python_protos.txt
index 8a9c406d8b118c10ddcaafb0e4fc242aa79cdb57..c03c0c80fe62a4f95d0fcf240ee25725a19d86f0 100644
--- a/tensorflow/contrib/cmake/python_protos.txt
+++ b/tensorflow/contrib/cmake/python_protos.txt
@@ -4,6 +4,7 @@ tensorflow/python
 tensorflow/contrib/boosted_trees/proto
 tensorflow/contrib/cloud/kernels
 tensorflow/contrib/decision_trees/proto
+tensorflow/contrib/eager/proto
 tensorflow/contrib/gdr
 tensorflow/contrib/lite/toco
 tensorflow/contrib/mpi
diff --git a/tensorflow/contrib/cmake/tf_core_cpu.cmake b/tensorflow/contrib/cmake/tf_core_cpu.cmake
index 96ac60d095dbc84470ff1be92f4bf52bb420fc52..a54cbff33b66d63d7229fa2f50b8a4ca962111ed 100644
--- a/tensorflow/contrib/cmake/tf_core_cpu.cmake
+++ b/tensorflow/contrib/cmake/tf_core_cpu.cmake
@@ -63,6 +63,12 @@ file(GLOB_RECURSE tf_core_cpu_exclude_srcs
     "${tensorflow_source_dir}/tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
     "${tensorflow_source_dir}/tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.cc"
 )
+file(GLOB_RECURSE tf_core_cpu_whitelisted_srcs
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id.h"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc"
+)
+list(REMOVE_ITEM tf_core_cpu_exclude_srcs ${tf_core_cpu_whitelisted_srcs})
 list(REMOVE_ITEM tf_core_cpu_srcs ${tf_core_cpu_exclude_srcs})
 
 if (tensorflow_ENABLE_GPU)
@@ -79,6 +85,7 @@ if (tensorflow_ENABLE_GPU)
      "${tensorflow_source_dir}/tensorflow/core/*test*.cc"
   )
   list(REMOVE_ITEM tf_core_gpu_srcs ${tf_core_gpu_exclude_srcs})
+  list(REMOVE_ITEM tf_core_gpu_srcs ${tf_core_cpu_whitelisted_srcs})
   list(APPEND tf_core_cpu_srcs ${tf_core_gpu_srcs})
 endif()
 
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
index 998f99ecc19f88921dce14fde892912fb699ad08..ed018b4fed8e47632f632723f19cc755f2079f86 100644
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -67,8 +67,6 @@ if(tensorflow_BUILD_CONTRIB_KERNELS)
       "${tensorflow_source_dir}/tensorflow/contrib/coder/kernels/range_coder_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/coder/kernels/range_coder_ops_util.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/coder/ops/coder_ops.cc"
-      "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/kernels/cudnn_rnn_ops.cc"
-      "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/prefetching_kernels.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc"
diff --git a/tensorflow/contrib/cmake/tf_core_ops.cmake b/tensorflow/contrib/cmake/tf_core_ops.cmake
index 59e094812aaf4da2549d96314fc550e5635f9de8..d6712aa2b48795bb6faf5153c9a8774a7d8bf3c1 100644
--- a/tensorflow/contrib/cmake/tf_core_ops.cmake
+++ b/tensorflow/contrib/cmake/tf_core_ops.cmake
@@ -21,6 +21,7 @@ set(tf_op_lib_names
     "checkpoint_ops"
     "control_flow_ops"
     "ctc_ops"
+    "cudnn_rnn_ops"
     "data_flow_ops"
     "dataset_ops"
     "functional_ops"
@@ -84,7 +85,6 @@ GENERATE_CONTRIB_OP_LIBRARY(boosted_trees_prediction "${tensorflow_source_dir}/t
 GENERATE_CONTRIB_OP_LIBRARY(boosted_trees_quantiles "${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/ops/quantile_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(boosted_trees_stats_accumulator "${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/ops/stats_accumulator_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(coder "${tensorflow_source_dir}/tensorflow/contrib/coder/ops/coder_ops.cc")
-GENERATE_CONTRIB_OP_LIBRARY(cudnn_rnn "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(data_dataset "${tensorflow_source_dir}/tensorflow/contrib/data/ops/dataset_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(factorization_clustering "${tensorflow_source_dir}/tensorflow/contrib/factorization/ops/clustering_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(factorization_factorization "${tensorflow_source_dir}/tensorflow/contrib/factorization/ops/factorization_ops.cc")
diff --git a/tensorflow/contrib/cmake/tf_python.cmake b/tensorflow/contrib/cmake/tf_python.cmake
index b730ebd3baacafe8ae401e8987104f3062372954..31e715b654c8baa53e25f54b4854c94e80c88049 100755
--- a/tensorflow/contrib/cmake/tf_python.cmake
+++ b/tensorflow/contrib/cmake/tf_python.cmake
@@ -326,6 +326,7 @@ GENERATE_PYTHON_OP_LIB("checkpoint_ops")
 GENERATE_PYTHON_OP_LIB("control_flow_ops"
   ADDITIONAL_LIBRARIES $<TARGET_OBJECTS:tf_no_op>)
 GENERATE_PYTHON_OP_LIB("ctc_ops")
+GENERATE_PYTHON_OP_LIB("cudnn_rnn_ops")
 GENERATE_PYTHON_OP_LIB("data_flow_ops")
 GENERATE_PYTHON_OP_LIB("dataset_ops")
 GENERATE_PYTHON_OP_LIB("image_ops")
@@ -348,6 +349,7 @@ GENERATE_PYTHON_OP_LIB("state_ops")
 GENERATE_PYTHON_OP_LIB("sparse_ops")
 GENERATE_PYTHON_OP_LIB("spectral_ops")
 GENERATE_PYTHON_OP_LIB("string_ops")
+GENERATE_PYTHON_OP_LIB("summary_ops")
 GENERATE_PYTHON_OP_LIB("user_ops")
 GENERATE_PYTHON_OP_LIB("training_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/training/gen_training_ops.py)
@@ -366,8 +368,6 @@ GENERATE_PYTHON_OP_LIB("contrib_boosted_trees_stats_accumulator_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/boosted_trees/python/ops/gen_stats_accumulator_ops.py)
 GENERATE_PYTHON_OP_LIB("contrib_coder_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/coder/python/ops/gen_coder_ops.py)
-GENERATE_PYTHON_OP_LIB("contrib_cudnn_rnn_ops"
-  DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/cudnn_rnn/ops/gen_cudnn_rnn_ops.py)
 GENERATE_PYTHON_OP_LIB("contrib_data_dataset_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/data/python/ops/gen_dataset_ops.py)
 GENERATE_PYTHON_OP_LIB("contrib_factorization_clustering_ops"
@@ -419,8 +419,6 @@ GENERATE_PYTHON_OP_LIB("stateless_random_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/stateless/gen_stateless_random_ops.py)
 GENERATE_PYTHON_OP_LIB("debug_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/debug/ops/gen_debug_ops.py)
-GENERATE_PYTHON_OP_LIB("summary_ops"
-  DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/summary/gen_summary_ops.py)
 
 add_custom_target(tf_python_ops SOURCES ${tf_python_ops_generated_files} ${PYTHON_PROTO_GENFILES})
 add_dependencies(tf_python_ops tf_python_op_gen_main)
diff --git a/tensorflow/contrib/cmake/tf_shared_lib.cmake b/tensorflow/contrib/cmake/tf_shared_lib.cmake
index 6d36d5fc5c2854b2d7d2542a3cb12e033e193b88..9738bbeb9aebaeb67495127528e26634887d392c 100644
--- a/tensorflow/contrib/cmake/tf_shared_lib.cmake
+++ b/tensorflow/contrib/cmake/tf_shared_lib.cmake
@@ -100,8 +100,7 @@ if(WIN32)
 endif(WIN32)
 
 target_include_directories(tensorflow PUBLIC 
-    $<INSTALL_INTERFACE:include/>
-    $<INSTALL_INTERFACE:include/external/nsync/public>)
+    $<INSTALL_INTERFACE:include/>)
 
 install(TARGETS tensorflow EXPORT tensorflow_export
         RUNTIME DESTINATION bin
@@ -133,10 +132,6 @@ install(DIRECTORY ${tensorflow_source_dir}/tensorflow/stream_executor/
 install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src/google/
         DESTINATION include/google
         FILES_MATCHING PATTERN "*.h")
-# nsync headers
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/external/nsync/
-        DESTINATION include/external/nsync
-        FILES_MATCHING PATTERN "*.h")
 # Eigen directory
 install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/eigen/src/eigen/Eigen/
         DESTINATION include/Eigen)
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index 1c4ebd7f0c1113bcd0857fb0858df2248499f920..b86a8f1ec236d820c2c8bbfec059d8eaed851c59 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -195,9 +195,11 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     "${tensorflow_source_dir}/tensorflow/python/profiler/model_analyzer_test.py"
     # Fails because uses data dependencies with bazel
     "${tensorflow_source_dir}/tensorflow/python/saved_model/saved_model_test.py"
+    "${tensorflow_source_dir}/tensorflow/contrib/image/python/kernel_tests/sparse_image_warp_test.py"
     # requires scipy
     "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/preprocessing/*_test.py"
     "${tensorflow_source_dir}/tensorflow/contrib/tfprof/python/tools/tfprof/pprof_profiler_test.py"
+    "${tensorflow_source_dir}/tensorflow/contrib/image/python/kernel_tests/interpolate_spline_test.py"
     # Takes very long to run without sharding (defined in bazel build file).
     "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cwise_ops_test.py"
     # Loading resources in contrib doesn't seem to work on Windows
@@ -208,6 +210,9 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     "${tensorflow_source_dir}/tensorflow/contrib/learn/python/learn/learn_io/graph_io_test.py"
     # Test is flaky on Windows GPU builds (b/38283730).
     "${tensorflow_source_dir}/tensorflow/contrib/factorization/python/ops/gmm_test.py"
+    # Disable following manual tag in BUILD.
+    "${tensorflow_source_dir}/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py"
+
   )
   if (WIN32)
     set(tf_test_src_py_exclude
@@ -222,6 +227,7 @@ if (tensorflow_BUILD_PYTHON_TESTS)
       "${tensorflow_source_dir}/tensorflow/python/debug/cli/curses_ui_test.py"
       # TFDBG grpc:// mode is not yet available on Windows.
       "${tensorflow_source_dir}/tensorflow/python/debug/lib/dist_session_debug_grpc_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/debug/lib/grpc_large_data_test.py"
       "${tensorflow_source_dir}/tensorflow/python/debug/lib/session_debug_grpc_test.py"
       "${tensorflow_source_dir}/tensorflow/python/debug/lib/source_remote_test.py"
       # stl on windows handles overflows different
@@ -475,6 +481,10 @@ if (tensorflow_BUILD_CC_TESTS)
     "${tensorflow_source_dir}/tensorflow/core/profiler/internal/advisor/*_test.cc"
   )
 
+  list(REMOVE_ITEM tf_test_src_simple
+    ${tf_core_profiler_test_srcs}
+  )
+
   set(tf_test_lib tf_test_lib)
   add_library(${tf_test_lib} STATIC ${tf_src_testlib})
 
diff --git a/tensorflow/contrib/cudnn_rnn/BUILD b/tensorflow/contrib/cudnn_rnn/BUILD
index fec358c4e1067dc8dc8173d1b9d05dc90b90ca05..fa86ad38c975a95171883adba152e32cd3905082 100644
--- a/tensorflow/contrib/cudnn_rnn/BUILD
+++ b/tensorflow/contrib/cudnn_rnn/BUILD
@@ -9,52 +9,10 @@ licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
-load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")
 load("//tensorflow:tensorflow.bzl", "tf_gen_op_libs")
 load("//tensorflow:tensorflow.bzl", "tf_gen_op_wrapper_py")
-load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
 load("//tensorflow:tensorflow.bzl", "cuda_py_test")
 load("//tensorflow:tensorflow.bzl", "tf_custom_op_py_library")
-load("//tensorflow:tensorflow.bzl", "tf_cc_test")
-
-tf_custom_op_library(
-    name = "python/ops/_cudnn_rnn_ops.so",
-    srcs = [
-        "kernels/cudnn_rnn_ops.cc",
-        "ops/cudnn_rnn_ops.cc",
-    ],
-    deps = [
-        "//tensorflow/core/kernels:bounds_check_lib",
-        "@farmhash_archive//:farmhash",
-    ],
-)
-
-tf_kernel_library(
-    name = "cudnn_rnn_kernels",
-    srcs = ["kernels/cudnn_rnn_ops.cc"],
-    visibility = ["//visibility:public"],
-    deps = [
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-        "//tensorflow/core:stream_executor",
-        "//tensorflow/core/kernels:bounds_check_lib",
-        "//third_party/eigen3",
-        "@farmhash_archive//:farmhash",
-    ],
-)
-
-tf_gen_op_libs(
-    op_lib_names = ["cudnn_rnn_ops"],
-    deps = [
-        "//tensorflow/core:lib",
-    ],
-)
-
-tf_gen_op_wrapper_py(
-    name = "cudnn_rnn_ops",
-    deps = [":cudnn_rnn_ops_op_lib"],
-)
 
 tf_custom_op_py_library(
     name = "cudnn_rnn_py",
@@ -64,20 +22,13 @@ tf_custom_op_py_library(
         "python/layers/cudnn_rnn.py",
         "python/ops/cudnn_rnn_ops.py",
     ],
-    dso = [
-        ":python/ops/_cudnn_rnn_ops.so",
-    ],
-    kernels = [
-        ":cudnn_rnn_kernels",
-        ":cudnn_rnn_ops_op_lib",
-    ],
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
     deps = [
-        ":cudnn_rnn_ops",
         "//tensorflow/contrib/util:util_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:cudnn_rnn_ops_gen",
         "//tensorflow/python:framework",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:init_ops",
@@ -173,23 +124,6 @@ cuda_py_test(
     ],
 )
 
-tf_cc_test(
-    name = "cudnn_rnn_ops_test_cc",
-    size = "small",
-    srcs = [
-        "ops/cudnn_rnn_ops_test.cc",
-    ],
-    deps = [
-        ":cudnn_rnn_ops_op_lib",
-        "//tensorflow/core",
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:test",
-        "//tensorflow/core:test_main",
-        "//tensorflow/core:testlib",
-    ],
-)
-
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
index e87162f0ee9cc4eed795555171f55a93639e83cf..622241a1774545529a4cdcb974333b53c8f56caa 100644
--- a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
+++ b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
@@ -17,27 +17,22 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.cudnn_rnn.ops import gen_cudnn_rnn_ops
 from tensorflow.contrib.rnn.python.ops import lstm_ops
-from tensorflow.contrib.util import loader
 from tensorflow.python.framework import common_shapes
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
 from tensorflow.python.layers import base as base_layer
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_cudnn_rnn_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope as vs
-from tensorflow.python.platform import resource_loader
 from tensorflow.python.training import saver
 
-_cudnn_rnn_ops_so = loader.load_op_library(
-    resource_loader.get_path_to_datafile("_cudnn_rnn_ops.so"))
-
 CUDNN_RNN_UNIDIRECTION = "unidirectional"
 CUDNN_RNN_BIDIRECTION = "bidirectional"
 CUDNN_LSTM = "lstm"
diff --git a/tensorflow/contrib/data/BUILD b/tensorflow/contrib/data/BUILD
index 0458199ff771bc45603106411550a39448e515b8..d787ed8a1ad163f847add25e3877cf0329441292 100644
--- a/tensorflow/contrib/data/BUILD
+++ b/tensorflow/contrib/data/BUILD
@@ -9,6 +9,10 @@ load(
     "tf_custom_op_library",
     "tf_gen_op_libs",
 )
+load(
+    "//tensorflow/core:platform/default/build_config_root.bzl",
+    "if_static",
+)
 
 py_library(
     name = "data",
@@ -29,7 +33,11 @@ py_library(
 tf_custom_op_library(
     name = "_dataset_ops.so",
     srcs = ["ops/dataset_ops.cc"],
-    deps = ["//tensorflow/contrib/data/kernels:dataset_kernels"],
+    deps = ["//tensorflow/contrib/data/kernels:dataset_kernels"] +
+           if_static(
+               extra_deps = ["//tensorflow/core:lib_proto_parsing"],
+               otherwise = [],
+           ),
 )
 
 tf_gen_op_libs(
diff --git a/tensorflow/contrib/data/__init__.py b/tensorflow/contrib/data/__init__.py
index fcdccdd26ca1824bf13f1fd0cfd80b20ca8a10c3..9212b69700941c190df1d44ed308147105c56fba 100644
--- a/tensorflow/contrib/data/__init__.py
+++ b/tensorflow/contrib/data/__init__.py
@@ -23,12 +23,15 @@ removing existing functionality.
 See the @{$datasets$Importing Data} Programmer's Guide for an overview.
 
 @@Counter
+@@SqlDataset
 
 @@batch_and_drop_remainder
+@@bucket_by_sequence_length
 @@dense_to_sparse_batch
 @@enumerate_dataset
 @@group_by_window
 @@ignore_errors
+@@make_batched_features_dataset
 @@make_saveable_from_iterator
 @@map_and_batch
 @@padded_batch_and_drop_remainder
@@ -37,6 +40,7 @@ See the @{$datasets$Importing Data} Programmer's Guide for an overview.
 @@rejection_resample
 @@scan
 @@shuffle_and_repeat
+@@sliding_window_batch
 @@sloppy_interleave
 @@unbatch
 
@@ -58,15 +62,20 @@ from tensorflow.contrib.data.python.ops.counter import Counter
 from tensorflow.contrib.data.python.ops.enumerate_ops import enumerate_dataset
 from tensorflow.contrib.data.python.ops.error_ops import ignore_errors
 from tensorflow.contrib.data.python.ops.get_single_element import get_single_element
+from tensorflow.contrib.data.python.ops.grouping import bucket_by_sequence_length
 from tensorflow.contrib.data.python.ops.grouping import group_by_window
 from tensorflow.contrib.data.python.ops.interleave_ops import parallel_interleave
 from tensorflow.contrib.data.python.ops.interleave_ops import sloppy_interleave
 from tensorflow.contrib.data.python.ops.iterator_ops import make_saveable_from_iterator
+from tensorflow.contrib.data.python.ops.readers import make_batched_features_dataset
 from tensorflow.contrib.data.python.ops.readers import read_batch_features
 from tensorflow.contrib.data.python.ops.readers import SqlDataset
 from tensorflow.contrib.data.python.ops.resampling import rejection_resample
 from tensorflow.contrib.data.python.ops.scan_ops import scan
 from tensorflow.contrib.data.python.ops.shuffle_ops import shuffle_and_repeat
+from tensorflow.contrib.data.python.ops.sliding import sliding_window_batch
+from tensorflow.python.data.ops.iterator_ops import Iterator
+from tensorflow.python.ops.parsing_ops import parse_single_example_v2 as parse_single_example
 # pylint: enable=unused-import
 
 from tensorflow.python.util.all_util import remove_undocumented
diff --git a/tensorflow/contrib/data/kernels/BUILD b/tensorflow/contrib/data/kernels/BUILD
index 9bd6a42da2d93263e84a759cffdc5a9e8f9742fd..c87da7dfaa5943f7918c370f63362673844c7f0e 100644
--- a/tensorflow/contrib/data/kernels/BUILD
+++ b/tensorflow/contrib/data/kernels/BUILD
@@ -10,6 +10,7 @@ cc_library(
     name = "prefetching_kernels",
     srcs = ["prefetching_kernels.cc"],
     deps = [
+        "//tensorflow/core:core_cpu_headers_lib",
         "//tensorflow/core:framework_headers_lib",
         "//third_party/eigen3",
         "@protobuf_archive//:protobuf_headers",
diff --git a/tensorflow/contrib/data/kernels/prefetching_kernels.cc b/tensorflow/contrib/data/kernels/prefetching_kernels.cc
index d3df14bdd03476e9ee4015b374512e5bb9893a63..2f986f2bb1b3e39c3a3c01055eecadc9695736cd 100644
--- a/tensorflow/contrib/data/kernels/prefetching_kernels.cc
+++ b/tensorflow/contrib/data/kernels/prefetching_kernels.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 #include <deque>
 
+#include "tensorflow/core/common_runtime/process_function_library_runtime.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/resource_op_kernel.h"
@@ -35,27 +36,31 @@ using FunctionBufferCallback = std::function<void(const BufferElement&)>;
 class FunctionBufferingResource : public ResourceBase {
  public:
   FunctionBufferingResource(FunctionLibraryRuntime* lib,
+                            std::unique_ptr<ProcessFunctionLibraryRuntime> pflr,
                             const NameAttrList& func, int64 buffer_size,
                             const string& source_device,
                             const string& target_device,
                             const std::vector<Tensor>& func_args,
                             int64 thread_pool_size)
       : lib_(lib),
+        pflr_(std::move(pflr)),
         func_(func),
         buffer_size_(buffer_size),
         source_device_(source_device),
         target_device_(target_device),
         func_args_(func_args),
-        thread_pool_(new thread::ThreadPool(Env::Default(), ThreadOptions(),
-                                            "buffer_resource", thread_pool_size,
-                                            false /* low_latency_hint */)),
         handle_(kInvalidHandle),
         is_buffering_(false),
         end_of_sequence_(false),
         cancelled_(false) {
-    runner_ = [this](std::function<void()> c) {
-      thread_pool_->Schedule(std::move(c));
-    };
+    if (thread_pool_size > 0) {
+      thread_pool_ = new thread::ThreadPool(Env::Default(), ThreadOptions(),
+                                            "buffer_resource", thread_pool_size,
+                                            false /* low_latency_hint */);
+      runner_ = [this](std::function<void()> c) {
+        thread_pool_->Schedule(std::move(c));
+      };
+    }
   }
 
   ~FunctionBufferingResource() override {
@@ -66,7 +71,9 @@ class FunctionBufferingResource : public ResourceBase {
         cond_var_.wait(l);
       }
     }
-    delete thread_pool_;
+    if (thread_pool_ != nullptr) {
+      delete thread_pool_;
+    }
   }
 
   string DebugString() override {
@@ -172,7 +179,9 @@ class FunctionBufferingResource : public ResourceBase {
     FunctionLibraryRuntime::Options opts;
     // Copied from CapturedFunction::generate_step_id();
     opts.step_id = -std::abs(static_cast<int64>(random::New64()));
-    opts.runner = &runner_;
+    if (runner_ != nullptr) {
+      opts.runner = &runner_;
+    }
     opts.source_device = source_device_;
     AllocatorAttributes arg_alloc_attr;
     arg_alloc_attr.set_on_host(true);
@@ -222,12 +231,13 @@ class FunctionBufferingResource : public ResourceBase {
 
   mutex mu_;
   FunctionLibraryRuntime* lib_;
+  std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
   NameAttrList func_;
   const int64 buffer_size_;
   const string source_device_;
   const string target_device_;
   const std::vector<Tensor> func_args_;
-  thread::ThreadPool* thread_pool_;
+  thread::ThreadPool* thread_pool_ = nullptr;
   FunctionLibraryRuntime::Handle handle_ GUARDED_BY(mu_);
   std::deque<BufferElement> buffer_ GUARDED_BY(mu_);
   std::deque<FunctionBufferCallback> requests_ GUARDED_BY(mu_);
@@ -241,7 +251,7 @@ class FunctionBufferingResource : public ResourceBase {
 class FunctionBufferResourceHandleOp : public OpKernel {
  public:
   explicit FunctionBufferResourceHandleOp(OpKernelConstruction* ctx)
-      : OpKernel(ctx) {
+      : OpKernel(ctx), flib_def_(nullptr) {
     OP_REQUIRES_OK(ctx, ctx->GetAttr("f", &func_));
     OP_REQUIRES_OK(ctx, ctx->GetAttr("buffer_size", &buffer_size_));
     OP_REQUIRES_OK(ctx, ctx->GetAttr("container", &container_));
@@ -249,6 +259,17 @@ class FunctionBufferResourceHandleOp : public OpKernel {
     OP_REQUIRES_OK(ctx, ctx->GetAttr("thread_pool_size", &thread_pool_size_));
   }
 
+  ~FunctionBufferResourceHandleOp() override {
+    if (cinfo_.resource_is_private_to_kernel()) {
+      if (!cinfo_.resource_manager()
+               ->Delete<FunctionBufferingResource>(cinfo_.container(),
+                                                   cinfo_.name())
+               .ok()) {
+        // Do nothing; the resource can have been deleted by session resets.
+      }
+    }
+  }
+
   void Compute(OpKernelContext* ctx) override {
     const Tensor* string_arg;
     OP_REQUIRES_OK(ctx, ctx->input("string_arg", &string_arg));
@@ -267,28 +288,39 @@ class FunctionBufferResourceHandleOp : public OpKernel {
 
     const string& source_device = ctx->device()->name();
 
-    ContainerInfo cinfo;
-    OP_REQUIRES_OK(ctx, cinfo.Init(ctx->resource_manager(), def()));
-    // Create the resource.
-    FunctionBufferingResource* buffer;
-    OP_REQUIRES_OK(
-        ctx, ctx->resource_manager()->LookupOrCreate<FunctionBufferingResource>(
-                 cinfo.container(), cinfo.name(), &buffer,
-                 [lib, &source_device, &target_device, func_args,
-                  this](FunctionBufferingResource** ptr) {
-                   *ptr = new FunctionBufferingResource(
-                       lib, func_, buffer_size_, source_device, target_device,
-                       func_args, thread_pool_size_);
-                   return Status::OK();
-                 }));
-    OP_REQUIRES_OK(ctx, buffer->Instantiate());
+    mutex_lock l(mu_);
+    if (!initialized_) {
+      OP_REQUIRES_OK(ctx, cinfo_.Init(ctx->resource_manager(), def()));
+      FunctionLibraryRuntime* clone_lib;
+      std::unique_ptr<ProcessFunctionLibraryRuntime> pflr;
+      OP_REQUIRES_OK(ctx, lib->Clone(&flib_def_, &pflr, &clone_lib));
+      // Create the resource.
+      FunctionBufferingResource* buffer;
+      OP_REQUIRES_OK(
+          ctx,
+          ctx->resource_manager()->LookupOrCreate<FunctionBufferingResource>(
+              cinfo_.container(), cinfo_.name(), &buffer,
+              [clone_lib, &pflr, &source_device, &target_device, func_args,
+               this](FunctionBufferingResource** ptr) {
+                *ptr = new FunctionBufferingResource(
+                    clone_lib, std::move(pflr), func_, buffer_size_,
+                    source_device, target_device, func_args, thread_pool_size_);
+                return Status::OK();
+              }));
+      OP_REQUIRES_OK(ctx, buffer->Instantiate());
+      initialized_ = true;
+    }
 
     OP_REQUIRES_OK(ctx, MakeResourceHandleToOutput(
-                            ctx, 0, cinfo.container(), cinfo.name(),
+                            ctx, 0, cinfo_.container(), cinfo_.name(),
                             MakeTypeIndex<FunctionBufferingResource>()));
   }
 
  private:
+  mutex mu_;
+  ContainerInfo cinfo_ GUARDED_BY(mu_);
+  bool initialized_ GUARDED_BY(mu_) = false;
+  std::unique_ptr<FunctionLibraryDefinition> flib_def_;
   NameAttrList func_;
   int64 buffer_size_;
   string container_;
diff --git a/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
index 4b3edde85fc755f1c7694a555b867317e81f149d..63e19ae3f837c9d3cfb1221df64360ee74117f13 100644
--- a/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
@@ -166,14 +166,10 @@ class ThreadPoolDatasetOp : public UnaryDatasetOpKernel {
         params.runner = [pool](std::function<void()> c) {
           pool->Schedule(std::move(c));
         };
-        params.stats_aggregator_getter = [ctx]() {
-          return ctx->stats_aggregator();
-        };
+        params.stats_aggregator_getter = ctx->stats_aggregator_getter();
         params.lib = ctx->lib();
         params.function_library = ctx->function_library();
-        params.allocator_getter = [ctx](AllocatorAttributes attrs) {
-          return ctx->allocator(attrs);
-        };
+        params.allocator_getter = ctx->allocator_getter();
         IteratorContext threadpool_ctx(params);
         return input_impl_->GetNext(&threadpool_ctx, out_tensors,
                                     end_of_sequence);
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index 82cd276ce8073b1e66bbc620fa845733aaaca4d4..2c4d4adfdad6d2b3268896cb91cd0357b2b814d9 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -168,8 +168,10 @@ py_test(
     srcs = ["interleave_dataset_op_test.py"],
     srcs_version = "PY2AND3",
     tags = [
+        "manual",
         "no_oss",
         "no_pip",
+        "notap",
     ],
     deps = [
         ":dataset_serialization_test",
@@ -292,9 +294,12 @@ py_test(
         "//tensorflow/python:errors",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:lib",
+        "//tensorflow/python:math_ops",
         "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:string_ops",
         "//tensorflow/python:util",
         "//tensorflow/python/data/ops:iterator_ops",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -493,6 +498,23 @@ py_test(
     ],
 )
 
+tf_py_test(
+    name = "slide_dataset_op_test",
+    size = "small",
+    srcs = ["slide_dataset_op_test.py"],
+    additional_deps = [
+        "//tensorflow/contrib/data/python/ops:dataset_ops",
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//third_party/py/numpy",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
index 71dc1c1172c9d515d4c85f85257c952135098329..75482f67da11401305b7b342cd5c971da71a4f3c 100644
--- a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
@@ -311,10 +311,10 @@ class BatchDatasetTest(test.TestCase):
     self.assertEqual([None], dataset.output_shapes[1][0].as_list())
     self.assertEqual([None, 30], dataset.output_shapes[1][1].as_list())
 
-  def _testBatchAndMapDatasetHelper(self, num_parallel_batches=1):
+  def _testMapAndBatchDatasetHelper(self, num_parallel_batches=1):
     """Test a dataset that maps a TF function across its input elements."""
     # The pipeline is TensorSliceDataset ->
-    # RepeatDataset(count) -> BatchAndMapDataset(square_3, batch_size).
+    # RepeatDataset(count) -> MapAndBatchDataset(square_3, batch_size).
     components = (np.arange(7),
                   np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
                   np.array(37.0) * np.arange(7))
@@ -381,11 +381,51 @@ class BatchDatasetTest(test.TestCase):
       with self.assertRaises(errors.InvalidArgumentError):
         sess.run(init_op, feed_dict={count: 14, batch_size: 0})
 
-  def testBatchAndMapDataset(self):
-    return self._testBatchAndMapDatasetHelper()
+  def testMapAndBatchDataset(self):
+    return self._testMapAndBatchDatasetHelper()
 
-  def testBatchAndMapDatasetWithParallelBatching(self):
-    return self._testBatchAndMapDatasetHelper(num_parallel_batches=10)
+  def testMapAndBatchDatasetWithParallelBatching(self):
+    return self._testMapAndBatchDatasetHelper(num_parallel_batches=10)
+
+  def _testMapAndBatchPartialBatchHelper(self, drop_remainder=False):
+    iterator = (
+        dataset_ops.Dataset.range(10).apply(
+            batching.map_and_batch(
+                lambda x: array_ops.reshape(x * x, [1]),
+                batch_size=4,
+                drop_remainder=drop_remainder)).make_one_shot_iterator())
+    if drop_remainder:
+      self.assertEqual([4, 1], iterator.output_shapes.as_list())
+    else:
+      self.assertEqual([None, 1], iterator.output_shapes.as_list())
+    next_element = iterator.get_next()
+    with self.test_session() as sess:
+      self.assertAllEqual([[0], [1], [4], [9]], sess.run(next_element))
+      self.assertAllEqual([[16], [25], [36], [49]], sess.run(next_element))
+      if not drop_remainder:
+        self.assertAllEqual([[64], [81]], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testMapAndBatchPartialBatch(self):
+    return self._testMapAndBatchPartialBatchHelper()
+
+  def testMapAndBatchPartialBatchDropRemainder(self):
+    return self._testMapAndBatchPartialBatchHelper(drop_remainder=True)
+
+  def testMapAndBatchYieldsPartialBatch(self):
+    iterator = (dataset_ops.Dataset.range(10)
+                .apply(batching.map_and_batch(
+                    lambda x: array_ops.reshape(x * x, [1]), 4))
+                .make_one_shot_iterator())
+    self.assertEqual([None, 1], iterator.output_shapes.as_list())
+    next_element = iterator.get_next()
+    with self.test_session() as sess:
+      self.assertAllEqual([[0], [1], [4], [9]], sess.run(next_element))
+      self.assertAllEqual([[16], [25], [36], [49]], sess.run(next_element))
+      self.assertAllEqual([[64], [81]], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
 
   def testMapAndBatchSparse(self):
 
@@ -411,7 +451,7 @@ class BatchDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-  def testBatchAndMapDatasetFails(self):
+  def testMapAndBatchDatasetFails(self):
     """Test a dataset that maps a TF function across its input elements."""
     dataset = dataset_ops.Dataset.from_tensors(
         array_ops.check_numerics(
@@ -425,7 +465,7 @@ class BatchDatasetTest(test.TestCase):
       with self.assertRaisesRegexp(errors.InvalidArgumentError, "oops"):
         sess.run(init_op, feed_dict={batch_size: 14})
 
-  def testBatchAndMapDatasetShapeMismatch(self):
+  def testMapAndBatchDatasetShapeMismatch(self):
     """Test a dataset that maps a TF function across its input elements."""
 
     def generator():
diff --git a/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py b/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
index f1b494e1a620992365ed75613b508e32f94b40a4..94f800e8a58bc34eef3034cd976b931528c01940 100644
--- a/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
@@ -17,6 +17,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import random
+
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
@@ -379,5 +381,93 @@ class BucketTest(test.TestCase):
       self.assertEqual(batches, 15)
 
 
+class BucketBySequenceLength(test.TestCase):
+
+  def testBucket(self):
+
+    boundaries = [10, 20, 30]
+    batch_sizes = [10, 8, 4, 2]
+    lengths = [8, 13, 25, 35]
+
+    def element_gen():
+      # Produce 1 batch for each bucket
+      elements = []
+      for batch_size, length in zip(batch_sizes, lengths):
+        for _ in range(batch_size):
+          elements.append([1] * length)
+      random.shuffle(elements)
+      for el in elements:
+        yield (el,)
+
+    element_len = lambda el: array_ops.shape(el)[0]
+    dataset = dataset_ops.Dataset.from_generator(
+        element_gen, (dtypes.int64,), ([None],)).apply(
+            grouping.bucket_by_sequence_length(
+                element_len, boundaries, batch_sizes))
+    batch, = dataset.make_one_shot_iterator().get_next()
+
+    with self.test_session() as sess:
+      batches = []
+      for _ in range(4):
+        batches.append(sess.run(batch))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(batch)
+    batch_sizes_val = []
+    lengths_val = []
+    for batch in batches:
+      batch_size = batch.shape[0]
+      length = batch.shape[1]
+      batch_sizes_val.append(batch_size)
+      lengths_val.append(length)
+    self.assertEqual(sum(batch_sizes_val), sum(batch_sizes))
+    self.assertEqual(sorted(batch_sizes), sorted(batch_sizes_val))
+    self.assertEqual(sorted(lengths), sorted(lengths_val))
+
+  def testPadToBoundary(self):
+
+    boundaries = [10, 20, 30]
+    batch_sizes = [10, 8, 4, 2]
+    lengths = [8, 13, 25]
+
+    def element_gen():
+      # Produce 1 batch for each bucket
+      elements = []
+      for batch_size, length in zip(batch_sizes[:-1], lengths):
+        for _ in range(batch_size):
+          elements.append([1] * length)
+      random.shuffle(elements)
+      for el in elements:
+        yield (el,)
+      for _ in range(batch_sizes[-1]):
+        el = [1] * (boundaries[-1] + 5)
+        yield (el,)
+
+    element_len = lambda el: array_ops.shape(el)[0]
+    dataset = dataset_ops.Dataset.from_generator(
+        element_gen, (dtypes.int64,), ([None],)).apply(
+            grouping.bucket_by_sequence_length(
+                element_len, boundaries, batch_sizes,
+                pad_to_bucket_boundary=True))
+    batch, = dataset.make_one_shot_iterator().get_next()
+
+    with self.test_session() as sess:
+      batches = []
+      for _ in range(3):
+        batches.append(sess.run(batch))
+      with self.assertRaisesOpError("bucket_boundaries"):
+        sess.run(batch)
+    batch_sizes_val = []
+    lengths_val = []
+    for batch in batches:
+      batch_size = batch.shape[0]
+      length = batch.shape[1]
+      batch_sizes_val.append(batch_size)
+      lengths_val.append(length)
+    batch_sizes = batch_sizes[:-1]
+    self.assertEqual(sum(batch_sizes_val), sum(batch_sizes))
+    self.assertEqual(sorted(batch_sizes), sorted(batch_sizes_val))
+    self.assertEqual(sorted(boundaries), sorted(lengths_val))
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py b/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
index 32ea44f7c7ba329dc253bb9fbbcac0a1ed16aec7..87b7c6ddb7afcbaaf8fe97cd8be87e6f5af8cd4d 100644
--- a/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
@@ -22,6 +22,7 @@ from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
 
@@ -33,17 +34,25 @@ class GetSingleElementTest(test.TestCase):
     take_value = array_ops.placeholder_with_default(
         constant_op.constant(1, dtype=dtypes.int64), shape=[])
 
+    def make_sparse(x):
+      x_1d = array_ops.reshape(x, [1])
+      x_2d = array_ops.reshape(x, [1, 1])
+      return sparse_tensor.SparseTensor(x_2d, x_1d, x_1d)
+
     dataset = (dataset_ops.Dataset.range(100)
                .skip(skip_value)
-               .map(lambda x: x * x)
+               .map(lambda x: (x * x, make_sparse(x)))
                .take(take_value))
 
     element = get_single_element.get_single_element(dataset)
 
     with self.test_session() as sess:
-      self.assertEqual(0, sess.run(element, feed_dict={skip_value: 0}))
-      self.assertEqual(25, sess.run(element, feed_dict={skip_value: 5}))
-      self.assertEqual(100, sess.run(element, feed_dict={skip_value: 10}))
+      for x in [0, 5, 10]:
+        dense_val, sparse_val = sess.run(element, feed_dict={skip_value: x})
+        self.assertEqual(x * x, dense_val)
+        self.assertAllEqual([[x]], sparse_val.indices)
+        self.assertAllEqual([x], sparse_val.values)
+        self.assertAllEqual([x], sparse_val.dense_shape)
 
       with self.assertRaisesRegexp(errors.InvalidArgumentError,
                                    "Dataset was empty."):
diff --git a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
index 6efe97444a375febc550ff3a3ea04bcd9330a3a5..699e8e7865502facd05e0c4d6d4f01b80f7c050c 100644
--- a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
@@ -21,6 +21,8 @@ import gzip
 import os
 import zlib
 
+import numpy as np
+
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
 from tensorflow.contrib.data.python.ops import readers
 from tensorflow.core.example import example_pb2
@@ -33,7 +35,9 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.lib.io import python_io
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import string_ops
 from tensorflow.python.platform import test
 from tensorflow.python.util import compat
 
@@ -262,12 +266,19 @@ class ReadBatchFeaturesTest(test.TestCase):
     self._num_records = 7
     self.test_filenames = self._createFiles()
 
-  def _read_batch_features(self, filenames, num_epochs, batch_size):
+  def _read_batch_features(self,
+                           filenames,
+                           num_epochs,
+                           batch_size,
+                           reader_num_threads=1,
+                           parser_num_threads=1,
+                           shuffle=False,
+                           shuffle_seed=None):
     self.filenames = filenames
     self.num_epochs = num_epochs
     self.batch_size = batch_size
 
-    return readers.read_batch_features(
+    return readers.make_batched_features_dataset(
         file_pattern=self.filenames,
         batch_size=self.batch_size,
         features={
@@ -276,8 +287,12 @@ class ReadBatchFeaturesTest(test.TestCase):
             "keywords": parsing_ops.VarLenFeature(dtypes.string)
         },
         reader=core_readers.TFRecordDataset,
-        randomize_input=False,
-        num_epochs=self.num_epochs)
+        num_epochs=self.num_epochs,
+        shuffle=shuffle,
+        shuffle_seed=shuffle_seed,
+        reader_num_threads=reader_num_threads,
+        parser_num_threads=parser_num_threads).make_one_shot_iterator(
+        ).get_next()
 
   def _record(self, f, r):
     example = example_pb2.Example(features=feature_pb2.Features(
@@ -312,24 +327,35 @@ class ReadBatchFeaturesTest(test.TestCase):
       writer.close()
     return filenames
 
-  def _next_actual_batch(self, sess):
-    file_op = self.outputs["file"]
-    keywords_indices_op = self.outputs["keywords"].indices
-    keywords_values_op = self.outputs["keywords"].values
-    keywords_dense_shape_op = self.outputs["keywords"].dense_shape
-    record_op = self.outputs["record"]
+  def _run_actual_batch(self, outputs, sess):
+    file_op = outputs["file"]
+    keywords_indices_op = outputs["keywords"].indices
+    keywords_values_op = outputs["keywords"].values
+    keywords_dense_shape_op = outputs["keywords"].dense_shape
+    record_op = outputs["record"]
     return sess.run([
         file_op, keywords_indices_op, keywords_values_op,
         keywords_dense_shape_op, record_op
     ])
 
-  def _next_expected_batch(self, file_indices, batch_size, num_epochs):
+  def _next_actual_batch(self, sess):
+    return self._run_actual_batch(self.outputs, sess)
+
+  def _next_expected_batch(self,
+                           file_indices,
+                           batch_size,
+                           num_epochs,
+                           cycle_length=1):
 
     def _next_record(file_indices):
       for j in file_indices:
         for i in range(self._num_records):
           yield j, i
 
+    def _next_record_interleaved(file_indices, cycle_length):
+      return self._interleave([_next_record([i]) for i in file_indices],
+                              cycle_length)
+
     file_batch = []
     keywords_batch_indices = []
     keywords_batch_values = []
@@ -337,7 +363,11 @@ class ReadBatchFeaturesTest(test.TestCase):
     record_batch = []
     batch_index = 0
     for _ in range(num_epochs):
-      for record in _next_record(file_indices):
+      if cycle_length == 1:
+        next_records = _next_record(file_indices)
+      else:
+        next_records = _next_record_interleaved(file_indices, cycle_length)
+      for record in next_records:
         f = record[0]
         r = record[1]
         file_batch.append(f)
@@ -365,14 +395,41 @@ class ReadBatchFeaturesTest(test.TestCase):
           [len(file_batch), keywords_batch_max_len], record_batch
       ]
 
-  def _verify_records(self, sess, batch_size, file_index=None, num_epochs=1):
+  def _interleave(self, iterators, cycle_length):
+    pending_iterators = iterators
+    open_iterators = []
+    num_open = 0
+    for i in range(cycle_length):
+      if pending_iterators:
+        open_iterators.append(pending_iterators.pop(0))
+        num_open += 1
+
+    while num_open:
+      for i in range(min(cycle_length, len(open_iterators))):
+        if open_iterators[i] is None:
+          continue
+        try:
+          yield next(open_iterators[i])
+        except StopIteration:
+          if pending_iterators:
+            open_iterators[i] = pending_iterators.pop(0)
+          else:
+            open_iterators[i] = None
+            num_open -= 1
+
+  def _verify_records(self,
+                      sess,
+                      batch_size,
+                      file_index=None,
+                      num_epochs=1,
+                      interleave_cycle_length=1):
     if file_index is not None:
       file_indices = [file_index]
     else:
       file_indices = range(self._num_files)
 
-    for expected_batch in self._next_expected_batch(file_indices, batch_size,
-                                                    num_epochs):
+    for expected_batch in self._next_expected_batch(
+        file_indices, batch_size, num_epochs, interleave_cycle_length):
       actual_batch = self._next_actual_batch(sess)
       for i in range(len(expected_batch)):
         self.assertAllEqual(expected_batch[i], actual_batch[i])
@@ -435,6 +492,302 @@ class ReadBatchFeaturesTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(next_element)
 
+  def testReadWithFusedShuffleRepeatDataset(self):
+    num_epochs = 5
+    total_records = num_epochs * self._num_records
+    for batch_size in [1, 2]:
+      # Test that shuffling with same seed produces the same result.
+      with ops.Graph().as_default() as g:
+        with self.test_session(graph=g) as sess:
+          outputs1 = self._read_batch_features(
+              filenames=self.test_filenames[0],
+              num_epochs=num_epochs,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          outputs2 = self._read_batch_features(
+              filenames=self.test_filenames[0],
+              num_epochs=num_epochs,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          for _ in range(total_records // batch_size):
+            batch1 = self._run_actual_batch(outputs1, sess)
+            batch2 = self._run_actual_batch(outputs2, sess)
+            for i in range(len(batch1)):
+              self.assertAllEqual(batch1[i], batch2[i])
+
+      # Test that shuffling with different seeds produces a different order.
+      with ops.Graph().as_default() as g:
+        with self.test_session(graph=g) as sess:
+          outputs1 = self._read_batch_features(
+              filenames=self.test_filenames[0],
+              num_epochs=num_epochs,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          outputs2 = self._read_batch_features(
+              filenames=self.test_filenames[0],
+              num_epochs=num_epochs,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=15)
+          all_equal = True
+          for _ in range(total_records // batch_size):
+            batch1 = self._run_actual_batch(outputs1, sess)
+            batch2 = self._run_actual_batch(outputs2, sess)
+            for i in range(len(batch1)):
+              all_equal = all_equal and np.array_equal(batch1[i], batch2[i])
+          self.assertFalse(all_equal)
+
+  def testParallelReadersAndParsers(self):
+    num_epochs = 5
+    for batch_size in [1, 2]:
+      for reader_num_threads in [2, 4]:
+        for parser_num_threads in [2, 4]:
+          with ops.Graph().as_default() as g:
+            with self.test_session(graph=g) as sess:
+              self.outputs = self._read_batch_features(
+                  filenames=self.test_filenames,
+                  num_epochs=num_epochs,
+                  batch_size=batch_size,
+                  reader_num_threads=reader_num_threads,
+                  parser_num_threads=parser_num_threads)
+              self._verify_records(
+                  sess,
+                  batch_size,
+                  num_epochs=num_epochs,
+                  interleave_cycle_length=reader_num_threads)
+              with self.assertRaises(errors.OutOfRangeError):
+                self._next_actual_batch(sess)
+
+
+class MakeCsvDatasetTest(test.TestCase):
+
+  COLUMN_TYPES = [
+      dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64, dtypes.string
+  ]
+  COLUMNS = ["col%d" % i for i in range(len(COLUMN_TYPES))]
+  LABEL = COLUMNS[0]
+
+  def setUp(self):
+    super(MakeCsvDatasetTest, self).setUp()
+    self._num_files = 2
+    self._num_records = 7
+    self._test_filenames = self._create_files()
+
+  def _csv_values(self, fileno, recordno):
+    return [
+        fileno,
+        recordno,
+        fileno * recordno * 0.5,
+        fileno * recordno + 0.5,
+        "record %d" % recordno if recordno % 2 == 1 else "",
+    ]
+
+  def _csv_record(self, fileno, recordno):
+    return ",".join(str(v) for v in self._csv_values(fileno, recordno))
+
+  def _create_files(self):
+    filenames = []
+    for i in range(self._num_files):
+      fn = os.path.join(self.get_temp_dir(), "csv_file%d.csv" % i)
+      filenames.append(fn)
+      f = open(fn, "w")
+      f.write(",".join(self.COLUMNS) + "\n")  # header line
+      for j in range(self._num_records):
+        f.write(self._csv_record(i, j) + "\n")
+        f.write("# Some comment goes here. Should be ignored!\n")
+      f.close()
+    return filenames
+
+  def _make_csv_dataset(self,
+                        filenames,
+                        defaults,
+                        label_key=LABEL,
+                        batch_size=1,
+                        num_epochs=1,
+                        shuffle=False,
+                        shuffle_seed=None):
+    return readers.make_csv_dataset(
+        filenames,
+        column_keys=self.COLUMNS,
+        column_defaults=defaults,
+        label_key=label_key,
+        batch_size=batch_size,
+        num_epochs=num_epochs,
+        shuffle=shuffle,
+        shuffle_seed=shuffle_seed,
+        skip=1,
+        filter_fn=
+        lambda line: math_ops.not_equal(string_ops.substr(line, 0, 1), "#"),
+    )
+
+  def _next_actual_batch(self, file_indices, batch_size, num_epochs):
+    features = {col: list() for col in self.COLUMNS}
+    for _ in range(num_epochs):
+      for i in file_indices:
+        for j in range(self._num_records):
+          values = self._csv_values(i, j)
+          if not values[-1]:
+            values[-1] = "NULL"  # null values in csv are interpreted as default
+          values[-1] = values[-1].encode("utf-8")
+
+          # Regroup lists by column instead of row
+          for n, col in enumerate(self.COLUMNS):
+            features[col].append(values[n])
+          if len(list(features.values())[0]) == batch_size:
+            yield features
+            features = {col: list() for col in self.COLUMNS}
+
+  def _run_actual_batch(self, outputs, sess):
+    features, labels = sess.run(outputs)
+    batch = [features[k] for k in self.COLUMNS if k != self.LABEL]
+    batch.append(labels)
+    return batch
+
+  def _verify_records(
+      self,
+      sess,
+      dataset,
+      file_indices,
+      label_key=LABEL,
+      batch_size=1,
+      num_epochs=1,
+  ):
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    for expected_features in self._next_actual_batch(file_indices, batch_size,
+                                                     num_epochs):
+      actual_features = sess.run(get_next)
+
+      if label_key is not None:
+        expected_labels = expected_features.pop(label_key)
+        # Compare labels
+        self.assertAllEqual(expected_labels, actual_features[1])
+        actual_features = actual_features[0]  # Extract features dict from tuple
+
+      for k in expected_features.keys():
+        # Compare features
+        self.assertAllEqual(expected_features[k], actual_features[k])
+
+    with self.assertRaises(errors.OutOfRangeError):
+      sess.run(get_next)
+
+  def test_make_csv_dataset(self):
+    defaults = [
+        constant_op.constant([], dtype=d) for d in self.COLUMN_TYPES[:-1]
+    ]
+    defaults.append(constant_op.constant(["NULL"], dtype=dtypes.string))
+
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Basic test: read from file 0.
+        dataset = self._make_csv_dataset(self._test_filenames[0], defaults)
+        self._verify_records(sess, dataset, [0])
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Basic test: read from file 1.
+        dataset = self._make_csv_dataset(self._test_filenames[1], defaults)
+        self._verify_records(sess, dataset, [1])
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Read from both files.
+        dataset = self._make_csv_dataset(self._test_filenames, defaults)
+        self._verify_records(sess, dataset, range(self._num_files))
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Read from both files. Exercise the `batch` and `num_epochs` parameters
+        # of make_csv_dataset and make sure they work.
+        dataset = self._make_csv_dataset(
+            self._test_filenames, defaults, batch_size=2, num_epochs=10)
+        self._verify_records(
+            sess, dataset, range(self._num_files), batch_size=2, num_epochs=10)
+
+  def test_make_csv_dataset_with_no_label(self):
+    defaults = [
+        constant_op.constant([], dtype=d) for d in self.COLUMN_TYPES[:-1]
+    ]
+    defaults.append(constant_op.constant(["NULL"], dtype=dtypes.string))
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Read from both files. Make sure this works with no label key supplied.
+        dataset = self._make_csv_dataset(
+            self._test_filenames,
+            defaults,
+            batch_size=2,
+            num_epochs=10,
+            label_key=None)
+        self._verify_records(
+            sess,
+            dataset,
+            range(self._num_files),
+            batch_size=2,
+            num_epochs=10,
+            label_key=None)
+
+  def test_make_csv_dataset_with_types(self):
+    defaults = [d for d in self.COLUMN_TYPES[:-1]]
+    defaults.append(constant_op.constant(["NULL"], dtype=dtypes.string))
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        dataset = self._make_csv_dataset(self._test_filenames, defaults)
+        self._verify_records(sess, dataset, range(self._num_files))
+
+  def test_make_csv_dataset_with_shuffle(self):
+    total_records = self._num_files * self._num_records
+    defaults = [d for d in self.COLUMN_TYPES[:-1]]
+    defaults.append(constant_op.constant(["NULL"], dtype=dtypes.string))
+    for batch_size in [1, 2]:
+      with ops.Graph().as_default() as g:
+        with self.test_session(graph=g) as sess:
+          # Test that shuffling with the same seed produces the same result
+          dataset1 = self._make_csv_dataset(
+              self._test_filenames,
+              defaults,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          dataset2 = self._make_csv_dataset(
+              self._test_filenames,
+              defaults,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          outputs1 = dataset1.make_one_shot_iterator().get_next()
+          outputs2 = dataset2.make_one_shot_iterator().get_next()
+          for _ in range(total_records // batch_size):
+            batch1 = self._run_actual_batch(outputs1, sess)
+            batch2 = self._run_actual_batch(outputs2, sess)
+            for i in range(len(batch1)):
+              self.assertAllEqual(batch1[i], batch2[i])
+
+      with ops.Graph().as_default() as g:
+        with self.test_session(graph=g) as sess:
+          # Test that shuffling with a different seed produces different results
+          dataset1 = self._make_csv_dataset(
+              self._test_filenames,
+              defaults,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=5)
+          dataset2 = self._make_csv_dataset(
+              self._test_filenames,
+              defaults,
+              batch_size=batch_size,
+              shuffle=True,
+              shuffle_seed=6)
+          outputs1 = dataset1.make_one_shot_iterator().get_next()
+          outputs2 = dataset2.make_one_shot_iterator().get_next()
+          all_equal = False
+          for _ in range(total_records // batch_size):
+            batch1 = self._run_actual_batch(outputs1, sess)
+            batch2 = self._run_actual_batch(outputs2, sess)
+            for i in range(len(batch1)):
+              all_equal = all_equal and np.array_equal(batch1[i], batch2[i])
+          self.assertFalse(all_equal)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/resample_test.py b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
index 3c7b46629edb13459766b5ef3f392e8d00ad4db8..5f47dcb33999119a690bd633f0c97a12a1ae1c84 100644
--- a/tensorflow/contrib/data/python/kernel_tests/resample_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
@@ -21,7 +21,10 @@ import numpy as np
 
 from tensorflow.contrib.data.python.ops import resampling
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import string_ops
 from tensorflow.python.platform import test
 from tensorflow.python.util import compat
@@ -45,12 +48,10 @@ class ResampleTest(test.TestCase):
                 target_dist=target_dist,
                 initial_dist=initial_dist,
                 class_func=lambda c, _: c,
-                seed=27)).make_initializable_iterator())
-    init_op = iterator.initializer
+                seed=27)).make_one_shot_iterator())
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      sess.run(init_op)
       returned = []
       with self.assertRaises(errors.OutOfRangeError):
         while True:
@@ -70,6 +71,43 @@ class ResampleTest(test.TestCase):
     returned_dist = class_counts / total_returned
     self.assertAllClose(target_dist, returned_dist, atol=1e-2)
 
+  def testRandomClasses(self):
+    init_dist = [0.25, 0.25, 0.25, 0.25]
+    target_dist = [0.0, 0.0, 0.0, 1.0]
+    num_classes = len(init_dist)
+    # We don't need many samples to test a dirac-delta target distribution
+    num_samples = 100
+    data_np = np.random.choice(num_classes, num_samples, p=init_dist)
+
+    dataset = dataset_ops.Dataset.from_tensor_slices(data_np)
+
+    # Apply a random mapping that preserves the data distribution.
+    def _remap_fn(_):
+      return math_ops.cast(random_ops.random_uniform([1]) * num_classes,
+                           dtypes.int32)[0]
+    dataset = dataset.map(_remap_fn)
+
+    # Reshape distribution.
+    dataset = dataset.apply(
+        resampling.rejection_resample(
+            class_func=lambda x: x,
+            target_dist=target_dist,
+            initial_dist=init_dist))
+
+    get_next = dataset.make_one_shot_iterator().get_next()
+
+    with self.test_session() as sess:
+      returned = []
+      with self.assertRaises(errors.OutOfRangeError):
+        while True:
+          returned.append(sess.run(get_next))
+
+    classes, _ = zip(*returned)
+    bincount = np.bincount(
+        np.array(classes),
+        minlength=num_classes).astype(np.float32) / len(classes)
+
+    self.assertAllClose(target_dist, bincount, atol=1e-2)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..33c48e20bea53b88d69a59e715af38b22dd2cbd4
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
@@ -0,0 +1,242 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import sliding
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class SlideDatasetTest(test.TestCase):
+
+  def testSlideDataset(self):
+    """Test an dataset that maps a TF function across its input elements."""
+    components = (np.arange(7),
+                  np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
+                  np.array(37.0) * np.arange(7))
+
+    count = array_ops.placeholder(dtypes.int64, shape=[])
+    window_size = array_ops.placeholder(dtypes.int64, shape=[])
+    stride = array_ops.placeholder(dtypes.int64, shape=[])
+
+    def _map_fn(x, y, z):
+      return math_ops.square(x), math_ops.square(y), math_ops.square(z)
+
+    # The pipeline is TensorSliceDataset -> MapDataset(square_3) ->
+    # RepeatDataset(count) -> _SlideDataset(window_size, stride).
+    iterator = (dataset_ops.Dataset.from_tensor_slices(components)
+                .map(_map_fn)
+                .repeat(count)
+                .apply(sliding.sliding_window_batch(window_size, stride))
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    self.assertEqual([[None] + list(c.shape[1:]) for c in components],
+                     [t.shape.as_list() for t in get_next])
+
+    with self.test_session() as sess:
+      # Slide over a finite input, where the window_size divides the
+      # total number of elements.
+      sess.run(init_op, feed_dict={count: 20, window_size: 14, stride: 7})
+      # Same formula with convolution layer.
+      num_batches = (20 * 7 - 14) // 7 + 1
+      for i in range(num_batches):
+        result = sess.run(get_next)
+        for component, result_component in zip(components, result):
+          for j in range(14):
+            self.assertAllEqual(component[(i*7 + j) % 7]**2,
+                                result_component[j])
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over a finite input, where the window_size does not
+      # divide the total number of elements.
+      sess.run(init_op, feed_dict={count: 20, window_size: 17, stride: 9})
+
+      num_batches = (20 * 7 - 17) // 9 + 1
+      for i in range(num_batches):
+        result = sess.run(get_next)
+        for component, result_component in zip(components, result):
+          for j in range(17):
+            self.assertAllEqual(component[(i*9 + j) % 7]**2,
+                                result_component[j])
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over a finite input, which is less than window_size,
+      # should fail straight away.
+      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 4})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 8})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over an empty input should fail straight away.
+      sess.run(init_op, feed_dict={count: 0, window_size: 8, stride: 4})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Empty window_size should be an initialization time error.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 0, stride: 0})
+
+      # Invalid stride should be an initialization time error.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 0})
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 3})
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 5})
+
+  def assertSparseValuesEqual(self, a, b):
+    self.assertAllEqual(a.indices, b.indices)
+    self.assertAllEqual(a.values, b.values)
+    self.assertAllEqual(a.dense_shape, b.dense_shape)
+
+  def testSlideSparse(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0]], values=(i * [1]), dense_shape=[1])
+
+    iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
+        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      num_batches = (10 - 5) // 3 + 1
+      for i in range(num_batches):
+        actual = sess.run(get_next)
+        expected = sparse_tensor.SparseTensorValue(
+            indices=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 0]],
+            values=[i * 3, i * 3 + 1, i * 3 + 2, i * 3 + 3, i * 3 + 4],
+            dense_shape=[5, 1])
+        self.assertTrue(sparse_tensor.is_sparse(actual))
+        self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testSlideSparseWithDifferentDenseShapes(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=array_ops.expand_dims(
+              math_ops.range(i, dtype=dtypes.int64), 1),
+          values=array_ops.fill([math_ops.to_int32(i)], i),
+          dense_shape=[i])
+
+    iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
+        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      num_batches = (10 - 5) // 3 + 1
+      for i in range(num_batches):
+        actual = sess.run(get_next)
+        expected_indices = []
+        expected_values = []
+        for j in range(5):
+          for k in range(i * 3 + j):
+            expected_indices.append([j, k])
+            expected_values.append(i * 3 + j)
+        expected = sparse_tensor.SparseTensorValue(
+            indices=expected_indices,
+            values=expected_values,
+            dense_shape=[5, i * 3 + 5 - 1])
+        self.assertTrue(sparse_tensor.is_sparse(actual))
+        self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testNestedSlideSparse(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0]], values=(i * [1]), dense_shape=[1])
+
+    iterator = (dataset_ops.Dataset.range(10)
+                .map(_sparse)
+                .apply(sliding.sliding_window_batch(4, 2))
+                .apply(sliding.sliding_window_batch(3, 1))
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      # Slide: 1st batch.
+      actual = sess.run(get_next)
+      expected = sparse_tensor.SparseTensorValue(
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
+                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
+                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          values=[0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7],
+          dense_shape=[3, 4, 1])
+      self.assertTrue(sparse_tensor.is_sparse(actual))
+      self.assertSparseValuesEqual(actual, expected)
+      # Slide: 2nd batch.
+      actual = sess.run(get_next)
+      expected = sparse_tensor.SparseTensorValue(
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
+                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
+                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          values=[2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9],
+          dense_shape=[3, 4, 1])
+      self.assertTrue(sparse_tensor.is_sparse(actual))
+      self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testSlideShapeError(self):
+
+    def generator():
+      yield [1.0, 2.0, 3.0]
+      yield [4.0, 5.0, 6.0]
+      yield [7.0, 8.0, 9.0, 10.0]
+
+    iterator = (dataset_ops.Dataset.from_generator(generator, dtypes.float32,
+                                                   output_shapes=[None])
+                .apply(sliding.sliding_window_batch(3, 1))
+                .make_initializable_iterator())
+    next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      with self.assertRaisesRegexp(
+          errors.InvalidArgumentError,
+          r"Cannot batch tensors with different shapes in component 0. "
+          r"First element had shape \[3\] and element 2 had shape \[4\]."):
+        sess.run(next_element)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/ops/BUILD b/tensorflow/contrib/data/python/ops/BUILD
index 789cb9c99a6bba06a1e3bd3371d1378065f49f46..c3331e963602d60fe27dd44b0cc06dfb20ca2b6a 100644
--- a/tensorflow/contrib/data/python/ops/BUILD
+++ b/tensorflow/contrib/data/python/ops/BUILD
@@ -67,6 +67,8 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":dataset_ops",
+        ":shuffle_ops",
+        "//tensorflow/python:constant_op",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -104,6 +106,7 @@ py_library(
         "interleave_ops.py",
         "resampling.py",
         "scan_ops.py",
+        "sliding.py",
         "stats_ops.py",
         "threadpool.py",
         "unique.py",
@@ -126,6 +129,7 @@ py_library(
         "//tensorflow/python:tensor_util",
         "//tensorflow/python:util",
         "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/ops:readers",
         "//tensorflow/python/data/util:convert",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
diff --git a/tensorflow/contrib/data/python/ops/batching.py b/tensorflow/contrib/data/python/ops/batching.py
index 6eb512dec67cb7b9c8c4518d03aee0b436205f9a..a212adf6cf580267f9f1e6959bef95f04a4ad782 100644
--- a/tensorflow/contrib/data/python/ops/batching.py
+++ b/tensorflow/contrib/data/python/ops/batching.py
@@ -348,13 +348,19 @@ class _RestructuredDataset(dataset_ops.Dataset):
 class _MapAndBatchDataset(dataset_ops.MapDataset):
   """A `Dataset` that maps a function over a batch of elements."""
 
-  def __init__(self, input_dataset, map_func, batch_size, num_parallel_batches):
+  def __init__(self, input_dataset, map_func, batch_size, num_parallel_batches,
+               drop_remainder):
     """See `Dataset.map()` for details."""
     super(_MapAndBatchDataset, self).__init__(input_dataset, map_func)
-    self._batch_size = ops.convert_to_tensor(
+    self._batch_size_t = ops.convert_to_tensor(
         batch_size, dtype=dtypes.int64, name="batch_size")
-    self._num_parallel_batches = ops.convert_to_tensor(
+    self._num_parallel_batches_t = ops.convert_to_tensor(
         num_parallel_batches, dtype=dtypes.int64, name="num_parallel_batches")
+    self._drop_remainder_t = ops.convert_to_tensor(
+        drop_remainder, dtype=dtypes.bool, name="drop_remainder")
+
+    self._batch_size = batch_size
+    self._drop_remainder = drop_remainder
 
   def _as_variant_tensor(self):
     # pylint: disable=protected-access
@@ -363,8 +369,9 @@ class _MapAndBatchDataset(dataset_ops.MapDataset):
         input_resource,
         self._map_func.captured_inputs,
         f=self._map_func,
-        batch_size=self._batch_size,
-        num_parallel_batches=self._num_parallel_batches,
+        batch_size=self._batch_size_t,
+        num_parallel_batches=self._num_parallel_batches_t,
+        drop_remainder=self._drop_remainder_t,
         output_types=nest.flatten(
             sparse.as_dense_types(self.output_types, self.output_classes)),
         output_shapes=nest.flatten(
@@ -373,9 +380,9 @@ class _MapAndBatchDataset(dataset_ops.MapDataset):
 
   @property
   def output_shapes(self):
+    dim = self._batch_size if self._drop_remainder else None
     return nest.pack_sequence_as(self._output_shapes, [
-        tensor_shape.vector(tensor_util.constant_value(
-            self._batch_size)).concatenate(s)
+        tensor_shape.vector(dim).concatenate(s)
         for s in nest.flatten(self._output_shapes)
     ])
 
@@ -384,7 +391,10 @@ class _MapAndBatchDataset(dataset_ops.MapDataset):
     return self._output_types
 
 
-def map_and_batch(map_func, batch_size, num_parallel_batches=1):
+def map_and_batch(map_func,
+                  batch_size,
+                  num_parallel_batches=1,
+                  drop_remainder=False):
   """Fused implementation of `map` and `batch`.
 
   Maps `map_func` across `batch_size` consecutive elements of this dataset
@@ -404,6 +414,9 @@ def map_and_batch(map_func, batch_size, num_parallel_batches=1):
       number of batches to create in parallel. On one hand, higher values can
       help mitigate the effect of stragglers. On the other hand, higher values
       can increase contention if CPU is scarce.
+    drop_remainder: A `tf.bool` scalar `tf.Tensor`, representing whether the
+      last batch should be dropped in case its size is smaller than desired;
+      the default behavior is not to drop the smaller batch.
 
   Returns:
     A `Dataset` transformation function, which can be passed to
@@ -412,6 +425,6 @@ def map_and_batch(map_func, batch_size, num_parallel_batches=1):
 
   def _apply_fn(dataset):
     return _MapAndBatchDataset(dataset, map_func, batch_size,
-                               num_parallel_batches)
+                               num_parallel_batches, drop_remainder)
 
   return _apply_fn
diff --git a/tensorflow/contrib/data/python/ops/counter.py b/tensorflow/contrib/data/python/ops/counter.py
index 63226fe78163c59025623a362d17c400fbe57c67..6ef65f9624601286691505a795a86dd6226eead1 100644
--- a/tensorflow/contrib/data/python/ops/counter.py
+++ b/tensorflow/contrib/data/python/ops/counter.py
@@ -25,7 +25,7 @@ from tensorflow.python.framework import ops
 
 
 def Counter(start=0, step=1, dtype=dtypes.int64):
-  """Creates a `Dataset` of a `step`-separated count startin from `start`.
+  """Creates a `Dataset` that counts from `start` in steps of size `step`.
 
   For example:
 
@@ -38,12 +38,13 @@ def Counter(start=0, step=1, dtype=dtypes.int64):
   ```
 
   Args:
-    start: starting value for count.
-    step: step size.
-    dtype: counter data type.
+    start: (Optional.) The starting value for the counter. Defaults to 0.
+    step: (Optional.) The step size for the counter. Defaults to 1.
+    dtype: (Optional.) The data type for counter elements. Defaults to
+      `tf.int64`.
 
   Returns:
-    A `Dataset` of scalar elements.
+    A `Dataset` of scalar `dtype` elements.
   """
   with ops.name_scope("counter"):
     start = ops.convert_to_tensor(start, dtype=dtype, name="start")
diff --git a/tensorflow/contrib/data/python/ops/get_single_element.py b/tensorflow/contrib/data/python/ops/get_single_element.py
index a817b45b71b608810a9d7536ec123ab84f7cdc3b..3a07df572748e464284f580d67e3a664e71acdfe 100644
--- a/tensorflow/contrib/data/python/ops/get_single_element.py
+++ b/tensorflow/contrib/data/python/ops/get_single_element.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
 from tensorflow.python.ops import gen_dataset_ops
 
 
@@ -59,9 +60,14 @@ def get_single_element(dataset):
   """
   if not isinstance(dataset, dataset_ops.Dataset):
     raise TypeError("`dataset` must be a `tf.data.Dataset` object.")
-  return nest.pack_sequence_as(
-      dataset.output_types,
-      gen_dataset_ops.dataset_to_single_element(
+
+  nested_ret = nest.pack_sequence_as(
+      dataset.output_types, gen_dataset_ops.dataset_to_single_element(
           dataset._as_variant_tensor(),  # pylint: disable=protected-access
-          output_types=nest.flatten(dataset.output_types),
-          output_shapes=nest.flatten(dataset.output_shapes)))
+          output_types=nest.flatten(sparse.as_dense_types(
+              dataset.output_types, dataset.output_classes)),
+          output_shapes=nest.flatten(sparse.as_dense_shapes(
+              dataset.output_shapes, dataset.output_classes))))
+  return sparse.deserialize_sparse_tensors(
+      nested_ret, dataset.output_types, dataset.output_shapes,
+      dataset.output_classes)
diff --git a/tensorflow/contrib/data/python/ops/grouping.py b/tensorflow/contrib/data/python/ops/grouping.py
index 67b085002aa7797d858837fea4646fb968ad5d97..ae10d2eb22d574e251d96a4c25bcdedad78d69ca 100644
--- a/tensorflow/contrib/data/python/ops/grouping.py
+++ b/tensorflow/contrib/data/python/ops/grouping.py
@@ -17,13 +17,20 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.ops import math_ops
 
 
 def group_by_window(key_func,
@@ -35,7 +42,7 @@ def group_by_window(key_func,
   This transformation maps each consecutive element in a dataset to a key
   using `key_func` and groups the elements by key. It then applies
   `reduce_func` to at most `window_size_func(key)` elements matching the same
-  key. All execpt the final window for each key will contain
+  key. All except the final window for each key will contain
   `window_size_func(key)` elements; the final window may be smaller.
 
   You may provide either a constant `window_size` or a window size determined by
@@ -85,6 +92,114 @@ def group_by_window(key_func,
   return _apply_fn
 
 
+def bucket_by_sequence_length(element_length_func,
+                              bucket_boundaries,
+                              bucket_batch_sizes,
+                              padded_shapes=None,
+                              padding_values=None,
+                              pad_to_bucket_boundary=False):
+  """A transformation that buckets elements in a `Dataset` by length.
+
+  Elements of the `Dataset` are grouped together by length and then are padded
+  and batched.
+
+  This is useful for sequence tasks in which the elements have variable length.
+  Grouping together elements that have similar lengths reduces the total
+  fraction of padding in a batch which increases training step efficiency.
+
+  Args:
+    element_length_func: function from element in `Dataset` to `tf.int64`,
+      determines the length of the element, which will determine the bucket it
+      goes into.
+    bucket_boundaries: `list<int>`, upper length boundaries of the buckets.
+    bucket_batch_sizes: `list<int>`, batch size per bucket. Length should be
+      `len(bucket_boundaries) + 1`.
+    padded_shapes: Nested structure of `tf.TensorShape` to pass to
+      @{tf.data.Dataset.padded_batch}. If not provided, will use
+      `dataset.output_shapes`, which will result in variable length dimensions
+      being padded out to the maximum length in each batch.
+    padding_values: Values to pad with, passed to
+      @{tf.data.Dataset.padded_batch}. Defaults to padding with 0.
+    pad_to_bucket_boundary: bool, if `False`, will pad dimensions with unknown
+      size to maximum length in batch. If `True`, will pad dimensions with
+      unknown size to bucket boundary, and caller must ensure that the source
+      `Dataset` does not contain any elements with length longer than
+      `max(bucket_boundaries)`.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+
+  Raises:
+    ValueError: if `len(bucket_batch_sizes) != len(bucket_boundaries) + 1`.
+  """
+  with ops.name_scope("bucket_by_seq_length"):
+    if len(bucket_batch_sizes) != (len(bucket_boundaries) + 1):
+      raise ValueError(
+          "len(bucket_batch_sizes) must equal len(bucket_boundaries) + 1")
+
+    batch_sizes = constant_op.constant(bucket_batch_sizes, dtype=dtypes.int64)
+
+    def element_to_bucket_id(element):
+      """Return int64 id of the length bucket for this element."""
+      seq_length = element_length_func(element)
+
+      boundaries = list(bucket_boundaries)
+      buckets_min = [np.iinfo(np.int32).min] + boundaries
+      buckets_max = boundaries + [np.iinfo(np.int32).max]
+      conditions_c = math_ops.logical_and(
+          math_ops.less_equal(buckets_min, seq_length),
+          math_ops.less(seq_length, buckets_max))
+      bucket_id = math_ops.reduce_min(array_ops.where(conditions_c))
+
+      return bucket_id
+
+    def window_size_fn(bucket_id):
+      # The window size is set to the batch size for this bucket
+      window_size = batch_sizes[bucket_id]
+      return window_size
+
+    def make_padded_shapes(shapes, none_filler=None):
+      padded = []
+      for shape in nest.flatten(shapes):
+        shape = tensor_shape.TensorShape(shape)
+        shape = [
+            none_filler if d.value is None else d
+            for d in shape
+        ]
+        padded.append(shape)
+      return nest.pack_sequence_as(shapes, padded)
+
+    def batching_fn(bucket_id, grouped_dataset):
+      """Batch elements in dataset."""
+      batch_size = batch_sizes[bucket_id]
+      none_filler = None
+      if pad_to_bucket_boundary:
+        err_msg = ("When pad_to_bucket_boundary=True, elements must have "
+                   "length <= max(bucket_boundaries).")
+        check = check_ops.assert_less(
+            bucket_id,
+            constant_op.constant(len(bucket_batch_sizes) - 1,
+                                 dtype=dtypes.int64),
+            message=err_msg)
+        with ops.control_dependencies([check]):
+          boundaries = constant_op.constant(bucket_boundaries,
+                                            dtype=dtypes.int64)
+          bucket_boundary = boundaries[bucket_id]
+          none_filler = bucket_boundary
+      shapes = make_padded_shapes(
+          padded_shapes or grouped_dataset.output_shapes,
+          none_filler=none_filler)
+      return grouped_dataset.padded_batch(batch_size, shapes, padding_values)
+
+    def _apply_fn(dataset):
+      return dataset.apply(
+          group_by_window(element_to_bucket_id, batching_fn,
+                          window_size_func=window_size_fn))
+
+    return _apply_fn
+
+
 class _VariantDataset(dataset_ops.Dataset):
   """A Dataset wrapper for a tf.variant-typed function argument."""
 
diff --git a/tensorflow/contrib/data/python/ops/interleave_ops.py b/tensorflow/contrib/data/python/ops/interleave_ops.py
index 3124ca1d1540e12d949dded88ce1c66181be3595..91f19da02d4a479820782822475d9121125fc38e 100644
--- a/tensorflow/contrib/data/python/ops/interleave_ops.py
+++ b/tensorflow/contrib/data/python/ops/interleave_ops.py
@@ -17,101 +17,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import convert
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import function
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.data.ops import readers
 from tensorflow.python.util import deprecation
 
 
-class ParallelInterleaveDataset(dataset_ops.Dataset):
-  """A `Dataset` that maps a function over its input and flattens the result."""
-
-  def __init__(self, input_dataset, map_func, cycle_length, block_length,
-               sloppy, buffer_output_elements, prefetch_input_elements):
-    """See `tf.contrib.data.parallel_interleave()` for details."""
-    super(ParallelInterleaveDataset, self).__init__()
-    self._input_dataset = input_dataset
-
-    @function.Defun(*nest.flatten(
-        sparse.as_dense_types(input_dataset.output_types,
-                              input_dataset.output_classes)))
-    def tf_map_func(*args):
-      """A wrapper for Defun that facilitates shape inference."""
-      # Pass in shape information from the input_dataset.
-      dense_shapes = sparse.as_dense_shapes(input_dataset.output_shapes,
-                                            input_dataset.output_classes)
-      for arg, shape in zip(args, nest.flatten(dense_shapes)):
-        arg.set_shape(shape)
-
-      nested_args = nest.pack_sequence_as(input_dataset.output_types, args)
-      nested_args = sparse.deserialize_sparse_tensors(
-          nested_args, input_dataset.output_types, input_dataset.output_shapes,
-          input_dataset.output_classes)
-      if dataset_ops._should_unpack_args(nested_args):  # pylint: disable=protected-access
-        dataset = map_func(*nested_args)
-      else:
-        dataset = map_func(nested_args)
-
-      if not isinstance(dataset, dataset_ops.Dataset):
-        raise TypeError("`map_func` must return a `Dataset` object.")
-
-      self._output_classes = dataset.output_classes
-      self._output_types = dataset.output_types
-      self._output_shapes = dataset.output_shapes
-
-      return dataset._as_variant_tensor()  # pylint: disable=protected-access
-
-    self._map_func = tf_map_func
-    self._map_func.add_to_graph(ops.get_default_graph())
-
-    self._cycle_length = ops.convert_to_tensor(
-        cycle_length, dtype=dtypes.int64, name="cycle_length")
-    self._block_length = ops.convert_to_tensor(
-        block_length, dtype=dtypes.int64, name="block_length")
-    self._sloppy = ops.convert_to_tensor(
-        sloppy, dtype=dtypes.bool, name="sloppy")
-    self._buffer_output_elements = convert.optional_param_to_tensor(
-        "buffer_output_elements",
-        buffer_output_elements,
-        argument_default=2 * block_length)
-    self._prefetch_input_elements = convert.optional_param_to_tensor(
-        "prefetch_input_elements",
-        prefetch_input_elements,
-        argument_default=2 * cycle_length)
-
-  def _as_variant_tensor(self):
-    return gen_dataset_ops.parallel_interleave_dataset(
-        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        self._map_func.captured_inputs,
-        self._cycle_length,
-        self._block_length,
-        self._sloppy,
-        self._buffer_output_elements,
-        self._prefetch_input_elements,
-        f=self._map_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
-
-  @property
-  def output_classes(self):
-    return self._output_classes
-
-  @property
-  def output_shapes(self):
-    return self._output_shapes
-
-  @property
-  def output_types(self):
-    return self._output_types
-
-
 def parallel_interleave(map_func,
                         cycle_length,
                         block_length=1,
@@ -162,7 +71,7 @@ def parallel_interleave(map_func,
     @{tf.data.Dataset.apply}.
   """
   def _apply_fn(dataset):
-    return ParallelInterleaveDataset(
+    return readers.ParallelInterleaveDataset(
         dataset, map_func, cycle_length, block_length, sloppy,
         buffer_output_elements, prefetch_input_elements)
 
@@ -221,7 +130,7 @@ def sloppy_interleave(map_func, cycle_length, block_length=1):
     @{tf.data.Dataset.apply}.
   """
   def _apply_fn(dataset):
-    return ParallelInterleaveDataset(
+    return readers.ParallelInterleaveDataset(
         dataset,
         map_func,
         cycle_length,
diff --git a/tensorflow/contrib/data/python/ops/prefetching_ops.py b/tensorflow/contrib/data/python/ops/prefetching_ops.py
index 96a9e9ed6649444dac5e56d7dd2fcdb62fc56459..b16f12c4eed338b49dfa133443340c875bf0211c 100644
--- a/tensorflow/contrib/data/python/ops/prefetching_ops.py
+++ b/tensorflow/contrib/data/python/ops/prefetching_ops.py
@@ -25,12 +25,14 @@ from tensorflow.contrib.data.python.ops import gen_dataset_ops
 # method and provides a get_next() that calls the prefetch op.
 def function_buffering_resource(string_arg,
                                 target_device,
-                                shared_name,
                                 f,
                                 buffer_size,
-                                thread_pool_size=1,
+                                thread_pool_size=0,
                                 container="",
+                                shared_name=None,
                                 name=None):
+  if shared_name is None:
+    shared_name = ""
   return gen_dataset_ops.function_buffering_resource(
       string_arg=string_arg,
       target_device=target_device,
diff --git a/tensorflow/contrib/data/python/ops/random_ops.py b/tensorflow/contrib/data/python/ops/random_ops.py
index 7d727165feabb101549567f28a2dfa07083de244..28ef5e50f39dd7d1b6f124e58e068fc968ddd6dc 100644
--- a/tensorflow/contrib/data/python/ops/random_ops.py
+++ b/tensorflow/contrib/data/python/ops/random_ops.py
@@ -19,11 +19,10 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
 
@@ -34,16 +33,7 @@ class RandomDataset(dataset_ops.Dataset):
   def __init__(self, seed=None):
     """A `Dataset` of pseudorandom values."""
     super(RandomDataset, self).__init__()
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
 
   def _as_variant_tensor(self):
     return gen_dataset_ops.random_dataset(
diff --git a/tensorflow/contrib/data/python/ops/readers.py b/tensorflow/contrib/data/python/ops/readers.py
index 57f30102778f3bac47580f9bdf94e411dfe1b621..f70f9c881df168564cbf2431bbc2ebdf7e7f7ded 100644
--- a/tensorflow/contrib/data/python/ops/readers.py
+++ b/tensorflow/contrib/data/python/ops/readers.py
@@ -17,20 +17,289 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.data.python.ops import interleave_ops
+from tensorflow.contrib.data.python.ops import shuffle_ops
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import readers as core_readers
 from tensorflow.python.data.util import nest
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import parsing_ops
 from tensorflow.python.platform import gfile
+from tensorflow.python.util import deprecation
 
+_ACCEPTABLE_CSV_TYPES = (dtypes.float32, dtypes.float64, dtypes.int32,
+                         dtypes.int64, dtypes.string)
 
+
+def make_csv_dataset(
+    file_pattern,
+    batch_size,
+    column_keys,
+    column_defaults,
+    label_key=None,
+    field_delim=",",
+    use_quote_delim=True,
+    skip=0,
+    filter_fn=None,
+    num_epochs=None,
+    shuffle=True,
+    shuffle_buffer_size=10000,
+    shuffle_seed=None,
+    prefetch_buffer_size=1,
+):
+  """Reads CSV files into a dataset.
+
+  Reads CSV files into a dataset, where each element is a (features, labels)
+  tuple that corresponds to a batch of CSV rows. The features dictionary
+  maps feature column names to `Tensor`s containing the corresponding
+  feature data, and labels is a `Tensor` containing the batch's label data.
+
+  Args:
+    file_pattern: List of files or patterns of file paths containing CSV
+      records. See @{tf.gfile.Glob} for pattern rules.
+    batch_size: An int representing the number of consecutive elements of this
+      dataset to combine in a single batch.
+    column_keys: A list of strings that corresponds to the CSV columns, in
+      order. One per column of the input record.
+    column_defaults: A list of default values for the CSV fields. One item per
+      column of the input record. Each item in the list is either one of the
+      following dtypes: float32, float64, int32, int64, or string, or a
+      `Tensor` with one of the aforementioned types. One item per column of
+      the input record, with either scalar default value for that column if it
+      is required, or, if the column is required, an empty `Tensor` or a dtype.
+    label_key: A optional string corresponding to the label column. If provided,
+      the data for this column is returned as a separate `Tensor` from the
+      features dictionary, so that the dataset complies with the format expected
+      by a `tf.Estimator.train` or `tf.Estimator.evaluate` input function.
+    field_delim: An optional `string`. Defaults to `","`. Char delimiter to
+      separate fields in a record.
+    use_quote_delim: An optional bool. Defaults to `True`. If false, treats
+      double quotation marks as regular characters inside of the string fields.
+    skip: An integer that corresponds to the number of lines to skip at the
+      head of each CSV file. Defaults to 0.
+    filter_fn: A callable function that takes in a CSV string and returns a
+      boolean that corresponds to whether the record should be included. If
+      None, does not filter records.
+    num_epochs: An int specifying the number of times this dataset is repeated.
+      If None, cycles through the dataset forever.
+    shuffle: A bool that indicates whether the input should be shuffled.
+    shuffle_buffer_size: Buffer size to use for shuffling. A large buffer size
+      ensures better shuffling, but would increase memory usage and startup
+      time.
+    shuffle_seed: Randomization seed to use for shuffling.
+    prefetch_buffer_size: An int specifying the number of feature batches to
+      prefetch for performance improvement. Recommended value is the number of
+      batches consumed per training step.
+
+  Returns:
+    A dataset, where each element is a (features, labels) tuple that corresponds
+    to a batch of `batch_size` CSV rows. The features dictionary maps feature
+    column names to `Tensor`s containing the corresponding column data, and
+    labels is a `Tensor` containing the column data for the label column
+    specified by `label_key`.
+  """
+  filenames = _get_file_names(file_pattern, False)
+  column_defaults = [
+      constant_op.constant([], dtype=x) if x in _ACCEPTABLE_CSV_TYPES else x
+      for x in column_defaults
+  ]
+
+  dataset = dataset_ops.Dataset.from_tensor_slices(filenames)
+  if label_key is not None:
+    assert label_key in column_keys
+
+  def filename_to_dataset(filename):
+    ds = core_readers.TextLineDataset(filename)
+    if skip > 0:
+      ds = ds.skip(skip)
+    if filter_fn is not None:
+      ds = ds.filter(filter_fn)
+    return ds
+
+  def decode_csv(line):
+    """Decodes csv line into features.
+
+    Args:
+      line: String tensor corresponding to one csv record.
+    Returns:
+      A dictionary of feature names to values for that particular record. If
+      label_key is provided, extracts the label feature to be returned as the
+      second element of the tuple.
+    """
+    columns = parsing_ops.decode_csv(
+        line,
+        column_defaults,
+        field_delim=field_delim,
+        use_quote_delim=use_quote_delim)
+    features = dict(zip(column_keys, columns))
+    if label_key is not None:
+      label = features.pop(label_key)
+      return features, label
+    return features
+
+  # TODO(rachelim): interleave records from files for better shuffling
+  dataset = dataset.flat_map(filename_to_dataset)
+  # TODO(rachelim): use fused shuffle_and_repeat for perf
+  if shuffle:
+    dataset = dataset.shuffle(shuffle_buffer_size, shuffle_seed)
+  if num_epochs != 1:
+    dataset = dataset.repeat(num_epochs)
+
+  dataset = dataset.batch(batch_size)
+  dataset = dataset.map(decode_csv)
+  dataset = dataset.prefetch(prefetch_buffer_size)
+  return dataset
+
+
+def make_batched_features_dataset(file_pattern,
+                                  batch_size,
+                                  features,
+                                  reader=core_readers.TFRecordDataset,
+                                  reader_args=None,
+                                  num_epochs=None,
+                                  shuffle=True,
+                                  shuffle_buffer_size=10000,
+                                  shuffle_seed=None,
+                                  prefetch_buffer_size=1,
+                                  reader_num_threads=1,
+                                  parser_num_threads=2,
+                                  sloppy_ordering=False):
+  """Returns a `Dataset` of feature dictionaries from `Example` protos.
+
+  Example:
+
+  ```
+  serialized_examples = [
+    features {
+      feature { key: "age" value { int64_list { value: [ 0 ] } } }
+      feature { key: "gender" value { bytes_list { value: [ "f" ] } } }
+      feature { key: "kws" value { bytes_list { value: [ "code", "art" ] } } }
+    },
+    features {
+      feature { key: "age" value { int64_list { value: [] } } }
+      feature { key: "gender" value { bytes_list { value: [ "f" ] } } }
+      feature { key: "kws" value { bytes_list { value: [ "sports" ] } } }
+    }
+  ]
+  ```
+
+  We can use arguments:
+
+  ```
+  features: {
+    "age": FixedLenFeature([], dtype=tf.int64, default_value=-1),
+    "gender": FixedLenFeature([], dtype=tf.string),
+    "kws": VarLenFeature(dtype=tf.string),
+  }
+  ```
+
+  And the expected output is:
+
+  ```python
+  {
+    "age": [[0], [-1]],
+    "gender": [["f"], ["f"]],
+    "kws": SparseTensor(
+      indices=[[0, 0], [0, 1], [1, 0]],
+      values=["code", "art", "sports"]
+      dense_shape=[2, 2]),
+  }
+  ```
+
+  Args:
+    file_pattern: List of files or patterns of file paths containing
+      `Example` records. See `tf.gfile.Glob` for pattern rules.
+    batch_size: An int representing the number of consecutive elements of this
+      dataset to combine in a single batch.
+    features: A `dict` mapping feature keys to `FixedLenFeature` or
+      `VarLenFeature` values. See `tf.parse_example`.
+    reader: A function or class that can be
+      called with a `filenames` tensor and (optional) `reader_args` and returns
+      a `Dataset` of `Example` tensors. Defaults to `tf.data.TFRecordDataset`.
+    reader_args: Additional arguments to pass to the reader class.
+    num_epochs: Integer specifying the number of times to read through the
+      dataset. If None, cycles through the dataset forever. Defaults to `None`.
+    shuffle: A boolean, indicates whether the input should be shuffled. Defaults
+      to `True`.
+    shuffle_buffer_size: Buffer size of the ShuffleDataset. A large capacity
+      ensures better shuffling but would increase memory usage and startup time.
+    shuffle_seed: Randomization seed to use for shuffling.
+    prefetch_buffer_size: Number of feature batches to prefetch in order to
+      improve performance. Recommended value is the number of batches consumed
+      per training step (default is 1).
+    reader_num_threads: Number of threads used to read `Example` records. If >1,
+      the results will be interleaved.
+    parser_num_threads: Number of threads to use for parsing `Example` tensors
+      into a dictionary of `Feature` tensors.
+    sloppy_ordering: If `True`, reading performance will be improved at
+      the cost of non-deterministic ordering. If `False`, the order of elements
+      produced is deterministic prior to shuffling (elements are still
+      randomized if `shuffle=True`. Note that if the seed is set, then order
+      of elements after shuffling is deterministic). Defaults to `False`.
+
+  Returns:
+    A dataset of `dict` elements. Each `dict` maps feature keys to
+    `Tensor` or `SparseTensor` objects.
+  """
+  # Create dataset of all matching filenames
+  if shuffle:
+    dataset = dataset_ops.Dataset.list_files(file_pattern, shuffle=True)
+  else:
+    # TODO(b/73959787): Use Dataset.list_files() once ordering is deterministic.
+    filenames = _get_file_names(file_pattern, shuffle)
+    dataset = dataset_ops.Dataset.from_tensor_slices(filenames)
+
+  # Read `Example` records from files as tensor objects.
+  if reader_args is None:
+    reader_args = []
+
+  # Read files sequentially (if reader_num_threads=1) or in parallel
+  dataset = dataset.apply(
+      interleave_ops.parallel_interleave(
+          lambda filename: reader(filename, *reader_args),
+          cycle_length=reader_num_threads,
+          sloppy=sloppy_ordering))
+
+  # Extract values if the `Example` tensors are stored as key-value tuples.
+  if dataset.output_types == (dtypes.string, dtypes.string):
+    dataset = dataset.map(lambda _, v: v)
+
+  # Apply dataset repeat and shuffle transformations.
+  repeat_dataset = (num_epochs != 1)
+  if repeat_dataset and shuffle:
+    # Used fused shuffle_and_repeat operation for better performance
+    dataset = dataset.apply(
+        shuffle_ops.shuffle_and_repeat(shuffle_buffer_size, num_epochs,
+                                       shuffle_seed))
+  elif repeat_dataset:
+    dataset = dataset.repeat(num_epochs)
+  elif shuffle:
+    dataset = dataset.shuffle(shuffle_buffer_size, shuffle_seed)
+
+  dataset = dataset.batch(batch_size)
+
+  # Parse `Example` tensors to a dictionary of `Feature` tensors.
+  dataset = dataset.map(
+      lambda x: parsing_ops.parse_example(x, features),
+      num_parallel_calls=parser_num_threads)
+
+  # TODO(rachelim): Add an optional label_key argument for extracting the label
+  # from the features dictionary, to comply with the type expected by the
+  # input_fn to a `tf.Estimator.train` or `tf.Estimator.evaluate` function.
+  dataset = dataset.prefetch(prefetch_buffer_size)
+  return dataset
+
+
+@deprecation.deprecated(None,
+                        "Use `tf.contrib.data.make_batched_features_dataset`")
 def read_batch_features(file_pattern,
                         batch_size,
                         features,
-                        reader,
+                        reader=core_readers.TFRecordDataset,
                         reader_args=None,
                         randomize_input=True,
                         num_epochs=None,
@@ -84,43 +353,38 @@ def read_batch_features(file_pattern,
       dataset to combine in a single batch.
     features: A `dict` mapping feature keys to `FixedLenFeature` or
       `VarLenFeature` values. See `tf.parse_example`.
-    reader: A function or class that can be called with a `filenames` tensor
-      and (optional) `reader_args` and returns a `Dataset` of Examples.
+    reader: A function or class that can be
+      called with a `filenames` tensor and (optional) `reader_args` and returns
+      a `Dataset` of `Example` tensors. Defaults to `tf.data.TFRecordDataset`.
     reader_args: Additional arguments to pass to the reader class.
     randomize_input: Whether the input should be randomized.
     num_epochs: Integer specifying the number of times to read through the
       dataset. If None, cycles through the dataset forever.
-    capacity: Capacity of the ShuffleDataset. A large capacity ensures better
+    capacity: Buffer size of the ShuffleDataset. A large capacity ensures better
       shuffling but would increase memory usage and startup time.
-
   Returns:
     A dict from keys in features to `Tensor` or `SparseTensor` objects.
   """
-  filenames = _get_file_names(file_pattern, randomize_input)
-  if reader_args:
-    dataset = reader(filenames, *reader_args)
-  else:
-    dataset = reader(filenames)
-  if dataset.output_types == (dtypes.string, dtypes.string):
-    dataset = dataset.map(lambda _, v: v)
-  if num_epochs != 1:
-    dataset = dataset.repeat(num_epochs)
-  if randomize_input:
-    dataset = dataset.shuffle(capacity)
-  dataset = dataset.batch(batch_size)
-  dataset = dataset.map(lambda x: parsing_ops.parse_example(x, features))
-  dataset = dataset.prefetch(1)
+  dataset = make_batched_features_dataset(
+      file_pattern,
+      batch_size,
+      features,
+      reader=reader,
+      reader_args=reader_args,
+      shuffle=randomize_input,
+      num_epochs=num_epochs,
+      shuffle_buffer_size=capacity)
   iterator = dataset.make_one_shot_iterator()
   outputs = iterator.get_next()
   return outputs
 
 
-def _get_file_names(file_pattern, randomize_input):
+def _get_file_names(file_pattern, shuffle):
   """Parse list of file names from pattern, optionally shuffled.
 
   Args:
     file_pattern: File glob pattern, or list of glob patterns.
-    randomize_input: Whether to shuffle the order of file names.
+    shuffle: Whether to shuffle the order of file names.
 
   Returns:
     List of file names matching `file_pattern`.
@@ -141,7 +405,7 @@ def _get_file_names(file_pattern, randomize_input):
     raise ValueError("No files match %s." % file_pattern)
 
   # Sort files so it will be deterministic for unit tests.
-  if not randomize_input:
+  if not shuffle:
     file_names = sorted(file_names)
   return file_names
 
diff --git a/tensorflow/contrib/data/python/ops/resampling.py b/tensorflow/contrib/data/python/ops/resampling.py
index 56f526a330bfbea7305b0754bfd114c5e97db506..b465397437adbdfaf865efb8ed2f80e57f48fcab 100644
--- a/tensorflow/contrib/data/python/ops/resampling.py
+++ b/tensorflow/contrib/data/python/ops/resampling.py
@@ -54,7 +54,7 @@ def rejection_resample(class_func, target_dist, initial_dist=None, seed=None):
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
     dist_estimation_batch_size = 32
-    target_dist_t = ops.convert_to_tensor(target_dist, name="initial_dist")
+    target_dist_t = ops.convert_to_tensor(target_dist, name="target_dist")
     class_values_ds = dataset.map(class_func)
     if initial_dist is not None:
       initial_dist_t = ops.convert_to_tensor(initial_dist, name="initial_dist")
@@ -101,14 +101,16 @@ def rejection_resample(class_func, target_dist, initial_dist=None, seed=None):
                                                    initial_dist_ds))
                           .map(maybe_warn_on_large_rejection))
 
-    current_probabilities_ds = dataset_ops.Dataset.zip(
-        (acceptance_dist_ds, class_values_ds)).map(array_ops.gather)
+    def _gather_and_copy(class_val, acceptance_prob, data):
+      return (class_val, array_ops.gather(acceptance_prob, class_val), data)
+    current_probabilities_and_class_and_data_ds = dataset_ops.Dataset.zip(
+        (class_values_ds, acceptance_dist_ds, dataset)).map(_gather_and_copy)
     filtered_ds = (
-        dataset_ops.Dataset.zip((class_values_ds, current_probabilities_ds,
-                                 dataset))
+        current_probabilities_and_class_and_data_ds
         .filter(lambda _1, p, _2: random_ops.random_uniform([], seed=seed) < p))
     return filtered_ds.map(lambda class_value, _, data: (class_value, data))
 
+
   return _apply_fn
 
 
@@ -151,7 +153,7 @@ def _calculate_acceptance_probs(initial_probs, target_probs):
   ```
 
 
-  A solution for a_i in terms of the other variabes is the following:
+  A solution for a_i in terms of the other variables is the following:
     ```a_i = (t_i / p_i) / max_i[t_i / p_i]```
   """
   # Add tiny to initial_probs to avoid divide by zero.
diff --git a/tensorflow/contrib/data/python/ops/shuffle_ops.py b/tensorflow/contrib/data/python/ops/shuffle_ops.py
index 99bb79bc06a421f811869ca9169aaa11deaca2f3..f35795abd38000b13cec0f08596e2ff66e86286c 100644
--- a/tensorflow/contrib/data/python/ops/shuffle_ops.py
+++ b/tensorflow/contrib/data/python/ops/shuffle_ops.py
@@ -19,11 +19,11 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import gen_dataset_ops
 
 
@@ -45,17 +45,7 @@ class _ShuffleAndRepeatDataset(dataset_ops.Dataset):
     else:
       self._count = ops.convert_to_tensor(
           count, dtype=dtypes.int64, name="count")
-
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
 
   def _as_variant_tensor(self):
     # pylint: disable=protected-access
diff --git a/tensorflow/contrib/data/python/ops/sliding.py b/tensorflow/contrib/data/python/ops/sliding.py
new file mode 100644
index 0000000000000000000000000000000000000000..19cc3cb89fc5c494f79ce1d25ed57c92099c8bd2
--- /dev/null
+++ b/tensorflow/contrib/data/python/ops/sliding.py
@@ -0,0 +1,102 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Sliding dataset transformations."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import gen_dataset_ops
+
+
+class _SlideDataset(dataset_ops.Dataset):
+  """A `Dataset` that passes a sliding window over its input."""
+
+  def __init__(self, input_dataset, window_size, stride=1):
+    """See `sliding_window_batch` for details."""
+    super(_SlideDataset, self).__init__()
+    self._input_dataset = input_dataset
+    self._window_size = ops.convert_to_tensor(
+        window_size, dtype=dtypes.int64, name="window_size")
+    self._stride = ops.convert_to_tensor(
+        stride, dtype=dtypes.int64, name="stride")
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.slide_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        window_size=self._window_size,
+        stride=self._stride,
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)))
+
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+  @property
+  def output_shapes(self):
+    input_shapes = self._input_dataset.output_shapes
+    return nest.pack_sequence_as(input_shapes, [
+        tensor_shape.vector(None).concatenate(s)
+        for s in nest.flatten(self._input_dataset.output_shapes)
+    ])
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
+
+
+def sliding_window_batch(window_size, stride=1):
+  """A sliding window with size of `window_size` and step of `stride`.
+
+  This transformation passes a sliding window over this dataset. The
+  window size is `window_size` and step size is `stride`. If the left
+  elements cannot fill up the sliding window, this transformation will
+  drop the final smaller element. For example:
+
+  ```python
+  # NOTE: The following examples use `{ ... }` to represent the
+  # contents of a dataset.
+  a = { [1], [2], [3], [4], [5], [6] }
+
+  a.apply(tf.contrib.data.sliding_window_batch(window_size=3, stride=2)) ==
+  {
+      [[1], [2], [3]],
+      [[3], [4], [5]],
+  }
+  ```
+
+  Args:
+    window_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
+      elements in the sliding window.
+    stride: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
+      steps moving the sliding window forward for one iteration. The default
+      is `1`. It must be in `[1, window_size)`.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+  def _apply_fn(dataset):
+    return _SlideDataset(dataset, window_size, stride)
+
+  return _apply_fn
diff --git a/tensorflow/contrib/data/python/ops/stats_ops.py b/tensorflow/contrib/data/python/ops/stats_ops.py
index 9cd1701c397b5a0bf5cc47c1bcab033704794d80..b5cf0fcfe91ebc22444302fca5d488a278ef2994 100644
--- a/tensorflow/contrib/data/python/ops/stats_ops.py
+++ b/tensorflow/contrib/data/python/ops/stats_ops.py
@@ -47,7 +47,7 @@ class StatsAggregator(object):
   dataset = ...
   iterator = dataset.make_one_shot_iterator()
   stats_aggregator = stats_ops.StatsAggregator()
-  set_op = stats_op.set_stats_aggregator_op(iterator, stats_aggregator)
+  set_op = stats_aggregator.subscribe(iterator)
 
   with tf.Session() as sess:
     # Running `set_op` will associate `iterator` with `stats_aggregator`.
diff --git a/tensorflow/contrib/data/python/ops/threadpool.py b/tensorflow/contrib/data/python/ops/threadpool.py
index 3f85aa84cd53fcf5e21480aac96e067766ad1b65..56f67e1766bbaff680bdff6b939df0c3ba68c679 100644
--- a/tensorflow/contrib/data/python/ops/threadpool.py
+++ b/tensorflow/contrib/data/python/ops/threadpool.py
@@ -44,7 +44,7 @@ class PrivateThreadPool(object):
 
   def __init__(self, num_threads, display_name=None):
     """Creates a `PrivateThreadPool` with the given number of threads."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       shared_name = _generate_shared_name("privatethreadpool")
       self._resource = gen_dataset_ops.thread_pool_handle(
           num_threads=num_threads,
diff --git a/tensorflow/contrib/decision_trees/proto/BUILD b/tensorflow/contrib/decision_trees/proto/BUILD
index f6de5998d73a4869d2444cd90c9b64d1a2c889ac..ae3847b8b62452b1afbe472fcb6369181ec60b73 100644
--- a/tensorflow/contrib/decision_trees/proto/BUILD
+++ b/tensorflow/contrib/decision_trees/proto/BUILD
@@ -25,7 +25,6 @@ tf_proto_library(
     name = "generic_tree_model",
     srcs = ["generic_tree_model.proto"],
     cc_api_version = 2,
-    go_api_version = 2,
     java_api_version = 2,
     visibility = ["//visibility:public"],
 )
@@ -34,7 +33,6 @@ tf_proto_library(
     name = "generic_tree_model_extensions",
     srcs = ["generic_tree_model_extensions.proto"],
     cc_api_version = 2,
-    go_api_version = 2,
     protodeps = [":generic_tree_model"],
     visibility = ["//visibility:public"],
 )
diff --git a/tensorflow/contrib/distributions/BUILD b/tensorflow/contrib/distributions/BUILD
index ed79ef70f829f9b72fa67026a5f7a0928130e95b..e9c827a61823649df3d648d81a0d3c529769830c 100644
--- a/tensorflow/contrib/distributions/BUILD
+++ b/tensorflow/contrib/distributions/BUILD
@@ -350,6 +350,7 @@ cuda_py_test(
         "//tensorflow/python:nn_ops",
         "//tensorflow/python:platform_test",
     ],
+    tags = ["nomsan"],
 )
 
 cuda_py_test(
@@ -474,6 +475,24 @@ cuda_py_test(
     tags = ["nomsan"],  # disable to avoid false positives from scipy.
 )
 
+cuda_py_test(
+    name = "statistical_testing_test",
+    size = "medium",
+    srcs = [
+        "python/kernel_tests/statistical_testing_test.py",
+    ],
+    additional_deps = [
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+    ],
+    tags = [
+        "manual",
+        "noasan",
+        "noguitar",
+    ],
+)
+
 cuda_py_test(
     name = "vector_sinh_arcsinh_diag_test",
     size = "medium",
@@ -797,6 +816,25 @@ cuda_py_test(
     tags = ["noasan"],  # times out b/63678675
 )
 
+cuda_py_test(
+    name = "affine_scalar_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/affine_scalar_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "affine_linear_operator_test",
     size = "small",
@@ -816,6 +854,22 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "batch_normalization_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/batch_normalization_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "chain_test",
     size = "small",
@@ -1051,10 +1105,12 @@ cuda_py_test(
     ],
 )
 
+# Tests for SinhArcSinh bijector.  The file name has the extra "_bijector" to
+# avoid BUILD rule name conflicts with the distribution by the same name.
 cuda_py_test(
-    name = "sigmoid_centered_test",
+    name = "sinh_arcsinh_bijector_test",
     size = "small",
-    srcs = ["python/kernel_tests/bijectors/sigmoid_centered_test.py"],
+    srcs = ["python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py"],
     additional_deps = [
         ":bijectors_py",
         ":distributions_py",
@@ -1070,12 +1126,10 @@ cuda_py_test(
     ],
 )
 
-# Tests for SinhArcSinh bijector.  The file name has the extra "_bijector" to
-# avoid BUILD rule name conflicts with the distribution by the same name.
 cuda_py_test(
-    name = "sinh_arcsinh_bijector_test",
+    name = "softmax_centered_test",
     size = "small",
-    srcs = ["python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py"],
+    srcs = ["python/kernel_tests/bijectors/softmax_centered_test.py"],
     additional_deps = [
         ":bijectors_py",
         ":distributions_py",
@@ -1092,9 +1146,9 @@ cuda_py_test(
 )
 
 cuda_py_test(
-    name = "softmax_centered_test",
+    name = "softplus_test",
     size = "small",
-    srcs = ["python/kernel_tests/bijectors/softmax_centered_test.py"],
+    srcs = ["python/kernel_tests/bijectors/softplus_test.py"],
     additional_deps = [
         ":bijectors_py",
         ":distributions_py",
@@ -1111,9 +1165,9 @@ cuda_py_test(
 )
 
 cuda_py_test(
-    name = "softplus_test",
+    name = "square_test",
     size = "small",
-    srcs = ["python/kernel_tests/bijectors/softplus_test.py"],
+    srcs = ["python/kernel_tests/bijectors/square_test.py"],
     additional_deps = [
         ":bijectors_py",
         ":distributions_py",
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_scalar_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_scalar_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..16173a166fd943413345036df12245c2a4ab8343
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_scalar_test.py
@@ -0,0 +1,153 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Affine Scalar Tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops.bijectors.affine_scalar import AffineScalar
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops.distributions.bijector_test_util import assert_scalar_congruency
+from tensorflow.python.platform import test
+
+
+class AffineScalarBijectorTest(test.TestCase):
+  """Tests correctness of the Y = scale @ x + shift transformation."""
+
+  def testProperties(self):
+    with self.test_session():
+      mu = -1.
+      # scale corresponds to 1.
+      bijector = AffineScalar(shift=mu)
+      self.assertEqual("affine_scalar", bijector.name)
+
+  def testNoBatchScalar(self):
+    with self.test_session() as sess:
+
+      def static_run(fun, x):
+        return fun(x).eval()
+
+      def dynamic_run(fun, x_value):
+        x_value = np.array(x_value)
+        x = array_ops.placeholder(dtypes.float32, name="x")
+        return sess.run(fun(x), feed_dict={x: x_value})
+
+      for run in (static_run, dynamic_run):
+        mu = -1.
+        # Corresponds to scale = 2
+        bijector = AffineScalar(shift=mu, scale=2.)
+        x = [1., 2, 3]  # Three scalar samples (no batches).
+        self.assertAllClose([1., 3, 5], run(bijector.forward, x))
+        self.assertAllClose([1., 1.5, 2.], run(bijector.inverse, x))
+        self.assertAllClose([-np.log(2.)] * 3,
+                            run(bijector.inverse_log_det_jacobian, x))
+
+  def testOneBatchScalarViaIdentityIn64BitUserProvidesShiftOnly(self):
+    with self.test_session() as sess:
+
+      def static_run(fun, x):
+        return fun(x).eval()
+
+      def dynamic_run(fun, x_value):
+        x_value = np.array(x_value).astype(np.float64)
+        x = array_ops.placeholder(dtypes.float64, name="x")
+        return sess.run(fun(x), feed_dict={x: x_value})
+
+      for run in (static_run, dynamic_run):
+        mu = np.float64([1.])
+        # One batch, scalar.
+        # Corresponds to scale = 1.
+        bijector = AffineScalar(shift=mu)
+        x = np.float64([1.])  # One sample from one batches.
+        self.assertAllClose([2.], run(bijector.forward, x))
+        self.assertAllClose([0.], run(bijector.inverse, x))
+        self.assertAllClose([0.], run(bijector.inverse_log_det_jacobian, x))
+
+  def testOneBatchScalarViaIdentityIn64BitUserProvidesScaleOnly(self):
+    with self.test_session() as sess:
+
+      def static_run(fun, x):
+        return fun(x).eval()
+
+      def dynamic_run(fun, x_value):
+        x_value = np.array(x_value).astype(np.float64)
+        x = array_ops.placeholder(dtypes.float64, name="x")
+        return sess.run(fun(x), feed_dict={x: x_value})
+
+      for run in (static_run, dynamic_run):
+        multiplier = np.float64([2.])
+        # One batch, scalar.
+        # Corresponds to scale = 2, shift = 0.
+        bijector = AffineScalar(scale=multiplier)
+        x = np.float64([1.])  # One sample from one batches.
+        self.assertAllClose([2.], run(bijector.forward, x))
+        self.assertAllClose([0.5], run(bijector.inverse, x))
+        self.assertAllClose([np.log(0.5)],
+                            run(bijector.inverse_log_det_jacobian, x))
+
+  def testTwoBatchScalarIdentityViaIdentity(self):
+    with self.test_session() as sess:
+
+      def static_run(fun, x):
+        return fun(x).eval()
+
+      def dynamic_run(fun, x_value):
+        x_value = np.array(x_value)
+        x = array_ops.placeholder(dtypes.float32, name="x")
+        return sess.run(fun(x), feed_dict={x: x_value})
+
+      for run in (static_run, dynamic_run):
+        mu = [1., -1]
+        # Univariate, two batches.
+        # Corresponds to scale = 1.
+        bijector = AffineScalar(shift=mu)
+        x = [1., 1]  # One sample from each of two batches.
+        self.assertAllClose([2., 0], run(bijector.forward, x))
+        self.assertAllClose([0., 2], run(bijector.inverse, x))
+        self.assertAllClose([0., 0.], run(bijector.inverse_log_det_jacobian, x))
+
+  def testTwoBatchScalarIdentityViaScale(self):
+    with self.test_session() as sess:
+
+      def static_run(fun, x):
+        return fun(x).eval()
+
+      def dynamic_run(fun, x_value):
+        x_value = np.array(x_value)
+        x = array_ops.placeholder(dtypes.float32, name="x")
+        return sess.run(fun(x), feed_dict={x: x_value})
+
+      for run in (static_run, dynamic_run):
+        mu = [1., -1]
+        # Univariate, two batches.
+        # Corresponds to scale = 1.
+        bijector = AffineScalar(shift=mu, scale=[2., 1])
+        x = [1., 1]  # One sample from each of two batches.
+        self.assertAllClose([3., 0], run(bijector.forward, x))
+        self.assertAllClose([0., 2], run(bijector.inverse, x))
+        self.assertAllClose(
+            [-np.log(2), 0.], run(bijector.inverse_log_det_jacobian, x))
+
+  def testScalarCongruency(self):
+    with self.test_session():
+      bijector = AffineScalar(shift=3.6, scale=0.42)
+      assert_scalar_congruency(bijector, lower_x=-2., upper_x=2.)
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_test.py
index c9158117f7a982e37047e8dd2b534a30040a87d9..077e6176b4e7aecb28369d49edad6d1367cc7259 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/affine_test.py
@@ -25,7 +25,6 @@ import numpy as np
 from tensorflow.contrib.distributions.python.ops.bijectors.affine import Affine
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops.distributions.bijector_test_util import assert_scalar_congruency
 from tensorflow.python.platform import test
 
 
@@ -36,192 +35,9 @@ class AffineBijectorTest(test.TestCase):
     with self.test_session():
       mu = -1.
       # scale corresponds to 1.
-      bijector = Affine(shift=mu, event_ndims=0)
+      bijector = Affine(shift=mu)
       self.assertEqual("affine", bijector.name)
 
-  def testNoBatchScalarViaIdentity(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = -1.
-        # Corresponds to scale = 2
-        bijector = Affine(
-            shift=mu, scale_identity_multiplier=2., event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [1., 2, 3]  # Three scalar samples (no batches).
-        self.assertAllClose([1., 3, 5], run(bijector.forward, x))
-        self.assertAllClose([1., 1.5, 2.], run(bijector.inverse, x))
-        self.assertAllClose(-np.log(2.),
-                            run(bijector.inverse_log_det_jacobian, x))
-
-  def testNoBatchScalarViaDiag(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = -1.
-        # Corresponds to scale = 2
-        bijector = Affine(shift=mu, scale_identity_multiplier=2., event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [1., 2, 3]  # Three scalar samples (no batches).
-        self.assertAllClose([1., 3, 5], run(bijector.forward, x))
-        self.assertAllClose([1., 1.5, 2.], run(bijector.inverse, x))
-        self.assertAllClose(-np.log(2.),
-                            run(bijector.inverse_log_det_jacobian, x))
-
-  def testWeirdSampleNoBatchScalarViaDiagMultiplier(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = -1.
-        # Corresponds to scale = 2.
-        bijector = Affine(
-            shift=mu, scale_identity_multiplier=2., event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [[1., 2, 3], [4, 5, 6]]  # Weird sample shape.
-        self.assertAllClose([[1., 3, 5],
-                             [7, 9, 11]],
-                            run(bijector.forward, x))
-        self.assertAllClose([[1., 1.5, 2.],
-                             [2.5, 3, 3.5]],
-                            run(bijector.inverse, x))
-        self.assertAllClose(-np.log(2.),
-                            run(bijector.inverse_log_det_jacobian, x))
-
-  def testOneBatchScalarViaIdentityIn64BitUserProvidesShiftOnly(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value).astype(np.float64)
-        x = array_ops.placeholder(dtypes.float64, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = np.float64([1.])
-        # One batch, scalar.
-        # Corresponds to scale = 1.
-        bijector = Affine(shift=mu, event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = np.float64([1.])  # One sample from one batches.
-        self.assertAllClose([2.], run(bijector.forward, x))
-        self.assertAllClose([0.], run(bijector.inverse, x))
-        self.assertAllClose(0., run(bijector.inverse_log_det_jacobian, x))
-
-  def testOneBatchScalarViaIdentityIn64BitUserProvidesMultiplierOnly(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value).astype(np.float64)
-        x = array_ops.placeholder(dtypes.float64, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        multiplier = np.float64([2.])
-        # One batch, scalar.
-        # Corresponds to scale = 2, shift = 0.
-        bijector = Affine(scale_identity_multiplier=multiplier, event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = np.float64([1.])  # One sample from one batches.
-        self.assertAllClose([2.], run(bijector.forward, x))
-        self.assertAllClose([0.5], run(bijector.inverse, x))
-        self.assertAllClose([np.log(0.5)],
-                            run(bijector.inverse_log_det_jacobian, x))
-
-  def testOneBatchScalarViaDiagMultiplier(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = [1.]
-        # One batch, scalar.
-        # Corresponds to scale = 1.
-        bijector = Affine(shift=mu, scale_identity_multiplier=1., event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [1.]  # One sample from one batches.
-        self.assertAllClose([2.], run(bijector.forward, x))
-        self.assertAllClose([0.], run(bijector.inverse, x))
-        self.assertAllClose(0., run(bijector.inverse_log_det_jacobian, x))
-
-  def testTwoBatchScalarIdentityViaIdentity(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = [1., -1]
-        # Univariate, two batches.
-        # Corresponds to scale = 1.
-        bijector = Affine(shift=mu, event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [1., 1]  # One sample from each of two batches.
-        self.assertAllClose([2., 0], run(bijector.forward, x))
-        self.assertAllClose([0., 2], run(bijector.inverse, x))
-        self.assertAllClose(0., run(bijector.inverse_log_det_jacobian, x))
-
-  def testTwoBatchScalarIdentityViaDiagMultiplier(self):
-    with self.test_session() as sess:
-
-      def static_run(fun, x):
-        return fun(x).eval()
-
-      def dynamic_run(fun, x_value):
-        x_value = np.array(x_value)
-        x = array_ops.placeholder(dtypes.float32, name="x")
-        return sess.run(fun(x), feed_dict={x: x_value})
-
-      for run in (static_run, dynamic_run):
-        mu = [1., -1]
-        # Univariate, two batches.
-        # Corresponds to scale = 1.
-        bijector = Affine(shift=mu, scale_identity_multiplier=1., event_ndims=0)
-        self.assertEqual(0, bijector.event_ndims.eval())  # "is scalar"
-        x = [1., 1]  # One sample from each of two batches.
-        self.assertAllClose([2., 0], run(bijector.forward, x))
-        self.assertAllClose([0., 2], run(bijector.inverse, x))
-        self.assertAllClose(0., run(bijector.inverse_log_det_jacobian, x))
-
   def testNoBatchMultivariateIdentity(self):
     with self.test_session() as sess:
 
@@ -238,7 +54,6 @@ class AffineBijectorTest(test.TestCase):
         # Multivariate
         # Corresponds to scale = [[1., 0], [0, 1.]]
         bijector = Affine(shift=mu)
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 1]
         # matmul(sigma, x) + shift
         # = [-1, -1] + [1, -1]
@@ -269,7 +84,6 @@ class AffineBijectorTest(test.TestCase):
         # Multivariate
         # Corresponds to scale = [[2., 0], [0, 1.]]
         bijector = Affine(shift=mu, scale_diag=[2., 1])
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 1]
         # matmul(sigma, x) + shift
         # = [-1, -1] + [1, -1]
@@ -297,22 +111,17 @@ class AffineBijectorTest(test.TestCase):
       x = array_ops.placeholder(dtypes.float32, name="x")
       mu = array_ops.placeholder(dtypes.float32, name="mu")
       scale_diag = array_ops.placeholder(dtypes.float32, name="scale_diag")
-      event_ndims = array_ops.placeholder(dtypes.int32, name="event_ndims")
 
       x_value = np.array([[1., 1]], dtype=np.float32)
       mu_value = np.array([1., -1], dtype=np.float32)
       scale_diag_value = np.array([2., 2], dtype=np.float32)
-      event_ndims_value = np.array(1, dtype=np.int32)
       feed_dict = {
           x: x_value,
           mu: mu_value,
           scale_diag: scale_diag_value,
-          event_ndims: event_ndims_value
       }
 
-      bijector = Affine(
-          shift=mu, scale_diag=scale_diag, event_ndims=event_ndims)
-      self.assertEqual(1, sess.run(bijector.event_ndims, feed_dict))
+      bijector = Affine(shift=mu, scale_diag=scale_diag)
       self.assertAllClose([[3., 1]], sess.run(bijector.forward(x), feed_dict))
       self.assertAllClose([[0., 1]], sess.run(bijector.inverse(x), feed_dict))
       self.assertAllClose(
@@ -335,7 +144,6 @@ class AffineBijectorTest(test.TestCase):
         # Corresponds to 1 2x2 matrix, with twos on the diagonal.
         scale = 2.
         bijector = Affine(shift=mu, scale_identity_multiplier=scale)
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [[[1., 1]]]
         self.assertAllClose([[[3., 1]]], run(bijector.forward, x))
         self.assertAllClose([[[0., 1]]], run(bijector.inverse, x))
@@ -358,7 +166,6 @@ class AffineBijectorTest(test.TestCase):
         # Corresponds to 1 2x2 matrix, with twos on the diagonal.
         scale_diag = [[2., 2]]
         bijector = Affine(shift=mu, scale_diag=scale_diag)
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [[[1., 1]]]
         self.assertAllClose([[[3., 1]]], run(bijector.forward, x))
         self.assertAllClose([[[0., 1]]], run(bijector.inverse, x))
@@ -370,23 +177,18 @@ class AffineBijectorTest(test.TestCase):
       x = array_ops.placeholder(dtypes.float32, name="x")
       mu = array_ops.placeholder(dtypes.float32, name="mu")
       scale_diag = array_ops.placeholder(dtypes.float32, name="scale_diag")
-      event_ndims = array_ops.placeholder(dtypes.int32, name="event_ndims")
 
       x_value = np.array([[[1., 1]]], dtype=np.float32)
       mu_value = np.array([[1., -1]], dtype=np.float32)
       scale_diag_value = np.array([[2., 2]], dtype=np.float32)
-      event_ndims_value = 1
 
       feed_dict = {
           x: x_value,
           mu: mu_value,
           scale_diag: scale_diag_value,
-          event_ndims: event_ndims_value
       }
 
-      bijector = Affine(
-          shift=mu, scale_diag=scale_diag, event_ndims=event_ndims)
-      self.assertEqual(1, sess.run(bijector.event_ndims, feed_dict))
+      bijector = Affine(shift=mu, scale_diag=scale_diag)
       self.assertAllClose([[[3., 1]]], sess.run(bijector.forward(x), feed_dict))
       self.assertAllClose([[[0., 1]]], sess.run(bijector.inverse(x), feed_dict))
       self.assertAllClose([-np.log(4)],
@@ -410,9 +212,7 @@ class AffineBijectorTest(test.TestCase):
         bijector = Affine(
             shift=mu,
             scale_identity_multiplier=1.,
-            scale_diag=[1., 1., 1.],
-            event_ndims=1)
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
+            scale_diag=[1., 1., 1.])
         x = [1., 2, 3]  # Three scalar samples (no batches).
         self.assertAllClose([1., 3, 5], run(bijector.forward, x))
         self.assertAllClose([1., 1.5, 2.], run(bijector.inverse, x))
@@ -437,7 +237,6 @@ class AffineBijectorTest(test.TestCase):
             shift=mu,
             scale_identity_multiplier=1.,
             scale_tril=[[1., 0], [2., 1]])
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [[1., 2]]  # One multivariate sample.
         self.assertAllClose([[1., 5]], run(bijector.forward, x))
         self.assertAllClose([[1., 0.5]], run(bijector.inverse, x))
@@ -460,7 +259,6 @@ class AffineBijectorTest(test.TestCase):
         # scale = [[2., 0], [2, 3]]
         bijector = Affine(
             shift=mu, scale_diag=[1., 2.], scale_tril=[[1., 0], [2., 1]])
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [[1., 2]]  # One multivariate sample.
         self.assertAllClose([[1., 7]], run(bijector.forward, x))
         self.assertAllClose([[1., 1 / 3.]], run(bijector.inverse, x))
@@ -486,7 +284,6 @@ class AffineBijectorTest(test.TestCase):
             scale_identity_multiplier=1.0,
             scale_diag=[1., 2.],
             scale_tril=[[1., 0], [2., 1]])
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [[1., 2]]  # One multivariate sample.
         self.assertAllClose([[2., 9]], run(bijector.forward, x))
         self.assertAllClose([[2 / 3., 5 / 12.]], run(bijector.inverse, x))
@@ -514,7 +311,6 @@ class AffineBijectorTest(test.TestCase):
             scale_perturb_factor=[[2., 0], [0., 0], [0, 1]])
         bijector_ref = Affine(shift=mu, scale_diag=[10., 2, 3])
 
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 2, 3]  # Vector.
         self.assertAllClose([9., 3, 8], run(bijector.forward, x))
         self.assertAllClose(
@@ -550,7 +346,6 @@ class AffineBijectorTest(test.TestCase):
             scale_perturb_factor=[[2., 0], [0., 0], [0, 1]])
         bijector_ref = Affine(shift=mu, scale_diag=[10., 3, 5])
 
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 2, 3]  # Vector.
         self.assertAllClose([9., 5, 14], run(bijector.forward, x))
         self.assertAllClose(
@@ -586,7 +381,6 @@ class AffineBijectorTest(test.TestCase):
         bijector_ref = Affine(
             shift=mu, scale_tril=[[10., 0, 0], [1, 3, 0], [2, 3, 5]])
 
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 2, 3]  # Vector.
         self.assertAllClose([9., 6, 22], run(bijector.forward, x))
         self.assertAllClose(
@@ -622,7 +416,6 @@ class AffineBijectorTest(test.TestCase):
         bijector_ref = Affine(
             shift=mu, scale_tril=[[6., 0, 0], [1, 3, 0], [2, 3, 5]])
 
-        self.assertEqual(1, bijector.event_ndims.eval())  # "is vector"
         x = [1., 2, 3]  # Vector.
         self.assertAllClose([5., 6, 22], run(bijector.forward, x))
         self.assertAllClose(
@@ -647,38 +440,6 @@ class AffineBijectorTest(test.TestCase):
       with self.assertRaisesOpError("diagonal part must be non-zero"):
         bijector.forward([1., 1.]).eval()
 
-  def testEventNdimsLargerThanOneRaises(self):
-    with self.test_session():
-      mu = [1., -1]
-      with self.assertRaisesRegexp(
-          ValueError, (r"event_ndims\(2\) was not 0 or 1")):
-        # Scale corresponds to 2x2 identity matrix.
-        bijector = Affine(shift=mu, event_ndims=2, validate_args=True)
-        bijector.forward([1., 1.]).eval()
-
-  def testScaleZeroScalarRaises(self):
-    with self.test_session():
-      mu = -1.
-      # Check Identity matrix with zero scaling.
-      bijector = Affine(
-          shift=mu,
-          scale_identity_multiplier=0.,
-          event_ndims=0,
-          validate_args=True)
-      with self.assertRaisesOpError("identity_multiplier should be non-zero"):
-        bijector.forward(1.).eval()
-
-  def testScaleDiagAndEventNdimsZeroRaises(self):
-    # Check Diag matrix with zero scaling.
-    with self.assertRaisesRegexp(ValueError, "only scale argument"):
-      Affine(shift=None, scale_diag=[0.0], event_ndims=0, validate_args=True)
-
-  def testScalarCongruency(self):
-    with self.test_session():
-      bijector = Affine(
-          shift=3.6, scale_identity_multiplier=0.42, event_ndims=0)
-      assert_scalar_congruency(bijector, lower_x=-2., upper_x=2.)
-
   def _makeScale(self,
                  x,
                  scale_identity_multiplier=None,
@@ -747,14 +508,12 @@ class AffineBijectorTest(test.TestCase):
         scale_args = dict({"x": x}, **args)
         scale = self._makeScale(**scale_args)
 
-        bijector_args = dict({"event_ndims": 1}, **args)
-
         # We haven't specified enough information for the scale.
         if scale is None:
           with self.assertRaisesRegexp(ValueError, ("must be specified.")):
-            bijector = Affine(shift=shift, **bijector_args)
+            bijector = Affine(shift=shift, **args)
         else:
-          bijector = Affine(shift=shift, **bijector_args)
+          bijector = Affine(shift=shift, **args)
           np_x = x
           # For the case a vector is passed in, we need to make the shape
           # match the matrix for matmul to work.
@@ -829,15 +588,5 @@ class AffineBijectorTest(test.TestCase):
         x=np.array(
             [1., 2], dtype=np.float32))
 
-  def testScalarEventIdentityScale(self):
-    with self.test_session() as sess:
-      doubler = Affine(
-          scale_identity_multiplier=2.,
-          event_ndims=0)
-      doubler2 = doubler.inverse_log_det_jacobian(2.)
-      doubler2_ildj_ = sess.run([doubler2])
-      self.assertAllClose([-np.log(2.)], doubler2_ildj_)
-
-
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/batch_normalization_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/batch_normalization_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..a215a4a2b1ffbea7951bdb9b4352ed567e0b1e41
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/batch_normalization_test.py
@@ -0,0 +1,236 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for BatchNorm Bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib import distributions
+from tensorflow.contrib.distributions.python.ops import test_util
+from tensorflow.contrib.distributions.python.ops.bijectors.batch_normalization import BatchNormalization
+from tensorflow.contrib.distributions.python.ops.bijectors.invert import Invert
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.layers import normalization
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.ops.distributions import normal as normal_lib
+from tensorflow.python.ops.distributions import transformed_distribution as transformed_distribution_lib
+from tensorflow.python.platform import test
+from tensorflow.python.training import adam
+
+
+class BatchNormTest(test_util.VectorDistributionTestHelpers,
+                    test.TestCase):
+
+  def _reduction_axes(self, input_shape, event_dims):
+    if isinstance(event_dims, int):
+      event_dims = [event_dims]
+    ndims = len(input_shape)
+    # Convert event_dims to non-negative indexing.
+    event_dims = list(event_dims)
+    for idx, x in enumerate(event_dims):
+      if x < 0:
+        event_dims[idx] = ndims + x
+    return tuple(i for i in range(ndims) if i not in event_dims)
+
+  def testForwardInverse(self):
+    """Tests forward and backward passes with different event shapes.
+
+    input_shape: Tuple of shapes for input tensor.
+    event_dims: Tuple of dimension indices that will be normalized.
+    training: Boolean of whether bijector runs in training or inference mode.
+    """
+    params = [
+        ((5*2, 4), [-1], False),
+        ((5, 2, 4), [-1], False),
+        ((5, 2, 4), [1, 2], False),
+        ((5, 2, 4), [0, 1], False),
+        ((5*2, 4), [-1], True),
+        ((5, 2, 4), [-1], True),
+        ((5, 2, 4), [1, 2], True),
+        ((5, 2, 4), [0, 1], True)
+    ]
+    for input_shape, event_dims, training in params:
+      x_ = np.arange(5 * 4 * 2).astype(np.float32).reshape(input_shape)
+      with self.test_session() as sess:
+        x = constant_op.constant(x_)
+        # When training, memorize the exact mean of the last
+        # minibatch that it normalized (instead of moving average assignment).
+        layer = normalization.BatchNormalization(
+            axis=event_dims, momentum=0., epsilon=0.)
+        batch_norm = BatchNormalization(
+            batchnorm_layer=layer, training=training)
+        # Minibatch statistics are saved only after norm_x has been computed.
+        norm_x = batch_norm.inverse(x)
+        with ops.control_dependencies(batch_norm.batchnorm.updates):
+          moving_mean = array_ops.identity(batch_norm.batchnorm.moving_mean)
+          moving_var = array_ops.identity(batch_norm.batchnorm.moving_variance)
+          denorm_x = batch_norm.forward(array_ops.identity(norm_x))
+          fldj = batch_norm.forward_log_det_jacobian(x)
+          # Use identity to invalidate cache.
+          ildj = batch_norm.inverse_log_det_jacobian(
+              array_ops.identity(denorm_x))
+        variables.global_variables_initializer().run()
+        # Update variables.
+        norm_x_ = sess.run(norm_x)
+        [
+            norm_x_,
+            moving_mean_,
+            moving_var_,
+            denorm_x_,
+            ildj_,
+            fldj_,
+        ] = sess.run([
+            norm_x,
+            moving_mean,
+            moving_var,
+            denorm_x,
+            ildj,
+            fldj,
+        ])
+        self.assertEqual("batch_normalization", batch_norm.name)
+
+        reduction_axes = self._reduction_axes(input_shape, event_dims)
+        keepdims = len(event_dims) > 1
+
+        expected_batch_mean = np.mean(
+            x_, axis=reduction_axes, keepdims=keepdims)
+        expected_batch_var = np.var(x_, axis=reduction_axes, keepdims=keepdims)
+
+        if training:
+          # When training=True, values become normalized across batch dim and
+          # original values are recovered after de-normalizing.
+          zeros = np.zeros_like(norm_x_)
+          self.assertAllClose(np.mean(zeros, axis=reduction_axes),
+                              np.mean(norm_x_, axis=reduction_axes))
+
+          self.assertAllClose(expected_batch_mean, moving_mean_)
+          self.assertAllClose(expected_batch_var, moving_var_)
+          self.assertAllClose(x_, denorm_x_, atol=1e-5)
+          # Since moving statistics are set to batch statistics after
+          # normalization, ildj and -fldj should match.
+          self.assertAllClose(ildj_, -fldj_)
+          # ildj is computed with minibatch statistics.
+          expected_ildj = np.sum(np.log(1.) - .5 * np.log(
+              expected_batch_var + batch_norm.batchnorm.epsilon))
+          self.assertAllClose(expected_ildj, ildj_)
+        else:
+          # When training=False, moving_mean, moving_var remain at their
+          # initialized values (0., 1.), resulting in no scale/shift (a small
+          # shift occurs if epsilon > 0.)
+          self.assertAllClose(x_, norm_x_)
+          self.assertAllClose(x_, denorm_x_, atol=1e-5)
+          # ildj is computed with saved statistics.
+          expected_ildj = np.sum(
+              np.log(1.) - .5 * np.log(1. + batch_norm.batchnorm.epsilon))
+          self.assertAllClose(expected_ildj, ildj_)
+
+  def testMaximumLikelihoodTraining(self):
+    # Test Maximum Likelihood training with default bijector.
+    with self.test_session() as sess:
+      base_dist = distributions.MultivariateNormalDiag(loc=[0., 0.])
+      batch_norm = BatchNormalization(training=True)
+      dist = transformed_distribution_lib.TransformedDistribution(
+          distribution=base_dist,
+          bijector=batch_norm)
+      target_dist = distributions.MultivariateNormalDiag(loc=[1., 2.])
+      target_samples = target_dist.sample(100)
+      dist_samples = dist.sample(3000)
+      loss = -math_ops.reduce_mean(dist.log_prob(target_samples))
+      with ops.control_dependencies(batch_norm.batchnorm.updates):
+        train_op = adam.AdamOptimizer(1e-2).minimize(loss)
+        moving_mean = array_ops.identity(batch_norm.batchnorm.moving_mean)
+        moving_var = array_ops.identity(batch_norm.batchnorm.moving_variance)
+      variables.global_variables_initializer().run()
+      for _ in range(3000):
+        sess.run(train_op)
+      [
+          dist_samples_,
+          moving_mean_,
+          moving_var_
+      ] = sess.run([
+          dist_samples,
+          moving_mean,
+          moving_var
+      ])
+      self.assertAllClose([1., 2.], np.mean(dist_samples_, axis=0), atol=5e-2)
+      self.assertAllClose([1., 2.], moving_mean_, atol=5e-2)
+      self.assertAllClose([1., 1.], moving_var_, atol=5e-2)
+
+  def testLogProb(self):
+    with self.test_session() as sess:
+      layer = normalization.BatchNormalization(epsilon=0.)
+      batch_norm = BatchNormalization(batchnorm_layer=layer, training=False)
+      base_dist = distributions.MultivariateNormalDiag(loc=[0., 0.])
+      dist = transformed_distribution_lib.TransformedDistribution(
+          distribution=base_dist,
+          bijector=batch_norm,
+          validate_args=True)
+      samples = dist.sample(int(1e5))
+      # No volume distortion since training=False, bijector is initialized
+      # to the identity transformation.
+      base_log_prob = base_dist.log_prob(samples)
+      dist_log_prob = dist.log_prob(samples)
+      variables.global_variables_initializer().run()
+      base_log_prob_, dist_log_prob_ = sess.run([base_log_prob, dist_log_prob])
+      self.assertAllClose(base_log_prob_, dist_log_prob_)
+
+  def testMutuallyConsistent(self):
+    # BatchNorm bijector is only mutually consistent when training=False.
+    dims = 4
+    with self.test_session() as sess:
+      layer = normalization.BatchNormalization(epsilon=0.)
+      batch_norm = BatchNormalization(batchnorm_layer=layer, training=False)
+      dist = transformed_distribution_lib.TransformedDistribution(
+          distribution=normal_lib.Normal(loc=0., scale=1.),
+          bijector=batch_norm,
+          event_shape=[dims],
+          validate_args=True)
+      self.run_test_sample_consistent_log_prob(
+          sess_run_fn=sess.run,
+          dist=dist,
+          num_samples=int(1e5),
+          radius=2.,
+          center=0.,
+          rtol=0.02)
+
+  def testInvertMutuallyConsistent(self):
+    # BatchNorm bijector is only mutually consistent when training=False.
+    dims = 4
+    with self.test_session() as sess:
+      layer = normalization.BatchNormalization(epsilon=0.)
+      batch_norm = Invert(
+          BatchNormalization(batchnorm_layer=layer, training=False))
+      dist = transformed_distribution_lib.TransformedDistribution(
+          distribution=normal_lib.Normal(loc=0., scale=1.),
+          bijector=batch_norm,
+          event_shape=[dims],
+          validate_args=True)
+      self.run_test_sample_consistent_log_prob(
+          sess_run_fn=sess.run,
+          dist=dist,
+          num_samples=int(1e5),
+          radius=2.,
+          center=0.,
+          rtol=0.02)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/chain_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/chain_test.py
index 20e754308449af3f0399101f4ea1bb47b3356424..a748acd667e58f9b527bab11d8bc4d086996e9f3 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/chain_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/chain_test.py
@@ -66,12 +66,10 @@ class ChainBijectorTest(test.TestCase):
   def testShapeGetters(self):
     with self.test_session():
       bijector = Chain([
-          SoftmaxCentered(
-              event_ndims=1, validate_args=True),
-          SoftmaxCentered(
-              event_ndims=0, validate_args=True)
+          SoftmaxCentered(validate_args=True),
+          SoftmaxCentered(validate_args=True),
       ])
-      x = tensor_shape.TensorShape([])
+      x = tensor_shape.TensorShape([1])
       y = tensor_shape.TensorShape([2 + 1])
       self.assertAllEqual(y, bijector.forward_event_shape(x))
       self.assertAllEqual(
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/cholesky_outer_product_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/cholesky_outer_product_test.py
index 0ff35304283fce9ce3f9e5d31b1258394e384d7b..f392e83d2c3da9dac43c2e87070e952ae2060b34 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/cholesky_outer_product_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/cholesky_outer_product_test.py
@@ -18,70 +18,111 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.contrib.distributions.python.ops import bijectors
-from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops.distributions import gamma as gamma_lib
-from tensorflow.python.ops.distributions import transformed_distribution as transformed_distribution_lib
-from tensorflow.python.ops.distributions.bijector_test_util import assert_scalar_congruency
 from tensorflow.python.platform import test
 
 
-class InvertBijectorTest(test.TestCase):
-  """Tests the correctness of the Y = Invert(bij) transformation."""
+class CholeskyOuterProductBijectorTest(test.TestCase):
+  """Tests the correctness of the Y = X @ X.T transformation."""
 
-  def testBijector(self):
+  def testBijectorMatrix(self):
     with self.test_session():
-      for fwd in [
-          bijectors.Identity(),
-          bijectors.Exp(event_ndims=1),
-          bijectors.Affine(
-              shift=[0., 1.], scale_diag=[2., 3.], event_ndims=1),
-          bijectors.Softplus(event_ndims=1),
-          bijectors.SoftmaxCentered(event_ndims=1),
-          bijectors.SigmoidCentered(),
-      ]:
-        rev = bijectors.Invert(fwd)
-        self.assertEqual("_".join(["invert", fwd.name]), rev.name)
-        x = [[[1., 2.],
-              [2., 3.]]]
-        self.assertAllClose(fwd.inverse(x).eval(), rev.forward(x).eval())
-        self.assertAllClose(fwd.forward(x).eval(), rev.inverse(x).eval())
-        self.assertAllClose(
-            fwd.forward_log_det_jacobian(x).eval(),
-            rev.inverse_log_det_jacobian(x).eval())
-        self.assertAllClose(
-            fwd.inverse_log_det_jacobian(x).eval(),
-            rev.forward_log_det_jacobian(x).eval())
+      bijector = bijectors.CholeskyOuterProduct(validate_args=True)
+      self.assertEqual("cholesky_outer_product", bijector.name)
+      x = [[[1., 0], [2, 1]], [[np.sqrt(2.), 0], [np.sqrt(8.), 1]]]
+      y = np.matmul(x, np.transpose(x, axes=(0, 2, 1)))
+      # Fairly easy to compute differentials since we have 2x2.
+      dx_dy = [[[2. * 1, 0, 0],
+                [2, 1, 0],
+                [0, 2 * 2, 2 * 1]],
+               [[2 * np.sqrt(2.), 0, 0],
+                [np.sqrt(8.), np.sqrt(2.), 0],
+                [0, 2 * np.sqrt(8.), 2 * 1]]]
+      ildj = -np.sum(
+          np.log(np.asarray(dx_dy).diagonal(
+              offset=0, axis1=1, axis2=2)),
+          axis=1)
+      self.assertAllEqual((2, 2, 2), bijector.forward(x).get_shape())
+      self.assertAllEqual((2, 2, 2), bijector.inverse(y).get_shape())
+      self.assertAllClose(y, bijector.forward(x).eval())
+      self.assertAllClose(x, bijector.inverse(y).eval())
+      self.assertAllClose(
+          ildj, bijector.inverse_log_det_jacobian(y).eval(), atol=0., rtol=1e-7)
+      self.assertAllClose(
+          -bijector.inverse_log_det_jacobian(y).eval(),
+          bijector.forward_log_det_jacobian(x).eval(),
+          atol=0.,
+          rtol=1e-7)
 
-  def testScalarCongruency(self):
-    with self.test_session():
-      bijector = bijectors.Invert(bijectors.Exp())
-      assert_scalar_congruency(
-          bijector, lower_x=1e-3, upper_x=1.5, rtol=0.05)
+  def testNoBatchStatic(self):
+    x = np.array([[1., 0], [2, 1]])  # np.linalg.cholesky(y)
+    y = np.array([[1., 2], [2, 5]])  # np.matmul(x, x.T)
+    with self.test_session() as sess:
+      y_actual = bijectors.CholeskyOuterProduct().forward(x=x)
+      x_actual = bijectors.CholeskyOuterProduct().inverse(y=y)
+    [y_actual_, x_actual_] = sess.run([y_actual, x_actual])
+    self.assertAllEqual([2, 2], y_actual.get_shape())
+    self.assertAllEqual([2, 2], x_actual.get_shape())
+    self.assertAllClose(y, y_actual_)
+    self.assertAllClose(x, x_actual_)
 
-  def testShapeGetters(self):
-    with self.test_session():
-      bijector = bijectors.Invert(bijectors.SigmoidCentered(validate_args=True))
-      x = tensor_shape.TensorShape([2])
-      y = tensor_shape.TensorShape([])
-      self.assertAllEqual(y, bijector.forward_event_shape(x))
-      self.assertAllEqual(
-          y.as_list(),
-          bijector.forward_event_shape_tensor(x.as_list()).eval())
-      self.assertAllEqual(x, bijector.inverse_event_shape(y))
-      self.assertAllEqual(
-          x.as_list(),
-          bijector.inverse_event_shape_tensor(y.as_list()).eval())
+  def testNoBatchDeferred(self):
+    x = np.array([[1., 0], [2, 1]])  # np.linalg.cholesky(y)
+    y = np.array([[1., 2], [2, 5]])  # np.matmul(x, x.T)
+    with self.test_session() as sess:
+      x_pl = array_ops.placeholder(dtypes.float32)
+      y_pl = array_ops.placeholder(dtypes.float32)
+      y_actual = bijectors.CholeskyOuterProduct().forward(x=x_pl)
+      x_actual = bijectors.CholeskyOuterProduct().inverse(y=y_pl)
+    [y_actual_, x_actual_] = sess.run([y_actual, x_actual],
+                                      feed_dict={x_pl: x, y_pl: y})
+    self.assertEqual(None, y_actual.get_shape())
+    self.assertEqual(None, x_actual.get_shape())
+    self.assertAllClose(y, y_actual_)
+    self.assertAllClose(x, x_actual_)
 
-  def testDocstringExample(self):
-    with self.test_session():
-      exp_gamma_distribution = (
-          transformed_distribution_lib.TransformedDistribution(
-              distribution=gamma_lib.Gamma(concentration=1., rate=2.),
-              bijector=bijectors.Invert(bijectors.Exp())))
-      self.assertAllEqual(
-          [], array_ops.shape(exp_gamma_distribution.sample()).eval())
+  def testBatchStatic(self):
+    x = np.array([[[1., 0],
+                   [2, 1]],
+                  [[3., 0],
+                   [1, 2]]])  # np.linalg.cholesky(y)
+    y = np.array([[[1., 2],
+                   [2, 5]],
+                  [[9., 3],
+                   [3, 5]]])  # np.matmul(x, x.T)
+    with self.test_session() as sess:
+      y_actual = bijectors.CholeskyOuterProduct().forward(x=x)
+      x_actual = bijectors.CholeskyOuterProduct().inverse(y=y)
+    [y_actual_, x_actual_] = sess.run([y_actual, x_actual])
+    self.assertEqual([2, 2, 2], y_actual.get_shape())
+    self.assertEqual([2, 2, 2], x_actual.get_shape())
+    self.assertAllClose(y, y_actual_)
+    self.assertAllClose(x, x_actual_)
+
+  def testBatchDeferred(self):
+    x = np.array([[[1., 0],
+                   [2, 1]],
+                  [[3., 0],
+                   [1, 2]]])  # np.linalg.cholesky(y)
+    y = np.array([[[1., 2],
+                   [2, 5]],
+                  [[9., 3],
+                   [3, 5]]])  # np.matmul(x, x.T)
+    with self.test_session() as sess:
+      x_pl = array_ops.placeholder(dtypes.float32)
+      y_pl = array_ops.placeholder(dtypes.float32)
+      y_actual = bijectors.CholeskyOuterProduct().forward(x=x_pl)
+      x_actual = bijectors.CholeskyOuterProduct().inverse(y=y_pl)
+    [y_actual_, x_actual_] = sess.run([y_actual, x_actual],
+                                      feed_dict={x_pl: x, y_pl: y})
+    self.assertEqual(None, y_actual.get_shape())
+    self.assertEqual(None, x_actual.get_shape())
+    self.assertAllClose(y, y_actual_)
+    self.assertAllClose(x, x_actual_)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/invert_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/invert_test.py
index 0ff35304283fce9ce3f9e5d31b1258394e384d7b..58ba9cedb1437df4e000ce32fe39664afa76c3b5 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/invert_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/invert_test.py
@@ -35,11 +35,9 @@ class InvertBijectorTest(test.TestCase):
       for fwd in [
           bijectors.Identity(),
           bijectors.Exp(event_ndims=1),
-          bijectors.Affine(
-              shift=[0., 1.], scale_diag=[2., 3.], event_ndims=1),
+          bijectors.Affine(shift=[0., 1.], scale_diag=[2., 3.]),
           bijectors.Softplus(event_ndims=1),
-          bijectors.SoftmaxCentered(event_ndims=1),
-          bijectors.SigmoidCentered(),
+          bijectors.SoftmaxCentered(),
       ]:
         rev = bijectors.Invert(fwd)
         self.assertEqual("_".join(["invert", fwd.name]), rev.name)
@@ -62,9 +60,9 @@ class InvertBijectorTest(test.TestCase):
 
   def testShapeGetters(self):
     with self.test_session():
-      bijector = bijectors.Invert(bijectors.SigmoidCentered(validate_args=True))
+      bijector = bijectors.Invert(bijectors.SoftmaxCentered(validate_args=True))
       x = tensor_shape.TensorShape([2])
-      y = tensor_shape.TensorShape([])
+      y = tensor_shape.TensorShape([1])
       self.assertAllEqual(y, bijector.forward_event_shape(x))
       self.assertAllEqual(
           y.as_list(),
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sigmoid_centered_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sigmoid_centered_test.py
deleted file mode 100644
index 4ff3f334ccb59f1c117b3d35032d9e799cfd79bb..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sigmoid_centered_test.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for Bijector."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid_centered import SigmoidCentered
-from tensorflow.python.platform import test
-
-
-class SigmoidCenteredBijectorTest(test.TestCase):
-  """Tests correctness of the Y = g(X) = (1 + exp(-X))^-1 transformation."""
-
-  def testBijector(self):
-    with self.test_session():
-      sigmoid = SigmoidCentered()
-      self.assertEqual("sigmoid_centered", sigmoid.name)
-      x = np.log([[2., 3, 4],
-                  [4., 8, 12]])
-      y = [[[2. / 3, 1. / 3],
-            [3. / 4, 1. / 4],
-            [4. / 5, 1. / 5]],
-           [[4. / 5, 1. / 5],
-            [8. / 9, 1. / 9],
-            [12. / 13, 1. / 13]]]
-      self.assertAllClose(y, sigmoid.forward(x).eval())
-      self.assertAllClose(x, sigmoid.inverse(y).eval())
-      self.assertAllClose(
-          -np.sum(np.log(y), axis=2),
-          sigmoid.inverse_log_det_jacobian(y).eval(),
-          atol=0.,
-          rtol=1e-7)
-      self.assertAllClose(
-          -sigmoid.inverse_log_det_jacobian(y).eval(),
-          sigmoid.forward_log_det_jacobian(x).eval(),
-          atol=0.,
-          rtol=1e-7)
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/softmax_centered_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/softmax_centered_test.py
index 62e3869db090e9c9327bc552d10234ff76ba28fd..cad4dd1ac8de0da6405aacb9047714b37eec73e3 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/softmax_centered_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/softmax_centered_test.py
@@ -21,7 +21,9 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.distributions.python.ops.bijectors.softmax_centered import SoftmaxCentered
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops.distributions.bijector_test_util import assert_bijective_and_finite
 from tensorflow.python.platform import test
 
@@ -32,22 +34,16 @@ rng = np.random.RandomState(42)
 class SoftmaxCenteredBijectorTest(test.TestCase):
   """Tests correctness of the Y = g(X) = exp(X) / sum(exp(X)) transformation."""
 
-  def testBijectorScalar(self):
+  def testBijectorVector(self):
     with self.test_session():
-      softmax = SoftmaxCentered()  # scalar by default
+      softmax = SoftmaxCentered()
       self.assertEqual("softmax_centered", softmax.name)
-      x = np.log([[2., 3, 4],
-                  [4., 8, 12]])
-      y = [[[2. / 3, 1. / 3],
-            [3. / 4, 1. / 4],
-            [4. / 5, 1. / 5]],
-           [[4. / 5, 1. / 5],
-            [8. / 9, 1. / 9],
-            [12. / 13, 1. / 13]]]
+      x = np.log([[2., 3, 4], [4., 8, 12]])
+      y = [[0.2, 0.3, 0.4, 0.1], [0.16, 0.32, 0.48, 0.04]]
       self.assertAllClose(y, softmax.forward(x).eval())
       self.assertAllClose(x, softmax.inverse(y).eval())
       self.assertAllClose(
-          -np.sum(np.log(y), axis=2),
+          -np.sum(np.log(y), axis=1),
           softmax.inverse_log_det_jacobian(y).eval(),
           atol=0.,
           rtol=1e-7)
@@ -57,45 +53,49 @@ class SoftmaxCenteredBijectorTest(test.TestCase):
           atol=0.,
           rtol=1e-7)
 
-  def testBijectorVector(self):
+  def testBijectorUnknownShape(self):
     with self.test_session():
-      softmax = SoftmaxCentered(event_ndims=1)
+      softmax = SoftmaxCentered()
       self.assertEqual("softmax_centered", softmax.name)
-      x = np.log([[2., 3, 4], [4., 8, 12]])
-      y = [[0.2, 0.3, 0.4, 0.1], [0.16, 0.32, 0.48, 0.04]]
-      self.assertAllClose(y, softmax.forward(x).eval())
-      self.assertAllClose(x, softmax.inverse(y).eval())
+      x = array_ops.placeholder(shape=[2, None], dtype=dtypes.float32)
+      real_x = np.log([[2., 3, 4], [4., 8, 12]])
+      y = array_ops.placeholder(shape=[2, None], dtype=dtypes.float32)
+      real_y = [[0.2, 0.3, 0.4, 0.1], [0.16, 0.32, 0.48, 0.04]]
+      self.assertAllClose(real_y, softmax.forward(x).eval(
+          feed_dict={x: real_x}))
+      self.assertAllClose(real_x, softmax.inverse(y).eval(
+          feed_dict={y: real_y}))
       self.assertAllClose(
-          -np.sum(np.log(y), axis=1),
-          softmax.inverse_log_det_jacobian(y).eval(),
+          -np.sum(np.log(real_y), axis=1),
+          softmax.inverse_log_det_jacobian(y).eval(
+              feed_dict={y: real_y}),
           atol=0.,
           rtol=1e-7)
       self.assertAllClose(
-          -softmax.inverse_log_det_jacobian(y).eval(),
-          softmax.forward_log_det_jacobian(x).eval(),
+          -softmax.inverse_log_det_jacobian(y).eval(
+              feed_dict={y: real_y}),
+          softmax.forward_log_det_jacobian(x).eval(
+              feed_dict={x: real_x}),
           atol=0.,
           rtol=1e-7)
 
   def testShapeGetters(self):
     with self.test_session():
-      for x, y, b in ((tensor_shape.TensorShape([]),
-                       tensor_shape.TensorShape([2]),
-                       SoftmaxCentered(
-                           event_ndims=0, validate_args=True)),
-                      (tensor_shape.TensorShape([4]),
-                       tensor_shape.TensorShape([5]),
-                       SoftmaxCentered(
-                           event_ndims=1, validate_args=True))):
-        self.assertAllEqual(y, b.forward_event_shape(x))
-        self.assertAllEqual(y.as_list(),
-                            b.forward_event_shape_tensor(x.as_list()).eval())
-        self.assertAllEqual(x, b.inverse_event_shape(y))
-        self.assertAllEqual(x.as_list(),
-                            b.inverse_event_shape_tensor(y.as_list()).eval())
+      x = tensor_shape.TensorShape([4])
+      y = tensor_shape.TensorShape([5])
+      bijector = SoftmaxCentered(validate_args=True)
+      self.assertAllEqual(y, bijector.forward_event_shape(x))
+      self.assertAllEqual(y.as_list(),
+                          bijector.forward_event_shape_tensor(
+                              x.as_list()).eval())
+      self.assertAllEqual(x, bijector.inverse_event_shape(y))
+      self.assertAllEqual(x.as_list(),
+                          bijector.inverse_event_shape_tensor(
+                              y.as_list()).eval())
 
   def testBijectiveAndFinite(self):
     with self.test_session():
-      softmax = SoftmaxCentered(event_ndims=1)
+      softmax = SoftmaxCentered()
       x = np.linspace(-50, 50, num=10).reshape(5, 2).astype(np.float32)
       # Make y values on the simplex with a wide range.
       y_0 = np.ones(5).astype(np.float32)
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/square_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/square_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..f03d6f1343a11ae4517f9034ceb0c99ca6fe7fa2
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/square_test.py
@@ -0,0 +1,58 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.python.ops.distributions.bijector_test_util import assert_scalar_congruency
+from tensorflow.python.platform import test
+
+
+class SquareBijectorTest(test.TestCase):
+  """Tests the correctness of the Y = X ** 2 transformation."""
+
+  def testBijectorScalar(self):
+    with self.test_session():
+      bijector = bijectors.Square(validate_args=True)
+      self.assertEqual("square", bijector.name)
+      x = [[[1., 5],
+            [2, 1]],
+           [[np.sqrt(2.), 3],
+            [np.sqrt(8.), 1]]]
+      y = np.square(x)
+      ildj = -np.log(2.) - np.log(x)
+      self.assertAllClose(y, bijector.forward(x).eval())
+      self.assertAllClose(x, bijector.inverse(y).eval())
+      self.assertAllClose(
+          ildj, bijector.inverse_log_det_jacobian(y).eval(), atol=0., rtol=1e-7)
+      self.assertAllClose(
+          -bijector.inverse_log_det_jacobian(y).eval(),
+          bijector.forward_log_det_jacobian(x).eval(),
+          atol=0.,
+          rtol=1e-7)
+
+  def testScalarCongruency(self):
+    with self.test_session():
+      bijector = bijectors.Square(validate_args=True)
+      assert_scalar_congruency(bijector, lower_x=1e-3, upper_x=1.5, rtol=0.05)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/independent_test.py b/tensorflow/contrib/distributions/python/kernel_tests/independent_test.py
index 06318ca09dec851cf025fa35c83732b85824cbee..6a69f9e60b99a17c657f074597a075890265a93b 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/independent_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/independent_test.py
@@ -27,6 +27,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bernoulli as bernoulli_lib
+from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.ops.distributions import normal as normal_lib
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
@@ -126,6 +127,100 @@ class ProductDistributionTest(test.TestCase):
       self.assertAllClose(sample_entropy_, actual_entropy_, rtol=0.01, atol=0.)
       self.assertAllClose(loc, actual_mode_, rtol=1e-6, atol=0.)
 
+  def testKLRaises(self):
+    ind1 = independent_lib.Independent(
+        distribution=normal_lib.Normal(
+            loc=np.float32([-1., 1]),
+            scale=np.float32([0.1, 0.5])),
+        reinterpreted_batch_ndims=1)
+    ind2 = independent_lib.Independent(
+        distribution=normal_lib.Normal(
+            loc=np.float32(-1),
+            scale=np.float32(0.5)),
+        reinterpreted_batch_ndims=0)
+
+    with self.assertRaisesRegexp(
+        ValueError, "Event shapes do not match"):
+      kullback_leibler.kl_divergence(ind1, ind2)
+
+    ind1 = independent_lib.Independent(
+        distribution=normal_lib.Normal(
+            loc=np.float32([-1., 1]),
+            scale=np.float32([0.1, 0.5])),
+        reinterpreted_batch_ndims=1)
+    ind2 = independent_lib.Independent(
+        distribution=mvn_diag_lib.MultivariateNormalDiag(
+            loc=np.float32([-1., 1]),
+            scale_diag=np.float32([0.1, 0.5])),
+        reinterpreted_batch_ndims=0)
+
+    with self.assertRaisesRegexp(
+        NotImplementedError, "different event shapes"):
+      kullback_leibler.kl_divergence(ind1, ind2)
+
+  def testKLScalarToMultivariate(self):
+    normal1 = normal_lib.Normal(
+        loc=np.float32([-1., 1]),
+        scale=np.float32([0.1, 0.5]))
+    ind1 = independent_lib.Independent(
+        distribution=normal1, reinterpreted_batch_ndims=1)
+
+    normal2 = normal_lib.Normal(
+        loc=np.float32([-3., 3]),
+        scale=np.float32([0.3, 0.3]))
+    ind2 = independent_lib.Independent(
+        distribution=normal2, reinterpreted_batch_ndims=1)
+
+    normal_kl = kullback_leibler.kl_divergence(normal1, normal2)
+    ind_kl = kullback_leibler.kl_divergence(ind1, ind2)
+    self.assertAllClose(
+        self.evaluate(math_ops.reduce_sum(normal_kl, axis=-1)),
+        self.evaluate(ind_kl))
+
+  def testKLIdentity(self):
+    normal1 = normal_lib.Normal(
+        loc=np.float32([-1., 1]),
+        scale=np.float32([0.1, 0.5]))
+    # This is functionally just a wrapper around normal1,
+    # and doesn't change any outputs.
+    ind1 = independent_lib.Independent(
+        distribution=normal1, reinterpreted_batch_ndims=0)
+
+    normal2 = normal_lib.Normal(
+        loc=np.float32([-3., 3]),
+        scale=np.float32([0.3, 0.3]))
+    # This is functionally just a wrapper around normal2,
+    # and doesn't change any outputs.
+    ind2 = independent_lib.Independent(
+        distribution=normal2, reinterpreted_batch_ndims=0)
+
+    normal_kl = kullback_leibler.kl_divergence(normal1, normal2)
+    ind_kl = kullback_leibler.kl_divergence(ind1, ind2)
+    self.assertAllClose(
+        self.evaluate(normal_kl), self.evaluate(ind_kl))
+
+  def testKLMultivariateToMultivariate(self):
+    # (1, 1, 2) batch of MVNDiag
+    mvn1 = mvn_diag_lib.MultivariateNormalDiag(
+        loc=np.float32([[[[-1., 1, 3.], [2., 4., 3.]]]]),
+        scale_diag=np.float32([[[0.2, 0.1, 5.], [2., 3., 4.]]]))
+    ind1 = independent_lib.Independent(
+        distribution=mvn1, reinterpreted_batch_ndims=2)
+
+    # (1, 1, 2) batch of MVNDiag
+    mvn2 = mvn_diag_lib.MultivariateNormalDiag(
+        loc=np.float32([[[[-2., 3, 2.], [1., 3., 2.]]]]),
+        scale_diag=np.float32([[[0.1, 0.5, 3.], [1., 2., 1.]]]))
+
+    ind2 = independent_lib.Independent(
+        distribution=mvn2, reinterpreted_batch_ndims=2)
+
+    mvn_kl = kullback_leibler.kl_divergence(mvn1, mvn2)
+    ind_kl = kullback_leibler.kl_divergence(ind1, ind2)
+    self.assertAllClose(
+        self.evaluate(math_ops.reduce_sum(mvn_kl, axis=[-1, -2])),
+        self.evaluate(ind_kl))
+
   def _testMnistLike(self, static_shape):
     sample_shape = [4, 5]
     batch_shape = [10]
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..3548ac18078a0b40f117c2bf9e2b34d20cee163b
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
@@ -0,0 +1,166 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the statistical testing library."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import statistical_testing as st
+from tensorflow.python.framework import errors
+from tensorflow.python.ops import check_ops
+from tensorflow.python.platform import test
+
+
+class StatisticalTestingTest(test.TestCase):
+
+  def test_dkwm_design_mean_one_sample_soundness(self):
+    numbers = [1e-5, 1e-2, 1.1e-1, 0.9, 1., 1.02, 2., 10., 1e2, 1e5, 1e10]
+    rates = [1e-6, 1e-3, 1e-2, 1.1e-1, 0.2, 0.5, 0.7, 1.]
+    with self.test_session() as sess:
+      for ff in rates:
+        for fp in rates:
+          sufficient_n = st.min_num_samples_for_dkwm_mean_test(
+              numbers, 0., 1., false_fail_rate=ff, false_pass_rate=fp)
+          detectable_d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
+              sufficient_n, 0., 1., false_fail_rate=ff, false_pass_rate=fp)
+          sess.run(check_ops.assert_less_equal(detectable_d, numbers))
+
+  def test_dkwm_design_mean_two_sample_soundness(self):
+    numbers = [1e-5, 1e-2, 1.1e-1, 0.9, 1., 1.02, 2., 10., 1e2, 1e5, 1e10]
+    rates = [1e-6, 1e-3, 1e-2, 1.1e-1, 0.2, 0.5, 0.7, 1.]
+    with self.test_session() as sess:
+      for ff in rates:
+        for fp in rates:
+          (sufficient_n1,
+           sufficient_n2) = st.min_num_samples_for_dkwm_mean_two_sample_test(
+               numbers, 0., 1., 0., 1.,
+               false_fail_rate=ff, false_pass_rate=fp)
+          d_fn = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample
+          detectable_d = d_fn(
+              sufficient_n1, 0., 1., sufficient_n2, 0., 1.,
+              false_fail_rate=ff, false_pass_rate=fp)
+          sess.run(check_ops.assert_less_equal(detectable_d, numbers))
+
+  def test_true_mean_confidence_interval_by_dkwm_one_sample(self):
+    rng = np.random.RandomState(seed=0)
+
+    num_samples = 5000
+    # 5000 samples is chosen to be enough to find discrepancies of
+    # size 0.1 or more with assurance 1e-6, as confirmed here:
+    with self.test_session() as sess:
+      d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
+          num_samples, 0., 1., false_fail_rate=1e-6, false_pass_rate=1e-6)
+      d = sess.run(d)
+      self.assertLess(d, 0.1)
+
+    # Test that the confidence interval computed for the mean includes
+    # 0.5 and excludes 0.4 and 0.6.
+    with self.test_session() as sess:
+      samples = rng.uniform(size=num_samples).astype(np.float32)
+      (low, high) = st.true_mean_confidence_interval_by_dkwm(
+          samples, 0., 1., error_rate=1e-6)
+      low, high = sess.run([low, high])
+      self.assertGreater(low, 0.4)
+      self.assertLess(low, 0.5)
+      self.assertGreater(high, 0.5)
+      self.assertLess(high, 0.6)
+
+  def test_dkwm_mean_one_sample_assertion(self):
+    rng = np.random.RandomState(seed=0)
+    num_samples = 5000
+
+    # Test that the test assertion agrees that the mean of the standard
+    # uniform distribution is 0.5.
+    samples = rng.uniform(size=num_samples).astype(np.float32)
+    with self.test_session() as sess:
+      sess.run(st.assert_true_mean_equal_by_dkwm(
+          samples, 0., 1., 0.5, false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is not 0.4.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm(
+            samples, 0., 1., 0.4, false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is not 0.6.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm(
+            samples, 0., 1., 0.6, false_fail_rate=1e-6))
+
+  def test_dkwm_mean_two_sample_assertion(self):
+    rng = np.random.RandomState(seed=0)
+    num_samples = 15000
+
+    # 15000 samples is chosen to be enough to find discrepancies of
+    # size 0.1 or more with assurance 1e-6, as confirmed here:
+    with self.test_session() as sess:
+      d = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
+          num_samples, 0., 1., num_samples, 0., 1.,
+          false_fail_rate=1e-6, false_pass_rate=1e-6)
+      d = sess.run(d)
+      self.assertLess(d, 0.1)
+
+    # Test that the test assertion agrees that the standard
+    # uniform distribution has the same mean as itself.
+    samples1 = rng.uniform(size=num_samples).astype(np.float32)
+    samples2 = rng.uniform(size=num_samples).astype(np.float32)
+    with self.test_session() as sess:
+      sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+          samples1, 0., 1., samples2, 0., 1., false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is different from the mean of beta(2, 1).
+      beta_high_samples = rng.beta(2, 1, size=num_samples).astype(np.float32)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+            samples1, 0., 1.,
+            beta_high_samples, 0., 1.,
+            false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is different from the mean of beta(1, 2).
+      beta_low_samples = rng.beta(1, 2, size=num_samples).astype(np.float32)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+            samples1, 0., 1.,
+            beta_low_samples, 0., 1.,
+            false_fail_rate=1e-6))
+
+  def test_dkwm_argument_validity_checking(self):
+    rng = np.random.RandomState(seed=0)
+    samples = rng.uniform(size=5000).astype(np.float32)
+
+    # Test that the test library complains if the given samples fall
+    # outside the purported bounds.
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.true_mean_confidence_interval_by_dkwm(
+            samples, 0., 0.5, error_rate=0.5))
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.true_mean_confidence_interval_by_dkwm(
+            samples, 0.5, 1., error_rate=0.5))
+
+      # But doesn't complain if they don't.
+      op = st.true_mean_confidence_interval_by_dkwm(
+          samples, 0., 1., error_rate=0.5)
+      _ = sess.run(op)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/transformed_distribution_test.py b/tensorflow/contrib/distributions/python/kernel_tests/transformed_distribution_test.py
index cbaf74d3f66253ae5727e1ba579e2d49235b748e..f0ba1ec3eb57c67c1a0edb15639e91916a4509b7 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/transformed_distribution_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/transformed_distribution_test.py
@@ -186,12 +186,14 @@ class TransformedDistributionTest(test.TestCase):
       standard_normal = ds.Normal(loc=0., scale=1.)
       multi_logit_normal = self._cls()(
           distribution=standard_normal,
-          bijector=softmax)
-      x = [[-np.log(3.), 0.],
-           [np.log(3), np.log(5)]]
+          bijector=softmax,
+          event_shape=[1])
+      x = [[[-np.log(3.)], [0.]],
+           [[np.log(3)], [np.log(5)]]]
       y = softmax.forward(x).eval()
-      expected_log_pdf = (stats.norm(loc=0., scale=1.).logpdf(x) -
-                          np.sum(np.log(y), axis=-1))
+      expected_log_pdf = (
+          np.squeeze(stats.norm(loc=0., scale=1.).logpdf(x)) -
+          np.sum(np.log(y), axis=-1))
       self.assertAllClose(expected_log_pdf,
                           multi_logit_normal.log_prob(y).eval())
       self.assertAllClose(
@@ -245,9 +247,8 @@ class TransformedDistributionTest(test.TestCase):
     with self.test_session() as sess:
       exp2 = self._cls()(
           ds.Exponential(rate=0.25),
-          bijector=ds.bijectors.Affine(
-              scale_identity_multiplier=2.,
-              event_ndims=0))
+          bijector=ds.bijectors.AffineScalar(scale=2.)
+      )
       log_prob = exp2.log_prob(1.)
       log_prob_ = sess.run(log_prob)
       base_log_prob = -0.5 * 0.25 + np.log(0.25)
diff --git a/tensorflow/contrib/distributions/python/ops/autoregressive.py b/tensorflow/contrib/distributions/python/ops/autoregressive.py
index 852298bf334666db003353d5fc8e172ffb738668..69f3d57ff000d6c9acc8aa9e3d0ad8d9cbb6bb3c 100644
--- a/tensorflow/contrib/distributions/python/ops/autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/autoregressive.py
@@ -36,7 +36,8 @@ class Autoregressive(distribution_lib.Distribution):
     "Autoregressive models decompose the joint density as a product of
     conditionals, and model each conditional in turn. Normalizing flows
     transform a base density (e.g. a standard Gaussian) into the target density
-    by an invertible transformation with tractable Jacobian." [1]
+    by an invertible transformation with tractable Jacobian." [(Papamakarios et
+    al., 2016)][1]
 
   In other words, the "autoregressive property" is equivalent to the
   decomposition, `p(x) = prod{ p(x[i] | x[0:i]) : i=0, ..., d }`. The provided
@@ -45,17 +46,18 @@ class Autoregressive(distribution_lib.Distribution):
 
   Practically speaking the autoregressive property means that there exists a
   permutation of the event coordinates such that each coordinate is a
-  diffeomorphic function of only preceding coordinates. [2]
+  diffeomorphic function of only preceding coordinates
+  [(van den Oord et al., 2016)][2].
 
   #### Mathematical Details
 
-  The probability function is,
+  The probability function is
 
   ```none
   prob(x; fn, n) = fn(x).prob(x)
   ```
 
-  And a sample is generated by,
+  And a sample is generated by
 
   ```none
   x = fn(...fn(fn(x0).sample()).sample()).sample()
@@ -93,13 +95,15 @@ class Autoregressive(distribution_lib.Distribution):
 
   ```
 
-  [1]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  #### References
 
-  [2]: "Conditional Image Generation with PixelCNN Decoders."
-       Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex
-       Graves, Koray Kavukcuoglu. Arxiv, 2016.
+  [1]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
+
+  [2]: Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt,
+       Alex Graves, and Koray Kavukcuoglu. Conditional Image Generation with
+       PixelCNN Decoders. In _Neural Information Processing Systems_, 2016.
        https://arxiv.org/abs/1606.05328
   """
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
index 9437f56b1ebc76165edec224928baeb836277163..bc6b02542ebf3b83d58f888509dafb86351de8a7 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
@@ -17,7 +17,9 @@
 @@AbsoluteValue
 @@Affine
 @@AffineLinearOperator
+@@AffineScalar
 @@Bijector
+@@BatchNormalization
 @@Chain
 @@CholeskyOuterProduct
 @@ConditionalBijector
@@ -33,10 +35,10 @@
 @@RealNVP
 @@Reshape
 @@Sigmoid
-@@SigmoidCentered
 @@SinhArcsinh
 @@SoftmaxCentered
 @@Softplus
+@@Square
 @@Weibull
 
 @@masked_autoregressive_default_template
@@ -53,6 +55,8 @@ from __future__ import print_function
 from tensorflow.contrib.distributions.python.ops.bijectors.absolute_value import *
 from tensorflow.contrib.distributions.python.ops.bijectors.affine import *
 from tensorflow.contrib.distributions.python.ops.bijectors.affine_linear_operator import *
+from tensorflow.contrib.distributions.python.ops.bijectors.affine_scalar import *
+from tensorflow.contrib.distributions.python.ops.bijectors.batch_normalization import *
 from tensorflow.contrib.distributions.python.ops.bijectors.chain import *
 from tensorflow.contrib.distributions.python.ops.bijectors.cholesky_outer_product import *
 from tensorflow.contrib.distributions.python.ops.bijectors.conditional_bijector import *
@@ -67,10 +71,10 @@ from tensorflow.contrib.distributions.python.ops.bijectors.power_transform impor
 from tensorflow.contrib.distributions.python.ops.bijectors.real_nvp import *
 from tensorflow.contrib.distributions.python.ops.bijectors.reshape import *
 from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid import *
-from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid_centered import *
 from tensorflow.contrib.distributions.python.ops.bijectors.sinh_arcsinh import *
 from tensorflow.contrib.distributions.python.ops.bijectors.softmax_centered import *
 from tensorflow.contrib.distributions.python.ops.bijectors.softplus import *
+from tensorflow.contrib.distributions.python.ops.bijectors.square import *
 from tensorflow.python.ops.distributions.bijector import *
 from tensorflow.python.ops.distributions.identity_bijector import Identity
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
index 05bb9c2f9bdf35e222c94db3491157893da64ebd..bef7bbb49b715497695f7513e19ecab4fa56c47e 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
@@ -62,7 +62,7 @@ class Affine(bijector.Bijector):
   matrices, i.e., the matmul is [matrix-free](
   https://en.wikipedia.org/wiki/Matrix-free_methods) when possible.
 
-  Examples:
+  #### Examples
 
   ```python
   # Y = X
@@ -104,7 +104,6 @@ class Affine(bijector.Bijector):
                scale_tril=None,
                scale_perturb_factor=None,
                scale_perturb_diag=None,
-               event_ndims=1,
                validate_args=False,
                name="affine"):
     """Instantiates the `Affine` bijector.
@@ -157,8 +156,6 @@ class Affine(bijector.Bijector):
         matrix. `scale_perturb_diag` has shape [N1, N2, ...  r], which
         represents an `r x r` diagonal matrix. When `None` low rank updates will
         take the form `scale_perturb_factor * scale_perturb_factor.T`.
-      event_ndims: Scalar `int` `Tensor` indicating the number of dimensions
-        associated with a particular draw from the distribution. Must be 0 or 1.
       validate_args: Python `bool` indicating whether arguments should be
         checked for correctness.
       name: Python `str` name given to ops managed by this object.
@@ -187,23 +184,6 @@ class Affine(bijector.Bijector):
     with self._name_scope("init", values=[
         shift, scale_identity_multiplier, scale_diag, scale_tril,
         scale_perturb_diag, scale_perturb_factor]):
-      event_ndims = ops.convert_to_tensor(event_ndims, name="event_ndims")
-      event_ndims_const = tensor_util.constant_value(event_ndims)
-      if event_ndims_const is not None and event_ndims_const not in (0, 1):
-        raise ValueError("event_ndims(%s) was not 0 or 1" % event_ndims_const)
-      else:
-        if validate_args:
-          # Shape tool will catch if event_ndims is negative.
-          event_ndims = control_flow_ops.with_dependencies(
-              [check_ops.assert_less(
-                  event_ndims, 2, message="event_ndims must be 0 or 1")],
-              event_ndims)
-
-      if event_ndims_const == 0 and not self._is_only_identity_multiplier:
-        raise ValueError(
-            "If event_ndims == 0, the only scale argument you can pass is "
-            "scale_identity_multiplier.  All others operate on vectors.")
-
       # In the absence of `loc` and `scale`, we'll assume `dtype` is `float32`.
       dtype = dtypes.float32
 
@@ -251,12 +231,11 @@ class Affine(bijector.Bijector):
       self._scale = scale
       self._shaper = _DistributionShape(
           batch_ndims=batch_ndims,
-          event_ndims=event_ndims,
+          event_ndims=1,
           validate_args=validate_args)
       super(Affine, self).__init__(
-          event_ndims=event_ndims,
+          event_ndims=1,
           graph_parents=(
-              [event_ndims] +
               [self._scale] if tensor_util.is_tensor(self._scale)
               else self._scale.graph_parents +
               [self._shift] if self._shift is not None else []),
@@ -388,9 +367,7 @@ class Affine(bijector.Bijector):
     if self._is_only_identity_multiplier:
       # We don't pad in this case and instead let the fldj be applied
       # via broadcast.
-      event_size = distribution_util.pick_vector(
-          math_ops.equal(self._shaper.event_ndims, 0),
-          [1], array_ops.shape(x))[-1]
+      event_size = array_ops.shape(x)[-1]
       event_size = math_ops.cast(event_size, dtype=self._scale.dtype)
       return math_ops.log(math_ops.abs(self._scale)) * event_size
     return self.scale.log_abs_determinant()
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py
new file mode 100644
index 0000000000000000000000000000000000000000..8adaa54c843d1b243a02967402a37b7c63fabbdf
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py
@@ -0,0 +1,138 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Affine bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import bijector
+
+
+__all__ = [
+    "AffineScalar",
+]
+
+
+class AffineScalar(bijector.Bijector):
+  """Compute `Y = g(X; shift, scale) = scale * X + shift`.
+
+  Examples:
+
+  ```python
+  # Y = X
+  b = AffineScalar()
+
+  # Y = X + shift
+  b = AffineScalar(shift=[1., 2, 3])
+
+  # Y = 2 * X + shift
+  b = AffineScalar(
+    shift=[1., 2, 3],
+    scale=2.)
+  ```
+
+  """
+
+  def __init__(self,
+               shift=None,
+               scale=None,
+               validate_args=False,
+               name="affine_scalar"):
+    """Instantiates the `AffineScalar` bijector.
+
+    This `Bijector` is initialized with `shift` `Tensor` and `scale` arguments,
+    giving the forward operation:
+
+    ```none
+    Y = g(X) = scale * X + shift
+    ```
+
+    if `scale` is not specified, then the bijector has the semantics of
+    `scale = 1.`. Similarly, if `shift` is not specified, then the bijector
+    has the semantics of `shift = 0.`.
+
+    Args:
+      shift: Floating-point `Tensor`. If this is set to `None`, no shift is
+        applied.
+      scale: Floating-point `Tensor`. If this is set to `None`, no scale is
+        applied.
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    """
+    self._graph_parents = []
+    self._name = name
+    self._validate_args = validate_args
+
+    with self._name_scope("init", values=[scale, shift]):
+      self._shift = shift
+      self._scale = scale
+
+      if self._shift is not None:
+        self._shift = ops.convert_to_tensor(shift, name="shift")
+
+      if self._scale is not None:
+        self._scale = ops.convert_to_tensor(self._scale, name="scale")
+        if validate_args:
+          self._scale = control_flow_ops.with_dependencies(
+              [check_ops.assert_none_equal(
+                  self._scale,
+                  array_ops.zeros([], dtype=self._scale.dtype))],
+              self._scale)
+
+      super(AffineScalar, self).__init__(
+          event_ndims=0,
+          is_constant_jacobian=True,
+          validate_args=validate_args,
+          name=name)
+
+  @property
+  def shift(self):
+    """The `shift` `Tensor` in `Y = scale @ X + shift`."""
+    return self._shift
+
+  @property
+  def scale(self):
+    """The `scale` `LinearOperator` in `Y = scale @ X + shift`."""
+    return self._scale
+
+  def _forward(self, x):
+    y = array_ops.identity(x)
+    if self.scale is not None:
+      y *= self.scale
+    if self.shift is not None:
+      y += self.shift
+    return y
+
+  def _inverse(self, y):
+    x = array_ops.identity(y)
+    if self.shift is not None:
+      x -= self.shift
+    if self.scale is not None:
+      x /= self.scale
+    return x
+
+  def _forward_log_det_jacobian(self, x):
+    log_det_jacobian = array_ops.zeros_like(x)
+    if self.scale is None:
+      return log_det_jacobian
+    log_det_jacobian += math_ops.log(math_ops.abs(self.scale))
+    return log_det_jacobian
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
new file mode 100644
index 0000000000000000000000000000000000000000..33fdd32d7a0a01685690e598c69adca2c95972e9
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
@@ -0,0 +1,259 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Batch Norm bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+import numpy as np
+
+from tensorflow.python.framework import ops
+from tensorflow.python.layers import normalization
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.ops.distributions import bijector
+
+
+__all__ = [
+    "BatchNormalization",
+]
+
+
+def _undo_batch_normalization(x,
+                              mean,
+                              variance,
+                              offset,
+                              scale,
+                              variance_epsilon,
+                              name=None):
+  r"""Inverse of tf.nn.batch_normalization.
+
+  Args:
+    x: Input `Tensor` of arbitrary dimensionality.
+    mean: A mean `Tensor`.
+    variance: A variance `Tensor`.
+    offset: An offset `Tensor`, often denoted `beta` in equations, or
+      None. If present, will be added to the normalized tensor.
+    scale: A scale `Tensor`, often denoted `gamma` in equations, or
+      `None`. If present, the scale is applied to the normalized tensor.
+    variance_epsilon: A small `float` added to the minibatch `variance` to
+      prevent dividing by zero.
+    name: A name for this operation (optional).
+
+  Returns:
+    batch_unnormalized: The de-normalized, de-scaled, de-offset `Tensor`.
+  """
+  with ops.name_scope(
+      name, "undo_batchnorm", [x, mean, variance, scale, offset]):
+    # inv = math_ops.rsqrt(variance + variance_epsilon)
+    # if scale is not None:
+    #   inv *= scale
+    # return x * inv + (
+    #     offset - mean * inv if offset is not None else -mean * inv)
+    rescale = math_ops.sqrt(variance + variance_epsilon)
+    if scale is not None:
+      rescale /= scale
+    batch_unnormalized = x * rescale + (
+        mean - offset * rescale if offset is not None else mean)
+    return batch_unnormalized
+
+
+class BatchNormalization(bijector.Bijector):
+  """Compute `Y = g(X) s.t. X = g^-1(Y) = (Y - mean(Y)) / std(Y)`.
+
+  Applies Batch Normalization [(Ioffe and Szegedy, 2015)][1] to samples from a
+  data distribution. This can be used to stabilize training of normalizing
+  flows ([Papamakarios et al., 2016][3]; [Dinh et al., 2017][2])
+
+  When training Deep Neural Networks (DNNs), it is common practice to
+  normalize or whiten features by shifting them to have zero mean and
+  scaling them to have unit variance.
+
+  The `inverse()` method of the `BatchNormalization` bijector, which is used in
+  the log-likelihood computation of data samples, implements the normalization
+  procedure (shift-and-scale) using the mean and standard deviation of the
+  current minibatch.
+
+  Conversely, the `forward()` method of the bijector de-normalizes samples (e.g.
+  `X*std(Y) + mean(Y)` with the running-average mean and standard deviation
+  computed at training-time. De-normalization is useful for sampling.
+
+  ```python
+
+  dist = tfd.TransformedDistribution(
+      distribution=tfd.Normal()),
+      bijector=tfb.BatchNorm())
+
+  y = tfd.MultivariateNormalDiag(loc=1., scale=2.).sample(100)  # ~ N(1, 2)
+  x = dist.bijector.inverse(y)  # ~ N(0, 1)
+  y = dist.sample()  # ~ N(1, 2)
+  ```
+
+  During training time, `BatchNorm.inverse` and `BatchNorm.forward` are not
+  guaranteed to be inverses of each other because `inverse(y)` uses statistics
+  of the current minibatch, while `forward(x)` uses running-average statistics
+  accumulated from training. In other words,
+  `BatchNorm.inverse(BatchNorm.forward(...))` and
+  `BatchNorm.forward(BatchNorm.inverse(...))` will be identical when
+  `training=False` but may be different when `training=True`.
+
+  #### References
+
+  [1]: Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating
+       Deep Network Training by Reducing Internal Covariate Shift. In
+       _International Conference on Machine Learning_, 2015.
+       https://arxiv.org/abs/1502.03167
+
+  [2]: Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation
+       using Real NVP. In _International Conference on Learning
+       Representations_, 2017. https://arxiv.org/abs/1605.08803
+
+  [3]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
+  """
+
+  def __init__(self,
+               batchnorm_layer=None,
+               training=True,
+               validate_args=False,
+               name="batch_normalization"):
+    """Instantiates the `BatchNorm` bijector.
+
+    Args:
+      batchnorm_layer: `tf.layers.BatchNormalization` layer object. If `None`,
+        defaults to
+        `tf.layers.BatchNormalization(gamma_constraint=nn_ops.relu(x) + 1e-6)`.
+        This ensures positivity of the scale variable.
+
+      training: If True, updates running-average statistics during call to
+        `inverse()`.
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    Raises:
+      ValueError: If bn_layer is not an instance of
+        `tf.layers.BatchNormalization`, or if it is specified with `renorm=True`
+        or a virtual batch size.
+    """
+    # Scale must be positive.
+    g_constraint = lambda x: nn.relu(x) + 1e-6
+    self.batchnorm = batchnorm_layer or normalization.BatchNormalization(
+        gamma_constraint=g_constraint)
+    self._validate_bn_layer(self.batchnorm)
+    self._training = training
+    super(BatchNormalization, self).__init__(
+        validate_args=validate_args, name=name)
+
+  def _validate_bn_layer(self, layer):
+    """Check for valid BatchNormalization layer.
+
+    Args:
+      layer: Instance of `tf.layers.BatchNormalization`.
+    Raises:
+      ValueError: If batchnorm_layer argument is not an instance of
+      `tf.layers.BatchNormalization`, or if `batchnorm_layer.renorm=True` or
+      if `batchnorm_layer.virtual_batch_size` is specified.
+    """
+    if not isinstance(layer, normalization.BatchNormalization):
+      raise ValueError(
+          "batchnorm_layer must be an instance of BatchNormalization layer.")
+    if layer.renorm:
+      raise ValueError("BatchNorm Bijector does not support renormalization.")
+    if layer.virtual_batch_size:
+      raise ValueError(
+          "BatchNorm Bijector does not support virtual batch sizes.")
+
+  def _get_broadcast_fn(self, x):
+    # Compute shape to broadcast scale/shift parameters to.
+    if not x.shape.is_fully_defined():
+      raise ValueError("Input must have shape known at graph construction.")
+    input_shape = np.int32(x.shape.as_list())
+
+    ndims = len(input_shape)
+    # event_dims = self._compute_event_dims(x)
+    reduction_axes = [i for i in range(ndims) if i not in self.batchnorm.axis]
+    # Broadcasting only necessary for single-axis batch norm where the axis is
+    # not the last dimension
+    broadcast_shape = [1] * ndims
+    broadcast_shape[self.batchnorm.axis[0]] = (
+        input_shape[self.batchnorm.axis[0]])
+    def _broadcast(v):
+      if (v is not None and
+          len(v.get_shape()) != ndims and
+          reduction_axes != list(range(ndims - 1))):
+        return array_ops.reshape(v, broadcast_shape)
+      return v
+    return _broadcast
+
+  def _normalize(self, y):
+    return self.batchnorm.apply(y, training=self._training)
+
+  def _de_normalize(self, x):
+    # Uses the saved statistics.
+    if not self.batchnorm.built:
+      input_shape = x.get_shape()
+      self.batchnorm.build(input_shape)
+    broadcast_fn = self._get_broadcast_fn(x)
+    mean = broadcast_fn(self.batchnorm.moving_mean)
+    variance = broadcast_fn(self.batchnorm.moving_variance)
+    beta = broadcast_fn(self.batchnorm.beta) if self.batchnorm.center else None
+    gamma = broadcast_fn(self.batchnorm.gamma) if self.batchnorm.scale else None
+    return _undo_batch_normalization(
+        x, mean, variance, beta, gamma, self.batchnorm.epsilon)
+
+  def _forward(self, x):
+    return self._de_normalize(x)
+
+  def _inverse(self, y):
+    return self._normalize(y)
+
+  def _forward_log_det_jacobian(self, x):
+    # Uses saved statistics to compute volume distortion.
+    return -self._inverse_log_det_jacobian(x, use_saved_statistics=True)
+
+  def _inverse_log_det_jacobian(self, y, use_saved_statistics=False):
+    if not y.shape.is_fully_defined():
+      raise ValueError("Input must have shape known at graph construction.")
+    input_shape = np.int32(y.shape.as_list())
+
+    if not self.batchnorm.built:
+      # Create variables.
+      self.batchnorm.build(input_shape)
+
+    event_dims = self.batchnorm.axis
+    reduction_axes = [i for i in range(len(input_shape)) if i not in event_dims]
+
+    if use_saved_statistics or not self._training:
+      log_variance = math_ops.log(
+          self.batchnorm.moving_variance + self.batchnorm.epsilon)
+    else:
+      # At training-time, ildj is computed from the mean and log-variance across
+      # the current minibatch.
+      _, v = nn.moments(y, axes=reduction_axes, keep_dims=True)
+      log_variance = math_ops.log(v + self.batchnorm.epsilon)
+
+    # `gamma` and `log Var(y)` reductions over event_dims.
+    # Log(total change in area from gamma term).
+    log_total_gamma = math_ops.reduce_sum(math_ops.log(self.batchnorm.gamma))
+
+    # Log(total change in area from log-variance term).
+    log_total_variance = math_ops.reduce_sum(log_variance)
+    # The ildj is scalar, as it does not depend on the values of x and are
+    # constant across minibatch elements.
+    return log_total_gamma - 0.5 * log_total_variance
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
index cbd60f92a60612c6cf791b2c7708a3310c6e2b6b..8f09e16058b766c788ab3acced6940fd0026b521 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
@@ -20,8 +20,6 @@ from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
@@ -39,8 +37,6 @@ __all__ = [
 class CholeskyOuterProduct(bijector.Bijector):
   """Compute `g(X) = X @ X.T`; X is lower-triangular, positive-diagonal matrix.
 
-  `event_ndims` must be 0 or 2, i.e., scalar or matrix.
-
   Note: the upper-triangular part of X is ignored (whether or not its zero).
 
   The surjectivity of g as a map from  the set of n x n positive-diagonal
@@ -61,49 +57,34 @@ class CholeskyOuterProduct(bijector.Bijector):
   that, if `I = L_3 @ L_3.T`, with L_3 being lower-triangular with positive-
   diagonal, then `L_3 = I`. Thus, `L_1 = L_2`, proving injectivity of g.
 
-  Examples:
+  #### Examples
 
   ```python
-  bijector.CholeskyOuterProduct(event_ndims=2).forward(x=[[1., 0], [2, 1]])
+  bijector.CholeskyOuterProduct().forward(x=[[1., 0], [2, 1]])
   # Result: [[1., 2], [2, 5]], i.e., x @ x.T
 
-  bijector.CholeskyOuterProduct(event_ndims=2).inverse(y=[[1., 2], [2, 5]])
+  bijector.CholeskyOuterProduct().inverse(y=[[1., 2], [2, 5]])
   # Result: [[1., 0], [2, 1]], i.e., cholesky(y).
   ```
 
   """
 
-  def __init__(self, event_ndims=2, validate_args=False,
-               name="cholesky_outer_product"):
+  def __init__(self, validate_args=False, name="cholesky_outer_product"):
     """Instantiates the `CholeskyOuterProduct` bijector.
 
     Args:
-      event_ndims: `constant` `int32` scalar `Tensor` indicating the number of
-        dimensions associated with a particular draw from the distribution. Must
-        be 0 or 2.
       validate_args: Python `bool` indicating whether arguments should be
         checked for correctness.
       name: Python `str` name given to ops managed by this object.
-
-    Raises:
-      ValueError: if event_ndims is neither 0 or 2.
     """
     self._graph_parents = []
     self._name = name
-    with self._name_scope("init", values=[event_ndims]):
-      event_ndims = ops.convert_to_tensor(event_ndims, name="event_ndims")
-      event_ndims = tensor_util.constant_value(event_ndims)
-    if event_ndims is None or event_ndims not in [0, 2]:
-      raise ValueError("`event_ndims` must be a TF constant which is 0 or 2")
-    self._static_event_ndims = event_ndims
     super(CholeskyOuterProduct, self).__init__(
-        event_ndims=event_ndims,
+        event_ndims=2,
         validate_args=validate_args,
         name=name)
 
   def _forward(self, x):
-    if self._static_event_ndims == 0:
-      return math_ops.square(x)
     if self.validate_args:
       is_matrix = check_ops.assert_rank_at_least(x, 2)
       shape = array_ops.shape(x)
@@ -114,11 +95,7 @@ class CholeskyOuterProduct(bijector.Bijector):
     return math_ops.matmul(x, x, adjoint_b=True)
 
   def _inverse(self, y):
-    return (math_ops.sqrt(y) if self._static_event_ndims == 0
-            else linalg_ops.cholesky(y))
-
-  def _inverse_log_det_jacobian(self, y):
-    return -self._forward_log_det_jacobian(x=self._inverse(y))
+    return linalg_ops.cholesky(y)
 
   def _forward_log_det_jacobian(self, x):
     # Let Y be a symmetric, positive definite matrix and write:
@@ -161,13 +138,6 @@ class CholeskyOuterProduct(bijector.Bijector):
     # Since there is a 2 X[j,j] term for every lower-triangular element of X we
     # conclude:
     #   |Jac(d vec[Y]/d vec[X])| = 2^p prod_{j=0}^{p-1} X[j,j]^{p-j}.
-    if self._static_event_ndims == 0:
-      if self.validate_args:
-        is_positive = check_ops.assert_positive(
-            x, message="All elements must be positive.")
-        x = control_flow_ops.with_dependencies([is_positive], x)
-      return np.log(2.) + math_ops.log(x)
-
     diag = array_ops.matrix_diag_part(x)
 
     # We now ensure diag is columnar. Eg, if `diag = [1, 2, 3]` then the output
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
index 5251dbcb5748f75688aa43ce6e4e9dbd76be78bb..84b2340c75514c3d2c12bf4d775ba74450a0dc26 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
@@ -45,14 +45,15 @@ __all__ = [
 class MaskedAutoregressiveFlow(bijector_lib.Bijector):
   """Affine MaskedAutoregressiveFlow bijector for vector-valued events.
 
-  The affine autoregressive flow [1] provides a relatively simple framework for
-  user-specified (deep) architectures to learn a distribution over vector-valued
-  events. Regarding terminology,
+  The affine autoregressive flow [(Papamakarios et al., 2016)][3] provides a
+  relatively simple framework for user-specified (deep) architectures to learn
+  a distribution over vector-valued events. Regarding terminology,
 
     "Autoregressive models decompose the joint density as a product of
     conditionals, and model each conditional in turn. Normalizing flows
     transform a base density (e.g. a standard Gaussian) into the target density
-    by an invertible transformation with tractable Jacobian." [1]
+    by an invertible transformation with tractable Jacobian."
+    [(Papamakarios et al., 2016)][3]
 
   In other words, the "autoregressive property" is equivalent to the
   decomposition, `p(x) = prod{ p(x[i] | x[0:i]) : i=0, ..., d }`. The provided
@@ -75,26 +76,26 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
 
   Given a `shift_and_log_scale_fn`, the forward and inverse transformations are
   (a sequence of) affine transformations. A "valid" `shift_and_log_scale_fn`
-  must compute each `shift` (aka `loc` or "mu" [2]) and `log(scale)` (aka
-  "alpha" [2]) such that each are broadcastable with the arguments to `forward`
-  and `inverse`, i.e., such that the calculations in `forward`, `inverse`
-  [below] are possible.
+  must compute each `shift` (aka `loc` or "mu" in [Germain et al. (2015)][1])
+  and `log(scale)` (aka "alpha" in [Germain et al. (2015)][1]) such that each
+  are broadcastable with the arguments to `forward` and `inverse`, i.e., such
+  that the calculations in `forward`, `inverse` [below] are possible.
 
   For convenience, `masked_autoregressive_default_template` is offered as a
   possible `shift_and_log_scale_fn` function. It implements the MADE
-  architecture [2]. MADE is a feed-forward network that computes a `shift` and
-  `log(scale)` using `masked_dense` layers in a deep neural network. Weights are
-  masked to ensure the autoregressive property. It is possible that this
-  architecture is suboptimal for your task. To build alternative networks,
-  either change the arguments to `masked_autoregressive_default_template`, use
-  the `masked_dense` function to roll-out your own, or use some other
-  architecture, e.g., using `tf.layers`.
+  architecture [(Germain et al., 2015)][1]. MADE is a feed-forward network that
+  computes a `shift` and `log(scale)` using `masked_dense` layers in a deep
+  neural network. Weights are masked to ensure the autoregressive property. It
+  is possible that this architecture is suboptimal for your task. To build
+  alternative networks, either change the arguments to
+  `masked_autoregressive_default_template`, use the `masked_dense` function to
+  roll-out your own, or use some other architecture, e.g., using `tf.layers`.
 
   Warning: no attempt is made to validate that the `shift_and_log_scale_fn`
   enforces the "autoregressive property".
 
   Assuming `shift_and_log_scale_fn` has valid shape and autoregressive
-  semantics, the forward transformation is,
+  semantics, the forward transformation is
 
   ```python
   def forward(x):
@@ -106,7 +107,7 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
     return y
   ```
 
-  and the inverse transformation is,
+  and the inverse transformation is
 
   ```python
   def inverse(y):
@@ -121,7 +122,7 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
   the "last" `y` used to compute `shift`, `log_scale`. (Roughly speaking, this
   also proves the transform is bijective.)
 
-  #### Example Use
+  #### Examples
 
   ```python
   tfd = tf.contrib.distributions
@@ -142,7 +143,8 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
   maf.log_prob(x)   # Almost free; uses Bijector caching.
   maf.log_prob(0.)  # Cheap; no `tf.while_loop` despite no Bijector caching.
 
-  # [1] also describes an "Inverse Autoregressive Flow", e.g.,
+  # [Papamakarios et al. (2016)][3] also describe an Inverse Autoregressive
+  # Flow [(Kingma et al., 2016)][2]:
   iaf = tfd.TransformedDistribution(
       distribution=tfd.Normal(loc=0., scale=1.),
       bijector=tfb.Invert(tfb.MaskedAutoregressiveFlow(
@@ -168,14 +170,20 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
       event_shape=[dims])
   ```
 
-  [1]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  #### References
 
-  [2]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
 
+  [2]: Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya
+       Sutskever, and Max Welling. Improving Variational Inference with Inverse
+       Autoregressive Flow. In _Neural Information Processing Systems_, 2016.
+       https://arxiv.org/abs/1606.04934
+
+  [3]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
   def __init__(self,
@@ -329,11 +337,7 @@ def masked_dense(inputs,
                  **kwargs):
   """A autoregressively masked dense layer. Analogous to `tf.layers.dense`.
 
-  See [1] for detailed explanation.
-
-  [1]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
+  See [Germain et al. (2015)][1] for detailed explanation.
 
   Arguments:
     inputs: Tensor input.
@@ -358,6 +362,12 @@ def masked_dense(inputs,
   Raises:
     NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
       graph execution.
+
+  #### References
+
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
   """
   # TODO(b/67594795): Better support of dynamic shape.
   input_depth = inputs.shape.with_rank_at_least(1)[-1].value
@@ -398,23 +408,24 @@ def masked_autoregressive_default_template(
     name=None,
     *args,
     **kwargs):
-  """Build the MADE Model [1].
+  """Build the Masked Autoregressive Density Estimator (Germain et al., 2015).
 
   This will be wrapped in a make_template to ensure the variables are only
-  created once. It takes the input and returns the `loc` ("mu" [1]) and
-  `log_scale` ("alpha" [1]) from the MADE network.
+  created once. It takes the input and returns the `loc` ("mu" in [Germain et
+  al. (2015)][1]) and `log_scale` ("alpha" in [Germain et al. (2015)][1]) from
+  the MADE network.
 
   Warning: This function uses `masked_dense` to create randomly initialized
   `tf.Variables`. It is presumed that these will be fit, just as you would any
   other neural architecture which uses `tf.layers.dense`.
 
-  #### About Hidden Layers:
+  #### About Hidden Layers
 
   Each element of `hidden_layers` should be greater than the `input_depth`
   (i.e., `input_depth = tf.shape(input)[-1]` where `input` is the input to the
   neural network). This is necessary to ensure the autoregressivity property.
 
-  #### About Clipping:
+  #### About Clipping
 
   This function also optionally clips the `log_scale` (but possibly not its
   gradient). This is useful because if `log_scale` is too small/large it might
@@ -427,11 +438,7 @@ def masked_autoregressive_default_template(
   `grad[exp(clip(x))] = grad[x] exp(clip(x))` rather than the usual
   `grad[clip(x)] exp(clip(x))`.
 
-  [1]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
-
-  Arguments:
+  Args:
     hidden_layers: Python `list`-like of non-negative integer, scalars
       indicating the number of units in each hidden layer. Default: `[512, 512].
     shift_only: Python `bool` indicating if only the `shift` term shall be
@@ -450,12 +457,20 @@ def masked_autoregressive_default_template(
     **kwargs: `tf.layers.dense` keyword arguments.
 
   Returns:
-    shift: `Float`-like `Tensor` of shift terms (the "mu" in [2]).
-    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in [2]).
+    shift: `Float`-like `Tensor` of shift terms (the "mu" in
+      [Germain et al.  (2015)][1]).
+    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in
+      [Germain et al. (2015)][1]).
 
   Raises:
     NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
       graph execution.
+
+  #### References
+
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
   """
 
   with ops.name_scope(name, "masked_autoregressive_default_template",
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
index 2840f52e742eac5e9e37a576bf7f6d6f05a07a35..71ab369d01aafc33854a2c2437f96bbb493cc6fb 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
@@ -38,7 +38,7 @@ class RealNVP(bijector_lib.Bijector):
   """RealNVP "affine coupling layer" for vector-valued events.
 
   Real NVP models a normalizing flow on a `D`-dimensional distribution via a
-  single `D-d`-dimensional conditional distribution [1]:
+  single `D-d`-dimensional conditional distribution [(Dinh et al., 2017)][1]:
 
   `y[d:D] = y[d:D] * math_ops.exp(log_scale_fn(y[d:D])) + shift_fn(y[d:D])`
   `y[0:d] = x[0:d]`
@@ -51,31 +51,34 @@ class RealNVP(bijector_lib.Bijector):
 
   Masking is currently only supported for base distributions with
   `event_ndims=1`. For more sophisticated masking schemes like checkerboard or
-  channel-wise masking [2], use the `tfb.Permute` bijector to re-order desired
-  masked units into the first `d` units. For base distributions with
-  `event_ndims > 1`, use the `tfb.Reshape` bijector to flatten the event shape.
-
-  Recall that the MAF bijector [2] implements a normalizing flow via an
-  autoregressive transformation. MAF and IAF have opposite computational
-  tradeoffs - MAF can train all units in parallel but must sample units
-  sequentially, while IAF must train units sequentially but can sample in
-  parallel. In contrast, Real NVP can compute both forward and inverse
-  computations in parallel. However, the lack of an autoregressive
+  channel-wise masking [(Papamakarios et al., 2016)[4], use the `tfb.Permute`
+  bijector to re-order desired masked units into the first `d` units. For base
+  distributions with `event_ndims > 1`, use the `tfb.Reshape` bijector to
+  flatten the event shape.
+
+  Recall that the MAF bijector [(Papamakarios et al., 2016)][4] implements a
+  normalizing flow via an autoregressive transformation. MAF and IAF have
+  opposite computational tradeoffs - MAF can train all units in parallel but
+  must sample units sequentially, while IAF must train units sequentially but
+  can sample in parallel. In contrast, Real NVP can compute both forward and
+  inverse computations in parallel. However, the lack of an autoregressive
   transformations makes it less expressive on a per-bijector basis.
 
   A "valid" `shift_and_log_scale_fn` must compute each `shift` (aka `loc` or
-  "mu" [2]) and `log(scale)` (aka "alpha" [2]) such that each are broadcastable
-  with the arguments to `forward` and `inverse`, i.e., such that the
-  calculations in `forward`, `inverse` [below] are possible. For convenience,
+  "mu" in [Papamakarios et al. (2016)][4]) and `log(scale)` (aka "alpha" in
+  [Papamakarios et al. (2016)][4]) such that each are broadcastable with the
+  arguments to `forward` and `inverse`, i.e., such that the calculations in
+  `forward`, `inverse` [below] are possible. For convenience,
   `real_nvp_default_nvp` is offered as a possible `shift_and_log_scale_fn`
   function.
 
-  NICE [3] is a special case of the Real NVP bijector which discards the scale
-  transformation, resulting in a constant-time inverse-log-determinant-Jacobian.
-  To use a NICE bijector instead of Real NVP, `shift_and_log_scale_fn` should
-  return `(shift, None)`, and `is_constant_jacobian` should be set to `True` in
-  the `RealNVP` constructor. Calling `real_nvp_default_template` with
-  `shift_only=True` returns one such NICE-compatible `shift_and_log_scale_fn`.
+  NICE [(Dinh et al., 2014)][2] is a special case of the Real NVP bijector
+  which discards the scale transformation, resulting in a constant-time
+  inverse-log-determinant-Jacobian. To use a NICE bijector instead of Real
+  NVP, `shift_and_log_scale_fn` should return `(shift, None)`, and
+  `is_constant_jacobian` should be set to `True` in the `RealNVP` constructor.
+  Calling `real_nvp_default_template` with `shift_only=True` returns one such
+  NICE-compatible `shift_and_log_scale_fn`.
 
   Caching: the scalar input depth `D` of the base distribution is not known at
   construction time. The first call to any of `forward(x)`, `inverse(x)`,
@@ -103,23 +106,24 @@ class RealNVP(bijector_lib.Bijector):
   nvp.log_prob(0.)
   ```
 
-  For more examples, see [4].
+  For more examples, see [Jang (2018)][3].
 
-  [1]: "Density Estimation using Real NVP."
-       Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio. ICLR. 2017.
-       https://arxiv.org/abs/1605.08803
+  #### References
 
-  [2]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  [1]: Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation
+       using Real NVP. In _International Conference on Learning
+       Representations_, 2017. https://arxiv.org/abs/1605.08803
 
-  [3]: "NICE: Non-linear Independent Components Estimation."
-       Laurent Dinh, David Krueger, Yoshua Bengio. ICLR. 2015.
-       https://arxiv.org/abs/1410.8516
+  [2]: Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear
+       Independent Components Estimation. _arXiv preprint arXiv:1410.8516_,
+       2014. https://arxiv.org/abs/1410.8516
 
-  [4]: "Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows."
-       Eric Jang. Blog post. January 2018.
-       http://blog.evjang.com/2018/01/nf2.html
+  [3]: Eric Jang. Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows.
+       _Technical Report_, 2018. http://blog.evjang.com/2018/01/nf2.html
+
+  [4]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
   def __init__(self,
@@ -250,12 +254,20 @@ def real_nvp_default_template(
     **kwargs: `tf.layers.dense` keyword arguments.
 
   Returns:
-    shift: `Float`-like `Tensor` of shift terms (the "mu" in [2]).
-    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in [2]).
+    shift: `Float`-like `Tensor` of shift terms ("mu" in
+      [Papamakarios et al.  (2016)][1]).
+    log_scale: `Float`-like `Tensor` of log(scale) terms ("alpha" in
+      [Papamakarios et al. (2016)][1]).
 
   Raises:
     NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
       graph execution.
+
+  #### References
+
+  [1]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
   with ops.name_scope(name, "real_nvp_default_template"):
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid_centered.py b/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid_centered.py
deleted file mode 100644
index 223bc9d042c69be05b0e578835a31ed6e83c0c97..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid_centered.py
+++ /dev/null
@@ -1,39 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""SigmoidCentered bijector."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.distributions.python.ops.bijectors import softmax_centered
-
-
-__all__ = [
-    "SigmoidCentered",
-]
-
-
-class SigmoidCentered(softmax_centered.SoftmaxCentered):
-  """Bijector which computes Y = g(X) = exp([X 0]) / (1 + exp(-X)).
-
-  Equivalent to: `bijector.SoftmaxCentered(event_ndims=0)`.
-
-  See `bijector.SoftmaxCentered` for more details.
-  """
-
-  def __init__(self, validate_args=False, name="sigmoid_centered"):
-    super(SigmoidCentered, self).__init__(
-        event_ndims=0, validate_args=validate_args, name=name)
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py b/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
index a9dcce6c526600f3b26c6bceb730417000917ce7..dc94fd0a38de29f5a7ee6ca826aab0ecf8712966 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
@@ -18,13 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
-
 from tensorflow.contrib.distributions.python.ops import distribution_util
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
@@ -47,17 +42,14 @@ class SoftmaxCentered(bijector.Bijector):
   e.g., `softmax(x) = exp(x-c) / sum(exp(x-c))` where `c` is the implicit last
   coordinate.
 
-  Because we append a coordinate, this bijector only supports `event_ndim in [0,
-  1]`, i.e., scalars and vectors.
-
   Example Use:
 
   ```python
-  bijector.SoftmaxCentered(event_ndims=1).forward(tf.log([2, 3, 4]))
+  bijector.SoftmaxCentered().forward(tf.log([2, 3, 4]))
   # Result: [0.2, 0.3, 0.4, 0.1]
   # Extra result: 0.1
 
-  bijector.SoftmaxCentered(event_ndims=1).inverse([0.2, 0.3, 0.4, 0.1])
+  bijector.SoftmaxCentered().inverse([0.2, 0.3, 0.4, 0.1])
   # Result: tf.log([2, 3, 4])
   # Extra coordinate removed.
   ```
@@ -69,82 +61,47 @@ class SoftmaxCentered(bijector.Bijector):
   """
 
   def __init__(self,
-               event_ndims=0,
                validate_args=False,
                name="softmax_centered"):
     self._graph_parents = []
     self._name = name
-    with self._name_scope("init", values=[event_ndims]):
-      event_ndims = ops.convert_to_tensor(event_ndims, name="event_ndims")
-      event_ndims = tensor_util.constant_value(event_ndims)
-      if event_ndims is None or event_ndims not in [0, 1]:
-        raise ValueError("`event_ndims` must be a TF constant which is 0 or 1")
-    self._static_event_ndims = event_ndims
     super(SoftmaxCentered, self).__init__(
-        event_ndims=event_ndims,
+        event_ndims=1,
         validate_args=validate_args,
         name=name)
 
   def _forward_event_shape(self, input_shape):
-    if input_shape.ndims is None:
+    if input_shape.ndims is None or input_shape[-1] is None:
       return input_shape
-    if input_shape.ndims != self._static_event_ndims:
-      raise ValueError("input_shape.dims = %d != %d" %
-                       (input_shape.ndims, self._static_event_ndims))
-    if input_shape.ndims == 0:
-      return tensor_shape.TensorShape([2])
-    if input_shape.ndims == 1:
-      return tensor_shape.TensorShape(input_shape[0] + 1)
-    # Unreachable code:
-    raise ValueError("event_ndims = %d must be 0 or 1" % input_shape.ndims)
+    return tensor_shape.TensorShape([input_shape[-1] + 1])
 
   def _forward_event_shape_tensor(self, input_shape):
-    ndims = array_ops.shape(input_shape)
-    if self.validate_args:
-      # It is not possible for a negative shape so we need only check <= 1.
-      is_zero_or_one = check_ops.assert_equal(
-          ndims, 0 if self._static_event_ndims == 0 else 1,
-          message="event_ndims must be 0 or 1")
-      ndims = control_flow_ops.with_dependencies([is_zero_or_one], ndims)
-    if self._static_event_ndims == 0:
-      return ops.convert_to_tensor(
-          [2], dtype=dtypes.int32, name="output_shape")
-    return input_shape + 1
+    return (input_shape[-1] + 1)[..., array_ops.newaxis]
 
   def _inverse_event_shape(self, output_shape):
-    if output_shape.ndims is None:
+    if output_shape.ndims is None or output_shape[-1] is None:
       return output_shape
-    if output_shape.ndims != 1:
-      raise ValueError("output_shape.ndims = %d != 1" % output_shape.ndims)
-    if self._static_event_ndims == 0:
-      return tensor_shape.TensorShape([])
-    return tensor_shape.TensorShape(output_shape[0] - 1)
+    if output_shape[-1] <= 1:
+      raise ValueError("output_shape[-1] = %d <= 1" % output_shape[-1])
+    return tensor_shape.TensorShape([output_shape[-1] - 1])
 
   def _inverse_event_shape_tensor(self, output_shape):
-    ndims = array_ops.shape(output_shape)[0]
     if self.validate_args:
       # It is not possible for a negative shape so we need only check <= 1.
-      is_one = check_ops.assert_equal(
-          ndims, 1, message="event_ndims must be 1")
-      ndims = control_flow_ops.with_dependencies([is_one], ndims)
-    if self._static_event_ndims == 0:
-      return ops.convert_to_tensor([], dtype=dtypes.int32, name="output_shape")
-    return array_ops.expand_dims(output_shape[0] - 1, dim=0)
+      is_greater_one = check_ops.assert_greater(
+          output_shape[-1], 1, message="Need last dimension greater than 1.")
+      output_shape = control_flow_ops.with_dependencies(
+          [is_greater_one], output_shape)
+    return (output_shape[-1] - 1)[..., array_ops.newaxis]
 
   def _forward(self, x):
     # Pad the last dim with a zeros vector. We need this because it lets us
     # infer the scale in the inverse function.
-    y = array_ops.expand_dims(x, dim=-1) if self._static_event_ndims == 0 else x
-    y = distribution_util.pad(y, axis=-1, back=True)
+    y = distribution_util.pad(x, axis=-1, back=True)
 
     # Set shape hints.
     if x.shape.ndims is not None:
-      shape = x.shape.as_list()
-      if self._static_event_ndims == 0:
-        shape += [2]
-      elif shape[-1] is not None:
-        shape[-1] += 1
-      shape = tensor_shape.TensorShape(shape)
+      shape = x.shape[:-1].concatenate(x.shape[-1] + 1)
       y.shape.assert_is_compatible_with(shape)
       y.set_shape(shape)
 
@@ -161,42 +118,17 @@ class SoftmaxCentered(bijector.Bijector):
     # x[i] = log(exp(x[i])) - log(y[end]) - log(normalization)
     #      = log(exp(x[i])/normalization) - log(y[end])
     #      = log(y[i]) - log(y[end])
-    shape = (np.asarray(y.shape.as_list(), dtype=np.int32)
-             if y.shape.is_fully_defined()
-             else array_ops.shape(y, name="shape"))
-    ndims = distribution_util.prefer_static_rank(y)
 
     # Do this first to make sure CSE catches that it'll happen again in
     # _inverse_log_det_jacobian.
     x = math_ops.log(y)
 
-    # We now extract the last coordinate of the rightmost dimension.
-    # Our trick is to slice from [0,0,...,shape[-1]-1] to shape[:-1]+[1].
-    begin = array_ops.one_hot(indices=ndims-1,
-                              depth=ndims,
-                              on_value=shape[-1]-np.array(1, dtype=shape.dtype),
-                              dtype=shape.dtype)
-    size = array_ops.concat([shape[:-1], np.asarray([1], dtype=shape.dtype)], 0)
-    log_normalization = -array_ops.strided_slice(x, begin, begin + size)
-
-    # Here we slice out all but the last coordinate; see above for idea.
-    begin = array_ops.zeros_like(shape)
-    size = array_ops.concat([shape[:-1], [shape[-1] - 1]], 0)
-    x = array_ops.strided_slice(x, begin, begin + size)
-
-    x += log_normalization
-
-    if self._static_event_ndims == 0:
-      x = array_ops.squeeze(x, squeeze_dims=[ndims-1])
+    log_normalization = (-x[..., -1])[..., array_ops.newaxis]
+    x = x[..., :-1] + log_normalization
 
     # Set shape hints.
     if y.shape.ndims is not None:
-      shape = y.shape.as_list()
-      if self._static_event_ndims == 0:
-        shape = shape[:-1]
-      elif shape[-1] is not None:
-        shape[-1] -= 1
-      shape = tensor_shape.TensorShape(shape)
+      shape = y.shape[:-1].concatenate(y.shape[-1] - 1)
       x.shape.assert_is_compatible_with(shape)
       x.set_shape(shape)
 
@@ -222,19 +154,16 @@ class SoftmaxCentered(bijector.Bijector):
     return -math_ops.reduce_sum(math_ops.log(y), axis=-1)
 
   def _forward_log_det_jacobian(self, x):
-    if self._static_event_ndims == 0:
-      return x - 2. * nn_ops.softplus(x)
-    else:
-      # This code is similar to nn_ops.log_softmax but different because we have
-      # an implicit zero column to handle. I.e., instead of:
-      #   reduce_sum(logits - reduce_sum(exp(logits), dim))
-      # we must do:
-      #   log_normalization = 1 + reduce_sum(exp(logits))
-      #   -log_normalization + reduce_sum(logits - log_normalization)
-      log_normalization = nn_ops.softplus(
-          math_ops.reduce_logsumexp(x, axis=-1, keep_dims=True))
-      fldj = (-log_normalization +
-              math_ops.reduce_sum(x - log_normalization,
-                                  axis=-1,
-                                  keep_dims=True))
-      return array_ops.squeeze(fldj, squeeze_dims=-1)
+    # This code is similar to nn_ops.log_softmax but different because we have
+    # an implicit zero column to handle. I.e., instead of:
+    #   reduce_sum(logits - reduce_sum(exp(logits), dim))
+    # we must do:
+    #   log_normalization = 1 + reduce_sum(exp(logits))
+    #   -log_normalization + reduce_sum(logits - log_normalization)
+    log_normalization = nn_ops.softplus(
+        math_ops.reduce_logsumexp(x, axis=-1, keep_dims=True))
+    fldj = (-log_normalization +
+            math_ops.reduce_sum(x - log_normalization,
+                                axis=-1,
+                                keep_dims=True))
+    return array_ops.squeeze(fldj, squeeze_dims=-1)
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/square.py b/tensorflow/contrib/distributions/python/ops/bijectors/square.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e9dbf35091fe51f2478dc085c394a77295ca4ee
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/square.py
@@ -0,0 +1,84 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Square bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import bijector
+
+
+__all__ = [
+    "Square",
+]
+
+
+class Square(bijector.Bijector):
+  """Compute `g(X) = X^2`; X is a positive real number.
+
+  g is a bijection between the non-negative real numbers (R_+) and the
+  non-negative real numbers.
+
+  #### Examples
+
+  ```python
+  bijector.Square().forward(x=[[1., 0], [2, 1]])
+  # Result: [[1., 0], [4, 1]], i.e., x^2
+
+  bijector.Square().inverse(y=[[1., 4], [9, 1]])
+  # Result: [[1., 2], [3, 1]], i.e., sqrt(y).
+  ```
+
+  """
+
+  def __init__(self, validate_args=False, name="square"):
+    """Instantiates the `Square` bijector.
+
+    Args:
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    """
+    self._name = name
+    super(Square, self).__init__(
+        event_ndims=0,
+        validate_args=validate_args,
+        name=name)
+
+  def _forward(self, x):
+    x = self._maybe_assert_valid(x)
+    return math_ops.square(x)
+
+  def _inverse(self, y):
+    y = self._maybe_assert_valid(y)
+    return math_ops.sqrt(y)
+
+  def _forward_log_det_jacobian(self, x):
+    x = self._maybe_assert_valid(x)
+    return np.log(2.) + math_ops.log(x)
+
+  def _maybe_assert_valid(self, t):
+    if not self.validate_args:
+      return t
+    is_valid = check_ops.assert_non_negative(
+        t, message="All elements must be non-negative.")
+    return control_flow_ops.with_dependencies([is_valid], t)
+
diff --git a/tensorflow/contrib/distributions/python/ops/chi2.py b/tensorflow/contrib/distributions/python/ops/chi2.py
index bdd5571c966a74e58e4f9f8eed2628f131a1b92e..e610f469e5d5f446b75c734cc39811de30a8cb9a 100644
--- a/tensorflow/contrib/distributions/python/ops/chi2.py
+++ b/tensorflow/contrib/distributions/python/ops/chi2.py
@@ -21,6 +21,8 @@ from __future__ import print_function
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import gamma
 
@@ -87,7 +89,11 @@ class Chi2(gamma.Gamma):
     # allow_nan_stats=True
     # through to the parent class results in unnecessary asserts.
     with ops.name_scope(name, values=[df]):
-      self._df = ops.convert_to_tensor(df, name="df")
+      with ops.control_dependencies([
+          check_ops.assert_positive(df),
+      ] if validate_args else []):
+        self._df = array_ops.identity(df, name="df")
+
       super(Chi2, self).__init__(
           concentration=0.5 * self._df,
           rate=constant_op.constant(0.5, dtype=self._df.dtype),
diff --git a/tensorflow/contrib/distributions/python/ops/gumbel.py b/tensorflow/contrib/distributions/python/ops/gumbel.py
index d0efaefb8e78ddf4436e9e5a112d2c1cdddaf3b5..8d05ad6b8032fb8bada99389959091fb1c28beda 100644
--- a/tensorflow/contrib/distributions/python/ops/gumbel.py
+++ b/tensorflow/contrib/distributions/python/ops/gumbel.py
@@ -190,9 +190,6 @@ class _Gumbel(distribution.Distribution):
   def _log_prob(self, x):
     return self._log_unnormalized_prob(x) - self._log_normalization()
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _log_cdf(self, x):
     return -math_ops.exp(-self._z(x))
 
diff --git a/tensorflow/contrib/distributions/python/ops/independent.py b/tensorflow/contrib/distributions/python/ops/independent.py
index cbce005013281ff3c58c94d525d5ce7a865d725a..7dcb3e3ac4db1855adacb7ec0fa8554c45d9c859 100644
--- a/tensorflow/contrib/distributions/python/ops/independent.py
+++ b/tensorflow/contrib/distributions/python/ops/independent.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution as distribution_lib
+from tensorflow.python.ops.distributions import kullback_leibler
 
 
 class Independent(distribution_lib.Distribution):
@@ -254,3 +255,58 @@ class Independent(distribution_lib.Distribution):
     else:
       which_maximum = np.maximum
     return which_maximum(0, ndims - 1)
+
+
+@kullback_leibler.RegisterKL(Independent, Independent)
+def _kl_independent(a, b, name="kl_independent"):
+  """Batched KL divergence `KL(a || b)` for Independent distributions.
+
+  We can leverage the fact that
+  ```
+  KL(Independent(a) || Independent(b)) = sum(KL(a || b))
+  ```
+  where the sum is over the `reinterpreted_batch_ndims`.
+
+  Args:
+    a: Instance of `Independent`.
+    b: Instance of `Independent`.
+    name: (optional) name to use for created ops. Default "kl_independent".
+
+  Returns:
+    Batchwise `KL(a || b)`.
+
+  Raises:
+    ValueError: If the event space for `a` and `b`, or their underlying
+      distributions don't match.
+  """
+  p = a.distribution
+  q = b.distribution
+
+  # The KL between any two (non)-batched distributions is a scalar.
+  # Given that the KL between two factored distributions is the sum, i.e.
+  # KL(p1(x)p2(y) || q1(x)q2(y)) = KL(p1 || q1) + KL(q1 || q2), we compute
+  # KL(p || q) and do a `reduce_sum` on the reinterpreted batch dimensions.
+  if a.event_shape.is_fully_defined() and b.event_shape.is_fully_defined():
+    if a.event_shape == b.event_shape:
+      if p.event_shape == q.event_shape:
+        num_reduce_dims = a.event_shape.ndims - p.event_shape.ndims
+        reduce_dims = [-i - 1 for i in range(0, num_reduce_dims)]
+
+        return math_ops.reduce_sum(
+            kullback_leibler.kl_divergence(p, q, name=name), axis=reduce_dims)
+      else:
+        raise NotImplementedError("KL between Independents with different "
+                                  "event shapes not supported.")
+    else:
+      raise ValueError("Event shapes do not match.")
+  else:
+    with ops.control_dependencies([
+        check_ops.assert_equal(a.event_shape_tensor(), b.event_shape_tensor()),
+        check_ops.assert_equal(p.event_shape_tensor(), q.event_shape_tensor())
+    ]):
+      num_reduce_dims = (
+          array_ops.shape(a.event_shape_tensor()[0]) -
+          array_ops.shape(p.event_shape_tensor()[0]))
+      reduce_dims = math_ops.range(-num_reduce_dims - 1, -1, 1)
+      return math_ops.reduce_sum(
+          kullback_leibler.kl_divergence(p, q, name=name), axis=reduce_dims)
diff --git a/tensorflow/contrib/distributions/python/ops/inverse_gamma.py b/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
index ee4d86867d48b20e97757bcec57d452085814b80..51ac61dcf640ca89f22c47127bda71316a179ca4 100644
--- a/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
+++ b/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
@@ -192,12 +192,6 @@ class InverseGamma(distribution.Distribution):
   def _log_prob(self, x):
     return self._log_unnormalized_prob(x) - self._log_normalization()
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
-  def _log_cdf(self, x):
-    return math_ops.log(self._cdf(x))
-
   def _cdf(self, x):
     x = self._maybe_assert_valid_sample(x)
     # Note that igammac returns the upper regularized incomplete gamma
diff --git a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
index 120b38db3cf72e8fce56a7e9293cdf25e75784e2..192dede6ff1d4de8d4be9965c414e7453d7b5d4b 100644
--- a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
+++ b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
@@ -44,18 +44,16 @@ _kumaraswamy_sample_note = """Note: `x` must have dtype `self.dtype` and be in
 def _harmonic_number(x):
   """Compute the harmonic number from its analytic continuation.
 
-  Derivation from [1] and Euler's constant [2].
-  [1] -
-  https://en.wikipedia.org/wiki/Digamma_function#Relation_to_harmonic_numbers
-  [2] - https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant
-
+  Derivation from [here](
+  https://en.wikipedia.org/wiki/Digamma_function#Relation_to_harmonic_numbers)
+  and [Euler's constant](
+  https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant).
 
   Args:
     x: input float.
 
   Returns:
     z: The analytic continuation of the harmonic number for the input.
-
   """
   one = array_ops.ones([], dtype=x.dtype)
   return math_ops.digamma(x + one) - math_ops.digamma(one)
diff --git a/tensorflow/contrib/distributions/python/ops/logistic.py b/tensorflow/contrib/distributions/python/ops/logistic.py
index 473677f8d91b184e029f345bb05f5c5d63df7a40..68e6bca5a554b29a450911073eb5c4fe55f313c6 100644
--- a/tensorflow/contrib/distributions/python/ops/logistic.py
+++ b/tensorflow/contrib/distributions/python/ops/logistic.py
@@ -185,9 +185,6 @@ class Logistic(distribution.Distribution):
   def _log_prob(self, x):
     return self._log_unnormalized_prob(x) - self._log_normalization()
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _log_cdf(self, x):
     return -nn_ops.softplus(-self._z(x))
 
diff --git a/tensorflow/contrib/distributions/python/ops/moving_stats.py b/tensorflow/contrib/distributions/python/ops/moving_stats.py
index 20f85643b9e7db61b4786dffe4115c7d3c00b046..87d40805a3c7a9c2871305af7f7182b7e2923530 100644
--- a/tensorflow/contrib/distributions/python/ops/moving_stats.py
+++ b/tensorflow/contrib/distributions/python/ops/moving_stats.py
@@ -47,9 +47,7 @@ def assign_moving_mean_variance(
   Note: `mean_var` is updated *after* `variance_var`, i.e., `variance_var` uses
   the lag-1 mean.
 
-  For derivation justification, see equation 143 of:
-    T. Finch, Feb 2009. "Incremental calculation of weighted mean and variance".
-    http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
+  For derivation justification, see [Finch (2009; Eq. 143)][1].
 
   Args:
     mean_var: `float`-like `Variable` representing the exponentially weighted
@@ -72,6 +70,12 @@ def assign_moving_mean_variance(
     TypeError: if `mean_var` does not have float type `dtype`.
     TypeError: if `mean_var`, `variance_var`, `value`, `decay` have different
       `base_dtype`.
+
+  #### References
+
+  [1]: Tony Finch. Incremental calculation of weighted mean and variance.
+       _Technical Report_, 2009.
+       http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
   """
   with ops.name_scope(name, "assign_moving_mean_variance",
                       [variance_var, mean_var, value, decay]):
@@ -183,9 +187,7 @@ def moving_mean_variance(value, decay, collections=None, name=None):
   Note: `mean_var` is updated *after* `variance_var`, i.e., `variance_var` uses
   the lag-`1` mean.
 
-  For derivation justification, see equation 143 of:
-    T. Finch, Feb 2009. "Incremental calculation of weighted mean and variance".
-    http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
+  For derivation justification, see [Finch (2009; Eq. 143)][1].
 
   Unlike `assign_moving_mean_variance`, this function handles
   variable creation.
@@ -208,6 +210,12 @@ def moving_mean_variance(value, decay, collections=None, name=None):
   Raises:
     TypeError: if `value_var` does not have float type `dtype`.
     TypeError: if `value`, `decay` have different `base_dtype`.
+
+  #### References
+
+  [1]: Tony Finch. Incremental calculation of weighted mean and variance.
+       _Technical Report_, 2009.
+       http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
   """
   if collections is None:
     collections = [ops.GraphKeys.GLOBAL_VARIABLES]
diff --git a/tensorflow/contrib/distributions/python/ops/onehot_categorical.py b/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
index b76cebf79fad09ebec68f2459c6fe80794ea81c0..46c2cc8b7a8c536a90176fbb2b2d52fed61e4705 100644
--- a/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
+++ b/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
@@ -203,9 +203,6 @@ class OneHotCategorical(distribution.Distribution):
     ret = array_ops.reshape(ret, logits_shape)
     return ret
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _entropy(self):
     return -math_ops.reduce_sum(
         nn_ops.log_softmax(self.logits) * self.probs, axis=-1)
diff --git a/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py b/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
index 2aa771a71efe52c8d86d459f090ea8ee137c4487..ff33f327c7a77597e516208cacad8c4aed65d1c9 100644
--- a/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
+++ b/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
@@ -285,9 +285,6 @@ class ExpRelaxedOneHotCategorical(distribution.Distribution):
     ret = array_ops.reshape(log_prob, logits_shape)
     return ret
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _assert_valid_sample(self, x):
     if not self.validate_args:
       return x
diff --git a/tensorflow/contrib/distributions/python/ops/shape.py b/tensorflow/contrib/distributions/python/ops/shape.py
index 5fb6f0c7eaa8c4734ea4c161b0eee6f24d4c9850..bac0b79d5908712f4e64259768fb6f3b4558f620 100644
--- a/tensorflow/contrib/distributions/python/ops/shape.py
+++ b/tensorflow/contrib/distributions/python/ops/shape.py
@@ -32,45 +32,50 @@ from tensorflow.python.ops.distributions import util as distribution_util
 class _DistributionShape(object):
   """Manage and manipulate `Distribution` shape.
 
-  Terminology:
-    Recall that a `Tensor` has:
-      - `shape`: size of `Tensor` dimensions,
-      - `ndims`: size of `shape`; number of `Tensor` dimensions,
-      - `dims`: indexes into `shape`; useful for transpose, reduce.
-
-    `Tensor`s sampled from a `Distribution` can be partitioned by `sample_dims`,
-    `batch_dims`, and `event_dims`. To understand the semantics of these
-    dimensions, consider when two of the three are fixed and the remaining
-    is varied:
-      - `sample_dims`: indexes independent draws from identical
-                       parameterizations of the `Distribution`.
-      - `batch_dims`:  indexes independent draws from non-identical
-                       parameterizations of the `Distribution`.
-      - `event_dims`:  indexes event coordinates from one sample.
-
-    The `sample`, `batch`, and `event` dimensions constitute the entirety of a
-    `Distribution` `Tensor`'s shape.
-
-    The dimensions are always in `sample`, `batch`, `event` order.
-
-  Purpose:
-    This class partitions `Tensor` notions of `shape`, `ndims`, and `dims` into
-    `Distribution` notions of `sample,` `batch,` and `event` dimensions. That
-    is, it computes any of:
+  #### Terminology
 
-    ```
-    sample_shape     batch_shape     event_shape
-    sample_dims      batch_dims      event_dims
-    sample_ndims     batch_ndims     event_ndims
-    ```
+  Recall that a `Tensor` has:
+    - `shape`: size of `Tensor` dimensions,
+    - `ndims`: size of `shape`; number of `Tensor` dimensions,
+    - `dims`: indexes into `shape`; useful for transpose, reduce.
+
+  `Tensor`s sampled from a `Distribution` can be partitioned by `sample_dims`,
+  `batch_dims`, and `event_dims`. To understand the semantics of these
+  dimensions, consider when two of the three are fixed and the remaining
+  is varied:
+    - `sample_dims`: indexes independent draws from identical
+                     parameterizations of the `Distribution`.
+    - `batch_dims`:  indexes independent draws from non-identical
+                     parameterizations of the `Distribution`.
+    - `event_dims`:  indexes event coordinates from one sample.
+
+  The `sample`, `batch`, and `event` dimensions constitute the entirety of a
+  `Distribution` `Tensor`'s shape.
+
+  The dimensions are always in `sample`, `batch`, `event` order.
+
+  #### Purpose
+
+  This class partitions `Tensor` notions of `shape`, `ndims`, and `dims` into
+  `Distribution` notions of `sample,` `batch,` and `event` dimensions. That
+  is, it computes any of:
+
+  ```
+  sample_shape     batch_shape     event_shape
+  sample_dims      batch_dims      event_dims
+  sample_ndims     batch_ndims     event_ndims
+  ```
 
-    for a given `Tensor`, e.g., the result of
-    `Distribution.sample(sample_shape=...)`.
+  for a given `Tensor`, e.g., the result of
+  `Distribution.sample(sample_shape=...)`.
 
-    For a given `Tensor`, this class computes the above table using minimal
-    information: `batch_ndims` and `event_ndims`.
+  For a given `Tensor`, this class computes the above table using minimal
+  information: `batch_ndims` and `event_ndims`.
+
+  #### Examples
+
+  We show examples of distribution shape semantics.
 
-  Examples of `Distribution` `shape` semantics:
     - Sample dimensions:
       Computing summary statistics, i.e., the average is a reduction over sample
       dimensions.
@@ -111,52 +116,54 @@ class _DistributionShape(object):
       tf.div(1., tf.reduce_prod(x, event_dims))
       ```
 
-  Examples using this class:
-    Write `S, B, E` for `sample_shape`, `batch_shape`, and `event_shape`.
-
-    ```python
-    # 150 iid samples from one multivariate Normal with two degrees of freedom.
-    mu = [0., 0]
-    sigma = [[1., 0],
-             [0,  1]]
-    mvn = MultivariateNormal(mu, sigma)
-    rand_mvn = mvn.sample(sample_shape=[3, 50])
-    shaper = DistributionShape(batch_ndims=0, event_ndims=1)
-    S, B, E = shaper.get_shape(rand_mvn)
-    # S = [3, 50]
-    # B = []
-    # E = [2]
-
-    # 12 iid samples from one Wishart with 2x2 events.
-    sigma = [[1., 0],
-             [2,  1]]
-    wishart = Wishart(df=5, scale=sigma)
-    rand_wishart = wishart.sample(sample_shape=[3, 4])
-    shaper = DistributionShape(batch_ndims=0, event_ndims=2)
-    S, B, E = shaper.get_shape(rand_wishart)
-    # S = [3, 4]
-    # B = []
-    # E = [2, 2]
-
-    # 100 iid samples from two, non-identical trivariate Normal distributions.
-    mu    = ...  # shape(2, 3)
-    sigma = ...  # shape(2, 3, 3)
-    X = MultivariateNormal(mu, sigma).sample(shape=[4, 25])
-    # S = [4, 25]
-    # B = [2]
-    # E = [3]
-    ```
-
-  Argument Validation:
-    When `validate_args=False`, checks that cannot be done during
-    graph construction are performed at graph execution. This may result in a
-    performance degradation because data must be switched from GPU to CPU.
-
-    For example, when `validate_args=False` and `event_ndims` is a
-    non-constant `Tensor`, it is checked to be a non-negative integer at graph
-    execution. (Same for `batch_ndims`). Constant `Tensor`s and non-`Tensor`
-    arguments are always checked for correctness since this can be done for
-    "free," i.e., during graph construction.
+  We show examples using this class.
+
+  Write `S, B, E` for `sample_shape`, `batch_shape`, and `event_shape`.
+
+  ```python
+  # 150 iid samples from one multivariate Normal with two degrees of freedom.
+  mu = [0., 0]
+  sigma = [[1., 0],
+           [0,  1]]
+  mvn = MultivariateNormal(mu, sigma)
+  rand_mvn = mvn.sample(sample_shape=[3, 50])
+  shaper = DistributionShape(batch_ndims=0, event_ndims=1)
+  S, B, E = shaper.get_shape(rand_mvn)
+  # S = [3, 50]
+  # B = []
+  # E = [2]
+
+  # 12 iid samples from one Wishart with 2x2 events.
+  sigma = [[1., 0],
+           [2,  1]]
+  wishart = Wishart(df=5, scale=sigma)
+  rand_wishart = wishart.sample(sample_shape=[3, 4])
+  shaper = DistributionShape(batch_ndims=0, event_ndims=2)
+  S, B, E = shaper.get_shape(rand_wishart)
+  # S = [3, 4]
+  # B = []
+  # E = [2, 2]
+
+  # 100 iid samples from two, non-identical trivariate Normal distributions.
+  mu    = ...  # shape(2, 3)
+  sigma = ...  # shape(2, 3, 3)
+  X = MultivariateNormal(mu, sigma).sample(shape=[4, 25])
+  # S = [4, 25]
+  # B = [2]
+  # E = [3]
+  ```
+
+  #### Argument Validation
+
+  When `validate_args=False`, checks that cannot be done during
+  graph construction are performed at graph execution. This may result in a
+  performance degradation because data must be switched from GPU to CPU.
+
+  For example, when `validate_args=False` and `event_ndims` is a
+  non-constant `Tensor`, it is checked to be a non-negative integer at graph
+  execution. (Same for `batch_ndims`). Constant `Tensor`s and non-`Tensor`
+  arguments are always checked for correctness since this can be done for
+  "free," i.e., during graph construction.
   """
 
   def __init__(self,
diff --git a/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py b/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
index c4b8f055b7fbc3f0835b503eddd7617610326d8c..0d8a1926913766da374cb65767dccfa28bf75579 100644
--- a/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
+++ b/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
@@ -174,13 +174,12 @@ class SinhArcsinh(transformed_distribution.TransformedDistribution):
             skewness=skewness.dtype.as_numpy_dtype(0.),
             tailweight=tailweight, event_ndims=0)
 
-      # Make the Affine bijector, Z --> loc + scale * Z (2 / F_0(2))
+      # Make the AffineScalar bijector, Z --> loc + scale * Z (2 / F_0(2))
       c = 2 * scale / f_noskew.forward(ops.convert_to_tensor(2, dtype=dtype))
-      affine = bijectors.Affine(
+      affine = bijectors.AffineScalar(
           shift=loc,
-          scale_identity_multiplier=c,
-          validate_args=validate_args,
-          event_ndims=0)
+          scale=c,
+          validate_args=validate_args)
 
       bijector = bijectors.Chain([affine, f])
 
diff --git a/tensorflow/contrib/distributions/python/ops/statistical_testing.py b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
new file mode 100644
index 0000000000000000000000000000000000000000..d66c34cc1a45cc09da5138a5f72ae3817690db49
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
@@ -0,0 +1,728 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Statistical test assertions calibrated for their error rates."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import itertools
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+
+__all__ = [
+    "true_mean_confidence_interval_by_dkwm",
+    "assert_true_mean_equal_by_dkwm",
+    "min_discrepancy_of_true_means_detectable_by_dkwm",
+    "min_num_samples_for_dkwm_mean_test",
+    "assert_true_mean_equal_by_dkwm_two_sample",
+    "min_discrepancy_of_true_means_detectable_by_dkwm_two_sample",
+    "min_num_samples_for_dkwm_mean_two_sample_test",
+]
+
+
+def _batch_sort_vector(x, ascending=True, name=None):
+  with ops.name_scope(name, "sort_each_row", [x]):
+    x = ops.convert_to_tensor(x, name="x")
+    n = array_ops.shape(x)[-1]
+    if ascending:
+      y, _ = nn_ops.top_k(-x, k=n, sorted=True)
+      y = -y
+    else:
+      y, _ = nn_ops.top_k(x, k=n, sorted=True)
+    y.set_shape(x.shape)
+    return y
+
+
+def _do_maximum_mean(samples, envelope, high, name=None):
+  """Common code between maximum_mean and minimum_mean."""
+  with ops.name_scope(name, "do_maximum_mean", [samples, envelope, high]):
+    n = array_ops.rank(samples)
+    # Move the batch dimension of `samples` to the rightmost position,
+    # where the _batch_sort_vector function wants it.
+    perm = array_ops.concat([math_ops.range(1, n), [0]], axis=0)
+    samples = array_ops.transpose(samples, perm)
+
+    samples = _batch_sort_vector(samples)
+    batch_shape = array_ops.shape(samples)[:-1]
+    n = array_ops.shape(samples)[-1]
+    step = 1. / math_ops.cast(n, dtype=samples.dtype.base_dtype)
+
+    def _loop_body(iter_, total, to_skip):
+      total = array_ops.where(
+          step <= to_skip,
+          total,
+          array_ops.where(
+              to_skip > 0.,
+              total + (step - to_skip) * samples[..., iter_],
+              total + step * samples[..., iter_]))
+      to_skip = array_ops.where(step <= to_skip, to_skip - step, 0.)
+      return [iter_ + 1, total, to_skip]
+
+    _, total, _ = control_flow_ops.while_loop(
+        cond=lambda iter_, *args: iter_ < n,
+        body=_loop_body,
+        loop_vars=[
+            0,
+            array_ops.zeros(batch_shape, dtype=samples.dtype.base_dtype),
+            envelope,  # to_skip
+        ])
+
+  return total + envelope * high
+
+
+def _maximum_mean(samples, envelope, high, name=None):
+  """Returns a stochastic upper bound on the mean of a scalar distribution.
+
+  The idea is that if the true CDF is within an `eps`-envelope of the
+  empirical CDF of the samples, and the support is bounded above, then
+  the mean is bounded above as well.  In symbols,
+
+  ```none
+  sup_x(|F_n(x) - F(x)|) < eps
+  ```
+
+  The 0th dimension of `samples` is interpreted as independent and
+  identically distributed samples.  The remaining dimensions are
+  broadcast together with `envelope` and `high`, and operated on
+  separately.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `envelope` and `high`.
+    envelope: Floating-point tensor of sizes of admissible CDF
+      envelopes (i.e., the `eps` above).
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    name: A name for this operation (optional).
+
+  Returns:
+    bound: Floating-point tensor of upper bounds on the true means.
+
+  Raises:
+    InvalidArgumentError: If some `sample` is found to be larger than
+      the corresponding `high`.
+  """
+  with ops.name_scope(name, "maximum_mean", [samples, envelope, high]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    envelope = ops.convert_to_tensor(envelope, name="envelope")
+    high = ops.convert_to_tensor(high, name="high")
+
+    xmax = math_ops.reduce_max(samples, axis=[-1])
+    msg = "Given sample maximum value exceeds expectations"
+    check_op = check_ops.assert_less_equal(xmax, high, message=msg)
+    with ops.control_dependencies([check_op]):
+      return array_ops.identity(_do_maximum_mean(samples, envelope, high))
+
+
+def _minimum_mean(samples, envelope, low, name=None):
+  """Returns a stochastic lower bound on the mean of a scalar distribution.
+
+  The idea is that if the true CDF is within an `eps`-envelope of the
+  empirical CDF of the samples, and the support is bounded below, then
+  the mean is bounded below as well.  In symbols,
+
+  ```none
+  sup_x(|F_n(x) - F(x)|) < eps
+  ```
+
+  The 0th dimension of `samples` is interpreted as independent and
+  identically distributed samples.  The remaining dimensions are
+  broadcast together with `envelope` and `low`, and operated on
+  separately.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `envelope` and `low`.
+    envelope: Floating-point tensor of sizes of admissible CDF
+      envelopes (i.e., the `eps` above).
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    name: A name for this operation (optional).
+
+  Returns:
+    bound: Floating-point tensor of lower bounds on the true means.
+
+  Raises:
+    InvalidArgumentError: If some `sample` is found to be smaller than
+      the corresponding `low`.
+  """
+  with ops.name_scope(name, "minimum_mean", [samples, envelope, low]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    envelope = ops.convert_to_tensor(envelope, name="envelope")
+    low = ops.convert_to_tensor(low, name="low")
+
+    xmin = math_ops.reduce_min(samples, axis=[-1])
+    msg = "Given sample minimum value falls below expectations"
+    check_op = check_ops.assert_greater_equal(xmin, low, message=msg)
+    with ops.control_dependencies([check_op]):
+      return - _do_maximum_mean(-samples, envelope, -low)
+
+
+def _dkwm_cdf_envelope(n, error_rate, name=None):
+  """Computes the CDF envelope that the DKWM inequality licenses.
+
+  The [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval)
+  gives a stochastic bound on the distance between the true cumulative
+  distribution function (CDF) of any distribution and its empirical
+  CDF.  To wit, for `n` iid samples from any distribution with CDF F,
+
+  ```none
+  P(sup_x |F_n(x) - F(x)| > eps) < 2exp(-2n eps^2)
+  ```
+
+  This function computes the envelope size `eps` as a function of the
+  number of samples `n` and the desired limit on the left-hand
+  probability above.
+
+  Args:
+    n: Tensor of numbers of samples drawn.
+    error_rate: Floating-point tensor of admissible rates of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    eps: Tensor of maximum distances the true CDF can be from the
+      empirical CDF.  This scales as `O(sqrt(-log(error_rate)))` and
+      as `O(1 / sqrt(n))`.  The shape is the broadcast of `n` and
+      `error_rate`.
+  """
+  with ops.name_scope(name, "dkwm_cdf_envelope", [n, error_rate]):
+    n = math_ops.cast(n, dtype=error_rate.dtype)
+    return math_ops.sqrt(-gen_math_ops.log(error_rate / 2.) / (2. * n))
+
+
+def _check_shape_dominates(tensor, tensors):
+  """Check that broadcasting `tensor` against `tensors` does not expand it.
+
+  Why?  Because I want to be very sure that the samples tensor is not
+  accidentally enlarged by broadcasting against tensors that are
+  supposed to be describing the distribution(s) sampled from, lest the
+  sample counts end up inflated.
+
+  Args:
+    tensor: A Tensor whose shape is to be protected against broadcasting.
+    tensors: A list of Tensors to check
+
+  Returns:
+    tensor: `tf.identity(tensor)` with control dependencies attached;
+      be sure to use that downstream.
+  """
+  def check(t):
+    target = array_ops.shape(tensor)[1:]
+    result = array_ops.broadcast_dynamic_shape(target, array_ops.shape(t))
+    # This rank check ensures that I don't get a wrong answer from the
+    # _shapes_ broadcasting against each other.
+    gt = check_ops.assert_greater(array_ops.rank(target), array_ops.rank(t))
+    eq = check_ops.assert_equal(target, result)
+    return gt, eq
+  checks = list(itertools.chain(*[check(t) for t in tensors]))
+  with ops.control_dependencies(checks):
+    return array_ops.identity(array_ops.identity(tensor))
+
+
+def true_mean_confidence_interval_by_dkwm(
+    samples, low, high, error_rate=1e-6, name=None):
+  """Computes a confidence interval for the mean of a scalar distribution.
+
+  In batch mode, computes confidence intervals for all distributions
+  in the batch (which need not be identically distributed).
+
+  Relies on the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  The probability (over the randomness of drawing the given samples)
+  that any true mean is outside the corresponding returned interval is
+  no more than the given `error_rate`.  The size of the intervals
+  scale as
+  `O(1 / sqrt(#samples))`, as `O(high - low)`, and as `O(-log(error_rate))`.
+
+  Note that `error_rate` is a total error rate for all the confidence
+  intervals in the batch.  As such, if the batch is nontrivial, the
+  error rate is not broadcast but divided (evenly) among the batch
+  members.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `low` and `high`.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    error_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    low: A floating-point tensor of stochastic lower bounds on the true means.
+    high: A floating-point tensor of stochastic upper bounds on the true means.
+  """
+  with ops.name_scope(
+      name, "true_mean_confidence_interval_by_dkwm",
+      [samples, low, high, error_rate]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    error_rate = ops.convert_to_tensor(error_rate, name="error_rate")
+    samples = _check_shape_dominates(samples, [low, high])
+    check_ops.assert_scalar(error_rate)  # Static shape
+    error_rate = _itemwise_error_rate(error_rate, [low, high], samples)
+    n = array_ops.shape(samples)[0]
+    envelope = _dkwm_cdf_envelope(n, error_rate)
+    min_mean = _minimum_mean(samples, envelope, low)
+    max_mean = _maximum_mean(samples, envelope, high)
+    return min_mean, max_mean
+
+
+def _itemwise_error_rate(
+    total_error_rate, param_tensors, sample_tensor=None, name=None):
+  with ops.name_scope(
+      name, "itemwise_error_rate",
+      [total_error_rate, param_tensors, sample_tensor]):
+    result_shape = [1]
+    for p_tensor in param_tensors:
+      result_shape = array_ops.broadcast_dynamic_shape(
+          array_ops.shape(p_tensor), result_shape)
+    if sample_tensor is not None:
+      result_shape = array_ops.broadcast_dynamic_shape(
+          array_ops.shape(sample_tensor)[1:], result_shape)
+    num_items = math_ops.reduce_prod(result_shape)
+    return total_error_rate / math_ops.cast(
+        num_items, dtype=total_error_rate.dtype)
+
+
+def assert_true_mean_equal_by_dkwm(
+    samples, low, high, expected, false_fail_rate=1e-6, name=None):
+  """Asserts the mean of the given distribution is as expected.
+
+  More precisely, fails if there is enough evidence (using the
+  [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval))
+  that the true mean of some distribution from which the given samples are
+  drawn is _not_ the given expected mean with statistical significance
+  `false_fail_rate` or stronger, otherwise passes.  If you also want to
+  check that you are gathering enough evidence that a pass is not
+  spurious, see `min_num_samples_for_dkwm_mean_test` and
+  `min_discrepancy_of_true_means_detectable_by_dkwm`.
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the assertions in the batch.  As such, if the batch is nontrivial,
+  the assertion will insist on stronger evidence to fail any one member.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `low` and `high`.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    expected: Floating-point tensor of expected true means.
+    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    check: Op that raises `InvalidArgumentError` if any expected mean is
+      outside the corresponding confidence interval.
+  """
+  with ops.name_scope(
+      name, "assert_true_mean_equal_by_dkwm",
+      [samples, low, high, expected, false_fail_rate]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    expected = ops.convert_to_tensor(expected, name="expected")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    samples = _check_shape_dominates(samples, [low, high, expected])
+    min_mean, max_mean = true_mean_confidence_interval_by_dkwm(
+        samples, low, high, error_rate=false_fail_rate)
+    less_op = check_ops.assert_less(
+        min_mean, expected, message="Mean confidence interval too high")
+    with ops.control_dependencies([less_op]):
+      return check_ops.assert_greater(
+          max_mean, expected, message="Mean confidence interval too low")
+
+
+def min_discrepancy_of_true_means_detectable_by_dkwm(
+    n, low, high, false_fail_rate, false_pass_rate, name=None):
+  """Returns the minimum mean discrepancy that a DKWM-based test can detect.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the tests in the batch.  As such, if the batch is nontrivial, each
+  member will demand more samples.  The `false_pass_rate` is also
+  interpreted as a total, but is treated asymmetrically: If each test
+  in the batch detects its corresponding discrepancy with probability
+  at least `1 - false_pass_rate`, then running all those tests and
+  failing if any one fails will jointly detect all those discrepancies
+  with the same `false_pass_rate`.
+
+  Args:
+    n: Tensor of numbers of samples to be drawn from the distributions
+      of interest.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    discr: Tensor of lower bounds on the distances between true
+       means detectable by a DKWM-based test.
+
+  For each batch member `i`, of `K` total, drawing `n[i]` samples from
+  some scalar distribution supported on `[low[i], high[i]]` is enough
+  to detect a difference in means of size `discr[i]` or more.
+  Specifically, we guarantee that (a) if the true mean is the expected
+  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
+  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
+  applied to the whole batch at once), and (b) if the true mean
+  differs from the expected mean by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm` will pass with probability at most
+  `false_pass_rate`.
+
+  The detectable discrepancy scales as
+
+  - `O(high[i] - low[i])`,
+  - `O(1 / sqrt(n[i]))`,
+  - `O(-log(false_fail_rate/K))`, and
+  - `O(-log(false_pass_rate))`.
+  """
+  with ops.name_scope(
+      name, "min_discrepancy_of_true_means_detectable_by_dkwm",
+      [n, low, high, false_fail_rate, false_pass_rate]):
+    n = ops.convert_to_tensor(n, name="n")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Algorithm: Assume a true CDF F.  The DKWM inequality gives a
+    # stochastic bound on how far the observed empirical CDF F_n can be.
+    # Then, using the DKWM inequality again gives a stochastic bound on
+    # the farthest candidate true CDF F' that
+    # true_mean_confidence_interval_by_dkwm might consider.  At worst, these
+    # errors may go in the same direction, so the distance between F and
+    # F' is bounded by the sum.
+    # On batching: false fail rates sum, so I need to reduce
+    # the input to account for the batching.  False pass rates
+    # max, so I don't.
+    sampling_envelope = _dkwm_cdf_envelope(n, false_pass_rate)
+    false_fail_rate = _itemwise_error_rate(false_fail_rate, [n, low, high])
+    analysis_envelope = _dkwm_cdf_envelope(n, false_fail_rate)
+    return (high - low) * (sampling_envelope + analysis_envelope)
+
+
+def min_num_samples_for_dkwm_mean_test(
+    discrepancy, low, high,
+    false_fail_rate=1e-6, false_pass_rate=1e-6, name=None):
+  """Returns how many samples suffice for a one-sample DKWM mean test.
+
+  To wit, returns an upper bound on the number of samples necessary to
+  guarantee detecting a mean difference of at least the given
+  `discrepancy`, with the given `false_fail_rate` and `false_pass_rate`,
+  using the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval)
+  on a scalar distribution supported on `[low, high]`.
+
+  Args:
+    discrepancy: Floating-point tensor of desired upper limits on mean
+      differences that may go undetected with probability higher than
+      `1 - false_pass_rate`.
+    low: Tensor of lower bounds on the distributions' support.
+    high: Tensor of upper bounds on the distributions' support.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    n: Tensor of numbers of samples to be drawn from the distributions
+      of interest.
+
+  The `discrepancy`, `low`, and `high` tensors must have
+  broadcast-compatible shapes.
+
+  For each batch member `i`, of `K` total, drawing `n[i]` samples from
+  some scalar distribution supported on `[low[i], high[i]]` is enough
+  to detect a difference in means of size `discrepancy[i]` or more.
+  Specifically, we guarantee that (a) if the true mean is the expected
+  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
+  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
+  applied to the whole batch at once), and (b) if the true mean
+  differs from the expected mean by at least `discrepancy[i]`,
+  `assert_true_mean_equal_by_dkwm` will pass with probability at most
+  `false_pass_rate`.
+
+  The required number of samples scales
+  as `O((high[i] - low[i])**2)`, `O(-log(false_fail_rate/K))`,
+  `O(-log(false_pass_rate))`, and `O(1 / discrepancy[i]**2)`.
+  """
+  with ops.name_scope(
+      name, "min_num_samples_for_dkwm_mean_test",
+      [low, high, false_fail_rate, false_pass_rate, discrepancy]):
+    discrepancy = ops.convert_to_tensor(
+        discrepancy, name="discrepancy")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Could choose to cleverly allocate envelopes, but this is sound.
+    envelope1 = discrepancy / (2. * (high - low))
+    envelope2 = envelope1
+    false_fail_rate = _itemwise_error_rate(
+        false_fail_rate, [low, high, discrepancy])
+    n1 = -math_ops.log(false_fail_rate / 2.) / (2. * envelope1**2)
+    n2 = -math_ops.log(false_pass_rate / 2.) / (2. * envelope2**2)
+    return math_ops.maximum(n1, n2)
+
+
+def assert_true_mean_equal_by_dkwm_two_sample(
+    samples1, low1, high1, samples2, low2, high2,
+    false_fail_rate=1e-6, name=None):
+  """Asserts the means of the given distributions are equal.
+
+  More precisely, fails if there is enough evidence (using the
+  [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval))
+  that the means of the distributions from which the given samples are
+  drawn are _not_ equal with statistical significance `false_fail_rate`
+  or stronger, otherwise passes.  If you also want to check that you
+  are gathering enough evidence that a pass is not spurious, see
+  `min_num_samples_for_dkwm_mean_two_sample_test` and
+  `min_discrepancy_of_true_means_detectable_by_dkwm_two_sample`.
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the assertions in the batch.  As such, if the batch is nontrivial,
+  the assertion will insist on stronger evidence to fail any one member.
+
+  Args:
+    samples1: Floating-point tensor of samples from the
+      distribution(s) A.  Entries are assumed IID across the 0th
+      dimension.  The other dimensions must broadcast with `low1`,
+      `high1`, `low2`, and `high2`.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    samples2: Floating-point tensor of samples from the
+      distribution(s) B.  Entries are assumed IID across the 0th
+      dimension.  The other dimensions must broadcast with `low1`,
+      `high1`, `low2`, and `high2`.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    check: Op that raises `InvalidArgumentError` if any pair of confidence
+      intervals true for corresponding true means do not overlap.
+  """
+  with ops.name_scope(
+      name, "assert_true_mean_equal_by_dkwm_two_sample",
+      [samples1, low1, high1, samples2, low2, high2, false_fail_rate]):
+    samples1 = ops.convert_to_tensor(samples1, name="samples1")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    samples2 = ops.convert_to_tensor(samples2, name="samples2")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    samples1 = _check_shape_dominates(samples1, [low1, high1])
+    samples2 = _check_shape_dominates(samples2, [low2, high2])
+    compatible_samples = check_ops.assert_equal(
+        array_ops.shape(samples1)[1:], array_ops.shape(samples2)[1:])
+    with ops.control_dependencies([compatible_samples]):
+      # Could in principle play games with cleverly allocating
+      # significance instead of the even split below.  It may be possible
+      # to get tighter intervals, in order to obtain a higher power test.
+      # Any allocation strategy that depends only on the support bounds
+      # and sample counts should be valid; however, because the intervals
+      # scale as O(-log(false_fail_rate)), there doesn't seem to be much
+      # room to win.
+      min_mean_1, max_mean_1 = true_mean_confidence_interval_by_dkwm(
+          samples1, low1, high1, false_fail_rate / 2.)
+      min_mean_2, max_mean_2 = true_mean_confidence_interval_by_dkwm(
+          samples2, low2, high2, false_fail_rate / 2.)
+      # I want to assert
+      #   not (max_mean_1 < min_mean_2 or min_mean_1 > max_mean_2),
+      # but I think I only have and-combination of asserts, so use DeMorgan.
+      clause1_op = check_ops.assert_greater_equal(max_mean_1, min_mean_2)
+      with ops.control_dependencies([clause1_op]):
+        return check_ops.assert_less_equal(min_mean_1, max_mean_2)
+
+
+def min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
+    n1, low1, high1, n2, low2, high2,
+    false_fail_rate, false_pass_rate, name=None):
+  """Returns the minimum mean discrepancy for a two-sample DKWM-based test.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the tests in the batch.  As such, if the batch is nontrivial, each
+  member will demand more samples.  The `false_pass_rate` is also
+  interpreted as a total, but is treated asymmetrically: If each test
+  in the batch detects its corresponding discrepancy with probability
+  at least `1 - false_pass_rate`, then running all those tests and
+  failing if any one fails will jointly detect all those discrepancies
+  with the same `false_pass_rate`.
+
+  Args:
+    n1: Tensor of numbers of samples to be drawn from the distributions A.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    n2: Tensor of numbers of samples to be drawn from the distributions B.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    discr: Tensor of lower bounds on the distances between true means
+       detectable by a two-sample DKWM-based test.
+
+  For each batch member `i`, of `K` total, drawing `n1[i]` samples
+  from scalar distribution A supported on `[low1[i], high1[i]]` and `n2[i]`
+  samples from scalar distribution B supported on `[low2[i], high2[i]]`
+  is enough to detect a difference in their true means of size
+  `discr[i]` or more.  Specifically, we guarantee that (a) if their
+  true means are equal, `assert_true_mean_equal_by_dkwm_two_sample`
+  will fail with probability at most `false_fail_rate/K` (which
+  amounts to `false_fail_rate` if applied to the whole batch at once),
+  and (b) if their true means differ by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm_two_sample` will pass with
+  probability at most `false_pass_rate`.
+
+  The detectable distribution scales as
+
+  - `O(high1[i] - low1[i])`, `O(high2[i] - low2[i])`,
+  - `O(1 / sqrt(n1[i]))`, `O(1 / sqrt(n2[i]))`,
+  - `O(-log(false_fail_rate/K))`, and
+  - `O(-log(false_pass_rate))`.
+  """
+  with ops.name_scope(
+      name, "min_discrepancy_of_true_means_detectable_by_dkwm_two_sample",
+      [n1, low1, high1, n2, low2, high2, false_fail_rate, false_pass_rate]):
+    n1 = ops.convert_to_tensor(n1, name="n1")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    n2 = ops.convert_to_tensor(n2, name="n2")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    det_disc1 = min_discrepancy_of_true_means_detectable_by_dkwm(
+        n1, low1, high1, false_fail_rate / 2., false_pass_rate / 2.)
+    det_disc2 = min_discrepancy_of_true_means_detectable_by_dkwm(
+        n2, low2, high2, false_fail_rate / 2., false_pass_rate / 2.)
+    return det_disc1 + det_disc2
+
+
+def min_num_samples_for_dkwm_mean_two_sample_test(
+    discrepancy, low1, high1, low2, high2,
+    false_fail_rate=1e-6, false_pass_rate=1e-6, name=None):
+  """Returns how many samples suffice for a two-sample DKWM mean test.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Args:
+    discrepancy: Floating-point tensor of desired upper limits on mean
+      differences that may go undetected with probability higher than
+      `1 - false_pass_rate`.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    n1: Tensor of numbers of samples to be drawn from the distributions A.
+    n2: Tensor of numbers of samples to be drawn from the distributions B.
+
+  For each batch member `i`, of `K` total, drawing `n1[i]` samples
+  from scalar distribution A supported on `[low1[i], high1[i]]` and `n2[i]`
+  samples from scalar distribution B supported on `[low2[i], high2[i]]`
+  is enough to detect a difference in their true means of size
+  `discr[i]` or more.  Specifically, we guarantee that (a) if their
+  true means are equal, `assert_true_mean_equal_by_dkwm_two_sample`
+  will fail with probability at most `false_fail_rate/K` (which
+  amounts to `false_fail_rate` if applied to the whole batch at once),
+  and (b) if their true means differ by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm_two_sample` will pass with
+  probability at most `false_pass_rate`.
+
+  The required number of samples scales as
+
+  - `O((high1[i] - low1[i])**2)`, `O((high2[i] - low2[i])**2)`,
+  - `O(-log(false_fail_rate/K))`,
+  - `O(-log(false_pass_rate))`, and
+  - `O(1 / discrepancy[i]**2)`.
+  """
+  with ops.name_scope(
+      name, "min_num_samples_for_dkwm_mean_two_sample_test",
+      [low1, high1, low2, high2,
+       false_fail_rate, false_pass_rate, discrepancy]):
+    discrepancy = ops.convert_to_tensor(discrepancy, name="discrepancy")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Could choose to cleverly allocate discrepancy tolerances and
+    # failure probabilities, but this is sound.
+    n1 = min_num_samples_for_dkwm_mean_test(
+        discrepancy / 2., low1, high1,
+        false_fail_rate / 2., false_pass_rate / 2.)
+    n2 = min_num_samples_for_dkwm_mean_test(
+        discrepancy / 2., low2, high2,
+        false_fail_rate / 2., false_pass_rate / 2.)
+    return n1, n2
diff --git a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
index 0c747f8e68529484ae6f695b8500cde74857bb11..971d65c4a69140161461fdac93bb588014dd3e88 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
@@ -181,7 +181,7 @@ def quadrature_scheme_softmaxnormal_quantiles(
       edges = array_ops.reshape(edges, shape=array_ops.concat([
           [-1], array_ops.ones([batch_ndims], dtype=dtypes.int32)], axis=0))
       quantiles = dist.quantile(edges)
-      quantiles = SoftmaxCentered(event_ndims=1).forward(quantiles)
+      quantiles = SoftmaxCentered().forward(quantiles)
       # Cyclically permute left by one.
       perm = array_ops.concat([
           math_ops.range(1, 1 + batch_ndims), [0]], axis=0)
@@ -248,11 +248,7 @@ class VectorDiffeomixture(distribution_lib.Distribution):
   The default quadrature scheme chooses `z_{N, n}` as `N` midpoints of
   the quantiles of `p(z)` (generalized quantiles if `K > 2`).
 
-  See [1] for more details.
-
-  [1]. "Quadrature Compound: An approximating family of distributions"
-       Joshua Dillon, Ian Langmore, arXiv preprints
-       https://arxiv.org/abs/1801.03080
+  See [Dillon and Langmore (2018)][1] for more details.
 
   #### About `Vector` distributions in TensorFlow.
 
@@ -313,6 +309,13 @@ class VectorDiffeomixture(distribution_lib.Distribution):
             is_positive_definite=True),
       ],
       validate_args=True)
+  ```
+
+  #### References
+
+  [1]: Joshua Dillon and Ian Langmore. Quadrature Compound: An approximating
+       family of distributions. _arXiv preprint arXiv:1801.03080_, 2018.
+       https://arxiv.org/abs/1801.03080
   """
 
   def __init__(self,
diff --git a/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py b/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
index e1ccf116457a97261b9ce3965552764771d3bdd2..003c66b9413fdcad20fbcc8b4bf47259692932e7 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
@@ -227,7 +227,7 @@ class VectorSinhArcsinhDiag(transformed_distribution.TransformedDistribution):
       c = 2 * scale_diag_part / f_noskew.forward(
           ops.convert_to_tensor(2, dtype=dtype))
       affine = bijectors.Affine(
-          shift=loc, scale_diag=c, validate_args=validate_args, event_ndims=1)
+          shift=loc, scale_diag=c, validate_args=validate_args)
 
       bijector = bijectors.Chain([affine, f])
 
diff --git a/tensorflow/contrib/eager/python/BUILD b/tensorflow/contrib/eager/python/BUILD
index a26ec8513f4b7b9c278edddc95e6acd2523194f2..80176397c02f22095a3a9be3d12c2115ec4eca29 100644
--- a/tensorflow/contrib/eager/python/BUILD
+++ b/tensorflow/contrib/eager/python/BUILD
@@ -11,12 +11,14 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
     deps = [
+        ":checkpointable_utils",
         ":datasets",
         ":metrics",
         ":network",
         ":saver",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradients",
         "//tensorflow/python:numerics",
         "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:script_ops",
@@ -26,7 +28,6 @@ py_library(
         "//tensorflow/python/eager:backprop",
         "//tensorflow/python/eager:context",
         "//tensorflow/python/eager:core",
-        "//tensorflow/python/eager:custom_gradient",
         "//tensorflow/python/eager:execution_callbacks",
         "//tensorflow/python/eager:function",
     ],
@@ -69,6 +70,7 @@ cuda_py_test(
     srcs = ["datasets_test.py"],
     additional_deps = [
         ":datasets",
+        ":checkpointable_utils",
         "//tensorflow/contrib/data/python/ops:transformation_ops",
         "//tensorflow/contrib/lookup:lookup_py",
         "//tensorflow/python:dtypes",
@@ -116,6 +118,7 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:internal"],
     deps = [
+        "//tensorflow/contrib/eager/python:checkpointable_utils",
         "//tensorflow/contrib/summary:summary_ops",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:control_flow_ops",
@@ -230,24 +233,27 @@ py_library(
         "//tensorflow/python:constant_op",
         "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:init_ops",
-        "//tensorflow/python:io_ops",
+        "//tensorflow/python:pywrap_tensorflow",
         "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:session",
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python:training",
+        "//tensorflow/python:util",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python/eager:context",
     ],
 )
 
-py_test(
+cuda_py_test(
     name = "checkpointable_utils_test",
     srcs = ["checkpointable_utils_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
+    additional_deps = [
         ":checkpointable_utils",
         ":network",
+        "@six_archive//:six",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -262,7 +268,12 @@ py_test(
         "//tensorflow/python:variables",
         "//tensorflow/python/eager:context",
         "//tensorflow/python/eager:test",
-        "@six_archive//:six",
+        "//tensorflow/python/keras",
+    ],
+    tags = [
+        "no_oss",  # b/74395663
+        "no_windows",  # TODO: needs investigation on Windows
+        "notsan",
     ],
 )
 
diff --git a/tensorflow/contrib/eager/python/checkpointable_utils.py b/tensorflow/contrib/eager/python/checkpointable_utils.py
index e57093bdbc34660c5a6d61fb5af46bcbbbb5f524..adbb92e43b8a64b412a8484139f0ccdbc3daaa43 100644
--- a/tensorflow/contrib/eager/python/checkpointable_utils.py
+++ b/tensorflow/contrib/eager/python/checkpointable_utils.py
@@ -32,7 +32,6 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
-from tensorflow.python.ops import io_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.training import checkpointable as core_checkpointable
@@ -220,12 +219,16 @@ def _serialize_checkpointables(
     object_proto = object_graph_proto.nodes.add()
     object_proto.slot_variables.extend(slot_variables.get(checkpointable, ()))
     object_name = object_names[checkpointable]
-    for name, saveable in (
+    for name, saveable_factory in (
         checkpointable._gather_saveables_for_checkpoint().items()):  # pylint: disable=protected-access
       attribute = object_proto.attributes.add()
       attribute.name = name
       attribute.checkpoint_key = "%s/%s/%s" % (
           object_name, _OBJECT_ATTRIBUTES_NAME, _escape_local_name(name))
+      if callable(saveable_factory):
+        saveable = saveable_factory(name=attribute.checkpoint_key)
+      else:
+        saveable = saveable_factory
       # Figure out the name-based Saver's name for this variable.
       saver_dict = saver_lib.BaseSaverBuilder.OpListToDict(
           [saveable], convert_variable_to_tensor=False)
@@ -395,7 +398,7 @@ class CheckpointLoadStatus(_LoadStatus):
 
   def run_restore_ops(self, session=None):
     """Run operations to restore objects in the dependency graph."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return  # Run eagerly
     if session is None:
       session = ops.get_default_session()
@@ -459,7 +462,7 @@ class InitializationOnlyStatus(_LoadStatus):
       session: The session to run initialization ops in. If `None`, uses the
         default session.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return  # run eagerly
     if session is None:
       session = ops.get_default_session()
@@ -491,10 +494,11 @@ class NameBasedSaverStatus(_LoadStatus):
       date=None, instructions=_DEPRECATED_RESTORE_INSTRUCTIONS)
   def run_restore_ops(self, session=None):
     """Load the name-based training checkpoint using a new `tf.train.Saver`."""
-    if session is None and context.in_graph_mode():
+    if session is None and not context.executing_eagerly():
       session = ops.get_default_session()
-    saver_lib.Saver(self._object_saver._global_variable_names()).restore(  # pylint: disable=protected-access
-        sess=session, save_path=self._save_path)
+    with ops.device("/cpu:0"):
+      saver_lib.Saver(self._object_saver._global_variable_names()).restore(  # pylint: disable=protected-access
+          sess=session, save_path=self._save_path)
 
   def initialize_or_restore(self, session=None):
     """Alias for `run_restore_ops`."""
@@ -518,7 +522,19 @@ class _SessionWithFeedDictAdditions(session_lib.SessionInterface):
         fetches=fetches, feed_dict=feed_dict, **kwargs)
 
 
-class Saver(object):
+def _copy_saver_with_new_var_list(old_saver, new_var_list):
+  """Copy a `tf.train.Saver`'s state to a new Saver with different variables."""
+  new_saver = saver_lib.Saver(var_list=new_var_list)
+  # TODO(allenl): Move to copying functionality to Saver?
+  # pylint: disable=protected-access
+  new_saver._last_checkpoints = old_saver._last_checkpoints
+  new_saver._checkpoints_to_be_deleted = old_saver._checkpoints_to_be_deleted
+  new_saver._next_checkpoint_time = old_saver._next_checkpoint_time
+  # pylint: enable=protected-access
+  return new_saver
+
+
+class CheckpointableSaver(object):
   """Saves and restores a `Checkpointable` object and its dependencies.
 
   See `Checkpointable` for details of dependency management. `Saver` wraps
@@ -548,8 +564,9 @@ class Saver(object):
     # Allow passing in a weak reference to avoid reference cycles when
     # `Checkpointable` objects save themselves.
     self._root_checkpointable_ref = root_checkpointable
-    if context.in_graph_mode():
-      self._file_prefix_placeholder = constant_op.constant("model")
+    if not context.executing_eagerly():
+      with ops.device("/cpu:0"):
+        self._file_prefix_placeholder = constant_op.constant("model")
     else:
       self._file_prefix_placeholder = None
 
@@ -559,7 +576,6 @@ class Saver(object):
     self._last_save_saver = None
 
     # Op caching for restore
-    self._object_graph_restore_tensor = None
     self._last_restore_object_graph = None
     self._last_restore_checkpoint = None
 
@@ -596,43 +612,39 @@ class Saver(object):
     """
     named_variables, graph_proto = _serialize_object_graph(
         self._root_checkpointable)
-    in_graph_mode = context.in_graph_mode()
-    if in_graph_mode:
+    if not context.executing_eagerly():
       if session is None:
         session = ops.get_default_session()
       if self._object_graph_feed_tensor is None:
-        self._object_graph_feed_tensor = constant_op.constant(
-            "", dtype=dtypes.string)
+        with ops.device("/cpu:0"):
+          self._object_graph_feed_tensor = constant_op.constant(
+              "", dtype=dtypes.string)
       object_graph_tensor = self._object_graph_feed_tensor
       feed_additions = {object_graph_tensor: graph_proto.SerializeToString()}
     else:
       session = None
-      object_graph_tensor = constant_op.constant(
-          graph_proto.SerializeToString(), dtype=dtypes.string)
+      with ops.device("/cpu:0"):
+        object_graph_tensor = constant_op.constant(
+            graph_proto.SerializeToString(), dtype=dtypes.string)
       feed_additions = None
     assert _OBJECT_GRAPH_PROTO_KEY not in named_variables
     named_variables[_OBJECT_GRAPH_PROTO_KEY] = _NoRestoreSaveable(
         tensor=object_graph_tensor,
         name=_OBJECT_GRAPH_PROTO_KEY)
-    if not in_graph_mode or self._last_save_object_graph != graph_proto:
-      if self._last_save_object_graph is not None and in_graph_mode:
-        raise NotImplementedError(
-            "Using a single Saver to save a mutated object graph is not "
-            "currently supported when graph building. Use a different Saver "
-            "when the object graph changes (save ops will be duplicated), or "
-            "file a feature request if this limitation bothers you.")
-      saver = saver_lib.Saver(var_list=named_variables)
-      if in_graph_mode:
-        self._last_save_saver = saver
-        self._last_save_object_graph = graph_proto
-    else:
-      saver = self._last_save_saver
-    save_path = saver.save(
-        sess=_SessionWithFeedDictAdditions(
-            session=session, feed_additions=feed_additions),
-        save_path=file_prefix,
-        write_meta_graph=False,
-        global_step=checkpoint_number)
+    if self._last_save_object_graph != graph_proto:
+      if self._last_save_object_graph is not None:
+        self._last_save_saver = _copy_saver_with_new_var_list(
+            old_saver=self._last_save_saver, new_var_list=named_variables)
+      else:
+        self._last_save_saver = saver_lib.Saver(var_list=named_variables)
+      self._last_save_object_graph = graph_proto
+    with ops.device("/cpu:0"):
+      save_path = self._last_save_saver.save(
+          sess=_SessionWithFeedDictAdditions(
+              session=session, feed_additions=feed_additions),
+          save_path=file_prefix,
+          write_meta_graph=False,
+          global_step=checkpoint_number)
     return save_path
 
   def _global_variable_names(self):
@@ -646,7 +658,7 @@ class Saver(object):
             attribute_proto.checkpoint_key]
     return saver_names
 
-  def restore(self, save_path, session=None):
+  def restore(self, save_path):
     """Restore a training checkpoint.
 
     Restores `root_checkpointable` and any objects that it tracks
@@ -656,8 +668,7 @@ class Saver(object):
     constructor after this call will be matched if they have a corresponding
     object in the checkpoint.
 
-    When building a graph, restorations are added to the graph but not run. A
-    session is required to retrieve checkpoint metadata.
+    When building a graph, restorations are added to the graph but not run.
 
     To disallow deferred loading, assert immediately that all checkpointed
     variables have been matched to variable objects:
@@ -695,9 +706,6 @@ class Saver(object):
         object which may run initializers for objects in the dependency
         graph. If the checkpoint was written by the name-based `tf.train.Saver`,
         names are used to match variables.
-      session: The session to retrieve metadata with. Ignored when executing
-        eagerly. If not provided when graph building, the default session is
-        used.
 
     Returns:
       A load status object, which can be used to make assertions about the
@@ -710,32 +718,17 @@ class Saver(object):
     """
     if save_path is None:
       return InitializationOnlyStatus(self._root_checkpointable)
-    in_graph_mode = context.in_graph_mode()
+    in_graph_mode = not context.executing_eagerly()
     if in_graph_mode:
-      if session is None:
-        session = ops.get_default_session()
       file_prefix_tensor = self._file_prefix_placeholder
       file_prefix_feed_dict = {self._file_prefix_placeholder: save_path}
     else:
-      session = None
-      file_prefix_tensor = constant_op.constant(save_path)
+      with ops.device("/cpu:0"):
+        file_prefix_tensor = constant_op.constant(save_path)
       file_prefix_feed_dict = None
+    reader = pywrap_tensorflow.NewCheckpointReader(save_path)
     try:
-      if not in_graph_mode or self._object_graph_restore_tensor is None:
-        object_graph_string, = io_ops.restore_v2(
-            prefix=file_prefix_tensor,
-            tensor_names=[_OBJECT_GRAPH_PROTO_KEY],
-            shape_and_slices=[""],
-            dtypes=[dtypes.string],
-            name="object_graph_proto_read")
-        if in_graph_mode:
-          self._object_graph_restore_tensor = object_graph_string
-      if in_graph_mode:
-        object_graph_string = session.run(
-            self._object_graph_restore_tensor,
-            feed_dict=file_prefix_feed_dict)
-      else:
-        object_graph_string = object_graph_string.numpy()
+      object_graph_string = reader.get_tensor(_OBJECT_GRAPH_PROTO_KEY)
     except errors_impl.NotFoundError:
       # The object graph proto does not exist in this checkpoint. Try again with
       # name-based saving.
@@ -750,7 +743,6 @@ class Saver(object):
       if in_graph_mode:
         dtype_map = None
       else:
-        reader = pywrap_tensorflow.NewCheckpointReader(save_path)
         dtype_map = reader.get_variable_to_dtype_map()
       checkpoint = core_checkpointable_utils._Checkpoint(  # pylint: disable=protected-access
           object_graph_proto=object_graph_proto,
@@ -770,3 +762,103 @@ class Saver(object):
     load_status = CheckpointLoadStatus(
         checkpoint, feed_dict=file_prefix_feed_dict)
     return load_status
+
+
+class Checkpoint(core_checkpointable.Checkpointable):
+  """A utility class which groups `Checkpointable` objects.
+
+  Accepts arbitrary keyword arguments to its constructor and saves those values
+  with a checkpoint. Maintains a `save_counter` for numbering checkpoints.
+
+  Example usage:
+
+  ```python
+  import tensorflow as tf
+  import tensorflow.contrib.eager as tfe
+  import os
+
+  checkpoint_directory = "/tmp/training_checkpoints"
+  checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+
+  root = tfe.Checkpoint(optimizer=optimizer, model=model)
+  root.restore(tf.train.latest_checkpoint(checkpoint_directory))
+  for _ in range(num_training_steps):
+    optimizer.minimize( ... )
+  root.save(file_prefix=checkpoint_prefix)
+  ```
+
+  For more manual control over saving, use `tfe.CheckpointableSaver` directly.
+
+  Attributes:
+    save_counter: Incremented when `save()` is called. Used to number
+      checkpoints.
+  """
+
+  def __init__(self, **kwargs):
+    """Group objects into a training checkpoint.
+
+    Args:
+      **kwargs: Keyword arguments are set as attributes of this object, and are
+        saved with the checkpoint. Attribute values must derive from
+        `CheckpointableBase`.
+    Raises:
+      ValueError: If objects in `kwargs` are not Checkpointable.
+    """
+    super(Checkpoint, self).__init__()
+    for k, v in sorted(kwargs.items(), key=lambda item: item[0]):
+      if not isinstance(v, core_checkpointable.CheckpointableBase):
+        raise ValueError(
+            ("`Checkpoint` was expecting an object derived from "
+             "`CheckpointableBase`, got %s.") % (v,))
+      setattr(self, k, v)
+    self._save_counter = None  # Created lazily for restore-on-create.
+    self._saver = CheckpointableSaver(weakref.ref(self))
+
+  def _maybe_create_save_counter(self):
+    """Create a save counter if it does not yet exist."""
+    if self._save_counter is None:
+      # Initialized to 0 and incremented before saving.
+      with ops.device("/cpu:0"):
+        self._save_counter = add_variable(
+            self, name="save_counter", initializer=0, dtype=dtypes.int64)
+
+  @property
+  def save_counter(self):
+    """An integer variable which starts at zero and is incremented on save.
+
+    Used to number checkpoints.
+
+    Returns:
+      The save counter variable.
+    """
+    self._maybe_create_save_counter()
+    return self._save_counter
+
+  def save(self, file_prefix, session=None):
+    """Save a checkpoint. Wraps `tfe.CheckpointableSaver.save`."""
+    in_graph_mode = not context.executing_eagerly()
+    if in_graph_mode:
+      if session is None:
+        session = ops.get_default_session()
+      if self._save_counter is None:
+        # When graph building, if this is a new save counter variable then it
+        # needs to be initialized before assign_add. This is only an issue if
+        # restore() has not been called first.
+        session.run(self.save_counter.initializer)
+    with ops.colocate_with(self.save_counter):
+      assign_op = self.save_counter.assign_add(1)
+    if in_graph_mode:
+      session.run(assign_op)
+    return self._saver.save(
+        file_prefix=file_prefix,
+        checkpoint_number=self.save_counter,
+        session=session)
+
+  def restore(self, save_path):
+    """Restore a checkpoint. Wraps `tfe.CheckpointableSaver.restore`."""
+    status = self._saver.restore(save_path=save_path)
+    # Create the save counter now so it gets initialized with other variables
+    # when graph building. Creating it earlier would lead to double
+    # initialization when executing eagerly.
+    self._maybe_create_save_counter()
+    return status
diff --git a/tensorflow/contrib/eager/python/checkpointable_utils_test.py b/tensorflow/contrib/eager/python/checkpointable_utils_test.py
index 3d6a200276754d96b6c539cc98c397d09b999b9f..690f3ee67a71a33eec55515a087800151909a214 100644
--- a/tensorflow/contrib/eager/python/checkpointable_utils_test.py
+++ b/tensorflow/contrib/eager/python/checkpointable_utils_test.py
@@ -18,23 +18,25 @@ from __future__ import print_function
 
 import functools
 import os
-import weakref
 
 import six
 
 from tensorflow.contrib.eager.python import checkpointable_utils
-from tensorflow.contrib.eager.python import network as network_lib
+from tensorflow.python.client import session as session_lib
+from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
-from tensorflow.python.layers import base
+from tensorflow.python.keras._impl.keras.engine import training
 from tensorflow.python.layers import core
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import template
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.training import adam
 from tensorflow.python.training import checkpointable
@@ -42,73 +44,6 @@ from tensorflow.python.training import saver as core_saver
 from tensorflow.python.training import training_util
 
 
-class CheckpointableDenseLayer(core.Dense, checkpointable.Checkpointable):
-
-  def __init__(self, *args, **kwargs):
-    checkpointable.Checkpointable.__init__(self)
-    core.Dense.__init__(self, *args, **kwargs)
-
-  def add_variable(self, name, shape, **kwargs):
-    # Calls both Checkpointable._add_variable and Layer.add_variable. Eventually
-    # Layer.add_variable should inherit from Checkpointable and simply call
-    # super and then do post-processing.
-    return checkpointable.Checkpointable._add_variable_with_custom_getter(
-        self,
-        name=name,
-        shape=shape,
-        getter=functools.partial(core.Dense.add_variable, self),
-        **kwargs)
-
-
-# pylint: disable=not-callable
-class CheckpointableNetwork(network_lib.Network, checkpointable.Checkpointable):
-
-  def __setattr__(self, name, value):
-    if isinstance(value, base.Layer):
-      self.track_layer(value, name=name)
-    # Checkpointable is next in the method resolution order, so this will catch
-    # Checkpointable objects which aren't Layers.
-    super(CheckpointableNetwork, self).__setattr__(name, value)
-
-  def track_layer(self, layer, name):
-    self._track_checkpointable(layer, name=name)
-    return super(CheckpointableNetwork, self).track_layer(layer)
-
-
-class CheckpointableAdam(adam.AdamOptimizer, checkpointable.Checkpointable):
-
-  # NOTE: Copied from Optimizer with modifications to use add_variable
-  # for non-slot variables. These contortions are necessary to maintain
-  # checkpoint compatibility with variable.name based saving.
-  # TODO(allenl): Make this cleaner.
-  def _create_non_slot_variable(self, initial_value, name, colocate_with):
-    """Add an extra variable, not associated with a slot."""
-    if context.in_graph_mode():
-      graph = colocate_with.graph
-    else:
-      graph = None
-
-    key = (name, graph)
-    v = self._non_slot_dict.get(key, None)
-    if v is None:
-      with ops.colocate_with(colocate_with):
-        def _variable_getter(name, shape, dtype, initializer):
-          del shape, dtype  # not used, but there for compatibility
-          return variable_scope.variable(
-              name=name, initial_value=initializer, trainable=False)
-
-        initial_value = ops.convert_to_tensor(initial_value)
-        v = self._add_variable_with_custom_getter(
-            name=name,
-            shape=initial_value.get_shape(),
-            initializer=initial_value,
-            getter=_variable_getter)
-
-      self._non_slot_dict[key] = v
-
-    return v
-
-
 class NonLayerCheckpointable(checkpointable.Checkpointable):
 
   def __init__(self):
@@ -117,64 +52,20 @@ class NonLayerCheckpointable(checkpointable.Checkpointable):
         self, name="a_variable", shape=[])
 
 
-class MyNetwork(CheckpointableNetwork):
-  """A concrete Network for testing."""
+# pylint: disable=not-callable
+class MyModel(training.Model):
+  """A concrete Model for testing."""
 
   def __init__(self):
-    super(MyNetwork, self).__init__()
-    self._named_dense = CheckpointableDenseLayer(1, use_bias=True)
-    self._via_track_layer = self.track_layer(
-        CheckpointableDenseLayer(1, use_bias=False), name="via_track_layer")
+    super(MyModel, self).__init__()
+    self._named_dense = core.Dense(1, use_bias=True)
+    self._second = core.Dense(1, use_bias=False)
     # We can still track Checkpointables which aren't Layers.
     self._non_layer = NonLayerCheckpointable()
 
   def call(self, values):
-    return self._via_track_layer(self._named_dense(values))
-
-
-class Checkpoint(checkpointable.Checkpointable):
-  """A utility class which groups `Checkpointable` objects."""
-
-  def __init__(self, **kwargs):
-    super(Checkpoint, self).__init__()
-    for k, v in sorted(kwargs.items(), key=lambda item: item[0]):
-      setattr(self, k, v)
-    self._save_counter = None  # Created lazily for restore-on-create.
-    self._saver = checkpointable_utils.Saver(weakref.ref(self))
-
-  @property
-  def save_counter(self):
-    """An integer variable which starts at zero and is incremented on save.
-
-    Used to number checkpoints.
-
-    Returns:
-      The save counter variable.
-    """
-    if self._save_counter is None:
-      # Initialized to 0 and incremented before saving.
-      self._save_counter = checkpointable_utils.add_variable(
-          self, name="save_counter", initializer=0, dtype=dtypes.int64)
-    return self._save_counter
-
-  def save(self, file_prefix, session=None):
-    assign_op = self.save_counter.assign_add(1)
-    if context.in_graph_mode():
-      if session is None:
-        session = ops.get_default_session()
-      session.run(assign_op)
-    return self._saver.save(
-        file_prefix=file_prefix,
-        checkpoint_number=self.save_counter,
-        session=session)
-
-  def restore(self, save_path):
-    status = self._saver.restore(save_path=save_path)
-    # Create the save counter now so it gets initialized with other variables
-    # when graph building. Creating it earlier would lead to double
-    # initialization when executing eagerly.
-    self.save_counter  # pylint: disable=pointless-statement
-    return status
+    ret = self._second(self._named_dense(values))
+    return ret
 
 
 class InterfaceTests(test.TestCase):
@@ -219,14 +110,14 @@ class InterfaceTests(test.TestCase):
                          [0., 0.]], self.evaluate(bare_initializer))
     self.assertEqual("a_variable:0", obj.a_variable.name)
     self.assertEqual("duplicate:0", other_duplicate.name)
-    if context.in_graph_mode():
-      # The .name attribute may be globally influenced, but the checkpoint name
-      # won't be (tested below).
-      self.assertEqual("duplicate_1:0", duplicate.name)
-    else:
+    if context.executing_eagerly():
       # When executing eagerly, there's no uniquification of variable names. The
       # checkpoint name will be the same.
       self.assertEqual("duplicate:0", duplicate.name)
+    else:
+      # The .name attribute may be globally influenced, but the checkpoint name
+      # won't be (tested below).
+      self.assertEqual("duplicate_1:0", duplicate.name)
     named_variables, _ = checkpointable_utils._serialize_object_graph(obj)
     expected_checkpoint_names = (
         "a_variable/.ATTRIBUTES/VARIABLE_VALUE",
@@ -263,31 +154,71 @@ class InterfaceTests(test.TestCase):
     self.assertAllEqual([1., 1., 1.], self.evaluate(v2))
 
 
+class _MirroringSaveable(
+    core_saver.BaseSaverBuilder.ResourceVariableSaveable):
+
+  def __init__(self, primary_variable, mirrored_variable, name):
+    self._primary_variable = primary_variable
+    self._mirrored_variable = mirrored_variable
+    super(_MirroringSaveable, self).__init__(
+        self._primary_variable, "", name)
+
+  def restore(self, restored_tensors, restored_shapes):
+    """Restore the same value into both variables."""
+    tensor, = restored_tensors
+    return control_flow_ops.group(
+        self._primary_variable.assign(tensor),
+        self._mirrored_variable.assign(tensor))
+
+
+class _OwnsMirroredVariables(checkpointable.CheckpointableBase):
+  """A Checkpointable object which returns a more complex SaveableObject."""
+
+  def __init__(self):
+    self.non_dep_variable = variable_scope.get_variable(
+        name="non_dep_variable", initializer=6., use_resource=True)
+    self.mirrored = variable_scope.get_variable(
+        name="mirrored", initializer=15., use_resource=True)
+
+  def _gather_saveables_for_checkpoint(self):
+    def _saveable_factory(name=self.non_dep_variable.name):
+      return _MirroringSaveable(
+          primary_variable=self.non_dep_variable,
+          mirrored_variable=self.mirrored,
+          name=name)
+    return {checkpointable.VARIABLE_VALUE_KEY: _saveable_factory}
+
+  # The Saver sorts by name before parsing, so we need a name property.
+  @property
+  def name(self):
+    return self.non_dep_variable.name
+
+
 class CheckpointingTests(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
   def testNamingWithOptimizer(self):
     input_value = constant_op.constant([[3.]])
-    network = MyNetwork()
-    # A nuisance Network using the same optimizer. Its slot variables should not
+    model = MyModel()
+    # A nuisance Model using the same optimizer. Its slot variables should not
     # go in the checkpoint, since it is never depended on.
-    other_network = MyNetwork()
-    optimizer = CheckpointableAdam(0.001)
+    other_model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
     optimizer_step = training_util.get_or_create_global_step()
-    root_checkpointable = Checkpoint(
-        optimizer=optimizer, network=network, optimizer_step=optimizer_step)
-    if context.in_eager_mode():
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model, optimizer_step=optimizer_step)
+    if context.executing_eagerly():
       optimizer.minimize(
-          lambda: network(input_value),
+          lambda: model(input_value),
           global_step=optimizer_step)
       optimizer.minimize(
-          lambda: other_network(input_value),
+          lambda: other_model(input_value),
           global_step=optimizer_step)
     else:
       train_op = optimizer.minimize(
-          network(input_value), global_step=optimizer_step)
+          model(input_value), global_step=optimizer_step)
       optimizer.minimize(
-          other_network(input_value),
+          other_model(input_value),
           global_step=optimizer_step)
       self.evaluate(checkpointable_utils.gather_initializers(
           root_checkpointable))
@@ -297,24 +228,21 @@ class CheckpointingTests(test.TestCase):
     expected_checkpoint_names = (
         # Created in the root node, so no prefix.
         "optimizer_step",
-        # No name provided to track_checkpointable(), so the position is used
-        # instead (one-based).
-        "network/via_track_layer/kernel",
-        # track_checkpointable() with a name provided, so that's used
-        "network/_named_dense/kernel",
-        "network/_named_dense/bias",
-        # non-Layer dependency of the network
-        "network/_non_layer/a_variable",
+        "model/_second/kernel",
+        "model/_named_dense/kernel",
+        "model/_named_dense/bias",
+        # non-Layer dependency of the model
+        "model/_non_layer/a_variable",
         # The optimizer creates two non-slot variables
         "optimizer/beta1_power",
         "optimizer/beta2_power",
         # Slot variables
-        "network/via_track_layer/kernel/.OPTIMIZER_SLOT/optimizer/m",
-        "network/via_track_layer/kernel/.OPTIMIZER_SLOT/optimizer/v",
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m",
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/v",
-        "network/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/m",
-        "network/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_second/kernel/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_second/kernel/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/v",
     )
     suffix = "/.ATTRIBUTES/VARIABLE_VALUE"
     expected_checkpoint_names = [
@@ -326,11 +254,11 @@ class CheckpointingTests(test.TestCase):
         "global_step:0",
         named_variables["optimizer_step" + suffix].name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer_1/kernel:0",
-        named_variables["network/via_track_layer/kernel" + suffix].name)
+        "my_model/dense_1/kernel:0",
+        named_variables["model/_second/kernel" + suffix].name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel:0",
-        named_variables["network/_named_dense/kernel" + suffix].name)
+        "my_model/dense/kernel:0",
+        named_variables["model/_named_dense/kernel" + suffix].name)
     self.assertEqual(
         "beta1_power:0",
         named_variables["optimizer/beta1_power" + suffix].name)
@@ -348,79 +276,110 @@ class CheckpointingTests(test.TestCase):
                      serialized_graph.nodes[optimizer_node.children[0].node_id]
                      .attributes[0].full_name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel",
+        "my_model/dense/kernel",
         serialized_graph.nodes[optimizer_node.slot_variables[0]
                                .original_variable_node_id]
         .attributes[0].full_name)
     # We strip off the :0 suffix, as variable.name-based saving does.
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel/Adam",
+        "my_model/dense/kernel/Adam",
         serialized_graph.nodes[optimizer_node.slot_variables[0]
                                .slot_variable_node_id]
         .attributes[0].full_name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel/Adam:0",
+        "my_model/dense/kernel/Adam:0",
         optimizer.get_slot(
-            var=named_variables["network/_named_dense/kernel" + suffix],
+            var=named_variables["model/_named_dense/kernel" + suffix],
             name="m").name)
     self.assertEqual(
-        "network/_named_dense/kernel" + suffix,
+        "model/_named_dense/kernel" + suffix,
         serialized_graph.nodes[
             optimizer_node.slot_variables[0]
             .original_variable_node_id].attributes[0].checkpoint_key)
     self.assertEqual("m", optimizer_node.slot_variables[0].slot_name)
     self.assertEqual(
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m" + suffix,
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m" + suffix,
         serialized_graph.nodes[
             optimizer_node.slot_variables[0]
             .slot_variable_node_id].attributes[0].checkpoint_key)
 
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def testMoreComplexSaveableReturned(self):
+    v = _OwnsMirroredVariables()
+    checkpoint = checkpointable_utils.Checkpoint(v=v)
+    test_dir = self.get_temp_dir()
+    prefix = os.path.join(test_dir, "ckpt")
+    self.evaluate(v.non_dep_variable.assign(42.))
+    save_path = checkpoint.save(prefix)
+    self.evaluate(v.non_dep_variable.assign(43.))
+    self.evaluate(v.mirrored.assign(44.))
+    checkpoint.restore(save_path).assert_consumed().initialize_or_restore()
+    self.assertEqual(42., self.evaluate(v.non_dep_variable))
+    self.assertEqual(42., self.evaluate(v.mirrored))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testMoreComplexSaveableReturnedWithGlobalName(self):
+    # The same object can also be saved using the name-based saver.
+    v = _OwnsMirroredVariables()
+    saver = core_saver.Saver(var_list=[v])
+    test_dir = self.get_temp_dir()
+    prefix = os.path.join(test_dir, "ckpt")
+    self.evaluate(v.non_dep_variable.assign(42.))
+    with self.test_session() as sess:
+      save_path = saver.save(sess, prefix)
+      self.evaluate(v.non_dep_variable.assign(43.))
+      self.evaluate(v.mirrored.assign(44.))
+      saver.restore(sess, save_path)
+      self.assertEqual(42., self.evaluate(v.non_dep_variable))
+      self.assertEqual(42., self.evaluate(v.mirrored))
+
   @test_util.run_in_graph_and_eager_modes()
   def testSaveRestore(self):
-    network = MyNetwork()
-    optimizer = CheckpointableAdam(0.001)
-    root_checkpointable = Checkpoint(optimizer=optimizer, network=network)
+    model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model)
     input_value = constant_op.constant([[3.]])
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       optimizer.minimize(
-          lambda: network(input_value))
+          lambda: model(input_value))
     else:
-      train_op = optimizer.minimize(network(input_value))
+      train_op = optimizer.minimize(model(input_value))
       # TODO(allenl): Make initialization more pleasant when graph building.
       root_checkpointable.save_counter  # pylint: disable=pointless-statement
       self.evaluate(checkpointable_utils.gather_initializers(
           root_checkpointable))
       self.evaluate(train_op)
     prefix = os.path.join(self.get_temp_dir(), "ckpt")
-    self.evaluate(state_ops.assign(network._named_dense.variables[1], [42.]))
-    m_bias_slot = optimizer.get_slot(network._named_dense.variables[1], "m")
+    self.evaluate(state_ops.assign(model._named_dense.variables[1], [42.]))
+    m_bias_slot = optimizer.get_slot(model._named_dense.variables[1], "m")
     self.evaluate(state_ops.assign(m_bias_slot, [1.5]))
     save_path = root_checkpointable.save(file_prefix=prefix)
-    self.evaluate(state_ops.assign(network._named_dense.variables[1], [43.]))
+    self.evaluate(state_ops.assign(model._named_dense.variables[1], [43.]))
     self.evaluate(state_ops.assign(root_checkpointable.save_counter, 3))
     optimizer_variables = self.evaluate(optimizer.variables())
     self.evaluate(state_ops.assign(m_bias_slot, [-2.]))
     # Immediate restoration
     status = root_checkpointable.restore(save_path=save_path).assert_consumed()
     status.run_restore_ops()
-    self.assertAllEqual([42.], self.evaluate(network._named_dense.variables[1]))
+    self.assertAllEqual([42.], self.evaluate(model._named_dense.variables[1]))
     self.assertAllEqual(1, self.evaluate(root_checkpointable.save_counter))
     self.assertAllEqual([1.5], self.evaluate(m_bias_slot))
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       return  # Restore-on-create is only supported when executing eagerly
-    on_create_network = MyNetwork()
-    on_create_optimizer = CheckpointableAdam(0.001)
-    on_create_root = Checkpoint(
-        optimizer=on_create_optimizer, network=on_create_network)
+    on_create_model = MyModel()
+    on_create_optimizer = adam.AdamOptimizer(0.001)
+    on_create_root = checkpointable_utils.Checkpoint(
+        optimizer=on_create_optimizer, model=on_create_model)
     # Deferred restoration
     status = on_create_root.restore(save_path=save_path)
-    on_create_network(constant_op.constant([[3.]]))  # create variables
+    on_create_model(constant_op.constant([[3.]]))  # create variables
     self.assertAllEqual(1, self.evaluate(on_create_root.save_counter))
     self.assertAllEqual([42.],
                         self.evaluate(
-                            on_create_network._named_dense.variables[1]))
+                            on_create_model._named_dense.variables[1]))
     on_create_m_bias_slot = on_create_optimizer.get_slot(
-        on_create_network._named_dense.variables[1], "m")
+        on_create_model._named_dense.variables[1], "m")
     # Optimizer slot variables are created when the original variable is
     # restored.
     self.assertAllEqual([1.5], self.evaluate(on_create_m_bias_slot))
@@ -440,17 +399,17 @@ class CheckpointingTests(test.TestCase):
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
     for training_continuation in range(3):
-      network = MyNetwork()
-      optimizer = CheckpointableAdam(0.001)
-      root = Checkpoint(
-          optimizer=optimizer, network=network,
+      model = MyModel()
+      optimizer = adam.AdamOptimizer(0.001)
+      root = checkpointable_utils.Checkpoint(
+          optimizer=optimizer, model=model,
           optimizer_step=training_util.get_or_create_global_step())
       root.restore(core_saver.latest_checkpoint(checkpoint_directory))
       for _ in range(num_training_steps):
         # TODO(allenl): Use a Dataset and serialize/checkpoint it.
         input_value = constant_op.constant([[3.]])
         optimizer.minimize(
-            lambda: network(input_value),  # pylint: disable=cell-var-from-loop
+            lambda: model(input_value),  # pylint: disable=cell-var-from-loop
             global_step=root.optimizer_step)
       root.save(file_prefix=checkpoint_prefix)
       self.assertEqual((training_continuation + 1) * num_training_steps,
@@ -464,14 +423,14 @@ class CheckpointingTests(test.TestCase):
       checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
       for training_continuation in range(3):
         with ops.Graph().as_default():
-          network = MyNetwork()
-          optimizer = CheckpointableAdam(0.001)
-          root = Checkpoint(
-              optimizer=optimizer, network=network,
+          model = MyModel()
+          optimizer = adam.AdamOptimizer(0.001)
+          root = checkpointable_utils.Checkpoint(
+              optimizer=optimizer, model=model,
               global_step=training_util.get_or_create_global_step())
           input_value = constant_op.constant([[3.]])
           train_op = optimizer.minimize(
-              network(input_value),
+              model(input_value),
               global_step=root.global_step)
           checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
           with self.test_session(graph=ops.get_default_graph()) as session:
@@ -500,20 +459,20 @@ class CheckpointingTests(test.TestCase):
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
     for training_continuation in range(3):
       with ops.Graph().as_default(), self.test_session(
-          graph=ops.get_default_graph()):
-        network = MyNetwork()
-        optimizer = CheckpointableAdam(0.001)
-        root = Checkpoint(
-            optimizer=optimizer, network=network,
+          graph=ops.get_default_graph()), test_util.device(use_gpu=True):
+        model = MyModel()
+        optimizer = adam.AdamOptimizer(0.001)
+        root = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, model=model,
             global_step=training_util.get_or_create_global_step())
         checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
         status = root.restore(save_path=checkpoint_path)
         input_value = constant_op.constant([[3.]])
         train_fn = functools.partial(
             optimizer.minimize,
-            functools.partial(network, input_value),
+            functools.partial(model, input_value),
             global_step=root.global_step)
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           train_fn = functools.partial(self.evaluate, train_fn())
         status.initialize_or_restore()
         for _ in range(num_training_steps):
@@ -565,6 +524,35 @@ class CheckpointingTests(test.TestCase):
     name, = named_variables.keys()
     self.assertEqual(name, "..ATTRIBUTES/a/.ATTRIBUTES/VARIABLE_VALUE")
 
+  def testAnonymousVarsInInit(self):
+
+    class Model(training.Model):
+
+      def __init__(self):
+        super(Model, self).__init__()
+        self.w = resource_variable_ops.ResourceVariable(0.0)
+        self.b = resource_variable_ops.ResourceVariable(0.0)
+        self.vars = [self.w, self.b]
+
+      def call(self, x):
+        return x * self.w + self.b
+
+    with context.eager_mode():
+      model = Model()
+      optimizer = adam.AdamOptimizer(learning_rate=0.05)
+      checkpoint_directory = self.get_temp_dir()
+      checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+      checkpoint = checkpointable_utils.Checkpoint(
+          model=model, optimizer=optimizer)
+      for _ in range(2):
+        checkpoint.save(checkpoint_prefix)
+        with backprop.GradientTape() as tape:
+          loss = (constant_op.constant(1.)
+                  - model(constant_op.constant(1.))) ** 2
+        grad = tape.gradient(loss, model.vars)
+        optimizer.apply_gradients(
+            [(g, v) for g, v in zip(grad, model.vars)])
+
   @test_util.run_in_graph_and_eager_modes()
   def testLateDependencyTracking(self):
 
@@ -585,9 +573,11 @@ class CheckpointingTests(test.TestCase):
     self.evaluate(state_ops.assign(original.dep.var, 123.))
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
-    save_path = checkpointable_utils.Saver(original).save(checkpoint_prefix)
+    save_path = checkpointable_utils.CheckpointableSaver(
+        original).save(checkpoint_prefix)
     load_into = LateDependencies()
-    status = checkpointable_utils.Saver(load_into).restore(save_path)
+    status = checkpointable_utils.CheckpointableSaver(
+        load_into).restore(save_path)
     with self.assertRaises(AssertionError):
       status.assert_consumed()
     load_into.add_dep()
@@ -616,11 +606,12 @@ class CheckpointingTests(test.TestCase):
     self.evaluate(state_ops.assign(dep_after_var.dep.var, -14.))
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
-    save_path = checkpointable_utils.Saver(dep_after_var).save(
+    save_path = checkpointable_utils.CheckpointableSaver(dep_after_var).save(
         checkpoint_prefix)
 
     loaded_dep_after_var = DepAfterVar()
-    status = checkpointable_utils.Saver(loaded_dep_after_var).restore(save_path)
+    status = checkpointable_utils.CheckpointableSaver(
+        loaded_dep_after_var).restore(save_path)
     loaded_dep_after_var.add_dep()
     status.assert_consumed()
     status.run_restore_ops()
@@ -633,31 +624,33 @@ class CheckpointingTests(test.TestCase):
     root = checkpointable.Checkpointable()
     root.var = checkpointable_utils.add_variable(
         root, name="var", initializer=0.)
-    optimizer = CheckpointableAdam(0.1)
-    if context.in_graph_mode():
+    optimizer = adam.AdamOptimizer(0.1)
+    if context.executing_eagerly():
+      optimizer.minimize(root.var.read_value)
+    else:
       train_op = optimizer.minimize(root.var)
       # Note that `optimizer` has not been added as a dependency of
       # `root`. Create a one-off grouping so that slot variables for `root.var`
       # get initialized too.
       self.evaluate(checkpointable_utils.gather_initializers(
-          Checkpoint(root=root, optimizer=optimizer)))
+          checkpointable_utils.Checkpoint(root=root, optimizer=optimizer)))
       self.evaluate(train_op)
-    else:
-      optimizer.minimize(root.var.read_value)
     self.evaluate(state_ops.assign(root.var, 12.))
-    no_slots_path = checkpointable_utils.Saver(root).save(
+    no_slots_path = checkpointable_utils.CheckpointableSaver(root).save(
         os.path.join(checkpoint_directory, "no_slots"))
     root.optimizer = optimizer
     self.evaluate(state_ops.assign(root.var, 13.))
     self.evaluate(state_ops.assign(optimizer.get_slot(name="m", var=root.var),
                                    14.))
-    slots_path = checkpointable_utils.Saver(root).save(
+    slots_path = checkpointable_utils.CheckpointableSaver(root).save(
         os.path.join(checkpoint_directory, "with_slots"))
     new_root = checkpointable.Checkpointable()
     # Load the slot-containing checkpoint (deferred), then immediately overwrite
     # the non-slot variable (also deferred).
-    slot_status = checkpointable_utils.Saver(new_root).restore(slots_path)
-    no_slot_status = checkpointable_utils.Saver(new_root).restore(no_slots_path)
+    slot_status = checkpointable_utils.CheckpointableSaver(
+        new_root).restore(slots_path)
+    no_slot_status = checkpointable_utils.CheckpointableSaver(
+        new_root).restore(no_slots_path)
     with self.assertRaises(AssertionError):
       no_slot_status.assert_consumed()
     new_root.var = checkpointable_utils.add_variable(
@@ -665,11 +658,11 @@ class CheckpointingTests(test.TestCase):
     no_slot_status.assert_consumed()
     no_slot_status.run_restore_ops()
     self.assertEqual(12., self.evaluate(new_root.var))
-    new_root.optimizer = CheckpointableAdam(0.1)
+    new_root.optimizer = adam.AdamOptimizer(0.1)
     with self.assertRaisesRegexp(AssertionError, "beta1_power"):
       slot_status.assert_consumed()
     self.assertEqual(12., self.evaluate(new_root.var))
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # Slot variables are only created with restoring initializers when
       # executing eagerly.
       self.assertEqual(14., self.evaluate(
@@ -677,7 +670,9 @@ class CheckpointingTests(test.TestCase):
     else:
       self.assertIs(new_root.optimizer.get_slot(name="m", var=new_root.var),
                     None)
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      new_root.optimizer.minimize(new_root.var.read_value)
+    else:
       train_op = new_root.optimizer.minimize(new_root.var)
       # The slot variable now exists; restore() didn't create it, but we should
       # now have a restore op for it.
@@ -685,8 +680,6 @@ class CheckpointingTests(test.TestCase):
       self.assertEqual(14., self.evaluate(
           new_root.optimizer.get_slot(name="m", var=new_root.var)))
       self.evaluate(train_op)
-    else:
-      new_root.optimizer.minimize(new_root.var.read_value)
     slot_status.assert_consumed()
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
@@ -697,15 +690,17 @@ class CheckpointingTests(test.TestCase):
     save_root.dep.var = checkpointable_utils.add_variable(
         save_root.dep, name="var", initializer=0.)
     self.evaluate(state_ops.assign(save_root.dep.var, 12.))
-    saver = checkpointable_utils.Saver(save_root)
+    saver = checkpointable_utils.CheckpointableSaver(save_root)
     first_path = saver.save(os.path.join(checkpoint_directory, "first"))
     self.evaluate(state_ops.assign(save_root.dep.var, 13.))
     second_path = saver.save(os.path.join(checkpoint_directory, "second"))
 
     first_root = checkpointable.Checkpointable()
     second_root = checkpointable.Checkpointable()
-    first_status = checkpointable_utils.Saver(first_root).restore(first_path)
-    second_status = checkpointable_utils.Saver(second_root).restore(second_path)
+    first_status = checkpointable_utils.CheckpointableSaver(
+        first_root).restore(first_path)
+    second_status = checkpointable_utils.CheckpointableSaver(
+        second_root).restore(second_path)
     load_dep = checkpointable.Checkpointable()
     load_dep.var = checkpointable_utils.add_variable(
         load_dep, name="var", shape=[])
@@ -722,8 +717,10 @@ class CheckpointingTests(test.TestCase):
     # determines the final value.
     first_root = checkpointable.Checkpointable()
     second_root = checkpointable.Checkpointable()
-    second_status = checkpointable_utils.Saver(second_root).restore(second_path)
-    first_status = checkpointable_utils.Saver(first_root).restore(first_path)
+    second_status = checkpointable_utils.CheckpointableSaver(
+        second_root).restore(second_path)
+    first_status = checkpointable_utils.CheckpointableSaver(
+        first_root).restore(first_path)
     load_dep = checkpointable.Checkpointable()
     load_dep.var = checkpointable_utils.add_variable(
         load_dep, name="var", shape=[])
@@ -748,10 +745,10 @@ class CheckpointingTests(test.TestCase):
     save_root.dep_two.dep_three = dep_three
     checkpointable_utils.add_variable(dep_three, name="var", initializer=0.)
     self.evaluate(checkpointable_utils.gather_initializers(save_root))
-    save_path = checkpointable_utils.Saver(save_root).save(
+    save_path = checkpointable_utils.CheckpointableSaver(save_root).save(
         os.path.join(checkpoint_directory, "ckpt"))
     load_root = checkpointable.Checkpointable()
-    checkpointable_utils.Saver(load_root).restore(save_path)
+    checkpointable_utils.CheckpointableSaver(load_root).restore(save_path)
     load_root.dep_one = checkpointable.Checkpointable()
     load_root.dep_two = checkpointable.Checkpointable()
     load_root.dep_one.dep_three = checkpointable.Checkpointable()
@@ -771,7 +768,7 @@ class CheckpointingTests(test.TestCase):
     checkpointable_utils.add_variable(
         save_root.dep_two, name="var2", initializer=64., dtype=dtypes.float64)
     self.evaluate(checkpointable_utils.gather_initializers(save_root))
-    save_path = checkpointable_utils.Saver(save_root).save(
+    save_path = checkpointable_utils.CheckpointableSaver(save_root).save(
         os.path.join(checkpoint_directory, "ckpt"))
     load_root = checkpointable.Checkpointable()
     load_root.dep_one = checkpointable.Checkpointable()
@@ -780,7 +777,7 @@ class CheckpointingTests(test.TestCase):
         load_root.dep_one, name="var1", shape=[], dtype=dtypes.float64)
     v2 = checkpointable_utils.add_variable(
         load_root.dep_one, name="var2", shape=[], dtype=dtypes.float64)
-    status = checkpointable_utils.Saver(load_root).restore(
+    status = checkpointable_utils.CheckpointableSaver(load_root).restore(
         save_path).assert_consumed()
     status.run_restore_ops()
     self.assertEqual(32., self.evaluate(v1))
@@ -800,12 +797,13 @@ class CheckpointingTests(test.TestCase):
         second, "v2", initializer=[1., 1., 2., 3.])
     self.evaluate(checkpointable_utils.gather_initializers(first))
     checkpoint_directory = self.get_temp_dir()
-    save_path = checkpointable_utils.Saver(first).save(
+    save_path = checkpointable_utils.CheckpointableSaver(first).save(
         os.path.join(checkpoint_directory, "ckpt"))
 
     # Test deferred loading
     first_load = checkpointable.Checkpointable()
-    status = checkpointable_utils.Saver(first_load).restore(save_path)
+    status = checkpointable_utils.CheckpointableSaver(
+        first_load).restore(save_path)
     second_load = checkpointable.Checkpointable()
     first_load.second = second_load
     second_load.first = first_load
@@ -825,7 +823,7 @@ class CheckpointingTests(test.TestCase):
     self.assertAllEqual([2., 7., 1.], self.evaluate(first_load.v))
     self.evaluate(second_load.v.assign([2., 7., 1., 8.]))
     self.assertAllEqual([2., 7., 1., 8.], self.evaluate(second_load.v))
-    status = checkpointable_utils.Saver(first_load).restore(
+    status = checkpointable_utils.CheckpointableSaver(first_load).restore(
         save_path).assert_consumed()
     status.run_restore_ops()
     self.assertAllEqual([3., 1., 4.], self.evaluate(first_load.v))
@@ -844,14 +842,15 @@ class CheckpointingTests(test.TestCase):
           name="blah", initializer=0.)
       self.evaluate(first.var1.assign(4.))
       self.evaluate(first.var2.assign(8.))
-      save_path = checkpointable_utils.Saver(first).save(
+      save_path = checkpointable_utils.CheckpointableSaver(first).save(
           checkpoint_prefix)
     restore_graph = ops.Graph()
     with restore_graph.as_default(), self.test_session(restore_graph):
       second = checkpointable.Checkpointable()
       second.var2 = variable_scope.get_variable(
           name="blah", initializer=0.)
-      status = checkpointable_utils.Saver(second).restore(save_path)
+      status = checkpointable_utils.CheckpointableSaver(
+          second).restore(save_path)
       recreated_var1 = variable_scope.get_variable(
           name="outside_var", initializer=0.)
       status.run_restore_ops()
@@ -871,15 +870,81 @@ class CheckpointingTests(test.TestCase):
         checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
         obj = checkpointable.Checkpointable()
         obj.var = variable_scope.get_variable(name="v", initializer=0.)
-        obj.opt = CheckpointableAdam(0.1)
+        obj.opt = adam.AdamOptimizer(0.1)
         obj.opt.minimize(obj.var.read_value())
         self.evaluate(checkpointable_utils.gather_initializers(obj))
-        saver = checkpointable_utils.Saver(obj)
+        saver = checkpointable_utils.CheckpointableSaver(obj)
         saver.save(checkpoint_prefix)
         before_ops = graph.get_operations()
         saver.save(checkpoint_prefix)
         self.assertEqual(before_ops, graph.get_operations())
 
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def testCheckpointCleanup(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    obj = checkpointable.Checkpointable()
+    obj.var = variable_scope.get_variable(name="v", initializer=0.)
+    self.evaluate(checkpointable_utils.gather_initializers(obj))
+    saver = checkpointable_utils.Checkpoint(obj=obj)
+    for _ in range(10):
+      saver.save(checkpoint_prefix)
+    expected_filenames = ["checkpoint"]
+    for checkpoint_number in range(6, 11):
+      expected_filenames.append("ckpt-%d.index" % (checkpoint_number,))
+      expected_filenames.append(
+          "ckpt-%d.data-00000-of-00001" % (checkpoint_number,))
+    six.assertCountEqual(
+        self,
+        expected_filenames,
+        os.listdir(checkpoint_directory))
+
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def testCheckpointCleanupChangingVarList(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    obj = checkpointable.Checkpointable()
+    obj.var = variable_scope.get_variable(name="v", initializer=0.)
+    self.evaluate(checkpointable_utils.gather_initializers(obj))
+    checkpoint = checkpointable_utils.Checkpoint(obj=obj)
+    looped_variables = []
+    for iteration in range(10):
+      new_variable = resource_variable_ops.ResourceVariable(iteration)
+      self.evaluate(new_variable.initializer)
+      setattr(checkpoint, "var_%d" % iteration, new_variable)
+      checkpoint.save(checkpoint_prefix)
+      looped_variables.append(new_variable)
+    expected_filenames = ["checkpoint"]
+    # We've copied the saver each time, but checkpoint management should still
+    # be consistent.
+    for checkpoint_number in range(6, 11):
+      expected_filenames.append("ckpt-%d.index" % (checkpoint_number,))
+      expected_filenames.append(
+          "ckpt-%d.data-00000-of-00001" % (checkpoint_number,))
+    six.assertCountEqual(
+        self,
+        expected_filenames,
+        os.listdir(checkpoint_directory))
+    for v in looped_variables:
+      self.evaluate(v.assign(314))
+    checkpoint.restore(checkpoint_prefix + "-6").run_restore_ops()
+    self.assertEqual(314, self.evaluate(checkpoint.var_9))
+    self.assertEqual(314, self.evaluate(checkpoint.var_8))
+    self.assertEqual(314, self.evaluate(checkpoint.var_6))
+    self.assertEqual(5, self.evaluate(checkpoint.var_5))
+    self.assertEqual(1, self.evaluate(checkpoint.var_1))
+    self.assertEqual(0, self.evaluate(checkpoint.var_0))
+    if context.executing_eagerly():
+      checkpoint.restore(checkpoint_prefix + "-10").run_restore_ops()
+      self.assertEqual(9, self.evaluate(checkpoint.var_9))
+      self.assertEqual(8, self.evaluate(checkpoint.var_8))
+      self.assertEqual(1, self.evaluate(checkpoint.var_1))
+      self.assertEqual(0, self.evaluate(checkpoint.var_0))
+    else:
+      # Restoring into modified graphs is an error while graph building.
+      with self.assertRaises(NotImplementedError):
+        checkpoint.restore(checkpoint_prefix + "-10").run_restore_ops()
+
   def testManyRestoresGraph(self):
     """Restores after the first should not modify the graph."""
     with context.graph_mode():
@@ -889,56 +954,193 @@ class CheckpointingTests(test.TestCase):
         checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
         obj = checkpointable.Checkpointable()
         obj.var = variable_scope.get_variable(name="v", initializer=0.)
-        obj.opt = CheckpointableAdam(0.1)
+        obj.opt = adam.AdamOptimizer(0.1)
         obj.opt.minimize(obj.var.read_value())
         self.evaluate(checkpointable_utils.gather_initializers(obj))
-        saver = checkpointable_utils.Saver(obj)
+        saver = checkpointable_utils.CheckpointableSaver(obj)
         save_path = saver.save(checkpoint_prefix)
         saver.restore(save_path)
         before_ops = graph.get_operations()
         saver.restore(save_path)
         self.assertEqual(before_ops, graph.get_operations())
 
+  def testMultipleGraphsNonSlotVariables(self):
+    with context.graph_mode():
+      checkpoint_directory = self.get_temp_dir()
+      checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+      optimizer = adam.AdamOptimizer(0.001)
+      # Construct a model in one graph
+      first_graph = ops.Graph()
+      first_session = session_lib.Session(graph=first_graph)
+      with first_graph.as_default(), first_session.as_default():
+        first_variable = resource_variable_ops.ResourceVariable([1.])
+        first_root_checkpointable = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, variable=first_variable)
+        train_op = optimizer.minimize(first_variable.read_value)
+        self.evaluate(checkpointable_utils.gather_initializers(
+            first_root_checkpointable))
+        self.evaluate(train_op)
+        self.evaluate(first_variable.assign([1.]))
+        self.evaluate(optimizer.get_slot(
+            var=first_variable, name="m").assign([2.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.evaluate(beta1_power.assign(3.))
+
+      # Save and load in a second graph
+      second_graph = ops.Graph()
+      with second_graph.as_default(), session_lib.Session(graph=second_graph):
+        second_variable = resource_variable_ops.ResourceVariable([1.])
+        second_root_checkpointable = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, variable=second_variable)
+        train_op = optimizer.minimize(second_variable.read_value)
+        second_root_checkpointable.restore(None).initialize_or_restore()
+        self.evaluate(train_op)
+        self.evaluate(second_variable.assign([4.]))
+        self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m").assign([5.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.evaluate(beta1_power.assign(6.))
+        save_path = second_root_checkpointable.save(checkpoint_prefix)
+        self.evaluate(second_variable.assign([7.]))
+        self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m").assign([8.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(6., self.evaluate(beta1_power))
+        status = second_root_checkpointable.restore(save_path)
+        status.assert_consumed().run_restore_ops()
+        self.assertAllEqual([4.], self.evaluate(second_variable))
+        self.assertAllEqual([5.], self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m")))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(6., self.evaluate(beta1_power))
+
+      # Check that the first graph is unmolested
+      with first_graph.as_default(), first_session.as_default():
+        self.assertAllEqual([1.], self.evaluate(first_variable))
+        self.assertAllEqual([2.], self.evaluate(optimizer.get_slot(
+            var=first_variable, name="m")))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(3., self.evaluate(beta1_power))
+
+
+class TemplateTests(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def test_checkpointable_save_restore(self):
+
+    def _templated():
+      v = variable_scope.get_variable(
+          "v", shape=[1], initializer=init_ops.zeros_initializer())
+      v2 = variable_scope.get_variable(
+          "v2", shape=[1], initializer=init_ops.zeros_initializer())
+      return v, v + 1., v2
+
+    save_template = template.make_template("s1", _templated)
+    save_root = checkpointable_utils.Checkpoint(my_template=save_template)
+    v1_save, _, v2_save = save_template()
+    self.evaluate(v1_save.assign([12.]))
+    self.evaluate(v2_save.assign([14.]))
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    save_path = save_root.save(checkpoint_prefix)
+
+    load_template = template.make_template("s2", _templated)
+    load_root = checkpointable_utils.Checkpoint(my_template=load_template)
+    status = load_root.restore(save_path)
+    var, var_plus_one, var2 = load_template()
+    self.assertEqual(2, len(load_template._checkpoint_dependencies))
+    self.assertEqual("v", load_template._checkpoint_dependencies[0].name)
+    self.assertEqual("v2", load_template._checkpoint_dependencies[1].name)
+    status.assert_consumed().run_restore_ops()
+    self.assertAllEqual([12.], self.evaluate(var))
+    self.assertAllEqual([13.], self.evaluate(var_plus_one))
+    self.assertAllEqual([14.], self.evaluate(var2))
+
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def test_checkpointable_save_restore_nested(self):
+
+    def _inner_template():
+      v = variable_scope.get_variable(
+          "v", shape=[1], initializer=init_ops.zeros_initializer())
+      return v
+
+    def _outer_template():
+      first_inner = template.make_template("i1", _inner_template)
+      second_inner = template.make_template("i2", _inner_template)
+      v1 = first_inner()
+      v2 = second_inner()
+      v3 = second_inner()
+      return (first_inner, second_inner), (v1, v2, v3)
+
+    with variable_scope.variable_scope("ignored"):
+      save_template = template.make_template("s1", _outer_template)
+      save_root = checkpointable_utils.Checkpoint(my_template=save_template)
+      (inner_template_one, inner_template_two), _ = save_template()
+    self.evaluate(inner_template_one.variables[0].assign([20.]))
+    self.evaluate(inner_template_two.variables[0].assign([25.]))
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    save_path = save_root.save(checkpoint_prefix)
+
+    load_template = template.make_template("s2", _outer_template)
+    load_root = checkpointable_utils.Checkpoint(my_template=load_template)
+    status = load_root.restore(save_path)
+    (inner_template_one, inner_template_two), (v1, v2, v3) = load_template()
+    outer_template_dependencies = load_root.my_template._checkpoint_dependencies
+    self.assertEqual(2, len(outer_template_dependencies))
+    self.assertEqual("i1", outer_template_dependencies[0].name)
+    self.assertIs(inner_template_one, outer_template_dependencies[0].ref)
+    self.assertEqual("i2", outer_template_dependencies[1].name)
+    self.assertIs(inner_template_two, outer_template_dependencies[1].ref)
+    self.assertEqual(1, len(inner_template_one._checkpoint_dependencies))
+    self.assertEqual("v", inner_template_one._checkpoint_dependencies[0].name)
+    self.assertEqual(1, len(inner_template_two._checkpoint_dependencies))
+    self.assertEqual("v", inner_template_two._checkpoint_dependencies[0].name)
+    status.assert_consumed().run_restore_ops()
+    self.assertAllEqual([20.], self.evaluate(v1))
+    self.assertAllEqual([25.], self.evaluate(v2))
+    self.assertAllEqual([25.], self.evaluate(v3))
+
 
 class CheckpointCompatibilityTests(test.TestCase):
 
   def _initialized_model(self):
     input_value = constant_op.constant([[3.]])
-    network = MyNetwork()
-    optimizer = CheckpointableAdam(0.001)
+    model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
     optimizer_step = training_util.get_or_create_global_step()
-    root_checkpointable = Checkpoint(
-        optimizer=optimizer, network=network, optimizer_step=optimizer_step)
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model, optimizer_step=optimizer_step)
     train_op = optimizer.minimize(
-        functools.partial(network, input_value),
+        functools.partial(model, input_value),
         global_step=optimizer_step)
     self.evaluate(checkpointable_utils.gather_initializers(
         root_checkpointable))
     self.evaluate(train_op)
     # A regular variable, a slot variable, and a non-slot Optimizer variable
     # with known values to check when loading.
-    self.evaluate(network._named_dense.bias.assign([1.]))
+    self.evaluate(model._named_dense.bias.assign([1.]))
     self.evaluate(optimizer.get_slot(
-        var=network._named_dense.bias, name="m").assign([2.]))
+        var=model._named_dense.bias, name="m").assign([2.]))
     beta1_power, _ = optimizer._get_beta_accumulators()
     self.evaluate(beta1_power.assign(3.))
     return root_checkpointable
 
   def _set_sentinels(self, root_checkpointable):
-    self.evaluate(root_checkpointable.network._named_dense.bias.assign([101.]))
+    self.evaluate(root_checkpointable.model._named_dense.bias.assign([101.]))
     self.evaluate(
         root_checkpointable.optimizer.get_slot(
-            var=root_checkpointable.network._named_dense.bias, name="m")
+            var=root_checkpointable.model._named_dense.bias, name="m")
         .assign([102.]))
     beta1_power, _ = root_checkpointable.optimizer._get_beta_accumulators()
     self.evaluate(beta1_power.assign(103.))
 
   def _check_sentinels(self, root_checkpointable):
     self.assertAllEqual(
-        [1.], self.evaluate(root_checkpointable.network._named_dense.bias))
+        [1.], self.evaluate(root_checkpointable.model._named_dense.bias))
     self.assertAllEqual([2.], self.evaluate(
         root_checkpointable.optimizer.get_slot(
-            var=root_checkpointable.network._named_dense.bias, name="m")))
+            var=root_checkpointable.model._named_dense.bias, name="m")))
     beta1_power, _ = root_checkpointable.optimizer._get_beta_accumulators()
     self.assertAllEqual(3., self.evaluate(beta1_power))
 
@@ -958,20 +1160,21 @@ class CheckpointCompatibilityTests(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testLoadFromNameBasedSaver(self):
     """Save a name-based checkpoint, load it using the object-based API."""
-    save_path = self._write_name_based_checkpoint()
-    root = self._initialized_model()
-    self._set_sentinels(root)
-    with self.assertRaises(AssertionError):
+    with test_util.device(use_gpu=True):
+      save_path = self._write_name_based_checkpoint()
+      root = self._initialized_model()
+      self._set_sentinels(root)
+      with self.assertRaises(AssertionError):
+        self._check_sentinels(root)
+      object_saver = checkpointable_utils.CheckpointableSaver(root)
+      status = object_saver.restore(save_path)
+      with self.assertRaises(AssertionError):
+        status.assert_consumed()
+      status.run_restore_ops()
+      self._check_sentinels(root)
+      self._set_sentinels(root)
+      status.initialize_or_restore()
       self._check_sentinels(root)
-    object_saver = checkpointable_utils.Saver(root)
-    status = object_saver.restore(save_path)
-    with self.assertRaises(AssertionError):
-      status.assert_consumed()
-    status.run_restore_ops()
-    self._check_sentinels(root)
-    self._set_sentinels(root)
-    status.initialize_or_restore()
-    self._check_sentinels(root)
 
   # TODO(allenl): Test for the core name-based saver loading object-based
   # checkpoints once object-based checkpointing is in core.
@@ -984,7 +1187,7 @@ class CheckpointCompatibilityTests(test.TestCase):
       with save_graph.as_default(), self.test_session(
           graph=save_graph) as session:
         root = self._initialized_model()
-        object_saver = checkpointable_utils.Saver(root)
+        object_saver = checkpointable_utils.CheckpointableSaver(root)
         save_path = object_saver.save(
             session=session, file_prefix=checkpoint_prefix)
     with context.eager_mode():
@@ -998,7 +1201,7 @@ class CheckpointCompatibilityTests(test.TestCase):
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
     with context.eager_mode():
       root = self._initialized_model()
-      object_saver = checkpointable_utils.Saver(root)
+      object_saver = checkpointable_utils.CheckpointableSaver(root)
       save_path = object_saver.save(file_prefix=checkpoint_prefix)
     with context.graph_mode():
       save_graph = ops.Graph()
diff --git a/tensorflow/contrib/eager/python/datasets.py b/tensorflow/contrib/eager/python/datasets.py
index d177bfeab2d1fdc05d7ced54df8723fae2c77fdb..a4c3283dac9194880a1297371ea7591af6dddb2b 100644
--- a/tensorflow/contrib/eager/python/datasets.py
+++ b/tensorflow/contrib/eager/python/datasets.py
@@ -27,11 +27,12 @@ from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import errors
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.training import checkpointable
+from tensorflow.python.training.saver import BaseSaverBuilder
 
 _uid_counter = 0
 _uid_lock = threading.Lock()
@@ -45,8 +46,13 @@ def _generate_shared_name(prefix):
   return "{}{}".format(prefix, uid)
 
 
-class Iterator(object):
-  """An iterator producing tf.Tensor objects from a tf.data.Dataset."""
+class Iterator(iterator_ops.EagerIterator, checkpointable.CheckpointableBase):
+  """An iterator producing tf.Tensor objects from a tf.data.Dataset.
+
+  NOTE: Unlike the iterator created by the
+  @{tf.data.Dataset.make_one_shot_iterator} method, this class enables
+  additional experimental functionality, such as prefetching to the GPU.
+  """
 
   def __init__(self, dataset):
     """Creates a new iterator over the given dataset.
@@ -67,37 +73,12 @@ class Iterator(object):
     Raises:
       RuntimeError: When invoked without eager execution enabled.
     """
-
-    if not context.in_eager_mode():
-      raise RuntimeError(
-          "{} objects can only be used when eager execution is enabled, use "
-          "tf.data.Dataset.make_iterator or "
-          "tf.data.Dataset.make_one_shot_iterator for graph construction".
-          format(type(self)))
-    with ops.device("/device:CPU:0"):
-      ds_variant = dataset._as_variant_tensor()  # pylint: disable=protected-access
-      self._output_classes = dataset.output_classes
-      self._output_types = dataset.output_types
-      self._output_shapes = dataset.output_shapes
-      self._flat_output_types = nest.flatten(
-          sparse.as_dense_types(self._output_types, self._output_classes))
-      self._flat_output_shapes = nest.flatten(
-          sparse.as_dense_shapes(self._output_shapes, self._output_classes))
-      self._resource = gen_dataset_ops.iterator(
-          shared_name="",
-          container=_generate_shared_name("eageriterator"),
-          output_types=self._flat_output_types,
-          output_shapes=self._flat_output_shapes)
-      gen_dataset_ops.make_iterator(ds_variant, self._resource)
-      # Delete the resource when this object is deleted
-      self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
-          handle=self._resource, handle_device="/device:CPU:0")
-    self._device = context.context().device_name
-    self._buffer_resource_handle = None
+    super(Iterator, self).__init__(dataset)
     if not context.context().device_spec.device_type:
       is_remote_device = False
     else:
       is_remote_device = context.context().device_spec.device_type != "CPU"
+    self._buffer_resource_handle = None
     if is_remote_device:
       with ops.device("/device:CPU:0"):
         iter_string_handle = gen_dataset_ops.iterator_to_string_handle(
@@ -106,7 +87,7 @@ class Iterator(object):
         @function.Defun(dtypes.string)
         def remote_fn(h):
           remote_iterator = iterator_ops.Iterator.from_string_handle(
-              h, self._output_types, self._output_shapes)
+              h, self.output_types, self.output_shapes, self.output_classes)
           return remote_iterator.get_next()
 
         remote_fn.add_to_graph(None)
@@ -124,89 +105,43 @@ class Iterator(object):
             handle=self._buffer_resource_handle,
             handle_device=self._device)
 
-  def __iter__(self):
-    return self
-
-  def __next__(self):  # For Python 3 compatibility
-    return self.next()
-
   def _next_internal(self):
     """Returns a nested structure of `tf.Tensor`s containing the next element.
     """
-    with ops.device(self._device):
-      if self._buffer_resource_handle is not None:
+    if self._buffer_resource_handle is not None:
+      with ops.device(self._device):
         ret = prefetching_ops.function_buffering_resource_get_next(
             function_buffer_resource=self._buffer_resource_handle,
             output_types=self._flat_output_types)
-      else:
-        # TODO(ashankar): Consider removing this ops.device() contextmanager
-        # and instead mimic ops placement in graphs: Operations on resource
-        # handles execute on the same device as where the resource is placed.
-        # NOTE(mrry): Here we use the "_sync" variant of `iterator_get_next`
-        # because in eager mode this code will run synchronously on the calling
-        # thread. Therefore we do not need to make a defensive context switch
-        # to a background thread, and can achieve a small constant performance
-        # boost by invoking the iterator synchronously.
-        ret = gen_dataset_ops.iterator_get_next_sync(
-            self._resource,
-            output_types=self._flat_output_types,
-            output_shapes=self._flat_output_shapes)
-
-    return sparse.deserialize_sparse_tensors(
-        nest.pack_sequence_as(self._output_types, ret), self._output_types,
-        self._output_shapes, self._output_classes)
-
-  def next(self):
-    """Returns a nested structure of `tf.Tensor`s containing the next element.
-    """
-    try:
-      return self._next_internal()
-    except errors.OutOfRangeError:
-      raise StopIteration
-
-  @property
-  def output_classes(self):
-    """Returns the class of each component of an element of this iterator.
-
-    The expected values are `tf.Tensor` and `tf.SparseTensor`.
-
-    Returns:
-      A nested structure of Python `type` objects corresponding to each
-      component of an element of this dataset.
-    """
-    return self._output_classes
-
-  @property
-  def output_shapes(self):
-    """Returns the shape of each component of an element of this iterator.
+      return sparse.deserialize_sparse_tensors(
+          nest.pack_sequence_as(self._output_types, ret), self._output_types,
+          self._output_shapes, self._output_classes)
+    else:
+      return super(Iterator, self)._next_internal()
 
-    Returns:
-      A nested structure of `tf.TensorShape` objects corresponding to each
-      component of an element of this dataset.
-    """
-    return self._output_shapes
+  # TODO(shivaniagrawal): Expose checkpointable stateful objects from dataset
+  # attributes(potential).
 
-  @property
-  def output_types(self):
-    """Returns the type of each component of an element of this iterator.
+  class _Saveable(BaseSaverBuilder.SaveableObject):
+    """SaveableObject for saving/restoring iterator state."""
 
-    Returns:
-      A nested structure of `tf.DType` objects corresponding to each component
-      of an element of this dataset.
-    """
-    return self._output_types
+    def __init__(self, iterator_resource, name):
+      serialized_iterator = gen_dataset_ops.serialize_iterator(
+          iterator_resource)
+      specs = [
+          BaseSaverBuilder.SaveSpec(serialized_iterator, "", name + "_STATE")
+      ]
+      # pylint: disable=protected-access
+      super(Iterator._Saveable, self).__init__(iterator_resource, specs, name)
 
-  def get_next(self, name=None):
-    """Returns a nested structure of `tf.Tensor`s containing the next element.
+    def restore(self, restored_tensors, restored_shapes):
+      with ops.colocate_with(self.op):
+        return gen_dataset_ops.deserialize_iterator(self.op,
+                                                    restored_tensors[0])
 
-    Args:
-      name: (Optional.) A name for the created operation. Currently unused.
+  def _gather_saveables_for_checkpoint(self):
 
-    Returns:
-      A nested structure of `tf.Tensor` objects.
+    def _saveable_factory(name):
+      return self._Saveable(self._resource, name)
 
-    Raises:
-      `tf.errors.OutOfRangeError`: If the end of the dataset has been reached.
-    """
-    del name
-    return self._next_internal()
+    return {"ITERATOR": _saveable_factory}
diff --git a/tensorflow/contrib/eager/python/datasets_test.py b/tensorflow/contrib/eager/python/datasets_test.py
index 35c3c5d3fad0a84bbe4d24c7bb17878583bded4b..c658505de41bb6a0007440f4850fef720c3e97f1 100644
--- a/tensorflow/contrib/eager/python/datasets_test.py
+++ b/tensorflow/contrib/eager/python/datasets_test.py
@@ -16,6 +16,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
+
 import threading
 import time
 
@@ -24,6 +26,7 @@ import numpy as np
 from tensorflow.contrib import lookup
 from tensorflow.contrib.data.python.ops import threadpool
 from tensorflow.contrib.data.python.ops import unique
+from tensorflow.contrib.eager.python import checkpointable_utils
 from tensorflow.contrib.eager.python import datasets
 from tensorflow.python.data import Dataset
 from tensorflow.python.eager import test
@@ -44,6 +47,18 @@ class IteratorTest(test.TestCase):
       got.append(t.numpy())
     self.assertAllEqual([0, 1, 2, 3], got)
 
+  def testBasicOneShotIterator(self):
+    got = []
+    for t in Dataset.range(4).make_one_shot_iterator():
+      got.append(t.numpy())
+    self.assertAllEqual([0, 1, 2, 3], got)
+
+  def testBasicImplicitIterator(self):
+    got = []
+    for t in Dataset.range(4):
+      got.append(t.numpy())
+    self.assertAllEqual([0, 1, 2, 3], got)
+
   def testGetNext(self):
     iterator = datasets.Iterator(Dataset.range(4))
     self.assertEqual(0, iterator.get_next().numpy())
@@ -53,6 +68,15 @@ class IteratorTest(test.TestCase):
     with self.assertRaises(errors.OutOfRangeError):
       iterator.get_next()
 
+  def testGetNextOneShotIterator(self):
+    iterator = Dataset.range(4).make_one_shot_iterator()
+    self.assertEqual(0, iterator.get_next().numpy())
+    self.assertEqual(1, iterator.get_next().numpy())
+    self.assertEqual(2, iterator.get_next().numpy())
+    self.assertEqual(3, iterator.get_next().numpy())
+    with self.assertRaises(errors.OutOfRangeError):
+      iterator.get_next()
+
   def testMultipleIteratorsOnTheSameDataset(self):
     ds = Dataset.range(4)
     it1 = datasets.Iterator(ds)
@@ -200,6 +224,61 @@ class IteratorTest(test.TestCase):
       # perform work.
       self.assertLessEqual(len(thread_ids), num_threads)
 
+  def testSaveRestore(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
+    dataset = Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
+    dataset = dataset.map(math_ops.square).batch(2)
+    iterator = datasets.Iterator(dataset)
+    checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+    self.assertAllEqual([1, 4], iterator.get_next().numpy())
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.assertAllEqual([9, 16], iterator.get_next().numpy())
+    self.assertAllEqual([25, 36], iterator.get_next().numpy())
+    checkpoint.restore(save_path)
+    self.assertAllEqual([9, 16], iterator.get_next().numpy())
+    self.assertAllEqual([25, 36], iterator.get_next().numpy())
+
+  def testSaveRestoreMultipleIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
+    dataset = Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
+    dataset = dataset.map(math_ops.square).batch(2)
+    iterator_1 = datasets.Iterator(dataset)
+    iterator_2 = datasets.Iterator(dataset)
+    dataset_2 = Dataset.range(10)
+    iterator_3 = datasets.Iterator(dataset_2)
+
+    checkpoint = checkpointable_utils.Checkpoint(
+        iterator_1=iterator_1, iterator_2=iterator_2, iterator_3=iterator_3)
+    self.assertAllEqual([1, 4], iterator_1.get_next().numpy())
+    self.assertEqual(0, iterator_3.get_next().numpy())
+    self.assertEqual(1, iterator_3.get_next().numpy())
+    self.assertEqual(2, iterator_3.get_next().numpy())
+
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.assertAllEqual([1, 4], iterator_2.get_next().numpy())
+    self.assertAllEqual([9, 16], iterator_2.get_next().numpy())
+    self.assertEqual(3, iterator_3.get_next().numpy())
+    checkpoint.restore(save_path)
+    self.assertAllEqual([9, 16], iterator_1.get_next().numpy())
+    self.assertAllEqual([1, 4], iterator_2.get_next().numpy())
+    self.assertEqual(3, iterator_3.get_next().numpy())
+
+  def testRestoreExhaustedIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
+    dataset = Dataset.range(3)
+    iterator = datasets.Iterator(dataset)
+
+    checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+    self.assertEqual(0, iterator.get_next().numpy())
+    self.assertEqual(1, iterator.get_next().numpy())
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.assertEqual(2, iterator.get_next().numpy())
+    checkpoint.restore(save_path)
+    self.assertEqual(2, iterator.get_next().numpy())
+
 
 class DatasetConstructorBenchmark(test.Benchmark):
 
diff --git a/tensorflow/contrib/eager/python/evaluator.py b/tensorflow/contrib/eager/python/evaluator.py
index 68e7b5421fec7f73f10e381ca45f9d900de299d7..37c8f0d47adbde6932bf409cdcae9a1845d700b5 100644
--- a/tensorflow/contrib/eager/python/evaluator.py
+++ b/tensorflow/contrib/eager/python/evaluator.py
@@ -57,7 +57,7 @@ class Evaluator(object):
     self._model = model
     self._metrics = {}
     self._evaluators = {}
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.call = function.defun(self.call)
 
   # ---- API for users ----
@@ -90,7 +90,7 @@ class Evaluator(object):
     Only for graph execution.
     @end_compatibility
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Evaluator.init_variables() not needed when "
                          "eager execution is enabled.")
     return control_flow_ops.group([m.init_variables() for _, m in self.metrics])
@@ -113,7 +113,8 @@ class Evaluator(object):
         with summary_ops.create_file_writer(
             summary_logdir).as_default(), summary_ops.always_record_summaries():
           return self._all_metric_results()
-      if context.in_eager_mode():
+
+      if context.executing_eagerly():
         return f()
       else:
         return function.defun(f)()
@@ -158,16 +159,16 @@ class Evaluator(object):
       @end_compatibility
     """
     summary_logdir = kwargs.pop("summary_logdir", None)
-    if context.in_graph_mode():
-      call_op = self.__call__(dataset.make_one_shot_iterator().get_next(),
-                              *args, **kwargs)
-      init_op = self.init_variables()
-      results_op = self.all_metric_results(summary_logdir)
-      return (init_op, call_op, results_op)
-    # Eager case
-    for example in datasets.Iterator(dataset):
-      self.__call__(example, *args, **kwargs)
-    return self.all_metric_results(summary_logdir)
+    if context.executing_eagerly():
+      for example in datasets.Iterator(dataset):
+        self.__call__(example, *args, **kwargs)
+      return self.all_metric_results(summary_logdir)
+    # Graph construction
+    call_op = self.__call__(dataset.make_one_shot_iterator().get_next(), *args,
+                            **kwargs)
+    init_op = self.init_variables()
+    results_op = self.all_metric_results(summary_logdir)
+    return (init_op, call_op, results_op)
 
   @staticmethod
   def run_evaluation(init_op, call_op, results_op, sess=None):
@@ -192,7 +193,7 @@ class Evaluator(object):
     Only for graph execution.
     @end_compatibility
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Evaluator.run_evaluation() not supported when "
                          "eager execution is enabled.")
     sess = sess or ops.get_default_session()
diff --git a/tensorflow/contrib/eager/python/examples/gan/mnist.py b/tensorflow/contrib/eager/python/examples/gan/mnist.py
index 5f51d52622caedc6baa9f9f9950a6fd91761259a..2b7e199fad08c9a5e320b51b3a4de92c2d7dbb1a 100644
--- a/tensorflow/contrib/eager/python/examples/gan/mnist.py
+++ b/tensorflow/contrib/eager/python/examples/gan/mnist.py
@@ -195,7 +195,8 @@ def generator_loss(discriminator_gen_outputs):
 
 
 def train_one_epoch(generator, discriminator, generator_optimizer,
-                    discriminator_optimizer, dataset, log_interval, noise_dim):
+                    discriminator_optimizer, dataset, step_counter,
+                    log_interval, noise_dim):
   """Trains `generator` and `discriminator` models on `dataset`.
 
   Args:
@@ -204,7 +205,8 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
     generator_optimizer: Optimizer to use for generator.
     discriminator_optimizer: Optimizer to use for discriminator.
     dataset: Dataset of images to train on.
-    log_interval: How many global steps to wait between logging and collecting
+    step_counter: An integer variable, used to write summaries regularly.
+    log_interval: How many steps to wait between logging and collecting
       summaries.
     noise_dim: Dimension of noise vector to use.
   """
@@ -213,9 +215,10 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
   total_discriminator_loss = 0.0
   for (batch_index, images) in enumerate(tfe.Iterator(dataset)):
     with tf.device('/cpu:0'):
-      tf.assign_add(tf.train.get_global_step(), 1)
+      tf.assign_add(step_counter, 1)
 
-    with tf.contrib.summary.record_summaries_every_n_global_steps(log_interval):
+    with tf.contrib.summary.record_summaries_every_n_global_steps(
+        log_interval, global_step=step_counter):
       current_batch_size = images.shape[0]
       noise = tf.random_uniform(
           shape=[current_batch_size, noise_dim],
@@ -243,12 +246,10 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
       discriminator_grad = g.gradient(discriminator_loss_val,
                                       discriminator.variables)
 
-      with tf.variable_scope('generator'):
-        generator_optimizer.apply_gradients(
-            zip(generator_grad, generator.variables))
-      with tf.variable_scope('discriminator'):
-        discriminator_optimizer.apply_gradients(
-            zip(discriminator_grad, discriminator.variables))
+      generator_optimizer.apply_gradients(
+          zip(generator_grad, generator.variables))
+      discriminator_optimizer.apply_gradients(
+          zip(discriminator_grad, discriminator.variables))
 
       if log_interval and batch_index > 0 and batch_index % log_interval == 0:
         print('Batch #%d\tAverage Generator Loss: %.6f\t'
@@ -269,13 +270,14 @@ def main(_):
       tf.data.Dataset.from_tensor_slices(data.train.images).shuffle(60000)
       .batch(FLAGS.batch_size))
 
-  # Create the models and optimizers
-  generator = Generator(data_format)
-  discriminator = Discriminator(data_format)
-  with tf.variable_scope('generator'):
-    generator_optimizer = tf.train.AdamOptimizer(FLAGS.lr)
-  with tf.variable_scope('discriminator'):
-    discriminator_optimizer = tf.train.AdamOptimizer(FLAGS.lr)
+  # Create the models and optimizers.
+  model_objects = {
+      'generator': Generator(data_format),
+      'discriminator': Discriminator(data_format),
+      'generator_optimizer': tf.train.AdamOptimizer(FLAGS.lr),
+      'discriminator_optimizer': tf.train.AdamOptimizer(FLAGS.lr),
+      'step_counter': tf.train.get_or_create_global_step(),
+  }
 
   # Prepare summary writer and checkpoint info
   summary_writer = tf.contrib.summary.create_summary_file_writer(
@@ -284,25 +286,22 @@ def main(_):
   latest_cpkt = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
   if latest_cpkt:
     print('Using latest checkpoint at ' + latest_cpkt)
+  checkpoint = tfe.Checkpoint(**model_objects)
+  # Restore variables on creation if a checkpoint exists.
+  checkpoint.restore(latest_cpkt)
 
   with tf.device(device):
-    for epoch in range(1, 101):
-      with tfe.restore_variables_on_create(latest_cpkt):
-        global_step = tf.train.get_or_create_global_step()
-        start = time.time()
-        with summary_writer.as_default():
-          train_one_epoch(generator, discriminator, generator_optimizer,
-                          discriminator_optimizer, dataset, FLAGS.log_interval,
-                          FLAGS.noise)
-        end = time.time()
-        print('\nTrain time for epoch #%d (global step %d): %f' %
-              (epoch, global_step.numpy(), end - start))
-
-      all_variables = (
-          generator.variables + discriminator.variables +
-          generator_optimizer.variables() +
-          discriminator_optimizer.variables() + [global_step])
-      tfe.Saver(all_variables).save(checkpoint_prefix, global_step=global_step)
+    for _ in range(100):
+      start = time.time()
+      with summary_writer.as_default():
+        train_one_epoch(dataset=dataset, log_interval=FLAGS.log_interval,
+                        noise_dim=FLAGS.noise, **model_objects)
+      end = time.time()
+      checkpoint.save(checkpoint_prefix)
+      print('\nTrain time for epoch #%d (step %d): %f' %
+            (checkpoint.save_counter.numpy(),
+             checkpoint.step_counter.numpy(),
+             end - start))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/eager/python/examples/gan/mnist_test.py b/tensorflow/contrib/eager/python/examples/gan/mnist_test.py
index 4a3ca8d82bc2619b05a734f6d2e58431c1a45995..bd35e50c1f434d167c5a8c5aa7d224912523ce28 100644
--- a/tensorflow/contrib/eager/python/examples/gan/mnist_test.py
+++ b/tensorflow/contrib/eager/python/examples/gan/mnist_test.py
@@ -62,7 +62,7 @@ class MnistEagerGanBenchmark(tf.test.Benchmark):
                         for _ in range(measure_batches)]
       measure_dataset = tf.data.Dataset.from_tensor_slices(measure_images)
 
-      tf.train.get_or_create_global_step()
+      step_counter = tf.train.get_or_create_global_step()
       with tf.device(device()):
         # Create the models and optimizers
         generator = mnist.Generator(data_format())
@@ -78,13 +78,15 @@ class MnistEagerGanBenchmark(tf.test.Benchmark):
           # warm up
           mnist.train_one_epoch(generator, discriminator, generator_optimizer,
                                 discriminator_optimizer,
-                                burn_dataset, log_interval=SUMMARY_INTERVAL,
+                                burn_dataset, step_counter,
+                                log_interval=SUMMARY_INTERVAL,
                                 noise_dim=NOISE_DIM)
           # measure
           start = time.time()
           mnist.train_one_epoch(generator, discriminator, generator_optimizer,
                                 discriminator_optimizer,
-                                measure_dataset, log_interval=SUMMARY_INTERVAL,
+                                measure_dataset, step_counter,
+                                log_interval=SUMMARY_INTERVAL,
                                 noise_dim=NOISE_DIM)
           self._report('train', start, measure_batches, batch_size)
 
diff --git a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
index 157a6360ea555bba37df008a6458acac0342880b..6ab847cb78a09ab0a38beefff56f87d8314c0713 100644
--- a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
+++ b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
@@ -54,7 +54,7 @@ class LinearModel(tf.keras.Model):
 
 
 def mean_square_loss(model, xs, ys):
-  return tf.reduce_mean(tf.square(model(xs) - ys))
+  return tf.reduce_mean(tf.square(tf.subtract(model(xs), ys)))
 
 
 def fit(model, dataset, optimizer, verbose=False, logdir=None):
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
index c106ab0a065c630a88cfb725aca5046d0a535253..d6923293a374f29ab77be70fa9fea44efd1ea40b 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
@@ -64,22 +64,29 @@ def train_one_step(model, images, labels, optimizer):
 
 class ResNet50Test(tf.test.TestCase):
 
-  def _apply(self, defun=False):
+  def _apply(self, defun=False, execution_mode=None):
     device, data_format = device_and_data_format()
     model = resnet50.ResNet50(data_format)
     if defun:
       model.call = tfe.defun(model.call)
-    with tf.device(device):
+    with tf.device(device), tfe.execution_mode(execution_mode):
       images, _ = random_batch(2)
       output = model(images, training=False)
+      tfe.async_wait()
     self.assertEqual((2, 1000), output.shape)
 
   def test_apply(self):
     self._apply(defun=False)
 
+  def test_apply_async(self):
+    self._apply(defun=False, execution_mode=tfe.ASYNC)
+
   def test_apply_with_defun(self):
     self._apply(defun=True)
 
+  def test_apply_with_defun_async(self):
+    self._apply(defun=True, execution_mode=tfe.ASYNC)
+
   def test_apply_no_top(self):
     device, data_format = device_and_data_format()
     model = resnet50.ResNet50(data_format, include_top=False)
@@ -98,7 +105,7 @@ class ResNet50Test(tf.test.TestCase):
       output = model(images, training=False)
     self.assertEqual((2, 2048), output.shape)
 
-  def test_train(self):
+  def _test_train(self, execution_mode=None):
     device, data_format = device_and_data_format()
     model = resnet50.ResNet50(data_format)
     tf.train.get_or_create_global_step()
@@ -106,15 +113,22 @@ class ResNet50Test(tf.test.TestCase):
     with tf.contrib.summary.create_file_writer(
         logdir, max_queue=0,
         name='t0').as_default(), tf.contrib.summary.always_record_summaries():
-      with tf.device(device):
+      with tf.device(device), tfe.execution_mode(execution_mode):
         optimizer = tf.train.GradientDescentOptimizer(0.1)
         images, labels = random_batch(2)
         train_one_step(model, images, labels, optimizer)
         self.assertEqual(320, len(model.variables))
+        tfe.async_wait()
     events = summary_test_util.events_from_logdir(logdir)
     self.assertEqual(len(events), 2)
     self.assertEqual(events[1].summary.value[0].tag, 'loss')
 
+  def test_train(self):
+    self._test_train()
+
+  def test_train_async(self):
+    self._test_train(execution_mode=tfe.ASYNC)
+
   def test_no_garbage(self):
     device, data_format = device_and_data_format()
     model = resnet50.ResNet50(data_format)
@@ -183,59 +197,84 @@ class ResNet50Benchmarks(tf.test.Benchmark):
     # a sync. This is a roundabout way, yes.
     tf.constant(1.).cpu()
 
-  def _benchmark_eager_apply(self, label, defun=False):
-    device, data_format = device_and_data_format()
-    model = resnet50.ResNet50(data_format)
-    if defun:
-      model.call = tfe.defun(model.call)
-    batch_size = 64
-    num_burn = 5
-    num_iters = 30
-    with tf.device(device):
-      images, _ = random_batch(batch_size)
-      for _ in xrange(num_burn):
-        model(images).cpu()
-      gc.collect()
-      start = time.time()
-      for _ in xrange(num_iters):
-        model(images).cpu()
-      self._report(label, start, num_iters, device, batch_size, data_format)
-
-  def benchmark_eager_apply(self):
-    self._benchmark_eager_apply('eager_apply', defun=False)
-
-  def benchmark_eager_apply_with_defun(self):
-    self._benchmark_eager_apply('eager_apply_with_defun', defun=True)
-
-  def _benchmark_eager_train(self, label, make_iterator, defun=False):
-    device, data_format = device_and_data_format()
-    for batch_size in self._train_batch_sizes():
-      (images, labels) = random_batch(batch_size)
-      num_burn = 3
-      num_iters = 10
+  def _benchmark_eager_apply(self, label, defun=False, execution_mode=None):
+    with tfe.execution_mode(execution_mode):
+      device, data_format = device_and_data_format()
       model = resnet50.ResNet50(data_format)
       if defun:
         model.call = tfe.defun(model.call)
-      optimizer = tf.train.GradientDescentOptimizer(0.1)
-
+      batch_size = 64
+      num_burn = 5
+      num_iters = 30
       with tf.device(device):
-        iterator = make_iterator((images, labels))
+        images, _ = random_batch(batch_size)
         for _ in xrange(num_burn):
-          (images, labels) = iterator.next()
-          train_one_step(model, images, labels, optimizer)
-        self._force_gpu_sync()
+          model(images, training=False).cpu()
+        if execution_mode:
+          tfe.async_wait()
         gc.collect()
-
         start = time.time()
         for _ in xrange(num_iters):
-          (images, labels) = iterator.next()
-          train_one_step(model, images, labels, optimizer)
-        self._force_gpu_sync()
+          model(images, training=False).cpu()
+        if execution_mode:
+          tfe.async_wait()
         self._report(label, start, num_iters, device, batch_size, data_format)
 
+  def benchmark_eager_apply(self):
+    self._benchmark_eager_apply('eager_apply', defun=False)
+
+  def benchmark_eager_apply_async(self):
+    self._benchmark_eager_apply(
+        'eager_apply_async', defun=False, execution_mode=tfe.ASYNC)
+
+  def benchmark_eager_apply_with_defun(self):
+    self._benchmark_eager_apply('eager_apply_with_defun', defun=True)
+
+  def _benchmark_eager_train(self,
+                             label,
+                             make_iterator,
+                             defun=False,
+                             execution_mode=None):
+    with tfe.execution_mode(execution_mode):
+      device, data_format = device_and_data_format()
+      for batch_size in self._train_batch_sizes():
+        (images, labels) = random_batch(batch_size)
+        num_burn = 3
+        num_iters = 10
+        model = resnet50.ResNet50(data_format)
+        if defun:
+          model.call = tfe.defun(model.call)
+        optimizer = tf.train.GradientDescentOptimizer(0.1)
+
+        with tf.device(device):
+          iterator = make_iterator((images, labels))
+          for _ in xrange(num_burn):
+            (images, labels) = iterator.next()
+            train_one_step(model, images, labels, optimizer)
+          if execution_mode:
+            tfe.async_wait()
+          self._force_gpu_sync()
+          gc.collect()
+
+          start = time.time()
+          for _ in xrange(num_iters):
+            (images, labels) = iterator.next()
+            train_one_step(model, images, labels, optimizer)
+          if execution_mode:
+            tfe.async_wait()
+          self._force_gpu_sync()
+          self._report(label, start, num_iters, device, batch_size, data_format)
+
   def benchmark_eager_train(self):
     self._benchmark_eager_train('eager_train', MockIterator, defun=False)
 
+  def benchmark_eager_train_async(self):
+    self._benchmark_eager_train(
+        'eager_train_async',
+        MockIterator,
+        defun=False,
+        execution_mode=tfe.ASYNC)
+
   def benchmark_eager_train_with_defun(self):
     self._benchmark_eager_train(
         'eager_train_with_defun', MockIterator, defun=True)
diff --git a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
index aa87b94e7b0876e65405f6bcb2d6aabde36582bf..88fffc962fc33ea1639c8e2f997ec72eb48727cc 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
@@ -60,6 +60,7 @@ import functools
 import os
 import sys
 import time
+import urllib
 
 import six
 import tensorflow as tf
@@ -89,13 +90,35 @@ def parse(line):
   return rgb, chars, length
 
 
+def maybe_download(filename, work_directory, source_url):
+  """Download the data from source url, unless it's already here.
+
+  Args:
+    filename: string, name of the file in the directory.
+    work_directory: string, path to working directory.
+    source_url: url to download from if file doesn't exist.
+
+  Returns:
+    Path to resulting file.
+  """
+  if not tf.gfile.Exists(work_directory):
+    tf.gfile.MakeDirs(work_directory)
+  filepath = os.path.join(work_directory, filename)
+  if not tf.gfile.Exists(filepath):
+    temp_file_name, _ = urllib.request.urlretrieve(source_url)
+    tf.gfile.Copy(temp_file_name, filepath)
+    with tf.gfile.GFile(filepath) as f:
+      size = f.size()
+      print("Successfully downloaded", filename, size, "bytes.")
+  return filepath
+
+
 def load_dataset(data_dir, url, batch_size):
   """Loads the colors data at path into a PaddedDataset."""
 
   # Downloads data at url into data_dir/basename(url). The dataset has a header
   # row (color_name, r, g, b) followed by comma-separated lines.
-  path = tf.contrib.learn.datasets.base.maybe_download(
-      os.path.basename(url), data_dir, url)
+  path = maybe_download(os.path.basename(url), data_dir, url)
 
   # This chain of commands loads our data by:
   #   1. skipping the header; (.skip(1))
@@ -109,7 +132,7 @@ def load_dataset(data_dir, url, batch_size):
 
 
 # pylint: disable=not-callable
-class RNNColorbot(tfe.Network):
+class RNNColorbot(tf.keras.Model):
   """Multi-layer (LSTM) RNN that regresses on real-valued vector labels.
   """
 
@@ -127,23 +150,20 @@ class RNNColorbot(tfe.Network):
     self.label_dimension = label_dimension
     self.keep_prob = keep_prob
 
-    # Note the calls to `track_layer` below; these calls register the layers as
-    # network components that house trainable variables.
-    self.cells = [
-        self.track_layer(tf.nn.rnn_cell.BasicLSTMCell(size))
-        for size in rnn_cell_sizes
-    ]
-    self.relu = self.track_layer(
-        tf.layers.Dense(label_dimension, activation=tf.nn.relu, name="relu"))
+    self.cells = self._add_cells(
+        [tf.nn.rnn_cell.BasicLSTMCell(size) for size in rnn_cell_sizes])
+    self.relu = tf.layers.Dense(
+        label_dimension, activation=tf.nn.relu, name="relu")
 
-  def call(self, chars, sequence_length, training=False):
+  def call(self, inputs, training=False):
     """Implements the RNN logic and prediction generation.
 
     Args:
-      chars: a Tensor of dimension [batch_size, time_steps, 256] holding a
-        batch of one-hot encoded color names
-      sequence_length: a Tensor of dimension [batch_size] holding the length
-        of each character sequence (i.e., color name)
+      inputs: A tuple (chars, sequence_length), where chars is a batch of
+        one-hot encoded color names represented as a Tensor with dimensions
+        [batch_size, time_steps, 256] and sequence_length holds the length
+        of each character sequence (color name) as a Tensor with dimension
+        [batch_size].
       training: whether the invocation is happening during training
 
     Returns:
@@ -151,6 +171,7 @@ class RNNColorbot(tfe.Network):
       passing chars through a multi-layer RNN and applying a ReLU to the final
       hidden state.
     """
+    (chars, sequence_length) = inputs
     # Transpose the first and second dimensions so that chars is of shape
     # [time_steps, batch_size, dimension].
     chars = tf.transpose(chars, [1, 0, 2])
@@ -181,6 +202,14 @@ class RNNColorbot(tfe.Network):
     hidden_states = tf.gather_nd(chars, indices)
     return self.relu(hidden_states)
 
+  def _add_cells(self, cells):
+    # "Magic" required for keras.Model classes to track all the variables in
+    # a list of tf.layers.Layer objects.
+    # TODO(ashankar): Figure out API so user code doesn't have to do this.
+    for i, c in enumerate(cells):
+      setattr(self, "cell-%d" % i, c)
+    return cells
+
 
 def loss(labels, predictions):
   """Computes mean squared loss."""
@@ -191,7 +220,7 @@ def test(model, eval_data):
   """Computes the average loss on eval_data, which should be a Dataset."""
   avg_loss = tfe.metrics.Mean("loss")
   for (labels, chars, sequence_length) in tfe.Iterator(eval_data):
-    predictions = model(chars, sequence_length, training=False)
+    predictions = model((chars, sequence_length), training=False)
     avg_loss(loss(labels, predictions))
   print("eval/loss: %.6f\n" % avg_loss.result())
   with tf.contrib.summary.always_record_summaries():
@@ -204,7 +233,7 @@ def train_one_epoch(model, optimizer, train_data, log_interval=10):
   tf.train.get_or_create_global_step()
 
   def model_loss(labels, chars, sequence_length):
-    predictions = model(chars, sequence_length, training=True)
+    predictions = model((chars, sequence_length), training=True)
     loss_value = loss(labels, predictions)
     tf.contrib.summary.scalar("loss", loss_value)
     return loss_value
@@ -277,7 +306,7 @@ def main(_):
       (chars, length) = (tf.identity(chars), tf.identity(length))
       chars = tf.expand_dims(chars, 0)
       length = tf.expand_dims(length, 0)
-      preds = tf.unstack(model(chars, length, training=False)[0])
+      preds = tf.unstack(model((chars, length), training=False)[0])
 
     # Predictions cannot be negative, as they are generated by a ReLU layer;
     # they may, however, be greater than 1.
diff --git a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
index 5c5c59c87744f4ffa6db90e5d8d3aa3bc8132756..69cd16d12c32c8c7c4744d8f0b4b1feedf946aa1 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
@@ -39,21 +39,23 @@ from tensorflow.contrib.cudnn_rnn.python.layers import cudnn_rnn
 from tensorflow.contrib.eager.python import tfe
 
 
-class RNN(tfe.Network):
+class RNN(tf.keras.Model):
   """A static RNN.
 
-  Similar to tf.nn.static_rnn, implemented as a tf.layer.Layer.
+  Similar to tf.nn.static_rnn, implemented as a class.
   """
 
   def __init__(self, hidden_dim, num_layers, keep_ratio):
     super(RNN, self).__init__()
     self.keep_ratio = keep_ratio
-    for _ in range(num_layers):
-      self.track_layer(tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_dim))
+    self.cells = self._add_cells([
+        tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_dim)
+        for _ in range(num_layers)
+    ])
 
   def call(self, input_seq, training):
     batch_size = int(input_seq.shape[1])
-    for c in self.layers:
+    for c in self.cells:
       state = c.zero_state(batch_size, tf.float32)
       outputs = []
       input_seq = tf.unstack(input_seq, num=int(input_seq.shape[0]), axis=0)
@@ -64,7 +66,19 @@ class RNN(tfe.Network):
       input_seq = tf.stack(outputs, axis=0)
       if training:
         input_seq = tf.nn.dropout(input_seq, self.keep_ratio)
-    return input_seq, None
+    # Returning a list instead of a single tensor so that the line:
+    # y = self.rnn(y, ...)[0]
+    # in PTBModel.call works for both this RNN and CudnnLSTM (which returns a
+    # tuple (output, output_states).
+    return [input_seq]
+
+  def _add_cells(self, cells):
+    # "Magic" required for keras.Model classes to track all the variables in
+    # a list of tf.layers.Layer objects.
+    # TODO(ashankar): Figure out API so user code doesn't have to do this.
+    for i, c in enumerate(cells):
+      setattr(self, "cell-%d" % i, c)
+    return cells
 
 
 class Embedding(tf.layers.Layer):
@@ -87,7 +101,8 @@ class Embedding(tf.layers.Layer):
     return tf.nn.embedding_lookup(self.embedding, x)
 
 
-class PTBModel(tfe.Network):
+# pylint: disable=not-callable
+class PTBModel(tf.keras.Model):
   """LSTM for word language modeling.
 
   Model described in:
@@ -109,19 +124,16 @@ class PTBModel(tfe.Network):
 
     self.keep_ratio = 1 - dropout_ratio
     self.use_cudnn_rnn = use_cudnn_rnn
-    self.embedding = self.track_layer(Embedding(vocab_size, embedding_dim))
+    self.embedding = Embedding(vocab_size, embedding_dim)
 
     if self.use_cudnn_rnn:
       self.rnn = cudnn_rnn.CudnnLSTM(
           num_layers, hidden_dim, dropout=dropout_ratio)
     else:
       self.rnn = RNN(hidden_dim, num_layers, self.keep_ratio)
-    self.track_layer(self.rnn)
 
-    self.linear = self.track_layer(
-        tf.layers.Dense(
-            vocab_size,
-            kernel_initializer=tf.random_uniform_initializer(-0.1, 0.1)))
+    self.linear = tf.layers.Dense(
+        vocab_size, kernel_initializer=tf.random_uniform_initializer(-0.1, 0.1))
     self._output_shape = [-1, embedding_dim]
 
   def call(self, input_seq, training):
@@ -136,7 +148,7 @@ class PTBModel(tfe.Network):
     y = self.embedding(input_seq)
     if training:
       y = tf.nn.dropout(y, self.keep_ratio)
-    y, _ = self.rnn(y, training=training)
+    y = self.rnn(y, training=training)[0]
     return self.linear(tf.reshape(y, self._output_shape))
 
 
@@ -148,7 +160,7 @@ def clip_gradients(grads_and_vars, clip_ratio):
 
 def loss_fn(model, inputs, targets, training):
   labels = tf.reshape(targets, [-1])
-  outputs = model(inputs, training)
+  outputs = model(inputs, training=training)
   return tf.reduce_mean(
       tf.nn.sparse_softmax_cross_entropy_with_logits(
           labels=labels, logits=outputs))
diff --git a/tensorflow/contrib/eager/python/examples/spinn/BUILD b/tensorflow/contrib/eager/python/examples/spinn/BUILD
index a1f8a759e2a556bc219f0aa13942f293c4f34cfa..5966f1d4873e8e77b3ad5914da7bfc7e69d4e341 100644
--- a/tensorflow/contrib/eager/python/examples/spinn/BUILD
+++ b/tensorflow/contrib/eager/python/examples/spinn/BUILD
@@ -38,5 +38,9 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_test_lib",
     ],
-    tags = ["no_pip"],  # because spinn.py is under third_party/.
+    tags = [
+        "no-internal-py3",  # flaky
+        "no_cuda_on_cpu_tap",
+        "no_pip",  # because spinn.py is under third_party/.
+    ],
 )
diff --git a/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py b/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
index 081b0af14fcc983a3f85d2a50e2bb04d2f2493b3..3f9a7818a570c630c5de277131e8e10ecea089e7 100644
--- a/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
+++ b/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
@@ -417,7 +417,6 @@ class SpinnTest(test_util.TensorFlowTestCase):
                     if event.summary.value
                     and event.summary.value[0].tag == "train/loss"]
     self.assertEqual(config.epochs, len(train_losses))
-    self.assertLess(train_losses[-1], train_losses[0])
 
     # 5. Verify that checkpoints exist and contains all the expected variables.
     self.assertTrue(glob.glob(os.path.join(config.logdir, "ckpt*")))
diff --git a/tensorflow/contrib/eager/python/g3doc/guide.md b/tensorflow/contrib/eager/python/g3doc/guide.md
index ebb05051f27841f1cd3d21b6218986e774ed4c9f..11064981c6257a607f88c6f4414418c8d1f8eac7 100644
--- a/tensorflow/contrib/eager/python/g3doc/guide.md
+++ b/tensorflow/contrib/eager/python/g3doc/guide.md
@@ -273,9 +273,9 @@ assert 6 == df(3.)[0].numpy()
 d2f = tfe.gradients_function(lambda x: df(x)[0])
 assert 2 == d2f(3.)[0].numpy()
 
-# Third order derivative.
+# Third order derivative: Will be None
 d3f = tfe.gradients_function(lambda x : d2f(x)[0])
-assert 0 == d3f(3.)[0].numpy()
+assert None == d3f(3.)[0]
 ```
 
 These functions can be used to train models. For example, consider the following
@@ -574,49 +574,45 @@ repository](https://github.com/tensorflow/models/tree/master/official/mnist/mnis
 
 ### Checkpointing trained variables
 
-TensorFlow Variables (`tfe.Variable`) provides a way to represent shared,
-persistent state of your model. The `tfe.Saver` class (which is a thin wrapper
-over the
-[`tf.train.Saver`](https://www.tensorflow.org/api_docs/python/tf/train/Saver)
-class) provides a means to save and restore variables to and from _checkpoints_.
+TensorFlow Variables (`tfe.Variable`) provide a way to represent shared,
+persistent state of your model. The `tfe.Checkpoint` class provides a means to
+save and restore variables to and from _checkpoints_.
 
 For example:
 
 ```python
 # Create variables.
-x = tfe.Variable(10., name='x')
-y = tfe.Variable(5., name='y')
+x = tfe.Variable(10.)
+y = tfe.Variable(5.)
 
-# Create a Saver.
-saver = tfe.Saver([x, y])
+# Indicate that the variables should be saved as "x" and "y".
+checkpoint = tfe.Checkpoint(x=x, y=y)
 
 # Assign new values to the variables and save.
 x.assign(2.)
-saver.save('/tmp/ckpt')
+save_path = checkpoint.save('/tmp/ckpt')
 
 # Change the variable after saving.
 x.assign(11.)
 assert 16. == (x + y).numpy()  # 11 + 5
 
 # Restore the values in the checkpoint.
-saver.restore('/tmp/ckpt')
+checkpoint.restore(save_path)  # save_path='/tmp/ckpt-1'
 
 assert 7. == (x + y).numpy()  # 2 + 5
 ```
 
-### `tfe.Network`
+### `tf.keras.Model`
 
 You may often want to organize your models using classes, like the `MNISTModel`
-class described above. We recommend inheriting from the `tfe.Network` class as
-it provides conveniences like keeping track of all model variables and methods
-to save and restore from checkpoints.
+class described above. We recommend inheriting from the `tf.keras.Model` class
+as it provides conveniences like keeping track of all model variables.
 
-Sub-classes of `tfe.Network` may register `Layer`s (like classes in
-[`tf.layers`](https://www.tensorflow.org/api_docs/python/tf/layers),
-or [Keras
-layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers))
-using a call to `self.track_layer()` and define the computation in an
-implementation of `call()`.
+Sub-classes of `tf.keras.Model` may register `Layer`s (like classes in
+[`tf.layers`](https://www.tensorflow.org/api_docs/python/tf/layers), or [Keras
+layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers)) by
+assigning them to attributes (`self.name = layer_object`) and define the
+computation in an implementation of `call()`.
 
 Note that `tf.layers.Layer` objects (like `tf.layers.Dense`) create variables
 lazily, when the first input is encountered.
@@ -624,12 +620,11 @@ lazily, when the first input is encountered.
 For example, consider the following two-layer neural network:
 
 ```python
-class TwoLayerNet(tfe.Network):
+class TwoLayerNet(tf.keras.Model):
   def __init__(self):
     super(TwoLayerNet, self).__init__()
-    self.layer1 = self.track_layer(
-      tf.layers.Dense(2, activation=tf.nn.relu, use_bias=False))
-    self.layer2 = self.track_layer(tf.layers.Dense(3, use_bias=False))
+    self.layer1 = tf.layers.Dense(2, activation=tf.nn.relu, use_bias=False)
+    self.layer2 = tf.layers.Dense(3, use_bias=False)
 
   def call(self, x):
     return self.layer2(self.layer1(x))
@@ -653,15 +648,16 @@ assert [1, 2] == net.variables[0].shape.as_list()  # weights of layer1.
 assert [2, 3] == net.variables[1].shape.as_list()  # weights of layer2.
 ```
 
-The `tfe.Network` class is itself a sub-class of `tf.layers.Layer`. This allows
-instances of `tfe.Network` to be embedded in other networks. For example:
+The `tf.keras.Model` class is itself a sub-class of `tf.layers.Layer`. This
+allows instances of `tf.keras.Model` to be embedded in other models. For
+example:
 
 ```python
-class ThreeLayerNet(tfe.Network):
+class ThreeLayerNet(tf.keras.Model):
   def __init__(self):
     super(ThreeLayerNet, self).__init__()
-    self.a = self.track_layer(TwoLayerNet())
-    self.b = self.track_layer(tf.layers.Dense(4, use_bias=False))
+    self.a = TwoLayerNet()
+    self.b = tf.layers.Dense(4, use_bias=False)
 
   def call(self, x):
     return self.b(self.a(x))
@@ -678,9 +674,8 @@ assert [3, 4] == net.variables[2].shape.as_list()
 See more examples in
 [`tensorflow/contrib/eager/python/examples`](https://www.tensorflow.org/code/tensorflow/contrib/eager/python/examples).
 
-`tfe.Saver` in combination with `tfe.restore_variables_on_create` provides a
-convenient way to save and load checkpoints without changing the program once
-the checkpoint has been created. For example, we can set an objective for the
+`tfe.Checkpoint` provides a convenient way to save and load training
+checkpoints. Let's define something simple to train. We set an objective for the
 output of our network, choose an optimizer, and a location for the checkpoint:
 
 ```python
@@ -691,30 +686,27 @@ checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
 net = ThreeLayerNet()
 ```
 
-Note that variables have not been created yet. We want them to be restored from
-a checkpoint, if one exists, so we create them inside a
-`tfe.restore_variables_on_create` context manager. Then our training loop is the
-same whether starting training or resuming from a previous checkpoint:
+We group them in a `tfe.Checkpoint` and request that it be restored. This
+ensures that variables created by these objects are restored before their values
+are used. Our training loop is the same whether starting training or resuming
+from a previous checkpoint:
 
 ```python
-with tfe.restore_variables_on_create(
-    tf.train.latest_checkpoint(checkpoint_directory)):
-  global_step = tf.train.get_or_create_global_step()
-  for _ in range(100):
-    loss_fn = lambda: tf.norm(net(inp) - objective)
-    optimizer.minimize(loss_fn, global_step=global_step)
-    if tf.equal(global_step % 20, 0):
-      print("Step %d, output %s" % (global_step.numpy(),
-                                    net(inp).numpy()))
-      all_variables = (
-          net.variables
-          + optimizer.variables()
-          + [global_step])
-      # Save the checkpoint.
-      tfe.Saver(all_variables).save(checkpoint_prefix, global_step=global_step)
-```
-
-The first time it runs, `Network` variables are initialized randomly. Then the
+global_step = tf.train.get_or_create_global_step()
+checkpoint = tfe.Checkpoint(
+    global_step=global_step, optimizer=optimizer, network=net)
+checkpoint.restore(tf.train.latest_checkpoint(checkpoint_directory))
+for _ in range(100):
+  loss_fn = lambda: tf.norm(net(inp) - objective)
+  optimizer.minimize(loss_fn, global_step=global_step)
+  if tf.equal(global_step % 20, 0):
+    print("Step %d, output %s" % (global_step.numpy(),
+                                  net(inp).numpy()))
+    # Save the checkpoint.
+    checkpoint.save(checkpoint_prefix)
+```
+
+The first time it runs, `Model` variables are initialized randomly. Then the
 output is trained to match the objective we've set:
 
 ```
diff --git a/tensorflow/contrib/eager/python/metrics_impl.py b/tensorflow/contrib/eager/python/metrics_impl.py
index ea8dbf2b46ea4bd0e33645ae3c590c4dd13f7a52..2f2347736a073c7d9b3fb6685f52f8d58cc40570 100644
--- a/tensorflow/contrib/eager/python/metrics_impl.py
+++ b/tensorflow/contrib/eager/python/metrics_impl.py
@@ -30,12 +30,12 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope
-
+from tensorflow.python.training import checkpointable
 
 _to_replace = re.compile("[^A-Za-z0-9.]")
 
 
-class Metric(object):
+class Metric(checkpointable.CheckpointableBase):
   """A metric holds state for aggregating statistics over an evaluation run.
 
   Example use with eager execution:
@@ -93,11 +93,12 @@ class Metric(object):
   `aggregate()`, it is for use by TensorFlow infrastructure.
   """
 
-  def __init__(self, name=None):
+  def __init__(self, name=None, use_global_variables=False):
     self._built = False
     self._vars = []
     self._initial_values = {}
     self._updates = []
+    self._use_global_variables = use_global_variables
     name = name or self.__class__.__name__
     # Replace things like spaces in name to create a valid scope name.
     scope_name = _to_replace.sub("_", name)
@@ -108,13 +109,25 @@ class Metric(object):
       pos = scope.name.rfind(scope_name)
       self._name = name + scope.name[pos + len(scope_name):]
       self._scope = scope
-    if context.in_graph_mode():
+
+    # Ensures that if the user calls build directly we still set self._built to
+    # True to prevent variables from being recreated.
+    self._build = self.build
+
+    def actual_build(*args, **kwargs):
+      self._build(*args, **kwargs)
+      self._built = True
+    self.build = actual_build
+    self.build.__doc__ = self._build.__doc__
+
+    # Captures construction scope for proper initialization.
+    if context.executing_eagerly():
+      self._construction_scope = context.eager_mode
+    else:
       # We make self.call() into a graph callable here, so that we can
       # return a single op that performs all of the variable updates.
       self._construction_scope = ops.get_default_graph().as_default
       self.call = function.defun(self.call)
-    else:
-      self._construction_scope = context.eager_mode
 
   # ---- API for users ----
   def __call__(self, *args, **kwargs):
@@ -155,10 +168,11 @@ class Metric(object):
       initialization. Under eager execution, the variables are reset to their
       initial values as a side effect and this function returns None.
     """
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      for v in self._vars:
+        v.assign(self._initial_values[v])
+    else:
       return control_flow_ops.group([v.initializer for v in self._vars])
-    for v in self._vars:
-      v.assign(self._initial_values[v])
 
   # ---- To be implemented by descendants ---
   def build(self, *args, **kwargs):
@@ -200,10 +214,10 @@ class Metric(object):
 
   def value(self):
     """In graph mode returns the result Tensor while in eager the callable."""
-    if context.in_graph_mode():
-      return self.result()
-    else:
+    if context.executing_eagerly():
       return self.result
+    else:
+      return self.result()
 
   # We can support two different strategies of for doing data-parallel
   # distributed metric computations:
@@ -245,19 +259,31 @@ class Metric(object):
     """***Only for use by descendants of Metric***."""
     if self._built:
       raise RuntimeError("Can't call add_variable() except in build().")
-    collections = None if context.in_eager_mode() else [
-        ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
-    ]
-    v = variable_scope.get_variable(
-        name,
-        shape,
-        dtype,
-        initializer,
+    if context.executing_eagerly():
+      collections = None
+    else:
+      if self._use_global_variables:
+        collections = [ops.GraphKeys.GLOBAL_VARIABLES]
+      else:
+        collections = [ops.GraphKeys.LOCAL_VARIABLES]
+      collections += [ops.GraphKeys.METRIC_VARIABLES]
+    # Variables are Checkpointable dependencies of Metrics regardless of the
+    # global/local distinction. Users can avoid saving variables by not adding a
+    # dependency on the Metric.
+    v = self._add_variable_with_custom_getter(
+        name=name,
+        shape=shape,
+        dtype=dtype,
+        initializer=initializer,
         trainable=False,
         collections=collections,
-        use_resource=True)
+        use_resource=True,
+        getter=variable_scope.get_variable,
+        # Raise duplicate variable exceptions from get_variable rather than
+        # Checkpointable.
+        overwrite=True)
     self._vars.append(v)
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       self._initial_values[v] = v.value()
     return v
 
@@ -267,8 +293,10 @@ class Mean(Metric):
   # TODO(josh11b): Maybe have a dtype argument that defaults to tf.float64?
   # Or defaults to type of the input if it is tf.float32, else tf.float64?
 
-  def __init__(self, name=None, dtype=dtypes.float64):
-    super(Mean, self).__init__(name=name)
+  def __init__(self, name=None, dtype=dtypes.float64,
+               use_global_variables=False):
+    super(Mean, self).__init__(name=name,
+                               use_global_variables=use_global_variables)
     self.dtype = dtype
 
   def build(self, *args, **kwargs):
diff --git a/tensorflow/contrib/eager/python/metrics_test.py b/tensorflow/contrib/eager/python/metrics_test.py
index a9ecaa3f8bced3043ea0eb0ac3aa8bfa65e9e1ff..15ac889191e0fe51269bc5740d5e0ab1bc0e2b72 100644
--- a/tensorflow/contrib/eager/python/metrics_test.py
+++ b/tensorflow/contrib/eager/python/metrics_test.py
@@ -18,8 +18,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
 import tempfile
 
+from tensorflow.contrib.eager.python import checkpointable_utils
 from tensorflow.contrib.eager.python import metrics
 from tensorflow.contrib.summary import summary_ops
 from tensorflow.contrib.summary import summary_test_util
@@ -50,6 +52,19 @@ class MetricsTest(test.TestCase):
       self.assertEqual(
           set(m.variables),
           set(ops.get_collection(ops.GraphKeys.LOCAL_VARIABLES)))
+      self.assertEqual(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES), [])
+      self.assertEqual(
+          set(m.variables),
+          set(ops.get_collection(ops.GraphKeys.METRIC_VARIABLES)))
+
+  def testUseGlobalVariablesCollections(self):
+    with context.graph_mode(), ops.Graph().as_default():
+      m = metrics.Mean(use_global_variables=True)
+      m(1000)
+      self.assertEqual(
+          set(m.variables),
+          set(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+      self.assertEqual(ops.get_collection(ops.GraphKeys.LOCAL_VARIABLES), [])
       self.assertEqual(
           set(m.variables),
           set(ops.get_collection(ops.GraphKeys.METRIC_VARIABLES)))
@@ -180,6 +195,15 @@ class MetricsTest(test.TestCase):
         m2 = metrics.Mean()
         m2(2)
 
+  def testBuildMean(self):
+    # Verify that calling build() on Mean and then calling it won't recreate
+    # variables.
+    m = metrics.Mean()
+    m.build()
+    old_numer = m.numer
+    m(0.0)
+    self.assertTrue(old_numer is m.numer)
+
   def testMetricsChain(self):
     with context.graph_mode(), self.test_session():
       m1 = metrics.Mean()
@@ -193,6 +217,31 @@ class MetricsTest(test.TestCase):
       self.assertAllEqual(m2.result().eval(), 2.0)
       self.assertAllEqual(m1.result().eval(), 1.0)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testSaveRestore(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    mean = metrics.Mean()
+    checkpoint = checkpointable_utils.Checkpoint(mean=mean)
+    mean.build()
+    mean._built = True
+    self.evaluate(mean.init_variables())
+    self.evaluate(mean(100.))
+    self.evaluate(mean(200.))
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.evaluate(mean(1000.))
+    checkpoint.restore(save_path).assert_consumed().run_restore_ops()
+    self.evaluate(mean(300.))
+    self.assertAllEqual(200., self.evaluate(mean.value()))
+
+    restore_mean = metrics.Mean()
+    restore_checkpoint = checkpointable_utils.Checkpoint(mean=restore_mean)
+    status = restore_checkpoint.restore(save_path)
+    restore_update = restore_mean(300.)
+    status.assert_consumed().run_restore_ops()
+    self.evaluate(restore_update)
+    self.assertAllEqual(200., self.evaluate(restore_mean.value()))
+    self.assertEqual(3, self.evaluate(restore_mean.denom))
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/eager/python/network.py b/tensorflow/contrib/eager/python/network.py
index e3c13cbd2e8ccd2ab79da74e0e97905c6ed5c02d..e55a9276ab53f44f76dc5e537b3bdde7c975f463 100644
--- a/tensorflow/contrib/eager/python/network.py
+++ b/tensorflow/contrib/eager/python/network.py
@@ -149,7 +149,7 @@ class Network(base.Layer):
     # check we might have name collisions if the parent scope on init gets
     # closed before build is called.
     self._variable_scope_counts_on_init = (
-        variable_scope._get_default_variable_store().variable_scopes_count)
+        variable_scope.get_variable_scope_store().variable_scopes_count)
 
   def _name_scope_name(self, current_variable_scope):
     """Overrides Layer op naming to match variable naming."""
@@ -639,7 +639,7 @@ def _make_custom_getter_for_deferred_restorations():
       # Mark as already restored from this checkpoint.
       delayed_restoration.checkpointed_variables_to_restore[
           checkpoint_name] = None
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         delayed_restoration.session.run(variable.initializer)
     if found_value:
       # Error checking should run even if we've already restored a value.
@@ -772,7 +772,7 @@ def save_network_checkpoint(
                  variable_map[mapped_name]._shared_name,
                  variable._shared_name,
                  network.scope_name))
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     sess = None
   else:
     sess = ops.get_default_session()
@@ -853,7 +853,7 @@ def _restore_existing_variables(network, save_path, map_func, user_map_func):
             network_name=network.name,
             network_scope_name=network.scope_name))
   if existing_variables_by_checkpoint_name:
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       sess = None
     else:
       sess = ops.get_default_session()
@@ -880,7 +880,7 @@ def _set_restore_on_create(network, save_path, map_func, user_map_func,
   # _DeferredRestoration objects once a Network has been built (so that
   # restoring in a loop does not take increasing amounts of memory).
   if checkpointed_variables_to_restore:
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       sess = None
     else:
       sess = ops.get_default_session()
diff --git a/tensorflow/contrib/eager/python/saver.py b/tensorflow/contrib/eager/python/saver.py
index 62421849c766a1124c726812428985c913c653a3..fdaca90fd13576e6ca8a3408aaf528dbc2384b0c 100644
--- a/tensorflow/contrib/eager/python/saver.py
+++ b/tensorflow/contrib/eager/python/saver.py
@@ -73,7 +73,7 @@ def restore_variables_on_create(save_path, map_func=None):
     NotFoundError: If the variable is not found in checkpoint.
     ValueError: If not used in eager mode or map_func is not callable.
   """
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     raise ValueError(
         "Currently, restore_variables_on_create can only be used with "
         "eager execution enabled.")
@@ -131,7 +131,7 @@ class Saver(object):
     Raises:
       RuntimeError: if invoked when eager execution has not been enabled.
     """
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       raise RuntimeError("tfe.Saver can only be used when eager "
                          "execution is enabled. Use tf.train.Saver when "
                          "building graphs.")
diff --git a/tensorflow/contrib/eager/python/tfe.py b/tensorflow/contrib/eager/python/tfe.py
index d32bebf90c1e768d1efec26b3b78bf1a522a8f00..c6f3f20e781147140f2c4b339ed465ab7e919d37 100644
--- a/tensorflow/contrib/eager/python/tfe.py
+++ b/tensorflow/contrib/eager/python/tfe.py
@@ -56,14 +56,24 @@ To use, at program startup, call `tfe.enable_eager_execution()`.
 @@save_network_checkpoint
 @@restore_network_checkpoint
 
+@@Checkpoint
+@@Checkpointable
+@@CheckpointableSaver
+
+@@executing_eagerly
 @@in_eager_mode
-@@in_graph_mode
+@@set_execution_mode
+@@execution_mode
+@@async_wait
+@@async_clear_error
 
 @@run_test_in_graph_and_eager_modes
 
 @@DEVICE_PLACEMENT_EXPLICIT
 @@DEVICE_PLACEMENT_WARN
 @@DEVICE_PLACEMENT_SILENT
+@@SYNC
+@@ASYNC
 """
 
 from __future__ import absolute_import
@@ -74,6 +84,8 @@ from __future__ import print_function
 # pylint:disable=g-bad-import-order,g-import-not-at-top,unused-import
 #
 from tensorflow.contrib.eager.python import metrics
+from tensorflow.contrib.eager.python.checkpointable_utils import CheckpointableSaver
+from tensorflow.contrib.eager.python.checkpointable_utils import Checkpoint
 from tensorflow.contrib.eager.python.datasets import Iterator
 from tensorflow.contrib.eager.python.network import Network
 from tensorflow.contrib.eager.python.network import Sequential
@@ -87,11 +99,15 @@ from tensorflow.python.eager import function
 from tensorflow.python.eager.context import DEVICE_PLACEMENT_EXPLICIT
 from tensorflow.python.eager.context import DEVICE_PLACEMENT_WARN
 from tensorflow.python.eager.context import DEVICE_PLACEMENT_SILENT
-from tensorflow.python.eager.context import in_eager_mode
-from tensorflow.python.eager.context import in_graph_mode
+from tensorflow.python.eager.context import executing_eagerly
 from tensorflow.python.eager.context import list_devices
+from tensorflow.python.eager.context import set_execution_mode
+from tensorflow.python.eager.context import execution_mode
+from tensorflow.python.eager.context import async_wait
+from tensorflow.python.eager.context import async_clear_error
+from tensorflow.python.eager.context import SYNC
+from tensorflow.python.eager.context import ASYNC
 from tensorflow.python.eager.context import num_gpus
-from tensorflow.python.eager.custom_gradient import custom_gradient
 from tensorflow.python.eager.execution_callbacks import add_execution_callback
 from tensorflow.python.eager.execution_callbacks import clear_execution_callbacks
 from tensorflow.python.eager.execution_callbacks import inf_callback
@@ -101,10 +117,12 @@ from tensorflow.python.eager.execution_callbacks import seterr
 from tensorflow.python.framework.ops import enable_eager_execution
 from tensorflow.python.framework.ops import eager_run as run
 from tensorflow.python.framework.test_util import run_in_graph_and_eager_modes as run_test_in_graph_and_eager_modes
+from tensorflow.python.ops.custom_gradient import custom_gradient
 from tensorflow.python.ops.resource_variable_ops import ResourceVariable as Variable
 from tensorflow.python.ops.variable_scope import EagerVariableStore
 from tensorflow.python.ops import script_ops
 from tensorflow.python.ops import template
+from tensorflow.python.training.checkpointable import Checkpointable
 from tensorflow.python.util.all_util import remove_undocumented
 
 py_func = script_ops.eager_py_func
@@ -115,5 +133,6 @@ implicit_value_and_gradients = backprop.implicit_val_and_grad
 gradients_function = backprop.gradients_function
 value_and_gradients_function = backprop.val_and_grad_function
 GradientTape = backprop.GradientTape  # pylint: disable=invalid-name
+in_eager_mode = executing_eagerly
 
 remove_undocumented(__name__)
diff --git a/tensorflow/contrib/eager/python/tfe_test.py b/tensorflow/contrib/eager/python/tfe_test.py
index b6659c2a1797feab261d756e78b45231dbea5a02..e80ccbb74d8623e977a98cb7fa5eb41f3c9bf250 100644
--- a/tensorflow/contrib/eager/python/tfe_test.py
+++ b/tensorflow/contrib/eager/python/tfe_test.py
@@ -47,7 +47,8 @@ class TFETest(test_util.TensorFlowTestCase):
 
   def testVariableError(self):
     with self.assertRaisesRegexp(
-        RuntimeError, r'Variable not supported in Eager mode'):
+        RuntimeError,
+        r'Variable not supported when eager execution is enabled'):
       variables.Variable(initial_value=1.0)
 
   def testGradients(self):
diff --git a/tensorflow/contrib/estimator/BUILD b/tensorflow/contrib/estimator/BUILD
index ddccfce3c07d20bde78de297db25437a347d75cb..24374266dcf751203f584c4ca97d7a39a38cc7d0 100644
--- a/tensorflow/contrib/estimator/BUILD
+++ b/tensorflow/contrib/estimator/BUILD
@@ -142,6 +142,7 @@ py_test(
     deps = [
         ":extenders",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
+        "//tensorflow/contrib/predictor",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:framework_ops",
@@ -170,9 +171,11 @@ py_library(
         "//tensorflow/python:lookup_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:metrics",
+        "//tensorflow/python:nn",
         "//tensorflow/python:sparse_ops",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:summary",
+        "//tensorflow/python:training",
         "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:head",
         "//tensorflow/python/estimator:metric_keys",
@@ -192,6 +195,7 @@ py_test(
         ":head",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:control_flow_ops",
@@ -289,6 +293,8 @@ py_library(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:metrics",
         "//tensorflow/python:summary",
+        "//tensorflow/python:training",
+        "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:head",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
@@ -352,6 +358,7 @@ cuda_py_test(
     size = "medium",
     srcs = ["python/estimator/replicate_model_fn_test.py"],
     additional_deps = [
+        "//third_party/py/absl/testing:parameterized",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:dnn",
         "//tensorflow/python/estimator:export_export",
diff --git a/tensorflow/contrib/estimator/__init__.py b/tensorflow/contrib/estimator/__init__.py
index 0f75b77050b0ba4c752a6a74fdc7024170b6f318..6b9f9575b606f1822d760e8597c55994dd8af04c 100644
--- a/tensorflow/contrib/estimator/__init__.py
+++ b/tensorflow/contrib/estimator/__init__.py
@@ -39,6 +39,7 @@ _allowed_symbols = [
     'multi_class_head',
     'multi_head',
     'multi_label_head',
+    'poisson_regression_head',
     'regression_head',
     'DNNEstimator',
     'DNNLinearCombinedEstimator',
diff --git a/tensorflow/contrib/estimator/python/estimator/extenders.py b/tensorflow/contrib/estimator/python/estimator/extenders.py
index c99bf8badb35e6fffb7cae8761db9d402b8b3a8f..266ae933052b11b9ab3edb662e95c90aae207dae 100644
--- a/tensorflow/contrib/estimator/python/estimator/extenders.py
+++ b/tensorflow/contrib/estimator/python/estimator/extenders.py
@@ -23,6 +23,7 @@ import six
 from tensorflow.python.estimator import estimator as estimator_lib
 from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.estimator import util as estimator_util
+from tensorflow.python.estimator.export.export_output import PredictOutput
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
 from tensorflow.python.ops import clip_ops
@@ -33,7 +34,7 @@ _VALID_METRIC_FN_ARGS = set(['features', 'labels', 'predictions', 'config'])
 
 
 def add_metrics(estimator, metric_fn):
-  """Creates a new ${tf.estimator.Estimator} which has given metrics.
+  """Creates a new @{tf.estimator.Estimator} which has given metrics.
 
   Example:
 
@@ -60,7 +61,7 @@ def add_metrics(estimator, metric_fn):
   ```
 
   Args:
-    estimator: A ${tf.estimator.Estimator} object.
+    estimator: A @{tf.estimator.Estimator} object.
     metric_fn: A function which should obey the following signature:
       - Args: can only have following four arguments in any order:
         * predictions: Predictions `Tensor` or dict of `Tensor` created by given
@@ -78,7 +79,7 @@ def add_metrics(estimator, metric_fn):
          function, namely a `(metric_tensor, update_op)` tuple.
 
   Returns:
-      A new ${tf.estimator.Estimator} which has a union of original metrics with
+      A new @{tf.estimator.Estimator} which has a union of original metrics with
         given ones.
   """
   _verify_metric_fn_args(metric_fn)
@@ -161,14 +162,14 @@ def forward_features(estimator, keys=None):
   ```
 
   Args:
-    estimator: A ${tf.estimator.Estimator} object.
+    estimator: A @{tf.estimator.Estimator} object.
     keys: a `string` or a `list` of `string`. If it is `None`, all of the
       `features` in `dict` is forwarded to the `predictions`. If it is a
       `string`, only given key is forwarded. If it is a `list` of strings, all
       the given `keys` are forwarded.
 
   Returns:
-      A new ${tf.estimator.Estimator} which forwards features to predictions.
+      A new @{tf.estimator.Estimator} which forwards features to predictions.
 
   Raises:
     ValueError:
@@ -233,7 +234,17 @@ def forward_features(estimator, keys=None):
             'argument of forward_features to filter unwanted features. Type of '
             'features[{}] is {}.'.format(key, key, type(feature)))
       predictions[key] = feature
-    return spec._replace(predictions=predictions)
+    spec = spec._replace(predictions=predictions)
+    if spec.export_outputs:
+      for ekey in ['predict', 'serving_default']:
+        if (ekey in spec.export_outputs and
+            isinstance(spec.export_outputs[ekey],
+                       PredictOutput)):
+          export_outputs = spec.export_outputs[ekey].outputs
+          for key in get_keys(features):
+            export_outputs[key] = predictions[key]
+
+    return spec
 
   return estimator_lib.Estimator(
       model_fn=new_model_fn,
diff --git a/tensorflow/contrib/estimator/python/estimator/extenders_test.py b/tensorflow/contrib/estimator/python/estimator/extenders_test.py
index ad1a8ef152b07ecbab33d9eb3184a2ae89def27d..407af2deaf0928361a4f0b0e44e842b7750118cb 100644
--- a/tensorflow/contrib/estimator/python/estimator/extenders_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/extenders_test.py
@@ -18,20 +18,27 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
+import tempfile
 import numpy as np
 
 from tensorflow.contrib.estimator.python.estimator import extenders
+from tensorflow.contrib.predictor import from_saved_model
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import estimator_lib
 from tensorflow.python.estimator.canned import linear
 from tensorflow.python.feature_column import feature_column as fc
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.ops import variables
+from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.training import training
+from tensorflow.python.util import compat
 
 
 def get_input_fn(x, y):
@@ -177,6 +184,44 @@ class ForwardFeaturesTest(test.TestCase):
     self.assertIn('id', predictions)
     self.assertEqual(101, predictions['id'])
 
+  def test_forward_in_exported(self):
+
+    def serving_input_fn():
+      features_ph = {
+          'x': array_ops.placeholder(dtypes.float32, [None]),
+          'id': array_ops.placeholder(dtypes.int32, [None])
+      }
+      features = {
+          key: array_ops.expand_dims(tensor, -1)
+          for key, tensor in features_ph.items()
+      }
+      return estimator_lib.export.ServingInputReceiver(features, features_ph)
+    def input_fn():
+      return {'x': [[3.], [5.]], 'id': [[101], [102]]}, [[1.], [2.]]
+    # create estimator
+    feature_columns = [fc.numeric_column('x')]
+    estimator = linear.LinearRegressor(feature_columns)
+    estimator.train(input_fn=input_fn, steps=1)
+    estimator = extenders.forward_features(estimator, 'id')
+
+    # export saved model
+    tmpdir = tempfile.mkdtemp()
+    export_dir_base = os.path.join(
+        compat.as_bytes(tmpdir), compat.as_bytes('export'))
+    export_dir = estimator.export_savedmodel(export_dir_base, serving_input_fn)
+    self.assertTrue(gfile.Exists(export_dir))
+
+    # restore model
+    predict_fn = from_saved_model(export_dir, signature_def_key='predict')
+    predictions = predict_fn({'x': [3], 'id': [101]})
+
+    # verify that 'id' exists in predictions
+    self.assertIn('id', predictions)
+    self.assertEqual(101, predictions['id'])
+
+    # Clean up.
+    gfile.DeleteRecursively(tmpdir)
+
   def test_forward_list(self):
 
     def input_fn():
diff --git a/tensorflow/contrib/estimator/python/estimator/head.py b/tensorflow/contrib/estimator/python/estimator/head.py
index a45f6934cc5b9bb7bccf148edbd7553b702c2127..42e1b7b68c356dbebdf7e51065dae3c6b4bbd77c 100644
--- a/tensorflow/contrib/estimator/python/estimator/head.py
+++ b/tensorflow/contrib/estimator/python/estimator/head.py
@@ -31,14 +31,17 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
+from tensorflow.python.ops import nn
 from tensorflow.python.ops import sparse_ops
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.summary import summary
+from tensorflow.python.training import training_util
 
 _DEFAULT_SERVING_KEY = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
 
 
+# TODO(b/65403806): Switch loss_reduction default to SUM_OVER_BATCH_SIZE.
 def multi_class_head(n_classes,
                      weight_column=None,
                      label_vocabulary=None,
@@ -237,6 +240,66 @@ def regression_head(weight_column=None,
       name=name)
 
 
+def poisson_regression_head(
+    weight_column=None,
+    label_dimension=1,
+    loss_reduction=losses.Reduction.SUM,
+    compute_full_loss=True,
+    name=None):
+  """Creates a `_Head` for poisson regression using `tf.nn.log_poisson_loss`.
+
+  The loss is the weighted sum over all input dimensions. Namely, if the input
+  labels have shape `[batch_size, label_dimension]`, the loss is the weighted
+  sum over both `batch_size` and `label_dimension`.
+
+  The head expects `logits` with shape `[D0, D1, ... DN, label_dimension]`.
+  In many applications, the shape is `[batch_size, label_dimension]`.
+
+  The `labels` shape must match `logits`, namely
+  `[D0, D1, ... DN, label_dimension]`. If `label_dimension=1`, shape
+  `[D0, D1, ... DN]` is also supported.
+
+  If `weight_column` is specified, weights must be of shape
+  `[D0, D1, ... DN]`, `[D0, D1, ... DN, 1]` or
+  `[D0, D1, ... DN, label_dimension]`.
+
+  This is implemented as a generalized linear model, see
+  https://en.wikipedia.org/wiki/Generalized_linear_model.
+
+  Args:
+    weight_column: A string or a `_NumericColumn` created by
+      `tf.feature_column.numeric_column` defining feature column representing
+      weights. It is used to down weight or boost examples during training. It
+      will be multiplied by the loss of the example.
+    label_dimension: Number of regression labels per example. This is the size
+      of the last dimension of the labels `Tensor` (typically, this has shape
+      `[batch_size, label_dimension]`).
+    loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how to
+      reduce training loss over batch. Defaults to `SUM`.
+    compute_full_loss: Whether to include the constant `log(z!)` term in
+      computing the poisson loss. See `tf.nn.log_poisson_loss` for the full
+      documentation.
+    name: name of the head. If provided, summary and metrics keys will be
+      suffixed by `"/" + name`. Also used as `name_scope` when creating ops.
+
+  Returns:
+    An instance of `_Head` for poisson regression.
+
+  Raises:
+    ValueError: If `label_dimension` or `loss_reduction` is invalid.
+  """
+  def _poisson_loss(labels, logits):
+    return nn.log_poisson_loss(
+        targets=labels, log_input=logits, compute_full_loss=compute_full_loss)
+  return head_lib._regression_head_with_mean_squared_error_loss(  # pylint:disable=protected-access
+      weight_column=weight_column,
+      label_dimension=label_dimension,
+      loss_reduction=loss_reduction,
+      loss_fn=_poisson_loss,
+      inverse_link_fn=math_ops.exp,
+      name=name)
+
+
 def multi_label_head(n_classes,
                      weight_column=None,
                      thresholds=None,
@@ -428,8 +491,8 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
         processed_labels=processed_labels)
 
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None,
-      regularization_losses=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None, regularization_losses=None):
     """Returns an `EstimatorSpec`.
 
     Args:
@@ -441,8 +504,11 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
         with shape `[D0, D1, ... DN, n_classes]` or `SparseTensor` with
         `dense_shape` `[D0, D1, ... DN, ?]`. `labels` is required argument when
         `mode` equals `TRAIN` or `EVAL`.
+      optimizer: `Optimizer` instance to optimize the loss in TRAIN mode.
+        Namely, sets `train_op = optimizer.minimize(loss, global_step)`, which
+        updates variables and increments `global_step`.
       train_op_fn: Function that takes a scalar loss `Tensor` and returns
-        `train_op`. Required in TRAIN mode.
+        `train_op`. Used if `optimizer` is `None`.
       regularization_losses: A list of additional scalar losses to be added to
         the training loss, such as regularization losses. These losses are
         usually expressed as a batch average, so for best results users need to
@@ -452,7 +518,8 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
     Returns:
       `EstimatorSpec`.
     Raises:
-      ValueError: If `train_op_fn` is `None` in TRAIN mode.
+      ValueError: If both `train_op_fn` and `optimizer` are `None` in TRAIN
+        mode, or if both are set.
     """
     with ops.name_scope(self._name, 'head'):
       logits = head_lib._check_logits_final_dim(logits, self.logits_dimension)  # pylint:disable=protected-access
@@ -504,8 +571,16 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
                 regularization_loss=regularization_loss))
 
       # Train.
-      if train_op_fn is None:
-        raise ValueError('train_op_fn can not be None.')
+      if optimizer is not None:
+        if train_op_fn is not None:
+          raise ValueError('train_op_fn and optimizer cannot both be set.')
+        train_op = optimizer.minimize(
+            regularized_training_loss,
+            global_step=training_util.get_global_step())
+      elif train_op_fn is not None:
+        train_op = train_op_fn(regularized_training_loss)
+      else:
+        raise ValueError('train_op_fn and optimizer cannot both be None.')
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -531,7 +606,7 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
         mode=model_fn.ModeKeys.TRAIN,
         predictions=predictions,
         loss=regularized_training_loss,
-        train_op=train_op_fn(regularized_training_loss))
+        train_op=train_op)
 
   def _eval_metric_ops(
       self, labels, probabilities, weights, unreduced_loss,
diff --git a/tensorflow/contrib/estimator/python/estimator/head_test.py b/tensorflow/contrib/estimator/python/estimator/head_test.py
index 1411635228457218578c0297d4d901e9c86ca91a..776f0ee341390c97b8284a2a343c9f67380bb547 100644
--- a/tensorflow/contrib/estimator/python/estimator/head_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/head_test.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import string_ops
@@ -446,7 +447,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
-        keys.AUC_PR: 0.5972,
+        keys.AUC_PR: 0.7639,
     }
     self._test_eval(
         head=head,
@@ -478,7 +479,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
-        keys.AUC_PR: 0.5972,
+        keys.AUC_PR: 0.7639,
     }
     self._test_eval(
         head=head,
@@ -509,7 +510,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
-        keys.AUC_PR: 0.5972,
+        keys.AUC_PR: 0.7639,
     }
     self._test_eval(
         head=head,
@@ -543,7 +544,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
-        keys.AUC_PR: 0.5972,
+        keys.AUC_PR: 0.7639,
     }
     self._test_eval(
         head=head,
@@ -573,7 +574,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
-        keys.AUC_PR: 0.5972,
+        keys.AUC_PR: 0.7639,
         keys.ACCURACY_AT_THRESHOLD % thresholds[0]: 2. / 4.,
         keys.PRECISION_AT_THRESHOLD % thresholds[0]: 2. / 3.,
         keys.RECALL_AT_THRESHOLD % thresholds[0]: 2. / 3.,
@@ -621,7 +622,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.2000,
-        keys.AUC_PR: 0.5833,
+        keys.AUC_PR: 0.7833,
     }
 
     # Assert spec contains expected tensors.
@@ -862,6 +863,42 @@ class MultiLabelHead(test.TestCase):
     self._test_train(
         head=head, logits=logits, labels=labels, expected_loss=expected_loss)
 
+  def test_train_with_optimizer(self):
+    head = head_lib.multi_label_head(n_classes=2)
+    logits = np.array([[-10., 10.], [-15., 10.]], dtype=np.float32)
+    labels = np.array([[1, 0], [1, 1]], dtype=np.int64)
+    # For large logits, sigmoid cross entropy loss is approximated as:
+    # loss = labels * (logits < 0) * (-logits) +
+    #        (1 - labels) * (logits > 0) * logits =>
+    # expected_unweighted_loss = [[10., 10.], [15., 0.]]
+    # Average over classes, sum over weights.
+    expected_loss = 17.5
+    expected_train_result = 'my_train_op'
+
+    class _Optimizer(object):
+
+      def minimize(self, loss, global_step):
+        del global_step
+        return string_ops.string_join(
+            [constant_op.constant(expected_train_result),
+             string_ops.as_string(loss, precision=3)])
+
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        optimizer=_Optimizer())
+
+    tol = 1e-3
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run((spec.loss, spec.train_op))
+      self.assertAllClose(expected_loss, loss, rtol=tol, atol=tol)
+      self.assertEqual(
+          six.b('{0:s}{1:.3f}'.format(expected_train_result, expected_loss)),
+          train_result)
+
   def test_train_with_regularization_losses(self):
     head = head_lib.multi_label_head(
         n_classes=2, loss_reduction=losses.Reduction.SUM_OVER_BATCH_SIZE)
@@ -1095,7 +1132,7 @@ class MultiLabelHead(test.TestCase):
         # auc and auc_pr cannot be reliably calculated for only 4 samples, but
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.4977,
-        keys.AUC_PR: 0.4037,
+        keys.AUC_PR: 0.6645,
     }
     self._test_eval(
         head=head,
@@ -1106,5 +1143,75 @@ class MultiLabelHead(test.TestCase):
         expected_metrics=expected_metrics)
 
 
+class PoissonRegressionHead(test.TestCase):
+
+  def setUp(self):
+    ops.reset_default_graph()
+
+  def test_train(self):
+    head = head_lib.poisson_regression_head()
+
+    # Create estimator spec.
+    logits = np.array([[0], [-1], [1]], dtype=np.float32)
+    labels = np.array([[1], [2], [3]], dtype=np.int32)
+    # With x = exp(logits), z = labels.
+    # loss = -ln(exp(-x) * (x^z) / z!)
+    #      = x - z * ln(x) + ln(z!)
+    #      = exp(logits) - labels * logits - ln(labels!)
+    # But for ln(z!) and z > 1, the Stirling approximation is used
+    # ln(z!) = z*ln(z) - z + 0.5*ln(2*pi*z)
+    # loss = [exp(0) - 1 * 0 + ln(1!),
+    #         exp(-1) - 2 * (-1) + 2*ln(2) - 2 + 0.5*ln(2*pi*2),
+    #         exp(1) - 3 * 1 + 3*ln(3) - 3 + 0.5*ln(2*pi*3)]
+    #      = [1.0, 3.020, 1.482]
+    # sum_loss = 5.502
+    expected_loss = 5.502
+    atol = 0.001
+    expected_train_result = b'my_train_op'
+    def _train_op_fn(loss):
+      with ops.control_dependencies((check_ops.assert_near(
+          math_ops.to_float(expected_loss), math_ops.to_float(loss),
+          atol=atol, name='assert_loss'),)):
+        return constant_op.constant(expected_train_result)
+
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42.,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        train_op_fn=_train_op_fn)
+
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run([spec.loss, spec.train_op])
+      self.assertAlmostEqual(expected_loss, loss, delta=atol)
+      self.assertEqual(expected_train_result, train_result)
+
+  def test_predict(self):
+    head = head_lib.poisson_regression_head()
+
+    # Create estimator spec.
+    logits = np.array([[0], [-1], [1]], dtype=np.float32)
+    expected_predictions = np.exp(logits)
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42.,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.PREDICT,
+        logits=logits)
+
+    # Assert spec contains expected tensors.
+    keys = prediction_keys.PredictionKeys
+    self.assertItemsEqual(
+        (keys.PREDICTIONS, keys.LOGITS), spec.predictions.keys())
+    self.assertEqual(dtypes.float32, spec.predictions[keys.PREDICTIONS].dtype)
+    self.assertEqual(dtypes.float32, spec.predictions[keys.LOGITS].dtype)
+
+    # Assert predictions.
+    with self.test_session():
+      _initialize_variables(self, spec.scaffold)
+      self.assertAllClose(
+          expected_predictions, spec.predictions[keys.PREDICTIONS].eval())
+      self.assertAllClose(logits, spec.predictions[keys.LOGITS].eval())
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/estimator/python/estimator/multi_head.py b/tensorflow/contrib/estimator/python/estimator/multi_head.py
index 0346ddc24bffd61068177f4622bd03be4acd53d9..bbbc19cc4dfb4b23f9b707023fbfdd124f1f48de 100644
--- a/tensorflow/contrib/estimator/python/estimator/multi_head.py
+++ b/tensorflow/contrib/estimator/python/estimator/multi_head.py
@@ -23,6 +23,7 @@ import six
 from tensorflow.python.estimator import model_fn
 from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.estimator.canned import metric_keys
+from tensorflow.python.estimator.export import export_output as export_output_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
@@ -30,6 +31,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.summary import summary
+from tensorflow.python.training import training_util
 
 
 _DEFAULT_SERVING_KEY = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
@@ -226,8 +228,10 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
         weights=example_weights_by_head,
         processed_labels=labels_by_head)
 
+  # TODO(b/65403806): Support regularization_losses arg.
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None):
     """See `_Head`."""
     if isinstance(logits, dict):
       logits_dict = logits
@@ -248,9 +252,10 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
               train_op_fn=_no_op_train_fn))
 
     if mode == model_fn.ModeKeys.TRAIN:
-      if train_op_fn is None:
-        raise ValueError('train_op_fn can not be None in TRAIN mode.')
-      spec = self._merge_train(all_estimator_spec, train_op_fn)
+      spec = self._merge_train(
+          all_estimator_spec=all_estimator_spec,
+          optimizer=optimizer,
+          train_op_fn=train_op_fn)
       with ops.name_scope(''):
         summary.scalar(metric_keys.MetricKeys.LOSS, spec.loss)
       return spec
@@ -279,16 +284,21 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
         begin_idx += head.logits_dimension
     return logits_dict
 
-  def _merge_train(self, all_estimator_spec, train_op_fn):
+  def _merge_train(self, all_estimator_spec, optimizer, train_op_fn):
     """Merges list of `EstimatorSpec` for training.
 
     Args:
       all_estimator_spec: list of `EstimatorSpec` for the individual heads.
-      train_op_fn: Function to create train op. See `create_estimator_spec`
-        documentation for more details.
+      optimizer: `Optimizer` instance to create train op. See
+        `create_estimator_spec` documentation for more details.
+      train_op_fn: Function to create train op. Used if `optimizer` is `None`.
 
     Returns:
       `EstimatorSpec` that merges all heads for TRAIN.
+
+    Raises:
+      ValueError: If both `train_op_fn` and `optimizer` are `None` in TRAIN
+        mode.
     """
     losses = []
     metrics = {}
@@ -297,11 +307,20 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
       # Metric keys already contain head.name.
       metrics.update(spec.eval_metric_ops or {})
     loss = _merge_losses(losses, self._head_weights)
+    if optimizer is not None:
+      if train_op_fn is not None:
+        raise ValueError('train_op_fn and optimizer cannot both be set.')
+      train_op = optimizer.minimize(
+          loss, global_step=training_util.get_global_step())
+    elif train_op_fn is not None:
+      train_op = train_op_fn(loss)
+    else:
+      raise ValueError('train_op_fn and optimizer cannot both be None.')
 
     return model_fn.EstimatorSpec(
         mode=model_fn.ModeKeys.TRAIN,
         loss=loss,
-        train_op=train_op_fn(loss),
+        train_op=train_op,
         eval_metric_ops=metrics)
 
   def _merge_predict(self, all_estimator_spec):
@@ -319,6 +338,7 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
             all_estimator_spec[0].export_outputs,
             self._heads[0].name),
     }
+    merged_predict_outputs = {}
     for head, spec in zip(self._heads, all_estimator_spec):
       head_name = head.name
       for k, v in six.iteritems(spec.export_outputs):
@@ -327,8 +347,15 @@ class _MultiHead(head_lib._Head):  # pylint:disable=protected-access
         else:
           key = '%s/%s' % (k, head_name)
         export_outputs[key] = v
+        if (k == head_lib._PREDICT_SERVING_KEY and  # pylint:disable=protected-access
+            isinstance(v, export_output_lib.PredictOutput)):
+          for kp, vp in six.iteritems(v.outputs):
+            key = '%s/%s' % (head_name, kp)
+            merged_predict_outputs[key] = vp
       for k, v in six.iteritems(spec.predictions):
         predictions[(head_name, k)] = v
+    export_outputs[head_lib._PREDICT_SERVING_KEY] = (  # pylint:disable=protected-access
+        export_output_lib.PredictOutput(merged_predict_outputs))
 
     return model_fn.EstimatorSpec(
         mode=model_fn.ModeKeys.PREDICT,
diff --git a/tensorflow/contrib/estimator/python/estimator/multi_head_test.py b/tensorflow/contrib/estimator/python/estimator/multi_head_test.py
index e47a6788f3b5440c4906b9f0430c802cf73237e3..43cc157a1f6340e1b852634ad5f685a868eca286 100644
--- a/tensorflow/contrib/estimator/python/estimator/multi_head_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/multi_head_test.py
@@ -127,8 +127,8 @@ class MultiHeadTest(test.TestCase):
         logits=logits)
 
     self.assertItemsEqual(
-        (_DEFAULT_SERVING_KEY, 'head1', 'classification/head1', 'predict/head1',
-         'head2', 'classification/head2', 'predict/head2'),
+        (_DEFAULT_SERVING_KEY, 'predict', 'head1', 'classification/head1',
+         'predict/head1', 'head2', 'classification/head2', 'predict/head2'),
         spec.export_outputs.keys())
 
     # Assert predictions and export_outputs.
@@ -158,6 +158,22 @@ class MultiHeadTest(test.TestCase):
       self.assertAllClose(
           expected_probabilities['head2'],
           sess.run(spec.export_outputs['head2'].scores))
+      self.assertAllClose(
+          expected_probabilities['head1'],
+          sess.run(
+              spec.export_outputs['predict'].outputs['head1/probabilities']))
+      self.assertAllClose(
+          expected_probabilities['head2'],
+          sess.run(
+              spec.export_outputs['predict'].outputs['head2/probabilities']))
+      self.assertAllClose(
+          expected_probabilities['head1'],
+          sess.run(
+              spec.export_outputs['predict/head1'].outputs['probabilities']))
+      self.assertAllClose(
+          expected_probabilities['head2'],
+          sess.run(
+              spec.export_outputs['predict/head2'].outputs['probabilities']))
 
   def test_predict_two_heads_logits_tensor(self):
     """Tests predict with logits as Tensor."""
@@ -181,8 +197,8 @@ class MultiHeadTest(test.TestCase):
         logits=logits)
 
     self.assertItemsEqual(
-        (_DEFAULT_SERVING_KEY, 'head1', 'classification/head1', 'predict/head1',
-         'head2', 'classification/head2', 'predict/head2'),
+        (_DEFAULT_SERVING_KEY, 'predict', 'head1', 'classification/head1',
+         'predict/head1', 'head2', 'classification/head2', 'predict/head2'),
         spec.export_outputs.keys())
 
     # Assert predictions and export_outputs.
@@ -238,8 +254,8 @@ class MultiHeadTest(test.TestCase):
         logits=logits)
 
     self.assertItemsEqual(
-        (_DEFAULT_SERVING_KEY, 'head1', 'regression/head1', 'predict/head1',
-         'head2', 'regression/head2', 'predict/head2'),
+        (_DEFAULT_SERVING_KEY, 'predict', 'head1', 'regression/head1',
+         'predict/head1', 'head2', 'regression/head2', 'predict/head2'),
         spec.export_outputs.keys())
 
     # Assert predictions and export_outputs.
@@ -306,8 +322,8 @@ class MultiHeadTest(test.TestCase):
         # this assert tests that the algorithm remains consistent.
         keys.AUC + '/head1': 0.1667,
         keys.AUC + '/head2': 0.3333,
-        keys.AUC_PR + '/head1': 0.49999964,
-        keys.AUC_PR + '/head2': 0.33333313,
+        keys.AUC_PR + '/head1': 0.6667,
+        keys.AUC_PR + '/head2': 0.5000,
     }
 
     # Assert spec contains expected tensors.
@@ -534,6 +550,44 @@ class MultiHeadTest(test.TestCase):
           metric_keys.MetricKeys.LOSS_MEAN + '/head1': expected_loss / 2,
       }, summary_str, tol)
 
+  def test_train_one_head_with_optimizer(self):
+    head1 = head_lib.multi_label_head(n_classes=2, name='head1')
+    multi_head = multi_head_lib.multi_head([head1])
+
+    logits = {'head1': np.array([[-10., 10.], [-15., 10.]], dtype=np.float32)}
+    labels = {'head1': np.array([[1, 0], [1, 1]], dtype=np.int64)}
+    # For large logits, sigmoid cross entropy loss is approximated as:
+    # loss = labels * (logits < 0) * (-logits) +
+    #        (1 - labels) * (logits > 0) * logits =>
+    # expected_unweighted_loss = [[10., 10.], [15., 0.]]
+    # Average over classes, sum over weights.
+    expected_loss = 17.5
+    expected_train_result = 'my_train_op'
+
+    class _Optimizer(object):
+
+      def minimize(self, loss, global_step):
+        del global_step
+        return string_ops.string_join(
+            [constant_op.constant(expected_train_result),
+             string_ops.as_string(loss, precision=3)])
+
+    spec = multi_head.create_estimator_spec(
+        features={'x': np.array(((42,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        optimizer=_Optimizer())
+
+    tol = 1e-3
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run((spec.loss, spec.train_op))
+      self.assertAllClose(expected_loss, loss, rtol=tol, atol=tol)
+      self.assertEqual(
+          six.b('{0:s}{1:.3f}'.format(expected_train_result, expected_loss)),
+          train_result)
+
   def test_train_two_heads_with_weights(self):
     head1 = head_lib.multi_label_head(n_classes=2, name='head1')
     head2 = head_lib.multi_label_head(n_classes=3, name='head2')
diff --git a/tensorflow/contrib/estimator/python/estimator/replicate_model_fn_test.py b/tensorflow/contrib/estimator/python/estimator/replicate_model_fn_test.py
index d46a18aacfcd911c56a9f22dc9581060c7b458a6..144b45982c8aec2e2b115c812b24e8843d60ce1e 100644
--- a/tensorflow/contrib/estimator/python/estimator/replicate_model_fn_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/replicate_model_fn_test.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import re
 import shutil
 import tempfile
+from absl.testing import parameterized
 import numpy as np
 import six
 
@@ -57,26 +58,19 @@ from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import training
 
 
-# TODO(isaprykin):  Parametrize all the tests on
-#   replicate_model_fn._VariableDistributionMode when it's supported.
-class DNNClassifierIntegrationTest(test_util.TensorFlowTestCase):
+class DNNClassifierIntegrationTest(test_util.TensorFlowTestCase,
+                                   parameterized.TestCase):
 
   def setUp(self):
     self._model_dir = tempfile.mkdtemp()
 
-  def test_complete_flow_with_public_version(self):
-    return self._complete_flow_with_mode(mode=None)
-
-  def test_complete_flow_with_mode_local_ps_server(self):
-    return self._complete_flow_with_mode(
-        replicate_model_fn._VariableDistributionMode.
-        SHARED_LOCAL_PARAMETER_SERVER)
-
-  def test_complete_flow_with_mode_round_robin(self):
-    return self._complete_flow_with_mode(
-        replicate_model_fn._VariableDistributionMode.SHARED_ROUND_ROBIN)
-
-  def _complete_flow_with_mode(self, mode):
+  @parameterized.named_parameters(
+      ('PublicInterface', None),
+      ('ParameterServerMode', replicate_model_fn._VariableDistributionMode.
+       SHARED_LOCAL_PARAMETER_SERVER),
+      ('RoundRobinMode',
+       replicate_model_fn._VariableDistributionMode.SHARED_ROUND_ROBIN))
+  def test_complete_flow_with_mode(self, mode):
     n_classes = 3
     input_dimension = 2
     batch_size = 12
diff --git a/tensorflow/contrib/factorization/BUILD b/tensorflow/contrib/factorization/BUILD
index 180f1b68f3b56113dfbbfc100bd04efc3bb8b31f..ad8568ad44ea84f96b97e98567a276c70520d53d 100644
--- a/tensorflow/contrib/factorization/BUILD
+++ b/tensorflow/contrib/factorization/BUILD
@@ -66,6 +66,7 @@ tf_custom_op_py_library(
         "//tensorflow/python:variables",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:model_fn",
+        "//tensorflow/python/feature_column:feature_column_py",
         "//third_party/py/numpy",
     ],
 )
@@ -223,7 +224,10 @@ py_test(
     srcs = ["python/ops/kmeans_test.py"],
     shard_count = 4,
     srcs_version = "PY2AND3",
-    tags = ["notsan"],  # b/67512932
+    tags = [
+        "nomac",  # b/73741358
+        "notsan",  # b/67512932
+    ],
     deps = [
         ":factorization_py",
         ":factorization_py_CYCLIC_DEPENDENCIES_THAT_NEED_TO_GO",
@@ -238,6 +242,7 @@ py_test(
         "//tensorflow/python:random_ops",
         "//tensorflow/python:training",
         "//tensorflow/python/estimator:run_config",
+        "//tensorflow/python/feature_column:feature_column_py",
         "//third_party/py/numpy",
     ],
 )
diff --git a/tensorflow/contrib/factorization/python/ops/kmeans.py b/tensorflow/contrib/factorization/python/ops/kmeans.py
index 7319eaa7de8db8e4677bdf64af3b0a72c1007a90..38faca119d0b5ee883de3b215428a0db8a021016 100644
--- a/tensorflow/contrib/factorization/python/ops/kmeans.py
+++ b/tensorflow/contrib/factorization/python/ops/kmeans.py
@@ -26,6 +26,7 @@ from tensorflow.contrib.factorization.python.ops import clustering_ops
 from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.estimator.export import export_output
+from tensorflow.python.feature_column import feature_column as fc
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
@@ -105,24 +106,32 @@ class _InitializeClustersHook(session_run_hook.SessionRunHook):
         logging.info(e)
 
 
-def _parse_tensor_or_dict(features):
+def _parse_features_if_necessary(features, feature_columns):
   """Helper function to convert the input points into a usable format.
 
   Args:
-    features: The input points.
+    features: The input features.
+    feature_columns: An optionable iterable containing all the feature columns
+      used by the model. All items in the set should be feature column instances
+      that can be passed to `tf.feature_column.input_layer`. If this is None,
+      all features will be used.
 
   Returns:
-    If `features` is a dict of `k` features, each of which is a vector of `n`
-    scalars, the return value is a Tensor of shape `(n, k)` representing `n`
-    input points, where the items in the `k` dimension are sorted
-    lexicographically by `features` key. If `features` is not a dict, it is
-    returned unmodified.
+    If `features` is a dict of `k` features (optionally filtered by
+    `feature_columns`), each of which is a vector of `n` scalars, the return
+    value is a Tensor of shape `(n, k)` representing `n` input points, where the
+    items in the `k` dimension are sorted lexicographically by `features` key.
+    If `features` is not a dict, it is returned unmodified.
   """
-  if isinstance(features, dict):
-    keys = sorted(features.keys())
-    with ops.colocate_with(features[keys[0]]):
-      features = array_ops.concat([features[k] for k in keys], axis=1)
-  return features
+  if not isinstance(features, dict):
+    return features
+
+  if feature_columns:
+    return fc.input_layer(features, feature_columns)
+
+  keys = sorted(features.keys())
+  with ops.colocate_with(features[keys[0]]):
+    return array_ops.concat([features[k] for k in keys], axis=1)
 
 
 class _ModelFn(object):
@@ -130,7 +139,8 @@ class _ModelFn(object):
 
   def __init__(self, num_clusters, initial_clusters, distance_metric,
                random_seed, use_mini_batch, mini_batch_steps_per_iteration,
-               kmeans_plus_plus_num_retries, relative_tolerance):
+               kmeans_plus_plus_num_retries, relative_tolerance,
+               feature_columns):
     self._num_clusters = num_clusters
     self._initial_clusters = initial_clusters
     self._distance_metric = distance_metric
@@ -139,6 +149,7 @@ class _ModelFn(object):
     self._mini_batch_steps_per_iteration = mini_batch_steps_per_iteration
     self._kmeans_plus_plus_num_retries = kmeans_plus_plus_num_retries
     self._relative_tolerance = relative_tolerance
+    self._feature_columns = feature_columns
 
   def model_fn(self, features, mode, config):
     """Model function for the estimator.
@@ -166,7 +177,7 @@ class _ModelFn(object):
     # input_points is a single Tensor. Therefore, the sharding functionality
     # in clustering_ops is unused, and some of the values below are lists of a
     # single item.
-    input_points = _parse_tensor_or_dict(features)
+    input_points = _parse_features_if_necessary(features, self._feature_columns)
 
     # Let N = the number of input_points.
     # all_distances: A list of one matrix of shape (N, num_clusters). Each value
@@ -316,7 +327,8 @@ class KMeansClustering(estimator.Estimator):
                mini_batch_steps_per_iteration=1,
                kmeans_plus_plus_num_retries=2,
                relative_tolerance=None,
-               config=None):
+               config=None,
+               feature_columns=None):
     """Creates an Estimator for running KMeans training and inference.
 
     This Estimator implements the following variants of the K-means algorithm:
@@ -383,6 +395,10 @@ class KMeansClustering(estimator.Estimator):
         iterations. Stops learning if the loss changes less than this amount.
         This may not work correctly if `use_mini_batch=True`.
       config: See @{tf.estimator.Estimator}.
+      feature_columns: An optionable iterable containing all the feature columns
+        used by the model. All items in the set should be feature column
+        instances that can be passed to `tf.feature_column.input_layer`. If this
+        is None, all features will be used.
 
     Raises:
       ValueError: An invalid argument was passed to `initial_clusters` or
@@ -402,7 +418,8 @@ class KMeansClustering(estimator.Estimator):
         model_fn=_ModelFn(
             num_clusters, initial_clusters, distance_metric, random_seed,
             use_mini_batch, mini_batch_steps_per_iteration,
-            kmeans_plus_plus_num_retries, relative_tolerance).model_fn,
+            kmeans_plus_plus_num_retries, relative_tolerance,
+            feature_columns).model_fn,
         model_dir=model_dir,
         config=config)
 
diff --git a/tensorflow/contrib/factorization/python/ops/kmeans_test.py b/tensorflow/contrib/factorization/python/ops/kmeans_test.py
index f9598bfc08c05ea3bba88b3135da0cf2e6bb0c95..0103cc44394b772a1bb1bb8bed78f5445297950a 100644
--- a/tensorflow/contrib/factorization/python/ops/kmeans_test.py
+++ b/tensorflow/contrib/factorization/python/ops/kmeans_test.py
@@ -27,6 +27,7 @@ from sklearn.cluster import KMeans as SklearnKMeans
 # pylint: disable=g-import-not-at-top
 from tensorflow.contrib.factorization.python.ops import kmeans as kmeans_lib
 from tensorflow.python.estimator import run_config
+from tensorflow.python.feature_column import feature_column as fc
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -226,6 +227,44 @@ class KMeansTest(KMeansTestBase):
     self._infer_helper(kmeans, clusters, 10)
     self._infer_helper(kmeans, clusters, 1)
 
+  def _parse_feature_dict_helper(self, features, parsed_feature_dict):
+    # Perform a sanity check.
+    self.assertEqual(features.shape, parsed_feature_dict.shape)
+    self.assertEqual(features.dtype, parsed_feature_dict.dtype)
+    # Then check that running the tensor yields the original list of points.
+    with self.test_session() as sess:
+      parsed_points = sess.run(parsed_feature_dict)
+      self.assertAllEqual(self.points, parsed_points)
+
+  def test_parse_features(self):
+    """Tests the various behaviours of kmeans._parse_features_if_necessary."""
+
+    # No-op if a tensor is passed in.
+    features = constant_op.constant(self.points)
+    parsed_features = kmeans_lib._parse_features_if_necessary(features, None)
+    self.assertAllEqual(features, parsed_features)
+
+    # All values from a feature dict are transformed into a tensor.
+    feature_dict = {
+        'x': [[point[0]] for point in self.points],
+        'y': [[point[1]] for point in self.points]
+    }
+    parsed_feature_dict = kmeans_lib._parse_features_if_necessary(
+        feature_dict, None)
+    self._parse_feature_dict_helper(features, parsed_feature_dict)
+
+    # Only the feature_columns of a feature dict are transformed into a tensor.
+    feature_dict_with_extras = {
+        'foo': 'bar',
+        'x': [[point[0]] for point in self.points],
+        'baz': {'fizz': 'buzz'},
+        'y': [[point[1]] for point in self.points]
+    }
+    feature_columns = [fc.numeric_column(key='x'), fc.numeric_column(key='y')]
+    parsed_feature_dict = kmeans_lib._parse_features_if_necessary(
+        feature_dict_with_extras, feature_columns)
+    self._parse_feature_dict_helper(features, parsed_feature_dict)
+
 
 class KMeansTestMultiStageInit(KMeansTestBase):
 
@@ -394,7 +433,6 @@ class KMeansCosineDistanceTest(KMeansTestBase):
     true_assignments = [0] * 2 + [1] * 2 + [2] * 8
     true_score = len(points) - np.tensordot(
         normalize(points), true_centers[true_assignments])
-
     kmeans = kmeans_lib.KMeansClustering(
         3,
         initial_clusters=self.initial_clusters,
diff --git a/tensorflow/contrib/feature_column/BUILD b/tensorflow/contrib/feature_column/BUILD
index 6fc053759c58d30c24657dd22e7d12be46fc7a7e..3614b2b15a6cbdd73f9f24c7e4e4534228d31499 100644
--- a/tensorflow/contrib/feature_column/BUILD
+++ b/tensorflow/contrib/feature_column/BUILD
@@ -25,13 +25,42 @@ py_library(
     srcs = ["__init__.py"],
     srcs_version = "PY2AND3",
     deps = [
-        ":sequential_feature_column",
+        ":sequence_feature_column",
+        "//tensorflow/python:util",
     ],
 )
 
 py_library(
-    name = "sequential_feature_column",
-    srcs = ["python/feature_column/sequential_feature_column.py"],
+    name = "sequence_feature_column",
+    srcs = ["python/feature_column/sequence_feature_column.py"],
     srcs_version = "PY2AND3",
-    deps = [],
+    deps = [
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:sparse_ops",
+        "//tensorflow/python:tensor_shape",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python/feature_column",
+    ],
+)
+
+py_test(
+    name = "sequence_feature_column_test",
+    srcs = ["python/feature_column/sequence_feature_column_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        ":sequence_feature_column",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python:training",
+        "//tensorflow/python/feature_column",
+        "//third_party/py/numpy",
+    ],
 )
diff --git a/tensorflow/contrib/feature_column/__init__.py b/tensorflow/contrib/feature_column/__init__.py
index 6da7b126931effae9cc97091a27070d7013450d4..baa8c1567a5aeb39976ab04c54ae2728ba050a7c 100644
--- a/tensorflow/contrib/feature_column/__init__.py
+++ b/tensorflow/contrib/feature_column/__init__.py
@@ -19,12 +19,18 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long,wildcard-import
-from tensorflow.contrib.feature_column.python.feature_column.sequential_feature_column import *
+from tensorflow.contrib.feature_column.python.feature_column.sequence_feature_column import *
 
 from tensorflow.python.util.all_util import remove_undocumented
 # pylint: enable=unused-import,line-too-long,wildcard-import
 
 _allowed_symbols = [
+    'sequence_categorical_column_with_hash_bucket',
+    'sequence_categorical_column_with_identity',
+    'sequence_categorical_column_with_vocabulary_list',
+    'sequence_categorical_column_with_vocabulary_file',
+    'sequence_input_layer',
+    'sequence_numeric_column',
 ]
 
 remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
new file mode 100644
index 0000000000000000000000000000000000000000..555beddeaab419bcb23d06f960d370b706d744c8
--- /dev/null
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
@@ -0,0 +1,447 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Experimental methods for tf.feature_column sequence input."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+import collections
+
+
+from tensorflow.python.feature_column import feature_column as fc
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.ops import variable_scope
+
+# pylint: disable=protected-access
+# TODO(b/73827486): Support SequenceExample.
+
+
+def sequence_input_layer(
+    features,
+    feature_columns,
+    weight_collections=None,
+    trainable=True):
+  """"Builds input layer for sequence input.
+
+  All `feature_columns` must be sequence dense columns with the same
+  `sequence_length`. The output of this method can be fed into sequence
+  networks, such as RNN.
+
+  The output of this method is a 3D `Tensor` of shape `[batch_size, T, D]`.
+  `T` is the maximum sequence length for this batch, which could differ from
+  batch to batch.
+
+  If multiple `feature_columns` are given with `Di` `num_elements` each, their
+  outputs are concatenated. So, the final `Tensor` has shape
+  `[batch_size, T, D0 + D1 + ... + Dn]`.
+
+  Example:
+
+  ```python
+  rating = sequence_numeric_column('rating')
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [rating, watches]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    features: A dict mapping keys to tensors.
+    feature_columns: An iterable of dense sequence columns. Valid columns are
+      - `embedding_column` that wraps a `sequence_categorical_column_with_*`
+      - `sequence_numeric_column`.
+    weight_collections: A list of collection names to which the Variable will be
+      added. Note that variables will also be added to collections
+      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
+    trainable: If `True` also add the variable to the graph collection
+      `GraphKeys.TRAINABLE_VARIABLES`.
+
+  Returns:
+    An `(input_layer, sequence_length)` tuple where:
+    - input_layer: A float `Tensor` of shape `[batch_size, T, D]`.
+        `T` is the maximum sequence length for this batch, which could differ
+        from batch to batch. `D` is the sum of `num_elements` for all
+        `feature_columns`.
+    - sequence_length: An int `Tensor` of shape `[batch_size]`. The sequence
+        length for each example.
+
+  Raises:
+    ValueError: If any of the `feature_columns` is the wrong type.
+  """
+  feature_columns = fc._clean_feature_columns(feature_columns)
+  for c in feature_columns:
+    if not isinstance(c, fc._SequenceDenseColumn):
+      raise ValueError(
+          'All feature_columns must be of type _SequenceDenseColumn. '
+          'You can wrap a sequence_categorical_column with an embedding_column '
+          'or indicator_column. '
+          'Given (type {}): {}'.format(type(c), c))
+
+  with variable_scope.variable_scope(
+      None, default_name='sequence_input_layer', values=features.values()):
+    builder = fc._LazyBuilder(features)
+    output_tensors = []
+    sequence_lengths = []
+    ordered_columns = []
+    for column in sorted(feature_columns, key=lambda x: x.name):
+      ordered_columns.append(column)
+      with variable_scope.variable_scope(
+          None, default_name=column._var_scope_name):
+        dense_tensor, sequence_length = column._get_sequence_dense_tensor(
+            builder,
+            weight_collections=weight_collections,
+            trainable=trainable)
+        # Flattens the final dimension to produce a 3D Tensor.
+        num_elements = column._variable_shape.num_elements()
+        shape = array_ops.shape(dense_tensor)
+        output_tensors.append(
+            array_ops.reshape(
+                dense_tensor,
+                shape=array_ops.concat([shape[:2], [num_elements]], axis=0)))
+        sequence_lengths.append(sequence_length)
+    fc._verify_static_batch_size_equality(output_tensors, ordered_columns)
+    fc._verify_static_batch_size_equality(sequence_lengths, ordered_columns)
+    sequence_length = _assert_all_equal_and_return(sequence_lengths)
+    return array_ops.concat(output_tensors, -1), sequence_length
+
+
+def sequence_categorical_column_with_identity(
+    key, num_buckets, default_value=None):
+  """Returns a feature column that represents sequences of integers.
+
+  Pass this to `embedding_column` or `indicator_column` to convert sequence
+  categorical data into dense representation for input to sequence NN, such as
+  RNN.
+
+  Example:
+
+  ```python
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [watches_embedding]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature.
+    num_buckets: Range of inputs. Namely, inputs are expected to be in the
+      range `[0, num_buckets)`.
+    default_value: If `None`, this column's graph operations will fail for
+      out-of-range inputs. Otherwise, this value must be in the range
+      `[0, num_buckets)`, and will replace out-of-range inputs.
+
+  Returns:
+    A `_SequenceCategoricalColumn`.
+
+  Raises:
+    ValueError: if `num_buckets` is less than one.
+    ValueError: if `default_value` is not in range `[0, num_buckets)`.
+  """
+  return fc._SequenceCategoricalColumn(
+      fc.categorical_column_with_identity(
+          key=key,
+          num_buckets=num_buckets,
+          default_value=default_value))
+
+
+def sequence_categorical_column_with_hash_bucket(
+    key, hash_bucket_size, dtype=dtypes.string):
+  """A sequence of categorical terms where ids are set by hashing.
+
+  Pass this to `embedding_column` or `indicator_column` to convert sequence
+  categorical data into dense representation for input to sequence NN, such as
+  RNN.
+
+  Example:
+
+  ```python
+  tokens = sequence_categorical_column_with_hash_bucket(
+      'tokens', hash_bucket_size=1000)
+  tokens_embedding = embedding_column(tokens, dimension=10)
+  columns = [tokens_embedding]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature.
+    hash_bucket_size: An int > 1. The number of buckets.
+    dtype: The type of features. Only string and integer types are supported.
+
+  Returns:
+    A `_SequenceCategoricalColumn`.
+
+  Raises:
+    ValueError: `hash_bucket_size` is not greater than 1.
+    ValueError: `dtype` is neither string nor integer.
+  """
+  return fc._SequenceCategoricalColumn(
+      fc.categorical_column_with_hash_bucket(
+          key=key,
+          hash_bucket_size=hash_bucket_size,
+          dtype=dtype))
+
+
+def sequence_categorical_column_with_vocabulary_file(
+    key, vocabulary_file, vocabulary_size=None, num_oov_buckets=0,
+    default_value=None, dtype=dtypes.string):
+  """A sequence of categorical terms where ids use a vocabulary file.
+
+  Pass this to `embedding_column` or `indicator_column` to convert sequence
+  categorical data into dense representation for input to sequence NN, such as
+  RNN.
+
+  Example:
+
+  ```python
+  states = sequence_categorical_column_with_vocabulary_file(
+      key='states', vocabulary_file='/us/states.txt', vocabulary_size=50,
+      num_oov_buckets=5)
+  states_embedding = embedding_column(states, dimension=10)
+  columns = [states_embedding]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature.
+    vocabulary_file: The vocabulary file name.
+    vocabulary_size: Number of the elements in the vocabulary. This must be no
+      greater than length of `vocabulary_file`, if less than length, later
+      values are ignored. If None, it is set to the length of `vocabulary_file`.
+    num_oov_buckets: Non-negative integer, the number of out-of-vocabulary
+      buckets. All out-of-vocabulary inputs will be assigned IDs in the range
+      `[vocabulary_size, vocabulary_size+num_oov_buckets)` based on a hash of
+      the input value. A positive `num_oov_buckets` can not be specified with
+      `default_value`.
+    default_value: The integer ID value to return for out-of-vocabulary feature
+      values, defaults to `-1`. This can not be specified with a positive
+      `num_oov_buckets`.
+    dtype: The type of features. Only string and integer types are supported.
+
+  Returns:
+    A `_SequenceCategoricalColumn`.
+
+  Raises:
+    ValueError: `vocabulary_file` is missing or cannot be opened.
+    ValueError: `vocabulary_size` is missing or < 1.
+    ValueError: `num_oov_buckets` is a negative integer.
+    ValueError: `num_oov_buckets` and `default_value` are both specified.
+    ValueError: `dtype` is neither string nor integer.
+  """
+  return fc._SequenceCategoricalColumn(
+      fc.categorical_column_with_vocabulary_file(
+          key=key,
+          vocabulary_file=vocabulary_file,
+          vocabulary_size=vocabulary_size,
+          num_oov_buckets=num_oov_buckets,
+          default_value=default_value,
+          dtype=dtype))
+
+
+def sequence_categorical_column_with_vocabulary_list(
+    key, vocabulary_list, dtype=None, default_value=-1, num_oov_buckets=0):
+  """A sequence of categorical terms where ids use an in-memory list.
+
+  Pass this to `embedding_column` or `indicator_column` to convert sequence
+  categorical data into dense representation for input to sequence NN, such as
+  RNN.
+
+  Example:
+
+  ```python
+  colors = sequence_categorical_column_with_vocabulary_list(
+      key='colors', vocabulary_list=('R', 'G', 'B', 'Y'),
+      num_oov_buckets=2)
+  colors_embedding = embedding_column(colors, dimension=3)
+  columns = [colors_embedding]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature.
+    vocabulary_list: An ordered iterable defining the vocabulary. Each feature
+      is mapped to the index of its value (if present) in `vocabulary_list`.
+      Must be castable to `dtype`.
+    dtype: The type of features. Only string and integer types are supported.
+      If `None`, it will be inferred from `vocabulary_list`.
+    default_value: The integer ID value to return for out-of-vocabulary feature
+      values, defaults to `-1`. This can not be specified with a positive
+      `num_oov_buckets`.
+    num_oov_buckets: Non-negative integer, the number of out-of-vocabulary
+      buckets. All out-of-vocabulary inputs will be assigned IDs in the range
+      `[len(vocabulary_list), len(vocabulary_list)+num_oov_buckets)` based on a
+      hash of the input value. A positive `num_oov_buckets` can not be specified
+      with `default_value`.
+
+  Returns:
+    A `_SequenceCategoricalColumn`.
+
+  Raises:
+    ValueError: if `vocabulary_list` is empty, or contains duplicate keys.
+    ValueError: `num_oov_buckets` is a negative integer.
+    ValueError: `num_oov_buckets` and `default_value` are both specified.
+    ValueError: if `dtype` is not integer or string.
+  """
+  return fc._SequenceCategoricalColumn(
+      fc.categorical_column_with_vocabulary_list(
+          key=key,
+          vocabulary_list=vocabulary_list,
+          dtype=dtype,
+          default_value=default_value,
+          num_oov_buckets=num_oov_buckets))
+
+
+def sequence_numeric_column(
+    key,
+    shape=(1,),
+    default_value=0.,
+    dtype=dtypes.float32):
+  """Returns a feature column that represents sequences of numeric data.
+
+  Example:
+
+  ```python
+  temperature = sequence_numeric_column('temperature')
+  columns = [temperature]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input features.
+    shape: The shape of the input data per sequence id. E.g. if `shape=(2,)`,
+      each example must contain `2 * sequence_length` values.
+    default_value: A single value compatible with `dtype` that is used for
+      padding the sparse data into a dense `Tensor`.
+    dtype: The type of values.
+
+  Returns:
+    A `_SequenceNumericColumn`.
+
+  Raises:
+    TypeError: if any dimension in shape is not an int.
+    ValueError: if any dimension in shape is not a positive integer.
+    ValueError: if `dtype` is not convertible to `tf.float32`.
+  """
+  shape = fc._check_shape(shape=shape, key=key)
+  if not (dtype.is_integer or dtype.is_floating):
+    raise ValueError('dtype must be convertible to float. '
+                     'dtype: {}, key: {}'.format(dtype, key))
+
+  return _SequenceNumericColumn(
+      key,
+      shape=shape,
+      default_value=default_value,
+      dtype=dtype)
+
+
+def _assert_all_equal_and_return(tensors, name=None):
+  """Asserts that all tensors are equal and returns the first one."""
+  with ops.name_scope(name, 'assert_all_equal', values=tensors):
+    if len(tensors) == 1:
+      return tensors[0]
+    assert_equal_ops = []
+    for t in tensors[1:]:
+      assert_equal_ops.append(check_ops.assert_equal(tensors[0], t))
+    with ops.control_dependencies(assert_equal_ops):
+      return array_ops.identity(tensors[0])
+
+
+class _SequenceNumericColumn(
+    fc._SequenceDenseColumn,
+    collections.namedtuple(
+        '_SequenceNumericColumn',
+        ['key', 'shape', 'default_value', 'dtype'])):
+  """Represents sequences of numeric data."""
+
+  @property
+  def name(self):
+    return self.key
+
+  @property
+  def _parse_example_spec(self):
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def _transform_feature(self, inputs):
+    return inputs.get(self.key)
+
+  @property
+  def _variable_shape(self):
+    return tensor_shape.TensorShape(self.shape)
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    # Do nothing with weight_collections and trainable since no variables are
+    # created in this function.
+    del weight_collections
+    del trainable
+    sp_tensor = inputs.get(self)
+    dense_tensor = sparse_ops.sparse_tensor_to_dense(
+        sp_tensor, default_value=self.default_value)
+    # Reshape into [batch_size, T, variable_shape].
+    dense_shape = array_ops.concat(
+        [array_ops.shape(dense_tensor)[:1], [-1], self._variable_shape],
+        axis=0)
+    dense_tensor = array_ops.reshape(dense_tensor, shape=dense_shape)
+    sequence_length = fc._sequence_length_from_sparse_tensor(
+        sp_tensor, num_elements=self._variable_shape.num_elements())
+    return fc._SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+# pylint: enable=protected-access
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..88f5d535162939e063eb1e7f43d495137c5adef4
--- /dev/null
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
@@ -0,0 +1,816 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for sequential_feature_column."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import numpy as np
+
+from tensorflow.contrib.feature_column.python.feature_column import sequence_feature_column as sfc
+from tensorflow.python.feature_column import feature_column as fc
+from tensorflow.python.feature_column.feature_column import _LazyBuilder
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.platform import test
+from tensorflow.python.training import monitored_session
+
+
+class SequenceInputLayerTest(test.TestCase):
+
+  def test_embedding_column(self):
+    vocabulary_size = 3
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        # example 0, ids [1]
+        # example 1, ids [2, 0]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+
+    embedding_dimension_a = 2
+    embedding_values_a = (
+        (1., 2.),  # id 0
+        (3., 4.),  # id 1
+        (5., 6.)  # id 2
+    )
+    embedding_dimension_b = 3
+    embedding_values_b = (
+        (11., 12., 13.),  # id 0
+        (14., 15., 16.),  # id 1
+        (17., 18., 19.)  # id 2
+    )
+    def _get_initializer(embedding_dimension, embedding_values):
+      def _initializer(shape, dtype, partition_info):
+        self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+        self.assertEqual(dtypes.float32, dtype)
+        self.assertIsNone(partition_info)
+        return embedding_values
+      return _initializer
+
+    expected_input_layer = [
+        # example 0, ids_a [2], ids_b [1]
+        [[5., 6., 14., 15., 16.], [0., 0., 0., 0., 0.]],
+        # example 1, ids_a [0, 1], ids_b [2, 0]
+        [[1., 2., 17., 18., 19.], [3., 4., 11., 12., 13.]],
+    ]
+    expected_sequence_length = [1, 2]
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column_a = fc.embedding_column(
+        categorical_column_a, dimension=embedding_dimension_a,
+        initializer=_get_initializer(embedding_dimension_a, embedding_values_a))
+    categorical_column_b = sfc.sequence_categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_b = fc.embedding_column(
+        categorical_column_b, dimension=embedding_dimension_b,
+        initializer=_get_initializer(embedding_dimension_b, embedding_values_b))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={
+            'aaa': sparse_input_a,
+            'bbb': sparse_input_b,
+        },
+        # Test that columns are reordered alphabetically.
+        feature_columns=[embedding_column_b, embedding_column_a])
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('sequence_input_layer/aaa_embedding/embedding_weights:0',
+         'sequence_input_layer/bbb_embedding/embedding_weights:0'),
+        tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values_a, global_vars[0].eval(session=sess))
+      self.assertAllEqual(embedding_values_b, global_vars[1].eval(session=sess))
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_embedding_column_with_non_sequence_categorical(self):
+    """Tests that error is raised for non-sequence categorical column."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column_a = fc.embedding_column(
+        categorical_column_a, dimension=2)
+
+    with self.assertRaisesRegexp(
+        ValueError,
+        r'In embedding_column: aaa_embedding\. categorical_column must be of '
+        r'type _SequenceCategoricalColumn to use sequence_input_layer\.'):
+      _, _ = sfc.sequence_input_layer(
+          features={'aaa': sparse_input},
+          feature_columns=[embedding_column_a])
+
+  def test_indicator_column(self):
+    vocabulary_size_a = 3
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    vocabulary_size_b = 2
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        # example 0, ids [1]
+        # example 1, ids [1, 0]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 1, 0),
+        dense_shape=(2, 2))
+
+    expected_input_layer = [
+        # example 0, ids_a [2], ids_b [1]
+        [[0., 0., 1., 0., 1.], [0., 0., 0., 0., 0.]],
+        # example 1, ids_a [0, 1], ids_b [1, 0]
+        [[1., 0., 0., 0., 1.], [0., 1., 0., 1., 0.]],
+    ]
+    expected_sequence_length = [1, 2]
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size_a)
+    indicator_column_a = fc.indicator_column(categorical_column_a)
+    categorical_column_b = sfc.sequence_categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size_b)
+    indicator_column_b = fc.indicator_column(categorical_column_b)
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={
+            'aaa': sparse_input_a,
+            'bbb': sparse_input_b,
+        },
+        # Test that columns are reordered alphabetically.
+        feature_columns=[indicator_column_b, indicator_column_a])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_indicator_column_with_non_sequence_categorical(self):
+    """Tests that error is raised for non-sequence categorical column."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    indicator_column_a = fc.indicator_column(categorical_column_a)
+
+    with self.assertRaisesRegexp(
+        ValueError,
+        r'In indicator_column: aaa_indicator\. categorical_column must be of '
+        r'type _SequenceCategoricalColumn to use sequence_input_layer\.'):
+      _, _ = sfc.sequence_input_layer(
+          features={'aaa': sparse_input},
+          feature_columns=[indicator_column_a])
+
+  def test_numeric_column(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_input_layer = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_numeric_column_multi_dim(self):
+    """Tests sequence_input_layer for multi-dimensional numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    # The output of numeric_column._get_dense_tensor should be flattened.
+    expected_input_layer = [
+        [[0., 1., 2., 3.], [4., 5., 6., 7.]],
+        [[10., 11., 12., 13.], [0., 0., 0., 0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_not_equal(self):
+    """Tests that an error is raised when sequence lengths are not equal."""
+    # Input a with sequence_length = [2, 1]
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    # Input b with sequence_length = [1, 1]
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0)),
+        values=(1., 10.),
+        dense_shape=(2, 2))
+    numeric_column_a = sfc.sequence_numeric_column('aaa')
+    numeric_column_b = sfc.sequence_numeric_column('bbb')
+
+    _, sequence_length = sfc.sequence_input_layer(
+        features={
+            'aaa': sparse_input_a,
+            'bbb': sparse_input_b,
+        },
+        feature_columns=[numeric_column_a, numeric_column_b])
+
+    with monitored_session.MonitoredSession() as sess:
+      with self.assertRaisesRegexp(
+          errors.InvalidArgumentError,
+          r'\[Condition x == y did not hold element-wise:\] '
+          r'\[x \(sequence_input_layer/aaa/sequence_length:0\) = \] \[2 1\] '
+          r'\[y \(sequence_input_layer/bbb/sequence_length:0\) = \] \[1 1\]'):
+        sess.run(sequence_length)
+
+
+class InputLayerTest(test.TestCase):
+  """Tests input_layer with sequence feature columns."""
+
+  def test_embedding_column(self):
+    """Tests that error is raised for sequence embedding column."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column_a = fc.embedding_column(
+        categorical_column_a, dimension=2)
+
+    with self.assertRaisesRegexp(
+        ValueError,
+        r'In embedding_column: aaa_embedding\. categorical_column must not be '
+        r'of type _SequenceCategoricalColumn\.'):
+      _ = fc.input_layer(
+          features={'aaa': sparse_input},
+          feature_columns=[embedding_column_a])
+
+  def test_indicator_column(self):
+    """Tests that error is raised for sequence indicator column."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    indicator_column_a = fc.indicator_column(categorical_column_a)
+
+    with self.assertRaisesRegexp(
+        ValueError,
+        r'In indicator_column: aaa_indicator\. categorical_column must not be '
+        r'of type _SequenceCategoricalColumn\.'):
+      _ = fc.input_layer(
+          features={'aaa': sparse_input},
+          feature_columns=[indicator_column_a])
+
+
+def _assert_sparse_tensor_value(test_case, expected, actual):
+  _assert_sparse_tensor_indices_shape(test_case, expected, actual)
+
+  test_case.assertEqual(
+      np.array(expected.values).dtype, np.array(actual.values).dtype)
+  test_case.assertAllEqual(expected.values, actual.values)
+
+
+def _assert_sparse_tensor_indices_shape(test_case, expected, actual):
+  test_case.assertEqual(np.int64, np.array(actual.indices).dtype)
+  test_case.assertAllEqual(expected.indices, actual.indices)
+
+  test_case.assertEqual(np.int64, np.array(actual.dense_shape).dtype)
+  test_case.assertAllEqual(expected.dense_shape, actual.dense_shape)
+
+
+class SequenceCategoricalColumnWithIdentityTest(test.TestCase):
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=np.array((1, 2, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_value(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+  def test_get_sparse_tensors_inputs3d(self):
+    """Tests _get_sparse_tensors when the input is already 3D Tensor."""
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2, 1))
+
+    with self.assertRaisesRegexp(
+        errors.InvalidArgumentError,
+        r'Column aaa expected ID tensor of rank 2\.\s*'
+        r'id_tensor shape:\s*\[2 2 1\]'):
+      id_weight_pair = column._get_sparse_tensors(
+          _LazyBuilder({'aaa': inputs}))
+      with monitored_session.MonitoredSession() as sess:
+        id_weight_pair.id_tensor.eval(session=sess)
+
+
+class SequenceCategoricalColumnWithHashBucketTest(test.TestCase):
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_hash_bucket(
+        'aaa', hash_bucket_size=10)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        # Ignored to avoid hash dependence in test.
+        values=np.array((0, 0, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_indices_shape(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+
+class SequenceCategoricalColumnWithVocabularyFileTest(test.TestCase):
+
+  def _write_vocab(self, vocab_strings, file_name):
+    vocab_file = os.path.join(self.get_temp_dir(), file_name)
+    with open(vocab_file, 'w') as f:
+      f.write('\n'.join(vocab_strings))
+    return vocab_file
+
+  def setUp(self):
+    super(SequenceCategoricalColumnWithVocabularyFileTest, self).setUp()
+
+    vocab_strings = ['omar', 'stringer', 'marlo']
+    self._wire_vocabulary_file_name = self._write_vocab(vocab_strings,
+                                                        'wire_vocabulary.txt')
+    self._wire_vocabulary_size = 3
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=np.array((2, -1, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_value(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+
+class SequenceCategoricalColumnWithVocabularyListTest(test.TestCase):
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=np.array((2, -1, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_value(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+
+class SequenceEmbeddingColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 1), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 2))
+
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    expected_lookups = [
+        # example 0, ids [2]
+        [[7., 11.], [0., 0.]],
+        # example 1, ids [0, 1]
+        [[1., 2.], [3., 5.]],
+        # example 2, ids []
+        [[0., 0.], [0., 0.]],
+        # example 3, ids [1]
+        [[3., 5.], [0., 0.]],
+    ]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+
+    embedding_lookup, _ = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values, global_vars[0].eval(session=sess))
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval(session=sess))
+
+  def test_sequence_length(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids []
+        # example 1, ids [2]
+        # example 2, ids [0, 1]
+        # example 3, ids []
+        # example 4, ids [1]
+        # example 5, ids []
+        indices=((1, 0), (2, 0), (2, 1), (4, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 1, 2, 0, 1, 0]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceIndicatorColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 1), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 2))
+
+    expected_lookups = [
+        # example 0, ids [2]
+        [[0., 0., 1.], [0., 0., 0.]],
+        # example 1, ids [0, 1]
+        [[1., 0., 0.], [0., 1., 0.]],
+        # example 2, ids []
+        [[0., 0., 0.], [0., 0., 0.]],
+        # example 3, ids [1]
+        [[0., 1., 0.], [0., 0., 0.]],
+    ]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    indicator_column = fc.indicator_column(categorical_column)
+
+    indicator_tensor, _ = indicator_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_lookups, indicator_tensor.eval(session=sess))
+
+  def test_sequence_length(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    indicator_column = fc.indicator_column(categorical_column)
+
+    _, sequence_length = indicator_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids []
+        # example 1, ids [2]
+        # example 2, ids [0, 1]
+        # example 3, ids []
+        # example 4, ids [1]
+        # example 5, ids []
+        indices=((1, 0), (2, 0), (2, 1), (4, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 1, 2, 0, 1, 0]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    indicator_column = fc.indicator_column(categorical_column)
+
+    _, sequence_length = indicator_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceNumericColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    a = sfc.sequence_numeric_column('aaa')
+    self.assertEqual('aaa', a.key)
+    self.assertEqual('aaa', a.name)
+    self.assertEqual('aaa', a._var_scope_name)
+    self.assertEqual((1,), a.shape)
+    self.assertEqual(0., a.default_value)
+    self.assertEqual(dtypes.float32, a.dtype)
+
+  def test_shape_saved_as_tuple(self):
+    a = sfc.sequence_numeric_column('aaa', shape=[1, 2])
+    self.assertEqual((1, 2), a.shape)
+
+  def test_shape_must_be_positive_integer(self):
+    with self.assertRaisesRegexp(TypeError, 'shape dimensions must be integer'):
+      sfc.sequence_numeric_column('aaa', shape=[1.0])
+
+    with self.assertRaisesRegexp(
+        ValueError, 'shape dimensions must be greater than 0'):
+      sfc.sequence_numeric_column('aaa', shape=[0])
+
+  def test_dtype_is_convertible_to_float(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'dtype must be convertible to float'):
+      sfc.sequence_numeric_column('aaa', dtype=dtypes.string)
+
+  def test_get_sequence_dense_tensor(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_dense_tensor = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_sequence_dense_tensor_with_shape(self):
+    """Tests get_sequence_dense_tensor with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_dense_tensor = [
+        [[0., 1., 2.], [3., 4., 5.]],
+        [[10., 11., 12.], [0., 0., 0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_dense_tensor_multi_dim(self):
+    """Tests get_sequence_dense_tensor for multi-dim numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    expected_dense_tensor = [
+        [[[0., 1.], [2., 3.]], [[4., 5.], [6., 7.]]],
+        [[[10., 11.], [12., 13.]], [[0., 0.], [0., 0.]]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_sequence_length(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_shape(self):
+    """Tests _sequence_length with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values []
+        # example 1, values [[0.], [1.]]
+        # example 2, [[2.]]
+        # example 3, values []
+        # example 4, [[3.]]
+        # example 5, values []
+        indices=((1, 0), (1, 1), (2, 0), (4, 0)),
+        values=(0., 1., 2., 3.),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 2, 1, 0, 1, 0]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc b/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
index e61221a6b0d34373279a379f356c99c379488182..35341406a08dc681c861aea30fcff784e3b963ef 100644
--- a/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
+++ b/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
@@ -256,6 +256,9 @@ Status ReadInfoFile(const string& filename, uint32* width, uint32* height,
         if (p != std::string::npos) {
           string rgb24 = line.substr(p + 9, line.find(" ", p + 9));
           rgb24 = rgb24.substr(0, rgb24.find(","));
+          // Strip anything after " ", in case the format is
+          // `640x360 [SAR 1:1 DAR 16:9]`
+          rgb24 = rgb24.substr(0, rgb24.find(" "));
           string rgb24_width = rgb24.substr(0, rgb24.find("x"));
           string rgb24_height = rgb24.substr(rgb24_width.length() + 1);
           if (strings::safe_strtou32(rgb24_width, &width_value) &&
@@ -270,8 +273,10 @@ Status ReadInfoFile(const string& filename, uint32* width, uint32* height,
       // We only look for the first stream mapping to have the number of the
       // frames.
       // Once processed we will not further process stream mapping section.
-      if (line.find("frame=  ") == 0) {
-        string number = line.substr(8, line.find(" ", 8));
+      if (line.find("frame=") == 0) {
+        // The format might be `frame=  166 ` or `frame=12488 `
+        string number = line.substr(6);
+        number = number.substr(number.find_first_not_of(" "));
         number = number.substr(0, number.find(" "));
         if (strings::safe_strtou32(number, &frames_value)) {
           in_mapping = false;
diff --git a/tensorflow/contrib/framework/BUILD b/tensorflow/contrib/framework/BUILD
index dbdb5cfaaca1a687fefb81cee200295d5cbb7fd5..ac043fda0638e61f422e769ab3047a53a1b377bd 100644
--- a/tensorflow/contrib/framework/BUILD
+++ b/tensorflow/contrib/framework/BUILD
@@ -28,7 +28,6 @@ tf_custom_op_py_library(
         "python/framework/graph_util.py",
         "python/framework/tensor_util.py",
         "python/ops/__init__.py",
-        "python/ops/accumulate_n_v2.py",
         "python/ops/arg_scope.py",
         "python/ops/audio_ops.py",
         "python/ops/checkpoint_ops.py",
@@ -63,7 +62,9 @@ tf_custom_op_py_library(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:platform",
         "//tensorflow/python:pywrap_tensorflow",
+        "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:script_ops",
+        "//tensorflow/python:smart_cond",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:state_ops",
         "//tensorflow/python:state_ops_gen",
@@ -161,23 +162,6 @@ py_test(
     ],
 )
 
-py_test(
-    name = "accumulate_n_v2_test",
-    size = "small",
-    srcs = ["python/ops/accumulate_n_v2_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":framework_py",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:variables",
-        "//third_party/py/numpy",
-    ],
-)
-
 cuda_py_test(
     name = "critical_section_test",
     size = "medium",
@@ -196,26 +180,6 @@ cuda_py_test(
     ],
 )
 
-py_test(
-    name = "accumulate_n_v2_eager_test",
-    size = "small",
-    srcs = ["python/ops/accumulate_n_v2_eager_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":framework_py",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:resource_variable_ops",
-        "//tensorflow/python/eager:backprop",
-        "//tensorflow/python/eager:context",
-        "//tensorflow/python/eager:tape",
-        "//third_party/py/numpy",
-    ],
-)
-
 py_test(
     name = "ops_test",
     size = "small",
diff --git a/tensorflow/contrib/framework/__init__.py b/tensorflow/contrib/framework/__init__.py
index deeb5bec79341f3e0468a127aeead69f960114d8..cbb68bd3eb257f9472515e5c29ce4f02057be321 100644
--- a/tensorflow/contrib/framework/__init__.py
+++ b/tensorflow/contrib/framework/__init__.py
@@ -71,6 +71,7 @@ See the @{$python/contrib.framework} guide.
 @@model_variable
 @@variable
 @@VariableDeviceChooser
+@@convolutional_delta_orthogonal
 @@zero_initializer
 
 @@load_checkpoint
@@ -82,11 +83,16 @@ See the @{$python/contrib.framework} guide.
 @@load_linear_multiclass_bias_initializer
 @@load_variable_slot_initializer
 
+@@argsort
 @@py_func
 @@sort
 
 @@get_placeholders
 
+@@smart_cond
+@@smart_constant_value
+@@smart_case
+
 @@CriticalSection
 
 @@BoundedTensorSpec
@@ -104,10 +110,12 @@ from tensorflow.contrib.framework.python.ops import *
 
 from tensorflow.python.framework.ops import prepend_name_scope
 from tensorflow.python.framework.ops import strip_name_scope
+from tensorflow.python.framework.smart_cond import smart_case
+from tensorflow.python.framework.smart_cond import smart_cond
+from tensorflow.python.framework.smart_cond import smart_constant_value
 from tensorflow.python.framework.tensor_spec import BoundedTensorSpec
 from tensorflow.python.framework.tensor_spec import TensorSpec
-from tensorflow.python.ops.control_flow_ops import smart_cond
-from tensorflow.python.ops.control_flow_ops import smart_constant_value
+from tensorflow.python.ops.init_ops import convolutional_delta_orthogonal
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = ['nest']
diff --git a/tensorflow/contrib/framework/python/framework/experimental_test.py b/tensorflow/contrib/framework/python/framework/experimental_test.py
index 8e54e09e04ee3c0ddbd4fa84cc0912cb70c93e62..cfdc7df7d8fd4c1406bf447a79038ac33b11e047 100644
--- a/tensorflow/contrib/framework/python/framework/experimental_test.py
+++ b/tensorflow/contrib/framework/python/framework/experimental_test.py
@@ -49,7 +49,6 @@ class ExperimentalTest(test.TestCase):
                      "\nTHIS FUNCTION IS EXPERIMENTAL. It may change or "
                      "be removed at any time, and without warning."
                      "\n"
-                     "\n"
                      "\nArgs:"
                      "\n  arg0: Arg 0."
                      "\n  arg1: Arg 1."
diff --git a/tensorflow/contrib/framework/python/framework/graph_util.py b/tensorflow/contrib/framework/python/framework/graph_util.py
index 49eec3a3f1a0f357ea3adfade51e71cb0f89942d..2703224b1bf62831b6088558d4f93950fe938c10 100644
--- a/tensorflow/contrib/framework/python/framework/graph_util.py
+++ b/tensorflow/contrib/framework/python/framework/graph_util.py
@@ -85,14 +85,19 @@ def fuse_op(graph_def, input_nodes, output_nodes, output_dtypes,
       if n not in reachable_by_input and n not in output_nodes_set:
         # n is between input and output, i.e., part of the fused op
         next_to_visit = [n]
+        visited = set()
         while next_to_visit:
           cur_node = next_to_visit[0]
+          visited.add(cur_node)
           del next_to_visit[0]
           if cur_node in reachable_by_input and cur_node not in input_nodes_set:
             raise TypeError("Node %s uses input %s not in input_nodes." %
                             (n, cur_node))
           if cur_node not in input_nodes_set:
-            next_to_visit += name_to_input_name[cur_node]
+            next_to_visit += [
+                input_node for input_node in name_to_input_name[cur_node]
+                if input_node not in visited
+            ]
     elif n not in reachable_by_input:
       nodes_post_output.append(n)
 
diff --git a/tensorflow/contrib/framework/python/framework/graph_util_test.py b/tensorflow/contrib/framework/python/framework/graph_util_test.py
index b8a6d109e19211d271c2b15bac66ddacd38fe395..812c5fbd8cb759aef6eb1aad532c03794b2ceaf4 100644
--- a/tensorflow/contrib/framework/python/framework/graph_util_test.py
+++ b/tensorflow/contrib/framework/python/framework/graph_util_test.py
@@ -42,7 +42,8 @@ class GraphUtilTest(test.TestCase):
     graph_def = graph_pb2.GraphDef()
     node_a = GetNewNode('A', 'Placeholder', [])
     node_b = GetNewNode('B', 'Op1', ['A'])
-    node_c = GetNewNode('C', 'Op1', ['B'])
+    # A loop in the part that will be fused.
+    node_c = GetNewNode('C', 'Op1', ['B', 'C'])
     node_d = GetNewNode('D', 'Op1', ['C'])
     node_e = GetNewNode('E', 'Op1', ['D'])
     graph_def.node.extend([node_a, node_b, node_c, node_d, node_e])
diff --git a/tensorflow/contrib/framework/python/framework/tensor_util_test.py b/tensorflow/contrib/framework/python/framework/tensor_util_test.py
index 8cdb340f2ddd9b3a7f55c1937ef045f4627e99be..a2834b648933772cab53002462c3edbe9a553e94 100644
--- a/tensorflow/contrib/framework/python/framework/tensor_util_test.py
+++ b/tensorflow/contrib/framework/python/framework/tensor_util_test.py
@@ -209,6 +209,7 @@ class WithShapeTest(test.TestCase):
         self.assertRaisesRegexp(errors_impl.OpError, "Wrong shape",
                                 tensor_2x2.eval, {tensor_no_shape: [42.0]})
 
+  @test_util.enable_c_shapes
   def test_with_shape_partial(self):
     with self.test_session():
       tensor_partial_shape = array_ops.placeholder(dtypes.float32)
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py b/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
deleted file mode 100644
index 476528b0dd3df05239d5dc402b466e06dd789985..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
+++ /dev/null
@@ -1,111 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Ops that will eventually be folded into tensorflow/python/ops/math_ops.py
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-
-from tensorflow.python.eager import context
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_math_ops
-from tensorflow.python.ops import math_ops
-
-
-
-def accumulate_n_v2(inputs, shape=None, tensor_dtype=None, name=None):
-  """Returns the element-wise sum of a list of tensors.
-
-  Optionally, pass `shape` and `tensor_dtype` for shape and type checking,
-  otherwise, these are inferred.
-
-  `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
-  wait for all of its inputs to be ready before beginning to sum. This can
-  save memory if inputs are ready at different times, since minimum temporary
-  storage is proportional to the output size rather than the inputs size.
-
-  Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
-
-  For example:
-
-  ```python
-  a = tf.constant([[1, 2], [3, 4]])
-  b = tf.constant([[5, 0], [0, 6]])
-  tf.accumulate_n_v2([a, b, a])  # [[7, 4], [6, 14]]
-
-  # Explicitly pass shape and type
-  tf.accumulate_n_v2([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)
-                                                                   # [[7,  4],
-                                                                   #  [6, 14]]
-  ```
-
-  Args:
-    inputs: A list of `Tensor` objects, each with same shape and type.
-    shape: Shape of elements of `inputs`.
-    tensor_dtype: The type of `inputs`.
-    name: A name for the operation (optional).
-
-  Returns:
-    A `Tensor` of same shape and type as the elements of `inputs`.
-
-  Raises:
-    ValueError: If `inputs` don't all have same shape and dtype or the shape
-    cannot be inferred.
-  """
-  _INPUTS_ERR_MSG = ValueError("inputs must be a list of at least one Tensor"
-                               "with the same dtype and shape")
-  if not inputs or not isinstance(inputs, (list, tuple)):
-    raise _INPUTS_ERR_MSG
-  inputs = ops.convert_n_to_tensor_or_indexed_slices(inputs)
-  if not all(isinstance(x, ops.Tensor) for x in inputs):
-    raise _INPUTS_ERR_MSG
-  if not all(x.dtype == inputs[0].dtype for x in inputs):
-    raise _INPUTS_ERR_MSG
-  if shape is not None:
-    shape = tensor_shape.as_shape(shape)
-  else:
-    shape = tensor_shape.unknown_shape()
-  for input_tensor in inputs:
-    if isinstance(input_tensor, ops.Tensor):
-      shape = shape.merge_with(input_tensor.get_shape())
-
-  # tensor_dtype is for safety only; operator's output type computed in C++
-  if tensor_dtype is not None and tensor_dtype != inputs[0].dtype:
-    raise TypeError("tensor_dtype is {}, but input is of type {}"
-                    .format(tensor_dtype, inputs[0].dtype))
-
-  if len(inputs) == 1 and name is None:
-    return inputs[0]
-  elif len(inputs) == 1 and name is not None:
-    return array_ops.identity(inputs[0], name=name)
-  elif context.in_eager_mode():
-    # TemporaryVariable not currently supported in eager mode; fall back
-    # onto AddN for now.
-    # TODO(frreiss) remove this once the lifetime of eager variables gets
-    # addressed
-    return math_ops.add_n(inputs, name=name)
-  else:
-    return gen_math_ops._accumulate_nv2(inputs, name=name, shape=shape)
-
-# The following code should eventually be merged into
-# tensorflow/python/ops/math_grad.py
-@ops.RegisterGradient("AccumulateNV2")
-def _AddNGrad(op, grad):
-  """Same as gradient for AddN. Copies the gradient to all inputs."""
-  # Not broadcasting.
-  return [grad] * len(op.inputs)
diff --git a/tensorflow/contrib/framework/python/ops/arg_scope.py b/tensorflow/contrib/framework/python/ops/arg_scope.py
index 409657fe1da0e5540cd2ad6070d86737c039e91f..3cad1fee1984042e3a9ab91a0af70cbaca25cece 100644
--- a/tensorflow/contrib/framework/python/ops/arg_scope.py
+++ b/tensorflow/contrib/framework/python/ops/arg_scope.py
@@ -142,7 +142,7 @@ def arg_scope(list_ops_or_scope, **kwargs):
   else:
     # Assumes that list_ops_or_scope is a list/tuple of ops with kwargs.
     if not isinstance(list_ops_or_scope, (list, tuple)):
-      raise TypeError('list_ops_or_scope must either be a list/tuple or reused'
+      raise TypeError('list_ops_or_scope must either be a list/tuple or reused '
                       'scope (i.e. dict)')
     try:
       current_scope = current_arg_scope().copy()
diff --git a/tensorflow/contrib/framework/python/ops/critical_section_ops.py b/tensorflow/contrib/framework/python/ops/critical_section_ops.py
index 3c5c55ed656432a33f19462130a9e58c2ab14efb..bd764ed57a6da0a4d356235108e998a80ac34362 100644
--- a/tensorflow/contrib/framework/python/ops/critical_section_ops.py
+++ b/tensorflow/contrib/framework/python/ops/critical_section_ops.py
@@ -24,10 +24,8 @@ import collections
 # from tensorflow.core.protobuf import critical_section_pb2
 
 from tensorflow.python.eager import context
-from tensorflow.python.eager import function
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_resource_variable_ops
@@ -48,6 +46,26 @@ class _ExecutionSignature(
   pass
 
 
+def _identity(x):
+  """Identity op that recognizes `TensorArray`, `Operation`, and `Tensor`."""
+  if isinstance(x, tensor_array_ops.TensorArray):
+    return x.identity()
+  elif isinstance(x, ops.Operation):
+    return control_flow_ops.group(x)
+  elif context.executing_eagerly() and x is None:
+    return None
+  else:
+    return array_ops.identity(x)
+
+
+def _get_colocation(op):
+  """Get colocation symbol from op, if any."""
+  try:
+    return op.get_attr("_class")
+  except ValueError:
+    return None
+
+
 class CriticalSection(object):
   """Critical section.
 
@@ -143,7 +161,7 @@ class CriticalSection(object):
   def _init_from_args(self, name, shared_name):  # pylint: disable=invalid-name
     """Initialize the CriticalSection from constructor arguments."""
     with ops.name_scope(name, "CriticalSection", []) as name:
-      with ops.control_dependencies(None):
+      with ops.init_scope():
         # pylint: disable=protected-access
         container = ops.get_default_graph()._container
         # pylint: enable=protected-access
@@ -154,7 +172,7 @@ class CriticalSection(object):
         self._handle = gen_resource_variable_ops.mutex_v2(
             shared_name=shared_name, container=container, name=name)
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       ops.add_to_collections(CRITICAL_SECTIONS, self)
 
   @property
@@ -180,8 +198,8 @@ class CriticalSection(object):
       The tensors returned from `fn(*args, **kwargs)`.
 
     Raises:
-      ValueError: If `fn` attempts to use this `CriticalSection` in any nested
-        way.
+      ValueError: If `fn` attempts to lock this `CriticalSection` in any nested
+        or lazy way that may cause a deadlock.
       ValueError: If `exclusive_resource_access` is not provided (is `True`) and
         another `CriticalSection` has an execution requesting the same
         resources as in `*args`, `**kwargs`, and any additionaly captured
@@ -193,67 +211,52 @@ class CriticalSection(object):
     exclusive_resource_access = kwargs.pop("exclusive_resource_access", True)
 
     with ops.name_scope(name, "critical_section_execute", []):
+
+      # Ensure that mutex locking only happens *after* all args and
+      # kwargs have been executed.  This avoids certain types of deadlocks.
       lock = gen_resource_variable_ops.mutex_lock(self._handle)
 
-      with ops.control_dependencies([lock]):
-        c_known_ops = set()
-        c_captured_tensors = set()
+      if not context.executing_eagerly():
+        # NOTE(ebrevdo): This is to ensure we don't pick up spurious
+        # Operations created by other threads.
+        with ops.get_default_graph()._lock:  # pylint: disable=protected-access
+          existing_ops = ops.get_default_graph().get_operations()
+          with ops.control_dependencies([lock]):
+            r = fn(*args, **kwargs)
+          # TODO(ebrevdo): If creating critical sections in a python loop, this
+          # makes graph creation time quadratic.  Revisit if this
+          # becomes a problem.
+          created_ops = (set(ops.get_default_graph().get_operations())
+                         .difference(existing_ops))
+      else:
+        with ops.control_dependencies([lock]):
+          r = fn(*args, **kwargs)
+
+      if not context.executing_eagerly():
+        self._add_control_dependencies_to_lock(created_ops, lock.op)
 
-        def add_op_internal(op):
-          c_known_ops.add(op)
-          for i in op.inputs:
-            if i.op not in c_known_ops:
-              c_captured_tensors.add(i)
+        # captured_resources is a list of resources that are directly
+        # accessed only by ops created during fn(), not by any
+        # ancestors of those ops in the graph.
+        captured_resources = set([
+            input_ for op in created_ops
+            for input_ in op.inputs
+            if input_.dtype == dtypes.resource
+        ])
 
-        c = function.HelperContext(add_op_internal)
-        with c:
-          r = fn(*args, **kwargs)
+        # NOTE(ebrevdo): The only time self._is_self_handle() is True
+        # in this call is if one of the recently created ops, within
+        # the execute(), themselves attempt to access the
+        # CriticalSection.  This will cause a deadlock.
+        if any(self._is_self_handle(x) for x in captured_resources):
+          raise ValueError("The function fn attempts to directly access the "
+                           "CriticalSection in which it would be running.  "
+                           "This is illegal and would cause deadlocks.")
 
-        resource_inputs = set([
-            x for x in
-            list(nest.flatten(args)) + nest.flatten(kwargs.values()) +
-            list(c_captured_tensors)
-            if tensor_util.is_tensor(x) and x.dtype == dtypes.resource])
-
-      if self._handle in resource_inputs:
-        raise ValueError("The function fn attempts to access the "
-                         "CriticalSection in which it would be running.  "
-                         "This is illegal and would cause deadlocks.  "
-                         "CriticalSection: %s." % self._handle)
-
-      if context.in_graph_mode():
-        # Collections and op introspection does not work in eager
-        # mode.  This is generally ok; since eager mode (as of
-        # writing) executes sequentially anyway.
-        for sg in ops.get_collection(CRITICAL_SECTION_EXECUTIONS):
-          if sg.handle.name == self._handle.name:
-            # Other executions in the same critical section are allowed.
-            continue
-          if not (exclusive_resource_access or sg.exclusive_resource_access):
-            # Neither execution requested exclusive access.
-            continue
-          resource_intersection = resource_inputs.intersection(sg.resources)
-          if resource_intersection:
-            raise ValueError(
-                "This execution would access resources: %s.  Either this "
-                "lock (CriticalSection: %s) or lock '%s' "
-                "(CriticalSection: %s) requested exclusive resource access "
-                "of this resource.  Did you mean to call execute with keyword "
-                "argument exclusive_resource_access=False?" %
-                (list(resource_intersection), self._handle.name,
-                 sg.op.name, sg.handle.name))
-
-      def identity(x):  # pylint: disable=invalid-name
-        if isinstance(x, tensor_array_ops.TensorArray):
-          return x.identity()
-        elif isinstance(x, ops.Operation):
-          return control_flow_ops.group(x)
-        elif context.in_eager_mode() and x is None:
-          return None
-        else:
-          return array_ops.identity(x)
-
-      r_flat = [identity(x) for x in nest.flatten(r)]
+        self._check_multiple_access_to_resources(
+            captured_resources, exclusive_resource_access)
+
+      r_flat = [_identity(x) for x in nest.flatten(r)]
 
       with ops.control_dependencies(r_flat):
         # The identity must run on the same machine as self._handle
@@ -266,23 +269,105 @@ class CriticalSection(object):
 
         # Make sure that if any element of r is accessed, all of
         # them are executed together.
-        r = nest.pack_sequence_as(
-            r, control_flow_ops.tuple(nest.flatten(r)))
+        r = nest.pack_sequence_as(r, control_flow_ops.tuple(nest.flatten(r)))
 
       with ops.control_dependencies([ensure_lock_exists]):
-        outputs = nest.map_structure(identity, r)
+        outputs = nest.map_structure(_identity, r)
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         signature = _ExecutionSignature(
             op=lock.op,
             handle=self._handle,
-            resources=list(resource_inputs),
+            resources=list(captured_resources),
             exclusive_resource_access=exclusive_resource_access)
         ops.add_to_collections(
             CRITICAL_SECTION_EXECUTIONS, signature)
 
       return outputs
 
+  def _add_control_dependencies_to_lock(self, created_ops, lock_op):
+    """To avoid deadlocks, all args must be executed before lock_op."""
+    # Get all arguments (explicit and captured) of all ops created by fn().
+    all_args = set([input_.op for op in created_ops for input_ in op.inputs])
+    all_args.update(
+        input_op for op in created_ops for input_op in op.control_inputs)
+    # Unfortunately, we can't use sets throughout because TF seems to
+    # create new Operation objects for the same op sometimes; and we
+    # can't rely on id(op).
+
+    # pylint: disable=protected-access
+    all_args_dict = dict((op._id, op) for op in all_args)
+
+    # Remove ops created within fn, or that lock_op already has a
+    # control dependency on.  Also remove a possible self-loop.
+    for op in created_ops:
+      all_args_dict.pop(op._id, None)
+    for op in lock_op.control_inputs:
+      all_args_dict.pop(op._id, None)
+    for input_ in lock_op.inputs:
+      all_args_dict.pop(input_.op._id, None)
+    all_args_dict.pop(lock_op._id, None)
+
+    all_args = all_args_dict.values()
+
+    if not all_args:
+      # No control dependencies to add; return early.
+      return
+
+    # This group is important: it ensures that any ops in all_args
+    # outside the control context of the lock_op (and this fn, which
+    # runs in the same context) are added to this context before
+    # being added to the control dependencies of lock_op.
+    all_args = control_flow_ops.group(*all_args)
+
+    lock_op._add_control_input(all_args)
+    # pylint: enable=protected-access
+
+  def _is_self_handle(self, x):
+    """Check if the tensor `x` is the same Mutex as `self._handle`."""
+    return (x.op.type == "MutexV2"
+            # blank shared_name means the op will create a unique one.
+            and x.op.get_attr("shared_name")
+            and (x.op.get_attr("shared_name") ==
+                 self._handle.op.get_attr("shared_name"))
+            and (x.op.device == self._handle.op.device
+                 or _get_colocation(x.op) == _get_colocation(self._handle.op)))
+
+  def _check_multiple_access_to_resources(
+      self, captured_resources, exclusive_resource_access):
+    """Raise if captured_resources are accessed by another CriticalSection.
+
+    Args:
+      captured_resources: Set of tensors of type resource.
+      exclusive_resource_access: Whether this execution requires exclusive
+        resource access.
+
+    Raises:
+      ValueError: If any tensors in `captured_resources` are also accessed
+        by another `CriticalSection`, and at least one of them requires
+        exclusive resource access.
+    """
+    # Collections and op introspection does not work in eager
+    # mode.  This is generally ok; since eager mode (as of
+    # writing) executes sequentially anyway.
+    for sg in ops.get_collection(CRITICAL_SECTION_EXECUTIONS):
+      if self._is_self_handle(sg.handle):
+        # Other executions in the same critical section are allowed.
+        continue
+      if not (exclusive_resource_access or sg.exclusive_resource_access):
+        # Neither execution requested exclusive access.
+        continue
+      resource_intersection = captured_resources.intersection(sg.resources)
+      if resource_intersection:
+        raise ValueError(
+            "This execution would access resources: %s.  Either this "
+            "lock (CriticalSection: %s) or lock '%s' "
+            "(CriticalSection: %s) requested exclusive resource access "
+            "of this resource.  Did you mean to call execute with keyword "
+            "argument exclusive_resource_access=False?" %
+            (list(resource_intersection), self._handle.name,
+             sg.op.name, sg.handle.name))
+
   # TODO(ebrevdo): Re-enable once CriticalSection is in core.
 
   # def to_proto(self, export_scope=None):
diff --git a/tensorflow/contrib/framework/python/ops/critical_section_test.py b/tensorflow/contrib/framework/python/ops/critical_section_test.py
index c916592ce1979fe3a79cf28ad4bdac44284cce97..ba660295cb3c97d26da7bf892c78bceee53cf2d4 100644
--- a/tensorflow/contrib/framework/python/ops/critical_section_test.py
+++ b/tensorflow/contrib/framework/python/ops/critical_section_test.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
+from tensorflow.python.platform import tf_logging as logging
 # TODO(ebrevdo): Re-enable once CriticalSection is in core.
 # from tensorflow.python.training import saver as saver_lib
 
@@ -37,7 +38,7 @@ class CriticalSectionTest(test.TestCase):
     v = resource_variable_ops.ResourceVariable(0.0, name="v")
 
     def fn(a, b):
-      c = v.read_value()
+      c = v.value()
       with ops.control_dependencies([c]):
         nv = v.assign_add(a * b)
         with ops.control_dependencies([nv]):
@@ -140,15 +141,151 @@ class CriticalSectionTest(test.TestCase):
          ops.get_collection(critical_section_ops.CRITICAL_SECTION_EXECUTIONS)])
 
   def testRecursiveCriticalSectionAccessIsIllegal(self):
+    # This does not work properly in eager mode.  Eager users will
+    # just hit a deadlock if they do this.  But at least it'll be easier
+    # to debug.
+    cs = critical_section_ops.CriticalSection()
+    def fn(x):
+      return cs.execute(lambda y: y + 1, x)
+    with self.assertRaisesRegexp(
+        ValueError,
+        r"attempts to directly access the CriticalSection in which it "
+        r"would be running"):
+      cs.execute(fn, 1.0)
+
+  def testRecursiveCriticalSectionAccessViaCapturedTensorIsProtected(self):
+    # This one is subtle; and we're being overly cautious here.  The
+    # deadlock we are ensuring we catch is:
+    #
+    # to_capture = CS[lambda x: x + 1](1.0)
+    # deadlocked = CS[lambda x: x + to_capture](1.0)
+    #
+    # This would have caused a deadlock because executing `deadlocked` will
+    # lock the mutex on CS; but then due to dependencies, will attempt
+    # to compute `to_capture`.  This computation requires locking CS,
+    # but that is not possible now because CS is already locked by
+    # `deadlocked`.
+    #
+    # We check that CriticalSection.execute properly inserts new
+    # control dependencies to its lock to ensure all captured
+    # operations are finished before anything runs within the critical section.
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
+    fn = array_ops.identity
+    to_capture = cs.execute(fn, 1.0)
+    fn_captures = lambda x: x + to_capture
+    to_capture_too = array_ops.identity(to_capture)
+
+    ex_0 = cs.execute(fn_captures, 1.0)
+
+    with ops.control_dependencies([to_capture]):
+      # This is OK because to_capture will execute before this next call
+      ex_1 = cs.execute(fn_captures, 1.0)
+
+    dependency = array_ops.identity(to_capture)
+
+    fn_captures_dependency = lambda x: x + dependency
+
+    ex_2 = cs.execute(fn_captures_dependency, 1.0)
+
+    with ops.control_dependencies([to_capture_too]):
+      ex_3 = cs.execute(fn_captures_dependency, 1.0)
+
+    # Ensure there's no actual deadlock on to_execute.
+    self.assertEquals(2.0, self.evaluate(ex_0))
+    self.assertEquals(2.0, self.evaluate(ex_1))
+    self.assertEquals(2.0, self.evaluate(ex_2))
+    self.assertEquals(2.0, self.evaluate(ex_3))
+
+  def testRecursiveCriticalSectionAccessWithinLoopIsProtected(self):
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
+
+    def body_implicit_capture(i, j):
+      # This would have caused a deadlock if not for logic in execute
+      # that inserts additional control dependencies onto the lock op:
+      #   * Loop body argument j is captured by fn()
+      #   * i is running in parallel to move forward the execution
+      #   * j is not being checked by the predicate function
+      #   * output of cs.execute() is returned as next j.
+      fn = lambda: j + 1
+      return (i + 1, cs.execute(fn))
+
+    (i_n, j_n) = control_flow_ops.while_loop(
+        lambda i, _: i < 1000,
+        body_implicit_capture,
+        [0, 0],
+        parallel_iterations=25)
+    logging.warn(
+        "\n==============\nRunning "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_implicit_capture'\n"
+        "==============\n")
+    self.assertEquals((1000, 1000), self.evaluate((i_n, j_n)))
+    logging.warn(
+        "\n==============\nSuccessfully finished running "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_implicit_capture'\n"
+        "==============\n")
+
+    def body_implicit_capture_protected(i, j):
+      # This version is ok because we manually add a control
+      # dependency on j, which is an argument to the while_loop body
+      # and captured by fn.
+      fn = lambda: j + 1
+      with ops.control_dependencies([j]):
+        return (i + 1, cs.execute(fn))
+
+    (i_n, j_n) = control_flow_ops.while_loop(
+        lambda i, _: i < 1000,
+        body_implicit_capture_protected,
+        [0, 0],
+        parallel_iterations=25)
+    logging.warn(
+        "\n==============\nRunning "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_implicit_capture_protected'\n"
+        "==============\n")
+    self.assertEquals((1000, 1000), self.evaluate((i_n, j_n)))
+    logging.warn(
+        "\n==============\nSuccessfully finished running "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_implicit_capture_protected'\n"
+        "==============\n")
+
+    def body_args_capture(i, j):
+      # This version is ok because j is an argument to fn and we can
+      # ensure there's a control dependency on j.
+      fn = lambda x: x + 1
+      return (i + 1, cs.execute(fn, j))
+
+    (i_n, j_n) = control_flow_ops.while_loop(
+        lambda i, _: i < 1000,
+        body_args_capture,
+        [0, 0],
+        parallel_iterations=25)
+    logging.warn(
+        "\n==============\nRunning "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_args_capture'\n"
+        "==============\n")
+    self.assertEquals((1000, 1000), self.evaluate((i_n, j_n)))
+    logging.warn(
+        "\n==============\nSuccessfully finished running "
+        "'testRecursiveCriticalSectionAccessWithinLoopDoesNotDeadlock "
+        "body_args_capture'\n"
+        "==============\n")
+
+  def testRecursiveCriticalSectionAccessIsIllegalSameSharedName(self):
     # This does not work properly in eager mode.  Eager users will
     # just hit a deadlock if they do this.  But at least it'll be easier
     # to debug.
     cs = critical_section_ops.CriticalSection(shared_name="cs")
+    cs_same = critical_section_ops.CriticalSection(shared_name="cs")
     def fn(x):
-      return cs.execute(lambda x: x+1, x)
+      return cs_same.execute(lambda x: x+1, x)
     with self.assertRaisesRegexp(
         ValueError,
-        r"attempts to access the CriticalSection in which it would be running"):
+        r"attempts to directly access the CriticalSection in which it "
+        r"would be running"):
       cs.execute(fn, 1.0)
 
   def testMultipleCSExecutionsRequestSameResource(self):
@@ -179,6 +316,20 @@ class CriticalSectionTest(test.TestCase):
         ValueError, "requested exclusive resource access"):
       cs1.execute(lambda: v2 + 1)
 
+  def testControlDependencyFromOutsideWhileLoopMixedWithInsideLoop(self):
+    cs = critical_section_ops.CriticalSection()
+    v = resource_variable_ops.ResourceVariable(0, name="v")
+    # Make sure that the control dependencies on v do not cause issues
+    # in the lock_op's automatic control dependency adder.
+    #
+    # Note, here v must be a resource variable (or something similar),
+    # otherwise it gets hoisted into the while_loop by the time we add
+    # control dependencies to the lock_op.
+    out = control_flow_ops.while_loop(
+        lambda i: i < 10, lambda i: cs.execute(lambda j: v + j + 1, i), [0])
+    self.evaluate(v.initializer)
+    self.assertEqual(10, self.evaluate(out))
+
   # TODO(ebrevdo): Re-enable once CriticalSection is in core.
   #
   # def testCriticalSectionAndExecuteOpSaverRoundTrip(self):
diff --git a/tensorflow/contrib/framework/python/ops/sort_ops.py b/tensorflow/contrib/framework/python/ops/sort_ops.py
index 8f62f0ea7b9b561f235b9496ffda97a9f378d530..1921a77c1e96ee3531d1ed0f98e41c27c9d427ac 100644
--- a/tensorflow/contrib/framework/python/ops/sort_ops.py
+++ b/tensorflow/contrib/framework/python/ops/sort_ops.py
@@ -14,6 +14,7 @@
 # ==============================================================================
 """Support for sorting tensors.
 
+@@argsort
 @@sort
 """
 
@@ -21,6 +22,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops as framework_ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
@@ -47,64 +51,141 @@ def sort(values, axis=-1, direction='ASCENDING', name=None):
     ValueError: If axis is not a constant scalar, or the direction is invalid.
   """
   with framework_ops.name_scope(name, 'sort'):
-    if direction not in _SORT_IMPL:
-      raise ValueError('%s should be one of %s' %
-                       (direction, ', '.join(sorted(_SORT_IMPL.keys()))))
-    # Axis must be an integer, not a Tensor.
-    axis = framework_ops.convert_to_tensor(axis, name='axis')
-    axis_static = tensor_util.constant_value(axis)
-    if axis.shape.ndims != 0 or axis_static is None:
-      raise ValueError('axis must be a constant scalar')
-    axis_static = int(axis_static)  # Avoids NumPy casting error
+    return _sort_or_argsort(values, axis, direction, return_argsort=False)
+
+
+def argsort(values, axis=-1, direction='ASCENDING', stable=False, name=None):
+  """Returns the indices of a tensor that give its sorted order along an axis.
+
+  For a 1D tensor, `tf.gather(values, tf.argsort(values))` is equivalent to
+  `tf.sort(values)`. For higher dimensions, the output has the same shape as
+  `values`, but along the given axis, values represent the index of the sorted
+  element in that slice of the tensor at the given position.
+
+  Args:
+    values: 1-D or higher numeric `Tensor`.
+    axis: The axis along which to sort. The default is -1, which sorts the last
+        axis.
+    direction: The direction in which to sort the values (`'ASCENDING'` or
+        `'DESCENDING'`).
+    stable: If True, equal elements in the original tensor will not be
+        re-ordered in the returned order. Unstable sort is not yet implemented,
+        but will eventually be the default for performance reasons. If you
+        require a stable order, pass `stable=True` for forwards compatibility.
+    name: Optional name for the operation.
+
+  Returns:
+    An int32 `Tensor` with the same shape as `values`. The indices that would
+        sort each slice of the given `values` along the given `axis`.
+
+  Raises:
+    ValueError: If axis is not a constant scalar, or the direction is invalid.
+  """
+  del stable  # Unused.
+  with framework_ops.name_scope(name, 'argsort'):
+    return _sort_or_argsort(values, axis, direction, return_argsort=True)
+
+
+def _sort_or_argsort(values, axis, direction, return_argsort):
+  """Internal sort/argsort implementation.
+
+  Args:
+    values: The input values.
+    axis: The axis along which to sort.
+    direction: 'ASCENDING' or 'DESCENDING'.
+    return_argsort: Whether to return the argsort result.
+
+  Returns:
+    Either the sorted values, or the indices of the sorted values in the
+        original tensor. See the `sort` and `argsort` docstrings.
+
+  Raises:
+    ValueError: If axis is not a constant scalar, or the direction is invalid.
+  """
+  if direction not in _SORT_IMPL:
+    raise ValueError('%s should be one of %s' %
+                     (direction, ', '.join(sorted(_SORT_IMPL.keys()))))
+  # Axis must be an integer, not a Tensor.
+  axis = framework_ops.convert_to_tensor(axis, name='axis')
+  axis_static = tensor_util.constant_value(axis)
+  if axis.shape.ndims != 0 or axis_static is None:
+    raise ValueError('axis must be a constant scalar')
+  axis_static = int(axis_static)  # Avoids NumPy casting error
 
-    values = framework_ops.convert_to_tensor(values, name='values')
+  values = framework_ops.convert_to_tensor(values, name='values')
 
-    return _SORT_IMPL[direction](values, axis_static)
+  return _SORT_IMPL[direction](values, axis_static, return_argsort)
 
 
-def _descending_sort(values, axis):
+def _descending_sort(values, axis, return_argsort=False):
   """Sorts values in reverse using `top_k`.
 
   Args:
     values: Tensor of numeric values.
     axis: Index of the axis which values should be sorted along.
+    return_argsort: If False, return the sorted values. If True, return the
+        indices that would sort the values.
 
   Returns:
     The sorted values.
   """
   k = array_ops.shape(values)[axis]
   rank = array_ops.rank(values)
+  static_rank = values.shape.ndims
   # Fast path: sorting the last axis.
   if axis == -1 or axis + 1 == values.get_shape().ndims:
-    return nn_ops.top_k(values, k)[0]
-
-  # Otherwise, transpose the array. Swap axes `axis` and `rank - 1`.
-  if axis < 0:
-    # Make axis a Tensor with the real axis index if needed.
-    axis += rank
-  transposition = array_ops.concat(
-      [
-          # Axes up to axis are unchanged.
-          math_ops.range(axis),
-          # Swap axis and rank - 1.
-          [rank - 1],
-          # Axes in [axis + 1, rank - 1) are unchanged.
-          math_ops.range(axis + 1, rank - 1),
-          # Swap axis and rank - 1.
-          [axis]
-      ],
-      axis=0)
-  top_k_input = array_ops.transpose(values, transposition)
-  values, unused_indices = nn_ops.top_k(top_k_input, k)
-  # transposition contains a single cycle of length 2 (swapping 2 elements),
-  # so it is an involution (it is its own inverse).
-  return array_ops.transpose(values, transposition)
-
-
-def _ascending_sort(values, axis):
+    top_k_input = values
+    transposition = None
+  else:
+    # Otherwise, transpose the array. Swap axes `axis` and `rank - 1`.
+    if axis < 0:
+      # Calculate the actual axis index if counting from the end. Use the static
+      # rank if available, or else make the axis back into a tensor.
+      axis += static_rank or rank
+    if static_rank is not None:
+      # Prefer to calculate the transposition array in NumPy and make it a
+      # constant.
+      transposition = constant_op.constant(
+          np.r_[
+              # Axes up to axis are unchanged.
+              np.arange(axis),
+              # Swap axis and rank - 1.
+              [static_rank - 1],
+              # Axes in [axis + 1, rank - 1) are unchanged.
+              np.arange(axis + 1, static_rank - 1),
+              # Swap axis and rank - 1.
+              [axis]],
+          name='transposition')
+    else:
+      # Generate the transposition array from the tensors.
+      transposition = array_ops.concat(
+          [
+              # Axes up to axis are unchanged.
+              math_ops.range(axis),
+              # Swap axis and rank - 1.
+              [rank - 1],
+              # Axes in [axis + 1, rank - 1) are unchanged.
+              math_ops.range(axis + 1, rank - 1),
+              # Swap axis and rank - 1.
+              [axis]
+          ],
+          axis=0)
+    top_k_input = array_ops.transpose(values, transposition)
+
+  values, indices = nn_ops.top_k(top_k_input, k)
+  return_value = indices if return_argsort else values
+  if transposition is not None:
+    # transposition contains a single cycle of length 2 (swapping 2 elements),
+    # so it is an involution (it is its own inverse).
+    return_value = array_ops.transpose(return_value, transposition)
+  return return_value
+
+
+def _ascending_sort(values, axis, return_argsort=False):
   # Negate the values to get the ascending order from descending sort.
-  values_or_indices = _descending_sort(-values, axis)
-  return -values_or_indices
+  values_or_indices = _descending_sort(-values, axis, return_argsort)
+  # If not argsort, negate the values again.
+  return values_or_indices if return_argsort else -values_or_indices
 
 
 _SORT_IMPL = {
diff --git a/tensorflow/contrib/framework/python/ops/sort_ops_test.py b/tensorflow/contrib/framework/python/ops/sort_ops_test.py
index d08ae502f10d98ee14d8bea2f76b18bedb935cea..a8fb94b245dccc8c7cf0e94cef9b436f881fe408 100644
--- a/tensorflow/contrib/framework/python/ops/sort_ops_test.py
+++ b/tensorflow/contrib/framework/python/ops/sort_ops_test.py
@@ -24,6 +24,8 @@ from tensorflow.contrib.framework.python.ops import sort_ops
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.platform import test
@@ -90,6 +92,38 @@ class SortTest(test.TestCase):
               axis=0,
               direction='DESCENDING').eval())
 
+  def testSort_staticallyKnownRank_constantTransposition(self):
+    # The transposition array should be a constant if the rank of "values" is
+    # statically known.
+    tensor = random_ops.random_uniform(
+        # Rank is statically known to be 5, but the dimension lengths are not
+        # known.
+        random_ops.random_uniform(
+            shape=(5,), minval=0, maxval=10, dtype=dtypes.int32))
+    sort_ops.sort(tensor, axis=1)
+    transposition = (
+        ops.get_default_graph().get_tensor_by_name('sort/transposition:0'))
+    self.assertFalse(tensor_util.constant_value(transposition) is None)
+    self.assertAllEqual(
+        # Swaps "1" and "4" to put "1" at the end.
+        tensor_util.constant_value(transposition),
+        [0, 4, 2, 3, 1])
+
+  def testArgsort_1d(self):
+    arr = np.random.random(42)
+    with self.test_session():
+      self.assertAllEqual(
+          np.sort(arr),
+          array_ops.gather(arr, sort_ops.argsort(arr)).eval())
+
+  def testArgsort(self):
+    arr = np.random.random((5, 6, 7, 8))
+    for axis in range(4):
+      with self.test_session():
+        self.assertAllEqual(
+            np.argsort(arr, axis=axis),
+            sort_ops.argsort(arr, axis=axis).eval())
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
index fdfabd07c13f689d075ecbb8786d725fa8a62d01..47e51415fd9e7daa360ca06a11078f6edcf63b5b 100644
--- a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
+++ b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
@@ -44,11 +44,11 @@ from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import image_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_impl
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import resource_loader
 
-
 __all__ = [
     'get_graph_def_from_disk',
     'get_graph_def_from_resource',
@@ -62,10 +62,11 @@ __all__ = [
     'frechet_inception_distance',
     'frechet_classifier_distance',
     'frechet_classifier_distance_from_activations',
+    'mean_only_frechet_classifier_distance_from_activations',
+    'diagonal_only_frechet_classifier_distance_from_activations',
     'INCEPTION_DEFAULT_IMAGE_SIZE',
 ]
 
-
 INCEPTION_URL = 'http://download.tensorflow.org/models/frozen_inception_v1_2015_12_05.tar.gz'
 INCEPTION_FROZEN_GRAPH = 'inceptionv1_for_inception_score.pb'
 INCEPTION_INPUT = 'Mul:0'
@@ -77,8 +78,7 @@ INCEPTION_DEFAULT_IMAGE_SIZE = 299
 def _validate_images(images, image_size):
   images = ops.convert_to_tensor(images)
   images.shape.with_rank(4)
-  images.shape.assert_is_compatible_with(
-      [None, image_size, image_size, None])
+  images.shape.assert_is_compatible_with([None, image_size, image_size, None])
   return images
 
 
@@ -109,9 +109,10 @@ def _symmetric_matrix_square_root(mat, eps=1e-10):
       math_ops.matmul(u, array_ops.diag(si)), v, transpose_b=True)
 
 
-def preprocess_image(
-    images, height=INCEPTION_DEFAULT_IMAGE_SIZE,
-    width=INCEPTION_DEFAULT_IMAGE_SIZE, scope=None):
+def preprocess_image(images,
+                     height=INCEPTION_DEFAULT_IMAGE_SIZE,
+                     width=INCEPTION_DEFAULT_IMAGE_SIZE,
+                     scope=None):
   """Prepare a batch of images for evaluation.
 
   This is the preprocessing portion of the graph from
@@ -272,8 +273,11 @@ def run_inception(images,
   return activations
 
 
-def run_image_classifier(tensor, graph_def, input_tensor,
-                         output_tensor, scope='RunClassifier'):
+def run_image_classifier(tensor,
+                         graph_def,
+                         input_tensor,
+                         output_tensor,
+                         scope='RunClassifier'):
   """Runs a network from a frozen graph.
 
   Args:
@@ -317,7 +321,7 @@ def classifier_score(images, classifier_fn, num_batches=1):
 
   NOTE: This function consumes images, computes their logits, and then
   computes the classifier score. If you would like to precompute many logits for
-  large batches, use clasifier_score_from_logits(), which this method also
+  large batches, use classifier_score_from_logits(), which this method also
   uses.
 
   Args:
@@ -433,8 +437,8 @@ def trace_sqrt_product(sigma, sigma_v):
   sqrt_sigma = _symmetric_matrix_square_root(sigma)
 
   # This is sqrt(A sigma_v A) above
-  sqrt_a_sigmav_a = math_ops.matmul(
-      sqrt_sigma, math_ops.matmul(sigma_v, sqrt_sigma))
+  sqrt_a_sigmav_a = math_ops.matmul(sqrt_sigma,
+                                    math_ops.matmul(sigma_v, sqrt_sigma))
 
   return math_ops.trace(_symmetric_matrix_square_root(sqrt_a_sigmav_a))
 
@@ -450,9 +454,9 @@ def frechet_classifier_distance(real_images,
 
   This technique is described in detail in https://arxiv.org/abs/1706.08500.
   Given two Gaussian distribution with means m and m_w and covariance matrices
-  C and C_w, this function calcuates
+  C and C_w, this function calculates
 
-  |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
+              |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
 
   which captures how different the distributions of real images and generated
   images (or more accurately, their visual features) are. Note that unlike the
@@ -463,7 +467,7 @@ def frechet_classifier_distance(real_images,
   Frechet distance is biased. It is more biased for small sample sizes. (e.g.
   even if the two distributions are the same, for a small sample size, the
   expected Frechet distance is large). It is important to use the same
-  sample size to compute frechet classifier distance when comparing two
+  sample size to compute Frechet classifier distance when comparing two
   generative models.
 
   NOTE: This function consumes images, computes their activations, and then
@@ -511,10 +515,142 @@ def frechet_classifier_distance(real_images,
   return frechet_classifier_distance_from_activations(real_a, gen_a)
 
 
-def frechet_classifier_distance_from_activations(
+def mean_only_frechet_classifier_distance_from_activations(
     real_activations, generated_activations):
   """Classifier distance for evaluating a generative model from activations.
 
+  Given two Gaussian distribution with means m and m_w and covariance matrices
+  C and C_w, this function calcuates
+
+                                |m - m_w|^2
+
+  which captures how different the distributions of real images and generated
+  images (or more accurately, their visual features) are. Note that unlike the
+  Inception score, this is a true distance and utilizes information about real
+  world images.
+
+  Note that when computed using sample means and sample covariance matrices,
+  Frechet distance is biased. It is more biased for small sample sizes. (e.g.
+  even if the two distributions are the same, for a small sample size, the
+  expected Frechet distance is large). It is important to use the same
+  sample size to compute frechet classifier distance when comparing two
+  generative models.
+
+  In this variant, we only compute the difference between the means of the
+  fitted Gaussians. The computation leads to O(n) vs. O(n^2) memory usage, yet
+  still retains much of the same information as FID.
+
+  Args:
+    real_activations: 2D array of activations of real images of size
+      [num_images, num_dims] to use to compute Frechet Inception distance.
+    generated_activations: 2D array of activations of generated images of size
+      [num_images, num_dims] to use to compute Frechet Inception distance.
+
+  Returns:
+    The mean-only Frechet Inception distance. A floating-point scalar of the
+    same type as the output of the activations.
+  """
+  real_activations.shape.assert_has_rank(2)
+  generated_activations.shape.assert_has_rank(2)
+
+  activations_dtype = real_activations.dtype
+  if activations_dtype != dtypes.float64:
+    real_activations = math_ops.to_double(real_activations)
+    generated_activations = math_ops.to_double(generated_activations)
+
+  # Compute means of activations.
+  m = math_ops.reduce_mean(real_activations, 0)
+  m_w = math_ops.reduce_mean(generated_activations, 0)
+
+  # Next the distance between means.
+  mean = math_ops.reduce_sum(
+      math_ops.squared_difference(m, m_w))  # Equivalent to L2 but more stable.
+  mofid = mean
+  if activations_dtype != dtypes.float64:
+    mofid = math_ops.cast(mofid, activations_dtype)
+
+  return mofid
+
+
+def diagonal_only_frechet_classifier_distance_from_activations(
+    real_activations, generated_activations):
+  """Classifier distance for evaluating a generative model.
+
+  This is based on the Frechet Inception distance, but for an arbitrary
+  classifier.
+
+  This technique is described in detail in https://arxiv.org/abs/1706.08500.
+  Given two Gaussian distribution with means m and m_w and covariance matrices
+  C and C_w, this function calcuates
+
+          |m - m_w|^2 + (sigma + sigma_w - 2(sigma x sigma_w)^(1/2))
+
+  which captures how different the distributions of real images and generated
+  images (or more accurately, their visual features) are. Note that unlike the
+  Inception score, this is a true distance and utilizes information about real
+  world images. In this variant, we compute diagonal-only covariance matrices.
+  As a result, instead of computing an expensive matrix square root, we can do
+  something much simpler, and has O(n) vs O(n^2) space complexity.
+
+  Note that when computed using sample means and sample covariance matrices,
+  Frechet distance is biased. It is more biased for small sample sizes. (e.g.
+  even if the two distributions are the same, for a small sample size, the
+  expected Frechet distance is large). It is important to use the same
+  sample size to compute frechet classifier distance when comparing two
+  generative models.
+
+  Args:
+    real_activations: Real images to use to compute Frechet Inception distance.
+    generated_activations: Generated images to use to compute Frechet Inception
+      distance.
+
+  Returns:
+    The diagonal-only Frechet Inception distance. A floating-point scalar of
+    the same type as the output of the activations.
+
+  Raises:
+    ValueError: If the shape of the variance and mean vectors are not equal.
+  """
+  real_activations.shape.assert_has_rank(2)
+  generated_activations.shape.assert_has_rank(2)
+
+  activations_dtype = real_activations.dtype
+  if activations_dtype != dtypes.float64:
+    real_activations = math_ops.to_double(real_activations)
+    generated_activations = math_ops.to_double(generated_activations)
+
+  # Compute mean and covariance matrices of activations.
+  m, var = nn_impl.moments(real_activations, axes=[0])
+  m_w, var_w = nn_impl.moments(generated_activations, axes=[0])
+
+  actual_shape = var.get_shape()
+  expected_shape = m.get_shape()
+
+  if actual_shape != expected_shape:
+    raise ValueError('shape: {} must match expected shape: {}'.format(
+        actual_shape, expected_shape))
+
+  # Compute the two components of FID.
+
+  # First the covariance component.
+  # Here, note that trace(A + B) = trace(A) + trace(B)
+  trace = math_ops.reduce_sum(
+      (var + var_w) - 2.0 * math_ops.sqrt(math_ops.multiply(var, var_w)))
+
+  # Next the distance between means.
+  mean = math_ops.reduce_sum(
+      math_ops.squared_difference(m, m_w))  # Equivalent to L2 but more stable.
+  dofid = trace + mean
+  if activations_dtype != dtypes.float64:
+    dofid = math_ops.cast(dofid, activations_dtype)
+
+  return dofid
+
+
+def frechet_classifier_distance_from_activations(real_activations,
+                                                 generated_activations):
+  """Classifier distance for evaluating a generative model.
+
   This methods computes the Frechet classifier distance from activations of
   real images and generated images. This can be used independently of the
   frechet_classifier_distance() method, especially in the case of using large
@@ -523,15 +659,22 @@ def frechet_classifier_distance_from_activations(
 
   This technique is described in detail in https://arxiv.org/abs/1706.08500.
   Given two Gaussian distribution with means m and m_w and covariance matrices
-  C and C_w, this function calcuates
+  C and C_w, this function calculates
 
-  |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
+                |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
 
   which captures how different the distributions of real images and generated
   images (or more accurately, their visual features) are. Note that unlike the
   Inception score, this is a true distance and utilizes information about real
   world images.
 
+  Note that when computed using sample means and sample covariance matrices,
+  Frechet distance is biased. It is more biased for small sample sizes. (e.g.
+  even if the two distributions are the same, for a small sample size, the
+  expected Frechet distance is large). It is important to use the same
+  sample size to compute frechet classifier distance when comparing two
+  generative models.
+
   Args:
     real_activations: 2D Tensor containing activations of real data. Shape is
       [batch_size, activation_size].
@@ -553,36 +696,38 @@ def frechet_classifier_distance_from_activations(
 
   # Compute mean and covariance matrices of activations.
   m = math_ops.reduce_mean(real_activations, 0)
-  m_v = math_ops.reduce_mean(generated_activations, 0)
+  m_w = math_ops.reduce_mean(generated_activations, 0)
   num_examples = math_ops.to_double(array_ops.shape(real_activations)[0])
 
   # sigma = (1 / (n - 1)) * (X - mu) (X - mu)^T
   real_centered = real_activations - m
   sigma = math_ops.matmul(
-      real_centered, real_centered, transpose_a=True) / (num_examples - 1)
+      real_centered, real_centered, transpose_a=True) / (
+          num_examples - 1)
 
-  gen_centered = generated_activations - m_v
-  sigma_v = math_ops.matmul(
-      gen_centered, gen_centered, transpose_a=True) / (num_examples - 1)
+  gen_centered = generated_activations - m_w
+  sigma_w = math_ops.matmul(
+      gen_centered, gen_centered, transpose_a=True) / (
+          num_examples - 1)
 
-  # Find the Tr(sqrt(sigma sigma_v)) component of FID
-  sqrt_trace_component = trace_sqrt_product(sigma, sigma_v)
+  # Find the Tr(sqrt(sigma sigma_w)) component of FID
+  sqrt_trace_component = trace_sqrt_product(sigma, sigma_w)
 
   # Compute the two components of FID.
 
   # First the covariance component.
   # Here, note that trace(A + B) = trace(A) + trace(B)
-  trace = math_ops.trace(sigma + sigma_v) - 2.0 * sqrt_trace_component
+  trace = math_ops.trace(sigma + sigma_w) - 2.0 * sqrt_trace_component
 
   # Next the distance between means.
-  mean = math_ops.square(linalg_ops.norm(m - m_v))  # This uses the L2 norm.
+  mean = math_ops.reduce_sum(
+      math_ops.squared_difference(m, m_w))  # Equivalent to L2 but more stable.
   fid = trace + mean
   if activations_dtype != dtypes.float64:
     fid = math_ops.cast(fid, activations_dtype)
 
   return fid
 
-
 frechet_inception_distance = functools.partial(
     frechet_classifier_distance,
     classifier_fn=functools.partial(
diff --git a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_test.py b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_test.py
index 61dc8646ddc10605561ae6b19e90f4739c346608..663e49bdca3cb2dd9257da326488c877fcc4256d 100644
--- a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_test.py
+++ b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_test.py
@@ -50,6 +50,26 @@ def _expected_inception_score(logits):
   return np.exp(np.mean(per_example_logincscore))
 
 
+def _expected_mean_only_fid(real_imgs, gen_imgs):
+  m = np.mean(real_imgs, axis=0)
+  m_v = np.mean(gen_imgs, axis=0)
+  mean = np.square(m - m_v).sum()
+  mofid = mean
+  return mofid
+
+
+def _expected_diagonal_only_fid(real_imgs, gen_imgs):
+  m = np.mean(real_imgs, axis=0)
+  m_v = np.mean(gen_imgs, axis=0)
+  var = np.var(real_imgs, axis=0)
+  var_v = np.var(gen_imgs, axis=0)
+  sqcc = np.sqrt(var * var_v)
+  mean = (np.square(m - m_v)).sum()
+  trace = (var + var_v - 2 * sqcc).sum()
+  dofid = mean + trace
+  return dofid
+
+
 def _expected_fid(real_imgs, gen_imgs):
   m = np.mean(real_imgs, axis=0)
   m_v = np.mean(gen_imgs, axis=0)
@@ -285,6 +305,46 @@ class ClassifierMetricsTest(test.TestCase):
 
     self.assertAllClose(_expected_inception_score(logits), incscore_np)
 
+  def test_mean_only_frechet_classifier_distance_value(self):
+    """Test that `frechet_classifier_distance` gives the correct value."""
+    np.random.seed(0)
+
+    pool_real_a = np.float32(np.random.randn(256, 2048))
+    pool_gen_a = np.float32(np.random.randn(256, 2048))
+
+    tf_pool_real_a = array_ops.constant(pool_real_a)
+    tf_pool_gen_a = array_ops.constant(pool_gen_a)
+
+    mofid_op = classifier_metrics.mean_only_frechet_classifier_distance_from_activations(  # pylint: disable=line-too-long
+        tf_pool_real_a, tf_pool_gen_a)
+
+    with self.test_session() as sess:
+      actual_mofid = sess.run(mofid_op)
+
+    expected_mofid = _expected_mean_only_fid(pool_real_a, pool_gen_a)
+
+    self.assertAllClose(expected_mofid, actual_mofid, 0.0001)
+
+  def test_diagonal_only_frechet_classifier_distance_value(self):
+    """Test that `frechet_classifier_distance` gives the correct value."""
+    np.random.seed(0)
+
+    pool_real_a = np.float32(np.random.randn(256, 2048))
+    pool_gen_a = np.float32(np.random.randn(256, 2048))
+
+    tf_pool_real_a = array_ops.constant(pool_real_a)
+    tf_pool_gen_a = array_ops.constant(pool_gen_a)
+
+    dofid_op = classifier_metrics.diagonal_only_frechet_classifier_distance_from_activations(  # pylint: disable=line-too-long
+        tf_pool_real_a, tf_pool_gen_a)
+
+    with self.test_session() as sess:
+      actual_dofid = sess.run(dofid_op)
+
+    expected_dofid = _expected_diagonal_only_fid(pool_real_a, pool_gen_a)
+
+    self.assertAllClose(expected_dofid, actual_dofid, 0.0001)
+
   def test_frechet_classifier_distance_value(self):
     """Test that `frechet_classifier_distance` gives the correct value."""
     np.random.seed(0)
diff --git a/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py b/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
index 9bebcacbe46d85fc4226c4275b71b3ecbde57a97..4b10bc0f8e607c02763d8ea622d6f8f2572c586d 100644
--- a/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
+++ b/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
@@ -212,7 +212,7 @@ def sliced_wasserstein_distance(real_images,
   Args:
       real_images: (tensor) Real images (batch, height, width, channels).
       fake_images: (tensor) Fake images (batch, height, width, channels).
-      resolution_min: (int) Minimum resolution for the Laplacion pyramid.
+      resolution_min: (int) Minimum resolution for the Laplacian pyramid.
       patches_per_image: (int) Number of patches to extract per image per
         Laplacian level.
       patch_size: (int) Width of a square patch.
@@ -221,7 +221,7 @@ def sliced_wasserstein_distance(real_images,
       use_svd: experimental method to compute a more accurate distance.
   Returns:
       List of tuples (distance_real, distance_fake) for each level of the
-      Laplacian pyramid from the highest resoluion to the lowest.
+      Laplacian pyramid from the highest resolution to the lowest.
         distance_real is the Wasserstein distance between real images
         distance_fake is the Wasserstein distance between real and fake images.
   Raises:
diff --git a/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py b/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
index cd31c62667fc048b1003d334377405b284f32af5..e2594faf85bcf91cbe09f266e4d4211d20bdee17 100644
--- a/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
+++ b/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Miscellanous utilities for TFGAN code and examples.
+"""Miscellaneous utilities for TFGAN code and examples.
 
 Includes:
 1) Conditioning the value of a Tensor, based on techniques from
diff --git a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
index 4cfae0de4451880cf8229903b0eb74b1c6e2e04d..9e4ec59e7098443efc53506a4ba159e84b5c1618 100644
--- a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
+++ b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
@@ -17,7 +17,7 @@
 We use this to keep a history of values created by a generator, such that
 a discriminator can randomly be trained on some older samples, not just the
 current one. This can help to not let the discriminator get too far ahead of the
-generator and also to keep the system from oscilating, if the discriminator
+generator and also to keep the system from oscillating, if the discriminator
 forgets too fast what past samples from the generator looked like.
 
 See the following papers for more details.
@@ -97,7 +97,7 @@ def tensor_pool(input_values,
         dtypes=[v.dtype for v in input_values],
         shapes=None)
 
-    # In pseudeo code this code does the following:
+    # In pseudo code this code does the following:
     # if not pool_full:
     #   enqueue(input_values)
     #   return input_values
diff --git a/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py b/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
index 845f89827b6e60eda41a55a80671f43460247b05..2fe06a287284ff994326d5a977a2e4d4634268ae 100644
--- a/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
+++ b/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
@@ -148,7 +148,7 @@ class VirtualBatchnormTest(test.TestCase):
       self.assertAllClose(bn_np[i, ...], vb_np)
 
   def test_minibatch_independent(self):
-    """Test that virtual batch normalized exampels are independent.
+    """Test that virtual batch normalized examples are independent.
 
     Unlike batch normalization, virtual batch normalization has the property
     that the virtual batch normalized value of an example is independent of the
diff --git a/tensorflow/contrib/graph_editor/reroute.py b/tensorflow/contrib/graph_editor/reroute.py
index 7ffdbb7139281734917fdb715601b317eb58b82f..95c02a64d47c26e731ef2628fb551529e9bc3f4d 100644
--- a/tensorflow/contrib/graph_editor/reroute.py
+++ b/tensorflow/contrib/graph_editor/reroute.py
@@ -471,9 +471,10 @@ def remove_control_inputs(op, cops):
     if cop not in op.control_inputs:
       raise ValueError("{} is not a control_input of {}".format(op.name,
                                                                 cop.name))
+  control_inputs = [cop for cop in op.control_inputs if cop not in cops]
   # pylint: disable=protected-access
-  op._control_inputs = [cop for cop in op._control_inputs if cop not in cops]
-  op._recompute_node_def()
+  op._remove_all_control_inputs()
+  op._add_control_inputs(control_inputs)
   # pylint: enable=protected-access
 
 
@@ -496,9 +497,6 @@ def add_control_inputs(op, cops):
     if cop in op.control_inputs:
       raise ValueError("{} is already a control_input of {}".format(cop.name,
                                                                     op.name))
-  # pylint: disable=protected-access
-  op._control_inputs += cops
-  op._recompute_node_def()
-  # pylint: enable=protected-access
+  op._add_control_inputs(cops)  # pylint: disable=protected-access
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/graph_editor/tests/transform_test.py b/tensorflow/contrib/graph_editor/tests/transform_test.py
index ca00394388f67e2ed9508684a47b23c3ee9e79e8..2603de640735a612cbd883cc6227fe3cd9f11fca 100644
--- a/tensorflow/contrib/graph_editor/tests/transform_test.py
+++ b/tensorflow/contrib/graph_editor/tests/transform_test.py
@@ -23,6 +23,7 @@ from tensorflow.contrib import graph_editor as ge
 from tensorflow.contrib.graph_editor.tests import match
 from tensorflow.python.client import session
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
@@ -84,9 +85,9 @@ class TransformTest(test.TestCase):
   def test_transform(self):
     transformer = ge.Transformer()
 
-    def my_transform_op_handler(info, op):
+    def my_transform_op_handler(info, op, new_inputs):
       add_noise = op.name.startswith("Add")
-      op_, op_outputs_ = ge.transform.copy_op_handler(info, op)
+      op_, op_outputs_ = ge.transform.copy_op_handler(info, op, new_inputs)
       if not add_noise:
         return op_, op_outputs_
       # add some noise to op
@@ -201,15 +202,56 @@ class TransformTest(test.TestCase):
                         get_operation_by_name("res/grad/mul1_grad/Mul_1"))
 
     # Make sure _original_ops are as expected.
-    self.assertEquals(original_mul1_grad._original_op.name, u"mul1")
-    self.assertEquals(result_mul1_grad._original_op.name, u"res/mul1")
-    self.assertNotEquals(res.name, g.name)
+    self.assertEqual(original_mul1_grad._original_op.name, u"mul1")
+    self.assertEqual(result_mul1_grad._original_op.name, u"res/mul1")
+    self.assertNotEqual(res.name, g.name)
     with session.Session() as sess:
       sess.run(variables.global_variables_initializer())
       g_val, res_val = sess.run([g, res])
     self.assertNear(g_val, 0.0, ERROR_TOLERANCE)
     self.assertNear(res_val, 0.0, ERROR_TOLERANCE)
 
+  def test_graph_while_loop(self):
+    graph = ops.Graph()
+    with graph.as_default():
+      max_index = array_ops.placeholder(dtype=dtypes.int32, shape=tuple())
+      index_start = constant_op.constant(1)
+      sum_start = constant_op.constant(0)
+      _, result = control_flow_ops.while_loop(
+          cond=lambda i, unused_s: i <= max_index,
+          body=lambda i, s: (i + 1, s + i),
+          loop_vars=[index_start, sum_start])
+    copied_graph = ops.Graph()
+    _, copy_info = ge.copy(
+        graph, dst_graph=copied_graph, dst_scope="imported")
+    copied_result = copy_info.transformed(result)
+    copied_max_index = copy_info.transformed(max_index)
+    with copied_graph.as_default():
+      with session.Session() as sess:
+        n = 10
+        sum_val = sess.run(copied_result, feed_dict={copied_max_index: n})
+        self.assertEqual(sum_val, 55)
+
+  def test_graph_cond(self):
+    graph = ops.Graph()
+    with graph.as_default():
+      choice = array_ops.placeholder(shape=(), dtype=dtypes.bool)
+      result = control_flow_ops.cond(
+          choice,
+          lambda: constant_op.constant(1),
+          lambda: constant_op.constant(2))
+    copied_graph = ops.Graph()
+    _, copy_info = ge.copy(
+        graph, dst_graph=copied_graph, dst_scope="imported")
+    copied_result = copy_info.transformed(result)
+    copied_choice = copy_info.transformed(choice)
+    with copied_graph.as_default():
+      with session.Session() as sess:
+        res = sess.run(copied_result, feed_dict={copied_choice: True})
+        self.assertEqual(res, 1)
+        res = sess.run(copied_result, feed_dict={copied_choice: False})
+        self.assertEqual(res, 2)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/graph_editor/transform.py b/tensorflow/contrib/graph_editor/transform.py
index 14ac5296657d48c7f9e94d220c9e7e28af4d4353..d8a48387a745e7d88cc6a74c96cb21a2ba1cfa1f 100644
--- a/tensorflow/contrib/graph_editor/transform.py
+++ b/tensorflow/contrib/graph_editor/transform.py
@@ -129,20 +129,26 @@ def transform_op_if_inside_handler(info, op, keep_if_possible=True):
       return None
 
 
-def copy_op_handler(info, op, copy_shape=True):
+def copy_op_handler(info, op, new_inputs, copy_shape=True):
   """Copy a `tf.Operation`.
 
   Args:
     info: Transform._TmpInfo instance.
     op: the `tf.Operation` to be copied.
+    new_inputs: The new inputs for this op.
     copy_shape: also copy the shape of the tensor
   Returns:
     A `(op, op_outputs)` tuple containing the transformed op and its outputs.
   """
+  # The `new_inputs` was added to this function. For compatibility reason,
+  # let's raise an error if `new_inputs` is a boolean.
+  if isinstance(new_inputs, bool):
+    raise TypeError("the `new_inputs` argument must be an iterable.")
+
   # pylint: disable=protected-access
 
   # Clone the node def:
-  node_def_ = deepcopy(op._node_def)
+  node_def_ = deepcopy(op.node_def)
 
   # Transform name:
   name_ = info.new_name(op.name)
@@ -155,10 +161,10 @@ def copy_op_handler(info, op, copy_shape=True):
 
   # Make a copy of the op_def too.
   # Its unique to every _type_ of Operation.
-  op_def_ = deepcopy(op._op_def)
+  op_def_ = deepcopy(op.op_def)
 
   # Initialize a new Operation instance
-  op_ = tf_ops.Operation(node_def_, info.graph_, [], output_types_,
+  op_ = tf_ops.Operation(node_def_, info.graph_, new_inputs, output_types_,
                          [], input_types_, None, op_def_)
 
   # copy the shape over
@@ -170,6 +176,7 @@ def copy_op_handler(info, op, copy_shape=True):
   # attribute to exist, we will create a dummy original_op first and then
   # later finalise it with the actual original_op when all the ops have
   # been copied.
+  # TODO(fkp): Stop worrying about _original_op and remove this code?
   if op._original_op:
     op_._original_op = op._original_op
 
@@ -328,6 +335,14 @@ class _TmpInfo(object):
                             for key in self.graph.get_all_collection_keys())
     self.cyclic_ops = []
     self.transform_original_op_handler = transform_op_if_inside_handler
+    # The graph is transformed op by op, in the same order the original ops
+    # were created. However, this is sometimes not possible due to cycles
+    # (i.e. while loops). So when the transformer creates a new op whose
+    # inputs do not exist yet, temporary placeholders are created and stored
+    # in this `tmp_cyclic_ts` container. During a second pass,
+    # those temporary tensors are replaced by the proper transformed tensors
+    # (see the function `_finalize_cycles`).
+    self.tmp_cyclic_ts = []
 
   def new_name(self, name):
     """Compute a destination name from a source name.
@@ -428,10 +443,10 @@ class Transformer(object):
 
     # Create temporary info used during this transform call
     info = _TmpInfo(sgv, dst_graph, dst_scope, src_scope)
-    info.transform_original_op_handler = self.transform_original_op_handler
 
     self._copy_ops(info)
-    self._connect_ops(info)
+    self._finalize_cycles(info)
+    self._connect_control_inputs(info)
 
     # Compute information about the transformation
     res_info = TransformerInfo(info)
@@ -440,10 +455,10 @@ class Transformer(object):
 
   def _copy_ops(self, info):
     """Copy ops without connecting them."""
-    for op in info.sgv.ops:
-      logging.debug("Copying op: %s", op.name)
-      # TODO(fkp): return a subgraph?
-      op_, op_outputs_ = self.transform_op_handler(info, op)
+    sorted_ops = sorted(info.sgv.ops, key=lambda op: op._id)  # pylint: disable=protected-access
+    for op in sorted_ops:
+      new_inputs = [self._transformed_t(info, t, op) for t in op.inputs]
+      op_, op_outputs_ = self.transform_op_handler(info, op, new_inputs)
       if op is op_:
         raise ValueError("In-place transformation not allowed.")
 
@@ -456,27 +471,36 @@ class Transformer(object):
         info.transformed_ts[op_output] = op_output_
         self.assign_collections_handler(info, op_output, op_output_)
 
-  def _connect_ops(self, info):
+  def _finalize_cycles(self, info):
+    """Reconnects the cyclic tensors."""
+    for t, tmp_t_, consumer_op in info.tmp_cyclic_ts:
+      if t not in info.transformed_ts:
+        raise ValueError("The tensor {} should be transformed by now.".format(
+            t.name))
+      if consumer_op not in info.transformed_ops:
+        raise ValueError("The op {} should be transformed by now.".format(
+            consumer_op.name))
+      t_ = info.transformed_ts[t]
+      consumer_op_ = info.transformed_ops[consumer_op]
+      t_index_ = list(consumer_op_.inputs).index(tmp_t_)
+      consumer_op_._update_input(t_index_, t_, update_dtype=False)  # pylint: disable=protected-access
+
+  def _connect_control_inputs(self, info):
     """Connect the previously copied ops."""
     for op in info.sgv.ops:
-      logging.debug("Finalizing op: %s", op.name)
+      logging.debug("Connecting control inputs of op: %s", op.name)
       op_ = info.transformed_ops[op]
 
-      # pylint: disable=protected-access
-      if op_.inputs:
-        raise ValueError("The newly transformed op should not have "
-                         "any inputs yet: {}".format(op_.name))
-      inputs_ = [self._transformed_t(info, t) for t in op.inputs]
-      for t in inputs_:
-        op_._add_input(t)
-
       # Finalize original op.
+      # TODO(fkp): Stop worrying about _original_op and remove this code?
+      # pylint: disable=protected-access
       if op._original_op:
-        original_op = info.transform_original_op_handler(info, op._original_op)
+        original_op = self.transform_original_op_handler(info, op._original_op)
         if original_op is None:
           logging.debug("Could not find original op for: %s", op_.name)
         else:
           op_._original_op = original_op
+      # pylint: enable=protected-access
 
       # Finalize control inputs:
       control_inputs_ = [self.transform_control_input_handler(info, ci)
@@ -525,19 +549,38 @@ class Transformer(object):
 
     return sgv_.remap(input_map_, output_map_)
 
-  def _transformed_t(self, info, t):
+  def _transformed_t(self, info, t, consumer_op):
     """Return tre transformed tensor of `t`."""
-    if t not in info.transformed_ts:
-      # If op is not in the subgraph.
-      if t in info.sgv_inputs_set:
-        # t is an input of the subgraph.
-        return self.transform_external_input_handler(info, t)
+    if t in info.transformed_ts:
+      # If op is in the subgraph, just return its transformed counterpart.
+      return info.transformed_ts[t]
+
+    if t in info.sgv_inputs_set:
+      # `t` is an input of the subgraph.
+      return self.transform_external_input_handler(info, t)
+    elif t.op in info.ops:
+      # `t` is an internal tensor but is not transformed yet because it
+      # belongs to a graph cycle.
+      logging.debug("Cyclic tensor: t.name = %s", t.name)
+      # Try to find an existing tensor we can use for now,
+      # otherwise create one. We'll rewire this later.
+      if consumer_op.type == "Merge":
+        first_input = consumer_op.inputs[0]
+        tmp_t_ = self._transformed_t(info, first_input, consumer_op)
+      elif t.op.type == "Enter":
+        enter_input = t.op.inputs[0]
+        tmp_t_ = self._transformed_t(info, enter_input, consumer_op)
       else:
-        # t is a hidden input of the subgraph.
-        return self.transform_external_hidden_input_handler(info, t)
+        with info.graph_.as_default():
+          tmp_t_ = util.make_placeholder_from_tensor(t, scope=info.scope_,
+                                                     prefix="geph_tmp")
+        logging.debug("Created temporary placeholder: %s.", tmp_t_.name)
+      # Register as temporary and return.
+      info.tmp_cyclic_ts.append((t, tmp_t_, consumer_op))
+      return tmp_t_
     else:
-      # If op is in the subgraph, just return its transformed.
-      return info.transformed_ts[t]
+      # `t` is a hidden input of the subgraph.
+      return self.transform_external_hidden_input_handler(info, t)
 
 
 def copy(sgv, dst_graph=None, dst_scope="", src_scope="",
@@ -624,6 +667,40 @@ def copy_with_input_replacements(sgv, replacement_ts,
       sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
 
 
+def _add_control_flow_ops(ops, control_ios):
+  """Complete `ops` so that the tranformed graph is valid.
+
+  Partially copying a graph can lead to a malformed graph. For instance,
+  copying half of a while construct is likely to result in an invalid graph.
+  This function attempts to add missing ops so that the transformation result
+  in a valid graph.
+
+  Args:
+    ops: list of ops (modifed in-place).
+    control_ios: object created by a call to `util.ControlOutputs`.
+  """
+  # Find while contexts.
+  control_flow_contexts = set()
+  for op in ops:
+    cfc = op._control_flow_context  # pylint: disable=protected-access
+    if cfc:
+      control_flow_contexts.add(cfc)
+  # Find new ops.
+  new_ops = []
+  for cfc in control_flow_contexts:
+    if cfc.IsWhileContext():
+      new_ops += select.get_walks_intersection_ops(
+          [enter_t.op for enter_t in cfc.loop_enters],
+          [exit_t.op for exit_t in cfc.loop_exits],
+          control_ios=control_ios)
+  # Add new ops.
+  new_ops_set = set(new_ops)
+  ops_set = frozenset(ops)
+  for op in new_ops_set:
+    if op not in ops_set:
+      ops.append(op)
+
+
 def graph_replace(target_ts, replacement_ts, dst_scope="",
                   src_scope="", reuse_dst_scope=False):
   """Create a new graph which compute the targets from the replaced Tensors.
@@ -657,8 +734,13 @@ def graph_replace(target_ts, replacement_ts, dst_scope="",
                                           control_ios=control_ios)
   if not ops:
     raise ValueError("Targets and replacements are not connected!")
+
+  # Complete ops to avoid malformed control flow.
+  # TODO(fkp): Consider moving this function deeper (in the transformer?).
+  _add_control_flow_ops(ops, control_ios)
+
   # Create a copy of the relevant subgraph
-  _, info = copy_with_input_replacements(
+  unused_sgv_, info = copy_with_input_replacements(
       ops, replacement_ts, None, dst_scope, src_scope, reuse_dst_scope)
   # Return the transformed targets but keep the original if the transformed
   # counterpart cannot be found
diff --git a/tensorflow/contrib/graph_editor/util.py b/tensorflow/contrib/graph_editor/util.py
index 30bc33b9ee42ba78bc7307c67c0fc0af9f3356ef..584f4509ccc0aab30edc2be3bad7a9cb938d6e6a 100644
--- a/tensorflow/contrib/graph_editor/util.py
+++ b/tensorflow/contrib/graph_editor/util.py
@@ -38,6 +38,11 @@ __all__ = [
 ]
 
 
+# The graph editor sometimes need to create placeholders, they are named
+# "geph_*". "geph" stands for Graph-Editor PlaceHolder.
+_DEFAULT_PLACEHOLDER_PREFIX = "geph"
+
+
 def concatenate_unique(la, lb):
   """Add all the elements of `lb` to `la` if they are not there already.
 
@@ -405,7 +410,7 @@ def scope_basename(scope):
   return scope[slash + 1:]
 
 
-def placeholder_name(t=None, scope=None):
+def placeholder_name(t=None, scope=None, prefix=_DEFAULT_PLACEHOLDER_PREFIX):
   """Create placeholder name for the graph editor.
 
   Args:
@@ -413,6 +418,7 @@ def placeholder_name(t=None, scope=None):
       on
     scope: absolute scope with which to prefix the placeholder's name. None
       means that the scope of t is preserved. "" means the root scope.
+    prefix: placeholder name prefix.
   Returns:
     A new placeholder name prefixed by "geph". Note that "geph" stands for
       Graph Editor PlaceHolder. This convention allows to quickly identify the
@@ -430,19 +436,20 @@ def placeholder_name(t=None, scope=None):
     if scope is None:
       scope = op_dirname
 
-    if op_basename.startswith("geph__"):
+    if op_basename.startswith("{}__".format(prefix)):
       ph_name = op_basename
     else:
-      ph_name = "geph__{}_{}".format(op_basename, t.value_index)
+      ph_name = "{}__{}_{}".format(prefix, op_basename, t.value_index)
 
     return scope + ph_name
   else:
     if scope is None:
       scope = ""
-    return scope + "geph"
+    return "{}{}".format(scope, prefix)
 
 
-def make_placeholder_from_tensor(t, scope=None):
+def make_placeholder_from_tensor(t, scope=None,
+                                 prefix=_DEFAULT_PLACEHOLDER_PREFIX):
   """Create a `tf.placeholder` for the Graph Editor.
 
   Note that the correct graph scope must be set by the calling function.
@@ -452,17 +459,19 @@ def make_placeholder_from_tensor(t, scope=None):
       (see function placeholder_name).
     scope: absolute scope within which to create the placeholder. None
       means that the scope of `t` is preserved. `""` means the root scope.
+    prefix: placeholder name prefix.
   Returns:
     A newly created `tf.placeholder`.
   Raises:
     TypeError: if `t` is not `None` or a `tf.Tensor`.
   """
   return tf_array_ops.placeholder(
-      dtype=t.dtype, shape=t.get_shape(), name=placeholder_name(
-          t, scope=scope))
+      dtype=t.dtype, shape=t.get_shape(),
+      name=placeholder_name(t, scope=scope, prefix=prefix))
 
 
-def make_placeholder_from_dtype_and_shape(dtype, shape=None, scope=None):
+def make_placeholder_from_dtype_and_shape(dtype, shape=None, scope=None,
+                                          prefix=_DEFAULT_PLACEHOLDER_PREFIX):
   """Create a tf.placeholder for the Graph Editor.
 
   Note that the correct graph scope must be set by the calling function.
@@ -474,11 +483,13 @@ def make_placeholder_from_dtype_and_shape(dtype, shape=None, scope=None):
     shape: the tensor shape (optional).
     scope: absolute scope within which to create the placeholder. None
       means that the scope of t is preserved. "" means the root scope.
+    prefix: placeholder name prefix.
   Returns:
     A newly created tf.placeholder.
   """
   return tf_array_ops.placeholder(
-      dtype=dtype, shape=shape, name=placeholder_name(scope=scope))
+      dtype=dtype, shape=shape,
+      name=placeholder_name(scope=scope, prefix=prefix))
 
 
 _INTERNAL_VARIABLE_RE = re.compile(r"^__\w+__$")
diff --git a/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py b/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
index 252788140f8c1906718c150574b963385b6ecfa1..bcd2a34c4e791a2ab66a439109145d6b78c14e22 100644
--- a/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
+++ b/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
@@ -110,7 +110,7 @@ class GridRNNCell(rnn.RNNCell):
       logging.warning('%s: Using a concatenated state is slower and will '
                       'soon be deprecated.  Use state_is_tuple=True.', self)
     if not output_is_tuple:
-      logging.warning('%s: Using a concatenated output is slower and will'
+      logging.warning('%s: Using a concatenated output is slower and will '
                       'soon be deprecated.  Use output_is_tuple=True.', self)
 
     if num_dims < 1:
diff --git a/tensorflow/contrib/image/BUILD b/tensorflow/contrib/image/BUILD
index 3ff02e085ee63fabf42b3cc4389f4605455f3800..79eb3762edbc17e5c4682ac42dff87ae423bddfe 100755
--- a/tensorflow/contrib/image/BUILD
+++ b/tensorflow/contrib/image/BUILD
@@ -78,7 +78,10 @@ tf_custom_op_py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        ":dense_image_warp_py",
         ":image_ops",
+        ":interpolate_spline_py",
+        ":sparse_image_warp_py",
         "//tensorflow/contrib/util:util_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:common_shapes",
@@ -194,6 +197,117 @@ cuda_py_test(
     ],
 )
 
+py_library(
+    name = "dense_image_warp_py",
+    srcs = [
+        "python/ops/dense_image_warp.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/contrib/util:util_py",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:util",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_library(
+    name = "interpolate_spline_py",
+    srcs = [
+        "python/ops/interpolate_spline.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/contrib/util:util_py",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:util",
+    ],
+)
+
+py_library(
+    name = "sparse_image_warp_py",
+    srcs = [
+        "python/ops/sparse_image_warp.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dense_image_warp_py",
+        ":interpolate_spline_py",
+        "//tensorflow/contrib/util:util_py",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:util",
+    ],
+)
+
+cuda_py_test(
+    name = "sparse_image_warp_test",
+    size = "medium",
+    srcs = ["python/kernel_tests/sparse_image_warp_test.py"],
+    additional_deps = [
+        ":sparse_image_warp_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:clip_ops",
+        "//tensorflow/python:io_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:random_ops",
+        "//tensorflow/python:image_ops",
+        "//tensorflow/python:variables",
+        "//tensorflow/core:protos_all_py",
+    ],
+    data = [":sparse_image_warp_test_data"],
+    tags = ["no_pip"],
+)
+
+filegroup(
+    name = "sparse_image_warp_test_data",
+    srcs = glob(["python/kernel_tests/test_data/*.png"]),
+)
+
+cuda_py_test(
+    name = "dense_image_warp_test",
+    size = "medium",
+    srcs = ["python/kernel_tests/dense_image_warp_test.py"],
+    additional_deps = [
+        ":dense_image_warp_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:clip_ops",
+        "//tensorflow/python:io_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:random_ops",
+        "//tensorflow/python:image_ops",
+        "//tensorflow/python:variables",
+        "//tensorflow/core:protos_all_py",
+    ],
+)
+
+cuda_py_test(
+    name = "interpolate_spline_test",
+    size = "medium",
+    srcs = ["python/kernel_tests/interpolate_spline_test.py"],
+    additional_deps = [
+        ":interpolate_spline_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:clip_ops",
+        "//tensorflow/python:io_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:image_ops",
+        "//tensorflow/python:variables",
+        "//tensorflow/core:protos_all_py",
+    ],
+)
+
 tf_py_test(
     name = "segmentation_test",
     size = "medium",
diff --git a/tensorflow/contrib/image/__init__.py b/tensorflow/contrib/image/__init__.py
index cc8ed117ba2edcc7a53e609381166f17a2fbb45e..e982030bc8959309e72d0f4e02b9755c48535a10 100755
--- a/tensorflow/contrib/image/__init__.py
+++ b/tensorflow/contrib/image/__init__.py
@@ -30,6 +30,9 @@ projective transforms (including rotation) are supported.
 @@transform
 @@translate
 @@translations_to_projective_transforms
+@@dense_image_warp
+@@interpolate_spline
+@@sparse_image_warp
 
 ## Image Segmentation `Ops`
 
@@ -47,6 +50,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.image.python.ops.dense_image_warp import dense_image_warp
+
 from tensorflow.contrib.image.python.ops.distort_image_ops import adjust_hsv_in_yiq
 from tensorflow.contrib.image.python.ops.distort_image_ops import random_hsv_in_yiq
 
@@ -57,7 +62,9 @@ from tensorflow.contrib.image.python.ops.image_ops import rotate
 from tensorflow.contrib.image.python.ops.image_ops import transform
 from tensorflow.contrib.image.python.ops.image_ops import translate
 from tensorflow.contrib.image.python.ops.image_ops import translations_to_projective_transforms
+from tensorflow.contrib.image.python.ops.interpolate_spline import interpolate_spline
 from tensorflow.contrib.image.python.ops.single_image_random_dot_stereograms import single_image_random_dot_stereograms
+from tensorflow.contrib.image.python.ops.sparse_image_warp import sparse_image_warp
 
 from tensorflow.python.util.all_util import remove_undocumented
 
diff --git a/tensorflow/contrib/image/kernels/segmentation_ops.cc b/tensorflow/contrib/image/kernels/segmentation_ops.cc
index fe8bf6e21c7b7310527668324571774e8bc50893..93722896233f0278c6cbb44af7203345e58c3172 100644
--- a/tensorflow/contrib/image/kernels/segmentation_ops.cc
+++ b/tensorflow/contrib/image/kernels/segmentation_ops.cc
@@ -101,8 +101,8 @@ struct ImageConnectedComponentsFunctor<CPUDevice, T> {
       int cost = (union_find.block_height() + union_find.block_width()) * 20;
       Shard(worker_threads->num_threads, worker_threads->workers,
             num_images * num_blocks_vertically * num_blocks_horizontally, cost,
-            [&union_find, num_images, num_blocks_vertically,
-             num_blocks_horizontally](int64 start_block, int64 limit_block) {
+            [&union_find, num_blocks_vertically, num_blocks_horizontally](
+                int64 start_block, int64 limit_block) {
               for (int64 i = start_block; i < limit_block; i++) {
                 int64 block_x = i % num_blocks_horizontally;
                 int64 block_y =
diff --git a/tensorflow/contrib/image/python/kernel_tests/dense_image_warp_test.py b/tensorflow/contrib/image/python/kernel_tests/dense_image_warp_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..a58b6a247ed6ae252db25a12f1e47c08c9a5c147
--- /dev/null
+++ b/tensorflow/contrib/image/python/kernel_tests/dense_image_warp_test.py
@@ -0,0 +1,267 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for dense_image_warp."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import numpy as np
+
+from tensorflow.contrib.image.python.ops import dense_image_warp
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import googletest
+
+from tensorflow.python.training import adam
+
+
+class DenseImageWarpTest(test_util.TensorFlowTestCase):
+
+  def setUp(self):
+    np.random.seed(0)
+
+  def test_interpolate_small_grid_ij(self):
+    grid = constant_op.constant(
+        [[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]], shape=[1, 3, 3, 1])
+    query_points = constant_op.constant(
+        [[0., 0.], [1., 0.], [2., 0.5], [1.5, 1.5]], shape=[1, 4, 2])
+    expected_results = np.reshape(np.array([0., 3., 6.5, 6.]), [1, 4, 1])
+
+    interp = dense_image_warp._interpolate_bilinear(grid, query_points)
+
+    with self.test_session() as sess:
+      predicted = sess.run(interp)
+      self.assertAllClose(expected_results, predicted)
+
+  def test_interpolate_small_grid_xy(self):
+    grid = constant_op.constant(
+        [[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]], shape=[1, 3, 3, 1])
+    query_points = constant_op.constant(
+        [[0., 0.], [0., 1.], [0.5, 2.0], [1.5, 1.5]], shape=[1, 4, 2])
+    expected_results = np.reshape(np.array([0., 3., 6.5, 6.]), [1, 4, 1])
+
+    interp = dense_image_warp._interpolate_bilinear(
+        grid, query_points, indexing='xy')
+
+    with self.test_session() as sess:
+      predicted = sess.run(interp)
+      self.assertAllClose(expected_results, predicted)
+
+  def test_interpolate_small_grid_batched(self):
+    grid = constant_op.constant(
+        [[[0., 1.], [3., 4.]], [[5., 6.], [7., 8.]]], shape=[2, 2, 2, 1])
+    query_points = constant_op.constant([[[0., 0.], [1., 0.], [0.5, 0.5]],
+                                         [[0.5, 0.], [1., 0.], [1., 1.]]])
+    expected_results = np.reshape(
+        np.array([[0., 3., 2.], [6., 7., 8.]]), [2, 3, 1])
+
+    interp = dense_image_warp._interpolate_bilinear(grid, query_points)
+
+    with self.test_session() as sess:
+      predicted = sess.run(interp)
+      self.assertAllClose(expected_results, predicted)
+
+  def get_image_and_flow_placeholders(self, shape, image_type, flow_type):
+    batch_size, height, width, numchannels = shape
+    image_shape = [batch_size, height, width, numchannels]
+    flow_shape = [batch_size, height, width, 2]
+
+    tf_type = {
+        'float16': dtypes.half,
+        'float32': dtypes.float32,
+        'float64': dtypes.float64
+    }
+
+    image = array_ops.placeholder(dtype=tf_type[image_type], shape=image_shape)
+
+    flows = array_ops.placeholder(dtype=tf_type[flow_type], shape=flow_shape)
+    return image, flows
+
+  def get_random_image_and_flows(self, shape, image_type, flow_type):
+    batch_size, height, width, numchannels = shape
+    image_shape = [batch_size, height, width, numchannels]
+    image = np.random.normal(size=image_shape)
+    flow_shape = [batch_size, height, width, 2]
+    flows = np.random.normal(size=flow_shape) * 3
+    return image.astype(image_type), flows.astype(flow_type)
+
+  def assert_correct_interpolation_value(self,
+                                         image,
+                                         flows,
+                                         pred_interpolation,
+                                         batch_index,
+                                         y_index,
+                                         x_index,
+                                         low_precision=False):
+    """Assert that the tf interpolation matches hand-computed value."""
+
+    height = image.shape[1]
+    width = image.shape[2]
+    displacement = flows[batch_index, y_index, x_index, :]
+    float_y = y_index - displacement[0]
+    float_x = x_index - displacement[1]
+    floor_y = max(min(height - 2, math.floor(float_y)), 0)
+    floor_x = max(min(width - 2, math.floor(float_x)), 0)
+    ceil_y = floor_y + 1
+    ceil_x = floor_x + 1
+
+    alpha_y = min(max(0.0, float_y - floor_y), 1.0)
+    alpha_x = min(max(0.0, float_x - floor_x), 1.0)
+
+    floor_y = int(floor_y)
+    floor_x = int(floor_x)
+    ceil_y = int(ceil_y)
+    ceil_x = int(ceil_x)
+
+    top_left = image[batch_index, floor_y, floor_x, :]
+    top_right = image[batch_index, floor_y, ceil_x, :]
+    bottom_left = image[batch_index, ceil_y, floor_x, :]
+    bottom_right = image[batch_index, ceil_y, ceil_x, :]
+
+    interp_top = alpha_x * (top_right - top_left) + top_left
+    interp_bottom = alpha_x * (bottom_right - bottom_left) + bottom_left
+    interp = alpha_y * (interp_bottom - interp_top) + interp_top
+    atol = 1e-6
+    rtol = 1e-6
+    if low_precision:
+      atol = 1e-2
+      rtol = 1e-3
+    self.assertAllClose(
+        interp,
+        pred_interpolation[batch_index, y_index, x_index, :],
+        atol=atol,
+        rtol=rtol)
+
+  def check_zero_flow_correctness(self, shape, image_type, flow_type):
+    """Assert using zero flows doesn't change the input image."""
+
+    image, flows = self.get_image_and_flow_placeholders(shape, image_type,
+                                                        flow_type)
+    interp = dense_image_warp.dense_image_warp(image, flows)
+
+    with self.test_session() as sess:
+      rand_image, rand_flows = self.get_random_image_and_flows(
+          shape, image_type, flow_type)
+      rand_flows *= 0
+
+      predicted_interpolation = sess.run(
+          interp, feed_dict={
+              image: rand_image,
+              flows: rand_flows
+          })
+      self.assertAllClose(rand_image, predicted_interpolation)
+
+  def test_zero_flows(self):
+    """Apply check_zero_flow_correctness() for a few sizes and types."""
+
+    shapes_to_try = [[3, 4, 5, 6], [1, 2, 2, 1]]
+    for shape in shapes_to_try:
+      self.check_zero_flow_correctness(
+          shape, image_type='float32', flow_type='float32')
+
+  def check_interpolation_correctness(self,
+                                      shape,
+                                      image_type,
+                                      flow_type,
+                                      num_probes=5):
+    """Interpolate, and then assert correctness for a few query locations."""
+
+    image, flows = self.get_image_and_flow_placeholders(shape, image_type,
+                                                        flow_type)
+    interp = dense_image_warp.dense_image_warp(image, flows)
+    low_precision = image_type == 'float16' or flow_type == 'float16'
+    with self.test_session() as sess:
+      rand_image, rand_flows = self.get_random_image_and_flows(
+          shape, image_type, flow_type)
+
+      pred_interpolation = sess.run(
+          interp, feed_dict={
+              image: rand_image,
+              flows: rand_flows
+          })
+
+      for _ in range(num_probes):
+        batch_index = np.random.randint(0, shape[0])
+        y_index = np.random.randint(0, shape[1])
+        x_index = np.random.randint(0, shape[2])
+
+        self.assert_correct_interpolation_value(
+            rand_image,
+            rand_flows,
+            pred_interpolation,
+            batch_index,
+            y_index,
+            x_index,
+            low_precision=low_precision)
+
+  def test_interpolation(self):
+    """Apply check_interpolation_correctness() for a few sizes and types."""
+
+    shapes_to_try = [[3, 4, 5, 6], [1, 5, 5, 3], [1, 2, 2, 1]]
+    for im_type in ['float32', 'float64', 'float16']:
+      for flow_type in ['float32', 'float64', 'float16']:
+        for shape in shapes_to_try:
+          self.check_interpolation_correctness(shape, im_type, flow_type)
+
+  def test_gradients_exist(self):
+    """Check that backprop can run.
+
+    The correctness of the gradients is assumed, since the forward propagation
+    is tested to be correct and we only use built-in tf ops.
+    However, we perform a simple test to make sure that backprop can actually
+    run. We treat the flows as a tf.Variable and optimize them to minimize
+    the difference between the interpolated image and the input image.
+    """
+
+    batch_size, height, width, numchannels = [4, 5, 6, 7]
+    image_shape = [batch_size, height, width, numchannels]
+    image = random_ops.random_normal(image_shape)
+    flow_shape = [batch_size, height, width, 2]
+    init_flows = np.float32(np.random.normal(size=flow_shape) * 0.25)
+    flows = variables.Variable(init_flows)
+
+    interp = dense_image_warp.dense_image_warp(image, flows)
+    loss = math_ops.reduce_mean(math_ops.square(interp - image))
+
+    optimizer = adam.AdamOptimizer(1.0)
+    grad = gradients.gradients(loss, [flows])
+    opt_func = optimizer.apply_gradients(zip(grad, [flows]))
+    init_op = variables.global_variables_initializer()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      for _ in range(10):
+        sess.run(opt_func)
+
+  def test_size_exception(self):
+    """Make sure it throws an exception for images that are too small."""
+
+    shape = [1, 2, 1, 1]
+    msg = 'Should have raised an exception for invalid image size'
+    with self.assertRaises(ValueError, msg=msg):
+      self.check_interpolation_correctness(shape, 'float32', 'float32')
+
+
+if __name__ == '__main__':
+  googletest.main()
diff --git a/tensorflow/contrib/image/python/kernel_tests/interpolate_spline_test.py b/tensorflow/contrib/image/python/kernel_tests/interpolate_spline_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..1939caaa2d8586413cf9ecba6ce73cf64910d6fc
--- /dev/null
+++ b/tensorflow/contrib/image/python/kernel_tests/interpolate_spline_test.py
@@ -0,0 +1,264 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for interpolate_spline."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from scipy import interpolate as sc_interpolate
+
+from tensorflow.contrib.image.python.ops import interpolate_spline
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+
+from tensorflow.python.ops import clip_ops
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import googletest
+
+from tensorflow.python.training import momentum
+
+
+class _InterpolationProblem(object):
+  """Abstract class for interpolation problem descriptions."""
+
+  def get_problem(self, optimizable=False, extrapolate=True, dtype='float32'):
+    """Make data for an interpolation problem where all x vectors are n-d.
+
+    Args:
+      optimizable: If True, then make train_points a tf.Variable.
+      extrapolate: If False, then clamp the query_points values to be within
+      the max and min of train_points.
+      dtype: The data type to use.
+
+    Returns:
+      query_points, query_values, train_points, train_values: training and
+      test tensors for interpolation problem
+    """
+
+    # The values generated here depend on a seed of 0.
+    np.random.seed(0)
+
+    batch_size = 1
+    num_training_points = 10
+    num_query_points = 4
+
+    init_points = np.random.uniform(
+        size=[batch_size, num_training_points, self.DATA_DIM])
+
+    init_points = init_points.astype(dtype)
+    train_points = (
+        variables.Variable(init_points)
+        if optimizable else constant_op.constant(init_points))
+    train_values = self.tf_function(train_points)
+
+    query_points_np = np.random.uniform(
+        size=[batch_size, num_query_points, self.DATA_DIM])
+    query_points_np = query_points_np.astype(dtype)
+    if not extrapolate:
+      query_points_np = np.clip(query_points_np, np.min(init_points),
+                                np.max(init_points))
+
+    query_points = constant_op.constant(query_points_np)
+    query_values = self.np_function(query_points_np)
+
+    return query_points, query_values, train_points, train_values
+
+
+class _QuadraticPlusSinProblem1D(_InterpolationProblem):
+  """1D interpolation problem used for regression testing."""
+  DATA_DIM = 1
+  HARDCODED_QUERY_VALUES = {
+      (1.0, 0.0): [6.2647187603, -7.84362604077, -5.63690142322, 1.42928896387],
+      (1.0,
+       0.01): [6.77688289946, -8.02163669853, -5.79491157027, 1.4063285693],
+      (2.0,
+       0.0): [8.67110264937, -8.41281390883, -5.80190044693, 1.50155606059],
+      (2.0,
+       0.01): [6.70797816797, -7.49709587663, -5.28965776238, 1.52284731741],
+      (3.0,
+       0.0): [9.37691802935, -8.50390141515, -5.80786417426, 1.63467762122],
+      (3.0,
+       0.01): [4.47106304758, -5.71266128361, -3.92529303296, 1.86755293857],
+      (4.0,
+       0.0): [9.58172461111, -8.51432104771, -5.80967675388, 1.63361164256],
+      (4.0, 0.01): [
+          -3.87902711352, -0.0253462273846, 1.79857618022, -0.769339675725
+      ]
+  }
+
+  def np_function(self, x):
+    """Takes np array, evaluates the test function, and returns np array."""
+    return np.sum(
+        np.power((x - 0.5), 3) - 0.25 * x + 10 * np.sin(x * 10),
+        axis=2,
+        keepdims=True)
+
+  def tf_function(self, x):
+    """Takes tf tensor, evaluates the test function,  and returns tf tensor."""
+    return math_ops.reduce_mean(
+        math_ops.pow((x - 0.5), 3) - 0.25 * x + 10 * math_ops.sin(x * 10),
+        2,
+        keepdims=True)
+
+
+class _QuadraticPlusSinProblemND(_InterpolationProblem):
+  """3D interpolation problem used for regression testing."""
+
+  DATA_DIM = 3
+  HARDCODED_QUERY_VALUES = {
+      (1.0, 0.0): [1.06609663962, 1.28894849357, 1.10882405595, 1.63966936885],
+      (1.0, 0.01): [1.03123780748, 1.2952930985, 1.10366822954, 1.65265118569],
+      (2.0, 0.0): [0.627787735064, 1.43802857251, 1.00194632358, 1.91667538215],
+      (2.0, 0.01): [0.730159985046, 1.41702471595, 1.0065827217, 1.85758519312],
+      (3.0, 0.0): [0.350460417862, 1.67223539464, 1.00475331246, 2.31580322491],
+      (3.0,
+       0.01): [0.624557250556, 1.63138876667, 0.976588193162, 2.12511237866],
+      (4.0,
+       0.0): [0.898129669986, 1.24434133638, -0.938056116931, 1.59910338833],
+      (4.0,
+       0.01): [0.0930360338179, -3.38791305538, -1.00969032567, 0.745535080382],
+  }
+
+  def np_function(self, x):
+    """Takes np array, evaluates the test function, and returns np array."""
+    return np.sum(
+        np.square(x - 0.5) + 0.25 * x + 1 * np.sin(x * 15),
+        axis=2,
+        keepdims=True)
+
+  def tf_function(self, x):
+    """Takes tf tensor, evaluates the test function,  and returns tf tensor."""
+    return math_ops.reduce_sum(
+        math_ops.square(x - 0.5) + 0.25 * x + 1 * math_ops.sin(x * 15),
+        2,
+        keepdims=True)
+
+
+class InterpolateSplineTest(test_util.TensorFlowTestCase):
+
+  def test_1d_linear_interpolation(self):
+    """For 1d linear interpolation, we can compare directly to scipy."""
+
+    tp = _QuadraticPlusSinProblem1D()
+    (query_points, _, train_points, train_values) = tp.get_problem(
+        extrapolate=False, dtype='float64')
+    interpolation_order = 1
+
+    with ops.name_scope('interpolator'):
+      interpolator = interpolate_spline.interpolate_spline(
+          train_points, train_values, query_points, interpolation_order)
+      with self.test_session() as sess:
+        fetches = [query_points, train_points, train_values, interpolator]
+        query_points_, train_points_, train_values_, interp_ = sess.run(fetches)
+
+        # Just look at the first element of the minibatch.
+        # Also, trim the final singleton dimension.
+        interp_ = interp_[0, :, 0]
+        query_points_ = query_points_[0, :, 0]
+        train_points_ = train_points_[0, :, 0]
+        train_values_ = train_values_[0, :, 0]
+
+        # Compute scipy interpolation.
+        scipy_interp_function = sc_interpolate.interp1d(
+            train_points_, train_values_, kind='linear')
+
+        scipy_interpolation = scipy_interp_function(query_points_)
+        scipy_interpolation_on_train = scipy_interp_function(train_points_)
+
+        # Even with float64 precision, the interpolants disagree with scipy a
+        # bit due to the fact that we add the EPSILON to prevent sqrt(0), etc.
+        tol = 1e-3
+
+        self.assertAllClose(
+            train_values_, scipy_interpolation_on_train, atol=tol, rtol=tol)
+        self.assertAllClose(interp_, scipy_interpolation, atol=tol, rtol=tol)
+
+  def test_1d_interpolation(self):
+    """Regression test for interpolation with 1-D points."""
+
+    tp = _QuadraticPlusSinProblem1D()
+    (query_points, _, train_points,
+     train_values) = tp.get_problem(dtype='float64')
+
+    for order in (1, 2, 3):
+      for reg_weight in (0, 0.01):
+        interpolator = interpolate_spline.interpolate_spline(
+            train_points, train_values, query_points, order, reg_weight)
+
+        target_interpolation = tp.HARDCODED_QUERY_VALUES[(order, reg_weight)]
+        target_interpolation = np.array(target_interpolation)
+        with self.test_session() as sess:
+          interp_val = sess.run(interpolator)
+          self.assertAllClose(interp_val[0, :, 0], target_interpolation)
+
+  def test_nd_linear_interpolation(self):
+    """Regression test for interpolation with N-D points."""
+
+    tp = _QuadraticPlusSinProblemND()
+    (query_points, _, train_points,
+     train_values) = tp.get_problem(dtype='float64')
+
+    for order in (1, 2, 3):
+      for reg_weight in (0, 0.01):
+        interpolator = interpolate_spline.interpolate_spline(
+            train_points, train_values, query_points, order, reg_weight)
+
+        target_interpolation = tp.HARDCODED_QUERY_VALUES[(order, reg_weight)]
+        target_interpolation = np.array(target_interpolation)
+        with self.test_session() as sess:
+          interp_val = sess.run(interpolator)
+          self.assertAllClose(interp_val[0, :, 0], target_interpolation)
+
+  def test_interpolation_gradient(self):
+    """Make sure that backprop can run. Correctness of gradients is assumed.
+
+    Here, we create a use a small 'training' set and a more densely-sampled
+    set of query points, for which we know the true value in advance. The goal
+    is to choose x locations for the training data such that interpolating using
+    this training data yields the best reconstruction for the function
+    values at the query points. The training data locations are optimized
+    iteratively using gradient descent.
+    """
+    tp = _QuadraticPlusSinProblemND()
+    (query_points, query_values, train_points,
+     train_values) = tp.get_problem(optimizable=True)
+
+    regularization = 0.001
+    for interpolation_order in (1, 2, 3, 4):
+      interpolator = interpolate_spline.interpolate_spline(
+          train_points, train_values, query_points, interpolation_order,
+          regularization)
+
+      loss = math_ops.reduce_mean(math_ops.square(query_values - interpolator))
+
+      optimizer = momentum.MomentumOptimizer(0.001, 0.9)
+      grad = gradients.gradients(loss, [train_points])
+      grad, _ = clip_ops.clip_by_global_norm(grad, 1.0)
+      opt_func = optimizer.apply_gradients(zip(grad, [train_points]))
+      init_op = variables.global_variables_initializer()
+
+      with self.test_session() as sess:
+        sess.run(init_op)
+        for _ in range(100):
+          sess.run([loss, opt_func])
+
+
+if __name__ == '__main__':
+  googletest.main()
diff --git a/tensorflow/contrib/image/python/kernel_tests/sparse_image_warp_test.py b/tensorflow/contrib/image/python/kernel_tests/sparse_image_warp_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..0135c66e293693345c3da7fdb21e28ca6d160154
--- /dev/null
+++ b/tensorflow/contrib/image/python/kernel_tests/sparse_image_warp_test.py
@@ -0,0 +1,254 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for sparse_image_warp."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.image.python.ops import sparse_image_warp
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import clip_ops
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import image_ops
+from tensorflow.python.ops import io_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import googletest
+from tensorflow.python.platform import test
+
+from tensorflow.python.training import momentum
+
+
+class SparseImageWarpTest(test_util.TensorFlowTestCase):
+
+  def setUp(self):
+    np.random.seed(0)
+
+  def testGetBoundaryLocations(self):
+    image_height = 11
+    image_width = 11
+    num_points_per_edge = 4
+    locs = sparse_image_warp._get_boundary_locations(image_height, image_width,
+                                                     num_points_per_edge)
+    num_points = locs.shape[0]
+    self.assertEqual(num_points, 4 + 4 * num_points_per_edge)
+    locs = [(locs[i, 0], locs[i, 1]) for i in range(num_points)]
+    for i in (0, image_height - 1):
+      for j in (0, image_width - 1):
+        self.assertIn((i, j), locs, '{},{} not in the locations'.format(i, j))
+
+      for i in (2, 4, 6, 8):
+        for j in (0, image_width - 1):
+          self.assertIn((i, j), locs, '{},{} not in the locations'.format(i, j))
+
+      for i in (0, image_height - 1):
+        for j in (2, 4, 6, 8):
+          self.assertIn((i, j), locs, '{},{} not in the locations'.format(i, j))
+
+  def testGetGridLocations(self):
+    image_height = 5
+    image_width = 3
+    grid = sparse_image_warp._get_grid_locations(image_height, image_width)
+    for i in range(image_height):
+      for j in range(image_width):
+        self.assertEqual(grid[i, j, 0], i)
+        self.assertEqual(grid[i, j, 1], j)
+
+  def testZeroShift(self):
+    """Run assertZeroShift for various hyperparameters."""
+    for order in (1, 2):
+      for regularization in (0, 0.01):
+        for num_boundary_points in (0, 1):
+          self.assertZeroShift(order, regularization, num_boundary_points)
+
+  def assertZeroShift(self, order, regularization, num_boundary_points):
+    """Check that warping with zero displacements doesn't change the image."""
+    batch_size = 1
+    image_height = 4
+    image_width = 4
+    channels = 3
+
+    image = np.random.uniform(
+        size=[batch_size, image_height, image_width, channels])
+
+    input_image_op = constant_op.constant(np.float32(image))
+
+    control_point_locations = [[1., 1.], [2., 2.], [2., 1.]]
+    control_point_locations = constant_op.constant(
+        np.float32(np.expand_dims(control_point_locations, 0)))
+
+    control_point_displacements = np.zeros(
+        control_point_locations.shape.as_list())
+    control_point_displacements = constant_op.constant(
+        np.float32(control_point_displacements))
+
+    (warped_image_op, flow_field) = sparse_image_warp.sparse_image_warp(
+        input_image_op,
+        control_point_locations,
+        control_point_locations + control_point_displacements,
+        interpolation_order=order,
+        regularization_weight=regularization,
+        num_boundary_points=num_boundary_points)
+
+    with self.test_session() as sess:
+      warped_image, input_image, _ = sess.run(
+          [warped_image_op, input_image_op, flow_field])
+
+      self.assertAllClose(warped_image, input_image)
+
+  def testMoveSinglePixel(self):
+    """Run assertMoveSinglePixel for various hyperparameters and data types."""
+    for order in (1, 2):
+      for num_boundary_points in (1, 2):
+        for type_to_use in (dtypes.float32, dtypes.float64):
+          self.assertMoveSinglePixel(order, num_boundary_points, type_to_use)
+
+  def assertMoveSinglePixel(self, order, num_boundary_points, type_to_use):
+    """Move a single block in a small grid using warping."""
+    batch_size = 1
+    image_height = 7
+    image_width = 7
+    channels = 3
+
+    image = np.zeros([batch_size, image_height, image_width, channels])
+    image[:, 3, 3, :] = 1.0
+    input_image_op = constant_op.constant(image, dtype=type_to_use)
+
+    # Place a control point at the one white pixel.
+    control_point_locations = [[3., 3.]]
+    control_point_locations = constant_op.constant(
+        np.float32(np.expand_dims(control_point_locations, 0)),
+        dtype=type_to_use)
+    # Shift it one pixel to the right.
+    control_point_displacements = [[0., 1.0]]
+    control_point_displacements = constant_op.constant(
+        np.float32(np.expand_dims(control_point_displacements, 0)),
+        dtype=type_to_use)
+
+    (warped_image_op, flow_field) = sparse_image_warp.sparse_image_warp(
+        input_image_op,
+        control_point_locations,
+        control_point_locations + control_point_displacements,
+        interpolation_order=order,
+        num_boundary_points=num_boundary_points)
+
+    with self.test_session() as sess:
+      warped_image, input_image, flow = sess.run(
+          [warped_image_op, input_image_op, flow_field])
+      # Check that it moved the pixel correctly.
+      self.assertAllClose(
+          warped_image[0, 4, 5, :],
+          input_image[0, 4, 4, :],
+          atol=1e-5,
+          rtol=1e-5)
+
+      # Test that there is no flow at the corners.
+      for i in (0, image_height - 1):
+        for j in (0, image_width - 1):
+          self.assertAllClose(
+              flow[0, i, j, :], np.zeros([2]), atol=1e-5, rtol=1e-5)
+
+  def load_image(self, image_file, sess):
+    image_op = image_ops.decode_png(
+        io_ops.read_file(image_file), dtype=dtypes.uint8, channels=4)[:, :, 0:3]
+    return sess.run(image_op)
+
+  def testSmileyFace(self):
+    """Check warping accuracy by comparing to hardcoded warped images."""
+
+    test_data_dir = test.test_src_dir_path('contrib/image/python/'
+                                           'kernel_tests/test_data/')
+    input_file = test_data_dir + 'Yellow_Smiley_Face.png'
+    with self.test_session() as sess:
+      input_image = self.load_image(input_file, sess)
+    control_points = np.asarray([[64, 59], [180 - 64, 59], [39, 111],
+                                 [180 - 39, 111], [90, 143], [58, 134],
+                                 [180 - 58, 134]])  # pyformat: disable
+    control_point_displacements = np.asarray(
+        [[-10.5, 10.5], [10.5, 10.5], [0, 0], [0, 0], [0, -10], [-20, 10.25],
+         [10, 10.75]])
+    control_points_op = constant_op.constant(
+        np.expand_dims(np.float32(control_points[:, [1, 0]]), 0))
+    control_point_displacements_op = constant_op.constant(
+        np.expand_dims(np.float32(control_point_displacements[:, [1, 0]]), 0))
+    float_image = np.expand_dims(np.float32(input_image) / 255, 0)
+    input_image_op = constant_op.constant(float_image)
+
+    for interpolation_order in (1, 2, 3):
+      for num_boundary_points in (0, 1, 4):
+        warp_op, _ = sparse_image_warp.sparse_image_warp(
+            input_image_op,
+            control_points_op,
+            control_points_op + control_point_displacements_op,
+            interpolation_order=interpolation_order,
+            num_boundary_points=num_boundary_points)
+        with self.test_session() as sess:
+          warped_image = sess.run(warp_op)
+          out_image = np.uint8(warped_image[0, :, :, :] * 255)
+          target_file = (
+              test_data_dir +
+              'Yellow_Smiley_Face_Warp-interp' + '-{}-clamp-{}.png'.format(
+                  interpolation_order, num_boundary_points))
+
+          target_image = self.load_image(target_file, sess)
+
+          # Check that the target_image and out_image difference is no
+          # bigger than 2 (on a scale of 0-255). Due to differences in
+          # floating point computation on different devices, the float
+          # output in warped_image may get rounded to a different int
+          # than that in the saved png file loaded into target_image.
+          self.assertAllClose(target_image, out_image, atol=2, rtol=1e-3)
+
+  def testThatBackpropRuns(self):
+    """Run optimization to ensure that gradients can be computed."""
+
+    batch_size = 1
+    image_height = 9
+    image_width = 12
+    image = variables.Variable(
+        np.float32(
+            np.random.uniform(size=[batch_size, image_height, image_width, 3])))
+    control_point_locations = [[3., 3.]]
+    control_point_locations = constant_op.constant(
+        np.float32(np.expand_dims(control_point_locations, 0)))
+    control_point_displacements = [[0.25, -0.5]]
+    control_point_displacements = constant_op.constant(
+        np.float32(np.expand_dims(control_point_displacements, 0)))
+    warped_image, _ = sparse_image_warp.sparse_image_warp(
+        image,
+        control_point_locations,
+        control_point_locations + control_point_displacements,
+        num_boundary_points=3)
+
+    loss = math_ops.reduce_mean(math_ops.abs(warped_image - image))
+    optimizer = momentum.MomentumOptimizer(0.001, 0.9)
+    grad = gradients.gradients(loss, [image])
+    grad, _ = clip_ops.clip_by_global_norm(grad, 1.0)
+    opt_func = optimizer.apply_gradients(zip(grad, [image]))
+    init_op = variables.global_variables_initializer()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      for _ in range(5):
+        sess.run([loss, opt_func])
+
+
+if __name__ == '__main__':
+  googletest.main()
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face.png
new file mode 100644
index 0000000000000000000000000000000000000000..7e303881e213a82e412d18de9d9d86f368726f06
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-0.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-0.png
new file mode 100644
index 0000000000000000000000000000000000000000..7fd9e4e6d69f3120428d1d778846d495cea1a989
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-0.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-1.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-1.png
new file mode 100644
index 0000000000000000000000000000000000000000..86d225e5d2158804f88dca881f69ed3ab287d866
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-1.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-4.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-4.png
new file mode 100644
index 0000000000000000000000000000000000000000..37e8ffae114625d0cc6a07ab2b8dbbb7413a3829
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-1-clamp-4.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-0.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-0.png
new file mode 100644
index 0000000000000000000000000000000000000000..e49b5816120d43a669264915f1b6747606e080e0
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-0.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-1.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-1.png
new file mode 100644
index 0000000000000000000000000000000000000000..df3cf2004312ed0ed0ebf1f0340cbfec7fd9ac46
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-1.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-4.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-4.png
new file mode 100644
index 0000000000000000000000000000000000000000..e1799a87c8542d7e515b6185d7e8f6f75fe73f3e
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-2-clamp-4.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-0.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-0.png
new file mode 100644
index 0000000000000000000000000000000000000000..2c346e0ce5487e21d41aa4e6306fd83a7b4ffdb4
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-0.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-1.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-1.png
new file mode 100644
index 0000000000000000000000000000000000000000..6f8b65451cc08a463e4305ddc4be0dbe2879fae9
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-1.png differ
diff --git a/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-4.png b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-4.png
new file mode 100644
index 0000000000000000000000000000000000000000..8e78146d955ae8f02230121e6314f3285e87611e
Binary files /dev/null and b/tensorflow/contrib/image/python/kernel_tests/test_data/Yellow_Smiley_Face_Warp-interp-3-clamp-4.png differ
diff --git a/tensorflow/contrib/image/python/ops/dense_image_warp.py b/tensorflow/contrib/image/python/ops/dense_image_warp.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9b219ada492466919c615d8978e462e6c619d33
--- /dev/null
+++ b/tensorflow/contrib/image/python/ops/dense_image_warp.py
@@ -0,0 +1,201 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Image warping using per-pixel flow vectors."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+
+
+def _interpolate_bilinear(grid,
+                          query_points,
+                          name='interpolate_bilinear',
+                          indexing='ij'):
+  """Similar to Matlab's interp2 function.
+
+  Finds values for query points on a grid using bilinear interpolation.
+
+  Args:
+    grid: a 4-D float `Tensor` of shape `[batch, height, width, channels]`.
+    query_points: a 3-D float `Tensor` of N points with shape `[batch, N, 2]`.
+    name: a name for the operation (optional).
+    indexing: whether the query points are specified as row and column (ij),
+      or Cartesian coordinates (xy).
+
+  Returns:
+    values: a 3-D `Tensor` with shape `[batch, N, channels]`
+
+  Raises:
+    ValueError: if the indexing mode is invalid, or if the shape of the inputs
+      invalid.
+  """
+  if indexing != 'ij' and indexing != 'xy':
+    raise ValueError('Indexing mode must be \'ij\' or \'xy\'')
+
+  with ops.name_scope(name):
+    grid = ops.convert_to_tensor(grid)
+    query_points = ops.convert_to_tensor(query_points)
+    shape = grid.get_shape().as_list()
+    if len(shape) != 4:
+      msg = 'Grid must be 4 dimensional. Received size: '
+      raise ValueError(msg + str(grid.get_shape()))
+
+    batch_size, height, width, channels = shape
+    query_type = query_points.dtype
+    grid_type = grid.dtype
+
+    if (len(query_points.get_shape()) != 3 or
+        query_points.get_shape()[2].value != 2):
+      msg = ('Query points must be 3 dimensional and size 2 in dim 2. Received '
+             'size: ')
+      raise ValueError(msg + str(query_points.get_shape()))
+
+    _, num_queries, _ = query_points.get_shape().as_list()
+
+    if height < 2 or width < 2:
+      msg = 'Grid must be at least batch_size x 2 x 2 in size. Received size: '
+      raise ValueError(msg + str(grid.get_shape()))
+
+    alphas = []
+    floors = []
+    ceils = []
+
+    index_order = [0, 1] if indexing == 'ij' else [1, 0]
+    unstacked_query_points = array_ops.unstack(query_points, axis=2)
+
+    for dim in index_order:
+      with ops.name_scope('dim-' + str(dim)):
+        queries = unstacked_query_points[dim]
+
+        size_in_indexing_dimension = shape[dim + 1]
+
+        # max_floor is size_in_indexing_dimension - 2 so that max_floor + 1
+        # is still a valid index into the grid.
+        max_floor = math_ops.cast(size_in_indexing_dimension - 2, query_type)
+        min_floor = constant_op.constant(0.0, dtype=query_type)
+        floor = math_ops.minimum(
+            math_ops.maximum(min_floor, math_ops.floor(queries)), max_floor)
+        int_floor = math_ops.cast(floor, dtypes.int32)
+        floors.append(int_floor)
+        ceil = int_floor + 1
+        ceils.append(ceil)
+
+        # alpha has the same type as the grid, as we will directly use alpha
+        # when taking linear combinations of pixel values from the image.
+        alpha = math_ops.cast(queries - floor, grid_type)
+        min_alpha = constant_op.constant(0.0, dtype=grid_type)
+        max_alpha = constant_op.constant(1.0, dtype=grid_type)
+        alpha = math_ops.minimum(math_ops.maximum(min_alpha, alpha), max_alpha)
+
+        # Expand alpha to [b, n, 1] so we can use broadcasting
+        # (since the alpha values don't depend on the channel).
+        alpha = array_ops.expand_dims(alpha, 2)
+        alphas.append(alpha)
+
+    if batch_size * height * width > np.iinfo(np.int32).max / 8:
+      error_msg = """The image size or batch size is sufficiently large
+                     that the linearized addresses used by array_ops.gather
+                     may exceed the int32 limit."""
+      raise ValueError(error_msg)
+
+    flattened_grid = array_ops.reshape(grid,
+                                       [batch_size * height * width, channels])
+    batch_offsets = array_ops.reshape(
+        math_ops.range(batch_size) * height * width, [batch_size, 1])
+
+    # This wraps array_ops.gather. We reshape the image data such that the
+    # batch, y, and x coordinates are pulled into the first dimension.
+    # Then we gather. Finally, we reshape the output back. It's possible this
+    # code would be made simpler by using array_ops.gather_nd.
+    def gather(y_coords, x_coords, name):
+      with ops.name_scope('gather-' + name):
+        linear_coordinates = batch_offsets + y_coords * width + x_coords
+        gathered_values = array_ops.gather(flattened_grid, linear_coordinates)
+        return array_ops.reshape(gathered_values,
+                                 [batch_size, num_queries, channels])
+
+    # grab the pixel values in the 4 corners around each query point
+    top_left = gather(floors[0], floors[1], 'top_left')
+    top_right = gather(floors[0], ceils[1], 'top_right')
+    bottom_left = gather(ceils[0], floors[1], 'bottom_left')
+    bottom_right = gather(ceils[0], ceils[1], 'bottom_right')
+
+    # now, do the actual interpolation
+    with ops.name_scope('interpolate'):
+      interp_top = alphas[1] * (top_right - top_left) + top_left
+      interp_bottom = alphas[1] * (bottom_right - bottom_left) + bottom_left
+      interp = alphas[0] * (interp_bottom - interp_top) + interp_top
+
+    return interp
+
+
+def dense_image_warp(image, flow, name='dense_image_warp'):
+  """Image warping using per-pixel flow vectors.
+
+  Apply a non-linear warp to the image, where the warp is specified by a dense
+  flow field of offset vectors that define the correspondences of pixel values
+  in the output image back to locations in the  source image. Specifically, the
+  pixel value at output[b, j, i, c] is
+  images[b, j - flow[b, j, i, 0], i - flow[b, j, i, 1], c].
+
+  The locations specified by this formula do not necessarily map to an int
+  index. Therefore, the pixel value is obtained by bilinear
+  interpolation of the 4 nearest pixels around
+  (b, j - flow[b, j, i, 0], i - flow[b, j, i, 1]). For locations outside
+  of the image, we use the nearest pixel values at the image boundary.
+
+
+  Args:
+    image: 4-D float `Tensor` with shape `[batch, height, width, channels]`.
+    flow: A 4-D float `Tensor` with shape `[batch, height, width, 2]`.
+    name: A name for the operation (optional).
+
+    Note that image and flow can be of type tf.half, tf.float32, or tf.float64,
+    and do not necessarily have to be the same type.
+
+  Returns:
+    A 4-D float `Tensor` with shape`[batch, height, width, channels]`
+      and same type as input image.
+
+  Raises:
+    ValueError: if height < 2 or width < 2 or the inputs have the wrong number
+                of dimensions.
+  """
+  with ops.name_scope(name):
+    batch_size, height, width, channels = image.get_shape().as_list()
+    # The flow is defined on the image grid. Turn the flow into a list of query
+    # points in the grid space.
+    grid_x, grid_y = array_ops.meshgrid(
+        math_ops.range(width), math_ops.range(height))
+    stacked_grid = math_ops.cast(
+        array_ops.stack([grid_y, grid_x], axis=2), flow.dtype)
+    batched_grid = array_ops.expand_dims(stacked_grid, axis=0)
+    query_points_on_grid = batched_grid - flow
+    query_points_flattened = array_ops.reshape(query_points_on_grid,
+                                               [batch_size, height * width, 2])
+    # Compute values at the query points, then reshape the result back to the
+    # image grid.
+    interpolated = _interpolate_bilinear(image, query_points_flattened)
+    interpolated = array_ops.reshape(interpolated,
+                                     [batch_size, height, width, channels])
+    return interpolated
diff --git a/tensorflow/contrib/image/python/ops/interpolate_spline.py b/tensorflow/contrib/image/python/ops/interpolate_spline.py
new file mode 100644
index 0000000000000000000000000000000000000000..daf8c56456327f102f1409296a91f9f7b68ec799
--- /dev/null
+++ b/tensorflow/contrib/image/python/ops/interpolate_spline.py
@@ -0,0 +1,291 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Polyharmonic spline interpolation."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+
+EPSILON = 0.0000000001
+
+
+def _cross_squared_distance_matrix(x, y):
+  """Pairwise squared distance between two (batch) matrices' rows (2nd dim).
+
+  Computes the pairwise distances between rows of x and rows of y
+  Args:
+    x: [batch_size, n, d] float `Tensor`
+    y: [batch_size, m, d] float `Tensor`
+
+  Returns:
+    squared_dists: [batch_size, n, m] float `Tensor`, where
+    squared_dists[b,i,j] = ||x[b,i,:] - y[b,j,:]||^2
+  """
+  x_norm_squared = math_ops.reduce_sum(math_ops.square(x), 2)
+  y_norm_squared = math_ops.reduce_sum(math_ops.square(y), 2)
+
+  # Expand so that we can broadcast.
+  x_norm_squared_tile = array_ops.expand_dims(x_norm_squared, 2)
+  y_norm_squared_tile = array_ops.expand_dims(y_norm_squared, 1)
+
+  x_y_transpose = math_ops.matmul(x, y, adjoint_b=True)
+
+  # squared_dists[b,i,j] = ||x_bi - y_bj||^2 = x_bi'x_bi- 2x_bi'x_bj + x_bj'x_bj
+  squared_dists = x_norm_squared_tile - 2 * x_y_transpose + y_norm_squared_tile
+
+  return squared_dists
+
+
+def _pairwise_squared_distance_matrix(x):
+  """Pairwise squared distance among a (batch) matrix's rows (2nd dim).
+
+  This saves a bit of computation vs. using _cross_squared_distance_matrix(x,x)
+
+  Args:
+    x: `[batch_size, n, d]` float `Tensor`
+
+  Returns:
+    squared_dists: `[batch_size, n, n]` float `Tensor`, where
+    squared_dists[b,i,j] = ||x[b,i,:] - x[b,j,:]||^2
+  """
+
+  x_x_transpose = math_ops.matmul(x, x, adjoint_b=True)
+  x_norm_squared = array_ops.matrix_diag_part(x_x_transpose)
+  x_norm_squared_tile = array_ops.expand_dims(x_norm_squared, 2)
+
+  # squared_dists[b,i,j] = ||x_bi - x_bj||^2 = x_bi'x_bi- 2x_bi'x_bj + x_bj'x_bj
+  squared_dists = x_norm_squared_tile - 2 * x_x_transpose + array_ops.transpose(
+      x_norm_squared_tile, [0, 2, 1])
+
+  return squared_dists
+
+
+def _solve_interpolation(train_points, train_values, order,
+                         regularization_weight):
+  """Solve for interpolation coefficients.
+
+  Computes the coefficients of the polyharmonic interpolant for the 'training'
+  data defined by (train_points, train_values) using the kernel phi.
+
+  Args:
+    train_points: `[b, n, d]` interpolation centers
+    train_values: `[b, n, k]` function values
+    order: order of the interpolation
+    regularization_weight: weight to place on smoothness regularization term
+
+  Returns:
+    w: `[b, n, k]` weights on each interpolation center
+    v: `[b, d, k]` weights on each input dimension
+  """
+
+  b, n, d = train_points.get_shape().as_list()
+  _, _, k = train_values.get_shape().as_list()
+
+  # First, rename variables so that the notation (c, f, w, v, A, B, etc.)
+  # follows https://en.wikipedia.org/wiki/Polyharmonic_spline.
+  # To account for python style guidelines we use
+  # matrix_a for A and matrix_b for B.
+
+  c = train_points
+  f = train_values
+
+  # Next, construct the linear system.
+  with ops.name_scope('construct_linear_system'):
+
+    matrix_a = _phi(_pairwise_squared_distance_matrix(c), order)  # [b, n, n]
+    if regularization_weight > 0:
+      batch_identity_matrix = np.expand_dims(np.eye(n), 0)
+      batch_identity_matrix = constant_op.constant(
+          batch_identity_matrix, dtype=train_points.dtype)
+
+      matrix_a += regularization_weight * batch_identity_matrix
+
+    # Append ones to the feature values for the bias term in the linear model.
+    ones = array_ops.ones([b, n, 1], train_points.dtype)
+    matrix_b = array_ops.concat([c, ones], 2)  # [b, n, d + 1]
+
+    # [b, n + d + 1, n]
+    left_block = array_ops.concat(
+        [matrix_a, array_ops.transpose(matrix_b, [0, 2, 1])], 1)
+
+    num_b_cols = matrix_b.get_shape()[2]  # d + 1
+    lhs_zeros = array_ops.zeros([b, num_b_cols, num_b_cols], train_points.dtype)
+    right_block = array_ops.concat([matrix_b, lhs_zeros],
+                                   1)  # [b, n + d + 1, d + 1]
+    lhs = array_ops.concat([left_block, right_block],
+                           2)  # [b, n + d + 1, n + d + 1]
+
+    rhs_zeros = array_ops.zeros([b, d + 1, k], train_points.dtype)
+    rhs = array_ops.concat([f, rhs_zeros], 1)  # [b, n + d + 1, k]
+
+  # Then, solve the linear system and unpack the results.
+  with ops.name_scope('solve_linear_system'):
+    w_v = linalg_ops.matrix_solve(lhs, rhs)
+    w = w_v[:, :n, :]
+    v = w_v[:, n:, :]
+
+  return w, v
+
+
+def _apply_interpolation(query_points, train_points, w, v, order):
+  """Apply polyharmonic interpolation model to data.
+
+  Given coefficients w and v for the interpolation model, we evaluate
+  interpolated function values at query_points.
+
+  Args:
+    query_points: `[b, m, d]` x values to evaluate the interpolation at
+    train_points: `[b, n, d]` x values that act as the interpolation centers
+                    ( the c variables in the wikipedia article)
+    w: `[b, n, k]` weights on each interpolation center
+    v: `[b, d, k]` weights on each input dimension
+    order: order of the interpolation
+
+  Returns:
+    Polyharmonic interpolation evaluated at points defined in query_points.
+  """
+
+  batch_size = train_points.get_shape()[0].value
+  num_query_points = query_points.get_shape()[1].value
+
+  # First, compute the contribution from the rbf term.
+  pairwise_dists = _cross_squared_distance_matrix(query_points, train_points)
+  phi_pairwise_dists = _phi(pairwise_dists, order)
+
+  rbf_term = math_ops.matmul(phi_pairwise_dists, w)
+
+  # Then, compute the contribution from the linear term.
+  # Pad query_points with ones, for the bias term in the linear model.
+  query_points_pad = array_ops.concat([
+      query_points,
+      array_ops.ones([batch_size, num_query_points, 1], train_points.dtype)
+  ], 2)
+  linear_term = math_ops.matmul(query_points_pad, v)
+
+  return rbf_term + linear_term
+
+
+def _phi(r, order):
+  """Coordinate-wise nonlinearity used to define the order of the interpolation.
+
+  See https://en.wikipedia.org/wiki/Polyharmonic_spline for the definition.
+
+  Args:
+    r: input op
+    order: interpolation order
+
+  Returns:
+    phi_k evaluated coordinate-wise on r, for k = r
+  """
+
+  # using EPSILON prevents log(0), sqrt0), etc.
+  # sqrt(0) is well-defined, but its gradient is not
+  with ops.name_scope('phi'):
+    if order == 1:
+      r = math_ops.maximum(r, EPSILON)
+      r = math_ops.sqrt(r)
+      return r
+    elif order == 2:
+      return 0.5 * r * math_ops.log(math_ops.maximum(r, EPSILON))
+    elif order == 4:
+      return 0.5 * math_ops.square(r) * math_ops.log(
+          math_ops.maximum(r, EPSILON))
+    elif order % 2 == 0:
+      r = math_ops.maximum(r, EPSILON)
+      return 0.5 * math_ops.pow(r, 0.5 * order) * math_ops.log(r)
+    else:
+      r = math_ops.maximum(r, EPSILON)
+      return math_ops.pow(r, 0.5 * order)
+
+
+def interpolate_spline(train_points,
+                       train_values,
+                       query_points,
+                       order,
+                       regularization_weight=0.0,
+                       name='interpolate_spline'):
+  r"""Interpolate signal using polyharmonic interpolation.
+
+  The interpolant has the form
+  $$f(x) = \sum_{i = 1}^n w_i \phi(||x - c_i||) + v^T x + b.$$
+
+  This is a sum of two terms: (1) a weighted sum of radial basis function (RBF)
+  terms, with the centers \\(c_1, ... c_n\\), and (2) a linear term with a bias.
+  The \\(c_i\\) vectors are 'training' points. In the code, b is absorbed into v
+  by appending 1 as a final dimension to x. The coefficients w and v are
+  estimated such that the interpolant exactly fits the value of the function at
+  the \\(c_i\\) points, the vector w is orthogonal to each \\(c_i\\), and the
+  vector w sums to 0. With these constraints, the coefficients can be obtained
+  by solving a linear system.
+
+  \\(\phi\\) is an RBF, parametrized by an interpolation
+  order. Using order=2 produces the well-known thin-plate spline.
+
+  We also provide the option to perform regularized interpolation. Here, the
+  interpolant is selected to trade off between the squared loss on the training
+  data and a certain measure of its curvature
+  ([details](https://en.wikipedia.org/wiki/Polyharmonic_spline)).
+  Using a regularization weight greater than zero has the effect that the
+  interpolant will no longer exactly fit the training data. However, it may be
+  less vulnerable to overfitting, particularly for high-order interpolation.
+
+  Note the interpolation procedure is differentiable with respect to all inputs
+  besides the order parameter.
+
+  Args:
+    train_points: `[batch_size, n, d]` float `Tensor` of n d-dimensional
+      locations. These do not need to be regularly-spaced.
+    train_values: `[batch_size, n, k]` float `Tensor` of n c-dimensional values
+      evaluated at train_points.
+    query_points: `[batch_size, m, d]` `Tensor` of m d-dimensional locations
+      where we will output the interpolant's values.
+    order: order of the interpolation. Common values are 1 for
+      \\(\phi(r) = r\\), 2 for \\(\phi(r) = r^2 * log(r)\\) (thin-plate spline),
+       or 3 for \\(\phi(r) = r^3\\).
+    regularization_weight: weight placed on the regularization term.
+      This will depend substantially on the problem, and it should always be
+      tuned. For many problems, it is reasonable to use no regularization.
+      If using a non-zero value, we recommend a small value like 0.001.
+    name: name prefix for ops created by this function
+
+  Returns:
+    `[b, m, k]` float `Tensor` of query values. We use train_points and
+    train_values to perform polyharmonic interpolation. The query values are
+    the values of the interpolant evaluated at the locations specified in
+    query_points.
+  """
+  with ops.name_scope(name):
+    train_points = ops.convert_to_tensor(train_points)
+    train_values = ops.convert_to_tensor(train_values)
+    query_points = ops.convert_to_tensor(query_points)
+
+    # First, fit the spline to the observed data.
+    with ops.name_scope('solve'):
+      w, v = _solve_interpolation(train_points, train_values, order,
+                                  regularization_weight)
+
+    # Then, evaluate the spline at the query locations.
+    with ops.name_scope('predict'):
+      query_values = _apply_interpolation(query_points, train_points, w, v,
+                                          order)
+
+  return query_values
diff --git a/tensorflow/contrib/image/python/ops/sparse_image_warp.py b/tensorflow/contrib/image/python/ops/sparse_image_warp.py
new file mode 100644
index 0000000000000000000000000000000000000000..54a215d6db6ded56a1a4a018a7e176f35fe6397e
--- /dev/null
+++ b/tensorflow/contrib/image/python/ops/sparse_image_warp.py
@@ -0,0 +1,201 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Image warping using sparse flow defined at control points."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.image.python.ops import dense_image_warp
+from tensorflow.contrib.image.python.ops import interpolate_spline
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+
+
+def _get_grid_locations(image_height, image_width):
+  """Wrapper for np.meshgrid."""
+
+  y_range = np.linspace(0, image_height - 1, image_height)
+  x_range = np.linspace(0, image_width - 1, image_width)
+  y_grid, x_grid = np.meshgrid(y_range, x_range, indexing='ij')
+  return np.stack((y_grid, x_grid), -1)
+
+
+def _expand_to_minibatch(np_array, batch_size):
+  """Tile arbitrarily-sized np_array to include new batch dimension."""
+  tiles = [batch_size] + [1] * np_array.ndim
+  return np.tile(np.expand_dims(np_array, 0), tiles)
+
+
+def _get_boundary_locations(image_height, image_width, num_points_per_edge):
+  """Compute evenly-spaced indices along edge of image."""
+  y_range = np.linspace(0, image_height - 1, num_points_per_edge + 2)
+  x_range = np.linspace(0, image_width - 1, num_points_per_edge + 2)
+  ys, xs = np.meshgrid(y_range, x_range, indexing='ij')
+  is_boundary = np.logical_or(
+      np.logical_or(xs == 0, xs == image_width - 1),
+      np.logical_or(ys == 0, ys == image_height - 1))
+  return np.stack([ys[is_boundary], xs[is_boundary]], axis=-1)
+
+
+def _add_zero_flow_controls_at_boundary(control_point_locations,
+                                        control_point_flows, image_height,
+                                        image_width, boundary_points_per_edge):
+  """Add control points for zero-flow boundary conditions.
+
+   Augment the set of control points with extra points on the
+   boundary of the image that have zero flow.
+
+  Args:
+    control_point_locations: input control points
+    control_point_flows: their flows
+    image_height: image height
+    image_width: image width
+    boundary_points_per_edge: number of points to add in the middle of each
+                           edge (not including the corners).
+                           The total number of points added is
+                           4 + 4*(boundary_points_per_edge).
+
+  Returns:
+    merged_control_point_locations: augmented set of control point locations
+    merged_control_point_flows: augmented set of control point flows
+  """
+
+  batch_size = control_point_locations.get_shape()[0].value
+
+  boundary_point_locations = _get_boundary_locations(image_height, image_width,
+                                                     boundary_points_per_edge)
+
+  boundary_point_flows = np.zeros([boundary_point_locations.shape[0], 2])
+
+  type_to_use = control_point_locations.dtype
+  boundary_point_locations = constant_op.constant(
+      _expand_to_minibatch(boundary_point_locations, batch_size),
+      dtype=type_to_use)
+
+  boundary_point_flows = constant_op.constant(
+      _expand_to_minibatch(boundary_point_flows, batch_size), dtype=type_to_use)
+
+  merged_control_point_locations = array_ops.concat(
+      [control_point_locations, boundary_point_locations], 1)
+
+  merged_control_point_flows = array_ops.concat(
+      [control_point_flows, boundary_point_flows], 1)
+
+  return merged_control_point_locations, merged_control_point_flows
+
+
+def sparse_image_warp(image,
+                      source_control_point_locations,
+                      dest_control_point_locations,
+                      interpolation_order=2,
+                      regularization_weight=0.0,
+                      num_boundary_points=0,
+                      name='sparse_image_warp'):
+  """Image warping using correspondences between sparse control points.
+
+  Apply a non-linear warp to the image, where the warp is specified by
+  the source and destination locations of a (potentially small) number of
+  control points. First, we use a polyharmonic spline
+  (@{tf.contrib.image.interpolate_spline}) to interpolate the displacements
+  between the corresponding control points to a dense flow field.
+  Then, we warp the image using this dense flow field
+  (@{tf.contrib.image.dense_image_warp}).
+
+  Let t index our control points. For regularization_weight=0, we have:
+  warped_image[b, dest_control_point_locations[b, t, 0],
+                  dest_control_point_locations[b, t, 1], :] =
+  image[b, source_control_point_locations[b, t, 0],
+           source_control_point_locations[b, t, 1], :].
+
+  For regularization_weight > 0, this condition is met approximately, since
+  regularized interpolation trades off smoothness of the interpolant vs.
+  reconstruction of the interpolant at the control points.
+  See @{tf.contrib.image.interpolate_spline} for further documentation of the
+  interpolation_order and regularization_weight arguments.
+
+
+  Args:
+    image: `[batch, height, width, channels]` float `Tensor`
+    source_control_point_locations: `[batch, num_control_points, 2]` float
+      `Tensor`
+    dest_control_point_locations: `[batch, num_control_points, 2]` float
+      `Tensor`
+    interpolation_order: polynomial order used by the spline interpolation
+    regularization_weight: weight on smoothness regularizer in interpolation
+    num_boundary_points: How many zero-flow boundary points to include at
+      each image edge.Usage:
+        num_boundary_points=0: don't add zero-flow points
+        num_boundary_points=1: 4 corners of the image
+        num_boundary_points=2: 4 corners and one in the middle of each edge
+          (8 points total)
+        num_boundary_points=n: 4 corners and n-1 along each edge
+    name: A name for the operation (optional).
+
+    Note that image and offsets can be of type tf.half, tf.float32, or
+    tf.float64, and do not necessarily have to be the same type.
+
+  Returns:
+    warped_image: `[batch, height, width, channels]` float `Tensor` with same
+      type as input image.
+    flow_field: `[batch, height, width, 2]` float `Tensor` containing the dense
+      flow field produced by the interpolation.
+  """
+
+  image = ops.convert_to_tensor(image)
+  source_control_point_locations = ops.convert_to_tensor(
+      source_control_point_locations)
+  dest_control_point_locations = ops.convert_to_tensor(
+      dest_control_point_locations)
+
+  control_point_flows = (
+      dest_control_point_locations - source_control_point_locations)
+
+  clamp_boundaries = num_boundary_points > 0
+  boundary_points_per_edge = num_boundary_points - 1
+
+  with ops.name_scope(name):
+
+    batch_size, image_height, image_width, _ = image.get_shape().as_list()
+
+    # This generates the dense locations where the interpolant
+    # will be evaluated.
+    grid_locations = _get_grid_locations(image_height, image_width)
+
+    flattened_grid_locations = np.reshape(grid_locations,
+                                          [image_height * image_width, 2])
+
+    flattened_grid_locations = constant_op.constant(
+        _expand_to_minibatch(flattened_grid_locations, batch_size), image.dtype)
+
+    if clamp_boundaries:
+      (dest_control_point_locations,
+       control_point_flows) = _add_zero_flow_controls_at_boundary(
+           dest_control_point_locations, control_point_flows, image_height,
+           image_width, boundary_points_per_edge)
+
+    flattened_flows = interpolate_spline.interpolate_spline(
+        dest_control_point_locations, control_point_flows,
+        flattened_grid_locations, interpolation_order, regularization_weight)
+
+    dense_flows = array_ops.reshape(flattened_flows,
+                                    [batch_size, image_height, image_width, 2])
+
+    warped_image = dense_image_warp.dense_image_warp(image, dense_flows)
+
+    return warped_image, dense_flows
diff --git a/tensorflow/contrib/kafka/BUILD b/tensorflow/contrib/kafka/BUILD
index efb403462a6e5df5b69ac0735ffc03f40d4a252c..1c3974871c62911c0cb47677eb92d28286837142 100644
--- a/tensorflow/contrib/kafka/BUILD
+++ b/tensorflow/contrib/kafka/BUILD
@@ -1,66 +1,93 @@
-package(
-    default_visibility = ["//visibility:private"],
-)
+package(default_visibility = ["//tensorflow:internal"])
 
 licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
-load("//tensorflow:tensorflow.bzl", "tf_gen_op_libs")
-load("//tensorflow:tensorflow.bzl", "tf_gen_op_wrapper_py")
-load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
-load("//tensorflow:tensorflow.bzl", "tf_py_test")
+load(
+    "//tensorflow:tensorflow.bzl",
+    "tf_gen_op_wrapper_py",
+    "tf_kernel_library",
+    "tf_custom_op_library",
+    "tf_custom_op_py_library",
+    "tf_gen_op_libs",
+    "tf_py_test",
+)
 
-tf_kernel_library(
-    name = "kafka_kernels",
+py_library(
+    name = "kafka",
+    srcs = ["__init__.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dataset_ops",
+    ],
+)
+
+tf_custom_op_library(
+    name = "_dataset_ops.so",
+    srcs = ["ops/dataset_ops.cc"],
+    deps = [":dataset_kernels"],
+)
+
+tf_gen_op_libs(
+    op_lib_names = ["dataset_ops"],
+)
+
+cc_library(
+    name = "dataset_kernels",
     srcs = ["kernels/kafka_dataset_ops.cc"],
-    visibility = ["//visibility:public"],
     deps = [
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-        "//tensorflow/core/kernels:bounds_check_lib",
-        "//tensorflow/core/kernels:dataset",
+        "//tensorflow/core:framework_headers_lib",
         "//third_party/eigen3",
         "@kafka",
+        "@protobuf_archive//:protobuf_headers",
     ],
+    alwayslink = 1,
 )
 
-tf_gen_op_libs(
-    op_lib_names = ["kafka_ops"],
+py_library(
+    name = "dataset_ops",
+    srcs = [
+        "python/ops/kafka_dataset_ops.py",
+    ],
+    srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/core:lib",
+        ":kafka_op_loader",
+        "//tensorflow/python:dataset_ops_gen",
+        "//tensorflow/python:util",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/util:nest",
     ],
 )
 
 tf_gen_op_wrapper_py(
-    name = "gen_kafka_ops",
-    out = "python/ops/gen_kafka_ops.py",
-    require_shape_functions = True,
-    deps = [":kafka_ops_op_lib"],
+    name = "gen_dataset_ops",
+    out = "python/ops/gen_dataset_ops.py",
+    deps = ["//tensorflow/contrib/kafka:dataset_ops_op_lib"],
 )
 
-py_library(
-    name = "kafka",
-    srcs = [
-        "__init__.py",
-        "python/ops/kafka_dataset_ops.py",
+tf_kernel_library(
+    name = "dataset_ops_kernels",
+    deps = [
+        ":dataset_kernels",
+        "//tensorflow/core:framework",
+    ],
+    alwayslink = 1,
+)
+
+tf_custom_op_py_library(
+    name = "kafka_op_loader",
+    srcs = ["python/ops/kafka_op_loader.py"],
+    dso = ["//tensorflow/contrib/kafka:_dataset_ops.so"],
+    kernels = [
+        ":dataset_ops_kernels",
+        "//tensorflow/contrib/kafka:dataset_ops_op_lib",
     ],
     srcs_version = "PY2AND3",
-    visibility = ["//visibility:public"],
     deps = [
-        ":gen_kafka_ops",
+        ":gen_dataset_ops",
         "//tensorflow/contrib/util:util_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:platform",
-        "//tensorflow/python:state_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python/data/ops:dataset_ops",
-        "//tensorflow/python/data/ops:iterator_ops",
-        "//tensorflow/python/data/ops:readers",
     ],
 )
 
@@ -88,6 +115,7 @@ tf_py_test(
     ],
     tags = [
         "manual",
+        "no_windows",
         "notap",
     ],
 )
@@ -95,7 +123,9 @@ tf_py_test(
 filegroup(
     name = "all_files",
     srcs = glob(
-        ["**/*"],
+        include = [
+            "**/*",
+        ],
         exclude = [
             "**/METADATA",
             "**/OWNERS",
diff --git a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
index 88ef5f357113372b0a2d0cb13382ac980a61252d..a4cd4a2cc4b99b5906185bd2b942ed15c1ddf5e4 100644
--- a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
+++ b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
@@ -13,9 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/kernels/dataset.h"
-
-#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/dataset.h"
 
 #include "src-cpp/rdkafkacpp.h"
 
diff --git a/tensorflow/contrib/kafka/ops/kafka_ops.cc b/tensorflow/contrib/kafka/ops/dataset_ops.cc
similarity index 100%
rename from tensorflow/contrib/kafka/ops/kafka_ops.cc
rename to tensorflow/contrib/kafka/ops/dataset_ops.cc
diff --git a/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py b/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
index 8e51d27a342359881de072c3979a2b5a7fc034ea..a1624614d1ab1be31463c5cdc0b4cfb653165a0c 100644
--- a/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
+++ b/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
@@ -17,8 +17,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.kafka.python.ops import gen_kafka_ops
-from tensorflow.python.data.ops.readers import Dataset
+from tensorflow.contrib.kafka.python.ops import kafka_op_loader  # pylint: disable=unused-import
+from tensorflow.contrib.kafka.python.ops import gen_dataset_ops
+from tensorflow.python.data.ops.dataset_ops import Dataset
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -58,8 +59,8 @@ class KafkaDataset(Dataset):
         timeout, dtype=dtypes.int64, name="timeout")
 
   def _as_variant_tensor(self):
-    return gen_kafka_ops.kafka_dataset(self._topics, self._servers, self._group,
-                                       self._eof, self._timeout)
+    return gen_dataset_ops.kafka_dataset(self._topics, self._servers,
+                                         self._group, self._eof, self._timeout)
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py b/tensorflow/contrib/kafka/python/ops/kafka_op_loader.py
similarity index 75%
rename from tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py
rename to tensorflow/contrib/kafka/python/ops/kafka_op_loader.py
index 690a44ff4368663306733300a1ea70397fb93e1e..ec2fdea962ef946d3f8f32b9e630b92649d612fe 100644
--- a/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py
+++ b/tensorflow/contrib/kafka/python/ops/kafka_op_loader.py
@@ -12,8 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Experimental methods for tf.feature_column sequential input."""
-
+"""Python helper for loading kafka ops and kernels."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+
+from tensorflow.contrib.util import loader
+from tensorflow.python.platform import resource_loader
+
+_dataset_ops = loader.load_op_library(
+    resource_loader.get_path_to_datafile("../../_dataset_ops.so"))
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/BUILD b/tensorflow/contrib/kfac/python/kernel_tests/BUILD
index f4ed978174a9ddd8b54a88e60bfb48a67a2e76d2..146ae8b7e2a3b2b479d5b8db7b8bffaca59a358f 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/kfac/python/kernel_tests/BUILD
@@ -36,6 +36,7 @@ py_test(
     srcs = ["fisher_factors_test.py"],
     srcs_version = "PY2AND3",
     deps = [
+        "//tensorflow/contrib/kfac/python/ops:fisher_blocks",
         "//tensorflow/contrib/kfac/python/ops:fisher_factors",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py b/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
index b12f7be76907dc206667eb8ee0c750f3b8db57fc..30c5404e03910eedb48132b0d69b2eabb89a9149 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
@@ -90,59 +90,83 @@ class EstimatorTest(test.TestCase):
   def testEstimatorInitManualRegistration(self):
     with self._graph.as_default():
       # We should be able to build an estimator for only the registered vars.
-      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
+      estimator.FisherEstimator([self.weights], 0.1, 0.2,
                                 self.layer_collection)
 
       # Check that we throw an error if we try to build an estimator for vars
       # that were not manually registered.
       with self.assertRaises(ValueError):
-        estimator.FisherEstimator(lambda: 0.2, [self.weights, self.bias], 0.1,
-                                  self.layer_collection)
+        est = estimator.FisherEstimator([self.weights, self.bias], 0.1, 0.2,
+                                        self.layer_collection)
+        est.make_ops_and_vars()
 
       # Check that we throw an error if we don't include registered variables,
       # i.e. self.weights
       with self.assertRaises(ValueError):
-        estimator.FisherEstimator(lambda: 0.2, [], 0.1, self.layer_collection)
+        est = estimator.FisherEstimator([], 0.1, 0.2, self.layer_collection)
+        est.make_ops_and_vars()
 
   @test.mock.patch.object(utils.SubGraph, "variable_uses", return_value=42)
   def testVariableWrongNumberOfUses(self, mock_uses):
     with self.assertRaises(ValueError):
-      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
-                                self.layer_collection)
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection)
+      est.make_ops_and_vars()
 
   def testInvalidEstimationMode(self):
     with self.assertRaises(ValueError):
-      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
-                                self.layer_collection, "not_a_real_mode")
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection,
+                                      estimation_mode="not_a_real_mode")
+      est.make_ops_and_vars()
 
-  def testModeListCorrect(self):
+  def testGradientsModeBuild(self):
     with self._graph.as_default():
-      est = estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
-                                      self.layer_collection)
-    self.assertItemsEqual(_ALL_ESTIMATION_MODES, est._gradient_fns.keys())
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection,
+                                      estimation_mode="gradients")
+      est.make_ops_and_vars()
+
+  def testEmpiricalModeBuild(self):
+    with self._graph.as_default():
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection,
+                                      estimation_mode="empirical")
+      est.make_ops_and_vars()
+
+  def testCurvaturePropModeBuild(self):
+    with self._graph.as_default():
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection,
+                                      estimation_mode="curvature_prop")
+      est.make_ops_and_vars()
 
-  def testAllModesBuild(self):
-    for mode in _ALL_ESTIMATION_MODES:
-      with self._graph.as_default():
-        estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
-                                  self.layer_collection, mode)
+  def testExactModeBuild(self):
+    with self._graph.as_default():
+      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+                                      self.layer_collection,
+                                      estimation_mode="exact")
+      est.make_ops_and_vars()
 
   def test_cov_update_thunks(self):
     """Ensures covariance update ops run once per global_step."""
     with self._graph.as_default(), self.test_session() as sess:
       fisher_estimator = estimator.FisherEstimator(
-          damping_fn=lambda: 0.2,
           variables=[self.weights],
           layer_collection=self.layer_collection,
+          damping=0.2,
           cov_ema_decay=0.0)
 
       # Construct an op that executes one covariance update per step.
       global_step = training_util.get_or_create_global_step()
+      (cov_variable_thunks, cov_update_op_thunks,
+       _, _) = fisher_estimator.create_ops_and_vars_thunks()
+      for thunk in cov_variable_thunks:
+        thunk()
       cov_matrices = [
           fisher_factor.get_cov()
           for fisher_factor in self.layer_collection.get_factors()
       ]
-      cov_update_op_thunks = fisher_estimator.cov_update_thunks
       cov_update_op = control_flow_ops.case(
           [(math_ops.equal(global_step, i), thunk)
            for i, thunk in enumerate(cov_update_op_thunks)])
@@ -178,19 +202,24 @@ class EstimatorTest(test.TestCase):
     """Ensures inverse update ops run once per global_step."""
     with self._graph.as_default(), self.test_session() as sess:
       fisher_estimator = estimator.FisherEstimator(
-          damping_fn=lambda: 0.2,
           variables=[self.weights],
           layer_collection=self.layer_collection,
+          damping=0.2,
           cov_ema_decay=0.0)
 
       # Construct op that updates one inverse per global step.
       global_step = training_util.get_or_create_global_step()
+      (cov_variable_thunks, _, inv_variable_thunks,
+       inv_update_op_thunks) = fisher_estimator.create_ops_and_vars_thunks()
+      for thunk in cov_variable_thunks:
+        thunk()
+      for thunk in inv_variable_thunks:
+        thunk()
       inv_matrices = [
           matrix
           for fisher_factor in self.layer_collection.get_factors()
-          for matrix in fisher_factor._inverses_by_damping.values()
+          for matrix in fisher_factor._matpower_by_exp_and_damping.values()
       ]
-      inv_update_op_thunks = fisher_estimator.inv_update_thunks
       inv_update_op = control_flow_ops.case(
           [(math_ops.equal(global_step, i), thunk)
            for i, thunk in enumerate(inv_update_op_thunks)])
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/fisher_blocks_test.py b/tensorflow/contrib/kfac/python/kernel_tests/fisher_blocks_test.py
index fb4b3a241c1e9fd82e7bf630fd57295917048fbd..b70c700f0936c2d8a2eca6e0836a3ee4ffe4e46d 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/fisher_blocks_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/fisher_blocks_test.py
@@ -94,6 +94,9 @@ class FullFBTest(test.TestCase):
       block.register_additional_minibatch(32)
       grads = (params[0]**2, math_ops.sqrt(params[1]))
       block.instantiate_factors((grads,), 0.5)
+      block._factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -112,6 +115,9 @@ class FullFBTest(test.TestCase):
       block.register_additional_minibatch(32)
       grads = params**2
       block.instantiate_factors((grads,), 0.5)
+      block._factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -131,6 +137,9 @@ class FullFBTest(test.TestCase):
       grads = (array_ops.constant([2., 3.]), array_ops.constant(4.))
       damping = 0.5
       block.instantiate_factors((grads,), damping)
+      block._factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(state_ops.assign(block._factor._cov, _make_psd(3)))
@@ -185,6 +194,7 @@ class NaiveDiagonalFBTest(test.TestCase):
       block.register_additional_minibatch(32)
       grads = (params[0]**2, math_ops.sqrt(params[1]))
       block.instantiate_factors((grads,), 0.5)
+      block._factor.instantiate_cov_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -203,6 +213,7 @@ class NaiveDiagonalFBTest(test.TestCase):
       block.register_additional_minibatch(32)
       grads = params**2
       block.instantiate_factors((grads,), 0.5)
+      block._factor.instantiate_cov_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -221,6 +232,7 @@ class NaiveDiagonalFBTest(test.TestCase):
       grads = (params[0]**2, math_ops.sqrt(params[1]))
       damping = 0.5
       block.instantiate_factors((grads,), damping)
+      block._factor.instantiate_cov_variables()
 
       cov = array_ops.reshape(array_ops.constant([2., 3., 4.]), [-1, 1])
       sess.run(state_ops.assign(block._factor._cov, cov))
@@ -367,6 +379,7 @@ class FullyConnectedDiagonalFBTest(test.TestCase):
         block.register_additional_minibatch(i, o)
 
       block.instantiate_factors((output_grads,), damping=0.0)
+      block._factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       sess.run(block._factor.make_covariance_update_op(0.0))
@@ -394,7 +407,7 @@ class EmbeddingKFACFBTest(test.TestCase):
       # Instantiate factor's variables. Ensure it doesn't fail.
       grads = outputs**2.
       damping = array_ops.constant(0.)
-      block.instantiate_factors(([grads],), damping)
+      block.instantiate_factors(((grads,),), damping)
 
   def testMultiplyInverse(self):
     with ops.Graph().as_default(), self.test_session() as sess:
@@ -412,7 +425,12 @@ class EmbeddingKFACFBTest(test.TestCase):
       # Instantiate factor's variables. Ensure it doesn't fail.
       grads = outputs**2.
       damping = array_ops.constant(0.)
-      block.instantiate_factors(([grads],), damping)
+      block.instantiate_factors(((grads,),), damping)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Create a sparse update.
       indices = array_ops.constant([1, 3, 4])
@@ -456,7 +474,7 @@ class FullyConnectedKFACBasicFBTest(test.TestCase):
       block.register_additional_minibatch(inputs, outputs)
 
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
 
   def testInstantiateFactorsNoBias(self):
     with ops.Graph().as_default():
@@ -467,7 +485,7 @@ class FullyConnectedKFACBasicFBTest(test.TestCase):
       block.register_additional_minibatch(inputs, outputs)
 
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
 
   def testMultiplyInverseTuple(self):
     with ops.Graph().as_default(), self.test_session() as sess:
@@ -477,7 +495,13 @@ class FullyConnectedKFACBasicFBTest(test.TestCase):
       block = fb.FullyConnectedKFACBasicFB(lc.LayerCollection(), has_bias=False)
       block.register_additional_minibatch(inputs, outputs)
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
+
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -503,7 +527,12 @@ class FullyConnectedKFACBasicFBTest(test.TestCase):
       block = fb.FullyConnectedKFACBasicFB(lc.LayerCollection(), has_bias=False)
       block.register_additional_minibatch(inputs, outputs)
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -527,10 +556,17 @@ class FullyConnectedKFACBasicFBTest(test.TestCase):
       block.register_additional_minibatch(inputs, outputs)
       grads = outputs**2
       damping = 0.  # This test is only valid without damping.
-      block.instantiate_factors(([grads],), damping)
+      block.instantiate_factors(((grads,),), damping)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
 
       sess.run(state_ops.assign(block._input_factor._cov, _make_psd(3)))
       sess.run(state_ops.assign(block._output_factor._cov, _make_psd(2)))
+
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
+
       sess.run(block._input_factor.make_inverse_update_ops())
       sess.run(block._output_factor.make_inverse_update_ops())
 
@@ -718,6 +754,7 @@ class ConvDiagonalFBTest(test.TestCase):
         block.register_additional_minibatch(i, o)
 
       block.instantiate_factors((output_grads,), damping=0.0)
+      block._factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       sess.run(block._factor.make_covariance_update_op(0.0))
@@ -727,6 +764,54 @@ class ConvDiagonalFBTest(test.TestCase):
     return multiply_result, multiply_inverse_result
 
 
+class DepthwiseConvKFCBasicFBTest(test.TestCase):
+
+  def testInstantiateFactors(self):
+    with ops.Graph().as_default():
+      random_seed.set_random_seed(200)
+      params = random_ops.random_normal((3, 3, 8, 2))
+      inputs = random_ops.random_normal((32, 5, 5, 8))
+      outputs = random_ops.random_normal((32, 5, 5, 16))
+      layer_collection = lc.LayerCollection()
+      block = fb.DepthwiseConvKFCBasicFB(
+          layer_collection, params=params, strides=[1, 1, 1, 1], padding='SAME')
+      block.register_additional_minibatch(inputs, outputs)
+      grads = outputs**2
+      block.instantiate_factors(([grads],), 0.5)
+
+  def testMultiplyInverse(self):
+    with ops.Graph().as_default(), self.test_session() as sess:
+      random_seed.set_random_seed(200)
+      params = random_ops.random_normal((3, 3, 8, 2))
+      inputs = random_ops.random_normal((32, 5, 5, 8))
+      outputs = random_ops.random_normal((32, 5, 5, 16))
+      layer_collection = lc.LayerCollection()
+      block = fb.DepthwiseConvKFCBasicFB(
+          layer_collection, params=params, strides=[1, 1, 1, 1], padding='SAME')
+      block.register_additional_minibatch(inputs, outputs)
+      grads = outputs**2
+      block.instantiate_factors(([grads],), 0.5)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
+
+      # Ensure inverse update op doesn't crash.
+      sess.run(tf_variables.global_variables_initializer())
+      sess.run([
+          factor.make_inverse_update_ops()
+          for factor in layer_collection.get_factors()
+      ])
+
+      # Ensure inverse-vector multiply doesn't crash.
+      output = block.multiply_inverse(params)
+      sess.run(output)
+
+      # Ensure same shape.
+      self.assertAllEqual(output.shape, params.shape)
+
+
 class ConvKFCBasicFBTest(test.TestCase):
 
   def _testConvKFCBasicFBInitParams(self, params):
@@ -738,16 +823,17 @@ class ConvKFCBasicFBTest(test.TestCase):
         params = array_ops.constant(params)
       inputs = random_ops.random_normal((2, 2, 2))
       outputs = random_ops.random_normal((2, 2, 2))
-      block = fb.ConvKFCBasicFB(lc.LayerCollection(), params, [1, 1, 1], 'SAME')
+      block = fb.ConvKFCBasicFB(
+          lc.LayerCollection(), params=params, padding='SAME')
       block.register_additional_minibatch(inputs, outputs)
 
       self.assertAllEqual([outputs], block.tensors_to_compute_grads())
 
   def testConvKFCBasicFBInitParamsParamsTuple(self):
-    self._testConvKFCBasicFBInitParams([np.array([1., 2.]), np.array(3.)])
+    self._testConvKFCBasicFBInitParams([np.ones([1, 2, 2]), np.ones([2])])
 
   def testConvKFCBasicFBInitParamsParamsSingle(self):
-    self._testConvKFCBasicFBInitParams([np.array([1., 2.])])
+    self._testConvKFCBasicFBInitParams([np.ones([1, 2, 2])])
 
   def testMultiplyInverseTuple(self):
     with ops.Graph().as_default(), self.test_session() as sess:
@@ -755,11 +841,16 @@ class ConvKFCBasicFBTest(test.TestCase):
       params = random_ops.random_normal((2, 2, 2, 2))
       inputs = random_ops.random_normal((2, 2, 2, 2))
       outputs = random_ops.random_normal((2, 2, 2, 2))
-      block = fb.ConvKFCBasicFB(lc.LayerCollection(), params, (1, 1, 1, 1),
-                                'SAME')
+      block = fb.ConvKFCBasicFB(
+          lc.LayerCollection(), params=params, padding='SAME')
       block.register_additional_minibatch(inputs, outputs)
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -781,12 +872,17 @@ class ConvKFCBasicFBTest(test.TestCase):
       params = random_ops.random_normal((2, 2, 2, 2))
       inputs = random_ops.random_normal((2, 2, 2, 2))
       outputs = random_ops.random_normal((2, 2, 2, 2))
-      block = fb.ConvKFCBasicFB(lc.LayerCollection(), params, (1, 1, 1, 1),
-                                'SAME')
+      block = fb.ConvKFCBasicFB(
+          lc.LayerCollection(), params=params, padding='SAME')
       block.register_additional_minibatch(inputs, outputs)
       self.assertFalse(block._has_bias)
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -804,12 +900,17 @@ class ConvKFCBasicFBTest(test.TestCase):
       params = [random_ops.random_normal((2, 2, 2, 2))]
       inputs = random_ops.random_normal((2, 2, 2, 2))
       outputs = random_ops.random_normal((2, 2, 2, 2))
-      block = fb.ConvKFCBasicFB(lc.LayerCollection(), params, (1, 1, 1, 1),
-                                'SAME')
+      block = fb.ConvKFCBasicFB(
+          lc.LayerCollection(), params=params, padding='SAME')
       block.register_additional_minibatch(inputs, outputs)
       self.assertTrue(block._has_bias)
       grads = outputs**2
-      block.instantiate_factors(([grads],), 0.5)
+      block.instantiate_factors(((grads,),), 0.5)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       # Make sure our inverse is something other than the identity.
       sess.run(tf_variables.global_variables_initializer())
@@ -827,12 +928,17 @@ class ConvKFCBasicFBTest(test.TestCase):
       params = array_ops.zeros((2, 2, 2, 2))
       inputs = array_ops.zeros((2, 2, 2, 2))
       outputs = array_ops.zeros((2, 2, 2, 2))
-      block = fb.ConvKFCBasicFB(lc.LayerCollection(), params, (1, 1, 1, 1),
-                                'SAME')
+      block = fb.ConvKFCBasicFB(
+          lc.LayerCollection(), params=params, padding='SAME')
       block.register_additional_minibatch(inputs, outputs)
       grads = outputs**2
       damping = 0.  # This test is only valid without damping.
-      block.instantiate_factors(([grads],), damping)
+      block.instantiate_factors(((grads,),), damping)
+      block._input_factor.instantiate_cov_variables()
+      block._output_factor.instantiate_cov_variables()
+      block.register_inverse()
+      block._input_factor.instantiate_inv_variables()
+      block._output_factor.instantiate_inv_variables()
 
       sess.run(state_ops.assign(block._input_factor._cov, _make_psd(8)))
       sess.run(state_ops.assign(block._output_factor._cov, _make_psd(2)))
@@ -857,9 +963,9 @@ class FullyConnectedSeriesFBTest(test.TestCase):
       random_seed.set_random_seed(200)
       inputs = array_ops.constant([1., 2.])
       outputs = array_ops.constant([3., 4.])
-      block = fb.FullyConnectedSeriesFB(
-          lc.LayerCollection(), inputs=[inputs], outputs=[outputs])
-      self.assertAllEqual([outputs], block.tensors_to_compute_grads())
+      block = fb.FullyConnectedSeriesFB(lc.LayerCollection())
+      block.register_additional_minibatch([inputs], [outputs])
+      self.assertAllEqual([[outputs]], block.tensors_to_compute_grads())
 
   def testInstantiateFactorsHasBias(self):
     with ops.Graph().as_default():
@@ -868,11 +974,10 @@ class FullyConnectedSeriesFBTest(test.TestCase):
       outputs = array_ops.constant([[3., 4.], [5., 6.]])
       block = fb.FullyConnectedSeriesFB(
           lc.LayerCollection(),
-          inputs=[inputs],
-          outputs=[outputs],
           has_bias=True)
+      block.register_additional_minibatch([inputs], [outputs])
       grads = outputs**2
-      block.instantiate_factors(((grads,),), 0.5)
+      block.instantiate_factors((((grads,),),), 0.5)
 
   def testInstantiateFactorsNoBias(self):
     with ops.Graph().as_default():
@@ -881,11 +986,10 @@ class FullyConnectedSeriesFBTest(test.TestCase):
       outputs = array_ops.constant([[3., 4.], [5., 6.]])
       block = fb.FullyConnectedSeriesFB(
           lc.LayerCollection(),
-          inputs=[inputs],
-          outputs=[outputs],
           has_bias=False)
+      block.register_additional_minibatch([inputs], [outputs])
       grads = outputs**2
-      block.instantiate_factors(((grads,),), 0.5)
+      block.instantiate_factors((((grads,),),), 0.5)
 
 
 def as_tensors(tensor_or_tuple):
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/fisher_factors_test.py b/tensorflow/contrib/kfac/python/kernel_tests/fisher_factors_test.py
index 66e18974abfadaad5d7a20b40d0b1352bfda67ee..e007f70939894725fd1b3212a3bf8e2ff877ca08 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/fisher_factors_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/fisher_factors_test.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import numpy as np
 import numpy.random as npr
 
+from tensorflow.contrib.kfac.python.ops import fisher_blocks as fb
 from tensorflow.contrib.kfac.python.ops import fisher_factors as ff
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
@@ -29,36 +30,13 @@ from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import test
 
 
-class MaybeColocateTest(test.TestCase):
-
-  def setUp(self):
-    self._colocate_cov_ops_with_inputs = ff.COLOCATE_COV_OPS_WITH_INPUTS
-
-  def tearDown(self):
-    ff.set_global_constants(
-        colocate_cov_ops_with_inputs=self._colocate_cov_ops_with_inputs)
-
-  def testFalse(self):
-    ff.set_global_constants(colocate_cov_ops_with_inputs=False)
-    with tf_ops.Graph().as_default():
-      a = constant_op.constant([2.0], name='a')
-      with ff.maybe_colocate_with(a):
-        b = constant_op.constant(3.0, name='b')
-      self.assertEqual([b'loc:@a'], a.op.colocation_groups())
-      self.assertEqual([b'loc:@b'], b.op.colocation_groups())
-
-  def testTrue(self):
-    ff.set_global_constants(colocate_cov_ops_with_inputs=True)
-    with tf_ops.Graph().as_default():
-      a = constant_op.constant([2.0], name='a')
-      with ff.maybe_colocate_with(a):
-        b = constant_op.constant(3.0, name='b')
-      self.assertEqual([b'loc:@a'], a.op.colocation_groups())
-      self.assertEqual([b'loc:@a'], b.op.colocation_groups())
+def make_damping_func(damping):
+  return fb._package_func(lambda: damping, damping)
 
 
 class FisherFactorTestingDummy(ff.FisherFactor):
@@ -98,10 +76,13 @@ class FisherFactorTestingDummy(ff.FisherFactor):
   def right_multiply(self, x, damping):
     return NotImplementedError
 
-  def left_multiply_inverse(self, x, damping):
+  def left_multiply_matpower(self, x, exp, damping):
+    return NotImplementedError
+
+  def right_multiply_matpower(self, x, exp, damping):
     return NotImplementedError
 
-  def right_multiply_inverse(self, x, damping):
+  def instantiate_inv_variables(self):
     return NotImplementedError
 
 
@@ -246,21 +227,24 @@ class InverseProvidingFactorTest(test.TestCase):
       factor = InverseProvidingFactorTestingDummy(shape)
       factor_var_scope = 'dummy/a_b_c'
 
-      dampings = 0.1, 1e-1, 0.00001, 1e-5
+      damping_funcs = [make_damping_func(0.1),
+                       make_damping_func(0.1),
+                       make_damping_func(1e-5),
+                       make_damping_func(1e-5)]
+      for damping_func in damping_funcs:
+        factor.register_inverse(damping_func)
 
-      for damping in dampings:
-        factor.register_damped_inverse(damping)
+      factor.instantiate_inv_variables()
 
-      self.assertEqual(set(dampings), set(factor._inverses_by_damping.keys()))
-      inv = factor._inverses_by_damping[dampings[0]]
-      self.assertEqual(inv, factor._inverses_by_damping[dampings[1]])
-      self.assertNotEqual(inv, factor._inverses_by_damping[dampings[2]])
-      self.assertEqual(factor._inverses_by_damping[dampings[2]],
-                       factor._inverses_by_damping[dampings[3]])
+      inv = factor.get_inverse(damping_funcs[0])
+      self.assertEqual(inv, factor.get_inverse(damping_funcs[1]))
+      self.assertNotEqual(inv, factor.get_inverse(damping_funcs[2]))
+      self.assertEqual(factor.get_inverse(damping_funcs[2]),
+                       factor.get_inverse(damping_funcs[3]))
       factor_vars = tf_ops.get_collection(tf_ops.GraphKeys.GLOBAL_VARIABLES,
                                           factor_var_scope)
-      self.assertListEqual([inv, factor._inverses_by_damping[dampings[2]]],
-                           factor_vars)
+      self.assertEqual(set([inv, factor.get_inverse(damping_funcs[2])]),
+                       set(factor_vars))
       self.assertEqual(shape, inv.get_shape())
 
   def testRegisterMatpower(self):
@@ -270,17 +254,22 @@ class InverseProvidingFactorTest(test.TestCase):
       factor = InverseProvidingFactorTestingDummy(shape)
       factor_var_scope = 'dummy/a_b_c'
 
-      factor.register_matpower(1, 0.5)
-      factor.register_matpower(2, 0.5)
+      # TODO(b/74201126): Change to using the same func for both once
+      # Topohash is in place.
+      damping_func_1 = make_damping_func(0.5)
+      damping_func_2 = make_damping_func(0.5)
+
+      factor.register_matpower(-0.5, damping_func_1)
+      factor.register_matpower(2, damping_func_2)
+
+      factor.instantiate_inv_variables()
 
-      self.assertEqual(
-          set([(1, 0.5), (2, 0.5)]),
-          set(factor._matpower_by_exp_and_damping.keys()))
       factor_vars = tf_ops.get_collection(tf_ops.GraphKeys.GLOBAL_VARIABLES,
                                           factor_var_scope)
-      matpower1 = factor.get_matpower(1, 0.5)
-      matpower2 = factor.get_matpower(2, 0.5)
-      self.assertListEqual([matpower1, matpower2], factor_vars)
+      matpower1 = factor.get_matpower(-0.5, damping_func_1)
+      matpower2 = factor.get_matpower(2, damping_func_2)
+
+      self.assertEqual(set([matpower1, matpower2]), set(factor_vars))
 
       self.assertEqual(shape, matpower1.get_shape())
       self.assertEqual(shape, matpower2.get_shape())
@@ -299,17 +288,24 @@ class InverseProvidingFactorTest(test.TestCase):
       factor = InverseProvidingFactorTestingDummy(cov.shape)
       factor._cov = array_ops.constant(cov, dtype=dtypes.float32)
 
+      damping_funcs = []
       for i in range(1, ff.EIGENVALUE_DECOMPOSITION_THRESHOLD + 1):
-        factor.register_damped_inverse(1. / i)
+        damping_funcs.append(make_damping_func(1./i))
+
+      for i in range(ff.EIGENVALUE_DECOMPOSITION_THRESHOLD):
+        factor.register_inverse(damping_funcs[i])
+
+      factor.instantiate_inv_variables()
       ops = factor.make_inverse_update_ops()
       self.assertEqual(1, len(ops))
 
       sess.run(tf_variables.global_variables_initializer())
       new_invs = []
       sess.run(ops)
-      for i in range(1, ff.EIGENVALUE_DECOMPOSITION_THRESHOLD + 1):
+      for i in range(ff.EIGENVALUE_DECOMPOSITION_THRESHOLD):
         # The inverse op will assign the damped inverse of cov to the inv var.
-        new_invs.append(sess.run(factor._inverses_by_damping[1. / i]))
+        new_invs.append(sess.run(factor.get_inverse(damping_funcs[i])))
+
       # We want to see that the new invs are all different from each other.
       for i in range(len(new_invs)):
         for j in range(i + 1, len(new_invs)):
@@ -324,14 +320,16 @@ class InverseProvidingFactorTest(test.TestCase):
       factor._cov = array_ops.constant(cov, dtype=dtypes.float32)
       exp = 2  # NOTE(mattjj): must be int to test with np.linalg.matrix_power
       damping = 0.5
+      damping_func = make_damping_func(damping)
 
-      factor.register_matpower(exp, damping)
+      factor.register_matpower(exp, damping_func)
+      factor.instantiate_inv_variables()
       ops = factor.make_inverse_update_ops()
       self.assertEqual(1, len(ops))
 
       sess.run(tf_variables.global_variables_initializer())
       sess.run(ops[0])
-      matpower = sess.run(factor._matpower_by_exp_and_damping[(exp, damping)])
+      matpower = sess.run(factor.get_matpower(exp, damping_func))
       matpower_np = np.linalg.matrix_power(cov + np.eye(2) * damping, exp)
       self.assertAllClose(matpower, matpower_np)
 
@@ -342,18 +340,21 @@ class InverseProvidingFactorTest(test.TestCase):
       factor = InverseProvidingFactorTestingDummy(cov.shape)
       factor._cov = array_ops.constant(cov, dtype=dtypes.float32)
 
-      factor.register_damped_inverse(0)
+      damping_func = make_damping_func(0)
+
+      factor.register_inverse(damping_func)
+      factor.instantiate_inv_variables()
       ops = factor.make_inverse_update_ops()
       self.assertEqual(1, len(ops))
 
       sess.run(tf_variables.global_variables_initializer())
       # The inverse op will assign the damped inverse of cov to the inv var.
-      old_inv = sess.run(factor._inverses_by_damping[0])
+      old_inv = sess.run(factor.get_inverse(damping_func))
       self.assertAllClose(
           sess.run(ff.inverse_initializer(cov.shape, dtypes.float32)), old_inv)
 
       sess.run(ops)
-      new_inv = sess.run(factor._inverses_by_damping[0])
+      new_inv = sess.run(factor.get_inverse(damping_func))
       self.assertAllClose(new_inv, np.linalg.inv(cov))
 
 
@@ -364,6 +365,7 @@ class FullFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), name='a/b/c')
       factor = ff.FullFactor((tensor,), 32)
+      factor.instantiate_cov_variables()
       self.assertEqual([6, 6], factor.get_cov().get_shape().as_list())
 
   def testFullFactorInitFloat64(self):
@@ -372,6 +374,7 @@ class FullFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), dtype=dtype, name='a/b/c')
       factor = ff.FullFactor((tensor,), 32)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual([6, 6], cov.get_shape().as_list())
@@ -381,6 +384,7 @@ class FullFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([1., 2.], name='a/b/c')
       factor = ff.FullFactor((tensor,), 2)
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
@@ -394,6 +398,7 @@ class NaiveDiagonalFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), name='a/b/c')
       factor = ff.NaiveDiagonalFactor((tensor,), 32)
+      factor.instantiate_cov_variables()
       self.assertEqual([6, 1], factor.get_cov_var().get_shape().as_list())
 
   def testNaiveDiagonalFactorInitFloat64(self):
@@ -402,6 +407,7 @@ class NaiveDiagonalFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), dtype=dtype, name='a/b/c')
       factor = ff.NaiveDiagonalFactor((tensor,), 32)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov_var()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual([6, 1], cov.get_shape().as_list())
@@ -411,6 +417,7 @@ class NaiveDiagonalFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([1., 2.], name='a/b/c')
       factor = ff.NaiveDiagonalFactor((tensor,), 2)
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
@@ -423,7 +430,8 @@ class EmbeddingInputKroneckerFactorTest(test.TestCase):
     with tf_ops.Graph().as_default():
       input_ids = array_ops.constant([[0], [1], [4]])
       vocab_size = 5
-      factor = ff.EmbeddingInputKroneckerFactor((input_ids,), vocab_size)
+      factor = ff.EmbeddingInputKroneckerFactor(input_ids, vocab_size)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov_var()
       self.assertEqual(cov.shape.as_list(), [vocab_size])
 
@@ -431,7 +439,8 @@ class EmbeddingInputKroneckerFactorTest(test.TestCase):
     with tf_ops.Graph().as_default():
       input_ids = array_ops.constant([[0], [1], [4]])
       vocab_size = 5
-      factor = ff.EmbeddingInputKroneckerFactor((input_ids,), vocab_size)
+      factor = ff.EmbeddingInputKroneckerFactor(input_ids, vocab_size)
+      factor.instantiate_cov_variables()
       cov_update_op = factor.make_covariance_update_op(0.0)
 
       with self.test_session() as sess:
@@ -440,6 +449,117 @@ class EmbeddingInputKroneckerFactorTest(test.TestCase):
         self.assertAllClose(np.array([1., 1., 0., 0., 1.]) / 3., new_cov)
 
 
+class ConvDiagonalFactorTest(test.TestCase):
+
+  def setUp(self):
+    self.batch_size = 10
+    self.height = self.width = 32
+    self.in_channels = 3
+    self.out_channels = 1
+    self.kernel_height = self.kernel_width = 3
+    self.strides = [1, 2, 2, 1]
+    self.data_format = 'NHWC'
+    self.padding = 'SAME'
+    self.kernel_shape = [
+        self.kernel_height, self.kernel_width, self.in_channels,
+        self.out_channels
+    ]
+
+  def testInit(self):
+    with tf_ops.Graph().as_default():
+      inputs = random_ops.random_uniform(
+          [self.batch_size, self.height, self.width, self.in_channels])
+      outputs_grads = [
+          random_ops.random_uniform([
+              self.batch_size, self.height // self.strides[1],
+              self.width // self.strides[2], self.out_channels
+          ]) for _ in range(3)
+      ]
+
+      factor = ff.ConvDiagonalFactor(
+          inputs,
+          outputs_grads,
+          self.kernel_shape,
+          self.strides,
+          self.padding,
+          data_format=self.data_format)
+      factor.instantiate_cov_variables()
+
+      # Ensure covariance matrix's shape makes sense.
+      self.assertEqual([
+          self.kernel_height * self.kernel_width * self.in_channels,
+          self.out_channels
+      ],
+                       factor.get_cov_var().shape.as_list())
+
+  def testMakeCovarianceUpdateOp(self):
+    with tf_ops.Graph().as_default():
+      # Construct all arguments such that convolution kernel is applied in
+      # exactly one spatial location.
+      inputs = np.random.randn(
+          1,  # batch_size
+          self.kernel_height,
+          self.kernel_width,
+          self.in_channels)  # in_channels
+      outputs_grad = np.random.randn(
+          1,  # batch_size
+          1,  # output_height
+          1,  # output_width
+          self.out_channels)
+
+      factor = ff.ConvDiagonalFactor(
+          constant_op.constant(inputs), [constant_op.constant(outputs_grad)],
+          self.kernel_shape,
+          strides=[1, 1, 1, 1],
+          padding='VALID')
+      factor.instantiate_cov_variables()
+
+      # Completely forget initial value on first update.
+      cov_update_op = factor.make_covariance_update_op(0.0)
+
+      # Ensure new covariance value is same as outer-product of inputs/outputs
+      # vectorized, squared.
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        cov = sess.run(cov_update_op)
+        expected_cov = np.outer(inputs.flatten(), outputs_grad.flatten())**2
+        self.assertAllClose(expected_cov, cov)
+
+  def testHasBias(self):
+    with tf_ops.Graph().as_default():
+      inputs = random_ops.random_uniform(
+          [self.batch_size, self.height, self.width, self.in_channels])
+      outputs_grads = [
+          random_ops.random_uniform([
+              self.batch_size, self.height // self.strides[1],
+              self.width // self.strides[2], self.out_channels
+          ]) for _ in range(3)
+      ]
+
+      factor = ff.ConvDiagonalFactor(
+          inputs,
+          outputs_grads,
+          self.kernel_shape,
+          self.strides,
+          self.padding,
+          data_format=self.data_format,
+          has_bias=True)
+      factor.instantiate_cov_variables()
+
+      # Ensure shape accounts for bias.
+      self.assertEqual([
+          self.kernel_height * self.kernel_width * self.in_channels + 1,
+          self.out_channels
+      ],
+                       factor.get_cov_var().shape.as_list())
+
+      # Ensure update op doesn't crash.
+      cov_update_op = factor.make_covariance_update_op(0.0)
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(cov_update_op)
+
+
 class FullyConnectedKroneckerFactorTest(test.TestCase):
 
   def _testFullyConnectedKroneckerFactorInit(self,
@@ -450,6 +570,7 @@ class FullyConnectedKroneckerFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), dtype=dtype, name='a/b/c')
       factor = ff.FullyConnectedKroneckerFactor((tensor,), has_bias=has_bias)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual(final_shape, cov.get_shape().as_list())
@@ -467,6 +588,7 @@ class FullyConnectedKroneckerFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([[1., 2.], [3., 4.]], name='a/b/c')
       factor = ff.FullyConnectedKroneckerFactor((tensor,), has_bias=True)
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
@@ -477,39 +599,170 @@ class FullyConnectedKroneckerFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([[1., 2.], [3., 4.]], name='a/b/c')
       factor = ff.FullyConnectedKroneckerFactor((tensor,))
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
       self.assertAllClose([[3, 3.5], [3.5, 5.5]], new_cov)
 
 
-class ConvInputKroneckerFactorTest(test.TestCase):
+class ConvFactorTestCase(test.TestCase):
+
+  def assertMatrixRank(self, rank, matrix, atol=1e-5):
+    assert rank <= matrix.shape[0], 'Rank cannot be larger than matrix size.'
+    eigvals = np.linalg.eigvals(matrix)
+    nnz_eigvals = np.sum(eigvals > atol)
+    self.assertEqual(
+        rank,
+        nnz_eigvals,
+        msg=('Found %d of %d expected non-zero eigenvalues: %s.' %
+             (nnz_eigvals, rank, eigvals)))
+
+
+class ConvInputKroneckerFactorTest(ConvFactorTestCase):
+
+  def test3DConvolution(self):
+    with tf_ops.Graph().as_default():
+      batch_size = 1
+      width = 3
+      in_channels = 3**3
+      out_channels = 4
+
+      factor = ff.ConvInputKroneckerFactor(
+          inputs=random_ops.random_uniform(
+              (batch_size, width, width, width, in_channels), seed=0),
+          filter_shape=(width, width, width, in_channels, out_channels),
+          padding='SAME',
+          strides=(2, 2, 2),
+          extract_patches_fn='extract_convolution_patches',
+          has_bias=False)
+      factor.instantiate_cov_variables()
+
+      # Ensure shape of covariance matches input size of filter.
+      input_size = in_channels * (width**3)
+      self.assertEqual([input_size, input_size],
+                       factor.get_cov_var().shape.as_list())
+
+      # Ensure cov_update_op doesn't crash.
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(factor.make_covariance_update_op(0.0))
+        cov = sess.run(factor.get_cov_var())
+
+      # Cov should be rank-8, as the filter will be applied at each corner of
+      # the 4-D cube.
+      self.assertMatrixRank(8, cov)
+
+  def testPointwiseConv2d(self):
+    with tf_ops.Graph().as_default():
+      batch_size = 1
+      width = 3
+      in_channels = 3**2
+      out_channels = 4
+
+      factor = ff.ConvInputKroneckerFactor(
+          inputs=random_ops.random_uniform(
+              (batch_size, width, width, in_channels), seed=0),
+          filter_shape=(1, 1, in_channels, out_channels),
+          padding='SAME',
+          strides=(1, 1, 1, 1),
+          extract_patches_fn='extract_pointwise_conv2d_patches',
+          has_bias=False)
+      factor.instantiate_cov_variables()
+
+      # Ensure shape of covariance matches input size of filter.
+      self.assertEqual([in_channels, in_channels],
+                       factor.get_cov_var().shape.as_list())
+
+      # Ensure cov_update_op doesn't crash.
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(factor.make_covariance_update_op(0.0))
+        cov = sess.run(factor.get_cov_var())
+
+      # Cov should be rank-9, as the filter will be applied at each location.
+      self.assertMatrixRank(9, cov)
+
+  def testStrides(self):
+    with tf_ops.Graph().as_default():
+      batch_size = 1
+      width = 3
+      in_channels = 3**2
+      out_channels = 4
+
+      factor = ff.ConvInputKroneckerFactor(
+          inputs=random_ops.random_uniform(
+              (batch_size, width, width, in_channels), seed=0),
+          filter_shape=(1, 1, in_channels, out_channels),
+          padding='SAME',
+          strides=(1, 2, 1, 1),
+          extract_patches_fn='extract_image_patches',
+          has_bias=False)
+      factor.instantiate_cov_variables()
+
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(factor.make_covariance_update_op(0.0))
+        cov = sess.run(factor.get_cov_var())
+
+      # Cov should be the sum of 3 * 2 = 6 outer products.
+      self.assertMatrixRank(6, cov)
+
+  def testDilationRate(self):
+    with tf_ops.Graph().as_default():
+      batch_size = 1
+      width = 3
+      in_channels = 2
+      out_channels = 4
+
+      factor = ff.ConvInputKroneckerFactor(
+          inputs=random_ops.random_uniform(
+              (batch_size, width, width, in_channels), seed=0),
+          filter_shape=(3, 3, in_channels, out_channels),
+          padding='SAME',
+          extract_patches_fn='extract_image_patches',
+          strides=(1, 1, 1, 1),
+          dilation_rate=(1, width, width, 1),
+          has_bias=False)
+      factor.instantiate_cov_variables()
+
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(factor.make_covariance_update_op(0.0))
+        cov = sess.run(factor.get_cov_var())
+
+      # Cov should be rank = in_channels, as only the center of the filter
+      # receives non-zero input for each input channel.
+      self.assertMatrixRank(in_channels, cov)
 
   def testConvInputKroneckerFactorInitNoBias(self):
     with tf_ops.Graph().as_default():
-      random_seed.set_random_seed(200)
-      tensor = array_ops.ones((2, 3), name='a/b/c')
+      tensor = array_ops.ones((64, 1, 2, 3), name='a/b/c')
       factor = ff.ConvInputKroneckerFactor(
-          tensor, (1, 2, 3, 4), 3, 2, has_bias=False)
+          inputs=tensor,
+          filter_shape=(1, 2, 3, 4),
+          padding='SAME',
+          has_bias=False)
+      factor.instantiate_cov_variables()
       self.assertEqual([1 * 2 * 3, 1 * 2 * 3],
                        factor.get_cov().get_shape().as_list())
 
   def testConvInputKroneckerFactorInit(self):
     with tf_ops.Graph().as_default():
-      random_seed.set_random_seed(200)
-      tensor = array_ops.ones((2, 3), name='a/b/c')
+      tensor = array_ops.ones((64, 1, 2, 3), name='a/b/c')
       factor = ff.ConvInputKroneckerFactor(
-          tensor, (1, 2, 3, 4), 3, 2, has_bias=True)
+          tensor, filter_shape=(1, 2, 3, 4), padding='SAME', has_bias=True)
+      factor.instantiate_cov_variables()
       self.assertEqual([1 * 2 * 3 + 1, 1 * 2 * 3 + 1],
                        factor.get_cov().get_shape().as_list())
 
   def testConvInputKroneckerFactorInitFloat64(self):
     with tf_ops.Graph().as_default():
       dtype = dtypes.float64_ref
-      random_seed.set_random_seed(200)
-      tensor = array_ops.ones((2, 3), dtype=dtype, name='a/b/c')
+      tensor = array_ops.ones((64, 1, 2, 3), name='a/b/c', dtype=dtypes.float64)
       factor = ff.ConvInputKroneckerFactor(
-          tensor, (1, 2, 3, 4), 3, 2, has_bias=True)
+          tensor, filter_shape=(1, 2, 3, 4), padding='SAME', has_bias=True)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual([1 * 2 * 3 + 1, 1 * 2 * 3 + 1],
@@ -517,37 +770,67 @@ class ConvInputKroneckerFactorTest(test.TestCase):
 
   def testMakeCovarianceUpdateOpWithBias(self):
     with tf_ops.Graph().as_default(), self.test_session() as sess:
-      random_seed.set_random_seed(200)
+      input_shape = (2, 1, 1, 1)
       tensor = array_ops.constant(
-          np.arange(1., 17.).reshape(2, 2, 2, 2), dtype=dtypes.float32)
+          np.arange(1, 1 + np.prod(input_shape)).reshape(input_shape).astype(
+              np.float32))
       factor = ff.ConvInputKroneckerFactor(
-          tensor, (1, 2, 1, 1), [1, 1, 1, 1], 'SAME', has_bias=True)
+          tensor, filter_shape=(1, 1, 1, 1), padding='SAME', has_bias=True)
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
-      new_cov = sess.run(factor.make_covariance_update_op(.5))
-      self.assertAllClose([[34.375, 37, 3.125], [37, 41, 3.5], [3.125, 3.5, 1]],
-                          new_cov)
+      new_cov = sess.run(factor.make_covariance_update_op(0.))
+      self.assertAllClose(
+          [
+              [(1. + 4.) / 2., (1. + 2.) / 2.],  #
+              [(1. + 2.) / 2., (1. + 1.) / 2.]
+          ],  #
+          new_cov)
 
   def testMakeCovarianceUpdateOpNoBias(self):
     with tf_ops.Graph().as_default(), self.test_session() as sess:
-      random_seed.set_random_seed(200)
+      input_shape = (2, 1, 1, 1)
       tensor = array_ops.constant(
-          np.arange(1., 17.).reshape(2, 2, 2, 2), dtype=dtypes.float32)
-      factor = ff.ConvInputKroneckerFactor(tensor, (1, 2, 1, 1), [1, 1, 1, 1],
-                                           'SAME')
+          np.arange(1, 1 + np.prod(input_shape)).reshape(input_shape).astype(
+              np.float32))
+      factor = ff.ConvInputKroneckerFactor(
+          tensor, filter_shape=(1, 1, 1, 1), padding='SAME')
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
-      new_cov = sess.run(factor.make_covariance_update_op(.5))
-      self.assertAllClose([[34.375, 37], [37, 41]], new_cov)
+      new_cov = sess.run(factor.make_covariance_update_op(0.))
+      self.assertAllClose([[(1. + 4.) / 2.]], new_cov)
 
 
-class ConvOutputKroneckerFactorTest(test.TestCase):
+class ConvOutputKroneckerFactorTest(ConvFactorTestCase):
+
+  def test3DConvolution(self):
+    with tf_ops.Graph().as_default():
+      batch_size = 1
+      width = 3
+      out_channels = width**3
+
+      factor = ff.ConvOutputKroneckerFactor(outputs_grads=[
+          random_ops.random_uniform(
+              (batch_size, width, width, width, out_channels), seed=0)
+      ])
+      factor.instantiate_cov_variables()
+
+      with self.test_session() as sess:
+        sess.run(tf_variables.global_variables_initializer())
+        sess.run(factor.make_covariance_update_op(0.0))
+        cov = sess.run(factor.get_cov())
+
+      # Cov should be rank 3^3, as each spatial position donates a rank-1
+      # update.
+      self.assertMatrixRank(width**3, cov)
 
   def testConvOutputKroneckerFactorInit(self):
     with tf_ops.Graph().as_default():
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3, 4, 5), name='a/b/c')
       factor = ff.ConvOutputKroneckerFactor((tensor,))
+      factor.instantiate_cov_variables()
       self.assertEqual([5, 5], factor.get_cov().get_shape().as_list())
 
   def testConvOutputKroneckerFactorInitFloat64(self):
@@ -556,22 +839,17 @@ class ConvOutputKroneckerFactorTest(test.TestCase):
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3, 4, 5), dtype=dtype, name='a/b/c')
       factor = ff.ConvOutputKroneckerFactor((tensor,))
+      factor.instantiate_cov_variables()
       cov = factor.get_cov()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual([5, 5], cov.get_shape().as_list())
 
-  def testConvOutputKroneckerFactorInitNotEnoughDims(self):
-    with tf_ops.Graph().as_default():
-      random_seed.set_random_seed(200)
-      tensor = array_ops.ones((2, 3), name='a/b/c')
-      with self.assertRaises(IndexError):
-        ff.ConvOutputKroneckerFactor(tensor)
-
   def testMakeCovarianceUpdateOp(self):
     with tf_ops.Graph().as_default(), self.test_session() as sess:
       random_seed.set_random_seed(200)
       tensor = np.arange(1, 17).reshape(2, 2, 2, 2).astype(np.float32)
       factor = ff.ConvOutputKroneckerFactor((array_ops.constant(tensor),))
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
@@ -584,8 +862,8 @@ class FullyConnectedMultiKFTest(test.TestCase):
     with tf_ops.Graph().as_default():
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), name='a/b/c')
-      tensor_list = [tensor]
-      factor = ff.FullyConnectedMultiKF((tensor_list,), has_bias=False)
+      factor = ff.FullyConnectedMultiKF((tensor,), has_bias=False)
+      factor.instantiate_cov_variables()
       self.assertEqual([3, 3], factor.get_cov().get_shape().as_list())
 
   def testFullyConnectedMultiKFInitFloat64(self):
@@ -593,8 +871,8 @@ class FullyConnectedMultiKFTest(test.TestCase):
       dtype = dtypes.float64_ref
       random_seed.set_random_seed(200)
       tensor = array_ops.ones((2, 3), dtype=dtype, name='a/b/c')
-      tensor_list = [tensor]
-      factor = ff.FullyConnectedMultiKF((tensor_list,), has_bias=False)
+      factor = ff.FullyConnectedMultiKF((tensor,), has_bias=False)
+      factor.instantiate_cov_variables()
       cov = factor.get_cov()
       self.assertEqual(cov.dtype, dtype)
       self.assertEqual([3, 3], cov.get_shape().as_list())
@@ -603,8 +881,8 @@ class FullyConnectedMultiKFTest(test.TestCase):
     with tf_ops.Graph().as_default(), self.test_session() as sess:
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([[1., 2.], [3., 4.]], name='a/b/c')
-      tensor_list = [tensor]
-      factor = ff.FullyConnectedMultiKF((tensor_list,), has_bias=True)
+      factor = ff.FullyConnectedMultiKF((tensor,), has_bias=True)
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
@@ -614,8 +892,8 @@ class FullyConnectedMultiKFTest(test.TestCase):
     with tf_ops.Graph().as_default(), self.test_session() as sess:
       random_seed.set_random_seed(200)
       tensor = array_ops.constant([[1., 2.], [3., 4.]], name='a/b/c')
-      tensor_list = [tensor]
-      factor = ff.FullyConnectedMultiKF((tensor_list,))
+      factor = ff.FullyConnectedMultiKF((tensor,))
+      factor.instantiate_cov_variables()
 
       sess.run(tf_variables.global_variables_initializer())
       new_cov = sess.run(factor.make_covariance_update_op(.5))
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/layer_collection_test.py b/tensorflow/contrib/kfac/python/kernel_tests/layer_collection_test.py
index b8ccbeadd0a9d69edb41fef50e3edb090457adf2..ba22099340b9535515afeee8ca22522c34a9ff1d 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/layer_collection_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/layer_collection_test.py
@@ -104,22 +104,53 @@ class LayerCollectionTest(test.TestCase):
           array_ops.constant(3),
           approx=layer_collection.APPROX_DIAGONAL_NAME)
       lc.register_conv2d(
-          array_ops.constant(4), [1, 1, 1, 1], 'SAME',
-          array_ops.ones((1, 1, 1, 1)), array_ops.constant(3))
+          params=array_ops.ones((2, 3, 4, 5)),
+          strides=[1, 1, 1, 1],
+          padding='SAME',
+          inputs=array_ops.ones((1, 2, 3, 4)),
+          outputs=array_ops.ones((1, 1, 1, 5)))
       lc.register_conv2d(
-          array_ops.constant(4), [1, 1, 1, 1],
-          'SAME',
-          array_ops.ones((1, 1, 1, 1)),
-          array_ops.constant(3),
+          params=array_ops.ones((2, 3, 4, 5)),
+          strides=[1, 1, 1, 1],
+          padding='SAME',
+          inputs=array_ops.ones((1, 2, 3, 4)),
+          outputs=array_ops.ones((1, 1, 1, 5)),
           approx=layer_collection.APPROX_DIAGONAL_NAME)
+      lc.register_separable_conv2d(
+          depthwise_params=array_ops.ones((3, 3, 1, 2)),
+          pointwise_params=array_ops.ones((1, 1, 2, 4)),
+          inputs=array_ops.ones((32, 5, 5, 1)),
+          depthwise_outputs=array_ops.ones((32, 5, 5, 2)),
+          pointwise_outputs=array_ops.ones((32, 5, 5, 4)),
+          strides=[1, 1, 1, 1],
+          padding='SAME')
+      lc.register_convolution(
+          params=array_ops.ones((3, 3, 1, 8)),
+          inputs=array_ops.ones((32, 5, 5, 1)),
+          outputs=array_ops.ones((32, 5, 5, 8)),
+          padding='SAME')
       lc.register_generic(
           array_ops.constant(5), 16, approx=layer_collection.APPROX_FULL_NAME)
       lc.register_generic(
           array_ops.constant(6),
           16,
           approx=layer_collection.APPROX_DIAGONAL_NAME)
-
-      self.assertEqual(6, len(lc.get_blocks()))
+      lc.register_fully_connected_multi(
+          array_ops.constant(1),
+          (array_ops.constant(2), array_ops.constant(3)),
+          (array_ops.constant(4), array_ops.constant(5)))
+      lc.register_conv2d_multi(
+          params=array_ops.ones((2, 3, 4, 5)),
+          strides=[1, 1, 1, 1],
+          padding='SAME',
+          inputs=(array_ops.ones((1, 2, 3, 4)), array_ops.ones((5, 6, 7, 8))),
+          outputs=(array_ops.ones((1, 1, 1, 5)), array_ops.ones((2, 2, 2, 10))))
+      lc.register_embedding_multi(
+          array_ops.constant((1,)),
+          (array_ops.constant(2), array_ops.constant(3)),
+          (array_ops.constant(4), array_ops.constant(5)))
+
+      self.assertEqual(12, len(lc.get_blocks()))
 
   def testRegisterBlocksMultipleRegistrations(self):
     with ops.Graph().as_default():
@@ -237,16 +268,16 @@ class LayerCollectionTest(test.TestCase):
 
       # Create a new loss function by name.
       lc.register_categorical_predictive_distribution(logits, name='loss1')
-      self.assertEqual(1, len(lc.losses))
+      self.assertEqual(1, len(lc.towers_by_loss))
 
       # Add logits to same loss function.
       lc.register_categorical_predictive_distribution(
           logits, name='loss1', reuse=True)
-      self.assertEqual(1, len(lc.losses))
+      self.assertEqual(1, len(lc.towers_by_loss))
 
       # Add another new loss function.
       lc.register_categorical_predictive_distribution(logits, name='loss2')
-      self.assertEqual(2, len(lc.losses))
+      self.assertEqual(2, len(lc.towers_by_loss))
 
   def testLossFunctionWithoutName(self):
     """Ensure loss functions get unique names if 'name' not specified."""
@@ -298,13 +329,9 @@ class LayerCollectionTest(test.TestCase):
             name='loss1',
             reuse=layer_collection.VARIABLE_SCOPE)
 
-      self.assertEqual(len(lc.losses), 1)
-      loss = lc.losses[0]
-
+      self.assertEqual(len(lc.towers_by_loss), 1)
       # Three successful registrations.
-      self.assertEqual(loss.params.shape.as_list(),
-                       [3 * batch_size, output_size])
-      self.assertEqual(loss.targets.shape.as_list(), [3 * batch_size])
+      self.assertEqual(len(lc.towers_by_loss[0]), 3)
 
   def testRegisterCategoricalPredictiveDistributionBatchSize1(self):
     with ops.Graph().as_default():
@@ -479,17 +506,6 @@ class LayerCollectionTest(test.TestCase):
       variables = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
       self.assertTrue(all([var.name.startswith(scope) for var in variables]))
 
-  def testGetUseCountMap(self):
-    """Ensure get_use_count_map() sums 'num_registered_minibatches'."""
-    lc = layer_collection.LayerCollection()
-    lc.fisher_blocks = {
-        'a': MockFisherBlock(),
-        ('a', 'c'): MockFisherBlock(),
-        ('b', 'c'): MockFisherBlock()
-    }
-    use_count_map = lc.get_use_count_map()
-    self.assertDictEqual({'a': 4, 'b': 2, 'c': 4}, use_count_map)
-
   def testIdentifyLinkedParametersSomeRegisteredInOtherTuples(self):
     x = variable_scope.get_variable('x', shape=())
     y = variable_scope.get_variable('y', shape=())
@@ -550,6 +566,32 @@ class LayerCollectionTest(test.TestCase):
     self.assertIsInstance(lc.fisher_blocks[b_0], fisher_blocks.FullFB)
     self.assertIsInstance(lc.fisher_blocks[b_1], fisher_blocks.NaiveDiagonalFB)
 
+  def testDefaultLayerCollection(self):
+    with ops.Graph().as_default():
+      # Can't get default if there isn't one set.
+      with self.assertRaises(ValueError):
+        layer_collection.get_default_layer_collection()
+
+      # Can't set default twice.
+      lc = layer_collection.LayerCollection()
+      layer_collection.set_default_layer_collection(lc)
+      with self.assertRaises(ValueError):
+        layer_collection.set_default_layer_collection(lc)
+
+      # Same as one set.
+      self.assertTrue(lc is layer_collection.get_default_layer_collection())
+
+      # Can set to None.
+      layer_collection.set_default_layer_collection(None)
+      with self.assertRaises(ValueError):
+        layer_collection.get_default_layer_collection()
+
+      # as_default() is the same as setting/clearing.
+      with lc.as_default():
+        self.assertTrue(lc is layer_collection.get_default_layer_collection())
+      with self.assertRaises(ValueError):
+        layer_collection.get_default_layer_collection()
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/loss_functions_test.py b/tensorflow/contrib/kfac/python/kernel_tests/loss_functions_test.py
index ae787b6f1ac90218f2ac73d37fb270df0b822de2..c00af5593f085e3b1f3e030a24f4b821115cc869 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/loss_functions_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/loss_functions_test.py
@@ -24,7 +24,6 @@ from tensorflow.contrib.kfac.python.ops import loss_functions
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import random_ops
 from tensorflow.python.platform import test
 
 
@@ -97,22 +96,6 @@ class CategoricalLogitsNegativeLogProbLossTest(test.TestCase):
       # difficult to say if the output is correct or not...
       neg_log_prob = sess.run(neg_log_prob)
 
-  def testMultiMinibatchRegistration(self):
-    """Ensure this loss function supports registering multiple minibatches."""
-    with ops.Graph().as_default():
-      tower_logits = []
-      loss = None
-      num_towers = 5
-      for _ in range(num_towers):
-        logits = random_ops.random_uniform(shape=[2, 3])
-        tower_logits.append(logits)
-        if loss is None:
-          loss = loss_functions.CategoricalLogitsNegativeLogProbLoss(logits)
-        else:
-          loss.register_additional_minibatch(logits)
-      self.assertListEqual(loss.input_minibatches, tower_logits)
-      self.assertEqual(loss.num_registered_minibatches, num_towers)
-
   def testMultiplyFisherSingleVector(self):
     with ops.Graph().as_default(), self.test_session() as sess:
       logits = np.array([1., 2., 3.])
@@ -203,23 +186,5 @@ class OnehotCategoricalLogitsNegativeLogProbLossTest(test.TestCase):
       # difficult to say if the output is correct or not...
       neg_log_prob = sess.run(neg_log_prob)
 
-  def testMultiMinibatchRegistration(self):
-    """Ensure this loss function supports registering multiple minibatches."""
-    with ops.Graph().as_default():
-      tower_logits = []
-      loss = None
-      num_towers = 5
-      for _ in range(num_towers):
-        logits = random_ops.random_uniform(shape=[2, 3])
-        tower_logits.append(logits)
-        if loss is None:
-          loss = loss_functions.OnehotCategoricalLogitsNegativeLogProbLoss(
-              logits)
-        else:
-          loss.register_additional_minibatch(logits)
-      self.assertListEqual(loss.input_minibatches, tower_logits)
-      self.assertEqual(loss.num_registered_minibatches, num_towers)
-
-
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/utils_test.py b/tensorflow/contrib/kfac/python/kernel_tests/utils_test.py
index 97a97adbf5577cd2694d3055acaa59258ad27964..2cee01212a11595669e9df0fc95a5657926c1038 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/utils_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/utils_test.py
@@ -29,6 +29,8 @@ from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -325,6 +327,84 @@ class UtilsTest(test.TestCase):
           ],
           values)
 
+  def testExtractConvolutionPatches(self):
+    with ops.Graph().as_default(), self.test_session() as sess:
+      batch_size = 10
+      image_spatial_shape = [9, 10, 11]
+      in_channels = out_channels = 32
+      kernel_spatial_shape = [5, 3, 3]
+      spatial_strides = [1, 2, 1]
+      spatial_dilation = [1, 1, 1]
+      padding = 'SAME'
+
+      images = random_ops.random_uniform(
+          [batch_size] + image_spatial_shape + [in_channels], seed=0)
+      kernel_shape = kernel_spatial_shape + [in_channels, out_channels]
+      kernel = random_ops.random_uniform(kernel_shape, seed=1)
+
+      # Ensure shape matches expectation.
+      patches = utils.extract_convolution_patches(
+          images,
+          kernel_shape,
+          padding,
+          strides=spatial_strides,
+          dilation_rate=spatial_dilation)
+      result_spatial_shape = (
+          patches.shape.as_list()[1:1 + len(image_spatial_shape)])
+      self.assertEqual(patches.shape.as_list(),
+                       [batch_size] + result_spatial_shape +
+                       kernel_spatial_shape + [in_channels])
+
+      # Ensure extract...patches() + matmul() and convolution() implementation
+      # give the same answer.
+      outputs = nn_ops.convolution(
+          images,
+          kernel,
+          padding,
+          strides=spatial_strides,
+          dilation_rate=spatial_dilation)
+
+      patches_flat = array_ops.reshape(
+          patches, [-1, np.prod(kernel_spatial_shape) * in_channels])
+      kernel_flat = array_ops.reshape(kernel, [-1, out_channels])
+      outputs_flat = math_ops.matmul(patches_flat, kernel_flat)
+
+      outputs_, outputs_flat_ = sess.run([outputs, outputs_flat])
+      self.assertAllClose(outputs_.flatten(), outputs_flat_.flatten())
+
+  def testExtractPointwiseConv2dPatches(self):
+    with ops.Graph().as_default(), self.test_session() as sess:
+      batch_size = 10
+      image_height = image_width = 8
+      in_channels = out_channels = 3
+      kernel_height = kernel_width = 1
+      strides = [1, 1, 1, 1]
+      padding = 'VALID'
+
+      images = random_ops.random_uniform(
+          [batch_size, image_height, image_width, in_channels], seed=0)
+      kernel_shape = [kernel_height, kernel_width, in_channels, out_channels]
+      kernel = random_ops.random_uniform(kernel_shape, seed=1)
+
+      # Ensure shape matches expectation.
+      patches = utils.extract_pointwise_conv2d_patches(images, kernel_shape)
+      self.assertEqual(patches.shape.as_list(), [
+          batch_size, image_height, image_width, kernel_height, kernel_width,
+          in_channels
+      ])
+
+      # Ensure extract...patches() + matmul() and conv2d() implementation
+      # give the same answer.
+      outputs = nn_ops.conv2d(images, kernel, strides, padding)
+
+      patches_flat = array_ops.reshape(
+          patches, [-1, kernel_height * kernel_width * in_channels])
+      kernel_flat = array_ops.reshape(kernel, [-1, out_channels])
+      outputs_flat = math_ops.matmul(patches_flat, kernel_flat)
+
+      outputs_, outputs_flat_ = sess.run([outputs, outputs_flat])
+      self.assertAllClose(outputs_.flatten(), outputs_flat_.flatten())
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/kfac/python/ops/estimator.py b/tensorflow/contrib/kfac/python/ops/estimator.py
index a7e268c48ae326a4d8fa5fe4a4ed15b8b83a0ed9..64755be65c4b5686397dbfd798fec1ed70ae61dc 100644
--- a/tensorflow/contrib/kfac/python/ops/estimator.py
+++ b/tensorflow/contrib/kfac/python/ops/estimator.py
@@ -27,6 +27,7 @@ from tensorflow.contrib.kfac.python.ops import utils
 from tensorflow.python.framework import ops as tf_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gradients_impl
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.util import nest
 
 
@@ -65,6 +66,13 @@ class _DeviceContextGenerator(object):
         yield
 
 
+def _make_thunk_on_device(func, device):
+  def thunk():
+    with tf_ops.device(device):
+      return func()
+  return thunk
+
+
 class FisherEstimator(object):
   """Fisher estimator class supporting various approximations of the Fisher.
 
@@ -83,26 +91,35 @@ class FisherEstimator(object):
   """
 
   def __init__(self,
-               damping_fn,
                variables,
                cov_ema_decay,
+               damping,
                layer_collection,
+               exps=(-1,),
                estimation_mode="gradients",
                colocate_gradients_with_ops=True,
-               cov_devices=None,
-               inv_devices=None):
+               name="FisherEstimator"):
     """Create a FisherEstimator object.
 
     Args:
-      damping_fn: Function, accepts no arguments and returns damping value.
       variables: A list of the variables for which to estimate the Fisher. This
           must match the variables registered in layer_collection (if it is not
           None).
       cov_ema_decay: The decay factor used when calculating the covariance
           estimate moving averages.
+      damping: float. The damping factor used to stabilize training due to
+          errors in the local approximation with the Fisher information matrix,
+          and to regularize the update direction by making it closer to the
+          gradient. (Higher damping means the update looks more like a standard
+          gradient update - see Tikhonov regularization.)
       layer_collection: The layer collection object, which holds the fisher
           blocks, kronecker factors, and losses associated with the
           graph.
+      exps: List of floats or ints. These represent the different matrix
+          powers of the approximate Fisher that the FisherEstimator will be able
+          to multiply vectors by. If the user asks for a matrix power other
+          one of these (or 1, which is always supported), there will be a
+          failure. (Default: (-1,))
       estimation_mode: The type of estimator to use for the Fishers.  Can be
           'gradients', 'empirical', 'curvature_prop', or 'exact'.
           (Default: 'gradients').  'gradients' is the basic estimation approach
@@ -121,23 +138,17 @@ class FisherEstimator(object):
           equal to the output dimension, roughly speaking.
       colocate_gradients_with_ops: Whether we should request gradients be
           colocated with their respective ops. (Default: True)
-      cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
-          computations will be placed on these devices in a round-robin fashion.
-          Can be None, which means that no devices are specified.
-      inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
-          computations will be placed on these devices in a round-robin fashion.
-          Can be None, which means that no devices are specified.
-
+      name: A string. A name given to this estimator, which is added to the
+          variable scope when constructing variables and ops.
+          (Default: "FisherEstimator")
     Raises:
       ValueError: If no losses have been registered with layer_collection.
     """
-    self._damping_fn = damping_fn
-    self._cov_ema_decay = cov_ema_decay
     self._variables = variables
+    self._cov_ema_decay = cov_ema_decay
+    self._damping = damping
     self._estimation_mode = estimation_mode
     self._layers = layer_collection
-    self._layers.create_subgraph()
-    self._layers.check_registration(variables)
     self._gradient_fns = {
         "gradients": self._get_grads_lists_gradients,
         "empirical": self._get_grads_lists_empirical,
@@ -146,30 +157,10 @@ class FisherEstimator(object):
     }
     self._colocate_gradients_with_ops = colocate_gradients_with_ops
 
-    # TODO(b/70674513): Factor device placement outside of this class.
-    self._cov_device_context_generator = _DeviceContextGenerator(cov_devices)
-    if inv_devices == cov_devices:
-      self._inv_device_context_generator = self._cov_device_context_generator
-    else:
-      self._inv_device_context_generator = _DeviceContextGenerator(inv_devices)
-
-    self._instantiate_factors()
+    self._made_vars = False
+    self._exps = exps
 
-    self.cov_update_thunks = [
-        self._create_cov_update_thunk(factor)
-        for factor in self._layers.get_factors()
-    ]
-    self.cov_update_ops = [thunk() for thunk in self.cov_update_thunks]
-    self.cov_update_op = control_flow_ops.group(
-        self.cov_update_ops, name="cov_update_op")
-
-    self.inv_update_thunks = [
-        self._create_inv_update_thunk(factor)
-        for factor in self._layers.get_factors()
-    ]
-    self.inv_update_ops = [thunk() for thunk in self.inv_update_thunks]
-    self.inv_update_op = control_flow_ops.group(
-        self.inv_update_ops, name="inv_update_op")
+    self._name = name
 
   @property
   def variables(self):
@@ -177,7 +168,21 @@ class FisherEstimator(object):
 
   @property
   def damping(self):
-    return self._damping_fn()
+    return self._damping
+
+  @property
+  def blocks(self):
+    """All registered FisherBlocks."""
+    return self._layers.get_blocks()
+
+  @property
+  def factors(self):
+    """All registered FisherFactors."""
+    return self._layers.get_factors()
+
+  @property
+  def name(self):
+    return self._name
 
   def _apply_transformation(self, vecs_and_vars, transform):
     """Applies an block-wise transformation to the corresponding vectors.
@@ -212,9 +217,7 @@ class FisherEstimator(object):
       A list of (transformed vector, var) pairs in the same order as
       vecs_and_vars.
     """
-
-    return self._apply_transformation(vecs_and_vars,
-                                      lambda fb, vec: fb.multiply_inverse(vec))
+    return self.multiply_matpower(-1, vecs_and_vars)
 
   def multiply(self, vecs_and_vars):
     """Multiplies the vectors by the corresponding (damped) blocks.
@@ -226,9 +229,22 @@ class FisherEstimator(object):
       A list of (transformed vector, var) pairs in the same order as
       vecs_and_vars.
     """
+    return self.multiply_matpower(1, vecs_and_vars)
 
-    return self._apply_transformation(vecs_and_vars,
-                                      lambda fb, vec: fb.multiply(vec))
+  def multiply_matpower(self, exp, vecs_and_vars):
+    """Multiplies the vecs by the corresponding matrix powers of the blocks.
+
+    Args:
+      exp: A float representing the power to raise the blocks by before
+        multiplying it by the vector.
+      vecs_and_vars: List of (vector, variable) pairs.
+
+    Returns:
+      A list of (transformed vector, var) pairs in the same order as
+      vecs_and_vars.
+    """
+    fcn = lambda fb, vec: fb.multiply_matpower(vec, exp)
+    return self._apply_transformation(vecs_and_vars, fcn)
 
   def _instantiate_factors(self):
     """Instantiates FisherFactors' variables.
@@ -236,9 +252,9 @@ class FisherEstimator(object):
     Raises:
       ValueError: If estimation_mode was improperly specified at construction.
     """
-    fisher_blocks_list = self._layers.get_blocks()
+    blocks = self.blocks
     tensors_to_compute_grads = [
-        fb.tensors_to_compute_grads() for fb in fisher_blocks_list
+        block.tensors_to_compute_grads() for block in blocks
     ]
 
     try:
@@ -248,45 +264,283 @@ class FisherEstimator(object):
       raise ValueError("Unrecognized value {} for estimation_mode.".format(
           self._estimation_mode))
 
-    # TODO(b/68033310): This loop round-robins the "concat" operations which
-    # gather the inputs for the cov_updates. In future, we might do these
-    # computations locally then communicate the results, which would require a
-    # modification to this code.
-    for grads_list, fb in zip(grads_lists, fisher_blocks_list):
-      with self._cov_device_context_generator():
-        fb.instantiate_factors(grads_list, self.damping)
+    for grads_list, block in zip(grads_lists, blocks):
+      block.instantiate_factors(grads_list, self.damping)
+
+  def _check_vars_unmade_and_set_made_flag(self):
+    if self._made_vars:
+      raise Exception("Already made variables.")
+    self._made_vars = True
+
+  def made_vars(self):
+    return self._made_vars
+
+  def _register_matrix_functions(self):
+    for exp in self._exps:
+      for block in self.blocks:
+        block.register_matpower(exp)
+
+  def _finalize_layer_collection(self):
+    self._layers.create_subgraph()
+    self._layers.check_registration(self.variables)
+    self._instantiate_factors()
+    self._register_matrix_functions()
+
+  def make_ops_and_vars(self, scope=None):
+    """Make ops and vars with no specific device placement.
+
+    See make_ops_and_vars_round_robin for further details.
+
+    Args:
+      scope: A string or None.  If None it will be set to the name of this
+        estimator (given by the name property). All variables will be created,
+        and all ops will execute, inside of a variable scope of the given
+        name. (Default: None)
+    Returns:
+      cov_update_ops: List of ops that compute the cov updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_ops: List of ops that compute the inv updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      inv_update_op: inv_update_ops grouped into a single op.
+      cov_update_thunks: Thunks that make the ops in cov_update_ops.
+      inv_update_thunks: Thunks that make the ops in inv_update_ops.
+    """
+    return self.make_ops_and_vars_round_robin(scope=scope)
+
+  # TODO(b/70674513): Factor device placement outside of this class.
+  def make_ops_and_vars_round_robin(self, scope=None, cov_devices=None,
+                                    inv_devices=None):
+    """Make ops and vars with a round-robin device placement strategy.
+
+    For each factor, all of that factor's cov variables and their associated
+    update ops will be placed on a particular device.  A new device is chosen
+    for each factor by cycling through list of devices in the cov_devices
+    argument. If cov_devices is None then no explicit device placement occurs.
+
+    An analogous strategy is followed for inverse update ops, with the list of
+    devices being given by the inv_devices argument.
+
+    Inverse variables on the other hand are not placed on any specific device
+    (they will just use the current the device placement context, whatever
+    that happens to be).  The idea is that the inverse variable belong where
+    they will be accessed most often, which is the device that actually applies
+    the preconditioner to the gradient. The user will be responsible for setting
+    the device context for this.
 
-  def _create_cov_update_thunk(self, factor):
+    Args:
+      scope: A string or None.  If None it will be set to the name of this
+        estimator (given by the name property). All variables will be created,
+        and all ops will execute, inside of a variable scope of the given
+        name. (Default: None)
+      cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+      inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+
+    Returns:
+      cov_update_ops: List of ops that compute the cov updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_ops: List of ops that compute the inv updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      inv_update_op: inv_update_ops grouped into a single op.
+      cov_update_thunks: Thunks that make the ops in cov_update_ops.
+      inv_update_thunks: Thunks that make the ops in inv_update_ops.
+    """
+    (cov_update_thunks,
+     inv_update_thunks) = self.make_vars_and_create_op_thunks_round_robin(
+         scope=scope,
+         cov_devices=cov_devices,
+         inv_devices=inv_devices)
+    cov_update_ops = [thunk() for thunk in cov_update_thunks]
+    inv_update_ops = [thunk() for thunk in inv_update_thunks]
+
+    scope = self.name if scope is None else scope
+    with variable_scope.variable_scope(scope):
+      cov_update_op = control_flow_ops.group(cov_update_ops,
+                                             name="cov_update_op")
+      inv_update_op = control_flow_ops.group(inv_update_ops,
+                                             name="inv_update_op")
+
+    return (cov_update_ops, cov_update_op, inv_update_ops, inv_update_op,
+            cov_update_thunks, inv_update_thunks)
+
+  def make_vars_and_create_op_thunks_round_robin(self,
+                                                 scope=None,
+                                                 cov_devices=None,
+                                                 inv_devices=None):
+    """Make vars and create op thunks w/ a round-robin device placement strat.
+
+    For each factor, all of that factor's cov variables and their associated
+    update ops will be placed on a particular device.  A new device is chosen
+    for each factor by cycling through list of devices in the cov_devices
+    argument. If cov_devices is None then no explicit device placement occurs.
+
+    An analogous strategy is followed for inverse update ops, with the list of
+    devices being given by the inv_devices argument.
+
+    Inverse variables on the other hand are not placed on any specific device
+    (they will just use the current the device placement context, whatever
+    that happens to be).  The idea is that the inverse variable belong where
+    they will be accessed most often, which is the device that actually applies
+    the preconditioner to the gradient. The user will be responsible for setting
+    the device context for this.
+
+    Args:
+      scope: A string or None.  If None it will be set to the name of this
+        estimator (given by the name property). All variables will be created,
+        and all thunks will execute, inside of a variable scope of the given
+        name. (Default: None)
+      cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+      inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+    Returns:
+      cov_update_thunks: List of cov update thunks. Corresponds one-to-one with
+        the list of factors given by the "factors" property.
+      inv_update_thunks: List of inv update thunks. Corresponds one-to-one with
+        the list of factors given by the "factors" property.
+    """
+
+    (cov_variable_thunks_raw, cov_update_thunks_raw, inv_variable_thunks_raw,
+     inv_update_thunks_raw) = self.create_ops_and_vars_thunks(scope=scope)
+
+    if cov_devices:
+      cov_update_thunks = []
+      for cov_variable_thunk, cov_update_thunk, device in zip(
+          cov_variable_thunks_raw, cov_update_thunks_raw,
+          itertools.cycle(cov_devices)):
+        with tf_ops.device(device):
+          cov_variable_thunk()
+        cov_update_thunks.append(_make_thunk_on_device(cov_update_thunk,
+                                                       device))
+    else:
+      for cov_variable_thunk in cov_variable_thunks_raw:
+        cov_variable_thunk()
+      cov_update_thunks = cov_update_thunks_raw
+
+    for inv_variable_thunk in inv_variable_thunks_raw:
+      inv_variable_thunk()
+
+    if inv_devices:
+      inv_update_thunks = []
+      for inv_update_thunk, device in zip(inv_update_thunks_raw,
+                                          itertools.cycle(inv_devices)):
+        inv_update_thunks.append(_make_thunk_on_device(inv_update_thunk,
+                                                       device))
+    else:
+      inv_update_thunks = inv_update_thunks_raw
+
+    return cov_update_thunks, inv_update_thunks
+
+  def create_ops_and_vars_thunks(self, scope=None):
+    """Create thunks that make the ops and vars on demand.
+
+    This function returns 4 lists of thunks: cov_variable_thunks,
+    cov_update_thunks, inv_variable_thunks, and inv_update_thunks.
+
+    The length of each list is the number of factors and the i-th element of
+    each list corresponds to the i-th factor (given by the "factors" property).
+
+    Note that the execution of these thunks must happen in a certain
+    partial order.  The i-th element of cov_variable_thunks must execute
+    before the i-th element of cov_update_thunks (and also the i-th element
+    of inv_update_thunks).  Similarly, the i-th element of inv_variable_thunks
+    must execute before the i-th element of inv_update_thunks.
+
+    TL;DR (oversimplified): Execute the thunks according to the order that
+    they are returned.
+
+    Args:
+      scope: A string or None.  If None it will be set to the name of this
+        estimator (given by the name property). All thunks will execute inside
+        of a variable scope of the given name. (Default: None)
+    Returns:
+      cov_variable_thunks: A list of thunks that make the cov variables.
+      cov_update_thunks: A list of thunks that make the cov update ops.
+      inv_variable_thunks: A list of thunks that make the inv variables.
+      inv_update_thunks: A list of thunks that make the inv update ops.
+    """
+    self._check_vars_unmade_and_set_made_flag()
+
+    self._finalize_layer_collection()
+
+    scope = self.name if scope is None else scope
+
+    cov_variable_thunks = [
+        self._create_cov_variable_thunk(factor, scope)
+        for factor in self.factors
+    ]
+    cov_update_thunks = [
+        self._create_cov_update_thunk(factor, scope) for factor in self.factors
+    ]
+    inv_variable_thunks = [
+        self._create_inv_variable_thunk(factor, scope)
+        for factor in self.factors
+    ]
+    inv_update_thunks = [
+        self._create_inv_update_thunk(factor, scope) for factor in self.factors
+    ]
+
+    return (cov_variable_thunks, cov_update_thunks,
+            inv_variable_thunks, inv_update_thunks)
+
+  def _create_cov_variable_thunk(self, factor, scope):
+    """Constructs a covariance variable thunk for a single FisherFactor."""
+
+    def thunk():
+      with variable_scope.variable_scope(scope):
+        return factor.instantiate_cov_variables()
+
+    return thunk
+
+  def _create_cov_update_thunk(self, factor, scope):
     """Constructs a covariance update thunk for a single FisherFactor."""
 
     def thunk():
-      with tf_ops.name_scope(
-          "create_cov_update_thunk", values=[self._cov_ema_decay]):
+      with variable_scope.variable_scope(scope):
         return factor.make_covariance_update_op(self._cov_ema_decay)
 
     return thunk
 
-  def _create_inv_update_thunk(self, factor):
+  def _create_inv_variable_thunk(self, factor, scope):
+    """Constructs a inverse variable thunk for a single FisherFactor."""
+
+    def thunk():
+      with variable_scope.variable_scope(scope):
+        return factor.instantiate_inv_variables()
+
+    return thunk
+
+  def _create_inv_update_thunk(self, factor, scope):
     """Constructs an inverse update thunk for a single FisherFactor."""
 
     def thunk():
-      with tf_ops.name_scope("create_inv_update_thunk"):
-        with self._inv_device_context_generator():
-          return control_flow_ops.group(factor.make_inverse_update_ops())
+      with variable_scope.variable_scope(scope):
+        return control_flow_ops.group(factor.make_inverse_update_ops())
 
     return thunk
 
   def _get_grads_lists_gradients(self, tensors):
+    # Passing in a list of loss values is better than passing in the sum as
+    # the latter creates unnessesary ops on the default device
     grads_flat = gradients_impl.gradients(
-        self._layers.total_sampled_loss(),
+        self._layers.eval_losses_on_samples(),
         nest.flatten(tensors),
         colocate_gradients_with_ops=self._colocate_gradients_with_ops)
     grads_all = nest.pack_sequence_as(tensors, grads_flat)
     return tuple((grad,) for grad in grads_all)
 
   def _get_grads_lists_empirical(self, tensors):
+    # Passing in a list of loss values is better than passing in the sum as
+    # the latter creates unnessesary ops on the default device
     grads_flat = gradients_impl.gradients(
-        self._layers.total_loss(),
+        self._layers.eval_losses(),
         nest.flatten(tensors),
         colocate_gradients_with_ops=self._colocate_gradients_with_ops)
     grads_all = nest.pack_sequence_as(tensors, grads_flat)
@@ -295,9 +549,10 @@ class FisherEstimator(object):
   def _get_transformed_random_signs(self):
     transformed_random_signs = []
     for loss in self._layers.losses:
-      transformed_random_signs.append(
-          loss.multiply_fisher_factor(
-              utils.generate_random_signs(loss.fisher_factor_inner_shape)))
+      with tf_ops.colocate_with(self._layers.loss_colocation_ops[loss]):
+        transformed_random_signs.append(
+            loss.multiply_fisher_factor(
+                utils.generate_random_signs(loss.fisher_factor_inner_shape)))
     return transformed_random_signs
 
   def _get_grads_lists_curvature_prop(self, tensors):
@@ -316,13 +571,14 @@ class FisherEstimator(object):
     # Loop over all coordinates of all losses.
     grads_all = []
     for loss in self._layers.losses:
-      for index in np.ndindex(*loss.fisher_factor_inner_static_shape[1:]):
-        transformed_one_hot = loss.multiply_fisher_factor_replicated_one_hot(
-            index)
-        grads_flat = gradients_impl.gradients(
-            loss.inputs,
-            nest.flatten(tensors),
-            grad_ys=transformed_one_hot,
-            colocate_gradients_with_ops=self._colocate_gradients_with_ops)
-        grads_all.append(nest.pack_sequence_as(tensors, grads_flat))
+      with tf_ops.colocate_with(self._layers.loss_colocation_ops[loss]):
+        for index in np.ndindex(*loss.fisher_factor_inner_static_shape[1:]):
+          transformed_one_hot = loss.multiply_fisher_factor_replicated_one_hot(
+              index)
+          grads_flat = gradients_impl.gradients(
+              loss.inputs,
+              nest.flatten(tensors),
+              grad_ys=transformed_one_hot,
+              colocate_gradients_with_ops=self._colocate_gradients_with_ops)
+          grads_all.append(nest.pack_sequence_as(tensors, grads_flat))
     return zip(*grads_all)
diff --git a/tensorflow/contrib/kfac/python/ops/fisher_blocks.py b/tensorflow/contrib/kfac/python/ops/fisher_blocks.py
index cf38d28b43836dced8babe2ffa7853b1c4b1b369..f517e3148fddd617f9c9d61656f5fcda08676ada 100644
--- a/tensorflow/contrib/kfac/python/ops/fisher_blocks.py
+++ b/tensorflow/contrib/kfac/python/ops/fisher_blocks.py
@@ -40,12 +40,15 @@ from __future__ import print_function
 import abc
 import enum  # pylint: disable=g-bad-import-order
 
+import numpy as np
 import six
 
 from tensorflow.contrib.kfac.python.ops import fisher_factors
 from tensorflow.contrib.kfac.python.ops import utils
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.util import nest
 
 # For blocks corresponding to convolutional layers, or any type of block where
 # the parameters can be thought of as being replicated in time or space,
@@ -72,6 +75,37 @@ def set_global_constants(normalize_damping_power=None, pi_type=None):
     PI_TYPE = pi_type
 
 
+def _make_partitionedtensors_inputs(inputs):
+  """Constructs PartitionedTensor for inputs.
+
+  The purpose of this method is to package up the towers/minibatch dimension
+  of these arrays into PartitionedTensor objects.
+
+  Args:
+    inputs: a 1-D list of Tensors. Index is tower/mini-batch.
+
+  Returns:
+    A PartitionedTensor.
+  """
+  return utils.PartitionedTensor(inputs)
+
+
+def _make_partitionedtensors_grads(grads_list):
+  """Constructs PartitionedTensor for grads_list.
+
+  The purpose of this method is to package up the towers/minibatch dimension
+  of these arrays into PartitionedTensor objects.
+
+  Args:
+    grads_list: 2-D list of Tensors. First index is for source, second
+      index for tower.
+
+  Returns:
+    Tuple of PartitionedTensors, one per source.
+  """
+  return tuple(utils.PartitionedTensor(grads) for grads in grads_list)
+
+
 def normalize_damping(damping, num_replications):
   """Normalize damping after adjusting scale by NORMALIZE_DAMPING_POWER."""
   if NORMALIZE_DAMPING_POWER:
@@ -121,12 +155,44 @@ def compute_pi_adjusted_damping(left_cov, right_cov, damping):
     return (damping, damping)
 
 
+class PackagedFunc(object):
+  """A Python thunk with a stable ID.
+
+  Enables stable names for lambdas.
+  """
+
+  def __init__(self, func, func_id):
+    """Initializes PackagedFunc.
+
+    Args:
+      func: a zero-arg Python function.
+      func_id: a hashable, function that produces a hashable, or a list/tuple
+        thereof.
+    """
+    self._func = func
+    func_id = func_id if isinstance(func_id, (tuple, list)) else (func_id,)
+    self._func_id = func_id
+
+  def __call__(self):
+    return self._func()
+
+  @property
+  def func_id(self):
+    """A hashable identifier for this function."""
+    return tuple(elt() if callable(elt) else elt for elt in self._func_id)
+
+
+def _package_func(func, func_id):
+  return PackagedFunc(func, func_id)
+
+
 @six.add_metaclass(abc.ABCMeta)
 class FisherBlock(object):
   """Abstract base class for objects modeling approximate Fisher matrix blocks.
 
-  Subclasses must implement multiply_inverse(), instantiate_factors(), and
-  tensors_to_compute_grads() methods.
+  Subclasses must implement register_matpower, multiply_matpower,
+  instantiate_factors, tensors_to_compute_grads, and num_registered_minibatches
+  methods.
   """
 
   def __init__(self, layer_collection):
@@ -145,6 +211,32 @@ class FisherBlock(object):
     pass
 
   @abc.abstractmethod
+  def register_matpower(self, exp):
+    """Registers a matrix power to be computed by the block.
+
+    Args:
+      exp: A float representing the power to raise the block by.
+    """
+    pass
+
+  def register_inverse(self):
+    """Registers a matrix inverse to be computed by the block."""
+    self.register_matpower(-1)
+
+  @abc.abstractmethod
+  def multiply_matpower(self, vector, exp):
+    """Multiplies the vector by the (damped) matrix-power of the block.
+
+    Args:
+      vector: The vector (a Tensor or tuple of Tensors) to be multiplied.
+      exp: A float representing the power to raise the block by before
+        multiplying it by the vector.
+
+    Returns:
+      The vector left-multiplied by the (damped) matrix-power of the block.
+    """
+    pass
+
   def multiply_inverse(self, vector):
     """Multiplies the vector by the (damped) inverse of the block.
 
@@ -154,9 +246,8 @@ class FisherBlock(object):
     Returns:
       The vector left-multiplied by the (damped) inverse of the block.
     """
-    pass
+    return self.multiply_matpower(vector, -1)
 
-  @abc.abstractmethod
   def multiply(self, vector):
     """Multiplies the vector by the (damped) block.
 
@@ -166,7 +257,7 @@ class FisherBlock(object):
     Returns:
       The vector left-multiplied by the (damped) block.
     """
-    pass
+    return self.multiply_matpower(vector, 1)
 
   @abc.abstractmethod
   def tensors_to_compute_grads(self):
@@ -207,21 +298,18 @@ class FullFB(FisherBlock):
     super(FullFB, self).__init__(layer_collection)
 
   def instantiate_factors(self, grads_list, damping):
-    self._damping = damping
+    self._damping_func = _package_func(lambda: damping, (damping,))
+
     self._factor = self._layer_collection.make_or_get_factor(
         fisher_factors.FullFactor, (grads_list, self._batch_size))
-    self._factor.register_damped_inverse(damping)
 
-  def multiply_inverse(self, vector):
-    vector_flat = utils.tensors_to_column(vector)
-    out_flat = self._factor.left_multiply_inverse(
-        vector_flat, self._damping)
-    return utils.column_to_tensors(vector, out_flat)
+  def register_matpower(self, exp):
+    self._factor.register_matpower(exp, self._damping_func)
 
-  def multiply(self, vector):
+  def multiply_matpower(self, vector, exp):
     vector_flat = utils.tensors_to_column(vector)
-    out_flat = self._factor.left_multiply(
-        vector_flat, self._damping)
+    out_flat = self._factor.left_multiply_matpower(
+        vector_flat, exp, self._damping_func)
     return utils.column_to_tensors(vector, out_flat)
 
   def full_fisher_block(self):
@@ -271,22 +359,20 @@ class NaiveDiagonalFB(FisherBlock):
     super(NaiveDiagonalFB, self).__init__(layer_collection)
 
   def instantiate_factors(self, grads_list, damping):
-    self._damping = damping
+    self._damping_func = _package_func(lambda: damping, (damping,))
+
     self._factor = self._layer_collection.make_or_get_factor(
         fisher_factors.NaiveDiagonalFactor, (grads_list, self._batch_size))
 
-  def multiply_inverse(self, vector):
-    vector_flat = utils.tensors_to_column(vector)
-    print("vector_flat: %s" % vector_flat)
-    out_flat = self._factor.left_multiply_inverse(
-        vector_flat, self._damping)
-    print("out_flat: %s" % out_flat)
-    return utils.column_to_tensors(vector, out_flat)
+  def register_matpower(self, exp):
+    # Not needed for this.  Matrix powers are computed on demand in the
+    # diagonal case
+    pass
 
-  def multiply(self, vector):
+  def multiply_matpower(self, vector, exp):
     vector_flat = utils.tensors_to_column(vector)
-    out_flat = self._factor.left_multiply(
-        vector_flat, self._damping)
+    out_flat = self._factor.left_multiply_matpower(
+        vector_flat, exp, self._damping_func)
     return utils.column_to_tensors(vector, out_flat)
 
   def full_fisher_block(self):
@@ -312,7 +398,38 @@ class NaiveDiagonalFB(FisherBlock):
     return math_ops.reduce_sum(self._batch_sizes)
 
 
-class FullyConnectedDiagonalFB(FisherBlock):
+class InputOutputMultiMinibatch(object):
+  """Mix-in class for blocks with inputs & outputs and multiple mini-batches."""
+
+  def __init__(self, *args, **kwargs):
+    self.__inputs = []
+    self.__outputs = []
+    super(InputOutputMultiMinibatch, self).__init__(*args, **kwargs)
+
+  def tensors_to_compute_grads(self):
+    """Tensors to compute derivative of loss with respect to."""
+    return self._outputs
+
+  def register_additional_minibatch(self, inputs, outputs):
+    self._inputs.append(inputs)
+    self._outputs.append(outputs)
+
+  @property
+  def num_registered_minibatches(self):
+    result = len(self._inputs)
+    assert result == len(self._outputs)
+    return result
+
+  @property
+  def _inputs(self):
+    return self.__inputs
+
+  @property
+  def _outputs(self):
+    return self.__outputs
+
+
+class FullyConnectedDiagonalFB(InputOutputMultiMinibatch, FisherBlock):
   """FisherBlock for fully-connected (dense) layers using a diagonal approx.
 
   Estimates the Fisher Information matrix's diagonal entries for a fully
@@ -344,80 +461,47 @@ class FullyConnectedDiagonalFB(FisherBlock):
       has_bias: Whether the component Kronecker factors have an additive bias.
           (Default: False)
     """
-    self._inputs = []
-    self._outputs = []
     self._has_bias = has_bias
 
     super(FullyConnectedDiagonalFB, self).__init__(layer_collection)
 
   def instantiate_factors(self, grads_list, damping):
-    inputs = _concat_along_batch_dim(self._inputs)
-    grads_list = tuple(_concat_along_batch_dim(grads) for grads in grads_list)
+    inputs = _make_partitionedtensors_inputs(self._inputs)
+    grads_list = _make_partitionedtensors_grads(grads_list)
 
-    self._damping = damping
     self._factor = self._layer_collection.make_or_get_factor(
         fisher_factors.FullyConnectedDiagonalFactor,
         (inputs, grads_list, self._has_bias))
 
-  def multiply_inverse(self, vector):
-    """Approximate damped inverse Fisher-vector product.
-
-    Args:
-      vector: Tensor or 2-tuple of Tensors. if self._has_bias, Tensor of shape
-        [input_size, output_size] corresponding to layer's weights. If not, a
-        2-tuple of the former and a Tensor of shape [output_size] corresponding
-        to the layer's bias.
+    self._damping_func = _package_func(lambda: damping, (damping,))
 
-    Returns:
-      Tensor of the same shape, corresponding to the inverse Fisher-vector
-      product.
-    """
-    reshaped_vec = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._factor.left_multiply_inverse(
-        reshaped_vec, self._damping)
-    return utils.mat2d_to_layer_params(vector, reshaped_out)
+  def register_matpower(self, exp):
+    # Not needed for this.  Matrix powers are computed on demand in the
+    # diagonal case
+    pass
 
-  def multiply(self, vector):
-    """Approximate damped Fisher-vector product.
+  def multiply_matpower(self, vector, exp):
+    """Multiplies the vector by the (damped) matrix-power of the block.
 
     Args:
       vector: Tensor or 2-tuple of Tensors. if self._has_bias, Tensor of shape
         [input_size, output_size] corresponding to layer's weights. If not, a
         2-tuple of the former and a Tensor of shape [output_size] corresponding
         to the layer's bias.
+      exp: A scalar representing the power to raise the block before multiplying
+           it by the vector.
 
     Returns:
-      Tensor of the same shape, corresponding to the Fisher-vector product.
+      The vector left-multiplied by the (damped) matrix-power of the block.
     """
     reshaped_vec = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._factor.left_multiply(
-        reshaped_vec, self._damping)
+    reshaped_out = self._factor.left_multiply_matpower(
+        reshaped_vec, exp, self._damping_func)
     return utils.mat2d_to_layer_params(vector, reshaped_out)
 
-  def tensors_to_compute_grads(self):
-    """Tensors to compute derivative of loss with respect to."""
-    return self._outputs
-
-  def register_additional_minibatch(self, inputs, outputs):
-    """Registers an additional minibatch to the FisherBlock.
-
-    Args:
-      inputs: Tensor of shape [batch_size, input_size]. Inputs to the
-        matrix-multiply.
-      outputs: Tensor of shape [batch_size, output_size]. Layer preactivations.
-    """
-    self._inputs.append(inputs)
-    self._outputs.append(outputs)
-
-  @property
-  def num_registered_minibatches(self):
-    result = len(self._inputs)
-    assert result == len(self._outputs)
-    return result
-
 
-class ConvDiagonalFB(FisherBlock):
-  """FisherBlock for convolutional layers using a diagonal approx.
+class ConvDiagonalFB(InputOutputMultiMinibatch, FisherBlock):
+  """FisherBlock for 2-D convolutional layers using a diagonal approx.
 
   Estimates the Fisher Information matrix's diagonal entries for a convolutional
   layer. Unlike NaiveDiagonalFB this uses the low-variance "sum of squares"
@@ -441,7 +525,13 @@ class ConvDiagonalFB(FisherBlock):
   to the layer's parameters 'w'.
   """
 
-  def __init__(self, layer_collection, params, strides, padding):
+  def __init__(self,
+               layer_collection,
+               params,
+               strides,
+               padding,
+               data_format=None,
+               dilations=None):
     """Creates a ConvDiagonalFB block.
 
     Args:
@@ -453,92 +543,116 @@ class ConvDiagonalFB(FisherBlock):
         containing the previous and a Tensor of shape [out_channels].
       strides: The stride size in this layer (1-D Tensor of length 4).
       padding: The padding in this layer (e.g. "SAME").
+      data_format: str or None. Format of input data.
+      dilations: List of 4 ints or None. Rate for dilation along all dimensions.
+
+    Raises:
+      ValueError: if strides is not length-4.
+      ValueError: if dilations is not length-4.
+      ValueError: if channel is not last dimension.
     """
-    self._inputs = []
-    self._outputs = []
-    self._strides = tuple(strides) if isinstance(strides, list) else strides
+    if len(strides) != 4:
+      raise ValueError("strides must contain 4 numbers.")
+
+    if dilations is None:
+      dilations = [1, 1, 1, 1]
+
+    if len(dilations) != 4:
+      raise ValueError("dilations must contain 4 numbers.")
+
+    if not utils.is_data_format_channel_last(data_format):
+      raise ValueError("data_format must be channels-last.")
+
+    self._strides = maybe_tuple(strides)
     self._padding = padding
+    self._data_format = data_format
+    self._dilations = maybe_tuple(dilations)
     self._has_bias = isinstance(params, (tuple, list))
 
     fltr = params[0] if self._has_bias else params
     self._filter_shape = tuple(fltr.shape.as_list())
 
+    if len(self._filter_shape) != 4:
+      raise ValueError(
+          "Convolution filter must be of shape"
+          " [filter_height, filter_width, in_channels, out_channels].")
+
     super(ConvDiagonalFB, self).__init__(layer_collection)
 
   def instantiate_factors(self, grads_list, damping):
-    # Concatenate inputs, grads_list into single Tensors.
-    inputs = _concat_along_batch_dim(self._inputs)
-    grads_list = tuple(_concat_along_batch_dim(grads) for grads in grads_list)
+    inputs = _make_partitionedtensors_inputs(self._inputs)
+    grads_list = _make_partitionedtensors_grads(grads_list)
 
     # Infer number of locations upon which convolution is applied.
-    inputs_shape = tuple(inputs.shape.as_list())
-    self._num_locations = (
-        inputs_shape[1] * inputs_shape[2] //
-        (self._strides[1] * self._strides[2]))
-
-    self._damping = (self._num_locations
-                     * normalize_damping(damping, self._num_locations))
+    self._num_locations = num_conv_locations(inputs.shape.as_list(),
+                                             self._strides)
 
     self._factor = self._layer_collection.make_or_get_factor(
         fisher_factors.ConvDiagonalFactor,
         (inputs, grads_list, self._filter_shape, self._strides, self._padding,
-         self._has_bias))
+         self._data_format, self._dilations, self._has_bias))
 
-  def multiply_inverse(self, vector):
-    reshaped_vect = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._factor.left_multiply_inverse(
-        reshaped_vect, self._damping)
-    return utils.mat2d_to_layer_params(vector, reshaped_out)
+    def damping_func():
+      return self._num_locations * normalize_damping(damping,
+                                                     self._num_locations)
 
-  def multiply(self, vector):
-    reshaped_vect = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._factor.left_multiply(
-        reshaped_vect, self._damping)
-    return utils.mat2d_to_layer_params(vector, reshaped_out)
-
-  def tensors_to_compute_grads(self):
-    return self._outputs
+    damping_id = (self._num_locations, "mult", "normalize_damping", damping,
+                  self._num_locations)
+    self._damping_func = _package_func(damping_func, damping_id)
 
-  def register_additional_minibatch(self, inputs, outputs):
-    """Registers an additional minibatch to the FisherBlock.
-
-    Args:
-      inputs: Tensor of shape [batch_size, height, width, input_size]. Inputs to
-        the convolution.
-      outputs: Tensor of shape [batch_size, height, width, output_size]. Layer
-        preactivations.
-    """
-    self._inputs.append(inputs)
-    self._outputs.append(outputs)
+  def register_matpower(self, exp):
+    # Not needed for this.  Matrix powers are computed on demand in the
+    # diagonal case
+    pass
 
-  @property
-  def num_registered_minibatches(self):
-    return len(self._inputs)
+  def multiply_matpower(self, vector, exp):
+    reshaped_vect = utils.layer_params_to_mat2d(vector)
+    reshaped_out = self._factor.left_multiply_matpower(
+        reshaped_vect, exp, self._damping_func)
+    return utils.mat2d_to_layer_params(vector, reshaped_out)
 
 
 class KroneckerProductFB(FisherBlock):
-  """A base class for FisherBlocks with separate input and output factors.
+  """A base class for blocks with separate input and output Kronecker factors.
 
   The Fisher block is approximated as a Kronecker product of the input and
   output factors.
   """
 
-  def _register_damped_input_and_output_inverses(self, damping):
-    """Registers damped inverses for both the input and output factors.
+  def __init__(self, layer_collection):
+    super(KroneckerProductFB, self).__init__(layer_collection)
+
+  def _setup_damping(self, damping, normalization=None):
+    """Makes functions that compute the damping values for both factors."""
+    def compute_damping():
+      if normalization is not None:
+        maybe_normalized_damping = normalize_damping(damping, normalization)
+      else:
+        maybe_normalized_damping = damping
+
+      return compute_pi_adjusted_damping(self._input_factor.get_cov(),
+                                         self._output_factor.get_cov(),
+                                         maybe_normalized_damping**0.5)
+
+    if normalization is not None:
+      damping_id = ("compute_pi_adjusted_damping",
+                    "cov", self._input_factor.name,
+                    "cov", self._output_factor.name,
+                    "normalize_damping", damping, normalization, "power", 0.5)
+    else:
+      damping_id = ("compute_pi_adjusted_damping",
+                    "cov", self._input_factor.name,
+                    "cov", self._output_factor.name,
+                    damping, "power", 0.5)
 
-    Sets the instance members _input_damping and _output_damping. Requires the
-    instance members _input_factor and _output_factor.
+    self._input_damping_func = _package_func(lambda: compute_damping()[0],
+                                             damping_id + ("ref", 0))
+    self._output_damping_func = _package_func(lambda: compute_damping()[1],
+                                              damping_id + ("ref", 1))
 
-    Args:
-      damping: The base damping factor (float or Tensor) for the damped inverse.
-    """
-    self._input_damping, self._output_damping = compute_pi_adjusted_damping(
-        self._input_factor.get_cov(),
-        self._output_factor.get_cov(),
-        damping**0.5)
-
-    self._input_factor.register_damped_inverse(self._input_damping)
-    self._output_factor.register_damped_inverse(self._output_damping)
+  def register_matpower(self, exp):
+    self._input_factor.register_matpower(exp, self._input_damping_func)
+    self._output_factor.register_matpower(exp, self._output_damping_func)
 
   @property
   def _renorm_coeff(self):
@@ -552,28 +666,15 @@ class KroneckerProductFB(FisherBlock):
     """
     return 1.0
 
-  def multiply_inverse(self, vector):
-    reshaped_vector = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._output_factor.right_multiply_inverse(
-        reshaped_vector,
-        self._output_damping)
-    reshaped_out = self._input_factor.left_multiply_inverse(
-        reshaped_out, self._input_damping)
-    if self._renorm_coeff != 1.0:
-      reshaped_out /= math_ops.cast(
-          self._renorm_coeff, dtype=reshaped_out.dtype)
-    return utils.mat2d_to_layer_params(vector, reshaped_out)
-
-  def multiply(self, vector):
+  def multiply_matpower(self, vector, exp):
     reshaped_vector = utils.layer_params_to_mat2d(vector)
-    reshaped_out = self._output_factor.right_multiply(
-        reshaped_vector,
-        self._output_damping)
-    reshaped_out = self._input_factor.left_multiply(
-        reshaped_out, self._input_damping)
+    reshaped_out = self._output_factor.right_multiply_matpower(
+        reshaped_vector, exp, self._output_damping_func)
+    reshaped_out = self._input_factor.left_multiply_matpower(
+        reshaped_out, exp, self._input_damping_func)
     if self._renorm_coeff != 1.0:
-      reshaped_out *= math_ops.cast(
-          self._renorm_coeff, dtype=reshaped_out.dtype)
+      renorm_coeff = math_ops.cast(self._renorm_coeff, dtype=reshaped_out.dtype)
+      reshaped_out *= math_ops.cast(renorm_coeff**exp, dtype=reshaped_out.dtype)
     return utils.mat2d_to_layer_params(vector, reshaped_out)
 
   def full_fisher_block(self):
@@ -590,10 +691,10 @@ class KroneckerProductFB(FisherBlock):
                                                         right_factor)
 
 
-class EmbeddingKFACFB(KroneckerProductFB):
+class EmbeddingKFACFB(InputOutputMultiMinibatch, KroneckerProductFB):
   """K-FAC FisherBlock for embedding layers.
 
-  This FisherBlock is similar to EmbeddingKFACFB, except that its
+  This FisherBlock is similar to FullyConnectedKFACBasicFB, except that its
   input factor is approximated by a diagonal matrix. In the case that each
   example references exactly one embedding, this approximation is exact.
 
@@ -608,8 +709,6 @@ class EmbeddingKFACFB(KroneckerProductFB):
           Fisher information matrix to which this FisherBlock belongs.
       vocab_size: int. Size of vocabulary for this embedding layer.
     """
-    self._inputs = []
-    self._outputs = []
     self._vocab_size = vocab_size
 
     super(EmbeddingKFACFB, self).__init__(layer_collection)
@@ -624,41 +723,18 @@ class EmbeddingKFACFB(KroneckerProductFB):
       damping: 0-D Tensor or float. 'damping' * identity is approximately added
         to this FisherBlock's Fisher approximation.
     """
-    # TODO(b/68033310): Validate which of,
-    #   (1) summing on a single device (as below), or
-    #   (2) on each device in isolation and aggregating
-    # is faster.
-    inputs = _concat_along_batch_dim(self._inputs)
-    grads_list = tuple(_concat_along_batch_dim(grads) for grads in grads_list)
-
-    self._input_factor = self._layer_collection.make_or_get_factor(  #
-        fisher_factors.EmbeddingInputKroneckerFactor,  #
-        ((inputs,), self._vocab_size))
-    self._output_factor = self._layer_collection.make_or_get_factor(  #
-        fisher_factors.FullyConnectedKroneckerFactor,  #
-        (grads_list,))
-    self._register_damped_input_and_output_inverses(damping)
-
-  def tensors_to_compute_grads(self):
-    return self._outputs
-
-  def register_additional_minibatch(self, inputs, outputs):
-    """Registers an additional minibatch to the FisherBlock.
-
-    Args:
-      inputs: Tensor of shape [batch_size, input_size]. Inputs to the
-        matrix-multiply.
-      outputs: Tensor of shape [batch_size, output_size]. Layer preactivations.
-    """
-    self._inputs.append(inputs)
-    self._outputs.append(outputs)
+    inputs = _make_partitionedtensors_inputs(self._inputs)
+    grads_list = _make_partitionedtensors_grads(grads_list)
 
-  @property
-  def num_registered_minibatches(self):
-    return len(self._inputs)
+    self._input_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.EmbeddingInputKroneckerFactor,
+        (inputs, self._vocab_size))
+    self._output_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.FullyConnectedKroneckerFactor, (grads_list,))
+    self._setup_damping(damping)
 
 
-class FullyConnectedKFACBasicFB(KroneckerProductFB):
+class FullyConnectedKFACBasicFB(InputOutputMultiMinibatch, KroneckerProductFB):
   """K-FAC FisherBlock for fully-connected (dense) layers.
 
   This uses the Kronecker-factorized approximation from the original
@@ -674,8 +750,6 @@ class FullyConnectedKFACBasicFB(KroneckerProductFB):
       has_bias: Whether the component Kronecker factors have an additive bias.
           (Default: False)
     """
-    self._inputs = []
-    self._outputs = []
     self._has_bias = has_bias
 
     super(FullyConnectedKFACBasicFB, self).__init__(layer_collection)
@@ -690,42 +764,20 @@ class FullyConnectedKFACBasicFB(KroneckerProductFB):
       damping: 0-D Tensor or float. 'damping' * identity is approximately added
         to this FisherBlock's Fisher approximation.
     """
-    # TODO(b/68033310): Validate which of,
-    #   (1) summing on a single device (as below), or
-    #   (2) on each device in isolation and aggregating
-    # is faster.
-    inputs = _concat_along_batch_dim(self._inputs)
-    grads_list = tuple(_concat_along_batch_dim(grads) for grads in grads_list)
-
-    self._input_factor = self._layer_collection.make_or_get_factor(  #
-        fisher_factors.FullyConnectedKroneckerFactor,  #
+    inputs = _make_partitionedtensors_inputs(self._inputs)
+    grads_list = _make_partitionedtensors_grads(grads_list)
+
+    self._input_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.FullyConnectedKroneckerFactor,
         ((inputs,), self._has_bias))
-    self._output_factor = self._layer_collection.make_or_get_factor(  #
-        fisher_factors.FullyConnectedKroneckerFactor,  #
+    self._output_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.FullyConnectedKroneckerFactor,
         (grads_list,))
-    self._register_damped_input_and_output_inverses(damping)
+    self._setup_damping(damping)
 
-  def tensors_to_compute_grads(self):
-    return self._outputs
 
-  def register_additional_minibatch(self, inputs, outputs):
-    """Registers an additional minibatch to the FisherBlock.
-
-    Args:
-      inputs: Tensor of shape [batch_size, input_size]. Inputs to the
-        matrix-multiply.
-      outputs: Tensor of shape [batch_size, output_size]. Layer preactivations.
-    """
-    self._inputs.append(inputs)
-    self._outputs.append(outputs)
-
-  @property
-  def num_registered_minibatches(self):
-    return len(self._inputs)
-
-
-class ConvKFCBasicFB(KroneckerProductFB):
-  """FisherBlock for 2D convolutional layers using the basic KFC approx.
+class ConvKFCBasicFB(InputOutputMultiMinibatch, KroneckerProductFB):
+  """FisherBlock for convolutional layers using the basic KFC approx.
 
   Estimates the Fisher Information matrix's blog for a convolutional
   layer.
@@ -748,23 +800,40 @@ class ConvKFCBasicFB(KroneckerProductFB):
   See equation 23 in https://arxiv.org/abs/1602.01407 for details.
   """
 
-  def __init__(self, layer_collection, params, strides, padding):
+  def __init__(self,
+               layer_collection,
+               params,
+               padding,
+               strides=None,
+               dilation_rate=None,
+               data_format=None,
+               extract_patches_fn=None):
     """Creates a ConvKFCBasicFB block.
 
     Args:
       layer_collection: The collection of all layers in the K-FAC approximate
           Fisher information matrix to which this FisherBlock belongs.
       params: The parameters (Tensor or tuple of Tensors) of this layer. If
-        kernel alone, a Tensor of shape [kernel_height, kernel_width,
+        kernel alone, a Tensor of shape [..spatial_filter_shape..,
         in_channels, out_channels]. If kernel and bias, a tuple of 2 elements
         containing the previous and a Tensor of shape [out_channels].
-      strides: The stride size in this layer (1-D Tensor of length 4).
-      padding: The padding in this layer (1-D of Tensor length 4).
+      padding: str. Padding method.
+      strides: List of ints or None. Contains [..spatial_filter_strides..] if
+        'extract_patches_fn' is compatible with tf.nn.convolution(), else
+        [1, ..spatial_filter_strides, 1].
+      dilation_rate: List of ints or None. Rate for dilation along each spatial
+        dimension if 'extract_patches_fn' is compatible with
+        tf.nn.convolution(), else [1, ..spatial_dilation_rates.., 1].
+      data_format: str or None. Format of input data.
+      extract_patches_fn: str or None. Name of function that extracts image
+        patches. One of "extract_convolution_patches", "extract_image_patches",
+        "extract_pointwise_conv2d_patches".
     """
-    self._inputs = []
-    self._outputs = []
-    self._strides = tuple(strides) if isinstance(strides, list) else strides
     self._padding = padding
+    self._strides = maybe_tuple(strides)
+    self._dilation_rate = maybe_tuple(dilation_rate)
+    self._data_format = data_format
+    self._extract_patches_fn = extract_patches_fn
     self._has_bias = isinstance(params, (tuple, list))
 
     fltr = params[0] if self._has_bias else params
@@ -773,145 +842,524 @@ class ConvKFCBasicFB(KroneckerProductFB):
     super(ConvKFCBasicFB, self).__init__(layer_collection)
 
   def instantiate_factors(self, grads_list, damping):
-    # TODO(b/68033310): Validate which of,
-    #   (1) summing on a single device (as below), or
-    #   (2) on each device in isolation and aggregating
-    # is faster.
-    inputs = _concat_along_batch_dim(self._inputs)
-    grads_list = tuple(_concat_along_batch_dim(grads) for grads in grads_list)
-
     # Infer number of locations upon which convolution is applied.
-    self._num_locations = num_conv_locations(inputs.shape.as_list(),
+    self._num_locations = num_conv_locations(self._inputs[0].shape.as_list(),
                                              self._strides)
 
+    inputs = _make_partitionedtensors_inputs(self._inputs)
+    grads_list = _make_partitionedtensors_grads(grads_list)
+
     self._input_factor = self._layer_collection.make_or_get_factor(
         fisher_factors.ConvInputKroneckerFactor,
-        (inputs, self._filter_shape, self._strides, self._padding,
+        (inputs, self._filter_shape, self._padding, self._strides,
+         self._dilation_rate, self._data_format, self._extract_patches_fn,
          self._has_bias))
     self._output_factor = self._layer_collection.make_or_get_factor(
         fisher_factors.ConvOutputKroneckerFactor, (grads_list,))
 
-    damping = normalize_damping(damping, self._num_locations)
-    self._register_damped_input_and_output_inverses(damping)
-    self._damping = damping
+    self._setup_damping(damping, normalization=self._num_locations)
 
   @property
   def _renorm_coeff(self):
     return self._num_locations
 
-  def tensors_to_compute_grads(self):
-    return self._outputs
 
-  def register_additional_minibatch(self, inputs, outputs):
-    """Registers an additional minibatch to the FisherBlock.
+class DepthwiseConvDiagonalFB(ConvDiagonalFB):
+  """FisherBlock for depthwise_conv2d().
+
+  Equivalent to ConvDiagonalFB applied to each input channel in isolation.
+  """
+
+  def __init__(self,
+               layer_collection,
+               params,
+               strides,
+               padding,
+               rate=None,
+               data_format=None):
+    """Creates a DepthwiseConvKFCBasicFB block.
 
     Args:
-      inputs: Tensor of shape [batch_size, height, width, input_size]. Inputs to
-        the convolution.
-      outputs: Tensor of shape [batch_size, height, width, output_size]. Layer
-        preactivations.
+      layer_collection: The collection of all layers in the K-FAC approximate
+          Fisher information matrix to which this FisherBlock belongs.
+      params: Tensor of shape [filter_height, filter_width, in_channels,
+        channel_multiplier].
+      strides: List of 4 ints. Strides along all dimensions.
+      padding: str. Padding method.
+      rate: List of 4 ints or None. Rate for dilation along all dimensions.
+      data_format: str or None. Format of input data.
+
+    Raises:
+      NotImplementedError: If parameters contains bias.
+      ValueError: If filter is not 4-D.
+      ValueError: If strides is not length-4.
+      ValueError: If rates is not length-2.
+      ValueError: If channels are not last dimension.
     """
-    self._inputs.append(inputs)
-    self._outputs.append(outputs)
+    if isinstance(params, (tuple, list)):
+      raise NotImplementedError("Bias not yet supported.")
 
-  @property
-  def num_registered_minibatches(self):
-    return len(self._inputs)
+    if params.shape.ndims != 4:
+      raise ValueError("Filter must be 4-D.")
+
+    if len(strides) != 4:
+      raise ValueError("strides must account for 4 dimensions.")
+
+    if rate is not None:
+      if len(rate) != 2:
+        raise ValueError("rate must only account for spatial dimensions.")
+      rate = [1, rate[0], rate[1], 1]  # conv2d expects 4-element rate.
 
+    if not utils.is_data_format_channel_last(data_format):
+      raise ValueError("data_format must be channels-last.")
 
-def _concat_along_batch_dim(tensor_list):
-  """Concatenate tensors along batch (first) dimension.
+    super(DepthwiseConvDiagonalFB, self).__init__(
+        layer_collection=layer_collection,
+        params=params,
+        strides=strides,
+        padding=padding,
+        dilations=rate,
+        data_format=data_format)
+
+    # This is a hack to overwrite the same setting in ConvKFCBasicFB.__init__().
+    filter_height, filter_width, in_channels, channel_multiplier = (
+        params.shape.as_list())
+    self._filter_shape = (filter_height, filter_width, in_channels,
+                          in_channels * channel_multiplier)
+
+  def multiply_matpower(self, vector, exp):
+    conv2d_vector = depthwise_conv2d_filter_to_conv2d_filter(vector)
+    conv2d_result = super(DepthwiseConvDiagonalFB, self).multiply_matpower(
+        conv2d_vector, exp)
+    return conv2d_filter_to_depthwise_conv2d_filter(conv2d_result)
+
+
+class DepthwiseConvKFCBasicFB(ConvKFCBasicFB):
+  """FisherBlock for depthwise_conv2d().
+
+  Equivalent to ConvKFCBasicFB applied to each input channel in isolation.
+  """
+
+  def __init__(self,
+               layer_collection,
+               params,
+               strides,
+               padding,
+               rate=None,
+               data_format=None):
+    """Creates a DepthwiseConvKFCBasicFB block.
+
+    Args:
+      layer_collection: The collection of all layers in the K-FAC approximate
+          Fisher information matrix to which this FisherBlock belongs.
+      params: Tensor of shape [filter_height, filter_width, in_channels,
+        channel_multiplier].
+      strides: List of 4 ints. Strides along all dimensions.
+      padding: str. Padding method.
+      rate: List of 4 ints or None. Rate for dilation along all dimensions.
+      data_format: str or None. Format of input data.
+
+    Raises:
+      NotImplementedError: If parameters contains bias.
+      ValueError: If filter is not 4-D.
+      ValueError: If strides is not length-4.
+      ValueError: If rates is not length-2.
+      ValueError: If channels are not last dimension.
+    """
+    if isinstance(params, (tuple, list)):
+      raise NotImplementedError("Bias not yet supported.")
+
+    if params.shape.ndims != 4:
+      raise ValueError("Filter must be 4-D.")
+
+    if len(strides) != 4:
+      raise ValueError("strides must account for 4 dimensions.")
+
+    if rate is not None:
+      if len(rate) != 2:
+        raise ValueError("rate must only account for spatial dimensions.")
+      rate = [1, rate[0], rate[1], 1]  # conv2d expects 4-element rate.
+
+    if not utils.is_data_format_channel_last(data_format):
+      raise ValueError("data_format must be channels-last.")
+
+    super(DepthwiseConvKFCBasicFB, self).__init__(
+        layer_collection=layer_collection,
+        params=params,
+        padding=padding,
+        strides=strides,
+        dilation_rate=rate,
+        data_format=data_format,
+        extract_patches_fn="extract_image_patches")
+
+    # This is a hack to overwrite the same setting in ConvKFCBasicFB.__init__().
+    filter_height, filter_width, in_channels, channel_multiplier = (
+        params.shape.as_list())
+    self._filter_shape = (filter_height, filter_width, in_channels,
+                          in_channels * channel_multiplier)
+
+  def multiply_matpower(self, vector, exp):
+    conv2d_vector = depthwise_conv2d_filter_to_conv2d_filter(vector)
+    conv2d_result = super(DepthwiseConvKFCBasicFB, self).multiply_matpower(
+        conv2d_vector, exp)
+    return conv2d_filter_to_depthwise_conv2d_filter(conv2d_result)
+
+
+def depthwise_conv2d_filter_to_conv2d_filter(filter, name=None):  # pylint: disable=redefined-builtin
+  """Converts a convolution filter for use with conv2d.
+
+  Transforms a filter for use with tf.nn.depthwise_conv2d() to one that's
+  compatible with tf.nn.conv2d().
 
   Args:
-    tensor_list: list of Tensors or list of tuples of Tensors.
+    filter: Tensor of shape [height, width, in_channels, channel_multiplier].
+    name: None or str. Name of Op.
 
   Returns:
-    Tensor or tuple of Tensors.
+    Tensor of shape [height, width, in_channels, out_channels].
 
-  Raises:
-    ValueError: If 'tensor_list' is empty.
+  """
+  with ops.name_scope(name, "depthwise_conv2d_filter_to_conv2d_filter",
+                      [filter]):
+    filter = ops.convert_to_tensor(filter)
+    filter_height, filter_width, in_channels, channel_multiplier = (
+        filter.shape.as_list())
+
+    results = []
+    for i in range(in_channels):
+      # Slice out one in_channel's filter. Insert zeros around it to force it
+      # to affect that channel and that channel alone.
+      elements = []
+      if i > 0:
+        elements.append(
+            array_ops.zeros(
+                [filter_height, filter_width, i, channel_multiplier]))
+      elements.append(filter[:, :, i:(i + 1), :])
+      if i + 1 < in_channels:
+        elements.append(
+            array_ops.zeros([
+                filter_height, filter_width, in_channels - (i + 1),
+                channel_multiplier
+            ]))
+
+      # Concat along in_channel.
+      results.append(
+          array_ops.concat(elements, axis=-2, name="in_channel_%d" % i))
+
+    # Concat along out_channel.
+    return array_ops.concat(results, axis=-1, name="out_channel")
+
+
+def conv2d_filter_to_depthwise_conv2d_filter(filter, name=None):  # pylint: disable=redefined-builtin
+  """Converts a convolution filter for use with depthwise_conv2d.
+
+  Transforms a filter for use with tf.nn.conv2d() to one that's
+  compatible with tf.nn.depthwise_conv2d(). Ignores all filters but those along
+  the diagonal.
 
+  Args:
+    filter: Tensor of shape [height, width, in_channels, out_channels].
+    name: None or str. Name of Op.
+
+  Returns:
+    Tensor of shape,
+      [height, width, in_channels, channel_multiplier]
+
+  Raises:
+    ValueError: if out_channels is not evenly divisible by in_channels.
   """
-  if not tensor_list:
-    raise ValueError(
-        "Cannot concatenate Tensors if there are no Tensors to concatenate.")
-
-  if isinstance(tensor_list[0], (tuple, list)):
-    # [(tensor1a, tensor1b),
-    #  (tensor2a, tensor2b), ...] --> (tensor_a, tensor_b)
-    return tuple(
-        array_ops.concat(tensors, axis=0) for tensors in zip(*tensor_list))
-  else:
-    # [tensor1, tensor2] --> tensor
-    return array_ops.concat(tensor_list, axis=0)
+  with ops.name_scope(name, "conv2d_filter_to_depthwise_conv2d_filter",
+                      [filter]):
+    filter = ops.convert_to_tensor(filter)
+    filter_height, filter_width, in_channels, out_channels = (
+        filter.shape.as_list())
+
+    if out_channels % in_channels != 0:
+      raise ValueError("out_channels must be evenly divisible by in_channels.")
+    channel_multiplier = out_channels // in_channels
+
+    results = []
+    filter = array_ops.reshape(filter, [
+        filter_height, filter_width, in_channels, in_channels,
+        channel_multiplier
+    ])
+    for i in range(in_channels):
+      # Slice out output corresponding to the correct filter.
+      filter_slice = array_ops.reshape(
+          filter[:, :, i, i, :],
+          [filter_height, filter_width, 1, channel_multiplier])
+      results.append(filter_slice)
+
+    # Concat along out_channel.
+    return array_ops.concat(results, axis=-2, name="in_channels")
+
+
+def maybe_tuple(obj):
+  if not isinstance(obj, list):
+    return obj
+  return tuple(obj)
 
 
 def num_conv_locations(input_shape, strides):
   """Returns the number of spatial locations a 2D Conv kernel is applied to.
 
   Args:
-    input_shape: list representing shape of inputs to the Conv layer.
-    strides: list representing strides for the Conv kernel.
+    input_shape: List of ints representing shape of inputs to
+      tf.nn.convolution().
+    strides: List of ints representing strides along spatial dimensions as
+      passed in to tf.nn.convolution().
 
   Returns:
     A scalar |T| denoting the number of spatial locations for the Conv layer.
   """
-  return input_shape[1] * input_shape[2] // (strides[1] * strides[2])
-
+  spatial_input_locations = np.prod(input_shape[1:-1])
 
-class FullyConnectedMultiIndepFB(KroneckerProductFB):
+  if strides is None:
+    spatial_strides_divisor = 1
+  else:
+    spatial_strides_divisor = np.prod(strides)
+
+  return spatial_input_locations // spatial_strides_divisor
+
+
+class InputOutputMultiMinibatchMultiUse(InputOutputMultiMinibatch):
+  """Adds methods for multi-use/time-step case to InputOutputMultiMinibatch."""
+
+  def __init__(self, num_uses=None, *args, **kwargs):
+    self._num_uses = num_uses
+    super(InputOutputMultiMinibatchMultiUse, self).__init__(*args, **kwargs)
+
+  def _process_data(self, grads_list):
+    """Process temporal/multi-use data into a standard format."""
+
+    inputs = self._inputs
+
+    # The first possible data format is where inputs is a list of tensors,
+    # one for each use/time-step.
+    if isinstance(inputs[0], (list, tuple)):
+      # The first index is tower/minibatch, the second is use/time-step
+      num_uses = len(inputs[0])
+      if self._num_uses is not None and self._num_uses != num_uses:
+        raise ValueError("num_uses argument doesn't match length of inputs.")
+      else:
+        self._num_uses = num_uses
+
+      # Check that all mini-batches/towers have the same number of uses
+      if not all(len(input_) == num_uses for input_ in inputs):
+        raise ValueError("Length of inputs argument is inconsistent across "
+                         "mini-batches/towers.")
+      # Fold uses/time-step and towers/minibatches dimensions together
+      inputs = nest.flatten(inputs)
+
+      inputs = _make_partitionedtensors_inputs(inputs)
+    # If inputs is not a tuple then we assume that inputs is a tensor
+    # with 'uses' folded into the batch dimension. (And grads_list is a list
+    # across sources of such Tensors.)  This is the native format that the
+    # factor will take as arguments.
+
+    # Now we perform the analogous processing for grads_list
+    if isinstance(grads_list[0][0], (list, tuple)):
+      num_uses = len(grads_list[0][0])
+      if self._num_uses is not None and self._num_uses != num_uses:
+        raise ValueError("num_uses argument doesn't match length of outputs, "
+                         "or length of outputs is inconsistent with length of "
+                         "inputs.")
+      else:
+        self._num_uses = num_uses
+
+      if not all(len(grad) == num_uses for grads in grads_list
+                 for grad in grads):
+        raise ValueError("Length of outputs argument is inconsistent across "
+                         "mini-batches/towers.")
+
+      grads_list = tuple(nest.flatten(grads) for grads in grads_list)
+      grads_list = _make_partitionedtensors_grads(grads_list)
+
+    if self._num_uses is None:
+      raise ValueError("You must supply a value for the num_uses argument if "
+                       "the number of uses cannot be inferred from inputs or "
+                       "outputs arguments (e.g. if they are both given in the "
+                       "single Tensor format, instead of as lists of Tensors.")
+
+    return inputs, grads_list
+
+
+class FullyConnectedMultiIndepFB(InputOutputMultiMinibatchMultiUse,
+                                 KroneckerProductFB):
   """FisherBlock for fully-connected layers that share parameters.
+
+  This class implements the "independence across time" approximation from the
+  following paper:
+    https://openreview.net/pdf?id=HyMTkQZAb
   """
 
-  def __init__(self, layer_collection, inputs, outputs, has_bias=False):
+  def __init__(self, layer_collection, has_bias=False, num_uses=None):
     """Creates a FullyConnectedMultiIndepFB block.
 
     Args:
       layer_collection: LayerCollection instance.
-      inputs: list or tuple of Tensors. Each Tensor has shape [batch_size,
-        inputs_size].
-      outputs: list or tuple of Tensors. Each Tensor has shape [batch_size,
-        outputs_size].
       has_bias: bool. If True, estimates Fisher with respect to a bias
         parameter as well as the layer's parameters.
+      num_uses: int or None. Number of uses of the layer in the model's graph.
+        Only required if the data is formatted with uses/time folded into the
+        batch dimension (instead of uses/time being a list dimension).
+        (Default: None)
     """
-
-    assert len(inputs) == len(outputs)
-    # We need to make sure inputs and outputs are tuples and not lists so that
-    # they get hashed by layer_collection.make_or_get_factor properly.
-    self._inputs = tuple(inputs)
-    self._outputs = tuple(outputs)
     self._has_bias = has_bias
-    self._num_uses = len(inputs)
 
-    super(FullyConnectedMultiIndepFB, self).__init__(layer_collection)
-
-  @property
-  def num_registered_minibatches(self):
-    # TODO(b/69411207): Add support for registering additional minibatches.
-    return 1
+    super(FullyConnectedMultiIndepFB, self).__init__(
+        layer_collection=layer_collection,
+        num_uses=num_uses)
 
   def instantiate_factors(self, grads_list, damping):
+    inputs, grads_list = self._process_data(grads_list)
 
     self._input_factor = self._layer_collection.make_or_get_factor(
         fisher_factors.FullyConnectedMultiKF,
-        ((self._inputs,), self._has_bias))
+        ((inputs,), self._num_uses, self._has_bias))
 
     self._output_factor = self._layer_collection.make_or_get_factor(
-        fisher_factors.FullyConnectedMultiKF, (grads_list,))
+        fisher_factors.FullyConnectedMultiKF, (grads_list, self._num_uses))
 
-    damping = normalize_damping(damping, self._num_uses)
-    self._register_damped_input_and_output_inverses(damping)
+    self._setup_damping(damping, normalization=self._num_uses)
 
   @property
   def _renorm_coeff(self):
-    return self._num_uses
+    return float(self._num_uses)
 
-  def tensors_to_compute_grads(self):
-    return self._outputs
 
-  def num_inputs(self):
-    return len(self._inputs)
+class ConvKFCBasicMultiIndepFB(InputOutputMultiMinibatchMultiUse,
+                               KroneckerProductFB):
+  """FisherBlock for 2D convolutional layers using the basic KFC approx.
+
+  Similar to ConvKFCBasicFB except that this version supports multiple
+  uses/time-steps via a standard independence approximation.  Similar to the
+  "independence across time" used in FullyConnectedMultiIndepFB but generalized
+  in the obvious way to conv layers.
+  """
+
+  def __init__(self,
+               layer_collection,
+               params,
+               padding,
+               strides=None,
+               dilation_rate=None,
+               data_format=None,
+               extract_patches_fn=None,
+               num_uses=None):
+    """Creates a ConvKFCBasicMultiIndepFB block.
+
+    Args:
+      layer_collection: The collection of all layers in the K-FAC approximate
+          Fisher information matrix to which this FisherBlock belongs.
+      params: The parameters (Tensor or tuple of Tensors) of this layer. If
+        kernel alone, a Tensor of shape [..spatial_filter_shape..,
+        in_channels, out_channels]. If kernel and bias, a tuple of 2 elements
+        containing the previous and a Tensor of shape [out_channels].
+      padding: str. Padding method.
+      strides: List of ints or None. Contains [..spatial_filter_strides..] if
+        'extract_patches_fn' is compatible with tf.nn.convolution(), else
+        [1, ..spatial_filter_strides, 1].
+      dilation_rate: List of ints or None. Rate for dilation along each spatial
+        dimension if 'extract_patches_fn' is compatible with
+        tf.nn.convolution(), else [1, ..spatial_dilation_rates.., 1].
+      data_format: str or None. Format of input data.
+      extract_patches_fn: str or None. Name of function that extracts image
+        patches. One of "extract_convolution_patches", "extract_image_patches",
+        "extract_pointwise_conv2d_patches".
+      num_uses: int or None. Number of uses of the layer in the model's graph.
+        Only required if the data is formatted with uses/time folded into the
+        batch dimension (instead of uses/time being a list dimension).
+        (Default: None)
+    """
+    self._padding = padding
+    self._strides = maybe_tuple(strides)
+    self._dilation_rate = maybe_tuple(dilation_rate)
+    self._data_format = data_format
+    self._extract_patches_fn = extract_patches_fn
+    self._has_bias = isinstance(params, (tuple, list))
+
+    fltr = params[0] if self._has_bias else params
+    self._filter_shape = tuple(fltr.shape.as_list())
+
+    super(ConvKFCBasicMultiIndepFB, self).__init__(
+        layer_collection=layer_collection,
+        num_uses=num_uses)
+
+  def instantiate_factors(self, grads_list, damping):
+    inputs, grads_list = self._process_data(grads_list)
+
+    # Infer number of locations upon which convolution is applied.
+    self._num_locations = num_conv_locations(inputs.shape.as_list(),
+                                             self._strides)
+
+    self._input_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.ConvInputKroneckerFactor,
+        (inputs, self._filter_shape, self._padding, self._strides,
+         self._dilation_rate, self._data_format, self._extract_patches_fn,
+         self._has_bias))
+    self._output_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.ConvOutputKroneckerFactor, (grads_list,))
+
+    self._setup_damping(damping, normalization=
+                        (self._num_locations * self._num_uses))
+
+  @property
+  def _renorm_coeff(self):
+    return self._num_locations * self._num_uses
+
+
+class EmbeddingKFACMultiIndepFB(InputOutputMultiMinibatchMultiUse,
+                                KroneckerProductFB):
+  """K-FAC FisherBlock for embedding layers used multiple times in the graph.
+
+  Similar to EmbeddingKFACFB except that this version supports multiple uses
+  of the parameter within a single model. These uses could correspond to time
+  steps in an RNN architecture, but they don't have to.
+
+  Does not support bias parameters.
+  """
+
+  def __init__(self, layer_collection, vocab_size, num_uses):
+    """Creates a EmbeddingKFACMultiIndepFB block.
+
+    Args:
+      layer_collection: The collection of all layers in the K-FAC approximate
+          Fisher information matrix to which this FisherBlock belongs.
+      vocab_size: int. Size of vocabulary for this embedding layer.
+      num_uses: int or None. Number of uses of the layer in the model's graph.
+        Only required if the data is formatted with time folded into the batch
+        dimension (instead of time being a list dimension). (Default: None)
+    """
+    self._vocab_size = vocab_size
+
+    super(EmbeddingKFACMultiIndepFB, self).__init__(
+        layer_collection=layer_collection,
+        num_uses=num_uses)
+
+  def instantiate_factors(self, grads_list, damping):
+    """Instantiate Kronecker Factors for this FisherBlock.
+
+    Args:
+      grads_list: List of list of list of Tensors. grads_list[i][j][k] is the
+        gradient of the loss with respect to 'outputs' from source 'i',
+        tower/mini-batch 'j', and use/time-step 'k'. Each Tensor has shape
+        [tower_minibatch_size, output_size].
+      damping: 0-D Tensor or float. 'damping' * identity is approximately added
+        to this FisherBlock's Fisher approximation.
+    """
+    inputs, grads_list = self._process_data(grads_list)
+
+    self._input_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.EmbeddingInputKroneckerFactor,
+        (inputs, self._vocab_size))
+    self._output_factor = self._layer_collection.make_or_get_factor(
+        fisher_factors.FullyConnectedMultiKF, (grads_list, self._num_uses))
+    self._setup_damping(damping, normalization=self._num_uses)
+
+  @property
+  def _renorm_coeff(self):
+    return float(self._num_uses)
 
 
 class SeriesFBApproximation(enum.IntEnum):
@@ -920,34 +1368,35 @@ class SeriesFBApproximation(enum.IntEnum):
   option2 = 2
 
 
-class FullyConnectedSeriesFB(FisherBlock):
+class FullyConnectedSeriesFB(InputOutputMultiMinibatchMultiUse,
+                             KroneckerProductFB):
   """FisherBlock for fully-connected layers that share parameters across time.
 
-  See the following preprint for details:
+  This class implements the "Option 1" and "Option 2" approximation from the
+  following paper:
     https://openreview.net/pdf?id=HyMTkQZAb
 
   See the end of the appendix of the paper for a pseudo-code of the
-  algorithm being implemented by multiply_inverse here.  Note that we are
+  algorithm being implemented by multiply_matpower here.  Note that we are
   using pre-computed versions of certain matrix-matrix products to speed
   things up.  This is explicitly explained wherever it is done.
   """
 
   def __init__(self,
                layer_collection,
-               inputs,
-               outputs,
                has_bias=False,
+               num_uses=None,
                option=SeriesFBApproximation.option2):
     """Constructs a new `FullyConnectedSeriesFB`.
 
     Args:
       layer_collection: The collection of all layers in the K-FAC approximate
         Fisher information matrix to which this FisherBlock belongs.
-      inputs: List of tensors of shape [batch_size, input_size].
-        Inputs to the layer.
-      outputs: List of tensors of shape [batch_size, input_size].
-        Outputs of the layer (before activations).
       has_bias: Whether the layer includes a bias parameter.
+      num_uses: int or None. Number of time-steps over which the layer
+        is used. Only required if the data is formatted with time folded into
+        the batch dimension (instead of time being a list dimension).
+        (Default: None)
       option: A `SeriesFBApproximation` specifying the simplifying assumption
         to be used in this block. `option1` approximates the cross-covariance
         over time as a symmetric matrix, while `option2` makes
@@ -955,48 +1404,58 @@ class FullyConnectedSeriesFB(FisherBlock):
         3.5 of the paper for more details.
     """
 
-    assert len(inputs) == len(outputs)
-    # We need to make sure inputs and outputs are tuples and not lists so that
-    # they get hashed by layer_collection.make_or_get_factor properly.
-    self._inputs = tuple(inputs)
-    self._outputs = tuple(outputs)
     self._has_bias = has_bias
-    self._num_timesteps = len(inputs)
     self._option = option
 
-    super(FullyConnectedSeriesFB, self).__init__(layer_collection)
+    super(FullyConnectedSeriesFB, self).__init__(
+        layer_collection=layer_collection,
+        num_uses=num_uses)
 
   @property
-  def num_registered_minibatches(self):
-    # TODO(b/69411207): Add support for registering additional minibatches.
-    return 1
+  def _num_timesteps(self):
+    return self._num_uses
+
+  @property
+  def _renorm_coeff(self):
+    # This should no longer be used since the multiply_X functions from the base
+    # class have been overridden
+    assert False
 
   def instantiate_factors(self, grads_list, damping):
+    inputs, grads_list = self._process_data(grads_list)
 
     self._input_factor = self._layer_collection.make_or_get_factor(
-        fisher_factors.FullyConnectedMultiKF, ((self._inputs,), self._has_bias))
+        fisher_factors.FullyConnectedMultiKF,
+        ((inputs,), self._num_uses, self._has_bias))
+    self._input_factor.register_cov_dt1()
 
     self._output_factor = self._layer_collection.make_or_get_factor(
-        fisher_factors.FullyConnectedMultiKF, (grads_list,))
+        fisher_factors.FullyConnectedMultiKF, (grads_list, self._num_uses))
+    self._output_factor.register_cov_dt1()
+
+    self._setup_damping(damping, normalization=self._num_uses)
 
-    damping = normalize_damping(damping, self._num_timesteps)
-    self._damping_input, self._damping_output = compute_pi_adjusted_damping(
-        self._input_factor.get_cov(),
-        self._output_factor.get_cov(),
-        damping**0.5)
+  def register_matpower(self, exp):
+    if exp != -1:
+      raise NotImplementedError("FullyConnectedSeriesFB only supports inverse"
+                                "multiplications.")
 
     if self._option == SeriesFBApproximation.option1:
-      self._input_factor.register_option1quants(self._damping_input)
-      self._output_factor.register_option1quants(self._damping_output)
+      self._input_factor.register_option1quants(self._input_damping_func)
+      self._output_factor.register_option1quants(self._output_damping_func)
     elif self._option == SeriesFBApproximation.option2:
-      self._input_factor.register_option2quants(self._damping_input)
-      self._output_factor.register_option2quants(self._damping_output)
+      self._input_factor.register_option2quants(self._input_damping_func)
+      self._output_factor.register_option2quants(self._output_damping_func)
     else:
       raise ValueError(
           "Unrecognized FullyConnectedSeriesFB approximation: {}".format(
               self._option))
 
-  def multiply_inverse(self, vector):
+  def multiply_matpower(self, vector, exp):
+    if exp != -1:
+      raise NotImplementedError("FullyConnectedSeriesFB only supports inverse"
+                                "multiplications.")
+
     # pylint: disable=invalid-name
 
     Z = utils.layer_params_to_mat2d(vector)
@@ -1008,8 +1467,10 @@ class FullyConnectedSeriesFB(FisherBlock):
     if self._option == SeriesFBApproximation.option1:
 
       # Note that L_A = A0^(-1/2) * U_A and L_G = G0^(-1/2) * U_G.
-      L_A, psi_A = self._input_factor.get_option1quants(self._damping_input)
-      L_G, psi_G = self._output_factor.get_option1quants(self._damping_output)
+      L_A, psi_A = self._input_factor.get_option1quants(
+          self._input_damping_func)
+      L_G, psi_G = self._output_factor.get_option1quants(
+          self._output_damping_func)
 
       def gamma(x):
         # We are assuming that each case has the same number of time-steps.
@@ -1046,9 +1507,10 @@ class FullyConnectedSeriesFB(FisherBlock):
 
       # Note that P_A = A_1^T * A_0^(-1) and P_G = G_1^T * G_0^(-1),
       # and K_A = A_0^(-1/2) * E_A and K_G = G_0^(-1/2) * E_G.
-      P_A, K_A, mu_A = self._input_factor.get_option2quants(self._damping_input)
+      P_A, K_A, mu_A = self._input_factor.get_option2quants(
+          self._input_damping_func)
       P_G, K_G, mu_G = self._output_factor.get_option2quants(
-          self._damping_output)
+          self._output_damping_func)
 
       # Our approach differs superficially from the pseudo-code in the paper
       # in order to reduce the total number of matrix-matrix multiplies.
@@ -1101,12 +1563,3 @@ class FullyConnectedSeriesFB(FisherBlock):
     return utils.mat2d_to_layer_params(vector, Z)
 
     # pylint: enable=invalid-name
-
-  def multiply(self, vector):
-    raise NotImplementedError
-
-  def tensors_to_compute_grads(self):
-    return self._outputs
-
-  def num_inputs(self):
-    return len(self._inputs)
diff --git a/tensorflow/contrib/kfac/python/ops/fisher_factors.py b/tensorflow/contrib/kfac/python/ops/fisher_factors.py
index 603d8b8b210279ee6d8f1de0ce10869fde23f4d9..f521363536ef57762d22ebf69339980b534ce18d 100644
--- a/tensorflow/contrib/kfac/python/ops/fisher_factors.py
+++ b/tensorflow/contrib/kfac/python/ops/fisher_factors.py
@@ -19,7 +19,6 @@ from __future__ import division
 from __future__ import print_function
 
 import abc
-import contextlib
 
 import numpy as np
 import six
@@ -53,36 +52,16 @@ EIGENVALUE_DECOMPOSITION_THRESHOLD = 2
 # matrix powers. Must be nonnegative.
 EIGENVALUE_CLIPPING_THRESHOLD = 0.0
 
-# Colocate the covariance ops and variables with the input tensors for each
-# factor.
-COLOCATE_COV_OPS_WITH_INPUTS = True
-
-
-@contextlib.contextmanager
-def maybe_colocate_with(op):
-  """Context to colocate with `op` if `COLOCATE_COV_OPS_WITH_INPUTS`."""
-  if COLOCATE_COV_OPS_WITH_INPUTS:
-    if isinstance(op, (list, tuple)):
-      with tf_ops.colocate_with(op[0]):
-        yield
-    else:
-      with tf_ops.colocate_with(op):
-        yield
-  else:
-    yield
-
 
 def set_global_constants(init_covariances_at_zero=None,
                          zero_debias=None,
                          eigenvalue_decomposition_threshold=None,
-                         eigenvalue_clipping_threshold=None,
-                         colocate_cov_ops_with_inputs=None):
+                         eigenvalue_clipping_threshold=None):
   """Sets various global constants used by the classes in this module."""
   global INIT_COVARIANCES_AT_ZERO
   global ZERO_DEBIAS
   global EIGENVALUE_DECOMPOSITION_THRESHOLD
   global EIGENVALUE_CLIPPING_THRESHOLD
-  global COLOCATE_COV_OPS_WITH_INPUTS
 
   if init_covariances_at_zero is not None:
     INIT_COVARIANCES_AT_ZERO = init_covariances_at_zero
@@ -92,8 +71,6 @@ def set_global_constants(init_covariances_at_zero=None,
     EIGENVALUE_DECOMPOSITION_THRESHOLD = eigenvalue_decomposition_threshold
   if eigenvalue_clipping_threshold is not None:
     EIGENVALUE_CLIPPING_THRESHOLD = eigenvalue_clipping_threshold
-  if colocate_cov_ops_with_inputs is not None:
-    COLOCATE_COV_OPS_WITH_INPUTS = colocate_cov_ops_with_inputs
 
 
 def inverse_initializer(shape, dtype, partition_info=None):  # pylint: disable=unused-argument
@@ -181,7 +158,9 @@ def scope_string_from_params(params):
 
   name_parts = []
   for param in params:
-    if isinstance(param, (tuple, list)):
+    if param is None:
+      name_parts.append("None")
+    elif isinstance(param, (tuple, list)):
       if all([isinstance(p, int) for p in param]):
         name_parts.append("-".join([str(p) for p in param]))
       else:
@@ -190,6 +169,8 @@ def scope_string_from_params(params):
       name_parts.append(str(param))
     elif isinstance(param, (tf_ops.Tensor, variables.Variable)):
       name_parts.append(scope_string_from_name(param))
+    elif isinstance(param, utils.PartitionedTensor):
+      name_parts.append(scope_string_from_name(param.tensors))
     else:
       raise ValueError("Encountered an unsupported param type {}".format(
           type(param)))
@@ -207,6 +188,22 @@ def scalar_or_tensor_to_string(val):
   return repr(val) if np.isscalar(val) else scope_string_from_name(val)
 
 
+def list_to_string(lst):
+  return "_".join(val if isinstance(val, six.string_types)
+                  else scalar_or_tensor_to_string(val) for val in lst)
+
+
+def graph_func_to_id(func):
+  """Returns a hashable object that represents func's computation."""
+  # TODO(b/74201126): replace with Topohash of func's output
+  return func.func_id
+
+
+def graph_func_to_string(func):
+  # TODO(b/74201126): replace with Topohash of func's output
+  return list_to_string(func.func_id)
+
+
 @six.add_metaclass(abc.ABCMeta)
 class FisherFactor(object):
   """Base class for objects modeling factors of approximate Fisher blocks.
@@ -223,13 +220,10 @@ class FisherFactor(object):
   Note that for blocks that aren't based on approximations, a 'factor' can
   be the entire block itself, as is the case for the diagonal and full
   representations.
-
-  Subclasses must implement the _compute_new_cov() method, and the _var_scope
-  and _cov_shape properties.
   """
 
   def __init__(self):
-    self.instantiate_covariance()
+    self._cov = None
 
   @abc.abstractproperty
   def _var_scope(self):
@@ -240,6 +234,10 @@ class FisherFactor(object):
     """
     pass
 
+  @property
+  def name(self):
+    return self._var_scope
+
   @abc.abstractproperty
   def _cov_shape(self):
     """The shape of the variable backing this FisherFactor."""
@@ -267,8 +265,9 @@ class FisherFactor(object):
     """Function for initializing covariance variable."""
     return covariance_initializer
 
-  def instantiate_covariance(self):
-    """Instantiates the covariance Variable as the instance member _cov."""
+  def instantiate_cov_variables(self):
+    """Makes the internal cov variable(s)."""
+    assert self._cov is None
     with variable_scope.variable_scope(self._var_scope):
       self._cov = variable_scope.get_variable(
           "cov",
@@ -300,20 +299,17 @@ class FisherFactor(object):
     """
     new_cov_contribs = tuple(self._compute_new_cov(idx)
                              for idx in range(self._num_sources))
-    # This gets the job done but we might want a better solution in the future.
-    # In particular, we could have a separate way of specifying where the
-    # the cov variables finally end up, independent of where their various
-    # contributions are computed.  Right now these are the same thing, but in
-    # the future we might want to perform the cov computations on each tower,
-    # so that each tower will be considered a "source" (allowing us to reuse
-    # the existing "source" code for this).
-    with maybe_colocate_with(new_cov_contribs[0]):
-      new_cov = math_ops.add_n(new_cov_contribs)
-      # Synchronize value across all TPU cores.
-      if utils.on_tpu():
-        new_cov = utils.cross_replica_mean(new_cov)
-      return moving_averages.assign_moving_average(
-          self._cov, new_cov, ema_decay, zero_debias=ZERO_DEBIAS)
+    new_cov = math_ops.add_n(new_cov_contribs)
+    # Synchronize value across all TPU cores.
+    if utils.on_tpu():
+      new_cov = utils.cross_replica_mean(new_cov)
+    return moving_averages.assign_moving_average(
+        self._cov, new_cov, ema_decay, zero_debias=ZERO_DEBIAS)
+
+  @abc.abstractmethod
+  def instantiate_inv_variables(self):
+    """Makes the internal "inverse" variable(s)."""
+    pass
 
   @abc.abstractmethod
   def make_inverse_update_ops(self):
@@ -341,70 +337,47 @@ class FisherFactor(object):
     return self._cov
 
   @abc.abstractmethod
-  def left_multiply(self, x, damping):
-    """Multiplies 'x' by the damped covariance of this factor.
-
-    Let C be the covariance matrix this factor represents, and
-    D = C + damping * I be its damped variant. This method calculates
-    matmul(D, vec(x)).
-
-    Args:
-      x: Tensor. Represents a single vector. Shape depends on implementation.
-      damping: 0-D Tensor. Damping to add to C's diagonal.
-
-    Returns:
-      Tensor of same shape as 'x'.
-    """
-    pass
+  def left_multiply_matpower(self, x, exp, damping_func):
+    """Left multiplies 'x' by matrix power of this factor (w/ damping applied).
 
-  @abc.abstractmethod
-  def right_multiply(self, x, damping):
-    """Multiplies 'x' by the damped covariance of this factor.
+    This calculation is essentially:
+      (C + damping * I)**exp * x
+    where * is matrix-multiplication, ** is matrix power, I is the identity
+    matrix, and C is the matrix represented by this factor.
 
-    Let C be the covariance matrix this factor represents, and
-    D = C + damping * I be its damped variant. This method calculates
-    matmul(vec(x), D).
+    x can represent either a matrix or a vector.  For some factors, 'x' might
+    represent a vector but actually be stored as a 2D matrix for convenience.
 
     Args:
       x: Tensor. Represents a single vector. Shape depends on implementation.
-      damping: 0-D Tensor. Damping to add to C's diagonal.
+      exp: float.  The matrix exponent to use.
+      damping_func: A function that computes a 0-D Tensor or a float which will
+        be the damping value used.  i.e. damping = damping_func().
 
     Returns:
-      Tensor of same shape as 'x'.
+      Tensor of same shape as 'x' representing the result of the multiplication.
     """
     pass
 
   @abc.abstractmethod
-  def left_multiply_inverse(self, x, damping):
-    """Multiplies 'x' by damped inverse of this factor.
+  def right_multiply_matpower(self, x, exp, damping_func):
+    """Right multiplies 'x' by matrix power of this factor (w/ damping applied).
 
-    Let C be the covariance matrix this factor represents and
-    E = inv(C + damping * I) be its damped inverse. This method calculates
-    matmul(E, vec(x)).
+    This calculation is essentially:
+      x * (C + damping * I)**exp
+    where * is matrix-multiplication, ** is matrix power, I is the identity
+    matrix, and C is the matrix represented by this factor.
 
-    Args:
-      x: Tensor. Represents a single vector. Shape depends on implementation.
-      damping: 0-D Tensor. Damping to add to C's diagonal.
-
-    Returns:
-      Tensor of same shape as 'x'.
-    """
-    pass
-
-  @abc.abstractmethod
-  def right_multiply_inverse(self, x, damping):
-    """Multiplies 'x' by damped inverse of this factor.
-
-    Let C be the covariance matrix this factor represents and
-    E = inv(C + damping * I) be its damped inverse. This method calculates
-    matmul(vec(x), E).
+    Unlike left_multiply_matpower, x will always be a matrix.
 
     Args:
       x: Tensor. Represents a single vector. Shape depends on implementation.
-      damping: 0-D Tensor. Damping to add to C's diagonal.
+      exp: float.  The matrix exponent to use.
+      damping_func: A function that computes a 0-D Tensor or a float which will
+        be the damping value used.  i.e. damping = damping_func().
 
     Returns:
-      Tensor of same shape as 'x'.
+      Tensor of same shape as 'x' representing the result of the multiplication.
     """
     pass
 
@@ -428,47 +401,52 @@ class InverseProvidingFactor(FisherFactor):
   # the latter.
 
   def __init__(self):
-    self._inverses_by_damping = {}
-    self._matpower_by_exp_and_damping = {}
+    self._matpower_by_exp_and_damping = {}  # { (float, hashable): variable }
+    self._matpower_registrations = set()  # { (float, hashable) }
     self._eigendecomp = None
+    self._damping_funcs_by_id = {}  # {hashable: lambda}
 
     super(InverseProvidingFactor, self).__init__()
 
-  def register_damped_inverse(self, damping):
-    """Registers a damped inverse needed by a FisherBlock.
-
-    This creates a variable and signals make_inverse_update_ops to make the
-    corresponding update op.  The variable can be read via the method
-    get_inverse.
+  def _register_damping(self, damping_func):
+    damping_id = graph_func_to_id(damping_func)
+    if damping_id not in self._damping_funcs_by_id:
+      self._damping_funcs_by_id[damping_id] = damping_func
+    return damping_id
 
-    Args:
-      damping: The damping value (float or Tensor) for this factor.
-    """
-    if damping not in self._inverses_by_damping:
-      damping_string = scalar_or_tensor_to_string(damping)
-      with variable_scope.variable_scope(self._var_scope):
-        inv = variable_scope.get_variable(
-            "inv_damp{}".format(damping_string),
-            initializer=inverse_initializer,
-            shape=self._cov_shape,
-            trainable=False,
-            dtype=self._dtype)
-      self._inverses_by_damping[damping] = inv
+  def register_inverse(self, damping_func):
+    # Just for backwards compatibility of some old code and tests
+    self.register_matpower(-1, damping_func)
 
-  def register_matpower(self, exp, damping):
-    """Registers a matrix power needed by a FisherBlock.
+  def register_matpower(self, exp, damping_func):
+    """Registers a matrix power to be maintained and served on demand.
 
     This creates a variable and signals make_inverse_update_ops to make the
     corresponding update op.  The variable can be read via the method
     get_matpower.
 
     Args:
-      exp: The exponent (float or Tensor) to raise the matrix to.
-      damping: The damping value (float or Tensor).
+      exp: float.  The exponent to use in the matrix power.
+      damping_func: A function that computes a 0-D Tensor or a float which will
+        be the damping value used.  i.e. damping = damping_func().
     """
-    if (exp, damping) not in self._matpower_by_exp_and_damping:
+    if exp == 1.0:
+      # We don't register these.  The user shouldn't even be calling this
+      # function with exp = 1.0.
+      return
+
+    damping_id = self._register_damping(damping_func)
+
+    if (exp, damping_id) not in self._matpower_registrations:
+      self._matpower_registrations.add((exp, damping_id))
+
+  def instantiate_inv_variables(self):
+    """Makes the internal "inverse" variable(s)."""
+
+    for (exp, damping_id) in self._matpower_registrations:
       exp_string = scalar_or_tensor_to_string(exp)
-      damping_string = scalar_or_tensor_to_string(damping)
+      damping_func = self._damping_funcs_by_id[damping_id]
+      damping_string = graph_func_to_string(damping_func)
       with variable_scope.variable_scope(self._var_scope):
         matpower = variable_scope.get_variable(
             "matpower_exp{}_damp{}".format(exp_string, damping_string),
@@ -476,34 +454,35 @@ class InverseProvidingFactor(FisherFactor):
             shape=self._cov_shape,
             trainable=False,
             dtype=self._dtype)
-      self._matpower_by_exp_and_damping[(exp, damping)] = matpower
+      assert (exp, damping_id) not in self._matpower_by_exp_and_damping
+      self._matpower_by_exp_and_damping[(exp, damping_id)] = matpower
 
   def make_inverse_update_ops(self):
     """Create and return update ops corresponding to registered computations."""
     ops = []
 
-    # We do this to ensure that we don't reuse the eigendecomp from old calls
-    # to make_inverse_update_ops that may be placed on different devices.  This
-    # can happen is the user has both a permanent and lazily constructed
-    # version of the inverse ops (and only uses one of them).
-    self.reset_eigendecomp()
+    num_inverses = sum(1 for (exp, _) in self._matpower_by_exp_and_damping
+                       if exp == -1)
+
+    num_other_matpower = len(self._matpower_by_exp_and_damping) - num_inverses
+
+    other_matrix_power_registered = num_other_matpower >= 1
 
-    num_inverses = len(self._inverses_by_damping)
-    matrix_power_registered = bool(self._matpower_by_exp_and_damping)
     use_eig = (
-        self._eigendecomp or matrix_power_registered or
+        self._eigendecomp or other_matrix_power_registered or
         num_inverses >= EIGENVALUE_DECOMPOSITION_THRESHOLD)
 
+    # We precompute these so we don't need to evaluate them multiple times (for
+    # each matrix power that uses them)
+    damping_value_by_id = {damping_id: self._damping_funcs_by_id[damping_id]()
+                           for damping_id in self._damping_funcs_by_id}
+
     if use_eig:
       eigenvalues, eigenvectors = self.get_eigendecomp()  # pylint: disable=unpacking-non-sequence
 
-      for damping, inv in self._inverses_by_damping.items():
-        ops.append(
-            inv.assign(
-                math_ops.matmul(eigenvectors / (eigenvalues + damping),
-                                array_ops.transpose(eigenvectors))))
-
-      for (exp, damping), matpower in self._matpower_by_exp_and_damping.items():
+      for (exp, damping_id), matpower in (
+          self._matpower_by_exp_and_damping.items()):
+        damping = damping_value_by_id[damping_id]
         ops.append(
             matpower.assign(
                 math_ops.matmul(eigenvectors *
@@ -512,28 +491,31 @@ class InverseProvidingFactor(FisherFactor):
       # These ops share computation and should be run on a single device.
       ops = [control_flow_ops.group(*ops)]
     else:
-      for damping, inv in self._inverses_by_damping.items():
-        ops.append(inv.assign(utils.posdef_inv(self._cov, damping)))
+      for (exp, damping_id), matpower in (
+          self._matpower_by_exp_and_damping.items()):
+        assert exp == -1
+        damping = damping_value_by_id[damping_id]
+        ops.append(matpower.assign(utils.posdef_inv(self._cov, damping)))
 
+    self._eigendecomp = False
     return ops
 
-  def get_damped_inverse(self, damping):
-    # Note that this function returns a variable which gets updated by the
-    # inverse ops.  It may be stale / inconsistent with the latest value of
-    # get_cov().
-    return self._inverses_by_damping[damping]
+  def get_inverse(self, damping_func):
+    # Just for backwards compatibility of some old code and tests
+    damping_id = graph_func_to_id(damping_func)
+    return self._matpower_by_exp_and_damping[(-1, damping_id)]
 
-  def get_matpower(self, exp, damping):
+  def get_matpower(self, exp, damping_func):
     # Note that this function returns a variable which gets updated by the
     # inverse ops.  It may be stale / inconsistent with the latest value of
     # get_cov().
-    return self._matpower_by_exp_and_damping[(exp, damping)]
+    damping_id = graph_func_to_id(damping_func)
+    return self._matpower_by_exp_and_damping[(exp, damping_id)]
 
   def get_eigendecomp(self):
     """Creates or retrieves eigendecomposition of self._cov."""
-    # Unlike get_inverse and get_matpower this doesn't retrieve a stored
-    # variable, but instead always computes a fresh version from the current
-    # value of get_cov().
+    # Unlike get_matpower this doesn't retrieve a stored variable, but instead
+    # always computes a fresh version from the current value of get_cov().
     if not self._eigendecomp:
       eigenvalues, eigenvectors = linalg_ops.self_adjoint_eig(self._cov)
 
@@ -546,63 +528,42 @@ class InverseProvidingFactor(FisherFactor):
 
     return self._eigendecomp
 
-  def reset_eigendecomp(self):
-    self._eigendecomp = None
-
   def get_cov(self):
     # Variable contains full covariance matrix.
     return self.get_cov_var()
 
-  def left_multiply(self, x, damping):
-    n = self.get_cov().shape[0]
-    damped_cov = self.get_cov() + damping * array_ops.eye(n)
-
+  def left_multiply_matpower(self, x, exp, damping_func):
     if isinstance(x, tf_ops.IndexedSlices):
-      raise NotImplementedError(
-          "Left-multiply not yet supported for IndexedSlices.")
+      raise ValueError("Left-multiply not yet supported for IndexedSlices.")
 
-    if len(x.shape) != 2:
+    if x.shape.ndims != 2:
       raise ValueError(
           "InverseProvidingFactors apply to matrix-shaped vectors. Found: %s."
           % (x,))
 
-    return math_ops.matmul(damped_cov, x)
+    if exp == 1:
+      return math_ops.matmul(self.get_cov(), x) + damping_func() * x
 
-  def right_multiply(self, x, damping):
-    n = self.get_cov().shape[0]
-    damped_cov = self.get_cov() + damping * array_ops.eye(n)
+    return math_ops.matmul(self.get_matpower(exp, damping_func), x)
 
+  def right_multiply_matpower(self, x, exp, damping_func):
     if isinstance(x, tf_ops.IndexedSlices):
-      return utils.matmul_sparse_dense(x, damped_cov)
-
-    if len(x.shape) != 2:
-      raise ValueError(
-          "InverseProvidingFactors apply to matrix-shaped vectors. Found: %s."
-          % (x,))
+      if exp == 1:
+        n = self.get_cov().shape[0]
+        damped_cov = self.get_cov() + damping_func() * array_ops.eye(n)
+        return utils.matmul_sparse_dense(x, damped_cov)
 
-    return math_ops.matmul(x, damped_cov)
-
-  def left_multiply_inverse(self, x, damping):
-    if isinstance(x, tf_ops.IndexedSlices):
-      raise ValueError("Left-multiply not yet supported for IndexedSlices.")
+      return utils.matmul_sparse_dense(x, self.get_matpower(exp, damping_func))
 
     if x.shape.ndims != 2:
       raise ValueError(
           "InverseProvidingFactors apply to matrix-shaped vectors. Found: %s."
           % (x,))
 
-    return math_ops.matmul(self.get_damped_inverse(damping), x)
-
-  def right_multiply_inverse(self, x, damping):
-    if isinstance(x, tf_ops.IndexedSlices):
-      return utils.matmul_sparse_dense(x, self.get_damped_inverse(damping))
-
-    if x.shape.ndims != 2:
-      raise ValueError(
-          "InverseProvidingFactors apply to matrix-shaped vectors. Found: %s."
-          % (x,))
+    if exp == 1:
+      return math_ops.matmul(x, self.get_cov()) + damping_func() * x
 
-    return math_ops.matmul(x, self.get_damped_inverse(damping))
+    return math_ops.matmul(x, self.get_matpower(exp, damping_func))
 
 
 class FullFactor(InverseProvidingFactor):
@@ -622,7 +583,7 @@ class FullFactor(InverseProvidingFactor):
 
   @property
   def _var_scope(self):
-    return "ff_full/" + scope_string_from_params(
+    return "ff_full_" + scope_string_from_params(
         [self._params_grads, self._batch_size])
 
   @property
@@ -641,11 +602,10 @@ class FullFactor(InverseProvidingFactor):
 
   def _compute_new_cov(self, idx=0):
     # This will be a very basic rank 1 estimate
-    with maybe_colocate_with(self._params_grads[idx]):
-      params_grads_flat = utils.tensors_to_column(self._params_grads[idx])
-      return ((params_grads_flat * array_ops.transpose(
-          params_grads_flat)) / math_ops.cast(self._batch_size,
-                                              params_grads_flat.dtype))
+    params_grads_flat = utils.tensors_to_column(self._params_grads[idx])
+    return ((params_grads_flat * array_ops.transpose(
+        params_grads_flat)) / math_ops.cast(self._batch_size,
+                                            params_grads_flat.dtype))
 
 
 class DiagonalFactor(FisherFactor):
@@ -656,6 +616,7 @@ class DiagonalFactor(FisherFactor):
   """
 
   def __init__(self):
+    self._damping_funcs_by_id = {}  # { hashable: lambda }
     super(DiagonalFactor, self).__init__()
 
   @property
@@ -665,43 +626,30 @@ class DiagonalFactor(FisherFactor):
   def make_inverse_update_ops(self):
     return []
 
+  def instantiate_inv_variables(self):
+    pass
+
   def get_cov(self):
     # self.get_cov() could be any shape, but it must have one entry per
     # parameter. Flatten it into a vector.
     cov_diag_vec = array_ops.reshape(self.get_cov_var(), [-1])
     return array_ops.diag(cov_diag_vec)
 
-  def left_multiply(self, x, damping):
-    damped_cov = self.get_cov_var() + damping
-    if isinstance(x, tf_ops.IndexedSlices):
-      return utils.matmul_diag_sparse(array_ops.reshape(damped_cov, [-1]), x)
-
-    if x.shape != damped_cov.shape:
-      raise ValueError("x (%s) and cov (%s) must have same shape." %
-                       (x, damped_cov))
-
-    return damped_cov * x
-
-  def right_multiply(self, x, damping):
-    raise NotImplementedError("Only left-multiply is currently supported.")
-
-  def left_multiply_inverse(self, x, damping):
-    inverse = 1. / (self.get_cov_var() + damping)
+  def left_multiply_matpower(self, x, exp, damping_func):
+    matpower = (self.get_cov_var() + damping_func())**exp
 
     if isinstance(x, tf_ops.IndexedSlices):
-      return utils.matmul_diag_sparse(array_ops.reshape(inverse, [-1]), x)
+      return utils.matmul_diag_sparse(array_ops.reshape(matpower, [-1]), x)
 
-    if x.shape != inverse.shape:
+    if x.shape != matpower.shape:
       raise ValueError("x (%s) and cov (%s) must have same shape." %
-                       (x, inverse))
+                       (x, matpower))
+    return matpower * x
 
-    return inverse * x
-
-  def right_multiply_inverse(self, x, damping):
+  def right_multiply_matpower(self, x, exp, damping_func):
     raise NotImplementedError("Only left-multiply is currently supported.")
 
-  def register_damped_inverse(self, damping):
-    # DiagonalFactors don't keep explicit inverses.
+  def register_matpower(self, exp, damping_func):
     pass
 
 
@@ -730,7 +678,7 @@ class NaiveDiagonalFactor(DiagonalFactor):
 
   @property
   def _var_scope(self):
-    return "ff_naivediag/" + scope_string_from_params(
+    return "ff_naivediag_" + scope_string_from_params(
         [self._params_grads, self._batch_size])
 
   @property
@@ -748,10 +696,9 @@ class NaiveDiagonalFactor(DiagonalFactor):
     return self._params_grads[0][0].dtype
 
   def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._params_grads[idx]):
-      params_grads_flat = utils.tensors_to_column(self._params_grads[idx])
-      return (math_ops.square(params_grads_flat) / math_ops.cast(
-          self._batch_size, params_grads_flat.dtype))
+    params_grads_flat = utils.tensors_to_column(self._params_grads[idx])
+    return (math_ops.square(params_grads_flat) / math_ops.cast(
+        self._batch_size, params_grads_flat.dtype))
 
 
 class EmbeddingInputKroneckerFactor(DiagonalFactor):
@@ -772,8 +719,8 @@ class EmbeddingInputKroneckerFactor(DiagonalFactor):
     """Instantiate EmbeddingInputKroneckerFactor.
 
     Args:
-      input_ids: Tuple of Tensors of shape [batch_size, input_size] and dtype
-        int32.  Indices into embedding matrix.
+      input_ids: Tensor of shape [batch_size, input_size] and dtype int32.
+        Indices into embedding matrix.
       vocab_size: int or 0-D Tensor. Maximum value for entries in 'input_ids'.
       dtype: dtype for covariance statistics. Must be a floating point type.
         Defaults to float32.
@@ -786,7 +733,7 @@ class EmbeddingInputKroneckerFactor(DiagonalFactor):
 
   @property
   def _var_scope(self):
-    return "ff_diag_embedding/" + scope_string_from_params(self._input_ids)
+    return "ff_diag_embedding_" + scope_string_from_params(self._input_ids)
 
   @property
   def _cov_shape(self):
@@ -794,42 +741,45 @@ class EmbeddingInputKroneckerFactor(DiagonalFactor):
 
   @property
   def _num_sources(self):
-    return len(self._input_ids)
+    return 1
 
   @property
   def _dtype(self):
     return self._cov_dtype
 
   def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._input_ids):
-      input_ids = self._input_ids[idx]
-      if len(input_ids.shape) > 2:
-        raise ValueError(
-            "Input to embeddings must have rank <= 2. Found rank %d." % len(
-                input_ids.shape))
+    if idx != 0:
+      raise ValueError("EmbeddingInputKroneckerFactor only supports idx = 0")
 
-      batch_size = array_ops.shape(input_ids)[0]
+    input_ids = self._input_ids
 
-      # Transform indices into one-hot vectors.
-      #
-      # TODO(b/72714822): There must be a faster way to construct the diagonal
-      # covariance matrix! This operation is O(batch_size * vocab_size), where
-      # it should be O(batch_size * input_size).
-      flat_input_ids = array_ops.reshape(input_ids, [-1])
-      one_hots = array_ops.one_hot(flat_input_ids,
-                                   self._vocab_size)  # [?, vocab_size]
+    if len(input_ids.shape) > 2:
+      raise ValueError(
+          "Input to embeddings must have rank <= 2. Found rank %d." % len(
+              input_ids.shape))
 
-      # Take average across examples. Note that, because all entries have
-      # magnitude zero or one, there's no need to square the entries.
-      #
-      # TODO(b/72714822): Support for SparseTensor, other kinds of aggregation
-      # within an example such as average.
-      #
-      # TODO(b/72714822): Support for partitioned embeddings.
-      new_cov = math_ops.reduce_sum(one_hots, axis=0)  # [vocab_size]
-      new_cov /= math_ops.cast(batch_size, new_cov.dtype)
+    batch_size = array_ops.shape(input_ids)[0]
 
-      return new_cov
+    # Transform indices into one-hot vectors.
+    #
+    # TODO(b/72714822): There must be a faster way to construct the diagonal
+    # covariance matrix! This operation is O(batch_size * vocab_size), where
+    # it should be O(batch_size * input_size).
+    flat_input_ids = array_ops.reshape(input_ids, [-1])
+    one_hots = array_ops.one_hot(flat_input_ids,
+                                 self._vocab_size)  # [?, vocab_size]
+
+    # Take average across examples. Note that, because all entries have
+    # magnitude zero or one, there's no need to square the entries.
+    #
+    # TODO(b/72714822): Support for SparseTensor, other kinds of aggregation
+    # within an example such as average.
+    #
+    # TODO(b/72714822): Support for partitioned embeddings.
+    new_cov = math_ops.reduce_sum(one_hots, axis=0)  # [vocab_size]
+    new_cov /= math_ops.cast(batch_size, new_cov.dtype)
+
+    return new_cov
 
 
 class FullyConnectedDiagonalFactor(DiagonalFactor):
@@ -850,23 +800,23 @@ class FullyConnectedDiagonalFactor(DiagonalFactor):
     """Instantiate FullyConnectedDiagonalFactor.
 
     Args:
-      inputs: Tensor of shape [batch_size, input_size]. Inputs to fully
-        connected layer.
-      outputs_grads: List of Tensors of shape [batch_size, output_size].
-        Gradient of loss with respect to layer's preactivations.
+      inputs: Tensor of shape [batch_size, input_size]. Inputs to this layer.
+      outputs_grads: List of Tensors, each of shape [batch_size, output_size],
+        which are the gradients of the loss with respect to the layer's
+        outputs. One Tensor for each "source".
+
       has_bias: bool. If True, append '1' to each input.
     """
     self._inputs = inputs
     self._has_bias = has_bias
     self._outputs_grads = outputs_grads
-    self._batch_size = array_ops.shape(inputs)[0]
     self._squared_inputs = None
 
     super(FullyConnectedDiagonalFactor, self).__init__()
 
   @property
   def _var_scope(self):
-    return "ff_diagfc/" + scope_string_from_params(
+    return "ff_diagfc_" + scope_string_from_params(
         (self._inputs,) + tuple(self._outputs_grads))
 
   @property
@@ -883,25 +833,30 @@ class FullyConnectedDiagonalFactor(DiagonalFactor):
   def _dtype(self):
     return self._outputs_grads[0].dtype
 
+  def make_covariance_update_op(self, ema_decay):
+    inputs = self._inputs
+
+    if self._has_bias:
+      inputs = append_homog(inputs)
+    self._squared_inputs = math_ops.square(inputs)
+
+    return super(FullyConnectedDiagonalFactor, self).make_covariance_update_op(
+        ema_decay)
+
   def _compute_new_cov(self, idx=0):
+    batch_size = array_ops.shape(self._squared_inputs)[0]
+    outputs_grad = self._outputs_grads[idx]
+
     # The well-known special formula that uses the fact that the entry-wise
     # square of an outer product is the outer-product of the entry-wise squares.
     # The gradient is the outer product of the input and the output gradients,
     # so we just square both and then take their outer-product.
-    with maybe_colocate_with(self._outputs_grads[idx]):
-      # We only need to compute squared_inputs once
-      if self._squared_inputs is None:
-        inputs = self._inputs
-        if self._has_bias:
-          inputs = append_homog(self._inputs)
-        self._squared_inputs = math_ops.square(inputs)
-
-      new_cov = math_ops.matmul(
-          self._squared_inputs,
-          math_ops.square(self._outputs_grads[idx]),
-          transpose_a=True)
-      new_cov /= math_ops.cast(self._batch_size, new_cov.dtype)
-      return new_cov
+    new_cov = math_ops.matmul(
+        self._squared_inputs,
+        math_ops.square(outputs_grad),
+        transpose_a=True)
+    new_cov /= math_ops.cast(batch_size, new_cov.dtype)
+    return new_cov
 
 
 class ConvDiagonalFactor(DiagonalFactor):
@@ -913,35 +868,64 @@ class ConvDiagonalFactor(DiagonalFactor):
                filter_shape,
                strides,
                padding,
+               data_format=None,
+               dilations=None,
                has_bias=False):
     """Creates a ConvDiagonalFactor object.
 
     Args:
       inputs: Tensor of shape [batch_size, height, width, in_channels].
         Input activations to this layer.
-      outputs_grads: Tensor of shape [batch_size, height, width, out_channels].
-        Per-example gradients to the loss with respect to the layer's output
-        preactivations.
+      outputs_grads: List of Tensors, each of shape [batch_size,
+        height, width, out_channels], which are the gradients of the loss
+        with respect to the layer's outputs. One Tensor for each "source".
       filter_shape: Tuple of 4 ints: (kernel_height, kernel_width, in_channels,
         out_channels). Represents shape of kernel used in this layer.
       strides: The stride size in this layer (1-D Tensor of length 4).
       padding: The padding in this layer (1-D of Tensor length 4).
+      data_format: None or str. Format of conv2d inputs.
+      dilations: None or tuple of 4 ints.
       has_bias: Python bool. If True, the layer is assumed to have a bias
         parameter in addition to its filter parameter.
+
+    Raises:
+      ValueError: If inputs, output_grads, and filter_shape do not agree on
+        in_channels or out_channels.
+      ValueError: If strides, dilations are not length-4 lists of ints.
+      ValueError: If data_format does not put channel last.
     """
+    if not utils.is_data_format_channel_last(data_format):
+      raise ValueError("Channel must be last.")
+    if inputs.shape.ndims != 4:
+      raise ValueError("inputs must be 4-D Tensor.")
+    if inputs.shape.as_list()[-1] != filter_shape[-2]:
+      raise ValueError("inputs and filter_shape must agree on in_channels.")
+    for i, outputs_grad in enumerate(outputs_grads):
+      if outputs_grad.shape.ndims != 4:
+        raise ValueError("outputs[%d] must be 4-D Tensor." % i)
+      if outputs_grad.shape.as_list()[-1] != filter_shape[-1]:
+        raise ValueError(
+            "outputs[%d] and filter_shape must agree on out_channels." % i)
+    if len(strides) != 4:
+      raise ValueError("strides must be length-4 list of ints.")
+    if dilations is not None and len(dilations) != 4:
+      raise ValueError("dilations must be length-4 list of ints.")
+
     self._inputs = inputs
+    self._outputs_grads = outputs_grads
     self._filter_shape = filter_shape
     self._strides = strides
     self._padding = padding
+    self._data_format = data_format
+    self._dilations = dilations
     self._has_bias = has_bias
-    self._outputs_grads = outputs_grads
     self._patches = None
 
     super(ConvDiagonalFactor, self).__init__()
 
   @property
   def _var_scope(self):
-    return "ff_convdiag/" + scope_string_from_name(
+    return "ff_convdiag_" + scope_string_from_params(
         (self._inputs,) + tuple(self._outputs_grads))
 
   @property
@@ -961,38 +945,36 @@ class ConvDiagonalFactor(DiagonalFactor):
     return self._outputs_grads[0].dtype
 
   def make_covariance_update_op(self, ema_decay):
-    with maybe_colocate_with(self._inputs):
-      filter_height, filter_width, _, _ = self._filter_shape
+    filter_height, filter_width, _, _ = self._filter_shape
 
-      # TODO(b/64144716): there is potential here for a big savings in terms
-      # of memory use.
-      patches = array_ops.extract_image_patches(
-          self._inputs,
-          ksizes=[1, filter_height, filter_width, 1],
-          strides=self._strides,
-          rates=[1, 1, 1, 1],
-          padding=self._padding)
-
-      if self._has_bias:
-        patches = append_homog(patches)
-
-      self._patches = patches
+    # TODO(b/64144716): there is potential here for a big savings in terms
+    # of memory use.
+    if self._dilations is None:
+      rates = (1, 1, 1, 1)
+    else:
+      rates = tuple(self._dilations)
+    patches = array_ops.extract_image_patches(
+        self._inputs,
+        ksizes=[1, filter_height, filter_width, 1],
+        strides=self._strides,
+        rates=rates,
+        padding=self._padding)
 
-    op = super(ConvDiagonalFactor, self).make_covariance_update_op(ema_decay)
+    if self._has_bias:
+      patches = append_homog(patches)
 
-    self._patches = None
+    self._patches = patches
 
-    return op
+    return super(ConvDiagonalFactor, self).make_covariance_update_op(ema_decay)
 
   def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._outputs_grads[idx]):
-      outputs_grad = self._outputs_grads[idx]
-      batch_size = array_ops.shape(self._patches)[0]
+    batch_size = array_ops.shape(self._patches)[0]
+    outputs_grad = self._outputs_grads[idx]
 
-      new_cov = self._convdiag_sum_of_squares(self._patches, outputs_grad)
-      new_cov /= math_ops.cast(batch_size, new_cov.dtype)
+    new_cov = self._convdiag_sum_of_squares(self._patches, outputs_grad)
+    new_cov /= math_ops.cast(batch_size, new_cov.dtype)
 
-      return new_cov
+    return new_cov
 
   def _convdiag_sum_of_squares(self, patches, outputs_grad):
     # This computes the sum of the squares of the per-training-case "gradients".
@@ -1013,8 +995,9 @@ class FullyConnectedKroneckerFactor(InverseProvidingFactor):
     """Instantiate FullyConnectedKroneckerFactor.
 
     Args:
-      tensors: List of Tensors of shape [batch_size, n]. Represents either a
-        layer's inputs or its output's gradients.
+      tensors: List of Tensors, each of shape [batch_size, n], one for each
+      source.  The Tensors are typically either a layer's inputs or its
+      output's gradients.
       has_bias: bool. If True, append '1' to each row.
     """
     # The tensor argument is either a tensor of input activations or a tensor of
@@ -1025,8 +1008,8 @@ class FullyConnectedKroneckerFactor(InverseProvidingFactor):
 
   @property
   def _var_scope(self):
-    return "ff_fckron/" + scope_string_from_params(
-        [self._tensors, self._has_bias])
+    return "ff_fckron_" + scope_string_from_params(
+        tuple(self._tensors) + (self._has_bias,))
 
   @property
   def _cov_shape(self):
@@ -1042,11 +1025,10 @@ class FullyConnectedKroneckerFactor(InverseProvidingFactor):
     return self._tensors[0].dtype
 
   def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._tensors[idx]):
-      tensor = self._tensors[idx]
-      if self._has_bias:
-        tensor = append_homog(tensor)
-      return compute_cov(tensor)
+    tensor = self._tensors[idx]
+    if self._has_bias:
+      tensor = append_homog(tensor)
+    return compute_cov(tensor)
 
 
 class ConvInputKroneckerFactor(InverseProvidingFactor):
@@ -1062,39 +1044,55 @@ class ConvInputKroneckerFactor(InverseProvidingFactor):
   def __init__(self,
                inputs,
                filter_shape,
-               strides,
                padding,
+               strides=None,
+               dilation_rate=None,
+               data_format=None,
+               extract_patches_fn=None,
                has_bias=False):
     """Initializes ConvInputKroneckerFactor.
 
     Args:
-      inputs: Tensor of shape [batch_size, height, width, in_channels]. Inputs
-        to layer.
-      filter_shape: 1-D Tensor of length 4. Contains [kernel_height,
-        kernel_width, in_channels, out_channels].
-      strides: 1-D Tensor of length 4. Contains [batch_stride, height_stride,
-        width_stride, in_channel_stride].
+      inputs: Tensor of shape [batch_size, ..spatial_input_size.., in_channels].
+        Inputs to layer.
+      filter_shape: List of ints. Contains [..spatial_filter_size..,
+        in_channels, out_channels]. Shape of convolution kernel.
       padding: str. Padding method for layer. "SAME" or "VALID".
+      strides: List of ints or None. Contains [..spatial_filter_strides..] if
+        'extract_patches_fn' is compatible with tf.nn.convolution(), else
+        [1, ..spatial_filter_strides, 1].
+      dilation_rate: List of ints or None. Rate for dilation along each spatial
+        dimension if 'extract_patches_fn' is compatible with
+        tf.nn.convolution(), else [1, ..spatial_dilation_rates.., 1].
+      data_format: str or None. Format of input data.
+      extract_patches_fn: str or None. Name of function that extracts image
+        patches. One of "extract_convolution_patches", "extract_image_patches",
+        "extract_pointwise_conv2d_patches".
       has_bias: bool. If True, append 1 to in_channel.
     """
+    self._inputs = inputs
     self._filter_shape = filter_shape
     self._strides = strides
     self._padding = padding
+    self._dilation_rate = dilation_rate
+    self._data_format = data_format
+    self._extract_patches_fn = extract_patches_fn
     self._has_bias = has_bias
-    self._inputs = inputs
+
     super(ConvInputKroneckerFactor, self).__init__()
 
   @property
   def _var_scope(self):
-    return "ff_convinkron/" + scope_string_from_params([
+    return "ff_convinkron_" + scope_string_from_params([
         self._inputs, self._filter_shape, self._strides, self._padding,
-        self._has_bias
+        self._dilation_rate, self._data_format, self._has_bias
     ])
 
   @property
   def _cov_shape(self):
-    filter_height, filter_width, in_channels, _ = self._filter_shape
-    size = filter_height * filter_width * in_channels + self._has_bias
+    spatial_filter_shape = self._filter_shape[0:-2]
+    in_channels = self._filter_shape[-2]
+    size = np.prod(spatial_filter_shape) * in_channels + self._has_bias
     return [size, size]
 
   @property
@@ -1109,37 +1107,62 @@ class ConvInputKroneckerFactor(InverseProvidingFactor):
     if idx != 0:
       raise ValueError("ConvInputKroneckerFactor only supports idx = 0")
 
-    with maybe_colocate_with(self._inputs):
-      filter_height, filter_width, in_channels, _ = self._filter_shape
-
-      # TODO(b/64144716): there is potential here for a big savings in terms of
-      # memory use.
+    # TODO(b/64144716): there is potential here for a big savings in terms of
+    # memory use.
+    if self._extract_patches_fn in [None, "extract_convolution_patches"]:
+      patches = utils.extract_convolution_patches(
+          self._inputs,
+          self._filter_shape,
+          padding=self._padding,
+          strides=self._strides,
+          dilation_rate=self._dilation_rate,
+          data_format=self._data_format)
+
+    elif self._extract_patches_fn == "extract_image_patches":
+      assert self._inputs.shape.ndims == 4
+      assert len(self._filter_shape) == 4
+      assert len(self._strides) == 4, self._strides
+      if self._dilation_rate is None:
+        rates = [1, 1, 1, 1]
+      else:
+        rates = self._dilation_rate
+        assert len(rates) == 4
+        assert rates[0] == rates[-1] == 1
       patches = array_ops.extract_image_patches(
           self._inputs,
-          ksizes=[1, filter_height, filter_width, 1],
+          ksizes=[1] + list(self._filter_shape[0:-2]) + [1],
           strides=self._strides,
-          rates=[1, 1, 1, 1],
+          rates=rates,
           padding=self._padding)
 
-      flatten_size = (filter_height * filter_width * in_channels)
-      # patches_flat below is the matrix [[A_l]] from the KFC paper (tilde
-      # omitted over A for clarity). It has shape M|T| x J|Delta| (eq. 14),
-      # where M = minibatch size, |T| = number of spatial locations,
-      # |Delta| = number of spatial offsets, and J = number of input maps
-      # for convolutional layer l.
-      patches_flat = array_ops.reshape(patches, [-1, flatten_size])
-      # We append a homogenous coordinate to patches_flat if the layer has
-      # bias parameters. This gives us [[A_l]]_H from the paper.
-      if self._has_bias:
-        patches_flat = append_homog(patches_flat)
-      # We call compute_cov without passing in a normalizer. compute_cov uses
-      # the first dimension of patches_flat i.e. M|T| as the normalizer by
-      # default. Hence we end up computing 1/M|T| * [[A_l]]^T [[A_l]], with
-      # shape J|Delta| x J|Delta|. This is related to hat{Omega}_l from
-      # the paper but has a different scale here for consistency with
-      # ConvOutputKroneckerFactor.
-      # (Tilde omitted over A for clarity.)
-      return compute_cov(patches_flat)
+    elif self._extract_patches_fn == "extract_pointwise_conv2d_patches":
+      assert self._strides in [None, [1, 1, 1, 1], (1, 1, 1, 1)]
+      assert self._filter_shape[0] == self._filter_shape[1] == 1
+      patches = utils.extract_pointwise_conv2d_patches(
+          self._inputs, self._filter_shape, data_format=None)
+
+    else:
+      raise NotImplementedError(self._extract_patches_fn)
+
+    flatten_size = np.prod(self._filter_shape[0:-1])
+    # patches_flat below is the matrix [[A_l]] from the KFC paper (tilde
+    # omitted over A for clarity). It has shape M|T| x J|Delta| (eq. 14),
+    # where M = minibatch size, |T| = number of spatial locations,
+    # |Delta| = number of spatial offsets, and J = number of input maps
+    # for convolutional layer l.
+    patches_flat = array_ops.reshape(patches, [-1, flatten_size])
+    # We append a homogenous coordinate to patches_flat if the layer has
+    # bias parameters. This gives us [[A_l]]_H from the paper.
+    if self._has_bias:
+      patches_flat = append_homog(patches_flat)
+    # We call compute_cov without passing in a normalizer. compute_cov uses
+    # the first dimension of patches_flat i.e. M|T| as the normalizer by
+    # default. Hence we end up computing 1/M|T| * [[A_l]]^T [[A_l]], with
+    # shape J|Delta| x J|Delta|. This is related to hat{Omega}_l from
+    # the paper but has a different scale here for consistency with
+    # ConvOutputKroneckerFactor.
+    # (Tilde omitted over A for clarity.)
+    return compute_cov(patches_flat)
 
 
 class ConvOutputKroneckerFactor(InverseProvidingFactor):
@@ -1153,20 +1176,27 @@ class ConvOutputKroneckerFactor(InverseProvidingFactor):
   Section 3.1 Estimating the factors.
   """
 
-  def __init__(self, outputs_grads):
+  def __init__(self, outputs_grads, data_format=None):
     """Initializes ConvOutputKroneckerFactor.
 
     Args:
       outputs_grads: list of Tensors. Each Tensor is of shape
-          [batch_size, height, width, out_channels].
+          [batch_size, ..spatial_input_size.., out_channels]. One Tensor per
+          source.
+      data_format: None or str. Format of outputs_grads.
+
+    Raises:
+      ValueError: If channels are not final dimension.
     """
-    self._out_channels = outputs_grads[0].shape.as_list()[3]
+    if not utils.is_data_format_channel_last(data_format):
+      raise ValueError("Channel must be last.")
+    self._out_channels = outputs_grads[0].shape.as_list()[-1]
     self._outputs_grads = outputs_grads
     super(ConvOutputKroneckerFactor, self).__init__()
 
   @property
   def _var_scope(self):
-    return "ff_convoutkron/" + scope_string_from_params(self._outputs_grads)
+    return "ff_convoutkron_" + scope_string_from_params(self._outputs_grads)
 
   @property
   def _cov_shape(self):
@@ -1182,56 +1212,57 @@ class ConvOutputKroneckerFactor(InverseProvidingFactor):
     return self._outputs_grads[0].dtype
 
   def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._outputs_grads[idx]):
-      # reshaped_tensor below is the matrix DS_l defined in the KFC paper
-      # (tilde omitted over S for clarity). It has shape M|T| x I, where
-      # M = minibatch size, |T| = number of spatial locations, and
-      # I = number of output maps for convolutional layer l.
-      reshaped_tensor = array_ops.reshape(self._outputs_grads[idx],
-                                          [-1, self._out_channels])
-      # Following the reasoning in ConvInputKroneckerFactor._compute_new_cov,
-      # compute_cov here returns 1/M|T| * DS_l^T DS_l = hat{Gamma}_l
-      # as defined in the paper, with shape I x I.
-      # (Tilde omitted over S for clarity.)
-      return compute_cov(reshaped_tensor)
-
-
-class FullyConnectedMultiKF(InverseProvidingFactor):
-  """Kronecker factor for a fully connected recurrent layer."""
+    outputs_grad = self._outputs_grads[idx]
+
+    # reshaped_tensor below is the matrix DS_l defined in the KFC paper
+    # (tilde omitted over S for clarity). It has shape M|T| x I, where
+    # M = minibatch size, |T| = number of spatial locations, and
+    # I = number of output maps for convolutional layer l.
+    reshaped_tensor = array_ops.reshape(outputs_grad, [-1, self._out_channels])
+    # Following the reasoning in ConvInputKroneckerFactor._compute_new_cov,
+    # compute_cov here returns 1/M|T| * DS_l^T DS_l = hat{Gamma}_l
+    # as defined in the paper, with shape I x I.
+    # (Tilde omitted over S for clarity.)
+    return compute_cov(reshaped_tensor)
+
+
+class FullyConnectedMultiKF(FullyConnectedKroneckerFactor):
+  """Kronecker factor for a fully connected layer used multiple times."""
 
   def __init__(self,
-               tensor_lists,
+               tensors,
+               num_uses=None,
                has_bias=False):
     """Constructs a new `FullyConnectedMultiKF`.
 
     Args:
-      tensor_lists: List of lists of Tensors of shape [batch_size, n].
+      tensors: List of Tensors of shape, each of shape [batch_size, n]. Each of
+        these tensors is usually a layer's inputs or its output's gradients.
+        The list is over sources.
+      num_uses: int. The number of time-steps / uses.
       has_bias: bool. If True, '1' is appended to each row.
     """
 
-    self._tensor_lists = tensor_lists
-    self._has_bias = has_bias
-    self._batch_size = array_ops.shape(tensor_lists[0][0])[0]
-    self._num_timesteps = len(tensor_lists[0])
-    self._tensors = [None] * len(tensor_lists)
+    self._num_uses = num_uses
 
     self._cov_dt1 = None
+    self._make_cov_dt1 = False
     self._option1quants_by_damping = {}
     self._option2quants_by_damping = {}
+    self._option1quants_registrations = set()
+    self._option2quants_registrations = set()
 
-    super(FullyConnectedMultiKF, self).__init__()
-
-  @property
-  def _var_scope(self):
-    return "ff_fc_multi/" + scope_string_from_params(self._tensor_lists)
+    super(FullyConnectedMultiKF, self).__init__(tensors=tensors,
+                                                has_bias=has_bias)
 
   @property
-  def _num_sources(self):
-    return len(self._tensor_lists)
+  def _num_timesteps(self):
+    return self._num_uses
 
   @property
-  def _dtype(self):
-    return self._tensor_lists[0][0].dtype
+  def _var_scope(self):
+    return "ff_fc_multi_" + scope_string_from_params(
+        tuple(self._tensors) + (self._num_timesteps, self._has_bias,))
 
   def make_covariance_update_op(self, ema_decay):
 
@@ -1240,71 +1271,62 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
     if self._cov_dt1 is not None:
       new_cov_dt1_contribs = tuple(self._compute_new_cov_dt1(idx)
                                    for idx in range(self._num_sources))
+      new_cov_dt1 = math_ops.add_n(new_cov_dt1_contribs)
+      op2 = moving_averages.assign_moving_average(
+          self._cov_dt1, new_cov_dt1, ema_decay, zero_debias=ZERO_DEBIAS)
 
-      with maybe_colocate_with(new_cov_dt1_contribs[0]):
-        new_cov_dt1 = math_ops.add_n(new_cov_dt1_contribs)
-
-        op2 = moving_averages.assign_moving_average(
-            self._cov_dt1, new_cov_dt1, ema_decay, zero_debias=ZERO_DEBIAS)
-
-        # TODO(b/69112164):
-        # It's important that _cov and _cov_dt1 remain consistent with each
-        # other while the inverse ops are happening. How can we ensure this?
-        # We will need to add explicit synchronization for this to
-        # work with asynchronous training.
-        op = control_flow_ops.group(op, op2)
+      # TODO(b/69112164):
+      # It's important that _cov and _cov_dt1 remain consistent with each
+      # other while the inverse ops are happening. How can we ensure this?
+      # We will need to add explicit synchronization for this to
+      # work with asynchronous training.
+      op = control_flow_ops.group(op, op2)
 
     return op
 
-  def _compute_new_cov(self, idx=0):
-    with maybe_colocate_with(self._tensor_lists[idx]):
-      tensor = array_ops.concat(self._tensor_lists[idx], 0)
-      if self._has_bias:
-        tensor = append_homog(tensor)
-      # We save these so they can be used by _compute_new_cov_dt1
-      self._tensors[idx] = tensor
-      return compute_cov(tensor)
-
-  def _compute_new_cov_dt1(self, idx=0):
+  def _compute_new_cov_dt1(self, idx=0):  # pylint: disable=missing-docstring
     tensor = self._tensors[idx]
-    with maybe_colocate_with(tensor):
-      # Is there a more elegant way to do this computation?
-      tensor_present = tensor[:-self._batch_size, :]
-      tensor_future = tensor[self._batch_size:, :]
-      # We specify a normalizer for this computation to ensure a PSD Fisher
-      # block estimate.  This is equivalent to padding with zeros, as was done
-      # in Section B.2 of the appendix.
-      normalizer = self._num_timesteps * self._batch_size
-      return compute_cov(
-          tensor_future, tensor_right=tensor_present, normalizer=normalizer)
+    if self._has_bias:
+      # This appending is technically done twice (the other time is for
+      # _compute_new_cov())
+      tensor = append_homog(tensor)
 
-  @property
-  def _cov_shape(self):
-    size = self._tensor_lists[0][0].shape[1] + self._has_bias
-    return [size, size]
+    total_len = array_ops.shape(tensor)[0]
+    batch_size = total_len // self._num_timesteps
+
+    tensor_present = tensor[:-batch_size, :]
+    tensor_future = tensor[batch_size:, :]
+
+    # We specify a normalizer for this computation to ensure a PSD Fisher
+    # block estimate.  This is equivalent to padding with zeros, as was done
+    # in Section B.2 of the appendix.
+    return compute_cov(
+        tensor_future, tensor_right=tensor_present, normalizer=total_len)
 
   @property
   def _vec_shape(self):
-    size = self._tensor_lists[0][0].shape[1] + self._has_bias
+    size = self._tensors[0].shape[1] + self._has_bias
     return [size]
 
-  def get_option1quants(self, damping):
-    return self._option1quants_by_damping[damping]
+  def get_option1quants(self, damping_func):
+    damping_id = graph_func_to_id(damping_func)
+    return self._option1quants_by_damping[damping_id]
 
-  def get_option2quants(self, damping):
-    return self._option2quants_by_damping[damping]
+  def get_option2quants(self, damping_func):
+    damping_id = graph_func_to_id(damping_func)
+    return self._option2quants_by_damping[damping_id]
 
   def get_cov_dt1(self):
     assert self._cov_dt1 is not None
     return self._cov_dt1
 
   def register_cov_dt1(self):
-    """Create a variable representing temporal cross-covariance.
+    self._make_cov_dt1 = True
 
-    (This is technically the second moment, not covariance, since it's
-    not mean subtracted.)
-    """
-    if self._cov_dt1 is None:
+  def instantiate_cov_variables(self):
+    super(FullyConnectedMultiKF, self).instantiate_cov_variables()
+    assert self._cov_dt1 is None
+    if self._make_cov_dt1:
       with variable_scope.variable_scope(self._var_scope):
         self._cov_dt1 = variable_scope.get_variable(
             "cov_dt1",
@@ -1313,15 +1335,25 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
             trainable=False,
             dtype=self._dtype)
 
-  def register_option1quants(self, damping):
+  def register_option1quants(self, damping_func):
+    damping_id = self._register_damping(damping_func)
+    if damping_id not in self._option1quants_registrations:
+      self._option1quants_registrations.add(damping_id)
 
-    self.register_cov_dt1()
+  def register_option2quants(self, damping_func):
+    damping_id = self._register_damping(damping_func)
+    if damping_id not in self._option2quants_registrations:
+      self._option2quants_registrations.add(damping_id)
 
-    if damping not in self._option1quants_by_damping:
+  def instantiate_inv_variables(self):
+    super(FullyConnectedMultiKF, self).instantiate_inv_variables()
+
+    for damping_id in self._option1quants_registrations:
+      damping_func = self._damping_funcs_by_id[damping_id]
+      damping_string = graph_func_to_string(damping_func)
       # It's questionable as to whether we should initialize with stuff like
       # this at all.  Ideally these values should never be used until they are
       # updated at least once.
-      damping_string = scalar_or_tensor_to_string(damping)
       with variable_scope.variable_scope(self._var_scope):
         Lmat = variable_scope.get_variable(  # pylint: disable=invalid-name
             "Lmat_damp{}".format(damping_string),
@@ -1336,17 +1368,15 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
             trainable=False,
             dtype=self._dtype)
 
-      self._option1quants_by_damping[damping] = (Lmat, psi)
-
-  def register_option2quants(self, damping):
+      assert damping_id not in self._option1quants_by_damping
+      self._option1quants_by_damping[damping_id] = (Lmat, psi)
 
-    self.register_cov_dt1()
-
-    if damping not in self._option2quants_by_damping:
+    for damping_id in self._option2quants_registrations:
+      damping_func = self._damping_funcs_by_id[damping_id]
+      damping_string = graph_func_to_string(damping_func)
       # It's questionable as to whether we should initialize with stuff like
       # this at all.  Ideally these values should never be used until they are
       # updated at least once.
-      damping_string = scalar_or_tensor_to_string(damping)
       with variable_scope.variable_scope(self._var_scope):
         Pmat = variable_scope.get_variable(  # pylint: disable=invalid-name
             "Lmat_damp{}".format(damping_string),
@@ -1367,14 +1397,15 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
             trainable=False,
             dtype=self._dtype)
 
-      self._option2quants_by_damping[damping] = (Pmat, Kmat, mu)
+      assert damping_id not in self._option2quants_by_damping
+      self._option2quants_by_damping[damping_id] = (Pmat, Kmat, mu)
 
   def make_inverse_update_ops(self):
     """Create and return update ops corresponding to registered computations."""
     # TODO(b/69918258): Add correctness tests for this method.
     # pylint: disable=invalid-name
 
-    ops = super(FullyConnectedMultiKF, self).make_inverse_update_ops()
+    ops = []
 
     if (len(self._option1quants_by_damping) +
         len(self._option2quants_by_damping)):
@@ -1395,8 +1426,10 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
       # consistently, or are somehow read between or during the cov updates.
       # Can this possibly happen?  Is there a way to prevent it?
 
-      for damping, (Lmat_var,
-                    psi_var) in self._option1quants_by_damping.items():
+      for damping_id, (Lmat_var,
+                       psi_var) in self._option1quants_by_damping.items():
+
+        damping = self._damping_funcs_by_id[damping_id]()
 
         invsqrtC0 = math_ops.matmul(
             eigen_V * (eigen_e + damping)**(-0.5), eigen_V, transpose_b=True)
@@ -1421,8 +1454,10 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
         ops.append(Lmat_var.assign(Lmat))
         ops.append(psi_var.assign(psi))
 
-      for damping, (Pmat_var, Kmat_var,
-                    mu_var) in self._option2quants_by_damping.items():
+      for damping_id, (Pmat_var, Kmat_var,
+                       mu_var) in self._option2quants_by_damping.items():
+
+        damping = self._damping_funcs_by_id[damping_id]()
 
         # compute C0^(-1/2)
         invsqrtC0 = math_ops.matmul(
@@ -1463,6 +1498,7 @@ class FullyConnectedMultiKF(InverseProvidingFactor):
         ops.append(Kmat_var.assign(Kmat))
         ops.append(mu_var.assign(mu))
 
+    ops += super(FullyConnectedMultiKF, self).make_inverse_update_ops()
     return [control_flow_ops.group(*ops)]
 
     # pylint: enable=invalid-name
diff --git a/tensorflow/contrib/kfac/python/ops/layer_collection.py b/tensorflow/contrib/kfac/python/ops/layer_collection.py
index ce9005b9ce99a4efa5f2821c56e199dd2086482e..7727c607db329c035b0991c87ee7bba2e3c2510e 100644
--- a/tensorflow/contrib/kfac/python/ops/layer_collection.py
+++ b/tensorflow/contrib/kfac/python/ops/layer_collection.py
@@ -26,6 +26,7 @@ from __future__ import print_function
 
 from collections import defaultdict
 from collections import OrderedDict
+from contextlib import contextmanager
 from functools import partial
 
 import math
@@ -59,6 +60,10 @@ _CONV2D_APPROX_TO_BLOCK_TYPES = {
     APPROX_DIAGONAL_NAME: fb.ConvDiagonalFB,
 }
 
+_EMBEDDING_APPROX_TO_BLOCK_TYPES = {
+    APPROX_KRONECKER_NAME: fb.EmbeddingKFACFB
+}
+
 APPROX_KRONECKER_INDEP_NAME = "kron_indep"
 APPROX_KRONECKER_SERIES_1_NAME = "kron_series_1"
 APPROX_KRONECKER_SERIES_2_NAME = "kron_series_2"
@@ -71,10 +76,39 @@ _FULLY_CONNECTED_MULTI_APPROX_TO_BLOCK_TYPES = {
                                             option=2)
 }
 
+_CONV2D_MULTI_APPROX_TO_BLOCK_TYPES = {
+    APPROX_KRONECKER_INDEP_NAME: fb.ConvKFCBasicMultiIndepFB
+}
+
+_EMBEDDING_MULTI_APPROX_TO_BLOCK_TYPES = {
+    APPROX_KRONECKER_INDEP_NAME: fb.EmbeddingKFACMultiIndepFB
+}
+
 # Possible value for 'reuse' keyword argument. Sets 'reuse' to
 # tf.get_variable_scope().reuse.
 VARIABLE_SCOPE = "VARIABLE_SCOPE"
 
+_DEFAULT_LAYER_COLLECTION = None
+
+
+def get_default_layer_collection():
+  """Get default LayerCollection."""
+  if _DEFAULT_LAYER_COLLECTION is None:
+    raise ValueError(
+        "Attempted to retrieve default LayerCollection when none is set. Use "
+        "LayerCollection.as_default().")
+
+  return _DEFAULT_LAYER_COLLECTION
+
+
+def set_default_layer_collection(layer_collection):
+  global _DEFAULT_LAYER_COLLECTION
+
+  if _DEFAULT_LAYER_COLLECTION is not None and layer_collection is not None:
+    raise ValueError("Default LayerCollection is already set.")
+
+  _DEFAULT_LAYER_COLLECTION = layer_collection
+
 
 class LayerParametersDict(OrderedDict):
   """An OrderedDict where keys are Tensors or tuples of Tensors.
@@ -130,6 +164,8 @@ class LayerCollection(object):
     fisher_factors: an OrderedDict mapping tuples to FisherFactor instances.
     losses: a list of LossFunction objects. The loss to be optimized is their
         sum.
+    loss_colocation_ops: ops to colocate loss function evaluations with.  These
+        will typically be the inputs to the losses.
   """
 
   def __init__(self,
@@ -145,17 +181,27 @@ class LayerCollection(object):
     self._default_generic_approximation = APPROX_FULL_NAME
     self._default_embedding_approximation = APPROX_KRONECKER_NAME
     self._default_fully_connected_approximation = APPROX_KRONECKER_NAME
-    self._default_convolution_2d_approximation = APPROX_KRONECKER_NAME
+    self._default_conv2d_approximation = APPROX_KRONECKER_NAME
     self._default_fully_connected_multi_approximation = (
-        APPROX_KRONECKER_SERIES_2_NAME)
+        APPROX_KRONECKER_INDEP_NAME)
+    self._default_conv2d_multi_approximation = (
+        APPROX_KRONECKER_INDEP_NAME)
+    self._default_embedding_multi_approximation = APPROX_KRONECKER_INDEP_NAME
+    self.loss_colocation_ops = {}
+    self._vars_to_uses = defaultdict(lambda: 0)
 
     with variable_scope.variable_scope(None, default_name=name) as scope:
       self._var_scope = scope.name
 
   @property
   def losses(self):
-    """LossFunctions registered with this LayerCollection."""
-    return list(self._loss_dict.values())
+    """Tuple of LossFunction objects registered with this LayerCollection."""
+    return nest.flatten(self.towers_by_loss)
+
+  @property
+  def towers_by_loss(self):
+    """Tuple across losses of LossFunction objects registered to each tower."""
+    return tuple(tuple(lst) for lst in self._loss_dict.values())
 
   @property
   def registered_variables(self):
@@ -214,14 +260,14 @@ class LayerCollection(object):
 
   @property
   def default_conv2d_approximation(self):
-    return self._default_convolution_2d_approximation
+    return self._default_conv2d_approximation
 
   def set_default_conv2d_approximation(self, value):
     if value not in _CONV2D_APPROX_TO_BLOCK_TYPES:
       raise ValueError(
           "{} is not a valid approximation for 2d convolutional layers.".format(
               value))
-    self._default_convolution_2d_approximation = value
+    self._default_conv2d_approximation = value
 
   @property
   def default_fully_connected_multi_approximation(self):
@@ -233,6 +279,14 @@ class LayerCollection(object):
                        "multi layer.".format(value))
     self._default_fully_connected_multi_approximation = value
 
+  @property
+  def default_conv2d_multi_approximation(self):
+    return self._default_conv2d_multi_approximation
+
+  @property
+  def default_embedding_multi_approximation(self):
+    return self._default_embedding_multi_approximation
+
   def register_block(self, layer_key, fisher_block, reuse=VARIABLE_SCOPE):
     """Validates and registers the layer_key associated with the fisher_block.
 
@@ -290,23 +344,74 @@ class LayerCollection(object):
     self.fisher_blocks[layer_key] = fisher_block
     return fisher_block
 
-  def get_use_count_map(self):
-    """Returns a dict of variables to their number of registrations."""
-    # TODO(b/70283403): Reimplement this in the old way, where each
-    # registration function would be responsible for incrementing the count.
-    # Also, this version has a bug: it won't do the right thing for generic
-    # registration for parameters that are shared.  i.e. it won't set the use
-    # count to infinity.
-    vars_to_uses = defaultdict(int)
-    for key, block in six.iteritems(self.fisher_blocks):
-      n = (
-          block.num_inputs()*block.num_registered_minibatches if isinstance(
-              block, (fb.FullyConnectedSeriesFB, fb.FullyConnectedMultiIndepFB))
-          else block.num_registered_minibatches)
-      key = utils.ensure_sequence(key)
-      for k in key:
-        vars_to_uses[k] += n
-    return vars_to_uses
+  def register_loss_function(self,
+                             loss,
+                             colocation_op,
+                             base_name,
+                             name=None,
+                             reuse=VARIABLE_SCOPE):
+    """Registers a LossFunction object.
+
+    Args:
+      loss: The LossFunction object.
+      colocation_op: The op to colocate the loss function's computations with.
+      base_name: The name to derive a new unique name from is the name argument
+        is None.
+      name: (OPTIONAL) str or None. Unique name for this loss function. If None,
+        a new name is generated. (Default: None)
+      reuse: (OPTIONAL) bool or str.  If True, reuse an existing FisherBlock.
+        If False, create a new FisherBlock.  If VARIABLE_SCOPE, use
+        tf.get_variable_scope().reuse.
+
+    Raises:
+      ValueError: If reuse == True and name == None.
+      ValueError: If reuse == True and seed != None.
+      KeyError: If reuse == True and no existing LossFunction with 'name' found.
+      KeyError: If reuse == False and existing LossFunction with 'name' found.
+    """
+
+    name = name or self._graph.unique_name(base_name)
+
+    if reuse == VARIABLE_SCOPE:
+      reuse = variable_scope.get_variable_scope().reuse
+
+    if reuse:
+      if name is None:
+        raise ValueError(
+            "If reuse is enabled, loss function's name must be set.")
+
+      loss_list = self._loss_dict.get(name, None)
+
+      if loss_list is None:
+        raise KeyError(
+            "Unable to find loss function named {}. Register a new loss "
+            "function with reuse=False.".format(name))
+    else:
+      if name in self._loss_dict:
+        raise KeyError(
+            "Loss function named {} already exists. Set reuse=True to append "
+            "another minibatch/tower.".format(name))
+
+      loss_list = []
+      self._loss_dict[name] = loss_list
+
+    loss_list.append(loss)
+    self.loss_colocation_ops[loss] = colocation_op
+
+  def _get_use_count_map(self):
+    """Returns a dict mapping variables to their number of registrations."""
+    return self._vars_to_uses
+
+  def _add_uses(self, params, uses):
+    """Register additional uses by params in the graph.
+
+    Args:
+      params: Variable or tuple of Variables. Parameters for a layer.
+      uses: int or float. Number of additional uses for these parameters.
+    """
+    params = params if isinstance(params, (tuple, list)) else (params,)
+    for var in params:
+      self._vars_to_uses[var] += uses
 
   def check_registration(self, variables):
     """Checks that all variable uses have been registered properly.
@@ -324,7 +429,7 @@ class LayerCollection(object):
     # Note that overlapping parameters (i.e. those that share variables) will
     # be caught by layer_collection.LayerParametersDict during registration.
 
-    reg_use_map = self.get_use_count_map()
+    reg_use_map = self._get_use_count_map()
 
     error_messages = []
 
@@ -414,12 +519,27 @@ class LayerCollection(object):
     inputs_to_losses = nest.flatten(tuple(loss.inputs for loss in self.losses))
     self._subgraph = utils.SubGraph(inputs_to_losses)
 
+  def eval_losses(self):
+    """Return evaluated losses (colocated with inputs to losses)."""
+    evals = []
+    for loss in self.losses:
+      with ops.colocate_with(self.loss_colocation_ops[loss]):
+        evals.append(loss.evaluate())
+    return evals
+
+  def eval_losses_on_samples(self):
+    """Return losses evaluated on samples (colocated with inputs to losses)."""
+    evals = []
+    for loss in self.losses:
+      with ops.colocate_with(self.loss_colocation_ops[loss]):
+        evals.append(loss.evaluate_on_sample())
+    return evals
+
   def total_loss(self):
-    return math_ops.add_n(tuple(loss.evaluate() for loss in self.losses))
+    return math_ops.add_n(self.eval_losses())
 
   def total_sampled_loss(self):
-    return math_ops.add_n(
-        tuple(loss.evaluate_on_sample() for loss in self.losses))
+    return math_ops.add_n(self.eval_losses_on_samples())
 
   def _get_linked_approx(self, params):
     """If params were linked, return their specified approximation."""
@@ -429,46 +549,57 @@ class LayerCollection(object):
     else:
       return None
 
+  def _get_block_type(self, params, approx, default, approx_to_type):
+    if approx is None:
+      approx = self._get_linked_approx(params)
+      if approx is None:
+        approx = default
+
+    if approx not in approx_to_type:
+      raise ValueError("Bad value {} for approx.".format(approx))
+
+    return approx_to_type[approx], approx
+
   def register_embedding(self,
                          params,
                          inputs,
                          outputs,
                          approx=None,
                          reuse=VARIABLE_SCOPE):
-    """Registers a fully connnected layer.
+    """Registers an embedding layer.
 
     Args:
       params: Embedding matrix of shape [vocab_size, embedding_size].
       inputs: Tensor of shape [batch_size, input_size] and dtype int32. Indices
         into embedding matrix.
-      outputs: Tensor of shape [batch_size, output_size]. Outputs
+      outputs: Tensor of shape [batch_size, embedding_size]. Outputs
         produced by layer.
-      approx: str. Must be "kron".
-      reuse: bool or str.  If True, reuse an existing FisherBlock. If False,
-        create a new FisherBlock.  If "VARIABLE_SCOPE", use
-        tf.get_variable_scope().reuse.
+      approx: str or None. If not None must be "kron".  The Fisher
+        approximation to use. If None the default value is used. (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
 
     Raises:
       ValueError: For improper value to 'approx'.
       KeyError: If reuse == True but no FisherBlock found for 'params'.
       ValueError: If reuse == True and FisherBlock found but of the wrong type.
     """
-    if approx is None:
-      approx = self._get_linked_approx(params)
-      if approx is None:
-        approx = self.default_embedding_approximation
-
-    if approx != APPROX_KRONECKER_NAME:
-      raise ValueError("Bad value {} for approx.".format(approx))
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_embedding_approximation,
+        _EMBEDDING_APPROX_TO_BLOCK_TYPES)
 
     if isinstance(params, (tuple, list)):
       raise ValueError("Bias not supported.")
-
     vocab_size = int(params.shape[0])
     block = self.register_block(
-        params, fb.EmbeddingKFACFB(self, vocab_size), reuse=reuse)
+        params, block_type(self, vocab_size), reuse=reuse)
     block.register_additional_minibatch(inputs, outputs)
 
+    self._add_uses(params, 1)
+
   def register_fully_connected(self,
                                params,
                                inputs,
@@ -484,55 +615,65 @@ class LayerCollection(object):
       inputs: Tensor of shape [batch_size, input_size]. Inputs to layer.
       outputs: Tensor of shape [batch_size, output_size]. Outputs
         produced by layer.
-      approx: str. One of "kron" or "diagonal".
-      reuse: bool or str.  If True, reuse an existing FisherBlock. If False,
-        create a new FisherBlock.  If "VARIABLE_SCOPE", use
-        tf.get_variable_scope().reuse.
+      approx: str or None. If not None must be one of "kron" or "diagonal".
+        The Fisher approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
 
     Raises:
       ValueError: For improper value to 'approx'.
       KeyError: If reuse == True but no FisherBlock found for 'params'.
       ValueError: If reuse == True and FisherBlock found but of the wrong type.
     """
-    if approx is None:
-      approx = self._get_linked_approx(params)
-      if approx is None:
-        approx = self.default_fully_connected_approximation
 
-    if approx not in _FULLY_CONNECTED_APPROX_TO_BLOCK_TYPES:
-      raise ValueError("Bad value {} for approx.".format(approx))
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_fully_connected_approximation,
+        _FULLY_CONNECTED_APPROX_TO_BLOCK_TYPES)
 
-    block_type = _FULLY_CONNECTED_APPROX_TO_BLOCK_TYPES[approx]
     has_bias = isinstance(params, (tuple, list))
-
-    block = self.register_block(params, block_type(self, has_bias), reuse=reuse)
+    block = self.register_block(params, block_type(self, has_bias=has_bias),
+                                reuse=reuse)
     block.register_additional_minibatch(inputs, outputs)
 
+    self._add_uses(params, 1)
+
   def register_conv2d(self,
                       params,
                       strides,
                       padding,
                       inputs,
                       outputs,
+                      data_format=None,
+                      dilations=None,
                       approx=None,
                       reuse=VARIABLE_SCOPE):
-    """Registers a convolutional layer.
+    """Registers a call to tf.nn.conv2d().
 
     Args:
       params: Tensor or 2-tuple of Tensors corresponding to weight and bias of
         this layer. Weight matrix should have shape [kernel_height,
         kernel_width, in_channels, out_channels].  Bias should have shape
         [out_channels].
-      strides: 1-D Tensor of length 4. Strides for convolution kernel.
+      strides: List of 4 ints. Strides for convolution kernel.
       padding: string. see tf.nn.conv2d for valid values.
       inputs: Tensor of shape [batch_size, height, width, in_channels]. Inputs
         to layer.
       outputs: Tensor of shape [batch_size, height, width, out_channels].
         Output produced by layer.
-      approx: str. One of "kron" or "diagonal".
-      reuse: bool or str.  If True, reuse an existing FisherBlock. If False,
-        create a new FisherBlock.  If "VARIABLE_SCOPE", use
-        tf.get_variable_scope().reuse.
+      data_format: str or None. Format of data.
+      dilations: List of 4 ints. Dilations along each dimension.
+      approx: str or None. If not None must be one of "kron" or "diagonal".
+        The Fisher approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
 
     Raises:
       ValueError: For improper value to 'approx'.
@@ -540,19 +681,229 @@ class LayerCollection(object):
       ValueError: If reuse == True and FisherBlock found but of the wrong type.
     """
 
-    if approx is None:
-      approx = self._get_linked_approx(params)
-      if approx is None:
-        approx = self.default_conv2d_approximation
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_conv2d_approximation,
+        _CONV2D_APPROX_TO_BLOCK_TYPES)
+
+    # It feels bad to pass in configuration that has to do with the internal
+    # implementation.  And then we can't use the same constructor for both
+    # anymore and are thus forced to use this ugly if-statement.
+    # TODO(b/74793309): Clean this up?
+    if approx == APPROX_KRONECKER_NAME:
+      block = self.register_block(
+          params,
+          block_type(
+              layer_collection=self,
+              params=params,
+              padding=padding,
+              strides=strides,
+              data_format=data_format,
+              dilation_rate=dilations,
+              extract_patches_fn="extract_image_patches"),
+          reuse=reuse)
+    elif approx == APPROX_DIAGONAL_NAME:
+      assert strides[0] == strides[-1] == 1
+      block = self.register_block(
+          params,
+          block_type(
+              layer_collection=self,
+              params=params,
+              padding=padding,
+              strides=strides,
+              dilations=dilations,
+              data_format=data_format),
+          reuse=reuse)
+    else:
+      raise NotImplementedError(approx)
 
-    if approx not in _CONV2D_APPROX_TO_BLOCK_TYPES:
-      raise ValueError("Bad value {} for approx.".format(approx))
+    block.register_additional_minibatch(inputs, outputs)
+
+    self._add_uses(params, 1)
+
+  def register_convolution(self,
+                           params,
+                           inputs,
+                           outputs,
+                           padding,
+                           strides=None,
+                           dilation_rate=None,
+                           data_format=None,
+                           approx=None,
+                           reuse=VARIABLE_SCOPE):
+    """Register a call to tf.nn.convolution().
+
+    Args:
+      params: Tensor or 2-tuple of Tensors corresponding to weight and bias of
+        this layer. Weight matrix should have shape [..filter_spatial_size..,
+        in_channels, out_channels].  Bias should have shape [out_channels].
+      inputs: Tensor of shape [batch_size, ..input_spatial_size.., in_channels].
+        Inputs to layer.
+      outputs: Tensor of shape [batch_size, ..output_spatial_size..,
+        out_channels].  Output produced by layer.
+      padding: string. see tf.nn.conv2d for valid values.
+      strides: List of ints of length len(..input_spatial_size..). Strides for
+        convolution kernel in spatial dimensions.
+      dilation_rate: List of ints of length len(..input_spatial_size..).
+        Dilations along spatial dimension.
+      data_format: str or None. Format of data.
+      approx: str or None. If not None must be one of "kron" or "diagonal".
+        The Fisher approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
+
+    Raises:
+      ValueError: For improper value to 'approx'.
+      KeyError: If reuse == True but no FisherBlock found for 'params'.
+      ValueError: If reuse == True and FisherBlock found but of the wrong type.
+    """
+    # TODO(b/74793309): Have this use _get_block_type like the other
+    # registration functions?
+    assert approx is None or approx == APPROX_KRONECKER_NAME
+
+    block = self.register_block(
+        params,
+        fb.ConvKFCBasicFB(
+            layer_collection=self,
+            params=params,
+            padding=padding,
+            strides=strides,
+            dilation_rate=dilation_rate,
+            data_format=data_format),
+        reuse=reuse)
+    block.register_additional_minibatch(inputs, outputs)
+
+    self._add_uses(params, 1)
+
+  def register_depthwise_conv2d(self,
+                                params,
+                                inputs,
+                                outputs,
+                                strides,
+                                padding,
+                                rate=None,
+                                data_format=None,
+                                approx=None,
+                                reuse=VARIABLE_SCOPE):
+    """Register a call to tf.nn.depthwise_conv2d().
+
+    Args:
+      params: 4-D Tensor of shape [filter_height, filter_width,
+        in_channels, channel_multiplier].  Convolutional filter.
+      inputs: Tensor of shape [batch_size, input_height, input_width,
+        in_channels].  Inputs to layer.
+      outputs: Tensor of shape [batch_size, output_height, output_width,
+        in_channels * channel_multiplier].  Output produced by depthwise conv2d.
+      strides: List of ints of length 4. Strides along all dimensions.
+      padding: string. see tf.nn.conv2d for valid values.
+      rate: None or List of ints of length 2. Dilation rates in spatial
+        dimensions.
+      data_format: str or None. Format of data.
+      approx: str or None. If not None must "diagonal".  The Fisher
+        approximation to use. If None the default value is used. (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
+
+    Raises:
+      ValueError: For improper value to 'approx'.
+      KeyError: If reuse == True but no FisherBlock found for 'params'.
+      ValueError: If reuse == True and FisherBlock found but of the wrong type.
+    """
+    # TODO(b/74793309): Have this use _get_block_type like the other
+    # registration functions?
+    assert approx is None or approx == APPROX_DIAGONAL_NAME
+    assert data_format in [None, "NHWC"]
 
-    block_type = _CONV2D_APPROX_TO_BLOCK_TYPES[approx]
     block = self.register_block(
-        params, block_type(self, params, strides, padding), reuse=reuse)
+        params,
+        fb.DepthwiseConvDiagonalFB(
+            layer_collection=self,
+            params=params,
+            strides=strides,
+            padding=padding,
+            rate=rate,
+            data_format=data_format),
+        reuse=reuse)
     block.register_additional_minibatch(inputs, outputs)
 
+    self._add_uses(params, 1)
+
+  def register_separable_conv2d(self,
+                                depthwise_params,
+                                pointwise_params,
+                                inputs,
+                                depthwise_outputs,
+                                pointwise_outputs,
+                                strides,
+                                padding,
+                                rate=None,
+                                data_format=None,
+                                approx=None,
+                                reuse=VARIABLE_SCOPE):
+    """Register a call to tf.nn.separable_conv2d().
+
+    Note: This requires access to intermediate outputs between depthwise and
+    pointwise convolutions.
+
+    Args:
+      depthwise_params: 4-D Tensor of shape [filter_height, filter_width,
+        in_channels, channel_multiplier].  Filter for depthwise conv2d.
+      pointwise_params: 4-D Tensor of shape [1, 1, in_channels *
+        channel_multiplier, out_channels].  Filter for pointwise conv2d.
+      inputs: Tensor of shape [batch_size, input_height, input_width,
+        in_channels].  Inputs to layer.
+      depthwise_outputs: Tensor of shape [batch_size, output_height,
+        output_width, in_channels * channel_multiplier].  Output produced by
+        depthwise conv2d.
+      pointwise_outputs: Tensor of shape [batch_size, output_height,
+        output_width, out_channels].  Output produced by pointwise conv2d.
+      strides: List of ints of length 4. Strides for depthwise conv2d kernel in
+        all dimensions.
+      padding: string. see tf.nn.conv2d for valid values.
+      rate: None or List of ints of length 2. Dilation rate of depthwise conv2d
+        kernel in spatial dimensions.
+      data_format: str or None. Format of data.
+      approx: str or None. If not None must be one of "kron" or "diagonal".
+        The Fisher approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds 'inputs' and 'outputs' as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.
+        (Default: "VARIABLE_SCOPE")
+
+    Raises:
+      ValueError: For improper value to 'approx'.
+      KeyError: If reuse == True but no FisherBlock found for 'params'.
+      ValueError: If reuse == True and FisherBlock found but of the wrong type.
+    """
+    self.register_depthwise_conv2d(
+        params=depthwise_params,
+        inputs=inputs,
+        outputs=depthwise_outputs,
+        strides=strides,
+        padding=padding,
+        rate=rate,
+        data_format=data_format,
+        approx=APPROX_DIAGONAL_NAME,
+        reuse=reuse)
+
+    self.register_conv2d(
+        params=pointwise_params,
+        inputs=depthwise_outputs,
+        outputs=pointwise_outputs,
+        strides=[1, 1, 1, 1],
+        padding="VALID",
+        data_format=data_format,
+        approx=approx,
+        reuse=reuse)
+
   def register_generic(self,
                        params,
                        batch_size,
@@ -563,31 +914,31 @@ class LayerCollection(object):
     Args:
       params: Tensor or tuple of Tensors corresponding to the parameters.
       batch_size: 0-D Tensor. Size of the minibatch.
-      approx: str. One of "full" or "diagonal".
-      reuse: bool or str.  If True, reuse an existing FisherBlock. If False,
-        create a new FisherBlock.  If "VARIABLE_SCOPE", use
-        tf.get_variable_scope().reuse.
+      approx: str or None. It not None, must be one of "full" or "diagonal".
+        The Fisher approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str. If True, this adds 'batch_size' to the total
+        mini-batch size use when estimating the Fisher block for this layer
+        (which must have already been registered). If "VARIABLE_SCOPE", use
+        tf.get_variable_scope().reuse. (Default: "VARIABLE_SCOPE")
 
     Raises:
       ValueError: For improper value to 'approx'.
       KeyError: If reuse == True but no FisherBlock found for 'params'.
       ValueError: If reuse == True and FisherBlock found but of the wrong type.
     """
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_generic_approximation,
+        _GENERIC_APPROX_TO_BLOCK_TYPES)
 
-    if approx is None:
-      approx = self._get_linked_approx(params)
-      if approx is None:
-        approx = self.default_generic_approximation
-
-    if approx not in _GENERIC_APPROX_TO_BLOCK_TYPES:
-      raise ValueError("Bad value {} for approx.".format(approx))
-
-    block_type = _GENERIC_APPROX_TO_BLOCK_TYPES[approx]
     block = self.register_block(params, block_type(self, params), reuse=reuse)
     block.register_additional_minibatch(batch_size)
 
+    self._add_uses(params, float("inf"))
+
   def register_fully_connected_multi(self, params, inputs, outputs,
-                                     approx=None):
+                                     num_uses=None, approx=None,
+                                     reuse=VARIABLE_SCOPE):
     """Register fully connected layers with shared parameters.
 
     This can handle general fully-connected layers with shared parameters, but
@@ -598,34 +949,187 @@ class LayerCollection(object):
       params: Tensor or 2-tuple of Tensors corresponding to weight and bias of
         this layer. Weight matrix should have shape [input_size, output_size].
         Bias should have shape [output_size].
-      inputs: A list of tensors, each of shape [batch_size, input_size]. Inputs
-        to layer. In the case of RNNs, one Tensor per time step.
-      outputs: A list of tensors, the same length as 'inputs', each of shape
-        [batch_size, output_size]. Outputs produced by layer. In the case of
-        RNNs, one Tensor per time step.
-      approx: str. One of "kron_indep", "kron_series_1", or "kron_series_2".
+      inputs: A list of Tensors, each of shape [batch_size, input_size]. Inputs
+        to layer. The list indexes each use in the graph (which might
+        correspond to a "time-step" in an RNN). OR, can be single Tensor, of
+        shape [batch_size * num_uses, input_size], which is a reshaped version
+        of a Tensor of shape [batch_size, num_uses, input_size].
+      outputs: A list of Tensors, the same length as 'inputs', each of shape
+        [batch_size, output_size]. Outputs produced by layer. The list indexes
+        each use in the graph (which might correspond to a "time-step" in an
+        RNN). Needs to correspond with the order used in 'inputs'.  OR, can be
+        a single Tensor of shape [batch_size * num_uses, output_size], which is
+        a reshaped version of a Tensor of shape [batch_size, num_uses,
+        output_size].
+      num_uses: int or None. The number uses/time-steps in the graph where the
+        layer appears. Only needed if both inputs and outputs are given in the
+        single Tensor format. (Default: None)
+      approx: str or None. If not None, must be of "kron_indep", "kron_series_1"
+        or "kron_series_2". The Fisher approximation to use. If None the default
+        value is used. (Default: None)
+      reuse: bool or str.  If True, this adds inputs and outputs as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.  (Note that the
+        word 'use' here has a completely different meaning to "use in the graph"
+        as it perturns to the 'inputs', 'outputs', and 'num_uses' arguments.)
+        (Default: "VARIABLE_SCOPE")
 
     Raises:
       ValueError: For improper value to 'approx'.
     """
-    if approx is None:
-      approx = self._get_linked_approx(params)
-      if approx is None:
-        approx = self.default_fully_connected_multi_approximation
-    has_bias = isinstance(params, (tuple, list))
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_fully_connected_multi_approximation,
+        _FULLY_CONNECTED_MULTI_APPROX_TO_BLOCK_TYPES)
 
     # TODO(b/70283649): something along the lines of find_canonical_output
     # should be added back in here (and for the other block types, arguably).
 
-    if approx not in _FULLY_CONNECTED_MULTI_APPROX_TO_BLOCK_TYPES:
-      raise ValueError("Bad value {} for approx.".format(approx))
-    block_type = _FULLY_CONNECTED_MULTI_APPROX_TO_BLOCK_TYPES[approx]
+    has_bias = isinstance(params, (tuple, list))
+    block = self.register_block(params, block_type(self, has_bias=has_bias,
+                                                   num_uses=num_uses),
+                                reuse=reuse)
+    block.register_additional_minibatch(inputs, outputs)
 
-    # For now we don't support multiple minibatches for this type of layer, so
-    # we set reuse=False
-    self.register_block(params,
-                        block_type(self, inputs, outputs, has_bias=has_bias),
-                        reuse=False)
+    assert len(inputs) == len(outputs)
+    self._add_uses(params, len(inputs))
+
+  def register_conv2d_multi(self,
+                            params,
+                            strides,
+                            padding,
+                            inputs,
+                            outputs,
+                            num_uses=None,
+                            data_format=None,
+                            dilations=None,
+                            approx=None,
+                            reuse=VARIABLE_SCOPE):
+    """Registers convolutional layers with shared parameters.
+
+    Args:
+      params: Tensor or 2-tuple of Tensors corresponding to weight and bias of
+        this layer. Weight matrix should have shape [kernel_height,
+        kernel_width, in_channels, out_channels].  Bias should have shape
+        [out_channels].
+      strides: 1-D Tensor of length 4. Strides for convolution kernel.
+      padding: string. see tf.nn.conv2d for valid values.
+      inputs: A list of Tensors, each of shape [batch_size, height, width,
+        in_channels]. Inputs to layer. The list indexes each use in the graph
+        (which might correspond to a "time-step" in an RNN). OR, can be single
+        Tensor, of shape [batch_size * num_uses, height, width, in_channels],
+        which is a reshaped version of a Tensor of shape [batch_size, num_uses,
+        height, width, in_channels].
+      outputs: A list of Tensors, each of shape [batch_size, height, width,
+        out_channels]. Output produced by layer. The list indexes each use
+        in the graph (which might correspond to a "time-step" in an RNN).
+        Needs to correspond with the order used in 'inputs'.  OR, can be a
+        single Tensor, of shape [batch_size*num_uses, height, width,
+        out_channels], which is a reshaped version of a Tensor of shape
+        [batch_size, num_uses, height, width, out_channels].
+      num_uses: int or None. The number uses/time-steps in the graph where the
+        layer appears. Only needed if both inputs and outputs are given in the
+        single Tensor format. (Default: None)
+      data_format: str or None. Format of data.
+      dilations: List of 4 ints. Dilations along each dimension.
+      approx: str or None. If not None must by "kron_indep". The Fisher
+        approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds inputs and outputs as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.  (Note that the
+        word 'use' here has a completely different meaning to "use in the graph"
+        as it perturns to the 'inputs', 'outputs', and 'num_uses' arguments.)
+        (Default: "VARIABLE_SCOPE")
+
+    Raises:
+      ValueError: For improper value to 'approx'.
+      KeyError: If reuse == True but no FisherBlock found for 'params'.
+      ValueError: If reuse == True and FisherBlock found but of the wrong type.
+    """
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_conv2d_multi_approximation,
+        _CONV2D_MULTI_APPROX_TO_BLOCK_TYPES)
+
+    block = self.register_block(
+        params,
+        block_type(
+            layer_collection=self,
+            params=params,
+            padding=padding,
+            strides=strides,
+            data_format=data_format,
+            dilation_rate=dilations,
+            extract_patches_fn="extract_image_patches",
+            num_uses=num_uses),
+        reuse=reuse)
+
+    block.register_additional_minibatch(inputs, outputs)
+
+    assert len(inputs) == len(outputs)
+    self._add_uses(params, len(inputs))
+
+  # TODO(b/74108452): change the loss registration functions names to refer
+  # to "loss functions" instead of distributions.  Following naming convention
+  # of the loss function classes themselves.
+
+  def register_embedding_multi(self,
+                               params,
+                               inputs,
+                               outputs,
+                               num_uses=None,
+                               approx=None,
+                               reuse=VARIABLE_SCOPE):
+    """Registers embedding layers with shared parameters.
+
+    Args:
+      params: Embedding matrix of shape [vocab_size, embedding_size].
+      inputs: A list of Tensors, each of shape [batch_size, input_size] and
+        dtype int32. Indices into embedding matrix. The list indexes each use
+        in the graph (which might correspond to a "time-step" in an RNN).
+        OR, can be single Tensor, of shape [batch_size * num_uses, input_size],
+        which is a reshaped version of a Tensor of shape [batch_size, num_uses,
+        input_size].
+      outputs: A list of Tensors, each of shape [batch_size, embedding_size].
+        Outputs produced by layer. The list indexes each use in the graph
+        (which might correspond to a "time-step" in an RNN). Needs to
+        correspond with the order used in 'inputs'. OR, can be a
+        single Tensor, of shape [batch_size*num_uses, embedding_size], which
+        is a reshaped version of a Tensor of shape [batch_size, num_uses,
+        embedding_size].
+      num_uses: int or None. The number uses/time-steps in the graph where the
+        layer appears. Only needed if both inputs and outputs are given in the
+        single Tensor format. (Default: None)
+      approx: str or None. If not None must by "kron_indep". The Fisher
+        approximation to use. If None the default value is used.
+        (Default: None)
+      reuse: bool or str.  If True, this adds inputs and outputs as an
+        additional mini-batch/tower of data to use when estimating the Fisher
+        block for this layer (which must have already been registered). If
+        "VARIABLE_SCOPE", use tf.get_variable_scope().reuse.  (Note that the
+        word 'use' here has a completely different meaning to "use in the graph"
+        as it perturns to the 'inputs', 'outputs', and 'num_uses' arguments.)
+        (Default: "VARIABLE_SCOPE")
+
+    Raises:
+      ValueError: For improper value to 'approx'.
+      KeyError: If reuse == True but no FisherBlock found for 'params'.
+      ValueError: If reuse == True and FisherBlock found but of the wrong type.
+    """
+    block_type, approx = self._get_block_type(
+        params, approx, self.default_embedding_multi_approximation,
+        _EMBEDDING_MULTI_APPROX_TO_BLOCK_TYPES)
+
+    if isinstance(params, (tuple, list)):
+      raise ValueError("Bias not supported.")
+    vocab_size = int(params.shape[0])
+
+    block = self.register_block(
+        params, block_type(self, vocab_size, num_uses=num_uses), reuse=reuse)
+    block.register_additional_minibatch(inputs, outputs)
+
+    self._add_uses(params, len(inputs))
 
   def register_categorical_predictive_distribution(self,
                                                    logits,
@@ -645,53 +1149,24 @@ class LayerCollection(object):
         (Default: None)
       name: (OPTIONAL) str or None. Unique name for this loss function. If None,
         a new name is generated. (Default: None)
-      reuse: (OPTIONAL) bool or str.  If True, reuse an existing FisherBlock.
-        If False, create a new FisherBlock.  If VARIABLE_SCOPE, use
-        tf.get_variable_scope().reuse.
-
-    Raises:
-      ValueError: If reuse == True and name == None.
-      ValueError: If reuse == True and seed != None.
-      KeyError: If reuse == True and no existing LossFunction with 'name' found.
-      KeyError: If reuse == False and existing LossFunction with 'name' found.
+      reuse: bool or str.  If True, this adds 'logits' as an additional
+        mini-batch/tower of inputs to the loss-function/predictive distribution
+        (which must have already been registered). If "VARIABLE_SCOPE", use
+        tf.get_variable_scope().reuse. (Default: "VARIABLE_SCOPE")
     """
-    name = name or self._graph.unique_name(
-        "register_categorical_predictive_distribution")
-
-    if reuse == VARIABLE_SCOPE:
-      reuse = variable_scope.get_variable_scope().reuse
-
-    if reuse:
-      if name is None:
-        raise ValueError(
-            "If reuse is enabled, loss function's name must be set.")
-      if seed is not None:
-        raise ValueError(
-            "Seed can only be specified at LossFunction instantiation.")
-
-      loss = self._loss_dict.get(name, None)
-
-      if loss is None:
-        raise KeyError(
-            "Unable to find loss function named {}. Create a new LossFunction "
-            "with reuse=False.".format(name))
-
-      loss.register_additional_minibatch(logits, targets=targets)
-    else:
-      if name in self._loss_dict:
-        raise KeyError(
-            "Loss function named {} already exists. Set reuse=True to append "
-            "another minibatch.".format(name))
-      loss = lf.CategoricalLogitsNegativeLogProbLoss(
-          logits, targets=targets, seed=seed)
-      self._loss_dict[name] = loss
+    loss = lf.CategoricalLogitsNegativeLogProbLoss(logits, targets=targets,
+                                                   seed=seed)
+    self.register_loss_function(loss, logits,
+                                "categorical_predictive_distribution",
+                                name=name, reuse=reuse)
 
   def register_normal_predictive_distribution(self,
                                               mean,
                                               var=0.5,
                                               seed=None,
                                               targets=None,
-                                              name=None):
+                                              name=None,
+                                              reuse=VARIABLE_SCOPE):
     """Registers a normal predictive distribution.
 
     Args:
@@ -708,21 +1183,23 @@ class LayerCollection(object):
         (Default: None)
       name: (OPTIONAL) str or None. Unique name for this loss function. If None,
         a new name is generated. (Default: None)
+      reuse: bool or str.  If True, this adds 'mean' and 'var' as an additional
+        mini-batch/tower of inputs to the loss-function/predictive distribution
+        (which must have already been registered). If "VARIABLE_SCOPE", use
+        tf.get_variable_scope().reuse. (Default: "VARIABLE_SCOPE")
     """
-    name = name or self._graph.unique_name(
-        "register_normal_predictive_distribution")
-    if name in self._loss_dict:
-      raise NotImplementedError(
-          "Adding logits to an existing LossFunction not yet supported.")
-    loss = lf.NormalMeanNegativeLogProbLoss(
-        mean, var, targets=targets, seed=seed)
-    self._loss_dict[name] = loss
+    loss = lf.NormalMeanNegativeLogProbLoss(mean, var, targets=targets,
+                                            seed=seed)
+    self.register_loss_function(loss, mean,
+                                "normal_predictive_distribution",
+                                name=name, reuse=reuse)
 
   def register_multi_bernoulli_predictive_distribution(self,
                                                        logits,
                                                        seed=None,
                                                        targets=None,
-                                                       name=None):
+                                                       name=None,
+                                                       reuse=VARIABLE_SCOPE):
     """Registers a multi-Bernoulli predictive distribution.
 
     Args:
@@ -735,15 +1212,16 @@ class LayerCollection(object):
         (Default: None)
       name: (OPTIONAL) str or None. Unique name for this loss function. If None,
         a new name is generated. (Default: None)
+      reuse: bool or str.  If True, this adds 'logits' as an additional
+        mini-batch/tower of inputs to the loss-function/predictive distribution
+        (which must have already been registered). If "VARIABLE_SCOPE", use
+        tf.get_variable_scope().reuse. (Default: "VARIABLE_SCOPE")
     """
-    name = name or self._graph.unique_name(
-        "register_multi_bernoulli_predictive_distribution")
-    if name in self._loss_dict:
-      raise NotImplementedError(
-          "Adding logits to an existing LossFunction not yet supported.")
-    loss = lf.MultiBernoulliNegativeLogProbLoss(
-        logits, targets=targets, seed=seed)
-    self._loss_dict[name] = loss
+    loss = lf.MultiBernoulliNegativeLogProbLoss(logits, targets=targets,
+                                                seed=seed)
+    self.register_loss_function(loss, logits,
+                                "multi_bernoulli_predictive_distribution",
+                                name=name, reuse=reuse)
 
   def make_or_get_factor(self, cls, args):
     """Insert 'cls(args)' into 'self.fisher_factors' if not already present.
@@ -772,3 +1250,10 @@ class LayerCollection(object):
       with variable_scope.variable_scope(self._var_scope):
         self.fisher_factors[key] = cls(*args)
     return self.fisher_factors[key]
+
+  @contextmanager
+  def as_default(self):
+    """Sets this LayerCollection as the default."""
+    set_default_layer_collection(self)
+    yield
+    set_default_layer_collection(None)
diff --git a/tensorflow/contrib/kfac/python/ops/layer_collection_lib.py b/tensorflow/contrib/kfac/python/ops/layer_collection_lib.py
index f8aa230d9ca1f542950f56b1e6cf1ab7ccd3d05f..9f4685380705bd409dbcd7e85d0e3bb4189a6adc 100644
--- a/tensorflow/contrib/kfac/python/ops/layer_collection_lib.py
+++ b/tensorflow/contrib/kfac/python/ops/layer_collection_lib.py
@@ -30,6 +30,8 @@ from tensorflow.python.util.all_util import remove_undocumented
 # pylint: enable=unused-import,line-too-long,wildcard-import
 
 _allowed_symbols = [
+    "get_default_layer_collection",
+    "set_default_layer_collection",
     "LayerParametersDict",
     "LayerCollection",
     "APPROX_KRONECKER_NAME",
diff --git a/tensorflow/contrib/kfac/python/ops/loss_functions.py b/tensorflow/contrib/kfac/python/ops/loss_functions.py
index cb3e698b9ceab920785adf735f88bd8e535a628f..e7d4243fc3d1c2d860693f2f62447b1c9aeeee03 100644
--- a/tensorflow/contrib/kfac/python/ops/loss_functions.py
+++ b/tensorflow/contrib/kfac/python/ops/loss_functions.py
@@ -57,30 +57,6 @@ class LossFunction(object):
     """The inputs to the loss function (excluding the targets)."""
     pass
 
-  @property
-  def input_minibatches(self):
-    """A `list` of inputs to the loss function, separated by minibatch.
-
-    Typically there will be one minibatch per tower in a multi-tower setup.
-    Returns a list consisting of `self.inputs` by default; `LossFunction`s
-    supporting registering multiple minibatches should override this method.
-
-    Returns:
-      A `list` of `Tensor`s representing
-    """
-    return [self.inputs]
-
-  @property
-  def num_registered_minibatches(self):
-    """Number of minibatches registered for this LossFunction.
-
-    Typically equal to the number of towers in a multi-tower setup.
-
-    Returns:
-      An `int` representing the number of registered minibatches.
-    """
-    return len(self.input_minibatches)
-
   def evaluate(self):
     """Evaluate the loss function on the targets."""
     if self.targets is not None:
@@ -474,7 +450,6 @@ class NormalMeanVarianceNegativeLogProbLoss(DistributionNegativeLogProbLoss):
     assert len(variance.shape) == 2, "Expect 2D variance tensor."
     self._mean = mean
     self._variance = variance
-    self._scale = math_ops.sqrt(variance)
     self._targets = targets
     super(NormalMeanVarianceNegativeLogProbLoss, self).__init__(seed=seed)
 
@@ -484,7 +459,7 @@ class NormalMeanVarianceNegativeLogProbLoss(DistributionNegativeLogProbLoss):
 
   @property
   def dist(self):
-    return normal.Normal(loc=self._mean, scale=self._scale)
+    return normal.Normal(loc=self._mean, scale=math_ops.sqrt(self._variance))
 
   @property
   def params(self):
@@ -502,7 +477,7 @@ class NormalMeanVarianceNegativeLogProbLoss(DistributionNegativeLogProbLoss):
 
   @property
   def _fisher_mean_factor(self):
-    return 1. / self._scale
+    return 1. / math_ops.sqrt(self._variance)
 
   @property
   def _fisher_var(self):
@@ -611,36 +586,13 @@ class CategoricalLogitsNegativeLogProbLoss(DistributionNegativeLogProbLoss,
         index in [0, output_size).
       seed: int or None. Default random seed when sampling.
     """
-    self._logits_components = []
-    self._targets_components = []
-    self.register_additional_minibatch(logits, targets=targets)
+    self._logits = logits
+    self._targets = targets
     super(CategoricalLogitsNegativeLogProbLoss, self).__init__(seed=seed)
 
-  def register_additional_minibatch(self, logits, targets=None):
-    """Register an additiona minibatch's worth of parameters.
-
-    Args:
-      logits: Tensor of shape [batch_size, output_size]. Parameters for
-        underlying distribution.
-      targets: None or Tensor of shape [batch_size, output_size].  Each row must
-        be a one-hot vector.
-    """
-    self._logits_components.append(logits)
-    self._targets_components.append(targets)
-
-  @property
-  def _logits(self):
-    return array_ops.concat(self._logits_components, axis=0)
-
-  @property
-  def input_minibatches(self):
-    return self._logits_components
-
   @property
   def targets(self):
-    if all(target is None for target in self._targets_components):
-      return None
-    return array_ops.concat(self._targets_components, axis=0)
+    return self._targets
 
   @property
   def dist(self):
diff --git a/tensorflow/contrib/kfac/python/ops/optimizer.py b/tensorflow/contrib/kfac/python/ops/optimizer.py
index 5d456bcb79ff00cedc1aaa7244cc8722d21f6e98..083da768ec97aca3e63995491bb579835bb5377f 100644
--- a/tensorflow/contrib/kfac/python/ops/optimizer.py
+++ b/tensorflow/contrib/kfac/python/ops/optimizer.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import warnings
+
 # pylint disable=long-line
 from tensorflow.contrib.kfac.python.ops import curvature_matrix_vector_products as cmvp
 from tensorflow.contrib.kfac.python.ops import estimator as est
@@ -50,6 +52,7 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
                name="KFAC",
                estimation_mode="gradients",
                colocate_gradients_with_ops=True,
+               batch_size=None,
                cov_devices=None,
                inv_devices=None):
     """Initializes the KFAC optimizer with the given settings.
@@ -91,12 +94,16 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       colocate_gradients_with_ops: Whether we should request gradients we
           compute in the estimator be colocated with their respective ops.
           (Default: True)
+      batch_size: The size of the mini-batch. Only needed when momentum_type
+          == 'qmodel' or when automatic adjustment is used.  (Default: None)
       cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
           computations will be placed on these devices in a round-robin fashion.
-          Can be None, which means that no devices are specified.
+          Can be None, which means that no devices are specified. Only used
+          with (soon-to-be-depcrecated "convenience" properties).
       inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
           computations will be placed on these devices in a round-robin fashion.
-          Can be None, which means that no devices are specified.
+          Can be None, which means that no devices are specified. Only used
+          with (soon-to-be-depcrecated "convenience" properties).
 
     Raises:
       ValueError: If the momentum type is unsupported.
@@ -110,6 +117,15 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     if variables is None:
       variables = tf_variables.trainable_variables()
 
+    # Parameters to be passed to the Fisher estimator:
+    self._variables = variables
+    self._cov_ema_decay = cov_ema_decay
+    self._layers = layer_collection
+    self._estimation_mode = estimation_mode
+    self._colocate_gradients_with_ops = colocate_gradients_with_ops
+    self._cov_devices = cov_devices
+    self._inv_devices = inv_devices
+
     # The below paramaters are required only if damping needs to be adapated.
     # These parameters can be set by calling
     # set_damping_adaptation_params() explicitly.
@@ -130,17 +146,6 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     self._q_model_change = None
     self._update_damping_op = None
 
-    self._layers = layer_collection
-    self._fisher_est = est.FisherEstimator(
-        lambda: self.damping,
-        variables,
-        cov_ema_decay,
-        layer_collection,
-        estimation_mode=estimation_mode,
-        colocate_gradients_with_ops=colocate_gradients_with_ops,
-        cov_devices=cov_devices,
-        inv_devices=inv_devices)
-
     momentum_type = momentum_type.lower()
     legal_momentum_types = ["regular", "adam", "qmodel"]
 
@@ -148,20 +153,27 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       raise ValueError("Unsupported momentum type {}. Must be one of {}."
                        .format(momentum_type, legal_momentum_types))
     if momentum_type != "regular" and norm_constraint is not None:
-      raise ValueError("Update clipping is only supported with momentum"
+      raise ValueError("Update clipping is only supported with momentum "
                        "type 'regular'.")
     if momentum_type not in ["regular", "adam"] and momentum != 0:
       raise ValueError("Momentum must be unspecified if using a momentum_type "
                        "other than 'regular' or 'adam'.")
 
+    # Extra parameters of the optimizer
     self._momentum = momentum
     self._momentum_type = momentum_type
     self._norm_constraint = norm_constraint
-
-    # this is a bit of a hack
-    # TODO(duckworthd): Handle this in a better way (e.g. pass it in?)
-    self._batch_size = array_ops.shape(layer_collection.losses[0].inputs)[0]
-    self._losses = layer_collection.losses
+    self._batch_size = batch_size
+
+    with variable_scope.variable_scope(name):
+      self._fisher_est = est.FisherEstimator(
+          self._variables,
+          self._cov_ema_decay,
+          self.damping,
+          self._layers,
+          exps=(-1,),
+          estimation_mode=self._estimation_mode,
+          colocate_gradients_with_ops=self._colocate_gradients_with_ops)
 
     super(KfacOptimizer, self).__init__(learning_rate, name=name)
 
@@ -178,6 +190,10 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     style rule described in Section 6.5 of "Optimizing Neural Networks with
     Kronecker-factored Approximate Curvature".
 
+    Note that this function creates Tensorflow variables which store a few
+    scalars and are accessed by the ops which update the damping (as part
+    of the training op returned by the minimize() method).
+
     Args:
       is_chief: `Boolean`, `True` if the worker is chief.
       prev_train_batch: Training data used to minimize loss in the previous
@@ -199,6 +215,7 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     """
     if self._adapt_damping:
       raise ValueError("Damping adaptation parameters already set.")
+
     with variable_scope.variable_scope(self.get_name()):
       self._adapt_damping = True
       self._is_chief = is_chief
@@ -221,31 +238,37 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
 
   @property
   def cov_update_thunks(self):
-    return self._fisher_est.cov_update_thunks
+    self._maybe_make_and_save_everything()
+    return self._cov_update_thunks
 
   @property
   def cov_update_ops(self):
-    return self._fisher_est.cov_update_ops
+    self._maybe_make_and_save_everything()
+    return self._cov_update_ops
 
   @property
   def cov_update_op(self):
-    return self._fisher_est.cov_update_op
+    self._maybe_make_and_save_everything()
+    return self._cov_update_op
 
   @property
   def inv_update_thunks(self):
-    return self._fisher_est.inv_update_thunks
+    self._maybe_make_and_save_everything()
+    return self._inv_update_thunks
 
   @property
   def inv_update_ops(self):
-    return self._fisher_est.inv_update_ops
+    self._maybe_make_and_save_everything()
+    return self._inv_update_ops
 
   @property
   def inv_update_op(self):
-    return self._fisher_est.inv_update_op
+    self._maybe_make_and_save_everything()
+    return self._inv_update_op
 
   @property
   def variables(self):
-    return self._fisher_est.variables
+    return self._variables
 
   @property
   def damping(self):
@@ -258,25 +281,162 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
   def damping_adaptation_interval(self):
     return self._damping_adaptation_interval
 
+  def _maybe_make_and_save_everything(self):
+    if not self._fisher_est.made_vars():
+      warnings.warn("These convenience properties will be depcrecated soon. "
+                    "Please use explicit op/thunk creation methods instead "
+                    "(e.g. make_ops_and_vars_round_robin, etc).",
+                    DeprecationWarning)
+      (self._cov_update_ops, self._cov_update_op, self._inv_update_ops,
+       self._inv_update_op, self._cov_update_thunks,
+       self._inv_update_thunks) = self.make_ops_and_vars_round_robin(
+           cov_devices=self._cov_devices,
+           inv_devices=self._inv_devices)
+
+  def make_ops_and_vars(self):
+    """Make ops and vars with no specific device placement.
+
+    See make_ops_and_vars_round_robin for details.
+
+    Returns:
+      cov_update_ops: List of ops that compute the cov updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_ops: List of ops that compute the inv updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_op: inv_update_ops grouped into a single op.
+    """
+    with variable_scope.variable_scope(self.get_name()):
+      return self._fisher_est.make_ops_and_vars()
+
+  def make_ops_and_vars_round_robin(self, cov_devices=None, inv_devices=None):
+    """Make ops and vars with a round-robin device placement strategy.
+
+    For each factor, all of that factor's cov variables and their associated
+    update ops will be placed on a particular device.  A new device is chosen
+    for each factor by cycling through list of devices in the cov_devices
+    argument. If cov_devices is None then no explicit device placement occurs.
+
+    An analogous strategy is followed for inverse update ops, with the list of
+    devices being given by the inv_devices argument.
+
+    Inverse variables on the other hand are not placed on any specific device
+    (they will just use the current the device placement context, whatever
+    that happens to be).  The idea is that the inverse variable belong where
+    they will be accessed most often, which is the device that actually applies
+    the preconditioner to the gradient. The user will be responsible for setting
+    the device context for this.
+
+    Args:
+      cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+      inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+
+    Returns:
+      cov_update_ops: List of ops that compute the cov updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_ops: List of ops that compute the inv updates. Corresponds
+        one-to-one with the list of factors given by the "factors" property.
+      cov_update_op: cov_update_ops grouped into a single op.
+      inv_update_op: inv_update_ops grouped into a single op.
+      cov_update_thunks: Thunks that make the ops in cov_update_ops.
+      inv_update_thunks: Thunks that make the ops in inv_update_ops.
+    """
+    with variable_scope.variable_scope(self.get_name()):
+      return self._fisher_est.make_ops_and_vars_round_robin(
+          cov_devices=cov_devices, inv_devices=inv_devices)
+
+  def make_vars_and_create_op_thunks_round_robin(self,
+                                                 cov_devices=None,
+                                                 inv_devices=None):
+    """Make vars and create op thunks w/ a round-robin device placement strat.
+
+    For each factor, all of that factor's cov variables and their associated
+    update ops will be placed on a particular device.  A new device is chosen
+    for each factor by cycling through list of devices in the cov_devices
+    argument. If cov_devices is None then no explicit device placement occurs.
+
+    An analogous strategy is followed for inverse update ops, with the list of
+    devices being given by the inv_devices argument.
+
+    Inverse variables on the other hand are not placed on any specific device
+    (they will just use the current the device placement context, whatever
+    that happens to be).  The idea is that the inverse variable belong where
+    they will be accessed most often, which is the device that actually applies
+    the preconditioner to the gradient. The user will be responsible for setting
+    the device context for this.
+
+    Args:
+      cov_devices: Iterable of device strings (e.g. '/gpu:0'). Covariance
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+      inv_devices: Iterable of device strings (e.g. '/gpu:0'). Inversion
+        computations will be placed on these devices in a round-robin fashion.
+        Can be None, which means that no devices are specified.
+    Returns:
+      cov_update_thunks: List of cov update thunks. Corresponds one-to-one with
+        the list of factors given by the "factors" property.
+      inv_update_thunks: List of inv update thunks. Corresponds one-to-one with
+        the list of factors given by the "factors" property.
+    """
+    scope = self.get_name() + "/" + self._fisher_est.name
+    return self._fisher_est.make_vars_and_create_op_thunks_round_robin(
+        scope=scope, cov_devices=cov_devices, inv_devices=inv_devices)
+
+  def ops_and_vars_thunks(self):
+    """Create thunks that make the ops and vars on demand.
+
+    This function returns 4 lists of thunks: cov_variable_thunks,
+    cov_update_thunks, inv_variable_thunks, and inv_update_thunks.
+
+    The length of each list is the number of factors and the i-th element of
+    each list corresponds to the i-th factor (given by the "factors" property).
+
+    Note that the execution of these thunks must happen in a certain
+    partial order.  The i-th element of cov_variable_thunks must execute
+    before the i-th element of cov_update_thunks (and also the i-th element
+    of inv_update_thunks).  Similarly, the i-th element of inv_variable_thunks
+    must execute before the i-th element of inv_update_thunks.
+
+    TL;DR (oversimplified): Execute the thunks according to the order that
+    they are returned.
+
+    Returns:
+      cov_variable_thunks: A list of thunks that make the cov variables.
+      cov_update_thunks: A list of thunks that make the cov update ops.
+      inv_variable_thunks: A list of thunks that make the inv variables.
+      inv_update_thunks: A list of thunks that make the inv update ops.
+    """
+    scope = self.get_name() + "/" + self._fisher_est.name
+    return self._fisher_est.ops_and_vars_thunks(scope=scope)
+
   def minimize(self, *args, **kwargs):
-    kwargs["var_list"] = kwargs.get("var_list") or self.variables
-    if set(kwargs["var_list"]) != set(self.variables):
-      raise ValueError("var_list doesn't match with set of Fisher-estimating "
-                       "variables.")
-    if self._adapt_damping and self._is_chief:
-      global_step = kwargs.get("global_step", None)
-      if not global_step:
-        raise KeyError("global_step needs to be passed to optimizer.minimize "
-                       "if damping parameter is adapted.")
-      update_damping_op = self._update_damping(self._prev_train_batch,
-                                               global_step)
-      with ops.control_dependencies([update_damping_op]):
-        loss = args[0]
-        loss_assign_op = state_ops.assign(self._prev_loss, loss)
-        train_op = super(KfacOptimizer, self).minimize(*args, **kwargs)
-        return control_flow_ops.group(loss_assign_op, train_op)
-    else:
-      return super(KfacOptimizer, self).minimize(*args, **kwargs)
+    # Should this variable scope encompass everything below?  Or will the super-
+    # class make another copy of the same name scope?
+    with variable_scope.variable_scope(self.get_name()):
+      kwargs["var_list"] = kwargs.get("var_list") or self.variables
+      if set(kwargs["var_list"]) != set(self.variables):
+        raise ValueError("var_list doesn't match with set of Fisher-estimating "
+                         "variables.")
+      if self._adapt_damping and self._is_chief:
+        global_step = kwargs.get("global_step", None)
+        if not global_step:
+          raise KeyError("global_step needs to be passed to optimizer.minimize "
+                         "if damping parameter is adapted.")
+        update_damping_op = self._update_damping(self._prev_train_batch,
+                                                 global_step)
+        with ops.control_dependencies([update_damping_op]):
+          loss = args[0]
+          loss_assign_op = state_ops.assign(self._prev_loss, loss)
+          train_op = super(KfacOptimizer, self).minimize(*args, **kwargs)
+          return control_flow_ops.group(loss_assign_op, train_op)
+      else:
+        return super(KfacOptimizer, self).minimize(*args, **kwargs)
 
   def compute_gradients(self, *args, **kwargs):
     # args[1] could be our var_list
@@ -301,6 +461,8 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     Returns:
       An `Operation` that applies the specified gradients.
     """
+    self._maybe_make_and_save_everything()
+
     # In Python 3, grads_and_vars can be a zip() object which can only be
     # iterated over once. By converting it to a list, we ensure that it can be
     # iterated over more than once.
@@ -450,7 +612,8 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
                     = qmodel(alpha*precon_grad + mu*prev_update) - L(theta).
     """
 
-    cmvpc = cmvp.CurvatureMatrixVectorProductComputer(self._losses, variables)
+    cmvpc = cmvp.CurvatureMatrixVectorProductComputer(self._layers.losses,
+                                                      variables)
 
     # compute the matrix-vector products with the transposed Fisher factor
     fft_precon_grads = cmvpc.multiply_fisher_factor_transpose(precon_grads)
diff --git a/tensorflow/contrib/kfac/python/ops/utils.py b/tensorflow/contrib/kfac/python/ops/utils.py
index 88e6fb20e8f97528aea2a92752d79344c27bbf24..c9de0c7270da5fb213f54ddb0073b5d5cb45c7b7 100644
--- a/tensorflow/contrib/kfac/python/ops/utils.py
+++ b/tensorflow/contrib/kfac/python/ops/utils.py
@@ -24,11 +24,13 @@ from tensorflow.contrib.tpu.python.ops import tpu_ops
 from tensorflow.contrib.tpu.python.tpu import tpu_function
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variables
@@ -430,6 +432,127 @@ def batch_execute(global_step, thunks, batch_size, name=None):
     return result
 
 
+def extract_convolution_patches(inputs,
+                                filter_shape,
+                                padding,
+                                strides=None,
+                                dilation_rate=None,
+                                name=None,
+                                data_format=None):
+  """Extracts inputs to each output coordinate in tf.nn.convolution.
+
+  This is a generalization of tf.extract_image_patches() to tf.nn.convolution(),
+  where the number of spatial dimensions may be something other than 2.
+
+  Assumes,
+  - First dimension of inputs is batch_size
+  - Convolution filter is applied to all input channels.
+
+  Args:
+    inputs: Tensor of shape [batch_size, ..spatial_image_shape..,
+      ..spatial_filter_shape.., in_channels]. Inputs to tf.nn.convolution().
+    filter_shape: List of ints. Shape of filter passed to tf.nn.convolution().
+    padding: string. Padding method. One of "VALID", "SAME".
+    strides: None or list of ints. Strides along spatial dimensions.
+    dilation_rate: None or list of ints. Dilation along spatial dimensions.
+    name: None or str. Name of Op.
+    data_format: None or str. Format of data.
+
+  Returns:
+    Tensor of shape [batch_size, ..spatial_image_shape..,
+      ..spatial_filter_shape.., in_channels]
+
+  Raises:
+    ValueError: If data_format does not put channel last.
+    ValueError: If inputs and filter disagree on in_channels.
+  """
+  if not is_data_format_channel_last(data_format):
+    raise ValueError("Channel must be last dimension.")
+  with ops.name_scope(name, "extract_convolution_patches",
+                      [inputs, filter_shape, padding, strides, dilation_rate]):
+    batch_size = inputs.shape.as_list()[0]
+    in_channels = inputs.shape.as_list()[-1]
+
+    # filter_shape = spatial_filter_shape + [in_channels, out_channels]
+    spatial_filter_shape = filter_shape[:-2]
+    if in_channels != filter_shape[-2]:
+      raise ValueError("inputs and filter_shape must agree on in_channels.")
+
+    # Map each input feature to a location in the output.
+    out_channels = np.prod(spatial_filter_shape) * in_channels
+    filters = linalg_ops.eye(out_channels)
+    filters = array_ops.reshape(
+        filters,
+        list(spatial_filter_shape) + [in_channels, out_channels])
+
+    result = nn_ops.convolution(
+        inputs,
+        filters,
+        padding=padding,
+        strides=strides,
+        dilation_rate=dilation_rate)
+    spatial_output_shape = result.shape.as_list()[1:-1]
+    result = array_ops.reshape(result,
+                               [batch_size or -1] + spatial_output_shape +
+                               list(spatial_filter_shape) + [in_channels])
+
+    return result
+
+
+def extract_pointwise_conv2d_patches(inputs,
+                                     filter_shape,
+                                     name=None,
+                                     data_format=None):
+  """Extract patches for a 1x1 conv2d.
+
+  Args:
+    inputs: 4-D Tensor of shape [batch_size, height, width, in_channels].
+    filter_shape: List of 4 ints. Shape of filter to apply with conv2d()
+    name: None or str. Name for Op.
+    data_format: None or str. Format for data. See 'data_format' in
+      tf.nn.conv2d() for details.
+
+  Returns:
+    Tensor of shape [batch_size, ..spatial_input_shape..,
+    ..spatial_filter_shape.., in_channels]
+
+  Raises:
+    ValueError: if inputs is not 4-D.
+    ValueError: if filter_shape is not [1, 1, ?, ?]
+    ValueError: if data_format is not channels-last.
+  """
+  if inputs.shape.ndims != 4:
+    raise ValueError("inputs must have 4 dims.")
+  if len(filter_shape) != 4:
+    raise ValueError("filter_shape must have 4 dims.")
+  if filter_shape[0] != 1 or filter_shape[1] != 1:
+    raise ValueError("filter_shape must have shape 1 along spatial dimensions.")
+  if not is_data_format_channel_last(data_format):
+    raise ValueError("data_format must be channels last.")
+  with ops.name_scope(name, "extract_pointwise_conv2d_patches",
+                      [inputs, filter_shape]):
+    ksizes = [1, 1, 1, 1]  # Spatial shape is 1x1.
+    strides = [1, 1, 1, 1]  # Operate on all pixels.
+    rates = [1, 1, 1, 1]  # Dilation has no meaning with spatial shape = 1.
+    padding = "VALID"  # Doesn't matter.
+    result = array_ops.extract_image_patches(inputs, ksizes, strides, rates,
+                                             padding)
+
+    batch_size, input_height, input_width, in_channels = inputs.shape.as_list()
+    filter_height, filter_width, in_channels, _ = filter_shape
+    return array_ops.reshape(result, [
+        batch_size, input_height, input_width, filter_height, filter_width,
+        in_channels
+    ])
+
+
+def is_data_format_channel_last(data_format):
+  """True if data_format puts channel last."""
+  if data_format is None:
+    return True
+  return data_format.endswith("C")
+
+
 def matmul_sparse_dense(A, B, name=None):  # pylint: disable=invalid-name
   """Computes matmul(A, B) where A is sparse, B is dense.
 
@@ -482,5 +605,87 @@ def matmul_diag_sparse(A_diag, B, name=None):  # pylint: disable=invalid-name
     a = array_ops.reshape(a, list(a.shape) + [1] * (B.values.shape.ndims - 1))
     return ops.IndexedSlices(a * B.values, B.indices, dense_shape=B.dense_shape)
 
+
+class PartitionedTensor(object):
+  """A Tensor partitioned across its 0-th dimension."""
+
+  def __init__(self, tensors):
+    """Initializes PartitionedTensor.
+
+    Args:
+      tensors: List of Tensors. All Tensors must agree on shape (excepting
+        batch dimension) and dtype.
+
+    Raises:
+      ValueError: If 'tensors' has length zero.
+      ValueError: if contents of 'tensors' don't agree on shape or dtype.
+    """
+    if not tensors:
+      raise ValueError("tensors must be a list of 1+ Tensors.")
+
+    dtype = tensors[0].dtype
+    if not all(tensor.dtype == dtype for tensor in tensors):
+      raise ValueError("all tensors must have dtype = %s." % dtype)
+
+    shape = tensors[0].shape[1:]
+    if not all(tensor.shape[1:] == shape for tensor in tensors):
+      raise ValueError("All tensors must have shape = %s (excluding batch "
+                       "dimension)." % shape)
+
+    self.tensors = tensors
+    self._concats = {}  # {device: Tensor}
+
+  @property
+  def shape(self):
+    feature_shape = self.tensors[0].shape[1:]
+    batch_size = sum([tensor.shape[0] for tensor in self.tensors],
+                     tensor_shape.Dimension(0))
+    return tensor_shape.TensorShape([batch_size]).concatenate(feature_shape)
+
+  def get_shape(self):
+    return self.shape
+
+  @property
+  def dtype(self):
+    return self.tensors[0].dtype
+
+  def devices(self):
+    return set(tensor.device for tensor in self.tensors)
+
+  def __str__(self):
+    return "PartitionedTensor([%s, ...], dtype=%s, shape=%s)" % (
+        self.tensors[0].name, self.dtype.name, tuple(self.shape.as_list()))
+
+  def __hash__(self):
+    return hash(tuple(self.tensors))
+
+  def __eq__(self, other):
+    if not isinstance(other, PartitionedTensor):
+      return False
+    return self.tensors == other.tensors
+
+  def __ne__(self, other):
+    return not self == other  # pylint: disable=g-comparison-negation
+
+  def __getitem__(self, key):
+    return self.as_tensor()[key]
+
+  def as_tensor(self, dtype=None, name=None, as_ref=False):
+    with ops.name_scope(name, "PartitionedTensor.as_tensor", self.tensors):
+      assert not as_ref
+      assert dtype in [None, self.dtype]
+      result = array_ops.concat(self.tensors, axis=0)
+
+      # Cache 'result' if we haven't already cached a value for this device.
+      if result.device not in self._concats:
+        self._concats[result.device] = result
+      return self._concats[result.device]
+
+
+ops.register_tensor_conversion_function(
+    PartitionedTensor,
+    lambda val, dtype, name, as_ref: val.as_tensor(dtype, name, as_ref))
+
+
 # TODO(b/69623235): Add a function for finding tensors that share gradients
 # to eliminate redundant fisher factor computations.
diff --git a/tensorflow/contrib/kfac/python/ops/utils_lib.py b/tensorflow/contrib/kfac/python/ops/utils_lib.py
index 8e424a794691484fdea7d8481677aa641c433d4c..330d222dbf70fcfa02ffd47261c0513d9dd6e0e9 100644
--- a/tensorflow/contrib/kfac/python/ops/utils_lib.py
+++ b/tensorflow/contrib/kfac/python/ops/utils_lib.py
@@ -40,6 +40,9 @@ _allowed_symbols = [
     "fwd_gradients",
     "ensure_sequence",
     "batch_execute",
+    "extract_convolution_patches",
+    "extract_pointwise_conv2d_patches",
+    "is_data_format_channel_last",
     "matmul_sparse_dense",
     "matmul_diag_sparse",
 ]
diff --git a/tensorflow/contrib/layers/python/layers/embedding_ops.py b/tensorflow/contrib/layers/python/layers/embedding_ops.py
index b62e3050cd7003f1ba72061b133ff9b5d6b616da..ffa208540dae975cb139ad6d76dcf392678ba0ee 100644
--- a/tensorflow/contrib/layers/python/layers/embedding_ops.py
+++ b/tensorflow/contrib/layers/python/layers/embedding_ops.py
@@ -470,7 +470,7 @@ def embedding_lookup_unique(params, ids, name=None):
     ids = ops.convert_to_tensor(ids)
     shape = array_ops.shape(ids)
     ids_flat = array_ops.reshape(
-        ids, math_ops.reduce_prod(shape, keep_dims=True))
+        ids, math_ops.reduce_prod(shape, keepdims=True))
     unique_ids, idx = array_ops.unique(ids_flat)
     unique_embeddings = embedding_ops.embedding_lookup(params, unique_ids)
     embeds_flat = array_ops.gather(unique_embeddings, idx)
diff --git a/tensorflow/contrib/layers/python/layers/encoders.py b/tensorflow/contrib/layers/python/layers/encoders.py
index 89c9d37bd09cb6c43eebb91f3a16600eae9cb490..f42112206d0db9d2e42bd4cff19f6a6533951d46 100644
--- a/tensorflow/contrib/layers/python/layers/encoders.py
+++ b/tensorflow/contrib/layers/python/layers/encoders.py
@@ -125,7 +125,7 @@ def embed_sequence(ids,
       `reuse` is `None` or `False`.
   """
   if not (reuse or (vocab_size and embed_dim)):
-    raise ValueError('Must specify vocab size and embedding dimension when not'
+    raise ValueError('Must specify vocab size and embedding dimension when not '
                      'reusing. Got vocab_size=%s and embed_dim=%s' % (
                          vocab_size, embed_dim))
   with variable_scope.variable_scope(
diff --git a/tensorflow/contrib/layers/python/layers/layers.py b/tensorflow/contrib/layers/python/layers/layers.py
index 80cbe68870808328b387e2044fe236af5a5e39f8..350bcb3bca11b4cad18ce863ab1496076477aa3c 100644
--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@@ -2747,7 +2747,7 @@ def softmax(logits, scope=None):
     logits_2d = array_ops.reshape(logits, [-1, num_logits])
     predictions = nn.softmax(logits_2d)
     predictions = array_ops.reshape(predictions, array_ops.shape(logits))
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       predictions.set_shape(logits.get_shape())
     return predictions
 
diff --git a/tensorflow/contrib/learn/BUILD b/tensorflow/contrib/learn/BUILD
index abf6e393bb0fbbce4e43f6d209e9b30517df36c3..16f80a876fac5e19bb8ce13074759c704c113947 100644
--- a/tensorflow/contrib/learn/BUILD
+++ b/tensorflow/contrib/learn/BUILD
@@ -5,6 +5,8 @@ licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
+load("//tensorflow:tensorflow.bzl", "py_test")
+
 package(default_visibility = [
     "//engedu/ml/tf_from_scratch:__pkg__",
     "//tensorflow:internal",
@@ -224,6 +226,7 @@ py_test(
     size = "small",
     srcs = ["python/learn/monitors_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_pip_gpu"],  # b/74437598
     deps = [
         ":learn",
         "//tensorflow/contrib/framework:framework_py",
@@ -426,6 +429,10 @@ py_test(
     size = "medium",
     srcs = ["python/learn/estimators/kmeans_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "noasan",  # b/73741358
+        "nomac",
+    ],
     deps = [
         ":learn",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/contrib/learn/README.md b/tensorflow/contrib/learn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d516bffc5e0327a3400068b35de5503e5a925a54
--- /dev/null
+++ b/tensorflow/contrib/learn/README.md
@@ -0,0 +1,143 @@
+EVERYTHING IN THIS DIRECTORY IS DEPRECATED.
+
+Using functions or classes will result in warnings.
+
+Instructions for converting to current alternatives are included in the
+warnings. A high-level overview is below.
+
+## Canned Estimators
+
+Many canned estimators (subclasses of `Estimator`) have equivalents in core:
+`DNNClassifier`, `DNNRegressor`, `DNNEstimator`, `LinearClassifier`,
+`LinearRegressor`, `DNNLinearCombinedClassifier` and
+`DNNLinearCombinedRegressor`. They are exposed under `tf.estimator`.
+`DNNEstimator`, `LinearEstimator` and `DNNLinearCombinedEstimator`
+are exposed under `tf.contrib.estimator`.
+
+To migrate to the new api, users need to take the following steps:
+
+* Replace `tf.contrib.learn` with `tf.estimator`.
+* If you subclass any of the estimators, stop doing that. You should be able to
+  write a factory method that returns a canned estimator instead. If this is not
+  possible (if you override methods from the canned estimator), consider writing
+  a custom estimator instead. See `tf.estimator.Estimator`.
+* Set `loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE` to preserve loss
+  reduction as the average over batch.
+* Some optimizer-related arguments are no longer passed in the estimator
+  constructor. Instead, we provide methods that perform the same job by wrapping
+  an optimizer. Specifically:
+  *  `gradient_clip_norm`: Use `tf.contrib.estimator.clip_gradients_by_norm`
+  *  `embedding_lr_multipliers`: Not supported.
+  Other arguments:
+  * `input_layer_min_slice_size`: Replaced by `input_layer_partitioner`
+  * `enable_centered_bias`: Not supported. Dropping this argument is unlikely to
+    harm your model.
+  * `feature_engineering_fn`: Not supported. You can call your
+    `feature_engineering_fn` inside your input_fn:
+    ```python
+    def new_input_fn():
+      features, labels = old_input_fn()
+      return feature_engineering_fn(features, labels)
+    ```
+* Use `tf.reshape` to reshape labels in your `input_fn`. `tf.estimator`
+  classifiers and regressors expect labels as a 2D Tensor of shape
+  `[batch_size, 1]`, or `[batch_size, n_labels]`. In contrast,
+  `tf.contrib.learn` classifiers and regressors supported labels with shape
+  `[batch_size]`.
+* If you pass custom metrics from the `evaluate()` method call, use
+  `tf.contrib.estimator.add_metrics`.
+* Replace your `serving_input_fn` with a `serving_input_receiver_fn`.
+  Note this should be entirely distinct from your training `input_fn`, so if you
+  previously had one `input_fn` with different "modes", you should now factor
+  that apart.  Where the former returned either a simple `(features, labels)`
+  tuple or `InputFnOps`, you should now return a `ServingInputReceiver`.
+  If you were generating your `serving_input_fn` using the
+  `build_parsing_serving_input_fn` helper, you can simply drop in the
+  replacement `build_parsing_serving_input_receiver_fn`.
+
+Some remaining estimators/classes:
+
+* `DynamicRnnEstimator`:  Consider a custom `model_fn`.
+* `KMeansClustering`: Use `tf.contrib.factorization.KMeansClustering`.
+* `LogisticRegressor`: Not supported. Instead, use `binary_classification_head`
+  with a custom `model_fn`, or with `DNNEstimator`.
+* `StateSavingRnnEstimator`: Consider a custom `model_fn`.
+* SVM: Consider a custom `model_fn`.
+* `LinearComposableModel` and `DNNComposableModel`: Not supported. 
+  Consider `tf.contrib.estimator.DNNEstimator`, or write a custom model_fn.
+* `MetricSpec`: Deprecated. For adding custom metrics to canned Estimators, use
+  `tf.contrib.estimator.add_metrics`.
+
+## Estimator
+`tf.contrib.learn.Estimator` is migrated to `tf.estimator.Estimator`.
+
+To migrate, users need to take the following steps:
+
+* Replace `tf.contrib.learn.Estimator` with `tf.estimator.Estimator`.
+* If you pass a `config` argument to `Estimator`, this must be
+  `tf.estimator.RunConfig`. You may need to edit your code accordingly.
+* Edit your `model_fn` to return `tf.estimator.EstimatorSpec`. Refer to
+  `EstimatorSpec` for documentation of specific fields.
+* If your `model_fn` uses the `mode` argument, use `tf.estimator.ModeKeys`.
+
+Some related classes:
+* `Evaluable`, `Trainable`: Not supported, merged into `tf.estimator.Estimator`.
+* ExportStrategy: Replaced by `tf.estimator.Exporter`.
+
+## Head/MultiHead
+These classes are now supported under `tf.contrib.estimator`, e.g.
+`tf.contrib.estimator.multi_class_head` and `tf.contrib.estimator.multi_head`.
+
+Some differences:
+
+* `multi_class_head`: If you use `tf.contrib.learn.multi_class_head` with
+  `n_classes=2`, switch to `tf.contrib.estimator.binary_classification_head`.
+* `loss_only_head`: Not supported.
+* `poisson_regression_head`: Not supported (yet).
+* `binary_svm_head`: Not supported (yet).
+* `no_op_train_fn`: Replace it with `tf.no_op`.
+
+Some arguments are renamed, please refer to documentation. In addition:
+
+* `loss_fn`: Supported for `multi_label_head`. If you need it for other heads,
+  please open an issue.
+* `metric_class_ids`: Not supported (yet).
+* `enable_centered_bias`: Not supported. Dropping this argument is unlikely to
+  harm your model.
+* `label_name`: Not needed in `tf.estimator`. If you don’t use `multi_head`,
+  drop this argument. If you use `multi_head`, refer to
+  `tf.contrib.estimator.multi_head` documentation.
+
+## Experiment Class - Distributed Training Tooling
+
+Switch to `tf.estimator.train_and_evaluate`. Some differences:
+
+* Most of the constructor arguments, like `train_input_fn`, `eval_input_fn`,
+  should be wrapped into `tf.estimator.TrainSpec` and `tf.estimator.EvalSpec`.
+* Remove the `experiment_fn`. Instead, create the `Estimator`,
+  `train_spec` and `eval_spec`, then call `tf.estimator.train_and_evaluate`
+  directly.
+* Inside `tf.estimator.EvalSpec`, the `exporter` field is the replacement
+  for `export_strategy`. To be precise, `tf.estimator.LatestExporter` is the
+  replacement for `tf.contrib.learn.make_export_strategy`. If you want to export
+  only at the end of training  use `tf.estimator.FinalExporter`.
+* If the `TF_CONFIG` environment variable is constructed manually, please read
+  the `train_and_evaluate` documentation for the new requirementds (in
+  particular, the chief node and evaluator node).
+
+## Others Classes and Functions
+
+* `tf.contrib.learn.datasets` is deprecated. We are adding ready to use datasets
+  to tensorflow/models. Many smaller datasets are available from other sources,
+  such as scikits.learn. Some Python processing may have to be written, but this
+  is straightforward to implement using the standard modules.
+* `tf.contrib.learn.preprocessing`: Deprecated. The python-only preprocessing
+  functions are not a good fit for TensorFlow. Please use `tf.data`, and
+  consider tensorflow/transform for more complex use cases.
+* `tf.contrib.learn.models`: Not supported, use canned estimators instead.
+* `tf.contrib.learn.monitors`: Implement `SessionRunHook` instead. Hook
+  implementations are in `tf.train`.
+* `tf.contrib.learn.learn_io`: Use the methods in `tf.estimator.inputs`, such as
+  `tf.estimator.inputs.numpy_input_fn`. Some utility functions have no
+  equivalent, we encourage the use of `tf.data`.
+
diff --git a/tensorflow/contrib/learn/__init__.py b/tensorflow/contrib/learn/__init__.py
index 3698af027e38f1063ad829c26eb179734968f813..79bd73faaf1301a2fc4999b64f88d30542577980 100644
--- a/tensorflow/contrib/learn/__init__.py
+++ b/tensorflow/contrib/learn/__init__.py
@@ -13,8 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-# TODO(ptucker,ipolosukhin): Improve descriptions.
-"""High level API for learning.
+"""High level API for learning (DEPRECATED).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 See the @{$python/contrib.learn} guide.
 
diff --git a/tensorflow/contrib/learn/python/__init__.py b/tensorflow/contrib/learn/python/__init__.py
index bbebd5ab9792cb937219cf937f08c4d4e6e44a92..df23aeb2c433c2b4392f706730f715246ce01cea 100644
--- a/tensorflow/contrib/learn/python/__init__.py
+++ b/tensorflow/contrib/learn/python/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level API for learning with TensorFlow."""
+"""High level API for learning with TensorFlow (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/__init__.py b/tensorflow/contrib/learn/python/learn/__init__.py
index cdc67c77d5fd1df61016835dc75ba44feb458cf9..76e0e8ac8f19026086959f3b197cfd1a81e65a3e 100644
--- a/tensorflow/contrib/learn/python/learn/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level API for learning with TensorFlow."""
+"""High level API for learning with TensorFlow (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py b/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
index 2284ec46e971731af74f17678fc0d1d3888419e2..fed1c44d1970bf07c808ace817aa9972d7776d88 100644
--- a/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
+++ b/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
@@ -12,20 +12,47 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Some common SessionRunHook classes."""
+"""Some common SessionRunHook classes (deprected).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.util.deprecation import deprecated_alias
 
 # pylint: disable=invalid-name
-LoggingTensorHook = basic_session_run_hooks.LoggingTensorHook
-StopAtStepHook = basic_session_run_hooks.StopAtStepHook
-CheckpointSaverHook = basic_session_run_hooks.CheckpointSaverHook
-StepCounterHook = basic_session_run_hooks.StepCounterHook
-NanLossDuringTrainingError = basic_session_run_hooks.NanLossDuringTrainingError
-NanTensorHook = basic_session_run_hooks.NanTensorHook
-SummarySaverHook = basic_session_run_hooks.SummarySaverHook
+LoggingTensorHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.LoggingTensorHook',
+    'tf.train.LoggingTensorHook',
+    basic_session_run_hooks.LoggingTensorHook)
+StopAtStepHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.StopAtStepHook',
+    'tf.train.StopAtStepHook',
+    basic_session_run_hooks.StopAtStepHook)
+CheckpointSaverHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.CheckpointSaverHook',
+    'tf.train.CheckpointSaverHook',
+    basic_session_run_hooks.CheckpointSaverHook)
+StepCounterHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.StepCounterHook',
+    'tf.train.StepCounterHook',
+    basic_session_run_hooks.StepCounterHook)
+NanLossDuringTrainingError = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.NanLossDuringTrainingError',
+    'tf.train.NanLossDuringTrainingError',
+    basic_session_run_hooks.NanLossDuringTrainingError)
+NanTensorHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.NanTensorHook',
+    'tf.train.NanTensorHook',
+    basic_session_run_hooks.NanTensorHook)
+SummarySaverHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.SummarySaverHook',
+    'tf.train.SummarySaverHook',
+    basic_session_run_hooks.SummarySaverHook)
 # pylint: enable=invalid-name
diff --git a/tensorflow/contrib/learn/python/learn/datasets/__init__.py b/tensorflow/contrib/learn/python/learn/datasets/__init__.py
index 7240b0de149051afa045a8113f9e9b212840c311..3c34712ac859d32f549468345950a93d2ed2aa56 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/__init__.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Dataset utilities and synthetic/reference datasets."""
+"""Dataset utilities and synthetic/reference datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -27,6 +32,7 @@ from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.contrib.learn.python.learn.datasets import mnist
 from tensorflow.contrib.learn.python.learn.datasets import synthetic
 from tensorflow.contrib.learn.python.learn.datasets import text_datasets
+from tensorflow.python.util.deprecation import deprecated
 
 # Export load_iris and load_boston.
 load_iris = base.load_iris
@@ -51,6 +57,7 @@ SYNTHETIC = {
 }
 
 
+@deprecated(None, 'Please use tf.data.')
 def load_dataset(name, size='small', test_with_fake_data=False):
   """Loads dataset by name.
 
@@ -73,8 +80,9 @@ def load_dataset(name, size='small', test_with_fake_data=False):
     return DATASETS[name]()
 
 
+@deprecated(None, 'Please use tf.data.')
 def make_dataset(name, n_samples=100, noise=None, seed=42, *args, **kwargs):
-  """Creates binary synthetic datasets
+  """Creates binary synthetic datasets.
 
   Args:
     name: str, name of the dataset to generate
diff --git a/tensorflow/contrib/learn/python/learn/datasets/base.py b/tensorflow/contrib/learn/python/learn/datasets/base.py
index ca720ae5ed26e74da12bd6c5a37231b41442f76f..3b5c9b97c08a388e1f35249967b6cab26861f100 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/base.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/base.py
@@ -12,7 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Base utilities for loading datasets."""
+
+"""Base utilities for loading datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -29,11 +35,14 @@ import numpy as np
 from six.moves import urllib
 
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
+
 
 Dataset = collections.namedtuple('Dataset', ['data', 'target'])
 Datasets = collections.namedtuple('Datasets', ['train', 'validation', 'test'])
 
 
+@deprecated(None, 'Use tf.data instead.')
 def load_csv_with_header(filename,
                          target_dtype,
                          features_dtype,
@@ -53,6 +62,7 @@ def load_csv_with_header(filename,
   return Dataset(data=data, target=target)
 
 
+@deprecated(None, 'Use tf.data instead.')
 def load_csv_without_header(filename,
                             target_dtype,
                             features_dtype,
@@ -70,6 +80,7 @@ def load_csv_without_header(filename,
   return Dataset(data=data, target=target)
 
 
+@deprecated(None, 'Use tf.data instead.')
 def shrink_csv(filename, ratio):
   """Create a smaller dataset of only 1/ratio of original data."""
   filename_small = filename.replace('.', '_small.')
@@ -84,6 +95,7 @@ def shrink_csv(filename, ratio):
         i += 1
 
 
+@deprecated(None, 'Use scikits.learn.datasets.')
 def load_iris(data_path=None):
   """Load Iris dataset.
 
@@ -100,6 +112,7 @@ def load_iris(data_path=None):
       data_path, target_dtype=np.int, features_dtype=np.float)
 
 
+@deprecated(None, 'Use scikits.learn.datasets.')
 def load_boston(data_path=None):
   """Load Boston housing dataset.
 
@@ -116,7 +129,12 @@ def load_boston(data_path=None):
       data_path, target_dtype=np.float, features_dtype=np.float)
 
 
-def retry(initial_delay, max_delay, factor=2.0, jitter=0.25, is_retriable=None):
+@deprecated(None, 'Use the retry module or similar alternatives.')
+def retry(initial_delay,
+          max_delay,
+          factor=2.0,
+          jitter=0.25,
+          is_retriable=None):
   """Simple decorator for wrapping retriable functions.
 
   Args:
@@ -152,7 +170,7 @@ def retry(initial_delay, max_delay, factor=2.0, jitter=0.25, is_retriable=None):
       for delay in delays():
         try:
           return fn(*args, **kwargs)
-        except Exception as e:  # pylint: disable=broad-except)
+        except Exception as e:  # pylint: disable=broad-except
           if is_retriable is None:
             continue
 
@@ -176,11 +194,13 @@ def _is_retriable(e):
   return isinstance(e, IOError) and e.errno in _RETRIABLE_ERRNOS
 
 
+@deprecated(None, 'Please use urllib or similar directly.')
 @retry(initial_delay=1.0, max_delay=16.0, is_retriable=_is_retriable)
 def urlretrieve_with_retry(url, filename=None):
   return urllib.request.urlretrieve(url, filename)
 
 
+@deprecated(None, 'Please write your own downloading logic.')
 def maybe_download(filename, work_directory, source_url):
   """Download the data from source url, unless it's already here.
 
diff --git a/tensorflow/contrib/learn/python/learn/datasets/mnist.py b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
index 37f9175015a239f763c7721cf36ab8063c0a3e32..abbb44c2f5b701829ce16f64eadd8ebc04c84e2c 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/mnist.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Functions for downloading and reading MNIST data."""
+"""Functions for downloading and reading MNIST data (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -27,6 +32,7 @@ from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import random_seed
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 # CVDF mirror of http://yann.lecun.com/exdb/mnist/
 DEFAULT_SOURCE_URL = 'https://storage.googleapis.com/cvdf-datasets/mnist/'
@@ -37,6 +43,7 @@ def _read32(bytestream):
   return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
 
 
+@deprecated(None, 'Please use tf.data to implement this functionality.')
 def extract_images(f):
   """Extract the images into a 4D uint8 numpy array [index, y, x, depth].
 
@@ -65,6 +72,7 @@ def extract_images(f):
     return data
 
 
+@deprecated(None, 'Please use tf.one_hot on tensors.')
 def dense_to_one_hot(labels_dense, num_classes):
   """Convert class labels from scalars to one-hot vectors."""
   num_labels = labels_dense.shape[0]
@@ -74,6 +82,7 @@ def dense_to_one_hot(labels_dense, num_classes):
   return labels_one_hot
 
 
+@deprecated(None, 'Please use tf.data to implement this functionality.')
 def extract_labels(f, one_hot=False, num_classes=10):
   """Extract the labels into a 1D uint8 numpy array [index].
 
@@ -103,7 +112,15 @@ def extract_labels(f, one_hot=False, num_classes=10):
 
 
 class DataSet(object):
+  """Container class for a dataset (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+              ' from tensorflow/models.')
   def __init__(self,
                images,
                labels,
@@ -210,6 +227,8 @@ class DataSet(object):
       return self._images[start:end], self._labels[start:end]
 
 
+@deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+            ' from tensorflow/models.')
 def read_data_sets(train_dir,
                    fake_data=False,
                    one_hot=False,
@@ -275,5 +294,7 @@ def read_data_sets(train_dir,
   return base.Datasets(train=train, validation=validation, test=test)
 
 
+@deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+            ' from tensorflow/models.')
 def load_mnist(train_dir='MNIST-data'):
   return read_data_sets(train_dir)
diff --git a/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py b/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
index 6e0ba38941ce4650ede9f7210e284bde2ed8e6a9..a4848fa64a72f031ef35c0c3256e97a7326acd60 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Produce DBpedia datasets of a smaller size."""
+"""Produce DBpedia datasets of a smaller size (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/datasets/synthetic.py b/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
index 9a843168c27d9cae3f55efe4fe4c688d86c745f3..6a0e3350b3d1052249160a2a997a76de7a5040c3 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Synthetic dataset generators."""
+"""Synthetic dataset generators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,8 +26,10 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.learn.python.learn.datasets.base import Dataset
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Consider using synthetic datasets from scikits.learn.')
 def circles(n_samples=100,
             noise=None,
             seed=None,
@@ -93,6 +100,7 @@ def circles(n_samples=100,
   return Dataset(data=X[indices], target=y[indices])
 
 
+@deprecated(None, 'Consider using synthetic datasets from scikits.learn.')
 def spirals(n_samples=100,
             noise=None,
             seed=None,
diff --git a/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py b/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
index 2596a2ecaf1572506504831e8b08fab9b5dbc119..ce9466301728082f8e9d99c90989ba8fe623bcf0 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Text datasets."""
+"""Text datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -26,10 +31,12 @@ import numpy as np
 
 from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 DBPEDIA_URL = 'https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz'
 
 
+@deprecated(None, 'See contrib/learn/README.md')
 def maybe_download_dbpedia(data_dir):
   """Download if DBpedia data is not present."""
   train_path = os.path.join(data_dir, 'dbpedia_csv/train.csv')
@@ -41,6 +48,7 @@ def maybe_download_dbpedia(data_dir):
     tfile.extractall(data_dir)
 
 
+@deprecated(None, 'See contrib/learn/README.md')
 def load_dbpedia(size='small', test_with_fake_data=False):
   """Get DBpedia datasets from CSV files."""
   if not test_with_fake_data:
diff --git a/tensorflow/contrib/learn/python/learn/estimators/__init__.py b/tensorflow/contrib/learn/python/learn/estimators/__init__.py
index 4981750c94c7ac31e23b7a3f71ca30e3c9573a20..3e64595f312bcc2a2e8dcba589fb993a249b684b 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""An estimator is a rule for calculating an estimate of a given quantity.
+"""An estimator is a rule for calculating an estimate of a given quantity (deprecated).
+
+These classes are deprecated and replaced with `tf.estimator`.
+
+See [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 # Estimators
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py b/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
index 15277415a1ce83dc1d4a334e60fe1933ba244df0..1f0e4663d060a3850e2002b27f809fde1db47e48 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""sklearn cross-support."""
+"""sklearn cross-support (deprecated)."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -132,6 +132,8 @@ class _TransformerMixin():
 class NotFittedError(ValueError, AttributeError):
   """Exception class to raise if estimator is used before fitting.
 
+  USE OF THIS EXCEPTION IS DEPRECATED.
+
   This class inherits from both ValueError and AttributeError to help with
   exception handling and backward compatibility.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/composable_model.py b/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
index a02c726c74946d93b8e1726473db746220b00795..1fa58271e2b886cd143683a759145fd750791473 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""TensorFlow composable models used as building blocks for estimators."""
+"""TensorFlow composable models used as building blocks for estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -34,6 +39,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.summary import summary
+from tensorflow.python.util.deprecation import deprecated
 
 
 class _ComposableModel(object):
@@ -46,6 +52,7 @@ class _ComposableModel(object):
   _ComposableModel and its subclasses are not part of the public tf.learn API.
   """
 
+  @deprecated(None, "Please use model_fns in tf.estimator.")
   def __init__(self,
                num_label_columns,
                optimizer,
@@ -141,6 +148,10 @@ class _ComposableModel(object):
 class LinearComposableModel(_ComposableModel):
   """A _ComposableModel that implements linear regression.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Instances of this class can be used to build estimators through the use
   of composition.
   """
@@ -252,6 +263,10 @@ class LinearComposableModel(_ComposableModel):
 class DNNComposableModel(_ComposableModel):
   """A _ComposableModel that implements a DNN.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Instances of this class can be used to build estimators through the use
   of composition.
   """
diff --git a/tensorflow/contrib/learn/python/learn/estimators/constants.py b/tensorflow/contrib/learn/python/learn/estimators/constants.py
index fc69e810244a182b864be856e6720f8584f7aa65..d2548946bc77dea7c452d61c7e2b6e12c3d6239a 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/constants.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/constants.py
@@ -13,9 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Constants regarding Estimators.
+"""Constants regarding Estimators (deprecated).
 
-This file is obsoleted in the move of Estimator to core.
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 """
 from __future__ import absolute_import
 from __future__ import division
@@ -25,6 +27,8 @@ from __future__ import print_function
 class ProblemType(object):
   """Enum-like values for the type of problem that the model solves.
 
+  THIS CLASS IS DEPRECATED.
+
   These values are used when exporting the model to produce the appropriate
   signature function for serving.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/debug.py b/tensorflow/contrib/learn/python/learn/estimators/debug.py
index 9d5f6c2bf969d7c85d251bf1b06a0307a41b2297..24b067b7e38b12df3d1d0c49f626344217218571 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/debug.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/debug.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Debug estimators.
+"""Debug estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Debug estimators are bias-only estimators that can be used for debugging
 and as simple baselines.
@@ -118,6 +122,10 @@ def debug_model_fn(features, labels, mode, params, config=None):
 class DebugClassifier(estimator.Estimator):
   """A classifier for TensorFlow Debug models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -237,6 +245,10 @@ class DebugClassifier(estimator.Estimator):
 class DebugRegressor(estimator.Estimator):
   """A regressor for TensorFlow Debug models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dnn.py b/tensorflow/contrib/learn/python/learn/estimators/dnn.py
index c17b41c0f767e19d9c3635a8f60347a49b297cfb..eabebb7e881558471c343c0573cc9a8f4a425312 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dnn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dnn.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Deep Neural Network estimators."""
+"""Deep Neural Network estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -212,6 +217,10 @@ def _dnn_model_fn(features, labels, mode, params, config=None):
 class DNNClassifier(estimator.Estimator):
   """A classifier for TensorFlow DNN models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -521,6 +530,10 @@ class DNNClassifier(estimator.Estimator):
 class DNNRegressor(estimator.Estimator):
   """A regressor for TensorFlow DNN models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -796,6 +809,10 @@ class DNNRegressor(estimator.Estimator):
 class DNNEstimator(estimator.Estimator):
   """A Estimator for TensorFlow DNN models with user specified _Head.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py b/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
index 726612235050def6e7addb503cc6646a25de0e42..3d85533d92d17095bae9a69f229171e1bf61ba10 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow estimators for Linear and DNN joined training models."""
+"""TensorFlow estimators for Linear and DNN joined training models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -372,6 +377,10 @@ def _dnn_linear_combined_model_fn(features, labels, mode, params, config=None):
 class DNNLinearCombinedEstimator(estimator.Estimator):
   """An estimator for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
@@ -490,6 +499,10 @@ class DNNLinearCombinedEstimator(estimator.Estimator):
 class DNNLinearCombinedClassifier(estimator.Estimator):
   """A classifier for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
@@ -832,6 +845,10 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
 class DNNLinearCombinedRegressor(estimator.Estimator):
   """A regressor for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py b/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
index 69440e823ef1ed2d739f28bc14587891f2de80bb..a703dc66e922d48ceb64edc2a979061b8e45db49 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Estimator for Dynamic RNNs."""
+"""Estimator for Dynamic RNNs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -540,6 +545,12 @@ def _get_dynamic_rnn_model_fn(
 
 
 class DynamicRnnEstimator(estimator.Estimator):
+  """Dynamically unrolled RNN (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                problem_type,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator.py b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
index 4b63e08ab3372849309ee5d28d754de82e9632f4..7a026a15e4aeea0dde4ed9f7de053a757a0abb58 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Base Estimator class."""
+"""Base Estimator class (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -138,6 +143,7 @@ def _get_input_fn(x, y, input_fn, feed_fn, batch_size, shuffle=False, epochs=1):
   return df.input_builder, df.get_feed_dict_fn()
 
 
+@deprecated(None, 'Please specify feature columns explicitly.')
 def infer_real_valued_columns_from_input_fn(input_fn):
   """Creates `FeatureColumn` objects for inputs defined by `input_fn`.
 
@@ -158,6 +164,7 @@ def infer_real_valued_columns_from_input_fn(input_fn):
     return layers.infer_real_valued_columns(features)
 
 
+@deprecated(None, 'Please specify feature columns explicitly.')
 def infer_real_valued_columns_from_input(x):
   """Creates `FeatureColumn` objects for inputs defined by input `x`.
 
@@ -389,6 +396,10 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
                     trainable.Trainable):
   """Abstract BaseEstimator class to train and evaluate TensorFlow models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Users should not instantiate or subclass this class. Instead, use an
   `Estimator`.
   """
@@ -399,6 +410,8 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
   # TODO(wicke): Remove this once launcher takes over config functionality
   _Config = run_config.RunConfig  # pylint: disable=invalid-name
 
+  @deprecated(None, 'Please replace uses of any Estimator from tf.contrib.learn'
+              ' with an Estimator from tf.estimator.*')
   def __init__(self, model_dir=None, config=None):
     """Initializes a BaseEstimator instance.
 
@@ -457,6 +470,20 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
     # TODO(wicke): make RunConfig immutable, and then return it without a copy.
     return copy.deepcopy(self._config)
 
+  @property
+  def model_fn(self):
+    """Returns the model_fn which is bound to self.params.
+
+    Returns:
+      The model_fn with the following signature:
+        `def model_fn(features, labels, mode, metrics)`
+    """
+
+    def public_model_fn(features, labels, mode, config):
+      return self._call_model_fn(features, labels, mode, config=config)
+
+    return public_model_fn
+
   @deprecated_args(SCIKIT_DECOUPLE_DATE, SCIKIT_DECOUPLE_INSTRUCTIONS,
                    ('x', None), ('y', None), ('batch_size', None))
   def fit(self,
@@ -890,8 +917,8 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
       if feed_fn:
         hooks.append(basic_session_run_hooks.FeedFnHook(feed_fn))
       if steps == 0:
-        logging.warning('evaluation steps are 0. If `input_fn` does not raise'
-                        'OutOfRangeError`, the evaluation will never stop.'
+        logging.warning('evaluation steps are 0. If `input_fn` does not raise '
+                        '`OutOfRangeError`, the evaluation will never stop. '
                         'Use steps=None if intended.')
       if steps:
         hooks.append(
@@ -1074,6 +1101,10 @@ def _identity_feature_engineering_fn(features, labels):
 
 class Estimator(BaseEstimator):
   """Estimator class is the basic TensorFlow model trainer/evaluator.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
   """
 
   def __init__(self,
@@ -1162,7 +1193,7 @@ class Estimator(BaseEstimator):
     self._feature_engineering_fn = (
         feature_engineering_fn or _identity_feature_engineering_fn)
 
-  def _call_model_fn(self, features, labels, mode, metrics=None):
+  def _call_model_fn(self, features, labels, mode, metrics=None, config=None):
     """Calls model function with support of 2, 3 or 4 arguments.
 
     Args:
@@ -1170,6 +1201,7 @@ class Estimator(BaseEstimator):
       labels: labels dict.
       mode: ModeKeys
       metrics: Dict of metrics.
+      config: RunConfig.
 
     Returns:
       A `ModelFnOps` object. If model_fn returns a tuple, wraps them up in a
@@ -1186,7 +1218,10 @@ class Estimator(BaseEstimator):
     if 'params' in model_fn_args:
       kwargs['params'] = self.params
     if 'config' in model_fn_args:
-      kwargs['config'] = self.config
+      if config:
+        kwargs['config'] = config
+      else:
+        kwargs['config'] = self.config
     if 'model_dir' in model_fn_args:
       kwargs['model_dir'] = self.model_dir
     model_fn_results = self._model_fn(features, labels, **kwargs)
@@ -1458,8 +1493,14 @@ class Estimator(BaseEstimator):
 # For time of deprecation x,y from Estimator allow direct access.
 # pylint: disable=protected-access
 class SKCompat(sklearn.BaseEstimator):
-  """Scikit learn wrapper for TensorFlow Learn Estimator."""
+  """Scikit learn wrapper for TensorFlow Learn Estimator.
+  
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please switch to the Estimator interface.')
   def __init__(self, estimator):
     self._estimator = estimator
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py b/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
index fd47710e3015de9ae6a453f98978b0ef8f88968c..e4c31396baf8271c49395a2b87b454dbc77177e2 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utils for Estimator."""
+"""Utils for Estimator (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head.py b/tensorflow/contrib/learn/python/learn/estimators/head.py
index 9b124b2c19f16bbc9b2afeadb82a32006e1a0ae9..2b4b6eff39f4fc8a20a149edfc07d2f4f27a9bae 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head.py
@@ -12,8 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Abstractions for the head(s) of a model.
+"""Abstractions for the head(s) of a model (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 """
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -47,11 +52,16 @@ from tensorflow.python.summary import summary
 from tensorflow.python.training import training
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 
 class Head(object):
   """Interface for the head/top of a model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Given logits (or output of a hidden layer), a Head knows how to compute
   predictions, loss, default metric and export signature. It is meant to,
 
@@ -177,6 +187,7 @@ class Head(object):
     raise NotImplementedError("Calling an abstract method.")
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def regression_head(label_name=None,
                     weight_column_name=None,
                     label_dimension=1,
@@ -216,6 +227,7 @@ def regression_head(label_name=None,
       link_fn=(link_fn if link_fn is not None else array_ops.identity))
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def poisson_regression_head(label_name=None,
                             weight_column_name=None,
                             label_dimension=1,
@@ -254,6 +266,7 @@ def poisson_regression_head(label_name=None,
 # TODO(zakaria): Consider adding a _RegressionHead for logistic_regression
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_class_head(n_classes,
                      label_name=None,
                      weight_column_name=None,
@@ -335,6 +348,7 @@ def multi_class_head(n_classes,
       label_keys=label_keys)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def binary_svm_head(
     label_name=None,
     weight_column_name=None,
@@ -370,6 +384,7 @@ def binary_svm_head(
       thresholds=thresholds)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_label_head(n_classes,
                      label_name=None,
                      weight_column_name=None,
@@ -430,6 +445,7 @@ def multi_label_head(n_classes,
       loss_fn=_wrap_custom_loss_fn(loss_fn) if loss_fn else None)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def loss_only_head(loss_fn, head_name=None):
   """Creates a Head that contains only loss terms.
 
@@ -447,6 +463,7 @@ def loss_only_head(loss_fn, head_name=None):
   return _LossOnlyHead(loss_fn, head_name=head_name)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_head(heads, loss_weights=None):
   """Creates a MultiHead stemming from same logits/hidden layer.
 
@@ -479,6 +496,7 @@ def multi_head(heads, loss_weights=None):
   return _MultiHead(heads, loss_merger=_weighted_loss_merger)
 
 
+@deprecated(None, "Use 'lambda _: tf.no_op()'.")
 def no_op_train_fn(loss):
   del loss
   return control_flow_ops.no_op()
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head_test.py b/tensorflow/contrib/learn/python/learn/estimators/head_test.py
index 6d5da81b4c2087fb9c5307902e452a6220a17cd0..7c2d9bb0767cb979dae9c84b5342d129225677ed 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head_test.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head_test.py
@@ -362,7 +362,7 @@ class MultiLabelHeadTest(test.TestCase):
         "auc_precision_recall": 0.166667,
         "auc_precision_recall/class0": 0,
         "auc_precision_recall/class1": 0.,
-        "auc_precision_recall/class2": 0.49999,
+        "auc_precision_recall/class2": 1.,
         "labels/actual_label_mean/class0": self._labels[0][0],
         "labels/actual_label_mean/class1": self._labels[0][1],
         "labels/actual_label_mean/class2": self._labels[0][2],
@@ -748,7 +748,7 @@ class BinaryClassificationHeadTest(test.TestCase):
         "accuracy/baseline_label_mean": label_mean,
         "accuracy/threshold_0.500000_mean": 1. / 2,
         "auc": 1. / 2,
-        "auc_precision_recall": 0.25,
+        "auc_precision_recall": 0.749999,
         "labels/actual_label_mean": label_mean,
         "labels/prediction_mean": .731059,  # softmax
         "loss": expected_loss,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/kmeans.py b/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
index 8f9d6fc318a357853bdb8e3264f6691b410006b1..66ebcfd1d81904b9afe5be6bd1a648fe325e1e0b 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Implementation of k-means clustering on top of `Estimator` API.
+"""Implementation of k-means clustering on top of `Estimator` API (deprecated).
 
 This module is deprecated. Please use
 @{tf.contrib.factorization.KMeansClustering} instead of
@@ -153,7 +153,12 @@ def _kmeans_clustering_model_fn(features, labels, mode, params, config):
 
 # TODO(agarwal,ands): support sharded input.
 class KMeansClustering(estimator.Estimator):
-  """An Estimator for K-Means clustering."""
+  """An Estimator for K-Means clustering.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
   SQUARED_EUCLIDEAN_DISTANCE = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE
   COSINE_DISTANCE = clustering_ops.COSINE_DISTANCE
   RANDOM_INIT = clustering_ops.RANDOM_INIT
diff --git a/tensorflow/contrib/learn/python/learn/estimators/linear.py b/tensorflow/contrib/learn/python/learn/estimators/linear.py
index 37aa8b339622415d082933cdf66d2472a4119b48..64d7ecc68e7abb1d36a3eb098fedd8184d6e9d77 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/linear.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/linear.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Linear Estimators."""
+"""Linear Estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -305,6 +310,10 @@ class _SdcaUpdateWeightsHook(session_run_hook.SessionRunHook):
 class LinearClassifier(estimator.Estimator):
   """Linear classifier model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a linear model to classify instances into one of multiple possible
   classes. When number of possible classes is 2, this is binary classification.
 
@@ -625,6 +634,10 @@ class LinearClassifier(estimator.Estimator):
 class LinearRegressor(estimator.Estimator):
   """Linear regressor model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a linear regression model to predict label value given observation of
   feature values.
 
@@ -860,6 +873,10 @@ class LinearRegressor(estimator.Estimator):
 class LinearEstimator(estimator.Estimator):
   """Linear model with user specified head.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a generalized linear model to predict label value given observation of
   feature values.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py b/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
index fb339160d58e09d4ffd50090f2dbbcec08bebe47..3cbcc6e98de1c915c302617e4591c9baa33adeaf 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Logistic regression (aka binary classifier) class.
+"""Logistic regression (aka binary classifier) class (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This defines some useful basic metrics for using logistic regression to classify
 a binary event (0 vs 1).
@@ -75,6 +79,10 @@ def LogisticRegressor(  # pylint: disable=invalid-name
     feature_engineering_fn=None):
   """Builds a logistic regression Estimator for binary classification.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This method provides a basic Estimator with some additional metrics for custom
   binary classification models, including AUC, precision/recall and accuracy.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/metric_key.py b/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
index 99388f116b345bd038f2985606c6203011597ea2..f264248e44d9aa48f26ee32e36746bd4c3145a8d 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
@@ -12,14 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Enum for metric keys."""
+"""Enum for metric keys (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 
 class MetricKey(object):
-  """Metric key strings."""
+  """Metric key strings (deprecated)."""
+  
   LOSS = "loss"
   AUC = "auc"
   AUC_PR = "auc_precision_recall"
diff --git a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
index 44e6c7c52dac524a22e9099e33e2aef82f8fe7ba..dcb161180c99ce71195c820217e8bdaf79d70901 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Classes and methods related to model_fn."""
+"""Classes and methods related to model_fn (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -37,10 +42,13 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.training import session_run_hook
+from tensorflow.python.util.deprecation import deprecated
 
 
 class ModeKeys(object):
-  """Standard names for model modes.
+  """Standard names for model modes (deprecated).
+
+  THIS CLASS IS DEPRECATED.
 
   The following standard keys are defined:
 
@@ -65,8 +73,16 @@ class ModelFnOps(
         'output_alternatives', 'training_chief_hooks', 'training_hooks',
         'scaffold', 'mode'
     ])):
-  """Ops returned from a model_fn."""
+  """Ops returned from a model_fn.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'When switching to tf.estimator.Estimator, use '
+              'tf.estimator.EstimatorSpec. You can use the `estimator_spec`'
+              ' method to create an equivalent one.')
   def __new__(cls,
               mode,
               predictions=None,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py b/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
index f8d87b8914307a86eb2f46123a28ff11eb925eda..6fd2fc9d592cef4e44a640e2f27cb28b367d44d5 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Enum for model prediction keys.
+"""Enum for model prediction keys (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This file is obsoleted in the move of Estimator to core.
 """
@@ -22,6 +26,8 @@ from __future__ import print_function
 
 
 class PredictionKey(object):
+  """THIS CLASS IS DEPRECATED."""
+
   CLASSES = "classes"
   PROBABILITIES = "probabilities"
   LOGITS = "logits"
diff --git a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
index 2752bc2d90ee0f51b2c40ccc4d24a4eb21cff38f..215022e5d9e5d3cd5d6a96583b325b19a1719568 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Common operations for RNN Estimators."""
+"""Common operations for RNN Estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/estimators/run_config.py b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
index fd90fd1cc6277e7d80287aefdbab6134dac7c0d5..1d161093de01ef838d0c75ec9a39574c7529bd57 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/run_config.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Run Config."""
+"""Run Config (deprecated, use tf.estimator.RunConfig instead).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -29,11 +34,12 @@ from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as core_run_config
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import server_lib
+from tensorflow.python.util.deprecation import deprecated
 
 
 # A list of the property names in RunConfig user allows to change. They will
 # not affect the execution framework, so when execution framework checks the
-# `uid` of the RunConfig, it should be ingored.
+# `uid` of the RunConfig, it should be ignored.
 _DEFAULT_UID_WHITE_LIST = [
     'tf_random_seed',
     'save_summary_steps',
@@ -47,6 +53,7 @@ _DEFAULT_UID_WHITE_LIST = [
 
 
 class Environment(object):
+  """DEPRECATED CLASS."""
   # For running general distributed training.
   CLOUD = 'cloud'
   # For running Google-internal distributed training.
@@ -56,6 +63,7 @@ class Environment(object):
 
 
 class TaskType(object):
+  """DEPRECATED CLASS."""
   MASTER = 'master'
   PS = 'ps'
   WORKER = 'worker'
@@ -64,6 +72,8 @@ class TaskType(object):
 class ClusterConfig(object):
   """This class specifies the configurations for a distributed run.
 
+  THIS CLASS IS DEPRECATED. Use tf.estimator.RunConfig instead.
+
   If you're using an `Estimator`, you should probably use the subclass
   RunConfig instead.
   """
@@ -211,10 +221,13 @@ class ClusterConfig(object):
 class RunConfig(ClusterConfig, core_run_config.RunConfig):
   """This class specifies the configurations for an `Estimator` run.
 
-  This class is the implementation of @{tf.estimator.RunConfig} interface.
+  This class is a deprecated implementation of @{tf.estimator.RunConfig}
+  interface.
   """
   _USE_DEFAULT = 0
 
+  @deprecated(None, 'When switching to tf.estimator.Estimator, use'
+              ' tf.estimator.RunConfig instead.')
   def __init__(self,
                master=None,
                num_cores=0,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py b/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
index 0cea35e219a4457417a161a3ac4ac4292fd24f53..de78c72c3ae3ef14f5f7c46b1d47f82e8266c7c6 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Estimator for State Saving RNNs."""
+"""Estimator for State Saving RNNs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -528,6 +533,12 @@ def _get_rnn_model_fn(cell_type,
 
 
 class StateSavingRnnEstimator(estimator.Estimator):
+  """RNN with static unrolling and state saving (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                problem_type,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/svm.py b/tensorflow/contrib/learn/python/learn/estimators/svm.py
index 72920d73c0c92886e54f533ad7fe170fe27d9870..3459997baba16fc0d4045e50819ecdd0e7121657 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/svm.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/svm.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Support Vector Machine (SVM) Estimator."""
+"""Support Vector Machine (SVM) Estimator (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -36,6 +41,10 @@ def _as_iterable(preds, output):
 class SVM(estimator.Estimator):
   """Support Vector Machine (SVM) model for binary classification.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Currently, only linear SVMs are supported. For the underlying optimization
   problem, the `SDCAOptimizer` is used. For performance and convergence tuning,
   the num_loss_partitions parameter passed to `SDCAOptimizer` (see `__init__()`
diff --git a/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py b/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
index a120bc6cc3975a3d4559d018c8aa74ff42a16d2d..71b5658dd174d2b47e33860844359f68e6768026 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorSignature class and utilities."""
+"""TensorSignature class and utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -33,6 +38,10 @@ class TensorSignature(collections.namedtuple(
     "TensorSignature", ["dtype", "shape", "is_sparse"])):
   """Signature of the `Tensor` object.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Useful to check compatibility of tensors.
 
   Example:
diff --git a/tensorflow/contrib/learn/python/learn/estimators/test_data.py b/tensorflow/contrib/learn/python/learn/estimators/test_data.py
index ed201bfc58f273e6587850032386c2686aea4148..e4b057b4f5a9e081c2d891bd9828ffc315e51e91 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/test_data.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/test_data.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Test data utilities."""
+"""Test data utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/evaluable.py b/tensorflow/contrib/learn/python/learn/evaluable.py
index 8f6cd39864b437f163dd7c1140dc88755ce98529..10881ca885599bc81386e15f814a2687d907f63b 100644
--- a/tensorflow/contrib/learn/python/learn/evaluable.py
+++ b/tensorflow/contrib/learn/python/learn/evaluable.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""`Evaluable` interface."""
+"""`Evaluable` interface (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,6 +28,10 @@ import abc
 
 class Evaluable(object):
   """Interface for objects that are evaluatable by, e.g., `Experiment`.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
   """
   __metaclass__ = abc.ABCMeta
 
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index 331bc115499c8d6f4057bf1c0908bcea05f005a3..3744abd860e7f460133873eb534fd75887182f78 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Experiment class collecting information needed for a single training run."""
+"""Experiment class collecting information for a single training run (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -25,7 +30,6 @@ import os
 import time
 
 from tensorflow.contrib.framework import deprecated
-from tensorflow.contrib.framework import deprecated_args
 from tensorflow.contrib.framework.python.framework import experimental
 from tensorflow.contrib.learn.python.learn import evaluable
 from tensorflow.contrib.learn.python.learn import export_strategy
@@ -118,6 +122,10 @@ class _EvalAndExportListener(basic_session_run_hooks.CheckpointSaverListener):
 class Experiment(object):
   """Experiment is a class containing all information needed to train a model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   After an experiment is created (by passing an Estimator and inputs for
   training and evaluation), an Experiment instance knows how to invoke training
   and eval loops in a sensible fashion for distributed training.
@@ -125,16 +133,8 @@ class Experiment(object):
 
   # TODO(ispir): remove delay_workers_by_global_step and make global step based
   # waiting as only behavior.
-  @deprecated_args(
-      "2016-10-23",
-      "local_eval_frequency is deprecated as local_run will be renamed to "
-      "train_and_evaluate. Use min_eval_frequency and call train_and_evaluate "
-      "instead. Note, however, that the default for min_eval_frequency is 1, "
-      "meaning models will be evaluated every time a new checkpoint is "
-      "available. In contrast, the default for local_eval_frequency is None, "
-      "resulting in evaluation occurring only after training has completed. "
-      "min_eval_frequency is ignored when calling the deprecated local_run.",
-      "local_eval_frequency")
+  @deprecated(None, "Please switch to tf.estimator.train_and_evaluate. You will"
+              " also have to convert to a tf.estimator.Estimator.")
   def __init__(self,
                estimator,
                train_input_fn,
@@ -358,7 +358,7 @@ class Experiment(object):
         self._start_server()
     elif config.cluster_spec and config.master:
       raise ValueError(
-          "For distributed runtime, Experiment class only works with"
+          "For distributed runtime, Experiment class only works with "
           "tf.contrib.learn.RunConfig for now, but provided {}".format(
               type(config)))
 
diff --git a/tensorflow/contrib/learn/python/learn/export_strategy.py b/tensorflow/contrib/learn/python/learn/export_strategy.py
index 55a8b824312b89e0ac66513242191f4201ac212a..075cab536ecb5279e7e6f23abb0b70c75043a7ec 100644
--- a/tensorflow/contrib/learn/python/learn/export_strategy.py
+++ b/tensorflow/contrib/learn/python/learn/export_strategy.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""ExportStrategy class represents different flavors of model export."""
+"""ExportStrategy class represents different flavors of model export (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,6 +26,7 @@ from __future__ import print_function
 import collections
 
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 __all__ = ['ExportStrategy']
 
@@ -30,6 +36,10 @@ class ExportStrategy(
                            ['name', 'export_fn', 'strip_default_attrs'])):
   """A class representing a type of model export.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Typically constructed by a utility function specific to the exporter, such as
   `saved_model_export_utils.make_export_strategy()`.
 
@@ -56,6 +66,8 @@ class ExportStrategy(
         forward compatibility of the resulting `SavedModel`.
   """
 
+  @deprecated(None, 'Please switch to tf.estimator.train_and_evaluate, and use '
+              'tf.estimator.Exporter.')
   def __new__(cls, name, export_fn, strip_default_attrs=None):
     return super(ExportStrategy, cls).__new__(
         cls, name, export_fn, strip_default_attrs)
diff --git a/tensorflow/contrib/learn/python/learn/graph_actions.py b/tensorflow/contrib/learn/python/learn/graph_actions.py
index 98365c05f663e5d2a06703457fc5663d7135f7d9..a997fab723a16dddf150aa9397863605e4e77933 100644
--- a/tensorflow/contrib/learn/python/learn/graph_actions.py
+++ b/tensorflow/contrib/learn/python/learn/graph_actions.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level operations on graphs."""
+"""High level operations on graphs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -68,6 +73,7 @@ def clear_summary_writers():
   return summary_io.SummaryWriterCache.clear()
 
 
+@deprecated(None, 'Use `SummaryWriterCache.get` directly.')
 def get_summary_writer(logdir):
   """Returns single SummaryWriter per logdir in current run.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/__init__.py b/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
index 06c3782a471537cf3879450e6bd20899a35d96ac..8b133a4440d8cbc19abca64f972791fc16ade6f8 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tools to allow different io formats."""
+"""Tools to allow different io formats (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py b/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
index 7d666391cea3c0a52a2cb7e324c00d5f480710d5..e0a1948d95a727675dac8ff3ce9f55c35d5f8d8d 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Methods to allow dask.DataFrame."""
+"""Methods to allow dask.DataFrame (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,6 +26,8 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.util.deprecation import deprecated
+
 try:
   # pylint: disable=g-import-not-at-top
   import dask.dataframe as dd
@@ -60,6 +67,7 @@ def _construct_dask_df_with_divisions(df):
     return dd.Series(merge(dsk, df.dask), name, df.name, divisions)
 
 
+@deprecated(None, 'Please feed input to tf.data to support dask.')
 def extract_dask_data(data):
   """Extract data from dask.Series or dask.DataFrame for predictors.
 
@@ -81,6 +89,7 @@ def extract_dask_data(data):
     return data
 
 
+@deprecated(None, 'Please feed input to tf.data to support dask.')
 def extract_dask_labels(labels):
   """Extract data from dask.Series or dask.DataFrame for labels.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
index 96be8b1bc402479d5611965f27abb197363cb939..c45b1d186471125776d6536112aebb66bb5ad558 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Implementations of different data feeders to provide data for TF trainer."""
+"""Implementations of different data feeders to provide data for TF trainer (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 # TODO(ipolosukhin): Replace this module with feed-dict queue runners & queues.
 
@@ -31,6 +36,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.deprecation import deprecated
 
 # pylint: disable=g-multiple-import,g-bad-import-order
 from .pandas_io import HAS_PANDAS, extract_pandas_data, extract_pandas_matrix, extract_pandas_labels
@@ -101,6 +107,7 @@ def _is_iterable(x):
   return hasattr(x, 'next') or hasattr(x, '__next__')
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_train_data_feeder(x,
                             y,
                             n_classes,
@@ -188,6 +195,7 @@ def _batch_data(x, batch_size=None):
     yield np.matrix(chunk)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_predict_data_feeder(x, batch_size=None):
   """Returns an iterable for feeding into predict step.
 
@@ -219,6 +227,7 @@ def setup_predict_data_feeder(x, batch_size=None):
   return [x]
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_processor_data_feeder(x):
   """Sets up processor iterable.
 
@@ -233,6 +242,7 @@ def setup_processor_data_feeder(x):
   return x
 
 
+@deprecated(None, 'Please convert numpy dtypes explicitly.')
 def check_array(array, dtype):
   """Checks array on dtype and converts it if different.
 
@@ -275,8 +285,14 @@ def _check_dtype(dtype):
 
 
 class DataFeeder(object):
-  """Data feeder is an example class to sample data for TF trainer."""
+  """Data feeder is an example class to sample data for TF trainer.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self,
                x,
                y,
@@ -563,6 +579,10 @@ class DataFeeder(object):
 class StreamingDataFeeder(DataFeeder):
   """Data feeder for TF trainer that reads data from iterator.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Streaming data feeder allows to read data as it comes it from disk or
   somewhere else. It's custom to have this iterators rotate infinetly over
   the dataset, to allow control of how much to learn on the trainer side.
@@ -771,11 +791,16 @@ class StreamingDataFeeder(DataFeeder):
 class DaskDataFeeder(object):
   """Data feeder for that reads data from dask.Series and dask.DataFrame.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Numpy arrays can be serialized to disk and it's possible to do random seeks
   into them. DaskDataFeeder will remove requirement to have full dataset in the
   memory and still do random seeks for sampling of batches.
   """
 
+  @deprecated(None, 'Please feed input to tf.data to support dask.')
   def __init__(self,
                x,
                y,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py b/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
index 884faf8335e2a3ca1d27d2d93b4c817131648774..f8aaa0c9e3e5b589a6ad47678dba3dc38de7c471 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to allow generator of dict with numpy arrays."""
+"""Methods to allow generator of dict with numpy arrays (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,8 +28,10 @@ from types import FunctionType
 from types import GeneratorType
 
 from tensorflow.python.estimator.inputs.queues.feeding_functions import _enqueue_data as enqueue_data
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Please use tf.data.')
 def generator_input_fn(x,
                        target_key=None,
                        batch_size=128,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py b/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
index 3a46c239688017f9204d2c6182a6f81cd325a417..9e816f54b6cf8dee84c6d62406ab3db700054d06 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to read data in the graph."""
+"""Methods to read data in the graph (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -34,11 +39,13 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.summary import summary
 from tensorflow.python.training import input as input_ops
 from tensorflow.python.training import queue_runner
+from tensorflow.python.util.deprecation import deprecated
 
 # Default name for key in the feature dict.
 KEY_FEATURE_NAME = '__key__'
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_examples(file_pattern,
                         batch_size,
                         reader,
@@ -106,6 +113,7 @@ def read_batch_examples(file_pattern,
   return examples
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_examples(file_pattern,
                               batch_size,
                               reader,
@@ -175,6 +183,7 @@ def read_keyed_batch_examples(file_pattern,
       seed=seed)
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_examples_shared_queue(file_pattern,
                                            batch_size,
                                            reader,
@@ -452,6 +461,7 @@ def _read_keyed_batch_examples_helper(file_pattern,
     return queued_examples_with_keys
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_features(file_pattern,
                               batch_size,
                               features,
@@ -540,6 +550,7 @@ def read_keyed_batch_features(file_pattern,
         name=scope)
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_features_shared_queue(file_pattern,
                                            batch_size,
                                            features,
@@ -620,6 +631,7 @@ def read_keyed_batch_features_shared_queue(file_pattern,
         name=scope)
 
 
+@deprecated(None, 'Use tf.data.')
 def queue_parsed_features(parsed_features,
                           keys=None,
                           feature_queue_capacity=100,
@@ -742,6 +754,7 @@ def queue_parsed_features(parsed_features,
     return dequeued_keys, dequeued_parsed_features
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_features(file_pattern,
                         batch_size,
                         features,
@@ -821,6 +834,7 @@ def read_batch_features(file_pattern,
   return features
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_record_features(file_pattern,
                                batch_size,
                                features,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py b/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
index 692438807fbd7febb156d4db73b5d3deba6c987d..29552d24f1eaa0d85a99c8b09f69d007e7e4fe9f 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
@@ -12,15 +12,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to allow dict of numpy arrays."""
+"""Methods to allow dict of numpy arrays (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.estimator.inputs.numpy_io import numpy_input_fn as core_numpy_input_fn
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Use tf.estimator.inputs.numpy_input_fn.')
 def numpy_input_fn(x,
                    y=None,
                    batch_size=128,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py b/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
index ede7558eafa9237dc63aa95a62e599c5e9755822..b4ef055f5ae484ec704ad42efcf2c00c4a7a4f56 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
@@ -13,13 +13,19 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Methods to allow pandas.DataFrame."""
+"""Methods to allow pandas.DataFrame (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.estimator.inputs.pandas_io import pandas_input_fn as core_pandas_input_fn
+from tensorflow.python.util.deprecation import deprecated
 
 try:
   # pylint: disable=g-import-not-at-top
@@ -47,6 +53,7 @@ PANDAS_DTYPES = {
 }
 
 
+@deprecated(None, 'Please use tf.estimator.inputs.pandas_input_fn')
 def pandas_input_fn(x,
                     y=None,
                     batch_size=128,
@@ -66,6 +73,7 @@ def pandas_input_fn(x,
                               target_column=target_column)
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_data(data):
   """Extract data from pandas.DataFrame for predictors.
 
@@ -96,6 +104,7 @@ def extract_pandas_data(data):
                      'float, or bool. Found: ' + ', '.join(error_report))
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_matrix(data):
   """Extracts numpy matrix from pandas DataFrame.
 
@@ -111,6 +120,7 @@ def extract_pandas_matrix(data):
   return data.as_matrix()
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_labels(labels):
   """Extract data from pandas.DataFrame for labels.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner.py b/tensorflow/contrib/learn/python/learn/learn_runner.py
index 2af723a0d64822e81fa0fbeb106ab812de6ab4e8..d719a3e488b9905ef7903e21d90dbaae0449735c 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Runs an Experiment."""
+"""Runs an Experiment (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,7 @@ from tensorflow.contrib.learn.python.learn.estimators import run_config as run_c
 from tensorflow.contrib.learn.python.learn.experiment import Experiment
 from tensorflow.contrib.training.python.training import hparam as hparam_lib
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.deprecation import deprecated
 
 
 # TODO(xiejw): Refactor the learn_runner to make code reusable.
@@ -99,6 +105,7 @@ def _wrapped_experiment_fn_with_uid_check(experiment_fn, require_hparams=False):
   return wrapped_experiment_fn
 
 
+@deprecated(None, 'Use tf.estimator.train_and_evaluate.')
 def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
         hparams=None):
   """Make and run an experiment.
@@ -218,6 +225,7 @@ def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
   return _execute_schedule(experiment, schedule)
 
 
+@deprecated(None, 'Use tf.estimator.train_and_evaluate.')
 def tune(experiment_fn, tuner):
   """Tune an experiment with hyper-parameters.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner_lib.py b/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
index 7d9b1c7716f0ab1f2274ca53406175240b613027..ba2d067787c1dfd4e4820ecc916f1053e9f3cf60 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities to run and tune an Experiment.
+"""Utilities to run and tune an Experiment (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 @@run
 @@tune
diff --git a/tensorflow/contrib/learn/python/learn/metric_spec.py b/tensorflow/contrib/learn/python/learn/metric_spec.py
index 6440bc204b8e339ff51311dcc87b36f556b94092..97220365d5dddb82b602369f06bea021a86d584f 100644
--- a/tensorflow/contrib/learn/python/learn/metric_spec.py
+++ b/tensorflow/contrib/learn/python/learn/metric_spec.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""The metric spec class to flexibly connect models and metrics."""
+"""The metric spec class to flexibly connect models and metrics (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,7 @@ import six
 
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 
 def _assert_named_args(sentinel):
@@ -223,6 +229,10 @@ def _adapt_metric_fn(
 class MetricSpec(object):
   """MetricSpec connects a model to metric functions.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   The MetricSpec class contains all information necessary to connect the
   output of a `model_fn` to the metrics (usually, streaming metrics) that are
   used in evaluation.
@@ -284,6 +294,7 @@ class MetricSpec(object):
 
   """
 
+  @deprecated(None, 'Use tf.estimator.EstimatorSpec.eval_metric_ops.')
   def __init__(self,
                metric_fn,
                prediction_key=None,
diff --git a/tensorflow/contrib/learn/python/learn/models.py b/tensorflow/contrib/learn/python/learn/models.py
index 4283240d018c949bb35aeb12032d2ee8b75884a5..bd4bbf9f8c9ad7e8a0fc06d8c0dc24672536c158 100644
--- a/tensorflow/contrib/learn/python/learn/models.py
+++ b/tensorflow/contrib/learn/python/learn/models.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Various high level TF models."""
+"""Various high level TF models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -28,8 +33,10 @@ from tensorflow.python.ops import array_ops as array_ops_
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.summary import summary
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Consider using a tf.estimator.LinearRegressor')
 def linear_regression_zero_init(x, y):
   """Linear regression subgraph with zero-value initial weights and bias.
 
@@ -43,6 +50,7 @@ def linear_regression_zero_init(x, y):
   return linear_regression(x, y, init_mean=0.0, init_stddev=0.0)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.LinearClassifier')
 def logistic_regression_zero_init(x, y):
   """Logistic regression subgraph with zero-value initial weights and bias.
 
@@ -56,6 +64,7 @@ def logistic_regression_zero_init(x, y):
   return logistic_regression(x, y, init_mean=0.0, init_stddev=0.0)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.')
 def linear_regression(x, y, init_mean=None, init_stddev=1.0):
   """Creates linear regression TensorFlow subgraph.
 
@@ -107,6 +116,7 @@ def linear_regression(x, y, init_mean=None, init_stddev=1.0):
     return losses_ops.mean_squared_error_regressor(x, y, weights, bias)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.')
 def logistic_regression(x,
                         y,
                         class_weight=None,
@@ -203,6 +213,7 @@ def _reverse_seq(input_seq, lengths):
   return result
 
 
+@deprecated(None, 'Please consider `tf.nn.bidirectional_dynamic_rnn`.')
 def bidirectional_rnn(cell_fw,
                       cell_bw,
                       inputs,
@@ -283,6 +294,7 @@ def bidirectional_rnn(cell_fw,
 # End of TensorFlow 0.7
 
 
+@deprecated(None, 'Please consider tensorflow/tensor2tensor.')
 def get_rnn_model(rnn_size, cell_type, num_layers, input_op_fn, bidirectional,
                   target_predictor_fn, sequence_length, initial_state,
                   attn_length, attn_size, attn_vec_size):
diff --git a/tensorflow/contrib/learn/python/learn/monitored_session.py b/tensorflow/contrib/learn/python/learn/monitored_session.py
index 22602e9f69d972505d83a66a6f9183b5e4d15c44..ac0433f1775feeed2ec3cf49291da01500bef01b 100644
--- a/tensorflow/contrib/learn/python/learn/monitored_session.py
+++ b/tensorflow/contrib/learn/python/learn/monitored_session.py
@@ -13,7 +13,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""A wrapper of Session API which runs hooks."""
+"""A wrapper of Session API which runs hooks (deprecated).
+
+These are deprecated aliases for classes and functions in `tf.train`. Please use
+those directly.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/monitors.py b/tensorflow/contrib/learn/python/learn/monitors.py
index 9457a73ecfb41782c888e3bba0b140db83d4d464..77f7c73d5412d40b338eaff4cf04d99fd0892723 100644
--- a/tensorflow/contrib/learn/python/learn/monitors.py
+++ b/tensorflow/contrib/learn/python/learn/monitors.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Monitors instrument the training process.
+"""Monitors instrument the training process (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 @@get_default_monitors
 @@BaseMonitor
@@ -59,6 +63,10 @@ from tensorflow.python.util import tf_inspect
 class BaseMonitor(object):
   """Base class for Monitors.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Defines basic interfaces of Monitors.
   Monitors can either be run on all workers or, more commonly, restricted
   to run exclusively on the elected chief worker.
@@ -229,6 +237,10 @@ def _extract_output(outputs, request):
 class EveryN(BaseMonitor):
   """Base class for monitors that execute callbacks every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This class adds three new callbacks:
     - every_n_step_begin
     - every_n_step_end
@@ -418,6 +430,10 @@ class StopAtStep(BaseMonitor):
 class PrintTensor(EveryN):
   """Prints given tensors every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This is an `EveryN` monitor and has consistent semantic for `every_n`
   and `first_n`.
 
@@ -455,9 +471,12 @@ class PrintTensor(EveryN):
 class LoggingTrainable(EveryN):
   """Writes trainable variable values into log every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Write the tensors in trainable variables `every_n` steps,
   starting with the `first_n`th step.
-
   """
 
   def __init__(self, scope=None, every_n=100, first_n=1):
@@ -493,7 +512,12 @@ class LoggingTrainable(EveryN):
 
 
 class SummarySaver(EveryN):
-  """Saves summaries every N steps."""
+  """Saves summaries every N steps.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                summary_op,
@@ -554,6 +578,10 @@ class SummarySaver(EveryN):
 class ValidationMonitor(EveryN):
   """Runs evaluation of a given estimator, at most every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note that the evaluation is done based on the saved checkpoint, which will
   usually be older than the current step.
 
@@ -756,6 +784,10 @@ class ValidationMonitor(EveryN):
 class CaptureVariable(EveryN):
   """Captures a variable's values into a collection.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This monitor is useful for unit testing. You should exercise caution when
   using this monitor in production, since it never discards values.
 
@@ -794,6 +826,7 @@ class CaptureVariable(EveryN):
     self._var_values[step] = _extract_output(outputs, self._var_name)
 
 
+@deprecation.deprecated(None, "Use tf.train.MonitoredTrainingSession.")
 def get_default_monitors(loss_op=None,
                          summary_op=None,
                          save_summary_steps=100,
@@ -828,6 +861,10 @@ def get_default_monitors(loss_op=None,
 class GraphDump(BaseMonitor):
   """Dumps almost all tensors in the graph at every step.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note, this is very expensive, prefer `PrintTensor` in production.
   """
 
@@ -917,7 +954,12 @@ class GraphDump(BaseMonitor):
 
 
 class ExportMonitor(EveryN):
-  """Monitor that exports Estimator every N steps."""
+  """Monitor that exports Estimator every N steps.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   @deprecation.deprecated("2017-03-25",
                           "ExportMonitor is deprecated. Please pass an "
@@ -1040,7 +1082,12 @@ class ExportMonitor(EveryN):
 
 
 class CheckpointSaver(BaseMonitor):
-  """Saves checkpoints every N steps or N seconds."""
+  """Saves checkpoints every N steps or N seconds.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                checkpoint_dir,
@@ -1125,7 +1172,12 @@ class CheckpointSaver(BaseMonitor):
 
 
 class StepCounter(EveryN):
-  """Steps per second monitor."""
+  """Steps per second monitor.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self, every_n_steps=100, output_dir=None, summary_writer=None):
     super(StepCounter, self).__init__(every_n_steps=every_n_steps)
@@ -1165,6 +1217,10 @@ class NanLossDuringTrainingError(RuntimeError):
 class NanLoss(EveryN):
   """NaN Loss monitor.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Monitors loss and stops training if loss is NaN.
   Can either fail with exception or just stop training.
   """
diff --git a/tensorflow/contrib/learn/python/learn/ops/__init__.py b/tensorflow/contrib/learn/python/learn/ops/__init__.py
index 33962e34cc685ce2c830a7bbfd1b5c626bcd8b31..efb1f47cf5bb2dcd0fb37b7b85cd8f170d56e4d1 100644
--- a/tensorflow/contrib/learn/python/learn/ops/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/ops/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Various TensorFlow Ops."""
+"""Various TensorFlow Ops (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
index fa3b7323e343371e986b763d30a8a44620894549..8f9811cf251ae0af1e0055a56e1358c2771b1367 100644
--- a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops to work with embeddings.
+"""TensorFlow Ops to work with embeddings (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Note: categorical variables are handled via embeddings in many cases.
 For example, in case of words.
@@ -57,7 +61,7 @@ def embedding_lookup(params, ids, name='embedding_lookup'):
     ids = ops.convert_to_tensor(ids)
     shape = array_ops_.shape(ids)
     ids_flat = array_ops_.reshape(
-        ids, math_ops.reduce_prod(shape, keep_dims=True))
+        ids, math_ops.reduce_prod(shape, keepdims=True))
     embeds_flat = nn.embedding_lookup(params, ids_flat, name)
     embed_shape = array_ops_.concat([shape, [-1]], 0)
     embeds = array_ops_.reshape(embeds_flat, embed_shape)
diff --git a/tensorflow/contrib/learn/python/learn/ops/losses_ops.py b/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
index b040ab3bb6c516158589a8e30d56fff1f7728951..92976d1539c7ddc226b81f903beee82b798ec8db 100644
--- a/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops for loss computation."""
+"""TensorFlow Ops for loss computation (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
index 45727faab4362abeab18f77861353eb53976023a..aa37cb4a76e2a6157bf077d327248353bd516472 100644
--- a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops for Sequence to Sequence models."""
+"""TensorFlow Ops for Sequence to Sequence models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -26,8 +31,10 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import variable_scope as vs
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def sequence_classifier(decoding, labels, sampling_decoding=None, name=None):
   """Returns predictions and loss for sequence of predictions.
 
@@ -57,6 +64,7 @@ def sequence_classifier(decoding, labels, sampling_decoding=None, name=None):
     return array_ops.stack(predictions, axis=1), loss
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def seq2seq_inputs(x, y, input_length, output_length, sentinel=None, name=None):
   """Processes inputs for Sequence to Sequence models.
 
@@ -87,6 +95,7 @@ def seq2seq_inputs(x, y, input_length, output_length, sentinel=None, name=None):
     return in_x, in_y, out_y
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def rnn_decoder(decoder_inputs, initial_state, cell, scope=None):
   """RNN Decoder that creates training and sampling sub-graphs.
 
@@ -123,6 +132,7 @@ def rnn_decoder(decoder_inputs, initial_state, cell, scope=None):
   return outputs, states, sampling_outputs, sampling_states
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def rnn_seq2seq(encoder_inputs,
                 decoder_inputs,
                 encoder_cell,
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py b/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
index 7bcc177d4ea0ab57f092d68888a72de2b2fd5edc..e8c6e1acf80f0791421bee59aff30e67bccb44b2 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Preprocessing tools useful for building models."""
+"""Preprocessing tools useful for building models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py b/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
index 154739d497ec1029026eaca1e93b37cd225f1050..faba3b2025e8abb51d1989c3fafbd5e711d6559b 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Implements preprocessing transformers for categorical variables."""
+"""Implements preprocessing transformers for categorical variables (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,8 @@ from __future__ import print_function
 import math
 import numpy as np
 
+from tensorflow.python.util.deprecation import deprecated
+
 # pylint: disable=g-bad-import-order
 from . import categorical_vocabulary
 from ..learn_io.data_feeder import setup_processor_data_feeder
@@ -31,10 +38,16 @@ from ..learn_io.data_feeder import setup_processor_data_feeder
 class CategoricalProcessor(object):
   """Maps documents to sequences of word ids.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   As a common convention, Nan values are handled as unknown tokens.
   Both float('nan') and np.nan are accepted.
   """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data for sequence '
+              'processing.')
   def __init__(self, min_frequency=0, share=False, vocabularies=None):
     """Initializes a CategoricalProcessor instance.
 
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
index 5709955c49fba50ca4a299a443a2902bbd9c6b23..3ac370a6ab4423846e810900514445ad5269b680 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Categorical vocabulary classes to map categories to indexes.
+"""Categorical vocabulary classes to map categories to indexes (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Can be used for categorical variables, sparse variables and words.
 """
@@ -25,14 +29,21 @@ from __future__ import print_function
 import collections
 import six
 
+from tensorflow.python.util.deprecation import deprecated
+
 
 class CategoricalVocabulary(object):
   """Categorical variables vocabulary class.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Accumulates and provides mapping from classes to indexes.
   Can be easily used for words.
   """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self, unknown_token="<UNK>", support_reverse=True):
     self._unknown_token = unknown_token
     self._mapping = {unknown_token: 0}
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/text.py b/tensorflow/contrib/learn/python/learn/preprocessing/text.py
index 3af2074c2a46f0258c04111fff0235ba8309625e..f2b6776be7789a9433bfe41eb9354b74347059ec 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/text.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/text.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Implements a number of text preprocessing utilities."""
+"""Implements a number of text preprocessing utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -24,6 +29,7 @@ import numpy as np
 import six
 
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 from .categorical_vocabulary import CategoricalVocabulary  # pylint: disable=g-bad-import-order
 
@@ -38,6 +44,7 @@ TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",
                           re.UNICODE)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def tokenizer(iterator):
   """Tokenizer generator.
 
@@ -51,9 +58,16 @@ def tokenizer(iterator):
     yield TOKENIZER_RE.findall(value)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 class ByteProcessor(object):
-  """Maps documents into sequence of ids for bytes."""
+  """Maps documents into sequence of ids for bytes.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self, max_document_length):
     self.max_document_length = max_document_length
 
@@ -108,8 +122,14 @@ class ByteProcessor(object):
 
 
 class VocabularyProcessor(object):
-  """Maps documents to sequences of word ids."""
+  """Maps documents to sequences of word ids.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self,
                max_document_length,
                min_frequency=0,
diff --git a/tensorflow/contrib/learn/python/learn/session_run_hook.py b/tensorflow/contrib/learn/python/learn/session_run_hook.py
index a8ba2be97206f2b974d256ad2c62c21a4e3e55d8..87edc9b720bdb3edcd5f2dcd1662d14da53c51cf 100644
--- a/tensorflow/contrib/learn/python/learn/session_run_hook.py
+++ b/tensorflow/contrib/learn/python/learn/session_run_hook.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""This file is deprecated. Use tensorflow.python.training.session_run_hook."""
+"""This file is deprecated. Use `tensorflow.python.training.session_run_hook`.
+
+See [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/summary_writer_cache.py b/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
index 919d415c302b8ec17202aad34ff0bee69bfee2c7..d663cf5fb79c428b0e70d66b0f1305f0559a05c9 100644
--- a/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
+++ b/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
@@ -12,7 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Wrapper for a Session-like object that handles threads and recovery.
+"""Wrapper for a Session-like object that handles threads and recovery (deprecated).
+
+These are deprecated aliases for classes and functions in `tf.train`. Please use
+those directly.
 
 Based on an original design of Illia Polosukhin.
 """
diff --git a/tensorflow/contrib/learn/python/learn/trainable.py b/tensorflow/contrib/learn/python/learn/trainable.py
index 429b6040be21d8cbe1f2bba58090366552fdfbe7..a1a3f20dcd8cb5ff7baa559ac41d5e5c40780511 100644
--- a/tensorflow/contrib/learn/python/learn/trainable.py
+++ b/tensorflow/contrib/learn/python/learn/trainable.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""`Trainable` interface."""
+"""`Trainable` interface (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,6 +28,8 @@ import abc
 
 class Trainable(object):
   """Interface for objects that are trainable by, e.g., `Experiment`.
+
+  THIS CLASS IS DEPRECATED.
   """
   __metaclass__ = abc.ABCMeta
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/__init__.py b/tensorflow/contrib/learn/python/learn/utils/__init__.py
index 48978d0ac34cec2b18e6794dcf3b260bc3b683c4..66d8dc6fd43b383919a16515bc96be492a253bf6 100644
--- a/tensorflow/contrib/learn/python/learn/utils/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/utils/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Learn Utils."""
+"""TensorFlow Learn Utils (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/utils/export.py b/tensorflow/contrib/learn/python/learn/utils/export.py
index cb34cb1d26b6812c7f3f39e9f965615de5a8ef07..3eacac7a3d3dcff4d39025fdee88e16e385b1b84 100644
--- a/tensorflow/contrib/learn/python/learn/utils/export.py
+++ b/tensorflow/contrib/learn/python/learn/utils/export.py
@@ -13,14 +13,18 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Export utilities."""
+"""Export utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.framework import deprecated
-from tensorflow.python.training import training_util
 from tensorflow.contrib.session_bundle import exporter
 from tensorflow.contrib.session_bundle import gc
 from tensorflow.python.client import session as tf_session
@@ -32,6 +36,7 @@ from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import saver as tf_saver
+from tensorflow.python.training import training_util
 
 
 @deprecated('2017-03-25', 'Please use Estimator.export_savedmodel() instead.')
diff --git a/tensorflow/contrib/learn/python/learn/utils/gc.py b/tensorflow/contrib/learn/python/learn/utils/gc.py
index 226915987a4934626066b12810f579ae675107b2..916aecbea88b10bbef316ffb89d4c4d89667cb29 100644
--- a/tensorflow/contrib/learn/python/learn/utils/gc.py
+++ b/tensorflow/contrib/learn/python/learn/utils/gc.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-r"""System for specifying garbage collection (GC) of path based data.
+r"""System for specifying garbage collection (GC) of path based data (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This framework allows for GC of data specified by path names, for example files
 on disk.  gc.Path objects each represent a single item stored at a path and may
@@ -73,10 +77,12 @@ import os
 
 from tensorflow.python.platform import gfile
 from tensorflow.python.util import compat
+from tensorflow.python.util.deprecation import deprecated
 
 Path = collections.namedtuple('Path', 'path export_version')
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def largest_export_versions(n):
   """Creates a filter that keeps the largest n export versions.
 
@@ -97,6 +103,7 @@ def largest_export_versions(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def one_of_every_n_export_versions(n):
   """Creates a filter that keeps one of every n export versions.
 
@@ -128,6 +135,7 @@ def one_of_every_n_export_versions(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def mod_export_version(n):
   """Creates a filter that keeps every export that is a multiple of n.
 
@@ -146,6 +154,7 @@ def mod_export_version(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def union(lf, rf):
   """Creates a filter that keeps the union of two filters.
 
@@ -163,6 +172,7 @@ def union(lf, rf):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def negation(f):
   """Negate a filter.
 
@@ -179,6 +189,7 @@ def negation(f):
   return keep
 
 
+@deprecated(None, 'Please implement your own file name management.')
 def get_paths(base_dir, parser):
   """Gets a list of Paths in a given directory.
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py b/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
index b2521933e524e7ec24d73d4b5171f33e507dd88c..b92eb9fea8b7ccea56c781df74dcfa1cc5508e48 100644
--- a/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities for creating input_fns.
+"""Utilities for creating input_fns (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Contents of this file are moved to tensorflow/python/estimator/export.py.
 InputFnOps is renamed to ServingInputReceiver.
@@ -32,13 +36,17 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.util.deprecation import deprecated
 
 
 class InputFnOps(collections.namedtuple('InputFnOps',
                                         ['features',
                                          'labels',
                                          'default_inputs'])):
-  """A return type for an input_fn.
+  """A return type for an input_fn (deprecated).
+
+  THIS CLASS IS DEPRECATED. Please use tf.estimator.export.ServingInputReceiver
+  instead.
 
   This return type is currently only supported for serving input_fn.
   Training and eval input_fn should return a `(features, labels)` tuple.
@@ -56,6 +64,8 @@ class InputFnOps(collections.namedtuple('InputFnOps',
   """
 
 
+@deprecated(None, 'Please use '
+            'tf.estimator.export.build_parsing_serving_input_receiver_fn.')
 def build_parsing_serving_input_fn(feature_spec, default_batch_size=None):
   """Build an input_fn appropriate for serving, expecting fed tf.Examples.
 
@@ -84,6 +94,8 @@ def build_parsing_serving_input_fn(feature_spec, default_batch_size=None):
   return input_fn
 
 
+@deprecated(None, 'Please use '
+            'tf.estimator.export.build_raw_serving_input_receiver_fn.')
 def build_default_serving_input_fn(features, default_batch_size=None):
   """Build an input_fn appropriate for serving, expecting feature Tensors.
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py b/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
index 6a63fb545a56e6040b0b0c3bbb6a17cd96925895..6dbaa15f8391b0044be8e30ca191753beb88db93 100644
--- a/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
+++ b/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""A simple script for inspect checkpoint files."""
+"""A simple script for inspect checkpoint files (deprecated)."""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
index 1593380007b2799fb1d17e92408ab19a7b47fe1e..c7cdb4131215c388412407a008113de13bdd0934 100644
--- a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities supporting export to SavedModel.
+"""Utilities supporting export to SavedModel (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Some contents of this file are moved to tensorflow/python/estimator/export.py:
 
@@ -52,8 +56,9 @@ from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.saved_model import signature_def_utils
 from tensorflow.python.summary import summary_iterator
 from tensorflow.python.training import saver
-
 from tensorflow.python.util import compat
+from tensorflow.python.util.deprecation import deprecated
+
 
 # A key for use in the input_alternatives dict indicating the default input.
 # This is the input that will be expected when a serving request does not
@@ -77,6 +82,7 @@ FEATURES_INPUT_ALTERNATIVE_KEY = 'features_input_alternative'
 _FALLBACK_DEFAULT_OUTPUT_ALTERNATIVE_KEY = 'default_output_alternative'
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def build_standardized_signature_def(input_tensors, output_tensors,
                                      problem_type):
   """Build a SignatureDef using problem type and input and output Tensors.
@@ -156,6 +162,7 @@ def _is_regression_problem(problem_type, input_tensors, output_tensors):
           len(input_tensors) == 1 and len(output_tensors) == 1)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_input_alternatives(input_ops):
   """Obtain all input alternatives using the input_fn output and heuristics."""
   input_alternatives = {}
@@ -181,6 +188,7 @@ def get_input_alternatives(input_ops):
   return input_alternatives, features
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_output_alternatives(model_fn_ops, default_output_alternative_key=None):
   """Obtain all output alternatives using the model_fn output and heuristics.
 
@@ -246,6 +254,7 @@ def get_output_alternatives(model_fn_ops, default_output_alternative_key=None):
                        sorted(output_alternatives.keys())))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def build_all_signature_defs(input_alternatives, output_alternatives,
                              actual_default_output_alternative_key):
   """Build `SignatureDef`s from all pairs of input and output alternatives."""
@@ -279,6 +288,7 @@ def build_all_signature_defs(input_alternatives, output_alternatives,
 MAX_DIRECTORY_CREATION_ATTEMPTS = 10
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_timestamped_export_dir(export_dir_base):
   """Builds a path to a new subdirectory within the base directory.
 
@@ -317,6 +327,7 @@ def get_timestamped_export_dir(export_dir_base):
                      '{} attempts.'.format(MAX_DIRECTORY_CREATION_ATTEMPTS))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_temp_export_dir(timestamped_export_dir):
   """Builds a directory name based on the argument but starting with 'temp-'.
 
@@ -344,6 +355,7 @@ def _export_version_parser(path):
   return path._replace(export_version=int(filename))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_most_recent_export(export_dir_base):
   """Locate the most recent SavedModel export in a directory of many exports.
 
@@ -363,6 +375,7 @@ def get_most_recent_export(export_dir_base):
   return next(iter(results or []), None)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def garbage_collect_exports(export_dir_base, exports_to_keep):
   """Deletes older exports, retaining only a given number of the most recent.
 
@@ -387,6 +400,7 @@ def garbage_collect_exports(export_dir_base, exports_to_keep):
       logging.warn('Can not delete %s recursively: %s', p.path, e)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def make_export_strategy(serving_input_fn,
                          default_output_alternative_key=None,
                          assets_extra=None,
@@ -400,7 +414,7 @@ def make_export_strategy(serving_input_fn,
       `InputFnOps`.
     default_output_alternative_key: the name of the head to serve when an
       incoming serving request does not explicitly request a specific head.
-      Must be `None` if the estimator inherits from ${tf.estimator.Estimator}
+      Must be `None` if the estimator inherits from @{tf.estimator.Estimator}
       or for single-headed models.
     assets_extra: A dict specifying how to populate the assets.extra directory
       within the exported SavedModel.  Each key should give the destination
@@ -438,7 +452,7 @@ def make_export_strategy(serving_input_fn,
       The string path to the exported directory.
 
     Raises:
-      ValueError: If `estimator` is a ${tf.estimator.Estimator} instance
+      ValueError: If `estimator` is a @{tf.estimator.Estimator} instance
         and `default_output_alternative_key` was specified.
     """
     if isinstance(estimator, core_estimator.Estimator):
@@ -469,6 +483,8 @@ def make_export_strategy(serving_input_fn,
   return export_strategy.ExportStrategy('Servo', export_fn, strip_default_attrs)
 
 
+@deprecated(None,
+            'Use tf.estimator.export.build_parsing_serving_input_receiver_fn')
 def make_parsing_export_strategy(feature_columns,
                                  default_output_alternative_key=None,
                                  assets_extra=None,
@@ -487,7 +503,7 @@ def make_parsing_export_strategy(feature_columns,
       that must be provided at serving time (excluding labels!).
     default_output_alternative_key: the name of the head to serve when an
       incoming serving request does not explicitly request a specific head.
-      Must be `None` if the estimator inherits from ${tf.estimator.Estimator}
+      Must be `None` if the estimator inherits from @{tf.estimator.Estimator}
       or for single-headed models.
     assets_extra: A dict specifying how to populate the assets.extra directory
       within the exported SavedModel.  Each key should give the destination
@@ -555,8 +571,14 @@ def _default_compare_fn(curr_best_eval_result, cand_eval_result):
 
 
 class BestModelSelector(object):
-  """A helper that keeps track of export selection candidates."""
+  """A helper that keeps track of export selection candidates.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
   def __init__(self, event_file_pattern=None, compare_fn=None):
     """Constructor of this class.
 
@@ -622,6 +644,7 @@ class BestModelSelector(object):
     return best_eval_result
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def make_best_model_export_strategy(
     serving_input_fn,
     exports_to_keep=1,
@@ -707,6 +730,7 @@ def make_best_model_export_strategy(
 
 # TODO(b/67013778): Revisit this approach when corresponding changes to
 # TF Core are finalized.
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def extend_export_strategy(base_export_strategy,
                            post_export_fn,
                            post_export_name=None):
@@ -741,7 +765,7 @@ def extend_export_strategy(base_export_strategy,
       The string path to the SavedModel indicated by post_export_fn.
 
     Raises:
-      ValueError: If `estimator` is a ${tf.estimator.Estimator} instance
+      ValueError: If `estimator` is a @{tf.estimator.Estimator} instance
         and `default_output_alternative_key` was specified or if post_export_fn
         does not return a valid directory.
       RuntimeError: If unable to create temporary or final export directory.
diff --git a/tensorflow/contrib/linalg/BUILD b/tensorflow/contrib/linalg/BUILD
index 208e7bc69be76680868c766bc99429eea5870c80..3bc1427bd2f05eec75441f376b7ba41bde30f544 100644
--- a/tensorflow/contrib/linalg/BUILD
+++ b/tensorflow/contrib/linalg/BUILD
@@ -43,6 +43,24 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "linear_operator_block_diag_test",
+    size = "medium",
+    srcs = ["python/kernel_tests/linear_operator_block_diag_test.py"],
+    additional_deps = [
+        ":linalg_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+    shard_count = 4,
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/linalg/__init__.py b/tensorflow/contrib/linalg/__init__.py
index 4720692c3384ba1bede1f486c1b1e0e69d10a63a..14cc3b2b4971de1a31960ee33c2f304154b1f411 100644
--- a/tensorflow/contrib/linalg/__init__.py
+++ b/tensorflow/contrib/linalg/__init__.py
@@ -17,6 +17,7 @@
 See the @{$python/contrib.linalg} guide.
 
 @@LinearOperator
+@@LinearOperatorBlockDiag
 @@LinearOperatorDiag
 @@LinearOperatorIdentity
 @@LinearOperatorScaledIdentity
@@ -34,6 +35,7 @@ from __future__ import print_function
 # pylint: disable=unused-import,wildcard-import,line-too-long,g-importing-member
 
 from tensorflow.contrib.linalg.python.ops.linear_operator_addition import *
+from tensorflow.contrib.linalg.python.ops.linear_operator_block_diag import *
 from tensorflow.python.ops.linalg.linear_operator import *
 from tensorflow.python.ops.linalg.linear_operator_composition import *
 from tensorflow.python.ops.linalg.linear_operator_diag import *
@@ -45,4 +47,5 @@ from tensorflow.python.ops.linalg.linear_operator_lower_triangular import *
 # pylint: enable=unused-import,wildcard-import,line-too-long,g-importing-member
 
 from tensorflow.python.util.all_util import remove_undocumented
+
 remove_undocumented(__name__)
diff --git a/tensorflow/contrib/linalg/python/kernel_tests/linear_operator_block_diag_test.py b/tensorflow/contrib/linalg/python/kernel_tests/linear_operator_block_diag_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc1a047d6a2b6029080fad3f240aa00f50504f07
--- /dev/null
+++ b/tensorflow/contrib/linalg/python/kernel_tests/linear_operator_block_diag_test.py
@@ -0,0 +1,253 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.linalg.python.ops import linear_operator_block_diag as block_diag
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import random_seed
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops.linalg import linalg as linalg_lib
+from tensorflow.python.ops.linalg import linear_operator_test_util
+from tensorflow.python.ops.linalg import linear_operator_util
+from tensorflow.python.platform import test
+
+linalg = linalg_lib
+random_seed.set_random_seed(23)
+rng = np.random.RandomState(0)
+
+
+def _block_diag_dense(expected_shape, blocks):
+  """Convert a list of blocks, into a dense block diagonal matrix."""
+  rows = []
+  num_cols = 0
+  for block in blocks:
+    # Get the batch shape for the block.
+    batch_row_shape = array_ops.shape(block)[:-1]
+
+    zeros_to_pad_before_shape = array_ops.concat(
+        [batch_row_shape, [num_cols]], axis=-1)
+    zeros_to_pad_before = array_ops.zeros(
+        shape=zeros_to_pad_before_shape, dtype=block.dtype)
+    num_cols += array_ops.shape(block)[-1]
+    zeros_to_pad_after_shape = array_ops.concat(
+        [batch_row_shape, [expected_shape[-2] - num_cols]], axis=-1)
+    zeros_to_pad_after = array_ops.zeros(
+        zeros_to_pad_after_shape, dtype=block.dtype)
+
+    rows.append(array_ops.concat(
+        [zeros_to_pad_before, block, zeros_to_pad_after], axis=-1))
+
+  return array_ops.concat(rows, axis=-2)
+
+
+class SquareLinearOperatorBlockDiagTest(
+    linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
+  """Most tests done in the base class LinearOperatorDerivedClassTest."""
+
+  def setUp(self):
+    # Increase from 1e-6 to 1e-4
+    self._atol[dtypes.float32] = 1e-4
+    self._atol[dtypes.complex64] = 1e-4
+    self._rtol[dtypes.float32] = 1e-4
+    self._rtol[dtypes.complex64] = 1e-4
+
+  @property
+  def _operator_build_infos(self):
+    build_info = linear_operator_test_util.OperatorBuildInfo
+    return [
+        build_info((0, 0)),
+        build_info((1, 1)),
+        build_info((1, 3, 3)),
+        build_info((5, 5), blocks=[(2, 2), (3, 3)]),
+    ]
+
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
+    expected_blocks = (
+        build_info.__dict__["blocks"] if "blocks" in build_info.__dict__
+        else [shape])
+    matrices = [
+        linear_operator_test_util.random_positive_definite_matrix(
+            block_shape, dtype, force_well_conditioned=True)
+        for block_shape in expected_blocks
+    ]
+
+    if use_placeholder:
+      matrices_ph = [
+          array_ops.placeholder(dtype=dtype) for _ in expected_blocks
+      ]
+      # Evaluate here because (i) you cannot feed a tensor, and (ii)
+      # values are random and we want the same value used for both mat and
+      # feed_dict.
+      matrices = self.evaluate(matrices)
+      operator = block_diag.LinearOperatorBlockDiag(
+          [linalg.LinearOperatorFullMatrix(
+              m_ph, is_square=True) for m_ph in matrices_ph],
+          is_square=True)
+      feed_dict = {m_ph: m for (m_ph, m) in zip(matrices_ph, matrices)}
+    else:
+      operator = block_diag.LinearOperatorBlockDiag(
+          [linalg.LinearOperatorFullMatrix(
+              m, is_square=True) for m in matrices])
+      feed_dict = None
+      # Should be auto-set.
+      self.assertTrue(operator.is_square)
+
+    # Broadcast the shapes.
+    expected_shape = list(build_info.shape)
+
+    matrices = linear_operator_util.broadcast_matrix_batch_dims(matrices)
+
+    block_diag_dense = _block_diag_dense(expected_shape, matrices)
+
+    if not use_placeholder:
+      block_diag_dense.set_shape(
+          expected_shape[:-2] + [expected_shape[-1], expected_shape[-1]])
+
+    return operator, block_diag_dense, feed_dict
+
+  def test_is_x_flags(self):
+    # Matrix with two positive eigenvalues, 1, and 1.
+    # The matrix values do not effect auto-setting of the flags.
+    matrix = [[1., 0.], [1., 1.]]
+    operator = block_diag.LinearOperatorBlockDiag(
+        [linalg.LinearOperatorFullMatrix(matrix)],
+        is_positive_definite=True,
+        is_non_singular=True,
+        is_self_adjoint=False)
+    self.assertTrue(operator.is_positive_definite)
+    self.assertTrue(operator.is_non_singular)
+    self.assertFalse(operator.is_self_adjoint)
+
+  def test_is_non_singular_auto_set(self):
+    # Matrix with two positive eigenvalues, 11 and 8.
+    # The matrix values do not effect auto-setting of the flags.
+    matrix = [[11., 0.], [1., 8.]]
+    operator_1 = linalg.LinearOperatorFullMatrix(matrix, is_non_singular=True)
+    operator_2 = linalg.LinearOperatorFullMatrix(matrix, is_non_singular=True)
+
+    operator = block_diag.LinearOperatorBlockDiag(
+        [operator_1, operator_2],
+        is_positive_definite=False,  # No reason it HAS to be False...
+        is_non_singular=None)
+    self.assertFalse(operator.is_positive_definite)
+    self.assertTrue(operator.is_non_singular)
+
+    with self.assertRaisesRegexp(ValueError, "always non-singular"):
+      block_diag.LinearOperatorBlockDiag(
+          [operator_1, operator_2], is_non_singular=False)
+
+  def test_name(self):
+    matrix = [[11., 0.], [1., 8.]]
+    operator_1 = linalg.LinearOperatorFullMatrix(matrix, name="left")
+    operator_2 = linalg.LinearOperatorFullMatrix(matrix, name="right")
+
+    operator = block_diag.LinearOperatorBlockDiag([operator_1, operator_2])
+
+    self.assertEqual("left_ds_right", operator.name)
+
+  def test_different_dtypes_raises(self):
+    operators = [
+        linalg.LinearOperatorFullMatrix(rng.rand(2, 3, 3)),
+        linalg.LinearOperatorFullMatrix(rng.rand(2, 3, 3).astype(np.float32))
+    ]
+    with self.assertRaisesRegexp(TypeError, "same dtype"):
+      block_diag.LinearOperatorBlockDiag(operators)
+
+  def test_non_square_operator_raises(self):
+    operators = [
+        linalg.LinearOperatorFullMatrix(rng.rand(3, 4), is_square=False),
+        linalg.LinearOperatorFullMatrix(rng.rand(3, 3))
+    ]
+    with self.assertRaisesRegexp(ValueError, "square matrices"):
+      block_diag.LinearOperatorBlockDiag(operators)
+
+  def test_empty_operators_raises(self):
+    with self.assertRaisesRegexp(ValueError, "non-empty"):
+      block_diag.LinearOperatorBlockDiag([])
+
+
+# This test is for blocks with different batch dimensions.
+# LinearOperatorFullMatrix doesn't broadcast matmul/solve.
+class SquareDiagLinearOperatorBlockDiagTest(
+    linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
+  """Most tests done in the base class LinearOperatorDerivedClassTest."""
+
+  def setUp(self):
+    # Increase from 1e-6 to 1e-4
+    self._atol[dtypes.float32] = 1e-4
+    self._atol[dtypes.complex64] = 1e-4
+    self._rtol[dtypes.float32] = 1e-4
+    self._rtol[dtypes.complex64] = 1e-4
+
+  @property
+  def _operator_build_infos(self):
+    build_info = linear_operator_test_util.OperatorBuildInfo
+    return [
+        build_info((3, 7, 7), blocks=[(1, 2, 2), (3, 2, 2), (1, 3, 3)]),
+        build_info((2, 1, 6, 6), blocks=[(2, 1, 2, 2), (1, 1, 4, 4)]),
+    ]
+
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
+    expected_blocks = (
+        build_info.__dict__["blocks"] if "blocks" in build_info.__dict__
+        else [shape])
+    diag_matrices = [
+        linear_operator_test_util.random_uniform(
+            shape=block_shape[:-1], minval=1., maxval=20., dtype=dtype)
+        for block_shape in expected_blocks
+    ]
+
+    if use_placeholder:
+      diag_matrices_ph = [
+          array_ops.placeholder(dtype=dtype) for _ in expected_blocks
+      ]
+      diag_matrices = self.evaluate(diag_matrices)
+      # Evaluate here because (i) you cannot feed a tensor, and (ii)
+      # values are random and we want the same value used for both mat and
+      # feed_dict.
+      operator = block_diag.LinearOperatorBlockDiag(
+          [linalg.LinearOperatorDiag(m_ph) for m_ph in diag_matrices_ph])
+      feed_dict = {m_ph: m for (m_ph, m) in zip(
+          diag_matrices_ph, diag_matrices)}
+    else:
+      operator = block_diag.LinearOperatorBlockDiag(
+          [linalg.LinearOperatorDiag(m) for m in diag_matrices])
+      feed_dict = None
+      # Should be auto-set.
+      self.assertTrue(operator.is_square)
+
+    # Broadcast the shapes.
+    expected_shape = list(build_info.shape)
+
+    matrices = linear_operator_util.broadcast_matrix_batch_dims(
+        [array_ops.matrix_diag(diag_block) for diag_block in diag_matrices])
+
+    block_diag_dense = _block_diag_dense(expected_shape, matrices)
+    if not use_placeholder:
+      block_diag_dense.set_shape(
+          expected_shape[:-2] + [expected_shape[-1], expected_shape[-1]])
+
+    return operator, block_diag_dense, feed_dict
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/linalg/python/ops/linear_operator_block_diag.py b/tensorflow/contrib/linalg/python/ops/linear_operator_block_diag.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d7a99664d38eca035bd5a86710050bce4b22c1e
--- /dev/null
+++ b/tensorflow/contrib/linalg/python/ops/linear_operator_block_diag.py
@@ -0,0 +1,358 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Create a Block Diagonal operator from one or more `LinearOperators`."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import common_shapes
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops.linalg import linear_operator
+from tensorflow.python.ops.linalg import linear_operator_util
+
+
+class LinearOperatorBlockDiag(linear_operator.LinearOperator):
+  """Combines one or more `LinearOperators` in to a Block Diagonal matrix.
+
+  This operator combines one or more linear operators `[op1,...,opJ]`,
+  building a new `LinearOperator`, whose underlying matrix representation is
+  square and has each operator `opi` on the main diagonal, and zero's elsewhere.
+
+  #### Shape compatibility
+
+  If `opj` acts like a [batch] square matrix `Aj`, then `op_combined` acts like
+  the [batch] square matrix formed by having each matrix `Aj` on the main
+  diagonal.
+
+
+  Each `opj` is required to represent a square matrix, and hence will have
+  shape `batch_shape_j + [M_j, M_j]`.
+
+  If `opj` has shape `batch_shape_j + [M_j, M_j]`, then the combined operator
+  has shape `broadcast_batch_shape + [sum M_j, sum M_j]`, where
+  `broadcast_batch_shape` is the mutual broadcast of `batch_shape_j`,
+  `j = 1,...,J`, assuming the intermediate batch shapes broadcast.
+  Even if the combined shape is well defined, the combined operator's
+  methods may fail due to lack of broadcasting ability in the defining
+  operators' methods.
+
+  ```python
+  # Create a 4 x 4 linear operator combined of two 2 x 2 operators.
+  operator_1 = LinearOperatorFullMatrix([[1., 2.], [3., 4.]])
+  operator_2 = LinearOperatorFullMatrix([[1., 0.], [0., 1.]])
+  operator = LinearOperatorBlockDiag([operator_1, operator_2])
+
+  operator.to_dense()
+  ==> [[1., 2., 0., 0.],
+       [3., 4., 0., 0.],
+       [0., 0., 1., 0.],
+       [0., 0., 0., 1.]]
+
+  operator.shape
+  ==> [4, 4]
+
+  operator.log_abs_determinant()
+  ==> scalar Tensor
+
+  x1 = ... # Shape [2, 2] Tensor
+  x2 = ... # Shape [2, 2] Tensor
+  x = tf.concat([x1, x2], 0)  # Shape [2, 4] Tensor
+  operator.matmul(x)
+  ==> tf.concat([operator_1.matmul(x1), operator_2.matmul(x2)])
+
+  # Create a [2, 3] batch of 4 x 4 linear operators.
+  matrix_44 = tf.random_normal(shape=[2, 3, 4, 4])
+  operator_44 = LinearOperatorFullMatrix(matrix)
+
+  # Create a [1, 3] batch of 5 x 5 linear operators.
+  matrix_55 = tf.random_normal(shape=[1, 3, 5, 5])
+  operator_55 = LinearOperatorFullMatrix(matrix_55)
+
+  # Combine to create a [2, 3] batch of 9 x 9 operators.
+  operator_99 = LinearOperatorBlockDiag([operator_44, operator_55])
+
+  # Create a shape [2, 3, 9] vector.
+  x = tf.random_normal(shape=[2, 3, 9])
+  operator_99.matmul(x)
+  ==> Shape [2, 3, 9] Tensor
+  ```
+
+  #### Performance
+
+  The performance of `LinearOperatorBlockDiag` on any operation is equal to
+  the sum of the individual operators' operations.
+
+
+  #### Matrix property hints
+
+  This `LinearOperator` is initialized with boolean flags of the form `is_X`,
+  for `X = non_singular, self_adjoint, positive_definite, square`.
+  These have the following meaning:
+
+  * If `is_X == True`, callers should expect the operator to have the
+    property `X`.  This is a promise that should be fulfilled, but is *not* a
+    runtime assert.  For example, finite floating point precision may result
+    in these promises being violated.
+  * If `is_X == False`, callers should expect the operator to not have `X`.
+  * If `is_X == None` (the default), callers should have no expectation either
+    way.
+  """
+
+  def __init__(self,
+               operators,
+               is_non_singular=None,
+               is_self_adjoint=None,
+               is_positive_definite=None,
+               is_square=True,
+               name=None):
+    r"""Initialize a `LinearOperatorBlockDiag`.
+
+    `LinearOperatorBlockDiag` is initialized with a list of operators
+    `[op_1,...,op_J]`.
+
+    Args:
+      operators:  Iterable of `LinearOperator` objects, each with
+        the same `dtype` and composable shape.
+      is_non_singular:  Expect that this operator is non-singular.
+      is_self_adjoint:  Expect that this operator is equal to its hermitian
+        transpose.
+      is_positive_definite:  Expect that this operator is positive definite,
+        meaning the quadratic form `x^H A x` has positive real part for all
+        nonzero `x`.  Note that we do not require the operator to be
+        self-adjoint to be positive-definite.  See:
+        https://en.wikipedia.org/wiki/Positive-definite_matrix\
+            #Extension_for_non_symmetric_matrices
+      is_square:  Expect that this operator acts like square [batch] matrices.
+        This is true by default, and will raise a `ValueError` otherwise.
+      name: A name for this `LinearOperator`.  Default is the individual
+        operators names joined with `_o_`.
+
+    Raises:
+      TypeError:  If all operators do not have the same `dtype`.
+      ValueError:  If `operators` is empty or are non-square.
+    """
+    # Validate operators.
+    check_ops.assert_proper_iterable(operators)
+    operators = list(operators)
+    if not operators:
+      raise ValueError(
+          "Expected a non-empty list of operators. Found: %s" % operators)
+    self._operators = operators
+
+    # Validate dtype.
+    dtype = operators[0].dtype
+    for operator in operators:
+      if operator.dtype != dtype:
+        name_type = (str((o.name, o.dtype)) for o in operators)
+        raise TypeError(
+            "Expected all operators to have the same dtype.  Found %s"
+            % "   ".join(name_type))
+
+    # Auto-set and check hints.
+    if all(operator.is_non_singular for operator in operators):
+      if is_non_singular is False:
+        raise ValueError(
+            "The direct sum of non-singular operators is always non-singular.")
+      is_non_singular = True
+
+    if all(operator.is_self_adjoint for operator in operators):
+      if is_self_adjoint is False:
+        raise ValueError(
+            "The direct sum of self-adjoint operators is always self-adjoint.")
+      is_self_adjoint = True
+
+    if all(operator.is_positive_definite for operator in operators):
+      if is_positive_definite is False:
+        raise ValueError(
+            "The direct sum of positive definite operators is always "
+            "positive definite.")
+      is_positive_definite = True
+
+    if not (is_square and all(operator.is_square for operator in operators)):
+      raise ValueError(
+          "Can only represent a block diagonal of square matrices.")
+
+    # Initialization.
+    graph_parents = []
+    for operator in operators:
+      graph_parents.extend(operator.graph_parents)
+
+    if name is None:
+      # Using ds to mean direct sum.
+      name = "_ds_".join(operator.name for operator in operators)
+    with ops.name_scope(name, values=graph_parents):
+      super(LinearOperatorBlockDiag, self).__init__(
+          dtype=dtype,
+          graph_parents=graph_parents,
+          is_non_singular=is_non_singular,
+          is_self_adjoint=is_self_adjoint,
+          is_positive_definite=is_positive_definite,
+          is_square=True,
+          name=name)
+
+  @property
+  def operators(self):
+    return self._operators
+
+  def _shape(self):
+    # Get final matrix shape.
+    domain_dimension = self.operators[0].domain_dimension
+    range_dimension = self.operators[0].range_dimension
+    for operator in self.operators[1:]:
+      domain_dimension += operator.domain_dimension
+      range_dimension += operator.range_dimension
+
+    matrix_shape = tensor_shape.TensorShape([domain_dimension, range_dimension])
+
+    # Get broadcast batch shape.
+    # broadcast_shape checks for compatibility.
+    batch_shape = self.operators[0].batch_shape
+    for operator in self.operators[1:]:
+      batch_shape = common_shapes.broadcast_shape(
+          batch_shape, operator.batch_shape)
+
+    return batch_shape.concatenate(matrix_shape)
+
+  def _shape_tensor(self):
+    # Avoid messy broadcasting if possible.
+    if self.shape.is_fully_defined():
+      return ops.convert_to_tensor(
+          self.shape.as_list(), dtype=dtypes.int32, name="shape")
+
+    domain_dimension = self.operators[0].domain_dimension_tensor()
+    range_dimension = self.operators[0].range_dimension_tensor()
+    for operator in self.operators[1:]:
+      domain_dimension += operator.domain_dimension_tensor()
+      range_dimension += operator.range_dimension_tensor()
+
+    matrix_shape = array_ops.stack([domain_dimension, range_dimension])
+
+    # Dummy Tensor of zeros.  Will never be materialized.
+    zeros = array_ops.zeros(shape=self.operators[0].batch_shape_tensor())
+    for operator in self.operators[1:]:
+      zeros += array_ops.zeros(shape=operator.batch_shape_tensor())
+    batch_shape = array_ops.shape(zeros)
+
+    return array_ops.concat((batch_shape, matrix_shape), 0)
+
+  def _matmul(self, x, adjoint=False, adjoint_arg=False):
+    split_dim = -1 if adjoint_arg else -2
+    # Split input by rows normally, and otherwise columns.
+    split_x = self._split_input_into_blocks(x, axis=split_dim)
+
+    result_list = []
+    for index, operator in enumerate(self.operators):
+      result_list += [operator.matmul(
+          split_x[index], adjoint=adjoint, adjoint_arg=adjoint_arg)]
+    result_list = linear_operator_util.broadcast_matrix_batch_dims(
+        result_list)
+    return array_ops.concat(result_list, axis=-2)
+
+  def _determinant(self):
+    result = self.operators[0].determinant()
+    for operator in self.operators[1:]:
+      result *= operator.determinant()
+    return result
+
+  def _log_abs_determinant(self):
+    result = self.operators[0].log_abs_determinant()
+    for operator in self.operators[1:]:
+      result += operator.log_abs_determinant()
+    return result
+
+  def _solve(self, rhs, adjoint=False, adjoint_arg=False):
+    split_dim = -1 if adjoint_arg else -2
+    # Split input by rows normally, and otherwise columns.
+    split_rhs = self._split_input_into_blocks(rhs, axis=split_dim)
+
+    solution_list = []
+    for index, operator in enumerate(self.operators):
+      solution_list += [operator.solve(
+          split_rhs[index], adjoint=adjoint, adjoint_arg=adjoint_arg)]
+
+    solution_list = linear_operator_util.broadcast_matrix_batch_dims(
+        solution_list)
+    return array_ops.concat(solution_list, axis=-2)
+
+  def _diag_part(self):
+    diag_list = []
+    for operator in self.operators:
+      # Extend the axis for broadcasting.
+      diag_list += [operator.diag_part()[..., array_ops.newaxis]]
+    diag_list = linear_operator_util.broadcast_matrix_batch_dims(diag_list)
+    diagonal = array_ops.concat(diag_list, axis=-2)
+    return array_ops.squeeze(diagonal, axis=-1)
+
+  def _trace(self):
+    result = self.operators[0].trace()
+    for operator in self.operators[1:]:
+      result += operator.trace()
+    return result
+
+  def _to_dense(self):
+    num_cols = 0
+    rows = []
+    broadcasted_blocks = [operator.to_dense() for operator in self.operators]
+    broadcasted_blocks = linear_operator_util.broadcast_matrix_batch_dims(
+        broadcasted_blocks)
+    for block in broadcasted_blocks:
+      batch_row_shape = array_ops.shape(block)[:-1]
+
+      zeros_to_pad_before_shape = array_ops.concat(
+          [batch_row_shape, [num_cols]], axis=-1)
+      zeros_to_pad_before = array_ops.zeros(
+          shape=zeros_to_pad_before_shape, dtype=block.dtype)
+      num_cols += array_ops.shape(block)[-1]
+      zeros_to_pad_after_shape = array_ops.concat(
+          [batch_row_shape,
+           [self.domain_dimension_tensor() - num_cols]], axis=-1)
+      zeros_to_pad_after = array_ops.zeros(
+          shape=zeros_to_pad_after_shape, dtype=block.dtype)
+
+      rows.append(array_ops.concat(
+          [zeros_to_pad_before, block, zeros_to_pad_after], axis=-1))
+
+    mat = array_ops.concat(rows, axis=-2)
+    mat.set_shape(self.shape)
+    return mat
+
+  def _split_input_into_blocks(self, x, axis=-1):
+    """Split `x` into blocks matching `operators`'s `domain_dimension`.
+
+    Specifically, if we have a block diagonal matrix, with block sizes
+    `[M_j, M_j] j = 1..J`,  this method splits `x` on `axis` into `J`
+    tensors, whose shape at `axis` is `M_j`.
+
+    Args:
+      x: `Tensor`. `x` is split into `J` tensors.
+      axis: Python `Integer` representing the axis to split `x` on.
+
+    Returns:
+      A list of `Tensor`s.
+    """
+    block_sizes = []
+    if self.shape.is_fully_defined():
+      for operator in self.operators:
+        block_sizes += [operator.domain_dimension.value]
+    else:
+      for operator in self.operators:
+        block_sizes += [operator.domain_dimension_tensor()]
+
+    return array_ops.split(x, block_sizes, axis=axis)
diff --git a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
index 70f777f08bd5b8157e601f19019075d3e7543811..cfe62fac43b35d863eb559b95057ae62a41bed49 100644
--- a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
+++ b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
@@ -270,14 +270,14 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
 
           train_op = lr.minimize()
 
-          def Minimize():
+          def minimize():
             with self._single_threaded_test_session():
               for _ in range(_MAX_ITERATIONS):
-                train_op.run()
+                train_op.run()  # pylint: disable=cell-var-from-loop
 
           threads = []
           for _ in range(num_loss_partitions):
-            threads.append(threading.Thread(target=Minimize))
+            threads.append(threading.Thread(target=minimize))
             threads[-1].start()
 
           for t in threads:
@@ -395,7 +395,7 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
         predicted_labels = get_binary_predictions_for_logistic(predictions)
         self.assertAllClose([0, 1, 1, 1], predicted_labels.eval())
         self.assertAllClose(
-            0.01, lr.approximate_duality_gap().eval(), rtol=1e-2, atol=1e-2)
+            0.0, lr.approximate_duality_gap().eval(), rtol=1e-2, atol=1e-2)
 
   def testFractionalExampleLabel(self):
     # Setup test data with 1 positive, and 1 mostly-negative example.
@@ -407,7 +407,7 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
         make_example_proto({
             'age': [1],
             'gender': [1]
-        }, 1),
+        }, 0.9),
     ]
     example_weights = [1.0, 1.0]
     for num_shards in _SHARD_NUMBERS:
diff --git a/tensorflow/contrib/lite/BUILD b/tensorflow/contrib/lite/BUILD
index 44c4a7e2ca8d019ca602c7f2b492cd1e70b17561..dafe6f136ef671ffc43448ec06f90f24baa72c9c 100644
--- a/tensorflow/contrib/lite/BUILD
+++ b/tensorflow/contrib/lite/BUILD
@@ -132,6 +132,7 @@ cc_library(
         ":memory_planner",
         ":schema_fbs_version",
         ":simple_memory_arena",
+        ":util",
         "//tensorflow/contrib/lite/kernels:gemm_support",
         "//tensorflow/contrib/lite/nnapi:nnapi_lib",
         "//tensorflow/contrib/lite/schema:schema_fbs",
@@ -169,6 +170,7 @@ cc_test(
     deps = [
         ":framework",
         ":string_util",
+        "//tensorflow/contrib/lite/kernels:kernel_util",
         "//tensorflow/contrib/lite/kernels/internal:tensor_utils",
         "//tensorflow/contrib/lite/schema:schema_fbs",
         "//tensorflow/contrib/lite/testing:util",
@@ -232,6 +234,27 @@ cc_test(
     ],
 )
 
+cc_library(
+    name = "util",
+    srcs = ["util.cc"],
+    hdrs = ["util.h"],
+    deps = [
+        ":context",
+    ],
+)
+
+cc_test(
+    name = "util_test",
+    size = "small",
+    srcs = ["util_test.cc"],
+    deps = [
+        ":context",
+        ":util",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 # Test the serialization of a model with optional tensors.
 
 # Model tests
diff --git a/tensorflow/contrib/lite/Makefile b/tensorflow/contrib/lite/Makefile
index 7f316292724ea0baaf034d4e914773ad97a957d4..b4504f246a0f806d35d8c3d659717a86d2f2a4f5 100644
--- a/tensorflow/contrib/lite/Makefile
+++ b/tensorflow/contrib/lite/Makefile
@@ -27,10 +27,10 @@ LIBDIR := $(MAKEFILE_DIR)/gen/lib/
 GENDIR := $(MAKEFILE_DIR)/gen/obj/
 
 # Settings for the host compiler.
-CXX := $(CC_PREFIX) gcc
+CXX := $(CC_PREFIX)gcc
 CXXFLAGS := --std=c++11 -O3 -DNDEBUG
-CC := $(CC_PREFIX) gcc
-CFLAGS :=
+CC := $(CC_PREFIX)gcc
+CFLAGS := -O3 -DNDEBUG
 LDOPTS :=
 LDOPTS += -L/usr/local/lib
 ARFLAGS := -r
@@ -57,10 +57,11 @@ LIBS := \
 
 # If we're on Linux, also link in the dl library.
 ifeq ($(HOST_OS),LINUX)
-	LIBS += -ldl -lpthread
+	LIBS += -ldl
 endif
 
 include $(MAKEFILE_DIR)/ios_makefile.inc
+include $(MAKEFILE_DIR)/rpi_makefile.inc
 
 # This library is the main target for this makefile. It will contain a minimal
 # runtime that can be linked in to other programs.
diff --git a/tensorflow/contrib/lite/README.md b/tensorflow/contrib/lite/README.md
index 00e93d2c4f3ab27057b855fba6fccf2ec8d7a1c1..2680d515ebee941f6b80513b241cf520adc7e510 100644
--- a/tensorflow/contrib/lite/README.md
+++ b/tensorflow/contrib/lite/README.md
@@ -91,7 +91,7 @@ Currently, we only support building the Android demo app within a Python 2
 environment (due to a Bazel bug).
 
 ### More about the demo
-The demo is resizing each camera image frame to (224 width * 224 height) to match the quantized Mobilenet model being used (229 * 229 for Inception-v3). The resized image is converted into a ByteBuffer row by row of size 1 * 224 * 224 * 3 bytes, where 1 is the number of images in a batch. 224 * 224 (299 * 299) is the width and height of the image. 3 bytes represents three colors of a pixel. This demo uses the TensorFlow Lite Java inference API for models which take a single input and provide a single output. This outputs a two-dimensional array, with the first dimension being the category index and the second dimension being the confidence of classification. Both models have 1001 unique categories and the app sorts the probabilities of all the categories and displays the top three. The model file must be downloaded and bundled within the assets directory of the app.
+The demo is resizing each camera image frame to (224 width * 224 height) to match the quantized Mobilenet model being used (299 * 299 for Inception-v3). The resized image is converted into a ByteBuffer row by row of size 1 * 224 * 224 * 3 bytes, where 1 is the number of images in a batch. 224 * 224 (299 * 299) is the width and height of the image. 3 bytes represents three colors of a pixel. This demo uses the TensorFlow Lite Java inference API for models which take a single input and provide a single output. This outputs a two-dimensional array, with the first dimension being the category index and the second dimension being the confidence of classification. Both models have 1001 unique categories and the app sorts the probabilities of all the categories and displays the top three. The model file must be downloaded and bundled within the assets directory of the app.
 
 # iOS Demo App
 
@@ -99,7 +99,7 @@ Similar to the Android demo app, there's an iOS camera app that uses exactly the
 
 This demo app requires a camera so it doesn't work with simulators. It need to be executed on a real iOS device. Follow the instructions to build and run the demo app:
 
-1.   Run `third_party/tensorflow/contrib/lite/examples/ios/download_models.sh` to download the model files used by the demo app.
+1.   Run `tensorflow/contrib/lite/examples/ios/download_models.sh` to download the model files used by the demo app.
 1.   Install [CocoaPods](https://cocoapods.org/) if it wasn't installed yet: `sudo gem install cocoapods`.
 1.   Run `pod install` in `tensorflow/contrib/lite/examples/ios/camera` to generate the workspace file.
 1.   Open the project by running `open tflite_camera_example.xcworkspace`, and build the app in XCode.
@@ -165,7 +165,7 @@ bazel-bin/tensorflow/python/tools/freeze_graph\
     --input_graph=/tmp/mobilenet_v1_224.pb \
     --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \
     --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \
-    --output_node_names=MobileNet/Predictions/Reshape_1
+    --output_node_names=MobilenetV1/Predictions/Reshape_1
 ```
 
 The user has to first build the freeze_graph script using bazel and then run the script.  The input_binary flag has to be enabled to ensure that the protobuf is read and written in binary format.  The user has to input the .pb and the .ckpt files to freeze the graph The output_node_names may not be obvious outside of the code that built the model. The easiest way to find them is to visualize the graph, either with
diff --git a/tensorflow/contrib/lite/arena_planner.h b/tensorflow/contrib/lite/arena_planner.h
index 58bc164619c2c053b9492e9a0e5de2da30e199af..f84b3dad9550e789237c8e45971002c7d336b9d3 100644
--- a/tensorflow/contrib/lite/arena_planner.h
+++ b/tensorflow/contrib/lite/arena_planner.h
@@ -33,7 +33,7 @@ class AllocationInfo;
 // each tensor needs to be allocated and deallocated, and preallocates all the
 // necessary memory (the PlanAllocations phase). It then assigns portions of
 // this memory buffer to each tensor (the ExecuteAllocations phase). Tensors may
-// share some of the bufer if a tensor B is to be allocated after another tensor
+// share some of the buffer if a tensor B is to be allocated after another tensor
 // A has been deallocated.
 //
 // If dynamic tensors are used the planning steps can be repeated during model
diff --git a/tensorflow/contrib/lite/build_def.bzl b/tensorflow/contrib/lite/build_def.bzl
index 19829e4991651111e13fc1805f97daef8bc016a7..2813d1c347163e67c70983d3dd49773f4a4b4544 100644
--- a/tensorflow/contrib/lite/build_def.bzl
+++ b/tensorflow/contrib/lite/build_def.bzl
@@ -104,7 +104,7 @@ def tflite_jni_binary(name,
   """Builds a jni binary for TFLite."""
   linkopts = linkopts + [
       "-Wl,--version-script",  # Export only jni functions & classes.
-      linkscript,
+      "$(location {})".format(linkscript),
   ]
   native.cc_binary(
       name=name,
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh b/tensorflow/contrib/lite/build_rpi_lib.sh
similarity index 69%
rename from tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh
rename to tensorflow/contrib/lite/build_rpi_lib.sh
index 6553ba5e3093c26d3c95f40216cd3922a1fb9e4e..3824b16412ed26a6cab79df3242da6017c3322b0 100755
--- a/tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh
+++ b/tensorflow/contrib/lite/build_rpi_lib.sh
@@ -1,5 +1,5 @@
-#!/bin/bash
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#!/bin/bash -x
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,8 +13,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-GCS_NUMBER=$(cat /dev/urandom | tr -dc 'A-F0-9' | fold -w 8 | head -n 1)
-GCS_PATH="$1"/"$GCS_NUMBER".tfrecord
 
-echo "gcs_path=$GCS_PATH" > "$_SETUP_OUTPUT"
-touch "$_SETUP_DONE"
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR/../../.."
+
+CC_PREFIX=arm-linux-gnueabihf- make -j 3 -f tensorflow/contrib/lite/Makefile TARGET=RPI TARGET_ARCH=armv7
diff --git a/tensorflow/contrib/lite/builtin_ops.h b/tensorflow/contrib/lite/builtin_ops.h
index 88cdf1d46312f1e610825f23f3d8d357b0762bac..d7993e60cc77839b823e17ce11f8a57d3e0972db 100644
--- a/tensorflow/contrib/lite/builtin_ops.h
+++ b/tensorflow/contrib/lite/builtin_ops.h
@@ -24,14 +24,14 @@ extern "C" {
 #endif  // __cplusplus
 
 // The enum for builtin operators.
-// Note: CUSTOM and DELEGATE are 2 special ops which are not real biultin
-// ops.
+// Note: CUSTOM and DELEGATE are 2 special ops which are not real built-in ops.
 typedef enum {
   kTfLiteBuiltinAdd = 0,
   kTfLiteBuiltinAveragePool2d = 1,
   kTfLiteBuiltinConcatenation = 2,
   kTfLiteBuiltinConv2d = 3,
   kTfLiteBuiltinDepthwiseConv2d = 4,
+  kTfLiteBuiltinDequantize = 6,
   kTfLiteBuiltinEmbeddingLookup = 7,
   kTfLiteBuiltinFullyConnected = 9,
   kTfLiteBuiltinHashtableLookup = 10,
@@ -77,6 +77,8 @@ typedef enum {
   kTfLiteBuiltinLogSoftmax = 50,
   kTfLiteBuiltinDelegate = 51,
   kTfLiteBuiltinBidirectionalSequenceLstm = 52,
+  kTfLiteBuiltinCast = 53,
+  kTfLiteBuiltinPrelu = 54,
 } TfLiteBuiltinOperator;
 
 #ifdef __cplusplus
diff --git a/tensorflow/contrib/lite/context.c b/tensorflow/contrib/lite/context.c
index c09e838c5c2e50e0f4a38eaf66e55246fd9a6f7f..5c6f5e72a47180cd98be46f60cfa8eaf28197806 100644
--- a/tensorflow/contrib/lite/context.c
+++ b/tensorflow/contrib/lite/context.c
@@ -17,9 +17,14 @@ limitations under the License.
 #include <stdio.h>
 #include <string.h>
 
+int TfLiteIntArrayGetSizeInBytes(int size) {
+  static TfLiteIntArray dummy;
+  return sizeof(dummy) + sizeof(dummy.data[0]) * size;
+}
+
 TfLiteIntArray* TfLiteIntArrayCreate(int size) {
   TfLiteIntArray* ret =
-      (TfLiteIntArray*)malloc(sizeof(*ret) + sizeof(ret->data[0]) * size);
+      (TfLiteIntArray*)malloc(TfLiteIntArrayGetSizeInBytes(size));
   ret->size = size;
   return ret;
 }
@@ -55,12 +60,16 @@ TfLiteIntArray* TfLiteIntArrayCopy(TfLiteIntArray* src) {
 
 void TfLiteIntArrayFree(TfLiteIntArray* a) { free(a); }
 
-void TfLiteTensorFree(TfLiteTensor* t) {
+void TfLiteTensorDataFree(TfLiteTensor* t) {
   if (t->allocation_type == kTfLiteDynamic && t->data.raw) {
     free(t->data.raw);
   }
-  if (t->dims) TfLiteIntArrayFree(t->dims);
   t->data.raw = NULL;
+}
+
+void TfLiteTensorFree(TfLiteTensor* t) {
+  TfLiteTensorDataFree(t);
+  if (t->dims) TfLiteIntArrayFree(t->dims);
   t->dims = NULL;
 }
 
diff --git a/tensorflow/contrib/lite/context.h b/tensorflow/contrib/lite/context.h
index ed7f4515fa4437d61a37be93616c28a046295c5a..45184b05ecefb504c75815ae900f3b605359a443 100644
--- a/tensorflow/contrib/lite/context.h
+++ b/tensorflow/contrib/lite/context.h
@@ -29,6 +29,7 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_CONTEXT_H_
 #define TENSORFLOW_CONTRIB_LITE_CONTEXT_H_
 
+#include <stdbool.h>
 #include <stdint.h>
 #include <stdlib.h>
 
@@ -40,6 +41,7 @@ typedef enum { kTfLiteOk = 0, kTfLiteError = 1 } TfLiteStatus;
 
 // Forward declare so GetNode can use this is in Context.
 typedef struct _TfLiteRegistration TfLiteRegistration;
+typedef struct _TfLiteDelegate TfLiteDelegate;
 
 #define kOptionalTensor (-1)
 
@@ -57,6 +59,10 @@ typedef struct {
 #endif
 } TfLiteIntArray;
 
+// Given the size (number of elements) in a TfLiteIntArray, calculate its size
+// in bytes.
+int TfLiteIntArrayGetSizeInBytes(int size);
+
 // Create a array of a given `size` (uninitialized entries).
 // This returns a pointer, that you must free using TfLiteIntArrayFree().
 TfLiteIntArray* TfLiteIntArrayCreate(int size);
@@ -162,6 +168,11 @@ typedef enum {
   kTfLiteDynamic,
 } TfLiteAllocationType;
 
+// The delegates should use zero or positive integers to represent handles.
+// -1 is reserved from unallocated status.
+typedef int TfLiteBufferHandle;
+const TfLiteBufferHandle kTfLiteNullBufferHandle = -1;
+
 // An tensor in the interpreter system which is a wrapper around a buffer of
 // data including a dimensionality (or NULL if not currently defined).
 typedef struct {
@@ -194,8 +205,27 @@ typedef struct {
 
   // Null-terminated name of this tensor.
   const char* name;
+
+  // The delegate which knows how to handle `buffer_handle`.
+  // WARNING: This is an experimental interface that is subject to change.
+  TfLiteDelegate* delegate;
+
+  // An integer buffer handle that can be handled by `delegate`.
+  // The value is valid only when delegate is not null.
+  // WARNING: This is an experimental interface that is subject to change.
+  TfLiteBufferHandle buffer_handle;
+
+  // If the delegate uses its own buffer (e.g. GPU memory), the delegate is
+  // responsible to set data_is_stale to true.
+  // `delegate->CopyFromBufferHandle` can be called to copy the data from
+  // delegate buffer.
+  // WARNING: This is an // experimental interface that is subject to change.
+  bool data_is_stale;
 } TfLiteTensor;
 
+// Free data memory of tensor `t`;
+void TfLiteTensorDataFree(TfLiteTensor* t);
+
 // Free memory of tensor `t`;
 void TfLiteTensorFree(TfLiteTensor* t);
 
@@ -234,6 +264,11 @@ typedef struct {
   // WARNING: This is an experimental interface that is subject to change.
   const void* custom_initial_data;
   int custom_initial_data_size;
+
+  // The pointer to the delegate. This is non-null only when the node is
+  // created by calling `interpreter.ModifyGraphWithDelegate`.
+  // WARNING: This is an experimental interface that is subject to change.
+  TfLiteDelegate* delegate;
 } TfLiteNode;
 
 typedef struct TfLiteContext {
@@ -287,11 +322,16 @@ typedef struct TfLiteContext {
   // does not take ownership of `nodes_to_replace`.
   TfLiteStatus (*ReplaceSubgraphsWithDelegateKernels)(
       struct TfLiteContext*, TfLiteRegistration registration,
-      const TfLiteIntArray* nodes_to_replace);
+      const TfLiteIntArray* nodes_to_replace, TfLiteDelegate* delegate);
+
+  // Number of threads that are recommended to subsystems like gemmlowp and
+  // eigen.
+  int recommended_num_threads;
 
   // TODO(ahentz): we should create a more general mechanism for this sort of
   // library-global objects.
   void* gemm_context;
+  void* eigen_context;
 } TfLiteContext;
 
 typedef struct _TfLiteRegistration {
@@ -338,19 +378,47 @@ typedef struct _TfLiteRegistration {
 } TfLiteRegistration;
 
 // WARNING: This is an experimental interface that is subject to change.
-typedef struct {
+typedef struct _TfLiteDelegate {
   // Data that delegate needs to identify itself. This data is owned by the
   // delegate. The delegate is owned in the user code, so the delegate is
   // responsible for doing this when it is destroyed.
   void* data_;
+
   // Invoked by ModifyGraphWithDelegate. This prepare is called, giving the
   // delegate a view of the current graph through TfLiteContext*. It typically
   // will look at the nodes and call ReplaceSubgraphsWithDelegateKernels()
   // to ask the TensorFlow lite runtime to create macro-nodes to represent
   // delegated subgraphs of the original graph.
-  TfLiteStatus (*Prepare)(TfLiteContext* context, void* data);
+  TfLiteStatus (*Prepare)(TfLiteContext* context, TfLiteDelegate* delegate);
+
+  // Copy the data from delegate buffer handle to raw memory.
+  // This can be null if the delegate doesn't use its own buffer.
+  TfLiteStatus (*CopyFromBufferHandle)(TfLiteDelegate* delegate,
+                                       TfLiteBufferHandle buffer_handle,
+                                       void* data, int size);
+
+  // Copy the data from raw memory to delegate buffer handle.
+  // This can be null if the delegate doesn't use its own buffer.
+  TfLiteStatus (*CopyToBufferHandle)(TfLiteDelegate* delegate,
+                                     TfLiteBufferHandle buffer_handle,
+                                     void* data, int size);
+
+  // Free the Delegate Buffer Handle. Note: This only frees the handle, but
+  // this doesn't release the underlying resource (e.g. textures). The
+  // resources are either owned by application layer or the delegate.
+  // This can be null if the delegate doesn't use its own buffer.
+  void (*FreeBufferHandle)(TfLiteDelegate* delegate,
+                           TfLiteBufferHandle* handle);
 } TfLiteDelegate;
 
+// WARNING: This is an experimental interface that is subject to change.
+typedef struct {
+  TfLiteDelegate* delegate;
+  TfLiteIntArray* nodes_to_replace;
+  TfLiteIntArray* input_tensors;
+  TfLiteIntArray* output_tensors;
+} TfLiteDelegateParams;
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif  // __cplusplus
diff --git a/tensorflow/contrib/lite/error_reporter.h b/tensorflow/contrib/lite/error_reporter.h
index da193d2586e9123341b9a41be049ee2a4382017a..3c5f805f12f6a1fb7185c140604f692ac282a143 100644
--- a/tensorflow/contrib/lite/error_reporter.h
+++ b/tensorflow/contrib/lite/error_reporter.h
@@ -30,7 +30,7 @@ namespace tflite {
 //  va_list args;
 //  foo.Report("test %d", args); // where args is va_list
 //
-// Sublclass ErrorReporter to provide another reporting destination.
+// Subclass ErrorReporter to provide another reporting destination.
 // For example, if you have a GUI program, you might redirect to a buffer
 // that drives a GUI error log box.
 class ErrorReporter {
diff --git a/tensorflow/contrib/lite/g3doc/ios.md b/tensorflow/contrib/lite/g3doc/ios.md
index a359b8d4b481dbc15cc86db14eabda5433722b8b..e0358a444d6dffc377bf13ee72ba5477359d6e07 100644
--- a/tensorflow/contrib/lite/g3doc/ios.md
+++ b/tensorflow/contrib/lite/g3doc/ios.md
@@ -22,6 +22,15 @@ Then install
 brew install automake
 brew install libtool
 ```
+If you get an error where either automake or libtool install but do not link correctly, you'll first need to:
+```bash
+sudo chown -R $(whoami) /usr/local/*
+```
+Then follow the instructions to perform the linking:
+```bash
+brew link automake
+brew link libtool
+```
 
 Then you need to run a shell script to download the dependencies you need:
 
diff --git a/tensorflow/contrib/lite/g3doc/models.md b/tensorflow/contrib/lite/g3doc/models.md
index 5b393140d61544e6d6e40d4b6ee1872b22cc84b2..48f43d4fc460a3a5307c5ee1f5e096a409a46af5 100644
--- a/tensorflow/contrib/lite/g3doc/models.md
+++ b/tensorflow/contrib/lite/g3doc/models.md
@@ -1,4 +1,4 @@
-#List of Hosted Models
+# List of Hosted Models
 
 *   [Inception V3 2015](https://storage.googleapis.com/download.tensorflow.org/models/tflite/inception_v3_2015_2017_11_10.zip)
 *   [Inception V3 Slim 2016](https://storage.googleapis.com/download.tensorflow.org/models/tflite/inception_v3_slim_2016_android_2017_11_10.zip)
diff --git a/tensorflow/contrib/lite/g3doc/rpi.md b/tensorflow/contrib/lite/g3doc/rpi.md
new file mode 100644
index 0000000000000000000000000000000000000000..7a3a231626d0e1c71e474ff4ff16789ebe2901db
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/rpi.md
@@ -0,0 +1,50 @@
+# TensorFlow Lite for Raspberry Pi
+
+## Cross compiling
+### Installing toolchian
+This has been tested on Ubuntu 16.04.3 64bit and Tensorflow devel docker image [tensorflow/tensorflow:nightly-devel](https://hub.docker.com/r/tensorflow/tensorflow/tags/).
+
+To cross compiling TensorFlow Lite. First you should install the toolchain and libs.
+```bash
+sudo apt-get update
+sudo apt-get install crossbuild-essential-armhf
+```
+> If you are using docker, you may not use `sudo`
+
+### Building
+Clone this Tensorflow repository, Run this script at the root of the repository to download all the dependencies:
+> The Tensorflow repository is in `/tensorflow` if you are using `tensorflow/tensorflow:nightly-devel` docker image, just try it.
+```bash
+./tensorflow/contrib/lite/download_dependencies.sh
+```
+Note than you only need to to this once.
+
+You should then be able to compile:
+```bash
+./tensorflow/contrib/lite/build_rpi_lib.sh
+```
+
+This should compile a static library in:
+`tensorflow/contrib/lite/gen/lib/rpi_armv7/libtensorflow-lite.a`.
+
+## Native compiling
+This has been tested on Raspberry Pi 3b, Raspbian GNU/Linux 9.1 (stretch), gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1).
+
+Log in to you RPI, install the toolchain.
+```bash
+sudo apt-get instal build-essential
+```
+
+First, clone this TensorFlow repository. Run this at the root of the repository:
+```bash
+./tensorflow/contrib/lite/download_dependencies.sh
+```
+Note than you only need to to this once.
+
+You should then be able to compile:
+```bash
+./tensorflow/contrib/lite/build_rpi_lib.sh
+```
+
+This should compile a static library in:
+`tensorflow/contrib/lite/gen/lib/rpi_armv7/libtensorflow-lite.a`.
diff --git a/tensorflow/contrib/lite/interpreter.cc b/tensorflow/contrib/lite/interpreter.cc
index 370e4955271679975072d458e0ad9837a69d9556..937c185b0abc0747d94365865afc037feac83f89 100644
--- a/tensorflow/contrib/lite/interpreter.cc
+++ b/tensorflow/contrib/lite/interpreter.cc
@@ -26,15 +26,30 @@ limitations under the License.
 #include "tensorflow/contrib/lite/memory_planner.h"
 #include "tensorflow/contrib/lite/nnapi_delegate.h"
 #include "tensorflow/contrib/lite/schema/schema_generated.h"
+#include "tensorflow/contrib/lite/util.h"
+
+namespace tflite {
 
 namespace {
 
-// std::vector preallocation tuning.
-constexpr const int kSlotsToReserve = 128;
+// Stub method which returns kTfLiteError when the function is forbidden.
+// We're registrating this function to several different function to save
+// compiled binary size. Please note the restrictions:
+// * The type of first parameter have to be `TfLiteContext*`.
+// * All paramteters must be trivailly destructible. (E.g. No C++ class)
+TfLiteStatus ForbiddenContextFunction(TfLiteContext* context, ...) {
+  context->ReportError(context,
+                       "The function is forbidden if not calling in delegate.");
+  return kTfLiteError;
+}
 
-}  // namespace
+// Set the ForbiddenContextFunction to a compatible function pointer.
+template <typename FunctionType>
+void SetForbiddenContextFunction(FunctionType* func) {
+  *func = reinterpret_cast<FunctionType>(ForbiddenContextFunction);
+}
 
-namespace tflite {
+}  // namespace
 
 // A trivial implementation of GraphInfo around the Interpreter.
 // NOTE: this interpreter info represents the subset of the
@@ -77,16 +92,18 @@ Interpreter::Interpreter(ErrorReporter* error_reporter)
   context_.AddTensors = AddTensors;
   context_.tensors = nullptr;
   context_.tensors_size = 0;
+  context_.eigen_context = nullptr;
   context_.gemm_context = nullptr;
+  context_.recommended_num_threads = -1;
 
   // Invalid to call these these except from TfLiteDelegate
-  context_.GetNodeAndRegistration = nullptr;
-  context_.ReplaceSubgraphsWithDelegateKernels = nullptr;
-  context_.GetExecutionPlan = nullptr;
+  SetForbiddenContextFunction(&context_.GetNodeAndRegistration);
+  SetForbiddenContextFunction(&context_.ReplaceSubgraphsWithDelegateKernels);
+  SetForbiddenContextFunction(&context_.GetExecutionPlan);
 
   // Reserve some space for the tensors to avoid excessive resizing.
-  tensors_.reserve(kSlotsToReserve);
-  nodes_and_registration_.reserve(kSlotsToReserve);
+  tensors_.reserve(kTensorsReservedCapacity);
+  nodes_and_registration_.reserve(kTensorsReservedCapacity);
   next_execution_plan_index_to_prepare_ = 0;
   UseNNAPI(false);
 }
@@ -103,19 +120,99 @@ Interpreter::~Interpreter() {
   }
 
   for (int i = 0; i < context_.tensors_size; i++) {
-    TfLiteTensorFree(&context_.tensors[i]);
+    TfLiteTensor* tensor = &context_.tensors[i];
+    if (tensor->buffer_handle != kTfLiteNullBufferHandle) {
+      tensor->delegate->FreeBufferHandle(tensor->delegate,
+                                         &tensor->buffer_handle);
+    }
+    TfLiteTensorFree(tensor);
   }
 }
 
 TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
     TfLiteContext* context, TfLiteRegistration registration,
-    const TfLiteIntArray* nodes_to_replace) {
+    const TfLiteIntArray* nodes_to_replace, TfLiteDelegate* delegate) {
   return static_cast<Interpreter*>(context->impl_)
-      ->ReplaceSubgraphsWithDelegateKernels(registration, nodes_to_replace);
+      ->ReplaceSubgraphsWithDelegateKernels(registration, nodes_to_replace,
+                                            delegate);
 }
 
+namespace {
+
+// Copy a std::vector<int> to an existing TfLiteIntArray.
+// This is a low-level data manipulation function, and it's caller's
+// responsibility to ensure TfLiteIntArray has enough size.
+void CopyVectorToTfLiteIntArray(const std::vector<int>& vec,
+                                TfLiteIntArray* arr) {
+  arr->size = vec.size();
+  memcpy(arr->data, vec.data(), sizeof(int) * arr->size);
+}
+
+// This function allocates a continuous memory space that contains a
+// TfLiteDelegateParams followed by a several TfLiteIntArray.
+// When calling `free` at TfLiteDelegateParams*, all the allocated space
+// will be freed together.
+//
+// +-----------------------------------+
+// | TfLiteDelegateParams              |
+// | TfLiteDelegate* delegate;         |
+// | TfLiteIntArray* nodes_to_replace; |--\
+// | TfLiteIntArray* input_tensors;    |--+--\
+// | TfLiteIntArray* output_tensors;   |--+--+--\
+// +-----------------------------------+  |  |  |
+// | TfLiteIntArray (variable size)    |<-/  |  |
+// +-----------------------------------+     |  |
+// | TfLiteIntArray (variable size)    |<----/  |
+// +-----------------------------------+        |
+// | TfLiteIntArray (variable size)    |<-------/
+// +-----------------------------------+
+TfLiteDelegateParams* CreateDelegateParams(TfLiteDelegate* delegate,
+                                           const Subgraph& subgraph) {
+  // Step 1: Calculate the allocation size.
+  int allocation_size = sizeof(TfLiteDelegateParams);
+
+  int nodes_to_replace_size =
+      TfLiteIntArrayGetSizeInBytes(subgraph.nodes.size());
+  allocation_size += nodes_to_replace_size;
+
+  int input_tensors_size =
+      TfLiteIntArrayGetSizeInBytes(subgraph.input_tensors.size());
+  allocation_size += input_tensors_size;
+
+  int output_tensors_size =
+      TfLiteIntArrayGetSizeInBytes(subgraph.output_tensors.size());
+  allocation_size += output_tensors_size;
+
+  // Step 2: Allocate the memory.
+  // Use `char*` for conveniently step through the allocated space by bytes.
+  char* allocation = reinterpret_cast<char*>(malloc(allocation_size));
+
+  // Step 3: Fill all data structures structures.
+  TfLiteDelegateParams* params =
+      reinterpret_cast<TfLiteDelegateParams*>(allocation);
+  params->delegate = delegate;
+  allocation += sizeof(TfLiteDelegateParams);
+
+  params->nodes_to_replace = reinterpret_cast<TfLiteIntArray*>(allocation);
+  CopyVectorToTfLiteIntArray(subgraph.nodes, params->nodes_to_replace);
+  allocation += nodes_to_replace_size;
+
+  params->input_tensors = reinterpret_cast<TfLiteIntArray*>(allocation);
+  CopyVectorToTfLiteIntArray(subgraph.input_tensors, params->input_tensors);
+  allocation += input_tensors_size;
+
+  params->output_tensors = reinterpret_cast<TfLiteIntArray*>(allocation);
+  CopyVectorToTfLiteIntArray(subgraph.output_tensors, params->output_tensors);
+  allocation += output_tensors_size;
+
+  return params;
+}
+
+}  // namespace
+
 TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
-    TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace) {
+    TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace,
+    TfLiteDelegate* delegate) {
   // Annotate the registration as DELEGATE op.
   registration.builtin_code = BuiltinOperator_DELEGATE;
 
@@ -127,30 +224,37 @@ TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
 
   execution_plan_.clear();
   for (auto& subgraph : subgraphs) {
-    // Turn subgraph.nodes into a TfLiteIntArray compatible data structure.
-    // TODO(aselle): Avoid this copy by constructing subgraph.nodes that way
-    // in the first place
-    subgraph.nodes.insert(subgraph.nodes.begin(),
-                          static_cast<int>(subgraph.nodes.size()));
     // Subgraphs calimed by the delegate should have a "macro" op created, the
     // other subgraphs (kTfNonPartition) just have their nodes added back to
     // the execution plan.
     switch (subgraph.type) {
       case Subgraph::kTfNonPartition:
-        for (auto it = subgraph.nodes.begin() + 1; it != subgraph.nodes.end();
+        for (auto it = subgraph.nodes.begin(); it != subgraph.nodes.end();
              ++it) {
           execution_plan_.push_back(*it);
         }
         break;
       case Subgraph::kTfPartition: {
-        void* builtin_data = nullptr;
         int node_index;
-        // Create a node that represents computation of this subgraph.
-        AddNodeWithParameters(
-            subgraph.input_tensors, subgraph.output_tensors,
-            reinterpret_cast<const char*>(subgraph.nodes.data()),
-            subgraph.nodes.size() * sizeof(subgraph.nodes[0]), builtin_data,
-            &registration, &node_index);
+
+        TfLiteDelegateParams* params = CreateDelegateParams(delegate, subgraph);
+        AddNodeWithParameters(subgraph.input_tensors, subgraph.output_tensors,
+                              nullptr, 0, params, &registration, &node_index);
+
+        // Initialize the output tensors's delegate-related fields.
+        for (int tensor_index : subgraph.output_tensors) {
+          TfLiteTensor* tensor = &tensors_[tensor_index];
+          TF_LITE_ENSURE_EQ(&context_, tensor->delegate, nullptr);
+          TF_LITE_ENSURE_EQ(&context_, tensor->buffer_handle,
+                            kTfLiteNullBufferHandle);
+          // buffer_handle will be filled in delegate's `Prepare`
+          // function.
+          tensor->delegate = delegate;
+        }
+
+        // Associate the node with the delegate.
+        TfLiteNode* node = &nodes_and_registration_[node_index].first;
+        node->delegate = delegate;
       } break;
       case Subgraph::kTfUnexplored:
         return kTfLiteError;
@@ -169,8 +273,8 @@ TfLiteStatus Interpreter::GetExecutionPlan(TfLiteIntArray** execution_plan) {
   *execution_plan = plan_cache_.get();
   static_assert(sizeof(plan_cache_->data[0]) == sizeof(execution_plan_[0]),
                 "TfLiteIntArray and execution_plan do not contain same type.");
-  memcpy(plan_cache_->data, execution_plan_.data(),
-         sizeof(plan_cache_->data[0]) * execution_plan_.size());
+  std::memcpy(plan_cache_->data, execution_plan_.data(),
+              sizeof(plan_cache_->data[0]) * execution_plan_.size());
   return kTfLiteOk;
 }
 
@@ -240,14 +344,6 @@ TfLiteStatus Interpreter::BytesRequired(TfLiteType type, const int* dims,
   return kTfLiteOk;
 }
 
-namespace {
-TfLiteIntArray* convertVectorToTfLiteIntArray(const std::vector<int>& x) {
-  TfLiteIntArray* lite = TfLiteIntArrayCreate(x.size());
-  for (size_t i = 0; i < x.size(); i++) lite->data[i] = x[i];
-  return lite;
-}
-}  // namespace
-
 TfLiteStatus Interpreter::AllocateTensors() {
   next_execution_plan_index_to_prepare_ = 0;
   if (memory_planner_) {
@@ -260,7 +356,11 @@ TfLiteStatus Interpreter::AllocateTensors() {
   }
 
   TF_LITE_ENSURE_STATUS(PrepareOpsAndTensors());
-  invokable_ = true;
+  if (state_ == kStateUninvokable) {
+    state_ = kStateInvokable;
+  }
+  TF_LITE_ENSURE(&context_, state_ == kStateInvokable ||
+                                state_ == kStateInvokableAndImmutable);
   return kTfLiteOk;
 }
 
@@ -268,7 +368,12 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
     const std::vector<int>& inputs, const std::vector<int>& outputs,
     const char* init_data, size_t init_data_size, void* builtin_data,
     const TfLiteRegistration* registration, int* node_index) {
-  invokable_ = false;
+  if (state_ == kStateInvokableAndImmutable) {
+    ReportError(&context_,
+                "AddNodeWithParameters is disallowed when graph is immutable.");
+    return kTfLiteError;
+  }
+  state_ = kStateUninvokable;
 
   std::unique_ptr<void, decltype(free)*> builtin_data_deleter(builtin_data,
                                                               free);
@@ -282,7 +387,6 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
   int new_node_index = nodes_and_registration_.size();
   if (node_index) *node_index = new_node_index;
   nodes_and_registration_.resize(nodes_and_registration_.size() + 1);
-
   auto& node_and_reg = nodes_and_registration_.back();
   TfLiteNode& node = node_and_reg.first;
   if (node.inputs) TfLiteIntArrayFree(node.inputs);
@@ -292,8 +396,8 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
   // NOTE, here we are not using move semantics yet, since our internal
   // representation isn't std::vector, but in the future we would like to avoid
   // copies, so we want the interface to take r-value references now.
-  node.inputs = convertVectorToTfLiteIntArray(inputs);
-  node.outputs = convertVectorToTfLiteIntArray(outputs);
+  node.inputs = ConvertVectorToTfLiteIntArray(inputs);
+  node.outputs = ConvertVectorToTfLiteIntArray(outputs);
   node.temporaries = TfLiteIntArrayCreate(0);
   if (init_data) {
     node.user_data = OpInit(*registration, init_data, init_data_size);
@@ -306,6 +410,7 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
   node.builtin_data = builtin_data_deleter.release();
   // TODO(ycling): Filling `custom_initial_data` and `custom_initial_data_size`
   // properly for nodes generated by ReplaceSubgraphsWithDelegateKernels.
+
   if (registration->builtin_code == BuiltinOperator_CUSTOM) {
     // When it's a CUSTOM op, the `custom_options` field in the Flatbuffer
     // `Operator` table is passed in.
@@ -316,6 +421,7 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
     node.custom_initial_data_size = 0;
   }
 
+  node.delegate = nullptr;
   node_and_reg.second = *registration;
   execution_plan_.push_back(new_node_index);
   return kTfLiteOk;
@@ -323,13 +429,18 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
 
 TfLiteStatus Interpreter::ResizeInputTensor(int tensor_index,
                                             const std::vector<int>& dims) {
+  if (state_ == kStateInvokableAndImmutable) {
+    ReportError(&context_,
+                "ResizeInputTensor is disallowed when graph is immutable.");
+    return kTfLiteError;
+  }
+  state_ = kStateUninvokable;
+
   // TODO(aselle): All bounds checks can be implemented as one-sided bounds
   // checks by casting to unsigned for efficiency. Profile before doing this.
-
   TF_LITE_ENSURE(&context_,
                  tensor_index < context_.tensors_size && tensor_index >= 0);
-  invokable_ = false;
-  TfLiteIntArray* dims_lite = convertVectorToTfLiteIntArray(dims);
+  TfLiteIntArray* dims_lite = ConvertVectorToTfLiteIntArray(dims);
   return ResizeTensorImpl(&context_.tensors[tensor_index], dims_lite);
 }
 
@@ -353,6 +464,7 @@ TfLiteStatus Interpreter::PrepareOpsStartingAt(
     TfLiteNode& node = nodes_and_registration_[node_index].first;
     const TfLiteRegistration& registration =
         nodes_and_registration_[node_index].second;
+    EnsureTensorsVectorCapacity();
     if (OpPrepare(registration, &node) == kTfLiteError) {
       return kTfLiteError;
     }
@@ -392,7 +504,7 @@ TfLiteStatus Interpreter::Invoke() {
     ReportError(&context_, "Invoke called on model that is not consistent.");
     return kTfLiteError;
   }
-  if (!invokable_) {
+  if (state_ == kStateUninvokable) {
     ReportError(&context_, "Invoke called on model that is not ready.");
     return kTfLiteError;
   }
@@ -430,10 +542,29 @@ TfLiteStatus Interpreter::Invoke() {
     TfLiteNode& node = nodes_and_registration_[node_index].first;
     const TfLiteRegistration& registration =
         nodes_and_registration_[node_index].second;
+
+    // TODO(ycling): This is an extra loop through inputs to check if the data
+    // need to be copied from Delegate buffer to raw memory, which is often not
+    // needed. We may want to cache this in prepare to know if this needs to be
+    // done for a node or not.
+    for (int i = 0; i < node.inputs->size; ++i) {
+      int tensor_index = node.inputs->data[i];
+      if (tensor_index == kOptionalTensor) {
+        continue;
+      }
+      TfLiteTensor* tensor = &tensors_[tensor_index];
+      if (tensor->delegate && tensor->delegate != node.delegate &&
+          tensor->data_is_stale) {
+        EnsureTensorDataIsReadable(tensor_index);
+      }
+    }
+
+    EnsureTensorsVectorCapacity();
     if (OpInvoke(registration, &node) == kTfLiteError) {
       status = kTfLiteError;
     }
   }
+
   return status;
 }
 
@@ -469,6 +600,7 @@ TfLiteStatus Interpreter::AddTensors(int tensors_to_add,
   tensors_.resize(tensors_.size() + tensors_to_add);
   for (int i = base_index; i < tensors_.size(); i++) {
     memset(&tensors_[i], 0, sizeof(tensors_[i]));
+    tensors_[i].buffer_handle = kTfLiteNullBufferHandle;
   }
   context_.tensors = tensors_.data();
   context_.tensors_size = tensors_.size();
@@ -501,9 +633,16 @@ TfLiteStatus Interpreter::GetNodeAndRegistration(
 }
 
 TfLiteStatus Interpreter::SetTensorParametersReadOnly(
-    int tensor_index, TfLiteType type, const char* name,
-    const std::vector<int>& dims, TfLiteQuantizationParams quantization,
-    const char* buffer, size_t bytes, const Allocation* allocation) {
+    int tensor_index, TfLiteType type, const char* name, const int rank,
+    const int* dims, TfLiteQuantizationParams quantization, const char* buffer,
+    size_t bytes, const Allocation* allocation) {
+  if (state_ == kStateInvokableAndImmutable) {
+    ReportError(
+        &context_,
+        "SetTensorParametersReadOnly is disallowed when graph is immutable.");
+    return kTfLiteError;
+  }
+
   TF_LITE_ENSURE(&context_,
                  tensor_index < context_.tensors_size && tensor_index >= 0);
   // For most tensors we know exactly how much memory is necessary so we can
@@ -511,14 +650,27 @@ TfLiteStatus Interpreter::SetTensorParametersReadOnly(
   // because their sizes change with the contents of the individual strings.
   if (type != kTfLiteString) {
     size_t required_bytes;
-    TF_LITE_ENSURE_OK(&context_, BytesRequired(type, dims.data(), dims.size(),
-                                               &required_bytes));
+    TF_LITE_ENSURE_OK(&context_,
+                      BytesRequired(type, dims, rank, &required_bytes));
     TF_LITE_ENSURE_EQ(&context_, required_bytes, bytes);
   }
-  invokable_ = false;
-  TfLiteTensorReset(type, name, convertVectorToTfLiteIntArray(dims),
-                    quantization, const_cast<char*>(buffer), bytes,
-                    kTfLiteMmapRo, allocation, &context_.tensors[tensor_index]);
+
+  TfLiteTensor& tensor = context_.tensors[tensor_index];
+  if (type == tensor.type &&
+      EqualArrayAndTfLiteIntArray(tensor.dims, rank, dims)) {
+    // Fast path which does not invalidate the invokable property.
+    TfLiteTensorDataFree(&tensor);
+    tensor.data.raw = const_cast<char*>(buffer);
+    if (!tensor.dims) tensor.dims = ConvertArrayToTfLiteIntArray(rank, dims);
+    tensor.params = quantization;
+    tensor.allocation_type = kTfLiteMmapRo;
+    tensor.allocation = allocation;
+  } else {
+    state_ = kStateUninvokable;
+    TfLiteTensorReset(type, name, ConvertArrayToTfLiteIntArray(rank, dims),
+                      quantization, const_cast<char*>(buffer), bytes,
+                      kTfLiteMmapRo, allocation, &tensor);
+  }
   return kTfLiteOk;
 }
 
@@ -527,9 +679,14 @@ TfLiteStatus Interpreter::SetTensorParametersReadOnly(
 // bytes. The lifetime of buffer must be ensured to be greater or equal
 // to Interpreter.
 TfLiteStatus Interpreter::SetTensorParametersReadWrite(
-    int tensor_index, TfLiteType type, const char* name,
-    const std::vector<int>& dims, TfLiteQuantizationParams quantization) {
-  invokable_ = false;
+    int tensor_index, TfLiteType type, const char* name, const int rank,
+    const int* dims, TfLiteQuantizationParams quantization) {
+  if (state_ == kStateInvokableAndImmutable) {
+    ReportError(
+        &context_,
+        "SetTensorParametersReadWrite is disallowed when graph is immutable.");
+    return kTfLiteError;
+  }
   TF_LITE_ENSURE(&context_,
                  tensor_index < context_.tensors_size && tensor_index >= 0);
   size_t required_bytes = 0;
@@ -538,10 +695,10 @@ TfLiteStatus Interpreter::SetTensorParametersReadWrite(
     // many bytes we will need based on the dimensions. String tensors are
     // allocated dynamically and we can't know ahead of time how much space
     // they will require.
-    TF_LITE_ENSURE_OK(&context_, BytesRequired(type, dims.data(), dims.size(),
-                                               &required_bytes));
+    TF_LITE_ENSURE_OK(&context_,
+                      BytesRequired(type, dims, rank, &required_bytes));
   }
-  TfLiteTensorReset(type, name, convertVectorToTfLiteIntArray(dims),
+  TfLiteTensorReset(type, name, ConvertArrayToTfLiteIntArray(rank, dims),
                     quantization,
                     /*buffer=*/nullptr, required_bytes,
                     type == kTfLiteString ? kTfLiteDynamic : kTfLiteArenaRw,
@@ -604,26 +761,90 @@ void Interpreter::UseNNAPI(bool enable) {
 }
 
 void Interpreter::SetNumThreads(int num_threads) {
-  // TODO(ahentz): this forces us to link against gemmlowp even when the ops
-  // don't use it. We should implement some dynamic mechanism for this sort of
-  // library-specific initialization.
-  tflite::gemm_support::SetMaxNumThreads(&context_, num_threads);
-}
+  context_.recommended_num_threads = num_threads;
+}
+
+TfLiteStatus Interpreter::ModifyGraphWithDelegate(TfLiteDelegate* delegate,
+                                                  bool allow_dynamic_tensors) {
+  if (!allow_dynamic_tensors) {
+    int last_execution_plan_index_prepared;
+    TF_LITE_ENSURE_OK(&context_, PrepareOpsStartingAt(
+                                     0, &last_execution_plan_index_prepared));
+
+    bool has_dynamic_tensors = true;
+    // Dynamic tensors exist if not all nodes can be prepared.
+    if (last_execution_plan_index_prepared + 1 == execution_plan_.size()) {
+      // If all the nodes can be prepared, check if the last node has dynamic
+      // tensors.
+      int node_index = execution_plan_[last_execution_plan_index_prepared];
+      TfLiteNode& node = nodes_and_registration_[node_index].first;
+      if (!HasDynamicTensor(context_, node.outputs)) {
+        has_dynamic_tensors = false;
+      }
+    }
+    if (has_dynamic_tensors) {
+      ReportError(&context_, "Attempting to resize a fixed-size tensor.");
+      return kTfLiteError;
+    }
+  }
 
-TfLiteStatus Interpreter::ModifyGraphWithDelegate(TfLiteDelegate* delegate) {
   // TODO(aselle): Consider if it is worth storing pointers to delegates.
-  // Setup additional context interface
+  // Setup additional context interface.
   context_.GetNodeAndRegistration = GetNodeAndRegistration;
   context_.ReplaceSubgraphsWithDelegateKernels =
       ReplaceSubgraphsWithDelegateKernels;
   context_.GetExecutionPlan = GetExecutionPlan;
 
-  TfLiteStatus status = delegate->Prepare(&context_, delegate->data_);
+  TfLiteStatus status = delegate->Prepare(&context_, delegate);
+
   // Remove additional context info.
-  context_.GetNodeAndRegistration = nullptr;
-  context_.ReplaceSubgraphsWithDelegateKernels = nullptr;
-  context_.GetExecutionPlan = nullptr;
+  SetForbiddenContextFunction(&context_.GetNodeAndRegistration);
+  SetForbiddenContextFunction(&context_.ReplaceSubgraphsWithDelegateKernels);
+  SetForbiddenContextFunction(&context_.GetExecutionPlan);
+
+  TF_LITE_ENSURE_OK(&context_, status);
+
+  if (!allow_dynamic_tensors) {
+    TF_LITE_ENSURE_OK(&context_, AllocateTensors());
+    TF_LITE_ENSURE(&context_, state_ == kStateInvokable ||
+                                  state_ == kStateInvokableAndImmutable);
+    // After using a delegate which doesn't support dynamic tensors, make the
+    // entire graph immutable.
+    state_ = kStateInvokableAndImmutable;
+  }
+
   return status;
 }
 
+TfLiteStatus Interpreter::SetBufferHandle(int tensor_index,
+                                          TfLiteBufferHandle buffer_handle,
+                                          TfLiteDelegate* delegate) {
+  TF_LITE_ENSURE(&context_, tensor_index < tensors_size());
+  TfLiteTensor* tensor = &tensors_[tensor_index];
+
+  TF_LITE_ENSURE(&context_,
+                 tensor->delegate == nullptr || tensor->delegate == delegate);
+  tensor->delegate = delegate;
+  if (tensor->buffer_handle != kTfLiteNullBufferHandle) {
+    TF_LITE_ENSURE(&context_, tensor->delegate->FreeBufferHandle != nullptr);
+    tensor->delegate->FreeBufferHandle(tensor->delegate,
+                                       &tensor->buffer_handle);
+  }
+  tensor->buffer_handle = buffer_handle;
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus Interpreter::GetBufferHandle(int tensor_index,
+                                          TfLiteBufferHandle* buffer_handle,
+                                          TfLiteDelegate** delegate) {
+  TF_LITE_ENSURE(&context_, tensor_index < tensors_size());
+  TfLiteTensor* tensor = &tensors_[tensor_index];
+
+  *delegate = tensor->delegate;
+  *buffer_handle = tensor->buffer_handle;
+
+  return kTfLiteOk;
+}
+
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/interpreter.h b/tensorflow/contrib/lite/interpreter.h
index a9df2627e02486f71e9b5b7b0d1bfd89c7ec70c0..77db17878318276c6cf5067274a3af3be262c8e1 100644
--- a/tensorflow/contrib/lite/interpreter.h
+++ b/tensorflow/contrib/lite/interpreter.h
@@ -24,7 +24,6 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/memory_planner.h"
-#include "tensorflow/contrib/lite/schema/schema_generated.h"
 
 namespace tflite {
 
@@ -134,18 +133,34 @@ class Interpreter {
   // This variant assumes an external buffer has been allocated of size
   // bytes. The lifetime of buffer must be ensured to be greater or equal
   // to Interpreter.
-  TfLiteStatus SetTensorParametersReadOnly(
+  inline TfLiteStatus SetTensorParametersReadOnly(
       int tensor_index, TfLiteType type, const char* name,
       const std::vector<int>& dims, TfLiteQuantizationParams quantization,
+      const char* buffer, size_t bytes,
+      const Allocation* allocation = nullptr) {
+    return SetTensorParametersReadOnly(tensor_index, type, name, dims.size(),
+                                       dims.data(), quantization, buffer, bytes,
+                                       allocation);
+  };
+
+  TfLiteStatus SetTensorParametersReadOnly(
+      int tensor_index, TfLiteType type, const char* name, const int rank,
+      const int* dims, TfLiteQuantizationParams quantization,
       const char* buffer, size_t bytes, const Allocation* allocation = nullptr);
 
   // Set description of inputs/outputs/data/fptrs for node `node_index`.
   // This variant assumes an external buffer has been allocated of size
   // bytes. The lifetime of buffer must be ensured to be greater or equal
   // to Interpreter.
-  TfLiteStatus SetTensorParametersReadWrite(
+  inline TfLiteStatus SetTensorParametersReadWrite(
       int tensor_index, TfLiteType type, const char* name,
-      const std::vector<int>& dims, TfLiteQuantizationParams quantization);
+      const std::vector<int>& dims, TfLiteQuantizationParams quantization) {
+    return SetTensorParametersReadWrite(tensor_index, type, name, dims.size(),
+                                        dims.data(), quantization);
+  }
+  TfLiteStatus SetTensorParametersReadWrite(
+      int tensor_index, TfLiteType type, const char* name, const int rank,
+      const int* dims, TfLiteQuantizationParams quantization);
 
   // Functions to access tensor data
 
@@ -257,13 +272,57 @@ class Interpreter {
   // Allow a delegate to look at the graph and modify the graph to handle
   // parts of the graph themselves. After this is called, the graph may
   // contain new nodes that replace 1 more nodes.
-  TfLiteStatus ModifyGraphWithDelegate(TfLiteDelegate* delegate);
+  // WARNING: This is an experimental API and subject to change.
+  TfLiteStatus ModifyGraphWithDelegate(TfLiteDelegate* delegate,
+                                       bool allow_dynamic_tensors = false);
+
+  // Ensure the data in `tensor.data` is readable. In case delegate is used,
+  // it might require to copy the data from delegate buffer to raw memory.
+  TfLiteStatus EnsureTensorDataIsReadable(int tensor_index) {
+    TF_LITE_ENSURE(&context_, tensor_index < tensors_size());
+    TfLiteTensor* tensor = &tensors_[tensor_index];
+    if (tensor->data_is_stale) {
+      TF_LITE_ENSURE(&context_, tensor->delegate != nullptr);
+      TF_LITE_ENSURE(&context_,
+                     tensor->buffer_handle != kTfLiteNullBufferHandle);
+      // This can be null if the delegate doesn't use its own buffer.
+      TF_LITE_ENSURE(&context_,
+                     tensor->delegate->CopyFromBufferHandle != nullptr);
+      tensor->delegate->CopyFromBufferHandle(tensor->delegate,
+                                             tensor->buffer_handle,
+                                             tensor->data.raw, tensor->bytes);
+      tensor->data_is_stale = false;
+    }
+    return kTfLiteOk;
+  }
 
-  // WARNING: This is a deprecated interface and will be removed as soon as
-  // possible.  Please do not use it.
-  // TODO(impjdi): Remove this interface after resolving dependencies.
-  void set_model(const Model* model) { model_ = const_cast<Model*>(model); }
-  Model* model() const { return model_; }
+  // Set the delegate buffer handle to a tensor. It can be called in the
+  // following cases:
+  // 1. Set the buffer handle to a tensor that's not being written by a
+  //    delegate. For example, feeding an OpenGL texture as the input of the
+  //    inference graph.
+  // 2. Set the buffer handle to a tensor that uses the same delegate.
+  //    For example, set an OpenGL texture as the output of inference, while
+  //    the node which produces output is an OpenGL delegate node.
+  // WARNING: This is an experimental API and subject to change.
+  TfLiteStatus SetBufferHandle(int tensor_index,
+                               TfLiteBufferHandle buffer_handle,
+                               TfLiteDelegate* delegate);
+
+  // Get the delegate buffer handle, and the delegate which can process the
+  // buffer handle.
+  // WARNING: This is an experimental API and subject to change.
+  TfLiteStatus GetBufferHandle(int tensor_index,
+                               TfLiteBufferHandle* buffer_handle,
+                               TfLiteDelegate** delegate);
+
+  // The default capacity of `tensors_` vector.
+  static constexpr int kTensorsReservedCapacity = 128;
+  // The capacity headroom of `tensors_` vector before calling ops'
+  // `prepare` and `invoke` function. In these functions, it's guaranteed
+  // allocating up to `kTensorsCapacityHeadroom` more tensors won't invalidate
+  // pointers to existing tensors.
+  static constexpr int kTensorsCapacityHeadroom = 16;
 
  private:
   // Give 'op_reg' a chance to initialize itself using the contents of
@@ -347,14 +406,15 @@ class Interpreter {
   // Entry point for C API ReplaceSubgraphsWithDelegateKernels
   static TfLiteStatus ReplaceSubgraphsWithDelegateKernels(
       TfLiteContext* context, TfLiteRegistration registration,
-      const TfLiteIntArray* nodes_to_replace);
+      const TfLiteIntArray* nodes_to_replace, TfLiteDelegate* delegate);
 
   // Update the execution graph to replace some of the nodes with stub
   // nodes. Specifically any node index that has `nodes[index]==1` will be
   // slated for replacement with a delegate kernel specified by registration.
   // WARNING: This is an experimental interface that is subject to change.
   TfLiteStatus ReplaceSubgraphsWithDelegateKernels(
-      TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace);
+      TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace,
+      TfLiteDelegate* delegate);
 
   // WARNING: This is an experimental interface that is subject to change.
   // Gets the internal pointer to a TensorFlow lite node by node_index.
@@ -377,6 +437,32 @@ class Interpreter {
   static TfLiteStatus GetExecutionPlan(struct TfLiteContext* context,
                                        TfLiteIntArray** execution_plan);
 
+  // Ensures that `tensors_` has at least `kTensorsCapacityHeadroom` extra
+  // capacity. Calling this function may invalidate existing pointers to
+  // tensors. After calling this function, adding `kTensorsCapacityHeadroom`
+  // more tensors won't invalidate the pointer to existing tensors.
+  void EnsureTensorsVectorCapacity() {
+    const int required_capacity = tensors_size() + kTensorsCapacityHeadroom;
+    if (required_capacity > tensors_.capacity()) {
+      tensors_.reserve(required_capacity);
+      context_.tensors = tensors_.data();
+    }
+  }
+
+  // The state of the Interpreter.
+  enum State {
+    // The interpreter isn't ready to be invoked.
+    // `AllocateTensor` need to be called to enter an invokable state.
+    kStateUninvokable = 0,
+    // The interpreter is ready to be invoked.
+    kStateInvokable,
+    // The interpreter is ready to be invoked, and graph can't be further
+    // modified. The interpreter will enter this state when calling
+    // `ModifyGraphWithDelegate` with `allow_dynamic_tensors=false`.
+    kStateInvokableAndImmutable,
+  };
+  State state_ = kStateUninvokable;
+
   // A pure C data structure used to communicate with the pure C plugin
   // interface. To avoid copying tensor metadata, this is also the definitive
   // structure to store tensors.
@@ -392,10 +478,6 @@ class Interpreter {
   // the tensor array.
   bool consistent_ = true;
 
-  // Whether the model is safe to invoke (if any errors occurred this
-  // will be false).
-  bool invokable_ = false;
-
   // Array of indices representing the tensors that are inputs to the
   // interpreter.
   std::vector<int> inputs_;
@@ -411,7 +493,7 @@ class Interpreter {
   // During Invoke(), Interpreter will allocate input tensors first, which are
   // known to be fixed size. Then it will allocate outputs from nodes as many
   // as possible. When there is a node that produces dynamic sized tensor.
-  // Intepreter will stop allocating tensors, set the value of next allocate
+  // Interpreter will stop allocating tensors, set the value of next allocate
   // node id, and execute the node to generate the output tensor before continue
   // to allocate successors. This process repeats until all nodes are executed.
   // NOTE: this relies on the order of nodes that is in topological order.
@@ -432,11 +514,6 @@ class Interpreter {
   std::unique_ptr<NNAPIDelegate> nnapi_delegate_;
 
   std::unique_ptr<MemoryPlanner> memory_planner_;
-
-  // WARNING: This is a deprecated interface and will be removed as soon as
-  // possible.  Please do not use it.
-  // TODO(impjdi): Remove this interface after resolving dependencies.
-  Model* model_ = nullptr;
 };
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/interpreter_test.cc b/tensorflow/contrib/lite/interpreter_test.cc
index 28c96e5dde6ffa62bb073db9716a00f91c6e0bdf..131e088079857af34478645b7f1559364d03a493 100644
--- a/tensorflow/contrib/lite/interpreter_test.cc
+++ b/tensorflow/contrib/lite/interpreter_test.cc
@@ -17,9 +17,11 @@ limitations under the License.
 #include <gtest/gtest.h>
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/kernels/internal/compatibility.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/schema/schema_generated.h"
 #include "tensorflow/contrib/lite/string_util.h"
 #include "tensorflow/contrib/lite/testing/util.h"
+
 namespace tflite {
 namespace {
 
@@ -40,7 +42,7 @@ TEST(BasicInterpreter, InvokeInvalidModel) {
   ASSERT_EQ(interpreter.Invoke(), kTfLiteOk);
 }
 
-// Test size accesser functions.
+// Test size accessor functions.
 TEST(BasicInterpreter, TestSizeFunctions) {
   Interpreter interpreter;
   int base_index;
@@ -439,12 +441,12 @@ TEST(BasicInterpreter, ThreeStepAllocate) {
   // String-in String-out node.
   TfLiteRegistration reg_copy = {nullptr, nullptr, nullptr, nullptr};
   reg_copy.invoke = [](TfLiteContext* context, TfLiteNode* node) {
-    TfLiteTensor* a0 = &context->tensors[node->inputs->data[0]];
-    TfLiteTensor* a1 = &context->tensors[node->outputs->data[0]];
+    TfLiteTensor* input = &context->tensors[node->inputs->data[0]];
+    TfLiteTensor* output = &context->tensors[node->outputs->data[0]];
     DynamicBuffer buf;
-    StringRef str_ref = GetString(a0, 0);
+    StringRef str_ref = GetString(input, 0);
     buf.AddString(str_ref);
-    buf.WriteToTensor(a1);
+    buf.WriteToTensor(output);
     return kTfLiteOk;
   };
 
@@ -561,6 +563,86 @@ TEST(BasicInterpreter, TestCustomErrorReporter) {
   ASSERT_EQ(reporter.calls, 1);
 }
 
+TEST(BasicInterpreter, TestUnsupportedDelegateFunctions) {
+  Interpreter interpreter;
+  ASSERT_EQ(interpreter.AddTensors(2), kTfLiteOk);
+  TfLiteRegistration registration = {
+      .init = nullptr, .free = nullptr, .prepare = nullptr, .invoke = nullptr};
+  // These functions are only supported inside Delegate's Prepare function.
+  // The test verifies that these functions returns `kTfLiteError`, but not
+  // `kTfLiteOk` or just crashes.
+  registration.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    {
+      TfLiteIntArray* execution_plan;
+      EXPECT_EQ(context->GetExecutionPlan(context, &execution_plan),
+                kTfLiteError);
+    }
+    {
+      TfLiteNode* node;
+      TfLiteRegistration* registration;
+      EXPECT_EQ(
+          context->GetNodeAndRegistration(context, 0, &node, &registration),
+          kTfLiteError);
+    }
+    {
+      TfLiteRegistration delegate_registration = {nullptr, nullptr, nullptr,
+                                                  nullptr};
+      TfLiteIntArray nodes_to_replace;
+      nodes_to_replace.size = 0;
+      EXPECT_EQ(context->ReplaceSubgraphsWithDelegateKernels(
+                    context, delegate_registration, &nodes_to_replace, nullptr),
+                kTfLiteError);
+    }
+    return kTfLiteError;
+  };
+  ASSERT_EQ(interpreter.SetInputs({0}), kTfLiteOk);
+  ASSERT_EQ(interpreter.SetOutputs({0}), kTfLiteOk);
+  ASSERT_EQ(interpreter.AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr,
+                                              &registration),
+            kTfLiteOk);
+  EXPECT_EQ(interpreter.AllocateTensors(), kTfLiteError);
+}
+
+TEST(InterpreterTensorsCapacityTest, TestWithinHeadroom) {
+  Interpreter interpreter;
+  ASSERT_EQ(interpreter.AddTensors(Interpreter::kTensorsReservedCapacity),
+            kTfLiteOk);
+  TfLiteRegistration registration = {nullptr, nullptr, nullptr, nullptr};
+  registration.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    TfLiteTensor* first_tensor = context->tensors;
+
+    int new_tensor_index;
+    context->AddTensors(context, Interpreter::kTensorsCapacityHeadroom,
+                        &new_tensor_index);
+    EXPECT_EQ(first_tensor, context->tensors);
+    return kTfLiteOk;
+  };
+  ASSERT_EQ(interpreter.AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr,
+                                              &registration),
+            kTfLiteOk);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+}
+
+TEST(InterpreterTensorsCapacityTest, TestExceedHeadroom) {
+  Interpreter interpreter;
+  ASSERT_EQ(interpreter.AddTensors(Interpreter::kTensorsReservedCapacity),
+            kTfLiteOk);
+  TfLiteRegistration registration = {nullptr, nullptr, nullptr, nullptr};
+  registration.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    TfLiteTensor* first_tensor = context->tensors;
+
+    int new_tensor_index;
+    context->AddTensors(context, Interpreter::kTensorsCapacityHeadroom + 1,
+                        &new_tensor_index);
+    EXPECT_NE(first_tensor, context->tensors);
+    return kTfLiteOk;
+  };
+  ASSERT_EQ(interpreter.AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr,
+                                              &registration),
+            kTfLiteOk);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+}
+
 // Test fixture that allows playing with execution plans. It creates a two
 // node graph that can be executed in either [0,1] order or [1,0] order.
 // The CopyOp records when it is invoked in the class member run_order_
@@ -698,13 +780,17 @@ TfLiteRegistration AddOpRegistration() {
 
   reg.prepare = [](TfLiteContext* context, TfLiteNode* node) {
     // Set output size to input size
-    TfLiteTensor* tensor0 = &context->tensors[node->inputs->data[0]];
-    TfLiteTensor* tensor1 = &context->tensors[node->inputs->data[1]];
-    TfLiteTensor* tensor2 = &context->tensors[node->outputs->data[0]];
-    TfLiteIntArray* newSize = TfLiteIntArrayCopy(tensor0->dims);
-    TfLiteIntArray* newSizeOther = TfLiteIntArrayCopy(tensor1->dims);
-    TF_LITE_ENSURE_EQ(context, newSize->size, newSizeOther->size);
-    TF_LITE_ENSURE_STATUS(context->ResizeTensor(context, tensor2, newSize));
+    TfLiteTensor* input1 = &context->tensors[node->inputs->data[0]];
+    TfLiteTensor* input2 = &context->tensors[node->inputs->data[1]];
+    TfLiteTensor* output = &context->tensors[node->outputs->data[0]];
+
+    TF_LITE_ENSURE_EQ(context, input1->dims->size, input2->dims->size);
+    for (int i = 0; i < input1->dims->size; ++i) {
+      TF_LITE_ENSURE_EQ(context, input1->dims->data[i], input2->dims->data[i]);
+    }
+
+    TF_LITE_ENSURE_STATUS(context->ResizeTensor(
+        context, output, TfLiteIntArrayCopy(input1->dims)));
     return kTfLiteOk;
   };
 
@@ -723,26 +809,40 @@ TfLiteRegistration AddOpRegistration() {
 }
 
 class TestDelegate : public ::testing::Test {
- public:
-  TestDelegate() {
-    interpreter_.AddTensors(5);
-    interpreter_.SetInputs({0, 1});
-    interpreter_.SetOutputs({3, 4});
+ protected:
+  void SetUp() override {
+    interpreter_.reset(new Interpreter);
+    interpreter_->AddTensors(5);
+    interpreter_->SetInputs({0, 1});
+    interpreter_->SetOutputs({3, 4});
     TfLiteQuantizationParams quant;
-    interpreter_.SetTensorParametersReadWrite(0, kTfLiteFloat32, "", {3},
-                                              quant);
-    interpreter_.SetTensorParametersReadWrite(1, kTfLiteFloat32, "", {3},
-                                              quant);
-    interpreter_.SetTensorParametersReadWrite(2, kTfLiteFloat32, "", {3},
-                                              quant);
-    interpreter_.SetTensorParametersReadWrite(3, kTfLiteFloat32, "", {3},
-                                              quant);
+    interpreter_->SetTensorParametersReadWrite(0, kTfLiteFloat32, "", {3},
+                                               quant);
+    interpreter_->SetTensorParametersReadWrite(1, kTfLiteFloat32, "", {3},
+                                               quant);
+    interpreter_->SetTensorParametersReadWrite(2, kTfLiteFloat32, "", {3},
+                                               quant);
+    interpreter_->SetTensorParametersReadWrite(3, kTfLiteFloat32, "", {3},
+                                               quant);
+    interpreter_->SetTensorParametersReadWrite(4, kTfLiteFloat32, "", {3},
+                                               quant);
     TfLiteRegistration reg = AddOpRegistration();
-    interpreter_.AddNodeWithParameters({0, 0}, {2}, nullptr, 0, nullptr, &reg);
-    interpreter_.AddNodeWithParameters({1, 1}, {3}, nullptr, 0, nullptr, &reg);
-    interpreter_.AddNodeWithParameters({2, 1}, {4}, nullptr, 0, nullptr, &reg);
+    interpreter_->AddNodeWithParameters({0, 0}, {2}, nullptr, 0, nullptr, &reg);
+    interpreter_->AddNodeWithParameters({1, 1}, {3}, nullptr, 0, nullptr, &reg);
+    interpreter_->AddNodeWithParameters({2, 1}, {4}, nullptr, 0, nullptr, &reg);
+  }
+
+  void TearDown() override {
+    // Interpreter relies on delegate_ to free the resources properly. Thus
+    // the life cycle of delegate must be longer than interpreter.
+    interpreter_.reset();
+    delegate_.reset();
   }
 
+  TfLiteBufferHandle last_allocated_handle_ = kTfLiteNullBufferHandle;
+
+  TfLiteBufferHandle AllocateBufferHandle() { return ++last_allocated_handle_; }
+
  protected:
   class SimpleDelegate {
    public:
@@ -751,8 +851,8 @@ class TestDelegate : public ::testing::Test {
     // value-copyable and compatible with TfLite.
     explicit SimpleDelegate(const std::vector<int>& nodes) : nodes_(nodes) {
       delegate_.Prepare = [](TfLiteContext* context,
-                             void* data) -> TfLiteStatus {
-        auto* simple = reinterpret_cast<SimpleDelegate*>(data);
+                             TfLiteDelegate* delegate) -> TfLiteStatus {
+        auto* simple = reinterpret_cast<SimpleDelegate*>(delegate->data_);
         TfLiteIntArray* nodes_to_separate =
             TfLiteIntArrayCreate(simple->nodes_.size());
         // Mark nodes that we want in TfLiteIntArray* structure.
@@ -783,10 +883,26 @@ class TestDelegate : public ::testing::Test {
         }
 
         context->ReplaceSubgraphsWithDelegateKernels(
-            context, FakeFusedRegistration(), nodes_to_separate);
+            context, FakeFusedRegistration(), nodes_to_separate, delegate);
         TfLiteIntArrayFree(nodes_to_separate);
         return kTfLiteOk;
       };
+      delegate_.CopyToBufferHandle = [](TfLiteDelegate* delegate,
+                                        TfLiteBufferHandle buffer_handle,
+                                        void* data, int size) -> TfLiteStatus {
+        // TODO(ycling): Implement tests to test buffer copying logic.
+        return kTfLiteOk;
+      };
+      delegate_.CopyFromBufferHandle =
+          [](TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle,
+             void* data, int size) -> TfLiteStatus {
+        // TODO(ycling): Implement tests to test buffer copying logic.
+        return kTfLiteOk;
+      };
+      delegate_.FreeBufferHandle = [](TfLiteDelegate* delegate,
+                                      TfLiteBufferHandle* handle) {
+        *handle = kTfLiteNullBufferHandle;
+      };
       // Store type-punned data SimpleDelegate structure.
       delegate_.data_ = reinterpret_cast<void*>(this);
     }
@@ -803,36 +919,196 @@ class TestDelegate : public ::testing::Test {
     std::vector<int> nodes_;
     TfLiteDelegate delegate_;
   };
-  Interpreter interpreter_;
+  std::unique_ptr<Interpreter> interpreter_;
+  std::unique_ptr<SimpleDelegate> delegate_;
 };
 
 TEST_F(TestDelegate, BasicDelegate) {
-  interpreter_.Invoke();
-  SimpleDelegate simple({0, 1, 2});
-  interpreter_.ModifyGraphWithDelegate(simple.get_tf_lite_delegate());
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({0, 1, 2}));
+  interpreter_->ModifyGraphWithDelegate(delegate_->get_tf_lite_delegate());
 
-  ASSERT_EQ(interpreter_.execution_plan().size(), 1);
-  int node = interpreter_.execution_plan()[0];
-  const auto* node_and_reg = interpreter_.node_and_registration(node);
-  ASSERT_EQ(node_and_reg->second.custom_name,
+  ASSERT_EQ(interpreter_->execution_plan().size(), 1);
+  int node = interpreter_->execution_plan()[0];
+  const auto* node_and_reg = interpreter_->node_and_registration(node);
+  EXPECT_EQ(node_and_reg->second.custom_name,
             SimpleDelegate::FakeFusedRegistration().custom_name);
+
+  const TfLiteDelegateParams* params =
+      reinterpret_cast<const TfLiteDelegateParams*>(
+          node_and_reg->first.builtin_data);
+  ASSERT_EQ(params->nodes_to_replace->size, 3);
+  EXPECT_EQ(params->nodes_to_replace->data[0], 0);
+  EXPECT_EQ(params->nodes_to_replace->data[1], 1);
+  EXPECT_EQ(params->nodes_to_replace->data[2], 2);
+
+  ASSERT_EQ(params->input_tensors->size, 2);
+  EXPECT_EQ(params->input_tensors->data[0], 0);
+  EXPECT_EQ(params->input_tensors->data[1], 1);
+
+  ASSERT_EQ(params->output_tensors->size, 2);
+  EXPECT_EQ(params->output_tensors->data[0], 3);
+  EXPECT_EQ(params->output_tensors->data[1], 4);
 }
 
 TEST_F(TestDelegate, ComplexDeligate) {
-  interpreter_.Invoke();
-  SimpleDelegate simple({1, 2});
-  interpreter_.ModifyGraphWithDelegate(simple.get_tf_lite_delegate());
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({1, 2}));
+  interpreter_->ModifyGraphWithDelegate(delegate_->get_tf_lite_delegate());
 
-  ASSERT_EQ(interpreter_.execution_plan().size(), 2);
+  ASSERT_EQ(interpreter_->execution_plan().size(), 2);
   // 0th should be a non-delegated original op
-  ASSERT_EQ(interpreter_.execution_plan()[0], 0);
+  ASSERT_EQ(interpreter_->execution_plan()[0], 0);
   // 1st should be a new macro op (3) which didn't exist)
-  ASSERT_EQ(interpreter_.execution_plan()[1], 3);
-  const auto* node_and_reg = interpreter_.node_and_registration(3);
+  ASSERT_EQ(interpreter_->execution_plan()[1], 3);
+  const auto* node_and_reg = interpreter_->node_and_registration(3);
   ASSERT_EQ(node_and_reg->second.custom_name,
             SimpleDelegate::FakeFusedRegistration().custom_name);
 }
 
+TEST_F(TestDelegate, SetBufferHandleToInput) {
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({0, 1, 2}));
+  TfLiteDelegate* delegate = delegate_->get_tf_lite_delegate();
+  interpreter_->ModifyGraphWithDelegate(delegate);
+
+  constexpr int kOutputTensorIndex = 0;
+  TfLiteTensor* tensor = interpreter_->tensor(kOutputTensorIndex);
+  ASSERT_EQ(tensor->delegate, nullptr);
+  ASSERT_EQ(tensor->buffer_handle, kTfLiteNullBufferHandle);
+
+  TfLiteBufferHandle handle = AllocateBufferHandle();
+  TfLiteStatus status =
+      interpreter_->SetBufferHandle(kOutputTensorIndex, handle, delegate);
+  ASSERT_EQ(status, kTfLiteOk);
+  EXPECT_EQ(tensor->delegate, delegate);
+  EXPECT_EQ(tensor->buffer_handle, handle);
+}
+
+TEST_F(TestDelegate, SetBufferHandleToOutput) {
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({0, 1, 2}));
+  TfLiteDelegate* delegate = delegate_->get_tf_lite_delegate();
+  interpreter_->ModifyGraphWithDelegate(delegate);
+
+  constexpr int kOutputTensorIndex = 3;
+  TfLiteTensor* tensor = interpreter_->tensor(kOutputTensorIndex);
+  // Before setting the buffer handle, the tensor's `delegate` is already set
+  // because it will be written by the delegate.
+  ASSERT_EQ(tensor->delegate, delegate);
+  ASSERT_EQ(tensor->buffer_handle, kTfLiteNullBufferHandle);
+
+  TfLiteBufferHandle handle = AllocateBufferHandle();
+  TfLiteStatus status =
+      interpreter_->SetBufferHandle(kOutputTensorIndex, handle, delegate);
+  ASSERT_EQ(status, kTfLiteOk);
+  EXPECT_EQ(tensor->delegate, delegate);
+  EXPECT_EQ(tensor->buffer_handle, handle);
+}
+
+TEST_F(TestDelegate, SetInvalidHandleToTensor) {
+  interpreter_->Invoke();
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({0, 1, 2}));
+  TfLiteDelegate* delegate = delegate_->get_tf_lite_delegate();
+  interpreter_->ModifyGraphWithDelegate(delegate, true);
+
+  SimpleDelegate another_simple_delegate({0, 1, 2});
+
+  constexpr int kOutputTensorIndex = 3;
+  TfLiteTensor* tensor = interpreter_->tensor(kOutputTensorIndex);
+  // Before setting the buffer handle, the tensor's `delegate` is already set
+  // because it will be written by the delegate.
+  ASSERT_EQ(tensor->delegate, delegate);
+  ASSERT_EQ(tensor->buffer_handle, kTfLiteNullBufferHandle);
+
+  TfLiteBufferHandle handle = AllocateBufferHandle();
+  TfLiteStatus status = interpreter_->SetBufferHandle(
+      kOutputTensorIndex, handle,
+      another_simple_delegate.get_tf_lite_delegate());
+  // Setting a buffer handle to a tensor with another delegate will fail.
+  ASSERT_EQ(status, kTfLiteError);
+  EXPECT_EQ(tensor->delegate, delegate);
+  EXPECT_EQ(tensor->buffer_handle, kTfLiteNullBufferHandle);
+}
+
+TEST_F(TestDelegate, ResizeInputWithNonDynamicDelegateShouldFail) {
+  delegate_ = std::unique_ptr<SimpleDelegate>(new SimpleDelegate({0, 1, 2}));
+  ASSERT_EQ(interpreter_->ResizeInputTensor(0, {1, 2}), kTfLiteOk);
+  ASSERT_EQ(interpreter_->ResizeInputTensor(1, {1, 2}), kTfLiteOk);
+  ASSERT_EQ(
+      interpreter_->ModifyGraphWithDelegate(delegate_->get_tf_lite_delegate()),
+      kTfLiteOk);
+  ASSERT_EQ(interpreter_->ResizeInputTensor(0, {1, 2}), kTfLiteError);
+}
+
+class TestDelegateWithDynamicTensors : public ::testing::Test {
+ protected:
+  void SetUp() override {
+    interpreter_.reset(new Interpreter);
+
+    interpreter_->AddTensors(2);
+    interpreter_->SetInputs({0});
+    interpreter_->SetOutputs({1});
+    TfLiteQuantizationParams quant;
+    interpreter_->SetTensorParametersReadWrite(0, kTfLiteFloat32, "", {3},
+                                               quant);
+    interpreter_->SetTensorParametersReadWrite(1, kTfLiteFloat32, "", {3},
+                                               quant);
+    TfLiteRegistration reg = DynamicCopyOpRegistration();
+    interpreter_->AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr, &reg);
+
+    delegate_.Prepare = [](TfLiteContext* context,
+                           TfLiteDelegate* delegate) -> TfLiteStatus {
+      // In this test, the delegate replaces all the nodes if this function is
+      // called.
+      TfLiteIntArray* execution_plan;
+      TF_LITE_ENSURE_STATUS(
+          context->GetExecutionPlan(context, &execution_plan));
+      context->ReplaceSubgraphsWithDelegateKernels(
+          context, DelegateRegistration(), execution_plan, delegate);
+      return kTfLiteOk;
+    };
+  }
+
+  static TfLiteRegistration DynamicCopyOpRegistration() {
+    TfLiteRegistration reg = {nullptr, nullptr, nullptr, nullptr};
+
+    reg.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+      TfLiteTensor* output = &context->tensors[node->outputs->data[0]];
+      SetTensorToDynamic(output);
+      return kTfLiteOk;
+    };
+
+    reg.invoke = [](TfLiteContext* context, TfLiteNode* node) {
+      // Not implemented since this isn't required in testing.
+      return kTfLiteOk;
+    };
+    return reg;
+  }
+
+  static TfLiteRegistration DelegateRegistration() {
+    TfLiteRegistration reg = {nullptr, nullptr, nullptr, nullptr};
+    return reg;
+  }
+
+  std::unique_ptr<Interpreter> interpreter_;
+  TfLiteDelegate delegate_;
+};
+
+TEST_F(TestDelegateWithDynamicTensors, DisallowDynamicTensors) {
+  interpreter_->ModifyGraphWithDelegate(&delegate_, false);
+
+  ASSERT_EQ(interpreter_->execution_plan().size(), 1);
+  // The interpreter should not call delegate's `Prepare` when dynamic tensors
+  // exist. So the node ID isn't changed.
+  ASSERT_EQ(interpreter_->execution_plan()[0], 0);
+}
+
+TEST_F(TestDelegateWithDynamicTensors, AllowDynamicTensors) {
+  interpreter_->ModifyGraphWithDelegate(&delegate_, true);
+
+  ASSERT_EQ(interpreter_->execution_plan().size(), 1);
+  // The node should be replaced because dynamic tensors are allowed. Therefore
+  // only node ID in the execution plan is changed from 0 to 1.
+  ASSERT_EQ(interpreter_->execution_plan()[0], 1);
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/ios_makefile.inc b/tensorflow/contrib/lite/ios_makefile.inc
index fc6594c3a04ba6aabba99bb631f85737baf389f1..079320586ffd01fc77818a81e0c5962f1d28c1f1 100644
--- a/tensorflow/contrib/lite/ios_makefile.inc
+++ b/tensorflow/contrib/lite/ios_makefile.inc
@@ -31,9 +31,6 @@ ifeq ($(TARGET), IOS)
 		${IPHONEOS_SYSROOT} \
 		-arch $(IOS_ARCH) \
 		-O3
-	ifeq ($(IOS_ARCH), x86_64)
-		CXXFLAGS += -msse4.1
-	endif
 	CCFLAGS += -miphoneos-version-min=$(MIN_SDK_VERSION) \
 		-fembed-bitcode \
 		-mno-thumb \
diff --git a/tensorflow/contrib/lite/java/BUILD b/tensorflow/contrib/lite/java/BUILD
index 35aacb70002d1d454f675484e4398bcdffc4acf1..f52d6ba6c5390e631d29e75f833aa4dd5bba1a68 100644
--- a/tensorflow/contrib/lite/java/BUILD
+++ b/tensorflow/contrib/lite/java/BUILD
@@ -29,7 +29,7 @@ android_library(
     visibility = ["//visibility:public"],
     deps = [
         ":tflite_runtime",
-        "@javax_validation",
+        "@org_checkerframework_qual",
     ],
 )
 
@@ -42,7 +42,7 @@ android_library(
     ),
     visibility = ["//visibility:public"],
     deps = [
-        "@javax_validation",
+        "@org_checkerframework_qual",
     ],
 )
 
@@ -58,7 +58,7 @@ java_library(
     deps = [
         ":libtensorflowlite_jni.so",
         "//tensorflow/contrib/lite/java/src/main/native",
-        "@javax_validation",
+        "@org_checkerframework_qual",
     ],
 )
 
diff --git a/tensorflow/contrib/lite/java/demo/app/src/main/BUILD b/tensorflow/contrib/lite/java/demo/app/src/main/BUILD
index 654fa9d6d2799fc3cafa3e0e042cb2a5746bf2c5..5eb749aae6e224bec64b66832f116ebc3372c1ef 100644
--- a/tensorflow/contrib/lite/java/demo/app/src/main/BUILD
+++ b/tensorflow/contrib/lite/java/demo/app/src/main/BUILD
@@ -6,7 +6,7 @@ android_binary(
     name = "TfLiteCameraDemo",
     srcs = glob(["java/**/*.java"]),
     assets = [
-        "@tflite_mobilenet//:labels.txt",
+        "//tensorflow/contrib/lite/java/demo/app/src/main/assets:labels_mobilenet_quant_v1_224.txt",
         "@tflite_mobilenet//:mobilenet_quant_v1_224.tflite",
     ],
     assets_dir = "",
diff --git a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
index 9b9fdffab557060f0211a0ce361b002cc7d03956..300786c3ca01b12a46f7f9a6fe8fd720f97a79f4 100644
--- a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
+++ b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
@@ -299,7 +299,7 @@ public class Camera2BasicFragment extends Fragment
       // create either a new ImageClassifierQuantizedMobileNet or an ImageClassifierFloatInception
       classifier = new ImageClassifierQuantizedMobileNet(getActivity());
     } catch (IOException e) {
-      Log.e(TAG, "Failed to initialize an image classifier.");
+      Log.e(TAG, "Failed to initialize an image classifier.", e);
     }
     startBackgroundThread();
   }
@@ -433,7 +433,7 @@ public class Camera2BasicFragment extends Fragment
         return;
       }
     } catch (CameraAccessException e) {
-      e.printStackTrace();
+      Log.e(TAG, "Failed to access Camera", e);
     } catch (NullPointerException e) {
       // Currently an NPE is thrown when the Camera2API is used but not supported on the
       // device this code runs.
@@ -478,7 +478,7 @@ public class Camera2BasicFragment extends Fragment
       }
       manager.openCamera(cameraId, stateCallback, backgroundHandler);
     } catch (CameraAccessException e) {
-      e.printStackTrace();
+      Log.e(TAG, "Failed to open Camera", e);
     } catch (InterruptedException e) {
       throw new RuntimeException("Interrupted while trying to lock camera opening.", e);
     }
@@ -545,7 +545,7 @@ public class Camera2BasicFragment extends Fragment
         runClassifier = false;
       }
     } catch (InterruptedException e) {
-      e.printStackTrace();
+      Log.e(TAG, "Interrupted when stopping background thread", e);
     }
   }
 
@@ -604,7 +604,7 @@ public class Camera2BasicFragment extends Fragment
                 captureSession.setRepeatingRequest(
                     previewRequest, captureCallback, backgroundHandler);
               } catch (CameraAccessException e) {
-                e.printStackTrace();
+                Log.e(TAG, "Failed to set up config to capture Camera", e);
               }
             }
 
@@ -615,7 +615,7 @@ public class Camera2BasicFragment extends Fragment
           },
           null);
     } catch (CameraAccessException e) {
-      e.printStackTrace();
+      Log.e(TAG, "Failed to preview Camera", e);
     }
   }
 
diff --git a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java
index 156c895146940adfe71f111be6e354e02b75ea48..e164ac75543ebab12e6b1c057c4ed487eb9accdf 100644
--- a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java
+++ b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java
@@ -16,7 +16,6 @@ limitations under the License.
 package com.example.android.tflitecamerademo;
 
 import android.app.Activity;
-
 import java.io.IOException;
 
 /**
diff --git a/tensorflow/contrib/lite/java/proguard.flags b/tensorflow/contrib/lite/java/proguard.flags
new file mode 100644
index 0000000000000000000000000000000000000000..8ee3d7e7ae728b27789336ac56208acdf13ee424
--- /dev/null
+++ b/tensorflow/contrib/lite/java/proguard.flags
@@ -0,0 +1,3 @@
+-keepclassmembers class org.tensorflow.lite.NativeInterpreterWrapper {
+    private long inferenceDurationNanoseconds;
+}
\ No newline at end of file
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
index d63c299589d2e8ce1051a52d29b533ed126bbcf7..fc16488a6459eb227fde712055d3e8ccfcce0070 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
@@ -71,6 +71,23 @@ enum DataType {
     throw new IllegalArgumentException("DataType " + this + " is not supported yet");
   }
 
+  /** Gets string names of the data type. */
+  String toStringName() {
+    switch (this) {
+      case FLOAT32:
+        return "float";
+      case INT32:
+        return "int";
+      case UINT8:
+        return "byte";
+      case INT64:
+        return "long";
+      case BYTEBUFFER:
+        return "ByteBuffer";
+    }
+    throw new IllegalArgumentException("DataType " + this + " is not supported yet");
+  }
+
   // Cached to avoid copying it
   private static final DataType[] values = values();
 }
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
index dd883d69d2065236ee29012b9bde99972aefbcf7..14f461f5f9ba8c0755d2a1968533a79cce10750a 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
@@ -19,7 +19,7 @@ import java.io.File;
 import java.nio.MappedByteBuffer;
 import java.util.HashMap;
 import java.util.Map;
-import javax.validation.constraints.NotNull;
+import org.checkerframework.checker.nullness.qual.NonNull;
 
 /**
  * Driver class to drive model inference with TensorFlow Lite.
@@ -60,7 +60,7 @@ public final class Interpreter implements AutoCloseable {
    *
    * @param modelFile: a File of a pre-trained TF Lite model.
    */
-  public Interpreter(@NotNull File modelFile) {
+  public Interpreter(@NonNull File modelFile) {
     if (modelFile == null) {
       return;
     }
@@ -73,20 +73,34 @@ public final class Interpreter implements AutoCloseable {
    * <p>The {@code MappedByteBuffer} should remain unchanged after the construction of a {@code
    * Interpreter}.
    */
-  public Interpreter(@NotNull MappedByteBuffer mappedByteBuffer) {
+  public Interpreter(@NonNull MappedByteBuffer mappedByteBuffer) {
     wrapper = new NativeInterpreterWrapper(mappedByteBuffer);
   }
 
+  /**
+   * Initializes a {@code Interpreter} with a {@code MappedByteBuffer} to the model file and
+   * specifies the number of threads used for inference.
+   *
+   * <p>The {@code MappedByteBuffer} should remain unchanged after the construction of a {@code
+   * Interpreter}.
+   */
+  public Interpreter(@NonNull MappedByteBuffer mappedByteBuffer, int numThreads) {
+    wrapper = new NativeInterpreterWrapper(mappedByteBuffer, numThreads);
+  }
+
   /**
    * Runs model inference if the model takes only one input, and provides only one output.
    *
+   * <p>Warning: The API runs much faster if {@link ByteBuffer} is used as input data type. Please
+   * consider using {@link ByteBuffer} to feed input data for better performance.
+   *
    * @param input an array or multidimensional array, or a {@link ByteBuffer} of primitive types
    *     including int, float, long, and byte. {@link ByteBuffer} is the preferred way to pass large
    *     input data. When {@link ByteBuffer} is used, its content should remain unchanged until
    *     model inference is done.
    * @param output a multidimensional array of output data.
    */
-  public void run(@NotNull Object input, @NotNull Object output) {
+  public void run(@NonNull Object input, @NonNull Object output) {
     Object[] inputs = {input};
     Map<Integer, Object> outputs = new HashMap<>();
     outputs.put(0, output);
@@ -96,6 +110,9 @@ public final class Interpreter implements AutoCloseable {
   /**
    * Runs model inference if the model takes multiple inputs, or returns multiple outputs.
    *
+   * <p>Warning: The API runs much faster if {@link ByteBuffer} is used as input data type. Please
+   * consider using {@link ByteBuffer} to feed input data for better performance.
+   *
    * @param inputs an array of input data. The inputs should be in the same order as inputs of the
    *     model. Each input can be an array or multidimensional array, or a {@link ByteBuffer} of
    *     primitive types including int, float, long, and byte. {@link ByteBuffer} is the preferred
@@ -105,7 +122,7 @@ public final class Interpreter implements AutoCloseable {
    *     needs to keep entries for the outputs to be used.
    */
   public void runForMultipleInputsOutputs(
-      @NotNull Object[] inputs, @NotNull Map<Integer, Object> outputs) {
+      @NonNull Object[] inputs, @NonNull Map<Integer, Object> outputs) {
     if (wrapper == null) {
       throw new IllegalStateException("The Interpreter has already been closed.");
     }
@@ -128,7 +145,7 @@ public final class Interpreter implements AutoCloseable {
    *
    * <p>IllegalArgumentException will be thrown if it fails to resize.
    */
-  public void resizeInput(int idx, @NotNull int[] dims) {
+  public void resizeInput(int idx, @NonNull int[] dims) {
     if (wrapper == null) {
       throw new IllegalStateException("The Interpreter has already been closed.");
     }
@@ -161,6 +178,27 @@ public final class Interpreter implements AutoCloseable {
     return wrapper.getOutputIndex(opName);
   }
 
+  /**
+   * Returns native inference timing.
+   * <p>IllegalArgumentException will be thrown if the model is not initialized by the
+   * {@link Interpreter}.
+   */
+  public Long getLastNativeInferenceDurationNanoseconds() {
+    if (wrapper == null) {
+      throw new IllegalStateException("The interpreter has already been closed.");
+    }
+    return wrapper.getLastNativeInferenceDurationNanoseconds();
+  }
+
+  /** Turns on/off Android NNAPI for hardware acceleration when it is available. */
+  public void setUseNNAPI(boolean useNNAPI) {
+    if (wrapper != null) {
+      wrapper.setUseNNAPI(useNNAPI);
+    } else {
+      throw new IllegalStateException("NativeInterpreterWrapper has already been closed.");
+    }
+  }
+
   /** Release resources associated with the {@code Interpreter}. */
   @Override
   public void close() {
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
index 5ee594dec492ad2fee22e603a6de311b3fed4cac..dbf8f8f7cc2815a46130e342d7e45d4e471696de 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
@@ -34,7 +34,8 @@ final class NativeInterpreterWrapper implements AutoCloseable {
   NativeInterpreterWrapper(String modelPath) {
     errorHandle = createErrorReporter(ERROR_BUFFER_SIZE);
     modelHandle = createModel(modelPath, errorHandle);
-    interpreterHandle = createInterpreter(modelHandle, errorHandle);
+    interpreterHandle = createInterpreter(modelHandle, errorHandle, /* numThreads= */ -1);
+    isMemoryAllocated = true;
   }
 
   /**
@@ -46,7 +47,21 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     modelByteBuffer = mappedByteBuffer;
     errorHandle = createErrorReporter(ERROR_BUFFER_SIZE);
     modelHandle = createModelWithBuffer(modelByteBuffer, errorHandle);
-    interpreterHandle = createInterpreter(modelHandle, errorHandle);
+    interpreterHandle = createInterpreter(modelHandle, errorHandle, /* numThreads= */ -1);
+    isMemoryAllocated = true;
+  }
+
+  /**
+   * Initializes a {@code NativeInterpreterWrapper} with a {@code MappedByteBuffer} and specifies
+   * the number of inference threads. The MappedByteBuffer should not be modified after the
+   * construction of a {@code NativeInterpreterWrapper}.
+   */
+  NativeInterpreterWrapper(MappedByteBuffer mappedByteBuffer, int numThreads) {
+    modelByteBuffer = mappedByteBuffer;
+    errorHandle = createErrorReporter(ERROR_BUFFER_SIZE);
+    modelHandle = createModelWithBuffer(modelByteBuffer, errorHandle);
+    interpreterHandle = createInterpreter(modelHandle, errorHandle, numThreads);
+    isMemoryAllocated = true;
   }
 
   /** Releases resources associated with this {@code NativeInterpreterWrapper}. */
@@ -59,6 +74,7 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     modelByteBuffer = null;
     inputsIndexes = null;
     outputsIndexes = null;
+    isMemoryAllocated = false;
   }
 
   /** Sets inputs, runs model inference and returns outputs. */
@@ -91,11 +107,21 @@ final class NativeInterpreterWrapper implements AutoCloseable {
                 i, inputs.length));
       }
     }
+    inferenceDurationNanoseconds = -1;
     long[] outputsHandles =
-        run(interpreterHandle, errorHandle, sizes, dataTypes, numsOfBytes, inputs);
+        run(
+            interpreterHandle,
+            errorHandle,
+            sizes,
+            dataTypes,
+            numsOfBytes,
+            inputs,
+            this,
+            isMemoryAllocated);
     if (outputsHandles == null || outputsHandles.length == 0) {
       throw new IllegalStateException("Interpreter has no outputs.");
     }
+    isMemoryAllocated = true;
     Tensor[] outputs = new Tensor[outputsHandles.length];
     for (int i = 0; i < outputsHandles.length; ++i) {
       outputs[i] = Tensor.fromHandle(outputsHandles[i]);
@@ -109,14 +135,18 @@ final class NativeInterpreterWrapper implements AutoCloseable {
       Object[] sizes,
       int[] dtypes,
       int[] numsOfBytes,
-      Object[] values);
+      Object[] values,
+      NativeInterpreterWrapper wrapper,
+      boolean memoryAllocated);
 
   /** Resizes dimensions of a specific input. */
   void resizeInput(int idx, int[] dims) {
-    resizeInput(interpreterHandle, errorHandle, idx, dims);
+    if (resizeInput(interpreterHandle, errorHandle, idx, dims)) {
+      isMemoryAllocated = false;
+    }
   }
 
-  private static native void resizeInput(
+  private static native boolean resizeInput(
       long interpreterHandle, long errorHandle, int inputIdx, int[] dims);
 
   void setUseNNAPI(boolean useNNAPI) {
@@ -236,6 +266,35 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     }
   }
 
+  /**
+   * Gets the last inference duration in nanoseconds. It returns null if there is no previous
+   * inference run or the last inference run failed.
+   */
+  Long getLastNativeInferenceDurationNanoseconds() {
+    return (inferenceDurationNanoseconds < 0) ? null : inferenceDurationNanoseconds;
+  }
+
+  /**
+   * Gets the dimensions of an input. It throws IllegalArgumentException if input index is invalid.
+   */
+  int[] getInputDims(int index) {
+    return getInputDims(interpreterHandle, index, -1);
+  }
+
+  /**
+   * Gets the dimensions of an input. If numBytes >= 0, it will check whether num of bytes match the
+   * input.
+   */
+  private static native int[] getInputDims(long interpreterHandle, int inputIdx, int numBytes);
+
+  /** Gets the type of an output. It throws IllegalArgumentException if output index is invalid. */
+  String getOutputDataType(int index) {
+    int type = getOutputDataType(interpreterHandle, index);
+    return DataType.fromNumber(type).toStringName();
+  }
+
+  private static native int getOutputDataType(long interpreterHandle, int outputIdx);
+
   private static final int ERROR_BUFFER_SIZE = 512;
 
   private long errorHandle;
@@ -246,12 +305,16 @@ final class NativeInterpreterWrapper implements AutoCloseable {
 
   private int inputSize;
 
+  private long inferenceDurationNanoseconds = -1;
+
   private MappedByteBuffer modelByteBuffer;
 
   private Map<String, Integer> inputsIndexes;
 
   private Map<String, Integer> outputsIndexes;
 
+  private boolean isMemoryAllocated = false;
+
   private static native String[] getInputNames(long interpreterHandle);
 
   private static native String[] getOutputNames(long interpreterHandle);
@@ -264,12 +327,10 @@ final class NativeInterpreterWrapper implements AutoCloseable {
 
   private static native long createModelWithBuffer(MappedByteBuffer modelBuffer, long errorHandle);
 
-  private static native long createInterpreter(long modelHandle, long errorHandle);
+  private static native long createInterpreter(long modelHandle, long errorHandle, int numThreads);
 
   private static native void delete(long errorHandle, long modelHandle, long interpreterHandle);
 
-  private static native int[] getInputDims(long interpreterHandle, int inputIdx, int numBytes);
-
   static {
     TensorFlowLite.init();
   }
diff --git a/tensorflow/contrib/lite/java/src/main/native/BUILD b/tensorflow/contrib/lite/java/src/main/native/BUILD
index 15806d57c8ed7a45d2db9b80e2aab8e22349ee3e..3571182ca92e959d54935cfdc76679ab69a8cfa9 100644
--- a/tensorflow/contrib/lite/java/src/main/native/BUILD
+++ b/tensorflow/contrib/lite/java/src/main/native/BUILD
@@ -11,6 +11,7 @@ licenses(["notice"])  # Apache 2.0
 cc_library(
     name = "native_framework_only",
     srcs = [
+        "duration_utils_jni.cc",
         "exception_jni.cc",
         "nativeinterpreterwrapper_jni.cc",
         "tensor_jni.cc",
diff --git a/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc b/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0e08a04370592f6e3c92b5811fa7e163f808e03c
--- /dev/null
+++ b/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc
@@ -0,0 +1,38 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <jni.h>
+#include <time.h>
+
+namespace tflite {
+
+// Gets the elapsed wall-clock timespec.
+timespec getCurrentTime() {
+  timespec time;
+  clock_gettime(CLOCK_MONOTONIC, &time);
+  return time;
+}
+
+// Computes the time diff from two timespecs. Returns '-1' if 'stop' is earlier
+// than 'start'.
+jlong timespec_diff_nanoseconds(struct timespec* start, struct timespec* stop) {
+  jlong result = stop->tv_sec - start->tv_sec;
+  if (result < 0) return -1;
+  result = 1000000000 * result + (stop->tv_nsec - start->tv_nsec);
+  if (result < 0) return -1;
+  return result;
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
index c346f9f92e360c0722ebac440d790da6441ceecf..844226203bb02f4017b2f04da34ac81ac2b7a191 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
@@ -14,7 +14,6 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h"
-
 namespace {
 
 const int kByteBufferValue = 999;
@@ -79,6 +78,21 @@ TfLiteType resolveDataType(jint data_type) {
   }
 }
 
+int getDataType(TfLiteType data_type) {
+  switch (data_type) {
+    case kTfLiteFloat32:
+      return 1;
+    case kTfLiteInt32:
+      return 2;
+    case kTfLiteUInt8:
+      return 3;
+    case kTfLiteInt64:
+      return 4;
+    default:
+      return -1;
+  }
+}
+
 void printDims(char* buffer, int max_size, int* dims, int num_dims) {
   if (max_size <= 0) return;
   buffer[0] = '?';
@@ -149,6 +163,45 @@ TfLiteStatus checkInputs(JNIEnv* env, tflite::Interpreter* interpreter,
   return kTfLiteOk;
 }
 
+// Checks whether there is any difference between dimensions of a tensor and a
+// given dimensions. Returns true if there is difference, else false.
+bool areDimsDifferent(JNIEnv* env, TfLiteTensor* tensor, jintArray dims) {
+  int num_dims = static_cast<int>(env->GetArrayLength(dims));
+  jint* ptr = env->GetIntArrayElements(dims, nullptr);
+  if (ptr == nullptr) {
+    throwException(env, kIllegalArgumentException,
+                   "Empty dimensions of input array.");
+    return true;
+  }
+  if (tensor->dims->size != num_dims) {
+    return true;
+  }
+  for (int i = 0; i < num_dims; ++i) {
+    if (ptr[i] != tensor->dims->data[i]) {
+      return true;
+    }
+  }
+  env->ReleaseIntArrayElements(dims, ptr, JNI_ABORT);
+  return false;
+}
+
+bool areInputDimensionsTheSame(JNIEnv* env, tflite::Interpreter* interpreter,
+                               int input_size, jobjectArray sizes) {
+  if (interpreter->inputs().size() != input_size) {
+    return false;
+  }
+  for (int i = 0; i < input_size; ++i) {
+    int input_idx = interpreter->inputs()[i];
+    jintArray dims =
+        static_cast<jintArray>(env->GetObjectArrayElement(sizes, i));
+    TfLiteTensor* target = interpreter->tensor(input_idx);
+    if (areDimsDifferent(env, target, dims)) return false;
+    env->DeleteLocalRef(dims);
+    if (env->ExceptionCheck()) return false;
+  }
+  return true;
+}
+
 TfLiteStatus resizeInputs(JNIEnv* env, tflite::Interpreter* interpreter,
                           int input_size, jobjectArray sizes) {
   for (int i = 0; i < input_size; ++i) {
@@ -270,6 +323,19 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createErrorReporter(
   return reinterpret_cast<jlong>(error_reporter);
 }
 
+// Verifies whether the model is a flatbuffer file.
+class JNIFlatBufferVerifier : public tflite::TfLiteVerifier {
+ public:
+  bool Verify(const char* data, int length,
+              tflite::ErrorReporter* reporter) override {
+    if (!VerifyModel(data, length)) {
+      reporter->Report("The model is not a valid Flatbuffer file");
+      return false;
+    }
+    return true;
+  }
+};
+
 JNIEXPORT jlong JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_createModel(
     JNIEnv* env, jclass clazz, jstring model_file, jlong error_handle) {
@@ -278,17 +344,11 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createModel(
   if (error_reporter == nullptr) return 0;
   const char* path = env->GetStringUTFChars(model_file, nullptr);
 
-  {
-    tflite::FileCopyAllocation allocation(path, nullptr);
-    if (!VerifyModel(allocation.base(), allocation.bytes())) {
-      throwException(env, kIllegalArgumentException,
-                     "Contents of %s is not a valid flatbuffer model", path);
-      env->ReleaseStringUTFChars(model_file, path);
-      return 0;
-    }
-  }
+  std::unique_ptr<tflite::TfLiteVerifier> verifier;
+  verifier.reset(new JNIFlatBufferVerifier());
 
-  auto model = tflite::FlatBufferModel::BuildFromFile(path, error_reporter);
+  auto model = tflite::FlatBufferModel::VerifyAndBuildFromFile(
+      path, verifier.get(), error_reporter);
   if (!model) {
     throwException(env, kIllegalArgumentException,
                    "Contents of %s does not encode a valid TensorFlowLite "
@@ -330,7 +390,8 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createModelWithBuffer(
 
 JNIEXPORT jlong JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
-    JNIEnv* env, jclass clazz, jlong model_handle, jlong error_handle) {
+    JNIEnv* env, jclass clazz, jlong model_handle, jlong error_handle,
+    jint num_threads) {
   tflite::FlatBufferModel* model = convertLongToModel(env, model_handle);
   if (model == nullptr) return 0;
   BufferErrorReporter* error_reporter =
@@ -338,12 +399,21 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
   if (error_reporter == nullptr) return 0;
   auto resolver = ::tflite::CreateOpResolver();
   std::unique_ptr<tflite::Interpreter> interpreter;
-  TfLiteStatus status =
-      tflite::InterpreterBuilder(*model, *(resolver.get()))(&interpreter);
+  TfLiteStatus status = tflite::InterpreterBuilder(*model, *(resolver.get()))(
+      &interpreter, static_cast<int>(num_threads));
   if (status != kTfLiteOk) {
     throwException(env, kIllegalArgumentException,
                    "Cannot create interpreter: %s",
                    error_reporter->CachedErrorMessage());
+    return 0;
+  }
+  // allocates memory
+  status = interpreter->AllocateTensors();
+  if (status != kTfLiteOk) {
+    throwException(env, kNullPointerException,
+                   "Can not allocate memory for the interpreter",
+                   error_reporter->CachedErrorMessage());
+    return 0;
   }
   return reinterpret_cast<jlong>(interpreter.release());
 }
@@ -353,7 +423,7 @@ JNIEXPORT jlongArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values) {
+    jobjectArray values, jobject wrapper, jboolean memory_allocated) {
   tflite::Interpreter* interpreter =
       convertLongToInterpreter(env, interpreter_handle);
   if (interpreter == nullptr) return nullptr;
@@ -365,25 +435,29 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
   TfLiteStatus status = checkInputs(env, interpreter, input_size, data_types,
                                     nums_of_bytes, values, sizes);
   if (status != kTfLiteOk) return nullptr;
-  // resizes inputs
-  status = resizeInputs(env, interpreter, input_size, sizes);
-  if (status != kTfLiteOk) {
-    throwException(env, kNullPointerException, "Can not resize the input: %s",
-                   error_reporter->CachedErrorMessage());
-    return nullptr;
-  }
-  // allocates memory
-  status = interpreter->AllocateTensors();
-  if (status != kTfLiteOk) {
-    throwException(env, kNullPointerException,
-                   "Can not allocate memory for the given inputs: %s",
-                   error_reporter->CachedErrorMessage());
-    return nullptr;
+  if (!memory_allocated ||
+      !areInputDimensionsTheSame(env, interpreter, input_size, sizes)) {
+    // resizes inputs
+    status = resizeInputs(env, interpreter, input_size, sizes);
+    if (status != kTfLiteOk) {
+      throwException(env, kNullPointerException, "Can not resize the input: %s",
+                     error_reporter->CachedErrorMessage());
+      return nullptr;
+    }
+    // allocates memory
+    status = interpreter->AllocateTensors();
+    if (status != kTfLiteOk) {
+      throwException(env, kNullPointerException,
+                     "Can not allocate memory for the given inputs: %s",
+                     error_reporter->CachedErrorMessage());
+      return nullptr;
+    }
   }
   // sets inputs
   status = setInputs(env, interpreter, input_size, data_types, nums_of_bytes,
                      values);
   if (status != kTfLiteOk) return nullptr;
+  timespec beforeInference = ::tflite::getCurrentTime();
   // runs inference
   if (interpreter->Invoke() != kTfLiteOk) {
     throwException(env, kIllegalArgumentException,
@@ -391,6 +465,17 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
                    error_reporter->CachedErrorMessage());
     return nullptr;
   }
+  timespec afterInference = ::tflite::getCurrentTime();
+  jclass wrapper_clazz = env->GetObjectClass(wrapper);
+  jfieldID fid =
+      env->GetFieldID(wrapper_clazz, "inferenceDurationNanoseconds", "J");
+  if (env->ExceptionCheck()) {
+    env->ExceptionClear();
+  } else if (fid != nullptr) {
+    env->SetLongField(
+        wrapper, fid,
+        ::tflite::timespec_diff_nanoseconds(&beforeInference, &afterInference));
+  }
   // returns outputs
   const std::vector<int>& results = interpreter->outputs();
   if (results.empty()) {
@@ -414,7 +499,7 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
   tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
   if (interpreter == nullptr) return nullptr;
   const int idx = static_cast<int>(input_idx);
-  if (input_idx >= interpreter->inputs().size()) {
+  if (input_idx < 0 || input_idx >= interpreter->inputs().size()) {
     throwException(env, kIllegalArgumentException,
                    "Out of range: Failed to get %d-th input out of %d inputs",
                    input_idx, interpreter->inputs().size());
@@ -422,45 +507,72 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
   }
   TfLiteTensor* target = interpreter->tensor(interpreter->inputs()[idx]);
   int size = target->dims->size;
-  int expected_num_bytes = elementByteSize(target->type);
-  for (int i = 0; i < size; ++i) {
-    expected_num_bytes *= target->dims->data[i];
-  }
-  if (num_bytes != expected_num_bytes) {
-    throwException(env, kIllegalArgumentException,
-                   "Failed to get input dimensions. %d-th input should have"
-                   " %d bytes, but found %d bytes.",
-                   idx, expected_num_bytes, num_bytes);
-    return nullptr;
+  if (num_bytes >= 0) {  // verifies num of bytes matches if num_bytes if valid.
+    int expected_num_bytes = elementByteSize(target->type);
+    for (int i = 0; i < size; ++i) {
+      expected_num_bytes *= target->dims->data[i];
+    }
+    if (num_bytes != expected_num_bytes) {
+      throwException(env, kIllegalArgumentException,
+                     "Failed to get input dimensions. %d-th input should have"
+                     " %d bytes, but found %d bytes.",
+                     idx, expected_num_bytes, num_bytes);
+      return nullptr;
+    }
   }
   jintArray outputs = env->NewIntArray(size);
   env->SetIntArrayRegion(outputs, 0, size, &(target->dims->data[0]));
   return outputs;
 }
 
-JNIEXPORT void JNICALL
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputDataType(
+    JNIEnv* env, jclass clazz, jlong handle, jint output_idx) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return -1;
+  const int idx = static_cast<int>(output_idx);
+  if (output_idx < 0 || output_idx >= interpreter->outputs().size()) {
+    throwException(env, kIllegalArgumentException,
+                   "Out of range: Failed to get %d-th output out of %d outputs",
+                   output_idx, interpreter->outputs().size());
+    return -1;
+  }
+  TfLiteTensor* target = interpreter->tensor(interpreter->outputs()[idx]);
+  int type = getDataType(target->type);
+  return static_cast<jint>(type);
+}
+
+JNIEXPORT jboolean JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_resizeInput(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jint input_idx, jintArray dims) {
   BufferErrorReporter* error_reporter =
       convertLongToErrorReporter(env, error_handle);
-  if (error_reporter == nullptr) return;
+  if (error_reporter == nullptr) return JNI_FALSE;
   tflite::Interpreter* interpreter =
       convertLongToInterpreter(env, interpreter_handle);
-  if (interpreter == nullptr) return;
+  if (interpreter == nullptr) return JNI_FALSE;
   const int idx = static_cast<int>(input_idx);
   if (idx < 0 || idx >= interpreter->inputs().size()) {
     throwException(env, kIllegalArgumentException,
                    "Can not resize %d-th input for a model having %d inputs.",
                    idx, interpreter->inputs().size());
+    return JNI_FALSE;
   }
-  TfLiteStatus status = interpreter->ResizeInputTensor(
-      interpreter->inputs()[idx], convertJIntArrayToVector(env, dims));
-  if (status != kTfLiteOk) {
-    throwException(env, kIllegalArgumentException,
-                   "Failed to resize %d-th input: %s", idx,
-                   error_reporter->CachedErrorMessage());
+  // check whether it is resizing with the same dimensions.
+  TfLiteTensor* target = interpreter->tensor(input_idx);
+  bool is_changed = areDimsDifferent(env, target, dims);
+  if (is_changed) {
+    TfLiteStatus status = interpreter->ResizeInputTensor(
+        interpreter->inputs()[idx], convertJIntArrayToVector(env, dims));
+    if (status != kTfLiteOk) {
+      throwException(env, kIllegalArgumentException,
+                     "Failed to resize %d-th input: %s", idx,
+                     error_reporter->CachedErrorMessage());
+      return JNI_FALSE;
+    }
   }
+  return is_changed ? JNI_TRUE : JNI_FALSE;
 }
 
 JNIEXPORT void JNICALL Java_org_tensorflow_lite_NativeInterpreterWrapper_delete(
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
index c52a7e4e439936344be26d5761fb5747db64794a..0e28a77feea41d72be126d6e60fffbe7ce374a76 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include <jni.h>
 #include <stdio.h>
+#include <time.h>
 #include <vector>
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/interpreter.h"
@@ -28,6 +29,9 @@ limitations under the License.
 namespace tflite {
 // This is to be provided at link-time by a library.
 extern std::unique_ptr<OpResolver> CreateOpResolver();
+extern timespec getCurrentTime();
+extern jlong timespec_diff_nanoseconds(struct timespec* start,
+                                       struct timespec* stop);
 }  // namespace tflite
 
 #ifdef __cplusplus
@@ -95,30 +99,33 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createModelWithBuffer(
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
- *  Signature: (JJ)J
+ *  Signature: (JJI)J
  */
 JNIEXPORT jlong JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
-    JNIEnv* env, jclass clazz, jlong model_handle, jlong error_handle);
+    JNIEnv* env, jclass clazz, jlong model_handle, jlong error_handle,
+    jint num_threads);
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
- *  Signature: (JJ[Ljava/lang/Object;[I[I[Ljava/lang/Object;)[J
+ *  Signature:
+ * (JJ[Ljava/lang/Object;[I[I[Ljava/lang/Object;Ljava/lang/Object;Z)[J
  */
 JNIEXPORT jlongArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values);
+    jobjectArray values, jobject wrapper, jboolean memory_allocated);
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
  *  Signature: (JII)[I
  *
- * It gets input dimensions if num_bytes matches number of bytes required by
- * the input, else returns null and throws IllegalArgumentException.
+ * Gets input dimensions. If num_bytes is non-negative, it will check whether
+ * num_bytes matches num of bytes required by the input, and return null and
+ * throw IllegalArgumentException if not.
  */
 JNIEXPORT jintArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
@@ -127,11 +134,23 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
- *  Signature: (JJI[I)
+ *  Signature: (JI)I
  *
- * It resizes dimensions of a input.
+ * Gets output dimensions.
  */
-JNIEXPORT void JNICALL
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputDataType(
+    JNIEnv* env, jclass clazz, jlong handle, jint output_idx);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:
+ *  Signature: (JJI[I)Z
+ *
+ * It returns true if resizing input tensor to different dimensions, else return
+ * false.
+ */
+JNIEXPORT jboolean JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_resizeInput(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jint input_idx, jintArray dims);
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
index 424b3de6c97672e310c54230a7ac1204f46d9ac8..61d6c35ec86beebf78dd81e17e145863516802fa 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
@@ -218,4 +218,52 @@ public final class InterpreterTest {
     int index = interpreter.getOutputIndex("MobilenetV1/Predictions/Softmax");
     assertThat(index).isEqualTo(0);
   }
+
+  @Test
+  public void testTurnOffNNAPI() throws Exception {
+    Path path = MODEL_FILE.toPath();
+    FileChannel fileChannel =
+        (FileChannel) Files.newByteChannel(path, EnumSet.of(StandardOpenOption.READ));
+    MappedByteBuffer mappedByteBuffer =
+        fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
+    Interpreter interpreter = new Interpreter(mappedByteBuffer);
+    interpreter.setUseNNAPI(true);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    interpreter.run(fourD, parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, 19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.setUseNNAPI(false);
+    interpreter.run(fourD, parsedOutputs);
+    outputOneD = parsedOutputs[0][0][0];
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.close();
+    fileChannel.close();
+  }
+
+  @Test
+  public void testTurnOnNNAPI() throws Exception {
+    Path path = MODEL_FILE.toPath();
+    FileChannel fileChannel =
+        (FileChannel) Files.newByteChannel(path, EnumSet.of(StandardOpenOption.READ));
+    MappedByteBuffer mappedByteBuffer =
+        fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
+    Interpreter interpreter = new Interpreter(mappedByteBuffer);
+    interpreter.setUseNNAPI(true);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    interpreter.run(fourD, parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, 19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.close();
+    fileChannel.close();
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
index 90323555d88419d837a76bca7de6d9998e388fca..dbe45e5a05b8227b441de7ca6747f61d010ae210 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
@@ -47,6 +47,9 @@ public final class NativeInterpreterWrapperTest {
   private static final String MODEL_WITH_CUSTOM_OP_PATH =
       "tensorflow/contrib/lite/java/src/testdata/with_custom_op.lite";
 
+  private static final String NONEXISTING_MODEL_PATH =
+      "tensorflow/contrib/lite/java/src/testdata/nonexisting_model.bin";
+
   @Test
   public void testConstructor() {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
@@ -60,7 +63,18 @@ public final class NativeInterpreterWrapperTest {
       NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(INVALID_MODEL_PATH);
       fail();
     } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("is not a valid flatbuffer model");
+      assertThat(e).hasMessageThat().contains("The model is not a valid Flatbuffer file");
+    }
+  }
+
+  @Test
+  public void testConstructorWithNonexistingModel() {
+    try {
+      NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(NONEXISTING_MODEL_PATH);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("The model is not a valid Flatbuffer file");
+      assertThat(e).hasMessageThat().contains("Could not open");
     }
   }
 
@@ -94,6 +108,30 @@ public final class NativeInterpreterWrapperTest {
     wrapper.close();
   }
 
+  @Test
+  public void testRunWithInputsOfSameDims() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, -6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    Tensor[] outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    outputs[0].copyTo(parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, -19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    parsedOutputs = new float[2][8][8][3];
+    outputs[0].copyTo(parsedOutputs);
+    outputOneD = parsedOutputs[0][0][0];
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    wrapper.close();
+  }
+
   @Test
   public void testRunWithInt() {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(INT_MODEL_PATH);
@@ -417,4 +455,87 @@ public final class NativeInterpreterWrapperTest {
     assertThat(shape[1]).isEqualTo(3);
     assertThat(shape[2]).isEqualTo(1);
   }
+
+  @Test
+  public void testGetInferenceLatency() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    Tensor[] outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isGreaterThan(0L);
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetInferenceLatencyWithNewWrapper() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isNull();
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetLatencyAfterFailedInference() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    try {
+      wrapper.run(inputs);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e)
+          .hasMessageThat()
+          .contains("0-th input dimension should be [?,8,8,3], but found [?,8,7,3]");
+    }
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isNull();
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetInputDims() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    int[] expectedDims = {1, 8, 8, 3};
+    assertThat(wrapper.getInputDims(0)).isEqualTo(expectedDims);
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetInputDimsOutOfRange() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    try {
+      wrapper.getInputDims(-1);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("Out of range");
+    }
+    try {
+      wrapper.getInputDims(1);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("Out of range");
+    }
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetOutputDataType() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    assertThat(wrapper.getOutputDataType(0)).contains("float");
+    wrapper.close();
+    wrapper = new NativeInterpreterWrapper(LONG_MODEL_PATH);
+    assertThat(wrapper.getOutputDataType(0)).contains("long");
+    wrapper.close();
+    wrapper = new NativeInterpreterWrapper(INT_MODEL_PATH);
+    assertThat(wrapper.getOutputDataType(0)).contains("int");
+    wrapper.close();
+    wrapper = new NativeInterpreterWrapper(BYTE_MODEL_PATH);
+    assertThat(wrapper.getOutputDataType(0)).contains("byte");
+    wrapper.close();
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
index 8660cabf709e6531a5667a16e5cf43a93c7135bd..3aef0c3bb6cc4748de0e55d31f0215a77320ae69 100644
--- a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
+++ b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
@@ -32,4 +32,55 @@ public class TestHelper {
       throw new IllegalArgumentException("Interpreter has not initialized; Failed to setUseNNAPI.");
     }
   }
+
+  /**
+   * Gets the last inference duration in nanoseconds. It returns null if there is no previous
+   * inference run or the last inference run failed.
+   *
+   * @param interpreter an instance of {@code Interpreter}. If it is not initialized, an {@code
+   *     IllegalArgumentException} will be thrown.
+   */
+  public static Long getLastNativeInferenceDurationNanoseconds(Interpreter interpreter) {
+    if (interpreter != null && interpreter.wrapper != null) {
+      return interpreter.wrapper.getLastNativeInferenceDurationNanoseconds();
+    } else {
+      throw new IllegalArgumentException("Interpreter has not initialized; Failed to get latency.");
+    }
+  }
+
+  /**
+   * Gets the dimensions of an input.
+   *
+   * @param interpreter an instance of {@code Interpreter}. If it is not initialized, an {@code
+   *     IllegalArgumentException} will be thrown.
+   * @param index an integer index of the input. If it is invalid, an {@code
+   *     IllegalArgumentException} will be thrown.
+   */
+  public static int[] getInputDims(Interpreter interpreter, int index) {
+    if (interpreter != null && interpreter.wrapper != null) {
+      return interpreter.wrapper.getInputDims(index);
+    } else {
+      throw new IllegalArgumentException(
+          "Interpreter has not initialized;" + " Failed to get input dimensions.");
+    }
+  }
+
+  /**
+   * Gets the string name of the data type of an output.
+   *
+   * @param interpreter an instance of {@code Interpreter}. If it is not initialized, an {@code
+   *     IllegalArgumentException} will be thrown.
+   * @param index an integer index of the output. If it is invalid, an {@code
+   *     IllegalArgumentException} will be thrown.
+   * @return string name of the data type. Possible values include "float", "int", "byte", and
+   *     "long".
+   */
+  public static String getOutputDataType(Interpreter interpreter, int index) {
+    if (interpreter != null && interpreter.wrapper != null) {
+      return interpreter.wrapper.getOutputDataType(index);
+    } else {
+      throw new IllegalArgumentException(
+          "Interpreter has not initialized;" + " Failed to get output data type.");
+    }
+  }
 }
diff --git a/tensorflow/contrib/lite/kernels/BUILD b/tensorflow/contrib/lite/kernels/BUILD
index 701993bc43d769c4485880786d8e390617568e97..48021aea47573b1b24bae78a9532200dc222020e 100644
--- a/tensorflow/contrib/lite/kernels/BUILD
+++ b/tensorflow/contrib/lite/kernels/BUILD
@@ -5,15 +5,17 @@ package(default_visibility = [
 licenses(["notice"])  # Apache 2.0
 
 load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
-load(
-    "//tensorflow:tensorflow.bzl",
-    "tf_cc_test",
-)
+load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 
 tf_cc_test(
     name = "optional_tensor_test",
     size = "small",
     srcs = ["optional_tensor_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -33,11 +35,27 @@ cc_library(
         "//tensorflow/contrib/lite:schema_fbs_version",
         "//tensorflow/contrib/lite:string_util",
         "//tensorflow/contrib/lite/testing:util",
-        "//tensorflow/core:lib",
+        "//tensorflow/core:tflite_portable_logging",
         "@com_google_googletest//:gtest",
     ],
 )
 
+cc_library(
+    name = "eigen_support",
+    srcs = [
+        "eigen_support.cc",
+    ],
+    hdrs = [
+        "eigen_support.h",
+    ],
+    copts = tflite_copts(),
+    deps = [
+        ":op_macros",
+        "//tensorflow/contrib/lite:context",
+        "//third_party/eigen3",
+    ],
+)
+
 cc_library(
     name = "gemm_support",
     srcs = [
@@ -90,6 +108,10 @@ tf_cc_test(
     name = "kernel_util_test",
     size = "small",
     srcs = ["kernel_util_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":kernel_util",
         "//tensorflow/contrib/lite/testing:util",
@@ -97,18 +119,32 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "test_util_test",
+    size = "small",
+    srcs = ["test_util_test.cc"],
+    deps = [
+        ":test_util",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 cc_library(
     name = "builtin_ops",
     srcs = [
         "activations.cc",
         "add.cc",
+        "audio_spectrogram.cc",
         "basic_rnn.cc",
         "batch_to_space_nd.cc",
         "bidirectional_sequence_lstm.cc",
         "bidirectional_sequence_rnn.cc",
+        "cast.cc",
         "concatenation.cc",
         "conv.cc",
         "depthwise_conv.cc",
+        "dequantize.cc",
         "div.cc",
         "embedding_lookup.cc",
         "embedding_lookup_sparse.cc",
@@ -122,6 +158,7 @@ cc_library(
         "lstm.cc",
         "maximum.cc",
         "mean.cc",
+        "mfcc.cc",
         "mul.cc",
         "pad.cc",
         "pooling.cc",
@@ -155,21 +192,49 @@ cc_library(
     }),
     deps = [
         ":activation_functor",
+        ":eigen_support",
         ":kernel_util",
         ":op_macros",
         "//tensorflow/contrib/lite:builtin_op_data",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite:string_util",
         "//tensorflow/contrib/lite/kernels:gemm_support",
+        "//tensorflow/contrib/lite/kernels/internal:audio_utils",
         "//tensorflow/contrib/lite/kernels/internal:kernel_utils",
         "//tensorflow/contrib/lite/kernels/internal:optimized",
         "//tensorflow/contrib/lite/kernels/internal:optimized_base",
         "//tensorflow/contrib/lite/kernels/internal:quantization_util",
         "//tensorflow/contrib/lite/kernels/internal:reference",
         "//tensorflow/contrib/lite/kernels/internal:reference_base",
-        "//tensorflow/contrib/lite/kernels/internal:round",
         "//tensorflow/contrib/lite/kernels/internal:tensor_utils",
         "@farmhash_archive//:farmhash",
+        "@flatbuffers",
+    ],
+)
+
+tf_cc_test(
+    name = "audio_spectrogram_test",
+    size = "small",
+    srcs = ["audio_spectrogram_test.cc"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+        "@flatbuffers",
+    ],
+)
+
+tf_cc_test(
+    name = "mfcc_test",
+    size = "small",
+    srcs = ["mfcc_test.cc"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+        "@flatbuffers",
     ],
 )
 
@@ -177,6 +242,10 @@ tf_cc_test(
     name = "activations_test",
     size = "small",
     srcs = ["activations_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -189,6 +258,42 @@ tf_cc_test(
     name = "add_test",
     size = "small",
     srcs = ["add_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "div_test",
+    size = "small",
+    srcs = ["div_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "sub_test",
+    size = "small",
+    srcs = ["sub_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -201,6 +306,10 @@ tf_cc_test(
     name = "transpose_test",
     size = "small",
     srcs = ["transpose_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -215,6 +324,10 @@ tf_cc_test(
     name = "space_to_batch_nd_test",
     size = "small",
     srcs = ["space_to_batch_nd_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -227,6 +340,22 @@ tf_cc_test(
     name = "batch_to_space_nd_test",
     size = "small",
     srcs = ["batch_to_space_nd_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "cast_test",
+    size = "small",
+    srcs = ["cast_test.cc"],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -239,6 +368,10 @@ tf_cc_test(
     name = "concatenation_test",
     size = "small",
     srcs = ["concatenation_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -251,6 +384,10 @@ tf_cc_test(
     name = "conv_test",
     size = "small",
     srcs = ["conv_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -264,10 +401,27 @@ tf_cc_test(
     name = "depthwise_conv_test",
     size = "small",
     srcs = ["depthwise_conv_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "dequantize_test",
+    size = "small",
+    srcs = ["dequantize_test.cc"],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_absl//absl/memory",
         "@com_google_googletest//:gtest",
     ],
 )
@@ -276,6 +430,10 @@ tf_cc_test(
     name = "basic_rnn_test",
     size = "small",
     srcs = ["basic_rnn_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -288,6 +446,10 @@ tf_cc_test(
     name = "bidirectional_sequence_lstm_test",
     size = "small",
     srcs = ["bidirectional_sequence_lstm_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -300,6 +462,10 @@ tf_cc_test(
     name = "unidirectional_sequence_lstm_test",
     size = "small",
     srcs = ["unidirectional_sequence_lstm_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -312,6 +478,9 @@ tf_cc_test(
     name = "bidirectional_sequence_rnn_test",
     size = "small",
     srcs = ["bidirectional_sequence_rnn_test.cc"],
+    tags = [
+        "tflite_not_portable",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -324,6 +493,10 @@ tf_cc_test(
     name = "unidirectional_sequence_rnn_test",
     size = "small",
     srcs = ["unidirectional_sequence_rnn_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -336,6 +509,10 @@ tf_cc_test(
     name = "l2norm_test",
     size = "small",
     srcs = ["l2norm_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -348,6 +525,10 @@ tf_cc_test(
     name = "exp_test",
     size = "small",
     srcs = ["exp_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -372,6 +553,10 @@ tf_cc_test(
     name = "mean_test",
     size = "small",
     srcs = ["mean_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -384,6 +569,10 @@ tf_cc_test(
     name = "mul_test",
     size = "small",
     srcs = ["mul_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -396,6 +585,10 @@ tf_cc_test(
     name = "pad_test",
     size = "small",
     srcs = ["pad_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -408,6 +601,10 @@ tf_cc_test(
     name = "reshape_test",
     size = "small",
     srcs = ["reshape_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -420,6 +617,10 @@ tf_cc_test(
     name = "gather_test",
     size = "small",
     srcs = ["gather_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -433,6 +634,10 @@ tf_cc_test(
     name = "topk_v2_test",
     size = "small",
     srcs = ["topk_v2_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -446,6 +651,10 @@ tf_cc_test(
     name = "resize_bilinear_test",
     size = "small",
     srcs = ["resize_bilinear_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -458,6 +667,10 @@ tf_cc_test(
     name = "svdf_test",
     size = "small",
     srcs = ["svdf_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -470,6 +683,10 @@ tf_cc_test(
     name = "embedding_lookup_test",
     size = "small",
     srcs = ["embedding_lookup_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -482,6 +699,10 @@ tf_cc_test(
     name = "embedding_lookup_sparse_test",
     size = "small",
     srcs = ["embedding_lookup_sparse_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -494,6 +715,10 @@ tf_cc_test(
     name = "fully_connected_test",
     size = "small",
     srcs = ["fully_connected_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -506,6 +731,10 @@ tf_cc_test(
     name = "local_response_norm_test",
     size = "small",
     srcs = ["local_response_norm_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -518,6 +747,10 @@ tf_cc_test(
     name = "pooling_test",
     size = "small",
     srcs = ["pooling_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -530,6 +763,10 @@ tf_cc_test(
     name = "softmax_test",
     size = "small",
     srcs = ["softmax_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -543,6 +780,10 @@ tf_cc_test(
     name = "log_softmax_test",
     size = "small",
     srcs = ["log_softmax_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -556,6 +797,10 @@ tf_cc_test(
     name = "lsh_projection_test",
     size = "small",
     srcs = ["lsh_projection_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -568,6 +813,10 @@ tf_cc_test(
     name = "hashtable_lookup_test",
     size = "small",
     srcs = ["hashtable_lookup_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -581,6 +830,10 @@ tf_cc_test(
     name = "lstm_test",
     size = "small",
     srcs = ["lstm_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -593,6 +846,10 @@ tf_cc_test(
     name = "skip_gram_test",
     size = "small",
     srcs = ["skip_gram_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -606,6 +863,10 @@ tf_cc_test(
     name = "space_to_depth_test",
     size = "small",
     srcs = ["space_to_depth_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -618,6 +879,10 @@ tf_cc_test(
     name = "split_test",
     size = "small",
     srcs = ["split_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -630,6 +895,10 @@ tf_cc_test(
     name = "squeeze_test",
     size = "small",
     srcs = ["squeeze_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -642,6 +911,10 @@ tf_cc_test(
     name = "strided_slice_test",
     size = "small",
     srcs = ["strided_slice_test.cc"],
+    tags = [
+        "tflite_not_portable_ios_arm64",
+        "tflite_not_portable_ios_x86_64",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -661,3 +934,5 @@ filegroup(
     ),
     visibility = ["//tensorflow:__subpackages__"],
 )
+
+tflite_portable_test_suite()
diff --git a/tensorflow/contrib/lite/kernels/activations.cc b/tensorflow/contrib/lite/kernels/activations.cc
index 6acded3091cb820ba641bac2498799d295d7dc7f..39a54c93962b33f3a787b3387d9a133119d0e80a 100644
--- a/tensorflow/contrib/lite/kernels/activations.cc
+++ b/tensorflow/contrib/lite/kernels/activations.cc
@@ -63,6 +63,33 @@ TfLiteStatus GenericPrepare(TfLiteContext* context, TfLiteNode* node) {
                                TfLiteIntArrayCopy(input->dims));
 }
 
+TfLiteStatus TanhPrepare(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+  TfLiteTensor* input = GetInput(context, node, 0);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  TF_LITE_ENSURE_EQ(context, input->type, output->type);
+
+  if (input->type == kTfLiteUInt8) {
+    static constexpr int kInputIntegerBits = 4;
+
+    const double input_real_multiplier =
+        input->params.scale *
+        static_cast<double>(1 << (31 - kInputIntegerBits));
+
+    QuantizeMultiplierGreaterThanOne(input_real_multiplier,
+                                     &data->input_multiplier,
+                                     &data->input_left_shift);
+    data->input_range_radius =
+        CalculateInputRadius(kInputIntegerBits, data->input_left_shift);
+  }
+
+  return context->ResizeTensor(context, output,
+                               TfLiteIntArrayCopy(input->dims));
+}
+
 TfLiteStatus SigmoidPrepare(TfLiteContext* context, TfLiteNode* node) {
   OpData* data = reinterpret_cast<OpData*>(node->user_data);
 
@@ -123,6 +150,34 @@ TfLiteStatus SoftmaxPrepare(TfLiteContext* context, TfLiteNode* node) {
                                TfLiteIntArrayCopy(input->dims));
 }
 
+TfLiteStatus PreluPrepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+  TfLiteTensor* input = GetInput(context, node, 0);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  TfLiteTensor* alpha = GetInput(context, node, 1);
+
+  output->type = input->type;
+
+  // Currently only Float32 is supported
+  // TODO(ycling): Support other data types.
+  TF_LITE_ENSURE_EQ(context, input->type, kTfLiteFloat32);
+  TF_LITE_ENSURE_EQ(context, alpha->type, kTfLiteFloat32);
+
+  // Currently, only support 4D `input` and 3D `alpha` with shape
+  // (1, 1, channels).
+  // TODO(impjdi): Support other cases where `alpha` is broadcastable
+  // to `input`.
+  TF_LITE_ENSURE_EQ(context, input->dims->size, 4);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->size, 3);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[0], 1);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[1], 1);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[2], input->dims->data[3]);
+
+  return context->ResizeTensor(context, output,
+                               TfLiteIntArrayCopy(input->dims));
+}
+
 TfLiteStatus ReluEval(TfLiteContext* context, TfLiteNode* node) {
   TfLiteTensor* input = GetInput(context, node, 0);
   TfLiteTensor* output = GetOutput(context, node, 0);
@@ -180,6 +235,7 @@ TfLiteStatus Relu6Eval(TfLiteContext* context, TfLiteNode* node) {
 }
 
 TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
   TfLiteTensor* input = GetInput(context, node, 0);
   TfLiteTensor* output = GetOutput(context, node, 0);
   switch (input->type) {
@@ -191,6 +247,14 @@ TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
       for (; in < in_end; in++, out++) *out = std::tanh(*in);
       return kTfLiteOk;
     } break;
+    case kTfLiteUInt8: {
+      optimized_ops::Tanh(GetTensorData<uint8_t>(input), GetTensorDims(input),
+                          input->params.zero_point, data->input_range_radius,
+                          data->input_multiplier, data->input_left_shift,
+                          GetTensorData<uint8_t>(output),
+                          GetTensorDims(output));
+      return kTfLiteOk;
+    } break;
     default:
       context->ReportError(context, "Only float32 supported currently.");
       return kTfLiteError;
@@ -352,6 +416,35 @@ TfLiteStatus LogSoftmaxEval(TfLiteContext* context, TfLiteNode* node) {
   }
 }
 
+TfLiteStatus PreluEval(TfLiteContext* context, TfLiteNode* node) {
+  TfLiteTensor* input = GetInput(context, node, 0);
+  TfLiteTensor* alpha = GetInput(context, node, 1);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+
+  if (input->type != kTfLiteFloat32) {
+    context->ReportError(context, "Only float32 supported currently.");
+    return kTfLiteError;
+  }
+  TF_LITE_ENSURE_EQ(context, input->dims->size, 4);
+  const int batches = input->dims->data[0];
+  const int height = input->dims->data[1];
+  const int width = input->dims->data[2];
+  const int channels = input->dims->data[3];
+
+  TF_LITE_ENSURE_EQ(context, alpha->dims->size, 3);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[0], 1);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[1], 1);
+  TF_LITE_ENSURE_EQ(context, alpha->dims->data[2], channels);
+
+  const int n = batches * height * width * channels;
+  for (int i = 0; i < n; ++i) {
+    const float x = input->data.f[i];
+    output->data.f[i] = x >= 0.0f ? x : alpha->data.f[i % channels] * x;
+  }
+
+  return kTfLiteOk;
+}
+
 }  // namespace activations
 
 TfLiteRegistration* Register_RELU() {
@@ -376,8 +469,8 @@ TfLiteRegistration* Register_RELU6() {
 }
 
 TfLiteRegistration* Register_TANH() {
-  static TfLiteRegistration r = {/*init=*/nullptr, /*free=*/nullptr,
-                                 activations::GenericPrepare,
+  static TfLiteRegistration r = {activations::Init, activations::Free,
+                                 activations::TanhPrepare,
                                  activations::TanhEval};
   return &r;
 }
@@ -403,6 +496,13 @@ TfLiteRegistration* Register_LOG_SOFTMAX() {
   return &r;
 }
 
+TfLiteRegistration* Register_PRELU() {
+  static TfLiteRegistration r = {/*init=*/nullptr, /*free=*/nullptr,
+                                 activations::PreluPrepare,
+                                 activations::PreluEval};
+  return &r;
+}
+
 }  // namespace builtin
 }  // namespace ops
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/activations_test.cc b/tensorflow/contrib/lite/kernels/activations_test.cc
index 302e52b96db0206f77eb4c8fcffd565b1db0cd3e..50a84edd475c8051a563cf8ed9fc03099829b786 100644
--- a/tensorflow/contrib/lite/kernels/activations_test.cc
+++ b/tensorflow/contrib/lite/kernels/activations_test.cc
@@ -52,6 +52,14 @@ class BaseActivationsOpModel : public SingleOpModel {
     BuildInterpreter({GetShape(input_)});
   }
 
+  BaseActivationsOpModel(BuiltinOperator type, const TensorData &input,
+                         const TensorData &output) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(type, BuiltinOptions_NONE, 0);
+    BuildInterpreter({GetShape(input_)});
+  }
+
  protected:
   int input_;
   int output_;
@@ -143,6 +151,27 @@ TEST(FloatActivationsOpTest, Tanh) {
                              })));
 }
 
+TEST(QuantizedActivationsOpTest, Tanh) {
+  QuantizedActivationsOpModel m(
+      BuiltinOperator_TANH,
+      /*input=*/{TensorType_UINT8, {1, 2, 4, 1}, -8, 8},
+      /*output=*/{TensorType_UINT8, {1, 2, 4, 1}, -1, 1});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      -4, -2, 8, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      0.0, -0.999987, 0.964027, 0.999329,     //
+                      -0.996078, -0.96402, 0.99999, 0.76159,  //
+                  },
+                  4 * (1. / 256))));
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({128, 0, 251, 255, 0, 5, 255, 226}));
+}
+
 TEST(FloatActivationsOpTest, Sigmoid) {
   FloatActivationsOpModel m(BuiltinOperator_LOGISTIC,
                             /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
@@ -354,6 +383,49 @@ TEST(FloatActivationsOpTest, LogSoftmax) {
                               })));
 }
 
+class PReluOpModel : public SingleOpModel {
+ public:
+  PReluOpModel(const TensorData& input, const TensorData& alpha) {
+    input_ = AddInput(input);
+    alpha_ = AddInput(alpha);
+    output_ = AddOutput(input);
+    SetBuiltinOp(BuiltinOperator_PRELU, BuiltinOptions_NONE, 0);
+    BuildInterpreter({GetShape(input_), GetShape(alpha_)});
+  }
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor(input_, data);
+  }
+  void SetAlpha(std::initializer_list<float> data) {
+    PopulateTensor(alpha_, data);
+  }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ protected:
+  int input_;
+  int alpha_;
+  int output_;
+};
+
+TEST(FloatActivationsOpTest, PRelu) {
+  PReluOpModel m({TensorType_FLOAT32, {1, 2, 2, 3}},
+                 {TensorType_FLOAT32, {1, 1, 3}});
+
+  m.SetInput({
+      0.0f, 0.0f, 0.0f,     // Row 1, Column 1
+      1.0f, 1.0f, 1.0f,     // Row 1, Column 2
+      -1.0f, -1.0f, -1.0f,  // Row 2, Column 1
+      -2.0f, -2.0f, -2.0f,  // Row 1, Column 2
+  });
+  m.SetAlpha({0.0f, 1.0f, 2.0f});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0.0f, 0.0f, 0.0f,    // Row 1, Column 1
+                                 1.0f, 1.0f, 1.0f,    // Row 1, Column 2
+                                 0.0f, -1.0f, -2.0f,  // Row 2, Column 1
+                                 0.0f, -2.0f, -4.0f,  // Row 1, Column 2
+                             }));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/audio_spectrogram.cc b/tensorflow/contrib/lite/kernels/audio_spectrogram.cc
new file mode 100644
index 0000000000000000000000000000000000000000..602f3888c10b3790dc0328c817bdd83276544b25
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/audio_spectrogram.cc
@@ -0,0 +1,165 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/spectrogram.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+#include "flatbuffers/flexbuffers.h"
+
+namespace tflite {
+namespace ops {
+namespace custom {
+namespace audio_spectrogram {
+
+constexpr int kInputTensor = 0;
+constexpr int kOutputTensor = 0;
+
+enum KernelType {
+  kReference,
+};
+
+typedef struct {
+  int window_size;
+  int stride;
+  bool magnitude_squared;
+  int output_height;
+  internal::Spectrogram* spectrogram;
+} TfLiteAudioSpectrogramParams;
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new TfLiteAudioSpectrogramParams;
+
+  const uint8_t* buffer_t = reinterpret_cast<const uint8_t*>(buffer);
+
+  const flexbuffers::Map& m = flexbuffers::GetRoot(buffer_t, length).AsMap();
+  data->window_size = m["window_size"].AsInt64();
+  data->stride = m["stride"].AsInt64();
+  data->magnitude_squared = m["magnitude_squared"].AsBool();
+
+  data->spectrogram = new internal::Spectrogram;
+
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  auto* params = reinterpret_cast<TfLiteAudioSpectrogramParams*>(buffer);
+  delete params->spectrogram;
+  delete params;
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  auto* params =
+      reinterpret_cast<TfLiteAudioSpectrogramParams*>(node->user_data);
+
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  TF_LITE_ENSURE_EQ(context, NumDimensions(input), 2);
+
+  TF_LITE_ENSURE_EQ(context, output->type, kTfLiteFloat32);
+  TF_LITE_ENSURE_EQ(context, input->type, output->type);
+
+  TF_LITE_ENSURE(context, params->spectrogram->Initialize(params->window_size,
+                                                          params->stride));
+  const int64_t sample_count = input->dims->data[0];
+  const int64_t length_minus_window = (sample_count - params->window_size);
+  if (length_minus_window < 0) {
+    params->output_height = 0;
+  } else {
+    params->output_height = 1 + (length_minus_window / params->stride);
+  }
+  TfLiteIntArray* output_size = TfLiteIntArrayCreate(3);
+  output_size->data[0] = input->dims->data[1];
+  output_size->data[1] = params->output_height;
+  output_size->data[2] = params->spectrogram->output_frequency_channels();
+
+  return context->ResizeTensor(context, output, output_size);
+}
+
+template <KernelType kernel_type>
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  auto* params =
+      reinterpret_cast<TfLiteAudioSpectrogramParams*>(node->user_data);
+
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  TF_LITE_ENSURE(context, params->spectrogram->Initialize(params->window_size,
+                                                          params->stride));
+
+  const float* input_data = GetTensorData<float>(input);
+
+  const int64_t sample_count = input->dims->data[0];
+  const int64_t channel_count = input->dims->data[1];
+
+  const int64_t output_width = params->spectrogram->output_frequency_channels();
+
+  float* output_flat = GetTensorData<float>(output);
+
+  std::vector<float> input_for_channel(sample_count);
+  for (int64_t channel = 0; channel < channel_count; ++channel) {
+    float* output_slice =
+        output_flat + (channel * params->output_height * output_width);
+    for (int i = 0; i < sample_count; ++i) {
+      input_for_channel[i] = input_data[i * channel_count + channel];
+    }
+    std::vector<std::vector<float>> spectrogram_output;
+    TF_LITE_ENSURE(context,
+                   params->spectrogram->ComputeSquaredMagnitudeSpectrogram(
+                       input_for_channel, &spectrogram_output));
+    TF_LITE_ENSURE_EQ(context, spectrogram_output.size(),
+                      params->output_height);
+    TF_LITE_ENSURE(context, spectrogram_output.empty() ||
+                                (spectrogram_output[0].size() == output_width));
+    for (int row_index = 0; row_index < params->output_height; ++row_index) {
+      const std::vector<float>& spectrogram_row = spectrogram_output[row_index];
+      TF_LITE_ENSURE_EQ(context, spectrogram_row.size(), output_width);
+      float* output_row = output_slice + (row_index * output_width);
+      if (params->magnitude_squared) {
+        for (int i = 0; i < output_width; ++i) {
+          output_row[i] = spectrogram_row[i];
+        }
+      } else {
+        for (int i = 0; i < output_width; ++i) {
+          output_row[i] = sqrtf(spectrogram_row[i]);
+        }
+      }
+    }
+  }
+  return kTfLiteOk;
+}
+
+}  // namespace audio_spectrogram
+
+TfLiteRegistration* Register_AUDIO_SPECTROGRAM() {
+  static TfLiteRegistration r = {
+      audio_spectrogram::Init, audio_spectrogram::Free,
+      audio_spectrogram::Prepare,
+      audio_spectrogram::Eval<audio_spectrogram::kReference>};
+  return &r;
+}
+
+}  // namespace custom
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/audio_spectrogram_test.cc b/tensorflow/contrib/lite/kernels/audio_spectrogram_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8d460fdfc610ef9a867acd492ca0558fb6eab8c3
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/audio_spectrogram_test.cc
@@ -0,0 +1,122 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <functional>
+#include <memory>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace ops {
+namespace custom {
+
+TfLiteRegistration* Register_AUDIO_SPECTROGRAM();
+
+namespace {
+
+using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+
+class BaseAudioSpectrogramOpModel : public SingleOpModel {
+ public:
+  BaseAudioSpectrogramOpModel(const TensorData& input1,
+                              const TensorData& output, int window_size,
+                              int stride, bool magnitude_squared) {
+    input1_ = AddInput(input1);
+    output_ = AddOutput(output);
+
+    flexbuffers::Builder fbb;
+    fbb.Map([&]() {
+      fbb.Int("window_size", window_size);
+      fbb.Int("stride", stride);
+      fbb.Bool("magnitude_squared", magnitude_squared);
+    });
+    fbb.Finish();
+    SetCustomOp("AudioSpectrogram", fbb.GetBuffer(),
+                Register_AUDIO_SPECTROGRAM);
+    BuildInterpreter({GetShape(input1_)});
+  }
+
+  int input1() { return input1_; }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input1_;
+  int output_;
+};
+
+TEST(BaseAudioSpectrogramOpModel, NonSquaredTest) {
+  BaseAudioSpectrogramOpModel m({TensorType_FLOAT32, {8, 1}},
+                                {TensorType_FLOAT32, {}}, 8, 1, false);
+  m.PopulateTensor<float>(m.input1(),
+                          {-1.0f, 0.0f, 1.0f, 0.0f, -1.0f, 0.0f, 1.0f, 0.0f});
+
+  m.Invoke();
+
+  std::vector<int> output_shape = m.GetOutputShape();
+  EXPECT_EQ(3, output_shape.size());
+  EXPECT_THAT(output_shape, ElementsAre(1, 1, 5));
+
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear(
+                                 {0.0f, 1.0f, 2.0f, 1.0f, 0.0f}, 1e-3)));
+}
+
+TEST(SpectrogramOpTest, SquaredTest) {
+  BaseAudioSpectrogramOpModel m({TensorType_FLOAT32, {8, 1}},
+                                {TensorType_FLOAT32, {}}, 8, 1, true);
+  m.PopulateTensor<float>(m.input1(),
+                          {-1.0f, 0.0f, 1.0f, 0.0f, -1.0f, 0.0f, 1.0f, 0.0f});
+
+  m.Invoke();
+
+  std::vector<int> output_shape = m.GetOutputShape();
+  EXPECT_EQ(3, output_shape.size());
+  EXPECT_THAT(output_shape, ElementsAre(1, 1, 5));
+
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear(
+                                 {0.f, 1.f, 4.f, 1.f, 0.f}, 1e-3)));
+}
+
+TEST(SpectrogramOpTest, StrideTest) {
+  BaseAudioSpectrogramOpModel m({TensorType_FLOAT32, {10, 1}},
+                                {TensorType_FLOAT32, {}}, 8, 2, true);
+  m.PopulateTensor<float>(m.input1(), {-1.0f, 0.0f, 1.0f, 0.0f, -1.0f, 0.0f,
+                                       1.0f, 0.0f, 1.0f, 0.0f});
+
+  m.Invoke();
+
+  std::vector<int> output_shape = m.GetOutputShape();
+  EXPECT_THAT(output_shape, ElementsAre(1, 2, 5));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear(
+                                 {0, 1, 4, 1, 0, 1, 2, 1, 2, 1}, 1e-3)));
+}
+
+}  // namespace
+}  // namespace custom
+}  // namespace ops
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
index 8d70df5e21fab110be238a6f72abe9aac8a75622..a64ac42bc43336db928d2682e290f5263f3db0f4 100644
--- a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -443,166 +444,6 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
-// Performs an LSTM batch inference step for input specified by input_ptr_batch.
-// The LSTM cell is specified by the pointers to its weights (*_weights_ptr) and
-// biases (*_bias_ptr), and buffers (*_scratch), along with additional
-// parameters:
-//  - params: various LSTM params including activation, clipping, etc.,
-//  - use_cifg: use coupled input forget gates,
-//  - use_peephole: whether to use peephole connection or not,
-//  - n_batch: size of batch,
-//  - n_cell: number of cells (or units),
-//  - n_input: the input size,
-//  - n_output: the output size.
-//
-// The pointers to the hidden state and the output are updated as a result.
-//
-// The pointers with the suffix "_batch" point to data aligned in batch_major
-// order, and each step processes batch_size many inputs from input_ptr_batch,
-// and updates batch_size many outputs and hidden states.
-void LstmBatchStep(
-    const float* input_ptr_batch, const float* input_to_input_weights_ptr,
-    const float* input_to_forget_weights_ptr,
-    const float* input_to_cell_weights_ptr,
-    const float* input_to_output_weights_ptr,
-    const float* recurrent_to_input_weights_ptr,
-    const float* recurrent_to_forget_weights_ptr,
-    const float* recurrent_to_cell_weights_ptr,
-    const float* recurrent_to_output_weights_ptr,
-    const float* cell_to_input_weights_ptr,
-    const float* cell_to_forget_weights_ptr,
-    const float* cell_to_output_weights_ptr, const float* input_gate_bias_ptr,
-    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
-    const float* output_gate_bias_ptr, const float* projection_weights_ptr,
-    const float* projection_bias_ptr, const TfLiteLSTMParams* params,
-    bool use_cifg, bool use_peephole, int n_batch, int n_cell, int n_input,
-    int n_output, float* output_state_ptr, float* cell_state_ptr,
-    float* input_gate_scratch, float* forget_gate_scratch, float* cell_scratch,
-    float* output_gate_scratch, float* output_ptr_time) {
-  // Initialize scratch buffers with bias.
-  if (!use_cifg) {
-    tensor_utils::VectorBatchVectorAssign(input_gate_bias_ptr, n_cell, n_batch,
-                                          input_gate_scratch);
-  }
-  tensor_utils::VectorBatchVectorAssign(forget_gate_bias_ptr, n_cell, n_batch,
-                                        forget_gate_scratch);
-  tensor_utils::VectorBatchVectorAssign(cell_bias_ptr, n_cell, n_batch,
-                                        cell_scratch);
-  tensor_utils::VectorBatchVectorAssign(output_gate_bias_ptr, n_cell, n_batch,
-                                        output_gate_scratch);
-
-  // For each batch and cell: compute input_weight * input.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_input_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
-        input_gate_scratch, /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_forget_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
-      forget_gate_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_cell_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
-      cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_output_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
-      output_gate_scratch, /*result_stride=*/1);
-
-  // For each batch and cell: compute recurrent_weight * output_state.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_input_weights_ptr, n_cell, n_output, output_state_ptr,
-        n_batch, input_gate_scratch,
-        /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_forget_weights_ptr, n_cell, n_output, output_state_ptr,
-      n_batch, forget_gate_scratch,
-      /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_cell_weights_ptr, n_cell, n_output, output_state_ptr,
-      n_batch, cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_output_weights_ptr, n_cell, n_output, output_state_ptr,
-      n_batch, output_gate_scratch,
-      /*result_stride=*/1);
-
-  // For each batch and cell: update input gate.
-  if (!use_cifg) {
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_input_weights_ptr, n_cell, cell_state_ptr, n_batch,
-          input_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
-                                       input_gate_scratch);
-  }
-
-  // For each batch and cell: update forget gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_forget_weights_ptr, n_cell, cell_state_ptr, n_batch,
-        forget_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
-                                     forget_gate_scratch);
-
-  // For each batch and cell: update the cell.
-  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch, cell_state_ptr,
-                                         n_batch * n_cell, cell_state_ptr);
-  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  if (use_cifg) {
-    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
-                             forget_gate_scratch);
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, forget_gate_scratch, n_batch * n_cell, cell_state_ptr);
-  } else {
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state_ptr);
-  }
-  if (params->cell_clip > 0.0) {
-    tensor_utils::ClipVector(cell_state_ptr, n_batch * n_cell,
-                             params->cell_clip, cell_state_ptr);
-  }
-
-  // For each batch and cell: update the output gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_output_weights_ptr, n_cell, cell_state_ptr, n_batch,
-        output_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
-                                     output_gate_scratch);
-  tensor_utils::ApplyActivationToVector(cell_state_ptr, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
-                                         n_batch * n_cell, output_gate_scratch);
-
-  // For each batch: update the projection and output_state.
-  const bool use_projection_weight = (projection_weights_ptr != nullptr);
-  const bool use_projection_bias = (projection_bias_ptr != nullptr);
-  if (use_projection_weight) {
-    if (use_projection_bias) {
-      tensor_utils::VectorBatchVectorAssign(projection_bias_ptr, n_output,
-                                            n_batch, output_ptr_time);
-    } else {
-      tensor_utils::ZeroVector(output_ptr_time, n_batch * n_output);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        projection_weights_ptr, n_output, n_cell, output_gate_scratch, n_batch,
-        output_ptr_time, /*result_stride=*/1);
-    if (params->proj_clip > 0.0) {
-      tensor_utils::ClipVector(output_ptr_time, n_batch * n_output,
-                               params->proj_clip, output_ptr_time);
-    }
-  } else {
-    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
-                             output_ptr_time);
-  }
-  tensor_utils::CopyVector(output_ptr_time, n_batch * n_output,
-                           output_state_ptr);
-}
-
 // The LSTM Op engine.
 TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
@@ -756,7 +597,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
     float* output_ptr_time = fw_output->data.f + t * n_batch * n_fw_output;
 
-    LstmBatchStep(
+    kernel_utils::LstmStep(
         input_ptr_batch, fw_input_to_input_weights_ptr,
         fw_input_to_forget_weights->data.f, fw_input_to_cell_weights->data.f,
         fw_input_to_output_weights->data.f, fw_recurrent_to_input_weights_ptr,
@@ -766,11 +607,10 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
         fw_cell_to_forget_weights_ptr, fw_cell_to_output_weights_ptr,
         fw_input_gate_bias_ptr, fw_forget_gate_bias->data.f,
         fw_cell_bias->data.f, fw_output_gate_bias->data.f,
-        fw_projection_weights_ptr, fw_projection_bias_ptr, params, fw_use_cifg,
-        fw_use_peephole, n_batch, n_fw_cell, n_input, n_fw_output,
-        fw_output_state->data.f, fw_cell_state->data.f, fw_input_gate_scratch,
-        fw_forget_gate_scratch, fw_cell_scratch, fw_output_gate_scratch,
-        output_ptr_time);
+        fw_projection_weights_ptr, fw_projection_bias_ptr, params, n_batch,
+        n_fw_cell, n_input, n_fw_output, fw_output_state->data.f,
+        fw_cell_state->data.f, fw_input_gate_scratch, fw_forget_gate_scratch,
+        fw_cell_scratch, fw_output_gate_scratch, output_ptr_time);
   }
 
   // n_cell and n_output will be the same size when there is no projection.
@@ -828,7 +668,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
     float* output_ptr_time = bw_output->data.f + t * n_batch * n_bw_output;
 
-    LstmBatchStep(
+    kernel_utils::LstmStep(
         input_ptr_batch, bw_input_to_input_weights_ptr,
         bw_input_to_forget_weights->data.f, bw_input_to_cell_weights->data.f,
         bw_input_to_output_weights->data.f, bw_recurrent_to_input_weights_ptr,
@@ -838,11 +678,10 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
         bw_cell_to_forget_weights_ptr, bw_cell_to_output_weights_ptr,
         bw_input_gate_bias_ptr, bw_forget_gate_bias->data.f,
         bw_cell_bias->data.f, bw_output_gate_bias->data.f,
-        bw_projection_weights_ptr, bw_projection_bias_ptr, params, bw_use_cifg,
-        bw_use_peephole, n_batch, n_bw_cell, n_input, n_bw_output,
-        bw_output_state->data.f, bw_cell_state->data.f, bw_input_gate_scratch,
-        bw_forget_gate_scratch, bw_cell_scratch, bw_output_gate_scratch,
-        output_ptr_time);
+        bw_projection_weights_ptr, bw_projection_bias_ptr, params, n_batch,
+        n_bw_cell, n_input, n_bw_output, bw_output_state->data.f,
+        bw_cell_state->data.f, bw_input_gate_scratch, bw_forget_gate_scratch,
+        bw_cell_scratch, bw_output_gate_scratch, output_ptr_time);
   }
 
   // Backward step.
diff --git a/tensorflow/contrib/lite/kernels/cast.cc b/tensorflow/contrib/lite/kernels/cast.cc
new file mode 100644
index 0000000000000000000000000000000000000000..19942de7bc0c083f192a4b337b224b778d991140
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/cast.cc
@@ -0,0 +1,99 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string.h>
+#include <algorithm>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+#include "tensorflow/contrib/lite/string_util.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace cast {
+constexpr int kInputTensor = 0;
+constexpr int kOutputTensor = 0;
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  return context->ResizeTensor(context, output,
+                               TfLiteIntArrayCopy(input->dims));
+}
+
+template <typename FromT, typename ToT>
+void copyCast(const FromT* in, ToT* out, int num_elements) {
+  std::transform(in, in + num_elements, out,
+                 [](FromT a) { return static_cast<ToT>(a); });
+}
+
+template <typename FromT>
+TfLiteStatus copyToTensor(const FromT* in, TfLiteTensor* out,
+                          int num_elements) {
+  switch (out->type) {
+    case kTfLiteInt64:
+      copyCast(in, out->data.i64, num_elements);
+      break;
+    case kTfLiteInt32:
+      copyCast(in, out->data.i32, num_elements);
+      break;
+    case kTfLiteUInt8:
+      copyCast(in, out->data.uint8, num_elements);
+      break;
+    case kTfLiteFloat32:
+      copyCast(in, out->data.f, num_elements);
+      break;
+    default:
+      // Unsupported type.
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  const int num_elements = NumElements(input);
+  TF_LITE_ENSURE_EQ(context, num_elements, NumElements(output));
+  switch (input->type) {
+    case kTfLiteInt64:
+      return copyToTensor(input->data.i64, output, num_elements);
+    case kTfLiteInt32:
+      return copyToTensor(input->data.i32, output, num_elements);
+    case kTfLiteUInt8:
+      return copyToTensor(input->data.uint8, output, num_elements);
+    case kTfLiteFloat32:
+      return copyToTensor(input->data.f, output, num_elements);
+    default:
+      // Unsupported type.
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+}  // namespace cast
+
+TfLiteRegistration* Register_CAST() {
+  static TfLiteRegistration r = {nullptr, nullptr, cast::Prepare, cast::Eval};
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/cast_test.cc b/tensorflow/contrib/lite/kernels/cast_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4e56482a371550b6275a6380e2beebe3cef958ff
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/cast_test.cc
@@ -0,0 +1,66 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class CastOpModel : public SingleOpModel {
+ public:
+  CastOpModel(const TensorData& input, const TensorData& output) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_CAST, BuiltinOptions_CastOptions,
+                 CreateCastOptions(builder_).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  int input() const { return input_; }
+  int output() const { return output_; }
+
+ protected:
+  int input_;
+  int output_;
+};
+
+TEST(CastOpModel, CastIntToFloat) {
+  CastOpModel m({TensorType_INT64, {2, 3}}, {TensorType_FLOAT32, {2, 3}});
+  m.PopulateTensor<int64_t>(m.input(), {100, 200, 300, 400, 500, 600});
+  m.Invoke();
+  EXPECT_THAT(m.ExtractVector<float>(m.output()),
+              ElementsAreArray({100.f, 200.f, 300.f, 400.f, 500.f, 600.f}));
+}
+
+TEST(CastOpModel, CastFloatToInt) {
+  CastOpModel m({TensorType_FLOAT32, {3, 2}}, {TensorType_INT32, {3, 2}});
+  m.PopulateTensor<float>(m.input(), {100.f, 20.f, 3.f, 0.4f, 0.999f, 1.1f});
+  m.Invoke();
+  EXPECT_THAT(m.ExtractVector<int>(m.output()),
+              ElementsAreArray({100, 20, 3, 0, 0, 1}));
+}
+
+}  // namespace
+}  // namespace tflite
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/conv.cc b/tensorflow/contrib/lite/kernels/conv.cc
index b93a416351cae34b2df8791e382a8a2cd38dcffb..e0cd12f1b4042c3d8b28159e288166bf1437e6ef 100644
--- a/tensorflow/contrib/lite/kernels/conv.cc
+++ b/tensorflow/contrib/lite/kernels/conv.cc
@@ -23,6 +23,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/eigen_support.h"
 #include "tensorflow/contrib/lite/kernels/gemm_support.h"
 #include "tensorflow/contrib/lite/kernels/internal/optimized/cblas_conv.h"
 #include "tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h"
@@ -43,6 +44,8 @@ namespace conv {
 enum KernelType {
   kReference,
   kGenericOptimized,  // Neon-free
+  // kMultithreadOptimized is a mixture of an Eigen-based kernel when threads
+  // are available and kGenericOptimized when we must use only one thread.
   kMultithreadOptimized,
   // The kernel uses use CBLAS interface for matrix multiplication.
   // It's fast when an optimized CBLAS implementation is available (e.g. Apple
@@ -61,7 +64,7 @@ struct OpData {
 
   TfLitePaddingValues padding;
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
@@ -75,6 +78,8 @@ struct OpData {
   bool need_hwcn_weights;
   bool have_weights_been_transposed;
   bool need_im2col;
+
+  bool run_multithreaded_kernel;
 };
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
@@ -83,10 +88,15 @@ void* Init(TfLiteContext* context, const char* buffer, size_t length) {
   // to carry information from Prepare() to Eval().
   auto* data = new OpData;
   gemm_support::IncrementUsageCounter(context);
+  eigen_support::IncrementUsageCounter(context);
+
+  data->run_multithreaded_kernel = context->recommended_num_threads != 1;
+
   return data;
 }
 
 void Free(TfLiteContext* context, void* buffer) {
+  eigen_support::DecrementUsageCounter(context);
   gemm_support::DecrementUsageCounter(context);
   delete reinterpret_cast<OpData*>(buffer);
 }
@@ -137,7 +147,8 @@ static TfLiteStatus AllocateTemporaryTensorsIfRequired(TfLiteContext* context,
   // buffer to store the results.
   // This path is only used for float processing, so only create the buffer if
   // we're running with that data type.
-  data->need_hwcn_weights = (input->type == kTfLiteFloat32);
+  data->need_hwcn_weights =
+      (input->type == kTfLiteFloat32 && data->run_multithreaded_kernel);
 
   int temporaries_count = 0;
   if (data->need_im2col) {
@@ -449,8 +460,13 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   // separate ops to avoid dispatch overhead here.
   switch (input->type) {  // Already know in/outtypes are same.
     case kTfLiteFloat32:
-      EvalFloat<kernel_type>(context, node, params, data, input, filter, bias,
-                             im2col, hwcn_weights, output);
+      if (data->run_multithreaded_kernel) {
+        EvalFloat<kernel_type>(context, node, params, data, input, filter, bias,
+                               im2col, hwcn_weights, output);
+      } else {
+        EvalFloat<kGenericOptimized>(context, node, params, data, input, filter,
+                                     bias, im2col, hwcn_weights, output);
+      }
       break;
     case kTfLiteUInt8:
       EvalQuantized<kernel_type>(context, node, params, data, input, filter,
diff --git a/tensorflow/contrib/lite/kernels/depthwise_conv.cc b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
index 15dbfe08c82befcf001b9ed9a053528b5606053e..cad9ce114c8387047af2b63bee704035fd329330 100644
--- a/tensorflow/contrib/lite/kernels/depthwise_conv.cc
+++ b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
@@ -52,7 +52,7 @@ enum KernelType {
 struct OpData {
   TfLitePaddingValues padding;
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
diff --git a/tensorflow/contrib/lite/kernels/dequantize.cc b/tensorflow/contrib/lite/kernels/dequantize.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e685f2465f627cf30e02564e6f16e1ec69e208e2
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/dequantize.cc
@@ -0,0 +1,77 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string.h>
+#include <vector>
+
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace dequantize {
+
+struct OpContext {
+  OpContext(TfLiteContext* context, TfLiteNode* node) {
+    input = GetInput(context, node, 0);
+    output = GetOutput(context, node, 0);
+  }
+  TfLiteTensor* input;
+  TfLiteTensor* output;
+};
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  OpContext op_context(context, node);
+
+  TF_LITE_ENSURE(context, op_context.input->type == kTfLiteUInt8);
+
+  op_context.output->type = kTfLiteFloat32;
+  return context->ResizeTensor(context, op_context.output,
+                               TfLiteIntArrayCopy(op_context.input->dims));
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  OpContext op_context(context, node);
+
+  auto zero_point = op_context.input->params.zero_point;
+  auto scale = op_context.input->params.scale;
+
+  optimized_ops::Dequantize(GetTensorData<uint8_t>(op_context.input),
+                            GetTensorDims(op_context.input), zero_point, scale,
+                            GetTensorData<float>(op_context.output),
+                            GetTensorDims(op_context.output));
+  return kTfLiteOk;
+}
+
+}  // namespace dequantize
+
+TfLiteRegistration* Register_DEQUANTIZE_OPT() {
+  static TfLiteRegistration r = {nullptr, nullptr, dequantize::Prepare,
+                                 dequantize::Eval};
+  return &r;
+}
+
+TfLiteRegistration* Register_DEQUANTIZE() { return Register_DEQUANTIZE_OPT(); }
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/dequantize_test.cc b/tensorflow/contrib/lite/kernels/dequantize_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..fcd74206177a0a97db168338e3619d4b95c052a9
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/dequantize_test.cc
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class DequantizeOpModel : public SingleOpModel {
+ public:
+  DequantizeOpModel(std::initializer_list<int> shape, float min, float max) {
+    input_ = AddInput({TensorType_UINT8, shape, min, max});
+    output_ = AddOutput({TensorType_FLOAT32, shape});
+    SetBuiltinOp(BuiltinOperator_DEQUANTIZE, BuiltinOptions_DequantizeOptions,
+                 CreateDequantizeOptions(builder_).Union());
+
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  void SetInput(std::initializer_list<uint8_t> data) {
+    PopulateTensor(input_, data);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ private:
+  int input_;
+  int output_;
+};
+
+TEST(SplitOpTest, FourDimensional) {
+  DequantizeOpModel m({2, 5}, -63.5, 64);
+
+  m.SetInput({0, 1, 2, 3, 4, 251, 252, 253, 254, 255});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {-63.5, -63, -62.5, -62, -61.5, 62, 62.5, 63, 63.5, 64})));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/div.cc b/tensorflow/contrib/lite/kernels/div.cc
index 44bd0dc85d50c98ec6b6888e05064a8f2e2731c0..6dd243ad62ece3e094529d923ce80d1d4a0c19ca 100644
--- a/tensorflow/contrib/lite/kernels/div.cc
+++ b/tensorflow/contrib/lite/kernels/div.cc
@@ -37,7 +37,23 @@ constexpr int kInputTensor1 = 0;
 constexpr int kInputTensor2 = 1;
 constexpr int kOutputTensor = 0;
 
+struct OpData {
+  bool requires_broadcast;
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new OpData;
+  data->requires_broadcast = false;
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
 TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
   TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
   TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
 
@@ -45,35 +61,47 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-  TF_LITE_ENSURE_EQ(context, NumDimensions(input1), NumDimensions(input2));
-  for (int i = 0; i < NumDimensions(input1); ++i) {
-    TF_LITE_ENSURE_EQ(context, SizeOfDimension(input1, i),
-                      SizeOfDimension(input2, i));
-  }
+  TF_LITE_ENSURE_EQ(context, input1->type, input2->type);
+  output->type = input2->type;
+
+  data->requires_broadcast = !HaveSameShapes(input1, input2);
 
-  TF_LITE_ENSURE_EQ(context, input1->type, output->type);
-  TF_LITE_ENSURE_EQ(context, input2->type, output->type);
+  TfLiteIntArray* output_size = nullptr;
+  if (data->requires_broadcast) {
+    TF_LITE_ENSURE_OK(context, CalculateShapeForBroadcast(
+                                   context, input1, input2, &output_size));
+  } else {
+    output_size = TfLiteIntArrayCopy(input1->dims);
+  }
 
-  TfLiteIntArray* output_size = TfLiteIntArrayCopy(input1->dims);
   return context->ResizeTensor(context, output, output_size);
 }
 
 template <KernelType kernel_type>
-void EvalDivFloat(TfLiteContext* context, TfLiteNode* node,
-                  TfLiteDivParams* params, TfLiteTensor* input1,
-                  TfLiteTensor* input2, TfLiteTensor* output) {
+void EvalFloat(TfLiteContext* context, TfLiteNode* node,
+               TfLiteDivParams* params, const OpData* data,
+               TfLiteTensor* input1, TfLiteTensor* input2,
+               TfLiteTensor* output) {
   float output_activation_min, output_activation_max;
   CalculateActivationRangeFloat(params->activation, &output_activation_min,
                                 &output_activation_max);
-#define TF_LITE_DIV(type)                                        \
-  type::Div(GetTensorData<float>(input1), GetTensorDims(input1), \
-            GetTensorData<float>(input2), GetTensorDims(input2), \
-            output_activation_min, output_activation_max,        \
-            GetTensorData<float>(output), GetTensorDims(output))
+#define TF_LITE_DIV(type, opname)                                   \
+  type::opname(GetTensorData<float>(input1), GetTensorDims(input1), \
+               GetTensorData<float>(input2), GetTensorDims(input2), \
+               output_activation_min, output_activation_max,        \
+               GetTensorData<float>(output), GetTensorDims(output))
   if (kernel_type == kReference) {
-    TF_LITE_DIV(reference_ops);
+    if (data->requires_broadcast) {
+      TF_LITE_DIV(reference_ops, BroadcastDiv);
+    } else {
+      TF_LITE_DIV(reference_ops, Div);
+    }
   } else {
-    TF_LITE_DIV(optimized_ops);
+    if (data->requires_broadcast) {
+      TF_LITE_DIV(optimized_ops, BroadcastDiv);
+    } else {
+      TF_LITE_DIV(optimized_ops, Div);
+    }
   }
 #undef TF_LITE_DIV
 }
@@ -81,13 +109,14 @@ void EvalDivFloat(TfLiteContext* context, TfLiteNode* node,
 template <KernelType kernel_type>
 TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   auto* params = reinterpret_cast<TfLiteDivParams*>(node->builtin_data);
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
 
   TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
   TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
   if (output->type == kTfLiteFloat32) {
-    EvalDivFloat<kernel_type>(context, node, params, input1, input2, output);
+    EvalFloat<kernel_type>(context, node, params, data, input1, input2, output);
   } else {
     context->ReportError(context, "Inputs and outputs not all float types.");
     return kTfLiteError;
@@ -99,19 +128,19 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
 }  // namespace div
 
 TfLiteRegistration* Register_DIV_REF() {
-  static TfLiteRegistration r = {nullptr, nullptr, div::Prepare,
+  static TfLiteRegistration r = {div::Init, div::Free, div::Prepare,
                                  div::Eval<div::kReference>};
   return &r;
 }
 
 TfLiteRegistration* Register_DIV_GENERIC_OPT() {
-  static TfLiteRegistration r = {nullptr, nullptr, div::Prepare,
+  static TfLiteRegistration r = {div::Init, div::Free, div::Prepare,
                                  div::Eval<div::kGenericOptimized>};
   return &r;
 }
 
 TfLiteRegistration* Register_DIV_NEON_OPT() {
-  static TfLiteRegistration r = {nullptr, nullptr, div::Prepare,
+  static TfLiteRegistration r = {div::Init, div::Free, div::Prepare,
                                  div::Eval<div::kNeonOptimized>};
   return &r;
 }
diff --git a/tensorflow/contrib/lite/kernels/div_test.cc b/tensorflow/contrib/lite/kernels/div_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..276b8289fbc1b4dcbf4624b76b854300d0fd4912
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/div_test.cc
@@ -0,0 +1,118 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class BaseDivOpModel : public SingleOpModel {
+ public:
+  BaseDivOpModel(const TensorData& input1, const TensorData& input2,
+                 const TensorData& output,
+                 ActivationFunctionType activation_type) {
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_DIV, BuiltinOptions_DivOptions,
+                 CreateDivOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+class FloatDivOpModel : public BaseDivOpModel {
+ public:
+  using BaseDivOpModel::BaseDivOpModel;
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+};
+
+TEST(FloatDivOpTest, NoActivation) {
+  FloatDivOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+  m.PopulateTensor<float>(m.input1(), {-0.2, 0.2, -1.2, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.5, 0.2, -1.5, 0.5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear({-0.4, 1.0, 0.8, 1.6})));
+}
+
+TEST(FloatDivOpTest, ActivationRELU_N1_TO_1) {
+  FloatDivOpModel m(
+      {TensorType_FLOAT32, {1, 2, 2, 1}}, {TensorType_FLOAT32, {1, 2, 2, 1}},
+      {TensorType_FLOAT32, {}}, ActivationFunctionType_RELU_N1_TO_1);
+  m.PopulateTensor<float>(m.input1(), {-0.2, 0.2, -1.2, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, -1.5, 0.5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear({-1.0, 1.0, 0.8, 1.0})));
+}
+
+TEST(FloatDivOpTest, VariousInputShapes) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    FloatDivOpModel m({TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 0.3, 0.8, 1.1, -2.0});
+    m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.6, 0.5, -1.1, -0.1});
+    m.Invoke();
+    EXPECT_THAT(
+        m.GetOutput(),
+        ElementsAreArray(ArrayFloatNear({-20.0, 1.0, 0.5, 1.6, -1.0, 20.0})))
+        << "With shape number " << i;
+  }
+}
+
+TEST(FloatDivOpTest, WithBroadcast) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    FloatDivOpModel m({TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, {}},  // always a scalar
+                      {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<float>(m.input1(), {-0.2, 0.2, 0.07, 0.08, 0.11, -0.123});
+    m.PopulateTensor<float>(m.input2(), {0.1});
+    m.Invoke();
+    EXPECT_THAT(
+        m.GetOutput(),
+        ElementsAreArray(ArrayFloatNear({-2.0, 2.0, 0.7, 0.8, 1.1, -1.23})))
+        << "With shape number " << i;
+  }
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/eigen_support.cc b/tensorflow/contrib/lite/kernels/eigen_support.cc
new file mode 100644
index 0000000000000000000000000000000000000000..213e46555210102b8faeb2e4d9900f924a023366
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/eigen_support.cc
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/kernels/eigen_support.h"
+
+#include "third_party/eigen3/Eigen/Core"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace eigen_support {
+
+struct RefCountedEigenContext {
+  int num_references = 0;
+};
+
+void IncrementUsageCounter(TfLiteContext* context) {
+  auto* ptr = reinterpret_cast<RefCountedEigenContext*>(context->eigen_context);
+  if (ptr == nullptr) {
+    if (context->recommended_num_threads != -1) {
+      Eigen::setNbThreads(context->recommended_num_threads);
+    }
+    ptr = new RefCountedEigenContext;
+    ptr->num_references = 0;
+    context->eigen_context = ptr;
+  }
+  ptr->num_references++;
+}
+
+void DecrementUsageCounter(TfLiteContext* context) {
+  auto* ptr = reinterpret_cast<RefCountedEigenContext*>(context->eigen_context);
+  if (ptr == nullptr) {
+    TF_LITE_FATAL(
+        "Call to DecrementUsageCounter() not preceded by "
+        "IncrementUsageCounter()");
+  }
+  if (--ptr->num_references == 0) {
+    delete ptr;
+  }
+}
+
+}  // namespace eigen_support
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/eigen_support.h b/tensorflow/contrib/lite/kernels/eigen_support.h
new file mode 100644
index 0000000000000000000000000000000000000000..d47e691123282a8a8cc53c29be1d95af037e3939
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/eigen_support.h
@@ -0,0 +1,34 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_EIGEN_SUPPORT_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_EIGEN_SUPPORT_H_
+
+#include "tensorflow/contrib/lite/context.h"
+
+namespace tflite {
+namespace eigen_support {
+
+// Let the framework know that the op will be using Eigen. If necessary a set of
+// temporary Eigen objects might be created and placed in 'context'.
+void IncrementUsageCounter(TfLiteContext* context);
+
+// Let the framework know that the op stopped using Eigen. If there are no more
+// usages all temporary Eigen objects will be deleted.
+void DecrementUsageCounter(TfLiteContext* context);
+
+}  // namespace eigen_support
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_EIGEN_SUPPORT_H_
diff --git a/tensorflow/contrib/lite/kernels/fully_connected.cc b/tensorflow/contrib/lite/kernels/fully_connected.cc
index a77fe94e499078bc2f0660e8e49fd557ed0f625d..888e67966c0a408257e763a405bf6e928310f4d9 100644
--- a/tensorflow/contrib/lite/kernels/fully_connected.cc
+++ b/tensorflow/contrib/lite/kernels/fully_connected.cc
@@ -48,7 +48,7 @@ enum KernelType {
 
 struct OpData {
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
diff --git a/tensorflow/contrib/lite/kernels/gemm_support.cc b/tensorflow/contrib/lite/kernels/gemm_support.cc
index eb2b0aacf7ecc3ed5dbde5ccce7a46dcda0a93b3..76a5165d148c6c1829580a47456cebce321d7c5a 100644
--- a/tensorflow/contrib/lite/kernels/gemm_support.cc
+++ b/tensorflow/contrib/lite/kernels/gemm_support.cc
@@ -29,6 +29,9 @@ void IncrementUsageCounter(TfLiteContext* context) {
   if (ptr == nullptr) {
     ptr = new RefCountedGemmContext;
     ptr->gemm_context_ = new gemmlowp::GemmContext();
+    if (context->recommended_num_threads != -1) {
+      ptr->gemm_context_->set_max_num_threads(context->recommended_num_threads);
+    }
     ptr->num_references_ = 0;
     context->gemm_context = ptr;
   }
@@ -58,11 +61,5 @@ gemmlowp::GemmContext* GetFromContext(TfLiteContext* context) {
   return ptr->gemm_context_;
 }
 
-void SetMaxNumThreads(TfLiteContext* context, int num_threads) {
-  IncrementUsageCounter(context);
-  GetFromContext(context)->set_max_num_threads(num_threads);
-  DecrementUsageCounter(context);
-}
-
 }  // namespace gemm_support
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/gemm_support.h b/tensorflow/contrib/lite/kernels/gemm_support.h
index 466781cbcecc7fb851d9078c450cc6c12364d2bb..37af772c6846f2f8124faabf1a0f0987e2e9393d 100644
--- a/tensorflow/contrib/lite/kernels/gemm_support.h
+++ b/tensorflow/contrib/lite/kernels/gemm_support.h
@@ -45,9 +45,6 @@ void IncrementUsageCounter(TfLiteContext* context);
 // 'context'. If there are no more usages the GemmContext will be deleted.
 void DecrementUsageCounter(TfLiteContext* context);
 
-// Set the maximum number threads available for gemmlowp operations.
-void SetMaxNumThreads(TfLiteContext* context, int num_threads);
-
 }  // namespace gemm_support
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/BUILD b/tensorflow/contrib/lite/kernels/internal/BUILD
index f47fb04cbaa688b75e763ff9d3cb7df44ac3f166..aa3957bee133c8b51a82e9c62884ce365e086d2e 100644
--- a/tensorflow/contrib/lite/kernels/internal/BUILD
+++ b/tensorflow/contrib/lite/kernels/internal/BUILD
@@ -10,21 +10,25 @@ tflite_deps_intel = [
     "@arm_neon_2_x86_sse",
 ]
 
+HARD_FP_FLAGS_IF_APPLICABLE = select({
+    "//tensorflow:android_arm": ["-mfloat-abi=softfp"],
+    "//tensorflow:android_arm64": ["-mfloat-abi=softfp"],
+    "//tensorflow:android_armeabi": ["-mfloat-abi=softfp"],
+    "//conditions:default": [],
+})
+
 NEON_FLAGS_IF_APPLICABLE = select({
     ":arm": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     ":armeabi-v7a": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     ":armv7a": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     "//conditions:default": [
         "-O3",
@@ -145,6 +149,7 @@ cc_library(
         "common.h",
         "optimized/depthwiseconv_float.h",
         "optimized/depthwiseconv_uint8.h",
+        "optimized/depthwiseconv_uint8_3x3_filter.h",
         "optimized/optimized_ops.h",
     ],
     copts = tflite_copts(),
@@ -208,7 +213,10 @@ cc_library(
         "compatibility.h",
         "quantization_util.h",
     ],
-    deps = [":round"],
+    deps = [
+        ":round",
+        ":types",
+    ],
 )
 
 cc_test(
@@ -283,7 +291,7 @@ cc_library(
         "optimized/neon_tensor_utils.h",
         "optimized/tensor_utils_impl.h",
     ],
-    copts = NEON_FLAGS_IF_APPLICABLE,
+    copts = NEON_FLAGS_IF_APPLICABLE + HARD_FP_FLAGS_IF_APPLICABLE,
     deps = [
         ":cpu_check",
         ":portable_tensor_utils",
@@ -305,6 +313,27 @@ cc_library(
     ],
 )
 
+# Audio support classes imported directly from TensorFlow.
+cc_library(
+    name = "audio_utils",
+    srcs = [
+        "mfcc.cc",
+        "mfcc_dct.cc",
+        "mfcc_mel_filterbank.cc",
+        "spectrogram.cc",
+    ],
+    hdrs = [
+        "mfcc.h",
+        "mfcc_dct.h",
+        "mfcc_mel_filterbank.h",
+        "spectrogram.h",
+    ],
+    deps = [
+        "//third_party/fft2d:fft2d_headers",
+        "@fft2d",
+    ],
+)
+
 cc_library(
     name = "tensor_utils",
     srcs = [
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
index 510395126ce3785b1d44fec1e0eb994c29ff0db7..f142374269606bdd3d4184af013749102666ab89 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
@@ -40,5 +40,152 @@ void RnnBatchStep(const float* input_ptr_batch, const float* input_weights_ptr,
                                         hidden_state_ptr_batch);
 }
 
+void LstmStep(
+    const float* input_ptr_batch, const float* input_to_input_weights_ptr,
+    const float* input_to_forget_weights_ptr,
+    const float* input_to_cell_weights_ptr,
+    const float* input_to_output_weights_ptr,
+    const float* recurrent_to_input_weights_ptr,
+    const float* recurrent_to_forget_weights_ptr,
+    const float* recurrent_to_cell_weights_ptr,
+    const float* recurrent_to_output_weights_ptr,
+    const float* cell_to_input_weights_ptr,
+    const float* cell_to_forget_weights_ptr,
+    const float* cell_to_output_weights_ptr, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const float* projection_weights_ptr,
+    const float* projection_bias_ptr, const TfLiteLSTMParams* params,
+    int n_batch, int n_cell, int n_input, int n_output, float* output_state_ptr,
+    float* cell_state_ptr, float* input_gate_scratch,
+    float* forget_gate_scratch, float* cell_scratch, float* output_gate_scratch,
+    float* output_ptr_batch) {
+  // Since we have already checked that weights are all there or none, we can
+  // check the existense of only one to the get the condition.
+  const bool use_cifg = (input_to_input_weights_ptr == nullptr);
+  const bool use_peephole = (cell_to_output_weights_ptr != nullptr);
+  // Initialize scratch buffers with bias.
+  if (!use_cifg) {
+    tensor_utils::VectorBatchVectorAssign(input_gate_bias_ptr, n_cell, n_batch,
+                                          input_gate_scratch);
+  }
+  tensor_utils::VectorBatchVectorAssign(forget_gate_bias_ptr, n_cell, n_batch,
+                                        forget_gate_scratch);
+  tensor_utils::VectorBatchVectorAssign(cell_bias_ptr, n_cell, n_batch,
+                                        cell_scratch);
+  tensor_utils::VectorBatchVectorAssign(output_gate_bias_ptr, n_cell, n_batch,
+                                        output_gate_scratch);
+
+  // For each batch and cell: compute input_weight * input.
+  if (!use_cifg) {
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        input_to_input_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+        input_gate_scratch, /*result_stride=*/1);
+  }
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_forget_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      forget_gate_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_cell_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      cell_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_output_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      output_gate_scratch, /*result_stride=*/1);
+
+  // For each batch and cell: compute recurrent_weight * output_state.
+  if (!use_cifg) {
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        recurrent_to_input_weights_ptr, n_cell, n_output, output_state_ptr,
+        n_batch, input_gate_scratch,
+        /*result_stride=*/1);
+  }
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_forget_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, forget_gate_scratch,
+      /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_cell_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, cell_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_output_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, output_gate_scratch,
+      /*result_stride=*/1);
+
+  // For each batch and cell: update input gate.
+  if (!use_cifg) {
+    if (use_peephole) {
+      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+          cell_to_input_weights_ptr, n_cell, cell_state_ptr, n_batch,
+          input_gate_scratch);
+    }
+    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
+                                       input_gate_scratch);
+  }
+
+  // For each batch and cell: update forget gate.
+  if (use_peephole) {
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        cell_to_forget_weights_ptr, n_cell, cell_state_ptr, n_batch,
+        forget_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
+                                     forget_gate_scratch);
+
+  // For each batch and cell: update the cell.
+  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch, cell_state_ptr,
+                                         n_batch * n_cell, cell_state_ptr);
+  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  if (use_cifg) {
+    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
+                             forget_gate_scratch);
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, forget_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  } else {
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  }
+  if (params->cell_clip > 0.0) {
+    tensor_utils::ClipVector(cell_state_ptr, n_batch * n_cell,
+                             params->cell_clip, cell_state_ptr);
+  }
+
+  // For each batch and cell: update the output gate.
+  if (use_peephole) {
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        cell_to_output_weights_ptr, n_cell, cell_state_ptr, n_batch,
+        output_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
+                                     output_gate_scratch);
+  tensor_utils::ApplyActivationToVector(cell_state_ptr, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
+                                         n_batch * n_cell, output_gate_scratch);
+
+  // For each batch: update the projection and output_state.
+  const bool use_projection_weight = (projection_weights_ptr != nullptr);
+  const bool use_projection_bias = (projection_bias_ptr != nullptr);
+  if (use_projection_weight) {
+    if (use_projection_bias) {
+      tensor_utils::VectorBatchVectorAssign(projection_bias_ptr, n_output,
+                                            n_batch, output_ptr_batch);
+    } else {
+      tensor_utils::ZeroVector(output_ptr_batch, n_batch * n_output);
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        projection_weights_ptr, n_output, n_cell, output_gate_scratch, n_batch,
+        output_ptr_batch, /*result_stride=*/1);
+    if (params->proj_clip > 0.0) {
+      tensor_utils::ClipVector(output_ptr_batch, n_batch * n_output,
+                               params->proj_clip, output_ptr_batch);
+    }
+  } else {
+    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
+                             output_ptr_batch);
+  }
+  tensor_utils::CopyVector(output_ptr_batch, n_batch * n_output,
+                           output_state_ptr);
+}
+
 }  // namespace kernel_utils
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
index 9872d4500b862388ed4b96c97e3755f548e35d35..3ec60ee57a87833959a34ba95d32df15bea188a4 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
@@ -35,6 +35,42 @@ void RnnBatchStep(const float* input_ptr_batch, const float* input_weights_ptr,
                   TfLiteFusedActivation activation,
                   float* hidden_state_ptr_batch, float* output_ptr_batch);
 
+// Performs an LSTM batch inference step for input specified by input_ptr_batch.
+// The LSTM cell is specified by the pointers to its weights (*_weights_ptr) and
+// biases (*_bias_ptr), and buffers (*_scratch), along with additional
+// parameters:
+//  - params: various LSTM params including activation, clipping, etc.,
+//  - n_batch: size of batch,
+//  - n_cell: number of cells (or units),
+//  - n_input: the input size,
+//  - n_output: the output size.
+//
+// The pointers to the cell and output state and the output are updated. Unless
+// projection is specified output and output state contain the same data.
+//
+// The pointers with the suffix "_batch" point to data aligned in batch_major
+// order, and each step processes batch_size many inputs from input_ptr_batch,
+// and updates batch_size many cell and output states.
+void LstmStep(
+    const float* input_ptr_batch, const float* input_to_input_weights_ptr,
+    const float* input_to_forget_weights_ptr,
+    const float* input_to_cell_weights_ptr,
+    const float* input_to_output_weights_ptr,
+    const float* recurrent_to_input_weights_ptr,
+    const float* recurrent_to_forget_weights_ptr,
+    const float* recurrent_to_cell_weights_ptr,
+    const float* recurrent_to_output_weights_ptr,
+    const float* cell_to_input_weights_ptr,
+    const float* cell_to_forget_weights_ptr,
+    const float* cell_to_output_weights_ptr, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const float* projection_weights_ptr,
+    const float* projection_bias_ptr, const TfLiteLSTMParams* params,
+    int n_batch, int n_cell, int n_input, int n_output, float* output_state_ptr,
+    float* cell_state_ptr, float* input_gate_scratch,
+    float* forget_gate_scratch, float* cell_scratch, float* output_gate_scratch,
+    float* output_ptr_batch);
+
 }  // namespace kernel_utils
 }  // namespace tflite
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_KERNEL_UTILS_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc.cc b/tensorflow/contrib/lite/kernels/internal/mfcc.cc
new file mode 100644
index 0000000000000000000000000000000000000000..eafe0c7afee6fabd5a4a258aa5176e23f5e8d62a
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc.cc
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <math.h>
+
+#include "tensorflow/contrib/lite/kernels/internal/mfcc.h"
+
+namespace tflite {
+namespace internal {
+
+const double kDefaultUpperFrequencyLimit = 4000;
+const double kDefaultLowerFrequencyLimit = 20;
+const double kFilterbankFloor = 1e-12;
+const int kDefaultFilterbankChannelCount = 40;
+const int kDefaultDCTCoefficientCount = 13;
+
+Mfcc::Mfcc()
+    : initialized_(false),
+      lower_frequency_limit_(kDefaultLowerFrequencyLimit),
+      upper_frequency_limit_(kDefaultUpperFrequencyLimit),
+      filterbank_channel_count_(kDefaultFilterbankChannelCount),
+      dct_coefficient_count_(kDefaultDCTCoefficientCount) {}
+
+bool Mfcc::Initialize(int input_length, double input_sample_rate) {
+  bool initialized = mel_filterbank_.Initialize(
+      input_length, input_sample_rate, filterbank_channel_count_,
+      lower_frequency_limit_, upper_frequency_limit_);
+  initialized &=
+      dct_.Initialize(filterbank_channel_count_, dct_coefficient_count_);
+  initialized_ = initialized;
+  return initialized;
+}
+
+void Mfcc::Compute(const std::vector<double>& spectrogram_frame,
+                   std::vector<double>* output) const {
+  if (!initialized_) {
+    // LOG(ERROR) << "Mfcc not initialized.";
+    return;
+  }
+  std::vector<double> working;
+  mel_filterbank_.Compute(spectrogram_frame, &working);
+  for (int i = 0; i < working.size(); ++i) {
+    double val = working[i];
+    if (val < kFilterbankFloor) {
+      val = kFilterbankFloor;
+    }
+    working[i] = log(val);
+  }
+  dct_.Compute(working, output);
+}
+
+}  // namespace internal
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc.h b/tensorflow/contrib/lite/kernels/internal/mfcc.h
new file mode 100644
index 0000000000000000000000000000000000000000..d8500ecdcf38e5dcfe9eb89915501678455b3dd9
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc.h
@@ -0,0 +1,78 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Basic class for computing MFCCs from spectrogram slices.
+
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_H_
+
+#include <vector>
+
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_dct.h"
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h"
+
+namespace tflite {
+namespace internal {
+
+class Mfcc {
+ public:
+  Mfcc();
+  bool Initialize(int input_length, double input_sample_rate);
+
+  // Input is a single squared-magnitude spectrogram frame. The input spectrum
+  // is converted to linear magnitude and weighted into bands using a
+  // triangular mel filterbank, and a discrete cosine transform (DCT) of the
+  // values is taken. Output is populated with the lowest dct_coefficient_count
+  // of these values.
+  void Compute(const std::vector<double>& spectrogram_frame,
+               std::vector<double>* output) const;
+
+  void set_upper_frequency_limit(double upper_frequency_limit) {
+    // CHECK(!initialized_) << "Set frequency limits before calling
+    // Initialize.";
+    upper_frequency_limit_ = upper_frequency_limit;
+  }
+
+  void set_lower_frequency_limit(double lower_frequency_limit) {
+    // CHECK(!initialized_) << "Set frequency limits before calling
+    // Initialize.";
+    lower_frequency_limit_ = lower_frequency_limit;
+  }
+
+  void set_filterbank_channel_count(int filterbank_channel_count) {
+    /// CHECK(!initialized_) << "Set channel count before calling Initialize.";
+    filterbank_channel_count_ = filterbank_channel_count;
+  }
+
+  void set_dct_coefficient_count(int dct_coefficient_count) {
+    // CHECK(!initialized_) << "Set coefficient count before calling
+    // Initialize.";
+    dct_coefficient_count_ = dct_coefficient_count;
+  }
+
+ private:
+  MfccMelFilterbank mel_filterbank_;
+  MfccDct dct_;
+  bool initialized_;
+  double lower_frequency_limit_;
+  double upper_frequency_limit_;
+  int filterbank_channel_count_;
+  int dct_coefficient_count_;
+};
+
+}  // namespace internal
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc_dct.cc b/tensorflow/contrib/lite/kernels/internal/mfcc_dct.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b0b7d181bdcf01688a387f33a3e64fc904324b50
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc_dct.cc
@@ -0,0 +1,78 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_dct.h"
+
+#include <math.h>
+
+namespace tflite {
+namespace internal {
+
+MfccDct::MfccDct() : initialized_(false) {}
+
+bool MfccDct::Initialize(int input_length, int coefficient_count) {
+  coefficient_count_ = coefficient_count;
+  input_length_ = input_length;
+
+  if (coefficient_count_ < 1) {
+    return false;
+  }
+
+  if (input_length < 1) {
+    return false;
+  }
+
+  if (coefficient_count_ > input_length_) {
+    return false;
+  }
+
+  cosines_.resize(coefficient_count_);
+  double fnorm = sqrt(2.0 / input_length_);
+  // Some platforms don't have M_PI, so define a local constant here.
+  const double pi = atan(1) * 4;
+  double arg = pi / input_length_;
+  for (int i = 0; i < coefficient_count_; ++i) {
+    cosines_[i].resize(input_length_);
+    for (int j = 0; j < input_length_; ++j) {
+      cosines_[i][j] = fnorm * cos(i * arg * (j + 0.5));
+    }
+  }
+  initialized_ = true;
+  return true;
+}
+
+void MfccDct::Compute(const std::vector<double> &input,
+                      std::vector<double> *output) const {
+  if (!initialized_) {
+    return;
+  }
+
+  output->resize(coefficient_count_);
+  int length = input.size();
+  if (length > input_length_) {
+    length = input_length_;
+  }
+
+  for (int i = 0; i < coefficient_count_; ++i) {
+    double sum = 0.0;
+    for (int j = 0; j < length; ++j) {
+      sum += cosines_[i][j] * input[j];
+    }
+    (*output)[i] = sum;
+  }
+}
+
+}  // namespace internal
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc_dct.h b/tensorflow/contrib/lite/kernels/internal/mfcc_dct.h
new file mode 100644
index 0000000000000000000000000000000000000000..a53f5cbd9bb70c7c9dd49672681140bb9cbd2f4f
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc_dct.h
@@ -0,0 +1,43 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Basic minimal DCT class for MFCC speech processing.
+
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_DCT_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_DCT_H_
+
+#include <vector>
+
+namespace tflite {
+namespace internal {
+
+class MfccDct {
+ public:
+  MfccDct();
+  bool Initialize(int input_length, int coefficient_count);
+  void Compute(const std::vector<double>& input,
+               std::vector<double>* output) const;
+
+ private:
+  bool initialized_;
+  int coefficient_count_;
+  int input_length_;
+  std::vector<std::vector<double> > cosines_;
+};
+
+}  // namespace internal
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_DCT_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.cc b/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c3deb33d91a47bfe54b7c84d2a615df2422f90cc
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.cc
@@ -0,0 +1,204 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// This code resamples the FFT bins, and smooths then with triangle-shaped
+// weights to create a mel-frequency filter bank. For filter i centered at f_i,
+// there is a triangular weighting of the FFT bins that extends from
+// filter f_i-1 (with a value of zero at the left edge of the triangle) to f_i
+// (where the filter value is 1) to f_i+1 (where the filter values returns to
+// zero).
+
+// Note: this code fails if you ask for too many channels.  The algorithm used
+// here assumes that each FFT bin contributes to at most two channels: the
+// right side of a triangle for channel i, and the left side of the triangle
+// for channel i+1.  If you ask for so many channels that some of the
+// resulting mel triangle filters are smaller than a single FFT bin, these
+// channels may end up with no contributing FFT bins.  The resulting mel
+// spectrum output will have some channels that are always zero.
+
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h"
+
+#include <math.h>
+
+namespace tflite {
+namespace internal {
+
+MfccMelFilterbank::MfccMelFilterbank() : initialized_(false) {}
+
+bool MfccMelFilterbank::Initialize(int input_length, double input_sample_rate,
+                                   int output_channel_count,
+                                   double lower_frequency_limit,
+                                   double upper_frequency_limit) {
+  num_channels_ = output_channel_count;
+  sample_rate_ = input_sample_rate;
+  input_length_ = input_length;
+
+  if (num_channels_ < 1) {
+    // LOG(ERROR) << "Number of filterbank channels must be positive.";
+    return false;
+  }
+
+  if (sample_rate_ <= 0) {
+    // LOG(ERROR) << "Sample rate must be positive.";
+    return false;
+  }
+
+  if (input_length < 2) {
+    // LOG(ERROR) << "Input length must greater than 1.";
+    return false;
+  }
+
+  if (lower_frequency_limit < 0) {
+    // LOG(ERROR) << "Lower frequency limit must be nonnegative.";
+    return false;
+  }
+
+  if (upper_frequency_limit <= lower_frequency_limit) {
+    /// LOG(ERROR) << "Upper frequency limit must be greater than "
+    //           << "lower frequency limit.";
+    return false;
+  }
+
+  // An extra center frequency is computed at the top to get the upper
+  // limit on the high side of the final triangular filter.
+  center_frequencies_.resize(num_channels_ + 1);
+  const double mel_low = FreqToMel(lower_frequency_limit);
+  const double mel_hi = FreqToMel(upper_frequency_limit);
+  const double mel_span = mel_hi - mel_low;
+  const double mel_spacing = mel_span / static_cast<double>(num_channels_ + 1);
+  for (int i = 0; i < num_channels_ + 1; ++i) {
+    center_frequencies_[i] = mel_low + (mel_spacing * (i + 1));
+  }
+
+  // Always exclude DC; emulate HTK.
+  const double hz_per_sbin =
+      0.5 * sample_rate_ / static_cast<double>(input_length_ - 1);
+  start_index_ = static_cast<int>(1.5 + (lower_frequency_limit / hz_per_sbin));
+  end_index_ = static_cast<int>(upper_frequency_limit / hz_per_sbin);
+
+  // Maps the input spectrum bin indices to filter bank channels/indices. For
+  // each FFT bin, band_mapper tells us which channel this bin contributes to
+  // on the right side of the triangle.  Thus this bin also contributes to the
+  // left side of the next channel's triangle response.
+  band_mapper_.resize(input_length_);
+  int channel = 0;
+  for (int i = 0; i < input_length_; ++i) {
+    double melf = FreqToMel(i * hz_per_sbin);
+    if ((i < start_index_) || (i > end_index_)) {
+      band_mapper_[i] = -2;  // Indicate an unused Fourier coefficient.
+    } else {
+      while ((center_frequencies_[channel] < melf) &&
+             (channel < num_channels_)) {
+        ++channel;
+      }
+      band_mapper_[i] = channel - 1;  // Can be == -1
+    }
+  }
+
+  // Create the weighting functions to taper the band edges.  The contribution
+  // of any one FFT bin is based on its distance along the continuum between two
+  // mel-channel center frequencies.  This bin contributes weights_[i] to the
+  // current channel and 1-weights_[i] to the next channel.
+  weights_.resize(input_length_);
+  for (int i = 0; i < input_length_; ++i) {
+    channel = band_mapper_[i];
+    if ((i < start_index_) || (i > end_index_)) {
+      weights_[i] = 0.0;
+    } else {
+      if (channel >= 0) {
+        weights_[i] =
+            (center_frequencies_[channel + 1] - FreqToMel(i * hz_per_sbin)) /
+            (center_frequencies_[channel + 1] - center_frequencies_[channel]);
+      } else {
+        weights_[i] = (center_frequencies_[0] - FreqToMel(i * hz_per_sbin)) /
+                      (center_frequencies_[0] - mel_low);
+      }
+    }
+  }
+  // Check the sum of FFT bin weights for every mel band to identify
+  // situations where the mel bands are so narrow that they don't get
+  // significant weight on enough (or any) FFT bins -- i.e., too many
+  // mel bands have been requested for the given FFT size.
+  std::vector<int> bad_channels;
+  for (int c = 0; c < num_channels_; ++c) {
+    float band_weights_sum = 0.0;
+    for (int i = 0; i < input_length_; ++i) {
+      if (band_mapper_[i] == c - 1) {
+        band_weights_sum += (1.0 - weights_[i]);
+      } else if (band_mapper_[i] == c) {
+        band_weights_sum += weights_[i];
+      }
+    }
+    // The lowest mel channels have the fewest FFT bins and the lowest
+    // weights sum.  But given that the target gain at the center frequency
+    // is 1.0, if the total sum of weights is 0.5, we're in bad shape.
+    if (band_weights_sum < 0.5) {
+      bad_channels.push_back(c);
+    }
+  }
+  if (!bad_channels.empty()) {
+    /*
+    LOG(ERROR) << "Missing " << bad_channels.size() << " bands "
+               << " starting at " << bad_channels[0]
+               << " in mel-frequency design. "
+               << "Perhaps too many channels or "
+               << "not enough frequency resolution in spectrum. ("
+               << "input_length: " << input_length
+               << " input_sample_rate: " << input_sample_rate
+               << " output_channel_count: " << output_channel_count
+               << " lower_frequency_limit: " << lower_frequency_limit
+               << " upper_frequency_limit: " << upper_frequency_limit;
+               */
+  }
+  initialized_ = true;
+  return true;
+}
+
+// Compute the mel spectrum from the squared-magnitude FFT input by taking the
+// square root, then summing FFT magnitudes under triangular integration windows
+// whose widths increase with frequency.
+void MfccMelFilterbank::Compute(const std::vector<double> &input,
+                                std::vector<double> *output) const {
+  if (!initialized_) {
+    // LOG(ERROR) << "Mel Filterbank not initialized.";
+    return;
+  }
+
+  if (input.size() <= end_index_) {
+    // LOG(ERROR) << "Input too short to compute filterbank";
+    return;
+  }
+
+  // Ensure output is right length and reset all values.
+  output->assign(num_channels_, 0.0);
+
+  for (int i = start_index_; i <= end_index_; i++) {  // For each FFT bin
+    double spec_val = sqrt(input[i]);
+    double weighted = spec_val * weights_[i];
+    int channel = band_mapper_[i];
+    if (channel >= 0)
+      (*output)[channel] += weighted;  // Right side of triangle, downward slope
+    channel++;
+    if (channel < num_channels_)
+      (*output)[channel] += spec_val - weighted;  // Left side of triangle
+  }
+}
+
+double MfccMelFilterbank::FreqToMel(double freq) const {
+  return 1127.0 * log(1.0 + (freq / 700.0));
+}
+
+}  // namespace internal
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h b/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h
new file mode 100644
index 0000000000000000000000000000000000000000..c1db28243eea39a694b7613ac7144dce9b294897
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h
@@ -0,0 +1,63 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Basic class for applying a mel-scale mapping to a power spectrum.
+
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_MEL_FILTERBANK_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_MEL_FILTERBANK_H_
+
+#include <vector>
+
+namespace tflite {
+namespace internal {
+
+class MfccMelFilterbank {
+ public:
+  MfccMelFilterbank();
+  bool Initialize(int input_length,  // Number of unique FFT bins fftsize/2+1.
+                  double input_sample_rate, int output_channel_count,
+                  double lower_frequency_limit, double upper_frequency_limit);
+
+  // Takes a squared-magnitude spectrogram slice as input, computes a
+  // triangular-mel-weighted linear-magnitude filterbank, and places the result
+  // in output.
+  void Compute(const std::vector<double>& input,
+               std::vector<double>* output) const;
+
+ private:
+  double FreqToMel(double freq) const;
+  bool initialized_;
+  int num_channels_;
+  double sample_rate_;
+  int input_length_;
+  std::vector<double> center_frequencies_;  // In mel, for each mel channel.
+
+  // Each FFT bin b contributes to two triangular mel channels, with
+  // proportion weights_[b] going into mel channel band_mapper_[b], and
+  // proportion (1 - weights_[b]) going into channel band_mapper_[b] + 1.
+  // Thus, weights_ contains the weighting applied to each FFT bin for the
+  // upper-half of the triangular band.
+  std::vector<double> weights_;  // Right-side weight for this fft  bin.
+
+  // FFT bin i contributes to the upper side of mel channel band_mapper_[i]
+  std::vector<int> band_mapper_;
+  int start_index_;  // Lowest FFT bin used to calculate mel spectrum.
+  int end_index_;    // Highest FFT bin used to calculate mel spectrum.
+};
+
+}  // namespace internal
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_MFCC_MEL_FILTERBANK_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
index dbc4f0d6fdca8279072d6ea225334722d6a89eb2..c71b070680ead77769dd8b04d0d7a133ad694abc 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
@@ -18,6 +18,7 @@ limitations under the License.
 #include "fixedpoint/fixedpoint.h"
 #include "public/gemmlowp.h"
 #include "tensorflow/contrib/lite/kernels/internal/common.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h"
 #include "tensorflow/contrib/lite/kernels/internal/types.h"
 
 namespace tflite {
@@ -1692,6 +1693,23 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
   const int output_width = ArraySize(output_dims, 1);
   TFLITE_DCHECK(output_depth == input_depth * depth_multiplier);
 
+#ifdef __aarch64__
+  // Call kernel optimized for depthwise convolutions using 3x3 filters,
+  // stride = 1, no padding, depth_multiplier = 1 and depth a multiple of 16.
+  if (filter_width == 3 && filter_height == 3 && depth_multiplier == 1 &&
+      (stride_width == 1 || stride_width == 2) &&
+      (stride_height == 1 || stride_height == 2) && pad_width == 0 &&
+      pad_height == 0 && (input_depth % 16) == 0) {
+    DepthwiseConv3by3FilterDepth16(
+        input_data, input_dims, input_offset, filter_data, filter_dims,
+        filter_offset, bias_data, bias_dims, stride_width, stride_height,
+        pad_width, pad_height, depth_multiplier, output_offset,
+        output_multiplier, output_shift, output_activation_min,
+        output_activation_max, output_data, output_dims);
+    return;
+  }
+#endif
+
   static const int kAccBufferMaxSize = 2048;
   int32 acc_buffer[kAccBufferMaxSize];
   TFLITE_DCHECK_GE(kAccBufferMaxSize, output_depth);
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h
new file mode 100644
index 0000000000000000000000000000000000000000..9dc76e7608f170fcf21bb188226bf30995df8cda
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h
@@ -0,0 +1,706 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_OPTIMIZED_DEPTHWISECONV_UINT8_3X3_FILTER_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_OPTIMIZED_DEPTHWISECONV_UINT8_3X3_FILTER_H_
+
+#include "fixedpoint/fixedpoint.h"
+#include "public/gemmlowp.h"
+#include "tensorflow/contrib/lite/kernels/internal/common.h"
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
+
+namespace tflite {
+namespace optimized_ops {
+
+#ifdef __aarch64__
+
+inline void preload_l1_keep(const uint8* ptr) {
+#ifdef GEMMLOWP_ARM_64
+  asm volatile("prfm pldl1keep, [%[ptr]]\n" ::[ptr] "r"(ptr) :);
+#else
+  gemmlowp::Prefetch(ptr);
+#endif
+}
+
+// Implementation of quantized DepthwiseConv for 3x3 filters.
+
+// Below are helper structs to remove the use of arrays.
+// There is an llvm bug that causes significant slowdown when using arrays for
+// NEON intrinsics vector data types.
+// See: https://bugs.llvm.org/show_bug.cgi?id=34945
+
+struct Int32x16 {
+  int32x4_t v0, v1, v2, v3;
+};
+
+struct Int16x16 {
+  int16x8_t low, high;
+};
+
+struct Int16x16x3 {
+  Int16x16 v0, v1, v2;
+};
+
+struct Filter3x3x16 {
+  Int16x16x3 r0, r1, r2;
+};
+
+// Loads 3x3 filter of depth 16 and adds filter offsets.
+inline Filter3x3x16 LoadFilterDepth16(const uint8* filter_ptr,
+                                      int32 filter_offset, int output_depth) {
+  Filter3x3x16 filter;
+
+  uint8x8_t temp_u8_0, temp_u8_1, temp_u8_2, temp_u8_3, temp_u8_4, temp_u8_5,
+      temp_u8_6, temp_u8_7, temp_u8_8, temp_u8_9, temp_u8_10, temp_u8_11,
+      temp_u8_12, temp_u8_13, temp_u8_14, temp_u8_15, temp_u8_16, temp_u8_17;
+  int16x8_t filter_offset_vec = vdupq_n_s16(filter_offset);
+
+  temp_u8_0 = vld1_u8(filter_ptr + 0 * output_depth);
+  temp_u8_1 = vld1_u8(filter_ptr + 0 * output_depth + 8);
+  temp_u8_2 = vld1_u8(filter_ptr + 1 * output_depth);
+  temp_u8_3 = vld1_u8(filter_ptr + 1 * output_depth + 8);
+  temp_u8_4 = vld1_u8(filter_ptr + 2 * output_depth);
+  temp_u8_5 = vld1_u8(filter_ptr + 2 * output_depth + 8);
+
+  temp_u8_6 = vld1_u8(filter_ptr + 3 * output_depth);
+  temp_u8_7 = vld1_u8(filter_ptr + 3 * output_depth + 8);
+  temp_u8_8 = vld1_u8(filter_ptr + 4 * output_depth);
+  temp_u8_9 = vld1_u8(filter_ptr + 4 * output_depth + 8);
+  temp_u8_10 = vld1_u8(filter_ptr + 5 * output_depth);
+  temp_u8_11 = vld1_u8(filter_ptr + 5 * output_depth + 8);
+
+  temp_u8_12 = vld1_u8(filter_ptr + 6 * output_depth);
+  temp_u8_13 = vld1_u8(filter_ptr + 6 * output_depth + 8);
+  temp_u8_14 = vld1_u8(filter_ptr + 7 * output_depth);
+  temp_u8_15 = vld1_u8(filter_ptr + 7 * output_depth + 8);
+  temp_u8_16 = vld1_u8(filter_ptr + 8 * output_depth);
+  temp_u8_17 = vld1_u8(filter_ptr + 8 * output_depth + 8);
+
+  filter.r0.v0.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_0));
+  filter.r0.v0.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_1));
+  filter.r0.v1.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_2));
+  filter.r0.v1.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_3));
+  filter.r0.v2.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_4));
+  filter.r0.v2.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_5));
+
+  filter.r1.v0.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_6));
+  filter.r1.v0.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_7));
+  filter.r1.v1.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_8));
+  filter.r1.v1.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_9));
+  filter.r1.v2.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_10));
+  filter.r1.v2.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_11));
+
+  filter.r2.v0.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_12));
+  filter.r2.v0.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_13));
+  filter.r2.v1.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_14));
+  filter.r2.v1.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_15));
+  filter.r2.v2.low = vreinterpretq_s16_u16(vmovl_u8(temp_u8_16));
+  filter.r2.v2.high = vreinterpretq_s16_u16(vmovl_u8(temp_u8_17));
+
+  filter.r0.v0.low = vaddq_s16(filter.r0.v0.low, filter_offset_vec);
+  filter.r0.v0.high = vaddq_s16(filter.r0.v0.high, filter_offset_vec);
+  filter.r0.v1.low = vaddq_s16(filter.r0.v1.low, filter_offset_vec);
+  filter.r0.v1.high = vaddq_s16(filter.r0.v1.high, filter_offset_vec);
+  filter.r0.v2.low = vaddq_s16(filter.r0.v2.low, filter_offset_vec);
+  filter.r0.v2.high = vaddq_s16(filter.r0.v2.high, filter_offset_vec);
+
+  filter.r1.v0.low = vaddq_s16(filter.r1.v0.low, filter_offset_vec);
+  filter.r1.v0.high = vaddq_s16(filter.r1.v0.high, filter_offset_vec);
+  filter.r1.v1.low = vaddq_s16(filter.r1.v1.low, filter_offset_vec);
+  filter.r1.v1.high = vaddq_s16(filter.r1.v1.high, filter_offset_vec);
+  filter.r1.v2.low = vaddq_s16(filter.r1.v2.low, filter_offset_vec);
+  filter.r1.v2.high = vaddq_s16(filter.r1.v2.high, filter_offset_vec);
+
+  filter.r2.v0.low = vaddq_s16(filter.r2.v0.low, filter_offset_vec);
+  filter.r2.v0.high = vaddq_s16(filter.r2.v0.high, filter_offset_vec);
+  filter.r2.v1.low = vaddq_s16(filter.r2.v1.low, filter_offset_vec);
+  filter.r2.v1.high = vaddq_s16(filter.r2.v1.high, filter_offset_vec);
+  filter.r2.v2.low = vaddq_s16(filter.r2.v2.low, filter_offset_vec);
+  filter.r2.v2.high = vaddq_s16(filter.r2.v2.high, filter_offset_vec);
+
+  return filter;
+}
+
+// Loads 3 input cells of depth 16 and adds input offsets.
+inline Int16x16x3 LoadInputRowDepth16(const uint8* ptr, int input_depth,
+                                      int32 input_offset,
+                                      Int16x16x3 input_row) {
+  uint8x8_t temp_0, temp_1;
+  int16x8_t offset_vec = vdupq_n_s16(input_offset);
+
+  temp_0 = vld1_u8(ptr + 0 * input_depth);
+  temp_1 = vld1_u8(ptr + 0 * input_depth + 8);
+  input_row.v0.low = vreinterpretq_s16_u16(vmovl_u8(temp_0));
+  input_row.v0.high = vreinterpretq_s16_u16(vmovl_u8(temp_1));
+  input_row.v0.low = vaddq_s16(input_row.v0.low, offset_vec);
+  input_row.v0.high = vaddq_s16(input_row.v0.high, offset_vec);
+
+  temp_0 = vld1_u8(ptr + 1 * input_depth);
+  temp_1 = vld1_u8(ptr + 1 * input_depth + 8);
+  input_row.v1.low = vreinterpretq_s16_u16(vmovl_u8(temp_0));
+  input_row.v1.high = vreinterpretq_s16_u16(vmovl_u8(temp_1));
+  input_row.v1.low = vaddq_s16(input_row.v1.low, offset_vec);
+  input_row.v1.high = vaddq_s16(input_row.v1.high, offset_vec);
+
+  temp_0 = vld1_u8(ptr + 2 * input_depth);
+  temp_1 = vld1_u8(ptr + 2 * input_depth + 8);
+  input_row.v2.low = vreinterpretq_s16_u16(vmovl_u8(temp_0));
+  input_row.v2.high = vreinterpretq_s16_u16(vmovl_u8(temp_1));
+  input_row.v2.low = vaddq_s16(input_row.v2.low, offset_vec);
+  input_row.v2.high = vaddq_s16(input_row.v2.high, offset_vec);
+
+  return input_row;
+}
+
+// Performs multiply accumulate on 3 inputs of depth 16.
+inline Int32x16 MultiplyAccumulateRowDepth16(Int32x16 output,
+                                             const Int16x16x3& filter_row,
+                                             const Int16x16x3& input_row) {
+  output.v0 = vmlal_s16(output.v0, vget_low_s16(filter_row.v0.low),
+                        vget_low_s16(input_row.v0.low));
+  output.v1 = vmlal_s16(output.v1, vget_high_s16(filter_row.v0.low),
+                        vget_high_s16(input_row.v0.low));
+  output.v2 = vmlal_s16(output.v2, vget_low_s16(filter_row.v0.high),
+                        vget_low_s16(input_row.v0.high));
+  output.v3 = vmlal_s16(output.v3, vget_high_s16(filter_row.v0.high),
+                        vget_high_s16(input_row.v0.high));
+
+  output.v0 = vmlal_s16(output.v0, vget_low_s16(filter_row.v1.low),
+                        vget_low_s16(input_row.v1.low));
+  output.v1 = vmlal_s16(output.v1, vget_high_s16(filter_row.v1.low),
+                        vget_high_s16(input_row.v1.low));
+  output.v2 = vmlal_s16(output.v2, vget_low_s16(filter_row.v1.high),
+                        vget_low_s16(input_row.v1.high));
+  output.v3 = vmlal_s16(output.v3, vget_high_s16(filter_row.v1.high),
+                        vget_high_s16(input_row.v1.high));
+
+  output.v0 = vmlal_s16(output.v0, vget_low_s16(filter_row.v2.low),
+                        vget_low_s16(input_row.v2.low));
+  output.v1 = vmlal_s16(output.v1, vget_high_s16(filter_row.v2.low),
+                        vget_high_s16(input_row.v2.low));
+  output.v2 = vmlal_s16(output.v2, vget_low_s16(filter_row.v2.high),
+                        vget_low_s16(input_row.v2.high));
+  output.v3 = vmlal_s16(output.v3, vget_high_s16(filter_row.v2.high),
+                        vget_high_s16(input_row.v2.high));
+
+  return output;
+}
+
+// Applies activation, offset and downquantize on a set of accumulator
+// registers of depth 16. Stores results to output.
+inline void DownquantizeAndStoreDepth16(Int32x16 acc, int32 output_multiplier,
+                                        int output_shift,
+                                        int32x4_t output_offset_vec,
+                                        int32x4_t output_activation_min_vec,
+                                        int32x4_t output_activation_max_vec,
+                                        uint8* output_ptr) {
+  // Fixed-point multiplication.
+  acc.v0 = vqrdmulhq_n_s32(acc.v0, output_multiplier);
+  acc.v1 = vqrdmulhq_n_s32(acc.v1, output_multiplier);
+  acc.v2 = vqrdmulhq_n_s32(acc.v2, output_multiplier);
+  acc.v3 = vqrdmulhq_n_s32(acc.v3, output_multiplier);
+
+  using gemmlowp::RoundingDivideByPOT;
+  acc.v0 = RoundingDivideByPOT(acc.v0, output_shift);
+  acc.v1 = RoundingDivideByPOT(acc.v1, output_shift);
+  acc.v2 = RoundingDivideByPOT(acc.v2, output_shift);
+  acc.v3 = RoundingDivideByPOT(acc.v3, output_shift);
+
+  // Add the output offset.
+  acc.v0 = vaddq_s32(acc.v0, output_offset_vec);
+  acc.v1 = vaddq_s32(acc.v1, output_offset_vec);
+  acc.v2 = vaddq_s32(acc.v2, output_offset_vec);
+  acc.v3 = vaddq_s32(acc.v3, output_offset_vec);
+
+  // Apply the activation function.
+  acc.v0 = vmaxq_s32(acc.v0, output_activation_min_vec);
+  acc.v1 = vmaxq_s32(acc.v1, output_activation_min_vec);
+  acc.v2 = vmaxq_s32(acc.v2, output_activation_min_vec);
+  acc.v3 = vmaxq_s32(acc.v3, output_activation_min_vec);
+
+  acc.v0 = vminq_s32(acc.v0, output_activation_max_vec);
+  acc.v1 = vminq_s32(acc.v1, output_activation_max_vec);
+  acc.v2 = vminq_s32(acc.v2, output_activation_max_vec);
+  acc.v3 = vminq_s32(acc.v3, output_activation_max_vec);
+
+  // Saturating cast to uint8 and store to destination.
+  int16x4_t acc_tlla_s16 = vqmovn_s32(acc.v0);
+  int16x4_t acc_tllb_s16 = vqmovn_s32(acc.v1);
+  int16x4_t acc_tlha_s16 = vqmovn_s32(acc.v2);
+  int16x4_t acc_tlhb_s16 = vqmovn_s32(acc.v3);
+
+  int16x8_t res_s16_0 = vcombine_s16(acc_tlla_s16, acc_tllb_s16);
+  int16x8_t res_s16_1 = vcombine_s16(acc_tlha_s16, acc_tlhb_s16);
+  uint8x8_t res_u8_0 = vqmovun_s16(res_s16_0);
+  uint8x8_t res_u8_1 = vqmovun_s16(res_s16_1);
+  vst1q_u8(output_ptr, vcombine_u8(res_u8_0, res_u8_1));
+}
+
+// A kernel that is optimized on the number of output cells in the x and y
+// direction, and the stride. Assumes 3x3 filters of 16 depth.
+template <int kFixedOutputX, int kFixedOutputY, int kFixedStride = 1>
+struct ConvKernel3x3FilterDepth16 {};
+
+template <>
+struct ConvKernel3x3FilterDepth16<1, 2, 1> {
+  static void Run(const Filter3x3x16& filter, const uint8* input_ptr,
+                  int input_depth, int32 input_offset, int input_row_width,
+                  const int32* bias_ptr, int32 output_offset,
+                  int32 output_multiplier, int output_shift,
+                  int32 output_activation_min, int32 output_activation_max,
+                  uint8* output_ptr, int output_depth, int output_width) {
+    // 16 depth accumulators for the 2 outputs.
+    Int32x16 acc0, acc1;
+
+    // Accumulators for top filter.
+    acc0.v0 = vld1q_s32(bias_ptr);
+    acc0.v1 = vld1q_s32(bias_ptr + 4);
+    acc0.v2 = vld1q_s32(bias_ptr + 8);
+    acc0.v3 = vld1q_s32(bias_ptr + 12);
+    // Accumulators for bottom filter.
+    acc1.v0 = vld1q_s32(bias_ptr);
+    acc1.v1 = vld1q_s32(bias_ptr + 4);
+    acc1.v2 = vld1q_s32(bias_ptr + 8);
+    acc1.v3 = vld1q_s32(bias_ptr + 12);
+
+    // Main multiply accumulate work.
+    {
+      // Load inputs for one filter row at a time.
+      Int16x16x3 input;
+
+      // Do first row of top filter.
+      input = LoadInputRowDepth16(input_ptr, input_depth, input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r0, input);
+
+      // Do second row of top filter.
+      input = LoadInputRowDepth16(input_ptr + input_row_width, input_depth,
+                                  input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r1, input);
+
+      // The inputs to second row of the top filter are also the inputs to the
+      // first row of the bottom filter.
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r0, input);
+
+      // Do third row of top filter.
+      input = LoadInputRowDepth16(input_ptr + 2 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r2, input);
+
+      // The inputs to third row of the top filter are also the inputs to the
+      // second row of the bottom filter.
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r1, input);
+
+      // Do third row of bottom filter.
+      input = LoadInputRowDepth16(input_ptr + 3 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r2, input);
+    }
+
+    // Apply activation, downquantize and store.
+    int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
+    int32x4_t output_activation_min_vec = vdupq_n_s32(output_activation_min);
+    int32x4_t output_activation_max_vec = vdupq_n_s32(output_activation_max);
+
+    DownquantizeAndStoreDepth16(acc0, output_multiplier, output_shift,
+                                output_offset_vec, output_activation_min_vec,
+                                output_activation_max_vec, output_ptr);
+
+    DownquantizeAndStoreDepth16(acc1, output_multiplier, output_shift,
+                                output_offset_vec, output_activation_min_vec,
+                                output_activation_max_vec,
+                                output_ptr + output_depth * output_width);
+  }
+};
+
+template <>
+struct ConvKernel3x3FilterDepth16<1, 2, 2> {
+  static void Run(const Filter3x3x16& filter, const uint8* input_ptr,
+                  int input_depth, int32 input_offset, int input_row_width,
+                  const int32* bias_ptr, int32 output_offset,
+                  int32 output_multiplier, int output_shift,
+                  int32 output_activation_min, int32 output_activation_max,
+                  uint8* output_ptr, int output_depth, int output_width) {
+    // 16 depth accumulators for the 2 outputs.
+    Int32x16 acc0, acc1;
+
+    // Accumulators for top filter.
+    acc0.v0 = vld1q_s32(bias_ptr);
+    acc0.v1 = vld1q_s32(bias_ptr + 4);
+    acc0.v2 = vld1q_s32(bias_ptr + 8);
+    acc0.v3 = vld1q_s32(bias_ptr + 12);
+    // Accumulators for bottom filter.
+    acc1.v0 = vld1q_s32(bias_ptr);
+    acc1.v1 = vld1q_s32(bias_ptr + 4);
+    acc1.v2 = vld1q_s32(bias_ptr + 8);
+    acc1.v3 = vld1q_s32(bias_ptr + 12);
+
+    // Main multiply accumulate work.
+    {
+      // Load inputs for one filter row at a time.
+      Int16x16x3 input;
+
+      // Do first row of top filter.
+      input = LoadInputRowDepth16(input_ptr, input_depth, input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r0, input);
+
+      // Do second row of top filter.
+      input = LoadInputRowDepth16(input_ptr + input_row_width, input_depth,
+                                  input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r1, input);
+
+      // Do third row of top filter.
+      input = LoadInputRowDepth16(input_ptr + 2 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc0 = MultiplyAccumulateRowDepth16(acc0, filter.r2, input);
+
+      // The inputs to third row of the top filter are also the inputs
+      // to first row of the bottom filter.
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r0, input);
+
+      // Do second row of bottom filter.
+      input = LoadInputRowDepth16(input_ptr + 3 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r1, input);
+
+      // Do third row of bottom filter.
+      input = LoadInputRowDepth16(input_ptr + 4 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc1 = MultiplyAccumulateRowDepth16(acc1, filter.r2, input);
+    }
+
+    // Apply activation, downquantize and store.
+    int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
+    int32x4_t output_activation_min_vec = vdupq_n_s32(output_activation_min);
+    int32x4_t output_activation_max_vec = vdupq_n_s32(output_activation_max);
+
+    DownquantizeAndStoreDepth16(acc0, output_multiplier, output_shift,
+                                output_offset_vec, output_activation_min_vec,
+                                output_activation_max_vec, output_ptr);
+
+    DownquantizeAndStoreDepth16(acc1, output_multiplier, output_shift,
+                                output_offset_vec, output_activation_min_vec,
+                                output_activation_max_vec,
+                                output_ptr + output_depth * output_width);
+  }
+};
+
+template <>
+struct ConvKernel3x3FilterDepth16<1, 1> {
+  static void Run(const Filter3x3x16& filter, const uint8* input_ptr,
+                  int input_depth, int32 input_offset, int input_row_width,
+                  const int32* bias_ptr, int32 output_offset,
+                  int32 output_multiplier, int output_shift,
+                  int32 output_activation_min, int32 output_activation_max,
+                  uint8* output_ptr, int output_depth, int output_width) {
+    Int32x16 acc;
+    acc.v0 = vld1q_s32(bias_ptr);
+    acc.v1 = vld1q_s32(bias_ptr + 4);
+    acc.v2 = vld1q_s32(bias_ptr + 8);
+    acc.v3 = vld1q_s32(bias_ptr + 12);
+
+    // Main multiply accumulate work.
+    {
+      // Load inputs for one filter row at a time.
+      Int16x16x3 input;
+
+      // Do first row.
+      input = LoadInputRowDepth16(input_ptr, input_depth, input_offset, input);
+      acc = MultiplyAccumulateRowDepth16(acc, filter.r0, input);
+
+      // Do second row.
+      input = LoadInputRowDepth16(input_ptr + input_row_width, input_depth,
+                                  input_offset, input);
+      acc = MultiplyAccumulateRowDepth16(acc, filter.r1, input);
+
+      // Do third row.
+      input = LoadInputRowDepth16(input_ptr + 2 * input_row_width, input_depth,
+                                  input_offset, input);
+      acc = MultiplyAccumulateRowDepth16(acc, filter.r2, input);
+    }
+
+    // Apply activation, downquantize and store.
+    int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
+    int32x4_t output_activation_min_vec = vdupq_n_s32(output_activation_min);
+    int32x4_t output_activation_max_vec = vdupq_n_s32(output_activation_max);
+
+    DownquantizeAndStoreDepth16(acc, output_multiplier, output_shift,
+                                output_offset_vec, output_activation_min_vec,
+                                output_activation_max_vec, output_ptr);
+  }
+};
+
+inline void DepthwiseConv3by3FilterDepth16(
+    const uint8* input_data, const Dims<4>& input_dims, int32 input_offset,
+    const uint8* filter_data, const Dims<4>& filter_dims, int32 filter_offset,
+    const int32* bias_data, const Dims<4>& bias_dims, int stride_width,
+    int stride_height, int pad_width, int pad_height, int depth_multiplier,
+    int32 output_offset, int32 output_multiplier, int output_shift,
+    int32 output_activation_min, int32 output_activation_max,
+    uint8* output_data, const Dims<4>& output_dims) {
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  const int input_height = ArraySize(input_dims, 2);
+  const int input_width = ArraySize(input_dims, 1);
+  const int input_depth = ArraySize(input_dims, 0);
+  const int filter_height = ArraySize(filter_dims, 2);
+  const int filter_width = ArraySize(filter_dims, 1);
+  const int output_height = ArraySize(output_dims, 2);
+  const int output_width = ArraySize(output_dims, 1);
+
+  // Algorithm assumes below constraints. It is optimized for depth multiplier
+  // of 1, 3x3 filter, no padding, strides 1 and 2.
+  TFLITE_DCHECK(output_depth == input_depth * depth_multiplier);
+  TFLITE_DCHECK(depth_multiplier == 1);
+  TFLITE_DCHECK(filter_height == 3);
+  TFLITE_DCHECK(filter_width == 3);
+  TFLITE_DCHECK(pad_height == 0);
+  TFLITE_DCHECK(pad_width == 0);
+  TFLITE_DCHECK(stride_width == 1 || stride_width == 2);
+  TFLITE_DCHECK(stride_height == 1 || stride_height == 2);
+
+  // The number of outputs to process in the main loop.
+  const int num_x_outputs = 1;
+  const int num_y_outputs = 2;
+
+  const int input_row_width = output_depth * (input_width + 2 * pad_width);
+  const int input_batch_size =
+      input_row_width * (input_height + 2 * pad_height);
+  const int output_batch_size = output_depth * output_width * output_height;
+  const int input_ptr_x_increment = input_depth * stride_width;
+
+  // Calculate extents of non-boundary loop.
+  int out_x_start = 0;
+  for (; out_x_start < input_width; out_x_start++) {
+    int in_x = (out_x_start * stride_width) - pad_width;
+    if (in_x >= 0) {
+      break;
+    }
+  }
+  int out_x_end = output_width - 1;
+  for (; out_x_end >= 0; out_x_end--) {
+    int in_x = (out_x_end * stride_width) - pad_width;
+    int in_x_end = in_x + filter_width + (num_x_outputs - 1) * stride_width;
+    if (in_x_end <= input_width) {
+      out_x_end++;
+      break;
+    }
+  }
+  int out_y_start = 0;
+  for (; out_y_start < input_height; out_y_start++) {
+    int in_y = (out_y_start * stride_height) - pad_height;
+    if (in_y >= 0) {
+      break;
+    }
+  }
+  int out_y_end = output_height - 1;
+  for (; out_y_end >= 0; out_y_end--) {
+    int in_y = (out_y_end * stride_height) - pad_height;
+    int in_y_end = in_y + filter_height + (num_y_outputs - 1) * stride_height;
+    if (in_y_end <= input_height) {
+      out_y_end++;
+      break;
+    }
+  }
+
+  using dot_product_func_t =
+      decltype(&ConvKernel3x3FilterDepth16<1, 2, 1>::Run);
+  dot_product_func_t dot_product_func = nullptr;
+
+  if (stride_width == 1 && stride_height == 1) {
+    dot_product_func = ConvKernel3x3FilterDepth16<1, 2, 1>::Run;
+  } else {
+    dot_product_func = ConvKernel3x3FilterDepth16<1, 2, 2>::Run;
+  }
+
+  // Offsets for preloading inputs.
+  const int i0 = 0;
+  const int i1 = input_depth;
+  const int i2 = 2 * input_depth;
+  const int i3 = input_row_width;
+  const int i4 = input_row_width + input_depth;
+  const int i5 = input_row_width + 2 * input_depth;
+  const int i6 = 2 * input_row_width;
+  const int i7 = 2 * input_row_width + input_depth;
+  const int i8 = 2 * input_row_width + 2 * input_depth;
+  const int i9 = 3 * input_row_width;
+  const int i10 = 3 * input_row_width + input_depth;
+  const int i11 = 3 * input_row_width + 2 * input_depth;
+  const int i12 = 4 * input_row_width;
+  const int i13 = 4 * input_row_width + input_depth;
+  const int i14 = 4 * input_row_width + 2 * input_depth;
+
+  for (int b = 0; b < batches; ++b) {
+    const int32* bias_ptr = bias_data;
+    const uint8* filter_ptr = filter_data;
+
+    const int in_batch_offset = b * input_batch_size;
+    const int out_batch_offset = b * output_batch_size;
+
+    int depth = 0;
+    for (; depth <= output_depth - 16; depth += 16) {
+      Filter3x3x16 filter =
+          LoadFilterDepth16(filter_ptr, filter_offset, output_depth);
+
+      // Handle 1x2 outputs.
+      int out_y = out_y_start;
+      for (; out_y < out_y_end; out_y += num_y_outputs) {
+        int out_x = out_x_start;
+
+        int in_y_offset =
+            stride_height * input_row_width * (out_y + pad_height);
+        int in_x_offset = stride_width * input_depth * (out_x + pad_width);
+
+        const uint8* input_ptr =
+            input_data + depth + in_x_offset + in_y_offset + in_batch_offset;
+
+        // Preload inputs. If input depth is large, preload every value of the
+        // input for this depth range. Otherwise, preload only the first values
+        // of each row.
+        if (input_depth >= 32) {
+          preload_l1_keep(input_ptr + i0);
+          preload_l1_keep(input_ptr + i1);
+          preload_l1_keep(input_ptr + i2);
+          preload_l1_keep(input_ptr + i3);
+          preload_l1_keep(input_ptr + i4);
+          preload_l1_keep(input_ptr + i5);
+          preload_l1_keep(input_ptr + i6);
+          preload_l1_keep(input_ptr + i7);
+          preload_l1_keep(input_ptr + i8);
+          preload_l1_keep(input_ptr + i9);
+          preload_l1_keep(input_ptr + i10);
+          preload_l1_keep(input_ptr + i11);
+
+          if (stride_height == 2) {
+            preload_l1_keep(input_ptr + i12);
+            preload_l1_keep(input_ptr + i13);
+            preload_l1_keep(input_ptr + i14);
+          }
+        } else {
+          preload_l1_keep(input_ptr + i0);
+          preload_l1_keep(input_ptr + i3);
+          preload_l1_keep(input_ptr + i6);
+          preload_l1_keep(input_ptr + i9);
+
+          if (stride_height == 2) {
+            preload_l1_keep(input_ptr + i12);
+          }
+        }
+
+        uint8* output_ptr = output_data + depth + (out_x * output_depth) +
+                            (output_depth * output_width * out_y) +
+                            out_batch_offset;
+
+        for (; out_x < out_x_end; out_x += num_x_outputs) {
+          dot_product_func(filter, input_ptr, input_depth, input_offset,
+                           input_row_width, bias_ptr, output_offset,
+                           output_multiplier, output_shift,
+                           output_activation_min, output_activation_max,
+                           output_ptr, output_depth, output_width);
+
+          input_ptr += input_ptr_x_increment * num_x_outputs;
+          output_ptr += output_depth * num_x_outputs;
+
+          // Preload the next inputs depending on stride.
+          if (stride_width == 1) {
+            preload_l1_keep(input_ptr + i2);
+            preload_l1_keep(input_ptr + i5);
+            preload_l1_keep(input_ptr + i8);
+            preload_l1_keep(input_ptr + i11);
+          } else if (stride_width == 2) {
+            preload_l1_keep(input_ptr + i1);
+            preload_l1_keep(input_ptr + i2);
+            preload_l1_keep(input_ptr + i4);
+            preload_l1_keep(input_ptr + i5);
+            preload_l1_keep(input_ptr + i7);
+            preload_l1_keep(input_ptr + i8);
+            preload_l1_keep(input_ptr + i10);
+            preload_l1_keep(input_ptr + i11);
+            preload_l1_keep(input_ptr + i13);
+            preload_l1_keep(input_ptr + i14);
+          }
+        }
+
+        // Handle the rest of the right side.
+        for (; out_x < output_width; out_x++) {
+          // This code path can only be reached if we're handling >1 x outputs
+          // at a time or support padding.
+        }
+      }
+
+      // Handle the rest of the bottom side.
+      for (; out_y < output_height; out_y++) {
+        int out_x = out_x_start;
+
+        int in_y_offset =
+            stride_height * input_row_width * (out_y + pad_height);
+        int in_x_offset = stride_width * input_depth * (out_x + pad_width);
+
+        const uint8* input_ptr =
+            input_data + depth + in_x_offset + in_y_offset + in_batch_offset;
+
+        if (input_depth >= 32) {
+          preload_l1_keep(input_ptr + i0);
+          preload_l1_keep(input_ptr + i1);
+          preload_l1_keep(input_ptr + i2);
+          preload_l1_keep(input_ptr + i3);
+          preload_l1_keep(input_ptr + i4);
+          preload_l1_keep(input_ptr + i5);
+          preload_l1_keep(input_ptr + i6);
+          preload_l1_keep(input_ptr + i7);
+        } else {
+          preload_l1_keep(input_ptr + i0);
+          preload_l1_keep(input_ptr + i3);
+          preload_l1_keep(input_ptr + i6);
+        }
+
+        uint8* output_ptr = output_data + depth + (out_x * output_depth) +
+                            (output_depth * output_width * out_y) +
+                            out_batch_offset;
+
+        for (; out_x < output_width; out_x++) {
+          ConvKernel3x3FilterDepth16<1, 1>::Run(
+              filter, input_ptr, input_depth, input_offset, input_row_width,
+              bias_ptr, output_offset, output_multiplier, output_shift,
+              output_activation_min, output_activation_max, output_ptr,
+              output_depth, output_width);
+
+          input_ptr += input_ptr_x_increment;
+          output_ptr += output_depth;
+
+          if (stride_width == 1) {
+            preload_l1_keep(input_ptr + i2);
+            preload_l1_keep(input_ptr + i5);
+            preload_l1_keep(input_ptr + i8);
+          } else if (stride_width == 2) {
+            preload_l1_keep(input_ptr + i1);
+            preload_l1_keep(input_ptr + i2);
+            preload_l1_keep(input_ptr + i4);
+            preload_l1_keep(input_ptr + i5);
+            preload_l1_keep(input_ptr + i7);
+            preload_l1_keep(input_ptr + i8);
+          }
+        }
+      }
+      filter_ptr += 16;
+      bias_ptr += 16;
+    }
+  }
+}
+
+#endif  // __aarch64__
+
+}  // namespace optimized_ops
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_OPTIMIZED_DEPTHWISECONV_UINT8_3X3_FILTER_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
index 3866f86d38a6f200e091497cab2972ed92e25c6b..f7840258ec6b81f2ba8c24e0359d3c8ee5d5e682 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
@@ -610,6 +610,58 @@ inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
       input_offset, output_pipeline);
 }
 
+inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
+                           int32 input_offset, const uint8* filter_data,
+                           const Dims<4>& filter_dims, int32 filter_offset,
+                           const int32* bias_data, const Dims<4>& bias_dims,
+                           int32 output_offset, int32 output_multiplier,
+                           int output_shift, int32 output_activation_min,
+                           int32 output_activation_max, int16* output_data,
+                           const Dims<4>& output_dims,
+                           gemmlowp::GemmContext* gemm_context) {
+  gemmlowp::ScopedProfilingLabel label("FullyConnected/Uint8Int16");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+  (void)gemm_context;  // only used in properly optimized code.
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  TFLITE_DCHECK_EQ(output_offset, 0);
+  // TODO(benoitjacob): This really should be:
+  //     const int batches = ArraySize(output_dims, 1);
+  // but the current --variable_batch hack consists in overwriting the 3rd
+  // dimension with the runtime batch size, as we don't keep track for each
+  // array of which dimension is the batch dimension in it.
+  const int batches = ArraySize(output_dims, 1) * ArraySize(output_dims, 2) *
+                      ArraySize(output_dims, 3);
+  const int output_depth = MatchingArraySize(filter_dims, 1, output_dims, 0);
+  const int accum_depth = ArraySize(filter_dims, 0);
+  TFLITE_DCHECK(IsPackedWithoutStrides(input_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(filter_dims));
+  for (int b = 0; b < batches; ++b) {
+    for (int out_c = 0; out_c < output_depth; ++out_c) {
+      // Internal accumulation.
+      // Initialize accumulator with the bias-value.
+      int32 accum = bias_data[out_c];
+      // Accumulation loop.
+      for (int d = 0; d < accum_depth; ++d) {
+        int16 input_val = input_data[b * accum_depth + d] + input_offset;
+        int16 filter_val = filter_data[out_c * accum_depth + d] + filter_offset;
+        accum += filter_val * input_val;
+      }
+      // Down-scale the final int32 accumulator to the scale used by our
+      // (16-bit, typically 3 integer bits) fixed-point format. The quantized
+      // multiplier and shift here have been pre-computed offline
+      // (e.g. by toco).
+      accum = MultiplyByQuantizedMultiplier(accum, output_multiplier,
+                                            -output_shift);
+      // Saturate, cast to int16, and store to output array.
+      accum = std::max(accum, output_activation_min - output_offset);
+      accum = std::min(accum, output_activation_max - output_offset);
+      accum += output_offset;
+      output_data[out_c + output_depth * b] = accum;
+    }
+  }
+}
+
 // legacy, for compatibility with old checked-in code
 template <FusedActivationFunctionType Ac>
 void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
@@ -768,6 +820,7 @@ inline void DilatedConv(const float* input_data, const Dims<4>& input_dims,
                         float output_activation_max, float* output_data,
                         const Dims<4>& output_dims, float* im2col_data,
                         const Dims<4>& im2col_dims) {
+  gemmlowp::ScopedProfilingLabel label("DilatedConv");
   // This is a copy of the reference Conv implementation. We do not currently
   // have an optimized path for dilation.
   (void)im2col_data;  // only used in optimized code.
@@ -1530,6 +1583,8 @@ inline void Add(int left_shift, const uint8* input1_data,
   TFLITE_DCHECK_LT(input1_offset, 256);
   TFLITE_DCHECK_LT(input2_offset, 256);
 #ifdef USE_NEON
+  const auto output_activation_min_vector = vdup_n_u8(output_activation_min);
+  const auto output_activation_max_vector = vdup_n_u8(output_activation_max);
   for (; i <= size - 8; i += 8) {
     const auto input1_val_original = vld1_u8(input1_data + i);
     const auto input2_val_original = vld1_u8(input2_data + i);
@@ -1575,7 +1630,10 @@ inline void Add(int left_shift, const uint8* input1_data,
     const auto s2_narrowed = vmovn_s32(s2);
     const auto s = vaddq_s16(vcombine_s16(s1_narrowed, s2_narrowed),
                              vdupq_n_s16(output_offset));
-    vst1_u8(output_data + i, vqmovun_s16(s));
+    const auto clamped =
+        vmax_u8(output_activation_min_vector,
+                vmin_u8(output_activation_max_vector, vqmovun_s16(s)));
+    vst1_u8(output_data + i, clamped);
   }
 #endif  // NEON
 
@@ -1598,6 +1656,39 @@ inline void Add(int left_shift, const uint8* input1_data,
   }
 }
 
+template <FusedActivationFunctionType Ac>
+inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
+                int input1_shift, const int16* input2_data,
+                const Dims<4>& input2_dims, int input2_shift,
+                int16* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Add/Int16");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+  static_assert(Ac == FusedActivationFunctionType::kNone, "");
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  TFLITE_DCHECK(input1_shift == 0 || input2_shift == 0);
+  TFLITE_DCHECK_GE(input1_shift, 0);
+  TFLITE_DCHECK_GE(input2_shift, 0);
+  const int16* not_shift_input = input1_shift == 0 ? input1_data : input2_data;
+  const int16* shift_input = input1_shift == 0 ? input2_data : input1_data;
+  const int input_shift = input1_shift == 0 ? input2_shift : input1_shift;
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 input_ready_scaled = F0::FromRaw(not_shift_input[i]);
+    F0 scaled_input =
+        F0::FromRaw(gemmlowp::RoundingDivideByPOT(shift_input[i], input_shift));
+    F0 result = gemmlowp::SaturatingAdd(scaled_input, input_ready_scaled);
+    output_data[i] = result.raw();
+  }
+}
+
 template <FusedActivationFunctionType Ac>
 void Add(const int32* input1_data, const Dims<4>& input1_dims,
          const int32* input2_data, const Dims<4>& input2_dims,
@@ -1872,6 +1963,57 @@ void Mul(const int32* input1_data, const Dims<4>& input1_dims,
   }
 }
 
+inline void Mul(const int16* input1_data, const Dims<4>& input1_dims,
+                const int16* input2_data, const Dims<4>& input2_dims,
+                int16* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Mul/Int16");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 unclamped_result =
+        F0::FromRaw(input1_data[i]) * F0::FromRaw(input2_data[i]);
+    output_data[i] = unclamped_result.raw();
+  }
+}
+
+inline void Mul(const int16* input1_data, const Dims<4>& input1_dims,
+                const int16* input2_data, const Dims<4>& input2_dims,
+                int32 output_offset, int32 output_activation_min,
+                int32 output_activation_max, uint8* output_data,
+                const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Mul/Int16Uint8");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 unclamped_result =
+        F0::FromRaw(input1_data[i]) * F0::FromRaw(input2_data[i]);
+    int16 rescaled_result =
+        gemmlowp::RoundingDivideByPOT(unclamped_result.raw(), 8);
+    int16 clamped_result =
+        std::min<int16>(output_activation_max - output_offset, rescaled_result);
+    clamped_result =
+        std::max<int16>(output_activation_min - output_offset, clamped_result);
+    output_data[i] = output_offset + clamped_result;
+  }
+}
+
 // TODO(jiawen): We can implement BroadcastMul on buffers of arbitrary
 // dimensionality if the runtime code does a single loop over one dimension
 // that handles broadcasting as the base case. The code generator would then
@@ -2020,6 +2162,51 @@ inline void Div(const float* input1_data, const Dims<4>& input1_dims,
   }
 }
 
+// TODO(jiawen): We can implement BroadcastDiv on buffers of arbitrary
+// dimensionality if the runtime code does a single loop over one dimension
+// that handles broadcasting as the base case. The code generator would then
+// generate max(D1, D2) nested for loops.
+// TODO(benoitjacob): BroadcastDiv is intentionally duplicated from
+// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
+// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
+// reference_ops.h.
+template <typename T>
+void BroadcastDiv(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastDiv");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] /
+                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
+                  output_activation_min, output_activation_max);
+        }
+      }
+    }
+  }
+}
+
 // TODO(aselle): This is not actually optimized yet.
 inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
                 const float* input2_data, const Dims<4>& input2_dims,
@@ -2047,6 +2234,111 @@ inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
     }
   }
 }
+
+// TODO(jiawen): We can implement BroadcastSub on buffers of arbitrary
+// dimensionality if the runtime code does a single loop over one dimension
+// that handles broadcasting as the base case. The code generator would then
+// generate max(D1, D2) nested for loops.
+// TODO(benoitjacob): BroadcastSub is intentionally duplicated from
+// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
+// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
+// reference_ops.h.
+template <typename T>
+void BroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastSub");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
+                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
+                  output_activation_min, output_activation_max);
+        }
+      }
+    }
+  }
+}
+
+inline void BroadcastSub(int left_shift, const uint8* input1_data,
+                         const Dims<4>& input1_dims, int32 input1_offset,
+                         int32 input1_multiplier, int input1_shift,
+                         const uint8* input2_data, const Dims<4>& input2_dims,
+                         int32 input2_offset, int32 input2_multiplier,
+                         int input2_shift, int32 output_offset,
+                         int32 output_multiplier, int output_shift,
+                         int32 output_activation_min,
+                         int32 output_activation_max, uint8* output_data,
+                         const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastSub/8bit");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          const int32 input1_val =
+              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
+          const int32 input2_val =
+              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
+          const int32 shifted_input1_val = input1_val * (1 << left_shift);
+          const int32 shifted_input2_val = input2_val * (1 << left_shift);
+          const int32 scaled_input1_val =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  shifted_input1_val, input1_multiplier, input1_shift);
+          const int32 scaled_input2_val =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  shifted_input2_val, input2_multiplier, input2_shift);
+          const int32 raw_sub = scaled_input1_val - scaled_input2_val;
+          const int32 raw_output =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  raw_sub, output_multiplier, output_shift) +
+              output_offset;
+          const int32 clamped_output =
+              std::min(output_activation_max,
+                       std::max(output_activation_min, raw_output));
+          output_data[Offset(output_dims, c, x, y, b)] =
+              static_cast<uint8>(clamped_output);
+        }
+      }
+    }
+  }
+}
+
 template <FusedActivationFunctionType Ac, typename Scalar>
 void Concatenation(int concat_dim, const Scalar* const* input_data,
                    const Dims<4>* const* input_dims, int inputs_count,
@@ -3631,6 +3923,28 @@ inline void Logistic(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+inline void Logistic(const int16* input_data, const Dims<4>& input_dims,
+                     int16* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Logistic/Int16");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    // This is the return type of math functions such as tanh, logistic,
+    // whose range is in [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+    // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
+    using F3 = gemmlowp::FixedPoint<std::int16_t, 3>;
+
+    const F3 input = F3::FromRaw(input_data[i]);
+    F0 output = gemmlowp::logistic(input);
+    output_data[i] = output.raw();
+  }
+}
+
 inline void Tanh(const float* input_data, const Dims<4>& input_dims,
                  float* output_data, const Dims<4>& output_dims) {
   gemmlowp::ScopedProfilingLabel label("Tanh");
@@ -3789,6 +4103,45 @@ inline void Tanh(const uint8* input_data, const Dims<4>& input_dims,
     output_data[c] = output_val;
   }
 }
+
+inline void Tanh(const int16* input_data, const Dims<4>& input_dims,
+                 int input_left_shift, int16* output_data,
+                 const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Tanh/Int16");
+  // This is a copy of the reference implementation. We do not currently have a
+  // properly optimized version.
+
+  // Support for shifts is limited until we have a parameterized version of
+  // SaturatingRoundingMultiplyByPOT().
+  TFLITE_DCHECK_GE(input_left_shift, 0);
+  TFLITE_DCHECK_LE(input_left_shift, 1);
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input_dims), flat_size);
+
+  // F0 uses 0 integer bits, range [-1, 1].
+  // This is the return type of math functions such as tanh, logistic,
+  // whose range is in [-1, 1].
+  using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+  // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
+  using F3 = gemmlowp::FixedPoint<std::int16_t, 3>;
+
+  if (input_left_shift == 0) {
+    for (int i = 0; i < flat_size; i++) {
+      F3 input = F3::FromRaw(input_data[i]);
+      F0 output = gemmlowp::tanh(input);
+      output_data[i] = output.raw();
+    }
+  } else {
+    for (int i = 0; i < flat_size; i++) {
+      F3 input = F3::FromRaw(
+          gemmlowp::SaturatingRoundingMultiplyByPOT<1>(input_data[i]));
+      F0 output = gemmlowp::tanh(input);
+      output_data[i] = output.raw();
+    }
+  }
+}
+
 inline void Dequantize(const uint8* input_data, const Dims<4>& input_dims,
                        int32 zero_point, double scale, float* output_data,
                        const Dims<4>& output_dims) {
@@ -4725,6 +5078,78 @@ void Transpose(const T* input, const Dims<4>& input_dims, T* output,
   }
 }
 
+inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
+                          const float* filter_data, const Dims<4>& filter_dims,
+                          int stride_width, int stride_height, int pad_width,
+                          int pad_height, float* output_data,
+                          const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("TransposeConv");
+  // THIS FUNCTION IS A COPY FROM reference_ops.h.
+  // To optimize, start by using the conv code with transposed weights for the
+  // case of stride_height = stride_width = 1.
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 3);
+  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  const int input_height = ArraySize(input_dims, 2);
+  const int input_width = ArraySize(input_dims, 1);
+  const int filter_height = ArraySize(filter_dims, 2);
+  const int filter_width = ArraySize(filter_dims, 1);
+  const int output_height = ArraySize(output_dims, 2);
+  const int output_width = ArraySize(output_dims, 1);
+
+  // Although transpose convolution simplifies to convolution with transposed
+  // weights for strides of 1, non-unitary striding complicates matters. To
+  // keep this reference implementation as clear as possible, we use a "scatter"
+  // access pattern, where we loop through all the input elements, computing
+  // their influence on the output, rather than looping through the output
+  // elements in the typical "gather" access pattern of a conv. We therefore
+  // must initialize the output array to zero.
+  for (int batch = 0; batch < batches; ++batch) {
+    for (int out_y = 0; out_y < output_height; ++out_y) {
+      for (int out_x = 0; out_x < output_width; ++out_x) {
+        for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
+          output_data[Offset(output_dims, out_channel, out_x, out_y, batch)] =
+              0.0f;
+        }
+      }
+    }
+  }
+
+  // Loop through input elements one at a time.
+  for (int batch = 0; batch < batches; ++batch) {
+    for (int in_y = 0; in_y < input_height; ++in_y) {
+      for (int in_x = 0; in_x < input_width; ++in_x) {
+        for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
+          // Loop through the output elements it will influence
+          const int out_x_origin = (in_x * stride_width) - pad_width;
+          const int out_y_origin = (in_y * stride_height) - pad_height;
+          for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
+            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
+              for (int out_channel = 0; out_channel < input_depth;
+                   ++out_channel) {
+                // Compute output element location
+                const int out_x = out_x_origin + filter_x;
+                const int out_y = out_y_origin + filter_y;
+                // We cannot accumulate out of bounds
+                if ((out_x >= 0) && (out_x < output_width) && (out_y >= 0) &&
+                    (out_y < output_height)) {
+                  float input_value = input_data[Offset(input_dims, in_channel,
+                                                        in_x, in_y, batch)];
+                  float filter_value =
+                      filter_data[Offset(filter_dims, out_channel, filter_x,
+                                         filter_y, in_channel)];
+                  output_data[Offset(output_dims, out_channel, out_x, out_y,
+                                     batch)] += input_value * filter_value;
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
 }  // namespace optimized_ops
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util.h b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
index ba06bc0975b6847b24592daa60efe99983d03707..f7706c793883f012d2b66cf3d3167a59afe31f91 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util.h
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
@@ -12,13 +12,91 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#ifndef PHOTOS_VISION_LEARNING_TENSORFLOW_MINI_QUANTIZATION_UTIL_H_
-#define PHOTOS_VISION_LEARNING_TENSORFLOW_MINI_QUANTIZATION_UTIL_H_
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_QUANTIZATION_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_QUANTIZATION_UTIL_H_
 
+#include <cmath>
 #include <cstdint>
+#include <limits>
+
+#include "tensorflow/contrib/lite/kernels/internal/compatibility.h"
+#include "tensorflow/contrib/lite/kernels/internal/round.h"
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
 
 namespace tflite {
 
+// Given the min and max values of a float array, return
+// reasonable quantization parameters to use for this array.
+template <typename T>
+QuantizationParams ChooseQuantizationParams(double rmin, double rmax) {
+  const T qmin = std::numeric_limits<T>::min();
+  const T qmax = std::numeric_limits<T>::max();
+  const double qmin_double = qmin;
+  const double qmax_double = qmax;
+  // 0 should always be a representable value. Let's assume that the initial
+  // min,max range contains 0.
+  TFLITE_CHECK_LE(rmin, 0.);
+  TFLITE_CHECK_GE(rmax, 0.);
+  if (rmin == rmax) {
+    // Special case where the min,max range is a point. Should be {0}.
+    TFLITE_CHECK_EQ(rmin, 0.);
+    TFLITE_CHECK_EQ(rmax, 0.);
+    QuantizationParams quantization_params;
+    quantization_params.zero_point = 0;
+    quantization_params.scale = 0.;
+    return quantization_params;
+  }
+
+  // General case.
+  //
+  // First determine the scale.
+  const double scale = (rmax - rmin) / (qmax_double - qmin_double);
+
+  // Zero-point computation.
+  // First the initial floating-point computation. The zero-point can be
+  // determined from solving an affine equation for any known pair
+  // (real value, corresponding quantized value).
+  // We know two such pairs: (rmin, qmin) and (rmax, qmax).
+  // The arithmetic error on the zero point computed from either pair
+  // will be roughly machine_epsilon * (sum of absolute values of terms)
+  // so we want to use the variant that adds the smaller terms.
+  const double zero_point_from_min = qmin_double - rmin / scale;
+  const double zero_point_from_max = qmax_double - rmax / scale;
+  const double zero_point_from_min_error =
+      std::abs(qmin_double) + std::abs(rmin / scale);
+  const double zero_point_from_max_error =
+      std::abs(qmax_double) + std::abs(rmax / scale);
+
+  const double zero_point_double =
+      zero_point_from_min_error < zero_point_from_max_error
+          ? zero_point_from_min
+          : zero_point_from_max;
+
+  // Now we need to nudge the zero point to be an integer
+  // (our zero points are integer, and this is motivated by the requirement
+  // to be able to represent the real value "0" exactly as a quantized value,
+  // which is required in multiple places, for example in Im2col with SAME
+  // padding).
+  T nudged_zero_point = 0;
+  if (zero_point_double < qmin_double) {
+    nudged_zero_point = qmin;
+  } else if (zero_point_double > qmax_double) {
+    nudged_zero_point = qmax;
+  } else {
+    nudged_zero_point = static_cast<T>(round(zero_point_double));
+  }
+  // The zero point should always be in the range of quantized value,
+  // [qmin, qmax].
+  TFLITE_CHECK_GE(nudged_zero_point, qmin);
+  TFLITE_CHECK_LE(nudged_zero_point, qmax);
+
+  // Finally, store the result nudged quantization params.
+  QuantizationParams quantization_params;
+  quantization_params.zero_point = nudged_zero_point;
+  quantization_params.scale = scale;
+  return quantization_params;
+}
+
 // Decompose a double multiplier into a Q0.31 int32 representation of its
 // significand, and shift representation of NEGATIVE its exponent ---
 // this is intended as a RIGHT-shift.
@@ -63,4 +141,4 @@ int CalculateInputRadius(int input_integer_bits, int input_left_shift);
 
 }  // namespace tflite
 
-#endif  // PHOTOS_VISION_LEARNING_TENSORFLOW_MINI_QUANTIZATION_UTIL_H_
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_QUANTIZATION_UTIL_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc b/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
index 19b1b408ec74b0939065b0ad10b91ecfc2cd4765..4ae2085c30ed08790eaff27c7921909d47687707 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
@@ -22,6 +22,51 @@ namespace {
 
 using ::testing::Pair;
 
+// Example taken from http://www.tensorflow.org/performance/quantization
+//
+//  Quantized | Float
+//  --------- | -----
+//  0         | -10.0
+//  255       | 30.0
+//  128       | 10.0
+TEST(QuantizationUtilTest, ChooseQuantizationParams) {
+  QuantizationParams qp = ChooseQuantizationParams<uint8>(-10.0, 30.0);
+  EXPECT_NEAR(qp.scale, 0.156863, 1e-5);
+  EXPECT_EQ(qp.zero_point, 64);
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsZeroPointOnMinBoundary) {
+  QuantizationParams qp = ChooseQuantizationParams<uint8>(0.0, 30.0);
+  EXPECT_NEAR(qp.scale, 0.117647, 1e-5);
+  EXPECT_EQ(qp.zero_point, 0);
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsZeroNotInRange) {
+  // Assumption is that zero is within the range.
+  EXPECT_DEATH(ChooseQuantizationParams<uint8>(10.0, 30.0), "");
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsEmptyRangePositive) {
+  // Assumption is that zero is within the range.
+  EXPECT_DEATH(ChooseQuantizationParams<uint8>(30.0, 30.0), "");
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsEmptyRangeZero) {
+  QuantizationParams qp = ChooseQuantizationParams<uint8>(0.0, 0.0);
+  EXPECT_NEAR(qp.scale, 0.0, 1e-5);
+  EXPECT_EQ(qp.zero_point, 0);
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsZeroPointOnMaxBoundary) {
+  QuantizationParams qp = ChooseQuantizationParams<uint8>(-10.0, 0.0);
+  EXPECT_NEAR(qp.scale, 0.039216, 1e-5);
+  EXPECT_EQ(qp.zero_point, 255);
+}
+
+TEST(QuantizationUtilTest, ChooseQuantizationParamsInvalidRange) {
+  EXPECT_DEATH(ChooseQuantizationParams<uint8>(10.0, -30.0), "");
+}
+
 TEST(QuantizationUtilTest, QuantizeMultiplierSmallerThanOne) {
   auto quantize = [](double d) {
     int32_t q;
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
index d482dc9e3248a48fa90449c37bd36637ae46e39f..f31420b9836c68d2a7ed128fff2d131267519203 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
@@ -552,6 +552,55 @@ inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
+                           int32 input_offset, const uint8* filter_data,
+                           const Dims<4>& filter_dims, int32 filter_offset,
+                           const int32* bias_data, const Dims<4>& bias_dims,
+                           int32 output_offset, int32 output_multiplier,
+                           int output_shift, int32 output_activation_min,
+                           int32 output_activation_max, int16* output_data,
+                           const Dims<4>& output_dims,
+                           gemmlowp::GemmContext* gemm_context) {
+  (void)gemm_context;  // only used in optimized code.
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  TFLITE_DCHECK_EQ(output_offset, 0);
+  // TODO(benoitjacob): This really should be:
+  //     const int batches = ArraySize(output_dims, 1);
+  // but the current --variable_batch hack consists in overwriting the 3rd
+  // dimension with the runtime batch size, as we don't keep track for each
+  // array of which dimension is the batch dimension in it.
+  const int batches = ArraySize(output_dims, 1) * ArraySize(output_dims, 2) *
+                      ArraySize(output_dims, 3);
+  const int output_depth = MatchingArraySize(filter_dims, 1, output_dims, 0);
+  const int accum_depth = ArraySize(filter_dims, 0);
+  TFLITE_DCHECK(IsPackedWithoutStrides(input_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(filter_dims));
+  for (int b = 0; b < batches; ++b) {
+    for (int out_c = 0; out_c < output_depth; ++out_c) {
+      // Internal accumulation.
+      // Initialize accumulator with the bias-value.
+      int32 accum = bias_data[out_c];
+      // Accumulation loop.
+      for (int d = 0; d < accum_depth; ++d) {
+        int16 input_val = input_data[b * accum_depth + d] + input_offset;
+        int16 filter_val = filter_data[out_c * accum_depth + d] + filter_offset;
+        accum += filter_val * input_val;
+      }
+      // Down-scale the final int32 accumulator to the scale used by our
+      // (16-bit, typically 3 integer bits) fixed-point format. The quantized
+      // multiplier and shift here have been pre-computed offline
+      // (e.g. by toco).
+      accum = MultiplyByQuantizedMultiplier(accum, output_multiplier,
+                                            -output_shift);
+      // Saturate, cast to int16, and store to output array.
+      accum = std::max(accum, output_activation_min - output_offset);
+      accum = std::min(accum, output_activation_max - output_offset);
+      accum += output_offset;
+      output_data[out_c + output_depth * b] = accum;
+    }
+  }
+}
+
 // legacy, for compatibility with old checked-in code
 template <FusedActivationFunctionType Ac>
 void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
@@ -904,6 +953,36 @@ inline void Add(int left_shift, const uint8* input1_data,
   }
 }
 
+template <FusedActivationFunctionType Ac>
+inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
+                int input1_shift, const int16* input2_data,
+                const Dims<4>& input2_dims, int input2_shift,
+                int16* output_data, const Dims<4>& output_dims) {
+  static_assert(Ac == FusedActivationFunctionType::kNone, "");
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  TFLITE_DCHECK(input1_shift == 0 || input2_shift == 0);
+  TFLITE_DCHECK_GE(input1_shift, 0);
+  TFLITE_DCHECK_GE(input2_shift, 0);
+  const int16* not_shift_input = input1_shift == 0 ? input1_data : input2_data;
+  const int16* shift_input = input1_shift == 0 ? input2_data : input1_data;
+  const int input_shift = input1_shift == 0 ? input2_shift : input1_shift;
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 input_ready_scaled = F0::FromRaw(not_shift_input[i]);
+    F0 scaled_input =
+        F0::FromRaw(gemmlowp::RoundingDivideByPOT(shift_input[i], input_shift));
+    F0 result = gemmlowp::SaturatingAdd(scaled_input, input_ready_scaled);
+    output_data[i] = result.raw();
+  }
+}
+
 // TODO(jiawen): We can implement BroadcastAdd on buffers of arbitrary
 // dimensionality if the runtime code does a single loop over one dimension
 // that handles broadcasting as the base case. The code generator would then
@@ -1185,6 +1264,53 @@ inline void BroadcastMul(const uint8* input1_data, const Dims<4>& input1_dims,
   }
 }
 
+inline void Mul(const int16* input1_data, const Dims<4>& input1_dims,
+                const int16* input2_data, const Dims<4>& input2_dims,
+                int16* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Mul/Int16");
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 unclamped_result =
+        F0::FromRaw(input1_data[i]) * F0::FromRaw(input2_data[i]);
+    output_data[i] = unclamped_result.raw();
+  }
+}
+
+inline void Mul(const int16* input1_data, const Dims<4>& input1_dims,
+                const int16* input2_data, const Dims<4>& input2_dims,
+                int32 output_offset, int32 output_activation_min,
+                int32 output_activation_max, uint8* output_data,
+                const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Mul/Int16Uint8");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input1_dims), flat_size);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input2_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+
+    F0 unclamped_result =
+        F0::FromRaw(input1_data[i]) * F0::FromRaw(input2_data[i]);
+    int16 rescaled_result =
+        gemmlowp::RoundingDivideByPOT(unclamped_result.raw(), 8);
+    int16 clamped_result =
+        std::min<int16>(output_activation_max - output_offset, rescaled_result);
+    clamped_result =
+        std::max<int16>(output_activation_min - output_offset, clamped_result);
+    output_data[i] = output_offset + clamped_result;
+  }
+}
+
 // legacy, for compatibility with old checked-in code
 template <FusedActivationFunctionType Ac>
 inline void BroadcastMul(const uint8* input1_data, const Dims<4>& input1_dims,
@@ -1200,6 +1326,47 @@ inline void BroadcastMul(const uint8* input1_data, const Dims<4>& input1_dims,
                output_data, output_dims);
 }
 
+// TODO(jiawen): We can implement BroadcastDiv on buffers of arbitrary
+// dimensionality if the runtime code does a single loop over one dimension
+// that handles broadcasting as the base case. The code generator would then
+// generate max(D1, D2) nested for loops.
+template <typename T>
+void BroadcastDiv(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastDiv");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest
+  // stride, typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for
+  // the best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] /
+                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
+                  output_activation_min, output_activation_max);
+        }
+      }
+    }
+  }
+}
+
 inline void Div(const float* input1_data, const Dims<4>& input1_dims,
                 const float* input2_data, const Dims<4>& input2_dims,
                 float output_activation_min, float output_activation_max,
@@ -1254,6 +1421,106 @@ inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
   }
 }
 
+// TODO(jiawen): We can implement BroadcastSub on buffers of arbitrary
+// dimensionality if the runtime code does a single loop over one dimension
+// that handles broadcasting as the base case. The code generator would then
+// generate max(D1, D2) nested for loops.
+template <typename T>
+void BroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastSub");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
+                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
+                  output_activation_min, output_activation_max);
+        }
+      }
+    }
+  }
+}
+
+inline void BroadcastSub(int left_shift, const uint8* input1_data,
+                         const Dims<4>& input1_dims, int32 input1_offset,
+                         int32 input1_multiplier, int input1_shift,
+                         const uint8* input2_data, const Dims<4>& input2_dims,
+                         int32 input2_offset, int32 input2_multiplier,
+                         int input2_shift, int32 output_offset,
+                         int32 output_multiplier, int output_shift,
+                         int32 output_activation_min,
+                         int32 output_activation_max, uint8* output_data,
+                         const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastSub/8bit");
+
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          const int32 input1_val =
+              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
+          const int32 input2_val =
+              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
+          const int32 shifted_input1_val = input1_val * (1 << left_shift);
+          const int32 shifted_input2_val = input2_val * (1 << left_shift);
+          const int32 scaled_input1_val =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  shifted_input1_val, input1_multiplier, input1_shift);
+          const int32 scaled_input2_val =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  shifted_input2_val, input2_multiplier, input2_shift);
+          const int32 raw_sub = scaled_input1_val - scaled_input2_val;
+          const int32 raw_output =
+              MultiplyByQuantizedMultiplierSmallerThanOne(
+                  raw_sub, output_multiplier, output_shift) +
+              output_offset;
+          const int32 clamped_output =
+              std::min(output_activation_max,
+                       std::max(output_activation_min, raw_output));
+          output_data[Offset(output_dims, c, x, y, b)] =
+              static_cast<uint8>(clamped_output);
+        }
+      }
+    }
+  }
+}
+
 template <FusedActivationFunctionType Ac, typename Scalar>
 void Concatenation(int concat_dim, const Scalar* const* input_data,
                    const Dims<4>* const* input_dims, int inputs_count,
@@ -2318,11 +2585,13 @@ inline void Logistic(const uint8* input_data, const Dims<4>& input_dims,
             const FixedPoint4 input_val_f4 =
                 FixedPoint4::FromRaw(input_val_rescaled);
             const FixedPoint0 output_val_f0 = gemmlowp::logistic(input_val_f4);
+            // Convert from Q0.31 to Q23.8.
             using gemmlowp::RoundingDivideByPOT;
             int32 output_val_s32 = RoundingDivideByPOT(output_val_f0.raw(), 23);
             if (output_val_s32 == 256) {
               output_val_s32 = 255;
             }
+            // Reinterpret as U0.8.
             TFLITE_DCHECK_GE(output_val_s32, 0);
             TFLITE_DCHECK_LE(output_val_s32, 255);
             output_val = static_cast<uint8>(output_val_s32);
@@ -2334,6 +2603,25 @@ inline void Logistic(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+inline void Logistic(const int16* input_data, const Dims<4>& input_dims,
+                     int16* output_data, const Dims<4>& output_dims) {
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input_dims), flat_size);
+
+  for (int i = 0; i < flat_size; i++) {
+    // F0 uses 0 integer bits, range [-1, 1].
+    // This is the return type of math functions such as tanh, logistic,
+    // whose range is in [-1, 1].
+    using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+    // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
+    using F3 = gemmlowp::FixedPoint<std::int16_t, 3>;
+
+    const F3 input = F3::FromRaw(input_data[i]);
+    F0 output = gemmlowp::logistic(input);
+    output_data[i] = output.raw();
+  }
+}
+
 inline void Tanh(const float* input_data, const Dims<4>& input_dims,
                  float* output_data, const Dims<4>& output_dims) {
   const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
@@ -2383,13 +2671,14 @@ inline void Tanh(const uint8* input_data, const Dims<4>& input_dims,
             const FixedPoint4 input_val_f4 =
                 FixedPoint4::FromRaw(input_val_rescaled);
             const FixedPoint0 output_val_f0 = gemmlowp::tanh(input_val_f4);
-
+            // Convert from Q0.31 to Q24.7.
             using gemmlowp::RoundingDivideByPOT;
             int32 output_val_s32 = RoundingDivideByPOT(output_val_f0.raw(), 24);
             output_val_s32 += output_zero_point;
             if (output_val_s32 == 256) {
               output_val_s32 = 255;
             }
+            // Reinterpret as Q0.7, encoded in uint8.
             TFLITE_DCHECK_GE(output_val_s32, 0);
             TFLITE_DCHECK_LE(output_val_s32, 255);
             output_val = static_cast<uint8>(output_val_s32);
@@ -2401,6 +2690,40 @@ inline void Tanh(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+inline void Tanh(const int16* input_data, const Dims<4>& input_dims,
+                 int input_left_shift, int16* output_data,
+                 const Dims<4>& output_dims) {
+  // Support for shifts is limited until we have a parameterized version of
+  // SaturatingRoundingMultiplyByPOT().
+  TFLITE_DCHECK_GE(input_left_shift, 0);
+  TFLITE_DCHECK_LE(input_left_shift, 1);
+
+  const int flat_size = RequiredBufferSizeForDims(output_dims);
+  TFLITE_DCHECK_EQ(RequiredBufferSizeForDims(input_dims), flat_size);
+
+  // F0 uses 0 integer bits, range [-1, 1].
+  // This is the return type of math functions such as tanh, logistic,
+  // whose range is in [-1, 1].
+  using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
+  // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
+  using F3 = gemmlowp::FixedPoint<std::int16_t, 3>;
+
+  if (input_left_shift == 0) {
+    for (int i = 0; i < flat_size; i++) {
+      F3 input = F3::FromRaw(input_data[i]);
+      F0 output = gemmlowp::tanh(input);
+      output_data[i] = output.raw();
+    }
+  } else {
+    for (int i = 0; i < flat_size; i++) {
+      F3 input = F3::FromRaw(
+          gemmlowp::SaturatingRoundingMultiplyByPOT<1>(input_data[i]));
+      F0 output = gemmlowp::tanh(input);
+      output_data[i] = output.raw();
+    }
+  }
+}
+
 inline void Dequantize(const uint8* input_data, const Dims<4>& input_dims,
                        int32 zero_point, double scale, float* output_data,
                        const Dims<4>& output_dims) {
@@ -3109,6 +3432,67 @@ void Transpose(const T* input, const Dims<4>& input_dims, T* output,
   }
 }
 
+inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
+                          const float* filter_data, const Dims<4>& filter_dims,
+                          int stride_width, int stride_height, int pad_width,
+                          int pad_height, float* output_data,
+                          const Dims<4>& output_dims) {
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 3);
+  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  const int input_height = ArraySize(input_dims, 2);
+  const int input_width = ArraySize(input_dims, 1);
+  const int filter_height = ArraySize(filter_dims, 2);
+  const int filter_width = ArraySize(filter_dims, 1);
+  const int output_height = ArraySize(output_dims, 2);
+  const int output_width = ArraySize(output_dims, 1);
+
+  // Although transpose convolution simplifies to convolution with transposed
+  // weights for strides of 1, non-unitary striding complicates matters. To
+  // keep this reference implementation as clear as possible, we use a "scatter"
+  // access pattern, where we loop through all the input elements, computing
+  // their influence on the output, rather than looping through the output
+  // elements in the typical "gather" access pattern of a conv. We therefore
+  // must initialize the output array to zero.
+  for (int i = 0; i < RequiredBufferSizeForDims(output_dims); i++) {
+    output_data[i] = 0.0f;
+  }
+
+  // Loop through input elements one at a time.
+  for (int batch = 0; batch < batches; ++batch) {
+    for (int in_y = 0; in_y < input_height; ++in_y) {
+      for (int in_x = 0; in_x < input_width; ++in_x) {
+        for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
+          // Loop through the output elements it will influence
+          const int out_x_origin = (in_x * stride_width) - pad_width;
+          const int out_y_origin = (in_y * stride_height) - pad_height;
+          for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
+            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
+              for (int out_channel = 0; out_channel < output_depth;
+                   ++out_channel) {
+                // Compute output element location
+                const int out_x = out_x_origin + filter_x;
+                const int out_y = out_y_origin + filter_y;
+                // We cannot accumulate out of bounds
+                if ((out_x >= 0) && (out_x < output_width) && (out_y >= 0) &&
+                    (out_y < output_height)) {
+                  float input_value = input_data[Offset(input_dims, in_channel,
+                                                        in_x, in_y, batch)];
+                  float filter_value =
+                      filter_data[Offset(filter_dims, out_channel, filter_x,
+                                         filter_y, in_channel)];
+                  output_data[Offset(output_dims, out_channel, out_x, out_y,
+                                     batch)] += input_value * filter_value;
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
 }  // namespace reference_ops
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/spectrogram.cc b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4eddf7bf0a2cbca695dae20ba8ba56a9cd72e4ba
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
@@ -0,0 +1,244 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/kernels/internal/spectrogram.h"
+
+#include <assert.h>
+#include <math.h>
+
+#include "third_party/fft2d/fft.h"
+
+namespace tflite {
+namespace internal {
+
+using std::complex;
+
+namespace {
+// Returns the default Hann window function for the spectrogram.
+void GetPeriodicHann(int window_length, std::vector<double>* window) {
+  // Some platforms don't have M_PI, so define a local constant here.
+  const double pi = std::atan(1) * 4;
+  window->resize(window_length);
+  for (int i = 0; i < window_length; ++i) {
+    (*window)[i] = 0.5 - 0.5 * cos((2 * pi * i) / window_length);
+  }
+}
+}  // namespace
+
+bool Spectrogram::Initialize(int window_length, int step_length) {
+  std::vector<double> window;
+  GetPeriodicHann(window_length, &window);
+  return Initialize(window, step_length);
+}
+
+inline int Log2Floor(uint n) {
+  if (n == 0) return -1;
+  int log = 0;
+  uint value = n;
+  for (int i = 4; i >= 0; --i) {
+    int shift = (1 << i);
+    uint x = value >> shift;
+    if (x != 0) {
+      value = x;
+      log += shift;
+    }
+  }
+  return log;
+}
+
+inline int Log2Ceiling(uint n) {
+  int floor = Log2Floor(n);
+  if (n == (n & ~(n - 1)))  // zero or a power of two
+    return floor;
+  else
+    return floor + 1;
+}
+
+inline uint NextPowerOfTwo(uint value) {
+  int exponent = Log2Ceiling(value);
+  // DCHECK_LT(exponent, std::numeric_limits<uint32>::digits);
+  return 1 << exponent;
+}
+
+bool Spectrogram::Initialize(const std::vector<double>& window,
+                             int step_length) {
+  window_length_ = window.size();
+  window_ = window;  // Copy window.
+  if (window_length_ < 2) {
+    // LOG(ERROR) << "Window length too short.";
+    initialized_ = false;
+    return false;
+  }
+
+  step_length_ = step_length;
+  if (step_length_ < 1) {
+    // LOG(ERROR) << "Step length must be positive.";
+    initialized_ = false;
+    return false;
+  }
+
+  fft_length_ = NextPowerOfTwo(window_length_);
+  // CHECK(fft_length_ >= window_length_);
+  output_frequency_channels_ = 1 + fft_length_ / 2;
+
+  // Allocate 2 more than what rdft needs, so we can rationalize the layout.
+  fft_input_output_.assign(fft_length_ + 2, 0.0);
+
+  int half_fft_length = fft_length_ / 2;
+  fft_double_working_area_.assign(half_fft_length, 0.0);
+  fft_integer_working_area_.assign(2 + static_cast<int>(sqrt(half_fft_length)),
+                                   0);
+  // Set flag element to ensure that the working areas are initialized
+  // on the first call to cdft.  It's redundant given the assign above,
+  // but keep it as a reminder.
+  fft_integer_working_area_[0] = 0;
+  input_queue_.clear();
+  samples_to_next_step_ = window_length_;
+  initialized_ = true;
+  return true;
+}
+
+template <class InputSample, class OutputSample>
+bool Spectrogram::ComputeComplexSpectrogram(
+    const std::vector<InputSample>& input,
+    std::vector<std::vector<complex<OutputSample>>>* output) {
+  if (!initialized_) {
+    // LOG(ERROR) << "ComputeComplexSpectrogram() called before successful call
+    // "
+    //           << "to Initialize().";
+    return false;
+  }
+  // CHECK(output);
+  output->clear();
+  int input_start = 0;
+  while (GetNextWindowOfSamples(input, &input_start)) {
+    // DCHECK_EQ(input_queue_.size(), window_length_);
+    ProcessCoreFFT();  // Processes input_queue_ to fft_input_output_.
+    // Add a new slice vector onto the output, to save new result to.
+    output->resize(output->size() + 1);
+    // Get a reference to the newly added slice to fill in.
+    auto& spectrogram_slice = output->back();
+    spectrogram_slice.resize(output_frequency_channels_);
+    for (int i = 0; i < output_frequency_channels_; ++i) {
+      // This will convert double to float if it needs to.
+      spectrogram_slice[i] = complex<OutputSample>(
+          fft_input_output_[2 * i], fft_input_output_[2 * i + 1]);
+    }
+  }
+  return true;
+}
+// Instantiate it four ways:
+template bool Spectrogram::ComputeComplexSpectrogram(
+    const std::vector<float>& input, std::vector<std::vector<complex<float>>>*);
+template bool Spectrogram::ComputeComplexSpectrogram(
+    const std::vector<double>& input,
+    std::vector<std::vector<complex<float>>>*);
+template bool Spectrogram::ComputeComplexSpectrogram(
+    const std::vector<float>& input,
+    std::vector<std::vector<complex<double>>>*);
+template bool Spectrogram::ComputeComplexSpectrogram(
+    const std::vector<double>& input,
+    std::vector<std::vector<complex<double>>>*);
+
+template <class InputSample, class OutputSample>
+bool Spectrogram::ComputeSquaredMagnitudeSpectrogram(
+    const std::vector<InputSample>& input,
+    std::vector<std::vector<OutputSample>>* output) {
+  if (!initialized_) {
+    // LOG(ERROR) << "ComputeSquaredMagnitudeSpectrogram() called before "
+    //           << "successful call to Initialize().";
+    return false;
+  }
+  // CHECK(output);
+  output->clear();
+  int input_start = 0;
+  while (GetNextWindowOfSamples(input, &input_start)) {
+    // DCHECK_EQ(input_queue_.size(), window_length_);
+    ProcessCoreFFT();  // Processes input_queue_ to fft_input_output_.
+    // Add a new slice vector onto the output, to save new result to.
+    output->resize(output->size() + 1);
+    // Get a reference to the newly added slice to fill in.
+    auto& spectrogram_slice = output->back();
+    spectrogram_slice.resize(output_frequency_channels_);
+    for (int i = 0; i < output_frequency_channels_; ++i) {
+      // Similar to the Complex case, except storing the norm.
+      // But the norm function is known to be a performance killer,
+      // so do it this way with explicit real and imagninary temps.
+      const double re = fft_input_output_[2 * i];
+      const double im = fft_input_output_[2 * i + 1];
+      // Which finally converts double to float if it needs to.
+      spectrogram_slice[i] = re * re + im * im;
+    }
+  }
+  return true;
+}
+// Instantiate it four ways:
+template bool Spectrogram::ComputeSquaredMagnitudeSpectrogram(
+    const std::vector<float>& input, std::vector<std::vector<float>>*);
+template bool Spectrogram::ComputeSquaredMagnitudeSpectrogram(
+    const std::vector<double>& input, std::vector<std::vector<float>>*);
+template bool Spectrogram::ComputeSquaredMagnitudeSpectrogram(
+    const std::vector<float>& input, std::vector<std::vector<double>>*);
+template bool Spectrogram::ComputeSquaredMagnitudeSpectrogram(
+    const std::vector<double>& input, std::vector<std::vector<double>>*);
+
+// Return true if a full window of samples is prepared; manage the queue.
+template <class InputSample>
+bool Spectrogram::GetNextWindowOfSamples(const std::vector<InputSample>& input,
+                                         int* input_start) {
+  auto input_it = input.begin() + *input_start;
+  int input_remaining = input.end() - input_it;
+  if (samples_to_next_step_ > input_remaining) {
+    // Copy in as many samples are left and return false, no full window.
+    input_queue_.insert(input_queue_.end(), input_it, input.end());
+    *input_start += input_remaining;  // Increases it to input.size().
+    samples_to_next_step_ -= input_remaining;
+    return false;  // Not enough for a full window.
+  } else {
+    // Copy just enough into queue to make a new window, then trim the
+    // front off the queue to make it window-sized.
+    input_queue_.insert(input_queue_.end(), input_it,
+                        input_it + samples_to_next_step_);
+    *input_start += samples_to_next_step_;
+    input_queue_.erase(
+        input_queue_.begin(),
+        input_queue_.begin() + input_queue_.size() - window_length_);
+    // DCHECK_EQ(window_length_, input_queue_.size());
+    samples_to_next_step_ = step_length_;  // Be ready for next time.
+    return true;  // Yes, input_queue_ now contains exactly a window-full.
+  }
+}
+
+void Spectrogram::ProcessCoreFFT() {
+  for (int j = 0; j < window_length_; ++j) {
+    fft_input_output_[j] = input_queue_[j] * window_[j];
+  }
+  // Zero-pad the rest of the input buffer.
+  for (int j = window_length_; j < fft_length_; ++j) {
+    fft_input_output_[j] = 0.0;
+  }
+  const int kForwardFFT = 1;  // 1 means forward; -1 reverse.
+  // This real FFT is a fair amount faster than using cdft here.
+  rdft(fft_length_, kForwardFFT, &fft_input_output_[0],
+       &fft_integer_working_area_[0], &fft_double_working_area_[0]);
+  // Make rdft result look like cdft result;
+  // unpack the last real value from the first position's imag slot.
+  fft_input_output_[fft_length_] = fft_input_output_[1];
+  fft_input_output_[fft_length_ + 1] = 0;
+  fft_input_output_[1] = 0;
+}
+
+}  // namespace internal
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/spectrogram.h b/tensorflow/contrib/lite/kernels/internal/spectrogram.h
new file mode 100644
index 0000000000000000000000000000000000000000..b77a68f7dfe6edb07ec4e5db540c673b2d6f6d6e
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/spectrogram.h
@@ -0,0 +1,110 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Class for generating spectrogram slices from a waveform.
+// Initialize() should be called before calls to other functions.  Once
+// Initialize() has been called and returned true, The Compute*() functions can
+// be called repeatedly with sequential input data (ie. the first element of the
+// next input vector directly follows the last element of the previous input
+// vector). Whenever enough audio samples are buffered to produce a
+// new frame, it will be placed in output. Output is cleared on each
+// call to Compute*(). This class is thread-unsafe, and should only be
+// called from one thread at a time.
+// With the default parameters, the output of this class should be very
+// close to the results of the following MATLAB code:
+// overlap_samples = window_length_samples - step_samples;
+// window = hann(window_length_samples, 'periodic');
+// S = abs(spectrogram(audio, window, overlap_samples)).^2;
+
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_SPECTROGRAM_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_SPECTROGRAM_H_
+
+#include <complex>
+#include <deque>
+#include <vector>
+
+#include "third_party/fft2d/fft.h"
+
+namespace tflite {
+namespace internal {
+
+class Spectrogram {
+ public:
+  Spectrogram() : initialized_(false) {}
+  ~Spectrogram() {}
+
+  // Initializes the class with a given window length and step length
+  // (both in samples). Internally a Hann window is used as the window
+  // function. Returns true on success, after which calls to Process()
+  // are possible. window_length must be greater than 1 and step
+  // length must be greater than 0.
+  bool Initialize(int window_length, int step_length);
+
+  // Initialize with an explicit window instead of a length.
+  bool Initialize(const std::vector<double>& window, int step_length);
+
+  // Processes an arbitrary amount of audio data (contained in input)
+  // to yield complex spectrogram frames. After a successful call to
+  // Initialize(), Process() may be called repeatedly with new input data
+  // each time.  The audio input is buffered internally, and the output
+  // vector is populated with as many temporally-ordered spectral slices
+  // as it is possible to generate from the input.  The output is cleared
+  // on each call before the new frames (if any) are added.
+  //
+  // The template parameters can be float or double.
+  template <class InputSample, class OutputSample>
+  bool ComputeComplexSpectrogram(
+      const std::vector<InputSample>& input,
+      std::vector<std::vector<std::complex<OutputSample>>>* output);
+
+  // This function works as the one above, but returns the power
+  // (the L2 norm, or the squared magnitude) of each complex value.
+  template <class InputSample, class OutputSample>
+  bool ComputeSquaredMagnitudeSpectrogram(
+      const std::vector<InputSample>& input,
+      std::vector<std::vector<OutputSample>>* output);
+
+  // Return reference to the window function used internally.
+  const std::vector<double>& GetWindow() const { return window_; }
+
+  // Return the number of frequency channels in the spectrogram.
+  int output_frequency_channels() const { return output_frequency_channels_; }
+
+ private:
+  template <class InputSample>
+  bool GetNextWindowOfSamples(const std::vector<InputSample>& input,
+                              int* input_start);
+  void ProcessCoreFFT();
+
+  int fft_length_;
+  int output_frequency_channels_;
+  int window_length_;
+  int step_length_;
+  bool initialized_;
+  int samples_to_next_step_;
+
+  std::vector<double> window_;
+  std::vector<double> fft_input_output_;
+  std::deque<double> input_queue_;
+
+  // Working data areas for the FFT routines.
+  std::vector<int> fft_integer_working_area_;
+  std::vector<double> fft_double_working_area_;
+};
+
+}  // namespace internal
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_SPECTROGRAM_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/types.h b/tensorflow/contrib/lite/kernels/internal/types.h
index afe131b06ec41201395e80aa5415fd7db990f8d4..293538fcbb6406d6065d8efd25adb3b163638c92 100644
--- a/tensorflow/contrib/lite/kernels/internal/types.h
+++ b/tensorflow/contrib/lite/kernels/internal/types.h
@@ -21,6 +21,22 @@ namespace tflite {
 
 enum class FusedActivationFunctionType : uint8 { kNone, kRelu6, kRelu1, kRelu };
 
+// Quantization parameters, determining the mapping of quantized values
+// to real values (i.e. determining how quantized values are mathematically
+// interpreted).
+//
+// The correspondence is as follows:
+//
+//   real_value = scale * (quantized_value - zero_point);
+//
+// In other words, zero_point designates which quantized value corresponds to
+// the real 0 value, and scale designates the difference between the real values
+// corresponding to consecutive quantized values differing by 1.
+struct QuantizationParams {
+  int32 zero_point = 0;
+  double scale = 0.0;
+};
+
 template <int N>
 struct Dims {
   int sizes[N];
diff --git a/tensorflow/contrib/lite/kernels/kernel_util.h b/tensorflow/contrib/lite/kernels/kernel_util.h
index 28f53b9fbbc5620f2fab5c73e40bed8af4af5f1e..2f407b5da31594335dba31b3057737e67a974057 100644
--- a/tensorflow/contrib/lite/kernels/kernel_util.h
+++ b/tensorflow/contrib/lite/kernels/kernel_util.h
@@ -53,13 +53,13 @@ inline TfLiteTensor* GetOptionalInputTensor(TfLiteContext* context,
 }
 
 // Determines whether tensor is constant.
-inline bool IsConstantTensor(TfLiteTensor* tensor) {
+inline bool IsConstantTensor(const TfLiteTensor* tensor) {
   return tensor->allocation_type == kTfLiteMmapRo;
 }
 
 // Determines whether tensor is dynamic. Note that a tensor can be non-const and
-// not dynamic. This function specificially checks for a dynamic tensor.
-inline bool IsDynamicTensor(TfLiteTensor* tensor) {
+// not dynamic. This function specifically checks for a dynamic tensor.
+inline bool IsDynamicTensor(const TfLiteTensor* tensor) {
   return tensor->allocation_type == kTfLiteDynamic;
 }
 
diff --git a/tensorflow/contrib/lite/kernels/lsh_projection.cc b/tensorflow/contrib/lite/kernels/lsh_projection.cc
index 5f73b56ed9790b216adc788490faebaabd2bc756..0ee35775d50b8750455572f789d7b92481655a95 100644
--- a/tensorflow/contrib/lite/kernels/lsh_projection.cc
+++ b/tensorflow/contrib/lite/kernels/lsh_projection.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// LSH Projection projects an input to a bit vector via locality senstive
+// LSH Projection projects an input to a bit vector via locality sensitive
 // hashing.
 //
 // Options:
diff --git a/tensorflow/contrib/lite/kernels/lstm.cc b/tensorflow/contrib/lite/kernels/lstm.cc
index 6c06264d845c24e71647b6fd2374734be32383ef..8cf1165135bdb0d4669bb97fd2d98e3dc044b4d9 100644
--- a/tensorflow/contrib/lite/kernels/lstm.cc
+++ b/tensorflow/contrib/lite/kernels/lstm.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -212,9 +213,9 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
   // present.
   // 2) If projection weight is present, then projection bias is optional.
   // TODO(ghodrat): make sure this is correct.
-  const bool projecton_tensors_consistent =
+  const bool projection_tensors_consistent =
       ((projection_weights != nullptr) || (projection_bias == nullptr));
-  TF_LITE_ENSURE(context, projecton_tensors_consistent == true);
+  TF_LITE_ENSURE(context, projection_tensors_consistent == true);
 
   return kTfLiteOk;
 }
@@ -356,7 +357,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Since we have already checked that weights are all there or none, we can
-  // check the existense of only one to the get the condition.
+  // check the existence of only one to get the condition.
   const bool use_cifg = (input_to_input_weights == nullptr);
   const bool use_peephole = (cell_to_output_weights != nullptr);
 
@@ -377,127 +378,54 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     output_gate_scratch = scratch_buffer->data.f + 3 * n_cell * n_batch;
   }
 
-  // Initialize scratch buffers with bias.
-  if (!use_cifg) {
-    tensor_utils::VectorBatchVectorAssign(input_gate_bias->data.f, n_cell,
-                                          n_batch, input_gate_scratch);
-  }
-  tensor_utils::VectorBatchVectorAssign(forget_gate_bias->data.f, n_cell,
-                                        n_batch, forget_gate_scratch);
-  tensor_utils::VectorBatchVectorAssign(cell_bias->data.f, n_cell, n_batch,
-                                        cell_scratch);
-  tensor_utils::VectorBatchVectorAssign(output_gate_bias->data.f, n_cell,
-                                        n_batch, output_gate_scratch);
-
-  // For each batch and cell: compute input_weight * input.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_input_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-        input_gate_scratch, /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_forget_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      forget_gate_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_cell_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_output_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      output_gate_scratch, /*result_stride=*/1);
-
-  // For each batch and cell: compute recurrent_weight * output_state.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_input_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, input_gate_scratch, /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_forget_weights->data.f, n_cell, n_output,
-      output_state->data.f, n_batch, forget_gate_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_cell_weights->data.f, n_cell, n_output, output_state->data.f,
-      n_batch, cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_output_weights->data.f, n_cell, n_output,
-      output_state->data.f, n_batch, output_gate_scratch, /*result_stride=*/1);
-
-  // For each batch and cell: update input gate.
-  if (!use_cifg) {
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_input_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          input_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
-                                       input_gate_scratch);
-  }
-
-  // For each batch and cell: update forget gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_forget_weights->data.f, n_cell, cell_state->data.f, n_batch,
-        forget_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
-                                     forget_gate_scratch);
-
-  // For each batch and cell: update the cell.
-  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch,
-                                         cell_state->data.f, n_batch * n_cell,
-                                         cell_state->data.f);
-  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  if (use_cifg) {
-    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
-                             forget_gate_scratch);
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, forget_gate_scratch, n_batch * n_cell,
-        cell_state->data.f);
-  } else {
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state->data.f);
-  }
-  if (params->cell_clip > 0.0) {
-    tensor_utils::ClipVector(cell_state->data.f, n_batch * n_cell,
-                             params->cell_clip, cell_state->data.f);
-  }
-
-  // For each batch and cell: update the output gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_output_weights->data.f, n_cell, cell_state->data.f, n_batch,
-        output_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
-                                     output_gate_scratch);
-  tensor_utils::ApplyActivationToVector(cell_state->data.f, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
-                                         n_batch * n_cell, output_gate_scratch);
-
-  // For each batch: update the projection and output_state.
-  const bool use_projection_weight = (projection_weights != nullptr);
-  const bool use_projection_bias = (projection_bias != nullptr);
-  if (use_projection_weight) {
-    if (use_projection_bias) {
-      tensor_utils::VectorBatchVectorAssign(projection_bias->data.f, n_output,
-                                            n_batch, output->data.f);
-    } else {
-      tensor_utils::ZeroVector(output->data.f, n_batch * n_output);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        projection_weights->data.f, n_output, n_cell, output_gate_scratch,
-        n_batch, output->data.f, /*result_stride=*/1);
-    if (params->proj_clip > 0.0) {
-      tensor_utils::ClipVector(output->data.f, n_batch * n_output,
-                               params->proj_clip, output->data.f);
-    }
-  } else {
-    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
-                             output->data.f);
-  }
-  tensor_utils::CopyVector(output->data.f, n_batch * n_output,
-                           output_state->data.f);
+  // Check optional tensors, the respective pointers can be null.
+  const float* input_to_input_weights_ptr =
+      (use_cifg) ? nullptr : input_to_input_weights->data.f;
+  const float* recurrent_to_input_weights_ptr =
+      (use_cifg) ? nullptr : recurrent_to_input_weights->data.f;
+  const float* input_gate_bias_ptr =
+      (use_cifg) ? nullptr : input_gate_bias->data.f;
+  const float* cell_to_input_weights_ptr =
+      (use_peephole && !use_cifg) ? cell_to_input_weights->data.f : nullptr;
+  const float* cell_to_forget_weights_ptr =
+      (use_peephole) ? cell_to_forget_weights->data.f : nullptr;
+  const float* cell_to_output_weights_ptr =
+      (use_peephole) ? cell_to_output_weights->data.f : nullptr;
+  const float* projection_weights_ptr =
+      (projection_weights == nullptr) ? nullptr : projection_weights->data.f;
+  const float* projection_bias_ptr =
+      (projection_bias == nullptr) ? nullptr : projection_bias->data.f;
+
+  // Required tensors, pointers are non-null.
+  const float* input_ptr_batch = input->data.f;
+  const float* input_to_forget_weights_ptr = input_to_forget_weights->data.f;
+  const float* input_to_cell_weights_ptr = input_to_cell_weights->data.f;
+  const float* input_to_output_weights_ptr = input_to_output_weights->data.f;
+  const float* recurrent_to_forget_weights_ptr =
+      recurrent_to_forget_weights->data.f;
+  const float* recurrent_to_cell_weights_ptr =
+      recurrent_to_cell_weights->data.f;
+  const float* recurrent_to_output_weights_ptr =
+      recurrent_to_output_weights->data.f;
+  const float* forget_gate_bias_ptr = forget_gate_bias->data.f;
+  const float* cell_bias_ptr = cell_bias->data.f;
+  const float* output_gate_bias_ptr = output_gate_bias->data.f;
+
+  float* output_state_ptr = output_state->data.f;
+  float* cell_state_ptr = cell_state->data.f;
+  float* output_ptr_batch = output->data.f;
+
+  kernel_utils::LstmStep(
+      input_ptr_batch, input_to_input_weights_ptr, input_to_forget_weights_ptr,
+      input_to_cell_weights_ptr, input_to_output_weights_ptr,
+      recurrent_to_input_weights_ptr, recurrent_to_forget_weights_ptr,
+      recurrent_to_cell_weights_ptr, recurrent_to_output_weights_ptr,
+      cell_to_input_weights_ptr, cell_to_forget_weights_ptr,
+      cell_to_output_weights_ptr, input_gate_bias_ptr, forget_gate_bias_ptr,
+      cell_bias_ptr, output_gate_bias_ptr, projection_weights_ptr,
+      projection_bias_ptr, params, n_batch, n_cell, n_input, n_output,
+      output_state_ptr, cell_state_ptr, input_gate_scratch, forget_gate_scratch,
+      cell_scratch, output_gate_scratch, output_ptr_batch);
 
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/kernels/mfcc.cc b/tensorflow/contrib/lite/kernels/mfcc.cc
new file mode 100644
index 0000000000000000000000000000000000000000..018db0dc54c5d281bf3fb3ff8a1f111b427fe76b
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/mfcc.cc
@@ -0,0 +1,154 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/kernels/internal/mfcc.h"
+#include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_dct.h"
+#include "tensorflow/contrib/lite/kernels/internal/mfcc_mel_filterbank.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace custom {
+namespace mfcc {
+
+enum KernelType {
+  kReference,
+};
+
+typedef struct {
+  float upper_frequency_limit;
+  float lower_frequency_limit;
+  int filterbank_channel_count;
+  int dct_coefficient_count;
+} TfLiteMfccParams;
+
+constexpr int kInputTensorWav = 0;
+constexpr int kInputTensorRate = 1;
+constexpr int kOutputTensor = 0;
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new TfLiteMfccParams;
+
+  const uint8_t* buffer_t = reinterpret_cast<const uint8_t*>(buffer);
+
+  const flexbuffers::Map& m = flexbuffers::GetRoot(buffer_t, length).AsMap();
+  data->upper_frequency_limit = m["upper_frequency_limit"].AsInt64();
+  data->lower_frequency_limit = m["lower_frequency_limit"].AsInt64();
+  data->filterbank_channel_count = m["filterbank_channel_count"].AsInt64();
+  data->dct_coefficient_count = m["dct_coefficient_count"].AsInt64();
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<TfLiteMfccParams*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  auto* params = reinterpret_cast<TfLiteMfccParams*>(node->user_data);
+
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  TfLiteTensor* inputWav = GetInput(context, node, kInputTensorWav);
+  TfLiteTensor* inputRate = GetInput(context, node, kInputTensorRate);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  TF_LITE_ENSURE_EQ(context, NumDimensions(inputWav), 3);
+  TF_LITE_ENSURE_EQ(context, NumDimensions(inputRate), 1);
+
+  TF_LITE_ENSURE_EQ(context, output->type, kTfLiteFloat32);
+  TF_LITE_ENSURE_EQ(context, inputWav->type, output->type);
+
+  TfLiteIntArray* output_size = TfLiteIntArrayCreate(3);
+  output_size->data[0] = inputWav->dims->data[0];
+  output_size->data[1] = inputWav->dims->data[1];
+  output_size->data[2] = params->dct_coefficient_count;
+
+  return context->ResizeTensor(context, output, output_size);
+}
+
+// Input is a single squared-magnitude spectrogram frame. The input spectrum
+// is converted to linear magnitude and weighted into bands using a
+// triangular mel filterbank, and a discrete cosine transform (DCT) of the
+// values is taken. Output is populated with the lowest dct_coefficient_count
+// of these values.
+template <KernelType kernel_type>
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  auto* params = reinterpret_cast<TfLiteMfccParams*>(node->user_data);
+
+  TfLiteTensor* inputWav = GetInput(context, node, kInputTensorWav);
+  TfLiteTensor* inputRate = GetInput(context, node, kInputTensorRate);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  const int32 sample_rate = *GetTensorData<int>(inputRate);
+
+  const int spectrogram_channels = inputWav->dims->data[2];
+  const int spectrogram_samples = inputWav->dims->data[1];
+  const int audio_channels = inputWav->dims->data[0];
+
+  internal::Mfcc mfcc;
+  mfcc.set_upper_frequency_limit(params->upper_frequency_limit);
+  mfcc.set_lower_frequency_limit(params->lower_frequency_limit);
+  mfcc.set_filterbank_channel_count(params->filterbank_channel_count);
+  mfcc.set_dct_coefficient_count(params->dct_coefficient_count);
+
+  mfcc.Initialize(spectrogram_channels, sample_rate);
+
+  const float* spectrogram_flat = GetTensorData<float>(inputWav);
+  float* output_flat = GetTensorData<float>(output);
+
+  for (int audio_channel = 0; audio_channel < audio_channels; ++audio_channel) {
+    for (int spectrogram_sample = 0; spectrogram_sample < spectrogram_samples;
+         ++spectrogram_sample) {
+      const float* sample_data =
+          spectrogram_flat +
+          (audio_channel * spectrogram_samples * spectrogram_channels) +
+          (spectrogram_sample * spectrogram_channels);
+      std::vector<double> mfcc_input(sample_data,
+                                     sample_data + spectrogram_channels);
+      std::vector<double> mfcc_output;
+      mfcc.Compute(mfcc_input, &mfcc_output);
+      TF_LITE_ENSURE_EQ(context, params->dct_coefficient_count,
+                        mfcc_output.size());
+      float* output_data = output_flat +
+                           (audio_channel * spectrogram_samples *
+                            params->dct_coefficient_count) +
+                           (spectrogram_sample * params->dct_coefficient_count);
+      for (int i = 0; i < params->dct_coefficient_count; ++i) {
+        output_data[i] = mfcc_output[i];
+      }
+    }
+  }
+
+  return kTfLiteOk;
+}
+
+}  // namespace mfcc
+
+TfLiteRegistration* Register_MFCC() {
+  static TfLiteRegistration r = {mfcc::Init, mfcc::Free, mfcc::Prepare,
+                                 mfcc::Eval<mfcc::kReference>};
+  return &r;
+}
+
+}  // namespace custom
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/mfcc_test.cc b/tensorflow/contrib/lite/kernels/mfcc_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0291ca8c1c58ea6ab3bb7c22bc436ed3404cba74
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/mfcc_test.cc
@@ -0,0 +1,104 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <functional>
+#include <memory>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace ops {
+namespace custom {
+
+TfLiteRegistration* Register_MFCC();
+
+namespace {
+
+using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+
+class BaseMfccOpModel : public SingleOpModel {
+ public:
+  BaseMfccOpModel(const TensorData& input1, const TensorData& input2,
+                  const TensorData& output) {
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+
+    flexbuffers::Builder fbb;
+    fbb.Map([&]() {
+      fbb.Int("upper_frequency_limit", 4000);
+      fbb.Int("lower_frequency_limit", 20);
+      fbb.Int("filterbank_channel_count", 40);
+      fbb.Int("dct_coefficient_count", 13);
+    });
+    fbb.Finish();
+    SetCustomOp("Mfcc", fbb.GetBuffer(), Register_MFCC);
+
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+TEST(MfccOpTest, SimpleTest) {
+  BaseMfccOpModel m({TensorType_FLOAT32, {1, 1, 513}}, {TensorType_INT32, {1}},
+                    {TensorType_FLOAT32, {}});
+
+  std::vector<float> data(513);
+  for (int i = 0; i < data.size(); ++i) {
+    data[i] = i + 1;
+  }
+  m.PopulateTensor<float>(m.input1(), 0, data.data(),
+                          data.data() + data.size());
+  m.PopulateTensor<int>(m.input2(), {22050});
+
+  m.Invoke();
+
+  std::vector<int> output_shape = m.GetOutputShape();
+  EXPECT_THAT(output_shape, ElementsAre(1, 1, 13));
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear(
+          {29.13970072, -6.41568601, -0.61903012, -0.96778652, -0.26819878,
+           -0.40907028, -0.15614748, -0.23203119, -0.10481487, -0.1543029,
+           -0.0769791, -0.10806114, -0.06047613},
+          1e-3)));
+}
+
+}  // namespace
+}  // namespace custom
+}  // namespace ops
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/register.cc b/tensorflow/contrib/lite/kernels/register.cc
index ec700e1e18880fbd13156b45c3c4a2efc8ceb6a4..0f98154b904b1f776016e6bbee3263027f815244 100644
--- a/tensorflow/contrib/lite/kernels/register.cc
+++ b/tensorflow/contrib/lite/kernels/register.cc
@@ -17,6 +17,14 @@ limitations under the License.
 
 namespace tflite {
 namespace ops {
+
+namespace custom {
+
+TfLiteRegistration* Register_AUDIO_SPECTROGRAM();
+TfLiteRegistration* Register_MFCC();
+
+}  // namespace custom
+
 namespace builtin {
 
 TfLiteRegistration* Register_RELU();
@@ -65,6 +73,9 @@ TfLiteRegistration* Register_STRIDED_SLICE();
 TfLiteRegistration* Register_EXP();
 TfLiteRegistration* Register_TOPK_V2();
 TfLiteRegistration* Register_LOG_SOFTMAX();
+TfLiteRegistration* Register_CAST();
+TfLiteRegistration* Register_DEQUANTIZE();
+TfLiteRegistration* Register_PRELU();
 TfLiteRegistration* Register_MAXIMUM();
 
 BuiltinOpResolver::BuiltinOpResolver() {
@@ -120,7 +131,16 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_EXP, Register_EXP());
   AddBuiltin(BuiltinOperator_TOPK_V2, Register_TOPK_V2());
   AddBuiltin(BuiltinOperator_LOG_SOFTMAX, Register_LOG_SOFTMAX());
+  AddBuiltin(BuiltinOperator_CAST, Register_CAST());
+  AddBuiltin(BuiltinOperator_DEQUANTIZE, Register_DEQUANTIZE());
+  AddBuiltin(BuiltinOperator_PRELU, Register_PRELU());
   AddBuiltin(BuiltinOperator_MAXIMUM, Register_MAXIMUM());
+
+  // TODO(andrewharp, ahentz): Move these somewhere more appropriate so that
+  // custom ops aren't always included by default.
+  AddCustom("Mfcc", tflite::ops::custom::Register_MFCC());
+  AddCustom("AudioSpectrogram",
+            tflite::ops::custom::Register_AUDIO_SPECTROGRAM());
 }
 
 TfLiteRegistration* BuiltinOpResolver::FindOp(
diff --git a/tensorflow/contrib/lite/kernels/reshape.cc b/tensorflow/contrib/lite/kernels/reshape.cc
index f3e6ddc9f480e3863cac52157ae28b7329ee2088..438f70d3115130efe477a3ceeccd2e77108c979a 100644
--- a/tensorflow/contrib/lite/kernels/reshape.cc
+++ b/tensorflow/contrib/lite/kernels/reshape.cc
@@ -49,20 +49,20 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 
   TfLiteIntArray* output_size = TfLiteIntArrayCreate(params->num_dimensions);
   int num_output_elements = 1;
-  int strech_dim = -1;
+  int stretch_dim = -1;
   for (int i = 0; i < params->num_dimensions; ++i) {
     int value = params->shape[i];
     if (value == -1) {
-      TF_LITE_ENSURE_EQ(context, strech_dim, -1);
-      strech_dim = i;
+      TF_LITE_ENSURE_EQ(context, stretch_dim, -1);
+      stretch_dim = i;
     } else {
       num_output_elements *= value;
       output_size->data[i] = value;
     }
   }
-  if (strech_dim != -1) {
-    output_size->data[strech_dim] = num_input_elements / num_output_elements;
-    num_output_elements *= output_size->data[strech_dim];
+  if (stretch_dim != -1) {
+    output_size->data[stretch_dim] = num_input_elements / num_output_elements;
+    num_output_elements *= output_size->data[stretch_dim];
   }
 
   TF_LITE_ENSURE_EQ(context, num_input_elements, num_output_elements);
diff --git a/tensorflow/contrib/lite/kernels/reshape_test.cc b/tensorflow/contrib/lite/kernels/reshape_test.cc
index 0fbcf6e6aa311d2cac491336ee54ccf58bbda8fd..aecbd0399f7454045e8189072f45b695b0525204 100644
--- a/tensorflow/contrib/lite/kernels/reshape_test.cc
+++ b/tensorflow/contrib/lite/kernels/reshape_test.cc
@@ -60,7 +60,7 @@ TEST(ReshapeOpTest, TooManyDimensions) {
 
 TEST(ReshapeOpTest, TooManySpecialDimensions) {
   EXPECT_DEATH(ReshapeOpModel({1, 2, 4, 1}, {-1, -1, 2, 4}),
-               "strech_dim != -1");
+               "stretch_dim != -1");
 }
 
 TEST(ReshapeOpTest, SimpleTest) {
diff --git a/tensorflow/contrib/lite/kernels/strided_slice.cc b/tensorflow/contrib/lite/kernels/strided_slice.cc
index fb1e11e0ca00abb36d7f29d562711a7bbcbeca1c..eb374d903182f46b40f5c80bfd769a19a5594742 100644
--- a/tensorflow/contrib/lite/kernels/strided_slice.cc
+++ b/tensorflow/contrib/lite/kernels/strided_slice.cc
@@ -48,7 +48,7 @@ struct StridedSliceContext {
     output = GetOutput(context, node, kOutputTensor);
     dims = NumDimensions(input);
   }
-  TfLiteStridedSliceParams* params;
+  const TfLiteStridedSliceParams* params;
   TfLiteTensor* input;
   TfLiteTensor* begin;
   TfLiteTensor* end;
@@ -199,19 +199,17 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     strides.emplace_back(1);
   }
 
-  op_context.params->begin_mask =
+  int begin_mask =
       ReverseMaskBits(op_context.params->begin_mask, op_context.dims);
-  op_context.params->end_mask =
-      ReverseMaskBits(op_context.params->end_mask, op_context.dims);
-  op_context.params->shrink_axis_mask =
+  int end_mask = ReverseMaskBits(op_context.params->end_mask, op_context.dims);
+  int shrink_axis_mask =
       ReverseMaskBits(op_context.params->shrink_axis_mask, op_context.dims);
 
-#define TF_LITE_STRIDED_SLICE(kernel_type, data_type)                      \
-  kernel_type::StridedSlice(                                               \
-      GetTensorData<data_type>(op_context.input),                          \
-      GetTensorDims(op_context.input), op_context.params->begin_mask,      \
-      op_context.params->end_mask, op_context.params->shrink_axis_mask,    \
-      starts, stops, strides, GetTensorData<data_type>(op_context.output), \
+#define TF_LITE_STRIDED_SLICE(kernel_type, data_type)                          \
+  kernel_type::StridedSlice(                                                   \
+      GetTensorData<data_type>(op_context.input),                              \
+      GetTensorDims(op_context.input), begin_mask, end_mask, shrink_axis_mask, \
+      starts, stops, strides, GetTensorData<data_type>(op_context.output),     \
       GetTensorDims(op_context.output))
 
   switch (op_context.input->type) {
diff --git a/tensorflow/contrib/lite/kernels/strided_slice_test.cc b/tensorflow/contrib/lite/kernels/strided_slice_test.cc
index 5cac04b38364958c5b0794c21742e8b592372ae9..5c98c5f43181fe75f35716dae5682113bde883ec 100644
--- a/tensorflow/contrib/lite/kernels/strided_slice_test.cc
+++ b/tensorflow/contrib/lite/kernels/strided_slice_test.cc
@@ -522,6 +522,28 @@ TEST(StridedSliceOpTest, In3D_IdentityShrinkAxis7) {
   EXPECT_TRUE(m.GetOutputShape().empty());
   EXPECT_THAT(m.GetOutput(), ElementsAreArray({1}));
 }
+
+// This tests catches a very subtle bug that was fixed by cl/188403234.
+TEST(StridedSliceOpTest, RunTwice) {
+  StridedSliceOpModel m({2, 3}, {2}, {2}, {2}, 1, 0, 0, 0, 0);
+
+  auto setup_inputs = [&m]() {
+    m.SetInput({1, 2, 3, 4, 5, 6});
+    m.SetBegin({1, 0});
+    m.SetEnd({2, 2});
+    m.SetStrides({1, 1});
+  };
+
+  setup_inputs();
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1, 2, 4, 5}));
+
+  setup_inputs();
+  m.Invoke();
+  // Prior to cl/188403234 this was {4, 5}.
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1, 2, 4, 5}));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/sub.cc b/tensorflow/contrib/lite/kernels/sub.cc
index ddaf498d5bac0109429224e7cf66cb3debcabc22..66b06aeaec52dd3d2d98acfec8218ffdd0ae6bf3 100644
--- a/tensorflow/contrib/lite/kernels/sub.cc
+++ b/tensorflow/contrib/lite/kernels/sub.cc
@@ -26,7 +26,7 @@ namespace ops {
 namespace builtin {
 namespace sub {
 
-// This file has three implementation of Div.
+// This file has three implementation of Sub.
 enum KernelType {
   kReference,
   kGenericOptimized,  // Neon-free
@@ -37,7 +37,23 @@ constexpr int kInputTensor1 = 0;
 constexpr int kInputTensor2 = 1;
 constexpr int kOutputTensor = 0;
 
+struct OpData {
+  bool requires_broadcast;
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new OpData;
+  data->requires_broadcast = false;
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
 TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
   TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
   TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
 
@@ -45,49 +61,118 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-  TF_LITE_ENSURE_EQ(context, NumDimensions(input1), NumDimensions(input2));
-  for (int i = 0; i < NumDimensions(input1); ++i) {
-    TF_LITE_ENSURE_EQ(context, SizeOfDimension(input1, i),
-                      SizeOfDimension(input2, i));
-  }
+  TF_LITE_ENSURE_EQ(context, input1->type, input2->type);
+  output->type = input2->type;
 
-  TF_LITE_ENSURE_EQ(context, input1->type, output->type);
-  TF_LITE_ENSURE_EQ(context, input2->type, output->type);
+  data->requires_broadcast = !HaveSameShapes(input1, input2);
+
+  TfLiteIntArray* output_size = nullptr;
+  if (data->requires_broadcast) {
+    TF_LITE_ENSURE_OK(context, CalculateShapeForBroadcast(
+                                   context, input1, input2, &output_size));
+  } else {
+    output_size = TfLiteIntArrayCopy(input1->dims);
+  }
 
-  TfLiteIntArray* output_size = TfLiteIntArrayCopy(input1->dims);
   return context->ResizeTensor(context, output, output_size);
 }
 
 template <KernelType kernel_type>
-void EvalSubFloat(TfLiteContext* context, TfLiteNode* node,
-                  TfLiteSubParams* params, TfLiteTensor* input1,
-                  TfLiteTensor* input2, TfLiteTensor* output) {
+void EvalFloat(TfLiteContext* context, TfLiteNode* node,
+               TfLiteSubParams* params, const OpData* data,
+               TfLiteTensor* input1, TfLiteTensor* input2,
+               TfLiteTensor* output) {
   float output_activation_min, output_activation_max;
   CalculateActivationRangeFloat(params->activation, &output_activation_min,
                                 &output_activation_max);
-#define TF_LITE_Sub(type)                                        \
-  type::Sub(GetTensorData<float>(input1), GetTensorDims(input1), \
-            GetTensorData<float>(input2), GetTensorDims(input2), \
-            output_activation_min, output_activation_max,        \
-            GetTensorData<float>(output), GetTensorDims(output))
+#define TF_LITE_SUB(type, opname)                                   \
+  type::opname(GetTensorData<float>(input1), GetTensorDims(input1), \
+               GetTensorData<float>(input2), GetTensorDims(input2), \
+               output_activation_min, output_activation_max,        \
+               GetTensorData<float>(output), GetTensorDims(output))
+  if (kernel_type == kReference) {
+    if (data->requires_broadcast) {
+      TF_LITE_SUB(reference_ops, BroadcastSub);
+    } else {
+      TF_LITE_SUB(reference_ops, Sub);
+    }
+  } else {
+    if (data->requires_broadcast) {
+      TF_LITE_SUB(optimized_ops, BroadcastSub);
+    } else {
+      TF_LITE_SUB(optimized_ops, Sub);
+    }
+  }
+#undef TF_LITE_SUB
+}
+
+template <KernelType kernel_type>
+void EvalQuantized(TfLiteContext* context, TfLiteNode* node,
+                   TfLiteSubParams* params, const OpData* data,
+                   TfLiteTensor* input1, TfLiteTensor* input2,
+                   TfLiteTensor* output) {
+  auto input1_offset = -input1->params.zero_point;
+  auto input2_offset = -input2->params.zero_point;
+  auto output_offset = output->params.zero_point;
+  const int left_shift = 20;
+  const double twice_max_input_scale =
+      2 * std::max(input1->params.scale, input2->params.scale);
+  const double real_input1_multiplier =
+      input1->params.scale / twice_max_input_scale;
+  const double real_input2_multiplier =
+      input2->params.scale / twice_max_input_scale;
+  const double real_output_multiplier =
+      twice_max_input_scale / ((1 << left_shift) * output->params.scale);
+
+  int32 input1_multiplier;
+  int input1_shift;
+  QuantizeMultiplierSmallerThanOne(real_input1_multiplier, &input1_multiplier,
+                                   &input1_shift);
+  int32 input2_multiplier;
+  int input2_shift;
+  QuantizeMultiplierSmallerThanOne(real_input2_multiplier, &input2_multiplier,
+                                   &input2_shift);
+  int32 output_multiplier;
+  int output_shift;
+  QuantizeMultiplierSmallerThanOne(real_output_multiplier, &output_multiplier,
+                                   &output_shift);
+
+  int32 output_activation_min, output_activation_max;
+  CalculateActivationRangeUint8(params->activation, output,
+                                &output_activation_min, &output_activation_max);
+
+#define TF_LITE_SUB(type, opname)                                            \
+  type::opname(left_shift, GetTensorData<uint8_t>(input1),                   \
+               GetTensorDims(input1), input1_offset, input1_multiplier,      \
+               input1_shift, GetTensorData<uint8_t>(input2),                 \
+               GetTensorDims(input2), input2_offset, input2_multiplier,      \
+               input2_shift, output_offset, output_multiplier, output_shift, \
+               output_activation_min, output_activation_max,                 \
+               GetTensorData<uint8_t>(output), GetTensorDims(output));
+  // The quantized version of Sub doesn't support activations, so we
+  // always use BroadcastSub.
   if (kernel_type == kReference) {
-    TF_LITE_Sub(reference_ops);
+    TF_LITE_SUB(reference_ops, BroadcastSub);
   } else {
-    TF_LITE_Sub(optimized_ops);
+    TF_LITE_SUB(optimized_ops, BroadcastSub);
   }
-#undef TF_LITE_Sub
+#undef TF_LITE_SUB
 }
 
 template <KernelType kernel_type>
 TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   auto* params = reinterpret_cast<TfLiteSubParams*>(node->builtin_data);
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
 
   TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
   TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
   if (output->type == kTfLiteFloat32) {
-    EvalSubFloat<kernel_type>(context, node, params, input1, input2, output);
+    EvalFloat<kernel_type>(context, node, params, data, input1, input2, output);
+  } else if (output->type == kTfLiteUInt8) {
+    EvalQuantized<kernel_type>(context, node, params, data, input1, input2,
+                               output);
   } else {
     context->ReportError(context, "Inputs and outputs not all float types.");
     return kTfLiteError;
@@ -99,19 +184,19 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
 }  // namespace sub
 
 TfLiteRegistration* Register_SUB_REF() {
-  static TfLiteRegistration r = {nullptr, nullptr, sub::Prepare,
+  static TfLiteRegistration r = {sub::Init, sub::Free, sub::Prepare,
                                  sub::Eval<sub::kReference>};
   return &r;
 }
 
 TfLiteRegistration* Register_SUB_GENERIC_OPT() {
-  static TfLiteRegistration r = {nullptr, nullptr, sub::Prepare,
+  static TfLiteRegistration r = {sub::Init, sub::Free, sub::Prepare,
                                  sub::Eval<sub::kGenericOptimized>};
   return &r;
 }
 
 TfLiteRegistration* Register_SUB_NEON_OPT() {
-  static TfLiteRegistration r = {nullptr, nullptr, sub::Prepare,
+  static TfLiteRegistration r = {sub::Init, sub::Free, sub::Prepare,
                                  sub::Eval<sub::kNeonOptimized>};
   return &r;
 }
diff --git a/tensorflow/contrib/lite/kernels/sub_test.cc b/tensorflow/contrib/lite/kernels/sub_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ff07aeec49dbfcc0e1f65df3d674d5ec30f1b54c
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/sub_test.cc
@@ -0,0 +1,218 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class BaseSubOpModel : public SingleOpModel {
+ public:
+  BaseSubOpModel(const TensorData& input1, const TensorData& input2,
+                 const TensorData& output,
+                 ActivationFunctionType activation_type) {
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_SUB, BuiltinOptions_SubOptions,
+                 CreateSubOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+class FloatSubOpModel : public BaseSubOpModel {
+ public:
+  using BaseSubOpModel::BaseSubOpModel;
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+};
+
+class QuantizedSubOpModel : public BaseSubOpModel {
+ public:
+  using BaseSubOpModel::BaseSubOpModel;
+
+  std::vector<float> GetDequantizedOutput() {
+    return Dequantize<uint8_t>(ExtractVector<uint8_t>(output_),
+                               GetScale(output_), GetZeroPoint(output_));
+  }
+};
+
+// for quantized Sub, the error shouldn't exceed 2*step
+float GetTolerance(int min, int max) {
+  float kQuantizedStep = (max - min) / 255.0;
+  float kQuantizedTolerance = 2.0 * kQuantizedStep;
+  return kQuantizedTolerance;
+}
+
+TEST(FloatSubOpModel, NoActivation) {
+  FloatSubOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 1.7, 0.5});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.8});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear({-2.1, 0.0, 1.4, -0.3})));
+}
+
+TEST(FloatSubOpModel, ActivationRELU_N1_TO_1) {
+  FloatSubOpModel m(
+      {TensorType_FLOAT32, {1, 2, 2, 1}}, {TensorType_FLOAT32, {1, 2, 2, 1}},
+      {TensorType_FLOAT32, {}}, ActivationFunctionType_RELU_N1_TO_1);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 1.7, 0.5});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.8});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear({-1.0, 0.0, 1.0, -0.3})));
+}
+
+TEST(FloatSubOpModel, VariousInputShapes) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    FloatSubOpModel m({TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 1.7, 0.5, -1.1, 2.0});
+    m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.8, -1.1, 0.1});
+    m.Invoke();
+    EXPECT_THAT(
+        m.GetOutput(),
+        ElementsAreArray(ArrayFloatNear({-2.1, 0.0, 1.4, -0.3, 0.0, 1.9})))
+        << "With shape number " << i;
+  }
+}
+
+TEST(FloatSubOpModel, WithBroadcast) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    FloatSubOpModel m({TensorType_FLOAT32, test_shapes[i]},
+                      {TensorType_FLOAT32, {}},  // always a scalar
+                      {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 1.7, 0.5, -1.1, 2.0});
+    m.PopulateTensor<float>(m.input2(), {0.5});
+    m.Invoke();
+    EXPECT_THAT(
+        m.GetOutput(),
+        ElementsAreArray(ArrayFloatNear({-2.5, -0.3, 1.2, 0.0, -1.6, 1.5})))
+        << "With shape number " << i;
+  }
+}
+
+TEST(QuantizedSubOpModel, QuantizedTestsNoActivation) {
+  float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
+  std::vector<std::initializer_list<float>> inputs1 = {
+      {0.1, 0.2, 0.3, 0.4}, {-0.2, 0.2, 0.4, 0.7}, {-0.01, 0.2, 0.7, 0.3}};
+  std::vector<std::initializer_list<float>> inputs2 = {
+      {0.6, 0.4, 0.3, 0.1}, {0.6, 0.4, 0.5, -0.2}, {0.6, 0.4, -0.18, 0.5}};
+  std::vector<std::initializer_list<float>> results = {
+      {-0.5, -0.2, 0.0, 0.3},
+      {-0.8, -0.2, -0.1, 0.9},
+      {-0.61, -0.2, 0.88, -0.2}};
+  for (int i = 0; i < inputs1.size(); ++i) {
+    QuantizedSubOpModel m({TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
+                          {TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
+                          {TensorType_UINT8, {}, -1.0, 1.0},
+                          ActivationFunctionType_NONE);
+    m.QuantizeAndPopulate<uint8_t>(m.input1(), inputs1[i]);
+    m.QuantizeAndPopulate<uint8_t>(m.input2(), inputs2[i]);
+    m.Invoke();
+    EXPECT_THAT(m.GetDequantizedOutput(), ElementsAreArray(ArrayFloatNear(
+                                              results[i], kQuantizedTolerance)))
+        << "With test number " << i;
+  }
+}
+
+TEST(QuantizedSubOpModel, QuantizedTestsActivationRELU_N1_TO_1) {
+  float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
+  std::vector<std::initializer_list<float>> inputs1 = {{-0.8, 0.2, 0.9, 0.7},
+                                                       {-0.8, 0.2, 0.7, 0.5}};
+  std::vector<std::initializer_list<float>> inputs2 = {{0.6, 0.4, 0.9, -0.8},
+                                                       {0.6, 0.4, -0.8, 0.3}};
+  std::vector<std::initializer_list<float>> results = {{-1.0, -0.2, 0.0, 1.0},
+                                                       {-1.0, -0.2, 1.0, 0.2}};
+  for (int i = 0; i < inputs1.size(); ++i) {
+    QuantizedSubOpModel m({TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
+                          {TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
+                          {TensorType_UINT8, {}, -1.0, 1.0},
+                          ActivationFunctionType_RELU_N1_TO_1);
+    m.QuantizeAndPopulate<uint8_t>(m.input1(), inputs1[i]);
+    m.QuantizeAndPopulate<uint8_t>(m.input2(), inputs2[i]);
+    m.Invoke();
+    EXPECT_THAT(m.GetDequantizedOutput(), ElementsAreArray(ArrayFloatNear(
+                                              results[i], kQuantizedTolerance)))
+        << "With test number " << i;
+  }
+}
+
+TEST(QuantizedSubOpModel, QuantizedVariousInputShapes) {
+  float kQuantizedTolerance = GetTolerance(-3.0, 3.0);
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    QuantizedSubOpModel m({TensorType_UINT8, test_shapes[i], -3.0, 3.0},
+                          {TensorType_UINT8, test_shapes[i], -3.0, 3.0},
+                          {TensorType_UINT8, {}, -3.0, 3.0},
+                          ActivationFunctionType_NONE);
+    m.QuantizeAndPopulate<uint8_t>(m.input1(), {-2.0, 0.2, 0.7, 0.8, 1.1, 2.0});
+    m.QuantizeAndPopulate<uint8_t>(m.input2(), {0.1, 0.3, 0.3, 0.5, 1.1, 0.1});
+    m.Invoke();
+    EXPECT_THAT(m.GetDequantizedOutput(),
+                ElementsAreArray(ArrayFloatNear(
+                    {-2.1, -0.1, 0.4, 0.3, 0.0, 1.9}, kQuantizedTolerance)))
+        << "With shape number " << i;
+  }
+}
+
+TEST(QuantizedSubOpModel, QuantizedWithBroadcast) {
+  float kQuantizedTolerance = GetTolerance(-3.0, 3.0);
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    QuantizedSubOpModel m({TensorType_UINT8, test_shapes[i], -3.0, 3.0},
+                          {TensorType_UINT8, {}, -3.0, 3.0},
+                          {TensorType_UINT8, {}, -3.0, 3.0},
+                          ActivationFunctionType_NONE);
+    m.QuantizeAndPopulate<uint8_t>(m.input1(), {-2.0, 0.2, 0.7, 0.8, 1.1, 2.0});
+    m.QuantizeAndPopulate<uint8_t>(m.input2(), {0.7});
+    m.Invoke();
+    EXPECT_THAT(m.GetDequantizedOutput(),
+                ElementsAreArray(ArrayFloatNear(
+                    {-2.7, -0.5, 0.0, 0.1, 0.4, 1.3}, kQuantizedTolerance)))
+        << "With shape number " << i;
+  }
+}
+
+}  // namespace
+}  // namespace tflite
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/test_util.cc b/tensorflow/contrib/lite/kernels/test_util.cc
index 373310bd87370a670a847cf5328633956028a850..0bb28b50b2a5e5a9fd803ecf1b0928026f63881e 100644
--- a/tensorflow/contrib/lite/kernels/test_util.cc
+++ b/tensorflow/contrib/lite/kernels/test_util.cc
@@ -141,8 +141,8 @@ void SingleOpModel::SetBuiltinOp(BuiltinOperator type,
 
 void SingleOpModel::SetCustomOp(
     const string& name, const std::vector<uint8_t>& custom_option,
-    const std::function<TfLiteRegistration*()>& registeration) {
-  custom_registrations_[name] = registeration;
+    const std::function<TfLiteRegistration*()>& registration) {
+  custom_registrations_[name] = registration;
   opcodes_.push_back(
       CreateOperatorCodeDirect(builder_, BuiltinOperator_CUSTOM, name.data()));
   operators_.push_back(CreateOperator(
diff --git a/tensorflow/contrib/lite/kernels/test_util.h b/tensorflow/contrib/lite/kernels/test_util.h
index 7d476ba1eaffbb24fb77390c0e71c32d60b6411e..a9064d54e7704d52eefa34f6bf446ec1cfe68fe1 100644
--- a/tensorflow/contrib/lite/kernels/test_util.h
+++ b/tensorflow/contrib/lite/kernels/test_util.h
@@ -39,10 +39,10 @@ inline std::vector<T> Quantize(const std::vector<float>& data, float scale,
                                int32_t zero_point) {
   std::vector<T> q;
   for (float f : data) {
-    q.push_back(std::max(
+    q.push_back(static_cast<T>(std::max<float>(
         std::numeric_limits<T>::min(),
-        std::min(std::numeric_limits<T>::max(),
-                 static_cast<T>(std::round(zero_point + (f / scale))))));
+        std::min<float>(std::numeric_limits<T>::max(),
+                        std::round(zero_point + (f / scale))))));
   }
   return q;
 }
diff --git a/tensorflow/contrib/lite/kernels/test_util_test.cc b/tensorflow/contrib/lite/kernels/test_util_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1e10e89061213b6fcabd404310893dd97a51d83f
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/test_util_test.cc
@@ -0,0 +1,51 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include <gtest/gtest.h>
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+TEST(TestUtilTest, QuantizeVector) {
+  std::vector<float> data = {-1.0, -0.5, 0.0, 0.5, 1.0, 1000.0};
+  auto q_data = Quantize<uint8>(data, /*scale=*/1.0, /*zero_point=*/0);
+  std::vector<uint8> expected = {0, 0, 0, 1, 1, 255};
+  EXPECT_THAT(q_data, ElementsAreArray(expected));
+}
+
+TEST(TestUtilTest, QuantizeVectorScalingDown) {
+  std::vector<float> data = {-1.0, -0.5, 0.0, 0.5, 1.0, 1000.0};
+  auto q_data = Quantize<uint8>(data, /*scale=*/10.0, /*zero_point=*/0);
+  std::vector<uint8> expected = {0, 0, 0, 0, 0, 100};
+  EXPECT_THAT(q_data, ElementsAreArray(expected));
+}
+
+TEST(TestUtilTest, QuantizeVectorScalingUp) {
+  std::vector<float> data = {-1.0, -0.5, 0.0, 0.5, 1.0, 1000.0};
+  auto q_data = Quantize<uint8>(data, /*scale=*/0.1, /*zero_point=*/0);
+  std::vector<uint8> expected = {0, 0, 0, 5, 10, 255};
+  EXPECT_THAT(q_data, ElementsAreArray(expected));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
index 9cdb58714edb5fee771fc45f3c53a570f8fb28d1..42941a97db70adb37c20500c8f9438adfea25389 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -359,7 +360,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Since we have already checked that weights are all there or none, we can
-  // check the existense of only one to the get the condition.
+  // check the existence of only one to get the condition.
   const bool use_cifg = (input_to_input_weights == nullptr);
   const bool use_peephole = (cell_to_output_weights != nullptr);
 
@@ -380,135 +381,57 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     output_gate_scratch = scratch_buffer->data.f + 3 * n_cell * n_batch;
   }
 
+  // Check optional tensors, the respective pointers can be null.
+  const float* input_to_input_weights_ptr =
+      (use_cifg) ? nullptr : input_to_input_weights->data.f;
+  const float* recurrent_to_input_weights_ptr =
+      (use_cifg) ? nullptr : recurrent_to_input_weights->data.f;
+  const float* input_gate_bias_ptr =
+      (use_cifg) ? nullptr : input_gate_bias->data.f;
+  const float* cell_to_input_weights_ptr =
+      (use_peephole && !use_cifg) ? cell_to_input_weights->data.f : nullptr;
+  const float* cell_to_forget_weights_ptr =
+      (use_peephole) ? cell_to_forget_weights->data.f : nullptr;
+  const float* cell_to_output_weights_ptr =
+      (use_peephole) ? cell_to_output_weights->data.f : nullptr;
+  const float* projection_weights_ptr =
+      (projection_weights == nullptr) ? nullptr : projection_weights->data.f;
+  const float* projection_bias_ptr =
+      (projection_bias == nullptr) ? nullptr : projection_bias->data.f;
+
+  // Required tensors, pointers are non-null.
+  const float* input_to_forget_weights_ptr = input_to_forget_weights->data.f;
+  const float* input_to_cell_weights_ptr = input_to_cell_weights->data.f;
+  const float* input_to_output_weights_ptr = input_to_output_weights->data.f;
+  const float* recurrent_to_forget_weights_ptr =
+      recurrent_to_forget_weights->data.f;
+  const float* recurrent_to_cell_weights_ptr =
+      recurrent_to_cell_weights->data.f;
+  const float* recurrent_to_output_weights_ptr =
+      recurrent_to_output_weights->data.f;
+  const float* forget_gate_bias_ptr = forget_gate_bias->data.f;
+  const float* cell_bias_ptr = cell_bias->data.f;
+  const float* output_gate_bias_ptr = output_gate_bias->data.f;
+
+  float* output_state_ptr = output_state->data.f;
+  float* cell_state_ptr = cell_state->data.f;
+
   for (int t = 0; t < max_time; t++) {
-    const float* input_ptr_time = input->data.f + t * n_batch * n_input;
-    // Initialize scratch buffers with bias.
-    if (!use_cifg) {
-      tensor_utils::VectorBatchVectorAssign(input_gate_bias->data.f, n_cell,
-                                            n_batch, input_gate_scratch);
-    }
-    tensor_utils::VectorBatchVectorAssign(forget_gate_bias->data.f, n_cell,
-                                          n_batch, forget_gate_scratch);
-    tensor_utils::VectorBatchVectorAssign(cell_bias->data.f, n_cell, n_batch,
-                                          cell_scratch);
-    tensor_utils::VectorBatchVectorAssign(output_gate_bias->data.f, n_cell,
-                                          n_batch, output_gate_scratch);
-
-    // For each batch and cell: compute input_weight * input.
-    if (!use_cifg) {
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          input_to_input_weights->data.f, n_cell, n_input, input_ptr_time,
-          n_batch, input_gate_scratch, /*result_stride=*/1);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_forget_weights->data.f, n_cell, n_input, input_ptr_time,
-        n_batch, forget_gate_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_cell_weights->data.f, n_cell, n_input, input_ptr_time, n_batch,
-        cell_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_output_weights->data.f, n_cell, n_input, input_ptr_time,
-        n_batch, output_gate_scratch, /*result_stride=*/1);
-
-    // For each batch and cell: compute recurrent_weight * output_state.
-    if (!use_cifg) {
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          recurrent_to_input_weights->data.f, n_cell, n_output,
-          output_state->data.f, n_batch, input_gate_scratch,
-          /*result_stride=*/1);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_forget_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, forget_gate_scratch,
-        /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_cell_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, cell_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_output_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, output_gate_scratch,
-        /*result_stride=*/1);
-
-    // For each batch and cell: update input gate.
-    if (!use_cifg) {
-      if (use_peephole) {
-        tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-            cell_to_input_weights->data.f, n_cell, cell_state->data.f, n_batch,
-            input_gate_scratch);
-      }
-      tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
-                                         input_gate_scratch);
-    }
-
-    // For each batch and cell: update forget gate.
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_forget_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          forget_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
-                                       forget_gate_scratch);
-
-    // For each batch and cell: update the cell.
-    tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch,
-                                           cell_state->data.f, n_batch * n_cell,
-                                           cell_state->data.f);
-    tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
-                                          params->activation, cell_scratch);
-    if (use_cifg) {
-      tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
-                               forget_gate_scratch);
-      tensor_utils::VectorVectorCwiseProductAccumulate(
-          cell_scratch, forget_gate_scratch, n_batch * n_cell,
-          cell_state->data.f);
-    } else {
-      tensor_utils::VectorVectorCwiseProductAccumulate(
-          cell_scratch, input_gate_scratch, n_batch * n_cell,
-          cell_state->data.f);
-    }
-    if (params->cell_clip > 0.0) {
-      tensor_utils::ClipVector(cell_state->data.f, n_batch * n_cell,
-                               params->cell_clip, cell_state->data.f);
-    }
-
-    // For each batch and cell: update the output gate.
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_output_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          output_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
-                                       output_gate_scratch);
-    tensor_utils::ApplyActivationToVector(cell_state->data.f, n_batch * n_cell,
-                                          params->activation, cell_scratch);
-    tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
-                                           n_batch * n_cell,
-                                           output_gate_scratch);
-
-    // For each batch: update the projection and output_state.
-    const bool use_projection_weight = (projection_weights != nullptr);
-    const bool use_projection_bias = (projection_bias != nullptr);
-    float* output_ptr_time = output->data.f + t * n_batch * n_output;
-    if (use_projection_weight) {
-      if (use_projection_bias) {
-        tensor_utils::VectorBatchVectorAssign(projection_bias->data.f, n_output,
-                                              n_batch, output_ptr_time);
-      } else {
-        tensor_utils::ZeroVector(output_ptr_time, n_batch * n_output);
-      }
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          projection_weights->data.f, n_output, n_cell, output_gate_scratch,
-          n_batch, output_ptr_time, /*result_stride=*/1);
-      if (params->proj_clip > 0.0) {
-        tensor_utils::ClipVector(output_ptr_time, n_batch * n_output,
-                                 params->proj_clip, output_ptr_time);
-      }
-    } else {
-      tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
-                               output_ptr_time);
-    }
-    tensor_utils::CopyVector(output_ptr_time, n_batch * n_output,
-                             output_state->data.f);
+    const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
+    float* output_ptr_batch = output->data.f + t * n_batch * n_output;
+
+    kernel_utils::LstmStep(
+        input_ptr_batch, input_to_input_weights_ptr,
+        input_to_forget_weights_ptr, input_to_cell_weights_ptr,
+        input_to_output_weights_ptr, recurrent_to_input_weights_ptr,
+        recurrent_to_forget_weights_ptr, recurrent_to_cell_weights_ptr,
+        recurrent_to_output_weights_ptr, cell_to_input_weights_ptr,
+        cell_to_forget_weights_ptr, cell_to_output_weights_ptr,
+        input_gate_bias_ptr, forget_gate_bias_ptr, cell_bias_ptr,
+        output_gate_bias_ptr, projection_weights_ptr, projection_bias_ptr,
+        params, n_batch, n_cell, n_input, n_output, output_state_ptr,
+        cell_state_ptr, input_gate_scratch, forget_gate_scratch, cell_scratch,
+        output_gate_scratch, output_ptr_batch);
   }
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/memory_planner.h b/tensorflow/contrib/lite/memory_planner.h
index 5cd6c208500f3ea84ab8146f7f136e8b7851ff03..0294ec815c4820d41361b8cd4a814b74c3c1d770 100644
--- a/tensorflow/contrib/lite/memory_planner.h
+++ b/tensorflow/contrib/lite/memory_planner.h
@@ -34,8 +34,8 @@ class MemoryPlanner {
   // [first_node, last_node].
   virtual TfLiteStatus ExecuteAllocations(int first_node, int last_node) = 0;
 
-  // Invalidates allocations made earliers. This is called when tensors sizes
-  // have change. All planned allocations remain, but can't be used until
+  // Invalidates allocations made earlier. This is called when tensors sizes
+  // have changed. All planned allocations remain, but can't be used until
   // ExecuteAllocations() is called.
   virtual TfLiteStatus ResetAllocations() = 0;
 };
diff --git a/tensorflow/contrib/lite/model.cc b/tensorflow/contrib/lite/model.cc
index 187af8af065f934380d150ab09a99d97deeab5b3..791d1378f393594ceb6f1fcec7cc5aadaa81dab3 100644
--- a/tensorflow/contrib/lite/model.cc
+++ b/tensorflow/contrib/lite/model.cc
@@ -32,11 +32,46 @@ namespace tflite {
 
 const char* kEmptyTensorName = "";
 
+// Loads a model from `filename`. If `mmap_file` is true then use mmap,
+// otherwise make a copy of the model in a buffer.
+std::unique_ptr<Allocation> GetAllocationFromFile(const char* filename,
+                                                  bool mmap_file,
+                                                  ErrorReporter* error_reporter,
+                                                  bool use_nnapi) {
+  std::unique_ptr<Allocation> allocation;
+  if (mmap_file) {
+    if (use_nnapi && NNAPIExists())
+      allocation.reset(new NNAPIAllocation(filename, error_reporter));
+    else
+      allocation.reset(new MMAPAllocation(filename, error_reporter));
+  } else {
+    allocation.reset(new FileCopyAllocation(filename, error_reporter));
+  }
+  return allocation;
+}
+
 std::unique_ptr<FlatBufferModel> FlatBufferModel::BuildFromFile(
     const char* filename, ErrorReporter* error_reporter) {
   std::unique_ptr<FlatBufferModel> model;
-  model.reset(new FlatBufferModel(filename, /*mmap_file=*/true, error_reporter,
-                                  /*use_nnapi=*/true));
+  auto allocation = GetAllocationFromFile(filename, /*mmap_file=*/true,
+                                          error_reporter, /*use_nnapi=*/true);
+  model.reset(new FlatBufferModel(allocation.release(), error_reporter));
+  if (!model->initialized()) model.reset();
+  return model;
+}
+
+std::unique_ptr<FlatBufferModel> FlatBufferModel::VerifyAndBuildFromFile(
+    const char* filename, TfLiteVerifier* verifier,
+    ErrorReporter* error_reporter) {
+  std::unique_ptr<FlatBufferModel> model;
+  auto allocation = GetAllocationFromFile(filename, /*mmap_file=*/true,
+                                          error_reporter, /*use_nnapi=*/true);
+  if (verifier &&
+      !verifier->Verify(static_cast<const char*>(allocation->base()),
+                        allocation->bytes(), error_reporter)) {
+    return model;
+  }
+  model.reset(new FlatBufferModel(allocation.release(), error_reporter));
   if (!model->initialized()) model.reset();
   return model;
 }
@@ -44,7 +79,9 @@ std::unique_ptr<FlatBufferModel> FlatBufferModel::BuildFromFile(
 std::unique_ptr<FlatBufferModel> FlatBufferModel::BuildFromBuffer(
     const char* buffer, size_t buffer_size, ErrorReporter* error_reporter) {
   std::unique_ptr<FlatBufferModel> model;
-  model.reset(new FlatBufferModel(buffer, buffer_size, error_reporter));
+  Allocation* allocation =
+      new MemoryAllocation(buffer, buffer_size, error_reporter);
+  model.reset(new FlatBufferModel(allocation, error_reporter));
   if (!model->initialized()) model.reset();
   return model;
 }
@@ -57,23 +94,6 @@ std::unique_ptr<FlatBufferModel> FlatBufferModel::BuildFromModel(
   return model;
 }
 
-FlatBufferModel::FlatBufferModel(const char* filename, bool mmap_file,
-                                 ErrorReporter* error_reporter, bool use_nnapi)
-    : error_reporter_(error_reporter ? error_reporter
-                                     : DefaultErrorReporter()) {
-  if (mmap_file) {
-    if (use_nnapi && NNAPIExists())
-      allocation_ = new NNAPIAllocation(filename, error_reporter);
-    else
-      allocation_ = new MMAPAllocation(filename, error_reporter);
-  } else {
-    allocation_ = new FileCopyAllocation(filename, error_reporter);
-  }
-  if (!allocation_->valid() || !CheckModelIdentifier()) return;
-
-  model_ = ::tflite::GetModel(allocation_->base());
-}
-
 bool FlatBufferModel::CheckModelIdentifier() const {
   if (!tflite::ModelBufferHasIdentifier(allocation_->base())) {
     const char* ident = flatbuffers::GetBufferIdentifier(allocation_->base());
@@ -85,21 +105,21 @@ bool FlatBufferModel::CheckModelIdentifier() const {
   return true;
 }
 
-FlatBufferModel::FlatBufferModel(const char* ptr, size_t num_bytes,
+FlatBufferModel::FlatBufferModel(const Model* model,
                                  ErrorReporter* error_reporter)
     : error_reporter_(error_reporter ? error_reporter
                                      : DefaultErrorReporter()) {
-  allocation_ = new MemoryAllocation(ptr, num_bytes, error_reporter);
-  if (!allocation_->valid()) return;
-
-  model_ = ::tflite::GetModel(allocation_->base());
+  model_ = model;
 }
 
-FlatBufferModel::FlatBufferModel(const Model* model,
+FlatBufferModel::FlatBufferModel(Allocation* allocation,
                                  ErrorReporter* error_reporter)
     : error_reporter_(error_reporter ? error_reporter
                                      : DefaultErrorReporter()) {
-  model_ = model;
+  allocation_ = allocation;
+  if (!allocation_->valid() || !CheckModelIdentifier()) return;
+
+  model_ = ::tflite::GetModel(allocation_->base());
 }
 
 FlatBufferModel::~FlatBufferModel() { delete allocation_; }
@@ -287,6 +307,9 @@ void* ParseOpData(const Operator* op, BuiltinOperator op_type,
     case BuiltinOperator_EXP:
     case BuiltinOperator_TOPK_V2:
     case BuiltinOperator_LOG_SOFTMAX:
+    case BuiltinOperator_CAST:
+    case BuiltinOperator_DEQUANTIZE:
+    case BuiltinOperator_PRELU:
       break;
     case BuiltinOperator_LSH_PROJECTION: {
       TfLiteLSHProjectionParams* params =
@@ -660,9 +683,27 @@ TfLiteStatus InterpreterBuilder::ParseTensors(
       // but we really only support one value for the whole tensor.
       // TODO(aselle): This breaks as well if these are nullptr's.
       // TODO(aselle): This assumes non per-channel quantization.
-      if (q_params->scale()) quantization.scale = q_params->scale()->Get(0);
-      if (q_params->zero_point())
+
+      if (q_params->scale()) {
+        if (q_params->scale()->size() != 1) {
+          error_reporter_->Report(
+              "QuantizationParam has %d scale values (only 1 is supported).",
+              q_params->scale()->size());
+          return kTfLiteError;
+        }
+        quantization.scale = q_params->scale()->Get(0);
+      }
+
+      if (q_params->zero_point()) {
+        if (q_params->zero_point()->size() != 1) {
+          error_reporter_->Report(
+              "QuantizationParam has %d zero_point values"
+              " (only 1 is supported).",
+              q_params->zero_point()->size());
+          return kTfLiteError;
+        }
         quantization.zero_point = q_params->zero_point()->Get(0);
+      }
     }
 
     TfLiteType type;
@@ -740,6 +781,11 @@ TfLiteStatus InterpreterBuilder::ParseTensors(
 
 TfLiteStatus InterpreterBuilder::operator()(
     std::unique_ptr<Interpreter>* interpreter) {
+  return operator()(interpreter, /*num_threads=*/-1);
+}
+
+TfLiteStatus InterpreterBuilder::operator()(
+    std::unique_ptr<Interpreter>* interpreter, int num_threads) {
   if (!interpreter) {
     error_reporter_->Report(
         "Null output pointer passed to InterpreterBuilder.");
@@ -794,9 +840,8 @@ TfLiteStatus InterpreterBuilder::operator()(
   if ((**interpreter).AddTensors(tensors->Length()) != kTfLiteOk) {
     return cleanup_and_error();
   }
-
-  (**interpreter).set_model(model_);
-
+  // Set num threads
+  (**interpreter).SetNumThreads(num_threads);
   // Parse inputs/outputs
   (**interpreter).SetInputs(FlatBufferIntArrayToVector(subgraph->inputs()));
   (**interpreter).SetOutputs(FlatBufferIntArrayToVector(subgraph->outputs()));
diff --git a/tensorflow/contrib/lite/model.h b/tensorflow/contrib/lite/model.h
index a467df5bb4eee3f6ce814512cb8b74bf09a6a4e7..036dc46e03f565c40791aee55d4158cef5c832e0 100644
--- a/tensorflow/contrib/lite/model.h
+++ b/tensorflow/contrib/lite/model.h
@@ -41,6 +41,17 @@ limitations under the License.
 
 namespace tflite {
 
+// Abstract interface that verifies whether a given model is legit.
+// It facilitates the use-case to verify and build a model without loading it
+// twice.
+class TfLiteVerifier {
+ public:
+  // Returns true if the model is legit.
+  virtual bool Verify(const char* data, int length,
+                      ErrorReporter* reporter) = 0;
+  virtual ~TfLiteVerifier() {}
+};
+
 // An RAII object that represents a read-only tflite model, copied from disk,
 // or mmapped. This uses flatbuffers as the serialization format.
 class FlatBufferModel {
@@ -50,6 +61,12 @@ class FlatBufferModel {
       const char* filename,
       ErrorReporter* error_reporter = DefaultErrorReporter());
 
+  // Verifies whether the content of the file is legit, then builds a model
+  // based on the file. Returns a nullptr in case of failure.
+  static std::unique_ptr<FlatBufferModel> VerifyAndBuildFromFile(
+      const char* filename, TfLiteVerifier* verifier = nullptr,
+      ErrorReporter* error_reporter = DefaultErrorReporter());
+
   // Builds a model based on a pre-loaded flatbuffer. The caller retains
   // ownership of the buffer and should keep it alive until the returned object
   // is destroyed. Returns a nullptr in case of failure.
@@ -64,7 +81,7 @@ class FlatBufferModel {
       const tflite::Model* model_spec,
       ErrorReporter* error_reporter = DefaultErrorReporter());
 
-  // Releases memory or unmaps mmaped meory.
+  // Releases memory or unmaps mmaped memory.
   ~FlatBufferModel();
 
   // Copying or assignment is disallowed to simplify ownership semantics.
@@ -82,23 +99,9 @@ class FlatBufferModel {
   bool CheckModelIdentifier() const;
 
  private:
-  // Loads a model from `filename`. If `mmap_file` is true then use mmap,
-  // otherwise make a copy of the model in a buffer.
-  //
-  // Note, if `error_reporter` is null, then a DefaultErrorReporter() will be
-  // used.
-  explicit FlatBufferModel(
-      const char* filename, bool mmap_file = true,
-      ErrorReporter* error_reporter = DefaultErrorReporter(),
-      bool use_nnapi = false);
-
-  // Loads a model from `ptr` and `num_bytes` of the model file. The `ptr` has
-  // to remain alive and unchanged until the end of this flatbuffermodel's
-  // lifetime.
-  //
-  // Note, if `error_reporter` is null, then a DefaultErrorReporter() will be
-  // used.
-  FlatBufferModel(const char* ptr, size_t num_bytes,
+  // Loads a model from a given allocation. FlatBufferModel will take over the
+  // ownership of `allocation`, and delete it in desctructor.
+  FlatBufferModel(Allocation* allocation,
                   ErrorReporter* error_reporter = DefaultErrorReporter());
 
   // Loads a model from Model flatbuffer. The `model` has to remain alive and
@@ -151,6 +154,8 @@ class InterpreterBuilder {
   InterpreterBuilder(const InterpreterBuilder&) = delete;
   InterpreterBuilder& operator=(const InterpreterBuilder&) = delete;
   TfLiteStatus operator()(std::unique_ptr<Interpreter>* interpreter);
+  TfLiteStatus operator()(std::unique_ptr<Interpreter>* interpreter,
+                          int num_threads);
 
  private:
   TfLiteStatus BuildLocalIndexToRegistrationMapping();
diff --git a/tensorflow/contrib/lite/model_test.cc b/tensorflow/contrib/lite/model_test.cc
index 66f22fd66a9ae0d35553a1f780ef73a5c5994c99..ae6c1ece18963f11f48a6f07bea4065ce39687e0 100644
--- a/tensorflow/contrib/lite/model_test.cc
+++ b/tensorflow/contrib/lite/model_test.cc
@@ -209,6 +209,38 @@ TEST(BasicFlatBufferModel, TestNullModel) {
   ASSERT_EQ(interpreter.get(), nullptr);
 }
 
+// Mocks the verifier by setting the result in ctor.
+class FakeVerifier : public tflite::TfLiteVerifier {
+ public:
+  explicit FakeVerifier(bool result) : result_(result) {}
+  bool Verify(const char* data, int length,
+              tflite::ErrorReporter* reporter) override {
+    return result_;
+  }
+
+ private:
+  bool result_;
+};
+
+TEST(BasicFlatBufferModel, TestWithTrueVerifier) {
+  FakeVerifier verifier(true);
+  ASSERT_TRUE(FlatBufferModel::VerifyAndBuildFromFile(
+      "tensorflow/contrib/lite/testdata/test_model.bin",
+      &verifier));
+}
+
+TEST(BasicFlatBufferModel, TestWithFalseVerifier) {
+  FakeVerifier verifier(false);
+  ASSERT_FALSE(FlatBufferModel::VerifyAndBuildFromFile(
+      "tensorflow/contrib/lite/testdata/test_model.bin",
+      &verifier));
+}
+
+TEST(BasicFlatBufferModel, TestWithNullVerifier) {
+  ASSERT_TRUE(FlatBufferModel::VerifyAndBuildFromFile(
+      "tensorflow/contrib/lite/testdata/test_model.bin", nullptr));
+}
+
 struct TestErrorReporter : public ErrorReporter {
   int Report(const char* format, va_list args) override {
     calls++;
diff --git a/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h b/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
index 76032771af2c8e099aed498b2071816646f3b606..bd49d327c995ef53dc6cf9f8301ab749c925b2c7 100644
--- a/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
+++ b/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
@@ -569,7 +569,7 @@ enum {
   ANEURALNETWORKS_LOGISTIC = 14,
 
   /**
-   * Projects an input to a bit vector via locality senstive hashing.
+   * Projects an input to a bit vector via locality sensitive hashing.
    *
    * Inputs:
    * * 0: Hash functions. Dim.size == 2, DataType: Float.
diff --git a/tensorflow/contrib/lite/nnapi_delegate.cc b/tensorflow/contrib/lite/nnapi_delegate.cc
index e6da21664ced109a7091269b5d498666a838da38..decaf9f160ad35b66f0ed56d0840634c610e4246 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.cc
+++ b/tensorflow/contrib/lite/nnapi_delegate.cc
@@ -346,7 +346,10 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       case tflite::BuiltinOperator_STRIDED_SLICE:
       case tflite::BuiltinOperator_EXP:
       case tflite::BuiltinOperator_LOG_SOFTMAX:
+      case tflite::BuiltinOperator_DEQUANTIZE:
       case tflite::BuiltinOperator_DELEGATE:
+      case tflite::BuiltinOperator_CAST:
+      case tflite::BuiltinOperator_PRELU:
       case tflite::BuiltinOperator_MAXIMUM:
         FATAL("Op code %d is currently not delegated to NNAPI", builtin);
         nn_op_type = -1;  // set to invalid
diff --git a/tensorflow/contrib/lite/python/BUILD b/tensorflow/contrib/lite/python/BUILD
index 82feae0f0041997949212613c654a5695f468d56..d6f39219ed93718af767e7ad1e1de2d843062c61 100644
--- a/tensorflow/contrib/lite/python/BUILD
+++ b/tensorflow/contrib/lite/python/BUILD
@@ -4,6 +4,38 @@ package(default_visibility = ["//tensorflow:internal"])
 
 load("//tensorflow:tensorflow.bzl", "py_test")
 
+filegroup(
+    name = "interpreter_test_data",
+    srcs = glob(["**/testdata/*"]),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+py_library(
+    name = "interpreter",
+    srcs = [
+        "interpreter.py",
+    ],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/contrib/lite/python/interpreter_wrapper:tensorflow_wrap_interpreter_wrapper",
+    ],
+)
+
+py_test(
+    name = "interpreter_test",
+    srcs = ["interpreter_test.py"],
+    data = [":interpreter_test_data"],
+    srcs_version = "PY2AND3",
+    tags = ["no_oss"],
+    deps = [
+        ":interpreter",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 py_library(
     name = "lite",
     srcs = ["lite.py"],
@@ -37,7 +69,10 @@ py_test(
     name = "lite_test",
     srcs = ["lite_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_oss"],
+    tags = [
+        "no-internal-py3",
+        "no_oss",
+    ],
     deps = [
         ":lite",
         ":op_hint",
diff --git a/tensorflow/contrib/lite/python/interpreter.py b/tensorflow/contrib/lite/python/interpreter.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8638007f7e49737726d9939a00e8cb1d6a41281
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter.py
@@ -0,0 +1,151 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Python TF-Lite interpreter."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.lite.python.interpreter_wrapper import tensorflow_wrap_interpreter_wrapper as interpreter_wrapper
+
+
+class Interpreter(object):
+  """Interpreter inferace for TF-Lite Models."""
+
+  def __init__(self, model_path=None, model_content=None):
+    """Constructor.
+
+    Args:
+      model_path: Path to TF-Lite Flatbuffer file.
+      model_content: Content of model.
+
+    Raises:
+      ValueError: If the interpreter was unable to create.
+    """
+    if model_path and not model_content:
+      self._interpreter = (
+          interpreter_wrapper.InterpreterWrapper_CreateWrapperCPPFromFile(
+              model_path))
+      if not self._interpreter:
+        raise ValueError('Failed to open {}'.format(model_path))
+    elif model_content and not model_path:
+      self._interpreter = (
+          interpreter_wrapper.InterpreterWrapper_CreateWrapperCPPFromBuffer(
+              model_content, len(model_content)))
+      if not self._interpreter:
+        raise ValueError(
+            'Failed to create model from {} bytes'.format(len(model_content)))
+    elif not model_path and not model_path:
+      raise ValueError('`model_path` or `model_content` must be specified.')
+    else:
+      raise ValueError('Can\'t both provide `model_path` and `model_content`')
+
+  def allocate_tensors(self):
+    if not self._interpreter.AllocateTensors():
+      raise ValueError('Failed to allocate tensors')
+
+  def _get_tensor_details(self, tensor_index):
+    """Gets tensor details.
+
+    Args:
+      tensor_index: Tensor index of tensor to query.
+
+    Returns:
+      a dictionary containing the name, index, shape and type of the tensor.
+
+    Raises:
+      ValueError: If tensor_index is invalid.
+    """
+    tensor_index = int(tensor_index)
+    tensor_name = self._interpreter.TensorName(tensor_index)
+    tensor_size = self._interpreter.TensorSize(tensor_index)
+    tensor_type = self._interpreter.TensorType(tensor_index)
+    tensor_quantization = self._interpreter.TensorQuantization(tensor_index)
+
+    if not tensor_name or not tensor_type:
+      raise ValueError('Could not get tensor details')
+
+    details = {
+        'name': tensor_name,
+        'index': tensor_index,
+        'shape': tensor_size,
+        'dtype': tensor_type,
+        'quantization': tensor_quantization,
+    }
+
+    return details
+
+  def get_input_details(self):
+    """Gets model input details.
+
+    Returns:
+      A list of input details.
+    """
+    return [
+        self._get_tensor_details(i) for i in self._interpreter.InputIndices()
+    ]
+
+  def set_tensor(self, tensor_index, value):
+    """Sets the value of the input.
+
+    Args:
+      tensor_index: Tensor index of tensor to set. This value can be gotten from
+                    the 'index' field in get_input_details.
+      value: Value of tensor to set.
+
+    Raises:
+      ValueError: If the interpreter could not set the tensor.
+    """
+    if not self._interpreter.SetTensor(tensor_index, value):
+      raise ValueError('Failed to set tensor')
+
+  def resize_tensor_input(self, input_index, tensor_size):
+    """Resizes an input tensor.
+
+    Args:
+      input_index: Tensor index of input to set. This value can be gotten from
+                   the 'index' field in get_input_details.
+      tensor_size: The tensor_shape to resize the input to.
+
+    Raises:
+      ValueError: If the interpreter could not resize the input tensor.
+    """
+    if not self.ResizeInputTensor.SetTensor(input_index, tensor_size):
+      raise ValueError('Failed to set input')
+
+  def get_output_details(self):
+    """Gets model output details.
+
+    Returns:
+      A list of output details.
+    """
+    return [
+        self._get_tensor_details(i) for i in self._interpreter.OutputIndices()
+    ]
+
+  def get_tensor(self, tensor_index):
+    """Sets the value of the input.
+
+    Args:
+      tensor_index: Tensor index of tensor to get. This value can be gotten from
+                    the 'index' field in get_output_details.
+
+    Returns:
+      a numpy array.
+    """
+    return self._interpreter.GetTensor(tensor_index)
+
+  def invoke(self):
+    if not self._interpreter.Invoke():
+      raise ValueError('Failed to invoke TFLite model')
diff --git a/tensorflow/contrib/lite/python/interpreter_test.py b/tensorflow/contrib/lite/python/interpreter_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd2386f5263f24e1e034015ec6880e71f0608c7c
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter_test.py
@@ -0,0 +1,92 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TensorFlow Lite Python Interface: Sanity check."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import io
+import numpy as np
+
+from tensorflow.contrib.lite.python import interpreter as interpreter_wrapper
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import resource_loader
+from tensorflow.python.platform import test
+
+
+class InterpreterTest(test_util.TensorFlowTestCase):
+
+  def testFloat(self):
+    interpreter = interpreter_wrapper.Interpreter(
+        model_path=resource_loader.get_path_to_datafile(
+            'testdata/permute_float.tflite'))
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('input', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 4] == input_details[0]['shape']).all())
+    self.assertEqual((0.0, 0), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('output', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 4] == output_details[0]['shape']).all())
+    self.assertEqual((0.0, 0), output_details[0]['quantization'])
+
+    test_input = np.array([[1.0, 2.0, 3.0, 4.0]], dtype=np.float32)
+    expected_output = np.array([[4.0, 3.0, 2.0, 1.0]], dtype=np.float32)
+    interpreter.set_tensor(input_details[0]['index'], test_input)
+    interpreter.invoke()
+
+    output_data = interpreter.get_tensor(output_details[0]['index'])
+    self.assertTrue((expected_output == output_data).all())
+
+  def testUint8(self):
+    model_path = resource_loader.get_path_to_datafile(
+        'testdata/permute_uint8.tflite')
+    with io.open(model_path, 'rb') as model_file:
+      data = model_file.read()
+
+    interpreter = interpreter_wrapper.Interpreter(model_content=data)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('input', input_details[0]['name'])
+    self.assertEqual(np.uint8, input_details[0]['dtype'])
+    self.assertTrue(([1, 4] == input_details[0]['shape']).all())
+    self.assertEqual((1.0, 0), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('output', output_details[0]['name'])
+    self.assertEqual(np.uint8, output_details[0]['dtype'])
+    self.assertTrue(([1, 4] == output_details[0]['shape']).all())
+    self.assertEqual((1.0, 0), output_details[0]['quantization'])
+
+    test_input = np.array([[1, 2, 3, 4]], dtype=np.uint8)
+    expected_output = np.array([[4, 3, 2, 1]], dtype=np.uint8)
+    interpreter.set_tensor(input_details[0]['index'], test_input)
+    interpreter.invoke()
+
+    output_data = interpreter.get_tensor(output_details[0]['index'])
+    self.assertTrue((expected_output == output_data).all())
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..453eda6e7345762666917fd501b69c7181c349e8
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
@@ -0,0 +1,32 @@
+package(
+    default_visibility = ["//visibility:public"],
+)
+
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow:tensorflow.bzl", "tf_py_wrap_cc")
+
+cc_library(
+    name = "interpreter_wrapper_lib",
+    srcs = ["interpreter_wrapper.cc"],
+    hdrs = ["interpreter_wrapper.h"],
+    deps = [
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/core:lib",
+        "//tensorflow/python:numpy_lib",
+        "//util/python:python_headers",
+        "@com_google_absl//absl/memory",
+    ],
+)
+
+tf_py_wrap_cc(
+    name = "tensorflow_wrap_interpreter_wrapper",
+    srcs = [
+        "interpreter_wrapper.i",
+    ],
+    deps = [
+        ":interpreter_wrapper_lib",
+        "//util/python:python_headers",
+    ],
+)
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
new file mode 100644
index 0000000000000000000000000000000000000000..35ad226b78c906f0819afd5b029a1a0d438d69af
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
@@ -0,0 +1,337 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
+
+#include <string>
+
+#include "absl/memory/memory.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/model.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/python/lib/core/numpy.h"
+
+#if PY_MAJOR_VERSION >= 3
+#define PY_TO_CPPSTRING PyBytes_AsStringAndSize
+#define CPP_TO_PYSTRING PyBytes_FromStringAndSize
+#else
+#define PY_TO_CPPSTRING PyString_AsStringAndSize
+#define CPP_TO_PYSTRING PyString_FromStringAndSize
+#endif
+
+namespace tflite {
+namespace interpreter_wrapper {
+
+namespace {
+std::unique_ptr<tflite::Interpreter> CreateInterpreter(
+    const tflite::FlatBufferModel* model,
+    const tflite::ops::builtin::BuiltinOpResolver& resolver) {
+  if (!model) {
+    return nullptr;
+  }
+
+  std::unique_ptr<tflite::Interpreter> interpreter;
+  tflite::InterpreterBuilder(*model, resolver)(&interpreter);
+  if (interpreter) {
+    for (const int input_index : interpreter->inputs()) {
+      const TfLiteTensor* tensor = interpreter->tensor(input_index);
+      CHECK(tensor);
+      const TfLiteIntArray* dims = tensor->dims;
+      if (!dims) {
+        continue;
+      }
+
+      std::vector<int> input_dims(dims->data, dims->data + dims->size);
+      interpreter->ResizeInputTensor(input_index, input_dims);
+    }
+  }
+  return interpreter;
+}
+
+int TfLiteTypeToPyArrayType(TfLiteType tf_lite_type) {
+  switch (tf_lite_type) {
+    case kTfLiteFloat32:
+      return NPY_FLOAT32;
+    case kTfLiteInt32:
+      return NPY_INT32;
+    case kTfLiteUInt8:
+      return NPY_UINT8;
+    case kTfLiteInt64:
+      return NPY_INT64;
+    case kTfLiteString:
+      return NPY_OBJECT;
+    case kTfLiteNoType:
+      return -1;
+  }
+  LOG(ERROR) << "Unknown TfLiteType " << tf_lite_type;
+  return -1;
+}
+
+TfLiteType TfLiteTypeFromPyArray(PyArrayObject* array) {
+  int pyarray_type = PyArray_TYPE(array);
+  switch (pyarray_type) {
+    case NPY_FLOAT32:
+      return kTfLiteFloat32;
+    case NPY_INT32:
+      return kTfLiteInt32;
+    case NPY_UINT8:
+      return kTfLiteUInt8;
+    case NPY_INT64:
+      return kTfLiteInt64;
+    case NPY_OBJECT:
+    case NPY_STRING:
+    case NPY_UNICODE:
+      return kTfLiteString;
+  }
+  LOG(ERROR) << "Unknown PyArray dtype " << pyarray_type;
+  return kTfLiteNoType;
+}
+
+struct PyDecrefDeleter {
+  void operator()(PyObject* p) const { Py_DECREF(p); }
+};
+
+PyObject* PyArrayFromIntVector(const int* data, npy_intp size) {
+  void* pydata = malloc(size * sizeof(int));
+  memcpy(pydata, data, size * sizeof(int));
+  return PyArray_SimpleNewFromData(1, &size, NPY_INT32, pydata);
+}
+
+PyObject* PyTupleFromQuantizationParam(const TfLiteQuantizationParams& param) {
+  PyObject* result = PyTuple_New(2);
+  PyTuple_SET_ITEM(result, 0, PyFloat_FromDouble(param.scale));
+  PyTuple_SET_ITEM(result, 1, PyInt_FromLong(param.zero_point));
+  return result;
+}
+
+}  // namespace
+
+InterpreterWrapper::InterpreterWrapper(
+    std::unique_ptr<tflite::FlatBufferModel> model)
+    : model_(std::move(model)),
+      resolver_(absl::make_unique<tflite::ops::builtin::BuiltinOpResolver>()),
+      interpreter_(CreateInterpreter(model_.get(), *resolver_)) {}
+
+InterpreterWrapper::~InterpreterWrapper() {}
+
+bool InterpreterWrapper::AllocateTensors() {
+  if (!interpreter_) {
+    LOG(ERROR) << "Cannot allocate tensors: invalid interpreter.";
+    return false;
+  }
+
+  if (interpreter_->AllocateTensors() != kTfLiteOk) {
+    LOG(ERROR) << "Unable to allocate tensors.";
+    return false;
+  }
+
+  return true;
+}
+
+bool InterpreterWrapper::Invoke() {
+  return interpreter_ ? (interpreter_->Invoke() == kTfLiteOk) : false;
+}
+
+PyObject* InterpreterWrapper::InputIndices() const {
+  PyObject* np_array = PyArrayFromIntVector(interpreter_->inputs().data(),
+                                            interpreter_->inputs().size());
+
+  return PyArray_Return(reinterpret_cast<PyArrayObject*>(np_array));
+}
+
+PyObject* InterpreterWrapper::OutputIndices() const {
+  PyObject* np_array = PyArrayFromIntVector(interpreter_->outputs().data(),
+                                            interpreter_->outputs().size());
+
+  return PyArray_Return(reinterpret_cast<PyArrayObject*>(np_array));
+}
+
+bool InterpreterWrapper::ResizeInputTensor(int i, PyObject* value) {
+  if (!interpreter_) {
+    LOG(ERROR) << "Invalid interpreter.";
+    return false;
+  }
+
+  std::unique_ptr<PyObject, PyDecrefDeleter> array_safe(
+      PyArray_FromAny(value, nullptr, 0, 0, NPY_ARRAY_CARRAY, nullptr));
+  if (!array_safe) {
+    LOG(ERROR) << "Failed to convert value into readable tensor.";
+    return false;
+  }
+
+  PyArrayObject* array = reinterpret_cast<PyArrayObject*>(array_safe.get());
+
+  if (PyArray_NDIM(array) != 1) {
+    LOG(ERROR) << "Expected 1-D defining input shape.";
+    return false;
+  }
+
+  if (PyArray_TYPE(array) != NPY_INT32) {
+    LOG(ERROR) << "Shape must be an int32 array";
+    return false;
+  }
+
+  std::vector<int> dims(PyArray_SHAPE(array)[0]);
+  memcpy(dims.data(), PyArray_BYTES(array), dims.size() * sizeof(int));
+
+  return interpreter_->ResizeInputTensor(i, dims);
+}
+
+std::string InterpreterWrapper::TensorName(int i) const {
+  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
+    return "";
+  }
+
+  const TfLiteTensor* tensor = interpreter_->tensor(i);
+  return tensor->name;
+}
+
+PyObject* InterpreterWrapper::TensorType(int i) const {
+  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
+    return nullptr;
+  }
+
+  const TfLiteTensor* tensor = interpreter_->tensor(i);
+  int typenum = TfLiteTypeToPyArrayType(tensor->type);
+  return PyArray_TypeObjectFromType(typenum);
+}
+
+PyObject* InterpreterWrapper::TensorSize(int i) const {
+  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  const TfLiteTensor* tensor = interpreter_->tensor(i);
+  PyObject* np_array =
+      PyArrayFromIntVector(tensor->dims->data, tensor->dims->size);
+
+  return PyArray_Return(reinterpret_cast<PyArrayObject*>(np_array));
+}
+
+PyObject* InterpreterWrapper::TensorQuantization(int i) const {
+  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  const TfLiteTensor* tensor = interpreter_->tensor(i);
+  return PyTupleFromQuantizationParam(tensor->params);
+}
+
+bool InterpreterWrapper::SetTensor(int i, PyObject* value) {
+  if (!interpreter_) {
+    LOG(ERROR) << "Invalid interpreter.";
+    return false;
+  }
+
+  if (i >= interpreter_->tensors_size()) {
+    LOG(ERROR) << "Invalid tensor index: " << i << " exceeds max tensor index "
+               << interpreter_->tensors_size();
+    return false;
+  }
+
+  std::unique_ptr<PyObject, PyDecrefDeleter> array_safe(
+      PyArray_FromAny(value, nullptr, 0, 0, NPY_ARRAY_CARRAY, nullptr));
+  if (!array_safe) {
+    LOG(ERROR) << "Failed to convert value into readable tensor.";
+    return false;
+  }
+
+  PyArrayObject* array = reinterpret_cast<PyArrayObject*>(array_safe.get());
+  const TfLiteTensor* tensor = interpreter_->tensor(i);
+
+  if (TfLiteTypeFromPyArray(array) != tensor->type) {
+    LOG(ERROR) << "Cannot set tensor:"
+               << " Got tensor of type " << TfLiteTypeFromPyArray(array)
+               << " but expected type " << tensor->type << " for input " << i;
+    return false;
+  }
+
+  if (PyArray_NDIM(array) != tensor->dims->size) {
+    LOG(ERROR) << "Cannot set tensor: Dimension mismatch";
+    return false;
+  }
+
+  for (int j = 0; j < PyArray_NDIM(array); j++) {
+    if (tensor->dims->data[j] != PyArray_SHAPE(array)[j]) {
+      LOG(ERROR) << "Cannot set tensor: Dimension mismatch";
+      return false;
+    }
+  }
+
+  size_t size = PyArray_NBYTES(array);
+  DCHECK_EQ(size, tensor->bytes);
+  memcpy(tensor->data.raw, PyArray_DATA(array), size);
+  return true;
+}
+
+PyObject* InterpreterWrapper::GetTensor(int i) const {
+  if (!interpreter_) {
+    LOG(ERROR) << "Invalid interpreter.";
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  if (i >= interpreter_->tensors_size()) {
+    LOG(ERROR) << "Invalid tensor index: " << i << " exceeds max tensor index "
+               << interpreter_->inputs().size();
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  const TfLiteTensor* output_tensor = interpreter_->tensor(i);
+  const int tensor_size = output_tensor->bytes;
+  if (tensor_size <= 0) {
+    LOG(ERROR) << "Invalid tensor size";
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  int type_num = TfLiteTypeToPyArrayType(output_tensor->type);
+  if (type_num == -1) {
+    LOG(ERROR) << "Unknown tensor type " << output_tensor->type;
+    Py_INCREF(Py_None);
+    return Py_None;
+  }
+
+  void* data = malloc(tensor_size);
+  memcpy(data, output_tensor->data.raw, tensor_size);
+
+  const TfLiteIntArray* output_dims = output_tensor->dims;
+  std::vector<npy_intp> dims(output_dims->data,
+                             output_dims->data + output_dims->size);
+  PyObject* np_array =
+      PyArray_SimpleNewFromData(dims.size(), dims.data(), type_num, data);
+
+  return PyArray_Return(reinterpret_cast<PyArrayObject*>(np_array));
+}
+
+InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromFile(
+    const char* model_path) {
+  std::unique_ptr<tflite::FlatBufferModel> model =
+      tflite::FlatBufferModel::BuildFromFile(model_path);
+  return model ? new InterpreterWrapper(std::move(model)) : nullptr;
+}
+
+InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromBuffer(
+    const char* data, size_t len) {
+  std::unique_ptr<tflite::FlatBufferModel> model =
+      tflite::FlatBufferModel::BuildFromBuffer(data, len);
+  return model ? new InterpreterWrapper(std::move(model)) : nullptr;
+}
+
+}  // namespace interpreter_wrapper
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
new file mode 100644
index 0000000000000000000000000000000000000000..0972c572595f5044a305a81afaccbea5f131247c
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
@@ -0,0 +1,77 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_PYTHON_INTERPRETER_WRAPPER_INTERPRETER_WRAPPER_H_
+#define TENSORFLOW_CONTRIB_LITE_PYTHON_INTERPRETER_WRAPPER_INTERPRETER_WRAPPER_H_
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include <Python.h>
+
+// We forward declare TFLite classes here to avoid exposing them to SWIG.
+namespace tflite {
+namespace ops {
+namespace builtin {
+class BuiltinOpResolver;
+}  // namespace builtin
+}  // namespace ops
+
+class FlatBufferModel;
+class Interpreter;
+
+namespace interpreter_wrapper {
+
+class InterpreterWrapper {
+ public:
+  // SWIG caller takes ownership of pointer.
+  static InterpreterWrapper* CreateWrapperCPPFromFile(const char* model_path);
+
+  // SWIG caller takes ownership of pointer.
+  static InterpreterWrapper* CreateWrapperCPPFromBuffer(const char* data,
+                                                        size_t len);
+
+  ~InterpreterWrapper();
+  bool AllocateTensors();
+  bool Invoke();
+
+  PyObject* InputIndices() const;
+  PyObject* OutputIndices() const;
+  bool ResizeInputTensor(int i, PyObject* value);
+
+  std::string TensorName(int i) const;
+  PyObject* TensorType(int i) const;
+  PyObject* TensorSize(int i) const;
+  PyObject* TensorQuantization(int i) const;
+  bool SetTensor(int i, PyObject* value);
+  PyObject* GetTensor(int i) const;
+
+ private:
+  InterpreterWrapper(std::unique_ptr<tflite::FlatBufferModel> model);
+
+  // InterpreterWrapper is not copyable or assignable. We avoid the use of
+  // InterpreterWrapper() = delete here for SWIG compatibility.
+  InterpreterWrapper();
+  InterpreterWrapper(const InterpreterWrapper& rhs);
+
+  const std::unique_ptr<tflite::FlatBufferModel> model_;
+  const std::unique_ptr<tflite::ops::builtin::BuiltinOpResolver> resolver_;
+  const std::unique_ptr<tflite::Interpreter> interpreter_;
+};
+
+}  // namespace interpreter_wrapper
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_PYTHON_INTERPRETER_WRAPPER_INTERPRETER_WRAPPER_H_
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i
new file mode 100644
index 0000000000000000000000000000000000000000..7f51f9f00d1b2fe057052f7b7bd52bcb65231164
--- /dev/null
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i
@@ -0,0 +1,25 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+%include "std_string.i"
+
+
+%{
+#define SWIG_FILE_WITH_INIT
+#include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
+%}
+
+
+%include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
diff --git a/tensorflow/contrib/lite/python/lite.py b/tensorflow/contrib/lite/python/lite.py
index 5d2f21653762a405a57288a7ba38323e5e42b3e1..35d224924ee4d8cd94543e10c082afee25b7630e 100644
--- a/tensorflow/contrib/lite/python/lite.py
+++ b/tensorflow/contrib/lite/python/lite.py
@@ -202,11 +202,12 @@ def toco_convert(input_data,
 
     input_array.name = _tensor_name(input_tensor)
     input_array.shape.dims.extend(map(int, input_tensor.get_shape()))
-    toco.inference_input_type = tflite_input_type
 
   for output_tensor in output_tensors:
     model.output_arrays.append(_tensor_name(output_tensor))
 
+  # TODO(aselle): Consider handling the case of allowing quantized
+  # inputs to be converted to float (via the toco.inference_input_type field).
   data = toco_convert_protos(model.SerializeToString(),
                              toco.SerializeToString(),
                              input_data.SerializeToString())
diff --git a/tensorflow/contrib/lite/python/op_hint.py b/tensorflow/contrib/lite/python/op_hint.py
index 9a3971228a683211e84b4c55d3a3e8d574b5ed94..7908689ce4a719ab15bd49a368a87f9cad7c6d61 100644
--- a/tensorflow/contrib/lite/python/op_hint.py
+++ b/tensorflow/contrib/lite/python/op_hint.py
@@ -119,8 +119,10 @@ class OpHint(object):
 
   def _setattr(self, dest_op, name, value):
     tensor_value = _ops.convert_to_tensor(value)
-    dest_op.op.node_def.attr[name].tensor.CopyFrom(
-        tensor_value.op.node_def.attr["value"].tensor)
+    # pylint: disable=protected-access
+    dest_op.op._set_attr(name, _attr_value_pb2.AttrValue(
+        tensor=tensor_value.op.node_def.attr["value"].tensor))
+    # pylint: enable=protected-access
 
   def add_inputs(self, *args):
     """Add a sequence of inputs to the function invocation.
diff --git a/tensorflow/contrib/lite/rpi_makefile.inc b/tensorflow/contrib/lite/rpi_makefile.inc
new file mode 100644
index 0000000000000000000000000000000000000000..832ef5824bea86a368184bd7e3d17915739e9d46
--- /dev/null
+++ b/tensorflow/contrib/lite/rpi_makefile.inc
@@ -0,0 +1,33 @@
+# Settings for Raspberry Pi.
+ifeq ($(TARGET), RPI)
+	ifeq ($(TARGET_ARCH), armv7)
+		CXXFLAGS += \
+			-march=armv7-a \
+			-mfpu=neon-vfpv4 \
+			-funsafe-math-optimizations \
+			-ftree-vectorize
+
+		CCFLAGS += \
+			-march=armv7-a \
+			-mfpu=neon-vfpv4 \
+			-funsafe-math-optimizations \
+			-ftree-vectorize
+
+		LDFLAGS := \
+			-Wl,--no-export-dynamic \
+			-Wl,--exclude-libs,ALL \
+			-Wl,--gc-sections \
+			-Wl,--as-needed
+	endif
+
+	LIBS := \
+	-lstdc++ \
+	-lpthread \
+	-lm \
+	-ldl
+
+	OBJDIR := $(OBJDIR)rpi_$(TARGET_ARCH)/
+	LIBDIR := $(LIBDIR)rpi_$(TARGET_ARCH)/
+	BINDIR := $(BINDIR)rpi_$(TARGET_ARCH)/
+	DEPDIR := $(DEPDIR)rpi_$(TARGET_ARCH)/
+endif
diff --git a/tensorflow/contrib/lite/schema/BUILD b/tensorflow/contrib/lite/schema/BUILD
index 54167ddd9a5a003d0ff21e6627a1dbe94afa3e87..da65ec659c7ab39348d2b7911aceaa9dbdd2654b 100644
--- a/tensorflow/contrib/lite/schema/BUILD
+++ b/tensorflow/contrib/lite/schema/BUILD
@@ -5,6 +5,7 @@ package(default_visibility = [
 licenses(["notice"])  # Apache 2.0
 
 load("//tensorflow:tensorflow.bzl", "py_test")
+load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
 
 py_binary(
     name = "upgrade_schema",
@@ -80,3 +81,5 @@ filegroup(
     ),
     visibility = ["//tensorflow:__subpackages__"],
 )
+
+tflite_portable_test_suite()
diff --git a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
index 08bcfe451685f488be2c3bc180f2dfc43dfe4f05..ac408d2f94b98d505afe4c951d7cc2ff960606fb 100644
--- a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
+++ b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
@@ -46,8 +46,7 @@ extern "C" {
 #endif  // __cplusplus
 
 // The enum for builtin operators.
-// Note: CUSTOM and DELEGATE are 2 special ops which are not real biultin
-// ops.
+// Note: CUSTOM and DELEGATE are 2 special ops which are not real built-in ops.
 typedef enum {
 )";
 
diff --git a/tensorflow/contrib/lite/schema/schema.fbs b/tensorflow/contrib/lite/schema/schema.fbs
index d97be553aac55e7d7606a50ed51ab08c9f4c60de..7d2e00fe329a5da77af7bf091eaa99badbd1022a 100644
--- a/tensorflow/contrib/lite/schema/schema.fbs
+++ b/tensorflow/contrib/lite/schema/schema.fbs
@@ -75,7 +75,7 @@ enum BuiltinOperator : byte {
   CONV_2D = 3,
   DEPTHWISE_CONV_2D = 4,
   // DEPTH_TO_SPACE = 5,
-  // DEQUANTIZE = 6,
+  DEQUANTIZE = 6,
   EMBEDDING_LOOKUP = 7,
   // FLOOR = 8,
   FULLY_CONNECTED = 9,
@@ -129,7 +129,9 @@ enum BuiltinOperator : byte {
   // WARNING: Experimental interface, subject to change
   DELEGATE = 51,
   BIDIRECTIONAL_SEQUENCE_LSTM = 52,
-  MAXIMUM = 53,
+  CAST = 53,
+  PRELU = 54,
+  MAXIMUM = 55,
 }
 
 // Options for the builtin operators.
@@ -170,6 +172,8 @@ union BuiltinOptions {
   TopKV2Options,
   SplitOptions,
   LogSoftmaxOptions,
+  CastOptions,
+  DequantizeOptions,
   MaximumOptions,
 }
 
@@ -376,6 +380,12 @@ table StridedSliceOptions {
 table LogSoftmaxOptions {
 }
 
+table CastOptions {
+}
+
+table DequantizeOptions {
+}
+
 table MaximumOptions {
 }
 
diff --git a/tensorflow/contrib/lite/schema/schema_generated.h b/tensorflow/contrib/lite/schema/schema_generated.h
index b8f3e0bb84abf19e97da2de95aa4bdfa1960b51d..66a97a1460d12b48102f53f975cb1e25e7735111 100755
--- a/tensorflow/contrib/lite/schema/schema_generated.h
+++ b/tensorflow/contrib/lite/schema/schema_generated.h
@@ -139,6 +139,12 @@ struct StridedSliceOptionsT;
 struct LogSoftmaxOptions;
 struct LogSoftmaxOptionsT;
 
+struct CastOptions;
+struct CastOptionsT;
+
+struct DequantizeOptions;
+struct DequantizeOptionsT;
+
 struct MaximumOptions;
 struct MaximumOptionsT;
 
@@ -204,6 +210,7 @@ enum BuiltinOperator {
   BuiltinOperator_CONCATENATION = 2,
   BuiltinOperator_CONV_2D = 3,
   BuiltinOperator_DEPTHWISE_CONV_2D = 4,
+  BuiltinOperator_DEQUANTIZE = 6,
   BuiltinOperator_EMBEDDING_LOOKUP = 7,
   BuiltinOperator_FULLY_CONNECTED = 9,
   BuiltinOperator_HASHTABLE_LOOKUP = 10,
@@ -249,18 +256,21 @@ enum BuiltinOperator {
   BuiltinOperator_LOG_SOFTMAX = 50,
   BuiltinOperator_DELEGATE = 51,
   BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM = 52,
-  BuiltinOperator_MAXIMUM = 53,
+  BuiltinOperator_CAST = 53,
+  BuiltinOperator_PRELU = 54,
+  BuiltinOperator_MAXIMUM = 55,
   BuiltinOperator_MIN = BuiltinOperator_ADD,
   BuiltinOperator_MAX = BuiltinOperator_MAXIMUM
 };
 
-inline BuiltinOperator (&EnumValuesBuiltinOperator())[51] {
+inline BuiltinOperator (&EnumValuesBuiltinOperator())[54] {
   static BuiltinOperator values[] = {
     BuiltinOperator_ADD,
     BuiltinOperator_AVERAGE_POOL_2D,
     BuiltinOperator_CONCATENATION,
     BuiltinOperator_CONV_2D,
     BuiltinOperator_DEPTHWISE_CONV_2D,
+    BuiltinOperator_DEQUANTIZE,
     BuiltinOperator_EMBEDDING_LOOKUP,
     BuiltinOperator_FULLY_CONNECTED,
     BuiltinOperator_HASHTABLE_LOOKUP,
@@ -306,6 +316,8 @@ inline BuiltinOperator (&EnumValuesBuiltinOperator())[51] {
     BuiltinOperator_LOG_SOFTMAX,
     BuiltinOperator_DELEGATE,
     BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM,
+    BuiltinOperator_CAST,
+    BuiltinOperator_PRELU,
     BuiltinOperator_MAXIMUM
   };
   return values;
@@ -319,7 +331,7 @@ inline const char **EnumNamesBuiltinOperator() {
     "CONV_2D",
     "DEPTHWISE_CONV_2D",
     "",
-    "",
+    "DEQUANTIZE",
     "EMBEDDING_LOOKUP",
     "",
     "FULLY_CONNECTED",
@@ -366,6 +378,8 @@ inline const char **EnumNamesBuiltinOperator() {
     "LOG_SOFTMAX",
     "DELEGATE",
     "BIDIRECTIONAL_SEQUENCE_LSTM",
+    "CAST",
+    "PRELU",
     "MAXIMUM",
     nullptr
   };
@@ -415,12 +429,14 @@ enum BuiltinOptions {
   BuiltinOptions_TopKV2Options = 34,
   BuiltinOptions_SplitOptions = 35,
   BuiltinOptions_LogSoftmaxOptions = 36,
-  BuiltinOptions_MaximumOptions = 37,
+  BuiltinOptions_CastOptions = 37,
+  BuiltinOptions_DequantizeOptions = 38,
+  BuiltinOptions_MaximumOptions = 39,
   BuiltinOptions_MIN = BuiltinOptions_NONE,
   BuiltinOptions_MAX = BuiltinOptions_MaximumOptions
 };
 
-inline BuiltinOptions (&EnumValuesBuiltinOptions())[38] {
+inline BuiltinOptions (&EnumValuesBuiltinOptions())[40] {
   static BuiltinOptions values[] = {
     BuiltinOptions_NONE,
     BuiltinOptions_Conv2DOptions,
@@ -459,6 +475,8 @@ inline BuiltinOptions (&EnumValuesBuiltinOptions())[38] {
     BuiltinOptions_TopKV2Options,
     BuiltinOptions_SplitOptions,
     BuiltinOptions_LogSoftmaxOptions,
+    BuiltinOptions_CastOptions,
+    BuiltinOptions_DequantizeOptions,
     BuiltinOptions_MaximumOptions
   };
   return values;
@@ -503,6 +521,8 @@ inline const char **EnumNamesBuiltinOptions() {
     "TopKV2Options",
     "SplitOptions",
     "LogSoftmaxOptions",
+    "CastOptions",
+    "DequantizeOptions",
     "MaximumOptions",
     nullptr
   };
@@ -662,6 +682,14 @@ template<> struct BuiltinOptionsTraits<LogSoftmaxOptions> {
   static const BuiltinOptions enum_value = BuiltinOptions_LogSoftmaxOptions;
 };
 
+template<> struct BuiltinOptionsTraits<CastOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_CastOptions;
+};
+
+template<> struct BuiltinOptionsTraits<DequantizeOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_DequantizeOptions;
+};
+
 template<> struct BuiltinOptionsTraits<MaximumOptions> {
   static const BuiltinOptions enum_value = BuiltinOptions_MaximumOptions;
 };
@@ -985,6 +1013,22 @@ struct BuiltinOptionsUnion {
     return type == BuiltinOptions_LogSoftmaxOptions ?
       reinterpret_cast<const LogSoftmaxOptionsT *>(value) : nullptr;
   }
+  CastOptionsT *AsCastOptions() {
+    return type == BuiltinOptions_CastOptions ?
+      reinterpret_cast<CastOptionsT *>(value) : nullptr;
+  }
+  const CastOptionsT *AsCastOptions() const {
+    return type == BuiltinOptions_CastOptions ?
+      reinterpret_cast<const CastOptionsT *>(value) : nullptr;
+  }
+  DequantizeOptionsT *AsDequantizeOptions() {
+    return type == BuiltinOptions_DequantizeOptions ?
+      reinterpret_cast<DequantizeOptionsT *>(value) : nullptr;
+  }
+  const DequantizeOptionsT *AsDequantizeOptions() const {
+    return type == BuiltinOptions_DequantizeOptions ?
+      reinterpret_cast<const DequantizeOptionsT *>(value) : nullptr;
+  }
   MaximumOptionsT *AsMaximumOptions() {
     return type == BuiltinOptions_MaximumOptions ?
       reinterpret_cast<MaximumOptionsT *>(value) : nullptr;
@@ -3656,6 +3700,86 @@ inline flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(
 
 flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogSoftmaxOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct CastOptionsT : public flatbuffers::NativeTable {
+  typedef CastOptions TableType;
+  CastOptionsT() {
+  }
+};
+
+struct CastOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef CastOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  CastOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(CastOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<CastOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const CastOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct CastOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit CastOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  CastOptionsBuilder &operator=(const CastOptionsBuilder &);
+  flatbuffers::Offset<CastOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<CastOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<CastOptions> CreateCastOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  CastOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<CastOptions> CreateCastOptions(flatbuffers::FlatBufferBuilder &_fbb, const CastOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct DequantizeOptionsT : public flatbuffers::NativeTable {
+  typedef DequantizeOptions TableType;
+  DequantizeOptionsT() {
+  }
+};
+
+struct DequantizeOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef DequantizeOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  DequantizeOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(DequantizeOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<DequantizeOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const DequantizeOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct DequantizeOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit DequantizeOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  DequantizeOptionsBuilder &operator=(const DequantizeOptionsBuilder &);
+  flatbuffers::Offset<DequantizeOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<DequantizeOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<DequantizeOptions> CreateDequantizeOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  DequantizeOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<DequantizeOptions> CreateDequantizeOptions(flatbuffers::FlatBufferBuilder &_fbb, const DequantizeOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct MaximumOptionsT : public flatbuffers::NativeTable {
   typedef MaximumOptions TableType;
   MaximumOptionsT() {
@@ -3921,6 +4045,12 @@ struct Operator FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   const LogSoftmaxOptions *builtin_options_as_LogSoftmaxOptions() const {
     return builtin_options_type() == BuiltinOptions_LogSoftmaxOptions ? static_cast<const LogSoftmaxOptions *>(builtin_options()) : nullptr;
   }
+  const CastOptions *builtin_options_as_CastOptions() const {
+    return builtin_options_type() == BuiltinOptions_CastOptions ? static_cast<const CastOptions *>(builtin_options()) : nullptr;
+  }
+  const DequantizeOptions *builtin_options_as_DequantizeOptions() const {
+    return builtin_options_type() == BuiltinOptions_DequantizeOptions ? static_cast<const DequantizeOptions *>(builtin_options()) : nullptr;
+  }
   const MaximumOptions *builtin_options_as_MaximumOptions() const {
     return builtin_options_type() == BuiltinOptions_MaximumOptions ? static_cast<const MaximumOptions *>(builtin_options()) : nullptr;
   }
@@ -4094,6 +4224,14 @@ template<> inline const LogSoftmaxOptions *Operator::builtin_options_as<LogSoftm
   return builtin_options_as_LogSoftmaxOptions();
 }
 
+template<> inline const CastOptions *Operator::builtin_options_as<CastOptions>() const {
+  return builtin_options_as_CastOptions();
+}
+
+template<> inline const DequantizeOptions *Operator::builtin_options_as<DequantizeOptions>() const {
+  return builtin_options_as_DequantizeOptions();
+}
+
 template<> inline const MaximumOptions *Operator::builtin_options_as<MaximumOptions>() const {
   return builtin_options_as_MaximumOptions();
 }
@@ -5580,6 +5718,52 @@ inline flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(flatbuffer
       _fbb);
 }
 
+inline CastOptionsT *CastOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new CastOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void CastOptions::UnPackTo(CastOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<CastOptions> CastOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const CastOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateCastOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<CastOptions> CreateCastOptions(flatbuffers::FlatBufferBuilder &_fbb, const CastOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const CastOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateCastOptions(
+      _fbb);
+}
+
+inline DequantizeOptionsT *DequantizeOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new DequantizeOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void DequantizeOptions::UnPackTo(DequantizeOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<DequantizeOptions> DequantizeOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const DequantizeOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateDequantizeOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<DequantizeOptions> CreateDequantizeOptions(flatbuffers::FlatBufferBuilder &_fbb, const DequantizeOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const DequantizeOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateDequantizeOptions(
+      _fbb);
+}
+
 inline MaximumOptionsT *MaximumOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new MaximumOptionsT();
   UnPackTo(_o, _resolver);
@@ -5927,6 +6111,14 @@ inline bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *ob
       auto ptr = reinterpret_cast<const LogSoftmaxOptions *>(obj);
       return verifier.VerifyTable(ptr);
     }
+    case BuiltinOptions_CastOptions: {
+      auto ptr = reinterpret_cast<const CastOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_DequantizeOptions: {
+      auto ptr = reinterpret_cast<const DequantizeOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
     case BuiltinOptions_MaximumOptions: {
       auto ptr = reinterpret_cast<const MaximumOptions *>(obj);
       return verifier.VerifyTable(ptr);
@@ -6093,6 +6285,14 @@ inline void *BuiltinOptionsUnion::UnPack(const void *obj, BuiltinOptions type, c
       auto ptr = reinterpret_cast<const LogSoftmaxOptions *>(obj);
       return ptr->UnPack(resolver);
     }
+    case BuiltinOptions_CastOptions: {
+      auto ptr = reinterpret_cast<const CastOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_DequantizeOptions: {
+      auto ptr = reinterpret_cast<const DequantizeOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
     case BuiltinOptions_MaximumOptions: {
       auto ptr = reinterpret_cast<const MaximumOptions *>(obj);
       return ptr->UnPack(resolver);
@@ -6247,6 +6447,14 @@ inline flatbuffers::Offset<void> BuiltinOptionsUnion::Pack(flatbuffers::FlatBuff
       auto ptr = reinterpret_cast<const LogSoftmaxOptionsT *>(value);
       return CreateLogSoftmaxOptions(_fbb, ptr, _rehasher).Union();
     }
+    case BuiltinOptions_CastOptions: {
+      auto ptr = reinterpret_cast<const CastOptionsT *>(value);
+      return CreateCastOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_DequantizeOptions: {
+      auto ptr = reinterpret_cast<const DequantizeOptionsT *>(value);
+      return CreateDequantizeOptions(_fbb, ptr, _rehasher).Union();
+    }
     case BuiltinOptions_MaximumOptions: {
       auto ptr = reinterpret_cast<const MaximumOptionsT *>(value);
       return CreateMaximumOptions(_fbb, ptr, _rehasher).Union();
@@ -6401,6 +6609,14 @@ inline BuiltinOptionsUnion::BuiltinOptionsUnion(const BuiltinOptionsUnion &u) FL
       value = new LogSoftmaxOptionsT(*reinterpret_cast<LogSoftmaxOptionsT *>(u.value));
       break;
     }
+    case BuiltinOptions_CastOptions: {
+      value = new CastOptionsT(*reinterpret_cast<CastOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_DequantizeOptions: {
+      value = new DequantizeOptionsT(*reinterpret_cast<DequantizeOptionsT *>(u.value));
+      break;
+    }
     case BuiltinOptions_MaximumOptions: {
       value = new MaximumOptionsT(*reinterpret_cast<MaximumOptionsT *>(u.value));
       break;
@@ -6592,6 +6808,16 @@ inline void BuiltinOptionsUnion::Reset() {
       delete ptr;
       break;
     }
+    case BuiltinOptions_CastOptions: {
+      auto ptr = reinterpret_cast<CastOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_DequantizeOptions: {
+      auto ptr = reinterpret_cast<DequantizeOptionsT *>(value);
+      delete ptr;
+      break;
+    }
     case BuiltinOptions_MaximumOptions: {
       auto ptr = reinterpret_cast<MaximumOptionsT *>(value);
       delete ptr;
diff --git a/tensorflow/contrib/lite/schema/upgrade_schema.py b/tensorflow/contrib/lite/schema/upgrade_schema.py
index 94f5730be5d991ae13fb019e4d035e23f76fe441..e0b36d3d3ee94b00cccd3968d14c63fe19c3c27c 100644
--- a/tensorflow/contrib/lite/schema/upgrade_schema.py
+++ b/tensorflow/contrib/lite/schema/upgrade_schema.py
@@ -39,8 +39,8 @@ import tensorflow as tf
 from tensorflow.python.platform import resource_loader
 
 parser = argparse.ArgumentParser(
-    description="Script to move TFLite models from pre-release schema to"
-    " new schema.")
+    description="Script to move TFLite models from pre-release schema to "
+    "new schema.")
 parser.add_argument(
     "input",
     type=str,
@@ -48,7 +48,7 @@ parser.add_argument(
 parser.add_argument(
     "output",
     type=str,
-    help="Output json or bin TensorFlow lite model compliant with"
+    help="Output json or bin TensorFlow lite model compliant with "
     "the new schema. Extension must be `.json`, `.bin` or `.tflite`.")
 
 
@@ -258,7 +258,7 @@ class Converter(object):
       # Check if builtin_code is the appropriate string type
       # use type("") instead of str or unicode. for py2and3
       if not isinstance(operator_code["builtin_code"], type(u"")):
-        raise ValueError("builtin_code %r is non-string. this usually means"
+        raise ValueError("builtin_code %r is non-string. this usually means "
                          "your model has consistency problems." %
                          (operator_code["builtin_code"]))
       operator_code["builtin_code"] = (RemapOperator(
diff --git a/tensorflow/contrib/lite/simple_memory_arena.cc b/tensorflow/contrib/lite/simple_memory_arena.cc
index 4aab244989ca5300fbe74162e03deaac89af60ad..2f2004f56bcad5b56f9dd6d4bc824ec14d79e795 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.cc
+++ b/tensorflow/contrib/lite/simple_memory_arena.cc
@@ -113,21 +113,21 @@ TfLiteStatus SimpleMemoryArena::Commit(TfLiteContext* context) {
     underlying_buffer_size_ = required_size;
     underlying_buffer_aligned_ptr_ = new_underlying_buffer_aligned_ptr;
   }
-  commited_ = true;
+  committed_ = true;
   return underlying_buffer_ != nullptr ? kTfLiteOk : kTfLiteError;
 }
 
 TfLiteStatus SimpleMemoryArena::ResolveAlloc(TfLiteContext* context,
                                              const ArenaAlloc& alloc,
                                              char** output_ptr) {
-  TF_LITE_ENSURE(context, commited_);
+  TF_LITE_ENSURE(context, committed_);
   TF_LITE_ENSURE(context, output_ptr != nullptr);
   *output_ptr = underlying_buffer_aligned_ptr_ + alloc.offset;
   return kTfLiteOk;
 }
 
 TfLiteStatus SimpleMemoryArena::Clear() {
-  commited_ = false;
+  committed_ = false;
   high_water_mark_ = 0;
   allocs_.clear();
   return kTfLiteOk;
diff --git a/tensorflow/contrib/lite/simple_memory_arena.h b/tensorflow/contrib/lite/simple_memory_arena.h
index 0535522374c63459d029c252ebe94628cf3122d5..5faf78b59e3755d22e4e866d433e622baa6c66c1 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.h
+++ b/tensorflow/contrib/lite/simple_memory_arena.h
@@ -22,7 +22,7 @@ limitations under the License.
 namespace tflite {
 
 // This little structure holds the offset and the size for a dynamic memory
-// allocation in the memory arena. When the arena is commited and the
+// allocation in the memory arena. When the arena is committed and the
 // underlying buffer is set, the alloc can be resolved into an actual memory
 // pointer.
 struct ArenaAlloc {
@@ -43,7 +43,7 @@ struct ArenaAlloc {
 class SimpleMemoryArena {
  public:
   explicit SimpleMemoryArena(size_t arena_alignment)
-      : commited_(false),
+      : committed_(false),
         arena_alignment_(arena_alignment),
         high_water_mark_(0),
         underlying_buffer_size_(0),
@@ -73,7 +73,7 @@ class SimpleMemoryArena {
   }
 
  private:
-  bool commited_;
+  bool committed_;
   size_t arena_alignment_;
   size_t high_water_mark_;
   std::unique_ptr<char[]> underlying_buffer_;
diff --git a/tensorflow/contrib/lite/special_rules.bzl b/tensorflow/contrib/lite/special_rules.bzl
new file mode 100644
index 0000000000000000000000000000000000000000..54083c49182c707620cbd231b957405cfe24be92
--- /dev/null
+++ b/tensorflow/contrib/lite/special_rules.bzl
@@ -0,0 +1,6 @@
+"""External versions of build rules that differ outside of Google."""
+
+def tflite_portable_test_suite(**kwargs):
+  """This is a no-op outside of Google."""
+  _ignore = [kwargs]
+  pass
diff --git a/tensorflow/contrib/lite/testing/BUILD b/tensorflow/contrib/lite/testing/BUILD
index c8d6e4b43e454073650e792e86bf657fd02d9224..12b7b3c35088a0560213e2e1431f23427d4fe640 100644
--- a/tensorflow/contrib/lite/testing/BUILD
+++ b/tensorflow/contrib/lite/testing/BUILD
@@ -8,6 +8,7 @@ load(
     "//tensorflow/contrib/lite:build_def.bzl",
     "gen_zipped_test_files",
 )
+load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
 load(
     "//tensorflow:tensorflow.bzl",
     "tf_cc_test",
@@ -34,12 +35,12 @@ gen_zipped_test_files(
         "l2norm.zip",
         "local_response_norm.zip",
         "log_softmax.zip",
-        "lstm.zip",
         "max_pool.zip",
         "maximum.zip",
         "mean.zip",
         "mul.zip",
         "pad.zip",
+        "prelu.zip",
         "relu.zip",
         "relu1.zip",
         "relu6.zip",
@@ -237,6 +238,9 @@ cc_test(
     size = "small",
     srcs = ["tf_driver_test.cc"],
     data = ["//tensorflow/contrib/lite:testdata/multi_add.pb"],
+    tags = [
+        "tflite_not_portable",
+    ],
     deps = [
         ":tf_driver",
         "@com_google_googletest//:gtest_main",
@@ -260,6 +264,9 @@ cc_test(
     name = "generate_testspec_test",
     size = "small",
     srcs = ["generate_testspec_test.cc"],
+    tags = [
+        "tflite_not_portable",
+    ],
     deps = [
         ":generate_testspec",
         "@com_google_googletest//:gtest_main",
@@ -321,6 +328,7 @@ tf_cc_test(
     tags = [
         "no_cuda_on_cpu_tap",
         "no_oss",
+        "tflite_not_portable",
     ],
     deps = [
         ":tflite_diff_flags",
@@ -340,7 +348,10 @@ tf_cc_test(
     ],
     data = [":optest"],
     shard_count = 20,
-    tags = ["no_oss"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
     deps = [
         ":parse_testdata_lib",
         ":tflite_driver",
@@ -374,3 +385,5 @@ filegroup(
     ),
     visibility = ["//tensorflow:__subpackages__"],
 )
+
+tflite_portable_test_suite()
diff --git a/tensorflow/contrib/lite/testing/generate_examples.py b/tensorflow/contrib/lite/testing/generate_examples.py
index e714f5299d048132533c940b2961cda12a863de7..deab5a91d2e1bc1df4d45a5e87ddb42c71b866a6 100644
--- a/tensorflow/contrib/lite/testing/generate_examples.py
+++ b/tensorflow/contrib/lite/testing/generate_examples.py
@@ -617,6 +617,54 @@ def make_relu6_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_prelu_tests(zip_path):
+  """Make a set of tests to do PReLU."""
+
+  test_parameters = [{
+      # The canonical case for image processing is having a 4D `input` (NHWC)
+      # and `shared_axes`=[1, 2], so the alpha parameter is per channel.
+      "input_shape": [[1, 10, 10, 3], [3, 3, 3, 3]],
+      "shared_axes": [[1, 2], [1]],
+  }]
+
+  def build_graph(parameters):
+    """Build the graph for the test case."""
+
+    input_tensor = tf.placeholder(
+        dtype=tf.float32, name="input", shape=parameters["input_shape"])
+    prelu = tf.keras.layers.PReLU(shared_axes=parameters["shared_axes"])
+    out = prelu(input_tensor)
+    return [input_tensor], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    """Build the inputs for the test case."""
+
+    input_shape = parameters["input_shape"]
+    input_values = create_tensor_data(
+        np.float32, input_shape, min_value=-10, max_value=10)
+    shared_axes = parameters["shared_axes"]
+
+    alpha_shape = []
+    for dim in range(1, len(input_shape)):
+      alpha_shape.append(1 if dim in shared_axes else input_shape[dim])
+
+    alpha_values = create_tensor_data(np.float32, alpha_shape)
+
+    with tf.variable_scope("", reuse=True):
+      alpha = tf.get_variable("p_re_lu/alpha")
+      sess.run(alpha.assign(alpha_values))
+
+    return [input_values], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_values])))
+
+  make_zip_of_tests(
+      zip_path,
+      test_parameters,
+      build_graph,
+      build_inputs,
+      use_frozen_graph=True)
+
+
 # This function tests various TensorFLow functions that generates Const op,
 # including `tf.ones`, `tf.zeros` and random functions.
 def make_constant_tests(zip_path):
@@ -1606,7 +1654,7 @@ def make_transpose_tests(zip_path):
   }, {
       "dtype": [tf.float32],
       "input_shape": [[1, 2, 3, 4, 5]],
-      "perm": [[0, 1, 2, 3, 4]],
+      "perm": [[4, 3, 2, 1, 0]],
       "constant_perm": [True, False],
   }]
 
@@ -1946,6 +1994,7 @@ def main(unused_args):
         "relu.zip": make_relu_tests,
         "relu1.zip": make_relu1_tests,
         "relu6.zip": make_relu6_tests,
+        "prelu.zip": make_prelu_tests,
         "l2_pool.zip": make_pool_tests(make_l2_pool),
         "avg_pool.zip": make_pool_tests(tf.nn.avg_pool),
         "max_pool.zip": make_pool_tests(tf.nn.max_pool),
diff --git a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
index f0e5dcc482c77b261186dc215c34eff7932f32ad..e9d505a76d15c8eaf1d3b6ba55bffe512532585e 100644
--- a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
+++ b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
@@ -47,10 +47,6 @@ tensorflow::Env* env = tensorflow::Env::Default();
 // Key is a substring of the test name and value is a bug number.
 // TODO(ahentz): make sure we clean this list up frequently.
 std::map<string, string> kBrokenTests = {
-    // Sub and Div don't support broadcasting.
-    {R"(^\/diva.*input_shape_1=\[1,3,4,3\],input_shape_2=\[3\])", "68500195"},
-    {R"(^\/suba.*input_shape_1=\[1,3,4,3\],input_shape_2=\[3\])", "68500195"},
-
     // Add only supports float32. (and "constant" tests use Add)
     {R"(^\/adda.*int32)", "68808744"},
     {R"(^\/constant.*int32)", "68808744"},
@@ -93,8 +89,8 @@ std::map<string, string> kBrokenTests = {
     // Transpose only supports 1D-4D input tensors.
     {R"(^\/transpose.*input_shape=\[.,.,.,.,.\])", "71545879"},
 
-    // Lstm kernel gets different results on tsan, asan, msan.
-    {R"(^\/lstmdtype=tf.float32.*)", "73830845"},
+    // PRelu only supports 4D input with (1, 1, channels) 3D alpha now.
+    {R"(^\/prelu.*shared_axes=\[1\])", "75975192"},
 };
 
 // Allows test data to be unzipped into a temporary directory and makes
@@ -238,41 +234,42 @@ TEST_P(OpsTest, RunStuff) {
 
 INSTANTIATE_TESTS(add)
 INSTANTIATE_TESTS(avg_pool)
-INSTANTIATE_TESTS(space_to_batch_nd)
 INSTANTIATE_TESTS(batch_to_space_nd)
 INSTANTIATE_TESTS(concat)
 INSTANTIATE_TESTS(constant)
 INSTANTIATE_TESTS(control_dep)
 INSTANTIATE_TESTS(conv)
 INSTANTIATE_TESTS(depthwiseconv)
+INSTANTIATE_TESTS(div)
 INSTANTIATE_TESTS(exp)
 INSTANTIATE_TESTS(fully_connected)
 INSTANTIATE_TESTS(fused_batch_norm)
 INSTANTIATE_TESTS(gather)
 INSTANTIATE_TESTS(global_batch_norm)
-INSTANTIATE_TESTS(l2norm)
 INSTANTIATE_TESTS(l2_pool)
+INSTANTIATE_TESTS(l2norm)
 INSTANTIATE_TESTS(local_response_norm)
 INSTANTIATE_TESTS(log_softmax)
 INSTANTIATE_TESTS(maximum)
 INSTANTIATE_TESTS(max_pool)
+INSTANTIATE_TESTS(mean)
 INSTANTIATE_TESTS(mul)
 INSTANTIATE_TESTS(pad)
 INSTANTIATE_TESTS(relu)
 INSTANTIATE_TESTS(relu1)
+INSTANTIATE_TESTS(prelu)
 INSTANTIATE_TESTS(relu6)
 INSTANTIATE_TESTS(reshape)
 INSTANTIATE_TESTS(resize_bilinear)
 INSTANTIATE_TESTS(sigmoid)
 INSTANTIATE_TESTS(softmax)
+INSTANTIATE_TESTS(space_to_batch_nd)
 INSTANTIATE_TESTS(space_to_depth)
-INSTANTIATE_TESTS(sub)
 INSTANTIATE_TESTS(split)
-INSTANTIATE_TESTS(div)
-INSTANTIATE_TESTS(transpose)
-INSTANTIATE_TESTS(mean)
 INSTANTIATE_TESTS(squeeze)
 INSTANTIATE_TESTS(strided_slice)
+INSTANTIATE_TESTS(sub)
+INSTANTIATE_TESTS(transpose)
 
 }  // namespace testing
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/toco/BUILD b/tensorflow/contrib/lite/toco/BUILD
index 17407f3db27ead984d1cfffc3f0085ac86f5318f..486ff1edcddaa5b92015b83abed8b3dd677a4067 100644
--- a/tensorflow/contrib/lite/toco/BUILD
+++ b/tensorflow/contrib/lite/toco/BUILD
@@ -173,6 +173,7 @@ cc_library(
         "graph_transformations/convert_expanddims_to_reshape.cc",
         "graph_transformations/convert_pure_conv_to_depthwise.cc",
         "graph_transformations/convert_reorder_axes.cc",
+        "graph_transformations/convert_squeeze_to_reshape.cc",
         "graph_transformations/convert_trivial_addn_to_add.cc",
         "graph_transformations/convert_trivial_stack_to_reshape.cc",
         "graph_transformations/convert_trivial_transpose_to_reshape.cc",
@@ -192,9 +193,11 @@ cc_library(
         "graph_transformations/identify_lstm.cc",
         "graph_transformations/identify_lstm_merge_inputs.cc",
         "graph_transformations/identify_lstm_split_inputs.cc",
+        "graph_transformations/identify_prelu.cc",
         "graph_transformations/identify_relu1.cc",
         "graph_transformations/lstm_utils.cc",
         "graph_transformations/make_initial_dequantize_operator.cc",
+        "graph_transformations/propagate_activation_function_into_constants.cc",
         "graph_transformations/propagate_array_data_types.cc",
         "graph_transformations/propagate_fixed_sizes.cc",
         "graph_transformations/quantize.cc",
@@ -218,6 +221,7 @@ cc_library(
         "graph_transformations/resolve_constant_concatenation.cc",
         "graph_transformations/resolve_constant_fake_quant.cc",
         "graph_transformations/resolve_constant_fill.cc",
+        "graph_transformations/resolve_constant_gather.cc",
         "graph_transformations/resolve_constant_range.cc",
         "graph_transformations/resolve_constant_shape_or_rank.cc",
         "graph_transformations/resolve_constant_stack.cc",
@@ -240,6 +244,7 @@ cc_library(
         "graph_transformations/resolve_tensorflow_tile.cc",
         "graph_transformations/resolve_transpose_attributes.cc",
         "graph_transformations/unfuse_activation_functions.cc",
+        "graph_transformations/unpartition_embedding_lookup.cc",
         "graph_transformations/unroll_batch_matmul.cc",
     ],
     hdrs = [
@@ -328,6 +333,7 @@ cc_library(
         ":toco_graphviz_dump_options",
         ":toco_port",
         ":types_proto_cc",
+        "//tensorflow/contrib/lite/kernels/internal:quantization_util",
         "//tensorflow/core:lib",
         "@com_google_absl//absl/strings",
         "@protobuf_archive//:protobuf_headers",
diff --git a/tensorflow/contrib/lite/toco/allocate_transient_arrays.cc b/tensorflow/contrib/lite/toco/allocate_transient_arrays.cc
index 49cc1fc2aa365925cde86ceb658ff2b354d06911..621fbcb98db049f819ebbbda8816ad4e30538530 100644
--- a/tensorflow/contrib/lite/toco/allocate_transient_arrays.cc
+++ b/tensorflow/contrib/lite/toco/allocate_transient_arrays.cc
@@ -248,29 +248,49 @@ void AllocateTransientArrays(Model* model,
        op_index++) {
     const auto& op = model->operators[op_index];
     // Allocate those arrays whose lifespan starts exactly here.
+    std::vector<string> arrays_to_allocate;
     for (const auto& input : op->inputs) {
       if (StartsAt(array_lifespans[input], op_index)) {
-        AllocateTransientArray(*model, input, &allocator,
-                               transient_data_alignment);
+        if (std::find(arrays_to_allocate.begin(), arrays_to_allocate.end(),
+                      input) == arrays_to_allocate.end()) {
+          arrays_to_allocate.push_back(input);
+        }
       }
     }
     for (const auto& output : op->outputs) {
       if (StartsAt(array_lifespans[output], op_index)) {
-        AllocateTransientArray(*model, output, &allocator,
-                               transient_data_alignment);
+        if (std::find(arrays_to_allocate.begin(), arrays_to_allocate.end(),
+                      output) == arrays_to_allocate.end()) {
+          arrays_to_allocate.push_back(output);
+        }
       }
     }
+    for (const string& array : arrays_to_allocate) {
+      AllocateTransientArray(*model, array, &allocator,
+                             transient_data_alignment);
+    }
+
     // Deallocate those arrays whose lifespan ends exactly here.
+    std::vector<string> arrays_to_deallocate;
     for (const auto& input : op->inputs) {
       if (EndsAt(array_lifespans[input], op_index)) {
-        DeallocateTransientArray(*model, input, &allocator);
+        if (std::find(arrays_to_deallocate.begin(), arrays_to_deallocate.end(),
+                      input) == arrays_to_deallocate.end()) {
+          arrays_to_deallocate.push_back(input);
+        }
       }
     }
     for (const auto& output : op->outputs) {
       if (EndsAt(array_lifespans[output], op_index)) {
-        DeallocateTransientArray(*model, output, &allocator);
+        if (std::find(arrays_to_deallocate.begin(), arrays_to_deallocate.end(),
+                      output) == arrays_to_deallocate.end()) {
+          arrays_to_deallocate.push_back(output);
+        }
       }
     }
+    for (const string& array : arrays_to_deallocate) {
+      DeallocateTransientArray(*model, array, &allocator);
+    }
   }
 
   // Just out of curiosity (not used in the actual allocation process)
diff --git a/tensorflow/contrib/lite/toco/export_tensorflow.cc b/tensorflow/contrib/lite/toco/export_tensorflow.cc
index 6900468ec6484d5c1896752286a2fa72f4d38c07..22a23357b36c16ea937e726f1e49aa95d7f964e3 100644
--- a/tensorflow/contrib/lite/toco/export_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/export_tensorflow.cc
@@ -548,6 +548,38 @@ void ConvertDepthwiseConvOperator(const Model& model,
   }
 }
 
+void ConvertTransposeConvOperator(const Model& model,
+                                  const TransposeConvOperator& src_op,
+                                  GraphDef* tensorflow_graph) {
+  auto* conv2d_op = tensorflow_graph->add_node();
+  conv2d_op->set_op("Conv2DBackpropInput");
+  conv2d_op->set_name(src_op.outputs[0]);
+  *conv2d_op->add_input() = src_op.inputs[0];
+  *conv2d_op->add_input() = src_op.inputs[1];
+  *conv2d_op->add_input() = src_op.inputs[2];
+  (*conv2d_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  const string& weights_array_name = WalkUpToConstantArray(
+      model, src_op.inputs[TransposeConvOperator::WEIGHTS]);
+  const auto& weights_array = model.GetArray(weights_array_name);
+  CHECK(weights_array.buffer->type == ArrayDataType::kFloat);
+  ConvertFloatTensorConst(model, weights_array_name, AxesOrder::kOHWI,
+                          AxesOrder::kHWIO, tensorflow_graph);
+  auto& strides = (*conv2d_op->mutable_attr())["strides"];
+  strides.mutable_list()->add_i(1);
+  strides.mutable_list()->add_i(src_op.stride_height);
+  strides.mutable_list()->add_i(src_op.stride_width);
+  strides.mutable_list()->add_i(1);
+  string padding;
+  if (src_op.padding.type == PaddingType::kSame) {
+    padding = "SAME";
+  } else if (src_op.padding.type == PaddingType::kValid) {
+    padding = "VALID";
+  } else {
+    LOG(FATAL) << "Bad padding (only SAME and VALID are supported)";
+  }
+  (*conv2d_op->mutable_attr())["padding"].set_s(padding);
+}
+
 void ConvertDepthToSpaceOperator(const Model& model,
                                  const DepthToSpaceOperator& src_op,
                                  GraphDef* tensorflow_graph) {
@@ -1622,9 +1654,11 @@ void ConvertSqueezeOperator(const Model& model, const SqueezeOperator& src_op,
   const auto params_type = GetTensorFlowDataType(model, src_op.inputs[0]);
   (*new_op->mutable_attr())["T"].set_type(params_type);
 
-  auto& squeeze_dims = (*new_op->mutable_attr())["squeeze_dims"];
-  for (int i : src_op.squeeze_dims) {
-    squeeze_dims.mutable_list()->add_i(i);
+  if (!src_op.squeeze_dims.empty()) {
+    auto& squeeze_dims = (*new_op->mutable_attr())["squeeze_dims"];
+    for (int i : src_op.squeeze_dims) {
+      squeeze_dims.mutable_list()->add_i(i);
+    }
   }
 }
 
@@ -1859,6 +1893,10 @@ void ConvertOperator(const Model& model, const Operator& src_op,
     ConvertExpandDimsOperator(model,
                               static_cast<const ExpandDimsOperator&>(src_op),
                               tensorflow_graph);
+  } else if (src_op.type == OperatorType::kTransposeConv) {
+    ConvertTransposeConvOperator(
+        model, static_cast<const TransposeConvOperator&>(src_op),
+        tensorflow_graph);
   } else {
     LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(src_op.type);
   }
diff --git a/tensorflow/contrib/lite/toco/g3doc/python_api.md b/tensorflow/contrib/lite/toco/g3doc/python_api.md
index 440f9c367c25726e20aa8828e3050cd1dc1b230d..36e2d9c37238bb6184ec99c567810b1bcb9a68ce 100644
--- a/tensorflow/contrib/lite/toco/g3doc/python_api.md
+++ b/tensorflow/contrib/lite/toco/g3doc/python_api.md
@@ -28,7 +28,7 @@ val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.])
 out = tf.identity(val, name="out")
 with tf.Session() as sess:
   tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out])
-  open("test.tflite", "wb").write(tflite_modeL)
+  open("test.tflite", "wb").write(tflite_model)
 ```
 
 **NOTE** Currently, the TOCO command will cause a fatal error to the Python
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/convert_squeeze_to_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/convert_squeeze_to_reshape.cc
new file mode 100644
index 0000000000000000000000000000000000000000..81cedb5dad751aacbbb32326db73de386aba282d
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/convert_squeeze_to_reshape.cc
@@ -0,0 +1,85 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "absl/strings/str_cat.h"
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+// Replaces a tf.squeeze operator with a reshape.
+// Squeeze removes dimensions == 1 (if in the list of squeeze_dims). This
+// means that the data layout will never change with this op, just the shape.
+// By converting these to reshapes once we have run shape propagation we allow
+// standard reshape optimization transforms to do their magic.
+bool ConvertSqueezeToReshape::Run(Model* model, std::size_t op_index) {
+  auto squeeze_it = model->operators.begin() + op_index;
+  if (squeeze_it->get()->type != OperatorType::kSqueeze) {
+    return false;
+  }
+  auto squeeze_op = static_cast<SqueezeOperator*>(squeeze_it->get());
+  CHECK_EQ(squeeze_op->inputs.size(), 1);
+  CHECK_EQ(squeeze_op->outputs.size(), 1);
+
+  const auto& input_array = model->GetArray(squeeze_op->inputs[0]);
+  if (!input_array.has_shape()) {
+    // Yield until input dims have been resolved.
+    return false;
+  }
+  if (input_array.shape().dimensions_count() == 0) {
+    // Input array cannot be 0-D.
+    return false;
+  }
+  if (!model->HasArray(squeeze_op->outputs[0]) ||
+      !model->GetArray(squeeze_op->outputs[0]).has_shape()) {
+    // Yield until shape propagation has set the output shape for us.
+    return false;
+  }
+
+  // We use the output shape that has been calculated by shape propagation.
+  const auto& output_shape = model->GetArray(squeeze_op->outputs[0]).shape();
+
+  // Empty shapes will not work as empty data arrays.
+  if (output_shape.dimensions_count() == 0) {
+    return false;
+  }
+
+  auto* reshape_op = new TensorFlowReshapeOperator;
+  reshape_op->inputs = {
+      squeeze_op->inputs[0],
+      CreateInt32Array(model, squeeze_op->outputs[0] + "_shape",
+                       output_shape.dims()),
+  };
+  reshape_op->outputs = squeeze_op->outputs;
+
+  AddMessageF("Replacing %s with %s", LogName(*squeeze_op),
+              LogName(*reshape_op));
+
+  // Replace the operator in the graph.
+  const auto reshape_it = model->operators.emplace(squeeze_it, reshape_op);
+  squeeze_it = reshape_it + 1;
+  CHECK_EQ(squeeze_it->get(), squeeze_op);
+  model->operators.erase(squeeze_it);
+
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_transpose_to_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_transpose_to_reshape.cc
index c2b166033c33b777bad88cb712adf8517be1762a..5a36a90b3841504d6f018832777e50bac95218d7 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_transpose_to_reshape.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_transpose_to_reshape.cc
@@ -21,6 +21,33 @@ limitations under the License.
 
 namespace toco {
 
+namespace {
+
+bool TransposeAffectsMemoryOrder(std::vector<int> perm,
+                                 std::vector<int> in_shape) {
+  CHECK_EQ(perm.size(), in_shape.size());
+  // See what the ordering of the non-unary columns are before and after
+  // transpose permutation. If the major indices stay in the same order (not
+  // just the shape) then the flat buffer representation shouldn't change.
+  std::vector<int> old_major_index_ordering;
+  std::vector<int> new_major_index_ordering;
+  for (int i = 0; i < in_shape.size(); i++) {
+    if (in_shape[i] != 1) {
+      old_major_index_ordering.push_back(i);
+    }
+
+    if (in_shape[perm[i]] != 1) {
+      new_major_index_ordering.push_back(perm[i]);
+    }
+  }
+
+  CHECK_EQ(new_major_index_ordering.size(), old_major_index_ordering.size());
+
+  return old_major_index_ordering != new_major_index_ordering;
+}
+
+}  // namespace
+
 bool ConvertTrivialTransposeToReshape::Run(Model* model, std::size_t op_index) {
   auto transpose_it = model->operators.begin() + op_index;
   if (transpose_it->get()->type != OperatorType::kTranspose) {
@@ -29,23 +56,26 @@ bool ConvertTrivialTransposeToReshape::Run(Model* model, std::size_t op_index) {
   TransposeOperator* transpose_op =
       static_cast<TransposeOperator*>(transpose_it->get());
 
+  const auto& input_array = model->GetArray(transpose_op->inputs[0]);
   const auto& output_array = model->GetArray(transpose_op->outputs[0]);
-  if (!output_array.has_shape()) {
+  if (!input_array.has_shape() || !output_array.has_shape()) {
     // Yield until PropagateFixedSizes has been run on this op.
     return false;
   }
   // Note: We can assume we have error checked inputs in PropagateFixedSizes.
 
-  // This transpose is trivial if we only have one non-unitary dimension.
-  std::vector<int> const& dims = output_array.shape().dims();
-  unsigned non_unitary_axis_count = 0;
-  for (int i = 0; i < dims.size(); i++) {
-    if (dims[i] != 1) {
-      non_unitary_axis_count++;
-    }
+  // Check that the permutation has propogated.
+  std::vector<int> const& perm = transpose_op->perm;
+  if (perm.empty()) {
+    return false;
   }
-  if (non_unitary_axis_count > 1) {
-    // Transpose is not trivial
+
+  // This transpose is trivial if non-unitary dimensions remain in the same
+  // order.
+  std::vector<int> const& input_dims = input_array.shape().dims();
+  std::vector<int> const& output_dims = output_array.shape().dims();
+
+  if (TransposeAffectsMemoryOrder(perm, input_dims)) {
     return false;
   }
 
@@ -61,11 +91,11 @@ bool ConvertTrivialTransposeToReshape::Run(Model* model, std::size_t op_index) {
   string shape_array_name = toco::AvailableArrayName(*model, perm_array_name);
   Array& shape_array = model->GetOrCreateArray(shape_array_name);
   *(shape_array.mutable_shape()->mutable_dims()) = {
-      1, static_cast<int>(dims.size())};
+      1, static_cast<int>(output_dims.size())};
   reshape_op->inputs.push_back(shape_array_name);
   shape_array.data_type = ArrayDataType::kInt32;
   auto& shape_buffer = shape_array.GetMutableBuffer<ArrayDataType::kInt32>();
-  shape_buffer.data = dims;
+  shape_buffer.data = output_dims;
 
   // Delete perm array if unused
   if (IsDiscardableArray(*model, perm_array_name) &&
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/fuse_activation_functions.cc b/tensorflow/contrib/lite/toco/graph_transformations/fuse_activation_functions.cc
index ab943f72d1dd87ae9ff4bd53a807cd4923a88c38..c5ce3fcd95eb0aaf63dcc7f43b96d8a13ed93929 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/fuse_activation_functions.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/fuse_activation_functions.cc
@@ -42,9 +42,9 @@ bool FuseActivationFunctions::Run(Model* model, std::size_t op_index) {
 
   if (CountTrueOutputs(*model, *op) > 1) {
     AddMessageF(
-        "Not fusing activation function into %s because it has more than one "
-        " consumed output",
-        LogName(*op));
+        "Not fusing activation function %s into %s because it has more than "
+        "one  consumed output",
+        LogName(*ac_op), LogName(*op));
     return false;
   }
 
@@ -56,22 +56,31 @@ bool FuseActivationFunctions::Run(Model* model, std::size_t op_index) {
     AddMessageF(
         "Not fusing activation function into %s because it is consumed by more "
         "than 1 other operator",
-        LogName(*op));
+        LogName(*ac_op), LogName(*op));
+    return false;
+  }
+
+  if (!IsDiscardableArray(*model, op->outputs[0])) {
+    AddMessageF(
+        "Not fusing activation function %s into %s because output %s it is not "
+        "discardable",
+        LogName(*ac_op), LogName(*op), op->outputs[0]);
     return false;
   }
 
   if (op->fused_activation_function != FusedActivationFunctionType::kNone) {
     AddMessageF(
-        "Not fusing activation function into %s because it already has a fused "
-        "activation function",
-        LogName(*op));
+        "Not fusing activation function %s into %s because it already has a "
+        "fused activation function",
+        LogName(*ac_op), LogName(*op));
     return false;
   }
 
   if (!OperatorSupportsFusedActivation(op->type)) {
     AddMessageF(
-        "Not fusing activation function because the %s op doesn't support it",
-        LogName(*op));
+        "Not fusing activation function %s because the %s op doesn't support "
+        "it",
+        LogName(*ac_op), LogName(*op));
     return false;
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/fuse_binary_into_preceding_affine.cc b/tensorflow/contrib/lite/toco/graph_transformations/fuse_binary_into_preceding_affine.cc
index 5b57178b18d2d60e1f301a1a8b257d8057618550..76c6be00d407ca30b898d088c9fa34cd7f76f656 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/fuse_binary_into_preceding_affine.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/fuse_binary_into_preceding_affine.cc
@@ -50,7 +50,17 @@ void FuseAddOrSubParamsIntoPrecedingAffine(Model* model, Operator* preceding_op,
 
   // TODO(b/62904716): Bias array should become 1-D when padding removed.
   const int depth = bias_shape.dims(bias_shape.dimensions_count() - 1);
-  CHECK_EQ(depth, operand_shape.dims(operand_shape.dimensions_count() - 1));
+  int operand_channel_increment = 0;
+  if (operand_shape.dimensions_count() >= 1 &&
+      operand_shape.dims(operand_shape.dimensions_count() - 1) ==
+          bias_shape.dims(bias_shape.dimensions_count() - 1)) {
+    operand_channel_increment = 1;
+  } else if (operand_shape.dimensions_count() == 0 ||
+             operand_shape.dims(operand_shape.dimensions_count() - 1) == 1) {
+    operand_channel_increment = 0;
+  } else {
+    LOG(FATAL) << "Operand shape mismatch.";
+  }
 
   enum class OpType { BiasPlusOperand, BiasMinusOperand, OperandMinusBias };
 
@@ -60,9 +70,10 @@ void FuseAddOrSubParamsIntoPrecedingAffine(Model* model, Operator* preceding_op,
                                   ? OpType::BiasMinusOperand
                                   : OpType::OperandMinusBias;
 
+  int operand_channel = 0;
   for (int i = 0; i < depth; i++) {
     float& bias_val = bias_data[i];
-    const float operand_val = operand_data[i];
+    const float operand_val = operand_data[operand_channel];
     if (optype == OpType::BiasPlusOperand) {
       bias_val += operand_val;
     } else if (optype == OpType::BiasMinusOperand) {
@@ -72,6 +83,7 @@ void FuseAddOrSubParamsIntoPrecedingAffine(Model* model, Operator* preceding_op,
     } else {
       LOG(FATAL) << "Should not get here.";
     }
+    operand_channel += operand_channel_increment;
   }
 }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
index f2c81ebc81c2928ae60d66bfcd7f643c5412f196..640afc7c74d7284fb9e212ab23d74a8215314add 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
+++ b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
@@ -114,6 +114,7 @@ void RunGraphTransformations(Model* model, const string& message,
 // List of all graph transformations
 DECLARE_GRAPH_TRANSFORMATION(ConvertExpandDimsToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertPureConvToDepthwise)
+DECLARE_GRAPH_TRANSFORMATION(ConvertSqueezeToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialAddNToAdd)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialStackToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialTransposeToReshape)
@@ -128,8 +129,10 @@ DECLARE_GRAPH_TRANSFORMATION(IdentifyLstmCell)
 DECLARE_GRAPH_TRANSFORMATION(SplitLstmCellInputs)
 DECLARE_GRAPH_TRANSFORMATION(MergeLstmCellInputs)
 DECLARE_GRAPH_TRANSFORMATION(IdentifyRelu1)
+DECLARE_GRAPH_TRANSFORMATION(IdentifyPRelu)
 DECLARE_GRAPH_TRANSFORMATION(IdentifyDilatedConv)
 DECLARE_GRAPH_TRANSFORMATION(MakeInitialDequantizeOperator)
+DECLARE_GRAPH_TRANSFORMATION(PropagateActivationFunctionIntoConstants)
 DECLARE_GRAPH_TRANSFORMATION(PropagateArrayDataTypes)
 DECLARE_GRAPH_TRANSFORMATION(PropagateFixedSizes)
 DECLARE_GRAPH_TRANSFORMATION(HardcodeMinMax)
@@ -175,8 +178,10 @@ DECLARE_GRAPH_TRANSFORMATION(ResolveConstantShapeOrRank)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantStack)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantStridedSlice)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantFill)
+DECLARE_GRAPH_TRANSFORMATION(ResolveConstantGather)
 DECLARE_GRAPH_TRANSFORMATION(ResolveMultiplyByZero)
 DECLARE_GRAPH_TRANSFORMATION(Dequantize)
+DECLARE_GRAPH_TRANSFORMATION(UnpartitionEmbeddingLookup)
 
 class ResolveReshapeAttributes : public GraphTransformation {
  public:
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
index 938d76386d6f315abfe6fe55b133cb4d19014f01..5cc82da5d544846cc095046ceccf0664525aae41 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
@@ -326,9 +326,12 @@ bool HardcodeMinMax::Run(Model* model, std::size_t op_index) {
       changed = HardcodeMinMaxForAverageOrMaxPool(model, op);
       break;
 
+    case OperatorType::kStridedSlice:
     case OperatorType::kSqueeze:
     case OperatorType::kTensorFlowReshape:
     case OperatorType::kPad:
+    case OperatorType::kGather:
+    case OperatorType::kTranspose:
       changed = HardcodeMinMaxFromFirstInput(model, op);
       break;
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc
new file mode 100644
index 0000000000000000000000000000000000000000..30be4ac0aa5e9f639bbf0630e142c2806faa3260
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc
@@ -0,0 +1,119 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+// This transformation rule tries to identify the PRelu structure generated by
+// Keras, and convert it to a single op.
+//
+// The formula of PReLU is:
+// f(x) = alpha * x for x < 0, f(x) = x for x >= 0.
+//
+// `x` is the input, and `alpha` is a trainable tensor which can be broadcasted
+// to the shape of `x`.
+//
+// There's no native PRelu op in TensorFlow, so Keras generates the following
+// structure which does the equivalent calculation:
+// f(x) = Relu(x) + (-alpha * Relu(-x))
+//
+// Practically, alpha is always a constant in the inference graph, and Toco have
+// other graph transformations which fold the activation functions to other ops.
+// Therefore, we're looking for the structure:
+//
+// f(x) = Relu(x) + (negative_alpha * Neg(x, activation=Relu))
+
+namespace toco {
+
+bool IdentifyPRelu::Run(Model* model, std::size_t op_index) {
+  const auto add_op_it = model->operators.begin() + op_index;
+  const auto* add_op = add_op_it->get();
+  if (add_op == nullptr || add_op->type != OperatorType::kAdd ||
+      add_op->inputs.size() != 2 ||
+      add_op->fused_activation_function != FusedActivationFunctionType::kNone) {
+    return false;
+  }
+
+  const auto* relu_input_op = GetOpWithOutput(*model, add_op->inputs[0]);
+  if (relu_input_op == nullptr || relu_input_op->type != OperatorType::kRelu ||
+      relu_input_op->inputs.size() != 1 ||
+      relu_input_op->fused_activation_function !=
+          FusedActivationFunctionType::kNone) {
+    return false;
+  }
+
+  // TODO(ycling): Both Add and Mul are commutative. Support the case where
+  // the position of operands are exchanged.
+  const auto* mul_op = GetOpWithOutput(*model, add_op->inputs[1]);
+  if (mul_op == nullptr || mul_op->type != OperatorType::kMul ||
+      mul_op->inputs.size() != 2 ||
+      mul_op->fused_activation_function != FusedActivationFunctionType::kNone) {
+    return false;
+  }
+
+  const auto neg_alpha_tensor_name = mul_op->inputs[0];
+
+  const auto* relu_neg_input_op = GetOpWithOutput(*model, mul_op->inputs[1]);
+
+  if (relu_neg_input_op == nullptr ||
+      relu_neg_input_op->type != OperatorType::kNeg ||
+      relu_neg_input_op->fused_activation_function !=
+          FusedActivationFunctionType::kRelu ||
+      relu_neg_input_op->inputs.size() != 1) {
+    return false;
+  }
+
+  if (relu_input_op->inputs[0] != relu_neg_input_op->inputs[0]) {
+    return false;
+  }
+
+  const auto input_tensor_name = relu_input_op->inputs[0];
+  const auto output_tensor_name = add_op->outputs[0];
+
+  // Construct a tensor for positive alpha (double negative).
+  const auto alpha_tensor_name =
+      AvailableArrayName(*model, neg_alpha_tensor_name + "_neg");
+  model->GetOrCreateArray(alpha_tensor_name);
+
+  auto* neg_neg_alpha_op = new NegOperator;
+  neg_neg_alpha_op->inputs = {neg_alpha_tensor_name};
+  neg_neg_alpha_op->outputs = {alpha_tensor_name};
+  model->operators.emplace(add_op_it, neg_neg_alpha_op);
+
+  auto* prelu_op = new PReluOperator;
+  prelu_op->inputs = {input_tensor_name, alpha_tensor_name};
+  prelu_op->outputs = {output_tensor_name};
+  model->operators.emplace(add_op_it, prelu_op);
+  AddMessageF("Creating %s replacing equivalent subgraph", LogName(*prelu_op));
+
+  DeleteArrayIfUsedOnce(neg_alpha_tensor_name, model);
+  DeleteArrayIfUsedOnce(add_op->inputs[0], model);
+  DeleteArrayIfUsedOnce(add_op->inputs[1], model);
+  DeleteArrayIfUsedOnce(mul_op->inputs[1], model);
+  // Remove the existing Add op that outputs the final result. If the other
+  // intermediate tensors aren't used by other ops, those will be removed by
+  // other graph transformation rules.
+  model->operators.erase(FindOp(*model, add_op));
+
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc b/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
index d83603e9a2c59ae74a5e5fda5b11178740336bfb..935da9f966ca63095faa17476be3a559d1a0193a 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
@@ -85,8 +85,8 @@ bool AddDequantizeOperatorToInput(const string& input_name, const Operator* op,
   auto& dequantized_input_minmax = dequantized_input_array.GetOrCreateMinMax();
   dequantized_input_minmax = input_minmax;
   auto& input_qparams = input_array.GetOrCreateQuantizationParams();
-  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(
-      model->flags, input_minmax, &input_qparams);
+  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(input_minmax,
+                                                         &input_qparams);
 
   transformation->AddMessageF(
       "Created %s"
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_activation_function_into_constants.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_activation_function_into_constants.cc
new file mode 100644
index 0000000000000000000000000000000000000000..cf17c49b1098d02468935aa72d1d1e73b4addbe1
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_activation_function_into_constants.cc
@@ -0,0 +1,121 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/runtime/types.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+bool PropagateActivationFunctionIntoConstants::Run(Model* model,
+                                                   std::size_t op_index) {
+  const auto ac_it = model->operators.begin() + op_index;
+  const auto* ac_op = ac_it->get();
+  if (ac_op->type != OperatorType::kRelu6 &&
+      ac_op->type != OperatorType::kRelu1 &&
+      ac_op->type != OperatorType::kRelu) {
+    return false;
+  }
+
+  // Find the op producing the array passed to this activation function.
+  auto* src_op = GetOpWithOutput(*model, ac_op->inputs[0]);
+  if (!src_op) {
+    return false;
+  }
+
+  // Ensure the src_op is not used without the activation function applied.
+  if (CountTrueOutputs(*model, *src_op) > 1) {
+    AddMessageF(
+        "Not propagating activation function %s into %s because it has more "
+        "than one consumed output",
+        LogName(*ac_op), LogName(*src_op));
+  }
+
+  // Filter to the list of supported ops.
+  string src_op_input;
+  switch (src_op->type) {
+    case OperatorType::kGather:
+      src_op_input = src_op->inputs[0];
+      break;
+    default:
+      return false;
+  }
+  CHECK_EQ(src_op->outputs[0], ac_op->inputs[0]);
+
+  // Ensure the input is constant as otherwise this needs to happen at runtime.
+  // If we bail here, it's still possible that FuseActivationFunctions will fuse
+  // the activation if it's supported by the op.
+  if (!IsConstantParameterArray(*model, src_op_input)) {
+    AddMessageF(
+        "Not propagating activation function %s into %s:%s because it is not "
+        "constant",
+        LogName(*ac_op), LogName(*src_op), src_op_input);
+    return false;
+  }
+
+  // Get the array we'll be working with and ensure it's a compatible type.
+  auto& const_array = model->GetArray(src_op_input);
+  if (const_array.data_type != ArrayDataType::kFloat) {
+    AddMessageF(
+        "Not propagating activation function %s into %s:%s because it is "
+        "non-float data",
+        LogName(*ac_op), LogName(*src_op), src_op_input);
+    return false;
+  }
+  auto& const_array_data =
+      const_array.GetMutableBuffer<ArrayDataType::kFloat>().data;
+
+  // Perform the activation function directly into the constant data array.
+  for (size_t i = 0; i < const_array_data.size(); ++i) {
+    const float value = const_array_data[i];
+    float new_value = value;
+    switch (ac_op->type) {
+      case OperatorType::kRelu: {
+        static constexpr float kLower = 0;
+        new_value = value < kLower ? kLower : value;
+        break;
+      }
+      case OperatorType::kRelu1: {
+        static constexpr float kUpper = 1;
+        static constexpr float kLower = -1;
+        new_value = value > kUpper ? kUpper : value < kLower ? kLower : value;
+        break;
+      }
+      case OperatorType::kRelu6: {
+        static constexpr float kUpper = 6;
+        static constexpr float kLower = 0;
+        new_value = value > kUpper ? kUpper : value < kLower ? kLower : value;
+        break;
+      }
+      default:
+        LOG(FATAL) << "Unsupported activation function " << LogName(*ac_op);
+        return false;
+    }
+    const_array_data[i] = new_value;
+  }
+
+  AddMessageF("Propagated activation function %s into %s:%s", LogName(*ac_op),
+              LogName(*src_op), src_op_input);
+  return RemoveTrivialPassthroughOp(this, model, op_index);
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
index f0d107232b4517115aa3f64b39b825dbaffb83ce..778da39bf13563cbbdbe54f1140595b057253ae3 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
@@ -71,6 +71,11 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
     CHECK_GE(op->inputs.size(), 2);
     const ArrayDataType data_type = model->GetArray(op->inputs[1]).data_type;
     SetDataTypeForAllOutputs(model, op, data_type);
+  } else if (op->type == OperatorType::kTransposeConv) {
+    // These operators produce an output with the same type as their 3rd input
+    CHECK_GE(op->inputs.size(), 3);
+    const ArrayDataType data_type = model->GetArray(op->inputs[2]).data_type;
+    SetDataTypeForAllOutputs(model, op, data_type);
   } else if (op->type == OperatorType::kCast) {
     // Data type of the Cast op is specified.
     CHECK_EQ(op->outputs.size(), 1);
@@ -97,10 +102,13 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
     SetDataTypeForAllOutputs(model, op, data_type);
   } else if (op->type == OperatorType::kTensorFlowUnsupported) {
     auto* unsupported_op = static_cast<TensorFlowUnsupportedOperator*>(op);
-    if (unsupported_op->output_data_types.size() != op->outputs.size()) {
+    // Some output tensors from the op could be eliminated by optimization.
+    // This can make unsupported_op->output_data_types have more elements than
+    // op->outputs.
+    if (unsupported_op->output_data_types.size() < op->outputs.size()) {
       return false;
     }
-    for (int i = 0; i < unsupported_op->output_data_types.size(); ++i) {
+    for (int i = 0; i < op->outputs.size(); ++i) {
       auto output = op->outputs[i];
       auto data_type = unsupported_op->output_data_types[i];
       model->GetArray(output).data_type = data_type;
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
index 0e2e5ecf30053103492337685d85a2aacf832caf..676736cfc523c03c9f4d99c404eb2b5209209945 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
@@ -190,6 +190,116 @@ void ProcessConvOperator(Model* model, ConvOperator* op) {
   }
 }
 
+void ProcessTransposeConvOperator(Model* model, TransposeConvOperator* op) {
+  // TransposeConv is unique in that it is specifically given the output shape
+  // as a 1D array on it's 1st input. Theoretically then, resolving the output
+  // shape is as easy as waiting for this input to be resolved. However, we also
+  // have to calculate the padding which requires the weights shape. So, we
+  // might as well calculate the output shape and ensure it matches the
+  // specified one
+
+  // Check if we have already run.
+  auto& output_array = model->GetArray(op->outputs[0]);
+  if (output_array.has_shape()) {
+    return;
+  }
+
+  // SPECIFIED OUTPUT SHAPE
+  // The below is the specified, or prescribed output shape, _given_ to the
+  // operator as an input.
+  auto& specified_output_shape_array =
+      model->GetArray(op->inputs[TransposeConvOperator::OUTPUT_SHAPE]);
+  if (!specified_output_shape_array.has_shape() ||
+      !specified_output_shape_array.buffer) {
+    // Yield until the specified output shape is resolved as a constant
+    return;
+  }
+
+  CHECK(specified_output_shape_array.data_type == ArrayDataType::kInt32)
+      << "TransposeConv input_dims must be int32";
+
+  CHECK(specified_output_shape_array.shape().dimensions_count() == 1 &&
+        specified_output_shape_array.shape().dims(0) == 4)
+      << "TransposeConv requires a 1D, 4 element array on it's 0th input "
+         "specifying the output shape. \""
+      << op->inputs[TransposeConvOperator::OUTPUT_SHAPE] << "\" had shape "
+      << toco::ShapeToString(specified_output_shape_array.shape());
+
+  // COMPUTE PADDING
+  // We require the weights shape to calculate padding.
+  const auto& weights_array =
+      model->GetArray(op->inputs[TransposeConvOperator::WEIGHTS]);
+  if (!weights_array.has_shape()) {
+    // Yield until weights dims have been resolved.
+    return;
+  }
+  const auto& weights_shape = weights_array.shape();
+  CHECK_EQ(weights_shape.dimensions_count(), 4)
+      << "TransposeConv weights must have 4 input dimensions. Input weights \""
+      << op->inputs[TransposeConvOperator::WEIGHTS] << "\" had shape "
+      << toco::ShapeToString(weights_shape) << ".";
+
+  CHECK(weights_shape.dims(0) == 1 && weights_shape.dims(3) == 1)
+      << "TransposeConv weights dimensions must begin and end with 1. Input "
+         "weights \""
+      << op->inputs[TransposeConvOperator::WEIGHTS] << "\" had shape "
+      << toco::ShapeToString(weights_shape) << ".";
+
+  // Compute padding
+  const int kheight = weights_shape.dims(1);
+  const int kwidth = weights_shape.dims(2);
+  op->padding.GetOrCreateFixedPadding();
+  if (op->padding.type == PaddingType::kValid) {
+    op->padding.fixed->height = 0;
+    op->padding.fixed->width = 0;
+  } else if (op->padding.type == PaddingType::kSame) {
+    op->padding.fixed->height = (kheight - 1) / 2;
+    op->padding.fixed->width = (kwidth - 1) / 2;
+  } else {
+    LOG(FATAL) << "TransposeConv only supports SAME or VALID padding";
+  }
+
+  // VALIDATE OUTPUT SHAPE
+  // Compute the output shape from the input and weights shapes to verify it
+  // agrees with the specified output shape.
+  const auto& input_array =
+      model->GetArray(op->inputs[TransposeConvOperator::DATA_INPUT]);
+  if (!input_array.has_shape()) {
+    // Yield until input dims have been resolved.
+    return;
+  }
+  const auto& input_shape = input_array.shape();
+  CHECK_EQ(input_shape.dimensions_count(), 4)
+      << "TransposeConv input shape must have 4 dimensions. Input \""
+      << op->inputs[TransposeConvOperator::WEIGHTS] << "\" had shape "
+      << toco::ShapeToString(weights_shape) << ".";
+
+  // Compute output shape
+  const int input_width = input_shape.dims(2);
+  const int input_height = input_shape.dims(1);
+  int output_height = op->stride_height * (input_height - 1);
+  int output_width = op->stride_width * (input_width - 1);
+  if (op->padding.type == PaddingType::kValid) {
+    output_height += kheight;
+    output_width += kwidth;
+  } else if (op->padding.type == PaddingType::kSame) {
+    output_height += 1;
+    output_width += 1;
+  }
+
+  CHECK(specified_output_shape_array.GetBuffer<ArrayDataType::kInt32>().data ==
+        std::vector<int32>({input_shape.dims(0), output_height, output_width,
+                            weights_shape.dims(3)}))
+      << "Specified output shape: " << ShapeToString(output_array.shape())
+      << ", does not agree with shape computed from input data and weights: ["
+      << input_shape.dims(0) << ", " << output_height << ", " << output_width
+      << ", " << weights_shape.dims(3) << "].";
+
+  // SUCCESS: Set the op's output shape according to the specified output shape.
+  *(output_array.mutable_shape()->mutable_dims()) =
+      specified_output_shape_array.GetBuffer<ArrayDataType::kInt32>().data;
+}
+
 void ProcessDepthwiseConvOperator(Model* model, DepthwiseConvOperator* op) {
   if (!EnsureBiasVectorShape(model, op)) {
     return;
@@ -1300,7 +1410,7 @@ void ProcessTransposeOperator(Model* model, TransposeOperator* op) {
   std::vector<int32> const& perm =
       perm_array.GetBuffer<ArrayDataType::kInt32>().data;
   CHECK_EQ(perm.size(), input_shape.dimensions_count())
-      << "Transpose permutation input " << op->inputs[0]
+      << "Transpose permutation input " << op->inputs[1]
       << " must be same length as input dimensions";
   std::vector<int>* output_dims = output_array.mutable_shape()->mutable_dims();
   for (int i = 0; i < perm.size(); i++) {
@@ -1357,6 +1467,7 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kRelu:
     case OperatorType::kRelu1:
     case OperatorType::kRelu6:
+    case OperatorType::kPRelu:
     case OperatorType::kSoftmax:
     case OperatorType::kLogSoftmax:
     case OperatorType::kLogistic:
@@ -1402,8 +1513,8 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
       ProcessConvOperator(model, static_cast<ConvOperator*>(op));
       break;
     case OperatorType::kTransposeConv:
-      // Unimplemented, hopefully another graph transformation will drop it or
-      // rewrite it.
+      ProcessTransposeConvOperator(model,
+                                   static_cast<TransposeConvOperator*>(op));
       break;
     case OperatorType::kDepthwiseConv:
       ProcessDepthwiseConvOperator(model,
@@ -1542,6 +1653,12 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kTranspose:
       ProcessTransposeOperator(model, static_cast<TransposeOperator*>(op));
       break;
+    case OperatorType::kDynamicPartition:
+    case OperatorType::kDynamicStitch:
+      // DynamicPartition/DynamicStitch are currently only supported for
+      // transforms that remove them, so we avoid propagating shapes through
+      // them and let things settle once they've been removed.
+      break;
     default:
       // Unimplemented, another graph transformation should drop it.
       LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(op->type);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
index 77316751bc2642a0c974d16f694aeebe1cd53a9f..ad3f05274b59e3019726f1b7a7080e74d1934c89 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
@@ -49,7 +49,10 @@ bool SupportsQuantization(const Operator& op) {
          type == OperatorType::kTensorFlowReshape ||
          type == OperatorType::kTanh || type == OperatorType::kMul ||
          type == OperatorType::kSpaceToDepth ||
-         type == OperatorType::kDepthToSpace || type == OperatorType::kLstmCell;
+         type == OperatorType::kStridedSlice ||
+         type == OperatorType::kDepthToSpace ||
+         type == OperatorType::kLstmCell || type == OperatorType::kGather ||
+         type == OperatorType::kTranspose;
 }
 
 template <ArrayDataType A>
@@ -222,7 +225,49 @@ ArrayDataType GetQuantizedDataType(const Array& array,
     default:
       LOG(FATAL) << "Unhandled final quantization type "
                  << static_cast<int>(array.final_data_type);
-      return default_type;
+  }
+}
+
+void GetQuantizationParams(ArrayDataType data_type, const MinMax& minmax,
+                           QuantizationParams* quantization_params) {
+  switch (data_type) {
+    case ArrayDataType::kInt8:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kInt8>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kUint8:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kInt16:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kUint16:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kUint16>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kInt32:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kInt32>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kUint32:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kUint32>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kInt64:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kInt64>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kUint64:
+      GetQuantizationParamsFromMinMax<ArrayDataType::kUint64>(
+          minmax, quantization_params);
+      break;
+    case ArrayDataType::kFloat:
+    case ArrayDataType::kNone:
+    default:
+      LOG(FATAL) << "Unhandled final quantization type "
+                 << static_cast<int>(data_type);
   }
 }
 
@@ -284,16 +329,14 @@ bool ChooseQuantizationForOperatorInput(
 
   if (op.type == OperatorType::kLstmCell) {
     if (input_index == LstmCellOperator::PREV_STATE_INPUT) {
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(
-          model->flags, minmax, quantization_params);
       *quantized_data_type = ArrayDataType::kInt16;
+      GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
       return true;
     }
   }
 
-  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(model->flags, minmax,
-                                                         quantization_params);
   *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
+  GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
   transformation->AddMessageF(
       "For input array %s with min=%g"
       ", max=%g"
@@ -416,15 +459,13 @@ bool ChooseQuantizationForOperatorOutput(
   if (op.type == OperatorType::kLstmCell) {
     if (output_index == LstmCellOperator::STATE_OUTPUT ||
         output_index == LstmCellOperator::ACTIV_TEMP) {
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(
-          model->flags, minmax, quantization_params);
       *quantized_data_type = ArrayDataType::kInt16;
+      GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
       return true;
     }
   }
-  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(model->flags, minmax,
-                                                         quantization_params);
   *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
+  GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
   transformation->AddMessageF(
       "For output array %s with min=%g, max=%g"
       ", chose to quantize as %s with zero_point=%d"
@@ -472,9 +513,11 @@ bool Quantize::Run(Model* model, std::size_t op_index) {
   //
   // Let us just guard this assumption by the following assertion:
   for (const auto& input : op.inputs) {
-    if (IsInputArray(*model, input)) {
-      const auto& input_array = model->GetArray(input);
-      CHECK(input_array.quantization_params);
+    const auto& input_array = model->GetArray(input);
+    if (IsInputArray(*model, input) &&
+        input_array.data_type == ArrayDataType::kFloat) {
+      CHECK(input_array.quantization_params)
+          << "Input array " << input << " is missing quantization_params";
     }
   }
   if (!SupportsQuantization(op)) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
index 587f171bbf823408a45083c36d52f1d38c300123..aa93ace03af300f9cbd3f9c6620a6a58b9329aa4 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
@@ -60,7 +60,9 @@ bool RemoveTrivialPassthroughOp(GraphTransformation* transformation,
   for (int i = 0; i < passthru_op->inputs.size(); i++) {
     if (!model->GetArray(passthru_op->inputs[i]).buffer) {
       count_nonconstant_input_arrays++;
-      main_input_array_index = i;
+      if (count_nonconstant_input_arrays == 1) {
+        main_input_array_index = i;
+      }
     }
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_quantized_activation_func.cc b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_quantized_activation_func.cc
index 28f76c9d36d6f68c8997fa0cf620c8aec4273619..9b65feaa6443cd32ac1bef961600ff225d52d4b2 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_quantized_activation_func.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_quantized_activation_func.cc
@@ -12,6 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
+#include <limits>
 #include <memory>
 #include <string>
 #include <vector>
@@ -30,6 +31,7 @@ bool RemoveTrivialQuantizedActivationFunc::Run(Model* model,
   const auto it = model->operators.begin() + op_index;
   auto* op = it->get();
   if (op->fused_activation_function != FusedActivationFunctionType::kRelu &&
+      op->fused_activation_function != FusedActivationFunctionType::kRelu1 &&
       op->fused_activation_function != FusedActivationFunctionType::kRelu6) {
     return false;
   }
@@ -42,33 +44,49 @@ bool RemoveTrivialQuantizedActivationFunc::Run(Model* model,
   }
   const auto& quantization_params = output_array.GetQuantizationParams();
 
+  double clamp_min;
+  double clamp_max;
+  switch (op->fused_activation_function) {
+    case FusedActivationFunctionType::kRelu:
+      clamp_min = 0.0;
+      clamp_max = std::numeric_limits<double>::infinity();
+      break;
+    case FusedActivationFunctionType::kRelu1:
+      clamp_min = -1.0;
+      clamp_max = 1.0;
+      break;
+    case FusedActivationFunctionType::kRelu6:
+      clamp_min = 0.0;
+      clamp_max = 6.0;
+      break;
+    default:
+      LOG(FATAL) << "Unsupported fused activation type: "
+                 << static_cast<int>(op->fused_activation_function);
+      return false;
+  }
+
   bool has_nontrivial_min_bound = false;
   bool has_nontrivial_max_bound = false;
 
-  if (op->fused_activation_function == FusedActivationFunctionType::kRelu ||
-      op->fused_activation_function == FusedActivationFunctionType::kRelu6) {
-    double lowest_representable_output =
-        (0. - quantization_params.zero_point) * quantization_params.scale;
-    if (lowest_representable_output < 0.) {
-      has_nontrivial_min_bound = true;
-      AddMessageF(
-          "Quantized activation function is not trivial: "
-          "the lowest representable output value %g"
-          " less than the clamp min bound.",
-          lowest_representable_output);
-    }
+  double lowest_representable_output =
+      (0. - quantization_params.zero_point) * quantization_params.scale;
+  if (lowest_representable_output < clamp_min) {
+    has_nontrivial_min_bound = true;
+    AddMessageF(
+        "Quantized activation function is not trivial: "
+        "the lowest representable output value %g"
+        " less than the clamp min bound %g.",
+        lowest_representable_output, clamp_min);
   }
-  if (op->fused_activation_function == FusedActivationFunctionType::kRelu6) {
-    double highest_representable_output =
-        (255. - quantization_params.zero_point) * quantization_params.scale;
-    if (highest_representable_output > 6.) {
-      has_nontrivial_max_bound = true;
-      AddMessageF(
-          "Quantized activation function is not trivial: "
-          "the highest representable output value %g"
-          " is greater than the clamp max bound.",
-          highest_representable_output);
-    }
+  double highest_representable_output =
+      (255. - quantization_params.zero_point) * quantization_params.scale;
+  if (highest_representable_output > clamp_max) {
+    has_nontrivial_max_bound = true;
+    AddMessageF(
+        "Quantized activation function is not trivial: "
+        "the highest representable output value %g"
+        " is greater than the clamp max bound %g.",
+        highest_representable_output, clamp_max);
   }
 
   if (has_nontrivial_min_bound || has_nontrivial_max_bound) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
index 90f9381ec154f145cda826ff9730ff332cd96701..61477d59aea2f11c6347b84d8863763a86c43558 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
@@ -61,8 +61,8 @@ bool IsReshapeTrivial(const Model& model, const Operator& op,
     if (next_op->type == OperatorType::kTensorFlowReshape) {
       transformation->AddMessageF(
           "%s is trivial because its output is only consumed by another "
-          "Reshape op",
-          LogName(op));
+          "Reshape op %s",
+          LogName(op), LogName(*next_op));
       return true;
     }
   }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/reorder_activation_functions.cc b/tensorflow/contrib/lite/toco/graph_transformations/reorder_activation_functions.cc
index 30a005c789bb12e880e8e4534088d99ebacba84a..9852c86c21b9a0714bc728e60b5d9dfe61ff52d1 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/reorder_activation_functions.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/reorder_activation_functions.cc
@@ -42,14 +42,22 @@ bool ReorderActivationFunctions::Run(Model* model, std::size_t op_index) {
   std::unique_ptr<Operator>& exchange_op = *exchange_it;
   DCHECK(exchange_op);
 
-  if (exchange_op->type != OperatorType::kTensorFlowReshape) {
-    return false;
+  // Allow activation functions to move up over any operator that does not
+  // change the values.
+  switch (exchange_op->type) {
+    case OperatorType::kExpandDims:
+    case OperatorType::kSqueeze:
+    case OperatorType::kTensorFlowReshape:
+    case OperatorType::kTranspose:
+      break;
+    default:
+      return false;
   }
 
   DCHECK_EQ(exchange_op->outputs[0], ac_op->inputs[0]);
-  const auto& exchange_op_input = exchange_op->inputs[0];
-  const auto& intermediate_array = exchange_op->outputs[0];
-  const auto& ac_op_output = ac_op->outputs[0];
+  const auto exchange_op_input = exchange_op->inputs[0];
+  const auto intermediate_array = exchange_op->outputs[0];
+  const auto ac_op_output = ac_op->outputs[0];
 
   int count_ops_consuming_output =
       CountOpsWithInput(*model, intermediate_array);
@@ -62,32 +70,58 @@ bool ReorderActivationFunctions::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
-  // If the ac_op was originally producing an output_array we can't reorder as
-  // otherwise the output array would change. It'd be nice to still be able to
-  // reorder but if code is relying on the fetch names instead of array indices
-  // this won't work.
-  for (int i = 0; i < model->flags.output_arrays_size(); ++i) {
-    if (model->flags.output_arrays(i) == ac_op->outputs[0]) {
-      AddMessageF(
-          "Not exchanging activation function with %s to preserve output array "
-          "name %s",
-          LogName(*exchange_op), ac_op->outputs[0]);
-      return false;
-    }
-  }
-
-  // Rewire by changing inputs, including all consumers.
-  Operator* consumer = GetFirstOpWithInput(*model, ac_op_output);
-  while (consumer) {
-    for (int i = 0; i < consumer->inputs.size(); ++i) {
-      if (consumer->inputs[i] == ac_op_output) {
-        consumer->inputs[i] = intermediate_array;
+  // If the ac_op was originally producing an output_array we can't trivially
+  // reorder as otherwise the output array name would change and break
+  // downstream assumptions. To work around that we perform some renaming below
+  // in that case at the cost of a bit more confusing array names in this rare
+  // case.
+  bool is_ac_op_output =
+      std::find(model->flags.output_arrays().begin(),
+                model->flags.output_arrays().end(),
+                ac_op_output) != model->flags.output_arrays().end();
+  if (is_ac_op_output) {
+    // To preserve the output array name of the activation function we need to
+    // create a temporary to use to pass between ac->ex.
+    //
+    // Original:
+    //  (a) -> EX -> (b) -> AC -> (c)
+    // Now:
+    //  (a) -> AC -> (c') -> EX -> (c)
+    AddMessageF(
+        "Exchanging activation function %s with %s but renaming to preserve "
+        "output array %s",
+        LogName(*ac_op), LogName(*exchange_op), ac_op->outputs[0]);
+
+    auto renamed_ac_op_output =
+        AvailableArrayName(*model, ac_op_output + "_exchange");
+    ac_op->inputs[0] = exchange_op_input;
+    ac_op->outputs[0] = renamed_ac_op_output;
+    model->EraseArray(exchange_op->outputs[0]);
+    exchange_op->inputs[0] = renamed_ac_op_output;
+    exchange_op->outputs[0] = ac_op_output;
+  } else {
+    // Simply swap the order and update consumers to use the exchange_op output
+    // array (b).
+    //
+    // Original:
+    //  (a) -> EX -> (b) -> AC -> (c)
+    // Now:
+    //  (a) -> AC -> (c) -> EX -> (b)
+    AddMessageF("Exchanging activation function %s with %s", LogName(*ac_op),
+                LogName(*exchange_op));
+
+    Operator* consumer = GetFirstOpWithInput(*model, ac_op_output);
+    while (consumer) {
+      for (int i = 0; i < consumer->inputs.size(); ++i) {
+        if (consumer->inputs[i] == ac_op_output) {
+          consumer->inputs[i] = intermediate_array;
+        }
       }
+      consumer = GetFirstOpWithInput(*model, ac_op_output);
     }
-    consumer = GetFirstOpWithInput(*model, ac_op_output);
+    ac_op->inputs[0] = exchange_op_input;
+    exchange_op->inputs[0] = ac_op_output;
   }
-  ac_op->inputs[0] = exchange_op_input;
-  exchange_op->inputs[0] = ac_op_output;
 
   // Clear shapes; this will allow shape propagation to fix the sizes for us.
   model->GetOrCreateArray(ac_op->outputs[0]).clear_shape();
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
index 944901ece77430708013ea4ca340a30511ba0174..625d90205a801ad7c3fc1026c9cedc9b509f920d 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
@@ -55,8 +55,8 @@ bool ResolveConstantFakeQuant::Run(Model* model, std::size_t op_index) {
   const int size = input_buffer.data.size();
   output_buffer.data.resize(size);
   QuantizationParams qparams;
-  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(
-      model->flags, *fakequant_op->minmax, &qparams);
+  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(*fakequant_op->minmax,
+                                                         &qparams);
   for (int i = 0; i < size; i++) {
     const double src_val = input_buffer.data[i];
     const double unclamped_quantized_val =
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d999c2df9483e096f333c6af83e1d9fee873d4d6
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc
@@ -0,0 +1,134 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+namespace {
+
+// Gathers data from axis 0.
+template <ArrayDataType Type>
+inline void Gather(const Array& input_array, int input_rank,
+                   const Array& coords_array, Array* output_array) {
+  const Shape& input_shape = input_array.shape();
+  const std::vector<DataType<Type>>& input_data =
+      input_array.GetBuffer<Type>().data;
+  const Shape& coords_shape = coords_array.shape();
+  const std::vector<int32>& coords_data =
+      coords_array.GetBuffer<ArrayDataType::kInt32>().data;
+
+  const Shape& output_shape = output_array->shape();
+  std::vector<DataType<Type>>& output_data =
+      output_array->GetMutableBuffer<Type>().data;
+  output_data.resize(RequiredBufferSizeForShape(output_shape));
+
+  int rev_input_rank = input_shape.dimensions_count() - 1 - (input_rank - 1);
+  CHECK_EQ(coords_shape.dims(0), output_array->shape().dims(rev_input_rank));
+
+  int stride = 1;
+  for (int i = input_shape.dimensions_count() - 1; i >= input_rank - 1; --i) {
+    stride *= input_shape.dims(i);
+  }
+
+  for (int i = 0; i < coords_shape.dims(0); ++i) {
+    DCHECK_GE(coords_data[i], 0);
+    DCHECK_LT(coords_data[i], input_shape.dims(rev_input_rank));
+    DataType<Type>* out = output_data.data() + i * stride;
+    const DataType<Type>* in = input_data.data() + coords_data[i] * stride;
+    memcpy(out, in, sizeof(DataType<Type>) * stride);
+  }
+}
+
+}  // namespace
+
+// Resolves a constant Gather operation.
+// This simply performs the gather and produces the output array with the
+// appropriate values.
+bool ResolveConstantGather::Run(Model* model, std::size_t op_index) {
+  auto it = model->operators.begin() + op_index;
+  const auto* base_op = it->get();
+  if (base_op->type != OperatorType::kGather) {
+    return false;
+  }
+  const auto* op = static_cast<const GatherOperator*>(base_op);
+
+  CHECK_EQ(op->inputs.size(), 2);
+  CHECK_EQ(op->outputs.size(), 1);
+  auto& output_array = model->GetArray(op->outputs[0]);
+  if (output_array.data_type == ArrayDataType::kNone) {
+    // Yield until the output type has been set by PropagateArrayDataTypes.
+    return false;
+  }
+  if (!output_array.has_shape()) {
+    // Yield until the output shape has been set by PropagateFixedShapes.
+    return false;
+  }
+
+  // Only handling axis=0 for now.
+  if (op->axis != 0) {
+    AddMessageF("%s has axis %d; only axis=0 is supported", LogName(*op),
+                op->axis);
+    return false;
+  }
+
+  // We require constant inputs.
+  if (!IsConstantParameterArray(*model, op->inputs[0]) ||
+      !IsConstantParameterArray(*model, op->inputs[1])) {
+    return false;
+  }
+  const Array& input_array = model->GetArray(op->inputs[0]);
+  const Array& coords_array = model->GetArray(op->inputs[1]);
+  CHECK(coords_array.data_type == ArrayDataType::kInt32)
+      << "Only int32 indices are supported";
+
+  CHECK(!output_array.buffer);
+  switch (output_array.data_type) {
+    case ArrayDataType::kFloat:
+      Gather<ArrayDataType::kFloat>(input_array, op->input_rank, coords_array,
+                                    &output_array);
+      break;
+    case ArrayDataType::kUint8:
+      Gather<ArrayDataType::kUint8>(input_array, op->input_rank, coords_array,
+                                    &output_array);
+      break;
+    case ArrayDataType::kInt32:
+      Gather<ArrayDataType::kInt32>(input_array, op->input_rank, coords_array,
+                                    &output_array);
+      break;
+    case ArrayDataType::kInt64:
+      Gather<ArrayDataType::kInt64>(input_array, op->input_rank, coords_array,
+                                    &output_array);
+      break;
+    default:
+      LOG(FATAL) << "Unsupported data type given to Gather op with output \""
+                 << op->outputs[0] << "\"";
+      break;
+  }
+
+  // Erase input arrays if no longer used after we remove the op.
+  DeleteArrayIfUsedOnce(op->inputs[0], model);
+  DeleteArrayIfUsedOnce(op->inputs[1], model);
+
+  // Erase the operator.
+  model->operators.erase(it);
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
index f227554bc505efe6a758fdd9894fee43f2500641..d4db6f1c009cd19515655fb31974a2e97cfa42e8 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
@@ -28,21 +28,45 @@ limitations under the License.
 
 namespace toco {
 
+bool CopyMinMaxFromFirstInput(const Operator& op, Model* model) {
+  auto& output_array = model->GetArray(op.outputs[0]);
+  if (output_array.minmax) {
+    return false;
+  }
+  const auto& input_array = model->GetArray(op.inputs[0]);
+  if (!input_array.minmax) {
+    return false;
+  }
+  const auto& input_minmax = input_array.GetMinMax();
+  CHECK(!output_array.minmax);
+  auto& output_minmax = output_array.GetOrCreateMinMax();
+  output_minmax.min = input_minmax.min;
+  output_minmax.max = input_minmax.max;
+  return true;
+}
+
 bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
   const auto unary_it = model->operators.begin() + op_index;
   const auto* unary_op = unary_it->get();
-  // Test for unary ops of types that we know how to resolve
-  if (unary_op->type != OperatorType::kCast &&
-      unary_op->type != OperatorType::kNeg &&
-      unary_op->type != OperatorType::kTensorFlowRsqrt &&
-      unary_op->type != OperatorType::kTensorFlowSqrt &&
-      unary_op->type != OperatorType::kTensorFlowSquare &&
-      unary_op->type != OperatorType::kTensorFlowSum &&
-      unary_op->type != OperatorType::kTensorFlowMin &&
-      unary_op->type != OperatorType::kTensorFlowMax &&
-      unary_op->type != OperatorType::kTensorFlowReshape) {
-    return false;
+  // Test for unary ops of types that we know how to resolve.
+  switch (unary_op->type) {
+    case OperatorType::kCast:
+    case OperatorType::kNeg:
+    case OperatorType::kTensorFlowRsqrt:
+    case OperatorType::kTensorFlowSqrt:
+    case OperatorType::kTensorFlowSquare:
+    case OperatorType::kTensorFlowSum:
+    case OperatorType::kTensorFlowMin:
+    case OperatorType::kTensorFlowMax:
+    case OperatorType::kTensorFlowReshape:
+    case OperatorType::kRelu6:
+    case OperatorType::kRelu1:
+    case OperatorType::kRelu:
+      break;
+    default:
+      return false;
   }
+
   // Check if the input is a constant parameter.
   if (!IsConstantParameterArray(*model, unary_op->inputs[0])) {
     return false;
@@ -76,6 +100,12 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
+  // The min-max is only copied for ops that copy data without arithmetic.
+  // In future trivial transpose, etc, can be handled here.
+  if (unary_op->type == OperatorType::kTensorFlowReshape) {
+    CopyMinMaxFromFirstInput(*unary_op, model);
+  }
+
   const auto& input_array = model->GetArray(unary_op->inputs[0]);
   // We have already tested above for existence of buffers (synonymous to being
   // a constant param).
@@ -135,15 +165,34 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
     }
   } else if (unary_op->type == OperatorType::kTensorFlowReshape) {
     CHECK(input_buffer_size == output_buffer_size);
-    memcpy(output_float_data.data(), (*input_float_data).data(),
-           output_buffer_size * sizeof(output_float_data[0]));
+    output_float_data = *input_float_data;
   } else if (unary_op->type == OperatorType::kTensorFlowSum) {
-    // At the moment only full reduction across all dimensions is supported.
-    float sum = 0.f;
-    for (int i = 0; i < input_buffer_size; i++) {
-      sum += (*input_float_data)[i];
+    CHECK_EQ(unary_op->inputs.size(), 2) << "Sum needs 2 inputs";
+    if (!IsConstantParameterArray(*model, unary_op->inputs[1])) {
+      AddMessageF("Axis input is non-constant");
+      return false;
     }
-    for (int i = 0; i < output_buffer_size; ++i) {
+    auto& axis_array = model->GetArray(unary_op->inputs[1]);
+    CHECK(axis_array.data_type == ArrayDataType::kInt32);
+    int axis = axis_array.GetBuffer<ArrayDataType::kInt32>().data[0];
+    CHECK_LT(axis, input_shape.dimensions_count()) << "Axis out of bounds";
+
+    // We currently only handle reduction on axis 0.
+    CHECK_EQ(axis, 0) << "Only reduction along axis 0 is supported";
+    // We currently only handle 1-D and 2-D input tensors.
+    CHECK_LE(input_shape.dimensions_count(), 2) << "Rank >2 not yet supported";
+    // We only support keep_dims=true; shape prop will need to change otherwise.
+    auto sum_op = static_cast<const TensorFlowSumOperator*>(unary_op);
+    CHECK(sum_op->keep_dims) << "Only keep_dims=true is supported";
+
+    std::vector<int> indices(input_shape.dimensions_count());
+    for (int i = 0; i < input_shape.dims(1); ++i) {
+      indices[1] = i;
+      float sum = 0.f;
+      for (int j = 0; j < input_shape.dims(0); ++j) {
+        indices[0] = j;
+        sum += (*input_float_data)[Offset(input_shape, indices)];
+      }
       output_float_data[i] = sum;
     }
   } else if (unary_op->type == OperatorType::kTensorFlowMin) {
@@ -193,6 +242,37 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
       }
       output_float_data[i] = outval;
     }
+  } else if (unary_op->type == OperatorType::kRelu6 &&
+             unary_op->type == OperatorType::kRelu1 &&
+             unary_op->type == OperatorType::kRelu) {
+    for (size_t i = 0; i < output_buffer_size; ++i) {
+      const float value = (*input_float_data)[i];
+      float new_value = 0.0f;
+      switch (unary_op->type) {
+        case OperatorType::kRelu: {
+          static constexpr float kLower = 0;
+          new_value = value < kLower ? kLower : value;
+          break;
+        }
+        case OperatorType::kRelu1: {
+          static constexpr float kUpper = 1;
+          static constexpr float kLower = -1;
+          new_value = value > kUpper ? kUpper : value < kLower ? kLower : value;
+          break;
+        }
+        case OperatorType::kRelu6: {
+          static constexpr float kUpper = 6;
+          static constexpr float kLower = 0;
+          new_value = value > kUpper ? kUpper : value < kLower ? kLower : value;
+          break;
+        }
+        default:
+          LOG(FATAL) << "Unsupported activation function "
+                     << LogName(*unary_op);
+          return false;
+      }
+      output_float_data[i] = new_value;
+    }
   } else {
     LOG(FATAL) << "should not get here.";
   }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
new file mode 100644
index 0000000000000000000000000000000000000000..48c326651f3201b4f7a31ac2440b171841e8ed7b
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
@@ -0,0 +1,240 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+
+namespace toco {
+
+bool UnpartitionEmbeddingLookup::Run(Model* model, std::size_t op_index) {
+  // Collapses a partitioned tf.nn.embedding_lookup back into a single Gather.
+  // https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup
+  // This transform attempts to identify the len(params) > 1 case and collapse
+  // it to the len(params) = 1 case by concatenating the original params and
+  // reversing the partitioning.
+  //
+  // If len(params) to the tf.nn.embedding_lookup == 1, the whole op becomes
+  // simply a gather:
+  // https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/python/ops/embedding_ops.py#L150
+  //
+  // Notes on this implementation:
+  // - only supports partition_strategy='mod'
+  //
+  // A rough graph of a partitioned embedding_lookup looks like:
+  //   (ids)--+-->FloorDiv--+-->DynamicPartition-->[[Gather]]--\
+  //          \-->FloorMod--/                                  |
+  //                 V                                         |
+  //   Range-->DynamicPartition-------->DynamicStitch<---------/
+  //  (const)                                V
+  //                                     (embeddings)
+
+  // First look for the final DynamicStitch.
+  auto op_it = model->operators.begin() + op_index;
+  if (op_it->get()->type != OperatorType::kDynamicStitch) {
+    return false;
+  }
+  auto* stitch_op = static_cast<DynamicStitchOperator*>(op_it->get());
+
+  // Split up the DynamicStitch inputs into the indices and data.
+  std::vector<string> stitch_indices_inputs;
+  std::vector<string> stitch_data_inputs;
+  for (size_t i = 0; i < stitch_op->num_partitions; ++i) {
+    stitch_indices_inputs.push_back(stitch_op->inputs[i]);
+  }
+  for (size_t i = stitch_op->num_partitions; i < stitch_op->num_partitions * 2;
+       ++i) {
+    stitch_data_inputs.push_back(stitch_op->inputs[i]);
+  }
+
+  // Validate all indices come from the same DynamicPartition.
+  DynamicPartitionOperator* indices_partition_op = nullptr;
+  for (const string& indices_partition_output_name : stitch_indices_inputs) {
+    auto* op = GetOpWithOutput(*model, indices_partition_output_name);
+    CHECK(op) << "Source of " << indices_partition_output_name << " not found";
+    if (op->type != OperatorType::kDynamicPartition) {
+      AddMessageF(
+          "Skipping because indices input %s into "
+          "%s is unexpected",
+          LogName(*op), LogName(*stitch_op));
+      return false;
+    }
+    if (!indices_partition_op) {
+      indices_partition_op = static_cast<DynamicPartitionOperator*>(op);
+    } else {
+      // Ensure this is the same op as previous ones.
+      if (op != indices_partition_op) {
+        AddMessageF(
+            "Skipping because indices input %s into "
+            "%s is from a different source op than others",
+            LogName(*op), LogName(*stitch_op));
+        return false;
+      }
+    }
+  }
+  CHECK(indices_partition_op) << "No indices inputs";
+
+  // The data for the indices must be a constant range of the array shape.
+  if (!IsConstantParameterArray(*model, indices_partition_op->inputs[0])) {
+    AddMessageF("Skipping because indices partition data is non-constant");
+    return false;
+  }
+  auto& indices_data_array = model->GetArray(indices_partition_op->inputs[0]);
+  if (indices_data_array.data_type == ArrayDataType::kNone) {
+    // Yield until data types are propagated.
+    return false;
+  }
+  CHECK(indices_data_array.data_type == ArrayDataType::kInt32)
+      << "Indices partition inputs must be int32";
+  const auto& indices_data_buffer =
+      indices_data_array.GetBuffer<ArrayDataType::kInt32>().data;
+  for (size_t i = 0; i < indices_data_buffer.size(); ++i) {
+    CHECK_EQ(indices_data_buffer[i], i) << "Indices range must be identity";
+  }
+
+  // Find all of the gathers used for the data inputs.
+  std::vector<GatherOperator*> gather_ops;
+  for (const string& gather_output_name : stitch_data_inputs) {
+    auto* op = GetOpWithOutput(*model, gather_output_name);
+    CHECK(op) << "Source of " << gather_output_name << " not found";
+    if (op->type != OperatorType::kGather) {
+      AddMessageF(
+          "Skipping because data input %s into %s "
+          "is unexpected",
+          LogName(*op), LogName(*stitch_op));
+      return false;
+    }
+    gather_ops.push_back(static_cast<GatherOperator*>(op));
+  }
+
+  // Validate all gathers come from the same DynamicPartition.
+  DynamicPartitionOperator* data_partition_op = nullptr;
+  for (auto* gather_op : gather_ops) {
+    auto* op = GetOpWithOutput(*model, gather_op->inputs[1]);
+    CHECK(op) << "Source of " << gather_op->inputs[1] << " not found";
+    if (op->type != OperatorType::kDynamicPartition) {
+      AddMessageF(
+          "Skipping because data input %s into "
+          "%s is unexpected",
+          LogName(*op), LogName(*gather_op));
+      return false;
+    }
+    if (!data_partition_op) {
+      data_partition_op = static_cast<DynamicPartitionOperator*>(op);
+    } else {
+      // Ensure this is the same op as previous ones.
+      if (op != data_partition_op) {
+        AddMessageF(
+            "Skipping because data input %s into "
+            "%s is from a different source op than others",
+            LogName(*op), LogName(*gather_op));
+        return false;
+      }
+    }
+  }
+  CHECK(data_partition_op) << "No data inputs";
+
+  // Validate the partition ops have the same sizes.
+  CHECK_EQ(indices_partition_op->num_partitions,
+           data_partition_op->num_partitions)
+      << "Indices and data partition ops have differing dimensions";
+  int num_partitions = indices_partition_op->num_partitions;
+
+  // Partition strategy of 'mod' gives us a FloorMod and FloorDiv.
+  // The gather partition uses the FloorDiv as the data and FloorMod as the
+  // partitions and the indices use the FloorMod as their partitions.
+  Operator* div_op = GetOpWithOutput(*model, data_partition_op->inputs[0]);
+  Operator* mod_op = GetOpWithOutput(*model, data_partition_op->inputs[1]);
+  CHECK(div_op && div_op->type == OperatorType::kFloorDiv)
+      << "Unsupported partition strategy";
+  CHECK(mod_op && mod_op->type == OperatorType::kFloorMod)
+      << "Unsupported partition strategy";
+  CHECK_EQ(mod_op, GetOpWithOutput(*model, indices_partition_op->inputs[1]))
+      << "Indices and data parition ops require the same partition strategy "
+         "and inputs";
+
+  // Glob together all of the gather data. This is not yet in the correct order.
+  auto* gather_params_concat_op = new ConcatenationOperator;
+  for (const auto& gather_op : gather_ops) {
+    gather_params_concat_op->inputs.push_back(gather_op->inputs[0]);
+  }
+  gather_params_concat_op->outputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_unpartitioned"));
+  op_it = model->operators.emplace(op_it, gather_params_concat_op) + 1;
+  model->GetOrCreateArray(gather_params_concat_op->outputs[0]);
+
+  // Permute the gather params to undo the partitioning that was originally
+  // done.
+  auto* gather_params_permute_op = new GatherOperator;
+  gather_params_permute_op->inputs.push_back(
+      gather_params_concat_op->outputs[0]);
+  gather_params_permute_op->inputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted/perm"));
+  gather_params_permute_op->outputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted"));
+  op_it = model->operators.emplace(op_it, gather_params_permute_op) + 1;
+  model->GetOrCreateArray(gather_params_permute_op->outputs[0]);
+  const auto& partition_array = model->GetArray(gather_ops[0]->inputs[0]);
+  const auto& partition_array_dims = partition_array.shape().dims();
+  gather_params_permute_op->input_rank =
+      partition_array.shape().dimensions_count();
+  auto& perm_array =
+      model->GetOrCreateArray(gather_params_permute_op->inputs[1]);
+  perm_array.data_type = ArrayDataType::kInt32;
+  perm_array.mutable_shape()->ReplaceDims(
+      {num_partitions * partition_array_dims[0]});
+  auto& perm_data = perm_array.GetMutableBuffer<ArrayDataType::kInt32>().data;
+  perm_data.resize(RequiredBufferSizeForShape(perm_array.shape()));
+  // NOTE: this is what relies on the partition_strategy.
+  for (int i = 0; i < num_partitions * partition_array_dims[0]; ++i) {
+    int p = i % num_partitions;
+    perm_data[i] = p * partition_array_dims[0] + i / num_partitions;
+  }
+
+  // Insert the new unpartitioned gather op.
+  auto* merged_gather_op = new GatherOperator;
+  merged_gather_op->inputs = {gather_params_permute_op->outputs[0],
+                              mod_op->inputs[0]};
+  merged_gather_op->outputs = {stitch_op->outputs[0]};
+  merged_gather_op->input_rank = partition_array.shape().dimensions_count();
+  model->operators.emplace(op_it, merged_gather_op);
+
+  AddMessageF(
+      "Replacing suspected partitioned tf.nn.embedding_lookup (starting at %s "
+      "+ %s and ending at %s) with a single unpartitioned gather %s",
+      LogName(*div_op), LogName(*mod_op), LogName(*stitch_op),
+      LogName(*merged_gather_op));
+
+  // Ensure the stitch output array is dead, as we don't want whatever was in it
+  // previously now that we've redefined it. It'll be recreated when needed.
+  model->EraseArray(stitch_op->outputs[0]);
+  model->GetOrCreateArray(merged_gather_op->outputs[0]);
+
+  // Erase all the original ops.
+  DeleteOpAndArraysIfUnused(model, div_op);
+  DeleteOpAndArraysIfUnused(model, mod_op);
+  for (auto* gather_op : gather_ops) {
+    DeleteOpAndArraysIfUnused(model, gather_op);
+  }
+  DeleteOpAndArraysIfUnused(model, indices_partition_op);
+  DeleteOpAndArraysIfUnused(model, data_partition_op);
+  DeleteOpAndArraysIfUnused(model, stitch_op);
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/import_tensorflow.cc b/tensorflow/contrib/lite/toco/import_tensorflow.cc
index 27d2f33a8d278156262753e6572c10ff967bda4c..a7a50e6fc9326338d69cd0334c83b60e2fa50402 100644
--- a/tensorflow/contrib/lite/toco/import_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/import_tensorflow.cc
@@ -272,6 +272,39 @@ void ImportInt64Array(const TensorProto& input_tensor, Array* output_array) {
   }
 }
 
+void ImportBoolArray(const TensorProto& input_tensor, Array* output_array) {
+  CHECK_EQ(input_tensor.dtype(), DT_BOOL);
+  const auto& input_shape = input_tensor.tensor_shape();
+  CHECK_LE(input_shape.dim_size(), 4);
+  ImportShape(input_shape.dim(), output_array->mutable_shape());
+  int input_flat_size = 1;
+  for (int k = 0; k < input_shape.dim_size(); k++) {
+    input_flat_size *= input_shape.dim(k).size();
+  }
+  auto& output_bool_data =
+      output_array->GetMutableBuffer<ArrayDataType::kBool>().data;
+  output_bool_data.resize(RequiredBufferSizeForShape(output_array->shape()),
+                          false);
+  if (input_tensor.bool_val_size()) {
+    for (int i = 0; i < input_tensor.bool_val_size(); i++) {
+      output_bool_data[i] = input_tensor.bool_val(i);
+    }
+  } else if (input_tensor.tensor_content().size() == input_flat_size) {
+    std::vector<char> buf(input_tensor.tensor_content().size());
+    toco::port::CopyToBuffer(input_tensor.tensor_content(), buf.data());
+    for (int i = 0; i < input_tensor.tensor_content().size(); i++) {
+      output_bool_data[i] = static_cast<bool>(buf[i]);
+    }
+  } else {
+    // Some graphs have bool const nodes without actual value...
+    // assuming that 'false' is implied.
+    // So far only encountered that in an array with 1 entry, let's
+    // require that until we encounter a graph where that's not the case.
+    CHECK_EQ(output_bool_data.size(), 1);
+    output_bool_data[0] = false;
+  }
+}
+
 void ImportStringArray(const TensorProto& input_tensor, Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_STRING);
   const auto& input_shape = input_tensor.tensor_shape();
@@ -318,6 +351,18 @@ void CheckInputsCount(const NodeDef& node,
       << " input(s) other than control dependencies: " << node.DebugString();
 }
 
+template <ArrayDataType T>
+string CreateConstArray(Model* model, string const& name,
+                        std::vector<typename toco::DataType<T> > const& data) {
+  // Utility function to create a const 1D array, useful for input parameters.
+  string array_name = toco::AvailableArrayName(*model, name);
+  auto& array = model->GetOrCreateArray(array_name);
+  array.data_type = T;
+  array.mutable_shape()->mutable_dims()->emplace_back(data.size());
+  array.GetMutableBuffer<T>().data = data;
+  return array_name;
+}
+
 void ConvertConstOperator(const NodeDef& node,
                           const TensorFlowImportFlags& tf_import_flags,
                           Model* model) {
@@ -347,6 +392,10 @@ void ConvertConstOperator(const NodeDef& node,
       array.data_type = ArrayDataType::kString;
       ImportStringArray(tensor, &array);
       break;
+    case DT_BOOL:
+      array.data_type = ArrayDataType::kBool;
+      ImportBoolArray(tensor, &array);
+      break;
     default:
       array.data_type = ArrayDataType::kNone;
       // do nothing, silently ignore the Const data.
@@ -678,9 +727,12 @@ void ConvertSqueezeOperator(const NodeDef& node,
   op->inputs.push_back(node.input(0));
   op->outputs.push_back(node.name());
 
-  const auto& squeeze_dims = GetListAttr(node, "squeeze_dims");
-  for (int i = 0; i < squeeze_dims.i_size(); ++i) {
-    op->squeeze_dims.push_back(squeeze_dims.i(i));
+  // When omitted we are to squeeze all dimensions == 1.
+  if (HasAttr(node, "squeeze_dims")) {
+    const auto& squeeze_dims = GetListAttr(node, "squeeze_dims");
+    for (int i = 0; i < squeeze_dims.i_size(); ++i) {
+      op->squeeze_dims.push_back(squeeze_dims.i(i));
+    }
   }
 
   model->operators.emplace_back(op);
@@ -1399,12 +1451,8 @@ void ConvertFusedBatchNormOperator(const NodeDef& node,
   const string& moving_variance_input = node.input(4);
 
   // Create an array holding the epsilon value (typically, 0.001).
-  const string epsilon_array_name = node.name() + "_epsilon_array";
-  auto& epsilon_array = model->GetOrCreateArray(epsilon_array_name);
-  epsilon_array.data_type = ArrayDataType::kFloat;
-  *epsilon_array.mutable_shape()->mutable_dims() = {1};
-  epsilon_array.GetMutableBuffer<ArrayDataType::kFloat>().data.push_back(
-      GetFloatAttr(node, "epsilon"));
+  const string epsilon_array_name = CreateConstArray<ArrayDataType::kFloat>(
+      model, node.name() + "_epsilon_array", {GetFloatAttr(node, "epsilon")});
 
   // Add epsilon to the moving variance.
   const string epsilon_add_op_name = node.name() + "_epsilon";
@@ -1532,16 +1580,56 @@ void ConvertTransposeConvOperator(const NodeDef& node,
   CHECK_EQ(node.op(), "Conv2DBackpropInput");
   CheckInputsCount(node, tf_import_flags, 3);
   auto* op = new TransposeConvOperator;
-  op->inputs.push_back(node.input(2));
-  op->inputs.push_back(node.input(1));
   op->inputs.push_back(node.input(0));
+  op->inputs.push_back(node.input(1));
+  op->inputs.push_back(node.input(2));
   op->outputs.push_back(node.name());
   const auto& strides = GetListAttr(node, "strides");
-  CHECK_EQ(strides.i_size(), 4);
-  CHECK_EQ(strides.i(0), 1);
   op->stride_height = strides.i(1);
   op->stride_width = strides.i(2);
-  CHECK_EQ(strides.i(3), 1);
+  CHECK_EQ(strides.i_size(), 4)
+      << "Can only import TransposeConv ops with 4D strides. TensorFlow op \""
+      << node.name() << "\" has " << strides.i_size() << "D strides.";
+  CHECK((strides.i(0) == 1) && (strides.i(3) == 1))
+      << "Can only import TransposeConv ops with striding along the height "
+         "(1st) or width (2nd) axis. TensorFlow op \""
+      << node.name() << "\" had strides:[ " << strides.i(0) << ", "
+      << strides.i(1) << ", " << strides.i(2) << ", " << strides.i(3) << "].";
+  op->stride_height = strides.i(1);
+  op->stride_width = strides.i(2);
+  if (HasAttr(node, "dilations")) {
+    const auto& dilations = GetListAttr(node, "dilations");
+    CHECK_EQ(dilations.i_size(), 4)
+        << "Dilation unsupported in TransposeConv. TensorFlow op \""
+        << node.name() << "\" had dilations";
+    CHECK((dilations.i(0) == 1) && (dilations.i(1) == 1) &&
+          (dilations.i(1) == 1) && (dilations.i(3) == 1))
+        << "Dilation unsupported in TransposeConv. TensorFlow op \""
+        << node.name() << "\" had dilations:[ " << dilations.i(0) << ", "
+        << dilations.i(1) << ", " << dilations.i(2) << ", " << dilations.i(3)
+        << "].";
+  }
+
+  const string& weights_name = node.input(TransposeConvOperator::WEIGHTS);
+  const string& transposed_weights_name = weights_name + "_transposed";
+  // Check if a TransposeOperator was already created for these weights
+  // (can happen when multiple layers share the same weights).
+  const Operator* existing_transpose =
+      GetOpWithOutput(*model, transposed_weights_name);
+  if (existing_transpose) {
+    CHECK(existing_transpose->type == OperatorType::kTranspose);
+  } else {
+    // Transpose weights from HWIO order to OHWI order, which is more efficient
+    // for computation
+    TransposeOperator* transpose = new TransposeOperator;
+    string perm_array = CreateConstArray<ArrayDataType::kInt32>(
+        model, node.name() + "_transpose_perm", {3, 0, 1, 2});
+    transpose->inputs = {weights_name, perm_array};
+    transpose->outputs = {transposed_weights_name};
+    model->operators.emplace_back(transpose);
+  }
+  op->inputs[1] = transposed_weights_name;
+
   auto const& padding = GetStringAttr(node, "padding");
   if (padding == "SAME") {
     op->padding.type = PaddingType::kSame;
@@ -1837,19 +1925,9 @@ void ConvertTopKV2Operator(const NodeDef& node,
   op->inputs.push_back(node.input(0));
   // K can be encoded as attr (TopK) convert it to a const.
   if (HasAttr(node, "k")) {
-    // Convert attribute into const tensor.
-    const string array_name = node.name() + "k";
-    auto& array = model->GetOrCreateArray(array_name);
-    array.data_type = ArrayDataType::kInt32;
-    // Size of array is always 1.
-    array.mutable_shape()->mutable_dims()->emplace_back(1);
-
-    auto& output_int_data =
-        array.GetMutableBuffer<ArrayDataType::kInt32>().data;
-    output_int_data.resize(1);
-    output_int_data[0] = GetIntAttr(node, "k");
-    op->inputs.push_back(array_name);
-
+    string k_array = CreateConstArray<ArrayDataType::kInt32>(
+        model, node.name() + "k", {GetIntAttr(node, "k")});
+    op->inputs.push_back(k_array);
   } else {
     CheckInputsCount(node, tf_import_flags, 2);
     op->inputs.push_back(node.input(1));
@@ -1859,6 +1937,42 @@ void ConvertTopKV2Operator(const NodeDef& node,
   op->outputs.push_back(node.name() + ":1");
   model->operators.emplace_back(op.release());
 }
+
+void ConvertDynamicPartitionOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
+  auto op = absl::make_unique<DynamicPartitionOperator>();
+  CHECK(HasAttr(node, "num_partitions"));
+  op->num_partitions = GetIntAttr(node, "num_partitions");
+  CheckInputsCount(node, tf_import_flags, 2);
+  op->inputs.push_back(node.input(0));
+  op->inputs.push_back(node.input(1));
+  CHECK_GT(op->num_partitions, 1);
+  op->outputs.push_back(node.name());  // Implicit :0.
+  for (int i = 1; i < op->num_partitions; ++i) {
+    op->outputs.push_back(node.name() + ":" + std::to_string(i));
+  }
+  model->operators.emplace_back(op.release());
+}
+
+void ConvertDynamicStitchOperator(const NodeDef& node,
+                                  const TensorFlowImportFlags& tf_import_flags,
+                                  Model* model) {
+  // The parallel and non-parallel variants are the same besides whether they
+  // have a parallel loop; there are no behavioral differences.
+  CHECK(node.op() == "DynamicStitch" || node.op() == "ParallelDynamicStitch");
+  auto op = absl::make_unique<DynamicStitchOperator>();
+  CHECK(HasAttr(node, "N"));
+  op->num_partitions = GetIntAttr(node, "N");
+  // Expect all ID partitions + all value partitions.
+  CheckInputsCount(node, tf_import_flags, op->num_partitions * 2);
+  for (int i = 0; i < op->num_partitions * 2; ++i) {
+    op->inputs.push_back(node.input(i));
+  }
+  op->outputs.push_back(node.name());
+  model->operators.emplace_back(op.release());
+}
+
 }  // namespace
 
 std::unique_ptr<Model> ImportTensorFlowGraphDef(
@@ -2044,6 +2158,11 @@ std::unique_ptr<Model> ImportTensorFlowGraphDef(
       ConvertExpOperator(node, tf_import_flags, model);
     } else if (node.op() == "TopK" || node.op() == "TopKV2") {
       ConvertTopKV2Operator(node, tf_import_flags, model);
+    } else if (node.op() == "DynamicPartition") {
+      ConvertDynamicPartitionOperator(node, tf_import_flags, model);
+    } else if (node.op() == "DynamicStitch" ||
+               node.op() == "ParallelDynamicStitch") {
+      ConvertDynamicStitchOperator(node, tf_import_flags, model);
     } else {
       ConvertUnsupportedOperator(node, tf_import_flags, model);
     }
diff --git a/tensorflow/contrib/lite/toco/model.h b/tensorflow/contrib/lite/toco/model.h
index 346859ab392d257355b21411a1b3691c8dda5421..5199e292e19c2ac59dcfc2efd9947cc788b0299d 100644
--- a/tensorflow/contrib/lite/toco/model.h
+++ b/tensorflow/contrib/lite/toco/model.h
@@ -29,6 +29,8 @@ limitations under the License.
 
 namespace toco {
 
+using tflite::QuantizationParams;
+
 enum class OperatorType {
   kNone,
   // General-purpose neural network operators.
@@ -63,6 +65,7 @@ enum class OperatorType {
   kRelu,
   kRelu1,
   kRelu6,
+  kPRelu,
   kSoftmax,
   kLogSoftmax,
   kSub,
@@ -115,6 +118,8 @@ enum class OperatorType {
   kTensorFlowTile,
   kTranspose,
   kTopK_V2,
+  kDynamicPartition,
+  kDynamicStitch,
   // An unsupported TF operation. It's only needed to be able to represent TF
   // graph internally and is expected to be dropped by graph transformations.
   kTensorFlowUnsupported,
@@ -244,6 +249,8 @@ struct GenericBuffer {
   // in containers and have the containers call the right subclass destructor.
   virtual ~GenericBuffer() {}
 
+  virtual int Length() const = 0;
+
   const ArrayDataType type;
 
  protected:
@@ -256,6 +263,8 @@ template <ArrayDataType A>
 struct Buffer : GenericBuffer {
   Buffer() : GenericBuffer(A) {}
 
+  int Length() const override { return data.size(); }
+
   std::vector<DataType<A>> data;
 };
 
@@ -558,6 +567,18 @@ struct Relu6Operator : Operator {
   Relu6Operator() : Operator(OperatorType::kRelu6) {}
 };
 
+// PRelu
+//   f(x) = alpha * x for x < 0, f(x) = x for x >= 0.
+//
+// Inputs:
+//   inputs[0]: required: the input array
+//   inputs[1]: required: the alpha array
+//
+// Equivalent to keras.layers.PReLU.
+struct PReluOperator : Operator {
+  PReluOperator() : Operator(OperatorType::kPRelu) {}
+};
+
 // Element-wise Logistic operator:
 //   x -> Logistic(x) = 1 / (1 + exp(-x))
 //
@@ -840,19 +861,29 @@ struct SqueezeOperator : Operator {
 };
 
 // Inputs:
-//   inputs[0]: required: the input activations array
-//   inputs[1]: required: the Conv weights
-//   channel.
+//   inputs[0]: required: the output shape
+//   inputs[1]: required: the weights
+//   inputs[2]: required: the input activations array
+//   NOTE: The input activations is NOT the first input.
+//
 //
 // Outputs:
 //   outputs[0]: required: the output activations array
 //
 // TensorFlow equivalent: Conv2DBackpropInput
 struct TransposeConvOperator : Operator {
+  enum Inputs {
+    OUTPUT_SHAPE = 0,
+    WEIGHTS = 1,
+    DATA_INPUT = 2,
+  };
+
   TransposeConvOperator() : Operator(OperatorType::kTransposeConv) {}
   Padding padding;
   int stride_width = 0;
   int stride_height = 0;
+  // Dilation is possible with transpose convolution, but Tensorflow does not
+  // currently support it, so we omit it.
 };
 
 // Given a tensor input, this operation calculates element-wise exponential
@@ -1410,6 +1441,30 @@ struct TopKV2Operator : Operator {
   TopKV2Operator() : Operator(OperatorType::kTopK_V2) {}
 };
 
+// DynamicPartition operator:
+//
+// Inputs:
+//  inputs[0]: required: data.
+//  inputs[1]: required: partitions.
+//
+// TensorFlow equivalent: DynamicPartition
+struct DynamicPartitionOperator : Operator {
+  DynamicPartitionOperator() : Operator(OperatorType::kDynamicPartition) {}
+  int num_partitions;
+};
+
+// DynamicStitch operator:
+//
+// Inputs:
+//  inputs[0,N): required: indices.
+//  inputs[N,2N): required: data.
+//
+// TensorFlow equivalent: DynamicStitch/ParallelDynamicStitch
+struct DynamicStitchOperator : Operator {
+  DynamicStitchOperator() : Operator(OperatorType::kDynamicStitch) {}
+  int num_partitions;
+};
+
 // Alloc's are used for transient arrays only. An Alloc specifies which interval
 // of the "transient_data" workspace buffer passed to inference functions, is to
 // be used for the transient array at hand. The 'start' and 'end' values are
@@ -1423,22 +1478,6 @@ inline bool operator<(const Alloc& a, const Alloc& b) {
   return a.start < b.start;
 }
 
-// Quantization parameters, determining the mapping of quantized values
-// to real values (i.e. determining how quantized values are mathematically
-// interpreted).
-//
-// The correspondence is as follows:
-//
-//   real_value = scale * (quantized_value - zero_point);
-//
-// In other words, zero_point designates which quantized value corresponds to
-// the real 0 value, and scale designates the difference between the real values
-// corresponding to consecutive quantized values differing by 1.
-struct QuantizationParams {
-  int32 zero_point = 0;
-  double scale = 0.;
-};
-
 class Shape {
  public:
   // For Shape, we stick to half-way encapsulation for now:
diff --git a/tensorflow/contrib/lite/toco/model_flags.proto b/tensorflow/contrib/lite/toco/model_flags.proto
index 867b86f31d16b502a7aeb92cb3d8c96117630cd2..42e0f54826dd809a801a8ac1bfd0a5a7660382a8 100644
--- a/tensorflow/contrib/lite/toco/model_flags.proto
+++ b/tensorflow/contrib/lite/toco/model_flags.proto
@@ -96,11 +96,13 @@ message RnnState {
 // model that does not already contain such MinMax information.
 message ArraysExtraInfo {
   message Entry {
-    // Next ID to use: 5.
+    // Next ID to use: 7.
     optional string name = 1;
     optional float min = 2;
     optional float max = 3;
     optional IODataType data_type = 4;
+    optional InputArrayShape shape = 5;
+    optional float constant_float_value = 6;
   }
   repeated Entry entries = 1;
 }
diff --git a/tensorflow/contrib/lite/toco/tensorflow_graph_matching/resolve_cluster.cc b/tensorflow/contrib/lite/toco/tensorflow_graph_matching/resolve_cluster.cc
index fddf6cc83686632033f31496ec42b33e2ea15f20..5e421ba944cccd9746c66bc33e986b4406dd3bf5 100644
--- a/tensorflow/contrib/lite/toco/tensorflow_graph_matching/resolve_cluster.cc
+++ b/tensorflow/contrib/lite/toco/tensorflow_graph_matching/resolve_cluster.cc
@@ -144,7 +144,9 @@ std::unique_ptr<GraphDef> MaybeReplaceCompositeSubgraph(
       MaybeResolveClusters(tf_graph, cluster_factories);
 
   // Copy function definitions
-  *(pruned_graph->mutable_library()) = tf_graph.library();
+  if (pruned_graph) {
+    *(pruned_graph->mutable_library()) = tf_graph.library();
+  }
   return pruned_graph;
 }
 
diff --git a/tensorflow/contrib/lite/toco/tflite/BUILD b/tensorflow/contrib/lite/toco/tflite/BUILD
index a2b8145a67278c3ac0065f9551da6ffd1de60772..9d3e1daf1258c6bc076dac566129174430bb761d 100644
--- a/tensorflow/contrib/lite/toco/tflite/BUILD
+++ b/tensorflow/contrib/lite/toco/tflite/BUILD
@@ -115,9 +115,11 @@ cc_library(
     deps = [
         ":operator",
         ":types",
+        "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite/schema:schema_fbs",
         "//tensorflow/contrib/lite/toco:model",
         "//tensorflow/contrib/lite/toco:tooling_util",
+        "//tensorflow/contrib/lite/tools:verifier",
         "@flatbuffers",
     ],
 )
diff --git a/tensorflow/contrib/lite/toco/tflite/import.cc b/tensorflow/contrib/lite/toco/tflite/import.cc
index 5b1ab514b23248cd98e66847185d0e8b9fe2d6aa..c0e7ab2ef57ed8edf1b7cda08c64f6ae66172af3 100644
--- a/tensorflow/contrib/lite/toco/tflite/import.cc
+++ b/tensorflow/contrib/lite/toco/tflite/import.cc
@@ -15,10 +15,12 @@ limitations under the License.
 #include "tensorflow/contrib/lite/toco/tflite/import.h"
 
 #include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/model.h"
 #include "tensorflow/contrib/lite/schema/schema_generated.h"
 #include "tensorflow/contrib/lite/toco/tflite/operator.h"
 #include "tensorflow/contrib/lite/toco/tflite/types.h"
 #include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/contrib/lite/tools/verifier.h"
 
 namespace toco {
 
@@ -64,6 +66,9 @@ void ImportTensors(const ::tflite::Model& input_model, Model* model) {
 
     auto shape = input_tensor->shape();
     if (shape) {
+      // If the shape is 0-dimensional, make sure to record it as such,
+      // as oppose to leaving the array without a shape.
+      array.mutable_shape()->mutable_dims()->clear();
       for (int i = 0; i < shape->Length(); ++i) {
         auto d = shape->Get(i);
         array.mutable_shape()->mutable_dims()->push_back(d);
@@ -159,16 +164,28 @@ void ImportIOTensors(const ::tflite::Model& input_model,
   }
 }
 
+namespace {
+bool Verify(const void* buf, size_t len) {
+  ::flatbuffers::Verifier verifier(static_cast<const uint8_t*>(buf), len);
+  return ::tflite::VerifyModelBuffer(verifier);
+}
+}  // namespace
+
 std::unique_ptr<Model> Import(const ModelFlags& model_flags,
                               const string& input_file_contents) {
+  ::tflite::AlwaysTrueResolver r;
+  if (!::tflite::Verify(input_file_contents.data(), input_file_contents.size(),
+                        r, ::tflite::DefaultErrorReporter())) {
+    LOG(FATAL) << "Invalid flatbuffer.";
+  }
   const ::tflite::Model* input_model =
       ::tflite::GetModel(input_file_contents.data());
 
   // Full list of all known operators.
   const auto ops_by_name = BuildOperatorByNameMap();
 
-  if (input_model->subgraphs()->size() != 1) {
-    LOG(FATAL) << "# of subgraphs in tflite should be exactly 1 for now.";
+  if (!input_model->subgraphs() || input_model->subgraphs()->size() != 1) {
+    LOG(FATAL) << "Number of subgraphs in tflite should be exactly 1.";
   }
   std::unique_ptr<Model> model;
   model.reset(new Model);
diff --git a/tensorflow/contrib/lite/toco/tflite/import_test.cc b/tensorflow/contrib/lite/toco/tflite/import_test.cc
index aad6e780d5eb5c3dbc880906df5053ad231ffd54..edd22f783f03b1fbd34039cd7b00f08d34ca9fc6 100644
--- a/tensorflow/contrib/lite/toco/tflite/import_test.cc
+++ b/tensorflow/contrib/lite/toco/tflite/import_test.cc
@@ -27,60 +27,110 @@ namespace {
 
 using ::testing::ElementsAre;
 
+using flatbuffers::Offset;
+using flatbuffers::Vector;
 class ImportTest : public ::testing::Test {
  protected:
   template <typename T>
-  flatbuffers::Offset<flatbuffers::Vector<unsigned char>> CreateDataVector(
-      const std::vector<T>& data) {
+  Offset<Vector<unsigned char>> CreateDataVector(const std::vector<T>& data) {
     return builder_.CreateVector(reinterpret_cast<const uint8_t*>(data.data()),
                                  sizeof(T) * data.size());
   }
-  // This is a very simplistic model. We are not interested in testing all the
-  // details here, since tf.mini's testing framework will be exercising all the
-  // conversions multiple times, and the conversion of operators is tested by
-  // separate unittests.
-  void BuildTestModel() {
-    // The tensors
+
+  Offset<Vector<Offset<::tflite::Buffer>>> BuildBuffers() {
+    auto buf0 = ::tflite::CreateBuffer(builder_, CreateDataVector<float>({}));
+    auto buf1 = ::tflite::CreateBuffer(
+        builder_, CreateDataVector<float>({1.0f, 2.0f, 3.0f, 4.0f}));
+    auto buf2 =
+        ::tflite::CreateBuffer(builder_, CreateDataVector<float>({3.0f, 4.0f}));
+    return builder_.CreateVector(
+        std::vector<Offset<::tflite::Buffer>>({buf0, buf1, buf2}));
+  }
+
+  Offset<Vector<Offset<::tflite::Tensor>>> BuildTensors() {
     auto q = ::tflite::CreateQuantizationParameters(
         builder_,
         /*min=*/builder_.CreateVector<float>({0.1f}),
         /*max=*/builder_.CreateVector<float>({0.2f}),
         /*scale=*/builder_.CreateVector<float>({0.3f}),
         /*zero_point=*/builder_.CreateVector<int64_t>({100ll}));
-    auto buf0 = ::tflite::CreateBuffer(builder_, CreateDataVector<float>({}));
-    auto buf1 =
-        ::tflite::CreateBuffer(builder_, CreateDataVector<float>({1.0f, 2.0f}));
-    auto buf2 =
-        ::tflite::CreateBuffer(builder_, CreateDataVector<float>({3.0f}));
-    auto buffers = builder_.CreateVector(
-        std::vector<flatbuffers::Offset<::tflite::Buffer>>({buf0, buf1, buf2}));
-    auto t1 = ::tflite::CreateTensor(builder_,
-                                     builder_.CreateVector<int>({1, 2, 3, 4}),
-                                     ::tflite::TensorType_FLOAT32, 1,
-                                     builder_.CreateString("tensor_one"), q);
+    auto t1 =
+        ::tflite::CreateTensor(builder_, builder_.CreateVector<int>({1, 2, 2}),
+                               ::tflite::TensorType_FLOAT32, 1,
+                               builder_.CreateString("tensor_one"), q);
     auto t2 =
         ::tflite::CreateTensor(builder_, builder_.CreateVector<int>({2, 1}),
                                ::tflite::TensorType_FLOAT32, 2,
                                builder_.CreateString("tensor_two"), q);
-    auto tensors = builder_.CreateVector(
-        std::vector<flatbuffers::Offset<::tflite::Tensor>>({t1, t2}));
-
-    // The operator codes.
-    auto c1 =
-        ::tflite::CreateOperatorCode(builder_, ::tflite::BuiltinOperator_CUSTOM,
-                                     builder_.CreateString("custom_op_one"));
-    auto c2 = ::tflite::CreateOperatorCode(
-        builder_, ::tflite::BuiltinOperator_CONV_2D, 0);
-    auto opcodes = builder_.CreateVector(
-        std::vector<flatbuffers::Offset<::tflite::OperatorCode>>({c1, c2}));
-
-    auto subgraph = ::tflite::CreateSubGraph(builder_, tensors, 0, 0, 0);
-    std::vector<flatbuffers::Offset<::tflite::SubGraph>> subgraph_vector(
-        {subgraph});
-    auto subgraphs = builder_.CreateVector(subgraph_vector);
+    return builder_.CreateVector(
+        std::vector<Offset<::tflite::Tensor>>({t1, t2}));
+  }
+
+  Offset<Vector<Offset<::tflite::OperatorCode>>> BuildOpCodes(
+      std::initializer_list<::tflite::BuiltinOperator> op_codes) {
+    std::vector<Offset<::tflite::OperatorCode>> op_codes_vector;
+    for (auto op : op_codes) {
+      op_codes_vector.push_back(::tflite::CreateOperatorCode(builder_, op, 0));
+    }
+    return builder_.CreateVector(op_codes_vector);
+  }
+
+  Offset<Vector<Offset<::tflite::OperatorCode>>> BuildOpCodes() {
+    return BuildOpCodes({::tflite::BuiltinOperator_MAX_POOL_2D,
+                         ::tflite::BuiltinOperator_CONV_2D});
+  }
+
+  Offset<Vector<Offset<::tflite::Operator>>> BuildOperators(
+      std::initializer_list<int> inputs, std::initializer_list<int> outputs) {
+    auto is = builder_.CreateVector<int>(inputs);
+    if (inputs.size() == 0) is = 0;
+    auto os = builder_.CreateVector<int>(outputs);
+    if (outputs.size() == 0) os = 0;
+    auto op = ::tflite::CreateOperator(
+        builder_, 0, is, os, ::tflite::BuiltinOptions_Conv2DOptions,
+        ::tflite::CreateConv2DOptions(builder_, ::tflite::Padding_VALID, 1, 1,
+                                      ::tflite::ActivationFunctionType_NONE)
+            .Union(),
+        /*custom_options=*/0, ::tflite::CustomOptionsFormat_FLEXBUFFERS);
+
+    return builder_.CreateVector(std::vector<Offset<::tflite::Operator>>({op}));
+  }
+
+  Offset<Vector<Offset<::tflite::Operator>>> BuildOperators() {
+    return BuildOperators({0}, {1});
+  }
+
+  Offset<Vector<Offset<::tflite::SubGraph>>> BuildSubGraphs(
+      Offset<Vector<Offset<::tflite::Tensor>>> tensors,
+      Offset<Vector<Offset<::tflite::Operator>>> operators,
+      int num_sub_graphs = 1) {
+    std::vector<int32_t> inputs = {0};
+    std::vector<int32_t> outputs = {1};
+    std::vector<Offset<::tflite::SubGraph>> v;
+    for (int i = 0; i < num_sub_graphs; ++i) {
+      v.push_back(::tflite::CreateSubGraph(
+          builder_, tensors, builder_.CreateVector(inputs),
+          builder_.CreateVector(outputs), operators,
+          builder_.CreateString("subgraph")));
+    }
+    return builder_.CreateVector(v);
+  }
+
+  // This is a very simplistic model. We are not interested in testing all the
+  // details here, since tf.mini's testing framework will be exercising all the
+  // conversions multiple times, and the conversion of operators is tested by
+  // separate unittests.
+  void BuildTestModel() {
+    auto buffers = BuildBuffers();
+    auto tensors = BuildTensors();
+    auto opcodes = BuildOpCodes();
+    auto operators = BuildOperators();
+    auto subgraphs = BuildSubGraphs(tensors, operators);
     auto s = builder_.CreateString("");
-    builder_.Finish(::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION,
-                                          opcodes, subgraphs, s, buffers));
+
+    ::tflite::FinishModelBuffer(
+        builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION,
+                                        opcodes, subgraphs, s, buffers));
 
     input_model_ = ::tflite::GetModel(builder_.GetBufferPointer());
   }
@@ -89,7 +139,6 @@ class ImportTest : public ::testing::Test {
                   builder_.GetSize());
   }
   flatbuffers::FlatBufferBuilder builder_;
-  // const uint8_t* buffer_ = nullptr;
   const ::tflite::Model* input_model_ = nullptr;
 };
 
@@ -106,7 +155,7 @@ TEST_F(ImportTest, LoadOperatorsTable) {
 
   details::OperatorsTable operators;
   details::LoadOperatorsTable(*input_model_, &operators);
-  EXPECT_THAT(operators, ElementsAre("custom_op_one", "CONV_2D"));
+  EXPECT_THAT(operators, ElementsAre("MAX_POOL_2D", "CONV_2D"));
 }
 
 TEST_F(ImportTest, Tensors) {
@@ -118,9 +167,9 @@ TEST_F(ImportTest, Tensors) {
   Array& a1 = model->GetArray("tensor_one");
   EXPECT_EQ(ArrayDataType::kFloat, a1.data_type);
   EXPECT_THAT(a1.GetBuffer<ArrayDataType::kFloat>().data,
-              ElementsAre(1.0f, 2.0f));
+              ElementsAre(1.0f, 2.0f, 3.0f, 4.0f));
   ASSERT_TRUE(a1.has_shape());
-  EXPECT_THAT(a1.shape().dims(), ElementsAre(1, 2, 3, 4));
+  EXPECT_THAT(a1.shape().dims(), ElementsAre(1, 2, 2));
 
   const auto& mm = a1.minmax;
   ASSERT_TRUE(mm.get());
@@ -133,6 +182,80 @@ TEST_F(ImportTest, Tensors) {
   EXPECT_EQ(100, q->zero_point);
 }
 
+TEST_F(ImportTest, NoBuffers) {
+  auto buffers = 0;
+  auto tensors = BuildTensors();
+  auto opcodes = BuildOpCodes();
+  auto operators = BuildOperators();
+  auto subgraphs = BuildSubGraphs(tensors, operators);
+  auto comment = builder_.CreateString("");
+  ::tflite::FinishModelBuffer(
+      builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION, opcodes,
+                                      subgraphs, comment, buffers));
+  EXPECT_DEATH(Import(ModelFlags(), InputModelAsString()),
+               "Missing 'buffers' section.");
+}
+
+TEST_F(ImportTest, NoInputs) {
+  auto buffers = BuildBuffers();
+  auto tensors = BuildTensors();
+  auto opcodes = BuildOpCodes();
+  auto operators = BuildOperators({}, {1});
+  auto subgraphs = BuildSubGraphs(tensors, operators);
+  auto comment = builder_.CreateString("");
+  ::tflite::FinishModelBuffer(
+      builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION, opcodes,
+                                      subgraphs, comment, buffers));
+  EXPECT_DEATH(Import(ModelFlags(), InputModelAsString()),
+               "Missing 'inputs' for operator.");
+}
+
+TEST_F(ImportTest, NoOutputs) {
+  auto buffers = BuildBuffers();
+  auto tensors = BuildTensors();
+  auto opcodes = BuildOpCodes();
+  auto operators = BuildOperators({0}, {});
+  auto subgraphs = BuildSubGraphs(tensors, operators);
+  auto comment = builder_.CreateString("");
+  ::tflite::FinishModelBuffer(
+      builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION, opcodes,
+                                      subgraphs, comment, buffers));
+  EXPECT_DEATH(Import(ModelFlags(), InputModelAsString()),
+               "Missing 'outputs' for operator.");
+}
+
+TEST_F(ImportTest, InvalidOpCode) {
+  auto buffers = BuildBuffers();
+  auto tensors = BuildTensors();
+  auto opcodes = BuildOpCodes({static_cast<::tflite::BuiltinOperator>(-1),
+                               ::tflite::BuiltinOperator_CONV_2D});
+  auto operators = BuildOperators();
+  auto subgraphs = BuildSubGraphs(tensors, operators);
+  auto comment = builder_.CreateString("");
+  ::tflite::FinishModelBuffer(
+      builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION, opcodes,
+                                      subgraphs, comment, buffers));
+  EXPECT_DEATH(Import(ModelFlags(), InputModelAsString()),
+               "Operator id '-1' is out of range.");
+}
+
+TEST_F(ImportTest, MultipleSubGraphs) {
+  auto buffers = BuildBuffers();
+  auto tensors = BuildTensors();
+  auto opcodes = BuildOpCodes();
+  auto operators = BuildOperators();
+  auto subgraphs = BuildSubGraphs(tensors, operators, 2);
+  auto comment = builder_.CreateString("");
+  ::tflite::FinishModelBuffer(
+      builder_, ::tflite::CreateModel(builder_, TFLITE_SCHEMA_VERSION, opcodes,
+                                      subgraphs, comment, buffers));
+
+  input_model_ = ::tflite::GetModel(builder_.GetBufferPointer());
+
+  EXPECT_DEATH(Import(ModelFlags(), InputModelAsString()),
+               "Number of subgraphs in tflite should be exactly 1.");
+}
+
 // TODO(ahentz): still need tests for Operators and IOTensors.
 
 }  // namespace
diff --git a/tensorflow/contrib/lite/toco/tflite/operator.cc b/tensorflow/contrib/lite/toco/tflite/operator.cc
index 1d57d2dd7ef9db095dbc86875affdba77641ece2..0989bfe5a3de9a7c0f62b272b0be84df1f4ddcb0 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator.cc
@@ -854,6 +854,8 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
       new SimpleOperator<Relu1Operator>("RELU_N1_TO_1", OperatorType::kRelu1));
   ops.emplace_back(
       new SimpleOperator<Relu6Operator>("RELU6", OperatorType::kRelu6));
+  ops.emplace_back(
+      new SimpleOperator<Relu1Operator>("PRELU", OperatorType::kPRelu));
   ops.emplace_back(new SimpleOperator<LogisticOperator>(
       "LOGISTIC", OperatorType::kLogistic));
   ops.emplace_back(
diff --git a/tensorflow/contrib/lite/toco/tflite/types.cc b/tensorflow/contrib/lite/toco/tflite/types.cc
index b4c2851502a40a1ca36965d4ddd2c8a15b8fe60f..0afd2f3df57caf3214dd198bfa2ee75fa7a8fd7b 100644
--- a/tensorflow/contrib/lite/toco/tflite/types.cc
+++ b/tensorflow/contrib/lite/toco/tflite/types.cc
@@ -90,6 +90,8 @@ flatbuffers::Offset<flatbuffers::Vector<uint8_t>> DataBuffer::Serialize(
       return CopyBuffer<ArrayDataType::kFloat>(array, builder);
     case ArrayDataType::kInt32:
       return CopyBuffer<ArrayDataType::kInt32>(array, builder);
+    case ArrayDataType::kInt64:
+      return CopyBuffer<ArrayDataType::kInt64>(array, builder);
     case ArrayDataType::kString:
       return CopyBuffer<ArrayDataType::kString>(array, builder);
     case ArrayDataType::kUint8:
diff --git a/tensorflow/contrib/lite/toco/toco_tooling.cc b/tensorflow/contrib/lite/toco/toco_tooling.cc
index a09a3c4ef56edc6ba7fd19eb1ff45a2e41cf3dd2..30dd6fab9ebbad9c2add7f830f9b58a73f41714b 100644
--- a/tensorflow/contrib/lite/toco/toco_tooling.cc
+++ b/tensorflow/contrib/lite/toco/toco_tooling.cc
@@ -52,12 +52,14 @@ void MakeGeneralGraphTransformationsSet(
     GraphTransformationsSet* transformations) {
   CHECK(transformations->empty());
   transformations->Add(new ConvertExpandDimsToReshape);
+  transformations->Add(new ConvertSqueezeToReshape);
   transformations->Add(new ConvertTrivialAddNToAdd);
   transformations->Add(new ConvertTrivialStackToReshape);
   transformations->Add(new ConvertTrivialTransposeToReshape);
   transformations->Add(new ConvertReorderAxes);
   transformations->Add(new ResolveReshapeAttributes);
   transformations->Add(new ResolveTransposeAttributes);
+  transformations->Add(new PropagateActivationFunctionIntoConstants);
   transformations->Add(new PropagateArrayDataTypes);
   transformations->Add(new PropagateFixedSizes);
   transformations->Add(new RemoveTensorFlowAssert);
@@ -76,6 +78,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ResolveBatchNormalization);
   transformations->Add(new ResolveConstantBinaryOperator);
   transformations->Add(new ResolveConstantFill);
+  transformations->Add(new ResolveConstantGather);
   transformations->Add(new ResolveConstantRange);
   transformations->Add(new ResolveConstantStack);
   transformations->Add(new ResolveConstantStridedSlice);
@@ -91,6 +94,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new IdentifyL2Normalization);
   transformations->Add(new IdentifyL2Pool);
   transformations->Add(new IdentifyRelu1);
+  transformations->Add(new IdentifyPRelu);
   transformations->Add(new RemoveTrivialBinaryOperator);
   transformations->Add(new ReadFakeQuantMinMax);
   transformations->Add(new ResolveSpaceToBatchNDAttributes);
@@ -102,6 +106,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ResolveConstantShapeOrRank);
   transformations->Add(new MakeInitialDequantizeOperator);
   transformations->Add(new ResolveConstantFakeQuant);
+  transformations->Add(new UnpartitionEmbeddingLookup);
 }
 
 bool SupportsQuantization(FileFormat format) {
@@ -285,6 +290,10 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
     EncodeConstantArraysMinMaxByWrappingThemInFakeQuantNodes(model);
   }
 
+  // Fix any issues with IO edges. This must happen after any transform that
+  // may modify the structure of the edges.
+  FixEdgeArrays(model);
+
   LogDump(kLogLevelModelChanged, "AFTER TRANSFORMATIONS", *model);
 
   if (output_format != GRAPHVIZ_DOT && output_format != TFLITE) {
diff --git a/tensorflow/contrib/lite/toco/tooling_util.cc b/tensorflow/contrib/lite/toco/tooling_util.cc
index 9e725822383b06985bbb5cffdc19a759bc6d5cf3..f3f50487ff74904bf3708fa4c86f522997b55ca0 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util.cc
@@ -84,6 +84,8 @@ string ArrayDataTypeName(ArrayDataType data_type) {
       return "Uint64";
     case ArrayDataType::kString:
       return "String";
+    case ArrayDataType::kBool:
+      return "Bool";
     case ArrayDataType::kNone:
       return "None";
     default:
@@ -157,6 +159,15 @@ bool DeleteArrayIfUsedOnce(const string& array_name, Model* model) {
   return false;
 }
 
+void DeleteOpAndArraysIfUnused(Model* model, Operator* op) {
+  for (const string& array_name : op->inputs) {
+    DeleteArrayIfUsedOnce(array_name, model);
+  }
+  auto op_it = FindOp(*model, op);
+  CHECK(op_it != model->operators.end());
+  model->operators.erase(op_it);
+}
+
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithOutput(
     const Model& model, const string& array_name) {
   for (auto it = model.operators.begin(); it != model.operators.end(); ++it) {
@@ -289,6 +300,7 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(Relu)
     HANDLE_OPERATORTYPENAME_CASE(Relu1)
     HANDLE_OPERATORTYPENAME_CASE(Relu6)
+    HANDLE_OPERATORTYPENAME_CASE(PRelu)
     HANDLE_OPERATORTYPENAME_CASE(ReorderAxes)
     HANDLE_OPERATORTYPENAME_CASE(Softmax)
     HANDLE_OPERATORTYPENAME_CASE(LogSoftmax)
@@ -345,6 +357,8 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(TopK_V2)
     HANDLE_OPERATORTYPENAME_CASE(TensorFlowUnsupported)
     HANDLE_OPERATORTYPENAME_CASE(Exp)
+    HANDLE_OPERATORTYPENAME_CASE(DynamicPartition)
+    HANDLE_OPERATORTYPENAME_CASE(DynamicStitch)
     default:
       LOG(FATAL) << "Unhandled op type";
 #undef HANDLE_OPERATORTYPENAME_CASE
@@ -809,9 +823,15 @@ void CheckEachArray(const Model& model) {
     // It's OK to have a buffer or an alloc, but not both.
     // (Since allocs are for transient arrays without a buffer).
     CHECK(!array->buffer || !array->alloc);
-    // If there is a buffer, its type should be consistent with data_type.
     if (array->buffer) {
+      // If there is a buffer, its type should be consistent with data_type.
       CHECK(array->buffer->type == array->data_type);
+      // The presence of a fixed buffer should imply the presence of a fixed
+      // shape.
+      CHECK(array->has_shape());
+      // The shape flat-size should agree with the buffer length.
+      CHECK_EQ(array->buffer->Length(),
+               RequiredBufferSizeForShape(array->shape()));
     }
 
     // Check name.  Either "name_with_suffix_8", "name_with_port:3", but not
@@ -1028,6 +1048,117 @@ void CheckModelCounts(const Model& model) {
   }
 }
 
+void FixEdgeArrays(Model* model) {
+  for (const string& output_array_name : model->flags.output_arrays()) {
+    if (!GetOpWithOutput(*model, output_array_name)) {
+      // Output has no operator producing it. Change that by inserting a copy.
+      LOG(WARNING) << "Fixing constant output array " << output_array_name
+                   << " by inserting a copy. This is not optimal.";
+      string intermediate_array_name =
+          AvailableArrayName(*model, output_array_name + "_copy");
+      CloneArray(model, output_array_name, intermediate_array_name);
+      InsertCopyOperator(model, intermediate_array_name, output_array_name);
+    }
+  }
+}
+
+void InsertCopyOperator(Model* model, const string& source_array_name,
+                        const string& target_array_name) {
+  // Drop constant data from the target array as the copy will be done at
+  // runtime.
+  Array& target_array = model->GetOrCreateArray(target_array_name);
+  target_array.buffer.reset();
+
+  // Reshape to the same size. This should be a no-op.
+  const Array& source_array = model->GetArray(source_array_name);
+  std::vector<int> shape = source_array.shape().dims();
+
+  // Insert copy operator.
+  auto* copy_op = new TensorFlowReshapeOperator;
+  copy_op->inputs = {
+      source_array_name,
+      CreateInt32Array(model, target_array_name + "_copy_shape", shape)};
+  copy_op->outputs = {target_array_name};
+  model->operators.emplace_back(copy_op);
+}
+
+namespace {
+template <ArrayDataType A>
+void CopyArrayBuffer(const Array& source_array, Array* target_array) {
+  if (source_array.buffer) {
+    const auto& source_buffer = source_array.GetBuffer<A>();
+    auto& target_buffer = target_array->GetMutableBuffer<A>();
+    target_buffer.data = source_buffer.data;
+  }
+}
+}  // namespace
+
+void CloneArray(Model* model, const string& source_array_name,
+                const string& target_array_name) {
+  CHECK(!model->HasArray(target_array_name));
+  const Array& source_array = model->GetArray(source_array_name);
+  Array& target_array = model->GetOrCreateArray(target_array_name);
+
+  switch (source_array.data_type) {
+    case ArrayDataType::kBool:
+      CopyArrayBuffer<ArrayDataType::kBool>(source_array, &target_array);
+      break;
+    case ArrayDataType::kFloat:
+      CopyArrayBuffer<ArrayDataType::kFloat>(source_array, &target_array);
+      break;
+    case ArrayDataType::kInt8:
+      CopyArrayBuffer<ArrayDataType::kInt8>(source_array, &target_array);
+      break;
+    case ArrayDataType::kUint8:
+      CopyArrayBuffer<ArrayDataType::kUint8>(source_array, &target_array);
+      break;
+    case ArrayDataType::kInt16:
+      CopyArrayBuffer<ArrayDataType::kInt16>(source_array, &target_array);
+      break;
+    case ArrayDataType::kUint16:
+      CopyArrayBuffer<ArrayDataType::kUint16>(source_array, &target_array);
+      break;
+    case ArrayDataType::kInt32:
+      CopyArrayBuffer<ArrayDataType::kInt32>(source_array, &target_array);
+      break;
+    case ArrayDataType::kUint32:
+      CopyArrayBuffer<ArrayDataType::kUint32>(source_array, &target_array);
+      break;
+    case ArrayDataType::kInt64:
+      CopyArrayBuffer<ArrayDataType::kInt64>(source_array, &target_array);
+      break;
+    case ArrayDataType::kUint64:
+      CopyArrayBuffer<ArrayDataType::kUint64>(source_array, &target_array);
+      break;
+    case ArrayDataType::kString:
+      CopyArrayBuffer<ArrayDataType::kString>(source_array, &target_array);
+      break;
+    default:
+      LOG(FATAL) << "Unsupported data type: "
+                 << ArrayDataTypeName(source_array.data_type);
+      return;
+  }
+
+  if (source_array.minmax) {
+    const auto& smm = source_array.GetMinMax();
+    auto& tmm = target_array.GetOrCreateMinMax();
+    tmm.min = smm.min;
+    tmm.max = smm.max;
+  }
+
+  if (source_array.quantization_params) {
+    const auto& sqp = source_array.GetQuantizationParams();
+    auto& tqp = target_array.GetOrCreateQuantizationParams();
+    tqp.zero_point = sqp.zero_point;
+    tqp.scale = sqp.scale;
+  }
+
+  target_array.data_type = source_array.data_type;
+  target_array.final_data_type = source_array.final_data_type;
+
+  target_array.copy_shape(source_array.shape());
+}
+
 void MakeArrayDims(int num_dims, int batch, int height, int width, int depth,
                    std::vector<int>* out_dims) {
   CHECK(out_dims->empty());
@@ -1191,7 +1322,7 @@ void ResolveModelFlags(const ModelFlags& model_flags, Model* model) {
       << "This model does not define output arrays, so a "
          "--output_arrays flag must be given on the command-line.";
 
-  for (const auto& input_array_proto : model->flags.input_arrays()) {
+  for (auto& input_array_proto : *model->flags.mutable_input_arrays()) {
     auto& input_array = model->GetOrCreateArray(input_array_proto.name());
     if (input_array_proto.has_data_type()) {
       const ArrayDataType specified_type =
@@ -1235,6 +1366,11 @@ void ResolveModelFlags(const ModelFlags& model_flags, Model* model) {
         for (int i = 0; i < input_array_dims.size(); i++) {
           CHECK_EQ(input_array_dims[i], input_array_proto.shape().dims(i));
         }
+      } else {
+        for (int i = 0; i < input_array.shape().dimensions_count(); i++) {
+          input_array_proto.mutable_shape()->add_dims(
+              input_array.shape().dims(i));
+        }
       }
     }
 
@@ -1330,6 +1466,8 @@ void UseDefaultMinMaxRangeValues(Model* model, double default_ranges_min,
 
 int ElementSize(ArrayDataType data_type) {
   switch (data_type) {
+    case ArrayDataType::kBool:
+      return sizeof(bool);
     case ArrayDataType::kFloat:
       return 4;
     case ArrayDataType::kInt8:
@@ -1355,7 +1493,7 @@ int ElementSize(ArrayDataType data_type) {
       LOG(FATAL) << "Transient arrays with strings are not supported yet";
       return 0;
     default:
-      LOG(FATAL) << "Should not get here.";
+      LOG(FATAL) << "Unknown data_type = " << static_cast<int>(data_type);
       return 0;
   }
 }
@@ -1785,7 +1923,10 @@ bool IsDiscardableArray(const Model& model, const string& array_name) {
 void CheckFinalDataTypesSatisfied(const Model& model) {
   for (const auto& array_entry : model.GetArrayMap()) {
     const auto& array = *array_entry.second;
-    if (array.final_data_type != ArrayDataType::kNone) {
+    // If the final data type is int16, the data type may be float, for example
+    // after dequantization.
+    if (array.final_data_type != ArrayDataType::kNone &&
+        array.final_data_type != ArrayDataType::kInt16) {
       CHECK(array.final_data_type == array.data_type)
           << "Array \"" << array_entry.first
           << "\" has mis-matching actual and final data types ("
@@ -1831,9 +1972,9 @@ void FinishBuildingRNNStates(Model* model) {
 
 void UseArraysExtraInfo(Model* model) {
   for (const auto& entry : model->flags.arrays_extra_info().entries()) {
-    QCHECK(model->HasArray(entry.name()))
-        << "ArraysExtraInfo refers to non-existent array name: "
-        << entry.name();
+    if (!model->HasArray(entry.name())) {
+      continue;
+    }
     auto& array = model->GetArray(entry.name());
     auto& minmax = array.GetOrCreateMinMax();
     if (entry.has_min() || entry.has_max()) {
@@ -1845,6 +1986,24 @@ void UseArraysExtraInfo(Model* model) {
       array.final_data_type =
           ConvertIODataTypeToArrayDataType(entry.data_type());
     }
+    if (entry.has_shape()) {
+      array.clear_shape();
+      // Make sure to create the shape even if there are no dims, to
+      // correctly record 0-D shapes.
+      array.mutable_shape();
+      for (int dim : entry.shape().dims()) {
+        array.mutable_shape()->mutable_dims()->push_back(dim);
+      }
+    }
+    if (entry.has_constant_float_value()) {
+      CHECK(array.has_shape());
+      CHECK(array.data_type == ArrayDataType::kFloat);
+      auto& data = array.GetMutableBuffer<ArrayDataType::kFloat>().data;
+      data.resize(RequiredBufferSizeForShape(array.shape()));
+      for (float& f : data) {
+        f = entry.constant_float_value();
+      }
+    }
   }
 }
 
diff --git a/tensorflow/contrib/lite/toco/tooling_util.h b/tensorflow/contrib/lite/toco/tooling_util.h
index 11208ed667212d56f9ef45e4f394e0bbf5000cbc..d3b7224fe3a773e389ad8fc9a40f0a0fad4debe5 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.h
+++ b/tensorflow/contrib/lite/toco/tooling_util.h
@@ -28,6 +28,7 @@ limitations under the License.
 #if TOCO_SUPPORT_PORTABLE_PROTOS
 #include "third_party/protobuf/src/google/protobuf/text_format.h"
 #endif  // TOCO_SUPPORT_PORTABLE_PROTOS
+#include "tensorflow/contrib/lite/kernels/internal/quantization_util.h"
 #include "tensorflow/contrib/lite/toco/model.h"
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/runtime/types.h"
@@ -64,6 +65,10 @@ int CountOpsWithInput(const Model& model, const string& array_name);
 bool DeleteArrayIfUnused(const string& array_name, Model* model);
 bool DeleteArrayIfUsedOnce(const string& array_name, Model* model);
 
+// Deletes the op and any of its input and output arrays if they are unused
+// after the op has been deleted.
+void DeleteOpAndArraysIfUnused(Model* model, Operator* op);
+
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithOutput(
     const Model& model, const string& array_name);
 Operator* GetOpWithOutput(const Model& model, const string& array_name);
@@ -71,8 +76,6 @@ Operator* GetOpWithOutput(const Model& model, const string& array_name);
 std::vector<std::unique_ptr<Operator>>::iterator FindOpWithOutput(
     Model& model, const string& array_name);
 
-Operator* GetOpWithOutput(const Model& model, const string& array_name);
-
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithInput(
     const Model& model, const string& array_name);
 
@@ -141,78 +144,29 @@ void FixOperatorOrdering(Model* model);
 void FixNoMissingArray(Model* model);
 void FixNoOrphanedArray(Model* model);
 
+// Fixes input/output arrays that may have issues during export or inference.
+void FixEdgeArrays(Model* model);
+
+// Inserts a no-op reshape operator between the source array and the target
+// array. This effectively just copies the data.
+void InsertCopyOperator(Model* model, const string& source_array_name,
+                        const string& target_array_name);
+
+// Clones an array with all data and parameters.
+void CloneArray(Model* model, const string& source_array_name,
+                const string& target_array_name);
+
 void ResolveModelFlags(const ModelFlags& model_flags, Model* model);
 
 template <ArrayDataType A>
-void GetQuantizationParamsFromMinMax(const ModelFlags& model_flags,
-                                     const MinMax& minmax,
+void GetQuantizationParamsFromMinMax(const MinMax& minmax,
                                      QuantizationParams* quantization_params) {
   using Integer = DataType<A>;
-  const Integer qmin = std::numeric_limits<Integer>::min();
-  const Integer qmax = std::numeric_limits<Integer>::max();
-  const double qmin_double = qmin;
-  const double qmax_double = qmax;
   const double rmin = minmax.min;
   const double rmax = minmax.max;
-  // 0 should always be a representable value. Let's assume that the initial
-  // min,max range contains 0.
-  CHECK_LE(rmin, 0.);
-  CHECK_GE(rmax, 0.);
-  if (rmin == rmax) {
-    // Special case where the min,max range is a point. Should be {0}.
-    CHECK_EQ(rmin, 0.);
-    CHECK_EQ(rmax, 0.);
-    quantization_params->zero_point = 0;
-    quantization_params->scale = 0.;
-    return;
-  }
 
-  // General case.
-  //
-  // First determine the scale.
-  const double scale = (rmax - rmin) / (qmax_double - qmin_double);
-
-  // Zero-point computation.
-  // First the initial floating-point computation. The zero-point can be
-  // determined from solving an affine equation for any known pair
-  // (real value, corresponding quantized value).
-  // We know two such pairs: (rmin, qmin) and (rmax, qmax).
-  // The arithmetic error on the zero point computed from either pair
-  // will be roughly machine_epsilon * (sum of absolute values of terms)
-  // so we want to use the variant that adds the smaller terms.
-  const double zero_point_from_min = qmin_double - rmin / scale;
-  const double zero_point_from_max = qmax_double - rmax / scale;
-  const double zero_point_from_min_error =
-      std::abs(qmin_double) + std::abs(rmin / scale);
-  const double zero_point_from_max_error =
-      std::abs(qmax_double) + std::abs(rmax / scale);
-
-  const double zero_point_double =
-      zero_point_from_min_error < zero_point_from_max_error
-          ? zero_point_from_min
-          : zero_point_from_max;
-
-  // Now we need to nudge the zero point to be an integer
-  // (our zero points are integer, and this is motivated by the requirement
-  // to be able to represent the real value "0" exactly as a quantized value,
-  // which is required in multiple places, for example in Im2col with SAME
-  // padding).
-  Integer nudged_zero_point = 0;
-  if (zero_point_double < qmin_double) {
-    nudged_zero_point = qmin;
-  } else if (zero_point_double > qmax_double) {
-    nudged_zero_point = qmax;
-  } else {
-    nudged_zero_point = static_cast<Integer>(std::round(zero_point_double));
-  }
-  // The zero point should always be in the range of quantized value,
-  // [qmin, qmax].
-  CHECK_GE(nudged_zero_point, qmin);
-  CHECK_LE(nudged_zero_point, qmax);
-
-  // Finally, store the result nudged quantization params.
-  quantization_params->zero_point = nudged_zero_point;
-  quantization_params->scale = scale;
+  *quantization_params =
+      ::tflite::ChooseQuantizationParams<Integer>(rmin, rmax);
 }
 
 void CheckIsReadyForQuantization(const Model& model);
diff --git a/tensorflow/contrib/lite/tools/BUILD b/tensorflow/contrib/lite/tools/BUILD
index 999ccf2ebc009b6b7c50a9a2d1667d69a3f690e7..b5abbc0712599814e078d19bc015bc7bf1812f95 100644
--- a/tensorflow/contrib/lite/tools/BUILD
+++ b/tensorflow/contrib/lite/tools/BUILD
@@ -4,6 +4,7 @@ package(default_visibility = [
 
 licenses(["notice"])  # Apache 2.0
 
+load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
 load("//tensorflow:tensorflow.bzl", "tf_cc_binary")
 
 py_binary(
@@ -45,7 +46,15 @@ tf_cc_binary(
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite:string_util",
         "//tensorflow/contrib/lite/kernels:builtin_ops",
-    ],
+    ] + select({
+        "//tensorflow:android": [
+            "//tensorflow/core:android_tensorflow_lib",
+        ],
+        "//conditions:default": [
+            "//tensorflow/core:framework_internal",
+            "//tensorflow/core:lib",
+        ],
+    }),
 )
 
 cc_library(
@@ -111,6 +120,9 @@ cc_test(
     name = "verifier_test",
     size = "small",
     srcs = ["verifier_test.cc"],
+    tags = [
+        "tflite_not_portable",
+    ],
     deps = [
         ":mutable_op_resolver",
         ":verifier",
@@ -124,3 +136,5 @@ cc_test(
         "@flatbuffers",
     ],
 )
+
+tflite_portable_test_suite()
diff --git a/tensorflow/contrib/lite/tools/benchmark_model.cc b/tensorflow/contrib/lite/tools/benchmark_model.cc
index 6ae3ab57294a92162b15f326630ac202a9ba2a82..93c80e0f5e021f76bff6858b0ea3370724393d6d 100644
--- a/tensorflow/contrib/lite/tools/benchmark_model.cc
+++ b/tensorflow/contrib/lite/tools/benchmark_model.cc
@@ -1,4 +1,4 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -25,36 +25,89 @@ limitations under the License.
 #include "tensorflow/contrib/lite/model.h"
 #include "tensorflow/contrib/lite/string_util.h"
 #include "tensorflow/contrib/lite/tools/mutable_op_resolver.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/init_main.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/util/command_line_flags.h"
 
 #ifdef TFLITE_CUSTOM_OPS_HEADER
 void RegisterSelectedOps(::tflite::MutableOpResolver* resolver);
 #endif
 
-#define LOG(x) std::cerr
+namespace tflite {
 
-#define CHECK(x)                  \
-  if (!(x)) {                     \
-    LOG(ERROR) << #x << "failed"; \
-    exit(1);                      \
+using ::tensorflow::Env;
+using ::tensorflow::str_util::Split;
+using ::tensorflow::str_util::SplitAndParseAsFloats;
+using ::tensorflow::str_util::SplitAndParseAsInts;
+
+struct InputLayerInfo {
+  string name;
+  TfLiteType data_type;
+  std::vector<int> shape;
+  // Note that initialization_values is currently unused.
+  std::vector<float> initialization_values;
+};
+
+template <typename T>
+void FillRandomValue(T* ptr, const std::vector<int>& sizes,
+                     const std::function<T()>& random_func) {
+  int num_elements = 1;
+  for (int dim : sizes) {
+    num_elements *= dim;
+  }
+  for (int i = 0; i < num_elements; ++i) {
+    *ptr++ = random_func();
   }
+}
 
-namespace tensorflow {
-namespace benchmark_tflite_model {
+void FillRandomString(tflite::DynamicBuffer* buffer,
+                      const std::vector<int>& sizes,
+                      const std::function<string()>& random_func) {
+  int num_elements = 1;
+  for (int dim : sizes) {
+    num_elements *= dim;
+  }
+  for (int i = 0; i < num_elements; ++i) {
+    auto str = random_func();
+    buffer->AddString(str.data(), str.length());
+  }
+}
 
-std::unique_ptr<tflite::FlatBufferModel> model;
-std::unique_ptr<tflite::Interpreter> interpreter;
+TfLiteType TfLiteTypeFromString(const string& input_layer_type) {
+  if (input_layer_type == "string")
+    return kTfLiteString;
+  else if (input_layer_type == "float")
+    return kTfLiteFloat32;
+  else if (input_layer_type == "uint8")
+    return kTfLiteUInt8;
+  else if (input_layer_type == "int32")
+    return kTfLiteInt32;
+  else if (input_layer_type == "int64")
+    return kTfLiteInt64;
+  else
+    return kTfLiteNoType;
+}
 
-void InitImpl(const std::string& graph, const std::vector<int>& sizes,
-              const std::string& input_layer_type, int num_threads) {
-  CHECK(graph.c_str());
+std::vector<int> ShapeFromTfLiteTensor(TfLiteTensor* t) {
+  std::vector<int> result;
+  result.reserve(t->dims->size);
+  for (int i = 0; i < t->dims->size; ++i) {
+    result.push_back(t->dims->data[i]);
+  }
+  CHECK(!result.empty()) << "Found no shapes in model";
+  return result;
+}
 
-  model = tflite::FlatBufferModel::BuildFromFile(graph.c_str());
+bool CreateInterpreter(const string& graph,
+                       std::unique_ptr<FlatBufferModel>* model,
+                       std::unique_ptr<Interpreter>* interpreter) {
+  *model = tflite::FlatBufferModel::BuildFromFile(graph.c_str());
   if (!model) {
-    LOG(FATAL) << "Failed to mmap model " << graph;
+    std::cerr << "Failed to load model " << graph << std::endl;
+    return false;
   }
-  LOG(INFO) << "Loaded model " << graph;
-  model->error_reporter();
-  LOG(INFO) << "resolved reporter";
 
 #ifdef TFLITE_CUSTOM_OPS_HEADER
   tflite::MutableOpResolver resolver;
@@ -63,34 +116,360 @@ void InitImpl(const std::string& graph, const std::vector<int>& sizes,
   tflite::ops::builtin::BuiltinOpResolver resolver;
 #endif
 
-  tflite::InterpreterBuilder(*model, resolver)(&interpreter);
-  if (!interpreter) {
-    LOG(FATAL) << "Failed to construct interpreter";
+  tflite::InterpreterBuilder(*(model->get()), resolver)(interpreter);
+  if (!(*interpreter)) {
+    std::cerr << "Failed to construct interpreter" << std::endl;
+    return false;
   }
 
+  return true;
+}
+
+bool PrepareInterpreter(const std::vector<InputLayerInfo> inputs,
+                        int num_threads, bool use_nnapi,
+                        Interpreter* interpreter) {
   if (num_threads != -1) {
     interpreter->SetNumThreads(num_threads);
   }
 
-  int input = interpreter->inputs()[0];
+  interpreter->UseNNAPI(use_nnapi);
 
-  if (input_layer_type != "string") {
-    interpreter->ResizeInputTensor(input, sizes);
+  // Check that all names and types match
+  for (const InputLayerInfo& input : inputs) {
+    for (int i : interpreter->inputs()) {
+      TfLiteTensor* t = interpreter->tensor(i);
+      CHECK_EQ(t->name, input.name)
+          << "Tensor # " << i << " is named " << t->name
+          << " but flags call it " << input.name;
+      CHECK_EQ(t->type, input.data_type)
+          << "Could not match the type of input tensor " << t->name;
+    }
+  }
+
+  // Resize all non-string tensors.
+  for (const InputLayerInfo& input : inputs) {
+    for (int i : interpreter->inputs()) {
+      TfLiteTensor* t = interpreter->tensor(i);
+      if (t->type != kTfLiteString) {
+        interpreter->ResizeInputTensor(i, input.shape);
+      }
+    }
   }
 
   if (interpreter->AllocateTensors() != kTfLiteOk) {
-    LOG(FATAL) << "Failed to allocate tensors!";
+    std::cerr << "Failed to allocate tensors!" << std::endl;
+    return false;
+  }
+
+  // Set the values of the input tensors.
+  for (int i : interpreter->inputs()) {
+    TfLiteTensor* t = interpreter->tensor(i);
+    std::vector<int> sizes = ShapeFromTfLiteTensor(t);
+
+    // TODO(ahentz): below we ignore the O-th dimension (number of batches).
+    if (t->type == kTfLiteFloat32) {
+      FillRandomValue<float>(
+          interpreter->typed_tensor<float>(i),
+          std::vector<int>(sizes.begin() + 1, sizes.end()),
+          []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
+    } else if (t->type == kTfLiteUInt8) {
+      FillRandomValue<uint8_t>(
+          interpreter->typed_tensor<uint8_t>(i),
+          std::vector<int>(sizes.begin() + 1, sizes.end()),
+          []() { return static_cast<uint8_t>(rand()) % 255; });
+    } else if (t->type == kTfLiteString) {
+      tflite::DynamicBuffer buffer;
+      FillRandomString(&buffer, sizes, []() {
+        return "we're have some friends over saturday to hang out in the yard";
+      });
+      buffer.WriteToTensor(interpreter->tensor(i));
+    } else {
+      std::cerr << "Don't know how to populate tensor " << t->name
+                << " of type " << t->type << std::endl;
+      return false;
+    }
+  }
+  return true;
+}
+
+bool PopulateInputLayerInfo(const string& names_string,
+                            const string& shapes_string,
+                            const string& types_string,
+                            const string& values_string,
+                            std::vector<InputLayerInfo>* info) {
+  std::vector<string> names = Split(names_string, ',');
+  std::vector<string> shapes = Split(shapes_string, ':');
+  std::vector<string> types = Split(types_string, ',');
+  std::vector<string> values = Split(values_string, ':');
+
+  if (names.size() != shapes.size()) {
+    LOG(ERROR) << "The number of items in"
+               << " --input_layer_shape (" << shapes_string << ", with "
+               << shapes.size() << " items)"
+               << " must match the number of items in"
+               << " --input_layer (" << names_string << ", with "
+               << names.size() << " items)."
+               << " For example --input_layer=input1,input2"
+               << " --input_layer_shape=1,224,224,4:1,20";
+    return false;
+  }
+  if (names.size() != types.size()) {
+    LOG(ERROR) << "The number of items in"
+               << " --input_layer_type (" << types_string << ", with "
+               << types.size() << " items)"
+               << " must match the number of items in"
+               << " --input_layer (" << names_string << ", with "
+               << names.size() << " items)."
+               << " For example --input_layer=input1,input2"
+               << " --input_layer_type=float,int";
+    return false;
+  }
+
+  for (int i = 0; i < names.size(); ++i) {
+    info->push_back(InputLayerInfo());
+    InputLayerInfo& input = info->back();
+
+    input.name = names[i];
+
+    input.data_type = TfLiteTypeFromString(types[i]);
+    CHECK(input.data_type != kTfLiteNoType)
+        << types[i] << " was an invalid type";
+
+    CHECK(SplitAndParseAsInts(shapes[i], ',', &input.shape))
+        << "Incorrect size string specified: " << shapes[i];
+    for (int dim : input.shape) {
+      if (dim == -1) {
+        LOG(ERROR) << "Any unknown sizes in the shapes (-1's) must be replaced"
+                   << " with the size you want to benchmark with.";
+        return false;
+      }
+    }
+
+    if (i < values.size()) {
+      CHECK(SplitAndParseAsFloats(values[i], ',', &input.initialization_values))
+          << "Incorrect initialization values string specified: " << values[i];
+    }
+  }
+
+  return true;
+}
+
+bool RunBenchmark(Interpreter* interpreter, int64_t* inference_time_us) {
+  const int64_t start_time = Env::Default()->NowMicros();
+
+  if (interpreter->Invoke() != kTfLiteOk) {
+    std::cerr << "Failed to invoke!";
+    return false;
   }
+
+  const int64_t end_time = Env::Default()->NowMicros();
+  *inference_time_us = end_time - start_time;
+  return true;
+}
+
+class Latencies {
+ public:
+  void AddMeasurement(int64_t time_us) {
+    max_ = std::max(time_us, max_);
+    min_ = std::min(time_us, min_);
+    ++count_;
+    sum_ += time_us;
+    squared_sum_ += static_cast<double>(time_us) * time_us;
+  }
+
+  double avg() const {
+    if (count_ == 0) return std::numeric_limits<int64_t>::quiet_NaN();
+    return static_cast<double>(sum_) / count_;
+  }
+
+  int64_t std_deviation() const {
+    if (count_ == 0 || min_ == max_) return 0;
+    return sqrt(squared_sum_ / count_ - avg() * avg());
+  }
+
+  void OutputToStream(std::ostream* stream) const {
+    *stream << "count=" << count_;
+    if (count_ == 0) return;
+    *stream << " min=" << min_ << " max=" << max_;
+    *stream << " avg=" << avg() << " std=" << std_deviation();
+  }
+
+ private:
+  int64_t count_ = 0;
+  int64_t min_ = std::numeric_limits<int64_t>::max();
+  int64_t max_ = std::numeric_limits<int64_t>::min();
+  int64_t sum_ = 0;
+  double squared_sum_ = 0;
+};
+
+bool TimeMultipleRuns(Interpreter* interpreter, double sleep_seconds,
+                      int num_runs, int64* total_time_us) {
+  // Convert the run_delay string into a timespec.
+  timespec req;
+  req.tv_sec = static_cast<time_t>(sleep_seconds);
+  req.tv_nsec = (sleep_seconds - req.tv_sec) * 1000000000;
+
+  *total_time_us = 0;
+
+  std::cout << "Running benchmark for " << num_runs
+            << " iterations: " << std::endl;
+
+  Latencies latencies;
+  for (int i = 0; i < num_runs; ++i) {
+    int64_t time_us;
+    bool run_status = RunBenchmark(interpreter, &time_us);
+    latencies.AddMeasurement(time_us);
+    *total_time_us += time_us;
+    if (!run_status) {
+      std::cout << "Failed on run " << i << std::endl;
+      return false;
+    }
+
+    // If requested, sleep between runs for an arbitrary amount of time.
+    // This can be helpful to determine the effect of mobile processor
+    // scaling and thermal throttling.
+    if (sleep_seconds > 0.0) {
+#ifdef PLATFORM_WINDOWS
+      Sleep(sleep_seconds * 1000);
+#else
+      nanosleep(&req, nullptr);
+#endif
+    }
+  }
+  latencies.OutputToStream(&std::cout);
+  std::cout << std::endl;
+
+  return true;
 }
 
 int Main(int argc, char** argv) {
-  InitImpl("", {}, "", 1);
+  using tensorflow::Flag;
+  using tensorflow::Flags;
+
+  string graph;               // e.g.: /data/local/tmp/tfl_inception-v1_model.fb
+  string input_layer_string;  // e.g.: input
+  string input_layer_shape_string;  // e.g.: 1,224,224,3
+  string input_layer_type_string;   // e.g.: float
+  string input_layer_values_string;
+  string output_layer_string;  // e.g.: output
+  int num_runs = 50;
+  string run_delay = "-1.0";
+  int num_threads = -1;
+  string benchmark_name = "";
+  string output_prefix = "";
+  int warmup_runs = 1;
+  bool use_nnapi = false;
+
+  std::vector<Flag> flag_list = {
+      Flag("graph", &graph, "graph file name"),
+      // All the following flags are optional, but can be used in order
+      // to benchmark different input shapes.
+      Flag("input_layer", &input_layer_string, "input layer names"),
+      Flag("input_layer_shape", &input_layer_shape_string, "input layer shape"),
+      Flag("input_layer_type", &input_layer_type_string, "input layer type"),
+      Flag("input_layer_values", &input_layer_values_string,
+           "values to initialize the inputs with"),
+      Flag("output_layer", &output_layer_string, "output layer name"),
+      Flag("num_runs", &num_runs, "number of runs"),
+      Flag("run_delay", &run_delay, "delay between runs in seconds"),
+      Flag("num_threads", &num_threads, "number of threads"),
+      Flag("benchmark_name", &benchmark_name, "benchmark name"),
+      Flag("output_prefix", &output_prefix, "benchmark output prefix"),
+      Flag("warmup_runs", &warmup_runs, "how many runs to initialize model"),
+      Flag("use_nnapi", &use_nnapi, "use nnapi api"),
+  };
+  string usage = Flags::Usage(argv[0], flag_list);
+  const bool parse_result = Flags::Parse(&argc, argv, flag_list);
+  tensorflow::port::InitMain(argv[0], &argc, &argv);
+
+  if (!parse_result) {
+    std::cerr << usage << std::endl;
+    return -1;
+  }
+
+  std::cout << "Graph: [" << graph << "]" << std::endl;
+  if (!input_layer_string.empty()) {
+    std::cout << "Input layers: [" << input_layer_string << "]" << std::endl;
+    std::cout << "Input shapes: [" << input_layer_shape_string << "]"
+              << std::endl;
+    std::cout << "Input types: [" << input_layer_type_string << "]"
+              << std::endl;
+  }
+  if (!output_layer_string.empty()) {
+    std::cout << "Output layers: [" << output_layer_string << "]" << std::endl;
+  }
+  std::cout << "Num runs: [" << num_runs << "]" << std::endl;
+  std::cout << "Inter-run delay (seconds): [" << run_delay << "]" << std::endl;
+  std::cout << "Num threads: [" << num_threads << "]" << std::endl;
+  if (!benchmark_name.empty()) {
+    std::cout << "Benchmark name: [" << benchmark_name << "]" << std::endl;
+    std::cout << "Output prefix: [" << output_prefix << "]" << std::endl;
+  }
+  std::cout << "Warmup runs: [" << warmup_runs << "]" << std::endl;
+  std::cout << "Use nnapi : [" << use_nnapi << "]" << std::endl;
+
+  if (graph.empty()) {
+    std::cout
+        << "Please specify the name of your TF Lite input file with --graph"
+        << std::endl;
+    return -1;
+  }
+
+  std::vector<InputLayerInfo> inputs;
+  if (!PopulateInputLayerInfo(input_layer_string, input_layer_shape_string,
+                              input_layer_type_string,
+                              input_layer_values_string, &inputs)) {
+    return -1;
+  }
+
+  int64 initialization_start_us = Env::Default()->NowMicros();
+
+  std::unique_ptr<tflite::FlatBufferModel> model;
+  std::unique_ptr<tflite::Interpreter> interpreter;
+  if (!CreateInterpreter(graph, &model, &interpreter)) {
+    return -1;
+  }
+  if (!PrepareInterpreter(inputs, num_threads, use_nnapi, interpreter.get())) {
+    return -1;
+  }
+
+  int64 initialization_end_us = Env::Default()->NowMicros();
+
+  const double initialization_time_s =
+      (initialization_end_us - initialization_start_us) / 1000000.0f;
+  std::cout << "Initialized session in " << initialization_time_s << "s"
+            << std::endl;
+
+  const double sleep_seconds = std::strtod(run_delay.c_str(), nullptr);
+
+  // If requested, run through the graph first to preinitialize everything
+  // before the benchmarking runs.
+  int64 warmup_time_us = 0;
+  if (warmup_runs > 0) {
+    if (!TimeMultipleRuns(interpreter.get(), sleep_seconds, warmup_runs,
+                          &warmup_time_us)) {
+      std::cerr << "Warmup failed" << std::endl;
+      return -1;
+    }
+  }
+
+  // Capture overall inference time without stat logging overhead. This is the
+  // timing data that can be compared to other libaries.
+  int64 no_stat_time_us = 0;
+  if (!TimeMultipleRuns(interpreter.get(), sleep_seconds, num_runs,
+                        &no_stat_time_us)) {
+    std::cerr << "Timing failed." << std::endl;
+    return -1;
+  }
+
+  std::cout << "Average inference timings in us: " << no_stat_time_us / num_runs
+            << " , Warmup: "
+            << (warmup_runs > 0 ? warmup_time_us / warmup_runs : 0) << ", "
+            << std::endl;
+
   return 0;
 }
 
-}  // namespace benchmark_tflite_model
-}  // namespace tensorflow
+}  // namespace tflite
 
-int main(int argc, char** argv) {
-  return tensorflow::benchmark_tflite_model::Main(argc, argv);
-}
+int main(int argc, char** argv) { return ::tflite::Main(argc, argv); }
diff --git a/tensorflow/contrib/lite/tools/verifier.cc b/tensorflow/contrib/lite/tools/verifier.cc
index 59c74205f0a311ec12ff87f46622041605fb493b..8818a7dc85d9ffdc1da450fb389d5ed11139bc31 100644
--- a/tensorflow/contrib/lite/tools/verifier.cc
+++ b/tensorflow/contrib/lite/tools/verifier.cc
@@ -148,11 +148,52 @@ bool VerifyNumericTensorBuffer(const Tensor& tensor, const Buffer& buffer,
   // TODO(yichengfan): verify quantized tensors.
 }
 
+using flatbuffers::Offset;
+using flatbuffers::Vector;
+
+bool VerifyOperators(const Vector<Offset<Operator>>& operators,
+                     ErrorReporter* error_reporter) {
+  for (const auto& op : operators) {
+    if (!op->inputs()) {
+      ReportError(error_reporter, "Missing 'inputs' for operator.");
+      return false;
+    }
+    if (!op->outputs()) {
+      ReportError(error_reporter, "Missing 'outputs' for operator.");
+      return false;
+    }
+  }
+  return true;
+}
+
+bool VerifySubGraphs(const Model& model, ErrorReporter* error_reporter) {
+  if (!model.subgraphs()) {
+    ReportError(error_reporter, "Missing 'subgraphs' section.");
+    return false;
+  }
+  for (const auto& subgraph : *model.subgraphs()) {
+    if (!subgraph->operators()) {
+      ReportError(error_reporter, "Missing 'operators' section in subgraph.");
+      return false;
+    }
+
+    if (!VerifyOperators(*subgraph->operators(), error_reporter)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 // Verifies tensors have valid properties and legit buffer if set.
 bool VerifyTensors(const Model& model, ErrorReporter* error_reporter) {
   if (!model.subgraphs()) {
     return true;
   }
+  if (!model.buffers()) {
+    ReportError(error_reporter, "Missing 'buffers' section.");
+    return false;
+  }
+
   for (const auto& subgraph : *model.subgraphs()) {
     if (!subgraph->tensors()) {
       continue;
@@ -167,19 +208,23 @@ bool VerifyTensors(const Model& model, ErrorReporter* error_reporter) {
         return false;
       }
       auto* buffer = model.buffers()->Get(tensor->buffer());
-      if (!buffer || !buffer->data()) {
+      if (!buffer) {
         ReportError(error_reporter, "Tensor buffer %d not set",
                     tensor->buffer());
         return false;
       }
 
-      if (tensor->type() == TensorType_STRING) {
-        if (!VerifyStringTensorBuffer(*buffer, error_reporter)) {
-          return false;
-        }
-      } else {
-        if (!VerifyNumericTensorBuffer(*tensor, *buffer, error_reporter)) {
-          return false;
+      // Many transient tensors don't have data in the flatbuffer. Their
+      // buffers will be allocated by the interpreter at run-time.
+      if (buffer->data()) {
+        if (tensor->type() == TensorType_STRING) {
+          if (!VerifyStringTensorBuffer(*buffer, error_reporter)) {
+            return false;
+          }
+        } else {
+          if (!VerifyNumericTensorBuffer(*tensor, *buffer, error_reporter)) {
+            return false;
+          }
         }
       }
     }
@@ -193,6 +238,13 @@ bool VerifyOps(const Model& model, const OpResolver& resolver,
     return true;
   }
   for (const auto& opcode : *model.operator_codes()) {
+    if (opcode->builtin_code() < BuiltinOperator_MIN ||
+        opcode->builtin_code() > BuiltinOperator_MAX) {
+      ReportError(error_reporter, "Operator id '%d' is out of range.",
+                  opcode->builtin_code());
+      return false;
+    }
+
     if (opcode->builtin_code() == BuiltinOperator_CUSTOM) {
       if (!resolver.FindOp(opcode->custom_code()->c_str())) {
         ReportError(error_reporter, "Unsupported custom op: %s",
@@ -223,6 +275,9 @@ bool Verify(const void* buf, size_t len, const OpResolver& resolver,
     ReportError(error_reporter, "Invalid model version %d", model->version());
     return false;
   }
+  if (!VerifySubGraphs(*model, error_reporter)) {
+    return false;
+  }
   if (!VerifyTensors(*model, error_reporter)) {
     return false;
   }
diff --git a/tensorflow/contrib/lite/tools/verifier.h b/tensorflow/contrib/lite/tools/verifier.h
index c2ee11215c861ed7b27696a8d786bb6e2a48e930..b7ce4e830576af14002d6bd9080af1da5764b1c9 100644
--- a/tensorflow/contrib/lite/tools/verifier.h
+++ b/tensorflow/contrib/lite/tools/verifier.h
@@ -23,6 +23,21 @@ limitations under the License.
 
 namespace tflite {
 
+class AlwaysTrueResolver : public OpResolver {
+ public:
+  AlwaysTrueResolver() {}
+  TfLiteRegistration* FindOp(tflite::BuiltinOperator op) const override {
+    static TfLiteRegistration null_registration = {nullptr, nullptr, nullptr,
+                                                   nullptr};
+    return &null_registration;
+  }
+  TfLiteRegistration* FindOp(const char* op) const override {
+    static TfLiteRegistration null_registration = {nullptr, nullptr, nullptr,
+                                                   nullptr};
+    return &null_registration;
+  }
+};
+
 // Verifies the integrity of a Tensorflow Lite flatbuffer model file.
 // Currently, it verifies:
 // * The file is following a legit flatbuffer schema.
diff --git a/tensorflow/contrib/lite/tools/verifier_test.cc b/tensorflow/contrib/lite/tools/verifier_test.cc
index b3e611f999b2837efbf8876bd989db44c408b8c7..03b93afe3ed04b4bff13bc01d7c7c8e9fae9bdf3 100644
--- a/tensorflow/contrib/lite/tools/verifier_test.cc
+++ b/tensorflow/contrib/lite/tools/verifier_test.cc
@@ -113,8 +113,8 @@ TEST(VerifyModel, TestEmptyModel) {
                            /*description=*/0, /*buffers=*/0);
   ::tflite::FinishModelBuffer(builder, model);
 
-  ASSERT_TRUE(Verify(builder.GetBufferPointer(), builder.GetSize(),
-                     MutableOpResolver{}, DefaultErrorReporter()));
+  ASSERT_FALSE(Verify(builder.GetBufferPointer(), builder.GetSize(),
+                      MutableOpResolver{}, DefaultErrorReporter()));
 }
 
 TEST(VerifyModel, TestSimpleModel) {
diff --git a/tensorflow/contrib/lite/util.cc b/tensorflow/contrib/lite/util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..fb4af07d060cac3a6a4e01c7d625b6db5241f10d
--- /dev/null
+++ b/tensorflow/contrib/lite/util.cc
@@ -0,0 +1,41 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/util.h"
+
+namespace tflite {
+
+TfLiteIntArray* ConvertVectorToTfLiteIntArray(const std::vector<int>& input) {
+  return ConvertArrayToTfLiteIntArray(input.size(), input.data());
+}
+
+TfLiteIntArray* ConvertArrayToTfLiteIntArray(const int rank, const int* dims) {
+  TfLiteIntArray* output = TfLiteIntArrayCreate(rank);
+  for (size_t i = 0; i < rank; i++) {
+    output->data[i] = dims[i];
+  }
+  return output;
+}
+
+bool EqualArrayAndTfLiteIntArray(const TfLiteIntArray* a, const int b_size,
+                                 const int* b) {
+  if (!a) return false;
+  if (a->size != b_size) return false;
+  for (int i = 0; i < a->size; ++i) {
+    if (a->data[i] != b[i]) return false;
+  }
+  return true;
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/util.h b/tensorflow/contrib/lite/util.h
new file mode 100644
index 0000000000000000000000000000000000000000..a34db35823104414cce028b9119397da085d05b1
--- /dev/null
+++ b/tensorflow/contrib/lite/util.h
@@ -0,0 +1,40 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// This file provides general C++ utility functions in TFLite.
+// For example: Converting between `TfLiteIntArray`, `std::vector` and
+// Flatbuffer vectors. These functions can't live in `context.h` since it's pure
+// C.
+
+#ifndef TENSORFLOW_CONTRIB_LITE_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_UTIL_H_
+
+#include <vector>
+#include "tensorflow/contrib/lite/context.h"
+
+namespace tflite {
+
+// Converts a `std::vector` to a `TfLiteIntArray`.
+TfLiteIntArray* ConvertVectorToTfLiteIntArray(const std::vector<int>& input);
+
+TfLiteIntArray* ConvertArrayToTfLiteIntArray(const int rank, const int* dims);
+
+// Checks whether a `TfLiteIntArray` and an int array have matching elements.
+bool EqualArrayAndTfLiteIntArray(const TfLiteIntArray* a, const int b_size,
+                                 const int* b);
+
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_UTIL_H_
diff --git a/tensorflow/contrib/lite/util_test.cc b/tensorflow/contrib/lite/util_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..04579c53aa4835c47d812c89a1554a0d2f2f30b8
--- /dev/null
+++ b/tensorflow/contrib/lite/util_test.cc
@@ -0,0 +1,50 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <vector>
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/util.h"
+
+namespace tflite {
+namespace {
+
+TEST(ConvertVectorToTfLiteIntArray, TestWithVector) {
+  std::vector<int> input = {1, 2};
+  TfLiteIntArray* output = ConvertVectorToTfLiteIntArray(input);
+  ASSERT_NE(output, nullptr);
+  EXPECT_EQ(output->size, 2);
+  EXPECT_EQ(output->data[0], 1);
+  EXPECT_EQ(output->data[1], 2);
+  TfLiteIntArrayFree(output);
+}
+
+TEST(ConvertVectorToTfLiteIntArray, TestWithEmptyVector) {
+  std::vector<int> input;
+  TfLiteIntArray* output = ConvertVectorToTfLiteIntArray(input);
+  ASSERT_NE(output, nullptr);
+  EXPECT_EQ(output->size, 0);
+  TfLiteIntArrayFree(output);
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lookup/lookup_ops.py b/tensorflow/contrib/lookup/lookup_ops.py
index a430dac4ec43ce31f0b5aaae5e7b0b51d25c9632..a57a1e5421890b2331211bbddcb8163807a8aabd 100644
--- a/tensorflow/contrib/lookup/lookup_ops.py
+++ b/tensorflow/contrib/lookup/lookup_ops.py
@@ -105,7 +105,7 @@ def index_table_from_tensor(mapping,
   ...
   tf.tables_initializer().run()
 
-  ids.eval()  ==> [0, 1, 4, 2]
+  ids.eval()  ==> [0, 1, 3, 2]
   ```
 
   Args:
@@ -341,23 +341,21 @@ class MutableHashTable(LookupInterface):
     # training to work correctly. Use the node name if no shared_name has been
     # explicitly specified.
     use_node_name_sharing = checkpoint and shared_name is None
-    # pylint: disable=protected-access
     if self._default_value.get_shape().ndims == 0:
-      self._table_ref = gen_lookup_ops._mutable_hash_table_v2(
+      self._table_ref = gen_lookup_ops.mutable_hash_table_v2(
           shared_name=shared_name,
           use_node_name_sharing=use_node_name_sharing,
           key_dtype=key_dtype,
           value_dtype=value_dtype,
           name=name)
     else:
-      self._table_ref = gen_lookup_ops._mutable_hash_table_of_tensors_v2(
+      self._table_ref = gen_lookup_ops.mutable_hash_table_of_tensors_v2(
           shared_name=shared_name,
           use_node_name_sharing=use_node_name_sharing,
           key_dtype=key_dtype,
           value_dtype=value_dtype,
           value_shape=self._default_value.get_shape(),
           name=name)
-    # pylint: enable=protected-access
     super(MutableHashTable, self).__init__(key_dtype, value_dtype,
                                            self._table_ref.op.name.split(
                                                "/")[-1])
@@ -378,9 +376,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-
-        # pylint: disable=protected-access
-        return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=name)
+        return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=name)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -406,8 +402,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_find" % self._name,
                         (self._table_ref, keys, self._default_value)) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        values = gen_lookup_ops._lookup_table_find_v2(
+        values = gen_lookup_ops.lookup_table_find_v2(
             self._table_ref, keys, self._default_value, name=name)
 
         values.set_shape(keys.get_shape().concatenate(self._value_shape))
@@ -437,7 +432,7 @@ class MutableHashTable(LookupInterface):
                         [self._table_ref, keys, values]) as name:
       with ops.colocate_with(self._table_ref):
         # pylint: disable=protected-access
-        op = gen_lookup_ops._lookup_table_insert_v2(
+        op = gen_lookup_ops.lookup_table_insert_v2(
             self._table_ref, keys, values, name=name)
     return op
 
@@ -454,8 +449,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_export_values" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        exported_keys, exported_values = gen_lookup_ops._lookup_table_export_v2(
+        exported_keys, exported_values = gen_lookup_ops.lookup_table_export_v2(
             self._table_ref, self._key_dtype, self._value_dtype, name=name)
 
     exported_values.set_shape(exported_keys.get_shape().concatenate(
@@ -477,7 +471,7 @@ class MutableHashTable(LookupInterface):
     def restore(self, restored_tensors, unused_restored_shapes):
       # pylint: disable=protected-access
       with ops.colocate_with(self.op._table_ref):
-        return gen_lookup_ops._lookup_table_import_v2(
+        return gen_lookup_ops.lookup_table_import_v2(
             self.op._table_ref, restored_tensors[0], restored_tensors[1])
 
 
@@ -551,8 +545,7 @@ class MutableDenseHashTable(LookupInterface):
     # explicitly specified.
     use_node_name_sharing = checkpoint and shared_name is None
     empty_key = ops.convert_to_tensor(empty_key, dtype=key_dtype)
-    # pylint: disable=protected-access
-    self._table_ref = gen_lookup_ops._mutable_dense_hash_table_v2(
+    self._table_ref = gen_lookup_ops.mutable_dense_hash_table_v2(
         empty_key=empty_key,
         shared_name=shared_name,
         use_node_name_sharing=use_node_name_sharing,
@@ -560,7 +553,6 @@ class MutableDenseHashTable(LookupInterface):
         value_shape=self._value_shape,
         initial_num_buckets=initial_num_buckets,
         name=name)
-    # pylint: enable=protected-access
     super(MutableDenseHashTable, self).__init__(
         key_dtype, value_dtype, self._table_ref.op.name.split("/")[-1])
 
@@ -580,8 +572,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=name)
+        return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=name)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -607,8 +598,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_find" % self._name,
                         [self._table_ref, keys]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        values = gen_lookup_ops._lookup_table_find_v2(
+        values = gen_lookup_ops.lookup_table_find_v2(
             self._table_ref, keys, self._default_value, name=name)
 
     if keys.get_shape().ndims is not None and keys.get_shape().ndims > 0:
@@ -640,8 +630,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_insert" % self._name,
                         [self._table_ref, keys, values]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        op = gen_lookup_ops._lookup_table_insert_v2(
+        op = gen_lookup_ops.lookup_table_insert_v2(
             self._table_ref, keys, values, name=name)
       return op
 
@@ -658,8 +647,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_export_values" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        exported_keys, exported_values = gen_lookup_ops._lookup_table_export_v2(
+        exported_keys, exported_values = gen_lookup_ops.lookup_table_export_v2(
             self._table_ref, self._key_dtype, self._value_dtype, name=name)
 
     exported_values.set_shape(exported_keys.get_shape().concatenate(
@@ -681,5 +669,5 @@ class MutableDenseHashTable(LookupInterface):
     def restore(self, restored_tensors, unused_restored_shapes):
       # pylint: disable=protected-access
       with ops.colocate_with(self.op._table_ref):
-        return gen_lookup_ops._lookup_table_import_v2(
+        return gen_lookup_ops.lookup_table_import_v2(
             self.op._table_ref, restored_tensors[0], restored_tensors[1])
diff --git a/tensorflow/contrib/makefile/Makefile b/tensorflow/contrib/makefile/Makefile
index 81327407d44b4317b7aecb964a689a35aa35c163..05e8d9064bea748c935859f5f9b4c7e646f504cf 100644
--- a/tensorflow/contrib/makefile/Makefile
+++ b/tensorflow/contrib/makefile/Makefile
@@ -677,6 +677,7 @@ endif  # TEGRA
 TF_CC_SRCS := $(filter-out $(CORE_CC_EXCLUDE_SRCS), $(CORE_CC_ALL_SRCS))
 # Add in any extra files that don't fit the patterns easily
 TF_CC_SRCS += tensorflow/contrib/makefile/downloads/fft2d/fftsg.c
+TF_CC_SRCS += tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
 # Also include the op and kernel definitions.
 TF_CC_SRCS += $(shell cat $(MAKEFILE_DIR)/tf_op_files.txt)
 PBT_CC_SRCS := $(shell cat $(MAKEFILE_DIR)/tf_pb_text_files.txt)
diff --git a/tensorflow/contrib/makefile/README.md b/tensorflow/contrib/makefile/README.md
index 995230dfa848532dc2a50b85f58d19ba264f293e..6c3b02e12b3082be8bfcc316c4c6122931eb5f76 100644
--- a/tensorflow/contrib/makefile/README.md
+++ b/tensorflow/contrib/makefile/README.md
@@ -194,6 +194,8 @@ with:
 srcs = glob(["libs/arm64-v8a/*.so"]),
 ```
 
+If you are building for Android TV (Shield TV devices), replace "portrait" with "landscape" for android:screenOrientation in all four activities in tensorflow/examples/android/AndroidManifest.xml
+
 Then run:
 ```bash
 # Create dir for native libs
diff --git a/tensorflow/contrib/makefile/build_all_ios.sh b/tensorflow/contrib/makefile/build_all_ios.sh
index 2d9979183975e6a17527b40ef5ee1795ced44a7b..0a458a27b3ac9b1a24b0f42de2f0166d515e8cd9 100755
--- a/tensorflow/contrib/makefile/build_all_ios.sh
+++ b/tensorflow/contrib/makefile/build_all_ios.sh
@@ -80,10 +80,9 @@ if [[ ! -z "${OPTIMIZE_FOR_GRAPH}" ]]; then
         fi
     else
         echo "${PRNT_SLCTV_BIN} found. Using it"
-        ${PRNT_SLCTV_BIN} --graphs=${OPTIMIZE_FOR_GRAPH} > ${TOP_SRCDIR}/tensorflow/core/framework/ops_to_register.h
-
     fi
 
+    ${PRNT_SLCTV_BIN} --graphs=${OPTIMIZE_FOR_GRAPH} > ${TOP_SRCDIR}/tensorflow/core/framework/ops_to_register.h
 fi
 
 if [[ "${ONLY_MAKE_TENSORFLOW}" != "true" ]]; then
@@ -111,7 +110,7 @@ if [[ -z "${BUILD_ARCH}" ]]; then
     TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios`
 else
     # arch specified so build just that
-    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios -a ${BUILD_ARCH}`
+    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios -a "${BUILD_ARCH}"`
 fi
 export HOST_NSYNC_LIB TARGET_NSYNC_LIB
 
diff --git a/tensorflow/contrib/makefile/compile_nsync.sh b/tensorflow/contrib/makefile/compile_nsync.sh
index 7927997678f077a716d81749561068f259d9744f..e8c6edd7ba9aa6a45d956d1d5655b2809d8d2309 100755
--- a/tensorflow/contrib/makefile/compile_nsync.sh
+++ b/tensorflow/contrib/makefile/compile_nsync.sh
@@ -109,17 +109,18 @@ for arch in $archs; do
         linux)  makefile='
                         CC=${CC_PREFIX} g++
                         PLATFORM_CPPFLAGS=-DNSYNC_USE_CPP11_TIMEPOINT -DNSYNC_ATOMIC_CPP11 \
+                                          -I../../platform/c++11.futex \
                                           -I../../platform/c++11 -I../../platform/gcc \
                                           -I../../platform/posix -pthread
                         PLATFORM_CFLAGS=-std=c++11 -Werror -Wall -Wextra -pedantic
                         PLATFORM_LDFLAGS=-pthread
                         MKDEP=${CC} -M -std=c++11
-                        PLATFORM_C=../../platform/c++11/src/nsync_semaphore_mutex.cc \
+                        PLATFORM_C=../../platform/linux/src/nsync_semaphore_futex.c \
                                    ../../platform/c++11/src/per_thread_waiter.cc \
                                    ../../platform/c++11/src/yield.cc \
                                    ../../platform/c++11/src/time_rep_timespec.cc \
                                    ../../platform/c++11/src/nsync_panic.cc
-                        PLATFORM_OBJS=nsync_semaphore_mutex.o per_thread_waiter.o yield.o \
+                        PLATFORM_OBJS=nsync_semaphore_futex.o per_thread_waiter.o yield.o \
                                       time_rep_timespec.o nsync_panic.o
                         TEST_PLATFORM_C=../../platform/c++11/src/start_thread.cc
                         TEST_PLATFORM_OBJS=start_thread.o
diff --git a/tensorflow/contrib/makefile/download_dependencies.sh b/tensorflow/contrib/makefile/download_dependencies.sh
index 4ae18b2cef28335a90bbc967529c0cf76b0a5da2..8b415e6527f85a5a7844b9d4156fd39ecb1b637a 100755
--- a/tensorflow/contrib/makefile/download_dependencies.sh
+++ b/tensorflow/contrib/makefile/download_dependencies.sh
@@ -34,7 +34,7 @@ PROTOBUF_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/protobuf/.
 RE2_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/re2/.*tar\.gz' "${BZL_FILE_PATH}" | head -n1)"
 FFT2D_URL="$(grep -o 'http.*fft\.tgz' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
 ABSL_URL="$(grep -o 'https://github.com/abseil/abseil-cpp/.*tar.gz' "${BZL_FILE_PATH}" | head -n1)"
-CUB_URL="$(grep -o 'https.*cub/archive.*zip' "${BZL_FILE_PATH}" | grep -v bazel-mirror | head -n1)"
+CUB_URL="$(grep -o 'https.*cub/archive.*zip' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
 
 # TODO(petewarden): Some new code in Eigen triggers a clang bug with iOS arm64,
 #                   so work around it by patching the source.
diff --git a/tensorflow/contrib/makefile/proto_text_cc_files.txt b/tensorflow/contrib/makefile/proto_text_cc_files.txt
index d56e388477db6239cfb577f7e2754321ff33bd82..77c936d8c5b99033ff5c5e149a6ce6613b603132 100644
--- a/tensorflow/contrib/makefile/proto_text_cc_files.txt
+++ b/tensorflow/contrib/makefile/proto_text_cc_files.txt
@@ -17,6 +17,7 @@ tensorflow/core/platform/env_time.cc
 tensorflow/core/platform/setround.cc
 tensorflow/core/platform/denormal.cc
 tensorflow/core/platform/default/tracing.cc
+tensorflow/core/platform/default/mutex.cc
 tensorflow/core/platform/default/logging.cc
 tensorflow/core/platform/cpu_info.cc
 tensorflow/core/lib/wav/wav_io.cc
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops.py b/tensorflow/contrib/metrics/python/ops/metric_ops.py
index 31e274c5fd7c670458b1b40a4f58c668a23776c7..81f05e7ce587ed1da67a17efbbeb809dbe7fc0b3 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops.py
@@ -1263,7 +1263,7 @@ def _compute_placement_auc(labels, predictions, weights, alpha,
   weights_for_true = ordered_weights * float_labels_for_true
   weights_for_false = ordered_weights * float_labels_for_false
 
- # For each set of weights with the same segmented indices, we add up the
+  # For each set of weights with the same segmented indices, we add up the
   # weight values. Note that for each label, we deliberately rely on weights
   # for the opposite label.
   weight_totals_for_true = math_ops.segment_sum(weights_for_false,
@@ -3646,8 +3646,8 @@ def cohen_kappa(labels,
       `updates_collections` are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
-    raise RuntimeError('tf.contrib.metrics.cohen_kappa is not supported'
+  if context.executing_eagerly():
+    raise RuntimeError('tf.contrib.metrics.cohen_kappa is not supported '
                        'when eager execution is enabled.')
   if num_classes < 2:
     raise ValueError('`num_classes` must be >= 2.'
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
index b387f26c0195432fb972dac450d2919bdaa702a1..33eb655fb660f0ecdfe1c5ab870d7f17690ae3ff 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
@@ -1802,9 +1802,9 @@ class StreamingAUCTest(test.TestCase):
       auc, update_op = metrics.streaming_auc(predictions, labels, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.54166603, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.79166, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.54166603, auc.eval(), delta=1e-3)
+      self.assertAlmostEqual(0.79166, auc.eval(), delta=1e-3)
 
   def testAnotherAUCPRSpecialCase(self):
     with self.test_session() as sess:
@@ -1816,9 +1816,9 @@ class StreamingAUCTest(test.TestCase):
       auc, update_op = metrics.streaming_auc(predictions, labels, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.44365042, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.610317, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.44365042, auc.eval(), delta=1e-3)
+      self.assertAlmostEqual(0.610317, auc.eval(), delta=1e-3)
 
   def testThirdAUCPRSpecialCase(self):
     with self.test_session() as sess:
@@ -1830,9 +1830,9 @@ class StreamingAUCTest(test.TestCase):
       auc, update_op = metrics.streaming_auc(predictions, labels, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.73611039, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.90277, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.73611039, auc.eval(), delta=1e-3)
+      self.assertAlmostEqual(0.90277, auc.eval(), delta=1e-3)
 
   def testAllIncorrect(self):
     inputs = np.random.randint(0, 2, size=(100, 1))
@@ -1865,9 +1865,9 @@ class StreamingAUCTest(test.TestCase):
       auc, update_op = metrics.streaming_auc(predictions, labels, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.49999976, sess.run(update_op), 6)
+      self.assertAlmostEqual(1, sess.run(update_op), 6)
 
-      self.assertAlmostEqual(0.49999976, auc.eval(), 6)
+      self.assertAlmostEqual(1, auc.eval(), 6)
 
   def testWithMultipleUpdates(self):
     num_samples = 1000
@@ -6888,8 +6888,7 @@ class CohenKappaTest(test.TestCase):
     # [[0, 25, 0],
     #  [0, 0, 25],
     #  [25, 0, 0]]
-    # Calculated by v0.19: sklearn.metrics.cohen_kappa_score(
-    #                          labels, predictions)
+    # Calculated by v0.19: sklearn.metrics.cohen_kappa_score(labels, predictions)
     expect = -0.333333333333
 
     with self.test_session() as sess:
@@ -6948,8 +6947,7 @@ class CohenKappaTest(test.TestCase):
                 weights_t: weights[batch_start:batch_end]
             })
       # Calculated by v0.19: sklearn.metrics.cohen_kappa_score(
-      #                          labels_np, predictions_np,
-      #                          sample_weight=weights_np)
+      #                          labels_np, predictions_np, sample_weight=weights_np)
       expect = 0.289965397924
       self.assertAlmostEqual(expect, kappa.eval(), 5)
 
diff --git a/tensorflow/contrib/model_pruning/README.md b/tensorflow/contrib/model_pruning/README.md
index d286750c257e9a78a82c95c1fc872b3ca6972203..52b659c69fdfc507e6259e928d79c65471f2f025 100644
--- a/tensorflow/contrib/model_pruning/README.md
+++ b/tensorflow/contrib/model_pruning/README.md
@@ -134,7 +134,7 @@ $ bazel-bin/$examples_dir/cifar10/cifar10_eval --run_once
 
 ### Block Sparsity
 
-For some hardware architectures, it may be beneficial to induce spatially correlated sparsity. To train models in which the weight tensors have block sparse structure, set *block_height* and *block_width* hyperparameters to the desired block configuration (2x2, 4x4, 4x1, 1x8, etc). Currently, block sparsity is supported for weight tensors with rank 2 only. The matrix is partitioned into non-overlapping blocks of size *[block_height, block_dim]* and the either the average or max absolute value in this block is taken as a proxy for the entire block (set by *block_pooling_function* hyperparameter).
+For some hardware architectures, it may be beneficial to induce spatially correlated sparsity. To train models in which the weight tensors have block sparse structure, set *block_height* and *block_width* hyperparameters to the desired block configuration (2x2, 4x4, 4x1, 1x8, etc). Currently, block sparsity is only supported for weight tensors which can be squeezed to rank 2. The matrix is partitioned into non-overlapping blocks of size *[block_height, block_dim]* and the either the average or max absolute value in this block is taken as a proxy for the entire block (set by *block_pooling_function* hyperparameter).
 The convolution layer tensors are always pruned used block dimensions of [1,1].
 
 ## References
diff --git a/tensorflow/contrib/model_pruning/python/layers/layers.py b/tensorflow/contrib/model_pruning/python/layers/layers.py
index 988748ad75bdf72f1da3f4e1c6e85aabb04a5954..466daf204a1ae86a7f37107342046305ea7249fc 100644
--- a/tensorflow/contrib/model_pruning/python/layers/layers.py
+++ b/tensorflow/contrib/model_pruning/python/layers/layers.py
@@ -214,7 +214,7 @@ def masked_convolution(inputs,
     elif data_format == 'NCHW':
       df = 'channels_first'
     else:
-      raise ValueError('Unsupported data fromat', data_format)
+      raise ValueError('Unsupported data format', data_format)
 
     layer = layer_class(
         filters=num_outputs,
diff --git a/tensorflow/contrib/model_pruning/python/pruning.py b/tensorflow/contrib/model_pruning/python/pruning.py
index d16af9da19816211ee22f6ea48a347f0b9a4e612..5146a4a2de7806041991c04958de378b2d3dc810 100644
--- a/tensorflow/contrib/model_pruning/python/pruning.py
+++ b/tensorflow/contrib/model_pruning/python/pruning.py
@@ -216,7 +216,7 @@ def _partitioned_variable_assign(partitioned_var, new_value):
   """Assign op for partitioned variables.
 
   Args:
-    partitioned_var: A partitioned tensotflow variable
+    partitioned_var: A partitioned tensorflow variable
     new_value: Value to be assigned to the variable var
 
   Returns:
@@ -523,7 +523,8 @@ class Pruning(object):
     """Performs block-granular masking of the weights.
 
     Block pruning occurs only if the block_height or block_width is > 1 and
-    if the weight tensor has ndims = 2. Otherwise, elementwise pruning occurs.
+    if the weight tensor, when squeezed, has ndims = 2. Otherwise, elementwise
+    pruning occurs.
     Args:
       weights: The weight tensor that needs to be masked.
       threshold: The current threshold value. The function will compute a new
@@ -540,7 +541,8 @@ class Pruning(object):
     Raises:
       ValueError: if block pooling function is not AVG or MAX
     """
-    if weights.get_shape().ndims != 2 or self._block_dim == [1, 1]:
+    squeezed_weights = array_ops.squeeze(weights)
+    if squeezed_weights.get_shape().ndims != 2 or self._block_dim == [1, 1]:
       return self._update_mask(weights, threshold)
 
     if self._block_pooling_function not in ['AVG', 'MAX']:
@@ -549,9 +551,11 @@ class Pruning(object):
 
     with ops.name_scope(weights.op.name + '_pruning_ops'):
       abs_weights = math_ops.abs(
-          array_ops.reshape(
-              weights, [1, weights.get_shape()[0],
-                        weights.get_shape()[1], 1]))
+          array_ops.reshape(weights, [
+              1,
+              squeezed_weights.get_shape()[0],
+              squeezed_weights.get_shape()[1], 1
+          ]))
       pool_window = [self._block_dim[0], self._block_dim[1]]
       pooled_weights = nn_ops.pool(
           abs_weights,
@@ -572,9 +576,10 @@ class Pruning(object):
                                         array_ops.ones(self._block_dim))
       sliced_mask = array_ops.slice(
           updated_mask, [0, 0],
-          [weights.get_shape()[0],
-           weights.get_shape()[1]])
-    return smoothed_threshold, sliced_mask
+          [squeezed_weights.get_shape()[0],
+           squeezed_weights.get_shape()[1]])
+    return smoothed_threshold, array_ops.reshape(sliced_mask,
+                                                 array_ops.shape(weights))
 
   def _get_mask_assign_ops(self):
     # Make sure the assignment ops have not already been added to the list
diff --git a/tensorflow/contrib/model_pruning/python/pruning_test.py b/tensorflow/contrib/model_pruning/python/pruning_test.py
index 1767b4bb94a9bb56bc6a4933423ad27d8cf3ed35..89e65713197afc6ed37346cb67a6e9be3fa9290f 100644
--- a/tensorflow/contrib/model_pruning/python/pruning_test.py
+++ b/tensorflow/contrib/model_pruning/python/pruning_test.py
@@ -140,6 +140,23 @@ class PruningTest(test.TestCase):
          [0.0, -0.3, 0.0, -0.4]])
     expected_mask = [[0, 0, 0, 0], [0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]]
 
+    self._blockMasking(param_list + ["block_pooling_function=MAX"], weights_max,
+                       expected_mask)
+    self._blockMasking(param_list + ["block_pooling_function=AVG"], weights_avg,
+                       expected_mask)
+
+  def testBlockMaskingWithHigherDimensions(self):
+    param_list = ["block_height=2", "block_width=2", "threshold_decay=0"]
+
+    # Weights as in testBlockMasking, but with one extra dimension.
+    weights_avg = constant_op.constant(
+        [[[0.1, 0.1, 0.2, 0.2], [0.1, 0.1, 0.2, 0.2], [0.3, 0.3, 0.4, 0.4],
+          [0.3, 0.3, 0.4, 0.4]]])
+    weights_max = constant_op.constant(
+        [[[0.1, 0.0, 0.2, 0.0], [0.0, -0.1, 0.0, -0.2], [0.3, 0.0, 0.4, 0.0],
+          [0.0, -0.3, 0.0, -0.4]]])
+    expected_mask = [[[0, 0, 0, 0], [0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]]]
+
     self._blockMasking(param_list + ["block_pooling_function=MAX"], weights_max,
                        expected_mask)
     self._blockMasking(param_list + ["block_pooling_function=AVG"],
diff --git a/tensorflow/contrib/mpi/mpi_utils.h b/tensorflow/contrib/mpi/mpi_utils.h
index fa297c28cb47d43ba927ab941854bd472d90b465..df055ff56731140b3bd09704c70e65f81362f763 100644
--- a/tensorflow/contrib/mpi/mpi_utils.h
+++ b/tensorflow/contrib/mpi/mpi_utils.h
@@ -24,6 +24,8 @@ limitations under the License.
 
 #include "tensorflow/core/lib/strings/str_util.h"
 
+// Skip MPI C++ bindings support, this matches the usage in other places
+#define OMPI_SKIP_MPICXX
 #include "third_party/mpi/mpi.h"
 #define MPI_CHECK(cmd)                                                \
   do {                                                                \
diff --git a/tensorflow/contrib/nccl/BUILD b/tensorflow/contrib/nccl/BUILD
index 5ac96007df7ee08b1e32aacd28f83768859810a9..94d01efee1546feca89a7e88acedf915b1dfb3a4 100644
--- a/tensorflow/contrib/nccl/BUILD
+++ b/tensorflow/contrib/nccl/BUILD
@@ -52,6 +52,7 @@ tf_cuda_cc_test(
         "manual",
         "multi_gpu",
         "no_oss",
+        "noguitar",
         "notap",
     ],
     deps =
@@ -136,6 +137,7 @@ cuda_py_test(
         "manual",
         "multi_gpu",
         "no_oss",
+        "noguitar",
         "notap",
     ],
 )
diff --git a/tensorflow/contrib/nccl/python/ops/nccl_ops.py b/tensorflow/contrib/nccl/python/ops/nccl_ops.py
index 8dc038b9ac992de7db8b762e3697c6693099e192..794372a1f4b0dcc41bcf0da611f5bc2ec9301973 100644
--- a/tensorflow/contrib/nccl/python/ops/nccl_ops.py
+++ b/tensorflow/contrib/nccl/python/ops/nccl_ops.py
@@ -267,5 +267,5 @@ def _check_device(tensor, expected=None):
 
 
 def _check_graph_mode():
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError('Nccl ops are not supported in eager mode')
diff --git a/tensorflow/contrib/nn/python/ops/scaled_softplus.py b/tensorflow/contrib/nn/python/ops/scaled_softplus.py
index fcbfbc239ca5b8a1d4b17b403f99b7eb05db47b0..7184ef2b66ec4662af3a37def070ab151d6e7c15 100644
--- a/tensorflow/contrib/nn/python/ops/scaled_softplus.py
+++ b/tensorflow/contrib/nn/python/ops/scaled_softplus.py
@@ -30,9 +30,7 @@ def _reduce_and_reshape_grad(g, t):
   """Returns the gradient, sum-reduced and reshaped to `t`'s shape."""
   shape = array_ops.shape(t)
   g_shape = array_ops.shape(g)
-  # pylint: disable=protected-access
-  bcast_dims, _ = gen_array_ops._broadcast_gradient_args(shape, g_shape)
-  # pylint: enable=protected-access
+  bcast_dims, _ = gen_array_ops.broadcast_gradient_args(shape, g_shape)
   return array_ops.reshape(math_ops.reduce_sum(g, bcast_dims), shape)
 
 
diff --git a/tensorflow/contrib/opt/BUILD b/tensorflow/contrib/opt/BUILD
index 827279bd476f9666a972f43ad557fde6d0b6c59a..bacf15bbd6140caf647552f0dca02209634ae56b 100644
--- a/tensorflow/contrib/opt/BUILD
+++ b/tensorflow/contrib/opt/BUILD
@@ -52,6 +52,9 @@ py_test(
     name = "external_optimizer_test",
     srcs = ["python/training/external_optimizer_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "no-internal-py3",
+    ],
     deps = [
         ":opt_py",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/contrib/opt/python/training/addsign_test.py b/tensorflow/contrib/opt/python/training/addsign_test.py
index bd19ee3e7ac514448c6d79272abb86a154f55e9a..08d45ed73f3ae4b580d7078272e79fef22ef67c5 100644
--- a/tensorflow/contrib/opt/python/training/addsign_test.py
+++ b/tensorflow/contrib/opt/python/training/addsign_test.py
@@ -97,7 +97,7 @@ class AddSignTest(test.TestCase):
                                      global_step=global_step)
         neg_update = opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
                                          global_step=global_step)
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           self.evaluate(variables.global_variables_initializer())
           # Fetch params to validate initial values
           self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -108,13 +108,13 @@ class AddSignTest(test.TestCase):
         # last 3 steps with negative gradient (sign(gm) should be -1)
         for t in range(1, 8):
           if t < 5:
-            if context.in_graph_mode():
+            if not context.executing_eagerly():
               self.evaluate(update)
             elif t > 1:
               opt.apply_gradients(zip([grads0, grads1], [var0, var1]),
                                   global_step=global_step)
           else:
-            if context.in_graph_mode():
+            if not context.executing_eagerly():
               self.evaluate(neg_update)
             elif t > 1:
               opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
diff --git a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py b/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py
index cb6c77a86feedde3285d75092511c8eb1e63b2a5..9076cc9d128552e37c09852ab2f24aa0c9977892 100644
--- a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py
+++ b/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py
@@ -22,6 +22,7 @@ import types
 import six
 
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import clip_ops
 from tensorflow.python.ops import control_flow_ops
@@ -40,8 +41,10 @@ def _get_wrapper(fn, opt):
 
   def wrapper(self, grad, *args, **kwargs):  # pylint: disable=unused-argument
     all_zeros = _is_all_zeros(grad)
-    return control_flow_ops.cond(all_zeros, control_flow_ops.no_op,
-                                 lambda: fn(grad, *args, **kwargs))
+    def call_fn():
+      with ops.control_dependencies([fn(grad, *args, **kwargs)]):
+        return control_flow_ops.no_op()
+    return control_flow_ops.cond(all_zeros, control_flow_ops.no_op, call_fn)
 
   wrapper = types.MethodType(wrapper, opt)
   return wrapper
diff --git a/tensorflow/contrib/opt/python/training/powersign_test.py b/tensorflow/contrib/opt/python/training/powersign_test.py
index ff7b1a72d47d8ef54980905323bcaf358c988a82..5214082dd66f00eadadad71d50f7e00b178b8c10 100644
--- a/tensorflow/contrib/opt/python/training/powersign_test.py
+++ b/tensorflow/contrib/opt/python/training/powersign_test.py
@@ -99,7 +99,7 @@ class PowerSignTest(test.TestCase):
         neg_update = opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
                                          global_step=global_step)
 
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           self.evaluate(variables.global_variables_initializer())
           # Fetch params to validate initial values
           self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -110,13 +110,13 @@ class PowerSignTest(test.TestCase):
         # last 3 steps with negative gradient (sign(gm) should be -1)
         for t in range(1, 8):
           if t < 5:
-            if context.in_graph_mode():
+            if not context.executing_eagerly():
               self.evaluate(update)
             elif t > 1:
               opt.apply_gradients(zip([grads0, grads1], [var0, var1]),
                                   global_step=global_step)
           else:
-            if context.in_graph_mode():
+            if not context.executing_eagerly():
               self.evaluate(neg_update)
             elif t > 1:
               opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
diff --git a/tensorflow/contrib/predictor/predictor_factories.py b/tensorflow/contrib/predictor/predictor_factories.py
index 04b5d5bdf158dc6a478d7a24b538c75d1dca8d45..6e77e934fe19851eea9ed0b74eb7aecc76f6237a 100644
--- a/tensorflow/contrib/predictor/predictor_factories.py
+++ b/tensorflow/contrib/predictor/predictor_factories.py
@@ -53,7 +53,7 @@ def from_contrib_estimator(estimator,
       `Estimator`.
   """
   if isinstance(estimator, core_estimator.Estimator):
-    raise TypeError('Espected estimator to be of type '
+    raise TypeError('Expected estimator to be of type '
                     'tf.contrib.learn.Estimator, but got type '
                     'tf.python.estimator.Estimator. You likely want to call '
                     'from_estimator.')
@@ -88,7 +88,7 @@ def from_estimator(estimator,
       `Estimator`.
   """
   if isinstance(estimator, contrib_estimator.Estimator):
-    raise TypeError('Espected estimator to be of type '
+    raise TypeError('Expected estimator to be of type '
                     'tf.python.estimator.Estimator, but got type '
                     'tf.contrib.learn.Estimator. You likely want to call '
                     'from_contrib_estimator.')
diff --git a/tensorflow/contrib/py2tf/__init__.py b/tensorflow/contrib/py2tf/__init__.py
index 379fa7fd5c2a22b5b16a21cca8c2ea8afdcaeefa..a4b62a0976f9e34f100be27d4fd9dc13bc0a3b3d 100644
--- a/tensorflow/contrib/py2tf/__init__.py
+++ b/tensorflow/contrib/py2tf/__init__.py
@@ -23,14 +23,17 @@ from __future__ import print_function
 
 from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl.api import convert
-from tensorflow.contrib.py2tf.impl.api import graph_ready
+from tensorflow.contrib.py2tf.impl.api import converted_call
+from tensorflow.contrib.py2tf.impl.api import do_not_convert
+from tensorflow.contrib.py2tf.impl.api import RunMode
 from tensorflow.contrib.py2tf.impl.api import to_code
 from tensorflow.contrib.py2tf.impl.api import to_graph
 from tensorflow.contrib.py2tf.pyct.transformer import PyFlowParseError
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
-    'to_graph', 'to_code', 'convert', 'graph_ready', 'utils', 'PyFlowParseError'
+    'utils', 'convert', 'converted_call', 'do_not_convert', 'RunMode',
+    'to_code', 'to_graph', 'PyFlowParseError'
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/py2tf/converters/BUILD b/tensorflow/contrib/py2tf/converters/BUILD
index 42baaaaba72c6c3dbc896e87b6e0f3c62b7f06fc..f624c4268639649e13a570f1c1b3e8ec3bedb4f4 100644
--- a/tensorflow/contrib/py2tf/converters/BUILD
+++ b/tensorflow/contrib/py2tf/converters/BUILD
@@ -25,10 +25,13 @@ py_library(
         "control_flow.py",
         "decorators.py",
         "for_loops.py",
+        "ifexp.py",
         "list_comprehension.py",
+        "lists.py",
         "logical_expressions.py",
         "name_scopes.py",
         "side_effect_guards.py",
+        "single_return.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
@@ -46,6 +49,7 @@ py_library(
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":converters",
+        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/contrib/py2tf/pyct/static_analysis",
         "//tensorflow/contrib/py2tf/utils",
         "@gast_archive//:gast",
@@ -59,7 +63,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -70,7 +73,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -81,7 +83,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -92,7 +93,7 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
+        "//tensorflow/contrib/py2tf/impl",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -103,7 +104,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -114,7 +114,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -125,7 +124,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -136,7 +134,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -157,7 +154,16 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
+    name = "lists_test",
+    srcs = ["lists_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":test_lib",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -168,7 +174,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -182,6 +187,27 @@ py_test(
         "flaky",
         "notap",
     ],
+    deps = [
+        ":test_lib",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
+    name = "single_return_test",
+    srcs = ["single_return_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":test_lib",
+        "//tensorflow/contrib/py2tf/pyct",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
+    name = "ifexp_test",
+    srcs = ["ifexp_test.py"],
+    srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
         "//tensorflow/contrib/py2tf/pyct",
diff --git a/tensorflow/contrib/py2tf/converters/builtin_functions.py b/tensorflow/contrib/py2tf/converters/builtin_functions.py
index e69038acedd6b7a251c3328a14f36ed107bde746..f1129ef153e6be6cbcbbf4bab63c4fe32ec77147 100644
--- a/tensorflow/contrib/py2tf/converters/builtin_functions.py
+++ b/tensorflow/contrib/py2tf/converters/builtin_functions.py
@@ -36,23 +36,24 @@ class BuiltinFunctionTransformer(transformer.Base):
 
   # pylint:disable=invalid-name
 
-  def _convert_len(self, node):
+  def _convert_builtin(self, node):
     template = """
-      py2tf_utils.dynamic_len(args)
+      py2tf_utils.dynamic_builtin(func, args)
     """
-    return templates.replace(template, args=node.args)[0].value
+    return templates.replace(template, func=node.func, args=node.args)[0].value
 
   def _convert_print(self, node):
     template = """
-      py2tf_utils.call_print(args)
+      py2tf_utils.dynamic_print(args)
     """
     return templates.replace(template, args=node.args)[0].value
 
   def visit_Call(self, node):
     self.generic_visit(node)
     # TODO(mdan): This won't work if the function was hidden.
-    if isinstance(node.func, gast.Name) and node.func.id == 'len':
-      return self._convert_len(node)
+    if isinstance(node.func, gast.Name) and node.func.id in ('len', 'range'):
+      return self._convert_builtin(node)
+    # Print needs to be handled separately because it can be read as statement.
     if isinstance(node.func, gast.Name) and node.func.id == 'print':
       return self._convert_print(node)
     return node
diff --git a/tensorflow/contrib/py2tf/converters/call_trees.py b/tensorflow/contrib/py2tf/converters/call_trees.py
index 1050ba654c63bb52c1c5e71c981a6a0baa3fc987..e3040f09e4066bc9b592373df0887f1df7e3473d 100644
--- a/tensorflow/contrib/py2tf/converters/call_trees.py
+++ b/tensorflow/contrib/py2tf/converters/call_trees.py
@@ -22,17 +22,30 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from collections import namedtuple
 import types
 
 import gast
 
 from tensorflow.contrib.py2tf.pyct import anno
+from tensorflow.contrib.py2tf.pyct import ast_util
+from tensorflow.contrib.py2tf.pyct import inspect_utils
 from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import templates
 from tensorflow.contrib.py2tf.pyct import transformer
 from tensorflow.python.util import tf_inspect
 
 
+class FunctionInfo(namedtuple('FunctionInfo', ('dtype',))):
+  pass
+
+
+# TODO(mdan): Move this to config.py.
+KNOWN_NUMPY_FUNCTIONS = {
+    ('numpy', 'random', 'binomial'): FunctionInfo(dtype='tf.int64'),
+}
+
+
 class FunctionNamer(object):
   """Describes the interface for CallTreeTransformer's namer."""
 
@@ -72,9 +85,8 @@ class CallTreeTransformer(transformer.Base):
     self.uncompiled_modules = uncompiled_modules
     self.nocompile_decorators = nocompile_decorators
 
-  # pylint:disable=invalid-name
-
   def _resolve_name(self, node):
+    """Used to resolve decorator info."""
     if isinstance(node, gast.Call):
       return self._resolve_name(node.func)
     if isinstance(node, gast.Name):
@@ -99,7 +111,13 @@ class CallTreeTransformer(transformer.Base):
                          (owner_type, node.attr))
     return None
 
+  def _function_is_compilable(self, target_entity):
+    """Determines whether an entity can be compiled at all."""
+    # TODO(mdan): This is just a placeholder. Implement.
+    return not isinstance(target_entity, types.BuiltinFunctionType)
+
   def _should_compile(self, node, fqn):
+    """Determines whether an entity should be compiled in the context."""
     for i in range(1, len(fqn)):
       if fqn[:i] in self.uncompiled_modules:
         return False
@@ -141,33 +159,6 @@ class CallTreeTransformer(transformer.Base):
 
     return True
 
-  def _determine_function_owner(self, m):
-    # TODO(mdan): The parent type should be known at analysis. Use that instead.
-    if hasattr(m, 'im_class'):  # Python 2
-      return m.im_class
-    if hasattr(m, '__qualname__'):  # Python 3
-      # Object attributes: should be bound to "self".
-      if hasattr(m, '__self__'):
-        return type(m.__self__)
-
-      # Class attributes: should have the owner name in their namespace.
-      qn = m.__qualname__.split('.')
-      if len(qn) < 2:
-        return None
-      owner_name, func_name = qn[-2:]
-      if func_name != m.__name__:
-        raise ValueError('Inconsistent names detected '
-                         '(__qualname__[1] = "%s", __name__ = "%s") for %s.' %
-                         (func_name, m.__name__, m))
-      if owner_name == '<locals>':
-        return None
-      if owner_name not in self.context.namespace:
-        raise ValueError(
-            'Could not resolve name "%s" while analyzing %s. Namespace:\n%s' %
-            (owner_name, m, self.context.namespace))
-      return self.context.namespace[owner_name]
-    return None
-
   def _rename_compilable_function(self, node):
     assert anno.hasanno(node.func, 'live_val')
     assert anno.hasanno(node.func, 'fqn')
@@ -182,7 +173,11 @@ class CallTreeTransformer(transformer.Base):
           target_fqn, live_entity=target_entity)
       do_rename = True
     else:
-      owner_type = self._determine_function_owner(target_entity)
+      if anno.hasanno(node.func, 'parent_type'):
+        owner_type = anno.getanno(node.func, 'parent_type')
+      else:
+        # Fallback - not reliable.
+        owner_type = inspect_utils.getmethodclass(target_entity)
       new_name, do_rename = self.context.namer.compiled_function_name(
           target_fqn, live_entity=target_entity, owner_type=owner_type)
 
@@ -196,15 +191,57 @@ class CallTreeTransformer(transformer.Base):
     return node
 
   def _wrap_to_py_func_no_return(self, node):
-    # TODO(mdan): Properly handle varargs, kwargs, etc.
+    # TODO(mdan): Properly handle varargs, etc.
+    template = """
+      py2tf_utils.wrap_py_func(func, None, (args,), kwargs, True)
+    """
+    return templates.replace(
+        template,
+        func=node.func,
+        args=node.args,
+        kwargs=ast_util.keywords_to_dict(node.keywords))
+
+  def _wrap_to_py_func_single_return(self, node, dtype):
+    # TODO(mdan): Properly handle varargs, etc.
+    template = """
+      py2tf_utils.wrap_py_func(func, dtype, (args,), kwargs, False)
+    """
+    return templates.replace_as_expression(
+        template,
+        func=node.func,
+        dtype=parser.parse_expression(dtype),
+        args=node.args,
+        kwargs=ast_util.keywords_to_dict(node.keywords))
+
+  def _insert_dynamic_conversion(self, node):
+    """Inlines a dynamic conversion for a dynamic function."""
+    # TODO(mdan): Pass information on the statically compiled functions.
+    # Having access to the statically compiled functions can help avoid
+    # unnecessary compilation.
+    # For example, this would lead to function `a` being compiled twice:
+    #
+    #   def a():
+    #     v = b
+    #     b()
+    #   def b():
+    #     a()
+    #
+    # This is really a problem with recursive calls, which currently can
+    # only be gated by a static condition, and should be rare.
+    # TODO(mdan): It probably makes sense to use dynamic conversion every time.
+    # Before we could convert all the time though, we'd need a reasonable
+    # caching mechanism.
     template = """
-      py2tf_utils.wrap_py_func(func, None, (original_args,), True)
+      py2tf_api.converted_call(func, True, False, {}, args)
     """
-    return templates.replace(template, func=node.func, original_args=node.args)
+    call_expr = templates.replace(
+        template, func=node.func, args=node.args)
+    new_call = call_expr[0].value
+    # TODO(mdan): Improve the template mechanism to better support this.
+    new_call.keywords = node.keywords
+    return new_call
 
-  def _function_is_compilable(self, target_entity):
-    # TODO(mdan): This is just a placeholder. Implement.
-    return not isinstance(target_entity, types.BuiltinFunctionType)
+  # pylint:disable=invalid-name
 
   def visit_Expr(self, node):
     if isinstance(node.value, gast.Call):
@@ -239,15 +276,24 @@ class CallTreeTransformer(transformer.Base):
     self.generic_visit(node)
     if anno.hasanno(node.func, 'live_val'):
       target_entity = anno.getanno(node.func, 'live_val')
+      if anno.hasanno(node.func, 'fqn'):
+        target_fqn = anno.getanno(node.func, 'fqn')
+      else:
+        target_fqn = None
       if self._function_is_compilable(target_entity):
         node = self._rename_compilable_function(node)
+      elif target_fqn and target_fqn in KNOWN_NUMPY_FUNCTIONS:
+        # TODO(mdan): Should we replace these with equivalent TF ops instead?
+        node = self._wrap_to_py_func_single_return(
+            node, KNOWN_NUMPY_FUNCTIONS[target_fqn].dtype)
       else:
-        raise NotImplementedError('py_func with return values')
+        raise NotImplementedError(
+            'py_func with return values (unknown function)')
     else:
       if self.context.recursive:
-        raise NotImplementedError('Could not resolve target function.')
+        node = self._insert_dynamic_conversion(node)
       else:
-        # TODO(mdan): Double check. Is this reachable code?
+        # Unresolved functions are allowed in non-recursive mode.
         pass
     return node
 
diff --git a/tensorflow/contrib/py2tf/converters/call_trees_test.py b/tensorflow/contrib/py2tf/converters/call_trees_test.py
index 777648dc0b31863227262fbf931aba680bb4ed98..1106432da6f84e2ef6653e016e11a8f255c26af9 100644
--- a/tensorflow/contrib/py2tf/converters/call_trees_test.py
+++ b/tensorflow/contrib/py2tf/converters/call_trees_test.py
@@ -18,9 +18,13 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.contrib.py2tf.converters import call_trees
 from tensorflow.contrib.py2tf.converters import converter_test_base
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 
@@ -47,6 +51,21 @@ class CallTreesTest(converter_test_base.TestCase):
       result.renamed_test_fn_1 = renamed_test_fn_1
       self.assertEquals(3, result.test_fn_2(1))
 
+  def test_dynamic_function(self):
+
+    def test_fn_1():
+      raise ValueError('This should be masked by the mock.')
+
+    def test_fn_2(f):
+      return f() + 3
+
+    node = self.parse_and_analyze(test_fn_2, {})
+    node = call_trees.transform(node, self.ctx, (), ())
+
+    with self.compiled(node) as result:
+      # 10 = 7 (from the mock) + 3 (from test_fn_2)
+      self.assertEquals(10, result.test_fn_2(test_fn_1))
+
   def test_simple_methods(self):
 
     class TestClass(object):
@@ -59,6 +78,7 @@ class CallTreesTest(converter_test_base.TestCase):
 
     node = self.parse_and_analyze(
         TestClass.test_fn_2, {'TestClass': TestClass},
+        namer=converter_test_base.FakeNoRenameNamer(),
         arg_types={'self': (TestClass.__name__, TestClass)})
     node = call_trees.transform(node, self.ctx, (), ())
 
@@ -89,6 +109,20 @@ class CallTreesTest(converter_test_base.TestCase):
         sess.run(sess.graph.get_operations()[0])
         self.assertEquals('bar', a.foo)
 
+  def test_py_func_wrap_known_function(self):
+
+    def test_fn():
+      return np.random.binomial(2, 0.5)
+
+    node = self.parse_and_analyze(test_fn, {'np': np})
+    node = call_trees.transform(node, self.ctx, (), ())
+
+    with self.compiled(node, dtypes.int64) as result:
+      result.np = np
+      with self.test_session() as sess:
+        self.assertTrue(isinstance(result.test_fn(), ops.Tensor))
+        self.assertIn(sess.run(result.test_fn()), (0, 1, 2))
+
   def test_uncompiled_modules(self):
 
     def test_fn(a):
diff --git a/tensorflow/contrib/py2tf/converters/control_flow.py b/tensorflow/contrib/py2tf/converters/control_flow.py
index d53e3e4fd6d87004cbe55bd430346ad263e898ea..762c26f0c77e13c077761ceec41cb29db9149a35 100644
--- a/tensorflow/contrib/py2tf/converters/control_flow.py
+++ b/tensorflow/contrib/py2tf/converters/control_flow.py
@@ -171,6 +171,14 @@ class ControlFlowTransformer(transformer.Base):
     all_referenced = body_scope.referenced
 
     state = list(body_closure)
+    if not state:
+      # TODO(mdan): Implement this properly.
+      # To complete this statement, we need to check whether any variable
+      # created inside the body scope is used before being modified outside the
+      # scope. This should be done during activity analysis, and in general
+      # should cover the case where variables may not be initialized.
+      raise ValueError('cannot convert while loop: no outputs')
+
     state_ssf = [
         self.context.namer.new_symbol(s.ssf(), all_referenced) for s in state
     ]
diff --git a/tensorflow/contrib/py2tf/converters/converter_test_base.py b/tensorflow/contrib/py2tf/converters/converter_test_base.py
index afa5c2f96fb55302e67e5ecac3532cb87827871a..8c08c5492a4b10d4abb0ec3b19b39d5b17e41a0a 100644
--- a/tensorflow/contrib/py2tf/converters/converter_test_base.py
+++ b/tensorflow/contrib/py2tf/converters/converter_test_base.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.pyct import compiler
 from tensorflow.contrib.py2tf.pyct import context
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.pyct import pretty_printer
 from tensorflow.contrib.py2tf.pyct import qual_names
 from tensorflow.contrib.py2tf.pyct.static_analysis import activity
 from tensorflow.contrib.py2tf.pyct.static_analysis import live_values
@@ -52,26 +53,49 @@ class FakeNamer(object):
     return ('renamed_%s' % '_'.join(original_fqn)), True
 
 
+class FakeNoRenameNamer(FakeNamer):
+
+  def compiled_function_name(self, original_fqn, **_):
+    return str(original_fqn), False
+
+
 class TestCase(test.TestCase):
   """Base class for unit tests in this module. Contains relevant utilities."""
 
   @contextlib.contextmanager
   def compiled(self, node, *symbols):
-    source = '<compile failed>'
+    source = None
+
+    self.dynamic_calls = []
+    def converted_call(*args):
+      """Mock version of api.converted_call."""
+      self.dynamic_calls.append(args)
+      return 7
+
     try:
       result, source = compiler.ast_to_object(node)
-      result.tf = self.make_fake_tf(*symbols)
+      result.tf = self.make_fake_mod('fake_tf', *symbols)
       result.py2tf_utils = utils
+      result.py2tf_api = self.make_fake_mod('fake_api', converted_call)
       yield result
     except Exception:  # pylint:disable=broad-except
-      print('Offending compiled code:\n%s' % source)
+      if source is None:
+        print('Offending AST:\n%s' % pretty_printer.fmt(node, color=False))
+      else:
+        print('Offending compiled code:\n%s' % source)
       raise
 
-  def make_fake_tf(self, *symbols):
-    fake_tf = imp.new_module('fake_tf')
+  def make_fake_mod(self, name, *symbols):
+    fake_mod = imp.new_module(name)
     for s in symbols:
-      setattr(fake_tf, s.__name__, s)
-    return fake_tf
+      if hasattr(s, '__name__'):
+        setattr(fake_mod, s.__name__, s)
+      elif hasattr(s, 'name'):
+        # This is a bit of a hack, but works for things like tf.int32
+        setattr(fake_mod, s.name, s)
+      else:
+        raise ValueError('can not attach %s - what should be its name?' % s)
+    return fake_mod
 
   def attach_namespace(self, module, **ns):
     for k, v in ns.items():
@@ -94,7 +118,8 @@ class TestCase(test.TestCase):
         arg_values=None,
         arg_types=arg_types,
         owner_type=owner_type,
-        recursive=recursive)
+        recursive=recursive,
+        type_annotation_func=utils.set_element_type)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
     node = live_values.resolve(node, ctx, {})
diff --git a/tensorflow/contrib/py2tf/converters/for_loops.py b/tensorflow/contrib/py2tf/converters/for_loops.py
index 935dade0ed30975dd29c8ffe5be875993936d241..4297c1cf2a3632e097973280cc985fc48da64475 100644
--- a/tensorflow/contrib/py2tf/converters/for_loops.py
+++ b/tensorflow/contrib/py2tf/converters/for_loops.py
@@ -37,14 +37,18 @@ class ForLoopCanonicalizationTransformer(transformer.Base):
   def visit_For(self, node):
     self.generic_visit(node)
     body_scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
-
+    i_var = self.context.namer.new_symbol('i', body_scope.referenced)
+    n_var = self.context.namer.new_symbol('n', body_scope.referenced)
+    iterated_var = self.context.namer.new_symbol('iterated',
+                                                 body_scope.referenced)
+    # TODO(mdan): Use TensorListFromTensor(loop_iter) here.
     if anno.hasanno(node, 'extra_cond'):
       template = """
         i = 0
-        n = len(loop_iter)
+        iterated = loop_iter
+        n = len(iterated)
         while i < n and extra_cond:
-          # TODO(mdan): Use TensorListFromTensor(loop_iter) here.
-          target = loop_iter[i]
+          target = iterated[i]
           body
           i += 1
       """
@@ -53,17 +57,18 @@ class ForLoopCanonicalizationTransformer(transformer.Base):
           loop_iter=node.iter,
           target=node.target,
           body=node.body,
-          i=self.context.namer.new_symbol('i', body_scope.referenced),
-          n=self.context.namer.new_symbol('n', body_scope.referenced),
+          i=i_var,
+          n=n_var,
+          iterated=iterated_var,
           extra_cond=anno.getanno(node, 'extra_cond'))
     else:
       template = """
         i = 0
-        n = len(loop_iter)
+        iterated = loop_iter
+        n = len(iterated)
         while i < n:
-          # TODO(mdan): Use TensorListFromTensor(loop_iter) here.
-          target = loop_iter[i]
-          body  # pylint:disable=pointless-statement
+          target = iterated[i]
+          body
           i += 1
       """
       repl = templates.replace(
@@ -71,8 +76,9 @@ class ForLoopCanonicalizationTransformer(transformer.Base):
           loop_iter=node.iter,
           target=node.target,
           body=node.body,
-          i=self.context.namer.new_symbol('i', body_scope.referenced),
-          n=self.context.namer.new_symbol('n', body_scope.referenced))
+          i=i_var,
+          n=n_var,
+          iterated=iterated_var)
       return repl
 
   def visit_Continue(self, node):
diff --git a/tensorflow/contrib/py2tf/converters/for_loops_test.py b/tensorflow/contrib/py2tf/converters/for_loops_test.py
index 70a367d3b517e528b67f260d607431d324d2ab7d..b6e3e8c8d8d4960977e2b72b56a3fab8329ad2a7 100644
--- a/tensorflow/contrib/py2tf/converters/for_loops_test.py
+++ b/tensorflow/contrib/py2tf/converters/for_loops_test.py
@@ -42,6 +42,29 @@ class ControlFlowTest(converter_test_base.TestCase):
       l = []
       self.assertEqual(test_fn(l), result.test_fn(l))
 
+  def test_for_with_iterated_expression(self):
+
+    eval_count = [0]
+
+    def count_evals(x):
+      eval_count[0] += 1
+      return x
+
+    def test_fn(n):
+      s = 0
+      for e in count_evals(range(n)):
+        s += e
+      return s
+
+    node = self.parse_and_analyze(test_fn, {'count_evals': count_evals})
+    node = for_loops.transform(node, self.ctx)
+
+    with self.compiled(node) as result:
+      result.count_evals = count_evals
+      self.assertEqual(test_fn(5), result.test_fn(5))
+      # count_evals ran twice, once for test_fn and another for result.test_fn
+      self.assertEqual(eval_count[0], 2)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/utils/printing.py b/tensorflow/contrib/py2tf/converters/ifexp.py
similarity index 51%
rename from tensorflow/contrib/py2tf/utils/printing.py
rename to tensorflow/contrib/py2tf/converters/ifexp.py
index 95a62bd80b5f4854e6a062df18d882f7bd495555..5fd6f348af0df81a6ff35745da603bd431130e20 100644
--- a/tensorflow/contrib/py2tf/utils/printing.py
+++ b/tensorflow/contrib/py2tf/converters/ifexp.py
@@ -12,36 +12,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""TensorFlow printing support utilities."""
+"""Canonicalizes the ternary conditional operator."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.py2tf.utils import py_func
-from tensorflow.python.ops import logging_ops
+from tensorflow.contrib.py2tf.pyct import templates
+from tensorflow.contrib.py2tf.pyct import transformer
 
 
-def is_tf_print_compatible(value):
-  # TODO(mdan): Enable once we can reliably test this.
-  # This is currently disabled because we can't capture the output of
-  # op kernels from Python.
-  del value
-  return False
+class IfExp(transformer.Base):
+  """Canonicalizes all IfExp nodes into plain conditionals."""
 
+  def visit_IfExp(self, node):
+    template = """
+        py2tf_utils.run_cond(test, lambda: body, lambda: orelse)
+    """
+    desugared_ifexp = templates.replace_as_expression(
+        template, test=node.test, body=node.body, orelse=node.orelse)
+    return desugared_ifexp
 
-def call_print(*values):
-  """Compiled counterpart of the print builtin.
 
-  The function attempts to use tf.Print if all the values are compatible.
-  Otherwise, it will fall back to py_func.
+def transform(node, context):
+  """Desugar IfExp nodes into plain conditionals.
 
   Args:
-    *values: values to print
+     node: an AST node to transform
+     context: a context object
+
   Returns:
-    A dummy value indicating the print completed. If tf.
+     new_node: an AST with no IfExp nodes, only conditionals.
   """
 
-  if all(map(is_tf_print_compatible, values)):
-    return logging_ops.Print(1, values)
-  return py_func.wrap_py_func(print, None, values, use_dummy_return=True)
+  node = IfExp(context).visit(node)
+  return node
diff --git a/tensorflow/contrib/py2tf/converters/ifexp_test.py b/tensorflow/contrib/py2tf/converters/ifexp_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c357ef35b550833bcb79d39f0bdbc6d758d31a5
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/ifexp_test.py
@@ -0,0 +1,106 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for ifexp module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.py2tf import utils
+from tensorflow.contrib.py2tf.converters import converter_test_base
+from tensorflow.contrib.py2tf.converters import ifexp
+from tensorflow.python.platform import test
+
+
+class IfExpTest(converter_test_base.TestCase):
+
+  def compiled_fn(self, test_fn, *args):
+    node = self.parse_and_analyze(test_fn, {})
+    node = ifexp.transform(node, self.ctx)
+    module = self.compiled(node, *args)
+    return module
+
+  def test_simple(self):
+
+    def test_fn(x):
+      return 1 if x else 0
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      for x in [0, 1]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_fn(self):
+
+    def f(x):
+      return 3 * x
+
+    def test_fn(x):
+      y = f(x * x if x > 0 else x)
+      return y
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      result.f = f
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_exp(self):
+
+    def test_fn(x):
+      return x * x if x > 0 else x
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_nested(self):
+
+    def test_fn(x):
+      return x * x if x > 0 else x if x else 1
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      for x in [-2, 0, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_in_cond(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x * x if x < 5 else x * x * x
+      return -x
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      for x in [-2, 2, 5]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_assign_in_cond(self):
+
+    def test_fn(x):
+      if x > 0:
+        x = -x if x < 5 else x
+      return x
+
+    with self.compiled_fn(test_fn) as result:
+      result.py2tf_util = utils
+      for x in [-2, 2, 5]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/py2tf/converters/lists.py b/tensorflow/contrib/py2tf/converters/lists.py
new file mode 100644
index 0000000000000000000000000000000000000000..12ebd00062a91cb63a5ca49a4b405b455af83e42
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/lists.py
@@ -0,0 +1,103 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converter for list operations.
+
+This includes converting Python lists to TensorArray/TensorList.
+"""
+
+# TODO(mdan): Elaborate the logic here.
+# TODO(mdan): Does it even make sense to attempt to try to use TAs?
+# The current rule (always convert to TensorArray) is naive and insufficient.
+# In general, a better mechanism could look like:
+#   * convert to TensorList by default
+#   * leave as Python list if the user explicitly forbids it
+#   * convert to TensorArray only when complete write once behavior can be
+#     guaranteed (e.g. list comprehensions)
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.py2tf.pyct import anno
+from tensorflow.contrib.py2tf.pyct import templates
+from tensorflow.contrib.py2tf.pyct import transformer
+from tensorflow.python.framework import dtypes
+
+
+class ListTransformer(transformer.Base):
+  """Converts lists and related operations to their TF counterpart."""
+
+  def _empty_list(self, node):
+    if not anno.hasanno(node, 'element_type'):
+      raise NotImplementedError(
+          'type inference for empty lists is not yet supported; '
+          'use utils.set_element_type(<list>, <dtype>) to continue')
+    dtype = anno.getanno(node, 'element_type')
+    if not isinstance(dtype, dtypes.DType):
+      # TODO(mdan): Allow non-TF dtypes?
+      # That would be consistent with the dynamic dispatch pattern, but
+      # we must make sure that doesn't become confusing.
+      raise NotImplementedError('element type "%s" not yet supported' % dtype)
+
+    dtype_name = dtype.name
+    # TODO(mdan): Does it ever make sense not to use tensor lists?
+    template = """
+      tf.TensorArray(tf.dtype_name, size=0, dynamic_size=True)
+    """
+    return templates.replace_as_expression(template, dtype_name=dtype_name)
+
+  def _pre_populated_list(self, node):
+    raise NotImplementedError('pre-populated lists')
+
+  def visit_Expr(self, node):
+    node = self.generic_visit(node)
+    if isinstance(node.value, gast.Call):
+      call_node = node.value
+      qn = anno.getanno(call_node.func, anno.Basic.QN)
+
+      if qn.qn[-1] == 'append' and (len(call_node.args) == 1):
+        template = """
+          target = py2tf_utils.dynamic_list_append(target, element)
+        """
+        node = templates.replace(
+            template,
+            target=qn.parent.ast(),
+            element=call_node.args[0])
+    return node
+
+  def visit_Assign(self, node):
+    node = self.generic_visit(node)
+
+    # Only convert lists when they are assigned to a variable, e.g.:
+    #   l = []
+    # TODO(mdan): This rule should be improved.
+    if len(node.targets) != 1:
+      return node
+    if not isinstance(node.value, gast.List):
+      return node
+    if not isinstance(node.value.ctx, gast.Load):
+      return node
+
+    if node.value.elts:
+      node.value = self._pre_populated_list(node.value)
+    else:
+      node.value = self._empty_list(node.value)
+    return node
+
+
+def transform(node, context):
+  return ListTransformer(context).visit(node)
diff --git a/tensorflow/contrib/py2tf/utils/printing_test.py b/tensorflow/contrib/py2tf/converters/lists_test.py
similarity index 50%
rename from tensorflow/contrib/py2tf/utils/printing_test.py
rename to tensorflow/contrib/py2tf/converters/lists_test.py
index 2070deb304d8df2433fb9a95ae36d48415578482..671a1cc7b1225061a00731596c536c4403e0bdff 100644
--- a/tensorflow/contrib/py2tf/utils/printing_test.py
+++ b/tensorflow/contrib/py2tf/converters/lists_test.py
@@ -12,41 +12,40 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for printing module."""
+"""Tests for lists module."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import sys
+from tensorflow.contrib.py2tf import utils
+from tensorflow.contrib.py2tf.converters import converter_test_base
+from tensorflow.contrib.py2tf.converters import lists
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.platform import test
 
-import six
 
-from tensorflow.contrib.py2tf.utils import printing
-from tensorflow.python.platform import test
+class ListTest(converter_test_base.TestCase):
 
+  def test_empty_annotated_list(self):
 
-class ContextManagersTest(test.TestCase):
+    def test_fn():
+      l = []
+      utils.set_element_type(l, dtypes.int32)
+      l.append(1)
+      return l
 
-  def test_call_print_tf(self):
-    try:
-      out_capturer = six.StringIO()
-      sys.stdout = out_capturer
-      with self.test_session() as sess:
-        sess.run(printing.call_print('test message', 1))
-        self.assertEqual(out_capturer.getvalue(), 'test message 1\n')
-    finally:
-      sys.stdout = sys.__stdout__
-
-  def test_call_print_py_func(self):
-    try:
-      out_capturer = six.StringIO()
-      sys.stdout = out_capturer
+    node = self.parse_and_analyze(test_fn, {'dtypes': dtypes, 'utils': utils})
+    node = lists.transform(node, self.ctx)
+
+    with self.compiled(node, tensor_array_ops.TensorArray,
+                       dtypes.int32) as result:
+      # TODO(mdan): Attach these additional modules automatically.
+      result.utils = utils
+      result.dtypes = dtypes
       with self.test_session() as sess:
-        sess.run(printing.call_print('test message', [1, 2]))
-        self.assertEqual(out_capturer.getvalue(), 'test message [1, 2]\n')
-    finally:
-      sys.stdout = sys.__stdout__
+        self.assertEqual(test_fn(), sess.run(result.test_fn().stack()))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/py2tf/converters/logical_expressions.py b/tensorflow/contrib/py2tf/converters/logical_expressions.py
index df980d41c9c57e325bee9a1fa870d9c95f46ea41..e0abf74ebc70b274b556b734532aeeb6fc6b4292 100644
--- a/tensorflow/contrib/py2tf/converters/logical_expressions.py
+++ b/tensorflow/contrib/py2tf/converters/logical_expressions.py
@@ -23,52 +23,110 @@ from __future__ import print_function
 
 import gast
 
+from tensorflow.contrib.py2tf.pyct import anno
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.pyct import templates
+from tensorflow.contrib.py2tf.pyct import transformer
 
 
-class LogicalExpressionTransformer(gast.NodeTransformer):
+# TODO(mdan): Properly extrack boolean ops according to lazy eval rules.
+# Note that this isn't completely safe either, because tensors may have control
+# dependencies.
+# Note that for loops that should be done after the loop was converted to
+# tf.while_loop so that the expanded conditionals are properly scoped.
+
+# Used to signal that an operand is safe for non-lazy evaluation.
+SAFE_BOOLEAN_OPERAND = 'SAFE_BOOLEAN_OPERAND'
+
+
+class LogicalExpressionTransformer(transformer.Base):
   """Converts logical expressions to corresponding TF calls."""
 
-  def __init__(self):
+  def __init__(self, context):
+    super(LogicalExpressionTransformer, self).__init__(context)
     # TODO(mdan): Look into replacing with bitwise operators instead.
+    # TODO(mdan): Skip replacing if the function is trivial.
     self.op_mapping = {
         gast.And: 'tf.logical_and',
-        gast.Or: 'tf.logical_or',
-        gast.Not: 'tf.logical_not',
         gast.Eq: 'tf.equal',
+        gast.Gt: 'tf.greater',
+        gast.GtE: 'tf.greater_equal',
+        gast.Lt: 'tf.less',
+        gast.LtE: 'tf.less_equal',
+        gast.Not: 'tf.logical_not',
+        gast.NotEq: 'tf.not_equal',
+        gast.Or: 'tf.logical_or',
+        gast.USub: 'tf.negative',
+        gast.Is: 'py2tf_utils.dynamic_is',
+        gast.IsNot: 'py2tf_utils.dynamic_is_not'
     }
 
+  def _expect_simple_symbol(self, operand):
+    if isinstance(operand, gast.Name):
+      return
+    if anno.hasanno(operand, SAFE_BOOLEAN_OPERAND):
+      return
+    raise NotImplementedError(
+        'only simple local variables are supported in logical and compound '
+        'comparison expressions; for example, we support "a or b" but not '
+        '"a.x or b"; for a workaround, assign the expression to a local '
+        'variable and use that instead, for example "tmp = a.x", "tmp or b"')
+
+  def _matching_func(self, operator):
+    op_type = type(operator)
+    mapped_op = self.op_mapping.get(op_type)
+    if not mapped_op:
+      raise NotImplementedError('operator %s is not yet supported' % op_type)
+    return mapped_op
+
+  def _as_function(self, func_name, args):
+    template = """
+      func_name(args)
+    """
+    replacement = templates.replace_as_expression(
+        template, func_name=parser.parse_expression(func_name), args=args)
+    anno.setanno(replacement, SAFE_BOOLEAN_OPERAND, True)
+    return replacement
+
   def visit_Compare(self, node):
     node = self.generic_visit(node)
-    if len(node.ops) > 1:
-      raise NotImplementedError()
-    cmp_type = type(node.ops[0])
-    if cmp_type in self.op_mapping:
-      tf_function = parser.parse_str(self.op_mapping[cmp_type]).body[0].value
-      return gast.Call(
-          func=tf_function, args=[node.left, node.comparators[0]], keywords=[])
-    return node
+    ops_and_comps = list(zip(node.ops, node.comparators))
+    left = node.left
+    op_tree = None
+
+    # Repeated comparisons are converted to conjunctions:
+    #   a < b < c   ->   a < b and b < c
+    while ops_and_comps:
+      op, right = ops_and_comps.pop(0)
+      binary_comparison = self._as_function(
+          self._matching_func(op), (left, right))
+      if isinstance(left, gast.Name) and isinstance(right, gast.Name):
+        anno.setanno(binary_comparison, SAFE_BOOLEAN_OPERAND, True)
+      if op_tree:
+        self._expect_simple_symbol(right)
+        op_tree = self._as_function('tf.logical_and',
+                                    (binary_comparison, op_tree))
+      else:
+        op_tree = binary_comparison
+      left = right
+    assert op_tree is not None
+    return op_tree
 
   def visit_UnaryOp(self, node):
     node = self.generic_visit(node)
-    if isinstance(node.op, gast.Not):
-      tf_function = parser.parse_str(self.op_mapping[type(
-          node.op)]).body[0].value
-      node = gast.Call(func=tf_function, args=[node.operand], keywords=[])
-    return node
+    return self._as_function(self._matching_func(node.op), node.operand)
 
   def visit_BoolOp(self, node):
-    # TODO(mdan): A normalizer may be useful here. Use ANF?
     node = self.generic_visit(node)
-    tf_function = parser.parse_str(self.op_mapping[type(node.op)]).body[0].value
-    left = node.values[0]
-    for i in range(1, len(node.values)):
-      left = gast.Call(
-          func=tf_function, args=[left, node.values[i]], keywords=[])
-    return left
-
-
-def transform(node):
-  transformer = LogicalExpressionTransformer()
-  node = transformer.visit(node)
-  return node
+    node_values = node.values
+    right = node.values.pop()
+    self._expect_simple_symbol(right)
+    while node_values:
+      left = node_values.pop()
+      self._expect_simple_symbol(left)
+      right = self._as_function(self._matching_func(node.op), (left, right))
+    return right
+
+
+def transform(node, context):
+  return LogicalExpressionTransformer(context).visit(node)
diff --git a/tensorflow/contrib/py2tf/converters/logical_expressions_test.py b/tensorflow/contrib/py2tf/converters/logical_expressions_test.py
index a28326c517d468230f35e45f0fbfe5257d769895..eb28c309a429f2267cc1ae1f6f65a8cde0ad91b8 100644
--- a/tensorflow/contrib/py2tf/converters/logical_expressions_test.py
+++ b/tensorflow/contrib/py2tf/converters/logical_expressions_test.py
@@ -32,7 +32,7 @@ class GradientsFunctionTest(converter_test_base.TestCase):
       return a == b
 
     node = self.parse_and_analyze(test_fn, {})
-    node = logical_expressions.transform(node)
+    node = logical_expressions.transform(node, self.ctx)
 
     with self.compiled(node, math_ops.equal) as result:
       with self.test_session() as sess:
@@ -45,7 +45,7 @@ class GradientsFunctionTest(converter_test_base.TestCase):
       return (a or b) and (a or b or c)
 
     node = self.parse_and_analyze(test_fn, {})
-    node = logical_expressions.transform(node)
+    node = logical_expressions.transform(node, self.ctx)
 
     with self.compiled(node, math_ops.logical_or,
                        math_ops.logical_and) as result:
diff --git a/tensorflow/contrib/py2tf/converters/single_return.py b/tensorflow/contrib/py2tf/converters/single_return.py
new file mode 100644
index 0000000000000000000000000000000000000000..1194b98f5ebeffa79a41fc3b32aa79ffd8cc407b
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/single_return.py
@@ -0,0 +1,317 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Canonicalizes functions with multiple returns to use just one."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.py2tf.pyct import anno
+from tensorflow.contrib.py2tf.pyct import ast_util
+from tensorflow.contrib.py2tf.pyct import templates
+from tensorflow.contrib.py2tf.pyct import transformer
+from tensorflow.contrib.py2tf.pyct.static_analysis.annos import NodeAnno
+
+
+# TODO(mdan): Move this logic into transformer_base.
+class BodyVisitor(transformer.Base):
+  """Walks breadth- or depth-first the list-of-nodes bodies of AST nodes."""
+
+  def __init__(self, context, depth_first=False):
+    self.depth_first = depth_first
+    self.changes_made = False
+    super(BodyVisitor, self).__init__(context)
+
+  def visit_nodelist(self, nodelist):
+    for node in nodelist:
+      if isinstance(node, list):
+        node = self.visit_nodelist(node)
+      else:
+        node = self.generic_visit(node)
+    return nodelist
+
+  def visit_If(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    node.orelse = self.visit_nodelist(node.orelse)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+  def visit_For(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    node.orelse = self.visit_nodelist(node.orelse)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+  def visit_While(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    node.orelse = self.visit_nodelist(node.orelse)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+  def visit_Try(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    node.orelse = self.visit_nodelist(node.orelse)
+    node.finalbody = self.visit_nodelist(node.finalbody)
+    for i in range(len(node.handlers)):
+      node.handlers[i].body = self.visit_nodelist(node.handlers[i].body)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+  def visit_With(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+  def visit_FunctionDef(self, node):
+    if self.depth_first:
+      node = self.generic_visit(node)
+    node.body = self.visit_nodelist(node.body)
+    self.generic_visit(node)
+    if not self.depth_first:
+      node = self.generic_visit(node)
+    return node
+
+
+class FoldElse(BodyVisitor):
+
+  def visit_nodelist(self, nodelist):
+    for i in range(len(nodelist)):
+      node = nodelist[i]
+      if isinstance(node, gast.If):
+        true_branch_returns = isinstance(node.body[-1], gast.Return)
+        false_branch_returns = len(node.orelse) and isinstance(
+            node.orelse[-1], gast.Return)
+        # If the last node in the if body is a return,
+        # then every line after this if statement effectively
+        # belongs in the else.
+        if true_branch_returns and not false_branch_returns:
+          for j in range(i + 1, len(nodelist)):
+            nodelist[i].orelse.append(ast_util.copy_clean(nodelist[j]))
+          if nodelist[i + 1:]:
+            self.changes_made = True
+          return nodelist[:i + 1]
+        elif not true_branch_returns and false_branch_returns:
+          for j in range(i + 1, len(nodelist)):
+            nodelist[i].body.append(ast_util.copy_clean(nodelist[j]))
+          if nodelist[i + 1:]:
+            self.changes_made = True
+          return nodelist[:i + 1]
+        elif true_branch_returns and false_branch_returns:
+          if nodelist[i + 1:]:
+            raise ValueError(
+                'Unreachable code after conditional where both branches return.'
+            )
+          return nodelist
+      elif isinstance(node, gast.Return) and nodelist[i + 1:]:
+        raise ValueError(
+            'Cannot have statements after a return in the same basic block')
+    return nodelist
+
+
+def contains_return(node):
+  for n in gast.walk(node):
+    if isinstance(n, gast.Return):
+      return True
+  return False
+
+
+class LiftReturn(transformer.Base):
+  """Move return statements out of If and With blocks."""
+
+  def __init__(self, context):
+    self.changes_made = False
+    self.common_return_name = None
+    super(LiftReturn, self).__init__(context)
+
+  def visit_If(self, node):
+    # Depth-first traversal of if statements
+    node = self.generic_visit(node)
+
+    # We check if both branches return, and if so, lift the return out of the
+    # conditional. We don't enforce that the true and false branches either
+    # both return or both do not, because FoldElse might move a return
+    # into a branch after this transform completes. FoldElse and LiftReturn
+    # are alternately run until the code reaches a fixed point.
+    true_branch_returns = isinstance(node.body[-1], gast.Return)
+    false_branch_returns = len(node.orelse) and isinstance(
+        node.orelse[-1], gast.Return)
+    if true_branch_returns and false_branch_returns:
+      node.body[-1] = templates.replace(
+          'a = b', a=self.common_return_name, b=node.body[-1].value)[0]
+      node.orelse[-1] = templates.replace(
+          'a = b', a=self.common_return_name, b=node.orelse[-1].value)[0]
+      return_node = templates.replace('return a', a=self.common_return_name)[0]
+      self.changes_made = True
+      return [node, return_node]
+    else:
+      return node
+
+  def visit_With(self, node):
+    # Depth-first traversal of syntax
+    node = self.generic_visit(node)
+
+    # If the with statement returns, lift the return
+    if isinstance(node.body[-1], gast.Return):
+      node.body[-1] = templates.replace(
+          'a = b', a=self.common_return_name, b=node.body[-1].value)[0]
+      return_node = templates.replace('return a', a=self.common_return_name)[0]
+      node = self.generic_visit(node)
+      self.changes_made = True
+      return [node, return_node]
+    else:
+      return node
+
+  def visit_FunctionDef(self, node):
+    # Ensure we're doing depth-first traversal
+    last_return_name = self.common_return_name
+    body_scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
+    referenced_names = body_scope.referenced
+    self.common_return_name = self.context.namer.new_symbol(
+        'return_', referenced_names)
+    node = self.generic_visit(node)
+    self.common_return_name = last_return_name
+    return node
+
+
+class DetectReturnInUnsupportedControlFlow(gast.NodeVisitor):
+  """Throws an error if code returns inside loops or try/except."""
+
+  # First, throw an error if we detect a return statement in a loop.
+  # TODO(alexbw): we need to learn to handle returns inside a loop,
+  # but don't currently have the TF constructs to do so (need something
+  # that looks vaguely like a goto).
+
+  def __init__(self):
+    self.cant_return = False
+    super(DetectReturnInUnsupportedControlFlow, self).__init__()
+
+  def visit_While(self, node):
+    self.cant_return = True
+    self.generic_visit(node)
+    self.cant_return = False
+
+  def visit_For(self, node):
+    self.cant_return = True
+    self.generic_visit(node)
+    self.cant_return = False
+
+  def visit_Try(self, node):
+    self.cant_return = True
+    self.generic_visit(node)
+    self.cant_return = False
+
+  def visit_Return(self, node):
+    if self.cant_return:
+      raise ValueError(
+          'Pyflow currently does not support `return` statements in loops. '
+          'Try assigning to a variable in the while loop, and returning '
+          'outside of the loop')
+
+
+class DetectReturnInConditional(gast.NodeVisitor):
+  """Assert that no return statements are present in conditionals."""
+
+  def __init__(self):
+    self.cant_return = False
+    super(DetectReturnInConditional, self).__init__()
+
+  def visit_If(self, node):
+    self.cant_return = True
+    self.generic_visit(node)
+    self.cant_return = False
+
+  def visit_Return(self, node):
+    if self.cant_return:
+      raise ValueError(
+          'After transforms, a conditional contained a `return `statement, '
+          'which is not allowed. This is a bug, and should not happen.')
+
+
+class DetectReturnInFunctionDef(gast.NodeVisitor):
+
+  def visit_FunctionDef(self, node):
+    self.generic_visit(node)
+    if not contains_return(node):
+      raise ValueError(
+          'Each function definition should contain at least one return.')
+
+
+def transform(node, context):
+  """Ensure a function has only a single return.
+
+  This transforms an AST node with multiple returns successively into containing
+  only a single return node.
+  There are a few restrictions on what we can handle:
+   - An AST being transformed must contain at least one return.
+   - No returns allowed in loops. We have to know the type of the return value,
+   and we currently don't have either a type inference system to discover it,
+   nor do we have a mechanism for late type binding in TensorFlow.
+   - After all transformations are finished, a Return node is not allowed inside
+   control flow. If we were unable to move a return outside of control flow,
+   this is an error.
+
+  Args:
+     node: an AST node to transform
+     context: a context object
+
+  Returns:
+     new_node: an AST with a single return value
+
+  Raises:
+    ValueError: if the AST is structured so that we can't perform the
+   transform.
+  """
+  # Make sure that the function has at least one return statement
+  # TODO(alexbw): turning off this assertion for now --
+  # we need to not require this in e.g. class constructors.
+  # DetectReturnInFunctionDef().visit(node)
+
+  # Make sure there's no returns in unsupported locations (loops, try/except)
+  DetectReturnInUnsupportedControlFlow().visit(node)
+
+  while True:
+
+    # Try to lift all returns out of if statements and with blocks
+    lr = LiftReturn(context)
+    node = lr.visit(node)
+    changes_made = lr.changes_made
+    fe = FoldElse(context)
+    node = fe.visit(node)
+    changes_made = changes_made or fe.changes_made
+
+    if not changes_made:
+      break
+
+  # Make sure we've scrubbed all returns from conditionals
+  DetectReturnInConditional().visit(node)
+
+  return node
diff --git a/tensorflow/contrib/py2tf/converters/single_return_test.py b/tensorflow/contrib/py2tf/converters/single_return_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ea7a9d6d3e25c8dafd8f211994c8fe99bd0e781
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/single_return_test.py
@@ -0,0 +1,189 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for single_return module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.py2tf.converters import converter_test_base
+from tensorflow.contrib.py2tf.converters import single_return
+from tensorflow.python.framework.ops import name_scope
+from tensorflow.python.platform import test
+
+
+class SingleReturnTest(converter_test_base.TestCase):
+
+  def compiled_fn(self, test_fn, *args):
+    node = self.parse_and_analyze(test_fn, {})
+    node = single_return.transform(node, self.ctx)
+    module = self.compiled(node, *args)
+    return module
+
+  def test_noop(self):
+    # Noop
+    def test_fn(x):
+      return x
+
+    with self.compiled_fn(test_fn) as result:
+      self.assertEqual(test_fn(2.0), result.test_fn(2.0))
+
+  def test_return_expression(self):
+    # ANF
+    def test_fn(x):
+      return x * x
+
+    with self.compiled_fn(test_fn) as result:
+      x = 2
+      self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_merge(self):
+    # Simple merge
+    def test_fn(x):
+      if x > 0:
+        return x
+      else:
+        return x * x
+
+    with self.compiled_fn(test_fn) as result:
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_orphan_branch(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x
+
+    with self.assertRaises(ValueError):
+      self.compiled_fn(test_fn)
+
+  def test_lift_body_into_false_branch(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x
+      return x * x
+
+    with self.compiled_fn(test_fn) as result:
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_lift_body_into_true_branch(self):
+
+    def test_fn(x):
+      if x < 0:
+        x *= x
+      else:
+        # TODO(alexbw): linter bug here that requires us suppress this warning.
+        return x  # pylint: disable=undefined-loop-variable
+      return x
+
+    with self.compiled_fn(test_fn) as result:
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_nested_if(self):
+
+    def test_fn(x):
+      if x > 0:
+        if x < 5:
+          return x
+        else:
+          return x * x
+      else:
+        return x * x * x
+
+    with self.compiled_fn(test_fn) as result:
+      for x in [-2, 2, 5]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_context_manager(self):
+
+    def test_fn(x):
+
+      with name_scope(''):
+        return x * x
+
+    with self.compiled_fn(test_fn) as result:
+      result.name_scope = name_scope
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_context_manager_in_conditional(self):
+
+    def test_fn(x):
+      if x > 0:
+        with name_scope(''):
+          return x * x
+      else:
+        return x
+
+    with self.compiled_fn(test_fn, name_scope) as result:
+      result.name_scope = name_scope
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def text_conditional_in_context_manager(self):
+
+    def test_fn(x):
+      with name_scope(''):
+        if x > 0:
+          return x * x
+        else:
+          return x
+
+    with self.compiled_fn(test_fn) as result:
+      result.name_scope = name_scope
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_no_return(self):
+
+    def test_fn(x):
+      x *= x
+
+    with self.compiled_fn(test_fn) as result:
+      self.assertEqual(test_fn(2), result.test_fn(2))
+
+  def test_nested_functiondefs(self):
+
+    def test_fn(x):
+
+      def inner_fn(y):
+        if y > 0:
+          return y * y
+        else:
+          return y
+
+      return inner_fn(x)
+
+    with self.compiled_fn(test_fn) as result:
+      for x in [-2, 2]:
+        self.assertEqual(test_fn(x), result.test_fn(x))
+
+  def test_loop(self):
+
+    def test_fn(x):
+      for _ in range(10):
+        return x
+      return x
+
+    with self.assertRaises(ValueError):
+      self.compiled_fn(test_fn)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/py2tf/impl/BUILD b/tensorflow/contrib/py2tf/impl/BUILD
index 90ffabbc9bf4524ec2ebf54b6dd847bd8768a486..cc49d71b78e13720e8aceb66898249cfe9c4891f 100644
--- a/tensorflow/contrib/py2tf/impl/BUILD
+++ b/tensorflow/contrib/py2tf/impl/BUILD
@@ -42,6 +42,7 @@ py_test(
         ":impl",
         "//tensorflow/contrib/py2tf/utils",
         "//tensorflow/python:client_testlib",
+        "//third_party/py/numpy",
     ],
 )
 
diff --git a/tensorflow/contrib/py2tf/impl/api.py b/tensorflow/contrib/py2tf/impl/api.py
index 29d2e038a73c8cac89c121ec65c32f0d4f68aff6..a9e8ea204379e8e2f7ccf3db70b7a19438ad08a5 100644
--- a/tensorflow/contrib/py2tf/impl/api.py
+++ b/tensorflow/contrib/py2tf/impl/api.py
@@ -20,13 +20,20 @@ from __future__ import print_function
 
 from functools import wraps
 
+from enum import Enum
+
+# pylint:disable=g-bad-import-order
 import gast
 import six
+# pylint:enable=g-bad-import-order
 
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.impl import conversion
 from tensorflow.contrib.py2tf.pyct import compiler
+from tensorflow.contrib.py2tf.pyct import inspect_utils
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.utils import builtins
+from tensorflow.contrib.py2tf.utils import py_func
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_inspect
 
@@ -35,55 +42,6 @@ from tensorflow.python.util import tf_inspect
 # (currently we require (module + class name, type))
 
 
-def graph_ready(f):
-  """No-op decorator that explicitly marks a function as graph-ready.
-
-  Graph-ready functions are assumed to not need any conversion.
-
-  Args:
-    f: Any callable.
-  Returns:
-    f itself.
-  """
-  setattr(f, '__pyct_is_compile_decorator', True)
-  return f
-
-
-def convert_inline(f, *args, **kwargs):
-  """Shorthand to convert and call a function.
-
-  For example, the following two statements are equivalent:
-
-      @convert()
-      def foo():
-        ...
-      foo(bar)
-
-      def foo():
-        ...
-      convert_inline(foo, bar)
-
-  Args:
-    f: Function to convert. Only this call will be converted.
-    *args: Passed through to f.
-    **kwargs: Passed through to f, with the following exceptions:
-        * arg_value_hints: A dict mapping parameter names to objects that can
-            hint at the type of those parameters.
-
-  Returns:
-    The result of the converted f applied to args and kwargs.
-  """
-  if 'arg_value_hints' in kwargs:
-    arg_value_hints = kwargs['arg_value_hints']
-    del kwargs['arg_value_hints']
-  else:
-    arg_value_hints = None
-  if tf_inspect.ismethod(f):
-    # When converting methods, the result is still an unbound function.
-    args = (f.__self__,) + args
-  return convert(arg_value_hints)(f)(*args, **kwargs)
-
-
 def convert(recursive=False, verbose=False, arg_types=None):
   """Decorator that compiles a function to graph mode.
 
@@ -110,28 +68,7 @@ def convert(recursive=False, verbose=False, arg_types=None):
 
     @wraps(f)
     def wrapper(*args, **kwargs):
-      """Wrapper that calls the compiled version of the wrapped function."""
-      partial_types = ()
-      arg_values = {}
-      arg_names = tf_inspect.getargspec(f)[0]
-      for name, arg in zip(arg_names, args):
-        arg_values[name] = arg
-        arg_class = arg.__class__
-        # If arg_value_hints specifies any name, use that instead.
-        if name not in arg_types:
-          arg_types[name] = (arg_class.__name__, arg_class)
-        if name == 'self' and tf_inspect.isclass(arg_class):
-          # Annotated methods need to specify that their owner type is partial,
-          # otherwise other members they call will not be converted.
-          partial_types = (arg_class,)
-      wrapped = to_graph(
-          f,
-          recursive=recursive,
-          verbose=verbose,
-          arg_values=arg_values,
-          arg_types=arg_types,
-          partial_types=partial_types)
-      return wrapped(*args, **kwargs)
+      return converted_call(f, recursive, verbose, arg_types, *args, **kwargs)
 
     # Sometimes the decorator is just desugared, making it impossible to detect.
     # This attribute makes detection easier.
@@ -141,6 +78,127 @@ def convert(recursive=False, verbose=False, arg_types=None):
   return decorator
 
 
+class RunMode(Enum):
+  GRAPH = 1
+  PY_FUNC = 2
+
+
+def do_not_convert(run_as=RunMode.GRAPH, return_dtypes=None):
+  """Decorator that suppresses compilation of a function.
+
+  Args:
+    run_as: RunMode value. Whether to run the function as-is, or wrap it into
+        a py_func.
+    return_dtypes: See py2tf.utils.py_func.wrap_py_func. Setting to None or
+        empty list or tuple will create a dummy return value that can be used
+        to set control dependencies.
+
+  Returns:
+    A decorator that wraps the original function.
+  """
+  def decorator(f):
+    """Decorator implementation."""
+
+    @wraps(f)
+    def graph_wrapper(*args, **kwargs):
+      return f(*args, **kwargs)
+
+    @wraps(f)
+    def py_func_wrapper(*args, **kwargs):
+      if kwargs:
+        raise NotImplementedError(
+            'RunMode.PY_FUNC does not yet support kwargs')
+      # TODO(mdan): Add support for kwargs.
+      return py_func.wrap_py_func(
+          f, return_dtypes, args, kwargs, use_dummy_return=not return_dtypes)
+
+    if run_as == RunMode.GRAPH:
+      wrapper = graph_wrapper
+    elif run_as == RunMode.PY_FUNC:
+      wrapper = py_func_wrapper
+    else:
+      raise ValueError('unknown value for run_as: %s' % run_as)
+
+    # Sometimes the decorator is just desugared, making it impossible to detect.
+    # This attribute makes detection easier.
+    setattr(wrapper, '__pyct_is_compile_decorator', True)
+    return wrapper
+
+  return decorator
+
+
+def converted_call(f, recursive, verbose, arg_types, *args, **kwargs):
+  """Compiles a function call inline."""
+  # TODO(mdan): This needs cleanup.
+  # In particular, we may want to avoid renaming functions altogether.
+
+  if conversion.is_whitelisted_for_graph(f):
+    return f(*args, **kwargs)
+
+  unknown_arg_value = object()  # Sentinel for arguments of unknown value
+
+  if tf_inspect.isbuiltin(f):
+    return builtins.dynamic_builtin(f, *args, **kwargs)
+
+  if tf_inspect.isfunction(f) or tf_inspect.ismethod(f):
+    # Regular functions
+    target_entity = f
+    arg_map_target = f
+    effective_args = args
+    f_class = inspect_utils.getmethodclass(f)
+
+    if f_class is not None:
+      partial_types = (f_class,)
+    else:
+      partial_types = ()
+
+  elif tf_inspect.isclass(f):
+    # Constructors
+    target_entity = f
+    arg_map_target = f.__init__
+    effective_args = (unknown_arg_value,) + args
+    partial_types = ()
+
+  elif hasattr(f, '__call__') and hasattr(f, '__class__'):
+    # Callable objects
+    target_entity = f.__call__
+    arg_map_target = f.__call__
+    effective_args = (f,) + args
+    partial_types = (f.__class__,)
+
+  else:
+    NotImplementedError('unknown callable type "%s"' % type(f))
+
+  arg_values = tf_inspect.getcallargs(arg_map_target, *args, **kwargs)
+  for name, arg in arg_values.items():
+    if arg is unknown_arg_value:
+      continue
+    arg_class = arg.__class__
+    # If arg_value_hints specifies any name, use that instead.
+    if name not in arg_types:
+      arg_types[name] = (arg_class.__name__, arg_class)
+
+  # When called from within a decorator, this is the only indication that
+  # the function is a method - it appears that the decorator is applied
+  # before the method is bound.
+  if not partial_types:
+    if 'self' in arg_values:
+      if tf_inspect.isclass(arg_values['self'].__class__):
+        partial_types = (arg_values['self'].__class__,)
+    elif 'cls' in arg_values:
+      if tf_inspect.isclass(arg_values['cls']):
+        partial_types = (arg_values['cls'],)
+
+  converted_f = to_graph(
+      target_entity,
+      recursive=recursive,
+      verbose=verbose,
+      arg_values=arg_values,
+      arg_types=arg_types,
+      partial_types=partial_types)
+  return converted_f(*effective_args, **kwargs)
+
+
 def to_graph(e,
              recursive=True,
              verbose=False,
@@ -174,14 +232,14 @@ def to_graph(e,
   """
   conversion_map = conversion.ConversionMap(
       recursive=recursive,
-      nocompile_decorators=(convert, graph_ready, convert_inline),
+      nocompile_decorators=(convert, do_not_convert, converted_call),
       partial_types=partial_types,
       api_module=tf_inspect.getmodule(to_graph))
   _, name = conversion.entity_to_graph(e, conversion_map, arg_values, arg_types)
 
   module = gast.Module([])
   for import_line in config.COMPILED_IMPORT_STATEMENTS:
-    module.body.append(parser.parse_str(import_line))
+    module.body.extend(parser.parse_str(import_line).body)
   for dep in conversion_map.dependency_cache.values():
     module.body.append(dep)
   compiled_node, compiled_src = compiler.ast_to_object(module)
@@ -189,7 +247,7 @@ def to_graph(e,
   # The compiled code should see everything the entry function saw.
   # TODO(mdan): This might not work well if the call tree spans modules?
   if tf_inspect.isfunction(e):
-    compiled_node.__dict__.update(six.get_function_globals(e))
+    compiled_node.__dict__.update(inspect_utils.getnamespace(e))
   compiled_fn = getattr(compiled_node, name)
 
   if verbose:
@@ -221,7 +279,7 @@ def to_code(e,
   """
   conversion_map = conversion.ConversionMap(
       recursive=recursive,
-      nocompile_decorators=(convert, graph_ready, convert_inline),
+      nocompile_decorators=(convert, do_not_convert, converted_call),
       partial_types=partial_types,
       api_module=tf_inspect.getmodule(to_graph))
   conversion.entity_to_graph(e, conversion_map, arg_values, arg_types)
diff --git a/tensorflow/contrib/py2tf/impl/api_test.py b/tensorflow/contrib/py2tf/impl/api_test.py
index 51e99864adeba9c928b6e74eb759054ef1d1d78c..a7b1aba85230124da7275fc973ae0714087b35c7 100644
--- a/tensorflow/contrib/py2tf/impl/api_test.py
+++ b/tensorflow/contrib/py2tf/impl/api_test.py
@@ -18,23 +18,29 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
+from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl import api
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.utils import py_func
 from tensorflow.python.framework import constant_op
-from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 
 
+tf = utils.fake_tf()
+
+
 class ApiTest(test.TestCase):
 
   def setUp(self):
-    config.DEFAULT_UNCOMPILED_MODULES.add((math_ops.__name__,))
     config.COMPILED_IMPORT_STATEMENTS = (
-        'from tensorflow.python.framework '
-        'import ops as tf',
+        'from __future__ import print_function',
         'from tensorflow.contrib.py2tf import utils as '
-        'py2tf_utils')
+        'py2tf_utils',
+        'tf = py2tf_utils.fake_tf()'
+    )
 
   def test_decorator_recurses(self):
 
@@ -47,7 +53,7 @@ class ApiTest(test.TestCase):
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -63,11 +69,11 @@ class ApiTest(test.TestCase):
     class TestClass(object):
 
       def called_member(self, a):
-        return math_ops.negative(a)
+        return tf.negative(a)
 
       @api.convert(recursive=False)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -78,17 +84,17 @@ class ApiTest(test.TestCase):
           constant_op.constant(-2))
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
-  def test_decorator_calls_converted(self):
+  def test_decorator_calls_unconverted_graph(self):
 
     class TestClass(object):
 
-      @api.graph_ready
+      @api.do_not_convert(api.RunMode.GRAPH)
       def called_member(self, a):
-        return math_ops.negative(a)
+        return tf.negative(a)
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -99,20 +105,23 @@ class ApiTest(test.TestCase):
           constant_op.constant(-2))
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
-  def test_decorator_calls_decorated(self):
+  def test_decorator_calls_unconverted_py_func(self):
 
     class TestClass(object):
 
-      @api.convert()
+      @api.do_not_convert(
+          api.RunMode.PY_FUNC, return_dtypes=py_func.MatchDType(1))
       def called_member(self, a):
-        if a < 0:
-          a = -a
-        return a
+        return np.negative(a)
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
-          x //= self.called_member(a)
+        while tf.reduce_sum(x) > s:
+          y = self.called_member(a)
+          # set_shape works around while_loop's limitations.
+          # TODO(mdan): Allow specifying shapes (or ShapeLike) instead.
+          y.set_shape(a.shape)
+          x //= y
         return x
 
     tc = TestClass()
@@ -122,10 +131,11 @@ class ApiTest(test.TestCase):
           constant_op.constant(-2))
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
-  def test_convert_call_site_decorator(self):
+  def test_decorator_calls_decorated(self):
 
     class TestClass(object):
 
+      @api.convert()
       def called_member(self, a):
         if a < 0:
           a = -a
@@ -133,8 +143,8 @@ class ApiTest(test.TestCase):
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
-          x //= api.convert_inline(self.called_member, a)
+        while tf.reduce_sum(x) > s:
+          x //= self.called_member(a)
         return x
 
     tc = TestClass()
@@ -144,17 +154,20 @@ class ApiTest(test.TestCase):
           constant_op.constant(-2))
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
-  def test_graph_ready_call_site_decorator(self):
+  def test_convert_call_site_decorator(self):
 
     class TestClass(object):
 
       def called_member(self, a):
-        return math_ops.negative(a)
+        if a < 0:
+          a = -a
+        return a
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
-          x //= api.graph_ready(self.called_member(a))
+        while tf.reduce_sum(x) > s:
+          x //= api.converted_call(self.called_member, False, False, {}, self,
+                                   a)
         return x
 
     tc = TestClass()
@@ -165,8 +178,9 @@ class ApiTest(test.TestCase):
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
   def test_to_graph_basic(self):
+
     def test_fn(x, s):
-      while math_ops.reduce_sum(x) > s:
+      while tf.reduce_sum(x) > s:
         x //= 2
       return x
 
@@ -177,8 +191,9 @@ class ApiTest(test.TestCase):
       self.assertListEqual([1, 2], sess.run(x).tolist())
 
   def test_to_code_basic(self):
+
     def test_fn(x, s):
-      while math_ops.reduce_sum(x) > s:
+      while tf.reduce_sum(x) > s:
         x /= 2
       return x
 
diff --git a/tensorflow/contrib/py2tf/impl/config.py b/tensorflow/contrib/py2tf/impl/config.py
index c90e85c96b690b7781358b173e5d83fe60e29c00..bdbc6663dd65ed66c55ad2d2e52428084bbea219 100644
--- a/tensorflow/contrib/py2tf/impl/config.py
+++ b/tensorflow/contrib/py2tf/impl/config.py
@@ -31,12 +31,16 @@ PYTHON_LITERALS = {
 DEFAULT_UNCOMPILED_MODULES = set((
     ('tensorflow',),
     (utils.__name__,),
+
+    # All of tensorflow's subpackages. Unlike the root tf module, they don't
+    # have well-known names. Not refering to the module directly to avoid
+    # circular imports.
+    (utils.__name__[:-len('.contrib.py2tf.utils')],),
 ))
 
 NO_SIDE_EFFECT_CONSTRUCTORS = set(('tensorflow',))
 
 # TODO(mdan): Also allow controlling the generated names (for testability).
-# TODO(mdan): Make sure copybara renames the reference below.
 COMPILED_IMPORT_STATEMENTS = (
     'from __future__ import print_function',
     'import tensorflow as tf',
diff --git a/tensorflow/contrib/py2tf/impl/conversion.py b/tensorflow/contrib/py2tf/impl/conversion.py
index 044de3356834a6ac7072c7390ce52385499eee85..37b24ab55fdd1b03e12e9afe06530e3c26218b61 100644
--- a/tensorflow/contrib/py2tf/impl/conversion.py
+++ b/tensorflow/contrib/py2tf/impl/conversion.py
@@ -29,9 +29,12 @@ from tensorflow.contrib.py2tf.converters import continue_statements
 from tensorflow.contrib.py2tf.converters import control_flow
 from tensorflow.contrib.py2tf.converters import decorators
 from tensorflow.contrib.py2tf.converters import for_loops
+from tensorflow.contrib.py2tf.converters import ifexp
+from tensorflow.contrib.py2tf.converters import lists
 from tensorflow.contrib.py2tf.converters import logical_expressions
 from tensorflow.contrib.py2tf.converters import name_scopes
 from tensorflow.contrib.py2tf.converters import side_effect_guards
+from tensorflow.contrib.py2tf.converters import single_return
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.impl import naming
 from tensorflow.contrib.py2tf.pyct import context
@@ -41,6 +44,7 @@ from tensorflow.contrib.py2tf.pyct import qual_names
 from tensorflow.contrib.py2tf.pyct.static_analysis import activity
 from tensorflow.contrib.py2tf.pyct.static_analysis import live_values
 from tensorflow.contrib.py2tf.pyct.static_analysis import type_info
+from tensorflow.contrib.py2tf.utils import type_hints
 from tensorflow.python.util import tf_inspect
 
 
@@ -48,7 +52,9 @@ from tensorflow.python.util import tf_inspect
 
 
 class ConversionMap(object):
-  """ConversionMaps keep track of converting function hierarchies.
+  """ConversionMap keeps track of converting function hierarchies.
+
+  This object is mutable, and is updated as functions are converted.
 
   Attributes:
     recursive: Whether to recusrively convert any functions that the decorator
@@ -97,6 +103,24 @@ class ConversionMap(object):
     self.dependency_cache[original_entity] = converted_ast
 
 
+def is_whitelisted_for_graph(o):
+  """Check whether an entity is whitelisted for use in graph mode.
+
+  Examples of whitelisted entities include all members of the tensorflow
+  package.
+
+  Args:
+    o: A Python entity.
+  Returns:
+    Boolean
+  """
+  m = tf_inspect.getmodule(o)
+  for prefix, in config.DEFAULT_UNCOMPILED_MODULES:
+    if m.__name__.startswith(prefix):
+      return True
+  return False
+
+
 def entity_to_graph(o, conversion_map, arg_values, arg_types):
   """Compile a Python entity into equivalent TensorFlow.
 
@@ -136,14 +160,20 @@ def entity_to_graph(o, conversion_map, arg_values, arg_types):
 
   conversion_map.add_to_cache(o, node)
   if conversion_map.recursive:
-    for obj in conversion_map.name_map.keys():
-      if obj not in conversion_map.dependency_cache:
-        if (hasattr(obj, 'im_class') and
-            getattr(obj, 'im_class') not in conversion_map.partial_types):
-          # Class members are converted with their objects, unless they're
-          # only converted partially.
-          continue
-        entity_to_graph(obj, conversion_map, {}, {})
+    while True:
+      candidate = None
+      for obj in conversion_map.name_map.keys():
+        if obj not in conversion_map.dependency_cache:
+          candidate = obj
+          break
+      if candidate is None:
+        break
+      if (hasattr(candidate, 'im_class') and
+          getattr(candidate, 'im_class') not in conversion_map.partial_types):
+        # Class members are converted with their objects, unless they're
+        # only converted partially.
+        continue
+      entity_to_graph(candidate, conversion_map, {}, {})
 
   return node, new_name
 
@@ -151,9 +181,10 @@ def entity_to_graph(o, conversion_map, arg_values, arg_types):
 def class_to_graph(c, conversion_map):
   """Specialization of `entity_to_graph` for classes."""
   converted_members = {}
-  members = tf_inspect.getmembers(c, predicate=tf_inspect.ismethod)
+  method_filter = lambda m: tf_inspect.isfunction(m) or tf_inspect.ismethod(m)
+  members = tf_inspect.getmembers(c, predicate=method_filter)
   if not members:
-    raise ValueError('Cannot convert %s: it has no member methods.')
+    raise ValueError('Cannot convert %s: it has no member methods.' % c)
 
   class_namespace = None
   for _, m in members:
@@ -173,7 +204,7 @@ def class_to_graph(c, conversion_map):
       class_name,
       bases=[],
       keywords=[],
-      body=converted_members.values(),
+      body=list(converted_members.values()),
       decorator_list=[])
 
   return node, class_name
@@ -215,7 +246,8 @@ def function_to_graph(f, conversion_map, arg_values, arg_types,
       arg_values=arg_values,
       arg_types=arg_types,
       owner_type=owner_type,
-      recursive=conversion_map.recursive)
+      recursive=conversion_map.recursive,
+      type_annotation_func=type_hints.set_element_type)
   node, deps = node_to_graph(node, ctx, conversion_map.nocompile_decorators)
 
   # TODO(mdan): This somewhat duplicates the call rename logic in call_treest.py
@@ -268,10 +300,15 @@ def node_to_graph(node, ctx, nocompile_decorators):
   # to re-run the analysis.
 
   node = _static_analysis_pass(node, ctx)
+
+  # TODO(mdan): Clean this up.
+  # Some intermediate analyses are not required, and some comments got orphaned.
+
   # Past this point, line numbers are no longer accurate so we ignore the
   # source.
   # TODO(mdan): Is it feasible to reconstruct intermediate source code?
   ctx.source_code = None
+  node = ifexp.transform(node, ctx)
   node, deps = decorators.transform(node, nocompile_decorators)
   node = break_statements.transform(node, ctx)
   node = asserts.transform(node, ctx)
@@ -283,6 +320,10 @@ def node_to_graph(node, ctx, nocompile_decorators):
   ctx.namespace['len'] = len
 
   node = _static_analysis_pass(node, ctx)
+  node = single_return.transform(node, ctx)
+
+  node = _static_analysis_pass(node, ctx)
+  node = lists.transform(node, ctx)
   node = for_loops.transform(node, ctx)
   # for_loops may insert new global references.
   node = builtin_functions.transform(node, ctx)
@@ -294,7 +335,7 @@ def node_to_graph(node, ctx, nocompile_decorators):
 
   # control_flow may create new symbols and change scopes.
   node = _static_analysis_pass(node, ctx)
-  node = logical_expressions.transform(node)
+  node = logical_expressions.transform(node, ctx)
   node = side_effect_guards.transform(node, ctx)
   node = name_scopes.transform(node, ctx)
 
diff --git a/tensorflow/contrib/py2tf/impl/conversion_test.py b/tensorflow/contrib/py2tf/impl/conversion_test.py
index 7816f958575d58236007fc7f0f1f3d1f3a99c4cf..9ff256aace7a0e7ac5e7ac07e580b8bed7d8df6f 100644
--- a/tensorflow/contrib/py2tf/impl/conversion_test.py
+++ b/tensorflow/contrib/py2tf/impl/conversion_test.py
@@ -20,12 +20,23 @@ from __future__ import print_function
 
 import gast
 
+from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl import conversion
+from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
 
 
 class ConversionTest(test.TestCase):
 
+  def test_is_whitelisted_for_graph(self):
+
+    def test_fn():
+      return constant_op.constant(1)
+
+    self.assertFalse(conversion.is_whitelisted_for_graph(test_fn))
+    self.assertTrue(conversion.is_whitelisted_for_graph(utils))
+    self.assertTrue(conversion.is_whitelisted_for_graph(constant_op.constant))
+
   def test_entity_to_graph_unsupported_types(self):
     with self.assertRaises(ValueError):
       conversion_map = conversion.ConversionMap(True, (), (), None)
diff --git a/tensorflow/contrib/py2tf/pyct/anno.py b/tensorflow/contrib/py2tf/pyct/anno.py
index 7a0528b6d0b65b6604930b7a13d8493af9d61f02..cc4a7edf02ed7556c9a552d8730e4c7875038c83 100644
--- a/tensorflow/contrib/py2tf/pyct/anno.py
+++ b/tensorflow/contrib/py2tf/pyct/anno.py
@@ -70,3 +70,8 @@ def delanno(node, key, field_name='___pyct_anno'):
   if not annotations:
     delattr(node, field_name)
     node._fields = tuple(f for f in node._fields if f != field_name)
+
+
+def copyanno(from_node, to_node, key, field_name='___pyct_anno'):
+  if hasanno(from_node, key, field_name):
+    setanno(to_node, key, getanno(from_node, key, field_name), field_name)
diff --git a/tensorflow/contrib/py2tf/pyct/anno_test.py b/tensorflow/contrib/py2tf/pyct/anno_test.py
index ff40bfe1f50ae731648afdf509c26c3a70d3f6cb..6c29918fdfaaa0224f20a2c3cb2ea8088f3eb52b 100644
--- a/tensorflow/contrib/py2tf/pyct/anno_test.py
+++ b/tensorflow/contrib/py2tf/pyct/anno_test.py
@@ -24,6 +24,9 @@ from tensorflow.contrib.py2tf.pyct import anno
 from tensorflow.python.platform import test
 
 
+# TODO(mdan): Consider strong types instead of primitives.
+
+
 class AnnoTest(test.TestCase):
 
   def test_basic(self):
@@ -42,6 +45,17 @@ class AnnoTest(test.TestCase):
     with self.assertRaises(AttributeError):
       anno.getanno(node, 'foo')
 
+  def test_copyanno(self):
+    node_1 = ast.Name()
+    anno.setanno(node_1, 'foo', 3)
+
+    node_2 = ast.Name()
+    anno.copyanno(node_1, node_2, 'foo')
+    anno.copyanno(node_1, node_2, 'bar')
+
+    self.assertTrue(anno.hasanno(node_2, 'foo'))
+    self.assertFalse(anno.hasanno(node_2, 'bar'))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/ast_util.py b/tensorflow/contrib/py2tf/pyct/ast_util.py
index f916775b9cf3cec960ec2896c334f1d737862205..6f7e656c2602efe40d6091532503b86c092f0162 100644
--- a/tensorflow/contrib/py2tf/pyct/ast_util.py
+++ b/tensorflow/contrib/py2tf/pyct/ast_util.py
@@ -94,3 +94,12 @@ def rename_symbols(node, name_map):
   elif isinstance(node, tuple):
     return tuple(renamer.visit(n) for n in node)
   return renamer.visit(node)
+
+
+def keywords_to_dict(keywords):
+  keys = []
+  values = []
+  for kw in keywords:
+    keys.append(gast.Str(kw.arg))
+    values.append(kw.value)
+  return gast.Dict(keys=keys, values=values)
diff --git a/tensorflow/contrib/py2tf/pyct/ast_util_test.py b/tensorflow/contrib/py2tf/pyct/ast_util_test.py
index e0b00c178168f96e656c57cc75a76e6da8af1d8a..8d123679e3f06490474c5caa401bb27f0ca87ab4 100644
--- a/tensorflow/contrib/py2tf/pyct/ast_util_test.py
+++ b/tensorflow/contrib/py2tf/pyct/ast_util_test.py
@@ -21,6 +21,8 @@ from __future__ import print_function
 import ast
 
 from tensorflow.contrib.py2tf.pyct import ast_util
+from tensorflow.contrib.py2tf.pyct import compiler
+from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import qual_names
 from tensorflow.python.platform import test
 
@@ -33,15 +35,15 @@ class AstUtilTest(test.TestCase):
         ast.Name('b', ast.Load()),
         ast.Attribute(ast.Name('b', None), 'c', ast.Store()),
         ast.Attribute(
-            ast.Attribute(ast.Name('b', None), 'c', ast.Load()), 'd',
-            None)
+            ast.Attribute(ast.Name('b', None), 'c', ast.Load()), 'd', None)
     ], None)
     node = qual_names.resolve(node)
     node = ast_util.rename_symbols(
-        node,
-        {
-            qual_names.QN('a'): qual_names.QN('renamed_a'),
-            qual_names.QN('b.c'): qual_names.QN('renamed_b_c'),
+        node, {
+            qual_names.QN('a'):
+                qual_names.QN('renamed_a'),
+            qual_names.QN(qual_names.QN('b'), attr='c'):
+                qual_names.QN('renamed_b_c'),
         })
 
     self.assertEqual(node.elts[0].id, 'renamed_a')
@@ -74,6 +76,17 @@ class AstUtilTest(test.TestCase):
     self.assertFalse(ret is new_node.body[0])
     self.assertFalse(hasattr(new_node.body[0], '__foo'))
 
+  def test_keywords_to_dict(self):
+    keywords = parser.parse_expression('f(a=b, c=1, d=\'e\')').keywords
+    d = ast_util.keywords_to_dict(keywords)
+    # Make sure we generate a usable dict node by attaching it to a variable and
+    # compiling everything.
+    output = parser.parse_str('b = 3')
+    output.body += (ast.Assign([ast.Name(id='d', ctx=ast.Store())], d),)
+    result, _ = compiler.ast_to_object(output)
+    self.assertDictEqual(result.d, {'a': 3, 'c': 1, 'd': 'e'})
+    print(d)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/compiler.py b/tensorflow/contrib/py2tf/pyct/compiler.py
index 51cf6930e8bcb3728ee55bf5d4781f01a5ef73bd..24c4517afa89147101f80af3ef60237132c1144c 100644
--- a/tensorflow/contrib/py2tf/pyct/compiler.py
+++ b/tensorflow/contrib/py2tf/pyct/compiler.py
@@ -31,7 +31,7 @@ import astor
 import gast
 
 
-def ast_to_source(node, indentation):
+def ast_to_source(node, indentation='  '):
   """Return the source code of given AST."""
   if isinstance(node, gast.AST):
     node = gast.gast_to_ast(node)
@@ -39,7 +39,10 @@ def ast_to_source(node, indentation):
                                             astor.string_repr.pretty_string)
   generator.visit(node)
   generator.result.append('\n')
-  return astor.source_repr.pretty_source(generator.result).lstrip()
+  # In some versions of Python, literals may appear as actual values. This
+  # ensures everything is string.
+  code = map(str, generator.result)
+  return astor.source_repr.pretty_source(code).lstrip()
 
 
 def ast_to_object(
diff --git a/tensorflow/contrib/py2tf/pyct/compiler_test.py b/tensorflow/contrib/py2tf/pyct/compiler_test.py
index c1f84238efa7dd6fc0748748a2cb4f074572b4c6..243f4c81538f5853a01ff444f2ff16ccf7cd5d62 100644
--- a/tensorflow/contrib/py2tf/pyct/compiler_test.py
+++ b/tensorflow/contrib/py2tf/pyct/compiler_test.py
@@ -23,11 +23,28 @@ import textwrap
 import gast
 
 from tensorflow.contrib.py2tf.pyct import compiler
+from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.python.platform import test
+from tensorflow.python.util import tf_inspect
 
 
 class CompilerTest(test.TestCase):
 
+  def test_parser_compile_idempotent(self):
+
+    def test_fn(x):
+      a = True
+      b = ''
+      if a:
+        b = x + 1
+      return b
+
+    self.assertEqual(
+        textwrap.dedent(tf_inspect.getsource(test_fn)),
+        tf_inspect.getsource(
+            compiler.ast_to_object(
+                parser.parse_entity(test_fn)[0].body[0])[0].test_fn))
+
   def test_ast_to_source(self):
     node = gast.If(
         test=gast.Num(1),
diff --git a/tensorflow/contrib/py2tf/pyct/context.py b/tensorflow/contrib/py2tf/pyct/context.py
index 4fcf2a687d58af951adfc0dcf52ff7303d2b17f5..b34015cfd2888f0dbeb6492b9e7335d561bf4763 100644
--- a/tensorflow/contrib/py2tf/pyct/context.py
+++ b/tensorflow/contrib/py2tf/pyct/context.py
@@ -22,6 +22,8 @@ from __future__ import print_function
 class EntityContext(object):
   """Contains information about an entity, like source code.
 
+  In general, objects of this class should be considered immutable.
+
   Attributes:
     namer: Namer that matches the contract of all converters.
     source_code: The entity's source code.
@@ -33,8 +35,9 @@ class EntityContext(object):
     owner_type: The surrounding class type of the function, if present.
   """
 
+  # TODO(mdan): Remove the default and update tests.
   def __init__(self, namer, source_code, source_file, namespace, arg_values,
-               arg_types, owner_type, recursive):
+               arg_types, owner_type, recursive, type_annotation_func=None):
     self.namer = namer
     self.source_code = source_code
     self.source_file = source_file
@@ -43,3 +46,4 @@ class EntityContext(object):
     self.arg_types = {} if arg_types is None else arg_types
     self.owner_type = owner_type
     self.recursive = recursive
+    self.type_annotation_func = type_annotation_func
diff --git a/tensorflow/contrib/py2tf/pyct/parser.py b/tensorflow/contrib/py2tf/pyct/parser.py
index dc7df883b349becd860bb0dbceab22cb39c750b5..c961efa892df6a21804dae8f52ef64bf99cd409e 100644
--- a/tensorflow/contrib/py2tf/pyct/parser.py
+++ b/tensorflow/contrib/py2tf/pyct/parser.py
@@ -29,12 +29,30 @@ from tensorflow.python.util import tf_inspect
 
 
 def parse_entity(entity):
-  """Return the AST of given entity."""
+  """Returns the AST of given entity."""
   source = tf_inspect.getsource(entity)
   source = textwrap.dedent(source)
   return parse_str(source), source
 
 
 def parse_str(src):
-  """Return the AST of given piece of code."""
+  """Returns the AST of given piece of code."""
   return gast.parse(src)
+
+
+def parse_expression(src):
+  """Returns the AST of given identifier.
+
+  Args:
+    src: A piece of code that represents a single Python expression
+  Returns:
+    A gast.AST object.
+  Raises:
+    ValueError: if src does not consist of a single Expression.
+  """
+  node = parse_str(src)
+  assert isinstance(node, gast.Module)
+  if len(node.body) != 1 and not isinstance(node.body[0], gast.Expr):
+    raise ValueError(
+        'Expected a single expression, found instead %s' % node.body)
+  return node.body[0].value
diff --git a/tensorflow/contrib/py2tf/pyct/parser_test.py b/tensorflow/contrib/py2tf/pyct/parser_test.py
index f35dfa04c70dc191078248c32f9a04d28133129a..c58ffc7e0c5a72cf5e92899852833e8bd6360222 100644
--- a/tensorflow/contrib/py2tf/pyct/parser_test.py
+++ b/tensorflow/contrib/py2tf/pyct/parser_test.py
@@ -24,24 +24,29 @@ from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.python.platform import test
 
 
-def f(x):
-  return x + 1
-
-
 class ParserTest(test.TestCase):
 
   def test_parse_entity(self):
+
+    def f(x):
+      return x + 1
+
     mod, _ = parser.parse_entity(f)
     self.assertEqual('f', mod.body[0].name)
 
   def test_parse_str(self):
     mod = parser.parse_str(
         textwrap.dedent("""
-        def f(x):
-          return x + 1
+            def f(x):
+              return x + 1
     """))
     self.assertEqual('f', mod.body[0].name)
 
+  def test_parse_expression(self):
+    node = parser.parse_expression('a.b')
+    self.assertEqual('a', node.value.id)
+    self.assertEqual('b', node.attr)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/qual_names.py b/tensorflow/contrib/py2tf/pyct/qual_names.py
index 8717ee6cff198ff31f6cbdb7213e5a8dd3df1149..7dec13db920abd9bf262eb7170a8da52963318fe 100644
--- a/tensorflow/contrib/py2tf/pyct/qual_names.py
+++ b/tensorflow/contrib/py2tf/pyct/qual_names.py
@@ -25,34 +25,87 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
+
 import gast
 
 from tensorflow.contrib.py2tf.pyct import anno
 
 
+class Symbol(collections.namedtuple('Symbol', ['name'])):
+  """Represents a Python symbol."""
+
+
+class StringLiteral(collections.namedtuple('StringLiteral', ['value'])):
+  """Represents a Python string literal."""
+
+  def __str__(self):
+    return '\'%s\'' % self.value
+
+  def __repr__(self):
+    return str(self)
+
+
+class NumberLiteral(collections.namedtuple('NumberLiteral', ['value'])):
+  """Represents a Python numeric literal."""
+
+  def __str__(self):
+    return '%s' % self.value
+
+  def __repr__(self):
+    return str(self)
+
+
+# TODO(mdan): Use subclasses to remove the has_attr has_subscript booleans.
 class QN(object):
   """Represents a qualified name."""
 
-  def __init__(self, base, attr=None):
-    if attr:
+  def __init__(self, base, attr=None, subscript=None):
+    if attr is not None and subscript is not None:
+      raise ValueError('A QN can only be either an attr or a subscript, not '
+                       'both: attr={}, subscript={}.'.format(attr, subscript))
+    self._has_attr = False
+    self._has_subscript = False
+
+    if attr is not None:
+      if not isinstance(base, QN):
+        raise ValueError(
+            'for attribute QNs, base must be a QN; got instead "%s"' % base)
+      if not isinstance(attr, str):
+        raise ValueError('attr may only be a string; got instead "%s"' % attr)
+      self._parent = base
+      # TODO(mdan): Get rid of the tuple - it can only have 1 or 2 elements now.
+      self.qn = (base, attr)
+      self._has_attr = True
+
+    elif subscript is not None:
       if not isinstance(base, QN):
-        raise ValueError('For attribute QNs, base must be a QN.')
+        raise ValueError('For subscript QNs, base must be a QN.')
       self._parent = base
-      self.qn = base.qn + (attr,)
+      self.qn = (base, subscript)
+      self._has_subscript = True
+
     else:
-      if isinstance(base, QN):
-        if base.is_composite():
-          self._parent = base.parent
-        else:
-          self._parent = None
-        self.qn = base.qn
-      else:
-        self._parent = None
-        self.qn = tuple(base.split('.'))
+      if not isinstance(base, (str, StringLiteral, NumberLiteral)):
+        # TODO(mdan): Require Symbol instead of string.
+        raise ValueError(
+            'For simple QNs, base must be a string or a Literal object.')
+      assert '.' not in base and '[' not in base and ']' not in base
+      self._parent = None
+      self.qn = (base,)
+
+  def is_symbol(self):
+    return isinstance(self.qn[0], str)
 
   def is_composite(self):
     return len(self.qn) > 1
 
+  def has_subscript(self):
+    return self._has_subscript
+
+  def has_attr(self):
+    return self._has_attr
+
   @property
   def parent(self):
     if self._parent is None:
@@ -60,26 +113,54 @@ class QN(object):
     return self._parent
 
   def __hash__(self):
-    return hash(self.qn)
+    return hash(self.qn + (self._has_attr, self._has_subscript))
 
   def __eq__(self, other):
-    return self.qn == other.qn
+    return (isinstance(other, QN) and self.qn == other.qn and
+            self.has_subscript() == other.has_subscript() and
+            self.has_attr() == other.has_attr())
 
   def __str__(self):
-    return '.'.join(self.qn)
+    if self.has_subscript():
+      return str(self.qn[0]) + '[' + str(self.qn[1]) + ']'
+    if self.has_attr():
+      return '.'.join(map(str, self.qn))
+    else:
+      return str(self.qn[0])
 
   def __repr__(self):
     return str(self)
 
   def ssf(self):
     """Simple symbol form."""
-    return '_'.join(self.qn)
+    ssfs = [n.ssf() if isinstance(n, QN) else n for n in self.qn]
+    ssf_string = ''
+    for i in range(0, len(self.qn) - 1):
+      if self.has_subscript():
+        delimiter = '_sub_'
+      else:
+        delimiter = '_'
+      ssf_string += ssfs[i] + delimiter
+    return ssf_string + ssfs[-1]
 
   def ast(self):
     # The caller must adjust the context appropriately.
-    if self.is_composite():
+    if self.has_subscript():
+      return gast.Subscript(self.parent.ast(), gast.Index(self.qn[-1].ast()),
+                            None)
+    if self.has_attr():
       return gast.Attribute(self.parent.ast(), self.qn[-1], None)
-    return gast.Name(self.qn[0], None, None)
+
+    base = self.qn[0]
+    if isinstance(base, str):
+      return gast.Name(base, None, None)
+    elif isinstance(base, StringLiteral):
+      return gast.Str(base.value)
+    elif isinstance(base, NumberLiteral):
+      return gast.Num(base.value)
+    else:
+      assert False, ('the constructor should prevent types other than '
+                     'str, StringLiteral and NumberLiteral')
 
 
 class QnResolver(gast.NodeTransformer):
@@ -89,14 +170,34 @@ class QnResolver(gast.NodeTransformer):
   """
 
   def visit_Name(self, node):
-    self.generic_visit(node)
+    node = self.generic_visit(node)
     anno.setanno(node, anno.Basic.QN, QN(node.id))
     return node
 
   def visit_Attribute(self, node):
-    self.generic_visit(node)
-    anno.setanno(node, anno.Basic.QN,
-                 QN(anno.getanno(node.value, anno.Basic.QN), node.attr))
+    node = self.generic_visit(node)
+    if anno.hasanno(node.value, anno.Basic.QN):
+      anno.setanno(node, anno.Basic.QN,
+                   QN(anno.getanno(node.value, anno.Basic.QN), attr=node.attr))
+    return node
+
+  def visit_Subscript(self, node):
+    node = self.generic_visit(node)
+    s = node.slice
+    if not isinstance(s, gast.Index):
+      # TODO(mdan): Support range and multi-dimensional indices.
+      # Continuing silently because some demos use these.
+      return node
+    if isinstance(s.value, gast.Num):
+      subscript = QN(NumberLiteral(s.value.n))
+    elif isinstance(s.value, gast.Str):
+      subscript = QN(StringLiteral(s.value.s))
+    else:
+      subscript = anno.getanno(node.slice.value, anno.Basic.QN)
+    if anno.hasanno(node.value, anno.Basic.QN):
+      anno.setanno(node, anno.Basic.QN,
+                   QN(anno.getanno(node.value, anno.Basic.QN),
+                      subscript=subscript))
     return node
 
 
diff --git a/tensorflow/contrib/py2tf/pyct/qual_names_test.py b/tensorflow/contrib/py2tf/pyct/qual_names_test.py
index 1b1eee2deca18bb0540c17d6ee85d421602aa2b7..6583fa243bd82c6f4a03474445af0fd9441d1d4a 100644
--- a/tensorflow/contrib/py2tf/pyct/qual_names_test.py
+++ b/tensorflow/contrib/py2tf/pyct/qual_names_test.py
@@ -23,13 +23,15 @@ import textwrap
 from tensorflow.contrib.py2tf.pyct import anno
 from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import qual_names
+from tensorflow.contrib.py2tf.pyct.qual_names import QN
+from tensorflow.contrib.py2tf.pyct.qual_names import resolve
 from tensorflow.python.platform import test
 
 
 class QNTest(test.TestCase):
 
   def test_basic(self):
-    a = qual_names.QN('a')
+    a = QN('a')
     self.assertEqual(a.qn, ('a',))
     self.assertEqual(str(a), 'a')
     self.assertEqual(a.ssf(), 'a')
@@ -38,8 +40,8 @@ class QNTest(test.TestCase):
     with self.assertRaises(ValueError):
       _ = a.parent
 
-    a_b = qual_names.QN(a, 'b')
-    self.assertEqual(a_b.qn, ('a', 'b'))
+    a_b = QN(a, attr='b')
+    self.assertEqual(a_b.qn, (a, 'b'))
     self.assertEqual(str(a_b), 'a.b')
     self.assertEqual(a_b.ssf(), 'a_b')
     self.assertEqual(a_b.ast().value.id, 'a')
@@ -47,13 +49,48 @@ class QNTest(test.TestCase):
     self.assertTrue(a_b.is_composite())
     self.assertEqual(a_b.parent.qn, ('a',))
 
-    a2 = qual_names.QN(a)
+  def test_subscripts(self):
+    a = QN('a')
+    b = QN('b')
+    a_sub_b = QN(a, subscript=b)
+    self.assertEqual(a_sub_b.qn, (a, b))
+    self.assertEqual(str(a_sub_b), 'a[b]')
+    self.assertEqual(a_sub_b.ssf(), 'a_sub_b')
+    self.assertEqual(a_sub_b.ast().value.id, 'a')
+    self.assertEqual(a_sub_b.ast().slice.value.id, 'b')
+    self.assertTrue(a_sub_b.is_composite())
+    self.assertTrue(a_sub_b.has_subscript())
+    self.assertEqual(a_sub_b.parent.qn, ('a',))
+
+    c = QN('c')
+    b_sub_c = QN(b, subscript=c)
+    a_sub_b_sub_c = QN(a, subscript=b_sub_c)
+    self.assertEqual(a_sub_b_sub_c.qn, (a, b_sub_c))
+    self.assertTrue(a_sub_b.is_composite())
+    self.assertTrue(a_sub_b_sub_c.is_composite())
+    self.assertTrue(a_sub_b.has_subscript())
+    self.assertTrue(a_sub_b_sub_c.has_subscript())
+    self.assertEqual(b_sub_c.qn, (b, c))
+    self.assertEqual(str(a_sub_b_sub_c), 'a[b[c]]')
+    self.assertEqual(a_sub_b_sub_c.ssf(), 'a_sub_b_sub_c')
+    self.assertEqual(a_sub_b_sub_c.ast().value.id, 'a')
+    self.assertEqual(a_sub_b_sub_c.ast().slice.value.value.id, 'b')
+    self.assertEqual(a_sub_b_sub_c.ast().slice.value.slice.value.id, 'c')
+    self.assertEqual(b_sub_c.ast().slice.value.id, 'c')
+    self.assertEqual(a_sub_b_sub_c.parent.qn, ('a',))
+    with self.assertRaises(ValueError):
+      QN('a', 'b')
+
+  def test_equality(self):
+    a = QN('a')
+    a2 = QN('a')
+    a_b = QN(a, attr='b')
     self.assertEqual(a2.qn, ('a',))
     with self.assertRaises(ValueError):
       _ = a.parent
 
-    a_b2 = qual_names.QN(a_b)
-    self.assertEqual(a_b2.qn, ('a', 'b'))
+    a_b2 = QN(a, attr='b')
+    self.assertEqual(a_b2.qn, (a, 'b'))
     self.assertEqual(a_b2.parent.qn, ('a',))
 
     self.assertTrue(a2 == a)
@@ -65,16 +102,57 @@ class QNTest(test.TestCase):
     self.assertTrue(a_b2 == a_b)
     self.assertFalse(a_b2 is a_b)
     self.assertFalse(a_b2 == a)
+    a_sub_b = QN(a, subscript='b')
+    a_sub_b2 = QN(a, subscript='b')
+    self.assertTrue(a_sub_b == a_sub_b2)
+    self.assertFalse(a_sub_b == a_b)
 
-    with self.assertRaises(ValueError):
-      qual_names.QN('a', 'b')
+  def test_nested_attrs_subscripts(self):
+    a = QN('a')
+    b = QN('b')
+    c = QN('c')
+    b_sub_c = QN(b, subscript=c)
+    a_sub_b_sub_c = QN(a, subscript=b_sub_c)
+
+    b_dot_c = QN(b, attr='c')
+    a_sub__b_dot_c = QN(a, subscript=b_dot_c)
+
+    a_sub_b = QN(a, subscript=b)
+    a_sub_b__dot_c = QN(a_sub_b, attr='c')
+
+    a_dot_b = QN(a, attr='b')
+    a_dot_b_sub_c = QN(a_dot_b, subscript=c)
+
+    self.assertEqual(str(a_sub_b_sub_c), 'a[b[c]]')
+    self.assertEqual(str(a_sub__b_dot_c), 'a[b.c]')
+    self.assertEqual(str(a_sub_b__dot_c), 'a[b].c')
+    self.assertEqual(str(a_dot_b_sub_c), 'a.b[c]')
+
+    self.assertNotEqual(a_sub_b_sub_c, a_sub__b_dot_c)
+    self.assertNotEqual(a_sub_b_sub_c, a_sub_b__dot_c)
+    self.assertNotEqual(a_sub_b_sub_c, a_dot_b_sub_c)
+
+    self.assertNotEqual(a_sub__b_dot_c, a_sub_b__dot_c)
+    self.assertNotEqual(a_sub__b_dot_c, a_dot_b_sub_c)
+
+    self.assertNotEqual(a_sub_b__dot_c, a_dot_b_sub_c)
 
   def test_hashable(self):
-    d = {qual_names.QN('a'): 'a', qual_names.QN('b'): 'b'}
+    d = {QN('a'): 'a', QN('b'): 'b'}
+    self.assertEqual(d[QN('a')], 'a')
+    self.assertEqual(d[QN('b')], 'b')
+    self.assertTrue(QN('c') not in d)
+
+  def test_literals(self):
+    a = QN('a')
+    a_sub_str_b = QN(a, subscript=QN(qual_names.StringLiteral('b')))
+    a_sub_b = QN(a, subscript=QN('b'))
 
-    self.assertEqual(d[qual_names.QN('a')], 'a')
-    self.assertEqual(d[qual_names.QN('b')], 'b')
-    self.assertTrue(qual_names.QN('c') not in d)
+    self.assertNotEqual(a_sub_str_b, a_sub_b)
+    self.assertNotEqual(hash(a_sub_str_b), hash(a_sub_b))
+
+    a_sub_three = QN(a, subscript=QN(qual_names.NumberLiteral(3)))
+    self.assertEqual(a_sub_three.ast().slice.value.n, 3)
 
 
 class QNResolverTest(test.TestCase):
@@ -90,7 +168,7 @@ class QNResolverTest(test.TestCase):
       [f, (g.h.i)]
       j(k, l)
     """
-    nodes = qual_names.resolve(parser.parse_str(textwrap.dedent(samples)))
+    nodes = resolve(parser.parse_str(textwrap.dedent(samples)))
     nodes = tuple(n.value for n in nodes.body)
 
     self.assertQNStringIs(nodes[0], 'a')
@@ -103,6 +181,51 @@ class QNResolverTest(test.TestCase):
     self.assertQNStringIs(nodes[4].args[0], 'k')
     self.assertQNStringIs(nodes[4].args[1], 'l')
 
+  def test_subscript_resolve(self):
+    samples = """
+      x[i]
+      x[i.b]
+      a.b[c]
+      a.b[x.y]
+      a[z[c]]
+      a[b[c[d]]]
+      a[b].c
+      a.b.c[d].e.f
+      a.b[c[d]].e.f
+      a.b[c[d.e.f].g].h
+    """
+    nodes = resolve(parser.parse_str(textwrap.dedent(samples)))
+    nodes = tuple(n.value for n in nodes.body)
+
+    self.assertQNStringIs(nodes[0], 'x[i]')
+    self.assertQNStringIs(nodes[1], 'x[i.b]')
+    self.assertQNStringIs(nodes[2], 'a.b[c]')
+    self.assertQNStringIs(nodes[3], 'a.b[x.y]')
+    self.assertQNStringIs(nodes[4], 'a[z[c]]')
+    self.assertQNStringIs(nodes[5], 'a[b[c[d]]]')
+    self.assertQNStringIs(nodes[6], 'a[b].c')
+    self.assertQNStringIs(nodes[7], 'a.b.c[d].e.f')
+    self.assertQNStringIs(nodes[8], 'a.b[c[d]].e.f')
+    self.assertQNStringIs(nodes[9], 'a.b[c[d.e.f].g].h')
+
+  def test_function_calls(self):
+    samples = """
+      a.b
+      a.b()
+      a().b
+      z[i]
+      z[i]()
+      z()[i]
+    """
+    nodes = resolve(parser.parse_str(textwrap.dedent(samples)))
+    nodes = tuple(n.value for n in nodes.body)
+    self.assertQNStringIs(nodes[0], 'a.b')
+    self.assertQNStringIs(nodes[1].func, 'a.b')
+    self.assertQNStringIs(nodes[2].value.func, 'a')
+    self.assertQNStringIs(nodes[3], 'z[i]')
+    self.assertQNStringIs(nodes[4].func, 'z[i]')
+    self.assertQNStringIs(nodes[5].value.func, 'z')
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/BUILD b/tensorflow/contrib/py2tf/pyct/static_analysis/BUILD
index fbfce18c60cca4b105e7de3c3ea7b9c3438f6b2a..2799b56a0042e99b8f8b38100d07c5afaef9f424 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/BUILD
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/BUILD
@@ -60,6 +60,7 @@ py_test(
     deps = [
         ":static_analysis",
         "//tensorflow/contrib/py2tf/pyct",
+        "//tensorflow/contrib/py2tf/utils",
         "//tensorflow/python:client_testlib",
     ],
 )
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py b/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
index 02ea6fdeaf78152b6bc48983f79b36f43d4f665d..716672a53b444abdae97a6b1a0d2e89eb08735fd 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
@@ -71,13 +71,33 @@ class Scope(object):
                                         tuple(self.modified))
 
   def copy_from(self, other):
+    """Recursively copies the contents of this scope from another scope."""
+    if (self.parent is None) != (other.parent is None):
+      raise ValueError('cannot copy scopes of different structures')
+    if other.parent is not None:
+      self.parent.copy_from(other.parent)
+    self.isolated = other.isolated
     self.modified = copy.copy(other.modified)
     self.created = copy.copy(other.created)
     self.used = copy.copy(other.used)
     self.params = copy.copy(other.params)
     self.returned = copy.copy(other.returned)
 
+  @classmethod
+  def copy_of(cls, other):
+    if other.parent is not None:
+      parent = cls.copy_of(other.parent)
+    else:
+      parent = None
+    new_copy = cls(parent)
+    new_copy.copy_from(other)
+    return new_copy
+
   def merge_from(self, other):
+    if (self.parent is None) != (other.parent is None):
+      raise ValueError('cannot merge scopes of different structures')
+    if other.parent is not None:
+      self.parent.merge_from(other.parent)
     self.modified |= other.modified
     self.created |= other.created
     self.used |= other.used
@@ -151,6 +171,10 @@ class ActivityAnalizer(transformer.Base):
     self._in_return_statement = False
 
   def _track_symbol(self, node):
+    # This can happen when we have an attribute (or subscript) on a function
+    # call.  Example: a().b
+    if not anno.hasanno(node, anno.Basic.QN):
+      return
     qn = anno.getanno(node, anno.Basic.QN)
 
     if isinstance(node.ctx, gast.Store):
@@ -225,14 +249,12 @@ class ActivityAnalizer(transformer.Base):
     # modifies the parent state causing the other child blocks to be
     # processed incorrectly. So we need to checkpoint the parent scope so that
     # each child sees the same context.
-    before_parent = Scope(None)
-    before_parent.copy_from(self.scope)
+    before_parent = Scope.copy_of(self.scope)
     after_children = []
     for child, scope_name in children:
       self.scope.copy_from(before_parent)
       parent = self._process_block_node(parent, child, scope_name)
-      after_child = Scope(None)
-      after_child.copy_from(self.scope)
+      after_child = Scope.copy_of(self.scope)
       after_children.append(after_child)
     for after_child in after_children:
       self.scope.merge_from(after_child)
@@ -250,6 +272,15 @@ class ActivityAnalizer(transformer.Base):
     self.scope = current_scope
     return node
 
+  def visit_With(self, node):
+    current_scope = self.scope
+    with_scope = Scope(current_scope, isolated=False)
+    self.scope = with_scope
+    self.generic_visit(node)
+    anno.setanno(node, NodeAnno.BODY_SCOPE, with_scope)
+    self.scope = current_scope
+    return node
+
   def visit_If(self, node):
     self.visit(node.test)
     node = self._process_parallel_blocks(node,
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
index 69f5f4fc582f159e46c8b8929a90ca95b724794d..b16d15b39d8eb4c444cbc50ae62baa3a8fcc7841 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
@@ -45,7 +45,7 @@ class ScopeTest(test.TestCase):
     scope.mark_read(QN('bar'))
     self.assertFalse(scope.has(QN('bar')))
 
-  def test_copy(self):
+  def test_copy_from(self):
     scope = activity.Scope(None)
     scope.mark_write(QN('foo'))
 
@@ -65,6 +65,17 @@ class ScopeTest(test.TestCase):
     self.assertTrue(QN('bar') in scope.created)
     self.assertFalse(QN('bar') in other.created)
 
+  def test_copy_of(self):
+    scope = activity.Scope(None)
+    scope.mark_read(QN('foo'))
+
+    self.assertTrue(QN('foo') in activity.Scope.copy_of(scope).used)
+
+    child_scope = activity.Scope(scope)
+    child_scope.mark_read(QN('bar'))
+
+    self.assertTrue(QN('bar') in activity.Scope.copy_of(child_scope).used)
+
   def test_nesting(self):
     scope = activity.Scope(None)
     scope.mark_write(QN('foo'))
@@ -133,7 +144,7 @@ class ActivityAnalizerTest(test.TestCase):
         anno.getanno(node.body[0].body[2].value,
                      NodeAnno.IS_LOCAL))  # b in return b
 
-  def assertScopeIs(self, scope, used, modified, created):
+  def assertScopeIsRmc(self, scope, used, modified, created):
     self.assertItemsEqual(used, tuple(str(s) for s in scope.used))
     self.assertItemsEqual(modified, tuple(str(s) for s in scope.modified))
     self.assertItemsEqual(created, tuple(str(s) for s in scope.created))
@@ -159,7 +170,7 @@ class ActivityAnalizerTest(test.TestCase):
       print_args_scope = anno.getanno(print_node, NodeAnno.ARGS_SCOPE)
     # We basically need to detect which variables are captured by the call
     # arguments.
-    self.assertScopeIs(print_args_scope, ('a', 'b'), (), ())
+    self.assertScopeIsRmc(print_args_scope, ('a', 'b'), (), ())
 
   def test_call(self):
 
@@ -173,7 +184,7 @@ class ActivityAnalizerTest(test.TestCase):
     call_node = node.body[0].body[2].value
     # We basically need to detect which variables are captured by the call
     # arguments.
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(call_node, NodeAnno.ARGS_SCOPE), ('a', 'b'), (), ())
 
   def test_while(self):
@@ -187,10 +198,10 @@ class ActivityAnalizerTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn)
     while_node = node.body[0].body[1]
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(while_node, NodeAnno.BODY_SCOPE), ('b',), ('b', 'c'),
         ('c',))
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(while_node, NodeAnno.BODY_SCOPE).parent, ('a', 'b', 'c'),
         ('b', 'c'), ('a', 'b', 'c'))
 
@@ -205,9 +216,9 @@ class ActivityAnalizerTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn)
     for_node = node.body[0].body[1]
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(for_node, NodeAnno.BODY_SCOPE), ('b',), ('b', 'c'), ('c',))
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(for_node, NodeAnno.BODY_SCOPE).parent, ('a', 'b', 'c'),
         ('b', 'c', '_'), ('a', 'b', 'c', '_'))
 
@@ -226,21 +237,40 @@ class ActivityAnalizerTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn)
     if_node = node.body[0].body[0]
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.BODY_SCOPE), ('x', 'y'), ('x', 'y', 'z'),
         ('y', 'z'))
     # TODO(mdan): Double check: is it ok to not mark a local symbol as not read?
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.BODY_SCOPE).parent, ('x', 'z', 'u'),
         ('x', 'y', 'z', 'u'), ('x', 'y', 'z', 'u'))
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE), ('x', 'y'),
         ('x', 'y', 'u'), ('y', 'u'))
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE).parent, ('x', 'z', 'u'),
         ('x', 'y', 'z', 'u'), ('x', 'y', 'z', 'u'))
 
-  def test_functiondef(self):
+  def test_nested_if_else_creation(self):
+
+    def test_fn(b):
+      if b > 0:
+        if b < 5:
+          a = b
+        else:
+          a = b * b
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    inner_if_node = node.body[0].body[0].body[0]
+    self.assertScopeIsRmc(
+        anno.getanno(inner_if_node, NodeAnno.BODY_SCOPE), ('b',), ('a',),
+        ('a',))
+    self.assertScopeIsRmc(
+        anno.getanno(inner_if_node, NodeAnno.ORELSE_SCOPE), ('b',), ('a',),
+        ('a',))
+
+  def test_function_def(self):
 
     def test_fn(a):
 
@@ -257,11 +287,11 @@ class ActivityAnalizerTest(test.TestCase):
     node = self._parse_and_analyze(test_fn)
     fndef_node = node.body[0].body[0]
 
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(fndef_node,
                      NodeAnno.BODY_SCOPE).parent, ('b', 'i', 'f', 'c', 'a'),
         ('f', 'b', 'c', 'i'), ('f', 'a', 'b', 'c', 'i'))
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(fndef_node, NodeAnno.BODY_SCOPE), ('x', 'y'), ('y',), (
             'x',
             'y',
@@ -284,13 +314,13 @@ class ActivityAnalizerTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn)
     call_node = node.body[0].body[0].value
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(call_node, NodeAnno.ARGS_SCOPE), ('a', 'a.b', 'a.c'), (),
         ())
     if_node = node.body[0].body[1]
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.BODY_SCOPE), ('a',), ('a.b',), ())
-    self.assertScopeIs(
+    self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE),
         ('a', 'a.c', 'd', 'd.e', 'f'), ('a.c', 'd', 'd.e', 'f'), ('d', 'f'))
 
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/annos.py b/tensorflow/contrib/py2tf/pyct/static_analysis/annos.py
index 2d8e49442364fdd4a4752c8a83a5f3b76117fe57..5254b83ca7c775867fc2ad5ef0a0ad93ac483ba0 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/annos.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/annos.py
@@ -34,13 +34,14 @@ class NodeAnno(NoValue):
   """
 
   # Symbols
-
+  # These flags are boolean.
   IS_LOCAL = 'Symbol is local to the function scope being analized.'
   IS_PARAM = 'Symbol is a parameter to the function being analized.'
   IS_MODIFIED_SINCE_ENTRY = (
       'Symbol has been explicitly replaced in the current function scope.')
 
   # Scopes
+  # Scopes are represented by objects of type activity.Scope.
   ARGS_SCOPE = 'The scope for the argument list of a function call.'
   BODY_SCOPE = (
       'The scope for the main body of a statement (True branch for if '
@@ -48,3 +49,10 @@ class NodeAnno(NoValue):
   ORELSE_SCOPE = (
       'The scope for the orelse body of a statement (False branch for if '
       'statements, orelse body for loops).')
+
+  # Type and Value annotations
+  # Type annotations are represented by objects of type type_info.Type.
+  STATIC_INFO = (
+      'The type or value information that should be asserted about the entity '
+      'referenced by the symbol holding this annotation, irrespective of the '
+      'execution context.')
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
index 0388be5d252389f2f3516c8b27828905d6475589..ac5697900a95853bf68553b453e84218a60f3a12 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
@@ -55,11 +55,13 @@ class LiveValueResolver(transformer.Base):
       if not symbol_is_local and not symbol_is_param:
         if node.id in self.literals:
           anno.setanno(node, 'live_val', self.literals[node.id])
-          # TODO(mdan): Could live values have FQNs? i.e. 'a'.join()
         elif node.id in self.context.namespace:
           obj = self.context.namespace[node.id]
           anno.setanno(node, 'live_val', obj)
-          anno.setanno(node, 'fqn', (obj.__name__,))
+          if hasattr(obj, '__name__'):
+            # If the symbol value is for example a primitive, then it will not
+            # have a name.
+            anno.setanno(node, 'fqn', (obj.__name__,))
         else:
           pass
           # TODO(mdan): Should we raise an error here?
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
index c133a455b3dd328689102634c6076f366212ac25..a56dff824e484a893fa8a20f2bff6a68aa77129a 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
@@ -57,13 +57,26 @@ class LiveValuesResolverTest(test.TestCase):
 
   def test_literals(self):
 
+    a = None
+
     def test_fn():
-      return Foo  # pylint: disable=undefined-variable
+      return a
 
-    node = self._parse_and_analyze(test_fn, {}, {'Foo': 'bar'})
+    node = self._parse_and_analyze(test_fn, {}, literals={'a': 'bar'})
     retval_node = node.body[0].body[0].value
     self.assertEquals('bar', anno.getanno(retval_node, 'live_val'))
 
+  def test_primitive_values(self):
+
+    a = None
+
+    def test_fn():
+      return a
+
+    node = self._parse_and_analyze(test_fn, {'a': True})
+    retval_node = node.body[0].body[0].value
+    self.assertFalse(anno.hasanno(retval_node, 'fqn'))
+
   def test_namespace(self):
 
     def foo():
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info.py b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info.py
index 8203bda0f9a792a5b24b9abb25d8f39b61625748..5556a58c025da695bcef10352c597c7c8dd612d9 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info.py
@@ -14,9 +14,29 @@
 # ==============================================================================
 """Type resolution.
 
+This analyzer uses known live values to further infer object types. This
+may include for instance constructed objects and object member functions.
+
+In addition, the analyzer will also process annotations for TF (staged) type
+annotations.
+
 Requires annotations generated by LiveValuesResolver.
 """
 
+# TODO(mdan): This would be more robust with a CFG.
+# Situations with multiple reaching modifications (e.g. modified inside and
+# outside a control flow statement) should be more robustly detected and
+# analyzed.
+
+# TODO(mdan): Look into using Python AST's type annotation fields instead.
+# It would be desirable to use that mechanism if we can.
+# Some caveats to consider: We may need to annotate other nodes like
+# Attribute. It may also not be feasible for us to faithfully to replicate
+# PY3's type annotations where it isn't available. It would also require us
+# to design rigorous type definitions that can accommodate Python types
+# as well as TensorFLow dtypes and shapes.
+
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -29,7 +49,7 @@ from tensorflow.python.util import tf_inspect
 
 
 class Scope(object):
-  """Encloses symbol value references.
+  """Tracks symbol value references.
 
   Attributes:
     values: A dict mapping string to gast.Node, containing the value that was
@@ -138,11 +158,14 @@ class TypeInfoResolver(transformer.Base):
     elif isinstance(node.ctx, gast.Load) and self.scope.hasval(qn):
       # E.g. if we had
       # a = b
-      # then for future references to `a` we should have traced_source = `b`
-      traced_source = self.scope.getval(qn)
-      if anno.hasanno(traced_source, 'type'):
-        anno.setanno(node, 'type', anno.getanno(traced_source, 'type'))
-        anno.setanno(node, 'type_fqn', anno.getanno(traced_source, 'type_fqn'))
+      # then for future references to `a` we should have definition = `b`
+      definition = self.scope.getval(qn)
+      if anno.hasanno(definition, 'type'):
+        anno.setanno(node, 'type', anno.getanno(definition, 'type'))
+        anno.setanno(node, 'type_fqn', anno.getanno(definition, 'type_fqn'))
+      if anno.hasanno(definition, 'element_type'):
+        anno.setanno(node, 'element_type',
+                     anno.getanno(definition, 'element_type'))
     return node
 
   def _process_variable_assignment(self, source, targets):
@@ -181,6 +204,34 @@ class TypeInfoResolver(transformer.Base):
     self._process_variable_assignment(node.value, node.targets)
     return node
 
+  def visit_Call(self, node):
+    if anno.hasanno(node.func, 'live_val'):
+      # Symbols targeted by the "set_type" marker function are assigned the data
+      # type that it specified.
+      if (anno.getanno(node.func, 'live_val') is
+          self.context.type_annotation_func):
+        # Expecting the actual type to be the second argument.
+        if len(node.args) != 2:
+          raise ValueError('"%s" must have exactly two parameters'
+                           % self.context.type_annotation_func)
+        if not anno.hasanno(node.args[0], anno.Basic.QN):
+          raise ValueError('the first argument of "%s" must by a symbol'
+                           % self.context.type_annotation_func)
+        if not anno.hasanno(node.args[1], 'live_val'):
+          raise ValueError(
+              'the second argument of "%s" must be statically resolvable' %
+              self.context.type_annotation_func)
+        target_symbol = anno.getanno(node.args[0], anno.Basic.QN)
+        element_type = anno.getanno(node.args[1], 'live_val')
+        # Find the definition of this symbol and annotate it with the given
+        # data type. That in turn will cause future uses of the symbol
+        # to receive the same type annotation.
+        definition = self.scope.getval(target_symbol)
+        anno.setanno(node, 'element_type', element_type)
+        anno.setanno(definition, 'element_type', element_type)
+        # TODO(mdan): Should we update references between definition and here?
+    return self.generic_visit(node)
+
 
 def resolve(node, context):
   return TypeInfoResolver(context).visit(node)
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
index a3e78202c80e45552c038a6a1da763eb30aff52f..0d9d5a85f055b170ea6e493e8ac185f1298ebf3c 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.pyct import anno
 from tensorflow.contrib.py2tf.pyct import context
 from tensorflow.contrib.py2tf.pyct import parser
@@ -56,7 +57,10 @@ class ScopeTest(test.TestCase):
 
 class TypeInfoResolverTest(test.TestCase):
 
-  def _parse_and_analyze(self, test_fn, namespace, arg_types=None):
+  def _parse_and_analyze(self,
+                         test_fn,
+                         namespace,
+                         arg_types=None):
     node, source = parser.parse_entity(test_fn)
     ctx = context.EntityContext(
         namer=None,
@@ -66,7 +70,8 @@ class TypeInfoResolverTest(test.TestCase):
         arg_values=None,
         arg_types=arg_types,
         owner_type=None,
-        recursive=True)
+        recursive=True,
+        type_annotation_func=utils.set_element_type)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
     node = live_values.resolve(node, ctx, {})
@@ -175,6 +180,22 @@ class TypeInfoResolverTest(test.TestCase):
     method_call = node.body[0].body[1].value.func
     self.assertFalse(anno.hasanno(method_call, 'live_val'))
 
+  def test_type_annotation(self):
+
+    class Foo(object):
+      pass
+
+    def test_fn():
+      f = []
+      f = utils.set_element_type(f, Foo)
+      return f
+
+    node = self._parse_and_analyze(test_fn, {'Foo': Foo, 'utils': utils})
+    f_def = node.body[0].body[0].value
+    self.assertEqual(anno.getanno(f_def, 'element_type'), Foo)
+    f_ref = node.body[0].body[1].value
+    self.assertEqual(anno.getanno(f_ref, 'element_type'), Foo)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/templates.py b/tensorflow/contrib/py2tf/pyct/templates.py
index 6ee6c0c5ceb70d87779ee313670135cadc5214b5..590be682345841626e6f0cfbe1ac5bad0b6e6baa 100644
--- a/tensorflow/contrib/py2tf/pyct/templates.py
+++ b/tensorflow/contrib/py2tf/pyct/templates.py
@@ -44,8 +44,6 @@ class ReplaceTransformer(gast.NodeTransformer):
     self.replacements = replacements
     self.in_replacements = False
 
-  # TODO(mdan): Make a more detailed pass and clean up if needed.
-
   def visit_Expr(self, node):
     if (isinstance(node.value, gast.Name) and
         node.value.id in self.replacements):
@@ -53,17 +51,57 @@ class ReplaceTransformer(gast.NodeTransformer):
     self.generic_visit(node)
     return node
 
+  def visit_keyword(self, node):
+    if node.arg in self.replacements:
+      repl = self.replacements[node.arg]
+      if isinstance(repl, gast.keyword):
+        return repl
+      elif (isinstance(repl, (list, tuple)) and repl and
+            all(isinstance(r, gast.keyword) for r in repl)):
+        return repl
+      # TODO(mdan): We may allow replacing with a string as well.
+      # For example, if one wanted to replace foo with bar in foo=baz, then
+      # we could allow changing just node arg, so that we end up with bar=baz.
+      raise ValueError(
+          'a keyword argument may only be replaced by another keyword or a '
+          'non-empty list of keywords. Found: %s' % repl)
+    return self.generic_visit(node)
+
   def visit_FunctionDef(self, node):
     node = self.generic_visit(node)
     if node.name in self.replacements:
       repl = self.replacements[node.name]
       if not isinstance(repl, (gast.Name, ast.Name)):
         raise ValueError(
-            'A function name can only be replaced by a Name node. Found: %s' %
+            'a function name can only be replaced by a Name node. Found: %s' %
             repl)
       node.name = repl.id
     return node
 
+  def _check_has_context(self, node):
+    if not node.ctx:
+      raise ValueError('node %s is missing ctx value' % node)
+
+  def _check_inner_children_have_context(self, node):
+    if isinstance(node, gast.Attribute):
+      self._check_inner_children_have_context(node.value)
+      self._check_has_context(node)
+    elif isinstance(node, gast.Tuple):
+      for e in node.elts:
+        self._check_inner_children_have_context(e)
+      self._check_has_context(node)
+    elif isinstance(node, gast.Dict):
+      for e in node.keys:
+        self._check_inner_children_have_context(e)
+      for e in node.values:
+        self._check_inner_children_have_context(e)
+    elif isinstance(node, gast.Name):
+      self._check_has_context(node)
+    elif isinstance(node, (gast.Str, gast.Num)):
+      pass
+    else:
+      raise ValueError('unexpected node type "%s"' % node)
+
   def _set_inner_child_context(self, node, ctx):
     if isinstance(node, gast.Attribute):
       self._set_inner_child_context(node.value, ctx)
@@ -74,11 +112,37 @@ class ReplaceTransformer(gast.NodeTransformer):
       node.ctx = ctx
     elif isinstance(node, gast.Name):
       node.ctx = ctx
+    elif isinstance(node, gast.Call):
+      self._set_inner_child_context(node.func, ctx)
+      # We may be able to override these to Load(), but for now it's simpler
+      # to just assert that they're set.
+      for a in node.args:
+        self._check_inner_children_have_context(a)
+      for k in node.keywords:
+        self._check_inner_children_have_context(k.value)
+    elif isinstance(node, gast.Dict):
+      # We may be able to override these to Load(), but for now it's simpler
+      # to just assert that they're set.
+      for e in node.keys:
+        self._check_inner_children_have_context(e)
+      for e in node.values:
+        self._check_inner_children_have_context(e)
     elif isinstance(node, (gast.Str, gast.Num)):
       pass
     else:
       raise ValueError('unexpected node type "%s"' % node)
 
+  def visit_Attribute(self, node):
+    node = self.generic_visit(node)
+    if node.attr not in self.replacements:
+      return node
+    repl = self.replacements[node.attr]
+    if not isinstance(repl, gast.Name):
+      raise ValueError(
+          'An attribute can only be replaced by a Name node. Found: %s' % repl)
+    node.attr = repl.id
+    return node
+
   def visit_Name(self, node):
     if node.id not in self.replacements:
       return node
@@ -154,3 +218,17 @@ def replace(template, **replacements):
   if isinstance(results, list):
     return [qual_names.resolve(r) for r in results]
   return qual_names.resolve(results)
+
+
+def replace_as_expression(template, **replacements):
+  """Variant of replace that generates expressions, instead of code blocks."""
+  replacement = replace(template, **replacements)
+  if len(replacement) != 1:
+    raise ValueError(
+        'single expression expected; for more general templates use replace')
+  node = replacement[0]
+  if not isinstance(node, gast.Expr):
+    raise ValueError(
+        'the template is expected to generate an expression node; instead '
+        'found %s' % node)
+  return node.value
diff --git a/tensorflow/contrib/py2tf/pyct/templates_test.py b/tensorflow/contrib/py2tf/pyct/templates_test.py
index 8ccfde8573724741b0bbe4eacb3c54beb381ee7e..af939caf32f40206d6a1ce43c7a01fbfbdf21b6d 100644
--- a/tensorflow/contrib/py2tf/pyct/templates_test.py
+++ b/tensorflow/contrib/py2tf/pyct/templates_test.py
@@ -18,9 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import imp
+
 import gast
 
 from tensorflow.contrib.py2tf.pyct import compiler
+from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import templates
 from tensorflow.python.platform import test
 
@@ -62,7 +65,7 @@ class TemplatesTest(test.TestCase):
     result, _ = compiler.ast_to_object(node)
     self.assertEquals(7, result.test_fn(2))
 
-  def test_code_block(self):
+  def test_replace_code_block(self):
     template = """
       def test_fn(a):
         block
@@ -79,6 +82,87 @@ class TemplatesTest(test.TestCase):
     result, _ = compiler.ast_to_object(node)
     self.assertEquals(3, result.test_fn(1))
 
+  def test_replace_attribute(self):
+    template = """
+      def test_fn(a):
+        return a.foo
+    """
+
+    node = templates.replace(template, foo='b')[0]
+    result, _ = compiler.ast_to_object(node)
+    mod = imp.new_module('test')
+    mod.b = 3
+    self.assertEquals(3, result.test_fn(mod))
+
+    with self.assertRaises(ValueError):
+      templates.replace(template, foo=1)
+
+  def test_replace_call_keyword(self):
+    template = """
+      def test_fn():
+        def f(a, d, f):
+          return a + d + f
+        return f(1, kws=None)
+    """
+
+    source = parser.parse_expression('f(d=3, f=5)')
+    node = templates.replace(template, kws=source.keywords)[0]
+    result, _ = compiler.ast_to_object(node)
+    self.assertEquals(9, result.test_fn())
+
+    with self.assertRaises(ValueError):
+      templates.replace(template, kws=[])
+      templates.replace(template, kws=1)
+
+  def test_replace_name_with_call(self):
+    template = """
+      def test_fn():
+        b = 5
+        def g(a):
+          return 3 * a
+        def f():
+          return g
+        return foo
+    """
+
+    source = parser.parse_expression('f()(b)')
+    node = templates.replace(template, foo=source)[0]
+    result, _ = compiler.ast_to_object(node)
+    self.assertEquals(15, result.test_fn())
+
+  def test_replace_name_with_dict(self):
+    template = """
+      def test_fn():
+        return foo['bar']
+    """
+
+    source = parser.parse_expression('{\'bar\': 3}')
+    node = templates.replace(template, foo=source)[0]
+    result, _ = compiler.ast_to_object(node)
+    self.assertEquals(3, result.test_fn())
+
+  def replace_as_expression(self):
+    template = """
+      foo(a)
+    """
+
+    node = templates.replace(template, foo='bar', a='baz')
+    self.assertTrue(node is gast.Call)
+    self.assertEqual(node.func.id, 'bar')
+    self.assertEqual(node.func.args[0].id, 'baz')
+
+  def replace_as_expression_restrictions(self):
+    template = """
+      foo(a)
+      bar(b)
+    """
+    with self.assertRaises(ValueError):
+      templates.replace_as_expression(template)
+    with self.assertRaises(ValueError):
+      templates.replace('')
+    with self.assertRaises(ValueError):
+      templates.replace('a = b')
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/transformer.py b/tensorflow/contrib/py2tf/pyct/transformer.py
index 877d52af016af720424c8a56257fec9ab64611cb..31ef7e1c05d2afd9686af168552b135d1a7eaafc 100644
--- a/tensorflow/contrib/py2tf/pyct/transformer.py
+++ b/tensorflow/contrib/py2tf/pyct/transformer.py
@@ -24,6 +24,7 @@ import gast
 import six
 
 from tensorflow.contrib.py2tf.pyct import anno
+from tensorflow.contrib.py2tf.pyct import compiler
 from tensorflow.contrib.py2tf.pyct import pretty_printer
 
 
@@ -31,6 +32,13 @@ class PyFlowParseError(SyntaxError):
   pass
 
 
+def try_ast_to_source(node):
+  try:
+    return compiler.ast_to_source(node)
+  except AssertionError:
+    return '<could not convert AST to source>'
+
+
 class Base(gast.NodeTransformer):
   """Base class for specialized transformers."""
 
@@ -44,6 +52,12 @@ class Base(gast.NodeTransformer):
     self._col_offset = 0
     self.context = context
 
+  def debug_print(self, node):
+    """Helper method useful for debugging."""
+    if __debug__:
+      print(pretty_printer.fmt(node))
+    return node
+
   def visit(self, node):
     source_code = self.context.source_code
     source_file = self.context.source_file
@@ -56,8 +70,9 @@ class Base(gast.NodeTransformer):
       return super(Base, self).visit(node)
     except (ValueError, AttributeError, KeyError, NotImplementedError,
             AssertionError) as e:
-      msg = '%s: %s\nOccurred at node:\n%s' % (
-          e.__class__.__name__, str(e), pretty_printer.fmt(node, color=False))
+      msg = '%s: %s\nOffending source:\n%s\n\nOccurred at node:\n%s' % (
+          e.__class__.__name__, str(e), try_ast_to_source(node),
+          pretty_printer.fmt(node, color=False))
       if source_code:
         line = source_code.splitlines()[self._lineno - 1]
       else:
diff --git a/tensorflow/contrib/py2tf/utils/BUILD b/tensorflow/contrib/py2tf/utils/BUILD
index c2fdd40707775783140390e4b5c0186c9c3e562e..d029289f5aea82e99d2fea63cc1cc6cf8d774183 100644
--- a/tensorflow/contrib/py2tf/utils/BUILD
+++ b/tensorflow/contrib/py2tf/utils/BUILD
@@ -20,25 +20,28 @@ py_library(
     name = "utils",
     srcs = [
         "__init__.py",
+        "builtins.py",
         "context_managers.py",
         "misc.py",
         "multiple_dispatch.py",
-        "printing.py",
         "py_func.py",
         "tensor_list.py",
+        "testing.py",
         "type_check.py",
+        "type_hints.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
+        "//tensorflow/python:list_ops",
         "//tensorflow/python:script_ops",
         "@six_archive//:six",
     ],
 )
 
 py_test(
-    name = "context_managers_test",
-    srcs = ["context_managers_test.py"],
+    name = "builtins_test",
+    srcs = ["builtins_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":utils",
@@ -47,8 +50,8 @@ py_test(
 )
 
 py_test(
-    name = "misc_test",
-    srcs = ["misc_test.py"],
+    name = "context_managers_test",
+    srcs = ["context_managers_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":utils",
@@ -57,8 +60,8 @@ py_test(
 )
 
 py_test(
-    name = "multiple_dispatch_test",
-    srcs = ["multiple_dispatch_test.py"],
+    name = "misc_test",
+    srcs = ["misc_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":utils",
@@ -67,8 +70,8 @@ py_test(
 )
 
 py_test(
-    name = "py_func_test",
-    srcs = ["py_func_test.py"],
+    name = "multiple_dispatch_test",
+    srcs = ["multiple_dispatch_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":utils",
@@ -77,8 +80,8 @@ py_test(
 )
 
 py_test(
-    name = "printing_test",
-    srcs = ["printing_test.py"],
+    name = "py_func_test",
+    srcs = ["py_func_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":utils",
diff --git a/tensorflow/contrib/py2tf/utils/__init__.py b/tensorflow/contrib/py2tf/utils/__init__.py
index d931322bf34cc36b614e587bbf5a36f5c1a4e38c..d9d8e3468966bc9da31c3fc756a9660f5ff7d115 100644
--- a/tensorflow/contrib/py2tf/utils/__init__.py
+++ b/tensorflow/contrib/py2tf/utils/__init__.py
@@ -18,11 +18,17 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.py2tf.utils.builtins import dynamic_builtin
+from tensorflow.contrib.py2tf.utils.builtins import dynamic_print
+from tensorflow.contrib.py2tf.utils.builtins import dynamic_range
 from tensorflow.contrib.py2tf.utils.context_managers import control_dependency_on_returns
 from tensorflow.contrib.py2tf.utils.misc import alias_tensors
-from tensorflow.contrib.py2tf.utils.misc import dynamic_len
+from tensorflow.contrib.py2tf.utils.multiple_dispatch import dynamic_is
+from tensorflow.contrib.py2tf.utils.multiple_dispatch import dynamic_is_not
 from tensorflow.contrib.py2tf.utils.multiple_dispatch import run_cond
 from tensorflow.contrib.py2tf.utils.multiple_dispatch import run_while
-from tensorflow.contrib.py2tf.utils.printing import call_print
 from tensorflow.contrib.py2tf.utils.py_func import wrap_py_func
+from tensorflow.contrib.py2tf.utils.tensor_list import dynamic_list_append
+from tensorflow.contrib.py2tf.utils.testing import fake_tf
 from tensorflow.contrib.py2tf.utils.type_check import is_tensor
+from tensorflow.contrib.py2tf.utils.type_hints import set_element_type
diff --git a/tensorflow/contrib/py2tf/utils/builtins.py b/tensorflow/contrib/py2tf/utils/builtins.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cb62b55d4d23545af4d641ecab1663ee7f7b876
--- /dev/null
+++ b/tensorflow/contrib/py2tf/utils/builtins.py
@@ -0,0 +1,99 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Builtin conversion utilities."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+
+from tensorflow.contrib.py2tf.utils import py_func
+from tensorflow.contrib.py2tf.utils import type_check
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import logging_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.util import tf_inspect
+
+
+def dynamic_builtin(f, *args, **kwargs):
+  """Converts a builtin function call inline."""
+  # Some built-ins may be objects.
+  if not tf_inspect.isbuiltin(f) and f not in (range,):
+    return f(*args, **kwargs)
+
+  if f is len:
+    return dynamic_len(*args, **kwargs)
+  if six.PY2 and f is xrange:
+    return dynamic_range(*args, **kwargs)
+  if f is range:
+    return dynamic_range(*args, **kwargs)
+
+  raise NotImplementedError(
+      'The "%s" builtin is not yet supported.' % f.__name__)
+
+
+def dynamic_len(list_or_tensor):
+  """Implementation of len using dynamic dispatch."""
+  if tensor_util.is_tensor(list_or_tensor):
+    shape = list_or_tensor.shape
+    if not shape:
+      raise ValueError(
+          'len requires non-zero rank for tensor "%s"' % list_or_tensor)
+    return array_ops.shape(list_or_tensor)[0]
+
+  return len(list_or_tensor)
+
+
+def dynamic_range(start_or_stop, stop=None, step=None):
+  """Implementation of range using dynamic dispatch."""
+  if type_check.is_tensor(start_or_stop, stop, step):
+    if step is not None:
+      return math_ops.range(start_or_stop, stop, step)
+    if stop is not None:
+      return math_ops.range(start_or_stop, stop)
+    return math_ops.range(start_or_stop)
+
+  if step is not None:
+    return range(start_or_stop, stop, step)
+  elif stop is not None:
+    return range(start_or_stop, stop)
+  return range(start_or_stop)
+
+
+def is_tf_print_compatible(value):
+  # TODO(mdan): Enable once we can reliably test this.
+  # This is currently disabled because we can't capture the output of
+  # op kernels from Python.
+  del value
+  return False
+
+
+def dynamic_print(*values):
+  """Implementartion of print using dynamic dispatch.
+
+  The function attempts to use tf.Print if all the values are compatible.
+  Otherwise, it will fall back to py_func.
+
+  Args:
+    *values: values to print
+  Returns:
+    A dummy value indicating the print completed. If tf.
+  """
+
+  if all(map(is_tf_print_compatible, values)):
+    return logging_ops.Print(1, values)
+  return py_func.wrap_py_func(print, None, values, use_dummy_return=True)
diff --git a/tensorflow/contrib/py2tf/utils/builtins_test.py b/tensorflow/contrib/py2tf/utils/builtins_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..59b3573d38c5bd98f416c7b77d1bc772cb8069dd
--- /dev/null
+++ b/tensorflow/contrib/py2tf/utils/builtins_test.py
@@ -0,0 +1,111 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for builtins module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+
+import six
+
+from tensorflow.contrib.py2tf.utils import builtins
+from tensorflow.python.framework import constant_op
+from tensorflow.python.platform import test
+
+
+class BuiltinsTest(test.TestCase):
+
+  def test_dynamic_len_tf_scalar(self):
+    a = constant_op.constant(1)
+
+    with self.assertRaises(ValueError):
+      with self.test_session() as sess:
+        sess.run(builtins.dynamic_builtin(len, a))
+
+  def test_dynamic_len_tf_array(self):
+    a = constant_op.constant([1, 2, 3])
+
+    with self.test_session() as sess:
+      self.assertEqual(3, sess.run(builtins.dynamic_builtin(len, a)))
+
+  def test_dynamic_len_tf_matrix(self):
+    a = constant_op.constant([[1, 2], [3, 4]])
+
+    with self.test_session() as sess:
+      self.assertEqual(2, sess.run(builtins.dynamic_builtin(len, a)))
+
+  def test_dynamic_len_py_list(self):
+    a = [3] * 5
+
+    self.assertEqual(5, builtins.dynamic_builtin(len, a))
+
+  def test_dynamic_range_all_python(self):
+    self.assertListEqual(list(builtins.dynamic_builtin(range, 3)), [0, 1, 2])
+    self.assertListEqual(list(builtins.dynamic_builtin(range, 1, 3)), [1, 2])
+    self.assertListEqual(
+        list(builtins.dynamic_builtin(range, 2, 0, -1)), [2, 1])
+
+  def test_dynamic_range_tf(self):
+    with self.test_session() as sess:
+      self.assertAllEqual(
+          sess.run(builtins.dynamic_builtin(range, constant_op.constant(3))),
+          [0, 1, 2])
+      self.assertAllEqual(
+          sess.run(builtins.dynamic_builtin(range, 1, constant_op.constant(3))),
+          [1, 2])
+      self.assertAllEqual(
+          sess.run(
+              builtins.dynamic_builtin(range, 2, 0, constant_op.constant(-1))),
+          [2, 1])
+
+  def test_dynamic_range_detection(self):
+    def range(x):  # pylint:disable=redefined-builtin
+      return x
+
+    # Functions that just have the names of builtins are ignored.
+    self.assertEqual(builtins.dynamic_builtin(range, 1), 1)
+    if six.PY2:
+      self.assertListEqual(
+          list(builtins.dynamic_builtin(xrange, 3)), [0, 1, 2])
+    self.assertListEqual(
+        list(builtins.dynamic_builtin(six.moves.range, 3)), [0, 1, 2])
+    self.assertListEqual(
+        list(builtins.dynamic_builtin(six.moves.xrange, 3)), [0, 1, 2])
+
+  def test_dynamic_print_tf(self):
+    try:
+      out_capturer = six.StringIO()
+      sys.stdout = out_capturer
+      with self.test_session() as sess:
+        sess.run(builtins.dynamic_print('test message', 1))
+        self.assertEqual(out_capturer.getvalue(), 'test message 1\n')
+    finally:
+      sys.stdout = sys.__stdout__
+
+  def test_dynamic_print_complex(self):
+    try:
+      out_capturer = six.StringIO()
+      sys.stdout = out_capturer
+      with self.test_session() as sess:
+        sess.run(builtins.dynamic_print('test message', [1, 2]))
+        self.assertEqual(out_capturer.getvalue(), 'test message [1, 2]\n')
+    finally:
+      sys.stdout = sys.__stdout__
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/py2tf/utils/context_managers.py b/tensorflow/contrib/py2tf/utils/context_managers.py
index 38d9e11fe9069722b9023fee848bf53e1f72de6a..3d150a95817b83c4d7aaa78dc250092dcc4c5a9b 100644
--- a/tensorflow/contrib/py2tf/utils/context_managers.py
+++ b/tensorflow/contrib/py2tf/utils/context_managers.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import contextlib
 
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import tensor_array_ops
 
 
 def control_dependency_on_returns(return_value):
@@ -34,9 +35,15 @@ def control_dependency_on_returns(return_value):
   Returns:
     A context manager.
   """
+  def control_dependency_handle(t):
+    if isinstance(t, tensor_array_ops.TensorArray):
+      return t.flow
+    return t
+
   if return_value is None:
     return contextlib.contextmanager(lambda: (yield))()
   # TODO(mdan): Filter to tensor objects.
   if not isinstance(return_value, (list, tuple)):
     return_value = (return_value,)
+  return_value = tuple(control_dependency_handle(t) for t in return_value)
   return ops.control_dependencies(return_value)
diff --git a/tensorflow/contrib/py2tf/utils/context_managers_test.py b/tensorflow/contrib/py2tf/utils/context_managers_test.py
index 633ba93540e696889a6b2b71b40b999da39d48ff..404f6e44e59d8bd6131367e3234843f03b351910 100644
--- a/tensorflow/contrib/py2tf/utils/context_managers_test.py
+++ b/tensorflow/contrib/py2tf/utils/context_managers_test.py
@@ -20,6 +20,8 @@ from __future__ import print_function
 
 from tensorflow.contrib.py2tf.utils import context_managers
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.platform import test
 
 
@@ -32,6 +34,9 @@ class ContextManagersTest(test.TestCase):
     with context_managers.control_dependency_on_returns(
         constant_op.constant(1)):
       pass
+    with context_managers.control_dependency_on_returns(
+        tensor_array_ops.TensorArray(dtypes.int32, size=1)):
+      pass
     with context_managers.control_dependency_on_returns(
         [constant_op.constant(1),
          constant_op.constant(2)]):
diff --git a/tensorflow/contrib/py2tf/utils/misc.py b/tensorflow/contrib/py2tf/utils/misc.py
index 7548048388766d0f12a55eecd77fca2706f9734b..1b06caf0bdeb6f4a079e33f2e887d2dca017adc2 100644
--- a/tensorflow/contrib/py2tf/utils/misc.py
+++ b/tensorflow/contrib/py2tf/utils/misc.py
@@ -19,22 +19,9 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 
 
-def dynamic_len(list_or_tensor):
-  """Implementation of len using dynamic dispatch."""
-  if tensor_util.is_tensor(list_or_tensor):
-    shape = list_or_tensor.shape
-    if not shape:
-      raise ValueError(
-          'len requires non-zero rank for tensor "%s"' % list_or_tensor)
-    return array_ops.shape(list_or_tensor)[0]
-
-  return len(list_or_tensor)
-
-
 def alias_tensors(*args):
   """Wrap any Tensor arguments with an identity op.
 
diff --git a/tensorflow/contrib/py2tf/utils/misc_test.py b/tensorflow/contrib/py2tf/utils/misc_test.py
index ec88e7cb74bd40b851fb3d2fe246d37d8c668d82..8aedd4cd64798660cc07364c45487399986c9be6 100644
--- a/tensorflow/contrib/py2tf/utils/misc_test.py
+++ b/tensorflow/contrib/py2tf/utils/misc_test.py
@@ -19,37 +19,12 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.py2tf.utils.misc import alias_tensors
-from tensorflow.contrib.py2tf.utils.misc import dynamic_len
 from tensorflow.python.framework.constant_op import constant
 from tensorflow.python.ops.variables import Variable
 from tensorflow.python.platform import test
 
 
-class ContextManagersTest(test.TestCase):
-
-  def test_dynamic_len_tf_scalar(self):
-    a = constant(1)
-
-    with self.assertRaises(ValueError):
-      with self.test_session() as sess:
-        sess.run(dynamic_len(a))
-
-  def test_dynamic_len_tf_array(self):
-    a = constant([1, 2, 3])
-
-    with self.test_session() as sess:
-      self.assertEqual(3, sess.run(dynamic_len(a)))
-
-  def test_dynamic_len_tf_matrix(self):
-    a = constant([[1, 2], [3, 4]])
-
-    with self.test_session() as sess:
-      self.assertEqual(2, sess.run(dynamic_len(a)))
-
-  def test_dynamic_len_py_list(self):
-    a = [3] * 5
-
-    self.assertEqual(5, dynamic_len(a))
+class MiscTest(test.TestCase):
 
   def test_alias_single_tensor(self):
     a = constant(1)
diff --git a/tensorflow/contrib/py2tf/utils/multiple_dispatch.py b/tensorflow/contrib/py2tf/utils/multiple_dispatch.py
index a855fdc075941915035d1e3380846ff912803494..427a936c35b9f11db51f9c29651d6dc932007c89 100644
--- a/tensorflow/contrib/py2tf/utils/multiple_dispatch.py
+++ b/tensorflow/contrib/py2tf/utils/multiple_dispatch.py
@@ -24,6 +24,16 @@ from tensorflow.contrib.py2tf.utils.type_check import is_tensor
 from tensorflow.python.ops import control_flow_ops
 
 
+def dynamic_is(left, right):
+  # TODO(alexbw) if we're sure we should leave 'is' in place,
+  # then change the semantics in converters/logical_expressions.py
+  return left is right
+
+
+def dynamic_is_not(left, right):
+  return left is not right
+
+
 def run_cond(condition, true_fn, false_fn):
   """Type-dependent functional conditional.
 
diff --git a/tensorflow/contrib/py2tf/utils/multiple_dispatch_test.py b/tensorflow/contrib/py2tf/utils/multiple_dispatch_test.py
index 5bb4d4086b002211eebb86783bb7212c707a1418..75e8fdd5ed10eee385c7419aea896aeafce4fd48 100644
--- a/tensorflow/contrib/py2tf/utils/multiple_dispatch_test.py
+++ b/tensorflow/contrib/py2tf/utils/multiple_dispatch_test.py
@@ -17,6 +17,9 @@
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+
+import numpy as np
+
 from tensorflow.contrib.py2tf.utils import multiple_dispatch
 from tensorflow.python.client.session import Session
 from tensorflow.python.framework.constant_op import constant
@@ -25,6 +28,33 @@ from tensorflow.python.platform import test
 
 class MultipleDispatchTest(test.TestCase):
 
+  def test_dynamic_is_python(self):
+    a = np.eye(3)
+    also_a = a
+    not_actually_a = np.eye(3)
+    should_be_true1 = multiple_dispatch.dynamic_is(a, also_a)
+    should_be_false1 = multiple_dispatch.dynamic_is_not(a, also_a)
+    should_be_true2 = multiple_dispatch.dynamic_is_not(a, not_actually_a)
+    should_be_false2 = multiple_dispatch.dynamic_is(a, not_actually_a)
+    self.assertTrue(should_be_true1)
+    self.assertTrue(should_be_true2)
+    self.assertFalse(should_be_false1)
+    self.assertFalse(should_be_false2)
+
+  def test_dynamic_is_tf(self):
+    with Session().as_default():
+      a = constant([2.0])
+      also_a = a
+      not_actually_a = constant([2.0])
+      should_be_true1 = multiple_dispatch.dynamic_is(a, also_a)
+      should_be_false1 = multiple_dispatch.dynamic_is_not(a, also_a)
+      should_be_true2 = multiple_dispatch.dynamic_is_not(a, not_actually_a)
+      should_be_false2 = multiple_dispatch.dynamic_is(a, not_actually_a)
+      self.assertTrue(should_be_true1)
+      self.assertTrue(should_be_true2)
+      self.assertFalse(should_be_false1)
+      self.assertFalse(should_be_false2)
+
   def test_run_cond_python(self):
     true_fn = lambda: 2.0
     false_fn = lambda: 3.0
diff --git a/tensorflow/contrib/py2tf/utils/py_func.py b/tensorflow/contrib/py2tf/utils/py_func.py
index 838872d092a3ab07e965180eff4fec7ff6c4ccf9..34f2a8b70b76542741dfe2d20050d85067f0c378 100644
--- a/tensorflow/contrib/py2tf/utils/py_func.py
+++ b/tensorflow/contrib/py2tf/utils/py_func.py
@@ -18,12 +18,24 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from collections import namedtuple
+
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import script_ops
 
 
-def wrap_py_func(f, return_dtypes, arguments, use_dummy_return=False):
+class MatchDType(namedtuple('MatchDType', ('arg_number',))):
+  """Allows matching the dtype of an argument.
+
+  Used in conjunction with function calls. For example, MatchDType(0) will
+  match the DType of the first argument.
+  """
+
+  pass
+
+
+def wrap_py_func(f, return_dtypes, args, kwargs=None, use_dummy_return=False):
   """Helper that wraps a callable to py_func.
 
   The helper passes tensor arguments through the py_func interface. Non-tensor
@@ -34,36 +46,87 @@ def wrap_py_func(f, return_dtypes, arguments, use_dummy_return=False):
 
   Args:
     f: Callable
-    return_dtypes: DType, tuple, list or None, the data type for each of f's
-        return value. None if f has no return values or use_dummy_return is
-        True.
-    arguments: Arguments for f
+    return_dtypes: None, individual of tuple/list of DType or MatchDType, the
+        data type for each of f's return value(s). Set to None if f has no
+        return values or use_dummy_return is True. Use MatchDType to define a
+        dtype identical to that of `i`th argument (argument 0 is the first);
+        an argument must of Tensor type if it is to be used with MatchDType.
+    args: Positional arguments for f, as list or tuple.
+    kwargs: Keyword arguments for f, as dict with string keys. May be None.
     use_dummy_return: If True, the function will return a dummy value of 1
         and discard its actual return value.
   Returns:
     The return values of f converted to tensor.
   Raises:
-    ValueError: if the arguments are incorrect.
+    ValueError: if any of the arguments are incorrect.
   """
 
   if return_dtypes and use_dummy_return:
     raise ValueError('if use_dummy_return is True, return_dtypes must be empty')
 
-  n = len(arguments)
-  arg_is_tensor = tuple(map(tensor_util.is_tensor, arguments))
-  index_in_tensor_list = [0] * n
-  i = 0
-  for j in range(n):
-    index_in_tensor_list[j] = i
-    if arg_is_tensor[j]:
-      i += 1
+  tensor_args = []
+  tensor_args_idx = {}
+
+  # Of the positional arguments, only grab the tensor ones to be passed through
+  # the py_func.
+  n_args = len(args)
+  arg_is_tensor = tuple(map(tensor_util.is_tensor, args))
+  for i in range(n_args):
+    if arg_is_tensor[i]:
+      tensor_args_idx[i] = len(tensor_args)
+      tensor_args.append(args[i])
+
+  # We essentially take the tensor kwargs, if any, and add them to the list of
+  # positional arguments. The kwargs are then reconstructed inside the py_func.
+  #
+  # For example, if
+  #
+  #     args = [Tensor(1), 'foo']
+  #     kwargs = {'a': Tensor(2), 'b': 'bar'}
+  #
+  # Then
+  #
+  #     tensor_args = (Tensor(1), Tensor(2))
+  #     kwarg_keys = ('a', 'b')
+  if kwargs:
+    kwarg_keys = tuple(kwargs.keys())
+    kwarg_is_tensor = {k: tensor_util.is_tensor(kwargs[k]) for k in kwarg_keys}
+    for k in kwarg_keys:
+      if kwarg_is_tensor[k]:
+        tensor_args_idx[k] = len(tensor_args)
+        tensor_args.append(kwargs[k])
+  else:
+    kwarg_keys = ()
+
+  # Set up return dtypes.
+  def match_arg_dtype(arg_number):
+    arg = args[arg_number]
+    if not arg_is_tensor[arg_number]:
+      raise ValueError(
+          'argument %d was used with MatchDType and must be a tf.Tensor, but '
+          'was %s instead' % (arg_number, type(arg)))
+    return arg.dtype
+
+  if return_dtypes:
+    if isinstance(return_dtypes, MatchDType):
+      return_dtypes = match_arg_dtype(return_dtypes.arg_number)
+    elif isinstance(return_dtypes, (list, tuple)):
+      return_dtypes = tuple(
+          match_arg_dtype(a.arg_number) if isinstance(a, MatchDType) else a
+          for a in return_dtypes)
+    else:
+      assert isinstance(return_dtypes, dtypes.DType)
 
   def f_wrapper(*tensor_args):
-    f_args = tuple(tensor_args[index_in_tensor_list[i]]
-                   if arg_is_tensor[i] else arguments[i] for i in range(n))
-    retval = f(*f_args)
+    f_args = tuple(
+        tensor_args[tensor_args_idx[i]] if arg_is_tensor[i] else a
+        for i, a in enumerate(args))
+    f_kwargs = {
+        k: tensor_args[tensor_args_idx[k]] if kwarg_is_tensor[k] else kwargs[k]
+        for i, k in enumerate(kwarg_keys)
+    }
+    retval = f(*f_args, **f_kwargs)
     return 1 if use_dummy_return else retval
 
-  return script_ops.py_func(
-      f_wrapper, tuple(arguments[i] for i in range(n) if arg_is_tensor[i]),
-      dtypes.int64 if use_dummy_return else return_dtypes)
+  return script_ops.py_func(f_wrapper, tensor_args, dtypes.int64
+                            if use_dummy_return else return_dtypes)
diff --git a/tensorflow/contrib/py2tf/utils/py_func_test.py b/tensorflow/contrib/py2tf/utils/py_func_test.py
index 776b5309c6f027bb2008aa83d48e4155e817ed97..3b7a35365a96ac370253e2551a4c3f9db18c66fd 100644
--- a/tensorflow/contrib/py2tf/utils/py_func_test.py
+++ b/tensorflow/contrib/py2tf/utils/py_func_test.py
@@ -32,19 +32,15 @@ class PyFuncTest(test.TestCase):
       return a + b + c
 
     with self.test_session() as sess:
-      tensor_1 = constant_op.constant(1)
-      self.assertEqual(3,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, dtypes.int64,
-                                                (1, tensor_1, 1))))
-      self.assertEqual(3,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, dtypes.int64,
-                                                (1, 1, 1))))
-      self.assertEqual(3,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, dtypes.int64,
-                                                (tensor_1, 1, tensor_1))))
+      result = py_func.wrap_py_func(test_fn, dtypes.int64,
+                                    (1, constant_op.constant(1), 1))
+      self.assertEqual(3, sess.run(result))
+      result = py_func.wrap_py_func(test_fn, dtypes.int64, (1, 1, 1))
+      self.assertEqual(3, sess.run(result))
+      result = py_func.wrap_py_func(
+          test_fn, dtypes.int64,
+          (constant_op.constant(1), 1, constant_op.constant(1)))
+      self.assertEqual(3, sess.run(result))
 
   def test_wrap_py_func_complex_args(self):
 
@@ -57,15 +53,34 @@ class PyFuncTest(test.TestCase):
       return a * b.foo
 
     with self.test_session() as sess:
-      self.assertEqual(35,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, dtypes.int64,
-                                                (7, TestClass()))))
-      self.assertEqual(
-          35,
-          sess.run(
-              py_func.wrap_py_func(test_fn, dtypes.int64,
-                                   (constant_op.constant(7), TestClass()))))
+      result = py_func.wrap_py_func(test_fn, dtypes.int64, (7, TestClass()))
+      self.assertEqual(35, sess.run(result))
+      result = py_func.wrap_py_func(test_fn, dtypes.int64,
+                                    (constant_op.constant(7), TestClass()))
+      self.assertEqual(35, sess.run(result))
+
+  def test_wrap_py_func_kwargs(self):
+
+    class TestClass(object):
+
+      def __init__(self, foo):
+        self.foo = foo
+
+    def test_fn(a, b, c, d):
+      return a * b.foo + c * d.foo
+
+    with self.test_session() as sess:
+      result = py_func.wrap_py_func(test_fn, dtypes.int64, (7, TestClass(5)), {
+          'c': 11,
+          'd': TestClass(13)
+      })
+      self.assertEqual(178, sess.run(result))
+      result = py_func.wrap_py_func(test_fn, dtypes.int64,
+                                    (constant_op.constant(7), TestClass(5)), {
+                                        'c': constant_op.constant(11),
+                                        'd': TestClass(13)
+                                    })
+      self.assertEqual(178, sess.run(result))
 
   def test_wrap_py_func_dummy_return(self):
 
@@ -75,15 +90,12 @@ class PyFuncTest(test.TestCase):
       side_counter[0] += 1
 
     with self.test_session() as sess:
-      self.assertEqual(1,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, None, (5,), True)))
+      result = py_func.wrap_py_func(test_fn, None, (5,), use_dummy_return=True)
+      self.assertEqual(1, sess.run(result))
       self.assertEqual([1], side_counter)
-      self.assertEqual(1,
-                       sess.run(
-                           py_func.wrap_py_func(test_fn, None,
-                                                (constant_op.constant(5),),
-                                                True)))
+      result = py_func.wrap_py_func(
+          test_fn, None, (constant_op.constant(5),), use_dummy_return=True)
+      self.assertEqual(1, sess.run(result))
       self.assertEqual([2], side_counter)
 
 
diff --git a/tensorflow/contrib/py2tf/utils/tensor_list.py b/tensorflow/contrib/py2tf/utils/tensor_list.py
index b6ff49e2a0eff384f10903e12212ab929e267804..2556f412891b4f0b954af5a6f0193341a6a5020a 100644
--- a/tensorflow/contrib/py2tf/utils/tensor_list.py
+++ b/tensorflow/contrib/py2tf/utils/tensor_list.py
@@ -18,7 +18,26 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import list_ops
+from tensorflow.python.ops import tensor_array_ops
+
+
+def dynamic_list_append(target, element):
+  """Converts a list append call inline."""
+  if isinstance(target, tensor_array_ops.TensorArray):
+    return target.write(target.size(), element)
+  # TODO(mdan): What's the right way to check this?
+  # TODO(mdan): We may not need this branch.
+  # It may be possible to use TensorList alone if the loop body will not
+  # require wrapping it, although we'd have to think about an autoboxing
+  # mechanism for lists received as parameter.
+  if isinstance(target, ops.Tensor):
+    return list_ops.tensor_list_push_back(target, element)
+
+  # Python targets (including TensorList): fallback to their original append.
+  target.append(element)
+  return target
 
 
 class TensorList(object):
diff --git a/tensorflow/contrib/py2tf/utils/tensor_list_test.py b/tensorflow/contrib/py2tf/utils/tensor_list_test.py
index b5e554a162674e08da21785dcbe193c54647f128..110e4d105e934d9d752afc2ccc0c53c99b70d41d 100644
--- a/tensorflow/contrib/py2tf/utils/tensor_list_test.py
+++ b/tensorflow/contrib/py2tf/utils/tensor_list_test.py
@@ -21,13 +21,41 @@ from __future__ import print_function
 from tensorflow.contrib.py2tf.utils import tensor_list as tl
 from tensorflow.python.client.session import Session
 from tensorflow.python.eager import context
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework.constant_op import constant
+from tensorflow.python.ops import list_ops
+from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.platform import test
 
 
 class TensorListTest(test.TestCase):
 
+  def _shape(self, shape_tuple):
+    return constant(shape_tuple, dtypes.int32)
+
+  def test_dynamic_list_append(self):
+    l = []
+    l = tl.dynamic_list_append(l, 1)
+    self.assertListEqual(l, [1])
+
+    l = list_ops.empty_tensor_list(self._shape(()), dtypes.int32)
+    l = tl.dynamic_list_append(l, 1)
+    s = list_ops.tensor_list_stack(l, element_dtype=dtypes.int32)
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(s), [1])
+
+    l = tensor_array_ops.TensorArray(dtypes.int32, size=0, dynamic_size=True)
+    l = tl.dynamic_list_append(l, 1)
+    s = l.stack()
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(s), [1])
+
+    l = tl.TensorList(self._shape(()), dtypes.int32)
+    l = tl.dynamic_list_append(l, 1)
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(l[0]), 1)
+
   def test_list_append_python(self):
     with context.eager_mode():
       a = constant(3.0)
diff --git a/tensorflow/contrib/bayesflow/python/ops/custom_grad.py b/tensorflow/contrib/py2tf/utils/testing.py
similarity index 65%
rename from tensorflow/contrib/bayesflow/python/ops/custom_grad.py
rename to tensorflow/contrib/py2tf/utils/testing.py
index ca1ecb9c40204c3c723fa3423cfe148e823adc28..cb4785d0dc0f4674b3560418daeb6733364b21e7 100644
--- a/tensorflow/contrib/bayesflow/python/ops/custom_grad.py
+++ b/tensorflow/contrib/py2tf/utils/testing.py
@@ -12,23 +12,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Functions for specifying custom gradients.
-
-See ${python/contrib.bayesflow.custom_gradient}.
-"""
+"""Testing utilities."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# go/tf-wildcard-import
-# pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.custom_grad_impl import *
-# pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
+import imp
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import math_ops
 
-_allowed_symbols = [
-    'custom_gradient',
-]
 
-remove_undocumented(__name__, _allowed_symbols)
+def fake_tf():
+  """Creates a fake module that looks like TensorFlow, for testing."""
+  mod = imp.new_module('tensorflow')
+  mod_contents = dict()
+  mod_contents.update(math_ops.__dict__)
+  mod_contents.update(ops.__dict__)
+  mod_contents.update(mod.__dict__)
+  mod.__dict__.update(mod_contents)
+  return mod
diff --git a/tensorflow/contrib/py2tf/utils/type_check.py b/tensorflow/contrib/py2tf/utils/type_check.py
index 9ca2dec872c8a9ca7bedaa8603f70e3214a3e24a..b9b2b451a4e22684a19f0d10fbf5e4fae5d6152b 100644
--- a/tensorflow/contrib/py2tf/utils/type_check.py
+++ b/tensorflow/contrib/py2tf/utils/type_check.py
@@ -22,12 +22,12 @@ from tensorflow.python.framework import tensor_util
 
 
 def is_tensor(*args):
-  """Check if all arguments are tensors.
+  """Check if any arguments are tensors.
 
   Args:
     *args: Python objects that may or may not be tensors.
 
   Returns:
-    True if all *args are TensorFlow types, False if one or more are not.
+    True if any *args are TensorFlow types, False if none are.
   """
   return any([tensor_util.is_tensor(a) for a in args])
diff --git a/tensorflow/contrib/bayesflow/python/ops/halton_sequence.py b/tensorflow/contrib/py2tf/utils/type_hints.py
similarity index 54%
rename from tensorflow/contrib/bayesflow/python/ops/halton_sequence.py
rename to tensorflow/contrib/py2tf/utils/type_hints.py
index 49d747d538f5a4aa3134d28ba00a651cb509fa41..aeb9e545610460afbe364dfcfc7a54b9aede29fe 100644
--- a/tensorflow/contrib/bayesflow/python/ops/halton_sequence.py
+++ b/tensorflow/contrib/py2tf/utils/type_hints.py
@@ -12,22 +12,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Support for low discrepancy Halton sequences.
+"""No-op utilities that provide static type hints.
 
+These are used when the data type is not known at creation, for instance in the
+case of empty lists.
 """
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# go/tf-wildcard-import
-# pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.halton_sequence_impl import *
-# pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
 
-_allowed_symbols = [
-    'sample',
-]
+def set_element_type(entity, dtype, shape=None):
+  """Indicates that the entity is expected hold items of specified type.
 
-remove_undocumented(__name__, _allowed_symbols)
+  This function is a no-op. Its presence merely marks the data type of its
+  argument. The staged TensorFlow ops will reflect and assert this data type.
+
+  Args:
+    entity: A Tensor or TensorArray.
+    dtype: TensorFlow dtype value to assert for entity.
+    shape: Optional shape to assert for entity.
+  Returns:
+    The value of entity, unchanged.
+  """
+  del dtype
+  del shape
+  return entity
diff --git a/tensorflow/contrib/quantize/BUILD b/tensorflow/contrib/quantize/BUILD
index aec9f47ccb20349c08bbe2fd813ee24a807f9fe3..0b7629620418340d803753be0df1f04c342dc490 100644
--- a/tensorflow/contrib/quantize/BUILD
+++ b/tensorflow/contrib/quantize/BUILD
@@ -24,6 +24,7 @@ py_test(
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:platform_test",
         "//tensorflow/python:session",
+        "//tensorflow/python:variable_scope",
     ],
 )
 
diff --git a/tensorflow/contrib/quantize/README.md b/tensorflow/contrib/quantize/README.md
index 8b0e7bb68f5a11f5d1942f7cf048e96768da259e..348c824a4072c3329ac4a3441c19c71598bc9c03 100644
--- a/tensorflow/contrib/quantize/README.md
+++ b/tensorflow/contrib/quantize/README.md
@@ -3,8 +3,7 @@
 tf.contrib.quantize provides tools for transforming graphs to include ops to
 model quantization of weights, biases and activations during both training and
 inference. This is done using the
-[fake quantization op]
-(https://www.tensorflow.org/versions/r0.12/api_docs/python/array_ops/fake_quantization).
+[fake quantization op](https://www.tensorflow.org/versions/r0.12/api_docs/python/array_ops/fake_quantization).
 
 Recent literature has shown that fixed point networks provide comparable
 performance to floating point networks [1]. This is achieved by modeling the
diff --git a/tensorflow/contrib/quantize/python/common.py b/tensorflow/contrib/quantize/python/common.py
index 3a1fa61e43986af1a1315d5a9e6f010e802ea157..bf648e158ec15e1bfa962ba7dbe0567263c89c9b 100644
--- a/tensorflow/contrib/quantize/python/common.py
+++ b/tensorflow/contrib/quantize/python/common.py
@@ -23,6 +23,7 @@ import re
 
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope
@@ -101,7 +102,7 @@ def CreateOrGetQuantizationStep():
     Quantization step Tensor.
   """
   quantization_step_name = 'fake_quantization_step'
-  quantization_step_tensor_name = quantization_step_name + '/AssignAdd:0'
+  quantization_step_tensor_name = quantization_step_name + '/Identity:0'
   g = ops.get_default_graph()
   try:
     return g.get_tensor_by_name(quantization_step_tensor_name)
@@ -118,5 +119,15 @@ def CreateOrGetQuantizationStep():
       with g.name_scope(quantization_step_tensor.op.name + '/'):
         # We return the incremented variable tensor. Since this is used in conds
         # for quant_delay and freeze_bn_delay, it will run once per graph
-        # execution.
-        return state_ops.assign_add(quantization_step_tensor, 1)
+        # execution. We return an identity to force resource variables and
+        # normal variables to return a tensor of the same name.
+        return array_ops.identity(
+            state_ops.assign_add(quantization_step_tensor, 1))
+
+
+def DropStringPrefix(s, prefix):
+  """If the string starts with this prefix, drops it."""
+  if s.startswith(prefix):
+    return s[len(prefix):]
+  else:
+    return s
diff --git a/tensorflow/contrib/quantize/python/common_test.py b/tensorflow/contrib/quantize/python/common_test.py
index d6237fe5e38d905bf262d7be3746b9ee6046da47..06c62f2d265503bf42d46fb682a398ce1f4d15fb 100644
--- a/tensorflow/contrib/quantize/python/common_test.py
+++ b/tensorflow/contrib/quantize/python/common_test.py
@@ -22,6 +22,7 @@ from tensorflow.contrib.quantize.python import common
 from tensorflow.python.client import session
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
 
@@ -29,8 +30,15 @@ from tensorflow.python.platform import googletest
 class CommonTest(test_util.TensorFlowTestCase):
 
   def testCreateOrGetQuantizationStep(self):
+    self._TestCreateOrGetQuantizationStep(False)
+
+  def testCreateOrGetQuantizationStepResourceVar(self):
+    self._TestCreateOrGetQuantizationStep(True)
+
+  def _TestCreateOrGetQuantizationStep(self, use_resource):
     g = ops.Graph()
     with session.Session(graph=g) as sess:
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       quantization_step_tensor = common.CreateOrGetQuantizationStep()
 
       # Check that operations are added to the graph.
diff --git a/tensorflow/contrib/quantize/python/fold_batch_norms.py b/tensorflow/contrib/quantize/python/fold_batch_norms.py
index 75d9eb0e58d96e4bb2946684febd250e2e1a6b4a..5750be6f4cbd501ec85656a66b9002a470b1a863 100644
--- a/tensorflow/contrib/quantize/python/fold_batch_norms.py
+++ b/tensorflow/contrib/quantize/python/fold_batch_norms.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.util import compat
 
 
@@ -194,7 +195,7 @@ def _FindFusedBatchNorms(graph):
     layer_op = match_result.get_op(layer_pattern)
     layer_tensor = match_result.get_tensor(layer_pattern)
     bn_op = match_result.get_op(batch_norm_pattern)
-    batch_epsilon_tensor = bn_op.get_attr('epsilon')
+    batch_epsilon = bn_op.get_attr('epsilon')
 
     # In the MatMul case, the output of batch norm is reshaped back into a
     # 2D tensor, so the output_tensor is the output of the Reshape op.
@@ -207,6 +208,11 @@ def _FindFusedBatchNorms(graph):
         continue
       output_tensor = output_reshape_op.outputs[0]
 
+    # Ensure that the output tensor has consumers, otherwise this is a dangling
+    # node and not a match.
+    if not output_tensor.consumers():
+      continue
+
     input_tensor = match_result.get_tensor(input_pattern)
     weight_tensor = match_result.get_tensor(weight_pattern)
     gamma_tensor = match_result.get_tensor(gamma_pattern)
@@ -231,7 +237,7 @@ def _FindFusedBatchNorms(graph):
       # The batch variance used during forward and backward prop is biased,
       # i.e it is calculated as: V=sum(x(k)-mu)^2/N. For the moving average
       # calculation, the variance is corrected by the term N/N-1 (Bessel's
-      # correction). The variance tensor read from FuseBatchNorm has bessel's
+      # correction). The variance tensor read from FuseBatchNorm has Bessel's
       # correction applied, so we undo it here.
       scope, sep, _ = bn_op.name.rpartition('/')
       g = ops.get_default_graph()
@@ -270,7 +276,7 @@ def _FindFusedBatchNorms(graph):
         moving_variance_tensor=moving_variance_tensor,
         bn_decay_mean_tensor=bn_decay_mean_tensor,
         bn_decay_var_tensor=bn_decay_var_tensor,
-        batch_epsilon_tensor=batch_epsilon_tensor)
+        batch_epsilon=batch_epsilon)
 
 
 def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
@@ -300,7 +306,7 @@ def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
 
   Args:
     context: The scope under which we look for batch norm params
-    match: Object containg required batch norm tensors for correction
+    match: Object containing required batch norm tensors for correction
       computation.
     freeze_batch_norm_delay: Delay in steps at which computation switches
       from regular batch norm to frozen mean and variance.
@@ -311,11 +317,11 @@ def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
   """
 
   g = ops.get_default_graph()
-  with g.name_scope(context + '/batch_norm_correction'):
+  prefix = '' if not context else context + '/'
+  with g.name_scope(prefix + 'batch_norm_correction'):
     recip_sigma_mv = math_ops.rsqrt(
-        match.moving_variance_tensor + match.batch_epsilon_tensor)
-    recip_sigma = math_ops.rsqrt(
-        match.variance_tensor + match.batch_epsilon_tensor)
+        match.moving_variance_tensor + match.batch_epsilon)
+    recip_sigma = math_ops.rsqrt(match.variance_tensor + match.batch_epsilon)
     correction_scale = math_ops.divide(
         recip_sigma_mv, recip_sigma, name='scale_compute')
     correction_scale = array_ops.identity(
@@ -434,6 +440,9 @@ def _FoldUnfusedBatchNorms(graph, is_training, freeze_batch_norm_delay):
   for bn in common.BatchNormGroups(graph):
     has_scaling = _HasScaling(graph, input_to_ops_map, bn)
 
+    if not _IsValidUnfusedBatchNorm(graph, bn):
+      continue
+
     # The mangling code intimately depends on BatchNorm node's internals.
     original_op, folded_op = _CreateFoldedOp(
         graph,
@@ -462,6 +471,15 @@ def _FoldUnfusedBatchNorms(graph, is_training, freeze_batch_norm_delay):
       raise ValueError('Unexpected inputs to op: %s' % add_bypass.name)
 
 
+def _IsValidUnfusedBatchNorm(graph, context):
+  """Checks that the output of the unfused batch norm has consumers."""
+  add_shift = graph.get_operation_by_name(
+      context + '/BatchNorm/batchnorm/add_1')
+  # Ensure that the output tensor of batch norm has consumers, otherwise this
+  # is a dangling node and not a match.
+  return bool(add_shift.outputs[0].consumers())
+
+
 def _GetBatchNormParams(graph, context, has_scaling):
   """Extracts relevant tensors for folding batch norms.
 
@@ -478,7 +496,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
   batch_variance_tensor = None
   moving_mean_tensor = None
   moving_variance_tensor = None
-  batch_epsilon_tensor = None
+  batch_epsilon = None
   bn_decay_mean_tensor = None
   bn_decay_var_tensor = None
 
@@ -486,15 +504,23 @@ def _GetBatchNormParams(graph, context, has_scaling):
   base_context = split_context[-1]
 
   oplist = graph.get_operations()
-  op_suffix_gamma = base_context + '/BatchNorm/gamma'
   op_suffix_mean = base_context + '/BatchNorm/moments/Squeeze'
   op_suffix_variance = base_context + '/BatchNorm/moments/Squeeze_1'
-  op_suffix_moving_variance = base_context + '/BatchNorm/moving_variance/read'
-  op_suffix_moving_mean = base_context + '/BatchNorm/moving_mean/read'
   op_suffix_epsilon = base_context + '/BatchNorm/batchnorm/add/y'
   op_suffix_bn_decay_mean = base_context + '/BatchNorm/AssignMovingAvg/decay'
   op_suffix_bn_decay_var = base_context + '/BatchNorm/AssignMovingAvg_1/decay'
 
+  if variable_scope.get_variable_scope().use_resource:
+    op_suffix_gamma = base_context + '/BatchNorm/gamma/Read/ReadVariableOp'
+    op_suffix_moving_variance = (
+        base_context + '/BatchNorm/moving_variance/Read/ReadVariableOp')
+    op_suffix_moving_mean = (
+        base_context + '/BatchNorm/moving_mean/Read/ReadVariableOp')
+  else:
+    op_suffix_gamma = base_context + '/BatchNorm/gamma'
+    op_suffix_moving_variance = base_context + '/BatchNorm/moving_variance/read'
+    op_suffix_moving_mean = base_context + '/BatchNorm/moving_mean/read'
+
   # Parse through list of ops to find relevant ops
   for op in oplist:
     if op.name.endswith(op_suffix_mean):
@@ -509,7 +535,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
     if op.name.endswith(op_suffix_moving_variance):
       moving_variance_tensor = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_epsilon):
-      batch_epsilon_tensor = graph.get_tensor_by_name(op.name + ':0')
+      batch_epsilon = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_bn_decay_mean):
       bn_decay_mean_tensor = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_bn_decay_var):
@@ -535,7 +561,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
       moving_variance_tensor=moving_variance_tensor,
       bn_decay_mean_tensor=bn_decay_mean_tensor,
       bn_decay_var_tensor=bn_decay_var_tensor,
-      batch_epsilon_tensor=batch_epsilon_tensor)
+      batch_epsilon=batch_epsilon)
 
 
 def _CreateFoldedOp(graph, context, has_scaling, freeze_batch_norm_delay,
@@ -816,7 +842,7 @@ class _BatchNormMatch(object):
   def __init__(self, layer_op, bn_op, output_tensor, input_tensor,
                weight_tensor, gamma_tensor, beta_tensor, mean_tensor,
                variance_tensor, moving_mean_tensor, moving_variance_tensor,
-               bn_decay_mean_tensor, bn_decay_var_tensor, batch_epsilon_tensor):
+               bn_decay_mean_tensor, bn_decay_var_tensor, batch_epsilon):
     self._layer_op = layer_op
     self._bn_op = bn_op
     self._output_tensor = output_tensor
@@ -830,7 +856,7 @@ class _BatchNormMatch(object):
     self._moving_variance_tensor = moving_variance_tensor
     self._bn_decay_mean_tensor = bn_decay_mean_tensor
     self._bn_decay_var_tensor = bn_decay_var_tensor
-    self._batch_epsilon_tensor = batch_epsilon_tensor
+    self._batch_epsilon = batch_epsilon
 
   @property
   def layer_op(self):
@@ -877,8 +903,8 @@ class _BatchNormMatch(object):
     return self._moving_variance_tensor
 
   @property
-  def batch_epsilon_tensor(self):
-    return self._batch_epsilon_tensor
+  def batch_epsilon(self):
+    return self._batch_epsilon
 
   @property
   def bn_decay_mean_tensor(self):
diff --git a/tensorflow/contrib/quantize/python/fold_batch_norms_test.py b/tensorflow/contrib/quantize/python/fold_batch_norms_test.py
index c90a18ab0357f1bcbc5d8ccd48edf894d7baf5f9..af31467476b1536adef2bb74308fd1093f7bea7a 100644
--- a/tensorflow/contrib/quantize/python/fold_batch_norms_test.py
+++ b/tensorflow/contrib/quantize/python/fold_batch_norms_test.py
@@ -128,6 +128,9 @@ class FoldBatchNormsTest(test_util.TensorFlowTestCase):
     output_op_names = ['test/Add' if with_bypass else 'test/' + relu_op_name]
     self._AssertOutputGoesToOps(folded_add, g, output_op_names)
 
+    for op in g.get_operations():
+      self.assertFalse('//' in op.name, 'Double slash in op %s' % op.name)
+
   def testFoldConv2d(self):
     self._RunTestOverParameters(self._TestFoldConv2d)
 
@@ -196,6 +199,9 @@ class FoldBatchNormsTest(test_util.TensorFlowTestCase):
     output_op_names = ['test/Add' if with_bypass else 'test/' + relu_op_name]
     self._AssertOutputGoesToOps(folded_add, g, output_op_names)
 
+    for op in g.get_operations():
+      self.assertFalse('//' in op.name, 'Double slash in op %s' % op.name)
+
   def testFoldConv2dUnknownShape(self):
     self._RunTestOverParameters(self._TestFoldConv2dUnknownShape)
 
@@ -260,6 +266,9 @@ class FoldBatchNormsTest(test_util.TensorFlowTestCase):
     output_op_names = ['test/Add' if with_bypass else 'test/' + relu_op_name]
     self._AssertOutputGoesToOps(folded_add, g, output_op_names)
 
+    for op in g.get_operations():
+      self.assertFalse('//' in op.name, 'Double slash in op %s' % op.name)
+
   def testFoldFullyConnectedLayer(self):
     self._RunTestOverParameters(self._TestFoldFullyConnectedLayer)
 
@@ -337,6 +346,9 @@ class FoldBatchNormsTest(test_util.TensorFlowTestCase):
     output_op_names = ['test/Add' if with_bypass else 'test/' + relu_op_name]
     self._AssertOutputGoesToOps(folded_add, g, output_op_names)
 
+    for op in g.get_operations():
+      self.assertFalse('//' in op.name, 'Double slash in op %s' % op.name)
+
   def testFoldDepthwiseConv2d(self):
     self._RunTestOverParameters(self._TestFoldDepthwiseConv2d)
 
diff --git a/tensorflow/contrib/quantize/python/quant_ops.py b/tensorflow/contrib/quantize/python/quant_ops.py
index 0a8e35080cb08f71dc28e33c6138a12656e5a5ea..a4f7b1b22139588be29171126d43b872d6658168 100644
--- a/tensorflow/contrib/quantize/python/quant_ops.py
+++ b/tensorflow/contrib/quantize/python/quant_ops.py
@@ -282,8 +282,8 @@ def _FakeQuantWithMinMaxVars(inputs, min_var, max_var, per_channel, num_bits,
   Args:
     inputs: a tensor containing values to be quantized.
     min_var: a variable containing quantization range lower end(s).
-    max_var: a variable containing quantization range lupper end(s).
-    per_channel: a boolean specifying whether to use per-channel quantizatioh.
+    max_var: a variable containing quantization range upper end(s).
+    per_channel: a boolean specifying whether to use per-channel quantization.
     num_bits: Number of bits to use for quantization, must be between 2 and 8.
     narrow_range: Whether to use the narrow quantization range
       [1; 2^num_bits - 1] or wide range [0; 2^num_bits - 1].
diff --git a/tensorflow/contrib/quantize/python/quantize.py b/tensorflow/contrib/quantize/python/quantize.py
index 5fd806d195dce671d079386ea4b6c89042e26cf6..33f14e8d0e9b7ce5704523d68bdcdee70ffff1cd 100644
--- a/tensorflow/contrib/quantize/python/quantize.py
+++ b/tensorflow/contrib/quantize/python/quantize.py
@@ -35,8 +35,7 @@ _QUANTIZABLE_TYPES = {'Conv2D', 'MatMul', 'DepthwiseConv2dNative'}
 _ACTIVATION_TYPES = {'Relu', 'Relu6', 'Identity'}
 
 # Weight types that are supported by the quantization rewrite.
-# TODO(suharshs): Add support for ResourceVariable.
-_WEIGHT_TYPES = {'Variable', 'VariableV2'}
+_WEIGHT_TYPES = {'Variable', 'VariableV2', 'VarHandleOp'}
 
 
 def Quantize(graph,
@@ -45,7 +44,7 @@ def Quantize(graph,
              activation_bits=8,
              ema_decay=0.999,
              quant_delay=None,
-             vars_collection=ops.GraphKeys.MOVING_AVERAGE_VARIABLES):
+             vars_collection=ops.GraphKeys.GLOBAL_VARIABLES):
   """Updates graph with quantization operations.
 
   Args:
@@ -124,10 +123,47 @@ def Quantize(graph,
           vars_collection=vars_collection,
           bits=activation_bits)
 
+    if layer_match.post_activation_bypass_op is not None:
+      _InsertQuantOp(
+          add_context,
+          'post_activation_bypass_quant',
+          layer_match.post_activation_bypass_op,
+          input_to_ops_map.ConsumerOperations(
+              layer_match.post_activation_bypass_op),
+          is_training,
+          moving_avg=True,
+          ema_decay=ema_decay,
+          quant_delay=quant_delay,
+          vars_collection=vars_collection,
+          bits=activation_bits)
+
 
 def _FindLayersToQuantize(graph):
   """Matches layers in graph to quantize.
 
+  The following patterns get matched. Nodes surrounded by [] will be
+  optionally matched:
+
+          weight|folded_weight
+                /
+         conv|fc
+            |
+    [post_conv_correction]
+            |
+     biasadd|folded_bias
+            |
+         [bypass]
+            |
+        activation
+            |
+   [post_activation_bypass]
+
+  Match replacements:
+    If weight_folded_weight is found, FakeQuant is added afterwards.
+    If bypass is found, FakeQuant is added before and after.
+    If activation is found, FakeQuant is added afterwards.
+    If post_activation_bypass is found, FakeQuant is added afterwards.
+
   Args:
     graph: Graph to perform match on.
 
@@ -137,7 +173,7 @@ def _FindLayersToQuantize(graph):
   input_pattern = graph_matcher.OpTypePattern('*')
   weight_var_pattern = graph_matcher.OpTypePattern('|'.join(_WEIGHT_TYPES))
   weight_pattern = graph_matcher.OpTypePattern(
-      'Identity', inputs=[weight_var_pattern])
+      'Identity|ReadVariableOp', inputs=[weight_var_pattern])
 
   folded_weight_pattern = graph_matcher.OpTypePattern('Mul')
 
@@ -180,7 +216,7 @@ def _FindLayersToQuantize(graph):
               [bias_add_pattern, folded_bias_add_pattern])
       ])
 
-  # The input to the activation can come from bias add, fold bias add or the
+  # The input to the activation can come from bias add, fold bias add, the
   # bypasses.
   activation_pattern = graph_matcher.OpTypePattern(
       '|'.join(_ACTIVATION_TYPES),
@@ -191,7 +227,16 @@ def _FindLayersToQuantize(graph):
           ])
       ])
 
-  layer_matcher = graph_matcher.GraphMatcher(activation_pattern)
+  post_activation_bypass_pattern_a = graph_matcher.OpTypePattern(
+      'Add', inputs=['*', activation_pattern])
+  post_activation_bypass_pattern_b = graph_matcher.OpTypePattern(
+      'Add', inputs=[activation_pattern, '*'])
+
+  layer_matcher = graph_matcher.GraphMatcher(
+      graph_matcher.OneofPattern([
+          post_activation_bypass_pattern_a, post_activation_bypass_pattern_b,
+          activation_pattern
+      ]))
   for match_result in layer_matcher.match_graph(graph):
     layer_op = match_result.get_op(layer_pattern)
     weight_tensor = match_result.get_tensor(weight_pattern)
@@ -204,8 +249,19 @@ def _FindLayersToQuantize(graph):
     bypass_op = match_result.get_op(bypass_pattern_a)
     if bypass_op is None:
       bypass_op = match_result.get_op(bypass_pattern_b)
+    post_activation_bypass_op = match_result.get_op(
+        post_activation_bypass_pattern_a)
+    if post_activation_bypass_op is None:
+      post_activation_bypass_op = match_result.get_op(
+          post_activation_bypass_pattern_b)
+    # If we don't find a post_activation_bypass_op but activation_op has a
+    # bypass following it, then we need to skip this match, since there will be
+    # another match that includes post_activation_bypass_op.
+    if post_activation_bypass_op is None and _HasPostActivationBypass(
+        activation_op):
+      continue
     yield _LayerMatch(layer_op, weight_tensor, activation_op, bypass_op,
-                      bias_add_op)
+                      post_activation_bypass_op, bias_add_op)
 
   # Match the final layer, where there will not be an activation and instead
   # the output of the final BiasAdd must be quantized, so we treat it as the
@@ -216,19 +272,32 @@ def _FindLayersToQuantize(graph):
   for match_result in final_layer_matcher.match_graph(graph):
     layer_op = match_result.get_op(layer_pattern)
     weight_tensor = match_result.get_tensor(weight_pattern)
+    if weight_tensor is None:
+      weight_tensor = match_result.get_tensor(folded_weight_pattern)
     activation_op = match_result.get_op(bias_add_pattern)
-    yield _LayerMatch(layer_op, weight_tensor, activation_op, None, None)
+    if activation_op is None:
+      activation_op = match_result.get_op(folded_bias_add_pattern)
+    yield _LayerMatch(layer_op, weight_tensor, activation_op, None, None, None)
+
+
+def _HasPostActivationBypass(activation_op):
+  for activation_tensor in activation_op.outputs:
+    for output_op in activation_tensor.consumers():
+      if output_op.type == 'Add':
+        return True
+  return False
 
 
 class _LayerMatch(object):
   """Contains all information related to a matched Layer."""
 
   def __init__(self, layer_op, weight_tensor, activation_op, bypass_op,
-               bias_add_op):
+               post_activation_bypass_op, bias_add_op):
     self._layer_op = layer_op
     self._weight_tensor = weight_tensor
     self._activation_op = activation_op
     self._bypass_op = bypass_op
+    self._post_activation_bypass_op = post_activation_bypass_op
     self._bias_add_op = bias_add_op
 
   @property
@@ -247,6 +316,10 @@ class _LayerMatch(object):
   def bypass_op(self):
     return self._bypass_op
 
+  @property
+  def post_activation_bypass_op(self):
+    return self._post_activation_bypass_op
+
   @property
   def bias_add_op(self):
     return self._bias_add_op
@@ -263,12 +336,12 @@ def _InsertQuantOp(context,
                    bits=8,
                    ema_decay=0.999,
                    quant_delay=None,
-                   vars_collection=ops.GraphKeys.MOVING_AVERAGE_VARIABLES,
+                   vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
                    narrow_range=False):
   """Inserts a quant op between a producer op and (multiple) consumer ops.
 
   Args:
-    context: Context w,here producer and consumer operations are nested.
+    context: Context where producer and consumer operations are nested.
     name: Name for the new quantization op within the context.
     producer: Producer operation of the pairs where quantization will be
       inserted.
@@ -294,6 +367,12 @@ def _InsertQuantOp(context,
       consumer operation.
   """
   name_prefix = _AddContextToName(context, name)
+  # This is needed on TPU where name_scope == 'TPUReplicate/loop', and
+  # name_prefix starts with 'TPUReplicate/loop/'; without dropping it
+  # variables are created as TPUReplicate/loop/TPUReplicate/loop/..., which
+  # breaks things later.
+  name_prefix = common.DropStringPrefix(name_prefix, ops.get_name_scope() + '/')
+
   inputs = producer.outputs[0]
   if moving_avg:
     quant = (
diff --git a/tensorflow/contrib/quantize/python/quantize_graph.py b/tensorflow/contrib/quantize/python/quantize_graph.py
index 5a3a74cec4864ad3808d485849334c81f569d300..0b74b438ac317967bbe10ad936b451de6f69d62c 100644
--- a/tensorflow/contrib/quantize/python/quantize_graph.py
+++ b/tensorflow/contrib/quantize/python/quantize_graph.py
@@ -72,6 +72,8 @@ def _create_graph(input_graph=None,
 def create_training_graph(input_graph=None, quant_delay=0):
   """Rewrites a training input_graph in place for simulated quantization.
 
+  Variables added by the rewrite get added to the global variables collection.
+
   The graph has fake quantization ops inserted to simulate the error
   introduced by quantization. Since the graph is transformed in place,
   the expected behavior of previously held references to nodes and tensors may
@@ -97,16 +99,7 @@ def create_training_graph(input_graph=None, quant_delay=0):
   # TODO(raghuramank) Need to have freeze_bn_delay be a function of batch size
   # Currently the values below are hardcoded for mobilenetV1 on imagenet
   # Please use the experimental API if you need to tune these values.
-  if quant_delay == 0:
-    # Corresponds to case of restoring from a floating point checkpoint
-    # In this case, we can freeze the moving mean and variance early on and
-    # switch to using them during training. Therefore, freeze_bn_delay is set to
-    # 2e5.
-    freeze_bn_delay = int(2e5)
-  else:
-    # If training from scratch, set freeze_bn_delay to 100 epochs after quant
-    # delay. With a batch size of 64, this corresponds to 20000*100=2M steps.
-    freeze_bn_delay = quant_delay + int(2e6)
+  freeze_bn_delay = None
 
   _create_graph(
       input_graph=input_graph,
@@ -118,6 +111,8 @@ def create_training_graph(input_graph=None, quant_delay=0):
 def create_eval_graph(input_graph=None):
   """Rewrites an eval input_graph in place for simulated quantization.
 
+  Variables added by the rewrite get added to the global variables collection.
+
   The graph has fake quantization ops inserted to simulate the error
   introduced by quantization. Since the graph is transformed in place,
   the expected behavior of previously held references to nodes and tensors may
@@ -138,9 +133,11 @@ def experimental_create_training_graph(input_graph=None,
                                        weight_bits=8,
                                        activation_bits=8,
                                        quant_delay=0,
-                                       freeze_bn_delay=int(2e5)):
+                                       freeze_bn_delay=None):
   """Rewrites a training input_graph in place for simulated quantization.
 
+  Variables added by the rewrite get added to the global variables collection.
+
   This function has additional experimental options not (yet) available to
   create_training_graph. The resulting behavior may be undefined.
 
@@ -158,7 +155,7 @@ def experimental_create_training_graph(input_graph=None,
   often fail.
 
   Args:
-    input_graph: The tf.Graph to be transformed,if None then defaults to the
+    input_graph: The tf.Graph to be transformed, if None then defaults to the
       default graph.
     weight_bits: Number of bits to use for quantizing weights.
     activation_bits: Number of bits to use for quantizing activations.
@@ -188,6 +185,8 @@ def experimental_create_eval_graph(input_graph=None,
                                    activation_bits=8):
   """Rewrites an eval input_graph in place for simulated quantization.
 
+  Variables added by the rewrite get added to the global variables collection.
+
   This function has additional experimental options not (yet) available to
   create_eval_graph. The resulting behavior may be undefined.
 
diff --git a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
index 639a7454a92aebd7289c59498cebff82cc003f75..db745aa56212af6a9c20e06ee9e4e5d6e27cf3c3 100644
--- a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import googletest
 
 batch_norm = layers.batch_norm
@@ -56,52 +57,46 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         (array_ops.identity, 'Identity', True, 5000),
     ]
     for params in parameters_list:
-      test_fn(params[0], params[1], params[2], params[3])
+      # Test everything with resource variables and normal variables.
+      test_fn(params[0], params[1], params[2], params[3], False)
+      test_fn(params[0], params[1], params[2], params[3], True)
 
-  def _TestQuantize_Conv2dWithoutBatchNorm(self, activation, activation_op_name,
-                                           with_bypass, delay):
-    """Tests quantization: inputs -> Conv2d no batch norm -> Activation.
-
-    Args:
-      activation: Callable that returns an Operation, a factory method for the
-        Activation.
-      activation_op_name: String, name of the Activation operation.
-      with_bypass: Bool, when true there is an extra connection added from
-        inputs to just before Activation.
-      delay: Int (optional), delay in number of steps until quantization starts.
-    """
-    graph = ops.Graph()
-    with graph.as_default():
-      batch_size, height, width, depth = 5, 128, 128, 3
-      inputs = array_ops.zeros((batch_size, height, width, depth))
-      stride = 1 if with_bypass else 2
-      out_depth = 3 if with_bypass else 32
-      activation_fn = None if with_bypass else activation
-      scope = 'test/test2' if with_bypass else 'test'
-      node = conv2d(inputs, out_depth, [5, 5], stride=stride, padding='SAME',
-                    weights_initializer=self._WeightInit(0.09),
-                    activation_fn=activation_fn, scope=scope)
-      if with_bypass:
-        node = math_ops.add(inputs, node, name='test/Add')
-        node = activation(node, name='test/' + activation_op_name)
-      update_barrier = control_flow_ops.no_op(name='update_barrier')
-      with ops.control_dependencies([update_barrier]):
-        array_ops.identity(node, name='control_dependency')
-
-      quantize.Quantize(graph, True, quant_delay=delay)
+  def _AssertCorrectQuantizedGraphWithoutBatchNorm(
+      self, graph, scope, layer, activation_op_name, with_bypass, delay,
+      use_resource):
     quantization_node_name = 'FakeQuantWithMinMaxVars'
     weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
                                                 quantization_node_name)
     self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/AssignMinLast',
-        scope + '/weights_quant/AssignMaxLast', scope + '/weights/read'
-    ]
+
+    # Assemble the expected inputs.
+    if use_resource:
+      expected_inputs = [
+          scope + '/weights_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+          scope + '/weights_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+      ]
+      if layer == 'DepthwiseConv2dNative':
+        expected_inputs.append(scope + '/depthwise/ReadVariableOp')
+      else:
+        expected_inputs.append(scope + '/' + layer + '/ReadVariableOp')
+    else:
+      expected_inputs = [
+          scope + '/weights_quant/AssignMinLast',
+          scope + '/weights_quant/AssignMaxLast',
+      ]
+      if layer == 'DepthwiseConv2dNative':
+        expected_inputs.append(scope + '/depthwise_weights/read')
+      else:
+        expected_inputs.append(scope + '/weights/read')
+
     self._AssertInputOpsAre(weights_quant, expected_inputs)
     if delay and delay > 0:
       output_op_name = scope + '/weights_quant/delayed_quant/Switch_1'
     else:
-      output_op_name = scope + '/Conv2D'
+      if layer == 'DepthwiseConv2dNative':
+        output_op_name = scope + '/depthwise'
+      else:
+        output_op_name = scope + '/' + layer
 
     self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
 
@@ -109,10 +104,17 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
                                                quantization_node_name)
       self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/BiasAdd'
-      ]
+      if use_resource:
+        expected_inputs = [
+            scope + '/conv_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+            scope + '/conv_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+            scope + '/BiasAdd',
+        ]
+      else:
+        expected_inputs = [
+            scope + '/conv_quant/AssignMinEma',
+            scope + '/conv_quant/AssignMaxEma', scope + '/BiasAdd'
+        ]
       self._AssertInputOpsAre(conv_quant, expected_inputs)
       output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
                         if delay else 'test/Add')
@@ -121,22 +123,76 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     act_quant = graph.get_operation_by_name('test/act_quant/' +
                                             quantization_node_name)
     self.assertEqual(act_quant.type, quantization_node_name)
-
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
+    if use_resource:
+      expected_inputs = [
+          'test/act_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+          'test/act_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+          'test/' + activation_op_name,
+      ]
+    else:
+      expected_inputs = [
+          'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
+          'test/' + activation_op_name
+      ]
     self._AssertInputOpsAre(act_quant, expected_inputs)
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._AssertIdempotent(graph)
 
   def testQuantize_Conv2dWithoutBatchNorm(self):
     self._RunWithoutBatchNormTestOverParameters(
         self._TestQuantize_Conv2dWithoutBatchNorm)
 
+  def _TestQuantize_Conv2dWithoutBatchNorm(self, activation, activation_op_name,
+                                           with_bypass, delay, use_resource):
+    """Tests quantization: inputs -> Conv2d no batch norm -> Activation.
+
+    Args:
+      activation: Callable that returns an Operation, a factory method for the
+        Activation.
+      activation_op_name: String, name of the Activation operation.
+      with_bypass: Bool, when true there is an extra connection added from
+        inputs to just before Activation.
+      delay: Int (optional), delay in number of steps until quantization starts.
+      use_resource: Bool, when true uses resource variables.
+    """
+    graph = ops.Graph()
+    with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
+      batch_size, height, width, depth = 5, 128, 128, 3
+      inputs = array_ops.zeros((batch_size, height, width, depth))
+      stride = 1 if with_bypass else 2
+      out_depth = 3 if with_bypass else 32
+      activation_fn = None if with_bypass else activation
+      scope = 'test/test2' if with_bypass else 'test'
+      node = conv2d(
+          inputs,
+          out_depth, [5, 5],
+          stride=stride,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=activation_fn,
+          scope=scope)
+      if with_bypass:
+        node = math_ops.add(inputs, node, name='test/Add')
+        node = activation(node, name='test/' + activation_op_name)
+      update_barrier = control_flow_ops.no_op(name='update_barrier')
+      with ops.control_dependencies([update_barrier]):
+        array_ops.identity(node, name='control_dependency')
+
+      quantize.Quantize(graph, True, quant_delay=delay)
+
+    self._AssertCorrectQuantizedGraphWithoutBatchNorm(
+        graph, scope, 'Conv2D', activation_op_name, with_bypass, delay,
+        use_resource)
+
+  def testQuantize_FCWithoutBatchNorm(self):
+    self._RunWithoutBatchNormTestOverParameters(
+        self._TestQuantize_FCWithoutBatchNorm)
+
   def _TestQuantize_FCWithoutBatchNorm(self, activation, activation_op_name,
-                                       with_bypass, delay):
+                                       with_bypass, delay, use_resource):
     """Tests quantization: inputs -> FC no batch norm -> Activation.
 
     Args:
@@ -146,72 +202,40 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       with_bypass: Bool, when true there is an extra connection added from
         inputs to just before Activation.
       delay: Int (optional), delay in number of steps until quantization starts.
+      use_resource: Bool, when true uses resource variables.
     """
     graph = ops.Graph()
     with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       batch_size, depth = 5, 256
       inputs = array_ops.zeros((batch_size, depth))
       out_depth = 256 if with_bypass else 128
       activation_fn = None if with_bypass else activation
       scope = 'test/test2' if with_bypass else 'test'
-      node = fully_connected(inputs, out_depth,
-                             weights_initializer=self._WeightInit(0.03),
-                             activation_fn=activation_fn, scope=scope)
+      node = fully_connected(
+          inputs,
+          out_depth,
+          weights_initializer=self._WeightInit(0.03),
+          activation_fn=activation_fn,
+          scope=scope)
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
         node = activation(node, name='test/' + activation_op_name)
       update_barrier = control_flow_ops.no_op(name='update_barrier')
       with ops.control_dependencies([update_barrier]):
         array_ops.identity(node, name='control_dependency')
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
-    quantization_node_name = 'FakeQuantWithMinMaxVars'
-    weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
-                                                quantization_node_name)
-    self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/AssignMinLast',
-        scope + '/weights_quant/AssignMaxLast', scope + '/weights/read'
-    ]
-    self._AssertInputOpsAre(weights_quant, expected_inputs)
-    if delay and delay > 0:
-      output_op_name = scope + '/weights_quant/delayed_quant/Switch_1'
-    else:
-      output_op_name = scope + '/MatMul'
-    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
-
-    if with_bypass:
-      conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
-                                               quantization_node_name)
-      self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/BiasAdd'
-      ]
-      self._AssertInputOpsAre(conv_quant, expected_inputs)
-      output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
-                        if delay else 'test/Add')
-      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
-
-    act_quant = graph.get_operation_by_name('test/act_quant/' +
-                                            quantization_node_name)
-    self.assertEqual(act_quant.type, quantization_node_name)
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
-    self._AssertInputOpsAre(act_quant, expected_inputs)
-    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
-                      if delay else 'control_dependency')
-    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._AssertCorrectQuantizedGraphWithoutBatchNorm(
+        graph, scope, 'MatMul', activation_op_name, with_bypass, delay,
+        use_resource)
 
-  def testQuantize_FCWithoutBatchNorm(self):
+  def testQuantize_DepthwiseConv2dWithoutBatchNorm(self):
     self._RunWithoutBatchNormTestOverParameters(
-        self._TestQuantize_FCWithoutBatchNorm)
+        self._TestQuantize_DepthwiseConv2dWithoutBatchNorm)
 
   def _TestQuantize_DepthwiseConv2dWithoutBatchNorm(
-      self, activation, activation_op_name, with_bypass, delay):
+      self, activation, activation_op_name, with_bypass, delay, use_resource):
     """Tests quantization: inputs -> DWConv2d no batch norm -> Activation.
 
     Args:
@@ -221,71 +245,36 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       with_bypass: Bool, when true there is an extra connection added from
         inputs to just before Activation.
       delay: Int (optional), delay in number of steps until quantization starts.
+      use_resource: Bool, when true uses resource variables.
     """
     graph = ops.Graph()
     with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       batch_size, height, width, depth = 5, 128, 128, 3
       inputs = array_ops.zeros((batch_size, height, width, depth))
       stride = 1 if with_bypass else 2
       activation_fn = None if with_bypass else activation
       scope = 'test/test2' if with_bypass else 'test'
-      node = separable_conv2d(inputs, None, [5, 5], stride=stride,
-                              depth_multiplier=1.0, padding='SAME',
-                              weights_initializer=self._WeightInit(0.09),
-                              activation_fn=activation_fn, scope=scope)
+      node = separable_conv2d(
+          inputs,
+          None, [5, 5],
+          stride=stride,
+          depth_multiplier=1.0,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=activation_fn,
+          scope=scope)
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
         node = activation(node, name='test/' + activation_op_name)
       update_barrier = control_flow_ops.no_op(name='update_barrier')
       with ops.control_dependencies([update_barrier]):
         array_ops.identity(node, name='control_dependency')
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
-    quantization_node_name = 'FakeQuantWithMinMaxVars'
-    weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
-                                                quantization_node_name)
-    self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/AssignMinLast',
-        scope + '/weights_quant/AssignMaxLast',
-        scope + '/depthwise_weights/read'
-    ]
-    self._AssertInputOpsAre(weights_quant, expected_inputs)
-    if delay and delay > 0:
-      output_op_name = scope + '/weights_quant/delayed_quant/Switch_1'
-    else:
-      output_op_name = scope + '/depthwise'
-    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
-
-    if with_bypass:
-      conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
-                                               quantization_node_name)
-      self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/BiasAdd'
-      ]
-      self._AssertInputOpsAre(conv_quant, expected_inputs)
-      output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
-                        if delay else 'test/Add')
-      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
-
-    act_quant = graph.get_operation_by_name('test/act_quant/' +
-                                            quantization_node_name)
-    self.assertEqual(act_quant.type, quantization_node_name)
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
-    self._AssertInputOpsAre(act_quant, expected_inputs)
-    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
-                      if delay else 'control_dependency')
-    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
-
-  def testQuantize_DepthwiseConv2dWithoutBatchNorm(self):
-    self._RunWithoutBatchNormTestOverParameters(
-        self._TestQuantize_DepthwiseConv2dWithoutBatchNorm)
+    self._AssertCorrectQuantizedGraphWithoutBatchNorm(
+        graph, scope, 'DepthwiseConv2dNative', activation_op_name, with_bypass,
+        delay, use_resource)
 
   def _RunBatchNormTestOverParameters(self, test_fn):
     # TODO(suharshs): Use parameterized test once OSS TF supports it.
@@ -317,13 +306,88 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         (array_ops.identity, 'Identity', True, 5000, True)
     ]
     for params in parameters_list:
-      test_fn(params[0], params[1], params[2], params[3], params[4])
+      # Test everything with resource variables and normal variables.
+      test_fn(params[0], params[1], params[2], params[3], params[4], False)
+      test_fn(params[0], params[1], params[2], params[3], params[4], True)
+
+  def _AssertCorrectQuantizedGraphWithBatchNorm(self, graph, scope, layer,
+                                                activation_op_name, with_bypass,
+                                                delay, use_resource):
+    quantization_node_name = 'FakeQuantWithMinMaxVars'
+    weights_quant = graph.get_operation_by_name(
+        scope + '/weights_quant/' + quantization_node_name)
+    self.assertEqual(weights_quant.type, quantization_node_name)
+    if use_resource:
+      expected_inputs = [
+          scope + '/weights_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+          scope + '/weights_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+      ]
+    else:
+      expected_inputs = [
+          scope + '/weights_quant/' + 'AssignMinLast',
+          scope + '/weights_quant/' + 'AssignMaxLast'
+      ]
+    expected_inputs.append(scope + '/mul_fold')
+
+    self._AssertInputOpsAre(weights_quant, expected_inputs)
+    if layer == 'DepthwiseConv2dNative':
+      output_op_name = scope + ('/weights_quant/delayed_quant/Switch_1'
+                                if delay else '/depthwise_Fold')
+    else:
+      output_op_name = scope + ('/weights_quant/delayed_quant/Switch_1'
+                                if delay else '/' + layer + '_Fold')
+    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
+
+    if with_bypass:
+      conv_quant = graph.get_operation_by_name(
+          scope + '/conv_quant/' + quantization_node_name)
+      self.assertEqual(conv_quant.type, quantization_node_name)
+
+      if use_resource:
+        expected_inputs = [
+            scope + '/conv_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+            scope + '/conv_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+        ]
+      else:
+        expected_inputs = [
+            scope + '/conv_quant/AssignMinEma',
+            scope + '/conv_quant/AssignMaxEma',
+        ]
+      expected_inputs.append(scope + '/add_fold')
+
+      self._AssertInputOpsAre(conv_quant, expected_inputs)
+      output_op_name = (
+          scope + '/conv_quant/delayed_quant/Switch_1' if delay else 'test/Add')
+      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
+
+    act_quant = graph.get_operation_by_name(
+        'test/act_quant/' + quantization_node_name)
+    self.assertEqual(act_quant.type, quantization_node_name)
+
+    if use_resource:
+      expected_inputs = [
+          'test/act_quant/FakeQuantWithMinMaxVars/ReadVariableOp',
+          'test/act_quant/FakeQuantWithMinMaxVars/ReadVariableOp_1',
+      ]
+    else:
+      expected_inputs = [
+          'test/act_quant/AssignMinEma',
+          'test/act_quant/AssignMaxEma',
+      ]
+    expected_inputs.append('test/' + activation_op_name)
+
+    self._AssertInputOpsAre(act_quant, expected_inputs)
+    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
+                      if delay else 'control_dependency')
+    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._AssertIdempotent(graph)
 
   def testQuantize_Conv2dWithBatchNorm(self):
     self._RunBatchNormTestOverParameters(self._TestQuantize_Conv2dWithBatchNorm)
 
   def _TestQuantize_Conv2dWithBatchNorm(self, activation, activation_op_name,
-                                        with_bypass, delay, fused_batch_norm):
+                                        with_bypass, delay, fused_batch_norm,
+                                        use_resource):
     """Tests quantization: inputs -> Conv2d with batch norm -> Activation.
 
     Args:
@@ -334,9 +398,11 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         inputs to just before Activation.
       delay: Int (optional), delay in number of steps until quantization starts.
       fused_batch_norm: Bool, when true use FusedBatchNorm.
+      use_resource: Bool, when true uses resource variables.
     """
     graph = ops.Graph()
     with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       batch_size, height, width, depth = 5, 128, 128, 3
       inputs = array_ops.zeros((batch_size, height, width, depth))
       stride = 1 if with_bypass else 2
@@ -353,7 +419,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -364,52 +430,18 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         array_ops.identity(node, name='control_dependency')
 
       fold_batch_norms.FoldBatchNorms(graph, is_training=True)
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
-    quantization_node_name = 'FakeQuantWithMinMaxVars'
-    weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
-                                                quantization_node_name)
-    self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/' + 'AssignMinLast',
-        scope + '/weights_quant/' + 'AssignMaxLast', scope + '/mul_fold'
-    ]
-    self._AssertInputOpsAre(weights_quant, expected_inputs)
-    output_op_name = scope + ('/weights_quant/delayed_quant/Switch_1'
-                              if delay else '/Conv2D_Fold')
-    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
-
-    if with_bypass:
-      conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
-                                               quantization_node_name)
-      self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/add_fold'
-      ]
-      self._AssertInputOpsAre(conv_quant, expected_inputs)
-      output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
-                        if delay else 'test/Add')
-      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
-
-    act_quant = graph.get_operation_by_name('test/act_quant/' +
-                                            quantization_node_name)
-    self.assertEqual(act_quant.type, quantization_node_name)
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
-    self._AssertInputOpsAre(act_quant, expected_inputs)
-    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
-                      if delay else 'control_dependency')
-    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+      self._AssertCorrectQuantizedGraphWithBatchNorm(
+          graph, scope, 'Conv2D', activation_op_name, with_bypass, delay,
+          use_resource)
 
   def testQuantize_FCWithBatchNorm(self):
     self._RunBatchNormTestOverParameters(self._TestQuantize_FCWithBatchNorm)
 
   def _TestQuantize_FCWithBatchNorm(self, activation, activation_op_name,
-                                    with_bypass, delay, fused_batch_norm):
+                                    with_bypass, delay, fused_batch_norm,
+                                    use_resource):
     """Tests quantization: inputs -> FC with batch norm -> Activation.
 
     Args:
@@ -420,9 +452,11 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         inputs to just before Activation.
       delay: Int (optional), delay in number of steps until quantization starts.
       fused_batch_norm: Bool, when true use FusedBatchNorm.
+      use_resource: Bool, when true uses resource variables.
     """
     graph = ops.Graph()
     with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       batch_size, depth = 5, 256
       inputs = array_ops.zeros((batch_size, depth))
       out_depth = 256 if with_bypass else 128
@@ -436,7 +470,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -450,43 +484,9 @@ class QuantizeTest(test_util.TensorFlowTestCase):
 
       quantize.Quantize(graph, True, quant_delay=delay)
 
-    quantization_node_name = 'FakeQuantWithMinMaxVars'
-    weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
-                                                quantization_node_name)
-    self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/' + 'AssignMinLast',
-        scope + '/weights_quant/' + 'AssignMaxLast', scope + '/mul_fold'
-    ]
-    self._AssertInputOpsAre(weights_quant, expected_inputs)
-    output_op_name = scope + ('/weights_quant/delayed_quant/Switch_1'
-                              if delay else '/MatMul_Fold')
-    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
-
-    if with_bypass:
-      conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
-                                               quantization_node_name)
-      self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/add_fold'
-      ]
-      self._AssertInputOpsAre(conv_quant, expected_inputs)
-      output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
-                        if delay else 'test/Add')
-      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
-
-    act_quant = graph.get_operation_by_name('test/act_quant/' +
-                                            quantization_node_name)
-    self.assertEqual(act_quant.type, quantization_node_name)
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
-    self._AssertInputOpsAre(act_quant, expected_inputs)
-    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
-                      if delay else 'control_dependency')
-    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._AssertCorrectQuantizedGraphWithBatchNorm(
+        graph, scope, 'MatMul', activation_op_name, with_bypass, delay,
+        use_resource)
 
   def testQuantize_DepthwiseConv2dWithBatchNorm(self):
     self._RunBatchNormTestOverParameters(
@@ -494,7 +494,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
 
   def _TestQuantize_DepthwiseConv2dWithBatchNorm(
       self, activation, activation_op_name, with_bypass, delay,
-      fused_batch_norm):
+      fused_batch_norm, use_resource):
     """Tests quantization: inputs -> DWConv2d with batch norm -> Activation.
 
     Args:
@@ -505,9 +505,11 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         inputs to just before Activation.
       delay: Int (optional), delay in number of steps until quantization starts.
       fused_batch_norm: Bool, when true use FusedBatchNorm.
+      use_resource: Bool, when true uses resource variables.
     """
     graph = ops.Graph()
     with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
       batch_size, height, width, depth = 5, 128, 128, 3
       inputs = array_ops.zeros((batch_size, height, width, depth))
       stride = 1 if with_bypass else 2
@@ -524,7 +526,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -535,45 +537,21 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         array_ops.identity(node, name='control_dependency')
 
       fold_batch_norms.FoldBatchNorms(graph, is_training=True)
-
       quantize.Quantize(graph, True, quant_delay=delay)
-    quantization_node_name = 'FakeQuantWithMinMaxVars'
-    weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
-                                                quantization_node_name)
-    self.assertEqual(weights_quant.type, quantization_node_name)
-    expected_inputs = [
-        scope + '/weights_quant/' + 'AssignMinLast',
-        scope + '/weights_quant/' + 'AssignMaxLast', scope + '/mul_fold'
-    ]
-    self._AssertInputOpsAre(weights_quant, expected_inputs)
-    output_op_name = scope + ('/weights_quant/delayed_quant/Switch_1'
-                              if delay else '/depthwise_Fold')
-    self._AssertOutputGoesToOps(weights_quant, graph, [output_op_name])
 
-    if with_bypass:
-      conv_quant = graph.get_operation_by_name(scope + '/conv_quant/' +
-                                               quantization_node_name)
-      self.assertEqual(conv_quant.type, quantization_node_name)
-      expected_inputs = [
-          scope + '/conv_quant/AssignMinEma',
-          scope + '/conv_quant/AssignMaxEma', scope + '/add_fold'
-      ]
-      self._AssertInputOpsAre(conv_quant, expected_inputs)
-      output_op_name = (scope + '/conv_quant/delayed_quant/Switch_1'
-                        if delay else 'test/Add')
-      self._AssertOutputGoesToOps(conv_quant, graph, [output_op_name])
+      self._AssertCorrectQuantizedGraphWithBatchNorm(
+          graph, scope, 'DepthwiseConv2dNative', activation_op_name,
+          with_bypass, delay, use_resource)
 
-    act_quant = graph.get_operation_by_name('test/act_quant/' +
-                                            quantization_node_name)
-    self.assertEqual(act_quant.type, quantization_node_name)
-    expected_inputs = [
-        'test/act_quant/AssignMinEma', 'test/act_quant/AssignMaxEma',
-        'test/' + activation_op_name
-    ]
-    self._AssertInputOpsAre(act_quant, expected_inputs)
-    output_op_name = ('test/act_quant/delayed_quant/Switch_1'
-                      if delay else 'control_dependency')
-    self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+  def _AssertIdempotent(self, graph):
+    # Ensure that calling the rewrite again doesn't change the graph.
+    graph_def_before = str(graph.as_graph_def())
+    with graph.as_default():
+      # Ensuring that calling the rewrite again doesn't add more nodes.
+      fold_batch_norms.FoldBatchNorms(graph, is_training=True)
+      quantize.Quantize(graph, True)
+    graph_def_after = str(graph.as_graph_def())
+    self.assertEqual(graph_def_before, graph_def_after)
 
   def _BatchNormParams(self, fused=False):
     return {'center': True, 'scale': True, 'decay': 1.0 - 0.003, 'fused': fused}
@@ -587,7 +565,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       stddev: Standard deviation of normal variable.
 
     Returns:
-      An initialized that initialzes with a truncated normal variable.
+      An initialized that initializes with a truncated normal variable.
     """
     return init_ops.truncated_normal_initializer(stddev=stddev)
 
diff --git a/tensorflow/contrib/quantize/python/quantize_test.py b/tensorflow/contrib/quantize/python/quantize_test.py
index ef59475167137e203db2f6ca7f43c7b8f1938060..bef58bad8d0a8aa2f68b66d6fa33800cefbafec0 100644
--- a/tensorflow/contrib/quantize/python/quantize_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_test.py
@@ -135,6 +135,59 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       self.assertTrue('FakeQuantWithMinMaxVars' in
                       [op.type for op in bias_add_op.outputs[0].consumers()])
 
+  def testPostActivationBypassQuantized(self):
+    self._RunTestOverParameters(self._TestPostActivationBypassQuantized)
+
+  def _TestPostActivationBypassQuantized(self, is_training):
+    graph = ops.Graph()
+    with graph.as_default():
+      batch_size, height, width, depth = 5, 128, 128, 3
+      input1 = array_ops.zeros((batch_size, height, width, depth))
+      input2 = array_ops.zeros((batch_size, height / 2, width / 2, 32))
+      conv = conv2d(
+          input1,
+          32, [5, 5],
+          stride=2,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=array_ops.identity,
+          scope='test/test')
+      bypass_tensor = math_ops.add(conv, input2, name='test/add')
+      _ = array_ops.identity(bypass_tensor, name='test/output')
+
+      quantize.Quantize(graph, is_training, weight_bits=8, activation_bits=8)
+
+      # Ensure that the bypass node is preceded and followed by
+      # FakeQuantWithMinMaxVars operations.
+      self.assertTrue('FakeQuantWithMinMaxVars' in
+                      [c.type for c in bypass_tensor.consumers()])
+      self.assertTrue('FakeQuantWithMinMaxVars' in
+                      [i.op.type for i in bypass_tensor.op.inputs])
+
+  def testWithNameScope(self):
+    self._RunTestOverParameters(self._TestWithNameScope)
+
+  def _TestWithNameScope(self, is_training):
+    graph = ops.Graph()
+    with graph.as_default():
+      with graph.name_scope('name_scope'):
+        batch_size, height, width, depth = 5, 128, 128, 3
+        input1 = array_ops.zeros((batch_size, height, width, depth))
+        _ = conv2d(
+            input1,
+            32, [5, 5],
+            stride=2,
+            padding='SAME',
+            weights_initializer=self._WeightInit(0.09),
+            activation_fn=None,
+            scope='test')
+
+        quantize.Quantize(graph, is_training, weight_bits=8, activation_bits=8)
+
+    for op in graph.get_operations():
+      self.assertTrue(not op.name.startswith('name_scope/name_scope/'),
+                      'Broken op: %s' % op.name)
+
   def _WeightInit(self, stddev):
     """Returns truncated normal variable initializer.
 
@@ -144,7 +197,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       stddev: Standard deviation of normal variable.
 
     Returns:
-      An initialized that initialzes with a truncated normal variable.
+      An initialized that initializes with a truncated normal variable.
     """
     return init_ops.truncated_normal_initializer(stddev=stddev)
 
diff --git a/tensorflow/contrib/rnn/ops/gru_ops.cc b/tensorflow/contrib/rnn/ops/gru_ops.cc
index e91d1e8a80ed252e5f89e116fb0a325be67e3941..9c8e40851a0cc5bd7f37f94a62ecdef7248660c1 100644
--- a/tensorflow/contrib/rnn/ops/gru_ops.cc
+++ b/tensorflow/contrib/rnn/ops/gru_ops.cc
@@ -69,7 +69,7 @@ Element-wise dot product of a and b is represented by ab
 Element-wise dot product is represented by \circ
 Matrix multiplication is represented by *
 
-Baises are initialized with :
+Biases are initialized with :
 `b_ru` - constant_initializer(1.0)
 `b_c` - constant_initializer(0.0)
 
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
index 0e62b315b61cb3ceeb5cfd33bf5102a71abef83b..d41fc0b3ac1cee4eacc88cb0f41df1f9ee59e7c3 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
@@ -187,6 +187,8 @@ class RNNCellTest(test.TestCase):
               ],
               state_is_tuple=False)
           self.assertEqual(cell.dtype, None)
+          self.assertEqual("cell-0", cell._checkpoint_dependencies[0].name)
+          self.assertEqual("cell-1", cell._checkpoint_dependencies[1].name)
           g, out_m = cell(x, m)
           # Layer infers the input type.
           self.assertEqual(cell.dtype, dtype.name)
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
index 57521c6a9ba0b2d66639017b09c541e270276323..de5df912921932056526e1e6dc5dbb905735f775 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
@@ -869,7 +869,7 @@ class LSTMTest(test.TestCase):
     num_proj = 4
     max_length = 8
     sequence_length = [4, 6]
-    in_graph_mode = context.in_graph_mode()
+    in_graph_mode = not context.executing_eagerly()
     with self.test_session(graph=ops_lib.Graph()) as sess:
       initializer = init_ops.random_uniform_initializer(
           -0.01, 0.01, seed=self._seed)
@@ -934,8 +934,7 @@ class LSTMTest(test.TestCase):
       if in_graph_mode:
         self.assertAllEqual(outputs_static, outputs_dynamic)
       else:
-        self.assertAllEqual(
-            array_ops.stack(outputs_static).numpy(), outputs_dynamic.numpy())
+        self.assertAllEqual(array_ops.stack(outputs_static), outputs_dynamic)
       self.assertAllEqual(np.hstack(state_static), np.hstack(state_dynamic))
 
   @test_util.run_in_graph_and_eager_modes()
@@ -946,7 +945,7 @@ class LSTMTest(test.TestCase):
     num_proj = 4
     max_length = 8
     sequence_length = [4, 6]
-    in_graph_mode = context.in_graph_mode()
+    in_graph_mode = not context.executing_eagerly()
     with self.test_session(graph=ops_lib.Graph()) as sess:
       initializer = init_ops.random_uniform_initializer(
           -0.01, 0.01, seed=self._seed)
@@ -1022,10 +1021,9 @@ class LSTMTest(test.TestCase):
       if in_graph_mode:
         self.assertAllEqual(outputs_static, outputs_dynamic)
       else:
-        self.assertAllEqual(
-            array_ops.stack(outputs_static).numpy(), outputs_dynamic.numpy())
-        state_static = [s.numpy() for s in nest.flatten(state_static)]
-        state_dynamic = [s.numpy() for s in nest.flatten(state_dynamic)]
+        self.assertAllEqual(array_ops.stack(outputs_static), outputs_dynamic)
+        state_static = nest.flatten(state_static)
+        state_dynamic = nest.flatten(state_dynamic)
       self.assertAllEqual(np.hstack(state_static), np.hstack(state_dynamic))
 
   def _testDynamicEquivalentToStaticRNN(self, use_sequence_length):
@@ -1043,7 +1041,7 @@ class LSTMTest(test.TestCase):
     else:
       sequence_length = None
 
-    in_graph_mode = context.in_graph_mode()
+    in_graph_mode = not context.executing_eagerly()
 
     # TODO(b/68017812): Eager ignores operation seeds, so we need to create a
     # single cell and reuse it across the static and dynamic RNNs. Remove this
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py b/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
index 7957edf68cc8a1461fccfc2de93ad5250dc9fdb5..ffd24218944e150a32b1b915288ab1df90afb45c 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
@@ -54,7 +54,7 @@ def blocks_match(sess, use_peephole):
   initializer = init_ops.random_uniform_initializer(-0.01, 0.01, seed=19890212)
 
   with variable_scope.variable_scope("test", initializer=initializer):
-    # magic naming so that the cells pick up these variables and resuse them
+    # magic naming so that the cells pick up these variables and reuse them
     if use_peephole:
       wci = variable_scope.get_variable(
           "rnn/lstm_cell/w_i_diag", shape=[cell_size], dtype=dtypes.float32)
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
index 7b883ebc5d7756f1bdf445f900500a4b89e6cffd..63fdd91d368d97007280871f3886e5649e6b2e86 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
@@ -455,8 +455,8 @@ class RNNCellTest(test.TestCase):
         self.assertAllClose(np.concatenate(res[1], axis=1), expected_state)
 
   def testAttentionCellWrapperFailures(self):
-    with self.assertRaisesRegexp(TypeError,
-                                 "The parameter cell is not RNNCell."):
+    with self.assertRaisesRegexp(
+        TypeError, rnn_cell_impl.ASSERT_LIKE_RNNCELL_ERROR_REGEXP):
       contrib_rnn_cell.AttentionCellWrapper(None, 0)
 
     num_units = 8
@@ -878,7 +878,6 @@ class RNNCellTest(test.TestCase):
       shape = [2, 1]
       filter_size = [3]
       num_features = 1
-      batch_size = 2
       expected_state_c = np.array(
           [[[1.4375670191], [1.4375670191]], [[2.7542609292], [2.7542609292]]],
           dtype=np.float32)
@@ -912,7 +911,6 @@ class RNNCellTest(test.TestCase):
       shape = [2, 2, 1]
       filter_size = [3, 3]
       num_features = 1
-      batch_size = 2
       expected_state_c = np.array(
           [[[[1.4375670191], [1.4375670191]], [[1.4375670191], [1.4375670191]]],
            [[[2.7542609292], [2.7542609292]], [[2.7542609292], [2.7542609292]]
@@ -954,7 +952,6 @@ class RNNCellTest(test.TestCase):
       shape = [2, 2, 2, 1]
       filter_size = [3, 3, 3]
       num_features = 1
-      batch_size = 2
       expected_state_c = np.array(
           [[[[[1.4375670191], [1.4375670191]], [[1.4375670191], [1.4375670191]]
             ], [[[1.4375670191], [1.4375670191]], [[1.4375670191],
@@ -1031,57 +1028,92 @@ class RNNCellTest(test.TestCase):
     num_units = 4
     number_of_groups = 1
 
-    with self.test_session() as sess:
-      with variable_scope.variable_scope(
-          "root1", initializer=init_ops.constant_initializer(0.5)):
-        x = array_ops.ones([batch_size, num_units])
-        # When number_of_groups = 1, G-LSTM is equivalent to regular LSTM
-        gcell = contrib_rnn_cell.GLSTMCell(
-            num_units=num_units, number_of_groups=number_of_groups)
-        cell = rnn_cell.LSTMCell(num_units=num_units)
-        self.assertTrue(isinstance(gcell.state_size, tuple))
-        zero_state = gcell.zero_state(
-            batch_size=batch_size, dtype=dtypes.float32)
-        gh, gs = gcell(x, zero_state)
-        h, g = cell(x, zero_state)
+    # Try with input dimension equal to num_units or not.
+    for num_inputs in [num_units, num_units + number_of_groups]:
+      with self.test_session() as sess:
+        with variable_scope.variable_scope(
+            "root1_%d" % num_inputs,
+            initializer=init_ops.constant_initializer(0.5)):
+          x = array_ops.ones([batch_size, num_inputs])
+          # When number_of_groups = 1, G-LSTM is equivalent to regular LSTM
+          gcell = contrib_rnn_cell.GLSTMCell(
+              num_units=num_units, number_of_groups=number_of_groups)
+          cell = rnn_cell.LSTMCell(num_units=num_units)
+          self.assertTrue(isinstance(gcell.state_size, tuple))
+          zero_state = gcell.zero_state(
+              batch_size=batch_size, dtype=dtypes.float32)
+          gh, gs = gcell(x, zero_state)
+          h, g = cell(x, zero_state)
 
-        sess.run([variables.global_variables_initializer()])
-        glstm_result = sess.run([gh, gs])
-        lstm_result = sess.run([h, g])
+          sess.run([variables.global_variables_initializer()])
+          glstm_result = sess.run([gh, gs])
+          lstm_result = sess.run([h, g])
 
-        self.assertAllClose(glstm_result[0], lstm_result[0], 1e-5)
-        self.assertAllClose(glstm_result[1], lstm_result[1], 1e-5)
+          self.assertAllClose(glstm_result[0], lstm_result[0], 1e-5)
+          self.assertAllClose(glstm_result[1], lstm_result[1], 1e-5)
 
     # Test that G-LSTM subgroup act like corresponding sub-LSTMs
     batch_size = 2
     num_units = 4
     number_of_groups = 2
 
-    with self.test_session() as sess:
+    # Try with num_inputs equal to or not equal to num_units.
+    for num_inputs in [num_units, num_units + number_of_groups]:
+      with self.test_session() as sess:
+        with variable_scope.variable_scope(
+            "root2_%d" % num_inputs,
+            initializer=init_ops.constant_initializer(0.5)):
+          # input for G-LSTM with 2 groups
+          glstm_input = array_ops.ones([batch_size, num_inputs])
+          gcell = contrib_rnn_cell.GLSTMCell(
+              num_units=num_units, number_of_groups=number_of_groups)
+          gcell_zero_state = gcell.zero_state(
+              batch_size=batch_size, dtype=dtypes.float32)
+          gh, gs = gcell(glstm_input, gcell_zero_state)
+
+          # input for LSTM cell simulating single G-LSTM group
+          lstm_input = array_ops.ones(
+              [batch_size, num_inputs / number_of_groups])
+          # note division by number_of_groups. This cell one simulates G-LSTM
+          # group
+          cell = rnn_cell.LSTMCell(num_units=int(num_units / number_of_groups))
+          cell_zero_state = cell.zero_state(
+              batch_size=batch_size, dtype=dtypes.float32)
+          h, g = cell(lstm_input, cell_zero_state)
+
+          sess.run([variables.global_variables_initializer()])
+          [gh_res, h_res] = sess.run([gh, h])
+          self.assertAllClose(gh_res[:, 0:int(num_units / number_of_groups)],
+                              h_res, 1e-5)
+          self.assertAllClose(gh_res[:, int(num_units / number_of_groups):],
+                              h_res, 1e-5)
+
+  def testGLSTMCellFailure(self):
+    batch_size = 2
+    num_units = 4
+    number_of_groups = 2
+    with self.test_session():
       with variable_scope.variable_scope(
-          "root2", initializer=init_ops.constant_initializer(0.5)):
-        # input for G-LSTM with 2 groups
-        glstm_input = array_ops.ones([batch_size, num_units])
+          "glstm_failure", initializer=init_ops.constant_initializer(0.5)):
         gcell = contrib_rnn_cell.GLSTMCell(
             num_units=num_units, number_of_groups=number_of_groups)
         gcell_zero_state = gcell.zero_state(
             batch_size=batch_size, dtype=dtypes.float32)
-        gh, gs = gcell(glstm_input, gcell_zero_state)
 
-        # input for LSTM cell simulating single G-LSTM group
-        lstm_input = array_ops.ones([batch_size, num_units / number_of_groups])
-        # note division by number_of_groups. This cell one simulates G-LSTM group
-        cell = rnn_cell.LSTMCell(num_units=int(num_units / number_of_groups))
-        cell_zero_state = cell.zero_state(
-            batch_size=batch_size, dtype=dtypes.float32)
-        h, g = cell(lstm_input, cell_zero_state)
+        # Try an input with statically-unknown innermost dimension.
+        glstm_input = array_ops.placeholder(
+            dtypes.float32, shape=[batch_size, None])
+        with self.assertRaisesRegexp(ValueError,
+                                     "input size must be statically known"):
+          gcell(glstm_input, gcell_zero_state)
 
-        sess.run([variables.global_variables_initializer()])
-        [gh_res, h_res] = sess.run([gh, h])
-        self.assertAllClose(gh_res[:, 0:int(num_units / number_of_groups)],
-                            h_res, 1e-5)
-        self.assertAllClose(gh_res[:, int(num_units / number_of_groups):],
-                            h_res, 1e-5)
+        # Try an input whose innermost dimension isn't divisible into groups.
+        glstm_input = array_ops.placeholder(
+            dtypes.float32, shape=[batch_size, 3])
+        with self.assertRaisesRegexp(
+            ValueError,
+            r"input size \(3\) must be divisible by number_of_groups \(2\)"):
+          gcell(glstm_input, gcell_zero_state)
 
 
 class LayerNormBasicLSTMCellTest(test.TestCase):
@@ -1168,7 +1200,7 @@ class LayerNormBasicLSTMCellTest(test.TestCase):
         h1 = array_ops.zeros([1, 2])
         state1 = rnn_cell.LSTMStateTuple(c1, h1)
         state = (state0, state1)
-        single_cell = lambda: contrib_rnn_cell.LayerNormBasicLSTMCell(2, layer_norm=False)
+        single_cell = lambda: contrib_rnn_cell.LayerNormBasicLSTMCell(2, layer_norm=False)  # pylint: disable=line-too-long
         cell = rnn_cell.MultiRNNCell([single_cell() for _ in range(2)])
         g, out_m = cell(x, state)
         sess.run([variables.global_variables_initializer()])
@@ -1200,7 +1232,7 @@ class LayerNormBasicLSTMCellTest(test.TestCase):
         self.assertAllClose(expected_state1_h, actual_state1_h, 1e-5)
 
       with variable_scope.variable_scope(
-          "other", initializer=init_ops.constant_initializer(0.5)) as vs:
+          "other", initializer=init_ops.constant_initializer(0.5)):
         x = array_ops.zeros(
             [1, 3])  # Test BasicLSTMCell with input_size != num_units.
         c = array_ops.zeros([1, 2])
@@ -1549,7 +1581,7 @@ class WeightNormLSTMCellTest(test.TestCase):
   """Compared cell output with pre-calculated values."""
 
   def _cell_output(self, cell):
-    """Calculate cell output"""
+    """Calculates cell output."""
 
     with self.test_session() as sess:
       init = init_ops.constant_initializer(0.5)
@@ -1576,7 +1608,7 @@ class WeightNormLSTMCellTest(test.TestCase):
     return actual_state_c, actual_state_h
 
   def testBasicCell(self):
-    """Tests cell w/o peepholes and w/o normalisation"""
+    """Tests cell w/o peepholes and w/o normalisation."""
 
     def cell():
       return contrib_rnn_cell.WeightNormLSTMCell(2,
@@ -1592,7 +1624,7 @@ class WeightNormLSTMCellTest(test.TestCase):
     self.assertAllClose(expected_h, actual_h, 1e-5)
 
   def testNonbasicCell(self):
-    """Tests cell with peepholes and w/o normalisation"""
+    """Tests cell with peepholes and w/o normalisation."""
 
     def cell():
       return contrib_rnn_cell.WeightNormLSTMCell(2,
@@ -1607,9 +1639,8 @@ class WeightNormLSTMCellTest(test.TestCase):
     self.assertAllClose(expected_c, actual_c, 1e-5)
     self.assertAllClose(expected_h, actual_h, 1e-5)
 
-
   def testBasicCellWithNorm(self):
-    """Tests cell w/o peepholes and with normalisation"""
+    """Tests cell w/o peepholes and with normalisation."""
 
     def cell():
       return contrib_rnn_cell.WeightNormLSTMCell(2,
@@ -1625,7 +1656,7 @@ class WeightNormLSTMCellTest(test.TestCase):
     self.assertAllClose(expected_h, actual_h, 1e-5)
 
   def testNonBasicCellWithNorm(self):
-    """Tests cell with peepholes and with normalisation"""
+    """Tests cell with peepholes and with normalisation."""
 
     def cell():
       return contrib_rnn_cell.WeightNormLSTMCell(2,
diff --git a/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py b/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py
index 8109ebc718353300f94536c5d7ae3332da584a1d..645f82624bf67b96ffc8520289b293b45f0e69e2 100644
--- a/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py
@@ -40,7 +40,6 @@ from tensorflow.python.util import nest
 
 # pylint: disable=protected-access,invalid-name
 RNNCell = rnn_cell_impl.RNNCell
-_like_rnncell = rnn_cell_impl._like_rnncell
 _WEIGHTS_VARIABLE_NAME = rnn_cell_impl._WEIGHTS_VARIABLE_NAME
 _BIAS_VARIABLE_NAME = rnn_cell_impl._BIAS_VARIABLE_NAME
 # pylint: enable=protected-access,invalid-name
@@ -221,8 +220,7 @@ class EmbeddingWrapper(RNNCell):
       ValueError: if embedding_classes is not positive.
     """
     super(EmbeddingWrapper, self).__init__(_reuse=reuse)
-    if not _like_rnncell(cell):
-      raise TypeError("The parameter cell is not RNNCell.")
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     if embedding_classes <= 0 or embedding_size <= 0:
       raise ValueError("Both embedding_classes and embedding_size must be > 0: "
                        "%d, %d." % (embedding_classes, embedding_size))
@@ -301,8 +299,7 @@ class InputProjectionWrapper(RNNCell):
     super(InputProjectionWrapper, self).__init__(_reuse=reuse)
     if input_size is not None:
       logging.warn("%s: The input_size parameter is deprecated.", self)
-    if not _like_rnncell(cell):
-      raise TypeError("The parameter cell is not RNNCell.")
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     self._cell = cell
     self._num_proj = num_proj
     self._activation = activation
@@ -356,8 +353,7 @@ class OutputProjectionWrapper(RNNCell):
       ValueError: if output_size is not positive.
     """
     super(OutputProjectionWrapper, self).__init__(_reuse=reuse)
-    if not _like_rnncell(cell):
-      raise TypeError("The parameter cell is not RNNCell.")
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     if output_size < 1:
       raise ValueError("Parameter output_size must be > 0: %d." % output_size)
     self._cell = cell
diff --git a/tensorflow/contrib/rnn/python/ops/lstm_ops.py b/tensorflow/contrib/rnn/python/ops/lstm_ops.py
index 4eb4fbcd92f0d7cb3bee712862c8950a1971b632..9e61fc54d10c1b75786450060e428c73974760a7 100644
--- a/tensorflow/contrib/rnn/python/ops/lstm_ops.py
+++ b/tensorflow/contrib/rnn/python/ops/lstm_ops.py
@@ -480,8 +480,7 @@ class LSTMBlockWrapper(base_layer.Layer):
     """Run this LSTM on inputs, starting from the given state.
 
     Args:
-      inputs: `3-D` tensor with shape `[time_len, batch_size, input_size]`
-        or a list of `time_len` tensors of shape `[batch_size, input_size]`.
+      inputs: `3-D` tensor with shape `[time_len, batch_size, input_size]`.
       initial_state: a tuple `(initial_cell_state, initial_output)` with tensors
         of shape `[batch_size, self._num_units]`. If this is not provided, the
         cell is expected to create a zero initial state of type `dtype`.
diff --git a/tensorflow/contrib/rnn/python/ops/rnn_cell.py b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
index a6c2d9cdbb2b6f61d59960f708000e945c6115e9..2f6ae9f3678e58dae67bf777991641b10e42ef94 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
@@ -534,7 +534,7 @@ class GridLSTMCell(rnn_cell_impl.RNNCell):
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
       num_unit_shards: (optional) int, default 1, How to split the weight
-        matrix. If > 1,the weight matrix is stored across num_unit_shards.
+        matrix. If > 1, the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
         of the training.
@@ -993,7 +993,7 @@ class BidirectionalGridLSTMCell(GridLSTMCell):
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
       num_unit_shards: (optional) int, default 1, How to split the weight
-        matrix. If > 1,the weight matrix is stored across num_unit_shards.
+        matrix. If > 1, the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
         of the training.
@@ -1143,8 +1143,7 @@ class AttentionCellWrapper(rnn_cell_impl.RNNCell):
           `state_is_tuple` is `False` or if attn_length is zero or less.
     """
     super(AttentionCellWrapper, self).__init__(_reuse=reuse)
-    if not rnn_cell_impl._like_rnncell(cell):  # pylint: disable=protected-access
-      raise TypeError("The parameter cell is not RNNCell.")
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     if nest.is_sequence(cell.state_size) and not state_is_tuple:
       raise ValueError(
           "Cell returns tuple of states, but the flag "
@@ -2059,16 +2058,19 @@ class ConvLSTMCell(rnn_cell_impl.RNNCell):
                initializers=None,
                name="conv_lstm_cell"):
     """Construct ConvLSTMCell.
+
     Args:
       conv_ndims: Convolution dimensionality (1, 2 or 3).
       input_shape: Shape of the input as int tuple, excluding the batch size.
       output_channels: int, number of output channels of the conv LSTM.
       kernel_shape: Shape of kernel as in tuple (of size 1,2 or 3).
-      use_bias: Use bias in convolutions.
+      use_bias: (bool) Use bias in convolutions.
       skip_connection: If set to `True`, concatenate the input to the
-      output of the conv LSTM. Default: `False`.
+        output of the conv LSTM. Default: `False`.
       forget_bias: Forget bias.
+      initializers: Unused.
       name: Name of the module.
+
     Raises:
       ValueError: If `skip_connection` is `True` and stride is different from 1
         or if `input_shape` is incompatible with `conv_ndims`.
@@ -2131,7 +2133,7 @@ class Conv1DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_1d_lstm_cell", **kwargs):
     """Construct Conv1DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv1DLSTMCell, self).__init__(conv_ndims=1, **kwargs)
+    super(Conv1DLSTMCell, self).__init__(conv_ndims=1, name=name, **kwargs)
 
 
 class Conv2DLSTMCell(ConvLSTMCell):
@@ -2142,7 +2144,7 @@ class Conv2DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_2d_lstm_cell", **kwargs):
     """Construct Conv2DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv2DLSTMCell, self).__init__(conv_ndims=2, **kwargs)
+    super(Conv2DLSTMCell, self).__init__(conv_ndims=2, name=name, **kwargs)
 
 
 class Conv3DLSTMCell(ConvLSTMCell):
@@ -2153,19 +2155,23 @@ class Conv3DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_3d_lstm_cell", **kwargs):
     """Construct Conv3DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv3DLSTMCell, self).__init__(conv_ndims=3, **kwargs)
+    super(Conv3DLSTMCell, self).__init__(conv_ndims=3, name=name, **kwargs)
 
 
 def _conv(args, filter_size, num_features, bias, bias_start=0.0):
-  """convolution:
+  """Convolution.
+
   Args:
     args: a Tensor or a list of Tensors of dimension 3D, 4D or 5D,
     batch x n, Tensors.
     filter_size: int tuple of filter height and width.
     num_features: int, number of features.
+    bias: Whether to use biases in the convolution layer.
     bias_start: starting value to initialize the bias; 0 by default.
+
   Returns:
     A 3D, 4D, or 5D Tensor with shape [batch ... num_features]
+
   Raises:
     ValueError: if some of the arguments has unspecified or wrong shape.
   """
@@ -2225,6 +2231,13 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
 
   O. Kuchaiev and B. Ginsburg
   "Factorization Tricks for LSTM Networks", ICLR 2017 workshop.
+
+  In brief, a G-LSTM cell consists of one LSTM sub-cell per group, where each
+  sub-cell operates on an evenly-sized sub-vector of the input and produces an
+  evenly-sized sub-vector of the output.  For example, a G-LSTM cell with 128
+  units and 4 groups consists of 4 LSTMs sub-cells with 32 units each.  If that
+  G-LSTM cell is fed a 200-dim input, then each sub-cell receives a 50-dim part
+  of the input and produces a 32-dim part of the output.
   """
 
   def __init__(self,
@@ -2298,7 +2311,7 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
     return self._output_size
 
   def _get_input_for_group(self, inputs, group_id, group_size):
-    """Slices inputs into groups to prepare for processing by cell's groups
+    """Slices inputs into groups to prepare for processing by cell's groups.
 
     Args:
       inputs: cell input or it's previous state,
@@ -2320,9 +2333,12 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
     """Run one step of G-LSTM.
 
     Args:
-      inputs: input Tensor, 2D, [batch x num_units].
-      state: this must be a tuple of state Tensors, both `2-D`,
-      with column sizes `c_state` and `m_state`.
+      inputs: input Tensor, 2D, [batch x num_inputs].  num_inputs must be
+        statically-known and evenly divisible into groups.  The innermost
+        vectors of the inputs are split into evenly-sized sub-vectors and fed
+        into the per-group LSTM sub-cells.
+      state: this must be a tuple of state Tensors, both `2-D`, with column
+        sizes `c_state` and `m_state`.
 
     Returns:
       A tuple containing:
@@ -2337,11 +2353,24 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
 
     Raises:
       ValueError: If input size cannot be inferred from inputs via
-        static shape inference.
+        static shape inference, or if the input shape is incompatible
+        with the number of groups.
     """
     (c_prev, m_prev) = state
 
     self._batch_size = inputs.shape[0].value or array_ops.shape(inputs)[0]
+
+    # If the input size is statically-known, calculate and validate its group
+    # size.  Otherwise, use the output group size.
+    input_size = inputs.shape[1].value
+    if input_size is None:
+      raise ValueError("input size must be statically known")
+    if input_size % self._number_of_groups != 0:
+      raise ValueError(
+          "input size (%d) must be divisible by number_of_groups (%d)" %
+          (input_size, self._number_of_groups))
+    input_group_size = int(input_size / self._number_of_groups)
+
     dtype = inputs.dtype
     scope = vs.get_variable_scope()
     with vs.variable_scope(scope, initializer=self._initializer):
@@ -2354,8 +2383,7 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
         with vs.variable_scope("group%d" % group_id):
           x_g_id = array_ops.concat(
               [
-                  self._get_input_for_group(inputs, group_id,
-                                            self._group_shape[0]),
+                  self._get_input_for_group(inputs, group_id, input_group_size),
                   self._get_input_for_group(m_prev, group_id,
                                             self._group_shape[0])
               ],
@@ -2684,7 +2712,7 @@ class LayerNormLSTMCell(rnn_cell_impl.RNNCell):
 
 
 class SRUCell(rnn_cell_impl.LayerRNNCell):
-  """SRU, Simple Recurrent Unit
+  """SRU, Simple Recurrent Unit.
 
      Implementation based on
      Training RNNs as Fast as CNNs (cf. https://arxiv.org/abs/1709.02755).
@@ -2732,12 +2760,13 @@ class SRUCell(rnn_cell_impl.LayerRNNCell):
 
     input_depth = inputs_shape[1].value
 
+    # pylint: disable=protected-access
     self._kernel = self.add_variable(
         rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
         shape=[input_depth, 4 * self._num_units])
-
+    # pylint: enable=protected-access
     self._bias = self.add_variable(
-        rnn_cell_impl._BIAS_VARIABLE_NAME,
+        rnn_cell_impl._BIAS_VARIABLE_NAME,  # pylint: disable=protected-access
         shape=[2 * self._num_units],
         initializer=init_ops.constant_initializer(0.0, dtype=self.dtype))
 
@@ -2746,7 +2775,7 @@ class SRUCell(rnn_cell_impl.LayerRNNCell):
   def call(self, inputs, state):
     """Simple recurrent unit (SRU) with num_units cells."""
 
-    U = math_ops.matmul(inputs, self._kernel)
+    U = math_ops.matmul(inputs, self._kernel)  # pylint: disable=invalid-name
     x_bar, f_intermediate, r_intermediate, x_tx = array_ops.split(
         value=U, num_or_size_splits=4, axis=1)
 
@@ -2876,6 +2905,7 @@ class WeightNormLSTMCell(rnn_cell_impl.RNNCell):
     Args:
       args: a 2D Tensor or a list of 2D, batch x n, Tensors.
       output_size: int, second dimension of W[i].
+      norm: bool, whether to normalize the weights.
       bias: boolean, whether to add a bias term or not.
       bias_initializer: starting value to initialize the bias
         (default is all zeros).
diff --git a/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py b/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
index b427dff88b2d586ccf8c512bb498cdaf879ac781..c4139dde492a3c88ec3d5f19973314ac3fd81fb3 100644
--- a/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
+++ b/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
@@ -222,6 +222,9 @@ class AttentionWrapperTest(test.TestCase):
           self.assertEqual(
               (None, batch_size, None),
               tuple(state_alignment_history.get_shape().as_list()))
+        nest.assert_same_structure(
+            cell.state_size,
+            cell.zero_state(batch_size, dtypes.float32))
         # Remove the history from final_state for purposes of the
         # remainder of the tests.
         final_state = final_state._replace(alignment_history=())  # pylint: disable=protected-access
diff --git a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
index 926554031775202d7f7d9018cf6ae4efb34fe96b..178328619f087789df040489cd150ba018cc8d14 100644
--- a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
+++ b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
@@ -27,6 +27,7 @@ from tensorflow.contrib.seq2seq.python.ops import beam_search_ops
 from tensorflow.contrib.seq2seq.python.ops import decoder
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.layers import core as layers_core
 from tensorflow.python.ops import array_ops
@@ -70,6 +71,98 @@ class TestGatherTree(test.TestCase):
 
     self.assertAllEqual(expected_result, res_)
 
+  def _test_gather_tree_from_array(self,
+                                   depth_ndims=0,
+                                   merged_batch_beam=False):
+    array = np.array(
+        [[[1, 2, 3], [4, 5, 6], [7, 8, 9], [0, 0, 0]],
+         [[2, 3, 4], [5, 6, 7], [8, 9, 10], [11, 12, 0]]]).transpose([1, 0, 2])
+    parent_ids = np.array(
+        [[[0, 0, 0], [0, 1, 1], [2, 1, 2], [-1, -1, -1]],
+         [[0, 0, 0], [1, 1, 0], [2, 0, 1], [0, 1, 0]]]).transpose([1, 0, 2])
+    expected_array = np.array(
+        [[[2, 2, 2], [6, 5, 6], [7, 8, 9], [0, 0, 0]],
+         [[2, 3, 2], [7, 5, 7], [8, 9, 8], [11, 12, 0]]]).transpose([1, 0, 2])
+    sequence_length = [[3, 3, 3], [4, 4, 3]]
+
+    array = ops.convert_to_tensor(
+        array, dtype=dtypes.float32)
+    parent_ids = ops.convert_to_tensor(
+        parent_ids, dtype=dtypes.int32)
+    expected_array = ops.convert_to_tensor(
+        expected_array, dtype=dtypes.float32)
+
+    max_time = array_ops.shape(array)[0]
+    batch_size = array_ops.shape(array)[1]
+    beam_width = array_ops.shape(array)[2]
+
+    def _tile_in_depth(tensor):
+      # Generate higher rank tensors by concatenating tensor and tensor + 1.
+      for _ in range(depth_ndims):
+        tensor = array_ops.stack([tensor, tensor + 1], -1)
+      return tensor
+
+    if merged_batch_beam:
+      array = array_ops.reshape(
+          array, [max_time, batch_size * beam_width])
+      expected_array = array_ops.reshape(
+          expected_array, [max_time, batch_size * beam_width])
+
+    if depth_ndims > 0:
+      array = _tile_in_depth(array)
+      expected_array = _tile_in_depth(expected_array)
+
+    sorted_array = beam_search_decoder.gather_tree_from_array(
+        array, parent_ids, sequence_length)
+
+    with self.test_session() as sess:
+      sorted_array = sess.run(sorted_array)
+      expected_array = sess.run(expected_array)
+      self.assertAllEqual(expected_array, sorted_array)
+
+  def test_gather_tree_from_array_scalar(self):
+    self._test_gather_tree_from_array()
+
+  def test_gather_tree_from_array_1d(self):
+    self._test_gather_tree_from_array(depth_ndims=1)
+
+  def test_gather_tree_from_array_1d_with_merged_batch_beam(self):
+    self._test_gather_tree_from_array(depth_ndims=1, merged_batch_beam=True)
+
+  def test_gather_tree_from_array_2d(self):
+    self._test_gather_tree_from_array(depth_ndims=2)
+
+
+class TestArrayShapeChecks(test.TestCase):
+
+  def _test_array_shape_dynamic_checks(self, static_shape, dynamic_shape,
+                                       batch_size, beam_width, is_valid=True):
+    t = array_ops.placeholder_with_default(
+        np.random.randn(*static_shape).astype(np.float32),
+        shape=dynamic_shape)
+
+    batch_size = array_ops.constant(batch_size)
+    check_op = beam_search_decoder._check_batch_beam(t, batch_size, beam_width)  # pylint: disable=protected-access
+
+    with self.test_session() as sess:
+      if is_valid:
+        sess.run(check_op)
+      else:
+        with self.assertRaises(errors.InvalidArgumentError):
+          sess.run(check_op)
+
+  def test_array_shape_dynamic_checks(self):
+    self._test_array_shape_dynamic_checks(
+        (8, 4, 5, 10), (None, None, 5, 10), 4, 5, is_valid=True)
+    self._test_array_shape_dynamic_checks(
+        (8, 20, 10), (None, None, 10), 4, 5, is_valid=True)
+    self._test_array_shape_dynamic_checks(
+        (8, 21, 10), (None, None, 10), 4, 5, is_valid=False)
+    self._test_array_shape_dynamic_checks(
+        (8, 4, 6, 10), (None, None, None, 10), 4, 5, is_valid=False)
+    self._test_array_shape_dynamic_checks(
+        (8, 4), (None, None), 4, 5, is_valid=False)
+
 
 class TestEosMasking(test.TestCase):
   """Tests EOS masking used in beam search."""
@@ -319,7 +412,8 @@ class TestLargeBeamStep(test.TestCase):
 
 class BeamSearchDecoderTest(test.TestCase):
 
-  def _testDynamicDecodeRNN(self, time_major, has_attention):
+  def _testDynamicDecodeRNN(self, time_major, has_attention,
+                            with_alignment_history=False):
     encoder_sequence_length = np.array([3, 2, 3, 1, 1])
     decoder_sequence_length = np.array([2, 0, 1, 2, 3])
     batch_size = 5
@@ -359,7 +453,7 @@ class BeamSearchDecoderTest(test.TestCase):
             cell=cell,
             attention_mechanism=attention_mechanism,
             attention_layer_size=attention_depth,
-            alignment_history=False)
+            alignment_history=with_alignment_history)
       cell_state = cell.zero_state(
           dtype=dtypes.float32, batch_size=batch_size_tensor * beam_width)
       if has_attention:
@@ -420,6 +514,12 @@ class BeamSearchDecoderTest(test.TestCase):
   def testDynamicDecodeRNNBatchMajorYesAttention(self):
     self._testDynamicDecodeRNN(time_major=False, has_attention=True)
 
+  def testDynamicDecodeRNNBatchMajorYesAttentionWithAlignmentHistory(self):
+    self._testDynamicDecodeRNN(
+        time_major=False,
+        has_attention=True,
+        with_alignment_history=True)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
index 0a53fd66dbe4d28ea102773b9c5bae50b9d18e9c..9ff8a343f124193b56e3bf46624ab1db927301a0 100644
--- a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
+++ b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
@@ -1152,9 +1152,7 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
         is a list, and its length does not match that of `attention_layer_size`.
     """
     super(AttentionWrapper, self).__init__(name=name)
-    if not rnn_cell_impl._like_rnncell(cell):  # pylint: disable=protected-access
-      raise TypeError(
-          "cell must be an RNNCell, saw type: %s" % type(cell).__name__)
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     if isinstance(attention_mechanism, (list, tuple)):
       self._is_multi = True
       attention_mechanisms = attention_mechanism
@@ -1280,7 +1278,8 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
         attention_state=self._item_or_tuple(
             a.state_size for a in self._attention_mechanisms),
         alignment_history=self._item_or_tuple(
-            () for _ in self._attention_mechanisms))  # sometimes a TensorArray
+            a.alignments_size if self._alignment_history else ()
+            for a in self._attention_mechanisms))  # sometimes a TensorArray
 
   def zero_state(self, batch_size, dtype):
     """Return an initial (zero) state tuple for this `AttentionWrapper`.
@@ -1320,22 +1319,26 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
         cell_state = nest.map_structure(
             lambda s: array_ops.identity(s, name="checked_cell_state"),
             cell_state)
+      initial_alignments = [
+          attention_mechanism.initial_alignments(batch_size, dtype)
+          for attention_mechanism in self._attention_mechanisms]
       return AttentionWrapperState(
           cell_state=cell_state,
           time=array_ops.zeros([], dtype=dtypes.int32),
           attention=_zero_state_tensors(self._attention_layer_size, batch_size,
                                         dtype),
-          alignments=self._item_or_tuple(
-              attention_mechanism.initial_alignments(batch_size, dtype)
-              for attention_mechanism in self._attention_mechanisms),
+          alignments=self._item_or_tuple(initial_alignments),
           attention_state=self._item_or_tuple(
               attention_mechanism.initial_state(batch_size, dtype)
               for attention_mechanism in self._attention_mechanisms),
           alignment_history=self._item_or_tuple(
-              tensor_array_ops.TensorArray(dtype=dtype, size=0,
-                                           dynamic_size=True)
+              tensor_array_ops.TensorArray(
+                  dtype,
+                  size=0,
+                  dynamic_size=True,
+                  element_shape=alignment.shape)
               if self._alignment_history else ()
-              for _ in self._attention_mechanisms))
+              for alignment in initial_alignments))
 
   def call(self, inputs, state):
     """Perform a step of attention-wrapped RNN.
diff --git a/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py b/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py
index ed226239b860e2250072a28a5538b816642ec54b..7eb95e5a70de985dca0d4b565ba03bdf454b6161 100644
--- a/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py
+++ b/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py
@@ -59,8 +59,7 @@ class BasicDecoder(decoder.Decoder):
     Raises:
       TypeError: if `cell`, `helper` or `output_layer` have an incorrect type.
     """
-    if not rnn_cell_impl._like_rnncell(cell):  # pylint: disable=protected-access
-      raise TypeError("cell must be an RNNCell, received: %s" % type(cell))
+    rnn_cell_impl.assert_like_rnncell("cell", cell)
     if not isinstance(helper, helper_py.Helper):
       raise TypeError("helper must be a Helper, received: %s" % type(helper))
     if (output_layer is not None
diff --git a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
index 554eb24e5260724a905b099091bf8aea461554cf..a26107b0d71497e2d21d735f02e2556873709cda 100644
--- a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
+++ b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.platform import tf_logging
 from tensorflow.python.util import nest
 
 __all__ = [
@@ -121,14 +122,114 @@ def tile_batch(t, multiplier, name=None):
     return nest.map_structure(lambda t_: _tile_batch(t_, multiplier), t)
 
 
+def gather_tree_from_array(t, parent_ids, sequence_length):
+  """Calculates the full beams for `TensorArray`s.
+
+  Args:
+    t: A stacked `TensorArray` of size `max_time` that contains `Tensor`s of
+      shape `[batch_size, beam_width, s]` or `[batch_size * beam_width, s]`
+      where `s` is the depth shape.
+    parent_ids: The parent ids of shape `[max_time, batch_size, beam_width]`.
+    sequence_length: The sequence length of shape `[batch_size, beam_width]`.
+
+  Returns:
+    A `Tensor` which is a stacked `TensorArray` of the same size and type as
+    `t` and where beams are sorted in each `Tensor` according to `parent_ids`.
+  """
+  max_time = parent_ids.shape[0].value or array_ops.shape(parent_ids)[0]
+  batch_size = parent_ids.shape[1].value or array_ops.shape(parent_ids)[1]
+  beam_width = parent_ids.shape[2].value or array_ops.shape(parent_ids)[2]
+
+  # Generate beam ids that will be reordered by gather_tree.
+  beam_ids = array_ops.expand_dims(
+      array_ops.expand_dims(math_ops.range(beam_width), 0), 0)
+  beam_ids = array_ops.tile(beam_ids, [max_time, batch_size, 1])
+
+  mask = array_ops.sequence_mask(
+      sequence_length, maxlen=max_time, dtype=dtypes.int32)
+  mask = array_ops.transpose(mask, perm=[2, 0, 1])
+
+  # Use beam_width + 1 to mark the end of beam.
+  masked_beam_ids = (beam_ids * mask) + (1 - mask) * (beam_width + 1)
+
+  max_sequence_lengths = math_ops.to_int32(
+      math_ops.reduce_max(sequence_length, axis=1))
+  sorted_beam_ids = beam_search_ops.gather_tree(
+      step_ids=masked_beam_ids,
+      parent_ids=parent_ids,
+      max_sequence_lengths=max_sequence_lengths,
+      end_token=beam_width + 1)
+
+  # For out of range steps, simply copy the same beam.
+  sorted_beam_ids = array_ops.where(
+      math_ops.cast(mask, dtypes.bool), x=sorted_beam_ids, y=beam_ids)
+
+  # Generate indices for gather_nd.
+  time_ind = array_ops.tile(array_ops.reshape(
+      math_ops.range(max_time), [-1, 1, 1]), [1, batch_size, beam_width])
+  batch_ind = array_ops.tile(array_ops.reshape(
+      math_ops.range(batch_size), [-1, 1, 1]), [1, max_time, beam_width])
+  batch_ind = array_ops.transpose(batch_ind, perm=[1, 0, 2])
+  indices = array_ops.stack([time_ind, batch_ind, sorted_beam_ids], -1)
+
+  # Gather from a tensor with collapsed additional dimensions.
+  gather_from = t
+  final_shape = array_ops.shape(gather_from)
+  gather_from = array_ops.reshape(
+      gather_from, [max_time, batch_size, beam_width, -1])
+  ordered = array_ops.gather_nd(gather_from, indices)
+  ordered = array_ops.reshape(ordered, final_shape)
+
+  return ordered
+
+
 def _check_maybe(t):
-  if isinstance(t, tensor_array_ops.TensorArray):
-    raise TypeError(
-        "TensorArray state is not supported by BeamSearchDecoder: %s" % t.name)
   if t.shape.ndims is None:
     raise ValueError(
         "Expected tensor (%s) to have known rank, but ndims == None." % t)
 
+def _check_static_batch_beam_maybe(shape, batch_size, beam_width):
+  """Raises an exception if dimensions are known statically and can not be
+  reshaped to [batch_size, beam_size, -1].
+  """
+  reshaped_shape = tensor_shape.TensorShape([batch_size, beam_width, None])
+  if (batch_size is not None and shape[0].value is not None
+      and (shape[0] != batch_size * beam_width
+           or (shape.ndims >= 2 and shape[1].value is not None
+               and (shape[0] != batch_size or shape[1] != beam_width)))):
+    tf_logging.warn("TensorArray reordering expects elements to be "
+                    "reshapable to %s which is incompatible with the "
+                    "current shape %s. Consider setting "
+                    "reorder_tensor_arrays to False to disable TensorArray "
+                    "reordering during the beam search."
+                    % (reshaped_shape, shape))
+    return False
+  return True
+
+def _check_batch_beam(t, batch_size, beam_width):
+  """Returns an Assert operation checking that the elements of the stacked
+  TensorArray can be reshaped to [batch_size, beam_size, -1]. At this point,
+  the TensorArray elements have a known rank of at least 1.
+  """
+  error_message = ("TensorArray reordering expects elements to be "
+                   "reshapable to [batch_size, beam_size, -1] which is "
+                   "incompatible with the dynamic shape of %s elements. "
+                   "Consider setting reorder_tensor_arrays to False to disable "
+                   "TensorArray reordering during the beam search."
+                   % (t.name))
+  rank = t.shape.ndims
+  shape = array_ops.shape(t)
+  if rank == 2:
+    condition = math_ops.equal(shape[1], batch_size * beam_width)
+  else:
+    condition = math_ops.logical_or(
+        math_ops.equal(shape[1], batch_size * beam_width),
+        math_ops.logical_and(
+            math_ops.equal(shape[1], batch_size),
+            math_ops.equal(shape[2], beam_width)))
+  return control_flow_ops.Assert(condition, [error_message])
+
+
 
 class BeamSearchDecoder(decoder.Decoder):
   """BeamSearch sampling decoder.
@@ -173,7 +274,8 @@ class BeamSearchDecoder(decoder.Decoder):
                initial_state,
                beam_width,
                output_layer=None,
-               length_penalty_weight=0.0):
+               length_penalty_weight=0.0,
+               reorder_tensor_arrays=True):
     """Initialize the BeamSearchDecoder.
 
     Args:
@@ -188,6 +290,12 @@ class BeamSearchDecoder(decoder.Decoder):
         `tf.layers.Dense`.  Optional layer to apply to the RNN output prior
         to storing the result or sampling.
       length_penalty_weight: Float weight to penalize length. Disabled with 0.0.
+      reorder_tensor_arrays: If `True`, `TensorArray`s' elements within the cell
+        state will be reordered according to the beam search path. If the
+        `TensorArray` can be reordered, the stacked form will be returned.
+        Otherwise, the `TensorArray` will be returned as is. Set this flag to
+        `False` if the cell state contains `TensorArray`s that are not amenable
+        to reordering.
 
     Raises:
       TypeError: if `cell` is not an instance of `RNNCell`,
@@ -195,14 +303,14 @@ class BeamSearchDecoder(decoder.Decoder):
       ValueError: If `start_tokens` is not a vector or
         `end_token` is not a scalar.
     """
-    if not rnn_cell_impl._like_rnncell(cell):  # pylint: disable=protected-access
-      raise TypeError("cell must be an RNNCell, received: %s" % type(cell))
+    rnn_cell_impl.assert_like_rnncell("cell", cell)  # pylint: disable=protected-access
     if (output_layer is not None and
         not isinstance(output_layer, layers_base.Layer)):
       raise TypeError(
           "output_layer must be a Layer, received: %s" % type(output_layer))
     self._cell = cell
     self._output_layer = output_layer
+    self._reorder_tensor_arrays = reorder_tensor_arrays
 
     if callable(embedding):
       self._embedding_fn = embedding
@@ -300,12 +408,13 @@ class BeamSearchDecoder(decoder.Decoder):
     """
     finished, start_inputs = self._finished, self._start_inputs
 
+    dtype = nest.flatten(self._initial_cell_state)[0].dtype
     log_probs = array_ops.one_hot(  # shape(batch_sz, beam_sz)
         array_ops.zeros([self._batch_size], dtype=dtypes.int32),
         depth=self._beam_width,
-        on_value=0.0,
-        off_value=-np.Inf,
-        dtype=nest.flatten(self._initial_cell_state)[0].dtype)
+        on_value=ops.convert_to_tensor(0.0, dtype=dtype),
+        off_value=ops.convert_to_tensor(-np.Inf, dtype=dtype),
+        dtype=dtype)
 
     initial_state = BeamSearchDecoderState(
         cell_state=self._initial_cell_state,
@@ -342,6 +451,11 @@ class BeamSearchDecoder(decoder.Decoder):
         outputs.parent_ids,
         max_sequence_lengths=max_sequence_lengths,
         end_token=self._end_token)
+    if self._reorder_tensor_arrays:
+      final_state = final_state._replace(cell_state=nest.map_structure(
+          lambda t: self._maybe_sort_array_beams(
+              t, outputs.parent_ids, final_state.lengths),
+          final_state.cell_state))
     outputs = FinalBeamSearchDecoderOutput(
         beam_search_decoder_output=outputs, predicted_ids=predicted_ids)
     return outputs, final_state
@@ -432,9 +546,10 @@ class BeamSearchDecoder(decoder.Decoder):
       returned unchanged.
 
     Raises:
-      TypeError: If `t` is an instance of `TensorArray`.
       ValueError: If the rank of `t` is not statically known.
     """
+    if isinstance(t, tensor_array_ops.TensorArray):
+      return t
     _check_maybe(t)
     if t.shape.ndims >= 1:
       return self._split_batch_beams(t, s)
@@ -455,15 +570,55 @@ class BeamSearchDecoder(decoder.Decoder):
       A reshaped version of t with shape `[batch_size, beam_width] + s`.
 
     Raises:
-      TypeError: If `t` is an instance of `TensorArray`.
       ValueError:  If the rank of `t` is not statically known.
     """
+    if isinstance(t, tensor_array_ops.TensorArray):
+      return t
     _check_maybe(t)
     if t.shape.ndims >= 2:
       return self._merge_batch_beams(t, s)
     else:
       return t
 
+  def _maybe_sort_array_beams(self, t, parent_ids, sequence_length):
+    """Maybe sorts beams within a `TensorArray`.
+
+    Args:
+      t: A `TensorArray` of size `max_time` that contains `Tensor`s of shape
+        `[batch_size, beam_width, s]` or `[batch_size * beam_width, s]` where
+        `s` is the depth shape.
+      parent_ids: The parent ids of shape `[max_time, batch_size, beam_width]`.
+      sequence_length: The sequence length of shape `[batch_size, beam_width]`.
+
+    Returns:
+      A `TensorArray` where beams are sorted in each `Tensor` or `t` itself if
+      it is not a `TensorArray` or does not meet shape requirements.
+    """
+    if not isinstance(t, tensor_array_ops.TensorArray):
+      return t
+    # pylint: disable=protected-access
+    if (not t._infer_shape or not t._element_shape
+        or t._element_shape[0].ndims is None
+        or t._element_shape[0].ndims < 1):
+      shape = (
+          t._element_shape[0] if t._infer_shape and t._element_shape
+          else tensor_shape.TensorShape(None))
+      tf_logging.warn("The TensorArray %s in the cell state is not amenable to "
+                      "sorting based on the beam search result. For a "
+                      "TensorArray to be sorted, its elements shape must be "
+                      "defined and have at least a rank of 1, but saw shape: %s"
+                      % (t.handle.name, shape))
+      return t
+    shape = t._element_shape[0]
+    # pylint: enable=protected-access
+    if not _check_static_batch_beam_maybe(
+        shape, tensor_util.constant_value(self._batch_size), self._beam_width):
+      return t
+    t = t.stack()
+    with ops.control_dependencies(
+        [_check_batch_beam(t, self._batch_size, self._beam_width)]):
+      return gather_tree_from_array(t, parent_ids, sequence_length)
+
   def step(self, time, inputs, state, name=None):
     """Perform a decoding step.
 
@@ -570,7 +725,6 @@ def _beam_search_step(time, logits, next_cell_state, beam_state, batch_size,
 
   time = ops.convert_to_tensor(time, name="time")
   # During the first time step we only consider the initial beam
-  scores_shape = array_ops.shape(scores)
   scores_flat = array_ops.reshape(scores, [batch_size, -1])
 
   # Pick the next beams according to the specified successors function
@@ -759,6 +913,8 @@ def _maybe_tensor_gather_helper(gather_indices, gather_from, batch_size,
     output: Gathered tensor of shape tf.shape(gather_from)[:1+len(gather_shape)]
       or the original tensor if its dimensions are too small.
   """
+  if isinstance(gather_from, tensor_array_ops.TensorArray):
+    return gather_from
   _check_maybe(gather_from)
   if gather_from.shape.ndims >= len(gather_shape):
     return _tensor_gather_helper(
diff --git a/tensorflow/contrib/session_bundle/BUILD b/tensorflow/contrib/session_bundle/BUILD
index 67011c8fef6c4f54db2626ffe7ae1299bddbb352..75a753ed89a5ea13b7b79f480511979c38f321e3 100644
--- a/tensorflow/contrib/session_bundle/BUILD
+++ b/tensorflow/contrib/session_bundle/BUILD
@@ -1,9 +1,7 @@
 # Description:
 #   TensorFlow Serving session bundle.
 
-package(
-    default_visibility = ["//visibility:public"],
-)
+package(default_visibility = ["//visibility:public"])
 
 licenses(["notice"])  # Apache 2.0
 
diff --git a/tensorflow/contrib/slim/README.md b/tensorflow/contrib/slim/README.md
index 2d9df8f27ee98431f51fd39c168325b8f625dce9..40f484fd78302163ba36142dec057478fe899189 100644
--- a/tensorflow/contrib/slim/README.md
+++ b/tensorflow/contrib/slim/README.md
@@ -94,7 +94,7 @@ of thin wrapper functions in
 [variables.py](https://www.tensorflow.org/code/tensorflow/contrib/framework/python/ops/variables.py)
 which allow callers to easily define variables.
 
-For example, to create a `weight` variable, initialize it using a truncated
+For example, to create a `weights` variable, initialize it using a truncated
 normal distribution, regularize it with an `l2_loss` and place it on the `CPU`,
 one need only declare the following:
 
diff --git a/tensorflow/contrib/solvers/python/ops/least_squares.py b/tensorflow/contrib/solvers/python/ops/least_squares.py
index fb7c0eb649c5216736b239d1a423cdaf7079f582..6e164f53420675d149ded6c1f42ca87bd89b158c 100644
--- a/tensorflow/contrib/solvers/python/ops/least_squares.py
+++ b/tensorflow/contrib/solvers/python/ops/least_squares.py
@@ -33,7 +33,7 @@ def cgls(operator, rhs, tol=1e-6, max_iter=20, name="cgls"):
   r"""Conjugate gradient least squares solver.
 
   Solves a linear least squares problem \\(||A x - rhs||_2\\) for a single
-  righ-hand side, using an iterative, matrix-free algorithm where the action of
+  right-hand side, using an iterative, matrix-free algorithm where the action of
   the matrix A is represented by `operator`. The CGLS algorithm implicitly
   applies the symmetric conjugate gradient algorithm to the normal equations
   \\(A^* A x = A^* rhs\\). The iteration terminates when either
diff --git a/tensorflow/contrib/solvers/python/ops/linear_equations.py b/tensorflow/contrib/solvers/python/ops/linear_equations.py
index d791d467639b572e7831c1d1a582aa15585649b6..9305c6a11c4ec898c82553773e8e7277a54ab82e 100644
--- a/tensorflow/contrib/solvers/python/ops/linear_equations.py
+++ b/tensorflow/contrib/solvers/python/ops/linear_equations.py
@@ -41,7 +41,7 @@ def conjugate_gradient(operator,
   r"""Conjugate gradient solver.
 
   Solves a linear system of equations `A*x = rhs` for selfadjoint, positive
-  definite matrix `A` and righ-hand side vector `rhs`, using an iterative,
+  definite matrix `A` and right-hand side vector `rhs`, using an iterative,
   matrix-free algorithm where the action of the matrix A is represented by
   `operator`. The iteration terminates when either the number of iterations
   exceeds `max_iter` or when the residual norm has been reduced to `tol`
diff --git a/tensorflow/contrib/summary/BUILD b/tensorflow/contrib/summary/BUILD
index b58c83fdaf574fb349fac57c922f1178b7d13b66..80563c5e150dfb74ef11bc912e95345a1a015212 100644
--- a/tensorflow/contrib/summary/BUILD
+++ b/tensorflow/contrib/summary/BUILD
@@ -10,12 +10,6 @@ load(
     "tf_gen_op_wrapper_py",
 )
 
-tf_gen_op_wrapper_py(
-    name = "gen_summary_ops",
-    out = "gen_summary_ops.py",
-    deps = ["//tensorflow/core:summary_ops_op_lib"],
-)
-
 py_test(
     name = "summary_ops_test",
     srcs = ["summary_ops_test.py"],
@@ -61,7 +55,6 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:internal"],
     deps = [
-        ":gen_summary_ops",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:constant_op",
@@ -72,6 +65,7 @@ py_library(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:summary_op_util",
+        "//tensorflow/python:summary_ops_gen",
         "//tensorflow/python:training",
         "//tensorflow/python:util",
         "//tensorflow/python/eager:context",
diff --git a/tensorflow/contrib/summary/summary_ops.py b/tensorflow/contrib/summary/summary_ops.py
index b6249fc92f712b21197c2167fb5d1c4af1f48ca5..bc763fe655edc455e2538e536d6efab314c8228c 100644
--- a/tensorflow/contrib/summary/summary_ops.py
+++ b/tensorflow/contrib/summary/summary_ops.py
@@ -26,7 +26,6 @@ import time
 
 import six
 
-from tensorflow.contrib.summary import gen_summary_ops
 from tensorflow.core.framework import graph_pb2
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
@@ -35,6 +34,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.layers import utils
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gen_summary_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import summary_op_util
@@ -110,7 +110,7 @@ class SummaryWriter(object):
 
   def  __init__(self, resource):
     self._resource = resource
-    if context.in_eager_mode() and self._resource is not None:
+    if context.executing_eagerly() and self._resource is not None:
       self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
           handle=self._resource, handle_device="cpu:0")
 
@@ -158,7 +158,7 @@ def initialize(
       @{tf.contrib.summary.SummaryWriter}.
     ValueError: If session wasn't passed and no default session.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return
   if context.context().summary_writer_resource is None:
     raise RuntimeError("No default tf.contrib.summary.SummaryWriter found")
@@ -269,7 +269,7 @@ def _make_summary_writer(name, factory, **kwargs):
   resource = gen_summary_ops.summary_writer(shared_name=name)
   # TODO(apassos): Consider doing this instead.
   # node = factory(resource, **kwargs)
-  # if not context.in_eager_mode():
+  # if not context.executing_eagerly():
   #   ops.get_default_session().run(node)
   ops.add_to_collection(_SUMMARY_WRITER_INIT_COLLECTION_NAME,
                         factory(resource, **kwargs))
@@ -295,7 +295,7 @@ def all_summary_ops():
   Returns:
     The summary ops.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return None
   return ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION)  # pylint: disable=protected-access
 
@@ -309,7 +309,7 @@ def summary_writer_initializer_op():
   Raises:
     RuntimeError: If in Eager mode.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError(
         "tf.contrib.summary.summary_writer_initializer_op is only "
         "supported in graph mode.")
@@ -328,8 +328,12 @@ def summary_writer_function(name, tensor, function, family=None):
   Returns:
     The result of writing the summary.
   """
+  name_scope = ops.get_name_scope()
+  if name_scope:
+    # Add a slash to allow reentering the name scope.
+    name_scope += "/"
   def record():
-    with summary_op_util.summary_scope(
+    with ops.name_scope(name_scope), summary_op_util.summary_scope(
         name, family, values=[tensor]) as (tag, scope):
       with ops.control_dependencies([function(tag, scope)]):
         return constant_op.constant(True)
@@ -477,7 +481,7 @@ def graph(param, step=None, name=None):
   Raises:
     TypeError: If `param` isn't already a @{tf.Tensor} in graph mode.
   """
-  if not context.in_eager_mode() and not isinstance(param, ops.Tensor):
+  if not context.executing_eagerly() and not isinstance(param, ops.Tensor):
     raise TypeError("graph() needs a tf.Tensor (e.g. tf.placeholder) in graph "
                     "mode, but was: %s" % type(param))
   writer = context.context().summary_writer_resource
diff --git a/tensorflow/contrib/summary/summary_ops_graph_test.py b/tensorflow/contrib/summary/summary_ops_graph_test.py
index 2b7806f80d020e0064b0f5cf32fd765a9ee993d1..3aba04540eba12092d884cca10e23546eb91c91d 100644
--- a/tensorflow/contrib/summary/summary_ops_graph_test.py
+++ b/tensorflow/contrib/summary/summary_ops_graph_test.py
@@ -85,6 +85,38 @@ class DbTest(summary_test_util.SummaryDbTest):
           self.assertEqual(len(events), 2)
           self.assertEqual(events[1].summary.value[0].tag, 'my_scalar')
 
+  def testScalarSummaryNameScope(self):
+    """Test record_summaries_every_n_global_steps and all_summaries()."""
+    with ops.Graph().as_default(), self.test_session() as sess:
+      global_step = training_util.get_or_create_global_step()
+      global_step.initializer.run()
+      with ops.device('/cpu:0'):
+        step_increment = state_ops.assign_add(global_step, 1)
+      sess.run(step_increment)  # Increment global step from 0 to 1
+
+      logdir = tempfile.mkdtemp()
+      with summary_ops.create_file_writer(logdir, max_queue=0,
+                                          name='t2').as_default():
+        with summary_ops.record_summaries_every_n_global_steps(2):
+          summary_ops.initialize()
+          with ops.name_scope('scope'):
+            summary_op = summary_ops.scalar('my_scalar', 2.0)
+
+          # Neither of these should produce a summary because
+          # global_step is 1 and "1 % 2 != 0"
+          sess.run(summary_ops.all_summary_ops())
+          sess.run(summary_op)
+          events = summary_test_util.events_from_logdir(logdir)
+          self.assertEqual(len(events), 1)
+
+          # Increment global step from 1 to 2 and check that the summary
+          # is now written
+          sess.run(step_increment)
+          sess.run(summary_ops.all_summary_ops())
+          events = summary_test_util.events_from_logdir(logdir)
+          self.assertEqual(len(events), 2)
+          self.assertEqual(events[1].summary.value[0].tag, 'scope/my_scalar')
+
   def testSummaryGraphModeCond(self):
     with ops.Graph().as_default(), self.test_session():
       training_util.get_or_create_global_step()
diff --git a/tensorflow/contrib/summary/summary_ops_test.py b/tensorflow/contrib/summary/summary_ops_test.py
index bb7215f879411e91a1c47b87f5caede63fffea74..c756f8b27055f9cf86a311e485d97745a3c7a95b 100644
--- a/tensorflow/contrib/summary/summary_ops_test.py
+++ b/tensorflow/contrib/summary/summary_ops_test.py
@@ -29,6 +29,7 @@ from tensorflow.core.framework import types_pb2
 from tensorflow.python.eager import function
 from tensorflow.python.eager import test
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import state_ops
@@ -107,6 +108,20 @@ class TargetTest(test_util.TensorFlowTestCase):
       self.assertEqual(len(events), 2)
       self.assertEqual(events[1].summary.value[0].tag, 'scalar')
 
+  def testSummaryNameScope(self):
+    training_util.get_or_create_global_step()
+    logdir = tempfile.mkdtemp()
+    with summary_ops.create_file_writer(
+        logdir, max_queue=0,
+        name='t2').as_default(), summary_ops.always_record_summaries():
+
+      with ops.name_scope('scope'):
+        summary_ops.scalar('scalar', 2.0)
+
+      events = summary_test_util.events_from_logdir(logdir)
+      self.assertEqual(len(events), 2)
+      self.assertEqual(events[1].summary.value[0].tag, 'scope/scalar')
+
   def testSummaryGlobalStep(self):
     step = training_util.get_or_create_global_step()
     logdir = tempfile.mkdtemp()
diff --git a/tensorflow/contrib/tensor_forest/README.md b/tensorflow/contrib/tensor_forest/README.md
index 8b24430c71c16c2ed6b2e1a530e19fbc9ebb1698..9e1491ea666b51ba0d367610778c659c543dacf6 100644
--- a/tensorflow/contrib/tensor_forest/README.md
+++ b/tensorflow/contrib/tensor_forest/README.md
@@ -116,7 +116,7 @@ a different `feature_bagging_fraction * num_features` sized subset of the
 input features.  Defaults to 1.0 (no feature bagging).
 
 * `base_random_seed`.  By default (`base_random_seed = 0`), the random number
-generator for each tree is seeded by the current time (in microseconds) when
+generator for each tree is seeded by a 64-bit random value when
 each tree is first created.  Using a non-zero value causes tree training to
 be deterministic, in that the i-th tree's random number generator is seeded
 with the value `base_random_seed + i`.
diff --git a/tensorflow/contrib/tensor_forest/kernels/data_spec.h b/tensorflow/contrib/tensor_forest/kernels/data_spec.h
index 0a3abe56dfc4f611ac8ed0815e4c74a639d2477e..bb33400214e5ef37be73b538455eecf5ae481db4 100644
--- a/tensorflow/contrib/tensor_forest/kernels/data_spec.h
+++ b/tensorflow/contrib/tensor_forest/kernels/data_spec.h
@@ -21,6 +21,7 @@
 
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
 namespace tensorforest {
diff --git a/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.cc b/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.cc
index da600d34eacdf27514709240723e5bb730cfe7f0..63d4d9ba50603f65cc822ea74c97b923c29fea35 100644
--- a/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.cc
+++ b/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.cc
@@ -19,6 +19,7 @@
 #include "tensorflow/contrib/tensor_forest/kernels/tree_utils.h"
 #include "tensorflow/contrib/tensor_forest/kernels/v4/stat_utils.h"
 #include "tensorflow/core/lib/random/distribution_sampler.h"
+#include "tensorflow/core/lib/random/random.h"
 
 namespace tensorflow {
 namespace tensorforest {
@@ -122,9 +123,8 @@ ClassificationStats::ClassificationStats(const TensorForestParams& params,
     right_gini_.reset(new RunningGiniScores());
   }
 
-  uint64 time_seed = static_cast<uint64>(std::clock());
   single_rand_ = std::unique_ptr<random::PhiloxRandom>(
-      new random::PhiloxRandom(time_seed));
+      new random::PhiloxRandom(random::New64()));
   rng_ = std::unique_ptr<random::SimplePhilox>(
       new random::SimplePhilox(single_rand_.get()));
 }
diff --git a/tensorflow/contrib/tensor_forest/kernels/v4/input_data.h b/tensorflow/contrib/tensor_forest/kernels/v4/input_data.h
index c544a8c75e9bfe8fe6bbea8913e7be17d868bfef..95f75b4d7e6a961edf6b3da1dc1712e7ddaacf31 100644
--- a/tensorflow/contrib/tensor_forest/kernels/v4/input_data.h
+++ b/tensorflow/contrib/tensor_forest/kernels/v4/input_data.h
@@ -23,6 +23,7 @@
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/lib/random/philox_random.h"
+#include "tensorflow/core/lib/random/random.h"
 #include "tensorflow/core/lib/random/simple_philox.h"
 
 namespace tensorflow {
@@ -44,18 +45,20 @@ class TensorDataSet {
     int column_count = 0;
     for (int i = 0; i < input_spec_.dense_size(); ++i) {
       for (int j = 0; j < input_spec_.dense(i).size(); ++j) {
-        decision_trees::FeatureId id;
-        id.mutable_id()->set_value(strings::StrCat(column_count));
-        available_features_.push_back(id);
         ++column_count;
       }
     }
+    available_features_.reserve(column_count);
+    decision_trees::FeatureId id;
+    for (int i = 0; i < column_count; i++) {
+      id.mutable_id()->set_value(strings::StrCat(i));
+      available_features_.emplace_back(id);
+    }
 
     // Set up the random number generator.
     if (split_sampling_random_seed_ == 0) {
-      uint64 time_seed = static_cast<uint64>(std::clock());
       single_rand_ = std::unique_ptr<random::PhiloxRandom>(
-          new random::PhiloxRandom(time_seed));
+          new random::PhiloxRandom(random::New64()));
     } else {
       single_rand_ = std::unique_ptr<random::PhiloxRandom>(
           new random::PhiloxRandom(split_sampling_random_seed_));
diff --git a/tensorflow/contrib/tensorboard/BUILD b/tensorflow/contrib/tensorboard/BUILD
index 2e0a46ffe432341a423ac159deb7745d9ef15374..d833744d0c7e85b9f336f60a3becfd043bc3821d 100644
--- a/tensorflow/contrib/tensorboard/BUILD
+++ b/tensorflow/contrib/tensorboard/BUILD
@@ -13,7 +13,6 @@ load("//tensorflow/core:platform/default/build_config.bzl", "tf_proto_library")
 tf_proto_library(
     name = "protos_all",
     srcs = glob(["**/*.proto"]),
-    go_api_version = 2,
     visibility = ["//visibility:public"],
 )
 
diff --git a/tensorflow/contrib/tensorrt/BUILD b/tensorflow/contrib/tensorrt/BUILD
index 65a0e903a74d066dcec6f2fdb70a22a0872b802f..906cc3f0344e7cb641589bd522e33d658150d3b5 100644
--- a/tensorflow/contrib/tensorrt/BUILD
+++ b/tensorflow/contrib/tensorrt/BUILD
@@ -47,7 +47,10 @@ tf_cuda_cc_test(
 
 tf_custom_op_library(
     name = "python/ops/_trt_engine_op.so",
-    srcs = ["ops/trt_engine_op.cc"],
+    srcs = [
+        "ops/trt_calib_op.cc",
+        "ops/trt_engine_op.cc",
+    ],
     deps = [
         ":trt_engine_op_kernel",
         ":trt_shape_function",
@@ -71,11 +74,19 @@ tf_cuda_library(
 
 cc_library(
     name = "trt_engine_op_kernel",
-    srcs = ["kernels/trt_engine_op.cc"],
-    hdrs = ["kernels/trt_engine_op.h"],
+    srcs = [
+        "kernels/trt_calib_op.cc",
+        "kernels/trt_engine_op.cc",
+    ],
+    hdrs = [
+        "kernels/trt_calib_op.h",
+        "kernels/trt_engine_op.h",
+    ],
     copts = tf_copts(),
+    visibility = ["//visibility:public"],
     deps = [
         ":trt_logging",
+        ":trt_resources",
         "//tensorflow/core:gpu_headers_lib",
         "//tensorflow/core:lib_proto_parsing",
         "//tensorflow/core:stream_executor_headers_lib",
@@ -87,7 +98,10 @@ cc_library(
 )
 
 tf_gen_op_libs(
-    op_lib_names = ["trt_engine_op"],
+    op_lib_names = [
+        "trt_engine_op",
+        "trt_calib_op",
+    ],
     deps = if_tensorrt([
         "@local_config_tensorrt//:nv_infer",
     ]),
@@ -107,7 +121,9 @@ tf_cuda_library(
 
 tf_gen_op_wrapper_py(
     name = "trt_engine_op",
+    gen_locally = True,
     deps = [
+        ":trt_calib_op_op_lib",
         ":trt_engine_op_op_lib",
         ":trt_logging",
         ":trt_shape_function",
@@ -139,6 +155,7 @@ py_library(
     deps = [
         ":trt_convert_py",
         ":trt_ops_py",
+        "//tensorflow/python:errors",
     ],
 )
 
@@ -171,6 +188,27 @@ tf_py_wrap_cc(
     ],
 )
 
+tf_cuda_library(
+    name = "trt_resources",
+    srcs = [
+        "resources/trt_int8_calibrator.cc",
+        "resources/trt_resource_manager.cc",
+    ],
+    hdrs = [
+        "resources/trt_int8_calibrator.h",
+        "resources/trt_resource_manager.h",
+        "resources/trt_resources.h",
+    ],
+    deps = [
+        ":trt_logging",
+        "//tensorflow/core:framework_headers_lib",
+        "//tensorflow/core:framework_lite",
+        "//tensorflow/core:lib_proto_parsing",
+    ] + if_tensorrt([
+        "@local_config_tensorrt//:nv_infer",
+    ]),
+)
+
 # Library for the node-level conversion portion of TensorRT operation creation
 tf_cuda_library(
     name = "trt_conversion",
@@ -185,6 +223,7 @@ tf_cuda_library(
     deps = [
         ":segment",
         ":trt_logging",
+        ":trt_resources",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core:framework",
diff --git a/tensorflow/contrib/tensorrt/README.md b/tensorflow/contrib/tensorrt/README.md
index dfcce0fd00eedf3341850bbc23927dc3b2e2d2aa..6eafc1754ca5102c8adf04f00e33dc2f8ff970f6 100644
--- a/tensorflow/contrib/tensorrt/README.md
+++ b/tensorflow/contrib/tensorrt/README.md
@@ -1,40 +1,59 @@
-Using TensorRT in TensorFlow
-============================
+# Using TensorRT in TensorFlow
+
 
 This module provides necessary bindings and introduces TRT_engine_op
-operator that wraps a subgraph in TensorRT.
+operator that wraps a subgraph in TensorRT. This is still a work in progress
+but should be useable with most common graphs.
+
+## Compilation
 
-Compilation
------------
 
 In order to compile the module, you need to have a local TensorRT
-installation (libnvinfer.so and respective include files). During the
+installation ( libnvinfer.so and respective include files ). During the
 configuration step, TensorRT should be enabled and installation path
 should be set. If installed through package managers (deb,rpm),
 configure script should find the necessary components from the system
 automatically. If installed from tar packages, user has to set path to
 location where the library is installed during configuration.
 
-
-```
+```shell
 bazel build --config=cuda --config=opt //tensorflow/tools/pip_package:build_pip_package
 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/
 ```
 
 After the installation of tensorflow package, TensorRT transformation
-will be available. An example use is shown below.
-
-```python
-import tensorflow as tf
-import tensorflow.contrib.tensorrt as trt
-#... create and train or load model
-gdef = sess.graph.as_graph_def()
-trt_gdef = trt.create_inference_graph(
-    gdef, #original graph_def
-    ["output"], #name of output node(s)
-    max_batch_size, #maximum batch size to run the inference
-    max_workspace_size_bytes) # max memory for TensorRT to use
-tf.reset_default_graph()
-tf.import_graph_def(graph_def=trt_gdef)
-#...... run inference
+will be available. An example use can be found in test/test_tftrt.py script
+
+## Installing TensorRT 3.0.4
+
+In order to make use of TensorRT integration, you will need a local installation of TensorRT 3.0.4 from the [NVIDIA Developer website](https://developer.nvidia.com/tensorrt). Due to compiler compatibility, you will need to download and install the TensorRT 3.0.4 tarball for _Ubuntu 14.04_, i.e., **_TensorRT-3.0.4.Ubuntu-14.04.5.x86_64.cuda-9.0.cudnn7.0-tar.gz_**, even if you are using Ubuntu 16.04 or later.
+
+### Preparing TensorRT installation
+
+Once you have downloaded TensorRT-3.0.4.Ubuntu-14.04.5.x86_64.cuda-9.0.cudnn7.0-tar.gz, you will need to unpack it to an installation directory, which will be referred to as <install_dir>. Please replace <install_dir> with the full path of actual installation directory you choose in commands below.
+
+```shell
+cd <install_dir> && tar -zxf /path/to/TensorRT-3.0.4.Ubuntu-14.04.5.x86_64.cuda-9.0.cudnn7.0-tar.gz
 ```
+
+After unpacking the binaries, you have several options to use them:
+
+#### To run TensorFlow as a user without superuser privileges
+
+For a regular user without any sudo rights, you should add TensorRT to your `$LD_LIBRARY_PATH`:
+
+  ```shell
+   export LD_LIBRARY_PATH=<install_dir>/TensorRT-3.0.4/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
+  ```
+
+Then you are ready to use TensorFlow-TensorRT integration. `$LD_LIBRARY_PATH` must contain the path to TensorRT installation for TensorFlow-TensorRT integration to work. If you are using a VirtualEnv-like setup, you can add the command above to your `bin/activate` script or to your `.bashrc` script.
+
+#### To run TensorFlow as a superuser
+
+ When running as a superuser, such as in a container or via sudo, the `$LD_LIBRARY_PATH` approach above may not work. The following is preferred when the user has superuser privileges:
+
+  ```shell
+  echo "<install_dir>/TensorRT-3.0.4/lib" | sudo tee /etc/ld.so.conf.d/tensorrt304.conf && sudo ldconfig
+  ```
+
+  Please ensure that any existing deb package installation of TensorRT is removed before following these instructions to avoid package conflicts.
\ No newline at end of file
diff --git a/tensorflow/contrib/tensorrt/__init__.py b/tensorflow/contrib/tensorrt/__init__.py
index fd551d70b4385b14b84b7b98a6d16b0c03733d38..140ad4828208ae4844a49bf664955b50cd9e51cd 100644
--- a/tensorflow/contrib/tensorrt/__init__.py
+++ b/tensorflow/contrib/tensorrt/__init__.py
@@ -18,6 +18,18 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# pylint: disable=unused-import,wildcard-import
-from tensorflow.contrib.tensorrt.python import *
-# pylint: enable=unused-import,wildcard-import
+from tensorflow.python.framework import errors
+
+# pylint: disable=unused-import,wildcard-import,g-import-not-at-top
+try:
+  from tensorflow.contrib.tensorrt.python import *
+except errors.NotFoundError as e:
+  no_trt_message = (
+      '**** Failed to initialize TensorRT. This is either because the TensorRT'
+      ' installation path is not in LD_LIBRARY_PATH, or because you do not have'
+      ' it installed. If not installed, please go to'
+      ' https://developer.nvidia.com/tensorrt to download and install'
+      ' TensorRT ****')
+  print(no_trt_message)
+  raise e
+# pylint: enable=unused-import,wildcard-import,g-import-not-at-top
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.cc b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
index 970f8104736d95d09ea3ffabb07f84d8591a8f9c..ff8cc6374d40dc0b49721a784e25015c76541d03 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/tensorrt/convert/convert_graph.h"
 
+#include <list>
 #include <map>
 #include <set>
 #include <unordered_map>
@@ -48,16 +49,33 @@ namespace tensorrt {
 namespace convert {
 namespace {
 
-static bool IsTensorRTCandidate(const tensorflow::NodeDef& node_def) {
+bool IsTensorRTCandidate(const tensorflow::Node* node) {
   // LINT.IfChange
   // TODO(jie): Segmentation shouldn't associated with op name.
   //            Split it into a registration for each kernel.
   static const std::set<string> candidate_ops = {
-      "Identity", "Const", "Conv2D", "MaxPool", "BiasAdd", "Relu",
-      "Add",      "Mul",   "Sub",    "Rsqrt",   "Pad"  // "Placeholder" ,"Mean"
+      "Identity",
+      "Snapshot",
+      "Const",
+      "Conv2D",
+      "MaxPool",
+      "BiasAdd",
+      "Relu",
+      "Add",
+      "Mul",
+      "Sub",
+      "Rsqrt",
+      "Pad",
+      "Mean",
+      "AvgPool",
+      "ConcatV2",
+      "DepthwiseConv2dNative",
+      "FusedBatchNorm",
+      "FusedBatchNormV2",
+      // TODO(ben,jie): ...
   };
   // LINT.ThenChange(//tensorflow/contrib/tensorrt/convert/convert_nodes.h)
-  return candidate_ops.count(node_def.op());
+  return candidate_ops.count(node->type_string());
 }
 
 void GetSubGraphIncomingEdges(const tensorflow::Graph& graph,
@@ -67,8 +85,10 @@ void GetSubGraphIncomingEdges(const tensorflow::Graph& graph,
     const tensorflow::Node* node = graph.FindNodeId(node_id);
     for (const tensorflow::Edge* edge : node->in_edges()) {
       if (!subgraph_node_ids.count(edge->src()->id()) &&
-          !edge->src()->IsSource()) {
+          !edge->src()->IsSource() && !edge->IsControlEdge()) {
         incoming_edges->insert(edge);
+      } else {
+        VLOG(2) << node->name() << " -> " << edge->src()->name() << " N, ";
       }
     }
   }
@@ -81,8 +101,11 @@ void GetSubGraphOutgoingEdges(const tensorflow::Graph& graph,
     const tensorflow::Node* node = graph.FindNodeId(node_id);
     for (const tensorflow::Edge* edge : node->out_edges()) {
       if (!subgraph_node_ids.count(edge->dst()->id()) &&
-          !edge->dst()->IsSink()) {
+          !edge->dst()->IsSink() && !edge->IsControlEdge()) {
+        VLOG(2) << node->name() << " -> " << edge->dst()->name() << " Y, ";
         outgoing_edges->insert(edge);
+      } else {
+        VLOG(2) << node->name() << " -> " << edge->dst()->name() << " N, ";
       }
     }
   }
@@ -109,74 +132,150 @@ std::unordered_map<string, std::vector<int>> BuildTensorNameMap(
   }
   return result;
 }
-
-tensorflow::Status ConvertSubGraphToTensorRT(
-    const std::vector<string>& output_names,
-    const std::set<int>& subgraph_node_ids,
-    size_t max_batch_size,  // Max batch size that engine will be created for
-    // Max amount of memory that engine will be allowed to consume, in bytes
-    size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_properties,
-    tensorflow::Graph* graph) {
-  tensorflow::EdgeSet subgraph_incoming_edges;
-  GetSubGraphIncomingEdges(*graph, subgraph_node_ids, &subgraph_incoming_edges);
-
+// TODO(sami): convert references to pointers
+struct ConvertGraphParams {
+  ConvertGraphParams(
+      tensorflow::Graph& inp_graph,
+      const std::vector<string>& output_node_names,
+      const std::set<int>& subgraph_node_id_numbers,
+      size_t max_supported_batch_size, size_t max_consumed_workspace_size_bytes,
+      const tensorflow::grappler::GraphProperties& current_graph_properties,
+      std::unordered_map<string, std::pair<int, string>>* output_edges,
+      int engine_precision_mode)
+      : graph(inp_graph),
+        output_names(output_node_names),
+        subgraph_node_ids(subgraph_node_id_numbers),
+        max_batch_size(max_supported_batch_size),
+        max_workspace_size_bytes(max_consumed_workspace_size_bytes),
+        graph_properties(current_graph_properties),
+        output_edge_map(output_edges),
+        precision_mode(engine_precision_mode) {}
+  tensorflow::Graph& graph;
+  const std::vector<string>& output_names;
+  const std::set<int>& subgraph_node_ids;
+  size_t max_batch_size;
+  size_t max_workspace_size_bytes;
+  const tensorflow::grappler::GraphProperties& graph_properties;
+  std::unordered_map<string, std::pair<int, string>>* output_edge_map;
+  int precision_mode;
   std::vector<std::pair<int, int>> subgraph_inputs;
+  std::vector<std::pair<int, int>> subgraph_outputs;
+  tensorflow::EdgeSet subgraph_incoming_edges;
+  tensorflow::EdgeSet subgraph_outgoing_edges;
+};
 
-  // Collect inputs by looking for incoming edges
-  for (const tensorflow::Edge* edge : subgraph_incoming_edges) {
-    subgraph_inputs.push_back({edge->src()->id(), edge->src_output()});
+static tensorflow::Status FillSubGraphEdgeSets(ConvertGraphParams* p) {
+  GetSubGraphIncomingEdges(p->graph, p->subgraph_node_ids,
+                           &p->subgraph_incoming_edges);
+  for (const tensorflow::Edge* edge : p->subgraph_incoming_edges) {
+    p->subgraph_inputs.push_back({edge->src()->id(), edge->src_output()});
   }
+  auto output_name_to_index_map = BuildTensorNameMap(p->output_names);
   std::set<std::pair<int, int>> subgraph_outputs_set;
   // Collect outputs referenced from output_names
-  auto output_name_to_index_map = BuildTensorNameMap(output_names);
-  for (int node_id : subgraph_node_ids) {
-    tensorflow::Node* node = graph->FindNodeId(node_id);
+  for (int node_id : p->subgraph_node_ids) {
+    tensorflow::Node* node = p->graph.FindNodeId(node_id);
     if (output_name_to_index_map.count(node->name())) {
       for (int index : output_name_to_index_map.at(node->name())) {
         subgraph_outputs_set.insert({node_id, index});
       }
     }
   }
-  // Collect outputs referenced from outgoing edges
-  tensorflow::EdgeSet subgraph_outgoing_edges;
-  GetSubGraphOutgoingEdges(*graph, subgraph_node_ids, &subgraph_outgoing_edges);
-  for (const tensorflow::Edge* edge : subgraph_outgoing_edges) {
+  GetSubGraphOutgoingEdges(p->graph, p->subgraph_node_ids,
+                           &p->subgraph_outgoing_edges);
+  for (const tensorflow::Edge* edge : p->subgraph_outgoing_edges) {
     subgraph_outputs_set.insert({edge->src()->id(), edge->src_output()});
   }
-  // Impose an ordering on the outputs
-  std::vector<std::pair<int, int>> subgraph_outputs(
-      subgraph_outputs_set.begin(), subgraph_outputs_set.end());
-  // Build TensorRT node and add it to the graph
+  p->subgraph_outputs.reserve(subgraph_outputs_set.size());
+  p->subgraph_outputs.insert(p->subgraph_outputs.begin(),
+                             subgraph_outputs_set.begin(),
+                             subgraph_outputs_set.end());
+  return tensorflow::Status::OK();
+};
+
+tensorflow::Status GetCalibNode(ConvertGraphParams* params) {
+  TF_RETURN_IF_ERROR(FillSubGraphEdgeSets(params));
   tensorflow::NodeDef trt_node_def;
-  TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRTNodeDef(
-      *graph, subgraph_node_ids, subgraph_inputs, subgraph_outputs,
-      max_batch_size, max_workspace_size_bytes, graph_properties,
-      &trt_node_def));
+  SubGraphParams s(params->graph, params->subgraph_node_ids,
+                   params->subgraph_inputs, params->subgraph_outputs,
+                   params->max_batch_size, params->max_workspace_size_bytes,
+                   params->graph_properties, params->output_edge_map,
+                   &trt_node_def, params->precision_mode);
+  TF_RETURN_IF_ERROR(InjectCalibrationNode(s));
   tensorflow::Status status;
-  tensorflow::Node* trt_node = graph->AddNode(trt_node_def, &status);
+  tensorflow::Node* trt_node = params->graph.AddNode(trt_node_def, &status);
+
+  TF_RETURN_IF_ERROR(status);
+
+  for (auto in_edge :
+       params->subgraph_incoming_edges) {  // loop over incoming edges and
+                                           // attach them to calib node
+    // tensorflow::Node* src_node = in_edge->src();
+    auto src_output = in_edge->src_output();
+    auto dst_node = in_edge->dst();
+    auto dst_input = in_edge->dst_input();
+    VLOG(1) << " update edge " << trt_node->name() << ":" << src_output
+            << " -> " << dst_node->name() << ":" << dst_input;
+    TF_RETURN_IF_ERROR(
+        params->graph.UpdateEdge(trt_node, src_output, dst_node, dst_input));
+  }
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertSubGraphToTensorRT(ConvertGraphParams* params) {
+  TF_RETURN_IF_ERROR(FillSubGraphEdgeSets(params));
+  tensorflow::NodeDef trt_node_def;
+
+  SubGraphParams s(params->graph, params->subgraph_node_ids,
+                   params->subgraph_inputs, params->subgraph_outputs,
+                   params->max_batch_size, params->max_workspace_size_bytes,
+                   params->graph_properties, params->output_edge_map,
+                   &trt_node_def, params->precision_mode);
+  TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRTNodeDef(s));
+  tensorflow::Status status;
+  tensorflow::Node* trt_node = params->graph.AddNode(trt_node_def, &status);
+
+  // AddNode does not wire edges.
+  // Re-map incoming edges to use the new TRT node instead of the orig subgraph
+  std::map<std::pair<int, int>, int> subgraph_edge_to_input_map;
+  for (size_t i = 0; i < params->subgraph_inputs.size(); ++i) {
+    subgraph_edge_to_input_map.insert({params->subgraph_inputs.at(i), i});
+  }
+  for (const tensorflow::Edge* edge : params->subgraph_incoming_edges) {
+    std::pair<int, int> old_src = {edge->src()->id(), edge->src_output()};
+    int new_src_output = subgraph_edge_to_input_map.at(old_src);
+    params->graph.AddEdge(edge->src(), edge->src_output(), trt_node,
+                          new_src_output);
+    params->graph.RemoveEdge(edge);
+  }
+
+  VLOG(2) << "new wiring edges: " << trt_node->in_edges().size();
+  for (const tensorflow::Edge* edge : trt_node->in_edges()) {
+    VLOG(2) << edge->src()->name() << " port: " << edge->src_output();
+  }
+
   TF_RETURN_IF_ERROR(status);
 
   // Re-map outgoing edges to use the new TRT node instead of the orig subgraph
   std::map<std::pair<int, int>, int> subgraph_edge_to_output_map;
-  for (size_t i = 0; i < subgraph_outputs.size(); ++i) {
-    subgraph_edge_to_output_map.insert({subgraph_outputs.at(i), i});
+  for (size_t i = 0; i < params->subgraph_outputs.size(); ++i) {
+    subgraph_edge_to_output_map.insert({params->subgraph_outputs.at(i), i});
   }
   TF_RETURN_IF_ERROR(status);
-  for (const tensorflow::Edge* edge : subgraph_outgoing_edges) {
+  for (const tensorflow::Edge* edge : params->subgraph_outgoing_edges) {
     std::pair<int, int> old_src = {edge->src()->id(), edge->src_output()};
     int new_src_output = subgraph_edge_to_output_map.at(old_src);
-    TF_RETURN_IF_ERROR(graph->UpdateEdge(trt_node, new_src_output, edge->dst(),
-                                         edge->dst_input()));
+    TF_RETURN_IF_ERROR(params->graph.UpdateEdge(
+        trt_node, new_src_output, edge->dst(), edge->dst_input()));
   }
   // Remove the original subgraph
-  for (int node_id : subgraph_node_ids) {
-    tensorflow::Node* node = graph->FindNodeId(node_id);
+  for (int node_id : params->subgraph_node_ids) {
+    tensorflow::Node* node = params->graph.FindNodeId(node_id);
     // Don't remove the input placeholders
     if (node->type_string() == "Placeholder") {
       continue;
     }
-    graph->RemoveNode(node);
+    params->graph.RemoveNode(node);
   }
   return tensorflow::Status::OK();
 }
@@ -194,12 +293,39 @@ tensorflow::Status BuildNodeMap(
 }
 
 }  // namespace
+tensorflow::Status ConvertCalibGraphToInferGraph(
+    const tensorflow::GraphDef& graph_def, tensorflow::GraphDef* infer_graph) {
+  VLOG(0) << "Starting Calib Conversion";
+  tensorflow::Graph graph(tensorflow::OpRegistry::Global());
+  TF_RETURN_IF_ERROR(tensorflow::ConvertGraphDefToGraph(
+      tensorflow::GraphConstructorOptions(), graph_def, &graph));
+  //  get calib nodes
+  std::vector<tensorflow::Node*> calib_nodes;
+  for (auto node : graph.op_nodes()) {
+    if (node->type_string() == "TRTCalibOp") {
+      VLOG(1) << "Found Calib Node";
+      calib_nodes.push_back(node);
+    }
+  }
+  VLOG(0) << "Num Calib nodes in graph= " << calib_nodes.size();
+  if (calib_nodes.size() == 0)
+    return tensorflow::errors::FailedPrecondition(
+        "Graph doesn't contain any calibration nodes!."
+        " Please generate calibration graph and run calibration first");
+  for (auto n : calib_nodes) {
+    TF_RETURN_IF_ERROR(
+        tensorrt::convert::ConvertCalibrationNodeToEngineNode(graph, n));
+  }
+  graph.ToGraphDef(infer_graph);
+  return tensorflow::Status::OK();
+}
 
 tensorflow::Status ConvertGraphDefToTensorRT(
     const tensorflow::GraphDef& graph_def,
     const std::vector<string>& output_names, size_t max_batch_size,
-    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def) {
-  // Optimization pass
+    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def,
+    int precision_mode = FP32MODE, int minimum_segment_size = 3) {
+  // optimization pass
   tensorflow::grappler::GrapplerItem item;
   item.fetch = output_names;
   tensorflow::GraphDef gdef;
@@ -209,16 +335,23 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   tensorflow::grappler::LayoutOptimizer optimizer;
   tensorflow::grappler::Cluster* cluster;
 
-  // Virtual cluster
+  // virtual cluster
   tensorflow::DeviceProperties device_properties;
+
   device_properties.set_type("GPU");
   device_properties.mutable_environment()->insert({"architecture", "6"});
   cluster =
       new tensorflow::grappler::VirtualCluster({{"/GPU:0", device_properties}});
 
+  // single machine
+  int num_cpu_cores = tensorflow::grappler::GetNumAvailableLogicalCPUCores();
+  int num_gpus = tensorflow::grappler::GetNumAvailableGPUs();
+  VLOG(2) << "cpu_cores: " << num_cpu_cores;
+  VLOG(2) << "gpus: " << num_gpus;
+
   TF_RETURN_IF_ERROR(optimizer.Optimize(cluster, item, &gdef));
 
-  // Constant folding
+  // constant folding
   item.graph = gdef;
   tensorflow::grappler::ConstantFolding fold(nullptr);
   TF_RETURN_IF_ERROR(fold.Optimize(nullptr, item, &gdef));
@@ -226,7 +359,6 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   // AJ refactoring shape inference through grappler/GraphProperties.
   tensorflow::grappler::GraphProperties static_graph_properties(item);
   TF_RETURN_IF_ERROR(static_graph_properties.InferStatically(false));
-
   // Build full graph
   tensorflow::FunctionLibraryDefinition flib(tensorflow::OpRegistry::Global(),
                                              gdef.library());
@@ -243,7 +375,7 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   }
 
   // TODO(sami): this should be passed as a knob!!!!
-  segment_options.minimum_segment_size = 2;
+  segment_options.minimum_segment_size = minimum_segment_size;
   tensorflow::tensorrt::segment::SegmentNodesVector segments;
   TF_RETURN_IF_ERROR(tensorrt::segment::SegmentGraph(
       gdef, IsTensorRTCandidate, segment_options, &segments));
@@ -252,14 +384,38 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   }
   std::unordered_map<string, tensorflow::Node*> node_map;
   TF_RETURN_IF_ERROR(BuildNodeMap(graph, &node_map));
+  std::unordered_map<string, std::pair<int, string>> output_edge_map;
+  int count = 0;
+  float total_num_nodes_in_segments = 0.;
+  for (auto s : segments) {
+    total_num_nodes_in_segments += s.size();
+  }
   for (const std::set<string>& subgraph_node_names : segments) {
     std::set<int> subgraph_node_ids;
+    size_t max_mem_per_engine =
+        max_workspace_size_bytes *
+        ((float)subgraph_node_names.size() / total_num_nodes_in_segments);
+    std::stringstream oss;
     for (const string& node_name : subgraph_node_names) {
+      oss << " " << node_name;
       subgraph_node_ids.insert(node_map.at(node_name)->id());
     }
-    TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRT(
-        output_names, subgraph_node_ids, max_batch_size,
-        max_workspace_size_bytes, static_graph_properties, &graph));
+    VLOG(2) << "Subgraph nodes" << oss.str();
+    ConvertGraphParams p(graph, output_names, subgraph_node_ids, max_batch_size,
+                         max_mem_per_engine, static_graph_properties,
+                         &output_edge_map, precision_mode);
+    if (precision_mode == INT8MODE) {
+      TF_RETURN_IF_ERROR(GetCalibNode(&p));
+    } else {
+      tensorflow::Status status = ConvertSubGraphToTensorRT(&p);
+      if (status != tensorflow::Status::OK()) {
+        LOG(WARNING) << "subgraph conversion error for subgraph_index:" << count
+                     << " due to: \"" << status.ToString()
+                     << "\" SKIPPING......( " << subgraph_node_names.size()
+                     << " nodes)";
+      }
+      count++;
+    }
   }
   graph.ToGraphDef(new_graph_def);
   return tensorflow::Status::OK();
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.h b/tensorflow/contrib/tensorrt/convert/convert_graph.h
index 154ad3f2e8fb0ae702448097fbdece510df30223..e01e4a5328061ad527b2dac6e2e4ef1559bd914d 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.h
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.h
@@ -28,14 +28,20 @@ namespace tensorflow {
 namespace tensorrt {
 namespace convert {
 
+// This method converts an already generated calibration graph which was used in
+// calibration runs to an inference graph
+tensorflow::Status ConvertCalibGraphToInferGraph(
+    const tensorflow::GraphDef& graph_def, tensorflow::GraphDef* new_graph_def);
+
 // max_batch_size: maximum batch size which can be used for inference for
 //                 optimization targets inference run with max batch size.
-// max_workspace_size_bytes: The upper bound of memory allowence for
+// max_workspace_size_bytes: The upper bound of memory allowance for
 //                 engine building.
 tensorflow::Status ConvertGraphDefToTensorRT(
     const tensorflow::GraphDef& graph_def,
     const std::vector<string>& output_names, size_t max_batch_size,
-    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def);
+    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def,
+    int precision_mode, int minimum_segment_size);
 
 }  // namespace convert
 }  // namespace tensorrt
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
index 9ee717dd7fb1eff4a11fb104cf5806ec8ab853d2..e920a797fe428620ef62a2b67c07f35d85ef5211 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
@@ -24,6 +24,10 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
+#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
+#include "tensorflow/core/framework/node_def.pb.h"  // NOLINT
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"  // NOLINT
 #include "tensorflow/core/framework/types.h"
@@ -32,6 +36,7 @@ limitations under the License.
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/tensor_coding.h"
@@ -39,7 +44,6 @@ limitations under the License.
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
-#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
 #include "tensorrt/include/NvInfer.h"
 
 //  Check if the types are equal. Cast to int first so that failure log message
@@ -49,7 +53,8 @@ limitations under the License.
 namespace tensorflow {
 namespace tensorrt {
 namespace convert {
-
+using ::tensorflow::strings::StrAppend;
+using ::tensorflow::strings::StrCat;
 namespace {
 
 inline tensorflow::Status ConvertDType(tensorflow::DataType tf_dtype,
@@ -65,7 +70,8 @@ inline tensorflow::Status ConvertDType(tensorflow::DataType tf_dtype,
       *trt_dtype = nvinfer1::DataType::kHALF;
       break;
     default:
-      return tensorflow::errors::InvalidArgument("Unsupported data type");
+      return tensorflow::errors::InvalidArgument(
+          "Unsupported data type " + tensorflow::DataTypeString(tf_dtype));
   }
   return tensorflow::Status::OK();
 }
@@ -112,6 +118,18 @@ static std::vector<std::pair<int, int>> CreateSamePadding(
   return padding;
 }
 
+string GetCommonNameScope(const string& op_name_a, const string& op_name_b) {
+  size_t last_scope_separator = 0;
+  for (size_t i = 0; i < std::min(op_name_a.size(), op_name_b.size()); ++i) {
+    if (op_name_a[i] != op_name_b[i]) {
+      break;
+    } else if (op_name_a[i] == '/') {
+      last_scope_separator = i + 1;
+    }
+  }
+  return op_name_a.substr(0, last_scope_separator);
+}
+
 class TRT_ShapedWeights {
  public:
   TRT_ShapedWeights(tensorflow::DataType type, const void* values,
@@ -244,6 +262,11 @@ std::vector<int> TFAttrs::get<std::vector<int>>(string key) const {
   return std::vector<int>(attr.begin(), attr.end());
 }
 
+template <>
+std::vector<string> TFAttrs::get<std::vector<string>>(string key) const {
+  auto attr = this->at(key)->list().s();
+  return std::vector<string>(attr.begin(), attr.end());
+}
 template <>
 nvinfer1::Dims TFAttrs::get<nvinfer1::Dims>(string key) const {
   auto values = this->get<std::vector<int>>(key);
@@ -266,6 +289,17 @@ tensorflow::DataType TFAttrs::get<tensorflow::DataType>(string key) const {
   return this->at(key)->type();
 }
 
+template <>
+float TFAttrs::get<float>(string key) const {
+  return this->at(key)->f();
+}
+
+template <>
+bool TFAttrs::get<bool>(string key) const {
+  return this->at(key)->b();
+}
+
+// TODO(jie): reorder4 & reorder2 should be merged?
 template <typename T>
 void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
               nvinfer1::DimsNCHW istrides, T* odata,
@@ -283,29 +317,86 @@ void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
   }
 }
 
+template <typename T>
+void Reorder2(nvinfer1::DimsHW shape, const T* idata, nvinfer1::DimsHW istrides,
+              T* odata, nvinfer1::DimsHW ostrides) {
+  for (int h = 0; h < shape.h(); ++h) {
+    for (int w = 0; w < shape.w(); ++w) {
+      odata[h * ostrides.h() + w * ostrides.w()] =
+          idata[h * ostrides.h() + w * ostrides.w()];
+    }
+  }
+}
+
+// TODO(jie): fallback to tensorflow!!
+void ReorderCKtoKC(const TRT_ShapedWeights& iweights,
+                   TRT_ShapedWeights* oweights) {
+  int c = iweights.shape_.d[0];
+  int k = iweights.shape_.d[1];
+  oweights->shape_.d[0] = k;
+  oweights->shape_.d[1] = c;
+  nvinfer1::DimsHW istrides = {1, k};
+  nvinfer1::DimsHW ostrides = {c, 1};
+  switch (iweights.type_) {
+    case tensorflow::DataType::DT_FLOAT: {
+      Reorder2({k, c}, static_cast<float const*>(iweights.GetValues()),
+               istrides,
+               static_cast<float*>(const_cast<void*>(oweights->GetValues())),
+               ostrides);
+      break;
+    }
+    case tensorflow::DataType::DT_HALF: {
+      Reorder2({k, c}, static_cast<Eigen::half const*>(iweights.GetValues()),
+               istrides, static_cast<Eigen::half*>(
+                             const_cast<void*>(oweights->GetValues())),
+               ostrides);
+      break;
+    }
+    default:
+      LOG(FATAL) << "Unsupported type in reorder expected fp32 or fp16 but got "
+                 << DataTypeString(iweights.type_);
+  }
+}
+
 void ReorderRSCKToKCRS(const TRT_ShapedWeights& iweights,
-                       TRT_ShapedWeights* oweights) {
+                       TRT_ShapedWeights* oweights, int num_groups) {
   CHECK_EQ(iweights.type_, oweights->type_);
   CHECK_EQ(iweights.size_bytes(), oweights->size_bytes());
   int r = iweights.shape_.d[0];
   int s = iweights.shape_.d[1];
-  int c = iweights.shape_.d[2];
-  int k = iweights.shape_.d[3];
-  oweights->shape_.d[0] = k;
-  oweights->shape_.d[1] = c;
+  // TRT requires GKcRS, while TF depthwise has RSCK
+  //   where c=1, C=G
+  VLOG(2) << "num_groups: " << num_groups;
+  int c = iweights.shape_.d[2] / num_groups;
+  VLOG(2) << "c" << iweights.shape_.d[2] << " then " << c;
+  int k = iweights.shape_.d[3] * num_groups;
+  VLOG(2) << "k" << iweights.shape_.d[3] << " then " << k;
+  oweights->shape_.d[0] = k / num_groups;
+  oweights->shape_.d[1] = c * num_groups;
   oweights->shape_.d[2] = r;
   oweights->shape_.d[3] = s;
   nvinfer1::DimsNCHW istrides = {1, k, s * k * c, c * k};
   nvinfer1::DimsNCHW ostrides = {c * r * s, r * s, s, 1};
   switch (iweights.type_) {
-    case tensorflow::DataType::DT_FLOAT:
+    case tensorflow::DataType::DT_FLOAT: {
       Reorder4({k, c, r, s}, static_cast<float const*>(iweights.GetValues()),
                istrides,
                static_cast<float*>(const_cast<void*>(oweights->GetValues())),
                ostrides);
       break;
+    }
+    case tensorflow::DataType::DT_HALF: {
+      Reorder4(
+          {k, c, r, s}, static_cast<Eigen::half const*>(iweights.GetValues()),
+          istrides,
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues())),
+          ostrides);
+      break;
+    }
+
     default:
-      LOG(FATAL) << "!!!!!!!!!!!!!!!!!!!!!!!!broke!!!!!!!!!!!!";
+      LOG(FATAL) << "Unsupported type, expected fp32 or fp16 but got "
+                 << DataTypeString(iweights.type_);
   }
 }
 
@@ -323,12 +414,11 @@ inline std::shared_ptr<T> infer_object(T* obj) {
   return std::shared_ptr<T>(obj, InferDeleter());
 }
 
-// Logger for GIE info/warning/errors
 class Converter;
 
 using OpConverter =
     std::function<tensorflow::Status(Converter&, const tensorflow::NodeDef&,
-                                     std::vector<TRT_TensorOrWeights> const&,
+                                     const std::vector<TRT_TensorOrWeights>&,
                                      std::vector<TRT_TensorOrWeights>*)>;
 
 class Converter {
@@ -336,40 +426,67 @@ class Converter {
   std::unordered_map<string, OpConverter> op_registry_;
   nvinfer1::INetworkDefinition* trt_network_;
   std::list<std::vector<uint8_t>> temp_bufs_;
-
+  tensorflow::tensorrt::TRTWeightStore* weight_store_;
+  bool fp16_;
   void register_op_converters();
-
-  std::vector<TRT_TensorOrWeights> get_inputs(
-      const tensorflow::NodeDef& node_def) {
-    std::vector<TRT_TensorOrWeights> inputs;
-    for (const auto& input_name : node_def.input()) {
-      VLOG(2) << "Retrieve input: " << input_name;
-      inputs.push_back(trt_tensors_.at(input_name));
+  tensorflow::Status get_inputs(const tensorflow::NodeDef& node_def,
+                                std::vector<TRT_TensorOrWeights>* inputs) {
+    for (auto const& input_name : node_def.input()) {
+      /*************************************************************************
+       * TODO(jie) handle case 1) here
+       * Normalizes the inputs and extracts associated metadata:
+       * 1) Inputs can contain a colon followed by a suffix of characters.
+       *    That suffix may be a single number (e.g. inputName:1) or several
+       *    word characters separated from a number by a colon
+       *    (e.g. inputName:foo:1). The
+       *    latter case is used to denote inputs and outputs of functions.
+       * 2) Control dependency inputs contain caret at the beginning and we
+       *    remove this and annotate the edge as a control dependency.
+       ************************************************************************/
+      string name = input_name[0] == '^' ? input_name.substr(1) : input_name;
+      auto first = name.find_first_of(':');
+      if (first != string::npos && first + 2 == name.size() &&
+          name[first + 1] == '0')
+        name.erase(first);
+
+      VLOG(2) << "retrieve input: " << name;
+      if (trt_tensors_.count(name)) {
+        inputs->push_back(trt_tensors_.at(name));
+      } else {
+        string str("Node ");
+        StrAppend(&str, node_def.name(), " should have an input named '", name,
+                  "' but it is not available");
+        LOG(WARNING) << "input: " << name << " not available for node at "
+                     << node_def.name();
+        return tensorflow::errors::InvalidArgument(str);
+      }
     }
-    return inputs;
+    return tensorflow::Status::OK();
   }
 
  public:
-  explicit Converter(nvinfer1::INetworkDefinition* trt_network)
-      : trt_network_(trt_network) {
+  explicit Converter(nvinfer1::INetworkDefinition* trt_network,
+                     tensorflow::tensorrt::TRTWeightStore* ws, bool fp16)
+      : trt_network_(trt_network), weight_store_(ws), fp16_(fp16) {
     this->register_op_converters();
   }
-
+  tensorflow::tensorrt::TRTWeightStore* weight_store() { return weight_store_; }
   TRT_ShapedWeights get_temp_weights(tensorflow::DataType type,
                                      nvinfer1::Dims shape) {
     TRT_ShapedWeights weights(type, nullptr, shape);
     // TODO(jie): check weights size_bytes. 0 means type error
-    temp_bufs_.push_back(std::vector<uint8_t>(weights.size_bytes()));
-    weights.SetValues(temp_bufs_.back().data());
+    weight_store_->store_.push_back(std::vector<uint8_t>(weights.size_bytes()));
+    weights.SetValues(weight_store_->store_.back().data());
     return weights;
   }
-
+  bool isFP16() { return fp16_; };
   TRT_ShapedWeights get_temp_weights_like(const TRT_ShapedWeights& weights) {
     return this->get_temp_weights(weights.type_, weights.shape_);
   }
 
   tensorflow::Status convert_node(const tensorflow::NodeDef& node_def) {
-    std::vector<TRT_TensorOrWeights> inputs = this->get_inputs(node_def);
+    std::vector<TRT_TensorOrWeights> inputs;
+    TF_RETURN_IF_ERROR(this->get_inputs(node_def, &inputs));
     string op = node_def.op();
     if (!op_registry_.count(op)) {
       return tensorflow::errors::Unimplemented(
@@ -382,7 +499,7 @@ class Converter {
       TRT_TensorOrWeights output = outputs.at(i);
       // TODO(jie): tf protobuf seems to be omitting the :0 suffix
       string output_name = node_def.name();
-      if (i != 0) output_name = output_name + ":" + std::to_string(i);
+      if (i != 0) output_name = StrCat(output_name, ":", i);
       if (output.is_tensor()) {
         output.tensor()->setName(output_name.c_str());
       }
@@ -434,6 +551,19 @@ class Converter {
   }
 };
 
+TRT_ShapedWeights ConvertFP32ToFP16(Converter& ctx,
+                                    const TRT_ShapedWeights& weights_src) {
+  auto dtype_new = tensorflow::DataType::DT_HALF;
+  TRT_ShapedWeights weights =
+      ctx.get_temp_weights(dtype_new, weights_src.shape_);
+  const float* src = static_cast<const float*>(weights_src.GetValues());
+  Eigen::half* dst = const_cast<Eigen::half*>(
+      static_cast<Eigen::half const*>(weights.GetValues()));
+  for (int64_t i = 0; i < weights_src.count(); i++) {
+    dst[i] = Eigen::half_impl::float_to_half_rtne(src[i]);
+  }
+  return weights;
+}
 // ****************************************************************************
 // Constant folding functions
 // TODO(jie): once optimizer kicks in, we should have done constant folding
@@ -448,7 +578,7 @@ struct LambdaFactory {
     switch (op) {
       case OP_CATEGORY::RSQRT: {
         VLOG(2) << "RSQRT GETS DONE";
-        return [](T t) -> T { return 1.0 / std::sqrt(t); };
+        return [](T t) -> T { return 1.0 / sqrt(t); };
       }
       case OP_CATEGORY::NEG:
         return [](T t) -> T { return -t; };
@@ -534,6 +664,22 @@ struct LambdaFactory {
   }
 };
 
+template <>
+std::function<Eigen::half(Eigen::half)> LambdaFactory::unary<Eigen::half>() {
+  switch (op) {
+    case OP_CATEGORY::RSQRT: {
+      VLOG(2) << "RSQRT GETS DONE";
+      return [](Eigen::half t) -> Eigen::half {
+        return Eigen::half(1.0 / sqrt(float(t)));
+      };
+    }
+    case OP_CATEGORY::NEG:
+      return [](Eigen::half t) -> Eigen::half { return -t; };
+    default:
+      VLOG(2) << "Not supported op for unary: " << static_cast<int>(op);
+      return nullptr;
+  }
+}
 tensorflow::Status UnaryCompute(const TRT_ShapedWeights& iweights,
                                 TRT_ShapedWeights* oweights,
                                 LambdaFactory unary_op) {
@@ -545,6 +691,14 @@ tensorflow::Status UnaryCompute(const TRT_ShapedWeights& iweights,
       std::transform(inp, inp + iweights.count(), oup, unary_op.unary<float>());
       break;
     }
+    case tensorflow::DataType::DT_HALF: {
+      auto inp = static_cast<Eigen::half const*>(iweights.GetValues());
+      auto oup =
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues()));
+      std::transform(inp, inp + iweights.count(), oup,
+                     unary_op.unary<Eigen::half>());
+      break;
+    }
     default:
       return tensorflow::errors::Unimplemented(
           "Data type not supported: " +
@@ -588,6 +742,32 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
       }
       break;
     }
+    case tensorflow::DataType::DT_HALF: {
+      auto inp_l = static_cast<const Eigen::half*>(iweights_l.GetValues());
+      auto inp_r = static_cast<const Eigen::half*>(iweights_r.GetValues());
+      auto oup =
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues()));
+
+      if (iweights_l.count() != iweights_r.count()) {
+        // We only supports broadcast of RankZero
+        if (iweights_l.count() == 1) {
+          VLOG(2) << "I bet it is not working!" << (*inp_l);
+          std::transform(inp_r, inp_r + iweights_r.count(), oup,
+                         binary_op.broadcast_l<Eigen::half>(*inp_l));
+        } else if (iweights_r.count() == 1) {
+          VLOG(2) << "I bet it is not working!" << (*inp_r);
+          std::transform(inp_l, inp_l + iweights_l.count(), oup,
+                         binary_op.broadcast_r<Eigen::half>(*inp_r));
+        } else {
+          return tensorflow::errors::Unimplemented(
+              "Binary op with non-rankZero broadcast not supported");
+        }
+      } else {
+        std::transform(inp_l, inp_l + iweights_l.count(), inp_r, oup,
+                       binary_op.binary<Eigen::half>());
+      }
+      break;
+    }
     default:
       return tensorflow::errors::Unimplemented(
           "Data type not supported: " +
@@ -599,7 +779,7 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
 
 tensorflow::Status ConstantFoldUnary(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   TRT_ShapedWeights weights_input = inputs.at(0).weights();
 
@@ -613,13 +793,12 @@ tensorflow::Status ConstantFoldUnary(
   CHECK_EQ(weights_input.type_,
            TFAttrs(node_def).get<tensorflow::DataType>("T"));
 
-  // Maybe I should do a switch
   LambdaFactory unary_op;
   if (node_def.op() == "Rsqrt") {
     // Compute rsqrt
     unary_op.op = LambdaFactory::OP_CATEGORY::RSQRT;
     auto ret = UnaryCompute(weights_input, &weights_output, unary_op);
-    // PAss the output
+    // Pass the output
     if (ret == tensorflow::Status::OK()) {
       outputs->push_back(TRT_TensorOrWeights(weights_output));
     }
@@ -631,11 +810,11 @@ tensorflow::Status ConstantFoldUnary(
 }
 
 // TODO(jie,ben) broadcast is needed yet not implemented
-// Let's get the simple stuff working first. Maybe we should fall bakc to TF
+// Let's get the simple stuff working first. Maybe we should fall back to TF
 //   approach for constant folding
 tensorflow::Status ConstantFoldBinary(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   TRT_ShapedWeights weights_input_l = inputs.at(0).weights();
   TRT_ShapedWeights weights_input_r = inputs.at(1).weights();
@@ -648,12 +827,12 @@ tensorflow::Status ConstantFoldBinary(
         "Binary op implicit broadcast not supported: " + node_def.op());
 
   // TODO(jie): constant fold should really fall back to TF.
-  int nb_dims = weights_input_l.shape_.nbDims;
+  int num_dims = weights_input_l.shape_.nbDims;
   nvinfer1::Dims output_shape;
-  output_shape.nbDims = nb_dims;
-  VLOG(2) << "nb_dims: " << nb_dims
+  output_shape.nbDims = num_dims;
+  VLOG(2) << "nb_dims: " << num_dims
           << ", the other: " << weights_input_r.shape_.nbDims;
-  for (int i = 0; i < nb_dims; i++) {
+  for (int i = 0; i < num_dims; i++) {
     if (weights_input_l.shape_.d[i] == weights_input_r.shape_.d[i]) {
       output_shape.d[i] = weights_input_l.shape_.d[i];
     } else if (weights_input_l.shape_.d[i] == 1 ||
@@ -678,7 +857,6 @@ tensorflow::Status ConstantFoldBinary(
   // Allocate output weights
   TRT_ShapedWeights weights_output = ctx.get_temp_weights(dtype, output_shape);
 
-  // Maybe I should do a switch
   LambdaFactory binary_op;
   if (node_def.op() == "Sub") {
     binary_op.op = LambdaFactory::OP_CATEGORY::SUB;
@@ -712,48 +890,94 @@ tensorflow::Status BinaryTensorOpWeight(
   // Maybe this part has to be moved into the block of rsqrt later
 
   // Check type consistency
-  auto dtype = TFAttrs(node_def).get<nvinfer1::DataType>("T");
-  CHECK_EQ_TYPE(tensor->getType(), dtype);  // Cast to int for error messages
   nvinfer1::DataType ttype;
-  TF_CHECK_OK(ConvertDType(weights.type_, &ttype));
-  CHECK_EQ_TYPE(ttype, dtype);  // Cast to int for error message
+  TF_RETURN_IF_ERROR(ConvertDType(weights.type_, &ttype));
 
   // Check scale mode
   auto dims_w = weights.shape_;
   auto dims_t = tensor->getDimensions();
 
-  // Default to channel-wise
+  // default to element-wise
   auto scale_mode = nvinfer1::ScaleMode::kELEMENTWISE;
 
+  // TODO(jie): maybe use a permutation instead to support more cases;
+  bool permutation_flag = false;
+
   if (weights.count() == 1) {
     VLOG(2) << "UNIFORM";
     scale_mode = nvinfer1::ScaleMode::kUNIFORM;
   } else {
-    // No broadcasting on Batch dimension;
-    assert(dims_w.d[0] == 1);
-
-    // Broadcasting on Channel dimension only allowed in kUNIFORM
-    assert(dims_w.d[1] == dims_t.d[0]);
-    assert(dims_w.nbDims == dims_t.nbDims);
-
-    // Default is element;
-    for (int i = 2; i < dims_w.nbDims; i++) {
-      if (dims_w.d[i] != dims_t.d[i - 1]) {
-        scale_mode = nvinfer1::ScaleMode::kCHANNEL;
-        break;
+    // no broadcasting on Batch dimension;
+    VLOG(2) << "WEIGHTS DIM: " << dims_w.nbDims
+            << " tensor DIM: " << dims_t.nbDims;
+    if (dims_w.nbDims == dims_t.nbDims + 1) {
+      if (dims_w.d[0] == 1) {
+        for (int i = 1; i < dims_w.nbDims; i++) {
+          dims_w.d[i - 1] = dims_w.d[i];
+        }
+        dims_w.nbDims--;
+      } else {
+        return tensorflow::errors::InvalidArgument(
+            "Binary op cannot operate on batch, " + node_def.name());
       }
     }
-    if (scale_mode == nvinfer1::ScaleMode::kELEMENTWISE) {
+
+    if (dims_w.nbDims == dims_t.nbDims && dims_w.d[0] == dims_t.d[0]) {
       scale_mode = nvinfer1::ScaleMode::kELEMENTWISE;
-      for (int i = 2; i < dims_w.nbDims; i++) {
-        if (dims_w.d[i] != 1)
-          return tensorflow::errors::InvalidArgument(
-              "Weight shape not compatible at, " + node_def.name());
+      // default is element;
+      for (int i = 1; i < dims_w.nbDims; i++) {
+        if (dims_w.d[i] != dims_t.d[i]) {
+          // if dimension does not match, switch back to channel;
+          VLOG(2) << "channel";
+          scale_mode = nvinfer1::ScaleMode::kCHANNEL;
+          break;
+        }
+      }
+      // if channel as candidate, validate it
+      if (scale_mode == nvinfer1::ScaleMode::kCHANNEL) {
+        for (int i = 1; i < dims_w.nbDims; i++) {
+          if (dims_w.d[i] != 1)
+            return tensorflow::errors::InvalidArgument(
+                "Weight shape not compatible at, " + node_def.name());
+        }
+      } else {
+        VLOG(2) << "elementwise";
+      }
+    } else if (dims_w.nbDims == 1 &&
+               dims_w.d[0] == dims_t.d[dims_t.nbDims - 1]) {
+      // channel wise and broadcast required;
+      permutation_flag = true;
+      scale_mode = nvinfer1::ScaleMode::kCHANNEL;
+    } else {
+      return tensorflow::errors::InvalidArgument(
+          "Weight shape not compatible at, " + node_def.name());
+    }
+  }
+
+  // transpose last dimension
+  std::vector<int> permutation(dims_t.nbDims + 1);
+  if (permutation_flag) {
+    if (scale_mode == nvinfer1::ScaleMode::kCHANNEL && dims_t.nbDims > 1) {
+      // we swap the last dimension into channel for trt.
+      // because of tensorflow default broadcasting rules.
+      for (int i = 0; i < static_cast<int>(permutation.size()); i++) {
+        permutation[i] = i;
       }
+      permutation[1] = dims_t.nbDims;
+      permutation[dims_t.nbDims] = 1;
+      tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
+                                   permutation);
+    } else {
+      return tensorflow::errors::InvalidArgument(
+          "Transpose cannot be applied, " + node_def.name());
     }
   }
 
-  // Prepare weights
+  if (ctx.isFP16()) {
+    weights = ConvertFP32ToFP16(ctx, weights);
+  }
+
+  // prepare weights
   TRT_ShapedWeights shift_weights(weights.type_);
   TRT_ShapedWeights scale_weights(weights.type_);
   TRT_ShapedWeights power_weights(weights.type_);
@@ -779,88 +1003,24 @@ tensorflow::Status BinaryTensorOpWeight(
       scale_weights, power_weights);
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  // transpose back dimension
+  if (permutation_flag) {
+    output_tensor = ctx.TransposeTensor(output_tensor, permutation);
+  }
 
   // Pass the output
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status BinaryTensorOpTensor(
-    Converter& ctx, const tensorflow::NodeDef& node_def,
-    const nvinfer1::ITensor* tensor_l, const nvinfer1::ITensor* tensor_r,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  static const std::unordered_map<string, nvinfer1::ElementWiseOperation> ops{
-      {"Add", nvinfer1::ElementWiseOperation::kSUM},
-      {"Mul", nvinfer1::ElementWiseOperation::kPROD},
-      // {"max", nvinfer1::ElementWiseOperation::kMAX},
-      // {"min", nvinfer1::ElementWiseOperation::kMIN},
-      {"Sub", nvinfer1::ElementWiseOperation::kSUB},
-      {"Div", nvinfer1::ElementWiseOperation::kDIV},
-  };
-
-  // FIXME assume type matches input weights
-  // Get trt type & shape
-  TFAttrs attrs(node_def);
-  // Maybe this part has to be moved into the block of rsqrt later
-  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("T");
-
-  // Check type consistency
-  CHECK_EQ_TYPE(tensor_l->getType(), dtype);
-  CHECK_EQ_TYPE(tensor_r->getType(), dtype);
-  auto op_pair = ops.find(node_def.op());
-  if (op_pair == ops.end())
-    return tensorflow::errors::Unimplemented("binary op: " + node_def.op() +
-                                             " not supported at: " +
-                                             node_def.name());
-
-  nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
-      *const_cast<nvinfer1::ITensor*>(tensor_l),
-      *const_cast<nvinfer1::ITensor*>(tensor_r), op_pair->second);
-
-  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
-
-  // Pass the output
-  outputs->push_back(TRT_TensorOrWeights(output_tensor));
-  return tensorflow::Status::OK();
-}
+enum class ConvolutionType { DEFAULT, DEPTHWISE_CONV };
 
-tensorflow::Status ConvertPlaceholder(
+tensorflow::Status ConvertConv2DHelper(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  VLOG(2) << "Placeholder should have been replace already";
-  return tensorflow::errors::Unimplemented(", cannot convert Placeholder op");
-  // OK this make sense since we are supposed to replace it with input
-  TFAttrs attrs(node_def);
-  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("dtype");
-  nvinfer1::Dims dims = attrs.get<nvinfer1::Dims>("shape");
-
-  dims.nbDims--;
-  for (int i = 0; i < dims.nbDims; i++) dims.d[i] = dims.d[i + 1];
-
-  nvinfer1::ITensor* output =
-      ctx.network()->addInput(node_def.name().c_str(), dtype, dims);
-  if (!output) {
-    return tensorflow::errors::InvalidArgument("Failed to create Input layer");
-  }
-  outputs->push_back(TRT_TensorOrWeights(output));
-  return tensorflow::Status::OK();
-}
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs, int group) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
-tensorflow::Status ConvertConv2D(Converter& ctx,
-                                 const tensorflow::NodeDef& node_def,
-                                 const std::vector<TRT_TensorOrWeights>& inputs,
-                                 std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
-  // TODO(jie): handle NHWC/NCHW transpose;
-  TRT_ShapedWeights weights_rsck = inputs.at(1).weights();
-  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_rsck);
-  ReorderRSCKToKCRS(weights_rsck, &weights);
-  TRT_ShapedWeights biases(weights.type_);
-  int noutput = weights.shape_.d[0];
-  nvinfer1::DimsHW kernel_size;
-  kernel_size.h() = weights.shape_.d[2];
-  kernel_size.w() = weights.shape_.d[3];
   TFAttrs attrs(node_def);
 
   int h_index = 2;
@@ -874,11 +1034,35 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
     // TODO(jie): transpose it
   }
 
+  // tensor after transpose (NCHW)
+  auto tensor_dim = tensor->getDimensions();
+
+  int num_groups = group;
+  if (num_groups == 0)  // depthwise convolution
+    num_groups = tensor_dim.d[0];
+  VLOG(2) << "groups count: " << num_groups;
+
+  TRT_ShapedWeights weights_rsck = inputs.at(1).weights();
+  if (ctx.isFP16()) {
+    weights_rsck = ConvertFP32ToFP16(ctx, inputs.at(1).weights());
+  }
+
+  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_rsck);
+  ReorderRSCKToKCRS(weights_rsck, &weights, num_groups);
+  TRT_ShapedWeights biases(weights.type_);
+  int noutput = weights.shape_.d[0] * num_groups;
+  nvinfer1::DimsHW kernel_size;
+  kernel_size.h() = weights.shape_.d[2];
+  kernel_size.w() = weights.shape_.d[3];
+  VLOG(2) << "kernel size: " << kernel_size.h() << ", " << kernel_size.w();
+
   // TODO(jie): stride. (NHWC/NCHW)
   auto tf_stride = attrs.get<std::vector<int>>("strides");
+  VLOG(2) << "h_INDEX" << h_index << ", w_index " << w_index;
+  VLOG(2) << "stride!!!: " << tf_stride[0] << tf_stride[1] << tf_stride[2]
+          << tf_stride[3];
   nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
 
-  auto tensor_dim = tensor->getDimensions();
   std::vector<std::pair<int, int>> padding;
   // TODO(jie): padding.
   if (attrs.get<string>("padding") == "SAME") {
@@ -919,10 +1103,11 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
   layer->setStride(stride);
   layer->setPadding({padding[0].first, padding[1].first});
   layer->setName(node_def.name().c_str());
+  layer->setNbGroups(num_groups);
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
   auto dim_after = output_tensor->getDimensions();
-  VLOG(2) << "TENSOR out: " << dim_after.d[0] << ", " << dim_after.d[1]
+  VLOG(2) << "TENSOR out: " << dim_after.d[0] << ", " << dim_after.d[1] << ", "
           << dim_after.d[2] << ", " << dim_after.d[3];
 
   if (data_format == "NHWC") {
@@ -935,11 +1120,101 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertConv2DHelper(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs, ConvolutionType type) {
+  switch (type) {
+    case ConvolutionType::DEFAULT:
+      return ConvertConv2DHelper(ctx, node_def, inputs, outputs, 1);
+    case ConvolutionType::DEPTHWISE_CONV:
+      return ConvertConv2DHelper(ctx, node_def, inputs, outputs, 0);
+  }
+  return tensorflow::errors::Unimplemented("unsupported convolution type at, " +
+                                           node_def.name());
+}
+
+tensorflow::Status BinaryTensorOpTensor(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const nvinfer1::ITensor* tensor_l, const nvinfer1::ITensor* tensor_r,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  static const std::unordered_map<string, nvinfer1::ElementWiseOperation> ops{
+      {"Add", nvinfer1::ElementWiseOperation::kSUM},
+      {"Mul", nvinfer1::ElementWiseOperation::kPROD},
+      {"Sub", nvinfer1::ElementWiseOperation::kSUB},
+      {"Div", nvinfer1::ElementWiseOperation::kDIV},
+  };
+
+  // FIXME assume type matches input weights
+  // get trt type & shape
+  TFAttrs attrs(node_def);
+  // maybe this part has to be moved into the block of rsqrt later
+  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("T");
+
+  // check type consistency
+  CHECK_EQ_TYPE(tensor_l->getType(), dtype);
+  CHECK_EQ_TYPE(tensor_r->getType(), dtype);
+  auto op_pair = ops.find(node_def.op());
+  if (op_pair == ops.end())
+    return tensorflow::errors::Unimplemented(
+        "binary op: " + node_def.op() +
+        " not supported at: " + node_def.name());
+
+  nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
+      *const_cast<nvinfer1::ITensor*>(tensor_l),
+      *const_cast<nvinfer1::ITensor*>(tensor_r), op_pair->second);
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+
+  // pass the output
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertPlaceholder(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  VLOG(2) << "Placeholder should have been replace already";
+  return tensorflow::errors::Unimplemented("cannot convert Placeholder op");
+  // OK this make sense since we are supposed to replace it with input
+  TFAttrs attrs(node_def);
+  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("dtype");
+  nvinfer1::Dims dims = attrs.get<nvinfer1::Dims>("shape");
+
+  dims.nbDims--;
+  for (int i = 0; i < dims.nbDims; i++) dims.d[i] = dims.d[i + 1];
+
+  nvinfer1::ITensor* output =
+      ctx.network()->addInput(node_def.name().c_str(), dtype, dims);
+  if (!output) {
+    return tensorflow::errors::InvalidArgument("Failed to create Input layer");
+  }
+  outputs->push_back(TRT_TensorOrWeights(output));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertConv2D(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  return ConvertConv2DHelper(ctx, node_def, inputs, outputs,
+                             ConvolutionType::DEFAULT);
+}
+
+tensorflow::Status ConvertConv2DDepthwise(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  return ConvertConv2DHelper(ctx, node_def, inputs, outputs,
+                             ConvolutionType::DEPTHWISE_CONV);
+}
+
 tensorflow::Status ConvertPool(Converter& ctx,
                                const tensorflow::NodeDef& node_def,
-                               std::vector<TRT_TensorOrWeights> const& inputs,
+                               const std::vector<TRT_TensorOrWeights>& inputs,
                                std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   TFAttrs attrs(node_def);
 
   int h_index = 2;
@@ -957,6 +1232,8 @@ tensorflow::Status ConvertPool(Converter& ctx,
   // TODO(jie): support other pooling type
   if (node_def.op() == "MaxPool")
     type = nvinfer1::PoolingType::kMAX;
+  else if (node_def.op() == "AvgPool")
+    type = nvinfer1::PoolingType::kAVERAGE;
   else
     return tensorflow::errors::Unimplemented("Only supports Max pool");
 
@@ -1019,9 +1296,9 @@ tensorflow::Status ConvertPool(Converter& ctx,
 
 tensorflow::Status ConvertActivation(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   nvinfer1::IActivationLayer* layer = ctx.network()->addActivation(
       *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ActivationType::kRELU);
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
@@ -1031,17 +1308,20 @@ tensorflow::Status ConvertActivation(
 
 tensorflow::Status ConvertScale(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
     return tensorflow::errors::Unimplemented(
         "Only supports tensor op weight for now, at " + node_def.name());
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
-  // TODO(jie): handle NHWC/NCHW transpose;
   TRT_ShapedWeights weights = inputs.at(1).weights();
+  if (ctx.isFP16()) {
+    weights = ConvertFP32ToFP16(ctx, inputs.at(1).weights());
+  }
+
   TRT_ShapedWeights empty_weights(weights.type_);
 
   TFAttrs attrs(node_def);
@@ -1055,12 +1335,29 @@ tensorflow::Status ConvertScale(Converter& ctx,
   } else {
     VLOG(2) << "NCHW !!!!";
   }
-  nvinfer1::IScaleLayer* layer = ctx.network()->addScale(
-      *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ScaleMode::kCHANNEL,
-      weights, empty_weights, empty_weights);
 
-  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
-  if (data_format == "NHWC") {
+  auto dims = tensor->getDimensions();
+  VLOG(2) << "tensor dimensions: " << dims.nbDims;
+  for (int i = 0; i < dims.nbDims; i++) {
+    VLOG(2) << "i: " << dims.d[i];
+  }
+  dims = weights.shape_;
+  VLOG(2) << "tensor dimensions: " << dims.nbDims;
+  for (int i = 0; i < dims.nbDims; i++) {
+    VLOG(2) << "i: " << dims.d[i];
+  }
+
+  nvinfer1::ScaleMode mode = nvinfer1::ScaleMode::kCHANNEL;
+  if (weights.shape_.d[0] == 1) {
+    mode = nvinfer1::ScaleMode::kUNIFORM;
+  }
+
+  nvinfer1::IScaleLayer* layer =
+      ctx.network()->addScale(*const_cast<nvinfer1::ITensor*>(tensor), mode,
+                              weights, empty_weights, empty_weights);
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  if (data_format == "NHWC") {
     // TODO(jie): transpose it back!
     output_tensor = ctx.TransposeTensor(output_tensor, {0, 2, 3, 1});
   } else {
@@ -1072,7 +1369,7 @@ tensorflow::Status ConvertScale(Converter& ctx,
 
 tensorflow::Status ConvertConst(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   const auto& weights_tensor = node_def.attr().at("value").tensor();
 
@@ -1091,22 +1388,96 @@ tensorflow::Status ConvertConst(Converter& ctx,
     VLOG(2) << "SCALAR!!!" << node_def.name();
     nvinfer1::Dims scalar_shape;
     if (tensor.dims() > 0) {
-      VLOG(2) << "Dimensions: " << tensor.dims();
-      weights = TRT_ShapedWeights(dtype, weights_tensor.float_val().data(),
-                                  GetTensorShape(tensor));
+      VLOG(2) << "dimensions: " << tensor.dims();
+      VLOG(2) << "size: " << weights_tensor.float_val_size();
+      scalar_shape = GetTensorShape(tensor);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        VLOG(2) << scalar_shape.d[i];
+      if (GetShapeSize(scalar_shape) != weights_tensor.float_val_size()) {
+        if (weights_tensor.float_val_size() == 1 ||
+            scalar_shape.d[0] == weights_tensor.float_val_size()) {
+          scalar_shape.nbDims = 1;
+          // no dimension provided. flatten it
+          scalar_shape.d[0] = weights_tensor.float_val_size();
+          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+        } else {
+          LOG(WARNING) << "Broadcast on weights only supports kCHANNEL and"
+                       << " kUNIFORM, at: " << node_def.name();
+          string err_str("Broadcast method is not supported for '");
+          StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
+          return tensorflow::errors::InvalidArgument(err_str);
+        }
+      }
     } else {
       VLOG(2) << "Dimensions: " << tensor.dims();
       scalar_shape.nbDims = 1;
-      scalar_shape.d[0] = 1;
+      // no dimension provided. flatten it
+      scalar_shape.d[0] = weights_tensor.float_val_size();
+      scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+      for (int i = 1; i < nvinfer1::Dims::MAX_DIMS; i++) {
+        scalar_shape.d[i] = 0;
+        scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
+      }
+    }
+    size_t len_data = tensorflow::DataTypeSize(dtype);
+    for (int i = 0; i < scalar_shape.nbDims; i++) len_data *= scalar_shape.d[i];
+    ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+    void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+    std::vector<float> tensor_data(
+        weights_tensor.float_val().begin(),
+        weights_tensor.float_val()
+            .end());  //  make a local copy first to flatten
+    memcpy(dst, tensor_data.data(), len_data);  // store into weight store
+    weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
+  } else if (!weights_tensor.int_val().empty()) {
+    VLOG(2) << "int!!!" << node_def.name();
+    nvinfer1::Dims scalar_shape;
+    if (tensor.dims() > 0) {
+      VLOG(2) << "dimensions: " << tensor.dims();
+      scalar_shape = GetTensorShape(tensor);
+      if (GetShapeSize(scalar_shape) != weights_tensor.int_val_size()) {
+        if (weights_tensor.int_val_size() == 1 ||
+            scalar_shape.d[0] == weights_tensor.int_val_size()) {
+          scalar_shape.nbDims = 1;
+          // no dimension provided. flatten it
+          scalar_shape.d[0] = weights_tensor.int_val_size();
+          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+        } else {
+          LOG(WARNING) << "Broadcast on weights only supports kCHANNEL and"
+                       << " kUNIFORM, at: " << node_def.name();
+          string err_str("Broadcast method is not supported for '");
+          StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
+          return tensorflow::errors::InvalidArgument(err_str);
+        }
+      }
+    } else {
+      VLOG(2) << "dimensions: " << tensor.dims();
+      scalar_shape.nbDims = 1;
+      // no dimension provided. flatten it
+      scalar_shape.d[0] = weights_tensor.int_val_size();
       scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
       for (int i = 1; i < nvinfer1::Dims::MAX_DIMS; i++) {
         scalar_shape.d[i] = 0;
         scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
       }
-      weights = TRT_ShapedWeights(dtype, weights_tensor.float_val().data(),
-                                  scalar_shape);
     }
+    //  we should not have converted //if (ctx.isFP16()) {
+    size_t len_data = tensorflow::DataTypeSize(dtype);
+    for (int i = 0; i < scalar_shape.nbDims; i++) len_data *= scalar_shape.d[i];
+    size_t len_tensor = weights_tensor.int_val_size() * sizeof(int32);
+    len_data = std::max(len_data, len_tensor);
+    ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+    void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+    std::vector<int32> tensor_data(
+        weights_tensor.int_val().begin(),
+        weights_tensor.int_val().end());  //  make a local copy first to flatten
+                                          //  doesn't have to be contigous
+    memcpy(dst, tensor_data.data(), len_tensor);  // store into weight store
+    weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
   } else if (!weights_tensor.tensor_content().empty()) {
+    //  obsolete method.
+    //  After optimization path, we do not see weights in this format.
+    //  fp16 conversion technically should be needed here.
     VLOG(2) << "TENSOR!!!" << node_def.name();
     const auto& content = weights_tensor.tensor_content();
 
@@ -1130,7 +1501,7 @@ tensorflow::Status ConvertConst(Converter& ctx,
 
 tensorflow::Status ConvertIdentity(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   outputs->push_back(inputs.at(0));
   return tensorflow::Status::OK();
@@ -1138,7 +1509,7 @@ tensorflow::Status ConvertIdentity(
 
 tensorflow::Status ConvertBinary(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
-                                 std::vector<TRT_TensorOrWeights> const& inputs,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                  std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2)
     return tensorflow::errors::FailedPrecondition(
@@ -1165,7 +1536,7 @@ tensorflow::Status ConvertBinary(Converter& ctx,
 
 tensorflow::Status ConvertUnary(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 1)
     return tensorflow::errors::FailedPrecondition(
@@ -1183,7 +1554,7 @@ tensorflow::Status ConvertUnary(Converter& ctx,
 
 tensorflow::Status ConvertReduce(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
-                                 std::vector<TRT_TensorOrWeights> const& inputs,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                  std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
@@ -1191,7 +1562,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
         "Input expects tensor and weights, at" + node_def.name());
 
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
   int nb_dims = dims.nbDims + 1;
@@ -1229,6 +1600,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
       return tensorflow::errors::InvalidArgument("TRT cannot reduce at 0, at" +
                                                  node_def.name());
     if (index_list_data[i] == 1) permuted_index = 1;
+
     idx_set.emplace(index_list_data[i]);
   }
 
@@ -1236,7 +1608,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
   nvinfer1::DimsHW pool_kernel;
   if (permuted_index == 1) {
     for (int i = 2; i < nb_dims; i++) {
-      if (idx_set.count(i)) {
+      if (idx_set.count(i) == 0) {
         permuted_index = i;
         break;
       }
@@ -1271,12 +1643,13 @@ tensorflow::Status ConvertReduce(Converter& ctx,
     output_tensor = ctx.TransposeTensor(
         const_cast<nvinfer1::ITensor*>(output_tensor), permutation_order);
   }
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
 tensorflow::Status ConvertPad(Converter& ctx,
                               const tensorflow::NodeDef& node_def,
-                              std::vector<TRT_TensorOrWeights> const& inputs,
+                              const std::vector<TRT_TensorOrWeights>& inputs,
                               std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
@@ -1284,7 +1657,7 @@ tensorflow::Status ConvertPad(Converter& ctx,
         "Input expects tensor and weights, at" + node_def.name());
 
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
   int nb_dims = dims.nbDims + 1;
@@ -1371,19 +1744,318 @@ tensorflow::Status ConvertPad(Converter& ctx,
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertConcat(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  // not including the last input (axis) here
+  int input_size = static_cast<int>(inputs.size()) - 1;
+
+  if (!inputs.at(0).is_tensor())
+    return tensorflow::errors::InvalidArgument(
+        "Concat in TRT support only Tensor input, at " + node_def.name());
+
+  // We are retrieving the axis
+  TRT_ShapedWeights axis = inputs.at(input_size).weights();
+
+  TFAttrs attrs(node_def);
+  auto index_type = attrs.get<tensorflow::DataType>("Tidx");
+
+  // TODO(jie): handle data type
+  // Only expect to handle INT32 as index attributes for now
+  if (index_type != tensorflow::DataType::DT_INT32)
+    return tensorflow::errors::Unimplemented(
+        "Tidx supports only DT_INT32, at " + node_def.name());
+
+  int index = *(static_cast<int*>(const_cast<void*>(axis.GetValues())));
+
+  // TODO(jie): early termination with no-op (attr_size==1)
+
+  auto dim = inputs.at(0).tensor()->getDimensions();
+  // dimension check
+  if (index > dim.nbDims + 1)
+    return tensorflow::errors::InvalidArgument(
+        "Concatenate on axis out of dimension range, at " + node_def.name());
+
+  if (index == 0)
+    return tensorflow::errors::InvalidArgument(
+        "Concatenate on batch dimension not supported, at " + node_def.name());
+
+  // incase we need permutation;
+  std::vector<int> permutation_order(dim.nbDims + 1);
+
+  for (int i = 0; i < dim.nbDims + 1; i++) permutation_order[i] = i;
+
+  if (index != 1) {
+    permutation_order[1] = index - 1;
+    permutation_order[index - 1] = 1;
+  }
+
+  std::vector<nvinfer1::ITensor const*> inputs_vec;
+  // Shap chack (all input tensor should have same shape)
+  // starting from 0 since we are probably also doing transpose here;
+  for (int i = 0; i < input_size; i++) {
+    auto tensor_i = inputs.at(i).tensor();
+    auto dim_i = tensor_i->getDimensions();
+    if (dim_i.nbDims != dim.nbDims)
+      return tensorflow::errors::InvalidArgument(
+          "Concatenate receives inputs with inconsistent dimensions, at " +
+          node_def.name());
+
+    for (int j = 0; j < dim.nbDims; j++) {
+      // check dimension consistency on non-concatenate axis
+      if (j != index - 1 && dim_i.d[j] != dim.d[j])
+        return tensorflow::errors::InvalidArgument(
+            "Concatenate receives inputs with inconsistent shape, at" +
+            node_def.name());
+    }
+
+    // TRT does concatenation only on channel!
+    if (index != 1)
+      tensor_i = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor_i),
+                                     permutation_order);
+
+    inputs_vec.push_back(tensor_i);
+  }
+
+  // nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  nvinfer1::IConcatenationLayer* layer = ctx.network()->addConcatenation(
+      const_cast<nvinfer1::ITensor* const*>(inputs_vec.data()),
+      inputs_vec.size());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+
+  if (index != 1) {
+    output_tensor = ctx.TransposeTensor(output_tensor, permutation_order);
+  }
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertFusedBatchNorm(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  TFAttrs attrs(node_def);
+  float epsilon = attrs.get<float>("epsilon");
+  auto data_format = attrs.get<string>("data_format");
+  if (data_format != "NCHW") {
+    return tensorflow::errors::Unimplemented(
+        "only data_format=NCHW is supported, at " + node_def.name());
+  }
+  bool is_training = attrs.get<bool>("is_training");
+  if (is_training) {
+    return tensorflow::errors::Unimplemented(
+        "only is_training=false is supported, at " + node_def.name());
+  }
+  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+
+  //  Check parameter types
+  auto parameter_type = inputs.at(1).weights().type_;
+  if ((parameter_type != tensorflow::DataType::DT_FLOAT) &&
+      (parameter_type != tensorflow::DataType::DT_HALF)) {
+    return tensorflow::errors::Unimplemented(
+        "only float32 or float16 weight data type is supported, for node " +
+        node_def.name() + " got " + tensorflow::DataTypeString(parameter_type));
+  }
+  for (int i = 1; i < 5; i++) {
+    if (inputs.at(i).weights().type_ != parameter_type) {
+      return tensorflow::errors::Unimplemented(
+          "Inconsistent parameter type for batchnormis not supported, at: " +
+          node_def.name());
+    }
+  }
+
+  TRT_ShapedWeights dummy_power_weights(parameter_type);
+  size_t nweight = 0;
+  for (int i = 1; i < 5; i++) {
+    nweight = std::max(nweight, (size_t)inputs.at(i).weights().count());
+  }
+  TRT_ShapedWeights* ptr_shape_weights = nullptr;
+  for (int i = 1; i < 5; i++) {
+    if (inputs.at(i).weights().count() == nweight) {
+      ptr_shape_weights =
+          const_cast<TRT_ShapedWeights*>(&(inputs.at(i).weights()));
+    } else if (inputs.at(i).weights().count() != 1) {
+      return tensorflow::errors::InvalidArgument(
+          "Inconsistent batchnorm parameter count, at: " + node_def.name());
+    }
+  }
+  //  We could technically have two weights with different shape.
+  //  that requires two addScale op, arguably less performant
+  TRT_ShapedWeights combined_scale_weights =
+      ctx.get_temp_weights_like(*ptr_shape_weights);
+  TRT_ShapedWeights combined_offset_weights =
+      ctx.get_temp_weights_like(*ptr_shape_weights);
+
+  const Eigen::half* cast_vals_array[4];
+  const float* vals_array[4];
+  for (int j = 0; j < 4; j++) {
+    cast_vals_array[j] =
+        static_cast<Eigen::half const*>(inputs.at(j + 1).weights().GetValues());
+    vals_array[j] =
+        static_cast<float const*>(inputs.at(j + 1).weights().GetValues());
+  }
+  Eigen::half* cast_combined_scale_vals = const_cast<Eigen::half*>(
+      static_cast<Eigen::half const*>(combined_scale_weights.GetValues()));
+  Eigen::half* cast_combined_offset_vals = const_cast<Eigen::half*>(
+      static_cast<Eigen::half const*>(combined_offset_weights.GetValues()));
+  float* combined_scale_vals = const_cast<float*>(
+      static_cast<float const*>(combined_scale_weights.GetValues()));
+  float* combined_offset_vals = const_cast<float*>(
+      static_cast<float const*>(combined_offset_weights.GetValues()));
+
+  for (size_t i = 0; i < nweight; ++i) {
+    float batchnorm_data[4];
+    for (int j = 0; j < 4; j++) {
+      if (inputs.at(j + 1).weights().count() != 1) {
+        if (parameter_type == tensorflow::DT_FLOAT) {
+          batchnorm_data[j] = vals_array[j][i];
+        } else if (parameter_type == tensorflow::DT_HALF) {
+          batchnorm_data[j] =
+              Eigen::half_impl::half_to_float(cast_vals_array[j][i]);
+        }
+      } else {
+        if (parameter_type == tensorflow::DT_FLOAT) {
+          batchnorm_data[j] = vals_array[j][0];
+        } else if (parameter_type == tensorflow::DT_HALF) {
+          batchnorm_data[j] =
+              Eigen::half_impl::half_to_float(cast_vals_array[j][0]);
+        }
+      }
+    }
+    float scale = batchnorm_data[0];
+    float offset = batchnorm_data[1];
+    float mean = batchnorm_data[2];
+    float variance = batchnorm_data[3];
+    float combined_scale_val = scale / sqrtf(variance + epsilon);
+    float combined_offset_val = offset - mean * combined_scale_val;
+    if (parameter_type == tensorflow::DT_FLOAT) {
+      combined_scale_vals[i] = combined_scale_val;
+      combined_offset_vals[i] = combined_offset_val;
+    } else if (parameter_type == tensorflow::DT_HALF) {
+      cast_combined_scale_vals[i] = Eigen::half(combined_scale_val);
+      cast_combined_offset_vals[i] = Eigen::half(combined_offset_val);
+    }
+  }
+
+  nvinfer1::ScaleMode mode = nweight == 1 ? nvinfer1::ScaleMode::kUNIFORM
+                                          : nvinfer1::ScaleMode::kCHANNEL;
+  nvinfer1::IScaleLayer* layer =
+      ctx.network()->addScale(*const_cast<nvinfer1::ITensor*>(tensor), mode,
+                              combined_offset_weights.GetWeightsForTRT(),
+                              combined_scale_weights.GetWeightsForTRT(),
+                              dummy_power_weights.GetWeightsForTRT());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertMatMul(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+
+  // TODO(jie): transpose!
+  TFAttrs attrs(node_def);
+
+  TRT_ShapedWeights weights_ck = inputs.at(1).weights();
+  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_ck);
+  ReorderCKtoKC(weights_ck, &weights);
+  TRT_ShapedWeights biases(weights.type_);
+
+  int noutput = weights.shape_.d[0];
+
+  nvinfer1::IFullyConnectedLayer* layer = ctx.network()->addFullyConnected(
+      *const_cast<nvinfer1::ITensor*>(tensor), noutput, weights, biases);
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertReshape(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
+      !inputs.at(1).is_weights())
+    return tensorflow::errors::InvalidArgument(
+        "Input expects tensor and weights, at" + node_def.name());
+
+  // implement tensor binaryOp weight [channel wise] for now;
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+  auto dims = tensor->getDimensions();
+  // restore implicit batch dimension
+
+  TRT_ShapedWeights shape = inputs.at(1).weights();
+
+  TFAttrs attrs(node_def);
+
+  auto padding_type = attrs.get<tensorflow::DataType>("Tshape");
+
+  if (shape.shape_.nbDims != 1)
+    return tensorflow::errors::InvalidArgument(
+        "reshape new shape is not 1 dimensional, at " + node_def.name());
+
+  // Only expect to handle INT32 as attributes for now
+  if (padding_type != tensorflow::DataType::DT_INT32)
+    return tensorflow::errors::Unimplemented(
+        "reshape new shape supports only DT_INT32, at " + node_def.name());
+
+  auto shape_data = static_cast<int*>(const_cast<void*>(shape.GetValues()));
+
+  if (shape_data[0] != -1)
+    return tensorflow::errors::InvalidArgument(
+        "reshape new shape first dimension is not -1, at " + node_def.name());
+
+  auto shape_num_dims = shape.shape_.d[0];
+  VLOG(2) << "shape dimensions: " << shape_num_dims;
+  int volume_w = 1;
+  for (int i = 1; i < shape.shape_.d[0]; i++) volume_w *= shape_data[i];
+
+  int volume_t = 1;
+  for (int i = 0; i < dims.nbDims; i++) volume_t *= dims.d[i];
+
+  VLOG(2) << "volume: " << volume_t << " volume weights: " << volume_w;
+  if (volume_w != volume_t)
+    return tensorflow::errors::InvalidArgument(
+        "volume does not agree between tensor and new shape, at " +
+        node_def.name());
+
+  nvinfer1::IShuffleLayer* layer =
+      ctx.network()->addShuffle(*const_cast<nvinfer1::ITensor*>(tensor));
+
+  nvinfer1::Dims reshape_dims;
+  VLOG(2) << "new dimension: " << shape_num_dims - 1;
+  reshape_dims.nbDims = shape_num_dims - 1;
+  for (int32_t i = 0; i < reshape_dims.nbDims; ++i) {
+    reshape_dims.d[i] = shape_data[i + 1];
+  }
+  layer->setReshapeDimensions(reshape_dims);
+  VLOG(2) << "new dimension: " << shape_num_dims - 1;
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  auto dims_output = output_tensor->getDimensions();
+  VLOG(2) << "output tensor dimension:" << dims_output.nbDims;
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
 void Converter::register_op_converters() {
   // vgg_16 slim implementation
   op_registry_["Placeholder"] = ConvertPlaceholder;
   op_registry_["Conv2D"] = ConvertConv2D;
+  op_registry_["DepthwiseConv2dNative"] = ConvertConv2DDepthwise;
   op_registry_["Relu"] = ConvertActivation;
   op_registry_["MaxPool"] = ConvertPool;
+  op_registry_["AvgPool"] = ConvertPool;
   // This could be really handled as ConvertBinary
   op_registry_["BiasAdd"] = ConvertScale;
   op_registry_["Const"] = ConvertConst;
-  // op_registry_["MatMul"] = ConvertFullyConnected;  // Not used in vgg
   // TODO(ben,jie): this is a temp hack.
   op_registry_["Identity"] = ConvertIdentity;  // Identity should be removed
-  // op_registry_["AvgPool"] = ConvertPool;
+  op_registry_["Snapshot"] = ConvertIdentity;  // Snapshot should be removed
 
   // resnet_50_v1 slim implementation
   op_registry_["Add"] = ConvertBinary;
@@ -1393,26 +2065,373 @@ void Converter::register_op_converters() {
   op_registry_["Mean"] = ConvertReduce;
   op_registry_["Pad"] = ConvertPad;
   // TODO(ben,jie): Add more ops
+
+  op_registry_["ConcatV2"] = ConvertConcat;
+  op_registry_["MatMul"] = ConvertMatMul;
+  op_registry_["Reshape"] = ConvertReshape;
+  op_registry_["FusedBatchNorm"] = ConvertFusedBatchNorm;
+  op_registry_["FusedBatchNormV2"] = ConvertFusedBatchNorm;
 }
 
 }  // namespace
+tensorflow::Status GetTensorRTGraph(tensorrt::convert::SubGraphParams& s) {
+  return tensorflow::errors::Unimplemented("Not implemented yet");
+}
+tensorflow::Status ConvertCalibrationNodeToEngineNode(
+    tensorflow::Graph& graph, tensorflow::Node* c_node) {
+  const auto ndef = c_node->def();
+
+  TFAttrs attrs(ndef);
+  std::vector<string> segment_nodes(
+      attrs.get<std::vector<string>>("segment_nodes"));
+  std::vector<string> output_nodes(
+      attrs.get<std::vector<string>>("segment_output_names"));
+  std::vector<string> input_names(
+      attrs.get<std::vector<string>>("input_names"));
+  string res_name = attrs.get<string>("resource_name");
+  VLOG(1) << "Node name " << c_node->name() << " res_name " << res_name;
+  string engine_name = "my_trt_op";
+  {
+    const auto node_id = tensorflow::str_util::Split(res_name, "_");
+    engine_name += node_id.back();
+  }
+  std::map<string, tensorflow::Node*> node_maps;
+
+  for (auto n : graph.op_nodes()) {
+    node_maps.insert({n->name(), n});
+  }
+  VLOG(1) << "Output Nodes:";
+  std::vector<tensorflow::DataType> out_types;
+  std::vector<const tensorflow::Edge*> out_edges;
+  for (auto& i : output_nodes) {
+    auto node_port = tensorflow::str_util::Split(i, ":");
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+    auto out_node_name = node_port.at(0);
+    if (node_port.size() > 1) {
+      VLOG(1) << "Multi port output" << node_port.at(0) << " "
+              << node_port.at(1) << " size=" << node_port.size();
+    }
+    auto node_it = node_maps.find(out_node_name);
+    if (node_it != node_maps.end()) {
+      tensorflow::Node* out_node = node_it->second;
+      int port = 0;
+      if (node_port.size() == 2) {
+        port = std::strtoul(node_port.at(1).c_str(), nullptr, 10);
+        out_types.push_back(out_node->output_type(port));
+      } else {
+        out_types.push_back(out_node->output_type(0));
+      }
+      for (auto out_edge : out_node->out_edges()) {
+        if (out_edge->src_output() == port) {
+          out_edges.push_back(out_edge);
+          break;
+        }
+      }
+    } else {
+      LOG(WARNING) << " couldn't find output node " << out_node_name;
+    }
+  }
+  VLOG(1) << "Input Nodes:";
+  for (auto& i : input_names) {
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+  }
+  auto trt_rm = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto resmgr = trt_rm->getManager("TRTCalibOps");
+  tensorflow::tensorrt::TRTCalibrationResource* calib_res = nullptr;
+  auto status = resmgr->Lookup(res_name, res_name, &calib_res);
+  if (!status.ok() || !calib_res->calibrator_) {
+    return tensorflow::errors::FailedPrecondition(
+        "You must run calibration"
+        " and inference conversion in the same proces");
+  }
+
+  calib_res->calibrator_->setDone();
+  calib_res->thr_->join();
+  delete calib_res->thr_;
+  if (!calib_res->engine_) {
+    LOG(ERROR) << "Calibration failed!, engine does not exist. Did you run "
+                  "calibration graph?";
+    return tensorflow::errors::FailedPrecondition(
+        "Calibration graph needs to be executed on"
+        " calibration data before convertsion to inference graph");
+  }
+  auto weight_rmgr = trt_rm->getManager("WeightStore");
+  TF_CHECK_OK(weight_rmgr->Delete<tensorflow::tensorrt::TRTWeightStore>(
+      res_name, res_name));
+  auto engine_plan = calib_res->engine_->serialize();
+  calib_res->engine_->destroy();
+  calib_res->network_->destroy();
+  calib_res->builder_->destroy();
+  calib_res->thr_ = nullptr;
+  calib_res->engine_ = nullptr;
+  calib_res->builder_ = nullptr;
+  tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
+  std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  for (const auto in_edge : c_node->in_edges()) {
+    auto src = in_edge->src();
+    int dest_port = in_edge->dst_input();
+    income_edges.emplace_back(src->name(), in_edge->src_output(),
+                              c_node->input_type(dest_port));
+  }
+  tensorflow::gtl::ArraySlice<tensorflow::NodeDefBuilder::NodeOut> input_list(
+      income_edges);
+  op_builder.Input(input_list);
+  tensorflow::NodeDef engine_node;
+  const char* engine_plan_data = static_cast<const char*>(engine_plan->data());
+  string engine_plan_string(engine_plan_data,
+                            engine_plan_data + engine_plan->size());
+  status = op_builder.Attr("serialized_engine", engine_plan_string)
+               .Attr("input_nodes", input_names)
+               .Attr("output_nodes", output_nodes)
+               .Attr("OutT", out_types)
+               .Finalize(&engine_node);
+  if (!status.ok()) {
+    LOG(ERROR) << "Engine Node creation failed";
+    return status;
+  }
+  auto trt_engine_node = graph.AddNode(engine_node, &status);
+  TF_RETURN_IF_ERROR(status);
+  for (size_t i = 0; i < out_edges.size(); i++) {
+    VLOG(1) << "Connecting trt_engine_node output " << i << " with "
+            << out_edges.at(i)->dst()->name() << " port "
+            << out_edges.at(i)->dst_input();
+    TF_RETURN_IF_ERROR(graph.UpdateEdge(trt_engine_node, i,
+                                        out_edges.at(i)->dst(),
+                                        out_edges.at(i)->dst_input()));
+  }
+  VLOG(1) << "Segment nodes:";
+  for (auto& i : segment_nodes) {
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+    auto it = node_maps.find(i);
+    if (it != node_maps.end()) {
+      graph.RemoveNode(it->second);
+    }
+  }
+  graph.RemoveNode(c_node);
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status InjectCalibrationNode(tensorrt::convert::SubGraphParams& s) {
+  // Visit nodes in reverse topological order and construct the TRT network.
+
+  // Toposort
+  std::vector<tensorflow::Node*> order_vec;
+  tensorflow::GetPostOrder(s.graph, &order_vec);
+  // Select just the subgraph
+  std::list<tensorflow::Node*> order;
+  for (tensorflow::Node* node : order_vec) {
+    if (s.subgraph_node_ids.count(node->id())) {
+      order.push_front(node);  // we want topological order to construct the
+      // network layer by layer
+    }
+  }
+  // topological order is needed to build TRT network
+  static int static_id = 0;
+  string subgraph_name_scope;
+  if (!order.empty()) {
+    subgraph_name_scope = order.front()->name();
+  }
+  for (const tensorflow::Node* node : order) {
+    subgraph_name_scope = GetCommonNameScope(subgraph_name_scope, node->name());
+  }
+  // TODO(sami,ben,jie): proper naming!
+  string calib_op_name =
+      StrCat(subgraph_name_scope, "my_trt_calib_op_", static_id);
+  string engine_name = StrCat(subgraph_name_scope, "my_trt_op", static_id);
+  static_id++;
+  auto trt_rmgr = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto op_rmgr = trt_rmgr->getManager("TRTCalibOps");
+  auto op_res = new tensorflow::tensorrt::TRTCalibrationResource();
+  TF_CHECK_OK(op_rmgr->Create(calib_op_name, calib_op_name, op_res));
+  op_res->logger_ = new tensorflow::tensorrt::Logger();
+  op_res->builder_ = nvinfer1::createInferBuilder(*(op_res->logger_));
+
+  if (!op_res->builder_) {
+    return tensorflow::errors::Internal(
+        "failed to create TensorRT builder object");
+  }
+
+  op_res->network_ = op_res->builder_->createNetwork();
+  if (!op_res->network_) {
+    return tensorflow::errors::Internal(
+        "failed to create TensorRT network object");
+  }
+
+  // Build the network
+  auto weight_rmgr = trt_rmgr->getManager("WeightStore");
+  auto ws = new tensorflow::tensorrt::TRTWeightStore();
+  TF_CHECK_OK(weight_rmgr->Create(calib_op_name, calib_op_name, ws));
+  Converter converter(op_res->network_, ws, s.precision_mode == FP16MODE);
+  std::vector<string> input_names;
+  std::vector<tensorflow::DataType> input_dtypes;
+  for (const std::pair<int, int>& input : s.input_inds) {
+    VLOG(2) << "parsing input. Node id= " << input.first;
+    int node_id = input.first;
+    int output_idx = input.second;
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
+    auto node_name = node->name();
+    input_names.push_back(node_name);  // insert original node name without port
+    // TODO(jie): alternative :)
+    if (!s.graph_properties.HasOutputProperties(node_name))
+      return tensorflow::errors::Internal("failed to find input node: " +
+                                          node_name);
+
+    auto op_info_vec = s.graph_properties.GetOutputProperties(node_name);
+    if (static_cast<int>(op_info_vec.size()) < output_idx)
+      return tensorflow::errors::Internal(
+          "accessing output index of: ", output_idx, ", at node: ", node_name,
+          "with output entry from shape_map: ", op_info_vec.size());
+
+    auto op_info = op_info_vec.at(output_idx);
+
+    tensorflow::DataType tf_dtype = op_info.dtype();
+    input_dtypes.push_back(tf_dtype);
+
+    nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
+    auto type_status = ConvertDType(tf_dtype, &dtype);
+    if (type_status != tensorflow::Status::OK()) {
+      LOG(WARNING) << "Data type conversion for input '" << node_name
+                   << "' failed";
+      return type_status;
+    }
+    TF_CHECK_OK(ConvertDType(tf_dtype, &dtype));
+
+    VLOG(2) << "accessing output index of: " << output_idx
+            << ", at node: " << node_name
+            << "with output entry from shape_map: " << op_info_vec.size();
+
+    // TODO(ben,jie): update TRT input format/dimension
+    nvinfer1::DimsCHW input_dim_psuedo_chw;
+    for (int i = 0; i < 3; i++) input_dim_psuedo_chw.d[i] = 1;
+
+    for (int i = 1; i < op_info.shape().dim_size(); i++) {
+      VLOG(2) << "dimension: " << i
+              << " , size: " << op_info.shape().dim(i).size();
+      input_dim_psuedo_chw.d[i - 1] = op_info.shape().dim(i).size();
+    }
+
+    // TODO(ben,jie): proper way to restore input tensor name?
+    auto input_tensor_name = node_name;
+    if (output_idx != 0) input_tensor_name = StrCat(node_name, ":", output_idx);
+
+    nvinfer1::ITensor* input_tensor = converter.network()->addInput(
+        input_tensor_name.c_str(), dtype, input_dim_psuedo_chw);
+
+    if (!input_tensor)
+      return tensorflow::errors::InvalidArgument(
+          "Failed to create Input layer");
+    VLOG(2) << "input tensor name :" << input_tensor_name;
+
+    if (!converter.insert_input_tensor(input_tensor_name, input_tensor))
+      return tensorflow::errors::AlreadyExists(
+          "output tensor already exists for op: " + input_tensor_name);
+  }
+
+  VLOG(2) << "finished sorting";
+
+  for (const tensorflow::Node* node : order) {
+    const tensorflow::NodeDef& node_def = node->def();
+    VLOG(2) << "converting node: " << node_def.name() << " , " << node_def.op();
+    TF_RETURN_IF_ERROR(converter.convert_node(node_def));
+  }
+
+  VLOG(2) << "finished conversion";
+
+  // Gather output metadata
+  std::vector<string> output_names;
+  std::vector<tensorflow::DataType> output_dtypes;
+  int trt_engine_op_output_idx = 0;
+  for (const std::pair<int, int>& output : s.output_inds) {
+    int node_id = output.first;
+    int output_idx = output.second;
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
+    string op_name = node->name();
+    string tensor_name = op_name;
+
+    s.output_edge_map->insert(
+        {trt_engine_op_output_idx == 0
+             ? engine_name
+             : StrCat(engine_name, ":", trt_engine_op_output_idx),
+         {output_idx, tensor_name}});
+    trt_engine_op_output_idx++;
+    if (output_idx != 0) {
+      tensor_name = StrCat(tensor_name, ":", output_idx);
+    }
+    VLOG(1) << "output tensor name: " << tensor_name;
+    output_names.push_back(tensor_name);
+    auto tensor_or_weights = converter.get_tensor(tensor_name);
+    if (!tensor_or_weights.is_tensor()) {
+      return tensorflow::errors::InvalidArgument("Output node'" + tensor_name +
+                                                 "' is weights not tensor");
+    }
+    nvinfer1::ITensor* tensor = tensor_or_weights.tensor();
+    if (!tensor) {
+      return tensorflow::errors::NotFound("Output tensor not found: " +
+                                          tensor_name);
+    }
+    converter.network()->markOutput(*tensor);
+    tensorflow::DataType tf_dtype = node->output_type(output_idx);
+    output_dtypes.push_back(tf_dtype);
+    nvinfer1::DataType trt_dtype = nvinfer1::DataType::kFLOAT;
+    TF_RETURN_IF_ERROR(ConvertDType(tf_dtype, &trt_dtype));
+    tensor->setType(trt_dtype);
+  }
+
+  VLOG(2) << "finished output";
+
+  // Build the engine
+  op_res->builder_->setMaxBatchSize(s.max_batch_size);
+  op_res->builder_->setMaxWorkspaceSize(s.max_workspace_size_bytes);
+
+  // Build the TRT op
+  // TODO(sami,ben,jie): proper naming!
+  tensorflow::NodeDefBuilder op_builder(calib_op_name, "TRTCalibOp");
+  std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  for (size_t i = 0; i < input_names.size(); ++i) {
+    int output_idx = s.input_inds.at(i).second;
+    // we wired up the input here already, it is redundant to do it again in
+    //  ConvertSubGraphToTensorRT(convert_graph.cc)
+    auto incoming_edge = tensorflow::NodeDefBuilder::NodeOut(
+        input_names.at(i), output_idx, input_dtypes.at(i));
+    VLOG(1) << calib_op_name << " input " << i << " = " << input_names.at(i)
+            << ":" << output_idx
+            << " dType= " << tensorflow::DataTypeString(input_dtypes.at(i));
+    income_edges.push_back(incoming_edge);
+  }
+  tensorflow::gtl::ArraySlice<tensorflow::NodeDefBuilder::NodeOut> input_list(
+      income_edges);
+  op_builder.Input(input_list);
+  std::vector<string> segment_names;
+  segment_names.reserve(s.subgraph_node_ids.size());
+  for (int i : s.subgraph_node_ids) {
+    auto node = s.graph.FindNodeId(i);
+    segment_names.push_back(node->name());
+  }
+  LOG(INFO) << "finished op preparation";
+
+  auto status = op_builder.Attr("segment_nodes", segment_names)
+                    .Attr("input_names", input_names)
+                    .Attr("segment_output_names", output_names)
+                    .Attr("resource_name", calib_op_name)
+                    .Finalize(s.trt_node);
+
+  LOG(INFO) << status.ToString();
+  LOG(INFO) << "finished op building";
+
+  return tensorflow::Status::OK();
+}
 
 tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
-    const tensorflow::Graph& graph, const std::set<int>& subgraph_node_ids,
-    const std::vector<std::pair<int, int>>& input_inds,
-    const std::vector<std::pair<int, int>>& output_inds, size_t max_batch_size,
-    size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_properties,
-    tensorflow::NodeDef* trt_node) {
+    tensorrt::convert::SubGraphParams& s) {
   // Visit nodes in reverse topological order and construct the TRT network.
 
   // Toposort
   std::vector<tensorflow::Node*> order_vec;
-  tensorflow::GetPostOrder(graph, &order_vec);
+  tensorflow::GetPostOrder(s.graph, &order_vec);
   // Select just the subgraph
   std::list<tensorflow::Node*> order;
   for (tensorflow::Node* node : order_vec) {
-    if (subgraph_node_ids.count(node->id())) {
+    if (s.subgraph_node_ids.count(node->id())) {
       // We want topological order to contstruct the
       // network layer by layer
       order.push_front(node);
@@ -1434,46 +2453,94 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
         "Failed to create TensorRT network object");
   }
 
+  string subgraph_name_scope;
+  if (!order.empty()) {
+    subgraph_name_scope = order.front()->name();
+  }
+  for (const tensorflow::Node* node : order) {
+    subgraph_name_scope = GetCommonNameScope(subgraph_name_scope, node->name());
+  }
+  static int static_id = 0;
+  // TODO(sami,ben,jie): proper naming!
+  string engine_name = StrCat(subgraph_name_scope, "my_trt_op");
+  engine_name = StrCat(engine_name, static_id++);
+  auto trt_rmgr = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto weight_rmgr = trt_rmgr->getManager("WeightStore");
+  auto ws = new tensorflow::tensorrt::TRTWeightStore();
+  TF_CHECK_OK(weight_rmgr->Create(engine_name, engine_name, ws));
+
   // Build the network
-  Converter converter(trt_network.get());
+  Converter converter(trt_network.get(), ws, s.precision_mode == FP16MODE);
 
   std::vector<string> input_names;
   std::vector<tensorflow::DataType> input_dtypes;
-  for (std::pair<int, int> const& input : input_inds) {
+  for (const std::pair<int, int>& input : s.input_inds) {
+    VLOG(2) << "parsing input!!!!!";
     int node_id = input.first;
     int output_idx = input.second;
-    tensorflow::Node* node = graph.FindNodeId(node_id);
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
     auto node_name = node->name();
-    input_names.push_back(node_name);  // Insert original node name without port
-    // TODO(jie): alternative :)
-    if (!graph_properties.HasOutputProperties(node_name))
-      return tensorflow::errors::Internal("Failed to find input node: " +
-                                          node_name);
+    // input_names should use the node name in the graph
+    // here it should be the input tensor name -> matching the binding
+    // insert original node name without port
+    auto tensor_name = node_name;
+    if (output_idx != 0) {
+      tensor_name = StrCat(tensor_name, ":", output_idx);
+    }
 
-    auto op_info_vec = graph_properties.GetOutputProperties(node_name);
-    if (static_cast<int>(op_info_vec.size()) < output_idx)
-      return tensorflow::errors::Internal(
-          "Accessing output index of: " + std::to_string(output_idx) +
-          ", at node: " + node_name + " with output entry from shape_map: " +
-          std::to_string(op_info_vec.size()));
+    VLOG(2) << "input name: " << node_name << " tensor_name: " << tensor_name
+            << " idx: " << output_idx;
 
-    auto op_info = op_info_vec.at(output_idx);
+    auto shape_inference_node_name = node_name;
+    auto shape_inference_output_idx = output_idx;
+    // rewire the shape inference to original node in the graph
+    if (s.output_edge_map->count(tensor_name)) {
+      shape_inference_node_name = s.output_edge_map->at(tensor_name).second;
+      shape_inference_output_idx = s.output_edge_map->at(tensor_name).first;
+    }
+    if (shape_inference_output_idx < 0) continue;
+    VLOG(2) << "shapeinference name: " << shape_inference_node_name
+            << " idx: " << shape_inference_output_idx;
+
+    if (!s.graph_properties.HasOutputProperties(shape_inference_node_name))
+      return tensorflow::errors::Internal("failed to find input node: " +
+                                          shape_inference_node_name);
 
+    auto op_info_vec =
+        s.graph_properties.GetOutputProperties(shape_inference_node_name);
+    if (static_cast<int>(op_info_vec.size()) <= shape_inference_output_idx)
+      return tensorflow::errors::Internal(
+          "accessing output index of: ", shape_inference_output_idx,
+          ", at node: ", shape_inference_node_name,
+          " with output entry from shape_map: ", op_info_vec.size());
+
+    auto op_info = op_info_vec.at(shape_inference_output_idx);
     tensorflow::DataType tf_dtype = op_info.dtype();
     input_dtypes.push_back(tf_dtype);
 
     nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
-    TF_CHECK_OK(ConvertDType(tf_dtype, &dtype));
+    auto type_status = ConvertDType(tf_dtype, &dtype);
+    if (type_status != tensorflow::Status::OK()) {
+      LOG(WARNING) << "Type conversion failed for " << node_name;
+      return type_status;
+    }
 
-    VLOG(2) << "Accessing output index of: " << std::to_string(output_idx)
+    VLOG(2) << "Accessing output index of: " << output_idx
             << ", at node: " << node_name
-            << " with output entry from shape_map: "
-            << std::to_string(op_info_vec.size());
-
+            << " with output entry from shape_map: " << op_info_vec.size();
     // TODO(ben,jie): update TRT input format/dimension
     nvinfer1::DimsCHW input_dim_psuedo_chw;
     for (int i = 0; i < 3; i++) input_dim_psuedo_chw.d[i] = 1;
 
+    // TODO(jie): TRT 3.x only support 4 dimensional input tensor.
+    //            update the code once TRT 4.0 comes out.
+    if (op_info.shape().dim_size() != 4) {
+      string err_str = "Require 4 dimensional input.";
+      StrAppend(&err_str, " Got ", op_info.shape().dim_size(), " ",
+                shape_inference_node_name);
+      return tensorflow::errors::Unimplemented(err_str);
+    }
+
     for (int i = 1; i < op_info.shape().dim_size(); i++) {
       VLOG(2) << "dimension: " << i
               << " , size: " << op_info.shape().dim(i).size();
@@ -1482,9 +2549,11 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
 
     // TODO(ben,jie): proper way to restore input tensor name?
     auto input_tensor_name = node_name;
-    if (output_idx != 0)
-      input_tensor_name = node_name + ":" + std::to_string(output_idx);
+    if (output_idx != 0) {
+      input_tensor_name = StrCat(node_name, ":", output_idx);
+    }
 
+    input_names.push_back(input_tensor_name);
     nvinfer1::ITensor* input_tensor = converter.network()->addInput(
         input_tensor_name.c_str(), dtype, input_dim_psuedo_chw);
 
@@ -1511,20 +2580,28 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
   // Gather output metadata
   std::vector<string> output_names;
   std::vector<tensorflow::DataType> output_dtypes;
-  for (std::pair<int, int> const& output : output_inds) {
+  int trt_engine_op_output_idx = 0;
+  for (const std::pair<int, int>& output : s.output_inds) {
     int node_id = output.first;
     int output_idx = output.second;
-    tensorflow::Node* node = graph.FindNodeId(node_id);
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
     string op_name = node->name();
     string tensor_name = op_name;
+
+    s.output_edge_map->insert(
+        {trt_engine_op_output_idx == 0
+             ? engine_name
+             : StrCat(engine_name, ":", trt_engine_op_output_idx),
+         {output_idx, tensor_name}});
+    trt_engine_op_output_idx++;
     if (output_idx != 0)
-      tensor_name = tensor_name + ":" + std::to_string(output_idx);
+      tensorflow::strings::StrAppend(&tensor_name, ":", output_idx);
     VLOG(2) << "Output tensor name: " << tensor_name;
     output_names.push_back(tensor_name);
     auto tensor_or_weights = converter.get_tensor(tensor_name);
     if (!tensor_or_weights.is_tensor()) {
-      return tensorflow::errors::InvalidArgument(
-          "Output node is weights not tensor");
+      return tensorflow::errors::InvalidArgument("Output node '" + tensor_name +
+                                                 "' is weights not tensor");
     }
     nvinfer1::ITensor* tensor = tensor_or_weights.tensor();
     if (!tensor) {
@@ -1540,19 +2617,25 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
   }
 
   VLOG(2) << "Finished output";
-  // TODO(jie): static_id is not thread safe.
-  static int static_id = 0;
 
   // Build the engine
-  trt_builder->setMaxBatchSize(max_batch_size);
-  trt_builder->setMaxWorkspaceSize(max_workspace_size_bytes);
-  VLOG(0) << "Starting build engine " << static_id;
-  // TODO(ben,jie): half2 and int8 mode support
+  trt_builder->setMaxBatchSize(s.max_batch_size);
+  trt_builder->setMaxWorkspaceSize(s.max_workspace_size_bytes);
+  VLOG(0) << "Max batch size= " << s.max_batch_size
+          << " max workspace size= " << s.max_workspace_size_bytes;
+  if (s.precision_mode == FP16MODE) {
+    trt_builder->setHalf2Mode(true);
+    VLOG(0) << "Using FP16 precision mode";
+  }
+  LOG(INFO) << "starting build engine";
   string engine_plan_string;
   {
     auto trt_engine =
         infer_object(trt_builder->buildCudaEngine(*converter.network()));
     VLOG(0) << "Built network";
+    if (trt_engine.get() == nullptr) {
+      return tensorflow::errors::Internal("Engine building failure");
+    }
     auto engine_plan = infer_object(trt_engine->serialize());
     VLOG(0) << "Serialized engine";
     const char* engine_plan_data =
@@ -1560,18 +2643,20 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
     engine_plan_string =
         string(engine_plan_data, engine_plan_data + engine_plan->size());
   }
-
-  VLOG(0) << "Finished engine";
+  TF_RETURN_IF_ERROR(weight_rmgr->Delete<tensorflow::tensorrt::TRTWeightStore>(
+      engine_name, engine_name));
+  LOG(INFO) << "finished engine " << engine_name << " containing "
+            << s.subgraph_node_ids.size() << " nodes";
 
   // Build the TRT op
-  // TODO(sami,ben,jie): proper naming!
-  tensorflow::NodeDefBuilder op_builder(
-      tensorflow::strings::StrCat("my_trt_op", static_id++), "TRTEngineOp");
+  tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
   std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  VLOG(2) << "input edge size: " << input_names.size();
   for (size_t i = 0; i < input_names.size(); ++i) {
-    int output_idx = input_inds.at(i).second;
-    // We wired up the input here already, it is redundant to do it again in
-    // ConvertSubGraphToTensorRT(convert_graph.cc)
+    VLOG(2) << "input edges: " << i << " " << input_names.at(i);
+    int output_idx = s.input_inds.at(i).second;
+    // we wired up the input here already, it is redundant to do it again in
+    //  ConvertSubGraphToTensorRT(convert_graph.cc)
     auto incoming_edge = tensorflow::NodeDefBuilder::NodeOut(
         input_names.at(i), output_idx, input_dtypes.at(i));
     income_edges.push_back(incoming_edge);
@@ -1586,7 +2671,7 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
                     .Attr("input_nodes", input_names)
                     .Attr("output_nodes", output_names)
                     .Attr("OutT", output_dtypes)
-                    .Finalize(trt_node);
+                    .Finalize(s.trt_node);
 
   VLOG(0) << status.ToString() << " finished op building";
 
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.h b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
index 2e7fd19566e1ed3719b932c7443a9c3f652b2d3e..954a1e72f8604371fc00e088a67b4d411314dda6 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.h
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
@@ -17,6 +17,8 @@ limitations under the License.
 #define TENSORFLOW_CONTRIB_TENSORRT_CONVERT_CONVERT_NODES_H_
 
 #include <set>
+#include <string>
+#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -32,16 +34,49 @@ namespace tensorflow {
 namespace tensorrt {
 namespace convert {
 
-tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
-    const tensorflow::Graph& graph, const std::set<int>& subgraph_node_ids,
-    const std::vector<std::pair<int, int>>&
-        input_inds,  // {node_id, output_idx}
-    const std::vector<std::pair<int, int>>&
-        output_inds,  // {node_id, output_idx}
-    size_t max_batch_size, size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_prop,
-    tensorflow::NodeDef* trt_node);
+const int FP32MODE = 0;
+const int FP16MODE = 1;
+const int INT8MODE = 2;
 
+struct SubGraphParams {
+  SubGraphParams(
+      tensorflow::Graph& inp_graph,
+      const std::set<int>& subgraph_node_id_numbers,
+      const std::vector<std::pair<int, int>>& input_indices,
+      const std::vector<std::pair<int, int>>& output_indices,
+      size_t max_supported_batch_size, size_t max_consumed_workspace_size_bytes,
+      const tensorflow::grappler::GraphProperties& current_graph_properties,
+      std::unordered_map<string, std::pair<int, string>>* output_edges,
+      tensorflow::NodeDef* constructed_trt_node,
+      int engine_precision_mode = FP32MODE)
+      : graph(inp_graph),
+        subgraph_node_ids(subgraph_node_id_numbers),
+        input_inds(input_indices),
+        output_inds(output_indices),
+        max_batch_size(max_supported_batch_size),
+        max_workspace_size_bytes(max_consumed_workspace_size_bytes),
+        graph_properties(current_graph_properties),
+        output_edge_map(output_edges),
+        trt_node(constructed_trt_node),
+        precision_mode(engine_precision_mode) {}
+
+  tensorflow::Graph& graph;
+  const std::set<int>& subgraph_node_ids;
+  const std::vector<std::pair<int, int>>& input_inds;   // {node_id, output_idx}
+  const std::vector<std::pair<int, int>>& output_inds;  // {node_id, output_idx}
+  size_t max_batch_size;
+  size_t max_workspace_size_bytes;
+  const tensorflow::grappler::GraphProperties& graph_properties;
+  std::unordered_map<string, std::pair<int, string>>* output_edge_map;
+  tensorflow::NodeDef* trt_node;
+  const int precision_mode;
+};
+
+// TODO(sami): Replace references with const reference or pointers
+tensorflow::Status ConvertSubGraphToTensorRTNodeDef(SubGraphParams& params);
+tensorflow::Status InjectCalibrationNode(SubGraphParams& params);
+tensorflow::Status ConvertCalibrationNodeToEngineNode(tensorflow::Graph& graph,
+                                                      tensorflow::Node* c_node);
 }  // namespace convert
 }  // namespace tensorrt
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..aea44fd8a2fcc4c359a6cb0c98ae34711708326e
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
@@ -0,0 +1,136 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/kernels/trt_calib_op.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/platform/stream_executor.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "cuda/include/cuda_runtime_api.h"
+#include "tensorrt/include/NvInfer.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+TRTCalibOp::TRTCalibOp(OpKernelConstruction* context) : OpKernel(context) {
+  OP_REQUIRES_OK(context, context->GetAttr("segment_nodes", &segment_nodes_));
+  OP_REQUIRES_OK(context, context->GetAttr("input_names", &input_names_));
+  OP_REQUIRES_OK(context, context->GetAttr("resource_name", &resource_name_));
+};
+
+#define TYPECASE(dt, X, Y)                                                \
+  case dt: {                                                              \
+    return (void*)X->flat<tensorflow::EnumToDataType<dt>::Type>().data(); \
+  }
+
+void* GetTensorAddress(const Tensor* tensor_ptr) {
+  auto tensor_type = tensor_ptr->dtype();
+  switch (tensor_type) {
+    TYPECASE(tensorflow::DT_FLOAT, tensor_ptr, dest_ptr);
+    TYPECASE(tensorflow::DT_HALF, tensor_ptr, dest_ptr);
+    TYPECASE(tensorflow::DT_INT8, tensor_ptr, dest_ptr);
+    default: {
+      LOG(FATAL) << "Unsupported Data type "
+                 << tensorflow::DataTypeString(tensor_type);
+      return nullptr;
+    }
+  }
+}
+
+void TRTCalibOp::Compute(tensorflow::OpKernelContext* ctx) {
+  // TODO(aaroey): make sure ctx->resource_mgr() is used in future PR.
+  auto trt_rm = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto res_mgr = trt_rm->getManager("TRTCalibOps");
+  tensorflow::tensorrt::TRTCalibrationResource* calib_res = nullptr;
+  auto status = res_mgr->Lookup(resource_name_, resource_name_, &calib_res);
+
+  if (!status.ok()) {
+    ctx->SetStatus(status);
+    return;
+  }
+  int num_inputs = ctx->num_inputs();
+  // first run instantiate calibrator
+  if (calib_res->calibrator_ == nullptr) {
+    dev_tensors_.resize(num_inputs);
+    int batch_size = ctx->input(0).dim_size(0);
+    VLOG(1) << " Constructing calibrator";
+    for (int i = 0; i < num_inputs; i++) {
+      // allocate workspace on device for inputs
+      const tensorflow::Tensor& t = ctx->input(i);
+      OP_REQUIRES_OK(ctx,
+                     ctx->allocate_persistent(t.dtype(), t.shape(),
+                                              &dev_tensors_.at(i), nullptr));
+      const auto device_tensor = dev_tensors_.at(i).AccessTensor(ctx);
+      CHECK_EQ(t.TotalBytes(), device_tensor->TotalBytes());
+      void* device_address = GetTensorAddress(device_tensor);
+      device_buffers_.emplace(input_names_.at(i),
+                              std::pair<void*, size_t>(
+                                  device_address, device_tensor->TotalBytes()));
+    }
+
+    calib_res->calibrator_ =
+        new TRTInt8Calibrator(device_buffers_, batch_size, resource_name_);
+    string label(resource_name_);
+    calib_res->thr_ = new std::thread([calib_res, label]() {
+      VLOG(1) << "Starting calibration thread, Calibration Resource @ "
+              << calib_res;
+      calib_res->builder_->setInt8Calibrator(calib_res->calibrator_);
+      calib_res->builder_->setInt8Mode(true);
+      calib_res->engine_ = calib_res->builder_->buildCudaEngine(
+          *calib_res->network_);  // will loop until we terminate calibrator
+      VLOG(1) << "Calibration loop terminated " << label;
+    });
+    VLOG(1) << "initialized calibrator resource";
+  }  //  calibrator initialized
+
+  // Pass input data to calibrator
+  std::unordered_map<string, void*> input_data;
+  for (int i = 0; i < num_inputs; i++) {
+    const Tensor& t = ctx->input(i);
+    void* data_address = GetTensorAddress(&t);
+    const auto device_tensor = dev_tensors_.at(i).AccessTensor(ctx);
+    CHECK_EQ(t.TotalBytes(),
+             device_tensor->TotalBytes());  // use the tensor so FW keeps it
+    input_data.emplace(input_names_.at(i), data_address);
+    ctx->set_output(i, t);
+  }
+  VLOG(2) << "Filled map for sending";
+  // copied from cuda_kernel_helper since it seems only valid in *.cu.cc files
+  const cudaStream_t* stream = CHECK_NOTNULL(
+      reinterpret_cast<const cudaStream_t*>(ctx->op_device_context()
+                                                ->stream()
+                                                ->implementation()
+                                                ->CudaStreamMemberHack()));
+  calib_res->calibrator_->setBatch(input_data, *stream);
+  VLOG(2) << "Passed calibration data";
+  // TODO(aaroey): make sure we wait for the completion of calibration on the
+  // last batch in future PR.
+};
+
+#undef TYPECASE
+
+REGISTER_KERNEL_BUILDER(Name("TRTCalibOp").Device(DEVICE_GPU), TRTCalibOp);
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..23df9db32f077a080eaff7479fcbe90d6a504c42
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h
@@ -0,0 +1,52 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
+#define TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/platform/types.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+namespace tensorflow {
+namespace tensorrt {
+// TODO(sami): Convert this to async kernel!
+class TRTCalibOp : public OpKernel {
+ public:
+  explicit TRTCalibOp(OpKernelConstruction* context);
+
+  void Compute(OpKernelContext* context) override;
+
+ private:
+  string resource_name_;
+  std::vector<string> segment_nodes_;
+  std::vector<string> input_names_;
+  std::vector<tensorflow::TensorShape> shapes_;
+  std::unordered_map<string, std::pair<void*, size_t>> device_buffers_;
+  std::vector<tensorflow::PersistentTensor> dev_tensors_;
+};
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif
+#endif
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
index 8efdf63ebebc4d7a199c60635ca64348d2b30505..b32371b642f38b0851955a4a3beab97b86e1f6a0 100644
--- a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
+++ b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
@@ -24,8 +24,12 @@ limitations under the License.
 #include "cuda/include/cuda_runtime_api.h"
 
 namespace tensorflow {
-namespace tensorrt {
 static ::tensorflow::tensorrt::Logger logger;
+namespace gpu = ::perftools::gputools;
+using IRuntime = nvinfer1::IRuntime;
+using Dims = nvinfer1::Dims;
+
+namespace tensorrt {
 
 TRTEngineOp::TRTEngineOp(OpKernelConstruction* context) : OpKernel(context) {
   // read serialized_engine
@@ -40,10 +44,21 @@ TRTEngineOp::TRTEngineOp(OpKernelConstruction* context) : OpKernel(context) {
   // TODO(samikama) runtime should be taken from a resourcemanager as well.
   // Only engine should be in the op and context and runtime should be taken
   // from resourcemanager
-  nvinfer1::IRuntime* infer = nvinfer1::createInferRuntime(logger);
+  // TODO(jie): cudaSetDevice make sure trt engine is allocated on the same
+  // gpu where the input/output is also located.
+  int gpu_id = context->device()->tensorflow_gpu_device_info()->gpu_id;
+  cudaSetDevice(gpu_id);
+  int device;
+  cudaGetDevice(&device);
+  if (gpu_id != device) LOG(FATAL) << "set device failed!";
+
+  // TODO(samikama) runtime should be taken from a resourcemanager as well.
+  // Only engine should be in the op and context and runtime should be taken
+  // from resourcemanager
+
+  IRuntime* infer = nvinfer1::createInferRuntime(logger);
   trt_engine_ptr_.reset(infer->deserializeCudaEngine(
       serialized_engine.c_str(), serialized_engine.size(), nullptr));
-
   trt_execution_context_ptr_.reset(trt_engine_ptr_->createExecutionContext());
   // Runtime is safe to delete after engine creation
   infer->destroy();
@@ -55,7 +70,6 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
 
   size_t binding_index;
   int num_batch = 0;
-  bool valid = true;
   for (int i = 0; i < context->num_inputs(); i++) {
     // Grab the input tensor
     binding_index = trt_engine_ptr_->getBindingIndex(input_nodes_[i].c_str());
@@ -64,8 +78,12 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
     const TensorShape& input_shape = input_tensor.shape();
     if (i == 0) {
       num_batch = input_shape.dim_size(0);
+      if (num_batch > trt_engine_ptr_->getMaxBatchSize()) {
+        LOG(FATAL) << "input tensor batch larger than max_batch_size: "
+                   << trt_engine_ptr_->getMaxBatchSize();
+      }
     } else if (num_batch != input_shape.dim_size(0)) {
-      valid = false;
+      LOG(FATAL) << "input data inconsistent batch size";
       break;
     }
     switch (trt_engine_ptr_->getBindingDataType(binding_index)) {
@@ -81,9 +99,6 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
     }
   }
 
-  // Might want a different way to inform the user of batch size inconsistency
-  if (!valid) LOG(WARNING) << "input data inconsistent batch size";
-
   for (int i = 0; i < static_cast<int>(output_nodes_.size()); i++) {
     // This is bad that we have to reallocate output buffer every run.
     // Create an output tensor
@@ -126,9 +141,11 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
                                                 ->implementation()
                                                 ->CudaStreamMemberHack()));
 
-  // execution handled by TF since we are getting stream from TF.
-  // it is safe for CPU pointer array (buffers) to go out of scope after enqueue
-  trt_execution_context_ptr_->enqueue(num_batch, &buffers[0], *stream, nullptr);
+  // TODO(jie): trt enqueue does not return error
+  auto ret = trt_execution_context_ptr_->enqueue(num_batch, &buffers[0],
+                                                 *stream, nullptr);
+  VLOG(2) << "enqueue returns: " << ret;
+  // sync should be done by TF.
 }
 
 REGISTER_KERNEL_BUILDER(Name("TRTEngineOp").Device(DEVICE_GPU), TRTEngineOp);
diff --git a/tensorflow/contrib/tensorrt/log/trt_logger.cc b/tensorflow/contrib/tensorrt/log/trt_logger.cc
index 7add8cb8b3d2a04206ee4174e79a1a4b86e05f30..dda0dc9e712eb726800abfb6084f4f708d04825b 100644
--- a/tensorflow/contrib/tensorrt/log/trt_logger.cc
+++ b/tensorflow/contrib/tensorrt/log/trt_logger.cc
@@ -27,19 +27,19 @@ void Logger::log(Severity severity, const char* msg) {
   // Suppress info-level messages
   switch (severity) {
     case Severity::kINFO: {  // Mark TRT info messages as debug!
-      VLOG(2) << msg;
+      VLOG(2) << name_ << " " << msg;
       break;
     }
     case Severity::kWARNING: {
-      LOG(WARNING) << msg;
+      LOG(WARNING) << name_ << " " << msg;
       break;
     }
     case Severity::kERROR: {
-      LOG(ERROR) << msg;
+      LOG(ERROR) << name_ << " " << msg;
       break;
     }
     case Severity::kINTERNAL_ERROR: {
-      LOG(FATAL) << msg;
+      LOG(FATAL) << name_ << " " << msg;
       break;
     }
     // This is useless for now. But would catch it in future if enum changes. It
diff --git a/tensorflow/contrib/tensorrt/log/trt_logger.h b/tensorflow/contrib/tensorrt/log/trt_logger.h
index d71f66b933a8068a6276a7e070755e0075543bb5..7f3544f8cfda8dce13881e1f8f4388b640e315f4 100644
--- a/tensorflow/contrib/tensorrt/log/trt_logger.h
+++ b/tensorflow/contrib/tensorrt/log/trt_logger.h
@@ -27,9 +27,11 @@ namespace tensorrt {
 
 // Logger for GIE info/warning/errors
 class Logger : public nvinfer1::ILogger {
- private:
+ public:
+  Logger(string name = "DefaultLogger") : name_(name){};
   void log(nvinfer1::ILogger::Severity severity, const char* msg) override;
 
+ private:
   string name_;
 };
 
diff --git a/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc b/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4835e5065068ec7a59995eb7f6126b31aecf6704
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc
@@ -0,0 +1,37 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+namespace tensorflow {
+
+REGISTER_OP("TRTCalibOp")
+    .Attr("segment_nodes: list(string)")         // names of the ops in segment
+    .Attr("segment_output_names: list(string)")  // names of the output ops in
+                                                 // segment
+    .Attr("input_names: list(string)")           // names of the inputs for
+                                                 // passing into tensorrt
+    .Attr("resource_name: string")
+    .Attr("InT: list({int8, float16, float32})")
+    .Input("in_tensor: InT")
+    .Output("out_tensor: InT")
+    .SetShapeFn([](tensorflow::shape_inference::InferenceContext* c) {
+      for (int i = 0; i < c->num_inputs(); i++) {
+        c->set_output(i, c->input(i));
+      }
+      return Status::OK();
+    });
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/python/__init__.py b/tensorflow/contrib/tensorrt/python/__init__.py
index 7e050a768ce97af1fc1d2c85cb52640b4c6a6a97..0b2321b5fc7bcbd53c01d1c97cafcfcb229a83ef 100644
--- a/tensorflow/contrib/tensorrt/python/__init__.py
+++ b/tensorflow/contrib/tensorrt/python/__init__.py
@@ -20,5 +20,6 @@ from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long
 from tensorflow.contrib.tensorrt.python.ops import trt_engine_op
+from tensorflow.contrib.tensorrt.python.trt_convert import calib_graph_to_infer_graph
 from tensorflow.contrib.tensorrt.python.trt_convert import create_inference_graph
 # pylint: enable=unused-import,line-too-long
diff --git a/tensorflow/contrib/tensorrt/python/trt_convert.py b/tensorflow/contrib/tensorrt/python/trt_convert.py
index 9454862f857ab743712ce409ff007de55e72a68e..338475d90ea55ab2c1bb8df77f27a71a4a36a5dd 100644
--- a/tensorflow/contrib/tensorrt/python/trt_convert.py
+++ b/tensorflow/contrib/tensorrt/python/trt_convert.py
@@ -20,11 +20,17 @@ from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long
 import six as _six
+from tensorflow.contrib.tensorrt.wrap_conversion import calib_convert
 from tensorflow.contrib.tensorrt.wrap_conversion import trt_convert
 from tensorflow.core.framework import graph_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import errors_impl as _impl
+from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
+from tensorflow.python.grappler import tf_optimizer
+from tensorflow.python.util import compat
+# pylint: enable=unused-import,line-too-long
 
 
 # TODO(skama): get outputs from session when implemented as c++
@@ -32,22 +38,33 @@ from tensorflow.python.framework import ops
 def create_inference_graph(input_graph_def,
                            outputs,
                            max_batch_size=1,
-                           max_workspace_size_bytes=2 << 20):
-  """Python wrapper for the TRT transormation.
-
+                           max_workspace_size_bytes=2 << 20,
+                           precision_mode="FP32",
+                           minimum_segment_size=3):
+  """Python wrapper for the TRT transformation.
 
   Args:
     input_graph_def: GraphDef object containing a model to be transformed.
-    outputs: List of tensors or node names for the model outputs.
+    outputs: list of tensors or node names for the model outputs.
     max_batch_size: max size for the input batch
     max_workspace_size_bytes: parameter to control memory allocation (in Bytes)
+    precision_mode: one of 'FP32', 'FP16' and 'INT8'
+    minimum_segment_size: the minimum number of nodes required for a subgraph to
+      be replaced by TRTEngineOp.
 
   Returns:
     New GraphDef with TRTEngineOps placed in graph replacing subgraphs.
 
   Raises:
+    ValueError: if the provided precision mode is invalid.
     RuntimeError: if the returned status message is malformed.
   """
+  supported_precision_modes = {"FP32": 0, "FP16": 1, "INT8": 2}
+  if precision_mode.upper() not in supported_precision_modes:
+    raise ValueError(("precision mode '{}' is not supported."
+                      "It should be one of {}").format(
+                          precision_mode, "{'FP32', 'FP16', 'INT8'}"))
+  mode = supported_precision_modes[precision_mode.upper()]
 
   def py2bytes(inp):
     return inp
@@ -83,7 +100,7 @@ def create_inference_graph(input_graph_def,
   # pair or strings where first one is encoded status and the second
   # one is the transformed graphs protobuf string.
   out = trt_convert(input_graph_def_str, out_names, max_batch_size,
-                    max_workspace_size_bytes)
+                    max_workspace_size_bytes, mode, minimum_segment_size)
   status = to_string(out[0])
   output_graph_def_string = out[1]
   del input_graph_def_str  # Save some memory
@@ -101,3 +118,46 @@ def create_inference_graph(input_graph_def,
   output_graph_def.ParseFromString(output_graph_def_string)
   del output_graph_def_string  # Save some memory
   return output_graph_def
+
+
+def calib_graph_to_infer_graph(calibration_graph_def):
+  """Convert an existing calibration graph to inference graph.
+
+  Args:
+    calibration_graph_def: the calibration GraphDef object with calibration data
+  Returns:
+    New GraphDef with TRTEngineOps placed in graph replacing calibration nodes.
+  Raises:
+    RuntimeError: if the returned status message is malformed.
+  """
+
+  def py2string(inp):
+    return inp
+
+  def py3string(inp):
+    return inp.decode("utf-8")
+
+  if _six.PY2:
+    to_string = py2string
+  else:
+    to_string = py3string
+
+  graph_str = calibration_graph_def.SerializeToString()
+  out = calib_convert(graph_str)
+  status = to_string(out[0])
+  output_graph_def_string = out[1]
+  del graph_str  # Save some memory
+  if len(status) < 2:
+    raise _impl.UnknownError(None, None, status)
+  if status[:2] != "OK":
+    msg = status.split(";")
+    if len(msg) == 1:
+      raise RuntimeError("Status message is malformed {}".format(status))
+    # pylint: disable=protected-access
+    raise _impl._make_specific_exception(None, None, ";".join(msg[1:]),
+                                         int(msg[0]))
+    # pylint: enable=protected-access
+  output_graph_def = graph_pb2.GraphDef()
+  output_graph_def.ParseFromString(output_graph_def_string)
+  del output_graph_def_string  # Save some memory
+  return output_graph_def
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..dc7c93f869f5ef7c8eaa2a87eed26cfe69597fdb
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
@@ -0,0 +1,129 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+
+#include <atomic>
+#include <chrono>
+#include <unordered_map>
+
+#include "tensorflow/core/platform/logging.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "cuda/include/cuda_runtime_api.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+// set the batch size before constructing the thread to execute engine
+int TRTInt8Calibrator::getBatchSize() const { return batch_size_; }
+
+TRTInt8Calibrator::TRTInt8Calibrator(
+    const std::unordered_map<string, std::pair<void*, size_t>>& dev_buffers,
+    int batch_size, string engine_name)
+    : batch_size_(batch_size),
+      done_(false),
+      dev_buffers_(dev_buffers),
+      calib_running_(false),
+      batch_is_set_(false),
+      engine_name_(engine_name) {}
+
+bool TRTInt8Calibrator::setBatch(const std::unordered_map<string, void*>& data,
+                                 const cudaStream_t stream) {
+  tensorflow::mutex_lock lock(cond_mtx_);
+  while ((calib_running_ || batch_is_set_) &&
+         !done_) {  // wait while calibration is running
+    cond_.wait(lock);
+  }
+  if (done_) return false;
+  CHECK(!calib_running_ && !batch_is_set_);
+  VLOG(1) << "Set Batch Waiting finished";
+  for (const auto it : data) {
+    auto devptr = dev_buffers_.find(it.first);
+    if (devptr == dev_buffers_.end()) {
+      LOG(FATAL) << "FATAL " << engine_name_ << " input name '" << it.first
+                 << "' does not match with the buffer names";
+    }
+    const auto& d = devptr->second;
+
+    // TODO(aaroey): we should not use sync copy on default stream. Make sure
+    // stream->ThenMemcpy() is used in future PRs.
+    // TODO(sami,aaroey): Need to figure out a way to ensure synchronization
+    // between stream, perhaps using a tensor?
+    auto status = cudaMemcpyAsync(d.first, it.second, d.second,
+                                  cudaMemcpyDeviceToDevice, stream);
+    if (status != cudaSuccess) {
+      LOG(FATAL) << "cudaMemcpy " << engine_name_ << " for '" << it.first
+                 << "' failed with " << status;
+    }
+  }
+
+  // TODO(Sami, aaorey): Find an alternative way!
+  cudaStreamSynchronize(
+      stream);  // we have to wait for the stream before returning!
+  batch_is_set_ = true;
+  cond_.notify_all();
+  return true;
+}
+
+bool TRTInt8Calibrator::getBatch(void** bindings, const char** names,
+                                 int num_bindings) {
+  tensorflow::mutex_lock lock(cond_mtx_);
+  calib_running_ = false;
+  cond_.notify_all();
+  while ((!batch_is_set_ && !done_)) {  // wait until new batch arrives
+    cond_.wait(lock);
+
+  }
+  if (done_) {
+    return false;
+  }
+
+  for (int i = 0; i < num_bindings; i++) {
+    auto it = dev_buffers_.find(names[i]);
+    if (it == dev_buffers_.end()) {
+      LOG(FATAL) << "Calibration engine asked for unknown tensor name '"
+                 << names[i] << "' at position " << i;
+    }
+
+    bindings[i] = it->second.first;
+  }
+  batch_is_set_ = false;
+  calib_running_ = true;
+  return true;
+}
+
+const void* TRTInt8Calibrator::readCalibrationCache(std::size_t& length) {
+  return nullptr;
+}
+
+void TRTInt8Calibrator::setDone() {
+  tensorflow::mutex_lock lock(cond_mtx_);
+  done_ = true;
+  cond_.notify_all();
+}
+
+void TRTInt8Calibrator::writeCalibrationCache(const void* ptr,
+                                              std::size_t length) {}
+TRTInt8Calibrator::~TRTInt8Calibrator() {
+  VLOG(1) << "Destroying calibrator for " << engine_name_;
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
new file mode 100644
index 0000000000000000000000000000000000000000..d77aa2c5ab184756adaee38f88180b3c128ebe03
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
@@ -0,0 +1,72 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+
+#include <atomic>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include "tensorflow/core/platform/mutex.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+
+#include "cuda/include/cuda_runtime_api.h"
+#include "tensorrt/include/NvInfer.h"
+
+namespace tensorflow {
+namespace tensorrt {
+// This class provides a 1 element queue to match TFs push model to
+// TRTs pull model for calibration. When TRT implements a means for
+// a push calibration This class should be updated accordingly
+
+struct TRTInt8Calibrator : public nvinfer1::IInt8EntropyCalibrator {
+ public:
+  TRTInt8Calibrator(
+      const std::unordered_map<string, std::pair<void*, size_t>>& dev_buffers,
+      int batch_size, string engine_name);
+  int getBatchSize() const override;
+  bool getBatch(void* bindings[], const char* names[],
+                int num_bindings) override;
+  bool setBatch(const std::unordered_map<string, void*>& data,
+                const cudaStream_t stream);
+  void setDone();
+  const void* readCalibrationCache(std::size_t& length) override;
+  void writeCalibrationCache(const void* ptr, std::size_t length) override;
+  ~TRTInt8Calibrator();
+
+ private:
+  const int batch_size_;
+  tensorflow::mutex cond_mtx_;           // mutex for condition_variable
+  tensorflow::condition_variable cond_;  // condition variable to implement
+                                         // producer-consumer queue for
+                                         // calibration
+  bool done_;
+  const std::unordered_map<string, std::pair<void*, size_t>>
+      dev_buffers_;  // map to keep tensorrt input buffers and sizes keyed with
+                     // buffer names
+  bool calib_running_;
+  bool batch_is_set_;
+  string engine_name_;
+};
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+
+#endif
+#endif
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e663eed4dd6704e2f41bde1dfabd411e86669ecd
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+std::shared_ptr<tensorflow::ResourceMgr>
+tensorflow::tensorrt::TRTResourceManager::getManager(const string& op_name) {
+  // mutex is held for lookup only. Most instantiations where mutex will be held
+  // longer will be during op creation and should be ok.
+  tensorflow::mutex_lock lock(map_mutex_);
+  auto s = managers_.find(op_name);
+  if (s == managers_.end()) {
+    auto it = managers_.emplace(
+        op_name, std::make_shared<tensorflow::ResourceMgr>(op_name));
+    VLOG(1) << "Returning a new manager " << op_name;
+    return it.first->second;
+  }
+  VLOG(1) << "Returning old manager " << op_name;
+  return s->second;
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h
new file mode 100644
index 0000000000000000000000000000000000000000..5f8ad491d3c13e8911b0b95c3e95e19afe4d59c0
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h
@@ -0,0 +1,49 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_RESOURCE_MANAGER_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_RESOURCE_MANAGER_H_
+#include <memory>
+
+#include <string>
+#include <unordered_map>
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+class TRTResourceManager {
+  TRTResourceManager() = default;
+
+ public:
+  static std::shared_ptr<TRTResourceManager> instance() {
+    static std::shared_ptr<TRTResourceManager> instance_(
+        new TRTResourceManager);
+    return instance_;
+  }
+  // returns a manager for given op, if it doesn't exists it creates one
+  std::shared_ptr<tensorflow::ResourceMgr> getManager(const string& op_name);
+
+ private:
+  std::unordered_map<string, std::shared_ptr<tensorflow::ResourceMgr>>
+      managers_;
+  tensorflow::mutex map_mutex_;
+};
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCE_TRT_RESOURCE_MANAGER_H_
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resources.h b/tensorflow/contrib/tensorrt/resources/trt_resources.h
new file mode 100644
index 0000000000000000000000000000000000000000..3c85968ae7acf5c5fc567be6805a5d226b1094c7
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resources.h
@@ -0,0 +1,95 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRTRESOURCES_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRTRESOURCES_H_
+
+#include <list>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include "tensorflow/core/framework/resource_mgr.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+#include "tensorrt/include/NvInfer.h"
+
+namespace tensorflow {
+namespace tensorrt {
+class TRTCalibrationResource : public tensorflow::ResourceBase {
+ public:
+  TRTCalibrationResource()
+      : calibrator_(nullptr),
+        builder_(nullptr),
+        network_(nullptr),
+        engine_(nullptr),
+        logger_(nullptr),
+        thr_(nullptr) {}
+  string DebugString() override {
+    std::stringstream oss;
+    oss << " Calibrator = " << std::hex << calibrator_ << std::dec << std::endl
+        << " Builder    = " << std::hex << builder_ << std::dec << std::endl
+        << " Network    = " << std::hex << network_ << std::dec << std::endl
+        << " Engine     = " << std::hex << engine_ << std::dec << std::endl
+        << " Logger     = " << std::hex << logger_ << std::dec << std::endl
+        << " Thread     = " << std::hex << thr_ << std::dec << std::endl;
+    return oss.str();
+  }
+  ~TRTCalibrationResource() {
+    VLOG(0) << "Destroying Calibration Resource " << std::endl << DebugString();
+  }
+  TRTInt8Calibrator* calibrator_;
+  nvinfer1::IBuilder* builder_;
+  nvinfer1::INetworkDefinition* network_;
+  nvinfer1::ICudaEngine* engine_;
+  tensorflow::tensorrt::Logger* logger_;
+  // TODO(sami): Use threadpool threads!
+  std::thread* thr_;
+};
+
+class TRTWeightStore : public tensorflow::ResourceBase {
+ public:
+  TRTWeightStore() {}
+  std::list<std::vector<uint8_t>> store_;
+  string DebugString() override {
+    std::stringstream oss;
+    size_t lenBytes = 0;
+    for (const auto& v : store_) {
+      lenBytes += v.size() * sizeof(uint8_t);
+    }
+    oss << " Number of entries     = " << store_.size() << std::endl
+        << " Total number of bytes = "
+        << store_.size() * sizeof(std::vector<uint8_t>) + lenBytes << std::endl;
+    return oss.str();
+  }
+  virtual ~TRTWeightStore() { VLOG(1) << "Destroying store" << DebugString(); }
+};
+
+class TRTEngineResource : public tensorflow::ResourceBase {
+ public:
+  TRTEngineResource() : runtime_(nullptr), ctx_(nullptr){};
+  string DebugString() override { return string(""); }
+  nvinfer1::IRuntime* runtime_;
+  nvinfer1::IExecutionContext* ctx_;
+};
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCEMGR_TRTRESOURCES_H_
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/segment/segment.cc b/tensorflow/contrib/tensorrt/segment/segment.cc
index 6193f0b0a13f6985d5fc8dd4c6fc09b15f72f139..8fc4697c513057c668d31a341cb13f60dc107e81 100644
--- a/tensorflow/contrib/tensorrt/segment/segment.cc
+++ b/tensorflow/contrib/tensorrt/segment/segment.cc
@@ -80,13 +80,20 @@ void ContractEdge(tensorflow::Edge* edge, tensorflow::Graph* graph,
   std::vector<const tensorflow::Edge*> in_edges(dst->in_edges().begin(),
                                                 dst->in_edges().end());
   for (const tensorflow::Edge* in_edge : in_edges) {
-    if (in_edge->src() != src) {
-      tensorflow::Edge* e = const_cast<tensorflow::Edge*>(in_edge);
-      if (e->src() == graph->source_node()) {
-        graph->AddEdge(e->src(), e->src_output(), src,
-                       tensorflow::Graph::kControlSlot);
-      } else {
-        graph->AddEdge(e->src(), e->src_output(), src, 0 /* input index */);
+    if (in_edge->IsControlEdge()) {
+      if (in_edge->src() != src) {
+        tensorflow::Edge* e = const_cast<tensorflow::Edge*>(in_edge);
+        graph->AddControlEdge(e->src(), src);
+      }
+    } else {
+      if (in_edge->src() != src) {
+        tensorflow::Edge* e = const_cast<tensorflow::Edge*>(in_edge);
+        if (e->src() == graph->source_node()) {
+          graph->AddEdge(e->src(), e->src_output(), src,
+                         tensorflow::Graph::kControlSlot);
+        } else {
+          graph->AddEdge(e->src(), e->src_output(), src, 0 /* input index */);
+        }
       }
     }
   }
@@ -94,12 +101,19 @@ void ContractEdge(tensorflow::Edge* edge, tensorflow::Graph* graph,
   std::vector<const tensorflow::Edge*> out_edges(dst->out_edges().begin(),
                                                  dst->out_edges().end());
   for (const tensorflow::Edge* out_edge : out_edges) {
-    tensorflow::Edge* e = const_cast<tensorflow::Edge*>(out_edge);
-    if (e->dst() == graph->sink_node()) {
-      graph->AddEdge(src, tensorflow::Graph::kControlSlot, e->dst(),
-                     e->dst_input());
+    if (out_edge->IsControlEdge()) {
+      tensorflow::Edge* e = const_cast<tensorflow::Edge*>(out_edge);
+      graph->AddControlEdge(src, e->dst());
     } else {
-      graph->AddEdge(src, 0 /* output index */, e->dst(), e->dst_input());
+      tensorflow::Edge* e = const_cast<tensorflow::Edge*>(out_edge);
+      if (e->dst() == graph->sink_node()) {
+        VLOG(1) << " edge to sink node " << src->name() << " -> "
+                << e->dst()->name();
+        graph->AddEdge(src, tensorflow::Graph::kControlSlot, e->dst(),
+                       e->dst_input());
+      } else {
+        graph->AddEdge(src, 0 /* output index */, e->dst(), e->dst_input());
+      }
     }
   }
 
@@ -118,7 +132,7 @@ void ContractEdge(tensorflow::Edge* edge, tensorflow::Graph* graph,
 
 tensorflow::Status SegmentGraph(
     const tensorflow::GraphDef& gdef,
-    const std::function<bool(const tensorflow::NodeDef&)>& candidate_fn,
+    const std::function<bool(const tensorflow::Node*)>& candidate_fn,
     const SegmentOptions& options, SegmentNodesVector* segments) {
   // Create a Graph representation of the GraphDef.
   tensorflow::FunctionLibraryDefinition flib(tensorflow::OpRegistry::Global(),
@@ -136,7 +150,7 @@ tensorflow::Status SegmentGraph(
   for (int i = 0; i < graph.num_node_ids(); ++i) {
     tensorflow::Node* node = graph.FindNodeId(i);
     if (options.exclude_node_list.count(node->name()) != 0 ||
-        !candidate_fn(node->def())) {
+        !candidate_fn(node)) {
       node = nullptr;
     }
     node_segments.emplace_back(node);
@@ -155,7 +169,7 @@ tensorflow::Status SegmentGraph(
 
   for (const tensorflow::Node* node : order) {
     // All output nodes of 'node' have been visited...
-    VLOG(2) << "Trying node " << node->name();
+    VLOG(2) << "Trying node " << node->name() << " id=" << node->id();
 
     // 'node' must be a TRT candidate...
     if (node_segments[node->id()].Value() == nullptr) {
@@ -169,8 +183,12 @@ tensorflow::Status SegmentGraph(
     while (true) {
       std::set<const tensorflow::Edge*> contract_edges;
       for (const tensorflow::Edge* out_edge : node->out_edges()) {
-        VLOG(2) << "... out node " << out_edge->dst()->name();
-
+        VLOG(2) << "... out node " << out_edge->dst()->name() << " ( "
+                << out_edge->dst()->id() << " <- " << node->id() << " )";
+        if (out_edge->IsControlEdge()) {
+          VLOG(2) << "... ... Control Edge, Skipping";
+          continue;
+        }
         // Out node must be TRT candidate...
         if (node_segments[out_edge->dst()->id()].Value() == nullptr) {
           VLOG(2) << "... ... not a TRT candidate";
@@ -196,7 +214,8 @@ tensorflow::Status SegmentGraph(
         const tensorflow::Node* src = contract_edge->src();
         const tensorflow::Node* dst = contract_edge->dst();
 
-        VLOG(2) << "Merge " << src->name() << " <- " << dst->name();
+        VLOG(2) << "Merge " << src->name() << " <- " << dst->name() << " ("
+                << src->id() << " <- " << dst->id();
         node_segments[src->id()].Merge(&node_segments[dst->id()]);
 
         // Contracting the edge leaves disconnected graph edges.
diff --git a/tensorflow/contrib/tensorrt/segment/segment.h b/tensorflow/contrib/tensorrt/segment/segment.h
index ee6e2b3ed26cd1fabc0e952d882d549046cd9a30..7e8685f44a8c8a20fd7159ee40a8835531e78e9f 100644
--- a/tensorflow/contrib/tensorrt/segment/segment.h
+++ b/tensorflow/contrib/tensorrt/segment/segment.h
@@ -20,10 +20,12 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace tensorflow {
+
 namespace tensorrt {
 namespace segment {
 
@@ -46,7 +48,7 @@ struct SegmentOptions {
 // @return the status.
 tensorflow::Status SegmentGraph(
     const tensorflow::GraphDef& gdef,
-    const std::function<bool(const tensorflow::NodeDef&)>& candidate_fn,
+    const std::function<bool(const tensorflow::Node*)>& candidate_fn,
     const SegmentOptions& options, SegmentNodesVector* segments);
 
 }  // namespace segment
diff --git a/tensorflow/contrib/tensorrt/segment/segment_test.cc b/tensorflow/contrib/tensorrt/segment/segment_test.cc
index 74cbc5f2b376b76324eed06d251767da6f928e3e..7ddabec268d4ef7b5c679001e5fb99aa7d83aec0 100644
--- a/tensorflow/contrib/tensorrt/segment/segment_test.cc
+++ b/tensorflow/contrib/tensorrt/segment/segment_test.cc
@@ -35,7 +35,7 @@ class SegmentTest : public ::testing::Test {
   TF_Operation* Add(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
                     TF_Status* s, const char* name);
 
-  std::function<bool(const NodeDef&)> MakeCandidateFn(
+  std::function<bool(const Node*)> MakeCandidateFn(
       const std::set<string>& node_names);
 
  protected:
@@ -60,10 +60,10 @@ bool SegmentTest::GetGraphDef(TF_Graph* graph,
   return ret;
 }
 
-std::function<bool(const NodeDef&)> SegmentTest::MakeCandidateFn(
+std::function<bool(const Node*)> SegmentTest::MakeCandidateFn(
     const std::set<string>& node_names) {
-  return [node_names](const NodeDef& node) -> bool {
-    return node_names.find(node.name()) != node_names.end();
+  return [node_names](const Node* node) -> bool {
+    return node_names.find(node->name()) != node_names.end();
   };
 }
 
diff --git a/tensorflow/contrib/tensorrt/test/test_tftrt.py b/tensorflow/contrib/tensorrt/test/test_tftrt.py
index c78f6f222457a875525e768eacc9a4ebf28ad504..ad01bedd8fa066e914b05b20dbc47d9aabe790d9 100644
--- a/tensorflow/contrib/tensorrt/test/test_tftrt.py
+++ b/tensorflow/contrib/tensorrt/test/test_tftrt.py
@@ -60,6 +60,7 @@ def get_simple_graph_def():
 
 
 def run_graph(gdef, dumm_inp):
+  """Run given graphdef once."""
   gpu_options = cpb2.GPUOptions(per_process_gpu_memory_fraction=0.50)
   ops.reset_default_graph()
   g = ops.Graph()
@@ -74,15 +75,65 @@ def run_graph(gdef, dumm_inp):
   return val
 
 
+# Use real data that is representative of the inference dataset
+# for calibration. For this test script it is random data.
+def run_calibration(gdef, dumm_inp):
+  """Run given calibration graph multiple times."""
+  gpu_options = cpb2.GPUOptions(per_process_gpu_memory_fraction=0.50)
+  ops.reset_default_graph()
+  g = ops.Graph()
+  with g.as_default():
+    inp, out = importer.import_graph_def(
+        graph_def=gdef, return_elements=["input", "output"])
+    inp = inp.outputs[0]
+    out = out.outputs[0]
+  with csess.Session(
+      config=cpb2.ConfigProto(gpu_options=gpu_options), graph=g) as sess:
+    # run over real calibration data here, we are mimicking a calibration set of
+    # 30 different batches. Use as much calibration data as you want
+    for _ in range(30):
+      val = sess.run(out, {inp: dumm_inp})
+  return val
+
+
 if "__main__" in __name__:
   inp_dims = (100, 24, 24, 2)
   dummy_input = np.random.random_sample(inp_dims)
-  gdef = get_simple_graph_def()
+  orig_graph = get_simple_graph_def()  # use a frozen graph for inference
   # Get optimized graph
-  trt_graph = trt.create_inference_graph(gdef, ["output"], inp_dims[0])
-  o1 = run_graph(gdef, dummy_input)
+  trt_graph = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="FP32",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  o1 = run_graph(orig_graph, dummy_input)
   o2 = run_graph(trt_graph, dummy_input)
   o3 = run_graph(trt_graph, dummy_input)
   assert np.array_equal(o1, o2)
   assert np.array_equal(o3, o2)  # sanity check
+  fp16_graph = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="FP16",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  int8_calib_gdef = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="INT8",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  o4 = run_graph(fp16_graph, dummy_input)
+  _ = run_calibration(int8_calib_gdef, dummy_input)
+  int8_graph = trt.calib_graph_to_infer_graph(int8_calib_gdef)
+  o5 = run_graph(int8_graph, dummy_input)
+  assert np.allclose(o1, o4)
+  assert np.allclose(o1, o5)
   print("Pass")
diff --git a/tensorflow/contrib/tensorrt/trt_conversion.i b/tensorflow/contrib/tensorrt/trt_conversion.i
index d679945d569c1784448b6cb09c2f431b9cda56d7..46480e99a113afb34702b0ecd71468d4bdc83f98 100644
--- a/tensorflow/contrib/tensorrt/trt_conversion.i
+++ b/tensorflow/contrib/tensorrt/trt_conversion.i
@@ -64,13 +64,17 @@ PyObject* pair_helper(std::pair<string, string>* in) {
 %ignoreall
 %unignore tensorflow;
 %unignore trt_convert;
+%unignore calib_convert;
 
 %{
+
 std::pair<string, string> trt_convert(
     string graph_def_string,  // The serialized GraphDef string.
     std::vector<string> output_names,
     size_t max_batch_size,
-    size_t max_workspace_size_bytes
+    size_t max_workspace_size_bytes,
+    int precision_mode,
+    int minimum_segment_size
     // Unfortunately we can't use TF_Status here since it
     // is in c/c_api and brings in a lot of other libraries
     // which in turn declare ops. These ops are included
@@ -90,16 +94,64 @@ std::pair<string, string> trt_convert(
     return std::pair<string, string>{out_status, ""};
   }
 
+  if(precision_mode < 0 || precision_mode > 2){
+    out_status = "InvalidArgument;Invalid precision_mode";
+    return std::pair<string, string>{out_status, ""};
+  }
   if (!output_names.size()) {
     out_status = "InvalidArgument;Size of the output_names vector is 0";
     return std::pair<string, string>{out_status, ""};
-    // return "";
   }
   tensorflow::GraphDef outGraph;
   tensorflow::Status conversion_status =
       tensorflow::tensorrt::convert::ConvertGraphDefToTensorRT(
           graph_def, output_names, max_batch_size, max_workspace_size_bytes,
-          &outGraph);
+          &outGraph, precision_mode, minimum_segment_size);
+  if (!conversion_status.ok()) {
+    auto retCode = (int)conversion_status.code();
+    char buff[2000];
+    snprintf(buff, 2000, "%d;%s", retCode,
+             conversion_status.error_message().c_str());
+    out_status = buff;
+    return std::pair<string, string>{out_status, ""};
+  }
+  string result;
+  if (!outGraph.SerializeToString(&result)) {
+    out_status = "InvalidArgument;Couldn't serialize output as a GraphDef";
+    return std::pair<string, string>{out_status, ""};
+  }
+  out_status = "OK;All good!";
+  return std::pair<string, string>{out_status, result};
+#else
+  // Returns FAILED_PRECONDITION.
+  return std::pair<string, string>{"9;TensorRT is not enabled!", ""};
+#endif  // GOOGLE_CUDA && GOOGLE_TENSORRT
+}
+
+std::pair<string, string> calib_convert(string graph_def_string  //  const tensorflow::GraphDef&
+    // unfortunately we can't use TF_Status here since it
+    // is in c/c_api and brings in a lot of other libraries
+    // which in turn declare ops. These ops are included
+    // statically in our library and cause an abort when
+    // module is loaded due to double registration
+    // until Tensorflow properly exposes these headers
+    // we have to work around this by returning a string
+    // and converting it to exception on python side.
+    //,TF_Status* out_status) {
+) {
+#if GOOGLE_CUDA && GOOGLE_TENSORRT
+  string out_status;
+
+  tensorflow::GraphDef graph_def;
+  if (!graph_def.ParseFromString(graph_def_string)) {
+    out_status = "InvalidArgument;Couldn't interpret input as a GraphDef";
+    return std::pair<string, string>{out_status, ""};
+  }
+
+  tensorflow::GraphDef outGraph;
+  tensorflow::Status conversion_status =
+      tensorflow::tensorrt::convert::ConvertCalibGraphToInferGraph(graph_def,
+                                                                   &outGraph);
   if (!conversion_status.ok()) {
     auto retCode = (int)conversion_status.code();
     char buff[2000];
@@ -122,10 +174,13 @@ std::pair<string, string> trt_convert(
 }
 %}
 
+std::pair<string, string> calib_convert(string graph_def_string);
+
 std::pair<string, string> trt_convert(string graph_def_string,
                                       std::vector<string> output_names,
                                       size_t max_batch_size,
-                                      size_t max_workspace_size_bytes);
+                                      size_t max_workspace_size_bytes,
+                                      int precision_mode, int minimum_segment_size);
 
 
 %unignoreall
diff --git a/tensorflow/contrib/testing/python/framework/fake_summary_writer.py b/tensorflow/contrib/testing/python/framework/fake_summary_writer.py
index f2065c666255984c8ab770fc10f682b1eabad095..15a415df303df5be44e89c00005cb253ae2af286 100644
--- a/tensorflow/contrib/testing/python/framework/fake_summary_writer.py
+++ b/tensorflow/contrib/testing/python/framework/fake_summary_writer.py
@@ -18,6 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.core.framework import summary_pb2
+from tensorflow.python.framework import test_util
 from tensorflow.python.summary.writer import writer
 from tensorflow.python.summary.writer import writer_cache
 
@@ -85,7 +86,11 @@ class FakeSummaryWriter(object):
     if expected_added_graphs is not None:
       test_case.assertEqual(expected_added_graphs, self._added_graphs)
     if expected_added_meta_graphs is not None:
-      test_case.assertEqual(expected_added_meta_graphs, self._added_meta_graphs)
+      test_case.assertEqual(len(expected_added_meta_graphs),
+                            len(self._added_meta_graphs))
+      for expected, actual in zip(expected_added_meta_graphs,
+                                  self._added_meta_graphs):
+        test_util.assert_meta_graph_protos_equal(test_case, expected, actual)
     if expected_session_logs is not None:
       test_case.assertEqual(expected_session_logs, self._added_session_logs)
 
diff --git a/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv b/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
index b49a0662c29b1d810f4be31ca1f318f0571f533e..9b15b4f0b26f11ac3281ca4206654872984628b6 100644
--- a/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
+++ b/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
@@ -1,100 +1,100 @@
-0,0.926906299771,1.99107237682,2.56546245685,3.07914768197,4.04839057867,1.,0.
-1,0.108010001864,1.41645361423,2.1686839775,2.94963962176,4.1263503303,1.,0.
-2,-0.800567600028,1.0172132907,1.96434754116,2.99885333086,4.04300485864,1.,0.
-3,0.0607042871898,0.719540073421,1.9765012584,2.89265588817,4.0951014426,1.,0.
-4,0.933712200629,0.28052120776,1.41018552514,2.69232603996,4.06481164223,1.,0.
-5,-0.171730652974,0.260054421028,1.48770816369,2.62199129293,4.44572807842,1.,0.
-6,-1.00180162933,0.333045158863,1.50006392277,2.88888309683,4.24755865606,1.,0.
-7,0.0580061875336,0.688929398826,1.56543458772,2.99840358953,4.52726873347,1.,0.
-8,0.764139447412,1.24704875327,1.77649279698,3.13578593851,4.63238922951,1.,0.
-9,-0.230331874785,1.47903998963,2.03547545751,3.20624030377,4.77980005228,1.,0.
-10,-1.03846045211,2.01133000781,2.31977503972,3.67951536251,5.09716775897,1.,0.
-11,0.188643592253,2.23285349038,2.68338482249,3.49817168611,5.24928239634,1.,0.
-12,0.91207302309,2.24244446841,2.71362604985,3.96332587625,5.37802271594,1.,0.
-13,-0.296588665881,2.02594634141,3.07733910479,3.99698324956,5.56365901394,1.,0.
-14,-0.959961476551,1.45078629833,3.18996420137,4.3763059609,5.65356015609,1.,0.
-15,0.46313530679,1.01141441548,3.4980215948,4.20224896882,5.88842247449,1.,0.
-16,0.929354125798,0.626635305936,3.70508262244,4.51791573544,5.73945973251,1.,0.
-17,-0.519110731957,0.269249223148,3.39866823332,4.46802003061,5.82768174382,1.,0.
-18,-0.924330981367,0.349602834684,3.21762413294,4.72803587499,5.94918925767,1.,0.
-19,0.253239387885,0.345158023497,3.11071425333,4.79311566935,5.9489259713,1.,0.
-20,0.637408390225,0.698996675371,3.25232492145,4.73814732384,5.9612010251,1.,0.
-21,-0.407396859412,1.17456342803,2.49526823723,4.59323415742,5.82501686811,1.,0.
-22,-0.967485452118,1.66655933642,2.47284606244,4.58316034754,5.88721406681,1.,0.
-23,0.474480867904,1.95018556323,2.0228950072,4.48651142819,5.8255943735,1.,0.
-24,1.04309652155,2.23519892356,1.91924131572,4.19094661783,5.87457348436,1.,0.
-25,-0.517861513772,2.12501967336,1.70266619979,4.05280882887,5.72160912899,1.,0.
-26,-0.945301585146,1.65464653549,1.81567174251,3.92309850635,5.58270493814,1.,0.
-27,0.501153868974,1.40600764889,1.53991387719,3.72853247942,5.60169001727,1.,0.
-28,0.972859524418,1.00344321868,1.5175642828,3.64092376655,5.10567722582,1.,0.
-29,-0.70553406135,0.465306263885,1.7038540803,3.33236870312,5.09182481555,1.,0.
-30,-0.946093634916,0.294539309453,1.88052827037,2.93011492669,4.97354922696,1.,0.
-31,0.47922123231,0.308465865031,2.03445883031,2.90772899045,4.86241793548,1.,0.
-32,0.754030014252,0.549752241167,2.46115815089,2.95063349534,4.71834614627,1.,0.
-33,-0.64875949826,0.894615488148,2.5922463381,2.81269864022,4.43480095104,1.,0.
-34,-0.757829951086,1.39123914261,2.69258079904,2.61834837315,4.36580046156,1.,0.
-35,0.565653301088,1.72360022693,2.97794913834,2.80403840334,4.27327248459,1.,0.
-36,0.867440092372,2.21100730052,3.38648090792,2.84057515729,4.12210169576,1.,0.
-37,-0.894567758095,2.17549105818,3.45532493329,2.90446025717,4.00251740584,1.,0.
-38,-0.715442356893,2.15105389965,3.52041791902,3.03650393392,4.12809249577,1.,0.
-39,0.80671703672,1.81504564517,3.60463324866,3.00747789871,3.98440762467,1.,0.
-40,0.527014790142,1.31803513865,3.43842186337,3.3332594663,4.03232406566,1.,0.
-41,-0.795936862129,0.847809114454,3.09875133548,3.52863155938,3.94883924909,1.,0.
-42,-0.610245806946,0.425530441018,2.92581949152,3.77238736123,4.27287245021,1.,0.
-43,0.611662279431,0.178432049837,2.48128214822,3.73212087883,4.17319013831,1.,0.
-44,0.650866553108,0.220341648392,2.41694642022,4.2609098519,4.27271645905,1.,0.
-45,-0.774156982023,0.632667602331,2.05474356052,4.32889204886,4.18029723271,1.,0.
-46,-0.714058448409,0.924562377599,1.75706135146,4.52492718422,4.3972678094,1.,0.
-47,0.889627293379,1.46207968841,1.78299357672,4.64466731095,4.56317887554,1.,0.
-48,0.520140662861,1.8996333843,1.41377633823,4.48899091177,4.78805049769,1.,0.
-49,-1.03816935616,2.08997002059,1.51218375351,4.84167764204,4.93026048606,1.,0.
-50,-0.40772951362,2.30878972136,1.44144415128,4.76854460997,5.01538444629,1.,0.
-51,0.792730684781,1.91367048509,1.58887384677,4.71739397335,5.25690012199,1.,0.
-52,0.371311881576,1.67565079528,1.81688563053,4.60353107555,5.44265822961,1.,0.
-53,-0.814398070371,1.13374634126,1.80328814859,4.72264252878,5.52674761122,1.,0.
-54,-0.469017949323,0.601244136627,2.29690896736,4.49859178859,5.54126153454,1.,0.
-55,0.871044371426,0.407597593794,2.7499112487,4.19060637761,5.57693767301,1.,0.
-56,0.523764933017,0.247705192709,3.09002071379,4.02095509006,5.80510362182,1.,0.
-57,-0.881326403531,0.31513103164,3.11358205718,3.96079100808,5.81000652365,1.,0.
-58,-0.357928025339,0.486163915865,3.17884556771,3.72634990659,5.85693642011,1.,0.
-59,0.853038779822,1.04218094475,3.45835384454,3.36703969978,5.9585988449,1.,0.
-60,0.435311516013,1.59715085283,3.63313338588,3.11276729421,5.93643818229,1.,0.
-61,-1.02703719138,1.92205832542,3.47606111735,3.06247155999,6.02106646259,1.,0.
-62,-0.246661325557,2.14653802542,3.29446326567,2.89936259181,5.67531541272,1.,0.
-63,1.02554736569,2.25943737733,3.07031591528,2.78176218013,5.78206328989,1.,0.
-64,0.337814475969,2.07589147224,2.80356226089,2.55888206331,5.7094075496,1.,0.
-65,-1.12023369929,1.25333011618,2.56497288445,2.77361359194,5.50799418376,1.,0.
-66,-0.178980246554,1.11937139901,2.51598681313,2.91438309151,5.47469577206,1.,0.
-67,0.97550951531,0.60553823137,2.11657741073,2.88081098981,5.37034999502,1.,0.
-68,0.136653357206,0.365828836075,1.97386033165,3.13217903204,5.07254490219,1.,0.
-69,-1.05607596951,0.153152115069,1.52110743825,3.01308794192,5.08902539125,1.,0.
-70,-0.13095280331,0.337113974483,1.52703079853,3.16687131599,4.86649398514,1.,0.
-71,1.07081057754,0.714247566736,1.53761382634,3.45151989484,4.75892309166,1.,0.
-72,0.0153410376082,1.24631231847,1.61690939161,3.85481994498,4.35683752832,1.,0.
-73,-0.912801257303,1.60791309476,1.8729264524,4.03037260012,4.36072588913,1.,0.
-74,-0.0894895640338,2.02535207407,1.93484909619,4.09557485132,4.35327025188,1.,0.
-75,0.978646999652,2.20085086625,2.09003440427,4.27542353033,4.1805058388,1.,0.
-76,-0.113312642876,2.2444100761,2.50789248839,4.4151861502,4.03267168136,1.,0.
-77,-1.00215099149,1.84305628445,2.61691237246,4.45425147595,3.81203553766,1.,0.
-78,-0.0183234614205,1.49573923116,2.99308471214,4.71134960112,4.0273804959,1.,0.
-79,1.0823738177,1.12211589848,3.27079386925,4.94288270502,4.01851068083,1.,0.
-80,0.124370187893,0.616474412808,3.4284236674,4.76942168327,3.9749536483,1.,0.
-81,-0.929423379352,0.290977090976,3.34131726136,4.78590392707,4.10190661656,1.,0.
-82,0.23766302648,0.155302052254,3.49779513794,4.64605656795,4.15571321107,1.,0.
-83,1.03531486192,0.359702776204,3.4880725919,4.48167586667,4.21134561991,1.,0.
-84,-0.261234571382,0.713877760378,3.42756426614,4.426443869,4.25208300527,1.,0.
-85,-1.03572442277,1.25001113691,2.96908341113,4.25500915322,4.25723010649,1.,0.
-86,0.380034261243,1.70543355622,2.73605932518,4.16703432307,4.63700400788,1.,0.
-87,1.03734873488,1.97544410562,2.55586572141,3.84976673263,4.55282864289,1.,0.
-88,-0.177344253372,2.22614526325,2.09565864891,3.77378097953,4.82577400298,1.,0.
-89,-0.976821526892,2.18385079177,1.78522284118,3.67768223554,5.06302440873,1.,0.
-90,0.264820472091,1.86981946157,1.50048403865,3.43619796921,5.05651761669,1.,0.
-91,1.05642344868,1.47568646076,1.51347671977,3.20898518885,5.50149047462,1.,0.
-92,-0.311607433358,1.04226467636,1.52089650905,3.02291865417,5.4889046232,1.,0.
-93,-0.724285777937,0.553052311957,1.48573560173,2.7365973598,5.72549174225,1.,0.
-94,0.519859192905,0.226520626591,1.61543723167,2.84102086852,5.69330622288,1.,0.
-95,1.0323195039,0.260873217055,1.81913034804,2.83951143848,5.90325028086,1.,0.
-96,-0.53285682538,0.387695521405,1.70935609313,2.57977050631,5.79579213161,1.,0.
-97,-0.975127997215,0.920948771589,2.51292643636,2.71004616612,5.87016469227,1.,0.
-98,0.540246804099,1.36445470181,2.61949412896,2.98482553485,6.02447664937,1.,0.
-99,0.987764008058,1.85581989607,2.84685706149,2.94760204892,6.0212151724,1.,0.
+0,0.926906299771,1.99107237682,2.56546245685,3.07914768197,4.04839057867,1.,0.,strkeya
+1,0.108010001864,1.41645361423,2.1686839775,2.94963962176,4.1263503303,1.,0.,strkeyb
+2,-0.800567600028,1.0172132907,1.96434754116,2.99885333086,4.04300485864,1.,0.,strkey
+3,0.0607042871898,0.719540073421,1.9765012584,2.89265588817,4.0951014426,1.,0.,strkey
+4,0.933712200629,0.28052120776,1.41018552514,2.69232603996,4.06481164223,1.,0.,strkey
+5,-0.171730652974,0.260054421028,1.48770816369,2.62199129293,4.44572807842,1.,0.,strkey
+6,-1.00180162933,0.333045158863,1.50006392277,2.88888309683,4.24755865606,1.,0.,strkey
+7,0.0580061875336,0.688929398826,1.56543458772,2.99840358953,4.52726873347,1.,0.,strkey
+8,0.764139447412,1.24704875327,1.77649279698,3.13578593851,4.63238922951,1.,0.,strkey
+9,-0.230331874785,1.47903998963,2.03547545751,3.20624030377,4.77980005228,1.,0.,strkey
+10,-1.03846045211,2.01133000781,2.31977503972,3.67951536251,5.09716775897,1.,0.,strkeyc
+11,0.188643592253,2.23285349038,2.68338482249,3.49817168611,5.24928239634,1.,0.,strkey
+12,0.91207302309,2.24244446841,2.71362604985,3.96332587625,5.37802271594,1.,0.,strkey
+13,-0.296588665881,2.02594634141,3.07733910479,3.99698324956,5.56365901394,1.,0.,strkey
+14,-0.959961476551,1.45078629833,3.18996420137,4.3763059609,5.65356015609,1.,0.,strkey
+15,0.46313530679,1.01141441548,3.4980215948,4.20224896882,5.88842247449,1.,0.,strkey
+16,0.929354125798,0.626635305936,3.70508262244,4.51791573544,5.73945973251,1.,0.,strkey
+17,-0.519110731957,0.269249223148,3.39866823332,4.46802003061,5.82768174382,1.,0.,strkey
+18,-0.924330981367,0.349602834684,3.21762413294,4.72803587499,5.94918925767,1.,0.,strkey
+19,0.253239387885,0.345158023497,3.11071425333,4.79311566935,5.9489259713,1.,0.,strkey
+20,0.637408390225,0.698996675371,3.25232492145,4.73814732384,5.9612010251,1.,0.,strkey
+21,-0.407396859412,1.17456342803,2.49526823723,4.59323415742,5.82501686811,1.,0.,strkey
+22,-0.967485452118,1.66655933642,2.47284606244,4.58316034754,5.88721406681,1.,0.,strkey
+23,0.474480867904,1.95018556323,2.0228950072,4.48651142819,5.8255943735,1.,0.,strkey
+24,1.04309652155,2.23519892356,1.91924131572,4.19094661783,5.87457348436,1.,0.,strkey
+25,-0.517861513772,2.12501967336,1.70266619979,4.05280882887,5.72160912899,1.,0.,strkey
+26,-0.945301585146,1.65464653549,1.81567174251,3.92309850635,5.58270493814,1.,0.,strkey
+27,0.501153868974,1.40600764889,1.53991387719,3.72853247942,5.60169001727,1.,0.,strkey
+28,0.972859524418,1.00344321868,1.5175642828,3.64092376655,5.10567722582,1.,0.,strkey
+29,-0.70553406135,0.465306263885,1.7038540803,3.33236870312,5.09182481555,1.,0.,strkey
+30,-0.946093634916,0.294539309453,1.88052827037,2.93011492669,4.97354922696,1.,0.,strkey
+31,0.47922123231,0.308465865031,2.03445883031,2.90772899045,4.86241793548,1.,0.,strkey
+32,0.754030014252,0.549752241167,2.46115815089,2.95063349534,4.71834614627,1.,0.,strkey
+33,-0.64875949826,0.894615488148,2.5922463381,2.81269864022,4.43480095104,1.,0.,strkey
+34,-0.757829951086,1.39123914261,2.69258079904,2.61834837315,4.36580046156,1.,0.,strkey
+35,0.565653301088,1.72360022693,2.97794913834,2.80403840334,4.27327248459,1.,0.,strkey
+36,0.867440092372,2.21100730052,3.38648090792,2.84057515729,4.12210169576,1.,0.,strkey
+37,-0.894567758095,2.17549105818,3.45532493329,2.90446025717,4.00251740584,1.,0.,strkeyd
+38,-0.715442356893,2.15105389965,3.52041791902,3.03650393392,4.12809249577,1.,0.,strkey
+39,0.80671703672,1.81504564517,3.60463324866,3.00747789871,3.98440762467,1.,0.,strkey
+40,0.527014790142,1.31803513865,3.43842186337,3.3332594663,4.03232406566,1.,0.,strkey
+41,-0.795936862129,0.847809114454,3.09875133548,3.52863155938,3.94883924909,1.,0.,strkey
+42,-0.610245806946,0.425530441018,2.92581949152,3.77238736123,4.27287245021,1.,0.,strkey
+43,0.611662279431,0.178432049837,2.48128214822,3.73212087883,4.17319013831,1.,0.,strkey
+44,0.650866553108,0.220341648392,2.41694642022,4.2609098519,4.27271645905,1.,0.,strkey
+45,-0.774156982023,0.632667602331,2.05474356052,4.32889204886,4.18029723271,1.,0.,strkey
+46,-0.714058448409,0.924562377599,1.75706135146,4.52492718422,4.3972678094,1.,0.,strkey
+47,0.889627293379,1.46207968841,1.78299357672,4.64466731095,4.56317887554,1.,0.,strkey
+48,0.520140662861,1.8996333843,1.41377633823,4.48899091177,4.78805049769,1.,0.,strkey
+49,-1.03816935616,2.08997002059,1.51218375351,4.84167764204,4.93026048606,1.,0.,strkey
+50,-0.40772951362,2.30878972136,1.44144415128,4.76854460997,5.01538444629,1.,0.,strkey
+51,0.792730684781,1.91367048509,1.58887384677,4.71739397335,5.25690012199,1.,0.,strkey
+52,0.371311881576,1.67565079528,1.81688563053,4.60353107555,5.44265822961,1.,0.,strkey
+53,-0.814398070371,1.13374634126,1.80328814859,4.72264252878,5.52674761122,1.,0.,strkey
+54,-0.469017949323,0.601244136627,2.29690896736,4.49859178859,5.54126153454,1.,0.,strkey
+55,0.871044371426,0.407597593794,2.7499112487,4.19060637761,5.57693767301,1.,0.,strkey
+56,0.523764933017,0.247705192709,3.09002071379,4.02095509006,5.80510362182,1.,0.,strkey
+57,-0.881326403531,0.31513103164,3.11358205718,3.96079100808,5.81000652365,1.,0.,strkey
+58,-0.357928025339,0.486163915865,3.17884556771,3.72634990659,5.85693642011,1.,0.,strkey
+59,0.853038779822,1.04218094475,3.45835384454,3.36703969978,5.9585988449,1.,0.,strkey
+60,0.435311516013,1.59715085283,3.63313338588,3.11276729421,5.93643818229,1.,0.,strkey
+61,-1.02703719138,1.92205832542,3.47606111735,3.06247155999,6.02106646259,1.,0.,strkey
+62,-0.246661325557,2.14653802542,3.29446326567,2.89936259181,5.67531541272,1.,0.,strkey
+63,1.02554736569,2.25943737733,3.07031591528,2.78176218013,5.78206328989,1.,0.,strkey
+64,0.337814475969,2.07589147224,2.80356226089,2.55888206331,5.7094075496,1.,0.,strkey
+65,-1.12023369929,1.25333011618,2.56497288445,2.77361359194,5.50799418376,1.,0.,strkey
+66,-0.178980246554,1.11937139901,2.51598681313,2.91438309151,5.47469577206,1.,0.,strkey
+67,0.97550951531,0.60553823137,2.11657741073,2.88081098981,5.37034999502,1.,0.,strkey
+68,0.136653357206,0.365828836075,1.97386033165,3.13217903204,5.07254490219,1.,0.,strkey
+69,-1.05607596951,0.153152115069,1.52110743825,3.01308794192,5.08902539125,1.,0.,strkey
+70,-0.13095280331,0.337113974483,1.52703079853,3.16687131599,4.86649398514,1.,0.,strkey
+71,1.07081057754,0.714247566736,1.53761382634,3.45151989484,4.75892309166,1.,0.,strkey
+72,0.0153410376082,1.24631231847,1.61690939161,3.85481994498,4.35683752832,1.,0.,strkey
+73,-0.912801257303,1.60791309476,1.8729264524,4.03037260012,4.36072588913,1.,0.,strkey
+74,-0.0894895640338,2.02535207407,1.93484909619,4.09557485132,4.35327025188,1.,0.,strkey
+75,0.978646999652,2.20085086625,2.09003440427,4.27542353033,4.1805058388,1.,0.,strkey
+76,-0.113312642876,2.2444100761,2.50789248839,4.4151861502,4.03267168136,1.,0.,strkey
+77,-1.00215099149,1.84305628445,2.61691237246,4.45425147595,3.81203553766,1.,0.,strkey
+78,-0.0183234614205,1.49573923116,2.99308471214,4.71134960112,4.0273804959,1.,0.,strkey
+79,1.0823738177,1.12211589848,3.27079386925,4.94288270502,4.01851068083,1.,0.,strkey
+80,0.124370187893,0.616474412808,3.4284236674,4.76942168327,3.9749536483,1.,0.,strkey
+81,-0.929423379352,0.290977090976,3.34131726136,4.78590392707,4.10190661656,1.,0.,strkey
+82,0.23766302648,0.155302052254,3.49779513794,4.64605656795,4.15571321107,1.,0.,strkey
+83,1.03531486192,0.359702776204,3.4880725919,4.48167586667,4.21134561991,1.,0.,strkey
+84,-0.261234571382,0.713877760378,3.42756426614,4.426443869,4.25208300527,1.,0.,strkey
+85,-1.03572442277,1.25001113691,2.96908341113,4.25500915322,4.25723010649,1.,0.,strkey
+86,0.380034261243,1.70543355622,2.73605932518,4.16703432307,4.63700400788,1.,0.,strkey
+87,1.03734873488,1.97544410562,2.55586572141,3.84976673263,4.55282864289,1.,0.,strkey
+88,-0.177344253372,2.22614526325,2.09565864891,3.77378097953,4.82577400298,1.,0.,strkey
+89,-0.976821526892,2.18385079177,1.78522284118,3.67768223554,5.06302440873,1.,0.,strkey
+90,0.264820472091,1.86981946157,1.50048403865,3.43619796921,5.05651761669,1.,0.,strkey
+91,1.05642344868,1.47568646076,1.51347671977,3.20898518885,5.50149047462,1.,0.,strkey
+92,-0.311607433358,1.04226467636,1.52089650905,3.02291865417,5.4889046232,1.,0.,strkey
+93,-0.724285777937,0.553052311957,1.48573560173,2.7365973598,5.72549174225,1.,0.,strkey
+94,0.519859192905,0.226520626591,1.61543723167,2.84102086852,5.69330622288,1.,0.,strkey
+95,1.0323195039,0.260873217055,1.81913034804,2.83951143848,5.90325028086,1.,0.,strkey
+96,-0.53285682538,0.387695521405,1.70935609313,2.57977050631,5.79579213161,1.,0.,strkey
+97,-0.975127997215,0.920948771589,2.51292643636,2.71004616612,5.87016469227,1.,0.,strkey
+98,0.540246804099,1.36445470181,2.61949412896,2.98482553485,6.02447664937,1.,0.,strkey
+99,0.987764008058,1.85581989607,2.84685706149,2.94760204892,6.0212151724,1.,0.,strkey
diff --git a/tensorflow/contrib/timeseries/examples/known_anomaly.py b/tensorflow/contrib/timeseries/examples/known_anomaly.py
index 7659dd308a7ee1b70d6688b85e4f6157ddee0540..e77628ddd390374d6336e3583e07ce03cdec7aea 100644
--- a/tensorflow/contrib/timeseries/examples/known_anomaly.py
+++ b/tensorflow/contrib/timeseries/examples/known_anomaly.py
@@ -46,12 +46,21 @@ def train_and_evaluate_exogenous(csv_file_name=_DATA_FILE, train_steps=300):
 
   # Indicate the format of our exogenous feature, in this case a string
   # representing a boolean value.
-  string_feature = tf.contrib.layers.sparse_column_with_keys(
-      column_name="is_changepoint", keys=["no", "yes"])
+  string_feature = tf.feature_column.categorical_column_with_vocabulary_list(
+      key="is_changepoint", vocabulary_list=["no", "yes"])
   # Specify the way this feature is presented to the model, here using a one-hot
   # encoding.
-  one_hot_feature = tf.contrib.layers.one_hot_column(
-      sparse_id_column=string_feature)
+  one_hot_feature = tf.feature_column.indicator_column(
+      categorical_column=string_feature)
+
+  def _exogenous_update_condition(times, features):
+    del times  # unused
+    # Make exogenous updates sparse by setting an update condition. This in
+    # effect allows missing exogenous features: if the condition evaluates to
+    # False, no update is performed. Otherwise we sometimes end up with "leaky"
+    # updates which add unnecessary uncertainty to the model even when there is
+    # no changepoint.
+    return tf.equal(tf.squeeze(features["is_changepoint"], axis=-1), "yes")
 
   estimator = tf.contrib.timeseries.StructuralEnsembleRegressor(
       periodicities=12,
@@ -60,13 +69,7 @@ def train_and_evaluate_exogenous(csv_file_name=_DATA_FILE, train_steps=300):
       cycle_num_latent_values=3,
       num_features=1,
       exogenous_feature_columns=[one_hot_feature],
-      # Make exogenous updates sparse by setting an update condition. This in
-      # effect allows missing exogenous features: if the condition evaluates to
-      # False, no update is performed. Otherwise we sometimes end up with
-      # "leaky" updates which add unnecessary uncertainty to the model even when
-      # there is no changepoint.
-      exogenous_update_condition=
-      lambda times, features: tf.equal(features["is_changepoint"], "yes"))
+      exogenous_update_condition=_exogenous_update_condition)
   reader = tf.contrib.timeseries.CSVReader(
       csv_file_name,
       # Indicate the format of our CSV file. First we have two standard columns,
diff --git a/tensorflow/contrib/timeseries/examples/lstm.py b/tensorflow/contrib/timeseries/examples/lstm.py
index f37cafcc502dc9415db0829b9b067b862f87dca7..b1c7475442c58b9a190c818b752760a4fb4fe6f0 100644
--- a/tensorflow/contrib/timeseries/examples/lstm.py
+++ b/tensorflow/contrib/timeseries/examples/lstm.py
@@ -59,10 +59,10 @@ class _LSTMModel(ts_model.SequentialTimeSeriesModel):
       num_units: The number of units in the model's LSTMCell.
       num_features: The dimensionality of the time series (features per
         timestep).
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects representing features which are inputs to the model but are
-          not predicted by it. These must then be present for training,
-          evaluation, and prediction.
+      exogenous_feature_columns: A list of `tf.feature_column`s representing
+          features which are inputs to the model but are not predicted by
+          it. These must then be present for training, evaluation, and
+          prediction.
       dtype: The floating point data type to use.
     """
     super(_LSTMModel, self).__init__(
@@ -189,12 +189,16 @@ def train_and_predict(
     export_directory=None):
   """Train and predict using a custom time series model."""
   # Construct an Estimator from our LSTM model.
+  categorical_column = tf.feature_column.categorical_column_with_hash_bucket(
+      key="categorical_exogenous_feature", hash_bucket_size=16)
   exogenous_feature_columns = [
       # Exogenous features are not part of the loss, but can inform
       # predictions. In this example the features have no extra information, but
       # are included as an API example.
-      tf.contrib.layers.real_valued_column(
-          "2d_exogenous_feature", dimension=2)]
+      tf.feature_column.numeric_column(
+          "2d_exogenous_feature", shape=(2,)),
+      tf.feature_column.embedding_column(
+          categorical_column=categorical_column, dimension=10)]
   estimator = ts_estimators.TimeSeriesRegressor(
       model=_LSTMModel(num_features=5, num_units=128,
                        exogenous_feature_columns=exogenous_feature_columns),
@@ -205,7 +209,11 @@ def train_and_predict(
       csv_file_name,
       column_names=((tf.contrib.timeseries.TrainEvalFeatures.TIMES,)
                     + (tf.contrib.timeseries.TrainEvalFeatures.VALUES,) * 5
-                    + ("2d_exogenous_feature",) * 2))
+                    + ("2d_exogenous_feature",) * 2
+                    + ("categorical_exogenous_feature",)),
+      # Data types other than for `times` need to be specified if they aren't
+      # float32. In this case one of our exogenous features has string dtype.
+      column_dtypes=((tf.int64,) + (tf.float32,) * 7 + (tf.string,)))
   train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
       reader, batch_size=4, window_size=32)
   estimator.train(input_fn=train_input_fn, steps=training_steps)
@@ -215,7 +223,9 @@ def train_and_predict(
   predict_exogenous_features = {
       "2d_exogenous_feature": numpy.concatenate(
           [numpy.ones([1, 100, 1]), numpy.zeros([1, 100, 1])],
-          axis=-1)}
+          axis=-1),
+      "categorical_exogenous_feature": numpy.array(
+          ["strkey"] * 100)[None, :, None]}
   (predictions,) = tuple(estimator.predict(
       input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
           evaluation, steps=100,
@@ -226,20 +236,36 @@ def train_and_predict(
       [evaluation["mean"][0], predictions["mean"]], axis=0))
   all_times = numpy.concatenate([times, predictions["times"]], axis=0)
 
-  # Export the model in SavedModel format.
+  # Export the model in SavedModel format. We include a bit of extra boilerplate
+  # for "cold starting" as if we didn't have any state from the Estimator, which
+  # is the case when serving from a SavedModel. If Estimator output is
+  # available, the result of "Estimator.evaluate" can be passed directly to
+  # `tf.contrib.timeseries.saved_model_utils.predict_continuation` as the
+  # `continue_from` argument.
+  with tf.Graph().as_default():
+    filter_feature_tensors, _ = evaluation_input_fn()
+    with tf.train.MonitoredSession() as session:
+      # Fetch the series to "warm up" our state, which will allow us to make
+      # predictions for its future values. This is just a dictionary of times,
+      # values, and exogenous features mapping to numpy arrays. The use of an
+      # input_fn is just a convenience for the example; they can also be
+      # specified manually.
+      filter_features = session.run(filter_feature_tensors)
   if export_directory is None:
     export_directory = tempfile.mkdtemp()
   input_receiver_fn = estimator.build_raw_serving_input_receiver_fn()
   export_location = estimator.export_savedmodel(
       export_directory, input_receiver_fn)
-  # Predict using the SavedModel
+  # Warm up and predict using the SavedModel
   with tf.Graph().as_default():
     with tf.Session() as session:
       signatures = tf.saved_model.loader.load(
           session, [tf.saved_model.tag_constants.SERVING], export_location)
+      state = tf.contrib.timeseries.saved_model_utils.cold_start_filter(
+          signatures=signatures, session=session, features=filter_features)
       saved_model_output = (
           tf.contrib.timeseries.saved_model_utils.predict_continuation(
-              continue_from=evaluation, signatures=signatures,
+              continue_from=state, signatures=signatures,
               session=session, steps=100,
               exogenous_features=predict_exogenous_features))
       # The exported model gives the same results as the Estimator.predict()
diff --git a/tensorflow/contrib/timeseries/python/timeseries/estimators.py b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
index f8355f366fe8e191ab570fd271bbe4a8bf71c73d..469cea4fd2fca65373eef85b1931a267e6e60238 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/estimators.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
@@ -18,8 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.layers.python.layers import feature_column
-
 from tensorflow.contrib.timeseries.python.timeseries import ar_model
 from tensorflow.contrib.timeseries.python.timeseries import feature_keys
 from tensorflow.contrib.timeseries.python.timeseries import head as ts_head_lib
@@ -31,11 +29,15 @@ from tensorflow.contrib.timeseries.python.timeseries.state_space_models.filterin
 
 from tensorflow.python.estimator import estimator_lib
 from tensorflow.python.estimator.export import export_lib
+from tensorflow.python.feature_column import feature_column
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.training import training as train
+from tensorflow.python.util import nest
 
 
 class TimeSeriesRegressor(estimator_lib.Estimator):
@@ -98,11 +100,11 @@ class TimeSeriesRegressor(estimator_lib.Estimator):
     def _serving_input_receiver_fn():
       """A receiver function to be passed to export_savedmodel."""
       placeholders = {}
-      placeholders[feature_keys.TrainEvalFeatures.TIMES] = (
-          array_ops.placeholder(
-              name=feature_keys.TrainEvalFeatures.TIMES,
-              dtype=dtypes.int64,
-              shape=[default_batch_size, default_series_length]))
+      time_placeholder = array_ops.placeholder(
+          name=feature_keys.TrainEvalFeatures.TIMES,
+          dtype=dtypes.int64,
+          shape=[default_batch_size, default_series_length])
+      placeholders[feature_keys.TrainEvalFeatures.TIMES] = time_placeholder
       # Values are only necessary when filtering. For prediction the default
       # value will be ignored.
       placeholders[feature_keys.TrainEvalFeatures.VALUES] = (
@@ -117,36 +119,57 @@ class TimeSeriesRegressor(estimator_lib.Estimator):
                   dtype=self._model.dtype),
               shape=(default_batch_size, default_series_length,
                      self._model.num_features)))
-      with ops.Graph().as_default():
-        # Default placeholders have only an unknown batch dimension. Make them
-        # in a separate graph, then splice in the series length to the shapes
-        # and re-create them in the outer graph.
-        exogenous_feature_shapes = {
-            key: (value.get_shape(), value.dtype) for key, value
-            in feature_column.make_place_holder_tensors_for_base_features(
-                self._model.exogenous_feature_columns).items()}
-      for feature_key, (batch_only_feature_shape, value_dtype) in (
-          exogenous_feature_shapes.items()):
-        batch_only_feature_shape = batch_only_feature_shape.with_rank_at_least(
-            1).as_list()
-        feature_shape = ([default_batch_size, default_series_length]
-                         + batch_only_feature_shape[1:])
-        placeholders[feature_key] = array_ops.placeholder(
-            dtype=value_dtype, name=feature_key, shape=feature_shape)
+      if self._model.exogenous_feature_columns:
+        with ops.Graph().as_default():
+          # Default placeholders have only an unknown batch dimension. Make them
+          # in a separate graph, then splice in the series length to the shapes
+          # and re-create them in the outer graph.
+          parsed_features = (
+              feature_column.make_parse_example_spec(
+                  self._model.exogenous_feature_columns))
+          placeholder_features = parsing_ops.parse_example(
+              serialized=array_ops.placeholder(
+                  shape=[None], dtype=dtypes.string),
+              features=parsed_features)
+          exogenous_feature_shapes = {
+              key: (value.get_shape(), value.dtype) for key, value
+              in placeholder_features.items()}
+        for feature_key, (batch_only_feature_shape, value_dtype) in (
+            exogenous_feature_shapes.items()):
+          batch_only_feature_shape = (
+              batch_only_feature_shape.with_rank_at_least(1).as_list())
+          feature_shape = ([default_batch_size, default_series_length]
+                           + batch_only_feature_shape[1:])
+          placeholders[feature_key] = array_ops.placeholder(
+              dtype=value_dtype, name=feature_key, shape=feature_shape)
       # Models may not know the shape of their state without creating some
       # variables/ops. Avoid polluting the default graph by making a new one. We
       # use only static metadata from the returned Tensors.
       with ops.Graph().as_default():
         self._model.initialize_graph()
-        model_start_state = self._model.get_start_state()
-      for prefixed_state_name, state_tensor in ts_head_lib.state_to_dictionary(
-          model_start_state).items():
+        # Evaluate the initial state as same-dtype "zero" values. These zero
+        # constants aren't used, but are necessary for feeding to
+        # placeholder_with_default for the "cold start" case where state is not
+        # fed to the model.
+        def _zeros_like_constant(tensor):
+          return tensor_util.constant_value(array_ops.zeros_like(tensor))
+        start_state = nest.map_structure(
+            _zeros_like_constant, self._model.get_start_state())
+      batch_size_tensor = array_ops.shape(time_placeholder)[0]
+      for prefixed_state_name, state in ts_head_lib.state_to_dictionary(
+          start_state).items():
         state_shape_with_batch = tensor_shape.TensorShape(
-            (default_batch_size,)).concatenate(state_tensor.get_shape())
-        placeholders[prefixed_state_name] = array_ops.placeholder(
+            (default_batch_size,)).concatenate(state.shape)
+        default_state_broadcast = array_ops.tile(
+            state[None, ...],
+            multiples=array_ops.concat(
+                [batch_size_tensor[None],
+                 array_ops.ones(len(state.shape), dtype=dtypes.int32)],
+                axis=0))
+        placeholders[prefixed_state_name] = array_ops.placeholder_with_default(
+            input=default_state_broadcast,
             name=prefixed_state_name,
-            shape=state_shape_with_batch,
-            dtype=state_tensor.dtype)
+            shape=state_shape_with_batch)
       return export_lib.ServingInputReceiver(placeholders, placeholders)
 
     return _serving_input_receiver_fn
@@ -333,11 +356,11 @@ class StructuralEnsembleRegressor(StateSpaceRegressor):
           determine the model size. Learning autoregressive coefficients
           typically requires more steps and a smaller step size than other
           components.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+          `tf.feature_column.embedding_column`) corresponding to exogenous
+          features which provide extra information to the model but are not part
+          of the series to be predicted. Passed to
+          `tf.feature_column.input_layer`.
       exogenous_update_condition: A function taking two Tensor arguments,
           `times` (shape [batch size]) and `features` (a dictionary mapping
           exogenous feature keys to Tensors with shapes [batch size, ...]), and
diff --git a/tensorflow/contrib/timeseries/python/timeseries/feature_keys.py b/tensorflow/contrib/timeseries/python/timeseries/feature_keys.py
index 970b9aa8acd6f55db843a4e023052b122992baf4..56566ee2e3207abd81ef665da10f851c9dc98ccb 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/feature_keys.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/feature_keys.py
@@ -72,3 +72,4 @@ class SavedModelLabels(object):
   """Names of signatures exported with export_savedmodel."""
   PREDICT = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
   FILTER = "filter"
+  COLD_START_FILTER = "cold_start_filter"
diff --git a/tensorflow/contrib/timeseries/python/timeseries/head.py b/tensorflow/contrib/timeseries/python/timeseries/head.py
index 5c49e903abde6d7487d1ffdb83ff902ff6b63585..3d7e61529014ff5045c3b64fb945ceb9c902dd0d 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/head.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/head.py
@@ -96,8 +96,12 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
   def _train_ops(self, features):
     """Add training ops to the graph."""
     mode = estimator_lib.ModeKeys.TRAIN
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope(
+        "model",
+        # Use ResourceVariables to avoid race conditions.
+        use_resource=True):
       model_outputs = self.create_loss(features, mode)
+
     train_op = optimizers.optimize_loss(
         model_outputs.loss,
         global_step=training_util.get_global_step(),
@@ -112,7 +116,7 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
   def _evaluate_ops(self, features):
     """Add ops for evaluation (aka filtering) to the graph."""
     mode = estimator_lib.ModeKeys.EVAL
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       model_outputs = self.create_loss(features, mode)
     metrics = {}
     # Just output in-sample predictions for the last chunk seen
@@ -132,7 +136,7 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
 
   def _predict_ops(self, features):
     """Add ops for prediction to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       prediction = self.model.predict(features=features)
     prediction[feature_keys.PredictionResults.TIMES] = features[
         feature_keys.PredictionFeatures.TIMES]
@@ -141,11 +145,17 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
 
   def _serving_ops(self, features):
     """Add ops for serving to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       prediction_outputs = self.model.predict(features=features)
     with variable_scope.variable_scope("model", reuse=True):
       filtering_outputs = self.create_loss(
           features, estimator_lib.ModeKeys.EVAL)
+    with variable_scope.variable_scope("model", reuse=True):
+      no_state_features = {
+          k: v for k, v in features.items()
+          if not k.startswith(feature_keys.State.STATE_PREFIX)}
+      cold_filtering_outputs = self.create_loss(
+          no_state_features, estimator_lib.ModeKeys.EVAL)
     return estimator_lib.EstimatorSpec(
         mode=estimator_lib.ModeKeys.PREDICT,
         export_outputs={
@@ -153,7 +163,10 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
                 export_lib.PredictOutput(prediction_outputs),
             feature_keys.SavedModelLabels.FILTER:
                 export_lib.PredictOutput(
-                    state_to_dictionary(filtering_outputs.end_state))
+                    state_to_dictionary(filtering_outputs.end_state)),
+            feature_keys.SavedModelLabels.COLD_START_FILTER:
+                export_lib.PredictOutput(
+                    state_to_dictionary(cold_filtering_outputs.end_state))
         },
         # Likely unused, but it is necessary to return `predictions` to satisfy
         # the Estimator's error checking.
diff --git a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
index 04225333b9377447f46d32663df76aece97a51e7..403c6e2cb4aeb665fb112b6322109a6a90f7a261 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
@@ -492,8 +492,7 @@ class CSVReader(ReaderBaseTimeSeriesParser):
       features_lists.setdefault(column_name, []).append(value)
     features = {}
     for column_name, values in features_lists.items():
-      if (len(values) == 1 and
-          column_name != feature_keys.TrainEvalFeatures.VALUES):
+      if column_name == feature_keys.TrainEvalFeatures.TIMES:
         features[column_name] = values[0]
       else:
         features[column_name] = array_ops.stack(values, axis=1)
diff --git a/tensorflow/contrib/timeseries/python/timeseries/model.py b/tensorflow/contrib/timeseries/python/timeseries/model.py
index bac7d1ebf59b28d4688a3d1a69ecdc1fc12248e0..7644764a7459db3951fe9a2790389713dd412a8f 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/model.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/model.py
@@ -21,18 +21,17 @@ from __future__ import print_function
 import abc
 import collections
 
-from tensorflow.contrib import layers
-from tensorflow.contrib.layers import feature_column
-
 from tensorflow.contrib.timeseries.python.timeseries import math_utils
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import PredictionFeatures
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import TrainEvalFeatures
 
+from tensorflow.python.feature_column import feature_column
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.ops import variable_scope
 
@@ -66,11 +65,11 @@ class TimeSeriesModel(object):
 
     Args:
       num_features: Number of features for the time series
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+           `tf.feature_column.embedding_column`) corresponding to exogenous
+           features which provide extra information to the model but are not
+           part of the series to be predicted. Passed to
+           `tf.feature_column.input_layer`.
       dtype: The floating point datatype to use.
     """
     if exogenous_feature_columns:
@@ -86,7 +85,7 @@ class TimeSeriesModel(object):
 
   @property
   def exogenous_feature_columns(self):
-    """`FeatureColumn` objects for features which are not predicted."""
+    """`tf.feature_colum`s for features which are not predicted."""
     return self._exogenous_feature_columns
 
   # TODO(allenl): Move more of the generic machinery for generating and
@@ -265,11 +264,14 @@ class TimeSeriesModel(object):
     if not self._exogenous_feature_columns:
       return (0,)
     with ops.Graph().as_default():
-      placeholder_features = (
-          feature_column.make_place_holder_tensors_for_base_features(
+      parsed_features = (
+          feature_column.make_parse_example_spec(
               self._exogenous_feature_columns))
-      embedded = layers.input_from_feature_columns(
-          columns_to_tensors=placeholder_features,
+      placeholder_features = parsing_ops.parse_example(
+          serialized=array_ops.placeholder(shape=[None], dtype=dtypes.string),
+          features=parsed_features)
+      embedded = feature_column.input_layer(
+          features=placeholder_features,
           feature_columns=self._exogenous_feature_columns)
       return embedded.get_shape().as_list()[1:]
 
@@ -308,13 +310,13 @@ class TimeSeriesModel(object):
         # Avoid shape warnings when embedding "scalar" exogenous features (those
         # with only batch and window dimensions); input_from_feature_columns
         # expects input ranks to match the embedded rank.
-        if tensor.get_shape().ndims == 1:
+        if tensor.get_shape().ndims == 1 and tensor.dtype != dtypes.string:
           exogenous_features_single_batch_dimension[name] = tensor[:, None]
         else:
           exogenous_features_single_batch_dimension[name] = tensor
       embedded_exogenous_features_single_batch_dimension = (
-          layers.input_from_feature_columns(
-              columns_to_tensors=exogenous_features_single_batch_dimension,
+          feature_column.input_layer(
+              features=exogenous_features_single_batch_dimension,
               feature_columns=self._exogenous_feature_columns,
               trainable=True))
       exogenous_regressors = array_ops.reshape(
@@ -381,8 +383,8 @@ class SequentialTimeSeriesModel(TimeSeriesModel):
           may use _scale_back_data or _scale_back_variance to return predictions
           to the input scale.
       dtype: The floating point datatype to use.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects. See `TimeSeriesModel`.
+      exogenous_feature_columns: A list of `tf.feature_column`s objects. See
+          `TimeSeriesModel`.
       exogenous_update_condition: A function taking two Tensor arguments `times`
           (shape [batch size]) and `features` (a dictionary mapping exogenous
           feature keys to Tensors with shapes [batch size, ...]) and returning a
diff --git a/tensorflow/contrib/timeseries/python/timeseries/saved_model_utils.py b/tensorflow/contrib/timeseries/python/timeseries/saved_model_utils.py
index 97f6d36a879532c12684ffdd700ef40b72750567..0461abdc19c08767114e3d26d1134ea4bc5481f8 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/saved_model_utils.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/saved_model_utils.py
@@ -15,6 +15,7 @@
 """Convenience functions for working with time series saved_models.
 
 @@predict_continuation
+@@cold_start_filter
 @@filter_continuation
 """
 
@@ -30,10 +31,12 @@ from tensorflow.contrib.timeseries.python.timeseries import model_utils as _mode
 from tensorflow.python.util.all_util import remove_undocumented
 
 
-def _colate_features_to_feeds_and_fetches(continue_from, signature, features,
-                                          graph):
+def _colate_features_to_feeds_and_fetches(signature, features, graph,
+                                          continue_from=None):
   """Uses a saved model signature to construct feed and fetch dictionaries."""
-  if _feature_keys.FilteringResults.STATE_TUPLE in continue_from:
+  if continue_from is None:
+    state_values = {}
+  elif _feature_keys.FilteringResults.STATE_TUPLE in continue_from:
     # We're continuing from an evaluation, so we need to unpack/flatten state.
     state_values = _head.state_to_dictionary(
         continue_from[_feature_keys.FilteringResults.STATE_TUPLE])
@@ -115,6 +118,55 @@ def predict_continuation(continue_from,
   return output
 
 
+def cold_start_filter(signatures, session, features):
+  """Perform filtering using an exported saved model.
+
+  Filtering refers to updating model state based on new observations.
+  Predictions based on the returned model state will be conditioned on these
+  observations.
+
+  Starts from the model's default/uninformed state.
+
+  Args:
+    signatures: The `MetaGraphDef` protocol buffer returned from
+      `tf.saved_model.loader.load`. Used to determine the names of Tensors to
+      feed and fetch. Must be from the same model as `continue_from`.
+    session: The session to use. The session's graph must be the one into which
+      `tf.saved_model.loader.load` loaded the model.
+    features: A dictionary mapping keys to Numpy arrays, with several possible
+      shapes (requires keys `FilteringFeatures.TIMES` and
+      `FilteringFeatures.VALUES`):
+        Single example; `TIMES` is a scalar and `VALUES` is either a scalar or a
+          vector of length [number of features].
+        Sequence; `TIMES` is a vector of shape [series length], `VALUES` either
+          has shape [series length] (univariate) or [series length x number of
+          features] (multivariate).
+        Batch of sequences; `TIMES` is a vector of shape [batch size x series
+          length], `VALUES` has shape [batch size x series length] or [batch
+          size x series length x number of features].
+      In any case, `VALUES` and any exogenous features must have their shapes
+      prefixed by the shape of the value corresponding to the `TIMES` key.
+  Returns:
+    A dictionary containing model state updated to account for the observations
+    in `features`.
+  """
+  filter_signature = signatures.signature_def[
+      _feature_keys.SavedModelLabels.COLD_START_FILTER]
+  features = _input_pipeline._canonicalize_numpy_data(  # pylint: disable=protected-access
+      data=features,
+      require_single_batch=False)
+  output_tensors_by_name, feed_dict = _colate_features_to_feeds_and_fetches(
+      signature=filter_signature,
+      features=features,
+      graph=session.graph)
+  output = session.run(output_tensors_by_name, feed_dict=feed_dict)
+  # Make it easier to chain filter -> predict by keeping track of the current
+  # time.
+  output[_feature_keys.FilteringResults.TIMES] = features[
+      _feature_keys.FilteringFeatures.TIMES]
+  return output
+
+
 def filter_continuation(continue_from, signatures, session, features):
   """Perform filtering using an exported saved model.
 
@@ -124,8 +176,8 @@ def filter_continuation(continue_from, signatures, session, features):
 
   Args:
     continue_from: A dictionary containing the results of either an Estimator's
-      evaluate method or a previous filter_continuation. Used to determine the
-      model state to start filtering from.
+      evaluate method or a previous filter step (cold start or
+      continuation). Used to determine the model state to start filtering from.
     signatures: The `MetaGraphDef` protocol buffer returned from
       `tf.saved_model.loader.load`. Used to determine the names of Tensors to
       feed and fetch. Must be from the same model as `continue_from`.
diff --git a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
index 6257002647ed53bbde3ace11a6b45e4e2cdeb57d..951c6546d5fed77e0cfa98a4e774b804639d7dad 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
@@ -112,11 +112,11 @@ class StateSpaceModelConfiguration(
       exogenous_noise_decreases: If True, exogenous regressors can "set" model
           state, decreasing uncertainty. If both this parameter and
           exogenous_noise_increases are False, exogenous regressors are ignored.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+          `tf.feature_column.embedding_column`) corresponding to exogenous
+          features which provide extra information to the model but are not part
+          of the series to be predicted. Passed to
+          `tf.feature_column.input_layer`.
       exogenous_update_condition: A function taking two Tensor arguments `times`
           (shape [batch size]) and `features` (a dictionary mapping exogenous
           feature keys to Tensors with shapes [batch size, ...]) and returning a
diff --git a/tensorflow/contrib/tpu/BUILD b/tensorflow/contrib/tpu/BUILD
index c48e84ddfaac8ac9c07e061847315eab3fd72152..eea19e9465e482dfd1ea9a144435c23a2ecf1467 100644
--- a/tensorflow/contrib/tpu/BUILD
+++ b/tensorflow/contrib/tpu/BUILD
@@ -24,6 +24,7 @@ cc_library(
     name = "all_ops",
     deps = [
         ":cross_replica_ops_op_lib",
+        ":host_compute_ops_op_lib",
         ":infeed_ops_op_lib",
         ":outfeed_ops_op_lib",
         ":replication_ops_op_lib",
@@ -69,6 +70,7 @@ py_library(
 tf_gen_op_libs(
     op_lib_names = [
         "cross_replica_ops",
+        "host_compute_ops",
         "infeed_ops",
         "outfeed_ops",
         "replication_ops",
@@ -78,6 +80,7 @@ tf_gen_op_libs(
     deps = [
         "//tensorflow/contrib/tpu/proto:tpu_embedding_config_proto_cc",
         "//tensorflow/core:lib_proto_parsing",
+        "//tensorflow/core:protos_all_cc",
     ],
 )
 
@@ -85,6 +88,7 @@ tf_custom_op_library(
     name = "python/ops/_tpu_ops.so",
     srcs = [
         "ops/cross_replica_ops.cc",
+        "ops/host_compute_ops.cc",
         "ops/infeed_ops.cc",
         "ops/outfeed_ops.cc",
         "ops/replication_ops.cc",
@@ -101,6 +105,7 @@ tf_gen_op_wrapper_py(
     name = "tpu_ops",
     deps = [
         ":cross_replica_ops_op_lib",
+        ":host_compute_ops_op_lib",
         ":infeed_ops_op_lib",
         ":outfeed_ops_op_lib",
         ":replication_ops_op_lib",
@@ -163,6 +168,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        ":datasets",
         ":profiler",
         ":tpu_py",
         "//tensorflow/contrib/tpu/proto:topology_proto_py",
@@ -181,6 +187,33 @@ py_library(
     ],
 )
 
+py_library(
+    name = "datasets",
+    srcs = [
+        "python/tpu/datasets.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:function",
+        "//tensorflow/python:functional_ops",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/ops:iterator_ops",
+        "//tensorflow/python/data/ops:readers",
+    ],
+)
+
+tf_py_test(
+    name = "datasets_test",
+    srcs = ["python/tpu/datasets_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        ":datasets",
+    ],
+    grpc_enabled = True,
+)
+
 tf_py_test(
     name = "tpu_test",
     size = "small",
@@ -238,6 +271,17 @@ tf_py_test(
     ],
 )
 
+tf_py_test(
+    name = "tpu_estimator_signals_test",
+    size = "small",
+    srcs = ["python/tpu/tpu_estimator_signals_test.py"],
+    additional_deps = [
+        ":tpu_estimator",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:framework_test_lib",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/tpu/ops/host_compute_ops.cc b/tensorflow/contrib/tpu/ops/host_compute_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..48aeb81ac1311d3acd4972810f0a27a382f8b136
--- /dev/null
+++ b/tensorflow/contrib/tpu/ops/host_compute_ops.cc
@@ -0,0 +1,64 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+namespace tensorflow {
+
+REGISTER_OP("_XlaSendFromHost")
+    .Input("inputs: Tinputs")
+    .Input("dynamic_key: string")
+    .Attr("Tinputs: list(type) >= 0")
+    .Attr("key: string")
+    .Attr("device_ordinal: int")
+    .SetIsStateful()
+    .SetShapeFn(::tensorflow::shape_inference::NoOutputs)
+    .Doc(R"doc(
+A placeholder op for multiple values that will be sent from TensorFlow to a
+running XLA computation.
+
+inputs: A list of tensors that will be sent to the XLA computation.
+dynamic_key: The key sent at runtime by the compile node to identify which
+execution the transfer corresponds to.
+Tinputs: The element types of each element in `inputs`.
+key: A key that is unique in the computation and associates the send with the consumer in
+the XLA computation.
+device_ordinal: The device to use.
+)doc");
+
+REGISTER_OP("_XlaRecvAtHost")
+    .Input("dynamic_key: string")
+    .Output("outputs: Toutputs")
+    .Attr("Toutputs: list(type) >= 0")
+    .Attr("key: string")
+    .Attr("device_ordinal: int")
+    .SetIsStateful()
+    .SetShapeFn(::tensorflow::shape_inference::UnknownShape)
+    .Doc(R"doc(
+A placeholder op for multiple values that will be sent to TensorFlow from a
+running XLA computation.
+
+dynamic_key: The key sent at runtime by the compile node to identify which
+execution the transfer corresponds to.
+outputs: A list of tensors that will be received from the XLA computation.
+Toutputs: The element types of each element in `outputs`.
+key: A key that is unique in the computation and associates the send with the consumer in
+the XLA computation.
+device_ordinal: The device to use.
+)doc");
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tpu/ops/tpu_configuration_ops.cc b/tensorflow/contrib/tpu/ops/tpu_configuration_ops.cc
index f8de8baa65339383c7f92284ee274a434f12f8c2..7bf5c21d0b526ee5e32448f75d39eca8add6d877 100644
--- a/tensorflow/contrib/tpu/ops/tpu_configuration_ops.cc
+++ b/tensorflow/contrib/tpu/ops/tpu_configuration_ops.cc
@@ -191,6 +191,7 @@ REGISTER_OP("ConfigureDistributedTPU")
     .Output("topology: string")
     .Attr("embedding_config: string = ''")
     .Attr("tpu_embedding_config: string = ''")
+    .Attr("is_global_init: bool = false")
     .SetIsStateful()
     .SetShapeFn(shape_inference::UnknownShape)
     .Doc(R"doc(
@@ -202,6 +203,7 @@ topology.
 tpu_embedding_config: Serialized tensorflow.tpu.TPUEmbeddingConfiguration that
 describes the embedding lookups of the program.
 embedding_config: Reserved. Do not use.
+is_global_init: Reserved. Do not use.
 )doc");
 
 REGISTER_OP("ShutdownDistributedTPU")
diff --git a/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc b/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
index cc32a265286951a1e4d59228da6b3ac83a75c5e9..72d37f774cc518c559b5953561957a799a7da568 100644
--- a/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
+++ b/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
@@ -50,7 +50,7 @@ namespace tensorflow {
 // TPU Embeddings use dedicated ops to enforce Host/TPU consistency in the
 // state of embedding table variables. Before beginning training or inference,
 // the model must Load the optimizer parameters into the TPU memories. Before
-// saving a checkpoint, the model must Retreieve the parameters back into the
+// saving a checkpoint, the model must Retrieve the parameters back into the
 // host CPU memory.
 
 REGISTER_OP("TPUEmbeddingLoadGradientDescentParameters")
@@ -263,7 +263,7 @@ REGISTER_OP("TPUEmbeddingReceiveActivations")
     .SetIsStateful()
     .SetShapeFn(tpu_embedding_config_util::ActivationShapes)
     .Doc(R"doc(
-An op that receives embeddng activations on the TPU.
+An op that receives embedding activations on the TPU.
 
 The TPU system performs the embedding lookups and aggregations specified by
 the arguments to TPUEmbeddingEnqueueSparseBatch. The results of these
@@ -293,7 +293,7 @@ REGISTER_OP("TPUEmbeddingActivations")
 An op enabling differentiation of TPU Embeddings.
 
 This op simply returns its first input, which is assumed to have been sliced
-from the Tensors returnd by TPUEmbeddingDequeueActivations. The presence of this
+from the Tensors returned by TPUEmbeddingDequeueActivations. The presence of this
 op, and its first argument being a trainable Variable, enables automatic
 differentiation of graphs containing embeddings via the TPU Embedding Python
 libraries.
diff --git a/tensorflow/contrib/tpu/profiler/BUILD b/tensorflow/contrib/tpu/profiler/BUILD
index 198da0203a7d17249c4f50110713121b74d5ca4f..0a52d0b13b7c8749ad44377659714d297ffec3ee 100644
--- a/tensorflow/contrib/tpu/profiler/BUILD
+++ b/tensorflow/contrib/tpu/profiler/BUILD
@@ -18,7 +18,7 @@ filegroup(
     visibility = ["//tensorflow:__subpackages__"],
 )
 
-tf_proto_library_cc(
+tf_proto_library(
     name = "tpu_profiler_proto",
     srcs = ["tpu_profiler.proto"],
     has_services = 1,
@@ -98,16 +98,36 @@ tf_cc_test(
     ],
 )
 
-tf_proto_library_cc(
+tf_proto_library(
     name = "op_profile_proto",
     srcs = ["op_profile.proto"],
     cc_api_version = 2,
     visibility = ["//visibility:public"],
 )
 
-tf_proto_library_cc(
+tf_proto_library(
     name = "tf_op_stats_proto",
     srcs = ["tf_op_stats.proto"],
     cc_api_version = 2,
     visibility = ["//visibility:public"],
 )
+
+tf_proto_library(
+    name = "tpu_profiler_analysis_proto",
+    srcs = ["tpu_profiler_analysis.proto"],
+    has_services = 1,
+    cc_api_version = 2,
+    cc_grpc_version = 1,
+    protodeps = [":tpu_profiler_proto"] + tf_additional_all_protos(),
+    visibility = ["//visibility:public"],
+)
+
+py_library(
+    name = "tpu_profiler_analysis_pb2_grpc",
+    srcs = ["tpu_profiler_analysis_pb2_grpc.py"],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":tpu_profiler_analysis_proto_py",
+    ],
+)
diff --git a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
index b1ef9fde37fe0647965f0818895be37d2d56d207..e6811d4ad204edb318638c698090479436f38ecd 100644
--- a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
+++ b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
@@ -29,6 +29,9 @@ limitations under the License.
 #include "tensorflow/contrib/tpu/profiler/version.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/init_main.h"
 #include "tensorflow/core/util/command_line_flags.h"
@@ -62,10 +65,13 @@ Status ValidateHostPortPair(const string& host_port) {
 }
 
 ProfileResponse Profile(const string& service_addr, int duration_ms,
+                        const string& repository_root, const string& session_id,
                         const ProfileOptions& opts) {
   ProfileRequest request;
   request.set_duration_ms(duration_ms);
   request.set_max_events(kMaxEvents);
+  request.set_repository_root(repository_root);
+  request.set_session_id(session_id);
   request.add_tools("input_pipeline");
   request.add_tools("overview_page");
   *request.mutable_opts() = opts;
@@ -137,10 +143,17 @@ int main(int argc, char** argv) {
   opts.set_include_dataset_ops(FLAGS_include_dataset_ops);
   tensorflow::ProfileResponse response;
 
+  // Use the current timestamp as the run name.
+  tensorflow::string session_id =
+      tensorflow::tpu::GetCurrentTimeStampAsString();
+  constexpr char kProfilePluginDirectory[] = "plugins/profile/";
+  tensorflow::string repository_root =
+      ::tensorflow::io::JoinPath(FLAGS_logdir, kProfilePluginDirectory);
   while (true) {
     std::cout << "Starting to profile TPU traces for " << duration_ms << " ms. "
               << "Remaining attempt(s): " << remaining_attempts-- << std::endl;
-    response = tensorflow::tpu::Profile(FLAGS_service_addr, duration_ms, opts);
+    response = tensorflow::tpu::Profile(FLAGS_service_addr, duration_ms,
+                                        repository_root, session_id, opts);
     if (remaining_attempts <= 0 || !response.encoded_trace().empty()) break;
     std::cout << "No trace event is collected. Automatically retrying."
               << std::endl
@@ -158,10 +171,8 @@ int main(int argc, char** argv) {
     return 0;
   }
 
-  // Use the current timestamp as the run name.
-  tensorflow::string run = tensorflow::tpu::GetCurrentTimeStampAsString();
   TF_CHECK_OK(tensorflow::tpu::WriteTensorboardTPUProfile(
-      FLAGS_logdir, run, response, &std::cout));
+      FLAGS_logdir, session_id, response, &std::cout));
   // Print this at the end so that it's not buried in irrelevant LOG messages.
   std::cout
       << "NOTE: using the trace duration " << duration_ms << "ms." << std::endl
diff --git a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
index 2094294baad63ae73712c8648b588accd4551ef8..20ed7419fde36a0d112900093ed2f44c3af63d75 100644
--- a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
+++ b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
@@ -77,6 +77,8 @@ message StepInfoResult {
   // The infeed duration in picoseconds.
   // Can turn into a map if we want a variable number of ops.
   optional uint64 infeed_duration_ps = 3;
+  // The start time of this step in picoseconds.
+  optional uint64 begin_ps = 4;
 }
 
 // Result proto for a sequence of steps.
@@ -155,6 +157,54 @@ message RunEnvironmentResult {
   repeated HostDependentJobInfoResult host_dependent_job_info = 6;
 }
 
+// The types of host operations that are tracked.
+enum HostOp {
+  // Invalid host op.
+  kINVALIDHostOp = 0;
+  // Each of host op type has two parts:
+  // (1) the stage where the op happens and (2) the op name.
+  // stage = Input Data Producer, op = Get Next Batch.
+  kInputDataProducerGetNextBatch = 1;
+  // stage = Input Data Producer, op = Session Run.
+  kInputDataProducerSessionRun = 2;
+  // stage = Input Data Producer, op = Forward Batch.
+  kInputDataProducerForwardBatch = 3;
+  // stage = Infeed Thread, op = Get Next Batch.
+  kInfeedThreadGetNextBatch = 4;
+  // stage = Infeed Thread, op = Session Run.
+  kInfeedThreadSessionRun = 5;
+  // stage = Infeed Thread, op = Forward Batch.
+  kInfeedThreadForwardBatch = 6;
+  // stage = Outfeed Thread, op = Get Next Batch.
+  kOutfeedThreadGetNextBatch = 7;
+  // stage = Outfeed Thread, op = Session Run.
+  kOutfeedThreadSessionRun = 8;
+  // stage = Outfeed Thread, op = Forward Batch.
+  kOutfeedThreadForwardBatch = 9;
+}
+
+// Result proto for the host ops per TPU step.
+message HostOpsPerTpuStep {
+  // Whether the data in this message is valid.
+  optional bool valid = 1 [default = false];
+  // The current TPU step number.
+  optional uint32 tpu_step_num = 2;
+  // The beginning time of the current TPU step on the device in picoseconds.
+  optional uint64 tpu_step_begin_ps = 3;
+  // The ending time of the current TPU step on the device in picoseconds.
+  optional uint64 tpu_step_end_ps = 4;
+  // For each possible host operation, maps to the difference between the TPU
+  // step number that the host op targets and the current TPU step number.
+  // The key is HostOp, value is the step difference.
+  map<int32, int32> step_diffs = 5;
+}
+
+// Result proto for the host ops for all TPU steps.
+message HostOpsResult {
+  // A sequence of HostOpsPerTpuStep (one for each TPU step)
+  repeated HostOpsPerTpuStep host_op_sequence = 1;
+}
+
 // Result proto for TfStatsHelper.
 message TfOpStats {
   // The result for the TF-metric database.
@@ -171,4 +221,8 @@ message TfOpStats {
   optional double matrix_unit_utilization_percent = 6;
   // The run environment of this profiling session.
   optional RunEnvironmentResult run_environment = 7;
+  // The result for the host operations.
+  optional HostOpsResult host_ops = 8;
+  // A map from core ID to name.
+  map<uint32, string> core_id_to_name_map = 9;
 }
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
index f3f3302ceb3d27dbb21bdce753aeb2d7fcd77448..cddc3cd1b41d6e00409222170e69c429fe6f91f8 100644
--- a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
@@ -36,10 +36,17 @@ message ProfileRequest {
   // Optional profiling options that control how a TF session will be profiled.
   ProfileOptions opts = 4;
 
+  // The place where we will dump profile data. We will normally use
+  // MODEL_DIR/plugin/profile/ as our repository root.
+  string repository_root = 5;
+
+  // The user provided profile session identifier.
+  string session_id = 6;
+
   // In future, the caller will indicate which TF session is being profiled, and
   // only data relating to that program will be returned. For now, we assume
   // all activity during the profiling period is relevant.
-  // next-field: 5
+  // next-field: 7
 }
 
 message ProfileToolData {
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto
new file mode 100644
index 0000000000000000000000000000000000000000..a4fc8d4e879eb85522f35663c9c628ecd5ef562c
--- /dev/null
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto
@@ -0,0 +1,73 @@
+syntax = "proto3";
+package tensorflow;
+
+import "tensorflow/contrib/tpu/profiler/tpu_profiler.proto";
+
+message NewProfileSessionRequest {
+  ProfileRequest request = 1;
+  string repository_root = 2;
+  repeated string hosts = 3;
+}
+
+message NewProfileSessionResponse {
+  // Auxiliary error_message.
+  string error_message = 1;
+  // If success, return session identifier for future reference.
+  string session_id = 2;
+}
+
+message EnumProfileSessionsAndToolsRequest {
+  string repository_root = 1;
+}
+
+message ProfileSessionInfo {
+  string session_id = 1;
+  // Which tool data is available for consumption.
+  repeated string available_tools = 2;
+}
+
+message EnumProfileSessionsAndToolsResponse {
+  // Auxiliary error_message.
+  string error_message = 1;
+  // If success, the returned sessions information are stored here.
+  repeated ProfileSessionInfo sessions = 2;
+}
+
+message ProfileSessionDataRequest {
+  string repository_root = 1;
+  string session_id = 2;
+  // Which tool
+  string tool_name = 3;
+  // Tool's specific parameters. e.g. TraceViewer's viewport etc
+  map<string, string> parameters = 4;
+}
+
+message ProfileSessionDataResponse {
+  // Auxiliary error_message.
+  string error_message = 1;
+
+  // Output format. e.g. "json" or "proto" or "blob"
+  string output_format = 2;
+
+  // TODO(jiesun): figure out whether to put bytes or oneof tool specific proto.
+  bytes output = 3;
+}
+////////////////////////////////////////////////////////////////////////////////
+// TPUProfileAnalysis service provide entry point for profiling TPU and for
+// serving profiled data to Tensorboard through GRPC
+////////////////////////////////////////////////////////////////////////////////
+service TPUProfileAnalysis {
+  // Starts a profiling session, blocks until it completes.
+  // TPUProfileAnalysis service delegate this to TPUProfiler service.
+  // Populate the profiled data in repository, then return status to caller.
+  rpc NewSession(NewProfileSessionRequest) returns (NewProfileSessionResponse) {
+  }
+  // Enumerate existing sessions and return available profile tools.
+  rpc EnumSessions(EnumProfileSessionsAndToolsRequest)
+      returns (EnumProfileSessionsAndToolsResponse) {
+  }
+  // Retrieve specific tool's data for specific session.
+  rpc GetSessionToolData(ProfileSessionDataRequest)
+      returns (ProfileSessionDataResponse) {
+  }
+}
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis_pb2_grpc.py b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis_pb2_grpc.py
new file mode 100644
index 0000000000000000000000000000000000000000..c28fef22a9d3736748b1b56135302d5ec7845720
--- /dev/null
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis_pb2_grpc.py
@@ -0,0 +1,138 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
+#
+# Do not use pylint on generated code.
+# pylint: disable=missing-docstring,g-short-docstring-punctuation,g-no-space-after-docstring-summary,invalid-name,line-too-long,unused-argument,g-doc-args
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import grpc
+
+from third_party.tensorflow.contrib.tpu.profiler import tpu_profiler_analysis_pb2 as third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2
+
+
+class TPUProfileAnalysisStub(object):
+  """//////////////////////////////////////////////////////////////////////////////
+
+  TPUProfileAnalysis service provide entry point for profiling TPU and for
+  serving profiled data to Tensorboard through GRPC
+  //////////////////////////////////////////////////////////////////////////////
+  """
+
+  def __init__(self, channel):
+    """Constructor.
+
+    Args:
+      channel: A grpc.Channel.
+    """
+    self.NewSession = channel.unary_unary(
+        '/tensorflow.TPUProfileAnalysis/NewSession',
+        request_serializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        NewProfileSessionRequest.SerializeToString,
+        response_deserializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        NewProfileSessionResponse.FromString,
+    )
+    self.EnumSessions = channel.unary_unary(
+        '/tensorflow.TPUProfileAnalysis/EnumSessions',
+        request_serializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        EnumProfileSessionsAndToolsRequest.SerializeToString,
+        response_deserializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        EnumProfileSessionsAndToolsResponse.FromString,
+    )
+    self.GetSessionToolData = channel.unary_unary(
+        '/tensorflow.TPUProfileAnalysis/GetSessionToolData',
+        request_serializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        ProfileSessionDataRequest.SerializeToString,
+        response_deserializer=
+        third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+        ProfileSessionDataResponse.FromString,
+    )
+
+
+class TPUProfileAnalysisServicer(object):
+  """//////////////////////////////////////////////////////////////////////////////
+
+  TPUProfileAnalysis service provide entry point for profiling TPU and for
+  serving profiled data to Tensorboard through GRPC
+  //////////////////////////////////////////////////////////////////////////////
+  """
+
+  def NewSession(self, request, context):
+    """Starts a profiling session, blocks until it completes.
+    TPUProfileAnalysis service delegate this to TPUProfiler service.
+    Populate the profiled data in repository, then return status to caller.
+    """
+    context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+    context.set_details('Method not implemented!')
+    raise NotImplementedError('Method not implemented!')
+
+  def EnumSessions(self, request, context):
+    """Enumerate existing sessions and return available profile tools.
+    """
+    context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+    context.set_details('Method not implemented!')
+    raise NotImplementedError('Method not implemented!')
+
+  def GetSessionToolData(self, request, context):
+    """Retrieve specific tool's data for specific session.
+    """
+    context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+    context.set_details('Method not implemented!')
+    raise NotImplementedError('Method not implemented!')
+
+
+def add_TPUProfileAnalysisServicer_to_server(servicer, server):
+  rpc_method_handlers = {
+      'NewSession':
+          grpc.unary_unary_rpc_method_handler(
+              servicer.NewSession,
+              request_deserializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              NewProfileSessionRequest.FromString,
+              response_serializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              NewProfileSessionResponse.SerializeToString,
+          ),
+      'EnumSessions':
+          grpc.unary_unary_rpc_method_handler(
+              servicer.EnumSessions,
+              request_deserializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              EnumProfileSessionsAndToolsRequest.FromString,
+              response_serializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              EnumProfileSessionsAndToolsResponse.SerializeToString,
+          ),
+      'GetSessionToolData':
+          grpc.unary_unary_rpc_method_handler(
+              servicer.GetSessionToolData,
+              request_deserializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              ProfileSessionDataRequest.FromString,
+              response_serializer=
+              third__party_dot_tensorflow_dot_contrib_dot_tpu_dot_profiler_dot_tpu__profiler__analysis__pb2.
+              ProfileSessionDataResponse.SerializeToString,
+          ),
+  }
+  generic_handler = grpc.method_handlers_generic_handler(
+      'tensorflow.TPUProfileAnalysis', rpc_method_handlers)
+  server.add_generic_rpc_handlers((generic_handler,))
diff --git a/tensorflow/contrib/tpu/python/tpu/datasets.py b/tensorflow/contrib/tpu/python/tpu/datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..465c668fd8b42f150892f8e4b52de76c6fe13fa9
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/datasets.py
@@ -0,0 +1,184 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ======================================
+"""Library of Cloud TPU helper functions for data loading."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.data.python.ops import batching
+from tensorflow.contrib.data.python.ops import interleave_ops
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import iterator_ops
+from tensorflow.python.data.ops import readers
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import functional_ops
+
+
+def _TextLineDataset(filename):
+  buffer_size = 8 * 1024 * 1024  # 8 MiB per file
+  dataset = readers.TextLineDataset(filename, buffer_size=buffer_size)
+  return dataset
+
+
+def _TFRecordDataset(filename):
+  buffer_size = 8 * 1024 * 1024  # 8 MiB per file
+  dataset = readers.TFRecordDataset(filename, buffer_size=buffer_size)
+  return dataset
+
+
+_FILETYPE_MAP = {
+    'tfrecord': _TFRecordDataset,
+    'textline': _TextLineDataset,
+    'text': _TextLineDataset,
+}
+
+
+def StreamingFilesDataset(files,
+                          filetype=None,
+                          file_reader_job=None,
+                          worker_job=None,
+                          num_epochs=None,
+                          filename_shuffle_buffer_size=None,
+                          num_parallel_reads=None,
+                          batch_transfer_size=None,
+                          sloppy=None):
+  """StreamingFilesDataset constructs a dataset to stream from workers (GCE VM).
+
+  Because Cloud TPUs are allocated over the network, a Cloud TPU cannot read
+  files local to your GCE VM. In order to train using files stored on your local
+  VM (e.g. on local SSD for extreme performance), use the StreamingFilesDataset
+  helper to generate a dataset to feed your Cloud TPU with files from your GCE
+  VM.
+
+  The resulting dataset may return an OutOfRangeError if there are no files
+  found as a result of the fileglob expansion.
+
+  Note: StreamingFilesDataset assumes that the session is using a
+  TPUClusterResolver and has therefore a worker and a coordinator job. File
+  loading will be done on the coordinator job.
+
+  Args:
+    files: A string glob to match files, or a `tf.data.Dataset` generating file
+      names.
+    filetype: A string (one of 'tfrecord', or 'textline') or a single-argument
+      TensorFlow function that when given a filename returns a dataset.
+    file_reader_job: An optional string that corresponds to the job that should
+      perform the file reads.
+    worker_job: An optional string that corresponds to the job that should
+      process the tensors (i.e. your GPU or TPU worker).
+    num_epochs: The number of epochs through the training set that should be
+      generated. By default, it will repeat infinitely.
+    filename_shuffle_buffer_size: An optional integer whose value controls the
+      shuffling of the file names. If you would like to read from the files in
+      the same order, set to 0 or False.
+    num_parallel_reads: An optional integer controlling the number of files to
+      read from concurrently. (Set to 1 for no parallelism.)
+    batch_transfer_size: An optional integer controlling the batching used to
+      amortize the remote function invocation overhead. Set to a very large
+      number to increase throughput. Set to a very small number to reduce memory
+      consumption. Set to False to skip batching.
+    sloppy: (Optional.) If `False`, read input data while maintaining a
+      deterministic order. (This may have significant performance impacts.)
+      sloppy defaults to: True.
+  Returns:
+    A `tf.data.Dataset` with an infinite stream of elements generated by a
+    parallel interleaving of the set of files matched (or generated) by `files`
+    with a type is the output of the dataset specified by `filetype`.
+
+  Raises:
+    ValueError: if any argument is not of the expected type.
+  """
+  if filetype is None:
+    filetype = 'tfrecord'
+
+  if isinstance(filetype, str):
+    if filetype not in _FILETYPE_MAP:
+      raise ValueError('Unexpected filetype: %s' % filetype)
+    reader_fn = _FILETYPE_MAP[filetype]
+  elif callable(filetype):
+    reader_fn = filetype
+  else:
+    raise ValueError('filetype should be a string or a callable')
+
+  file_reader_job = file_reader_job or 'coordinator'
+
+  worker_job = worker_job or 'worker'
+
+  if filename_shuffle_buffer_size is None:
+    filename_shuffle_buffer_size = 4096
+
+  num_parallel_reads = num_parallel_reads or 8
+
+  if batch_transfer_size is None:
+    batch_transfer_size = 256
+
+  if sloppy is None:
+    sloppy = True
+
+  with ops.device('/job:%s' % file_reader_job):
+    if isinstance(files, str):
+      source_dataset = dataset_ops.Dataset.list_files(files)
+    elif isinstance(files, dataset_ops.Dataset):
+      source_dataset = files
+    else:
+      raise ValueError('files was not a string or a dataset: %s' % files)
+
+    if filename_shuffle_buffer_size:
+      source_dataset = source_dataset.shuffle(
+          buffer_size=filename_shuffle_buffer_size)
+
+    # NOTE: We perform the `repeat` on the source dataset, because the output
+    # dataset does not currently have enough information to recreate an iterator
+    # over the source dataset when it reaches the end.
+    source_dataset = source_dataset.repeat(num_epochs)
+
+    source_dataset = source_dataset.apply(
+        interleave_ops.parallel_interleave(
+            reader_fn, cycle_length=num_parallel_reads, sloppy=sloppy))
+
+    if batch_transfer_size:
+      source_dataset = source_dataset.batch(batch_transfer_size)
+
+    source_dataset = source_dataset.prefetch(1)
+
+    source_iterator = source_dataset.make_one_shot_iterator()
+    source_handle = source_iterator.string_handle()
+
+  @function.Defun(dtypes.string)
+  def LoadingFunc(h):
+    remote_iterator = iterator_ops.Iterator.from_string_handle(
+        h, source_dataset.output_types, source_dataset.output_shapes)
+    return remote_iterator.get_next()
+
+  def MapFn(unused_input):
+    return functional_ops.remote_call(
+        args=[source_handle],
+        Tout=[dtypes.string],
+        f=LoadingFunc,
+        target='/job:%s/replica:0/task:0/cpu:0' % file_reader_job)
+
+  with ops.device('/job:%s' % worker_job):
+    output_dataset = dataset_ops.Dataset.range(2).repeat().map(
+        MapFn, num_parallel_calls=4 if sloppy else None)
+    output_dataset = output_dataset.prefetch(1)
+
+    if batch_transfer_size:
+      # Undo the batching used during the transfer.
+      output_dataset = output_dataset.apply(batching.unbatch()).prefetch(1)
+
+  return output_dataset
diff --git a/tensorflow/contrib/tpu/python/tpu/datasets_test.py b/tensorflow/contrib/tpu/python/tpu/datasets_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..918cf0ed8e513de0d4207f7d2aac61ad886c8288
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/datasets_test.py
@@ -0,0 +1,181 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TPU datasets tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from tensorflow.contrib.tpu.python.tpu import datasets
+from tensorflow.core.protobuf import cluster_pb2
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.client import session
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import readers
+from tensorflow.python.lib.io import python_io
+from tensorflow.python.platform import test
+from tensorflow.python.training import server_lib
+from tensorflow.python.util import compat
+
+_NUM_FILES = 10
+_NUM_ENTRIES = 20
+
+
+class DatasetsTest(test.TestCase):
+
+  def setUp(self):
+    super(DatasetsTest, self).setUp()
+    self._coord = server_lib.Server.create_local_server()
+    self._worker = server_lib.Server.create_local_server()
+
+    self._cluster_def = cluster_pb2.ClusterDef()
+    worker_job = self._cluster_def.job.add()
+    worker_job.name = 'worker'
+    worker_job.tasks[0] = self._worker.target[len('grpc://'):]
+    coord_job = self._cluster_def.job.add()
+    coord_job.name = 'coordinator'
+    coord_job.tasks[0] = self._coord.target[len('grpc://'):]
+
+    session_config = config_pb2.ConfigProto(cluster_def=self._cluster_def)
+
+    self._sess = session.Session(self._worker.target, config=session_config)
+
+  def testTextLineDataset(self):
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'text_line.%d.txt' % i)
+      contents = []
+      for j in range(_NUM_ENTRIES):
+        contents.append(compat.as_bytes('%d: %d' % (i, j)))
+      with open(filename, 'wb') as f:
+        f.write(b'\n'.join(contents))
+      all_contents.extend(contents)
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'text_line.*.txt'), filetype='text')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(4 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testTFRecordDataset(self):
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'tf_record.%d' % i)
+      writer = python_io.TFRecordWriter(filename)
+      for j in range(_NUM_ENTRIES):
+        record = compat.as_bytes('Record %d of file %d' % (j, i))
+        writer.write(record)
+        all_contents.append(record)
+      writer.close()
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'tf_record*'), filetype='tfrecord')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(4 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testTFRecordDatasetFromDataset(self):
+    filenames = []
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'tf_record.%d' % i)
+      filenames.append(filename)
+      writer = python_io.TFRecordWriter(filename)
+      for j in range(_NUM_ENTRIES):
+        record = compat.as_bytes('Record %d of file %d' % (j, i))
+        writer.write(record)
+        all_contents.append(record)
+      writer.close()
+
+    filenames = dataset_ops.Dataset.from_tensor_slices(filenames)
+
+    dataset = datasets.StreamingFilesDataset(filenames, filetype='tfrecord')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(4 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testArbitraryReaderFunc(self):
+
+    def MakeRecord(i, j):
+      return compat.as_bytes('%04d-%04d' % (i, j))
+
+    record_bytes = len(MakeRecord(10, 200))
+
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'fixed_length.%d' % i)
+      with open(filename, 'wb') as f:
+        for j in range(_NUM_ENTRIES):
+          record = MakeRecord(i, j)
+          f.write(record)
+          all_contents.append(record)
+
+    def FixedLengthFile(filename):
+      return readers.FixedLengthRecordDataset(filename, record_bytes)
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'fixed_length*'),
+        filetype=FixedLengthFile)
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(4 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testUnexpectedFiletypeString(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(
+          os.path.join(self.get_temp_dir(), '*'), filetype='foo')
+
+  def testUnexpectedFiletypeType(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(
+          os.path.join(self.get_temp_dir(), '*'), filetype=3)
+
+  def testUnexpectedFilesType(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(123, filetype='tfrecord')
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/tpu/python/tpu/device_assignment.py b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
index bdd9b88af55fa4fb483ddbdbe5c51d7076cce675..726b2d248e3086e1882004827076ed3e563d960d 100644
--- a/tensorflow/contrib/tpu/python/tpu/device_assignment.py
+++ b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
@@ -191,9 +191,9 @@ class DeviceAssignment(object):
       logical_core: A tuple of three integers which represents a logical core.
     Returns:
       A sorted list of the replicas that are attached to that task and
-      loical_core.
+      logical_core.
     Raises:
-      ValueError: If no replica exisis in the task which contains the logical
+      ValueError: If no replica exists in the task which contains the logical
       core.
     """
     try:
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_config.py b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
index 644070218214643923b9ca3ee138615ec568e8b5..38b5ea23103730630ae8e1cdd7b9180a501013c5 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_config.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
@@ -26,6 +26,7 @@ import os
 import numpy as np
 
 from tensorflow.contrib.tpu.python.tpu import util as util_lib
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.platform import tf_logging as logging
 
@@ -65,7 +66,7 @@ class TPUConfig(
       cores. This is required by model-parallelism which enables partitioning
       the model to multiple cores. For example, [2, 2, 1] means the model is
       partitioned across 4 cores which span two cores in both x and y
-      coordinates.  Please refer to ${tf.contrib.tpu.TopologyProto} for the
+      coordinates.  Please refer to @{tf.contrib.tpu.Topology} for the
       geometry of a TPU mesh.
     per_host_input_for_training: If `True`, `input_fn` is invoked Per-Host
       rather than Per-Core. With Per-Host input pipeline deployment, `input_fn`
@@ -140,6 +141,7 @@ class RunConfig(run_config_lib.RunConfig):
                tpu_config=None,
                evaluation_master=None,
                master=None,
+               cluster=None,
                **kwargs):
     """Constructs a RunConfig.
 
@@ -148,15 +150,26 @@ class RunConfig(run_config_lib.RunConfig):
       evaluation_master: a string. The address of the master to use for eval.
         Defaults to master if not set.
       master: a string. The address of the master to use for training.
+      cluster: a ClusterResolver
       **kwargs: keyword config parameters.
+
+    Raises:
+      ValueError: if cluster is not None and the provided session_config has a
+        cluster_def already.
     """
     super(RunConfig, self).__init__(**kwargs)
     self._tpu_config = tpu_config or TPUConfig()
+    self._cluster = cluster
 
-    # If user sets master and/or evaluation_master explicilty, including empty
+    # If user sets master and/or evaluation_master explicitly, including empty
     # string '', take it. Otherwise, take the values set by parent class.
     if master is not None:
+      if cluster is not None:
+        raise ValueError('Both master and cluster are set.')
       self._master = master
+    else:
+      if cluster:
+        self._master = cluster.master()
 
     if evaluation_master is not None:
       self._evaluation_master = evaluation_master
@@ -170,6 +183,20 @@ class RunConfig(run_config_lib.RunConfig):
       # evaluation_master to master, unless user overwrites it.
       self._evaluation_master = self._master
 
+    # Set the ClusterSpec to use
+    if cluster:
+      self._cluster_spec = cluster.cluster_spec()
+
+      # Merge the cluster_def into the ConfigProto.
+      if self._session_config is None:  # pylint: disable=access-member-before-definition
+        self._session_config = config_pb2.ConfigProto(allow_soft_placement=True)
+      if self._session_config.HasField('cluster_def'):
+        raise ValueError(
+            'You cannot provide a ClusterResolver and '
+            'session_config.cluster_def.')
+      self._session_config.cluster_def.CopyFrom(
+          self._cluster_spec.as_cluster_def())
+
   @property
   def evaluation_master(self):
     return self._evaluation_master
@@ -182,6 +209,10 @@ class RunConfig(run_config_lib.RunConfig):
   def tpu_config(self):
     return self._tpu_config
 
+  @property
+  def cluster(self):
+    return self._cluster
+
   def replace(self, **kwargs):
     if 'tpu_config' not in kwargs:
       return super(RunConfig, self).replace(**kwargs)
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_context.py b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
index c5c46ea741ea64ca37089431f8ed66cad7bc31fb..3bac2db77e95520a6c9c4c17658267a9a6588d94 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_context.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
@@ -39,7 +39,7 @@ class _TPUContext(object):
 
   This immutable object holds TPUEstimator config, train/eval batch size, and
   `TPUEstimator.use_tpu`, which is expected to be passed around. It also
-  provides utility functions, basded on the current state, to determine other
+  provides utility functions, based on the current state, to determine other
   information commonly required by TPU computation, such as TPU device names,
   TPU hosts, shard batch size, etc.
 
@@ -218,7 +218,7 @@ class _TPUContext(object):
         model, when mode == PREDICT. Only with this bool, we could
         tell whether user is calling the Estimator.predict or
         Estimator.export_savedmodel, which are running on TPU and CPU
-        respectively. Parent class Estimator does not distingush these two.
+        respectively. Parent class Estimator does not distinguish these two.
 
     Returns:
       bool, whether current input_fn or model_fn should be running on CPU.
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
index 1b2eda1caa0fa2779834d65b5a49121d9cc0af56..435473574451c2cd3c04ae8344bb8b74598c5d7c 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@@ -25,6 +25,7 @@ import threading
 import time
 import traceback
 
+import numpy as np
 import six
 from six.moves import queue as Queue  # pylint: disable=redefined-builtin
 from six.moves import xrange  # pylint: disable=redefined-builtin
@@ -48,6 +49,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
@@ -61,6 +63,7 @@ from tensorflow.python.training import evaluation
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training
 from tensorflow.python.training import training_util
+from tensorflow.python.util import nest
 from tensorflow.python.util import tf_inspect
 
 _INITIAL_LOSS = 1e7
@@ -69,6 +72,7 @@ _TPU_ESTIMATOR = 'tpu_estimator'
 _ITERATIONS_PER_LOOP_VAR = 'iterations_per_loop'
 _BATCH_SIZE_KEY = 'batch_size'
 _CROSS_REPLICA_SUM_OP = 'CrossReplicaSum'
+_ONE_GIGABYTE = 1024 * 1024 * 1024
 
 _RESERVED_PARAMS_KEYS = [_BATCH_SIZE_KEY]
 
@@ -133,7 +137,7 @@ def _increase_eval_step_op(iterations_per_loop):
   """Returns an op to increase the eval step for TPU evaluation.
 
   Args:
-    iterations_per_loop: Tensor. The number of eval steps runnining in TPU
+    iterations_per_loop: Tensor. The number of eval steps running in TPU
         system before returning to CPU host for each `Session.run`.
 
   Returns:
@@ -605,17 +609,17 @@ class _StoppingPredictHook(session_run_hook.SessionRunHook):
       # batch. And we append one more batch to signal the system it should stop.
       # The data flow might look like
       #
-      #  batch   0: images, labels, stop = 0  (user provideded)
-      #  batch   1: images, labels, stop = 0  (user provideded)
+      #  batch   0: images, labels, stop = 0  (user provided)
+      #  batch   1: images, labels, stop = 0  (user provided)
       #  ...
-      #  batch  99: images, labels, stop = 0  (user provideded)
+      #  batch  99: images, labels, stop = 0  (user provided)
       #  batch 100: images, labels, stop = 1  (TPUEstimator appended)
       #
       # where the final batch (id = 100) is appended by TPUEstimator, so we
       # should drop it before returning the predictions to user.
       # To achieve that, we throw the OutOfRangeError in after_run. Once
       # Monitored Session sees this error in SessionRunHook.after_run, the
-      # "current" prediciton, i.e., batch with id=100, will be discarded
+      # "current" prediction, i.e., batch with id=100, will be discarded
       # immediately
       raise errors.OutOfRangeError(None, None, 'Stopped by stopping signal.')
 
@@ -676,8 +680,11 @@ def generate_per_host_enqueue_ops_fn_for_host(
         raise TypeError(
             'For mode PREDICT, `input_fn` must return `Dataset` instead of '
             '`features` and `labels`.')
+      if batch_axis is not None:
+        raise TypeError('For mode PREDICT, batch_axis is not supported yet.')
       inputs = _InputsWithStoppingSignals(
-          dataset=inputs.dataset, batch_size=ctx.batch_size_for_input_fn)
+          dataset=inputs.dataset, batch_size=ctx.batch_size_for_input_fn,
+          add_padding=True)
 
     if is_dataset:
       hooks.append(inputs.dataset_initializer_hook())
@@ -751,7 +758,7 @@ class _InputPipeline(object):
   2. (features, labels)
 
   Internally, form 1 is reformed to `(features, None)` as features and labels
-  are passed separatedly to underlying methods. For TPU training, TPUEstimator
+  are passed separately to underlying methods. For TPU training, TPUEstimator
   may expect multiple `features` and `labels` tuples one for each core.
 
   TPUEstimator allows various different structures for inputs (namely `features`
@@ -784,7 +791,8 @@ class _InputPipeline(object):
       def _extract_key_names(tensor_or_dict):
         if tensor_or_dict is None:
           return []
-        return tensor_or_dict.keys() if isinstance(tensor_or_dict, dict) else []
+        return sorted(tensor_or_dict.keys()) if isinstance(
+            tensor_or_dict, dict) else []
 
       # Extract structure.
       has_labels = labels is not None
@@ -1036,8 +1044,8 @@ class _ModelFnWrapper(object):
     self._params = params
     self._ctx = ctx
 
-  def call_without_tpu(self, features, labels):
-    return self._call_model_fn(features, labels)
+  def call_without_tpu(self, features, labels, is_export_mode):
+    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
 
   def convert_to_single_tpu_train_step(self, dequeue_fn):
     """Converts user provided model_fn` as a single train step on TPU.
@@ -1196,7 +1204,7 @@ class _ModelFnWrapper(object):
 
     return predict_step, host_calls, captured_scaffold_fn
 
-  def _call_model_fn(self, features, labels, is_export_mode=True):
+  def _call_model_fn(self, features, labels, is_export_mode=False):
     """Calls the model_fn with required parameters."""
     model_fn_args = util.fn_args(self._model_fn)
     kwargs = {}
@@ -1222,7 +1230,11 @@ class _ModelFnWrapper(object):
                        'required by TPUEstimator to pass batch size as '
                        'params[\'batch_size\']'.format(self._model_fn))
 
-    batch_size_for_model_fn = self._ctx.batch_size_for_model_fn
+    if is_export_mode:
+      batch_size_for_model_fn = None
+    else:
+      batch_size_for_model_fn = self._ctx.batch_size_for_model_fn
+
     if batch_size_for_model_fn is not None:
       params[_BATCH_SIZE_KEY] = batch_size_for_model_fn
 
@@ -1516,14 +1528,20 @@ class TPUEstimator(estimator_lib.Estimator):
   size when calling the `input_fn` and `model_fn`. Users should specify
   global batch size in constructor, and then get the batch size for each shard
   in `input_fn` and `model_fn` by `params['batch_size']`.
-  For training, `model_fn` gets per-core batch size; `input_fn` may get
-  per-core or per-host batch size depending on
-  `per_host_input_for_training` in `TPUConfig`.
-  For evaluation, `model_fn` gets per-core batch size and `input_fn` get
-  per-host batch size.
+
+  - For training, `model_fn` gets per-core batch size; `input_fn` may get
+    per-core or per-host batch size depending on `per_host_input_for_training`
+    in `TPUConfig` (See docstring for TPUConfig for details).
+
+  - For evaluation and prediction, `model_fn` gets per-core batch size and
+    `input_fn` get per-host batch size.
+
+  Evaluation
+  ==========
 
   `model_fn` should return `TPUEstimatorSpec`, which expects the `eval_metrics`
   for TPU evaluation.
+
   `TPUEstimatorSpec.eval_metrics` is a tuple of `metric_fn` and `tensors`, where
   `tensors` could be a list of `Tensor`s or dict of names to `Tensor`s. (See
   `TPUEstimatorSpec` for details).  `metric_fn` takes the `tensors` and returns
@@ -1535,12 +1553,17 @@ class TPUEstimator(estimator_lib.Estimator):
   `train_batch_size` or `eval_batch_size` unmodified as `params['batch_size']`.
 
   Current limitations:
+  --------------------
 
-  1. TPU evaluation only works on single host.
-  2. `input_fn` for evaluation should not throw OutOfRange error for all
-  evaluation steps and all batches should have the same size.
+  1. TPU evaluation only works on a single host (one TPU worker).
+
+  2. `input_fn` for evaluation should **NOT** raise an end-of-input exception
+     (`OutOfRangeError` or `StopIteration`). And all evaluation steps and all
+     batches should have the same size.
 
   Example (MNIST):
+  ----------------
+
   ```
   # The metric Fn which runs on CPU.
   def metric_fn(labels, logits):
@@ -1576,8 +1599,83 @@ class TPUEstimator(estimator_lib.Estimator):
           }))
   ```
 
-  Predict support on TPU is not yet implemented. So, `predict` and
-  `export_savedmodel` are executed on CPU, even if `use_tpu` is true.
+  Prediction
+  ==========
+
+  Prediction on TPU is an experimental feature to support large batch inference.
+  It is not designed for latency-critical system. In addition, due to some
+  usability issues, for prediction with small dataset, CPU `.predict`, i.e.,
+  creating a new `TPUEstimator` instance with `use_tpu=False`, might be more
+  convenient.
+
+  Note: In contrast to TPU training/evaluation, the `input_fn` for prediction
+  *should* raise an end-of-input exception (`OutOfRangeError` or
+  `StopIteration`), which serves as the stopping signal to `TPUEstimator`. To be
+  precise, the ops created by `input_fn` produce one batch of the data.
+  The `predict()` API processes one batch at a time. When reaching the end of
+  the data source, an end-of-input exception should be raised by one of these
+  operations. The user usually does not need to do this manually. As long as the
+  dataset is not repeated forever, the `tf.data` API will raise an end-of-input
+  exception automatically after the last batch has been produced.
+
+  Note: Estimator.predict returns a Python generator. Please consume all the
+  data from the generator so that TPUEstimator can shutdown the TPU system
+  properly for user.
+
+  Current limitations:
+  --------------------
+  1. TPU prediction only works on a single host (one TPU worker).
+
+  2. `input_fn` must return a `Dataset` instance rather than `features`. In
+  fact, .train() and .evaluate() also support Dataset as return value.
+
+  Example (MNIST):
+  ----------------
+  ```
+  height = 32
+  width = 32
+  total_examples = 100
+
+  def predict_input_fn(params):
+    batch_size = params['batch_size']
+
+    images = tf.random_uniform(
+        [total_examples, height, width, 3], minval=-1, maxval=1)
+
+    dataset = tf.data.Dataset.from_tensor_slices(images)
+    dataset = dataset.map(lambda images: {'image': images})
+
+    dataset = dataset.batch(batch_size)
+    return dataset
+
+  def model_fn(features, labels, params, mode):
+     # Generate predictions, called 'output', from features['image']
+
+    if mode == tf.estimator.ModeKeys.PREDICT:
+      return tf.contrib.tpu.TPUEstimatorSpec(
+          mode=mode,
+          predictions={
+              'predictions': output,
+              'is_padding': features['is_padding']
+          })
+
+  tpu_est = TPUEstimator(
+      model_fn=model_fn,
+      ...,
+      predict_batch_size=16)
+
+  # Fully consume the generator so that TPUEstimator can shutdown the TPU
+  # system.
+  for item in tpu_est.predict(input_fn=input_fn):
+    # Filter out item if the `is_padding` is 1.
+    # Process the 'predictions'
+  ```
+
+  Exporting
+  =========
+
+  Exporting `SavedModel` support on TPU is not yet implemented. So,
+  `export_savedmodel` is executed on CPU, even if `use_tpu` is true.
   """
 
   def __init__(self,
@@ -1684,6 +1782,8 @@ class TPUEstimator(estimator_lib.Estimator):
         eval_batch_size, predict_batch_size,
         use_tpu)
 
+    self._is_input_fn_invoked = None
+
   def _create_global_step(self, graph):
     """Creates a global step suitable for TPUs.
 
@@ -1766,6 +1866,9 @@ class TPUEstimator(estimator_lib.Estimator):
     if 'mode' in input_fn_args:
       kwargs['mode'] = mode
 
+    # Records the fact input_fn has been invoked.
+    self._is_input_fn_invoked = True
+
     with self._ctx.with_mode(mode) as ctx:
       # Setting the batch size in params first. This helps user to have same
       # input_fn for use_tpu=True/False.
@@ -1794,6 +1897,17 @@ class TPUEstimator(estimator_lib.Estimator):
 
       return _input_fn
 
+  def _validate_features_in_predict_input(self, result):
+    """Skip the validation.
+
+    For TPUEstimator, we do not need to check the result type. `_InputPipeline`
+    has stronger check. Parent class's check generates confusing warning msg.
+
+    Args:
+      result: `features` returned by input_fn.
+    """
+    pass
+
   def _augment_model_fn(self, model_fn, batch_axis):
     """Returns a new model_fn, which wraps the TPU support."""
 
@@ -1802,15 +1916,24 @@ class TPUEstimator(estimator_lib.Estimator):
       with self._ctx.with_mode(mode) as ctx:
         model_fn_wrapper = _ModelFnWrapper(model_fn, config, params, ctx)
 
-        # For export_savedmodel, input_fn is never passed to Estimator. So,
-        # if features is callable, it means it is the input_fn passed by
-        # TPUEstimator._call_input_fn. Then we can know if the mode == PREDICT,
-        # it implies, it is the .predict API, not export_savedmodel API.
-        is_export_mode = not callable(features)
+        if mode != model_fn_lib.ModeKeys.PREDICT:
+          is_export_mode = False
+        else:
+          # For export_savedmodel, input_fn is never passed to Estimator. So, by
+          # checking the self._is_input_fn_invoked bit, we can know, given the
+          # mode == PREDICT, it is the .predict API, not export_savedmodel API.
+          if self._is_input_fn_invoked:
+            is_export_mode = False
+          else:
+            is_export_mode = True
+
+        # Clear the bit.
+        self._is_input_fn_invoked = None
 
         if ctx.is_running_on_cpu(is_export_mode=is_export_mode):
           logging.info('Running %s on CPU', mode)
-          return model_fn_wrapper.call_without_tpu(features, labels)
+          return model_fn_wrapper.call_without_tpu(
+              features, labels, is_export_mode=is_export_mode)
 
         assert labels is None, '`labels` passed to `model_fn` must be `None`.'
         # TPUEstimator._call_input_fn passes `input_fn` as features to here.
@@ -1948,12 +2071,18 @@ class TPUEstimator(estimator_lib.Estimator):
           host_ops = host_call_ret['host_call']
 
         predictions = host_call_ret['predictions']
-        stopping_signals = host_call_ret['signals']
+        _verify_cross_hosts_transfer_size(
+            predictions, message=(
+                'The estimated size for TPUEstimatorSpec.predictions is too '
+                'large.'))
+        signals = host_call_ret['signals']
 
         with ops.control_dependencies(host_ops):
           host_ops = []  # Empty, we do do not need it anymore.
           scalar_stopping_signal = _StopSignals.as_scalar_stopping_signal(
-              stopping_signals)
+              signals)
+          predictions = _PaddingSignals.slice_tensor_or_dict(
+              predictions, signals)
 
         hooks = [
             _StoppingPredictHook(scalar_stopping_signal),
@@ -2248,20 +2377,19 @@ class _Inputs(object):
     return self._dataset
 
 
-# TODO(xiejw): Extend this to support final partial batch.
 class _InputsWithStoppingSignals(_Inputs):
   """Inputs with `_StopSignals` inserted into the dataset."""
 
-  def __init__(self, dataset, batch_size):
+  def __init__(self, dataset, batch_size, add_padding=False):
 
     assert dataset is not None
 
     user_provided_dataset = dataset.map(
         _InputsWithStoppingSignals.insert_stopping_signal(
-            stop=False, batch_size=batch_size))
+            stop=False, batch_size=batch_size, add_padding=add_padding))
     final_batch_dataset = dataset.take(1).map(
         _InputsWithStoppingSignals.insert_stopping_signal(
-            stop=True, batch_size=batch_size))
+            stop=True, batch_size=batch_size, add_padding=add_padding))
     dataset = user_provided_dataset.concatenate(final_batch_dataset).prefetch(2)
 
     super(_InputsWithStoppingSignals, self).__init__(dataset=dataset)
@@ -2291,7 +2419,7 @@ class _InputsWithStoppingSignals(_Inputs):
     return signals
 
   @staticmethod
-  def insert_stopping_signal(stop, batch_size):
+  def insert_stopping_signal(stop, batch_size, add_padding=False):
     """Inserts stopping_signal into dataset via _map_fn.
 
     Here we change the data structure in the dataset, such that the return value
@@ -2302,19 +2430,39 @@ class _InputsWithStoppingSignals(_Inputs):
     Args:
       stop: bool, state of current stopping signals.
       batch_size: int, batch size.
+      add_padding: bool, whether to pad the tensor to full batch size.
 
     Returns:
       A map_fn passed to dataset.map API.
     """
 
     def _map_fn(*args):
+      """The map fn to insert signals."""
+      if len(args) == 1:
+        # Unpack the single Tensor/dict argument as features. This is required
+        # for the input_fn returns no labels.
+        args = args[0]
       features, labels = _Inputs._parse_inputs(args)
       new_input_dict = {}
-      new_input_dict['features'] = features
-      if labels is not None:
-        new_input_dict['labels'] = labels
+
+      if add_padding:
+        padding_mask, features, labels = (
+            _PaddingSignals.pad_features_and_labels(
+                features, labels, batch_size))
+
+        new_input_dict['features'] = features
+        if labels is not None:
+          new_input_dict['labels'] = labels
+
+      else:
+        new_input_dict['features'] = features
+        if labels is not None:
+          new_input_dict['labels'] = labels
+        padding_mask = None
+
       new_input_dict['signals'] = _StopSignals(
-          stop=stop, batch_size=batch_size).as_dict()
+          stop=stop, batch_size=batch_size, padding_mask=padding_mask).as_dict()
+
       return new_input_dict
 
     return _map_fn
@@ -2323,23 +2471,28 @@ class _InputsWithStoppingSignals(_Inputs):
 class _StopSignals(object):
   """Signals class holding all logic to handle TPU stopping condition."""
 
-  NON_STOPPING_SIGNAL = 0.0
-  STOPPING_SIGNAL = 1.0
+  NON_STOPPING_SIGNAL = False
+  STOPPING_SIGNAL = True
 
-  def __init__(self, stop, batch_size):
+  def __init__(self, stop, batch_size, padding_mask=None):
     self._stop = stop
     self._batch_size = batch_size
+    self._padding_mask = padding_mask
 
   def as_dict(self):
+    """Returns the signals as Python dict."""
     shape = [self._batch_size, 1]
-    dtype = dtypes.float32
+    dtype = dtypes.bool
 
     if self._stop:
       stopping = array_ops.ones(shape=shape, dtype=dtype)
     else:
       stopping = array_ops.zeros(shape=shape, dtype=dtype)
 
-    return {'stopping': stopping}
+    signals = {'stopping': stopping}
+    if self._padding_mask is not None:
+      signals['padding_mask'] = self._padding_mask
+    return signals
 
   @staticmethod
   def as_scalar_stopping_signal(signals):
@@ -2347,7 +2500,118 @@ class _StopSignals(object):
 
   @staticmethod
   def should_stop(scalar_stopping_signal):
-    return scalar_stopping_signal >= _StopSignals.STOPPING_SIGNAL
+    if isinstance(scalar_stopping_signal, ops.Tensor):
+      # STOPPING_SIGNAL is a constant True. Here, the logical_and is just the TF
+      # way to express the bool check whether scalar_stopping_signal is True.
+      return math_ops.logical_and(
+          scalar_stopping_signal, _StopSignals.STOPPING_SIGNAL)
+    else:
+      # For non Tensor case, it is used in SessionRunHook. So, we cannot modify
+      # the graph anymore. Here, we use pure Python.
+      return bool(scalar_stopping_signal)
+
+
+class _PaddingSignals(object):
+  """Signals class holding all logic to handle padding."""
+
+  @staticmethod
+  def pad_features_and_labels(features, labels, batch_size):
+    """Pads out the batch dimension of features and labels."""
+    real_batch_size = array_ops.shape(
+        _PaddingSignals._find_any_tensor(features))[0]
+
+    batch_size_tensor = constant_op.constant(batch_size, dtypes.int32)
+
+    check_greater = check_ops.assert_greater_equal(
+        batch_size_tensor, real_batch_size,
+        data=(batch_size_tensor, real_batch_size),
+        message='The real batch size should not be greater than batch_size.')
+
+    with ops.control_dependencies([check_greater]):
+      missing_count = batch_size_tensor - real_batch_size
+
+    def pad_single_tensor(tensor):
+      """Pads out the batch dimension of a tensor to the complete batch_size."""
+      rank = len(tensor.shape)
+      assert rank > 0
+      padding = array_ops.stack([[0, missing_count]] + [[0, 0]] * (rank - 1))
+      padded_shape = (batch_size,) + tuple(tensor.shape[1:])
+      padded_tensor = array_ops.pad(tensor, padding)
+      padded_tensor.set_shape(padded_shape)
+      return padded_tensor
+
+    def nest_pad(tensor_or_dict):
+      return nest.map_structure(pad_single_tensor, tensor_or_dict)
+
+    features = nest_pad(features)
+    if labels is not None:
+      labels = nest_pad(labels)
+
+    padding_mask = _PaddingSignals._padding_mask(
+        real_batch_size, missing_count, batch_size)
+
+    return padding_mask, features, labels
+
+  @staticmethod
+  def slice_tensor_or_dict(tensor_or_dict, signals):
+    """Slice the real Tensors according to padding mask in signals."""
+
+    padding_mask = signals['padding_mask']
+    batch_size = array_ops.shape(padding_mask)[0]
+
+    def verify_batch_size(tensor):
+      check_batch_size = math_ops.equal(batch_size, tensor.shape[0])
+      with ops.control_dependencies([check_batch_size]):
+        return array_ops.identity(tensor)
+
+    def slice_single_tensor(tensor):
+      rank = len(tensor.shape)
+      assert rank > 0
+      real_batch_size = batch_size - math_ops.reduce_sum(padding_mask)
+      return verify_batch_size(tensor)[0:real_batch_size]
+
+    # As we split the Tensors to all TPU cores and concat them back, it is
+    # important to ensure the real data is placed before padded ones, i.e.,
+    # order is preserved. By that, the sliced padding mask should have all 0's.
+    # If this assertion failed, # the slice logic here would not hold.
+    sliced_padding_mask = slice_single_tensor(padding_mask)
+    assert_padding_mask = math_ops.equal(
+        math_ops.reduce_sum(sliced_padding_mask), 0)
+
+    with ops.control_dependencies([assert_padding_mask]):
+      should_stop = _StopSignals.should_stop(
+          _StopSignals.as_scalar_stopping_signal(signals))
+
+    is_full_batch = math_ops.equal(math_ops.reduce_sum(padding_mask), 0)
+
+    def slice_fn(tensor):
+      # If the current batch is full batch or part of stopping signals, we do
+      # not need to slice to save performance.
+      return control_flow_ops.cond(
+          math_ops.logical_or(should_stop, is_full_batch),
+          (lambda: verify_batch_size(tensor)),
+          (lambda: slice_single_tensor(tensor)))
+
+    return nest.map_structure(slice_fn, tensor_or_dict)
+
+  @staticmethod
+  def _find_any_tensor(batch_features):
+    tensors = [x for x in nest.flatten(batch_features)
+               if isinstance(x, ops.Tensor)]
+    if not tensors:
+      raise ValueError('Cannot find any Tensor in features dict.')
+    return tensors[0]
+
+  @staticmethod
+  def _padding_mask(real_batch_size, missing_count, batch_size):
+    padding_mask = array_ops.concat(
+        [
+            array_ops.zeros((real_batch_size,), dtype=dtypes.int32),
+            array_ops.ones((missing_count,), dtype=dtypes.int32)
+        ],
+        axis=0)
+    padding_mask.set_shape((batch_size,))
+    return padding_mask
 
 
 class _SignalsHelper(object):
@@ -2368,3 +2632,21 @@ class _SignalsHelper(object):
   @staticmethod
   def as_tensor_list(signals):
     return [signals[key] for key in sorted(signals.iterkeys())]
+
+
+def _verify_cross_hosts_transfer_size(tensor_dict, message):
+  total_size = 0
+  tensor_structure = {}
+  for key, tensor in tensor_dict.items():
+    shape = tensor.shape
+    size = np.product(shape) * tensor.dtype.size
+    tensor_structure[key] = shape
+    total_size += size
+  if total_size >= _ONE_GIGABYTE:
+    raise ValueError(
+        '{} The transfer size is larger than the protobuf limit. Please '
+        'consider to use Tensors with smaller shapes or reduce batch '
+        'size. Given:\n'
+        '{}'.format(message, '\n'.join([
+            ' -- Key: {}, Shape: {}'.format(k, v)
+            for k, v in tensor_structure.items()])))
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator_signals_test.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator_signals_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e90957e6dea7ff1777dd3e26cdf1c6fdb340dd3
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator_signals_test.py
@@ -0,0 +1,291 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TPU Estimator Signalling Tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tpu.python.tpu import tpu_estimator
+from tensorflow.python import data as dataset_lib
+from tensorflow.python.client import session
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.platform import test
+
+
+def make_input_fn(num_samples):
+  a = np.linspace(0, 100.0, num=num_samples)
+  b = np.reshape(np.array(a, dtype=np.float32), (len(a), 1))
+
+  def input_fn(params):
+    batch_size = params['batch_size']
+    da1 = dataset_lib.Dataset.from_tensor_slices(a)
+    da2 = dataset_lib.Dataset.from_tensor_slices(b)
+
+    dataset = dataset_lib.Dataset.zip((da1, da2))
+    dataset = dataset.map(lambda fa, fb: {'a': fa, 'b': fb})
+    dataset = dataset.batch(batch_size)
+    return dataset
+  return input_fn, (a, b)
+
+
+def make_input_fn_with_labels(num_samples):
+  a = np.linspace(0, 100.0, num=num_samples)
+  b = np.reshape(np.array(a, dtype=np.float32), (len(a), 1))
+
+  def input_fn(params):
+    batch_size = params['batch_size']
+    da1 = dataset_lib.Dataset.from_tensor_slices(a)
+    da2 = dataset_lib.Dataset.from_tensor_slices(b)
+
+    dataset = dataset_lib.Dataset.zip((da1, da2))
+    dataset = dataset.map(lambda fa, fb: ({'a': fa}, fb))
+    dataset = dataset.batch(batch_size)
+    return dataset
+  return input_fn, (a, b)
+
+
+class TPUEstimatorStoppingSignalsTest(test.TestCase):
+
+  def test_normal_output_without_signals(self):
+    num_samples = 4
+    batch_size = 2
+
+    params = {'batch_size': batch_size}
+    input_fn, (a, b) = make_input_fn(num_samples=num_samples)
+
+    with ops.Graph().as_default():
+      dataset = input_fn(params)
+      features = dataset.make_one_shot_iterator().get_next()
+
+      # With tf.data.Dataset.batch, the batch is None, i.e., dynamic shape.
+      self.assertIsNone(features['a'].shape.as_list()[0])
+
+      with session.Session() as sess:
+        result = sess.run(features)
+        self.assertAllEqual(a[:batch_size], result['a'])
+        self.assertAllEqual(b[:batch_size], result['b'])
+
+        # This run should work as num_samples / batch_size = 2.
+        result = sess.run(features)
+        self.assertAllEqual(a[batch_size:num_samples], result['a'])
+        self.assertAllEqual(b[batch_size:num_samples], result['b'])
+
+        with self.assertRaises(errors.OutOfRangeError):
+          # Given num_samples and batch_size, this run should fail.
+          sess.run(features)
+
+  def test_output_with_stopping_signals(self):
+    num_samples = 4
+    batch_size = 2
+
+    params = {'batch_size': batch_size}
+    input_fn, (a, b) = make_input_fn(num_samples=num_samples)
+
+    with ops.Graph().as_default():
+      dataset = input_fn(params)
+      inputs = tpu_estimator._InputsWithStoppingSignals(dataset, batch_size)
+      hook = inputs.dataset_initializer_hook()
+      features, _ = inputs.features_and_labels()
+      signals = inputs.signals()
+
+      # With tf.data.Dataset.batch, the batch is None, i.e., dynamic shape.
+      self.assertIsNone(features['a'].shape.as_list()[0])
+
+      with session.Session() as sess:
+        hook.begin()
+        hook.after_create_session(sess, coord=None)
+
+        result, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual(a[:batch_size], result['a'])
+        self.assertAllEqual(b[:batch_size], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+
+        # This run should work as num_samples / batch_size = 2.
+        result, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual(a[batch_size:num_samples], result['a'])
+        self.assertAllEqual(b[batch_size:num_samples], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+
+        # This run should work, *but* see STOP ('1') as signals
+        _, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual([[1.]] * batch_size, evaluated_signals['stopping'])
+
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(features)
+
+
+class TPUEstimatorStoppingSignalsWithPaddingTest(test.TestCase):
+
+  def test_num_samples_divisible_by_batch_size(self):
+    num_samples = 4
+    batch_size = 2
+
+    params = {'batch_size': batch_size}
+    input_fn, (a, b) = make_input_fn(num_samples=num_samples)
+
+    with ops.Graph().as_default():
+      dataset = input_fn(params)
+      inputs = tpu_estimator._InputsWithStoppingSignals(dataset, batch_size,
+                                                        add_padding=True)
+      hook = inputs.dataset_initializer_hook()
+      features, _ = inputs.features_and_labels()
+      signals = inputs.signals()
+
+      # With padding, all shapes are static now.
+      self.assertEqual(batch_size, features['a'].shape.as_list()[0])
+
+      with session.Session() as sess:
+        hook.begin()
+        hook.after_create_session(sess, coord=None)
+
+        result, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual(a[:batch_size], result['a'])
+        self.assertAllEqual(b[:batch_size], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+        self.assertAllEqual([0.] * batch_size,
+                            evaluated_signals['padding_mask'])
+
+        # This run should work as num_samples / batch_size = 2.
+        result, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual(a[batch_size:num_samples], result['a'])
+        self.assertAllEqual(b[batch_size:num_samples], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+        self.assertAllEqual([0.] * batch_size,
+                            evaluated_signals['padding_mask'])
+
+        # This run should work, *but* see STOP ('1') as signals
+        _, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual([[1.]] * batch_size, evaluated_signals['stopping'])
+
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(features)
+
+  def test_num_samples_not_divisible_by_batch_size(self):
+    num_samples = 5
+    batch_size = 2
+
+    params = {'batch_size': batch_size}
+    input_fn, (a, b) = make_input_fn_with_labels(num_samples=num_samples)
+
+    with ops.Graph().as_default():
+      dataset = input_fn(params)
+      inputs = tpu_estimator._InputsWithStoppingSignals(dataset, batch_size,
+                                                        add_padding=True)
+      hook = inputs.dataset_initializer_hook()
+      features, labels = inputs.features_and_labels()
+      signals = inputs.signals()
+
+      # With padding, all shapes are static.
+      self.assertEqual(batch_size, features['a'].shape.as_list()[0])
+
+      with session.Session() as sess:
+        hook.begin()
+        hook.after_create_session(sess, coord=None)
+
+        evaluated_features, evaluated_labels, evaluated_signals = (
+            sess.run([features, labels, signals]))
+        self.assertAllEqual(a[:batch_size], evaluated_features['a'])
+        self.assertAllEqual(b[:batch_size], evaluated_labels)
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+        self.assertAllEqual([0.] * batch_size,
+                            evaluated_signals['padding_mask'])
+
+        # This run should work as num_samples / batch_size >= 2.
+        evaluated_features, evaluated_labels, evaluated_signals = (
+            sess.run([features, labels, signals]))
+        self.assertAllEqual(a[batch_size:2*batch_size], evaluated_features['a'])
+        self.assertAllEqual(b[batch_size:2*batch_size], evaluated_labels)
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+        self.assertAllEqual([0.] * batch_size,
+                            evaluated_signals['padding_mask'])
+
+        # This is the final partial batch.
+        evaluated_features, evaluated_labels, evaluated_signals = (
+            sess.run([features, labels, signals]))
+        real_batch_size = num_samples % batch_size
+
+        # Assert the real part.
+        self.assertAllEqual(a[2*batch_size:num_samples],
+                            evaluated_features['a'][:real_batch_size])
+        self.assertAllEqual(b[2*batch_size:num_samples],
+                            evaluated_labels[:real_batch_size])
+        # Assert the padded part.
+        self.assertAllEqual([0.0] * (batch_size - real_batch_size),
+                            evaluated_features['a'][real_batch_size:])
+        self.assertAllEqual([[0.0]] * (batch_size - real_batch_size),
+                            evaluated_labels[real_batch_size:])
+
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+
+        padding = ([.0] * real_batch_size
+                   + [1.] * (batch_size - real_batch_size))
+        self.assertAllEqual(padding, evaluated_signals['padding_mask'])
+
+        # This run should work, *but* see STOP ('1') as signals
+        _, evaluated_signals = sess.run([features, signals])
+        self.assertAllEqual([[1.]] * batch_size, evaluated_signals['stopping'])
+
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(features)
+
+  def test_slice(self):
+    num_samples = 3
+    batch_size = 2
+
+    params = {'batch_size': batch_size}
+    input_fn, (a, b) = make_input_fn(num_samples=num_samples)
+
+    with ops.Graph().as_default():
+      dataset = input_fn(params)
+      inputs = tpu_estimator._InputsWithStoppingSignals(dataset, batch_size,
+                                                        add_padding=True)
+      hook = inputs.dataset_initializer_hook()
+      features, _ = inputs.features_and_labels()
+      signals = inputs.signals()
+
+      sliced_features = (
+          tpu_estimator._PaddingSignals.slice_tensor_or_dict(
+              features, signals))
+
+      with session.Session() as sess:
+        hook.begin()
+        hook.after_create_session(sess, coord=None)
+
+        result, evaluated_signals = sess.run([sliced_features, signals])
+        self.assertAllEqual(a[:batch_size], result['a'])
+        self.assertAllEqual(b[:batch_size], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+
+        # This is the final partial batch.
+        result, evaluated_signals = sess.run([sliced_features, signals])
+        self.assertEqual(1, len(result['a']))
+        self.assertAllEqual(a[batch_size:num_samples], result['a'])
+        self.assertAllEqual(b[batch_size:num_samples], result['b'])
+        self.assertAllEqual([[0.]] * batch_size, evaluated_signals['stopping'])
+
+        # This run should work, *but* see STOP ('1') as signals
+        _, evaluated_signals = sess.run([sliced_features, signals])
+        self.assertAllEqual([[1.]] * batch_size, evaluated_signals['stopping'])
+
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(sliced_features)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/tpu/python/tpu/training_loop.py b/tensorflow/contrib/tpu/python/tpu/training_loop.py
index 3d7896127a99653167f164873331a2cc95f656e8..82a75d02552b7b013452945a76b16c2c2fb9fa82 100644
--- a/tensorflow/contrib/tpu/python/tpu/training_loop.py
+++ b/tensorflow/contrib/tpu/python/tpu/training_loop.py
@@ -170,7 +170,7 @@ def while_loop(condition, body, inputs=None, infeed_queue=None, name=None):
 
 
 def repeat(n, body, inputs=None, infeed_queue=None, name=None):
-  """Builds a training loop that executes a fixed number of interations.
+  """Builds a training loop that executes a fixed number of iterations.
 
   The set of loop-carried tensors correspond to `inputs`.
   `body` must be a function that takes and returns the values of the
diff --git a/tensorflow/contrib/training/BUILD b/tensorflow/contrib/training/BUILD
index 6db373d2d5e20ea7da449530b2730403c3bb64cc..6ae2f382528c37ae647b73ea01a7f88c07580c78 100644
--- a/tensorflow/contrib/training/BUILD
+++ b/tensorflow/contrib/training/BUILD
@@ -324,7 +324,6 @@ tf_proto_library(
     name = "protos_all",
     srcs = glob(["**/*.proto"]),
     cc_api_version = 2,
-    go_api_version = 2,
     java_api_version = 2,
     visibility = ["//visibility:public"],
 )
diff --git a/tensorflow/contrib/verbs/README.md b/tensorflow/contrib/verbs/README.md
index 58fed4e5cb4c24b0f21dfe9b99cf4c665d2591c7..4b6104a8b4d542b1d8a9cb3e48eeed4950d791cd 100644
--- a/tensorflow/contrib/verbs/README.md
+++ b/tensorflow/contrib/verbs/README.md
@@ -93,7 +93,7 @@ When the receiver receives the RDMA write, it will locate the relevant **RdmaTen
 
 1. When the sender receives a tensor request, the source tensor may or may not be ready yet. The situation is handled through a process of tag matching:
 	* If the request arrives before the tensor is ready, then a callback is put in a local table, and will be invoked once the tensor arrives.
-	* If the tensor is ready before the request arives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediately.
+	* If the tensor is ready before the request arrives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediately.
    In code it is done by calling **RecvLocalAsync()**, which receives the tensor's key, step-id, and the callback.
 2. When the callback is invoked, the relevant tensor is removed from the tag matching table. In the case where we need to send the tensor's meta-data, the **RdmaTensorResponse** will store a copy of the tensor until the re-request arrives.
 3. The sending of protocol messages (**RDMA_MESSAGE_TENSOR_REQUEST**, **RDMA_MESSAGE_META_DATA_RESPONSE** and **RDMA_MESSAGE_TENSOR_RE_REQUEST**) is done by the class **RdmaMessageBuffer**. All messages are sent using RDMA writes from/to fixed messages buffers. This implies that we cannot send on a specific channel more than one message at a time. In order to synchronize the messages, the **RdmaMessageBuffer** holds the a local and remote buffer statuses which can be either busy or idle. When a write is issued, both statuses will be changed to busy. When the write-complete event is received, the local status is changed to idle. When the write is received on the remote side, the remote side will parse the message, and return an ACK back to the sending side on which the sending side will update the remote status to idle. When both the local and remote statuses are idle, the next message can be sent.
diff --git a/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md b/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
index 956b8f2147cf8154b6f1ade006d7bff194864c9b..da6fdd48e19e9d1503d1537926b1c464a0e77589 100644
--- a/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
+++ b/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
@@ -64,7 +64,7 @@ The protocol messages themselves will remain mostly unchanged at the first stage
 	* type - The message type.
 	* request_index - Request index.
 	* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
-* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
+* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-request after meta-data update and reallocation of result/proxy tensors.
 	* type - The message type.
 	* name (name_size) - Name of the requested tensor.
 	* step_id - Step ID.
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 1893967cdd0034bcff52c84f4db0bf1e2e3334f4..2885a9f82355b617529132da7c289d0d6126f670 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -220,7 +220,6 @@ tf_proto_library(
     srcs = CORE_PROTO_SRCS + ADDITIONAL_CORE_PROTO_SRCS,
     cc_api_version = 2,
     default_header = True,
-    go_api_version = 2,
     j2objc_api_version = 1,
     java_api_version = 2,
     js_api_version = 2,
@@ -314,6 +313,7 @@ cc_library(
         "lib/gtl/optional.h",
         "lib/gtl/priority_queue_util.h",
         "lib/hash/crc32c.h",
+        "lib/hash/hash.h",
         "lib/histogram/histogram.h",
         "lib/io/buffered_inputstream.h",
         "lib/io/compression.h",
@@ -339,6 +339,7 @@ cc_library(
         "lib/strings/strcat.h",
         "lib/strings/stringprintf.h",
         "platform/abi.h",
+        "platform/context.h",
         "platform/cpu_feature_guard.h",
         "platform/cpu_info.h",
         "platform/dynamic_annotations.h",
@@ -593,6 +594,7 @@ cc_library(
         "platform/prefetch.h",
         "platform/thread_annotations.h",
         "platform/types.h",
+        "platform/cpu_info.h",
     ] + if_windows(["platform/windows/integral_types.h"]),
     visibility = ["//visibility:public"],
     deps =
@@ -632,6 +634,7 @@ tf_gen_op_libs(
         "random_ops",
         "remote_fused_graph_ops",
         "resource_variable_ops",
+        "scoped_allocator_ops",
         "sdca_ops",
         "set_ops",
         "script_ops",
@@ -685,6 +688,34 @@ cc_library(
     alwayslink = 1,
 )
 
+cc_library(
+    name = "cudnn_rnn_ops",
+    srcs = [
+        "ops/cudnn_rnn_ops.cc",
+    ],
+    linkstatic = 1,
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:stream_executor",
+        "//tensorflow/core/kernels:bounds_check_lib",
+        "//third_party/eigen3",
+        "@farmhash_archive//:farmhash",
+    ],
+    alwayslink = 1,
+)
+
+tf_gen_op_libs(
+    op_lib_names = [
+        "cudnn_rnn_ops",
+    ],
+    deps = [
+        ":lib",
+    ],
+)
+
 cc_library(
     name = "ops",
     visibility = ["//visibility:public"],
@@ -697,6 +728,7 @@ cc_library(
         ":checkpoint_ops_op_lib",
         ":control_flow_ops_op_lib",
         ":ctc_ops_op_lib",
+        ":cudnn_rnn_ops_op_lib",
         ":data_flow_ops_op_lib",
         ":dataset_ops_op_lib",
         ":function_ops_op_lib",
@@ -715,11 +747,13 @@ cc_library(
         ":random_ops_op_lib",
         ":remote_fused_graph_ops_op_lib",
         ":resource_variable_ops_op_lib",
+        ":scoped_allocator_ops_op_lib",
         ":script_ops_op_lib",
         ":sdca_ops_op_lib",
         ":sendrecv_ops_op_lib",
         ":set_ops_op_lib",
         ":sparse_ops_op_lib",
+        ":summary_ops_op_lib",
         ":spectral_ops_op_lib",
         ":state_ops_op_lib",
         ":stateless_random_ops_op_lib",
@@ -835,6 +869,7 @@ cc_library(
         "//tensorflow/core/kernels:checkpoint_ops",
         "//tensorflow/core/kernels:control_flow_ops",
         "//tensorflow/core/kernels:ctc_ops",
+        "//tensorflow/core/kernels:cudnn_rnn_kernels",
         "//tensorflow/core/kernels:data_flow",
         "//tensorflow/core/kernels:dataset_ops",
         "//tensorflow/core/kernels:fake_quant_ops",
@@ -858,6 +893,7 @@ cc_library(
         "//tensorflow/core/kernels:remote_fused_graph_ops",
         "//tensorflow/core/kernels:required",
         "//tensorflow/core/kernels:resource_variable_ops",
+        "//tensorflow/core/kernels:scoped_allocator_ops",
         "//tensorflow/core/kernels:sdca_ops",
         "//tensorflow/core/kernels:set_kernels",
         "//tensorflow/core/kernels:sparse",
@@ -1034,6 +1070,7 @@ filegroup(
             "util/tensor_bundle/*.h",
             "util/tensor_bundle/*.cc",
             "common_runtime/gpu/**/*",
+            "common_runtime/eager/*",
             "common_runtime/gpu_device_factory.*",
         ],
     ),
@@ -1059,6 +1096,7 @@ filegroup(
             "**/*testlib*",
             "**/*main.cc",
             "common_runtime/gpu/**/*",
+            "common_runtime/eager/*",
             "common_runtime/gpu_device_factory.*",
             "graph/dot.*",
         ],
@@ -1402,6 +1440,13 @@ tf_pyclif_proto_library(
     visibility = ["//visibility:public"],
 )
 
+tf_pyclif_proto_library(
+    name = "protobuf/device_properties_pyclif",
+    proto_lib = ":protos_all_cc",
+    proto_srcfile = "protobuf/device_properties.proto",
+    visibility = ["//visibility:public"],
+)
+
 tf_pyclif_proto_library(
     name = "protobuf/meta_graph_pyclif",
     proto_lib = ":protos_all_cc",
@@ -1410,9 +1455,9 @@ tf_pyclif_proto_library(
 )
 
 tf_pyclif_proto_library(
-    name = "protobuf/device_properties_pyclif",
+    name = "protobuf/saved_model_pyclif",
     proto_lib = ":protos_all_cc",
-    proto_srcfile = "protobuf/device_properties.proto",
+    proto_srcfile = "protobuf/saved_model.proto",
     visibility = ["//visibility:public"],
 )
 
@@ -1518,6 +1563,7 @@ LIB_INTERNAL_PUBLIC_HEADERS = tf_additional_lib_hdrs() + [
     "lib/strings/base64.h",
     "lib/strings/ordered_code.h",
     "lib/strings/proto_text_util.h",
+    "lib/strings/proto_serialization.h",
     "lib/strings/scanner.h",
     "lib/wav/wav_io.h",
     "platform/demangle.h",
@@ -1663,6 +1709,25 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "tflite_portable_logging",
+    srcs = [],
+    hdrs = [
+        "lib/bfloat16/bfloat16.h",
+        "platform/default/integral_types.h",
+        "platform/default/logging.h",
+        "platform/logging.h",
+        "platform/macros.h",
+        "platform/platform.h",
+        "platform/types.h",
+    ],
+    copts = tf_copts(),
+    linkopts = ["-ldl"],
+    deps = [
+        "//tensorflow/core/platform/default/build_config:logging",
+    ],
+)
+
 cc_library(
     name = "android_jpeg_internal",
     srcs = if_android([
@@ -1854,6 +1919,13 @@ cc_header_only_library(
     ],
 )
 
+cc_header_only_library(
+    name = "core_cpu_headers_lib",
+    deps = [
+        ":core_cpu_lib",
+    ],
+)
+
 tf_cuda_library(
     name = "framework_internal_impl",
     srcs = FRAMEWORK_INTERNAL_PRIVATE_HEADERS + [
@@ -1919,7 +1991,7 @@ tf_cuda_library(
     ) + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
-            "@mkl_dnn//:mkl_dnn",
+            "@mkl_dnn",
         ],
     ),
     alwayslink = 1,
@@ -1931,7 +2003,6 @@ cc_header_only_library(
     deps = [
         ":framework",
         ":reader_base",
-        "@nsync//:nsync_headers",
     ],
 )
 
@@ -2038,14 +2109,19 @@ tf_cuda_library(
 
 CORE_CPU_BASE_HDRS = GRAPH_HDRS + [
     "common_runtime/device.h",
+    "common_runtime/device_mgr.h",
+    "common_runtime/eval_const_tensor.h",
     "common_runtime/graph_runner.h",
     "common_runtime/shape_refiner.h",
     "framework/versions.h",
+    "common_runtime/process_function_library_runtime.h",
+    "common_runtime/function.h",
 ]
 
 tf_cuda_library(
     name = "core_cpu_base",
     srcs = [
+        "common_runtime/eval_const_tensor.cc",
         "common_runtime/shape_refiner.cc",
         "common_runtime/shape_refiner.h",
         "framework/versions.h",
@@ -2086,24 +2162,23 @@ CORE_CPU_LIB_HEADERS = CORE_CPU_BASE_HDRS + [
     "common_runtime/costmodel_manager.h",
     "common_runtime/debugger_state_interface.h",
     "common_runtime/device_factory.h",
-    "common_runtime/device_mgr.h",
     "common_runtime/device_set.h",
     "common_runtime/dma_helper.h",
     "common_runtime/eigen_thread_pool.h",
     "common_runtime/executor.h",
-    "common_runtime/function.h",
     "common_runtime/graph_optimizer.h",
     "common_runtime/local_device.h",
     "common_runtime/memory_types.h",
     "common_runtime/mkl_cpu_allocator.h",
     "common_runtime/optimization_registry.h",
     "common_runtime/pending_counts.h",
-    "common_runtime/process_function_library_runtime.h",
     "common_runtime/process_util.h",
     "common_runtime/profile_handler.h",
     "common_runtime/renamed_device.h",
     "common_runtime/rendezvous_mgr.h",
     "common_runtime/rendezvous_util.h",
+    "common_runtime/scoped_allocator.h",
+    "common_runtime/scoped_allocator_mgr.h",
     "common_runtime/session_factory.h",
     "common_runtime/placer.h",
     "common_runtime/stats_publisher_interface.h",
@@ -2134,6 +2209,7 @@ tf_cuda_library(
         "common_runtime/graph_runner.cc",
         "common_runtime/local_device.cc",
         "common_runtime/memory_types.cc",
+        "common_runtime/mkl_cpu_allocator.cc",
         "common_runtime/optimization_registry.cc",
         "common_runtime/parallel_concat_optimizer.cc",
         "common_runtime/placer.cc",
@@ -2142,6 +2218,8 @@ tf_cuda_library(
         "common_runtime/renamed_device.cc",
         "common_runtime/rendezvous_mgr.cc",
         "common_runtime/rendezvous_util.cc",
+        "common_runtime/scoped_allocator.cc",
+        "common_runtime/scoped_allocator_mgr.cc",
         "common_runtime/session.cc",
         "common_runtime/session_factory.cc",
         "common_runtime/session_options.cc",
@@ -2173,6 +2251,7 @@ tf_cuda_library(
     ] + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
+            "@mkl_dnn",
         ],
     ),
     alwayslink = 1,
@@ -2217,14 +2296,12 @@ tf_cuda_library(
     ] + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
-            "@mkl_dnn//:mkl_dnn",
+            "@mkl_dnn",
         ],
     ) + tf_additional_core_deps() + if_static([":core_cpu_impl"]),
     alwayslink = 1,
 )
 
-# This library is deprecated and no longer publicly available.
-# Do not add more uses of it.
 cc_library(
     name = "regexp_internal",
     hdrs = [
@@ -2867,6 +2944,23 @@ tf_cc_tests(
     ],
 )
 
+tf_cc_test(
+    name = "cudnn_rnn_ops_test_cc",
+    size = "small",
+    srcs = [
+        "ops/cudnn_rnn_ops_test.cc",
+    ],
+    deps = [
+        ":cudnn_rnn_ops",
+        "//tensorflow/core",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
 tf_cc_test_mkl(
     name = "mkl_runtime_tests",
     size = "small",
@@ -3127,6 +3221,7 @@ tf_cc_test(
         ":core_cpu",
         ":core_cpu_internal",
         ":framework",
+        ":lib",
         ":test",
         ":test_main",
         ":testlib",
@@ -3175,6 +3270,7 @@ tf_cc_test(
         "//tensorflow/core/kernels:dense_update_ops",
         "//tensorflow/core/kernels:fifo_queue_op",
         "//tensorflow/core/kernels:function_ops",
+        "//tensorflow/core/kernels:identity_n_op",
         "//tensorflow/core/kernels:identity_op",
         "//tensorflow/core/kernels:matmul_op",
         "//tensorflow/core/kernels:ops_util",
@@ -3217,6 +3313,7 @@ tf_cc_test(
         "//tensorflow/core/kernels:fifo_queue_op",
         "//tensorflow/core/kernels:function_ops",
         "//tensorflow/core/kernels:identity_op",
+        "//tensorflow/core/kernels:identity_n_op",
         "//tensorflow/core/kernels:matmul_op",
         "//tensorflow/core/kernels:ops_util",
         "//tensorflow/core/kernels:queue_ops",
@@ -3291,6 +3388,43 @@ tf_cc_test(
     size = "small",
     srcs = ["common_runtime/function_test.cc"],
     linkstatic = tf_kernel_tests_linkstatic(),
+    tags = [
+        "manual",
+        "no_oss",
+    ],
+    deps = [
+        ":core",
+        ":core_cpu",
+        ":core_cpu_internal",
+        ":direct_session_internal",
+        ":framework",
+        ":framework_internal",
+        ":lib",
+        ":lib_internal",
+        ":ops",
+        ":protos_all_cc",
+        ":test",
+        ":test_main",
+        ":testlib",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:cc_ops_internal",
+        "//tensorflow/cc:function_ops",
+        "//tensorflow/cc:functional_ops",
+        "//tensorflow/core/kernels:cast_op",
+        "//tensorflow/core/kernels:cwise_op",
+        "//tensorflow/core/kernels:function_ops",
+        "//tensorflow/core/kernels:matmul_op",
+        "//tensorflow/core/kernels:random_ops",
+        "//tensorflow/core/kernels:shape_ops",
+        "//third_party/eigen3",
+    ],
+)
+
+tf_cc_test(
+    name = "common_runtime_function_threadpool_test",
+    size = "small",
+    srcs = ["common_runtime/function_threadpool_test.cc"],
+    linkstatic = tf_kernel_tests_linkstatic(),
     deps = [
         ":core",
         ":core_cpu",
@@ -3319,6 +3453,21 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "common_runtime_scoped_allocator_mgr_test",
+    size = "small",
+    srcs = ["common_runtime/scoped_allocator_mgr_test.cc"],
+    linkstatic = tf_kernel_tests_linkstatic(),
+    deps = [
+        ":core_cpu",
+        ":core_cpu_internal",
+        ":framework",
+        ":lib",
+        ":test",
+        ":test_main",
+    ],
+)
+
 tf_cc_test_gpu(
     name = "gpu_allocator_retry_test",
     size = "medium",
@@ -3623,6 +3772,13 @@ filegroup(
         "lib/gif/testdata/optimized.gif",
         # BMP data
         "lib/bmp/testdata/lena.bmp",
+        # SSIM, PSNR data
+        "lib/ssim/testdata/checkerboard1.png",
+        "lib/ssim/testdata/checkerboard2.png",
+        "lib/ssim/testdata/checkerboard3.png",
+        "lib/psnr/testdata/cat_q20.jpg",
+        "lib/psnr/testdata/cat_q72.jpg",
+        "lib/psnr/testdata/cat_q95.jpg",
     ],
     visibility = ["//visibility:public"],
 )
diff --git a/tensorflow/core/api_def/base_api/api_def_CloseSummaryWriter.pbtxt b/tensorflow/core/api_def/base_api/api_def_CloseSummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f6fd7d93169306fdf5ca62d27635e1f86f37bc4d
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CloseSummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CloseSummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CreateSummaryDbWriter.pbtxt b/tensorflow/core/api_def/base_api/api_def_CreateSummaryDbWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28da46a0f8e452f65d06a13c4b0d0b03b2a75757
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CreateSummaryDbWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CreateSummaryDbWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CreateSummaryFileWriter.pbtxt b/tensorflow/core/api_def/base_api/api_def_CreateSummaryFileWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ce2c4d37e5001681ffa733bf4726c6bea652029
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CreateSummaryFileWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CreateSummaryFileWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CudnnRNN.pbtxt b/tensorflow/core/api_def/base_api/api_def_CudnnRNN.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..daeb5fe9a223d7d1254725325921a28a7d165902
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CudnnRNN.pbtxt
@@ -0,0 +1,36 @@
+op {
+  graph_op_name: "CudnnRNN"
+  summary: "A RNN backed by cuDNN."
+  description: <<END
+Computes the RNN from the input and initial states, with respect to the params
+buffer.
+
+rnn_mode: Indicates the type of the RNN model.
+input_mode: Indicate whether there is a linear projection between the input and
+  The actual computation before the first layer. 'skip_input' is only allowed
+  when input_size == num_units; 'auto_select' implies 'skip_input' when
+  input_size == num_units; otherwise, it implies 'linear_input'.
+direction: Indicates whether a bidirectional model will be used.
+  dir = (direction == bidirectional) ? 2 : 1
+dropout: dropout probability. When set to 0., dropout is disabled.
+seed: the 1st part of a seed to initialize dropout.
+seed2: the 2nd part of a seed to initialize dropout.
+input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
+input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
+    num_units].
+input_c: For LSTM, a 3-D tensor with the shape of
+    [num_layer * dir, batch, num_units]. For other models, it is ignored.
+params: a 1-D tensor that contains the weights and biases in an opaque layout.
+    The size must be created through CudnnRNNParamsSize, and initialized
+    separately. Note that they might not be compatible across different
+    generations. So it is a good idea to save and restore
+output: a 3-D tensor with the shape of [seq_length, batch_size,
+    dir * num_units].
+output_h: the same shape has input_h.
+output_c: the same shape as input_c for LSTM. An empty tensor for other models.
+is_training: Indicates whether this operation is used for inferenece or
+  training.
+reserve_space: an opaque tensor that can be used in backprop calculation. It
+  is only produced if is_training is false.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CudnnRNNBackprop.pbtxt b/tensorflow/core/api_def/base_api/api_def_CudnnRNNBackprop.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..075ec52648e37397c95cb5ad302dcc9d951caada
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CudnnRNNBackprop.pbtxt
@@ -0,0 +1,45 @@
+op {
+  graph_op_name: "CudnnRNNBackprop"
+  summary: "Backprop step of CudnnRNN."
+  description: <<END
+Compute the backprop of both data and weights in a RNN.
+
+rnn_mode: Indicates the type of the RNN model.
+input_mode: Indicate whether there is a linear projection between the input and
+    The actual computation before the first layer. 'skip_input' is only allowed
+    when input_size == num_units; 'auto_select' implies 'skip_input' when
+    input_size == num_units; otherwise, it implies 'linear_input'.
+direction: Indicates whether a bidirectional model will be used.
+    dir = (direction == bidirectional) ? 2 : 1
+dropout: dropout probability. When set to 0., dropout is disabled.
+seed: the 1st part of a seed to initialize dropout.
+seed2: the 2nd part of a seed to initialize dropout.
+input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
+input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
+    num_units].
+input_c: For LSTM, a 3-D tensor with the shape of
+    [num_layer * dir, batch, num_units]. For other models, it is ignored.
+params: a 1-D tensor that contains the weights and biases in an opaque layout.
+    The size must be created through CudnnRNNParamsSize, and initialized
+    separately. Note that they might not be compatible across different
+    generations. So it is a good idea to save and restore
+output: a 3-D tensor with the shape of [seq_length, batch_size,
+    dir * num_units].
+output_h: the same shape has input_h.
+output_c: the same shape as input_c for LSTM. An empty tensor for other models.
+output_backprop: A 3-D tensor with the same shape as output in the forward pass.
+output_h_backprop: A 3-D tensor with the same shape as output_h in the forward
+    pass.
+output_c_backprop: A 3-D tensor with the same shape as output_c in the forward
+    pass.
+reserve_space: The same reserve_space produced in for forward operation.
+input_backprop: The backprop to input in the forward pass. Has the same shape
+    as input.
+input_h_backprop: The backprop to input_h in the forward pass. Has the same
+    shape as input_h.
+input_c_backprop: The backprop to input_c in the forward pass. Has the same
+    shape as input_c.
+params_backprop: The backprop to the params buffer in the forward pass. Has the
+    same shape as params.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CudnnRNNCanonicalToParams.pbtxt b/tensorflow/core/api_def/base_api/api_def_CudnnRNNCanonicalToParams.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..abf81a2071172c5b00fec662e1401a46fc49c450
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CudnnRNNCanonicalToParams.pbtxt
@@ -0,0 +1,35 @@
+op {
+  graph_op_name: "CudnnRNNCanonicalToParams"
+  summary: "Converts CudnnRNN params from canonical form to usable form."
+  description: <<END
+Writes a set of weights into the opaque params buffer so they can be used in
+upcoming training or inferences.
+
+Note that the params buffer may not be compatible across different GPUs. So any
+save and restoration should be converted to and from the canonical weights and
+biases.
+
+num_layers: Specifies the number of layers in the RNN model.
+num_units: Specifies the size of the hidden state.
+input_size: Specifies the size of the input state.
+weights: the canonical form of weights that can be used for saving
+    and restoration. They are more likely to be compatible across different
+    generations.
+biases: the canonical form of biases that can be used for saving
+    and restoration. They are more likely to be compatible across different
+    generations.
+num_params: number of parameter sets for all layers.
+    Each layer may contain multiple parameter sets, with each set consisting of
+    a weight matrix and a bias vector.
+rnn_mode: Indicates the type of the RNN model.
+input_mode: Indicate whether there is a linear projection between the input and
+    The actual computation before the first layer. 'skip_input' is only allowed
+    when input_size == num_units; 'auto_select' implies 'skip_input' when
+    input_size == num_units; otherwise, it implies 'linear_input'.
+direction: Indicates whether a bidirectional model will be used.
+    dir = (direction == bidirectional) ? 2 : 1
+dropout: dropout probability. When set to 0., dropout is disabled.
+seed: the 1st part of a seed to initialize dropout.
+seed2: the 2nd part of a seed to initialize dropout.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsSize.pbtxt b/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..31fb85d4fb3f59ae82737128cc88d2cbdbc996ea
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsSize.pbtxt
@@ -0,0 +1,27 @@
+op {
+  graph_op_name: "CudnnRNNParamsSize"
+  summary: "Computes size of weights that can be used by a Cudnn RNN model."
+  description: <<END
+Return the params size that can be used by the Cudnn RNN model. Subsequent
+weight allocation and initialization should use this size.
+
+num_layers: Specifies the number of layers in the RNN model.
+num_units: Specifies the size of the hidden state.
+input_size: Specifies the size of the input state.
+rnn_mode: Indicates the type of the RNN model.
+input_mode: Indicate whether there is a linear projection between the input and
+  The actual computation before the first layer. 'skip_input' is only allowed
+  when input_size == num_units; 'auto_select' implies 'skip_input' when
+  input_size == num_units; otherwise, it implies 'linear_input'.
+direction: Indicates whether a bidirectional model will be used.
+  dir = (direction == bidirectional) ? 2 : 1
+dropout: dropout probability. When set to 0., dropout is disabled.
+seed: the 1st part of a seed to initialize dropout.
+seed2: the 2nd part of a seed to initialize dropout.
+params_size: The size of the params buffer that should be allocated and
+  initialized for this RNN model. Note that this params buffer may not be
+  compatible across GPUs. Please use CudnnRNNParamsWeights and
+  CudnnRNNParamsBiases to save and restore them in a way that is compatible
+  across different runs.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsToCanonical.pbtxt b/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsToCanonical.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..47753bf8fcf9aa3b1d0938b974aa788f1e2c5df1
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_CudnnRNNParamsToCanonical.pbtxt
@@ -0,0 +1,35 @@
+op {
+  graph_op_name: "CudnnRNNParamsToCanonical"
+  summary: "Retrieves CudnnRNN params in canonical form."
+  description: <<END
+Retrieves a set of weights from the opaque params buffer that can be saved and
+restored in a way compatible with future runs.
+
+Note that the params buffer may not be compatible across different GPUs. So any
+save and restoration should be converted to and from the canonical weights and
+biases.
+
+num_layers: Specifies the number of layers in the RNN model.
+num_units: Specifies the size of the hidden state.
+input_size: Specifies the size of the input state.
+num_params: number of parameter sets for all layers.
+    Each layer may contain multiple parameter sets, with each set consisting of
+    a weight matrix and a bias vector.
+weights: the canonical form of weights that can be used for saving
+    and restoration. They are more likely to be compatible across different
+    generations.
+biases: the canonical form of biases that can be used for saving
+    and restoration. They are more likely to be compatible across different
+    generations.
+rnn_mode: Indicates the type of the RNN model.
+input_mode: Indicate whether there is a linear projection between the input and
+    The actual computation before the first layer. 'skip_input' is only allowed
+    when input_size == num_units; 'auto_select' implies 'skip_input' when
+    input_size == num_units; otherwise, it implies 'linear_input'.
+direction: Indicates whether a bidirectional model will be used.
+    dir = (direction == bidirectional) ? 2 : 1
+dropout: dropout probability. When set to 0., dropout is disabled.
+seed: the 1st part of a seed to initialize dropout.
+seed2: the 2nd part of a seed to initialize dropout.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxArgs.pbtxt b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
index 561c86ddf68a4fb093d263d076fb6ccc8d408733..599bbce65f44a3c6798be2d5cee3b8f6f2e2635a 100644
--- a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
@@ -6,7 +6,7 @@ Attributes `[min; max]` define the clamping range for the `inputs` data.
 `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
 when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
 then de-quantized and output as floats in `[min; max]` interval.
-`num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+`num_bits` is the bitwidth of the quantization; between 2 and 16, inclusive.
 
 Quantization is called fake since the output is still in floating point.
 END
diff --git a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVars.pbtxt b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVars.pbtxt
index 2713c01b27f6bc45eb6117047243f06873d4dd87..1976ffb8aac29f6aaac307c6f126b78f17dcadb8 100644
--- a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVars.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVars.pbtxt
@@ -8,7 +8,7 @@ and `max` to 'outputs' tensor of same shape as `inputs`.
 `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
 when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
 then de-quantized and output as floats in `[min; max]` interval.
-`num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+`num_bits` is the bitwidth of the quantization; between 2 and 16, inclusive.
 
 This operation has a gradient and thus allows for training `min` and `max`
 values.
diff --git a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
index e293d4d084bc90f24ee0cc1111f750ddfa46465b..c0fac6a445895eb1ebbfd621d9afd49d99ed80f4 100644
--- a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
@@ -9,7 +9,7 @@ to 'outputs' tensor of same shape as `inputs`.
 `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
 when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
 then de-quantized and output as floats in `[min; max]` interval.
-`num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+`num_bits` is the bitwidth of the quantization; between 2 and 16, inclusive.
 
 This operation has a gradient and thus allows for training `min` and `max`
 values.
diff --git a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
index 8a4ab368b5a8c4d8ac756513da14796ff3a41551..2051903f6dae6bb51490dbec137e7cd7592fb6c6 100644
--- a/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
@@ -40,7 +40,7 @@ END
   attr {
     name: "num_bits"
     description: <<END
-The bitwidth of the quantization; between 2 and 8, inclusive.
+The bitwidth of the quantization; between 2 and 16, inclusive.
 END
   }
   attr {
diff --git a/tensorflow/core/api_def/base_api/api_def_FlushSummaryWriter.pbtxt b/tensorflow/core/api_def/base_api/api_def_FlushSummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ada43c9b8b5e25b72fa6e6d7b0a313965dd9d5a
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_FlushSummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FlushSummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_ImportEvent.pbtxt b/tensorflow/core/api_def/base_api/api_def_ImportEvent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d8813b58f3e53e5916edcabafc1fd28388fea8d8
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_ImportEvent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ImportEvent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt b/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..70ad5219267fcc84368f072a6f5a122b6cc11a89
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt
@@ -0,0 +1,25 @@
+op {
+  graph_op_name: "RegexReplace"
+  in_arg {
+    name: "input"
+    description: "The text to be processed."
+  }
+  in_arg {
+    name: "pattern"
+    description: "The regular expression to match the input."
+  }
+  in_arg {
+    name: "rewrite"
+    description: "The rewrite to be applied to the matched expresion."
+  }
+  out_arg {
+    name: "output"
+    description: "The text after applying pattern and rewrite."
+  }
+  attr {
+    name: "replace_global"
+    description: "If True, the replacement is global, otherwise the replacement\nis done only on the first match."
+  }
+  summary: "Replaces the match of pattern in input with rewrite."
+  description: "It follows the re2 syntax (https://github.com/google/re2/wiki/Syntax)"
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_SdcaOptimizer.pbtxt b/tensorflow/core/api_def/base_api/api_def_SdcaOptimizer.pbtxt
index b0b58ac00e6709922ed517ad2c9efebbedf450a3..9da0e124ebe02f1cfb6450b96471d7d9d146bd20 100644
--- a/tensorflow/core/api_def/base_api/api_def_SdcaOptimizer.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SdcaOptimizer.pbtxt
@@ -97,8 +97,11 @@ END
   }
   attr {
     name: "adaptative"
+    default_value {
+      b: True
+    }
     description: <<END
-Whether to use Adapative SDCA for the inner loop.
+Whether to use Adaptive SDCA for the inner loop.
 END
   }
   attr {
diff --git a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
index 51d63eeb5695d6a428e990ba43e54102db58b58e..7be9a958ab55d27b4b9fe3dd023e44ae828e042c 100644
--- a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
@@ -19,6 +19,7 @@ form square matrices, with the same constraints as the single matrix
 SelfAdjointEig.
 
 The result is a [..., M+1, M] matrix with [..., 0,:] containing the
-eigenvalues, and subsequent [...,1:, :] containing the eigenvectors.
+eigenvalues, and subsequent [...,1:, :] containing the eigenvectors. The eigenvalues
+are sorted in non-decreasing order.
 END
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
index 4a5e1252586ea8b3e03b2545e0d8646288ddc408..fae9e84fc85be06184b19308d87c90632347e2f6 100644
--- a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
@@ -31,7 +31,8 @@ END
   summary: "Computes the eigen decomposition of one or more square self-adjoint matrices."
   description: <<END
 Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
-`input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`.
+`input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`. The eigenvalues
+are sorted in non-decreasing order.
 
 ```python
 # a is a tensor.
diff --git a/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9fabe7863e4bf89a09a9bfcc9ce6c0a00d8cc6db
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
@@ -0,0 +1,18 @@
+op {
+  graph_op_name: "SlideDataset"
+  in_arg {
+    name: "window_size"
+    description: <<END
+A scalar representing the number of elements in the
+sliding window.
+END
+  }
+  in_arg {
+    name: "stride"
+    description: <<END
+A scalar representing the steps moving the sliding window
+forward in one iteration. It must be in `[1, window_size)`.
+END
+  }
+  summary: "Creates a dataset that passes a sliding window over `input_dataset`."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_SummaryWriter.pbtxt b/tensorflow/core/api_def/base_api/api_def_SummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1fe57ecf195c85217bd174dbc503b28f26adade9
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_SummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteAudioSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteAudioSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..520952cd4117867f61cd3c536b8a7cc5beeeab62
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteAudioSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteAudioSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteGraphSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteGraphSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3653477b2067875ee772dedc5015bc550de1ec12
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteGraphSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteGraphSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteHistogramSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteHistogramSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..26e1482630596d9ecb80917ff91adb2bd1131692
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteHistogramSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteHistogramSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteImageSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteImageSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..78db8700f0c7231f24fd1db3a0eedbcc4f43deeb
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteImageSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteImageSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteScalarSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteScalarSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7bae8638d258b6fdf217fbdbd8705369f57e0bb3
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteScalarSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteScalarSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WriteSummary.pbtxt b/tensorflow/core/api_def/base_api/api_def_WriteSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db86883e21ea47a9e74788f8d76a166a235674da
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WriteSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt b/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3f95aaf12c65383b1425fd4063a79afff63480a6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Abort"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1e76d6dadcde5083dba8e2ef78740256fd45dc63
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorApplyGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fbe971ab2e221bd01e991f9c80e1d527736e59bf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorNumAccumulated"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0047b25af6a52d02b5b4f1e87fa16fce56a90a29
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorSetGlobalStep"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..860fbe124506c7db95e3f1603b9f3878a2d4b84b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorTakeGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0311ad92b7e40c67969bf14193a8b2f98659558a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustContrast"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b4411677118b002bd751a492aadf30b6fb0f4ac8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustHue"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..893219e17a70c5d4fdd24b46986b6ed33303448c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustSaturation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt b/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..771e861fd17171ae886fabcf47218923ab5451b6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Angle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d8776b19f1a28dca7f9f067154d51cdb02599d79
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7e659c1bb30ab2f80a9bb010c55a4426b12f9d5b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d647c5eb0a23346f25407630d24d25321e3282a3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..66d9095c8fd0eb729f3b3c3ca5938ebe045723d7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdam"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b7fe1aa6542e874c071bb3d069d50a37a4779754
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAddSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..56003c5e6fda60fb43c4a5784e2ccdf564ff1ed8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..680b3ef480f54da4e13331cc47589810ccecb54e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5ab3bb6efd2bd483a5223acbb9a16b0b9ab3d001
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..467bf7db558000d7a15bf83b33011690a34a107a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7c3f0fef95f55c18170343d6f1cc081612dd68ee
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f376b1dc6e531113d580406df26a5b30210b07ad
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyPowerSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0c6e2a4bb1ede274433bd297c7f39b2c58186923
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..90c1655fe913504f55b910d826e834e70a7c4acc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..18cce1915a5eb25f68a227968c0819977d02467f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..707f6716f9604994f5d92651ecacc7608f10742d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApproximateEqual"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ArgMax.pbtxt b/tensorflow/core/api_def/python_api/api_def_ArgMax.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4c23a432f2b747a8a406c6152d36fc0ba5f1118f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ArgMax.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ArgMax"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ArgMin.pbtxt b/tensorflow/core/api_def/python_api/api_def_ArgMin.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..daa14f638663ef42fc50667e1bb1d5236e3d3361
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ArgMin.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ArgMin"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Assign.pbtxt b/tensorflow/core/api_def/python_api/api_def_Assign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..34062ede91a836fa88f4680cdea91a5e23917158
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Assign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Assign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4553c6c6e72f50d047310c7df723d7d6e4b8491c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e30ec092e68bf9204b34e6d06b3b4c1cfbeab02b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignAddVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignSub.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignSub.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..aec68d5c21e598a4dd28673bda698e71bd808f51
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignSub.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignSub"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..81290a56ec1cd89b15a875c24ec31a20faa11bb8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignSubVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ffa4a11c41f2cf38b9251a0fd030ea0fc511058
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cc16523a1567e8d7f2d0146c1c44d9ef11b6c6d5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "AvgPool3D"
+  endpoint {
+    name: "nn.avg_pool3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4289c1daf96583943b8dfad84aeca3351657bee4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0a699e20506e47177722a57249f53cb1b80cf1b2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixBandPart"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..40be51ecccd3d5f6ceb3a1dc245e925e666d8ac5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixDiag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1ef78fa5ece9a6ff1147b70d4660769954e67301
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixDiagPart"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..644c1270a2385871d9dc4f429e39de4c0382c27a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixSetDiag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9226c6791c82fcddb1ed0d54db155d20ff44d18e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BiasAddGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt b/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..551b51db26251cbf15b38ac3e48b5024fae0ec72
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Bincount"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fcf541f9036baaef1590f06da0d7471b0558b4c7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BytesProducedStatsDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2bbb4ff9e3b08d0dd11c7444e5d00feb514e81c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CacheDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..428aa62c462a2361c519243a8b8a6bdb4f42cb9d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cast"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3538afb2a7108763b1986615f8c738e0c3113c96
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CholeskyGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CloseSummaryWriter.pbtxt b/tensorflow/core/api_def/python_api/api_def_CloseSummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f6fd7d93169306fdf5ca62d27635e1f86f37bc4d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CloseSummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CloseSummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt b/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..493a7e48665954b647e99a5fe9d06a7a42755494
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CompareAndBitpack"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c005a4da0f866c1d1106effabbaa22f1abecf422
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConcatenateDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a4663e8eb3a20b6d809242ec9a44376247b8b3ca
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConditionalAccumulator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9559947490e6fbcb88ab8a359e8416fd11a8b165
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConsumeMutexLock"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt b/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..33941493af7bf038503dad2379821d0c36929cf8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ControlTrigger"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ae75d6da222d84245bb2a912942522eb52047bc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2D"
+  endpoint {
+    name: "nn.conv2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6f21d8c8802f9a18c9357dbe68d3c65407bff923
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2DBackpropFilter"
+  endpoint {
+    name: "nn.conv2d_backprop_filter"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ea976799cbc73bc9164a15e781a051f03e14275b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2DBackpropInput"
+  endpoint {
+    name: "nn.conv2d_backprop_input"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ba8d178263c94574c0aaac8f1f24fb1424a50275
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv3D"
+  endpoint {
+    name: "nn.conv3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..634545f427c906edd94297f9e5291be4021462ad
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropFilter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1da8ee3a25f36a0b44f6458a351854190fe7830f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv3DBackpropFilterV2"
+  endpoint {
+    name: "nn.conv3d_backprop_filter_v2"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e2b0a0d19f4a31b05771133c52446078a7e938c8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4e5c4f74fe90c148e11be98a2e343a41511d1d1d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropInputV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CountUpTo.pbtxt b/tensorflow/core/api_def/python_api/api_def_CountUpTo.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f41be2f540d241776bee3fcb1bba496d4baebeab
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CountUpTo.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CountUpTo"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CreateSummaryDbWriter.pbtxt b/tensorflow/core/api_def/python_api/api_def_CreateSummaryDbWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28da46a0f8e452f65d06a13c4b0d0b03b2a75757
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CreateSummaryDbWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CreateSummaryDbWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CreateSummaryFileWriter.pbtxt b/tensorflow/core/api_def/python_api/api_def_CreateSummaryFileWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ce2c4d37e5001681ffa733bf4726c6bea652029
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CreateSummaryFileWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CreateSummaryFileWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ac4449419314a1fe09e9a2b17e815a741b960b1d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CropAndResizeGradBoxes"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..eecd0536f29bc189705c7a7311a79eb5ffff02dc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CropAndResizeGradImage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CudnnRNN.pbtxt b/tensorflow/core/api_def/python_api/api_def_CudnnRNN.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b13586b63ba3418c452e44b0f007c42885498f9f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CudnnRNN.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CudnnRNN"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CudnnRNNBackprop.pbtxt b/tensorflow/core/api_def/python_api/api_def_CudnnRNNBackprop.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..81c4efc60b7f6338a0197e5898c6e7eddd5069bf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CudnnRNNBackprop.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CudnnRNNBackprop"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CudnnRNNCanonicalToParams.pbtxt b/tensorflow/core/api_def/python_api/api_def_CudnnRNNCanonicalToParams.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..164a306034af2b85fb803b431367e337bc65b34f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CudnnRNNCanonicalToParams.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CudnnRNNCanonicalToParams"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..00f97f05b11d3bdb049c55beba2fe9ce18e14ff0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CudnnRNNParamsSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsToCanonical.pbtxt b/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsToCanonical.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..841bc0cf55e0800e3350f0eb68d37f42c788d79e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CudnnRNNParamsToCanonical.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CudnnRNNParamsToCanonical"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f5e2f061bd281d22e91b0922899eda3a641d68f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cumprod"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..715f26fcac2bb03b729d58f5c5f7cfe6802660fd
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cumsum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt b/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82a39cfc5981f14edfe39cee363abf169f89245e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DataFormatDimMap"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt b/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9ec292df8f670cfbae6488545979354d751e5d41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DataFormatVecPermute"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt b/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e3d34cc15be752b466aa03f6805cd687698f74fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DatasetToSingleElement"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7539918617e866acdf4d4d88279e1aeeb7a14
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DecodeCompressed"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1c47ec09c5ee16d37ac57c211c2409cd8f8c6970
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToDenseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0a8e068afb744ce8b472111d19cf743d39ac44ef
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToSparseBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a30757df4d0326159c180c4be14309f9150fff00
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToSparseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt b/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fd0766b36556d86b2fc99f8b2ac19480832546e1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DepthToSpace"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..170d37be4e268c2829a5fa01fcaa48be082c2e0e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DeserializeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b9dde0080a8c3533a1b1837ddd4aaeb05e7a180e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DestroyResourceOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6d73ecf1bb06895017b2d2ac2a16c702681eb217
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Dilation2D"
+  endpoint {
+    name: "nn.dilation2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..feb9f083db691c55832997509ff6455a6584f486
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Dilation2DBackpropFilter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9a6b09f5cc653ba456bfb9fb66757c48963503e4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Dilation2DBackpropInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Div.pbtxt b/tensorflow/core/api_def/python_api/api_def_Div.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8e5537c8bfea638b585c04c514264d930054fde5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Div.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Div"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..051cf14c0ec2b32779be8b9c297b93abd1bc1318
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "EnqueueInQueueDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Erf.pbtxt b/tensorflow/core/api_def/python_api/api_def_Erf.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..391167254edb69725c778e6319bf8a9f6038589f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Erf.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Erf"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9ed1341dfe2d0c4f57e0fa3c2d14378bce452be3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "FFT2D"
+  endpoint {
+    name: "spectral.fft2d"
+  }
+  endpoint {
+    name: "fft2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5a4e1d6adf9b9c2bf68c6375de6aebfdfcf5bfb3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "FFT3D"
+  endpoint {
+    name: "spectral.fft3d"
+  }
+  endpoint {
+    name: "fft3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6f91b842181c769d0a2f921f1d7566c4d8522541
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FilterDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d0703471d38c94a8c37da6f0a65ebd165c23a820
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FixedLengthRecordDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9de61ac263cd82a0893aa2e27b9d7532490ca441
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FlatMapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FlushSummaryWriter.pbtxt b/tensorflow/core/api_def/python_api/api_def_FlushSummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ada43c9b8b5e25b72fa6e6d7b0a313965dd9d5a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FlushSummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FlushSummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..56409f32d8d58b923980f78b3662f196e7954e14
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedBatchNormGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f5a4200b76c884c0f24335df1716f85b0666b589
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedBatchNormGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..03b5fdd5a11844af209c86d9ef8e362c4d286ea6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedPadConv2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52165d9b4d991d1636cdc08d5cb2f9efe2f7754f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedResizeAndPadConv2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt b/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5f956930e0f5bc9a9160974ee4c4a177102942fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Gather"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..029bc59b51cb5463579dbf867e3a1927cb3577f7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GatherV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9dcfa0f7d210012aa5c2d43349239a953ea3739e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GeneratorDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8d40208e613e6b7ee1522c2990afea1345cc5de1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GroupByWindowDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d6b36a314b8d8a197651ee3c68b1376a9bbed669
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "IFFT2D"
+  endpoint {
+    name: "spectral.ifft2d"
+  }
+  endpoint {
+    name: "ifft2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6def5b36da17766c5342703fcefe2b377028f330
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "IFFT3D"
+  endpoint {
+    name: "spectral.ifft3d"
+  }
+  endpoint {
+    name: "ifft3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8fa74a4317fe635cb10ca226f5516834370275c2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2021cad63911d2bafb159e1a1f2f11ed2a1d372e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5d1eab6003ece3b1eed22200743a28de185d1299
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT3D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Identity.pbtxt b/tensorflow/core/api_def/python_api/api_def_Identity.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..00f2afde271ebade0ac7d1ae75dc9dff6f692ab5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Identity.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Identity"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt b/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5632fd4365f718ccf079e1c75b962b011c0253f6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Imag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt b/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..997013914b89fb489eaa3c8f96f001b093aa23e0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ImmutableConst"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ImportEvent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ImportEvent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d8813b58f3e53e5916edcabafc1fd28388fea8d8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ImportEvent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ImportEvent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef1b06b19cc6a0c62f6e9f451aceed8aeabed553
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "InterleaveDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt b/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ed58a276f69e46bbf3d14fbd4b921ad2f0d7a2df
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Inv"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt b/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6a7b0789090c96fa2db968edbc885258aecf34d7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IsVariableInitialized"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a021db1534834a5c248e750b9e56e334a20d3949
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Iterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f9efe2d1446330aa78405329b017ce0c81d3a20c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorFromStringHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f7066484ceaa2c0dce7a9ccba8c71838e79e85c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNext"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d94edbc71de2295e4a83bda4a0616cbb6c3ebe41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNextSync"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db51ae3873c7354dc7ce932b99b42edd12066757
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorSetStatsAggregator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8a4251f76bd078903cbdf4b2d8419815dab2742e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorToStringHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..94bf6106ad8459767d31a345a17483b255dfc02b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "LatencyStatsDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt b/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4cfa295b2a33446e3646fc1d000ecefd78d64291
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "LoopCond"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..acc3342c9ba98d1e5022d99a17fe51c9f4af0ce6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MakeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cffd2910fb404bc7f75e55e42b9ebba1635db134
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapAndBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..67c1c3e2dd3191d9e37ea40e6d8cf00e5f888550
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0b1d2f2c730ff8b8b928fcd97c4fe3bdc704e470
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db7921e13b97eebf09260f985b311175bf5b67a4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapIncompleteSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..85fab1722948b9b5e0d2e74794bd98b9dd7de37e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapPeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8b6ed1a0cf460ad9050af7b3bea7a2ef9bd5c1e4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ae70d5d5791ec58642fff759c53d56338670540
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapStage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e5f92e37db41b2528961f1dde322e3a1938539b3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapUnstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2c2a25db2139db94c3a541188fa17021a2492738
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapUnstageNoKey"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e8576c9ff2e0729235d9bca70c369536dacaa08e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "MaxPool3D"
+  endpoint {
+    name: "nn.max_pool3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..534cc90e41ea33fda876a907bc1dfe7eae1bcc15
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MaxPoolGradGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e79f839686a425a5648f569d05a2cce60d46edcb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MaxPoolGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt b/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ca9f74e0c19081446fdaa2d13413d2817e00f402
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MergeV2Checkpoints"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Mod.pbtxt b/tensorflow/core/api_def/python_api/api_def_Mod.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..48d828ca72a5abfebf1815980e82e1a3f471c175
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Mod.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Mod"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt b/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9b654335806407602938e43850d2165e3c952032
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Multinomial"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt b/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..74e6e1035771484adbfdabd1720c260a39e5f519
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MutexLock"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..013f42d8550cac92aad2539f766deb3e97abaeaf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MutexV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt b/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28ac301e4169fa4302124a3e554cae6f8f1e13db
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NextIteration"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt b/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ec838585103345748ef5332b032af7a522393fb0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NthElement"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee9d777b4e4c7104ea919bfe3fa6e48aa0928b20
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OneShotIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt b/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c058e5b1ab19790b8aa4049412f937282bd14abb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OnesLike"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b8276b964a58855e3ab92d026ebb0fc00e67f2e7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1ba6c5b2fc7f461d05cf944a1152d249de0217f0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapIncompleteSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f0c7afd465358c35ea8c1fd3b33eb4d4a76ef87
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapPeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2e155726da6bd697ef422c53d96cec086df511b3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6222c1fc4c998174b65e861ef1aeb4375d58c05e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapStage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5cca8d9f93d065271a71ea23bf953e73a1cd6e58
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapUnstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d67b95b65b7e00475a1f8f422e3df529b3747ea0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapUnstageNoKey"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c6223b3132ed0d6878995d3c5e657275fac0cc4f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PaddedBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a36ad273646a97aaadbe74718800e5fb1fc27dae
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelDynamicStitch"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..93cd5719feb613cd3de2e422e23cc3d690bdef08
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelInterleaveDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..09d200dd24c828af85d1505bb17086dbfa688ee8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelMapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4193bdd091e015eca8cca85034255d36ba27a67b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParseSingleExample"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a30360d2de4e36f47f3c7564db5ec9ca045034b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PlaceholderV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt b/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d35550236a317c6581b6e8b91f8843b5cc24977f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PopulationCount"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ec4e214eb5e082c8f732cbef9db69524c48d80a4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PrefetchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..228c4047d2e0b7ddfec1d8cd4fad478aa6c4c1a7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PrependFromQueueAndPaddedBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9565f5632b9ffdbaf1879dc1c18092838143d06b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PreventGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d2468f1b243f318dcc3a8fb45524c6b548f378fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..15e181be20948128a7f970f024e6cc8dfe28c96c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantizeV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f1edc6f5faecfbedc0b9b873484b160551b0f2d2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantizeV3"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9a2a86d25dad916208e9a666b5ffaa15f1513c4a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeDownAndShrinkRange"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..40673234ed02fa49601b83ce4f587b9051295315
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b952d6eccbb30df85582848f7f7e7869eea367a8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e009ada5535993ab5c6eefe8e0b8858e04735824
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedBatchNormWithGlobalNormalization"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3432962e593d6777a62723d25bffd22b8001cc68
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedBiasAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2409d12abeff922cca92f9ae609764a27f651356
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "QuantizedConv2D"
+  endpoint {
+    name: "nn.quantized_conv2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..47a4931a05ab0f9f8b746103667c64f2cc27fbae
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedInstanceNorm"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ca9d2ae0774cda244db4843e86372cbe40e2ecb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedMatMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c026fba194c0a0fa6208799248b7269d09a5623b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e5da4f25f0e51b1b73b9bb96c9b5b18c2ee54d60
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedRelu"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef1e64831270955219d409c80f865a16713cbcc3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedRelu6"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7e6d9ed718386f77c5f28ed164803e2d7f148eaf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedReshape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a8da4128c260644db022183683e2dc362d82d39e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedResizeBilinear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt b/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f1d2ef63f1a8849befe42341e15c1630f730ec04
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QueueIsClosed"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..07cf1a7497a40ff435f40eaaa31d22e8785bd20c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QueueIsClosedV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e9719255aebe6c665f9178c6b652230dd4542d13
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1336a64408e8135284d9cafd6ca057572950bdf5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..978b5814ff652246cca7630f9a2df22985bbb28e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT3D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a5f6f8c6f1db344c480e2bd452362d977dc15000
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RandomDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8cc217c50ea74af0413a804c8e2b726b3e5f1a91
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RandomPoissonV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4cd8296b2233ac58c12e6573d2194f7d976d9137
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RangeDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Rank.pbtxt b/tensorflow/core/api_def/python_api/api_def_Rank.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..05aa12f2fa238f540f653e42576899c3e1b799da
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Rank.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Rank"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e250b78effcd998b3d26804522858b2386df9b46
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReadVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Real.pbtxt b/tensorflow/core/api_def/python_api/api_def_Real.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52a9089f4a75018cb1a6a551aecef7b1795e9f4f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Real.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Real"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..29f798050e7c868a13a13b3c123ecbc2c5f70de1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RecordInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0fde5942abee41797db084e1b34b8202532db1a5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReduceJoin"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f9dfcf5e97e9fd4f0676cdb59503947c1a1972f9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefNextIteration"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f9909aa86a861e5b1bfb95aa96e3fbd925f0c4a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefSelect"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..68b0f4a694aa1f059ca85b5218b40c99e1d21d28
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefSwitch"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt b/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fc069d857d0b40bda75dedc4d881359419ce8b6b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RemoteCall"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..be301da8386af0fbd98c9b02d2cfc0fe79178990
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RepeatDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt b/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e327595a3898c69e3b060a821345cb8d863b4587
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RequantizationRange"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt b/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f26f0611bae9544aab74f014690db9ceb4606241
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Requantize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e0413a67a3c949e8d34311b26acd81a251100ca4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52b8ba0b0e4db79a65fc47cc14a66f7469db6328
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..edfc0a733f84542601ce95a3bad1b99db629c2a1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ca2713b533804e8ecc9ed76798744c0b07bb5e24
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdam"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..50dd6439536ae38c5de377d5636262f5cdb906a6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAddSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..20592e38c812aeb45cd03bae300b5d6667aee7dd
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..72b49e09d6217ce9fdc110f91f4e10fe86124e09
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..af1d24c344d81f94216c0517011b387ab93965eb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..75d6afd426a0bfcd324542e6fdae70bf7f4b53b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3e499cf72e254ed5d0e6e73da8f88a9de4392605
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b23ad0d061a42a054e9116a25792f3d73a40caf1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyPowerSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6ad124c59005285d2c9a9f894d53147cf0823c86
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d684a5dd6720333e334b8afe462224437e32e248
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c4c20e1382f6759993511e7eb4fd846c63575611
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..87376b74475d3191dd0d2be3a80c8da57087a88b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceCountUpTo"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..714ba4a7ca9a1f05dc34cefe9e2430ebdc6f284a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceGather"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4d4601cafdeb8af4821887ac0d354b1a4a7844b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..54c66708aeb1263b01ed90b792b1331e909416ec
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterNdUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..30f885bee0cf27801b56918e82cdef8c644afe1b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a7e4dad13878112512fc7b915cb4fec9bd47b76a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1388da789c08d54bf61ace44791280586d0ec6aa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c5beaa4f580fb22cd23a7bcefe9009f4168c36b4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f3de3d93df3d658d02663537d6cc4c404a315d27
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f83833d3511cf3f42c0257cd61677684da86b35f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..71adbb0bcd6f63e2de2ecf63d7c8d56265b29ae2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28a19caaccfabc58444e5963ff1a1c6446e67255
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e8cda7f4edd1217baca8ed84b7c9ae96a22e3b6f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5fa1ade6696ba464a3da44b05f35c67e8ada4fc8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..86cc9a41ae9db89aa61b2225aeb15c42461dda45
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef6e19fea0d35ad6410f1001d3a683581bd545ea
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceStridedSliceAssign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..34d07239a1a18e85e2534db6607a89c12cc670a1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RestoreV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f3fc2578dfd2af59077f611bce137f98f6af38ee
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReverseSequence"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt b/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9cc919f36fef69b28fc20873f1a635f5dd1644cf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Roll"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Round.pbtxt b/tensorflow/core/api_def/python_api/api_def_Round.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..74428e2f58323e47ca672d50f5193bac2977b1f9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Round.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Round"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..617897ee44e8351bd95d9f44ca2b660894617b88
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SaveV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e71b655c22fbcbf1524433fc65a392e4d80c5c43
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScanDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ecf71cd6257b5630566cb6fb92110f6f738f91f4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScatterNdNonAliasingAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterNdUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterNdUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ccf4a9cce807c4cbc5fe2fdc1e2a7057a0bc5464
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterNdUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScatterNdUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e4c41c1226674fbb21899ea31be8668b0d8f6ece
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScatterUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..07d2f200fee0dee55cb813389f672a914a10e0f2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SerializeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee9c71036bdcf6c3d64d468cfe5a4793e522335d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SetSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt b/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bd7b5ad36c6e1d7f3292cbd4ca13a1242bf09e8c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Shape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ShapeN.pbtxt b/tensorflow/core/api_def/python_api/api_def_ShapeN.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b2dbe74b09689b6c4fb1c54640205c7281d23780
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ShapeN.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ShapeN"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7b0d2994f0711f440fb6623aa2322c86bd3859f8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ShuffleAndRepeatDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f0be9197adeb23b2d5047c5d69916df0e2c1eda
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ShuffleDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Sign.pbtxt b/tensorflow/core/api_def/python_api/api_def_Sign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c2ee91dd12ed16ba27a9c4ae45b48194bc5a8b03
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Sign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Sign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Size.pbtxt b/tensorflow/core/api_def/python_api/api_def_Size.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7f76173a5d870910edead637d3493e75ba651b67
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Size.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Size"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..96a551c5b6669a8d019e3c705507aba768ab9d21
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SkipDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt b/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d56a7384eb9c75f2f90420a1a742733b364e770c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SpaceToDepth"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5e158c9ca0ca4620cb18c7e98f969598180df7c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseAccumulatorApplyGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5326f23def4637a178e8af1aff972f4ad1d982c7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseAccumulatorTakeGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d30a8676e03172b852a1a3c6d50f77722ed25625
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cb5ddef2128bbaa3239a279df614a4e3512dcf41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c3b87b09536ee5d36f5d8b1c83b025e5857d13ab
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db4732873845424e21a14506b84723268a963eea
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..14e37b8ba209374d6478cc047229e55314edcd81
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0d307af9b497739d76d1d875751ef16323a1d56b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ed34c0485d4b597dbe366ad69503be1393e079e8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ff2d3b673141a571b6b1d816bf29fb5ba9880232
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f342a611bb2aed3e4c9eae07a4802ed59bae76e8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7f337d50e5b0a995a2a3765782bf229d967ee9b4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bad4120795ef1fc6411a3bcacc209efe2e7b6841
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseConditionalAccumulator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c5e7c9851f15b6748e48eadd26b26925fbe2ed94
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f72031cf684b4515bbf580203f2ea4e5714eab58
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseDiv"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a87004ee5f9c9d445acb1d49138639d91da7b44f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseReduceMax.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseReduceMax.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a885e97d238101857a3c45a11ba0dbacb1a21fd8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseReduceMax.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseReduceMax"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseReduceMaxSparse.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseReduceMaxSparse.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c7d697880178cc6252edd0672553d8bf44eb6bea
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseReduceMaxSparse.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseReduceMaxSparse"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseReduceSum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseReduceSum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee30d7aaf17bc122797394b61cb9eeba0387928d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseReduceSum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseReduceSum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseReduceSumSparse.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseReduceSumSparse.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0dd89fb4c50a955f75d38dd999467d467eea8d54
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseReduceSumSparse.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseReduceSumSparse"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f12c2e207368d4c585137bfeac523e2fada0ed50
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMean"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..771083cd513c4d3deba21aaa2abaa090362c5684
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMeanGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fcb029535c36c6be0e7e08f36ac7b39a5e126df0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMeanWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7daaa81482ff7f9e84ee7ba8a3768d0b19ef38e0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtN"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0682a597bb038abebab4198a44079486b60eb799
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtNGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7311a093df9e593dbebac764cd28e55c556ae6da
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtNWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e7028efce268d3e6e80328908c3a9e2dfc3d4343
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..81c2b8554e8f62e5ffffbdf410389776bbcf9035
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSumWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSlice.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSlice.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..716f8781d0ce698deb9b1f70ade8f1988b418749
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSlice.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSlice"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSoftmax.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSoftmax.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fc29dd6513a9abce3f5ab39a3e8a5a79aed138d4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSoftmax.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSoftmax"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0dbadc01edc539658e3066a9d720c71bc50ecae9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSparseMaximum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0e3ffcbddf3bccdae0e28cc15527566ddf2ff03a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSparseMinimum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..19c0c7f199dfd24d24a56c3766733f9e55957c12
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseTensorSliceDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..735ee18e149aca56cd82c9bb2b3fd8d3870a3188
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseToSparseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ab4c3e441dd51f50a2796ef9d6fa0d21b727ffa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SqlDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Sqrt.pbtxt b/tensorflow/core/api_def/python_api/api_def_Sqrt.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..59e2dfe8366813242337c9490d74ca317e525636
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Sqrt.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Sqrt"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Square.pbtxt b/tensorflow/core/api_def/python_api/api_def_Square.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7b39ae25fa062b4271dcc2aee6523847c97b1e4d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Square.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Square"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt b/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..66de5901bc9b604d92693e2affb75ea8555bfc4e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Stage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f54a1c1c0428753d93cc19abbd4fbc961d8eb988
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StageClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..710394d30d34cece63c148f41b6a18a3e4d99b7b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StagePeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..472032ac42af197e32a437a112f6c39704193ad0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StageSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f7bed36602f40602313157c20677acbbf592d7be
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StatsAggregatorHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8b1bab2440f1934f1fd0194b76b7907fb0fb142d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StatsAggregatorSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a55fa9887797ed9fa6900f9e77f9d1fa70de5aa2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSlice"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bcf1df228e879fbde73fed7d0e955f67ea494663
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSliceAssign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..05d7d57511e8dd485d40a5168ca0866c4b6c481a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSliceGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SummaryWriter.pbtxt b/tensorflow/core/api_def/python_api/api_def_SummaryWriter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1fe57ecf195c85217bd174dbc503b28f26adade9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SummaryWriter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SummaryWriter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3c270ada3c219b03715e0cd651a4b56fe5ebc227
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TFRecordDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..711b335dc1926d32071637b3c986727c339736a3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TakeDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5bc3920c56360f2348805db1db79ab2b630f379d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TensorDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..89ad016483fa392a302915d588d32201237c717a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TensorSliceDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..08d785191b6a4bddce2ac43fd4c0188b4d74548e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TextLineDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt b/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e22b6a040e4011f87b1c945ffd7df050bcbdea76
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Transpose"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..65eb756b870d4a6b8d767de3876d9353a192c1d5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Unstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2c93a6db93cf62aab345bb9044e4acddd01da7d9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VarHandleOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..de5d9850acb1a3adbc59e554aedd819d870c5442
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VarIsInitializedOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt b/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9b317152ddf925b3f0b5b24c95bcb44bed6b718a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VariableShape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Where.pbtxt b/tensorflow/core/api_def/python_api/api_def_Where.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d4dd25a20655a036c0fba33c14133389a532bf8e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Where.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Where"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteAudioSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteAudioSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..520952cd4117867f61cd3c536b8a7cc5beeeab62
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteAudioSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteAudioSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteGraphSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteGraphSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3653477b2067875ee772dedc5015bc550de1ec12
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteGraphSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteGraphSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteHistogramSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteHistogramSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..26e1482630596d9ecb80917ff91adb2bd1131692
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteHistogramSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteHistogramSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteImageSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteImageSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..78db8700f0c7231f24fd1db3a0eedbcc4f43deeb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteImageSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteImageSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteScalarSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteScalarSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7bae8638d258b6fdf217fbdbd8705369f57e0bb3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteScalarSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteScalarSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db86883e21ea47a9e74788f8d76a166a235674da
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_WriteSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "WriteSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..dd1459521ff70fc4b3adce7fbb1251b45106b439
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ZipDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/common_runtime/accumulate_n_optimizer.cc b/tensorflow/core/common_runtime/accumulate_n_optimizer.cc
index 832a55f2556f46efe6a94fb62d0420330917faac..822d0065b6713dbc6692ed11b7a938a784b0d597 100644
--- a/tensorflow/core/common_runtime/accumulate_n_optimizer.cc
+++ b/tensorflow/core/common_runtime/accumulate_n_optimizer.cc
@@ -114,19 +114,43 @@ class AccumulateNV2RemovePass : public GraphOptimizationPass {
 
     const string accumulator_name =
         strings::StrCat(n->name(), "/Internal/Accumulator");
+    TensorShapeProto variable_shape;
+    variable_shape.add_dim()->set_size(0);
     TF_RETURN_IF_ERROR(make_node("TemporaryVariable")
-                           .Attr("shape", shape)
+                           .Attr("shape", variable_shape)
                            .Attr("dtype", dtype)
                            .Attr("var_name", accumulator_name)
                            .Finalize(g, &create_accumulator));
-    TF_RETURN_IF_ERROR(make_node("Const")
-                           .Attr("value", make_zeros(dtype, shape))
-                           .Attr("dtype", dtype)
-                           .Finalize(g, &initial_val));
+    if (PartialTensorShape(shape).IsFullyDefined()) {
+      // For fully defined shapes make a constant zero tensor.
+      TF_RETURN_IF_ERROR(make_node("Const")
+                             .Attr("value", make_zeros(dtype, shape))
+                             .Attr("dtype", dtype)
+                             .Finalize(g, &initial_val));
+    } else {
+      // For partial shapes make a Fill operation to make a zero tensor with the
+      // shape of the first input.
+      Node* shape_node;
+      TF_RETURN_IF_ERROR(
+          make_node("Shape")
+              .Input(data_edges[0]->src(), data_edges[0]->src_output())
+              .Finalize(g, &shape_node));
+      Node* zero;
+      TF_RETURN_IF_ERROR(
+          make_node("Const")
+              .Attr("value", make_zeros(dtype, TensorShapeProto()))
+              .Attr("dtype", dtype)
+              .Finalize(g, &zero));
+      TF_RETURN_IF_ERROR(make_node("Fill")
+                             .Input(shape_node)
+                             .Input(zero)
+                             .Finalize(g, &initial_val));
+    }
     TF_RETURN_IF_ERROR(make_node("Assign")
                            .Attr("T", dtype)
                            .Input(create_accumulator)  // ref: Ref(T)
                            .Input(initial_val)         // value: T
+                           .Attr("validate_shape", false)
                            .Finalize(g, &initialize_accumulator));
     for (int i = 0; i < data_edges.size(); ++i) {
       Node* assignAdd;
diff --git a/tensorflow/core/common_runtime/build_graph_options.cc b/tensorflow/core/common_runtime/build_graph_options.cc
index 811d459758a826c4c1b2d14cca7a96ea07d0a799..a9dc6ca6cda9443ae9737267aca5f361e492d22d 100644
--- a/tensorflow/core/common_runtime/build_graph_options.cc
+++ b/tensorflow/core/common_runtime/build_graph_options.cc
@@ -21,15 +21,15 @@ namespace tensorflow {
 
 string BuildGraphOptions::DebugString() const {
   string rv = "Feed endpoints: ";
-  for (auto& s : feed_endpoints) {
+  for (auto& s : callable_options.feed()) {
     strings::StrAppend(&rv, s, ", ");
   }
   strings::StrAppend(&rv, "\nFetch endpoints: ");
-  for (auto& s : fetch_endpoints) {
+  for (auto& s : callable_options.fetch()) {
     strings::StrAppend(&rv, s, ", ");
   }
   strings::StrAppend(&rv, "\nTarget nodes: ");
-  for (auto& s : target_nodes) {
+  for (auto& s : callable_options.target()) {
     strings::StrAppend(&rv, s, ", ");
   }
   return rv;
diff --git a/tensorflow/core/common_runtime/build_graph_options.h b/tensorflow/core/common_runtime/build_graph_options.h
index 5f0e8f170b9e9b0c6a3094e475fcc3bbf47756ea..5ca170e922ce348fb3f76b1129b22ca01804054c 100644
--- a/tensorflow/core/common_runtime/build_graph_options.h
+++ b/tensorflow/core/common_runtime/build_graph_options.h
@@ -19,25 +19,18 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/protobuf/debug.pb.h"
+#include "tensorflow/core/protobuf/config.pb.h"
 
 namespace tensorflow {
 
 struct BuildGraphOptions {
-  std::vector<string> feed_endpoints;
-  std::vector<string> fetch_endpoints;
-
-  // TODO(vrv): Remove this when we unify target_nodes and fetch_endpoint,
-  // the former via "ref" fetch_endpoints.
-  std::vector<string> target_nodes;
+  CallableOptions callable_options;
 
   // If `true`, uses Arg/Retval to implement feeds/fetches; otherwise
   // uses Recv/Send to implement feeds/fetches.
   // TODO(mrry): Remove this when the distributed runtime supports Arg/Retval.
   bool use_function_convention = false;
 
-  DebugOptions debug_options;
-
   string DebugString() const;
 };
 
diff --git a/tensorflow/core/common_runtime/constant_folding.h b/tensorflow/core/common_runtime/constant_folding.h
index b1e1fb831963bccb81731752ec76b9d5be123d9f..84598880bb20e74570fb79de8e9e0d75fa341658 100644
--- a/tensorflow/core/common_runtime/constant_folding.h
+++ b/tensorflow/core/common_runtime/constant_folding.h
@@ -22,6 +22,8 @@ limitations under the License.
 #include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/platform/env.h"
 
+// TODO(skyewm): can this be combined with EvaluateConstantTensor?
+
 namespace tensorflow {
 
 // This generator type is used to generate a name for the newly folded node
diff --git a/tensorflow/core/common_runtime/device_mgr.cc b/tensorflow/core/common_runtime/device_mgr.cc
index 1f0cc5e83bdb8520cb49387ea34e69491c71f61e..a77601ba79bf29998e1bb78b5e39dbdbd48434e9 100644
--- a/tensorflow/core/common_runtime/device_mgr.cc
+++ b/tensorflow/core/common_runtime/device_mgr.cc
@@ -94,8 +94,8 @@ Status DeviceMgr::LookupDevice(StringPiece name, Device** device) const {
     for (auto&& itr : device_map_) {
       device_names.push_back(itr.first);
     }
-    LOG(WARNING) << "Unknown device: " << name
-                 << " all devices: " << str_util::Join(device_names, ", ");
+    VLOG(1) << "Unknown device: " << name
+            << " all devices: " << str_util::Join(device_names, ", ");
     return errors::InvalidArgument(name, " unknown device.");
   }
   *device = iter->second;
diff --git a/tensorflow/core/common_runtime/direct_session.cc b/tensorflow/core/common_runtime/direct_session.cc
index ecbffcbf6c4030bde82f2abe0e7779bf9c5a9870..25cfb9e524cd12c92fc5edb01f0d4bed64fb872f 100644
--- a/tensorflow/core/common_runtime/direct_session.cc
+++ b/tensorflow/core/common_runtime/direct_session.cc
@@ -318,6 +318,7 @@ DirectSession::~DirectSession() {
   for (auto& it : executors_) {
     it.second.reset();
   }
+  callables_.clear();
   for (auto d : device_mgr_->ListDevices()) {
     d->op_segment()->RemoveHold(session_handle_);
   }
@@ -409,16 +410,21 @@ Status DirectSession::Run(const NamedTensorList& inputs,
 }
 
 Status DirectSession::CreateDebuggerState(
-    const DebugOptions& debug_options, int64 session_run_index,
-    int64 executor_step_index, const std::vector<string>& input_names,
-    const std::vector<string>& output_names,
-    const std::vector<string>& target_names,
+    const CallableOptions& callable_options, int64 global_step,
+    int64 session_run_index, int64 executor_step_index,
     std::unique_ptr<DebuggerStateInterface>* debugger_state) {
-  TF_RETURN_IF_ERROR(
-      DebuggerStateRegistry::CreateState(debug_options, debugger_state));
+  TF_RETURN_IF_ERROR(DebuggerStateRegistry::CreateState(
+      callable_options.run_options().debug_options(), debugger_state));
+  std::vector<string> input_names(callable_options.feed().begin(),
+                                  callable_options.feed().end());
+  std::vector<string> output_names(callable_options.fetch().begin(),
+                                   callable_options.fetch().end());
+  std::vector<string> target_names(callable_options.target().begin(),
+                                   callable_options.target().end());
+
   TF_RETURN_IF_ERROR(debugger_state->get()->PublishDebugMetadata(
-      debug_options.global_step(), session_run_index, executor_step_index,
-      input_names, output_names, target_names));
+      global_step, session_run_index, executor_step_index, input_names,
+      output_names, target_names));
   return Status::OK();
 }
 
@@ -433,84 +439,23 @@ Status DirectSession::DecorateAndPublishGraphForDebug(
   return Status::OK();
 }
 
-Status DirectSession::Run(const RunOptions& run_options,
-                          const NamedTensorList& inputs,
-                          const std::vector<string>& output_names,
-                          const std::vector<string>& target_nodes,
-                          std::vector<Tensor>* outputs,
-                          RunMetadata* run_metadata) {
-  TF_RETURN_IF_ERROR(CheckNotClosed());
-  direct_session_runs->GetCell()->IncrementBy(1);
-  {
-    mutex_lock l(graph_def_lock_);
-    if (!graph_created_) {
-      return errors::InvalidArgument(
-          "Session was not created with a graph before Run()!");
-    }
-  }
-
-  // Extract the inputs names for this run of the session.
-  std::vector<string> input_tensor_names;
-  input_tensor_names.reserve(inputs.size());
-  for (const auto& it : inputs) {
-    input_tensor_names.push_back(it.first);
-  }
-
-  if (run_options.inter_op_thread_pool() < 0 ||
-      run_options.inter_op_thread_pool() >= thread_pools_.size()) {
-    return errors::InvalidArgument("Invalid inter_op_thread_pool: ",
-                                   run_options.inter_op_thread_pool());
-  }
-  thread::ThreadPool* pool =
-      thread_pools_[run_options.inter_op_thread_pool()].first;
-
-  // Check if we already have an executor for these arguments.
-  ExecutorsAndKeys* executors_and_keys;
-  RunStateArgs run_state_args(run_options.debug_options());
-
-  Executor::Args args;
-  args.step_id = step_id_counter_.fetch_add(1);
-
-  TF_RETURN_IF_ERROR(GetOrCreateExecutors(input_tensor_names, output_names,
-                                          target_nodes, &executors_and_keys,
-                                          &run_state_args));
+Status DirectSession::RunInternal(int64 step_id, const RunOptions& run_options,
+                                  CallFrameInterface* call_frame,
+                                  ExecutorsAndKeys* executors_and_keys,
+                                  RunMetadata* run_metadata) {
   const int64 executor_step_count = executors_and_keys->step_count.fetch_add(1);
 
   std::unique_ptr<DebuggerStateInterface> debugger_state;
   if (!run_options.debug_options().debug_tensor_watch_opts().empty()) {
-    TF_RETURN_IF_ERROR(CreateDebuggerState(
-        run_options.debug_options(), args.step_id, executor_step_count,
-        input_tensor_names, output_names, target_nodes, &debugger_state));
-  }
-
-  // Configure a call frame for the step, which we use to feed and
-  // fetch values to and from the executors.
-  FunctionCallFrame call_frame(executors_and_keys->input_types,
-                               executors_and_keys->output_types);
-  gtl::InlinedVector<Tensor, 4> feed_args(inputs.size());
-  for (const auto& it : inputs) {
-    if (it.second.dtype() == DT_RESOURCE) {
-      Tensor tensor_from_handle;
-      TF_RETURN_IF_ERROR(
-          ResourceHandleToInputTensor(it.second, &tensor_from_handle));
-      feed_args[executors_and_keys->input_name_to_index[it.first]] =
-          tensor_from_handle;
-    } else {
-      feed_args[executors_and_keys->input_name_to_index[it.first]] = it.second;
-    }
-  }
-  const Status s = call_frame.SetArgs(feed_args);
-  if (errors::IsInternal(s)) {
-    return errors::InvalidArgument(s.error_message());
-  } else if (!s.ok()) {
-    return s;
+    TF_RETURN_IF_ERROR(
+        CreateDebuggerState(executors_and_keys->callable_options,
+                            run_options.debug_options().global_step(), step_id,
+                            executor_step_count, &debugger_state));
   }
 
   // Create a run state and start execution.
-  RunState run_state(args.step_id, &devices_);
+  RunState run_state(step_id, &devices_);
   run_state.rendez = new IntraProcessRendezvous(device_mgr_.get());
-  CancellationManager step_cancellation_manager;
-  args.call_frame = &call_frame;
 
   // Start parallel Executors.
   const size_t num_executors = executors_and_keys->items.size();
@@ -523,15 +468,15 @@ Status DirectSession::Run(const RunOptions& run_options,
         run_state.executors_done.Notify();
       });
 
+  Executor::Args args;
+  args.step_id = step_id;
+  args.call_frame = call_frame;
   args.rendezvous = run_state.rendez;
+  CancellationManager step_cancellation_manager;
   args.cancellation_manager = &step_cancellation_manager;
-
   args.session_state = &session_state_;
   args.tensor_store = &run_state.tensor_store;
   args.step_container = &run_state.step_container;
-  if (LogMemory::IsEnabled()) {
-    LogMemory::RecordStep(args.step_id, run_state_args.handle);
-  }
   args.sync_on_finish = sync_on_finish_;
 
   const bool do_trace = (run_options.trace_level() > RunOptions::NO_TRACE);
@@ -569,6 +514,14 @@ Status DirectSession::Run(const RunOptions& run_options,
     }
   }
 
+  if (run_options.inter_op_thread_pool() < 0 ||
+      run_options.inter_op_thread_pool() >= thread_pools_.size()) {
+    run_state.executors_done.Notify();
+    delete barrier;
+    return errors::InvalidArgument("Invalid inter_op_thread_pool: ",
+                                   run_options.inter_op_thread_pool());
+  }
+
   // Register this step with session's cancellation manager, so that
   // `Session::Close()` will cancel the step.
   const CancellationToken cancellation_token =
@@ -586,6 +539,9 @@ Status DirectSession::Run(const RunOptions& run_options,
     return errors::Cancelled("Run call was cancelled");
   }
 
+  thread::ThreadPool* pool =
+      thread_pools_[run_options.inter_op_thread_pool()].first;
+
   Executor::Args::Runner default_runner = [this,
                                            pool](Executor::Args::Closure c) {
     SchedClosure(pool, std::move(c));
@@ -628,6 +584,111 @@ Status DirectSession::Run(const RunOptions& run_options,
     TF_RETURN_IF_ERROR(run_state.status);
   }
 
+  // Save the output tensors of this run we choose to keep.
+  if (!run_state.tensor_store.empty()) {
+    TF_RETURN_IF_ERROR(run_state.tensor_store.SaveTensors(
+        {executors_and_keys->callable_options.fetch().begin(),
+         executors_and_keys->callable_options.fetch().end()},
+        &session_state_));
+  }
+
+  if (args.stats_collector) {
+    args.stats_collector->Finalize();
+  }
+
+  // Build and return the cost model as instructed.
+  if (update_cost_model) {
+    // Build the cost model
+    std::unordered_map<string, const Graph*> device_to_graph;
+    for (const PerPartitionExecutorsAndLib& partition :
+         executors_and_keys->items) {
+      const Graph* graph = partition.graph;
+      const string device = partition.flib->device()->name();
+      device_to_graph[device] = graph;
+    }
+
+    mutex_lock l(executor_lock_);
+    args.stats_collector->BuildCostModel(&cost_model_manager_, device_to_graph);
+
+    // annotate stats onto cost graph.
+    CostGraphDef* cost_graph = run_metadata->mutable_cost_graph();
+    for (const auto& item : executors_and_keys->items) {
+      TF_RETURN_IF_ERROR(
+          cost_model_manager_.AddToCostGraphDef(item.graph, cost_graph));
+    }
+  }
+
+  // If requested via RunOptions, output the partition graphs.
+  if (run_options.output_partition_graphs()) {
+    protobuf::RepeatedPtrField<GraphDef>* partition_graph_defs =
+        run_metadata->mutable_partition_graphs();
+    for (const PerPartitionExecutorsAndLib& exec_and_lib :
+         executors_and_keys->items) {
+      GraphDef* partition_graph_def = partition_graph_defs->Add();
+      exec_and_lib.graph->ToGraphDef(partition_graph_def);
+    }
+  }
+
+  return Status::OK();
+}
+
+Status DirectSession::Run(const RunOptions& run_options,
+                          const NamedTensorList& inputs,
+                          const std::vector<string>& output_names,
+                          const std::vector<string>& target_nodes,
+                          std::vector<Tensor>* outputs,
+                          RunMetadata* run_metadata) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  TF_RETURN_IF_ERROR(CheckGraphCreated("Run()"));
+  direct_session_runs->GetCell()->IncrementBy(1);
+
+  // Extract the inputs names for this run of the session.
+  std::vector<string> input_tensor_names;
+  input_tensor_names.reserve(inputs.size());
+  for (const auto& it : inputs) {
+    input_tensor_names.push_back(it.first);
+  }
+
+  // Check if we already have an executor for these arguments.
+  ExecutorsAndKeys* executors_and_keys;
+  RunStateArgs run_state_args(run_options.debug_options());
+
+  TF_RETURN_IF_ERROR(GetOrCreateExecutors(input_tensor_names, output_names,
+                                          target_nodes, &executors_and_keys,
+                                          &run_state_args));
+
+  // Configure a call frame for the step, which we use to feed and
+  // fetch values to and from the executors.
+  FunctionCallFrame call_frame(executors_and_keys->input_types,
+                               executors_and_keys->output_types);
+  gtl::InlinedVector<Tensor, 4> feed_args(inputs.size());
+  for (const auto& it : inputs) {
+    if (it.second.dtype() == DT_RESOURCE) {
+      Tensor tensor_from_handle;
+      TF_RETURN_IF_ERROR(
+          ResourceHandleToInputTensor(it.second, &tensor_from_handle));
+      feed_args[executors_and_keys->input_name_to_index[it.first]] =
+          tensor_from_handle;
+    } else {
+      feed_args[executors_and_keys->input_name_to_index[it.first]] = it.second;
+    }
+  }
+  const Status s = call_frame.SetArgs(feed_args);
+  if (errors::IsInternal(s)) {
+    return errors::InvalidArgument(s.error_message());
+  } else if (!s.ok()) {
+    return s;
+  }
+
+  const int64 step_id = step_id_counter_.fetch_add(1);
+
+  if (LogMemory::IsEnabled()) {
+    LogMemory::RecordStep(step_id, run_state_args.handle);
+  }
+
+  TF_RETURN_IF_ERROR(RunInternal(step_id, run_options, &call_frame,
+                                 executors_and_keys, run_metadata));
+
   // Receive outputs.
   if (outputs) {
     std::vector<Tensor> sorted_outputs;
@@ -667,45 +728,6 @@ Status DirectSession::Run(const RunOptions& run_options,
     }
   }
 
-  // Save the output tensors of this run we choose to keep.
-  TF_RETURN_IF_ERROR(
-      run_state.tensor_store.SaveTensors(output_names, &session_state_));
-  if (args.stats_collector) {
-    args.stats_collector->Finalize();
-  }
-
-  // Build and return the cost model as instructed.
-  mutex_lock l(executor_lock_);
-  if (update_cost_model) {
-    // Build the cost model
-    std::unordered_map<string, const Graph*> device_to_graph;
-    for (const PerPartitionExecutorsAndLib& partition :
-         executors_and_keys->items) {
-      const Graph* graph = partition.graph;
-      const string device = partition.flib->device()->name();
-      device_to_graph[device] = graph;
-    }
-    args.stats_collector->BuildCostModel(&cost_model_manager_, device_to_graph);
-
-    // annotate stats onto cost graph.
-    CostGraphDef* cost_graph = run_metadata->mutable_cost_graph();
-    for (const auto& item : executors_and_keys->items) {
-      TF_RETURN_IF_ERROR(
-          cost_model_manager_.AddToCostGraphDef(item.graph, cost_graph));
-    }
-  }
-
-  // If requested via RunOptions, output the partition graphs.
-  if (run_options.output_partition_graphs()) {
-    protobuf::RepeatedPtrField<GraphDef>* partition_graph_defs =
-        run_metadata->mutable_partition_graphs();
-    for (const PerPartitionExecutorsAndLib& exec_and_lib :
-         executors_and_keys->items) {
-      GraphDef* partition_graph_def = partition_graph_defs->Add();
-      exec_and_lib.graph->ToGraphDef(partition_graph_def);
-    }
-  }
-
   return Status::OK();
 }
 
@@ -714,13 +736,7 @@ Status DirectSession::PRunSetup(const std::vector<string>& input_names,
                                 const std::vector<string>& target_nodes,
                                 string* handle) {
   TF_RETURN_IF_ERROR(CheckNotClosed());
-  {
-    mutex_lock l(graph_def_lock_);
-    if (!graph_created_) {
-      return errors::InvalidArgument(
-          "Session was not created with a graph before PRunSetup()!");
-    }
-  }
+  TF_RETURN_IF_ERROR(CheckGraphCreated("PRunSetup()"));
 
   // RunOptions is not available in PRunSetup, so use thread pool 0.
   thread::ThreadPool* pool = thread_pools_[0].first;
@@ -1061,92 +1077,20 @@ Status DirectSession::CheckFetch(const NamedTensorList& feeds,
   return Status::OK();
 }
 
-Status DirectSession::GetOrCreateExecutors(
-    gtl::ArraySlice<string> inputs, gtl::ArraySlice<string> outputs,
-    gtl::ArraySlice<string> target_nodes, ExecutorsAndKeys** executors_and_keys,
+Status DirectSession::CreateExecutors(
+    const CallableOptions& callable_options,
+    std::unique_ptr<ExecutorsAndKeys>* out_executors_and_keys,
+    std::unique_ptr<FunctionInfo>* out_func_info,
     RunStateArgs* run_state_args) {
-  int64 handle_name_counter_value = -1;
-  if (LogMemory::IsEnabled() || run_state_args->is_partial_run) {
-    handle_name_counter_value = handle_name_counter_.fetch_add(1);
-  }
-
-  string debug_tensor_watches_summary;
-  if (!run_state_args->debug_options.debug_tensor_watch_opts().empty()) {
-    debug_tensor_watches_summary = SummarizeDebugTensorWatches(
-        run_state_args->debug_options.debug_tensor_watch_opts());
-  }
-
-  // Fast lookup path, no sorting.
-  const string key = strings::StrCat(
-      str_util::Join(inputs, ","), "->", str_util::Join(outputs, ","), "/",
-      str_util::Join(target_nodes, ","), "/", run_state_args->is_partial_run,
-      "/", debug_tensor_watches_summary);
-  // Set the handle, if it's needed to log memory or for partial run.
-  if (handle_name_counter_value >= 0) {
-    run_state_args->handle =
-        strings::StrCat(key, ";", handle_name_counter_value);
-  }
-
-  // See if we already have the executors for this run.
-  {
-    mutex_lock l(executor_lock_);  // could use reader lock
-    auto it = executors_.find(key);
-    if (it != executors_.end()) {
-      *executors_and_keys = it->second.get();
-      return Status::OK();
-    }
-  }
-
-  // Slow lookup path, the unsorted key missed the cache.
-  // Sort the inputs and outputs, and look up with the sorted key in case an
-  // earlier call used a different order of inputs and outputs.
-  //
-  // We could consider some other signature instead of sorting that
-  // preserves the same property to avoid the sort in the future.
-  std::vector<string> inputs_sorted(inputs.begin(), inputs.end());
-  std::sort(inputs_sorted.begin(), inputs_sorted.end());
-  std::vector<string> outputs_sorted(outputs.begin(), outputs.end());
-  std::sort(outputs_sorted.begin(), outputs_sorted.end());
-  std::vector<string> tn_sorted(target_nodes.begin(), target_nodes.end());
-  std::sort(tn_sorted.begin(), tn_sorted.end());
-
-  const string sorted_key = strings::StrCat(
-      str_util::Join(inputs_sorted, ","), "->",
-      str_util::Join(outputs_sorted, ","), "/", str_util::Join(tn_sorted, ","),
-      "/", run_state_args->is_partial_run, "/", debug_tensor_watches_summary);
-  // Set the handle, if its needed to log memory or for partial run.
-  if (handle_name_counter_value >= 0) {
-    run_state_args->handle =
-        strings::StrCat(sorted_key, ";", handle_name_counter_value);
-  }
-
-  // See if we already have the executors for this run.
-  {
-    mutex_lock l(executor_lock_);
-    auto it = executors_.find(sorted_key);
-    if (it != executors_.end()) {
-      *executors_and_keys = it->second.get();
-      // Insert this under the original key.
-      executors_.emplace(key, it->second);
-      return Status::OK();
-    }
-  }
-
-  // Nothing found, so create the executors and store in the cache.
   BuildGraphOptions options;
-  options.feed_endpoints = inputs_sorted;
-  options.fetch_endpoints = outputs_sorted;
-  options.target_nodes = tn_sorted;
+  options.callable_options = callable_options;
   options.use_function_convention = !run_state_args->is_partial_run;
-  if (!run_state_args->debug_options.debug_tensor_watch_opts().empty()) {
-    options.debug_options = run_state_args->debug_options;
-  }
 
   std::unique_ptr<FunctionInfo> func_info(new FunctionInfo);
-  std::shared_ptr<ExecutorsAndKeys> ek(new ExecutorsAndKeys);
+  std::unique_ptr<ExecutorsAndKeys> ek(new ExecutorsAndKeys);
+
+  ek->callable_options = callable_options;
 
-  // The executor_lock_ is intentionally released while executor is
-  // being created.
   std::unordered_map<string, std::unique_ptr<Graph>> graphs;
   TF_RETURN_IF_ERROR(CreateGraphs(options, &graphs, &func_info->flib_def,
                                   run_state_args, &ek->input_types,
@@ -1155,11 +1099,11 @@ Status DirectSession::GetOrCreateExecutors(
   if (run_state_args->is_partial_run) {
     ek->graph = std::move(run_state_args->graph);
     std::unordered_set<StringPiece, StringPieceHasher> names;
-    for (const string& input : inputs) {
+    for (const string& input : callable_options.feed()) {
       TensorId id(ParseTensorName(input));
       names.emplace(id.first);
     }
-    for (const string& output : outputs) {
+    for (const string& output : callable_options.fetch()) {
       TensorId id(ParseTensorName(output));
       names.emplace(id.first);
     }
@@ -1181,7 +1125,7 @@ Status DirectSession::GetOrCreateExecutors(
   }
   func_info->proc_flr.reset(new ProcessFunctionLibraryRuntime(
       device_mgr_.get(), options_.env, graph_def_version,
-      func_info->flib_def.get(), optimizer_opts));
+      func_info->flib_def.get(), optimizer_opts, thread_pools_[0].first));
 
   GraphOptimizer optimizer(optimizer_opts);
   for (auto iter = graphs.begin(); iter != graphs.end(); ++iter) {
@@ -1236,9 +1180,11 @@ Status DirectSession::GetOrCreateExecutors(
                        /*shape_map=*/nullptr);
 
     // EXPERIMENTAL: tfdbg inserts debug nodes in the graph.
-    if (!options.debug_options.debug_tensor_watch_opts().empty()) {
+    const DebugOptions& debug_options =
+        options.callable_options.run_options().debug_options();
+    if (!debug_options.debug_tensor_watch_opts().empty()) {
       TF_RETURN_IF_ERROR(DecorateAndPublishGraphForDebug(
-          options.debug_options, partition_graph.get(), params.device));
+          debug_options, partition_graph.get(), params.device));
     }
 
     TF_RETURN_IF_ERROR(EnsureMemoryTypes(DeviceType(device->device_type()),
@@ -1260,12 +1206,12 @@ Status DirectSession::GetOrCreateExecutors(
     // For regular `Run()`, we use the function calling convention, and so
     // maintain a mapping from input/output names to
     // argument/return-value ordinal index.
-    for (size_t i = 0; i < inputs_sorted.size(); ++i) {
-      const string& input = inputs_sorted[i];
+    for (int i = 0; i < callable_options.feed().size(); ++i) {
+      const string& input = callable_options.feed(i);
       ek->input_name_to_index[input] = i;
     }
-    for (size_t i = 0; i < outputs_sorted.size(); ++i) {
-      const string& output = outputs_sorted[i];
+    for (int i = 0; i < callable_options.fetch().size(); ++i) {
+      const string& output = callable_options.fetch(i);
       ek->output_name_to_index[output] = i;
     }
   } else {
@@ -1274,26 +1220,123 @@ Status DirectSession::GetOrCreateExecutors(
     //
     // We always use the first device as the device name portion of the
     // key, even if we're feeding another graph.
-    for (size_t i = 0; i < inputs_sorted.size(); ++i) {
-      const string& input = inputs_sorted[i];
+    for (int i = 0; i < callable_options.feed().size(); ++i) {
+      const string& input = callable_options.feed(i);
       ek->input_name_to_rendezvous_key[input] = GetRendezvousKey(
           input, device_set_.client_device()->attributes(), FrameAndIter(0, 0));
     }
-    for (size_t i = 0; i < outputs_sorted.size(); ++i) {
-      const string& output = outputs_sorted[i];
+    for (int i = 0; i < callable_options.fetch().size(); ++i) {
+      const string& output = callable_options.fetch(i);
       ek->output_name_to_rendezvous_key[output] =
           GetRendezvousKey(output, device_set_.client_device()->attributes(),
                            FrameAndIter(0, 0));
     }
   }
 
+  *out_executors_and_keys = std::move(ek);
+  *out_func_info = std::move(func_info);
+  return Status::OK();
+}
+
+Status DirectSession::GetOrCreateExecutors(
+    gtl::ArraySlice<string> inputs, gtl::ArraySlice<string> outputs,
+    gtl::ArraySlice<string> target_nodes, ExecutorsAndKeys** executors_and_keys,
+    RunStateArgs* run_state_args) {
+  int64 handle_name_counter_value = -1;
+  if (LogMemory::IsEnabled() || run_state_args->is_partial_run) {
+    handle_name_counter_value = handle_name_counter_.fetch_add(1);
+  }
+
+  string debug_tensor_watches_summary;
+  if (!run_state_args->debug_options.debug_tensor_watch_opts().empty()) {
+    debug_tensor_watches_summary = SummarizeDebugTensorWatches(
+        run_state_args->debug_options.debug_tensor_watch_opts());
+  }
+
+  // Fast lookup path, no sorting.
+  const string key = strings::StrCat(
+      str_util::Join(inputs, ","), "->", str_util::Join(outputs, ","), "/",
+      str_util::Join(target_nodes, ","), "/", run_state_args->is_partial_run,
+      "/", debug_tensor_watches_summary);
+  // Set the handle, if it's needed to log memory or for partial run.
+  if (handle_name_counter_value >= 0) {
+    run_state_args->handle =
+        strings::StrCat(key, ";", handle_name_counter_value);
+  }
+
+  // See if we already have the executors for this run.
+  {
+    mutex_lock l(executor_lock_);  // could use reader lock
+    auto it = executors_.find(key);
+    if (it != executors_.end()) {
+      *executors_and_keys = it->second.get();
+      return Status::OK();
+    }
+  }
+
+  // Slow lookup path, the unsorted key missed the cache.
+  // Sort the inputs and outputs, and look up with the sorted key in case an
+  // earlier call used a different order of inputs and outputs.
+  //
+  // We could consider some other signature instead of sorting that
+  // preserves the same property to avoid the sort in the future.
+  std::vector<string> inputs_sorted(inputs.begin(), inputs.end());
+  std::sort(inputs_sorted.begin(), inputs_sorted.end());
+  std::vector<string> outputs_sorted(outputs.begin(), outputs.end());
+  std::sort(outputs_sorted.begin(), outputs_sorted.end());
+  std::vector<string> tn_sorted(target_nodes.begin(), target_nodes.end());
+  std::sort(tn_sorted.begin(), tn_sorted.end());
+
+  const string sorted_key = strings::StrCat(
+      str_util::Join(inputs_sorted, ","), "->",
+      str_util::Join(outputs_sorted, ","), "/", str_util::Join(tn_sorted, ","),
+      "/", run_state_args->is_partial_run, "/", debug_tensor_watches_summary);
+  // Set the handle, if its needed to log memory or for partial run.
+  if (handle_name_counter_value >= 0) {
+    run_state_args->handle =
+        strings::StrCat(sorted_key, ";", handle_name_counter_value);
+  }
+
+  // See if we already have the executors for this run.
+  {
+    mutex_lock l(executor_lock_);
+    auto it = executors_.find(sorted_key);
+    if (it != executors_.end()) {
+      *executors_and_keys = it->second.get();
+      // Insert this under the original key.
+      executors_.emplace(key, it->second);
+      return Status::OK();
+    }
+  }
+
+  // Nothing found, so create the executors and store in the cache.
+  // The executor_lock_ is intentionally released while executors are
+  // being created.
+  CallableOptions callable_options;
+  for (const string& input : inputs_sorted) {
+    callable_options.add_feed(input);
+  }
+  for (const string& output : outputs_sorted) {
+    callable_options.add_fetch(output);
+  }
+  for (const string& target : tn_sorted) {
+    callable_options.add_target(target);
+  }
+  *callable_options.mutable_run_options()->mutable_debug_options() =
+      run_state_args->debug_options;
+  std::unique_ptr<ExecutorsAndKeys> ek;
+  std::unique_ptr<FunctionInfo> func_info;
+  TF_RETURN_IF_ERROR(
+      CreateExecutors(callable_options, &ek, &func_info, run_state_args));
+
   // Reacquire the lock, try to insert into the map.
   mutex_lock l(executor_lock_);
   functions_.push_back(std::move(func_info));
 
   // Another thread may have created the entry before us, in which case we will
   // reuse the already created one.
-  auto insert_result = executors_.emplace(sorted_key, ek);
+  auto insert_result = executors_.emplace(
+      sorted_key, std::shared_ptr<ExecutorsAndKeys>(std::move(ek)));
   // Insert the value under the original key, so the fast path lookup will work
   // if the user uses the same order of inputs, outputs, and targets again.
   executors_.emplace(key, insert_result.first->second);
@@ -1332,19 +1375,19 @@ Status DirectSession::CreateGraphs(
         execution_state->BuildGraph(subgraph_options, &client_graph));
   }
 
-  if (subgraph_options.feed_endpoints.size() !=
+  if (subgraph_options.callable_options.feed_size() !=
       client_graph->feed_types.size()) {
     return errors::Internal(
         "Graph pruning failed: requested number of feed endpoints = ",
-        subgraph_options.feed_endpoints.size(),
+        subgraph_options.callable_options.feed_size(),
         " versus number of pruned feed endpoints = ",
         client_graph->feed_types.size());
   }
-  if (subgraph_options.fetch_endpoints.size() !=
+  if (subgraph_options.callable_options.fetch_size() !=
       client_graph->fetch_types.size()) {
     return errors::Internal(
         "Graph pruning failed: requested number of fetch endpoints = ",
-        subgraph_options.fetch_endpoints.size(),
+        subgraph_options.callable_options.fetch_size(),
         " versus number of pruned fetch endpoints = ",
         client_graph->fetch_types.size());
   }
@@ -1562,4 +1605,156 @@ void DirectSession::WaitForNotification(RunState* run_state,
   return Status::OK();
 }
 
+Status DirectSession::MakeCallable(const CallableOptions& callable_options,
+                                   CallableHandle* out_handle) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  TF_RETURN_IF_ERROR(CheckGraphCreated("MakeCallable()"));
+
+  if (!callable_options.run_options()
+           .debug_options()
+           .debug_tensor_watch_opts()
+           .empty()) {
+    return errors::Unimplemented(
+        "Debug options are not currently supported via the C++ MakeCallable "
+        "interface.");
+  }
+
+  std::unique_ptr<ExecutorsAndKeys> ek;
+  std::unique_ptr<FunctionInfo> func_info;
+  RunStateArgs run_state_args(callable_options.run_options().debug_options());
+  TF_RETURN_IF_ERROR(
+      CreateExecutors(callable_options, &ek, &func_info, &run_state_args));
+  {
+    mutex_lock l(callables_lock_);
+    *out_handle = next_callable_handle_++;
+    callables_[*out_handle] = {std::move(ek), std::move(func_info)};
+  }
+  return Status::OK();
+}
+
+class DirectSession::RunCallableCallFrame : public CallFrameInterface {
+ public:
+  RunCallableCallFrame(DirectSession* session,
+                       ExecutorsAndKeys* executors_and_keys,
+                       const std::vector<Tensor>* feed_tensors,
+                       std::vector<Tensor>* fetch_tensors)
+      : session_(session),
+        executors_and_keys_(executors_and_keys),
+        feed_tensors_(feed_tensors),
+        fetch_tensors_(fetch_tensors) {}
+
+  size_t num_args() const override {
+    return executors_and_keys_->input_types.size();
+  }
+  size_t num_retvals() const override {
+    return executors_and_keys_->output_types.size();
+  }
+
+  Status GetArg(int index, Tensor* val) const override {
+    if (index > feed_tensors_->size()) {
+      return errors::Internal("Args index out of bounds: ", index);
+    } else if (executors_and_keys_->input_types[index] == DT_RESOURCE) {
+      TF_RETURN_IF_ERROR(
+          session_->ResourceHandleToInputTensor((*feed_tensors_)[index], val));
+    } else {
+      *val = (*feed_tensors_)[index];
+    }
+    return Status::OK();
+  }
+
+  Status SetRetval(int index, const Tensor& val) override {
+    if (index > fetch_tensors_->size()) {
+      return errors::Internal("RetVal index out of bounds: ", index);
+    }
+    (*fetch_tensors_)[index] = val;
+    return Status::OK();
+  }
+
+ private:
+  DirectSession* const session_;                   // Not owned.
+  ExecutorsAndKeys* const executors_and_keys_;     // Not owned.
+  const std::vector<Tensor>* const feed_tensors_;  // Not owned.
+  std::vector<Tensor>* const fetch_tensors_;       // Not owned.
+};
+
+::tensorflow::Status DirectSession::RunCallable(
+    CallableHandle handle, const std::vector<Tensor>& feed_tensors,
+    std::vector<Tensor>* fetch_tensors, RunMetadata* run_metadata) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  TF_RETURN_IF_ERROR(CheckGraphCreated("RunCallable()"));
+  direct_session_runs->GetCell()->IncrementBy(1);
+
+  // Check if we already have an executor for these arguments.
+  std::shared_ptr<ExecutorsAndKeys> executors_and_keys;
+  const int64 step_id = step_id_counter_.fetch_add(1);
+
+  {
+    tf_shared_lock l(callables_lock_);
+    if (handle >= next_callable_handle_) {
+      return errors::InvalidArgument("No such callable handle: ", handle);
+    }
+    executors_and_keys = callables_[handle].executors_and_keys;
+  }
+
+  if (!executors_and_keys) {
+    return errors::InvalidArgument(
+        "Attempted to run callable after handle was released: ", handle);
+  }
+
+  // NOTE(mrry): Debug options are not currently supported in the
+  // callable interface.
+  DebugOptions debug_options;
+  RunStateArgs run_state_args(debug_options);
+
+  // Configure a call frame for the step, which we use to feed and
+  // fetch values to and from the executors.
+  if (feed_tensors.size() != executors_and_keys->input_types.size()) {
+    return errors::InvalidArgument(
+        "Expected ", executors_and_keys->input_types.size(),
+        " feed tensors, but got ", feed_tensors.size());
+  }
+  if (fetch_tensors != nullptr) {
+    fetch_tensors->resize(executors_and_keys->output_types.size());
+  } else if (!executors_and_keys->output_types.empty()) {
+    return errors::InvalidArgument(
+        "`fetch_tensors` must be provided when the callable has one or more "
+        "outputs.");
+  }
+
+  // A specialized CallFrame implementation that takes advantage of the
+  // optimized RunCallable interface.
+
+  RunCallableCallFrame call_frame(this, executors_and_keys.get(), &feed_tensors,
+                                  fetch_tensors);
+
+  if (LogMemory::IsEnabled()) {
+    LogMemory::RecordStep(step_id, run_state_args.handle);
+  }
+
+  TF_RETURN_IF_ERROR(
+      RunInternal(step_id, executors_and_keys->callable_options.run_options(),
+                  &call_frame, executors_and_keys.get(), run_metadata));
+
+  return Status::OK();
+}
+
+::tensorflow::Status DirectSession::ReleaseCallable(CallableHandle handle) {
+  mutex_lock l(callables_lock_);
+  if (handle >= next_callable_handle_) {
+    return errors::InvalidArgument("No such callable handle: ", handle);
+  }
+  callables_.erase(handle);
+  return Status::OK();
+}
+
+DirectSession::Callable::~Callable() {
+  // We must delete the fields in this order, because the destructor
+  // of `executors_and_keys` will call into an object owned by
+  // `function_info` (in particular, when deleting a kernel, it relies
+  // on the `FunctionLibraryRuntime` to know if the kernel is stateful
+  // or not).
+  executors_and_keys.reset();
+  function_info.reset();
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/direct_session.h b/tensorflow/core/common_runtime/direct_session.h
index 45d765f8498e5e12eef3a47cd4a7ff0ad22aa495..6f9c1b980b30213373d82ded639fc2e614b3752d 100644
--- a/tensorflow/core/common_runtime/direct_session.h
+++ b/tensorflow/core/common_runtime/direct_session.h
@@ -107,6 +107,14 @@ class DirectSession : public Session {
     cost_model_manager_.ExportCostModels(cost_models);
   }
 
+  ::tensorflow::Status MakeCallable(const CallableOptions& callable_options,
+                                    CallableHandle* out_handle) override;
+  ::tensorflow::Status RunCallable(CallableHandle handle,
+                                   const std::vector<Tensor>& feed_tensors,
+                                   std::vector<Tensor>* fetch_tensors,
+                                   RunMetadata* run_metadata) override;
+  ::tensorflow::Status ReleaseCallable(CallableHandle handle) override;
+
  private:
   // We create one executor and its dependent library runtime for
   // every partition.
@@ -139,6 +147,8 @@ class DirectSession : public Session {
 
     DataTypeVector input_types;
     DataTypeVector output_types;
+
+    CallableOptions callable_options;
   };
 
   // A FunctionInfo object is created for every unique set of feeds/fetches.
@@ -206,6 +216,14 @@ class DirectSession : public Session {
       gtl::ArraySlice<string> target_nodes,
       ExecutorsAndKeys** executors_and_keys, RunStateArgs* run_state_args);
 
+  // Creates a set of executors to run the subgraph defined by
+  // `callable_options`.
+  ::tensorflow::Status CreateExecutors(
+      const CallableOptions& callable_options,
+      std::unique_ptr<ExecutorsAndKeys>* out_executors_and_keys,
+      std::unique_ptr<FunctionInfo>* out_func_info,
+      RunStateArgs* run_state_args);
+
   // Creates several graphs given the existing graph_def_ and the
   // input feeds and fetches, given 'devices'. The graphs share a common
   // function library 'flib_def'.
@@ -216,6 +234,11 @@ class DirectSession : public Session {
       RunStateArgs* run_state_args, DataTypeVector* input_types,
       DataTypeVector* output_types);
 
+  ::tensorflow::Status RunInternal(int64 step_id, const RunOptions& run_options,
+                                   CallFrameInterface* call_frame,
+                                   ExecutorsAndKeys* executors_and_keys,
+                                   RunMetadata* run_metadata);
+
   ::tensorflow::Status ExtendLocked(const GraphDef& graph)
       EXCLUSIVE_LOCKS_REQUIRED(graph_def_lock_);
 
@@ -257,11 +280,18 @@ class DirectSession : public Session {
     return ::tensorflow::Status::OK();
   }
 
+  ::tensorflow::Status CheckGraphCreated(const char* method) {
+    mutex_lock l(graph_def_lock_);
+    if (!graph_created_) {
+      return errors::InvalidArgument(
+          "Session was not created with a graph before ", method, "!");
+    }
+    return ::tensorflow::Status::OK();
+  }
+
   ::tensorflow::Status CreateDebuggerState(
-      const DebugOptions& debug_options, int64 session_run_index,
-      int64 executor_step_index, const std::vector<string>& input_names,
-      const std::vector<string>& output_names,
-      const std::vector<string>& target_names,
+      const CallableOptions& options, int64 global_step,
+      int64 session_run_index, int64 executor_step_index,
       std::unique_ptr<DebuggerStateInterface>* debugger_state);
 
   ::tensorflow::Status DecorateAndPublishGraphForDebug(
@@ -303,6 +333,16 @@ class DirectSession : public Session {
   std::unordered_map<string, std::shared_ptr<ExecutorsAndKeys>> executors_
       GUARDED_BY(executor_lock_);
 
+  class RunCallableCallFrame;
+  struct Callable {
+    std::shared_ptr<ExecutorsAndKeys> executors_and_keys;
+    std::shared_ptr<FunctionInfo> function_info;
+    ~Callable();
+  };
+  mutex callables_lock_;
+  int64 next_callable_handle_ GUARDED_BY(callables_lock_) = 0;
+  std::unordered_map<int64, Callable> callables_ GUARDED_BY(callables_lock_);
+
   // Holds mappings from handle to partial run state.
   std::unordered_map<string, std::unique_ptr<RunState>> partial_runs_
       GUARDED_BY(executor_lock_);
diff --git a/tensorflow/core/common_runtime/direct_session_test.cc b/tensorflow/core/common_runtime/direct_session_test.cc
index b75a4f76d94f704cf38a6c4657b6089a863c085f..ee3896061858bd65d03171b97cae0ec850f82ad9 100644
--- a/tensorflow/core/common_runtime/direct_session_test.cc
+++ b/tensorflow/core/common_runtime/direct_session_test.cc
@@ -23,6 +23,7 @@ limitations under the License.
 
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/function_testlib.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -48,6 +49,22 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
+CallableOptions MakeCallableOptions(gtl::ArraySlice<string> feeds,
+                                    gtl::ArraySlice<string> fetches,
+                                    gtl::ArraySlice<string> targets) {
+  CallableOptions ret;
+  for (const string& feed : feeds) {
+    ret.add_feed(feed);
+  }
+  for (const string& fetch : fetches) {
+    ret.add_fetch(fetch);
+  }
+  for (const string& target : targets) {
+    ret.add_target(target);
+  }
+  return ret;
+}
+
 std::unique_ptr<Session> CreateSession() {
   SessionOptions options;
   (*options.config.mutable_device_count())["CPU"] = 2;
@@ -110,6 +127,53 @@ TEST_F(DirectSessionMinusAXTest, RunSimpleNetwork) {
   EXPECT_FLOAT_EQ(5.0, mat(0, 0));
 }
 
+TEST_F(DirectSessionMinusAXTest, RunSimpleNetwork_Callable) {
+  Initialize({3, 2, -1, 0});
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def_));
+  std::vector<std::pair<string, Tensor>> inputs;
+
+  // Run the test twice to ensure that the Make/Run/Release cycle is hermetic.
+  for (int i = 0; i < 2; ++i) {
+    // Request two targets: one fetch output and one non-fetched output.
+    Session::CallableHandle handle;
+    TF_ASSERT_OK(session->MakeCallable(
+        MakeCallableOptions({}, {y_ + ":0"}, {y_neg_}), &handle));
+
+    for (int i = 0; i < 2; ++i) {
+      std::vector<Tensor> outputs;
+      TF_ASSERT_OK(session->RunCallable(handle, {}, &outputs, nullptr));
+
+      ASSERT_EQ(1, outputs.size());
+      // The first output should be initialized and have the correct
+      // output.
+      auto mat = outputs[0].matrix<float>();
+      ASSERT_TRUE(outputs[0].IsInitialized());
+      EXPECT_FLOAT_EQ(5.0, mat(0, 0));
+    }
+
+    Status s = session->RunCallable(handle, {}, nullptr, nullptr);
+    EXPECT_TRUE(errors::IsInvalidArgument(s));
+    EXPECT_TRUE(StringPiece(s.error_message())
+                    .contains("`fetch_tensors` must be provided"));
+
+    TF_ASSERT_OK(session->ReleaseCallable(handle));
+
+    std::vector<Tensor> outputs;
+    s = session->RunCallable(handle, {}, &outputs, nullptr);
+    EXPECT_TRUE(errors::IsInvalidArgument(s));
+    EXPECT_TRUE(
+        StringPiece(s.error_message())
+            .contains("Attempted to run callable after handle was released"));
+
+    s = session->RunCallable(handle + 1, {}, &outputs, nullptr);
+    EXPECT_TRUE(errors::IsInvalidArgument(s));
+    EXPECT_TRUE(
+        StringPiece(s.error_message()).contains("No such callable handle"));
+  }
+}
+
 TEST_F(DirectSessionMinusAXTest, TestFeed) {
   Initialize({1, 2, 3, 4});
   auto session = CreateSession();
@@ -139,6 +203,39 @@ TEST_F(DirectSessionMinusAXTest, TestFeed) {
   EXPECT_FLOAT_EQ(39.0, mat(1, 0));
 }
 
+TEST_F(DirectSessionMinusAXTest, TestFeed_Callable) {
+  Initialize({1, 2, 3, 4});
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+
+  TF_ASSERT_OK(session->Create(def_));
+
+  // Fill in the input and ask for the output
+  //
+  // Note that the input being fed is on the second device.
+  CallableOptions callable_options;
+  callable_options.add_feed(x_);
+  callable_options.add_fetch(y_ + ":0");
+  Session::CallableHandle handle;
+  TF_ASSERT_OK(session->MakeCallable(MakeCallableOptions({x_}, {y_ + ":0"}, {}),
+                                     &handle));
+  Tensor t(DT_FLOAT, TensorShape({2, 1}));
+  t.matrix<float>()(0, 0) = 5;
+  t.matrix<float>()(1, 0) = 6;
+  std::vector<Tensor> inputs = {t};
+  std::vector<Tensor> outputs;
+
+  // Run the callable
+  TF_ASSERT_OK(session->RunCallable(handle, inputs, &outputs, nullptr));
+
+  ASSERT_EQ(1, outputs.size());
+  auto mat = outputs[0].matrix<float>();
+
+  // Expect outputs to be; 1*5 + 2*6, 3*5 + 4*6
+  EXPECT_FLOAT_EQ(17.0, mat(0, 0));
+  EXPECT_FLOAT_EQ(39.0, mat(1, 0));
+}
+
 TEST_F(DirectSessionMinusAXTest, TestConcurrency) {
   Initialize({1, 2, 3, 4});
   auto session = CreateSession();
@@ -171,6 +268,39 @@ TEST_F(DirectSessionMinusAXTest, TestConcurrency) {
   delete tp;
 }
 
+TEST_F(DirectSessionMinusAXTest, TestConcurrency_Callable) {
+  Initialize({1, 2, 3, 4});
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def_));
+
+  // Fill in the input and ask for the output
+  thread::ThreadPool* tp = new thread::ThreadPool(Env::Default(), "test", 4);
+
+  Session::CallableHandle handle;
+  TF_ASSERT_OK(
+      session->MakeCallable(MakeCallableOptions({}, {y_ + ":0"}, {}), &handle));
+
+  // Run the callable 1000 times in 4 different threads concurrently.
+  auto fn = [&session, handle]() {
+    for (int i = 0; i < 1000; ++i) {
+      std::vector<Tensor> outputs;
+      // Run the graph
+      TF_ASSERT_OK(session->RunCallable(handle, {}, &outputs, nullptr));
+      ASSERT_EQ(1, outputs.size());
+      auto mat = outputs[0].matrix<float>();
+      EXPECT_FLOAT_EQ(3.0, mat(0, 0));
+    }
+  };
+
+  for (int i = 0; i < 4; ++i) {
+    tp->Schedule(fn);
+  }
+
+  // Wait for the functions to finish.
+  delete tp;
+}
+
 TEST_F(DirectSessionMinusAXTest, TestPerSessionThreads) {
   Initialize({1, 2, 3, 4});
 
@@ -296,6 +426,38 @@ TEST_F(DirectSessionMinusAXTest, RunSimpleNetworkWithOpts) {
   EXPECT_EQ(run_metadata.step_stats().dev_stats_size(), 2);
 }
 
+TEST_F(DirectSessionMinusAXTest, RunSimpleNetworkWithOpts_Callable) {
+  Initialize({3, 2, -1, 0});
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def_));
+
+  // Request two targets: one fetch output and one non-fetched output.
+  Session::CallableHandle handle;
+  CallableOptions callable_options =
+      MakeCallableOptions({}, {y_ + ":0"}, {y_neg_});
+  callable_options.mutable_run_options()->set_trace_level(
+      RunOptions::FULL_TRACE);
+  TF_ASSERT_OK(session->MakeCallable(callable_options, &handle));
+
+  RunMetadata run_metadata;
+  EXPECT_EQ(run_metadata.step_stats().dev_stats_size(), 0);
+
+  std::vector<Tensor> outputs;
+  TF_ASSERT_OK(session->RunCallable(handle, {}, &outputs, &run_metadata));
+
+  ASSERT_EQ(1, outputs.size());
+  // The first output should be initialized and have the correct
+  // output.
+  auto mat = outputs[0].matrix<float>();
+  ASSERT_TRUE(outputs[0].IsInitialized());
+  EXPECT_FLOAT_EQ(5.0, mat(0, 0));
+
+  // Checks RunMetadata is well-formed
+  ASSERT_TRUE(run_metadata.has_step_stats());
+  EXPECT_EQ(run_metadata.step_stats().dev_stats_size(), 2);
+}
+
 TEST(DirectSessionTest, KeepsStateAcrossRunsOfSession) {
   GraphDef def;
   Graph g(OpRegistry::Global());
@@ -408,6 +570,89 @@ TEST(DirectSessionTest, MultipleFeedTest) {
   EXPECT_TRUE(StringPiece(s.error_message()).contains("fed more than once"));
 }
 
+TEST(DirectSessionTest, MultipleFeedTest_Callable) {
+  GraphDef def;
+  Graph g(OpRegistry::Global());
+
+  Tensor first_value(DT_FLOAT, TensorShape({}));
+  first_value.scalar<float>()() = 1.0;
+  Node* first_const = test::graph::Constant(&g, first_value);
+  Node* first_identity = test::graph::Identity(&g, first_const);
+
+  Tensor second_value(DT_FLOAT, TensorShape({}));
+  second_value.scalar<float>()() = 2.0;
+  Node* second_const = test::graph::Constant(&g, second_value);
+  Node* second_identity = test::graph::Identity(&g, second_const);
+
+  test::graph::ToGraphDef(&g, &def);
+
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def));
+
+  Session::CallableHandle handle;
+  std::vector<Tensor> outputs;
+
+  // Fetch without feeding.
+  TF_ASSERT_OK(session->MakeCallable(
+      MakeCallableOptions(
+          {}, {first_identity->name() + ":0", second_identity->name() + ":0"},
+          {}),
+      &handle));
+  TF_ASSERT_OK(session->RunCallable(handle, {}, &outputs, nullptr));
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(1.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(2.0, outputs[1].flat<float>()(0));
+
+  TF_ASSERT_OK(session->MakeCallable(
+      MakeCallableOptions(
+          {}, {second_identity->name() + ":0", first_identity->name() + ":0"},
+          {}),
+      &handle));
+  TF_ASSERT_OK(session->RunCallable(handle, {}, &outputs, nullptr));
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(2.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(1.0, outputs[1].flat<float>()(0));
+
+  Tensor value_11(DT_FLOAT, TensorShape({}));
+  value_11.scalar<float>()() = 11.0;
+  Tensor value_22(DT_FLOAT, TensorShape({}));
+  value_22.scalar<float>()() = 22.0;
+
+  // Feed [first_const, second_const]
+  TF_ASSERT_OK(session->MakeCallable(
+      MakeCallableOptions(
+          {first_const->name(), second_const->name()},
+          {first_identity->name() + ":0", second_identity->name() + ":0"}, {}),
+      &handle));
+  TF_ASSERT_OK(
+      session->RunCallable(handle, {value_11, value_22}, &outputs, nullptr));
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+
+  // Feed [second_const, first_const]
+  TF_ASSERT_OK(session->MakeCallable(
+      MakeCallableOptions(
+          {second_const->name(), first_const->name()},
+          {first_identity->name() + ":0", second_identity->name() + ":0"}, {}),
+      &handle));
+  TF_ASSERT_OK(
+      session->RunCallable(handle, {value_22, value_11}, &outputs, nullptr));
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+
+  // Feed [first_const, first_const]
+  Status s = session->MakeCallable(
+      MakeCallableOptions(
+          {first_const->name(), first_const->name()},
+          {first_identity->name() + ":0", second_identity->name() + ":0"}, {}),
+      &handle);
+  EXPECT_TRUE(errors::IsInvalidArgument(s));
+  EXPECT_TRUE(StringPiece(s.error_message()).contains("fed more than once"));
+}
+
 TEST(DirectSessionTest, FetchMultipleTimes) {
   Graph g(OpRegistry::Global());
   Tensor seven_tensor(DT_INT32, TensorShape());
@@ -694,6 +939,59 @@ TEST(DirectSessionTest, RunHandleTest) {
   ASSERT_TRUE(s.ok());
 }
 
+TEST(DirectSessionTest, RunHandleTest_Callable) {
+  GraphDef def;
+  Graph g(OpRegistry::Global());
+
+  Tensor value0(DT_FLOAT, TensorShape({}));
+  value0.scalar<float>()() = 1.0;
+  Node* const0 = test::graph::Constant(&g, value0);
+  Node* identity0 = test::graph::Identity(&g, const0);
+
+  Tensor value1(DT_FLOAT, TensorShape({}));
+  value1.scalar<float>()() = 2.0;
+  Node* const1 = test::graph::Constant(&g, value1);
+  Node* node3 = test::graph::Add(&g, identity0, const1);
+  Node* node4 = test::graph::Unary(&g, "GetSessionHandleV2", node3);
+
+  Tensor value2(DT_STRING, TensorShape({}));
+  Node* const2 = test::graph::Constant(&g, value2);
+  Node* node5 = test::graph::GetSessionTensor(&g, const2);
+  Node* node6 = test::graph::Add(&g, node5, const1);
+
+  Node* node7 = test::graph::Unary(&g, "DeleteSessionTensor", const2);
+
+  test::graph::ToGraphDef(&g, &def);
+
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def));
+
+  // First run call: Create a handle.
+  std::vector<Tensor> outputs;
+  Status s = session->Run({}, {node4->name() + ":0"}, {}, &outputs);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(1, outputs.size());
+
+  ResourceHandle resource_handle = outputs[0].scalar<ResourceHandle>()();
+  Tensor string_handle(DT_STRING, {});
+  string_handle.flat<string>().setConstant(resource_handle.name());
+
+  // Second run call: Use a handle.
+  std::vector<Tensor> outputs1;
+  s = session->Run({{const2->name(), string_handle}}, {node6->name() + ":0"},
+                   {}, &outputs1);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(1, outputs1.size());
+  ASSERT_EQ(5.0, outputs1[0].flat<float>()(0));
+
+  // Third run call: Delete a handle.
+  std::vector<Tensor> outputs2;
+  s = session->Run({{const2->name(), string_handle}}, {}, {node7->name()},
+                   &outputs2);
+  ASSERT_TRUE(s.ok());
+}
+
 TEST(DirectSessionTest, CreateGraphFailsWhenAssigningAFedVar) {
   Graph graph(OpRegistry::Global());
 
@@ -868,59 +1166,14 @@ TEST(DirectSessionTest, TestTimeoutCleanShutdown) {
   TF_ASSERT_OK(session->Close());
 }
 
-class BlockingOpState {
- public:
-  void AwaitState(int awaiting_state) {
-    mutex_lock ml(mu_);
-    while (state_ != awaiting_state) {
-      cv_.wait(ml);
-    }
-  }
-  void MoveToState(int expected_current, int next) {
-    mutex_lock ml(mu_);
-    CHECK_EQ(expected_current, state_);
-    state_ = next;
-    cv_.notify_all();
-  }
-
- private:
-  mutex mu_;
-  condition_variable cv_;
-  int state_ = 0;
-};
-static BlockingOpState* blocking_op_state = nullptr;
-
-// BlockingOp blocks on the global <blocking_op_state's> state,
-// and also updates it when it is unblocked and finishing computation.
-class BlockingOp : public OpKernel {
- public:
-  explicit BlockingOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
-  void Compute(OpKernelContext* ctx) override {
-    blocking_op_state->MoveToState(0, 1);
-    blocking_op_state->AwaitState(2);
-    blocking_op_state->MoveToState(2, 3);
-
-    Tensor* out = nullptr;
-    const Tensor& in = ctx->input(0);
-    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, in.shape(), &out));
-    out->flat<float>() = in.flat<float>();
-  }
-};
-REGISTER_KERNEL_BUILDER(Name("BlockingOp").Device(DEVICE_CPU), BlockingOp);
-REGISTER_OP("BlockingOp").Input("x: float").Output("y: float").Doc("");
-
 static void TestSessionInterOpThreadsImpl(bool use_function_lib,
                                           bool use_global_pools) {
+  using test::function::blocking_op_state;
+  using test::function::BlockingOpState;
+
   FunctionDefLibrary library_graph_def;
   if (use_function_lib) {
-    const string lib = R"proto(
-        signature: {
-          name: "BlockingOpFn" input_arg: { name: "x" type: DT_FLOAT }
-                               output_arg: { name: "y" type: DT_FLOAT }}
-        node_def: { name: "y" op: "BlockingOp" input: "x" }
-        ret: { key: "y" value: "y:y:0" } )proto";
-    CHECK(protobuf::TextFormat::ParseFromString(
-        lib, library_graph_def.add_function()));
+    *library_graph_def.add_function() = test::function::BlockingOpFn();
   }
 
   FunctionLibraryDefinition flib(OpRegistry::Global(), library_graph_def);
@@ -1153,6 +1406,11 @@ TEST(DirectSessionTest, TestDirectSessionRunClose) {
   EXPECT_EQ(t.scalar<float>()(), outputs[0].scalar<float>()());
   outputs.clear();
 
+  // Make a callable handle before closing the session.
+  Session::CallableHandle handle;
+  TF_ASSERT_OK(session->MakeCallable(
+      MakeCallableOptions({}, {}, {var_assign->name()}), &handle));
+
   // Close the session.
   TF_ASSERT_OK(session->Close());
 
@@ -1160,6 +1418,10 @@ TEST(DirectSessionTest, TestDirectSessionRunClose) {
   Status s = session->Run({} /* inputs */, {},
                           {var_assign->name()} /* target_nodes */, nullptr);
   EXPECT_EQ("Cancelled: Session has been closed.", s.ToString());
+
+  // Run the read as a callable to verify that we get the same error.
+  s = session->RunCallable(handle, {}, {}, nullptr);
+  EXPECT_EQ("Cancelled: Session has been closed.", s.ToString());
 }
 
 TEST(DirectSessionTest, TestDirectSessionPRunClose) {
@@ -1261,7 +1523,8 @@ TEST(DirectSessionTest, LocalDeviceManager) {
 
 // A simple benchmark for the overhead of `DirectSession::Run()` calls
 // with varying numbers of feeds/fetches.
-void FeedFetchBenchmarkHelper(int iters, int num_feeds) {
+void FeedFetchBenchmarkHelper(int iters, int num_feeds,
+                              bool use_make_callable) {
   testing::StopTiming();
 
   Tensor value(DT_FLOAT, TensorShape());
@@ -1297,29 +1560,55 @@ void FeedFetchBenchmarkHelper(int iters, int num_feeds) {
   SessionOptions opts;
   std::unique_ptr<Session> session(NewSession(opts));
   TF_CHECK_OK(session->Create(gd));
-  {
-    // NOTE(mrry): Ignore the first run, which will incur the graph
-    // partitioning/pruning overhead and skew the results.
-    //
-    // Note that we should also optimize and monitor the overhead on
-    // the first run, which will impact application startup times, but
-    // that is not the object of study in this benchmark.
-    std::vector<Tensor> output_values;
-    TF_CHECK_OK(session->Run(inputs, outputs, {}, &output_values));
-  }
-  testing::StartTiming();
-  for (int i = 0; i < iters; ++i) {
-    std::vector<Tensor> output_values;
-    TF_CHECK_OK(session->Run(inputs, outputs, {}, &output_values));
+  if (use_make_callable) {
+    Session::CallableHandle handle;
+    CallableOptions callable_options;
+    std::vector<Tensor> input_tensors;
+    for (const auto& input : inputs) {
+      callable_options.add_feed(input.first);
+      input_tensors.push_back(input.second);
+    }
+    for (const string& output : outputs) {
+      callable_options.add_fetch(output);
+    }
+    TF_CHECK_OK(session->MakeCallable(callable_options, &handle));
+
+    testing::StartTiming();
+    for (int i = 0; i < iters; ++i) {
+      std::vector<Tensor> output_values;
+      TF_CHECK_OK(
+          session->RunCallable(handle, input_tensors, &output_values, nullptr));
+    }
+    testing::StopTiming();
+  } else {
+    {
+      // NOTE(mrry): Ignore the first run, which will incur the graph
+      // partitioning/pruning overhead and skew the results.
+      //
+      // Note that we should also optimize and monitor the overhead on
+      // the first run, which will impact application startup times, but
+      // that is not the object of study in this benchmark.
+      std::vector<Tensor> output_values;
+      TF_CHECK_OK(session->Run(inputs, outputs, {}, &output_values));
+    }
+    testing::StartTiming();
+    for (int i = 0; i < iters; ++i) {
+      std::vector<Tensor> output_values;
+      TF_CHECK_OK(session->Run(inputs, outputs, {}, &output_values));
+    }
+    testing::StopTiming();
   }
-  testing::StopTiming();
 }
 
 void BM_FeedFetch(int iters, int num_feeds) {
-  FeedFetchBenchmarkHelper(iters, num_feeds);
+  FeedFetchBenchmarkHelper(iters, num_feeds, /* use_make_callable */ false);
+}
+void BM_FeedFetchCallable(int iters, int num_feeds) {
+  FeedFetchBenchmarkHelper(iters, num_feeds, /* use_make_callable */ true);
 }
 
 BENCHMARK(BM_FeedFetch)->Arg(1)->Arg(2)->Arg(5)->Arg(10);
+BENCHMARK(BM_FeedFetchCallable)->Arg(1)->Arg(2)->Arg(5)->Arg(10);
 
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/BUILD b/tensorflow/core/common_runtime/eager/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..de10b10b7e314138a2caae62a26b8868537cd435
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/BUILD
@@ -0,0 +1,108 @@
+package(
+    default_visibility = [
+        "//tensorflow:internal",
+        "//tensorflow_models:__subpackages__",
+    ],
+)
+
+licenses(["notice"])  # Apache 2.0
+
+load(
+    "//tensorflow:tensorflow.bzl",
+    "tf_cc_test",
+    "tf_cuda_library",
+)
+
+tf_cuda_library(
+    name = "eager_executor",
+    srcs = [
+        "eager_executor.cc",
+    ],
+    hdrs = [
+        "eager_executor.h",
+    ],
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        "//tensorflow/core:core_cpu_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
+tf_cuda_library(
+    name = "context",
+    srcs = [
+        "context.cc",
+    ],
+    hdrs = [
+        "context.h",
+    ],
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        ":eager_executor",
+        ":kernel_and_device",
+        "//tensorflow/core:core_cpu_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:session_options",
+    ],
+)
+
+tf_cuda_library(
+    name = "kernel_and_device",
+    srcs = [
+        "kernel_and_device.cc",
+    ],
+    hdrs = [
+        "kernel_and_device.h",
+    ],
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        "//tensorflow/core:core_cpu_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
+tf_cc_test(
+    name = "kernel_and_device_test",
+    srcs = ["kernel_and_device_test.cc"],
+    deps = [
+        ":kernel_and_device",
+        "//tensorflow/c/eager:runtime",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:client_session",
+        "//tensorflow/cc:ops",
+        "//tensorflow/cc:scope",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+# -----------------------------------------------------------------------------
+# Google-internal targets.
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
diff --git a/tensorflow/core/common_runtime/eager/context.cc b/tensorflow/core/common_runtime/eager/context.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0566329f185e7871f020b395890578e0084f3b8f
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/context.cc
@@ -0,0 +1,153 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/eager/context.h"
+
+namespace tensorflow {
+
+ContextDevicePlacementPolicy PlacementPolicy(
+    bool soft_placement, ContextDevicePlacementPolicy original_policy) {
+  if (!soft_placement) {
+    return original_policy;
+  }
+  if (original_policy == DEVICE_PLACEMENT_EXPLICIT ||
+      original_policy == DEVICE_PLACEMENT_SILENT_FOR_INT32) {
+    return DEVICE_PLACEMENT_SILENT;
+  }
+  return original_policy;
+}
+
+EagerContext::EagerContext(const SessionOptions& opts,
+                           ContextDevicePlacementPolicy default_policy,
+                           bool async, std::unique_ptr<DeviceMgr> device_mgr,
+                           Rendezvous* rendezvous)
+    : soft_placement_(opts.config.allow_soft_placement()),
+      policy_(PlacementPolicy(soft_placement_, default_policy)),
+      device_manager_(std::move(device_mgr)),
+      devices_(device_manager_->ListDevices()),
+      rendezvous_(rendezvous),
+      pflr_(new ProcessFunctionLibraryRuntime(device_manager_.get(), opts.env,
+                                              TF_GRAPH_DEF_VERSION,
+                                              &func_lib_def_, {})),
+      log_device_placement_(opts.config.log_device_placement()),
+      async_default_(async) {
+  if (async_default_) {
+    executor_.EnableAsync();
+  }
+
+  for (auto* device : devices_) {
+    devices_map_[device->name()] = device;
+  }
+}
+
+bool EagerContext::Async() const {
+  mutex_lock l(async_map_mu_);
+  return gtl::FindWithDefault(thread_local_async_, std::this_thread::get_id(),
+                              async_default_);
+}
+
+Status EagerContext::SetAsyncForThread(bool async) {
+  {
+    tensorflow::mutex_lock l(async_map_mu_);
+    thread_local_async_[std::this_thread::get_id()] = async;
+  }
+  if (async) {
+    executor_.EnableAsync();
+  } else {
+    // TODO(agarwal): Currently we add a wait here to handle cases where a
+    // sync op has a control dependency on an async op, and the latter has not
+    // executed yet. This wait can be removed by storing all the control
+    // inputs and waiting for them when executing ops.
+    return executor_.WaitForAllPendingNodes();
+  }
+  return Status::OK();
+}
+
+void EagerContext::ClearCaches() {
+  mutex_lock ml(cache_mu_);
+  gtl::STLDeleteValues(&kernel_cache_);
+}
+
+void EagerContext::SetThreadLocalDevicePlacementPolicy(
+    ContextDevicePlacementPolicy policy) {
+  mutex_lock ml(policy_map_mu_);
+  thread_local_policies_[std::this_thread::get_id()] = policy;
+}
+
+ContextDevicePlacementPolicy EagerContext::GetDevicePlacementPolicy() {
+  mutex_lock ml(policy_map_mu_);
+  auto policy_map_it = thread_local_policies_.find(std::this_thread::get_id());
+  if (policy_map_it != thread_local_policies_.end()) {
+    return policy_map_it->second;
+  }
+  return policy_;
+}
+
+EagerContext::~EagerContext() {
+  executor_.WaitForAllPendingNodes().IgnoreError();
+  ClearCaches();
+  rendezvous_->Unref();
+}
+
+bool EagerContext::FindFunctionByName(const string& name) {
+  mutex_lock l(functions_mu_);
+  return func_lib_def_.Find(name) != nullptr;
+}
+
+Status EagerContext::FindFunctionOpData(
+    const string& name, const tensorflow::OpRegistrationData** op_data) {
+  mutex_lock l(functions_mu_);
+  return func_lib_def_.LookUp(name, op_data);
+}
+
+const FunctionDef* EagerContext::FindFunctionDef(const string& name) {
+  mutex_lock l(functions_mu_);
+  return func_lib_def_.Find(name);
+}
+
+Status EagerContext::FindDeviceByName(const string& name, Device** result) {
+  auto it = devices_map_.find(name);
+  if (it == devices_map_.end()) {
+    return errors::InvalidArgument(name, " unknown device.");
+  }
+  *result = it->second;
+  return Status::OK();
+}
+
+Status EagerContext::AddFunctionDef(const FunctionDef& fdef) {
+  mutex_lock l(functions_mu_);
+  return func_lib_def_.AddFunctionDef(fdef);
+}
+
+KernelAndDevice* EagerContext::GetCachedKernel(Fprint128 cache_key) {
+  tf_shared_lock l(cache_mu_);
+  return gtl::FindPtrOrNull(kernel_cache_, cache_key);
+}
+
+void EagerContext::AddKernelToCache(Fprint128 cache_key,
+                                    KernelAndDevice* kernel) {
+  mutex_lock ml(cache_mu_);
+  gtl::InsertOrUpdate(&kernel_cache_, cache_key, kernel);
+}
+
+void EagerContext::SetShouldStoreMetadata(bool value) {
+  should_store_metadata_.store(value);
+  if (!value) {
+    mutex_lock ml(metadata_mu_);
+    run_metadata_.Clear();
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/context.h b/tensorflow/core/common_runtime/eager/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..bc97219dae532f058b63515be1ec24fcedf9bf10
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/context.h
@@ -0,0 +1,198 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_CONTEXT_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_CONTEXT_H_
+
+#include <algorithm>
+#include <cstddef>
+#include <map>
+#include <memory>
+#include <queue>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/eager/eager_executor.h"
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
+#include "tensorflow/core/common_runtime/function.h"
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/rendezvous.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/platform/fingerprint.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/public/session_options.h"
+#include "tensorflow/core/public/version.h"
+
+namespace tensorflow {
+
+// Note: there's a copy enum in eager/c_api.h. It should be kept in sync.
+enum ContextDevicePlacementPolicy {
+  // Running operations with input tensors on the wrong device will fail. When
+  // soft placement is enabled acts like TFE_DEVICE_PLACEMENT_SILENT.
+  DEVICE_PLACEMENT_EXPLICIT = 0,
+  // Copy the tensor to the right device but log a warning.
+  DEVICE_PLACEMENT_WARN = 1,
+  // Silently copy the tensor, which has a performance cost since the
+  // operation will be blocked till the copy completes.
+  DEVICE_PLACEMENT_SILENT = 2,
+  // Default placement policy which silently copies int32 tensors but not other
+  // dtypes.  When soft placement is enabled acts like
+  // TFE_DEVICE_PLACEMENT_SILENT.
+  DEVICE_PLACEMENT_SILENT_FOR_INT32 = 3,
+};
+
+ContextDevicePlacementPolicy PlacementPolicy(
+    bool soft_placement, ContextDevicePlacementPolicy original_policy);
+
+class EagerContext {
+ public:
+  explicit EagerContext(const SessionOptions& opts,
+                        ContextDevicePlacementPolicy default_policy, bool async,
+                        std::unique_ptr<DeviceMgr> device_mgr,
+                        Rendezvous* rendezvous);
+
+  ~EagerContext();
+
+  // Returns the function library runtime for the given device.
+  FunctionLibraryRuntime* func_lib(Device* d) const {
+    return pflr_->GetFLR(d->name());
+  }
+
+  // True if running in asynchronous mode.
+  bool Async() const;
+
+  EagerExecutor* Executor() { return &executor_; }
+
+  // Sets whether this thread should run in synchronous or asynchronous mode.
+  Status SetAsyncForThread(bool async);
+
+  // TODO(apassos) make this return a constant reference
+  gtl::FlatMap<string, Device*, StringPieceHasher>* device_map() {
+    return &devices_map_;
+  }
+
+  // TODO(apassos) make this return a constant reference
+  std::vector<Device*>* devices() { return &devices_; }
+
+  // Clears the kernel caches.
+  void ClearCaches();
+
+  // Sets the device placement policy for the current thread.
+  void SetThreadLocalDevicePlacementPolicy(ContextDevicePlacementPolicy policy);
+
+  // Returns the device placement policy for the current thread.
+  ContextDevicePlacementPolicy GetDevicePlacementPolicy();
+
+  Status AsyncWait() { return executor_.WaitForAllPendingNodes(); }
+
+  Status GetStatus() { return executor_.status(); }
+
+  void ClearAsyncError() { executor_.ClearError(); }
+
+  bool FindFunctionByName(const string& name);
+
+  Status FindFunctionOpData(const string& name,
+                            const tensorflow::OpRegistrationData** op_data);
+
+  const FunctionDef* FindFunctionDef(const string& name);
+
+  Status FindDeviceByName(const string& name, Device** result);
+
+  Device* HostCPU() { return devices_[0]; }
+
+  bool SoftPlacement() { return soft_placement_; }
+
+  uint64 NextId() { return executor_.NextId(); }
+
+  void ExecutorAdd(EagerNode* node) { executor_.Add(node); }
+
+  Status AddFunctionDef(const FunctionDef& fdef);
+
+  KernelAndDevice* GetCachedKernel(Fprint128 cache_key);
+
+  void AddKernelToCache(Fprint128 cache_key, KernelAndDevice* kernel);
+
+  bool LogDevicePlacement() { return log_device_placement_; }
+
+  Rendezvous* GetRendezvous() { return rendezvous_; }
+
+  mutex* FunctionsMu() { return &functions_mu_; }
+
+  tensorflow::DeviceMgr* device_mgr() { return device_manager_.get(); }
+
+  // TODO(apassos) remove the need for this
+  void ReleaseDeviceMgr() { device_manager_.release(); }
+
+  // TODO(apassos) clean up RunMetadata storage.
+  mutex* MetadataMu() { return &metadata_mu_; }
+  bool ShouldStoreMetadata() { return should_store_metadata_.load(); }
+  void SetShouldStoreMetadata(bool value);
+  RunMetadata* RunMetadataProto() { return &run_metadata_; }
+
+  FunctionLibraryDefinition* FuncLibDef() { return &func_lib_def_; }
+
+ private:
+  const bool soft_placement_;
+  const ContextDevicePlacementPolicy policy_;
+
+  // Note: we cannot use C++11 thread_local here as there is no concept of a
+  // thread-local-object-local variable in C++11.
+  mutex policy_map_mu_;
+  std::unordered_map<std::thread::id, ContextDevicePlacementPolicy>
+      thread_local_policies_ GUARDED_BY(policy_map_mu_);
+
+  std::unique_ptr<DeviceMgr> device_manager_;
+  // Devices owned by device_manager
+  std::vector<Device*> devices_;
+  // All devices are not owned.
+  gtl::FlatMap<string, Device*, StringPieceHasher> devices_map_;
+  Rendezvous* const rendezvous_;
+
+  mutex functions_mu_;
+  FunctionLibraryDefinition func_lib_def_ GUARDED_BY(functions_mu_){
+      OpRegistry::Global(), {}};
+
+  // One FunctionLibraryRuntime per device.
+  // func_libs[i] is the FunctionLibraryRuntime corresponding to
+  // session->devices[i].
+  const std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
+
+  mutex cache_mu_;
+  std::unordered_map<Fprint128, KernelAndDevice*, Fprint128Hasher> kernel_cache_
+      GUARDED_BY(cache_mu_);
+
+  // Whether we should compute RunMetadata.
+  std::atomic<bool> should_store_metadata_{false};
+  mutex metadata_mu_;
+  RunMetadata run_metadata_ GUARDED_BY(metadata_mu_);
+  const bool log_device_placement_;
+  // EagerExecutor for async execution.
+  EagerExecutor executor_;
+
+  // True if the default value for execution mode is async. Note that this value
+  // can be overridden per thread based on `thread_local_async` overrides.
+  const bool async_default_;
+  mutable mutex async_map_mu_;
+  std::unordered_map<std::thread::id, bool> thread_local_async_
+      GUARDED_BY(async_map_mu_);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_CONTEXT_H_
diff --git a/tensorflow/core/common_runtime/eager/eager_executor.cc b/tensorflow/core/common_runtime/eager/eager_executor.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b699036e9697576adca403d5919b341d8f919db0
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/eager_executor.cc
@@ -0,0 +1,152 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/eager/eager_executor.h"
+
+namespace tensorflow {
+
+EagerNode::EagerNode(tensorflow::uint64 id) : id(id) {}
+
+EagerExecutor::~EagerExecutor() {
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  thread_done_ = true;
+  nodes_pending_.notify_all();
+}
+
+tensorflow::uint64 EagerExecutor::NextId() {
+  tensorflow::mutex_lock l(next_id_mutex_);
+  return next_id_++;
+}
+
+void EagerExecutor::EnableAsync() {
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  if (thread_ == nullptr) {
+    thread_.reset(tensorflow::Env::Default()->StartThread(
+        tensorflow::ThreadOptions(), "eager_async_executor",
+        std::bind(&EagerExecutor::Run, this)));
+  }
+}
+
+void EagerExecutor::Add(EagerNode* node) {
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  DCHECK(thread_) << "EnableAsync should have been called before Add";
+  if (!status_.ok()) {
+    delete node;
+    return;
+  }
+  int64 qlen = node_queue_.size();
+  if (qlen > 0) {
+    if (node_queue_.back()->id >= node->id) {
+      status_ = tensorflow::errors::InvalidArgument(
+          "Inserting EagerNode with non-increasing ids:",
+          node_queue_.back()->id, " vs ", node->id);
+      delete node;
+      return;
+    }
+    node_queue_.push(node);
+  } else {
+    node_queue_.push(node);
+    nodes_pending_.notify_all();
+  }
+}
+
+tensorflow::Status EagerExecutor::WaitFor(tensorflow::uint64 node_id) {
+  return WaitImpl(false, node_id);
+}
+
+tensorflow::Status EagerExecutor::WaitForAllPendingNodes() {
+  return WaitImpl(true, 0);
+}
+
+tensorflow::Status EagerExecutor::WaitImpl(bool wait_all,
+                                           tensorflow::uint64 node_id) {
+  tensorflow::condition_variable cond;
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  // Don't wait if an error is already set.
+  if (!status_.ok()) return status_;
+  if (node_queue_.empty()) return tensorflow::Status::OK();
+  if (wait_all) {
+    node_id = node_queue_.back()->id;
+  } else if (node_id < node_queue_.front()->id) {
+    // Note that we are relying on the ops being dispatched sequentially from
+    // the queue.
+    return tensorflow::Status::OK();
+  }
+  node_done_notifications_.insert(std::make_pair(node_id, &cond));
+  cond.wait(l);
+  // Note that we could be woken up if an error occurs, even though the node has
+  // not actually executed.
+  return status_;
+}
+
+void EagerExecutor::ClearError() {
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  if (status_.ok()) return;
+  // If an error was set, node_done_notifications_ and node_queue_ should have
+  // been cleared, and no new entries should have been added since.
+  DCHECK(node_done_notifications_.empty());
+  DCHECK(node_queue_.empty());
+  status_ = tensorflow::Status::OK();
+  nodes_pending_.notify_all();
+}
+
+tensorflow::Status EagerExecutor::status() {
+  tensorflow::mutex_lock l(node_queue_mutex_);
+  return status_;
+}
+
+void EagerExecutor::Run() {
+  while (true) {
+    std::unique_ptr<EagerNode> curr_node;
+    {
+      tensorflow::mutex_lock l(node_queue_mutex_);
+      while (node_queue_.empty() || !status_.ok()) {
+        if (thread_done_) return;
+        nodes_pending_.wait(l);
+      }
+      curr_node.reset(node_queue_.front());
+    }
+    tensorflow::Status status = curr_node->Run();
+    const bool ok = status.ok();
+    tensorflow::mutex_lock l(node_queue_mutex_);
+    node_queue_.pop();
+    if (!ok) {
+      status_ = status;
+      // TODO(agarwal): mark all affected handles as corrupted before clearing
+      // this queue.
+      // We remove any pending ops so that we don't try to execute them if
+      // ClearError is called.
+      for (int i = 0; i < node_queue_.size(); ++i) {
+        delete node_queue_.front();
+        node_queue_.pop();
+      }
+    }
+    if (!node_done_notifications_.empty()) {
+      tensorflow::uint64 node_id = curr_node->id;
+      // Note that we notify all waiting threads in case an error has occurred.
+      // These calling threads are responsible for checking status_ before
+      // proceeding.
+      const auto range = ok ? node_done_notifications_.equal_range(node_id)
+                            : make_pair(node_done_notifications_.begin(),
+                                        node_done_notifications_.end());
+      for (auto it = range.first; it != range.second; ++it) {
+        it->second->notify_all();
+      }
+      node_done_notifications_.erase(range.first, range.second);
+    }
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/eager_executor.h b/tensorflow/core/common_runtime/eager/eager_executor.h
new file mode 100644
index 0000000000000000000000000000000000000000..021daeb21d2ecb033b8017f012a148dafa092c01
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/eager_executor.h
@@ -0,0 +1,138 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_EAGER_EXECUTOR_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_EAGER_EXECUTOR_H_
+
+#include <algorithm>
+#include <cstddef>
+#include <map>
+#include <memory>
+#include <queue>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/function.h"
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/rendezvous.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/public/version.h"
+
+namespace tensorflow {
+
+// A unit of execution for the EagerExecutor class below. Example subclasses
+// encapsulate execution of a TFE_Op, or copying a TFE_TensorHandle from one
+// device to another.
+class EagerNode {
+ public:
+  explicit EagerNode(uint64 id);
+
+  virtual ~EagerNode() {}
+
+  // Runs the computation corresponding to this node and blocks till the
+  // execution is done.
+  virtual Status Run() = 0;
+
+  // An id unique to the TFE_Context under which this node is created. Allocated
+  // monotonically.
+  const uint64 id;
+};
+
+// A class for handling async execution (see TFE_ContextSetAsync).
+// Note that this class is thread-safe.
+// TODO(agarwal): TFE_OpAddInput may currently block if it tries to access the
+// device of the input handle. Fix that.
+// TODO(agarwal): On error, mark all affected handles as corrupted.
+// TODO(agarwal): Implement support for control dependencies.
+// TODO(agarwal): Support out-of-order execution and dispatching multiple
+// EagerNode in parallel.
+// TODO(agarwal): Implement optimizations over EagerNode traces.
+class EagerExecutor {
+ public:
+  ~EagerExecutor();
+
+  // This is called whenever async mode is enabled. Note that it may be called
+  // multiple times as different calling threads may switch async mode on or off
+  // independently.
+  void EnableAsync();
+
+  // Helper function to create monotonically increasing ids unique to this
+  // object.
+  uint64 NextId();
+
+  // Schedules `node` for execution.
+  // Note that Add must be called in monotonically increasing order of node->id.
+  void Add(EagerNode* node);
+
+  // Causes the caller to block till node with id `node_id` has finished
+  // execution.
+  Status WaitFor(uint64 node_id);
+
+  // Blocks till all currently pending ops are done.
+  Status WaitForAllPendingNodes();
+
+  // Clears all currently set errors which re-enables async execution.
+  void ClearError();
+
+  // Returns Status based on any errors that occurred during async execution.
+  Status status();
+
+ private:
+  // Starts execution of pending EagerNodes. This function loops till
+  // thread_done_ is set to true. If any errors are encontered, these are set
+  // inside `status_`. The loop blocks anytime there are no pending nodes, or if
+  // `status_` is not ok.
+  void Run();
+
+  Status WaitImpl(bool wait_all, uint64 node_id);
+
+  mutex node_queue_mutex_;
+
+  // Used to signal that some EagerNodes are pending execution.
+  condition_variable nodes_pending_ GUARDED_BY(node_queue_mutex_);
+
+  // Queue of pending EagerNodes.
+  std::queue<EagerNode*> node_queue_ GUARDED_BY(node_queue_mutex_);
+
+  // `status_` is set based on any errors raised during execution of a
+  // EagerNode.  It remains set until ClearError is called.
+  Status status_ GUARDED_BY(node_queue_mutex_);
+
+  // Map from id of a EagerNode to condition_variables (not owned by the map).
+  // These condition_variables are notified and removed when that EagerNode is
+  // done executing, or if an error is found in execution of any EagerNode.
+  std::multimap<uint64, condition_variable*> node_done_notifications_
+      GUARDED_BY(node_queue_mutex_);
+
+  // Thread object that calls the `Run` method. Currently we use only one thread
+  // for executing the EagerNodes one-by-one.
+  std::unique_ptr<Thread> thread_ GUARDED_BY(node_queue_mutex_);
+
+  // Indicates that `thread_` should stop as soon as it is done executing the
+  // current EagerNode.
+  bool thread_done_ GUARDED_BY(node_queue_mutex_) = false;
+
+  mutex next_id_mutex_;
+  uint64 next_id_ GUARDED_BY(next_id_mutex_) = 1;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_EAGER_EXECUTOR_H_
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.cc b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0a4895a938a72a41cf3ad494ecca2ef9fb3e9648
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
@@ -0,0 +1,132 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/platform/fingerprint.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/public/version.h"
+#include "tensorflow/core/util/tensor_slice_reader_cache.h"
+
+namespace tensorflow {
+
+// static
+Status KernelAndDevice::InitOp(Device* device, const NodeDef& ndef,
+                               KernelAndDevice* out) {
+  OpKernel* k = nullptr;
+  Status s = CreateOpKernel(device->device_type().c_str(), device,
+                            device->GetAllocator(AllocatorAttributes()),
+                            nullptr, ndef, TF_GRAPH_DEF_VERSION, &k);
+  out->device_ = device;
+  out->kernel_.reset(k);
+  out->flib_ = nullptr;
+  return s;
+}
+
+// static
+Status KernelAndDevice::Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
+                             KernelAndDevice* out) {
+  OpKernel* k = nullptr;
+  Status s = flib->CreateKernel(ndef, &k);
+  out->device_ = flib->device();
+  out->kernel_.reset(k);
+  out->flib_ = flib;
+  return s;
+}
+
+Status KernelAndDevice::Run(std::vector<Tensor>* input_tensors,
+                            std::vector<Tensor>* output_tensors,
+                            NodeExecStats* stats) {
+  gtl::InlinedVector<TensorValue, 4> inputs;
+  for (Tensor& t : *input_tensors) {
+    inputs.push_back(TensorValue(&t));
+  }
+
+  std::vector<AllocatorAttributes> out_attrs(kernel_->num_outputs());
+  for (size_t i = 0; i < out_attrs.size(); ++i) {
+    out_attrs[i].set_on_host(kernel_->output_memory_types()[i] ==
+                             tensorflow::HOST_MEMORY);
+  }
+
+  OpKernelContext::Params params;
+  params.device = device_;
+  params.frame_iter = FrameAndIter(0, 0);
+  params.inputs = &inputs;
+  params.op_kernel = kernel_.get();
+  params.resource_manager = device_->resource_manager();
+  params.output_attr_array = gtl::vector_as_array(&out_attrs);
+  params.function_library = flib_;
+  params.slice_reader_cache = &slice_reader_cache_;
+  params.rendezvous = rendez_;
+  if (stats != nullptr) {
+    params.track_allocations = true;
+  }
+  // TODO(apassos): use a thread pool.
+  std::function<void(std::function<void()>)> runner =
+      [](std::function<void()> f) { f(); };
+  params.runner = &runner;
+
+  OpKernelContext context(&params);
+
+  if (kernel_->def().op() == "_Recv") {
+    // TODO(apassos) do not special-case _Recv. Currently the GPU device fails
+    // if trying to run _Recv->Compute(), specifically checking for _Recv. To go
+    // around this we call _Recv->ComputeAsync, to mimic graph mode behavior.
+    AsyncOpKernel* async = kernel_->AsAsync();
+    Notification done;
+    device_->ComputeAsync(async, &context, [&done]() { done.Notify(); });
+    done.WaitForNotification();
+  } else {
+    device_->Compute(kernel_.get(), &context);
+  }
+  if (!context.status().ok()) return context.status();
+
+  output_tensors->clear();
+  for (int i = 0; i < context.num_outputs(); ++i) {
+    output_tensors->push_back(Tensor(*context.mutable_output(i)));
+  }
+  if (stats != nullptr) {
+    for (const auto& allocator_pair : context.wrapped_allocators()) {
+      AllocatorMemoryUsed* memory = stats->add_memory();
+      memory->set_allocator_name(allocator_pair.first->Name());
+      auto sizes = allocator_pair.second->GetSizes();
+      memory->set_total_bytes(std::get<0>(sizes));
+      memory->set_peak_bytes(std::get<1>(sizes));
+      memory->set_live_bytes(std::get<2>(sizes));
+
+      AllocatorStats allocator_stats;
+      allocator_pair.first->GetStats(&allocator_stats);
+      memory->set_allocator_bytes_in_use(allocator_stats.bytes_in_use);
+      allocator_pair.second->GetRecordsAndUnRef();
+    }
+    auto* ms = stats->mutable_memory_stats();
+    ms->set_temp_memory_size(context.temp_memory_allocated());
+    for (const auto& alloc_id : context.persistent_alloc_ids()) {
+      ms->mutable_persistent_tensor_alloc_ids()->Add(alloc_id);
+    }
+
+    ms->set_persistent_memory_size(context.persistent_memory_allocated());
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.h b/tensorflow/core/common_runtime/eager/kernel_and_device.h
new file mode 100644
index 0000000000000000000000000000000000000000..46ec550c780aaa3cd5cad5f02a4dfe9a75572277
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.h
@@ -0,0 +1,85 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_KERNEL_AND_DEVICE_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_KERNEL_AND_DEVICE_H_
+
+// Support for eager execution of TensorFlow kernels.
+
+#include <memory>
+#include <unordered_map>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/platform/fingerprint.h"
+#include "tensorflow/core/util/tensor_slice_reader_cache.h"
+
+namespace tensorflow {
+
+// KernelAndDevice encapsulates an instantiated kernel and the device it is on.
+//
+// Also see:
+// https://www.tensorflow.org/code/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
+// and
+// https://www.tensorflow.org/code/tensorflow/core/kernels/ops_testutil.h
+class KernelAndDevice {
+ public:
+  // Populates 'out' with a kernel appropriate for 'ndef'.
+  //
+  // The provided FunctionLibraryRuntime MUST outlive all calls to
+  // Run() on the returned KernelAndDevice.
+  //
+  // TODO(ashankar): Figure out thread-safety concerns around
+  // FunctionLibraryRuntime (in particular, how the underlying
+  // FunctionLibraryDefinition might be mutated by another thread as new
+  // functions are registered with it).  Conservatively, thread-safe usage of
+  // the FunctionLibraryRuntime is pushed on to the caller (see locking in
+  // c_api.cc).
+  static Status Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
+                     KernelAndDevice* out);
+  // TODO(ashankar): Remove this
+  static Status InitOp(Device* device, const NodeDef& ndef,
+                       KernelAndDevice* out);
+
+  KernelAndDevice(tensorflow::Rendezvous* rendez)
+      : device_(nullptr), flib_(nullptr), rendez_(rendez) {}
+
+  // TODO(ashankar): Handle list-valued inputs.
+  Status Run(std::vector<Tensor>* inputs, std::vector<Tensor>* outputs,
+             NodeExecStats* stats);
+
+  const OpKernel* kernel() const { return kernel_.get(); }
+
+  Device* device() const { return device_; }
+
+  DataTypeVector* mutable_output_dtypes() { return &output_dtypes_; }
+  const DataTypeVector& output_dtypes() { return output_dtypes_; }
+
+ private:
+  std::unique_ptr<OpKernel> kernel_;
+  Device* device_;
+  FunctionLibraryRuntime* flib_;
+  checkpoint::TensorSliceReaderCacheWrapper slice_reader_cache_;
+  Rendezvous* rendez_;
+  DataTypeVector output_dtypes_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_EAGER_KERNEL_AND_DEVICE_H_
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc b/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..dd055c3c3eb6df3eb440a78b7a8d3e72ff9335bd
--- /dev/null
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc
@@ -0,0 +1,140 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/eager/kernel_and_device.h"
+
+#include <memory>
+#include <vector>
+
+#include "tensorflow/c/eager/runtime.h"
+#include "tensorflow/cc/client/client_session.h"
+#include "tensorflow/cc/framework/ops.h"
+#include "tensorflow/cc/framework/scope.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/function.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+#include "tensorflow/core/public/version.h"
+
+namespace tensorflow {
+namespace {
+
+class TestEnv {
+ public:
+  TestEnv() : flib_def_(OpRegistry::Global(), {}) {
+    Device* device =
+        DeviceFactory::NewDevice("CPU", {}, "/job:a/replica:0/task:0");
+    device_mgr_.reset(new DeviceMgr({device}));
+    flib_runtime_ = NewFunctionLibraryRuntime(device_mgr_.get(), Env::Default(),
+                                              device, TF_GRAPH_DEF_VERSION,
+                                              &flib_def_, nullptr, {}, nullptr);
+  }
+
+  FunctionLibraryRuntime* function_library_runtime() const {
+    return flib_runtime_.get();
+  }
+
+ private:
+  FunctionLibraryDefinition flib_def_;
+  std::unique_ptr<DeviceMgr> device_mgr_;
+  std::unique_ptr<FunctionLibraryRuntime> flib_runtime_;
+};
+
+void BM_CreateGraph(int iters) {
+  for (int i = 0; i < iters; ++i) {
+    Scope root = Scope::NewRootScope();
+    auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
+    auto M = ops::MatMul(root, C, C);
+    TF_CHECK_OK(root.status());
+  }
+}
+BENCHMARK(BM_CreateGraph);
+
+void BM_RunGraph(int iters) {
+  tensorflow::testing::StopTiming();
+  Scope root = Scope::NewRootScope();
+  auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
+  auto M = ops::MatMul(root, C, C);
+  SessionOptions opts;
+  opts.config.set_inter_op_parallelism_threads(1);
+  opts.config.set_intra_op_parallelism_threads(1);
+  ClientSession sess(root, opts);
+  std::vector<Tensor> outputs;
+  tensorflow::testing::StartTiming();
+  for (int i = 0; i < iters; ++i) {
+    outputs.clear();
+    TF_CHECK_OK(sess.Run({M}, &outputs));
+  }
+}
+BENCHMARK(BM_RunGraph);
+
+void BM_CreateAndDestroySession(int iters) {
+  tensorflow::testing::StopTiming();
+  Scope root = Scope::NewRootScope();
+  auto C = ops::Const(root, {{1.0, 2.0}, {3.0, 4.0}});
+  auto M = ops::MatMul(root, C, C);
+  tensorflow::testing::StartTiming();
+  for (int i = 0; i < iters; ++i) {
+    ClientSession sess(root);
+  }
+}
+BENCHMARK(BM_CreateAndDestroySession);
+
+void BM_KernelAndDeviceInit(int iters) {
+  tensorflow::testing::StopTiming();
+  NodeDef ndef(AttrBuilder("MatMul")
+                   .Set("T", DT_FLOAT)
+                   .Set("transpose_a", false)
+                   .Set("transpose_b", false)
+                   .NumInputs(2)
+                   .BuildNodeDef());
+  TestEnv env;
+  KernelAndDevice k(nullptr);
+  tensorflow::testing::StartTiming();
+  for (int i = 0; i < iters; ++i) {
+    TF_CHECK_OK(
+        KernelAndDevice::Init(ndef, env.function_library_runtime(), &k));
+  }
+}
+BENCHMARK(BM_KernelAndDeviceInit);
+
+void BM_KernelAndDeviceRun(int iters) {
+  tensorflow::testing::StopTiming();
+  Tensor t(Input({{1.0f, 2.0f}, {3.0f, 4.0f}}).tensor());
+  std::vector<Tensor> inputs;
+  inputs.push_back(t);
+  inputs.push_back(t);
+  std::vector<Tensor> outputs;
+  NodeDef ndef(AttrBuilder("MatMul")
+                   .Set("T", DT_FLOAT)
+                   .Set("transpose_a", false)
+                   .Set("transpose_b", false)
+                   .NumInputs(inputs.size())
+                   .BuildNodeDef());
+  TestEnv env;
+  KernelAndDevice kernel(nullptr);
+  TF_CHECK_OK(
+      KernelAndDevice::Init(ndef, env.function_library_runtime(), &kernel));
+  tensorflow::testing::StartTiming();
+  for (int i = 0; i < iters; ++i) {
+    TF_CHECK_OK(kernel.Run(&inputs, &outputs, nullptr));
+  }
+}
+BENCHMARK(BM_KernelAndDeviceRun);
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eval_const_tensor.cc b/tensorflow/core/common_runtime/eval_const_tensor.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c1542f1f57fd06a9b8fa0d6e36a5592772404c0e
--- /dev/null
+++ b/tensorflow/core/common_runtime/eval_const_tensor.cc
@@ -0,0 +1,361 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/eval_const_tensor.h"
+
+#include <deque>
+
+#include "tensorflow/core/common_runtime/graph_runner.h"
+#include "tensorflow/core/common_runtime/shape_refiner.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/versions.pb.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/kernels/bounds_check.h"
+
+namespace tensorflow {
+
+using shape_inference::InferenceContext;
+
+namespace {
+
+// Tries to infer tensor output based on the input shapes of the node. In some
+// cases, the shapes of the inputs are sufficient for inferring the contents of
+// the output tensor. For example, a Shape op with fully defined input shapes
+// can have its output tensor inferred.
+Status TryToInferTensorOutputFromInputShapes(const Edge& edge,
+                                             const ShapeRefiner& refiner,
+                                             Tensor* output, bool* success) {
+  *success = false;
+  const Node* node = edge.src();
+  InferenceContext* c = refiner.GetContext(node);
+  if (c == nullptr) {
+    return errors::FailedPrecondition("Node does not have context.");
+  }
+
+  if (node->type_string() == "Shape") {
+    // If input shapes to the shape op are fully defined,
+    // we can infer the shape op's output tensor.
+    bool fully_defined_inputs = c->FullyDefined(c->input(0));
+    if (fully_defined_inputs) {
+      int input_rank = c->Rank(c->input(0));
+      Tensor t(node->output_type(0), TensorShape({input_rank}));
+      if (node->output_type(0) == DT_INT32) {
+        auto flat = t.flat<int>();
+        for (int i = 0; i < input_rank; i++) {
+          int64 dimension = c->Value(c->Dim(c->input(0), i));
+          if (!FastBoundsCheck(dimension, std::numeric_limits<int32>::max())) {
+            return errors::InvalidArgument(
+                "Shape has output type int32, but dimension exceeds maximum "
+                "int32 value");
+          }
+          flat(i) = static_cast<int32>(dimension);
+        }
+      } else if (node->output_type(0) == DT_INT64) {
+        auto flat = t.flat<int64>();
+        for (int i = 0; i < input_rank; i++) {
+          flat(i) = c->Value(c->Dim(c->input(0), i));
+        }
+      } else {
+        return errors::FailedPrecondition(
+            "Shape has output type that is not int32 or int64");
+      }
+      *output = t;
+      *success = true;
+    }
+  } else if (node->type_string() == "Rank") {
+    bool rank_known = c->RankKnown(c->input(0));
+    if (rank_known) {
+      int32 input_rank = c->Rank(c->input(0));
+      Tensor t(node->output_type(0), TensorShape({}));
+      t.flat<int32>()(0) = input_rank;
+      *output = t;
+      *success = true;
+    }
+  } else if (node->type_string() == "Size") {
+    bool fully_defined_inputs = c->FullyDefined(c->input(0));
+    if (fully_defined_inputs) {
+      int32 rank = c->Rank(c->input(0));
+      Tensor t(node->output_type(0), TensorShape({}));
+      int64 size = 1;
+      for (int i = 0; i < rank; i++) {
+        size *= c->Value(c->Dim(c->input(0), i));
+      }
+      if (node->output_type(0) == DT_INT32) {
+        if (!FastBoundsCheck(size, std::numeric_limits<int32>::max())) {
+          return errors::InvalidArgument(
+              "Size has output type int32, but size exceeds maximum int32 "
+              "value");
+        }
+        t.flat<int32>()(0) = static_cast<int32>(size);
+      } else if (node->output_type(0) == DT_INT64) {
+        t.flat<int64>()(0) = size;
+      } else {
+        return errors::FailedPrecondition(
+            "Size has output type that is not int32 or int64");
+      }
+      *output = t;
+      *success = true;
+    }
+  }
+  return Status::OK();
+}
+
+// Extracts the subgraph ending at 'target_node' that is statically computable
+// and inserts into 'out_graph'. If statically computable, 'is_constant_graph'
+// will be set to true.
+Status ExtractConstantSubgraph(
+    const Node& target_node, const ShapeRefiner& refiner,
+    const std::unordered_map<string, Tensor>* cached_values, Graph* out_graph,
+    bool* is_constant_graph,
+    std::vector<std::pair<string, Tensor>>* const_inputs) {
+  *is_constant_graph = false;
+  std::unordered_set<string> const_inputs_added;
+
+  if (target_node.op_def().is_stateful()) {
+    return Status::OK();
+  }
+
+  if (IsMerge(&target_node)) {
+    return Status::OK();
+  }
+
+  if (target_node.type_string() == "PlaceholderWithDefault") {
+    return Status::OK();
+  }
+
+  // TODO(skyewm): should more of the filtering applied in input nodes below be
+  // applied to target_node here?
+
+  // Identify the possibly constant subgraph by recursively iterating backwards
+  // through the inputs to 'target_node' until we either 1) find an already
+  // existing input to our subgraph 'const_inputs', 2) Discover our graph is not
+  // constant, or 3) Hit a root node.
+
+  struct NodeAndRecursed {
+    Node* new_node = nullptr;
+    bool recursed = false;
+  };
+
+  std::map<const Node*, NodeAndRecursed> old_to_new_and_recursed;
+  Node* target_node_copy = out_graph->CopyNode(&target_node);
+  old_to_new_and_recursed[&target_node].new_node = target_node_copy;
+  old_to_new_and_recursed[&target_node].recursed = true;
+
+  // Add the target node's inputs to seed the recursion.
+  std::deque<const Edge*> edges_to_visit;
+  for (const Edge* e : target_node.in_edges()) {
+    // TODO(skyewm): control edges will be meaningful if/when we handle control
+    // flow (e.g. constants in cond branches are triggered via control edges).
+    if (e->IsControlEdge()) continue;
+    edges_to_visit.push_back(e);
+  }
+
+  *is_constant_graph = true;
+
+  // Iterate over the set of edges to visit (backwards).
+  while (!edges_to_visit.empty()) {
+    const Edge* current_edge = edges_to_visit.front();
+    edges_to_visit.pop_front();
+    Node* current_node = current_edge->src();
+
+    // If the node is stateful, assume the graph is not constant.
+    if (current_node->op_def().is_stateful()) {
+      *is_constant_graph = false;
+      return Status::OK();
+    }
+
+    // During construction or import from GraphConstructor, back edges may not
+    // be filled in. In addition, control flow constructs may depend on control
+    // edges which aren't handled by this method. Don't constant fold through
+    // merges at all for now.
+    if (IsMerge(current_node)) {
+      *is_constant_graph = false;
+      return Status::OK();
+    }
+
+    // Don't constant fold enter/exit currently either, as it's easy to end
+    // up with a partial frame.
+    if (IsEnter(current_node) || IsExit(current_node)) {
+      *is_constant_graph = false;
+      return Status::OK();
+    }
+
+    // Placeholders should never be constant folded because their outputs are
+    // fed by the user. Note that "Placeholder" nodes have no inputs so are
+    // handled below.
+    if (current_node->type_string() == "PlaceholderWithDefault") {
+      *is_constant_graph = false;
+      return Status::OK();
+    }
+
+    // If there is nothing more to recurse down, see if
+    // the generator node is a constant.
+    if (current_node->num_inputs() == 0) {
+      if (!current_node->IsConstant()) {
+        // Generator node is not a constant, so subgraph is not
+        // constant.
+        *is_constant_graph = false;
+        return Status::OK();
+      }
+    }
+
+    // Either the node is a constant, or the node is a potential
+    // intermediate node on the path from a constant.
+    //
+    // Add a copy of its node and a new edge to the new subgraph.
+
+    // Get or create the version of 'current_node' in the new graph.
+    Node* current_node_copy;
+    // This gets or creates the NodeAndRecursed entry for current_node.
+    NodeAndRecursed* node_and_recursed = &old_to_new_and_recursed[current_node];
+    if (node_and_recursed->new_node == nullptr) {
+      // First time processing this node.
+      current_node_copy = out_graph->CopyNode(current_node);
+      // Track the mapping from the original node to the new one.
+      node_and_recursed->new_node = current_node_copy;
+    } else {
+      current_node_copy = node_and_recursed->new_node;
+    }
+
+    // Add the edge to the destination node.
+    {
+      auto it = old_to_new_and_recursed.find(current_edge->dst());
+      if (it == old_to_new_and_recursed.end()) {
+        return errors::Internal(
+            "Could not find mapping from old to new copy of destination node: ",
+            current_edge->dst()->name());
+      }
+      Node* dst_copy = it->second.new_node;
+
+      out_graph->AddEdge(current_node_copy, current_edge->src_output(),
+                         dst_copy, current_edge->dst_input());
+    }
+
+    const string& output_tensor_name =
+        strings::StrCat(current_node->name(), ":", current_edge->src_output());
+
+    // Some tensor values can be inferred. For example, a shape op
+    // with input shapes fully defined can have its output tensor inferred.
+    Tensor tensor_inferred;
+    bool successfully_inferred_tensor = false;
+    TF_RETURN_IF_ERROR(TryToInferTensorOutputFromInputShapes(
+        *current_edge, refiner, &tensor_inferred,
+        &successfully_inferred_tensor));
+    if (successfully_inferred_tensor) {
+      const_inputs->emplace_back(output_tensor_name, tensor_inferred);
+      const_inputs_added.insert(output_tensor_name);
+      continue;
+    }
+
+    // If we have a copy of the input tensor materialized already,
+    // then add to the list of inputs to feed and do not recurse further.
+    if (cached_values != nullptr) {
+      auto it = cached_values->find(output_tensor_name);
+      if (it != cached_values->end() &&
+          const_inputs_added.count(output_tensor_name) == 0) {
+        const_inputs->emplace_back(output_tensor_name, it->second);
+        const_inputs_added.insert(output_tensor_name);
+        continue;
+      }
+    }
+
+    // If this node's inputs have not been processed already, do so now.
+    if (!node_and_recursed->recursed) {
+      node_and_recursed->recursed = true;
+      for (const Edge* e : current_node->in_edges()) {
+        if (e->IsControlEdge()) continue;
+        edges_to_visit.push_back(e);
+      }
+    }
+  }
+
+  return Status::OK();
+}
+
+}  // namespace
+
+Status EvaluateConstantTensor(OutputTensor tensor, const ShapeRefiner& refiner,
+                              const OpRegistryInterface& ops,
+                              int32 graph_def_version, bool* evaluated,
+                              Tensor* result, GraphRunner* graph_runner,
+                              std::unordered_map<string, Tensor>* cached_values,
+                              int64 max_cached_value_size,
+                              bool disable_constant_propagation) {
+  *evaluated = false;
+  const Node* src = tensor.node;
+
+  // Simple case: the source node is a constant
+  if (src->IsConstant()) {
+    if (result->FromProto(src->def().attr().at("value").tensor())) {
+      *evaluated = true;
+      return Status::OK();
+    }
+  }
+
+  if (disable_constant_propagation) {
+    return Status::OK();
+  }
+
+  bool is_constant_graph = false;
+  Graph subgraph(&ops);
+  auto versions = subgraph.versions();
+  versions.set_producer(graph_def_version);
+  subgraph.set_versions(versions);
+
+  std::vector<std::pair<string, Tensor>> const_inputs;
+  TF_RETURN_IF_ERROR(ExtractConstantSubgraph(*src, refiner, cached_values,
+                                             &subgraph, &is_constant_graph,
+                                             &const_inputs));
+  if (!is_constant_graph) {
+    return Status::OK();
+  }
+  const string output_tensor_name =
+      strings::StrCat(src->name(), ":", tensor.index);
+  std::vector<Tensor> outputs;
+
+  std::unique_ptr<GraphRunner> graph_runner_storage;
+  if (graph_runner == nullptr) {
+    // TODO(skyewm): Convert to std::make_unique when available.
+    graph_runner_storage.reset(new GraphRunner(Env::Default()));
+    graph_runner = graph_runner_storage.get();
+  }
+
+  // NOTE; we should pass in a function library runtime if we want
+  // to support constant-expression evaluation on functions.
+  Status s = graph_runner->Run(&subgraph, nullptr /* function_library */,
+                               const_inputs, {output_tensor_name}, &outputs);
+
+  // If all kernels in the constant graph are not registered
+  // in the process, GraphRunner::Run may fail, in which case
+  // we cannot propagate constants, so this is best-effort.
+  if (s.ok()) {
+    *result = outputs[0];
+    *evaluated = true;
+
+    // We memoize (small) constants evaluated so far, so
+    // ExtractConstantSubgraph can avoid extracting the full
+    // subgraph.  As we build up large graphs, this avoids
+    // repeated computation of the early parts of a constant
+    // graph.
+    if (cached_values != nullptr &&
+        outputs[0].TotalBytes() <= max_cached_value_size) {
+      (*cached_values)[output_tensor_name] = outputs[0];
+    }
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eval_const_tensor.h b/tensorflow/core/common_runtime/eval_const_tensor.h
new file mode 100644
index 0000000000000000000000000000000000000000..fca5a235695ce121552656bced5f4001b146cfd0
--- /dev/null
+++ b/tensorflow/core/common_runtime/eval_const_tensor.h
@@ -0,0 +1,66 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_EVAL_CONST_TENSOR_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_EVAL_CONST_TENSOR_H_
+
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/lib/core/status.h"
+
+// TODO(skyewm): can this be combined with ConstantFold?
+
+namespace tensorflow {
+
+class GraphRunner;
+class OpRegistryInterface;
+class ShapeRefiner;
+class Tensor;
+
+// Attempts to evaluate `tensor`. This will only be possible if `tensor` doesn't
+// depend on any graph inputs (this function is safe to call if this isn't the
+// case though).
+//
+// If the evaluation is successful, `evaluated` will be set to true and
+// `tensor`s value returned in `result`. Otherwise `evaluated` will be set to
+// false. An error status is returned if something is wrong with the graph or
+// input. Note that `evaluated` may set to false if Status::OK() is returned.
+//
+// Params:
+//   tensor - the tensor to be evaluated.
+//   refiner - used to fetch the InferenceContexts for nodes in the graph.
+//   ops - the OpRegistryInterface for the graph.
+//   graph_def_version - the producer version of the graph.
+//   evaluated - output param indicating whether evaluation was successful.
+//   result - output param containing the result if evaluated is true.
+//   graph_runner - optional. If not set, a GraphRunner will be created for
+//     evaluating tensor. This can be set to avoid creating a new GraphRunner
+//     for every call.
+//   cached_values - optional. This can be used to cache evaluated results
+//     across calls, to avoid evaluating the same parts of the graph multiple
+//     times.
+//   max_cached_value_size - optional. If `cached_values` is set, the maximum
+//     result size to cache.
+//   disable_constant_propagation - if true, only Const node values will be
+//     returned.
+Status EvaluateConstantTensor(
+    OutputTensor tensor, const ShapeRefiner& refiner,
+    const OpRegistryInterface& ops, int32 graph_def_version, bool* evaluated,
+    Tensor* result, GraphRunner* graph_runner = nullptr,
+    std::unordered_map<string, Tensor>* cached_values = nullptr,
+    int64 max_cached_value_size = 1024,
+    bool disable_constant_propagation = false);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_EVAL_CONST_TENSOR_H_
diff --git a/tensorflow/core/common_runtime/function.cc b/tensorflow/core/common_runtime/function.cc
index 3e937ceb640554be3a2578decdb336d0e58c197f..37c59a16f56809fe8d5f88c05b824bcbdcc7cf4e 100644
--- a/tensorflow/core/common_runtime/function.cc
+++ b/tensorflow/core/common_runtime/function.cc
@@ -34,6 +34,7 @@ limitations under the License.
 #include "tensorflow/core/graph/gradients.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/graph/optimizer_cse.h"
+#include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/platform/macros.h"
 
@@ -141,6 +142,7 @@ class FunctionLibraryRuntimeImpl : public FunctionLibraryRuntime {
   FunctionLibraryRuntimeImpl(const DeviceMgr* dmgr, Env* env, Device* device,
                              int graph_def_version,
                              const FunctionLibraryDefinition* lib_def,
+                             thread::ThreadPool* default_thread_pool,
                              const OptimizerOptions& optimizer_options,
                              CustomKernelCreator custom_kernel_creator,
                              ProcessFunctionLibraryRuntime* parent);
@@ -194,6 +196,7 @@ class FunctionLibraryRuntimeImpl : public FunctionLibraryRuntime {
   const FunctionLibraryDefinition* const base_lib_def_;
   GraphOptimizer optimizer_;
   const CustomKernelCreator custom_kernel_creator_;
+  Executor::Args::Runner default_runner_;
   const string device_name_;
 
   std::function<Status(const string&, const OpDef**)> get_func_sig_;
@@ -243,6 +246,7 @@ class FunctionLibraryRuntimeImpl : public FunctionLibraryRuntime {
 FunctionLibraryRuntimeImpl::FunctionLibraryRuntimeImpl(
     const DeviceMgr* dmgr, Env* env, Device* device, int graph_def_version,
     const FunctionLibraryDefinition* lib_def,
+    thread::ThreadPool* default_thread_pool,
     const OptimizerOptions& optimizer_options,
     CustomKernelCreator custom_kernel_creator,
     ProcessFunctionLibraryRuntime* parent)
@@ -253,6 +257,7 @@ FunctionLibraryRuntimeImpl::FunctionLibraryRuntimeImpl(
       base_lib_def_(lib_def),
       optimizer_(optimizer_options),
       custom_kernel_creator_(std::move(custom_kernel_creator)),
+      default_runner_(nullptr),
       device_name_(device_ == nullptr
                        ? ProcessFunctionLibraryRuntime::kDefaultFLRDevice
                        : device_->name()),
@@ -264,6 +269,18 @@ FunctionLibraryRuntimeImpl::FunctionLibraryRuntimeImpl(
   create_kernel_ = [this](const NodeDef& ndef, OpKernel** kernel) {
     return CreateKernel(ndef, kernel);
   };
+  thread::ThreadPool* pool = nullptr;
+  if (device_ != nullptr) {
+    pool = device_->tensorflow_device_thread_pool();
+  }
+  if (pool == nullptr) {
+    pool = default_thread_pool;
+  }
+  if (pool != nullptr) {
+    default_runner_ = [pool](Executor::Args::Closure c) {
+      pool->Schedule(std::move(c));
+    };
+  }
 }
 
 FunctionLibraryRuntimeImpl::~FunctionLibraryRuntimeImpl() {
@@ -479,11 +496,26 @@ Status FunctionLibraryRuntimeImpl::Instantiate(
   InstantiateOptions options_copy(options);
   options_copy.target = device_name_;
   const string key = Canonicalize(function_name, attrs, options_copy);
-  *handle = parent_->GetHandle(key);
-  if (*handle != kInvalidHandle) {
+
+  {
     mutex_lock l(mu_);
-    items_[parent_->GetHandleOnDevice(device_name_, *handle)]->Ref();
-    return Status::OK();
+    *handle = parent_->GetHandle(key);
+    if (*handle != kInvalidHandle) {
+      FunctionLibraryRuntime::LocalHandle handle_on_device =
+          parent_->GetHandleOnDevice(device_name_, *handle);
+      if (handle_on_device == kInvalidLocalHandle) {
+        return errors::Internal("LocalHandle not found for handle ", *handle,
+                                ".");
+      }
+      auto item_handle = items_.find(handle_on_device);
+      if (item_handle == items_.end()) {
+        return errors::Internal("LocalHandle ", handle_on_device,
+                                " for handle ", *handle,
+                                " not found in items.");
+      }
+      item_handle->second->Ref();
+      return Status::OK();
+    }
   }
 
   Status s;
@@ -536,6 +568,7 @@ Status FunctionLibraryRuntimeImpl::ReleaseHandle(Handle handle) {
   }
 
   LocalHandle h = parent_->GetHandleOnDevice(device_name_, handle);
+  CHECK_NE(h, kInvalidLocalHandle);
   mutex_lock l(mu_);
   CHECK_EQ(1, items_.count(h));
   Item* item = items_[h];
@@ -768,6 +801,9 @@ void FunctionLibraryRuntimeImpl::Run(const Options& opts, Handle handle,
     return;
   }
 
+  if (run_opts.runner == nullptr) {
+    run_opts.runner = &default_runner_;
+  }
   DCHECK(run_opts.runner != nullptr);
 
   Executor::Args* exec_args = new Executor::Args;
@@ -854,6 +890,9 @@ void FunctionLibraryRuntimeImpl::Run(const Options& opts, Handle handle,
     done(s);
     return;
   }
+  if (run_opts.runner == nullptr) {
+    run_opts.runner = &default_runner_;
+  }
   DCHECK(run_opts.runner != nullptr);
 
   Executor::Args* exec_args = new Executor::Args;
@@ -942,21 +981,21 @@ void RegisterDefaultCustomKernelCreator(CustomKernelCreator cb) {
 std::unique_ptr<FunctionLibraryRuntime> NewFunctionLibraryRuntime(
     const DeviceMgr* device_mgr, Env* env, Device* device,
     int graph_def_version, const FunctionLibraryDefinition* lib_def,
-    const OptimizerOptions& optimizer_options,
+    thread::ThreadPool* thread_pool, const OptimizerOptions& optimizer_options,
     CustomKernelCreator custom_kernel_creator,
     ProcessFunctionLibraryRuntime* parent) {
   return std::unique_ptr<FunctionLibraryRuntime>(new FunctionLibraryRuntimeImpl(
-      device_mgr, env, device, graph_def_version, lib_def, optimizer_options,
-      std::move(custom_kernel_creator), parent));
+      device_mgr, env, device, graph_def_version, lib_def, thread_pool,
+      optimizer_options, std::move(custom_kernel_creator), parent));
 }
 
 std::unique_ptr<FunctionLibraryRuntime> NewFunctionLibraryRuntime(
     const DeviceMgr* device_mgr, Env* env, Device* device,
     int graph_def_version, const FunctionLibraryDefinition* lib_def,
-    const OptimizerOptions& optimizer_options,
+    thread::ThreadPool* thread_pool, const OptimizerOptions& optimizer_options,
     ProcessFunctionLibraryRuntime* parent) {
   return NewFunctionLibraryRuntime(device_mgr, env, device, graph_def_version,
-                                   lib_def, optimizer_options,
+                                   lib_def, thread_pool, optimizer_options,
                                    GetCustomCreatorSingleton()->Get(), parent);
 }
 
diff --git a/tensorflow/core/common_runtime/function.h b/tensorflow/core/common_runtime/function.h
index 477340d87a3c6ae311b3f5125af9b4fb0f66c6ad..a0f9fcae0aaf63c62ef194f5cb8e84d2d53b321a 100644
--- a/tensorflow/core/common_runtime/function.h
+++ b/tensorflow/core/common_runtime/function.h
@@ -55,7 +55,7 @@ void RegisterDefaultCustomKernelCreator(CustomKernelCreator cb);
 std::unique_ptr<FunctionLibraryRuntime> NewFunctionLibraryRuntime(
     const DeviceMgr* device_mgr, Env* env, Device* device,
     int graph_def_version, const FunctionLibraryDefinition* lib_def,
-    const OptimizerOptions& optimizer_options,
+    thread::ThreadPool* thread_pool, const OptimizerOptions& optimizer_options,
     CustomKernelCreator custom_kernel_creator,
     ProcessFunctionLibraryRuntime* parent);
 
@@ -65,7 +65,7 @@ std::unique_ptr<FunctionLibraryRuntime> NewFunctionLibraryRuntime(
 std::unique_ptr<FunctionLibraryRuntime> NewFunctionLibraryRuntime(
     const DeviceMgr* device_mgr, Env* env, Device* device,
     int graph_def_version, const FunctionLibraryDefinition* lib_def,
-    const OptimizerOptions& optimizer_options,
+    thread::ThreadPool* thread_pool, const OptimizerOptions& optimizer_options,
     ProcessFunctionLibraryRuntime* parent);
 
 // FunctionLibraryRuntime::GetFunctionBody returns a description of an
diff --git a/tensorflow/core/common_runtime/function_test.cc b/tensorflow/core/common_runtime/function_test.cc
index 63ad0d231c28a5af144b61e967a73e8ecfe6049a..d17ef4d4590e5932e43a0bb01fe1e05ab2c4f873 100644
--- a/tensorflow/core/common_runtime/function_test.cc
+++ b/tensorflow/core/common_runtime/function_test.cc
@@ -38,6 +38,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/notification.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/public/session_options.h"
 #include "tensorflow/core/public/version.h"
@@ -135,7 +136,8 @@ TEST_F(FunctionTest, WXPlusB) {
 
 class FunctionLibraryRuntimeTest : public ::testing::Test {
  protected:
-  void Init(const std::vector<FunctionDef>& flib) {
+  void Init(const std::vector<FunctionDef>& flib,
+            thread::ThreadPool* default_thread_pool = nullptr) {
     SessionOptions options;
     auto* device_count = options.config.mutable_device_count();
     device_count->insert({"CPU", 3});
@@ -149,7 +151,7 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
     device_mgr_.reset(new DeviceMgr(devices_));
     pflr_.reset(new ProcessFunctionLibraryRuntime(
         device_mgr_.get(), Env::Default(), TF_GRAPH_DEF_VERSION, lib_def_.get(),
-        opts, nullptr /* cluster_flr */));
+        opts, default_thread_pool, nullptr /* cluster_flr */));
     flr0_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:0");
     flr1_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:1");
     flr2_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:2");
@@ -158,16 +160,20 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
 
   Status Run(FunctionLibraryRuntime* flr, FunctionLibraryRuntime::Handle handle,
              FunctionLibraryRuntime::Options opts,
-             const std::vector<Tensor>& args, std::vector<Tensor*> rets) {
+             const std::vector<Tensor>& args, std::vector<Tensor*> rets,
+             bool add_runner = true) {
     std::atomic<int32> call_count(0);
     std::function<void(std::function<void()>)> runner =
         [&call_count](std::function<void()> fn) {
           ++call_count;
           test::function::FunctionTestSchedClosure(fn);
         };
-
+    if (add_runner) {
+      opts.runner = &runner;
+    } else {
+      opts.runner = nullptr;
+    }
     Notification done;
-    opts.runner = &runner;
     std::vector<Tensor> out;
     Status status;
     flr->Run(opts, handle, args, &out, [&status, &done](const Status& s) {
@@ -183,7 +189,9 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
       *rets[i] = out[i];
     }
 
-    EXPECT_GE(call_count, 1);  // Test runner is used.
+    if (add_runner) {
+      EXPECT_GE(call_count, 1);  // Test runner is used.
+    }
 
     return Status::OK();
   }
@@ -204,24 +212,25 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
   Status InstantiateAndRun(FunctionLibraryRuntime* flr, const string& name,
                            test::function::Attrs attrs,
                            const std::vector<Tensor>& args,
-                           std::vector<Tensor*> rets) {
+                           std::vector<Tensor*> rets, bool add_runner = true) {
     return InstantiateAndRun(flr, name, attrs,
                              FunctionLibraryRuntime::InstantiateOptions(), args,
-                             std::move(rets));
+                             std::move(rets), add_runner);
   }
 
   Status InstantiateAndRun(
       FunctionLibraryRuntime* flr, const string& name,
       test::function::Attrs attrs,
       const FunctionLibraryRuntime::InstantiateOptions& options,
-      const std::vector<Tensor>& args, std::vector<Tensor*> rets) {
+      const std::vector<Tensor>& args, std::vector<Tensor*> rets,
+      bool add_runner = true) {
     FunctionLibraryRuntime::Handle handle;
     Status status = flr->Instantiate(name, attrs, options, &handle);
     if (!status.ok()) {
       return status;
     }
     FunctionLibraryRuntime::Options opts;
-    status = Run(flr, handle, opts, args, rets);
+    status = Run(flr, handle, opts, args, rets, add_runner);
     if (!status.ok()) return status;
 
     // Release the handle and try running again. It should not succeed.
@@ -237,16 +246,20 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
   }
 
   Status Run(FunctionLibraryRuntime* flr, FunctionLibraryRuntime::Handle handle,
-             FunctionLibraryRuntime::Options opts, CallFrameInterface* frame) {
+             FunctionLibraryRuntime::Options opts, CallFrameInterface* frame,
+             bool add_runner = true) {
     std::atomic<int32> call_count(0);
     std::function<void(std::function<void()>)> runner =
         [&call_count](std::function<void()> fn) {
           ++call_count;
           test::function::FunctionTestSchedClosure(fn);
         };
-
+    if (add_runner) {
+      opts.runner = &runner;
+    } else {
+      opts.runner = nullptr;
+    }
     Notification done;
-    opts.runner = &runner;
     std::vector<Tensor> out;
     Status status;
     flr->Run(opts, handle, frame, [&status, &done](const Status& s) {
@@ -258,7 +271,9 @@ class FunctionLibraryRuntimeTest : public ::testing::Test {
       return status;
     }
 
-    EXPECT_GE(call_count, 1);  // Test runner is used.
+    if (add_runner) {
+      EXPECT_GE(call_count, 1);  // Test runner is used.
+    }
 
     return Status::OK();
   }
@@ -447,7 +462,7 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
   {
     // Simple case: instantiating with no state_handle.
     for (int32 expected : {6, 4}) {
-      TF_CHECK_OK(Run(flr0_, handle, opts, {}, {&y}));
+      TF_CHECK_OK(Run(flr0_, handle, opts, {}, {&y}, true));
       test::ExpectTensorEqual<int>(y, test::AsTensor<int32>({expected}));
     }
   }
@@ -460,7 +475,7 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
         Instantiate(flr0_, "RandomUniformWrapper", {}, &handle_non_isolated));
     EXPECT_EQ(handle, handle_non_isolated);
     for (int32 expected : {0, 1}) {
-      TF_CHECK_OK(Run(flr0_, handle_non_isolated, opts, {}, {&y}));
+      TF_CHECK_OK(Run(flr0_, handle_non_isolated, opts, {}, {&y}, true));
       test::ExpectTensorEqual<int>(y, test::AsTensor<int32>({expected}));
     }
   }
@@ -475,7 +490,7 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
                             &handle_isolated));
     EXPECT_NE(handle, handle_isolated);
     for (int32 expected : {6, 4, 0, 1}) {
-      TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}));
+      TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}, true));
       test::ExpectTensorEqual<int>(y, test::AsTensor<int32>({expected}));
     }
   }
@@ -490,7 +505,7 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
                             &handle_isolated));
     EXPECT_NE(handle, handle_isolated);
     for (int32 expected : {6, 4, 0, 1}) {
-      TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}));
+      TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}, true));
       test::ExpectTensorEqual<int>(y, test::AsTensor<int32>({expected}));
     }
   }
@@ -507,7 +522,7 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
                               &handle_isolated));
       EXPECT_NE(handle, handle_isolated);
       for (int32 expected : {6, 4, 0, 1}) {
-        TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}));
+        TF_CHECK_OK(Run(flr0_, handle_isolated, opts, {}, {&y}, true));
         test::ExpectTensorEqual<int>(y, test::AsTensor<int32>({expected}));
       }
       TF_CHECK_OK(flr0_->ReleaseHandle(handle_isolated));
@@ -1247,14 +1262,14 @@ TEST_F(FunctionLibraryRuntimeTest, CrossDevice) {
   opts.rendezvous = new IntraProcessRendezvous(device_mgr_.get());
   opts.source_device = "/device:CPU:1";
   // Run on flr1_, flr2_ and make sure that the device it ran on was cpu:1.
-  TF_CHECK_OK(Run(flr1_, handle, opts, {}, {&y}));
+  TF_CHECK_OK(Run(flr1_, handle, opts, {}, {&y}, true));
   test::ExpectTensorEqual<string>(
       y,
       test::AsTensor<string>({"/job:localhost/replica:0/task:0/device:CPU:1"},
                              TensorShape({})));
   opts.remote_execution = true;
   opts.source_device = "/job:localhost/replica:0/task:0/cpu:2";
-  TF_CHECK_OK(Run(flr2_, handle, opts, {}, {&y}));
+  TF_CHECK_OK(Run(flr2_, handle, opts, {}, {&y}, true));
   test::ExpectTensorEqual<string>(
       y,
       test::AsTensor<string>({"/job:localhost/replica:0/task:0/device:CPU:1"},
diff --git a/tensorflow/core/common_runtime/function_testlib.cc b/tensorflow/core/common_runtime/function_testlib.cc
index 87733ed2dbe931c6bb64fd065d2691072d4eced0..1720ee64c07ae744ebb6f0d4ac89738a4e4419e7 100644
--- a/tensorflow/core/common_runtime/function_testlib.cc
+++ b/tensorflow/core/common_runtime/function_testlib.cc
@@ -58,6 +58,59 @@ FunctionDef FindDevice() {
       {{{"device_name"}, "FindDeviceOp", {}, {}}});
 }
 
+void BlockingOpState::AwaitState(int awaiting_state) {
+  mutex_lock ml(mu_);
+  while (state_ != awaiting_state) {
+    cv_.wait(ml);
+  }
+}
+
+void BlockingOpState::MoveToState(int expected_current, int next) {
+  mutex_lock ml(mu_);
+  CHECK_EQ(expected_current, state_);
+  state_ = next;
+  cv_.notify_all();
+}
+
+BlockingOpState* blocking_op_state = nullptr;
+
+// BlockingOp blocks on the global <blocking_op_state's> state,
+// and also updates it when it is unblocked and finishing computation.
+class BlockingOp : public OpKernel {
+ public:
+  explicit BlockingOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+  void Compute(OpKernelContext* ctx) override {
+    blocking_op_state->MoveToState(0, 1);
+    blocking_op_state->AwaitState(2);
+    blocking_op_state->MoveToState(2, 3);
+
+    Tensor* out = nullptr;
+    const Tensor& in = ctx->input(0);
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, in.shape(), &out));
+    out->flat<float>() = in.flat<float>();
+  }
+};
+REGISTER_KERNEL_BUILDER(Name("BlockingOp").Device(DEVICE_CPU), BlockingOp);
+REGISTER_OP("BlockingOp")
+    .Input("x: float")
+    .Output("y: float")
+    .Doc("")
+    .SetShapeFn(shape_inference::UnknownShape);
+
+FunctionDef BlockingOpFn() {
+  return FDH::Define(
+      // Name
+      "BlockingOpFn",
+      // Args
+      {"x: float"},
+      // Return values
+      {"y: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"y"}, "BlockingOp", {"x"}, {}}});
+}
+
 // TODO(phawkins): replace with C++ API for calling functions, when that exists.
 Output Call(Scope* scope, const string& op_name, const string& fn_name,
             gtl::ArraySlice<Input> inputs) {
diff --git a/tensorflow/core/common_runtime/function_testlib.h b/tensorflow/core/common_runtime/function_testlib.h
index 3ddb26de929dc19792142dffde345672aafaadce..fb967a612336ee28f9e5abf186a380203dd4f0c3 100644
--- a/tensorflow/core/common_runtime/function_testlib.h
+++ b/tensorflow/core/common_runtime/function_testlib.h
@@ -25,6 +25,22 @@ namespace function {
 // {} -> y:DT_STRING (device where this op runs).
 FunctionDef FindDevice();
 
+class BlockingOpState {
+ public:
+  void AwaitState(int awaiting_state);
+
+  void MoveToState(int expected_current, int next);
+
+ private:
+  mutex mu_;
+  condition_variable cv_;
+  int state_ = 0;
+};
+
+extern BlockingOpState* blocking_op_state;
+
+FunctionDef BlockingOpFn();
+
 // Adds a function call to the given scope and returns the output for the node.
 // TODO(phawkins): replace with C++ API for calling functions, when that exists.
 Output Call(Scope* scope, const string& op_name, const string& fn_name,
diff --git a/tensorflow/core/common_runtime/function_threadpool_test.cc b/tensorflow/core/common_runtime/function_threadpool_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6223a4e648baeb9cd7a2595c74881cddbf9a6f0b
--- /dev/null
+++ b/tensorflow/core/common_runtime/function_threadpool_test.cc
@@ -0,0 +1,258 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/function.h"
+
+#include <atomic>
+#include <utility>
+
+#include "tensorflow/cc/ops/array_ops_internal.h"
+#include "tensorflow/cc/ops/function_ops.h"
+#include "tensorflow/cc/ops/functional_ops.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/common_runtime/function_testlib.h"
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/common_runtime/step_stats_collector.h"
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/framework/versions.pb.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/public/session_options.h"
+#include "tensorflow/core/public/version.h"
+#include "tensorflow/core/util/equal_graph_def.h"
+
+namespace tensorflow {
+namespace {
+
+class FunctionLibraryRuntimeTest : public ::testing::Test {
+ protected:
+  void Init(const std::vector<FunctionDef>& flib,
+            thread::ThreadPool* default_thread_pool = nullptr) {
+    SessionOptions options;
+    auto* device_count = options.config.mutable_device_count();
+    device_count->insert({"CPU", 3});
+    TF_CHECK_OK(DeviceFactory::AddDevices(
+        options, "/job:localhost/replica:0/task:0", &devices_));
+
+    FunctionDefLibrary proto;
+    for (const auto& fdef : flib) *(proto.add_function()) = fdef;
+    lib_def_.reset(new FunctionLibraryDefinition(OpRegistry::Global(), proto));
+    OptimizerOptions opts;
+    device_mgr_.reset(new DeviceMgr(devices_));
+    pflr_.reset(new ProcessFunctionLibraryRuntime(
+        device_mgr_.get(), Env::Default(), TF_GRAPH_DEF_VERSION, lib_def_.get(),
+        opts, default_thread_pool, nullptr /* cluster_flr */));
+    flr0_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:0");
+    flr1_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:1");
+    flr2_ = pflr_->GetFLR("/job:localhost/replica:0/task:0/cpu:2");
+    fdef_lib_ = lib_def_->ToProto();
+  }
+
+  Status Run(FunctionLibraryRuntime* flr, FunctionLibraryRuntime::Handle handle,
+             FunctionLibraryRuntime::Options opts,
+             const std::vector<Tensor>& args, std::vector<Tensor*> rets,
+             bool add_runner = true) {
+    std::atomic<int32> call_count(0);
+    std::function<void(std::function<void()>)> runner =
+        [&call_count](std::function<void()> fn) {
+          ++call_count;
+          test::function::FunctionTestSchedClosure(fn);
+        };
+    if (add_runner) {
+      opts.runner = &runner;
+    } else {
+      opts.runner = nullptr;
+    }
+    Notification done;
+    std::vector<Tensor> out;
+    Status status;
+    flr->Run(opts, handle, args, &out, [&status, &done](const Status& s) {
+      status = s;
+      done.Notify();
+    });
+    done.WaitForNotification();
+    if (!status.ok()) {
+      return status;
+    }
+    CHECK_EQ(rets.size(), out.size());
+    for (size_t i = 0; i < rets.size(); ++i) {
+      *rets[i] = out[i];
+    }
+
+    if (add_runner) {
+      EXPECT_GE(call_count, 1);  // Test runner is used.
+    }
+
+    return Status::OK();
+  }
+
+  Status Instantiate(FunctionLibraryRuntime* flr, const string& name,
+                     test::function::Attrs attrs,
+                     FunctionLibraryRuntime::Handle* handle) {
+    return flr->Instantiate(name, attrs, handle);
+  }
+
+  Status Instantiate(FunctionLibraryRuntime* flr, const string& name,
+                     test::function::Attrs attrs,
+                     const FunctionLibraryRuntime::InstantiateOptions& options,
+                     FunctionLibraryRuntime::Handle* handle) {
+    return flr->Instantiate(name, attrs, options, handle);
+  }
+
+  Status InstantiateAndRun(FunctionLibraryRuntime* flr, const string& name,
+                           test::function::Attrs attrs,
+                           const std::vector<Tensor>& args,
+                           std::vector<Tensor*> rets, bool add_runner = true) {
+    return InstantiateAndRun(flr, name, attrs,
+                             FunctionLibraryRuntime::InstantiateOptions(), args,
+                             std::move(rets), add_runner);
+  }
+
+  Status InstantiateAndRun(
+      FunctionLibraryRuntime* flr, const string& name,
+      test::function::Attrs attrs,
+      const FunctionLibraryRuntime::InstantiateOptions& options,
+      const std::vector<Tensor>& args, std::vector<Tensor*> rets,
+      bool add_runner = true) {
+    FunctionLibraryRuntime::Handle handle;
+    Status status = flr->Instantiate(name, attrs, options, &handle);
+    if (!status.ok()) {
+      return status;
+    }
+    FunctionLibraryRuntime::Options opts;
+    status = Run(flr, handle, opts, args, rets, add_runner);
+    if (!status.ok()) return status;
+
+    // Release the handle and try running again. It should not succeed.
+    status = flr->ReleaseHandle(handle);
+    if (!status.ok()) return status;
+
+    Status status2 = Run(flr, handle, opts, args, std::move(rets));
+    EXPECT_TRUE(errors::IsInvalidArgument(status2));
+    EXPECT_TRUE(
+        StringPiece(status2.error_message()).contains("remote execution."));
+
+    return status;
+  }
+
+  Status Run(FunctionLibraryRuntime* flr, FunctionLibraryRuntime::Handle handle,
+             FunctionLibraryRuntime::Options opts, CallFrameInterface* frame,
+             bool add_runner = true) {
+    std::atomic<int32> call_count(0);
+    std::function<void(std::function<void()>)> runner =
+        [&call_count](std::function<void()> fn) {
+          ++call_count;
+          test::function::FunctionTestSchedClosure(fn);
+        };
+    if (add_runner) {
+      opts.runner = &runner;
+    } else {
+      opts.runner = nullptr;
+    }
+    Notification done;
+    std::vector<Tensor> out;
+    Status status;
+    flr->Run(opts, handle, frame, [&status, &done](const Status& s) {
+      status = s;
+      done.Notify();
+    });
+    done.WaitForNotification();
+    if (!status.ok()) {
+      return status;
+    }
+
+    if (add_runner) {
+      EXPECT_GE(call_count, 1);  // Test runner is used.
+    }
+
+    return Status::OK();
+  }
+
+  FunctionLibraryRuntime* flr0_;
+  FunctionLibraryRuntime* flr1_;
+  FunctionLibraryRuntime* flr2_;
+  std::vector<Device*> devices_;
+  std::unique_ptr<DeviceMgr> device_mgr_;
+  std::unique_ptr<FunctionLibraryDefinition> lib_def_;
+  std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
+  FunctionDefLibrary fdef_lib_;
+};
+
+TEST_F(FunctionLibraryRuntimeTest, DefaultThreadpool) {
+  using test::function::blocking_op_state;
+  using test::function::BlockingOpState;
+
+  thread::ThreadPool* tp = new thread::ThreadPool(Env::Default(), "FLRTest", 1);
+  Init({test::function::BlockingOpFn(), test::function::XTimesTwo()}, tp);
+
+  auto x = test::AsScalar<float>(1.3);
+  Tensor y;
+  blocking_op_state = new BlockingOpState();
+
+  thread::ThreadPool* tp1 = new thread::ThreadPool(Env::Default(), "tp1", 5);
+  bool finished_running = false;
+  tp1->Schedule([&x, &y, &finished_running, this]() {
+    TF_CHECK_OK(InstantiateAndRun(flr0_, "BlockingOpFn", {}, {x}, {&y},
+                                  false /* add_runner */));
+    finished_running = true;
+  });
+
+  // InstantiateAndRun shouldn't finish because BlockingOpFn should be blocked.
+  EXPECT_FALSE(finished_running);
+
+  FunctionLibraryRuntime::Handle h;
+  TF_CHECK_OK(Instantiate(flr0_, "XTimesTwo", {{"T", DT_FLOAT}}, &h));
+
+  auto x1 = test::AsTensor<float>({1, 2, 3, 4});
+  std::atomic<int32> num_done(0);
+  FunctionLibraryRuntime::Options opts;
+  for (int i = 0; i < 4; ++i) {
+    tp1->Schedule([&h, &x1, &opts, &num_done, this]() {
+      Tensor y1;
+      TF_CHECK_OK(Run(flr0_, h, opts, {x1}, {&y1}, false /* add_runner */));
+      num_done.fetch_add(1);
+    });
+  }
+  // All the 4 Run() calls should be blocked because the runner is occupied.
+  EXPECT_EQ(0, num_done.load());
+
+  blocking_op_state->AwaitState(1);
+  blocking_op_state->MoveToState(1, 2);
+  // Now the runner should be unblocked and all the other Run() calls should
+  // proceed.
+  blocking_op_state->AwaitState(3);
+  blocking_op_state->MoveToState(3, 0);
+  delete tp1;
+  EXPECT_TRUE(finished_running);
+  EXPECT_EQ(4, num_done.load());
+
+  delete blocking_op_state;
+  blocking_op_state = nullptr;
+  delete tp;
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.cc b/tensorflow/core/common_runtime/gpu/gpu_device.cc
index 8357cc5a7201b3b590c6965648eed72116167459..52fd20e479918dea8fb07694e21a8db9fede8467 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.cc
@@ -840,6 +840,17 @@ void BaseGPUDevice::ReinitializeGpuDevice(OpKernelContext* context,
   }
 }
 
+Allocator* BaseGPUDevice::GetScopedAllocator(AllocatorAttributes attr,
+                                             int64 step_id) {
+  if (attr.scope_id > 0) {
+    return scoped_allocator_mgr_->GetContainer(step_id)->GetInstance(
+        attr.scope_id);
+  }
+  LOG(FATAL) << "Unexpected call to BaseGPUDevice::GetScopedAllocator "
+             << "attr.scope_id = " << attr.scope_id;
+  return gpu_allocator_;
+}
+
 const int BaseGPUDeviceFactory::InterconnectMap::kSameDeviceStrength = 1000;
 const int BaseGPUDeviceFactory::InterconnectMap::kStreamExecutorStrength = 1;
 
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.h b/tensorflow/core/common_runtime/gpu/gpu_device.h
index c88daa8ff87589a3fc48f4c7693d073d6adf9a5a..cc5c3881dd24fec24c027406d8e4577e81042433 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.h
@@ -17,8 +17,8 @@ limitations under the License.
 #error This file must only be included when building with Cuda support
 #endif
 
-#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
-#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
 
 #include <memory>
 #include <string>
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/gpu/gpu_id_utils.h"
 #include "tensorflow/core/common_runtime/gpu_device_context.h"
 #include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/device_base.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -68,7 +69,7 @@ class BaseGPUDevice : public LocalDevice {
       const TensorReferenceVector& tensor_refs) override;
 
   Status FillContextMap(const Graph* graph,
-                        DeviceContextMap* device_context_map);
+                        DeviceContextMap* device_context_map) override;
 
   void Compute(OpKernel* op_kernel, OpKernelContext* context) override;
 
@@ -95,11 +96,19 @@ class BaseGPUDevice : public LocalDevice {
   // corresponds to the cuda context.
   gpu::StreamExecutor* executor() const { return executor_; }
 
+  Allocator* GetScopedAllocator(AllocatorAttributes attr,
+                                int64 step_id) override;
+
+  ScopedAllocatorMgr* GetScopedAllocatorMgr() const override {
+    return scoped_allocator_mgr_.get();
+  }
+
  protected:
   Allocator* gpu_allocator_;  // not owned
   Allocator* cpu_allocator_;  // not owned
 
   gpu::StreamExecutor* executor_;  // not owned
+  std::unique_ptr<ScopedAllocatorMgr> scoped_allocator_mgr_;
 
  private:
   struct StreamGroup {
@@ -205,4 +214,4 @@ class BaseGPUDeviceFactory : public DeviceFactory {
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
diff --git a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
index 2452efc77952a89dd7b01989f684ac04a8a5ca90..af6a59a85df1cf3dc6a78c4eb81b78a61d095954 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
@@ -30,10 +30,6 @@ EventMgr::EventMgr(gpu::StreamExecutor* se, const GPUOptions& gpu_options)
       polling_active_delay_usecs_(gpu_options.polling_active_delay_usecs()
                                       ? gpu_options.polling_active_delay_usecs()
                                       : 10),
-      polling_inactive_delay_msecs_(
-          gpu_options.polling_inactive_delay_msecs()
-              ? gpu_options.polling_inactive_delay_msecs()
-              : 1),
       accumulated_stream_(nullptr),
       accumulated_tensors_(new TensorReferenceVector),
       accumulated_tensor_bytes_(0),
@@ -78,16 +74,22 @@ EventMgr::~EventMgr() {
 
 void EventMgr::StartPollingLoop() {
   CHECK(polling_stopped_ == nullptr);
-  stop_polling_.reset(new Notification);
+  {
+    mutex_lock l(mu_);
+    stop_polling_ = false;
+  }
   polling_stopped_.reset(new Notification);
   threadpool_.Schedule([this]() { PollLoop(); });
 }
 
 void EventMgr::StopPollingLoop() {
-  if (stop_polling_) {
-    stop_polling_->Notify();
+  if (polling_stopped_) {
+    {
+      mutex_lock l(mu_);
+      stop_polling_ = true;
+      events_pending_.notify_all();
+    }
     polling_stopped_->WaitForNotification();
-    stop_polling_.reset(nullptr);
     polling_stopped_.reset(nullptr);
   }
 }
@@ -121,28 +123,31 @@ void EventMgr::FlushAccumulatedTensors() {
   accumulated_stream_ = nullptr;
 }
 
-// A polling loop to detect completion of GPU events.  There's a
-// tradeoff between achieving low latency detection, which argues for
-// little delay between calls, and minimizing CPU use and lock
-// contention, which argue for longer delay.  The current strategy is
-// to poll frequently when the queue is non-empty, and infrequently
-// otherwise.
+// A polling loop to detect completion of GPU events.
+//
+// While one or more events is outstanding, poll for completed events.  When no
+// events are outstanding, we sleep until one is enqueued.
 void EventMgr::PollLoop() {
-  bool queue_empty = false;
-  while (!stop_polling_->HasBeenNotified()) {
-    if (queue_empty) {
-      mutex_lock l(mu_);
-      WaitForMilliseconds(&l, &events_pending_, polling_inactive_delay_msecs_);
-    } else {
-      Env::Default()->SleepForMicroseconds(polling_active_delay_usecs_);
-    }
-    ToFreeVector to_free;
+  ToFreeVector to_free;
+  while (true) {
+    bool events_still_pending;
     {
       mutex_lock l(mu_);
+      if (stop_polling_) {
+        break;
+      }
+      if (used_events_.empty()) {
+        events_pending_.wait(l);
+      }
       PollEvents(true, &to_free);
-      queue_empty = used_events_.empty();
+      events_still_pending = !used_events_.empty();
     }
     FreeMemory(to_free);
+    to_free.clear();
+
+    if (events_still_pending) {
+      Env::Default()->SleepForMicroseconds(polling_active_delay_usecs_);
+    }
   }
   polling_stopped_->Notify();
 }
diff --git a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
index 9692b24084ab577c3d27ed32248d430fd0d65fa0..d23898e1f26a2e0c8363f6080c0b8e301ec7fd67 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
@@ -94,7 +94,6 @@ class EventMgr {
   perftools::gputools::StreamExecutor* const exec_;
   const int64 deferred_bytes_threshold_;
   const int32 polling_active_delay_usecs_;
-  const int32 polling_inactive_delay_msecs_;
   mutex mu_;
   condition_variable events_pending_ GUARDED_BY(mu_);
 
@@ -180,7 +179,7 @@ class EventMgr {
   // A FIFO queue of InUse events and associated tensors.
   std::deque<InUse> used_events_ GUARDED_BY(mu_);
 
-  std::unique_ptr<Notification> stop_polling_;
+  bool stop_polling_ GUARDED_BY(mu_);
   std::unique_ptr<Notification> polling_stopped_;
 
   // The main PollLoop for the event manager runs in this threadpool.
diff --git a/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc b/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
index 207afdca75642b14c1617c8abae4fd5e9916f020..7dfff3269cf91582adf783dcd15dd55d1c4e1451 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
@@ -18,7 +18,10 @@ limitations under the License.
 #include <unordered_map>
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/mutex.h"
 
 namespace tensorflow {
@@ -27,8 +30,8 @@ namespace {
 class TfToCudaGpuIdMap {
  public:
   static TfToCudaGpuIdMap* singleton() {
-    static auto* manager = new TfToCudaGpuIdMap;
-    return manager;
+    static auto* id_map = new TfToCudaGpuIdMap;
+    return id_map;
   }
 
   void InsertOrDie(TfGpuId tf_gpu_id, CudaGpuId cuda_gpu_id)
@@ -47,18 +50,41 @@ class TfToCudaGpuIdMap {
     }
   }
 
-  int32 FindOrDie(TfGpuId tf_gpu_id) const LOCKS_EXCLUDED(mu_) {
+  CudaGpuId FindOrDie(TfGpuId tf_gpu_id) const LOCKS_EXCLUDED(mu_) {
     mutex_lock lock(mu_);
+    return FindOrDieLocked(tf_gpu_id);
+  }
+
+  bool Find(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id) const
+      LOCKS_EXCLUDED(mu_) {
+    mutex_lock lock(mu_);
+    if (id_map_.count(tf_gpu_id.value()) == 0) return false;
+    *cuda_gpu_id = FindOrDieLocked(tf_gpu_id);
+    return true;
+  }
+
+ private:
+  TfToCudaGpuIdMap() = default;
+
+  CudaGpuId FindOrDieLocked(TfGpuId tf_gpu_id) const
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) {
     auto result = id_map_.find(tf_gpu_id.value());
     CHECK(result != id_map_.end())
         << "Could not find the mapping for TfGpuId: " << tf_gpu_id;
-    return result->second;
+    return CudaGpuId(result->second);
+  }
+
+  void TestOnlyReset() LOCKS_EXCLUDED(mu_) {
+    mutex_lock lock(mu_);
+    id_map_.clear();
   }
 
- private:
   using IdMapType = std::unordered_map<int32, int32>;
   mutable mutex mu_;
   IdMapType id_map_ GUARDED_BY(mu_);
+
+  friend class ::tensorflow::GpuIdManager;
+  TF_DISALLOW_COPY_AND_ASSIGN(TfToCudaGpuIdMap);
 };
 }  // namespace
 
@@ -67,8 +93,20 @@ void GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId tf_gpu_id,
   TfToCudaGpuIdMap::singleton()->InsertOrDie(tf_gpu_id, cuda_gpu_id);
 }
 
+Status GpuIdManager::TfToCudaGpuId(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id) {
+  if (TfToCudaGpuIdMap::singleton()->Find(tf_gpu_id, cuda_gpu_id)) {
+    return Status::OK();
+  }
+  return errors::NotFound("TF GPU device with id ", tf_gpu_id.value(),
+                          " was not registered");
+}
+
 CudaGpuId GpuIdManager::TfToCudaGpuId(TfGpuId tf_gpu_id) {
-  return CudaGpuId(TfToCudaGpuIdMap::singleton()->FindOrDie(tf_gpu_id));
+  return TfToCudaGpuIdMap::singleton()->FindOrDie(tf_gpu_id);
+}
+
+void GpuIdManager::TestOnlyReset() {
+  TfToCudaGpuIdMap::singleton()->TestOnlyReset();
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/gpu/gpu_id_manager.h b/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
index 33925d8c36f44a9d2c7abc8f2801f3f203bcb982..2b54cc184ca508b94e2a715642cdb13fe8a4c3e1 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
@@ -17,15 +17,25 @@ limitations under the License.
 #define TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_ID_MANAGER_H_
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/lib/core/status.h"
 
 namespace tensorflow {
 
-// Class that manages the translation between Tensorflow GPU ids and CUDA GPU
-// ids.
+// Class that maintains a map from TfGpuId to CudaGpuId, and manages the
+// translation between them.
 class GpuIdManager {
  public:
+  // Adds a mapping from tf_gpu_id to cuda_gpu_id.
   static void InsertTfCudaGpuIdPair(TfGpuId tf_gpu_id, CudaGpuId cuda_gpu_id);
+
+  // Gets the cuda_gpu_id associated with tf_gpu_id. Returns OK if found.
+  static Status TfToCudaGpuId(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id);
+  // Similar to the above version, but returns the result, and checks fail if
+  // no result is found.
   static CudaGpuId TfToCudaGpuId(TfGpuId tf_gpu_id);
+
+  // Clears the map. Used in unit tests only.
+  static void TestOnlyReset();
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/graph_execution_state.cc b/tensorflow/core/common_runtime/graph_execution_state.cc
index 33a5d60eb7ec4de829d3c0784f909ef42cf994d1..2f17af273ff8cdc83a112ef350fde88346c7e13d 100644
--- a/tensorflow/core/common_runtime/graph_execution_state.cc
+++ b/tensorflow/core/common_runtime/graph_execution_state.cc
@@ -73,6 +73,10 @@ GraphExecutionState::~GraphExecutionState() {
 /* static */ Status GraphExecutionState::MakeForBaseGraph(
     GraphDef* graph_def, const GraphExecutionStateOptions& options,
     std::unique_ptr<GraphExecutionState>* out_state) {
+#ifndef __ANDROID__
+  VLOG(1) << "Graph proto is " << graph_def->DebugString();
+#endif  // __ANDROID__
+
   std::unique_ptr<GraphExecutionState> ret(
       new GraphExecutionState(graph_def, options));
 
@@ -233,6 +237,42 @@ void GraphExecutionState::RestoreStatefulNodes(Graph* graph) {
   }
 }
 
+Status GraphExecutionState::PruneGraph(
+    const BuildGraphOptions& options, Graph* graph,
+    subgraph::RewriteGraphMetadata* out_rewrite_metadata) {
+  std::vector<std::unique_ptr<subgraph::PruneRewrite>> feed_rewrites;
+  feed_rewrites.reserve(options.callable_options.feed_size());
+  std::vector<std::unique_ptr<subgraph::PruneRewrite>> fetch_rewrites;
+  fetch_rewrites.reserve(options.callable_options.fetch_size());
+  const DeviceAttributes* device_info =
+      &device_set_->client_device()->attributes();
+  if (options.use_function_convention) {
+    for (int i = 0; i < options.callable_options.feed_size(); ++i) {
+      feed_rewrites.emplace_back(new subgraph::ArgFeedRewrite(
+          &options.callable_options.feed(i), device_info, i));
+    }
+    for (int i = 0; i < options.callable_options.fetch_size(); ++i) {
+      fetch_rewrites.emplace_back(new subgraph::RetvalFetchRewrite(
+          &options.callable_options.fetch(i), device_info, i));
+    }
+  } else {
+    for (const string& feed : options.callable_options.feed()) {
+      feed_rewrites.emplace_back(
+          new subgraph::RecvFeedRewrite(&feed, device_info));
+    }
+    for (const string& fetch : options.callable_options.fetch()) {
+      fetch_rewrites.emplace_back(
+          new subgraph::SendFetchRewrite(&fetch, device_info));
+    }
+  }
+  std::vector<string> target_node_names(
+      options.callable_options.target().begin(),
+      options.callable_options.target().end());
+  return subgraph::RewriteGraphForExecution(graph, feed_rewrites,
+                                            fetch_rewrites, target_node_names,
+                                            out_rewrite_metadata);
+}
+
 Status GraphExecutionState::InitBaseGraph(const BuildGraphOptions& options) {
   const GraphDef* graph_def = &original_graph_def_;
 
@@ -247,10 +287,8 @@ Status GraphExecutionState::InitBaseGraph(const BuildGraphOptions& options) {
       session_options_->config.graph_options().place_pruned_graph()) {
     // Rewrite the graph before placement.
     rewrite_metadata_.reset(new subgraph::RewriteGraphMetadata);
-    TF_RETURN_IF_ERROR(subgraph::RewriteGraphForExecution(
-        new_graph.get(), options.feed_endpoints, options.fetch_endpoints,
-        options.target_nodes, device_set_->client_device()->attributes(),
-        options.use_function_convention, rewrite_metadata_.get()));
+    TF_RETURN_IF_ERROR(
+        PruneGraph(options, new_graph.get(), rewrite_metadata_.get()));
   }
 
   // Save stateful placements before placing.
@@ -295,13 +333,16 @@ Status GraphExecutionState::OptimizeGraph(
     item.id = "tf_graph";
     graph_->ToGraphDef(&item.graph);
 
-    item.fetch = options.fetch_endpoints;
-    item.fetch.insert(item.fetch.end(), options.target_nodes.begin(),
-                      options.target_nodes.end());
+    item.fetch.insert(item.fetch.end(),
+                      options.callable_options.fetch().begin(),
+                      options.callable_options.fetch().end());
+    item.fetch.insert(item.fetch.end(),
+                      options.callable_options.target().begin(),
+                      options.callable_options.target().end());
 
-    if (!options.feed_endpoints.empty()) {
+    if (!options.callable_options.feed().empty()) {
       std::unordered_set<string> feeds;
-      for (const string& feed : options.feed_endpoints) {
+      for (const string& feed : options.callable_options.feed()) {
         TensorId id = ParseTensorName(feed);
         if (id.second != 0) {
           return errors::InvalidArgument("Unsupported feed: ", feed);
@@ -397,12 +438,7 @@ Status GraphExecutionState::BuildGraph(const BuildGraphOptions& options,
   subgraph::RewriteGraphMetadata rewrite_metadata;
   if (session_options_ == nullptr ||
       !session_options_->config.graph_options().place_pruned_graph()) {
-    // Extract the subset of the graph that needs to be run, adding feed/fetch
-    // ops as needed.
-    TF_RETURN_IF_ERROR(subgraph::RewriteGraphForExecution(
-        ng.get(), options.feed_endpoints, options.fetch_endpoints,
-        options.target_nodes, device_set_->client_device()->attributes(),
-        options.use_function_convention, &rewrite_metadata));
+    TF_RETURN_IF_ERROR(PruneGraph(options, ng.get(), &rewrite_metadata));
   } else {
     // This GraphExecutionState represents a graph that was
     // pruned when this was constructed, so we copy the metadata from
@@ -411,8 +447,10 @@ Status GraphExecutionState::BuildGraph(const BuildGraphOptions& options,
     rewrite_metadata = *rewrite_metadata_;
   }
 
-  CHECK_EQ(options.feed_endpoints.size(), rewrite_metadata.feed_types.size());
-  CHECK_EQ(options.fetch_endpoints.size(), rewrite_metadata.fetch_types.size());
+  CHECK_EQ(options.callable_options.feed_size(),
+           rewrite_metadata.feed_types.size());
+  CHECK_EQ(options.callable_options.fetch_size(),
+           rewrite_metadata.fetch_types.size());
 
   // Make a fresh copy of the function library for the client graph.
   std::unique_ptr<FunctionLibraryDefinition> flib(
diff --git a/tensorflow/core/common_runtime/graph_execution_state.h b/tensorflow/core/common_runtime/graph_execution_state.h
index 2312e1a89fd1fd5734fab4316c25ca2e39f16ae5..2154ef5bd3e09f69728360e62b435354ca33e160 100644
--- a/tensorflow/core/common_runtime/graph_execution_state.h
+++ b/tensorflow/core/common_runtime/graph_execution_state.h
@@ -177,6 +177,11 @@ class GraphExecutionState {
   void SaveStatefulNodes(Graph* graph);
   void RestoreStatefulNodes(Graph* graph);
 
+  // Extract the subset of the graph that needs to be run, adding feed/fetch
+  // ops as needed.
+  Status PruneGraph(const BuildGraphOptions& options, Graph* graph,
+                    subgraph::RewriteGraphMetadata* out_rewrite_metadata);
+
   Status OptimizeGraph(const BuildGraphOptions& options,
                        std::unique_ptr<Graph>* optimized_graph);
 
diff --git a/tensorflow/core/common_runtime/graph_runner.cc b/tensorflow/core/common_runtime/graph_runner.cc
index f1082a60030fb3c289de35b4cab397c527f8afca..1125d2a34a5adcde5153ea4f039d0bda3159deb4 100644
--- a/tensorflow/core/common_runtime/graph_runner.cc
+++ b/tensorflow/core/common_runtime/graph_runner.cc
@@ -97,7 +97,9 @@ class SimpleRendezvous : public Rendezvous {
 
 }  // namespace
 
-GraphRunner::GraphRunner(Env* env) : cpu_device_(GetCPUDevice(env)) {}
+GraphRunner::GraphRunner(Env* env)
+    : device_deleter_(GetCPUDevice(env)), device_(device_deleter_.get()) {}
+GraphRunner::GraphRunner(Device* device) : device_(device) {}
 
 GraphRunner::~GraphRunner() {}
 
@@ -105,17 +107,18 @@ Status GraphRunner::Run(Graph* graph, FunctionLibraryRuntime* function_library,
                         const NamedTensorList& inputs,
                         const std::vector<string>& output_names,
                         std::vector<Tensor>* outputs) {
-  if (cpu_device_ == nullptr) {
+  if (device_ == nullptr) {
     return errors::NotFound("Cannot find a device for GraphRunner.");
   }
 
   if (function_library && function_library->device() &&
-      function_library->device()->device_type() != cpu_device_->device_type()) {
-    // We are running on a CPU but the function library is for a non-CPU device,
-    // so just ignore the function_library.
+      function_library->device()->device_type() != device_->device_type()) {
+    // Mismatch between function_library's device_type and device_'s
+    // device_type.
     // TODO(matthewmurray) Can we create a new FunctionLibraryRuntime that is
-    // identical to function_library except that it uses CPU?
-    VLOG(1) << "Cannot run on CPU device with a function library for a "
+    // identical to function_library except that it uses the given 'device_'?
+    VLOG(1) << "Cannot run on: " << device_->device_type()
+            << " with a function library for a "
             << function_library->device()->device_type() << " device.";
     function_library = nullptr;
   }
@@ -146,8 +149,7 @@ Status GraphRunner::Run(Graph* graph, FunctionLibraryRuntime* function_library,
   subgraph::RewriteGraphMetadata metadata;
   TF_RETURN_IF_ERROR(subgraph::RewriteGraphForExecution(
       graph_to_run.get(), input_names, output_names, {} /* target nodes */,
-      cpu_device_->attributes(), false /* use_function_convention */,
-      &metadata));
+      device_->attributes(), false /* use_function_convention */, &metadata));
 
   // Create the local executor and the Rendezvous for fetching back the
   // constants.
@@ -158,13 +160,12 @@ Status GraphRunner::Run(Graph* graph, FunctionLibraryRuntime* function_library,
 
   LocalExecutorParams params;
   // The ownership of the output tensors are bound to this device's lifetime.
-  params.device = cpu_device_.get();
+  params.device = device_;
   params.function_library = function_library;
   const int producer = graph_to_run->versions().producer();
   params.create_kernel = [this, producer](const NodeDef& ndef,
                                           OpKernel** kernel) {
-    return CreateNonCachedKernel(cpu_device_.get(), nullptr, ndef, producer,
-                                 kernel);
+    return CreateNonCachedKernel(device_, nullptr, ndef, producer, kernel);
   };
   params.delete_kernel = [](OpKernel* kernel) { delete kernel; };
 
diff --git a/tensorflow/core/common_runtime/graph_runner.h b/tensorflow/core/common_runtime/graph_runner.h
index 1e4ae7722794ca527bcea023d992d92839ee46c9..1c4b2b719cd52a4cce16453601112e710b3c3e9d 100644
--- a/tensorflow/core/common_runtime/graph_runner.h
+++ b/tensorflow/core/common_runtime/graph_runner.h
@@ -36,12 +36,14 @@ namespace tensorflow {
 // This class is only meant for internal use where one needs to
 // partially evaluate inexpensive nodes in a graph, such as for shape
 // inference or for constant folding.  Because of its limited, simple
-// use-cases, it executes all computation on the CPU and is not meant
-// to be particularly lightweight, fast, or efficient.
+// use-cases, it executes all computation on the given device (CPU by default)
+// and is not meant to be particularly lightweight, fast, or efficient.
 class GraphRunner {
  public:
   // REQUIRES: `env` is not nullptr.
   GraphRunner(Env* env);
+  // REQUIRES: 'device' is not nullptr. Not owned.
+  GraphRunner(Device* device);
   ~GraphRunner();
 
   // Function semantics for `inputs`, `output_names` and `outputs`
@@ -59,7 +61,8 @@ class GraphRunner {
              std::vector<Tensor>* outputs);
 
  private:
-  std::unique_ptr<Device> cpu_device_;
+  std::unique_ptr<Device> device_deleter_;
+  Device* const device_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/memory_types.cc b/tensorflow/core/common_runtime/memory_types.cc
index 090a16ebeb10007261666aeb6491a1785dd2e5c4..116750fbfd60f74ff49390de56f659308aa50f5c 100644
--- a/tensorflow/core/common_runtime/memory_types.cc
+++ b/tensorflow/core/common_runtime/memory_types.cc
@@ -92,7 +92,7 @@ static Status ProcessMemoryTypes(
 
 Status ValidateMemoryTypes(const DeviceType& device_type, const Graph* g) {
   return ProcessMemoryTypes(
-      device_type, g, [g](const Edge* e, MemoryType sm, MemoryType dm) {
+      device_type, g, [](const Edge* e, MemoryType sm, MemoryType dm) {
         if (sm == dm) {
           return Status::OK();
         }
@@ -155,7 +155,7 @@ Status EnsureMemoryTypes(const DeviceType& device_type,
   };
   std::vector<Item> edges;
   TF_RETURN_IF_ERROR(ProcessMemoryTypes(
-      device_type, g, [g, &edges](const Edge* e, MemoryType sm, MemoryType dm) {
+      device_type, g, [&edges](const Edge* e, MemoryType sm, MemoryType dm) {
         if (sm == dm) {
           return Status::OK();
         }
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.cc b/tensorflow/core/common_runtime/mkl_cpu_allocator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..829c19204af19119667fb455aad6505b388de94e
--- /dev/null
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.cc
@@ -0,0 +1,24 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef INTEL_MKL
+
+#include "tensorflow/core/common_runtime/mkl_cpu_allocator.h"
+
+namespace tensorflow {
+
+}  // namespace tensorflow
+
+#endif  // INTEL_MKL
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index fb092424bfc0b1bd8653e630b246b2749eb665fd..73abf18d97f6ca269ed554b44d6d9012c7bb173b 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -50,10 +50,10 @@ class MklCPUAllocator : public VisitableAllocator {
   // Constructor and other standard functions
 
   /// Environment variable that user can set to upper bound on memory allocation
-  static constexpr const char* kMaxLimitStr = "TF_MKL_ALLOC_MAX_BYTES";
+  static inline constexpr const char* kMaxLimitStr = "TF_MKL_ALLOC_MAX_BYTES";
 
   /// Default upper limit on allocator size - 64GB
-  static const size_t kDefaultMaxLimit = 64LL << 30;
+  static constexpr size_t kDefaultMaxLimit = 64LL << 30;
 
   MklCPUAllocator() { TF_CHECK_OK(Initialize()); }
 
@@ -158,7 +158,7 @@ class MklCPUAllocator : public VisitableAllocator {
   static constexpr const char* kName = "mklcpu";
 
   /// The alignment that we need for the allocations
-  static const size_t kAlignment = 64;
+  static constexpr const size_t kAlignment = 64;
 
   VisitableAllocator* allocator_;  // owned by this class
 };
diff --git a/tensorflow/core/common_runtime/process_function_library_runtime.cc b/tensorflow/core/common_runtime/process_function_library_runtime.cc
index e205e34aa0f6afb1363d65bd23403d4b50f056eb..92fdcb404e7ecc20e3079f1b21c37492daa5b047 100644
--- a/tensorflow/core/common_runtime/process_function_library_runtime.cc
+++ b/tensorflow/core/common_runtime/process_function_library_runtime.cc
@@ -25,25 +25,40 @@ namespace tensorflow {
 
 const char ProcessFunctionLibraryRuntime::kDefaultFLRDevice[] = "null";
 
+Status ProcessFunctionLibraryRuntime::FunctionData::DistributedInit(
+    DistributedFunctionLibraryRuntime* parent, const string& function_name,
+    const FunctionLibraryDefinition& lib_def, AttrSlice attrs,
+    const FunctionLibraryRuntime::InstantiateOptions& options) {
+  mutex_lock l(mu_);
+  if (!init_started_) {
+    init_started_ = true;
+    init_result_ = parent->Instantiate(function_name, lib_def, attrs, options,
+                                       &local_handle_);
+  }
+  return init_result_;
+}
+
 ProcessFunctionLibraryRuntime::ProcessFunctionLibraryRuntime(
     const DeviceMgr* device_mgr, Env* env, int graph_def_version,
     const FunctionLibraryDefinition* lib_def,
     const OptimizerOptions& optimizer_options,
+    thread::ThreadPool* default_thread_pool,
     DistributedFunctionLibraryRuntime* parent)
     : device_mgr_(device_mgr),
       lib_def_(lib_def),
+      default_thread_pool_(default_thread_pool),
       next_handle_(0),
       parent_(parent) {
   if (device_mgr == nullptr) {
-    flr_map_[nullptr] =
-        NewFunctionLibraryRuntime(nullptr, env, nullptr, graph_def_version,
-                                  lib_def, optimizer_options, this);
+    flr_map_[nullptr] = NewFunctionLibraryRuntime(
+        nullptr, env, nullptr, graph_def_version, lib_def, default_thread_pool,
+        optimizer_options, this);
     return;
   }
   for (Device* d : device_mgr->ListDevices()) {
-    flr_map_[d] =
-        NewFunctionLibraryRuntime(device_mgr, env, d, graph_def_version,
-                                  lib_def, optimizer_options, this);
+    flr_map_[d] = NewFunctionLibraryRuntime(
+        device_mgr, env, d, graph_def_version, lib_def, default_thread_pool,
+        optimizer_options, this);
   }
 }
 
@@ -52,21 +67,23 @@ ProcessFunctionLibraryRuntime::ProcessFunctionLibraryRuntime(
     const FunctionLibraryDefinition* lib_def,
     const OptimizerOptions& optimizer_options,
     CustomKernelCreator custom_kernel_creator,
+    thread::ThreadPool* default_thread_pool,
     DistributedFunctionLibraryRuntime* parent)
     : device_mgr_(device_mgr),
       lib_def_(lib_def),
+      default_thread_pool_(default_thread_pool),
       next_handle_(0),
       parent_(parent) {
   if (device_mgr == nullptr) {
     flr_map_[nullptr] = NewFunctionLibraryRuntime(
-        nullptr, env, nullptr, graph_def_version, lib_def, optimizer_options,
-        std::move(custom_kernel_creator), this);
+        nullptr, env, nullptr, graph_def_version, lib_def, default_thread_pool,
+        optimizer_options, std::move(custom_kernel_creator), this);
     return;
   }
   for (Device* d : device_mgr->ListDevices()) {
     flr_map_[d] = NewFunctionLibraryRuntime(
-        device_mgr, env, d, graph_def_version, lib_def, optimizer_options,
-        custom_kernel_creator, this);
+        device_mgr, env, d, graph_def_version, lib_def, default_thread_pool,
+        optimizer_options, custom_kernel_creator, this);
   }
 }
 
@@ -145,7 +162,7 @@ FunctionLibraryRuntime* ProcessFunctionLibraryRuntime::GetFLR(
   Device* device = nullptr;
   if (device_name != kDefaultFLRDevice) {
     if (!device_mgr_->LookupDevice(device_name, &device).ok()) {
-      LOG(ERROR) << "Could not find device: " << device_name;
+      VLOG(1) << "Could not find device: " << device_name;
       return nullptr;
     }
   }
@@ -167,7 +184,8 @@ FunctionLibraryRuntime::Handle ProcessFunctionLibraryRuntime::AddHandle(
     if (function_data_.count(h) != 0) return h;
   }
   h = next_handle_;
-  function_data_.insert({h, FunctionData(device_name, local_handle)});
+  FunctionData* fd = new FunctionData(device_name, local_handle);
+  function_data_[h] = std::unique_ptr<FunctionData>(fd);
   table_[function_key] = h;
   next_handle_++;
   return h;
@@ -196,19 +214,19 @@ ProcessFunctionLibraryRuntime::GetHandleOnDevice(
   if (function_data_.count(handle) == 0) {
     return kInvalidLocalHandle;
   }
-  const FunctionData& function_data = function_data_[handle];
-  if (function_data.target_device != device_name) {
+  FunctionData* function_data = function_data_[handle].get();
+  if (function_data->target_device() != device_name) {
     return kInvalidLocalHandle;
   }
-  return function_data.local_handle;
+  return function_data->local_handle();
 }
 
 string ProcessFunctionLibraryRuntime::GetDeviceName(
     FunctionLibraryRuntime::Handle handle) {
   mutex_lock l(mu_);
   CHECK_EQ(1, function_data_.count(handle));
-  const FunctionData& function_data = function_data_[handle];
-  return function_data.target_device;
+  FunctionData* function_data = function_data_[handle].get();
+  return function_data->target_device();
 }
 
 Status ProcessFunctionLibraryRuntime::Instantiate(
@@ -225,11 +243,29 @@ Status ProcessFunctionLibraryRuntime::Instantiate(
         "Currently don't support instantiating functions on device: ",
         options.target);
   }
-  FunctionLibraryRuntime::Handle cluster_handle;
-  TF_RETURN_IF_ERROR(parent_->Instantiate(function_name, *lib_def_, attrs,
-                                          options, &cluster_handle));
-  string function_key = Canonicalize(function_name, attrs);
-  *handle = AddHandle(function_key, options.target, cluster_handle);
+  VLOG(1) << "ProcessFLR Instantiate: " << function_name
+          << " on: " << options.target;
+  string function_key = Canonicalize(function_name, attrs, options);
+  FunctionData* f;
+  {
+    mutex_lock l(mu_);
+    FunctionLibraryRuntime::Handle h =
+        gtl::FindWithDefault(table_, function_key, kInvalidHandle);
+    if (h == kInvalidHandle || function_data_.count(h) == 0) {
+      h = next_handle_;
+      FunctionData* fd = new FunctionData(options.target, kInvalidHandle);
+      function_data_[h] = std::unique_ptr<FunctionData>(fd);
+      table_[function_key] = h;
+      next_handle_++;
+    }
+    f = function_data_[h].get();
+    *handle = h;
+  }
+  TF_RETURN_IF_ERROR(
+      f->DistributedInit(parent_, function_name, *lib_def_, attrs, options));
+  VLOG(1) << "ProcessFLR Instantiate [success]: " << function_name
+          << " on: " << options.target << " with handle: " << *handle
+          << " (this: " << this << ")";
   return Status::OK();
 }
 
@@ -247,7 +283,7 @@ Status ProcessFunctionLibraryRuntime::ReleaseHandle(
   {
     mutex_lock l(mu_);
     CHECK_EQ(1, function_data_.count(handle)) << " handle: " << handle;
-    target_device = function_data_[handle].target_device;
+    target_device = function_data_[handle]->target_device();
   }
   flr = GetFLR(target_device);
   if (flr != nullptr) {
@@ -276,8 +312,8 @@ void ProcessFunctionLibraryRuntime::Run(
       done(errors::NotFound("Handle: ", handle, " not found."));
       return;
     }
-    target_device = function_data_[handle].target_device;
-    local_handle = function_data_[handle].local_handle;
+    target_device = function_data_[handle]->target_device();
+    local_handle = function_data_[handle]->local_handle();
   }
   flr = GetFLR(target_device);
   if (flr != nullptr) {
@@ -341,7 +377,8 @@ Status ProcessFunctionLibraryRuntime::Clone(
   out_lib_def->reset(new FunctionLibraryDefinition(*lib_def_));
   out_pflr->reset(new ProcessFunctionLibraryRuntime(
       device_mgr_, env, graph_def_version, out_lib_def->get(),
-      optimizer_options, std::move(custom_kernel_creator), parent_));
+      optimizer_options, std::move(custom_kernel_creator), default_thread_pool_,
+      parent_));
   return Status::OK();
 }
 
diff --git a/tensorflow/core/common_runtime/process_function_library_runtime.h b/tensorflow/core/common_runtime/process_function_library_runtime.h
index 0473e16d242814930a9de17c88d4851d0d73edbe..d69e8bc2a049e9d71ca4ef0298dfe0dc058f2c45 100644
--- a/tensorflow/core/common_runtime/process_function_library_runtime.h
+++ b/tensorflow/core/common_runtime/process_function_library_runtime.h
@@ -33,6 +33,7 @@ class ProcessFunctionLibraryRuntime {
       const DeviceMgr* device_mgr, Env* env, int graph_def_version,
       const FunctionLibraryDefinition* lib_def,
       const OptimizerOptions& optimizer_options,
+      thread::ThreadPool* thread_pool = nullptr,
       DistributedFunctionLibraryRuntime* parent = nullptr);
 
   // With `custom_kernel_creator`.
@@ -41,6 +42,7 @@ class ProcessFunctionLibraryRuntime {
                                 const FunctionLibraryDefinition* lib_def,
                                 const OptimizerOptions& optimizer_options,
                                 CustomKernelCreator custom_kernel_creator,
+                                thread::ThreadPool* thread_pool,
                                 DistributedFunctionLibraryRuntime* parent);
 
   // Sends `tensors_to_send` from `source_device` to `target_device` using
@@ -145,22 +147,41 @@ class ProcessFunctionLibraryRuntime {
 
   mutable mutex mu_;
 
-  struct FunctionData {
-    const string target_device;
-    const FunctionLibraryRuntime::LocalHandle local_handle;
-
+  class FunctionData {
+   public:
     FunctionData(const string& target_device,
                  FunctionLibraryRuntime::LocalHandle local_handle)
-        : target_device(target_device), local_handle(local_handle) {}
-    FunctionData() : FunctionData("", -1) {}
+        : target_device_(target_device), local_handle_(local_handle) {}
+
+    string target_device() { return target_device_; }
+
+    FunctionLibraryRuntime::LocalHandle local_handle() { return local_handle_; }
+
+    // Initializes the FunctionData object by potentially making an Initialize
+    // call to the DistributedFunctionLibraryRuntime.
+    Status DistributedInit(
+        DistributedFunctionLibraryRuntime* parent, const string& function_name,
+        const FunctionLibraryDefinition& lib_def, AttrSlice attrs,
+        const FunctionLibraryRuntime::InstantiateOptions& options);
+
+   private:
+    mutex mu_;
+
+    const string target_device_;
+    FunctionLibraryRuntime::LocalHandle local_handle_ GUARDED_BY(mu_);
+    bool init_started_ GUARDED_BY(mu_) = false;
+    Status init_result_ GUARDED_BY(mu_);
+    Notification init_done_;
   };
 
   const DeviceMgr* const device_mgr_;
   const FunctionLibraryDefinition* lib_def_;
+  thread::ThreadPool* default_thread_pool_;
   // Holds all the function invocations here.
   std::unordered_map<string, FunctionLibraryRuntime::Handle> table_
       GUARDED_BY(mu_);
-  std::unordered_map<FunctionLibraryRuntime::Handle, FunctionData>
+  std::unordered_map<FunctionLibraryRuntime::Handle,
+                     std::unique_ptr<FunctionData>>
       function_data_ GUARDED_BY(mu_);
   std::unordered_map<Device*, std::unique_ptr<FunctionLibraryRuntime>> flr_map_;
   int next_handle_ GUARDED_BY(mu_);
diff --git a/tensorflow/core/common_runtime/process_function_library_runtime_test.cc b/tensorflow/core/common_runtime/process_function_library_runtime_test.cc
index 439ba1ce965ebe4addb525cd3d17d794feaecd1f..2da67b084a04067d56f66dfca208287aa04d7b46 100644
--- a/tensorflow/core/common_runtime/process_function_library_runtime_test.cc
+++ b/tensorflow/core/common_runtime/process_function_library_runtime_test.cc
@@ -19,9 +19,11 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/function_testlib.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/function_testlib.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/public/session_options.h"
 #include "tensorflow/core/public/version.h"
@@ -29,8 +31,32 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
+class TestClusterFLR : public DistributedFunctionLibraryRuntime {
+ public:
+  TestClusterFLR() {}
+
+  Status Instantiate(const string& function_name,
+                     const FunctionLibraryDefinition& lib_def, AttrSlice attrs,
+                     const FunctionLibraryRuntime::InstantiateOptions& options,
+                     FunctionLibraryRuntime::LocalHandle* handle) {
+    mutex_lock l(mu_);
+    *handle = next_handle_;
+    next_handle_++;
+    return Status::OK();
+  }
+
+  void Run(const FunctionLibraryRuntime::Options& opts,
+           FunctionLibraryRuntime::LocalHandle handle,
+           gtl::ArraySlice<Tensor> args, std::vector<Tensor>* rets,
+           FunctionLibraryRuntime::DoneCallback done) {}
+
+ private:
+  mutex mu_;
+  int next_handle_ GUARDED_BY(mu_) = 0;
+};
+
 class ProcessFunctionLibraryRuntimeTest : public ::testing::Test {
- protected:
+ public:
   void Init(const std::vector<FunctionDef>& flib) {
     SessionOptions options;
     auto* device_count = options.config.mutable_device_count();
@@ -42,12 +68,20 @@ class ProcessFunctionLibraryRuntimeTest : public ::testing::Test {
     for (const auto& fdef : flib) *(proto.add_function()) = fdef;
     lib_def_.reset(new FunctionLibraryDefinition(OpRegistry::Global(), proto));
     OptimizerOptions opts;
+    cluster_flr_.reset(new TestClusterFLR());
     proc_flr_.reset(new ProcessFunctionLibraryRuntime(
         device_mgr_.get(), Env::Default(), TF_GRAPH_DEF_VERSION, lib_def_.get(),
-        opts, nullptr /* cluster_flr */));
+        opts, nullptr, cluster_flr_.get()));
     rendezvous_ = new IntraProcessRendezvous(device_mgr_.get());
   }
 
+  Status Instantiate(
+      const string& name, test::function::Attrs attrs,
+      const FunctionLibraryRuntime::InstantiateOptions& instantiate_opts,
+      FunctionLibraryRuntime::Handle* handle) {
+    return proc_flr_->Instantiate(name, attrs, instantiate_opts, handle);
+  }
+
   Status Run(const string& name, FunctionLibraryRuntime::Options opts,
              test::function::Attrs attrs,
              const FunctionLibraryRuntime::InstantiateOptions& instantiate_opts,
@@ -106,6 +140,7 @@ class ProcessFunctionLibraryRuntimeTest : public ::testing::Test {
   std::vector<Device*> devices_;
   std::unique_ptr<DeviceMgr> device_mgr_;
   std::unique_ptr<FunctionLibraryDefinition> lib_def_;
+  std::unique_ptr<TestClusterFLR> cluster_flr_;
   std::unique_ptr<ProcessFunctionLibraryRuntime> proc_flr_;
   IntraProcessRendezvous* rendezvous_;
 };
@@ -118,7 +153,7 @@ TEST_F(ProcessFunctionLibraryRuntimeTest, GetFLRNull) {
   std::unique_ptr<ProcessFunctionLibraryRuntime> proc_flr(
       new ProcessFunctionLibraryRuntime(
           nullptr /* device_mgr */, Env::Default(), TF_GRAPH_DEF_VERSION,
-          lib_def.get(), opts, nullptr /* cluster_flr */));
+          lib_def.get(), opts, nullptr, nullptr /* cluster_flr */));
   FunctionLibraryRuntime* flr =
       proc_flr->GetFLR(ProcessFunctionLibraryRuntime::kDefaultFLRDevice);
   EXPECT_NE(flr, nullptr);
@@ -250,5 +285,60 @@ TEST_F(ProcessFunctionLibraryRuntimeTest, MultipleCallsDiffDeviceFindDevice) {
   rendezvous_->Unref();
 }
 
+TEST_F(ProcessFunctionLibraryRuntimeTest, ClusterFLRSerialTest) {
+  Init({test::function::FindDevice()});
+  FunctionLibraryRuntime::Options opts;
+  opts.source_device = "/job:a/replica:0/task:0/cpu:0";
+  opts.rendezvous = rendezvous_;
+  opts.remote_execution = true;
+  FunctionLibraryRuntime::InstantiateOptions instantiate_opts;
+  instantiate_opts.target = "/job:b/replica:0/task:0/device:CPU:0";
+  FunctionLibraryRuntime::Handle h;
+  TF_CHECK_OK(Instantiate("FindDevice",
+                          {{"_target", "/job:b/replica:0/task:0/device:CPU:0"}},
+                          instantiate_opts, &h));
+  EXPECT_EQ(0, proc_flr_->GetHandleOnDevice(
+                   "/job:b/replica:0/task:0/device:CPU:0", h));
+  TF_CHECK_OK(Instantiate("FindDevice",
+                          {{"_target", "/job:b/replica:0/task:0/device:CPU:0"}},
+                          instantiate_opts, &h));
+  EXPECT_EQ(0, proc_flr_->GetHandleOnDevice(
+                   "/job:b/replica:0/task:0/device:CPU:0", h));
+  instantiate_opts.target = "/job:c/replica:0/task:0/device:CPU:0";
+  TF_CHECK_OK(Instantiate("FindDevice",
+                          {{"_target", "/job:c/replica:0/task:0/device:CPU:0"}},
+                          instantiate_opts, &h));
+  EXPECT_EQ(1, proc_flr_->GetHandleOnDevice(
+                   "/job:c/replica:0/task:0/device:CPU:0", h));
+  rendezvous_->Unref();
+}
+
+TEST_F(ProcessFunctionLibraryRuntimeTest, ClusterFLRParallelTest) {
+  Init({test::function::FindDevice()});
+  FunctionLibraryRuntime::Options opts;
+  opts.source_device = "/job:a/replica:0/task:0/cpu:0";
+  opts.rendezvous = rendezvous_;
+  opts.remote_execution = true;
+  FunctionLibraryRuntime::InstantiateOptions instantiate_opts;
+  instantiate_opts.target = "/job:b/replica:0/task:0/device:CPU:0";
+
+  thread::ThreadPool* tp = new thread::ThreadPool(Env::Default(), "test", 4);
+  auto fn = [this, &instantiate_opts]() {
+    FunctionLibraryRuntime::Handle h;
+    TF_CHECK_OK(Instantiate(
+        "FindDevice", {{"_target", "/job:b/replica:0/task:0/device:CPU:0"}},
+        instantiate_opts, &h));
+    EXPECT_EQ(0, proc_flr_->GetHandleOnDevice(
+                     "/job:b/replica:0/task:0/device:CPU:0", h));
+  };
+
+  for (int i = 0; i < 100; ++i) {
+    tp->Schedule(fn);
+  }
+  delete tp;
+
+  rendezvous_->Unref();
+}
+
 }  // anonymous namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/scoped_allocator.cc b/tensorflow/core/common_runtime/scoped_allocator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a26672b79dab87d66ac98a8436cc7e2df7473677
--- /dev/null
+++ b/tensorflow/core/common_runtime/scoped_allocator.cc
@@ -0,0 +1,210 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+
+namespace tensorflow {
+
+ScopedAllocator::ScopedAllocator(const Tensor& backing_tensor, int32 scope_id,
+                                 const string& name,
+                                 const gtl::ArraySlice<Field>& fields,
+                                 int32 expected_call_count,
+                                 ScopedAllocatorContainer* container)
+    : backing_tensor_(backing_tensor),
+      tbuf_(backing_tensor_.buf_),
+      id_(scope_id),
+      name_(name),
+      container_(container),
+      fields_(fields.begin(), fields.end()),
+      expected_call_count_(expected_call_count),
+      live_alloc_count_(0) {
+  // Hold this until all aliases have been deallocated.
+  tbuf_->Ref();
+  // Hold this until all expected_calls have been made.
+  container->Ref();
+  CHECK_GE(tbuf_->size(), fields.back().offset + fields.back().bytes);
+}
+
+ScopedAllocator::~ScopedAllocator() {
+  mutex_lock l(mu_);
+  VLOG(1) << "~ScopedAllocator " << this << " tbuf_ " << tbuf_ << " data "
+          << static_cast<void*>(tbuf_->data());
+  // In the absence of incomplete graph execution situations
+  // (interruption by error status or control flow branch crossing
+  // ScopedAllocation region) we expect expected_call_count_ == 0 at
+  // exit.
+  if (VLOG_IS_ON(1)) {
+    if (expected_call_count_ > 0)
+      VLOG(1) << "expected_call_count_ = " << expected_call_count_
+              << " at deallocation";
+  }
+  if (tbuf_) tbuf_->Unref();
+}
+
+void* ScopedAllocator::AllocateRaw(int32 field_index, size_t num_bytes) {
+  VLOG(1) << "ScopedAllocator index " << id_ << " AllocateRaw "
+          << "field " << field_index << " num_bytes " << num_bytes;
+  mutex_lock l(mu_);
+  if (expected_call_count_ <= 0) {
+    LOG(ERROR) << "Scoped allocator " << name_
+               << " could not satisfy request for " << num_bytes
+               << " bytes, expected uses exhausted. ";
+    return nullptr;
+  }
+
+  int32_t num_fields = static_cast<int32>(fields_.size());
+  if (field_index >= num_fields) {
+    LOG(ERROR) << "ScopedAllocator " << name_
+               << " received unexpected field number " << field_index;
+    return nullptr;
+  }
+
+  const Field& f = fields_[field_index];
+  if (num_bytes != f.bytes) {
+    LOG(ERROR) << "ScopedAllocator " << name_ << " got request for "
+               << num_bytes << " bytes from field " << field_index
+               << " which has precalculated size " << f.bytes << " and offset "
+               << f.offset;
+    return nullptr;
+  }
+
+  void* ptr = static_cast<void*>((tbuf_->template base<char>() + f.offset));
+
+  ++live_alloc_count_;
+  --expected_call_count_;
+  if (0 == expected_call_count_) {
+    for (auto& f : fields_) {
+      container_->Drop(f.scope_id, this);
+    }
+    container_->Drop(id_, this);
+    container_->Unref();
+    container_ = nullptr;
+  }
+  VLOG(1) << "AllocateRaw returning " << ptr;
+  return ptr;
+}
+
+void ScopedAllocator::DeallocateRaw(void* p) {
+  CHECK(VerifyPointer(p));
+
+  bool dead = false;
+  {
+    mutex_lock l(mu_);
+    CHECK_GT(live_alloc_count_, 0);
+    if (0 == --live_alloc_count_) {
+      if (0 == expected_call_count_) {
+        dead = true;
+      }
+    }
+  }
+  if (dead) {
+    delete this;
+  }
+}
+
+bool ScopedAllocator::VerifyPointer(const void* p) {
+  void* base = tbuf_->data();
+  CHECK_GE(p, base);
+  for (auto& f : fields_) {
+    void* f_ptr = static_cast<void*>(static_cast<char*>(base) + f.offset);
+    if (f_ptr == p) {
+      return true;
+      break;
+    }
+  }
+  VLOG(1) << "ScopedAllocator index " << id_ << " VerifyPointer for p=" << p
+          << " failed.";
+  return false;
+}
+
+bool ScopedAllocator::VerifyTensor(const Tensor* t) {
+  return VerifyPointer(t->buf_->data());
+}
+
+ScopedAllocatorInstance::ScopedAllocatorInstance(ScopedAllocator* sa,
+                                                 int32 field_index)
+    : scoped_allocator_(sa),
+      field_index_(field_index),
+      allocated_(false),
+      deallocated_(false),
+      in_table_(true) {
+  VLOG(1) << "new ScopedAllocatorInstance " << this << " on SA " << sa
+          << " field_index " << field_index;
+}
+
+void ScopedAllocatorInstance::DropFromTable() {
+  bool del = false;
+  {
+    mutex_lock l(mu_);
+    CHECK(in_table_);
+    in_table_ = false;
+    VLOG(2) << "ScopedAllocatorInstance::DropFromTable " << this
+            << " allocated_ " << allocated_ << " deallocated_ " << deallocated_
+            << " in_table_ " << in_table_;
+    // Single use is complete when it is allocated and deallocated.
+    // This check prevents a race between Allocating the tensor slice and
+    // Dropping it from the parent container's table.
+    if (allocated_ && deallocated_) {
+      del = true;
+    }
+  }
+  if (del) delete this;
+}
+
+void* ScopedAllocatorInstance::AllocateRaw(size_t alignment, size_t num_bytes) {
+  void* ptr = scoped_allocator_->AllocateRaw(field_index_, num_bytes);
+  {
+    mutex_lock l(mu_);
+    if (nullptr == ptr) {
+      VLOG(2) << "ScopedAllocatorInstance::AllocateRaw " << this
+              << " call to underlying ScopedAllocator unsuccessful,"
+              << " allocated_ " << allocated_ << " deallocated_ "
+              << deallocated_ << " in_table_ " << in_table_
+              << " returning nullptr.";
+    } else {
+      allocated_ = true;
+      VLOG(2) << "ScopedAllocatorInstance::AllocateRaw " << this
+              << " allocated_ " << allocated_ << " deallocated_ "
+              << deallocated_ << " in_table_ " << in_table_
+              << " returning ptr = " << ptr;
+    }
+  }
+  return ptr;
+}
+
+void ScopedAllocatorInstance::DeallocateRaw(void* p) {
+  scoped_allocator_->DeallocateRaw(p);
+  bool del = false;
+  {
+    mutex_lock l(mu_);
+    CHECK(allocated_);
+    deallocated_ = true;
+    VLOG(2) << "ScopedAllocatorInstance::DeallocateRaw " << this
+            << " allocated_ " << allocated_ << " deallocated_ " << deallocated_
+            << " in_table_ " << in_table_;
+    // Single use is now complete, but only delete this instance when it is
+    // no longer in a ScopedAllocatorContainer's table.
+    if (!in_table_) {
+      del = true;
+    }
+  }
+  if (del) delete this;
+}
+
+string ScopedAllocatorInstance::Name() {
+  return strings::StrCat(scoped_allocator_->name(), "_field_", field_index_);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/scoped_allocator.h b/tensorflow/core/common_runtime/scoped_allocator.h
new file mode 100644
index 0000000000000000000000000000000000000000..71a7001217352cae2c374ad1f11a13482e940be2
--- /dev/null
+++ b/tensorflow/core/common_runtime/scoped_allocator.h
@@ -0,0 +1,124 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_H_
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/refcount.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+class ScopedAllocatorContainer;
+class ScopedAllocatorInstance;
+
+// Manages a single backing tensor and a collection of aliases.
+class ScopedAllocator {
+ public:
+  static const int32 kInvalidId = 0;
+  static const size_t kMaxAlignment = 64;
+
+  // A subrange of the TensorBuffer associated with this object that
+  // will be the backing memory for one aliased tensor.
+  struct Field {
+    int32 scope_id;
+    size_t offset;
+    size_t bytes;
+  };
+  // Field index that refers to backing tensor, not any aliased field.
+  static const int32 kBackingIndex = -1;
+
+  // backing_tensor is expected to be newly allocated by a ScopedAllocatorOp
+  // instance.  It must be large enough to back all of the specified
+  // (offset, byte) ranges of the fields.
+  ScopedAllocator(const Tensor& backing_tensor, int32 scope_id,
+                  const string& name, const gtl::ArraySlice<Field>& fields,
+                  int32 expected_call_count,
+                  ScopedAllocatorContainer* container);
+
+  // Automatically deletes when last use expires, or when
+  // ScopedAllocatorContainer decides to delete.
+  ~ScopedAllocator() LOCKS_EXCLUDED(mu_);
+
+  // For debugging: returns true iff p is a pointer that could have
+  // been returned by AllocateRaw.
+  bool VerifyPointer(const void* p);
+  bool VerifyTensor(const Tensor* t);
+
+  const Tensor& tensor() const { return backing_tensor_; }
+
+  const string& name() const { return name_; }
+
+ private:
+  friend class ScopedAllocatorInstance;
+  // Only ScopedAllocatorInstances can call AllocateRaw and DeallocateRaw on a
+  // ScopedAllocator
+  void* AllocateRaw(int32 field_index, size_t num_bytes) LOCKS_EXCLUDED(mu_);
+  void DeallocateRaw(void* p) LOCKS_EXCLUDED(mu_);
+  Tensor backing_tensor_;
+  TensorBuffer* tbuf_;
+  int32 id_;
+  string name_;
+  ScopedAllocatorContainer* container_;
+  std::vector<Field> fields_;
+  mutex mu_;
+  int32 expected_call_count_ GUARDED_BY(mu_);
+  int32 live_alloc_count_ GUARDED_BY(mu_);
+};
+
+// An Allocator that will return a pointer into the backing buffer of
+// a previously allocated tensor, allowing creation of an alias
+// tensor.  There is a one-to-one mapping between the fields of a
+// ScopedAllocator and ScopedAllocatorInstances.  There is also a one-to-one
+// mapping between scope_ids and ScopedAllocatorInstances.  It should be
+// discarded immediately after a single use.
+class ScopedAllocatorInstance : public Allocator {
+ public:
+  explicit ScopedAllocatorInstance(ScopedAllocator* sa, int32 field_index);
+
+ private:
+  ~ScopedAllocatorInstance() { VLOG(1) << "~ScopedAllocatorInstance " << this; }
+
+ public:
+  // When a ScopedAllocatorContainer "Drops" a scope_id, it calls DropFromTable
+  // on the underlying ScopedAllocatorInstance.  If this instance has already
+  // deallocated the tensor slice, we can safely delete this.
+  void DropFromTable() LOCKS_EXCLUDED(mu_);
+  void* AllocateRaw(size_t alignment, size_t num_bytes)
+      LOCKS_EXCLUDED(mu_) override;
+  void* AllocateRaw(size_t alignment, size_t num_bytes,
+                    const AllocationAttributes& allocator_attr) override {
+    return AllocateRaw(alignment, num_bytes);
+  }
+  void DeallocateRaw(void* p) LOCKS_EXCLUDED(mu_) override;
+  bool TracksAllocationSizes() override { return false; }
+  bool ShouldAllocateEmptyTensors() override { return false; }
+  size_t RequestedSize(const void* ptr) override { return 0; }
+  size_t AllocatedSize(const void* ptr) override { return 0; }
+  int64 AllocationId(const void* ptr) override { return 0; }
+  size_t AllocatedSizeSlow(const void* ptr) override { return 0; }
+  string Name() override;
+
+ private:
+  mutex mu_;
+  ScopedAllocator* scoped_allocator_;
+  int32 field_index_;
+  bool allocated_ GUARDED_BY(mu_);
+  bool deallocated_ GUARDED_BY(mu_);
+  bool in_table_ GUARDED_BY(mu_);
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_H_
diff --git a/tensorflow/core/common_runtime/scoped_allocator_mgr.cc b/tensorflow/core/common_runtime/scoped_allocator_mgr.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e1f70404e32edaabca95f913cf0bb86080f8b411
--- /dev/null
+++ b/tensorflow/core/common_runtime/scoped_allocator_mgr.cc
@@ -0,0 +1,185 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/framework/allocator.h"
+
+namespace tensorflow {
+
+Status ScopedAllocatorContainer::AddScopedAllocator(
+    const Tensor& backing_tensor, int32 scope_id, const string& scope_name,
+    const gtl::ArraySlice<ScopedAllocator::Field>& fields,
+    int32 expected_call_count) {
+  VLOG(1) << "AddScopedAllocator " << mgr_->device_name()
+          << " step_id_=" << step_id_ << " scope_id=" << scope_id;
+  mutex_lock l(mu_);
+  // Ensure none of the new scope_ids are in use.
+  auto it = allocators_.find(scope_id);
+  if (it != allocators_.end()) {
+    return errors::Internal("Cannot create ScopedAllocator because scope_id ",
+                            scope_id, " for name ", scope_name,
+                            " already exists");
+  }
+  for (auto& f : fields) {
+    if (allocators_.find(f.scope_id) != allocators_.end()) {
+      return errors::Internal(
+          "Cannot create ScopedAllocator because field scope_id ", f.scope_id,
+          " for name ", scope_name, " already exists");
+    }
+  }
+  VLOG(2) << " container " << this << " step_id " << step_id_;
+  ScopedAllocator* sa = new ScopedAllocator(
+      backing_tensor, scope_id, scope_name, fields, expected_call_count, this);
+  allocators_[scope_id] =
+      ScopedAllocatorContainer::SAField(ScopedAllocator::kBackingIndex, sa);
+  VLOG(2) << "#fields " << fields.size();
+  for (int i = 0; i < fields.size(); ++i) {
+    const ScopedAllocator::Field& f = fields[i];
+    VLOG(2) << "Adding instance with for " << mgr_->device_name()
+            << " scope_id=" << f.scope_id;
+    allocators_[f.scope_id] = ScopedAllocatorContainer::SAField(
+        i, new ScopedAllocatorInstance(sa, i));
+  }
+  return Status::OK();
+}
+
+ScopedAllocator* ScopedAllocatorContainer::GetAllocator(int32 scope_id) {
+  mutex_lock l(mu_);
+  auto it = allocators_.find(scope_id);
+  if (it != allocators_.end()) {
+    CHECK_EQ(ScopedAllocator::kBackingIndex, it->second.field_index);
+    return it->second.scoped_allocator;
+  } else {
+    LOG(ERROR) << "Failed to find ScopedAllocator for " << scope_id
+               << " in container for step " << step_id_ << " on "
+               << mgr_->device_name();
+    return nullptr;
+  }
+}
+
+ScopedAllocatorInstance* ScopedAllocatorContainer::GetInstance(int32 scope_id) {
+  VLOG(2) << "GetInstance " << scope_id << " step " << step_id_ << " on "
+          << mgr_->device_name();
+  mutex_lock l(mu_);
+  auto it = allocators_.find(scope_id);
+  if (it != allocators_.end()) {
+    return it->second.instance;
+  }
+  LOG(FATAL) << "Failed to find instance " << scope_id << " in container "
+             << step_id_ << " on " << mgr_->device_name();
+  return nullptr;
+}
+
+void ScopedAllocatorContainer::Drop(int32 scope_id, ScopedAllocator* sa) {
+  VLOG(2) << "Drop " << scope_id << " from container " << this << " step "
+          << step_id_ << " on " << mgr_->device_name();
+  mutex_lock l(mu_);
+  auto it = allocators_.find(scope_id);
+  if (it != allocators_.end()) {
+    if (it->second.field_index != ScopedAllocator::kBackingIndex) {
+      it->second.instance->DropFromTable();
+    }
+    allocators_.erase(it);
+  }
+}
+
+ScopedAllocatorContainer::~ScopedAllocatorContainer() {
+  VLOG(2) << "~ScopedAllocatorContainer " << this << " step " << step_id_
+          << " on " << mgr_->device_name();
+  mutex_lock l(mu_);
+  // In normal execution the table should be empty and all of its
+  // contents deleted via Drop.  When when a step ends early
+  // (e.g. through abnormal termination) we need to clean up
+  // explicitly.  So long as graph exection of the associated step has
+  // completey terminated this should be safe.
+  for (auto& it : allocators_) {
+    if (it.second.field_index == ScopedAllocator::kBackingIndex) {
+      delete it.second.scoped_allocator;
+    } else {
+      it.second.instance->DropFromTable();
+    }
+  }
+}
+
+ScopedAllocatorMgr::~ScopedAllocatorMgr() {
+  mutex_lock l(mu_);
+  for (auto it : per_step_map_) {
+    // In normal execution the associated ScopedAllocatorContainer is
+    // empty and gone by the end of the step.  But in abnormal termination,
+    // such as when an error has interrupted execution or in a unittest,
+    // we need to remove all of its Refs here to avoid memory leaks.
+    // This is safe so long as graph execution has ceased.
+    while (!it.second->Unref()) {
+    }
+  }
+}
+
+void ScopedAllocatorMgr::Cleanup(int64 step_id) {
+  mutex_lock l(mu_);
+  auto it = per_step_map_.find(step_id);
+  if (it != per_step_map_.end()) {
+    it->second->Unref();
+    per_step_map_.erase(it);
+  }
+}
+
+ScopedAllocatorContainer* ScopedAllocatorMgr::GetContainer(int64 step_id) {
+  VLOG(2) << "GetContainer " << step_id << " on " << device_name();
+  ScopedAllocatorContainer* sac = nullptr;
+  mutex_lock l(mu_);
+  auto it = per_step_map_.find(step_id);
+  if (it == per_step_map_.end()) {
+    sac = new ScopedAllocatorContainer(this, step_id);
+    per_step_map_[step_id] = sac;
+  } else {
+    sac = it->second;
+  }
+  return sac;
+}
+
+Status ScopedAllocatorMgr::AddScopedAllocator(
+    const Tensor& backing_tensor, int64 step_id, int32 scope_id,
+    const string& scope_name,
+    const gtl::ArraySlice<ScopedAllocator::Field>& fields,
+    int32 expected_call_count) {
+  ScopedAllocatorContainer* sac = GetContainer(step_id);
+  return sac->AddScopedAllocator(backing_tensor, scope_id, scope_name, fields,
+                                 expected_call_count);
+}
+
+void ScopedAllocatorMgr::PopulateFields(
+    int32 scope_id, const gtl::ArraySlice<TensorShape>& shapes,
+    const DataType dtype, std::vector<ScopedAllocator::Field>* fields) {
+  const int32 num_fields = static_cast<int32>(shapes.size());
+  fields->resize(num_fields);
+  size_t offset = 0;
+  for (int32 i = 0; i < num_fields; ++i) {
+    size_t bytes = shapes[i].num_elements() * DataTypeSize(dtype);
+    (*fields)[i].scope_id = scope_id + 1 + i;
+    (*fields)[i].bytes = bytes;
+    (*fields)[i].offset = offset;
+    VLOG(1) << "field=" << i << " scope_id=" << (*fields)[i].scope_id
+            << " bytes=" << (*fields)[i].bytes
+            << " offset=" << (*fields)[i].offset;
+    offset += bytes;
+    size_t overshoot = offset % Allocator::kAllocatorAlignment;
+    if (overshoot > 0) {
+      offset += (Allocator::kAllocatorAlignment - overshoot);
+    }
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/scoped_allocator_mgr.h b/tensorflow/core/common_runtime/scoped_allocator_mgr.h
new file mode 100644
index 0000000000000000000000000000000000000000..effc5f2d775336621a783d83b7dd5eece6d42292
--- /dev/null
+++ b/tensorflow/core/common_runtime/scoped_allocator_mgr.h
@@ -0,0 +1,107 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_MGR_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_MGR_H_
+
+#include <string>
+#include <unordered_map>
+
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/lib/core/refcount.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+class ScopedAllocatorMgr;
+
+// At most one of these exists per <device, step_id> pair.
+// A Ref is held by every ScopedAllocator and also by the ScopedAllocatorMgr.
+class ScopedAllocatorContainer : public core::RefCounted {
+ public:
+  // Establishes a reachable ScopedAllocator.
+  Status AddScopedAllocator(
+      const Tensor& backing_tensor, int32 scope_id, const string& scope_name,
+      const gtl::ArraySlice<ScopedAllocator::Field>& fields,
+      int32 expected_call_count);
+
+  ScopedAllocatorInstance* GetInstance(int32 scope_id);
+  ScopedAllocator* GetAllocator(int32 scope_id);
+
+  // Retire the scope_id.
+  void Drop(int32 scope_id, ScopedAllocator* sa);
+
+ protected:
+  friend class ScopedAllocatorMgr;
+  ScopedAllocatorContainer(const ScopedAllocatorMgr* mgr, int64 step_id)
+      : mgr_(mgr), step_id_(step_id) {}
+  ~ScopedAllocatorContainer();
+
+ private:
+  const ScopedAllocatorMgr* mgr_;
+  int64 step_id_;
+  mutex mu_;
+  struct SAField {
+    int32 field_index;
+    union {
+      ScopedAllocator* scoped_allocator;
+      ScopedAllocatorInstance* instance;
+    };
+    SAField(int32 fi, ScopedAllocatorInstance* sai)
+        : field_index(fi), instance(sai) {}
+    SAField(int32 fi, ScopedAllocator* sa)
+        : field_index(fi), scoped_allocator(sa) {}
+    SAField()
+        : field_index(ScopedAllocator::kBackingIndex),
+          scoped_allocator(nullptr) {}
+  };
+  std::unordered_map<int32, SAField> allocators_ GUARDED_BY(mu_);
+};
+
+// At most one of these exists per device.
+class ScopedAllocatorMgr {
+ public:
+  explicit ScopedAllocatorMgr(const string& device_name)
+      : device_name_(device_name) {}
+  ~ScopedAllocatorMgr();
+
+  ScopedAllocatorContainer* GetContainer(int64 step_id);
+
+  // Establishes a reachable ScopedAllocator.
+  Status AddScopedAllocator(
+      const Tensor& backing_tensor, int64 step_id, int32 scope_id,
+      const string& scope_name,
+      const gtl::ArraySlice<ScopedAllocator::Field>& fields,
+      int32 expected_call_count);
+
+  void Cleanup(int64 step_id);
+
+  // Populate the bytes and offset members of Field.  Instance allocaters get
+  // consecutive scope_id values following that of the base ScopedAllocator.
+  static void PopulateFields(int32 scope_id,
+                             const gtl::ArraySlice<TensorShape>& shapes,
+                             const DataType dtype,
+                             std::vector<ScopedAllocator::Field>* fields);
+
+  const string& device_name() const { return device_name_; }
+
+ private:
+  string device_name_;
+  mutex mu_;
+  std::unordered_map<int64, ScopedAllocatorContainer*> per_step_map_
+      GUARDED_BY(mu_);
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_SCOPED_ALLOCATOR_MGR_H_
diff --git a/tensorflow/core/common_runtime/scoped_allocator_mgr_test.cc b/tensorflow/core/common_runtime/scoped_allocator_mgr_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..38e07e47f24d6808e858721f8e7832668a556164
--- /dev/null
+++ b/tensorflow/core/common_runtime/scoped_allocator_mgr_test.cc
@@ -0,0 +1,227 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+
+#include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+class ScopedAllocatorMgrTest : public ::testing::Test {
+ public:
+  ScopedAllocatorMgrTest() : sam_("CPU0") {}
+
+  void InitTensor() {
+    backing_tensor_ = Tensor(cpu_allocator(), DT_FLOAT, backing_tensor_shape_);
+  }
+
+  void PopulateFields() {
+    ScopedAllocatorMgr::PopulateFields(scope_id_, fields_shapes_, DT_FLOAT,
+                                       &fields_);
+  }
+
+  Status AddScopedAllocator(int expected_use_count, int scope_id) {
+    VLOG(2) << "Adding ScopedAllocator step_id " << step_id_ << " scope_id "
+            << scope_id_ << " #fields " << fields_.size()
+            << " expected_use_count " << expected_use_count;
+    return sam_.AddScopedAllocator(backing_tensor_, step_id_, scope_id,
+                                   "tensor_shape_599", fields_,
+                                   expected_use_count);
+  }
+
+  Status PrepScopedAllocatorMgr(int expected_use_count) {
+    InitTensor();
+    PopulateFields();
+    return AddScopedAllocator(expected_use_count, scope_id_);
+  }
+
+  void SaveInstances(int num_instances) {
+    sa_instances_.clear();
+    sa_instances_.resize(num_instances);
+    ScopedAllocatorContainer* sac = sam_.GetContainer(step_id_);
+    for (int i = 0; i < num_instances; i++) {
+      sa_instances_[i] = sac->GetInstance(scope_id_ + 1 + i);
+    }
+  }
+
+  // For the specific case when the backing tensor is of shape
+  // {512 + 9 + 512 + 16} and the fields_shapes are {{512}, {3,3}, {2, 256}}
+  // This method computes the padding between the second and third slice of the
+  // backing tensor.  This example is reused across multiple tests.
+  int AlignmentPadding() {
+    int alignment_padding =
+        (Allocator::kAllocatorAlignment -
+         (521 * sizeof(float)) % Allocator::kAllocatorAlignment) %
+        Allocator::kAllocatorAlignment;
+    return alignment_padding;
+  }
+
+  // Debug
+  void PrintShapes() {
+    VLOG(2) << "tensor_shape=" << backing_tensor_shape_.DebugString();
+    for (int i = 0; i < fields_shapes_.size(); i++) {
+      VLOG(2) << "fields_shapes[" << i
+              << "]=" << fields_shapes_[i].DebugString();
+    }
+  }
+
+ protected:
+  TensorShape backing_tensor_shape_;
+  Tensor backing_tensor_;
+  std::vector<TensorShape> fields_shapes_;
+  std::vector<ScopedAllocator::Field> fields_;
+  ScopedAllocatorMgr sam_;
+  const int step_id_ = 101;
+  const int scope_id_ = 599;
+  std::vector<ScopedAllocatorInstance*> sa_instances_;
+};
+
+TEST_F(ScopedAllocatorMgrTest, ContainerAllocation) {
+  ScopedAllocatorContainer* sac_101 = sam_.GetContainer(101);
+  EXPECT_TRUE(sac_101 != nullptr);
+  ScopedAllocatorContainer* sac_201 = sam_.GetContainer(201);
+  EXPECT_TRUE(sac_201 != nullptr);
+  EXPECT_NE(sac_101, sac_201);
+  ScopedAllocatorContainer* also_sac_101 = sam_.GetContainer(101);
+  EXPECT_EQ(sac_101, also_sac_101);
+  sam_.Cleanup(101);
+  // 201 should be cleaned up by the destructor.
+}
+
+TEST_F(ScopedAllocatorMgrTest, PopulateFields) {
+  backing_tensor_shape_ = TensorShape({512 + 9 + 512 + 16});
+  fields_shapes_ = std::vector<TensorShape>({{512}, {3, 3}, {2, 256}});
+  InitTensor();
+  PopulateFields();
+  EXPECT_EQ(0, fields_[0].offset);
+  EXPECT_EQ(512 * sizeof(float), fields_[0].bytes);
+  EXPECT_EQ(scope_id_ + 1, fields_[0].scope_id);
+  EXPECT_EQ(512 * sizeof(float), fields_[1].offset);
+  EXPECT_EQ(9 * sizeof(float), fields_[1].bytes);
+  EXPECT_EQ(scope_id_ + 2, fields_[1].scope_id);
+  EXPECT_EQ(521 * sizeof(float) + AlignmentPadding(), fields_[2].offset);
+  EXPECT_EQ(512 * sizeof(float), fields_[2].bytes);
+  EXPECT_EQ(scope_id_ + 3, fields_[2].scope_id);
+}
+
+TEST_F(ScopedAllocatorMgrTest, ContainerAddAllocator) {
+  backing_tensor_shape_ = TensorShape({1024});
+  fields_shapes_ = std::vector<TensorShape>({{512}, {512}});
+  Status s = PrepScopedAllocatorMgr(2);
+  EXPECT_TRUE(s.ok());
+  // Need to call Allocate and Deallocate in order to use up the expected uses
+  // for this allocator.  Save the instances for now.
+  SaveInstances(fields_shapes_.size());
+
+  s = AddScopedAllocator(2, scope_id_);
+  EXPECT_FALSE(s.ok());
+  fields_[0].scope_id = scope_id_ + 1;
+  s = AddScopedAllocator(2, scope_id_ + 3);
+  EXPECT_FALSE(s.ok());
+
+  // Cleanup the instances by invoking allocate and deallocate.
+  void* ptr0 =
+      sa_instances_[0]->AllocateRaw(0 /* alignment */, 512 * sizeof(float));
+  void* ptr1 =
+      sa_instances_[1]->AllocateRaw(0 /* alignment */, 512 * sizeof(float));
+  sa_instances_[0]->DeallocateRaw(ptr0);
+  sa_instances_[1]->DeallocateRaw(ptr1);
+}
+
+TEST_F(ScopedAllocatorMgrTest, AllocatorSuccess) {
+  ScopedAllocatorContainer* sac = sam_.GetContainer(step_id_);
+  ScopedAllocator* other = sac->GetAllocator(scope_id_);
+  EXPECT_EQ(other, nullptr);
+  backing_tensor_shape_ = TensorShape({512 + 9 + 512 + 16});
+  fields_shapes_ = std::vector<TensorShape>({{512}, {3, 3}, {2, 256}});
+  Status s = PrepScopedAllocatorMgr(3);
+  other = sac->GetAllocator(scope_id_);
+
+  ScopedAllocatorInstance* inst0 = sac->GetInstance(scope_id_ + 1);
+  char* ptr0 = static_cast<char*>(inst0->AllocateRaw(0, 512 * sizeof(float)));
+  const char* base =
+      static_cast<const char*>(DMAHelper::base(&backing_tensor_));
+  EXPECT_EQ(ptr0, base);
+
+  ScopedAllocatorInstance* inst1 = sac->GetInstance(scope_id_ + 2);
+  char* ptr1 = static_cast<char*>(inst1->AllocateRaw(0, 9 * sizeof(float)));
+  EXPECT_EQ(ptr1, ptr0 + (512 * sizeof(float)));
+
+  ScopedAllocatorInstance* inst2 = sac->GetInstance(scope_id_ + 3);
+  char* ptr2 = static_cast<char*>(inst2->AllocateRaw(0, 512 * sizeof(float)));
+  EXPECT_EQ(ptr2, ptr1 + AlignmentPadding() + (9 * sizeof(float)));
+
+  // At this point the scopes should be gone from the container
+  EXPECT_EQ(nullptr, sac->GetAllocator(scope_id_));
+
+  // The ScopedAllocatorInstances automatically delete when their memory
+  // is returned and they are out of table.
+  inst0->DeallocateRaw(ptr0);
+  inst1->DeallocateRaw(ptr1);
+  inst2->DeallocateRaw(ptr2);
+}
+
+// ScopedAllocator initialization should fail because backing_tensor is not
+// large enough to hold all the fields
+TEST_F(ScopedAllocatorMgrTest, AllocatorInitFail) {
+  backing_tensor_shape_ = TensorShape({8});
+  InitTensor();
+  fields_.resize(1);
+  fields_[0].scope_id = scope_id_ + 1;
+  fields_[0].offset = 0;
+  fields_[0].bytes = backing_tensor_shape_.num_elements() * 2 * sizeof(float);
+  // fields[0].offset + fields[0].bytes is larger than the size of the backing
+  // tensor, so this check should fail
+  EXPECT_DEATH(Status s = AddScopedAllocator(1, scope_id_), "");
+}
+
+// ScopedAllocator allocation should fail because we called more times than
+// expected, or we deallocated a non-existent pointer, or we requested more
+// or less than the exact size of an instance buffer.
+TEST_F(ScopedAllocatorMgrTest, AllocatorFail) {
+  backing_tensor_shape_ = TensorShape({1024});
+  fields_shapes_ = std::vector<TensorShape>({{512}, {512}});
+  Status s = PrepScopedAllocatorMgr(2);
+  EXPECT_TRUE(s.ok());
+  // Save instances so that we can explicitly delete later on.  In normal
+  // operation the instances will be automatically deleted after single use, but
+  // in this test we are invoking the ScopedAllocator's Alloc/Dealloc interface,
+  // so we need to explicitly delete the instances to avoid a memleak.
+  SaveInstances(fields_shapes_.size());
+
+  char* ptr0 =
+      static_cast<char*>(sa_instances_[0]->AllocateRaw(0, 512 * sizeof(float)));
+  VLOG(2) << "Should fail because we deallocate ptr="
+          << static_cast<void*>(ptr0 + 8) << " which we never allocated.";
+  EXPECT_DEATH(sa_instances_[0]->DeallocateRaw(ptr0 + 8), "");
+  VLOG(2) << "Should fail because we allocate smaller than the size of the "
+          << "field.";
+  EXPECT_EQ(nullptr, sa_instances_[1]->AllocateRaw(0, 256 * sizeof(float)));
+  VLOG(2) << "Should fail because we allocate larger than the size of the "
+          << "field.";
+  EXPECT_EQ(nullptr, sa_instances_[1]->AllocateRaw(0, 1024 * sizeof(float)));
+  void* ptr1 = sa_instances_[1]->AllocateRaw(0, 512 * sizeof(float));
+  VLOG(2) << "Should fail because we exceed expected_use_count.";
+  EXPECT_EQ(nullptr, sa_instances_[0]->AllocateRaw(0, 512 * sizeof(float)));
+  sa_instances_[0]->DeallocateRaw(ptr0);
+  sa_instances_[1]->DeallocateRaw(ptr1);
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/shape_refiner.cc b/tensorflow/core/common_runtime/shape_refiner.cc
index 45cdab98e0642a3fbfee3dfa415696b98251600a..cef50be3b1566de9f05b14783212f90da3107fc6 100644
--- a/tensorflow/core/common_runtime/shape_refiner.cc
+++ b/tensorflow/core/common_runtime/shape_refiner.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include <unordered_set>
 #include <vector>
 
+#include "tensorflow/core/common_runtime/eval_const_tensor.h"
 #include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -211,14 +212,14 @@ Status ShapeRefiner::AddNode(const Node* node) {
   // For each 'input' of this node, fetch the corresponding shape
   // from 'input's InferenceContext, and store into a vector
   // indexed by 'node's input.
-  std::vector<Node*> input_nodes(node->num_inputs());
+  std::vector<const Node*> input_nodes(node->num_inputs());
   std::vector<ShapeHandle> input_shapes(node->num_inputs());
   std::vector<std::unique_ptr<std::vector<ShapeAndType>>>
       input_handle_shapes_and_types(node->num_inputs());
   for (const Edge* e : node->in_edges()) {
     if (e->IsControlEdge()) continue;
 
-    Node* input = e->src();
+    const Node* input = e->src();
     auto it = node_to_context_.find(input);
     if (it == node_to_context_.end()) {
       return errors::FailedPrecondition(
@@ -407,301 +408,13 @@ Status ShapeRefiner::EvaluateConstantTensorForEdge(const Node* node,
                                                    int dst_idx, bool* evaluated,
                                                    Tensor* result) {
   *evaluated = false;
-
   const Edge* input_edge;
   TF_RETURN_IF_ERROR(node->input_edge(dst_idx, &input_edge));
-
-  // Simple case: the source node is a constant
-  const Node* src = input_edge->src();
-  if (src->IsConstant()) {
-    if (result->FromProto(src->def().attr().at("value").tensor())) {
-      *evaluated = true;
-      return Status::OK();
-    }
-  }
-
-  if (disable_constant_propagation_) {
-    return Status::OK();
-  }
-
-  bool is_constant_graph = false;
-  Graph subgraph(ops_registry_);
-  auto versions = subgraph.versions();
-  versions.set_producer(graph_def_version_);
-  subgraph.set_versions(versions);
-
-  // We identify the possibly constant subgraph to evaluate by
-  // recursively iterating backwards through the inputs to 'node'
-  // until we either 1) find an already existing input to our subgraph
-  // (filled in `const_inputs`), 2) Discover our graph is not constant,
-  // or 3) Hit a root node.
-  std::vector<std::pair<string, Tensor>> const_inputs;
-  TF_RETURN_IF_ERROR(ExtractConstantSubgraph(
-      input_edge->src(), &subgraph, &is_constant_graph, &const_inputs));
-  if (!is_constant_graph) {
-    return Status::OK();
-  }
-  const string output_tensor_name =
-      strings::StrCat(input_edge->src()->name(), ":", input_edge->src_output());
-  std::vector<Tensor> outputs;
-
-  // NOTE; we should pass in a function library runtime if we want
-  // to support constant-expression evaluation on functions.
-  Status s = graph_runner_.Run(&subgraph, nullptr /* function_library */,
-                               const_inputs, {output_tensor_name}, &outputs);
-
-  // If all kernels in the constant graph are not registered
-  // in the process, GraphRunner::Run may fail, in which case
-  // we cannot propagate constants, so this is best-effort.
-  if (s.ok()) {
-    *result = outputs[0];
-    *evaluated = true;
-
-    // We memoize (small) constants evaluated so far, so
-    // ExtractConstantSubgraph can avoid extracting the full
-    // subgraph.  As we build up large graphs, this avoids
-    // repeated computation of the early parts of a constant
-    // graph.
-    if (outputs[0].TotalBytes() <= kMaxTensorSize) {
-      const_tensor_map_[output_tensor_name] = outputs[0];
-    }
-  }
-  return Status::OK();
-}
-
-Status ShapeRefiner::TryToInferTensorOutputFromInputShapes(const Edge* edge,
-                                                           Tensor* output,
-                                                           bool* success) {
-  *success = false;
-  const Node* node = edge->src();
-  auto it = node_to_context_.find(node);
-  if (it == node_to_context_.end()) {
-    return errors::FailedPrecondition("Node does not have context.");
-  }
-  InferenceContext* c = it->second->get_context();
-
-  if (node->type_string() == "Shape") {
-    // If input shapes to the shape op are fully defined,
-    // we can infer the shape op's output tensor.
-    bool fully_defined_inputs = c->FullyDefined(c->input(0));
-    if (fully_defined_inputs) {
-      int input_rank = c->Rank(c->input(0));
-      Tensor t(node->output_type(0), TensorShape({input_rank}));
-      if (node->output_type(0) == DT_INT32) {
-        auto flat = t.flat<int>();
-        for (int i = 0; i < input_rank; i++) {
-          int64 dimension = c->Value(c->Dim(c->input(0), i));
-          if (!FastBoundsCheck(dimension, std::numeric_limits<int32>::max())) {
-            return errors::FailedPrecondition(
-                "Shape has output type int32, but dimension exceeds maximum "
-                "int32 value");
-          }
-          flat(i) = static_cast<int32>(dimension);
-        }
-      } else if (node->output_type(0) == DT_INT64) {
-        auto flat = t.flat<int64>();
-        for (int i = 0; i < input_rank; i++) {
-          flat(i) = c->Value(c->Dim(c->input(0), i));
-        }
-      } else {
-        return errors::FailedPrecondition(
-            "Shape has output type that is not int32 or int64");
-      }
-      *output = t;
-      *success = true;
-    }
-  } else if (node->type_string() == "Rank") {
-    bool rank_known = c->RankKnown(c->input(0));
-    if (rank_known) {
-      int32 input_rank = c->Rank(c->input(0));
-      Tensor t(node->output_type(0), TensorShape({}));
-      t.flat<int32>()(0) = input_rank;
-      *output = t;
-      *success = true;
-    }
-  } else if (node->type_string() == "Size") {
-    bool fully_defined_inputs = c->FullyDefined(c->input(0));
-    if (fully_defined_inputs) {
-      int32 rank = c->Rank(c->input(0));
-      Tensor t(node->output_type(0), TensorShape({}));
-      int64 size = 1;
-      for (int i = 0; i < rank; i++) {
-        size *= c->Value(c->Dim(c->input(0), i));
-      }
-      if (node->output_type(0) == DT_INT32) {
-        if (!FastBoundsCheck(size, std::numeric_limits<int32>::max())) {
-          return errors::FailedPrecondition(
-              "Size has output type int32, but size exceeds maximum int32 "
-              "value");
-        }
-        t.flat<int32>()(0) = static_cast<int32>(size);
-      } else if (node->output_type(0) == DT_INT64) {
-        t.flat<int64>()(0) = size;
-      } else {
-        return errors::FailedPrecondition(
-            "Size has output type that is not int32 or int64");
-      }
-      *output = t;
-      *success = true;
-    }
-  }
-  return Status::OK();
-}
-
-Status ShapeRefiner::ExtractConstantSubgraph(
-    Node* target_node, Graph* out_graph, bool* is_constant_graph,
-    std::vector<std::pair<string, Tensor>>* const_inputs) {
-  *is_constant_graph = false;
-  std::unordered_set<string> const_inputs_added;
-
-  if (target_node->op_def().is_stateful()) {
-    return Status::OK();
-  }
-
-  if (target_node->type_string() == "PlaceholderWithDefault") {
-    return Status::OK();
-  }
-
-  // TODO(skyewm): more of the filtering applied in input nodes below should be
-  // applied to target_node here
-
-  struct NodeAndRecursed {
-    Node* new_node = nullptr;
-    bool recursed = false;
-  };
-
-  std::map<Node*, NodeAndRecursed> old_to_new_and_recursed;
-  Node* target_node_copy = out_graph->CopyNode(target_node);
-  old_to_new_and_recursed[target_node].new_node = target_node_copy;
-  old_to_new_and_recursed[target_node].recursed = true;
-
-  // Add the target node's inputs to seed the recursion.
-  std::deque<const Edge*> edges_to_visit;
-  for (const Edge* e : target_node->in_edges()) {
-    // TODO(vrv): What do we do about control edges?  Based on our
-    // definition of a constant graph, we should be free to ignore
-    // control edges since the order in which a constant graph is
-    // executed should be the same regardless of when nodes run: we
-    // should only need to recurse down data edges.
-    if (e->IsControlEdge()) continue;
-    edges_to_visit.push_back(e);
-  }
-
-  *is_constant_graph = true;
-
-  // Iterate over the set of edges to visit (backwards).
-  while (!edges_to_visit.empty()) {
-    const Edge* current_edge = edges_to_visit.front();
-    edges_to_visit.pop_front();
-    Node* current_node = current_edge->src();
-
-    // If the node is stateful, assume the graph is not constant.
-    if (current_node->op_def().is_stateful()) {
-      *is_constant_graph = false;
-      return Status::OK();
-    }
-
-    // During construction or import from GraphConstructor, back edges may not
-    // be filled in.  Don't constant fold through merges at all for now.
-    if (IsMerge(current_node)) {
-      *is_constant_graph = false;
-      return Status::OK();
-    }
-
-    // Don't constant fold enter/exit currently either, as it's easy to end
-    // up with a partial frame.
-    if (IsEnter(current_node) || IsExit(current_node)) {
-      *is_constant_graph = false;
-      return Status::OK();
-    }
-
-    // Placeholders should never be constant folded because their outputs are
-    // fed by the user. Note that "Placeholder" nodes have no inputs so are
-    // handled below.
-    if (current_node->type_string() == "PlaceholderWithDefault") {
-      *is_constant_graph = false;
-      return Status::OK();
-    }
-
-    // If there is nothing more to recurse down, see if
-    // the generator node is a constant.
-    if (current_node->num_inputs() == 0) {
-      if (!current_node->IsConstant()) {
-        // Generator node is not a constant, so subgraph is not
-        // constant.
-        *is_constant_graph = false;
-        return Status::OK();
-      }
-    }
-
-    // Either the node is a constant, or the node is a potential
-    // intermediate node on the path from a constant.
-    //
-    // Add a copy of its node and a new edge to the new subgraph.
-
-    // Get or create the version of 'current_node' in the new graph.
-    Node* current_node_copy;
-    // This gets or creates the NodeAndRecursed entry for current_node.
-    NodeAndRecursed* node_and_recursed = &old_to_new_and_recursed[current_node];
-    if (node_and_recursed->new_node == nullptr) {
-      // First time processing this node.
-      current_node_copy = out_graph->CopyNode(current_node);
-      // Track the mapping from the original node to the new one.
-      node_and_recursed->new_node = current_node_copy;
-    } else {
-      current_node_copy = node_and_recursed->new_node;
-    }
-
-    // Add the edge to the destination node.
-    {
-      auto it = old_to_new_and_recursed.find(current_edge->dst());
-      if (it == old_to_new_and_recursed.end()) {
-        return errors::Internal(
-            "Could not find mapping from old to new copy of destination node: ",
-            current_edge->dst()->name());
-      }
-      Node* dst_copy = it->second.new_node;
-
-      out_graph->AddEdge(current_node_copy, current_edge->src_output(),
-                         dst_copy, current_edge->dst_input());
-    }
-
-    const string& output_tensor_name =
-        strings::StrCat(current_node->name(), ":", current_edge->src_output());
-
-    // Some tensor values can be inferred. For example, a shape op
-    // with input shapes fully defined can have its output tensor inferred.
-    Tensor tensor_inferred;
-    bool successfully_inferred_tensor = false;
-    TF_RETURN_IF_ERROR(TryToInferTensorOutputFromInputShapes(
-        current_edge, &tensor_inferred, &successfully_inferred_tensor));
-    if (successfully_inferred_tensor) {
-      const_inputs->emplace_back(output_tensor_name, tensor_inferred);
-      const_inputs_added.insert(output_tensor_name);
-      continue;
-    }
-
-    // If we have a copy of the input tensor materialized already,
-    // then add to the list of inputs to feed and do not recurse further.
-    auto it = const_tensor_map_.find(output_tensor_name);
-    if (it != const_tensor_map_.end() &&
-        const_inputs_added.count(output_tensor_name) == 0) {
-      const_inputs->emplace_back(output_tensor_name, it->second);
-      const_inputs_added.insert(output_tensor_name);
-      continue;
-    }
-
-    // If this node's inputs have not been processed already, do so now.
-    if (!node_and_recursed->recursed) {
-      node_and_recursed->recursed = true;
-      for (const Edge* e : current_node->in_edges()) {
-        if (e->IsControlEdge()) continue;
-        edges_to_visit.push_back(e);
-      }
-    }
-  }
-
-  return Status::OK();
+  OutputTensor tensor(input_edge->src(), input_edge->src_output());
+  return EvaluateConstantTensor(tensor, *this, *ops_registry_,
+                                graph_def_version_, evaluated, result,
+                                &graph_runner_, &const_tensor_map_,
+                                kMaxTensorSize, disable_constant_propagation_);
 }
 
 Status ShapeRefiner::ConstantPartialShape(InferenceContext* target_context,
diff --git a/tensorflow/core/common_runtime/shape_refiner.h b/tensorflow/core/common_runtime/shape_refiner.h
index 75eb5bf0d2972e6bccdd9c2c265f3494821210cc..d49c4373f0b8c8721bb5f151dfe4d8774fbceb50 100644
--- a/tensorflow/core/common_runtime/shape_refiner.h
+++ b/tensorflow/core/common_runtime/shape_refiner.h
@@ -215,20 +215,6 @@ class ShapeRefiner {
                                 bool keep_nested_shapes,
                                 ExtendedInferenceContext* outer_context);
 
-  // Tries to infer tensor output based on the input shapes of the node. In some
-  // cases, the shapes of the inputs are sufficient for inferring the contents
-  // of the output tensor. For example, a Shape op with fully defined input
-  // shapes can have its output tensor inferred.
-  Status TryToInferTensorOutputFromInputShapes(const Edge* edge, Tensor* output,
-                                               bool* success);
-
-  // Extracts the subgraph ending at 'node' that is statically
-  // computable and inserts into 'out_graph'. If statically computable,
-  // 'is_constant_graph' will be true.
-  Status ExtractConstantSubgraph(
-      Node* node, Graph* out_graph, bool* is_constant_graph,
-      std::vector<std::pair<string, Tensor>>* const_inputs) TF_MUST_USE_RESULT;
-
   Status EvaluateConstantTensorForEdge(const Node* node, int dst_idx,
                                        bool* evaluated, Tensor* result);
 
diff --git a/tensorflow/core/common_runtime/step_stats_collector.cc b/tensorflow/core/common_runtime/step_stats_collector.cc
index cb900db10af98496cfdfafa5a38296bfdc4e996b..f21536d586edcca2cec9257579db9ca616f36a6c 100644
--- a/tensorflow/core/common_runtime/step_stats_collector.cc
+++ b/tensorflow/core/common_runtime/step_stats_collector.cc
@@ -226,13 +226,14 @@ void StepStatsCollector::BuildCostModel(
       if (node) {
         for (int i = 0; i < stats.output_size(); ++i) {
           const auto& output = stats.output(i);
-          cm->RecordMaxMemorySize(node, i,
+          int output_slot = output.slot();
+          cm->RecordMaxMemorySize(node, output_slot,
                                   Bytes(output.tensor_description()
                                             .allocation_description()
                                             .allocated_bytes()),
-                                  stats.output(i).tensor_description().shape(),
-                                  node->output_types()[i]);
-          cm->RecordAllocationId(node, i,
+                                  output.tensor_description().shape(),
+                                  node->output_types()[output_slot]);
+          cm->RecordAllocationId(node, output_slot,
                                  output.tensor_description()
                                      .allocation_description()
                                      .allocation_id());
diff --git a/tensorflow/core/common_runtime/threadpool_device.cc b/tensorflow/core/common_runtime/threadpool_device.cc
index 5aa01376ab047e7613ba7403bb32859a83a09f5a..6d8de6a3c06a84d50c22f6337632ef89cc77e2c8 100644
--- a/tensorflow/core/common_runtime/threadpool_device.cc
+++ b/tensorflow/core/common_runtime/threadpool_device.cc
@@ -16,6 +16,8 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/threadpool_device.h"
 
 #include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/allocator_registry.h"
 #include "tensorflow/core/framework/device_base.h"
@@ -40,7 +42,8 @@ ThreadPoolDevice::ThreadPoolDevice(const SessionOptions& options,
                                    Allocator* allocator)
     : LocalDevice(options, Device::BuildDeviceAttributes(
                                name, DEVICE_CPU, memory_limit, locality)),
-      allocator_(allocator) {}
+      allocator_(allocator),
+      scoped_allocator_mgr_(new ScopedAllocatorMgr(name)) {}
 
 ThreadPoolDevice::~ThreadPoolDevice() {}
 
@@ -65,6 +68,17 @@ Allocator* ThreadPoolDevice::GetAllocator(AllocatorAttributes attr) {
   return allocator_;
 }
 
+Allocator* ThreadPoolDevice::GetScopedAllocator(AllocatorAttributes attr,
+                                                int64 step_id) {
+  if (attr.scope_id > 0) {
+    return scoped_allocator_mgr_->GetContainer(step_id)->GetInstance(
+        attr.scope_id);
+  }
+  LOG(FATAL) << "Unexpected call to ThreadPoolDevice::GetScopedAllocator "
+             << "attr.scope_id = " << attr.scope_id;
+  return allocator_;
+}
+
 Status ThreadPoolDevice::MakeTensorFromProto(
     const TensorProto& tensor_proto, const AllocatorAttributes alloc_attrs,
     Tensor* tensor) {
diff --git a/tensorflow/core/common_runtime/threadpool_device.h b/tensorflow/core/common_runtime/threadpool_device.h
index 37cb745a0aa89b9aeae2c289d347faac2ae177dd..afc5d15ebc39883f3d24c91b42d86c46576883c0 100644
--- a/tensorflow/core/common_runtime/threadpool_device.h
+++ b/tensorflow/core/common_runtime/threadpool_device.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
-#define TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
 
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/local_device.h"
@@ -31,6 +31,11 @@ class ThreadPoolDevice : public LocalDevice {
 
   void Compute(OpKernel* op_kernel, OpKernelContext* context) override;
   Allocator* GetAllocator(AllocatorAttributes attr) override;
+  Allocator* GetScopedAllocator(AllocatorAttributes attr,
+                                int64 step_id) override;
+  ScopedAllocatorMgr* GetScopedAllocatorMgr() const override {
+    return scoped_allocator_mgr_.get();
+  }
   Status MakeTensorFromProto(const TensorProto& tensor_proto,
                              const AllocatorAttributes alloc_attrs,
                              Tensor* tensor) override;
@@ -39,8 +44,9 @@ class ThreadPoolDevice : public LocalDevice {
 
  private:
   Allocator* allocator_;  // Not owned
+  std::unique_ptr<ScopedAllocatorMgr> scoped_allocator_mgr_;
 };
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
diff --git a/tensorflow/core/debug/BUILD b/tensorflow/core/debug/BUILD
index 40cb8353cdccb4307f09b537ff7016e3dca5a8da..f6fe9edb022dce29286190e9948f385b933c5a07 100644
--- a/tensorflow/core/debug/BUILD
+++ b/tensorflow/core/debug/BUILD
@@ -298,6 +298,9 @@ tf_cc_test(
     size = "small",
     srcs = ["debug_grpc_io_utils_test.cc"],
     linkstatic = tf_kernel_tests_linkstatic(),
+    tags = [
+        "no_oss",  # b/73962011
+    ],
     deps = [
         ":debug_graph_utils",
         ":debug_grpc_testlib",
diff --git a/tensorflow/core/distributed_runtime/BUILD b/tensorflow/core/distributed_runtime/BUILD
index 9e152aa0823b67fceb7f103cc6e090f00870f88a..434626bd2da57ce4c4895017c0bb0abef58c6f44 100644
--- a/tensorflow/core/distributed_runtime/BUILD
+++ b/tensorflow/core/distributed_runtime/BUILD
@@ -595,6 +595,7 @@ tf_cc_test(
     srcs = ["recent_request_ids_test.cc"],
     deps = [
         ":recent_request_ids",
+        ":request_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
diff --git a/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc b/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
index 3a8d5912369525253904bd700dfdc6e3eb26e0ae..000a03da5da3a75527ceec2d343451316d09d656 100644
--- a/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
+++ b/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
@@ -121,6 +121,8 @@ Status ClusterFunctionLibraryRuntime::Instantiate(
     const string& function_name, const FunctionLibraryDefinition& lib_def,
     AttrSlice attrs, const FunctionLibraryRuntime::InstantiateOptions& options,
     FunctionLibraryRuntime::LocalHandle* handle) {
+  VLOG(1) << "CFLR::Instantiate: " << function_name << " on " << options.target
+          << " (this: " << this << ")";
   WorkerInterface* wi =
       worker_session_->worker_cache->CreateWorker(options.target);
 
@@ -154,6 +156,9 @@ Status ClusterFunctionLibraryRuntime::Instantiate(
   *handle = function_data_.size();
   function_data_.push_back(FunctionData(resp.graph_handle(), options.target, wi,
                                         send_keys, recv_keys));
+  VLOG(1) << "CFLR::Instantiate: [Success] " << function_name << " on "
+          << options.target << " (this: " << this << ")"
+          << " with handle: " << *handle;
   return Status::OK();
 }
 
@@ -175,32 +180,33 @@ void ClusterFunctionLibraryRuntime::Run(
     return;
   }
 
-  RunGraphRequest req;
-  req.set_session_handle(worker_session_->session_name);
-  req.set_graph_handle(function_data->graph_handle);
+  RunGraphRequest* req = new RunGraphRequest;
+  req->set_session_handle(worker_session_->session_name);
+  req->set_graph_handle(function_data->graph_handle);
   // Borrowed from master_session.cc
   const uint64 step_id = (random::New64() & ((1uLL << 56) - 1)) | (1uLL << 56);
-  req.set_step_id(step_id);
+  req->set_step_id(step_id);
   int i = 0;
   for (const auto& send_key : function_data->send_keys) {
-    NamedTensorProto* send = req.add_send();
+    NamedTensorProto* send = req->add_send();
     send->set_name(send_key);
     args[i].AsProtoTensorContent(send->mutable_tensor());
     i++;
   }
   const std::vector<string>& recv_keys = function_data->recv_keys;
   for (const auto& recv_key : recv_keys) {
-    req.add_recv_key(recv_key);
+    req->add_recv_key(recv_key);
   }
 
   RunGraphResponse* resp = new RunGraphResponse();
   CallOptions* call_options = new CallOptions();
   wi->RunGraphAsync(
-      call_options, &req, resp,
-      [call_options, resp, rets, recv_keys, done](const Status& status) {
+      call_options, req, resp,
+      [call_options, req, resp, rets, recv_keys, done](const Status& status) {
         if (!status.ok()) {
           done(status);
           delete call_options;
+          delete req;
           delete resp;
           return;
         }
@@ -212,25 +218,28 @@ void ClusterFunctionLibraryRuntime::Run(
         for (const auto& recv_key : recv_keys) {
           TensorProto* tp = mapped_recvs[recv_key];
           if (tp == nullptr) {
+            done(errors::Internal("Could not find key: ", recv_key));
             delete call_options;
+            delete req;
             delete resp;
-            done(errors::Internal("Could not find key: ", recv_key));
             return;
           }
           Tensor t;
           if (t.FromProto(*tp)) {
             rets->push_back(t);
           } else {
-            delete call_options;
-            delete resp;
             done(errors::Internal("Could not convert tensor proto: ",
                                   tp->DebugString()));
+            delete call_options;
+            delete req;
+            delete resp;
             return;
           }
         }
+        done(status);
         delete call_options;
+        delete req;
         delete resp;
-        done(status);
       });
 }
 
diff --git a/tensorflow/core/distributed_runtime/graph_mgr.cc b/tensorflow/core/distributed_runtime/graph_mgr.cc
index 7878ebb5f06db0f64e9216250da2a79352274ab3..8447c55bf43d8b2f7b40685932474b075d6cf696 100644
--- a/tensorflow/core/distributed_runtime/graph_mgr.cc
+++ b/tensorflow/core/distributed_runtime/graph_mgr.cc
@@ -134,7 +134,8 @@ Status GraphMgr::InitItem(const string& session, const GraphDef& gdef,
 
   item->proc_flr.reset(new ProcessFunctionLibraryRuntime(
       device_mgr_, worker_env_->env, gdef.versions().producer(),
-      item->lib_def.get(), graph_options.optimizer_options(), cluster_flr));
+      item->lib_def.get(), graph_options.optimizer_options(),
+      worker_env_->compute_pool, cluster_flr));
 
   // Constructs the graph out of "gdef".
   Graph graph(OpRegistry::Global());
@@ -437,7 +438,7 @@ void GraphMgr::ExecuteAsync(const string& handle, const int64 step_id,
 
   StartParallelExecutors(handle, step_id, item, rendezvous, collector,
                          cost_graph, cancellation_manager,
-                         [this, item, rendezvous, done](const Status& s) {
+                         [item, rendezvous, done](const Status& s) {
                            done(s);
                            rendezvous->Unref();
                            item->Unref();
diff --git a/tensorflow/core/distributed_runtime/master_session.cc b/tensorflow/core/distributed_runtime/master_session.cc
index 878a1398c9d382a4b2018712ca9f9e48c11a9345..01da54fcb3ce8d5467bbc15d8db3bc970d8727b5 100644
--- a/tensorflow/core/distributed_runtime/master_session.cc
+++ b/tensorflow/core/distributed_runtime/master_session.cc
@@ -72,7 +72,7 @@ class MasterSession::ReffedClientGraph : public core::RefCounted {
         client_graph_(std::move(cg)),
         session_opts_(session_opts),
         is_partial_(is_partial),
-        debug_opts_(bopts.debug_options),
+        debug_opts_(bopts.callable_options.run_options().debug_options()),
         worker_cache_(worker_cache),
         should_deregister_(should_deregister) {
     VLOG(1) << "Created ReffedClientGraph for node with "
@@ -921,61 +921,70 @@ void MasterSession::ReffedClientGraph::DeregisterPartitions() {
   }
 }
 
+namespace {
+void CopyAndSortStrings(size_t size,
+                        const std::function<string(size_t)>& input_accessor,
+                        protobuf::RepeatedPtrField<string>* output) {
+  std::vector<string> temp;
+  temp.reserve(size);
+  for (size_t i = 0; i < size; ++i) {
+    output->Add(input_accessor(i));
+  }
+  std::sort(output->begin(), output->end());
+}
+}  // namespace
+
 void BuildBuildGraphOptions(const RunStepRequestWrapper& req,
                             BuildGraphOptions* opts) {
-  for (size_t i = 0; i < req.num_feeds(); ++i) {
-    opts->feed_endpoints.push_back(req.feed_name(i));
-  }
-  for (size_t i = 0; i < req.num_fetches(); ++i) {
-    opts->fetch_endpoints.push_back(req.fetch_name(i));
-  }
-  for (size_t i = 0; i < req.num_targets(); ++i) {
-    opts->target_nodes.push_back(req.target_name(i));
-  }
+  CallableOptions* callable_opts = &opts->callable_options;
+  CopyAndSortStrings(req.num_feeds(),
+                     [&req](size_t i) { return req.feed_name(i); },
+                     callable_opts->mutable_feed());
+  CopyAndSortStrings(req.num_fetches(),
+                     [&req](size_t i) { return req.fetch_name(i); },
+                     callable_opts->mutable_fetch());
+  CopyAndSortStrings(req.num_targets(),
+                     [&req](size_t i) { return req.target_name(i); },
+                     callable_opts->mutable_target());
 
   if (!req.options().debug_options().debug_tensor_watch_opts().empty()) {
-    opts->debug_options = req.options().debug_options();
+    *callable_opts->mutable_run_options()->mutable_debug_options() =
+        req.options().debug_options();
   }
-
-  std::sort(opts->feed_endpoints.begin(), opts->feed_endpoints.end());
-  std::sort(opts->target_nodes.begin(), opts->target_nodes.end());
-  std::sort(opts->fetch_endpoints.begin(), opts->fetch_endpoints.end());
 }
 
 void BuildBuildGraphOptions(const PartialRunSetupRequest& req,
                             BuildGraphOptions* opts) {
-  for (const auto& feed : req.feed()) {
-    opts->feed_endpoints.push_back(feed);
-  }
-  for (const auto& fetch : req.fetch()) {
-    opts->fetch_endpoints.push_back(fetch);
-  }
-  for (const auto& target : req.target()) {
-    opts->target_nodes.push_back(target);
-  }
+  CallableOptions* callable_opts = &opts->callable_options;
+  CopyAndSortStrings(req.feed_size(), [&req](size_t i) { return req.feed(i); },
+                     callable_opts->mutable_feed());
+  CopyAndSortStrings(req.fetch_size(),
+                     [&req](size_t i) { return req.fetch(i); },
+                     callable_opts->mutable_fetch());
+  CopyAndSortStrings(req.target_size(),
+                     [&req](size_t i) { return req.target(i); },
+                     callable_opts->mutable_target());
 
   // TODO(cais): Add TFDBG support to partial runs.
-
-  std::sort(opts->feed_endpoints.begin(), opts->feed_endpoints.end());
-  std::sort(opts->target_nodes.begin(), opts->target_nodes.end());
-  std::sort(opts->fetch_endpoints.begin(), opts->fetch_endpoints.end());
 }
 
 uint64 HashBuildGraphOptions(const BuildGraphOptions& opts) {
   uint64 h = 0x2b992ddfa23249d6ull;
-  for (const string& name : opts.feed_endpoints) {
+  for (const string& name : opts.callable_options.feed()) {
     h = Hash64(name.c_str(), name.size(), h);
   }
-  for (const string& name : opts.target_nodes) {
+  for (const string& name : opts.callable_options.target()) {
     h = Hash64(name.c_str(), name.size(), h);
   }
-  for (const string& name : opts.fetch_endpoints) {
+  for (const string& name : opts.callable_options.fetch()) {
     h = Hash64(name.c_str(), name.size(), h);
   }
 
-  if (!opts.debug_options.debug_tensor_watch_opts().empty()) {
-    const string watch_summary = SummarizeDebugTensorWatches(
-        opts.debug_options.debug_tensor_watch_opts());
+  const DebugOptions& debug_options =
+      opts.callable_options.run_options().debug_options();
+  if (!debug_options.debug_tensor_watch_opts().empty()) {
+    const string watch_summary =
+        SummarizeDebugTensorWatches(debug_options.debug_tensor_watch_opts());
     h = Hash64(watch_summary.c_str(), watch_summary.size(), h);
   }
 
@@ -984,15 +993,15 @@ uint64 HashBuildGraphOptions(const BuildGraphOptions& opts) {
 
 string BuildGraphOptionsString(const BuildGraphOptions& opts) {
   string buf;
-  for (const string& name : opts.feed_endpoints) {
+  for (const string& name : opts.callable_options.feed()) {
     strings::StrAppend(&buf, " FdE: ", name);
   }
   strings::StrAppend(&buf, "\n");
-  for (const string& name : opts.target_nodes) {
+  for (const string& name : opts.callable_options.target()) {
     strings::StrAppend(&buf, " TN: ", name);
   }
   strings::StrAppend(&buf, "\n");
-  for (const string& name : opts.fetch_endpoints) {
+  for (const string& name : opts.callable_options.fetch()) {
     strings::StrAppend(&buf, " FeE: ", name);
   }
   strings::StrAppend(&buf, "\n");
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids.cc b/tensorflow/core/distributed_runtime/recent_request_ids.cc
index c30879406c6924aa85ad4bf8279b278eaf5d29fd..4f6866c5d154ba023b0923af67fe00a7a69b459d 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids.cc
+++ b/tensorflow/core/distributed_runtime/recent_request_ids.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/recent_request_ids.h"
 
+#include <utility>
+
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -29,12 +31,14 @@ RecentRequestIds::RecentRequestIds(int num_tracked_request_ids)
 Status RecentRequestIds::TrackUnique(int64 request_id,
                                      const string& method_name,
                                      const protobuf::Message& request) {
-  mutex_lock l(mu_);
   if (request_id == 0) {
     // For backwards compatibility, allow all requests with request_id 0.
     return Status::OK();
   }
-  if (set_.count(request_id) > 0) {
+
+  mutex_lock l(mu_);
+  const bool inserted = set_.insert(request_id).second;
+  if (!inserted) {
     // Note: RecentRequestIds is not strict LRU because we don't update
     // request_id's age in the circular_buffer_ if it's tracked again. Strict
     // LRU is not useful here because returning this error will close the
@@ -49,7 +53,6 @@ Status RecentRequestIds::TrackUnique(int64 request_id,
   // when the buffer is not yet full.
   set_.erase(circular_buffer_[next_index_]);
   circular_buffer_[next_index_] = request_id;
-  set_.insert(request_id);
   next_index_ = (next_index_ + 1) % circular_buffer_.size();
   return Status::OK();
 }
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids.h b/tensorflow/core/distributed_runtime/recent_request_ids.h
index e8e45331dd5a26e2230bb92e8ce73888d3f28505..11cf937c94659d85e3dc88350f20e107a27fab62 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids.h
+++ b/tensorflow/core/distributed_runtime/recent_request_ids.h
@@ -16,11 +16,13 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RECENT_REQUEST_IDS_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RECENT_REQUEST_IDS_H_
 
+#include <string>
+#include <unordered_set>
 #include <vector>
 
 #include "tensorflow/core/lib/core/status.h"
-#include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/thread_annotations.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/worker.pb.h"
@@ -64,7 +66,7 @@ class RecentRequestIds {
   // request_id.
   int next_index_ GUARDED_BY(mu_) = 0;
   std::vector<int64> circular_buffer_ GUARDED_BY(mu_);
-  gtl::FlatSet<int64> set_ GUARDED_BY(mu_);
+  std::unordered_set<int64> set_ GUARDED_BY(mu_);
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids_test.cc b/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
index 9a0facf5404bb4e6d0d57f55bcd1f2a4f4f99dba..8910a50e9cda691984d712ebfc5aea1d4f904d3f 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
+++ b/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
@@ -17,8 +17,10 @@ limitations under the License.
 
 #include <algorithm>
 
+#include "tensorflow/core/distributed_runtime/request_id.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/worker.pb.h"
 
@@ -93,4 +95,15 @@ TEST(RecentRequestIds, Ordered3) { TestOrdered(3); }
 TEST(RecentRequestIds, Ordered4) { TestOrdered(4); }
 TEST(RecentRequestIds, Ordered5) { TestOrdered(5); }
 
+void BM_TrackUnique(int iters) {
+  RecentRequestIds recent_request_ids(100000);
+  RecvTensorRequest request;
+  for (int i = 0; i < iters; ++i) {
+    TF_CHECK_OK(recent_request_ids.TrackUnique(GetUniqueRequestId(),
+                                               "BM_TrackUnique", request));
+  }
+}
+
+BENCHMARK(BM_TrackUnique);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/rpc/BUILD b/tensorflow/core/distributed_runtime/rpc/BUILD
index dade26abc6a3c58f24c759ad863600a156985708..9dae1b9859787393464f4a45fc597be7fc41601c 100644
--- a/tensorflow/core/distributed_runtime/rpc/BUILD
+++ b/tensorflow/core/distributed_runtime/rpc/BUILD
@@ -259,6 +259,7 @@ cc_library(
     hdrs = ["grpc_serialization_traits.h"],
     deps = [
         "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc_unsecure",
     ],
 )
 
@@ -381,6 +382,7 @@ tf_cuda_library(
     data = [
         ":grpc_testlib_server",
     ],
+    visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":grpc_session",
         ":grpc_testlib_ops",
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_serialization_traits.h b/tensorflow/core/distributed_runtime/rpc/grpc_serialization_traits.h
index 730124c25e9a3e8d102a9dd39e4c4a17f2ce39d1..e7f5fb0c6ae24caa3ffe5039d5daddb771c4858d 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_serialization_traits.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_serialization_traits.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include "grpc++/impl/codegen/proto_utils.h"
 #include "grpc++/support/slice.h"
+#include "grpc/grpc.h"
 
 namespace grpc {
 
@@ -30,13 +31,13 @@ class GrpcBufferWriter final
  public:
   explicit GrpcBufferWriter(grpc_byte_buffer** bp, int block_size)
       : block_size_(block_size), byte_count_(0), have_backup_(false) {
-    *bp = g_core_codegen_interface->grpc_raw_byte_buffer_create(NULL, 0);
+    *bp = grpc_raw_byte_buffer_create(NULL, 0);
     slice_buffer_ = &(*bp)->data.raw.slice_buffer;
   }
 
   ~GrpcBufferWriter() override {
     if (have_backup_) {
-      g_core_codegen_interface->grpc_slice_unref(backup_slice_);
+      grpc_slice_unref(backup_slice_);
     }
   }
 
@@ -45,24 +46,24 @@ class GrpcBufferWriter final
       slice_ = backup_slice_;
       have_backup_ = false;
     } else {
-      slice_ = g_core_codegen_interface->grpc_slice_malloc(block_size_);
+      slice_ = grpc_slice_malloc(block_size_);
     }
     *data = GRPC_SLICE_START_PTR(slice_);
     // On win x64, int is only 32bit
     GPR_CODEGEN_ASSERT(GRPC_SLICE_LENGTH(slice_) <= INT_MAX);
     byte_count_ += * size = (int)GRPC_SLICE_LENGTH(slice_);
-    g_core_codegen_interface->grpc_slice_buffer_add(slice_buffer_, slice_);
+    grpc_slice_buffer_add(slice_buffer_, slice_);
     return true;
   }
 
   void BackUp(int count) override {
-    g_core_codegen_interface->grpc_slice_buffer_pop(slice_buffer_);
+    grpc_slice_buffer_pop(slice_buffer_);
     if (count == block_size_) {
       backup_slice_ = slice_;
     } else {
-      backup_slice_ = g_core_codegen_interface->grpc_slice_split_tail(
-          &slice_, GRPC_SLICE_LENGTH(slice_) - count);
-      g_core_codegen_interface->grpc_slice_buffer_add(slice_buffer_, slice_);
+      backup_slice_ =
+          grpc_slice_split_tail(&slice_, GRPC_SLICE_LENGTH(slice_) - count);
+      grpc_slice_buffer_add(slice_buffer_, slice_);
     }
     // It's dangerous to keep an inlined grpc_slice as the backup slice, since
     // on a following Next() call, a reference will be returned to this slice
@@ -85,29 +86,12 @@ class GrpcBufferWriter final
 
 class GrpcBufferReader final
     : public ::grpc::protobuf::io::ZeroCopyInputStream {
-  typedef void (CoreCodegenInterface::*OldReaderInitAPI)(
-      grpc_byte_buffer_reader* reader, grpc_byte_buffer* buffer);
-  typedef int (CoreCodegenInterface::*NewReaderInitAPI)(
-      grpc_byte_buffer_reader* reader, grpc_byte_buffer* buffer);
-  void ReaderInit(OldReaderInitAPI ptr, grpc_byte_buffer_reader* reader,
-                  grpc_byte_buffer* buffer) {
-    (g_core_codegen_interface->*ptr)(reader, buffer);
-  }
-  void ReaderInit(NewReaderInitAPI ptr, grpc_byte_buffer_reader* reader,
-                  grpc_byte_buffer* buffer) {
-    int result = (g_core_codegen_interface->*ptr)(reader, buffer);
-    (void)result;
-  }
-
  public:
   explicit GrpcBufferReader(grpc_byte_buffer* buffer)
       : byte_count_(0), backup_count_(0) {
-    ReaderInit(&CoreCodegenInterface::grpc_byte_buffer_reader_init, &reader_,
-               buffer);
-  }
-  ~GrpcBufferReader() override {
-    g_core_codegen_interface->grpc_byte_buffer_reader_destroy(&reader_);
+    (void)grpc_byte_buffer_reader_init(&reader_, buffer);
   }
+  ~GrpcBufferReader() override { grpc_byte_buffer_reader_destroy(&reader_); }
 
   bool Next(const void** data, int* size) override {
     if (backup_count_ > 0) {
@@ -118,11 +102,10 @@ class GrpcBufferReader final
       backup_count_ = 0;
       return true;
     }
-    if (!g_core_codegen_interface->grpc_byte_buffer_reader_next(&reader_,
-                                                                &slice_)) {
+    if (!grpc_byte_buffer_reader_next(&reader_, &slice_)) {
       return false;
     }
-    g_core_codegen_interface->grpc_slice_unref(slice_);
+    grpc_slice_unref(slice_);
     *data = GRPC_SLICE_START_PTR(slice_);
     // On win x64, int is only 32bit
     GPR_CODEGEN_ASSERT(GRPC_SLICE_LENGTH(slice_) <= INT_MAX);
@@ -176,18 +159,18 @@ class UnlimitedSizeProtoSerializationTraits {
       return Status(StatusCode::INTERNAL, "Message length was negative");
     } else if (byte_size <=
                tensorflow_helper::kGrpcBufferWriterMaxBufferLength) {
-      grpc_slice slice = g_core_codegen_interface->grpc_slice_malloc(byte_size);
+      grpc_slice slice = grpc_slice_malloc(byte_size);
       GPR_CODEGEN_ASSERT(
           GRPC_SLICE_END_PTR(slice) ==
           msg.SerializeWithCachedSizesToArray(GRPC_SLICE_START_PTR(slice)));
-      *bp = g_core_codegen_interface->grpc_raw_byte_buffer_create(&slice, 1);
-      g_core_codegen_interface->grpc_slice_unref(slice);
-      return g_core_codegen_interface->ok();
+      *bp = grpc_raw_byte_buffer_create(&slice, 1);
+      grpc_slice_unref(slice);
+      return Status::OK;
     } else {
       tensorflow_helper::GrpcBufferWriter writer(
           bp, tensorflow_helper::kGrpcBufferWriterMaxBufferLength);
       return msg.SerializeToZeroCopyStream(&writer)
-                 ? g_core_codegen_interface->ok()
+                 ? Status::OK
                  : Status(StatusCode::INTERNAL, "Failed to serialize message");
     }
   }
@@ -197,7 +180,7 @@ class UnlimitedSizeProtoSerializationTraits {
     if (buffer == nullptr) {
       return Status(StatusCode::INTERNAL, "No payload");
     }
-    Status result = g_core_codegen_interface->ok();
+    Status result = Status::OK;
     {
       tensorflow_helper::GrpcBufferReader reader(buffer);
       ::grpc::protobuf::io::CodedInputStream decoder(&reader);
@@ -214,7 +197,7 @@ class UnlimitedSizeProtoSerializationTraits {
         result = Status(StatusCode::INTERNAL, "Did not read entire message");
       }
     }
-    g_core_codegen_interface->grpc_byte_buffer_destroy(buffer);
+    grpc_byte_buffer_destroy(buffer);
     return result;
   }
 };
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
index c4ac92d809627e7134b5d4ae694f9978cd5390b4..a6f4be3eaf69f40199e64c43dff443e886aa5aa1 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
@@ -106,7 +106,8 @@ GrpcServer::~GrpcServer() {
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func,
-    const WorkerCreationFunction& worker_func) {
+    const WorkerCreationFunction& worker_func,
+    const StatsPublisherFactory& stats_factory) {
   mutex_lock l(mu_);
   CHECK_EQ(state_, NEW);
   master_env_.env = env_;
@@ -218,7 +219,7 @@ Status GrpcServer::Init(
   master_env_.ops = OpRegistry::Global();
   master_env_.worker_cache = worker_cache;
   master_env_.master_session_factory =
-      [config](
+      [config, stats_factory](
           SessionOptions options, const MasterEnv* env,
           std::unique_ptr<std::vector<std::unique_ptr<Device>>> remote_devs,
           std::unique_ptr<WorkerCacheInterface> worker_cache,
@@ -226,7 +227,7 @@ Status GrpcServer::Init(
         options.config.MergeFrom(config);
         return new MasterSession(options, env, std::move(remote_devs),
                                  std::move(worker_cache), std::move(device_set),
-                                 CreateNoOpStatsPublisher);
+                                 stats_factory);
       };
   master_env_.worker_cache_factory =
       [this](const WorkerCacheFactoryOptions& options,
@@ -241,6 +242,14 @@ Status GrpcServer::Init(
   return Status::OK();
 }
 
+Status GrpcServer::Init(
+    ServiceInitFunction service_func,
+    const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+    const WorkerCreationFunction& worker_func) {
+  return Init(std::move(service_func), rendezvous_mgr_func, worker_func,
+              CreateNoOpStatsPublisher);
+}
+
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func) {
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
index 8b12ac1461d6b1fa3098197aa7697031a5d3075b..7c2f06f618a85c901ce7a7902cb8b1bc4e57be40 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
@@ -22,6 +22,7 @@ limitations under the License.
 #include "grpc++/security/credentials.h"
 
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/common_runtime/stats_publisher_interface.h"
 #include "tensorflow/core/distributed_runtime/master_env.h"
 #include "tensorflow/core/distributed_runtime/rpc/async_service_interface.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_channel.h"
@@ -68,6 +69,11 @@ class GrpcServer : public ServerInterface {
   const string target() const override;
 
  protected:
+  Status Init(ServiceInitFunction service_func,
+              const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+              const WorkerCreationFunction& worker_func,
+              const StatsPublisherFactory& stats_factory);
+
   Status Init(ServiceInitFunction service_func,
               const RendezvousMgrCreationFunction& rendezvous_mgr_func,
               const WorkerCreationFunction& worker_func);
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
index c237f2dce43b9391c340736f45166c9adc2a5b78..89f83f9f24d570d96704ea0b2d09da13147b1d6c 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
@@ -57,7 +57,7 @@ Status TestCluster::MakeTestCluster(const SessionOptions& options, int n,
          tf_jobs, "--tf_job=localhost", strings::StrCat("--tf_task=", i),
          strings::StrCat("--num_cpus=", num_cpus),
          strings::StrCat("--num_gpus=", num_gpus)});
-    ret->subprocesses_.emplace_back(testing::CreateSubProcess(argv));
+    ret->subprocesses_.emplace_back(CreateSubProcess(argv));
     bool success = ret->subprocesses_[i]->Start();
     if (!success) {
       return errors::Internal("Could not start subprocess");
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.h b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.h
index 4b3a03b1d708744bded25ff4d320979bb7eb38b2..d5baaae353a99b2681ae5e0873a4cef7161845f3 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/core/framework/device_attributes.pb.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/subprocess.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/public/session_options.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
index 1a5e2edfb240198c50d3b5d00bec1127fceff725..2a2f7e3ffbef10f9f2997fc554f010d3f8689ca2 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
@@ -88,7 +88,7 @@ class SerializationTraits<tensorflow::TensorResponse>
     if (buffer == nullptr) {
       return Status(StatusCode::INTERNAL, "No payload");
     }
-    Status result = g_core_codegen_interface->ok();
+    Status result = Status::OK;
     if (result.ok()) {
       ::tensorflow::GrpcByteSource source(buffer);
       auto s = msg->ParseFrom(&source);
@@ -98,7 +98,7 @@ class SerializationTraits<tensorflow::TensorResponse>
                             "TensorResponse parse error", s.ToString()));
       }
     }
-    g_core_codegen_interface->grpc_byte_buffer_destroy(buffer);
+    grpc_byte_buffer_destroy(buffer);
     return result;
   }
 };
diff --git a/tensorflow/core/distributed_runtime/scheduler.cc b/tensorflow/core/distributed_runtime/scheduler.cc
index 9dae5b3b926fab14c2b36955436d3956baa29fdd..84036361971b73f9fb7fe990833d5018f6321e27 100644
--- a/tensorflow/core/distributed_runtime/scheduler.cc
+++ b/tensorflow/core/distributed_runtime/scheduler.cc
@@ -80,7 +80,7 @@ Microseconds SlackAnalysis::ComputeAsap(std::vector<Microseconds>* asap_times) {
   std::vector<int> pending_count(graph_->num_node_ids());
   InitializePending(graph_, &pending_count);
 
-  std::deque<Node*> queue;
+  std::deque<const Node*> queue;
   Node* srcNode = graph_->source_node();
   queue.push_back(srcNode);
   (*asap_times)[srcNode->id()] = 0;
@@ -92,7 +92,7 @@ Microseconds SlackAnalysis::ComputeAsap(std::vector<Microseconds>* asap_times) {
     for (const Edge* out_edge : curr->out_edges()) {
       // The time needed for 'out' to get its input from 'curr'.
       Microseconds copy_time(0);
-      Node* out = out_edge->dst();
+      const Node* out = out_edge->dst();
       if (!out_edge->IsControlEdge() &&
           curr->assigned_device_name() != out->assigned_device_name()) {
         // Add an arbitrary 10microsecs for each copy.
@@ -137,7 +137,7 @@ Microseconds SlackAnalysis::ComputeAlap(std::vector<Microseconds>* alap_times) {
     }
   }
 
-  std::deque<Node*> queue;
+  std::deque<const Node*> queue;
   Node* sinkNode = graph_->sink_node();
   queue.push_back(sinkNode);
   (*alap_times)[sinkNode->id()] = 0;
@@ -148,7 +148,7 @@ Microseconds SlackAnalysis::ComputeAlap(std::vector<Microseconds>* alap_times) {
     for (const Edge* in_edge : curr->in_edges()) {
       // The time needed for 'curr' to get its input from 'src'.
       Microseconds copy_time(0);
-      Node* src = in_edge->src();
+      const Node* src = in_edge->src();
       if (!in_edge->IsControlEdge() &&
           src->assigned_device_name() != curr->assigned_device_name()) {
         // TODO(yuanbyu): Use the real cost model
@@ -236,7 +236,7 @@ Microseconds GreedyScheduler::ComputeSchedule(
 
       for (const Edge* out_edge : event.node->out_edges()) {
         Microseconds copy_time(0);
-        Node* out = out_edge->dst();
+        const Node* out = out_edge->dst();
         if (!out_edge->IsControlEdge() &&
             event.node->assigned_device_name() != out->assigned_device_name()) {
           // TODO(yuanbyu): Use below with the real cost model.
@@ -277,11 +277,11 @@ Microseconds GreedyScheduler::ComputeSchedule(
   return max_completion;
 }
 
-Node* GreedyScheduler::GetNodeWithHighestPriority(
-    const std::vector<Node*>& nodes) {
-  Node* curr_node = nullptr;
+const Node* GreedyScheduler::GetNodeWithHighestPriority(
+    const std::vector<const Node*>& nodes) {
+  const Node* curr_node = nullptr;
   int64 curr_priority = kint64max;
-  for (Node* n : nodes) {
+  for (const Node* n : nodes) {
     if ((*priority_)[n->id()] < curr_priority) {
       curr_node = n;
       curr_priority = (*priority_)[n->id()];
diff --git a/tensorflow/core/distributed_runtime/scheduler.h b/tensorflow/core/distributed_runtime/scheduler.h
index ef87b9834dba50cf628a8c29c70b0266661d6227..bf9d0d1bec33284a44f69412477edb4a0963e8a1 100644
--- a/tensorflow/core/distributed_runtime/scheduler.h
+++ b/tensorflow/core/distributed_runtime/scheduler.h
@@ -57,11 +57,11 @@ class GreedyScheduler {
   struct Sim {
     int degree_parallelism;
     int num_running;
-    std::vector<Node*> ready_nodes;
+    std::vector<const Node*> ready_nodes;
   };
 
   struct Event {
-    Node* node;
+    const Node* node;
     Microseconds time;
     bool is_completion;
 
@@ -79,7 +79,7 @@ class GreedyScheduler {
 
  private:
   // Returns the ready node with the highest priority for a sim.
-  Node* GetNodeWithHighestPriority(const std::vector<Node*>& nodes);
+  const Node* GetNodeWithHighestPriority(const std::vector<const Node*>& nodes);
 
   const DeviceSet* devices_;
   const CostModel* cost_model_;
diff --git a/tensorflow/core/distributed_runtime/tensor_coding.cc b/tensorflow/core/distributed_runtime/tensor_coding.cc
index 34a4013547b5feef12b49198bff4e733f1b9e932..fe2d1a12934dde814344b70f52fbc972f74347e0 100644
--- a/tensorflow/core/distributed_runtime/tensor_coding.cc
+++ b/tensorflow/core/distributed_runtime/tensor_coding.cc
@@ -81,7 +81,7 @@ void TensorResponse::InitPartial(const RecvTensorResponse& response) {
 Status TensorResponse::ParseFrom(Source* source) {
   if (!on_host_) {
     protobuf::io::CodedInputStream input(source->contents());
-    input.SetTotalBytesLimit(INT_MAX);  // Unlimited
+    input.SetTotalBytesLimit(INT_MAX, INT_MAX);  // Unlimited
 
     // Pre-parse into local storage, then delegate to device.
     if (!meta_.ParseFromCodedStream(&input) || !input.ConsumedEntireMessage()) {
@@ -217,7 +217,7 @@ bool TensorResponse::ParseTensorSubmessage(
 
 bool TensorResponse::ParseFast(Source* source) {
   protobuf::io::CodedInputStream input(source->contents());
-  input.SetTotalBytesLimit(INT_MAX);  // Unlimited
+  input.SetTotalBytesLimit(INT_MAX, INT_MAX);  // Unlimited
   while (true) {
     auto p = input.ReadTagWithCutoff(127);
     int tag = GetTagFieldNumber(p.first);
diff --git a/tensorflow/core/distributed_runtime/worker.cc b/tensorflow/core/distributed_runtime/worker.cc
index 63455493671fcd1f4282bc804f8f2a521c056dce..598652fb981c24c9ef0329c3c2244fc2bdb8ea3e 100644
--- a/tensorflow/core/distributed_runtime/worker.cc
+++ b/tensorflow/core/distributed_runtime/worker.cc
@@ -215,7 +215,7 @@ void Worker::DoPartialRunGraph(CallOptions* opts,
   GraphMgr::NamedTensors in;
   GraphMgr::NamedTensors* out = new GraphMgr::NamedTensors;
   Status s = PrepareRunGraph(request, &in, out);
-  auto finish = [this, done, out, opts](const Status& s) {
+  auto finish = [done, out, opts](const Status& s) {
     opts->ClearCancelCallback();
     delete out;
     done(s);
@@ -247,7 +247,7 @@ void Worker::DoPartialRunGraph(CallOptions* opts,
     session->graph_mgr->ExecuteAsync(
         graph_handle, step_id, session.get(), request->exec_opts(),
         nullptr /* collector */, nullptr /* response */, cm, in,
-        [this, token, step_id, session, cm](Status s) {
+        [this, token, step_id, session](Status s) {
           {
             mutex_lock l(mu_);
             cancellation_manager_->DeregisterCallback(token);
diff --git a/tensorflow/core/framework/allocator.h b/tensorflow/core/framework/allocator.h
index 3ce1b612464291eceb6e08d9b0f2deca70cda27a..2c87156dca61188a3a8deabf9ad483c9180ccd55 100644
--- a/tensorflow/core/framework/allocator.h
+++ b/tensorflow/core/framework/allocator.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_FRAMEWORK_ALLOCATOR_H_
-#define TENSORFLOW_FRAMEWORK_ALLOCATOR_H_
+#ifndef TENSORFLOW_CORE_FRAMEWORK_ALLOCATOR_H_
+#define TENSORFLOW_CORE_FRAMEWORK_ALLOCATOR_H_
 
 #include <stdlib.h>
 
@@ -359,7 +359,12 @@ struct AllocatorAttributes {
   bool nic_compatible() const { return value & (0x1 << 1); }
   void set_gpu_compatible(bool v) { value |= (static_cast<int>(v) << 2); }
   bool gpu_compatible() const { return value & (0x1 << 2); }
-  void Merge(AllocatorAttributes other) { value |= other.value; }
+  void Merge(AllocatorAttributes other) {
+    value |= other.value;
+    scope_id = (scope_id > 0 && other.scope_id == 0)
+                   ? scope_id
+                   : ((scope_id == 0) ? other.scope_id : 0);
+  }
   // Returns true if the fields set in *this is a subset of or equal to
   // those set in other.
   bool IsEqualOrLessRestrictiveThan(const AllocatorAttributes& other) const {
@@ -371,6 +376,9 @@ struct AllocatorAttributes {
   // upper 8 bits in device-specific ways, and ops implemented for those
   // devices are responsible for setting those 8 bits appropriately.
   uint32 value = 0;
+  // EXPERIMENTAL: If this is greater than zero, then allocation is delegated to
+  // a named special-purpose allocator on the same device.
+  int32 scope_id = 0;
 };
 
 // Returns a trivial implementation of Allocator which uses the system
@@ -396,4 +404,4 @@ class SubAllocator {
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_FRAMEWORK_ALLOCATOR_H_
+#endif  // TENSORFLOW_CORE_FRAMEWORK_ALLOCATOR_H_
diff --git a/tensorflow/core/framework/attr_value_util.cc b/tensorflow/core/framework/attr_value_util.cc
index a1c39d2a7a78354239f2cdbb718160906b233ddd..ebb56d525e52e7351e4159dce44349ce0649921c 100644
--- a/tensorflow/core/framework/attr_value_util.cc
+++ b/tensorflow/core/framework/attr_value_util.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/lib/strings/proto_serialization.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/protobuf.h"
 
diff --git a/tensorflow/core/framework/bfloat16_test.cc b/tensorflow/core/framework/bfloat16_test.cc
index 17e6209f8e5ad5240dfc8ca1def75c178da45c27..206396a25ab784e93daa227bcf79fe608f5df706 100644
--- a/tensorflow/core/framework/bfloat16_test.cc
+++ b/tensorflow/core/framework/bfloat16_test.cc
@@ -37,19 +37,27 @@ float BinaryToFloat(uint32_t sign, uint32_t exponent, uint32_t high_mantissa,
 
 struct Bfloat16TestParam {
   float input;
-  float expected;
+  float expected_truncation;
+  float expected_rounding;
 };
 
 class Bfloat16Test : public ::testing::Test,
                      public ::testing::WithParamInterface<Bfloat16TestParam> {};
 
 TEST_P(Bfloat16Test, TruncateTest) {
-  bfloat16 a(GetParam().input);
+  bfloat16 truncated(GetParam().input);
   if (std::isnan(GetParam().input)) {
-    EXPECT_TRUE(std::isnan(float(a)) || std::isinf(float(a)));
+    EXPECT_TRUE(std::isnan(float(truncated)) || std::isinf(float(truncated)));
     return;
   }
-  EXPECT_EQ(GetParam().expected, float(a));
+  EXPECT_EQ(GetParam().expected_truncation, float(truncated));
+
+  bfloat16 rounded = bfloat16::round_to_bfloat16((GetParam().input));
+  if (std::isnan(GetParam().input)) {
+    EXPECT_TRUE(std::isnan(float(rounded)) || std::isinf(float(rounded)));
+    return;
+  }
+  EXPECT_EQ(GetParam().expected_rounding, float(rounded));
 }
 
 INSTANTIATE_TEST_CASE_P(
@@ -57,37 +65,48 @@ INSTANTIATE_TEST_CASE_P(
     ::testing::Values(
         Bfloat16TestParam{
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b1111010111000011),
-            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000)},
+            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
+            BinaryToFloat(0, 0b10000000, 0b1001001, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(1, 0b10000000, 0b1001000, 0b1111010111000011),
-            BinaryToFloat(1, 0b10000000, 0b1001000, 0b0000000000000000)},
+            BinaryToFloat(1, 0b10000000, 0b1001000, 0b0000000000000000),
+            BinaryToFloat(1, 0b10000000, 0b1001001, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b1000000000000000),
+            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b11111111, 0b0000000, 0b0000000000000001),
-            BinaryToFloat(0, 0b11111111, 0b0000000, 0b0000000000000000)},
+            BinaryToFloat(0, 0b11111111, 0b0000000, 0b0000000000000000),
+            BinaryToFloat(0, 0b11111111, 0b1000000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b11111111, 0b1111111, 0b1111111111111111),
-            BinaryToFloat(0, 0b11111111, 0b1111111, 0b0000000000000000)},
+            BinaryToFloat(0, 0b11111111, 0b1111111, 0b0000000000000000),
+            BinaryToFloat(0, 0b11111111, 0b1000000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(1, 0b10000000, 0b1001000, 0b1100000000000000),
-            BinaryToFloat(1, 0b10000000, 0b1001000, 0b0000000000000000)},
+            BinaryToFloat(1, 0b10000000, 0b1001000, 0b0000000000000000),
+            BinaryToFloat(1, 0b10000000, 0b1001001, 0b0000000000000000)},
         Bfloat16TestParam{
+            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0100000000000000),
+            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b1000000000000000),
+            BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b10000000, 0b1001000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b00000000, 0b1001000, 0b1000000000000000),
+            BinaryToFloat(0, 0b00000000, 0b1001000, 0b0000000000000000),
             BinaryToFloat(0, 0b00000000, 0b1001000, 0b0000000000000000)},
         Bfloat16TestParam{
             BinaryToFloat(0, 0b00000000, 0b1111111, 0b1100000000000000),
-            BinaryToFloat(0, 0b00000000, 0b1111111, 0b0000000000000000)}));
+            BinaryToFloat(0, 0b00000000, 0b1111111, 0b0000000000000000),
+            BinaryToFloat(0, 0b00000001, 0b0000000, 0b0000000000000000)}));
 
 TEST(Bfloat16Test, Conversion) {
   float a[100];
diff --git a/tensorflow/core/framework/cancellation.cc b/tensorflow/core/framework/cancellation.cc
index 9da4828bbad7b6333336dd1215441f5c5f62151a..1258e40c9344ab9b41b6c8e0e29dce5d8da3bbf9 100644
--- a/tensorflow/core/framework/cancellation.cc
+++ b/tensorflow/core/framework/cancellation.cc
@@ -89,6 +89,10 @@ bool CancellationManager::DeregisterCallback(CancellationToken token) {
   }
 }
 
-CancellationManager::~CancellationManager() { StartCancel(); }
+CancellationManager::~CancellationManager() {
+  if (!callbacks_.empty()) {
+    StartCancel();
+  }
+}
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/framework/dataset.h b/tensorflow/core/framework/dataset.h
index 6ab23d92a421df8b5fb9bcf637ad805d67577aa1..beaf0adbc5e0972faac96ac975c887c0080ec74f 100644
--- a/tensorflow/core/framework/dataset.h
+++ b/tensorflow/core/framework/dataset.h
@@ -305,6 +305,14 @@ class IteratorContext {
     return params_.allocator_getter(attrs);
   }
 
+  std::function<Allocator*(AllocatorAttributes)> allocator_getter() {
+    return params_.allocator_getter;
+  }
+
+  std::function<std::shared_ptr<StatsAggregator>()> stats_aggregator_getter() {
+    return params_.stats_aggregator_getter;
+  }
+
  private:
   Params params_;
 };
diff --git a/tensorflow/core/framework/device_base.h b/tensorflow/core/framework/device_base.h
index fb6d5c69e135c0263845cf71b93ac53bb2a359ed..52b9077d8cfd3e39c9404748d7a1051e238c36db 100644
--- a/tensorflow/core/framework/device_base.h
+++ b/tensorflow/core/framework/device_base.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_FRAMEWORK_DEVICE_BASE_H_
-#define TENSORFLOW_FRAMEWORK_DEVICE_BASE_H_
+#ifndef TENSORFLOW_CORE_FRAMEWORK_DEVICE_BASE_H_
+#define TENSORFLOW_CORE_FRAMEWORK_DEVICE_BASE_H_
 
 #include <memory>
 #include <string>
@@ -48,6 +48,7 @@ class Env;
 class EventMgr;
 class OpKernelContext;
 class ResourceMgr;
+class ScopedAllocatorMgr;
 class TensorProto;
 
 namespace thread {
@@ -179,6 +180,15 @@ class DeviceBase {
     return GetAllocator(attr);
   }
 
+  // Return an Allocator prepared for use in particular places by graph
+  // optimization
+  virtual Allocator* GetScopedAllocator(AllocatorAttributes attr,
+                                        int64 step_id) {
+    LOG(FATAL) << "Device does not implement GetScopedAllocator()";
+  }
+
+  virtual ScopedAllocatorMgr* GetScopedAllocatorMgr() const { return nullptr; }
+
   virtual const Eigen::ThreadPoolDevice* eigen_cpu_device() {
     CHECK(eigen_cpu_device_ != nullptr);
     return eigen_cpu_device_;
@@ -243,4 +253,4 @@ class DeviceBase {
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_FRAMEWORK_DEVICE_BASE_H_
+#endif  // TENSORFLOW_CORE_FRAMEWORK_DEVICE_BASE_H_
diff --git a/tensorflow/core/framework/function.proto b/tensorflow/core/framework/function.proto
index bd01e86da3a646b7e6331301a03e80e90d2ce6ee..72e3c438314bd80391c38c7fab4acc994eeb92a4 100644
--- a/tensorflow/core/framework/function.proto
+++ b/tensorflow/core/framework/function.proto
@@ -30,7 +30,8 @@ message FunctionDef {
   // Attributes specific to this function definition.
   map<string, AttrValue> attr = 5;
 
-  // NOTE: field id 2 deleted on Jan 11, 2016, GraphDef version 21.
+  // NOTE: field id 2 deleted on Jan 11, 2017, GraphDef version 21.
+  reserved 2;
 
   // In both of the following fields, there is the need to specify an
   // output that is used as either the input to another node (in
diff --git a/tensorflow/core/framework/graph_def_util.cc b/tensorflow/core/framework/graph_def_util.cc
index 1f670535d575e9bbc4196fb1f1e1c381d33ae204..896cb3cd7ffe45f2d528761403cfa4aaed902d96 100644
--- a/tensorflow/core/framework/graph_def_util.cc
+++ b/tensorflow/core/framework/graph_def_util.cc
@@ -53,6 +53,12 @@ Status ValidateExternalGraphDefSyntax(const GraphDef& graph_def) {
 Status AddDefaultAttrsToGraphDef(GraphDef* graph_def,
                                  const OpRegistryInterface& op_registry,
                                  int node_offset) {
+  return AddDefaultAttrsToGraphDef(graph_def, op_registry, node_offset, false);
+}
+
+Status AddDefaultAttrsToGraphDef(GraphDef* graph_def,
+                                 const OpRegistryInterface& op_registry,
+                                 int node_offset, bool skip_unknown_ops) {
   if (node_offset > graph_def->node_size()) {
     return errors::InvalidArgument(
         "Tried to add default attrs to GraphDef "
@@ -63,8 +69,12 @@ Status AddDefaultAttrsToGraphDef(GraphDef* graph_def,
   for (int i = node_offset; i < graph_def->node_size(); ++i) {
     NodeDef* node_def = graph_def->mutable_node(i);
     const OpDef* op_def;
-    TF_RETURN_IF_ERROR(op_registry.LookUpOpDef(node_def->op(), &op_def));
-    AddDefaultsToNodeDef(*op_def, node_def);
+    Status s = op_registry.LookUpOpDef(node_def->op(), &op_def);
+    if (s.ok()) {
+      AddDefaultsToNodeDef(*op_def, node_def);
+    } else if (!skip_unknown_ops) {
+      return s;
+    }
   }
 
   return Status::OK();
diff --git a/tensorflow/core/framework/graph_def_util.h b/tensorflow/core/framework/graph_def_util.h
index 0c6542a9f258a28f42a9caa9821ea3faf8010342..525e84a989fb0edbc8fd57ff3f3b0d0ed4b13e16 100644
--- a/tensorflow/core/framework/graph_def_util.h
+++ b/tensorflow/core/framework/graph_def_util.h
@@ -56,6 +56,12 @@ Status AddDefaultAttrsToGraphDef(GraphDef* graph_def,
                                  const OpRegistryInterface& op_registry,
                                  int node_offset);
 
+// Same as above, except for the fact that it skips nodes that aren't found in
+// op_registry if skip_unknown_ops is true.
+Status AddDefaultAttrsToGraphDef(GraphDef* graph_def,
+                                 const OpRegistryInterface& op_registry,
+                                 int node_offset, bool skip_unknown_ops);
+
 // Remove attrs from 'graph_def' that have the default value according
 // to 'producer_op_registry', but don't exist according to
 // 'consumer_op_registry'. This can allow 'graph_def' to run on the
diff --git a/tensorflow/core/framework/numeric_types.h b/tensorflow/core/framework/numeric_types.h
index 4c38fbbe591a5d07ba4cbbea00dcbfb41ca2f403..dab53cba3e6343c82052c2997c2947b1d11bed59 100644
--- a/tensorflow/core/framework/numeric_types.h
+++ b/tensorflow/core/framework/numeric_types.h
@@ -24,6 +24,7 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/FixedPoint"
 // clang-format on
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace tensorflow {
diff --git a/tensorflow/core/framework/op_def_util.cc b/tensorflow/core/framework/op_def_util.cc
index 2d035ab90d0f4493f6b6f572d0dd8550f5098e7e..c80802aad3a6b84572096f726d90133ac5536526 100644
--- a/tensorflow/core/framework/op_def_util.cc
+++ b/tensorflow/core/framework/op_def_util.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/lib/strings/proto_serialization.h"
 #include "tensorflow/core/lib/strings/scanner.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/mutex.h"
diff --git a/tensorflow/core/framework/op_kernel.cc b/tensorflow/core/framework/op_kernel.cc
index 8654437059ca449432e6381b9eb3c4ba15e56f48..9ec1c213c3b465fc6a42b076c97bf936066a6b40 100644
--- a/tensorflow/core/framework/op_kernel.cc
+++ b/tensorflow/core/framework/op_kernel.cc
@@ -282,8 +282,13 @@ OpKernelContext::~OpKernelContext() {
 }
 
 Allocator* OpKernelContext::get_allocator(AllocatorAttributes attr) {
-  Allocator* allocator =
-      params_->device->GetStepAllocator(attr, resource_manager());
+  Allocator* allocator = nullptr;
+  if (attr.scope_id > 0) {
+    allocator = params_->device->GetScopedAllocator(attr, step_id());
+    CHECK(allocator);
+  } else {
+    allocator = params_->device->GetStepAllocator(attr, resource_manager());
+  }
   if (track_allocations()) {
     mutex_lock lock(mu_);
     for (const auto& wrapped : wrapped_allocators_) {
diff --git a/tensorflow/core/framework/rendezvous.cc b/tensorflow/core/framework/rendezvous.cc
index 90756a4f2fceb366f2ec0eb991adc31dcf884d99..e84143f1b9ac2e2ef1b0d7121fd8167fe7328d1b 100644
--- a/tensorflow/core/framework/rendezvous.cc
+++ b/tensorflow/core/framework/rendezvous.cc
@@ -296,7 +296,9 @@ class LocalRendezvousImpl : public Rendezvous {
   Status status_ GUARDED_BY(mu_);
 
   ~LocalRendezvousImpl() override {
-    StartAbort(errors::Cancelled("LocalRendezvousImpl deleted"));
+    if (!table_.empty()) {
+      StartAbort(errors::Cancelled("LocalRendezvousImpl deleted"));
+    }
   }
 
   TF_DISALLOW_COPY_AND_ASSIGN(LocalRendezvousImpl);
diff --git a/tensorflow/core/framework/session_state.h b/tensorflow/core/framework/session_state.h
index 8fbe940f6aefe70cad3016ecc99528bae1b761dd..653a661dd234a9f9739c0fe7254dd0939ce63223 100644
--- a/tensorflow/core/framework/session_state.h
+++ b/tensorflow/core/framework/session_state.h
@@ -74,6 +74,12 @@ class TensorStore {
   Status SaveTensors(const std::vector<string>& output_names,
                      SessionState* session_state);
 
+  // Returns true if no tensors have been added to this store.
+  bool empty() {
+    mutex_lock l(lock_);
+    return tensors_.empty();
+  }
+
  private:
   mutex lock_;
 
diff --git a/tensorflow/core/framework/tensor.cc b/tensorflow/core/framework/tensor.cc
index 5d32b71628263fe89d6f54fd07b2fe18bbb55e53..e2111d60389d51702463f377602067ddc1bade08 100644
--- a/tensorflow/core/framework/tensor.cc
+++ b/tensorflow/core/framework/tensor.cc
@@ -884,6 +884,20 @@ bool Tensor::CanUseDMA() const {
 #undef CASE
 
 namespace {
+
+// StrCat and StrAppend don't support Eigen::half directly at the moment, and
+// we would like to keep them compatible with their absl counterparts, for ease
+// of migration. We could rely on errors::internal::PrepareForStrCat() but the
+// logic is so simple we can just replicate it here, where it is close to its
+// usage and easy to change later. And there's the extra benefit of not
+// accessing an 'internal' namespace.
+inline const strings::AlphaNum& PrintOneElement(const strings::AlphaNum& a) {
+  return a;
+}
+inline float PrintOneElement(const Eigen::half& h) {
+  return static_cast<float>(h);
+}
+
 // Print from left dim to right dim recursively.
 template <typename T>
 void PrintOneDim(int dim_index, const gtl::InlinedVector<int64, 4>& shape,
@@ -896,7 +910,7 @@ void PrintOneDim(int dim_index, const gtl::InlinedVector<int64, 4>& shape,
     for (int64 i = 0; i < element_count; i++) {
       if (*data_index >= limit) return;
       if (i > 0) strings::StrAppend(result, " ");
-      strings::StrAppend(result, data[(*data_index)++]);
+      strings::StrAppend(result, PrintOneElement(data[(*data_index)++]));
     }
     return;
   }
@@ -927,7 +941,7 @@ string SummarizeArray(int64 limit, int64 num_elts,
   if (shape.empty()) {
     for (int64 i = 0; i < limit; ++i) {
       if (i > 0) strings::StrAppend(&ret, " ");
-      strings::StrAppend(&ret, array[i]);
+      strings::StrAppend(&ret, PrintOneElement(array[i]));
     }
     if (num_elts > limit) strings::StrAppend(&ret, "...");
     return ret;
diff --git a/tensorflow/core/framework/tensor.h b/tensorflow/core/framework/tensor.h
index 62c42ba652356a5128d4a337e34a3b449781b445..4d10f7efb5d3aa2549912b043424c2d41dff27c0 100644
--- a/tensorflow/core/framework/tensor.h
+++ b/tensorflow/core/framework/tensor.h
@@ -482,6 +482,9 @@ class Tensor {
   friend class AutoReloadVariableOp;  // For access to set_shape
   friend class TensorTestHelper;      // For access to set_shape
   friend class OpKernelContext;       // For access to RefCountIsOne().
+  friend class ScopedAllocator;       // For access to buf_.
+  friend class XlaTensorBuffer;  // For access to the private constructor taking
+                                 // the buffer
   template <typename Device, typename T>
   friend class AssignVariableOp;  // For access to RefCountIsOne().
   template <typename Device, typename T>
diff --git a/tensorflow/core/framework/tensor_shape.h b/tensorflow/core/framework/tensor_shape.h
index fe2ba375aa0c5c50009b3155338cd8860070d47a..be7e740c335ced3ec6826e804090927962d57285 100644
--- a/tensorflow/core/framework/tensor_shape.h
+++ b/tensorflow/core/framework/tensor_shape.h
@@ -25,7 +25,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
-#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
@@ -271,6 +271,12 @@ class TensorShapeBase : public TensorShapeRep {
   friend Status MakeShapeHelper(const T*, int64, S*);
 };
 
+/// Outputs `TensorShapeBase` to `std::ostream`.
+template <typename Shape>
+std::ostream& operator<<(std::ostream& os, const TensorShapeBase<Shape>& tsb) {
+  return os << tsb.DebugString();
+}
+
 /// Represents the shape of a Tensor.
 ///
 /// A tensor's shape is denoted by its number of dimensions and a size for each
diff --git a/tensorflow/core/framework/tensor_shape_test.cc b/tensorflow/core/framework/tensor_shape_test.cc
index d7517bb311d517351f4dd2a59438780482485dff..6329aa6d8edf3795ed8018b7802661749683fe41 100644
--- a/tensorflow/core/framework/tensor_shape_test.cc
+++ b/tensorflow/core/framework/tensor_shape_test.cc
@@ -198,6 +198,13 @@ TEST(TensorShapeTest, DataType) {
   EXPECT_EQ(TensorShapeTestHelper::data_type(&s2), DT_INVALID);
 }
 
+TEST(TensorShapeTest, ostream) {
+  TensorShape s({10, 5, 4});
+  std::stringstream ss;
+  ss << s;
+  EXPECT_EQ(ss.str(), "[10,5,4]");
+}
+
 // -----------------------------------------------------------------------
 // An old implementation of TensorShape using a different representation,
 // preserved here in the unittest to allow us to have a randomized unittest
diff --git a/tensorflow/core/graph/costmodel.cc b/tensorflow/core/graph/costmodel.cc
index 4f3a6ec38cb88213c7127df41823bc16e9834d09..1df45d9b893fdb2807c5e6ab63dd4a8577d7feb6 100644
--- a/tensorflow/core/graph/costmodel.cc
+++ b/tensorflow/core/graph/costmodel.cc
@@ -427,7 +427,7 @@ static void AssignSizes(const Graph& g, CostModel* cost_model) {
     if (e->IsControlEdge()) {
       continue;
     }
-    Node* src = e->src();
+    const Node* src = e->src();
 
     // TODO(josh11b): Get an estimate from the Op
     Bytes size(1);
diff --git a/tensorflow/core/graph/graph.cc b/tensorflow/core/graph/graph.cc
index 9b56216f1f97a9598dd7ae8b70786e32bb7e0f4b..a7af5e2312af716ef25cb35c8f247d6feccb6d9c 100644
--- a/tensorflow/core/graph/graph.cc
+++ b/tensorflow/core/graph/graph.cc
@@ -339,7 +339,7 @@ Node* Graph::AddNode(const NodeDef& node_def, Status* status) {
   return node;
 }
 
-Node* Graph::CopyNode(Node* node) {
+Node* Graph::CopyNode(const Node* node) {
   DCHECK(!node->IsSource());
   DCHECK(!node->IsSink());
   Node* copy = AllocateNode(node->props_, node);
diff --git a/tensorflow/core/graph/graph.h b/tensorflow/core/graph/graph.h
index 9d96cd4654bbf1fd65c5135d6a8bdc271c6e9443..f7ca7d0620f4d483d31eaad66dfb8ef10dcab027 100644
--- a/tensorflow/core/graph/graph.h
+++ b/tensorflow/core/graph/graph.h
@@ -124,7 +124,8 @@ class Node {
   // Inputs requested by the NodeDef.  For the actual inputs, use in_edges.
   const protobuf::RepeatedPtrField<string>& requested_inputs() const;
 
-  // Get the neighboring nodes via edges either in or out of this node.
+  // Get the neighboring nodes via edges either in or out of this node.  This
+  // includes control edges.
   gtl::iterator_range<NeighborIter> in_nodes() const;
   gtl::iterator_range<NeighborIter> out_nodes() const;
   const EdgeSet& in_edges() const { return in_edges_; }
@@ -422,7 +423,7 @@ class Graph {
   // Copies *node, which may belong to another graph, to a new node,
   // which is returned.  Does not copy any edges.  *this owns the
   // returned instance.
-  Node* CopyNode(Node* node);
+  Node* CopyNode(const Node* node);
 
   // Removes a node from this graph, including all edges from or to it.
   // *node should not be accessed after calling this function.
diff --git a/tensorflow/core/graph/graph_constructor.cc b/tensorflow/core/graph/graph_constructor.cc
index 0629ff32d00cf7fad00c39f07810aa4a9d57f14f..76ee88e684dcbf2b687dcf8ea6433225b0293c54 100644
--- a/tensorflow/core/graph/graph_constructor.cc
+++ b/tensorflow/core/graph/graph_constructor.cc
@@ -568,13 +568,22 @@ Status GraphConstructor::ValidateShape(Node* node) {
   auto* ic = refiner_->GetContext(node);
   DCHECK(ic != nullptr)
       << "ShapeRefiner::AddNode() should have created the InferenceContext";
-  if (shape_attrs.size() != node->num_outputs()) {
+  if (shape_attrs.size() < node->num_outputs()) {
     return errors::InvalidArgument(
         "Node '", node->name(), "' has ", node->num_outputs(),
         " outputs but the ", kAttrName, " attribute specifies shapes for ",
         shape_attrs.size(), " outputs");
   }
-  for (int i = 0; i < shape_attrs.size(); ++i) {
+  // NOTE(skyewm): we don't raise an error here because some users depend on
+  // this behavior, even though it's unsafe.
+  // TODO(b/74619486): raise an error.
+  if (shape_attrs.size() > node->num_outputs()) {
+    LOG(WARNING) << "Node '" << node->name() << "' has " << node->num_outputs()
+                 << " outputs but the " << kAttrName
+                 << " attribute specifies shapes for " << shape_attrs.size()
+                 << " outputs. Output shapes may be inaccurate.";
+  }
+  for (int i = 0; i < node->num_outputs(); ++i) {
     const TensorShapeProto& p = shape_attrs[i];
     shape_inference::ShapeHandle h;
     Status s = ic->MakeShapeFromShapeProto(p, &h);
@@ -1271,7 +1280,7 @@ void CopyGraph(const Graph& src, Graph* dest) {
   dest->set_versions(src.versions());
 
   // Copy the nodes
-  std::unordered_map<Node*, Node*>
+  std::unordered_map<const Node*, Node*>
       node_map;  // "Node in src" -> "Node in *dest"
   node_map[src.source_node()] = dest->source_node();
   node_map[src.sink_node()] = dest->sink_node();
diff --git a/tensorflow/core/graph/graph_partition.cc b/tensorflow/core/graph/graph_partition.cc
index add80eda23d7887fb06902c0b123c03db8f4cccf..17a174101b2be479bea834a407544b3a74dc08cf 100644
--- a/tensorflow/core/graph/graph_partition.cc
+++ b/tensorflow/core/graph/graph_partition.cc
@@ -123,8 +123,8 @@ bool NeedSameDeviceSendRecv(const Edge* edge, const GraphInfo& info) {
     return false;
   }
 
-  Node* src = edge->src();
-  Node* dst = edge->dst();
+  const Node* src = edge->src();
+  const Node* dst = edge->dst();
   if (src->assigned_device_name() == dst->assigned_device_name()) {
     int src_port = edge->src_output();
     int dst_port = edge->dst_input();
@@ -141,7 +141,7 @@ bool NeedSameDeviceSendRecv(const Edge* edge, const GraphInfo& info) {
 
 // Return true iff (dst, dst_input) is specified on host memory.
 bool IsDstInputOnHost(const Edge* edge, const GraphInfo& info) {
-  Node* dst = edge->dst();
+  const Node* dst = edge->dst();
   int dst_port = edge->dst_input();
   if (info.device_types[dst->id()] != DEVICE_CPU) {
     if (edge->IsControlEdge()) return false;
diff --git a/tensorflow/core/graph/mkl_graph_util.h b/tensorflow/core/graph/mkl_graph_util.h
index 1b99d54e8e33fd5155913a78ee833343bf92b905..5f51d6083b1ae17d8c4dee2434f4b57de5f18d06 100644
--- a/tensorflow/core/graph/mkl_graph_util.h
+++ b/tensorflow/core/graph/mkl_graph_util.h
@@ -90,7 +90,7 @@ inline string GetMklOpName(const string& name) {
 // @input: name of the op
 // @input: T datatype to be used for checking op
 // @return: true if opname is registered as Mkl op; false otherwise
-static inline bool IsMklOp(const std::string& op_name, DataType T) {
+static inline bool IsMklOp(const string& op_name, DataType T) {
   string kernel = KernelsRegisteredForOp(op_name);
   bool result =
       kernel.find(kMklOpLabelPattern) != string::npos && (T == DT_FLOAT);
@@ -104,7 +104,7 @@ static inline bool IsMklOp(const std::string& op_name, DataType T) {
 // @input: T datatype to be used for checking op
 // @return: true if opname is registered as element-wise Mkl op;
 // false otherwise
-static inline bool IsMklElementWiseOp(const std::string& op_name, DataType T) {
+static inline bool IsMklElementWiseOp(const string& op_name, DataType T) {
   if (!IsMklOp(op_name, T)) {
     return false;
   }
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index 7d3be152991351533a6185ea088503032f720b47..1507b6eae26596686f279155c27925e0b5d80cf4 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+// TODO(intel): Improve error handling in this file; instead of CHECK failing
+// all over the place, we should log an error and execute the original graph.
 #ifdef INTEL_MKL
 
 #include <algorithm>
@@ -1030,8 +1032,7 @@ void MklLayoutRewritePass::GetDummyMklTensorNode(std::unique_ptr<Graph>* g,
   TensorProto proto;
   proto.set_dtype(dt);
   uint8 zero[8] = {0, 0, 0, 0, 0, 0, 0, 0};
-  proto.set_tensor_content(const_cast<const void*>(static_cast<void*>(&zero)),
-                           8);
+  proto.set_tensor_content(string(reinterpret_cast<const char*>(zero), 8));
   TensorShape dummy_shape({8});
   dummy_shape.AsProto(proto.mutable_tensor_shape());
   TF_CHECK_OK(NodeBuilder((*g)->NewName("DMT"), "Const")
@@ -1144,7 +1145,8 @@ int MklLayoutRewritePass::SetUpContiguousInputs(
     // For that let's first find filter node that is 2nd input (slot 1)
     // of BackpropInput.
     Node* filter_node = nullptr;
-    old_node->input_node(kConv2DBackpropInputFilterInputSlotIdx, &filter_node);
+    TF_CHECK_OK(old_node->input_node(kConv2DBackpropInputFilterInputSlotIdx,
+                                     &filter_node));
     CHECK_NOTNULL(filter_node);
 
     // Now check which nodes receive from filter_node. Filter feeds as
@@ -1323,8 +1325,7 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
   TensorProto proto;
   proto.set_dtype(dt);
   float zero[1] = {0};
-  proto.set_tensor_content(const_cast<const void*>(static_cast<void*>(&zero)),
-                           4);
+  proto.set_tensor_content(string(reinterpret_cast<char*>(&zero), 4));
   TensorShape dummy_shape({1});
   dummy_shape.AsProto(proto.mutable_tensor_shape());
   TF_CHECK_OK(NodeBuilder((*g)->NewName("DMT"), "Const")
@@ -1829,7 +1830,7 @@ Status MklLayoutRewritePass::MergeNode(std::unique_ptr<Graph>* g, Node* succ,
 
     // Create node.
     Node* new_node;
-    nb.Finalize(&**g, &new_node);
+    TF_CHECK_OK(nb.Finalize(&**g, &new_node));
     CHECK_NOTNULL(new_node);
 
     // Set the Mkl layer label for this op.
@@ -2491,10 +2492,10 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
                       mkl_op_registry::GetMklOpName(csinfo_.identity),
                       CopyAttrsDataType, AlwaysRewrite});
     rinfo_.push_back({csinfo_.lrn, mkl_op_registry::GetMklOpName(csinfo_.lrn),
-                      CopyAttrsLRN, AlwaysRewrite});
+                      CopyAttrsLRN, LrnRewrite});
     rinfo_.push_back({csinfo_.lrn_grad,
                       mkl_op_registry::GetMklOpName(csinfo_.lrn_grad),
-                      CopyAttrsLRN, AlwaysRewrite});
+                      CopyAttrsLRN, LrnRewrite});
     rinfo_.push_back({csinfo_.max_pool,
                       mkl_op_registry::GetMklOpName(csinfo_.max_pool),
                       CopyAttrsPooling, NonDepthBatchWisePoolRewrite});
@@ -2864,6 +2865,28 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
     return false;
   }
 
+  // If the depth_radius of LRN is not 2, then MKL DNN takes unoptimized 
+  // path. The unoptimized path is slow. Thus we dont rewrite the node 
+  // and use default Eigen. But for depth_radius=2, MKL DNN optimized 
+  // path is taken, i.e., eigen node is rewritten by MKl DNN node.
+  static bool LrnRewrite(const Node* n) {
+    CHECK_NOTNULL(n);
+
+    int depth_radius;
+    CHECK_EQ(GetNodeAttr(n->def(), "depth_radius", &depth_radius).ok(), true);
+
+    // if the depth_radius of LRN is not 2, don't rewrite the node by MKL DNN
+    // and use eigen node instead 
+    if (depth_radius == 2) {
+      return true;
+    }
+    VLOG(1) << "LrnRewrite: The model sets depth_radius as not 2 which"
+            << "case is not optimized by Intel MKL, thus using Eigen op"
+            << "for LRN " ; 
+
+    return false;
+  }
+
   static bool AddNRewrite(const Node* n) {
     CHECK_NOTNULL(n);
 
@@ -3527,11 +3550,13 @@ void MklLayoutRewritePass::CopyAttrsConv2D(const Node* orig_node,
   string data_format;
   string padding;
   std::vector<int32> strides;
+  std::vector<int32> dilations;
   bool use_cudnn_on_gpu;
 
   // Get all attributes from old node.
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "T", &T));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "strides", &strides));
+  TF_CHECK_OK(GetNodeAttr(orig_node->def(), "dilations", &dilations));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "padding", &padding));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "data_format", &data_format));
   TF_CHECK_OK(
@@ -3540,6 +3565,7 @@ void MklLayoutRewritePass::CopyAttrsConv2D(const Node* orig_node,
   // Add attributes to new node.
   nb->Attr("T", T);
   nb->Attr("strides", strides);
+  nb->Attr("dilations", dilations);
   nb->Attr("padding", padding);
   nb->Attr("data_format", data_format);
   nb->Attr("use_cudnn_on_gpu", use_cudnn_on_gpu);
@@ -3777,12 +3803,14 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
   DataType T_pred, T_succ;
   string padding;
   std::vector<int32> strides;
+  std::vector<int32> dilations;
   string data_format_pred, data_format_succ;
   bool use_cudnn_on_gnu;
   TF_CHECK_OK(GetNodeAttr(pred->def(), "T", &T_pred));
   TF_CHECK_OK(GetNodeAttr(succ->def(), "T", &T_succ));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "padding", &padding));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "strides", &strides));
+  TF_CHECK_OK(GetNodeAttr(pred->def(), "dilations", &dilations));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "data_format", &data_format_pred));
   TF_CHECK_OK(GetNodeAttr(succ->def(), "data_format", &data_format_succ));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "use_cudnn_on_gpu", &use_cudnn_on_gnu));
diff --git a/tensorflow/core/graph/node_builder.cc b/tensorflow/core/graph/node_builder.cc
index 138952dcb33e7b1e57cf013147581d20f509e85d..114962c0e4f2969fe539d5b168aaf62d577a7024 100644
--- a/tensorflow/core/graph/node_builder.cc
+++ b/tensorflow/core/graph/node_builder.cc
@@ -88,7 +88,7 @@ NodeBuilder& NodeBuilder::ControlInput(Node* src_node) {
 NodeBuilder& NodeBuilder::ControlInputs(gtl::ArraySlice<Node*> src_nodes) {
   control_inputs_.insert(control_inputs_.end(), src_nodes.begin(),
                          src_nodes.end());
-  for (Node* src_node : src_nodes) {
+  for (const Node* src_node : src_nodes) {
     def_builder_.ControlInput(src_node->name());
   }
   return *this;
@@ -127,7 +127,7 @@ Status NodeBuilder::Finalize(Graph* graph, Node** created_node) const {
   return Status::OK();
 }
 
-void NodeBuilder::AddIndexError(Node* node, int i) {
+void NodeBuilder::AddIndexError(const Node* node, int i) {
   if (node == nullptr) {
     errors_.emplace_back(
         strings::StrCat("Attempt to add nullptr Node to node with type ",
@@ -140,7 +140,7 @@ void NodeBuilder::AddIndexError(Node* node, int i) {
   }
 }
 
-bool NodeBuilder::GetOutputType(Node* node, int i, DataType* dt) {
+bool NodeBuilder::GetOutputType(const Node* node, int i, DataType* dt) {
   bool error;
   *dt = SafeGetOutput(node, i, &error);
   if (error) AddIndexError(node, i);
diff --git a/tensorflow/core/graph/node_builder.h b/tensorflow/core/graph/node_builder.h
index 86647a49c12085b6850a0e6d2622ef1bb58c513d..f6b7b5674b032cd2b19d69765e7c3b6b6613b3bd 100644
--- a/tensorflow/core/graph/node_builder.h
+++ b/tensorflow/core/graph/node_builder.h
@@ -120,7 +120,7 @@ class NodeBuilder {
   const OpDef& op_def() const { return def_builder_.op_def(); }
 
  private:
-  static DataType SafeGetOutput(Node* node, int i, bool* error) {
+  static DataType SafeGetOutput(const Node* node, int i, bool* error) {
     if (node != nullptr && i >= 0 && i < node->num_outputs()) {
       *error = false;
       return node->output_type(i);
@@ -131,11 +131,11 @@ class NodeBuilder {
   }
 
   // If SafeGetOutput indicates a range error, add it to errors_.
-  void AddIndexError(Node* node, int i);
+  void AddIndexError(const Node* node, int i);
 
   // Set *dt and returns true if i is in range. Combines
   // SafeGetOutput() and AddIndexError().
-  bool GetOutputType(Node* node, int i, DataType* dt);
+  bool GetOutputType(const Node* node, int i, DataType* dt);
 
   NodeDefBuilder def_builder_;
   std::vector<NodeOut> inputs_;
diff --git a/tensorflow/core/graph/optimizer_cse.cc b/tensorflow/core/graph/optimizer_cse.cc
index 6b452a1d5dca0a636264a3483e4ee9d027fd2e06..4073255db3f7cbcd697f3cb2781e04b3b01634c1 100644
--- a/tensorflow/core/graph/optimizer_cse.cc
+++ b/tensorflow/core/graph/optimizer_cse.cc
@@ -65,8 +65,8 @@ class OptimizerCSE {
 };
 
 static void FillInputs(const Node* n,
-                       gtl::InlinedVector<Node*, 4>* control_edges,
-                       gtl::InlinedVector<std::pair<Node*, int>, 4>* in) {
+                       gtl::InlinedVector<const Node*, 4>* control_edges,
+                       gtl::InlinedVector<std::pair<const Node*, int>, 4>* in) {
   DCHECK_EQ(in->size(), n->num_inputs());
   control_edges->clear();
   for (const Edge* e : n->in_edges()) {
@@ -96,8 +96,8 @@ size_t OptimizerCSE::NodeHash(const Node* n) {
 
   const int N_in = n->num_inputs();
   strings::StrAppend(&str_to_hash, N_in);
-  gtl::InlinedVector<Node*, 4> control_edges;
-  gtl::InlinedVector<std::pair<Node*, int>, 4> in(N_in);
+  gtl::InlinedVector<const Node*, 4> control_edges;
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> in(N_in);
   FillInputs(n, &control_edges, &in);
   for (const auto& edge : in) {
     strings::StrAppend(&str_to_hash, edge.first->id(), edge.second);
@@ -147,10 +147,10 @@ bool OptimizerCSE::Equivalent(const Node* a, const Node* b,
   // Compare input sources
   if (a->num_inputs() != b->num_inputs()) return false;
   const int N_in = a->num_inputs();
-  gtl::InlinedVector<Node*, 4> a_control_edges;
-  gtl::InlinedVector<Node*, 4> b_control_edges;
-  gtl::InlinedVector<std::pair<Node*, int>, 4> a_in(N_in);
-  gtl::InlinedVector<std::pair<Node*, int>, 4> b_in(N_in);
+  gtl::InlinedVector<const Node*, 4> a_control_edges;
+  gtl::InlinedVector<const Node*, 4> b_control_edges;
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> a_in(N_in);
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> b_in(N_in);
   FillInputs(a, &a_control_edges, &a_in);
   FillInputs(b, &b_control_edges, &b_in);
   if (a_in != b_in) return false;
diff --git a/tensorflow/core/graph/subgraph.cc b/tensorflow/core/graph/subgraph.cc
index 2a08bf8ca019185cf9d13dd83ca1880c3741090f..193cf88aed3da8c871f457c02d8dbb714b926737 100644
--- a/tensorflow/core/graph/subgraph.cc
+++ b/tensorflow/core/graph/subgraph.cc
@@ -28,13 +28,13 @@ limitations under the License.
 #include "tensorflow/core/graph/algorithm.h"
 #include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/graph/graph_constructor.h"
-#include "tensorflow/core/graph/node_builder.h"
 #include "tensorflow/core/graph/tensor_id.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
+namespace subgraph {
 
 // ----------------------------------------------------------------------------
 // Subgraph construction-related routines
@@ -44,6 +44,8 @@ namespace tensorflow {
 
 namespace {
 
+typedef std::unordered_map<StringPiece, Node*, StringPieceHasher> NameIndex;
+
 // Rewrite graph by replacing the output tensors specified in
 // "fed_outputs" with special feed nodes for each specified output
 // tensor, and removing any nodes that are now disconnected from the
@@ -53,59 +55,33 @@ namespace {
 // Return true on success.  On error, return false and sets *error to
 // an appropriate error message (and *g is left in an indeterminate
 // state).
-static Status FeedInputs(Graph* g, const DeviceAttributes& device_info,
-                         const gtl::ArraySlice<string>& fed_outputs,
-                         bool use_function_convention,
-                         subgraph::NameIndex* name_index,
-                         DataTypeVector* out_feed_types) {
+Status FeedInputs(
+    Graph* g, const std::vector<std::unique_ptr<PruneRewrite>>& feed_rewrites,
+    NameIndex* name_index, DataTypeVector* out_feed_types) {
   out_feed_types->clear();
-  out_feed_types->reserve(fed_outputs.size());
-  for (size_t i = 0; i < fed_outputs.size(); ++i) {
-    const string& t = fed_outputs[i];
+  out_feed_types->reserve(feed_rewrites.size());
+  for (size_t i = 0; i < feed_rewrites.size(); ++i) {
+    const string& t = feed_rewrites[i]->endpoint_name();
     TensorId id(ParseTensorName(t));
 
     auto iter = name_index->find(id.first);
     if (iter == name_index->end()) {
       return errors::NotFound("FeedInputs: unable to find feed output ", t);
     }
-    const Node* n = iter->second;
+    Node* n = iter->second;
     DCHECK_EQ(n->name(), id.first);
     if (id.second >= n->num_outputs()) {
       return errors::InvalidArgument(
           "FeedInputs: ", t, " should have output index < ", n->num_outputs());
     }
 
-    Node* recv_node;
-
-    if (!use_function_convention) {
-      TF_RETURN_IF_ERROR(
-          NodeBuilder(strings::StrCat("_recv_", id.first, "_", id.second),
-                      "_Recv")
-              .Attr("tensor_type", BaseType(n->output_type(id.second)))
-              .Attr("tensor_name", t)
-              .Attr("send_device", device_info.name())
-              .Attr("recv_device", device_info.name())
-              .Attr("send_device_incarnation",
-                    static_cast<int64>(device_info.incarnation()))
-              .Attr("client_terminated", true)
-              .Finalize(g, &recv_node));
-    } else {
-      // NOTE(mrry): We must include the index as part of the node
-      // name, because _Arg is a "stateful" kernel and therefore
-      // its name must uniquely identify a kernel instance across all
-      // graphs in the same session.
-      TF_RETURN_IF_ERROR(NodeBuilder(strings::StrCat("_arg_", id.first, "_",
-                                                     id.second, "_", i),
-                                     "_Arg")
-                             .Attr("T", BaseType(n->output_type(id.second)))
-                             .Attr("index", static_cast<int32>(i))
-                             .Finalize(g, &recv_node));
-    }
-    recv_node->set_assigned_device_name(device_info.name());
+    Node* feed_node;
+    TF_RETURN_IF_ERROR(
+        feed_rewrites[i]->AddNode(g, {n, id.second}, &feed_node));
 
     // Update name_index
-    (*name_index)[recv_node->name()] = recv_node;
-    g->AddControlEdge(g->source_node(), recv_node);
+    (*name_index)[feed_node->name()] = feed_node;
+    g->AddControlEdge(g->source_node(), feed_node);
 
     // Look through edges coming out of "n" for edges whose src_output() index
     // matches "output_index".  If found, replace the edges with a connection
@@ -119,7 +95,7 @@ static Status FeedInputs(Graph* g, const DeviceAttributes& device_info,
                   n->type_string() == "PlaceholderV2")) {
         // When feeding a Placeholder node, any outgoing control edges
         // will be replaced with a control edge from the replacement
-        // recv_node.
+        // feed_node.
         // TODO(josh11b,mrry): Come up with a more elegant way of addressing
         // the general version of this problem.
         to_remove.emplace_back(e);
@@ -128,10 +104,10 @@ static Status FeedInputs(Graph* g, const DeviceAttributes& device_info,
 
     for (const Edge* e : to_remove) {
       if (e->src_output() == id.second) {
-        g->AddEdge(recv_node, 0, e->dst(), e->dst_input());
+        g->AddEdge(feed_node, 0, e->dst(), e->dst_input());
       } else {
         CHECK_EQ(Graph::kControlSlot, e->src_output());
-        g->AddControlEdge(recv_node, e->dst());
+        g->AddControlEdge(feed_node, e->dst());
       }
       g->RemoveEdge(e);
     }
@@ -140,9 +116,61 @@ static Status FeedInputs(Graph* g, const DeviceAttributes& device_info,
   return Status::OK();
 }
 
-static bool AddNodeToTargets(const string& node_or_tensor_name,
-                             const subgraph::NameIndex& name_index,
-                             std::unordered_set<const Node*>* targets) {
+Status FetchOutputs(
+    Graph* g, const std::vector<std::unique_ptr<PruneRewrite>>& fetch_rewrites,
+    NameIndex* name_index, std::vector<Node*>* out_fetch_nodes,
+    DataTypeVector* out_fetch_types) {
+  out_fetch_nodes->clear();
+  out_fetch_nodes->reserve(fetch_rewrites.size());
+  for (size_t i = 0; i < fetch_rewrites.size(); ++i) {
+    const string& t = fetch_rewrites[i]->endpoint_name();
+
+    // Parse t into node_name and output_index.
+    TensorId id(ParseTensorName(t));
+
+    // Find node in graph with that name.
+    auto iter = name_index->find(id.first);
+    if (iter == name_index->end()) {
+      return errors::NotFound("FetchOutputs node ", t, ": not found");
+    }
+    Node* n = iter->second;
+    DCHECK_EQ(n->name(), id.first);
+    VLOG(2) << "Found fetch node for " << t;
+
+    // Validate output_index
+    if (n->num_outputs() == 0) {
+      return errors::InvalidArgument(
+          "Tried to fetch data for '", t,
+          "', which produces no output.  To run to a node but not fetch any "
+          "data, pass '",
+          t,
+          "' as an argument to the 'target_node_names' argument of the "
+          "Session::Run API.");
+    } else if (id.second >= n->num_outputs()) {
+      return errors::InvalidArgument("FetchOutputs ", t,
+                                     ": output index too large, must be < ",
+                                     n->num_outputs());
+    }
+
+    // Create the fetch Node and connect it up
+    Node* fetch_node;
+    TF_RETURN_IF_ERROR(
+        fetch_rewrites[i]->AddNode(g, {n, id.second}, &fetch_node));
+
+    // Update the index.
+    (*name_index)[fetch_node->name()] = fetch_node;
+
+    g->AddControlEdge(fetch_node, g->sink_node());
+    out_fetch_nodes->push_back(fetch_node);
+    out_fetch_types->push_back(BaseType(n->output_type(id.second)));
+  }
+
+  return Status::OK();
+}
+
+bool AddNodeToTargets(const string& node_or_tensor_name,
+                      const NameIndex& name_index,
+                      std::unordered_set<const Node*>* targets) {
   TensorId id = ParseTensorName(node_or_tensor_name);
   auto iter = name_index.find(id.first);
   if (iter == name_index.end()) {
@@ -154,9 +182,9 @@ static bool AddNodeToTargets(const string& node_or_tensor_name,
   return true;
 }
 
-static Status PruneForTargets(Graph* g, const subgraph::NameIndex& name_index,
-                              const std::vector<Node*>& fetch_nodes,
-                              const gtl::ArraySlice<string>& target_nodes) {
+Status PruneForTargets(Graph* g, const NameIndex& name_index,
+                       const std::vector<Node*>& fetch_nodes,
+                       const gtl::ArraySlice<string>& target_nodes) {
   string not_found;
   std::unordered_set<const Node*> targets;
   for (Node* n : fetch_nodes) {
@@ -183,108 +211,149 @@ static Status PruneForTargets(Graph* g, const subgraph::NameIndex& name_index,
 
 }  // namespace
 
-namespace subgraph {
+Status ArgFeedRewrite::AddNode(Graph* g, NodeBuilder::NodeOut feed_tensor,
+                               Node** out_node) {
+  // NOTE(mrry): We must include the index as part of the node
+  // name, because _Arg is a "stateful" kernel and therefore
+  // its name must uniquely identify a kernel instance across all
+  // graphs in the same session.
+  TF_RETURN_IF_ERROR(
+      NodeBuilder(strings::StrCat("_arg_", feed_tensor.node->name(), "_",
+                                  feed_tensor.index, "_", arg_index_),
+                  "_Arg")
+          .Attr("T", BaseType(feed_tensor.node->output_type(feed_tensor.index)))
+          .Attr("index", arg_index_)
+          .Finalize(g, out_node));
+  (*out_node)->set_assigned_device_name(device_info().name());
+  return Status::OK();
+}
 
-Status FetchOutputs(Graph* g, const DeviceAttributes& device_info,
-                    const gtl::ArraySlice<string>& fetch_outputs,
-                    bool use_function_convention, NameIndex* name_index,
-                    std::vector<Node*>* out_fetch_nodes,
-                    DataTypeVector* out_fetch_types) {
-  out_fetch_nodes->clear();
-  out_fetch_nodes->reserve(fetch_outputs.size());
-  for (size_t i = 0; i < fetch_outputs.size(); ++i) {
-    const string& t = fetch_outputs[i];
+Status RecvFeedRewrite::AddNode(Graph* g, NodeBuilder::NodeOut feed_tensor,
+                                Node** out_node) {
+  TF_RETURN_IF_ERROR(
+      NodeBuilder(strings::StrCat("_recv_", feed_tensor.node->name(), "_",
+                                  feed_tensor.index),
+                  "_Recv")
+          .Attr("tensor_type",
+                BaseType(feed_tensor.node->output_type(feed_tensor.index)))
+          .Attr("tensor_name", endpoint_name())
+          .Attr("send_device", device_info().name())
+          .Attr("recv_device", device_info().name())
+          .Attr("send_device_incarnation",
+                static_cast<int64>(device_info().incarnation()))
+          .Attr("client_terminated", true)
+          .Finalize(g, out_node));
+
+  (*out_node)->set_assigned_device_name(device_info().name());
+  return Status::OK();
+}
 
-    // Parse t into node_name and output_index.
-    TensorId id(ParseTensorName(t));
+Status RetvalFetchRewrite::AddNode(Graph* g, NodeBuilder::NodeOut fetch_tensor,
+                                   Node** out_node) {
+  // NOTE(mrry): We must include the index as part of the node
+  // name, because _Retval is a "stateful" kernel and therefore
+  // its name must uniquely identify a kernel instance across all
+  // graphs in the same session.
+  TF_RETURN_IF_ERROR(
+      NodeBuilder(strings::StrCat("_retval_", fetch_tensor.node->name(), "_",
+                                  fetch_tensor.index, "_", retval_index_),
+                  "_Retval")
+          .Input(fetch_tensor.node, fetch_tensor.index)
+          .Attr("T",
+                BaseType(fetch_tensor.node->output_type(fetch_tensor.index)))
+          .Attr("index", retval_index_)
+          .Finalize(g, out_node));
+  (*out_node)->set_assigned_device_name(device_info().name());
+  return Status::OK();
+}
 
-    // Find node in graph with that name.
-    auto iter = name_index->find(id.first);
-    if (iter == name_index->end()) {
-      return errors::NotFound("FetchOutputs node ", t, ": not found");
-    }
-    Node* n = iter->second;
-    DCHECK_EQ(n->name(), id.first);
-    VLOG(2) << "Found fetch node for " << t;
+Status SendFetchRewrite::AddNode(Graph* g, NodeBuilder::NodeOut fetch_tensor,
+                                 Node** out_node) {
+  TF_RETURN_IF_ERROR(
+      NodeBuilder(strings::StrCat("_send_", fetch_tensor.node->name(), "_",
+                                  fetch_tensor.index),
+                  "_Send")
+          .Input(fetch_tensor.node, fetch_tensor.index)
+          .Attr("tensor_name", endpoint_name())
+          .Attr("send_device", device_info().name())
+          .Attr("recv_device", device_info().name())
+          .Attr("send_device_incarnation",
+                static_cast<int64>(device_info().incarnation()))
+          .Attr("client_terminated", true)
+          .Finalize(g, out_node));
+  (*out_node)->set_assigned_device_name(device_info().name());
+  return Status::OK();
+}
 
-    // Validate output_index
-    if (n->num_outputs() == 0) {
-      return errors::InvalidArgument(
-          "Tried to fetch data for '", t,
-          "', which produces no output.  To run to a node but not fetch any "
-          "data, pass '",
-          t,
-          "' as an argument to the 'target_node_names' argument of the "
-          "Session::Run API.");
-    } else if (id.second >= n->num_outputs()) {
-      return errors::InvalidArgument("FetchOutputs ", t,
-                                     ": output index too large, must be < ",
-                                     n->num_outputs());
+Status RewriteGraphForExecution(
+    Graph* g, const gtl::ArraySlice<string>& fed_outputs,
+    const gtl::ArraySlice<string>& fetch_outputs,
+    const gtl::ArraySlice<string>& target_node_names,
+    const DeviceAttributes& device_info, bool use_function_convention,
+    RewriteGraphMetadata* out_metadata) {
+  std::vector<std::unique_ptr<PruneRewrite>> feed_rewrites;
+  feed_rewrites.reserve(fed_outputs.size());
+  if (use_function_convention) {
+    for (size_t i = 0; i < fed_outputs.size(); ++i) {
+      feed_rewrites.emplace_back(new ArgFeedRewrite(
+          &fed_outputs[i], &device_info, static_cast<int32>(i)));
     }
-
-    // Create the fetch Node and connect it up
-    Node* send_node;
-    if (!use_function_convention) {
-      TF_RETURN_IF_ERROR(
-          NodeBuilder(strings::StrCat("_send_", id.first, "_", id.second),
-                      "_Send")
-              .Input(n, id.second)
-              .Attr("tensor_name", t)
-              .Attr("send_device", device_info.name())
-              .Attr("recv_device", device_info.name())
-              .Attr("send_device_incarnation",
-                    static_cast<int64>(device_info.incarnation()))
-              .Attr("client_terminated", true)
-              .Finalize(g, &send_node));
-    } else {
-      // NOTE(mrry): We must include the index as part of the node
-      // name, because _Retval is a "stateful" kernel and therefore
-      // its name must uniquely identify a kernel instance across all
-      // graphs in the same session.
-      TF_RETURN_IF_ERROR(NodeBuilder(strings::StrCat("_retval_", id.first, "_",
-                                                     id.second, "_", i),
-                                     "_Retval")
-                             .Input(n, id.second)
-                             .Attr("T", BaseType(n->output_type(id.second)))
-                             .Attr("index", static_cast<int32>(i))
-                             .Finalize(g, &send_node));
+  } else {
+    for (const string& fed_output : fed_outputs) {
+      feed_rewrites.emplace_back(
+          new RecvFeedRewrite(&fed_output, &device_info));
     }
-    send_node->set_assigned_device_name(device_info.name());
-
-    // Update the index.
-    (*name_index)[send_node->name()] = send_node;
+  }
 
-    g->AddControlEdge(send_node, g->sink_node());
-    out_fetch_nodes->push_back(send_node);
-    out_fetch_types->push_back(BaseType(n->output_type(id.second)));
+  std::vector<std::unique_ptr<PruneRewrite>> fetch_rewrites;
+  fetch_rewrites.reserve(fetch_outputs.size());
+  if (use_function_convention) {
+    for (size_t i = 0; i < fetch_outputs.size(); ++i) {
+      fetch_rewrites.emplace_back(new RetvalFetchRewrite(
+          &fetch_outputs[i], &device_info, static_cast<int32>(i)));
+    }
+  } else {
+    for (const string& fetch_output : fetch_outputs) {
+      fetch_rewrites.emplace_back(
+          new SendFetchRewrite(&fetch_output, &device_info));
+    }
   }
 
-  return Status::OK();
+  return RewriteGraphForExecution(g, feed_rewrites, fetch_rewrites,
+                                  target_node_names, out_metadata);
+}
+
+namespace {
+template <typename StringContainer>
+std::vector<string> ConvertToVector(StringContainer field) {
+  return std::vector<string>(field.begin(), field.end());
 }
+}  // namespace
 
 Status RewriteGraphForExecution(
-    Graph* g, const gtl::ArraySlice<string>& fed_outputs,
-    const gtl::ArraySlice<string>& fetch_outputs,
+    Graph* g, const std::vector<std::unique_ptr<PruneRewrite>>& feed_rewrites,
+    const std::vector<std::unique_ptr<PruneRewrite>>& fetch_rewrites,
     const gtl::ArraySlice<string>& target_node_names,
-    const DeviceAttributes& device_info, bool use_function_convention,
     RewriteGraphMetadata* out_metadata) {
-  if (fetch_outputs.empty() && target_node_names.empty()) {
+  if (fetch_rewrites.empty() && target_node_names.empty()) {
     return errors::InvalidArgument(
         "Must specify at least one target to fetch or execute.");
   }
 
   std::unordered_set<string> endpoints;
-  for (const string& endpoint_name : fed_outputs) {
-    auto result = endpoints.insert(endpoint_name);
+  for (const auto& feed_rewrite : feed_rewrites) {
+    auto result = endpoints.insert(feed_rewrite->endpoint_name());
     if (!result.second) {
-      return errors::InvalidArgument("Endpoint \"", endpoint_name,
+      return errors::InvalidArgument("Endpoint \"",
+                                     feed_rewrite->endpoint_name(),
                                      "\" fed more than once.");
     }
   }
 
-  for (const auto& fetch : fetch_outputs) {
-    if (endpoints.count(fetch) > 0) {
-      return errors::InvalidArgument(fetch, " is both fed and fetched.");
+  for (const auto& fetch_rewrite : fetch_rewrites) {
+    if (endpoints.count(fetch_rewrite->endpoint_name()) > 0) {
+      return errors::InvalidArgument(fetch_rewrite->endpoint_name(),
+                                     " is both fed and fetched.");
     }
   }
 
@@ -297,19 +366,17 @@ Status RewriteGraphForExecution(
   }
 
   // Add the feeds.  This may replace nodes in the graph, including the nodes
-  // currently listed in "fetch_nodes".  We pass "name_index" so the index is
+  // currently listed in "fetch_rewrites".  We pass "name_index" so the index is
   // kept up to date.
-  if (!fed_outputs.empty()) {
-    TF_RETURN_IF_ERROR(FeedInputs(g, device_info, fed_outputs,
-                                  use_function_convention, &name_index,
-                                  &out_metadata->feed_types));
+  if (!feed_rewrites.empty()) {
+    TF_RETURN_IF_ERROR(
+        FeedInputs(g, feed_rewrites, &name_index, &out_metadata->feed_types));
   }
 
   // Add the fetch nodes, also updating "name_index".
   std::vector<Node*> fetch_nodes;
-  if (!fetch_outputs.empty()) {
-    TF_RETURN_IF_ERROR(FetchOutputs(g, device_info, fetch_outputs,
-                                    use_function_convention, &name_index,
+  if (!fetch_rewrites.empty()) {
+    TF_RETURN_IF_ERROR(FetchOutputs(g, fetch_rewrites, &name_index,
                                     &fetch_nodes, &out_metadata->fetch_types));
   }
 
diff --git a/tensorflow/core/graph/subgraph.h b/tensorflow/core/graph/subgraph.h
index 3c1f8870f57f6d585f795cc92c320927e1a29315..ba35846d937bfeeeab825be2a2897aa6f3a195b7 100644
--- a/tensorflow/core/graph/subgraph.h
+++ b/tensorflow/core/graph/subgraph.h
@@ -20,8 +20,10 @@ limitations under the License.
 
 #include "tensorflow/core/framework/device_attributes.pb.h"
 #include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/node_builder.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/protobuf/config.pb.h"
 
 namespace tensorflow {
 namespace subgraph {
@@ -38,6 +40,37 @@ struct RewriteGraphMetadata {
   DataTypeVector fetch_types;
 };
 
+// Describes the action to take on a particular tensor endpoint (described by
+// a "<node_name>:<output_index>" pair) when pruning the graph.
+//
+// The `AddNode()` method must be overridden to describe this action. The method
+// will be invoked once during `RewriteGraphForExecution()` with tensor endpoint
+// named by `endpoint_name`, and it may either create a single new node, or fail
+// with an error if the resulting graph would be invalid.
+class PruneRewrite {
+ public:
+  // `endpoint_name` and `device_info` must outlive this object.
+  PruneRewrite(const string* endpoint_name, const DeviceAttributes* device_info)
+      : endpoint_name_(endpoint_name), device_info_(device_info) {}
+  virtual ~PruneRewrite() {}
+
+  // Creates a new node whose output replaces the given `tensor` in graph `g`.
+  // The node will be assigned to the device named in `device_info`.
+  virtual Status AddNode(Graph* g, NodeBuilder::NodeOut tensor,
+                         Node** out_node) = 0;
+
+  // Returns the name of the tensor to which this rewrite applies.
+  const string& endpoint_name() { return *endpoint_name_; }
+
+ protected:
+  // The device on which the new node will be created.
+  const DeviceAttributes& device_info() { return *device_info_; }
+
+ private:
+  const string* const endpoint_name_;          // Not owned.
+  const DeviceAttributes* const device_info_;  // Not owned.
+};
+
 // Rewrite the graph structure of "*g" to deal with feeding node
 // outputs, fetching node outputs, and only running a subset of the
 // graph.  "fed_outputs" and "fetch_outputs" are both lists of
@@ -48,7 +81,7 @@ struct RewriteGraphMetadata {
 // In the resulting graph "*g", output edges in "fed_outputs" have
 // been redirected to special "_recv" nodes introduced into the graph.
 // If these fed nodes are not needed in order to compute the effects
-// of the nodes in "targets_nodes" and "fetch_outputs", then these may
+// of the nodes in "target_node_names" and "fetch_outputs", then these may
 // be omitted from the graph.
 //
 // In the resulting graph "*g", additional "_send" nodes are connected
@@ -71,19 +104,60 @@ Status RewriteGraphForExecution(
     const DeviceAttributes& device_info, bool use_function_convention,
     RewriteGraphMetadata* out_metadata);
 
-typedef std::unordered_map<StringPiece, Node*, StringPieceHasher> NameIndex;
+// A more general version of the above function that supports
+// customizable rewriting actions for each fed and fetched tensor.
+Status RewriteGraphForExecution(
+    Graph* g, const std::vector<std::unique_ptr<PruneRewrite>>& feed_rewrites,
+    const std::vector<std::unique_ptr<PruneRewrite>>& fetch_rewrites,
+    const gtl::ArraySlice<string>& target_node_names,
+    RewriteGraphMetadata* out_metadata);
 
-// Augment "*g" by adding special "fetch" nodes that connect to the
-// tensor outputs specified in "fetch_outputs" to retrieve the output
-// of the tensors.  The new nodes added are set up to execute on
-// "client_device_name", and are returned in "*fetch_nodes".
-//
-// Return OK on success.  On error, return false and sets *error to
-// an appropriate error message (and *g is left in an indeterminate
-// state).
-Status FetchOutputs(Graph* g, const DeviceAttributes& device_info,
-                    const gtl::ArraySlice<string>& fetch_outputs,
-                    NameIndex* name_index, std::vector<Node*>* fetch_nodes);
+/////////////////////////////////////////////////////////
+// Custom rewrite actions for fed and fetched tensors. //
+/////////////////////////////////////////////////////////
+
+// A rewrite action that adds an _Arg node for a fed tensor.
+class ArgFeedRewrite : public PruneRewrite {
+ public:
+  ArgFeedRewrite(const string* endpoint_name,
+                 const DeviceAttributes* device_info, int32 arg_index)
+      : PruneRewrite(endpoint_name, device_info), arg_index_(arg_index) {}
+  Status AddNode(Graph* g, NodeBuilder::NodeOut feed_tensor,
+                 Node** out_node) override;
+
+ private:
+  const int32 arg_index_;
+};
+
+// A rewrite action that adds a client-terminated _Recv node for a fed tensor.
+class RecvFeedRewrite : public PruneRewrite {
+ public:
+  using PruneRewrite::PruneRewrite;
+  Status AddNode(Graph* g, NodeBuilder::NodeOut feed_tensor,
+                 Node** out_node) override;
+};
+
+// A rewrite action that adds a _Retval node for a fetched tensor.
+class RetvalFetchRewrite : public PruneRewrite {
+ public:
+  RetvalFetchRewrite(const string* endpoint_name,
+                     const DeviceAttributes* device_info, int32 retval_index)
+      : PruneRewrite(endpoint_name, device_info), retval_index_(retval_index) {}
+  Status AddNode(Graph* g, NodeBuilder::NodeOut fetch_tensor,
+                 Node** out_node) override;
+
+ private:
+  const int32 retval_index_;
+};
+
+// A rewrite action that adds a client-terminated _Send node for a
+// fetched tensor.
+class SendFetchRewrite : public PruneRewrite {
+ public:
+  using PruneRewrite::PruneRewrite;
+  Status AddNode(Graph* g, NodeBuilder::NodeOut fetch_tensor,
+                 Node** out_node) override;
+};
 
 }  // namespace subgraph
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/clusters/BUILD b/tensorflow/core/grappler/clusters/BUILD
index b8f8e13c9a6830658e2b53388e1f91fbc8a22eab..b653f902e857ce804f797a016ebde551bf3b6695 100644
--- a/tensorflow/core/grappler/clusters/BUILD
+++ b/tensorflow/core/grappler/clusters/BUILD
@@ -1,7 +1,12 @@
 licenses(["notice"])  # Apache 2.0
 
+load("//tensorflow:tensorflow.bzl", "if_cuda")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_library")
+load(
+    "//tensorflow/core:platform/default/build_config_root.bzl",
+    "tf_cuda_tests_tags",
+)
 
 filegroup(
     name = "all_files",
@@ -26,13 +31,12 @@ config_setting(
 tf_cuda_library(
     name = "utils",
     srcs = ["utils.cc"],
-    hdrs = [
-        "utils.h",
-    ],
+    hdrs = ["utils.h"],
     visibility = ["//visibility:public"],
     deps = [
         "//third_party/eigen3",
         "//tensorflow/core:framework",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
     ] + select({
@@ -41,6 +45,21 @@ tf_cuda_library(
     }),
 )
 
+tf_cc_test(
+    name = "utils_test",
+    srcs = ["utils_test.cc"],
+    linkstatic = if_cuda(1, 0),
+    tags = tf_cuda_tests_tags(),
+    deps = [
+        ":utils",
+        "//tensorflow/core:gpu_id",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 cc_library(
     name = "cluster",
     srcs = ["cluster.cc"],
@@ -104,6 +123,7 @@ cc_library(
         "//tensorflow/core:core_cpu_lib",
         "//tensorflow/core:direct_session",
         "//tensorflow/core:framework",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/kernels:ops_util",
diff --git a/tensorflow/core/grappler/clusters/single_machine.cc b/tensorflow/core/grappler/clusters/single_machine.cc
index cc7f418d49816d64ffc51704d2f127a441815d7b..313ef90d81981ce43ab818873efa4da908e7dcfa 100644
--- a/tensorflow/core/grappler/clusters/single_machine.cc
+++ b/tensorflow/core/grappler/clusters/single_machine.cc
@@ -21,6 +21,8 @@ limitations under the License.
 #include "tensorflow/cc/training/queue_runner.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
 #include "tensorflow/core/grappler/clusters/utils.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/kernels/ops_util.h"
@@ -80,13 +82,24 @@ Status SingleMachine::Provision() {
 
   std::vector<DeviceAttributes> devices;
   TF_RETURN_IF_ERROR(session_->ListDevices(&devices));
-  int gpu_id = 0;
   for (const auto& dev : devices) {
     DeviceProperties attr;
     if (dev.device_type() == "CPU") {
       attr = GetLocalCPUInfo();
     } else if (dev.device_type() == "GPU") {
-      attr = GetLocalGPUInfo(gpu_id++);
+      DeviceNameUtils::ParsedName parsed;
+      if (!DeviceNameUtils::ParseFullName(dev.name(), &parsed)) {
+        return errors::InvalidArgument(
+            strings::StrCat("Not able to parse GPU device name: ", dev.name()));
+      }
+      TfGpuId tf_gpu_id(parsed.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        return errors::Unavailable("Unknown TF GPU device with id ",
+                                   tf_gpu_id.value(), ": ", s.ToString());
+      }
+      attr = GetLocalGPUInfo(cuda_gpu_id);
     } else if (dev.device_type().find("XLA") == string::npos) {
       // Filter out the fake XLA devices to avoid double counting the actual
       // hardware resources that are available.
@@ -365,10 +378,15 @@ void SingleMachine::MergeCosts(CostGraphDef* graph_costs,
                                        init_costs.node_size() +
                                        queue_costs.node_size());
   std::unordered_set<string> nodes_seen;
+  int queue_costs_id_offset = graph_costs->node_size();
   for (const auto& node : graph_costs->node()) {
     nodes_seen.insert(node.name());
+    if (node.id() >= queue_costs_id_offset) {
+      queue_costs_id_offset = node.id() + 1;
+    }
   }
 
+  int init_costs_id_offset = queue_costs_id_offset + queue_costs.node_size();
   // The costs obtained by running the main graph could be more stable than
   // the one we get from the queue runners since the queue runners run
   // asynchronously.
@@ -376,7 +394,22 @@ void SingleMachine::MergeCosts(CostGraphDef* graph_costs,
     if (nodes_seen.find(node.name()) != nodes_seen.end()) {
       continue;
     }
-    graph_costs->add_node()->MergeFrom(node);
+
+    auto* new_node = graph_costs->add_node();
+    new_node->MergeFrom(node);
+
+    new_node->set_id(node.id() + queue_costs_id_offset);
+    if (new_node->id() >= init_costs_id_offset) {
+      init_costs_id_offset = new_node->id() + 1;
+    }
+
+    for (auto& input_info : *new_node->mutable_input_info()) {
+      input_info.set_preceding_node(input_info.preceding_node() +
+                                    queue_costs_id_offset);
+    }
+    for (auto& control_input : *new_node->mutable_control_input()) {
+      control_input += queue_costs_id_offset;
+    }
   }
 
   // Don't overwrite the costs with that generated during initialization since
@@ -385,7 +418,18 @@ void SingleMachine::MergeCosts(CostGraphDef* graph_costs,
     if (nodes_seen.find(node.name()) != nodes_seen.end()) {
       continue;
     }
-    graph_costs->add_node()->MergeFrom(node);
+
+    auto* new_node = graph_costs->add_node();
+    new_node->MergeFrom(node);
+
+    new_node->set_id(node.id() + init_costs_id_offset);
+    for (auto& input_info : *new_node->mutable_input_info()) {
+      input_info.set_preceding_node(input_info.preceding_node() +
+                                    init_costs_id_offset);
+    }
+    for (auto& control_input : *new_node->mutable_control_input()) {
+      control_input += init_costs_id_offset;
+    }
   }
 }
 
diff --git a/tensorflow/core/grappler/clusters/utils.cc b/tensorflow/core/grappler/clusters/utils.cc
index 607e10e1ab56b290c6b50f51d54570ff39204996..b54b34959a53b56022a449ca286ff0ba823f2aa5 100644
--- a/tensorflow/core/grappler/clusters/utils.cc
+++ b/tensorflow/core/grappler/clusters/utils.cc
@@ -27,6 +27,9 @@ limitations under the License.
 #include "include/libxsmm.h"
 #endif
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/cpu_info.h"
@@ -66,36 +69,40 @@ DeviceProperties GetLocalCPUInfo() {
   return device;
 }
 
-DeviceProperties GetLocalGPUInfo(int gpu_id) {
+DeviceProperties GetLocalGPUInfo(CudaGpuId cuda_gpu_id) {
   DeviceProperties device;
   device.set_type("GPU");
 
 #if GOOGLE_CUDA
   cudaDeviceProp properties;
-  cudaError_t error = cudaGetDeviceProperties(&properties, gpu_id);
-  if (error == cudaSuccess) {
-    device.set_vendor("NVidia");
-    device.set_model(properties.name);
-    device.set_frequency(properties.clockRate * 1e-3);
-    device.set_num_cores(properties.multiProcessorCount);
-    device.set_num_registers(properties.regsPerMultiprocessor);
-    // For compute capability less than 5, l1 cache size is configurable to
-    // either 16 KB or 48 KB. We use the initial configuration 16 KB here. For
-    // compute capability larger or equal to 5, l1 cache (unified with texture
-    // cache) size is 24 KB. This number may need to be updated for future
-    // compute capabilities.
-    device.set_l1_cache_size((properties.major < 5) ? 16 * 1024 : 24 * 1024);
-    device.set_l2_cache_size(properties.l2CacheSize);
-    device.set_l3_cache_size(0);
-    device.set_shared_memory_size_per_multiprocessor(
-        properties.sharedMemPerMultiprocessor);
-    device.set_memory_size(properties.totalGlobalMem);
-    // 8 is the number of bits per byte. 2 is accounted for
-    // double data rate (DDR).
-    device.set_bandwidth(properties.memoryBusWidth / 8 *
-                         properties.memoryClockRate * 2);
+  cudaError_t error = cudaGetDeviceProperties(&properties, cuda_gpu_id.value());
+  if (error != cudaSuccess) {
+    device.set_type("UNKNOWN");
+    LOG(ERROR) << "Failed to get device properties, error code: " << error;
+    return device;
   }
 
+  device.set_vendor("NVIDIA");
+  device.set_model(properties.name);
+  device.set_frequency(properties.clockRate * 1e-3);
+  device.set_num_cores(properties.multiProcessorCount);
+  device.set_num_registers(properties.regsPerMultiprocessor);
+  // For compute capability less than 5, l1 cache size is configurable to
+  // either 16 KB or 48 KB. We use the initial configuration 16 KB here. For
+  // compute capability larger or equal to 5, l1 cache (unified with texture
+  // cache) size is 24 KB. This number may need to be updated for future
+  // compute capabilities.
+  device.set_l1_cache_size((properties.major < 5) ? 16 * 1024 : 24 * 1024);
+  device.set_l2_cache_size(properties.l2CacheSize);
+  device.set_l3_cache_size(0);
+  device.set_shared_memory_size_per_multiprocessor(
+      properties.sharedMemPerMultiprocessor);
+  device.set_memory_size(properties.totalGlobalMem);
+  // 8 is the number of bits per byte. 2 is accounted for
+  // double data rate (DDR).
+  device.set_bandwidth(properties.memoryBusWidth / 8 *
+                       properties.memoryClockRate * 2);
+
   (*device.mutable_environment())["architecture"] =
       strings::StrCat(properties.major, ".", properties.minor);
   (*device.mutable_environment())["cuda"] = strings::StrCat(CUDA_VERSION);
@@ -106,18 +113,26 @@ DeviceProperties GetLocalGPUInfo(int gpu_id) {
 }
 
 DeviceProperties GetDeviceInfo(const DeviceNameUtils::ParsedName& device) {
+  DeviceProperties unknown;
+  unknown.set_type("UNKNOWN");
+
   if (device.type == "CPU") {
     return GetLocalCPUInfo();
   } else if (device.type == "GPU") {
     if (device.has_id) {
-      return GetLocalGPUInfo(device.id);
+      TfGpuId tf_gpu_id(device.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        LOG(ERROR) << s;
+        return unknown;
+      }
+      return GetLocalGPUInfo(cuda_gpu_id);
     } else {
-      return GetLocalGPUInfo(0);
+      return GetLocalGPUInfo(CudaGpuId(0));
     }
   }
-  DeviceProperties result;
-  result.set_type("UNKNOWN");
-  return result;
+  return unknown;
 }
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/clusters/utils.h b/tensorflow/core/grappler/clusters/utils.h
index 191942040a1fdd276bb50f799ce314389c2cb0fe..df8e7dca44ad637aed8a6a2e87fc6e20bdf62606 100644
--- a/tensorflow/core/grappler/clusters/utils.h
+++ b/tensorflow/core/grappler/clusters/utils.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_GRAPPLER_CLUSTERS_UTILS_H_
 #define TENSORFLOW_GRAPPLER_CLUSTERS_UTILS_H_
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
 #include "tensorflow/core/protobuf/device_properties.pb.h"
 #include "tensorflow/core/util/device_name_utils.h"
 
@@ -27,7 +28,7 @@ DeviceProperties GetLocalCPUInfo();
 
 // Returns the DeviceProperties for the specified GPU attached to the server on
 // which grappler is running.
-DeviceProperties GetLocalGPUInfo(int gpu_id);
+DeviceProperties GetLocalGPUInfo(CudaGpuId cuda_gpu_id);
 
 // Returns the DeviceProperties of the specified device
 DeviceProperties GetDeviceInfo(const DeviceNameUtils::ParsedName& device);
diff --git a/tensorflow/core/grappler/clusters/utils_test.cc b/tensorflow/core/grappler/clusters/utils_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..74218adbac4eda3a7a780933b8116cfd2b7a1b18
--- /dev/null
+++ b/tensorflow/core/grappler/clusters/utils_test.cc
@@ -0,0 +1,100 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/clusters/utils.h"
+
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/device_properties.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(UtilsTest, GetLocalGPUInfo) {
+  GpuIdManager::TestOnlyReset();
+#if GOOGLE_CUDA
+  LOG(INFO) << "CUDA is enabled.";
+  DeviceProperties properties;
+
+  // Invalid CUDA GPU ID.
+  properties = GetLocalGPUInfo(CudaGpuId(100));
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Succeed when a valid CUDA GPU id was inserted.
+  properties = GetLocalGPUInfo(CudaGpuId(0));
+  EXPECT_EQ("GPU", properties.type());
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#else
+  LOG(INFO) << "CUDA is not enabled.";
+  DeviceProperties properties;
+
+  properties = GetLocalGPUInfo(CudaGpuId(0));
+  EXPECT_EQ("GPU", properties.type());
+
+  properties = GetLocalGPUInfo(CudaGpuId(100));
+  EXPECT_EQ("GPU", properties.type());
+#endif
+}
+
+TEST(UtilsTest, GetDeviceInfo) {
+  GpuIdManager::TestOnlyReset();
+  DeviceNameUtils::ParsedName device;
+  DeviceProperties properties;
+
+  // Invalid type.
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Cpu info.
+  device.type = "CPU";
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("CPU", properties.type());
+
+  // No TF GPU id provided.
+  device.type = "GPU";
+  device.has_id = false;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("GPU", properties.type());
+#if GOOGLE_CUDA
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#endif
+
+  // TF to CUDA GPU id mapping entry doesn't exist.
+  device.has_id = true;
+  device.id = 0;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+#if GOOGLE_CUDA
+  // Invalid CUDA GPU id.
+  GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId(0), CudaGpuId(100));
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Valid CUDA GPU id.
+  GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId(1), CudaGpuId(0));
+  device.id = 1;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("GPU", properties.type());
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#endif
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/costs/BUILD b/tensorflow/core/grappler/costs/BUILD
index 0fe01e9c9e094ebfa7fd1e6200d775ef61775184..5336df1f51dbb5dd5f48857a088ece1b1a04dbb5 100644
--- a/tensorflow/core/grappler/costs/BUILD
+++ b/tensorflow/core/grappler/costs/BUILD
@@ -142,6 +142,7 @@ tf_cuda_library(
         "//third_party/eigen3",
         "//tensorflow/core:framework",
         "//tensorflow/core:graph",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_proto_parsing",
         "//tensorflow/core:protos_all_cc",
diff --git a/tensorflow/core/grappler/costs/graph_properties.cc b/tensorflow/core/grappler/costs/graph_properties.cc
index 243ca9121c70d91631b474da62281bc56a476d8a..817247e3794ca3e165b2f0445ab164938577336f 100644
--- a/tensorflow/core/grappler/costs/graph_properties.cc
+++ b/tensorflow/core/grappler/costs/graph_properties.cc
@@ -1182,5 +1182,12 @@ GraphProperties::GetOutputProperties(const string& node_name) const {
   return missing_properties_;
 }
 
+void GraphProperties::ClearInputProperties(const string& node_name) {
+  input_properties_.erase(node_name);
+}
+void GraphProperties::ClearOutputProperties(const string& node_name) {
+  output_properties_.erase(node_name);
+}
+
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/costs/graph_properties.h b/tensorflow/core/grappler/costs/graph_properties.h
index 6fc53a7f2e7da7bae7b6f49c7b32291c981fef53..8ff572fe4f826a9346d6822b2c84b51065986be0 100644
--- a/tensorflow/core/grappler/costs/graph_properties.h
+++ b/tensorflow/core/grappler/costs/graph_properties.h
@@ -29,9 +29,12 @@ namespace grappler {
 class SymbolicShapeRefiner;
 class TopoQueue;
 
-// A TensorFlow model to optimize.
-// Models are represented by the combination of a graph, one of more fetch
-// nodes, and potentially a set of nodes to feed.
+// Infer OpInfo::TensorProperties for graph nodes inputs/outputs.
+//
+// Typical use case, is to infer tensor properties from a graph, before doing
+// optimization pass. Nodes modified during optimization pass have to be
+// invalidated, to prevent further incorrect optimizations based on wrong shape
+// and data type properties.
 class GraphProperties {
  public:
   explicit GraphProperties(const GrapplerItem& item) : item_(item) {}
@@ -64,12 +67,11 @@ class GraphProperties {
       const string& node_name) const;
   const std::vector<OpInfo::TensorProperties>& GetOutputProperties(
       const string& node_name) const;
-
-  static void FillTensorPropertiesFromContext(
-      const shape_inference::ShapeHandle&, const DataType&,
-      shape_inference::InferenceContext*,
-      std::unordered_map<const shape_inference::Dimension*, int>* dim_ids,
-      OpInfo::TensorProperties*);
+  // Invalidate input/output properties for nodes modified during graph
+  // optimization pass, to prevent potential optimizations, based on incorrect
+  // shape information.
+  void ClearInputProperties(const string& node_name);
+  void ClearOutputProperties(const string& node_name);
 
  private:
   // Merges shapes <shapes_and_types>, determined from an EnqueueV2 node, into
diff --git a/tensorflow/core/grappler/costs/graph_properties_test.cc b/tensorflow/core/grappler/costs/graph_properties_test.cc
index 5012069118fbe0b3d90d2e99690b2988c45a2843..284d9d409bb4d9439cf007e1692838667caff26a 100644
--- a/tensorflow/core/grappler/costs/graph_properties_test.cc
+++ b/tensorflow/core/grappler/costs/graph_properties_test.cc
@@ -113,6 +113,33 @@ TEST_F(GraphPropertiesTest, StaticProperties) {
   }
 }
 
+TEST_F(GraphPropertiesTest, ClearProperties) {
+  TrivialTestGraphInputYielder fake_input(4, 1, 10, false,
+                                          cluster_->GetDeviceNames());
+  GrapplerItem item;
+  CHECK(fake_input.NextItem(&item));
+
+  GraphProperties properties(item);
+  Status s = properties.InferStatically(true);
+  TF_CHECK_OK(s);
+
+  for (const auto& node : item.graph.node()) {
+    if (node.op() == "RandomStandardNormal") {
+      EXPECT_EQ(1, properties.GetInputProperties(node.name()).size());
+      const auto props = properties.GetOutputProperties(node.name());
+      properties.ClearOutputProperties(node.name());
+      const auto cleared_props = properties.GetOutputProperties(node.name());
+      EXPECT_TRUE(cleared_props.empty());
+    } else if (node.op() == "AddN") {
+      const auto in_props = properties.GetInputProperties(node.name());
+      EXPECT_EQ(1, in_props.size());
+      properties.ClearInputProperties(node.name());
+      const auto cleared_props = properties.GetInputProperties(node.name());
+      EXPECT_TRUE(cleared_props.empty());
+    }
+  }
+}
+
 TEST_F(GraphPropertiesTest, DynamicProperties) {
   TrivialTestGraphInputYielder fake_input(4, 1, 10, false,
                                           cluster_->GetDeviceNames());
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
index 29ef317e46f13bd64847fd898fcb2eb9fee67f1c..fdbc61f3f18087c40bf20716b503d3a53d37a47d 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/attr_value_util.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/grappler/clusters/utils.h"
 
 namespace tensorflow {
@@ -46,6 +47,9 @@ constexpr char kShape[] = "Shape";
 constexpr char kSize[] = "Size";
 constexpr char kStopGradient[] = "StopGradient";
 constexpr char kPreventGradient[] = "PreventGradient";
+constexpr char kGather[] = "Gather";
+constexpr char kGatherV2[] = "GatherV2";
+constexpr char kSlice[] = "Slice";
 
 static const Costs::Duration kMinComputeTime(1);
 
@@ -167,6 +171,10 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
 
       {kNoOp, wrap(&OpLevelCostEstimator::PredictNoOp)},
 
+      {kGather, wrap(&OpLevelCostEstimator::PredictGatherOrSlice)},
+      {kGatherV2, wrap(&OpLevelCostEstimator::PredictGatherOrSlice)},
+      {kSlice, wrap(&OpLevelCostEstimator::PredictGatherOrSlice)},
+
       {kPlaceholder, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kIdentity, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kRefIdentity, wrap(&OpLevelCostEstimator::PredictIdentity)},
@@ -184,107 +192,75 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
       {kShape, wrap(&OpLevelCostEstimator::PredictMetadata)},
       {kSize, wrap(&OpLevelCostEstimator::PredictMetadata)}};
 
-  elementwise_ops_ = {
-      // Unary ops alphabetically sorted
-      {"Acos", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_acos_op<float>>::Cost},
-      {"Asin", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_asin_op<float>>::Cost},
-      {"Atan", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_atan_op<float>>::Cost},
-      {"Atan2", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_quotient_op<float>>::Cost +
-                    Eigen::internal::functor_traits<
-                        Eigen::internal::scalar_atan_op<float>>::Cost},
-      {"Ceil", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_ceil_op<float>>::Cost},
-      {"Cos", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_cos_op<float>>::Cost},
-      {"Erf", 1},
-      {"Erfc", 1},
-      {"Exp", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_exp_op<float>>::Cost},
-      {"Expm1", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_expm1_op<float>>::Cost},
-      {"Floor", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_floor_op<float>>::Cost},
-      {"Inv", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_inverse_op<float>>::Cost},
-      {"InvGrad", 1},
-      {"Lgamma", 1},
-      {"Log", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_log_op<float>>::Cost},
-      {"Log1p", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_log1p_op<float>>::Cost},
-      {"Neg", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_opposite_op<float>>::Cost},
-      {"Reciprocal", Eigen::internal::functor_traits<
-                         Eigen::internal::scalar_inverse_op<float>>::Cost},
-      {"Rint", 1},
-      {"Round", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_round_op<float>>::Cost},
-      {"Rsqrt", Eigen::internal::functor_traits<
-                    Eigen::internal::scalar_rsqrt_op<float>>::Cost},
-      {"Sqrt", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_sqrt_op<float>>::Cost},
-      {"Square", Eigen::internal::functor_traits<
-                     Eigen::internal::scalar_square_op<float>>::Cost},
-      {"Tanh", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_tanh_op<float>>::Cost},
-      {"Relu", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_max_op<float>>::Cost},
-      {"Sigmoid", Eigen::internal::functor_traits<
-                      Eigen::internal::scalar_sigmoid_op<float>>::Cost},
-      {"Sign", Eigen::internal::functor_traits<
-                   Eigen::internal::scalar_sign_op<float>>::Cost},
-      {"Sin", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_sin_op<float>>::Cost},
-      {"Tan", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_tan_op<float>>::Cost},
-      // Binary ops alphabetically sorted
-      {"Add", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_sum_op<float>>::Cost},
-      {"ApproximateEqual", 1},
-      {"BiasAdd", Eigen::internal::functor_traits<
-                      Eigen::internal::scalar_sum_op<float>>::Cost},
-      {"Div", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_quotient_op<float>>::Cost},
-      {"Equal", 1},
-      {"FloorDiv", Eigen::internal::functor_traits<
-                       Eigen::internal::scalar_quotient_op<float>>::Cost},
-      {"FloorMod", Eigen::internal::functor_traits<
-                       Eigen::internal::scalar_mod_op<float>>::Cost},
-      {"Greater", 1},
-      {"GreaterEqual", 1},
-      {"Less", 1},
-      {"LessEqual", 1},
-      {"LogicalAnd", Eigen::internal::functor_traits<
-                         Eigen::internal::scalar_boolean_and_op>::Cost},
-      {"LogicalNot", 1},
-      {"LogicalOr", Eigen::internal::functor_traits<
-                        Eigen::internal::scalar_boolean_or_op>::Cost},
-      {"Maximum", Eigen::internal::functor_traits<
-                      Eigen::internal::scalar_max_op<float>>::Cost},
-      {"Minimum", Eigen::internal::functor_traits<
-                      Eigen::internal::scalar_min_op<float>>::Cost},
-      {"Mod", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_mod_op<float>>::Cost},
-      {"Mul", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_product_op<float>>::Cost},
-      {"NotEqual", 1},
-      {"QuantizedAdd", Eigen::internal::functor_traits<
-                           Eigen::internal::scalar_sum_op<float>>::Cost},
-      {"QuantizedMul", Eigen::internal::functor_traits<
-                           Eigen::internal::scalar_product_op<float>>::Cost},
-      {"RealDiv", Eigen::internal::functor_traits<
-                      Eigen::internal::scalar_quotient_op<float>>::Cost},
-      {"SquareDifference", 1},
-      {"Sub", Eigen::internal::functor_traits<
-                  Eigen::internal::scalar_difference_op<float>>::Cost},
-      {"TruncateDiv", Eigen::internal::functor_traits<
-                          Eigen::internal::scalar_quotient_op<float>>::Cost},
-      {"TruncateMod", Eigen::internal::functor_traits<
-                          Eigen::internal::scalar_mod_op<float>>::Cost}};
+#define EIGEN_COST(X) Eigen::internal::functor_traits<Eigen::internal::X>::Cost
+
+  // Quantize = apply min and max bounds, multiply by scale factor and round.
+  const int quantize_v2_cost =
+      EIGEN_COST(scalar_product_op<float>) + EIGEN_COST(scalar_max_op<float>) +
+      EIGEN_COST(scalar_min_op<float>) + EIGEN_COST(scalar_round_op<float>);
+
+  elementwise_ops_ = {// Unary ops alphabetically sorted
+                      {"Acos", EIGEN_COST(scalar_acos_op<float>)},
+                      {"Asin", EIGEN_COST(scalar_asin_op<float>)},
+                      {"Atan", EIGEN_COST(scalar_atan_op<float>)},
+                      {"Atan2", EIGEN_COST(scalar_quotient_op<float>) +
+                                    EIGEN_COST(scalar_atan_op<float>)},
+                      {"Ceil", EIGEN_COST(scalar_ceil_op<float>)},
+                      {"Cos", EIGEN_COST(scalar_cos_op<float>)},
+                      {"Dequantize", EIGEN_COST(scalar_product_op<float>)},
+                      {"Erf", 1},
+                      {"Erfc", 1},
+                      {"Exp", EIGEN_COST(scalar_exp_op<float>)},
+                      {"Expm1", EIGEN_COST(scalar_expm1_op<float>)},
+                      {"Floor", EIGEN_COST(scalar_floor_op<float>)},
+                      {"Inv", EIGEN_COST(scalar_inverse_op<float>)},
+                      {"InvGrad", 1},
+                      {"Lgamma", 1},
+                      {"Log", EIGEN_COST(scalar_log_op<float>)},
+                      {"Log1p", EIGEN_COST(scalar_log1p_op<float>)},
+                      {"Neg", EIGEN_COST(scalar_opposite_op<float>)},
+                      {"QuantizeV2", quantize_v2_cost},
+                      {"Reciprocal", EIGEN_COST(scalar_inverse_op<float>)},
+                      {"Rint", 1},
+                      {"Round", EIGEN_COST(scalar_round_op<float>)},
+                      {"Rsqrt", EIGEN_COST(scalar_rsqrt_op<float>)},
+                      {"Sqrt", EIGEN_COST(scalar_sqrt_op<float>)},
+                      {"Square", EIGEN_COST(scalar_square_op<float>)},
+                      {"Tanh", EIGEN_COST(scalar_tanh_op<float>)},
+                      {"Relu", EIGEN_COST(scalar_max_op<float>)},
+                      {"Sigmoid", EIGEN_COST(scalar_sigmoid_op<float>)},
+                      {"Sign", EIGEN_COST(scalar_sign_op<float>)},
+                      {"Sin", EIGEN_COST(scalar_sin_op<float>)},
+                      {"Tan", EIGEN_COST(scalar_tan_op<float>)},
+                      // Binary ops alphabetically sorted
+                      {"Add", EIGEN_COST(scalar_sum_op<float>)},
+                      {"ApproximateEqual", 1},
+                      {"BiasAdd", EIGEN_COST(scalar_sum_op<float>)},
+                      {"Div", EIGEN_COST(scalar_quotient_op<float>)},
+                      {"Equal", 1},
+                      {"FloorDiv", EIGEN_COST(scalar_quotient_op<float>)},
+                      {"FloorMod", EIGEN_COST(scalar_mod_op<float>)},
+                      {"Greater", 1},
+                      {"GreaterEqual", 1},
+                      {"Less", 1},
+                      {"LessEqual", 1},
+                      {"LogicalAnd", EIGEN_COST(scalar_boolean_and_op)},
+                      {"LogicalNot", 1},
+                      {"LogicalOr", EIGEN_COST(scalar_boolean_or_op)},
+                      {"Maximum", EIGEN_COST(scalar_max_op<float>)},
+                      {"Minimum", EIGEN_COST(scalar_min_op<float>)},
+                      {"Mod", EIGEN_COST(scalar_mod_op<float>)},
+                      {"Mul", EIGEN_COST(scalar_product_op<float>)},
+                      {"NotEqual", 1},
+                      {"QuantizedAdd", EIGEN_COST(scalar_sum_op<float>)},
+                      {"QuantizedMul", EIGEN_COST(scalar_product_op<float>)},
+                      {"RealDiv", EIGEN_COST(scalar_quotient_op<float>)},
+                      {"SquareDifference", 1},
+                      {"Sub", EIGEN_COST(scalar_difference_op<float>)},
+                      {"TruncateDiv", EIGEN_COST(scalar_quotient_op<float>)},
+                      {"TruncateMod", EIGEN_COST(scalar_mod_op<float>)}};
+
+#undef EIGEN_COST
 
   // By default, use sum of memory_time and compute_time for execution_time.
   compute_memory_overlap_ = false;
@@ -411,28 +387,33 @@ Costs OpLevelCostEstimator::PredictCostOfAnUnknownOp(
 }
 
 Costs OpLevelCostEstimator::PredictOpCountBasedCost(
-    double operations, const OpInfo& op_features) const {
-  DeviceInfo device_perf = GetDeviceInfo(op_features.device());
-  if (device_perf.gigaops <= 0 || device_perf.gb_per_sec <= 0) {
-    VLOG(1) << "BAD DEVICE. Op:" << op_features.op()
-            << " device type:" << op_features.device().type()
-            << " device model:" << op_features.device().model();
-  }
+    double operations, const OpInfo& op_info) const {
+  bool unknown_shapes = false;
+  const double input_size = CalculateInputSize(op_info, &unknown_shapes);
+  const double output_size = CalculateOutputSize(op_info, &unknown_shapes);
+  const double total_io_bytes = input_size + output_size;
+  Costs costs = PredictOpCountBasedCost(operations, total_io_bytes, op_info);
+  costs.inaccurate = unknown_shapes;
+  costs.max_memory = output_size;
+  return costs;
+}
 
-  Costs::NanoSeconds compute_cost(std::ceil(operations / device_perf.gigaops));
-  VLOG(1) << "Op:" << op_features.op() << " GOps:" << operations / 1e9
-          << " Execution Time (ns):" << compute_cost.count();
+Costs OpLevelCostEstimator::PredictOpCountBasedCost(
+    double operations, double total_io_bytes, const OpInfo& op_info) const {
+  const DeviceInfo device_info = GetDeviceInfo(op_info.device());
+  if (device_info.gigaops <= 0 || device_info.gb_per_sec <= 0) {
+    VLOG(1) << "BAD DEVICE. Op:" << op_info.op()
+            << " device type:" << op_info.device().type()
+            << " device model:" << op_info.device().model();
+  }
 
-  bool found_unknown_shapes = false;
-  const double total_input_size =
-      CalculateInputSize(op_features, &found_unknown_shapes);
-  const double total_output_size =
-      CalculateOutputSize(op_features, &found_unknown_shapes);
-  const double total_io_size = total_input_size + total_output_size;
+  Costs::NanoSeconds compute_cost(std::ceil(operations / device_info.gigaops));
+  VLOG(1) << "Op:" << op_info.op() << " GOps:" << operations / 1e9
+          << " Compute Time (ns):" << compute_cost.count();
 
   Costs::NanoSeconds memory_cost(
-      std::ceil(total_io_size / device_perf.gb_per_sec));
-  VLOG(1) << "Op:" << op_features.op() << " Size (KB):" << (total_io_size) / 1e3
+      std::ceil(total_io_bytes / device_info.gb_per_sec));
+  VLOG(1) << "Op:" << op_info.op() << " Size (KB):" << (total_io_bytes) / 1e3
           << " Memory Time (ns):" << memory_cost.count();
 
   Costs costs;
@@ -443,8 +424,6 @@ Costs OpLevelCostEstimator::PredictOpCountBasedCost(
   } else {
     costs.execution_time = compute_cost + memory_cost;
   }
-  costs.inaccurate = found_unknown_shapes;
-  costs.max_memory = total_output_size;
   return costs;
 }
 
@@ -867,7 +846,7 @@ int64 OpLevelCostEstimator::CountConv2DBackpropFilterOperations(
 
 int64 OpLevelCostEstimator::CalculateTensorElementCount(
     const OpInfo::TensorProperties& tensor, bool* found_unknown_shapes) const {
-  VLOG(2) << "   with " << tensor.dtype() << " tensor of shape "
+  VLOG(2) << "   with " << DataTypeString(tensor.dtype()) << " tensor of shape "
           << tensor.shape().DebugString();
   int64 tensor_size = 1;
   int num_dims = std::max(1, tensor.shape().dim_size());
@@ -1028,5 +1007,39 @@ Costs OpLevelCostEstimator::PredictMetadata(const OpContext& op_context) const {
   return costs;
 }
 
+Costs OpLevelCostEstimator::PredictGatherOrSlice(
+    const OpContext& op_context) const {
+  // Gather & Slice ops can have a very large input, but only access a small
+  // part of it. For these op the size of the output determines the memory cost.
+  const auto& op_info = op_context.op_info;
+
+  bool unknown_shapes = false;
+
+  // Each output element is a copy of some element from input.
+  // For roofline estimate we assume each copy has a unit cost.
+  const int64 op_count =
+      CalculateTensorElementCount(op_info.outputs(0), &unknown_shapes);
+
+  const double output_size = CalculateOutputSize(op_info, &unknown_shapes);
+  double input_size = output_size;
+  if (op_info.op() == "Slice") {
+    // Add 'begin' & 'size' tensors sizes.
+    input_size +=
+        CalculateTensorElementCount(op_info.inputs(1), &unknown_shapes) +
+        CalculateTensorElementCount(op_info.inputs(2), &unknown_shapes);
+  } else {
+    // Assuming this is "Gather" or "GatherV2" op, add 'indices' size.
+    input_size +=
+        CalculateTensorElementCount(op_info.inputs(1), &unknown_shapes);
+  }
+
+  const double total_io = input_size + output_size;
+  Costs costs = PredictOpCountBasedCost(op_count, total_io, op_info);
+  costs.inaccurate = unknown_shapes;
+  costs.max_memory = output_size;
+
+  return costs;
+}
+
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.h b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
index 7bb530fe31a9f70d168ae16783fac7d487e5f12d..1b3babb2066afc2b794ff268929ebdd01ad61e89 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.h
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
@@ -51,10 +51,15 @@ class OpLevelCostEstimator {
   // Predict cost of an op for which no accurate estimator is defined.
   Costs PredictCostOfAnUnknownOp(const OpContext& op_context) const;
 
-  // Naive cost estimate based on operations divided by device ops/sec,
-  // and input/output tensor sizes.
-  Costs PredictOpCountBasedCost(double operations,
-                                const OpInfo& op_features) const;
+  // Naive cost estimate based on the given operations count and total
+  // input/output tensor sizes of the given op_info combined.
+  Costs PredictOpCountBasedCost(double operations, const OpInfo& op_info) const;
+
+  // Naive cost estimate based on the given operations count and the given total
+  // io size in bytes. Sizes of op_info inputs and outputs are not taken into
+  // consideration.
+  Costs PredictOpCountBasedCost(double operations, double total_io_bytes,
+                                const OpInfo& op_info) const;
 
   // This family of routines counts the number of operations to perform the
   // specified TensorFlow Op.
@@ -125,7 +130,7 @@ class OpLevelCostEstimator {
   // implementation just divides the operations to
   // perform the op (from the "Count" routines,
   // above) by the device peak operations per
-  // second. Override to supply a better estimate.
+  // second.
   // Implementation of costs other than
   // execution_time is optional, depending on the
   // device.
@@ -139,6 +144,7 @@ class OpLevelCostEstimator {
   Costs PredictVariable(const OpContext& op_context) const;
   Costs PredictBatchMatMul(const OpContext& op_context) const;
   Costs PredictMetadata(const OpContext& op_context) const;
+  Costs PredictGatherOrSlice(const OpContext& op_context) const;
 
   // Utility function for safe division. Returns 0
   // if rhs is 0 or negative.
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
index 4790b9bab2c7d67e7a29d45aaf9f964c470c63df..f2a9615dfb7df328c983de4dd37f3d1f4ec7d704 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
@@ -55,28 +55,11 @@ OpContext DescribeMatMul(int m, int n, int l, int k) {
   return op_context;
 }
 
-// Returns an OpInfo for MatMul with unknown input shapes.
-OpContext DescribeMatMulUnknownShape() {
-  OpContext op_context;
-  SetCpuDevice(&op_context.op_info);
-  op_context.op_info.set_op("MatMul");
-
-  auto input = op_context.op_info.add_inputs();
-  auto shape = input->mutable_shape();
-  shape->set_unknown_rank(true);
-
-  input = op_context.op_info.add_inputs();
-  shape = input->mutable_shape();
-  shape->set_unknown_rank(true);
-
-  return op_context;
-}
-
 // Wrangles the minimum number of proto fields to set up an input of
 // arbitrary rank and type.
 void DescribeArbitraryRankInput(const std::vector<int>& dims, DataType dtype,
-                                OpInfo* op_features) {
-  auto input = op_features->add_inputs();
+                                OpInfo* op_info) {
+  auto input = op_info->add_inputs();
   input->set_dtype(dtype);
   auto shape = input->mutable_shape();
   for (auto d : dims) {
@@ -84,6 +67,18 @@ void DescribeArbitraryRankInput(const std::vector<int>& dims, DataType dtype,
   }
 }
 
+// Wrangles the minimum number of proto fields to set up an output of
+// arbitrary rank and type.
+void DescribeArbitraryRankOutput(const std::vector<int>& dims, DataType dtype,
+                                 OpInfo* op_info) {
+  auto output = op_info->add_outputs();
+  output->set_dtype(dtype);
+  auto shape = output->mutable_shape();
+  for (auto d : dims) {
+    shape->add_dim()->set_size(d);
+  }
+}
+
 // Returns an OpInfo for a BatchMatMul
 OpContext DescribeBatchMatMul(const std::vector<int>& dims_a,
                               const std::vector<int>& dims_b) {
@@ -200,6 +195,41 @@ class OpLevelCostEstimatorTest : public ::testing::Test {
   OpLevelCostEstimator estimator_;
 };
 
+TEST_F(OpLevelCostEstimatorTest, TestGatherCosts) {
+  OpContext op_context;
+  SetCpuDevice(&op_context.op_info);
+  op_context.op_info.set_op("Gather");
+
+  // Huge first input shouldn't affect Gather execution and memory costs.
+  DescribeArbitraryRankInput({10000000, 10}, DT_FLOAT, &op_context.op_info);
+  DescribeArbitraryRankInput({16}, DT_INT64, &op_context.op_info);
+  DescribeArbitraryRankOutput({16, 10}, DT_FLOAT, &op_context.op_info);
+
+  auto cost = estimator_.PredictCosts(op_context);
+  EXPECT_EQ(Costs::Duration(130), cost.memory_time);
+  EXPECT_EQ(Costs::Duration(16), cost.compute_time);
+  EXPECT_EQ(Costs::Duration(146), cost.execution_time);
+  EXPECT_FALSE(cost.inaccurate);
+}
+
+TEST_F(OpLevelCostEstimatorTest, TestSliceCosts) {
+  OpContext op_context;
+  SetCpuDevice(&op_context.op_info);
+  op_context.op_info.set_op("Slice");
+
+  // Huge first input shouldn't affect Slice execution and memory costs.
+  DescribeArbitraryRankInput({10000000, 10}, DT_FLOAT, &op_context.op_info);
+  DescribeArbitraryRankInput({2}, DT_INT64, &op_context.op_info);
+  DescribeArbitraryRankInput({2}, DT_INT64, &op_context.op_info);
+  DescribeArbitraryRankOutput({10, 10}, DT_FLOAT, &op_context.op_info);
+
+  auto cost = estimator_.PredictCosts(op_context);
+  EXPECT_EQ(Costs::Duration(81), cost.memory_time);
+  EXPECT_EQ(Costs::Duration(10), cost.compute_time);
+  EXPECT_EQ(Costs::Duration(91), cost.execution_time);
+  EXPECT_FALSE(cost.inaccurate);
+}
+
 TEST_F(OpLevelCostEstimatorTest, BiasAddExecutionTime) {
   auto cost = PredictCosts(DescribeBiasAdd(1000, 10));
   EXPECT_EQ(Costs::Duration(8400), cost.memory_time);
@@ -354,7 +384,7 @@ TEST_F(OpLevelCostEstimatorTest, GetTensorShapeProtoFromTensorProto) {
   TensorProto tensor_proto;
   TensorShapeProto tensor_shape_proto;
 
-  // Dimention larger than max value; should fail while converting to Tensor
+  // Dimension larger than max value; should fail while converting to Tensor
   // class.
   tensor_proto.mutable_tensor_shape()->add_dim()->set_size(255);
   EXPECT_FALSE(
diff --git a/tensorflow/core/grappler/costs/op_performance_data.proto b/tensorflow/core/grappler/costs/op_performance_data.proto
index 37f9ebd6a146c8c0089857c7a41ba863b4c2fb1f..5ef5fd927b8580b55539d98014ba1af4c079ae0d 100644
--- a/tensorflow/core/grappler/costs/op_performance_data.proto
+++ b/tensorflow/core/grappler/costs/op_performance_data.proto
@@ -24,6 +24,11 @@ import "tensorflow/core/framework/types.proto";
 import "tensorflow/core/framework/attr_value.proto";
 import "tensorflow/core/protobuf/device_properties.proto";
 
+// Description of the session when an op is run.
+message SessionInfo {
+  int64 intra_op_parallelism = 1;
+}
+
 // Description of an operation as well as the parameters expected to impact its
 // performance.
 message OpInfo {
@@ -46,6 +51,9 @@ message OpInfo {
 
   // Device on which the operation is run.
   DeviceProperties device = 4;
+
+  // Information about the session configs.
+  SessionInfo session_info = 6;
 }
 
 message NormalDistribution {
@@ -58,17 +66,13 @@ message LogNormalDistribution {
   double sigma = 2;
 }
 
-message SessionInfo {
-  int64 intra_op_parallelism = 1;
-}
-
 // Performance data for tensorflow operations
 message OpPerformance {
   // The op
   OpInfo op = 1;
 
   // Information about the session configs.
-  SessionInfo session_info = 12;
+  SessionInfo session_info = 12 [deprecated = true];
 
   // The node name (optional). Makes it easier to associate the performance data
   // with a specific graph node.
diff --git a/tensorflow/core/grappler/costs/utils.cc b/tensorflow/core/grappler/costs/utils.cc
index 602f69f12ea9d24ebd94da73a2a76d1992f3bfb1..076945d5c626b9609448e339fcbd96de3e9d137f 100644
--- a/tensorflow/core/grappler/costs/utils.cc
+++ b/tensorflow/core/grappler/costs/utils.cc
@@ -26,6 +26,8 @@ limitations under the License.
 #include "cuda/include/cudnn.h"
 #endif
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/op.h"
@@ -200,17 +202,25 @@ std::vector<OpInfo::TensorProperties> FindInputFeatures(
 }
 
 DeviceProperties GetDeviceInfo(const string& device_str) {
+  DeviceProperties unknown;
+  unknown.set_type("UNKNOWN");
+
   DeviceNameUtils::ParsedName parsed;
   if (DeviceNameUtils::ParseFullName(device_str, &parsed)) {
     if (parsed.type == "GPU") {
-      return GetLocalGPUInfo(parsed.id);
+      TfGpuId tf_gpu_id(parsed.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        LOG(ERROR) << s;
+        return unknown;
+      }
+      return GetLocalGPUInfo(cuda_gpu_id);
     } else if (parsed.type == "CPU") {
       return GetLocalCPUInfo();
     }
   }
-  DeviceProperties device;
-  device.set_type("UNKNOWN");
-  return device;
+  return unknown;
 }
 
 DeviceProperties GetDeviceInfo(const CostGraphDef::Node& node) {
diff --git a/tensorflow/core/grappler/grappler_item_builder.cc b/tensorflow/core/grappler/grappler_item_builder.cc
index 5ac52eefe1144e06f1e10f9c99dcef7591deb880..288587ce9b357d0056de428f5abc653cc4b91ea2 100644
--- a/tensorflow/core/grappler/grappler_item_builder.cc
+++ b/tensorflow/core/grappler/grappler_item_builder.cc
@@ -38,6 +38,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/model_pruner.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/platform/protobuf_internal.h"
 #include "tensorflow/core/protobuf/meta_graph.pb.h"
@@ -77,7 +78,7 @@ void InitializeTensor(DataType type, Tensor* tensor) {
 // correct optimizations.
 Status OptimizeGraph(const GraphDef& graph_def_arg, GraphDef* output_graph_def,
                      const ItemConfig& cfg) {
-  if (!cfg.apply_optimizations && !cfg.inline_functions) {
+  if (!cfg.apply_optimizations && !cfg.erase_noinline_attributes) {
     return Status::OK();
   }
 
@@ -87,7 +88,7 @@ Status OptimizeGraph(const GraphDef& graph_def_arg, GraphDef* output_graph_def,
   // Make a local copy of graph def, because we need to change some things.
   GraphDef graph_def(graph_def_arg);
 
-  if (cfg.inline_functions && cfg.erase_noinline_attributes) {
+  if (cfg.erase_noinline_attributes) {
     // TF optimizer doesn't inline functions with "_noinline" attribute,
     // so let's go over the function library and erase it.
     for (auto& func : *graph_def.mutable_library()->mutable_function()) {
@@ -112,7 +113,6 @@ Status OptimizeGraph(const GraphDef& graph_def_arg, GraphDef* output_graph_def,
   } else {
     optimizer_opts->set_opt_level(::tensorflow::OptimizerOptions_Level_L0);
   }
-  optimizer_opts->set_do_function_inlining(cfg.inline_functions);
 
   // Create the function library runtime.
   std::unique_ptr<ProcessFunctionLibraryRuntime> pflr(
@@ -138,7 +138,7 @@ Status OptimizeGraph(const GraphDef& graph_def_arg, GraphDef* output_graph_def,
   // The default values of attributes might have been stripped by the optimizer.
   // Add them back.
   return AddDefaultAttrsToGraphDef(output_graph_def, *graphptr->op_registry(),
-                                   0);
+                                   0, true);
 }
 
 // Applies the same graph pruning logic to the graph as Session.Run in TF.
@@ -152,6 +152,27 @@ Status PruneGraph(GrapplerItem* item) {
   return Status::OK();
 }
 
+// Replace any unknown dimensions in a shape with
+// cfg.placeholder_unknown_output_shape_dim if it is no less than 0.
+Status ReplaceUnknownShapeDim(const ItemConfig& cfg,
+                              const TensorShapeProto& shape_pb_in,
+                              TensorShapeProto* shape_pb_out,
+                              TensorShape* shape_out) {
+  std::vector<int32> dims;
+  for (const auto& dim_proto : shape_pb_in.dim()) {
+    if (cfg.placeholder_unknown_output_shape_dim >= 0 &&
+        dim_proto.size() == -1) {
+      dims.push_back(cfg.placeholder_unknown_output_shape_dim);
+      shape_pb_out->add_dim()->set_size(
+          cfg.placeholder_unknown_output_shape_dim);
+    } else {
+      dims.push_back(std::max<int32>(1, dim_proto.size()));
+      shape_pb_out->add_dim()->set_size(dim_proto.size());
+    }
+  }
+  return TensorShapeUtils::MakeShape(dims.data(), dims.size(), shape_out);
+}
+
 }  // namespace
 
 // static
@@ -168,12 +189,6 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
   // Fill in feed nodes from config, if any provided.
   for (const auto& feed_node : cfg.feed_nodes) {
     const string feed_name = NodeName(feed_node);
-    if (feed_name.empty()) {
-      LOG(ERROR) << "Invalid feed node name " << feed_node
-                 << ", skipping this input.";
-      return nullptr;
-    }
-    VLOG(1) << "Will use feed node " << feed_name;
     new_item->feed.emplace_back(feed_name, Tensor());
   }
 
@@ -182,17 +197,119 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
     const CollectionDef& nodes = meta_graph.collection_def().at("train_op");
     if (nodes.has_node_list()) {
       for (const auto& node : nodes.node_list().value()) {
-        const string name = NodeName(node);
-        if (name.empty()) {
-          LOG(ERROR) << "Invalid fetch node name " << node
-                     << ", skipping this input";
-          return nullptr;
+        new_item->fetch.push_back(NodeName(node));
+      }
+    }
+  }
+
+  // Detect feed and fetch nodes from signature defs. Signatures may share same
+  // inputs or outputs.
+  std::unordered_set<string> signature_feed_nodes;
+  std::unordered_set<string> signature_fetch_nodes;
+  for (const auto& name_and_signature : meta_graph.signature_def()) {
+    for (const auto& name_and_input : name_and_signature.second.inputs()) {
+      const TensorInfo& input = name_and_input.second;
+      if (input.has_coo_sparse()) {
+        // Define the shapes following the comment of CooSparse.
+        // TODO(yuefengz): we probably want to use different dim values for the
+        // three tensors of a SparseTensor.
+        int64 dim = std::max(1, cfg.placeholder_unknown_output_shape_dim);
+        TensorShape shape_1d({dim});
+        TensorShape shape_2d({dim, dim});
+
+        if (gtl::InsertIfNotPresent(
+                &signature_feed_nodes,
+                NodeName(input.coo_sparse().values_tensor_name()))) {
+          Tensor value_tensor(input.dtype(), shape_1d);
+          InitializeTensor(input.dtype(), &value_tensor);
+          new_item->feed.emplace_back(
+              NodeName(input.coo_sparse().values_tensor_name()), value_tensor);
+        }
+        if (gtl::InsertIfNotPresent(
+                &signature_feed_nodes,
+                NodeName(input.coo_sparse().indices_tensor_name()))) {
+          Tensor indices_tensor(DT_INT64, shape_2d);
+          InitializeTensor(input.dtype(), &indices_tensor);
+          new_item->feed.emplace_back(
+              NodeName(input.coo_sparse().indices_tensor_name()),
+              indices_tensor);
+        }
+        if (gtl::InsertIfNotPresent(
+                &signature_feed_nodes,
+                NodeName(input.coo_sparse().dense_shape_tensor_name()))) {
+          Tensor dense_shape_tensor(DT_INT64, shape_1d);
+          InitializeTensor(input.dtype(), &dense_shape_tensor);
+          new_item->feed.emplace_back(
+              NodeName(input.coo_sparse().dense_shape_tensor_name()),
+              dense_shape_tensor);
+        }
+      } else {
+        if (gtl::InsertIfNotPresent(&signature_feed_nodes,
+                                    NodeName(input.name()))) {
+          TensorShape shape;
+          TensorShapeProto shape_proto;
+          Status s = ReplaceUnknownShapeDim(cfg, input.tensor_shape(),
+                                            &shape_proto, &shape);
+          if (!s.ok()) {
+            LOG(ERROR) << "Invalid shape for signature input " << input.name()
+                       << ": " << s << ", skipping this input";
+            return nullptr;
+          }
+
+          Tensor fake_input(input.dtype(), shape);
+          InitializeTensor(input.dtype(), &fake_input);
+          new_item->feed.emplace_back(NodeName(input.name()), fake_input);
         }
-        VLOG(1) << "Will use fetch node " << name;
-        new_item->fetch.push_back(name);
       }
     }
+    for (const auto& name_and_output : name_and_signature.second.outputs()) {
+      const TensorInfo& output = name_and_output.second;
+      if (output.has_coo_sparse()) {
+        if (gtl::InsertIfNotPresent(
+                &signature_fetch_nodes,
+                NodeName(output.coo_sparse().values_tensor_name()))) {
+          new_item->fetch.push_back(
+              NodeName(output.coo_sparse().values_tensor_name()));
+        }
+        if (gtl::InsertIfNotPresent(
+                &signature_fetch_nodes,
+                NodeName(output.coo_sparse().indices_tensor_name()))) {
+          new_item->fetch.push_back(
+              NodeName(output.coo_sparse().indices_tensor_name()));
+        }
+        if (gtl::InsertIfNotPresent(
+                &signature_fetch_nodes,
+                NodeName(output.coo_sparse().dense_shape_tensor_name()))) {
+          new_item->fetch.push_back(
+              NodeName(output.coo_sparse().dense_shape_tensor_name()));
+        }
+      } else {
+        if (gtl::InsertIfNotPresent(&signature_fetch_nodes,
+                                    NodeName(output.name()))) {
+          new_item->fetch.push_back(NodeName(output.name()));
+        }
+      }
+    }
+  }
+
+  for (const auto& feed : new_item->feed) {
+    if (feed.first.empty()) {
+      LOG(ERROR) << "Invalid feed node name skipping this input";
+      return nullptr;
+    } else {
+      VLOG(1) << "Will use feed node " << feed.first;
+    }
   }
+
+  for (const auto& fetch : new_item->fetch) {
+    if (fetch.empty()) {
+      LOG(ERROR) << "Invalid fetch node name skipping this input";
+      return nullptr;
+    } else {
+      VLOG(1) << "Will use fetch node " << fetch;
+    }
+  }
+
   if (new_item->fetch.empty()) {
     LOG(ERROR) << "Failed to detect the fetch node(s), skipping this input";
     return nullptr;
@@ -325,20 +442,8 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
       // shape is not empty if the shape is partially defined.
       TensorShape shape;
       TensorShapeProto shape_proto;
-      std::vector<int32> dims;
-      for (const auto& dim_proto : node.attr().at("shape").shape().dim()) {
-        if (cfg.placeholder_unknown_output_shape_dim >= 0 &&
-            dim_proto.size() == -1) {
-          dims.push_back(cfg.placeholder_unknown_output_shape_dim);
-          shape_proto.add_dim()->set_size(
-              cfg.placeholder_unknown_output_shape_dim);
-        } else {
-          dims.push_back(std::max<int32>(1, dim_proto.size()));
-          shape_proto.add_dim()->set_size(dim_proto.size());
-        }
-      }
-      Status make_shape_status =
-          TensorShapeUtils::MakeShape(dims.data(), dims.size(), &shape);
+      Status make_shape_status = ReplaceUnknownShapeDim(
+          cfg, node.attr().at("shape").shape(), &shape_proto, &shape);
       if (!make_shape_status.ok()) {
         LOG(ERROR) << "Invalid shape for placeholder " << node.name() << ": "
                    << make_shape_status << ", skipping this input";
@@ -378,7 +483,9 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
 
       if (cfg.feed_nodes.empty()) {
         // No specific feed nodes were given. Assume all placeholders are fed.
-        new_item->feed.emplace_back(node.name(), fake_input);
+        if (signature_feed_nodes.count(node.name()) == 0) {
+          new_item->feed.emplace_back(node.name(), fake_input);
+        }
       } else if (cfg.feed_nodes.count(node.name()) > 0) {
         // If specific feed nodes were given, only update their tensors.
         auto it = find_if(new_item->feed.begin(), new_item->feed.end(),
@@ -462,7 +569,7 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
       &new_item->graph,
       FunctionLibraryDefinition(OpRegistry::Global(),
                                 new_item->graph.library()),
-      0);
+      0, true);
   if (!attr_status.ok()) {
     LOG(ERROR) << "Failed to instantiate default attribute values: "
                << attr_status.error_message();
@@ -518,113 +625,5 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
   return new_item;
 }
 
-std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
-    const FunctionDef& func,
-    const std::unordered_map<string, AttrValue>& func_attr,
-    const FunctionDefLibrary& library) {
-  if (func.signature().name().empty()) {
-    LOG(ERROR) << "function name must be specified.";
-    return nullptr;
-  }
-  std::unique_ptr<GrapplerItem> new_item(new GrapplerItem());
-  new_item->id = func.signature().name();
-
-  std::unordered_map<string, string> port_map;
-
-  // Add the function inputs as placeholder
-  for (const auto& inp : func.signature().input_arg()) {
-    NodeDef* ph = new_item->graph.add_node();
-    ph->set_name(inp.name());
-    ph->set_op("Placeholder");
-    if (inp.type() != DT_INVALID) {
-      (*ph->mutable_attr())["T"].set_type(inp.type());
-    } else {
-      auto it = func_attr.find(inp.type_attr());
-      if (it == func_attr.end()) {
-        LOG(ERROR) << "Unknown type attribute " << inp.type_attr()
-                   << " for function input " << inp.name();
-        return nullptr;
-      } else {
-        (*ph->mutable_attr())["T"] = it->second;
-      }
-    }
-    port_map[inp.name()] = inp.name();
-  }
-
-  // Add the function body to the graph.
-  FunctionLibraryDefinition func_def(OpRegistry::Global(), library);
-
-  for (const NodeDef& node : func.node_def()) {
-    NodeDef* new_node = new_item->graph.add_node();
-    *new_node = node;
-    // Replace the placeholder attribute values with the specified value.
-    for (auto& attr : *new_node->mutable_attr()) {
-      const string& ph_name = attr.second.placeholder();
-      auto it = func_attr.find(ph_name);
-      if (it != func_attr.end()) {
-        attr.second = it->second;
-      }
-    }
-
-    // Functions use a custom format to encode connectivity. Map these custom
-    // strings to regular ones.
-    const OpRegistrationData* registration;
-    Status status = func_def.LookUp(node.op(), &registration);
-    if (!status.ok()) {
-      LOG(ERROR) << "Op " << node.op() << " not registered: " << status;
-      return nullptr;
-    }
-
-    tensorflow::NameRangeMap inputs;
-    tensorflow::NameRangeMap outputs;
-    status = tensorflow::NameRangesForNode(node, registration->op_def, &inputs,
-                                           &outputs);
-    if (!status.ok()) {
-      LOG(ERROR) << "Op " << node.op() << " invalid: " << status;
-      return nullptr;
-    }
-    for (const auto& name_range : outputs) {
-      string port_prefix =
-          strings::StrCat(node.name(), ":", name_range.first, ":");
-      int index_start = name_range.second.first;
-      int index_end = name_range.second.second;
-      for (int i = index_start; i < index_end; ++i) {
-        string port_id = strings::StrCat(port_prefix, i - index_start);
-        string port_name = strings::StrCat(node.name(), ":", i);
-        port_map[port_id] = port_name;
-      }
-    }
-  }
-
-  for (auto& node : *new_item->graph.mutable_node()) {
-    // Rewrite the inputs to use the normal naming convention.
-    for (int i = 0; i < node.input_size(); ++i) {
-      const string& input = node.input(i);
-      if (IsControlInput(input)) {
-        // No need to remap control dependencies.
-        continue;
-      } else {
-        auto it = port_map.find(input);
-        if (it == port_map.end()) {
-          LOG(ERROR) << "Unknown input: " << input;
-          return nullptr;
-        }
-        node.set_input(i, it->second);
-      }
-    }
-  }
-
-  // Add the function outputs to the list of fetch nodes.
-  for (const auto& out : func.signature().output_arg()) {
-    new_item->fetch.emplace_back(out.name());
-  }
-  // Add the function inputs to the list of feeds.
-  for (const auto& inp : func.signature().input_arg()) {
-    new_item->feed.emplace_back(inp.name(), Tensor());
-  }
-
-  return new_item;
-}
-
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/grappler_item_builder.h b/tensorflow/core/grappler/grappler_item_builder.h
index e892a3f556f7e9ccba91d5ce672a12d2eac49f5a..6d181e49e67acaae116c5f5af9365dba238994e8 100644
--- a/tensorflow/core/grappler/grappler_item_builder.h
+++ b/tensorflow/core/grappler/grappler_item_builder.h
@@ -40,8 +40,6 @@ struct ItemConfig {
   int placeholder_unknown_output_shape_dim = -1;
   // If true, does L1 optimizations.
   bool apply_optimizations = false;
-  // If true, does inlining.
-  bool inline_functions = false;
   // If true, erases all "_noinline" attributes from user-defined functions.
   // Has no effect if "inline_functions" is disabled.
   bool erase_noinline_attributes = false;
@@ -58,13 +56,6 @@ struct ItemConfig {
 std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
     const string& id, const MetaGraphDef& meta_graph, const ItemConfig& cfg);
 
-// Factory method for creating a GrapplerItem from a FunctionDef.
-// Returns nullptr if the given function def cannot be converted.
-std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
-    const FunctionDef& func,
-    const std::unordered_map<string, AttrValue>& func_attr,
-    const FunctionDefLibrary& library);
-
 }  // end namespace grappler
 }  // end namespace tensorflow
 
diff --git a/tensorflow/core/grappler/grappler_item_builder_test.cc b/tensorflow/core/grappler/grappler_item_builder_test.cc
index 68437b60419f73419bca4467b409818bc0b11650..4b90bf3038df2900315aa32e32a6635d834e4403 100644
--- a/tensorflow/core/grappler/grappler_item_builder_test.cc
+++ b/tensorflow/core/grappler/grappler_item_builder_test.cc
@@ -35,96 +35,6 @@ namespace {
 
 class GrapplerItemBuilderTest : public ::testing::Test {};
 
-// Create a sample graph with a symbolic gradient for sum.
-void SampleSumSymbolicGradientGraphdef(
-    GraphDef *def, CollectionDef *fetches,
-    std::vector<string> *names_of_ops_of_inline) {
-  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
-
-  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
-
-  auto dummy_variable = Variable(scope, {2, 2}, DT_FLOAT);
-  auto x = Const(scope, 1.0f);
-  auto y = Const(scope, 2);
-  auto z = Const(scope, 3.0f);
-  TF_ASSERT_OK(scope.status());
-
-  NameAttrList fn;
-  fn.set_name("Sum");
-  (*fn.mutable_attr())["T"].set_type(DT_FLOAT);
-  auto g0 = SymbolicGradient(scope, std::initializer_list<Input>{x, y, z},
-                             {DT_FLOAT, DT_INT32}, fn);
-
-  // TODO(bsteiner): we should rewrite the feed/fetch nodes to reflect the
-  // inlining that's done in the item builder
-  // fetches->mutable_node_list()->add_value(g0[0].name());
-  fetches->mutable_node_list()->add_value("SymbolicGradient/dx");
-  fetches->mutable_node_list()->add_value("SymbolicGradient/dy_reshaped");
-
-  TF_CHECK_OK(scope.ToGraphDef(def));
-
-  // Add names of the ops that replace the Mul symbolic gradient during
-  // inlining. This is for validation.
-  *names_of_ops_of_inline = {
-      "SymbolicGradient/dx",          "SymbolicGradient/tile_scaling",
-      "SymbolicGradient/dy_reshaped", "SymbolicGradient/y_shape",
-      "SymbolicGradient/x_shape",     "SymbolicGradient/stitch_idx0",
-      "SymbolicGradient/x_rank",      "SymbolicGradient/stitch_val1",
-      "SymbolicGradient/i_shape",     "SymbolicGradient/di",
-      "SymbolicGradient/zero",        "SymbolicGradient/one"};
-}
-
-std::unique_ptr<GrapplerItem> CreateGrapplerItem(const GraphDef &def,
-                                                 const CollectionDef &fetches) {
-  MetaGraphDef meta_def;
-  ItemConfig cfg;
-  cfg.inline_functions = true;
-  *meta_def.mutable_graph_def() = def;
-  (*meta_def.mutable_collection_def())["train_op"] = fetches;
-  return GrapplerItemFromMetaGraphDef("0", meta_def, cfg);
-}
-
-int CountSymbolicGradientOps(const std::unique_ptr<GrapplerItem> &item) {
-  int n_symb_grads = 0;
-  for (const auto &node : item->graph.node()) {
-    if (node.op() == FunctionLibraryDefinition::kGradientOp) {
-      n_symb_grads++;
-    }
-  }
-  return n_symb_grads;
-}
-
-int CountOpsWithNames(const std::unique_ptr<GrapplerItem> &item,
-                      const std::vector<string> &names) {
-  std::set<string> names_set(names.begin(), names.end());
-  int n_with_names = 0;
-  for (const auto &node : item->graph.node()) {
-    if (names_set.find(node.name()) != names_set.end()) {
-      n_with_names++;
-    }
-  }
-  return n_with_names;
-}
-
-TEST_F(GrapplerItemBuilderTest, SymbolicGradientInlining) {
-  // Create sample sum symbolic gradient graph.
-  GraphDef def;
-  CollectionDef fetches;
-  std::vector<string> ops_of_inline;
-  SampleSumSymbolicGradientGraphdef(&def, &fetches, &ops_of_inline);
-
-  // Create the inlined graph.
-  std::unique_ptr<GrapplerItem> with_inline = CreateGrapplerItem(def, fetches);
-
-  // For the inlined graph, there should be 0 symbolic gradient ops.
-  EXPECT_EQ(0, CountSymbolicGradientOps(with_inline));
-
-  // For the inlined graph, make sure all the required expanded op’s are in the
-  // graph.
-  EXPECT_EQ(ops_of_inline.size(),
-            CountOpsWithNames(with_inline, ops_of_inline));
-}
-
 TEST_F(GrapplerItemBuilderTest, AssetFilepathOverrideTest) {
   MetaGraphDef meta_graph;
 
@@ -273,210 +183,134 @@ TEST_F(GrapplerItemBuilderTest, GraphWithFunctions) {
   (*meta_graph.mutable_collection_def())["train_op"] = train_op;
 
   ItemConfig cfg;
-  cfg.inline_functions = false;
 
   std::unique_ptr<GrapplerItem> item =
       GrapplerItemFromMetaGraphDef("0", meta_graph, cfg);
   ASSERT_TRUE(item != nullptr);
 }
 
-TEST_F(GrapplerItemBuilderTest, FromSimpleFunctionDef) {
-  const Tensor kTwo = test::AsScalar<int64>(2);
-  FunctionDef func = FunctionDefHelper::Define(
-      // Name
-      "XTimesTwo",
-      // Args
-      {"x: T"},
-      // Return values
-      {"y: T"},
-      // Attr def
-      {"T: {float, double, int32, int64}"},
-      // Nodes
-      {
-          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
-          {{"scale"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}},
-          {{"y"}, "Mul", {"x", "scale"}, {{"T", "$T"}}},
-      });
+TEST_F(GrapplerItemBuilderTest, GraphWithCustomOps) {
+  MetaGraphDef meta_graph;
+  // y = XTimesTwo(x)
+  constexpr char device[] = "/cpu:0";
+  *meta_graph.mutable_graph_def() = test::function::GDef(
+      {test::function::NDef("x", "Const", {}, {{"dtype", DT_FLOAT}}, device),
+       test::function::NDef("y", "CustomOp", {"x"}, {{"T", DT_FLOAT}}, device)},
+      {});
 
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
-  FunctionDefLibrary library;
-  std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
-  CHECK(item);
-  EXPECT_EQ("XTimesTwo", item->id);
-  EXPECT_EQ(4, item->graph.node_size());
-  EXPECT_EQ(std::vector<string>({"y"}), item->fetch);
-  EXPECT_EQ(1, item->feed.size());
-  EXPECT_EQ("x", item->feed[0].first);
+  CollectionDef train_op;
+  train_op.mutable_node_list()->add_value("y");
+  (*meta_graph.mutable_collection_def())["train_op"] = train_op;
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "two") {
-      EXPECT_EQ("Const", node.op());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "scale") {
-      EXPECT_EQ("Cast", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("DstT").type());
-      EXPECT_EQ(1, node.input_size());
-      EXPECT_EQ("two:0", node.input(0));
-    } else if (node.name() == "y") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("scale:0", node.input(1));
-    }
-  }
+  ItemConfig cfg;
+
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromMetaGraphDef("0", meta_graph, cfg);
+  ASSERT_TRUE(item != nullptr);
 }
 
-TEST_F(GrapplerItemBuilderTest, FromFunctionDefWithMultiOutputNodes) {
-  // Gradient graph for the Subtract operation
-  std::vector<FunctionDefHelper::Node> nodes = {
-      {{"sx"}, "Shape", {"x"}},
-      {{"sy"}, "Shape", {"y"}},
-      {{"gx"}, "Identity", {"dz"}},
-      {{"gy"}, "Neg", {"dz"}},
-      {{"rx", "ry"}, "BroadcastGradientArgs", {"sx", "sy"}},
-      {{"sum_gx"}, "Sum", {"gx", "rx"}},
-      {{"dx"}, "Reshape", {"sum_gx", "sx"}},
-      {{"sum_gy"}, "Sum", {"gy", "ry"}},
-      {{"dy"}, "Reshape", {"sum_gy", "sy"}},
-  };
-
-  for (auto &n : nodes) {
-    // "BroadcastGradientArgs" doesn't need any attrs.
-    if (n.attr.empty() && n.op != "BroadcastGradientArgs") {
-      n.attr = {{"T", "$T"}};
-    }
-  }
-  FunctionDef func = FunctionDefHelper::Define(
-      // Name
-      "SubGrad",
-      // Arg defs
-      {"x: T", "y: T", "dz: T"},
-      // Ret val defs
-      {"dx: T", "dy: T"},
-      // Attr defs
-      {{"T: {half, float, double}"}},
-      // Nodes
-      nodes);
-
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
-  FunctionDefLibrary library;
+TEST_F(GrapplerItemBuilderTest, FromGraphWithSignatureDef) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(s.WithOpName("x"), 0);
+  auto y = ops::Const(s.WithOpName("y"), 1);
+  auto z = ops::Add(s.WithOpName("z"), x, y);
+
+  MetaGraphDef meta_graph;
+  TF_CHECK_OK(s.ToGraphDef(meta_graph.mutable_graph_def()));
+
+  TensorInfo input, output;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  output.set_name("z");
+  SignatureDef serving_signature;
+  (*serving_signature.mutable_inputs())["input"] = input;
+  (*serving_signature.mutable_outputs())["output"] = output;
+  (*meta_graph.mutable_signature_def())["serving"] = serving_signature;
+
+  // It should be able to dedup the input and output with same names.
+  TensorInfo input2, output2;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  output.set_name("z");
+  SignatureDef serving_signature2;
+  (*serving_signature.mutable_inputs())["input2"] = input2;
+  (*serving_signature.mutable_outputs())["output2"] = output2;
+  (*meta_graph.mutable_signature_def())["serving2"] = serving_signature2;
+
   std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
-  CHECK(item);
-  EXPECT_EQ("SubGrad", item->id);
-  EXPECT_EQ(12, item->graph.node_size());
-  EXPECT_EQ(std::vector<string>({"dx", "dy"}), item->fetch);
-  EXPECT_EQ(3, item->feed.size());
-  EXPECT_EQ("x", item->feed[0].first);
-  EXPECT_EQ("y", item->feed[1].first);
-  EXPECT_EQ("dz", item->feed[2].first);
+      GrapplerItemFromMetaGraphDef("0", meta_graph, ItemConfig());
+  ASSERT_TRUE(item != nullptr);
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x" || node.name() == "y" || node.name() == "dz") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "rx") {
-      EXPECT_EQ("BroadcastGradientArgs", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("sx:0", node.input(0));
-      EXPECT_EQ("sy:0", node.input(1));
-    } else if (node.name() == "sum_gx") {
-      EXPECT_EQ("Sum", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("gx:0", node.input(0));
-      EXPECT_EQ("rx:0", node.input(1));
-    } else if (node.name() == "sum_gy") {
-      EXPECT_EQ("Sum", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("gy:0", node.input(0));
-      EXPECT_EQ("rx:1", node.input(1));
-    }
-  }
+  EXPECT_EQ(item->feed.size(), 1);
+  EXPECT_EQ(item->fetch.size(), 1);
+  EXPECT_EQ(item->feed[0].first, "x");
+  EXPECT_EQ(item->fetch[0], "z");
 }
 
-TEST_F(GrapplerItemBuilderTest, FromFunctionDefWithNestedFuncs) {
-  FunctionDefLibrary library;
-  *library.add_function() = FunctionDefHelper::Define(
-      // Name
-      "Swap",
-      // Args
-      {"i0: T", "i1: T"},
-      // Return values
-      {"o0: T", "o1: T"},
-      // Attr def
-      {"T: {float, double}"},
-      // Nodes
-      {{{"o0"}, "Identity", {"i1"}, {{"T", "$T"}}},
-       {{"o1"}, "Identity", {"i0"}, {{"T", "$T"}}}});
-
-  FunctionDef func = FunctionDefHelper::Create(
-      // Name
-      "ManySwapsFirst",
-      // Args
-      {"x: float", "y: float"},
-      // Return values
-      {"o: float"},
-      // attr def
-      {},
-      // Nodes
-      // o = x*x + y*y.  Furthermore, The 1st swap depends on x2, and
-      // y2 depends on the 2nd swap.  The 2nd swap has data dependency
-      // on the 1st swap.
-      {{{"a0"}, "Swap", {"x", "y"}, {{"T", DT_FLOAT}}, {"x2"}},
-       {{"a1"}, "Swap", {"a0:o0:0", "a0:o1:0"}, {{"T", DT_FLOAT}}},
-       {{"x2"}, "Mul", {"x", "x"}, {{"T", DT_FLOAT}}},
-       {{"y2"}, "Mul", {"y", "y"}, {{"T", DT_FLOAT}}, {"a1"}},
-       {{"o"}, "Add", {"x2:z:0", "y2:z:0"}, {{"T", DT_FLOAT}}}},
-      {{"o", "o:z:0"}});
-
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
+TEST_F(GrapplerItemBuilderTest, FromGraphWithIncompleteSignatureDef) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(s.WithOpName("x"), 0);
+  auto y = ops::Const(s.WithOpName("y"), 1);
+
+  MetaGraphDef meta_graph;
+  TF_CHECK_OK(s.ToGraphDef(meta_graph.mutable_graph_def()));
+
+  CollectionDef train_op;
+  train_op.mutable_node_list()->add_value("y");
+  (*meta_graph.mutable_collection_def())["train_op"] = train_op;
+
+  TensorInfo input, output;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  // Its coo_sparse proto is incomplete.
+  output.mutable_coo_sparse()->set_values_tensor_name("z");
+  SignatureDef serving_signature;
+  (*serving_signature.mutable_inputs())["input"] = input;
+  (*serving_signature.mutable_outputs())["output"] = output;
+  (*meta_graph.mutable_signature_def())["serving"] = serving_signature;
+
   std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
+      GrapplerItemFromMetaGraphDef("0", meta_graph, ItemConfig());
+  ASSERT_TRUE(item == nullptr);
+}
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x" || node.name() == "y") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "a0") {
-      EXPECT_EQ("Swap", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("y", node.input(1));
-      EXPECT_EQ("^x2", node.input(2));
-    } else if (node.name() == "a1") {
-      EXPECT_EQ("Swap", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("a0:0", node.input(0));
-      EXPECT_EQ("a0:1", node.input(1));
-    } else if (node.name() == "x2") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("x", node.input(1));
-    } else if (node.name() == "y2") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ("y", node.input(0));
-      EXPECT_EQ("y", node.input(1));
-      EXPECT_EQ("^a1", node.input(2));
-    } else if (node.name() == "o") {
-      EXPECT_EQ("Add", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x2:0", node.input(0));
-      EXPECT_EQ("y2:0", node.input(1));
-    }
-  }
+TEST_F(GrapplerItemBuilderTest, FromGraphWithUnknownDimInSignatureInput) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto shape_1d = PartialTensorShape({-1});
+  auto x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT,
+                            ops::Placeholder::Shape(shape_1d));
+  auto y = ops::Const(s.WithOpName("y"), static_cast<float>(1.0));
+  auto z = ops::Add(s.WithOpName("z"), x, y);
+
+  MetaGraphDef meta_graph;
+  TF_CHECK_OK(s.ToGraphDef(meta_graph.mutable_graph_def()));
+
+  TensorInfo input, output;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  shape_1d.AsProto(input.mutable_tensor_shape());
+  output.set_name("z");
+
+  SignatureDef serving_signature;
+  (*serving_signature.mutable_inputs())["input"] = input;
+  (*serving_signature.mutable_outputs())["output"] = output;
+  (*meta_graph.mutable_signature_def())["serving"] = serving_signature;
+
+  ItemConfig cfg;
+  cfg.placeholder_unknown_output_shape_dim = 64;
+  std::unique_ptr<GrapplerItem> item1 =
+      GrapplerItemFromMetaGraphDef("0", meta_graph, cfg);
+  ASSERT_TRUE(item1 != nullptr);
+
+  ASSERT_EQ(item1->feed.size(), 1);
+  EXPECT_EQ(item1->feed[0].second.NumElements(), 64);
+
+  std::unique_ptr<GrapplerItem> item2 =
+      GrapplerItemFromMetaGraphDef("0", meta_graph, ItemConfig());
+  ASSERT_TRUE(item2 != nullptr);
+
+  ASSERT_EQ(item2->feed.size(), 1);
+  EXPECT_EQ(item2->feed[0].second.NumElements(), 1);
 }
 
 }  // namespace
diff --git a/tensorflow/core/grappler/op_types.cc b/tensorflow/core/grappler/op_types.cc
index e225e99a9e43a5776d2a6315767f2b9ddb6bcd53..259168bb33e70a0a8ea58bd3df2d55a5e18d45a8 100644
--- a/tensorflow/core/grappler/op_types.cc
+++ b/tensorflow/core/grappler/op_types.cc
@@ -72,12 +72,20 @@ bool IsComplex(const NodeDef& node) { return node.op() == "Complex"; }
 
 bool IsComplexAbs(const NodeDef& node) { return node.op() == "ComplexAbs"; }
 
+bool IsConcat(const NodeDef& node) {
+  return node.op() == "Concat" || node.op() == "ConcatV2";
+}
+
 bool IsConcatOffset(const NodeDef& node) { return node.op() == "ConcatOffset"; }
 
 bool IsConstant(const NodeDef& node) { return node.op() == "Const"; }
 
 bool IsConj(const NodeDef& node) { return node.op() == "Conj"; }
 
+bool IsConjugateTranspose(const NodeDef& node) {
+  return node.op() == "ConjugateTranspose";
+}
+
 bool IsConv2D(const NodeDef& node) { return node.op() == "Conv2D"; }
 
 bool IsConv2DBackpropFilter(const NodeDef& node) {
@@ -144,6 +152,9 @@ bool IsHistogramSummary(const NodeDef& node) {
 
 bool IsIdentity(const NodeDef& node) {
   const auto& op = node.op();
+  if (op == "IdentityN" && node.attr().at("T").list().type_size() == 1) {
+    return true;
+  }
   return op == "Identity" || op == "RefIdentity";
 }
 
@@ -201,6 +212,8 @@ bool IsMod(const NodeDef& node) { return node.op() == "Mod"; }
 
 bool IsMul(const NodeDef& node) { return node.op() == "Mul"; }
 
+bool IsNeg(const NodeDef& node) { return node.op() == "Neg"; }
+
 bool IsNoOp(const NodeDef& node) { return node.op() == "NoOp"; }
 
 bool IsNotEqual(const NodeDef& node) { return node.op() == "NotEqual"; }
@@ -210,6 +223,8 @@ bool IsNextIteration(const NodeDef& node) {
   return op == "NextIteration" || op == "RefNextIteration";
 }
 
+bool IsPack(const NodeDef& node) { return node.op() == "Pack"; }
+
 bool IsPad(const NodeDef& node) {
   const auto& op = node.op();
   return op == "Pad" || op == "PadV2";
@@ -300,6 +315,19 @@ bool IsSquaredDifference(const NodeDef& node) {
 
 bool IsSqueeze(const NodeDef& node) { return node.op() == "Squeeze"; }
 
+bool IsStackOp(const NodeDef& node) {
+  return node.op() == "Stack" || node.op() == "StackV2";
+}
+bool IsStackCloseOp(const NodeDef& node) {
+  return node.op() == "StackClose" || node.op() == "StackCloseV2";
+}
+bool IsStackPushOp(const NodeDef& node) {
+  return node.op() == "StackPush" || node.op() == "StackPushV2";
+}
+bool IsStackPopOp(const NodeDef& node) {
+  return node.op() == "StackPop" || node.op() == "StackPopV2";
+}
+
 bool IsStopGradient(const NodeDef& node) {
   const auto& op = node.op();
   return op == "StopGradient" || op == "PreventGradient";
@@ -354,7 +382,8 @@ bool IsFreeOfSideEffect(const NodeDef& node) {
     return false;
   }
   const OpDef* op_def = nullptr;
-  Status status = OpRegistry::Global()->LookUpOpDef(node.op(), &op_def);
+  const string& op_name = node.op();
+  Status status = OpRegistry::Global()->LookUpOpDef(op_name, &op_def);
   if (!status.ok()) {
     return false;
   }
@@ -368,7 +397,8 @@ bool IsFreeOfSideEffect(const NodeDef& node) {
     }
   }
   // Some nodes do in-place updates on regular tensor inputs.
-  if (GetBoolAttr(node, "in_place") || GetBoolAttr(node, "inplace")) {
+  if (GetBoolAttr(node, "in_place") || GetBoolAttr(node, "inplace") ||
+      StringPiece(op_name).starts_with("Inplace")) {
     return false;
   }
   return true;
diff --git a/tensorflow/core/grappler/op_types.h b/tensorflow/core/grappler/op_types.h
index 1fa43a9b66b93c4dae1c30943b8466043af327ec..49e01f68e3747e941688b5a7084a34258320c3a0 100644
--- a/tensorflow/core/grappler/op_types.h
+++ b/tensorflow/core/grappler/op_types.h
@@ -40,6 +40,8 @@ bool IsCast(const NodeDef& node);
 bool IsComplex(const NodeDef& node);
 bool IsComplexAbs(const NodeDef& node);
 bool IsConj(const NodeDef& node);
+bool IsConjugateTranspose(const NodeDef& node);
+bool IsConcat(const NodeDef& node);
 bool IsConcatOffset(const NodeDef& node);
 bool IsConstant(const NodeDef& node);
 bool IsConv2D(const NodeDef& node);
@@ -84,7 +86,10 @@ bool IsMod(const NodeDef& node);
 bool IsMul(const NodeDef& node);
 bool IsMatMul(const NodeDef& node);
 bool IsNextIteration(const NodeDef& node);
+bool IsPack(const NodeDef& node);
 bool IsPad(const NodeDef& node);
+bool IsPack(const NodeDef& node);
+bool IsNeg(const NodeDef& node);
 bool IsNoOp(const NodeDef& node);
 bool IsNotEqual(const NodeDef& node);
 bool IsPlaceholder(const NodeDef& node);
@@ -118,6 +123,10 @@ bool IsSplitV(const NodeDef& node);
 bool IsSqrtGrad(const NodeDef& node);
 bool IsSquaredDifference(const NodeDef& node);
 bool IsSqueeze(const NodeDef& node);
+bool IsStackOp(const NodeDef& node);
+bool IsStackCloseOp(const NodeDef& node);
+bool IsStackPushOp(const NodeDef& node);
+bool IsStackPopOp(const NodeDef& node);
 bool IsStopGradient(const NodeDef& node);
 bool IsStridedSlice(const NodeDef& node);
 bool IsStridedSliceGrad(const NodeDef& node);
diff --git a/tensorflow/core/grappler/optimizers/BUILD b/tensorflow/core/grappler/optimizers/BUILD
index 50ba48ea7a98490d81329cde79842b0201fa0b6b..bd9608b369bd6d08bdf34727cda0b30d8b309999 100644
--- a/tensorflow/core/grappler/optimizers/BUILD
+++ b/tensorflow/core/grappler/optimizers/BUILD
@@ -1,6 +1,15 @@
 licenses(["notice"])  # Apache 2.0
 
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
+load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
+load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
+
+# Platform specific build config
+load(
+    "//tensorflow/core:platform/default/build_config.bzl",
+    "tf_protos_grappler",
+)
 
 filegroup(
     name = "all_files",
@@ -35,7 +44,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "static_schedule_test",
     srcs = ["static_schedule_test.cc"],
     deps = [
@@ -70,7 +79,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "auto_parallel_test",
     srcs = ["auto_parallel_test.cc"],
     deps = [
@@ -129,6 +138,48 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "function_optimizer",
+    srcs = ["function_optimizer.cc"],
+    hdrs = [
+        "function_optimizer.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_optimizer",
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/utils:functions",
+    ],
+)
+
+tf_cuda_cc_test(
+    name = "function_optimizer_test",
+    srcs = ["function_optimizer_test.cc"],
+    deps = [
+        ":function_optimizer",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:cc_ops_internal",
+        "//tensorflow/cc:functional_ops",
+        "//tensorflow/core:all_kernels",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:direct_session",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/utils:grappler_test",
+    ],
+)
+
 cc_library(
     name = "graph_rewriter",
     srcs = ["graph_rewriter.cc"],
@@ -157,6 +208,37 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "graph_optimizer_stage",
+    srcs = ["graph_optimizer_stage.cc"],
+    hdrs = ["graph_optimizer_stage.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/costs:graph_properties",
+        "//tensorflow/core/grappler/utils:frame",
+    ],
+)
+
+tf_cuda_cc_test(
+    name = "graph_optimizer_stage_test",
+    size = "small",
+    srcs = ["graph_optimizer_stage_test.cc"],
+    deps = [
+        ":graph_optimizer_stage",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/costs:graph_properties",
+        "//tensorflow/core/grappler/utils:grappler_test",
+    ],
+)
+
 cc_library(
     name = "custom_graph_optimizer",
     hdrs = [
@@ -179,6 +261,7 @@ cc_library(
     deps = [
         ":constant_folding",
         ":graph_optimizer",
+        ":graph_optimizer_stage",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
@@ -191,7 +274,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "arithmetic_optimizer_test",
     size = "small",
     srcs = ["arithmetic_optimizer_test.cc"],
@@ -206,6 +289,7 @@ tf_cc_test(
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+        "//tensorflow/core/grappler/utils:grappler_test",
     ],
 )
 
@@ -231,7 +315,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "dependency_optimizer_test",
     size = "small",
     srcs = ["dependency_optimizer_test.cc"],
@@ -267,7 +351,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "model_pruner_test",
     srcs = ["model_pruner_test.cc"],
     deps = [
@@ -282,9 +366,36 @@ tf_cc_test(
     ],
 )
 
+tf_kernel_library(
+    name = "gpu_swapping_kernels",
+    srcs = [
+        "gpu_swapping_kernels.cc",
+    ],
+    deps = [
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "gpu_swapping_ops",
+    srcs = [
+        "gpu_swapping_ops.cc",
+    ],
+    deps = [
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+    alwayslink = 1,
+)
+
 cc_library(
     name = "memory_optimizer",
-    srcs = ["memory_optimizer.cc"],
+    srcs = [
+        "memory_optimizer.cc",
+    ],
     hdrs = [
         "memory_optimizer.h",
     ],
@@ -294,6 +405,7 @@ cc_library(
         ":graph_rewriter",
         ":static_schedule",
         "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core/grappler:graph_view",
         "//tensorflow/core/grappler:grappler_item",
@@ -304,12 +416,16 @@ cc_library(
         "//tensorflow/core/grappler/costs:graph_properties",
         "//tensorflow/core/grappler/utils:topological_sort",
         "//tensorflow/core/grappler/utils:traversal",
-    ],
+    ] + if_cuda([
+        ":gpu_swapping_kernels",
+        ":gpu_swapping_ops",
+    ]),
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "memory_optimizer_test",
     srcs = ["memory_optimizer_test.cc"],
+    tags = ["no_cuda_on_cpu_tap"],  # Do not re-enable again without actually testing.
     deps = [
         ":memory_optimizer",
         "//tensorflow/cc:cc_ops",
@@ -348,7 +464,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "layout_optimizer_test",
     srcs = ["layout_optimizer_test.cc"],
     deps = [
@@ -383,6 +499,7 @@ cc_library(
         ":custom_graph_optimizer",
         ":custom_graph_optimizer_registry",
         ":dependency_optimizer",
+        ":function_optimizer",
         ":graph_optimizer",
         ":layout_optimizer",
         ":loop_optimizer",
@@ -396,7 +513,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "meta_optimizer_test",
     srcs = ["meta_optimizer_test.cc"],
     deps = [
@@ -425,7 +542,7 @@ cc_library(
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "custom_graph_optimizer_registry_test",
     size = "small",
     srcs = ["custom_graph_optimizer_registry_test.cc"],
@@ -446,6 +563,7 @@ cc_library(
     ],
     visibility = ["//visibility:public"],
     deps = [
+        ":constant_folding",
         ":graph_optimizer",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
@@ -455,12 +573,12 @@ cc_library(
         "//tensorflow/core/grappler:op_types",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/costs:graph_properties",
+        "//tensorflow/core/grappler/utils:frame",
     ],
 )
 
-tf_cc_test(
+tf_cuda_cc_test(
     name = "loop_optimizer_test",
-    size = "small",
     srcs = ["loop_optimizer_test.cc"],
     deps = [
         ":loop_optimizer",
@@ -471,5 +589,30 @@ tf_cc_test(
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+        "//tensorflow/core/grappler/utils:grappler_test",
+    ],
+)
+
+cc_library(
+    name = "symbolic_shapes",
+    srcs = ["symbolic_shapes.cc"],
+    hdrs = ["symbolic_shapes.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+    ] + tf_protos_grappler(),
+)
+
+tf_cc_test(
+    name = "symbolic_shapes_test",
+    srcs = ["symbolic_shapes_test.cc"],
+    deps = [
+        ":symbolic_shapes",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
     ],
 )
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
index 709a434e40e887502cac1317870eb0db8e0c2910..bc004df60880c9da52aed0d8518c2bc4e02bb213 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
@@ -30,6 +30,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/optimizers/graph_optimizer_stage.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/lib/core/errors.h"
@@ -45,19 +46,6 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
-template <typename T>
-bool AreInversePermutations(const std::vector<T>& a, const std::vector<T>& b) {
-  if (a.size() != b.size()) {
-    return false;
-  }
-  for (int i = 0; i < a.size(); ++i) {
-    if (a[b[i]] != i) {
-      return false;
-    }
-  }
-  return true;
-}
-
 // Extract values from a Const op to `values`. Returns true if succeeds.
 template <typename T>
 bool ValuesFromConstNode(const NodeDef& node, std::vector<T>* values) {
@@ -210,30 +198,39 @@ bool IsNumberType(DataType dtype) { return kNumberTypes.Contains(dtype); }
 
 const char kOutputShapesAttr[] = "_output_shapes";
 
-PartialTensorShape GetInputShape(const string& input, const NodeMap& node_map) {
-  int output_pos;
-  string node_name = ParseNodeName(input, &output_pos);
-  const NodeDef* input_node = node_map.GetNode(node_name);
-  return input_node->attr().at(kOutputShapesAttr).list().shape(output_pos);
+// Shape is symbolically defined if it has a known rank, and each dimension is
+// defined, or is an unknown symbol (dim.size <= -2).
+bool ShapeIsSymbolicallyDefined(const TensorShapeProto& shape) {
+  return !shape.unknown_rank() &&
+         std::all_of(
+             shape.dim().begin(), shape.dim().end(),
+             [](const TensorShapeProto::Dim& dim) { return dim.size() != -1; });
+}
+
+bool ShapeIsSymbolicallyDefined(const OpInfo::TensorProperties& properties) {
+  return ShapeIsSymbolicallyDefined(properties.shape());
 }
 
-bool ShapesEqual(const string& input_x, const string& input_y,
-                 const NodeMap& node_map) {
-  PartialTensorShape x_shape = GetInputShape(input_x, node_map);
-  PartialTensorShape y_shape = GetInputShape(input_y, node_map);
-  if (x_shape.unknown_rank() || y_shape.unknown_rank() ||
-      x_shape.dims() != y_shape.dims()) {
+bool ShapesSymbolicallyEqual(const TensorShapeProto& left,
+                             const TensorShapeProto& right) {
+  if (left.unknown_rank() || right.unknown_rank() ||
+      left.dim_size() != right.dim_size()) {
     return false;
   }
-  for (int i = 0; i < x_shape.dims(); ++i) {
-    if (x_shape.dim_size(i) == -1 || y_shape.dim_size(i) == -1 ||
-        x_shape.dim_size(i) != y_shape.dim_size(i)) {
+  for (int i = 0; i < left.dim_size(); ++i) {
+    if (left.dim(i).size() == -1 || right.dim(i).size() == -1 ||
+        left.dim(i).size() != right.dim(i).size()) {
       return false;
     }
   }
   return true;
 }
 
+bool ShapesSymbolicallyEqual(const OpInfo::TensorProperties& left,
+                             const OpInfo::TensorProperties& right) {
+  return ShapesSymbolicallyEqual(left.shape(), right.shape());
+}
+
 // Returns whether `reshape` is an identity op. The tensor that `reshape`
 // reshapes is the `output_pos`-th output of node `input`.
 bool ReshapeIsIdentity(const NodeDef& reshape, const NodeDef& input,
@@ -292,6 +289,667 @@ NodeDef* GetTailOfValuePreservingChain(
                         is_value_preserving_non_branching);
 }
 
+// Graph optimizer context extension specific to ArithmeticOptimizer
+struct ArithmeticOptimizerContext {
+  explicit ArithmeticOptimizerContext(SetVector<NodeDef*>* nodes_to_simplify)
+      : nodes_to_simplify(nodes_to_simplify) {}
+  SetVector<NodeDef*>* nodes_to_simplify;
+};
+
+// Base class for single arithmetic optimization: e.g. Bitcast optimization,
+// AddOps optimization, etc...
+class ArithmeticOptimizerStage : public GraphOptimizerStage<string> {
+ public:
+  explicit ArithmeticOptimizerStage(const string& name,
+                                    const GraphOptimizerContext& ctx,
+                                    const ArithmeticOptimizerContext ctx_ext)
+      : GraphOptimizerStage("ArithmeticOptimizer", name, ctx),
+        ctx_ext_(ctx_ext) {}
+  virtual ~ArithmeticOptimizerStage() = default;
+
+  // Simplification graph rewrite can create additional nodes that are inputs
+  // to final simplified node, they can be also added to the arithmetic
+  // optimizer queue for further optimization.
+  void AddToOptimizationQueue(NodeDef* node) {
+    ctx_ext_.nodes_to_simplify->PushBack(node);
+  }
+
+  // TODO(ezhulenev): remove this method from ArithmeticOptimizer when all
+  // optimizations will be migrated to stages
+  void AddFrameControlDeps(const NodeDef* old_node,
+                           const std::vector<NodeDef*>& new_nodes,
+                           const string& source_for_ctrl_dep,
+                           const std::vector<NodeDef*>& sinks_for_control_dep) {
+    const auto frame_it = ctx_.frame_map->find(old_node);
+    if (frame_it != ctx_.frame_map->end()) {
+      for (auto node : new_nodes) {
+        ctx_.frame_map->emplace(node, frame_it->second);
+      }
+      if (!source_for_ctrl_dep.empty() && !sinks_for_control_dep.empty()) {
+        const string ctrl_dep = ConstantFolding::AddControlDependency(
+            source_for_ctrl_dep, ctx_.optimized_graph, ctx_.node_map);
+        for (auto node : sinks_for_control_dep) {
+          MaybeAddControlInput(ctrl_dep, node, ctx_.optimized_graph,
+                               ctx_.node_map);
+        }
+      }
+    }
+  }
+
+ private:
+  // extened context required for ArithmeticOptimizer
+  const ArithmeticOptimizerContext ctx_ext_;
+};
+
+// Rewrite a tree of Add/AddN with a single AddN operation, consuming all the
+// original inputs of absorbed nodes.
+//
+// All nodes in a Add/AddN subgraph must have symbolically equal shape. All
+// nodes must have the same device placement.
+//
+// Example:
+//                AddN_1
+//             /    |    \
+//          Add_1   z   Add_2       -> AddN(z, y, z, w, q, e)
+//          /  \        /  \
+//         x    y      w    Add_3
+//                          / \
+//                         q   e
+class AddOpsRewriteStage : public ArithmeticOptimizerStage {
+ public:
+  explicit AddOpsRewriteStage(const GraphOptimizerContext& ctx,
+                              const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("AddOpsRewrite", ctx, ctx_ext),
+        rewritten_nodes_() {}
+
+  ~AddOpsRewriteStage() override = default;
+
+  // Check if a node can become a root of AddOpsGroup
+  bool IsSupported(const NodeDef* node) const override {
+    // check basic preconditions
+    if (!IsRewritable(node)) {
+      return false;
+    }
+
+    // shape must be symbolically defined and all inputs compatible with it
+    OpInfo::TensorProperties properties;
+    Status has_properties = GetTensorProperties(node->name(), &properties);
+    return has_properties.ok() && ShapeIsSymbolicallyDefined(properties) &&
+           HasAllInputsOfSymbolicallyEqualShape(*node, properties);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    CHECK(IsSupported(node));
+    AddOpsGroup group;
+    TF_RETURN_IF_ERROR(CreateAddOpsGroup(node, &group));
+
+    if (!group.absorbed_nodes.empty() && !IsRewritten(group)) {
+      *simplified_node_name = RewriteAddOpsGroup(group);
+    }
+
+    return Status::OK();
+  }
+
+ private:
+  // Holds together an add ops subgraph that we want to rewrite together.
+  //
+  // For the graph above the AddOpsGroup will be:
+  //   root_node: AddN_1
+  //   absorbed_nodes: [Add_1, Add_2]
+  //   input_nodes: [x, y, z, w, q, e]
+  struct AddOpsGroup {
+    const NodeDef* root_node;
+    TensorShapeProto root_shape;
+    // Add/AddN operations below the root level that were absorbed by this group
+    std::vector<NodeDef*> absorbed_nodes;
+    // Inputs of absorbed nodes that will be forwarded to rewritten AddN node
+    std::vector<string> inputs;
+  };
+
+  // Check if all inputs have symbolically equal shapes
+  bool HasAllInputsOfSymbolicallyEqualShape(
+      const NodeDef& node, const OpInfo::TensorProperties& properties) const {
+    const AddOpsRewriteStage* self = this;
+    return std::all_of(
+        node.input().begin(), node.input().end(),
+        [self, &properties](const string& input) {
+          OpInfo::TensorProperties input_properties;
+          Status has_input_properties =
+              self->GetTensorProperties(input, &input_properties);
+          return has_input_properties.ok() &&
+                 ShapesSymbolicallyEqual(properties, input_properties);
+        });
+  }
+
+  // TODO(ezhulenev): use GraphRewriter?
+  bool IsDrivenByControlDependency(const NodeDef& node) const {
+    return std::any_of(node.input().begin(), node.input().end(),
+                       IsControlInput);
+  }
+
+  // TODO(ezhulenev): use GraphRewriter?
+  bool DrivesControlDependency(const NodeDef& node) const {
+    int position;
+    for (const NodeDef* output : ctx_.node_map->GetOutputs(node.name())) {
+      for (int i = 0; i < output->input_size(); ++i) {
+        auto input = output->input(i);
+        string name = ParseNodeName(input, &position);
+        if (name == node.name() && /*control input*/ position < 0) {
+          return true;
+        }
+      }
+    }
+    return false;
+  }
+
+  // Check if a node can be absorbed by current AddOpsGroup
+  bool IsAbsorbableByAddOpsGroup(const string& name, const AddOpsGroup& group) {
+    NodeDef* node;
+    Status node_status = GetInputNode(name, &node);
+    if (!node_status.ok()) {
+      return false;
+    }
+    // check basic preconditions
+    if (!IsRewritable(node)) {
+      return false;
+    }
+    // with a single output data consumer (presumably if we reach this node from
+    // previously absorbed or a root node, it means that this node is not used
+    // as an input to any other op, outside of the group)
+    if (NumNonControlDataOutputs(*node, *ctx_.node_map) != 1) {
+      return false;
+    }
+    // must be on the same device as a root node
+    if (node->device() != group.root_node->device()) {
+      return false;
+    }
+    // All input shapes must be symbolically defined and equal to the node shape
+    OpInfo::TensorProperties properties;
+    Status has_properties = GetTensorProperties(name, &properties);
+    return has_properties.ok() &&
+           HasAllInputsOfSymbolicallyEqualShape(*node, properties);
+  }
+
+  // Node requirements both for a root node and an absorbed node
+  bool IsRewritable(const NodeDef* node) const {
+    // only Add or AddN can be a root node
+    // TODO(ezhulenev): check if AccumulateNV2 can be supported too
+    if (!IsAdd(*node) && !IsAddN(*node)) {
+      return false;
+    }
+    // it must not be in a preserve set
+    if (ctx_.nodes_to_preserve->find(node->name()) !=
+        ctx_.nodes_to_preserve->end()) {
+      return false;
+    }
+    // it must not be a node created or absorbed by previous iteration
+    if (rewritten_nodes_.find(node->name()) != rewritten_nodes_.end()) {
+      return false;
+    }
+    // should not drive or be driven by control dependency
+    // TODO(ezhulenev): relax this condition for root node
+    return !(IsDrivenByControlDependency(*node) ||
+             DrivesControlDependency(*node));
+  }
+
+  // Check that optimized group node name doesn't exists. It might happen if
+  // graph optimized multiple times without pruning between invocations.
+  bool IsRewritten(const AddOpsGroup& group) const {
+    return ctx_.node_map->NodeExists(AddOpsGroupName(group));
+  }
+
+  // Create an AddOpsGroup with a root in a given node
+  Status CreateAddOpsGroup(const NodeDef* root_node, AddOpsGroup* group) {
+    OpInfo::TensorProperties root_node_output_properties;
+    TF_RETURN_IF_ERROR(
+        GetTensorProperties(root_node->name(), &root_node_output_properties));
+
+    group->root_node = root_node;
+    group->root_shape = root_node_output_properties.shape();
+
+    group->absorbed_nodes.reserve(root_node->input_size());
+    for (int i = 0; i < root_node->input_size(); ++i) {
+      TF_RETURN_IF_ERROR(AbsorbInputByAddOpsGroup(root_node->input(i), group));
+    }
+
+    return Status::OK();
+  }
+
+  Status AbsorbInputByAddOpsGroup(const string& input, AddOpsGroup* group) {
+    NodeDef* node;
+    TF_RETURN_IF_ERROR(GetInputNode(input, &node));
+
+    if (IsAbsorbableByAddOpsGroup(input, *group)) {
+      group->absorbed_nodes.push_back(node);
+      for (int i = 0; i < node->input_size(); ++i) {
+        TF_RETURN_IF_ERROR(AbsorbInputByAddOpsGroup(node->input(i), group));
+      }
+    } else {
+      // If node can't be absorbed, add it to AddOpsGroup input
+      group->inputs.push_back(input);
+    }
+    return Status::OK();
+  }
+
+  // New node for AddOpsGroup is added to the same scope as a root_node. All
+  // absorbed nodes are stripped of their scope, and only names are used in a
+  // new node name.
+  //
+  // Example: AddOpsGroup(root="a/b/c/Add_2", absorbed=["d/Add_1", "e/Add"])
+  //          node_name="a/b/c/AddOpsGroup_Add_2_Add_1_Add
+  string AddOpsGroupName(const AddOpsGroup& group) const {
+    CHECK_NOTNULL(group.root_node);
+
+    auto root = ParseNodeScopeAndName(group.root_node->name());
+
+    std::vector<string> absorbed_node_names(group.absorbed_nodes.size());
+    std::transform(group.absorbed_nodes.begin(), group.absorbed_nodes.end(),
+                   absorbed_node_names.begin(),
+                   [](const NodeDef* node) { return node->name(); });
+
+    return OptimizedNodeName(root, absorbed_node_names);
+  }
+
+  // Create a new node for a AddOpsGroup and return it's name.
+  string RewriteAddOpsGroup(const AddOpsGroup& group) {
+    CHECK_GT(group.absorbed_nodes.size(), 0)
+        << "AddOpsGroup must have non empty absorbed nodes";
+
+    // name for a new node constructed from AddOpsGroup
+    string node_name = AddOpsGroupName(group);
+
+    // copy attributes from a root node
+    DataType dtype = group.root_node->attr().at("T").type();
+
+    // add new AddN node
+    NodeDef* added_node = AddEmptyNode(node_name);
+    added_node->set_op("AddN");
+    added_node->set_device(group.root_node->device());
+    (*added_node->mutable_attr())["T"].set_type(dtype);
+    (*added_node->mutable_attr())["N"].set_i(group.inputs.size());
+
+    // all inputs of absorbed nodes are added to the new node
+    for (const string& input : group.inputs) {
+      ctx_.node_map->AddOutput(input, node_name);
+      added_node->add_input(input);
+    }
+
+    // Add frame dependencies that the original node might have had.
+    AddFrameControlDeps(group.root_node, {added_node}, "", {});
+
+    VLOG(1) << "Absorbed " << group.absorbed_nodes.size()
+            << " Add/AddN nodes from the graph";
+
+    // keep track of nodes that were created or absorbed as a part of rewrite
+    rewritten_nodes_.insert(node_name);
+    for (const NodeDef* absorbed : group.absorbed_nodes) {
+      rewritten_nodes_.insert(absorbed->name());
+    }
+
+    return node_name;
+  }
+
+  // keep nodes that were added or absorbed as a part of AddOpsGroup rewrite
+  std::unordered_set<string> rewritten_nodes_;
+};
+
+// Use the commutativity and (left- and right-) distributive property of
+// multiplication over addition to hoist common factors out of aggregate nodes
+// where all the inputs are Mul nodes. This pattern occurs frequently in
+// regularization terms for the gradients during training.
+//
+// For example, we can rewrite an expression of the form:
+//   AddN(Mul(x, y1), Mul(y2, x), Mul(x, y3), ... Mul(x, yn))
+// to the following:
+//   Mul(x, AddN(y1, y2, y3, ... yn))
+class HoistCommonFactorOutOfAggregation : public ArithmeticOptimizerStage {
+ public:
+  explicit HoistCommonFactorOutOfAggregation(
+      const GraphOptimizerContext& ctx,
+      const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("HoistCommonFactor", ctx, ctx_ext) {}
+  ~HoistCommonFactorOutOfAggregation() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsAggregate(*node) && NumNonControlInputs(*node) > 1 &&
+           !IsRewritten(node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    CHECK(IsSupported(node));
+
+    std::set<string> common_factors;
+    TF_RETURN_IF_ERROR(GetCommonFactors(node, &common_factors));
+
+    if (common_factors.size() == 1) {
+      const string& common_factor = *common_factors.begin();
+
+      // Gather up the non-shared factors
+      bool shapes_match = true;
+      std::vector<string> unique_factors;
+      TF_RETURN_IF_ERROR(GetUniqueFactors(node, common_factor, &shapes_match,
+                                          &unique_factors));
+
+      if (shapes_match) {
+        NodeDef* input_0;
+        TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &input_0));
+
+        // Use a copy of the first Mul node for the outer multiplication.
+        NodeDef* new_mul_node = AddCopyNode(OuterMulNodeName(node), input_0);
+        // And a copy of aggregation node as one of the inner operands
+        NodeDef* new_add_node = AddCopyNode(InnerAddNodeName(node), node);
+
+        new_mul_node->set_device(node->device());
+        new_mul_node->set_input(0, common_factor);
+        new_mul_node->set_input(1, new_add_node->name());
+
+        ctx_.node_map->AddOutput(common_factor, new_mul_node->name());
+        ctx_.node_map->AddOutput(new_add_node->name(), new_mul_node->name());
+
+        // Hoist non-shared factors up into the new AddN node.
+        for (int i = 0; i < unique_factors.size(); ++i) {
+          new_add_node->set_input(i, unique_factors[i]);
+        }
+
+        // Add frame dependencies that the original node might have had.
+        AddFrameControlDeps(node, {new_add_node, new_mul_node}, common_factor,
+                            {new_add_node});
+
+        // optimize new inner aggregation node
+        AddToOptimizationQueue(new_add_node);
+        // do not optimize the same node twice
+        rewritten_nodes_.insert(node->name());
+        *simplified_node_name = new_mul_node->name();
+      }
+    }
+    return Status::OK();
+  }
+
+ private:
+  // Get a name for new outer Mul node
+  string OuterMulNodeName(const NodeDef* node) const {
+    auto scope_and_name = ParseNodeScopeAndName(node->name());
+    return OptimizedNodeName(scope_and_name, "Mul");
+  }
+
+  // Get a name new inner Add node
+  string InnerAddNodeName(const NodeDef* node) const {
+    auto scope_and_name = ParseNodeScopeAndName(node->name());
+    return OptimizedNodeName(scope_and_name, "Add");
+  }
+
+  // Determine the set of common factors if the input nodes are all Mul nodes.
+  Status GetCommonFactors(const NodeDef* node,
+                          std::set<string>* common_factors) const {
+    CHECK(common_factors->empty());
+
+    for (int i = 0; i < node->input_size(); ++i) {
+      if (i > 0 && common_factors->empty()) break;
+      if (IsControlInput(node->input(i))) break;
+
+      NodeDef* input;
+      TF_RETURN_IF_ERROR(GetInputNode(node->input(i), &input));
+
+      if (!IsMul(*input)) {
+        common_factors->clear();
+        break;
+      }
+
+      std::set<string> factors_i{input->input(0), input->input(1)};
+      if (i == 0) {
+        std::swap(*common_factors, factors_i);
+      } else {
+        std::set<string> intersection;
+        std::set_intersection(
+            factors_i.begin(), factors_i.end(), common_factors->begin(),
+            common_factors->end(),
+            std::inserter(intersection, intersection.begin()));
+        std::swap(*common_factors, intersection);
+      }
+    }
+    return Status::OK();
+  }
+
+  // Gather up the non-shared factors (the y's in the example).
+  // Unless the aggregation is Add, we have to make sure that all the y's
+  // have the same shape since the other aggregation ops do not support
+  // broadcasting.
+  Status GetUniqueFactors(const NodeDef* node, const string& common_factor,
+                          bool* shapes_match,
+                          std::vector<string>* unique_factors) const {
+    *shapes_match = true;
+    unique_factors->reserve(node->input_size());
+
+    for (int i = 0; i < node->input_size() && shapes_match; ++i) {
+      const string& input = node->input(i);
+      if (IsControlInput(input)) {
+        break;
+      }
+      NodeDef* mul_node;
+      TF_RETURN_IF_ERROR(GetInputNode(input, &mul_node));
+      const int unique_factor_index =
+          mul_node->input(0) == common_factor ? 1 : 0;
+      unique_factors->push_back(mul_node->input(unique_factor_index));
+      if (i > 0 && !IsAdd(*node)) {
+        OpInfo::TensorProperties lhs;
+        OpInfo::TensorProperties rhs;
+        TF_RETURN_IF_ERROR(GetTensorProperties(unique_factors->front(), &lhs));
+        TF_RETURN_IF_ERROR(GetTensorProperties(unique_factors->back(), &rhs));
+        *shapes_match = ShapesSymbolicallyEqual(lhs, rhs);
+      }
+    }
+    return Status::OK();
+  }
+
+  bool IsRewritten(const NodeDef* node) const {
+    // if graph rewrite happens in multiple passes without graph pruning between
+    // them, it's possible that rewritten node already exists in a graph
+    return rewritten_nodes_.find(node->name()) != rewritten_nodes_.end() ||
+           ctx_.node_map->NodeExists(OuterMulNodeName(node));
+  }
+
+  // keep names of the nodes that were optimized by this stage
+  std::unordered_set<string> rewritten_nodes_;
+};
+
+// Removes inverse transpose nodes
+class RemoveIdentityTranspose : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveIdentityTranspose(const GraphOptimizerContext& ctx,
+                                   const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveIdentityTranspose", ctx, ctx_ext) {}
+  ~RemoveIdentityTranspose() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsTranspose(*node) || IsConjugateTranspose(*node);
+  }
+
+  // TODO(rmlarsen): Forward control dependencies on the bypassed
+  // transpose nodes.
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    CHECK(IsSupported(node));
+
+    NodeDef* input;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &input));
+    NodeDef* node_perm;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(1), &node_perm));
+    std::vector<int64> node_perm_values;
+    TF_RETURN_IF_ERROR(GetPermutation(*node_perm, &node_perm_values));
+
+    if (input->op() == node->op()) {
+      // Remove pairs of transposes that cancel each other.
+      NodeDef* input_perm;
+      TF_RETURN_IF_ERROR(GetInputNode(input->input(1), &input_perm));
+      std::vector<int64> input_perm_values;
+      TF_RETURN_IF_ERROR(GetPermutation(*input_perm, &input_perm_values));
+      if (AreInversePermutations(node_perm_values, input_perm_values)) {
+        *simplified_node_name = input->input(0);
+      }
+    } else {
+      // Remove simple identity transposes.
+      if (IsIdentityPermutation(node_perm_values)) {
+        *simplified_node_name = node->input(0);
+      }
+    }
+    return Status::OK();
+  }
+
+ private:
+  Status GetPermutation(const NodeDef& node_perm,
+                        std::vector<int64>* perm64) const {
+    std::vector<int> perm32;
+    if (ValuesFromConstNode(node_perm, &perm32)) {
+      perm64->reserve(perm32.size());
+      for (int val : perm32) {
+        perm64->push_back(static_cast<int64>(val));
+      }
+      return Status::OK();
+    }
+    if (ValuesFromConstNode(node_perm, perm64)) {
+      return Status::OK();
+    }
+    return errors::InvalidArgument("Couldn't extract permutation from ",
+                                   node_perm.name());
+  }
+
+  bool AreInversePermutations(const std::vector<int64>& a,
+                              const std::vector<int64>& b) {
+    if (a.size() != b.size()) {
+      return false;
+    }
+    for (int i = 0; i < a.size(); ++i) {
+      if (a[b[i]] != i) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  bool IsIdentityPermutation(const std::vector<int64>& perm) {
+    for (int64 i = 0; i < perm.size(); ++i) {
+      if (i != perm[i]) {
+        return false;
+      }
+    }
+    return true;
+  }
+};
+
+// Remove redundant Bitcasts.
+// 1) Remove Bitcast whose source type and destination type are equal
+// 2) Rewrite Bitcast(Bitcast(x, type1), type2) => Bitcast(x, type2)
+class RemoveRedundantBitcastStage : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveRedundantBitcastStage(
+      const GraphOptimizerContext& ctx,
+      const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveRedundantBitcast", ctx, ctx_ext) {}
+  ~RemoveRedundantBitcastStage() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsBitcast(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    CHECK(IsSupported(node));
+
+    // Bypass Bitcast whose source type and destination type are equal.
+    if (GetSourceDataType(*node) == GetDestinationDataType(*node)) {
+      *simplified_node_name = node->input(0);
+      return Status::OK();
+    }
+
+    NodeDef* bitcast;
+    TF_RETURN_IF_ERROR(GetInputNode(node->name(), &bitcast));
+    NodeDef* operand;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &operand));
+
+    if (IsBitcast(*operand)) {
+      // Bitcast(Bitcast(x, type1), type2) => Bitcast(x, type2)
+      bitcast->set_input(0, operand->input(0));
+      SetSourceDataType(GetSourceDataType(*operand), bitcast);
+      ctx_.node_map->UpdateInput(bitcast->name(), bitcast->input(0),
+                                 operand->input(0));
+      AddToOptimizationQueue(bitcast);
+      *simplified_node_name = bitcast->name();
+    }
+
+    return Status::OK();
+  }
+};
+
+// Remove Casts whose source type and destination type are equal.
+class RemoveRedundantCastStage : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveRedundantCastStage(const GraphOptimizerContext& ctx,
+                                    const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveRedundantCast", ctx, ctx_ext) {}
+  ~RemoveRedundantCastStage() override = default;
+
+  bool IsSupported(const NodeDef* node) const override { return IsCast(*node); }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    CHECK(IsSupported(node));
+    // Bypass Cast whose source type and destination type are equal.
+    if (GetSourceDataType(*node) == GetDestinationDataType(*node)) {
+      *simplified_node_name = node->input(0);
+    }
+    return Status::OK();
+  }
+};
+
+class RemoveNegationStage : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveNegationStage(const GraphOptimizerContext& ctx,
+                               const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveNegation", ctx, ctx_ext) {}
+  ~RemoveNegationStage() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsAdd(*node) || IsSub(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const string node_name = node->name();
+    NodeDef* x;
+    NodeDef* y;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &x));
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(1), &y));
+    bool updated = false;
+    if (IsAdd(*node)) {
+      if (IsNeg(*x)) {
+        // (-a) + b = b - a
+        node->set_op("Sub");
+        node->mutable_input()->SwapElements(0, 1);
+        node->set_input(1, x->input(0));
+        node->add_input(AsControlDependency(x->name()));
+        ctx_.node_map->AddOutput(NodeName(x->input(0)), node_name);
+        updated = true;
+      } else if (IsNeg(*y)) {
+        // a + (-b) = a - b
+        node->set_op("Sub");
+        node->set_input(1, y->input(0));
+        node->add_input(AsControlDependency(y->name()));
+        ctx_.node_map->AddOutput(NodeName(y->input(0)), node_name);
+        updated = true;
+      }
+    } else if (IsSub(*node)) {
+      if (IsNeg(*y)) {
+        // a - (-b) = a + b
+        node->set_op("Add");
+        node->set_input(1, y->input(0));
+        node->add_input(AsControlDependency(y->name()));
+        ctx_.node_map->AddOutput(NodeName(y->input(0)), node_name);
+        updated = true;
+      }
+    }
+    if (updated) {
+      AddToOptimizationQueue(node);
+    }
+    return Status::OK();
+  }
+};
+
 }  // namespace
 
 class UniqueNodes {
@@ -379,6 +1037,9 @@ bool UniqueNodes::SameNode(const NodeDef& node1, const NodeDef& node2) const {
   }
 
   // Compare attributes.
+  if (node1.attr().size() != node2.attr().size()) {
+    return false;
+  }
   for (const auto& attr1 : node1.attr()) {
     auto it = node2.attr().find(attr1.first);
     if (it == node2.attr().end()) {
@@ -516,6 +1177,8 @@ void ArithmeticOptimizer::AddFrameControlDeps(
   }
 }
 
+// TODO(ezhulenev): extract each individual simplify rewrite into separate
+// ArithmeticOptimizerStage
 string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
     const NodeDef* node, SetVector<NodeDef*>* nodes_to_simplify) {
   // Remove involutions applied twice.
@@ -543,31 +1206,6 @@ string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
     }
   }
 
-  // Remove inverse transposes.
-  if (node->op() == "Transpose" || node->op() == "ConjugateTranspose") {
-    NodeDef* input = node_map_->GetNode(node->input(0));
-    if (input->op() == node->op()) {
-      const NodeDef* node_perm = node_map_->GetNode(node->input(1));
-      const NodeDef* input_perm = node_map_->GetNode(input->input(1));
-      // Try 32-bit indices.
-      std::vector<int> node_perm_values;
-      std::vector<int> input_perm_values;
-      if (ValuesFromConstNode(*node_perm, &node_perm_values) &&
-          ValuesFromConstNode(*input_perm, &input_perm_values) &&
-          AreInversePermutations(node_perm_values, input_perm_values)) {
-        return input->input(0);
-      }
-      // Try 64-bit indices.
-      std::vector<int64> node_perm_values64;
-      std::vector<int64> input_perm_values64;
-      if (ValuesFromConstNode(*node_perm, &node_perm_values64) &&
-          ValuesFromConstNode(*input_perm, &input_perm_values64) &&
-          AreInversePermutations(node_perm_values64, input_perm_values64)) {
-        return input->input(0);
-      }
-    }
-  }
-
   if (node->op() == "Reshape") {
     //   Reshape
     //      ^
@@ -664,32 +1302,6 @@ string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
     }
   }
 
-  if (node->op() == "Bitcast") {
-    NodeDef* bitcast = node_map_->GetNode(node->name());
-    // Bypass bitcasts whose source type and destination type are equal.
-    if (GetSourceDataType(*bitcast) == GetDestinationDataType(*bitcast)) {
-      return bitcast->input(0);
-    }
-
-    const NodeDef* operand = node_map_->GetNode(bitcast->input(0));
-    if (operand->op() == bitcast->op()) {
-      // Bitcast(Bitcast(x, type1), type2) => Bitcast(x, type2)
-      bitcast->set_input(0, operand->input(0));
-      SetSourceDataType(GetSourceDataType(*operand), bitcast);
-      node_map_->UpdateInput(bitcast->name(), bitcast->input(0),
-                             operand->input(0));
-      nodes_to_simplify->PushBack(bitcast);
-      return bitcast->name();
-    }
-  }
-
-  if (node->op() == "Cast") {
-    // Bypass casts whose source type and destination type are equal.
-    if (GetSourceDataType(*node) == GetDestinationDataType(*node)) {
-      return node->input(0);
-    }
-  }
-
   // Fold a multiply of a scalar into the following convolution. This folding
   // can jump across nodes that merely reorders data (such as reshape and
   // transpose). For example, we can optimize
@@ -858,98 +1470,6 @@ string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
     }
   }
 
-  // Use the commutativity and (left- and right-) distributive property of
-  // multiplication over addition to hoist common factors out of aggregate nodes
-  // where all the inputs are Mul nodes. This pattern occurs frequently in
-  // regularization terms for the gradients during training.
-  // For example, we can rewrite an expression of the form:
-  //   AddN(Mul(x, y1), Mul(y2, x), Mul(x, y3), ... Mul(x, yn))
-  // to the following:
-  //   Mul(x, AddN(y1, y2, y3, ... yn))
-  if (IsAggregate(*node) && NumNonControlInputs(*node) > 1 &&
-      !OptimizedNodeExists(*node, "hoist_add") &&
-      !OptimizedNodeExists(*node, "hoist_mul")) {
-    // Determine the set of common factors if the input nodes are all Mul nodes.
-    std::set<string> common_factors;
-    for (int i = 0; i < node->input_size(); ++i) {
-      if (i > 0 && common_factors.empty()) {
-        break;
-      }
-      if (IsControlInput(node->input(i))) {
-        break;
-      }
-      const NodeDef* input = node_map_->GetNode(node->input(i));
-      if (input->op() == "Mul") {
-        std::set<string> factors_i{input->input(0), input->input(1)};
-        if (i == 0) {
-          std::swap(common_factors, factors_i);
-        } else {
-          std::set<string> intersection;
-          std::set_intersection(
-              factors_i.begin(), factors_i.end(), common_factors.begin(),
-              common_factors.end(),
-              std::inserter(intersection, intersection.begin()));
-          std::swap(common_factors, intersection);
-        }
-      } else {
-        common_factors.clear();
-      }
-    }
-    if (common_factors.size() == 1) {
-      const string& common_factor = *common_factors.begin();
-
-      // Gather up the non-shared factors (the y's in the example).
-      // Unless the aggregation is Add, we have to make sure that all the y's
-      // have the same shape since the other aggregation ops do not support
-      // broadcasting.
-      std::vector<string> unique_factors;
-      unique_factors.reserve(node->input_size());
-      bool shapes_match = true;
-      for (int i = 0; i < node->input_size() && shapes_match; ++i) {
-        const string& input = node->input(i);
-        if (IsControlInput(input)) {
-          break;
-        }
-        const NodeDef* mul_node = node_map_->GetNode(input);
-        const int unique_factor_index =
-            mul_node->input(0) == common_factor ? 1 : 0;
-        unique_factors.push_back(mul_node->input(unique_factor_index));
-        if (i > 0 && !IsAdd(*node)) {
-          shapes_match = ShapesEqual(unique_factors.front(),
-                                     unique_factors.back(), *node_map_);
-        }
-      }
-
-      if (shapes_match) {
-        // 1. Use a copy of the first Mul node for the outer multiplication.
-        NodeDef* new_mul_node = AddNode(OptimizedNodeName(*node, "hoist_mul"),
-                                        node_map_->GetNode(node->input(0)));
-        NodeDef* new_add_node = AddNode(*node, "hoist_add", /*copy_node=*/true);
-        new_mul_node->set_device(node->device());
-        new_mul_node->set_input(0, common_factor);
-        node_map_->AddOutput(common_factor, new_mul_node->name());
-        new_mul_node->set_input(1, new_add_node->name());
-        node_map_->AddOutput(new_add_node->name(), new_mul_node->name());
-
-        // 2. Hoist non-shared factors up into the new AddN node.
-        nodes_to_simplify->PushBack(new_add_node);
-        for (int i = 0; i < node->input_size(); ++i) {
-          const string& input = node->input(i);
-          if (IsControlInput(input)) {
-            break;
-          }
-          new_add_node->set_input(i, unique_factors[i]);
-        }
-
-        // 3. Add frame dependencies that the original node might have had.
-        AddFrameControlDeps(node, {new_add_node, new_mul_node}, common_factor,
-                            {new_add_node});
-
-        return new_mul_node->name();
-      }
-    }
-  }
-
   // Fold Transpose into matrix multiplication.
   if ((node->op() == "MatMul" || node->op() == "SparseMatMul" ||
        node->op() == "BatchMatMul") &&
@@ -1025,14 +1545,69 @@ Status ArithmeticOptimizer::SimplifyArithmeticOps() {
   for (int i = 0; i < optimized_graph_->node_size(); ++i) {
     nodes_to_simplify.PushBack(optimized_graph_->mutable_node(i));
   }
+
+  const GraphOptimizerContext ctx(&nodes_to_preserve_, optimized_graph_,
+                                  graph_properties_.get(), node_map_.get(),
+                                  &frame_map_);
+  const ArithmeticOptimizerContext ctx_ext(&nodes_to_simplify);
+
+  std::vector<std::unique_ptr<ArithmeticOptimizerStage>> stages;
+
+  if (options_.combine_add_to_addn) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new AddOpsRewriteStage(ctx, ctx_ext)));
+  }
+  if (options_.hoist_common_factor_out_of_aggregation) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new HoistCommonFactorOutOfAggregation(ctx, ctx_ext)));
+  }
+  if (options_.remove_identity_transpose) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new RemoveIdentityTranspose(ctx, ctx_ext)));
+  }
+  if (options_.remove_redundant_bitcast) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new RemoveRedundantBitcastStage(ctx, ctx_ext)));
+  }
+  if (options_.remove_redundant_cast) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new RemoveRedundantCastStage(ctx, ctx_ext)));
+  }
+  if (options_.remove_negation) {
+    stages.push_back(std::unique_ptr<ArithmeticOptimizerStage>(
+        new RemoveNegationStage(ctx, ctx_ext)));
+  }
+
+  VLOG(1) << "Simplify arithmetic ops using " << stages.size()
+          << " arithmetic optimization stages";
+
   while (!nodes_to_simplify.Empty()) {
-    const NodeDef* node = nodes_to_simplify.PopBack();
-    const string simplified_tensor =
-        TrySimplifyAndReplaceUses(node, &nodes_to_simplify);
+    NodeDef* node = nodes_to_simplify.PopBack();
+
+    // TODO(ezhulenev): move all rewrites into separate stages
+    string simplified_tensor = "";
+    if (options_.enable_try_simplify_and_replace) {
+      simplified_tensor = TrySimplifyAndReplaceUses(node, &nodes_to_simplify);
+    }
+
+    // if it was not simplified try to run it through all configured stages
+    if (simplified_tensor.empty()) {
+      for (auto& stage : stages) {
+        if (stage->IsSupported(node)) {
+          TF_RETURN_IF_ERROR(stage->TrySimplify(node, &simplified_tensor));
+          if (!simplified_tensor.empty()) {
+            break;
+          }
+        }
+      }
+    }
+
+    // if it's still empty go to the next Node
     if (simplified_tensor.empty()) {
       continue;
     }
 
+    // re-wire consumers of an old node to the new one
     if (NodeName(simplified_tensor) != node->name()) {
       // Always consider simplified_tensor for further optimizations.
       NodeDef* simplified_node = node_map_->GetNode(simplified_tensor);
@@ -1087,6 +1662,7 @@ Status ArithmeticOptimizer::Optimize(Cluster* /*cluster*/,
   // Shapes are only needed in aggressive mode.
   graph_properties_.reset(new GraphProperties(item));
   TF_RETURN_IF_ERROR(graph_properties_->InferStatically(false));
+  // TODO(ezhulenev): Use GraphProperties to lookup tensor shapes directly
   TF_RETURN_IF_ERROR(graph_properties_->AnnotateOutputShapes(optimized_graph_));
 
   // Perform the optimizations.
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
index afd538db408aa859a108e08b2de9efad635d515c..965f0e9ea25e1aac9ff266949bf6e20b5f0a30c9 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
@@ -32,9 +32,14 @@ constexpr char kArithmeticOptimizer[] = "ArithmeticOptimizer";
 // run a model.
 class ArithmeticOptimizer : public GraphOptimizer {
  public:
-  ArithmeticOptimizer() : opt_level_(RewriterConfig::ON) {}
+  ArithmeticOptimizer()
+      : opt_level_(RewriterConfig::ON),
+        options_(ArithmeticOptimizerOptions::Default(RewriterConfig::ON)) {}
+
   explicit ArithmeticOptimizer(RewriterConfig::Toggle opt_level)
-      : opt_level_(opt_level) {}
+      : opt_level_(opt_level),
+        options_(ArithmeticOptimizerOptions::Default(opt_level)) {}
+
   ~ArithmeticOptimizer() override {}
 
   string name() const override { return "arithmetic_optimizer"; };
@@ -46,6 +51,28 @@ class ArithmeticOptimizer : public GraphOptimizer {
                 const GraphDef& optimized_graph, double result) override;
 
  private:
+  friend class ArithmeticOptimizerTest;
+
+  // Granular control for arithmetic optimizer stages
+  struct ArithmeticOptimizerOptions {
+    // TODO(ezhulenev): flag do disable TrySimplifyAndReplaceUses in tests.
+    // Remove when all optimizers will be migrated to separate stages.
+    bool enable_try_simplify_and_replace = true;
+    bool combine_add_to_addn = false;
+    bool hoist_common_factor_out_of_aggregation = true;
+    bool remove_identity_transpose = true;
+    bool remove_redundant_bitcast = true;
+    bool remove_redundant_cast = true;
+    bool remove_negation = true;
+
+    // Choose which arithmetic optimizer stages will be enabled for a given
+    // optimization level by default.
+    static ArithmeticOptimizerOptions Default(
+        RewriterConfig::Toggle opt_level) {
+      return ArithmeticOptimizerOptions();
+    }
+  };
+
   // Returns true is a node with given name and the optimizer prefix already
   // exists.
   string OptimizedNodeName(const NodeDef& node, StringPiece suffix) const;
@@ -97,13 +124,14 @@ class ArithmeticOptimizer : public GraphOptimizer {
                                    SetVector<NodeDef*>* nodes_to_simplify);
 
   RewriterConfig::Toggle opt_level_;
+  ArithmeticOptimizerOptions options_;
 
-  bool fetch_nodes_known_;
+  bool fetch_nodes_known_ = false;
   std::unordered_set<string> nodes_to_preserve_;
   std::unique_ptr<NodeMap> node_map_;
   FrameMap frame_map_;
   std::unique_ptr<GraphProperties> graph_properties_;
-  GraphDef* optimized_graph_;  // Not owned.
+  GraphDef* optimized_graph_ = nullptr;  // Not owned.
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
index 2a82b250586783759608db75bf9e383f4b0322cb..3876486d8037c462f424bf7a8dfd6223fded1a4e 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
@@ -21,13 +21,31 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
 #include "tensorflow/core/grappler/optimizers/model_pruner.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace tensorflow {
 namespace grappler {
+
 namespace {
 
+constexpr char kHoistFactorOptimizerMul[] =
+    "ArithmeticOptimizer/HoistCommonFactor_Mul_";
+
+constexpr char kHoistFactorOptimizerAdd[] =
+    "ArithmeticOptimizer/HoistCommonFactor_Add_";
+
+// Optimized name of outer Mul node by HoistCommonFactorOutOfAggregation
+string HoistMulName(const string& name) {
+  return AddPrefixToNodeName(name, kHoistFactorOptimizerMul, "");
+}
+
+// Optimized name of inner Add node by HoistCommonFactorOutOfAggregation
+string HoistAddName(const string& name) {
+  return AddPrefixToNodeName(name, kHoistFactorOptimizerAdd, "");
+}
+
 string OptimizedName(const string& name) {
   return AddPrefixToNodeName(name, kArithmeticOptimizer);
 }
@@ -46,8 +64,74 @@ void VerifyGraphsMatch(const GraphDef& original_graph,
     }
   }
 }
+}  // namespace
+
+class ArithmeticOptimizerTest : public GrapplerTest {
+ protected:
+  // Optimize a graph using ArithmeticOptimizer and prune all the nodes that no
+  // longer have any output consumers.
+  void OptimizeAndPrune(ArithmeticOptimizer* optimizer, GrapplerItem* item,
+                        GraphDef* output) {
+    TF_EXPECT_OK(optimizer->Optimize(nullptr, *item, output));
+    item->graph.Swap(output);
+    TF_EXPECT_OK(ModelPruner().Optimize(nullptr, *item, output));
+  }
+
+  // Run ArithmeticOptimizer twice to make sure the rewrite is idempotent.
+  void OptimizeTwice(ArithmeticOptimizer* optimizer, GrapplerItem* item,
+                     GraphDef* output) {
+    TF_EXPECT_OK(optimizer->Optimize(nullptr, *item, output));
+    item->graph.Swap(output);
+    TF_EXPECT_OK(optimizer->Optimize(nullptr, *item, output));
+  }
+
+  // TODO(ezhulenev): Make private. After migration to stages each test
+  // should explicitly enable required optimization for tests isolation
+  void DisableAllStages(ArithmeticOptimizer* optimizer) {
+    ArithmeticOptimizer::ArithmeticOptimizerOptions options;
+    options.enable_try_simplify_and_replace = false;
+    options.combine_add_to_addn = false;
+    options.hoist_common_factor_out_of_aggregation = false;
+    options.remove_identity_transpose = false;
+    options.remove_redundant_bitcast = false;
+    options.remove_redundant_cast = false;
+    optimizer->options_ = options;
+  }
+
+  void DisableAddToAddNCombining(ArithmeticOptimizer* optimizer) {
+    optimizer->options_.combine_add_to_addn = false;
+  }
+
+  void EnableOnlyAddToAddNCombining(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.combine_add_to_addn = true;
+  }
+
+  void EnableOnlyHoistCommonFactor(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.hoist_common_factor_out_of_aggregation = true;
+  }
+
+  void EnableOnlyRemoveIdentityTranspose(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_identity_transpose = true;
+  }
+
+  void EnableOnlyRemoveRedundantBitcast(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_redundant_bitcast = true;
+  }
+
+  void EnableOnlyRemoveRedundantCast(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_redundant_cast = true;
+  }
 
-class ArithmeticOptimizerTest : public ::testing::Test {};
+  void EnableOnlyRemoveNegation(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_negation = true;
+  }
+};
 
 TEST_F(ArithmeticOptimizerTest, NoOp) {
   // This trivial graph is so basic there's nothing to optimize.
@@ -350,58 +434,68 @@ TEST_F(ArithmeticOptimizerTest, TrivialSumsRepeatedAdd) {
   for (int i = 0; i < item.graph.node_size(); ++i) {
     item.graph.mutable_node(i)->set_device(devices[i]);
   }
+
   ArithmeticOptimizer optimizer;
+  DisableAddToAddNCombining(&optimizer);
+
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
-  // Run the optimizer twice to make sure the rewrite is idempotent.
-  item.graph.Swap(&output);
-  status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  OptimizeTwice(&optimizer, &item, &output);
 
-  EXPECT_EQ(17, output.node_size());
-  // The graph gets optimized to
+  // We expect the following rewrite(s) to occur:
+  //
   // Mul(p,
-  //     Add(Add(Const(2), Const(2)),
-  //         Add(Const(2), Const(2))))
+  //     Add_6(Add_4(Const(2), Const(2)),
+  //           Add_5(Const(2), Const(2))))
+  NodeMap node_map(&output);
+
   EXPECT_EQ(17, output.node_size());
-  for (const auto& node : output.node()) {
-    if ("id" == node.name()) {
-      EXPECT_EQ(1, node.input_size());
-      EXPECT_EQ(OptimizedName("Add_6_hoist_mul"), node.input(0));
-    } else if (OptimizedName("Add_6_hoist_mul") == node.name()) {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("Placeholder", node.input(0));
-      EXPECT_EQ(OptimizedName("Add_6_hoist_add"), node.input(1));
-    } else if (OptimizedName("Add_6_hoist_add") == node.name()) {
-      EXPECT_EQ("Add", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ(OptimizedName("Add_4_hoist_add"), node.input(0));
-      EXPECT_EQ(OptimizedName("Add_5_hoist_add"), node.input(1));
-      EXPECT_EQ("^Placeholder", node.input(2));
-    } else if (OptimizedName("Add_4_hoist_add") == node.name()) {
-      EXPECT_EQ("Add", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ(OptimizedName("Add_const"), node.input(0));
-      EXPECT_EQ(OptimizedName("Add_1_const"), node.input(1));
-      EXPECT_EQ("^Placeholder", node.input(2));
-    } else if (OptimizedName("Add_5_hoist_add") == node.name()) {
-      EXPECT_EQ("Add", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ(OptimizedName("Add_const"), node.input(0));
-      EXPECT_EQ(OptimizedName("Add_1_const"), node.input(1));
-      EXPECT_EQ("^Placeholder", node.input(2));
-    } else if (OptimizedName("Add_const") == node.name()) {
-      EXPECT_EQ("Const", node.op());
-      EXPECT_EQ(1, node.input_size());
-      EXPECT_EQ("^Placeholder", node.input(0));
-    } else if (OptimizedName("Add_1_const") == node.name()) {
-      EXPECT_EQ("Const", node.op());
-      EXPECT_EQ(1, node.input_size());
-      EXPECT_EQ("^Placeholder", node.input(0));
-    }
-  }
+
+  const NodeDef* id_node = node_map.GetNode("id");
+  ASSERT_TRUE(id_node != nullptr);
+  EXPECT_EQ(1, id_node->input_size());
+  EXPECT_EQ(HoistMulName("Add_6"), id_node->input(0));
+
+  const NodeDef* mul_node = node_map.GetNode(HoistMulName("Add_6"));
+  ASSERT_TRUE(mul_node != nullptr);
+  EXPECT_EQ(2, mul_node->input_size());
+  EXPECT_EQ("Placeholder", mul_node->input(0));
+  EXPECT_EQ(HoistAddName("Add_6"), mul_node->input(1));
+
+  const NodeDef* add_6_node = node_map.GetNode(HoistAddName("Add_6"));
+  ASSERT_TRUE(add_6_node != nullptr);
+  EXPECT_EQ(3, add_6_node->input_size());
+  EXPECT_EQ(HoistAddName("Add_4"), add_6_node->input(0));
+  EXPECT_EQ(HoistAddName("Add_5"), add_6_node->input(1));
+  EXPECT_EQ("^Placeholder", add_6_node->input(2));
+
+  const NodeDef* add_4_node = node_map.GetNode(HoistAddName("Add_4"));
+  ASSERT_TRUE(add_4_node != nullptr);
+  EXPECT_EQ("Add", add_4_node->op());
+  EXPECT_EQ(3, add_4_node->input_size());
+  EXPECT_EQ(OptimizedName("Add_const"), add_4_node->input(0));
+  EXPECT_EQ(OptimizedName("Add_1_const"), add_4_node->input(1));
+  EXPECT_EQ("^Placeholder", add_4_node->input(2));
+
+  const NodeDef* add_5_node = node_map.GetNode(HoistAddName("Add_5"));
+  ASSERT_TRUE(add_5_node != nullptr);
+  EXPECT_EQ("Add", add_5_node->op());
+  EXPECT_EQ(3, add_5_node->input_size());
+  EXPECT_EQ(OptimizedName("Add_const"), add_5_node->input(0));
+  EXPECT_EQ(OptimizedName("Add_1_const"), add_5_node->input(1));
+  EXPECT_EQ("^Placeholder", add_5_node->input(2));
+
+  const NodeDef* add_const_node = node_map.GetNode(OptimizedName("Add_const"));
+  ASSERT_TRUE(add_const_node != nullptr);
+  EXPECT_EQ("Const", add_const_node->op());
+  EXPECT_EQ(1, add_const_node->input_size());
+  EXPECT_EQ("^Placeholder", add_const_node->input(0));
+
+  const NodeDef* add_1_const_node =
+      node_map.GetNode(OptimizedName("Add_1_const"));
+  ASSERT_TRUE(add_1_const_node != nullptr);
+  EXPECT_EQ("Const", add_1_const_node->op());
+  EXPECT_EQ(1, add_1_const_node->input_size());
+  EXPECT_EQ("^Placeholder", add_1_const_node->input(0));
 }
 
 TEST_F(ArithmeticOptimizerTest, HoistFactor) {
@@ -422,31 +516,46 @@ TEST_F(ArithmeticOptimizerTest, HoistFactor) {
                                    ops::Add(s.WithOpName("add"), mul1, mul2));
 
       GrapplerItem item;
+      item.fetch = {"id"};
       TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
       ArithmeticOptimizer optimizer;
+      EnableOnlyHoistCommonFactor(&optimizer);
+
       GraphDef output;
-      Status status = optimizer.Optimize(nullptr, item, &output);
-      TF_EXPECT_OK(status);
-      // Run the optimizer twice to make sure the rewrite is idempotent.
-      item.graph.Swap(&output);
-      status = optimizer.Optimize(nullptr, item, &output);
-      TF_EXPECT_OK(status);
+      OptimizeTwice(&optimizer, &item, &output);
+
+      // We expect the following rewrite(s) to occur:
+      //
+      //        Add                 Mul
+      //      /    \               /   \
+      //    Mul    Mul       ->   x    Add
+      //    / \    / \                 / \
+      //   x  y1  y2  x              y1   y2
+      //
+      // If "root" op is AddN and shapes does not match, this rewrite is not
+      // possible and graph should stay intact.
+      NodeMap node_map(&output);
 
       if (use_addn && !matching_shapes) {
         VerifyGraphsMatch(item.graph, output, __LINE__);
       } else {
         EXPECT_EQ(9, output.node_size());
-        const NodeDef& new_add = output.node(8);
-        EXPECT_EQ(OptimizedName("add_hoist_add"), new_add.name());
-        EXPECT_EQ("y1", new_add.input(0));
-        EXPECT_EQ("y2", new_add.input(1));
-        const NodeDef& new_mul = output.node(7);
-        EXPECT_EQ(OptimizedName("add_hoist_mul"), new_mul.name());
-        EXPECT_EQ("x", new_mul.input(0));
-        EXPECT_EQ(OptimizedName("add_hoist_add"), new_mul.input(1));
-        const NodeDef& new_id = output.node(6);
-        EXPECT_EQ("id", new_id.name());
-        EXPECT_EQ(OptimizedName("add_hoist_mul"), new_id.input(0));
+
+        const NodeDef* new_add_node = node_map.GetNode(HoistAddName("add"));
+        ASSERT_TRUE(new_add_node != nullptr) << "Hoisted Add node not found";
+        EXPECT_EQ("y1", new_add_node->input(0));
+        EXPECT_EQ("y2", new_add_node->input(1));
+
+        const NodeDef* new_mul_node = node_map.GetNode(HoistMulName("add"));
+        ASSERT_TRUE(new_mul_node != nullptr) << "Hoisted Mul node not found";
+        EXPECT_EQ("x", new_mul_node->input(0));
+        EXPECT_EQ(new_add_node->name(), new_mul_node->input(1));
+
+        const NodeDef* id_node = node_map.GetNode("id");
+        ASSERT_TRUE(id_node != nullptr) << "Id node not found";
+        EXPECT_EQ("id", id_node->name());
+        EXPECT_EQ(HoistMulName("add"), id_node->input(0));
       }
     }
   }
@@ -630,9 +739,7 @@ TEST_F(ArithmeticOptimizerTest, IdentityReshape) {
   item.graph.Swap(&output);
   TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
 
-  EXPECT_EQ(0, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Reshape"; }));
+  EXPECT_EQ(0, CountOpNodes(output, "Reshape"));
 }
 
 TEST_F(ArithmeticOptimizerTest, NotIdentityReshape) {
@@ -654,9 +761,7 @@ TEST_F(ArithmeticOptimizerTest, NotIdentityReshape) {
   item.graph.Swap(&output);
   TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
 
-  EXPECT_EQ(1, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Reshape"; }));
+  EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
 }
 
 TEST_F(ArithmeticOptimizerTest, NotIdentityReshapeTooManyUnknownDimSizes) {
@@ -676,9 +781,7 @@ TEST_F(ArithmeticOptimizerTest, NotIdentityReshapeTooManyUnknownDimSizes) {
   item.graph.Swap(&output);
   TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
 
-  EXPECT_EQ(1, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Reshape"; }));
+  EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
 }
 
 TEST_F(ArithmeticOptimizerTest, CombineReshapes) {
@@ -709,9 +812,7 @@ TEST_F(ArithmeticOptimizerTest, CombineReshapes) {
   item.graph.Swap(&output);
   TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
 
-  EXPECT_EQ(1, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Reshape"; }));
+  EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
 }
 
 TEST_F(ArithmeticOptimizerTest, ReorderTransposeCast) {
@@ -780,7 +881,7 @@ TEST_F(ArithmeticOptimizerTest, NoReorderTransposeCast) {
   EXPECT_EQ(1, num_transposes);
 }
 
-TEST_F(ArithmeticOptimizerTest, RemoveInverseTransposes) {
+TEST_F(ArithmeticOptimizerTest, RemoveIdentityTransposes) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs_shape =
       ops::Const(s.WithOpName("inputs_shape"), {8, 3, 28, 28}, {4});
@@ -788,30 +889,32 @@ TEST_F(ArithmeticOptimizerTest, RemoveInverseTransposes) {
       ops::RandomUniform(s.WithOpName("inputs"), inputs_shape, DT_FLOAT);
   Output perm1 = ops::Const(s.WithOpName("perm1"), {0, 2, 3, 1}, {4});
   Output perm2 = ops::Const(s.WithOpName("perm2"), {0, 3, 1, 2}, {4});
+  Output perm3 = ops::Const(s.WithOpName("perm2"), {0, 1, 2, 3}, {4});
   Output transpose1 = ops::Transpose(s.WithOpName("transpose1"), inputs, perm1);
   Output transpose2 =
       ops::Transpose(s.WithOpName("transpose2"), transpose1, perm2);
-  Output outputs = ops::Identity(s.WithOpName("outputs"), transpose2);
+  Output transpose3 = ops::Transpose(s.WithOpName("transpose3"), inputs, perm3);
+  Output id1 = ops::Identity(s.WithOpName("id1"), transpose2);
+  Output id2 = ops::Identity(s.WithOpName("id2"), transpose3);
 
   GrapplerItem item;
-  item.fetch = {"outputs"};
+  item.fetch = {"id1", "id2"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveIdentityTranspose(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
   std::set<string> nodes_after_optimization;
   for (const NodeDef& node : output.node()) {
     nodes_after_optimization.insert(node.name());
   }
   EXPECT_EQ(nodes_after_optimization,
-            std::set<string>({"inputs_shape", "inputs", "outputs"}));
+            std::set<string>({"id1", "id2", "inputs_shape", "inputs"}));
 }
 
-TEST_F(ArithmeticOptimizerTest, RemoveInverseTransposesMultipleOutputs) {
+TEST_F(ArithmeticOptimizerTest, RemoveIdentityTransposesMultipleOutputs) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs_shape =
       ops::Const(s.WithOpName("inputs_shape"), {8, 9, 28, 28}, {4});
@@ -831,10 +934,9 @@ TEST_F(ArithmeticOptimizerTest, RemoveInverseTransposesMultipleOutputs) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveIdentityTranspose(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
   for (const NodeDef& node : output.node()) {
     if (node.op() == "Concat") {
@@ -858,10 +960,11 @@ TEST_F(ArithmeticOptimizerTest, RemoveTransposesWithControlDependency) {
   GrapplerItem item;
   item.fetch = {"outputs"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveIdentityTranspose(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
   NodeMap node_map(&output);
   const NodeDef* outputs_node = node_map.GetNode("outputs");
@@ -887,10 +990,9 @@ TEST_F(ArithmeticOptimizerTest, NotRemoveTransposes) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveIdentityTranspose(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(6, output.node_size());
 }
@@ -1105,10 +1207,10 @@ TEST_F(ArithmeticOptimizerTest, OptimizeMultipleMulTransposeConv) {
 
 TEST_F(ArithmeticOptimizerTest, CombineBitcasts) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output inputs =
-      ops::Placeholder(s, DT_UINT8, ops::Placeholder::Shape({2, 3}));
-  Output bc1 = ops::Bitcast(s, inputs, DT_QINT8);
-  Output bc2 = ops::Bitcast(s, bc1, DT_INT8);
+  Output inputs = ops::Placeholder(s.WithOpName("inputs"), DT_UINT8,
+                                   ops::Placeholder::Shape({2, 3}));
+  Output bc1 = ops::Bitcast(s.WithOpName("bc1"), inputs, DT_QINT8);
+  Output bc2 = ops::Bitcast(s.WithOpName("bc2"), bc1, DT_INT8);
   Output outputs = ops::Identity(s.WithOpName("outputs"), bc2);
 
   GrapplerItem item;
@@ -1116,18 +1218,22 @@ TEST_F(ArithmeticOptimizerTest, CombineBitcasts) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantBitcast(&optimizer);
 
-  EXPECT_EQ(1, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Bitcast"; }));
+  OptimizeAndPrune(&optimizer, &item, &output);
+  NodeMap node_map(&output);
+
+  // Bitcasts combined into a single op and inputs redirected to updated Bitcast
+  EXPECT_EQ(3, output.node_size());
+  EXPECT_EQ(1, CountOpNodes(output, "Bitcast"));
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "inputs", "bc2"));
 }
 
 TEST_F(ArithmeticOptimizerTest, CombineAndRemoveBitcasts) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output inputs = ops::Placeholder(s, DT_INT8, ops::Placeholder::Shape({2, 3}));
+  Output inputs = ops::Placeholder(s.WithOpName("inputs"), DT_INT8,
+                                   ops::Placeholder::Shape({2, 3}));
   Output bc1 = ops::Bitcast(s, inputs, DT_QINT8);
   Output bc2 = ops::Bitcast(s, bc1, DT_INT8);
   Output outputs = ops::Identity(s.WithOpName("outputs"), bc2);
@@ -1135,35 +1241,339 @@ TEST_F(ArithmeticOptimizerTest, CombineAndRemoveBitcasts) {
   GrapplerItem item;
   item.fetch = {"outputs"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantBitcast(&optimizer);
+
+  OptimizeAndPrune(&optimizer, &item, &output);
+  NodeMap node_map(&output);
 
-  EXPECT_EQ(0, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Bitcast"; }));
+  // Bitcasts removed and inputs redirected to outputs
+  EXPECT_EQ(2, output.node_size());
+  EXPECT_EQ(0, CountOpNodes(output, "Bitcast"));
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "inputs", "outputs"));
 }
 
 TEST_F(ArithmeticOptimizerTest, RemoveRedundantCast) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output inputs = ops::Placeholder(s, DT_INT8, ops::Placeholder::Shape({2, 3}));
+  Output inputs = ops::Placeholder(s.WithOpName("inputs"), DT_INT8,
+                                   ops::Placeholder::Shape({2, 3}));
   Output cast = ops::Cast(s, inputs, DT_INT8);
   Output outputs = ops::Identity(s.WithOpName("outputs"), cast);
 
   GrapplerItem item;
   item.fetch = {"outputs"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantCast(&optimizer);
 
-  EXPECT_EQ(0, std::count_if(
-                   output.node().begin(), output.node().end(),
-                   [](const NodeDef& node) { return node.op() == "Cast"; }));
+  OptimizeAndPrune(&optimizer, &item, &output);
+  NodeMap node_map(&output);
+
+  // Cast removed and inputs redirected to outputs
+  EXPECT_EQ(2, output.node_size());
+  EXPECT_EQ(0, CountOpNodes(output, "Cast"));
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "inputs", "outputs"));
+}
+
+TEST_F(ArithmeticOptimizerTest, AddOpsRewrite_AddOpsOfIdenticalShape) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  tensorflow::Scope sx = s.NewSubScope("x");
+  tensorflow::Scope sy = s.NewSubScope("y");
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto c = ops::Variable(s.WithOpName("c"), {2, 2}, DT_FLOAT);
+  auto add_ab = ops::Add(sx.WithOpName("Add_ab"), a, b);
+  auto add_abc = ops::Add(sy.WithOpName("Add_abc"), add_ab, c);
+
+  auto outputs = ops::Identity(s.WithOpName("outputs"), add_abc);
+
+  GrapplerItem item;
+  item.fetch = {"outputs"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyAddToAddNCombining(&optimizer);
+
+  OptimizeAndPrune(&optimizer, &item, &output);
+
+  // We expect the following rewrite(s) to occur:
+  //
+  //     +
+  //    / \
+  //   +   c      -->    AddN(a, b, c)
+  //  / \
+  // a   b
+  EXPECT_EQ(5, output.node_size());
+
+  NodeMap node_map(&output);
+
+  // check add tree was replaced with AddN
+  const NodeDef* collapsed_add =
+      node_map.GetNode("y/ArithmeticOptimizer/AddOpsRewrite_Add_abc_Add_ab");
+  ASSERT_TRUE(collapsed_add != nullptr);
+
+  EXPECT_EQ("AddN", collapsed_add->op());
+  EXPECT_EQ(3, collapsed_add->input_size());
+  EXPECT_EQ("a", collapsed_add->input(0));
+  EXPECT_EQ("b", collapsed_add->input(1));
+  EXPECT_EQ("c", collapsed_add->input(2));
+
+  // check output was re-wired to new node
+  const NodeDef* updated_outputs = node_map.GetNode("outputs");
+  ASSERT_TRUE(updated_outputs != nullptr);
+
+  EXPECT_EQ(collapsed_add->name(), updated_outputs->input(0));
+}
+
+TEST_F(ArithmeticOptimizerTest, AddOpsRewrite_MultiplePasses) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto c = ops::Variable(s.WithOpName("c"), {2, 2}, DT_FLOAT);
+  auto add_ab = ops::Add(s.WithOpName("Add_ab"), a, b);
+  auto add_abc = ops::Add(s.WithOpName("Add_abc"), add_ab, c);
+
+  auto x = ops::Variable(s.WithOpName("x"), {2, 2}, DT_FLOAT);
+  auto y = ops::Variable(s.WithOpName("y"), {2, 2}, DT_FLOAT);
+  auto z = ops::Variable(s.WithOpName("z"), {2, 2}, DT_FLOAT);
+  auto add_xy = ops::Add(s.WithOpName("Add_xy"), x, y);
+  auto add_xyz = ops::Add(s.WithOpName("Add_xyz"), add_xy, z);
+
+  auto mul = ops::Multiply(s.WithOpName("Mul"), add_abc, add_xyz);
+  auto outputs = ops::Identity(s.WithOpName("outputs"), mul);
+
+  GrapplerItem item;
+  item.fetch = {"outputs"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyAddToAddNCombining(&optimizer);
+
+  OptimizeAndPrune(&optimizer, &item, &output);
+
+  // We expect the following rewrite(s) to occur:
+  //
+  //         *
+  //      /     \
+  //     +       +                        *
+  //    / \     / \                    /     \
+  //   +   c   x   + -->    AddN(a, b, c)  AddN(x, y, z))
+  //  / \         / \
+  // a   b       y   z
+  EXPECT_EQ(10, output.node_size());
+
+  NodeMap node_map(&output);
+
+  // check left Add subtree replaced with AddN
+  const NodeDef* collapsed_left =
+      node_map.GetNode("ArithmeticOptimizer/AddOpsRewrite_Add_abc_Add_ab");
+  ASSERT_TRUE(collapsed_left != nullptr);
+
+  EXPECT_EQ("AddN", collapsed_left->op());
+  EXPECT_EQ(3, collapsed_left->input_size());
+  EXPECT_EQ("a", collapsed_left->input(0));
+  EXPECT_EQ("b", collapsed_left->input(1));
+  EXPECT_EQ("c", collapsed_left->input(2));
+
+  // check right Add subtree replaced with AddN
+  const NodeDef* collapsed_right =
+      node_map.GetNode("ArithmeticOptimizer/AddOpsRewrite_Add_xyz_Add_xy");
+  ASSERT_TRUE(collapsed_right != nullptr);
+
+  EXPECT_EQ("AddN", collapsed_right->op());
+  EXPECT_EQ(3, collapsed_right->input_size());
+  EXPECT_EQ("x", collapsed_right->input(0));
+  EXPECT_EQ("y", collapsed_right->input(1));
+  EXPECT_EQ("z", collapsed_right->input(2));
+
+  // check that Mul inputs re-wired to new Nodes
+  const NodeDef* updated_mul = node_map.GetNode("Mul");
+  ASSERT_TRUE(updated_mul != nullptr);
+
+  EXPECT_EQ("Mul", updated_mul->op());
+  EXPECT_EQ(2, updated_mul->input_size());
+  EXPECT_EQ(collapsed_left->name(), updated_mul->input(0));
+  EXPECT_EQ(collapsed_right->name(), updated_mul->input(1));
+}
+
+TEST_F(ArithmeticOptimizerTest, AddOpsRewrite_AddInputMultipleTimes) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto c = ops::Variable(s.WithOpName("c"), {2, 2}, DT_FLOAT);
+  auto add_ab = ops::Add(s.WithOpName("Add_ab"), a, b);
+  auto add_bc = ops::Add(s.WithOpName("Add_bc"), b, c);
+  auto add_all = ops::Add(s.WithOpName("Add_all"), add_ab, add_bc);
+  auto outputs = ops::Identity(s.WithOpName("outputs"), add_all);
+
+  GrapplerItem item;
+  item.fetch = {"outputs"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyAddToAddNCombining(&optimizer);
+
+  OptimizeAndPrune(&optimizer, &item, &output);
+
+  // We expect the following rewrite(s) to occur:
+  //
+  //     +
+  //    / \
+  //   +   +     -->    AddN(a, b, b, c)
+  //  / \ / \                   ^
+  // a   b   c                  b added twice!
+  EXPECT_EQ(5, output.node_size());
+
+  NodeMap node_map(&output);
+
+  // check Add tree replaced with AddN
+  const NodeDef* collapsed_add = node_map.GetNode(
+      "ArithmeticOptimizer/AddOpsRewrite_Add_all_Add_ab_Add_bc");
+  ASSERT_TRUE(collapsed_add != nullptr);
+
+  EXPECT_EQ("AddN", collapsed_add->op());
+  EXPECT_EQ(4, collapsed_add->input_size());
+  EXPECT_EQ("a", collapsed_add->input(0));
+  EXPECT_EQ("b", collapsed_add->input(1));
+  EXPECT_EQ("b", collapsed_add->input(2));
+  EXPECT_EQ("c", collapsed_add->input(3));
+}
+
+TEST_F(ArithmeticOptimizerTest, AddOpsRewrite_AddOpsOfSymbolicallyEqualShape) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  // unknown input shape propagated symbolically through the graph
+  auto input = ops::Variable(s.WithOpName("input"), {-1, 2}, DT_FLOAT);
+
+  // [a, b, c] have symbolically equal shapes
+  auto a = ops::Sqrt(s.WithOpName("a"), input);
+  auto b = ops::Square(s.WithOpName("b"), input);
+  auto c = ops::Round(s.WithOpName("c"), input);
+
+  // [add_ab, add_abc] shape must be inferred from inputs
+  auto add_ab = ops::Add(s.WithOpName("Add_ab"), a, b);
+  auto add_abc = ops::Add(s.WithOpName("Add_abc"), add_ab, c);
+
+  auto outputs = ops::Identity(s.WithOpName("outputs"), add_abc);
+
+  GrapplerItem item;
+  item.fetch = {"outputs"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyAddToAddNCombining(&optimizer);
+
+  OptimizeAndPrune(&optimizer, &item, &output);
+
+  // We expect the following rewrite(s) to occur:
+  //
+  //     +
+  //    / \
+  //   +   c      -->    AddN(a, b, c)
+  //  / \
+  // a   b
+  EXPECT_EQ(6, output.node_size());
+
+  NodeMap node_map(&output);
+
+  // check add tree was replaced with AddN
+  const NodeDef* collapsed_add =
+      node_map.GetNode("ArithmeticOptimizer/AddOpsRewrite_Add_abc_Add_ab");
+  ASSERT_TRUE(collapsed_add != nullptr);
+  EXPECT_EQ("AddN", collapsed_add->op());
+  EXPECT_EQ(3, collapsed_add->input_size());
+  EXPECT_EQ("a", collapsed_add->input(0));
+  EXPECT_EQ("b", collapsed_add->input(1));
+  EXPECT_EQ("c", collapsed_add->input(2));
+
+  // check output was re-wired to new node
+  const NodeDef* updated_outputs = node_map.GetNode("outputs");
+  ASSERT_TRUE(updated_outputs != nullptr);
+  EXPECT_EQ(collapsed_add->name(), updated_outputs->input(0));
+}
+
+TEST_F(ArithmeticOptimizerTest, RemoveNegation) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Variable(s.WithOpName("x"), {2, 2}, DT_FLOAT);
+  auto y = ops::Variable(s.WithOpName("y"), {2, 2}, DT_FLOAT);
+  Output neg_x = ops::Neg(s.WithOpName("Neg_x"), x);
+  Output neg_y = ops::Neg(s.WithOpName("Neg_y"), y);
+  Output add_x_y = ops::Add(s.WithOpName("Add_x_y"), x, y);
+  Output add_negx_y = ops::Add(s.WithOpName("Add_negx_y"), neg_x, y);
+  Output add_x_negy = ops::Add(s.WithOpName("Add_x_negy"), x, neg_y);
+  Output add_negx_negy = ops::Add(s.WithOpName("Add_negx_negy"), neg_x, neg_y);
+  Output sub_x_y = ops::Sub(s.WithOpName("Sub_x_y"), x, y);
+  Output sub_negx_y = ops::Sub(s.WithOpName("Sub_negx_y"), neg_x, y);
+  Output sub_x_negy = ops::Sub(s.WithOpName("Sub_x_negy"), x, neg_y);
+  Output sub_negx_negy = ops::Sub(s.WithOpName("Sub_negx_negy"), neg_x, neg_y);
+  auto add_all = ops::AddN(s.WithOpName("add_all"),
+                           {add_x_y, add_negx_y, add_x_negy, add_negx_negy,
+                            sub_x_y, sub_negx_y, sub_x_negy, sub_negx_negy});
+
+  GrapplerItem item;
+  item.fetch = {"add_all"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveNegation(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
+
+  EXPECT_EQ(item.graph.node_size(), output.node_size());
+  int found = 0;
+  for (int i = 0; i < output.node_size(); ++i) {
+    const NodeDef& node = output.node(i);
+    if (node.name() == "Add_negx_y") {
+      ++found;
+      EXPECT_EQ("Sub", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+      EXPECT_EQ("^Neg_x", node.input(2));
+    } else if (node.name() == "Add_x_negy") {
+      ++found;
+      EXPECT_EQ("Sub", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^Neg_y", node.input(2));
+    } else if (node.name() == "Add_negx_negy") {
+      ++found;
+      EXPECT_EQ("Sub", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("Neg_y", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+      EXPECT_EQ("^Neg_x", node.input(2));
+    } else if (node.name() == "Sub_x_negy") {
+      ++found;
+      EXPECT_EQ("Add", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^Neg_y", node.input(2));
+    } else if (node.name() == "Sub_negx_negy") {
+      ++found;
+      EXPECT_EQ("Sub", node.op());
+      EXPECT_EQ(4, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+      EXPECT_EQ("^Neg_y", node.input(2));
+      EXPECT_EQ("^Neg_x", node.input(3));
+    }
+  }
+  EXPECT_EQ(5, found);
 }
 
-}  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.cc b/tensorflow/core/grappler/optimizers/constant_folding.cc
index 182e03f04e205f4426db716b1ac29fe18c8acc7e..bdec73e69ecb3b1f2c3a9fc73bb0a7e5293efdd5 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@@ -31,11 +31,14 @@ limitations under the License.
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/cleanup.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/denormal.h"
 #include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/setround.h"
 #include "tensorflow/core/platform/tensor_coding.h"
 #include "tensorflow/core/public/version.h"
 #include "tensorflow/core/util/bcast.h"
@@ -51,7 +54,14 @@ class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
   explicit EigenThreadPoolWrapper(thread::ThreadPool* pool) : pool_(pool) {}
   ~EigenThreadPoolWrapper() override {}
   void Schedule(std::function<void()> fn) override {
-    pool_->Schedule(std::move(fn));
+    auto wrapped = [=]() {
+      // TensorFlow flushes denormals to zero and rounds to nearest, so we do
+      // the same here.
+      port::ScopedFlushDenormal flush;
+      port::ScopedSetRound round(FE_TONEAREST);
+      fn();
+    };
+    pool_->Schedule(std::move(wrapped));
   }
   int NumThreads() const override { return pool_->NumThreads(); }
   int CurrentThreadId() const override { return pool_->CurrentThreadId(); }
@@ -131,20 +141,20 @@ bool AllValuesAre(const TensorProto& tensor, const T& value) {
 // Add new_input as a control input to node if it does not already depend on it.
 // TODO(rmlarsen): Move the following two utility functions to utils.{h,cc} and
 // clean up code that should be using them.
-bool MaybeAddControlInput(const string& new_input, NodeDef* node,
+bool MaybeAddControlInput(const string& ctrl_input, NodeDef* node,
                           GraphDef* graph, NodeMap* node_map) {
   bool already_exists = false;
   for (const string& input : node->input()) {
-    if (input == new_input || AsControlDependency(input) == new_input) {
+    if (input == ctrl_input || AsControlDependency(input) == ctrl_input) {
       already_exists = true;
       break;
     }
   }
   if (!already_exists) {
     const string ctrl_dep =
-        ConstantFolding::AddControlDependency(new_input, graph, node_map);
+        ConstantFolding::AddControlDependency(ctrl_input, graph, node_map);
     node->add_input(ctrl_dep);
-    node_map->AddOutput(NodeName(new_input), node->name());
+    node_map->AddOutput(NodeName(ctrl_input), node->name());
   }
   return !already_exists;
 }
@@ -152,16 +162,27 @@ bool MaybeAddControlInput(const string& new_input, NodeDef* node,
 // Remove old_input as a control input to node.
 bool MaybeRemoveControlInput(const string& old_input, NodeDef* node,
                              GraphDef* graph, NodeMap* node_map) {
+  bool removed_input = false;
+  bool update_node_map = true;
+  const string old_input_ctrl_dep = AsControlDependency(NodeName(old_input));
   for (int i = 0; i < node->input_size(); ++i) {
     const string& input = node->input(i);
-    if (IsControlInput(input) && AsControlDependency(old_input) == input) {
-      node->mutable_input()->SwapElements(i, node->input_size() - 1);
-      node->mutable_input()->RemoveLast();
-      node_map->RemoveOutput(NodeName(old_input), node->name());
-      return true;
+    if (old_input_ctrl_dep == input) {
+      if (IsControlInput(input)) {
+        node->mutable_input()->SwapElements(i, node->input_size() - 1);
+        node->mutable_input()->RemoveLast();
+        removed_input = true;
+      } else {
+        // There is a non-control input from the same node.
+        // Don't remove the output from the NodeMap.
+        update_node_map = false;
+      }
     }
   }
-  return false;
+  if (update_node_map) {
+    node_map->RemoveOutput(NodeName(old_input), node->name());
+  }
+  return removed_input;
 }
 
 }  // namespace
@@ -224,44 +245,41 @@ string ConstantFolding::AddControlDependency(const string& input_name,
   }
 }
 
-Status ConvertShapeToConstant(const string& op, const DataType& type,
-                              const PartialTensorShape& shp, Tensor* value) {
+// Puts the given value into the tensor at the given "flat" index.
+static Status PutValueIntoTensor(const int64 value, const DataType& type,
+                                 const int index, Tensor* tensor) {
+  if (type == DT_INT32) {
+    if (value >= INT_MAX) {
+      return Status(error::INVALID_ARGUMENT, "int32 overflow");
+    }
+    tensor->flat<int32>()(index) = static_cast<int32>(value);
+  } else {
+    tensor->flat<int64>()(index) = value;
+  }
+  return Status::OK();
+}
+
+// Writes the given tensor shape into the given tensor.
+// Op is assumed to be Shape, ShapeN, Size or Rank.
+static Status ConvertShapeToConstant(const string& op, const DataType& type,
+                                     const PartialTensorShape& shp,
+                                     Tensor* tensor) {
   if (op == "Shape" || op == "ShapeN") {
-    *value = Tensor(type, TensorShape({shp.dims()}));
+    *tensor = Tensor(type, TensorShape({shp.dims()}));
     for (int i = 0; i < shp.dims(); ++i) {
-      if (type == DT_INT32) {
-        if (shp.dim_size(i) >= INT_MAX) {
-          return Status(error::INVALID_ARGUMENT, "Invalid dimension size");
-        }
-        value->flat<int32>()(i) = shp.dim_size(i);
-      } else {
-        value->flat<int64>()(i) = shp.dim_size(i);
-      }
+      TF_RETURN_IF_ERROR(PutValueIntoTensor(shp.dim_size(i), type, i, tensor));
     }
   } else if (op == "Size") {
     int64 size = 1;
     for (int i = 0; i < shp.dims(); ++i) {
       size *= shp.dim_size(i);
     }
-    *value = Tensor(type, TensorShape({}));
-    if (type == DT_INT32) {
-      if (size >= INT_MAX) {
-        return Status(error::INVALID_ARGUMENT, "Invalid dimension size");
-      }
-      value->flat<int32>()(0) = size;
-    } else {
-      value->flat<int64>()(0) = size;
-    }
+    *tensor = Tensor(type, TensorShape({}));
+    TF_RETURN_IF_ERROR(PutValueIntoTensor(size, type, 0, tensor));
   } else {
-    *value = Tensor(type, TensorShape({}));
-    if (type == DT_INT32) {
-      if (shp.dims() >= INT_MAX) {
-        return Status(error::INVALID_ARGUMENT, "Invalid dimension size");
-      }
-      value->flat<int32>()(0) = shp.dims();
-    } else {
-      value->flat<int64>()(0) = shp.dims();
-    }
+    CHECK_EQ(op, "Rank");
+    *tensor = Tensor(type, TensorShape({}));
+    TF_RETURN_IF_ERROR(PutValueIntoTensor(shp.dims(), type, 0, tensor));
   }
   return Status::OK();
 }
@@ -286,99 +304,129 @@ bool ConstantFolding::IsReallyConstant(const NodeDef& node) const {
   return feed_nodes_.find(node.name()) == feed_nodes_.end();
 }
 
+// Materialize the shapes using constants whenever possible.
 Status ConstantFolding::MaterializeShapes(const GraphProperties& properties) {
-  // We may add some nodes to the graph to encode control dependencies: there is
-  // no need to process these, so only iterate over the nodes of the input
-  // graph.
+  // We may add some nodes to the graph to encode control dependencies and hold
+  // the materialized shapes: there is no need to process these added nodes, so
+  // only iterate over the nodes of the input graph.
   const int node_count = graph_->node_size();
-  for (int i = 0; i < node_count; ++i) {
-    NodeDef& node = *graph_->mutable_node(i);
-    const string op = node.op();
+  for (int node_idx = 0; node_idx < node_count; ++node_idx) {
+    NodeDef* node = graph_->mutable_node(node_idx);
+    const string op = node->op();
     if (op != "Shape" && op != "Size" && op != "Rank" && op != "ShapeN") {
       continue;
     }
 
     const std::vector<OpInfo::TensorProperties>& output =
-        properties.GetOutputProperties(node.name());
+        properties.GetOutputProperties(node->name());
     const std::vector<OpInfo::TensorProperties>& input =
-        properties.GetInputProperties(node.name());
+        properties.GetInputProperties(node->name());
     if (input.empty() || output.empty()) {
       continue;
     }
+
     if (op == "Shape" || op == "Size" || op == "Rank") {
       CHECK_EQ(1, output.size());
       CHECK_EQ(1, input.size());
+
+      const DataType type = output[0].dtype();
+      CHECK(type == DT_INT32 || type == DT_INT64);
+      const PartialTensorShape shape(input[0].shape());
+
+      if ((op != "Rank" && !shape.IsFullyDefined()) ||
+          (op == "Rank" && shape.unknown_rank())) {
+        continue;
+      }
+
+      Tensor constant_value(type);
+      if (!ConvertShapeToConstant(op, type, shape, &constant_value).ok()) {
+        continue;
+      }
+
+      // Repurpose the existing node to be the constant.
+      // Device placement is preserved.
+      node->set_op("Const");
+      node->clear_attr();
+      (*node->mutable_attr())["dtype"].set_type(type);
+      constant_value.AsProtoTensorContent(
+          (*node->mutable_attr())["value"].mutable_tensor());
+
+      // Turn the data input into a control dependency: this is needed to
+      // ensure that the constant value will only be run in the
+      // cases where the shape/rank/size would have been run in
+      // the original graph.
+      string ctrl_dep =
+          AddControlDependency(node->input(0), graph_, node_map_.get());
+      node->set_input(0, ctrl_dep);
+      node_map_->AddOutput(NodeName(ctrl_dep), node->name());
+
+      // Done with the Shape/Size/Rank node, move to the next node.
+      continue;
     }
-    CHECK_EQ(input.size(), output.size());
 
-    for (int j = 0; j < output.size(); ++j) {
-      const DataType type = output[j].dtype();
+    // Handle ShapeN materialization case.
+    // It's possible that not all input tensors have known shapes.
+    CHECK_EQ(op, "ShapeN");
+    CHECK_EQ(input.size(), output.size());
+    const NodeDef* const shape_n_node = node;
+    for (int port_idx = 0; port_idx < output.size(); ++port_idx) {
+      const DataType type = output[port_idx].dtype();
       CHECK(type == DT_INT32 || type == DT_INT64);
-      const TensorShapeProto shape = input[j].shape();
-      // Materialize the shapes using constants whenever possible.
-      PartialTensorShape shp(shape);
-      if (shp.IsFullyDefined() || (!shp.unknown_rank() && op == "Rank")) {
-        Tensor value(type);
-        auto status = ConvertShapeToConstant(op, type, shp, &value);
-        if (!status.ok()) {
-          continue;
-        }
-        // We rewrite the existing node for the first const output and
-        // create new nodes for the remaining const outputs (Note that ShapeN
-        // could have multiple outputs).
-        if (op == "Shape" || op == "Size" || op == "Rank") {
-          // Replace the node with the corresponding constant.
-          node.set_op("Const");
-          node.clear_attr();
-          (*node.mutable_attr())["dtype"].set_type(type);
-          value.AsProtoTensorContent(
-              (*node.mutable_attr())["value"].mutable_tensor());
-
-          // Turn the data input into a control dependency: this is needed to
-          // ensure that the constant value will only be run in the
-          // cases where the shape/rank/size would have been run in
-          // the original graph. Additional inputs are extra control
-          string ctrl_dep =
-              AddControlDependency(node.input(0), graph_, node_map_.get());
-          node.set_input(0, ctrl_dep);
-          node_map_->AddOutput(NodeName(ctrl_dep), node.name());
-        } else {
-          auto outputs = node_map_->GetOutputs(node.name());
-          for (const auto& output : outputs) {
-            for (int k = 0; k < output->input_size(); ++k) {
-              int port;
-              string node_name = ParseNodeName(output->input(k), &port);
-              if (node_name == node.name() && port == j) {
-                // Create a const node as ShapeN's output if not already.
-                const string const_name =
-                    OptimizedNodeName(node, strings::StrCat("-matshapes-", j));
-                if (node_map_->GetNode(const_name) == nullptr) {
-                  NodeDef* added_node = graph_->add_node();
-                  added_node->set_name(const_name);
-                  added_node->set_op("Const");
-                  added_node->set_device(node.device());
-                  node_map_->AddNode(added_node->name(), added_node);
-                  (*added_node->mutable_attr())["dtype"].set_type(type);
-                  value.AsProtoTensorContent(
-                      (*added_node->mutable_attr())["value"].mutable_tensor());
-                  // We add a control dependency to the original ShapeN node,
-                  // so that the node will only be run if all inputs of the
-                  // original ShapeN node are run.
-                  string ctrl_dep = AddControlDependency(node.name(), graph_,
-                                                         node_map_.get());
-                  *added_node->add_input() = ctrl_dep;
-                  node_map_->AddOutput(NodeName(ctrl_dep), added_node->name());
-                }
-                node_map_->UpdateInput(output->name(),
-                                       NodeName(output->input(k)), const_name);
-                *output->mutable_input(k) = const_name;
-              }
+      const PartialTensorShape shape(input[port_idx].shape());
+      if (!shape.IsFullyDefined()) {
+        continue;
+      }
+      Tensor constant_value(type);
+      auto status = ConvertShapeToConstant(op, type, shape, &constant_value);
+      if (!status.ok()) {
+        continue;
+      }
+
+      // Find all nodes consuming this shape and connect them through the new
+      // constant node instead.
+      auto outputs = node_map_->GetOutputs(shape_n_node->name());
+      for (NodeDef* output : outputs) {
+        // Track whether there are any direct edges left between shape_n_node
+        // and this output node after the transformation.
+        bool direct_edges_exist = false;
+        for (int k = 0; k < output->input_size(); ++k) {
+          int port;
+          const string node_name = ParseNodeName(output->input(k), &port);
+          if (node_name == shape_n_node->name() && port == port_idx) {
+            // Create a const node as ShapeN's output if not already.
+            const string const_name = OptimizedNodeName(
+                *shape_n_node, strings::StrCat("-matshapes-", port_idx));
+            if (node_map_->GetNode(const_name) == nullptr) {
+              NodeDef* added_node = graph_->add_node();
+              added_node->set_name(const_name);
+              added_node->set_op("Const");
+              added_node->set_device(shape_n_node->device());
+              node_map_->AddNode(added_node->name(), added_node);
+              (*added_node->mutable_attr())["dtype"].set_type(type);
+              constant_value.AsProtoTensorContent(
+                  (*added_node->mutable_attr())["value"].mutable_tensor());
+              // We add a control dependency to the original ShapeN node,
+              // so that the node will only be run if all inputs of the
+              // original ShapeN node are run.
+              string ctrl_dep = AddControlDependency(shape_n_node->name(),
+                                                     graph_, node_map_.get());
+              *added_node->add_input() = ctrl_dep;
+              node_map_->AddOutput(NodeName(ctrl_dep), added_node->name());
             }
+            *output->mutable_input(k) = const_name;
+            node_map_->AddOutput(const_name, output->name());
+          }
+          if (node_name == shape_n_node->name() && port != port_idx) {
+            direct_edges_exist = true;
           }
         }
+        if (!direct_edges_exist) {
+          node_map_->RemoveOutput(node->name(), output->name());
+        }
       }
     }
   }
+
   return Status::OK();
 }
 
@@ -679,7 +727,7 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
       nodes_whitelist_.find(node.name()) == nodes_whitelist_.end()) {
     return false;
   }
-  // Skip control flow nodes, they can't be folded
+  // Skip control flow nodes, they can't be folded.
   if (ModifiesFrameInfo(node)) {
     return false;
   }
@@ -688,12 +736,16 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
     return false;
   }
 
-  // Skips ops that don't benefit from folding.
-  const string& op = node.op();
+  // Don't fold stateful ops such as TruncatedNormal.
+  if (!IsFreeOfSideEffect(node)) {
+    return false;
+  }
 
-  if (op.find("Placeholder") == 0) {
+  // Skips ops that don't benefit from folding.
+  if (IsPlaceholder(node)) {
     return false;
   }
+  const string& op = node.op();
   if (op.find("Save") != string::npos || op.find("Restore") != string::npos ||
       op.find("Reader") != string::npos) {
     return false;
@@ -701,20 +753,17 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
   if (op.find("Quantized") != string::npos || op.find("Sparse") == 0) {
     return false;
   }
-  if (node.attr().count("_XlaCompile") > 0) {
+  if (node.attr().count("_XlaCompile") > 0 &&
+      node.attr().at("_XlaCompile").b()) {
     return false;
   }
 
-  // Don't fold stateful ops such as TruncatedNormal.
   const OpDef* op_def = nullptr;
   Status status = OpRegistry::Global()->LookUpOpDef(node.op(), &op_def);
   if (!status.ok()) {
     return false;
   }
-  if (op_def->is_stateful()) {
-    return false;
-  }
-
+  // Don't fold ops without outputs.
   if (op_def->output_arg_size() == 0) {
     return false;
   }
@@ -779,8 +828,11 @@ Status CreateConstantTensorAttrValue(DataType type, double value,
     SET_TENSOR_VAL_CASE(DT_FLOAT, float, float);
     SET_TENSOR_VAL_CASE(DT_DOUBLE, double, double);
     SET_TENSOR_VAL_CASE(DT_INT64, int64, int64);
+    SET_TENSOR_VAL_CASE(DT_UINT64, int64, int64);
     SET_TENSOR_VAL_CASE(DT_INT32, int32, int);
+    SET_TENSOR_VAL_CASE(DT_UINT32, int32, int);
     SET_TENSOR_VAL_CASE(DT_INT16, int32, int);
+    SET_TENSOR_VAL_CASE(DT_UINT16, int32, int);
     SET_TENSOR_VAL_CASE(DT_INT8, int32, int);
     SET_TENSOR_VAL_CASE(DT_UINT8, int32, int);
     SET_TENSOR_VAL_CASE(DT_BOOL, bool, bool);
@@ -843,10 +895,16 @@ Status ConstantFolding::CreateNodeDef(const string& name,
         POPULATE_TENSOR_PROTO(tensor, t, double, double);
       case DT_INT64:
         POPULATE_TENSOR_PROTO(tensor, t, int64, int64);
+      case DT_UINT64:
+        POPULATE_TENSOR_PROTO(tensor, t, uint64, int64);
       case DT_INT32:
         POPULATE_TENSOR_PROTO(tensor, t, int32, int);
+      case DT_UINT32:
+        POPULATE_TENSOR_PROTO(tensor, t, uint32, int);
       case DT_INT16:
         POPULATE_TENSOR_PROTO(tensor, t, int16, int);
+      case DT_UINT16:
+        POPULATE_TENSOR_PROTO(tensor, t, uint16, int);
       case DT_INT8:
         POPULATE_TENSOR_PROTO(tensor, t, int8, int);
       case DT_UINT8:
@@ -1033,7 +1091,7 @@ Status ConstantFolding::FoldNode(NodeDef* node, GraphDef* output_graph) {
       node_map_->AddOutput(node->name(), const_index->name());
 
       auto outputs = node_map_->GetOutputs(node->name());
-      for (auto& output : outputs) {
+      for (NodeDef* output : outputs) {
         for (int i = 0; i < output->input_size(); i++) {
           int port;
           string node_name = ParseNodeName(output->input(i), &port);
@@ -1124,7 +1182,7 @@ Status ConstantFolding::FoldNode(NodeDef* node, GraphDef* output_graph) {
 
   if (const_nodes.size() > 1) {
     auto outputs = node_map_->GetOutputs(node->name());
-    for (const auto& output : outputs) {
+    for (NodeDef* output : outputs) {
       for (int i = 0; i < output->input_size(); i++) {
         int port;
         string node_name = ParseNodeName(output->input(i), &port);
@@ -1166,9 +1224,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
   std::unordered_set<string> processed_nodes;
   std::deque<NodeDef*> queue;
   for (int i = 0; i < graph_->node_size(); i++) {
-    auto node = graph_->mutable_node(i);
-    if (IsFoldable(*node)) {
-      queue.push_back(node);
+    if (IsFoldable(graph_->node(i))) {
+      queue.push_back(graph_->mutable_node(i));
     }
   }
   while (!queue.empty()) {
@@ -1203,8 +1260,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
   int last = output->node_size() - 1;
   for (int i = output->node_size() - 1; i >= 0; --i) {
     const NodeDef& node = output->node(i);
-    auto outputs = node_map_->GetOutputs(node.name());
-    if (outputs.empty()) {
+    auto fanout = node_map_->GetOutputs(node.name());
+    if (fanout.empty()) {
       output->mutable_node()->SwapElements(i, last);
       last--;
     }
@@ -1216,8 +1273,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
     // If no fetch nodes is provided, we conservatively
     // keep all nodes in the original graph in case users need to fetch
     // their values.
-    auto outputs = node_map_->GetOutputs(node.name());
-    if (!outputs.empty() || !has_fetch_ ||
+    auto fanout = node_map_->GetOutputs(node.name());
+    if (!fanout.empty() || !has_fetch_ ||
         nodes_to_preserve_.find(node.name()) != nodes_to_preserve_.end()) {
       auto added_node = output->add_node();
       *added_node = node;
@@ -1322,6 +1379,10 @@ bool ConstantFolding::IsOnes(const NodeDef& node) const {
   if (node.op() == "OnesLike") {
     return true;
   }
+  if (node.op() == "Fill") {
+    NodeDef* values = node_map_->GetNode(NodeName(node.input(1)));
+    return values != nullptr && IsOnes(*values);
+  }
   if (node.op() != "Const") {
     return false;
   }
@@ -1331,14 +1392,14 @@ bool ConstantFolding::IsOnes(const NodeDef& node) const {
     //    IS_ONES_CASE(DT_HALF);
     IS_ONES_CASE(DT_FLOAT);
     IS_ONES_CASE(DT_DOUBLE);
+    IS_ONES_CASE(DT_COMPLEX64);
+    IS_ONES_CASE(DT_COMPLEX128);
     IS_ONES_CASE(DT_UINT8);
     IS_ONES_CASE(DT_INT8);
     IS_ONES_CASE(DT_UINT16);
     IS_ONES_CASE(DT_INT16);
     IS_ONES_CASE(DT_INT32);
     IS_ONES_CASE(DT_INT64);
-    IS_ONES_CASE(DT_COMPLEX64);
-    IS_ONES_CASE(DT_COMPLEX128);
     default:
       VLOG(1) << "Unsupported type " << DataTypeString(dtype);
       return false;
@@ -1353,6 +1414,10 @@ bool ConstantFolding::IsZeros(const NodeDef& node) const {
   if (node.op() == "ZerosLike") {
     return true;
   }
+  if (node.op() == "Fill") {
+    NodeDef* values = node_map_->GetNode(NodeName(node.input(1)));
+    return values != nullptr && IsZeros(*values);
+  }
   if (!IsConstant(node)) {
     return false;
   }
@@ -1362,14 +1427,14 @@ bool ConstantFolding::IsZeros(const NodeDef& node) const {
     //    IS_ZEROS_CASE(DT_HALF);
     IS_ZEROS_CASE(DT_FLOAT);
     IS_ZEROS_CASE(DT_DOUBLE);
+    IS_ZEROS_CASE(DT_COMPLEX64);
+    IS_ZEROS_CASE(DT_COMPLEX128);
     IS_ZEROS_CASE(DT_UINT8);
     IS_ZEROS_CASE(DT_INT8);
     IS_ZEROS_CASE(DT_UINT16);
     IS_ZEROS_CASE(DT_INT16);
     IS_ZEROS_CASE(DT_INT32);
     IS_ZEROS_CASE(DT_INT64);
-    IS_ZEROS_CASE(DT_COMPLEX64);
-    IS_ZEROS_CASE(DT_COMPLEX128);
     default:
       VLOG(1) << "Unsupported type " << DataTypeString(dtype);
       return false;
@@ -1434,6 +1499,17 @@ void ConstantFolding::ReplaceDivisionOfOnesByReciprocal(NodeDef* node,
   graph_modified_ = true;
 }
 
+void ConstantFolding::ReplaceSubtractionFromZeroByNegation(NodeDef* node,
+                                                           GraphDef* graph) {
+  node->set_op("Neg");
+  node->mutable_input()->SwapElements(0, 1);
+  const string ctrl_dep =
+      AddControlDependency(node->input(1), graph, node_map_.get());
+  node_map_->UpdateInput(node->name(), node->input(1), ctrl_dep);
+  node->set_input(1, ctrl_dep);
+  graph_modified_ = true;
+}
+
 Status ConstantFolding::ReplaceOperationWithConstant(
     double value, const TensorShapeProto& shape, NodeDef* node,
     GraphDef* graph) {
@@ -1459,25 +1535,219 @@ Status ConstantFolding::ReplaceOperationWithConstant(
   return Status::OK();
 }
 
-Status ConstantFolding::SimplifyGraph(GraphDef* output,
-                                      const GraphProperties& properties,
+Status ConstantFolding::SimplifyGraph(GraphDef* optimized_graph,
+                                      GraphProperties* properties,
                                       bool use_shape_info) {
   const bool is_aggressive = opt_level_ == RewriterConfig::AGGRESSIVE;
-  for (int i = 0; i < output->node_size(); ++i) {
-    NodeDef* node = output->mutable_node(i);
+  for (int i = 0; i < optimized_graph->node_size(); ++i) {
+    NodeDef* node = optimized_graph->mutable_node(i);
+
     // Remove Shuffle or Reverse op over scalar values.
     if (use_shape_info &&
         (IsShuffle(*node) || IsReverse(*node) || IsTranspose(*node))) {
       const auto& shape =
-          properties.GetInputProperties(node->name())[0].shape();
+          properties->GetInputProperties(node->name())[0].shape();
       // The node is replaceable iff
       // unknown_rank == false && (dim_size == 0 || all dims have size 1)
       bool replaceable = !shape.unknown_rank();
-      for (int j = 0; j < shape.dim_size(); ++j) {
+      for (int j = 0; replaceable && j < shape.dim_size(); ++j) {
         replaceable &= shape.dim(j).size() == 1;
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, node, output);
+        ReplaceOperationWithIdentity(0, node, optimized_graph);
+        continue;
+      }
+    }
+
+    if (use_shape_info && IsSlice(*node) &&
+        properties->GetInputProperties(node->name()).size() == 3) {
+      const auto& input = properties->GetInputProperties(node->name())[0];
+      const auto& b = properties->GetInputProperties(node->name())[1];
+      const auto& s = properties->GetInputProperties(node->name())[2];
+      if (TensorShape::IsValid(b.shape()) && b.has_value() &&
+          TensorShape::IsValid(s.shape()) && s.has_value()) {
+        Tensor begin(b.dtype(), b.shape());
+        if (!begin.FromProto(b.value())) {
+          return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                         b.value().DebugString());
+        }
+        Tensor size(s.dtype(), s.shape());
+        if (!size.FromProto(s.value())) {
+          return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                         s.value().DebugString());
+        }
+        // The node is replaceable iff unknown_rank == false &&
+        // begin == 0 && (size == -1 || size == input_shape) for all dimensions
+        bool replaceable = !input.shape().unknown_rank();
+        for (int j = 0; replaceable && j < input.shape().dim_size(); ++j) {
+          if (begin.dtype() == DT_INT32) {
+            replaceable &= begin.vec<int>()(j) == 0;
+          } else {
+            replaceable &= begin.vec<int64>()(j) == 0;
+          }
+          if (size.dtype() == DT_INT32) {
+            replaceable &= (size.vec<int>()(j) == -1 ||
+                            size.vec<int>()(j) == input.shape().dim(j).size());
+          } else {
+            replaceable &=
+                (size.vec<int64>()(j) == -1 ||
+                 size.vec<int64>()(j) == input.shape().dim(j).size());
+          }
+        }
+        if (replaceable) {
+          ReplaceOperationWithIdentity(0, node, optimized_graph);
+          continue;
+        }
+      }
+    }
+
+    if (use_shape_info && IsTile(*node) &&
+        properties->GetInputProperties(node->name()).size() == 2) {
+      const auto& m = properties->GetInputProperties(node->name())[1];
+      if (TensorShape::IsValid(m.shape()) && m.has_value()) {
+        Tensor multiplies(m.dtype(), m.shape());
+        if (!multiplies.FromProto(m.value())) {
+          return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                         m.value().DebugString());
+        }
+        // The node is replaceable iff all values in multiplies are 1.
+        bool replaceable = true;
+        if (multiplies.dtype() == DT_INT32) {
+          for (int j = 0; replaceable && j < multiplies.vec<int>().size();
+               ++j) {
+            replaceable &= multiplies.vec<int>()(j) == 1;
+          }
+        } else {
+          for (int j = 0; replaceable && j < multiplies.vec<int64>().size();
+               ++j) {
+            replaceable &= multiplies.vec<int64>()(j) == 1;
+          }
+        }
+        if (replaceable) {
+          ReplaceOperationWithIdentity(0, node, optimized_graph);
+          continue;
+        }
+      }
+    }
+
+    if (use_shape_info && IsPad(*node) &&
+        properties->GetInputProperties(node->name()).size() >= 2) {
+      const auto& p = properties->GetInputProperties(node->name())[1];
+      if (TensorShape::IsValid(p.shape()) && p.has_value()) {
+        Tensor paddings(p.dtype(), p.shape());
+        if (!paddings.FromProto(p.value())) {
+          return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                         p.value().DebugString());
+        }
+        // The node is replaceable iff all values in paddings are 0.
+        bool replaceable = true;
+        // The operation requires it to be int32 value so we don't check for
+        // 1nt64.
+        const auto flatten = paddings.flat<int32>();
+        for (int j = 0; replaceable && j < flatten.size(); ++j) {
+          replaceable &= flatten(j) == 0;
+        }
+        if (replaceable) {
+          ReplaceOperationWithIdentity(0, node, optimized_graph);
+          continue;
+        }
+      }
+    }
+
+    if (use_shape_info && IsSqueeze(*node) &&
+        !properties->GetInputProperties(node->name()).empty()) {
+      // https://www.tensorflow.org/api_docs/python/tf/squeeze mentions it's
+      // error to squeeze a dimension that is not 1, so we only need to check
+      // whether the input has > 1 size for each dimension.
+      const auto& shape =
+          properties->GetInputProperties(node->name())[0].shape();
+      // The node is replaceable iff
+      // unknown_rank == false && (dim_size == 0 || all dims have size > 1)
+      bool replaceable = !shape.unknown_rank();
+      for (int j = 0; replaceable && j < shape.dim_size(); ++j) {
+        replaceable &= shape.dim(j).size() > 1;
+      }
+      if (replaceable) {
+        ReplaceOperationWithIdentity(0, node, optimized_graph);
+        continue;
+      }
+    }
+
+    if (IsPack(*node) && NumNonControlInputs(*node) == 1 &&
+        !OptimizedNodeExists(*node, "_const_axis")) {
+      // Create constant axis node.
+      Tensor axis_t(DT_INT32, TensorShape({}));
+      NodeDef* axis_node = optimized_graph->add_node();
+      axis_node->set_name(OptimizedNodeName(*node, "_const_axis"));
+      const int axis = node->attr().at("axis").i();
+      if (!SetTensorValue(DT_INT32, axis, &axis_t).ok() ||
+          !CreateNodeDef(axis_node->name(), TensorValue(&axis_t), axis_node)
+               .ok()) {
+        continue;
+      }
+      // Add a control dependency to make sure axis_node is in the right frame.
+      const string ctrl_dep = ConstantFolding::AddControlDependency(
+          node->input(0), graph_, node_map_.get());
+      axis_node->add_input(ctrl_dep);
+      axis_node->set_device(node->device());
+      node->set_op("ExpandDims");
+      if (node->attr().count("axis") != 0) {
+        node->mutable_attr()->erase("axis");
+      }
+      if (node->attr().count("N") != 0) {
+        node->mutable_attr()->erase("N");
+      }
+      (*node->mutable_attr())["Tdim"].set_type(DT_INT32);
+      node->add_input(axis_node->name());
+      if (node->input_size() > 2) {
+        node->mutable_input()->SwapElements(1, node->input_size() - 1);
+      }
+      graph_modified_ = true;
+      continue;
+    }
+
+    // Move constants past Enter.
+    // TODO(rmlarsen): Reenable when we fix the root cause of b/76008022
+    if (opt_level_ == RewriterConfig::AGGRESSIVE && IsEnter(*node) &&
+        node->input_size() > 0) {
+      const string& node_name = node->name();
+      const NodeDef* input = node_map_->GetNode(node->input(0));
+      if (input != nullptr && IsReallyConstant(*input) &&
+          !OptimizedNodeExists(*input, "_enter")) {
+        auto fanouts = node_map_->GetOutputs(node_name);
+        // Find non-constant nodes that consume the output of *node.
+        std::vector<NodeDef*> consumers;
+        for (NodeDef* fanout : fanouts) {
+          if (!IsConstant(*fanout)) {
+            for (int i = 0; i < fanout->input_size(); ++i) {
+              if (fanout->input(i) == node_name) {
+                consumers.push_back(fanout);
+                break;
+              }
+            }
+          }
+        }
+        if (!consumers.empty()) {
+          NodeDef* new_node = optimized_graph->add_node();
+          *new_node = *input;
+          new_node->set_name(OptimizedNodeName(*input, "_enter"));
+          new_node->set_device(node->device());
+          new_node->clear_input();
+          new_node->add_input(AsControlDependency(node_name));
+          node_map_->AddNode(new_node->name(), new_node);
+          node_map_->AddOutput(node_name, new_node->name());
+          for (NodeDef* consumer : consumers) {
+            for (int i = 0; i < consumer->input_size(); ++i) {
+              if (consumer->input(i) == node_name) {
+                node_map_->UpdateInput(consumer->name(), node_name,
+                                       new_node->name());
+                consumer->set_input(i, new_node->name());
+              }
+            }
+          }
+          graph_modified_ = true;
+          continue;
+        }
       }
     }
 
@@ -1530,7 +1800,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
                     return n1->name() < n2->name();
                   });
         // Create constant false & true nodes.
-        NodeDef* false_node = output->add_node();
+        NodeDef* false_node = optimized_graph->add_node();
         false_node->set_name(OptimizedNodeName(*node, "_const_false"));
         if (!CreateNodeDef(false_node->name(), TensorValue(&false_t),
                            false_node)
@@ -1539,7 +1809,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
         }
         false_node->set_device(node->device());
 
-        NodeDef* true_node = output->add_node();
+        NodeDef* true_node = optimized_graph->add_node();
         true_node->set_name(OptimizedNodeName(*node, "_const_true"));
         if (!CreateNodeDef(true_node->name(), TensorValue(&true_t), true_node)
                  .ok()) {
@@ -1552,10 +1822,10 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
         const string false_port = node->name();
         const string true_port = strings::StrCat(node->name(), ":1");
         const string false_ctrl_dep =
-            AddControlDependency(false_port, output, node_map_.get());
+            AddControlDependency(false_port, optimized_graph, node_map_.get());
         false_node->add_input(false_ctrl_dep);
         const string true_ctrl_dep =
-            AddControlDependency(true_port, output, node_map_.get());
+            AddControlDependency(true_port, optimized_graph, node_map_.get());
         true_node->add_input(true_ctrl_dep);
 
         node_map_->AddNode(false_node->name(), false_node);
@@ -1598,7 +1868,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       graph_modified_ = true;
       continue;
     }
-    if (use_shape_info && IsSimplifiableReshape(*node, properties)) {
+    if (use_shape_info && IsSimplifiableReshape(*node, *properties)) {
       DataType output_type = node->attr().at("T").type();
       node->set_op("Identity");
       node->clear_attr();
@@ -1616,8 +1886,8 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
     // Simplify arithmetic operations with ones or zeros.
     if (use_shape_info &&
         (is_mul || is_matmul || is_add || is_sub || is_any_div) &&
-        properties.HasInputProperties(node->name()) &&
-        properties.HasOutputProperties(node->name())) {
+        properties->HasInputProperties(node->name()) &&
+        properties->HasOutputProperties(node->name())) {
       const NodeDef* x = node_map_->GetNode(node->input(0));
       const NodeDef* y = node_map_->GetNode(node->input(1));
       if (x == nullptr || y == nullptr) {
@@ -1625,20 +1895,25 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
                                        node->DebugString());
       }
       const TensorShapeProto& output_shape =
-          properties.GetOutputProperties(node->name())[0].shape();
+          properties->GetOutputProperties(node->name())[0].shape();
 
       // Simplify element-wise multiplication by ones or addition/subtraction
       // of zeros.
       const TensorShapeProto& y_shape =
-          properties.GetInputProperties(node->name())[1].shape();
+          properties->GetInputProperties(node->name())[1].shape();
       const bool x_is_zero = IsZeros(*x);
-      const bool x_is_one = IsOnes(*x);
+      const bool x_is_one = x_is_zero ? false : IsOnes(*x);
       const bool y_matches_output_shape = ShapesEqual(output_shape, y_shape);
       if (y_matches_output_shape &&
           ((is_mul && x_is_one) || (is_add && x_is_zero))) {
-        // TODO(rmlarsen): Handle subtraction 0 - y.
         // 1 * y = y or 0 + y = y.
-        ReplaceOperationWithSnapshot(1, node, output);
+        ReplaceOperationWithSnapshot(1, node, optimized_graph);
+        continue;
+      }
+
+      if (y_matches_output_shape && (is_sub && x_is_zero)) {
+        // Replace 0 - y with Neg(y).
+        ReplaceSubtractionFromZeroByNegation(node, optimized_graph);
         continue;
       }
 
@@ -1646,20 +1921,20 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       if (y_matches_output_shape && is_any_div && x_is_one) {
         DataType type = node->attr().at("T").type();
         if (DataTypeIsFloating(type) || DataTypeIsComplex(type)) {
-          ReplaceDivisionOfOnesByReciprocal(node, output);
+          ReplaceDivisionOfOnesByReciprocal(node, optimized_graph);
           continue;
         }
       }
 
       const TensorShapeProto& x_shape =
-          properties.GetInputProperties(node->name())[0].shape();
+          properties->GetInputProperties(node->name())[0].shape();
       const bool y_is_zero = IsZeros(*y);
-      const bool y_is_one = IsOnes(*y);
+      const bool y_is_one = y_is_zero ? false : IsOnes(*y);
       const bool x_matches_output_shape = ShapesEqual(output_shape, x_shape);
       if (x_matches_output_shape && (((is_mul || is_any_div) && y_is_one) ||
                                      ((is_add || is_sub) && y_is_zero))) {
         // x * 1 = x or x / 1 = x or x +/- 0 = x
-        ReplaceOperationWithSnapshot(0, node, output);
+        ReplaceOperationWithSnapshot(0, node, optimized_graph);
         continue;
       }
 
@@ -1672,18 +1947,18 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
           (is_mul || is_matmul || optimize_zeros_divided_by_y)) {
         const PartialTensorShape shp(output_shape);
         if (shp.IsFullyDefined()) {
-          TF_RETURN_IF_ERROR(
-              ReplaceOperationWithConstant(0, output_shape, node, output));
+          TF_RETURN_IF_ERROR(ReplaceOperationWithConstant(0, output_shape, node,
+                                                          optimized_graph));
           continue;
         }
         // Even if an input shape is only partially known, we may known that it
         // matches the output shape and thus forward the corresponding zero
         // input.
         if ((is_mul || is_any_div) && x_is_zero && x_matches_output_shape) {
-          ReplaceOperationWithIdentity(0, node, output);
+          ReplaceOperationWithIdentity(0, node, optimized_graph);
           continue;
         } else if (is_mul && y_is_zero && y_matches_output_shape) {
-          ReplaceOperationWithIdentity(1, node, output);
+          ReplaceOperationWithIdentity(1, node, optimized_graph);
           continue;
         }
       }
@@ -1708,7 +1983,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
         continue;
       }
       // Insert new reciprocal op and change node from Div to Mul.
-      NodeDef* reciprocal_node = output->add_node();
+      NodeDef* reciprocal_node = optimized_graph->add_node();
       reciprocal_node->set_name(OptimizedNodeName(*node, "_recip"));
       reciprocal_node->set_op("Reciprocal");
       reciprocal_node->set_device(node->device());
@@ -1721,6 +1996,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       node_map_->UpdateOutput(node->name(), const_input,
                               reciprocal_node->name());
       graph_modified_ = true;
+      continue;
     }
 
     // Consider the transformation
@@ -1806,6 +2082,248 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       std::swap(*node->mutable_input(parent_const_input),
                 *op_child_node->mutable_input(non_const_leaf_input));
       graph_modified_ = true;
+      continue;
+    }
+
+    // Partial constant propagation through IdentityN.
+    if (IsIdentityN(*node) && NumNonControlInputs(*node) > 0) {
+      const std::set<NodeDef*>& tmp = node_map_->GetOutputs(node->name());
+      const std::vector<NodeDef*> consumers(tmp.begin(), tmp.end());
+      bool updated_graph = false;
+      for (int input_idx = 0; input_idx < node->input_size(); ++input_idx) {
+        const string& input = node->input(input_idx);
+        if (IsControlInput(input)) {
+          break;
+        }
+        const NodeDef* input_node = node_map_->GetNode(NodeName(input));
+        if (input_node == nullptr) {
+          LOG(ERROR) << "Bad input: " << input;
+          break;
+        }
+        // Forward constant inputs to outputs and add a control dependency on
+        // the IdentityN node.
+        if (IsReallyConstant(*input_node)) {
+          // Update each consumer.
+          for (NodeDef* consumer : consumers) {
+            bool add_dep = false;
+            for (int consumer_input_idx = 0;
+                 consumer_input_idx < consumer->input_size();
+                 ++consumer_input_idx) {
+              const string& consumer_input =
+                  consumer->input(consumer_input_idx);
+              if (IsControlInput(consumer_input)) {
+                break;
+              }
+              int output_idx;
+              const string input_node_name =
+                  ParseNodeName(consumer_input, &output_idx);
+              if (input_node_name == node->name() && output_idx == input_idx) {
+                consumer->set_input(consumer_input_idx, input);
+                // We will keep the input from IdentityN through a control
+                // dependency, so we only need to add the consumer as an output
+                // for the constant input node.
+                node_map_->AddOutput(NodeName(input), consumer->name());
+                add_dep = true;
+              }
+            }
+            if (add_dep) {
+              consumer->add_input(AsControlDependency(node->name()));
+              updated_graph = true;
+            }
+          }
+        }
+      }
+
+      if (updated_graph) {
+        for (NodeDef* consumer : consumers) {
+          DedupControlInputs(consumer);
+        }
+        graph_modified_ = true;
+        continue;
+      }
+    }
+
+    // Partial constant folding for associative operators:
+    // Split AddN/AccumulateNV2 to enable partial
+    // folding of ops when more than one but not all inputs are constant.
+    // For AddN and AccumulateNV2, we may furthermore reorder inputs, since
+    // addition is commutative.
+    const int num_non_control_inputs = NumNonControlInputs(*node);
+    if (IsAggregate(*node) && IsCommutative(*node) &&
+        num_non_control_inputs > 2) {
+      const int num_control_inputs =
+          node->input_size() - num_non_control_inputs;
+      std::vector<int> const_inputs;
+      std::vector<int> nonconst_inputs;
+      for (int i = 0; i < node->input_size(); ++i) {
+        const string& input = node->input(i);
+        const NodeDef* input_node = node_map_->GetNode(NodeName(input));
+        CHECK(input_node != nullptr) << input;
+        if (!IsControlInput(input) && IsReallyConstant(*input_node)) {
+          const_inputs.push_back(i);
+        } else {
+          // Non-const and control inputs.
+          nonconst_inputs.push_back(i);
+        }
+      }
+      // Promote AccumulateNV2 with all constant inputs to AddN, since it is
+      // a fake node that cannot be constant folded by itself.
+      if (const_inputs.size() == num_non_control_inputs &&
+          node->op() == "AccumulateNV2") {
+        node->set_op("AddN");
+        node->mutable_attr()->erase("shape");
+        graph_modified_ = true;
+        continue;
+      }
+      const string new_node_name = OptimizedNodeName(
+          *node, strings::StrCat("_partial_split_", const_inputs.size()));
+      if (1 < const_inputs.size() &&
+          const_inputs.size() < num_non_control_inputs &&
+          !node_map_->NodeExists(new_node_name)) {
+        NodeDef* added_node = optimized_graph->add_node();
+        *added_node = *node;
+        // Always use AddN for the constant node, since AccumulateNV2 is a fake
+        // node that cannot be constant folded, since it does not have a kernel.
+        added_node->set_op("AddN");
+        added_node->mutable_attr()->erase("shape");
+        added_node->set_name(new_node_name);
+        node_map_->AddNode(added_node->name(), added_node);
+        added_node->clear_input();
+        for (int i : const_inputs) {
+          added_node->add_input(node->input(i));
+          node_map_->UpdateOutput(NodeName(node->input(i)), node->name(),
+                                  added_node->name());
+        }
+
+        // Overwrite the first const input with the added node.
+        node->set_input(const_inputs[0], added_node->name());
+        node_map_->AddOutput(added_node->name(), node->name());
+        nonconst_inputs.push_back(const_inputs[0]);
+        // Compact the remaining inputs to the original node.
+        std::sort(nonconst_inputs.begin(), nonconst_inputs.end());
+        int idx = 0;
+        for (int i : nonconst_inputs) {
+          if (idx != i) {
+            node->set_input(idx, node->input(i));
+          }
+          ++idx;
+        }
+        node->mutable_input()->DeleteSubrange(nonconst_inputs.size(),
+                                              const_inputs.size() - 1);
+        (*node->mutable_attr())["N"].set_i(node->input_size() -
+                                           num_control_inputs);
+        properties->ClearInputProperties(node->name());
+        (*added_node->mutable_attr())["N"].set_i(const_inputs.size());
+        graph_modified_ = true;
+        continue;
+      }
+    }
+
+    // Partial constant folding for Concat which is not commutative, so
+    // we have to preserve order and can only push consecutive runs of constant
+    // inputs into sub-nodes.
+    if (IsConcat(*node) && num_non_control_inputs > 3 &&
+        node->name().rfind("_partial_split_") == string::npos) {
+      int axis_arg = -1;
+      int begin = 0;
+      int end = num_non_control_inputs;
+      if (node->op() == "Concat") {
+        begin = 1;
+        axis_arg = 0;
+      } else if (node->op() == "ConcatV2") {
+        end = num_non_control_inputs - 1;
+        axis_arg = num_non_control_inputs - 1;
+      } else {
+        continue;
+      }
+
+      const NodeDef* axis_arg_node =
+          node_map_->GetNode(NodeName(node->input(axis_arg)));
+      if (axis_arg_node == nullptr || !IsReallyConstant(*axis_arg_node)) {
+        // We cannot constant fold Concat unless we the axis argument is
+        // constant. Skip node.
+        continue;
+      }
+
+      // We search for consecutive runs of constant inputs in the range
+      // [begin:end[ and push then down into child nodes.
+      std::vector<std::pair<int, int>> constant_input_runs;
+      int first = begin;
+      int last = begin;
+      while (last < end) {
+        while (first < end && !IsReallyConstant(*node_map_->GetNode(
+                                  NodeName(node->input(first))))) {
+          ++first;
+        }
+        // Invariant: node[first] is constant || first >= end.
+        last = first + 1;
+        while (last < end && IsReallyConstant(*node_map_->GetNode(
+                                 NodeName(node->input(last))))) {
+          ++last;
+        }
+        // Invariant: node[last] is not constant || last >= end
+        // Discard intervals shorter than 2 elements.
+        if (first < end && (last - first) > 1) {
+          constant_input_runs.emplace_back(first, last);
+        }
+        first = last;
+      }
+
+      // Skip if all inputs are constant, and let constant folding take over.
+      if (constant_input_runs.size() == 1 &&
+          constant_input_runs[0].first == begin &&
+          constant_input_runs[0].second == end) {
+        continue;
+      }
+      std::set<int> inputs_to_delete;
+      for (auto interval : constant_input_runs) {
+        // Push the constant inputs in the interval to a child node than can be
+        // constant folded.
+        const string new_node_name = OptimizedNodeName(
+            *node, strings::StrCat("_partial_split_", interval.first));
+        if (node_map_->NodeExists(new_node_name)) {
+          break;
+        }
+        NodeDef* added_node = optimized_graph->add_node();
+        *added_node = *node;
+        added_node->set_name(new_node_name);
+        node_map_->AddNode(added_node->name(), added_node);
+        added_node->clear_input();
+        for (int i = interval.first; i < interval.second; ++i) {
+          added_node->add_input(node->input(i));
+          node_map_->UpdateOutput(NodeName(node->input(i)), node->name(),
+                                  added_node->name());
+          if (i != interval.first) {
+            inputs_to_delete.insert(i);
+          }
+        }
+        added_node->add_input(node->input(axis_arg));
+        (*added_node->mutable_attr())["N"].set_i(interval.second -
+                                                 interval.first);
+        node_map_->AddOutput(NodeName(node->input(axis_arg)),
+                             added_node->name());
+
+        // Overwrite the first constant input with the result of the added
+        // child node.
+        node->set_input(interval.first, added_node->name());
+        node_map_->AddOutput(added_node->name(), node->name());
+      }
+      if (!constant_input_runs.empty()) {
+        graph_modified_ = true;
+        if (!inputs_to_delete.empty()) {
+          // Fix up the inputs to the original node.
+          std::vector<string> tmp(node->input().begin(), node->input().end());
+          node->clear_input();
+          for (int i = 0; i < tmp.size(); ++i) {
+            if (inputs_to_delete.find(i) == inputs_to_delete.end()) {
+              node->add_input(tmp[i]);
+            }
+          }
+          (*node->mutable_attr())["N"].set_i(node->input_size() - 1);
+          properties->ClearInputProperties(node->name());
+        }
+        continue;
+      }
     }
   }
 
@@ -1814,7 +2332,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
 
 Status ConstantFolding::RunOptimizationPass(Cluster* cluster,
                                             const GrapplerItem& item,
-                                            GraphDef* output) {
+                                            GraphDef* optimized_graph) {
   node_map_.reset(new NodeMap(graph_));
   nodes_whitelist_.clear();
   // Fold fetch nodes iff it has a single fanout. Note that if a fetch node
@@ -1843,16 +2361,20 @@ Status ConstantFolding::RunOptimizationPass(Cluster* cluster,
     TF_RETURN_IF_ERROR(MaterializeShapes(properties));
     TF_RETURN_IF_ERROR(MaterializeConstants(properties));
   }
-
-  TF_RETURN_IF_ERROR(FoldGraph(output));
-  node_map_.reset(new NodeMap(output));
-  TF_RETURN_IF_ERROR(SimplifyGraph(output, properties, can_use_shape_info));
+  TF_RETURN_IF_ERROR(FoldGraph(optimized_graph));
+  node_map_.reset(new NodeMap(optimized_graph));
+  TF_RETURN_IF_ERROR(
+      SimplifyGraph(optimized_graph, &properties, can_use_shape_info));
 
   return Status::OK();
 }
 
 Status ConstantFolding::Optimize(Cluster* cluster, const GrapplerItem& item,
-                                 GraphDef* output) {
+                                 GraphDef* optimized_graph) {
+  // TensorFlow flushes denormals to zero and rounds to nearest, so we do
+  // the same here.
+  port::ScopedFlushDenormal flush;
+  port::ScopedSetRound round(FE_TONEAREST);
   nodes_to_preserve_ = item.NodesToPreserve();
   for (const auto& feed : item.feed) {
     feed_nodes_.insert(NodeName(feed.first));
@@ -1864,20 +2386,20 @@ Status ConstantFolding::Optimize(Cluster* cluster, const GrapplerItem& item,
   }
 
   has_fetch_ = !item.fetch.empty();
-
   GrapplerItem item_to_optimize = item;
-  *output = item.graph;
+  *optimized_graph = item.graph;
   int64 node_count;
   do {
     graph_modified_ = false;
-    item_to_optimize.graph.Swap(output);
+    item_to_optimize.graph.Swap(optimized_graph);
     graph_ = &item_to_optimize.graph;
-    *output = GraphDef();
+    *optimized_graph = GraphDef();
     node_count = graph_->node_size();
-    TF_RETURN_IF_ERROR(RunOptimizationPass(cluster, item_to_optimize, output));
-  } while (graph_modified_ || output->node_size() != node_count);
-  *output->mutable_library() = item.graph.library();
-  *output->mutable_versions() = item.graph.versions();
+    TF_RETURN_IF_ERROR(
+        RunOptimizationPass(cluster, item_to_optimize, optimized_graph));
+  } while (graph_modified_ || optimized_graph->node_size() != node_count);
+  *optimized_graph->mutable_library() = item.graph.library();
+  *optimized_graph->mutable_versions() = item.graph.versions();
 
   return Status::OK();
 }
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.h b/tensorflow/core/grappler/optimizers/constant_folding.h
index 232b2f9fa05d86877e681c8eff3606725cf6fdb9..b6645d335e3074f188e5ab67a27fb11975fa14ed 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.h
+++ b/tensorflow/core/grappler/optimizers/constant_folding.h
@@ -38,7 +38,7 @@ class ConstantFolding : public GraphOptimizer {
   static string AddControlDependency(const string& input_name, GraphDef* graph,
                                      NodeMap* node_map);
 
-  ConstantFolding(DeviceBase* cpu_device);
+  explicit ConstantFolding(DeviceBase* cpu_device);
   ConstantFolding(RewriterConfig::Toggle opt_level, DeviceBase* cpu_device);
 
   ~ConstantFolding() override {}
@@ -82,6 +82,7 @@ class ConstantFolding : public GraphOptimizer {
                                     GraphDef* graph);
   void ReplaceOperationWithSnapshot(int input_to_forward, NodeDef* node,
                                     GraphDef* graph);
+  void ReplaceSubtractionFromZeroByNegation(NodeDef* node, GraphDef* graph);
   Status ReplaceOperationWithConstant(double value,
                                       const TensorShapeProto& shape,
                                       NodeDef* node, GraphDef* graph);
@@ -91,7 +92,7 @@ class ConstantFolding : public GraphOptimizer {
   bool IsSimplifiableReduction(const NodeDef& node) const;
   bool IsSimplifiableReshape(const NodeDef& node,
                              const GraphProperties& properties) const;
-  Status SimplifyGraph(GraphDef* output, const GraphProperties& properties,
+  Status SimplifyGraph(GraphDef* output, GraphProperties* properties,
                        bool use_shape_info);
 
   Status RunOptimizationPass(Cluster* cluster, const GrapplerItem& item,
diff --git a/tensorflow/core/grappler/optimizers/constant_folding_test.cc b/tensorflow/core/grappler/optimizers/constant_folding_test.cc
index 219f3bd5ec2a1c15078972bdea69a7642bb4af46..914a9257ee0d0e86133264bdf9752e6375d39e9a 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding_test.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding_test.cc
@@ -43,9 +43,9 @@ TEST_F(ConstantFoldingTest, SimpleFolding) {
   item.fetch.push_back("d");
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   EXPECT_EQ(1, output.node_size());
@@ -89,9 +89,9 @@ TEST_F(ConstantFoldingTest, AddTree) {
   item.fetch = {"add_parent", "mul_parent", "addmul_parent"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   // We expect the following rewrite(s) to occur:
@@ -152,7 +152,10 @@ TEST_F(ConstantFoldingTest, AddTree) {
 }
 
 TEST_F(ConstantFoldingTest, NeutralElement) {
-  for (bool use_const : {true, false}) {
+  int kConst = 0;
+  int kLike = 1;
+  int kFill = 2;
+  for (int const_type : {kConst, kLike, kFill}) {
     tensorflow::Scope s = tensorflow::Scope::NewRootScope();
     Output x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT,
                                 ops::Placeholder::Shape(TensorShape({2, 2})));
@@ -164,11 +167,19 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
                                 ops::Placeholder::Shape(TensorShape({2, 3})));
     Output bias = ops::Placeholder(s.WithOpName("bias"), DT_FLOAT,
                                    ops::Placeholder::Shape(TensorShape({2})));
-    Output zeros = !use_const ? ops::ZerosLike(s.WithOpName("zeros"), x)
-                              : ops::Const(s.WithOpName("zeros"), 0.0f, {2, 2});
     Output zeros_1d = ops::Const(s.WithOpName("zeros_1d"), 0.0f, {2});
-    Output ones = !use_const ? ops::OnesLike(s.WithOpName("ones"), x)
-                             : ops::Const(s.WithOpName("ones"), 1.0f, {2, 2});
+    Output zeros_const = ops::Const(s.WithOpName("zeros_const"), 0.0f, {2, 2});
+    Output zeros_like = ops::ZerosLike(s.WithOpName("zeros_like"), x);
+    Output zeros_fill = ops::Fill(s.WithOpName("zeros_fill"), {2, 2}, 0.0f);
+    Output zeros = const_type == kConst
+                       ? zeros_const
+                       : (const_type == kLike ? zeros_like : zeros_fill);
+    Output ones_const = ops::Const(s.WithOpName("ones_const"), 1.0f, {2, 2});
+    Output ones_like = ops::OnesLike(s.WithOpName("ones_like"), x);
+    Output ones_fill = ops::Fill(s.WithOpName("ones_fill"), {2, 2}, 1.0f);
+    Output ones = const_type == kConst
+                      ? ones_const
+                      : (const_type == kLike ? ones_like : ones_fill);
     Output mul1 = ops::Mul(s.WithOpName("mul1"), x, zeros);
     Output mul2 = ops::Mul(s.WithOpName("mul2"), zeros, y);
     Output mul3 = ops::Mul(s.WithOpName("mul3"), x, ones);
@@ -187,19 +198,26 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
     Output bias_add2 = ops::BiasAdd(s.WithOpName("bias_add2"), zeros, bias);
     Output sub1 = ops::Sub(s.WithOpName("sub1"), x, zeros);
     Output sub2 = ops::Sub(s.WithOpName("sub2"), zeros, y);
-    Output addn =
-        ops::AddN(s.WithOpName("addn"),
-                  {mul1, mul2, mul3, mul4, mul5, mul6, div1, div2, matmul1,
-                   matmul2, add1, add2, bias_add1, bias_add2, sub1, sub2});
+    Output concat =
+        ops::Stack(s.WithOpName("stack"),
+                   {mul1, mul2, mul3, mul4, mul5, mul6, div1, div2, matmul1,
+                    matmul2, add1, add2, bias_add1, bias_add2, sub1, sub2});
     GrapplerItem item;
     TF_CHECK_OK(s.ToGraphDef(&item.graph));
-    item.fetch = {"addn", "matmul3", "matmul4"};
+    item.fetch = {"stack", "matmul3", "matmul4"};
 
     ConstantFolding optimizer(nullptr /* cpu_device */);
     GraphDef output;
     Status status = optimizer.Optimize(nullptr, item, &output);
     TF_EXPECT_OK(status);
 
+    const string suffix =
+        (const_type == kConst ? "_const"
+                              : (const_type == kLike ? "_like" : "_fill"));
+    const string zeros_name = strings::StrCat("zeros", suffix);
+    const string ones_name = strings::StrCat("ones", suffix);
+    const string ctrl_zeros_name = strings::StrCat("^zeros", suffix);
+    const string ctrl_ones_name = strings::StrCat("^ones", suffix);
     EXPECT_EQ(27, output.node_size());
     for (int i = 0; i < output.node_size(); ++i) {
       const NodeDef& node = output.node(i);
@@ -207,19 +225,19 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
       if (name == "mul1") {
         EXPECT_EQ("Const", node.op());
         EXPECT_EQ("^x", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       } else if (name == "mul2") {
         EXPECT_EQ("Const", node.op());
-        EXPECT_EQ("^zeros", node.input(0));
+        EXPECT_EQ(ctrl_zeros_name, node.input(0));
         EXPECT_EQ("^y", node.input(1));
       } else if (name == "mul3") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("x", node.input(0));
-        EXPECT_EQ("^ones", node.input(1));
+        EXPECT_EQ(ctrl_ones_name, node.input(1));
       } else if (name == "mul4") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("y", node.input(0));
-        EXPECT_EQ("^ones", node.input(1));
+        EXPECT_EQ(ctrl_ones_name, node.input(1));
       } else if (name == "mul5") {
         EXPECT_EQ("Const", node.op());
         EXPECT_EQ("^x", node.input(0));
@@ -231,23 +249,23 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
       } else if (name == "div1") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("x", node.input(0));
-        EXPECT_EQ("^ones", node.input(1));
+        EXPECT_EQ(ctrl_ones_name, node.input(1));
       } else if (name == "div2") {
         EXPECT_EQ("Reciprocal", node.op());
         EXPECT_EQ("y", node.input(0));
-        EXPECT_EQ("^ones", node.input(1));
+        EXPECT_EQ(ctrl_ones_name, node.input(1));
       } else if (name == "matmul1") {
         EXPECT_EQ("Const", node.op());
         EXPECT_EQ("^x", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       } else if (name == "matmul2") {
         EXPECT_EQ("Const", node.op());
-        EXPECT_EQ("^zeros", node.input(0));
+        EXPECT_EQ(ctrl_zeros_name, node.input(0));
         EXPECT_EQ("^y", node.input(1));
       } else if (name == "matmul3") {
         EXPECT_EQ("Const", node.op());
         EXPECT_EQ("^a", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
         TensorProto t = node.attr().at("value").tensor();
         EXPECT_EQ(1, t.float_val_size());
         EXPECT_EQ(0, t.float_val(0));
@@ -256,7 +274,7 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
         EXPECT_EQ(2, t.tensor_shape().dim(1).size());
       } else if (name == "matmul4") {
         EXPECT_EQ("Const", node.op());
-        EXPECT_EQ("^zeros", node.input(0));
+        EXPECT_EQ(ctrl_zeros_name, node.input(0));
         EXPECT_EQ("^b", node.input(1));
         TensorProto t = node.attr().at("value").tensor();
         EXPECT_EQ(1, t.float_val_size());
@@ -267,11 +285,11 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
       } else if (name == "add1") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("x", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       } else if (name == "add2") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("y", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       } else if (name == "bias_add1") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("x", node.input(0));
@@ -279,17 +297,16 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
       } else if (name == "bias_add2") {
         // We don't eliminate this one, because it requires broadcasting.
         EXPECT_EQ("BiasAdd", node.op());
-        EXPECT_EQ("zeros", node.input(0));
+        EXPECT_EQ(zeros_name, node.input(0));
         EXPECT_EQ("bias", node.input(1));
       } else if (name == "sub1") {
         EXPECT_EQ("Snapshot", node.op());
         EXPECT_EQ("x", node.input(0));
-        EXPECT_EQ("^zeros", node.input(1));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       } else if (name == "sub2") {
-        // We don't handle this case yet.
-        EXPECT_EQ("Sub", node.op());
-        EXPECT_EQ("zeros", node.input(0));
-        EXPECT_EQ("y", node.input(1));
+        EXPECT_EQ("Neg", node.op());
+        EXPECT_EQ("y", node.input(0));
+        EXPECT_EQ(ctrl_zeros_name, node.input(1));
       }
       const std::set<string> square_zero_const{"mul1", "mul2",    "mul5",
                                                "mul6", "matmul1", "matmul2"};
@@ -415,7 +432,6 @@ TEST_F(ConstantFoldingTest, NeutralElement_PartialShape_UnknownOutputShape) {
   GraphDef output;
   Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
-  LOG(INFO) << output.DebugString();
 
   EXPECT_EQ(15, output.node_size());
   for (int i = 0; i < output.node_size(); ++i) {
@@ -509,9 +525,9 @@ TEST_F(ConstantFoldingTest, CreateConstNodes) {
 
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   EXPECT_EQ(24, output.node_size());
@@ -558,9 +574,9 @@ TEST_F(ConstantFoldingTest, FoldingNodeWithTwoOutputs) {
   item.fetch.push_back("f");
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   EXPECT_EQ(2, output.node_size());
@@ -599,9 +615,9 @@ TEST_F(ConstantFoldingTest, ControlDependencies) {
   item.fetch.push_back("e");
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   std::vector<string> expected_nodes = {"dflt", "p1", "p2", "e"};
@@ -642,9 +658,9 @@ TEST_F(ConstantFoldingTest, ControlDependenciesEmptyFetch) {
   GrapplerItem item;
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   std::vector<string> expected_nodes = {"dflt", "p1", "p2", "c",
@@ -699,9 +715,9 @@ TEST_F(ConstantFoldingTest, ControlDependenciesDeduplicate) {
   item.fetch.push_back("i2");
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   std::vector<string> expected_nodes = {"dflt", "p1", "p2", "i2"};
@@ -773,9 +789,9 @@ TEST_F(ConstantFoldingTest, VariableNumberOfOutputs) {
   }
 
   item.fetch = outputs;
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int constant_folded = 0;
@@ -811,9 +827,9 @@ TEST_F(ConstantFoldingTest, ShapeMaterialization) {
   item.fetch.push_back("p2");
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found = 0;
@@ -850,9 +866,9 @@ TEST_F(ConstantFoldingTest, ShapeMaterializationEmptyFetch) {
   GrapplerItem item;
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found = 0;
@@ -904,9 +920,9 @@ TEST_F(ConstantFoldingTest, ShapeMaterializationShapeN) {
   GrapplerItem item;
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
   int found = 0;
   for (const auto& node : output.node()) {
@@ -948,6 +964,56 @@ TEST_F(ConstantFoldingTest, ShapeMaterializationShapeN) {
   EXPECT_EQ(9, found);
 }
 
+TEST_F(ConstantFoldingTest, ShapeMaterializationShapeN_MultipleOutputs) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output v1 = ops::Variable(scope.WithOpName("v1"), {3, -1}, DT_FLOAT);
+  Output v2 = ops::Variable(scope.WithOpName("v2"), {4, 6}, DT_FLOAT);
+  auto s = ops::ShapeN(scope.WithOpName("s"), {v1, v2});
+  auto id_n = ops::IdentityN(scope.WithOpName("id_n"), {s[0], s[1]});
+  Output ia = ops::Identity(scope.WithOpName("ia"), id_n[0]);
+  Output ib = ops::Identity(scope.WithOpName("ib"), id_n[1]);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch.push_back("ia");
+  item.fetch.push_back("ib");
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int found = 0;
+  for (const auto& node : output.node()) {
+    EXPECT_NE(AddPrefixToNodeName("s-matshapes-0", kConstantFoldingConst),
+              node.name());
+    if (node.name() == "s") {
+      ++found;
+      EXPECT_EQ("ShapeN", node.op());
+      EXPECT_EQ("v1", node.input(0));
+      EXPECT_EQ("v2", node.input(1));
+    }
+    if (node.name() == "id_n") {
+      ++found;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ("s", node.input(0));
+      EXPECT_EQ(AddPrefixToNodeName("s-matshapes-1", kConstantFoldingConst),
+                node.input(1));
+    }
+    if (node.name() == "ia") {
+      ++found;
+      EXPECT_EQ("id_n", node.input(0));
+    }
+    if (node.name() == "ib") {
+      ++found;
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ("^s", node.input(0));
+      EXPECT_EQ("^id_n", node.input(1));
+    }
+  }
+  EXPECT_EQ(4, found);
+}
+
 TEST_F(ConstantFoldingTest, SwitchNodesEmptyFetch) {
   tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
   ops::Variable v_in(scope.WithOpName("v_in"), {3}, DT_FLOAT);
@@ -973,9 +1039,9 @@ TEST_F(ConstantFoldingTest, SwitchNodesEmptyFetch) {
   GrapplerItem item;
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   std::set<string> present_nodes = {"v_in",     "v_ctrl",
@@ -1051,9 +1117,9 @@ TEST_F(ConstantFoldingTest, SwitchNodes) {
 
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
   std::set<string> present_nodes = {"v_in",     "v_ctrl",
                                     "switch",   "i",
@@ -1119,9 +1185,9 @@ TEST_F(ConstantFoldingTest, MergeNodes) {
   item.fetch = {"out1", "idx1", "out2", "idx2", "out3", "idx3"};
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found_nodes = 0;
@@ -1196,18 +1262,200 @@ TEST_F(ConstantFoldingTest, ShuffleReverseOnScalarRemoval) {
   item.fetch = {"out1", "out2"};
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef got;
-  Status status = fold.Optimize(nullptr, item, &got);
+  Status status = optimizer.Optimize(nullptr, item, &got);
   TF_EXPECT_OK(status);
 
   GraphDef want;
-  AddNode("in1", "VariableV2", {}, &want);
-  AddNode("in2", "VariableV2", {}, &want);
-  AddNode("s1", "Identity", {"in1"}, &want);
-  AddNode("s2", "Identity", {"in2", AsControlDependency("in1")}, &want);
-  AddNode("out1", "Add", {"s1", "s2"}, &want);
-  AddNode("out2", "Identity", {"s2"}, &want);
+  AddNode("in1", "VariableV2", {}, {}, &want);
+  AddNode("in2", "VariableV2", {}, {}, &want);
+  AddNode("s1", "Identity", {"in1"}, {}, &want);
+  AddNode("s2", "Identity", {"in2", AsControlDependency("in1")}, {}, &want);
+  AddNode("out1", "Add", {"s1", "s2"}, {}, &want);
+  AddNode("out2", "Identity", {"s2"}, {}, &want);
+
+  CompareGraphs(want, got);
+}
+
+TEST_F(ConstantFoldingTest, SliceWithSameDimensionRemoval) {
+  {  // size = {3, 5}
+    tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+    auto in1 = ops::Variable(scope.WithOpName("in1"), {3, 5}, DT_FLOAT);
+    auto begin = ops::Const(scope.WithOpName("begin"), {0, 0}, {2});
+    auto size = ops::Const(scope.WithOpName("size"), {3, 5}, {2});
+    Output in2 = ops::Variable(scope.WithOpName("in2"), {4, 6}, DT_FLOAT);
+    ops::Slice s1(scope.WithOpName("s1"), in1, begin, size);
+    ops::Slice s2(scope.WithOpName("s2"), in2, begin, size);
+
+    ops::Add out(scope.WithOpName("out"), s1, s2);
+
+    GrapplerItem item;
+    item.fetch = {"out"};
+    TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+    ConstantFolding optimizer(nullptr /* cpu_device */);
+    GraphDef got;
+    Status status = optimizer.Optimize(nullptr, item, &got);
+    TF_EXPECT_OK(status);
+
+    GraphDef want;
+    AddNode("in1", "VariableV2", {}, {}, &want);
+    AddNode("in2", "VariableV2", {}, {}, &want);
+    AddNode("begin", "Const", {}, {}, &want);
+    AddNode("size", "Const", {}, {}, &want);
+    AddNode("s1", "Identity",
+            {"in1", AsControlDependency("begin"), AsControlDependency("size")},
+            {}, &want);
+    AddNode("s2", "Slice", {"in2", "begin", "size"}, {}, &want);
+    AddNode("out", "Add", {"s1", "s2"}, {}, &want);
+
+    CompareGraphs(want, got);
+  }
+  {  // size = {-1, -1}
+    tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+    auto in1 =
+        ops::Variable(scope.WithOpName("in1"), {3, 5}, DataType::DT_FLOAT);
+    auto begin1 = ops::Const(scope.WithOpName("begin1"), {0, 0}, {2});
+    auto begin2 = ops::Const(scope.WithOpName("begin2"), {1, 1}, {2});
+    auto size = ops::Const(scope.WithOpName("size"), {-1, -1}, {2});
+    Output in2 =
+        ops::Variable(scope.WithOpName("in2"), {4, 6}, DataType::DT_FLOAT);
+    ops::Slice s1(scope.WithOpName("s1"), in1, begin1, size);
+    ops::Slice s2(scope.WithOpName("s2"), in2, begin2, size);
+
+    ops::Add out(scope.WithOpName("out"), s1, s2);
+
+    GrapplerItem item;
+    item.fetch = {"out"};
+    TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+    ConstantFolding optimizer(nullptr /* cpu_device */);
+    GraphDef got;
+    Status status = optimizer.Optimize(nullptr, item, &got);
+    TF_EXPECT_OK(status);
+
+    GraphDef want;
+    AddNode("in1", "VariableV2", {}, {}, &want);
+    AddNode("in2", "VariableV2", {}, {}, &want);
+    AddNode("begin1", "Const", {}, {}, &want);
+    AddNode("begin2", "Const", {}, {}, &want);
+    AddNode("size", "Const", {}, {}, &want);
+    AddNode("s1", "Identity",
+            {"in1", AsControlDependency("begin1"), AsControlDependency("size")},
+            {}, &want);
+    AddNode("s2", "Slice", {"in2", "begin2", "size"}, {}, &want);
+    AddNode("out", "Add", {"s1", "s2"}, {}, &want);
+
+    CompareGraphs(want, got);
+  }
+}
+
+TEST_F(ConstantFoldingTest, TileWithMultipliesBeingOne) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  auto in1 = ops::Variable(scope.WithOpName("in1"), {4, 6}, DT_FLOAT);
+  auto in2 = ops::Variable(scope.WithOpName("in2"), {4, 3}, DT_FLOAT);
+  auto multiplies1 = ops::Const(scope.WithOpName("multiplies1"), {1, 1}, {2});
+  auto multiplies2 = ops::Const(scope.WithOpName("multiplies2"), {1, 2}, {2});
+
+  ops::Tile t1(scope.WithOpName("t1"), in1, multiplies1);
+  ops::Tile t2(scope.WithOpName("t2"), in2, multiplies2);
+
+  ops::Add out(scope.WithOpName("out"), t1, t2);
+
+  GrapplerItem item;
+  item.fetch = {"out"};
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef got;
+  Status status = optimizer.Optimize(nullptr, item, &got);
+  TF_EXPECT_OK(status);
+
+  GraphDef want;
+  AddNode("in1", "VariableV2", {}, {}, &want);
+  AddNode("in2", "VariableV2", {}, {}, &want);
+  AddNode("multiplies1", "Const", {}, {}, &want);
+  AddNode("multiplies2", "Const", {}, {}, &want);
+  AddNode("t1", "Identity", {"in1", AsControlDependency("multiplies1")}, {},
+          &want);
+  AddNode("t2", "Tile", {"in2", "multiplies2"}, {}, &want);
+  AddNode("out", "Add", {"t1", "t2"}, {}, &want);
+
+  CompareGraphs(want, got);
+}
+
+TEST_F(ConstantFoldingTest, PaddingWithZeroSize) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  auto in1 = ops::Variable(scope.WithOpName("in1"), {4, 6}, DT_INT32);
+  auto in2 = ops::Variable(scope.WithOpName("in2"), {2, 2}, DT_INT32);
+  auto paddings1 =
+      ops::Const(scope.WithOpName("paddings1"), {0, 0, 0, 0}, {2, 2});
+  auto paddings2 =
+      ops::Const(scope.WithOpName("paddings2"), {1, 1, 2, 2}, {2, 2});
+  auto c1 = ops::Const(scope.WithOpName("c1"), 1);
+  auto c2 = ops::Const(scope.WithOpName("c2"), 1);
+
+  ops::PadV2 p1(scope.WithOpName("p1"), in1, paddings1, c1);
+  ops::PadV2 p2(scope.WithOpName("p2"), in2, paddings2, c2);
+
+  ops::Add out(scope.WithOpName("out"), p1, p2);
+
+  GrapplerItem item;
+  item.fetch = {"out"};
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef got;
+  Status status = optimizer.Optimize(nullptr, item, &got);
+  TF_EXPECT_OK(status);
+
+  GraphDef want;
+  AddNode("in1", "VariableV2", {}, {}, &want);
+  AddNode("in2", "VariableV2", {}, {}, &want);
+  AddNode("paddings1", "Const", {}, {}, &want);
+  AddNode("paddings2", "Const", {}, {}, &want);
+  AddNode("c1", "Const", {}, {}, &want);
+  AddNode("c2", "Const", {}, {}, &want);
+  AddNode("p1", "Identity",
+          {"in1", AsControlDependency("paddings1"), AsControlDependency("c1")},
+          {}, &want);
+  AddNode("p2", "PadV2", {"in2", "paddings2", "c2"}, {}, &want);
+  AddNode("out", "Add", {"p1", "p2"}, {}, &want);
+
+  CompareGraphs(want, got);
+}
+
+TEST_F(ConstantFoldingTest, SqueezeWithAllDimesionsGreaterThanOne) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  auto in1 = ops::Variable(scope.WithOpName("in1"), {2, 3}, DT_INT32);
+  auto in2 = ops::Variable(scope.WithOpName("in2"), {1, 2, 3, 1}, DT_INT32);
+
+  ops::Squeeze s1(scope.WithOpName("s1"), in1);
+  ops::Squeeze s2(scope.WithOpName("s2"), in2);
+
+  ops::Add out(scope.WithOpName("out"), s1, s2);
+
+  GrapplerItem item;
+  item.fetch = {"out"};
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef got;
+  Status status = optimizer.Optimize(nullptr, item, &got);
+  TF_EXPECT_OK(status);
+
+  GraphDef want;
+  AddNode("in1", "VariableV2", {}, {}, &want);
+  AddNode("in2", "VariableV2", {}, {}, &want);
+  AddNode("s1", "Identity", {"in1"}, {}, &want);
+  AddNode("s2", "Squeeze", {"in2"}, {}, &want);
+  AddNode("out", "Add", {"s1", "s2"}, {}, &want);
 
   CompareGraphs(want, got);
 }
@@ -1228,9 +1476,9 @@ TEST_F(ConstantFoldingTest, NoOpReduction) {
   item.fetch.push_back("s");
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   bool found = false;
@@ -1287,9 +1535,9 @@ TEST_F(ConstantFoldingTest, NoOpReshape) {
   item.fetch = {"s1", "s2", "s3", "s4"};
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found = 0;
@@ -1334,9 +1582,9 @@ TEST_F(ConstantFoldingTest, Packing) {
   GrapplerItem item;
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   // Make sure that the representation of the folded constant is space
@@ -1369,14 +1617,14 @@ TEST_F(ConstantFoldingTest, MaterializeBroadcastGradientArgs) {
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   // Run a second time to make sure the optimization is idempotent.
   item.graph.Swap(&output);
-  status = fold.Optimize(nullptr, item, &output);
+  status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found = 0;
@@ -1430,14 +1678,14 @@ TEST_F(ConstantFoldingTest, MaterializeReductionIndices) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
   item.fetch.push_back("reshape");
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   // Run a second time to make sure the optimization is idempotent.
   item.graph.Swap(&output);
-  status = fold.Optimize(nullptr, item, &output);
+  status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   int found = 0;
@@ -1470,9 +1718,9 @@ TEST_F(ConstantFoldingTest, LargeConstant) {
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
   item.fetch.push_back("out");
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   // Make sure the diag node hasn't been folded, since it would use too much
@@ -1509,9 +1757,9 @@ TEST_F(ConstantFoldingTest, SwitchIdenticalInputs) {
   item.fetch.push_back("id_true");
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  ConstantFolding fold(nullptr /* cpu_device */);
+  ConstantFolding optimizer(nullptr /* cpu_device */);
   GraphDef output;
-  Status status = fold.Optimize(nullptr, item, &output);
+  Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
 
   EXPECT_EQ(6, output.node_size());
@@ -1548,8 +1796,338 @@ TEST_F(ConstantFoldingTest, SwitchIdenticalInputs) {
   EXPECT_EQ(6, found);
 }
 
+TEST_F(ConstantFoldingTest, PartialFolding_AssociativeAndCommutative) {
+  std::function<Output(const Scope&, InputList)> addn_fun =
+      [](const Scope& scope, InputList inputs) {
+        return ops::AddN(scope, inputs);
+      };
+  std::function<Output(const Scope&, InputList)> accumulate_fun =
+      [](const Scope& scope, InputList inputs) {
+        return ops::AccumulateNV2(scope, inputs, TensorShape({2, 2}));
+      };
+  for (bool use_add_n : {true, false}) {
+    auto fun = use_add_n ? addn_fun : accumulate_fun;
+    const string op_name = use_add_n ? "AddN" : "AccumulateNV2";
+    Scope s = Scope::NewRootScope();
+    Output x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output y = ops::Placeholder(s.WithOpName("y"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output z = ops::Placeholder(s.WithOpName("z"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output c1 = ops::Const(s.WithOpName("c1"), 1.0f, {2, 2});
+    Output c2 = ops::Const(s.WithOpName("c2"), 2.0f, {2, 2});
+    Output c3 = ops::Const(s.WithOpName("c3"), 3.0f, {2, 2});
+    Output acc0 = fun(s.WithOpName("acc0"), {c1, c2, c3});
+    Output acc1 = fun(s.WithOpName("acc1"), {x, y, z});
+    Output acc2 = fun(s.WithOpName("acc2"), {c1, x, y});
+    Output acc3 = fun(s.WithOpName("acc3"), {c1, c2, z});
+    Output acc4 = fun(s.WithOpName("acc4"), {c1, y, c2});
+    Output acc5 = fun(s.WithOpName("acc5"), {x, c1, c2});
+    Output acc6 = fun(s.WithOpName("acc6"), {x, c1, y, c2});
+    Output stack = ops::Stack(s.WithOpName("stack"),
+                              {acc0, acc1, acc2, acc3, acc4, acc5, acc6});
+
+    GrapplerItem item;
+    TF_CHECK_OK(s.ToGraphDef(&item.graph));
+    item.fetch = {"stack"};
+
+    ConstantFolding optimizer(nullptr /* cpu_device */);
+    GraphDef output;
+    Status status = optimizer.Optimize(nullptr, item, &output);
+    TF_EXPECT_OK(status);
+
+    EXPECT_EQ(16, output.node_size());
+    for (const NodeDef& node : output.node()) {
+      if (node.name() == "acc0") {
+        EXPECT_EQ("Const", node.op());
+      }
+      if (node.name() == "acc1") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("y", node.input(1));
+        EXPECT_EQ("z", node.input(2));
+      }
+      if (node.name() == "acc2") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("c1", node.input(0));
+        EXPECT_EQ("x", node.input(1));
+        EXPECT_EQ("y", node.input(2));
+      }
+      if (node.name() == "acc3") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("ConstantFolding/acc3_partial_split_2", node.input(0));
+        EXPECT_EQ("z", node.input(1));
+      }
+      if (node.name() == "acc4") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("ConstantFolding/acc4_partial_split_2", node.input(0));
+        EXPECT_EQ("y", node.input(1));
+      }
+      if (node.name() == "acc5") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("ConstantFolding/acc5_partial_split_2", node.input(1));
+      }
+      if (node.name() == "acc6") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("ConstantFolding/acc6_partial_split_2", node.input(1));
+        EXPECT_EQ("y", node.input(2));
+      }
+      if (StringPiece(node.name()).starts_with("ConstantFolding/")) {
+        EXPECT_EQ("Const", node.op());
+      }
+    }
+
+    std::vector<string> fetch = {"acc0"};
+    auto tensors_expected = EvaluateNodes(item.graph, fetch);
+    auto tensors = EvaluateNodes(output, fetch);
+    EXPECT_EQ(1, tensors_expected.size());
+    EXPECT_EQ(1, tensors.size());
+    test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
+  }
+}
+
+TEST_F(ConstantFoldingTest, PartialFolding_Concat) {
+  Scope s = Scope::NewRootScope();
+  Output x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT,
+                              ops::Placeholder::Shape(TensorShape({2, 2})));
+  Output y = ops::Placeholder(s.WithOpName("y"), DT_FLOAT,
+                              ops::Placeholder::Shape(TensorShape({2, 2})));
+  Output z = ops::Placeholder(s.WithOpName("z"), DT_FLOAT,
+                              ops::Placeholder::Shape(TensorShape({2, 2})));
+  Output axis = ops::Const(s.WithOpName("axis"), 0, {});
+  Output c1 = ops::Const(s.WithOpName("c1"), 1.0f, {2, 2});
+  Output c2 = ops::Const(s.WithOpName("c2"), 2.0f, {2, 2});
+  Output concat0 = ops::Concat(s.WithOpName("concat0"), {c1, c2, c1}, axis);
+  Output concat1 = ops::Concat(s.WithOpName("concat1"), {x, y, z}, axis);
+  Output concat2 = ops::Concat(s.WithOpName("concat2"), {c1, x, y}, axis);
+  Output concat3 = ops::Concat(s.WithOpName("concat3"), {c1, c2, z}, axis);
+  Output concat4 = ops::Concat(s.WithOpName("concat4"), {c1, y, c2}, axis);
+  Output concat5 = ops::Concat(s.WithOpName("concat5"), {x, c1, c2}, axis);
+  Output concat6 = ops::Concat(s.WithOpName("concat6"), {x, c1, y, c2}, axis);
+  Output concat7 = ops::Concat(s.WithOpName("concat7"), {x, y, c1, c2}, axis);
+  Output concat8 = ops::Concat(s.WithOpName("concat8"), {x, c1, c2, y}, axis);
+  Output concat9 = ops::Concat(s.WithOpName("concat9"), {c1, c2, x, y}, axis);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  item.fetch = {"concat0", "concat1", "concat2", "concat3", "concat4",
+                "concat5", "concat6", "concat7", "concat8", "concat9"};
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  // Run the optimizer twice to make sure the rewrite is idempotent.
+  item.graph.Swap(&output);
+  status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(21, output.node_size());
+  for (int i = 0; i < output.node_size(); ++i) {
+    const NodeDef& node = output.node(i);
+    if (node.name() == "concat0") {
+      EXPECT_EQ("Const", node.op());
+    } else if (node.name() == "concat3") {
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("ConstantFolding/concat3_partial_split_0", node.input(0));
+      EXPECT_EQ("z", node.input(1));
+      EXPECT_EQ("axis", node.input(2));
+    } else if (node.name() == "concat5") {
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("ConstantFolding/concat5_partial_split_1", node.input(1));
+      EXPECT_EQ("axis", node.input(2));
+    } else if (node.name() == "concat7") {
+      EXPECT_EQ(4, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("ConstantFolding/concat7_partial_split_2", node.input(2));
+      EXPECT_EQ("axis", node.input(3));
+    } else if (node.name() == "concat8") {
+      EXPECT_EQ(4, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("ConstantFolding/concat8_partial_split_1", node.input(1));
+      EXPECT_EQ("y", node.input(2));
+      EXPECT_EQ("axis", node.input(3));
+    } else if (node.name() == "concat9") {
+      EXPECT_EQ(4, node.input_size());
+      EXPECT_EQ("ConstantFolding/concat9_partial_split_0", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+      EXPECT_EQ("y", node.input(2));
+      EXPECT_EQ("axis", node.input(3));
+    } else if (StringPiece(node.name()).starts_with("ConstantFolding/")) {
+      EXPECT_EQ("Const", node.op());
+    } else {
+      EXPECT_EQ(item.graph.node(i).DebugString(), node.DebugString());
+    }
+  }
+
+  auto tensors_expected = EvaluateNodes(item.graph, {"concat0"});
+  auto tensors = EvaluateNodes(output, {"concat0"});
+  EXPECT_EQ(1, tensors_expected.size());
+  EXPECT_EQ(1, tensors.size());
+  test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
+}
+
+TEST_F(ConstantFoldingTest, PartialFolding_IdentityN) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output x = ops::Placeholder(scope.WithOpName("x"), DT_FLOAT,
+                              ops::Placeholder::Shape(TensorShape({})));
+  Output c1 = ops::Const(scope.WithOpName("c1"), 1.0f, {2, 2});
+  Output c2 = ops::Const(scope.WithOpName("c2"), 2.0f, {2, 2});
+  auto id_n = ops::IdentityN(scope.WithOpName("id_n"), {c1, x, c2});
+  auto id0 = ops::Identity(scope.WithOpName("id0"), id_n[0]);
+  auto id1 = ops::Identity(scope.WithOpName("id1"), id_n[1]);
+  auto add0 = ops::Add(scope.WithOpName("add0"), id_n[0], id_n[1]);
+  auto add1 = ops::Add(scope.WithOpName("add1"), id_n[0], id_n[2]);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch.push_back("id0");
+  item.fetch.push_back("id1");
+  item.fetch.push_back("add0");
+  item.fetch.push_back("add1");
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  LOG(INFO) << output.DebugString();
+  TF_EXPECT_OK(status);
+  EXPECT_EQ(8, output.node_size());
+  for (const auto& node : output.node()) {
+    // id_n should remain unchanged.
+    if (node.name() == "id_n") {
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("c1", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+      EXPECT_EQ("c2", node.input(2));
+    }
+    // id0 should be constant folded, and a control dependency from id_n.
+    if (node.name() == "id0") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^id_n", node.input(0));
+    }
+    // id1 is unchanged.
+    if ("id1" == node.name()) {
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("id_n:1", node.input(0));
+    }
+
+    if ("add0" == node.name()) {
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("c1", node.input(0));
+      EXPECT_EQ("id_n:1", node.input(1));
+    }
+    // add1 should bo constant folded and have a control dependency from id_n.
+    if ("add1" == node.name()) {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^id_n", node.input(0));
+    }
+  }
+}
+
+TEST_F(ConstantFoldingTest, TrivialPack) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output x =
+      ops::RandomNormal(scope.WithOpName("x"), {2, 2}, DataType::DT_FLOAT);
+  Output y = ops::Const(scope.WithOpName("y"), {2.0f}, {});
+  auto stack =
+      ops::Stack(scope.WithOpName("stack").WithControlDependencies({y}), {x},
+                 ops::Stack::Axis(1));
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch.push_back("stack");
+
+  ConstantFolding optimizer(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  EXPECT_EQ(5, output.node_size());
+  for (const auto& node : output.node()) {
+    if (node.name() == "stack") {
+      EXPECT_EQ("stack", node.name());
+      EXPECT_EQ("ExpandDims", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("ConstantFolding/stack_const_axis", node.input(1));
+      EXPECT_EQ("^y", node.input(2));
+    } else if (node.name() == "ConstantFolding/stack_const_axis") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^x", node.input(0));
+    }
+  }
+
+  std::vector<string> fetch = {"stack"};
+  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+  auto tensors = EvaluateNodes(output, fetch);
+  EXPECT_EQ(1, tensors_expected.size());
+  EXPECT_EQ(1, tensors.size());
+  EXPECT_EQ(tensors_expected[0].shape(), tensors[0].shape());
+}
+
+TEST_F(ConstantFoldingTest, Enter) {
+  GrapplerItem item;
+  AttrValue frame_name;
+  frame_name.set_s("foo");
+  AttrValue type;
+  type.set_type(DT_FLOAT);
+  AttrValue value;
+  Tensor value_tensor(DT_FLOAT, TensorShape({}));
+  value_tensor.flat<float>()(0) = 1;
+  value_tensor.AsProtoTensorContent(value.mutable_tensor());
+
+  GraphDef& graph = item.graph;
+  AddNode("x", "Placeholder", {}, {{"T", type}}, &graph);
+  AddNode("c1", "Const", {"^x"}, {{"value", value}, {"dtype", type}}, &graph);
+  AddNode("enter1", "Enter", {"x"}, {{"T", type}, {"frame_name", frame_name}},
+          &graph);
+  AddNode("enter2", "Enter", {"c1"}, {{"T", type}, {"frame_name", frame_name}},
+          &graph);
+  AddNode("id1", "Identity", {"enter1"}, {{"T", type}}, &graph);
+  AddNode("id2", "Identity", {"enter2"}, {{"T", type}}, &graph);
+  AddNode("id3", "Identity", {"enter2"}, {{"T", type}}, &graph);
+  item.fetch.push_back("id1");
+  item.fetch.push_back("id2");
+  item.fetch.push_back("id3");
+
+  ConstantFolding optimizer(RewriterConfig::AGGRESSIVE,
+                            nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  // Run the optimizer twice to make sure the rewrite is idempotent.
+  item.graph.Swap(&output);
+  status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(7, output.node_size());
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "id1") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("enter1", node.input(0));
+    }
+    if (node.name() == "id2" || node.name() == "id3") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^enter2", node.input(0));
+    }
+  }
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
-
-//  LocalWords:  NewRootScope
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
index edb0db65e987318e1e64bf0288b6ef18a7b9d662..63bc19630de98a651e193558de8e99e11051e37c 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
@@ -274,55 +274,58 @@ void DependencyOptimizer::OptimizeNode(int node_idx,
   //           +----------+             y --^> b
 
   if (is_noop || is_identity) {
+    if (is_identity && !SafeToRemoveIdentity(*node)) {
+      return;
+    }
+
     const auto& output_node_set = node_map_->GetOutputs(node_name);
     const std::vector<NodeDef*> output_nodes(output_node_set.begin(),
                                              output_node_set.end());
     const int num_outputs = output_nodes.size();
     const int num_inputs = node->input_size();
 
+    // Don't increase the number of edges in the graph.
     if (num_inputs * num_outputs > num_inputs + num_outputs) {
       return;
     }
     std::vector<NodeDef*> input_nodes;
     for (int i = 0; i < num_inputs; ++i) {
       NodeDef* input_node = node_map_->GetNode(node->input(i));
-      CHECK_NE(input_node, nullptr);
+      if (input_node == nullptr) {
+        LOG(ERROR) << "Invalid input " << node->input(i);
+        return;
+      }
       input_nodes.push_back(input_node);
     }
 
-    // Make sure that we don't increase the number of edges that cross
-    // device boundaries.
-    if ((num_inputs == 1 && num_outputs > 1 &&
-         input_nodes[0]->device() != node->device()) ||
-        (num_inputs > 1 && num_outputs == 1 &&
-         output_nodes[0]->device() != node->device())) {
+    // TODO(rmlarsen): Not all device crossings are equally expensive.
+    // Assign a cost to each based on device affinity and compute a
+    // cost before and after.
+    const string& node_dev = node->device();
+    int num_cross_in = 0;
+    for (NodeDef* input_node : input_nodes) {
+      num_cross_in += static_cast<int>(input_node->device() != node_dev);
+    }
+    int num_cross_out = 0;
+    for (NodeDef* output_node : output_nodes) {
+      num_cross_out += static_cast<int>(output_node->device() != node_dev);
+    }
+    if (is_identity && num_cross_in > 0 && num_cross_out > 0) {
+      // This identity node follows a device crossing, so it might be
+      // following a _Recv node after partioning. Do not remove such nodes,
+      // unless they only have consumers on the same device as themselves.
       return;
     }
-    if (num_inputs == 2 && num_outputs == 2) {
-      const string& noop_dev = node->device();
-      const string& in0_dev = input_nodes[0]->device();
-      const string& in1_dev = input_nodes[1]->device();
-      const string& out0_dev = output_nodes[0]->device();
-      const string& out1_dev = output_nodes[1]->device();
-      const int num_cross_before = static_cast<int>(in0_dev != noop_dev) +
-                                   static_cast<int>(in1_dev != noop_dev) +
-                                   static_cast<int>(out0_dev != noop_dev) +
-                                   static_cast<int>(out1_dev != noop_dev);
-      const int num_cross_after = static_cast<int>(in0_dev != out0_dev) +
-                                  static_cast<int>(in0_dev != out1_dev) +
-                                  static_cast<int>(in1_dev != out0_dev) +
-                                  static_cast<int>(in1_dev != out1_dev);
-      if (num_cross_after > num_cross_before) {
-        return;
-      }
-      // To avoid potentially removing Identity nodes following _Recv nodes,
-      // we require that no device crossings occur in that case.
-      // TODO(rmlarsen): See if we can relax this condition.
-      if (is_identity && (num_cross_after > 0 || num_cross_before > 0)) {
-        return;
+    const int num_cross_before = num_cross_in + num_cross_out;
+    int num_cross_after = 0;
+    for (NodeDef* input_node : input_nodes) {
+      for (NodeDef* output_node : output_nodes) {
+        num_cross_after +=
+            static_cast<int>(input_node->device() != output_node->device());
       }
     }
-    if (is_identity && !SafeToRemoveIdentity(*node)) {
+    if (num_cross_after > num_cross_before) {
+      // Avoid increasing the number of device crossings.
       return;
     }
 
@@ -343,16 +346,23 @@ void DependencyOptimizer::OptimizeNode(int node_idx,
           CHECK(!IsControlInput(input_to_forward));
           for (int j = 0; j < consumer->input_size(); ++j) {
             const string& old_input = consumer->input(j);
-            if (old_input == node_name) {
-              new_input = input_to_forward;
-              node_map_->UpdateInput(consumer->name(), old_input, new_input);
-              consumer->set_input(j, new_input);
-              found_input = true;
-            } else if (old_input == AsControlDependency(NodeName(node_name))) {
-              new_input = AsControlDependency(NodeName(input_to_forward));
-              node_map_->UpdateInput(consumer->name(), old_input, new_input);
-              consumer->set_input(j, new_input);
-              found_input = true;
+            int old_input_pos;
+            string old_input_node_name =
+                ParseNodeName(old_input, &old_input_pos);
+            if (old_input_node_name == node_name) {
+              if (old_input_pos >= 0) {
+                // Regular input
+                new_input = input_to_forward;
+                node_map_->UpdateInput(consumer->name(), old_input, new_input);
+                consumer->set_input(j, new_input);
+                found_input = true;
+              } else {
+                // Control dependency
+                new_input = AsControlDependency(NodeName(input_to_forward));
+                node_map_->UpdateInput(consumer->name(), old_input, new_input);
+                consumer->set_input(j, new_input);
+                found_input = true;
+              }
             }
           }
           CHECK(found_input);
@@ -566,7 +576,9 @@ Status DependencyOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
       // Remove redundant control dependencies.
       TF_RETURN_IF_ERROR(TransitiveReduction());
     } else {
-      LOG(ERROR) << topo_sort_status.error_message();
+      LOG(ERROR) << "Iteration = " << iteration
+                 << ", topological sort failed with message: "
+                 << topo_sort_status.error_message();
     }
     // Turn nodes with only control outputs into NoOps, prune NoOp and Identity
     // nodes.
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.h b/tensorflow/core/grappler/optimizers/dependency_optimizer.h
index 61ed15479370614bc79c15b450039f0cbf30908d..b4db98125aa740b5d261e8f9ad0ea5bfd8102877 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.h
@@ -29,9 +29,8 @@ namespace grappler {
 // optimizations, such as removing nodes that are effectively noops.
 class DependencyOptimizer : public GraphOptimizer {
  public:
-  DependencyOptimizer() : opt_level_(RewriterConfig::ON) {}
-  explicit DependencyOptimizer(RewriterConfig::Toggle opt_level)
-      : opt_level_(opt_level) {}
+  DependencyOptimizer() {}
+  explicit DependencyOptimizer(RewriterConfig::Toggle opt_level) {}
   ~DependencyOptimizer() override {}
 
   string name() const override { return "dependency_optimizer"; };
@@ -63,7 +62,6 @@ class DependencyOptimizer : public GraphOptimizer {
   // Main driver of dependency optimizations.
   Status OptimizeDependencies();
 
-  RewriterConfig::Toggle opt_level_;
   bool fetch_nodes_known_;
   std::unordered_set<string> nodes_to_preserve_;
   std::unique_ptr<NodeMap> node_map_;
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
index 33d6b992d21212fe325c642b87d3c3736185c445..cc1e142041c36fb645267c13b306d86639b2541e 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
@@ -515,6 +515,137 @@ TEST_F(DependencyOptimizerTest, ChangeToNoop_Identity) {
   }
 }
 
+TEST_F(DependencyOptimizerTest, IdentityInputs) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output b = ops::Placeholder(scope.WithOpName("b"), DT_BOOL);
+  Output x = ops::RandomUniform(scope.WithOpName("x"), {1, 2}, DT_FLOAT);
+  auto s = ops::Switch(scope.WithOpName("s"), x, b);
+
+  // Identity nodes to be removed.
+  auto id_f = ops::Identity(scope.WithOpName("id_f"), s.output_false);
+  auto id_t = ops::Identity(scope.WithOpName("id_t"), s.output_true);
+
+  // Output
+  Output out1 = ops::Identity(scope.WithOpName("out1"), id_f);
+  Output out2 = ops::Identity(scope.WithOpName("out2"), id_t);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch = {"out1", "out2"};
+
+  DependencyOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(6, output.node_size());
+  EXPECT_EQ("out1", output.node(4).name());
+  EXPECT_EQ(1, output.node(4).input_size());
+  EXPECT_EQ("s", output.node(4).input(0));
+
+  EXPECT_EQ("out2", output.node(5).name());
+  EXPECT_EQ(1, output.node(5).input_size());
+  EXPECT_EQ("s:1", output.node(5).input(0));
+}
+
+TEST_F(DependencyOptimizerTest, IdentityN) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output b = ops::Placeholder(scope.WithOpName("b"), DT_BOOL);
+  Output x = ops::RandomUniform(scope.WithOpName("x"), {1, 2}, DT_FLOAT);
+  auto s = ops::Switch(scope.WithOpName("s"), x, b);
+
+  // IdentityN nodes to be removed.
+  auto id_f = ops::IdentityN(scope.WithOpName("id_f"), {s.output_false});
+  auto id_t = ops::IdentityN(scope.WithOpName("id_t"), {s.output_true});
+
+  // IdentityN node that can't be removed.
+  auto id_b =
+      ops::IdentityN(scope.WithOpName("id_b"), {s.output_false, s.output_true});
+
+  // Outputs
+  Output out1 = ops::Identity(scope.WithOpName("out1"), id_f[0]);
+  Output out2 = ops::Identity(scope.WithOpName("out2"), id_t[0]);
+  Output out3 = ops::Identity(scope.WithOpName("out3"), id_b[0]);
+  Output out4 = ops::Identity(scope.WithOpName("out4"), id_b[1]);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch = {"out1", "out2", "out3", "out4"};
+
+  DependencyOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(9, output.node_size());
+  EXPECT_EQ("out1", output.node(5).name());
+  EXPECT_EQ(1, output.node(5).input_size());
+  EXPECT_EQ("s", output.node(5).input(0));
+
+  EXPECT_EQ("out2", output.node(6).name());
+  EXPECT_EQ(1, output.node(6).input_size());
+  EXPECT_EQ("s:1", output.node(6).input(0));
+
+  EXPECT_EQ("out3", output.node(7).name());
+  EXPECT_EQ(1, output.node(7).input_size());
+  EXPECT_EQ("id_b", output.node(7).input(0));
+
+  EXPECT_EQ("out4", output.node(8).name());
+  EXPECT_EQ(1, output.node(8).input_size());
+  EXPECT_EQ("id_b:1", output.node(8).input(0));
+}
+
+TEST_F(DependencyOptimizerTest,
+       Identity_DeviceCrossing_ConsumerOnDifferentDevice) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  Output x_on_1 =
+      ops::Const(s.WithOpName("x_on_1").WithDevice("/gpu:1"), {1.0f}, {});
+  Output one_on_3 =
+      ops::Const(s.WithOpName("one_on_3").WithDevice("/gpu:3"), {1.0f}, {});
+  Output x_on_2 =
+      ops::Identity(s.WithOpName("x_on_2").WithDevice("/gpu:2"), x_on_1);
+  Output result =
+      ops::Add(s.WithOpName("result").WithDevice("/gpu:3"), x_on_2, one_on_3);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  item.fetch = {"result"};
+  DependencyOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  VerifyGraphsEqual(item.graph, output, __FUNCTION__);
+}
+
+TEST_F(DependencyOptimizerTest, Identity_DeviceCrossing_ConsumerOnSameDevice) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  Output x_on_1 =
+      ops::Const(s.WithOpName("x_on_1").WithDevice("/gpu:1"), {1.0f}, {});
+  Output one_on_2 =
+      ops::Const(s.WithOpName("one_on_2").WithDevice("/gpu:2"), {1.0f}, {});
+  Output x_on_2 =
+      ops::Identity(s.WithOpName("x_on_2").WithDevice("/gpu:2"), x_on_1);
+  Output result =
+      ops::Add(s.WithOpName("result").WithDevice("/gpu:2"), x_on_2, one_on_2);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  item.fetch = {"result"};
+  DependencyOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  LOG(INFO) << output.DebugString();
+  EXPECT_EQ(3, output.node_size());
+  for (const auto& node : output.node()) {
+    EXPECT_NE("x_on_2", node.name());
+    if (node.name() == "result") {
+      EXPECT_EQ("x_on_1", node.input(0));
+    }
+  }
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer.cc b/tensorflow/core/grappler/optimizers/function_optimizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2a6b8a325f223f1202203c880786d0cf015a496d
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer.cc
@@ -0,0 +1,344 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
+#include <unordered_map>
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/function.h"
+#include "tensorflow/core/common_runtime/process_function_library_runtime.h"
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/framework/graph_def_util.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op_def.pb.h"
+#include "tensorflow/core/framework/versions.pb.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/functions.h"
+
+namespace tensorflow {
+namespace grappler {
+
+Status InlineFunction(const NodeDef& node, const FunctionDef& func,
+                      const FunctionDefLibrary& library, GraphDef* graph) {
+  const std::unordered_map<string, AttrValue> attr(node.attr().begin(),
+                                                   node.attr().end());
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, attr, library);
+  if (!item) {
+    return errors::InvalidArgument("Failed to inline function ", node.op(),
+                                   " instantiated by ", node.name());
+  }
+
+  std::unordered_map<string, int> input_nodes;
+  for (int i = 0; i < func.signature().input_arg_size(); ++i) {
+    const OpDef::ArgDef& arg = func.signature().input_arg(i);
+    input_nodes[arg.name()] = i;
+  }
+
+  // Add an IdentityN op to hook the function inputs to: this ensures that
+  // they're all evaluated before the evaluation of the function body starts.
+  NodeDef* func_inputs = graph->add_node();
+  func_inputs->set_name(strings::StrCat(node.name(), "/", "inlined_inputs"));
+  func_inputs->set_op("IdentityN");
+  func_inputs->set_device(node.device());
+  *func_inputs->mutable_input() = node.input();
+  AttrValue::ListValue* type_list =
+      (*func_inputs->mutable_attr())["T"].mutable_list();
+  for (const OpDef::ArgDef& arg : func.signature().input_arg()) {
+    if (arg.type() != DT_INVALID) {
+      type_list->add_type(arg.type());
+    } else {
+      auto it = attr.find(arg.type_attr());
+      if (it == attr.end()) {
+        return errors::InvalidArgument("Invalid input argument ", arg.name(),
+                                       " for function ", node.op(),
+                                       " instantiated by ", node.name());
+      }
+      type_list->add_type(it->second.type());
+    }
+  }
+
+  for (NodeDef& func_body_node : *item->graph.mutable_node()) {
+    if (input_nodes.find(func_body_node.name()) != input_nodes.end()) {
+      // Turn input placeholders into identity nodes
+      if (IsPlaceholder(func_body_node)) {
+        func_body_node.set_op("Identity");
+      }
+      CHECK_EQ(0, func_body_node.input_size());
+      int input_id = input_nodes[func_body_node.name()];
+      func_body_node.add_input(
+          strings::StrCat(func_inputs->name(), ":", input_id));
+    } else {
+      // Update the input names if any.
+      for (string& input : *func_body_node.mutable_input()) {
+        input = AddPrefixToNodeName(input, node.name());
+      }
+      // If the node has no input, make hook it up to the func_inputs node to
+      // ensure it runs in the same frame as the other nodes of the function
+      // body.
+      if (func_body_node.input_size() == 0) {
+        *func_body_node.add_input() = AsControlDependency(func_inputs->name());
+      }
+    }
+
+    // Add the node name as a prefix to avoid collisions after inlining
+    func_body_node.set_name(
+        strings::StrCat(node.name(), "/", func_body_node.name()));
+
+    // Make sure the node is placed
+    func_body_node.set_device(node.device());
+
+    // Move the node to the main graph
+    graph->add_node()->Swap(&func_body_node);
+  }
+
+  // Add an IdentityN op to hook the function outputs to: this ensures that the
+  // function body is fully evaluated before its fanout gets scheduled.
+  NodeDef* func_outputs = graph->add_node();
+  func_outputs->set_name(node.name());
+  func_outputs->set_op("IdentityN");
+  func_outputs->set_device(node.device());
+  type_list = (*func_outputs->mutable_attr())["T"].mutable_list();
+  for (int i = 0; i < func.signature().output_arg_size(); ++i) {
+    const OpDef::ArgDef& arg = func.signature().output_arg(i);
+    if (arg.type() != DT_INVALID) {
+      type_list->add_type(arg.type());
+    } else {
+      auto it = attr.find(arg.type_attr());
+      if (it == attr.end()) {
+        return errors::InvalidArgument("Invalid output argument ", arg.name(),
+                                       " for function ", node.op(),
+                                       " instantiated by ", node.name());
+      }
+      type_list->add_type(it->second.type());
+    }
+    // Use the fetch names since they take into account the output mapping.
+    func_outputs->add_input(strings::StrCat(node.name(), "/", item->fetch[i]));
+  }
+
+  return Status::OK();
+}
+
+class FakeCPUDevice : public Device {
+ public:
+  FakeCPUDevice(Env* env, const DeviceAttributes& attr) : Device(env, attr) {}
+  Status Sync() override { return Status::OK(); }
+};
+
+class SymbolicGradientEnv {
+ public:
+  SymbolicGradientEnv(int graph_version, const FunctionDefLibrary& library)
+      : graph_version_(graph_version), library_(library) {}
+
+  FunctionLibraryDefinition* function_library() {
+    InitializeIfNeeded();
+    return fld_.get();
+  }
+  FunctionLibraryRuntime* function_library_runtime() {
+    InitializeIfNeeded();
+    return flr_;
+  }
+
+ private:
+  // This initialization is expensive. Do it lazily to avoid paying for it
+  // unless it's needed.
+  void InitializeIfNeeded() {
+    if (flr_) {
+      return;
+    }
+    Env* env = Env::Default();
+    DeviceAttributes attr;
+    attr.set_name("/device:CPU:0");
+    attr.set_device_type("CPU");
+    FakeCPUDevice* dev = new FakeCPUDevice(env, attr);
+    std::vector<Device*> devices;
+    devices.push_back(dev);
+    dvc_mgr_.reset(new DeviceMgr(devices));
+    fld_.reset(new FunctionLibraryDefinition(OpRegistry::Global(), library_));
+    OptimizerOptions optimizer_opts;
+    optimizer_opts.set_do_function_inlining(true);
+    pflr_.reset(new ProcessFunctionLibraryRuntime(
+        dvc_mgr_.get(), env, graph_version_, fld_.get(), optimizer_opts));
+    flr_ = pflr_->GetFLR(dev->name());
+  }
+
+  const int graph_version_;
+  const FunctionDefLibrary& library_;
+  std::unique_ptr<DeviceMgr> dvc_mgr_;
+  std::unique_ptr<FunctionLibraryDefinition> fld_;
+  std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
+  FunctionLibraryRuntime* flr_ = nullptr;
+};
+
+Status InlineSymbolicGradient(const NodeDef& node, SymbolicGradientEnv* env,
+                              GraphDef* inlined_graph) {
+  GraphDef graph_def;
+
+  // Create a node to anchor the gradient inputs
+  NodeDef* inlined_input = graph_def.add_node();
+  inlined_input->set_name("FunctionInputs");
+  inlined_input->set_op("IdentityN");
+  AttrValue::ListValue* type_list =
+      (*inlined_input->mutable_attr())["T"].mutable_list();
+  for (const auto& type : node.attr().at("Tin").list().type()) {
+    type_list->add_type(static_cast<DataType>(type));
+  }
+
+  // Add the gradient node
+  NodeDef* inlined = graph_def.add_node();
+  *inlined = node;
+  inlined->clear_input();
+  for (int i = 0; i < node.attr().at("Tin").list().type_size(); ++i) {
+    inlined->add_input(strings::StrCat(inlined_input->name(), ":", i));
+  }
+
+  // Create a node to anchor the gradient outputs
+  NodeDef* inlined_output = graph_def.add_node();
+  inlined_output->set_name("FunctionOutputs");
+  inlined_output->set_op("IdentityN");
+  type_list = (*inlined_output->mutable_attr())["T"].mutable_list();
+  for (const auto& type : node.attr().at("Tout").list().type()) {
+    type_list->add_type(static_cast<DataType>(type));
+  }
+  for (int i = 0; i < node.attr().at("Tout").list().type_size(); ++i) {
+    inlined_output->add_input(strings::StrCat(inlined->name(), ":", i));
+  }
+
+  // Convert the graphdef to a graph
+  GraphConstructorOptions graph_ctor_opts;
+  graph_ctor_opts.allow_internal_ops = true;
+  graph_ctor_opts.expect_device_spec = false;
+  Graph graph(env->function_library());
+  TF_RETURN_IF_ERROR(
+      ConvertGraphDefToGraph(graph_ctor_opts, graph_def, &graph));
+
+  // Recursively inline the functions until there is nothing more to inline. We
+  // should at least expand one function.
+  int counter = 0;
+  while (counter < 50 &&
+         ExpandInlineFunctions(env->function_library_runtime(), &graph)) {
+    ++counter;
+  }
+
+  GraphDef inlined_graph_def;
+  graph.ToGraphDef(&inlined_graph_def);
+
+  // Add the default values of attributes to the nodes that have been inlined.
+  TF_RETURN_IF_ERROR(AddDefaultAttrsToGraphDef(&inlined_graph_def,
+                                               *graph.op_registry(), 0, true));
+
+  // Add the inlined nodes to the graph
+  for (NodeDef& inlined_node : *inlined_graph_def.mutable_node()) {
+    if (inlined_node.name() == "FunctionOutputs") {
+      inlined_node.set_name(node.name());
+      for (int i = 0; i < inlined_node.input_size(); ++i) {
+        inlined_node.set_input(
+            i, AddPrefixToNodeName(inlined_node.input(i), node.name()));
+      }
+    } else if (inlined_node.name() == "FunctionInputs") {
+      inlined_node.set_name(
+          AddPrefixToNodeName(inlined_node.name(), node.name()));
+      inlined_node.clear_input();
+      for (int i = 0; i < node.input_size(); ++i) {
+        inlined_node.add_input(node.input(i));
+      }
+    } else {
+      inlined_node.set_name(
+          AddPrefixToNodeName(inlined_node.name(), node.name()));
+      for (int i = 0; i < inlined_node.input_size(); ++i) {
+        inlined_node.set_input(
+            i, AddPrefixToNodeName(inlined_node.input(i), node.name()));
+      }
+      // If the node has no input, hook it up to the function input node to make
+      // sure it runs in the same frame as the other nodes of the function body.
+      if (inlined_node.input_size() == 0) {
+        *inlined_node.add_input() = AsControlDependency(
+            AddPrefixToNodeName("FunctionInputs", node.name()));
+      }
+    }
+    inlined_node.set_device(node.device());
+    inlined_graph->add_node()->Swap(&inlined_node);
+  }
+
+  return Status::OK();
+}
+
+Status FunctionOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                   GraphDef* optimized_graph) {
+  std::unordered_map<string, const FunctionDef*> functions;
+  for (const FunctionDef& func : item.graph.library().function()) {
+    // Don't inline functions marked as noinline
+    if (func.attr().count("_noinline") != 0) {
+      continue;
+    }
+    // Don't touch anything marked XLA to prevent XLA failures further down the
+    // road.
+    if (func.attr().count("_XlaCompile") > 0 &&
+        func.attr().at("_XlaCompile").b()) {
+      continue;
+    }
+    // Can't create IdentityN nodes with no input or output: skip these
+    // functions for now.
+    if (func.signature().input_arg_size() == 0 ||
+        func.signature().output_arg_size() == 0) {
+      continue;
+    }
+    functions[func.signature().name()] = &func;
+  }
+
+  // Nothing to do.
+  if (functions.empty()) {
+    *optimized_graph = item.graph;
+    return Status::OK();
+  }
+
+  SymbolicGradientEnv env(item.graph.versions().producer(),
+                          item.graph.library());
+
+  for (const NodeDef& node : item.graph.node()) {
+    if (node.op() == "SymbolicGradient") {
+      TF_RETURN_IF_ERROR(InlineSymbolicGradient(node, &env, optimized_graph));
+      continue;
+    }
+    auto it = functions.find(node.op());
+    if (it == functions.end()) {
+      *optimized_graph->add_node() = node;
+    } else {
+      TF_RETURN_IF_ERROR(InlineFunction(node, *it->second, item.graph.library(),
+                                        optimized_graph));
+    }
+  }
+
+  // TODO(bsteiner): specialize the implementation of functions that can't be
+  // inlined based on the context in which they're instantiated.
+
+  // TODO(bsteiner): trim the library to remove unused function definitions
+  *optimized_graph->mutable_versions() = item.graph.versions();
+  *optimized_graph->mutable_library() = item.graph.library();
+
+  return Status::OK();
+}
+
+void FunctionOptimizer::Feedback(Cluster* cluster, const GrapplerItem& item,
+                                 const GraphDef& optimized_graph,
+                                 double result) {
+  // Nothing to do for FunctionOptimizer.
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer.h b/tensorflow/core/grappler/optimizers/function_optimizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..41444e467364f83e7627477a7651203100e47d8a
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer.h
@@ -0,0 +1,44 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
+#define TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
+
+#include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
+#include "tensorflow/core/protobuf/rewriter_config.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// Remap TensorFlow subgraphs onto alternative operations or collection of
+// operations to make the overall graph more efficient.
+class FunctionOptimizer : public GraphOptimizer {
+ public:
+  FunctionOptimizer(RewriterConfig::Toggle opt_level) {}
+  ~FunctionOptimizer() override {}
+
+  string name() const override { return "function_optimizer"; };
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* optimized_graph) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimized_graph, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer_test.cc b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..52a1118080ab4e0529330514258ba1b498f4dbbb
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
@@ -0,0 +1,526 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
+#include "tensorflow/cc/ops/functional_ops.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class FunctionOptimizerTest : public GrapplerTest {};
+
+TEST_F(FunctionOptimizerTest, SimpleFunction) {
+  // Build a graph to compute y = XTimesTwo(x)
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "XTimesTwo", {"x"}, {{"T", DT_FLOAT}}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          test::function::XTimesTwo(),
+      });
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/x") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/two") {
+      count++;
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^y/inlined_inputs", node.input(0));
+    } else if (node.name() == "y/scale") {
+      count++;
+      EXPECT_EQ("Cast", node.op());
+      EXPECT_EQ(device, node.device());
+    } else if (node.name() == "y/y") {
+      count++;
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("y/x", node.input(0));
+      EXPECT_EQ("y/scale:0", node.input(1));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/y:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(7, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FixedTypeFunction) {
+  // Create and instantiate a version of the XTimesTwo function that only
+  // accepts floats a inputs.
+  const Tensor kTwo = test::AsScalar<float>(2.0f);
+  FunctionDef x_times_two = FunctionDefHelper::Define(
+      // Name
+      "XTimesTwo",
+      // Args
+      {"x: float"},
+      // Return values
+      {"y: float"},
+      // Attr def
+      {},
+      // Nodes
+      {
+          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_FLOAT}}},
+          {{"y"}, "Mul", {"x", "two"}, {{"T", DT_FLOAT}}},
+      });
+
+  constexpr char device[] = "/device:CPU:0";
+  GrapplerItem item;
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "XTimesTwo", {"x"}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          x_times_two,
+      });
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/x") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/two") {
+      count++;
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^y/inlined_inputs", node.input(0));
+      EXPECT_EQ(device, node.device());
+    } else if (node.name() == "y/y") {
+      count++;
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("y/x", node.input(0));
+      EXPECT_EQ("y/two:0", node.input(1));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/y:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(6, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithOutputMapping) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "Exp_func",
+      // Args
+      {"in: float"},
+      // Return values
+      {"out: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"Linear_func"}, "Identity", {"in"}, {{"T", DT_FLOAT}}},
+       {{"Exp"}, "Exp", {"Linear_func:output:0"}, {{"T", DT_FLOAT}}}},
+      // Mapping
+      {{"out", "Exp:y:0"}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "Exp_func", {"x"}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/in") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/Linear_func") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/in", node.input(0));
+    } else if (node.name() == "y/Exp") {
+      count++;
+      EXPECT_EQ("Exp", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/Linear_func:0", node.input(0));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/Exp:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(6, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithInputForwarding) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ForwardInputs",
+      // Args
+      {"in0: float", "in1: float", "arg2: float", "arg3: int32", "arg4: float"},
+      // Return values
+      {"out0: float", "arg2: float", "arg3: int32"},
+      // Attr def
+      {},
+      // Nodes
+      {},
+      // Mapping
+      {{"out0", "in0"}, {"arg2", "arg2"}, {"arg3", "arg3"}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x0", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x1", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x2", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x3", "Placeholder", {}, {{"dtype", DT_INT32}},
+                            device),
+       test::function::NDef("x4", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "ForwardInputs",
+                            {"x0", "x1", "x2", "x3", "x4"}, {}, device),
+       test::function::NDef("z0", "Identity", {"y:0"}, {{"T", DT_FLOAT}},
+                            device),
+       test::function::NDef("z1", "Identity", {"y:1"}, {{"T", DT_FLOAT}},
+                            device),
+       test::function::NDef("z2", "Identity", {"y:2"}, {{"T", DT_INT32}},
+                            device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  item.fetch = {"z0", "z1", "z2"};
+  Tensor in(DT_FLOAT, {});
+  in.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x0", in);
+  in.flat<float>()(0) = 2.7f;
+  item.feed.emplace_back("x1", in);
+  in.flat<float>()(0) = 1.0f;
+  item.feed.emplace_back("x2", in);
+  in.flat<float>()(0) = -1.0f;
+  item.feed.emplace_back("x4", in);
+  Tensor in_int(DT_INT32, {});
+  in_int.flat<int>()(0) = 1234;
+  item.feed.emplace_back("x3", in_int);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+  test::ExpectTensorEqual<float>(tensors_expected[1], tensors[1]);
+  test::ExpectTensorEqual<int>(tensors_expected[2], tensors[2]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithoutInput) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "GenerateTwo",
+      // Args
+      {},
+      // Return value
+      {"o: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+       {{"o"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("y", "GenerateTwo", {}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  // For now we won't inline the function.
+  EXPECT_EQ(item.graph.DebugString(), output.DebugString());
+}
+
+TEST_F(FunctionOptimizerTest, SymbolicGradients) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  FunctionDef func = FunctionDefHelper::Define(
+      "TestFunc", {"x:float", "y:float"}, {"l:float"}, {},
+      {
+          {{"z"}, "Add", {"x", "y"}, {{"T", DT_FLOAT}}},
+          FunctionDefHelper::Const("zero", 0),
+          FunctionDefHelper::Const("one", 1),
+          {{"r"}, "Rank", {"z"}, {{"T", DT_FLOAT}}},
+          {{"indices"}, "Range", {"zero", "r", "one"}},
+          {{"l"}, "Sum", {"z", "indices"}, {{"T", DT_FLOAT}}},
+      });
+
+  auto x = ops::Const(scope, 1.0f);
+  auto y = ops::Const(scope, 2.0f);
+  auto dl = ops::Const(scope, 3.0f);
+
+  NameAttrList fn;
+  fn.set_name("TestFunc");
+  (*fn.mutable_attr())["T"].set_type(DT_FLOAT);
+  auto g0 = ops::SymbolicGradient(scope, std::initializer_list<Input>{x, y, dl},
+                                  {DT_FLOAT, DT_FLOAT}, fn);
+  auto out1 = ops::Identity(scope.WithOpName("out1"), g0.output[0]);
+  auto out2 = ops::Identity(scope.WithOpName("out2"), g0.output[1]);
+
+  GrapplerItem item;
+  TF_EXPECT_OK(scope.ToGraphDef(&item.graph));
+  *item.graph.mutable_library()->add_function() = func;
+
+  FunctionOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::vector<Tensor> expected = EvaluateNodes(item.graph, {"out1", "out2"});
+  std::vector<Tensor> optimized = EvaluateNodes(output, {"out1", "out2"});
+  test::ExpectTensorEqual<float>(expected[0], optimized[0]);
+  test::ExpectTensorEqual<float>(expected[1], optimized[1]);
+}
+
+TEST_F(FunctionOptimizerTest, SymbolicGradientsIdentity) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "Identity_func",
+      // Args
+      {"in: float"},
+      // Return values
+      {"out: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"Identity"}, "Identity", {"in"}, {{"T", DT_FLOAT}}}},
+      // Mapping
+      {{"out", "Identity:output:0"}});
+
+  auto x = ops::Const(scope, 1.0f, {3, 5, 7});
+  auto z = ops::Const(scope, 3.0f, {3, 5, 7});
+
+  NameAttrList fn;
+  fn.set_name("Identity_func");
+  auto g0 = ops::SymbolicGradient(scope, std::initializer_list<Input>{x, z},
+                                  {DT_FLOAT}, fn);
+  auto out = ops::Identity(scope.WithOpName("out"), g0.output[0]);
+
+  GrapplerItem item;
+  TF_EXPECT_OK(scope.ToGraphDef(&item.graph));
+  *item.graph.mutable_library()->add_function() = func;
+
+  FunctionOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(13, output.node_size());
+  EXPECT_EQ("Const", output.node(0).name());
+  EXPECT_EQ("Const_1", output.node(1).name());
+  EXPECT_EQ("SymbolicGradient/FunctionInputs", output.node(2).name());
+  EXPECT_EQ("SymbolicGradient", output.node(3).name());
+  EXPECT_EQ("SymbolicGradient/SymbolicGradient/Identity",
+            output.node(4).name());
+  EXPECT_EQ("SymbolicGradient/Func/_0", output.node(5).name());
+  EXPECT_EQ("SymbolicGradient/Func/_1", output.node(6).name());
+  EXPECT_EQ("SymbolicGradient/Func/_2", output.node(7).name());
+  EXPECT_EQ("SymbolicGradient/SymbolicGradient/Func/_1/dx",
+            output.node(8).name());
+  EXPECT_EQ("SymbolicGradient/Func/_3", output.node(9).name());
+  EXPECT_EQ("SymbolicGradient/Func/_4", output.node(10).name());
+  EXPECT_EQ("SymbolicGradient/Func/_5", output.node(11).name());
+  EXPECT_EQ("out", output.node(12).name());
+  for (int i = 2; i < 4; ++i) {
+    EXPECT_EQ("IdentityN", output.node(i).op());
+  }
+  for (int i = 4; i < 11; ++i) {
+    EXPECT_EQ("Identity", output.node(i).op());
+  }
+
+  std::vector<Tensor> expected = EvaluateNodes(item.graph, {"out"});
+  std::vector<Tensor> optimized = EvaluateNodes(output, {"out"});
+  test::ExpectTensorEqual<float>(expected[0], optimized[0]);
+}
+
+TEST_F(FunctionOptimizerTest, SymbolicGradientsNoInlineFunc) {
+  FunctionDef func = FunctionDefHelper::Define(
+      "TestFunc", {"x:float", "y:float"}, {"l:float"}, {},
+      {
+          {{"z"}, "Add", {"x", "y"}, {{"T", DT_FLOAT}}},
+          FunctionDefHelper::Const("zero", 0),
+          FunctionDefHelper::Const("one", 1),
+          {{"r"}, "Rank", {"z"}, {{"T", DT_FLOAT}}},
+          {{"indices"}, "Range", {"zero", "r", "one"}},
+          {{"l"}, "Sum", {"z", "indices"}, {{"T", DT_FLOAT}}},
+      });
+  (*func.mutable_attr())["_noinline"].set_b(true);
+
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(scope, 1.0f);
+  auto y = ops::Const(scope, 2.0f);
+  auto dl = ops::Const(scope, 3.0f);
+
+  NameAttrList fn;
+  fn.set_name("TestFunc");
+  (*fn.mutable_attr())["T"].set_type(DT_FLOAT);
+  auto g0 = ops::SymbolicGradient(scope, std::initializer_list<Input>{x, y, dl},
+                                  {DT_FLOAT, DT_FLOAT}, fn);
+  auto out1 = ops::Identity(scope.WithOpName("out1"), g0.output[0]);
+  auto out2 = ops::Identity(scope.WithOpName("out2"), g0.output[1]);
+
+  GrapplerItem item;
+  TF_EXPECT_OK(scope.ToGraphDef(&item.graph));
+  *item.graph.mutable_library()->add_function() = func;
+
+  FunctionOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  // The optimizer should succeed but the graphs should be the same.
+  TF_EXPECT_OK(status);
+  CompareGraphs(item.graph, output);
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc b/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1820af6844215475d2bfccba93891a52029218b2
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc
@@ -0,0 +1,88 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Op kernels used to swap data in and out of GPU memory.
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace {
+
+class CopyFromGpuToHostKernel : public AsyncOpKernel {
+ public:
+  explicit CopyFromGpuToHostKernel(OpKernelConstruction* context)
+      : AsyncOpKernel(context) {}
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    const Tensor& input = ctx->input(0);
+    OP_REQUIRES_ASYNC(
+        ctx, !ctx->input_alloc_attr(0).on_host(),
+        errors::Internal("The input tensor to the _CopyFromGpuToHost kernel "
+                         "must reside on the device."),
+        done);
+
+    AllocatorAttributes alloc_attrs;
+    alloc_attrs.set_gpu_compatible(true);
+    alloc_attrs.set_on_host(true);
+    Tensor* output;
+    OP_REQUIRES_OK_ASYNC(
+        ctx, ctx->allocate_output(0, input.shape(), &output, alloc_attrs),
+        done);
+
+    ctx->op_device_context()->CopyDeviceTensorToCPU(
+        &input, "CopyFromGpuToHost", static_cast<Device*>(ctx->device()),
+        output, [ctx, done](const Status& s) {
+          ctx->SetStatus(s);
+          done();
+        });
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("_CopyFromGpuToHost").Device(DEVICE_GPU).HostMemory("output"),
+    CopyFromGpuToHostKernel);
+
+class CopyFromHostToGpuKernel : public AsyncOpKernel {
+ public:
+  explicit CopyFromHostToGpuKernel(OpKernelConstruction* context)
+      : AsyncOpKernel(context) {}
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    const Tensor& input = ctx->input(0);
+    OP_REQUIRES_ASYNC(
+        ctx, ctx->input_alloc_attr(0).on_host(),
+        errors::Internal("The input tensor to the _CopyFromHostToGpu kernel "
+                         "must reside on the host."),
+        done);
+
+    Tensor* output;
+    OP_REQUIRES_OK_ASYNC(ctx, ctx->allocate_output(0, input.shape(), &output),
+                         done);
+
+    ctx->op_device_context()->CopyCPUTensorToDevice(
+        &input, static_cast<Device*>(ctx->device()), output,
+        [ctx, done](const Status& s) {
+          ctx->SetStatus(s);
+          done();
+        });
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("_CopyFromHostToGpu").Device(DEVICE_GPU).HostMemory("input"),
+    CopyFromHostToGpuKernel);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc b/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..46828346da608a237528da2a2a8070c57946f762
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc
@@ -0,0 +1,58 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Definition for the ops used to swap data in and out of GPU memory.
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace {
+
+// The _CopyFromGpuToHost op copies its input tensor to the host. The input must
+// reside on GPU. The op itself must be placed on GPU.
+REGISTER_OP("_CopyFromGpuToHost")
+    .Input("input: T")
+    .Output("output: T")
+    .Attr("T: type")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      c->set_output(0, c->input(0));
+      auto* handle_data = c->input_handle_shapes_and_types(0);
+      if (handle_data != nullptr) {
+        c->set_output_handle_shapes_and_types(0, *handle_data);
+      }
+      return Status::OK();
+    })
+    .Doc("Copies the input tensor from gpu to the host.");
+
+// The _CopyFromHostToGpu op copies its input tensor from the host to the GPU.
+// The input must reside on CPU. The op itself must be placed on GPU.
+REGISTER_OP("_CopyFromHostToGpu")
+    .Input("input: T")
+    .Output("output: T")
+    .Attr("T: type")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      c->set_output(0, c->input(0));
+      auto* handle_data = c->input_handle_shapes_and_types(0);
+      if (handle_data != nullptr) {
+        c->set_output_handle_shapes_and_types(0, *handle_data);
+      }
+      return Status::OK();
+    })
+    .Doc("Copies the input tensor from the host to the GPU.");
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/graph_optimizer_stage.cc b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7044705adee7c1be52b04e6556066546b17f944f
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.cc
@@ -0,0 +1,120 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/graph_optimizer_stage.h"
+
+namespace tensorflow {
+namespace grappler {
+
+const NodeScopeAndName ParseNodeScopeAndName(const string& node_name) {
+  auto pos = node_name.find_last_of("/");
+  if (pos == string::npos) {
+    return {"", node_name};
+  } else {
+    return {node_name.substr(0, pos), node_name.substr(pos + 1)};
+  }
+};
+
+Status GetInputNode(const GraphOptimizerContext& ctx, const string& input,
+                    NodeDef** node) {
+  string node_name = NodeName(input);
+  NodeDef* node_by_name = ctx.node_map->GetNode(node_name);
+  if (node_by_name == nullptr) {
+    return errors::FailedPrecondition("Node ", node_name,
+                                      " doesn't exists in a node map");
+  }
+  *node = node_by_name;
+  return Status::OK();
+}
+
+Status GetTensorProperties(const GraphOptimizerContext& ctx,
+                           const string& tensor,
+                           OpInfo::TensorProperties* properties) {
+  int port;
+  string tensor_node_name = ParseNodeName(tensor, &port);
+  if (port < 0) {
+    return errors::InvalidArgument(
+        "Can't get tensor properties of control dependency ", tensor);
+  }
+
+  const auto& output_properties =
+      ctx.graph_properties->GetOutputProperties(tensor_node_name);
+  auto num_outputs = output_properties.size();
+
+  if (num_outputs == 0 || port > num_outputs - 1) {
+    return errors::InvalidArgument(
+        "Node ", tensor_node_name,
+        " is missing output properties at position :", port,
+        " (num_outputs=", num_outputs, ")");
+  }
+
+  properties->CopyFrom(output_properties[port]);
+  return Status::OK();
+}
+
+NodeDef* AddCopyNode(const GraphOptimizerContext& ctx, const string& name,
+                     const NodeDef* node_to_copy) {
+  CHECK(node_to_copy != nullptr);
+  CHECK(!ctx.node_map->NodeExists(name))
+      << "Node " << name << " already exists in a graph";
+  NodeDef* new_node = ctx.optimized_graph->add_node();
+  *new_node = *node_to_copy;
+  new_node->set_name(name);
+  ctx.node_map->AddNode(name, new_node);
+  return new_node;
+}
+
+NodeDef* AddEmptyNode(const GraphOptimizerContext& ctx, const string& name) {
+  CHECK(!ctx.node_map->NodeExists(name))
+      << "Node " << name << " already exists in a graph";
+  NodeDef* new_node = ctx.optimized_graph->add_node();
+  new_node->set_name(name);
+  ctx.node_map->AddNode(name, new_node);
+  return new_node;
+}
+
+const string MakeOptimizedNodeName(const NodeScopeAndName& node,
+                                   const string& sub_scope,
+                                   const string& prefix) {
+  CHECK(!sub_scope.empty() || !prefix.empty())
+      << "Either optimized node name prefix or sub-scope must be non-empty";
+  string optimized_node_name;
+  if (!node.scope.empty()) {
+    strings::StrAppend(&optimized_node_name, node.scope, "/");
+  }
+  if (!sub_scope.empty()) {
+    strings::StrAppend(&optimized_node_name, sub_scope, "/");
+  }
+  if (!prefix.empty()) {
+    strings::StrAppend(&optimized_node_name, prefix, "_");
+  }
+  strings::StrAppend(&optimized_node_name, node.name);
+  return optimized_node_name;
+}
+
+const string MakeOptimizedNodeName(const NodeScopeAndName& root,
+                                   const std::vector<string> node_names,
+                                   const string& sub_scope,
+                                   const string& prefix) {
+  string optimized_node_name = MakeOptimizedNodeName(root, sub_scope, prefix);
+  for (const string& node_name : node_names) {
+    auto name_and_scope = ParseNodeScopeAndName(node_name);
+    strings::StrAppend(&optimized_node_name, "_", name_and_scope.name);
+  }
+  return optimized_node_name;
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h
new file mode 100644
index 0000000000000000000000000000000000000000..be95c00d2dae30d4419cc45eec9f2f4855daeadd
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h
@@ -0,0 +1,185 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_OPTIMIZERS_OPTIMIZER_STAGE_H_
+#define TENSORFLOW_GRAPPLER_OPTIMIZERS_OPTIMIZER_STAGE_H_
+
+#include <unordered_map>
+#include <unordered_set>
+#include "tensorflow/core/grappler/costs/graph_properties.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/frame.h"
+
+namespace tensorflow {
+namespace grappler {
+
+struct NodeScopeAndName {
+  string scope;
+  string name;
+};
+
+// Parse scope and name: "a/b/c/Add_1" -> {"a/b/c", "Add_1"}
+const NodeScopeAndName ParseNodeScopeAndName(const string& node_name);
+
+// Context owned by GraphOptimizer, and passed to every stage at construction
+// time. Each optimizer stage is responsible for updating it according to the
+// changes it made to the graph.
+//
+// If an optimizer needs access to some helper class that is not present in this
+// context, consider creating an extension context, specific to that
+// optimizer (see example of ArithmeticOptimizerContext). GraphOptimizerContext
+// should only have members that are useful to almost all optimizers.
+struct GraphOptimizerContext {
+  GraphOptimizerContext(const std::unordered_set<string>* nodes_to_preserve,
+                        GraphDef* optimized_graph,
+                        GraphProperties* graph_properties, NodeMap* node_map,
+                        FrameMap* frame_map)
+      : nodes_to_preserve(nodes_to_preserve),
+        optimized_graph(optimized_graph),
+        graph_properties(graph_properties),
+        node_map(node_map),
+        frame_map(frame_map) {}
+
+  const std::unordered_set<string>* nodes_to_preserve;
+  GraphDef* optimized_graph;
+  GraphProperties* graph_properties;
+  NodeMap* node_map;
+  // TODO(ezhulenev): it seems that frame_map is only relevant for loop
+  // optimizer? Move it to loop-optimizer specific context extension.
+  FrameMap* frame_map;
+};
+
+Status GetInputNode(const GraphOptimizerContext& ctx, const string& input,
+                    NodeDef** node);
+Status GetTensorProperties(const GraphOptimizerContext& ctx,
+                           const string& tensor,
+                           OpInfo::TensorProperties* properties);
+
+NodeDef* AddCopyNode(const GraphOptimizerContext& ctx, const string& name,
+                     const NodeDef* node_to_copy);
+NodeDef* AddEmptyNode(const GraphOptimizerContext& ctx, const string& name);
+
+// WARNING:
+// Optimizer stage must try to re-use original nodes of a graph and
+// make all updates in place. This helps to make robust node placement
+// decisions. Create new nodes only if there is a reason for that.
+
+// Make a name for a new node obtained by optimizing a single node of the
+// original graph. The optimized node is placed under the original node scope.
+//
+// Node name uniqueness is guaranteed by unique name of an original node in
+// a same scope.
+//
+// Empty sub_scope or prefix ignored. At least one of them must be non-empty.
+//
+// Example: a/b/c/Add -> a/b/c/${sub_scope}/${prefix}_Add.
+const string MakeOptimizedNodeName(const NodeScopeAndName& node,
+                                   const string& sub_scope,
+                                   const string& prefix);
+// Make a name for a new node obtained by optimizing multiple nodes of the
+// original graph, starting from "root". The optimized node is placed under
+// the original scope of a "root" node.
+//
+// Example: [a/b/c/Add, x/y/z/Mul] -> a/b/c/${sub_scope}/${prefix}_Add_Mul
+const string MakeOptimizedNodeName(const NodeScopeAndName& root,
+                                   const std::vector<string> node_names,
+                                   const string& sub_scope,
+                                   const string& prefix);
+
+// Base class for multi-stage GraphOptimizers (ArithmeticOptimizer, etc...).
+//
+// If a graph optimizer consists of large number of small independent
+// rewrites, each of them should be implemented as a separate stage.
+//
+// * Result:
+// Each graph optimizer choose what result is reported by each stage
+// (e.g. each stage can fill in the name of optimized nodes, or have more
+// complex result).
+template <typename Result>
+class GraphOptimizerStage {
+ public:
+  explicit GraphOptimizerStage(const string& optimizer_name,
+                               const string& stage_name,
+                               const GraphOptimizerContext& ctx)
+      : optimizer_name_(optimizer_name), stage_name_(stage_name), ctx_(ctx) {}
+  virtual ~GraphOptimizerStage() = default;
+
+  // Check if we should try to simplify node. Returning true doesn't
+  // guarantee that node will be simplified.
+  //
+  // Should implement just a basic sanity check, without any expensive graph
+  // traversals.
+  virtual bool IsSupported(const NodeDef* node) const = 0;
+
+  // Try to simplify the given node.
+  //
+  // Return error status only if some precondition is failed, or got an
+  // incorrect graph. In every other case return Status:OK(), even if didn't
+  // simplify anything.
+  //
+  // Report result using output argument. Each GraphOptimizer can choose it's
+  // own Result type.
+  // TODO(ezhulenev): if it will appear that Result output parameter is not
+  // sufficiently useful (used with a reason by most optimizers), get rid of it,
+  // and remove template parameter.
+  virtual Status TrySimplify(NodeDef* node, Result* result) = 0;
+
+  // Get a name for a new node, created by this stage, based on one or multiple
+  // nodes of an original graph.
+  const string OptimizedNodeName(const NodeScopeAndName& node) const {
+    return MakeOptimizedNodeName(node, optimizer_name_, stage_name_);
+  }
+  const string OptimizedNodeName(const NodeScopeAndName& root,
+                                 const std::vector<string>& nodes) const {
+    return MakeOptimizedNodeName(root, nodes, optimizer_name_, stage_name_);
+  }
+  const string OptimizedNodeName(const NodeScopeAndName& node,
+                                 const string& rewrite_rule) const {
+    const string prefix = strings::StrCat(stage_name_, "_", rewrite_rule);
+    return MakeOptimizedNodeName(node, optimizer_name_, prefix);
+  }
+
+  // Get a node by input name from a node map. Return an error if node was not
+  // found.
+  Status GetInputNode(const string& input, NodeDef** node) const {
+    return ::tensorflow::grappler::GetInputNode(ctx_, input, node);
+  }
+  // Lookup tensor properties by name. Tensor name might have non-zero port
+  // number. Return an error if tensor node doesn't exists in a graph, or it
+  // doesn't have properties defined for requested port.
+  Status GetTensorProperties(const string& tensor,
+                             OpInfo::TensorProperties* properties) const {
+    return ::tensorflow::grappler::GetTensorProperties(ctx_, tensor,
+                                                       properties);
+  }
+
+  NodeDef* AddCopyNode(const string& name, const NodeDef* node_to_copy) {
+    return ::tensorflow::grappler::AddCopyNode(ctx_, name, node_to_copy);
+  }
+  NodeDef* AddEmptyNode(const string& name) {
+    return ::tensorflow::grappler::AddEmptyNode(ctx_, name);
+  }
+
+ protected:  // Data members
+  const string optimizer_name_;
+  const string stage_name_;
+  const GraphOptimizerContext ctx_;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_OPTIMIZERS_OPTIMIZER_STAGE_H_
diff --git a/tensorflow/core/grappler/optimizers/graph_optimizer_stage_test.cc b/tensorflow/core/grappler/optimizers/graph_optimizer_stage_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..416327e6228431edbf0389f6135cdd028fda45dc
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/graph_optimizer_stage_test.cc
@@ -0,0 +1,168 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/graph_optimizer_stage.h"
+
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/grappler/costs/graph_properties.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class GraphOptimizerStageTest : public ::testing::Test {};
+
+struct FakeResult {};
+
+// NoOp optimizer stage that supports all the node types and does nothing
+class FakeOptimizerStage : public GraphOptimizerStage<FakeResult> {
+ public:
+  explicit FakeOptimizerStage(const string& optimizer_name,
+                              const string& stage_name,
+                              const GraphOptimizerContext& ctx)
+      : GraphOptimizerStage(optimizer_name, stage_name, ctx) {}
+  ~FakeOptimizerStage() override = default;
+
+  bool IsSupported(const NodeDef* node) const override { return true; }
+  Status TrySimplify(NodeDef* node, FakeResult* result) override {
+    return Status::OK();
+  }
+};
+
+TEST_F(GraphOptimizerStageTest, ParseNodeNameAndScope_InRoot) {
+  const auto scope_and_name = ParseNodeScopeAndName("Add");
+  EXPECT_EQ("", scope_and_name.scope);
+  EXPECT_EQ("Add", scope_and_name.name);
+}
+
+TEST_F(GraphOptimizerStageTest, ParseNodeNameAndScope_InScope) {
+  const auto scope_and_name = ParseNodeScopeAndName("a/b/c/Add");
+  EXPECT_EQ("a/b/c", scope_and_name.scope);
+  EXPECT_EQ("Add", scope_and_name.name);
+}
+
+TEST_F(GraphOptimizerStageTest, OptimizedNodeName) {
+  GraphOptimizerContext ctx(/*nodes_to_preserve*/ nullptr,
+                            /*optimized_graph*/ nullptr,
+                            /*graph_properties*/ nullptr, /*node_name*/ nullptr,
+                            /*frame_map*/ nullptr);
+  FakeOptimizerStage stage("my_opt", "my_stg", ctx);
+
+  const auto node = ParseNodeScopeAndName("a/b/c/Add");
+
+  // Without rewrite rule
+  EXPECT_EQ("a/b/c/my_opt/my_stg_Add", stage.OptimizedNodeName(node));
+  EXPECT_EQ(
+      "a/b/c/my_opt/my_stg_Add_Mul_Sqrt",
+      stage.OptimizedNodeName(node, std::vector<string>({"Mul", "Sqrt"})));
+
+  // With rewrite rule
+  const string rewrite = "my_rewrite";
+  EXPECT_EQ("a/b/c/my_opt/my_stg_my_rewrite_Add",
+            stage.OptimizedNodeName(node, rewrite));
+}
+
+TEST_F(GraphOptimizerStageTest, GetInputNodeAndProperties) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto add = ops::Add(s.WithOpName("Add"), a, b);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(/*assume_valid_feeds*/ false));
+
+  NodeMap node_map(&item.graph);
+
+  GraphOptimizerContext ctx(/*nodes_to_preserve*/ nullptr,
+                            /*optimized_graph*/ &item.graph,
+                            /*graph_properties*/ &properties,
+                            /*node_name*/ &node_map,
+                            /*frame_map*/ nullptr);
+  FakeOptimizerStage stage("my_opt", "my_stg", ctx);
+
+  NodeDef* add_node;
+  TF_CHECK_OK(stage.GetInputNode("Add", &add_node));
+  EXPECT_EQ("a", add_node->input(0));
+  EXPECT_EQ("b", add_node->input(1));
+
+  OpInfo::TensorProperties add_properties;
+  TF_CHECK_OK(stage.GetTensorProperties("Add", &add_properties));
+  EXPECT_EQ(DT_FLOAT, add_properties.dtype());
+
+  OpInfo::TensorProperties a_properties;
+  TF_CHECK_OK(stage.GetTensorProperties("a:0", &a_properties));
+  EXPECT_EQ(DT_FLOAT_REF, a_properties.dtype());
+
+  OpInfo::TensorProperties b_properties;
+  TF_CHECK_OK(stage.GetTensorProperties("b:0", &b_properties));
+  EXPECT_EQ(DT_FLOAT_REF, b_properties.dtype());
+}
+
+TEST_F(GraphOptimizerStageTest, AddNodes) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto add = ops::Add(s.WithOpName("Add"), a, b);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(/*assume_valid_feeds*/ false));
+
+  NodeMap node_map(&item.graph);
+
+  GraphOptimizerContext ctx(/*nodes_to_preserve*/ nullptr,
+                            /*optimized_graph*/ &item.graph,
+                            /*graph_properties*/ &properties,
+                            /*node_name*/ &node_map,
+                            /*frame_map*/ nullptr);
+  FakeOptimizerStage stage("my_opt", "my_stg", ctx);
+
+  NodeDef* add_node;
+  TF_CHECK_OK(stage.GetInputNode("Add", &add_node));
+
+  // Add a new copy node
+  NodeDef* add_node_copy = stage.AddCopyNode("Add_1", add_node);
+  EXPECT_EQ("Add_1", add_node_copy->name());
+  EXPECT_EQ("Add", add_node_copy->op());
+  EXPECT_EQ("a", add_node_copy->input(0));
+  EXPECT_EQ("b", add_node_copy->input(1));
+
+  // It must be available for by-name lookup
+  NodeDef* add_node_copy_by_name;
+  TF_CHECK_OK(stage.GetInputNode("Add_1", &add_node_copy_by_name));
+  EXPECT_EQ(add_node_copy, add_node_copy_by_name);
+
+  // Add new empty node
+  NodeDef* empty_node = stage.AddEmptyNode("Add_2");
+  EXPECT_EQ("Add_2", empty_node->name());
+
+  // It must be available for by-name lookup
+  NodeDef* empty_node_by_name;
+  TF_CHECK_OK(stage.GetInputNode("Add_2", &empty_node_by_name));
+  EXPECT_EQ(empty_node, empty_node_by_name);
+}
+
+}  // namespace
+}  // end namespace grappler
+}  // end namespace tensorflow
\ No newline at end of file
diff --git a/tensorflow/core/grappler/optimizers/layout_optimizer.cc b/tensorflow/core/grappler/optimizers/layout_optimizer.cc
index 826f00209b15705f2a9b8b43f78134498a19d167..18e63f823b0ca6ee1011e8e98c43e46ad07c2995 100644
--- a/tensorflow/core/grappler/optimizers/layout_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/layout_optimizer.cc
@@ -301,10 +301,6 @@ bool IsComparisonOp(const NodeDef& node) {
   return is_compare;
 }
 
-bool IsLogicalOp(const NodeDef& node) {
-  return IsLogicalAnd(node) || IsLogicalNot(node) || IsLogicalOr(node);
-}
-
 bool IsReduceOp(const NodeDef& node) {
   return IsSum(node) || IsMean(node) || IsProd(node) || IsMax(node) ||
          IsMin(node) || IsAll(node) || IsAny(node);
@@ -355,7 +351,7 @@ std::vector<int> DataInputPos(const NodeDef& node) {
   if (IsBetainc(node) || IsSelect(node)) {
     return {0, 1, 2};
   }
-  if (IsShapeN(node) || IsIdentityN(node) || IsAddN(node)) {
+  if (IsShapeN(node) || IsIdentityN(node) || IsAddN(node) || IsMerge(node)) {
     return NonControlInputs(node);
   }
   if (IsConcat(node)) {
@@ -1160,9 +1156,11 @@ class AgnosticNodeProcessor : public NodeProcessor {
     std::set<string> ops_format_agnostic = GetOpsFormatAgnostic();
     std::deque<NodeDef*> queue;
     auto data_node_pos = DataInputPos(node);
+    std::unordered_set<string> visited;
     for (const auto& pos : data_node_pos) {
       auto input_node = node_map_->GetNode(node.input(pos));
       queue.push_back(input_node);
+      visited.insert(input_node->name());
     }
     // The code will exit this while loop in one iteration in most cases, as the
     // graph is already topologically sorted.
@@ -1181,7 +1179,10 @@ class AgnosticNodeProcessor : public NodeProcessor {
         auto current_node_pos = DataInputPos(*current_node);
         for (const auto& pos : current_node_pos) {
           auto input_node = node_map_->GetNode(current_node->input(pos));
-          queue.push_back(input_node);
+          if (visited.find(input_node->name()) == visited.end()) {
+            queue.push_back(input_node);
+            visited.insert(input_node->name());
+          }
         }
       }
     }
diff --git a/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc b/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
index 5cb366df2dccee2260c6f407e992e73296712ccc..1c912fcaa251c576308a983ef351319053423a85 100644
--- a/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
@@ -1102,6 +1102,34 @@ TEST_F(LayoutOptimizerTest, IdentityNWithInputsVectorAnd4D) {
   EXPECT_EQ(add_node->input(1),
             "identity_n-0-1-TransposeNCHWToNHWC-LayoutOptimizer");
 }
+
+TEST_F(LayoutOptimizerTest, LoopNoLiveLock) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto c = ops::Const(s.WithOpName("const"), 3.0f, {8, 3, 3, 2});
+  auto merge = ops::Merge(s.WithOpName("merge"), {c, c});
+  auto i0 = ops::Identity(s.WithOpName("i0"), merge.output);
+  ops::Variable v_ctrl(s.WithOpName("v_ctrl"), {}, DT_BOOL);
+  auto sw = ops::Switch(s.WithOpName("switch"), i0, v_ctrl);
+  auto next = ops::NextIteration(s.WithOpName("next"), sw.output_true);
+  auto conv = SimpleConv2D(&s, 4, 2, "VALID");
+  auto mul = ops::Mul(s.WithOpName("mul"), conv, sw.output_false);
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  NodeMap node_map_original(&item.graph);
+  auto merge_node = node_map_original.GetNode("merge");
+  // Modify the graph to create a loop
+  merge_node->set_input(1, "next");
+  LayoutOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(virtual_cluster_.get(), item, &output);
+  NodeMap node_map(&output);
+  auto conv_node = node_map.GetNode("Conv2D");
+  EXPECT_EQ(conv_node->input(0),
+            "Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer");
+  auto mul_node = node_map.GetNode("mul");
+  EXPECT_EQ(mul_node->input(0),
+            "Conv2D-0-0-TransposeNCHWToNHWC-LayoutOptimizer");
+}
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.cc b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
index 102526e22f4742cb90757a1daf55467dd16afc3e..a063dc33816e25c560a385e188203c9ad9bfe4cd 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
@@ -15,23 +15,472 @@ limitations under the License.
 
 #include "tensorflow/core/grappler/optimizers/loop_optimizer.h"
 
+#include <algorithm>
+#include <limits>
 #include <unordered_map>
 #include <unordered_set>
+#include <vector>
+#include <deque>
 
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/tensor_coding.h"
+#include "tensorflow/core/util/device_name_utils.h"
+#include "tensorflow/core/util/saved_tensor_slice_util.h"
+
+using tensorflow::strings::StrCat;
 
 namespace tensorflow {
 namespace grappler {
+namespace {
+
+std::vector<int> GetStackPushNodesToConvert(
+    const SimpleGraphView& graph_view,
+    const std::unordered_set<string>& nodes_to_preserve, int stack_node_idx) {
+  VLOG(1) << "Stack node: " << graph_view.graph()->node(stack_node_idx).name();
+  const std::unordered_set<string> op_types_to_traverse(
+      {"Stack", "StackV2", "Enter", "RefEnter", "Switch", "RefSwitch",
+       "Identity", "RefIdentity"});
+  std::vector<int> nodes_to_convert;
+  std::set<int> fanout;
+  graph_view.DepthFirstSearch(op_types_to_traverse, stack_node_idx, &fanout);
+  for (int fanout_idx : fanout) {
+    const NodeDef& fanout_node = graph_view.graph()->node(fanout_idx);
+    VLOG(1) << "Fanout " << fanout_idx << " : " << fanout_node.name();
+    if (IsStackPushOp(fanout_node)) {
+      nodes_to_convert.push_back(fanout_idx);
+    } else if (IsStackOp(fanout_node) || IsStackCloseOp(fanout_node) ||
+               op_types_to_traverse.find(fanout_node.op()) !=
+                   op_types_to_traverse.end()) {
+      continue;
+    } else if (!IsStackPopOp(fanout_node) ||
+               (!graph_view.outputs(fanout_idx).empty() ||
+                nodes_to_preserve.find(fanout_node.name()) !=
+                    nodes_to_preserve.end())) {
+      // The node is either a stack pop with consumers or something unexpected
+      // so we leave the graph alone.
+      nodes_to_convert.clear();
+      break;
+    }
+  }
+  return nodes_to_convert;
+}
+
+Status RemoveStackOps(const GrapplerItem& item, GraphDef* optimized_graph) {
+  const std::unordered_set<string> nodes_to_preserve = item.NodesToPreserve();
+  const GraphDef& graph = item.graph;
+  *optimized_graph = graph;
+  NodeMap node_map(optimized_graph);
+  SimpleGraphView graph_view;
+  TF_RETURN_IF_ERROR(graph_view.Initialize(graph));
+  for (int node_idx = 0; node_idx < graph.node_size(); ++node_idx) {
+    if (IsStackOp(graph.node(node_idx))) {
+      for (int push_node_idx : GetStackPushNodesToConvert(
+               graph_view, nodes_to_preserve, node_idx)) {
+        // We found push nodes without corresponding pops. Convert them to
+        // Identity passing the data through and add a control dependency from
+        // the op supplying the stack handle.
+        NodeDef* push_node = optimized_graph->mutable_node(push_node_idx);
+        VLOG(1) << "Converting " << push_node_idx << " : "
+                << push_node->DebugString();
+        if (push_node->attr().count("swap_memory") != 0) {
+          push_node->mutable_attr()->erase("swap_memory");
+        }
+        push_node->set_op("Identity");
+        push_node->mutable_input()->SwapElements(0, 1);
+        const string ctrl_dep = ConstantFolding::AddControlDependency(
+            push_node->input(1), optimized_graph, &node_map);
+        push_node->set_input(1, ctrl_dep);
+        VLOG(1) << "After converting: " << push_node->DebugString();
+      }
+    }
+  }
+  return Status::OK();
+}
+
+}  // namespace
+
+Status LoopOptimizer::LINMHandleInvariantEnter(NodeDef* node,
+                                               const int num_outputs) {
+  auto consumers = node_map_->GetOutputs(node->name());
+  std::vector<string> enter_control_inputs;
+  string enter_input;
+  for (auto& input : node->input()) {
+    if (IsControlInput(input)) {
+      enter_control_inputs.push_back(input);
+    } else {
+      enter_input = input;
+    }
+  }
+  for (auto* consumer : consumers) {
+    if (invariant_nodes_.count(consumer)) {
+      for (int i = 0; i < consumer->input_size(); ++i) {
+        if (NodeName(consumer->input(i)) == node->name()) {
+          consumer->set_input(i, enter_input);
+          node_map_->AddOutput(NodeName(enter_input), consumer->name());
+          node_map_->RemoveOutput(node->name(), consumer->name());
+        }
+      }
+      for (auto& control_input : enter_control_inputs) {
+        consumer->add_input(control_input);
+        node_map_->AddOutput(NodeName(control_input), consumer->name());
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LINMHandleConst(NodeDef* node,
+    const int num_outputs, const int frame_id) {
+  NodeDef* const_node;
+  if (num_outputs == 0) {
+    // all successor nodes are invariant
+    // Remove the control inputs from this frame to the const node,
+    // when moving it out of the frame (in parent frame)
+    const_node = node;
+    node_map_->RemoveInputs(node->name());
+    node->clear_input();
+  } else {
+    // some successor nodes are variant
+    // Have to keep the const node in the frame,
+    // so create a new one outside the frame (in parent frame)
+    const_node = optimized_graph_->add_node();
+    const_node->set_name(AddPrefixToNodeName(node->name(), kLoopOptimizer));
+    const_node->set_op("Const");
+    const_node->set_device(node->device());
+    *const_node->mutable_attr() = node->attr();
+    node_map_->AddNode(const_node->name(), const_node);
+    auto consumers = node_map_->GetOutputs(node->name());
+    for (auto* consumer : consumers) {
+      if (invariant_nodes_.count(consumer)) {
+        for (int i = 0; i < consumer->input_size(); ++i) {
+          if (NodeName(consumer->input(i)) == node->name()) {
+            if (IsControlInput(consumer->input(i))) {
+              *consumer->mutable_input(i) = AsControlDependency(*const_node);
+            } else {
+              *consumer->mutable_input(i) = const_node->name();
+            }
+            node_map_->AddOutput(const_node->name(), consumer->name());
+            node_map_->RemoveOutput(node->name(), consumer->name());
+          }
+        }
+      }
+    }
+  }
+  // add a control input from the parent frame
+  auto parent_it = frame_parent_.find(frame_id);
+  if (parent_it != frame_parent_.end()) {
+    int parent_id = parent_it->second;
+    auto loop_cond_it = loop_cond_.find(parent_id);
+    if (loop_cond_it == loop_cond_.end()) {
+      return errors::InvalidArgument(
+          "Frame ", frame_id, " doesn't have a LoopCond node");
+    }
+    auto& loop_cond_name = loop_cond_it->second->name();
+    NodeDef* switch_node = nullptr;
+    for (auto* node : node_map_->GetOutputs(loop_cond_name)) {
+      if (node->op() == "Switch") {
+        switch_node = node;
+        break;
+      }
+    }
+    if (!switch_node) {
+      return errors::InvalidArgument(
+          "LoopCond node of Frame ", frame_id,
+          " doesn't connect to any Switch node");
+    }
+    string switch_output = StrCat(switch_node->name(), ":1");
+    const string ctrl_dep = ConstantFolding::AddControlDependency(
+        switch_output, optimized_graph_, node_map_.get());
+    const_node->add_input(ctrl_dep);
+    node_map_->AddOutput(NodeName(ctrl_dep), const_node->name());
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LINMHandleInvariantNode(NodeDef* node,
+    const int num_outputs, const int frame_id) {
+  // have to remove control inputs to the invariant node from the same frame
+  // when moving this node out of this frame
+  for (int i = 0; i < node->input_size(); ++i) {
+    if (IsControlInput(node->input(i))) {
+      node->mutable_input()->SwapElements(i, node->input_size() - 1);
+      node->mutable_input()->RemoveLast();
+    }
+  }
+  if (num_outputs == 0) {
+    return Status::OK();
+  }
+
+  DataTypeVector input_types;
+  DataTypeVector output_types;
+  OpRegistryInterface* op_registry = OpRegistry::Global();
+  const OpRegistrationData* op_reg_data = nullptr;
+  TF_RETURN_IF_ERROR(
+      op_registry->LookUp(node->op(), &op_reg_data));
+  TF_RETURN_IF_ERROR(
+      InOutTypesForNode(*node, op_reg_data->op_def,
+                        &input_types, &output_types));
+
+  auto consumers = node_map_->GetOutputs(node->name());
+  string fname = invariant_enters_[frame_id][0]->attr().at("frame_name").s();
+  int piterations = invariant_enters_[frame_id][0]
+                    ->attr().at("parallel_iterations").i();
+  for (auto* consumer : consumers) {
+    if (!invariant_nodes_.count(consumer)) {
+      for (int i = 0; i < consumer->input_size(); ++i) {
+        int port;
+        string node_name = ParseNodeName(consumer->input(i), &port);
+        if (node_name != node->name()) {
+          continue;
+        }
+        if (port < 0) {
+          return errors::InvalidArgument(
+              "Invariant node should not have control outputs "
+              "to variant node");
+        }
+        DataType output_type = output_types[port];
+        NodeDef* new_enter = optimized_graph_->add_node();
+        new_enter->set_op("Enter");
+        new_enter->set_device(node->device());
+        new_enter->set_name(AddPrefixToNodeName(
+            StrCat(fname, "_enter_", new_enter_id_++), kLoopOptimizer));
+        AttrValue data_type;
+        data_type.set_type(output_type);
+        new_enter->mutable_attr()->insert({"T", data_type});
+        AttrValue frame_name;
+        frame_name.set_s(fname);
+        new_enter->mutable_attr()->insert({"frame_name", frame_name});
+        AttrValue is_const;
+        is_const.set_b(true);
+        new_enter->mutable_attr()->insert({"is_constant", is_const});
+        AttrValue parallel_iterations;
+        parallel_iterations.set_i(piterations);
+        new_enter->mutable_attr()->insert(
+            {"parallel_iterations", parallel_iterations});
+        new_enter->add_input(consumer->input(i));
+        *consumer->mutable_input(i) = new_enter->name();
+        node_map_->AddNode(new_enter->name(), new_enter);
+        node_map_->AddOutput(node->name(), new_enter->name());
+        node_map_->AddOutput(new_enter->name(), consumer->name());
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::MoveInvariantNodes(const int frame_id) {
+  for (auto iter = invariant_nodes_.begin();
+       iter != invariant_nodes_.end(); ++iter) {
+    auto* invariant_node = iter->first;
+    const int num_outputs = iter->second;
+    if (IsEnter(*invariant_node)) {
+      TF_RETURN_IF_ERROR(
+          LINMHandleInvariantEnter(invariant_node, num_outputs));
+    } else if (IsConstant(*invariant_node)) {
+      TF_RETURN_IF_ERROR(
+          LINMHandleConst(invariant_node, num_outputs, frame_id));
+    } else {
+      TF_RETURN_IF_ERROR(
+          LINMHandleInvariantNode(invariant_node, num_outputs, frame_id));
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::RevertInvariantNodes() {
+  std::deque<const NodeDef*> reverted_nodes;
+  for (auto iter=invariant_nodes_.begin(); iter != invariant_nodes_.end();) {
+    bool erased = false;
+    const auto* node = iter->first;
+    if (!IsConstant(*node) && !IsEnter(*node) && iter->second > 0) {
+      auto& consumers = node_map_->GetOutputs(node->name());
+      for (auto* consumer : consumers) {
+        if (!invariant_nodes_.count(consumer)) {
+          for (const auto& input : consumer->input()) {
+            if (IsControlInput(input) && NodeName(input) == node->name()) {
+              reverted_nodes.push_back(node);
+              invariant_nodes_.erase(iter++);
+              erased = true;
+              break;
+            }
+          }
+          if (erased) break;
+        }
+      }
+    }
+    if (!erased) ++iter;
+  }
+  while (!reverted_nodes.empty()) {
+    const auto* node = reverted_nodes.front();
+    reverted_nodes.pop_front();
+    std::set<NodeDef*> producers;
+    for (const auto& input : node->input()) {
+      auto* producer = node_map_->GetNode(input);
+      auto iter = invariant_nodes_.find(producer);
+      if (iter != invariant_nodes_.end()) {
+        if (IsControlInput(input) &&
+            !IsConstant(*producer) && !IsEnter(*producer)) {
+          reverted_nodes.push_back(producer);
+          invariant_nodes_.erase(iter);
+        } else {
+          producers.insert(producer);
+        }
+      }
+    }
+    for (auto* producer : producers) {
+      auto iter = invariant_nodes_.find(producer);
+      if (iter != invariant_nodes_.end()) {
+        ++iter->second;
+      }
+    }
+    for (auto* consumer : node_map_->GetOutputs(node->name())) {
+      auto iter = invariant_nodes_.find(consumer);
+      if (iter != invariant_nodes_.end()) {
+        reverted_nodes.push_back(consumer);
+        invariant_nodes_.erase(iter);
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::FindInvariantNodes(NodeDef* node) {
+  auto consumers = node_map_->GetOutputs(node->name());
+  invariant_nodes_.insert(std::make_pair(node, consumers.size()));
+  for (auto* consumer : consumers) {
+    if (invariant_nodes_.count(consumer) ||
+        ModifiesFrameInfo(*consumer)) {
+      continue;
+    }
+    bool is_invariant = true;
+    for (const auto& input : consumer->input()) {
+      if (!IsControlInput(input)) {
+        const string name = NodeName(input);
+        auto* producer = node_map_->GetNode(name);
+        if (!invariant_nodes_.count(producer)) {
+          if (IsConstant(*producer)) {
+            invariant_nodes_.insert(
+                std::make_pair(producer, node_map_->GetOutputs(name).size()));
+          } else {
+            is_invariant = false;
+            break;
+          }
+        }
+      }
+    }
+    if (is_invariant) {
+      std::set<NodeDef*> producers;
+      for (const auto& input : consumer->input()) {
+        auto* producer = node_map_->GetNode(input);
+        producers.insert(producer);
+      }
+      for (auto* producer : producers) {
+        auto iter = invariant_nodes_.find(producer);
+        if (iter != invariant_nodes_.end()) {
+          --iter->second;
+        }
+      }
+      TF_RETURN_IF_ERROR(FindInvariantNodes(consumer));
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LoopInvariantNodeMotion() {
+  std::deque<int> worklist;
+  for (auto iter = frame_map_.begin(); iter != frame_map_.end(); ++iter) {
+    auto* node = iter->first;
+    auto& frame_ids = iter->second;
+    if (frame_ids.size() >= 3) {
+      for (unsigned int i = 1; i < frame_ids.size() - 1; ++i) {
+        frame_parent_[frame_ids[i]] = frame_ids[i - 1];
+        frame_children_[frame_ids[i]].insert(frame_ids[i + 1]);
+      }
+    }
+    if (frame_ids.size() >= 2) {
+      frame_children_[frame_ids[0]].insert(frame_ids[1]);
+      frame_parent_[frame_ids.back()] = frame_ids[frame_ids.size() - 2];
+    }
+    if (frame_ids.size() >= 1) {
+      frame_children_.insert(std::make_pair(frame_ids.back(), empty_set_));
+      if (node->op() == "LoopCond") {
+        if (loop_cond_.count(frame_ids.back())) {
+          return errors::InvalidArgument(
+              "Loop ", frame_ids.back(),
+              " has more than one LoopCond node: ", node->name(), " and ",
+              loop_cond_[frame_ids.back()]->name());
+        }
+        loop_cond_[frame_ids.back()] = node;
+      }
+      if (IsEnter(*node) && node->attr().at("is_constant").b()) {
+        invariant_enters_[frame_ids.back()].push_back(
+            const_cast<NodeDef*>(node));
+      }
+    }
+  }
+
+  for (auto it = frame_children_.begin(); it != frame_children_.end(); ++it) {
+    if (it->second.size() == 0) {
+      worklist.push_back(it->first);
+    }
+  }
+
+  while (!worklist.empty()) {
+    int frame_id = worklist.front();
+    new_enter_id_ = 0;
+    worklist.pop_front();
+    auto parent_it = frame_parent_.find(frame_id);
+    if (parent_it != frame_parent_.end()) {
+      int parent_id = parent_it->second;
+      frame_children_[parent_id].erase(frame_id);
+      if (frame_children_[parent_id].size() == 0) {
+        worklist.push_back(parent_id);
+      }
+    }
+
+    if (invariant_enters_[frame_id].empty()) {
+      continue;
+    }
+    invariant_nodes_.clear();
+    for (auto* enter : invariant_enters_[frame_id]) {
+      TF_RETURN_IF_ERROR(FindInvariantNodes(enter));
+    }
+
+    // revert invariant nodes that have control outputs to variant nodes
+    TF_RETURN_IF_ERROR(RevertInvariantNodes());
+
+    TF_RETURN_IF_ERROR(MoveInvariantNodes(frame_id));
+  }
+  return Status::OK();
+}
 
 Status LoopOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
                                GraphDef* optimized_graph) {
-  *optimized_graph = item.graph;
+
+  TF_RETURN_IF_ERROR(RemoveStackOps(item, optimized_graph));
+
+  if (opt_level_ == RewriterConfig::AGGRESSIVE) {
+    optimized_graph_ = optimized_graph;
+    // Set up helper data structures.
+    node_map_.reset(new NodeMap(optimized_graph_));
+    int num_frames;
+    TF_RETURN_IF_ERROR(IdentifyFramesWithNodeMap(*optimized_graph_, *node_map_,
+                                                 &frame_map_, &num_frames));
+    TF_RETURN_IF_ERROR(LoopInvariantNodeMotion());
+  }
 
   return Status::OK();
 }
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.h b/tensorflow/core/grappler/optimizers/loop_optimizer.h
index 106d4628ae68f3c92ab597f903f96a6af8a64b8d..c1b0321e4e16f2c34a8016fe51068a79634a9617 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.h
@@ -17,13 +17,17 @@ limitations under the License.
 #define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_LOOP_OPTIMIZER_H_
 
 #include <unordered_set>
+#include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/protobuf/rewriter_config.pb.h"
 
 namespace tensorflow {
 namespace grappler {
 
+constexpr char kLoopOptimizer[] = "LoopOptimizer";
+
 class LoopOptimizer : public GraphOptimizer {
  public:
   LoopOptimizer() : opt_level_(RewriterConfig::ON) {}
@@ -40,7 +44,29 @@ class LoopOptimizer : public GraphOptimizer {
                 const GraphDef& optimized_graph, double result) override;
 
  private:
+  Status LoopInvariantNodeMotion();
+  Status FindInvariantNodes(NodeDef* node);
+  Status RevertInvariantNodes();
+  Status MoveInvariantNodes(const int frame_id);
+  Status LINMHandleInvariantNode(NodeDef* node, const int num_outputs,
+      const int frame_id);
+  Status LINMHandleConst(NodeDef* node, const int num_outputs,
+      const int frame_id);
+  Status LINMHandleInvariantEnter(NodeDef* node, const int num_outputs);
+
+  std::map<NodeDef*, int> invariant_nodes_;
+  std::set<int> empty_set_;
+  std::map<int, std::set<int>> frame_children_;
+  std::map<int, int> frame_parent_;
+  std::map<int, const NodeDef*> loop_cond_;
+  std::map<int, std::vector<NodeDef*>> invariant_enters_;
+  int new_enter_id_;
   RewriterConfig::Toggle opt_level_;
+
+  std::unique_ptr<NodeMap> node_map_;
+  FrameMap frame_map_;
+  std::unique_ptr<GraphProperties> graph_properties_;
+  GraphDef* optimized_graph_;  // Not owned.
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
index c09434f60916b9bf269b0f5006b8a3732afaa5fc..a0bd3351976ccbeddd8778281dbdc0c07bbd6455 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -26,7 +27,431 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
-class LoopOptimizerTest : public ::testing::Test {};
+class LoopOptimizerTest : public GrapplerTest {
+ protected:
+  // These helpers always sets T=DT_FLOAT.
+  void AddEnterNode(const string& name, const string& frame,
+                    const bool is_constant, const int piterations,
+                    const std::vector<string>& inputs, GraphDef* graph) const {
+    std::vector<std::pair<string, AttrValue>> attributes;
+    AttrValue type;
+    type.set_type(DT_FLOAT);
+    attributes.emplace_back("T", type);
+    AttrValue frame_name;
+    frame_name.set_s(frame);
+    attributes.emplace_back("frame_name", frame_name);
+    AttrValue is_const;
+    is_const.set_b(is_constant);
+    attributes.emplace_back("is_constant", is_const);
+    AttrValue parallel_iterations;
+    parallel_iterations.set_i(piterations);
+    attributes.emplace_back("parallel_iterations", parallel_iterations);
+    AddNode(name, "Enter", inputs, attributes, graph);
+  }
+
+  void AddSimpleNode(const string& name, const string& op,
+                     const std::vector<string>& inputs, GraphDef* graph) const {
+    std::vector<std::pair<string, AttrValue>> attributes;
+    AttrValue type;
+    type.set_type(DT_FLOAT);
+    attributes.emplace_back("T", type);
+    AddNode(name, op, inputs, attributes, graph);
+  }
+};
+
+TEST_F(LoopOptimizerTest, Basic) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"VariantAdd", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"VariantAdd"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, Const) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("Const", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "Const"}, &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"VariantAdd", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"VariantAdd"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).size(), 0);
+}
+
+TEST_F(LoopOptimizerTest, ControlOutput) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"VariantAdd", "Less/y", "^InvariantAdd"},
+                &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"VariantAdd"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoop1) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"Exit2", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"Exit2"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  AddEnterNode("InvariantEnter2", "while/while/while_context", true, 1,
+               {"VariantAdd"}, &graph);
+  AddSimpleNode("InvariantAdd2", "Add", {"InvariantEnter2", "InvariantEnter2"},
+                &graph);
+  AddSimpleNode("VariantAdd2", "Add", {"InvariantAdd2", "Identity2"}, &graph);
+  AddEnterNode("VariantEnter2", "while/while/while_context", false, 1,
+               {"VariantEnter"}, &graph);
+  AddSimpleNode("Merge2", "Merge", {"VariantEnter2", "NextIteration2"}, &graph);
+  AddSimpleNode("Less2/y", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("Less2", "Less", {"VariantAdd2", "Less2/y"}, &graph);
+  AddSimpleNode("LoopCond2", "LoopCond", {"Less2"}, &graph);
+  AddSimpleNode("Switch2", "Switch", {"Merge2", "LoopCond2"}, &graph);
+  AddSimpleNode("Identity2", "Identity", {"Switch2:1"}, &graph);
+  AddSimpleNode("NextIteration2", "NextIteration", {"VariantAdd2"}, &graph);
+  AddSimpleNode("Exit2", "Exit", {"Switch2"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoop2) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"Exit2", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"Exit2"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  AddEnterNode("InvariantEnter2", "while/while/while_context", true, 1,
+               {"InvariantAdd"}, &graph);
+  AddSimpleNode("InvariantAdd2", "Add", {"InvariantEnter2", "InvariantEnter2"},
+                &graph);
+  AddSimpleNode("VariantAdd2", "Add", {"InvariantAdd2", "Identity2"}, &graph);
+  AddEnterNode("VariantEnter2", "while/while/while_context", false, 1,
+               {"VariantEnter"}, &graph);
+  AddSimpleNode("Merge2", "Merge", {"VariantEnter2", "NextIteration2"}, &graph);
+  AddSimpleNode("Less2/y", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("Less2", "Less", {"VariantAdd2", "Less2/y"}, &graph);
+  AddSimpleNode("LoopCond2", "LoopCond", {"Less2"}, &graph);
+  AddSimpleNode("Switch2", "Switch", {"Merge2", "LoopCond2"}, &graph);
+  AddSimpleNode("Identity2", "Identity", {"Switch2:1"}, &graph);
+  AddSimpleNode("NextIteration2", "NextIteration", {"VariantAdd2"}, &graph);
+  AddSimpleNode("Exit2", "Exit", {"Switch2"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoopConst1) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"Exit2", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"Exit2"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  AddEnterNode("InvariantEnter2", "while/while/while_context", true, 1,
+               {"VariantAdd"}, &graph);
+  AddSimpleNode("Const2", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("InvariantAdd2", "Add", {"InvariantEnter2", "Const2"}, &graph);
+  AddSimpleNode("VariantAdd2", "Add", {"InvariantAdd2", "Identity2"}, &graph);
+  AddEnterNode("VariantEnter2", "while/while/while_context", false, 1,
+               {"VariantEnter"}, &graph);
+  AddSimpleNode("Merge2", "Merge", {"VariantEnter2", "NextIteration2"}, &graph);
+  AddSimpleNode("Less2/y", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("Less2", "Less", {"VariantAdd2", "Less2/y"}, &graph);
+  AddSimpleNode("LoopCond2", "LoopCond", {"Less2"}, &graph);
+  AddSimpleNode("Switch2", "Switch", {"Merge2", "LoopCond2"}, &graph);
+  AddSimpleNode("Identity2", "Identity", {"Switch2:1"}, &graph);
+  AddSimpleNode("NextIteration2", "NextIteration", {"VariantAdd2"}, &graph);
+  AddSimpleNode("Exit2", "Exit", {"Switch2"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoopConst2) {
+  GraphDef graph;
+  AddSimpleNode("In", "Identity", {}, &graph);
+  AddEnterNode("InvariantEnter", "while/while_context", true, 1, {"In"},
+               &graph);
+  AddSimpleNode("InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"},
+                &graph);
+  AddSimpleNode("VariantAdd", "Add", {"InvariantAdd", "Identity"}, &graph);
+  AddEnterNode("VariantEnter", "while/while_context", false, 1, {"In"}, &graph);
+  AddSimpleNode("Merge", "Merge", {"VariantEnter", "NextIteration"}, &graph);
+  AddSimpleNode("Less/y", "Const", {"^Identity"}, &graph);
+  AddSimpleNode("Less", "Less", {"Exit2", "Less/y"}, &graph);
+  AddSimpleNode("LoopCond", "LoopCond", {"Less"}, &graph);
+  AddSimpleNode("Switch", "Switch", {"Merge", "LoopCond"}, &graph);
+  AddSimpleNode("Identity", "Identity", {"Switch:1"}, &graph);
+  AddSimpleNode("NextIteration", "NextIteration", {"Exit2"}, &graph);
+  AddSimpleNode("Exit", "Exit", {"Switch"}, &graph);
+  AddSimpleNode("Out", "Identity", {"Exit"}, &graph);
+
+  AddEnterNode("InvariantEnter2", "while/while/while_context", true, 1,
+               {"InvariantAdd"}, &graph);
+  AddSimpleNode("Const2", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("InvariantAdd2", "Add", {"InvariantEnter2", "Const2"}, &graph);
+  AddSimpleNode("VariantAdd2", "Add", {"InvariantAdd2", "Identity2"}, &graph);
+  AddEnterNode("VariantEnter2", "while/while/while_context", false, 1,
+               {"VariantEnter"}, &graph);
+  AddSimpleNode("Merge2", "Merge", {"VariantEnter2", "NextIteration2"}, &graph);
+  AddSimpleNode("Less2/y", "Const", {"^Identity2"}, &graph);
+  AddSimpleNode("Less2", "Less", {"VariantAdd2", "Less2/y"}, &graph);
+  AddSimpleNode("LoopCond2", "LoopCond", {"Less2"}, &graph);
+  AddSimpleNode("Switch2", "Switch", {"Merge2", "LoopCond2"}, &graph);
+  AddSimpleNode("Identity2", "Identity", {"Switch2:1"}, &graph);
+  AddSimpleNode("NextIteration2", "NextIteration", {"VariantAdd2"}, &graph);
+  AddSimpleNode("Exit2", "Exit", {"Switch2"}, &graph);
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 0);
+}
 
 void VerifyGraphsEqual(const GraphDef& original_graph,
                        const GraphDef& optimized_graph, const string& func) {
@@ -57,6 +482,87 @@ TEST_F(LoopOptimizerTest, NoOp) {
   VerifyGraphsEqual(item.graph, output, __FUNCTION__);
 }
 
+TEST_F(LoopOptimizerTest, RemovePush_NoOp) {
+  GrapplerItem item;
+  GraphDef& graph = item.graph;
+  AddSimpleNode("c", "Const", {}, &graph);
+  // Stack with corresponding push/pop.
+  AddSimpleNode("stack1", "StackV2", {}, &graph);
+  AddSimpleNode("push1", "StackPushV2", {"stack1", "c"}, &graph);
+  AddSimpleNode("pop1", "StackPopV2", {"stack1"}, &graph);
+  AddSimpleNode("id1", "Identity", {"pop1"}, &graph);
+  // Stack with corresponding push/pop behind Enter.
+  AddSimpleNode("stack2", "StackV2", {}, &graph);
+  AddEnterNode("enter2_c", "frame_name", false, 1, {"c"}, &graph);
+  AddEnterNode("enter2_stack2", "frame_name", false, 1, {"stack2"}, &graph);
+  AddSimpleNode("push2", "StackPushV2", {"enter2_stack2", "enter2_c"}, &graph);
+  AddSimpleNode("pop2", "StackPopV2", {"enter2_stack2"}, &graph);
+  AddSimpleNode("id2", "Identity", {"pop2"}, &graph);
+  // Stack with unexpected op type in fanout of Stack.
+  AddSimpleNode("stack3", "StackV2", {}, &graph);
+  AddSimpleNode("push3", "StackPushV2", {"stack3", "c"}, &graph);
+  AddSimpleNode("stop", "StopGradient", {"stack3"}, &graph);
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  VerifyGraphsEqual(item.graph, output, __FUNCTION__);
+}
+
+TEST_F(LoopOptimizerTest, RemovePushWithoutMatchingPop) {
+  GrapplerItem item;
+  GraphDef& graph = item.graph;
+  AddSimpleNode("c", "Const", {}, &graph);
+  // Push without Pop.
+  AddSimpleNode("stack1", "StackV2", {}, &graph);
+  AddSimpleNode("push1", "StackPushV2", {"stack1", "c"}, &graph);
+  // Push without Pop behind Enter.
+  AddSimpleNode("stack2", "StackV2", {}, &graph);
+  AddEnterNode("enter_c", "frame_name", false, 1, {"c"}, &graph);
+  AddEnterNode("enter_stack2", "frame_name", false, 1, {"stack2"}, &graph);
+  AddSimpleNode("push2", "StackPushV2", {"enter_stack2", "enter_c"}, &graph);
+  // Pop without consumer.
+  AddSimpleNode("stack3", "StackV2", {}, &graph);
+  AddSimpleNode("push3", "StackPushV2", {"stack3", "c"}, &graph);
+  AddSimpleNode("pop3", "StackPopV2", {"stack3"}, &graph);
+  // Push for a Pop without consumer that is fetched should not be removed.
+  AddSimpleNode("stack4", "StackV2", {}, &graph);
+  AddSimpleNode("push4", "StackPushV2", {"stack4", "c"}, &graph);
+  AddSimpleNode("pop4", "StackPopV2", {"stack4"}, &graph);
+
+  item.fetch.push_back("pop4");
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(13, output.node_size());
+  for (int i = 0; i < output.node_size(); ++i) {
+    const NodeDef& node = output.node(i);
+    if (node.name() == "push1") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("c", node.input(0));
+      EXPECT_EQ("^stack1", node.input(1));
+    } else if (node.name() == "push2") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("enter_c", node.input(0));
+      EXPECT_EQ("^enter_stack2", node.input(1));
+    } else if (node.name() == "push3") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("c", node.input(0));
+      EXPECT_EQ("^stack3", node.input(1));
+    } else {
+      const NodeDef& orig_node = item.graph.node(i);
+      EXPECT_EQ(orig_node.ShortDebugString(), node.ShortDebugString());
+    }
+  }
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer.cc b/tensorflow/core/grappler/optimizers/memory_optimizer.cc
index dec4f04a1c4234a32ecd370defb4a1560022a3e3..27e9d2c78d0456e61d31f7f772172fb8d17a11ac 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer.cc
@@ -413,7 +413,7 @@ void RecomputeSubgraph(
 }
 
 void RecomputationRewritingPass(RewriterConfig::MemOptType optimization_level,
-                                const string& recomputation_targets_name_prefix,
+                                const string& recomputation_targets_name_scope,
                                 GraphDef* graph, const GrapplerItem& item) {
   if (optimization_level != RewriterConfig::RECOMPUTATION_HEURISTICS &&
       optimization_level != RewriterConfig::HEURISTICS &&
@@ -438,15 +438,14 @@ void RecomputationRewritingPass(RewriterConfig::MemOptType optimization_level,
     feeds.insert(NodeName(feed.first));
   }
   std::function<bool(const NodeDef&)> is_target =
-      [&recomputation_targets_name_prefix](const NodeDef& node) {
-        // Nodes whose inputs we may want to recompute. Typically targets will
-        // be gradients (recomputation_targets_name_prefix="gradients/"),
-        // although the prefix is configurable since gradients may be created
-        // in a name scope.
-        // TODO(allenl): Use a static schedule
-        // (grappler::EstimateEarliestExecutionTimes) to recompute only nodes
-        // whose outputs will sit around for a while.
-        return node.name().find(recomputation_targets_name_prefix) == 0;
+      [&recomputation_targets_name_scope](const NodeDef& node) {
+        // Nodes whose inputs we may want to recompute. This matches node names
+        // that contain recomputation_targets_name_scope as a name scope,
+        // meaning it either begins with or contains the name scope.
+        // Defaults to "gradients/" which will match any node names that begins
+        // with "gradients/" or contains "/gradients/".
+        return node.name().find(recomputation_targets_name_scope) == 0 ||
+               node.name().find("/" + recomputation_targets_name_scope) != -1;
       };
 
   if (optimization_level == RewriterConfig::RECOMPUTATION_HEURISTICS ||
@@ -720,18 +719,19 @@ Status BuildSwapPair(NodeDef* node, int input_to_swap,
   // Force the tensor to be copied to cpu.
   NodeDef* swap_out_node = graph->add_node();
   swap_out_node->set_name(swap_out_name);
-  swap_out_node->set_op("Identity");
-  swap_out_node->set_device("/device:CPU:0");
+  swap_out_node->set_op("_CopyFromGpuToHost");
 
   // Force the tensor to be restored to the device.
   NodeDef* swap_in_node = graph->add_node();
   swap_in_node->set_name(swap_in_name);
-  swap_in_node->set_op("Identity");
+  swap_in_node->set_op("_CopyFromHostToGpu");
   *swap_in_node->add_input() = swap_out_node->name();
 
-  // Colocate the swap_in_ node with the node itself.
+  // Colocate the swap_out_ and swap_in_ nodes with the node itself.
+  swap_out_node->set_device(node->device());
   swap_in_node->set_device(node->device());
   string coloc_group = strings::StrCat("loc@", tensor_to_swap);
+  (*swap_out_node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
   (*swap_in_node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
   (*node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
 
@@ -1224,8 +1224,8 @@ Status MemoryOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
   *optimized_graph = item.graph;
 
   RecomputationRewritingPass(optimization_level_,
-                             recomputation_targets_name_prefix_,
-                             optimized_graph, item);
+                             recomputation_targets_name_scope_, optimized_graph,
+                             item);
 
   GrapplerItem optimized_item(item, std::move(*optimized_graph));
   std::unordered_set<string> skip_list;
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer.h b/tensorflow/core/grappler/optimizers/memory_optimizer.h
index c3dd0c45c6c524ef850ce7cfb9f6543d22e783ec..5c555a26746b759500f3d778ce137d6d9bedb67b 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer.h
@@ -27,14 +27,14 @@ class MemoryOptimizer : public GraphOptimizer {
  public:
   // optimization_level: Controls the level of autonomy for the memory
   //   optimizer. See RewriterConfig::memory_optimization.
-  // recomputation_targets_name_prefix: Name prefix for potential outputs of
+  // recomputation_targets_name_scope: Name scope for potential outputs of
   //   recomputations. See
-  //   RewriterConfig::memory_optimizer_target_node_name_prefix.
+  //   RewriterConfig::memory_optimizer_target_node_name_scope.
   explicit MemoryOptimizer(
       RewriterConfig::MemOptType optimization_level,
-      const string& recomputation_targets_name_prefix = "gradients/")
+      const string& recomputation_targets_name_scope = "gradients/")
       : optimization_level_(optimization_level),
-        recomputation_targets_name_prefix_(recomputation_targets_name_prefix) {}
+        recomputation_targets_name_scope_(recomputation_targets_name_scope) {}
   ~MemoryOptimizer() override {}
 
   string name() const override { return "memory_optimizer"; };
@@ -47,7 +47,7 @@ class MemoryOptimizer : public GraphOptimizer {
 
  private:
   RewriterConfig::MemOptType optimization_level_;
-  string recomputation_targets_name_prefix_;
+  string recomputation_targets_name_scope_;
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc b/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
index 5d7913e0c018ecf14cc09ab91d3a71125c720aa5..9595936e9e6158045a13ebede95d63b9291ca434 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
@@ -221,16 +221,20 @@ TEST_F(MemoryOptimizerTest, SimpleSwapping) {
   // Build a simple graph with an op that's marked for swapping.
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
 
-  Output a = ops::Variable(s.WithOpName("a"), {10, 10}, DT_FLOAT);
-  Output b = ops::AddN(s.WithOpName("b"), {a});
-  Output c = ops::AddN(s.WithOpName("c"), {b});
-  Output d = ops::AddN(s.WithOpName("d"), {c});
-  Output e = ops::AddN(s.WithOpName("e"), {b, d});
+  Output a =
+      ops::Variable(s.WithOpName("a").WithDevice("/gpu:0"), {10, 10}, DT_FLOAT);
+  Output b = ops::AddN(s.WithOpName("b").WithDevice("/gpu:0"), {a});
+  Output c = ops::AddN(s.WithOpName("c").WithDevice("/gpu:0"), {b});
+  Output d = ops::AddN(s.WithOpName("d").WithDevice("/gpu:0"), {c});
+  Output e = ops::AddN(s.WithOpName("e").WithDevice("/gpu:0"), {b, d});
+
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {10, 10});
+  Output init = ops::Assign(s.WithOpName("init"), a, constant);
 
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  EXPECT_EQ(5, item.graph.node_size());
+  EXPECT_EQ(7, item.graph.node_size());
   EXPECT_EQ(NodeName(e.name()), item.graph.node(4).name());
   AttrValue& val =
       (*item.graph.mutable_node(4)->mutable_attr())["_swap_to_host"];
@@ -243,32 +247,43 @@ TEST_F(MemoryOptimizerTest, SimpleSwapping) {
   Status status = optimizer.Optimize(cluster.get(), item, &output);
   TF_EXPECT_OK(status);
 
-  EXPECT_EQ(7, output.node_size());
-  const NodeDef& new_e = output.node(4);
+  EXPECT_EQ(9, output.node_size());
+  const NodeDef& new_e = output.node(6);
   EXPECT_EQ(NodeName(e.name()), new_e.name());
 
   EXPECT_EQ(2, new_e.input_size());
   EXPECT_EQ(NodeName(d.name()), new_e.input(1));
   EXPECT_EQ("swap_in_e_0", new_e.input(0));
 
-  const NodeDef& swap_out = output.node(5);
+  const NodeDef& swap_out = output.node(7);
   EXPECT_EQ("swap_out_e_0", swap_out.name());
+  EXPECT_EQ("_CopyFromGpuToHost", swap_out.op());
 
-  const NodeDef& swap_in = output.node(6);
+  const NodeDef& swap_in = output.node(8);
   EXPECT_EQ("swap_in_e_0", swap_in.name());
+  EXPECT_EQ("_CopyFromHostToGpu", swap_in.op());
 
   EXPECT_EQ(NodeName(b.name()), swap_out.input(0));
   EXPECT_EQ(NodeName(swap_out.name()), swap_in.input(0));
   EXPECT_EQ("^c", swap_in.input(1));
 
-  const NodeDef& new_c = output.node(2);
+  const NodeDef& new_c = output.node(4);
   EXPECT_EQ(NodeName(c.name()), new_c.name());
   EXPECT_EQ("^swap_out_e_0", new_c.input(1));
 
   // Run the optimizer a second time to ensure it's idempotent.
-  item.graph.Swap(&output);
-  status = optimizer.Optimize(cluster.get(), item, &output);
+  GrapplerItem item_copy(item, std::move(output));
+  status = optimizer.Optimize(cluster.get(), item_copy, &output);
   TF_EXPECT_OK(status);
+
+#if GOOGLE_CUDA
+  item.fetch = {"e"};
+  item.init_ops = {init.name()};
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
@@ -287,9 +302,13 @@ TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
   Output h = ops::Exp(s.WithOpName("h").WithDevice("/gpu:0"), c);
   Output i = ops::Log(s.WithOpName("i").WithDevice("/gpu:0"), d);
 
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {128, 128, 8});
+  Output init = ops::Assign(s.WithOpName("init"), v, constant);
+
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
   item.fetch = {"e", "f", "g", "h", "i"};
+  item.init_ops = {init.name()};
 
   std::unique_ptr<VirtualCluster> cluster(CreateVirtualCluster());
 
@@ -308,6 +327,15 @@ TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
       EXPECT_EQ("axis", node.input(4));
     }
   }
+
+#if GOOGLE_CUDA
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  for (int i = 0; i < item.fetch.size(); ++i) {
+    test::ExpectTensorEqual<float>(tensors_expected[i], tensors[i]);
+  }
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, UnswappableInputs) {
@@ -325,9 +353,13 @@ TEST_F(MemoryOptimizerTest, UnswappableInputs) {
   Output e =
       ops::Concat(s.WithOpName("e").WithDevice("/gpu:0"), {b, c, d}, axis);
 
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {128, 128, 8});
+  Output init = ops::Assign(s.WithOpName("init"), v, constant);
+
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
   item.fetch = {"e"};
+  item.init_ops = {init.name()};
 
   std::unique_ptr<VirtualCluster> cluster(CreateVirtualCluster());
 
@@ -344,6 +376,13 @@ TEST_F(MemoryOptimizerTest, UnswappableInputs) {
       EXPECT_EQ("^swap_out_d_2", node.input(4));
     }
   }
+
+#if GOOGLE_CUDA
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, AccumulationRewrites) {
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer.cc b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
index 7ae77207afc5be86a99bf8145025e0f18ef4af0f..6eb2bbc54717a9cdc6943e43f184f9eb9d9635e7 100644
--- a/tensorflow/core/grappler/optimizers/meta_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
 #include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
 #include "tensorflow/core/grappler/optimizers/dependency_optimizer.h"
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/layout_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/loop_optimizer.h"
@@ -56,6 +57,9 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::NewOptimizer(
   if (optimizer == "pruning") {
     graph_optimizer.reset(new ModelPruner());
   }
+  if (optimizer == "function") {
+    graph_optimizer.reset(new FunctionOptimizer(cfg_.function_optimization()));
+  }
   if (optimizer == "constfold") {
     graph_optimizer.reset(new ConstantFolding(cpu_device_));
   }
@@ -73,13 +77,13 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::NewOptimizer(
     graph_optimizer.reset(
         new AutoParallel(cfg_.auto_parallel().num_replicas()));
   }
+  if (optimizer == "loop") {
+    graph_optimizer.reset(new LoopOptimizer(cfg_.loop_optimization()));
+  }
   if (optimizer == "dependency") {
     graph_optimizer.reset(
         new DependencyOptimizer(cfg_.dependency_optimization()));
   }
-  if (optimizer == "loop") {
-    graph_optimizer.reset(new LoopOptimizer(cfg_.loop_optimization()));
-  }
   return graph_optimizer;
 }
 
@@ -90,6 +94,10 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
     if (!cfg_.disable_model_pruning()) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(new ModelPruner()));
     }
+    if (cfg_.function_optimization() != RewriterConfig::OFF) {
+      optimizers.push_back(std::unique_ptr<GraphOptimizer>(
+          new FunctionOptimizer(cfg_.function_optimization())));
+    }
     if (cfg_.constant_folding() != RewriterConfig::OFF) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
           new ConstantFolding(cfg_.constant_folding(), cpu_device_)));
@@ -98,20 +106,20 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
           new ArithmeticOptimizer(cfg_.arithmetic_optimization())));
     }
-    if (cfg_.dependency_optimization() != RewriterConfig::OFF) {
-      optimizers.push_back(std::unique_ptr<GraphOptimizer>(
-          new DependencyOptimizer(cfg_.dependency_optimization())));
-    }
     if (cfg_.loop_optimization() != RewriterConfig::OFF) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
           new LoopOptimizer(cfg_.loop_optimization())));
     }
+    if (cfg_.dependency_optimization() != RewriterConfig::OFF) {
+      optimizers.push_back(std::unique_ptr<GraphOptimizer>(
+          new DependencyOptimizer(cfg_.dependency_optimization())));
+    }
     if (cfg_.layout_optimizer() != RewriterConfig::OFF) {
       optimizers.push_back(
           std::unique_ptr<GraphOptimizer>(new LayoutOptimizer()));
     }
     if (cfg_.memory_optimization() != RewriterConfig::NO_MEM_OPT) {
-      if (cfg_.memory_optimizer_target_node_name_prefix().empty()) {
+      if (cfg_.memory_optimizer_target_node_name_scope().empty()) {
         optimizers.push_back(std::unique_ptr<GraphOptimizer>(
             // Use the default target node name prefix "gradients/"
             new MemoryOptimizer(cfg_.memory_optimization())));
@@ -119,7 +127,7 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
         optimizers.push_back(
             std::unique_ptr<GraphOptimizer>(new MemoryOptimizer(
                 cfg_.memory_optimization(),
-                cfg_.memory_optimizer_target_node_name_prefix())));
+                cfg_.memory_optimizer_target_node_name_scope())));
       }
     }
     if (cfg_.auto_parallel().enable()) {
@@ -128,8 +136,8 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
     }
   } else {
     const std::set<string> available_optimizers = {
-        "pruning",      "constfold",  "layout",     "memory",
-        "autoparallel", "arithmetic", "dependency", "loop"};
+        "pruning",      "function",   "constfold", "layout",    "memory",
+        "autoparallel", "arithmetic", "loop",      "dependency"};
     std::vector<string> custom_optimizer_names;
     for (const auto& optimizer_name : cfg_.optimizers()) {
       if (available_optimizers.find(optimizer_name) !=
@@ -223,10 +231,11 @@ void MetaOptimizer::Feedback(Cluster* cluster, const GrapplerItem& item,
 bool MetaOptimizerEnabled(const RewriterConfig& cfg) {
   return !cfg.disable_model_pruning() ||
          cfg.layout_optimizer() != RewriterConfig::OFF ||
+         cfg.function_optimization() != RewriterConfig::OFF ||
          cfg.constant_folding() != RewriterConfig::OFF ||
-         cfg.dependency_optimization() != RewriterConfig::OFF ||
-         cfg.loop_optimization() == RewriterConfig::ON ||
          cfg.arithmetic_optimization() != RewriterConfig::OFF ||
+         cfg.loop_optimization() != RewriterConfig::OFF ||
+         cfg.dependency_optimization() != RewriterConfig::OFF ||
          cfg.auto_parallel().enable() ||
          cfg.memory_optimization() != RewriterConfig::NO_MEM_OPT ||
          !cfg.optimizers().empty();
diff --git a/tensorflow/core/grappler/optimizers/symbolic_shapes.cc b/tensorflow/core/grappler/optimizers/symbolic_shapes.cc
new file mode 100644
index 0000000000000000000000000000000000000000..cfca2dc0d38480240c9b158ecbb3cc718bfa1ad2
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/symbolic_shapes.cc
@@ -0,0 +1,177 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/symbolic_shapes.h"
+#include "tensorflow/core/util/bcast.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+BCast::Vec ShapeDims(const TensorShapeProto& shape) {
+  BCast::Vec dims;
+  dims.reserve(shape.dim_size());
+  for (int i = 0; i < shape.dim_size(); ++i)
+    dims.push_back(shape.dim(i).size());
+  return dims;
+}
+
+}  // namespace
+
+bool IsKnown(const TensorShapeProto::Dim& dim) { return dim.size() >= 0; }
+
+bool IsKnownSymbolically(const TensorShapeProto::Dim& dim) {
+  return dim.size() <= -2;
+}
+
+bool IsUnknown(const TensorShapeProto::Dim& dim) { return dim.size() == -1; }
+
+bool ShapeIsSymbolicallyDefined(const TensorShapeProto& shape) {
+  return !shape.unknown_rank() &&
+         std::all_of(
+             shape.dim().begin(), shape.dim().end(),
+             [](const TensorShapeProto::Dim& dim) { return !IsUnknown(dim); });
+}
+
+bool ShapeIsSymbolicallyDefined(const OpInfo::TensorProperties& properties) {
+  return ShapeIsSymbolicallyDefined(properties.shape());
+}
+
+bool ShapesSymbolicallyEqual(const TensorShapeProto& left,
+                             const TensorShapeProto& right) {
+  if (left.unknown_rank() || right.unknown_rank() ||
+      left.dim_size() != right.dim_size()) {
+    return false;
+  }
+  for (int i = 0; i < left.dim_size(); ++i) {
+    const auto& ldim = left.dim(i);
+    const auto& rdim = right.dim(i);
+    if (IsUnknown(ldim) || IsUnknown(rdim) || ldim.size() != rdim.size()) {
+      return false;
+    }
+  }
+  return true;
+}
+
+bool ShapesSymbolicallyEqual(const OpInfo::TensorProperties& left,
+                             const OpInfo::TensorProperties& right) {
+  return ShapesSymbolicallyEqual(left.shape(), right.shape());
+}
+
+bool ShapesBroadcastable(const TensorShapeProto& left,
+                         const TensorShapeProto& right) {
+  if (!ShapeIsSymbolicallyDefined(left) || !ShapeIsSymbolicallyDefined(right)) {
+    return false;
+  }
+  BCast bcast(ShapeDims(left), ShapeDims(right),
+              /*fewer_dims_optimization*/ false);
+  return bcast.IsValid();
+}
+
+bool ShapesBroadcastable(const OpInfo::TensorProperties& left,
+                         const OpInfo::TensorProperties& right) {
+  return ShapesBroadcastable(left.shape(), right.shape());
+}
+
+bool CompareSymbolicallyShapedTensorSizes(const TensorShapeProto& left,
+                                          const TensorShapeProto& right) {
+  // if one of the ranks is unknown, it's impossible to compare tensor sizes
+  if (left.unknown_rank() || right.unknown_rank()) {
+    return false;
+  }
+
+  // Tensor size, computed as a product of defined dimensions
+  int64 left_defined_size = 1;
+  int64 right_defined_size = 1;
+
+  // Keep how many times each unknown dimension appeared on the left and right
+  std::unordered_map<int64, int64> left_unknown_dims;
+  std::unordered_map<int64, int64> right_unknown_dims;
+
+  // Assign unique id to every unknown dimension (-1). We are going to
+  // assign positive ids, because negative values are already used by
+  // symbolic dimensions.
+  int64 unknown_dim_id = 1;
+
+  // For each shape dimension update "defined tensor size", if shape is defined,
+  // or increment a counter for unknown dim.
+  auto process_dimensions =
+      [&unknown_dim_id](const TensorShapeProto& shape, int64* defined_size,
+                        std::unordered_map<int64, int64>* unknown_dims) {
+        for (int i = 0; i < shape.dim_size(); ++i) {
+          const auto& dim = shape.dim(i);
+          int64 dim_size = dim.size();
+          if (dim_size > 0) {
+            *defined_size *= dim_size;
+          } else if (IsUnknown(dim)) {
+            ++(*unknown_dims)[unknown_dim_id++];
+          } else if (IsKnownSymbolically(dim)) {
+            ++(*unknown_dims)[dim_size];
+          }
+        }
+      };
+
+  process_dimensions(left, &left_defined_size, &left_unknown_dims);
+  process_dimensions(right, &right_defined_size, &right_unknown_dims);
+
+  // Compute a union of unknown dimension ids appeared in both shapes
+  std::set<int64> unknown_dims;
+  for (const auto& el : left_unknown_dims) unknown_dims.insert(el.first);
+  for (const auto& el : right_unknown_dims) unknown_dims.insert(el.first);
+
+  // Cancel unknown dimensions that appeared in both shapes
+  for (int64 unknown_dim : unknown_dims) {
+    int64 co_occurrence = std::min(left_unknown_dims[unknown_dim],
+                                   right_unknown_dims[unknown_dim]);
+    left_unknown_dims[unknown_dim] -= co_occurrence;
+    right_unknown_dims[unknown_dim] -= co_occurrence;
+  }
+
+  // Count unbalanced unknown dimensions
+  int64 left_unbalanced_unknown_dims = 0;
+  int64 right_unbalanced_unknown_dims = 0;
+  for (const auto& el : left_unknown_dims)
+    left_unbalanced_unknown_dims += el.second;
+  for (const auto& el : right_unknown_dims)
+    right_unbalanced_unknown_dims += el.second;
+
+  if (left_unbalanced_unknown_dims == 0 && right_unbalanced_unknown_dims == 0) {
+    // If unknown dimensions cancelled each other, compare tensor sizes
+    // represented by defined dimensions
+    return left_defined_size < right_defined_size;
+  }
+
+  if (left_defined_size <= right_defined_size &&
+      left_unbalanced_unknown_dims == 0 && right_unbalanced_unknown_dims > 0) {
+    // If size of a 'left" tensor computed from defined dimensions less or
+    // equal, and shape on the right has unbalanced unknown dimensions, we can
+    // guarantee that shape on the left is strictly smaller (assuming that
+    // unknown dimension size is larger than 1)
+    return true;
+  }
+
+  // In every other case, assuming that unknown dimensions can be arbitrary
+  // large in size, we can't guarantee any ordering
+  return false;
+}
+
+bool CompareSymbolicallyShapedTensorSizes(
+    const OpInfo::TensorProperties& left,
+    const OpInfo::TensorProperties& right) {
+  return CompareSymbolicallyShapedTensorSizes(left.shape(), right.shape());
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/symbolic_shapes.h b/tensorflow/core/grappler/optimizers/symbolic_shapes.h
new file mode 100644
index 0000000000000000000000000000000000000000..a9dcf44e236281badfabaf5213ef09fd98bf0820
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/symbolic_shapes.h
@@ -0,0 +1,60 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SYMBOLIC_SHAPES_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SYMBOLIC_SHAPES_H_
+
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/grappler/costs/op_performance_data.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+
+bool IsKnown(const TensorShapeProto::Dim& dim);
+bool IsKnownSymbolically(const TensorShapeProto::Dim& dim);
+bool IsUnknown(const TensorShapeProto::Dim& dim);
+
+// Shape is symbolically defined, if it has a known rank, and each dimension is
+// known (dim_size >= 0), or is a symbolic dimension size (dim_size <= -2).
+bool ShapeIsSymbolicallyDefined(const TensorShapeProto& shape);
+bool ShapeIsSymbolicallyDefined(const OpInfo::TensorProperties& properties);
+
+// Shapes are symbolically equal, if they have the same rank, they are
+// they are known or symbolically defined, and have matching dimensions.
+bool ShapesSymbolicallyEqual(const TensorShapeProto& left,
+                             const TensorShapeProto& right);
+bool ShapesSymbolicallyEqual(const OpInfo::TensorProperties& left,
+                             const OpInfo::TensorProperties& right);
+
+// Check if two shapes can be broadcasted to each other. Both shapes must be at
+// least symbolically defined, and the have valid BCast instance.
+bool ShapesBroadcastable(const TensorShapeProto& left,
+                         const TensorShapeProto& right);
+bool ShapesBroadcastable(const OpInfo::TensorProperties& left,
+                         const OpInfo::TensorProperties& right);
+
+// Return true if can prove, that tensor of size 'left' is smaller than tensor
+// of size 'right'. Return false if it's larger or equal, or it's impossible to
+// compare because of unknown dimensions, or mismatch in symbolic dimensions.
+bool CompareSymbolicallyShapedTensorSizes(const TensorShapeProto& left,
+                                          const TensorShapeProto& right);
+bool CompareSymbolicallyShapedTensorSizes(
+    const OpInfo::TensorProperties& left,
+    const OpInfo::TensorProperties& right);
+
+}  // namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SYMBOLIC_SHAPES_H_
diff --git a/tensorflow/core/grappler/optimizers/symbolic_shapes_test.cc b/tensorflow/core/grappler/optimizers/symbolic_shapes_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5ef9f6592571062564d16e5fb282b1dd85d074ef
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/symbolic_shapes_test.cc
@@ -0,0 +1,95 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/symbolic_shapes.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class SymbolicShapesTest : public ::testing::Test {
+ protected:
+  TensorShapeProto MakeUnknown() {
+    TensorShapeProto shape;
+    shape.set_unknown_rank(true);
+    return shape;
+  }
+
+  TensorShapeProto MakeShape(std::vector<int> dims) {
+    TensorShapeProto shape;
+    for (int dim_size : dims) {
+      TensorShapeProto::Dim dim;
+      dim.set_size(dim_size);
+      *shape.add_dim() = dim;
+    }
+    return shape;
+  }
+};
+
+bool operator<(const TensorShapeProto& lhs, const TensorShapeProto& rhs) {
+  return CompareSymbolicallyShapedTensorSizes(lhs, rhs);
+}
+
+TEST_F(SymbolicShapesTest, ShapeIsSymbolicallyDefined) {
+  EXPECT_FALSE(ShapeIsSymbolicallyDefined(MakeUnknown()));
+  EXPECT_FALSE(ShapeIsSymbolicallyDefined(MakeShape({-1, 2})));
+
+  EXPECT_TRUE(ShapeIsSymbolicallyDefined(MakeShape({1, 2})));
+  EXPECT_TRUE(ShapeIsSymbolicallyDefined(MakeShape({-2, 2})));
+}
+
+TEST_F(SymbolicShapesTest, ShapesSymbolicallyEqual) {
+  EXPECT_FALSE(ShapesSymbolicallyEqual(MakeUnknown(), MakeUnknown()));
+  EXPECT_FALSE(ShapesSymbolicallyEqual(MakeShape({-1, 2}), MakeShape({-1, 2})));
+  EXPECT_FALSE(ShapesSymbolicallyEqual(MakeShape({-2, 2}), MakeShape({-3, 2})));
+
+  EXPECT_TRUE(ShapesSymbolicallyEqual(MakeShape({1, 2}), MakeShape({1, 2})));
+  EXPECT_TRUE(ShapesSymbolicallyEqual(MakeShape({-2, 2}), MakeShape({-2, 2})));
+}
+
+TEST_F(SymbolicShapesTest, ShapesBroadcastable) {
+  EXPECT_FALSE(ShapesBroadcastable(MakeUnknown(), MakeUnknown()));
+  EXPECT_FALSE(ShapesBroadcastable(MakeShape({-2}), MakeShape({1, -3})));
+  EXPECT_FALSE(ShapesBroadcastable(MakeShape({-1, 2}), MakeShape({-1, 2})));
+  EXPECT_FALSE(ShapesBroadcastable(MakeShape({-2, 2}), MakeShape({-3, 2})));
+  EXPECT_FALSE(ShapesBroadcastable(MakeShape({-2, 4}), MakeShape({-2, 8})));
+
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({1, 2}), MakeShape({1, 2})));
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({-2, 2}), MakeShape({-2, 2})));
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({-2, 32}), MakeShape({-2, 1})));
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({-2, 1}), MakeShape({1, -2})));
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({-2, 1}), MakeShape({1, -3})));
+  EXPECT_TRUE(ShapesBroadcastable(MakeShape({-3}), MakeShape({-2, -3})));
+}
+
+TEST_F(SymbolicShapesTest, CompareSymbolicallyShapedTensorSizes) {
+  EXPECT_TRUE(MakeShape({1, 1, 32}) < MakeShape({32, 32}));
+  EXPECT_TRUE(MakeShape({1, 32, 32}) < MakeShape({2048}));
+  EXPECT_TRUE(MakeShape({1, -2, 32}) < MakeShape({-2, 32, 32}));
+  EXPECT_TRUE(MakeShape({1, 32, 32}) < MakeShape({-2, 32, 32}));
+  EXPECT_TRUE(MakeShape({1, 32, 32}) < MakeShape({-1, 32, 32}));
+  EXPECT_TRUE(MakeShape({1, -2, 32}) < MakeShape({-2, -2, 32}));
+
+  EXPECT_FALSE(MakeShape({1, -2, 32}) < MakeShape({-3, 32, 32}));
+  EXPECT_FALSE(MakeShape({1, -1, 32}) < MakeShape({1, -1, 32}));
+  EXPECT_FALSE(MakeShape({1, -1, 32}) < MakeShape({-1, -1, 32}));
+  EXPECT_FALSE(MakeShape({-1, -1, 32}) < MakeShape({1, -1, 32}));
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils.cc b/tensorflow/core/grappler/utils.cc
index 81bb5e6c3b26ebbed8cd1555c10d2dd6f2a47c12..829bfe9e3104273cbf9d3d0ddb0bef4a233708e7 100644
--- a/tensorflow/core/grappler/utils.cc
+++ b/tensorflow/core/grappler/utils.cc
@@ -40,6 +40,16 @@ bool SafeSetScalarTensorValue(double value, Tensor* tensor) {
   tensor->flat<T>()(0) = static_cast<T>(value);
   return true;
 }
+
+// Is 'node' an operator that consumes only the shape of its input, not the
+// data itself?
+// TODO(ezhulenev): move to op_types.h. Requires to break circular dependency.
+// TODO(ezhulenev): what about Identity passing tensor to Shape consumer?
+bool IsShapeConsumer(const NodeDef& node) {
+  const string& op = node.op();
+  return op == "Shape" || op == "ShapeN" || op == "Rank" || op == "Size";
+}
+
 }  // namespace
 
 NodeMap::NodeMap(GraphDef* graph) {
@@ -270,6 +280,22 @@ int NumNonControlOutputs(const NodeDef& node, const NodeMap& node_map) {
   return num_outputs;
 }
 
+int NumNonControlDataOutputs(const NodeDef& node, const NodeMap& node_map) {
+  int num_data_outputs = 0;
+  for (const NodeDef* output : node_map.GetOutputs(node.name())) {
+    if (IsShapeConsumer(*output)) continue;
+
+    for (int i = 0; i < output->input_size(); ++i) {
+      const string& input = output->input(i);
+      if (!IsControlInput(input) && NodeName(input) == node.name()) {
+        ++num_data_outputs;
+        break;
+      }
+    }
+  }
+  return num_data_outputs;
+}
+
 // Returns the data type in attribute `attr_name` of `node`. If that attribute
 // doesn't exist, returns DT_INVALID.
 DataType GetDataTypeFromAttr(const NodeDef& node, const string& attr_name) {
@@ -348,6 +374,7 @@ inline void STLSortAndRemoveDuplicates(T* v) {
 
 Status SimpleGraphView::Initialize(const GraphDef& graph, bool dedup_inputs,
                                    bool dedup_outputs) {
+  graph_ = &graph;
   const int num_nodes = graph.node_size();
   inputs_.clear();
   inputs_.resize(num_nodes);
@@ -394,6 +421,22 @@ Status SimpleGraphView::Initialize(const GraphDef& graph, bool dedup_inputs,
   return Status::OK();
 }
 
+void SimpleGraphView::DepthFirstSearch(
+    const std::unordered_set<string>& op_types_to_traverse, int node_idx,
+    std::set<int>* nodes_found) const {
+  if (nodes_found->find(node_idx) != nodes_found->end()) {
+    return;
+  }
+  nodes_found->insert(node_idx);
+  const string& op_type = graph_->node(node_idx).op();
+  if (op_types_to_traverse.find(op_type) == op_types_to_traverse.end()) {
+    return;
+  }
+  for (auto output_idx : this->outputs(node_idx)) {
+    DepthFirstSearch(op_types_to_traverse, output_idx, nodes_found);
+  }
+}
+
 string SimpleGraphView::PrintToString() const {
   string str;
   for (int i = 0; i < num_nodes(); ++i) {
diff --git a/tensorflow/core/grappler/utils.h b/tensorflow/core/grappler/utils.h
index 255319693a57a7cc493365a51d5d04d2893f08c5..7aa31939f58ae2556fc7ffa59ca337a3c162ca2e 100644
--- a/tensorflow/core/grappler/utils.h
+++ b/tensorflow/core/grappler/utils.h
@@ -144,6 +144,10 @@ int NumNonControlInputs(const NodeDef& node);
 // Number of connected non-control outputs.
 int NumNonControlOutputs(const NodeDef& node, const NodeMap& node_map);
 
+// Number of connected non-control data outputs (Ops that consume output tensor
+// data, not just it's shape).
+int NumNonControlDataOutputs(const NodeDef& node, const NodeMap& node_map);
+
 // Removes redundant control inputs from node.
 void DedupControlInputs(NodeDef* node);
 
@@ -178,6 +182,7 @@ class SimpleGraphView {
   Status Initialize(const GraphDef& graph, bool dedup_inputs,
                     bool dedup_outputs);
 
+  const GraphDef* graph() const { return graph_; }
   inline int num_nodes() const { return index_to_name_.size(); }
   inline const int index(const string& node_name) const {
     const auto& it = name_to_index_.find(node_name);
@@ -194,9 +199,17 @@ class SimpleGraphView {
     return outputs_[node_idx];
   }
 
+  // Traverse the graph starting at `node_idx`, collecting indices of nodes
+  // visited in nodes_found. If a node has an op in `op_types_to_traverse`, the
+  // walk continues to its children. It is assumed that *graph_ was not modified
+  // after the call to Initialize().
+  void DepthFirstSearch(const std::unordered_set<string>& op_types_to_traverse,
+                        int node_idx, std::set<int>* nodes_found) const;
+
   string PrintToString() const;
 
  private:
+  const GraphDef* graph_;  // Not owned.
   std::vector<string> index_to_name_;
   std::unordered_map<string, int> name_to_index_;
   std::vector<gtl::InlinedVector<int, 4>> inputs_;
diff --git a/tensorflow/core/grappler/utils/BUILD b/tensorflow/core/grappler/utils/BUILD
index 0a9dbe22cfe3cd01c2c61661adcdd4839a957f03..939031c44b57e930b80fc7897be8e9f5e7906688 100644
--- a/tensorflow/core/grappler/utils/BUILD
+++ b/tensorflow/core/grappler/utils/BUILD
@@ -142,6 +142,54 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+    ],
+)
+
+tf_cc_test(
+    name = "grappler_test_test",
+    size = "small",
+    srcs = ["grappler_test_test.cc"],
+    deps = [
+        ":grappler_test",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:direct_session",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
         "//tensorflow/core/grappler:utils",
     ],
 )
+
+cc_library(
+    name = "functions",
+    srcs = [
+        "functions.cc",
+    ],
+    hdrs = ["functions.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+    ],
+)
+
+tf_cc_test(
+    name = "functions_test",
+    srcs = ["functions_test.cc"],
+    deps = [
+        ":functions",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:all_kernels",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
diff --git a/tensorflow/core/grappler/utils/functions.cc b/tensorflow/core/grappler/utils/functions.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4f286ce1c8bc3df4065f39c1744600d457173c2e
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions.cc
@@ -0,0 +1,153 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/grappler/utils/functions.h"
+
+#include <unordered_map>
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/framework/graph_def_util.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/grappler/utils.h"
+
+namespace tensorflow {
+namespace grappler {
+
+std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
+    const FunctionDef& func,
+    const std::unordered_map<string, AttrValue>& func_attr,
+    const FunctionDefLibrary& library) {
+  if (func.signature().name().empty()) {
+    LOG(ERROR) << "function name must be specified.";
+    return nullptr;
+  }
+  std::unique_ptr<GrapplerItem> new_item(new GrapplerItem());
+  new_item->id = func.signature().name();
+
+  std::unordered_map<string, string> port_map;
+
+  // Add the function inputs as placeholder
+  for (const auto& inp : func.signature().input_arg()) {
+    NodeDef* ph = new_item->graph.add_node();
+    ph->set_name(inp.name());
+    ph->set_op("Placeholder");
+    if (inp.type() != DT_INVALID) {
+      (*ph->mutable_attr())["T"].set_type(inp.type());
+    } else {
+      auto it = func_attr.find(inp.type_attr());
+      if (it == func_attr.end()) {
+        LOG(ERROR) << "Unknown type attribute " << inp.type_attr()
+                   << " for function input " << inp.name();
+        return nullptr;
+      } else {
+        (*ph->mutable_attr())["T"] = it->second;
+      }
+    }
+    port_map[inp.name()] = inp.name();
+  }
+
+  // Add the function body to the graph.
+  FunctionLibraryDefinition func_def(OpRegistry::Global(), library);
+
+  for (const NodeDef& node : func.node_def()) {
+    NodeDef* new_node = new_item->graph.add_node();
+    *new_node = node;
+    // Replace the placeholder attribute values with the specified value.
+    for (auto& attr : *new_node->mutable_attr()) {
+      const string& ph_name = attr.second.placeholder();
+      auto it = func_attr.find(ph_name);
+      if (it != func_attr.end()) {
+        attr.second = it->second;
+      }
+    }
+
+    // Functions use a custom format to encode connectivity. Map these custom
+    // strings to regular ones.
+    const OpRegistrationData* registration;
+    Status status = func_def.LookUp(node.op(), &registration);
+    if (!status.ok()) {
+      LOG(ERROR) << "Op " << node.op() << " not registered: " << status;
+      return nullptr;
+    }
+
+    tensorflow::NameRangeMap inputs;
+    tensorflow::NameRangeMap outputs;
+    status = tensorflow::NameRangesForNode(node, registration->op_def, &inputs,
+                                           &outputs);
+    if (!status.ok()) {
+      LOG(ERROR) << "Op " << node.op() << " invalid: " << status;
+      return nullptr;
+    }
+    for (const auto& name_range : outputs) {
+      string port_prefix =
+          strings::StrCat(node.name(), ":", name_range.first, ":");
+      int index_start = name_range.second.first;
+      int index_end = name_range.second.second;
+      for (int i = index_start; i < index_end; ++i) {
+        string port_id = strings::StrCat(port_prefix, i - index_start);
+        string port_name = strings::StrCat(node.name(), ":", i);
+        port_map[port_id] = port_name;
+      }
+    }
+  }
+
+  for (auto& node : *new_item->graph.mutable_node()) {
+    // Rewrite the inputs to use the normal naming convention.
+    for (int i = 0; i < node.input_size(); ++i) {
+      const string& input = node.input(i);
+      if (IsControlInput(input)) {
+        // No need to remap control dependencies.
+        continue;
+      } else {
+        auto it = port_map.find(input);
+        if (it == port_map.end()) {
+          LOG(ERROR) << "Unknown input: " << input;
+          return nullptr;
+        }
+        node.set_input(i, it->second);
+      }
+    }
+  }
+
+  // Add the function outputs to the list of fetch nodes, taking into account
+  // the output mapping if any.
+  for (const auto& out : func.signature().output_arg()) {
+    auto it = func.ret().find(out.name());
+    if (it != func.ret().end()) {
+      auto it2 = port_map.find(it->second);
+      if (it2 == port_map.end()) {
+        LOG(ERROR) << "Unknown output mapping: " << it->first << " to "
+                   << it->second;
+        return nullptr;
+      } else {
+        new_item->fetch.emplace_back(it2->second);
+      }
+    } else {
+      new_item->fetch.emplace_back(out.name());
+    }
+  }
+  // Add the function inputs to the list of feeds.
+  for (const auto& inp : func.signature().input_arg()) {
+    new_item->feed.emplace_back(inp.name(), Tensor());
+  }
+
+  return new_item;
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/functions.h b/tensorflow/core/grappler/utils/functions.h
new file mode 100644
index 0000000000000000000000000000000000000000..8f9b7d848a89435e1839e540f33d87213beb8a45
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions.h
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
+#define TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
+
+#include <memory>
+#include <string>
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+
+namespace tensorflow {
+
+namespace grappler {
+
+// Factory method for creating a GrapplerItem from a FunctionDef.
+// Returns nullptr if the given function def cannot be converted.
+std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
+    const FunctionDef& func,
+    const std::unordered_map<string, AttrValue>& func_attr,
+    const FunctionDefLibrary& library);
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
diff --git a/tensorflow/core/grappler/utils/functions_test.cc b/tensorflow/core/grappler/utils/functions_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6a7d766b1c6b49f8fc13b3b0294f3e3f8a74eb35
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions_test.cc
@@ -0,0 +1,350 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/utils/functions.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/meta_graph.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class FunctionsTest : public ::testing::Test {};
+
+TEST_F(FunctionsTest, FromSimpleFunctionDef) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "XTimesTwo",
+      // Args
+      {"x: T"},
+      // Return values
+      {"y: T"},
+      // Attr def
+      {"T: {float, double, int32, int64}"},
+      // Nodes
+      {
+          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+          {{"scale"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}},
+          {{"y"}, "Mul", {"x", "scale"}, {{"T", "$T"}}},
+      });
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+  CHECK(item);
+  EXPECT_EQ("XTimesTwo", item->id);
+  EXPECT_EQ(4, item->graph.node_size());
+  EXPECT_EQ(std::vector<string>({"y:0"}), item->fetch);
+  EXPECT_EQ(1, item->feed.size());
+  EXPECT_EQ("x", item->feed[0].first);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "two") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "scale") {
+      EXPECT_EQ("Cast", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("DstT").type());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("two:0", node.input(0));
+    } else if (node.name() == "y") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("scale:0", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithMultiOutputNodes) {
+  // Gradient graph for the Subtract operation
+  std::vector<FunctionDefHelper::Node> nodes = {
+      {{"sx"}, "Shape", {"x"}},
+      {{"sy"}, "Shape", {"y"}},
+      {{"gx"}, "Identity", {"dz"}},
+      {{"gy"}, "Neg", {"dz"}},
+      {{"rx", "ry"}, "BroadcastGradientArgs", {"sx", "sy"}},
+      {{"sum_gx"}, "Sum", {"gx", "rx"}},
+      {{"dx"}, "Reshape", {"sum_gx", "sx"}},
+      {{"sum_gy"}, "Sum", {"gy", "ry"}},
+      {{"dy"}, "Reshape", {"sum_gy", "sy"}},
+  };
+
+  for (auto &n : nodes) {
+    // "BroadcastGradientArgs" doesn't need any attrs.
+    if (n.attr.empty() && n.op != "BroadcastGradientArgs") {
+      n.attr = {{"T", "$T"}};
+    }
+  }
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "SubGrad",
+      // Arg defs
+      {"x: T", "y: T", "dz: T"},
+      // Ret val defs
+      {"dx: T", "dy: T"},
+      // Attr defs
+      {{"T: {half, float, double}"}},
+      // Nodes
+      nodes);
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+  CHECK(item);
+  EXPECT_EQ("SubGrad", item->id);
+  EXPECT_EQ(12, item->graph.node_size());
+  EXPECT_EQ(std::vector<string>({"dx:0", "dy:0"}), item->fetch);
+  EXPECT_EQ(3, item->feed.size());
+  EXPECT_EQ("x", item->feed[0].first);
+  EXPECT_EQ("y", item->feed[1].first);
+  EXPECT_EQ("dz", item->feed[2].first);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x" || node.name() == "y" || node.name() == "dz") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "rx") {
+      EXPECT_EQ("BroadcastGradientArgs", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("sx:0", node.input(0));
+      EXPECT_EQ("sy:0", node.input(1));
+    } else if (node.name() == "sum_gx") {
+      EXPECT_EQ("Sum", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("gx:0", node.input(0));
+      EXPECT_EQ("rx:0", node.input(1));
+    } else if (node.name() == "sum_gy") {
+      EXPECT_EQ("Sum", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("gy:0", node.input(0));
+      EXPECT_EQ("rx:1", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithNestedFuncs) {
+  FunctionDefLibrary library;
+  *library.add_function() = FunctionDefHelper::Define(
+      // Name
+      "Swap",
+      // Args
+      {"i0: T", "i1: T"},
+      // Return values
+      {"o0: T", "o1: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"o0"}, "Identity", {"i1"}, {{"T", "$T"}}},
+       {{"o1"}, "Identity", {"i0"}, {{"T", "$T"}}}});
+
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ManySwapsFirst",
+      // Args
+      {"x: float", "y: float"},
+      // Return values
+      {"o: float"},
+      // attr def
+      {},
+      // Nodes
+      // o = x*x + y*y.  Furthermore, The 1st swap depends on x2, and
+      // y2 depends on the 2nd swap.  The 2nd swap has data dependency
+      // on the 1st swap.
+      {{{"a0"}, "Swap", {"x", "y"}, {{"T", DT_FLOAT}}, {"x2"}},
+       {{"a1"}, "Swap", {"a0:o0:0", "a0:o1:0"}, {{"T", DT_FLOAT}}},
+       {{"x2"}, "Mul", {"x", "x"}, {{"T", DT_FLOAT}}},
+       {{"y2"}, "Mul", {"y", "y"}, {{"T", DT_FLOAT}}, {"a1"}},
+       {{"o"}, "Add", {"x2:z:0", "y2:z:0"}, {{"T", DT_FLOAT}}}},
+      // Output Mapping
+      {{"o", "o:z:0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x" || node.name() == "y") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "a0") {
+      EXPECT_EQ("Swap", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^x2", node.input(2));
+    } else if (node.name() == "a1") {
+      EXPECT_EQ("Swap", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("a0:0", node.input(0));
+      EXPECT_EQ("a0:1", node.input(1));
+    } else if (node.name() == "x2") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+    } else if (node.name() == "y2") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^a1", node.input(2));
+    } else if (node.name() == "o") {
+      EXPECT_EQ("Add", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x2:0", node.input(0));
+      EXPECT_EQ("y2:0", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithOutputMappings) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "Exp_func",
+      // Args
+      {"in: float"},
+      // Return values
+      {"out: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"Linear_func"}, "Identity", {"in"}, {{"T", DT_FLOAT}}},
+       {{"Exp"}, "Exp", {"Linear_func:output:0"}, {{"T", DT_FLOAT}}}},
+      // Mapping
+      {{"out", "Exp:y:0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(1, item->fetch.size());
+  EXPECT_EQ("Exp:0", item->fetch[0]);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "in") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "Linear_func") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("in", node.input(0));
+    } else if (node.name() == "Exp") {
+      EXPECT_EQ("Exp", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("Linear_func:0", node.input(0));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithInputForwarding) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ForwardInputs",
+      // Args
+      {"in0: float", "in1: float", "arg2: float", "arg3: int32", "arg4: float"},
+      // Return values
+      {"out0: float", "arg2: float", "arg3: int32"},
+      // Attr def
+      {},
+      // Nodes
+      {},
+      // Mapping
+      {{"out0", "in0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(3, item->fetch.size());
+  EXPECT_EQ("in0", item->fetch[0]);
+  EXPECT_EQ("arg2", item->fetch[1]);
+  EXPECT_EQ("arg3", item->fetch[2]);
+
+  EXPECT_EQ(5, item->graph.node_size());
+  for (const NodeDef &node : item->graph.node()) {
+    EXPECT_TRUE(node.name() == "in0" || node.name() == "in1" ||
+                node.name() == "arg2" || node.name() == "arg3" ||
+                node.name() == "arg4");
+    EXPECT_EQ("Placeholder", node.op());
+    if (node.name() == "arg3") {
+      EXPECT_EQ(DT_INT32, node.attr().at("T").type());
+    } else {
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithoutInput) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "GenerateTwo",
+      // Args
+      {},
+      // Return value
+      {"o: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+       {{"o"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(0, item->feed.size());
+  EXPECT_EQ(1, item->fetch.size());
+  EXPECT_EQ("o:0", item->fetch[0]);
+
+  EXPECT_EQ(2, item->graph.node_size());
+  const NodeDef &two = item->graph.node(0);
+  EXPECT_EQ("two", two.name());
+  EXPECT_EQ(0, two.input_size());
+  const NodeDef &cast = item->graph.node(1);
+  EXPECT_EQ("o", cast.name());
+  EXPECT_EQ(1, cast.input_size());
+  EXPECT_EQ("two:0", cast.input(0));
+
+  std::cout << item->graph.DebugString() << std::endl;
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/grappler_test.cc b/tensorflow/core/grappler/utils/grappler_test.cc
index fed46c05fb341097991844a0f7cedb98fe3543c6..1c15ea65b8ccd4e5b4c38c7649cc523d4b3ccce9 100644
--- a/tensorflow/core/grappler/utils/grappler_test.cc
+++ b/tensorflow/core/grappler/utils/grappler_test.cc
@@ -17,15 +17,30 @@ limitations under the License.
 #include <memory>
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/protobuf/rewriter_config.pb.h"
 #include "tensorflow/core/public/session.h"
 
 namespace tensorflow {
 namespace grappler {
 
+GrapplerTest::GrapplerTest() {
+  // Turn off all the automatic optimizations to ensure that we run the graph
+  // exactly as it is given to us. This ensures that we can compare the results
+  // before and after manual optimization, without any of the automatic
+  // optimizations interfering in the comparison.
+  RewriterConfig* cfg =
+      options_.config.mutable_graph_options()->mutable_rewrite_options();
+  cfg->set_constant_folding(RewriterConfig::OFF);
+  cfg->set_arithmetic_optimization(RewriterConfig::OFF);
+  cfg->set_dependency_optimization(RewriterConfig::OFF);
+  cfg->set_loop_optimization(RewriterConfig::OFF);
+  cfg->set_function_optimization(RewriterConfig::OFF);
+  cfg->set_layout_optimizer(RewriterConfig::OFF);
+}
+
 std::vector<Tensor> GrapplerTest::EvaluateNodes(
-    const GraphDef& graph, const std::vector<string>& node_names) {
-  SessionOptions options;
-  std::unique_ptr<tensorflow::Session> session(NewSession(options));
+    const GraphDef& graph, const std::vector<string>& node_names) const {
+  std::unique_ptr<tensorflow::Session> session(NewSession(options_));
   TF_CHECK_OK(session->Create(graph));
   RunOptions run_options;
   std::vector<Tensor> output_tensors;
@@ -35,17 +50,40 @@ std::vector<Tensor> GrapplerTest::EvaluateNodes(
   return output_tensors;
 }
 
-void GrapplerTest::AddNode(const string& name, const string& op,
-                           const std::vector<string>& inputs, GraphDef* graph) {
-  auto* node = graph->add_node();
+std::vector<Tensor> GrapplerTest::EvaluateFetchNodes(
+    const GrapplerItem& item) const {
+  std::unique_ptr<tensorflow::Session> session(NewSession(options_));
+  TF_CHECK_OK(session->Create(item.graph));
+  RunOptions run_options;
+  if (!item.init_ops.empty()) {
+    std::vector<Tensor> dummy;
+    TF_CHECK_OK(
+        session->Run(run_options, {}, {}, item.init_ops, &dummy, nullptr));
+  }
+  std::vector<Tensor> output_tensors;
+  TF_CHECK_OK(session->Run(run_options, item.feed, item.fetch, {},
+                           &output_tensors, nullptr));
+  TF_CHECK_OK(session->Close());
+  return output_tensors;
+}
+
+NodeDef* GrapplerTest::AddNode(
+    const string& name, const string& op, const std::vector<string>& inputs,
+    const std::vector<std::pair<string, AttrValue>>& attributes,
+    GraphDef* graph) const {
+  NodeDef* node = graph->add_node();
   node->set_name(name);
   node->set_op(op);
-  for (const auto& input : inputs) {
+  for (const string& input : inputs) {
     node->add_input(input);
   }
+  for (auto attr : attributes) {
+    (*node->mutable_attr())[attr.first] = attr.second;
+  }
+  return node;
 }
 
-void GrapplerTest::CompareGraphs(GraphDef want, GraphDef got) {
+void GrapplerTest::CompareGraphs(GraphDef want, GraphDef got) const {
   auto comparator = [](const NodeDef& n1, const NodeDef& n2) -> bool {
     return n1.name() < n2.name();
   };
@@ -73,5 +111,20 @@ void GrapplerTest::CompareGraphs(GraphDef want, GraphDef got) {
   }
 }
 
+bool GrapplerTest::IsNodesDirectlyConnected(const NodeMap& node_map,
+                                            const string& src,
+                                            const string& dst, int position) {
+  const NodeDef* src_node = node_map.GetNode(src);
+  const NodeDef* dst_node = node_map.GetNode(dst);
+  EXPECT_TRUE(src_node != nullptr) << src << " node not found";
+  EXPECT_TRUE(dst_node != nullptr) << dst << " node not found";
+  return src_node && dst_node && dst_node->input(position) == src_node->name();
+}
+
+int GrapplerTest::CountOpNodes(const GraphDef& graph, const string& op) {
+  return std::count_if(graph.node().begin(), graph.node().end(),
+                       [&op](const NodeDef& node) { return node.op() == op; });
+}
+
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/grappler_test.h b/tensorflow/core/grappler/utils/grappler_test.h
index 042b616aa46bb565dc7a2b236eb6c2d95158b9d1..e0c67381a454a7a6219ffe596831f203856b71ab 100644
--- a/tensorflow/core/grappler/utils/grappler_test.h
+++ b/tensorflow/core/grappler/utils/grappler_test.h
@@ -18,22 +18,43 @@ limitations under the License.
 
 #include <vector>
 
+#include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/public/session_options.h"
 
 namespace tensorflow {
 namespace grappler {
 
 class GrapplerTest : public ::testing::Test {
+ public:
+  GrapplerTest();
+
  protected:
-  std::vector<Tensor> EvaluateNodes(const GraphDef& graph,
-                                    const std::vector<string>& node_names);
+  std::vector<Tensor> EvaluateNodes(
+      const GraphDef& graph, const std::vector<string>& node_names) const;
+
+  std::vector<Tensor> EvaluateFetchNodes(const GrapplerItem& item) const;
+
+  NodeDef* AddNode(const string& name, const string& op,
+                   const std::vector<string>& inputs,
+                   const std::vector<std::pair<string, AttrValue>>& attributes,
+                   GraphDef* graph) const;
+
+  void CompareGraphs(GraphDef want, GraphDef got) const;
+
+  // Check if node 'src' is directly connected to the input($position) of 'dst'.
+  bool IsNodesDirectlyConnected(const NodeMap& node_map, const string& src,
+                                const string& dst, int position = 0);
 
-  void AddNode(const string& name, const string& op,
-               const std::vector<string>& inputs, GraphDef* graph);
+  // Count nodes of the given op-type in a graph.
+  int CountOpNodes(const GraphDef& graph, const string& op);
 
-  void CompareGraphs(GraphDef want, GraphDef got);
+ private:
+  SessionOptions options_;
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/utils/grappler_test_test.cc b/tensorflow/core/grappler/utils/grappler_test_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..677fa5a79896cf82831812a90242d1e2dd9d302c
--- /dev/null
+++ b/tensorflow/core/grappler/utils/grappler_test_test.cc
@@ -0,0 +1,100 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/utils/grappler_test.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+// TODO(ezhulenev): add tests for all methods in GrapplerTest
+class GrapplerTestTest : public GrapplerTest {};
+
+TEST_F(GrapplerTestTest, CompareIdenticalGraphs) {
+  tensorflow::Scope s1 = tensorflow::Scope::NewRootScope();
+  auto s1_a = ops::Variable(s1.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto s1_b = ops::Variable(s1.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto s1_add = ops::Add(s1.WithOpName("Add_1"), s1_a, s1_b);
+
+  tensorflow::Scope s2 = tensorflow::Scope::NewRootScope();
+  auto s2_a = ops::Variable(s2.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto s2_b = ops::Variable(s2.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto s2_add = ops::Add(s2.WithOpName("Add_1"), s2_a, s2_b);
+
+  GraphDef graph1;
+  TF_ASSERT_OK(s1.ToGraphDef(&graph1));
+
+  GraphDef graph2;
+  TF_ASSERT_OK(s2.ToGraphDef(&graph2));
+
+  CompareGraphs(graph1, graph2);
+}
+
+TEST_F(GrapplerTestTest, CheckNodesConnectivity) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto add_1 = ops::Add(s.WithOpName("Add_1"), a, b);
+  auto add_2 = ops::Add(s.WithOpName("Add_2"), add_1, b);
+
+  GraphDef graph;
+  TF_ASSERT_OK(s.ToGraphDef(&graph));
+
+  NodeMap node_map(&graph);
+
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "a", "Add_1", 0));
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "b", "Add_1", 1));
+  EXPECT_FALSE(IsNodesDirectlyConnected(node_map, "a", "Add_2", 0));
+  EXPECT_TRUE(IsNodesDirectlyConnected(node_map, "b", "Add_2", 1));
+}
+
+TEST_F(GrapplerTestTest, CountOpNodes) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  auto a = ops::Variable(s.WithOpName("a"), {2, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {2, 2}, DT_FLOAT);
+  auto c = ops::Variable(s.WithOpName("c"), {2, 2}, DT_FLOAT);
+
+  auto add_ab = ops::Add(s.WithOpName("Add_ab"), a, b);
+  auto add_bc = ops::Add(s.WithOpName("Add_bc"), b, c);
+
+  auto mul_ab = ops::Mul(s.WithOpName("Mull_ab"), a, b);
+  auto mul_bc = ops::Mul(s.WithOpName("Mull_bc"), a, b);
+
+  InputList inputs{
+      Output(add_ab),
+      Output(add_bc),
+      Output(mul_ab),
+      Output(mul_bc),
+  };
+  auto add_all = ops::AddN(s.WithOpName("Add_all"), inputs);
+
+  GraphDef graph;
+  TF_ASSERT_OK(s.ToGraphDef(&graph));
+
+  EXPECT_EQ(2, CountOpNodes(graph, "Add"));
+  EXPECT_EQ(2, CountOpNodes(graph, "Mul"));
+  EXPECT_EQ(1, CountOpNodes(graph, "AddN"));
+  EXPECT_EQ(0, CountOpNodes(graph, "Transpose"));
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
\ No newline at end of file
diff --git a/tensorflow/core/grappler/utils_test.cc b/tensorflow/core/grappler/utils_test.cc
index eabce5b5ee7b037b7bc429abfa86ee8735bdbede..49a1996d25e78d17908b1eae04c9acbeb7e2c788 100644
--- a/tensorflow/core/grappler/utils_test.cc
+++ b/tensorflow/core/grappler/utils_test.cc
@@ -292,6 +292,47 @@ TEST_F(UtilsTest, DedupControlInputs) {
   EXPECT_EQ("gnu", foo.input(1));
 }
 
+TEST_F(UtilsTest, NumNonControlOutputs) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
+  //  *) Round node has control dependency edge from Add, which
+  //     is not on this scheme (ASCII graphics limitation).
+  //
+  //   *Round    [Sqrt, Shape]
+  //      |           |
+  //      |   ctrl    |
+  //     Mul ------> Add
+  //     / \         / \
+  //    x   y       a   b
+  auto x = ops::Variable(s.WithOpName("x"), {1, 2}, DT_FLOAT);
+  auto y = ops::Variable(s.WithOpName("y"), {1, 2}, DT_FLOAT);
+  auto a = ops::Variable(s.WithOpName("a"), {1, 2}, DT_FLOAT);
+  auto b = ops::Variable(s.WithOpName("b"), {1, 2}, DT_FLOAT);
+
+  auto mul = ops::Multiply(s.WithOpName("mul"), x, y);
+  auto add = ops::Add(s.WithOpName("add").WithControlDependencies(mul), a, b);
+
+  auto shape = ops::Shape(s.WithOpName("shape"), add);
+  auto sqrt = ops::Sqrt(s.WithOpName("sqrt"), add);
+
+  auto round =
+      ops::Round(s.WithOpName("round").WithControlDependencies(add), mul);
+
+  GraphDef graph;
+  TF_CHECK_OK(s.ToGraphDef(&graph));
+  NodeMap node_map(&graph);
+
+  const NodeDef* add_node = node_map.GetNode("add");
+  ASSERT_TRUE(add_node != nullptr);
+
+  // [a, b] are only non-control inputs
+  EXPECT_EQ(2, NumNonControlInputs(*add_node));
+  // [sqrt, shape] are non control outputs
+  EXPECT_EQ(2, NumNonControlOutputs(*add_node, node_map));
+  // sqrt is the only data output
+  EXPECT_EQ(1, NumNonControlDataOutputs(*add_node, node_map));
+}
+
 TEST_F(UtilsTest, DeleteNodes) {}
 
 }  // namespace
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 3426cf6e40371905b0e2dea8ba7afc840356faea..8d235e79c095e9c74caf3e66fb456ad09f95a714 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -879,7 +879,7 @@ tf_kernel_library(
     hdrs = ["transpose_op.h"],
     deps = ARRAY_DEPS + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
+        "@mkl_dnn",
     ]),
 )
 
@@ -920,6 +920,22 @@ tf_kernel_library(
     ]) + ARRAY_DEPS,
 )
 
+tf_kernel_library(
+    name = "cudnn_rnn_kernels",
+    srcs = ["cudnn_rnn_ops.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:cudnn_rnn_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:stream_executor",
+        "//tensorflow/core/kernels:bounds_check_lib",
+        "//third_party/eigen3",
+        "@farmhash_archive//:farmhash",
+    ],
+)
+
 tf_cc_test(
     name = "batch_norm_op_test",
     size = "small",
@@ -1666,6 +1682,43 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "scoped_allocator_ops",
+    prefix = "scoped_allocator_ops",
+    deps = [
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:scoped_allocator_ops_op_lib",
+    ],
+)
+
+tf_cuda_cc_test(
+    name = "scoped_allocator_ops_test",
+    srcs = ["scoped_allocator_ops_test.cc"],
+    linkstatic = tf_kernel_tests_linkstatic(),  #Required for benchmarking
+    deps = [
+        ":cwise_op",
+        ":dense_update_ops",
+        ":ops_testutil",
+        ":ops_util",
+        ":scoped_allocator_ops",
+        ":variable_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:math_ops_op_lib",
+        "//tensorflow/core:proto_text",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
 tf_kernel_library(
     name = "session_ops",
     prefix = "session_ops",
@@ -1951,6 +2004,7 @@ tf_kernel_library(
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
     ],
 )
 
@@ -2810,7 +2864,7 @@ tf_kernel_library(
         "//conditions:default": [],
     }) + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
+        "@mkl_dnn",
     ]) + if_cuda([
         "//tensorflow/core/platform/default/build_config:cublas_plugin",
     ]),
@@ -4155,6 +4209,7 @@ cc_library(
         ":as_string_op",
         ":base64_ops",
         ":reduce_join_op",
+        ":regex_replace_op",
         ":string_join_op",
         ":string_split_op",
         ":string_to_hash_bucket_op",
@@ -4189,6 +4244,12 @@ tf_kernel_library(
     deps = STRING_DEPS,
 )
 
+tf_kernel_library(
+    name = "regex_replace_op",
+    prefix = "regex_replace_op",
+    deps = STRING_DEPS + ["@com_googlesource_code_re2//:re2"],
+)
+
 tf_kernel_library(
     name = "string_split_op",
     prefix = "string_split_op",
@@ -5034,6 +5095,7 @@ filegroup(
             # not used on Android. Those ops also do not compile if included,
             # unless we add the additional deps they need.
             "tf_record_reader_op.*",
+            "cudnn_rnn_ops.*",
             "lmdb_reader_op.*",
             "string_to_hash_bucket_op.*",
             "sdca_ops.*",
@@ -5063,6 +5125,7 @@ filegroup(
             "scatter_nd_op*",
             "mutex_ops.*",
             "batch_kernels.*",
+            "regex_replace_op.cc",
         ],
     ),
     visibility = ["//visibility:public"],
@@ -5128,7 +5191,6 @@ tf_kernel_library(
     srcs = [
         "dequantize_op.cc",
         "meta_support.cc",
-        "quantization_utils.cc",
         "quantize_down_and_shrink_range.cc",
         "quantize_op.cc",
         "quantized_activation_ops.cc",
@@ -5149,7 +5211,6 @@ tf_kernel_library(
     ],
     hdrs = [
         "meta_support.h",
-        "quantization_utils.h",
         "reference_gemm.h",
     ],
     deps = [
@@ -5160,6 +5221,7 @@ tf_kernel_library(
         ":image_resizer_state",
         ":ops_util",
         ":pooling_ops",
+        ":quantization_utils",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
@@ -5223,6 +5285,7 @@ tf_cc_test(
     name = "quantization_utils_test",
     srcs = ["quantization_utils_test.cc"],
     deps = [
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
@@ -5285,6 +5348,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5346,6 +5410,7 @@ tf_cc_test(
         ":math",
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5368,6 +5433,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5432,6 +5498,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5452,6 +5519,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5491,6 +5559,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5547,6 +5616,7 @@ tf_cc_test(
         ":math",
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5569,6 +5639,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5605,6 +5676,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
@@ -5626,6 +5698,7 @@ tf_cc_test(
     deps = [
         ":batch_norm_op",
         ":ops_testutil",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu_internal",
@@ -5706,6 +5779,16 @@ tf_kernel_library(
     ],
 )
 
+cc_library(
+    name = "quantization_utils",
+    srcs = ["quantization_utils.cc"],
+    hdrs = ["quantization_utils.h"],
+    deps = [
+        "//tensorflow/core:framework",
+        "@gemmlowp",
+    ],
+)
+
 cc_library(
     name = "remote_fused_graph_execute_utils",
     srcs = [
@@ -5842,10 +5925,9 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5859,10 +5941,9 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5890,6 +5971,7 @@ tf_mkl_kernel_library(
     ],
     hdrs = ["mkl_pooling_ops_common.h"],
     deps = [
+        ":bounds_check",
         ":ops_util",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
@@ -5911,10 +5993,10 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
+        "//third_party/eigen3",
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5928,19 +6010,18 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
     name = "mkl_fused_batch_norm_op",
     srcs = ["mkl_fused_batch_norm_op.cc"],
-    deps = NN_DEPS + if_mkl([
+    deps = NN_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5954,10 +6035,10 @@ tf_mkl_kernel_library(
 tf_mkl_kernel_library(
     name = "mkl_concat_op",
     prefix = "mkl_concat_op",
-    deps = ARRAY_DEPS + if_mkl([
+    deps = ARRAY_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5971,19 +6052,19 @@ tf_mkl_kernel_library(
 tf_mkl_kernel_library(
     name = "mkl_identity_op",
     prefix = "mkl_identity_op",
-    deps = ARRAY_DEPS + if_mkl([
+    deps = ARRAY_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
     name = "mkl_lrn_op",
     prefix = "mkl_lrn_op",
-    deps = NN_DEPS + if_mkl([
+    deps = NN_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -6081,7 +6162,6 @@ cc_library(
     srcs = [
         "cwise_ops_common.cc",
         "meta_support.cc",
-        "quantization_utils.cc",
     ],
     hdrs = [
         "cwise_ops.h",
@@ -6090,10 +6170,10 @@ cc_library(
         "cwise_ops_gpu_gradients.cu.h",
         "cwise_ops_gradients.h",
         "meta_support.h",
-        "quantization_utils.h",
     ],
     deps = [
         ":bounds_check",
+        ":quantization_utils",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//third_party/eigen3",
diff --git a/tensorflow/core/kernels/avgpooling_op.cc b/tensorflow/core/kernels/avgpooling_op.cc
index ec9cbc2a9b5d4c1ac6d91913fc015e139fa2a068..c581d1451f0e6740cbcf526e0fd8636ea925eb69 100644
--- a/tensorflow/core/kernels/avgpooling_op.cc
+++ b/tensorflow/core/kernels/avgpooling_op.cc
@@ -102,6 +102,9 @@ class AvgPoolingOp : public UnaryOp<T> {
   TensorFormat data_format_;
 };
 
+REGISTER_KERNEL_BUILDER(
+    Name("AvgPool").Device(DEVICE_CPU).TypeConstraint<double>("T"),
+    AvgPoolingOp<CPUDevice, double>);
 REGISTER_KERNEL_BUILDER(
     Name("AvgPool").Device(DEVICE_CPU).TypeConstraint<float>("T"),
     AvgPoolingOp<CPUDevice, float>);
@@ -189,6 +192,7 @@ namespace functor {
 
 DECLARE_GPU_SPEC(Eigen::half);
 DECLARE_GPU_SPEC(float);
+DECLARE_GPU_SPEC(double);
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
 
@@ -198,6 +202,9 @@ REGISTER_KERNEL_BUILDER(
 REGISTER_KERNEL_BUILDER(
     Name("AvgPool").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     AvgPoolingOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("AvgPool").Device(DEVICE_GPU).TypeConstraint<double>("T"),
+    AvgPoolingOp<GPUDevice, double>);
 #endif  // GOOGLE_CUDA
 
 // The operation to compute AvgPool gradients.
@@ -423,6 +430,12 @@ class AvgPoolingGradOp<GPUDevice, T> : public OpKernel {
   TensorFormat data_format_;
 };
 
+REGISTER_KERNEL_BUILDER(Name("AvgPoolGrad")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<double>("T")
+                            .HostMemory("orig_input_shape")
+                            .Label("cudnn"),
+                        AvgPoolingGradOp<GPUDevice, double>);
 REGISTER_KERNEL_BUILDER(Name("AvgPoolGrad")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<float>("T")
@@ -553,6 +566,11 @@ REGISTER_KERNEL_BUILDER(Name("AvgPoolGrad")
                             .TypeConstraint<float>("T")
                             .HostMemory("orig_input_shape"),
                         AvgPoolingGradOpCustomGPUKernel<float>);
+REGISTER_KERNEL_BUILDER(Name("AvgPoolGrad")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<double>("T")
+                            .HostMemory("orig_input_shape"),
+                        AvgPoolingGradOpCustomGPUKernel<double>);
 REGISTER_KERNEL_BUILDER(Name("AvgPoolGrad")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<Eigen::half>("T")
diff --git a/tensorflow/core/kernels/avgpooling_op_gpu.cu.cc b/tensorflow/core/kernels/avgpooling_op_gpu.cu.cc
index 6537b42f1ed8856a5f701023eb5fc55ded278ec8..35511d5c313fb4b3794d00bd685ec4249580daa3 100644
--- a/tensorflow/core/kernels/avgpooling_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/avgpooling_op_gpu.cu.cc
@@ -35,6 +35,7 @@ typedef Eigen::GpuDevice GPUDevice;
 
 DEFINE_GPU_KERNELS(Eigen::half)
 DEFINE_GPU_KERNELS(float)
+DEFINE_GPU_KERNELS(double)
 
 #undef DEFINE_GPU_KERNELS
 
@@ -99,6 +100,12 @@ bool RunAvePoolBackwardNHWC(const T* const top_diff, const int num,
   return d.ok();
 }
 
+template bool RunAvePoolBackwardNHWC(
+    const double* const top_diff, const int num, const int height,
+    const int width, const int channels, const int pooled_height,
+    const int pooled_width, const int kernel_h, const int kernel_w,
+    const int stride_h, const int stride_w, const int pad_t, const int pad_l,
+    double* const bottom_diff, const GPUDevice& d);
 template bool RunAvePoolBackwardNHWC(
     const float* const top_diff, const int num, const int height,
     const int width, const int channels, const int pooled_height,
diff --git a/tensorflow/core/kernels/batch_kernels.cc b/tensorflow/core/kernels/batch_kernels.cc
index 546e51be53cee1833e8e1d4a15ea9b5be8a31506..8c99ded0a89e8065f4a7112db3b14eb2b27010c1 100644
--- a/tensorflow/core/kernels/batch_kernels.cc
+++ b/tensorflow/core/kernels/batch_kernels.cc
@@ -146,7 +146,7 @@ Status SplitCPU(OpKernelContext* context, const Tensor& input,
     suffix_dim_size *= input.shape().dim_size(i);
   }
   auto input_reshaped =
-      input.shaped<T, 3>({1, input.shape().dim_size(0), suffix_dim_size});
+      input.shaped<T, 2>({input.shape().dim_size(0), suffix_dim_size});
 
   int64 position = 0;
   for (const int64 size : sizes) {
@@ -155,13 +155,13 @@ Status SplitCPU(OpKernelContext* context, const Tensor& input,
     Tensor output;
     TF_RETURN_IF_ERROR(
         context->allocate_temp(input.dtype(), output_shape, &output));
-    auto output_shaped = output.shaped<T, 3>({1, size, suffix_dim_size});
+    auto output_shaped = output.shaped<T, 2>({size, suffix_dim_size});
 
-    Eigen::DSizes<Eigen::DenseIndex, 3> slice_indices{0, position, 0};
-    Eigen::DSizes<Eigen::DenseIndex, 3> slice_sizes{1, size, suffix_dim_size};
-    functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                   output_shaped, input_reshaped, slice_indices,
-                                   slice_sizes);
+    Eigen::DSizes<Eigen::DenseIndex, 2> slice_indices{position, 0};
+    Eigen::DSizes<Eigen::DenseIndex, 2> slice_sizes{size, suffix_dim_size};
+    functor::Split<CPUDevice, T, 2>()(context->eigen_device<CPUDevice>(),
+                                      output_shaped, input_reshaped,
+                                      slice_indices, slice_sizes);
 
     outputs->emplace_back(output);
 
diff --git a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
index 25c5f9cf424fdb286922548ea7ab0a35157a3502..339d792302dd96e7a157c1df893d3ea62080c51a 100644
--- a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
+++ b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
@@ -19,7 +19,6 @@ limitations under the License.
 #include <algorithm>
 #include <functional>
 #include <memory>
-#include <queue>
 #include <random>
 #include <unordered_map>
 #include <vector>
@@ -44,49 +43,31 @@ template <typename TaskType>
 class ASBSQueue;
 }  // namespace internal
 
-// EXPERIMENTAL: API MAY BE SUBJECTED TO SUDDEN CHANGES.
-//
 // Shared batch scheduler designed to minimize latency. The scheduler keeps
 // track of a number of queues (one per model or model version) which are
 // continuously enqueuing requests. The scheduler groups the requests into
 // batches which it periodically sends off for processing (see
-// shared_batch_scheduler.h for more details). The AdaptiveSharedBatchScheduler
-// prioritizes batches by age (i.e. the batch's oldest request) irrespective of
-// queue or batch size.
-//
-// The scheduling decision currently exists in two flavors, controlled by the
-// option use_in_flight_batches_implementation. It is expected that setting this
-// option to true will give universally better results; after a period of
-// testing to confirm, the old implementation will be removed.
+// shared_batch_scheduler.h for more details). AdaptiveSharedBatchScheduler
+// (ASBS) prioritizes batches primarily by age (i.e. the batch's oldest request)
+// along with a configurable preference for scheduling larger batches first.
 //
-// If use_in_flight_batches_implementation is set to true, the scheduler
-// limits the number of batches which can be processed concurrently.  If a new
-// batch is created, and the number of in flight batches is below the limit,
-// the next (i.e. oldest) batch is immediately scheduled.  Similarly, when a
-// batch finishes processing, the limit is rechecked, and another batch may be
-// scheduled.  To avoid the need to carefully tune the limit for workload,
-// model type, platform, etc, it is dynamically adjusted in order to provide the
-// lowest latency.
 //
-// If use_in_flight_batches_implementation is set to false, the scheduler will
-// process the oldest batch at an adjustable rate, regardless of batch size.
-// The user can provide feedback to help set this rate to achieve some goal
-// (i.e. minimize overall latency, limit cpu usage, etc). The rate (or rather,
-// the corresponding period) is adjusted each time a batch is processed, using
-// an exponentially weighted moving average to smooth noisy feedback:
-// ewma_feedback = ((N - 1) * ewma_feedback + feedback()) / N
-// period *= (1 + K * emwa_feedback)
+// ASBS tries to keep the system busy by maintaining an adjustable number of
+// concurrently processed batches.  If a new batch is created, and the number of
+// in flight batches is below the target, the next (i.e. oldest) batch is
+// immediately scheduled.  Similarly, when a batch finishes processing, the
+// target is rechecked, and another batch may be scheduled.  To avoid the need
+// to carefully tune the target for workload, model type, platform, etc, it is
+// dynamically adjusted in order to provide the lowest average latency.
 //
 // Some potential use cases:
 // Hardware Accelerators (GPUs & TPUs) - If some phase of batch processing
 //   involves serial processing by a device, from a latency perspective it is
 //   desirable to keep the device evenly loaded, avoiding the need to wait for
 //   the device to process prior batches.
-//   feedback = num_pending_on_device() - desired_pending.
 // CPU utilization - If the batch processing is cpu dominated, you can reap
 //   latency gains when underutilized by increasing the processing rate, but
 //   back the rate off when the load increases to avoid overload.
-//   feedback = cpu_rate() - desired_cpu_rate.
 
 template <typename TaskType>
 class AdaptiveSharedBatchScheduler
@@ -101,13 +82,24 @@ class AdaptiveSharedBatchScheduler
   struct Options {
     // The name to use for the pool of batch threads.
     string thread_pool_name = {"batch_threads"};
-    // Number of batch processing threads; equivalently the maximum number of
-    // concurrently running batches.
+    // Number of batch processing threads - the maximum value of
+    // in_flight_batches_limit_.  It is recommended that this value be set by
+    // running the system under load, observing the learned value for
+    // in_flight_batches_limit_, and setting this maximum to ~ 2x the value.
+    // Under low load, in_flight_batches_limit_ has no substantial effect on
+    // latency and therefore undergoes a random walk.  Unreasonably large values
+    // for num_batch_threads allows for large in_flight_batches_limit_, which
+    // will harm latency for some time once load increases again.
     int64 num_batch_threads = port::NumSchedulableCPUs();
+    // Although batch selection is primarily based on age, this parameter
+    // specifies a preference for larger batches.  A full batch will be
+    // scheduled before an older, nearly empty batch as long as the age gap is
+    // less than full_batch_scheduling_boost_micros.  The optimal value for this
+    // parameter should be of order the batch processing latency, but must be
+    // chosen carefully, as too large a value will harm tail latency.
+    int64 full_batch_scheduling_boost_micros = 0;
     // The environment to use (typically only overridden by test code).
     Env* env = Env::Default();
-    // Which implementation to use (described in class comments above).
-    bool use_in_flight_batches_implementation = false;
     // Initial limit for number of batches being concurrently processed.
     // Non-integer values correspond to probabilistic limits - i.e. a value of
     // 3.2 results in an actual cap of 3 80% of the time, and 4 20% of the time.
@@ -116,28 +108,6 @@ class AdaptiveSharedBatchScheduler
     // numbers will give less noisy latency measurements, but will be less
     // responsive to changes in workload.
     int64 batches_to_average_over = 1000;
-
-    // TODO(kte): remove the rate based implementation and corresponding options
-    // below once testing confirms the superiority of the in flight batches
-    // implementation.
-    // Initial batch scheduling period in microseconds. Will be altered for
-    // non-zero rate_feedback.
-    double initial_scheduling_period_micros = 500;
-    // Minimum batch scheduling period in microseconds. Recommend setting this
-    // value greater than 0, otherwise it may take a while to recover from a
-    // sustained time of negative scheduling_period_feedback (which may occur
-    // under low load).
-    double min_scheduling_period_micros = 100;
-    // Maximum batch scheduling period in microseconds.
-    double max_scheduling_period_micros = 10000;
-    // Feedback function used to modify the scheduling period each time a batch
-    // is scheduled.  Should return values roughly O(1), with positive values
-    // resulting in an increased period.
-    std::function<double()> scheduling_period_feedback{[] { return 0.; }};
-    // To handle potentially noisy scheduling_period_feedback, the period is
-    // adjusted using an exponentially weighted moving average over the previous
-    // feedback_smoothing_batches batches.  Must be greater than 0.
-    int64 feedback_smoothing_batches = 10;
   };
 
   // Ownership is shared between the caller of Create() and any queues created
@@ -171,17 +141,11 @@ class AdaptiveSharedBatchScheduler
 
   explicit AdaptiveSharedBatchScheduler(const Options& options);
 
-  // Batch scheduling function which runs every scheduling_period_ microseconds.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  void ProcessOneBatch();
-
   // Tracks processing latency and adjusts in_flight_batches_limit to minimize.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   void CallbackWrapper(const internal::ASBSBatch<TaskType>* batch,
                        BatchProcessor callback);
 
   // Schedules batch if in_flight_batches_limit_ is not met.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   void MaybeScheduleNextBatch() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
   // Notifies scheduler of non-empty batch which is eligible for processing.
@@ -194,17 +158,9 @@ class AdaptiveSharedBatchScheduler
 
   const Options options_;
 
-  struct BatchCompare {
-    bool operator()(const internal::ASBSBatch<TaskType>* a,
-                    const internal::ASBSBatch<TaskType>* b);
-  };
-
   // Collection of batches added by AddBatch, ordered by age. Owned by scheduler
   // until they are released for processing.
-  std::priority_queue<const internal::ASBSBatch<TaskType>*,
-                      std::vector<const internal::ASBSBatch<TaskType>*>,
-                      BatchCompare>
-      batches_ GUARDED_BY(mu_);
+  std::vector<const internal::ASBSBatch<TaskType>*> batches_ GUARDED_BY(mu_);
 
   // Unowned queues and callbacks added by AddQueue.
   std::unordered_map<const internal::ASBSQueue<TaskType>*, BatchProcessor>
@@ -212,41 +168,22 @@ class AdaptiveSharedBatchScheduler
 
   mutex mu_;
 
-  // Responsible for running ProcessOneBatch. PeriodicFunction was used in order
-  // to check for deletion so that the thread can be shut down.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  std::unique_ptr<PeriodicFunction> scheduling_thread_;
-
   // Responsible for running the batch processing callbacks.
   std::unique_ptr<thread::ThreadPool> batch_thread_pool_;
 
-  // Time interval in microseconds between successive ProcessOneBatch calls.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  double scheduling_period_;
-
-  // Exponentially weighted moving average of
-  // options_.scheduling_period_feedback() evaluated in each ProcessOneBatch
-  // call.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  double ewma_feedback_ = 0;
-
   // Limit on number of batches which can be concurrently processed.
   // Non-integer values correspond to probabilistic limits - i.e. a value of 3.2
   // results in an actual cap of 3 80% of the time, and 4 20% of the time.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   double in_flight_batches_limit_ GUARDED_BY(mu_);
 
   // Number of batches currently being processed.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   int64 in_flight_batches_ GUARDED_BY(mu_) = 0;
 
   // RNG engine and distribution.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   std::default_random_engine rand_engine_;
   std::uniform_real_distribution<double> rand_double_;
 
   // Fields controlling the dynamic adjustment of in_flight_batches_limit_.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   // Number of batches since the last in_flight_batches_limit_ adjustment.
   int64 batch_count_ GUARDED_BY(mu_) = 0;
   // Sum of processing latency for batches counted by batch_count_.
@@ -348,31 +285,10 @@ Status AdaptiveSharedBatchScheduler<TaskType>::Create(
     return errors::InvalidArgument("num_batch_threads must be positive; was ",
                                    options.num_batch_threads);
   }
-  if (options.min_scheduling_period_micros < 0) {
+  if (options.full_batch_scheduling_boost_micros < 0) {
     return errors::InvalidArgument(
-        "min_scheduling_period_micros must be >= 0; was ",
-        options.min_scheduling_period_micros);
-  }
-  if (options.min_scheduling_period_micros >
-      options.initial_scheduling_period_micros) {
-    return errors::InvalidArgument(
-        "initial_scheduling_period_micros (",
-        options.initial_scheduling_period_micros,
-        ") must be >= min_scheduling_period_micros (",
-        options.min_scheduling_period_micros, ")");
-  }
-  if (options.initial_scheduling_period_micros >
-      options.max_scheduling_period_micros) {
-    return errors::InvalidArgument(
-        "initial_scheduling_period_micros (",
-        options.initial_scheduling_period_micros,
-        ") must be <= max_scheduling_period_micros (",
-        options.max_scheduling_period_micros, ")");
-  }
-  if (options.feedback_smoothing_batches < 1) {
-    return errors::InvalidArgument(
-        "feedback_smoothing_batches must be positive; was ",
-        options.feedback_smoothing_batches);
+        "full_batch_scheduling_boost_micros can't be negative; was ",
+        options.full_batch_scheduling_boost_micros);
   }
   if (options.initial_in_flight_batches_limit > options.num_batch_threads) {
     return errors::InvalidArgument(
@@ -401,20 +317,12 @@ template <typename TaskType>
 AdaptiveSharedBatchScheduler<TaskType>::AdaptiveSharedBatchScheduler(
     const Options& options)
     : options_(options),
-      scheduling_period_(options.initial_scheduling_period_micros),
       in_flight_batches_limit_(options.initial_in_flight_batches_limit),
       rand_double_(0.0, 1.0) {
   std::random_device device;
   rand_engine_.seed(device());
-  PeriodicFunction::Options opts;
-  opts.thread_name_prefix = "scheduling_thread";
-  opts.env = GetEnv();
   batch_thread_pool_.reset(new thread::ThreadPool(
       GetEnv(), options.thread_pool_name, options.num_batch_threads));
-  if (!options.use_in_flight_batches_implementation) {
-    scheduling_thread_.reset(
-        new PeriodicFunction([this] { ProcessOneBatch(); }, 0, opts));
-  }
 }
 
 template <typename TaskType>
@@ -442,10 +350,8 @@ template <typename TaskType>
 void AdaptiveSharedBatchScheduler<TaskType>::AddBatch(
     const internal::ASBSBatch<TaskType>* batch) {
   mutex_lock l(mu_);
-  batches_.push(batch);
-  if (options_.use_in_flight_batches_implementation) {
-    MaybeScheduleNextBatch();
-  }
+  batches_.push_back(batch);
+  MaybeScheduleNextBatch();
 }
 
 template <typename TaskType>
@@ -462,10 +368,26 @@ void AdaptiveSharedBatchScheduler<TaskType>::MaybeScheduleNextBatch() {
   // Non-integer limit handled probabilistially.
   if (in_flight_batches_limit_ - in_flight_batches_ < 1 &&
       rand_double_(rand_engine_) >
-          (in_flight_batches_limit_ - in_flight_batches_))
+          in_flight_batches_limit_ - in_flight_batches_) {
     return;
-  const internal::ASBSBatch<TaskType>* batch = batches_.top();
-  batches_.pop();
+  }
+  auto best_it = batches_.begin();
+  double best_score =
+      (*best_it)->creation_time_micros() -
+      options_.full_batch_scheduling_boost_micros * (*best_it)->size() /
+          static_cast<double>((*best_it)->queue()->max_task_size());
+  for (auto it = batches_.begin() + 1; it != batches_.end(); it++) {
+    const double score =
+        (*it)->creation_time_micros() -
+        options_.full_batch_scheduling_boost_micros * (*it)->size() /
+            static_cast<double>((*it)->queue()->max_task_size());
+    if (score < best_score) {
+      best_score = score;
+      best_it = it;
+    }
+  }
+  const internal::ASBSBatch<TaskType>* batch = *best_it;
+  batches_.erase(best_it);
   // Queue may destroy itself after ReleaseBatch is called.
   batch->queue()->ReleaseBatch(batch);
   batch_thread_pool_->Schedule(
@@ -523,51 +445,6 @@ void AdaptiveSharedBatchScheduler<TaskType>::CallbackWrapper(
   MaybeScheduleNextBatch();
 }
 
-template <typename TaskType>
-void AdaptiveSharedBatchScheduler<TaskType>::ProcessOneBatch() {
-  static const double kFeedbackMultiplier = .001;
-  const internal::ASBSBatch<TaskType>* batch = nullptr;
-  BatchProcessor callback;
-  const int64 start_time_micros = GetEnv()->NowMicros();
-  {
-    mutex_lock l(mu_);
-    if (!batches_.empty()) {
-      batch = batches_.top();
-      batches_.pop();
-      callback = queues_and_callbacks_[batch->queue()];
-    }
-  }
-  if (batch != nullptr) {
-    double feedback = options_.scheduling_period_feedback();
-    const int64 N = options_.feedback_smoothing_batches;
-    ewma_feedback_ = ((N - 1) * ewma_feedback_ + feedback) / N;
-    scheduling_period_ *= (1 + kFeedbackMultiplier * ewma_feedback_);
-    if (scheduling_period_ < options_.min_scheduling_period_micros) {
-      scheduling_period_ = options_.min_scheduling_period_micros;
-    } else if (scheduling_period_ > options_.max_scheduling_period_micros) {
-      scheduling_period_ = options_.max_scheduling_period_micros;
-    }
-    // Queue may destroy itself after ReleaseBatch is called.
-    batch->queue()->ReleaseBatch(batch);
-    batch_thread_pool_->Schedule([callback, batch] {
-      callback(std::unique_ptr<Batch<TaskType>>(
-          const_cast<internal::ASBSBatch<TaskType>*>(batch)));
-    });
-  }
-  const int64 sleep_time =
-      scheduling_period_ - (GetEnv()->NowMicros() - start_time_micros);
-  if (sleep_time > 0) {
-    GetEnv()->SleepForMicroseconds(sleep_time);
-  }
-}
-
-template <typename TaskType>
-bool AdaptiveSharedBatchScheduler<TaskType>::BatchCompare::operator()(
-    const internal::ASBSBatch<TaskType>* a,
-    const internal::ASBSBatch<TaskType>* b) {
-  return a->creation_time_micros() > b->creation_time_micros();
-}
-
 // ---------------- ASBSQueue ----------------
 
 namespace internal {
diff --git a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
index 8ae8ca02eca20b5d1184e6e588f013d59d10464a..1be0c1f5c65cad3245a9392e2a2db61cfb2dd904 100644
--- a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
+++ b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
@@ -64,59 +64,6 @@ std::unique_ptr<Thread> CreateFakeClockAdvancerThread(
       }));
 }
 
-TEST(AdaptiveSharedBatchSchedulerTest, Basic) {
-  for (const bool delete_scheduler_early : {false, true}) {
-    for (const bool delete_queue_1_early : {false, true}) {
-      int queue_0_tasks = 0;
-      auto queue_0_callback =
-          [&queue_0_tasks](std::unique_ptr<Batch<FakeTask>> batch) {
-            ASSERT_TRUE(batch->IsClosed());
-            EXPECT_GT(batch->num_tasks(), 0);
-            for (int i = 0; i < batch->num_tasks(); i++) {
-              queue_0_tasks += batch->task(i).size();
-            }
-          };
-      int queue_1_tasks = 0;
-      auto queue_1_callback =
-          [&queue_1_tasks](std::unique_ptr<Batch<FakeTask>> batch) {
-            ASSERT_TRUE(batch->IsClosed());
-            EXPECT_GT(batch->num_tasks(), 0);
-            for (int i = 0; i < batch->num_tasks(); i++) {
-              queue_1_tasks += batch->task(i).size();
-            }
-          };
-      {
-        std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-        TF_ASSERT_OK(
-            AdaptiveSharedBatchScheduler<FakeTask>::Create({}, &scheduler));
-
-        // Create two queues.
-        std::unique_ptr<BatchScheduler<FakeTask>> queue_0;
-        TF_ASSERT_OK(scheduler->AddQueue({}, queue_0_callback, &queue_0));
-        std::unique_ptr<BatchScheduler<FakeTask>> queue_1;
-        TF_ASSERT_OK(scheduler->AddQueue({}, queue_1_callback, &queue_1));
-
-        if (delete_scheduler_early) {
-          // Delete our copy of the scheduler. The queues should keep it alive
-          // under the covers.
-          scheduler = nullptr;
-        }
-        // Submit tasks to the two queues, and (optionally) remove the queues.
-        TF_ASSERT_OK(ScheduleTask(1, queue_0.get()));
-        TF_ASSERT_OK(ScheduleTask(2, queue_1.get()));
-        TF_ASSERT_OK(ScheduleTask(3, queue_0.get()));
-        TF_ASSERT_OK(ScheduleTask(4, queue_1.get()));
-        if (delete_queue_1_early) {
-          queue_1 = nullptr;
-        }
-        TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-      }
-      EXPECT_EQ(queue_0_tasks, 9);
-      EXPECT_EQ(queue_1_tasks, 6);
-    }
-  }
-}
-
 TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   using Scheduler = AdaptiveSharedBatchScheduler<FakeTask>;
   std::shared_ptr<Scheduler> scheduler;
@@ -124,24 +71,6 @@ TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   options.num_batch_threads = 0;
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
   options = Scheduler::Options();
-  options.min_scheduling_period_micros = 50;
-  options.max_scheduling_period_micros = 100;
-  options.initial_scheduling_period_micros = 1;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.min_scheduling_period_micros = 50;
-  options.max_scheduling_period_micros = 100;
-  options.initial_scheduling_period_micros = 1000;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.min_scheduling_period_micros = 100;
-  options.max_scheduling_period_micros = 50;
-  options.initial_scheduling_period_micros = 75;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.feedback_smoothing_batches = 0;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
   options.initial_in_flight_batches_limit = 0.5;
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
   options = Scheduler::Options();
@@ -153,301 +82,8 @@ TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
 }
 
-TEST(AdaptiveSharedBatchSchedulerTest, ObeysQueueOptions) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue_0;
-    std::unique_ptr<BatchScheduler<FakeTask>> queue_1;
-    int queue_0_tasks = 0;
-    int queue_1_tasks = 0;
-    auto queue_0_callback = [&queue_0_tasks,
-                             &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        queue_0_tasks += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    auto queue_1_callback = [&queue_1_tasks,
-                             &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        queue_1_tasks += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
-    queue_options.max_batch_size = 10;
-    queue_options.max_enqueued_batches = 0;
-    // Queue must have max_enqueued_batchs > 1.
-    EXPECT_FALSE(
-        scheduler->AddQueue(queue_options, queue_0_callback, &queue_0).ok());
-    queue_options.max_enqueued_batches = 2;
-    TF_ASSERT_OK(
-        scheduler->AddQueue(queue_options, queue_0_callback, &queue_0));
-    EXPECT_EQ(10, queue_0->max_task_size());
-    queue_options.max_batch_size = 0;
-    // Queue must have max_batch_size > 0.
-    EXPECT_FALSE(
-        scheduler->AddQueue(queue_options, queue_1_callback, &queue_1).ok());
-    queue_options.max_batch_size = 2;
-    queue_options.max_enqueued_batches = 1;
-    TF_ASSERT_OK(
-        scheduler->AddQueue(queue_options, queue_1_callback, &queue_1));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Task larger than max_batch_size shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(15, queue_0.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    env.AdvanceByMicroseconds(1);
-
-    // Task larger than max_batch_size shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(3, queue_1.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(1, queue_1.get()));
-    TF_ASSERT_OK(ScheduleTask(1, queue_1.get()));
-    env.AdvanceByMicroseconds(1);
-    // Exceeds max_enqueued_batches, shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(1, queue_1.get()).ok());
-
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    // Exceeds max_enqueued_batches, shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(6, queue_0.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(4, queue_0.get()));
-
-    // Batches should be processed in order from oldest to newest.
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 10);
-    EXPECT_EQ(queue_1_tasks, 0);
-
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 10);
-    EXPECT_EQ(queue_1_tasks, 2);
-
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 19);
-    EXPECT_EQ(queue_1_tasks, 2);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, RateFeedback) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    double feedback = 0;
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.min_scheduling_period_micros = 200;
-    options.max_scheduling_period_micros = 2000;
-    options.env = &env;
-    options.scheduling_period_feedback = [&feedback] { return feedback; };
-    options.feedback_smoothing_batches = 1;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-
-    TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 6 batches.
-    for (int i = 0; i < 6; i++) {
-      TF_ASSERT_OK(ScheduleTask(900 + i, queue.get()));
-      env.AdvanceByMicroseconds(1);
-    }
-    feedback = -500;
-    env.AdvanceByMicroseconds(994);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 500 usec.
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(500);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 250 usec.
-    EXPECT_EQ(scheduled_items, 901);
-    feedback = 0;
-    env.AdvanceByMicroseconds(250);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 250 usec.
-    EXPECT_EQ(scheduled_items, 902);
-    feedback = 10000;  // large feedback should hit max_scheduling_period.
-    env.AdvanceByMicroseconds(250);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 2000 usec.
-    EXPECT_EQ(scheduled_items, 903);
-    feedback = -10000;  // large feedback should hit min_scheduling_period.
-    env.AdvanceByMicroseconds(1999);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 903);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 200 usec.
-    EXPECT_EQ(scheduled_items, 904);
-    env.AdvanceByMicroseconds(200);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 905);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, FeedbackSmoothing) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    double feedback = 0;
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    options.scheduling_period_feedback = [&feedback] { return feedback; };
-    options.feedback_smoothing_batches = 3;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-
-    TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 4 batches.
-    for (int i = 0; i < 4; i++) {
-      TF_ASSERT_OK(ScheduleTask(900 + i, queue.get()));
-      env.AdvanceByMicroseconds(1);
-    }
-    feedback = -300;
-    env.AdvanceByMicroseconds(996);
-    env.BlockUntilThreadsAsleep(2);
-    // ewma_feedback = 100, scheduling_period = 900.
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(899);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    // ewma_feedback = 167, scheduling_period = 750.
-    EXPECT_EQ(scheduled_items, 901);
-    env.AdvanceByMicroseconds(749);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 901);
-    feedback = 1000 / 3.;
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    // emwa_feedback = 0, scheduling_period = 750.
-    EXPECT_EQ(scheduled_items, 902);
-    env.AdvanceByMicroseconds(749);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 902);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 903);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, QueueCapacityInfo) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
-    queue_options.max_batch_size = 10;
-    queue_options.max_enqueued_batches = 10;
-    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 3 tasks.
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 0);
-    EXPECT_EQ(queue->SchedulingCapacity(), 100);
-    TF_ASSERT_OK(ScheduleTask(5, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 1);
-    EXPECT_EQ(queue->SchedulingCapacity(), 95);
-    env.AdvanceByMicroseconds(1);
-    TF_ASSERT_OK(ScheduleTask(6, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 2);
-    EXPECT_EQ(queue->SchedulingCapacity(), 84);
-    env.AdvanceByMicroseconds(1);
-    TF_ASSERT_OK(ScheduleTask(1, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 3);
-    EXPECT_EQ(queue->SchedulingCapacity(), 83);
-
-    env.AdvanceByMicroseconds(998);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 5);
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 7);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesImplementation) {
+TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimit) {
   AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-  options.use_in_flight_batches_implementation = true;
   options.initial_in_flight_batches_limit = 2;
   options.batches_to_average_over = 1000;
   mutex mu;
@@ -476,7 +112,7 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesImplementation) {
   std::unique_ptr<BatchScheduler<FakeTask>> queue;
   TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
 
-  // Enqueue 3 batches.
+  // Enqueue 3 tasks, should result in 3 batches.
   for (int i = 0; i < 3; i++) {
     TF_ASSERT_OK(ScheduleTask(100, queue.get()));
   }
@@ -490,7 +126,6 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimitTuning) {
   {
     AdaptiveSharedBatchScheduler<FakeTask>::Options options;
     options.env = &env;
-    options.use_in_flight_batches_implementation = true;
     options.initial_in_flight_batches_limit = 2;
     options.batches_to_average_over = 1;
     auto queue_callback = [&env](std::unique_ptr<Batch<FakeTask>> batch) {
@@ -544,6 +179,196 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimitTuning) {
   }
   stop_teardown.Notify();
 }
+
+TEST(AdaptiveSharedBatchSchedulerTest, FullBatchSchedulingBoostMicros) {
+  test_util::FakeClockEnv env(Env::Default());
+  Notification start_teardown, stop_teardown;
+  std::unique_ptr<Thread> teardown_thread =
+      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
+  {
+    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+    options.env = &env;
+    options.initial_in_flight_batches_limit = 1;
+    options.batches_to_average_over = 1000;
+    options.full_batch_scheduling_boost_micros = 100;
+    mutex mu;
+    int processed_batches = 0;
+    Notification finish_processing;
+    auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                              std::unique_ptr<Batch<FakeTask>> batch) {
+      ASSERT_TRUE(batch->IsClosed());
+      finish_processing.WaitForNotification();
+      mutex_lock l(mu);
+      processed_batches++;
+      switch (processed_batches) {
+        case 1:
+          EXPECT_EQ(100, batch->size());
+          break;
+        case 2:
+          EXPECT_EQ(50, batch->size());
+          break;
+        case 3:
+          EXPECT_EQ(900, batch->size());
+          break;
+        case 4:
+          EXPECT_EQ(200, batch->size());
+          break;
+        default:
+          EXPECT_TRUE(false) << "Should only have 4 batches";
+      }
+    };
+    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+    TF_ASSERT_OK(
+        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue1;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue2;
+    queue_options.max_batch_size = 1000;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue1));
+    queue_options.max_batch_size = 100;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue2));
+
+    // First batch immediately processed.
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+    env.AdvanceByMicroseconds(10);
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+    env.AdvanceByMicroseconds(10);
+
+    TF_ASSERT_OK(ScheduleTask(50, queue2.get()));
+    env.AdvanceByMicroseconds(45);
+
+    TF_ASSERT_OK(ScheduleTask(900, queue1.get()));
+
+    // Second batch - creation time: 0, fullness: 0.2, sched score: -20
+    // Third batch - creation time: 20, fullness: 0.5, sched score: -30
+    // Fourth batch - creation time: 65, fullness: 0.9, sched score: -25
+
+    finish_processing.Notify();
+    start_teardown.Notify();
+  }
+  stop_teardown.Notify();
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, DeleteQueue) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    finish_processing.WaitForNotification();
+    mu.lock();
+    processed_batches++;
+    mu.unlock();
+  };
+
+  std::unique_ptr<Thread> queue_deleter;
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // Delete queue, should be kept alive until empty.
+  queue_deleter.reset(Env::Default()->StartThread(
+      {}, "QueueDeleterThread", [&queue, &mu, &processed_batches] {
+        queue.reset();
+        mutex_lock l(mu);
+        EXPECT_EQ(processed_batches, 2);
+      }));
+  // Give queue_deleter thread time to delete queue.
+  Env::Default()->SleepForMicroseconds(1000);
+  finish_processing.Notify();
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, DeleteScheduler) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    finish_processing.WaitForNotification();
+    mu.lock();
+    processed_batches++;
+    mu.unlock();
+  };
+
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // Delete scheduler, should be kept alive until queues are empty.
+  scheduler.reset();
+  finish_processing.Notify();
+  while (true) {
+    mutex_lock l(mu);
+    if (processed_batches == 2) break;
+  }
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, QueueCapacityInfo) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    mu.lock();
+    int batch_num = ++processed_batches;
+    mu.unlock();
+    if (batch_num == 1) {
+      finish_processing.WaitForNotification();
+    }
+  };
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // First batch was immediately processed, no longer counts as enqueued.
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 1);
+  EXPECT_EQ(queue->SchedulingCapacity(), 9 * 1000 + 900);
+  // Enqueue 2 more tasks, should fall in same batch.
+  TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  TF_ASSERT_OK(ScheduleTask(200, queue.get()));
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 3);
+  EXPECT_EQ(queue->SchedulingCapacity(), 9 * 1000 + 600);
+  // Enqueue 1 more task, should create new batch.
+  TF_ASSERT_OK(ScheduleTask(700, queue.get()));
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 4);
+  EXPECT_EQ(queue->SchedulingCapacity(), 8 * 1000 + 300);
+  finish_processing.Notify();
+}
 }  // namespace anonymous
 }  // namespace serving
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/check_numerics_op.cc b/tensorflow/core/kernels/check_numerics_op.cc
index 6040b2b3999770bdd9e39e5209b6b0a1918e1d8e..d3b67f4614e82d3efe851038f2ac8ba29a38521e 100644
--- a/tensorflow/core/kernels/check_numerics_op.cc
+++ b/tensorflow/core/kernels/check_numerics_op.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 // See docs in ../ops/array_ops.cc.
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #include <math.h>
 #include <algorithm>
 #include <numeric>
@@ -219,6 +221,7 @@ class CheckNumericsOp<GPUDevice, T> : public AsyncOpKernel {
       Name("CheckNumerics").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
       CheckNumericsOp<CPUDevice, T>);
 TF_CALL_half(REGISTER_CPU_KERNEL);
+TF_CALL_bfloat16(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 
diff --git a/tensorflow/core/kernels/concat_op.cc b/tensorflow/core/kernels/concat_op.cc
index 7011550f7e161c9727b8d31eff0917964b09044e..f16766315f2640ab7c42c077fc5156a3a825fbf9 100644
--- a/tensorflow/core/kernels/concat_op.cc
+++ b/tensorflow/core/kernels/concat_op.cc
@@ -18,7 +18,6 @@ limitations under the License.
 #include <limits>
 #include <vector>
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -28,6 +27,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/concat_lib.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/types.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
 namespace tensorflow {
 
@@ -53,17 +53,38 @@ class ConcatBaseOp : public OpKernel {
   void Compute(OpKernelContext* c) override {
     const Tensor* concat_dim_tensor;
     const char* axis_attribute_name =
-        AxisArgName == NAME_IS_AXIS
-            ? "axis"
-            : AxisArgName == NAME_IS_CONCAT_DIM ? "concat_dim" : "<invalid>";
+        AxisArgName == NAME_IS_AXIS ? "axis" : AxisArgName == NAME_IS_CONCAT_DIM
+                                                   ? "concat_dim"
+                                                   : "<invalid>";
     OP_REQUIRES_OK(c, c->input(axis_attribute_name, &concat_dim_tensor));
     OP_REQUIRES(c, IsLegacyScalar(concat_dim_tensor->shape()),
                 errors::InvalidArgument(
                     axis_attribute_name,
                     " tensor should be a scalar integer, but got shape ",
                     concat_dim_tensor->shape().DebugString()));
-    const int32 concat_dim =
-        internal::SubtleMustCopy(concat_dim_tensor->scalar<int32>()());
+    int64 concat_dim;
+    // In case of ConcatV2, "axis" could be int32 or int64
+    if (AxisArgName == NAME_IS_AXIS) {
+      OP_REQUIRES(
+          c, (concat_dim_tensor->dtype() == DT_INT32 ||
+              concat_dim_tensor->dtype() == DT_INT64),
+          errors::InvalidArgument(axis_attribute_name,
+                                  " tensor should be int32 or int64, but got ",
+                                  concat_dim_tensor->dtype()));
+    } else {
+      OP_REQUIRES(c, (concat_dim_tensor->dtype() == DT_INT32),
+                  errors::InvalidArgument(axis_attribute_name,
+                                          " tensor should be int32, but got ",
+                                          concat_dim_tensor->dtype()));
+    }
+    if (concat_dim_tensor->dtype() == DT_INT32) {
+      concat_dim =
+          internal::SubtleMustCopy(concat_dim_tensor->scalar<int32>()());
+    } else {
+      concat_dim =
+          internal::SubtleMustCopy(concat_dim_tensor->scalar<int64>()());
+    }
+
     OpInputList values;
     OP_REQUIRES_OK(c, c->input_list("values", &values));
     const int N = values.size();
@@ -154,17 +175,16 @@ using ConcatOp = ConcatBaseOp<Device, T, NAME_IS_CONCAT_DIM>;
 template <typename Device, typename T>
 using ConcatV2Op = ConcatBaseOp<Device, T, NAME_IS_AXIS>;
 
-#define REGISTER_CONCAT(type)                                \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_CPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<CPUDevice, type>)         \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_CPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_CONCAT(type)                            \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_CPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<CPUDevice, type>)     \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_CPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<CPUDevice, type>)
 
 TF_CALL_POD_STRING_TYPES(REGISTER_CONCAT);
@@ -178,17 +198,16 @@ REGISTER_CONCAT(qint32);
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU(type)                                   \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_GPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<GPUDevice, type>)         \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_GPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_GPU(type)                               \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_GPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<GPUDevice, type>)     \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_GPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<GPUDevice, type>)
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU);
@@ -212,7 +231,6 @@ REGISTER_KERNEL_BUILDER(Name("Concat")
 REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tidx")
                             .HostMemory("values")
                             .HostMemory("axis")
                             .HostMemory("output"),
@@ -221,17 +239,16 @@ REGISTER_KERNEL_BUILDER(Name("ConcatV2")
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL(type)                                  \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_SYCL)           \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<SYCLDevice, type>)        \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_SYCL)           \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_SYCL(type)                              \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_SYCL)       \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<SYCLDevice, type>)    \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_SYCL)       \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<SYCLDevice, type>)
 
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
@@ -246,7 +263,6 @@ REGISTER_KERNEL_BUILDER(Name("Concat")
 REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tidx")
                             .HostMemory("values")
                             .HostMemory("axis")
                             .HostMemory("output"),
diff --git a/tensorflow/core/kernels/constant_op.cc b/tensorflow/core/kernels/constant_op.cc
index fdb03a5aae8bd9dbe180e9b722fc5847c740801b..312c1a41d36245ae3ca5a09d2e76a430bc464953 100644
--- a/tensorflow/core/kernels/constant_op.cc
+++ b/tensorflow/core/kernels/constant_op.cc
@@ -105,7 +105,12 @@ REGISTER_KERNEL(GPU, int8);
 REGISTER_KERNEL(GPU, qint8);
 REGISTER_KERNEL(GPU, uint16);
 REGISTER_KERNEL(GPU, int16);
+REGISTER_KERNEL(GPU, qint16);
+REGISTER_KERNEL(GPU, quint16);
+REGISTER_KERNEL(GPU, uint32);
+REGISTER_KERNEL(GPU, qint32);
 REGISTER_KERNEL(GPU, int64);
+REGISTER_KERNEL(GPU, uint64);
 REGISTER_KERNEL(GPU, complex64);
 REGISTER_KERNEL(GPU, complex128);
 REGISTER_KERNEL(GPU, bool);
@@ -122,9 +127,15 @@ REGISTER_SYCL_KERNEL(SYCL, float);
 REGISTER_SYCL_KERNEL(SYCL, double);
 REGISTER_SYCL_KERNEL(SYCL, uint8);
 REGISTER_SYCL_KERNEL(SYCL, int8);
+REGISTER_SYCL_KERNEL(SYCL, qint8);
 REGISTER_SYCL_KERNEL(SYCL, uint16);
 REGISTER_SYCL_KERNEL(SYCL, int16);
+REGISTER_SYCL_KERNEL(SYCL, qint16);
+REGISTER_SYCL_KERNEL(SYCL, quint16);
+REGISTER_SYCL_KERNEL(SYCL, uint32);
+REGISTER_SYCL_KERNEL(SYCL, qint32);
 REGISTER_SYCL_KERNEL(SYCL, int64);
+REGISTER_SYCL_KERNEL(SYCL, uint64);
 REGISTER_SYCL_KERNEL(SYCL, bool);
 #undef REGISTER_SYCL_KERNEL
 #endif
diff --git a/tensorflow/core/kernels/conv_grad_filter_ops.cc b/tensorflow/core/kernels/conv_grad_filter_ops.cc
index e6ae59529107e529a9ccf7c790da0da62c90c199..66ee474ca3f72c283b2a300e90f7377a68911b7b 100644
--- a/tensorflow/core/kernels/conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_filter_ops.cc
@@ -520,6 +520,7 @@ class Conv2DCustomBackpropFilterOp : public OpKernel {
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
+TF_CALL_double(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 // GPU definitions.
@@ -1017,11 +1018,17 @@ namespace functor {
       typename TTypes<T, 4, int>::Tensor out, TensorFormat data_format); \
   extern template struct PadInput<GPUDevice, T, int, 4>;
 
+DECLARE_GPU_SPEC(double);
 DECLARE_GPU_SPEC(float);
 DECLARE_GPU_SPEC(Eigen::half);
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
 
+REGISTER_KERNEL_BUILDER(Name("Conv2DBackpropFilter")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<double>("T")
+                            .HostMemory("filter_sizes"),
+                        Conv2DSlowBackpropFilterOp<GPUDevice, double>);
 REGISTER_KERNEL_BUILDER(Name("Conv2DBackpropFilter")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<float>("T")
diff --git a/tensorflow/core/kernels/conv_grad_input_ops.cc b/tensorflow/core/kernels/conv_grad_input_ops.cc
index 15c55e4d9903b3bbd53e1b6e1c95571ef7834015..71ea0d5d720df3c8070bce81fc8608b438617220 100644
--- a/tensorflow/core/kernels/conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_input_ops.cc
@@ -592,6 +592,7 @@ class Conv2DCustomBackpropInputOp : public OpKernel {
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
+TF_CALL_double(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 // GPU definitions.
@@ -1090,11 +1091,17 @@ namespace functor {
       typename TTypes<T, 4, int>::Tensor out, TensorFormat data_format); \
   extern template struct PadInput<GPUDevice, T, int, 4>;
 
+DECLARE_GPU_SPEC(double);
 DECLARE_GPU_SPEC(float);
 DECLARE_GPU_SPEC(Eigen::half);
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
 
+REGISTER_KERNEL_BUILDER(Name("Conv2DBackpropInput")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<double>("T")
+                            .HostMemory("input_sizes"),
+                        Conv2DSlowBackpropInputOp<GPUDevice, double>);
 REGISTER_KERNEL_BUILDER(Name("Conv2DBackpropInput")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<float>("T")
diff --git a/tensorflow/core/kernels/conv_ops.cc b/tensorflow/core/kernels/conv_ops.cc
index 47f6907c04b4e48607e66b5c9601cd9030fa9001..88843e4da78a867ea5b7c30d6cb43855fdefdd13 100644
--- a/tensorflow/core/kernels/conv_ops.cc
+++ b/tensorflow/core/kernels/conv_ops.cc
@@ -446,10 +446,11 @@ class Conv2DOp : public BinaryOp<T> {
 #if !defined(USE_GEMM_FOR_CONV)
 TF_CALL_half(REGISTER_CPU);
 TF_CALL_float(REGISTER_CPU);
+TF_CALL_double(REGISTER_CPU);
 #endif  // USE_GEMM_FOR_CONV
 
 // To be used inside depthwise_conv_op.cc.
-template class LaunchConv2DOp<CPUDevice, float>;
+template struct LaunchConv2DOp<CPUDevice, float>;
 
 #if GOOGLE_CUDA
 int64 GetCudnnWorkspaceLimit(const string& envvar_in_mb,
@@ -810,6 +811,7 @@ namespace functor {
       typename TTypes<T, 4, int>::Tensor out, TensorFormat data_format);     \
   extern template struct PadInput<GPUDevice, T, int, 4>
 
+DECLARE_GPU_SPEC(double);
 DECLARE_GPU_SPEC(float);
 DECLARE_GPU_SPEC(Eigen::half);
 #undef DECLARE_GPU_SPEC
@@ -822,6 +824,9 @@ REGISTER_KERNEL_BUILDER(
 REGISTER_KERNEL_BUILDER(
     Name("Conv2D").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     Conv2DOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("Conv2D").Device(DEVICE_GPU).TypeConstraint<double>("T"),
+    Conv2DOp<GPUDevice, double>);
 
 // To be used inside depthwise_conv_op.cc.
 template class LaunchConv2DOp<GPUDevice, float>;
diff --git a/tensorflow/core/kernels/conv_ops_gpu_2.cu.cc b/tensorflow/core/kernels/conv_ops_gpu_2.cu.cc
index b5dd26a9e47578619945da21c85b4c5b40a55132..52859af950e3c346536acd246bafde830b405ee5 100644
--- a/tensorflow/core/kernels/conv_ops_gpu_2.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_2.cu.cc
@@ -25,6 +25,9 @@ limitations under the License.
 namespace tensorflow {
 
 typedef Eigen::GpuDevice GPUDevice;
+template struct functor::InflatePadAndShuffle<GPUDevice, double, 4, int>;
+template struct functor::InflatePadAndShuffle<GPUDevice, double, 4,
+                                              Eigen::DenseIndex>;
 template struct functor::InflatePadAndShuffle<GPUDevice, float, 4, int>;
 template struct functor::InflatePadAndShuffle<GPUDevice, float, 4,
                                               Eigen::DenseIndex>;
diff --git a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
index a376534badc73065e3ec01972dde85da7bbdb0f8..2503b475dc10e631863e06b1e4d6931928fb4321 100644
--- a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
@@ -1039,9 +1039,11 @@ template struct functor::SwapDimension0And2InTensor3<GPUDevice, double2,
                                                      /*conjugate=*/true>;
 
 // For 2d ops.
+template struct functor::TransformFilter<GPUDevice, double, int, 4>;
 template struct functor::TransformFilter<GPUDevice, float, int, 4>;
 template struct functor::TransformFilter<GPUDevice, Eigen::half, int, 4>;
 
+template struct functor::ReverseTransformFilter<GPUDevice, double, 4>;
 template struct functor::ReverseTransformFilter<GPUDevice, float, 4>;
 template struct functor::ReverseTransformFilter<GPUDevice, Eigen::half, 4>;
 
@@ -1054,6 +1056,7 @@ template struct functor::NCHWToNHWC<GPUDevice, float, 4>;
 template struct functor::NCHWToNHWC<GPUDevice, Eigen::half, 4>;
 
 template struct functor::PadInput<GPUDevice, int, int, 4>;
+template struct functor::PadInput<GPUDevice, double, int, 4>;
 template struct functor::PadInput<GPUDevice, float, int, 4>;
 template struct functor::PadInput<GPUDevice, Eigen::half, int, 4>;
 
diff --git a/tensorflow/core/kernels/conv_ops_test.cc b/tensorflow/core/kernels/conv_ops_test.cc
index 666bca265c95febf3753e71bf010a7caf95c0541..e2e166c02fe2ee7f1e61ccf1f0eb72aa03f9af07 100644
--- a/tensorflow/core/kernels/conv_ops_test.cc
+++ b/tensorflow/core/kernels/conv_ops_test.cc
@@ -401,7 +401,7 @@ class ConvOpTest : public OpsTestBase {
     // (1*0)+(4*5)+(7*6)+(2*0)+(5*9)+(8*10)+(3*0)+(6*0)+(9*0)=187
     // (1*5)+(4*6)+(7*7)+(2*9)+(5*10)+(8*11)+(3*0)+(6*0)+(9*0)=234
     // (1*6)+(4*7)+(7*8)+(2*10)+(5*11)+(8*12)+(3*0)+(6*0)+(9*0)=261
-    // (1*7)+(4*11)+(7*0)+(2*8)+(5*12)+(8*0)+(3*0)+(6*0)+(9*0)=121
+    // (1*7)+(4*8)+(7*0)+(2*11)+(5*12)+(8*0)+(3*0)+(6*0)+(9*0)=121
     // This means we should end up with this matrix:
     // |  105  |  150  |  183  |   95  |
     // |  235  |  312  |  357  |  178  |
diff --git a/tensorflow/contrib/cudnn_rnn/kernels/cudnn_rnn_ops.cc b/tensorflow/core/kernels/cudnn_rnn_ops.cc
similarity index 100%
rename from tensorflow/contrib/cudnn_rnn/kernels/cudnn_rnn_ops.cc
rename to tensorflow/core/kernels/cudnn_rnn_ops.cc
diff --git a/tensorflow/core/kernels/cwise_op_add_1.cc b/tensorflow/core/kernels/cwise_op_add_1.cc
index bf32c8a54b34586e43d34cf8890ed37fe64b8c34..9e4ffe950c9a88d22a3bfc081adc4703fd9e6b65 100644
--- a/tensorflow/core/kernels/cwise_op_add_1.cc
+++ b/tensorflow/core/kernels/cwise_op_add_1.cc
@@ -16,10 +16,10 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER5(BinaryOp, CPU, "Add", functor::add, float, Eigen::half, double, int32,
-          int64);
-REGISTER5(BinaryOp, CPU, "AddV2", functor::add, float, Eigen::half, double,
-          int32, int64);
+REGISTER6(BinaryOp, CPU, "Add", functor::add, float, Eigen::half, double, int32,
+          int64, bfloat16);
+REGISTER6(BinaryOp, CPU, "AddV2", functor::add, float, Eigen::half, double,
+          int32, int64, bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);
diff --git a/tensorflow/core/kernels/cwise_op_div.cc b/tensorflow/core/kernels/cwise_op_div.cc
index c71c756e4461d4ed36628ea8a4f8a0922896302c..b12652f7fba4ea8a9bd4ec18b79469ad69e79902 100644
--- a/tensorflow/core/kernels/cwise_op_div.cc
+++ b/tensorflow/core/kernels/cwise_op_div.cc
@@ -16,14 +16,14 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER5(BinaryOp, CPU, "Div", functor::div, float, Eigen::half, double,
-          complex64, complex128);
+REGISTER6(BinaryOp, CPU, "Div", functor::div, float, Eigen::half, double,
+          bfloat16, complex64, complex128);
 REGISTER5(BinaryOp, CPU, "Div", functor::safe_div, uint8, uint16, int16, int32,
           int64);
 REGISTER5(BinaryOp, CPU, "TruncateDiv", functor::safe_div, uint8, uint16, int16,
           int32, int64);
-REGISTER5(BinaryOp, CPU, "RealDiv", functor::div, float, Eigen::half, double,
-          complex64, complex128);
+REGISTER6(BinaryOp, CPU, "RealDiv", functor::div, float, Eigen::half, double,
+          bfloat16, complex64, complex128);
 #if GOOGLE_CUDA
 REGISTER9(BinaryOp, GPU, "Div", functor::div, float, Eigen::half, double, uint8,
           uint16, int16, int64, complex64, complex128);
diff --git a/tensorflow/core/kernels/cwise_op_isnan.cc b/tensorflow/core/kernels/cwise_op_isnan.cc
index aa180c247e7d01ef0f2898b4a50a71c3c3bc6941..707dc9e49ca1dd5b9872c2e5d3184e11eddd7a1f 100644
--- a/tensorflow/core/kernels/cwise_op_isnan.cc
+++ b/tensorflow/core/kernels/cwise_op_isnan.cc
@@ -16,7 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER3(UnaryOp, CPU, "IsNan", functor::isnan, float, Eigen::half, double);
+REGISTER4(UnaryOp, CPU, "IsNan", functor::isnan, float, Eigen::half, double,
+          bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "IsNan", functor::isnan, float, Eigen::half, double);
diff --git a/tensorflow/core/kernels/cwise_op_less.cc b/tensorflow/core/kernels/cwise_op_less.cc
index 00cdecdbd184b84b6601eda76dd5dfded5aa1e1b..575968126fa82d585fcda9490da5cd69332366c6 100644
--- a/tensorflow/core/kernels/cwise_op_less.cc
+++ b/tensorflow/core/kernels/cwise_op_less.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER8(BinaryOp, CPU, "Less", functor::less, float, Eigen::half, double,
-          int32, int64, uint8, int8, int16);
+REGISTER9(BinaryOp, CPU, "Less", functor::less, float, Eigen::half, double,
+          bfloat16, int32, int64, uint8, int8, int16);
 #if GOOGLE_CUDA
 REGISTER7(BinaryOp, GPU, "Less", functor::less, float, Eigen::half, double,
           int64, uint8, int8, int16);
diff --git a/tensorflow/core/kernels/cwise_op_less_equal.cc b/tensorflow/core/kernels/cwise_op_less_equal.cc
index 11806c5fc774dc3a37abc733127e4b6660f27f9c..499200d0546ccf1d9119b63a9e552908de3d1ae1 100644
--- a/tensorflow/core/kernels/cwise_op_less_equal.cc
+++ b/tensorflow/core/kernels/cwise_op_less_equal.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER8(BinaryOp, CPU, "LessEqual", functor::less_equal, float, Eigen::half,
-          double, int32, int64, uint8, int8, int16);
+REGISTER9(BinaryOp, CPU, "LessEqual", functor::less_equal, float, Eigen::half,
+          bfloat16, double, int32, int64, uint8, int8, int16);
 #if GOOGLE_CUDA
 REGISTER7(BinaryOp, GPU, "LessEqual", functor::less_equal, float, Eigen::half,
           double, int64, uint8, int8, int16);
diff --git a/tensorflow/core/kernels/cwise_op_minimum.cc b/tensorflow/core/kernels/cwise_op_minimum.cc
index dff83df828f076a076a8f220d04974344d8ffafc..9bc37003879f077288dfc058996e9b0b4162d16e 100644
--- a/tensorflow/core/kernels/cwise_op_minimum.cc
+++ b/tensorflow/core/kernels/cwise_op_minimum.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER5(BinaryOp, CPU, "Minimum", functor::minimum, float, Eigen::half,
-          double, int32, int64);
+REGISTER6(BinaryOp, CPU, "Minimum", functor::minimum, float, Eigen::half,
+          bfloat16, double, int32, int64);
 #if GOOGLE_CUDA
 REGISTER4(BinaryOp, GPU, "Minimum", functor::minimum, float, Eigen::half,
           double, int64);
diff --git a/tensorflow/core/kernels/cwise_op_mul_1.cc b/tensorflow/core/kernels/cwise_op_mul_1.cc
index 0e8d2e37350dbbb942bd5ed6b16392b6288313fe..cff0407b83a4bafd27573325615322f92e594d46 100644
--- a/tensorflow/core/kernels/cwise_op_mul_1.cc
+++ b/tensorflow/core/kernels/cwise_op_mul_1.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 namespace tensorflow {
 
-REGISTER5(BinaryOp, CPU, "Mul", functor::mul, float, Eigen::half, double, uint8,
-          int32);
+REGISTER6(BinaryOp, CPU, "Mul", functor::mul, float, Eigen::half, double, uint8,
+          int32, bfloat16);
 #if defined(__ANDROID_TYPES_SLIM__)
 // We only register the first type when we have multi-argument calls in the
 // case where we're trying to reduce executable size, but it turns out that the
diff --git a/tensorflow/core/kernels/cwise_op_sqrt.cc b/tensorflow/core/kernels/cwise_op_sqrt.cc
index 497756133d05249141823481e6ef43b73a84660b..205070761f13cbfe6b509eea2d6b36c2f0f37f04 100644
--- a/tensorflow/core/kernels/cwise_op_sqrt.cc
+++ b/tensorflow/core/kernels/cwise_op_sqrt.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER5(UnaryOp, CPU, "Sqrt", functor::sqrt, float, Eigen::half, double,
-          complex64, complex128);
+REGISTER6(UnaryOp, CPU, "Sqrt", functor::sqrt, float, Eigen::half, double,
+          bfloat16, complex64, complex128);
 
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "Sqrt", functor::sqrt, float, Eigen::half, double);
@@ -27,8 +27,8 @@ REGISTER3(UnaryOp, GPU, "Sqrt", functor::sqrt, float, Eigen::half, double);
 REGISTER2(UnaryOp, SYCL, "Sqrt", functor::sqrt, float, double);
 #endif  // TENSORFLOW_USE_SYCL
 
-REGISTER5(SimpleBinaryOp, CPU, "SqrtGrad", functor::sqrt_grad, float,
-          Eigen::half, double, complex64, complex128);
+REGISTER6(SimpleBinaryOp, CPU, "SqrtGrad", functor::sqrt_grad, float,
+          Eigen::half, bfloat16, double, complex64, complex128);
 #if GOOGLE_CUDA
 REGISTER3(SimpleBinaryOp, GPU, "SqrtGrad", functor::sqrt_grad, float,
           Eigen::half, double);
diff --git a/tensorflow/core/kernels/cwise_op_square.cc b/tensorflow/core/kernels/cwise_op_square.cc
index 7fc2f6bf08b2c825f471123e1ab58bd060f6070a..84f695ddc29d7f8d3afc12ea81515e80a8a75255 100644
--- a/tensorflow/core/kernels/cwise_op_square.cc
+++ b/tensorflow/core/kernels/cwise_op_square.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER7(UnaryOp, CPU, "Square", functor::square, float, Eigen::half, double,
-          int32, int64, complex64, complex128);
+REGISTER8(UnaryOp, CPU, "Square", functor::square, float, Eigen::half, double,
+          int32, int64, complex64, complex128, bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER4(UnaryOp, GPU, "Square", functor::square, float, Eigen::half, double,
diff --git a/tensorflow/core/kernels/cwise_op_sub.cc b/tensorflow/core/kernels/cwise_op_sub.cc
index 025041946ac71f0e8f4724f9432d5e2901e348cc..eb27bddb78dfd8679b010f7f2cb67d2049a22a4b 100644
--- a/tensorflow/core/kernels/cwise_op_sub.cc
+++ b/tensorflow/core/kernels/cwise_op_sub.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER7(BinaryOp, CPU, "Sub", functor::sub, float, Eigen::half, double, int32,
-          int64, complex64, complex128);
+REGISTER8(BinaryOp, CPU, "Sub", functor::sub, float, Eigen::half, double, int32,
+          int64, bfloat16, complex64, complex128);
 #if !defined(__ANDROID_TYPES_SLIM__)
 // Sub op for int8, uint8, int16, uint16
 REGISTER4(BinaryOp, CPU, "Sub", functor::sub, int8, uint8, int16, uint16);
diff --git a/tensorflow/core/kernels/cwise_ops_common.h b/tensorflow/core/kernels/cwise_ops_common.h
index 8295fa939ee1aabf78a7d7b7f4677d851b407573..e32eccf547e07b71678abf0e75ac20973ecbf380 100644
--- a/tensorflow/core/kernels/cwise_ops_common.h
+++ b/tensorflow/core/kernels/cwise_ops_common.h
@@ -20,6 +20,8 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #ifdef TENSORFLOW_USE_SYCL
 #include "tensorflow/core/kernels/cwise_ops_sycl_common.h"
 #endif
diff --git a/tensorflow/core/kernels/data/BUILD b/tensorflow/core/kernels/data/BUILD
index 253399c1e4ec7fe8edeeeee161ef3413d1dbea09..01754ec21acd2196dd907747da45071022bcebc9 100644
--- a/tensorflow/core/kernels/data/BUILD
+++ b/tensorflow/core/kernels/data/BUILD
@@ -113,6 +113,19 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "slide_dataset_op",
+    srcs = ["slide_dataset_op.cc"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core/kernels:batch_util",
+    ],
+)
+
 tf_kernel_library(
     name = "padded_batch_dataset_op",
     srcs = ["padded_batch_dataset_op.cc"],
@@ -162,6 +175,7 @@ tf_kernel_library(
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:dataset_ops_op_lib",
         "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
     ],
 )
@@ -537,6 +551,7 @@ tf_kernel_library(
         ":scan_dataset_op",
         ":shuffle_dataset_op",
         ":skip_dataset_op",
+        ":slide_dataset_op",
         ":sparse_tensor_slice_dataset_op",
         ":sql_dataset_ops",
         ":stats_aggregator_ops",
diff --git a/tensorflow/core/kernels/data/cache_dataset_ops.cc b/tensorflow/core/kernels/data/cache_dataset_ops.cc
index f0a2192826e051586e4999d729c24ed5495be0ea..4b4728dab68523aa54176bdce6222a7aa5f8e9d3 100644
--- a/tensorflow/core/kernels/data/cache_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/cache_dataset_ops.cc
@@ -308,6 +308,21 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
             input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             cache_(new std::vector<std::vector<Tensor>>) {}
 
+      ~MemoryWriterIterator() override {
+        mutex_lock l(mu_);
+        if (cache_) {
+          LOG(ERROR)
+              << "The calling iterator did not fully read the dataset we were "
+                 "attempting to cache. In order to avoid unexpected truncation "
+                 "of the sequence, the current [partially cached] sequence "
+                 "will be dropped. This can occur if you have a sequence "
+                 "similar to `dataset.cache().take(k).repeat()`. Instead, swap "
+                 "the order (i.e. `dataset.take(k).cache().repeat()`)";
+          mutex_lock l2(dataset()->mu_);
+          dataset()->writer_iterator_created_ = false;
+        }
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
@@ -318,7 +333,7 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
           // Guard on cache_ to not crash if GetNext is called a second time
           // after *end_of_sequence == true
           if (cache_) {
-            mutex_lock l2(dataset()->mu_);
+            mutex_lock l(dataset()->mu_);
             DCHECK(dataset()->writer_iterator_created_);
             DCHECK(!dataset()->cache_);
             cache_.swap(dataset()->cache_);
diff --git a/tensorflow/core/kernels/data/filter_dataset_op.cc b/tensorflow/core/kernels/data/filter_dataset_op.cc
index d16b5b7d416b85695287ccbab4bc4398a222c139..186b1e1c6c5a3d5a4aa8e4eb31d474aac4156243 100644
--- a/tensorflow/core/kernels/data/filter_dataset_op.cc
+++ b/tensorflow/core/kernels/data/filter_dataset_op.cc
@@ -17,6 +17,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/kernels/data/captured_function.h"
 #include "tensorflow/core/kernels/data/dataset.h"
+#include "tensorflow/core/lib/gtl/cleanup.h"
 #include "tensorflow/core/lib/random/random.h"
 
 namespace tensorflow {
@@ -44,21 +45,45 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
       other_arguments.push_back(t);
     }
 
+    FunctionLibraryRuntime::Handle pred_handle;
+    OP_REQUIRES_OK(ctx,
+                   ctx->function_library()->Instantiate(
+                       func_.name(), AttrSlice(&func_.attr()), &pred_handle));
+    auto cleanup = gtl::MakeCleanup([ctx, pred_handle]() {
+      OP_REQUIRES_OK(ctx, ctx->function_library()->ReleaseHandle(pred_handle));
+    });
+
+    const FunctionBody* pred_body =
+        ctx->function_library()->GetFunctionBody(pred_handle);
+    OP_REQUIRES(ctx, pred_body->ret_nodes.size() == 1,
+                errors::InvalidArgument(
+                    "predicate function must have a single return value."));
+    Node* ret_node = pred_body->ret_nodes[0];
+    Node* ret_input_node;
+    OP_REQUIRES_OK(ctx, ret_node->input_node(0, &ret_input_node));
     std::unique_ptr<CapturedFunction> captured_func;
     OP_REQUIRES_OK(ctx, CapturedFunction::Create(
                             func_, std::move(other_arguments), &captured_func));
 
-    *output = new Dataset(ctx, input, func_, std::move(captured_func));
+    if (ret_input_node->def().op() == "_Arg") {
+      int32 index = -1;
+      OP_REQUIRES_OK(ctx, GetNodeAttr(ret_input_node->def(), "index", &index));
+      *output = new FilterTensorDataset(ctx, input, func_,
+                                        std::move(captured_func), index);
+    } else {
+      *output = new FilterFunctionDataset(ctx, input, func_,
+                                          std::move(captured_func));
+    }
   }
 
  private:
   const int graph_def_version_;
 
-  class Dataset : public GraphDatasetBase {
+  class FilterDatasetBase : public GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx, const DatasetBase* input,
-            const NameAttrList& func,
-            std::unique_ptr<CapturedFunction> captured_func)
+    FilterDatasetBase(OpKernelContext* ctx, const DatasetBase* input,
+                      const NameAttrList& func,
+                      std::unique_ptr<CapturedFunction> captured_func)
         : GraphDatasetBase(ctx),
           input_(input),
           func_(func),
@@ -66,7 +91,7 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
       input_->Ref();
     }
 
-    ~Dataset() override { input_->Unref(); }
+    ~FilterDatasetBase() override { input_->Unref(); }
 
     std::unique_ptr<IteratorBase> MakeIterator(
         const string& prefix) const override {
@@ -112,11 +137,15 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
       return Status::OK();
     }
 
+    virtual Status EvaluatePredicate(IteratorContext* ctx,
+                                     const std::vector<Tensor>& element,
+                                     bool* out_matched) const = 0;
+
    private:
-    class Iterator : public DatasetIterator<Dataset> {
+    class Iterator : public DatasetIterator<FilterDatasetBase> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
+          : DatasetIterator<FilterDatasetBase>(params),
             input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
 
       Status GetNextInternal(IteratorContext* ctx,
@@ -143,18 +172,8 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
             return Status::OK();
           }
 
-          // TODO(mrry): Avoid blocking a threadpool thread. We will need to
-          // stack-rip the iterators and use async kernels.
-          std::vector<Tensor> result;
-          TF_RETURN_IF_ERROR(dataset()->captured_func_->RunWithBorrowedArgs(
-              ctx, *out_tensors, &result));
-
-          if (result.size() != 1 || result[0].dtype() != DT_BOOL ||
-              result[0].NumElements() != 1) {
-            return errors::InvalidArgument(
-                "Filter predicate `f` must return a scalar bool.");
-          }
-          matched = result[0].scalar<bool>()();
+          TF_RETURN_IF_ERROR(
+              dataset()->EvaluatePredicate(ctx, *out_tensors, &matched));
           if (!matched) {
             // Clear the output tensor list since it didn't match.
             out_tensors->clear();
@@ -192,9 +211,61 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
 
     const DatasetBase* const input_;
     const NameAttrList func_;
+
+   protected:
     const std::unique_ptr<CapturedFunction> captured_func_;
   };
 
+  class FilterFunctionDataset : public FilterDatasetBase {
+   public:
+    using FilterDatasetBase::FilterDatasetBase;
+
+   protected:
+    Status EvaluatePredicate(IteratorContext* ctx,
+                             const std::vector<Tensor>& element,
+                             bool* out_matched) const override {
+      // TODO(mrry): Avoid blocking a threadpool thread. We will need to
+      // stack-rip the iterators and use async kernels.
+      std::vector<Tensor> result;
+      TF_RETURN_IF_ERROR(
+          captured_func_->RunWithBorrowedArgs(ctx, element, &result));
+
+      if (result.size() != 1 || result[0].dtype() != DT_BOOL ||
+          result[0].NumElements() != 1) {
+        return errors::InvalidArgument(
+            "Filter predicate `f` must return a scalar bool.");
+      }
+      *out_matched = result[0].scalar<bool>()();
+      return Status::OK();
+    }
+  };
+
+  class FilterTensorDataset : public FilterDatasetBase {
+   public:
+    FilterTensorDataset(OpKernelContext* ctx, const DatasetBase* input,
+                        const NameAttrList& func,
+                        std::unique_ptr<CapturedFunction> captured_func,
+                        int32 index)
+        : FilterDatasetBase(ctx, input, func, std::move(captured_func)),
+          index_(index) {}
+
+   protected:
+    Status EvaluatePredicate(IteratorContext* ctx,
+                             const std::vector<Tensor>& element,
+                             bool* out_matched) const override {
+      const Tensor& predicate = element[index_];
+      if (predicate.dtype() != DT_BOOL || predicate.NumElements() != 1) {
+        return errors::InvalidArgument(
+            "Filter predicate `f` must return a scalar bool.");
+      }
+      *out_matched = predicate.scalar<bool>()();
+      return Status::OK();
+    }
+
+   private:
+    const int32 index_;
+  };
+
  private:
   NameAttrList func_;
 };
diff --git a/tensorflow/core/kernels/data/iterator_ops.cc b/tensorflow/core/kernels/data/iterator_ops.cc
index d7d4ad5cf7f6d5a3386be524c7a227006da0b3f4..780f927a4f1b3a14cd1ed2acb1b41895d96e3950 100644
--- a/tensorflow/core/kernels/data/iterator_ops.cc
+++ b/tensorflow/core/kernels/data/iterator_ops.cc
@@ -141,14 +141,20 @@ class IteratorResource : public ResourceBase {
     std::vector<Tensor> outputs;
     GraphRunner graph_runner(ctx->env());
 
-    // Build a new FLR that knows about the functions in the graph.
-    std::shared_ptr<FunctionLibraryDefinition> flib_def(
-        new FunctionLibraryDefinition(
-            *ctx->function_library()->GetFunctionLibraryDefinition()));
+    // Build a new FLR that knows about the functions in the graph, and use
+    // it for all operations on the restored iterator.
+    // NOTE(mrry): We clone the existing FLR and use it in the GraphRunner
+    // because some of the OpKernels in the graph might call functions that are
+    // only defined in the loaded GraphDef.
+    FunctionLibraryRuntime* lib;
+    std::unique_ptr<DeviceMgr> device_mgr(nullptr);
+    std::unique_ptr<FunctionLibraryDefinition> flib_def(nullptr);
+    std::unique_ptr<ProcessFunctionLibraryRuntime> pflr(nullptr);
+    TF_RETURN_IF_ERROR(ctx->function_library()->Clone(&flib_def, &pflr, &lib));
     TF_RETURN_IF_ERROR(flib_def->AddLibrary(graph_def.library()));
 
     TF_RETURN_IF_ERROR(
-        graph_runner.Run(&graph, lib_, {}, {output_node}, &outputs));
+        graph_runner.Run(&graph, lib, {}, {output_node}, &outputs));
     TF_RETURN_IF_ERROR(GetDatasetFromVariantTensor(outputs[0], &dataset));
 
     TF_RETURN_IF_ERROR(set_iterator(dataset->MakeIterator("Iterator")));
@@ -158,9 +164,8 @@ class IteratorResource : public ResourceBase {
       IteratorContext::Params params;
       params.env = ctx->env();
       params.runner = *(ctx->runner());
-      params.function_library = flib_def;
-      params.lib = lib_;
-      DeviceBase* device = lib_->device();
+      params.lib = lib;
+      DeviceBase* device = lib->device();
       params.allocator_getter = [device](AllocatorAttributes attrs) {
         return device->GetAllocator(attrs);
       };
@@ -168,7 +173,10 @@ class IteratorResource : public ResourceBase {
 
       TF_RETURN_IF_ERROR(captured_iterator->Restore(&iter_ctx, reader));
       mutex_lock l(mu_);
+      device_mgr_ = std::move(device_mgr);
       lib_def_ = std::move(flib_def);
+      pflr_ = std::move(pflr);
+      lib_ = lib;
       return Status::OK();
     } else {
       return errors::FailedPrecondition(
@@ -859,9 +867,7 @@ class IteratorGetNextOp : public AsyncOpKernel {
     // inter-op thread pool thread, so we issue the call from the
     // owned thread pool.
     thread_pool_->Schedule(std::bind(
-        [this, ctx, iterator](DoneCallback done) {
-          core::ScopedUnref unref_iterator(iterator);
-
+        [ctx, iterator](DoneCallback done) {
           std::vector<Tensor> components;
           bool end_of_sequence = false;
 
@@ -878,17 +884,22 @@ class IteratorGetNextOp : public AsyncOpKernel {
           };
           IteratorContext iter_ctx(std::move(params));
 
-          OP_REQUIRES_OK_ASYNC(
-              ctx, iterator->GetNext(&iter_ctx, &components, &end_of_sequence),
-              done);
-          OP_REQUIRES_ASYNC(ctx, !end_of_sequence,
-                            errors::OutOfRange("End of sequence"), done);
-
-          for (int i = 0; i < components.size(); ++i) {
-            // TODO(mrry): Check that the shapes match the shape attrs.
-            ctx->set_output(i, components[i]);
+          Status s =
+              iterator->GetNext(&iter_ctx, &components, &end_of_sequence);
+          // NOTE(mrry): We must unref the iterator before calling `done()`, to
+          // avoid destruction races.
+          iterator->Unref();
+
+          if (!s.ok()) {
+            ctx->SetStatus(s);
+          } else if (end_of_sequence) {
+            ctx->SetStatus(errors::OutOfRange("End of sequence"));
+          } else {
+            for (int i = 0; i < components.size(); ++i) {
+              // TODO(mrry): Check that the shapes match the shape attrs.
+              ctx->set_output(i, components[i]);
+            }
           }
-
           done();
         },
         std::move(done)));
diff --git a/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc b/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
index 9ce263732f6e6c907dfdc89692455daa5dca86d1..aaf4dc734183968359b03819bfc04ae544c8877a 100644
--- a/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
@@ -66,12 +66,16 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
                 errors::InvalidArgument(
                     "num_parallel_batches must be greater than zero."));
 
+    bool drop_remainder;
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument(ctx, "drop_remainder", &drop_remainder));
+
     std::unique_ptr<CapturedFunction> captured_func;
     OP_REQUIRES_OK(ctx, CapturedFunction::Create(
                             func_, std::move(other_arguments), &captured_func));
 
     *output = new Dataset(input, batch_size, num_parallel_batches,
-                          output_types_, output_shapes_,
+                          drop_remainder, output_types_, output_shapes_,
                           std::move(captured_func), &ctx->eigen_cpu_device());
   }
 
@@ -79,13 +83,15 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
   class Dataset : public DatasetBase {
    public:
     Dataset(const DatasetBase* input, int64 batch_size,
-            int64 num_parallel_batches, const DataTypeVector& output_types,
+            int64 num_parallel_batches, bool drop_remainder,
+            const DataTypeVector& output_types,
             const std::vector<PartialTensorShape>& output_shapes,
             std::unique_ptr<CapturedFunction> captured_func,
             const Eigen::ThreadPoolDevice* device)
         : input_(input),
           batch_size_(batch_size),
           num_parallel_batches_(num_parallel_batches),
+          drop_remainder_(drop_remainder),
           output_types_(output_types),
           output_shapes_(output_shapes),
           captured_func_(std::move(captured_func)),
@@ -177,13 +183,21 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
           batch_results_[current_batch_index_].output.clear();
         } else {
           if (num_elements < dataset()->batch_size_) {
+            if (dataset()->drop_remainder_) {
+              // Deallocate tensors allocated for the output.
+              batch_results_[current_batch_index_].output.clear();
+              *end_of_sequence = true;
+              return Status::OK();
+            }
             const std::vector<Tensor>& output =
                 batch_results_[current_batch_index_].output;
             for (size_t i = 0; i < output.size(); ++i) {
               TensorShape component_shape(
                   batch_results_[current_batch_index_].output[i].shape());
               component_shape.set_dim(0, num_elements);
-              Tensor component(ctx->allocator({}), output[i].dtype(),
+              AllocatorAttributes attr;
+              attr.set_gpu_compatible(true);
+              Tensor component(ctx->allocator(attr), output[i].dtype(),
                                component_shape);
               TF_RETURN_IF_ERROR(
                   CopyPartialBatch(&component, output[i], num_elements));
@@ -255,7 +269,9 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
         for (size_t i = 0; i < num_components; ++i) {
           TensorShape component_shape({dataset()->batch_size_});
           component_shape.AppendShape(return_values[i].shape());
-          Tensor component(ctx->allocator({}), return_values[i].dtype(),
+          AllocatorAttributes attr;
+          attr.set_gpu_compatible(true);
+          Tensor component(ctx->allocator(attr), return_values[i].dtype(),
                            component_shape);
           batch_result->output.emplace_back(std::move(component));
         }
@@ -388,6 +404,7 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
     const NameAttrList func_;
     const int64 batch_size_;
     const int64 num_parallel_batches_;
+    const bool drop_remainder_;
     const DataTypeVector output_types_;
     const std::vector<PartialTensorShape> output_shapes_;
     const std::unique_ptr<CapturedFunction> captured_func_;
diff --git a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
index 33053b1bd9d7878016ebaf96b75c5c4b30130c4b..7e373f25686899ec8599fc064f9cf7beb3ebfe95 100644
--- a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
@@ -318,7 +318,7 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
 
         // Get the next input element.
         std::vector<Tensor> input_element;
-        bool end_of_input;
+        bool end_of_input = false;
         result->status =
             input_impl_->GetNext(ctx, &input_element, &end_of_input);
         if (end_of_input) {
diff --git a/tensorflow/core/kernels/data/slide_dataset_op.cc b/tensorflow/core/kernels/data/slide_dataset_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4f3537b6912bffb1c76d8a315790b4e5585a02d7
--- /dev/null
+++ b/tensorflow/core/kernels/data/slide_dataset_op.cc
@@ -0,0 +1,252 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/batch_util.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+
+namespace tensorflow {
+
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+
+class SlideDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit SlideDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {}
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    int64 window_size = 0;
+    int64 stride = 1;
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<int64>(ctx, "window_size", &window_size));
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<int64>(ctx, "stride", &stride));
+    OP_REQUIRES(
+        ctx, window_size > 0,
+        errors::InvalidArgument("Window size must be greater than zero."));
+    OP_REQUIRES(
+        ctx, stride > 0 && stride < window_size,
+        errors::InvalidArgument("Stride must be in [1, window_size)."));
+
+    *output = new Dataset(ctx, window_size, stride, input);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, int64 window_size, int64 stride, const DatasetBase* input)
+        : GraphDatasetBase(ctx), window_size_(window_size), stride_(stride), input_(input) {
+      input_->Ref();
+
+      const auto& input_shapes = input_->output_shapes();
+      output_shapes_.reserve(input_shapes.size());
+      for (const auto& input_shape : input_shapes) {
+        output_shapes_.emplace_back(
+            PartialTensorShape({-1}).Concatenate(input_shape));
+      }
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIterator(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          Iterator::Params{this, strings::StrCat(prefix, "::Slide")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return input_->output_dtypes();
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return output_shapes_;
+    }
+
+    string DebugString() override {
+      return strings::StrCat("SlideDatasetOp(", window_size_, ", ", stride_, ")::Dataset");
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* window_size = nullptr;
+      Node* stride = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(window_size_, &window_size));
+      TF_RETURN_IF_ERROR(b->AddScalar(stride_, &stride));
+      TF_RETURN_IF_ERROR(
+          b->AddDataset(this, {input_graph_node, window_size, stride}, output));
+      return Status::OK();
+    }
+
+   private:
+
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params),
+            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        const int64 window_size = dataset()->window_size_;
+        const int64 stride = dataset()->stride_;
+        std::vector<std::vector<Tensor>> batch_elements;
+        {
+          mutex_lock l(mu_);
+          if (!input_impl_) {
+            *end_of_sequence = true;
+            return Status::OK();
+          }
+          batch_elements.reserve(window_size);
+          const bool first_call = cache_.empty();
+          if (first_call) {
+            cache_.reserve(window_size);
+          } else {
+            // Reuse cache in the previous iteration.
+            cache_.swap(batch_elements);
+          }
+          // Fill up with new elements.
+          *end_of_sequence = false;
+          for (size_t i = batch_elements.size(); i < window_size && !*end_of_sequence;
+              ++i) {
+            std::vector<Tensor> batch_element_tuple;
+            TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &batch_element_tuple,
+                                                    end_of_sequence));
+            if (!*end_of_sequence) {
+              batch_elements.push_back(std::move(batch_element_tuple));
+            } else {
+              input_impl_.reset();
+            }
+          }
+          // Drop the final smaller blocks.
+          if (batch_elements.size() < window_size) {
+            DCHECK(*end_of_sequence);
+            return Status::OK();
+          }
+          // Cache the data used for the next iteration.
+          for (size_t i = stride; i < window_size; ++i) {
+            cache_.emplace_back(batch_elements[i]);
+          }
+        }
+
+        // Construct output tensors.
+        // Those codes below are copied from batch_dataset_op.cc.
+        const size_t num_tuple_components = batch_elements[0].size();
+        const int64 num_batch_elements = batch_elements.size();
+        for (size_t component_index = 0; component_index < num_tuple_components;
+             ++component_index) {
+          const Tensor& first_element = batch_elements[0][component_index];
+          TensorShape batch_component_shape({num_batch_elements});
+          batch_component_shape.AppendShape(first_element.shape());
+          Tensor batch_component(cpu_allocator(), first_element.dtype(),
+                                 batch_component_shape);
+          // Build the output tuple component by copying one slice
+          // from each input element in the batch.
+          for (size_t i = 0; i < num_batch_elements; ++i) {
+            if (batch_elements[i][component_index].shape() !=
+                first_element.shape()) {
+              return errors::InvalidArgument(
+                  "Cannot batch tensors with different shapes in component ",
+                  component_index, ". First element had shape ",
+                  first_element.shape().DebugString(), " and element ", i,
+                  " had shape ",
+                  batch_elements[i][component_index].shape().DebugString(),
+                  ".");
+            }
+            TF_RETURN_IF_ERROR(batch_util::CopyElementToSlice(
+                std::move(batch_elements[i][component_index]), &batch_component,
+                i));
+          }
+          out_tensors->emplace_back(std::move(batch_component));
+        }
+        *end_of_sequence = false;
+        return Status::OK();
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(mu_);
+        if (!input_impl_) {
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("input_impl_empty"), ""));
+        } else {
+          TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        }
+        // Save cache.
+        TF_RETURN_IF_ERROR(
+            writer->WriteScalar(strings::StrCat("cache_size"), cache_.size()));
+        for (int64 i = 0; i < cache_.size(); i++) {
+          TF_RETURN_IF_ERROR(writer->WriteScalar(
+              strings::StrCat("cache[", i, "]_size"), cache_[i].size()));
+          for (int64 j = 0; j < cache_[i].size(); j++) {
+            TF_RETURN_IF_ERROR(writer->WriteTensor(
+                strings::StrCat("cache[", i, "][", j, "]"), cache_[i][j]));
+          }
+        }
+        return Status::OK();
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(mu_);
+        if (!reader->Contains(full_name("input_impl_empty"))) {
+          TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        } else {
+          input_impl_.reset();
+        }
+        // Restore cache.
+        int64 cache_size;
+        TF_RETURN_IF_ERROR(
+            reader->ReadScalar(strings::StrCat("cache_size"), &cache_size));
+        cache_.resize(cache_size);
+        for (int64 i = 0; i < cache_size; i++) {
+          int64 vector_size;
+          TF_RETURN_IF_ERROR(reader->ReadScalar(
+              strings::StrCat("cache[", i, "]_size"), &vector_size));
+          cache_[i].resize(vector_size);
+          for (int64 j = 0; j < vector_size; j++) {
+            TF_RETURN_IF_ERROR(reader->ReadTensor(
+                strings::StrCat("cache[", i, "][", j, "]"), &cache_[i][j]));
+          }
+        }
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      std::vector<std::vector<Tensor>> cache_ GUARDED_BY(mu_);
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+    };
+
+    const int64 window_size_;
+    const int64 stride_;
+    const DatasetBase* const input_;
+    std::vector<PartialTensorShape> output_shapes_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("SlideDataset").Device(DEVICE_CPU),
+                        SlideDatasetOp);
+
+}  // namespace
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/stats_aggregator_ops.cc b/tensorflow/core/kernels/data/stats_aggregator_ops.cc
index 5a2dd9c43dbcbf5250d4dcd4bd803ed4979999e0..17103627e0749d14215bf28fec2489b110308526 100644
--- a/tensorflow/core/kernels/data/stats_aggregator_ops.cc
+++ b/tensorflow/core/kernels/data/stats_aggregator_ops.cc
@@ -47,7 +47,7 @@ class StatsAggregatorImpl : public StatsAggregator {
       Summary::Value* value = out_summary->add_value();
       value->set_tag(name);
       histogram.EncodeToProto(value->mutable_histo(),
-                              true /* preserve_zero_buckets */);
+                              false /* doesn't preserve zero buckets */);
     }
   }
 
diff --git a/tensorflow/core/kernels/data_format_ops.cc b/tensorflow/core/kernels/data_format_ops.cc
index fa67545a0dad0332cce55c173fc39ba25c055902..39ef8ee3ac429e1db96692eb9302616ed9ba61db 100644
--- a/tensorflow/core/kernels/data_format_ops.cc
+++ b/tensorflow/core/kernels/data_format_ops.cc
@@ -67,14 +67,8 @@ class DataFormatVecPermuteOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("src_format", &src_format));
     string dst_format;
     OP_REQUIRES_OK(context, context->GetAttr("dst_format", &dst_format));
-    OP_REQUIRES(context,
-                (src_format == "NHWC" && dst_format == "NCHW") ||
-                    (src_format == "NCHW" && dst_format == "NHWC"),
-                errors::InvalidArgument(strings::StrCat(
-                    "Current implementation only supports NCHW-to-NHWC and "
-                    "NHWC-to-NCHW format conversion; got source format ",
-                    src_format, " and destination format ", dst_format)));
-    nhwc_to_nchw_ = (src_format == "NHWC") ? true : false;
+    src_format_ = src_format;
+    dst_format_ = dst_format;
   }
 
   void Compute(OpKernelContext* context) override {
@@ -104,13 +98,34 @@ class DataFormatVecPermuteOp : public OpKernel {
     Tensor* output = nullptr;
     OP_REQUIRES_OK(context,
                    context->allocate_output(0, input.shape(), &output));
-    functor::DataFormatVecPermute<Device, T>()(
-        context->eigen_device<Device>(), input.flat<T>(), output->flat<T>(),
-        nhwc_to_nchw_);
+    // Support 1D and 2D cases.
+    Eigen::DSizes<Eigen::DenseIndex, 8> dst_idx;
+    ComputeDstIndex(input.dims(), &dst_idx);
+
+    functor::DataFormatVecPermute<Device, T>()(context->eigen_device<Device>(),
+                                               input.flat<T>(),
+                                               output->flat<T>(), dst_idx);
   }
 
  private:
-  bool nhwc_to_nchw_;
+  // Finds out the destination index. Support 1D and 2D cases.
+  // Example: HWNC --> NHWC
+  // 1D: dst = [1, 2, 0, 3],
+  // 2D: dst = [2, 3, 4, 5, 0, 1, 6, 7]
+  void ComputeDstIndex(int num_dim, Eigen::DSizes<Eigen::DenseIndex, 8>* dst) {
+    for (int i = 0; i < src_format_.size(); ++i) {
+      for (int j = 0; j < dst_format_.size(); ++j) {
+        if (dst_format_[j] != src_format_[i]) continue;
+        // Found the dst index. Set output based on the number of dims.
+        for (int k = 0; k < num_dim; ++k) {
+          (*dst)[i * num_dim + k] = j * num_dim + k;
+        }
+      }
+    }
+  }
+
+  string src_format_;
+  string dst_format_;
 };
 
 #define REGISTER_KERNEL(T)                                                \
@@ -147,7 +162,8 @@ TF_CALL_int64(DECLARE_GPU_SPECS);
   template <>                                              \
   void DataFormatVecPermute<GPUDevice, T>::operator()(     \
       const GPUDevice& d, typename TTypes<T>::ConstFlat x, \
-      typename TTypes<T>::Vec y, bool nhwc_to_nchw);       \
+      typename TTypes<T>::Vec y,                           \
+      const Eigen::DSizes<Eigen::DenseIndex, 8>& dst_idx); \
   extern template struct DataFormatVecPermute<GPUDevice, T>;
 #define DECLARE_GPU_SPECS(T) DECLARE_GPU_SPEC(T);
 TF_CALL_int32(DECLARE_GPU_SPECS);
diff --git a/tensorflow/core/kernels/data_format_ops.h b/tensorflow/core/kernels/data_format_ops.h
index bf704cc35cf2ff18b38202db5d192b460b415fbb..2ccc919586551cefa887718481277d4a0e673dbb 100644
--- a/tensorflow/core/kernels/data_format_ops.h
+++ b/tensorflow/core/kernels/data_format_ops.h
@@ -40,7 +40,8 @@ struct DataFormatDimMap {
 };
 
 template <typename T>
-struct VecPermuteNHWCToNCHW {
+struct VecPermute {
+  VecPermute(const Eigen::DSizes<Eigen::DenseIndex, 8>& dst) : dst_(dst) {}
   Eigen::DSizes<Eigen::DenseIndex, 1> dimensions(
       typename TTypes<T>::ConstFlat input) const {
     Eigen::DSizes<Eigen::DenseIndex, 1> result;
@@ -50,63 +51,22 @@ struct VecPermuteNHWCToNCHW {
   template <typename Output, typename Device>
   void eval(typename TTypes<T>::ConstFlat input, Output& output,
             const Device& d) const {
-    if (input.size() == 8) {
-      output.template chip<0>(0).device(d) = input.template chip<0>(0);
-      output.template chip<0>(1).device(d) = input.template chip<0>(1);
-      output.template chip<0>(2).device(d) = input.template chip<0>(6);
-      output.template chip<0>(3).device(d) = input.template chip<0>(7);
-      output.template chip<0>(4).device(d) = input.template chip<0>(2);
-      output.template chip<0>(5).device(d) = input.template chip<0>(3);
-      output.template chip<0>(6).device(d) = input.template chip<0>(4);
-      output.template chip<0>(7).device(d) = input.template chip<0>(5);
-    } else {
-      output.template chip<0>(0).device(d) = input.template chip<0>(0);
-      output.template chip<0>(1).device(d) = input.template chip<0>(3);
-      output.template chip<0>(2).device(d) = input.template chip<0>(1);
-      output.template chip<0>(3).device(d) = input.template chip<0>(2);
+    for (int i = 0; i < input.size(); ++i) {
+      output.template chip<0>(dst_[i]).device(d) = input.template chip<0>(i);
     }
   }
-};
 
-template <typename T>
-struct VecPermuteNCHWToNHWC {
-  Eigen::DSizes<Eigen::DenseIndex, 1> dimensions(
-      typename TTypes<T>::ConstFlat input) const {
-    Eigen::DSizes<Eigen::DenseIndex, 1> result;
-    result[0] = input.dimension(0);
-    return result;
-  }
-  template <typename Output, typename Device>
-  void eval(typename TTypes<T>::ConstFlat input, Output& output,
-            const Device& d) const {
-    if (input.size() == 8) {
-      output.template chip<0>(0).device(d) = input.template chip<0>(0);
-      output.template chip<0>(1).device(d) = input.template chip<0>(1);
-      output.template chip<0>(2).device(d) = input.template chip<0>(4);
-      output.template chip<0>(3).device(d) = input.template chip<0>(5);
-      output.template chip<0>(4).device(d) = input.template chip<0>(6);
-      output.template chip<0>(5).device(d) = input.template chip<0>(7);
-      output.template chip<0>(6).device(d) = input.template chip<0>(2);
-      output.template chip<0>(7).device(d) = input.template chip<0>(3);
-    } else {
-      output.template chip<0>(0).device(d) = input.template chip<0>(0);
-      output.template chip<0>(1).device(d) = input.template chip<0>(2);
-      output.template chip<0>(2).device(d) = input.template chip<0>(3);
-      output.template chip<0>(3).device(d) = input.template chip<0>(1);
-    }
-  }
+ private:
+  Eigen::DSizes<Eigen::DenseIndex, 8> dst_;
 };
 
 // Functor used by DataFormatVecPermuteOp to do the computations.
 template <typename Device, typename T>
 struct DataFormatVecPermute {
   void operator()(const Device& d, typename TTypes<T>::ConstFlat x,
-                  typename TTypes<T>::Flat y, bool nhwc_to_nchw) {
-    if (nhwc_to_nchw) {
-      y.device(d) = x.customOp(VecPermuteNHWCToNCHW<T>());
-    } else {
-      y.device(d) = x.customOp(VecPermuteNCHWToNHWC<T>());
-    }
+                  typename TTypes<T>::Flat y,
+                  const Eigen::DSizes<Eigen::DenseIndex, 8>& dst) {
+    y.device(d) = x.customOp(VecPermute<T>(dst));
   }
 };
 
diff --git a/tensorflow/core/kernels/depthtospace_op.cc b/tensorflow/core/kernels/depthtospace_op.cc
index 39aa3e9eb0772076b25922520995174fc45efc4e..b74a09e2cb8b83d791d7ae5e9bbecaa07ead7266 100644
--- a/tensorflow/core/kernels/depthtospace_op.cc
+++ b/tensorflow/core/kernels/depthtospace_op.cc
@@ -187,6 +187,9 @@ TF_CALL_ALL_TYPES(REGISTER);
 REGISTER_KERNEL_BUILDER(
     Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     DepthToSpaceOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
+    DepthToSpaceOp<GPUDevice, Eigen::half>);
 REGISTER_KERNEL_BUILDER(
     Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<qint8>("T"),
     DepthToSpaceOp<GPUDevice, qint8>);
diff --git a/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc b/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
index 7a66285383368bb28dd3d0cd2fc6ff360eb82f5b..0656081177e8673bdc8e603a832d96a8884bff45 100644
--- a/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
@@ -158,6 +158,9 @@ struct DepthToSpaceOpFunctor<GPUDevice, T, FORMAT_NHWC> {
 
     const int total_count =
         batch_size * output_height * output_width * output_depth;
+    if (total_count == 0) {
+      return;
+    }
     CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
     D2S_NHWC<<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
         config.virtual_thread_count, input.data(), block_size, batch_size,
@@ -188,6 +191,9 @@ struct DepthToSpaceOpFunctor<GPUDevice, T, FORMAT_NCHW> {
       const int output_width = output.dimension(3);
       const int output_depth_by_input_area = output_depth * input_area;
       const int total_count = batch_size * output_depth_by_input_area;
+      if (total_count == 0) {
+        return;
+      }
       CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
       switch (block_size) {
         case 2:
@@ -213,6 +219,9 @@ struct DepthToSpaceOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 
     // Other block sizes are processed by the generic kernel.
     const int total_count = batch_size * input_depth_by_input_area;
+    if (total_count == 0) {
+      return;
+    }
     auto config = GetCudaLaunchConfig(total_count, d);
     D2S_NCHW<<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
         config.virtual_thread_count, input.data(), block_size, input_width,
@@ -229,6 +238,12 @@ struct DepthToSpaceOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, float, FORMAT_NCHW>;
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, float, FORMAT_NHWC>;
 
+// Instantiate the GPU implementations for Eigen::half.
+template struct functor::DepthToSpaceOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NCHW>;
+template struct functor::DepthToSpaceOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NHWC>;
+
 // NCHW_VECT_C with 4 x qint8 can be treated as NCHW int32.
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, int32, FORMAT_NCHW>;
 
diff --git a/tensorflow/core/kernels/depthwise_conv_op.cc b/tensorflow/core/kernels/depthwise_conv_op.cc
index c060b2e14d2f03f990af5267260bd88fa01a2c81..6dedb1a61ef47ccc1fa902e7f69ea21db3392f39 100644
--- a/tensorflow/core/kernels/depthwise_conv_op.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op.cc
@@ -241,7 +241,7 @@ struct LaunchDepthwiseConvOp<CPUDevice, T> {
 };
 
 // Extern template instantiated in conv_ops.cc.
-extern template class LaunchConv2DOp<CPUDevice, float>;
+extern template struct LaunchConv2DOp<CPUDevice, float>;
 
 #if GOOGLE_CUDA
 
@@ -251,7 +251,7 @@ extern template struct LaunchDepthwiseConvOp<GPUDevice, float>;
 extern template struct LaunchDepthwiseConvOp<GPUDevice, double>;
 
 // Extern template instantiated in conv_ops.cc.
-extern template class LaunchConv2DOp<GPUDevice, float>;
+extern template struct LaunchConv2DOp<GPUDevice, float>;
 
 #endif
 
diff --git a/tensorflow/core/kernels/eigen_pooling.h b/tensorflow/core/kernels/eigen_pooling.h
index 896c9957616037da4ead2dbda8cb2393eaea226f..2f83780525090c90a0a9cfa3268115daa6fbc89b 100644
--- a/tensorflow/core/kernels/eigen_pooling.h
+++ b/tensorflow/core/kernels/eigen_pooling.h
@@ -334,7 +334,8 @@ struct AvgPoolMeanReducer {
   }
 
   template <typename Packet>
-  void reducePacketWithType(T, const Packet& p, Packet* accum) {
+  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void reducePacketWithType(
+      T, const Packet& p, Packet* accum) {
     Packet skip_mask =
         pequal(p, pset1<Packet>(-Eigen::NumTraits<T>::highest()));
     (*accum) = padd<Packet>(*accum, psel(p, pset1<Packet>(0), skip_mask));
@@ -480,11 +481,9 @@ SpatialAvgPooling(const Input& input, DenseIndex patchRows,
                              Eigen::type2index<3> > >::type reduction_dims;
 #endif
   return input
-      .extract_image_patches(
-          patchRows, patchCols, strideRows, strideCols, in_strideRows,
-          in_strideCols, padding_type,
-          -Eigen::NumTraits<typename internal::remove_const<
-              typename internal::traits<Input>::Scalar>::type>::highest())
+      .extract_image_patches(patchRows, patchCols, strideRows, strideCols,
+                             in_strideRows, in_strideCols, padding_type,
+                             -Eigen::NumTraits<CoeffReturnType>::highest())
       .reduce(reduction_dims, mean_with_nan)
       .reshape(post_reduce_dims);
 }
diff --git a/tensorflow/core/kernels/eigen_spatial_convolutions.h b/tensorflow/core/kernels/eigen_spatial_convolutions.h
index 1acbe3a658070222e99ff874815db9a6b07d4565..a4dff4b91c5c7a991b432f113cae2e29ecdcab31 100644
--- a/tensorflow/core/kernels/eigen_spatial_convolutions.h
+++ b/tensorflow/core/kernels/eigen_spatial_convolutions.h
@@ -797,6 +797,188 @@ struct gemm_pack_rhs<
   }
 };
 
+// Template specialization for packet_size = 2. We must special-case packet
+// blocks with nr > packet_size, e.g. PacketBlock<Packet2d, 4>.
+template <typename NewDimension, DenseIndex Rows, DenseIndex Cols,
+          typename ArgType, typename Device, typename Scalar, typename Index,
+          typename nocontract_t, typename contract_t, bool inner_dim_contiguous,
+          bool inner_dim_reordered, int Alignment, int nr>
+struct gemm_pack_rhs<
+    Scalar, Index,
+    TensorContractionSubMapper<
+        Scalar, Index, Rhs,
+        TensorEvaluator<
+            const TensorReshapingOp<
+                NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >,
+            Device>,
+        nocontract_t, contract_t, 2, inner_dim_contiguous, inner_dim_reordered,
+        Alignment>,
+    nr, ColMajor, false, false> {
+  typedef TensorContractionSubMapper<
+      Scalar, Index, Rhs,
+      TensorEvaluator<
+          const TensorReshapingOp<
+              NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >,
+          Device>,
+      nocontract_t, contract_t, 2, inner_dim_contiguous, inner_dim_reordered,
+      Alignment>
+      SubMapper;
+  typedef SubMapper DataMapper;
+
+  EIGEN_DEVICE_FUNC
+  static inline Index ceil_div(Index a, Index b) { return (a + b - 1) / b; }
+
+  EIGEN_DEVICE_FUNC
+  EIGEN_DONT_INLINE void operator()(Scalar* block, const DataMapper& rhs,
+                                    Index depth, Index cols, Index stride = 0,
+                                    Index offset = 0) const {
+    eigen_assert(stride == 0);
+    eigen_assert(offset == 0);
+
+    EIGEN_STATIC_ASSERT((nr == 4), YOU_MADE_A_PROGRAMMING_MISTAKE);
+    typedef typename packet_traits<Scalar>::type Packet;
+
+    const int packet_size = 2;
+    const Index packet_cols4 = (cols / 4) * 4;
+    const Index peeled_k = (depth / packet_size) * packet_size;
+    const bool non_standard_patches = rhs.nonStandardPatches();
+
+    for (Index j2 = 0; j2 < packet_cols4; j2 += 4) {
+      const SubMapper dm0 = rhs.getLinearMapper(0, j2 + 0);
+      const SubMapper dm1 = rhs.getLinearMapper(0, j2 + 1);
+      const SubMapper dm2 = rhs.getLinearMapper(0, j2 + 2);
+      const SubMapper dm3 = rhs.getLinearMapper(0, j2 + 3);
+
+      Index k = 0;
+      if (!non_standard_patches) {
+        const Index patch_depth = rhs.patchDepth();
+        if ((patch_depth % packet_size) == 0) {
+          const Index patch_cols = rhs.patchCols();
+          const Index patch_rows = rhs.patchRows();
+
+          const Index startCol = rhs.colOffset();
+          const Index max_cols = std::min<Index>(
+              ceil_div(peeled_k, patch_rows * patch_depth) + startCol,
+              patch_cols);
+
+          for (Index c = startCol; c < max_cols; ++c) {
+            eigen_assert(k < peeled_k);
+            const Index startRow = (c == startCol) ? rhs.rowOffset() : 0;
+            const Index max_rows = std::min<Index>(
+                ceil_div(peeled_k - c * patch_rows * patch_depth, patch_depth) +
+                    startRow,
+                patch_rows);
+
+            const bool pad_col0 = dm0.padCol(c);
+            const bool pad_col1 = dm1.padCol(c);
+            const bool pad_col2 = dm2.padCol(c);
+            const bool pad_col3 = dm3.padCol(c);
+            for (Index r = startRow; r < max_rows; ++r) {
+              eigen_assert(k < peeled_k);
+              const bool pad0 = pad_col0 || dm0.padRow(r);
+              const bool pad1 = pad_col1 || dm1.padRow(r);
+              const bool pad2 = pad_col2 || dm2.padRow(r);
+              const bool pad3 = pad_col3 || dm3.padRow(r);
+
+              const Index idx0 = dm0.baseIndex(r, c);
+              const Index idx1 = dm1.baseIndex(r, c);
+              const Index idx2 = dm2.baseIndex(r, c);
+              const Index idx3 = dm3.baseIndex(r, c);
+
+              const Index startDepth =
+                  ((c == startCol) && (r == startRow)) ? rhs.depthOffset() : 0;
+              const Index max_depth =
+                  std::min<Index>(peeled_k - c * patch_rows * patch_depth -
+                                      r * patch_depth + startDepth,
+                                  patch_depth);
+              eigen_assert((max_depth - startDepth) % packet_size == 0);
+              for (Index d = startDepth; d < max_depth; d += packet_size) {
+                eigen_assert(k < peeled_k);
+                PacketBlock<Packet, 2> kernel0;
+                PacketBlock<Packet, 2> kernel1;
+                kernel0.packet[0] = pad0 ? pset1<Packet>(Scalar(0))
+                                         : rhs.packetNoPadding(d, idx0);
+                kernel0.packet[1] = pad1 ? pset1<Packet>(Scalar(0))
+                                         : rhs.packetNoPadding(d, idx1);
+                kernel1.packet[0] = pad2 ? pset1<Packet>(Scalar(0))
+                                         : rhs.packetNoPadding(d, idx2);
+                kernel1.packet[1] = pad3 ? pset1<Packet>(Scalar(0))
+                                         : rhs.packetNoPadding(d, idx3);
+                ptranspose(kernel0);
+                ptranspose(kernel1);
+                pstoreu(block + 0 * packet_size, kernel0.packet[0]);
+                pstoreu(block + 1 * packet_size, kernel1.packet[0]);
+                pstoreu(block + 2 * packet_size, kernel0.packet[1]);
+                pstoreu(block + 3 * packet_size, kernel1.packet[1]);
+                block += 4 * packet_size;
+                k += packet_size;
+              }
+            }
+          }
+
+          for (; k < peeled_k; k += packet_size) {
+            PacketBlock<Packet, 2> kernel0;
+            PacketBlock<Packet, 2> kernel1;
+            kernel0.packet[0] = dm0.loadPacketFast(k);
+            kernel0.packet[1] = dm1.loadPacketFast(k);
+            kernel1.packet[0] = dm2.loadPacketFast(k);
+            kernel1.packet[1] = dm3.loadPacketFast(k);
+            ptranspose(kernel0);
+            ptranspose(kernel1);
+            pstoreu(block + 0 * packet_size, kernel0.packet[0]);
+            pstoreu(block + 1 * packet_size, kernel1.packet[0]);
+            pstoreu(block + 2 * packet_size, kernel0.packet[1]);
+            pstoreu(block + 3 * packet_size, kernel1.packet[1]);
+            block += 4 * packet_size;
+          }
+        } else {
+          for (; k < peeled_k; k += packet_size) {
+            PacketBlock<Packet, 2> kernel0;
+            PacketBlock<Packet, 2> kernel1;
+            kernel0.packet[0] = dm0.loadPacketStandard(k);
+            kernel0.packet[1] = dm1.loadPacketStandard(k);
+            kernel1.packet[0] = dm2.loadPacketStandard(k);
+            kernel1.packet[1] = dm3.loadPacketStandard(k);
+            ptranspose(kernel0);
+            ptranspose(kernel1);
+            pstoreu(block + 0 * packet_size, kernel0.packet[0]);
+            pstoreu(block + 1 * packet_size, kernel1.packet[0]);
+            pstoreu(block + 2 * packet_size, kernel0.packet[1]);
+            pstoreu(block + 3 * packet_size, kernel1.packet[1]);
+            block += 4 * packet_size;
+          }
+        }
+      }
+      if (!rhs.nonStandardPatches()) {
+        for (; k < depth; k++) {
+          block[0] = dm0.loadCoeffStandard(k);
+          block[1] = dm1.loadCoeffStandard(k);
+          block[2] = dm2.loadCoeffStandard(k);
+          block[3] = dm3.loadCoeffStandard(k);
+          block += 4;
+        }
+      } else {
+        for (; k < depth; k++) {
+          block[0] = dm0(k);
+          block[1] = dm1(k);
+          block[2] = dm2(k);
+          block[3] = dm3(k);
+          block += 4;
+        }
+      }
+    }
+
+    // copy the remaining columns one at a time (nr==1)
+    for (Index j2 = packet_cols4; j2 < cols; ++j2) {
+      const SubMapper dm0 = rhs.getLinearMapper(0, j2);
+      for (Index k = 0; k < depth; k++) {
+        *block = dm0(k);
+        block += 1;
+      }
+    }
+  }
+};
+
 // Special case for non-vectorized types such as float16.
 template <typename NewDimension, DenseIndex Rows, DenseIndex Cols,
           typename ArgType, typename Device, typename Scalar, typename Index,
diff --git a/tensorflow/core/kernels/function_ops.cc b/tensorflow/core/kernels/function_ops.cc
index a094ebe5e2d1d78ec8f5514dca7b7ebeec4e6b57..351aad72135da9c11dcef7ce4ff19cd158a50a1b 100644
--- a/tensorflow/core/kernels/function_ops.cc
+++ b/tensorflow/core/kernels/function_ops.cc
@@ -28,6 +28,7 @@ limitations under the License.
 #include "tensorflow/core/graph/gradients.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/tracing.h"
 #include "tensorflow/core/util/device_name_utils.h"
 
 namespace tensorflow {
@@ -307,11 +308,30 @@ class RemoteCallOp : public AsyncOpKernel {
     AttrValueMap attr_values = func_.attr();
     FunctionLibraryRuntime::InstantiateOptions instantiate_opts;
     instantiate_opts.target = target_device;
+
+    FunctionTarget function_target = {target_device, lib};
+
     FunctionLibraryRuntime::Handle handle;
-    OP_REQUIRES_OK_ASYNC(ctx,
-                         lib->Instantiate(func_.name(), AttrSlice(&attr_values),
-                                          instantiate_opts, &handle),
-                         done);
+    {
+      mutex_lock l(mu_);
+      auto cached_entry = handle_cache_.find(function_target);
+      if (cached_entry != handle_cache_.end()) {
+        handle = cached_entry->second;
+      } else {
+        VLOG(1) << "Instantiating " << func_.name() << " on " << target_device;
+        port::Tracing::TraceMe activity(strings::StrCat(
+            "RemoteCall: Instantiate: ", func_.name(), " on ", target_device));
+        OP_REQUIRES_OK_ASYNC(
+            ctx,
+            lib->Instantiate(func_.name(), AttrSlice(&attr_values),
+                             instantiate_opts, &handle),
+            done);
+        auto insert_result = handle_cache_.insert({function_target, handle});
+        CHECK(insert_result.second) << "Insert unsuccessful.";
+        VLOG(1) << "Instantiated " << func_.name() << " on " << target_device
+                << ", resulting in handle: " << handle << " flr: " << lib;
+      }
+    }
 
     OpInputList arguments;
     OP_REQUIRES_OK_ASYNC(ctx, ctx->input_list("args", &arguments), done);
@@ -330,22 +350,33 @@ class RemoteCallOp : public AsyncOpKernel {
       args.push_back(argument);
     }
     auto* rets = new std::vector<Tensor>;
-    lib->Run(opts, handle, args, rets, [rets, done, ctx](const Status& status) {
-      if (!status.ok()) {
-        ctx->SetStatus(status);
-      } else {
-        for (size_t i = 0; i < rets->size(); ++i) {
-          ctx->set_output(i, (*rets)[i]);
-        }
-      }
-      delete rets;
-      done();
-    });
+    auto* trace = new port::Tracing::TraceMe(strings::StrCat(
+        "RemoteCall: Run: ", func_.name(), " on ", target_device));
+    VLOG(1) << "Running " << func_.name() << " on " << target_device
+            << " with handle: " << handle;
+    lib->Run(opts, handle, args, rets,
+             [rets, trace, done, ctx](const Status& status) {
+               if (!status.ok()) {
+                 ctx->SetStatus(status);
+               } else {
+                 for (size_t i = 0; i < rets->size(); ++i) {
+                   ctx->set_output(i, (*rets)[i]);
+                 }
+               }
+               delete rets;
+               delete trace;
+               done();
+             });
   }
 
  private:
-  string target_;
   NameAttrList func_;
+
+  mutex mu_;
+  typedef std::pair<string, FunctionLibraryRuntime*> FunctionTarget;
+  std::map<FunctionTarget, FunctionLibraryRuntime::Handle> handle_cache_
+      GUARDED_BY(mu_);
+
   TF_DISALLOW_COPY_AND_ASSIGN(RemoteCallOp);
 };
 
diff --git a/tensorflow/core/kernels/hexagon/BUILD b/tensorflow/core/kernels/hexagon/BUILD
index 108d59db2c21cad6ff3136a0271f9ba1f7d7a237..7688305019cdbca94b2094c12380dab4162353d7 100644
--- a/tensorflow/core/kernels/hexagon/BUILD
+++ b/tensorflow/core/kernels/hexagon/BUILD
@@ -45,6 +45,7 @@ tf_cc_test(
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
         "//tensorflow/core/kernels:cwise_op",
+        "//tensorflow/core/kernels:quantization_utils",
         "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/kernels:reduction_ops",
         "//tensorflow/core/kernels:remote_fused_graph_execute_utils",
diff --git a/tensorflow/core/kernels/identity_op.cc b/tensorflow/core/kernels/identity_op.cc
index a18a72c66dc659ffd372c231524dbf038df6ac22..b5603fecd8bed458b1f686f4bb5efaefa33046b6 100644
--- a/tensorflow/core/kernels/identity_op.cc
+++ b/tensorflow/core/kernels/identity_op.cc
@@ -101,6 +101,10 @@ REGISTER_SYCL_HOST_KERNEL(bool);
   REGISTER_KERNEL_BUILDER(Name("DebugGradientIdentity")                     \
                               .Device(DEVICE_GPU)                           \
                               .TypeConstraint<type>("T"),                   \
+                          IdentityOp);                                      \
+  REGISTER_KERNEL_BUILDER(Name("PlaceholderWithDefault")                    \
+                              .Device(DEVICE_GPU)                           \
+                              .TypeConstraint<type>("dtype"),               \
                           IdentityOp)
 
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER_GPU_KERNEL);
@@ -112,18 +116,24 @@ REGISTER_GPU_KERNEL(Variant);
 // A special GPU kernel for int32 and bool.
 // TODO(b/25387198): Also enable int32 in device memory. This kernel
 // registration requires all int32 inputs and outputs to be in host memory.
-#define REGISTER_GPU_HOST_KERNEL(type)                    \
-  REGISTER_KERNEL_BUILDER(Name("Identity")                \
-                              .Device(DEVICE_GPU)         \
-                              .HostMemory("input")        \
-                              .HostMemory("output")       \
-                              .TypeConstraint<type>("T"), \
-                          IdentityOp);                    \
-  REGISTER_KERNEL_BUILDER(Name("RefIdentity")             \
-                              .Device(DEVICE_GPU)         \
-                              .HostMemory("input")        \
-                              .HostMemory("output")       \
-                              .TypeConstraint<type>("T"), \
+#define REGISTER_GPU_HOST_KERNEL(type)                        \
+  REGISTER_KERNEL_BUILDER(Name("Identity")                    \
+                              .Device(DEVICE_GPU)             \
+                              .HostMemory("input")            \
+                              .HostMemory("output")           \
+                              .TypeConstraint<type>("T"),     \
+                          IdentityOp);                        \
+  REGISTER_KERNEL_BUILDER(Name("RefIdentity")                 \
+                              .Device(DEVICE_GPU)             \
+                              .HostMemory("input")            \
+                              .HostMemory("output")           \
+                              .TypeConstraint<type>("T"),     \
+                          IdentityOp);                        \
+  REGISTER_KERNEL_BUILDER(Name("PlaceholderWithDefault")      \
+                              .Device(DEVICE_GPU)             \
+                              .HostMemory("input")            \
+                              .HostMemory("output")           \
+                              .TypeConstraint<type>("dtype"), \
                           IdentityOp)
 
 REGISTER_GPU_HOST_KERNEL(int32);
diff --git a/tensorflow/core/kernels/mirror_pad_op.cc b/tensorflow/core/kernels/mirror_pad_op.cc
index 26e1082989f317a35d55826a466cb8d9ef306c4c..1c85c744fc8c0f9c3025bd15ac87752c8af56367 100644
--- a/tensorflow/core/kernels/mirror_pad_op.cc
+++ b/tensorflow/core/kernels/mirror_pad_op.cc
@@ -173,6 +173,7 @@ namespace functor {
   DECLARE_CPU_SPEC(T, int64, 5);
 
 TF_CALL_POD_TYPES(DECLARE_CPU_SPECS);
+TF_CALL_string(DECLARE_CPU_SPECS);
 
 #undef DECLARE_CPU_SPEC
 #undef DECLARE_CPU_SPECS
@@ -194,6 +195,7 @@ TF_CALL_POD_TYPES(DECLARE_CPU_SPECS);
 
 // Note that we do register for bool type, but not in the gradient op.
 TF_CALL_POD_TYPES(REGISTER_KERNEL);
+TF_CALL_string(REGISTER_KERNEL);
 #undef REGISTER_KERNEL
 
 #if GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h b/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
index 6716a26fac2c77ee1ee5306cc26cf802585dcfc4..f27ca139c9d4a62114b9f7a261e1d7dc7f766123 100644
--- a/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
+++ b/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
@@ -29,6 +29,7 @@ using CpuDevice = Eigen::ThreadPoolDevice;
   template struct functor::MirrorPad<CpuDevice, T, int32, CPU_PROVIDED_IXDIM>; \
   template struct functor::MirrorPad<CpuDevice, T, int64, CPU_PROVIDED_IXDIM>;
 TF_CALL_POD_TYPES(DEFINE_CPU_SPECS);
+TF_CALL_string(DEFINE_CPU_SPECS);
 #undef DEFINE_CPU_SPECS
 
 #define DEFINE_CPU_SPECS(T)                                   \
diff --git a/tensorflow/core/kernels/mkl_concat_op.cc b/tensorflow/core/kernels/mkl_concat_op.cc
index f1f267e849aa39b43c153b857493160e0d103970..aa3ea890b04358d6176b44558fed014ef29259e3 100644
--- a/tensorflow/core/kernels/mkl_concat_op.cc
+++ b/tensorflow/core/kernels/mkl_concat_op.cc
@@ -519,9 +519,11 @@ class MklConcatOp : public OpKernel {
     mkl_tensor_tf_shape.AddDim(
         SIZE_OF_MKL_SERIAL_DATA(mkl_tensor_mkl_shape.GetDimension()));
     int tf_output_index = 0;
-    context->allocate_output(
+    // TODO(jktomer): replace this with OP_REQUIRES_OK and clean up this file
+    // to propagate the status up the call stack.
+    TF_CHECK_OK(context->allocate_output(
         GetTensorMetaDataIndex(tf_output_index, context->num_outputs()),
-        mkl_tensor_tf_shape, &mkl_tensor);
+        mkl_tensor_tf_shape, &mkl_tensor));
     mkl_tensor_mkl_shape.SerializeMklShape(
         mkl_tensor->flat<uint8>().data(),
         mkl_tensor->flat<uint8>().size() * sizeof(uint8));
@@ -549,9 +551,11 @@ class MklConcatOp : public OpKernel {
     mkl_tensor_tf_shape.AddDim(
         SIZE_OF_MKL_SERIAL_DATA(mkl_tensor_mkl_shape.GetDimension()));
     int tf_output_index = 0;
-    context->allocate_output(
+    // TODO(jktomer): replace this with OP_REQUIRES_OK and clean up this file
+    // to propagate the status up the call stack.
+    TF_CHECK_OK(context->allocate_output(
         GetTensorMetaDataIndex(tf_output_index, context->num_outputs()),
-        mkl_tensor_tf_shape, &mkl_tensor);
+        mkl_tensor_tf_shape, &mkl_tensor));
     mkl_tensor_mkl_shape.SerializeMklShape(
         mkl_tensor->flat<uint8>().data(),
         mkl_tensor->flat<uint8>().size() * sizeof(uint8));
diff --git a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
index 25c2573741265d4d33c9c91474792be241dd3b32..d23027a54d169b5e597bd26a63f26d38a23239ae 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
@@ -79,8 +79,9 @@ class MklConv2DCustomBackpropBiasOp : public OpKernel {
     } else if (data_format_ == FORMAT_NHWC || data_format_ == FORMAT_NCHW) {
       mkl_context.c_size = GetTensorDim(input, data_format_, 'C');
     } else {
-      errors::InvalidArgument("Unknown format ",
-                              " Format must be either NCHW or NHWC. ");
+      context->CtxFailure(errors::InvalidArgument(
+          "Unknown format ", " Format must be either NCHW or NHWC. "));
+      return;
     }
     TensorShape output_shape{mkl_context.c_size};
 
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index 1401bc65a45bd80ed78230840cf0b9958b1f012e..e0706568b15204312445446a161d0aa9911f9e33 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -444,6 +444,7 @@ class MklConv2DCustomBackpropFilterOp
   ~MklConv2DCustomBackpropFilterOp() {}
 
  private:
+  const int kDilationH = 0, kDilationW = 1;
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -492,7 +493,9 @@ class MklConv2DCustomBackpropFilterOp
                        const convolution_forward::primitive_desc& conv_fwd_pd,
                        MklDnnData<T>* input, MklDnnData<T>* filter,
                        MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor, const memory::dims& strides,
+                       Tensor** output_tensor,
+                       const memory::dims& strides,
+                       const memory::dims& dilations,
                        const memory::dims& padding_l,
                        const memory::dims& padding_r, padding_kind padding,
                        const memory::dims& bwd_output_dims,
@@ -518,31 +521,32 @@ class MklConv2DCustomBackpropFilterOp
       bias_grad->SetOpMemDesc(bias_grad_dims, memory::format::x);
     }
 
-    // Create convolution backward weights primitive.
-    auto bwd_desc =
-        (biasEnabled && (bias_grad != nullptr))
-            ? convolution_backward_weights::desc(
-                  convolution_direct, input->GetOpMemDesc(),
-                  output->GetOpMemDesc(), bias_grad->GetOpMemDesc(),
-                  outbackprop->GetOpMemDesc(), strides, padding_l, padding_r,
-                  padding)
-            : convolution_backward_weights::desc(
-                  convolution_direct, input->GetOpMemDesc(),
-                  output->GetOpMemDesc(), outbackprop->GetOpMemDesc(), strides,
-                  padding_l, padding_r, padding);
-
-    auto bwd_pd = convolution_backward_weights::primitive_desc(
-        bwd_desc, cpu_engine, conv_fwd_pd);
-
-    // Allocate output tensor.
-    AllocateOutputTensor(context, bwd_pd, bwd_output_dims, bwd_output_format,
-                         output_tensor);
-
-    CHECK_NOTNULL(*output_tensor);
-    // Set buffer handle using allocated output tensor.
-    output->SetUsrMemDataHandle(*output_tensor);
-
     if (biasEnabled && (bias_grad != nullptr)) {
+      // Create convolution backward weights with bias primitive.
+      // Use dilated convolution in case dilate rates are greater than zero.
+      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  bias_grad->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(), strides,
+                                  dilations, padding_l, padding_r, padding) :
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  bias_grad->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(),
+                                  strides, padding_l, padding_r, padding);
+      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
+                                                            cpu_engine,
+                                                            conv_fwd_pd);
+
+      // Allocate output tensor.
+      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
+                           bwd_output_format, output_tensor);
+
+      CHECK_NOTNULL(*output_tensor);
+      // Set buffer handle using allocated output tensor.
+      output->SetUsrMemDataHandle(*output_tensor);
+
       // Allocate bias_grad tensor
       TensorShape bias_grad_shape({depth});
       Tensor* bias_grad_tensor = nullptr;
@@ -553,11 +557,32 @@ class MklConv2DCustomBackpropFilterOp
           memory::desc({bias_grad_dims}, MklDnnType<T>(), memory::format::x);
       bias_grad->SetUsrMem(bias_grad_md, bias_grad_tensor);
       bias_grad->SetUsrMemDataHandle(bias_grad_tensor);
-    }
 
-    if (biasEnabled && (bias_grad != nullptr)) {
-      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output, bias_grad);
+      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output,
+                                  bias_grad);
     } else {
+      // Create convolution backward weights primitive.
+      // Use dilated convolution in case dilate rates are greater than zero.
+      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(), strides,
+                                  dilations, padding_l, padding_r, padding) :
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(),
+                                  strides, padding_l, padding_r, padding);
+      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
+                                                            cpu_engine,
+                                                            conv_fwd_pd);
+
+      // Allocate output tensor.
+      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
+                           bwd_output_format, output_tensor);
+
+      CHECK_NOTNULL(*output_tensor);
+      // Set buffer handle using allocated output tensor.
+      output->SetUsrMemDataHandle(*output_tensor);
       PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output);
     }
   }
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index eeed0095310280997ebb2ec3e848451df378c4fa..d203c04934131ee56fbca169d4c3e5e534d7986f 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -369,6 +369,7 @@ class MklConv2DCustomBackpropInputOp
  private:
   const int kInputIndex_Filter = 1, kInputIndex_InputSizes = 0,
             kInputIndex_OutBackProp = 2;
+  const int kDilationH = 0, kDilationW = 1;
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -419,7 +420,9 @@ class MklConv2DCustomBackpropInputOp
                        const convolution_forward::primitive_desc& conv_fwd_pd,
                        MklDnnData<T>* input, MklDnnData<T>* filter,
                        MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor, const memory::dims& strides,
+                       Tensor** output_tensor,
+                       const memory::dims& strides,
+                       const memory::dims& dilations,
                        const memory::dims& padding_l,
                        const memory::dims& padding_r, padding_kind padding,
                        const memory::dims& bwd_output_dims,
@@ -432,9 +435,16 @@ class MklConv2DCustomBackpropInputOp
     CHECK_NOTNULL(output_tensor);
 
     // Create convolution backward data primitive.
-    auto bwd_desc = convolution_backward_data::desc(
-        convolution_direct, output->GetOpMemDesc(), filter->GetOpMemDesc(),
-        outbackprop->GetOpMemDesc(), strides, padding_l, padding_r, padding);
+    // Use dilated convolution in case dilate rates are greater than zero.
+    auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_data::desc(convolution_direct,
+                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
+                      outbackprop->GetOpMemDesc(), strides,
+                      dilations, padding_l, padding_r, padding):
+        convolution_backward_data::desc(convolution_direct,
+                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
+                      outbackprop->GetOpMemDesc(),
+                      strides, padding_l, padding_r, padding);
 
     auto bwd_pd = convolution_backward_data::primitive_desc(
         bwd_desc, cpu_engine, conv_fwd_pd);
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 2953426d5824064952858124882126c154fe6725..f0818eb96daaab254823f066d7f8c78d913fa3c0 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -294,8 +294,10 @@ class MklConv2DOp : public OpKernel {
     mkl_filter_output_mkl_shape.SetMklLayout(mkl_context.prim_fwd,
                                              dnnResourceFilter);
 
-    size_t filter_sizes[4] = {filter.dim_size(0), filter.dim_size(1),
-                              filter.dim_size(2), filter.dim_size(3)};
+    size_t filter_sizes[4] = {static_cast<size_t>(filter.dim_size(0)),
+                              static_cast<size_t>(filter.dim_size(1)),
+                              static_cast<size_t>(filter.dim_size(2)),
+                              static_cast<size_t>(filter.dim_size(3))};
     mkl_filter_output_mkl_shape.SetTfLayout(filter.dims(), filter_sizes,
                                             mkl_context.filter_strides);
 
@@ -491,6 +493,7 @@ class MklConv2DOp : public OpKernel {
   ~MklConv2DOp() {}
 
   explicit MklConv2DOp(OpKernelConstruction* context) : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("dilations", &dilations_));
     OP_REQUIRES_OK(context, context->GetAttr("strides", &strides_));
     string data_format;
     OP_REQUIRES_OK(context, context->GetAttr("data_format", &data_format));
@@ -507,6 +510,20 @@ class MklConv2DOp : public OpKernel {
         errors::InvalidArgument("Current implementation does not yet support "
                                 "strides in the batch and depth dimensions."));
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
+    OP_REQUIRES(context, dilations_.size() == 4,
+                errors::InvalidArgument("Sliding window dilations field must "
+                                        "specify 4 dimensions"));
+    const int64 dilation_n = GetTensorDim(dilations_, data_format_, 'N');
+    const int64 dilation_c = GetTensorDim(dilations_, data_format_, 'C');
+    const int64 dilation_h = GetTensorDim(dilations_, data_format_, 'H');
+    const int64 dilation_w = GetTensorDim(dilations_, data_format_, 'W');
+    OP_REQUIRES(context, dilation_n == 1 && dilation_c == 1,
+                errors::InvalidArgument(
+                    "Current implementation does not yet support "
+                    "dilations in the batch and depth dimensions."));
+    OP_REQUIRES(
+        context, dilation_h > 0 && dilation_w > 0,
+        errors::InvalidArgument("Dilated rates should be larger than 0."));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -528,17 +545,19 @@ class MklConv2DOp : public OpKernel {
       MklDnnData<T> filter(&cpu_engine);
       MklDnnData<T> output(&cpu_engine);
 
-      memory::dims src_dims, filter_dims, padding_l, padding_r, strides;
+      memory::dims src_dims, filter_dims, padding_l, padding_r,
+                   dilations, strides;
       memory::dims output_dims_tf_order, output_dims_mkl_order;
 
       // Get shapes of input tensors in MKL-DNN order
-      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
+      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_,
+                             dilations_);
       auto src_tf_shape = GetTfShape(context, kInputIndex_Src);
       auto filter_tf_shape = GetTfShape(context, kInputIndex_Filter);
       conv_utl.GetConvFwdSizesInMklOrder(
           src_tf_shape, filter_tf_shape, &src_dims, &filter_dims, &strides,
-          &output_dims_tf_order, &output_dims_mkl_order, &padding_l,
-          &padding_r);
+          &dilations, &output_dims_tf_order, &output_dims_mkl_order,
+          &padding_l, &padding_r);
       if (!context->status().ok()) return;
 
       // Check for corner case - if there is nothing to compute, return.
@@ -551,6 +570,7 @@ class MklConv2DOp : public OpKernel {
         //               Need semantics for Null MKL tensor
         MklDnnShape output_mkl_shape;
         output_mkl_shape.SetMklTensor(false);
+
         AllocateOutputSetMklShape(context, kOutputIndex_Dst, &output_tensor,
                                   src_tf_shape, output_mkl_shape);
 
@@ -594,55 +614,79 @@ class MklConv2DOp : public OpKernel {
       filter.SetOpMemDesc(filter_dims, memory::format::any);
       output.SetOpMemDesc(output_dims_mkl_order, memory::format::any);
 
-      // If bias is enabled, then do the same steps as above for bias.
+      // MKLDNN dilation starts from 0.
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+
       if (biasEnabled) {
-        MklDnnData<T> bias(&cpu_engine);
-        memory::dims bias_size;
-        conv_utl.GetBiasSizeInMklOrder(kInputIndex_Bias, &bias_size);
-        const Tensor& bias_tensor = MklGetInput(context, kInputIndex_Bias);
-        bias.SetUsrMem(bias_size, memory::format::x, &bias_tensor);
-        bias.SetOpMemDesc(bias_size, memory::format::any);
-
-        // Create convolution primitive with Bias.
-        auto conv_desc = convolution_forward::desc(
-            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
-            filter.GetOpMemDesc(), bias.GetOpMemDesc(), output.GetOpMemDesc(),
-            strides, padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
-
-        auto conv_prim_desc =
-            convolution_forward::primitive_desc(conv_desc, cpu_engine);
-        AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
-                             tf_fmt, &output_tensor);
-        // Set data handle for output.
-        output.SetUsrMemDataHandle(output_tensor);
-
-        Tensor* filter_out_tensor = nullptr;
-        AllocateFilterOutputTensor(context, conv_prim_desc,
-                                   TFShapeToMklDnnDims(filter_tf_shape),
-                                   &filter_out_tensor);
-
-        PrepareAndExecuteNet(conv_prim_desc, &src, &filter, &bias, &output,
-                             filter_out_tensor);
+          // Create convolution primitive with Bias.
+          MklDnnData<T> bias(&cpu_engine);
+          memory::dims bias_size;
+          conv_utl.GetBiasSizeInMklOrder(kInputIndex_Bias, &bias_size);
+          const Tensor& bias_tensor = MklGetInput(context, kInputIndex_Bias);
+          bias.SetUsrMem(bias_size, memory::format::x, &bias_tensor);
+          bias.SetOpMemDesc(bias_size, memory::format::any);
+
+          // Create convolution primitive with Bias.
+          // Use MKLDNN dilated convolution in case of dilated rate (>0).
+          auto conv_desc = (dilations[kDilationH] > 0 ||
+              dilations[kDilationW] > 0) ?
+              convolution_forward::desc(prop_kind::forward,
+                      convolution_direct, src.GetOpMemDesc(),
+                      filter.GetOpMemDesc(), bias.GetOpMemDesc(),
+                      output.GetOpMemDesc(), strides, dilations,
+                      padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_)):
+              convolution_forward::desc(prop_kind::forward,
+                      convolution_direct, src.GetOpMemDesc(),
+                      filter.GetOpMemDesc(), bias.GetOpMemDesc(),
+                      output.GetOpMemDesc(), strides,
+                      padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_));
+
+          auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
+                                                                  cpu_engine);
+          AllocateOutputTensor(context, conv_prim_desc,
+                               output_dims_mkl_order, tf_fmt, &output_tensor);
+          // Set data handle for output.
+          output.SetUsrMemDataHandle(output_tensor);
+
+          Tensor* filter_out_tensor = nullptr;
+          AllocateFilterOutputTensor(context, conv_prim_desc,
+                TFShapeToMklDnnDims(filter_tf_shape),
+                &filter_out_tensor);
+
+          PrepareAndExecuteNet(conv_prim_desc, &src, &filter, &bias, &output,
+                               filter_out_tensor);
       } else {
-        // Create convolution primitive without Bias.
-        auto conv_desc = convolution_forward::desc(
-            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
-            filter.GetOpMemDesc(), output.GetOpMemDesc(), strides, padding_l,
-            padding_r, TFPaddingToMklDnnPadding(padding_));
-
-        auto conv_prim_desc =
-            convolution_forward::primitive_desc(conv_desc, cpu_engine);
-        AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
-                             tf_fmt, &output_tensor);
-        // Set data handle for output.
-        output.SetUsrMemDataHandle(output_tensor);
-
-        Tensor* filter_out_tensor = nullptr;
-        AllocateFilterOutputTensor(context, conv_prim_desc,
-                                   TFShapeToMklDnnDims(filter_tf_shape),
-                                   &filter_out_tensor);
-        PrepareAndExecuteNet(conv_prim_desc, &src, &filter, nullptr, &output,
-                             filter_out_tensor);
+          // Create convolution primitive without Bias.
+          // Use MKLDNN dilated convolution in case of dilated rate (>0).
+          auto conv_desc = (dilations[kDilationH] > 0 ||
+            dilations[kDilationW] > 0) ?
+            convolution_forward::desc(prop_kind::forward,
+              convolution_direct, src.GetOpMemDesc(),
+              filter.GetOpMemDesc(), output.GetOpMemDesc(),
+              strides, dilations, padding_l, padding_r,
+              TFPaddingToMklDnnPadding(padding_)):
+          convolution_forward::desc(prop_kind::forward,
+              convolution_direct, src.GetOpMemDesc(),
+              filter.GetOpMemDesc(), output.GetOpMemDesc(),
+              strides, padding_l, padding_r,
+              TFPaddingToMklDnnPadding(padding_));
+
+          auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
+                                                                  cpu_engine);
+          AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
+                               tf_fmt, &output_tensor);
+          // Set data handle for output.
+          output.SetUsrMemDataHandle(output_tensor);
+
+          Tensor* filter_out_tensor = nullptr;
+          AllocateFilterOutputTensor(context, conv_prim_desc,
+                TFShapeToMklDnnDims(filter_tf_shape),
+                &filter_out_tensor);
+          PrepareAndExecuteNet(conv_prim_desc, &src, &filter,
+                              nullptr, &output, filter_out_tensor);
       }
     } catch (mkldnn::error& e) {
       string error_msg = "Status: " + std::to_string(e.status) +
@@ -656,10 +700,12 @@ class MklConv2DOp : public OpKernel {
 
  private:
   std::vector<int32> strides_;
+  std::vector<int32> dilations_;
   Padding padding_;
   TensorFormat data_format_;
   const int kInputIndex_Src = 0, kInputIndex_Filter = 1, kInputIndex_Bias = 2;
   const int kOutputIndex_Dst = 0, kOutputIndex_Filter = 1;
+  const int kDilationH = 0, kDilationW = 1;
 
   // Allocate output tensor.
   void AllocateOutputTensor(
diff --git a/tensorflow/core/kernels/mkl_conv_ops.h b/tensorflow/core/kernels/mkl_conv_ops.h
index 9dd88221a84671e1f69df13cca1b62b2ce65bb4e..7ca10db895c2224aef6b4306ad585d5b617c446c 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.h
+++ b/tensorflow/core/kernels/mkl_conv_ops.h
@@ -58,13 +58,16 @@ class MklDnnConvUtil {
  protected:
   OpKernelContext* context_;  // We don't own this.
   std::vector<int32> strides_;
+  std::vector<int32> dilations_;
   Padding padding_;
   TensorFormat data_format_;
 
  public:
   MklDnnConvUtil(OpKernelContext* context, const std::vector<int32>& strides,
-                 Padding pad, TensorFormat fm)
-      : context_(context), strides_(strides), padding_(pad), data_format_(fm) {}
+                 Padding pad, TensorFormat fm,
+                 const std::vector<int32>& dilations) :
+    context_(context), strides_(strides), padding_(pad),
+    data_format_(fm), dilations_(dilations) {}
 
   virtual ~MklDnnConvUtil() { context_ = nullptr; }
 
@@ -78,6 +81,16 @@ class MklDnnConvUtil {
     *strides = {stride_rows, stride_cols};
   }
 
+  // Calculate Convolution dilations
+  virtual inline void GetDilationsInMklOrder(memory::dims *dilations) {
+    // For now we take the dilation from the second and third dimensions only
+    // (we do not support dilation on the batch or depth dimension).
+    CHECK_NOTNULL(dilations);
+    int dilations_rows = GetTensorDim(dilations_, data_format_, 'H');
+    int dilations_cols = GetTensorDim(dilations_, data_format_, 'W');
+    *dilations = {dilations_rows, dilations_cols};
+  }
+
   // Calculate Convolution input size in MKL-DNN order. MKL-DNN
   // requires input in NCHW format. Function does not return anything.
   // But errors arising from sanity checks are returned in context's
@@ -213,7 +226,8 @@ class MklDnnConvUtil {
   // TODO(nhasabni): Add similar function for input and filter in MklShape.
   virtual inline void GetOutputAndPadSizeInMklOrder(
       const TensorShape& input_shape, const TensorShape& filter_shape,
-      const memory::dims& strides, memory::dims* output_dims_tf_order,
+      const memory::dims& strides, const memory::dims& dilations,
+      memory::dims* output_dims_tf_order,
       memory::dims* output_dims_mkl_order, memory::dims* pad_l,
       memory::dims* pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
@@ -232,6 +246,8 @@ class MklDnnConvUtil {
     // Stride is vector of 2 elements: {s_r, s_c}
     int stride_rows = strides[0];
     int stride_cols = strides[1];
+    int dilation_rows = dilations[0];
+    int dilation_cols = dilations[1];
 
     // Output batch is same as input batch.
     int out_batch = GetTensorDim(input_shape, data_format_, 'N');
@@ -241,11 +257,13 @@ class MklDnnConvUtil {
     int64 out_rows = 0, out_cols = 0;
     int64 pad_top = 0, pad_bottom = 0, pad_left, pad_right;
 
-    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
-                                 input_rows, filter_rows, stride_rows, padding_,
+    OP_REQUIRES_OK(context_,
+            GetWindowedOutputSizeVerboseV2(input_rows, filter_rows,
+                                 dilation_rows, stride_rows, padding_,
                                  &out_rows, &pad_top, &pad_bottom));
-    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
-                                 input_cols, filter_cols, stride_cols, padding_,
+    OP_REQUIRES_OK(context_,
+            GetWindowedOutputSizeVerboseV2(input_cols, filter_cols,
+                                 dilation_cols, stride_cols, padding_,
                                  &out_cols, &pad_left, &pad_right));
 
     // Tensorflow output is in data_format order. (NHWC or NCHW)
@@ -271,7 +289,8 @@ class MklDnnConvUtil {
   //
   // Function does not return anything, but sets error in context status.
   inline void GetOutputAndPadSizeInMklOrder(
-      size_t src_index, size_t filter_index, const memory::dims& strides,
+      size_t src_index, size_t filter_index,
+      const memory::dims& strides, const memory::dims& dilations,
       memory::dims* output_dims_tf_order, memory::dims* output_dims_mkl_order,
       memory::dims* pad_l, memory::dims* pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
@@ -286,9 +305,9 @@ class MklDnnConvUtil {
                 errors::InvalidArgument("input must be 4-dimensional",
                                         input_tf_shape.DebugString()));
 
-    GetOutputAndPadSizeInMklOrder(input_tf_shape, filter_tf_shape, strides,
-                                  output_dims_tf_order, output_dims_mkl_order,
-                                  pad_l, pad_r);
+    GetOutputAndPadSizeInMklOrder(input_tf_shape, filter_tf_shape,
+                                  strides, dilations, output_dims_tf_order,
+                                  output_dims_mkl_order, pad_l, pad_r);
   }
 
   // Wrapper function to calculate input, filter, and output sizes of
@@ -300,12 +319,14 @@ class MklDnnConvUtil {
   inline void GetConvFwdSizesInMklOrder(
       const TensorShape& input_shape, const TensorShape& filter_shape,
       memory::dims* input_dims, memory::dims* filter_dims,
-      memory::dims* strides, memory::dims* output_dims_tf_order,
+      memory::dims* strides, memory::dims *dilations,
+      memory::dims* output_dims_tf_order,
       memory::dims* output_dims_mkl_order, memory::dims* pad_l,
       memory::dims* pad_r) {
     CHECK_NOTNULL(input_dims);
     CHECK_NOTNULL(filter_dims);
     CHECK_NOTNULL(strides);
+    CHECK_NOTNULL(dilations);
     CHECK_NOTNULL(output_dims_tf_order);
     CHECK_NOTNULL(output_dims_mkl_order);
     CHECK_NOTNULL(pad_l);
@@ -316,7 +337,9 @@ class MklDnnConvUtil {
     GetFilterSizeInMklOrder(input_shape, filter_shape, filter_dims);
     if (!context_->status().ok()) return;
     GetStridesInMklOrder(strides);
-    GetOutputAndPadSizeInMklOrder(input_shape, filter_shape, *strides,
+    GetDilationsInMklOrder(dilations);
+    GetOutputAndPadSizeInMklOrder(input_shape, filter_shape,
+                                  *strides, *dilations,
                                   output_dims_tf_order, output_dims_mkl_order,
                                   pad_l, pad_r);
     if (!context_->status().ok()) return;
@@ -344,7 +367,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
         context, (stride_n == 1 && stride_c == 1),
         errors::InvalidArgument("Current implementation does not yet support "
                                 "strides in the batch and depth dimensions."));
-
+    OP_REQUIRES_OK(context, context->GetAttr("dilations", &dilations_));
+    OP_REQUIRES(context, dilations_.size() == 4,
+                errors::InvalidArgument("Sliding window dilations field must "
+                                        "specify 4 dimensions"));
+    int dilation_n = GetTensorDim(dilations_, data_format_, 'N');
+    int dilation_c = GetTensorDim(dilations_, data_format_, 'C');
+    int dilation_h = GetTensorDim(dilations_, data_format_, 'H');
+    int dilation_w = GetTensorDim(dilations_, data_format_, 'W');
+    OP_REQUIRES(context, (dilation_n == 1 && dilation_c == 1),
+                errors::InvalidArgument(
+                    "Current implementation does not yet support "
+                    "dilations in the batch and depth dimensions."));
+    OP_REQUIRES(
+        context, dilation_h > 0 && dilation_w > 0,
+        errors::InvalidArgument("Dilated rates should be larger than 0."));
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
   }
 
@@ -406,15 +443,16 @@ class MklConv2DBackpropCommonOp : public OpKernel {
       // By default, all dims are in MKL order. Only dims in TF order
       // are those with prefix tf_order.
       memory::dims outbprop_dims, fwd_input_dims, fwd_filter_dims;
-      memory::dims padding_l, padding_r, strides, fwd_output_dims;
+      memory::dims padding_l, padding_r, dilations, strides, fwd_output_dims;
       memory::dims fwd_output_dims_tf_order;
 
       // Get forward convolution parameters.
-      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
+      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_,
+                             dilations_);
       conv_utl.GetConvFwdSizesInMklOrder(
           input_tf_shape, filter_tf_shape, &fwd_input_dims, &fwd_filter_dims,
-          &strides, &fwd_output_dims_tf_order, &fwd_output_dims, &padding_l,
-          &padding_r);
+          &strides, &dilations, &fwd_output_dims_tf_order, &fwd_output_dims,
+          &padding_l, &padding_r);
       if (!context->status().ok()) return;
 
       // Create Convolution forward descriptor since Convolution backward
@@ -437,10 +475,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
                                               memory::format::hwio);
       // Tensorflow Output of Conv2D is in data_format order.
       auto fwd_out_md = memory::desc(fwd_output_dims, MklDnnType<T>(), tf_fmt);
-      auto fwd_desc = convolution_forward::desc(
-          prop_kind::forward, convolution_direct, fwd_input_md, fwd_filter_md,
-          fwd_out_md, strides, padding_l, padding_r,
-          TFPaddingToMklDnnPadding(padding_));
+
+      const int kDilationH = 0, kDilationW = 1;
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+      auto fwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0)?
+              convolution_forward::desc(prop_kind::forward,
+                     convolution_direct, fwd_input_md,
+                     fwd_filter_md, fwd_out_md,
+                     strides, dilations, padding_l, padding_r,
+                     TFPaddingToMklDnnPadding(padding_)) :
+              convolution_forward::desc(prop_kind::forward,
+                     convolution_direct, fwd_input_md,
+                     fwd_filter_md, fwd_out_md,
+                     strides, padding_l, padding_r,
+                     TFPaddingToMklDnnPadding(padding_));
       auto fwd_pd = convolution_forward::primitive_desc(fwd_desc, cpu_engine);
 
       // Create memory for user data. Describe how the inputs and outputs of
@@ -485,8 +534,9 @@ class MklConv2DBackpropCommonOp : public OpKernel {
 
       // Operator-specific call to create and execute primitive.
       CreatePrimitive(context, cpu_engine, fwd_pd, &input, &filter,
-                      &outbackprop, &output, &output_tensor, strides, padding_l,
-                      padding_r, TFPaddingToMklDnnPadding(padding_),
+                      &outbackprop, &output, &output_tensor,
+                      strides, dilations, padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_),
                       bwd_output_dims, bwd_output_format);
     } catch (mkldnn::error& e) {
       string error_msg = "Status: " + std::to_string(e.status) +
@@ -535,20 +585,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
   virtual memory::format GetOutputFormat(const memory::format data_format) = 0;
 
   /// Create and execute the primitive storing output in the output_tensor.
-  virtual void CreatePrimitive(
-      OpKernelContext* context, const engine& cpu_engine,
-      const convolution_forward::primitive_desc& conv_fwd_pd,
-      MklDnnData<T>* input, MklDnnData<T>* filter, MklDnnData<T>* outbackprop,
-      MklDnnData<T>* output, Tensor** output_tensor,
-      const memory::dims& strides, const memory::dims& padding_l,
-      const memory::dims& padding_r, padding_kind padding,
-      const memory::dims& bwd_output_dims,
-      memory::format bwd_output_format) = 0;
+  virtual void CreatePrimitive(OpKernelContext* context,
+    const engine& cpu_engine,
+    const convolution_forward::primitive_desc& conv_fwd_pd,
+    MklDnnData<T>* input, MklDnnData<T>* filter, MklDnnData<T>* outbackprop,
+    MklDnnData<T>* output, Tensor** output_tensor, const memory::dims& strides,
+    const memory::dims& dilations, const memory::dims& padding_l,
+    const memory::dims& padding_r, padding_kind padding,
+    const memory::dims& bwd_output_dims,
+    memory::format bwd_output_format) = 0;
 
   // Get the data_format {NCHW, NHWC}
   TensorFormat GetTFDataFormat() { return data_format_; }
 
  private:
+  std::vector<int32> dilations_;
   std::vector<int32> strides_;
   Padding padding_;
   TensorFormat data_format_;
diff --git a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
index 8313224d7fe3e2d307d3642ced5b277b95c85cdb..7c14a46ddea16cdc729d61d034fe55c61bc66acb 100644
--- a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
+++ b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
@@ -262,7 +262,6 @@ class MklFusedBatchNormOp : public OpKernel {
     }
 
     void MklCreateInputLayout(OpKernelContext* context) {
-      const Tensor& input = MklGetInput(context, 0);
       bool input_in_mkl_format = mkl_shape_input_shape.IsMklTensor();
       if (input_in_mkl_format) {
         mkl_lt_input =
@@ -934,7 +933,7 @@ class MklFusedBatchNormOp : public OpKernel {
   bool is_training_;
   T* mean_values_;
   T* variance_values_;
-  size_t depth_;  // batch normalization is done for per channel.
+  int depth_;  // batch normalization is done for per channel.
 
   void ExtractParams(OpKernelContext* context) {
     const Tensor& input = MklGetInput(context, 0);
@@ -1110,19 +1109,12 @@ class MklFusedBatchNormGradOp : public OpKernel {
         return;
       }
 
-      if (dnn_shape_src.IsMklTensor())
-        depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
-      else
-        ExtractParams(context);
-
-      memory::format format_m;
       if (dnn_shape_src.IsMklTensor()) {
-        if (dnn_shape_src.IsTensorInNCHWFormat())
-          format_m = memory::format::nchw;
-        else
-          format_m = memory::format::nhwc;
+        depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
+      } else if (dnn_shape_diff_dst.IsMklTensor()) {
+        depth_ = dnn_shape_diff_dst.DimSize(MklDnnDims::Dim_C);
       } else {
-        format_m = TFDataFormatToMklDnnDataFormat(tensor_format_);
+        ExtractParams(context);
       }
 
       MklDnnData<T> src(&cpu_engine);
@@ -1146,20 +1138,20 @@ class MklFusedBatchNormGradOp : public OpKernel {
         diff_dst_dims =
             TFShapeToMklDnnDimsInNCHW(diff_dst_tensor.shape(), tensor_format_);
 
-      // set src and diff_dst primitives
+      // set src and diff_dst primitives according to input layout
       memory::desc src_md({}, memory::data_undef, memory::format_undef);
       memory::desc diff_dst_md({}, memory::data_undef, memory::format_undef);
-      if (dnn_shape_src.IsMklTensor() || dnn_shape_diff_dst.IsMklTensor()) {
-        if (dnn_shape_src.IsMklTensor()) {
-          src_md = dnn_shape_src.GetMklLayout();
-          diff_dst_md = src_md;
-        } else {
-          diff_dst_md = dnn_shape_diff_dst.GetMklLayout();
-          src_md = diff_dst_md;
-        }
+      if (dnn_shape_src.IsMklTensor()) {
+        src_md = dnn_shape_src.GetMklLayout();
       } else {
-        src_md = memory::desc(src_dims, MklDnnType<T>(), format_m);
-        diff_dst_md = src_md;
+        src_md =  memory::desc(src_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
+      }
+      if (dnn_shape_diff_dst.IsMklTensor()) {
+        diff_dst_md = dnn_shape_diff_dst.GetMklLayout();
+      } else {
+        diff_dst_md = memory::desc(diff_dst_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
       }
       src.SetUsrMem(src_md, &src_tensor);
       diff_dst.SetUsrMem(diff_dst_md, &diff_dst_tensor);
@@ -1211,28 +1203,64 @@ class MklFusedBatchNormGradOp : public OpKernel {
       // allocate diff_src tensor
       MklDnnShape dnn_shape_diff_src;
       TensorShape tf_shape_diff_src;
-      if (dnn_shape_src.IsMklTensor()) {
+
+      // MKL-DNN's BN primitive not provide API to fetch internal format
+      // set common_md as OpMem
+      // src and diff_dst will reorder to common_md
+      // diff_src will set as common_md
+      memory::desc common_md({}, memory::data_undef, memory::format_undef);
+      if (dnn_shape_src.IsMklTensor() || dnn_shape_diff_dst.IsMklTensor()) {
+        if (dnn_shape_src.IsMklTensor()) {
+          common_md = dnn_shape_src.GetMklLayout();
+        } else {
+          common_md = dnn_shape_diff_dst.GetMklLayout();
+        }
+      } else {
+        common_md = memory::desc(src_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
+      }
+      // if any of src and diff_dst as mkl layout,
+      // then we set diff_src as mkl layout
+      if (dnn_shape_src.IsMklTensor() ||
+              dnn_shape_diff_dst.IsMklTensor()) {
         dnn_shape_diff_src.SetMklTensor(true);
-        auto diff_src_pd = bnrm_fwd_pd.dst_primitive_desc();
+        // set diff_src's mkl layout as common_md
+        auto diff_src_pd = memory::primitive_desc(common_md, cpu_engine);
         dnn_shape_diff_src.SetMklLayout(&diff_src_pd);
         dnn_shape_diff_src.SetElemType(MklDnnType<T>());
-        dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(), src_dims,
-                                       format_m);
-        dnn_shape_diff_src.SetTfDimOrder(dnn_shape_src.GetDimension(),
-                                         tensor_format_);
+        if (dnn_shape_src.IsMklTensor()) {
+          dnn_shape_diff_src.SetTfLayout(
+                  dnn_shape_src.GetDimension(),
+                  src_dims,
+                  dnn_shape_src.GetTfDataFormat());
+          dnn_shape_diff_src.SetTfDimOrder(
+                  dnn_shape_src.GetDimension(),
+                  tensor_format_);
+        } else {
+          dnn_shape_diff_src.SetTfLayout(
+                  dnn_shape_diff_dst.GetDimension(),
+                  src_dims,
+                  dnn_shape_diff_dst.GetTfDataFormat());
+          dnn_shape_diff_src.SetTfDimOrder(
+                  dnn_shape_diff_dst.GetDimension(),
+                  tensor_format_);
+        }
         tf_shape_diff_src.AddDim(diff_src_pd.get_size() / sizeof(T));
       } else {
         dnn_shape_diff_src.SetMklTensor(false);
+        // both src and diff_dst are TensorFlow layout,
+        // so it is OK to get TensorFlow shape.
         tf_shape_diff_src = src_tensor.shape();
       }
       AllocateOutputSetMklShape(context, kDiffSrcIndex, &diff_src_tensor,
                                 tf_shape_diff_src, dnn_shape_diff_src);
 
-      diff_src.SetUsrMem(src_md, diff_src_tensor);
+      // set diff_src
+      diff_src.SetUsrMem(common_md, diff_src_tensor);
 
       prop_kind pk = prop_kind::backward;
       auto bnrm_bwd_desc = batch_normalization_backward::desc(
-          pk, diff_src.GetUsrMemDesc(), src.GetUsrMemDesc(), epsilon_,
+          pk, common_md, common_md, epsilon_,
           /* for inference, specify use_global_stats
              1. on fwd prop, use mean and variance
                 provided as inputs
@@ -1245,11 +1273,16 @@ class MklFusedBatchNormGradOp : public OpKernel {
       auto bnrm_bwd_pd = batch_normalization_backward::primitive_desc(
           bnrm_bwd_desc, cpu_engine, bnrm_fwd_pd);
 
+      std::vector<primitive> net;
+      src.CheckReorderToOpMem(memory::primitive_desc(common_md,
+                                   cpu_engine), &net);
+      diff_dst.CheckReorderToOpMem(memory::primitive_desc(common_md,
+                                   cpu_engine), &net);
+
       auto bnrm_bwd_op = batch_normalization_backward(
           bnrm_bwd_pd, src.GetOpMem(), mean.GetOpMem(), variance.GetOpMem(),
           diff_dst.GetOpMem(), weights_m, diff_src.GetOpMem(), diff_weights_m);
 
-      std::vector<primitive> net;
       net.push_back(bnrm_bwd_op);
       stream(stream::kind::eager).submit(net).wait();
 
diff --git a/tensorflow/core/kernels/mkl_input_conversion_op.cc b/tensorflow/core/kernels/mkl_input_conversion_op.cc
index e9a2376b545fcec97e1ced5c592351203abadd69..d91f7107c5b1effdfa6c4c3b95b16bcf31750f42 100644
--- a/tensorflow/core/kernels/mkl_input_conversion_op.cc
+++ b/tensorflow/core/kernels/mkl_input_conversion_op.cc
@@ -442,12 +442,11 @@ class MklInputConversionOp : public OpKernel {
       auto input_tf_md = mkl_output_mkl_shape.GetTfLayout();
       tf_input.SetUsrMem(input_tf_md, tf_tensor);
 
-      // Create reorder between tensorflow layout and Mkl layout.
+      // Create reorder between tensorflow layout and Mkl layout if necessary
       std::vector<primitive> net;
-      CHECK_EQ(tf_input.CheckReorderToOpMem(
+      tf_input.CheckReorderToOpMem(
                    memory::primitive_desc(output_mkl_md, cpu_engine),
-                   tensor_out, &net),
-               true);
+                   tensor_out, &net);
       stream(stream::kind::eager).submit(net).wait();
 
       // -- The tensor in MKL format passes through --
diff --git a/tensorflow/core/kernels/mkl_lrn_op.cc b/tensorflow/core/kernels/mkl_lrn_op.cc
index 5f0a12a1fb9bff3086e05918e23b8396196eb389..282012c719fe3045e880ef0dc9027a50c0f23fec 100644
--- a/tensorflow/core/kernels/mkl_lrn_op.cc
+++ b/tensorflow/core/kernels/mkl_lrn_op.cc
@@ -88,7 +88,8 @@ class MklLRNOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("alpha", &alpha_));
     OP_REQUIRES_OK(context, context->GetAttr("beta", &beta_));
     workspace_enabled_ = false;
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -357,7 +358,8 @@ class MklLRNGradOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("alpha", &alpha_));
     OP_REQUIRES_OK(context, context->GetAttr("beta", &beta_));
     workspace_enabled_ = false;
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -535,7 +537,6 @@ class MklLRNGradOp : public OpKernel {
                                 Tensor* mkl_tmp_outimage_buf_tensor) {
       const Tensor& in_grads = MklGetInput(context, 0);
       const Tensor& in_image = MklGetInput(context, 1);
-      const Tensor& out_image = MklGetInput(context, 2);
       const Tensor& workspace = MklGetInput(
           context,
           3); /*Worskpsace is enabled, get the buffer to the workspace */
@@ -544,8 +545,6 @@ class MklLRNGradOp : public OpKernel {
           static_cast<const void*>(in_grads.flat<T>().data()));
       void* user_fwd_input = const_cast<void*>(
           static_cast<const void*>(in_image.flat<T>().data()));
-      void* user_fwd_output = const_cast<void*>(
-          static_cast<const void*>(out_image.flat<T>().data()));
       void* workspace_buffer = const_cast<void*>(
           static_cast<const void*>(workspace.flat<T>().data()));
 
diff --git a/tensorflow/core/kernels/mkl_maxpooling_op.cc b/tensorflow/core/kernels/mkl_maxpooling_op.cc
index 14607f26e0ccd1028dd62343000d90ac8451d7bb..ea537524b11ef1362ff08b79ae25ca6e7048a9cd 100644
--- a/tensorflow/core/kernels/mkl_maxpooling_op.cc
+++ b/tensorflow/core/kernels/mkl_maxpooling_op.cc
@@ -69,7 +69,8 @@ class MklMaxPoolingOp : public OpKernel {
     // We may not get this attribute for this node if it does not go through
     // graph rewrite pass. So we do not check for error while retrieving this
     // attribute value.
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -118,7 +119,6 @@ class MklMaxPoolingOp : public OpKernel {
                               mkl_out_shape);
 
     Tensor* workspace_tensor;
-    void* workspace_buf = nullptr;
 
     TensorShape workspace_shape;
     mkl_workspace_shape.SetMklTensor(false);
@@ -226,7 +226,8 @@ class MklMaxPoolingGradOp : public OpKernel {
     // We may not get this attribute for this node if it does not go through
     // graph rewrite pass. So we do not check for error while retrieving this
     // attribute value.
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
diff --git a/tensorflow/core/kernels/mkl_relu_op.cc b/tensorflow/core/kernels/mkl_relu_op.cc
index 51db3991e2a24f087771f571cd91fc9fbb26040b..0a0f69522fad9a30bc6b0536348652ba61b8468f 100644
--- a/tensorflow/core/kernels/mkl_relu_op.cc
+++ b/tensorflow/core/kernels/mkl_relu_op.cc
@@ -25,7 +25,6 @@ limitations under the License.
 
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
-#include "tensorflow/core/platform/default/logging.h"
 #include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
@@ -368,8 +367,11 @@ void MklReluGradOp<Device, T>::Compute(OpKernelContext* context) {
   mkl_context.MklCleanup();
 }
 
+
+
 #else  // INTEL_MKL_ML
 
+
 template <typename Device, typename T, algorithm alg_kind>
 class MklReluOpBase : public OpKernel {
  public:
@@ -390,7 +392,7 @@ class MklReluOpBase : public OpKernel {
 
       Tensor* dst_tensor = nullptr;
       if (src_tensor.dims() == 0) {
-        Compute_Scalar(context);
+        Compute_Scalar(context); // scalar case doesn't use in-place operation
         return;
       }
 
@@ -435,11 +437,15 @@ class MklReluOpBase : public OpKernel {
         dnn_shape_dst.SetMklTensor(false);
         tf_shape_dst = src_tensor.shape();
       }
-      AllocateOutputSetMklShape(context, dst_index, &dst_tensor, tf_shape_dst,
-                                dnn_shape_dst);
+      
+      // Allocate output and MklDnnShape tensors separately for possible
+      // in-place operation
+      OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                      {src_index}, dst_index, tf_shape_dst, &dst_tensor));
+      AllocateOutputSetMklShape(context, dst_index, dnn_shape_dst);
 
       // Destination memory descriptor is same as source memory descriptor.
-      auto dst_md = src_md;
+      auto &dst_md = src_md;
       dst.SetUsrMem(dst_md, dst_tensor);
 
       // execute net
@@ -490,7 +496,7 @@ class MklReluGradOpBase : public OpKernel {
 
       int src_dims_size = src_tensor.dims();
       if (src_dims_size == 0) {
-        Compute_Scalar(context);
+        Compute_Scalar(context); // scalar case doesn't use in-place operation
         return;
       }
 
@@ -579,21 +585,35 @@ class MklReluGradOpBase : public OpKernel {
       // allocate diff_src tensor
       MklDnnShape dnn_shape_diff_src;
       TensorShape tf_shape_diff_src;
-      if (dnn_shape_src.IsMklTensor()) {
+      if (dnn_shape_src.IsMklTensor() ||
+              dnn_shape_diff_dst.IsMklTensor()) {
         dnn_shape_diff_src.SetMklTensor(true);
         auto diff_src_pd = relu_bwd_pd.diff_src_primitive_desc();
         dnn_shape_diff_src.SetMklLayout(&diff_src_pd);
         dnn_shape_diff_src.SetElemType(MklDnnType<T>());
-        dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(),
-                                       dnn_shape_src.GetSizesAsMklDnnDims(),
-                                       dnn_shape_src.GetTfDataFormat());
+        if (dnn_shape_src.IsMklTensor()) {
+          dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(),
+                                         dnn_shape_src.GetSizesAsMklDnnDims(),
+                                         dnn_shape_src.GetTfDataFormat());
+        } else {
+          dnn_shape_diff_src.SetTfLayout(dnn_shape_diff_dst.GetDimension(),
+                                 dnn_shape_diff_dst.GetSizesAsMklDnnDims(),
+                                 dnn_shape_diff_dst.GetTfDataFormat());
+        }
         tf_shape_diff_src.AddDim(diff_src_pd.get_size() / sizeof(T));
       } else {
         dnn_shape_diff_src.SetMklTensor(false);
+        // both src and diff_dst are TensorFlow layout,
+        // so it is ok to get TensorFlow shape.
         tf_shape_diff_src = src_tensor.shape();
       }
-      AllocateOutputSetMklShape(context, diff_src_index, &diff_src_tensor,
-                                tf_shape_diff_src, dnn_shape_diff_src);
+
+      // Allocate diff_src and MklDnnShape tensors separately for possible
+      // in-place operation
+      OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                      {diff_dst_index}, diff_src_index, tf_shape_diff_src,
+                                      &diff_src_tensor));
+      AllocateOutputSetMklShape(context, diff_src_index, dnn_shape_diff_src);
 
       // diff_src memory descriptor is same as memory descriptor for both
       // inputs.
diff --git a/tensorflow/core/kernels/mutex_ops.cc b/tensorflow/core/kernels/mutex_ops.cc
index b8b1fc7679d758f2855af33618620e00f1bbb7e1..ddb7a606c1a7f0264c7c4a9cbb2f97095d9fee01 100644
--- a/tensorflow/core/kernels/mutex_ops.cc
+++ b/tensorflow/core/kernels/mutex_ops.cc
@@ -127,7 +127,7 @@ class Mutex : public ResourceBase {
       }
     }
     thread_pool_->Schedule(std::bind(
-        [this, c, cm, cancelled,
+        [this, cm, cancelled,
          token](std::function<void(const Status& s, SharedLockReleaser&& lock)>
                     fn_) {
           bool local_locked;
@@ -173,7 +173,7 @@ class MutexLockOp : public AsyncOpKernel {
     OP_REQUIRES_OK_ASYNC(
         c,
         LookupOrCreateResource<Mutex>(c, HandleFromInput(c, 0), &mutex,
-                                      [this, c](Mutex** ptr) {
+                                      [c](Mutex** ptr) {
                                         *ptr = new Mutex(
                                             c, HandleFromInput(c, 0).name());
                                         return Status::OK();
@@ -186,11 +186,10 @@ class MutexLockOp : public AsyncOpKernel {
 
     mutex->AcquireAsync(
         c, std::bind(
-               [this, c, variant, mutex](DoneCallback done_,
-                                         // End of bound arguments.
-                                         const Status& s,
-                                         Mutex::SharedLockReleaser&& lock) {
-                 core::ScopedUnref unref(mutex);
+               [c, variant, mutex](DoneCallback done_,
+                                   // End of bound arguments.
+                                   const Status& s,
+                                   Mutex::SharedLockReleaser&& lock) {
                  VLOG(2) << "Finished locking mutex " << mutex
                          << " with lock: " << lock.shared_lock.get()
                          << " status: " << s.ToString();
@@ -199,6 +198,7 @@ class MutexLockOp : public AsyncOpKernel {
                  } else {
                    c->SetStatus(s);
                  }
+                 mutex->Unref();
                  done_();
                },
                std::move(done), std::placeholders::_1, std::placeholders::_2));
diff --git a/tensorflow/core/kernels/pad_op.cc b/tensorflow/core/kernels/pad_op.cc
index eff3e4d92cc3ecc3b172a970564225c8b204cdd4..41494f56c5ea6b099f8eb7e81d50c83269aa278f 100644
--- a/tensorflow/core/kernels/pad_op.cc
+++ b/tensorflow/core/kernels/pad_op.cc
@@ -70,7 +70,7 @@ class PadOp : public OpKernel {
             "The first dimension of paddings must be the rank of inputs",
             in1.shape().DebugString(), " ", in0.shape().DebugString()));
 
-    T pad_value(0);
+    T pad_value = T();
     if (context->num_inputs() == 3) {
       const Tensor& constant_values = context->input(2);
       OP_REQUIRES(
@@ -104,42 +104,147 @@ class PadOp : public OpKernel {
       return;
     }
 
-    Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));
+    TensorShape collapsed_input_shape;
+    TensorShape collapsed_output_shape;
+    Tensor collapsed_paddings;
+    if (fixed_dims > 1 &&
+        CollapseAdjacentNonPaddedDimensions(
+            in0.shape(), in1, output_shape, &collapsed_input_shape,
+            &collapsed_paddings, &collapsed_output_shape)) {
+      Tensor collapsed_input;
+      CHECK(collapsed_input.CopyFrom(in0, collapsed_input_shape));
+      Tensor collapsed_output;
+      AllocatorAttributes alloc_attrs;
+      alloc_attrs.set_on_host(context->input_memory_type(0) == HOST_MEMORY);
+      OP_REQUIRES_OK(context,
+                     context->allocate_temp(collapsed_input.dtype(),
+                                            collapsed_output_shape,
+                                            &collapsed_output, alloc_attrs));
+      const Tensor& collapsed_paddings_ref = collapsed_paddings;
+      typename TTypes<Tpadding>::ConstMatrix collapsed_paddings_matrix =
+          collapsed_paddings_ref.matrix<Tpadding>();
 
+      OperateWithVariableRank(context, collapsed_input_shape.dims(),
+                              collapsed_input, collapsed_paddings_matrix,
+                              pad_value, &collapsed_output);
+
+      Tensor output;
+      CHECK(output.CopyFrom(collapsed_output, output_shape));
+      context->set_output(0, output);
+    } else {
+      Tensor* output = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output(0, output_shape, &output));
+      OperateWithVariableRank(context, fixed_dims, in0, paddings, pad_value,
+                              output);
+    }
+  }
+
+ private:
+  // Collapses adjacent dimensions that are not padded to one dimension for
+  // speed. Returns true if any two dimensions are collapsed. For example,
+  //
+  //   Pad(input_shape=[8, 28, 28, 3],
+  //       paddings=[[0, 0], [0, 0], [0, 0], [0, 1]]
+  // is equivalent to
+  //   Pad(input_shape=[6272, 3],
+  //       paddings=[[0, 0], [0, 1]])
+  //
+  // input_shape: the original input shape.
+  // paddings_as_tensor: the original paddings.
+  // output_shape: the original output shape.
+  // collapsed_input_shape: the input shape after collapsing.
+  // collapsed_paddings_as_tensor: the paddings after collapsing.
+  // collapsed_output_shape: the output shape after collapsing.
+  static bool CollapseAdjacentNonPaddedDimensions(
+      const TensorShape& input_shape, const Tensor& paddings_as_tensor,
+      const TensorShape& output_shape, TensorShape* collapsed_input_shape,
+      Tensor* collapsed_paddings_as_tensor,
+      TensorShape* collapsed_output_shape) {
+    bool collapsed = false;
+    typename TTypes<Tpadding>::ConstMatrix paddings =
+        paddings_as_tensor.matrix<Tpadding>();
+    std::vector<std::pair<int, int>> collapsed_paddings;
+    int i = 0;
+    while (i < paddings.dimension(0)) {
+      if (paddings(i, 0) != 0 || paddings(i, 1) != 0) {
+        // If padded, copy the original dimension over.
+        collapsed_input_shape->InsertDim(collapsed_input_shape->dims(),
+                                         input_shape.dim_size(i));
+        collapsed_output_shape->InsertDim(collapsed_output_shape->dims(),
+                                          output_shape.dim_size(i));
+        collapsed_paddings.push_back({paddings(i, 0), paddings(i, 1)});
+        ++i;
+      } else {
+        // If not padded, find the next dimension that is padded and collapse
+        // all dimensions in between to one dimension.
+        int64 collapsed_input_dim_size = input_shape.dim_size(i);
+        int64 collapsed_output_dim_size = output_shape.dim_size(i);
+        ++i;
+        while (i < paddings.dimension(0) && paddings(i, 0) == 0 &&
+               paddings(i, 1) == 0) {
+          collapsed = true;
+          collapsed_input_dim_size *= input_shape.dim_size(i);
+          collapsed_output_dim_size *= output_shape.dim_size(i);
+          ++i;
+        }
+        collapsed_input_shape->InsertDim(collapsed_input_shape->dims(),
+                                         collapsed_input_dim_size);
+        collapsed_output_shape->InsertDim(collapsed_output_shape->dims(),
+                                          collapsed_output_dim_size);
+        collapsed_paddings.push_back({0, 0});
+      }
+    }
+
+    // Copy collapsed_paddings to collapsed_paddings_as_tensor.
+    *collapsed_paddings_as_tensor =
+        Tensor(paddings_as_tensor.dtype(),
+               TensorShape({static_cast<int64>(collapsed_paddings.size()), 2}));
+    auto collapsed_paddings_as_matrix =
+        collapsed_paddings_as_tensor->matrix<Tpadding>();
+    for (size_t i = 0; i < collapsed_paddings.size(); ++i) {
+      collapsed_paddings_as_matrix(i, 0) = collapsed_paddings[i].first;
+      collapsed_paddings_as_matrix(i, 1) = collapsed_paddings[i].second;
+    }
+    return collapsed;
+  }
+
+  void OperateWithVariableRank(OpKernelContext* context, int fixed_dims,
+                               const Tensor& input,
+                               typename TTypes<Tpadding>::ConstMatrix paddings,
+                               T pad_value, Tensor* output) {
     // Invoke the dims-specific implementation.
     switch (fixed_dims) {
       case 0:
-        Operate<0>(context, in0.tensor<T, 0>(), paddings, pad_value, output);
+        Operate<0>(context, input.tensor<T, 0>(), paddings, pad_value, output);
         break;
       case 1:
         // TODO(irving): Once Pad doesn't need a scalar special case,
         // change flat to tensor.  That is, once !allow_legacy_scalars().
-        Operate<1>(context, in0.flat<T>(), paddings, pad_value, output);
+        Operate<1>(context, input.flat<T>(), paddings, pad_value, output);
         break;
       case 2:
-        Operate<2>(context, in0.tensor<T, 2>(), paddings, pad_value, output);
+        Operate<2>(context, input.tensor<T, 2>(), paddings, pad_value, output);
         break;
       case 3:
-        Operate<3>(context, in0.tensor<T, 3>(), paddings, pad_value, output);
+        Operate<3>(context, input.tensor<T, 3>(), paddings, pad_value, output);
         break;
       case 4:
-        Operate<4>(context, in0.tensor<T, 4>(), paddings, pad_value, output);
+        Operate<4>(context, input.tensor<T, 4>(), paddings, pad_value, output);
         break;
       case 5:
-        Operate<5>(context, in0.tensor<T, 5>(), paddings, pad_value, output);
+        Operate<5>(context, input.tensor<T, 5>(), paddings, pad_value, output);
         break;
       case 6:
-        Operate<6>(context, in0.tensor<T, 6>(), paddings, pad_value, output);
+        Operate<6>(context, input.tensor<T, 6>(), paddings, pad_value, output);
         break;
       default:
         OP_REQUIRES(context, false,
                     errors::InvalidArgument("Only ranks up to 6 supported: ",
-                                            in0.shape().DebugString()));
+                                            input.shape().DebugString()));
     }
   }
 
- private:
   template <int Dims>
   void Operate(OpKernelContext* context,
                typename TTypes<T, Dims>::ConstTensor input,
@@ -186,6 +291,7 @@ class PadOp : public OpKernel {
                           PadOp<CPUDevice, type, int64>);
 
 TF_CALL_POD_TYPES(REGISTER_KERNEL);
+TF_CALL_string(REGISTER_KERNEL);
 #undef REGISTER_KERNEL
 
 #if GOOGLE_CUDA
@@ -215,6 +321,7 @@ namespace functor {
   DECLARE_GPU_SPEC(T, 6);
 
 TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
+TF_CALL_int8(DECLARE_GPU_SPECS);
 }  // namespace functor
 
 // Registration of the GPU implementations.
@@ -247,6 +354,7 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
                           PadOp<GPUDevice, T, int64>)
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNEL);
+TF_CALL_int8(REGISTER_GPU_KERNEL);
 
 // A special GPU kernel for int32.
 // TODO(b/25387198): Also enable int32 in device memory. This kernel
diff --git a/tensorflow/core/kernels/pad_op_gpu.cu.cc b/tensorflow/core/kernels/pad_op_gpu.cu.cc
index 613ad628251915951be7a99ce687ceeef89d7aef..8e13e19e2ee03557e51ab21dc813ed33b75210dc 100644
--- a/tensorflow/core/kernels/pad_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/pad_op_gpu.cu.cc
@@ -40,6 +40,7 @@ typedef Eigen::GpuDevice GPUDevice;
   DEFINE_GPU_PAD_SPECS(T, int64)
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_SPECS);
+TF_CALL_int8(DEFINE_GPU_SPECS);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/kernels/random_op.cc b/tensorflow/core/kernels/random_op.cc
index 78ff7948fbf1b6406b2faca1d94acd7ea3325437..e37232539f2f32cba74cde354dade0efe8bf719a 100644
--- a/tensorflow/core/kernels/random_op.cc
+++ b/tensorflow/core/kernels/random_op.cc
@@ -495,6 +495,7 @@ class RandomGammaOp : public OpKernel {
                           RandomUniformIntOp<CPUDevice, IntType>);
 
 TF_CALL_half(REGISTER);
+TF_CALL_bfloat16(REGISTER);
 TF_CALL_float(REGISTER);
 TF_CALL_double(REGISTER);
 TF_CALL_int32(REGISTER_INT);
diff --git a/tensorflow/core/kernels/regex_replace_op.cc b/tensorflow/core/kernels/regex_replace_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..59ec854a79c90424966e4c7f19f8e5c10dfe17d4
--- /dev/null
+++ b/tensorflow/core/kernels/regex_replace_op.cc
@@ -0,0 +1,76 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <string>
+
+#include "re2/re2.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+
+class RegexReplaceOp : public OpKernel {
+ public:
+  explicit RegexReplaceOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("replace_global", &replace_global_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor* input_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("input", &input_tensor));
+    const auto& input_flat = input_tensor->flat<string>();
+
+    const Tensor* pattern_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("pattern", &pattern_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(pattern_tensor->shape()),
+                errors::InvalidArgument("Pattern must be scalar, but received ",
+                                        pattern_tensor->shape().DebugString()));
+    const string pattern = pattern_tensor->flat<string>()(0);
+    const RE2 match(pattern);
+    OP_REQUIRES(ctx, match.ok(),
+                errors::InvalidArgument("Invalid pattern: ", pattern,
+                                        ", error: ", match.error()));
+
+    const Tensor* rewrite_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("rewrite", &rewrite_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(rewrite_tensor->shape()),
+                errors::InvalidArgument("Rewrite must be scalar, but received ",
+                                        rewrite_tensor->shape().DebugString()));
+    const string rewrite = rewrite_tensor->flat<string>()(0);
+
+    Tensor* output_tensor = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output("output", input_tensor->shape(),
+                                             &output_tensor));
+    auto output_flat = output_tensor->flat<string>();
+    for (size_t i = 0; i < input_flat.size(); ++i) {
+      output_flat(i) = input_flat(i);
+      if (replace_global_) {
+        RE2::GlobalReplace(&output_flat(i), match, rewrite);
+      } else {
+        RE2::Replace(&output_flat(i), match, rewrite);
+      }
+    }
+  }
+
+ private:
+  bool replace_global_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("RegexReplace").Device(DEVICE_CPU),
+                        RegexReplaceOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/relu_op_gpu.cu.cc b/tensorflow/core/kernels/relu_op_gpu.cu.cc
index ec09d8dfea519a70474dca7d3167ba20d3d16d69..6e46c979f33496f8da2c561683723728e28a610e 100644
--- a/tensorflow/core/kernels/relu_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/relu_op_gpu.cu.cc
@@ -19,15 +19,104 @@ limitations under the License.
 
 #include <stdio.h>
 
-#include "tensorflow/core/kernels/relu_op_functor.h"
-
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/kernels/relu_op_functor.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+#include "tensorflow/core/util/cuda_launch_config.h"
 
 namespace tensorflow {
 
 typedef Eigen::GpuDevice GPUDevice;
 
+namespace functor {
+#ifdef TF_HAS_CUDA_FP16
+
+// This kernel computes ReluGrad by processing one half2, two fp16, at a time.
+// It effectively does: backdrops = (feature > 0) ? gradient : 0
+// It also tries to use native half2 primitives as much as possible.
+__global__ void ReluGradHalfKernel(const Eigen::half* gradient,
+                                   const Eigen::half* feature,
+                                   Eigen::half* backprop, int32 count) {
+  int32 half2_count = count >> 1;
+  int32 index = blockIdx.x * blockDim.x + threadIdx.x;
+  const int32 total_device_threads = gridDim.x * blockDim.x;
+
+  while (index < half2_count) {
+    // The fast branch.
+    // One half2, two fp16, is fetched and processed at a time.
+    half2 gradient_h2 = reinterpret_cast<const half2*>(gradient)[index];
+    half2 feature_h2 = reinterpret_cast<const half2*>(feature)[index];
+    half2* p_backprop_h2 = reinterpret_cast<half2*>(backprop) + index;
+
+#if __CUDA_ARCH__ >= 530
+    // Fast path, when half2 primitives are available.
+    const half2 kZeroH2 = __float2half2_rn(0.f);
+    // mask = (feature > 0)
+    half2 mask_h2 = __hgt2(feature_h2, kZeroH2);
+    // backprop = mask * gradient
+    half2 backprop_h2 = __hmul2(mask_h2, gradient_h2);
+#else
+    // Fall back: convert half2 to float2 for processing.
+    float2 feature_f2 = __half22float2(feature_h2);
+    float2 gradient_f2 = __half22float2(gradient_h2);
+    float2 backprop_f2 = make_float2((feature_f2.x > 0) ? gradient_f2.x : 0,
+                                     (feature_f2.y > 0) ? gradient_f2.y : 0);
+    // Convert back to half2.
+    half2 backprop_h2 = __float22half2_rn(backprop_f2);
+#endif
+
+    // Write back the result.
+    *p_backprop_h2 = backprop_h2;
+
+    index += total_device_threads;
+  }
+
+  if ((count & 0x1) == 1 && index == half2_count) {
+    // If the total number of the elements is odd, process the last element.
+    Eigen::half grad_h = gradient[count - 1];
+    Eigen::half feature_h = feature[count - 1];
+
+    float grad_f = static_cast<float>(grad_h);
+    float feature_f = static_cast<float>(feature_h);
+    float backprop_f = (feature_f > 0) ? grad_f : 0;
+
+    Eigen::half backprop_h(backprop_f);
+    backprop[count - 1] = backprop_h;
+  }
+}
+
+template <typename Device>
+struct ReluGrad<Device, Eigen::half> {
+  // Computes ReluGrad backprop.
+  //
+  // gradient: gradient backpropagated to the Relu op.
+  // feature: either the inputs that were passed to the Relu, or its outputs
+  //           (using either one yields the same result here).
+  // backprop: gradient to backpropagate to the Relu inputs.
+  void operator()(const Device& d,
+                  typename TTypes<Eigen::half>::ConstTensor gradient,
+                  typename TTypes<Eigen::half>::ConstTensor feature,
+                  typename TTypes<Eigen::half>::Tensor backprop) {
+    // NOTE: When the activation is exactly zero, we do not propagate the
+    // associated gradient value. This allows the output of the Relu to be used,
+    // as well as its input.
+    int32 count = gradient.size();
+    if (count == 0) return;
+    int32 half2_count = Eigen::divup(count, 2);
+    const int32 kThreadInBlock = 512;
+    CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
+        half2_count, d, ReluGradHalfKernel, 0, kThreadInBlock);
+    ReluGradHalfKernel<<<config.block_count, config.thread_per_block, 0,
+                         d.stream()>>>(gradient.data(), feature.data(),
+                                       backprop.data(), count);
+  }
+};
+
+#endif  // TF_HAS_CUDA_FP16
+}  // namespace functor
+
 // Definition of the GPU implementations declared in relu_op.cc.
 #define DEFINE_GPU_KERNELS(T)                       \
   template struct functor::Relu<GPUDevice, T>;      \
diff --git a/tensorflow/core/kernels/resource_variable_ops.cc b/tensorflow/core/kernels/resource_variable_ops.cc
index 2041fb90946860c5164da3cb448ff81d9f654e54..aecad0185fd6cd9574e29bb33f2707e04650aef4 100644
--- a/tensorflow/core/kernels/resource_variable_ops.cc
+++ b/tensorflow/core/kernels/resource_variable_ops.cc
@@ -351,7 +351,7 @@ class AssignVariableOp<Device, Variant> : public OpKernel {
     Var* variable = nullptr;
     OP_REQUIRES_OK(context, LookupOrCreateResource<Var>(
                                 context, HandleFromInput(context, 0), &variable,
-                                [this, context](Var** ptr) {
+                                [](Var** ptr) {
                                   // Created on host.
                                   *ptr = new Var(DT_VARIANT);
                                   return Status::OK();
@@ -374,7 +374,7 @@ class AssignVariableOp<Device, Variant> : public OpKernel {
       OP_REQUIRES_OK(context, VariantDeviceCopy(
                                   VariantDeviceCopyDirection::DEVICE_TO_DEVICE,
                                   elements_in(i), &elements_out(i), copy_fn));
-    };
+    }
   }
 
  private:
@@ -608,7 +608,7 @@ class ResourceScatterUpdateOp : public OpKernel {
                                 DataTypeString(DataTypeToEnum<Index>::v()),
                                 " indexing: ", N_big, " > ",
                                 std::numeric_limits<Index>::max()));
-    const Index N = static_cast<Index>(indices.NumElements());
+    const Index N = static_cast<Index>(N_big);
     OP_REQUIRES(
         c, params->dim_size(0) <= std::numeric_limits<Index>::max(),
         errors::InvalidArgument("params.shape[0] too large for ",
@@ -619,7 +619,13 @@ class ResourceScatterUpdateOp : public OpKernel {
     if (N > 0) {
       auto indices_flat = indices.flat<Index>();
       auto params_flat = params->flat_outer_dims<T>();
-      auto updates_flat = updates.shaped<T, 2>({N, updates.NumElements() / N});
+      int64 num_updates = updates.NumElements();
+      OP_REQUIRES(c, num_updates % N == 0,
+                  errors::InvalidArgument(
+                      "shape of indices (", indices.shape().DebugString(),
+                      ") is not compatible with the shape of updates (",
+                      updates.shape().DebugString(), ")"));
+      auto updates_flat = updates.shaped<T, 2>({N, num_updates / N});
 
       functor::ScatterFunctor<Device, T, Index, op> functor;
       const Index bad_i = functor(c, c->template eigen_device<Device>(),
diff --git a/tensorflow/core/kernels/scoped_allocator_ops.cc b/tensorflow/core/kernels/scoped_allocator_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d7b25ffad0408383a63d5127f85ce41f40890e87
--- /dev/null
+++ b/tensorflow/core/kernels/scoped_allocator_ops.cc
@@ -0,0 +1,216 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+
+class ScopedAllocatorOp : public OpKernel {
+ public:
+  explicit ScopedAllocatorOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("T", &dtype_));
+    OP_REQUIRES_OK(context, context->GetAttr("shapes", &shapes_));
+    OP_REQUIRES_OK(context, context->GetAttr("sa_name", &name_));
+    OP_REQUIRES_OK(context, context->GetAttr("id", &id_));
+    OP_REQUIRES_OK(context, context->GetAttr("expected_call_count",
+                                             &expected_call_count_));
+    device_ = context->device();
+    // Precalculate the size of the backing tensor and the offsets of
+    // the subtensors to be allocated from it, taking into account
+    // alignment considerations.
+    ScopedAllocatorMgr::PopulateFields(id_, shapes_, dtype_, &fields_);
+    size_t num_bytes = fields_.back().offset + fields_.back().bytes;
+    num_elements_ = num_bytes / DataTypeSize(dtype_);
+    OP_REQUIRES(context, num_bytes % DataTypeSize(dtype_) == 0,
+                errors::InvalidArgument(
+                    "Number of bytes ", num_bytes,
+                    " must be divisible by size of datatype ", dtype_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    ScopedAllocatorMgr* sam = device_->GetScopedAllocatorMgr();
+    if (!sam) {
+      context->SetStatus(errors::Internal(
+          "ScopedAllocatorMgr not supported on device ", device_->name()));
+      return;
+    }
+    Tensor* backing_tensor = nullptr;
+    AllocatorAttributes attr = context->output_alloc_attr(0);
+    Status s =
+        context->allocate_output(0, {num_elements_}, &backing_tensor, attr);
+    VLOG(1) << "_ScopedAllocatorOp new backing tensor size "
+            << backing_tensor->TotalBytes() << " num_elements_ "
+            << num_elements_ << " buffer " << DMAHelper::buffer(backing_tensor)
+            << " base addr " << DMAHelper::base(backing_tensor);
+    if (s.ok()) {
+      s = sam->AddScopedAllocator(*backing_tensor, context->step_id(), id_,
+                                  name_, fields_, expected_call_count_);
+    }
+    if (!s.ok()) {
+      context->SetStatus(s);
+    }
+  }
+
+ private:
+  std::vector<TensorShape> shapes_;
+  DataType dtype_;
+  int64 num_elements_;
+  std::vector<ScopedAllocator::Field> fields_;
+  string name_;
+  int32 id_;
+  int32 expected_call_count_;
+  DeviceBase* device_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocator").Device(DEVICE_CPU),
+                        ScopedAllocatorOp);
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocator").Device(DEVICE_GPU),
+                        ScopedAllocatorOp);
+
+class ScopedAllocatorConcatOp : public OpKernel {
+ public:
+  explicit ScopedAllocatorConcatOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("shape", &shape_));
+    OP_REQUIRES_OK(context, context->GetAttr("T", &dtype_));
+    // This stuff is just for debugging
+    OP_REQUIRES_OK(context, context->GetAttr("sa_name", &name_));
+    OP_REQUIRES_OK(context, context->GetAttr("id", &id_));
+    device_ = context->device();
+  }
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& backing_tensor = context->input(0);
+    // Check that type matches.
+    OP_REQUIRES(
+        context, backing_tensor.dtype() == dtype_,
+        errors::InvalidArgument("Backing tensor type ", backing_tensor.dtype(),
+                                " does not match expected type ", dtype_));
+    // Check that backing tensor is at least as large as the shape of the
+    // output.
+    OP_REQUIRES(context, backing_tensor.NumElements() >= shape_.num_elements(),
+                errors::InvalidArgument("Backing tensor num elements ",
+                                        backing_tensor.NumElements(),
+                                        " is not equal to expected ",
+                                        shape_.num_elements()));
+    VLOG(1) << "_ScopedAllocatorConcatOp outputting backing tensor at "
+            << DMAHelper::base(&backing_tensor);
+    Tensor backing_copy(backing_tensor);
+    context->set_output(0, backing_copy);
+    const TensorBuffer* backing_buf = DMAHelper::buffer(&backing_copy);
+    const void* backing_tensor_lb = backing_buf->data();
+    const void* backing_tensor_ub = static_cast<const void*>(
+        static_cast<const char*>(backing_tensor_lb) + backing_buf->size());
+    // Check that all inputs lie entirely within the backing tensor.
+    for (int i = 1; i < context->num_inputs(); ++i) {
+      const TensorBuffer* input_buf = DMAHelper::buffer(&context->input(i));
+      const void* input_lb = input_buf->data();
+      OP_REQUIRES(
+          context, input_lb >= backing_tensor_lb,
+          errors::InvalidArgument("Lower bound check fail for input ", i,
+                                  " to node ", context->op_kernel().name()));
+      const void* input_ub = static_cast<const void*>(
+          static_cast<const char*>(input_lb) + input_buf->size());
+      OP_REQUIRES(
+          context, input_ub <= backing_tensor_ub,
+          errors::InvalidArgument("Upper bound check fail for input ", i,
+                                  " to node ", context->op_kernel().name()));
+    }
+  }
+
+ private:
+  TensorShape shape_;
+  DataType dtype_;
+  string name_;
+  int32 id_;
+  DeviceBase* device_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocatorConcat").Device(DEVICE_CPU),
+                        ScopedAllocatorConcatOp);
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocatorConcat").Device(DEVICE_GPU),
+                        ScopedAllocatorConcatOp);
+
+class ScopedAllocatorSplitOp : public OpKernel {
+ public:
+  explicit ScopedAllocatorSplitOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("T", &dtype_));
+    // This stuff is just for debugging
+    OP_REQUIRES_OK(context, context->GetAttr("sa_name", &name_));
+    OP_REQUIRES_OK(context, context->GetAttr("id", &id_));
+    device_ = context->device();
+  }
+
+  void Compute(OpKernelContext* context) override {
+    Tensor backing_copy(context->input(0));
+    // Check that type matches.
+    OP_REQUIRES(
+        context, backing_copy.dtype() == dtype_,
+        errors::InvalidArgument("Backing tensor type ", backing_copy.dtype(),
+                                " does not match expected type ", dtype_));
+    const TensorBuffer* backing_buf = DMAHelper::buffer(&backing_copy);
+    const void* backing_tensor_lb = backing_buf->data();
+    const void* backing_tensor_ub = static_cast<const void*>(
+        static_cast<const char*>(backing_tensor_lb) + backing_buf->size());
+    for (int i = 1; i < context->num_inputs(); ++i) {
+      VLOG(1) << "_ScopedAllocatorSplitOp assigning input " << i
+              << " to output " << i - 1 << " buf addr "
+              << DMAHelper::base(&context->input(i));
+      Tensor copy(context->input(i));
+      OP_REQUIRES(
+          context, copy.dtype() == dtype_,
+          errors::InvalidArgument("Input ", i, " tensor type ", copy.dtype(),
+                                  " does not match expected type ", dtype_));
+      context->set_output(i - 1, copy);
+      const TensorBuffer* input_buf = DMAHelper::buffer(&copy);
+      const void* input_lb = input_buf->data();
+      OP_REQUIRES(
+          context, input_lb >= backing_tensor_lb,
+          errors::InvalidArgument("Lower bound check fail for input ", i,
+                                  " to node ", context->op_kernel().name()));
+      const void* input_ub = static_cast<const void*>(
+          static_cast<const char*>(input_lb) + input_buf->size());
+      OP_REQUIRES(
+          context, input_ub <= backing_tensor_ub,
+          errors::InvalidArgument("Upper bound check fail for input ", i,
+                                  " to node ", context->op_kernel().name()));
+    }
+  }
+
+ private:
+  DataType dtype_;
+  string name_;
+  int32 id_;
+  DeviceBase* device_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocatorSplit").Device(DEVICE_CPU),
+                        ScopedAllocatorSplitOp);
+
+REGISTER_KERNEL_BUILDER(Name("_ScopedAllocatorSplit").Device(DEVICE_GPU),
+                        ScopedAllocatorSplitOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/scoped_allocator_ops_test.cc b/tensorflow/core/kernels/scoped_allocator_ops_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3d36c8b7d43748abb91a5ecd2edd22dada7ae9c6
--- /dev/null
+++ b/tensorflow/core/kernels/scoped_allocator_ops_test.cc
@@ -0,0 +1,296 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <vector>
+
+#include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+#include "tensorflow/core/framework/fake_input.h"
+#include "tensorflow/core/framework/node_def_builder.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/graph/testlib.h"
+#include "tensorflow/core/kernels/ops_testutil.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+class ScopedAllocatorOpTest : public OpsTestBase {
+ protected:
+  void MakeOp(const gtl::ArraySlice<TensorShape>& shapes, DataType dtype,
+              const string& name, int32 id, int32 expected_call_count) {
+    TF_EXPECT_OK(NodeDefBuilder("scoped_allocator_op", "_ScopedAllocator")
+                     .Attr("T", dtype)
+                     .Attr("shapes", shapes)
+                     .Attr("sa_name", name)
+                     .Attr("id", id)
+                     .Attr("expected_call_count", expected_call_count)
+                     .Finalize(node_def()));
+    TF_EXPECT_OK(InitOp());
+    TF_ASSERT_OK(RunOpKernel());
+
+    // Allocate and Deallocate the tensors so that memory is not leaked
+    AllocatorAttributes attr;
+    Allocator* allocator;
+    for (size_t i = 0; i < shapes.size(); i++) {
+      attr.scope_id = id + i + 1;
+      allocator = device_->GetScopedAllocator(attr, context_->step_id());
+      Tensor temp(allocator, dtype, shapes[i]);
+    }
+  }
+};
+
+TEST_F(ScopedAllocatorOpTest, Simple) {
+  MakeOp({TensorShape({8})}, DT_FLOAT, "test", 120, 1);
+  MakeOp({TensorShape({32, 32})}, DT_DOUBLE, "test1", 130, 1);
+  MakeOp({TensorShape({64}), TensorShape({3, 3}), TensorShape({5, 5, 5})},
+         DT_HALF, "test2", 140, 3);
+  MakeOp({TensorShape({512}), TensorShape({64, 8})}, DT_UINT32, "test3", 150,
+         2);
+}
+
+// PrepOp is common to ConcatOp tests and SplitOpTests.
+// It allocates a backing tensor that is large enough to hold all slices defined
+// by fields, creates ScopedAllocatorInstances for each field, allocates the
+// tensors, and assigns them as inputs to the op.
+// We won't use the AddInput* suite of functions from ops_testutil.h because
+// they allocate new tensors for each input.  We need to mimic what a
+// ScopedAllocator would do.
+void PrepOp(DataType dtype, int32 id,
+            const std::vector<TensorShape>& fields_shapes,
+            std::vector<ScopedAllocator::Field>* fields,
+            Tensor** backing_tensor, Allocator* allocator,
+            ScopedAllocatorMgr* sam, const string& op_name,
+            std::vector<Tensor>* tensors,
+            gtl::InlinedVector<TensorValue, 4>* inputs,
+            const DataTypeVector& input_types) {
+  ScopedAllocatorMgr::PopulateFields(id, fields_shapes, dtype, fields);
+  // We don't simply allocate a tensor with shape as backing_tensor_shape,
+  // because we need to account for padding in the fields.  We actually need a
+  // tensor of size at least (fields[-1].offset + fields[-1].bytes).
+  size_t num_bytes = fields->back().offset + fields->back().bytes;
+  int32_t num_elements = num_bytes / DataTypeSize(dtype);
+  CHECK_EQ(num_bytes % DataTypeSize(dtype), 0);
+
+  *backing_tensor = new Tensor(allocator, dtype, {num_elements});
+  int64 step_id = 10;
+  Status s = sam->AddScopedAllocator(**backing_tensor, step_id, id,
+                                     "sa_" + op_name + "_test", *fields,
+                                     fields_shapes.size());
+  TF_ASSERT_OK(s);
+
+  ScopedAllocatorContainer* sac = sam->GetContainer(step_id);
+  std::vector<ScopedAllocatorInstance*> sa_instances(fields_shapes.size(),
+                                                     nullptr);
+  for (size_t i = 0; i < fields_shapes.size(); i++) {
+    sa_instances[i] = sac->GetInstance(id + i + 1);
+    tensors->push_back(Tensor(sa_instances[i], dtype, fields_shapes[i]));
+  }
+  // Now add the tensor as an input to ScopedAllocator<op_name>Op.
+  // Order matters here, so first add the backing tensor, then the slices.
+  inputs->reserve(1 + tensors->size());
+  CHECK_GT(input_types.size(), inputs->size());
+  CHECK_EQ(input_types[inputs->size()], dtype);
+  inputs->push_back({nullptr, *backing_tensor});
+  for (size_t i = 0; i < tensors->size(); i++) {
+    CHECK_EQ(input_types[inputs->size()], dtype);
+    inputs->push_back({nullptr, &((*tensors)[i])});
+  }
+}
+
+class ScopedAllocatorConcatOpTest : public OpsTestBase {
+ protected:
+  void MakeOp(const TensorShape& shape, DataType dtype, const string& name,
+              int32 id, int32 num_tensors) {
+    TF_EXPECT_OK(
+        NodeDefBuilder("scoped_allocator_concat_op", "_ScopedAllocatorConcat")
+            .Attr("shape", shape)
+            .Attr("T", dtype)
+            .Attr("N", num_tensors)
+            .Attr("sa_name", name)
+            .Attr("id", id)
+            .Input(FakeInput(dtype))               // backing tensor
+            .Input(FakeInput(num_tensors, dtype))  // list of tensors
+            .Finalize(node_def()));
+    TF_EXPECT_OK(InitOp());
+  }
+
+  void ExecOp(DataType dtype, int32 id,
+              const std::vector<TensorShape>& fields_shapes) {
+    Tensor* backing_tensor = nullptr;
+    std::vector<Tensor> tensors;
+    std::vector<ScopedAllocator::Field> fields;
+    PrepOp(dtype, id, fields_shapes, &fields, &backing_tensor, allocator(),
+           device_->GetScopedAllocatorMgr(), "split", &tensors, &inputs_,
+           input_types_);
+
+    TF_ASSERT_OK(RunOpKernel());
+
+    // Check input and output are same tensor.
+    const Tensor& input = context_->input(0);
+    OpOutputList output_list;
+    Status s = context_->output_list("output", &output_list);
+    TF_ASSERT_OK(s);
+    const Tensor& output = *(output_list[0]);
+    CHECK_EQ(DMAHelper::base(&input), DMAHelper::base(&output));
+    CHECK_EQ(input.dtype(), output.dtype());
+    CHECK_EQ(input.NumElements(), output.NumElements());
+
+    // Free the backing tensor which was allocated in PrepOp.
+    delete backing_tensor;
+  }
+};
+
+TEST_F(ScopedAllocatorConcatOpTest, Success1) {
+  MakeOp({32}, DT_FLOAT, "test", 120, 2);
+  ExecOp(DT_FLOAT, 120, {{16}, {16}});
+}
+
+TEST_F(ScopedAllocatorConcatOpTest, Success2) {
+  MakeOp({2, 2, 2}, DT_DOUBLE, "test", 120, 2);
+  ExecOp(DT_DOUBLE, 120, {{2, 2}, {2, 2}});
+}
+
+TEST_F(ScopedAllocatorConcatOpTest, Success3) {
+  MakeOp({3, 3, 3}, DT_HALF, "test", 120, 3);
+  ExecOp(DT_HALF, 120, {{3, 3}, {3, 3}, {3, 3}});
+}
+
+TEST_F(ScopedAllocatorConcatOpTest, FailDtypeCheck) {
+  MakeOp({8}, DT_FLOAT, "test", 120, 2);
+  EXPECT_DEATH(ExecOp(DT_DOUBLE, 120, {{4}, {4}}), "");
+}
+
+TEST_F(ScopedAllocatorConcatOpTest, FailNumElementsCheck) {
+  MakeOp({32}, DT_FLOAT, "test", 120, 2);
+  AddInputFromArray<float>({8}, {0, 1, 2, 3, 4, 5, 6, 7});
+  AddInputFromArray<float>({4}, {0, 1, 2, 3});
+  AddInputFromArray<float>({4}, {4, 5, 6, 7});
+  Status s = RunOpKernel();
+  EXPECT_EQ(s.code(), error::INVALID_ARGUMENT);
+}
+
+// This test should fail because the backing tensor and the input tensors are
+// unrelated, i.e. the inputs are not slices of the backing tensor.
+TEST_F(ScopedAllocatorConcatOpTest, FailBounds) {
+  MakeOp({8}, DT_DOUBLE, "test", 120, 2);
+  AddInputFromArray<double>({8}, {0, 1, 2, 3, 4, 5, 6, 7});
+  AddInputFromArray<double>({4}, {0, 1, 2, 3});
+  AddInputFromArray<double>({4}, {4, 5, 6, 7});
+  Status s = RunOpKernel();
+  EXPECT_EQ(s.code(), error::INVALID_ARGUMENT);
+}
+
+class ScopedAllocatorSplitOpTest : public OpsTestBase {
+ protected:
+  void BuildNodeDef(const TensorShape& shape, DataType dtype,
+                    const string& name, int32 id, int32 num_tensors) {
+    TF_EXPECT_OK(
+        NodeDefBuilder("scoped_allocator_split_op", "_ScopedAllocatorSplit")
+            .Attr("T", dtype)
+            .Attr("N", num_tensors)
+            .Attr("sa_name", name)
+            .Attr("id", id)
+            .Input(FakeInput(dtype))  // backing tensor and input
+            .Input(
+                FakeInput(num_tensors, dtype))  // list of subtensors to forward
+            .Finalize(node_def()));
+  }
+
+  void MakeOp(const TensorShape& shape, DataType dtype, const string& name,
+              int32 id, int32 num_tensors) {
+    BuildNodeDef(shape, dtype, name, id, num_tensors);
+    TF_EXPECT_OK(InitOp());
+  }
+
+  // Similar to ConcatOpTest, we add inputs that are allocated from
+  // ScopedAllocator so that the memory lines up nicely.
+  void ExecOp(DataType dtype, int32 id,
+              const std::vector<TensorShape>& fields_shapes) {
+    Tensor* backing_tensor = nullptr;
+    std::vector<Tensor> tensors;
+    std::vector<ScopedAllocator::Field> fields;
+    PrepOp(dtype, id, fields_shapes, &fields, &backing_tensor, allocator(),
+           device_->GetScopedAllocatorMgr(), "split", &tensors, &inputs_,
+           input_types_);
+
+    TF_ASSERT_OK(RunOpKernel());
+
+    // Check that outputs are slices of backing tensor.
+    const Tensor& input = context_->input(0);
+    const void* lower_limit = DMAHelper::base(&input);
+    const char* lower_limit_c =
+        static_cast<const char*>(lower_limit);  // for pointer arithmetic
+    OpOutputList output_list;
+    Status s = context_->output_list("output", &output_list);
+    TF_ASSERT_OK(s);
+    for (int i = 0; i < output_list.size(); i++) {
+      const Tensor& output = *(output_list[i]);
+      const void* expected_base =
+          static_cast<const void*>(lower_limit_c + fields[i].offset);
+      CHECK_EQ(output.dtype(), input.dtype());
+      CHECK_EQ(expected_base, DMAHelper::base(&output));
+      CHECK_EQ(output.NumElements(), fields_shapes[i].num_elements());
+    }
+
+    // Free the backing tensor which was allocated in PrepOp.
+    delete backing_tensor;
+  }
+};
+
+TEST_F(ScopedAllocatorSplitOpTest, Success1) {
+  MakeOp({32}, DT_FLOAT, "test", 120, 2);
+  ExecOp(DT_FLOAT, 120, {{16}, {16}});
+}
+
+TEST_F(ScopedAllocatorSplitOpTest, Success2) {
+  MakeOp({2, 2, 2}, DT_DOUBLE, "test", 120, 2);
+  ExecOp(DT_DOUBLE, 120, {{2, 2}, {2, 2}});
+}
+
+TEST_F(ScopedAllocatorSplitOpTest, Success3) {
+  MakeOp({3, 3, 3}, DT_HALF, "test", 120, 3);
+  ExecOp(DT_HALF, 120, {{3, 3}, {3, 3}, {3, 3}});
+}
+
+TEST_F(ScopedAllocatorSplitOpTest, FailNLessThan2) {
+  BuildNodeDef({4, 4}, DT_FLOAT, "test", 120, 1);
+  Status s = InitOp();
+  EXPECT_EQ(s.code(), error::INVALID_ARGUMENT);
+}
+
+TEST_F(ScopedAllocatorSplitOpTest, FailDtypeCheck) {
+  MakeOp({8}, DT_FLOAT, "test", 120, 2);
+  EXPECT_DEATH(ExecOp(DT_HALF, 120, {{4}, {4}}), "");
+}
+
+TEST_F(ScopedAllocatorSplitOpTest, FailBounds) {
+  MakeOp({8}, DT_DOUBLE, "test", 120, 2);
+  AddInputFromArray<double>({8}, {0, 1, 2, 3, 4, 5, 6, 7});
+  AddInputFromArray<double>({4}, {0, 1, 2, 3});
+  AddInputFromArray<double>({4}, {4, 5, 6, 7});
+  Status s = RunOpKernel();
+  EXPECT_EQ(s.code(), error::INVALID_ARGUMENT);
+}
+
+}  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/sdca_internal.cc b/tensorflow/core/kernels/sdca_internal.cc
index 066a4b80a2bc6976a6c95ced2c5efecbef13eeba..5a389a6548797d4248b4255e6f3eba11d5439ab3 100644
--- a/tensorflow/core/kernels/sdca_internal.cc
+++ b/tensorflow/core/kernels/sdca_internal.cc
@@ -226,7 +226,7 @@ const ExampleStatistics Example::ComputeWxAndWeightedExampleNorm(
 }
 
 // Examples contains all the training examples that SDCA uses for a mini-batch.
-Status Examples::SampleAdaptativeProbabilities(
+Status Examples::SampleAdaptiveProbabilities(
     const int num_loss_partitions, const Regularizations& regularization,
     const ModelWeights& model_weights,
     const TTypes<float>::Matrix example_state_data,
diff --git a/tensorflow/core/kernels/sdca_internal.h b/tensorflow/core/kernels/sdca_internal.h
index 45915693ac6f0b4ad2d5f2aacebcd4aa34c03439..1665b1210ec568acd4871292df642286e307de2b 100644
--- a/tensorflow/core/kernels/sdca_internal.h
+++ b/tensorflow/core/kernels/sdca_internal.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_KERNELS_SDCA_INTERNAL_H_
-#define TENSORFLOW_KERNELS_SDCA_INTERNAL_H_
+#ifndef TENSORFLOW_CORE_KERNELS_SDCA_INTERNAL_H_
+#define TENSORFLOW_CORE_KERNELS_SDCA_INTERNAL_H_
 
 #define EIGEN_USE_THREADS
 
@@ -75,7 +75,7 @@ struct ExampleStatistics {
 
 class Regularizations {
  public:
-  Regularizations(){};
+  Regularizations() {}
 
   // Initialize() must be called immediately after construction.
   Status Initialize(OpKernelConstruction* const context) {
@@ -199,7 +199,7 @@ class FeatureWeightsDenseStorage {
   FeatureWeightsDenseStorage(const TTypes<const float>::Matrix nominals,
                              TTypes<float>::Matrix deltas)
       : nominals_(nominals), deltas_(deltas) {
-    CHECK(deltas.rank() > 1);
+    CHECK_GT(deltas.rank(), 1);
   }
 
   // Check if a feature index is with-in the bounds.
@@ -322,15 +322,15 @@ class Examples {
     return examples_.at(example_index);
   }
 
-  int sampled_index(const int id, const bool adaptative) const {
-    if (adaptative) return sampled_index_[id];
+  int sampled_index(const int id, const bool adaptive) const {
+    if (adaptive) return sampled_index_[id];
     return id;
   }
 
   // Adaptive SDCA in the current implementation only works for
   // binary classification, where the input argument for num_weight_vectors
   // is 1.
-  Status SampleAdaptativeProbabilities(
+  Status SampleAdaptiveProbabilities(
       const int num_loss_partitions, const Regularizations& regularization,
       const ModelWeights& model_weights,
       const TTypes<float>::Matrix example_state_data,
@@ -378,7 +378,7 @@ class Examples {
   // All examples in the batch.
   std::vector<Example> examples_;
 
-  // Adaptative sampling variables
+  // Adaptive sampling variables.
   std::vector<float> probabilities_;
   std::vector<int> sampled_index_;
   std::vector<int> sampled_count_;
@@ -391,4 +391,4 @@ class Examples {
 }  // namespace sdca
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_KERNELS_SDCA_INTERNAL_H_
+#endif  // TENSORFLOW_CORE_KERNELS_SDCA_INTERNAL_H_
diff --git a/tensorflow/core/kernels/sdca_ops.cc b/tensorflow/core/kernels/sdca_ops.cc
index dbe0177dda337a271433cd3bb4257026dc702364..5b63057f3f59d63ae8c7e696595c492fdbd31d6f 100644
--- a/tensorflow/core/kernels/sdca_ops.cc
+++ b/tensorflow/core/kernels/sdca_ops.cc
@@ -80,7 +80,7 @@ struct ComputeOptions {
           context, false,
           errors::InvalidArgument("Unsupported loss type: ", loss_type));
     }
-    OP_REQUIRES_OK(context, context->GetAttr("adaptative", &adaptative));
+    OP_REQUIRES_OK(context, context->GetAttr("adaptative", &adaptive));
     OP_REQUIRES_OK(
         context, context->GetAttr("num_sparse_features", &num_sparse_features));
     OP_REQUIRES_OK(context, context->GetAttr("num_sparse_features_with_values",
@@ -113,7 +113,7 @@ struct ComputeOptions {
   int num_dense_features = 0;
   int num_inner_iterations = 0;
   int num_loss_partitions = 0;
-  bool adaptative = false;
+  bool adaptive = true;
   Regularizations regularizations;
 };
 
@@ -147,9 +147,9 @@ void DoCompute(const ComputeOptions& options, OpKernelContext* const context) {
   OP_REQUIRES_OK(context, context->set_output("out_example_state_data",
                                               mutable_example_state_data_t));
 
-  if (options.adaptative) {
+  if (options.adaptive) {
     OP_REQUIRES_OK(context,
-                   examples.SampleAdaptativeProbabilities(
+                   examples.SampleAdaptiveProbabilities(
                        options.num_loss_partitions, options.regularizations,
                        model_weights, example_state_data, options.loss_updater,
                        /*num_weight_vectors =*/1));
@@ -163,7 +163,7 @@ void DoCompute(const ComputeOptions& options, OpKernelContext* const context) {
     // num_examples which is an int.
     for (int id = static_cast<int>(begin); id < end; ++id) {
       const int64 example_index =
-          examples.sampled_index(++atomic_index, options.adaptative);
+          examples.sampled_index(++atomic_index, options.adaptive);
       const Example& example = examples.example(example_index);
       const float dual = example_state_data(example_index, 0);
       const float example_weight = example.example_weight();
diff --git a/tensorflow/core/kernels/segment_reduction_ops.cc b/tensorflow/core/kernels/segment_reduction_ops.cc
index 6c4685a50a4139b9f33d22b409059f7c03fa2812..2fc73a3309d3f6c63546e0bdfa4216f012219496 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.cc
+++ b/tensorflow/core/kernels/segment_reduction_ops.cc
@@ -679,7 +679,12 @@ class SparseSegmentReductionOpBase : public OpKernel {
     // we need to explicitly set missing indices to the default value.
     Tensor* output = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));
-    if (num_indices == 0) return;
+    if (num_indices == 0) {
+      if (output_rows > 0) {
+        output->flat_outer_dims<T>().setConstant(default_value_);
+      }
+      return;
+    }
     OP_REQUIRES(context, output_rows > 0,
                 errors::InvalidArgument("segment ids must be >= 0"));
     auto output_flat = output->flat_outer_dims<T>();
diff --git a/tensorflow/core/kernels/segment_reduction_ops.h b/tensorflow/core/kernels/segment_reduction_ops.h
index fe0a2782f952386e673127776c8f20da3ab1e2d5..d0703d7576932c19933844ba43c6c00f357d1ba1 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.h
+++ b/tensorflow/core/kernels/segment_reduction_ops.h
@@ -17,6 +17,13 @@ limitations under the License.
 #define THIRD_PARTY_TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 
 
+// This file requires the following include because it uses CudaAtomicMax:
+// #include "tensorflow/core/util/cuda_kernel_helper.h"
+
+// Unfortunately we can't add the #include, since it breaks compilation for
+// non-GPU targets. This only breaks in clang, because it's more strict for
+// template code and CudaAtomicMax is used in template context.
+
 // This file requires the following include because it uses CudaAtomicMax:
 // #include "tensorflow/core/util/cuda_kernel_helper.h"
 
diff --git a/tensorflow/core/kernels/sendrecv_ops.cc b/tensorflow/core/kernels/sendrecv_ops.cc
index 688e61fcadc3ad01b579f8dfc712af2d8032ee35..2f87057f4ef431a0ed4cac928ff21575d7af34a9 100644
--- a/tensorflow/core/kernels/sendrecv_ops.cc
+++ b/tensorflow/core/kernels/sendrecv_ops.cc
@@ -169,9 +169,10 @@ Rendezvous::DoneCallback make_recv_callback(OpKernelContext* ctx,
 }  // namespace
 
 void RecvOp::ComputeAsync(OpKernelContext* ctx, DoneCallback done) {
-  OP_REQUIRES(
+  OP_REQUIRES_ASYNC(
       ctx, ctx->rendezvous() != nullptr,
-      errors::Internal("Op kernel context needs to provide a rendezvous."));
+      errors::Internal("Op kernel context needs to provide a rendezvous."),
+      done);
 
   Rendezvous::Args args;
   args.device_context = ctx->op_device_context();
diff --git a/tensorflow/core/kernels/snapshot_op.cc b/tensorflow/core/kernels/snapshot_op.cc
index 50157d5d48f93bfe61cbac95246123ef0a7d446e..fe04dcf72e2aa73a0140338a8f207177048c7d0a 100644
--- a/tensorflow/core/kernels/snapshot_op.cc
+++ b/tensorflow/core/kernels/snapshot_op.cc
@@ -22,6 +22,26 @@ limitations under the License.
 
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
+typedef Eigen::GpuDevice GPUDevice;
+
+template <typename Device, typename Scalar>
+class SnapshotOp : public OpKernel {
+ public:
+  explicit SnapshotOp(OpKernelConstruction* context) : OpKernel(context) {}
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& input = context->input(0);
+    Tensor* output = nullptr;
+    // Try to use buffer forwarding to avoid an explicit copy.
+    OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                {0}, 0, input.shape(), &output));
+    if (!output->SharesBufferWith(input)) {
+      functor::Snapshot<Device, Scalar> functor;
+      functor(context->eigen_device<Device>(), input.flat<Scalar>(),
+              output->flat<Scalar>());
+    }
+  }
+};
 
 #define REGISTER_KERNEL(TYPE)                                        \
   REGISTER_KERNEL_BUILDER(                                           \
@@ -31,6 +51,16 @@ typedef Eigen::ThreadPoolDevice CPUDevice;
 TF_CALL_POD_TYPES(REGISTER_KERNEL);
 #undef REGISTER_KERNEL
 
+#if GOOGLE_CUDA
+#define REGISTER_KERNEL(TYPE)                                        \
+  REGISTER_KERNEL_BUILDER(                                           \
+      Name("Snapshot").Device(DEVICE_GPU).TypeConstraint<TYPE>("T"), \
+      SnapshotOp<GPUDevice, TYPE>);
+
+TF_CALL_POD_TYPES(REGISTER_KERNEL);
+#undef REGISTER_KERNEL
+#endif
+
 #if TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SyclDevice;
 #define REGISTER_SYCL_KERNEL(TYPE)                                    \
diff --git a/tensorflow/core/kernels/snapshot_op.h b/tensorflow/core/kernels/snapshot_op.h
index b94834f15988a21ad41eefc8030b3da1a58875f8..a18065d42ba832d5b34f2dd534bc103c907310fe 100644
--- a/tensorflow/core/kernels/snapshot_op.h
+++ b/tensorflow/core/kernels/snapshot_op.h
@@ -26,29 +26,19 @@ limitations under the License.
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
+namespace functor {
 
+// Functor used by SnapshotOp.
 template <typename Device, typename Scalar>
-class SnapshotOp : public OpKernel {
- public:
-  explicit SnapshotOp(OpKernelConstruction* context) : OpKernel(context) {}
-
-  void Compute(OpKernelContext* context) override {
-    const Tensor& input = context->input(0);
-    Tensor* output = nullptr;
-    // Try to use buffer forwarding to avoid an explicit copy.
-    OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
-                                {0}, 0, input.shape(), &output));
-    if (!output->SharesBufferWith(input)) {
-      // We had to allocate a new buffer since the refcount on the input was
-      // greater than 1. Copy the input to the new buffer.
-      const Device& device = context->eigen_device<Device>();
-      device.memcpy(output->template flat<Scalar>().data(),
-                    input.template flat<Scalar>().data(),
-                    input.NumElements() * sizeof(Scalar));
-    }
+struct Snapshot {
+  void operator()(const Device& device,
+                  typename TTypes<Scalar>::ConstTensor input,
+                  typename TTypes<Scalar>::Tensor output) {
+    device.memcpy(output.data(), input.data(), input.size() * sizeof(Scalar));
   }
 };
 
+}  // namespace functor
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_KERNELS_SNAPSHOT_OP_H_
diff --git a/tensorflow/core/kernels/snapshot_op_gpu.cu.cc b/tensorflow/core/kernels/snapshot_op_gpu.cu.cc
index 52070be838d65d21813dfe097db9c395ef5a8448..f1c0ed2eae0009d54053e7b3fb9d802afb3da1b6 100644
--- a/tensorflow/core/kernels/snapshot_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/snapshot_op_gpu.cu.cc
@@ -24,13 +24,11 @@ limitations under the License.
 namespace tensorflow {
 typedef Eigen::GpuDevice GPUDevice;
 
-#define REGISTER_KERNEL(TYPE)                                        \
-  REGISTER_KERNEL_BUILDER(                                           \
-      Name("Snapshot").Device(DEVICE_GPU).TypeConstraint<TYPE>("T"), \
-      SnapshotOp<GPUDevice, TYPE>);
+// Definition of the GPU implementations declared in softsign_op.cc.
+#define DEFINE_GPU_KERNELS(T)                      \
+  template struct functor::Snapshot<GPUDevice, T>;
 
-TF_CALL_POD_TYPES(REGISTER_KERNEL);
-#undef REGISTER_KERNEL
+TF_CALL_POD_TYPES(DEFINE_GPU_KERNELS);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/kernels/softmax_op_gpu.cu.cc b/tensorflow/core/kernels/softmax_op_gpu.cu.cc
index 1f4a82a7334e35c48af09c897895f79ee30e1ebd..130d693dbdf132515a7ffcfc0bc6c9631a5aee21 100644
--- a/tensorflow/core/kernels/softmax_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/softmax_op_gpu.cu.cc
@@ -33,8 +33,42 @@ namespace tensorflow {
 
 namespace {
 
+template <typename U, typename T>
+__device__ __host__ EIGEN_STRONG_INLINE
+    typename std::enable_if<!std::is_same<T, U>::value, U>::type
+    strict_cast(T t);
+
+template <typename U, typename T>
+__device__ __host__ EIGEN_STRONG_INLINE
+    typename std::enable_if<std::is_same<T, U>::value, U>::type
+    strict_cast(T t) {
+  return t;
+}
+
+template <>
+__device__ __host__ EIGEN_STRONG_INLINE float strict_cast<float, Eigen::half>(
+    Eigen::half t) {
+  return functor::HalfToFloat()(t);
+}
+
+template <>
+__device__ __host__ EIGEN_STRONG_INLINE Eigen::half
+strict_cast<Eigen::half, float>(float t) {
+  return functor::FloatToHalf()(t);
+}
+
 template <typename T>
-__global__ void GenerateNormalizedProb(const T* logits, const T* sum_probs,
+struct softmax_traits {
+  using accumulator_type = T;
+};
+
+template <>
+struct softmax_traits<Eigen::half> {
+  using accumulator_type = float;
+};
+
+template <typename T, typename U>
+__global__ void GenerateNormalizedProb(const T* logits, const U* sum_probs,
                                        const T* max_logits, T* output,
                                        const int num_rows, const int num_cols,
                                        const bool in_log_space) {
@@ -43,25 +77,33 @@ __global__ void GenerateNormalizedProb(const T* logits, const T* sum_probs,
   const int row = tid / num_cols;
   const int col = tid % num_cols;
 
+  // TODO(jamesqin): change to half2 load when inputs are Eigen::half.
+  U input = strict_cast<U>(logits[tid]);
+  U max_val = strict_cast<U>(ldg(max_logits + row));
+  U result;
+
   if (row < num_rows && col < num_cols) {
-    if (in_log_space)
-      output[tid] =
-          logits[tid] - ldg(max_logits + row) - log(ldg(sum_probs + row));
-    else
-      output[tid] =
-          exp(logits[tid] - ldg(max_logits + row)) / ldg(sum_probs + row);
+    if (in_log_space) {
+      result = input - max_val - log(ldg(sum_probs + row));
+    } else {
+      result = exp(input - max_val) / ldg(sum_probs + row);
+    }
+    output[tid] = strict_cast<T>(result);
   }
 }
 
-template <typename T>
+template <typename T, typename U>
 struct SubtractAndExpFunctor {
   __host__ __device__ SubtractAndExpFunctor(const T* logits,
                                             const T* max_logits,
                                             const int num_cols)
       : logits_(logits), max_logits_(max_logits), num_cols_(num_cols) {}
 
-  __host__ __device__ T operator()(const int gid) const {
-    return exp(logits_[gid] - ldg(max_logits_ + gid / num_cols_));
+  __host__ __device__ U operator()(const int gid) const {
+    // TODO(jamesqin): change to half2 load when inputs are Eigen::half.
+    const U diff =
+        strict_cast<U>(logits_[gid] - ldg(max_logits_ + gid / num_cols_));
+    return exp(diff);
   }
 
   const T* logits_;
@@ -80,7 +122,6 @@ void DoRowReduction(OpKernelContext* context, T* output, InputIter input,
   functor::ReduceImpl<T, Op, T*, InputIter, ReductionAxes>(
       context, output, input, 2, rows, cols, 1, 1, constants.kOne, op);
 }
-
 }  // namespace
 
 template <typename T>
@@ -108,8 +149,10 @@ class SoftmaxOpGPU : public OpKernel {
       OP_REQUIRES_OK(context,
                      context->allocate_temp(DataTypeToEnum<T>::value,
                                             softmax_out->shape(), &max_logits));
+
+      typedef typename softmax_traits<T>::accumulator_type acc_type;
       OP_REQUIRES_OK(context,
-                     context->allocate_temp(DataTypeToEnum<T>::value,
+                     context->allocate_temp(DataTypeToEnum<acc_type>::value,
                                             softmax_out->shape(), &sum_probs));
 
       DoRowReduction<T, cub::Max, const T*>(
@@ -120,25 +163,28 @@ class SoftmaxOpGPU : public OpKernel {
       const int numBlocks = Eigen::divup(rows * cols, numThreads);
 
       cub::CountingInputIterator<int> counting_iterator(0);
-      typedef cub::TransformInputIterator<T, SubtractAndExpFunctor<T>,
+      typedef cub::TransformInputIterator<acc_type,
+                                          SubtractAndExpFunctor<T, acc_type>,
                                           cub::CountingInputIterator<int>>
           InputIterType;
 
       InputIterType input_itr(
           counting_iterator,
-          SubtractAndExpFunctor<T>(
+          SubtractAndExpFunctor<T, acc_type>(
               reinterpret_cast<const T*>(logits_in_.flat<T>().data()),
               reinterpret_cast<const T*>(max_logits.flat<T>().data()), cols));
 
-      DoRowReduction<T, cub::Sum, InputIterType>(
-          context, const_cast<T*>(sum_probs.flat<T>().data()), input_itr, rows,
-          cols);
+      DoRowReduction<acc_type, cub::Sum, InputIterType>(
+          context, const_cast<acc_type*>(sum_probs.flat<acc_type>().data()),
+          input_itr, rows, cols);
 
-      GenerateNormalizedProb<<<numBlocks, numThreads, 0, cu_stream>>>(
-          reinterpret_cast<const T*>(logits_in_.flat<T>().data()),
-          reinterpret_cast<const T*>(sum_probs.flat<T>().data()),
-          reinterpret_cast<const T*>(max_logits.flat<T>().data()),
-          const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_);
+      GenerateNormalizedProb<T, acc_type>
+          <<<numBlocks, numThreads, 0, cu_stream>>>(
+              reinterpret_cast<const T*>(logits_in_.flat<T>().data()),
+              reinterpret_cast<const acc_type*>(
+                  sum_probs.flat<acc_type>().data()),
+              reinterpret_cast<const T*>(max_logits.flat<T>().data()),
+              const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_);
     }
   }
 
diff --git a/tensorflow/core/kernels/spacetodepth_op.cc b/tensorflow/core/kernels/spacetodepth_op.cc
index 23df1c35e5205ff3045643efee203aa501d4959d..e59adfc6acbeef3e2d309629121d308c6e228703 100644
--- a/tensorflow/core/kernels/spacetodepth_op.cc
+++ b/tensorflow/core/kernels/spacetodepth_op.cc
@@ -187,6 +187,9 @@ TF_CALL_ALL_TYPES(REGISTER);
 REGISTER_KERNEL_BUILDER(
     Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     SpaceToDepthOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
+    SpaceToDepthOp<GPUDevice, Eigen::half>);
 REGISTER_KERNEL_BUILDER(
     Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<qint8>("T"),
     SpaceToDepthOp<GPUDevice, qint8>);
diff --git a/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc b/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
index a1a01e8813261592a0d9ea97d6f76a163d070ee4..f38459724abcb544252885b89dede635960b24b9 100644
--- a/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
@@ -154,6 +154,9 @@ struct SpaceToDepthOpFunctor<GPUDevice, T, FORMAT_NHWC> {
 
     const int total_count =
         batch_size * input_height * input_width * input_depth;
+    if (total_count == 0) {
+      return;
+    }
     CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
     S2D_NHWC<<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
         config.virtual_thread_count, input.data(), block_size, batch_size,
@@ -184,6 +187,9 @@ struct SpaceToDepthOpFunctor<GPUDevice, T, FORMAT_NCHW> {
       const int input_width = input.dimension(3);
       const int input_depth_by_output_area = input_depth * output_area;
       const int total_count = batch_size * input_depth_by_output_area;
+      if (total_count == 0) {
+        return;
+      }
       CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
       switch (block_size) {
         case 2:
@@ -209,6 +215,9 @@ struct SpaceToDepthOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 
     // Other block sizes are processed by the generic kernel.
     const int total_count = batch_size * output_depth_by_output_area;
+    if (total_count == 0) {
+      return;
+    }
     CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
     S2D_NCHW<<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
         config.virtual_thread_count, input.data(), block_size, output_width,
@@ -225,6 +234,12 @@ struct SpaceToDepthOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, float, FORMAT_NCHW>;
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, float, FORMAT_NHWC>;
 
+// Instantiate the GPU implementations for Eigen::half.
+template struct functor::SpaceToDepthOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NCHW>;
+template struct functor::SpaceToDepthOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NHWC>;
+
 // NCHW_VECT_C with 4 x qint8 can be treated as NCHW int32.
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, int32, FORMAT_NCHW>;
 
diff --git a/tensorflow/core/kernels/sparse_cross_op.cc b/tensorflow/core/kernels/sparse_cross_op.cc
index 7cd4532ad63812d905ceb6b96291aa50293070ef..4b5df7aff0e9b345fb94f9f06a9906972448c048 100644
--- a/tensorflow/core/kernels/sparse_cross_op.cc
+++ b/tensorflow/core/kernels/sparse_cross_op.cc
@@ -327,7 +327,7 @@ class SparseCrossOp : public OpKernel {
 
     typename CrossTraits<HASHED_OUTPUT, InternalType>::Updater updater(
         output_start_indices, indices_out, values_out);
-    auto do_work = [this, &columns, crosser, updater](int64 begin, int64 end) {
+    auto do_work = [&columns, crosser, updater](int64 begin, int64 end) {
       for (int b = begin; b < end; b++) {
         ProductIterator<InternalType> product_iterator(columns, b);
         int64 cross_count = 0;
diff --git a/tensorflow/core/kernels/split_lib.h b/tensorflow/core/kernels/split_lib.h
index a08949e626cc8e5d4c3707b75a902d82b46c3376..bc1fa28f8f8f23085d89e5b98d57914de778ea0b 100644
--- a/tensorflow/core/kernels/split_lib.h
+++ b/tensorflow/core/kernels/split_lib.h
@@ -31,31 +31,31 @@ struct SplitCustom {
                   const Eigen::DSizes<Eigen::DenseIndex, 2>& slice_sizes);
 };
 
-template <typename Device, typename T>
+template <typename Device, typename T, int NDims>
 struct Split {
-  void operator()(const Device& d, typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+  void operator()(const Device& d, typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 
-template <typename T>
-struct Split<Eigen::ThreadPoolDevice, T> {
+template <typename T, int NDims>
+struct Split<Eigen::ThreadPoolDevice, T, NDims> {
   void operator()(const Eigen::ThreadPoolDevice& d,
-                  typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+                  typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 
 #ifdef TENSORFLOW_USE_SYCL
-template <typename T>
+template <typename T, int NDims>
 struct Split<Eigen::SyclDevice, T> {
   void operator()(const Eigen::SyclDevice& d,
-                  typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+                  typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 #endif  // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/split_lib_cpu.cc b/tensorflow/core/kernels/split_lib_cpu.cc
index 771c633b156edf7c7d9944fe95703a0e0cd9e981..a3060e4e90d8db6866bd0c56570beeef65ab58ce 100644
--- a/tensorflow/core/kernels/split_lib_cpu.cc
+++ b/tensorflow/core/kernels/split_lib_cpu.cc
@@ -24,12 +24,12 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
-template <typename T>
-void Split<Eigen::ThreadPoolDevice, T>::operator()(
-    const Eigen::ThreadPoolDevice& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename T, int NDims>
+void Split<Eigen::ThreadPoolDevice, T, NDims>::operator()(
+    const Eigen::ThreadPoolDevice& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   if (output.size() < 131072) {
     output = input.slice(slice_indices, slice_sizes);
   } else {
@@ -37,22 +37,26 @@ void Split<Eigen::ThreadPoolDevice, T>::operator()(
   }
 }
 
-#define DEFINE_CPU_KERNELS(T) template struct Split<Eigen::ThreadPoolDevice, T>;
+#define DEFINE_CPU_KERNELS(T)                           \
+  template struct Split<Eigen::ThreadPoolDevice, T, 2>; \
+  template struct Split<Eigen::ThreadPoolDevice, T, 3>;
 
 TF_CALL_ALL_TYPES(DEFINE_CPU_KERNELS)
 DEFINE_CPU_KERNELS(quint8)
 
 #ifdef TENSORFLOW_USE_SYCL
-template <typename T>
-void Split<Eigen::SyclDevice, T>::operator()(
-    const Eigen::SyclDevice& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename T, int NDims>
+void Split<Eigen::SyclDevice, T, NDims>::operator()(
+    const Eigen::SyclDevice& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   output.device(d) = input.slice(slice_indices, slice_sizes);
 }
 
-#define DEFINE_SYCL_KERNELS(T) template struct Split<Eigen::SyclDevice, T>;
+#define DEFINE_SYCL_KERNELS(T)                    \
+  template struct Split<Eigen::SyclDevice, T, 2>; \
+  template struct Split<Eigen::SyclDevice, T, 3>;
 
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DEFINE_SYCL_KERNELS);
 #endif  // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
index 9f234fc0935be0662b0d8df1a6bd1c109ab24fd9..393818730bb4fe7fc6bba7f66b2cc96b12cab390 100644
--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
@@ -29,12 +29,12 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
-template <typename Device, typename T>
-void Split<Device, T>::operator()(
-    const Device& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename Device, typename T, int NDims>
+void Split<Device, T, NDims>::operator()(
+    const Device& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   To32Bit(output).device(d) = To32Bit(input).slice(slice_indices, slice_sizes);
 }
 
@@ -47,7 +47,9 @@ void SplitCustom<Device, T>::operator()(
   To32Bit(output).device(d) = To32Bit(input).slice(slice_indices, slice_sizes);
 }
 
-#define DEFINE_GPU_KERNELS(T) template struct Split<Eigen::GpuDevice, T>;
+#define DEFINE_GPU_KERNELS(T)                    \
+  template struct Split<Eigen::GpuDevice, T, 2>; \
+  template struct Split<Eigen::GpuDevice, T, 3>;
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_KERNELS);
 TF_CALL_complex64(DEFINE_GPU_KERNELS);
diff --git a/tensorflow/core/kernels/split_op.cc b/tensorflow/core/kernels/split_op.cc
index 85f529326dbf5d9d5ae72156da05f08f805d1271..7cc3c532c95584a66cfab3f184ffef4028ee4bdb 100644
--- a/tensorflow/core/kernels/split_op.cc
+++ b/tensorflow/core/kernels/split_op.cc
@@ -121,6 +121,77 @@ class SplitOpBase : public OpKernel {
   }
 };
 
+template <typename T, typename InputReshapedType, int NDims>
+class SplitOpCPUImpl {
+ public:
+  template <typename MakeSizesType, typename ReshapeResultType>
+  void operator()(OpKernelContext* context,
+                  const InputReshapedType& input_reshaped,
+                  const TensorShape& input_shape, int32 split_dim,
+                  Eigen::DenseIndex prefix_dim_size,
+                  Eigen::DenseIndex split_dim_size,
+                  Eigen::DenseIndex suffix_dim_size,
+                  const MakeSizesType& make_sizes,
+                  const ReshapeResultType& reshape_result, int32 num_split,
+                  int64 split_dim_output_size) const {
+    const auto num_threads =
+        context->device()->tensorflow_cpu_worker_threads()->num_threads;
+    // TODO(jewillco): Tune heuristic further.
+    const auto input_element_count = input_shape.num_elements();
+    const bool use_parallelism_between_outputs =
+        (num_split >= 4 &&
+         input_element_count >= std::max(num_threads, num_split) * 4096 &&
+         input_element_count < num_split * 180 * 1024);
+    Eigen::DSizes<Eigen::DenseIndex, NDims> indices;
+    for (int i = 0; i < NDims; ++i) {
+      indices[i] = 0;
+    }
+    auto sizes = make_sizes(split_dim_output_size);
+    TensorShape output_shape(input_shape);
+    output_shape.set_dim(split_dim, split_dim_output_size);
+
+    auto range_output_func = [&indices, context, &output_shape, prefix_dim_size,
+                              split_dim_output_size, suffix_dim_size, &sizes,
+                              use_parallelism_between_outputs, &input_reshaped,
+                              &reshape_result](int64 start, int64 limit) {
+      for (int64 i = start; i < limit; ++i) {
+        Tensor* result = nullptr;
+        OP_REQUIRES_OK(context,
+                       context->allocate_output(i, output_shape, &result));
+        if (prefix_dim_size * split_dim_output_size * suffix_dim_size > 0) {
+          Eigen::DSizes<Eigen::DenseIndex, NDims> slice_indices;
+          Eigen::DSizes<Eigen::DenseIndex, NDims> slice_sizes;
+          for (int j = 0; j < NDims; ++j) {
+            slice_indices[j] =
+                (j == NDims - 2 ? i * split_dim_output_size : indices[j]);
+            slice_sizes[j] = sizes[j];
+          }
+
+          auto result_shaped = reshape_result(result, split_dim_output_size);
+
+          if (use_parallelism_between_outputs) {
+            // Use sequential implementation for single output.
+            result_shaped = input_reshaped.slice(slice_indices, slice_sizes);
+          } else {
+            // This implementation may be parallel internally.
+            functor::Split<CPUDevice, T, NDims>()(
+                context->eigen_device<CPUDevice>(), result_shaped,
+                input_reshaped, slice_indices, slice_sizes);
+          }
+        }
+      }
+    };
+    if (use_parallelism_between_outputs) {
+      // Run in parallel, disabling parallelism in functor.
+      context->device()->tensorflow_cpu_worker_threads()->workers->ParallelFor(
+          num_split, input_element_count / num_split, range_output_func);
+    } else {
+      // Run sequentially, but allow internal parallelism in functor.
+      range_output_func(0, num_split);
+    }
+  }
+};
+
 template <typename T>
 class SplitOpCPU : public SplitOpBase<CPUDevice, T> {
  public:
@@ -154,66 +225,37 @@ class SplitOpCPU : public SplitOpBase<CPUDevice, T> {
 
     std::tie(prefix_dim_size, split_dim_size, suffix_dim_size) =
         Base::template SetDims<Eigen::DenseIndex>(input_shape, split_dim);
-    auto input_reshaped =
-        input.shaped<T, 3>({prefix_dim_size, split_dim_size, suffix_dim_size});
 
     const int64 split_dim_output_size = split_dim_size / num_split;
-    TensorShape output_shape(input_shape);
-    output_shape.set_dim(split_dim, split_dim_output_size);
-
-    Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, 0};
-    const Eigen::DSizes<Eigen::DenseIndex, 3> sizes{
-        prefix_dim_size, split_dim_output_size, suffix_dim_size};
-
-    const auto num_threads =
-        context->device()->tensorflow_cpu_worker_threads()->num_threads;
-    // TODO(jewillco): Tune heuristic further.
-    const auto input_element_count = input_shape.num_elements();
-    const bool use_parallelism_between_outputs =
-        (num_split >= 4 &&
-         input_element_count >= std::max(num_threads, num_split) * 4096 &&
-         input_element_count < num_split * 180 * 1024);
-
-    auto range_output_func = [&indices, context, &output_shape, prefix_dim_size,
-                              split_dim_output_size, suffix_dim_size, &sizes,
-                              use_parallelism_between_outputs,
-                              &input_reshaped](int64 start, int64 limit) {
-      for (int64 i = start; i < limit; ++i) {
-        Tensor* result = nullptr;
-        OP_REQUIRES_OK(context,
-                       context->allocate_output(i, output_shape, &result));
-        if (prefix_dim_size * split_dim_output_size * suffix_dim_size > 0) {
-          Eigen::DSizes<Eigen::DenseIndex, 3> slice_indices;
-          Eigen::DSizes<Eigen::DenseIndex, 3> slice_sizes;
-          for (int j = 0; j < 3; ++j) {
-            slice_indices[j] =
-                (j == 1 ? i * split_dim_output_size : indices[j]);
-            slice_sizes[j] = sizes[j];
-          }
-
-          auto result_shaped = result->shaped<T, 3>(
-              {prefix_dim_size, split_dim_output_size, suffix_dim_size});
 
-          if (use_parallelism_between_outputs) {
-            // Use sequential implementation for single output.
-            result_shaped = input_reshaped.slice(slice_indices, slice_sizes);
-          } else {
-            // This implementation may be parallel internally.
-            functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                           result_shaped, input_reshaped,
-                                           slice_indices, slice_sizes);
-          }
-        }
-      }
-    };
-    if (use_parallelism_between_outputs) {
-      // Run in parallel, disabling parallelism in functor.
-      Shard(num_split,
-            context->device()->tensorflow_cpu_worker_threads()->workers,
-            num_split, input_element_count / num_split, range_output_func);
+    if (prefix_dim_size == 1) {
+      auto input_reshaped =
+          input.shaped<T, 2>({split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 2>{split_size, suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Eigen::DenseIndex split_size) {
+        return result->shaped<T, 2>({split_size, suffix_dim_size});
+      };
+      SplitOpCPUImpl<T, decltype(input_reshaped), 2>{}(
+          context, input_reshaped, input_shape, split_dim, prefix_dim_size,
+          split_dim_size, suffix_dim_size, make_sizes, reshape_result,
+          num_split, split_dim_output_size);
     } else {
-      // Run sequentially, but allow internal parallelism in functor.
-      range_output_func(0, num_split);
+      auto input_reshaped = input.shaped<T, 3>(
+          {prefix_dim_size, split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 3>{prefix_dim_size, split_size,
+                                                   suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Eigen::DenseIndex split_size) {
+        return result->shaped<T, 3>(
+            {prefix_dim_size, split_size, suffix_dim_size});
+      };
+      SplitOpCPUImpl<T, decltype(input_reshaped), 3>{}(
+          context, input_reshaped, input_shape, split_dim, prefix_dim_size,
+          split_dim_size, suffix_dim_size, make_sizes, reshape_result,
+          num_split, split_dim_output_size);
     }
   }
 };
diff --git a/tensorflow/core/kernels/split_v_op.cc b/tensorflow/core/kernels/split_v_op.cc
index 7ff5df47d70fa8e47aabfb24e82874c146708ef1..5c19a45fb18abdacb5f89f623f9690b43bdfa1e5 100644
--- a/tensorflow/core/kernels/split_v_op.cc
+++ b/tensorflow/core/kernels/split_v_op.cc
@@ -55,8 +55,13 @@ class SplitVOpBase : public OpKernel {
     const Tensor& input = context->input(0);
     const TensorShape& input_shape = input.shape();
     const Tensor& split_tensor = context->input(1);
+    const Tensor& split_dim_tensor = context->input(2);
 
-    const int32 split_dim_orig = context->input(2).flat<int32>()(0);
+    OP_REQUIRES(context, split_dim_tensor.NumElements() == 1,
+                errors::InvalidArgument("split_dim_tensor must have "
+                                        "exactly one element."));
+
+    const int32 split_dim_orig = split_dim_tensor.flat<int32>()(0);
     const int32 split_dim =
         split_dim_orig < 0 ? split_dim_orig + input.dims() : split_dim_orig;
 
@@ -175,6 +180,77 @@ class SplitVOpBase : public OpKernel {
   }
 };
 
+template <typename T, typename Tlen, typename InputReshapedType, int NDims>
+class SplitVOpCPUImpl {
+ public:
+  template <typename MakeSizesType, typename ReshapeResultType>
+  void operator()(OpKernelContext* context,
+                  const InputReshapedType& input_reshaped,
+                  const std::vector<int64>& split_start_points,
+                  const TensorShape& input_shape, int32 split_dim,
+                  Eigen::DenseIndex prefix_dim_size,
+                  Eigen::DenseIndex split_dim_size,
+                  Eigen::DenseIndex suffix_dim_size,
+                  std::vector<Tlen>& split_sizes_vec,
+                  const MakeSizesType& make_sizes,
+                  const ReshapeResultType& reshape_result) const {
+    Eigen::DSizes<Eigen::DenseIndex, NDims> indices;
+    for (int i = 0; i < NDims; ++i) {
+      indices[i] = 0;
+    }
+    const auto num_threads =
+        context->device()->tensorflow_cpu_worker_threads()->num_threads;
+    // TODO(jewillco): Tune heuristic further.
+    const auto input_element_count = input_shape.num_elements();
+    const int num_split = split_start_points.size();
+    const bool use_parallelism_between_outputs =
+        (num_split >= 4 &&
+         input_element_count >= std::max(num_threads, num_split) * 4096 &&
+         input_element_count < num_split * 180 * 1024);
+
+    auto range_output_func = [&indices, context, &input_shape, split_dim,
+                              &split_sizes_vec, &split_start_points,
+                              use_parallelism_between_outputs, &input_reshaped,
+                              &make_sizes,
+                              &reshape_result](int64 start, int64 limit) {
+      for (int64 i = start; i < limit; ++i) {
+        TensorShape output_shape(input_shape);
+        output_shape.set_dim(split_dim, split_sizes_vec[i]);
+        Tensor* result = nullptr;
+        OP_REQUIRES_OK(context,
+                       context->allocate_output(i, output_shape, &result));
+
+        const auto sizes = make_sizes(split_sizes_vec[i]);
+
+        if (sizes.TotalSize() > 0) {
+          auto result_shaped = reshape_result(result, split_sizes_vec[i]);
+
+          auto current_indices = indices;
+          current_indices[NDims - 2] = split_start_points[i];
+          if (use_parallelism_between_outputs) {
+            // Use sequential implementation for single output.
+            result_shaped = input_reshaped.slice(current_indices, sizes);
+          } else {
+            // This implementation may be parallel internally.
+            functor::Split<CPUDevice, T, NDims>()(
+                context->eigen_device<CPUDevice>(), result_shaped,
+                input_reshaped, current_indices, sizes);
+          }
+        }
+      }
+    };
+    if (use_parallelism_between_outputs) {
+      // Run in parallel, disabling parallelism in functor.
+      Shard(num_split,
+            context->device()->tensorflow_cpu_worker_threads()->workers,
+            num_split, input_element_count / num_split, range_output_func);
+    } else {
+      // Run sequentially, but allow internal parallelism in functor.
+      range_output_func(0, num_split);
+    }
+  }
+};
+
 template <typename T, typename Tlen>
 class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
  public:
@@ -209,10 +285,6 @@ class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
 
     std::tie(prefix_dim_size, split_dim_size, suffix_dim_size) =
         Base::template SetDims<Eigen::DenseIndex>(input_shape, split_dim);
-    auto input_reshaped =
-        input.shaped<T, 3>({prefix_dim_size, split_dim_size, suffix_dim_size});
-
-    Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, 0};
     std::vector<int64> split_start_points(num_split);
     for (int i = 0; i < num_split; ++i) {
       if (i == 0) {
@@ -223,55 +295,34 @@ class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
       }
     }
 
-    const auto num_threads =
-        context->device()->tensorflow_cpu_worker_threads()->num_threads;
-    // TODO(jewillco): Tune heuristic further.
-    const auto input_element_count = input_shape.num_elements();
-    const bool use_parallelism_between_outputs =
-        (num_split >= 4 &&
-         input_element_count >= std::max(num_threads, num_split) * 4096 &&
-         input_element_count < num_split * 180 * 1024);
-
-    auto range_output_func = [&indices, context, &input_shape, prefix_dim_size,
-                              split_dim, &split_sizes_vec, &split_start_points,
-                              suffix_dim_size, use_parallelism_between_outputs,
-                              &input_reshaped](int64 start, int64 limit) {
-      for (int64 i = start; i < limit; ++i) {
-        TensorShape output_shape(input_shape);
-        output_shape.set_dim(split_dim, split_sizes_vec[i]);
-        Tensor* result = nullptr;
-        OP_REQUIRES_OK(context,
-                       context->allocate_output(i, output_shape, &result));
-
-        Eigen::DSizes<Eigen::DenseIndex, 3> sizes{
-            prefix_dim_size, split_sizes_vec[i], suffix_dim_size};
-
-        if (sizes.TotalSize() > 0) {
-          auto result_shaped = result->shaped<T, 3>(
-              {prefix_dim_size, split_sizes_vec[i], suffix_dim_size});
-
-          auto current_indices = indices;
-          current_indices[1] = split_start_points[i];
-          if (use_parallelism_between_outputs) {
-            // Use sequential implementation for single output.
-            result_shaped = input_reshaped.slice(current_indices, sizes);
-          } else {
-            // This implementation may be parallel internally.
-            functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                           result_shaped, input_reshaped,
-                                           current_indices, sizes);
-          }
-        }
-      }
-    };
-    if (use_parallelism_between_outputs) {
-      // Run in parallel, disabling parallelism in functor.
-      Shard(num_split,
-            context->device()->tensorflow_cpu_worker_threads()->workers,
-            num_split, input_element_count / num_split, range_output_func);
+    if (prefix_dim_size == 1) {
+      auto input_reshaped =
+          input.shaped<T, 2>({split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 2>{split_size, suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Tlen split_size) {
+        return result->shaped<T, 2>({split_size, suffix_dim_size});
+      };
+      SplitVOpCPUImpl<T, Tlen, decltype(input_reshaped), 2>{}(
+          context, input_reshaped, split_start_points, input_shape, split_dim,
+          prefix_dim_size, split_dim_size, suffix_dim_size, split_sizes_vec,
+          make_sizes, reshape_result);
     } else {
-      // Run sequentially, but allow internal parallelism in functor.
-      range_output_func(0, num_split);
+      auto input_reshaped = input.shaped<T, 3>(
+          {prefix_dim_size, split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 3>{prefix_dim_size, split_size,
+                                                   suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Tlen split_size) {
+        return result->shaped<T, 3>(
+            {prefix_dim_size, split_size, suffix_dim_size});
+      };
+      SplitVOpCPUImpl<T, Tlen, decltype(input_reshaped), 3>{}(
+          context, input_reshaped, split_start_points, input_shape, split_dim,
+          prefix_dim_size, split_dim_size, suffix_dim_size, split_sizes_vec,
+          make_sizes, reshape_result);
     }
   }
 };
diff --git a/tensorflow/core/kernels/strided_slice_op.cc b/tensorflow/core/kernels/strided_slice_op.cc
index 7745effe2abe94ba73a2f0d761210e07c62e499c..1e3e92a68a05123bafad77348e6811a14c303301 100644
--- a/tensorflow/core/kernels/strided_slice_op.cc
+++ b/tensorflow/core/kernels/strided_slice_op.cc
@@ -109,17 +109,27 @@ class StridedSliceOp : public OpKernel {
     if (is_identity) {
       VLOG(1) << "Strided slice identity ";
       Tensor tmp;
-      CHECK(tmp.CopyFrom(input, final_shape));
+      OP_REQUIRES(context, tmp.CopyFrom(input, final_shape),
+                  errors::Internal("Copy failed"));
       context->set_output(0, tmp);
       return;
     }
 
     // Optimization #2, slice is memory contiguous (only occurs in dim 0)
     if (slice_dim0 && IsDim0SliceAligned<T>(input.shape(), begin[0], end[0])) {
-      CHECK_GE(input.dims(), 1);  // Otherwise, is_identity should be true.
+      OP_REQUIRES(context, input.dims() >= 1,
+                  errors::InvalidArgument(
+                      "Input must have rank at least 1, got: ", input.dims()));
+      // Otherwise, is_identity should be true.
       VLOG(1) << "Strided slice dim 0: " << input.shape().DebugString();
+      OP_REQUIRES(
+          context, begin[0] <= end[0],
+          errors::InvalidArgument("begin[0] (", begin[0],
+                                  ") must less or equal to end[0] (", end[0]));
+      Tensor slice = input.Slice(begin[0], end[0]);
       Tensor tmp;
-      CHECK(tmp.CopyFrom(input.Slice(begin[0], end[0]), final_shape));
+      OP_REQUIRES(context, tmp.CopyFrom(slice, final_shape),
+                  errors::Internal("Copy failed"));
       context->set_output(0, tmp);
       return;
     }
@@ -238,7 +248,8 @@ class StridedSliceGradOp : public OpKernel {
 
     if (processing_shape.dims() == 0) {
       auto in = context->input(4);
-      CHECK(result->CopyFrom(in, processing_shape));
+      OP_REQUIRES(context, result->CopyFrom(in, processing_shape),
+                  errors::Internal("Copy failed"));
       return;
     }
 
diff --git a/tensorflow/core/kernels/tensor_array_ops.cc b/tensorflow/core/kernels/tensor_array_ops.cc
index af93d814ec06ff86c6c7eb3312d97224dee485f2..7ec26d95e6886d639d2dde5a61456898529be524 100644
--- a/tensorflow/core/kernels/tensor_array_ops.cc
+++ b/tensorflow/core/kernels/tensor_array_ops.cc
@@ -1104,9 +1104,9 @@ class TensorArrayUnpackOrScatterOp : public OpKernel {
       indices[1] = i;
 
       if (element_shape.num_elements() > 0) {
-        functor::Split<Device, T>()(ctx->eigen_device<Device>(),
-                                    tensor_value_i_t, tensor_value_t, indices,
-                                    sizes);
+        functor::Split<Device, T, 3>()(ctx->eigen_device<Device>(),
+                                       tensor_value_i_t, tensor_value_t,
+                                       indices, sizes);
       }
 
       write_values.push_back(persistent_tensor);
@@ -1295,9 +1295,9 @@ class TensorArraySplitOp : public OpKernel {
         auto tensor_value_i_t = tensor_value_i->shaped<T, 3>(
             {1, tensor_lengths_t(i), elements_per_row});
 
-        functor::Split<Device, T>()(ctx->eigen_device<Device>(),
-                                    tensor_value_i_t, tensor_value_t, indices,
-                                    sizes);
+        functor::Split<Device, T, 3>()(ctx->eigen_device<Device>(),
+                                       tensor_value_i_t, tensor_value_t,
+                                       indices, sizes);
       }
 
       write_values.push_back(persistent_tensor);
diff --git a/tensorflow/core/kernels/training_ops.cc b/tensorflow/core/kernels/training_ops.cc
index 233aa03c32333e62281cb8ab71828649b4fabe7e..f53c567c4da19e18dc2832fbc47ee27ee1c928d1 100644
--- a/tensorflow/core/kernels/training_ops.cc
+++ b/tensorflow/core/kernels/training_ops.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #include <algorithm>
 
 #include "tensorflow/core/framework/op_kernel.h"
@@ -494,6 +496,7 @@ class ApplyGradientDescentOp<SYCLDevice, T> : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -647,6 +650,7 @@ class ApplyAdadeltaOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -822,6 +826,7 @@ class SparseApplyAdadeltaOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1107,6 +1112,7 @@ class ApplyAdagradOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1360,6 +1366,7 @@ class SparseApplyAdagradOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1961,6 +1968,7 @@ class ApplyFtrlOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1982,6 +1990,7 @@ TF_CALL_double(REGISTER_CPU_KERNELS);
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2230,6 +2239,7 @@ class SparseApplyFtrlOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2254,6 +2264,7 @@ TF_CALL_double(REGISTER_CPU_KERNELS);
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2332,6 +2343,7 @@ class ApplyMomentumOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2471,6 +2483,7 @@ class SparseApplyMomentumOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2698,6 +2711,7 @@ class ApplyAdamOp<SYCLDevice, T> : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2937,6 +2951,7 @@ class ApplyCenteredRMSPropOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -3352,6 +3367,7 @@ class ApplyAddSignOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -3457,6 +3473,7 @@ class ApplyPowerSignOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
diff --git a/tensorflow/core/kernels/unpack_op.cc b/tensorflow/core/kernels/unpack_op.cc
index 764b6a252adf09c13511a01f95332857f46eee96..1e1647db5c1c41d6242cab87b0d8a8cf66d32a28 100644
--- a/tensorflow/core/kernels/unpack_op.cc
+++ b/tensorflow/core/kernels/unpack_op.cc
@@ -90,21 +90,21 @@ class UnpackOp : public OpKernel {
     }
 #endif  // TENSORFLOW_USE_SYCL
 
-    int64 before_dim = 1;
+    Eigen::DenseIndex before_dim = 1;
     for (int i = 0; i < axis; ++i) {
       before_dim *= input_shape.dim_size(i);
     }
 
-    int64 after_dim = 1;
+    Eigen::DenseIndex after_dim = 1;
     for (int i = axis + 1; i < input_shape.dims(); ++i) {
       after_dim *= input_shape.dim_size(i);
     }
-    const int64 axis_dim = input_shape.dim_size(axis);
+    const Eigen::DenseIndex axis_dim = input_shape.dim_size(axis);
 
     // Except for shape, unpack is a special case of split, so we reuse the
     // same computational kernels.
     auto input_reshaped =
-        input.shaped<T, 3>({1, before_dim, axis_dim * after_dim});
+        input.shaped<T, 2>({before_dim, axis_dim * after_dim});
 
     for (int i = 0; i < num; ++i) {
       Tensor* output;
@@ -112,12 +112,12 @@ class UnpackOp : public OpKernel {
                      context->allocate_output(i, output_shape, &output));
 
       if (output_shape.num_elements() > 0) {
-        auto output_shaped = output->shaped<T, 3>({1, before_dim, after_dim});
-        Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, i * after_dim};
-        Eigen::DSizes<Eigen::DenseIndex, 3> sizes{1, before_dim, after_dim};
-        functor::Split<Device, T>()(context->eigen_device<Device>(),
-                                    output_shaped, input_reshaped, indices,
-                                    sizes);
+        auto output_shaped = output->shaped<T, 2>({before_dim, after_dim});
+        Eigen::DSizes<Eigen::DenseIndex, 2> indices{0, i * after_dim};
+        Eigen::DSizes<Eigen::DenseIndex, 2> sizes{before_dim, after_dim};
+        functor::Split<Device, T, 2>()(context->eigen_device<Device>(),
+                                       output_shaped, input_reshaped, indices,
+                                       sizes);
       }
     }
   }
diff --git a/tensorflow/core/lib/bfloat16/bfloat16.h b/tensorflow/core/lib/bfloat16/bfloat16.h
index f9cca0ef2ab90c677e47d979a4636b3fc25ec919..126e5a17af42a36be31f4fa6698f55d02f8321a7 100644
--- a/tensorflow/core/lib/bfloat16/bfloat16.h
+++ b/tensorflow/core/lib/bfloat16/bfloat16.h
@@ -16,8 +16,12 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
 #define TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
 
+#include <cmath>
 #include <complex>
 
+// We need cpu_info.h here in order to pick up __BYTE_ORDER__.
+#include "tensorflow/core/platform/cpu_info.h"
+
 #ifdef __CUDACC__
 // All functions callable from CUDA code must be qualified with __device__
 #define B16_DEVICE_FUNC __host__ __device__
@@ -161,6 +165,192 @@ struct bfloat16 {
     return complex128(double(*this), double(0.0));
   }
 
+  union FP32 {
+    unsigned int u;
+    float f;
+  };
+
+  // Converts a float point to bfloat16, with round-nearest-to-even as rounding
+  // method.
+  // TODO(b/69266521): Add a truncate_to_bfloat16 function and make this
+  // function as default behavior.
+  // TODO: There is a slightly faster implementation (8% faster on CPU)
+  // than this (documented in cl/175987786), that is exponentially harder to
+  // understand and document. Switch to the faster version when converting to
+  // BF16 becomes compute-bound.
+  B16_DEVICE_FUNC static bfloat16 round_to_bfloat16(float v) {
+    uint32_t input;
+    FP32 f;
+    f.f = v;
+    input = f.u;
+    bfloat16 output;
+
+    if (float_isnan(v)) {
+      // If the value is a NaN, squash it to a qNaN with msb of fraction set,
+      // this makes sure after truncation we don't end up with an inf.
+      //
+      // qNaN magic: All exponent bits set + most significant bit of fraction
+      // set.
+      output.value = 0x7fc0;
+    } else {
+      // Fast rounding algorithm that rounds a half value to nearest even. This
+      // reduces expected error when we convert a large number of floats. Here
+      // is how it works:
+      //
+      // Definitions:
+      // To convert a float 32 to bfloat16, a float 32 can be viewed as 32 bits
+      // with the following tags:
+      //
+      // Sign |  Exp (8 bits) | Frac (23 bits)
+      //  S     EEEEEEEE         FFFFFFLRTTTTTTTTTTTTTTT
+      //
+      //  S: Sign bit.
+      //  E: Exponent bits.
+      //  F: First 6 bits of fraction.
+      //  L: Least significant bit of resulting bfloat16 if we truncate away the
+      //  rest of the float32. This is also the 7th bit of fraction
+      //  R: Rounding bit, 8th bit of fraction.
+      //  T: Sticky bits, rest of fraction, 15 bits.
+      //
+      // To round half to nearest even, there are 3 cases where we want to round
+      // down (simply truncate the result of the bits away, which consists of
+      // rounding bit and sticky bits) and two cases where we want to round up
+      // (truncate then add one to the result).
+      //
+      // The fast converting algorithm simply adds lsb (L) to 0x7fff (15 bits of
+      // 1s) as the rounding bias, adds the rounding bias to the input, then
+      // truncates the last 16 bits away.
+      //
+      // To understand how it works, we can analyze this algorithm case by case:
+      //
+      // 1. L = 0, R = 0:
+      //   Expect: round down, this is less than half value.
+      //
+      //   Algorithm:
+      //   - Rounding bias: 0x7fff + 0 = 0x7fff
+      //   - Adding rounding bias to input may create any carry, depending on
+      //   whether there is any value set to 1 in T bits.
+      //   - R may be set to 1 if there is a carry.
+      //   - L remains 0.
+      //   - Note that this case also handles Inf and -Inf, where all fraction
+      //   bits, including L, R and Ts are all 0. The output remains Inf after
+      //   this algorithm.
+      //
+      // 2. L = 1, R = 0:
+      //   Expect: round down, this is less than half value.
+      //
+      //   Algorithm:
+      //   - Rounding bias: 0x7fff + 1 = 0x8000
+      //   - Adding rounding bias to input doesn't change sticky bits but
+      //   adds 1 to rounding bit.
+      //   - L remains 1.
+      //
+      // 3. L = 0, R = 1, all of T are 0:
+      //   Expect: round down, this is exactly at half, the result is already
+      //   even (L=0).
+      //
+      //   Algorithm:
+      //   - Rounding bias: 0x7fff + 0 = 0x7fff
+      //   - Adding rounding bias to input sets all sticky bits to 1, but
+      //   doesn't create a carry.
+      //   - R remains 1.
+      //   - L remains 0.
+      //
+      // 4. L = 1, R = 1:
+      //   Expect: round up, this is exactly at half, the result needs to be
+      //   round to the next even number.
+      //
+      //   Algorithm:
+      //   - Rounding bias: 0x7fff + 1 = 0x8000
+      //   - Adding rounding bias to input doesn't change sticky bits, but
+      //   creates a carry from rounding bit.
+      //   - The carry sets L to 0, creates another carry bit and propagate
+      //   forward to F bits.
+      //   - If all the F bits are 1, a carry then propagates to the exponent
+      //   bits, which then creates the minimum value with the next exponent
+      //   value. Note that we won't have the case where exponents are all 1,
+      //   since that's either a NaN (handled in the other if condition) or inf
+      //   (handled in case 1).
+      //
+      // 5. L = 0, R = 1, any of T is 1:
+      //   Expect: round up, this is greater than half.
+      //
+      //   Algorithm:
+      //   - Rounding bias: 0x7fff + 0 = 0x7fff
+      //   - Adding rounding bias to input creates a carry from sticky bits,
+      //   sets rounding bit to 0, then create another carry.
+      //   - The second carry sets L to 1.
+      //
+      // Examples:
+      //
+      //  Exact half value that is already even:
+      //    Input:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit) | Frac (last 16 bit)
+      //     S     E E E E E E E E      F F F F F F L     RTTTTTTTTTTTTTTT
+      //     0     0 0 0 0 0 0 0 0      0 0 0 0 0 1 0     1000000000000000
+      //
+      //     This falls into case 3. We truncate the rest of 16 bits and no
+      //     carry is created into F and L:
+      //
+      //    Output:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit)
+      //     S     E E E E E E E E      F F F F F F L
+      //     0     0 0 0 0 0 0 0 0      0 0 0 0 0 1 0
+      //
+      //  Exact half value, round to next even number:
+      //    Input:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit) | Frac (last 16 bit)
+      //     S     E E E E E E E E      F F F F F F L     RTTTTTTTTTTTTTTT
+      //     0     0 0 0 0 0 0 0 0      0 0 0 0 0 0 1     1000000000000000
+      //
+      //     This falls into case 4. We create a carry from R and T,
+      //     which then propagates into L and F:
+      //
+      //    Output:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit)
+      //     S     E E E E E E E E      F F F F F F L
+      //     0     0 0 0 0 0 0 0 0      0 0 0 0 0 1 0
+      //
+      //
+      //  Max denormal value round to min normal value:
+      //    Input:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit) | Frac (last 16 bit)
+      //     S     E E E E E E E E      F F F F F F L     RTTTTTTTTTTTTTTT
+      //     0     0 0 0 0 0 0 0 0      1 1 1 1 1 1 1     1111111111111111
+      //
+      //     This falls into case 4. We create a carry from R and T,
+      //     propagate into L and F, which then propagates into exponent
+      //     bits:
+      //
+      //    Output:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit)
+      //     S     E E E E E E E E      F F F F F F L
+      //     0     0 0 0 0 0 0 0 1      0 0 0 0 0 0 0
+      //
+      //  Max normal value round to Inf:
+      //    Input:
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit) | Frac (last 16 bit)
+      //     S     E E E E E E E E      F F F F F F L     RTTTTTTTTTTTTTTT
+      //     0     1 1 1 1 1 1 1 0      1 1 1 1 1 1 1     1111111111111111
+      //
+      //     This falls into case 4. We create a carry from R and T,
+      //     propagate into L and F, which then propagates into exponent
+      //     bits:
+      //
+      //    Sign |  Exp (8 bit)     | Frac (first 7 bit)
+      //     S     E E E E E E E E      F F F F F F L
+      //     0     1 1 1 1 1 1 1 1      0 0 0 0 0 0 0
+      //
+      //
+      // Least significant bit of resulting bfloat.
+      uint32_t lsb = (input >> 16) & 1;
+      uint32_t rounding_bias = 0x7fff + lsb;
+      input += rounding_bias;
+      output.value = static_cast<uint16_t>(input >> 16);
+    }
+    return output;
+  }
+
   static bfloat16 epsilon() {
     bfloat16 x;
     x.value = 0x3c00;  // 0x1.0p-7
@@ -173,7 +363,7 @@ struct bfloat16 {
   static const uint16_t NAN_VALUE = 0x7FC0;
 
  private:
-  B16_DEVICE_FUNC bool float_isnan(const float& x) {
+  B16_DEVICE_FUNC static bool float_isnan(const float& x) {
 #ifdef __CUDA_ARCH__
     return ::isnan(x);
 #else
@@ -271,6 +461,35 @@ struct hash<tensorflow::bfloat16> {
     return hash<float>()(static_cast<float>(v));
   }
 };
+
+using tensorflow::bfloat16;
+inline bool isinf(const bfloat16& a) { return std::isinf(float(a)); }
+inline bool isnan(const bfloat16& a) { return std::isnan(float(a)); }
+inline bool isfinite(const bfloat16& a) { return std::isfinite(float(a)); }
+inline bfloat16 abs(const bfloat16& a) { return bfloat16(std::abs(float(a))); }
+inline bfloat16 exp(const bfloat16& a) { return bfloat16(std::exp(float(a))); }
+inline bfloat16 log(const bfloat16& a) { return bfloat16(std::log(float(a))); }
+inline bfloat16 log10(const bfloat16& a) {
+  return bfloat16(std::log10(float(a)));
+}
+inline bfloat16 sqrt(const bfloat16& a) {
+  return bfloat16(std::sqrt(float(a)));
+}
+inline bfloat16 pow(const bfloat16& a, const bfloat16& b) {
+  return bfloat16(std::pow(float(a), float(b)));
+}
+inline bfloat16 sin(const bfloat16& a) { return bfloat16(std::sin(float(a))); }
+inline bfloat16 cos(const bfloat16& a) { return bfloat16(std::cos(float(a))); }
+inline bfloat16 tan(const bfloat16& a) { return bfloat16(std::tan(float(a))); }
+inline bfloat16 tanh(const bfloat16& a) {
+  return bfloat16(std::tanh(float(a)));
+}
+inline bfloat16 floor(const bfloat16& a) {
+  return bfloat16(std::floor(float(a)));
+}
+inline bfloat16 ceil(const bfloat16& a) {
+  return bfloat16(std::ceil(float(a)));
+}
 }  // namespace std
 
 #endif  // TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
diff --git a/tensorflow/core/lib/core/error_codes.proto b/tensorflow/core/lib/core/error_codes.proto
index a7306c8cc1212118a22b25efc67b4589316b207d..b82d3891460cb416aecb9058f9bd9c2d2693c197 100644
--- a/tensorflow/core/lib/core/error_codes.proto
+++ b/tensorflow/core/lib/core/error_codes.proto
@@ -119,7 +119,7 @@ enum Code {
   // Operation is not implemented or not supported/enabled in this service.
   UNIMPLEMENTED = 12;
 
-  // Internal errors.  Means some invariants expected by underlying
+  // Internal errors.  Means some invariant expected by the underlying
   // system has been broken.  If you see one of these errors,
   // something is very broken.
   INTERNAL = 13;
diff --git a/tensorflow/core/lib/core/errors.h b/tensorflow/core/lib/core/errors.h
index 1fd62755d835bb282d79babb96700b5abcb0086b..1a0f4be2eaabcb750acec26b3983e2089f70c557 100644
--- a/tensorflow/core/lib/core/errors.h
+++ b/tensorflow/core/lib/core/errors.h
@@ -16,6 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_CORE_ERRORS_H_
 #define TENSORFLOW_LIB_CORE_ERRORS_H_
 
+#include <sstream>
+
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -26,6 +28,33 @@ namespace errors {
 
 typedef ::tensorflow::error::Code Code;
 
+namespace internal {
+
+// The DECLARE_ERROR macro below only supports types that can be converted
+// into StrCat's AlphaNum. For the other types we rely on a slower path
+// through std::stringstream. To add support of a new type, it is enough to
+// make sure there is an operator<<() for it:
+//
+//   std::ostream& operator<<(std::ostream& os, const MyType& foo) {
+//     os << foo.ToString();
+//     return os;
+//   }
+// Eventually absl::strings will have native support for this and we will be
+// able to completely remove PrepareForStrCat().
+template <typename T>
+typename std::enable_if<!std::is_convertible<T, strings::AlphaNum>::value,
+                        string>::type
+PrepareForStrCat(const T& t) {
+  std::stringstream ss;
+  ss << t;
+  return ss.str();
+}
+inline const strings::AlphaNum& PrepareForStrCat(const strings::AlphaNum& a) {
+  return a;
+}
+
+}  // namespace internal
+
 // Append some context to an error message.  Each time we append
 // context put it on a new line, since it is possible for there
 // to be several layers of additional context.
@@ -61,8 +90,10 @@ void AppendToMessage(::tensorflow::Status* status, Args... args) {
 #define DECLARE_ERROR(FUNC, CONST)                                       \
   template <typename... Args>                                            \
   ::tensorflow::Status FUNC(Args... args) {                              \
-    return ::tensorflow::Status(::tensorflow::error::CONST,              \
-                                ::tensorflow::strings::StrCat(args...)); \
+    return ::tensorflow::Status(                                         \
+        ::tensorflow::error::CONST,                                      \
+        ::tensorflow::strings::StrCat(                                   \
+            ::tensorflow::errors::internal::PrepareForStrCat(args)...)); \
   }                                                                      \
   inline bool Is##FUNC(const ::tensorflow::Status& status) {             \
     return status.code() == ::tensorflow::error::CONST;                  \
diff --git a/tensorflow/core/lib/core/stringpiece.cc b/tensorflow/core/lib/core/stringpiece.cc
index 29b727fc4463d933ceeb402c5dd92f3ea5b8a62a..5bd79778a66f65fbe3963b664a639a7c1b028237 100644
--- a/tensorflow/core/lib/core/stringpiece.cc
+++ b/tensorflow/core/lib/core/stringpiece.cc
@@ -17,14 +17,9 @@ limitations under the License.
 
 #include <algorithm>
 #include <iostream>
-#include "tensorflow/core/lib/hash/hash.h"
 
 namespace tensorflow {
 
-size_t StringPieceHasher::operator()(StringPiece s) const {
-  return Hash64(s.data(), s.size());
-}
-
 std::ostream& operator<<(std::ostream& o, StringPiece piece) {
   o.write(piece.data(), piece.size());
   return o;
diff --git a/tensorflow/core/lib/core/stringpiece.h b/tensorflow/core/lib/core/stringpiece.h
index caa9642774bebec05a28b7a0c2ea71d18d6ebd1a..79409cce4b492939c7a0758e1dc0c0f0d06cace8 100644
--- a/tensorflow/core/lib/core/stringpiece.h
+++ b/tensorflow/core/lib/core/stringpiece.h
@@ -35,8 +35,6 @@ limitations under the License.
 
 namespace tensorflow {
 
-struct StringPieceHasher;
-
 class StringPiece {
  public:
   typedef size_t size_type;
@@ -90,11 +88,13 @@ class StringPiece {
 
   size_t find(char c, size_t pos = 0) const;
   size_t rfind(char c, size_t pos = npos) const;
+  // DEPRECATED: Use tensorflow::str_util::StrContains instead.
   bool contains(StringPiece s) const;
 
   // Checks whether StringPiece starts with x and if so advances the beginning
   // of it to past the match.  It's basically a shortcut for starts_with
   // followed by remove_prefix.
+  // DEPRECATED: Use tensorflow::str_util::ConsumePrefix instead.
   bool Consume(StringPiece x) {
     if (starts_with(x)) {
       remove_prefix(x.size_);
@@ -115,10 +115,12 @@ class StringPiece {
   int compare(StringPiece b) const;
 
   // Return true iff "x" is a prefix of "*this"
+  // DEPRECATED: Use tensorflow::str_util::StartsWith instead.
   bool starts_with(StringPiece x) const {
     return ((size_ >= x.size_) && (memcmp(data_, x.data_, x.size_) == 0));
   }
   // Return true iff "x" is a suffix of "*this"
+  // DEPRECATED: Use tensorflow::str_util::EndsWith instead.
   bool ends_with(StringPiece x) const {
     return ((size_ >= x.size_) &&
             (memcmp(data_ + (size_ - x.size_), x.data_, x.size_) == 0));
@@ -131,10 +133,6 @@ class StringPiece {
   // Intentionally copyable
 };
 
-struct StringPieceHasher {
-  size_t operator()(StringPiece s) const;
-};
-
 inline bool operator==(StringPiece x, StringPiece y) {
   return ((x.size() == y.size()) &&
           (memcmp(x.data(), y.data(), x.size()) == 0));
diff --git a/tensorflow/core/lib/core/stringpiece_test.cc b/tensorflow/core/lib/core/stringpiece_test.cc
index 8f17b85b6d7941d7084ce4e142de4ad33f1e8202..d0dbeb6072c2d5e7f11d6ec47f6c2443a96dee47 100644
--- a/tensorflow/core/lib/core/stringpiece_test.cc
+++ b/tensorflow/core/lib/core/stringpiece_test.cc
@@ -65,74 +65,4 @@ TEST(StringPiece, Contains) {
   EXPECT_TRUE(!a.contains(d));
 }
 
-TEST(StringPieceHasher, Equality) {
-  StringPieceHasher hasher;
-
-  StringPiece s1("foo");
-  StringPiece s2("bar");
-  StringPiece s3("baz");
-  StringPiece s4("zot");
-
-  EXPECT_TRUE(hasher(s1) != hasher(s2));
-  EXPECT_TRUE(hasher(s1) != hasher(s3));
-  EXPECT_TRUE(hasher(s1) != hasher(s4));
-  EXPECT_TRUE(hasher(s2) != hasher(s3));
-  EXPECT_TRUE(hasher(s2) != hasher(s4));
-  EXPECT_TRUE(hasher(s3) != hasher(s4));
-
-  EXPECT_TRUE(hasher(s1) == hasher(s1));
-  EXPECT_TRUE(hasher(s2) == hasher(s2));
-  EXPECT_TRUE(hasher(s3) == hasher(s3));
-  EXPECT_TRUE(hasher(s4) == hasher(s4));
-}
-
-TEST(StringPieceHasher, HashMap) {
-  string s1("foo");
-  string s2("bar");
-  string s3("baz");
-
-  StringPiece p1(s1);
-  StringPiece p2(s2);
-  StringPiece p3(s3);
-
-  std::unordered_map<StringPiece, int, StringPieceHasher> map;
-
-  map.insert(std::make_pair(p1, 0));
-  map.insert(std::make_pair(p2, 1));
-  map.insert(std::make_pair(p3, 2));
-  EXPECT_EQ(map.size(), 3);
-
-  bool found[3] = {false, false, false};
-  for (auto const& val : map) {
-    int x = val.second;
-    EXPECT_TRUE(x >= 0 && x < 3);
-    EXPECT_TRUE(!found[x]);
-    found[x] = true;
-  }
-  EXPECT_EQ(found[0], true);
-  EXPECT_EQ(found[1], true);
-  EXPECT_EQ(found[2], true);
-
-  auto new_iter = map.find("zot");
-  EXPECT_TRUE(new_iter == map.end());
-
-  new_iter = map.find("bar");
-  EXPECT_TRUE(new_iter != map.end());
-
-  map.erase(new_iter);
-  EXPECT_EQ(map.size(), 2);
-
-  found[0] = false;
-  found[1] = false;
-  found[2] = false;
-  for (const auto& iter : map) {
-    int x = iter.second;
-    EXPECT_TRUE(x >= 0 && x < 3);
-    EXPECT_TRUE(!found[x]);
-    found[x] = true;
-  }
-  EXPECT_EQ(found[0], true);
-  EXPECT_EQ(found[1], false);
-  EXPECT_EQ(found[2], true);
-}
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/core/threadpool_test.cc b/tensorflow/core/lib/core/threadpool_test.cc
index 627ef5a892a35ec43d0c31220dcf046b4b8eda55..320f3ebb8328b23c5e0b10ae2effe1de2528246b 100644
--- a/tensorflow/core/lib/core/threadpool_test.cc
+++ b/tensorflow/core/lib/core/threadpool_test.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <atomic>
 
+#include "tensorflow/core/platform/context.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/test.h"
@@ -35,6 +36,7 @@ TEST(ThreadPool, Empty) {
 }
 
 TEST(ThreadPool, DoWork) {
+  Context outer_context(ContextKind::kThread);
   for (int num_threads = 1; num_threads < kNumThreads; num_threads++) {
     fprintf(stderr, "Testing with %d threads\n", num_threads);
     const int kWorkItems = 15;
@@ -45,7 +47,9 @@ TEST(ThreadPool, DoWork) {
     {
       ThreadPool pool(Env::Default(), "test", num_threads);
       for (int i = 0; i < kWorkItems; i++) {
-        pool.Schedule([&work, i]() {
+        pool.Schedule([&outer_context, &work, i]() {
+          Context inner_context(ContextKind::kThread);
+          ASSERT_EQ(outer_context, inner_context);
           ASSERT_FALSE(work[i]);
           work[i] = true;
         });
@@ -58,6 +62,7 @@ TEST(ThreadPool, DoWork) {
 }
 
 TEST(ThreadPool, ParallelFor) {
+  Context outer_context(ContextKind::kThread);
   // Make ParallelFor use as many threads as possible.
   int64 kHugeCost = 1 << 30;
   for (int num_threads = 1; num_threads < kNumThreads; num_threads++) {
@@ -68,12 +73,15 @@ TEST(ThreadPool, ParallelFor) {
     for (int i = 0; i < kWorkItems; i++) {
       work[i] = false;
     }
-    pool.ParallelFor(kWorkItems, kHugeCost, [&work](int64 begin, int64 end) {
-      for (int64 i = begin; i < end; ++i) {
-        ASSERT_FALSE(work[i]);
-        work[i] = true;
-      }
-    });
+    pool.ParallelFor(kWorkItems, kHugeCost,
+                     [&outer_context, &work](int64 begin, int64 end) {
+                       Context inner_context(ContextKind::kThread);
+                       ASSERT_EQ(outer_context, inner_context);
+                       for (int64 i = begin; i < end; ++i) {
+                         ASSERT_FALSE(work[i]);
+                         work[i] = true;
+                       }
+                     });
     for (int i = 0; i < kWorkItems; i++) {
       ASSERT_TRUE(work[i]);
     }
@@ -167,5 +175,40 @@ static void BM_Parallel(int iters) {
 }
 BENCHMARK(BM_Parallel);
 
+static void BM_ParallelFor(int iters, int total, int cost_per_unit) {
+  ThreadPool pool(Env::Default(), "test", kNumThreads);
+  // Decrement count concurrently until 0.
+  std::atomic_int_fast32_t count(iters);
+  mutex done_lock;
+  condition_variable done;
+  bool done_flag = false;
+  for (int i = 0; i < iters; ++i) {
+    pool.ParallelFor(
+        total, cost_per_unit,
+        [&count, &done_lock, &done, &done_flag](int64 begin, int64 end) {
+          for (int64 i = begin; i < end; ++i) {
+            if (count.fetch_sub(1) == 1) {
+              mutex_lock l(done_lock);
+              done_flag = true;
+              done.notify_all();
+            }
+          }
+        });
+  }
+  mutex_lock l(done_lock);
+  if (!done_flag) {
+    done.wait(l);
+  }
+}
+BENCHMARK(BM_ParallelFor)
+    ->ArgPair(1 << 10, 1)
+    ->ArgPair(1 << 20, 1)
+    ->ArgPair(1 << 10, 1 << 10)
+    ->ArgPair(1 << 20, 1 << 10)
+    ->ArgPair(1 << 10, 1 << 20)
+    ->ArgPair(1 << 20, 1 << 20)
+    ->ArgPair(1 << 10, 1 << 30)
+    ->ArgPair(1 << 20, 1 << 30);
+
 }  // namespace thread
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/hash/hash.cc b/tensorflow/core/lib/hash/hash.cc
index ed9b4df37a043ec0c1241be088c93e031ed1f42e..dc9d300d00e596478553f0f296b0c4fcf8c85068 100644
--- a/tensorflow/core/lib/hash/hash.cc
+++ b/tensorflow/core/lib/hash/hash.cc
@@ -126,15 +126,4 @@ uint64 Hash64(const char* data, size_t n, uint64 seed) {
   return h;
 }
 
-bool SerializeToStringDeterministic(const protobuf::MessageLite& msg,
-                                    string* result) {
-  const size_t size = msg.ByteSizeLong();
-  *result = string(size, '\0');
-  protobuf::io::ArrayOutputStream array_stream(&(*result)[0], size);
-  protobuf::io::CodedOutputStream output_stream(&array_stream);
-  output_stream.SetSerializationDeterministic(true);
-  msg.SerializeWithCachedSizes(&output_stream);
-  return !output_stream.HadError() && size == output_stream.ByteCount();
-}
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/hash/hash.h b/tensorflow/core/lib/hash/hash.h
index 4d312ab7e830963671a8be9d4622a5b83488d295..ca05e6346e2045da52ac048f80716cf046b2fd26 100644
--- a/tensorflow/core/lib/hash/hash.h
+++ b/tensorflow/core/lib/hash/hash.h
@@ -24,7 +24,6 @@ limitations under the License.
 #include <string>
 
 #include "tensorflow/core/lib/core/stringpiece.h"
-#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace tensorflow {
@@ -64,13 +63,6 @@ struct hash<T*> {
   }
 };
 
-template <>
-struct hash<bfloat16> {
-  size_t operator()(const bfloat16& t) const {
-    return std::hash<float>()(static_cast<float>(t));
-  }
-};
-
 template <>
 struct hash<string> {
   size_t operator()(const string& s) const {
@@ -84,6 +76,7 @@ struct hash<StringPiece> {
     return static_cast<size_t>(Hash64(sp.data(), sp.size()));
   }
 };
+using StringPieceHasher = ::tensorflow::hash<StringPiece>;
 
 template <typename T, typename U>
 struct hash<std::pair<T, U>> {
@@ -92,15 +85,6 @@ struct hash<std::pair<T, U>> {
   }
 };
 
-// Wrapper around protocol buffer serialization that requests deterministic
-// serialization, in particular for Map fields, which serialize in a random
-// order by default. Returns true on success.
-// Serialization is guaranteed to be deterministic for a given binary only.
-// See the following for more details:
-// https://github.com/google/protobuf/blob/a1bb147e96b6f74db6cdf3c3fcb00492472dbbfa/src/google/protobuf/io/coded_stream.h#L834
-bool SerializeToStringDeterministic(const protobuf::MessageLite& msg,
-                                    string* result);
-
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_LIB_HASH_HASH_H_
diff --git a/tensorflow/core/lib/hash/hash_test.cc b/tensorflow/core/lib/hash/hash_test.cc
index 0e5f6c6803389d78d648142992e9ad5b0d487d26..7d583131322dec509f06d2f4569c2cbc41e19e9b 100644
--- a/tensorflow/core/lib/hash/hash_test.cc
+++ b/tensorflow/core/lib/hash/hash_test.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <map>
+#include <unordered_map>
 #include <vector>
 
 #include "tensorflow/core/lib/hash/hash.h"
@@ -81,4 +83,75 @@ static void BM_Hash32(int iters, int len) {
 }
 BENCHMARK(BM_Hash32)->Range(1, 1024);
 
+TEST(StringPieceHasher, Equality) {
+  StringPieceHasher hasher;
+
+  StringPiece s1("foo");
+  StringPiece s2("bar");
+  StringPiece s3("baz");
+  StringPiece s4("zot");
+
+  EXPECT_TRUE(hasher(s1) != hasher(s2));
+  EXPECT_TRUE(hasher(s1) != hasher(s3));
+  EXPECT_TRUE(hasher(s1) != hasher(s4));
+  EXPECT_TRUE(hasher(s2) != hasher(s3));
+  EXPECT_TRUE(hasher(s2) != hasher(s4));
+  EXPECT_TRUE(hasher(s3) != hasher(s4));
+
+  EXPECT_TRUE(hasher(s1) == hasher(s1));
+  EXPECT_TRUE(hasher(s2) == hasher(s2));
+  EXPECT_TRUE(hasher(s3) == hasher(s3));
+  EXPECT_TRUE(hasher(s4) == hasher(s4));
+}
+
+TEST(StringPieceHasher, HashMap) {
+  string s1("foo");
+  string s2("bar");
+  string s3("baz");
+
+  StringPiece p1(s1);
+  StringPiece p2(s2);
+  StringPiece p3(s3);
+
+  std::unordered_map<StringPiece, int, StringPieceHasher> map;
+
+  map.insert(std::make_pair(p1, 0));
+  map.insert(std::make_pair(p2, 1));
+  map.insert(std::make_pair(p3, 2));
+  EXPECT_EQ(map.size(), 3);
+
+  bool found[3] = {false, false, false};
+  for (auto const& val : map) {
+    int x = val.second;
+    EXPECT_TRUE(x >= 0 && x < 3);
+    EXPECT_TRUE(!found[x]);
+    found[x] = true;
+  }
+  EXPECT_EQ(found[0], true);
+  EXPECT_EQ(found[1], true);
+  EXPECT_EQ(found[2], true);
+
+  auto new_iter = map.find("zot");
+  EXPECT_TRUE(new_iter == map.end());
+
+  new_iter = map.find("bar");
+  EXPECT_TRUE(new_iter != map.end());
+
+  map.erase(new_iter);
+  EXPECT_EQ(map.size(), 2);
+
+  found[0] = false;
+  found[1] = false;
+  found[2] = false;
+  for (const auto& iter : map) {
+    int x = iter.second;
+    EXPECT_TRUE(x >= 0 && x < 3);
+    EXPECT_TRUE(!found[x]);
+    found[x] = true;
+  }
+  EXPECT_EQ(found[0], true);
+  EXPECT_EQ(found[1], false);
+  EXPECT_EQ(found[2], true);
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/io/path.cc b/tensorflow/core/lib/io/path.cc
index 83f15e134d6f60c65a7523458353ffd62345b7cc..996fbf62e5c1736f9922fcb652f65259e985a7f1 100644
--- a/tensorflow/core/lib/io/path.cc
+++ b/tensorflow/core/lib/io/path.cc
@@ -27,9 +27,9 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/lib/strings/scanner.h"
-#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
-#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/mutex.h"
 
 namespace tensorflow {
 namespace io {
diff --git a/tensorflow/core/lib/io/path.h b/tensorflow/core/lib/io/path.h
index 47bb2b998d637099b3ab788f7ce274f83e4fc646..818ba99888d041f016210292a7c0cf18ef7d0e41 100644
--- a/tensorflow/core/lib/io/path.h
+++ b/tensorflow/core/lib/io/path.h
@@ -16,7 +16,6 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_IO_PATH_H_
 #define TENSORFLOW_LIB_IO_PATH_H_
 
-#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 
 namespace tensorflow {
diff --git a/tensorflow/core/lib/io/record_reader.cc b/tensorflow/core/lib/io/record_reader.cc
index 254fdf115da132343b8e6f176e67672a11281cd0..6de850bb20716e1f1ce73936134f561fe295ad8d 100644
--- a/tensorflow/core/lib/io/record_reader.cc
+++ b/tensorflow/core/lib/io/record_reader.cc
@@ -205,7 +205,9 @@ Status RecordReader::SkipNBytes(uint64 offset) {
     if (options_.buffer_size > 0) {
       TF_RETURN_IF_ERROR(input_stream_->SkipNBytes(offset));
     }
+#if !defined(IS_SLIM_BUILD)
   }
+#endif
   return Status::OK();
 }  // namespace io
 
diff --git a/tensorflow/core/lib/io/record_reader.h b/tensorflow/core/lib/io/record_reader.h
index 62dd2efb792988c4197cf7172b25ac34cdd77ed9..26278e03284df7584bb8a681dd0fb38ed20206b7 100644
--- a/tensorflow/core/lib/io/record_reader.h
+++ b/tensorflow/core/lib/io/record_reader.h
@@ -16,10 +16,10 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_IO_RECORD_READER_H_
 #define TENSORFLOW_LIB_IO_RECORD_READER_H_
 
-#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
-#if !defined(IS_SLIM_BUILD)
 #include "tensorflow/core/lib/io/inputstream_interface.h"
+#if !defined(IS_SLIM_BUILD)
 #include "tensorflow/core/lib/io/zlib_compression_options.h"
 #include "tensorflow/core/lib/io/zlib_inputstream.h"
 #endif  // IS_SLIM_BUILD
diff --git a/tensorflow/core/lib/psnr/testdata/cat_q20.jpg b/tensorflow/core/lib/psnr/testdata/cat_q20.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..d7b882a7a7b17ca6f77876d6f534c41c3c62a11a
Binary files /dev/null and b/tensorflow/core/lib/psnr/testdata/cat_q20.jpg differ
diff --git a/tensorflow/core/lib/psnr/testdata/cat_q72.jpg b/tensorflow/core/lib/psnr/testdata/cat_q72.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..2b5dd75ac9e391a92f29aebfcad0fd2079bc6029
Binary files /dev/null and b/tensorflow/core/lib/psnr/testdata/cat_q72.jpg differ
diff --git a/tensorflow/core/lib/psnr/testdata/cat_q95.jpg b/tensorflow/core/lib/psnr/testdata/cat_q95.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..7fa3c3157fbfa4f02bc5feb726e46b9a33cc2f2f
Binary files /dev/null and b/tensorflow/core/lib/psnr/testdata/cat_q95.jpg differ
diff --git a/tensorflow/core/lib/random/random_distributions.h b/tensorflow/core/lib/random/random_distributions.h
index 3fe1f9bc6cf06158df4811eaa177988b60890006..ad16dbf01fc402303c1ba33c2479d2f8f1bb039c 100644
--- a/tensorflow/core/lib/random/random_distributions.h
+++ b/tensorflow/core/lib/random/random_distributions.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include <algorithm>
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
 #include "tensorflow/core/lib/random/philox_random.h"
 
 namespace tensorflow {
@@ -32,6 +33,8 @@ namespace random {
 
 // Helper function to convert a 16-bit integer to a half between [0..1).
 PHILOX_DEVICE_INLINE Eigen::half Uint16ToHalf(uint16 x);
+// Helper function to convert a 16-bit integer to a bfloat16 between [0..1).
+PHILOX_DEVICE_INLINE bfloat16 Uint16ToGfloat16(uint16 x);
 // Helper function to convert a 32-bit integer to a float between [0..1).
 PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x);
 // Helper function to convert two 32-bit integers to a double between [0..1).
@@ -75,6 +78,30 @@ class UniformDistribution<Generator, Eigen::half> {
   }
 };
 
+template <class Generator>
+class UniformDistribution<Generator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount = Generator::kResultElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 3;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = false;
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(Generator* gen) {
+    typename Generator::ResultType sample = (*gen)();
+    ResultType result;
+    for (int i = 0; i < kResultElementCount; ++i) {
+      result[i] = Uint16ToGfloat16(sample[i]);
+    }
+    return result;
+  }
+};
+
 template <class Generator>
 class UniformDistribution<Generator, float> {
  public:
@@ -305,6 +332,36 @@ class NormalDistribution<Generator, Eigen::half> {
   }
 };
 
+template <class Generator>
+class NormalDistribution<Generator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount = Generator::kResultElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 70;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = false;
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(Generator* gen) {
+    typename Generator::ResultType sample = (*gen)();
+    ResultType result;
+    static_assert(kResultElementCount % 2 == 0,
+                  "kResultElementCount should be an even number");
+    for (int i = 0; i < kResultElementCount; i += 2) {
+      float f[2];
+      // Box-Muller transform requires processing 2 elements at a time.
+      BoxMullerFloat(sample[i], sample[i + 1], &f[0], &f[1]);
+      result[i] = bfloat16(f[0]);
+      result[i + 1] = bfloat16(f[1]);
+    }
+    return result;
+  }
+};
+
 template <class Generator>
 class NormalDistribution<Generator, float> {
  public:
@@ -414,6 +471,48 @@ class TruncatedNormalDistribution<SingleSampleGenerator, Eigen::half> {
   }
 };
 
+template <class SingleSampleGenerator>
+class TruncatedNormalDistribution<SingleSampleGenerator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount =
+      SingleSampleGenerator::kNativeElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 90;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = true;
+  // The threshold where the normal distribution is truncated.
+  const float kTruncateValue = 2.0f;
+
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(SingleSampleGenerator* gen) {
+    ResultType results;
+    int index = 0;
+    while (true) {
+      // Repeatedly take samples from the normal distribution, until we have
+      // the desired number of elements that fall within the pre-defined cutoff
+      // threshold.
+      const uint32 x0 = (*gen)();
+      const uint32 x1 = (*gen)();
+      float f[2];
+      BoxMullerFloat(x0, x1, &f[0], &f[1]);
+
+      for (int i = 0; i < 2; ++i) {
+        if (Eigen::numext::abs(f[i]) < kTruncateValue) {
+          results[index++] = bfloat16(f[i]);
+          if (index >= kResultElementCount) {
+            return results;
+          }
+        }
+      }
+    }
+  }
+};
+
 // Partial specialization for float.
 template <class SingleSampleGenerator>
 class TruncatedNormalDistribution<SingleSampleGenerator, float> {
@@ -567,6 +666,27 @@ PHILOX_DEVICE_INLINE Eigen::half Uint16ToHalf(uint16 x) {
   return result - Eigen::half(1.0);
 }
 
+// Helper function to convert an 16-bit integer to a bfloat16 between [0..1).
+// This can create a uniform distribution of values between [0..1).
+PHILOX_DEVICE_INLINE bfloat16 Uint16ToGfloat16(uint16 x) {
+  // bfloat are formatted as follows (MSB first):
+  //    sign(1) exponent(8) mantissa(7)
+  // Conceptually construct the following:
+  //    sign == 0
+  //    exponent == 127  -- an excess 127 representation of a zero exponent
+  //    mantissa == 7 random bits
+  const uint16 man = x & 0x7fu;  // 7 bit mantissa
+  const uint16 exp = static_cast<uint16>(127);
+  const uint16 val = (exp << 7) | man;
+
+  bfloat16 result;
+  memcpy(&result, &val, sizeof(val));
+  // The mantissa has an implicit leading 1, so the above code creates a value
+  // in [1, 2). The minus will not cause a rounding that makes the result 1.
+  // Instead it will just be close to 1.
+  return result - bfloat16(1.0);
+}
+
 // Helper function to convert an 32-bit integer to a float between [0..1).
 PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
   // IEEE754 floats are formatted as follows (MSB first):
diff --git a/tensorflow/core/lib/random/random_distributions_test.cc b/tensorflow/core/lib/random/random_distributions_test.cc
index 85d68f456e1e27b7a62315f2b0a962843da87d52..8868672a10ae027415d81f76ef146d1a5f28bddd 100644
--- a/tensorflow/core/lib/random/random_distributions_test.cc
+++ b/tensorflow/core/lib/random/random_distributions_test.cc
@@ -37,6 +37,10 @@ namespace {
 // unit normal distribution, it should almost definitely never exceed 6.
 static constexpr float kZLimit = 6.0;
 
+// As bfloat16 has much less precision, the largest z-value will should be
+// larger than float32.
+static constexpr float kZLimitBfloat16 = 20.0;
+
 // A utility function to fill the given array with samples from the given
 // distribution, using the single adapter of the underlying generator
 template <class Distribution>
@@ -93,7 +97,7 @@ bool CheckSamplesMoments(const std::vector<T>& samples,
       // mode, given the large number of samples.
       moments_data[i] += moment;
       ++moments_sample_count_data[i];
-      moment *= samples_data[index];
+      moment *= static_cast<double>(samples_data[index]);
     }
   }
 
@@ -125,7 +129,7 @@ bool CheckSamplesMoments(const std::vector<T>& samples,
     const double z_test =
         fabs((moments[i] - moments_i_mean) / sqrt(total_variance));
 
-    if (z_test > z_limit) {
+    if (z_test > static_cast<double>(z_limit)) {
       LOG(ERROR) << "failing z_test:"
                  << " moment: " << i << " stride: " << stride
                  << " z_test: " << z_test << " z_limit: " << z_limit
@@ -252,6 +256,22 @@ void RandomParametersMomentsTest(int count, int max_moments,
   }
 }
 
+TEST(PhiloxRandomTest, UniformBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  UniformMomentsTest<bfloat16>(1 << 20, 40, strides, bfloat16(kZLimitBfloat16));
+}
+
+TEST(PhiloxRandomTest, NormalBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  NormalMomentsTest<bfloat16>(8 << 20, 25, strides, bfloat16(kZLimitBfloat16));
+}
+
+TEST(PhiloxRandomTest, RandomParametersBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  RandomParametersMomentsTest<bfloat16>(1 << 20, 40, strides,
+                                        bfloat16(kZLimitBfloat16));
+}
+
 TEST(PhiloxRandomTest, UniformFloatMomentsTest) {
   const std::vector<int> strides = {0, 1, 4, 17};
   UniformMomentsTest<float>(1 << 20, 40, strides, kZLimit);
diff --git a/tensorflow/core/lib/ssim/testdata/checkerboard1.png b/tensorflow/core/lib/ssim/testdata/checkerboard1.png
new file mode 100644
index 0000000000000000000000000000000000000000..6ede4395e2c8d3eb530c074ff3ac1c1974a441ef
Binary files /dev/null and b/tensorflow/core/lib/ssim/testdata/checkerboard1.png differ
diff --git a/tensorflow/core/lib/ssim/testdata/checkerboard2.png b/tensorflow/core/lib/ssim/testdata/checkerboard2.png
new file mode 100644
index 0000000000000000000000000000000000000000..baaa43229074f737dcf5040d209e9fb9ecad989c
Binary files /dev/null and b/tensorflow/core/lib/ssim/testdata/checkerboard2.png differ
diff --git a/tensorflow/core/lib/ssim/testdata/checkerboard3.png b/tensorflow/core/lib/ssim/testdata/checkerboard3.png
new file mode 100644
index 0000000000000000000000000000000000000000..95fa3bbb3ee42673b2b7e52cf84e704f249fa26b
Binary files /dev/null and b/tensorflow/core/lib/ssim/testdata/checkerboard3.png differ
diff --git a/tensorflow/core/lib/strings/numbers.cc b/tensorflow/core/lib/strings/numbers.cc
index f5822fad8e3d3b8559d19c79ee2885e580ea3e11..516decc3c01742b9a9fbbef1239037b3a4005b2f 100644
--- a/tensorflow/core/lib/strings/numbers.cc
+++ b/tensorflow/core/lib/strings/numbers.cc
@@ -106,19 +106,22 @@ T locale_independent_strtonum(const char* str, const char** endptr) {
 
 namespace strings {
 
-char* FastInt32ToBufferLeft(int32 i, char* buffer) {
+size_t FastInt32ToBufferLeft(int32 i, char* buffer) {
   uint32 u = i;
+  size_t length = 0;
   if (i < 0) {
     *buffer++ = '-';
+    ++length;
     // We need to do the negation in modular (i.e., "unsigned")
     // arithmetic; MSVC++ apparently warns for plain "-u", so
     // we write the equivalent expression "0 - u" instead.
     u = 0 - u;
   }
-  return FastUInt32ToBufferLeft(u, buffer);
+  length += FastUInt32ToBufferLeft(u, buffer);
+  return length;
 }
 
-char* FastUInt32ToBufferLeft(uint32 i, char* buffer) {
+size_t FastUInt32ToBufferLeft(uint32 i, char* buffer) {
   char* start = buffer;
   do {
     *buffer++ = ((i % 10) + '0');
@@ -126,19 +129,22 @@ char* FastUInt32ToBufferLeft(uint32 i, char* buffer) {
   } while (i > 0);
   *buffer = 0;
   std::reverse(start, buffer);
-  return buffer;
+  return buffer - start;
 }
 
-char* FastInt64ToBufferLeft(int64 i, char* buffer) {
+size_t FastInt64ToBufferLeft(int64 i, char* buffer) {
   uint64 u = i;
+  size_t length = 0;
   if (i < 0) {
     *buffer++ = '-';
+    ++length;
     u = 0 - u;
   }
-  return FastUInt64ToBufferLeft(u, buffer);
+  length += FastUInt64ToBufferLeft(u, buffer);
+  return length;
 }
 
-char* FastUInt64ToBufferLeft(uint64 i, char* buffer) {
+size_t FastUInt64ToBufferLeft(uint64 i, char* buffer) {
   char* start = buffer;
   do {
     *buffer++ = ((i % 10) + '0');
@@ -146,19 +152,18 @@ char* FastUInt64ToBufferLeft(uint64 i, char* buffer) {
   } while (i > 0);
   *buffer = 0;
   std::reverse(start, buffer);
-  return buffer;
+  return buffer - start;
 }
 
 static const double kDoublePrecisionCheckMax = DBL_MAX / 1.000000000000001;
 
-char* DoubleToBuffer(double value, char* buffer) {
+size_t DoubleToBuffer(double value, char* buffer) {
   // DBL_DIG is 15 for IEEE-754 doubles, which are used on almost all
   // platforms these days.  Just in case some system exists where DBL_DIG
   // is significantly larger -- and risks overflowing our buffer -- we have
   // this assert.
   static_assert(DBL_DIG < 20, "DBL_DIG is too big");
 
-  bool full_precision_needed = true;
   if (std::abs(value) <= kDoublePrecisionCheckMax) {
     int snprintf_result =
         snprintf(buffer, kFastToBufferSize, "%.*g", DBL_DIG, value);
@@ -167,18 +172,20 @@ char* DoubleToBuffer(double value, char* buffer) {
     // larger than the precision we asked for.
     DCHECK(snprintf_result > 0 && snprintf_result < kFastToBufferSize);
 
-    full_precision_needed =
-        locale_independent_strtonum<double>(buffer, nullptr) != value;
+    if (locale_independent_strtonum<double>(buffer, nullptr) == value) {
+      // Round-tripping the string to double works; we're done.
+      return snprintf_result;
+    }
+    // else: full precision formatting needed. Fall through.
   }
 
-  if (full_precision_needed) {
-    int snprintf_result =
-        snprintf(buffer, kFastToBufferSize, "%.*g", DBL_DIG + 2, value);
+  int snprintf_result =
+      snprintf(buffer, kFastToBufferSize, "%.*g", DBL_DIG + 2, value);
 
-    // Should never overflow; see above.
-    DCHECK(snprintf_result > 0 && snprintf_result < kFastToBufferSize);
-  }
-  return buffer;
+  // Should never overflow; see above.
+  DCHECK(snprintf_result > 0 && snprintf_result < kFastToBufferSize);
+
+  return snprintf_result;
 }
 
 namespace {
@@ -325,7 +332,7 @@ bool safe_strtod(const char* str, double* value) {
   return *str != '\0' && *endptr == '\0';
 }
 
-char* FloatToBuffer(float value, char* buffer) {
+size_t FloatToBuffer(float value, char* buffer) {
   // FLT_DIG is 6 for IEEE-754 floats, which are used on almost all
   // platforms these days.  Just in case some system exists where FLT_DIG
   // is significantly larger -- and risks overflowing our buffer -- we have
@@ -347,7 +354,7 @@ char* FloatToBuffer(float value, char* buffer) {
     // Should never overflow; see above.
     DCHECK(snprintf_result > 0 && snprintf_result < kFastToBufferSize);
   }
-  return buffer;
+  return snprintf_result;
 }
 
 string FpToString(Fprint fp) {
diff --git a/tensorflow/core/lib/strings/numbers.h b/tensorflow/core/lib/strings/numbers.h
index 3c45b9027401999ba4e6c32005456312970cccba..6b7703be378cde5755a034252eba83e4be99bdc0 100644
--- a/tensorflow/core/lib/strings/numbers.h
+++ b/tensorflow/core/lib/strings/numbers.h
@@ -60,19 +60,18 @@ static const int kFastToBufferSize = 32;
 // the output.  The buffer should typically be at least kFastToBufferSize
 // bytes.
 //
-// Returns a pointer to the end of the string (i.e. the null character
-// terminating the string).
+// Returns the number of characters written.
 // ----------------------------------------------------------------------
 
-char* FastInt32ToBufferLeft(int32 i, char* buffer);    // at least 12 bytes
-char* FastUInt32ToBufferLeft(uint32 i, char* buffer);  // at least 12 bytes
-char* FastInt64ToBufferLeft(int64 i, char* buffer);    // at least 22 bytes
-char* FastUInt64ToBufferLeft(uint64 i, char* buffer);  // at least 22 bytes
+size_t FastInt32ToBufferLeft(int32 i, char* buffer);    // at least 12 bytes
+size_t FastUInt32ToBufferLeft(uint32 i, char* buffer);  // at least 12 bytes
+size_t FastInt64ToBufferLeft(int64 i, char* buffer);    // at least 22 bytes
+size_t FastUInt64ToBufferLeft(uint64 i, char* buffer);  // at least 22 bytes
 
 // Required buffer size for DoubleToBuffer is kFastToBufferSize.
 // Required buffer size for FloatToBuffer is kFastToBufferSize.
-char* DoubleToBuffer(double i, char* buffer);
-char* FloatToBuffer(float i, char* buffer);
+size_t DoubleToBuffer(double value, char* buffer);
+size_t FloatToBuffer(float value, char* buffer);
 
 // Convert a 64-bit fingerprint value to an ASCII representation.
 string FpToString(Fprint fp);
diff --git a/tensorflow/core/lib/strings/proto_serialization.cc b/tensorflow/core/lib/strings/proto_serialization.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5c1fbda2155492c00049f52ce12ae8da665cbda0
--- /dev/null
+++ b/tensorflow/core/lib/strings/proto_serialization.cc
@@ -0,0 +1,33 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/lib/strings/proto_serialization.h"
+
+#include "tensorflow/core/platform/logging.h"
+
+namespace tensorflow {
+
+bool SerializeToStringDeterministic(const protobuf::MessageLite& msg,
+                                    string* result) {
+  DCHECK_LE(msg.ByteSizeLong(), static_cast<size_t>(INT_MAX));
+  const int size = static_cast<int>(msg.ByteSizeLong());
+  *result = string(size, '\0');
+  protobuf::io::ArrayOutputStream array_stream(&(*result)[0], size);
+  protobuf::io::CodedOutputStream output_stream(&array_stream);
+  output_stream.SetSerializationDeterministic(true);
+  msg.SerializeWithCachedSizes(&output_stream);
+  return !output_stream.HadError() && size == output_stream.ByteCount();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/lib/strings/proto_serialization.h b/tensorflow/core/lib/strings/proto_serialization.h
new file mode 100644
index 0000000000000000000000000000000000000000..6664928e2818c747268ec1c361acce6bcf6c862e
--- /dev/null
+++ b/tensorflow/core/lib/strings/proto_serialization.h
@@ -0,0 +1,33 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_LIB_STRINGS_PROTO_SERIALIZATION_H_
+#define TENSORFLOW_CORE_LIB_STRINGS_PROTO_SERIALIZATION_H_
+
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+
+// Wrapper around protocol buffer serialization that requests deterministic
+// serialization, in particular for Map fields, which serialize in a random
+// order by default. Returns true on success.
+// Serialization is guaranteed to be deterministic for a given binary only.
+// See the following for more details:
+// https://github.com/google/protobuf/blob/a1bb147e96b6f74db6cdf3c3fcb00492472dbbfa/src/google/protobuf/io/coded_stream.h#L834
+bool SerializeToStringDeterministic(const protobuf::MessageLite& msg,
+                                    string* result);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_LIB_STRINGS_PROTO_SERIALIZATION_H_
diff --git a/tensorflow/core/lib/strings/str_util.cc b/tensorflow/core/lib/strings/str_util.cc
index d28857803d7ef1edd66ae6c1a6b81a7ed1dbce85..2c9e98357a1136876da57b5453f60490f4f8bb53 100644
--- a/tensorflow/core/lib/strings/str_util.cc
+++ b/tensorflow/core/lib/strings/str_util.cc
@@ -16,9 +16,11 @@ limitations under the License.
 #include "tensorflow/core/lib/strings/str_util.h"
 
 #include <ctype.h>
+#include <algorithm>
 #include <vector>
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
+#include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
 namespace str_util {
@@ -373,7 +375,7 @@ size_t RemoveWhitespaceContext(StringPiece* text) {
 }
 
 bool ConsumePrefix(StringPiece* s, StringPiece expected) {
-  if (s->starts_with(expected)) {
+  if (StartsWith(*s, expected)) {
     s->remove_prefix(expected.size());
     return true;
   }
@@ -381,7 +383,7 @@ bool ConsumePrefix(StringPiece* s, StringPiece expected) {
 }
 
 bool ConsumeSuffix(StringPiece* s, StringPiece expected) {
-  if (s->ends_with(expected)) {
+  if (EndsWith(*s, expected)) {
     s->remove_suffix(expected.size());
     return true;
   }
@@ -452,5 +454,22 @@ bool SplitAndParseAsFloats(StringPiece text, char delim,
                                     result);
 }
 
+bool StrContains(StringPiece haystack, StringPiece needle) {
+  return std::search(haystack.begin(), haystack.end(), needle.begin(),
+                     needle.end()) != haystack.end();
+}
+
+bool StartsWith(StringPiece text, StringPiece prefix) {
+  return prefix.empty() ||
+         (text.size() >= prefix.size() &&
+          memcmp(text.data(), prefix.data(), prefix.size()) == 0);
+}
+
+bool EndsWith(StringPiece text, StringPiece suffix) {
+  return suffix.empty() || (text.size() >= suffix.size() &&
+                            memcmp(text.data() + (text.size() - suffix.size()),
+                                   suffix.data(), suffix.size()) == 0);
+}
+
 }  // namespace str_util
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/strings/str_util.h b/tensorflow/core/lib/strings/str_util.h
index 44c52850fa99f7688fb496784a18b651c147bb8b..065871c1b4b05afc39da6bee13bda93359ddb913 100644
--- a/tensorflow/core/lib/strings/str_util.h
+++ b/tensorflow/core/lib/strings/str_util.h
@@ -20,7 +20,6 @@ limitations under the License.
 #include <string>
 #include <vector>
 #include "tensorflow/core/lib/core/stringpiece.h"
-#include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -141,6 +140,21 @@ bool SplitAndParseAsInts(StringPiece text, char delim,
 bool SplitAndParseAsFloats(StringPiece text, char delim,
                            std::vector<float>* result);
 
+// StartsWith()
+//
+// Returns whether a given string `text` begins with `prefix`.
+bool StartsWith(StringPiece text, StringPiece prefix);
+
+// EndsWith()
+//
+// Returns whether a given string `text` ends with `suffix`.
+bool EndsWith(StringPiece text, StringPiece suffix);
+
+// StrContains()
+//
+// Returns whether a given string `haystack` contains the substring `needle`.
+bool StrContains(StringPiece haystack, StringPiece needle);
+
 // ------------------------------------------------------------------
 // Implementation details below
 template <typename T>
diff --git a/tensorflow/core/lib/strings/str_util_test.cc b/tensorflow/core/lib/strings/str_util_test.cc
index 6d461241f7e9c5a29064c015991039d5bf95a80f..63643c3e8ed935ecea2a3430b938985ac7df85bb 100644
--- a/tensorflow/core/lib/strings/str_util_test.cc
+++ b/tensorflow/core/lib/strings/str_util_test.cc
@@ -430,4 +430,56 @@ TEST(StringReplace, EmptyStringReplaceAll) {
   EXPECT_EQ("", str_util::StringReplace("", "a", "X", /*replace_all=*/true));
 }
 
+TEST(StartsWith, Basic) {
+  const string s1(
+      "123"
+      "\0"
+      "456",
+      7);
+  const StringPiece a("foobar");
+  const StringPiece b(s1);
+  const StringPiece e;
+  EXPECT_TRUE(str_util::StartsWith(a, a));
+  EXPECT_TRUE(str_util::StartsWith(a, "foo"));
+  EXPECT_TRUE(str_util::StartsWith(a, e));
+  EXPECT_TRUE(str_util::StartsWith(b, s1));
+  EXPECT_TRUE(str_util::StartsWith(b, b));
+  EXPECT_TRUE(str_util::StartsWith(b, e));
+  EXPECT_TRUE(str_util::StartsWith(e, ""));
+  EXPECT_FALSE(str_util::StartsWith(a, b));
+  EXPECT_FALSE(str_util::StartsWith(b, a));
+  EXPECT_FALSE(str_util::StartsWith(e, a));
+}
+
+TEST(EndsWith, Basic) {
+  const string s1(
+      "123"
+      "\0"
+      "456",
+      7);
+  const StringPiece a("foobar");
+  const StringPiece b(s1);
+  const StringPiece e;
+  EXPECT_TRUE(str_util::EndsWith(a, a));
+  EXPECT_TRUE(str_util::EndsWith(a, "bar"));
+  EXPECT_TRUE(str_util::EndsWith(a, e));
+  EXPECT_TRUE(str_util::EndsWith(b, s1));
+  EXPECT_TRUE(str_util::EndsWith(b, b));
+  EXPECT_TRUE(str_util::EndsWith(b, e));
+  EXPECT_TRUE(str_util::EndsWith(e, ""));
+  EXPECT_FALSE(str_util::EndsWith(a, b));
+  EXPECT_FALSE(str_util::EndsWith(b, a));
+  EXPECT_FALSE(str_util::EndsWith(e, a));
+}
+
+TEST(StrContains, Basic) {
+  StringPiece a("abcdefg");
+  StringPiece b("abcd");
+  StringPiece c("efg");
+  StringPiece d("gh");
+  EXPECT_TRUE(str_util::StrContains(a, b));
+  EXPECT_TRUE(str_util::StrContains(a, c));
+  EXPECT_TRUE(!str_util::StrContains(a, d));
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/strings/strcat.cc b/tensorflow/core/lib/strings/strcat.cc
index 5b1cff486dba46ab761762b3076610e60d636711..f140ec3d260efdb7b82234706c5a33584c38cbb2 100644
--- a/tensorflow/core/lib/strings/strcat.cc
+++ b/tensorflow/core/lib/strings/strcat.cc
@@ -20,16 +20,12 @@ limitations under the License.
 #include <stdio.h>
 #include <string.h>
 
-#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
 namespace strings {
 
-AlphaNum::AlphaNum(const Eigen::half &f)
-    : piece_(digits_, strlen(FloatToBuffer(static_cast<float>(f), digits_))) {}
-
 AlphaNum::AlphaNum(Hex hex) {
   char *const end = &digits_[kFastToBufferSize];
   char *writer = end;
diff --git a/tensorflow/core/lib/strings/strcat.h b/tensorflow/core/lib/strings/strcat.h
index 2bc14945cd0413751003c03c7f5255c300790321..fb2cd5bc7e5fb69650dfc2758b132d73e88375a9 100644
--- a/tensorflow/core/lib/strings/strcat.h
+++ b/tensorflow/core/lib/strings/strcat.h
@@ -27,10 +27,6 @@ limitations under the License.
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/types.h"
 
-namespace Eigen {
-struct half;
-}
-
 // The AlphaNum type was designed to be used as the parameter type for StrCat().
 // Any routine accepting either a string or a number may accept it.
 // The basic idea is that by accepting a "const AlphaNum &" as an argument
@@ -105,27 +101,23 @@ class AlphaNum {
   // A bool ctor would also convert incoming pointers (bletch).
 
   AlphaNum(int i32)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastInt32ToBufferLeft(i32, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastInt32ToBufferLeft(i32, digits_)) {}
   AlphaNum(unsigned int u32)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastUInt32ToBufferLeft(u32, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastUInt32ToBufferLeft(u32, digits_)) {}
   AlphaNum(long x)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastInt64ToBufferLeft(x, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastInt64ToBufferLeft(x, digits_)) {}
   AlphaNum(unsigned long x)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastUInt64ToBufferLeft(x, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastUInt64ToBufferLeft(x, digits_)) {}
   AlphaNum(long long int i64)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastInt64ToBufferLeft(i64, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastInt64ToBufferLeft(i64, digits_)) {}
   AlphaNum(unsigned long long int u64)  // NOLINT(runtime/explicit)
-      : piece_(digits_, FastUInt64ToBufferLeft(u64, digits_) - &digits_[0]) {}
+      : piece_(digits_, FastUInt64ToBufferLeft(u64, digits_)) {}
 
   AlphaNum(float f)  // NOLINT(runtime/explicit)
-      : piece_(digits_, strlen(FloatToBuffer(f, digits_))) {}
-  AlphaNum(bfloat16 f)  // NOLINT(runtime/explicit)
-      : piece_(digits_, strlen(FloatToBuffer(static_cast<float>(f), digits_))) {
-  }
+      : piece_(digits_, FloatToBuffer(f, digits_)) {}
   AlphaNum(double f)  // NOLINT(runtime/explicit)
-      : piece_(digits_, strlen(DoubleToBuffer(f, digits_))) {}
+      : piece_(digits_, DoubleToBuffer(f, digits_)) {}
 
-  AlphaNum(const Eigen::half &f);  // NOLINT(runtime/explicit)
   AlphaNum(Hex hex);               // NOLINT(runtime/explicit)
 
   AlphaNum(const char *c_str) : piece_(c_str) {}   // NOLINT(runtime/explicit)
diff --git a/tensorflow/core/lib/strings/strcat_test.cc b/tensorflow/core/lib/strings/strcat_test.cc
index 7cb186e6375fae4d8a7140dd2f9ee6e7e64ddd1a..8cc64a6f0aecfd3dcce772b9a6c5c30ced86ba12 100644
--- a/tensorflow/core/lib/strings/strcat_test.cc
+++ b/tensorflow/core/lib/strings/strcat_test.cc
@@ -17,7 +17,6 @@ limitations under the License.
 
 #include <string>
 
-#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
@@ -131,11 +130,6 @@ TEST(StrCat, Basics) {
   result = tensorflow::strings::StrCat("A hundred K and a half squared is ", d);
   EXPECT_EQ(result, "A hundred K and a half squared is 10000100000.25");
 
-  Eigen::half h(10007.0f);
-  result =
-      tensorflow::strings::StrCat("Ten thousand seven is approximately ", h);
-  EXPECT_EQ(result, "Ten thousand seven is approximately 10008");
-
   result = tensorflow::strings::StrCat(1, 2, 333, 4444, 55555, 666666, 7777777,
                                        88888888, 999999999);
   EXPECT_EQ(result, "12333444455555666666777777788888888999999999");
diff --git a/tensorflow/core/lib/wav/wav_io.cc b/tensorflow/core/lib/wav/wav_io.cc
index 77d3c88998e655ec429469a8e4bc5535b95bf1b9..51b9c6cd82c4769cfd333e91177bc9b7ba5e38de 100644
--- a/tensorflow/core/lib/wav/wav_io.cc
+++ b/tensorflow/core/lib/wav/wav_io.cc
@@ -81,13 +81,38 @@ inline float Int16SampleToFloat(int16 data) {
   return data * kMultiplier;
 }
 
+}  // namespace
+
+// Handles moving the data index forward, validating the arguments, and avoiding
+// overflow or underflow.
+Status IncrementOffset(int old_offset, size_t increment, size_t max_size,
+                       int* new_offset) {
+  if (old_offset < 0) {
+    return errors::InvalidArgument("Negative offsets are not allowed: ",
+                                   old_offset);
+  }
+  if (old_offset > max_size) {
+    return errors::InvalidArgument("Initial offset is outside data range: ",
+                                   old_offset);
+  }
+  *new_offset = old_offset + increment;
+  if (*new_offset > max_size) {
+    return errors::InvalidArgument("Data too short when trying to read string");
+  }
+  // See above for the check that the input offset is positive. If it's negative
+  // here then it means that there's been an overflow in the arithmetic.
+  if (*new_offset < 0) {
+    return errors::InvalidArgument("Offset too large, overflowed: ",
+                                   *new_offset);
+  }
+  return Status::OK();
+}
+
 Status ExpectText(const string& data, const string& expected_text,
                   int* offset) {
-  const int new_offset = *offset + expected_text.size();
-  if (new_offset > data.size()) {
-    return errors::InvalidArgument("Data too short when trying to read ",
-                                   expected_text);
-  }
+  int new_offset;
+  TF_RETURN_IF_ERROR(
+      IncrementOffset(*offset, expected_text.size(), data.size(), &new_offset));
   const string found_text(data.begin() + *offset, data.begin() + new_offset);
   if (found_text != expected_text) {
     return errors::InvalidArgument("Header mismatch: Expected ", expected_text,
@@ -97,40 +122,16 @@ Status ExpectText(const string& data, const string& expected_text,
   return Status::OK();
 }
 
-template <class T>
-Status ReadValue(const string& data, T* value, int* offset) {
-  const int new_offset = *offset + sizeof(T);
-  if (new_offset > data.size()) {
-    return errors::InvalidArgument("Data too short when trying to read value");
-  }
-  if (port::kLittleEndian) {
-    memcpy(value, data.data() + *offset, sizeof(T));
-  } else {
-    *value = 0;
-    const uint8* data_buf =
-        reinterpret_cast<const uint8*>(data.data() + *offset);
-    int shift = 0;
-    for (int i = 0; i < sizeof(T); ++i, shift += 8) {
-      *value = *value | (data_buf[i] << shift);
-    }
-  }
-  *offset = new_offset;
-  return Status::OK();
-}
-
 Status ReadString(const string& data, int expected_length, string* value,
                   int* offset) {
-  const int new_offset = *offset + expected_length;
-  if (new_offset > data.size()) {
-    return errors::InvalidArgument("Data too short when trying to read string");
-  }
+  int new_offset;
+  TF_RETURN_IF_ERROR(
+      IncrementOffset(*offset, expected_length, data.size(), &new_offset));
   *value = string(data.begin() + *offset, data.begin() + new_offset);
   *offset = new_offset;
   return Status::OK();
 }
 
-}  // namespace
-
 Status EncodeAudioAsS16LEWav(const float* audio, size_t sample_rate,
                              size_t num_channels, size_t num_frames,
                              string* wav_string) {
@@ -272,6 +273,11 @@ Status DecodeLin16WaveAsFloatVector(const string& wav_string,
     TF_RETURN_IF_ERROR(ReadString(wav_string, 4, &chunk_id, &offset));
     uint32 chunk_size;
     TF_RETURN_IF_ERROR(ReadValue<uint32>(wav_string, &chunk_size, &offset));
+    if (chunk_size > std::numeric_limits<int32>::max()) {
+      return errors::InvalidArgument(
+          "WAV data chunk '", chunk_id, "' is too large: ", chunk_size,
+          " bytes, but the limit is ", std::numeric_limits<int32>::max());
+    }
     if (chunk_id == kDataChunkId) {
       if (was_data_found) {
         return errors::InvalidArgument("More than one data chunk found in WAV");
diff --git a/tensorflow/core/lib/wav/wav_io.h b/tensorflow/core/lib/wav/wav_io.h
index adca0ee3034a1a850707be1b64917931be321f97..f004524177eef56a876f7aa7cfe6bf80559f4335 100644
--- a/tensorflow/core/lib/wav/wav_io.h
+++ b/tensorflow/core/lib/wav/wav_io.h
@@ -21,6 +21,9 @@ limitations under the License.
 #include <string>
 #include <vector>
 
+#include "tensorflow/core/lib/core/casts.h"
+#include "tensorflow/core/lib/core/coding.h"
+#include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -55,6 +58,36 @@ Status DecodeLin16WaveAsFloatVector(const string& wav_string,
                                     uint32* sample_count, uint16* channel_count,
                                     uint32* sample_rate);
 
+// Everything below here is only exposed publicly for testing purposes.
+
+// Handles moving the data index forward, validating the arguments, and avoiding
+// overflow or underflow.
+Status IncrementOffset(int old_offset, size_t increment, size_t max_size,
+                       int* new_offset);
+
+// This function is only exposed in the header for testing purposes, as a
+// template that needs to be instantiated. Reads a typed numeric value from a
+// stream of data.
+template <class T>
+Status ReadValue(const string& data, T* value, int* offset) {
+  int new_offset;
+  TF_RETURN_IF_ERROR(
+      IncrementOffset(*offset, sizeof(T), data.size(), &new_offset));
+  if (port::kLittleEndian) {
+    memcpy(value, data.data() + *offset, sizeof(T));
+  } else {
+    *value = 0;
+    const uint8* data_buf =
+        reinterpret_cast<const uint8*>(data.data() + *offset);
+    int shift = 0;
+    for (int i = 0; i < sizeof(T); ++i, shift += 8) {
+      *value = *value | (data_buf[i] << shift);
+    }
+  }
+  *offset = new_offset;
+  return Status::OK();
+}
+
 }  // namespace wav
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/lib/wav/wav_io_test.cc b/tensorflow/core/lib/wav/wav_io_test.cc
index 40ddd94abef649285528b6ee177620eedb07df20..d8a83fc464b33d274aa4f8174132980275fd8598 100644
--- a/tensorflow/core/lib/wav/wav_io_test.cc
+++ b/tensorflow/core/lib/wav/wav_io_test.cc
@@ -25,6 +25,12 @@ limitations under the License.
 namespace tensorflow {
 namespace wav {
 
+// These are defined in wav_io.cc, and the signatures are here so we don't have
+// to expose them in the public header.
+Status ExpectText(const string& data, const string& expected_text, int* offset);
+Status ReadString(const string& data, int expected_length, string* value,
+                  int* offset);
+
 TEST(WavIO, BadArguments) {
   float audio[] = {0.0f, 0.1f, 0.2f, 0.3f, 0.4f, 0.5f};
   string result;
@@ -155,5 +161,318 @@ TEST(WavIO, BasicStereo) {
   EXPECT_EQ(expected, result);
 }
 
+// Test how chunk sizes larger than 2GB are handled, since they're stored as
+// unsigned int32s, so there are lots of ways for conversions to confuse the
+// decoding logic. The expected behavior is to fail with an error, since such
+// large WAV files are not common, and are unsupported by many readers.
+// See b/72655902.
+TEST(WavIO, ChunkSizeOverflow) {
+  std::vector<uint8> wav_data = {
+      'R', 'I', 'F', 'F',      // ChunkID
+      60, 0, 0, 0,             // ChunkSize: 36 + SubChunk2Size
+      'W', 'A', 'V', 'E',      // Format
+      'f', 'm', 't', ' ',      // Subchunk1ID
+      16, 0, 0, 0,             // Subchunk1Size
+      1, 0,                    // AudioFormat: 1=PCM
+      1, 0,                    // NumChannels
+      0x44, 0xac, 0, 0,        // SampleRate: 44100
+      0x88, 0x58, 0x1, 0,      // BytesPerSecond: SampleRate * NumChannels *
+                               //                 BitsPerSample/8
+      2, 0,                    // BytesPerSample: NumChannels * BitsPerSample/8
+      16, 0,                   // BitsPerSample
+      'd', 'a', 't', 'a',      // Subchunk2ID
+      8, 0, 0, 0,              // Subchunk2Size: NumSamples * NumChannels *
+                               //                BitsPerSample/8
+      0, 0,                    // Sample 1: 0
+      0xff, 0x7f,              // Sample 2: 32767 (saturated)
+      0, 0,                    // Sample 3: 0
+      0x00, 0x80,              // Sample 4: -32768 (saturated)
+      'f', 'o', 'o', 'o',      // Subchunk2ID
+      0xff, 0xff, 0xff, 0xf8,  // Chunk size that could cause an infinite loop.
+      0, 0,                    // Sample 1: 0
+      0xff, 0x7f,              // Sample 2: 32767 (saturated)
+      0, 0,                    // Sample 3: 0
+      0x00, 0x80,              // Sample 4: -32768 (saturated)
+  };
+  string wav_data_string(wav_data.begin(), wav_data.end());
+  std::vector<float> decoded_audio;
+  uint32 decoded_sample_count;
+  uint16 decoded_channel_count;
+  uint32 decoded_sample_rate;
+  Status decode_status = DecodeLin16WaveAsFloatVector(
+      wav_data_string, &decoded_audio, &decoded_sample_count,
+      &decoded_channel_count, &decoded_sample_rate);
+  EXPECT_FALSE(decode_status.ok());
+  EXPECT_TRUE(StringPiece(decode_status.error_message()).contains("too large"))
+      << decode_status.error_message();
+}
+
+TEST(WavIO, IncrementOffset) {
+  int new_offset = -1;
+  TF_EXPECT_OK(IncrementOffset(0, 10, 20, &new_offset));
+  EXPECT_EQ(10, new_offset);
+
+  new_offset = -1;
+  TF_EXPECT_OK(IncrementOffset(10, 4, 20, &new_offset));
+  EXPECT_EQ(14, new_offset);
+
+  new_offset = -1;
+  TF_EXPECT_OK(IncrementOffset(99, 1, 100, &new_offset));
+  EXPECT_EQ(100, new_offset);
+
+  new_offset = -1;
+  EXPECT_FALSE(IncrementOffset(-1, 1, 100, &new_offset).ok());
+
+  new_offset = -1;
+  EXPECT_FALSE(IncrementOffset(0, -1, 100, &new_offset).ok());
+
+  new_offset = -1;
+  EXPECT_FALSE(IncrementOffset(std::numeric_limits<int>::max(), 1,
+                               std::numeric_limits<int>::max(), &new_offset)
+                   .ok());
+
+  new_offset = -1;
+  EXPECT_FALSE(IncrementOffset(101, 1, 100, &new_offset).ok());
+}
+
+TEST(WavIO, ExpectText) {
+  std::vector<uint8> test_data = {
+      'E', 'x', 'p', 'e', 'c', 't', 'e', 'd',
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  TF_EXPECT_OK(ExpectText(test_string, "Expected", &offset));
+  EXPECT_EQ(8, offset);
+
+  offset = 0;
+  Status expect_status = ExpectText(test_string, "Unexpected", &offset);
+  EXPECT_FALSE(expect_status.ok());
+
+  offset = 0;
+  TF_EXPECT_OK(ExpectText(test_string, "Exp", &offset));
+  EXPECT_EQ(3, offset);
+  TF_EXPECT_OK(ExpectText(test_string, "ected", &offset));
+  EXPECT_EQ(8, offset);
+  expect_status = ExpectText(test_string, "foo", &offset);
+  EXPECT_FALSE(expect_status.ok());
+}
+
+TEST(WavIO, ReadString) {
+  std::vector<uint8> test_data = {
+      'E', 'x', 'p', 'e', 'c', 't', 'e', 'd',
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  string read_value;
+  TF_EXPECT_OK(ReadString(test_string, 2, &read_value, &offset));
+  EXPECT_EQ("Ex", read_value);
+  EXPECT_EQ(2, offset);
+
+  TF_EXPECT_OK(ReadString(test_string, 6, &read_value, &offset));
+  EXPECT_EQ("pected", read_value);
+  EXPECT_EQ(8, offset);
+
+  Status read_status = ReadString(test_string, 3, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueInt8) {
+  std::vector<uint8> test_data = {0x00, 0x05, 0xff, 0x80};
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  int8 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(1, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(5, read_value);
+  EXPECT_EQ(2, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(-1, read_value);
+  EXPECT_EQ(3, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(-128, read_value);
+  EXPECT_EQ(4, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueUInt8) {
+  std::vector<uint8> test_data = {0x00, 0x05, 0xff, 0x80};
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  uint8 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(1, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(5, read_value);
+  EXPECT_EQ(2, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(255, read_value);
+  EXPECT_EQ(3, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(128, read_value);
+  EXPECT_EQ(4, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueInt16) {
+  std::vector<uint8> test_data = {
+      0x00, 0x00,  // 0
+      0xff, 0x00,  // 255
+      0x00, 0x01,  // 256
+      0xff, 0xff,  // -1
+      0x00, 0x80,  // -32768
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  int16 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(2, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(255, read_value);
+  EXPECT_EQ(4, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(256, read_value);
+  EXPECT_EQ(6, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(-1, read_value);
+  EXPECT_EQ(8, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(-32768, read_value);
+  EXPECT_EQ(10, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueUInt16) {
+  std::vector<uint8> test_data = {
+      0x00, 0x00,  // 0
+      0xff, 0x00,  // 255
+      0x00, 0x01,  // 256
+      0xff, 0xff,  // 65535
+      0x00, 0x80,  // 32768
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  uint16 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(2, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(255, read_value);
+  EXPECT_EQ(4, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(256, read_value);
+  EXPECT_EQ(6, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(65535, read_value);
+  EXPECT_EQ(8, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(32768, read_value);
+  EXPECT_EQ(10, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueInt32) {
+  std::vector<uint8> test_data = {
+      0x00, 0x00, 0x00, 0x00,  // 0
+      0xff, 0x00, 0x00, 0x00,  // 255
+      0x00, 0xff, 0x00, 0x00,  // 65280
+      0x00, 0x00, 0xff, 0x00,  // 16,711,680
+      0xff, 0xff, 0xff, 0xff,  // -1
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  int32 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(4, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(255, read_value);
+  EXPECT_EQ(8, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(65280, read_value);
+  EXPECT_EQ(12, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(16711680, read_value);
+  EXPECT_EQ(16, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(-1, read_value);
+  EXPECT_EQ(20, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
+TEST(WavIO, ReadValueUInt32) {
+  std::vector<uint8> test_data = {
+      0x00, 0x00, 0x00, 0x00,  // 0
+      0xff, 0x00, 0x00, 0x00,  // 255
+      0x00, 0xff, 0x00, 0x00,  // 65280
+      0x00, 0x00, 0xff, 0x00,  // 16,711,680
+      0xff, 0xff, 0xff, 0xff,  // 4,294,967,295
+  };
+  string test_string(test_data.begin(), test_data.end());
+
+  int offset = 0;
+  uint32 read_value;
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(0, read_value);
+  EXPECT_EQ(4, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(255, read_value);
+  EXPECT_EQ(8, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(65280, read_value);
+  EXPECT_EQ(12, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(16711680, read_value);
+  EXPECT_EQ(16, offset);
+
+  TF_EXPECT_OK(ReadValue(test_string, &read_value, &offset));
+  EXPECT_EQ(4294967295, read_value);
+  EXPECT_EQ(20, offset);
+
+  Status read_status = ReadValue(test_string, &read_value, &offset);
+  EXPECT_FALSE(read_status.ok());
+}
+
 }  // namespace wav
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index 2fab62ea5cae6280554d2106f8f77d46017180e7..39b92464cb8f626d5581e7be1347d8c735f8277e 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -1168,7 +1168,9 @@ REGISTER_OP("Unique")
     .SetShapeFn([](InferenceContext* c) {
       c->set_output(0, c->Vector(InferenceContext::kUnknownDim));
       c->set_output(1, c->input(0));
-      return Status::OK();
+      // Assert that the input rank is 1.
+      ShapeHandle dummy;
+      return c->WithRank(c->input(0), 1, &dummy);
     });
 
 REGISTER_OP("UniqueV2")
@@ -1564,6 +1566,9 @@ REGISTER_OP("Tile")
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(1, &multiples));
       if (c->RankKnown(input)) {
         TF_RETURN_IF_ERROR(c->WithRank(multiples, c->Rank(input), &multiples));
+        ShapeHandle dummy;
+        TF_RETURN_IF_ERROR(
+            c->Merge(c->input(1), c->Vector(c->Rank(input)), &dummy));
       }
 
       if (!c->RankKnown(multiples)) {
diff --git a/tensorflow/core/ops/array_ops_test.cc b/tensorflow/core/ops/array_ops_test.cc
index 86d64635f4c1bc1c34407a517267758ce5cf60fc..cf5bb5ad849571c92f7dccf4d0fdc5780965567c 100644
--- a/tensorflow/core/ops/array_ops_test.cc
+++ b/tensorflow/core/ops/array_ops_test.cc
@@ -368,7 +368,11 @@ TEST(ArrayOpsTest, ShapeN_ShapeFn) {
 TEST(ArrayOpsTest, Unique_ShapeFn) {
   ShapeInferenceTestOp op("Unique");
   INFER_OK(op, "?", "[?];in0");
-  INFER_OK(op, "[1,2,3,?,5]", "[?];in0");
+  INFER_OK(op, "[5]", "[?];in0");
+  INFER_ERROR(
+      "Shape must be rank 1 but is rank 5 for '' (op: '') with input shapes: "
+      "[1,2,3,?,5].",
+      op, "[1,2,3,?,5]");
 }
 
 TEST(ArrayOpsTest, UniqueWithCounts_ShapeFn) {
diff --git a/tensorflow/core/ops/compat/ops_history.v1.pbtxt b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
index dddde1624a4f4258beae212014302f2599879d75..b41826d6eb9c7b51c5e1ec37845af715e58780b6 100644
--- a/tensorflow/core/ops/compat/ops_history.v1.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
@@ -10867,6 +10867,14 @@ op {
     }
   }
 }
+op {
+  name: "CloseSummaryWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  is_stateful: true
+}
 op {
   name: "CompareAndBitpack"
   input_arg {
@@ -11603,17 +11611,13 @@ op {
   }
 }
 op {
-  name: "Conv2DBackpropFilter"
+  name: "Conv2D"
   input_arg {
     name: "input"
     type_attr: "T"
   }
   input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
+    name: "filter"
     type_attr: "T"
   }
   output_arg {
@@ -11626,7 +11630,9 @@ op {
     allowed_values {
       list {
         type: DT_HALF
+        type: DT_BFLOAT16
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -11664,6 +11670,18 @@ op {
       }
     }
   }
+  attr {
+    name: "dilations"
+    type: "list(int)"
+    default_value {
+      list {
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+      }
+    }
+  }
 }
 op {
   name: "Conv2DBackpropFilter"
@@ -11689,7 +11707,6 @@ op {
     allowed_values {
       list {
         type: DT_HALF
-        type: DT_BFLOAT16
         type: DT_FLOAT
       }
     }
@@ -11728,28 +11745,16 @@ op {
       }
     }
   }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
 }
 op {
-  name: "Conv2DBackpropInput"
+  name: "Conv2DBackpropFilter"
   input_arg {
-    name: "input_sizes"
-    type: DT_INT32
+    name: "input"
+    type_attr: "T"
   }
   input_arg {
-    name: "filter"
-    type_attr: "T"
+    name: "filter_sizes"
+    type: DT_INT32
   }
   input_arg {
     name: "out_backprop"
@@ -11765,6 +11770,7 @@ op {
     allowed_values {
       list {
         type: DT_HALF
+        type: DT_BFLOAT16
         type: DT_FLOAT
       }
     }
@@ -11803,16 +11809,28 @@ op {
       }
     }
   }
+  attr {
+    name: "dilations"
+    type: "list(int)"
+    default_value {
+      list {
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+      }
+    }
+  }
 }
 op {
-  name: "Conv2DBackpropInput"
+  name: "Conv2DBackpropFilter"
   input_arg {
-    name: "input_sizes"
-    type: DT_INT32
+    name: "input"
+    type_attr: "T"
   }
   input_arg {
-    name: "filter"
-    type_attr: "T"
+    name: "filter_sizes"
+    type: DT_INT32
   }
   input_arg {
     name: "out_backprop"
@@ -11830,6 +11848,7 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -11881,54 +11900,17 @@ op {
   }
 }
 op {
-  name: "Conv3D"
+  name: "Conv2DBackpropInput"
   input_arg {
-    name: "input"
-    type_attr: "T"
+    name: "input_sizes"
+    type: DT_INT32
   }
   input_arg {
     name: "filter"
     type_attr: "T"
   }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 5
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-}
-op {
-  name: "Conv3D"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
   input_arg {
-    name: "filter"
+    name: "out_backprop"
     type_attr: "T"
   }
   output_arg {
@@ -11940,16 +11922,21 @@ op {
     type: "type"
     allowed_values {
       list {
+        type: DT_HALF
         type: DT_FLOAT
-        type: DT_DOUBLE
       }
     }
   }
   attr {
     name: "strides"
     type: "list(int)"
-    has_minimum: true
-    minimum: 5
+  }
+  attr {
+    name: "use_cudnn_on_gpu"
+    type: "bool"
+    default_value {
+      b: true
+    }
   }
   attr {
     name: "padding"
@@ -11965,26 +11952,30 @@ op {
     name: "data_format"
     type: "string"
     default_value {
-      s: "NDHWC"
+      s: "NHWC"
     }
     allowed_values {
       list {
-        s: "NDHWC"
-        s: "NCDHW"
+        s: "NHWC"
+        s: "NCHW"
       }
     }
   }
 }
 op {
-  name: "Conv3D"
+  name: "Conv2DBackpropInput"
   input_arg {
-    name: "input"
-    type_attr: "T"
+    name: "input_sizes"
+    type: DT_INT32
   }
   input_arg {
     name: "filter"
     type_attr: "T"
   }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
   output_arg {
     name: "output"
     type_attr: "T"
@@ -11997,15 +11988,19 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
-        type: DT_DOUBLE
       }
     }
   }
   attr {
     name: "strides"
     type: "list(int)"
-    has_minimum: true
-    minimum: 5
+  }
+  attr {
+    name: "use_cudnn_on_gpu"
+    type: "bool"
+    default_value {
+      b: true
+    }
   }
   attr {
     name: "padding"
@@ -12021,12 +12016,12 @@ op {
     name: "data_format"
     type: "string"
     default_value {
-      s: "NDHWC"
+      s: "NHWC"
     }
     allowed_values {
       list {
-        s: "NDHWC"
-        s: "NCDHW"
+        s: "NHWC"
+        s: "NCHW"
       }
     }
   }
@@ -12039,16 +12034,15 @@ op {
         i: 1
         i: 1
         i: 1
-        i: 1
       }
     }
   }
 }
 op {
-  name: "Conv3DBackpropFilter"
+  name: "Conv2DBackpropInput"
   input_arg {
-    name: "input"
-    type_attr: "T"
+    name: "input_sizes"
+    type: DT_INT32
   }
   input_arg {
     name: "filter"
@@ -12067,6 +12061,8 @@ op {
     type: "type"
     allowed_values {
       list {
+        type: DT_HALF
+        type: DT_BFLOAT16
         type: DT_FLOAT
         type: DT_DOUBLE
       }
@@ -12075,8 +12071,13 @@ op {
   attr {
     name: "strides"
     type: "list(int)"
-    has_minimum: true
-    minimum: 5
+  }
+  attr {
+    name: "use_cudnn_on_gpu"
+    type: "bool"
+    default_value {
+      b: true
+    }
   }
   attr {
     name: "padding"
@@ -12088,71 +12089,40 @@ op {
       }
     }
   }
-  deprecation {
-    version: 10
-  }
-}
-op {
-  name: "Conv3DBackpropFilter"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "filter"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
   attr {
-    name: "T"
-    type: "type"
+    name: "data_format"
+    type: "string"
+    default_value {
+      s: "NHWC"
+    }
     allowed_values {
       list {
-        type: DT_HALF
-        type: DT_FLOAT
-        type: DT_DOUBLE
+        s: "NHWC"
+        s: "NCHW"
       }
     }
   }
   attr {
-    name: "strides"
+    name: "dilations"
     type: "list(int)"
-    has_minimum: true
-    minimum: 5
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
+    default_value {
       list {
-        s: "SAME"
-        s: "VALID"
+        i: 1
+        i: 1
+        i: 1
+        i: 1
       }
     }
   }
-  deprecation {
-    version: 10
-  }
 }
 op {
-  name: "Conv3DBackpropFilterV2"
+  name: "Conv3D"
   input_arg {
     name: "input"
     type_attr: "T"
   }
   input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
+    name: "filter"
     type_attr: "T"
   }
   output_arg {
@@ -12187,17 +12157,13 @@ op {
   }
 }
 op {
-  name: "Conv3DBackpropFilterV2"
+  name: "Conv3D"
   input_arg {
     name: "input"
     type_attr: "T"
   }
   input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
+    name: "filter"
     type_attr: "T"
   }
   output_arg {
@@ -12245,17 +12211,286 @@ op {
   }
 }
 op {
-  name: "Conv3DBackpropFilterV2"
+  name: "Conv3D"
   input_arg {
     name: "input"
     type_attr: "T"
   }
   input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
+    name: "filter"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_BFLOAT16
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+  attr {
+    name: "data_format"
+    type: "string"
+    default_value {
+      s: "NDHWC"
+    }
+    allowed_values {
+      list {
+        s: "NDHWC"
+        s: "NCDHW"
+      }
+    }
+  }
+  attr {
+    name: "dilations"
+    type: "list(int)"
+    default_value {
+      list {
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+      }
+    }
+  }
+}
+op {
+  name: "Conv3DBackpropFilter"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "filter"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+  deprecation {
+    version: 10
+  }
+}
+op {
+  name: "Conv3DBackpropFilter"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "filter"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+  deprecation {
+    version: 10
+  }
+}
+op {
+  name: "Conv3DBackpropFilterV2"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "filter_sizes"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+}
+op {
+  name: "Conv3DBackpropFilterV2"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "filter_sizes"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+  attr {
+    name: "data_format"
+    type: "string"
+    default_value {
+      s: "NDHWC"
+    }
+    allowed_values {
+      list {
+        s: "NDHWC"
+        s: "NCDHW"
+      }
+    }
+  }
+}
+op {
+  name: "Conv3DBackpropFilterV2"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "filter_sizes"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "out_backprop"
     type_attr: "T"
   }
   output_arg {
@@ -12822,6 +13057,54 @@ op {
     }
   }
 }
+op {
+  name: "CreateSummaryDbWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "db_uri"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "experiment_name"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "run_name"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "user_name"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
+op {
+  name: "CreateSummaryFileWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "logdir"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "max_queue"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "flush_millis"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "filename_suffix"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
 op {
   name: "CropAndResize"
   input_arg {
@@ -13224,6 +13507,582 @@ op {
     }
   }
 }
+op {
+  name: "CudnnRNN"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output_h"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output_c"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "reserve_space"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "is_training"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "CudnnRNNBackprop"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_h_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_c_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "reserve_space"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "input_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "input_h_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "input_c_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "params_backprop"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "CudnnRNNCanonicalToParams"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "weights"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  input_arg {
+    name: "biases"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  output_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "num_params"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "CudnnRNNParamsSize"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "params_size"
+    type_attr: "S"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "S"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "CudnnRNNParamsToCanonical"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "weights"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  output_arg {
+    name: "biases"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "num_params"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
 op {
   name: "Cumprod"
   input_arg {
@@ -19441,32 +20300,40 @@ op {
   }
 }
 op {
-  name: "FloorMod"
+  name: "FloorMod"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "z"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+}
+op {
+  name: "FlushSummaryWriter"
   input_arg {
-    name: "x"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "y"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "z"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_INT32
-        type: DT_INT64
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
+    name: "writer"
+    type: DT_RESOURCE
   }
+  is_stateful: true
 }
 op {
   name: "FractionalAvgPool"
@@ -21770,6 +22637,18 @@ op {
     type: "string"
   }
 }
+op {
+  name: "ImportEvent"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "event"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
 op {
   name: "InTopK"
   input_arg {
@@ -24307,6 +25186,10 @@ op {
     name: "num_parallel_batches"
     type: DT_INT64
   }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
   output_arg {
     name: "handle"
     type: DT_VARIANT
@@ -37666,6 +38549,32 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexReplace"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "rewrite"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_STRING
+  }
+  attr {
+    name: "replace_global"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+}
 op {
   name: "Relu"
   input_arg {
@@ -38209,6 +39118,38 @@ op {
     type: "func"
   }
 }
+op {
+  name: "RemoteCall"
+  input_arg {
+    name: "target"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "args"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "f"
+    type: "func"
+  }
+  is_stateful: true
+}
 op {
   name: "RemoteFusedGraphExecute"
   input_arg {
@@ -51464,6 +52405,37 @@ op {
     }
   }
 }
+op {
+  name: "SlideDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "window_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "stride"
+    type: DT_INT64
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "Snapshot"
   input_arg {
@@ -62094,6 +63066,28 @@ op {
     }
   }
 }
+op {
+  name: "SummaryWriter"
+  output_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "Svd"
   input_arg {
@@ -65354,6 +66348,59 @@ op {
     }
   }
 }
+op {
+  name: "UniqueWithCountsV2"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "axis"
+    type_attr: "Taxis"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "idx"
+    type_attr: "out_idx"
+  }
+  output_arg {
+    name: "count"
+    type_attr: "out_idx"
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  attr {
+    name: "Taxis"
+    type: "type"
+    default_value {
+      type: DT_INT64
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "out_idx"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
 op {
   name: "Unpack"
   input_arg {
@@ -65530,15 +66577,139 @@ op {
         type: DT_FLOAT
         type: DT_DOUBLE
         type: DT_INT32
-        type: DT_INT64
+        type: DT_INT64
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+        type: DT_BFLOAT16
+      }
+    }
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "Tnumsegments"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "UnsortedSegmentMax"
+  input_arg {
+    name: "data"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "segment_ids"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "num_segments"
+    type_attr: "Tnumsegments"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
         type: DT_UINT8
         type: DT_INT16
         type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
         type: DT_UINT16
         type: DT_HALF
         type: DT_UINT32
         type: DT_UINT64
+      }
+    }
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "Tnumsegments"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "UnsortedSegmentMin"
+  input_arg {
+    name: "data"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "segment_ids"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "num_segments"
+    type_attr: "Tnumsegments"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
         type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
       }
     }
   }
@@ -65567,7 +66738,7 @@ op {
   }
 }
 op {
-  name: "UnsortedSegmentMax"
+  name: "UnsortedSegmentProd"
   input_arg {
     name: "data"
     type_attr: "T"
@@ -66242,6 +67413,39 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "WriteAudioSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "tensor"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "sample_rate"
+    type: DT_FLOAT
+  }
+  attr {
+    name: "max_outputs"
+    type: "int"
+    default_value {
+      i: 3
+    }
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "WriteFile"
   input_arg {
@@ -66253,6 +67457,180 @@ op {
     type: DT_STRING
   }
 }
+op {
+  name: "WriteGraphSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tensor"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteHistogramSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "values"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_FLOAT
+    }
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteImageSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "tensor"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "bad_color"
+    type: DT_UINT8
+  }
+  attr {
+    name: "max_images"
+    type: "int"
+    default_value {
+      i: 3
+    }
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_FLOAT
+    }
+    allowed_values {
+      list {
+        type: DT_UINT8
+        type: DT_FLOAT
+        type: DT_HALF
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteScalarSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "value"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tensor"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "summary_metadata"
+    type: DT_STRING
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  is_stateful: true
+}
 op {
   name: "ZerosLike"
   input_arg {
diff --git a/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc b/tensorflow/core/ops/cudnn_rnn_ops.cc
similarity index 53%
rename from tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc
rename to tensorflow/core/ops/cudnn_rnn_ops.cc
index 1a79bf066c3a27e040099729fb079ee963f59270..37d70a22ef61ad4e31259dc3001db72ffcea7d93 100644
--- a/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc
+++ b/tensorflow/core/ops/cudnn_rnn_ops.cc
@@ -21,31 +21,6 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-constexpr auto kCudnnRNNCommonInputs = R"doc(
-num_layers: Specifies the number of layers in the RNN model.
-num_units: Specifies the size of the hidden state.
-input_size: Specifies the size of the input state.
-)doc";
-
-constexpr auto kCudnnRNNCommonAttrs = R"doc(
-rnn_mode: Indicates the type of the RNN model.
-input_mode: Indicate whether there is a linear projection between the input and
-    The actual computation before the first layer. 'skip_input' is only allowed
-    when input_size == num_units; 'auto_select' implies 'skip_input' when
-    input_size == num_units; otherwise, it implies 'linear_input'.
-direction: Indicates whether a bidirectional model will be used.
-    dir = (direction == bidirectional) ? 2 : 1
-dropout: dropout probability. When set to 0., dropout is disabled.
-seed: the 1st part of a seed to initialize dropout.
-seed2: the 2nd part of a seed to initialize dropout.
-)doc";
-
-constexpr auto kCudnnRNNParamsBuffer = R"doc(
-Note that the params buffer may not be compatible across different GPUs. So any
-save and restoration should be converted to and from the canonical weights and
-biases.
-)doc";
-
 constexpr auto kRNNModeAttrs =
     "rnn_mode: {'rnn_relu', 'rnn_tanh', 'lstm', 'gru'} = 'lstm'";
 
@@ -56,21 +31,13 @@ constexpr auto kRNNInputModeAttrs =
 constexpr auto kRNNDirectionAttrs =
     "direction: {'unidirectional', 'bidirectional'} = 'unidirectional'";
 
-constexpr auto kCudnnRNNParamsCanonical = R"doc(
-weights: the canonical form of weights that can be used for saving
-    and restoration. They are more likely to be compatible across different
-    generations.
-biases: the canonical form of biases that can be used for saving
-    and restoration. They are more likely to be compatible across different
-    generations.
-)doc";
-
 }  // namespace
 
 using shape_inference::DimensionHandle;
 using shape_inference::InferenceContext;
 using shape_inference::ShapeHandle;
 
+
 REGISTER_OP("CudnnRNNParamsSize")
     .Input("num_layers: int32")
     .Input("num_units: int32")
@@ -87,38 +54,8 @@ REGISTER_OP("CudnnRNNParamsSize")
     .SetShapeFn([](InferenceContext* c) {
       c->set_output(0, c->Vector(1));
       return Status::OK();
-    })
-    .Doc(strings::StrCat(R"doc(
-Return the params size that can be used by the Cudnn RNN model. Subsequent
-weight allocation and initialization should use this size.
-)doc",
-                         kCudnnRNNCommonInputs, kCudnnRNNCommonAttrs,
-                         R"doc(
-params_size: The size of the params buffer that should be allocated and
-    initialized for this RNN model. Note that this params buffer may not be
-    compatible across GPUs. Please use CudnnRNNParamsWeights and
-    CudnnRNNParamsBiases to save and restore them in a way that is compatible
-    across different runs.
-)doc",
-                         kCudnnRNNParamsBuffer));
+    });
 
-static string CudnnRNNForwardTensors() {
-  return R"doc(
-input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
-input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
-    num_units].
-input_c: For LSTM, a 3-D tensor with the shape of
-    [num_layer * dir, batch, num_units]. For other models, it is ignored.
-params: a 1-D tensor that contains the weights and biases in an opaque layout.
-    The size must be created through CudnnRNNParamsSize, and initialized
-    separately. Note that they might not be compatible across different
-    generations. So it is a good idea to save and restore
-output: a 3-D tensor with the shape of [seq_length, batch_size,
-    dir * num_units].
-output_h: the same shape has input_h.
-output_c: the same shape as input_c for LSTM. An empty tensor for other models.
-)doc";
-}
 
 REGISTER_OP("CudnnRNN")
     .Input("input: T")
@@ -160,18 +97,8 @@ REGISTER_OP("CudnnRNN")
       c->set_output(2, output_c_shape);
       c->set_output(3, c->UnknownShape());
       return Status::OK();
-    })
-    .Doc(strings::StrCat(R"doc(
-Computes the RNN from the input and initial states, with respect to the params
-buffer.
-)doc",
-                         kCudnnRNNCommonAttrs, CudnnRNNForwardTensors(),
-                         R"doc(
-is_training: Indicates whether this operation is used for inferenece or
-    training.
-reserve_space: an opaque tensor that can be used in backprop calculation. It
-    is only produced if is_training is false.
-)doc"));
+    });
+
 
 REGISTER_OP("CudnnRNNBackprop")
     .Input("input: T")
@@ -207,27 +134,8 @@ REGISTER_OP("CudnnRNNBackprop")
       c->set_output(2, input_c_shape);
       c->set_output(3, params_shape);
       return Status::OK();
-    })
-    .Doc(strings::StrCat(R"doc(
-Compute the backprop of both data and weights in a RNN.
-)doc",
-                         kCudnnRNNCommonAttrs, CudnnRNNForwardTensors(),
-                         R"doc(
-output_backprop: A 3-D tensor with the same shape as output in the forward pass.
-output_h_backprop: A 3-D tensor with the same shape as output_h in the forward
-    pass.
-output_c_backprop: A 3-D tensor with the same shape as output_c in the forward
-    pass.
-reserve_space: The same reserve_space produced in for forward operation.
-input_backprop: The backprop to input in the forward pass. Has the same shape
-    as input.
-input_h_backprop: The backprop to input_h in the forward pass. Has the same
-    shape as input_h.
-input_c_backprop: The backprop to input_c in the forward pass. Has the same
-    shape as input_c.
-params_backprop: The backprop to the params buffer in the forward pass. Has the
-    same shape as params.
-)doc"));
+    });
+
 
 REGISTER_OP("CudnnRNNParamsToCanonical")
     .Input("num_layers: int32")
@@ -259,17 +167,8 @@ REGISTER_OP("CudnnRNNParamsToCanonical")
         c->set_output(num_params + i, c->Vector(InferenceContext::kUnknownDim));
       }
       return Status::OK();
-    })
-    .Doc(strings::StrCat(R"doc(
-Retrieves a set of weights from the opaque params buffer that can be saved and
-restored in a way compatible with future runs.
-)doc",
-                         kCudnnRNNCommonInputs, kCudnnRNNParamsBuffer, R"doc(
-num_params: number of parameter sets for all layers.
-    Each layer may contain multiple parameter sets, with each set consisting of
-    a weight matrix and a bias vector.
-)doc",
-                         kCudnnRNNParamsCanonical, kCudnnRNNCommonAttrs));
+    });
+
 
 REGISTER_OP("CudnnRNNCanonicalToParams")
     .Input("num_layers: int32")
@@ -289,17 +188,6 @@ REGISTER_OP("CudnnRNNCanonicalToParams")
     .SetShapeFn([](InferenceContext* c) {
       c->set_output(0, c->Vector(InferenceContext::kUnknownDim));
       return Status::OK();
-    })
-    .Doc(strings::StrCat(R"doc(
-Writes a set of weights into the opaque params buffer so they can be used in
-upcoming training or inferences.
-)doc",
-                         kCudnnRNNCommonInputs, kCudnnRNNParamsCanonical,
-                         kCudnnRNNParamsBuffer, R"doc(
-num_params: number of parameter sets for all layers.
-    Each layer may contain multiple parameter sets, with each set consisting of
-    a weight matrix and a bias vector.
-)doc",
-                         kCudnnRNNCommonAttrs));
+    });
 
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops_test.cc b/tensorflow/core/ops/cudnn_rnn_ops_test.cc
similarity index 100%
rename from tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops_test.cc
rename to tensorflow/core/ops/cudnn_rnn_ops_test.cc
diff --git a/tensorflow/core/ops/data_flow_ops.cc b/tensorflow/core/ops/data_flow_ops.cc
index 4f946fb3ca7608816180351b7753d01f13d469f2..3112f35da43d16d7a4cd4c1c8e017cab3366e070 100644
--- a/tensorflow/core/ops/data_flow_ops.cc
+++ b/tensorflow/core/ops/data_flow_ops.cc
@@ -668,13 +668,31 @@ REGISTER_OP("TensorArrayGatherV3")
     .Attr("dtype: type")
     .Attr("element_shape: shape = { unknown_rank: true }")
     .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle indices;
       ShapeHandle unused;
       DimensionHandle unused_dim;
       TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &unused));
-      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &indices));
       TF_RETURN_IF_ERROR(c->WithValue(c->Dim(c->input(0), 0), 2, &unused_dim));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
-      return shape_inference::UnknownShape(c);
+      auto shapes = c->input_handle_shapes_and_types(0);
+      if (shapes != nullptr && !shapes->empty()) {
+        ShapeHandle tensor_shape = shapes->at(0).shape;
+        ShapeHandle output_shape;
+        TF_RETURN_IF_ERROR(
+            c->Concatenate(indices, tensor_shape, &output_shape));
+        c->set_output(0, output_shape);
+        return Status::OK();
+      } else {
+        PartialTensorShape p;
+        TF_RETURN_IF_ERROR(c->GetAttr("element_shape", &p));
+        ShapeHandle s;
+        TF_RETURN_IF_ERROR(c->MakeShapeFromPartialTensorShape(p, &s));
+        ShapeHandle output_shape;
+        TF_RETURN_IF_ERROR(c->Concatenate(indices, s, &output_shape));
+        c->set_output(0, output_shape);
+        return Status::OK();
+      }
     });
 
 REGISTER_OP("TensorArrayScatterV3")
@@ -685,12 +703,25 @@ REGISTER_OP("TensorArrayScatterV3")
     .Output("flow_out: float")
     .Attr("T: type")
     .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle indices;
       ShapeHandle unused;
       DimensionHandle unused_dim;
       TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &unused));
-      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &indices));
       TF_RETURN_IF_ERROR(c->WithValue(c->Dim(c->input(0), 0), 2, &unused_dim));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused));
+      ShapeHandle value_shape;
+      // Assert that the length of the indices tensor is equal to the first
+      // dimension of the value tensor.
+      TF_RETURN_IF_ERROR(
+          c->MergePrefix(c->input(2), indices, &value_shape, &indices));
+      auto shapes = c->input_handle_shapes_and_types(0);
+      if (shapes != nullptr && !shapes->empty()) {
+        ShapeHandle tensor_shape = shapes->at(0).shape;
+        ShapeHandle fed_shape;
+        TF_RETURN_IF_ERROR(c->Subshape(value_shape, 1, &fed_shape));
+        TF_RETURN_IF_ERROR(c->Merge(tensor_shape, fed_shape, &fed_shape));
+      }
       return shape_inference::ScalarShape(c);
     });
 
diff --git a/tensorflow/core/ops/dataset_ops.cc b/tensorflow/core/ops/dataset_ops.cc
index bdbbf6d7c32014678d8ad171df03c29a4a44f422..e2453b9712f3a1141343f694f0e2c798aa39ed4a 100644
--- a/tensorflow/core/ops/dataset_ops.cc
+++ b/tensorflow/core/ops/dataset_ops.cc
@@ -1,4 +1,4 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -166,6 +166,7 @@ REGISTER_OP("MapAndBatchDataset")
     .Input("other_arguments: Targuments")
     .Input("batch_size: int64")
     .Input("num_parallel_batches: int64")
+    .Input("drop_remainder: bool")
     .Output("handle: variant")
     .Attr("f: func")
     .Attr("Targuments: list(type) >= 0")
@@ -265,6 +266,16 @@ REGISTER_OP("BatchDataset")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+// TODO(mrry): move SlideDataset to contrib in the future.
+REGISTER_OP("SlideDataset")
+    .Input("input_dataset: variant")
+    .Input("window_size: int64")
+    .Input("stride: int64")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("PaddedBatchDataset")
     .Input("input_dataset: variant")
     .Input("batch_size: int64")
diff --git a/tensorflow/core/ops/functional_ops.cc b/tensorflow/core/ops/functional_ops.cc
index 9e18d20db65075e471862def3924811b260f8a08..4b21fac80aea76555959e8a202a73ccc833d0306 100644
--- a/tensorflow/core/ops/functional_ops.cc
+++ b/tensorflow/core/ops/functional_ops.cc
@@ -47,6 +47,7 @@ REGISTER_OP("RemoteCall")
     .Attr("Tin: list(type)")
     .Attr("Tout: list(type)")
     .Attr("f: func")
+    .SetIsStateful()
     .SetShapeFn(shape_inference::UnknownShape);
 
 REGISTER_OP("_If")
diff --git a/tensorflow/core/ops/list_ops.cc b/tensorflow/core/ops/list_ops.cc
index 3487c955cbb2b06bdb33000da549c0fc6e7f86e8..cad617638ff12cd1020276341fbe9f9b7aac97bc 100644
--- a/tensorflow/core/ops/list_ops.cc
+++ b/tensorflow/core/ops/list_ops.cc
@@ -135,10 +135,6 @@ REGISTER_OP("TensorListStack")
         }
         shape_inference::ShapeHandle ignored;
         TF_RETURN_IF_ERROR(c->Merge(s, list_shape_type.shape, &ignored));
-        if (!c->FullyDefined(s) || !c->FullyDefined(list_shape_type.shape)) {
-          return errors::InvalidArgument(
-              "Can only gather from a list with fully defined shapes.");
-        }
         s = list_shape_type.shape;
       }
       int expected_num_elements = -1;
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index 67481fd202b3c3b35033b72e4c1c5fd294d98696..1f4e9753c37bcea49af3f7b3f63be329ab7d6b80 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -266,7 +266,7 @@ REGISTER_OP("Conv2D")
     .Input("input: T")
     .Input("filter: T")
     .Output("output: T")
-    .Attr("T: {half, bfloat16, float}")
+    .Attr("T: {half, bfloat16, float, double}")
     .Attr("strides: list(int)")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
@@ -279,7 +279,7 @@ REGISTER_OP("Conv2DBackpropInput")
     .Input("filter: T")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, bfloat16, float}")
+    .Attr("T: {half, bfloat16, float, double}")
     .Attr("strides: list(int)")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
@@ -301,7 +301,7 @@ REGISTER_OP("Conv2DBackpropFilter")
     .Input("filter_sizes: int32")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, bfloat16, float}")
+    .Attr("T: {half, bfloat16, float, double}")
     .Attr("strides: list(int)")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
@@ -1498,6 +1498,7 @@ REGISTER_OP("_MklConv2D")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn(shape_inference::Conv2DShape)
     .Doc(R"doc(
 MKL version of Conv2D operator. Uses MKL DNN APIs to perform 2D convolution.
@@ -1516,6 +1517,7 @@ REGISTER_OP("__MklDummyConv2DWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 Dummy node that enables fusing Conv2D and BiasAdd operator for MKL. This node
 does not perform anything. It is just created as an intermediate output of
@@ -1541,6 +1543,7 @@ REGISTER_OP("_MklConv2DWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 MKL version of Conv2D and BiasAdd operator. Uses MKL DNN APIs to perform
 2D convolution and add Bias to the output of convolution.
@@ -1563,6 +1566,7 @@ REGISTER_OP("_MklConv2DBackpropFilter")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle s;
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(1, &s));
@@ -1589,6 +1593,7 @@ REGISTER_OP("__MklDummyConv2DBackpropFilterWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle input_shape;
       // Fetch the data_format attribute, which may not exist.
@@ -1633,6 +1638,7 @@ REGISTER_OP("_MklConv2DBackpropFilterWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle input_shape;
       // Fetch the data_format attribute, which may not exist.
@@ -1668,6 +1674,7 @@ REGISTER_OP("_MklConv2DWithBiasBackpropBias")
     .Attr("T: {half, float, double}")
     .Attr("strides: list(int)")
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 MKL version of Conv2DBackpropBias. Uses MKL DNN APIs to compute the
 gradients of convolution with respect to the bias.
@@ -1690,6 +1697,7 @@ REGISTER_OP("_MklConv2DBackpropInput")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle s;
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(0, &s));
@@ -2007,10 +2015,10 @@ REGISTER_OP("_MklFusedBatchNorm")
       TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 4, &x));
 
       bool is_training;
-      c->GetAttr("is_training", &is_training);
+      TF_RETURN_IF_ERROR(c->GetAttr("is_training", &is_training));
       int number_inputs = (is_training) ? 3 : 5;
       string data_format;
-      c->GetAttr("data_format", &data_format);
+      TF_RETURN_IF_ERROR(c->GetAttr("data_format", &data_format));
       DimensionHandle channel_dim =
           (data_format == "NHWC") ? c->Dim(x, 3) : c->Dim(x, 1);
 
@@ -2076,8 +2084,8 @@ REGISTER_OP("_MklFusedBatchNormGrad")
 
       bool is_training;
       string data_format;
-      c->GetAttr("is_training", &is_training);
-      c->GetAttr("data_format", &data_format);
+      TF_RETURN_IF_ERROR(c->GetAttr("is_training", &is_training));
+      TF_RETURN_IF_ERROR(c->GetAttr("data_format", &data_format));
       DimensionHandle channel_dim = (data_format == "NHWC")
                                         ? c->Dim(y_backprop, 3)
                                         : c->Dim(y_backprop, 1);
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index 55be0519a797f29994945d9c2fa44d27a5f0ad0f..af2c563489e7a92b6a88af181da3c5454d2be986 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -4384,6 +4384,14 @@ op {
     }
   }
 }
+op {
+  name: "CloseSummaryWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  is_stateful: true
+}
 op {
   name: "CompareAndBitpack"
   input_arg {
@@ -4806,6 +4814,7 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -4882,6 +4891,7 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -4958,6 +4968,7 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -5473,6 +5484,54 @@ op {
     }
   }
 }
+op {
+  name: "CreateSummaryDbWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "db_uri"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "experiment_name"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "run_name"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "user_name"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
+op {
+  name: "CreateSummaryFileWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "logdir"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "max_queue"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "flush_millis"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "filename_suffix"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
 op {
   name: "CropAndResize"
   input_arg {
@@ -5666,214 +5725,790 @@ op {
   }
 }
 op {
-  name: "Cumprod"
+  name: "CudnnRNN"
   input_arg {
-    name: "x"
+    name: "input"
     type_attr: "T"
   }
   input_arg {
-    name: "axis"
-    type_attr: "Tidx"
+    name: "input_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
   }
   output_arg {
-    name: "out"
+    name: "output"
     type_attr: "T"
   }
-  attr {
-    name: "exclusive"
-    type: "bool"
-    default_value {
-      b: false
-    }
+  output_arg {
+    name: "output_h"
+    type_attr: "T"
   }
-  attr {
-    name: "reverse"
-    type: "bool"
-    default_value {
-      b: false
-    }
+  output_arg {
+    name: "output_c"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "reserve_space"
+    type_attr: "T"
   }
   attr {
     name: "T"
     type: "type"
     allowed_values {
       list {
+        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_COMPLEX64
-        type: DT_INT64
-        type: DT_QINT8
-        type: DT_QUINT8
-        type: DT_QINT32
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_COMPLEX128
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
       }
     }
   }
   attr {
-    name: "Tidx"
-    type: "type"
+    name: "rnn_mode"
+    type: "string"
     default_value {
-      type: DT_INT32
+      s: "lstm"
     }
     allowed_values {
       list {
-        type: DT_INT32
-        type: DT_INT64
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
       }
     }
   }
-}
-op {
-  name: "Cumsum"
-  input_arg {
-    name: "x"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "axis"
-    type_attr: "Tidx"
-  }
-  output_arg {
-    name: "out"
-    type_attr: "T"
-  }
-  attr {
-    name: "exclusive"
-    type: "bool"
-    default_value {
-      b: false
-    }
-  }
   attr {
-    name: "reverse"
-    type: "bool"
+    name: "input_mode"
+    type: "string"
     default_value {
-      b: false
+      s: "linear_input"
     }
-  }
-  attr {
-    name: "T"
-    type: "type"
     allowed_values {
       list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_COMPLEX64
-        type: DT_INT64
-        type: DT_QINT8
-        type: DT_QUINT8
-        type: DT_QINT32
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_COMPLEX128
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
       }
     }
   }
   attr {
-    name: "Tidx"
-    type: "type"
+    name: "direction"
+    type: "string"
     default_value {
-      type: DT_INT32
+      s: "unidirectional"
     }
     allowed_values {
       list {
-        type: DT_INT32
-        type: DT_INT64
+        s: "unidirectional"
+        s: "bidirectional"
       }
     }
   }
-}
-op {
-  name: "DataFormatDimMap"
-  input_arg {
-    name: "x"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "y"
-    type_attr: "T"
-  }
   attr {
-    name: "T"
-    type: "type"
+    name: "dropout"
+    type: "float"
     default_value {
-      type: DT_INT32
+      f: 0
     }
-    allowed_values {
-      list {
-        type: DT_INT32
-        type: DT_INT64
-      }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
     }
   }
   attr {
-    name: "src_format"
-    type: "string"
+    name: "seed2"
+    type: "int"
     default_value {
-      s: "NHWC"
+      i: 0
     }
   }
   attr {
-    name: "dst_format"
-    type: "string"
+    name: "is_training"
+    type: "bool"
     default_value {
-      s: "NCHW"
+      b: true
     }
   }
+  is_stateful: true
 }
 op {
-  name: "DataFormatVecPermute"
+  name: "CudnnRNNBackprop"
   input_arg {
-    name: "x"
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "input_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_h"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_c"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_h_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "output_c_backprop"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "reserve_space"
     type_attr: "T"
   }
   output_arg {
-    name: "y"
+    name: "input_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "input_h_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "input_c_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "params_backprop"
     type_attr: "T"
   }
   attr {
     name: "T"
     type: "type"
-    default_value {
-      type: DT_INT32
-    }
     allowed_values {
       list {
-        type: DT_INT32
-        type: DT_INT64
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
   attr {
-    name: "src_format"
+    name: "rnn_mode"
     type: "string"
     default_value {
-      s: "NHWC"
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
     }
   }
   attr {
-    name: "dst_format"
+    name: "input_mode"
     type: "string"
     default_value {
-      s: "NCHW"
+      s: "linear_input"
     }
-  }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "CudnnRNNCanonicalToParams"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "weights"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  input_arg {
+    name: "biases"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  output_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "num_params"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "CudnnRNNParamsSize"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "params_size"
+    type_attr: "S"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "S"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "CudnnRNNParamsToCanonical"
+  input_arg {
+    name: "num_layers"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "num_units"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "input_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "params"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "weights"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  output_arg {
+    name: "biases"
+    type_attr: "T"
+    number_attr: "num_params"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "num_params"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "rnn_mode"
+    type: "string"
+    default_value {
+      s: "lstm"
+    }
+    allowed_values {
+      list {
+        s: "rnn_relu"
+        s: "rnn_tanh"
+        s: "lstm"
+        s: "gru"
+      }
+    }
+  }
+  attr {
+    name: "input_mode"
+    type: "string"
+    default_value {
+      s: "linear_input"
+    }
+    allowed_values {
+      list {
+        s: "linear_input"
+        s: "skip_input"
+        s: "auto_select"
+      }
+    }
+  }
+  attr {
+    name: "direction"
+    type: "string"
+    default_value {
+      s: "unidirectional"
+    }
+    allowed_values {
+      list {
+        s: "unidirectional"
+        s: "bidirectional"
+      }
+    }
+  }
+  attr {
+    name: "dropout"
+    type: "float"
+    default_value {
+      f: 0
+    }
+  }
+  attr {
+    name: "seed"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+  attr {
+    name: "seed2"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "Cumprod"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "axis"
+    type_attr: "Tidx"
+  }
+  output_arg {
+    name: "out"
+    type_attr: "T"
+  }
+  attr {
+    name: "exclusive"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+  attr {
+    name: "reverse"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_COMPLEX64
+        type: DT_INT64
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT32
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_COMPLEX128
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  attr {
+    name: "Tidx"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "Cumsum"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "axis"
+    type_attr: "Tidx"
+  }
+  output_arg {
+    name: "out"
+    type_attr: "T"
+  }
+  attr {
+    name: "exclusive"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+  attr {
+    name: "reverse"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_COMPLEX64
+        type: DT_INT64
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT32
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_COMPLEX128
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  attr {
+    name: "Tidx"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "DataFormatDimMap"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "src_format"
+    type: "string"
+    default_value {
+      s: "NHWC"
+    }
+  }
+  attr {
+    name: "dst_format"
+    type: "string"
+    default_value {
+      s: "NCHW"
+    }
+  }
+}
+op {
+  name: "DataFormatVecPermute"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "src_format"
+    type: "string"
+    default_value {
+      s: "NHWC"
+    }
+  }
+  attr {
+    name: "dst_format"
+    type: "string"
+    default_value {
+      s: "NCHW"
+    }
+  }
 }
 op {
   name: "DatasetToSingleElement"
@@ -8800,6 +9435,14 @@ op {
     }
   }
 }
+op {
+  name: "FlushSummaryWriter"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  is_stateful: true
+}
 op {
   name: "FractionalAvgPool"
   input_arg {
@@ -10367,6 +11010,18 @@ op {
     type: "string"
   }
 }
+op {
+  name: "ImportEvent"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "event"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
 op {
   name: "InTopK"
   input_arg {
@@ -11883,6 +12538,10 @@ op {
     name: "num_parallel_batches"
     type: DT_INT64
   }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
   output_arg {
     name: "handle"
     type: DT_VARIANT
@@ -19353,6 +20012,32 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexReplace"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "rewrite"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_STRING
+  }
+  attr {
+    name: "replace_global"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+}
 op {
   name: "Relu"
   input_arg {
@@ -19515,6 +20200,7 @@ op {
     name: "f"
     type: "func"
   }
+  is_stateful: true
 }
 op {
   name: "RemoteFusedGraphExecute"
@@ -24292,6 +24978,37 @@ op {
     }
   }
 }
+op {
+  name: "SlideDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "window_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "stride"
+    type: DT_INT64
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "Snapshot"
   input_arg {
@@ -28632,6 +29349,28 @@ op {
     }
   }
 }
+op {
+  name: "SummaryWriter"
+  output_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "Svd"
   input_arg {
@@ -30916,49 +31655,226 @@ op {
   }
 }
 op {
-  name: "Unpack"
+  name: "UniqueWithCountsV2"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "axis"
+    type_attr: "Taxis"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "idx"
+    type_attr: "out_idx"
+  }
+  output_arg {
+    name: "count"
+    type_attr: "out_idx"
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  attr {
+    name: "Taxis"
+    type: "type"
+    default_value {
+      type: DT_INT64
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "out_idx"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "Unpack"
+  input_arg {
+    name: "value"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+    number_attr: "num"
+  }
+  attr {
+    name: "num"
+    type: "int"
+    has_minimum: true
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  attr {
+    name: "axis"
+    type: "int"
+    default_value {
+      i: 0
+    }
+  }
+}
+op {
+  name: "UnravelIndex"
+  input_arg {
+    name: "indices"
+    type_attr: "Tidx"
+  }
+  input_arg {
+    name: "dims"
+    type_attr: "Tidx"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "Tidx"
+  }
+  attr {
+    name: "Tidx"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "UnsortedSegmentMax"
+  input_arg {
+    name: "data"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "segment_ids"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "num_segments"
+    type_attr: "Tnumsegments"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "Tnumsegments"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
+  name: "UnsortedSegmentMin"
   input_arg {
-    name: "value"
+    name: "data"
     type_attr: "T"
   }
+  input_arg {
+    name: "segment_ids"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "num_segments"
+    type_attr: "Tnumsegments"
+  }
   output_arg {
     name: "output"
     type_attr: "T"
-    number_attr: "num"
-  }
-  attr {
-    name: "num"
-    type: "int"
-    has_minimum: true
   }
   attr {
     name: "T"
     type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
   }
   attr {
-    name: "axis"
-    type: "int"
-    default_value {
-      i: 0
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
     }
   }
-}
-op {
-  name: "UnravelIndex"
-  input_arg {
-    name: "indices"
-    type_attr: "Tidx"
-  }
-  input_arg {
-    name: "dims"
-    type_attr: "Tidx"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "Tidx"
-  }
   attr {
-    name: "Tidx"
+    name: "Tnumsegments"
     type: "type"
     default_value {
       type: DT_INT32
@@ -30972,7 +31888,7 @@ op {
   }
 }
 op {
-  name: "UnsortedSegmentMax"
+  name: "UnsortedSegmentProd"
   input_arg {
     name: "data"
     type_attr: "T"
@@ -31358,6 +32274,39 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "WriteAudioSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "tensor"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "sample_rate"
+    type: DT_FLOAT
+  }
+  attr {
+    name: "max_outputs"
+    type: "int"
+    default_value {
+      i: 3
+    }
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "WriteFile"
   input_arg {
@@ -31369,6 +32318,180 @@ op {
     type: DT_STRING
   }
 }
+op {
+  name: "WriteGraphSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tensor"
+    type: DT_STRING
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteHistogramSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "values"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_FLOAT
+    }
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteImageSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "tensor"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "bad_color"
+    type: DT_UINT8
+  }
+  attr {
+    name: "max_images"
+    type: "int"
+    default_value {
+      i: 3
+    }
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "T"
+    type: "type"
+    default_value {
+      type: DT_FLOAT
+    }
+    allowed_values {
+      list {
+        type: DT_UINT8
+        type: DT_FLOAT
+        type: DT_HALF
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteScalarSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "value"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_INT64
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  is_stateful: true
+}
+op {
+  name: "WriteSummary"
+  input_arg {
+    name: "writer"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "step"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "tensor"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "summary_metadata"
+    type: DT_STRING
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  is_stateful: true
+}
 op {
   name: "ZerosLike"
   input_arg {
diff --git a/tensorflow/core/ops/scoped_allocator_ops.cc b/tensorflow/core/ops/scoped_allocator_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f053a53f4cf02e89fce53461bfda0cb23756287b
--- /dev/null
+++ b/tensorflow/core/ops/scoped_allocator_ops.cc
@@ -0,0 +1,81 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+
+namespace tensorflow {
+
+REGISTER_OP("_ScopedAllocator")
+    .Output("output: T")
+    .Attr("shapes: list(shape)")
+    .Attr("T: type")
+    .Attr("sa_name: string")
+    .Attr("id: int")
+    .Attr("expected_call_count: int")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ExplicitShape)
+    .Doc(R"doc(
+Allocates a mutable tensor that becomes available to appropriately annotated
+downstream Ops as backing store for their output tensor allocations via the
+ScopedAllocatorMgr.
+Returns a reference to this value.
+
+This is an experimental op for internal use only.  It is possible to use this
+op in unsafe ways.
+)doc");
+
+REGISTER_OP("_ScopedAllocatorConcat")
+    .Output("output: T")
+    .Input("backing: T")
+    .Input("inputs: N * T")
+    .Attr("shape: shape")
+    .Attr("T: type")
+    .Attr("sa_name: string")
+    .Attr("id: int")
+    .Attr("N: int >= 2")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ExplicitShape)
+    .Doc(R"doc(
+Acts like a Concat Op that merges multple tensors into one, however it must
+only be used in conjunction with a ScopedAllocator which is backing the memory
+of all of its input tensors so that actually it just outputs a read-only
+reference to that ScopedAllocator's backing tensor.
+
+This is an experimental op for internal use only.  It is possible to use this
+op in unsafe ways.
+)doc");
+
+REGISTER_OP("_ScopedAllocatorSplit")
+    .Output("output: N * T")
+    .Input("concat: T")
+    .Input("split: N * T")
+    .Attr("T: type")
+    .Attr("sa_name: string")
+    .Attr("id: int")
+    .Attr("N: int >= 2")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ExplicitShape)
+    .Doc(R"doc(
+Acts like a Concat Op that merges multple tensors into one, however it must
+only be used in conjunction with a ScopedAllocator which is backing the memory
+of all of its input tensors so that actually it just outputs a read-only
+reference to that ScopedAllocator's backing tensor.
+
+This is an experimental op for internal use only.  It is possible to use this
+op in unsafe ways.
+)doc");
+
+}  // end namespace tensorflow
diff --git a/tensorflow/core/ops/string_ops.cc b/tensorflow/core/ops/string_ops.cc
index e4c5bcfb540660a609aca013b795d566e69f54a8..05f216a83e21030443379876ddd160f2ceba6d39 100644
--- a/tensorflow/core/ops/string_ops.cc
+++ b/tensorflow/core/ops/string_ops.cc
@@ -23,6 +23,20 @@ using shape_inference::DimensionHandle;
 using shape_inference::InferenceContext;
 using shape_inference::ShapeHandle;
 
+REGISTER_OP("RegexReplace")
+    .Input("input: string")
+    .Input("pattern: string")
+    .Input("rewrite: string")
+    .Output("output: string")
+    .Attr("replace_global: bool = true")
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle unused;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
+      c->set_output(0, c->input(0));
+      return Status::OK();
+    });
+
 REGISTER_OP("StringToHashBucketFast")
     .Input("input: string")
     .Output("output: int64")
diff --git a/tensorflow/core/ops/summary_ops.cc b/tensorflow/core/ops/summary_ops.cc
index aa7458f903cf76af660c04149ff50ac899987eac..742a221adcb101e3d2d152d60e343b666d3fb96b 100644
--- a/tensorflow/core/ops/summary_ops.cc
+++ b/tensorflow/core/ops/summary_ops.cc
@@ -22,15 +22,7 @@ REGISTER_OP("SummaryWriter")
     .Output("writer: resource")
     .Attr("shared_name: string = ''")
     .Attr("container: string = ''")
-    .SetShapeFn(shape_inference::ScalarShape)
-    .Doc(R"doc(
-Returns a handle to be used to access a summary writer.
-
-The summary writer is an in-graph resource which can be used by ops to write
-summaries to event files.
-
-writer: the summary writer resource. Scalar handle.
-)doc");
+    .SetShapeFn(shape_inference::ScalarShape);
 
 REGISTER_OP("CreateSummaryFileWriter")
     .Input("writer: resource")
@@ -38,17 +30,7 @@ REGISTER_OP("CreateSummaryFileWriter")
     .Input("max_queue: int32")
     .Input("flush_millis: int32")
     .Input("filename_suffix: string")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Creates a summary file writer accessible by the given resource handle.
-
-writer: A handle to the summary writer resource
-logdir: Directory where the event file will be written.
-max_queue: Size of the queue of pending events and summaries.
-flush_millis: How often, in milliseconds, to flush the pending events and
-  summaries to disk.
-filename_suffix: Every event file's name is suffixed with this suffix.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("CreateSummaryDbWriter")
     .Input("writer: resource")
@@ -56,47 +38,15 @@ REGISTER_OP("CreateSummaryDbWriter")
     .Input("experiment_name: string")
     .Input("run_name: string")
     .Input("user_name: string")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Creates summary database writer accessible by given resource handle.
-
-This can be used to write tensors from the execution graph directly
-to a database. Only SQLite is supported right now. This function
-will create the schema if it doesn't exist. Entries in the Users,
-Experiments, and Runs tables will be created automatically if they
-don't already exist.
-
-writer: Handle to SummaryWriter resource to overwrite.
-db_uri: For example "file:/tmp/foo.sqlite".
-experiment_name: Can't contain ASCII control characters or <>. Case
-  sensitive. If empty, then the Run will not be associated with any
-  Experiment.
-run_name: Can't contain ASCII control characters or <>. Case sensitive.
-  If empty, then each Tag will not be associated with any Run.
-user_name: Must be valid as both a DNS label and Linux username. If
-  empty, then the Experiment will not be associated with any User.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("FlushSummaryWriter")
     .Input("writer: resource")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"(
-Flushes the writer's unwritten events.
-
-writer: A handle to the summary writer resource.
-)");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("CloseSummaryWriter")
     .Input("writer: resource")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"(
-Flushes and closes the summary writer.
-
-Also removes it from the resource manager. To reopen, use another
-CreateSummaryFileWriter op.
-
-writer: A handle to the summary writer resource.
-)");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteSummary")
     .Input("writer: resource")
@@ -105,31 +55,12 @@ REGISTER_OP("WriteSummary")
     .Input("tag: string")
     .Input("summary_metadata: string")
     .Attr("T: type")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Outputs a `Summary` protocol buffer with a tensor.
-
-writer: A handle to a summary writer.
-step: The step to write the summary for.
-tensor: A tensor to serialize.
-tag: The summary's tag.
-summary_metadata: Serialized SummaryMetadata protocol buffer containing
- plugin-related metadata for this summary.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("ImportEvent")
     .Input("writer: resource")
     .Input("event: string")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Outputs a `tf.Event` protocol buffer.
-
-When CreateSummaryDbWriter is being used, this op can be useful for
-importing data from event logs.
-
-writer: A handle to a summary writer.
-event: A string containing a binary-encoded tf.Event proto.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteScalarSummary")
     .Input("writer: resource")
@@ -137,17 +68,7 @@ REGISTER_OP("WriteScalarSummary")
     .Input("tag: string")
     .Input("value: T")
     .Attr("T: realnumbertype")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Writes a `Summary` protocol buffer with scalar values.
-
-The input `tag` and `value` must have the scalars.
-
-writer: A handle to a summary writer.
-step: The step to write the summary for.
-tag: Tag for the summary.
-value: Value for the summary.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteHistogramSummary")
     .Input("writer: resource")
@@ -155,21 +76,7 @@ REGISTER_OP("WriteHistogramSummary")
     .Input("tag: string")
     .Input("values: T")
     .Attr("T: realnumbertype = DT_FLOAT")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Writes a `Summary` protocol buffer with a histogram.
-
-The generated
-[`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-has one summary value containing a histogram for `values`.
-
-This op reports an `InvalidArgument` error if any value is not finite.
-
-writer: A handle to a summary writer.
-step: The step to write the summary for.
-tag: Scalar.  Tag to use for the `Summary.Value`.
-values: Any shape. Values to use to build the histogram.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteImageSummary")
     .Input("writer: resource")
@@ -179,52 +86,7 @@ REGISTER_OP("WriteImageSummary")
     .Input("bad_color: uint8")
     .Attr("max_images: int >= 1 = 3")
     .Attr("T: {uint8, float, half} = DT_FLOAT")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Writes a `Summary` protocol buffer with images.
-
-The summary has up to `max_images` summary values containing images. The
-images are built from `tensor` which must be 4-D with shape `[batch_size,
-height, width, channels]` and where `channels` can be:
-
-*  1: `tensor` is interpreted as Grayscale.
-*  3: `tensor` is interpreted as RGB.
-*  4: `tensor` is interpreted as RGBA.
-
-The images have the same number of channels as the input tensor. For float
-input, the values are normalized one image at a time to fit in the range
-`[0, 255]`.  `uint8` values are unchanged.  The op uses two different
-normalization algorithms:
-
-*  If the input values are all positive, they are rescaled so the largest one
-   is 255.
-
-*  If any input value is negative, the values are shifted so input value 0.0
-   is at 127.  They are then rescaled so that either the smallest value is 0,
-   or the largest one is 255.
-
-The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-build the `tag` of the summary values:
-
-*  If `max_images` is 1, the summary value tag is '*tag*/image'.
-*  If `max_images` is greater than 1, the summary value tags are
-   generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
-
-The `bad_color` argument is the color to use in the generated images for
-non-finite input values.  It is a `unit8` 1-D tensor of length `channels`.
-Each element must be in the range `[0, 255]` (It represents the value of a
-pixel in the output image).  Non-finite values in the input tensor are
-replaced by this tensor in the output image.  The default value is the color
-red.
-
-writer: A handle to a summary writer.
-step: The step to write the summary for.
-tag: Scalar. Used to build the `tag` attribute of the summary values.
-tensor: 4-D of shape `[batch_size, height, width, channels]` where
-  `channels` is 1, 3, or 4.
-max_images: Max number of batch elements to generate images for.
-bad_color: Color to use for pixels with non-finite values.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteAudioSummary")
     .Input("writer: resource")
@@ -233,41 +95,12 @@ REGISTER_OP("WriteAudioSummary")
     .Input("tensor: float")
     .Input("sample_rate: float")
     .Attr("max_outputs: int >= 1 = 3")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Writes a `Summary` protocol buffer with audio.
-
-The summary has up to `max_outputs` summary values containing audio. The
-audio is built from `tensor` which must be 3-D with shape `[batch_size,
-frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
-assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
-
-The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-build the `tag` of the summary values:
-
-*  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
-*  If `max_outputs` is greater than 1, the summary value tags are
-   generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
-
-writer: A handle to a summary writer.
-step: The step to write the summary for.
-tag: Scalar. Used to build the `tag` attribute of the summary values.
-tensor: 2-D of shape `[batch_size, frames]`.
-sample_rate: The sample rate of the signal in hertz.
-max_outputs: Max number of batch elements to generate audio for.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 REGISTER_OP("WriteGraphSummary")
     .Input("writer: resource")
     .Input("step: int64")
     .Input("tensor: string")
-    .SetShapeFn(shape_inference::NoOutputs)
-    .Doc(R"doc(
-Writes a `GraphDef` protocol buffer to a `SummaryWriter`.
-
-writer: Handle of `SummaryWriter`.
-step: The step to write the summary for.
-tensor: A scalar string of the serialized tf.GraphDef proto.
-)doc");
+    .SetShapeFn(shape_inference::NoOutputs);
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/BUILD b/tensorflow/core/platform/cloud/BUILD
index 9ba25dea4fb278cbfaf4080e21beef8a3e9de769..e43639e9c76da92dc6ea70430027f9f05da88d55 100644
--- a/tensorflow/core/platform/cloud/BUILD
+++ b/tensorflow/core/platform/cloud/BUILD
@@ -38,13 +38,24 @@ cc_library(
 
 cc_library(
     name = "file_block_cache",
-    srcs = ["file_block_cache.cc"],
     hdrs = ["file_block_cache.h"],
     copts = tf_copts(),
     visibility = ["//tensorflow:__subpackages__"],
     deps = ["//tensorflow/core:lib"],
 )
 
+cc_library(
+    name = "ram_file_block_cache",
+    srcs = ["ram_file_block_cache.cc"],
+    hdrs = ["ram_file_block_cache.h"],
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":file_block_cache",
+        "//tensorflow/core:lib",
+    ],
+)
+
 cc_library(
     name = "gcs_dns_cache",
     srcs = ["gcs_dns_cache.cc"],
@@ -68,6 +79,18 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "gce_env_utils",
+    srcs = ["gce_env_utils.cc"],
+    hdrs = ["gce_env_utils.h"],
+    copts = tf_copts(),
+    visibility = ["//visibility:private"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
 cc_library(
     name = "gcs_file_system",
     srcs = ["gcs_file_system.cc"],
@@ -83,6 +106,7 @@ cc_library(
         ":gcs_throttle",
         ":google_auth_provider",
         ":http_request",
+        ":ram_file_block_cache",
         ":retrying_file_system",
         ":retrying_utils",
         ":time_util",
@@ -146,6 +170,7 @@ cc_library(
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":curl_http_request",
+        ":gce_env_utils",
         ":oauth_client",
         ":retrying_utils",
         "//tensorflow/core:lib",
@@ -231,6 +256,21 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "fake_env",
+    srcs = [
+        "fake_env.cc",
+    ],
+    hdrs = [
+        "fake_env.h",
+    ],
+    copts = tf_copts(),
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
 tf_cc_test(
     name = "expiring_lru_cache_test",
     size = "small",
@@ -245,12 +285,12 @@ tf_cc_test(
 )
 
 tf_cc_test(
-    name = "file_block_cache_test",
+    name = "ram_file_block_cache_test",
     size = "small",
-    srcs = ["file_block_cache_test.cc"],
+    srcs = ["ram_file_block_cache_test.cc"],
     deps = [
-        ":file_block_cache",
         ":now_seconds_env",
+        ":ram_file_block_cache",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:test",
@@ -336,6 +376,7 @@ tf_cc_test(
         "testdata/service_account_credentials.json",
     ],
     deps = [
+        ":fake_env",
         ":google_auth_provider",
         ":http_request_fake",
         ":oauth_client",
@@ -382,3 +423,15 @@ tf_cc_test(
         "//tensorflow/core:test_main",
     ],
 )
+
+tf_cc_test(
+    name = "gce_env_utils_test",
+    size = "small",
+    srcs = ["gcp_env_utils_test.cc"],
+    deps = [
+        ":fake_env",
+        ":gce_env_utils",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
diff --git a/tensorflow/core/platform/cloud/curl_http_request.cc b/tensorflow/core/platform/cloud/curl_http_request.cc
index 88a5d1e96dc2fcb7d12e2c0891d2f04d64bac594..1ac6a7531b0c0b6f26f2636032388dcaefe894ee 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request.cc
@@ -21,9 +21,12 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/strings/scanner.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/public/version.h"
 
+#define CHECK_CURL_OK(expr) CHECK_EQ(expr, CURLE_OK)
+
 namespace tensorflow {
 
 namespace {
@@ -129,20 +132,21 @@ CurlHttpRequest::CurlHttpRequest(LibCurl* libcurl, Env* env)
   //       default in //third_party:curl.BUILD and can be customized via an
   //       environment variable.
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_VERBOSE, kVerboseOutput);
-  libcurl_->curl_easy_setopt(
+  CHECK_CURL_OK(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_VERBOSE, kVerboseOutput));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(
       curl_, CURLOPT_USERAGENT,
-      strings::StrCat("TensorFlow/", TF_VERSION_STRING).c_str());
+      strings::StrCat("TensorFlow/", TF_VERSION_STRING).c_str()));
   // Do not use signals for timeouts - does not work in multi-threaded programs.
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_NOSIGNAL, 1L);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTP_VERSION,
-                             CURL_HTTP_VERSION_2_0);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_NOSIGNAL, 1L));
+
+  // TODO(b/74351157): Enable HTTP/2.
 
   // Set up the progress meter.
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_NOPROGRESS, 0ULL);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFODATA, this);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFOFUNCTION,
-                             &CurlHttpRequest::ProgressCallback);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_NOPROGRESS, 0ULL));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFODATA, this));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFOFUNCTION,
+                                           &CurlHttpRequest::ProgressCallback));
 
   // If response buffer is not set, libcurl will print results to stdout,
   // so we always set it.
@@ -175,13 +179,13 @@ void CurlHttpRequest::SetUri(const string& uri) {
   CheckNotSent();
   is_uri_set_ = true;
   uri_ = uri;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_URL, uri.c_str());
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_URL, uri.c_str()));
 }
 
 void CurlHttpRequest::SetRange(uint64 start, uint64 end) {
   CheckNotSent();
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_RANGE,
-                             strings::StrCat(start, "-", end).c_str());
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(
+      curl_, CURLOPT_RANGE, strings::StrCat(start, "-", end).c_str()));
 }
 
 void CurlHttpRequest::AddHeader(const string& name, const string& value) {
@@ -206,11 +210,19 @@ void CurlHttpRequest::AddAuthBearerHeader(const string& auth_token) {
   }
 }
 
+void CurlHttpRequest::SetRequestStats(RequestStats* stats) {
+  CheckNotSent();
+  CHECK(stats_ == nullptr) << "SetRequestStats already called";
+  stats_ = stats;
+}
+
 void CurlHttpRequest::SetDeleteRequest() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_CUSTOMREQUEST, "DELETE");
+  method_ = RequestMethod::kDelete;
+  CHECK_CURL_OK(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_CUSTOMREQUEST, "DELETE"));
 }
 
 Status CurlHttpRequest::SetPutFromFile(const string& body_filepath,
@@ -218,6 +230,7 @@ Status CurlHttpRequest::SetPutFromFile(const string& body_filepath,
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
+  method_ = RequestMethod::kPut;
   if (put_body_) {
     fclose(put_body_);
   }
@@ -232,9 +245,9 @@ Status CurlHttpRequest::SetPutFromFile(const string& body_filepath,
 
   curl_headers_ = libcurl_->curl_slist_append(
       curl_headers_, strings::StrCat("Content-Length: ", size).c_str());
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(put_body_));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                           reinterpret_cast<void*>(put_body_)));
   // Using the default CURLOPT_READFUNCTION, which is doing an fread() on the
   // FILE * userdata set with CURLOPT_READDATA.
   return Status::OK();
@@ -244,26 +257,28 @@ void CurlHttpRequest::SetPutEmptyBody() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1);
-  curl_headers_ =
-      libcurl_->curl_slist_append(curl_headers_, "Content-Length: 0");
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  method_ = RequestMethod::kPut;
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1));
+  AddHeader("Content-Length", "0");
+  AddHeader("Transfer-Encoding", "identity");
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                           &CurlHttpRequest::ReadCallback));
 }
 
 void CurlHttpRequest::SetPostFromBuffer(const char* buffer, size_t size) {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
+  method_ = RequestMethod::kPost;
   curl_headers_ = libcurl_->curl_slist_append(
       curl_headers_, strings::StrCat("Content-Length: ", size).c_str());
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                           &CurlHttpRequest::ReadCallback));
   post_body_buffer_ = StringPiece(buffer, size);
 }
 
@@ -271,13 +286,14 @@ void CurlHttpRequest::SetPostEmptyBody() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1);
-  curl_headers_ =
-      libcurl_->curl_slist_append(curl_headers_, "Content-Length: 0");
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  method_ = RequestMethod::kPost;
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1));
+  AddHeader("Content-Length", "0");
+  AddHeader("Transfer-Encoding", "identity");
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                           &CurlHttpRequest::ReadCallback));
 }
 
 void CurlHttpRequest::SetResultBuffer(std::vector<char>* out_buffer) {
@@ -287,10 +303,10 @@ void CurlHttpRequest::SetResultBuffer(std::vector<char>* out_buffer) {
   out_buffer->clear();
   response_buffer_ = out_buffer;
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
-                             &CurlHttpRequest::WriteCallback);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
+                                           &CurlHttpRequest::WriteCallback));
 }
 
 void CurlHttpRequest::SetResultBufferDirect(char* buffer, size_t size) {
@@ -298,11 +314,10 @@ void CurlHttpRequest::SetResultBufferDirect(char* buffer, size_t size) {
   CheckNotSent();
 
   direct_response_ = DirectResponseState{buffer, size, 0};
-
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
-                             &CurlHttpRequest::WriteCallbackDirect);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(
+      curl_, CURLOPT_WRITEFUNCTION, &CurlHttpRequest::WriteCallbackDirect));
 }
 
 bool CurlHttpRequest::IsDirectResponse() const {
@@ -406,37 +421,58 @@ Status CurlHttpRequest::Send() {
   is_sent_ = true;
 
   if (curl_headers_) {
-    libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTPHEADER, curl_headers_);
+    CHECK_CURL_OK(
+        libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTPHEADER, curl_headers_));
   }
   if (resolve_list_) {
-    libcurl_->curl_easy_setopt(curl_, CURLOPT_RESOLVE, resolve_list_);
+    CHECK_CURL_OK(
+        libcurl_->curl_easy_setopt(curl_, CURLOPT_RESOLVE, resolve_list_));
   }
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERFUNCTION,
-                             &CurlHttpRequest::HeaderCallback);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERDATA,
+                                           reinterpret_cast<void*>(this)));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERFUNCTION,
+                                           &CurlHttpRequest::HeaderCallback));
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_TIMEOUT, request_timeout_secs_);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_CONNECTTIMEOUT,
-                             connect_timeout_secs_);
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_TIMEOUT,
+                                           request_timeout_secs_));
+  CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_CONNECTTIMEOUT,
+                                           connect_timeout_secs_));
 
   char error_buffer[CURL_ERROR_SIZE] = {0};
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_ERRORBUFFER, error_buffer);
+  CHECK_CURL_OK(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_ERRORBUFFER, error_buffer));
+
+  if (stats_ != nullptr) {
+    stats_->RecordRequest(this, uri_, method_);
+  }
 
-  const auto curl_result = libcurl_->curl_easy_perform(curl_);
+  const CURLcode curl_result = libcurl_->curl_easy_perform(curl_);
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      curl_result, "Performing request. Detailed error: ", error_buffer);
+
+  auto get_error_message = [this, curl_result, &error_buffer]() -> string {
+    StringPiece response = GetResponse();
+    string error_message = strings::StrCat(
+        "Error executing an HTTP request (HTTP response code ", response_code_,
+        ", error code ", curl_result, ", error message '", error_buffer, "')");
+    if (!response.empty()) {
+      return strings::StrCat(
+          error_message, ", response '",
+          response.substr(0,
+                          std::min(response.size(), response_to_error_limit_)),
+          "'");
+    }
+    return error_message;
+  };
 
   double written_size = 0;
-  libcurl_->curl_easy_getinfo(curl_, CURLINFO_SIZE_DOWNLOAD, &written_size);
-
-  libcurl_->curl_easy_getinfo(curl_, CURLINFO_RESPONSE_CODE, &response_code_);
+  CHECK_CURL_OK(libcurl_->curl_easy_getinfo(curl_, CURLINFO_SIZE_DOWNLOAD,
+                                            &written_size));
 
-  const auto& error_message = strings::StrCat(
-      "Error executing an HTTP request (HTTP response code ", response_code_,
-      ", error code ", curl_result, ", error message '", error_buffer, "')");
+  CHECK_CURL_OK(libcurl_->curl_easy_getinfo(curl_, CURLINFO_RESPONSE_CODE,
+                                            &response_code_));
 
   Status result;
-  StringPiece response = GetResponse();
-  string extended_error_message;
   switch (response_code_) {
     // The group of response codes indicating that the request achieved
     // the expected goal.
@@ -444,14 +480,9 @@ Status CurlHttpRequest::Send() {
     case 201:  // Created
     case 204:  // No Content
     case 206:  // Partial Content
-      if (curl_result != CURLE_OK) {
-        // This means the server executed the request successfully, but then
-        // something went wrong during the transmission of the response.
-        result = errors::Unavailable(error_message);
-      } else {
-        result = Status::OK();
-      }
+      result = Status::OK();
       break;
+
     case 416:  // Requested Range Not Satisfiable
       // The requested range had no overlap with the available range.
       // This doesn't indicate an error, but this does mean an empty response
@@ -463,27 +494,19 @@ Status CurlHttpRequest::Send() {
     // INVALID_ARGUMENT indicates a problem with how the request is constructed.
     case 400:  // Bad Request
     case 411:  // Length Required
-      result = errors::InvalidArgument(error_message);
+      result = errors::InvalidArgument(get_error_message());
       break;
 
     // PERMISSION_DENIED indicates an authentication or an authorization issue.
     case 401:  // Unauthorized
     case 403:  // Forbidden
-      if (!response.empty()) {
-        extended_error_message = strings::StrCat(
-            error_message, ", response ",
-            response.substr(
-                0, std::min(response.size(), response_to_error_limit_)));
-        result = errors::PermissionDenied(extended_error_message);
-      } else {
-        result = errors::PermissionDenied(error_message);
-      }
+      result = errors::PermissionDenied(get_error_message());
       break;
 
     // NOT_FOUND indicates that the requested resource does not exist.
     case 404:  // Not found
     case 410:  // Gone
-      result = errors::NotFound(error_message);
+      result = errors::NotFound(get_error_message());
       break;
 
     // FAILED_PRECONDITION indicates that the request failed because some
@@ -493,26 +516,35 @@ Status CurlHttpRequest::Send() {
     case 303:  // See Other
     case 304:  // Not Modified
     case 307:  // Temporary Redirect
-    case 308:  // Resume Incomplete
     case 412:  // Precondition Failed
     case 413:  // Payload Too Large
-      result = errors::FailedPrecondition(error_message);
+      result = errors::FailedPrecondition(get_error_message());
       break;
 
     // UNAVAILABLE indicates a problem that can go away if the request
-    // is just retried without any modification.
+    // is just retried without any modification. 308 return codes are intended
+    // for write requests that can be retried. See the documentation and the
+    // official library:
+    // https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload
+    // https://github.com/google/apitools/blob/master/apitools/base/py/transfer.py
+    case 308:  // Resume Incomplete
     case 409:  // Conflict
     case 429:  // Too Many Requests
     case 500:  // Internal Server Error
     case 502:  // Bad Gateway
     case 503:  // Service Unavailable
     default:   // All other HTTP response codes also should be retried.
-      result = errors::Unavailable(error_message);
+      result = errors::Unavailable(get_error_message());
       break;
   }
   if (!result.ok()) {
     response_buffer_->clear();
   }
+
+  if (stats_ != nullptr) {
+    stats_->RecordResponse(this, uri_, method_, result);
+  }
+
   return result;
 }
 
@@ -596,4 +628,12 @@ int CurlHttpRequest::ProgressCallback(void* this_object, curl_off_t dltotal,
   return 0;
 }
 
+Status CURLcodeToStatus(CURLcode code) {
+  // Return Unavailable to retry by default. We probably should distinguish
+  // between permanent or temporary failures.
+  return errors::Unavailable("Error executing an HTTP request (error code ",
+                             code, ", error message '",
+                             curl_easy_strerror(code), "')");
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/curl_http_request.h b/tensorflow/core/platform/cloud/curl_http_request.h
index cfa26f2b795a6cc33aba308597c77088362f1e1b..e658948ab9b9dd95eee1889b390a48beb2f74f9e 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.h
+++ b/tensorflow/core/platform/cloud/curl_http_request.h
@@ -75,6 +75,8 @@ class CurlHttpRequest : public HttpRequest {
   /// Sets the 'Authorization' header to the value of 'Bearer ' + auth_token.
   void AddAuthBearerHeader(const string& auth_token) override;
 
+  void SetRequestStats(RequestStats* stats) override;
+
   /// Makes the request a DELETE request.
   void SetDeleteRequest() override;
 
@@ -186,6 +188,8 @@ class CurlHttpRequest : public HttpRequest {
   curl_slist* curl_headers_ = nullptr;
   curl_slist* resolve_list_ = nullptr;
 
+  RequestStats* stats_ = nullptr;
+
   std::vector<char> default_response_buffer_;
 
   std::unordered_map<string, string> response_headers_;
@@ -213,6 +217,7 @@ class CurlHttpRequest : public HttpRequest {
 
   // Store the URI to help disambiguate requests when errors occur.
   string uri_;
+  RequestMethod method_ = RequestMethod::kGet;
 
   // Limit the size of a http response that is copied into an error message.
   const size_t response_to_error_limit_ = 500;
@@ -229,26 +234,28 @@ class LibCurl {
 
   virtual CURL* curl_easy_init() = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    uint64 param) = 0;
+                                    uint64 param) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    const char* param) = 0;
+                                    const char* param) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    void* param) = 0;
-  virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    size_t (*param)(void*, size_t, size_t,
-                                                    FILE*)) = 0;
+                                    void* param) TF_MUST_USE_RESULT = 0;
+  virtual CURLcode curl_easy_setopt(
+      CURL* curl, CURLoption option,
+      size_t (*param)(void*, size_t, size_t, FILE*)) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
                                     size_t (*param)(const void*, size_t, size_t,
-                                                    void*)) = 0;
+                                                    void*))
+      TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(
       CURL* curl, CURLoption option,
       int (*param)(void* clientp, curl_off_t dltotal, curl_off_t dlnow,
-                   curl_off_t ultotal, curl_off_t ulnow)) = 0;
-  virtual CURLcode curl_easy_perform(CURL* curl) = 0;
+                   curl_off_t ultotal,
+                   curl_off_t ulnow)) TF_MUST_USE_RESULT = 0;
+  virtual CURLcode curl_easy_perform(CURL* curl) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_getinfo(CURL* curl, CURLINFO info,
-                                     uint64* value) = 0;
+                                     uint64* value) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_getinfo(CURL* curl, CURLINFO info,
-                                     double* value) = 0;
+                                     double* value) TF_MUST_USE_RESULT = 0;
   virtual void curl_easy_cleanup(CURL* curl) = 0;
   virtual curl_slist* curl_slist_append(curl_slist* list, const char* str) = 0;
   virtual void curl_slist_free_all(curl_slist* list) = 0;
@@ -258,6 +265,17 @@ class LibCurl {
   virtual const char* curl_easy_strerror(CURLcode errornum) = 0;
 };
 
+Status CURLcodeToStatus(CURLcode code);
+
+#define TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(_code, ...)                    \
+  do {                                                                      \
+    if (_code != CURLE_OK) {                                                \
+      ::tensorflow::Status _status = ::tensorflow::CURLcodeToStatus(_code); \
+      ::tensorflow::errors::AppendToMessage(&_status, __VA_ARGS__);         \
+      return _status;                                                       \
+    }                                                                       \
+  } while (0)
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_CURL_HTTP_REQUEST_H_
diff --git a/tensorflow/core/platform/cloud/curl_http_request_test.cc b/tensorflow/core/platform/cloud/curl_http_request_test.cc
index 86d26a028733c303b85390b0be8fb8808c6e082a..522b717568777b22865d38a214086b12e3525e48 100644
--- a/tensorflow/core/platform/cloud/curl_http_request_test.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request_test.cc
@@ -346,7 +346,6 @@ TEST(CurlHttpRequestTest, GetRequest_Empty) {
 
 TEST(CurlHttpRequestTest, GetRequest_RangeOutOfBound) {
   FakeLibCurl libcurl("get response", 416);
-  libcurl.curl_easy_perform_result_ = CURLE_WRITE_ERROR;
   CurlHttpRequest http_request(&libcurl);
 
   std::vector<char> scratch;
@@ -377,10 +376,10 @@ TEST(CurlHttpRequestTest, GetRequest_503) {
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 503, "
-      "error code 23, error message '')",
+      "Error executing an HTTP request (error code 23, error message 'Failed "
+      "writing received data to disk/application')\n\tPerforming request. "
+      "Detailed error: ",
       status.error_message());
-  EXPECT_EQ(503, http_request.GetResponseCode());
 }
 
 TEST(CurlHttpRequestTest, GetRequest_HttpCode0) {
@@ -396,8 +395,9 @@ TEST(CurlHttpRequestTest, GetRequest_HttpCode0) {
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 0, "
-      "error code 28, error message 'Operation timed out')",
+      "Error executing an HTTP request (error code 28, error message 'Timeout "
+      "was reached')\n\tPerforming request. Detailed error: Operation timed "
+      "out",
       status.error_message());
   EXPECT_EQ(0, http_request.GetResponseCode());
 }
@@ -476,9 +476,10 @@ TEST(CurlHttpRequestTest, PutRequest_WithoutBody) {
   EXPECT_TRUE(libcurl.is_initialized_);
   EXPECT_EQ("http://www.testuri.com", libcurl.url_);
   EXPECT_EQ("", libcurl.custom_request_);
-  EXPECT_EQ(2, libcurl.headers_->size());
+  EXPECT_EQ(3, libcurl.headers_->size());
   EXPECT_EQ("Authorization: Bearer fake-bearer", (*libcurl.headers_)[0]);
   EXPECT_EQ("Content-Length: 0", (*libcurl.headers_)[1]);
+  EXPECT_EQ("Transfer-Encoding: identity", (*libcurl.headers_)[2]);
   EXPECT_TRUE(libcurl.is_put_);
   EXPECT_EQ("", libcurl.posted_content_);
 }
@@ -517,9 +518,10 @@ TEST(CurlHttpRequestTest, PostRequest_WithoutBody) {
   EXPECT_TRUE(libcurl.is_initialized_);
   EXPECT_EQ("http://www.testuri.com", libcurl.url_);
   EXPECT_EQ("", libcurl.custom_request_);
-  EXPECT_EQ(2, libcurl.headers_->size());
+  EXPECT_EQ(3, libcurl.headers_->size());
   EXPECT_EQ("Authorization: Bearer fake-bearer", (*libcurl.headers_)[0]);
   EXPECT_EQ("Content-Length: 0", (*libcurl.headers_)[1]);
+  EXPECT_EQ("Transfer-Encoding: identity", (*libcurl.headers_)[2]);
   EXPECT_TRUE(libcurl.is_post_);
   EXPECT_EQ("", libcurl.posted_content_);
 }
@@ -628,10 +630,214 @@ TEST(CurlHttpRequestTest, ProgressIsStuck) {
   auto status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 200, "
-      "error code 42, error message '')",
+      "Error executing an HTTP request (error code 42, error message "
+      "'Operation was aborted by an application callback')\n\tPerforming "
+      "request. Detailed error: ",
       status.error_message());
 }
 
+class TestStats : public HttpRequest::RequestStats {
+ public:
+  ~TestStats() override = default;
+
+  void RecordRequest(const HttpRequest* request, const string& uri,
+                     HttpRequest::RequestMethod method) override {
+    has_recorded_request_ = true;
+    record_request_request_ = request;
+    record_request_uri_ = uri;
+    record_request_method_ = method;
+  }
+
+  void RecordResponse(const HttpRequest* request, const string& uri,
+                      HttpRequest::RequestMethod method,
+                      const Status& result) override {
+    has_recorded_response_ = true;
+    record_response_request_ = request;
+    record_response_uri_ = uri;
+    record_response_method_ = method;
+    record_response_result_ = result;
+  }
+
+  const HttpRequest* record_request_request_ = nullptr;
+  string record_request_uri_ = "http://www.testuri.com";
+  HttpRequest::RequestMethod record_request_method_ =
+      HttpRequest::RequestMethod::kGet;
+
+  const HttpRequest* record_response_request_ = nullptr;
+  string record_response_uri_ = "http://www.testuri.com";
+  HttpRequest::RequestMethod record_response_method_ =
+      HttpRequest::RequestMethod::kGet;
+  Status record_response_result_;
+
+  bool has_recorded_request_ = false;
+  bool has_recorded_response_ = false;
+};
+
+class StatsTestFakeLibCurl : public FakeLibCurl {
+ public:
+  StatsTestFakeLibCurl(TestStats* stats, const string& response_content,
+                       uint64 response_code)
+      : FakeLibCurl(response_content, response_code), stats_(stats) {}
+  CURLcode curl_easy_perform(CURL* curl) override {
+    CHECK(!performed_request_);
+    performed_request_ = true;
+    stats_had_recorded_request_ = stats_->has_recorded_request_;
+    stats_had_recorded_response_ = stats_->has_recorded_response_;
+    return FakeLibCurl::curl_easy_perform(curl);
+  };
+
+  TestStats* stats_;
+  bool performed_request_ = false;
+  bool stats_had_recorded_request_;
+  bool stats_had_recorded_response_;
+};
+
+TEST(CurlHttpRequestTest, StatsGetSuccessful) {
+  TestStats stats;
+  StatsTestFakeLibCurl libcurl(&stats, "get response", 200);
+  CurlHttpRequest http_request(&libcurl);
+
+  std::vector<char> scratch;
+  scratch.insert(scratch.begin(), kTestContent.begin(), kTestContent.end());
+  scratch.reserve(100);
+
+  http_request.SetRequestStats(&stats);
+
+  http_request.SetUri("http://www.testuri.com");
+  http_request.AddAuthBearerHeader("fake-bearer");
+  http_request.SetRange(100, 199);
+  http_request.SetResultBuffer(&scratch);
+  TF_EXPECT_OK(http_request.Send());
+
+  EXPECT_EQ("get response", string(scratch.begin(), scratch.end()));
+
+  // Check interaction with stats.
+  ASSERT_TRUE(stats.has_recorded_request_);
+  EXPECT_EQ(&http_request, stats.record_request_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_request_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kGet, stats.record_request_method_);
+
+  ASSERT_TRUE(stats.has_recorded_response_);
+  EXPECT_EQ(&http_request, stats.record_response_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_response_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kGet, stats.record_response_method_);
+  TF_EXPECT_OK(stats.record_response_result_);
+
+  // Check interaction with libcurl.
+  EXPECT_TRUE(libcurl.performed_request_);
+  EXPECT_TRUE(libcurl.stats_had_recorded_request_);
+  EXPECT_FALSE(libcurl.stats_had_recorded_response_);
+}
+
+TEST(CurlHttpRequestTest, StatsGetNotFound) {
+  TestStats stats;
+  StatsTestFakeLibCurl libcurl(&stats, "get other response", 404);
+  CurlHttpRequest http_request(&libcurl);
+
+  std::vector<char> scratch;
+  scratch.insert(scratch.begin(), kTestContent.begin(), kTestContent.end());
+  scratch.reserve(100);
+
+  http_request.SetRequestStats(&stats);
+
+  http_request.SetUri("http://www.testuri.com");
+  http_request.AddAuthBearerHeader("fake-bearer");
+  http_request.SetRange(100, 199);
+  http_request.SetResultBuffer(&scratch);
+  Status s = http_request.Send();
+
+  // Check interaction with stats.
+  ASSERT_TRUE(stats.has_recorded_request_);
+  EXPECT_EQ(&http_request, stats.record_request_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_request_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kGet, stats.record_request_method_);
+
+  ASSERT_TRUE(stats.has_recorded_response_);
+  EXPECT_EQ(&http_request, stats.record_response_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_response_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kGet, stats.record_response_method_);
+  EXPECT_TRUE(errors::IsNotFound(stats.record_response_result_));
+  EXPECT_EQ(s, stats.record_response_result_);
+
+  // Check interaction with libcurl.
+  EXPECT_TRUE(libcurl.performed_request_);
+  EXPECT_TRUE(libcurl.stats_had_recorded_request_);
+  EXPECT_FALSE(libcurl.stats_had_recorded_response_);
+}
+
+TEST(CurlHttpRequestTest, StatsPost) {
+  TestStats stats;
+
+  FakeLibCurl libcurl("", 200);
+  CurlHttpRequest http_request(&libcurl);
+
+  http_request.SetRequestStats(&stats);
+
+  string content = "post body content";
+
+  http_request.SetUri("http://www.testuri.com");
+  http_request.SetPostFromBuffer(content.c_str(), content.size());
+  TF_EXPECT_OK(http_request.Send());
+
+  // Check interaction with stats.
+  ASSERT_TRUE(stats.has_recorded_request_);
+  EXPECT_EQ(&http_request, stats.record_request_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_request_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kPost, stats.record_request_method_);
+
+  ASSERT_TRUE(stats.has_recorded_response_);
+  EXPECT_EQ(&http_request, stats.record_response_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_response_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kPost, stats.record_response_method_);
+  TF_EXPECT_OK(stats.record_response_result_);
+}
+
+TEST(CurlHttpRequestTest, StatsDelete) {
+  TestStats stats;
+
+  FakeLibCurl libcurl("", 200);
+  CurlHttpRequest http_request(&libcurl);
+  http_request.SetRequestStats(&stats);
+  http_request.SetUri("http://www.testuri.com");
+  http_request.SetDeleteRequest();
+  TF_EXPECT_OK(http_request.Send());
+
+  // Check interaction with stats.
+  ASSERT_TRUE(stats.has_recorded_request_);
+  EXPECT_EQ(&http_request, stats.record_request_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_request_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kDelete, stats.record_request_method_);
+
+  ASSERT_TRUE(stats.has_recorded_response_);
+  EXPECT_EQ(&http_request, stats.record_response_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_response_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kDelete, stats.record_response_method_);
+  TF_EXPECT_OK(stats.record_response_result_);
+}
+
+TEST(CurlHttpRequestTest, StatsPut) {
+  TestStats stats;
+
+  FakeLibCurl libcurl("", 200);
+  CurlHttpRequest http_request(&libcurl);
+  http_request.SetRequestStats(&stats);
+  http_request.SetUri("http://www.testuri.com");
+  http_request.AddAuthBearerHeader("fake-bearer");
+  http_request.SetPutEmptyBody();
+  TF_EXPECT_OK(http_request.Send());
+
+  // Check interaction with stats.
+  ASSERT_TRUE(stats.has_recorded_request_);
+  EXPECT_EQ(&http_request, stats.record_request_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_request_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kPut, stats.record_request_method_);
+
+  ASSERT_TRUE(stats.has_recorded_response_);
+  EXPECT_EQ(&http_request, stats.record_response_request_);
+  EXPECT_EQ("http://www.testuri.com", stats.record_response_uri_);
+  EXPECT_EQ(HttpRequest::RequestMethod::kPut, stats.record_response_method_);
+  TF_EXPECT_OK(stats.record_response_result_);
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/fake_env.cc b/tensorflow/core/platform/cloud/fake_env.cc
new file mode 100644
index 0000000000000000000000000000000000000000..221166839ed10fb80967f8b577abea9a0e0eaa86
--- /dev/null
+++ b/tensorflow/core/platform/cloud/fake_env.cc
@@ -0,0 +1,62 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/fake_env.h"
+
+namespace tensorflow {
+namespace test {
+
+Status FakeEnv::FakeRandomAccessFile::Read(uint64 offset, size_t n,
+                                           StringPiece* result,
+                                           char* scratch) const {
+  CHECK_EQ(offset, 0);
+  CHECK_EQ(n, 256);
+  Status s;
+  string platform;
+  switch (env_type_) {
+    case kGoogle: {
+      platform = "Google\n  ";
+      s = errors::OutOfRange("");
+      break;
+    }
+    case kGce: {
+      platform = "  Google Compute Engine\n  ";
+      s = errors::OutOfRange("");
+      break;
+    }
+    case kLocal: {
+      platform = "HP Linux Workstation";
+      s = Status::OK();
+      break;
+    }
+    case kBad: {
+      platform = "";
+      s = errors::Internal("Expected");
+      break;
+    }
+  }
+  strncpy(scratch, platform.data(), strlen(platform.data()));
+  *result = StringPiece(scratch, platform.length());
+  return s;
+}
+
+Status FakeEnv::NewRandomAccessFile(const string& fname,
+                                    std::unique_ptr<RandomAccessFile>* result) {
+  result->reset(new FakeRandomAccessFile(env_type_));
+  return Status::OK();
+}
+
+}  // namespace test
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/fake_env.h b/tensorflow/core/platform/cloud/fake_env.h
new file mode 100644
index 0000000000000000000000000000000000000000..7c162d9d66ccec516925224d24ace1763b95b1a1
--- /dev/null
+++ b/tensorflow/core/platform/cloud/fake_env.h
@@ -0,0 +1,60 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_FAKE_ENV_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_FAKE_ENV_H_
+
+#include "tensorflow/core/platform/env.h"
+
+namespace tensorflow {
+namespace test {
+
+/// Env implementation that stubs out the calls to read a file and time.
+class FakeEnv : public EnvWrapper {
+ public:
+  enum EnvType {
+    kGoogle,
+    kGce,
+    kLocal,
+    kBad,
+  };
+
+  FakeEnv(EnvType env_type) : EnvWrapper(Env::Default()), env_type_(env_type) {}
+
+  class FakeRandomAccessFile : public RandomAccessFile {
+   public:
+    FakeRandomAccessFile(EnvType env_type) : env_type_(env_type) {}
+
+    Status Read(uint64 offset, size_t n, StringPiece* result,
+                char* scratch) const override;
+
+   private:
+    EnvType env_type_;
+  };
+
+  Status NewRandomAccessFile(
+      const string& fname, std::unique_ptr<RandomAccessFile>* result) override;
+
+  uint64 NowSeconds() override { return now; }
+  uint64 now = 10000;
+
+ private:
+  EnvType env_type_;
+};
+
+}  // namespace test
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_FAKE_ENV_H_
diff --git a/tensorflow/core/platform/cloud/file_block_cache.h b/tensorflow/core/platform/cloud/file_block_cache.h
index 5c180e2332042af3ae938c2685ac416952b00187..da167882470bfa3d833faeb7f031fdd7064aba35 100644
--- a/tensorflow/core/platform/cloud/file_block_cache.h
+++ b/tensorflow/core/platform/cloud/file_block_cache.h
@@ -32,7 +32,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-/// \brief An LRU block cache of file contents, keyed by {filename, offset}.
+/// \brief A block cache of file contents, keyed by {filename, offset}.
 ///
 /// This class should be shared by read-only random access files on a remote
 /// filesystem (e.g. GCS).
@@ -48,27 +48,7 @@ class FileBlockCache {
                                size_t* bytes_transferred)>
       BlockFetcher;
 
-  FileBlockCache(size_t block_size, size_t max_bytes, uint64 max_staleness,
-                 BlockFetcher block_fetcher, Env* env = Env::Default())
-      : block_size_(block_size),
-        max_bytes_(max_bytes),
-        max_staleness_(max_staleness),
-        block_fetcher_(block_fetcher),
-        env_(env) {
-    if (max_staleness_ > 0) {
-      pruning_thread_.reset(env_->StartThread(ThreadOptions(), "TF_prune_FBC",
-                                              [this] { Prune(); }));
-    }
-  }
-
-  ~FileBlockCache() {
-    if (pruning_thread_) {
-      stop_pruning_thread_.Notify();
-      // Destroying pruning_thread_ will block until Prune() receives the above
-      // notification and returns.
-      pruning_thread_.reset();
-    }
-  }
+  virtual ~FileBlockCache() {}
 
   /// Read `n` bytes from `filename` starting at `offset` into `out`. This
   /// method will return:
@@ -84,143 +64,22 @@ class FileBlockCache {
   ///    placed in `out`.
   /// 4) OK otherwise (i.e. the read succeeded, and at least one byte was placed
   ///    in `out`).
-  Status Read(const string& filename, size_t offset, size_t n, char* buffer,
-              size_t* bytes_transferred);
+  virtual Status Read(const string& filename, size_t offset, size_t n,
+                      char* buffer, size_t* bytes_transferred) = 0;
 
   /// Remove all cached blocks for `filename`.
-  void RemoveFile(const string& filename) LOCKS_EXCLUDED(mu_);
+  virtual void RemoveFile(const string& filename) = 0;
 
   /// Remove all cached data.
-  void Flush() LOCKS_EXCLUDED(mu_);
+  virtual void Flush() = 0;
 
   /// Accessors for cache parameters.
-  size_t block_size() const { return block_size_; }
-  size_t max_bytes() const { return max_bytes_; }
-  uint64 max_staleness() const { return max_staleness_; }
+  virtual size_t block_size() const = 0;
+  virtual size_t max_bytes() const = 0;
+  virtual uint64 max_staleness() const = 0;
 
   /// The current size (in bytes) of the cache.
-  size_t CacheSize() const LOCKS_EXCLUDED(mu_);
-
- private:
-  /// The size of the blocks stored in the LRU cache, as well as the size of the
-  /// reads from the underlying filesystem.
-  const size_t block_size_;
-  /// The maximum number of bytes (sum of block sizes) allowed in the LRU cache.
-  const size_t max_bytes_;
-  /// The maximum staleness of any block in the LRU cache, in seconds.
-  const uint64 max_staleness_;
-  /// The callback to read a block from the underlying filesystem.
-  const BlockFetcher block_fetcher_;
-  /// The Env from which we read timestamps.
-  Env* const env_;  // not owned
-
-  /// \brief The key type for the file block cache.
-  ///
-  /// The file block cache key is a {filename, offset} pair.
-  typedef std::pair<string, size_t> Key;
-
-  /// \brief The state of a block.
-  ///
-  /// A block begins in the CREATED stage. The first thread will attempt to read
-  /// the block from the filesystem, transitioning the state of the block to
-  /// FETCHING. After completing, if the read was successful the state should
-  /// be FINISHED. Otherwise the state should be ERROR. A subsequent read can
-  /// re-fetch the block if the state is ERROR.
-  enum class FetchState {
-    CREATED,
-    FETCHING,
-    FINISHED,
-    ERROR,
-  };
-
-  /// \brief A block of a file.
-  ///
-  /// A file block consists of the block data, the block's current position in
-  /// the LRU cache, the timestamp (seconds since epoch) at which the block
-  /// was cached, a coordination lock, and state & condition variables.
-  ///
-  /// Thread safety:
-  /// The iterator and timestamp fields should only be accessed while holding
-  /// the block-cache-wide mu_ instance variable. The state variable should only
-  /// be accessed while holding the Block's mu lock. The data vector should only
-  /// be accessed after state == FINISHED, and it should never be modified.
-  ///
-  /// In order to prevent deadlocks, never grab the block-cache-wide mu_ lock
-  /// AFTER grabbing any block's mu lock. It is safe to grab mu without locking
-  /// mu_.
-  struct Block {
-    /// The block data.
-    std::vector<char> data;
-    /// A list iterator pointing to the block's position in the LRU list.
-    std::list<Key>::iterator lru_iterator;
-    /// A list iterator pointing to the block's position in the LRA list.
-    std::list<Key>::iterator lra_iterator;
-    /// The timestamp (seconds since epoch) at which the block was cached.
-    uint64 timestamp;
-    /// Mutex to guard state variable
-    mutex mu;
-    /// The state of the block.
-    FetchState state GUARDED_BY(mu) = FetchState::CREATED;
-    /// Wait on cond_var if state is FETCHING.
-    condition_variable cond_var;
-  };
-
-  /// \brief The block map type for the file block cache.
-  ///
-  /// The block map is an ordered map from Key to Block.
-  typedef std::map<Key, std::shared_ptr<Block>> BlockMap;
-
-  /// Prune the cache by removing files with expired blocks.
-  void Prune() LOCKS_EXCLUDED(mu_);
-
-  bool BlockNotStale(const std::shared_ptr<Block>& block)
-      EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Look up a Key in the block cache.
-  std::shared_ptr<Block> Lookup(const Key& key) LOCKS_EXCLUDED(mu_);
-
-  Status MaybeFetch(const Key& key, const std::shared_ptr<Block>& block)
-      LOCKS_EXCLUDED(mu_);
-
-  /// Trim the block cache to make room for another entry.
-  void Trim() EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Update the LRU iterator for the block at `key`.
-  Status UpdateLRU(const Key& key, const std::shared_ptr<Block>& block)
-      LOCKS_EXCLUDED(mu_);
-
-  /// Remove all blocks of a file, with mu_ already held.
-  void RemoveFile_Locked(const string& filename) EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Remove the block `entry` from the block map and LRU list, and update the
-  /// cache size accordingly.
-  void RemoveBlock(BlockMap::iterator entry) EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// The cache pruning thread that removes files with expired blocks.
-  std::unique_ptr<Thread> pruning_thread_;
-
-  /// Notification for stopping the cache pruning thread.
-  Notification stop_pruning_thread_;
-
-  /// Guards access to the block map, LRU list, and cached byte count.
-  mutable mutex mu_;
-
-  /// The block map (map from Key to Block).
-  BlockMap block_map_ GUARDED_BY(mu_);
-
-  /// The LRU list of block keys. The front of the list identifies the most
-  /// recently accessed block.
-  std::list<Key> lru_list_ GUARDED_BY(mu_);
-
-  /// The LRA (least recently added) list of block keys. The front of the list
-  /// identifies the most recently added block.
-  ///
-  /// Note: blocks are added to lra_list_ only after they have successfully been
-  /// fetched from the underlying block store.
-  std::list<Key> lra_list_ GUARDED_BY(mu_);
-
-  /// The combined number of bytes in all of the cached blocks.
-  size_t cache_size_ GUARDED_BY(mu_) = 0;
+  virtual size_t CacheSize() const = 0;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gce_env_utils.cc b/tensorflow/core/platform/cloud/gce_env_utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d78374c4b87ee2a89d026050f7f3e2f7564c14c8
--- /dev/null
+++ b/tensorflow/core/platform/cloud/gce_env_utils.cc
@@ -0,0 +1,159 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/gce_env_utils.h"
+
+#if defined(PLATFORM_WINDOWS)
+#include <algorithm>
+#include <cctype>
+#include <iostream>
+#include <string>
+
+// The order if these includes is important, windows.h has to come first.
+// clang-format off
+#include <windows.h>   // NOLINT
+#include <tchar.h>     // NOLINT
+#include <shellapi.h>  // NOLINT
+// clang-format on
+#else
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#endif
+
+namespace tensorflow {
+
+constexpr char kExpectedGoogleProductName[] = "Google";
+constexpr char kExpectedGceProductName[] = "Google Compute Engine";
+
+constexpr char kWinCheckCommand[] = "powershell.exe";
+constexpr char kWinCheckCommandArgs[] =
+    "(Get-WmiObject -Class Win32_BIOS).Manufacturer";
+
+constexpr char kLinuxProductNameFile[] = "/sys/class/dmi/id/product_name";
+
+const size_t kBiosDataBufferSize = 256;
+
+namespace {
+
+#if defined(PLATFORM_WINDOWS)
+
+Status IsRunningOnWinGce(bool* is_running_under_gce) {
+  *is_running_under_gce = FALSE;
+  SECURITY_ATTRIBUTES sa;
+  sa.nLength = sizeof(sa);
+  sa.lpSecurityDescriptor = NULL;
+  sa.bInheritHandle = TRUE;
+
+  // Handles to input and output of the pipe connecting us
+  // to the child process running powershell(). The output of this
+  // child process will be written to 'process_output_in' and read from
+  // 'process_output_in'.
+  HANDLE process_output_out = NULL;
+  HANDLE process_output_in = NULL;
+
+  // Create the actually pipe connecting us to the child process.
+  if (!CreatePipe(&process_output_out, &process_output_in, &sa, 0)) {
+    return errors::Internal("CreatePipe() failed");
+  }
+  if (!SetHandleInformation(process_output_out, HANDLE_FLAG_INHERIT, 0)) {
+    return errors::Internal("SetHandleInformation() failed");
+  }
+
+  PROCESS_INFORMATION pi;
+  STARTUPINFO si;
+  DWORD flags = CREATE_NO_WINDOW;
+  ZeroMemory(&pi, sizeof(pi));
+  ZeroMemory(&si, sizeof(si));
+  si.cb = sizeof(si);
+  si.dwFlags |= STARTF_USESTDHANDLES;
+  si.hStdInput = NULL;
+
+  // Connect the process to pipe's input.
+  si.hStdError = process_output_in;
+  si.hStdOutput = process_output_in;
+  // Execute (and wait for) powershell command to read the product information
+  // out of the registry.
+  TCHAR cmd[kBiosDataBufferSize];
+  snprintf(cmd, kBiosDataBufferSize, "%s %s", _T(kWinCheckCommand),
+           _T(kWinCheckCommandArgs));
+
+  if (!CreateProcess(NULL, cmd, NULL, NULL, TRUE, flags, NULL, NULL, &si,
+                     &pi)) {
+    return errors::Internal("CreateProcess() failed");
+  }
+
+  WaitForSingleObject(pi.hProcess, INFINITE);
+  CloseHandle(pi.hProcess);
+  CloseHandle(pi.hThread);
+
+  // Read data from the pipe. Note that we are reading only kBiosDataBufferSize
+  // chars. There might be technically more data than that but we are looking
+  // for Google product identifiers that are much shorter than
+  // kBiosDataBufferSize.
+  DWORD dwread = 0;
+  CHAR buffer[kBiosDataBufferSize];
+  if (!ReadFile(process_output_out, buffer, kBiosDataBufferSize, &dwread,
+                NULL)) {
+    return errors::Internal("Failed reading from the pipe.");
+  }
+  std::string output(buffer, 0, dwread);
+  // Trim whitespaces
+  output.erase(output.begin(),
+               std::find_if(output.begin(), output.end(),
+                            [](int ch) { return !std::isspace(ch); }));
+  output.erase(std::find_if(output.rbegin(), output.rend(),
+                            [](int ch) { return !std::isspace(ch); })
+                   .base(),
+               output.end());
+  *is_running_under_gce =
+      output == kExpectedGceProductName || output == kExpectedGoogleProductName;
+  return Status::OK();
+}
+
+#else
+
+Status IsRunningOnLinuxGce(Env* env, bool* is_running_under_gce) {
+  std::unique_ptr<RandomAccessFile> file;
+  TF_RETURN_IF_ERROR(env->NewRandomAccessFile(kLinuxProductNameFile, &file));
+  char buf[kBiosDataBufferSize + 1];
+  std::fill(buf, buf + kBiosDataBufferSize + 1, '\0');
+  StringPiece product_name;
+  const Status s = file->Read(0, kBiosDataBufferSize, &product_name, buf);
+  if (!s.ok() && !errors::IsOutOfRange(s)) {
+    // We expect OutOfRange error because bios file doesn't correspond to its
+    // state size,
+    return s;
+  }
+  str_util::RemoveLeadingWhitespace(&product_name);
+  str_util::RemoveTrailingWhitespace(&product_name);
+  *is_running_under_gce = (product_name == kExpectedGceProductName ||
+                           product_name == kExpectedGoogleProductName);
+  return Status::OK();
+}
+
+#endif
+
+}  // namespace
+
+Status IsRunningOnGce(Env* env, bool* is_running_under_gce) {
+  *is_running_under_gce = false;
+#if defined(PLATFORM_WINDOWS)
+  return IsRunningOnWinGce(is_running_under_gce);
+#else
+  return IsRunningOnLinuxGce(env, is_running_under_gce);
+#endif
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gce_env_utils.h b/tensorflow/core/platform/cloud/gce_env_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..25aaeb7db348240f070a6eaa3c7045f6cf9ccb61
--- /dev/null
+++ b/tensorflow/core/platform/cloud/gce_env_utils.h
@@ -0,0 +1,29 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_GCE_ENV_UTILS_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_GCE_ENV_UTILS_H_
+
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/env.h"
+
+namespace tensorflow {
+
+// Check whether the current process is running under GCE.
+Status IsRunningOnGce(Env* env, bool* is_running_under_gce);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_GCE_ENV_UTILS_H_
diff --git a/tensorflow/core/platform/cloud/gcp_env_utils_test.cc b/tensorflow/core/platform/cloud/gcp_env_utils_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..910397b52ba02c5e8d35376373f68a84ddefaf65
--- /dev/null
+++ b/tensorflow/core/platform/cloud/gcp_env_utils_test.cc
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/gce_env_utils.h"
+
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/cloud/fake_env.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+
+namespace {
+
+TEST(GcpEnvUtils, IsRunningOnGce) {
+  {
+    test::FakeEnv env(test::FakeEnv::kGoogle);
+    bool is_running_on_gcp = false;
+    TF_EXPECT_OK(IsRunningOnGce(&env, &is_running_on_gcp));
+    EXPECT_TRUE(is_running_on_gcp);
+  }
+  {
+    test::FakeEnv env(test::FakeEnv::kGce);
+    bool is_running_on_gcp = false;
+    TF_EXPECT_OK(IsRunningOnGce(&env, &is_running_on_gcp));
+    EXPECT_TRUE(is_running_on_gcp);
+  }
+  {
+    test::FakeEnv env(test::FakeEnv::kLocal);
+    bool is_running_on_gcp = false;
+    TF_EXPECT_OK(IsRunningOnGce(&env, &is_running_on_gcp));
+    EXPECT_FALSE(is_running_on_gcp);
+  }
+  {
+    test::FakeEnv env(test::FakeEnv::kBad);
+    bool is_running_on_gcp = false;
+    EXPECT_TRUE(errors::IsInternal(IsRunningOnGce(&env, &is_running_on_gcp)));
+  }
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gcs_dns_cache_test.cc b/tensorflow/core/platform/cloud/gcs_dns_cache_test.cc
index 8be452ff44d03bf3a8a66b99b0e65f98da537d5f..237ce6b5e5afb16628c6570c0c60e81b6cc9b0be 100644
--- a/tensorflow/core/platform/cloud/gcs_dns_cache_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_dns_cache_test.cc
@@ -36,7 +36,7 @@ class TestHttpRequest : public HttpRequest {
   }
 
   void AddAuthBearerHeader(const string& auth_token) override {}
-
+  void SetRequestStats(HttpRequest::RequestStats* stats) override {}
   void SetDeleteRequest() override {}
 
   Status SetPutFromFile(const string& body_filepath, size_t offset) override {
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.cc b/tensorflow/core/platform/cloud/gcs_file_system.cc
index 01ca0d76bab2720513775ef33ff8670bd148c241..1691826483a3227ea00b3b37e82002f3ad8d5225 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system.cc
@@ -36,6 +36,7 @@ limitations under the License.
 #include "tensorflow/core/platform/cloud/curl_http_request.h"
 #include "tensorflow/core/platform/cloud/file_block_cache.h"
 #include "tensorflow/core/platform/cloud/google_auth_provider.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include "tensorflow/core/platform/cloud/retrying_utils.h"
 #include "tensorflow/core/platform/cloud/time_util.h"
 #include "tensorflow/core/platform/env.h"
@@ -783,13 +784,13 @@ Status GcsFileSystem::NewRandomAccessFile(
 // A helper function to build a FileBlockCache for GcsFileSystem.
 std::unique_ptr<FileBlockCache> GcsFileSystem::MakeFileBlockCache(
     size_t block_size, size_t max_bytes, uint64 max_staleness) {
-  std::unique_ptr<FileBlockCache> file_block_cache(
-      new FileBlockCache(block_size, max_bytes, max_staleness,
-                         [this](const string& filename, size_t offset, size_t n,
-                                char* buffer, size_t* bytes_transferred) {
-                           return LoadBufferFromGCS(filename, offset, n, buffer,
-                                                    bytes_transferred);
-                         }));
+  std::unique_ptr<FileBlockCache> file_block_cache(new RamFileBlockCache(
+      block_size, max_bytes, max_staleness,
+      [this](const string& filename, size_t offset, size_t n, char* buffer,
+             size_t* bytes_transferred) {
+        return LoadBufferFromGCS(filename, offset, n, buffer,
+                                 bytes_transferred);
+      }));
   return file_block_cache;
 }
 
@@ -812,6 +813,10 @@ Status GcsFileSystem::LoadBufferFromGCS(const string& filename, size_t offset,
   request->SetResultBufferDirect(buffer, n);
   request->SetTimeouts(timeouts_.connect, timeouts_.idle, timeouts_.read);
 
+  if (stats_ != nullptr) {
+    stats_->RecordBlockLoadRequest(filename, offset);
+  }
+
   TF_RETURN_WITH_CONTEXT_IF_ERROR(request->Send(), " when reading gs://",
                                   bucket, "/", object);
 
@@ -820,6 +825,10 @@ Status GcsFileSystem::LoadBufferFromGCS(const string& filename, size_t offset,
   VLOG(1) << "Successful read of gs://" << bucket << "/" << object << " @ "
           << offset << " of size: " << bytes_read;
 
+  if (stats_ != nullptr) {
+    stats_->RecordBlockRetrieved(filename, offset, bytes_read);
+  }
+
   throttle_.RecordResponse(bytes_read);
 
   if (bytes_read < block_size()) {
@@ -1454,6 +1463,13 @@ void GcsFileSystem::FlushCaches() {
   matching_paths_cache_->Clear();
 }
 
+void GcsFileSystem::SetStats(GcsStatsInterface* stats) {
+  CHECK(stats_ == nullptr) << "SetStats() has already been called.";
+  CHECK(stats != nullptr);
+  stats_ = stats;
+  stats_->Init(this, &throttle_, file_block_cache_.get());
+}
+
 // Creates an HttpRequest and sets several parameters that are common to all
 // requests.  All code (in GcsFileSystem) that creates an HttpRequest should
 // go through this method, rather than directly using http_request_factory_.
@@ -1473,6 +1489,10 @@ Status GcsFileSystem::CreateHttpRequest(std::unique_ptr<HttpRequest>* request) {
                            additional_header_->second);
   }
 
+  if (stats_ != nullptr) {
+    new_request->SetRequestStats(stats_->HttpStats());
+  }
+
   if (!throttle_.AdmitRequest()) {
     return errors::Unavailable("Request throttled");
   }
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.h b/tensorflow/core/platform/cloud/gcs_file_system.h
index e8edde8a445aad4c0310394d89480dc6ae445dfa..703c8d57784eae6dbfec04a031751f3e6c075435 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.h
+++ b/tensorflow/core/platform/cloud/gcs_file_system.h
@@ -32,6 +32,39 @@ limitations under the License.
 
 namespace tensorflow {
 
+class GcsFileSystem;
+
+/// GcsStatsInterface allows for instrumentation of the GCS file system.
+///
+/// GcsStatsInterface and its subclasses must be safe to use from multiple
+/// threads concurrently.
+///
+/// WARNING! This is an experimental interface that may change or go away at any
+/// time.
+class GcsStatsInterface {
+ public:
+  /// Init is called by the GcsFileSystem immediately after being registered.
+  virtual void Init(GcsFileSystem* fs, GcsThrottle* throttle,
+                    const FileBlockCache* block_cache) = 0;
+
+  /// RecordBlockLoadRequest is called to record a block load request is about
+  /// to be made.
+  virtual void RecordBlockLoadRequest(const string& file, size_t offset) = 0;
+
+  /// RecordBlockRetrieved is called once a block within the file has been
+  /// retrieved.
+  virtual void RecordBlockRetrieved(const string& file, size_t offset,
+                                    size_t bytes_transferred) = 0;
+
+  /// HttpStats is called to optionally provide a RequestStats listener
+  /// to be annotated on every HTTP request made to the GCS API.
+  ///
+  /// HttpStats() may return nullptr.
+  virtual HttpRequest::RequestStats* HttpStats() = 0;
+
+  virtual ~GcsStatsInterface() = default;
+};
+
 /// Google Cloud Storage implementation of a file system.
 ///
 /// The clients should use RetryingGcsFileSystem defined below,
@@ -90,6 +123,9 @@ class GcsFileSystem : public FileSystem {
 
   void FlushCaches() override;
 
+  /// Set an object to collect runtime statistics from the GcsFilesystem.
+  void SetStats(GcsStatsInterface* stats);
+
   /// These accessors are mainly for testing purposes, to verify that the
   /// environment variables that control these parameters are handled correctly.
   size_t block_size() const { return file_block_cache_->block_size(); }
@@ -205,6 +241,8 @@ class GcsFileSystem : public FileSystem {
 
   TimeoutConfig timeouts_;
 
+  GcsStatsInterface* stats_ = nullptr;  // Not owned.
+
   /// The initial delay for exponential backoffs when retrying failed calls.
   const int64 initial_retry_delay_usec_ = 1000000L;
 
@@ -217,8 +255,16 @@ class GcsFileSystem : public FileSystem {
 /// Google Cloud Storage implementation of a file system with retry on failures.
 class RetryingGcsFileSystem : public RetryingFileSystem {
  public:
-  RetryingGcsFileSystem()
-      : RetryingFileSystem(std::unique_ptr<FileSystem>(new GcsFileSystem)) {}
+  RetryingGcsFileSystem() : RetryingGcsFileSystem(new GcsFileSystem) {}
+
+  void SetStats(GcsStatsInterface* stats) { underlying_->SetStats(stats); }
+
+ private:
+  explicit RetryingGcsFileSystem(GcsFileSystem* fs)
+      : RetryingFileSystem(std::unique_ptr<FileSystem>(fs)), underlying_(fs) {}
+
+  // TODO(b/74259157): Refactor RetryingFileSystem to avoid holding this ptr.
+  GcsFileSystem* underlying_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gcs_file_system_test.cc b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
index d452074ce312f98abe6b058ea56d2e0ce4cf047a..8516421614481cbb5e96cacd4b1f16aded883a91 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
@@ -393,7 +393,7 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceeds) {
                            "Timeouts: 5 1 10\n"
                            "Header Content-Range: bytes */17\n"
                            "Put: yes\n",
-                           "", errors::FailedPrecondition("308"), nullptr,
+                           "", errors::Unavailable("308"), nullptr,
                            {{"Range", "0-10"}}, 308),
        new FakeHttpRequest("Uri: https://custom/upload/location\n"
                            "Auth Token: fake_token\n"
@@ -406,13 +406,26 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceeds) {
                            "Timeouts: 5 1 10\n"
                            "Header Content-Range: bytes */17\n"
                            "Put: yes\n",
-                           "", errors::FailedPrecondition("308"), nullptr,
+                           "", errors::Unavailable("308"), nullptr,
                            {{"Range", "bytes=0-12"}}, 308),
        new FakeHttpRequest("Uri: https://custom/upload/location\n"
                            "Auth Token: fake_token\n"
                            "Header Content-Range: bytes 13-16/17\n"
                            "Timeouts: 5 1 30\n"
                            "Put body: ent2\n",
+                           "", errors::Unavailable("308"), 308),
+       new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                           "Auth Token: fake_token\n"
+                           "Timeouts: 5 1 10\n"
+                           "Header Content-Range: bytes */17\n"
+                           "Put: yes\n",
+                           "", errors::Unavailable("308"), nullptr,
+                           {{"Range", "bytes=0-14"}}, 308),
+       new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                           "Auth Token: fake_token\n"
+                           "Header Content-Range: bytes 15-16/17\n"
+                           "Timeouts: 5 1 30\n"
+                           "Put body: t2\n",
                            "")});
   GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
                    std::unique_ptr<HttpRequest::Factory>(
@@ -521,14 +534,14 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadAllAttemptsFail) {
                            "Put body: content1,content2\n",
                            "", errors::Unavailable("503"), 503)});
   for (int i = 0; i < 10; i++) {
-    requests.emplace_back(new FakeHttpRequest(
-        "Uri: https://custom/upload/location\n"
-        "Auth Token: fake_token\n"
-        "Timeouts: 5 1 10\n"
-        "Header Content-Range: bytes */17\n"
-        "Put: yes\n",
-        "", errors::FailedPrecondition("important HTTP error 308"), nullptr,
-        {{"Range", "0-10"}}, 308));
+    requests.emplace_back(
+        new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                            "Auth Token: fake_token\n"
+                            "Timeouts: 5 1 10\n"
+                            "Header Content-Range: bytes */17\n"
+                            "Put: yes\n",
+                            "", errors::Unavailable("important HTTP error 308"),
+                            nullptr, {{"Range", "0-10"}}, 308));
     requests.emplace_back(new FakeHttpRequest(
         "Uri: https://custom/upload/location\n"
         "Auth Token: fake_token\n"
@@ -2608,5 +2621,74 @@ TEST(GcsFileSystemTest, CreateHttpRequest) {
   TF_EXPECT_OK(request->Send());
 }
 
+TEST(GcsFileSystemTest, NewRandomAccessFile_StatsRecording) {
+  class TestGcsStats : public GcsStatsInterface {
+   public:
+    void Init(GcsFileSystem* fs, GcsThrottle* throttle,
+              const FileBlockCache* block_cache) override {
+      CHECK(fs_ == nullptr);
+      CHECK(throttle_ == nullptr);
+      CHECK(block_cache_ == nullptr);
+
+      fs_ = fs;
+      throttle_ = throttle;
+      block_cache_ = block_cache;
+    }
+
+    void RecordBlockLoadRequest(const string& file, size_t offset) override {
+      block_load_request_file_ = file;
+    }
+
+    void RecordBlockRetrieved(const string& file, size_t offset,
+                              size_t bytes_transferred) override {
+      block_retrieved_file_ = file;
+      block_retrieved_bytes_transferred_ = bytes_transferred;
+    }
+
+    HttpRequest::RequestStats* HttpStats() override { return nullptr; }
+
+    GcsFileSystem* fs_ = nullptr;
+    GcsThrottle* throttle_ = nullptr;
+    const FileBlockCache* block_cache_ = nullptr;
+
+    string block_load_request_file_;
+    string block_retrieved_file_;
+    size_t block_retrieved_bytes_transferred_ = 0;
+  };
+
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: https://storage.googleapis.com/bucket/random_access.txt\n"
+      "Auth Token: fake_token\n"
+      "Range: 0-5\n"
+      "Timeouts: 5 1 20\n",
+      "012345")});
+  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+                   std::unique_ptr<HttpRequest::Factory>(
+                       new FakeHttpRequestFactory(&requests)),
+                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+                   0 /* stat cache max age */, 0 /* stat cache max entries */,
+                   0 /* matching paths cache max age */,
+                   0 /* matching paths cache max entries */,
+                   0 /* initial retry delay */, kTestTimeoutConfig,
+                   nullptr /* gcs additional header */);
+
+  TestGcsStats stats;
+  fs.SetStats(&stats);
+  EXPECT_EQ(stats.fs_, &fs);
+
+  std::unique_ptr<RandomAccessFile> file;
+  TF_EXPECT_OK(fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
+
+  char scratch[6];
+  StringPiece result;
+
+  TF_EXPECT_OK(file->Read(0, sizeof(scratch), &result, scratch));
+  EXPECT_EQ("012345", result);
+
+  EXPECT_EQ("gs://bucket/random_access.txt", stats.block_load_request_file_);
+  EXPECT_EQ("gs://bucket/random_access.txt", stats.block_retrieved_file_);
+  EXPECT_EQ(6, stats.block_retrieved_bytes_transferred_);
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gcs_throttle.cc b/tensorflow/core/platform/cloud/gcs_throttle.cc
index eb5f8958a37f45aeac1a836ca037f91931bb34a6..27dd06a6250ad457d0ec142c07d29a2358dddaee 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle.cc
@@ -26,10 +26,9 @@ GcsThrottle::GcsThrottle(EnvTime* env_time)
 
 bool GcsThrottle::AdmitRequest() {
   mutex_lock l(mu_);
-  if (!config_.enabled) return true;
   UpdateState();
   if (available_tokens_ < config_.tokens_per_request) {
-    return false;
+    return false || !config_.enabled;
   }
   available_tokens_ -= config_.tokens_per_request;
   return true;
@@ -37,7 +36,6 @@ bool GcsThrottle::AdmitRequest() {
 
 void GcsThrottle::RecordResponse(size_t num_bytes) {
   mutex_lock l(mu_);
-  if (!config_.enabled) return;
   UpdateState();
   available_tokens_ -= request_bytes_to_tokens(num_bytes);
 }
diff --git a/tensorflow/core/platform/cloud/gcs_throttle.h b/tensorflow/core/platform/cloud/gcs_throttle.h
index 1a89daef084e921f1ad8bd856cefcc62d0d7aa1c..97a858e3fecfbbd5175db2da2057836c7dd9f482 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle.h
+++ b/tensorflow/core/platform/cloud/gcs_throttle.h
@@ -109,13 +109,24 @@ class GcsThrottle {
    * purpose of this function is to make available to monitoring or other
    * instrumentation the number of available tokens in the pool.
    */
-  inline int64 available_tokens() {
+  inline int64 available_tokens() LOCKS_EXCLUDED(mu_) {
     mutex_lock l(mu_);
-    if (!config_.enabled) return 0;
     UpdateState();
     return available_tokens_;
   }
 
+  /**
+   * is_enabled determines if the throttle is enabled.
+   *
+   * If !is_enabled(), AdmitRequest() will always return true. To enable the
+   * throttle, call SetConfig passing in a configuration that has enabled set to
+   * true.
+   */
+  bool is_enabled() LOCKS_EXCLUDED(mu_) {
+    mutex_lock l(mu_);
+    return config_.enabled;
+  }
+
  private:
   /**
    * UpdateState updates the available_tokens_ and last_updated_secs_ variables.
diff --git a/tensorflow/core/platform/cloud/gcs_throttle_test.cc b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
index 694756022e37263a07f8215bf7496c9ca130fd58..57193ac4057550463b6bea29089bdd545f2f0a33 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
@@ -96,6 +96,24 @@ TEST_F(GcsThrottleTest, ReverseTime) {
   EXPECT_EQ(200000, throttle_.available_tokens());
 }
 
+TEST(GcsThrottleDisabledTest, Disabled) {
+  TestTime time;
+  GcsThrottle throttle(&time);
+  ASSERT_FALSE(throttle.is_enabled());  // Verify throttle is disabled.
+
+  EXPECT_EQ(0, throttle.available_tokens());
+  time.AdvanceSeconds(1);
+  EXPECT_EQ(100000, throttle.available_tokens());
+  EXPECT_TRUE(throttle.AdmitRequest());
+  EXPECT_EQ(99900, throttle.available_tokens());
+  time.AdvanceSeconds(1);
+  EXPECT_EQ(199900, throttle.available_tokens());
+  throttle.RecordResponse(128000000);  // 128 MB response.
+  EXPECT_LT(0, throttle.available_tokens());
+  // Admit request even without available tokens
+  EXPECT_TRUE(throttle.AdmitRequest());
+}
+
 }  // namespace
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/google_auth_provider.cc b/tensorflow/core/platform/cloud/google_auth_provider.cc
index 7e39b63e3e8e19b3ed9e05e5c49422b42774567c..0e8a620464a9d38b4dd4e3ee340f7cb26b2486ac 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider.cc
+++ b/tensorflow/core/platform/cloud/google_auth_provider.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/strings/base64.h"
 #include "tensorflow/core/platform/cloud/curl_http_request.h"
+#include "tensorflow/core/platform/cloud/gce_env_utils.h"
 #include "tensorflow/core/platform/cloud/retrying_utils.h"
 #include "tensorflow/core/platform/env.h"
 
@@ -207,6 +208,16 @@ Status GoogleAuthProvider::GetTokenFromFiles() {
 }
 
 Status GoogleAuthProvider::GetTokenFromGce() {
+  if (!is_running_on_gce_.has_value()) {
+    bool is_running_on_gce = false;
+    TF_RETURN_IF_ERROR(IsRunningOnGce(env_, &is_running_on_gce));
+    is_running_on_gce_ = is_running_on_gce;
+  }
+  if (!is_running_on_gce_.value()) {
+    // Assume bucket is world-accessible. If not, the access will be rejected.
+    current_token_ = "";
+    return Status::OK();
+  }
   const auto get_token_from_gce = [this]() {
     std::unique_ptr<HttpRequest> request(http_request_factory_->Create());
     std::vector<char> response_buffer;
diff --git a/tensorflow/core/platform/cloud/google_auth_provider.h b/tensorflow/core/platform/cloud/google_auth_provider.h
index 00da25a9593a404a330f4cf5630ec29a3798a982..79a57ff2a0880bf9a0b9ec32442558ebced75793 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider.h
+++ b/tensorflow/core/platform/cloud/google_auth_provider.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_CORE_PLATFORM_GOOGLE_AUTH_PROVIDER_H_
 
 #include <memory>
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/cloud/auth_provider.h"
 #include "tensorflow/core/platform/cloud/oauth_client.h"
 #include "tensorflow/core/platform/mutex.h"
@@ -46,7 +47,10 @@ class GoogleAuthProvider : public AuthProvider {
   /// standard gcloud tool's location.
   Status GetTokenFromFiles() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
-  /// Gets the bearer token from Google Compute Engine environment.
+  /// Gets the bearer token from Google Compute Engine environment. May return
+  /// an empty token if the current process is not running under GCE. If that
+  /// happens the caller will try to use the empty token and either succeed
+  /// if the resource is publicly accessible or fail with a permissions error.
   Status GetTokenFromGce() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
   /// Gets the bearer token from the systen env variable, for testing purposes.
@@ -57,6 +61,7 @@ class GoogleAuthProvider : public AuthProvider {
   Env* env_;
   mutex mu_;
   string current_token_ GUARDED_BY(mu_);
+  tensorflow::gtl::optional<bool> is_running_on_gce_ GUARDED_BY(mu_);
   uint64 expiration_timestamp_sec_ GUARDED_BY(mu_) = 0;
   // The initial delay for exponential backoffs when retrying failed calls.
   const int64 initial_retry_delay_usec_;
diff --git a/tensorflow/core/platform/cloud/google_auth_provider_test.cc b/tensorflow/core/platform/cloud/google_auth_provider_test.cc
index 4281c6c73738dbc0523e4715137b7fc171458eac..55829f84d92eb241f7875776c091bd4ba7ca7226 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider_test.cc
+++ b/tensorflow/core/platform/cloud/google_auth_provider_test.cc
@@ -17,6 +17,7 @@ limitations under the License.
 #include <stdlib.h>
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/platform/cloud/fake_env.h"
 #include "tensorflow/core/platform/cloud/http_request_fake.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -26,14 +27,6 @@ namespace {
 
 constexpr char kTestData[] = "core/platform/cloud/testdata/";
 
-class FakeEnv : public EnvWrapper {
- public:
-  FakeEnv() : EnvWrapper(Env::Default()) {}
-
-  uint64 NowSeconds() override { return now; }
-  uint64 now = 10000;
-};
-
 class FakeOAuthClient : public OAuthClient {
  public:
   Status GetTokenFromServiceAccountJson(
@@ -89,7 +82,7 @@ TEST_F(GoogleAuthProviderTest, EnvironmentVariable_Caching) {
   auto oauth_client = new FakeOAuthClient;
   std::vector<HttpRequest*> requests;
 
-  FakeEnv env;
+  test::FakeEnv env(test::FakeEnv::kGoogle);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
                               std::unique_ptr<HttpRequest::Factory>(
                                   new FakeHttpRequestFactory(&requests)),
@@ -123,7 +116,7 @@ TEST_F(GoogleAuthProviderTest, GCloudRefreshToken) {
   auto oauth_client = new FakeOAuthClient;
   std::vector<HttpRequest*> requests;
 
-  FakeEnv env;
+  test::FakeEnv env(test::FakeEnv::kGoogle);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
                               std::unique_ptr<HttpRequest::Factory>(
                                   new FakeHttpRequestFactory(&requests)),
@@ -169,7 +162,7 @@ TEST_F(GoogleAuthProviderTest, RunningOnGCE) {
                 "token_type":"Bearer"
               })")});
 
-  FakeEnv env;
+  test::FakeEnv env(test::FakeEnv::kGoogle);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
                               std::unique_ptr<HttpRequest::Factory>(
                                   new FakeHttpRequestFactory(&requests)),
@@ -195,7 +188,7 @@ TEST_F(GoogleAuthProviderTest, OverrideForTesting) {
 
   auto oauth_client = new FakeOAuthClient;
   std::vector<HttpRequest*> empty_requests;
-  FakeEnv env;
+  test::FakeEnv env(test::FakeEnv::kGoogle);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
                               std::unique_ptr<HttpRequest::Factory>(
                                   new FakeHttpRequestFactory(&empty_requests)),
@@ -215,7 +208,25 @@ TEST_F(GoogleAuthProviderTest, NothingAvailable) {
       "Header Metadata-Flavor: Google\n",
       "", errors::NotFound("404"), 404)});
 
-  FakeEnv env;
+  test::FakeEnv env(test::FakeEnv::kGoogle);
+  GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
+                              std::unique_ptr<HttpRequest::Factory>(
+                                  new FakeHttpRequestFactory(&requests)),
+                              &env, 0);
+
+  string token;
+  TF_EXPECT_OK(provider.GetToken(&token));
+  EXPECT_EQ("", token);
+}
+
+TEST_F(GoogleAuthProviderTest, AccessingPublicBucket) {
+  setenv("CLOUDSDK_CONFIG",
+         io::JoinPath(testing::TensorFlowSrcRoot(), kTestData).c_str(), 1);
+
+  auto oauth_client = new FakeOAuthClient;
+  std::vector<HttpRequest*> requests;
+
+  test::FakeEnv env(test::FakeEnv::kLocal);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
                               std::unique_ptr<HttpRequest::Factory>(
                                   new FakeHttpRequestFactory(&requests)),
@@ -223,6 +234,8 @@ TEST_F(GoogleAuthProviderTest, NothingAvailable) {
 
   string token;
   TF_EXPECT_OK(provider.GetToken(&token));
+  // We are assuming we are accessing a public bucket (and we are not running
+  // on GCE) so we an empty token is returned.
   EXPECT_EQ("", token);
 }
 
diff --git a/tensorflow/core/platform/cloud/http_request.h b/tensorflow/core/platform/cloud/http_request.h
index df8a5b86a0b9b3354514be69cb03dd6472e51e86..2343bca608a6bd812354d0e243429c67c261b3ed 100644
--- a/tensorflow/core/platform/cloud/http_request.h
+++ b/tensorflow/core/platform/cloud/http_request.h
@@ -47,6 +47,46 @@ class HttpRequest {
     virtual HttpRequest* Create() = 0;
   };
 
+  /// RequestMethod is used to capture what type of HTTP request is made and
+  /// is used in conjunction with RequestStats for instrumentation and
+  /// monitoring of HTTP requests and their responses.
+  enum class RequestMethod : char {
+    kGet,
+    kPost,
+    kPut,
+    kDelete,
+  };
+
+  /// RequestMethodName converts a RequestMethod to the canonical method string.
+  inline static const char* RequestMethodName(RequestMethod m) {
+    switch (m) {
+      case RequestMethod::kGet:
+        return "GET";
+      case RequestMethod::kPost:
+        return "POST";
+      case RequestMethod::kPut:
+        return "PUT";
+      case RequestMethod::kDelete:
+        return "DELETE";
+      default:
+        return "???";
+    }
+  }
+
+  /// RequestStats is a class that can be used to instrument an Http Request.
+  class RequestStats {
+   public:
+    virtual ~RequestStats() = default;
+
+    /// RecordRequest is called right before a request is sent on the wire.
+    virtual void RecordRequest(const HttpRequest* request, const string& uri,
+                               RequestMethod method) = 0;
+
+    /// RecordResponse is called after the response has been received.
+    virtual void RecordResponse(const HttpRequest* request, const string& uri,
+                                RequestMethod method, const Status& result) = 0;
+  };
+
   HttpRequest() {}
   virtual ~HttpRequest() {}
 
@@ -73,6 +113,9 @@ class HttpRequest {
   /// Sets the 'Authorization' header to the value of 'Bearer ' + auth_token.
   virtual void AddAuthBearerHeader(const string& auth_token) = 0;
 
+  /// Sets the RequestStats object to use to record the request and response.
+  virtual void SetRequestStats(RequestStats* stats) = 0;
+
   /// Makes the request a DELETE request.
   virtual void SetDeleteRequest() = 0;
 
diff --git a/tensorflow/core/platform/cloud/file_block_cache.cc b/tensorflow/core/platform/cloud/ram_file_block_cache.cc
similarity index 89%
rename from tensorflow/core/platform/cloud/file_block_cache.cc
rename to tensorflow/core/platform/cloud/ram_file_block_cache.cc
index 6add1142a15fb69044828bd82a6d6e838959de08..55a5657a503a334866cad737bb11fe505e59699a 100644
--- a/tensorflow/core/platform/cloud/file_block_cache.cc
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include <cstring>
 #include <memory>
 #include "tensorflow/core/lib/gtl/cleanup.h"
@@ -21,7 +21,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-bool FileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
+bool RamFileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
   mutex_lock l(block->mu);
   if (block->state != FetchState::FINISHED) {
     return true;  // No need to check for staleness.
@@ -30,7 +30,8 @@ bool FileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
   return env_->NowSeconds() - block->timestamp <= max_staleness_;
 }
 
-std::shared_ptr<FileBlockCache::Block> FileBlockCache::Lookup(const Key& key) {
+std::shared_ptr<RamFileBlockCache::Block> RamFileBlockCache::Lookup(
+    const Key& key) {
   mutex_lock lock(mu_);
   auto entry = block_map_.find(key);
   if (entry != block_map_.end()) {
@@ -55,15 +56,15 @@ std::shared_ptr<FileBlockCache::Block> FileBlockCache::Lookup(const Key& key) {
 }
 
 // Remove blocks from the cache until we do not exceed our maximum size.
-void FileBlockCache::Trim() {
+void RamFileBlockCache::Trim() {
   while (!lru_list_.empty() && cache_size_ > max_bytes_) {
     RemoveBlock(block_map_.find(lru_list_.back()));
   }
 }
 
 /// Move the block to the front of the LRU list if it isn't already there.
-Status FileBlockCache::UpdateLRU(const Key& key,
-                                 const std::shared_ptr<Block>& block) {
+Status RamFileBlockCache::UpdateLRU(const Key& key,
+                                    const std::shared_ptr<Block>& block) {
   mutex_lock lock(mu_);
   if (block->timestamp == 0) {
     // The block was evicted from another thread. Allow it to remain evicted.
@@ -92,8 +93,8 @@ Status FileBlockCache::UpdateLRU(const Key& key,
   return Status::OK();
 }
 
-Status FileBlockCache::MaybeFetch(const Key& key,
-                                  const std::shared_ptr<Block>& block) {
+Status RamFileBlockCache::MaybeFetch(const Key& key,
+                                     const std::shared_ptr<Block>& block) {
   bool downloaded_block = false;
   auto reconcile_state =
       gtl::MakeCleanup([this, &downloaded_block, &key, &block] {
@@ -151,11 +152,11 @@ Status FileBlockCache::MaybeFetch(const Key& key,
     }
   }
   return errors::Internal(
-      "Control flow should never reach the end of FileBlockCache::Fetch.");
+      "Control flow should never reach the end of RamFileBlockCache::Fetch.");
 }
 
-Status FileBlockCache::Read(const string& filename, size_t offset, size_t n,
-                            char* buffer, size_t* bytes_transferred) {
+Status RamFileBlockCache::Read(const string& filename, size_t offset, size_t n,
+                               char* buffer, size_t* bytes_transferred) {
   *bytes_transferred = 0;
   if (n == 0) {
     return Status::OK();
@@ -216,12 +217,12 @@ Status FileBlockCache::Read(const string& filename, size_t offset, size_t n,
   return Status::OK();
 }
 
-size_t FileBlockCache::CacheSize() const {
+size_t RamFileBlockCache::CacheSize() const {
   mutex_lock lock(mu_);
   return cache_size_;
 }
 
-void FileBlockCache::Prune() {
+void RamFileBlockCache::Prune() {
   while (!WaitForNotificationWithTimeout(&stop_pruning_thread_, 1000000)) {
     mutex_lock lock(mu_);
     uint64 now = env_->NowSeconds();
@@ -238,7 +239,7 @@ void FileBlockCache::Prune() {
   }
 }
 
-void FileBlockCache::Flush() {
+void RamFileBlockCache::Flush() {
   mutex_lock lock(mu_);
   block_map_.clear();
   lru_list_.clear();
@@ -246,12 +247,12 @@ void FileBlockCache::Flush() {
   cache_size_ = 0;
 }
 
-void FileBlockCache::RemoveFile(const string& filename) {
+void RamFileBlockCache::RemoveFile(const string& filename) {
   mutex_lock lock(mu_);
   RemoveFile_Locked(filename);
 }
 
-void FileBlockCache::RemoveFile_Locked(const string& filename) {
+void RamFileBlockCache::RemoveFile_Locked(const string& filename) {
   Key begin = std::make_pair(filename, 0);
   auto it = block_map_.lower_bound(begin);
   while (it != block_map_.end() && it->first.first == filename) {
@@ -261,7 +262,7 @@ void FileBlockCache::RemoveFile_Locked(const string& filename) {
   }
 }
 
-void FileBlockCache::RemoveBlock(BlockMap::iterator entry) {
+void RamFileBlockCache::RemoveBlock(BlockMap::iterator entry) {
   // This signals that the block is removed, and should not be inadvertently
   // reinserted into the cache in UpdateLRU.
   entry->second->timestamp = 0;
diff --git a/tensorflow/core/platform/cloud/ram_file_block_cache.h b/tensorflow/core/platform/cloud/ram_file_block_cache.h
new file mode 100644
index 0000000000000000000000000000000000000000..7fdd7b2e0294e1cf289a77464fb60e08bdb28da7
--- /dev/null
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache.h
@@ -0,0 +1,229 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
+
+#include <functional>
+#include <list>
+#include <map>
+#include <memory>
+#include <string>
+#include <vector>
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/notification.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+/// \brief An LRU block cache of file contents, keyed by {filename, offset}.
+///
+/// This class should be shared by read-only random access files on a remote
+/// filesystem (e.g. GCS).
+class RamFileBlockCache : public FileBlockCache {
+ public:
+  /// The callback executed when a block is not found in the cache, and needs to
+  /// be fetched from the backing filesystem. This callback is provided when the
+  /// cache is constructed. The returned Status should be OK as long as the
+  /// read from the remote filesystem succeeded (similar to the semantics of the
+  /// read(2) system call).
+  typedef std::function<Status(const string& filename, size_t offset,
+                               size_t buffer_size, char* buffer,
+                               size_t* bytes_transferred)>
+      BlockFetcher;
+
+  RamFileBlockCache(size_t block_size, size_t max_bytes, uint64 max_staleness,
+                    BlockFetcher block_fetcher, Env* env = Env::Default())
+      : block_size_(block_size),
+        max_bytes_(max_bytes),
+        max_staleness_(max_staleness),
+        block_fetcher_(block_fetcher),
+        env_(env) {
+    if (max_staleness_ > 0) {
+      pruning_thread_.reset(env_->StartThread(ThreadOptions(), "TF_prune_FBC",
+                                              [this] { Prune(); }));
+    }
+  }
+
+  ~RamFileBlockCache() override {
+    if (pruning_thread_) {
+      stop_pruning_thread_.Notify();
+      // Destroying pruning_thread_ will block until Prune() receives the above
+      // notification and returns.
+      pruning_thread_.reset();
+    }
+  }
+
+  /// Read `n` bytes from `filename` starting at `offset` into `out`. This
+  /// method will return:
+  ///
+  /// 1) The error from the remote filesystem, if the read from the remote
+  ///    filesystem failed.
+  /// 2) PRECONDITION_FAILED if the read from the remote filesystem succeeded,
+  ///    but the read returned a partial block, and the LRU cache contained a
+  ///    block at a higher offset (indicating that the partial block should have
+  ///    been a full block).
+  /// 3) OUT_OF_RANGE if the read from the remote filesystem succeeded, but
+  ///    the file contents do not extend past `offset` and thus nothing was
+  ///    placed in `out`.
+  /// 4) OK otherwise (i.e. the read succeeded, and at least one byte was placed
+  ///    in `out`).
+  Status Read(const string& filename, size_t offset, size_t n, char* buffer,
+              size_t* bytes_transferred) override;
+
+  /// Remove all cached blocks for `filename`.
+  void RemoveFile(const string& filename) override LOCKS_EXCLUDED(mu_);
+
+  /// Remove all cached data.
+  void Flush() LOCKS_EXCLUDED(mu_) override;
+
+  /// Accessors for cache parameters.
+  size_t block_size() const override { return block_size_; }
+  size_t max_bytes() const override { return max_bytes_; }
+  uint64 max_staleness() const override { return max_staleness_; }
+
+  /// The current size (in bytes) of the cache.
+  size_t CacheSize() const override LOCKS_EXCLUDED(mu_);
+
+ private:
+  /// The size of the blocks stored in the LRU cache, as well as the size of the
+  /// reads from the underlying filesystem.
+  const size_t block_size_;
+  /// The maximum number of bytes (sum of block sizes) allowed in the LRU cache.
+  const size_t max_bytes_;
+  /// The maximum staleness of any block in the LRU cache, in seconds.
+  const uint64 max_staleness_;
+  /// The callback to read a block from the underlying filesystem.
+  const BlockFetcher block_fetcher_;
+  /// The Env from which we read timestamps.
+  Env* const env_;  // not owned
+
+  /// \brief The key type for the file block cache.
+  ///
+  /// The file block cache key is a {filename, offset} pair.
+  typedef std::pair<string, size_t> Key;
+
+  /// \brief The state of a block.
+  ///
+  /// A block begins in the CREATED stage. The first thread will attempt to read
+  /// the block from the filesystem, transitioning the state of the block to
+  /// FETCHING. After completing, if the read was successful the state should
+  /// be FINISHED. Otherwise the state should be ERROR. A subsequent read can
+  /// re-fetch the block if the state is ERROR.
+  enum class FetchState {
+    CREATED,
+    FETCHING,
+    FINISHED,
+    ERROR,
+  };
+
+  /// \brief A block of a file.
+  ///
+  /// A file block consists of the block data, the block's current position in
+  /// the LRU cache, the timestamp (seconds since epoch) at which the block
+  /// was cached, a coordination lock, and state & condition variables.
+  ///
+  /// Thread safety:
+  /// The iterator and timestamp fields should only be accessed while holding
+  /// the block-cache-wide mu_ instance variable. The state variable should only
+  /// be accessed while holding the Block's mu lock. The data vector should only
+  /// be accessed after state == FINISHED, and it should never be modified.
+  ///
+  /// In order to prevent deadlocks, never grab the block-cache-wide mu_ lock
+  /// AFTER grabbing any block's mu lock. It is safe to grab mu without locking
+  /// mu_.
+  struct Block {
+    /// The block data.
+    std::vector<char> data;
+    /// A list iterator pointing to the block's position in the LRU list.
+    std::list<Key>::iterator lru_iterator;
+    /// A list iterator pointing to the block's position in the LRA list.
+    std::list<Key>::iterator lra_iterator;
+    /// The timestamp (seconds since epoch) at which the block was cached.
+    uint64 timestamp;
+    /// Mutex to guard state variable
+    mutex mu;
+    /// The state of the block.
+    FetchState state GUARDED_BY(mu) = FetchState::CREATED;
+    /// Wait on cond_var if state is FETCHING.
+    condition_variable cond_var;
+  };
+
+  /// \brief The block map type for the file block cache.
+  ///
+  /// The block map is an ordered map from Key to Block.
+  typedef std::map<Key, std::shared_ptr<Block>> BlockMap;
+
+  /// Prune the cache by removing files with expired blocks.
+  void Prune() LOCKS_EXCLUDED(mu_);
+
+  bool BlockNotStale(const std::shared_ptr<Block>& block)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Look up a Key in the block cache.
+  std::shared_ptr<Block> Lookup(const Key& key) LOCKS_EXCLUDED(mu_);
+
+  Status MaybeFetch(const Key& key, const std::shared_ptr<Block>& block)
+      LOCKS_EXCLUDED(mu_);
+
+  /// Trim the block cache to make room for another entry.
+  void Trim() EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Update the LRU iterator for the block at `key`.
+  Status UpdateLRU(const Key& key, const std::shared_ptr<Block>& block)
+      LOCKS_EXCLUDED(mu_);
+
+  /// Remove all blocks of a file, with mu_ already held.
+  void RemoveFile_Locked(const string& filename) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Remove the block `entry` from the block map and LRU list, and update the
+  /// cache size accordingly.
+  void RemoveBlock(BlockMap::iterator entry) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// The cache pruning thread that removes files with expired blocks.
+  std::unique_ptr<Thread> pruning_thread_;
+
+  /// Notification for stopping the cache pruning thread.
+  Notification stop_pruning_thread_;
+
+  /// Guards access to the block map, LRU list, and cached byte count.
+  mutable mutex mu_;
+
+  /// The block map (map from Key to Block).
+  BlockMap block_map_ GUARDED_BY(mu_);
+
+  /// The LRU list of block keys. The front of the list identifies the most
+  /// recently accessed block.
+  std::list<Key> lru_list_ GUARDED_BY(mu_);
+
+  /// The LRA (least recently added) list of block keys. The front of the list
+  /// identifies the most recently added block.
+  ///
+  /// Note: blocks are added to lra_list_ only after they have successfully been
+  /// fetched from the underlying block store.
+  std::list<Key> lra_list_ GUARDED_BY(mu_);
+
+  /// The combined number of bytes in all of the cached blocks.
+  size_t cache_size_ GUARDED_BY(mu_) = 0;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
diff --git a/tensorflow/core/platform/cloud/file_block_cache_test.cc b/tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
similarity index 92%
rename from tensorflow/core/platform/cloud/file_block_cache_test.cc
rename to tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
index 596fdbf19eb03a70c5659d392db368b3cdb791fe..d555b682a624309172588c9279d650d436f5d5cd 100644
--- a/tensorflow/core/platform/cloud/file_block_cache_test.cc
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include <cstring>
 #include "tensorflow/core/lib/core/blocking_counter.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
@@ -25,8 +25,8 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-Status ReadCache(FileBlockCache* cache, const string& filename, size_t offset,
-                 size_t n, std::vector<char>* out) {
+Status ReadCache(RamFileBlockCache* cache, const string& filename,
+                 size_t offset, size_t n, std::vector<char>* out) {
   out->clear();
   out->resize(n, 0);
   size_t bytes_transferred = 0;
@@ -37,7 +37,7 @@ Status ReadCache(FileBlockCache* cache, const string& filename, size_t offset,
   return status;
 }
 
-TEST(FileBlockCacheTest, PassThrough) {
+TEST(RamFileBlockCacheTest, PassThrough) {
   const string want_filename = "foo/bar";
   const size_t want_offset = 42;
   const size_t want_n = 1024;
@@ -54,9 +54,9 @@ TEST(FileBlockCacheTest, PassThrough) {
     return Status::OK();
   };
   // If block_size, max_bytes, or both are zero, the cache is a pass-through.
-  FileBlockCache cache1(1, 0, 0, fetcher);
-  FileBlockCache cache2(0, 1, 0, fetcher);
-  FileBlockCache cache3(0, 0, 0, fetcher);
+  RamFileBlockCache cache1(1, 0, 0, fetcher);
+  RamFileBlockCache cache2(0, 1, 0, fetcher);
+  RamFileBlockCache cache3(0, 0, 0, fetcher);
   std::vector<char> out;
   TF_EXPECT_OK(ReadCache(&cache1, want_filename, want_offset, want_n, &out));
   EXPECT_EQ(calls, 1);
@@ -66,7 +66,7 @@ TEST(FileBlockCacheTest, PassThrough) {
   EXPECT_EQ(calls, 3);
 }
 
-TEST(FileBlockCacheTest, BlockAlignment) {
+TEST(RamFileBlockCacheTest, BlockAlignment) {
   // Initialize a 256-byte buffer.  This is the file underlying the reads we'll
   // do in this test.
   const size_t size = 256;
@@ -89,7 +89,7 @@ TEST(FileBlockCacheTest, BlockAlignment) {
   for (size_t block_size = 2; block_size <= 4; block_size++) {
     // Make a cache of N-byte block size (1 block) and verify that reads of
     // varying offsets and lengths return correct data.
-    FileBlockCache cache(block_size, block_size, 0, fetcher);
+    RamFileBlockCache cache(block_size, block_size, 0, fetcher);
     for (size_t offset = 0; offset < 10; offset++) {
       for (size_t n = block_size - 2; n <= block_size + 2; n++) {
         std::vector<char> got;
@@ -117,7 +117,7 @@ TEST(FileBlockCacheTest, BlockAlignment) {
   }
 }
 
-TEST(FileBlockCacheTest, CacheHits) {
+TEST(RamFileBlockCacheTest, CacheHits) {
   const size_t block_size = 16;
   std::set<size_t> calls;
   auto fetcher = [&calls, block_size](const string& filename, size_t offset,
@@ -132,7 +132,7 @@ TEST(FileBlockCacheTest, CacheHits) {
     return Status::OK();
   };
   const uint32 block_count = 256;
-  FileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
   std::vector<char> out;
   out.resize(block_count, 0);
   // The cache has space for `block_count` blocks. The loop with i = 0 should
@@ -146,7 +146,7 @@ TEST(FileBlockCacheTest, CacheHits) {
   }
 }
 
-TEST(FileBlockCacheTest, OutOfRange) {
+TEST(RamFileBlockCacheTest, OutOfRange) {
   // Tests reads of a 24-byte file with block size 16.
   const size_t block_size = 16;
   const size_t file_size = 24;
@@ -172,7 +172,7 @@ TEST(FileBlockCacheTest, OutOfRange) {
     *bytes_transferred = bytes_to_copy;
     return Status::OK();
   };
-  FileBlockCache cache(block_size, block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_size, 0, fetcher);
   std::vector<char> out;
   // Reading the first 16 bytes should be fine.
   TF_EXPECT_OK(ReadCache(&cache, "", 0, block_size, &out));
@@ -191,7 +191,7 @@ TEST(FileBlockCacheTest, OutOfRange) {
   EXPECT_EQ(out.size(), file_size - block_size);
 }
 
-TEST(FileBlockCacheTest, Inconsistent) {
+TEST(RamFileBlockCacheTest, Inconsistent) {
   // Tests the detection of interrupted reads leading to partially filled blocks
   // where we expected complete blocks.
   const size_t block_size = 16;
@@ -205,7 +205,7 @@ TEST(FileBlockCacheTest, Inconsistent) {
     *bytes_transferred = 1;
     return Status::OK();
   };
-  FileBlockCache cache(block_size, 2 * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, 2 * block_size, 0, fetcher);
   std::vector<char> out;
   // Read the second block; this should yield an OK status and a single byte.
   TF_EXPECT_OK(ReadCache(&cache, "", block_size, block_size, &out));
@@ -216,7 +216,7 @@ TEST(FileBlockCacheTest, Inconsistent) {
   EXPECT_EQ(status.code(), error::INTERNAL);
 }
 
-TEST(FileBlockCacheTest, LRU) {
+TEST(RamFileBlockCacheTest, LRU) {
   const size_t block_size = 16;
   std::list<size_t> calls;
   auto fetcher = [&calls, block_size](const string& filename, size_t offset,
@@ -233,7 +233,7 @@ TEST(FileBlockCacheTest, LRU) {
     return Status::OK();
   };
   const uint32 block_count = 2;
-  FileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
   std::vector<char> out;
   // Read blocks from the cache, and verify the LRU behavior based on the
   // fetcher calls that the cache makes.
@@ -265,7 +265,7 @@ TEST(FileBlockCacheTest, LRU) {
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 1, &out));
 }
 
-TEST(FileBlockCacheTest, MaxStaleness) {
+TEST(RamFileBlockCacheTest, MaxStaleness) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -278,7 +278,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   std::unique_ptr<NowSecondsEnv> env(new NowSecondsEnv);
   // Create a cache with max staleness of 2 seconds, and verify that it works as
   // expected.
-  FileBlockCache cache1(8, 16, 2 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache1(8, 16, 2 /* max staleness */, fetcher, env.get());
   // Execute the first read to load the block.
   TF_EXPECT_OK(ReadCache(&cache1, "", 0, 1, &out));
   EXPECT_EQ(calls, 1);
@@ -294,7 +294,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   // as expected.
   calls = 0;
   env->SetNowSeconds(0);
-  FileBlockCache cache2(8, 16, 0 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache2(8, 16, 0 /* max staleness */, fetcher, env.get());
   // Execute the first read to load the block.
   TF_EXPECT_OK(ReadCache(&cache2, "", 0, 1, &out));
   EXPECT_EQ(calls, 1);
@@ -305,7 +305,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   EXPECT_EQ(calls, 1);
 }
 
-TEST(FileBlockCacheTest, RemoveFile) {
+TEST(RamFileBlockCacheTest, RemoveFile) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -321,7 +321,7 @@ TEST(FileBlockCacheTest, RemoveFile) {
   };
   // This cache has space for 4 blocks; we'll read from two files.
   const size_t n = 3;
-  FileBlockCache cache(8, 32, 0, fetcher);
+  RamFileBlockCache cache(8, 32, 0, fetcher);
   std::vector<char> out;
   std::vector<char> a(n, 'a');
   std::vector<char> b(n, 'b');
@@ -367,7 +367,7 @@ TEST(FileBlockCacheTest, RemoveFile) {
   EXPECT_EQ(calls, 6);
 }
 
-TEST(FileBlockCacheTest, Prune) {
+TEST(RamFileBlockCacheTest, Prune) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -381,7 +381,7 @@ TEST(FileBlockCacheTest, Prune) {
   std::unique_ptr<NowSecondsEnv> env(new NowSecondsEnv);
   uint64 now = Env::Default()->NowSeconds();
   env->SetNowSeconds(now);
-  FileBlockCache cache(8, 32, 1 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache(8, 32, 1 /* max staleness */, fetcher, env.get());
   // Read three blocks into the cache, and advance the timestamp by one second
   // with each read. Start with a block of "a" at the current timestamp `now`.
   TF_EXPECT_OK(ReadCache(&cache, "a", 0, 1, &out));
@@ -426,7 +426,7 @@ TEST(FileBlockCacheTest, Prune) {
   EXPECT_EQ(cache.CacheSize(), 0);
 }
 
-TEST(FileBlockCacheTest, ParallelReads) {
+TEST(RamFileBlockCacheTest, ParallelReads) {
   // This fetcher won't respond until either `callers` threads are calling it
   // concurrently (at which point it will respond with success to all callers),
   // or 10 seconds have elapsed (at which point it will respond with an error).
@@ -444,7 +444,7 @@ TEST(FileBlockCacheTest, ParallelReads) {
     return Status::OK();
   };
   const int block_size = 8;
-  FileBlockCache cache(block_size, 2 * callers * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, 2 * callers * block_size, 0, fetcher);
   std::vector<std::unique_ptr<Thread>> threads;
   for (int i = 0; i < callers; i++) {
     threads.emplace_back(
@@ -461,7 +461,7 @@ TEST(FileBlockCacheTest, ParallelReads) {
   // executed, or 10 seconds have passed).
 }
 
-TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
+TEST(RamFileBlockCacheTest, CoalesceConcurrentReads) {
   // Concurrent reads to the same file blocks should be de-duplicated.
   const size_t block_size = 16;
   int num_requests = 0;
@@ -479,7 +479,7 @@ TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
     Env::Default()->SleepForMicroseconds(100000);  // 0.1 secs
     return Status::OK();
   };
-  FileBlockCache cache(block_size, block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_size, 0, fetcher);
   // Fork off thread for parallel read.
   std::unique_ptr<Thread> concurrent(
       Env::Default()->StartThread({}, "concurrent", [&cache, block_size] {
@@ -496,7 +496,7 @@ TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
   EXPECT_EQ(1, num_requests);
 }
 
-TEST(FileBlockCacheTest, Flush) {
+TEST(RamFileBlockCacheTest, Flush) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -505,7 +505,7 @@ TEST(FileBlockCacheTest, Flush) {
     *bytes_transferred = n;
     return Status::OK();
   };
-  FileBlockCache cache(16, 32, 0, fetcher);
+  RamFileBlockCache cache(16, 32, 0, fetcher);
   std::vector<char> out;
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 16, &out));
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 16, &out));
diff --git a/tensorflow/core/platform/default/build_config.bzl b/tensorflow/core/platform/default/build_config.bzl
index 2102c5cca383b553c56fb3704596e3d1335c55c2..e01e076bcf279206ea821d2777a3d44755668f02 100644
--- a/tensorflow/core/platform/default/build_config.bzl
+++ b/tensorflow/core/platform/default/build_config.bzl
@@ -219,7 +219,7 @@ def tf_proto_library_cc(name, srcs = [], has_services = None,
                         cc_stubby_versions = None,
                         cc_grpc_version = None,
                         j2objc_api_version = 1,
-                        cc_api_version = 2, go_api_version = 2,
+                        cc_api_version = 2,
                         java_api_version = 2, py_api_version = 2,
                         js_api_version = 2, js_codegen = "jspb",
                         default_header = False):
@@ -280,7 +280,6 @@ def tf_proto_library(name, srcs = [], has_services = None,
                      visibility = [], testonly = 0,
                      cc_libs = [],
                      cc_api_version = 2, cc_grpc_version = None,
-                     go_api_version = 2,
                      j2objc_api_version = 1,
                      java_api_version = 2, py_api_version = 2,
                      js_api_version = 2, js_codegen = "jspb",
diff --git a/tensorflow/core/platform/default/context.h b/tensorflow/core/platform/default/context.h
index d8afeb47a9ca06e61a8c02962bc98d7797a279f7..682f64c26d7e3c4306df4139f0f48297e5c01a03 100644
--- a/tensorflow/core/platform/default/context.h
+++ b/tensorflow/core/platform/default/context.h
@@ -22,6 +22,8 @@ class Context {
  public:
   Context() {}
   Context(const ContextKind kind) {}
+
+  bool operator==(const Context& other) const { return true; }
 };
 
 class WithContext {
diff --git a/tensorflow/core/platform/default/logging.h b/tensorflow/core/platform/default/logging.h
index f0efa31d5576393e9d9bba6e39a454b2a33cddc3..2c134f1be931982930047850736d1d3a33fdffcc 100644
--- a/tensorflow/core/platform/default/logging.h
+++ b/tensorflow/core/platform/default/logging.h
@@ -64,11 +64,11 @@ class LogMessageFatal : public LogMessage {
 };
 
 #define _TF_LOG_INFO \
-  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, tensorflow::INFO)
+  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, ::tensorflow::INFO)
 #define _TF_LOG_WARNING \
-  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, tensorflow::WARNING)
+  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, ::tensorflow::WARNING)
 #define _TF_LOG_ERROR \
-  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, tensorflow::ERROR)
+  ::tensorflow::internal::LogMessage(__FILE__, __LINE__, ::tensorflow::ERROR)
 #define _TF_LOG_FATAL \
   ::tensorflow::internal::LogMessageFatal(__FILE__, __LINE__)
 
diff --git a/tensorflow/core/platform/default/mutex.cc b/tensorflow/core/platform/default/mutex.cc
new file mode 100644
index 0000000000000000000000000000000000000000..79830a4738afeb2a19e3e3811be4c88e13f06090
--- /dev/null
+++ b/tensorflow/core/platform/default/mutex.cc
@@ -0,0 +1,89 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/mutex.h"
+#include <chrono>
+#include <condition_variable>
+#include "nsync_cv.h"
+#include "nsync_mu.h"
+
+namespace tensorflow {
+
+// Check that the external_mu_space struct used to reserve space for the mutex
+// in tensorflow::mutex is big enough.
+static_assert(sizeof(nsync::nsync_mu) <= sizeof(mutex::external_mu_space),
+              "tensorflow::mutex::external_mu_space needs to be bigger");
+
+// Cast a pointer to mutex::external_mu_space to a pointer to the mutex mutex
+// representation.  This is done so that the header files for nsync_mu do not
+// need to be included in every file that uses tensorflow's mutex.
+static inline nsync::nsync_mu *mu_cast(mutex::external_mu_space *mu) {
+  return reinterpret_cast<nsync::nsync_mu *>(mu);
+}
+
+mutex::mutex() { nsync::nsync_mu_init(mu_cast(&mu_)); }
+
+void mutex::lock() { nsync::nsync_mu_lock(mu_cast(&mu_)); }
+
+bool mutex::try_lock() { return nsync::nsync_mu_trylock(mu_cast(&mu_)) != 0; };
+
+void mutex::unlock() { nsync::nsync_mu_unlock(mu_cast(&mu_)); }
+
+void mutex::lock_shared() { nsync::nsync_mu_rlock(mu_cast(&mu_)); }
+
+bool mutex::try_lock_shared() {
+  return nsync::nsync_mu_rtrylock(mu_cast(&mu_)) != 0;
+};
+
+void mutex::unlock_shared() { nsync::nsync_mu_runlock(mu_cast(&mu_)); }
+
+// Check that the external_cv_space struct used to reserve space for the
+// condition variable in tensorflow::condition_variable is big enough.
+static_assert(
+    sizeof(nsync::nsync_cv) <= sizeof(condition_variable::external_cv_space),
+    "tensorflow::condition_variable::external_cv_space needs to be bigger");
+
+// Cast a pointer to mutex::external_cv_space to a pointer to the condition
+// variable representation.  This is done so that the header files for nsync_mu
+// do not need to be included in every file that uses tensorflow's
+// condition_variable.
+static inline nsync::nsync_cv *cv_cast(
+    condition_variable::external_cv_space *cv) {
+  return reinterpret_cast<nsync::nsync_cv *>(cv);
+}
+
+condition_variable::condition_variable() {
+  nsync::nsync_cv_init(cv_cast(&cv_));
+}
+
+void condition_variable::wait(mutex_lock &lock) {
+  nsync::nsync_cv_wait(cv_cast(&cv_), mu_cast(&lock.mutex()->mu_));
+}
+
+std::cv_status condition_variable::wait_until_system_clock(
+    mutex_lock &lock,
+    const std::chrono::system_clock::time_point timeout_time) {
+  int r = nsync::nsync_cv_wait_with_deadline(
+      cv_cast(&cv_), mu_cast(&lock.mutex()->mu_), timeout_time, nullptr);
+  return r ? std::cv_status::timeout : std::cv_status::no_timeout;
+}
+
+void condition_variable::notify_one() { nsync::nsync_cv_signal(cv_cast(&cv_)); }
+
+void condition_variable::notify_all() {
+  nsync::nsync_cv_broadcast(cv_cast(&cv_));
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/default/mutex.h b/tensorflow/core/platform/default/mutex.h
index 044c754e80bd0dee04c73e969c325a2aa4a89c31..a12d92795e14665f62222879488ab1cb89da8cfc 100644
--- a/tensorflow/core/platform/default/mutex.h
+++ b/tensorflow/core/platform/default/mutex.h
@@ -22,9 +22,8 @@ limitations under the License.
 #include <chrono>
 #include <condition_variable>
 #include <mutex>
-#include "nsync_cv.h"
-#include "nsync_mu.h"
 #include "tensorflow/core/platform/thread_annotations.h"
+
 namespace tensorflow {
 
 #undef mutex_lock
@@ -38,26 +37,26 @@ class condition_variable;
 // lock.
 class LOCKABLE mutex {
  public:
-  mutex() { nsync::nsync_mu_init(&mu_); }
-  // The default implementation of nsync_mutex is safe to use after the linker
-  // initializations
+  mutex();
+  // The default implementation of the underlying mutex is safe to use after
+  // the linker initialization to zero.
   explicit mutex(LinkerInitialized x) {}
 
-  void lock() EXCLUSIVE_LOCK_FUNCTION() { nsync::nsync_mu_lock(&mu_); }
-  bool try_lock() EXCLUSIVE_TRYLOCK_FUNCTION(true) {
-    return nsync::nsync_mu_trylock(&mu_) != 0;
-  };
-  void unlock() UNLOCK_FUNCTION() { nsync::nsync_mu_unlock(&mu_); }
+  void lock() EXCLUSIVE_LOCK_FUNCTION();
+  bool try_lock() EXCLUSIVE_TRYLOCK_FUNCTION(true);
+  void unlock() UNLOCK_FUNCTION();
+
+  void lock_shared() SHARED_LOCK_FUNCTION();
+  bool try_lock_shared() SHARED_TRYLOCK_FUNCTION(true);
+  void unlock_shared() UNLOCK_FUNCTION();
 
-  void lock_shared() SHARED_LOCK_FUNCTION() { nsync::nsync_mu_rlock(&mu_); }
-  bool try_lock_shared() SHARED_TRYLOCK_FUNCTION(true) {
-    return nsync::nsync_mu_rtrylock(&mu_) != 0;
+  struct external_mu_space {
+    void* space[2];
   };
-  void unlock_shared() UNLOCK_FUNCTION() { nsync::nsync_mu_runlock(&mu_); }
 
  private:
   friend class condition_variable;
-  nsync::nsync_mu mu_;
+  external_mu_space mu_;
 };
 
 // Mimic a subset of the std::unique_lock<tensorflow::mutex> functionality.
@@ -139,26 +138,29 @@ class SCOPED_LOCKABLE tf_shared_lock {
 // Mimic std::condition_variable.
 class condition_variable {
  public:
-  condition_variable() { nsync::nsync_cv_init(&cv_); }
+  condition_variable();
 
-  void wait(mutex_lock& lock) {
-    nsync::nsync_cv_wait(&cv_, &lock.mutex()->mu_);
-  }
+  void wait(mutex_lock& lock);
   template <class Rep, class Period>
   std::cv_status wait_for(mutex_lock& lock,
                           std::chrono::duration<Rep, Period> dur) {
-    int r = nsync::nsync_cv_wait_with_deadline(
-        &cv_, &lock.mutex()->mu_, std::chrono::system_clock::now() + dur,
-        nullptr);
-    return r ? std::cv_status::timeout : std::cv_status::no_timeout;
+    return wait_until_system_clock(lock,
+                                   std::chrono::system_clock::now() + dur);
   }
-  void notify_one() { nsync::nsync_cv_signal(&cv_); }
-  void notify_all() { nsync::nsync_cv_broadcast(&cv_); }
+  void notify_one();
+  void notify_all();
+
+  struct external_cv_space {
+    void* space[2];
+  };
 
  private:
   friend ConditionResult WaitForMilliseconds(mutex_lock* mu,
                                              condition_variable* cv, int64 ms);
-  nsync::nsync_cv cv_;
+  std::cv_status wait_until_system_clock(
+      mutex_lock& lock,
+      const std::chrono::system_clock::time_point timeout_time);
+  external_cv_space cv_;
 };
 
 inline ConditionResult WaitForMilliseconds(mutex_lock* mu,
diff --git a/tensorflow/core/platform/env.cc b/tensorflow/core/platform/env.cc
index 12509c250eab9047b869694e930bf523a975a4f8..b9a9ef85eb16e62bf9fecf01910fb98673e8cf7b 100644
--- a/tensorflow/core/platform/env.cc
+++ b/tensorflow/core/platform/env.cc
@@ -33,7 +33,6 @@ limitations under the License.
 #endif
 
 #include "tensorflow/core/lib/core/errors.h"
-#include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
diff --git a/tensorflow/core/platform/env.h b/tensorflow/core/platform/env.h
index 4ce4e0b4e024d50ae2bd081ec7b8b155060d2a4a..a7e9fcb17c4943c1005b98e4150e90f8b3d8585d 100644
--- a/tensorflow/core/platform/env.h
+++ b/tensorflow/core/platform/env.h
@@ -88,8 +88,8 @@ class Env {
   /// The ownership of the returned RandomAccessFile is passed to the caller
   /// and the object should be deleted when is not used. The file object
   /// shouldn't live longer than the Env object.
-  Status NewRandomAccessFile(const string& fname,
-                             std::unique_ptr<RandomAccessFile>* result);
+  virtual Status NewRandomAccessFile(const string& fname,
+                                     std::unique_ptr<RandomAccessFile>* result);
 
   /// \brief Creates an object that writes to a new file with the specified
   /// name.
@@ -291,10 +291,10 @@ class Env {
   virtual string FormatLibraryFileName(const string& name,
                                        const string& version) = 0;
 
- private:
   // Returns a possible list of local temporary directories.
-  void GetLocalTempDirectories(std::vector<string>* list);
+  virtual void GetLocalTempDirectories(std::vector<string>* list) = 0;
 
+ private:
   std::unique_ptr<FileSystemRegistry> file_system_registry_;
   TF_DISALLOW_COPY_AND_ASSIGN(Env);
   EnvTime* envTime = EnvTime::Default();
@@ -358,6 +358,10 @@ class EnvWrapper : public Env {
   }
 
  private:
+  void GetLocalTempDirectories(std::vector<string>* list) override {
+    target_->GetLocalTempDirectories(list);
+  }
+
   Env* target_;
 };
 
diff --git a/tensorflow/core/platform/file_system.cc b/tensorflow/core/platform/file_system.cc
index 271d73f5f1a7bd3e1301520aed09cbafd89c8ebc..5bc8606e28057d65522f23c58266494ec7d44c0b 100644
--- a/tensorflow/core/platform/file_system.cc
+++ b/tensorflow/core/platform/file_system.cc
@@ -19,15 +19,12 @@ limitations under the License.
 
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/threadpool.h"
-#include "tensorflow/core/lib/gtl/map_util.h"
-#include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/file_system.h"
 #include "tensorflow/core/platform/platform.h"
-#include "tensorflow/core/platform/protobuf.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/platform/file_system.h b/tensorflow/core/platform/file_system.h
index 3085b6958fd921ae124b885107e807f0a02e1d9d..03c0c5ab51b6ddfb2c577a8c3483e3c898685735 100644
--- a/tensorflow/core/platform/file_system.h
+++ b/tensorflow/core/platform/file_system.h
@@ -27,7 +27,6 @@ limitations under the License.
 #include "tensorflow/core/platform/file_statistics.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/platform.h"
-#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 
 #ifdef PLATFORM_WINDOWS
diff --git a/tensorflow/core/platform/posix/env.cc b/tensorflow/core/platform/posix/env.cc
index 8097624e09f81364071895ad114f26f93f4aab14..418874d3406200566cdd9a4c6141852b948413ff 100644
--- a/tensorflow/core/platform/posix/env.cc
+++ b/tensorflow/core/platform/posix/env.cc
@@ -118,6 +118,9 @@ class PosixEnv : public Env {
                                const string& version) override {
     return tensorflow::internal::FormatLibraryFileName(name, version);
   }
+
+ private:
+  void GetLocalTempDirectories(std::vector<string>* list) override;
 };
 
 }  // namespace
@@ -131,7 +134,7 @@ Env* Env::Default() {
 }
 #endif
 
-void Env::GetLocalTempDirectories(std::vector<string>* list) {
+void PosixEnv::GetLocalTempDirectories(std::vector<string>* list) {
   list->clear();
   // Directories, in order of preference. If we find a dir that
   // exists, we stop adding other less-preferred dirs
diff --git a/tensorflow/core/platform/posix/subprocess.cc b/tensorflow/core/platform/posix/subprocess.cc
index cefc66831a9b9fe11a170013e64ff7a1cd2e2bcd..a661c34ef01b1afba05dedf3d7c6c2aef86245fa 100644
--- a/tensorflow/core/platform/posix/subprocess.cc
+++ b/tensorflow/core/platform/posix/subprocess.cc
@@ -20,6 +20,8 @@ limitations under the License.
 #include <string.h>
 #include <sys/types.h>
 #include <sys/wait.h>
+#include <memory>
+#include <vector>
 
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/subprocess.h"
@@ -461,4 +463,12 @@ int SubProcess::Communicate(const string* stdin_input, string* stdout_output,
   return WaitInternal(&status) ? status : -1;
 }
 
+std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv) {
+  std::unique_ptr<SubProcess> proc(new SubProcess());
+  proc->SetProgram(argv[0], argv);
+  proc->SetChannelAction(CHAN_STDERR, ACTION_DUPPARENT);
+  proc->SetChannelAction(CHAN_STDOUT, ACTION_DUPPARENT);
+  return proc;
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/posix/test.cc b/tensorflow/core/platform/posix/test.cc
index a69127b3e88834458503012ad4d8d9334bba247a..28f7478a6d5a371079acef135eaab69ead1ebf4b 100644
--- a/tensorflow/core/platform/posix/test.cc
+++ b/tensorflow/core/platform/posix/test.cc
@@ -20,19 +20,10 @@ limitations under the License.
 
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/platform/subprocess.h"
 
 namespace tensorflow {
 namespace testing {
 
-std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv) {
-  std::unique_ptr<SubProcess> proc(new SubProcess());
-  proc->SetProgram(argv[0], argv);
-  proc->SetChannelAction(CHAN_STDERR, ACTION_DUPPARENT);
-  proc->SetChannelAction(CHAN_STDOUT, ACTION_DUPPARENT);
-  return proc;
-}
-
 int PickUnusedPortOrDie() { return internal::PickUnusedPortOrDie(); }
 
 string TensorFlowSrcRoot() {
diff --git a/tensorflow/core/platform/subprocess.h b/tensorflow/core/platform/subprocess.h
index dfdcf82173b774dee119f1fe9e84818e45d7b50c..dcc0c1a4ee33ff47beefa6c3f82c6954770e7036 100644
--- a/tensorflow/core/platform/subprocess.h
+++ b/tensorflow/core/platform/subprocess.h
@@ -16,6 +16,9 @@ limitations under the License.
 #ifndef TENSORFLOW_PLATFORM_SUBPROCESS_H_
 #define TENSORFLOW_PLATFORM_SUBPROCESS_H_
 
+#include <memory>
+#include <vector>
+
 namespace tensorflow {
 
 // Channel identifiers.
@@ -43,6 +46,12 @@ enum ChannelAction {
 // Supports spawning and killing child processes.
 class SubProcess;
 
+// Returns an object that represents a child process that will be
+// launched with the given command-line arguments `argv`. The process
+// must be explicitly started by calling the Start() method on the
+// returned object.
+std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv);
+
 }  // namespace tensorflow
 
 #include "tensorflow/core/platform/platform.h"
diff --git a/tensorflow/core/platform/test.h b/tensorflow/core/platform/test.h
index 295957c3d801cc959eeb3e60dd2a74587ee14197..99bae63edf8ae26fb51acde12dc1a4f8bcaf778c 100644
--- a/tensorflow/core/platform/test.h
+++ b/tensorflow/core/platform/test.h
@@ -21,7 +21,6 @@ limitations under the License.
 
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/platform.h"
-#include "tensorflow/core/platform/subprocess.h"
 #include "tensorflow/core/platform/types.h"
 
 // As of September 2016, we continue to attempt to avoid the use of gmock aka
@@ -49,12 +48,6 @@ string TensorFlowSrcRoot();
 // Returns the same value for the lifetime of the process.
 int RandomSeed();
 
-// Returns an object that represents a child process that will be
-// launched with the given command-line arguments `argv`. The process
-// must be explicitly started by calling the Start() method on the
-// returned object.
-std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv);
-
 // Returns an unused port number, for use in multi-process testing.
 // NOTE: This function is not thread-safe.
 int PickUnusedPortOrDie();
diff --git a/tensorflow/core/platform/types.h b/tensorflow/core/platform/types.h
index e2dd5b003f291b6ce88ebabe2d66114762bd2c57..6308e588470d75c2236113f7e19a27241f2f9224 100644
--- a/tensorflow/core/platform/types.h
+++ b/tensorflow/core/platform/types.h
@@ -31,12 +31,6 @@ limitations under the License.
 #error Define the appropriate PLATFORM_<foo> macro for this platform
 #endif
 
-#if defined(PLATFORM_WINDOWS)
-#include "tensorflow/core/platform/windows/cpu_info.h"
-#endif
-
-#include "tensorflow/core/lib/bfloat16/bfloat16.h"
-
 namespace tensorflow {
 
 // Define tensorflow::string to refer to appropriate platform specific type.
diff --git a/tensorflow/core/platform/windows/env.cc b/tensorflow/core/platform/windows/env.cc
index 41b264417071cadb5f70806b458ee2b46ebb2feb..2f54f423b2ee7d8842f2edab7f6bf29877fac173 100644
--- a/tensorflow/core/platform/windows/env.cc
+++ b/tensorflow/core/platform/windows/env.cc
@@ -160,6 +160,8 @@ class WindowsEnv : public Env {
   }
 
  private:
+  void GetLocalTempDirectories(std::vector<string>* list) override;
+
   typedef VOID(WINAPI* FnGetSystemTimePreciseAsFileTime)(LPFILETIME);
   FnGetSystemTimePreciseAsFileTime GetSystemTimePreciseAsFileTime_;
 };
@@ -174,7 +176,7 @@ Env* Env::Default() {
   return default_env;
 }
 
-void Env::GetLocalTempDirectories(std::vector<string>* list) {
+void WindowsEnv::GetLocalTempDirectories(std::vector<string>* list) {
   list->clear();
   // On windows we'll try to find a directory in this order:
   //   C:/Documents & Settings/whomever/TEMP (or whatever GetTempPath() is)
diff --git a/tensorflow/core/platform/windows/port.cc b/tensorflow/core/platform/windows/port.cc
index 582b232054b850a2ef5ab8f47c089eb35a7bb3cf..f3b27ea394d04770b612752328d5d571e6521cc6 100644
--- a/tensorflow/core/platform/windows/port.cc
+++ b/tensorflow/core/platform/windows/port.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #endif
 
 #include <Windows.h>
+#include <shlwapi.h>
 
 #include "tensorflow/core/platform/cpu_info.h"
 #include "tensorflow/core/platform/demangle.h"
@@ -149,11 +150,16 @@ bool Snappy_Uncompress(const char* input, size_t length, char* output) {
 string Demangle(const char* mangled) { return mangled; }
 
 double NominalCPUFrequency() {
-#ifdef TENSORFLOW_USE_ABSL
-  return absl::base_internal::NominalCPUFrequency();
-#else
+  DWORD data;
+  DWORD data_size = sizeof(data);
+  #pragma comment(lib, "shlwapi.lib")  // For SHGetValue().
+  if (SUCCEEDED(
+          SHGetValueA(HKEY_LOCAL_MACHINE,
+                      "HARDWARE\\DESCRIPTION\\System\\CentralProcessor\\0",
+                      "~MHz", nullptr, &data, &data_size))) {
+    return data * 1e6;  // Value is MHz.
+  }
   return 1.0;
-#endif
 }
 
 int64 AvailableRam() {
diff --git a/tensorflow/core/platform/windows/subprocess.h b/tensorflow/core/platform/windows/subprocess.h
index 66ec44885d52195b807f4957aec6d590324b2975..f00471d484014d431665dbf0cb0d38ea82a14435 100644
--- a/tensorflow/core/platform/windows/subprocess.h
+++ b/tensorflow/core/platform/windows/subprocess.h
@@ -16,11 +16,21 @@ limitations under the License.
 #ifndef TENSORFLOW_PLATFORM_WINDOWS_SUBPROCESS_H_
 #define TENSORFLOW_PLATFORM_WINDOWS_SUBPROCESS_H_
 
+#include <memory>
+#include <vector>
+
+#include "tensorflow/core/platform/logging.h"
+
 namespace tensorflow {
 
 // SubProcess is not yet implemented for Windows.
 class SubProcess {};
 
+std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv) {
+  LOG(FATAL) << "CreateSubProcess NOT IMPLEMENTED for Windows yet ! ";
+  return nullptr;
+}
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_PLATFORM_WINDOWS_SUBPROCESS_H_
diff --git a/tensorflow/core/platform/windows/test.cc b/tensorflow/core/platform/windows/test.cc
index 584acad91b24fc6be9b93f71b7d44b0fba3cb2e8..ad2b7bc6ff6e6037a922352d34b628b2138d0712 100644
--- a/tensorflow/core/platform/windows/test.cc
+++ b/tensorflow/core/platform/windows/test.cc
@@ -22,11 +22,6 @@ limitations under the License.
 namespace tensorflow {
 namespace testing {
 
-std::unique_ptr<SubProcess> CreateSubProcess(const std::vector<string>& argv) {
-  LOG(FATAL) << "CreateSubProcess NOT IMPLEMENTED for Windows yet ! ";
-  return nullptr;
-}
-
 int PickUnusedPortOrDie() { return internal::PickUnusedPortOrDie(); }
 
 string TensorFlowSrcRoot() {
diff --git a/tensorflow/core/platform/windows/windows_file_system.cc b/tensorflow/core/platform/windows/windows_file_system.cc
index b6b3722caae4dc0cdc0ddff91be479ab91a744b2..682e46e0fcd0322ed34fa94d0ee5516cf9194a3b 100644
--- a/tensorflow/core/platform/windows/windows_file_system.cc
+++ b/tensorflow/core/platform/windows/windows_file_system.cc
@@ -382,7 +382,8 @@ Status WindowsFileSystem::NewReadOnlyMemoryRegionFromFile(
 
 Status WindowsFileSystem::FileExists(const string& fname) {
   constexpr int kOk = 0;
-  if (_access(TranslateName(fname).c_str(), kOk) == 0) {
+  std::wstring ws_translated_fname = Utf8ToWideChar(TranslateName(fname));
+  if (_waccess(ws_translated_fname.c_str(), kOk) == 0) {
     return Status::OK();
   }
   return errors::NotFound(fname, " not found");
diff --git a/tensorflow/core/profiler/BUILD b/tensorflow/core/profiler/BUILD
index 35d99930186381edbb80aa6485856e288f1dd568..5ce6f1046d3a812039106520d4883622c4df485b 100644
--- a/tensorflow/core/profiler/BUILD
+++ b/tensorflow/core/profiler/BUILD
@@ -57,7 +57,6 @@ tf_proto_library(
     name = "protos_all",
     srcs = glob(["**/*.proto"]),
     cc_api_version = 2,
-    go_api_version = 2,
     java_api_version = 2,
     protodeps = tf_additional_all_protos(),
     visibility = ["//visibility:public"],
diff --git a/tensorflow/core/protobuf/config.proto b/tensorflow/core/protobuf/config.proto
index 3606c5f127ce1f533d018e645b0a48c20e79cd8d..a3557e4721644dd2577e7b56077a4e7ef8030463 100644
--- a/tensorflow/core/protobuf/config.proto
+++ b/tensorflow/core/protobuf/config.proto
@@ -67,9 +67,7 @@ message GPUOptions {
   // set or set to 0, gets set to a non-zero default.
   int32 polling_active_delay_usecs = 6;
 
-  // In the event polling loop sleep this many millisconds between
-  // PollEvents calls, when the queue is empty.  If value is not
-  // set or set to 0, gets set to a non-zero default.
+  // This field is deprecated and ignored.
   int32 polling_inactive_delay_msecs = 7;
 
   // Force all tensors to be gpu_compatible. On a GPU-enabled TensorFlow,
@@ -410,3 +408,26 @@ message RunMetadata {
   // Graphs of the partitions executed by executors.
   repeated GraphDef partition_graphs = 3;
 }
+
+// Defines a subgraph in another `GraphDef` as a set of feed points and nodes
+// to be fetched or executed.
+//
+// Compare with the arguments to `Session::Run()`.
+message CallableOptions {
+  // Tensors to be fed in the callable. Each feed is the name of a tensor.
+  repeated string feed = 1;
+
+  // Fetches. A list of tensor names. The caller of the callable expects a
+  // tensor to be returned for each fetch[i] (see RunStepResponse.tensor). The
+  // order of specified fetches does not change the execution order.
+  repeated string fetch = 2;
+
+  // Target Nodes. A list of node names. The named nodes will be run by the
+  // callable but their outputs will not be returned.
+  repeated string target = 3;
+
+  // Options that will be applied to each run.
+  RunOptions run_options = 4;
+
+  // Next: 5
+}
diff --git a/tensorflow/core/protobuf/control_flow.proto b/tensorflow/core/protobuf/control_flow.proto
index 2c9476a08ad946e7f019475055397fcd6cfbbc5a..3c05b4f0e22e5ce2104980ad4fa52c8d8ad57070 100644
--- a/tensorflow/core/protobuf/control_flow.proto
+++ b/tensorflow/core/protobuf/control_flow.proto
@@ -17,6 +17,15 @@ message ValuesDef {
   map<string, string> external_values = 2;
 }
 
+// Container for any kind of control flow context. Any other control flow
+// contexts that are added below should also be added here.
+message ControlFlowContextDef {
+  oneof ctxt {
+    CondContextDef cond_ctxt = 1;
+    WhileContextDef while_ctxt = 2;
+  }
+}
+
 // Protocol buffer representing a CondContext object.
 message CondContextDef {
   // Name of the context.
@@ -33,6 +42,9 @@ message CondContextDef {
 
   // Values and external values in control flow context.
   ValuesDef values_def = 5;
+
+  // Contexts contained inside this context (e.g. nested conds).
+  repeated ControlFlowContextDef nested_contexts = 6;
 }
 
 // Protocol buffer representing a WhileContext object.
@@ -70,5 +82,8 @@ message WhileContextDef {
   // Optional name of the maximum_iterations tensor.
   string maximum_iterations_name = 11;
 
-  // Next available id: 12.
+  // Contexts contained inside this context (e.g. nested whiles).
+  repeated ControlFlowContextDef nested_contexts = 12;
+
+  // Next available id: 13.
 }
diff --git a/tensorflow/core/protobuf/rewriter_config.proto b/tensorflow/core/protobuf/rewriter_config.proto
index 504ed5d8190473f0df0219fd052b6b580c5929f6..fdf16aa1dafd3ff73c4eee6f893dc972020ca9a4 100644
--- a/tensorflow/core/protobuf/rewriter_config.proto
+++ b/tensorflow/core/protobuf/rewriter_config.proto
@@ -30,15 +30,22 @@ message RewriterConfig {
   }
 
   // Optimize tensor layouts (default is ON)
+  // e.g. This will try to use NCHW layout on GPU which is faster.
   Toggle layout_optimizer = 1;
   // Fold constants (default is ON)
+  // Statically infer the value of tensors when possible, and materialize the
+  // result using constants.
   Toggle constant_folding = 3;
   // Arithmetic optimizations (default is ON)
+  // e.g. Simplify arithmetic ops; merge ops with same value (like constants).
   Toggle arithmetic_optimization = 7;
   // Control dependency optimizations (default is ON).
+  // Remove redundant control dependencies, which may enable other optimization.
   Toggle dependency_optimization = 8;
-  // Loop optimizations (default is OFF).
+  // Loop optimizations (default is ON).
   Toggle loop_optimization = 9;
+  // Function optimizations (default is ON).
+  Toggle function_optimization = 10;
   // If true, don't remove unnecessary ops from the graph
   bool disable_model_pruning = 2;
 
@@ -49,12 +56,20 @@ message RewriterConfig {
     NO_MEM_OPT = 1;
     // Driven by manual op-level annotations.
     MANUAL = 2;
+
     // Driven by heuristics. The behavior of these heuristics is subject to
     // change. Currently includes an experimental recomputation and swapping
     // heuristics. Manual annotations are respected, but additional nodes are
     // selected automatically.
+
+    // Swapping heuristic will move a tensor from the GPU to the CPU and move
+    // it back when needed to reduce peak memory usage.
     SWAPPING_HEURISTICS = 4;
+    // Recomputation heuristics will recompute ops (such as Relu activation)
+    // during backprop instead of storing them, reducing peak memory usage.
     RECOMPUTATION_HEURISTICS = 5;
+    // Scheduling will split big ops such as AddN and try to enforce a schedule
+    // of the new computations that decreases peak memory usage.
     SCHEDULING_HEURISTICS = 6;
     // Use any combination of swapping and recomputation heuristics.
     HEURISTICS = 3;
@@ -63,16 +78,15 @@ message RewriterConfig {
   // effect on manually requested memory optimization passes in the optimizers
   // field.
   MemOptType memory_optimization = 4;
-  // The prefix for nodes which are valid outputs of recomputations. Inputs to
-  // nodes with this name prefix may be recomputed (subject either to manual
-  // annotation of those input nodes or to manual annotation and heuristics
-  // depending on memory_optimization), but the prefixed nodes themselves will
-  // not be recomputed. Typically this will be "gradients/", indicating that
-  // activations from the forward pass of a graph may be recomputed as inputs to
-  // gradients, but may be adjusted if gradients are inside a name scope or if
-  // inputs to non-gradients should be recomputed. Defaults to "gradients/" if
-  // empty or not set.
-  string memory_optimizer_target_node_name_prefix = 6;
+  // A node name scope for node names which are valid outputs of recompuations.
+  // Inputs to nodes that match this scope may be recomputed (subject either to
+  // manual annotation of those input nodes or to manual annotation and
+  // heuristics depending on memory_optimization), but the nodes themselves will
+  // not be recomputed. This matches any sub-scopes as well, meaning the scope
+  // can appear not just as a top-level scope. For example, if the value is
+  // "gradients/", the default, it will match node name "gradients/foo",
+  // "foo/gradients/bar", but not "foo_gradients/"
+  string memory_optimizer_target_node_name_scope = 6;
 
   // Configures AutoParallel optimization passes either through the
   // meta-optimizer or when manually specified through the optimizers field.
diff --git a/tensorflow/core/public/session.h b/tensorflow/core/public/session.h
index 75ad50f6f2d59a8f4b8282d8e7b395e2323d62e1..d58c877cfd3a820ba6671433defe36693df539c7 100644
--- a/tensorflow/core/public/session.h
+++ b/tensorflow/core/public/session.h
@@ -195,6 +195,41 @@ class Session {
     return errors::Unimplemented(
         "LocalDeviceManager is not supported for this session.");
   }
+
+  /// \brief A handle to a subgraph, created with `Session::MakeCallable()`.
+  typedef int64 CallableHandle;
+
+  /// \brief Creates a `handle` for invoking the subgraph defined by
+  /// `callable_options`.
+  /// NOTE: This API is still experimental and may change.
+  virtual Status MakeCallable(const CallableOptions& callable_options,
+                              CallableHandle* out_handle) {
+    return errors::Unimplemented(
+        "MakeCallable is not supported for this session.");
+  }
+
+  /// \brief Invokes the subgraph named by `handle` with the given options and
+  /// input tensors.
+  ///
+  /// The order of tensors in `feed_tensors` must and `fetch_tensors` will
+  /// match the order of names in `CallableOptions::feed()` and
+  /// `CallableOptions::fetch()` when this subgraph was created.
+  /// NOTE: This API is still experimental and may change.
+  virtual Status RunCallable(CallableHandle handle,
+                             const std::vector<Tensor>& feed_tensors,
+                             std::vector<Tensor>* fetch_tensors,
+                             RunMetadata* run_metadata) {
+    return errors::Unimplemented(
+        "RunCallable is not supported for this session.");
+  }
+
+  /// \brief Releases resources associated with the given `handle` in this
+  /// session.
+  /// NOTE: This API is still experimental and may change.
+  virtual Status ReleaseCallable(CallableHandle handle) {
+    return errors::Unimplemented(
+        "ReleaseCallable is not supported for this session.");
+  }
 };
 
 /// \brief Create a new session with the given options.
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index 7405e01e14494fb6e4e241f1a2b8bc33a4200fa7..40eebd1db0e44ae11f4174f93706743dbeae489c 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -19,7 +19,7 @@ limitations under the License.
 // TensorFlow uses semantic versioning, see http://semver.org/.
 
 #define TF_MAJOR_VERSION 1
-#define TF_MINOR_VERSION 6
+#define TF_MINOR_VERSION 7
 #define TF_PATCH_VERSION 0
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
diff --git a/tensorflow/core/util/cuda_kernel_helper.h b/tensorflow/core/util/cuda_kernel_helper.h
index 3c59524cb6f85911544b8f2d7d3339e19af7f5b4..0ab875625ff617028c4bc53fa8ccba0488c3d0d1 100644
--- a/tensorflow/core/util/cuda_kernel_helper.h
+++ b/tensorflow/core/util/cuda_kernel_helper.h
@@ -21,6 +21,11 @@ limitations under the License.
 #include "tensorflow/core/util/cuda_device_functions.h"
 #include "tensorflow/core/util/cuda_launch_config.h"
 
+#if CUDA_VERSION >= 7050
+#include "cuda/include/cuda_fp16.h"
+#define TF_HAS_CUDA_FP16
+#endif
+
 // Deprecated, use 'for(int i : CudaGridRangeX(n))' instead.
 #define CUDA_1D_KERNEL_LOOP(i, n) \
   for (int i : ::tensorflow::CudaGridRangeX<int>(n))
diff --git a/tensorflow/core/util/events_writer.cc b/tensorflow/core/util/events_writer.cc
index 49507616ed8c6461f8d59d8899d93abb4ba58cd2..c50e329bda4b44cb5390081d889d81f231b031a5 100644
--- a/tensorflow/core/util/events_writer.cc
+++ b/tensorflow/core/util/events_writer.cc
@@ -122,9 +122,11 @@ Status EventsWriter::Flush() {
   CHECK(recordio_file_ != nullptr) << "Unexpected NULL file";
 
   TF_RETURN_WITH_CONTEXT_IF_ERROR(recordio_writer_->Flush(), "Failed to flush ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
   TF_RETURN_WITH_CONTEXT_IF_ERROR(recordio_file_->Sync(), "Failed to sync ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
 
   // The FileStillExists() condition is necessary because
   // recordio_writer_->Sync() can return OK even if the underlying
@@ -135,7 +137,8 @@ Status EventsWriter::Flush() {
   // disappearing file, in case for some file system File::Exists() is
   // false after File::Open() but before File::Sync().
   TF_RETURN_WITH_CONTEXT_IF_ERROR(FileStillExists(), "Failed to flush ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
   VLOG(1) << "Wrote " << num_outstanding_events_ << " events to disk.";
   num_outstanding_events_ = 0;
   return Status::OK();
diff --git a/tensorflow/core/util/mkl_util.h b/tensorflow/core/util/mkl_util.h
index 34db96075d45f690cffad44bcc08cdf17d6e68dc..9f58e40d94c6f50694583bc82057790040115de1 100644
--- a/tensorflow/core/util/mkl_util.h
+++ b/tensorflow/core/util/mkl_util.h
@@ -1579,10 +1579,10 @@ class MklDnnData {
   }
 
   /// Set function for data buffer of user memory primitive.
-  inline void* SetUsrMemDataHandle(void* data_buffer) {
+  inline void SetUsrMemDataHandle(void* data_buffer) {
     CHECK_NOTNULL(user_memory_);
     CHECK_NOTNULL(data_buffer);
-    return user_memory_->set_data_handle(data_buffer);
+    user_memory_->set_data_handle(data_buffer);
   }
 
   /// Set function for data buffer of user memory primitive.
diff --git a/tensorflow/core/util/port.cc b/tensorflow/core/util/port.cc
index d93b971f8567059e129d3349d44c50ebad5f607c..490c584dc5cf67cb765cdec583a078e7cf6686c1 100644
--- a/tensorflow/core/util/port.cc
+++ b/tensorflow/core/util/port.cc
@@ -39,4 +39,11 @@ bool CudaSupportsHalfMatMulAndConv() {
 #endif
 }
 
+bool IsMklEnabled() {
+#ifdef INTEL_MKL
+  return true;
+#else
+  return false;
+#endif
+}
 }  // end namespace tensorflow
diff --git a/tensorflow/core/util/port.h b/tensorflow/core/util/port.h
index ed65341711fff08db01315a662aa6e140e8451b2..981def9d22a029731366d6de0e3d2f5eefa0d8e1 100644
--- a/tensorflow/core/util/port.h
+++ b/tensorflow/core/util/port.h
@@ -25,6 +25,9 @@ bool IsGoogleCudaEnabled();
 // half-precision matrix multiplications and convolution operations.
 bool CudaSupportsHalfMatMulAndConv();
 
+// Returns true if INTEL_MKL is defined
+bool IsMklEnabled();
+
 }  // end namespace tensorflow
 
 #endif  // TENSORFLOW_UTIL_PORT_H_
diff --git a/tensorflow/core/util/stat_summarizer.h b/tensorflow/core/util/stat_summarizer.h
index f7b63e86869c2713e08439ed1b0c0d343ad07451..79fa63723ec843e72364f09ad82617788ae9a901 100644
--- a/tensorflow/core/util/stat_summarizer.h
+++ b/tensorflow/core/util/stat_summarizer.h
@@ -186,7 +186,7 @@ class StatSummarizer {
   void Reset();
 
   // Returns number of runs.
-  int num_runs() const { return run_total_us_.count(); }
+  int num_runs() const { return static_cast<int>(run_total_us_.count()); }
 
   // Returns stats of total microseconds spent by all nodes in each run.
   const Stat<int64>& run_total_us() const { return run_total_us_; }
diff --git a/tensorflow/docs_src/about/index.md b/tensorflow/docs_src/about/index.md
index 5326b1e11012618261b85d13770c793bc05736bf..dc1e9af8763e0b55bbee936ec491fba75c6507fd 100644
--- a/tensorflow/docs_src/about/index.md
+++ b/tensorflow/docs_src/about/index.md
@@ -3,7 +3,6 @@
 This section provides a few documents about TensorFlow itself,
 including the following:
 
-  * @{$roadmap$Roadmap}, which summarizes upcoming additions to TensorFlow.
   * @{$uses$TensorFlow in Use}, which provides a link to our model zoo and
     lists some popular ways that TensorFlow is being used.
   * @{$bib$TensorFlow White Papers}, which provides abstracts of white papers
diff --git a/tensorflow/docs_src/about/leftnav_files b/tensorflow/docs_src/about/leftnav_files
index 28f039e9b5f8c948dae16f6bcb74d03b3a7804e7..63763b9d9c9d5d1c604035678e855f29925b408e 100644
--- a/tensorflow/docs_src/about/leftnav_files
+++ b/tensorflow/docs_src/about/leftnav_files
@@ -1,5 +1,4 @@
 index.md
-roadmap.md
 uses.md
 bib.md
 attribution.md
diff --git a/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md b/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md
index 0ce187b329bce38fe096f2640a09cc93c71f9543..e169897f31717d994a0229f1e1b485874d2b0572 100644
--- a/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md
+++ b/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md
@@ -28,6 +28,5 @@ To apply a `Bijector`, use `distributions.TransformedDistribution`.
 *   @{tf.contrib.distributions.bijectors.Inline}
 *   @{tf.contrib.distributions.bijectors.Invert}
 *   @{tf.contrib.distributions.bijectors.PowerTransform}
-*   @{tf.contrib.distributions.bijectors.SigmoidCentered}
 *   @{tf.contrib.distributions.bijectors.SoftmaxCentered}
 *   @{tf.contrib.distributions.bijectors.Softplus}
diff --git a/tensorflow/docs_src/community/index.md b/tensorflow/docs_src/community/index.md
index 8e67022648d4c7161b02072446371e6d7e7168e2..b706d9b2047a4ff9707772edb30bfd036bbffc24 100644
--- a/tensorflow/docs_src/community/index.md
+++ b/tensorflow/docs_src/community/index.md
@@ -5,6 +5,7 @@ This section contains the following documents:
   * @{$welcome$Welcome to the TensorFlow Community}, which explains how
     you can get involved, where to report issues, and where to join
     like-minded TensorFlow enthusiasts online.
+  * @{$roadmap$Roadmap}, which summarizes upcoming additions to TensorFlow.
   * @{$documentation$Writing TensorFlow Documentation}, which explains
     TensorFlow's documentation conventions.  If you are modifying
     TensorFlow source code or documentation, please read this guide.
diff --git a/tensorflow/docs_src/community/leftnav_files b/tensorflow/docs_src/community/leftnav_files
index c1595d3c955bb87120fe6a6c9185c58e9db1097e..fab35024ad63e09adba1298eab52f7904eca1007 100644
--- a/tensorflow/docs_src/community/leftnav_files
+++ b/tensorflow/docs_src/community/leftnav_files
@@ -1,5 +1,6 @@
 index.md
 welcome.md
+roadmap.md
 documentation.md
 style_guide.md
 benchmarks.md
diff --git a/tensorflow/docs_src/about/roadmap.md b/tensorflow/docs_src/community/roadmap.md
similarity index 98%
rename from tensorflow/docs_src/about/roadmap.md
rename to tensorflow/docs_src/community/roadmap.md
index 1f934acab69276d4c32393bb73632d978e0d15c3..a3170a10f2d12ed272ee1d32da679f25916994c6 100644
--- a/tensorflow/docs_src/about/roadmap.md
+++ b/tensorflow/docs_src/community/roadmap.md
@@ -75,8 +75,7 @@ across image recognition, speech, object detection, and
 ### Community and Partner Engagement
 #### Special Interest Groups: 
 * Mobilizing the community to work together in focused domains
-* [tf-distribute](https://groups.google.com/a/tensorflow.org/forum/#!forum/tf-distribute)
-: build and packaging of TensorFlow
+* [tf-distribute](https://groups.google.com/a/tensorflow.org/forum/#!forum/tf-distribute): build and packaging of TensorFlow
 * More to be identified and launched
 
 #### Community:
diff --git a/tensorflow/docs_src/community/welcome.md b/tensorflow/docs_src/community/welcome.md
index 9f6fe91b1490ef4ffe43acc877ecb83cc9121118..6d0458e678b5507fc722e2c3848e84ca2168e1e3 100644
--- a/tensorflow/docs_src/community/welcome.md
+++ b/tensorflow/docs_src/community/welcome.md
@@ -51,6 +51,8 @@ Europe:
 TensorFlow provides multiple communication paths.  To pick the right path,
 please read the following list carefully:
 
+  * For new release announcements and security updates, subscribe to
+    [announce@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/announce).
   * To ask or answer technical questions about TensorFlow, use
     [Stack Overflow](https://stackoverflow.com/questions/tagged/tensorflow).
     For example, ask or search Stack Overflow about a particular error message
@@ -65,5 +67,5 @@ please read the following list carefully:
     on GitHub.  For example, use the issue tracker to request a
     new operation in TensorFlow.
   * To report vulnerabilities, please follow our
-    [vulnerability disclosure guidelines](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/SECURITY.md).
+    [vulnerability disclosure guidelines](https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md).
 
diff --git a/tensorflow/docs_src/extend/add_filesys.md b/tensorflow/docs_src/extend/add_filesys.md
index 06f11de4eb0ea7878b01cd37d994c5a40ec400be..bc0f662f0cf8054add41c4c677e369a9e1582343 100644
--- a/tensorflow/docs_src/extend/add_filesys.md
+++ b/tensorflow/docs_src/extend/add_filesys.md
@@ -225,7 +225,7 @@ it will use the `FooBarFileSystem` implementation.
 Next, you must build a shared object containing this implementation. An example
 of doing so using bazel's `cc_binary` rule can be found
 [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/BUILD#L244),
-but you may use any build system to do so. See the section on @{$adding_an_op#build-the-op-library$building the op library} for similar
+but you may use any build system to do so. See the section on @{$adding_an_op#build_the_op_library$building the op library} for similar
 instructions.
 
 The result of building this target is a `.so` shared object file.
diff --git a/tensorflow/docs_src/extend/new_data_formats.md b/tensorflow/docs_src/extend/new_data_formats.md
index b3cc96804740991ada56a9b7f60439a63e9eb895..10e717c280f09c4f1bdfea9d0a2c8d3a00191734 100644
--- a/tensorflow/docs_src/extend/new_data_formats.md
+++ b/tensorflow/docs_src/extend/new_data_formats.md
@@ -167,7 +167,7 @@ REGISTER_KERNEL_BUILDER(Name("TextLineReader").Device(DEVICE_CPU),
 ```
 
 The last step is to add the Python wrapper.  You can either do this by
-@{$adding_an_op#building_the_op_library$compiling a dynamic library}
+@{$adding_an_op#build_the_op_library$compiling a dynamic library}
 or, if you are building TensorFlow from source, adding to `user_ops.py`.
 For the latter, you will import `tensorflow.python.ops.io_ops` in
 [`tensorflow/python/user_ops/user_ops.py`](https://www.tensorflow.org/code/tensorflow/python/user_ops/user_ops.py)
diff --git a/tensorflow/docs_src/get_started/checkpoints.md b/tensorflow/docs_src/get_started/checkpoints.md
index dfa2110e691167f54e6ea8b7a832f0a88d0ec41a..4aa07c7f2a0b56aa6de6f42e30c364c348753a39 100644
--- a/tensorflow/docs_src/get_started/checkpoints.md
+++ b/tensorflow/docs_src/get_started/checkpoints.md
@@ -154,7 +154,7 @@ classifier = tf.estimator.DNNClassifier(
 
 The first time you call an Estimator's `train` method, TensorFlow saves a
 checkpoint to the `model_dir`. Each subsequent call to the Estimator's
-`train`, `eval`, or `predict` method causes the following:
+`train`, `evaluate`, or `predict` method causes the following:
 
 1.  The Estimator builds the model's
     [graph](https://developers.google.com/machine-learning/glossary/#graph)
@@ -222,7 +222,7 @@ does not match the shape stored in checkpoint: [20]
 
 To run experiments in which you train and compare slightly different
 versions of a model, save a copy of the code that created each
-`model-dir`, possibly by creating a separate git branch for each version.
+`model_dir`, possibly by creating a separate git branch for each version.
 This separation will keep your checkpoints recoverable.
 
 ## Summary
diff --git a/tensorflow/docs_src/get_started/custom_estimators.md b/tensorflow/docs_src/get_started/custom_estimators.md
index 42a246678a054d637fea5a82a03ecb84ff412bd9..941c3e16905a9062b3081ad0af6bcbc1621a146b 100644
--- a/tensorflow/docs_src/get_started/custom_estimators.md
+++ b/tensorflow/docs_src/get_started/custom_estimators.md
@@ -164,9 +164,9 @@ To implement a typical model function, you must do the following:
 * [Define the model](#define_the_model).
 * Specify additional calculations for each of
   the [three different modes](#modes):
-  * [Predict](#predict)
-  * [Evaluate](#evaluate)
-  * [Train](#train)
+    * [Predict](#predict)
+    * [Evaluate](#evaluate)
+    * [Train](#train)
 
 ## Define the model
 
@@ -213,7 +213,7 @@ is connected to every node in the preceding layer.  Here's the relevant code:
 ```
 
 * The `units` parameter defines the number of output neurons in a given layer.
-* The `activation` parameter defines the [activation function](https://developers.google.com/machine-learning/glossary/#a) —
+* The `activation` parameter defines the [activation function](https://developers.google.com/machine-learning/glossary/#activation_function) —
   [Relu](https://developers.google.com/machine-learning/glossary/#ReLU) in this
   case.
 
@@ -546,8 +546,8 @@ In brief, here's what the three graphs tell you:
 
 * accuracy: The accuracy is recorded by the following two lines:
 
-  * `eval_metric_ops={'my_accuracy': accuracy})`, during evaluation.
-  * `tf.summary.scalar('accuracy', accuracy[1])`, during training.
+    * `eval_metric_ops={'my_accuracy': accuracy})`, during evaluation.
+    * `tf.summary.scalar('accuracy', accuracy[1])`, during training.
 
 These tensorboard graphs are one of the main reasons it's important to pass a
 `global_step` to your optimizer's `minimize` method. The model can't record
diff --git a/tensorflow/docs_src/get_started/premade_estimators.md b/tensorflow/docs_src/get_started/premade_estimators.md
index 6bffd2e065548a42eb726df34542ecc7480ad38d..e50d2f542037c8537f79a2ae53a2cbb3448243c6 100644
--- a/tensorflow/docs_src/get_started/premade_estimators.md
+++ b/tensorflow/docs_src/get_started/premade_estimators.md
@@ -397,9 +397,9 @@ predictions and their probabilities:
 
 
 ``` python
-for pred_dict, expec in zip(predictions, expected):
-    template = ('\nPrediction is "{}" ({:.1f}%), expected "{}"')
+template = ('\nPrediction is "{}" ({:.1f}%), expected "{}"')
 
+for pred_dict, expec in zip(predictions, expected):
     class_id = pred_dict['class_ids'][0]
     probability = pred_dict['probabilities'][class_id]
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index 818798555aec3a52bd5feb0c0e67d878a6dc41e4..9059b3f3b6f5e9fd6b3f7a46512577ad05848ba6 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -38,7 +38,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for macOS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.7.0-rc1.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 4c6dfa8dafe2042ea7b80498ca35a359f84ce854..2e47a6d2127ee7e06a2cc0d2d725145edea49b43 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.7.0-rc1.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 527884863ea5104e60569008ea067b407e74d29b..eff066d2009c5191402a0e10b2534aa6df12f544 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.7.0-rc1</version>
 </dependency>
 ```
 
@@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.6.0-rc1</version>
+                 <version>1.7.0-rc1</version>
                </dependency>
              </dependencies>
          </project>
@@ -123,12 +123,12 @@ instead:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.7.0-rc1</version>
 </dependency>
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow_jni_gpu</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.7.0-rc1</version>
 </dependency>
 ```
 
@@ -147,7 +147,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or macOS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0-rc1.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.7.0-rc1.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -166,7 +166,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.7.0-rc1.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -174,10 +174,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0-rc1.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.7.0-rc1.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.6.0-rc1.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.7.0-rc1.zip).
   3. Extract this .zip file.
 
 
@@ -225,7 +225,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.6.0-rc1.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.7.0-rc1.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -239,11 +239,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and macOS X:
 
-<pre><b>java -cp libtensorflow-1.6.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.7.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.6.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>d
+<pre><b>java -cp libtensorflow-1.7.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index e3e115d9f618265864363810acf96033882ad89d..8e46c9ee2052d8f9232d105f231cb28d4a51b8b9 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -31,49 +31,25 @@ If you are installing TensorFlow with GPU support using one of the
 mechanisms described in this guide, then the following NVIDIA software
 must be installed on your system:
 
-  * CUDA® Toolkit 9.0. For details, see
-    [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/#axzz4VZnqTJ2A).
-    Ensure that you append the relevant Cuda pathnames to the
+  * [CUDA Toolkit 9.0](http://nvidia.com/cuda). For details, see
+    [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
+    Ensure that you append the relevant CUDA pathnames to the
     `LD_LIBRARY_PATH` environment variable as described in the
     NVIDIA documentation.
-  * The NVIDIA drivers associated with CUDA Toolkit 9.0.
-  * cuDNN v7.0. For details, see
-    [NVIDIA's documentation](https://developer.nvidia.com/cudnn).
+  * [cuDNN SDK v7](http://developer.nvidia.com/cudnn). For details, see
+    [NVIDIA's documentation](http://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
     Ensure that you create the `CUDA_HOME` environment variable as
     described in the NVIDIA documentation.
-  * GPU card with CUDA Compute Capability 3.0 or higher.  See
+  * GPU card with CUDA Compute Capability 3.0 or higher for building
+    from source and 3.5 or higher for our binaries. See
     [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for
     a list of supported GPU cards.
-  * The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface.
-    This library provides advanced profiling support. To install this library,
-    issue the following command for CUDA Toolkit >= 8.0:
-
-    <pre>
-    $ <b>sudo apt-get install cuda-command-line-tools</b>
-    </pre>
-
-    and add its path to your `LD_LIBRARY_PATH` environment variable:
-
-    <pre>
-    $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b>
-    </pre>
-
-    For CUDA Toolkit <= 7.5 do:
-
-    <pre>
-    $ <b>sudo apt-get install libcupti-dev</b>
-    </pre>
+  * [GPU drivers](http://nvidia.com/driver) supporting your version of the CUDA
+    Toolkit.
 
 If you have an earlier version of the preceding packages, please upgrade to
 the specified versions. If upgrading is not possible, then you may still run
-TensorFlow with GPU support, but only if you do the following:
-
-  * Install TensorFlow from sources as documented in
-    @{$install_sources$Installing TensorFlow from Sources}.
-  * Install or upgrade to at least the following NVIDIA versions:
-    * CUDA toolkit 7.0 or greater
-    * cuDNN v3 or greater
-    * GPU card with CUDA Compute Capability 3.0 or higher.
+TensorFlow with GPU support, if you @{$install_sources$install TensorFlow from Sources}.
 
 
 ## Determine how to install TensorFlow
@@ -148,7 +124,8 @@ Take the following steps to install TensorFlow with Virtualenv:
      commands:
 
      <pre>$ <b>source ~/tensorflow/bin/activate</b> # bash, sh, ksh, or zsh
-    $ <b>source ~/tensorflow/bin/activate.csh</b>  # csh or tcsh</pre>
+    $ <b>source ~/tensorflow/bin/activate.csh</b>  # csh or tcsh
+    $ <b>. ~/tensorflow/bin/activate.fish</b>  # fish</pre>
 
      The preceding <tt>source</tt> command should change your prompt
      to the following:
@@ -188,7 +165,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      Virtualenv environment:
 
      <pre>(tensorflow)$ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common_installation_problems).
@@ -293,7 +270,7 @@ take the following steps:
 
      <pre>
      $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp34-cp34m-linux_x86_64.whl</b>
      </pre>
 
      If this step fails, see
@@ -356,24 +333,23 @@ where:
     to 6006.
   * <tt><i>TensorFlowCPUImage</i></tt> is required. It identifies the Docker
     container. Specify one of the following values:
-    * <tt>gcr.io/tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:latest-devel</tt>, which is the latest
+    * <tt>tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary image.
+    * <tt>tensorflow/tensorflow:latest-devel</tt>, which is the latest
       TensorFlow CPU Binary image plus source code.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i></tt>, which is the
+    * <tt>tensorflow/tensorflow:<i>version</i></tt>, which is the
       specified version (for example, 1.1.0rc1) of TensorFlow CPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-devel</tt>, which is
+    * <tt>tensorflow/tensorflow:<i>version</i>-devel</tt>, which is
       the specified version (for example, 1.1.0rc1) of the TensorFlow GPU
       binary image plus source code.
 
-    <tt>gcr.io</tt> is the Google Container Registry. Note that some
-    TensorFlow images are also available at
+    TensorFlow images are available at
     [dockerhub](https://hub.docker.com/r/tensorflow/tensorflow/).
 
 For example, the following command launches the latest TensorFlow CPU binary image
 in a Docker container from which you can run TensorFlow programs in a shell:
 
 <pre>
-$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b>
+$ <b>docker run -it tensorflow/tensorflow bash</b>
 </pre>
 
 The following command also launches the latest TensorFlow CPU binary image in a
@@ -381,7 +357,7 @@ Docker container. However, in this Docker container, you can run TensorFlow
 programs in a Jupyter notebook:
 
 <pre>
-$ <b>docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow</b>
+$ <b>docker run -it -p 8888:8888 tensorflow/tensorflow</b>
 </pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
@@ -405,14 +381,14 @@ where:
     <tt><i>hostPort</i></tt> and <code><em>containerPort</em></code> to `8888`.
   * <i>TensorFlowGPUImage</i> specifies the Docker container. You must
     specify one of the following values:
-    * <tt>gcr.io/tensorflow/tensorflow:latest-gpu</tt>, which is the latest
+    * <tt>tensorflow/tensorflow:latest-gpu</tt>, which is the latest
       TensorFlow GPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:latest-devel-gpu</tt>, which is
+    * <tt>tensorflow/tensorflow:latest-devel-gpu</tt>, which is
       the latest TensorFlow GPU Binary image plus source code.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
+    * <tt>tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
       specified version (for example, 0.12.1) of the TensorFlow GPU
       binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is
+    * <tt>tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is
       the specified version (for example, 0.12.1) of the TensorFlow GPU
       binary image plus source code.
 
@@ -421,7 +397,7 @@ following command launches the latest TensorFlow GPU binary image in a
 Docker container from which you can run TensorFlow programs in a shell:
 
 <pre>
-$ <b>nvidia-docker run -it gcr.io/tensorflow/tensorflow:latest-gpu bash</b>
+$ <b>nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash</b>
 </pre>
 
 The following command also launches the latest TensorFlow GPU binary image
@@ -429,13 +405,13 @@ in a Docker container. In this Docker container, you can run TensorFlow
 programs in a Jupyter notebook:
 
 <pre>
-$ <b>nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu</b>
+$ <b>nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu</b>
 </pre>
 
 The following command installs an older TensorFlow version (0.12.1):
 
 <pre>
-$ <b>nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:0.12.1-gpu</b>
+$ <b>nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:0.12.1-gpu</b>
 </pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
@@ -480,7 +456,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 <a name="ValidateYourInstallation"></a>
 ## Validate your installation
@@ -505,7 +481,7 @@ If you installed through Docker, start a Docker container
 from which you can run bash. For example:
 
 <pre>
-$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b>
+$ <b>docker run -it tensorflow/tensorflow bash</b>
 </pre>
 
 
@@ -647,14 +623,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.7.0rc1-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -666,14 +642,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.7.0rc1-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -685,14 +661,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.7.0rc1-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -704,14 +680,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.7.0rc1-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.7.0rc1-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 623ca6bb7919bf74fa9bcaad3184cdf0bcd9ccff..cb7250a16e9363b8416bff44fb7521b448b098d2 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -118,8 +118,8 @@ Take the following steps to install TensorFlow with Virtualenv:
      Python 2.7, the command to install
      TensorFlow in the active Virtualenv is as follows:
 
-     <pre> $ <b>pip install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b></pre>
+     <pre> $ <b>pip3 install --upgrade \
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.7.0rc1-py3-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -238,11 +238,11 @@ take the following steps:
      operating system and Python version. Find the appropriate
      value for <i>tfBinaryURL</i>
      [here](#the_url_of_the_tensorflow_python_package).  For example, if
-     you are installing TensorFlow for Mac OS and Python 2.7
+     you are installing TensorFlow for macOS and Python 2.7
      issue the following command:
 
-     <pre> $ <b>sudo pip install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b> </pre>
+     <pre> $ <b>sudo pip3 install --upgrade \
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.7.0rc1-py3-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -292,24 +292,23 @@ where:
     to 6006.
   * <i>TensorFlowImage</i> is required. It identifies the Docker container.
     You must specify one of the following values:
-    * <code>gcr.io/tensorflow/tensorflow</code>: TensorFlow binary image.
-    * <code>gcr.io/tensorflow/tensorflow:latest-devel</code>: TensorFlow
+    * <code>tensorflow/tensorflow</code>: TensorFlow binary image.
+    * <code>tensorflow/tensorflow:latest-devel</code>: TensorFlow
       Binary image plus source code.
 
-<code>gcr.io</code> is the Google Container Registry. Note that some
-TensorFlow images are also available at
+The TensorFlow images are available at
 [dockerhub](https://hub.docker.com/r/tensorflow/tensorflow/).
 
 For example, the following command launches a TensorFlow CPU binary image
 in a Docker container from which you can run TensorFlow programs in a shell:
 
-<pre>$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b></pre>
+<pre>$ <b>docker run -it tensorflow/tensorflow bash</b></pre>
 
 The following command also launches a TensorFlow CPU binary image in a
 Docker container. However, in this Docker container, you can run
 TensorFlow programs in a Jupyter notebook:
 
-<pre>$ <b>docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow</b></pre>
+<pre>$ <b>docker run -it -p 8888:8888 tensorflow/tensorflow</b></pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
 
@@ -351,7 +350,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.7.0rc1-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -376,7 +375,7 @@ do the following:
 If you installed through Docker, start a Docker container that runs bash.
 For example:
 
-<pre>$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b></pre>
+<pre>$ <b>docker run -it tensorflow/tensorflow bash</b></pre>
 
 
 
@@ -513,18 +512,13 @@ RuntimeError: Broken toolchain: cannot link a simple C program</pre>
 ## The URL of the TensorFlow Python package
 
 A few installation mechanisms require the URL of the TensorFlow Python package.
-The value you specify depends on three factors:
-
-  * operating system
-  * Python version
-
-This section documents the relevant values for Mac OS installations.
+The value you specify depends on your Python version.
 
 ### Python 2.7
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.7.0rc1-py2-none-any.whl
 </pre>
 
 
@@ -532,5 +526,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-a
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.7.0rc1-py3-none-any.whl
 </pre>
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index 8d83e9f1190ed307ca99d81168df7dfab51e4507..148f80efe25f12cfaef9df8a8edfaa700782dacd 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -133,30 +133,21 @@ The following NVIDIA <i>hardware</i> must be installed on your system:
 
 The following NVIDIA <i>software</i> must be installed on your system:
 
-  * NVIDIA's Cuda Toolkit (>= 7.0). We recommend version 9.0.
+  * [CUDA Toolkit](http://nvidia.com/cuda) (>= 7.0). We recommend version 9.0.
     For details, see
-    [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/#axzz4VZnqTJ2A).
-    Ensure that you append the relevant Cuda pathnames to the
+    [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
+    Ensure that you append the relevant CUDA pathnames to the
     `LD_LIBRARY_PATH` environment variable as described in the
     NVIDIA documentation.
-  * The NVIDIA drivers associated with NVIDIA's Cuda Toolkit.
-  * cuDNN (>= v3). We recommend version 6.0. For details, see
-    [NVIDIA's documentation](https://developer.nvidia.com/cudnn),
-    particularly the description of appending the appropriate pathname
-    to your `LD_LIBRARY_PATH` environment variable.
-
-Finally, you must also install `libcupti` which for Cuda Toolkit >= 8.0 you do via
-
-<pre> $ <b>sudo apt-get install cuda-command-line-tools</b> </pre>
-
-and add its path to your `LD_LIBRARY_PATH` environment variable:
-
-<pre> $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> </pre>
-
-For Cuda Toolkit <= 7.5, you install `libcupti-dev` by invoking the following command:
-
-<pre> $ <b>sudo apt-get install libcupti-dev</b> </pre>
+  * [GPU drivers](http://nvidia.com/driver) supporting your version of the CUDA
+    Toolkit.
+  * [cuDNN SDK](http://developer.nvidia.com/cudnn) (>= v3). We recommend version 7.0. For details, see
+    [NVIDIA's documentation](http://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
+  * [CUPTI](http://docs.nvidia.com/cuda/cupti/) ships with the CUDA Toolkit, but
+    you also need to append its path to the `LD_LIBRARY_PATH` environment
+    variable:
 
+    <pre> $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> </pre>
 
 ### Next
 
@@ -240,8 +231,8 @@ such as compiler flags. You must run this script *prior* to
 creating the pip package and installing TensorFlow.
 
 If you wish to build TensorFlow with GPU, `configure` will ask
-you to specify the version numbers of Cuda and cuDNN. If several
-versions of Cuda or cuDNN are installed on your system, explicitly select
+you to specify the version numbers of CUDA and cuDNN. If several
+versions of CUDA or cuDNN are installed on your system, explicitly select
 the desired version instead of relying on the default.
 
 One of the questions that `configure` will ask is as follows:
@@ -289,12 +280,12 @@ Do you wish to build TensorFlow with CUDA support? [y/N] <b>Y</b>
 CUDA support will be enabled for TensorFlow
 Do you want to use clang as CUDA compiler? [y/N]
 nvcc will be used as CUDA compiler
-Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: <b>9.0</b>
+Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: <b>9.0</b>
 Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
 Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: <b>7</b>
 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
-Please specify a list of comma-separated Cuda compute capabilities you want to build with.
+Please specify a list of comma-separated CUDA compute capabilities you want to build with.
 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
 Please note that each additional compute capability significantly increases your build time and binary size.
 [Default is: "3.5,5.2"]: <b>3.0</b>
@@ -304,14 +295,14 @@ Configuration finished
 </pre>
 
 If you told `configure` to build for GPU support, then `configure`
-will create a canonical set of symbolic links to the Cuda libraries
-on your system.  Therefore, every time you change the Cuda library paths,
+will create a canonical set of symbolic links to the CUDA libraries
+on your system.  Therefore, every time you change the CUDA library paths,
 you must rerun the `configure` script before re-invoking
 the <code>bazel build</code> command.
 
 Note the following:
 
-  * Although it is possible to build both Cuda and non-Cuda configs
+  * Although it is possible to build both CUDA and non-CUDA configs
     under the same source tree, we recommend running `bazel clean` when
     switching between these two configurations in the same source tree.
   * If you don't run the `configure` script *before* running the
@@ -359,10 +350,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.6.0rc1 on Linux:
+for TensorFlow 1.7.0rc1 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.6.0rc1-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.7.0rc1-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
@@ -393,8 +384,7 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with
-TensorFlow}.
+If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with TensorFlow}.
 
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
@@ -460,6 +450,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Linux**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.7.0rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.10.0</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.7.0rc1</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.9.0</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.6.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.9.0</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.6.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.9.0</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.5.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.8.0</td><td>N/A</td><td>N/A</td></tr>
@@ -479,6 +471,7 @@ Stack Overflow and specify the `tensorflow` tag.
 **Mac**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.7.0rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.10.1</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.6.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.8.1</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.5.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.8.1</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
@@ -493,6 +486,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Windows**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.7.0rc1</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.7.0rc1</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.6.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.6.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.5.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index dedf485f93d6fd6a8ce7b4465548cc998d307daa..2413bc9cfbbfd577ebd583be4da82994e8551c9e 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -17,7 +17,7 @@ You must choose one of the following types of TensorFlow to install:
     NVIDIA® GPU, you must install this version. Note that this version of
     TensorFlow is typically much easier to install (typically,
     in 5 or 10 minutes), so even if you have an NVIDIA GPU, we recommend
-    installing this version first.
+    installing this version first. Prebuilt binaries will use AVX instructions. 
   * **TensorFlow with GPU support**. TensorFlow programs typically run
     significantly faster on a GPU than on a CPU. Therefore, if your
     system has a NVIDIA® GPU meeting the prerequisites shown below
@@ -41,7 +41,8 @@ installed on your system:
     Note that cuDNN is typically installed in a different location from the
     other CUDA DLLs. Ensure that you add the directory where you installed
     the cuDNN DLL to your `%PATH%` environment variable.
-  * GPU card with CUDA Compute Capability 3.0 or higher.  See
+  * GPU card with CUDA Compute Capability 3.0 or higher for building
+    from source and 3.5 or higher for our binaries. See
     [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for a
     list of supported GPU cards.
 
@@ -153,8 +154,7 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with
-TensorFlow}.
+If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with TensorFlow}.
 
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
diff --git a/tensorflow/docs_src/mobile/android_build.md b/tensorflow/docs_src/mobile/android_build.md
index b5a1d5d7d1bf3b456ab24165e273969bdbd7bfca..08a5fbe41c87c88399682208c38bf7a892d8fc1a 100644
--- a/tensorflow/docs_src/mobile/android_build.md
+++ b/tensorflow/docs_src/mobile/android_build.md
@@ -90,8 +90,8 @@ using [ADB](https://developer.android.com/studio/command-line/adb.html). This
 requires some knowledge of build systems and Android developer tools, but we'll
 guide you through the basics here.
 
-- First, follow our instructions for @{$install/install_sources$installing from
-  sources}. This will also guide you through installing Bazel and cloning the
+- First, follow our instructions for @{$install/install_sources$installing from sources}.
+  This will also guide you through installing Bazel and cloning the
   TensorFlow code.
 
 - Download the Android [SDK](https://developer.android.com/studio/index.html)
diff --git a/tensorflow/docs_src/mobile/leftnav_files b/tensorflow/docs_src/mobile/leftnav_files
index ac50f528ba468d8a830c059539d3399f413f39c8..4cf134cc3c2c323405d769a5ced5d5a68f188203 100644
--- a/tensorflow/docs_src/mobile/leftnav_files
+++ b/tensorflow/docs_src/mobile/leftnav_files
@@ -2,6 +2,7 @@ index.md
 ### TensorFlow Lite
 tflite/index.md
 tflite/demo_android.md
+tflite/demo_ios.md
 >>>
 ### TensorFlow Mobile
 mobile_intro.md
diff --git a/tensorflow/docs_src/mobile/optimizing.md b/tensorflow/docs_src/mobile/optimizing.md
index 44cacff5dbbcb0685044c342184464b47a8ed090..ca9cb043e9282702044b0da26c27f5bf29141cae 100644
--- a/tensorflow/docs_src/mobile/optimizing.md
+++ b/tensorflow/docs_src/mobile/optimizing.md
@@ -290,8 +290,8 @@ run it on a 64-bit ARM device:
 
 You can interpret the results in exactly the same way as the desktop version
 above. If you have any trouble figuring out what the right input and output
-names and types are, take a look at the @{$mobile/prepare_models$Preparing
-models} page for details about detecting these for your model, and look at the
+names and types are, take a look at the @{$mobile/prepare_models$Preparing models}
+page for details about detecting these for your model, and look at the
 `summarize_graph` tool which may give you
 helpful information.
 
diff --git a/tensorflow/docs_src/mobile/prepare_models.md b/tensorflow/docs_src/mobile/prepare_models.md
index 360ee302aa96bc3a0b65eab7b39c3dacf56b42c0..8b22c04d872f18607c485775cb8f096f0a361995 100644
--- a/tensorflow/docs_src/mobile/prepare_models.md
+++ b/tensorflow/docs_src/mobile/prepare_models.md
@@ -60,7 +60,7 @@ and serialized as protocol buffers:
   the `NodeDef`, so if all the `Variable` weights are converted to `Const` nodes,
   then we only need a single `GraphDef` file to hold the model architecture and
   the weights. Freezing the graph handles the process of loading the
-  checkpoints, and then converts all Consts to Variables. You can then load the
+  checkpoints, and then converts all Variables to Consts. You can then load the
   resulting file in a single call, without having to restore variable values
   from checkpoints. One thing to watch out for with `GraphDef` files is that
   sometimes they’re stored in text format for easy inspection. These versions
diff --git a/tensorflow/docs_src/mobile/tflite/demo_android.md b/tensorflow/docs_src/mobile/tflite/demo_android.md
index 79b567897cb8a38ed2e27e73aa7e8fee95f718b8..c94b5597a673a7e68aed517b325b9719b3b73bbd 100644
--- a/tensorflow/docs_src/mobile/tflite/demo_android.md
+++ b/tensorflow/docs_src/mobile/tflite/demo_android.md
@@ -8,6 +8,9 @@ You'll need an Android device running Android 5.0 or higher to run the demo.
 To get you started working with TensorFlow Lite on Android, we'll walk you
 through building and deploying our TensorFlow demo app in Android Studio.
 
+Note: For a more detailed guide see the
+[TFLite Codelab](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets-2-tflite/index.html#0)
+
 It's also possible to build the demo app with Bazel, but we only recommend
 this for advanced users who are very familiar with the Bazel build
 environment. For more information on that, see our page [on Github](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite#building-tensorflow-lite-and-the-demo-app-from-source).
diff --git a/tensorflow/docs_src/mobile/tflite/demo_ios.md b/tensorflow/docs_src/mobile/tflite/demo_ios.md
new file mode 100644
index 0000000000000000000000000000000000000000..3ee9b1cbca6cfef98616bd33bbf91b756b4efa15
--- /dev/null
+++ b/tensorflow/docs_src/mobile/tflite/demo_ios.md
@@ -0,0 +1,68 @@
+# TensorFlow Lite Demo for iOS
+
+The TensorFlow Lite demo is a camera app that continuously classifies whatever
+it sees from your device's back camera, using a quantized MobileNet model. These
+instructions walk you through building and running the demo on an iOS device.
+
+## Prerequisites
+
+* You must have [Xcode](https://developer.apple.com/xcode/) installed and have a
+  valid Apple Developer ID, and have an iOS device set up and linked to your
+  developer account with all of the appropriate certificates. For these
+  instructions, we assume that you have already been able to build and deploy an
+  app to an iOS device with your current developer environment.
+
+* The demo app requires a camera and must be executed on a real iOS device. You
+  can build it and run with the iPhone Simulator but it won't have any camera
+  information to classify.
+
+* You don't need to build the entire TensorFlow library to run the demo, but you
+  will need to clone the TensorFlow repository if you haven't already:
+
+        git clone https://github.com/tensorflow/tensorflow
+
+* You'll also need the Xcode command-line tools:
+
+        xcode-select --install
+
+    If this is a new install, you will need to run the Xcode application once to
+    agree to the license before continuing.
+
+## Building the iOS Demo App
+
+1. Install CocoaPods if you don't have it:
+
+        sudo gem install cocoapods
+
+2. Download the model files used by the demo app (this is done from inside the
+   cloned directory):
+
+        sh tensorflow/contrib/lite/examples/ios/download_models.sh
+
+3. Install the pod to generate the workspace file:
+
+        cd tensorflow/contrib/lite/examples/ios/camera
+        pod install
+
+    If you have installed this pod before and that command doesn't work, try
+
+        pod update
+
+    At the end of this step you should have a file called 
+    `tflite_camera_example.xcworkspace`.
+
+4. Open the project in Xcode by typing this on the command line:
+
+        open tflite_camera_example.xcworkspace
+
+    This launches Xcode if it isn't open already and opens the
+    `tflite_camera_example` project.
+
+5. Build and run the app in Xcode.
+
+    Note that as mentioned earlier, you must already have a device set up and
+    linked to your Apple Developer account in order to deploy the app on a
+    device.
+
+You'll have to grant permissions for the app to use the device's camera. Point
+the camera at various objects and enjoy seeing how the model classifies things!
diff --git a/tensorflow/docs_src/performance/leftnav_files b/tensorflow/docs_src/performance/leftnav_files
index 316f023f43dcfe781c7819d1681335267ddd5f76..d11a7e5d07c3e6cfa092e7ac11189ce6c272c1ad 100644
--- a/tensorflow/docs_src/performance/leftnav_files
+++ b/tensorflow/docs_src/performance/leftnav_files
@@ -2,6 +2,7 @@ performance_guide.md
 datasets_performance.md
 performance_models.md
 benchmarks.md
+quantization.md
 
 ### XLA
 xla/index.md
@@ -11,6 +12,3 @@ xla/jit.md
 xla/operation_semantics.md
 xla/shapes.md
 xla/tfcompile.md
-
-### Quantization
-quantization.md
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
index cd47fc2803bc1429d28bd0ae4c2ad68e632a6f03..580a899ac4e4f5c3d97ce023f25083168fe00d01 100644
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@@ -78,7 +78,7 @@ training CIFAR-10 illustrates the use of the `tf.data` API along with
 The `tf.data` API utilizes C++ multi-threading and has a much lower overhead
 than the Python-based `queue_runner` that is limited by Python's multi-threading
 performance. A detailed performance guide for the `tf.data` API can be found
-[here](#datasets_performance).
+[here](@{$datasets_performance}).
 
 While feeding data using a `feed_dict` offers a high level of flexibility, in
 general `feed_dict` does not provide a scalable solution. If only a single GPU
diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md
index 544274cab68934419e8601a4d9714d80335fca28..411889cb1c616130f809e6228cc692ba3f951d48 100644
--- a/tensorflow/docs_src/performance/quantization.md
+++ b/tensorflow/docs_src/performance/quantization.md
@@ -1,226 +1,253 @@
-# How to Quantize Neural Networks with TensorFlow
-
-When modern neural networks were being developed, the biggest challenge was
-getting them to work at all! That meant that accuracy and speed during training
-were the top priorities. Using floating point arithmetic was the easiest way to
-preserve accuracy, and GPUs were well-equipped to accelerate those calculations,
-so it's natural that not much attention was paid to other numerical formats.
-
-These days, we actually have a lot of models being deployed in commercial
-applications. The computation demands of training grow with the number of
-researchers, but the cycles needed for inference expand in proportion to users.
-That means pure inference efficiency has become a burning issue for a lot of
-teams.
-
-That is where quantization comes in. It's an umbrella term that covers a lot of
-different techniques to store numbers and perform calculations on them in more
-compact formats than 32-bit floating point. I am going to focus on eight-bit
-fixed point, for reasons I'll go into more detail on later.
-
-[TOC]
-
-## Why does Quantization Work?
-
-Training neural networks is done by applying many tiny nudges to the weights,
-and these small increments typically need floating point precision to work
-(though there are research efforts to use quantized representations here too).
-
-Taking a pre-trained model and running inference is very different. One of the
-magical qualities of deep networks is that they tend to cope very well with high
-levels of noise in their inputs. If you think about recognizing an object in a
-photo you've just taken, the network has to ignore all the CCD noise, lighting
-changes, and other non-essential differences between it and the training
-examples it's seen before, and focus on the important similarities instead. This
-ability means that they seem to treat low-precision calculations as just another
-source of noise, and still produce accurate results even with numerical formats
-that hold less information.
-
-## Why Quantize?
-
-Neural network models can take up a lot of space on disk, with the original
-AlexNet being over 200 MB in float format for example. Almost all of that size
-is taken up with the weights for the neural connections, since there are often
-many millions of these in a single model. Because they're all slightly different
-floating point numbers, simple compression formats like zip don't compress them
-well. They are arranged in large layers though, and within each layer the
-weights tend to be normally distributed within a certain range, for example -3.0
-to 6.0.
-
-The simplest motivation for quantization is to shrink file sizes by storing the
-min and max for each layer, and then compressing each float value to an
-eight-bit integer representing the closest real number in a linear set of 256
-within the range. For example with the -3.0 to 6.0 range, a 0 byte would
-represent -3.0, a 255 would stand for 6.0, and 128 would represent about 1.5.
-I'll go into the exact calculations later, since there's some subtleties, but
-this means you can get the benefit of a file on disk that's shrunk by 75%, and
-then convert back to float after loading so that your existing floating-point
-code can work without any changes.
-
-Another reason to quantize is to reduce the computational resources you need to
-do the inference calculations, by running them entirely with eight-bit inputs
-and outputs. This is a lot more difficult since it requires changes everywhere
-you do calculations, but offers a lot of potential rewards. Fetching eight-bit
-values only requires 25% of the memory bandwidth of floats, so you'll make much
-better use of caches and avoid bottlenecking on RAM access. You can also
-typically use SIMD operations that do many more operations per clock cycle. In
-some case you'll have a DSP chip available that can accelerate eight-bit
-calculations too, which can offer a lot of advantages.
-
-Moving calculations over to eight bit will help you run your models faster, and
-use less power (which is especially important on mobile devices). It also opens
-the door to a lot of embedded systems that can't run floating point code
-efficiently, so it can enable a lot of applications in the IoT world.
-
-## Why Not Train in Lower Precision Directly?
-
-There have been some experiments training at lower bit depths, but the results
-seem to indicate that you need higher than eight bit to handle the back
-propagation and gradients. That makes implementing the training more
-complicated, and so starting with inference made sense. We also already have a
-lot of float models already that we use and know well, so being able to convert
-them directly is very convenient.
-
-## How Can You Quantize Your Models?
-
-TensorFlow has production-grade support for eight-bit calculations built in. It
-also has a process for converting many models trained in floating-point over to
-equivalent graphs using quantized calculations for inference. For example,
-here's how you can translate the latest GoogLeNet model into a version that uses
-eight-bit computations:
-
-```sh
-curl -L "https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz" |
-  tar -C tensorflow/examples/label_image/data -xz
-bazel build tensorflow/tools/graph_transforms:transform_graph
-bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
-  --in_graph=tensorflow/examples/label_image/data/inception_v3_2016_08_28_frozen.pb \
-  --out_graph=/tmp/quantized_graph.pb \
-  --inputs=input \
-  --outputs=InceptionV3/Predictions/Reshape_1 \
-  --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3")
-    remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true)
-    fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
-    strip_unused_nodes sort_by_execution_order'
+# Fixed Point Quantization
+
+Quantization techniques store and calculate numbers in more compact formats.
+[TensorFlow Lite](/mobile/tflite/) adds quantization that uses an 8-bit fixed
+point representation.
+
+Since a challenge for modern neural networks is optimizing for high accuracy, the
+priority has been improving accuracy and speed during training. Using floating
+point arithmetic is an easy way to preserve accuracy and GPUs are designed to
+accelerate these calculations.
+
+However, as more machine learning models are deployed to mobile devices,
+inference efficiency has become a critical issue. Where the computational demand
+for *training* grows with the amount of models trained on different
+architectures, the computational demand for *inference* grows in proportion to
+the amount of users.
+
+## Quantization benefits
+
+
+Using 8-bit calculations help your models run faster and use less power. This is
+especially important for mobile devices and embedded applications that can't run
+floating point code efficiently, for example, Internet of Things (IoT) and
+robotics devices. There are additional opportunities to extend this support to
+more backends and research lower precision networks.
+
+### Smaller file sizes {: .hide-from-toc}
+
+Neural network models require a lot of space on disk. For example, the original
+AlexNet requires over 200 MB for the float format—almost all of that for the
+model's millions of weights. Because the weights are slightly different
+floating point numbers, simple compression formats perform poorly (like zip).
+
+Weights fall in large layers of numerical values. For each layer, weights tend to
+be normally distributed within a range. Quantization can shrink file sizes by
+storing the minimum and maximum weight for each layer, then compress each
+weight's float value to an 8-bit integer representing the closest real number in
+a linear set of 256 within the range.
+
+### Faster inference {: .hide-from-toc}
+
+Since calculations are run entirely on 8-bit inputs and outputs, quantization
+reduces the computational resources needed for inference calculations. This is
+more involved, requiring changes to all floating point calculations, but results
+in a large speed-up for inference time.
+
+### Memory efficiency {: .hide-from-toc}
+
+Since fetching 8-bit values only requires 25% of the memory bandwidth of floats,
+more efficient caches avoid bottlenecks for RAM access. In many cases, the power
+consumption for running a neural network is dominated by memory access. The
+savings from using fixed-point 8-bit weights and activations are significant. 
+
+Typically, SIMD operations are available that run more operations per clock
+cycle. In some cases, a DSP chip is available that accelerates 8-bit calculations
+resulting in a massive speedup.
+
+## Fixed point quantization techniques
+
+The goal is to use the same precision for weights and activations during both
+training and inference. But an important difference is that training consists of
+a forward pass and a backward pass, while inference only uses a forward pass.
+When we train the model with quantization in the loop, we ensure that the forward
+pass matches precision for both training and inference.
+
+To minimize the loss in accuracy for fully fixed point models (weights and
+activations), train the model with quantization in the loop. This simulates
+quantization in the forward pass of a model so weights tend towards values that
+perform better during quantized inference. The backward pass uses quantized
+weights and activations and models quantization as a straight through estimator.
+(See Bengio et al., [2013](https://arxiv.org/abs/1308.3432))
+
+Additionally, the minimum and maximum values for activations are determined
+during training. This allows a model trained with quantization in the loop to be
+converted to a fixed point inference model with little effort, eliminating the
+need for a separate calibration step.
+
+## Quantization training with TensorFlow
+
+TensorFlow can train models with quantization in the loop. Because training
+requires small gradient adjustments, floating point values are still used. To
+keep models as floating point while adding the quantization error in the training
+loop, @{$array_ops#Fake_quantization$fake quantization} nodes simulate the
+effect of quantization in the forward and backward passes.
+
+Since it's difficult to add these fake quantization operations to all the
+required locations in the model, there's a function available that rewrites the
+training graph. To create a fake quantized training graph:
+
+```
+# Build forward pass of model.
+loss = tf.losses.get_total_loss()
+
+# Call the training rewrite which rewrites the graph in-place with
+# FakeQuantization nodes and folds batchnorm for training. It is
+# often needed to fine tune a floating point model for quantization
+# with this training tool. When training from scratch, quant_delay
+# can be used to activate quantization after training to converge
+# with the float graph, effectively fine-tuning the model.
+tf.contrib.quantize.create_training_graph(quant_delay=2000000)
+
+# Call backward pass optimizer as usual.
+optimizer = tf.train.GradientDescentOptimizer(learning_rate)
+optimizer.minimize(loss)
 ```
 
-This will produce a new model that runs the same operations as the original, but
-with eight bit calculations internally, and all weights quantized as well. If
-you look at the file size, you'll see it's about a quarter of the original (23MB
-versus 91MB). You can still run this model using exactly the same inputs and
-outputs though, and you should get equivalent results. Here's an example:
+The rewritten *eval graph* is non-trivially different from the *training graph*
+since the quantization ops affect the batch normalization step. Because of this,
+we've added a separate rewrite for the *eval graph*:
 
-```sh
-bazel build tensorflow/examples/label_image:label_image
-bazel-bin/tensorflow/examples/label_image/label_image \
---graph=/tmp/quantized_graph.pb \
+```
+# Build eval model
+logits = tf.nn.softmax_cross_entropy_with_logits(...)
+
+# Call the eval rewrite which rewrites the graph in-place with
+# FakeQuantization nodes and fold batchnorm for eval.
+tf.contrib.quantize.create_eval_graph()
+
+# Save the checkpoint and eval graph proto to disk for freezing
+# and providing to TFLite.
+with open(eval_graph_file, ‘w’) as f:
+  f.write(str(g.as_graph_def()))
+saver = tf.train.Saver()
+saver.save(sess, checkpoint_name)
+```
+
+Methods to rewrite the training and eval graphs are an active area of research
+and experimentation. Although rewrites and quantized training might not work or
+improve performance for all models, we are working to generalize these
+techniques.
+
+## Generating fully quantized models
+
+The previously demonstrated after-rewrite eval graph only *simulates*
+quantization. To generate real fixed point computations from a trained
+quantization model, convert it to a fixed point kernel. Tensorflow Lite supports
+this conversion from the graph resulting from `create_eval_graph`.
+
+First, create a frozen graph that will be the input for the TensorFlow Lite
+toolchain:
+
+```
+bazel build tensorflow/python/tools:freeze_graph && \
+  bazel-bin/tensorflow/python/tools/freeze_graph \
+  --input_graph=eval_graph_def.pb \
+  --input_checkpoint=checkpoint \
+  --output_graph=frozen_eval_graph.pb --output_node_names=outputs
 ```
 
-You'll see that this runs the newly-quantized graph, and outputs a very similar
-answer to the original.
-
-You can run the same process on your own models saved out as GraphDefs, with the
-input and output names adapted to those your network requires. I recommend that
-you run them through the freeze_graph script first, to convert checkpoints into
-constants stored in the file.
-
-## How Does the Quantization Process Work?
-
-We've implemented quantization by writing equivalent eight-bit versions of
-operations that are commonly used during inference. These include convolution,
-matrix multiplication, activation functions, pooling operations and
-concatenation. The conversion script first replaces all the individual ops it
-knows about with quantized equivalents. These are small sub-graphs that have
-conversion functions before and after to move the data between float and
-eight-bit. Below is an example of what they look like. First here's the original
-Relu operation, with float inputs and outputs:
-
-![Relu Diagram](https://www.tensorflow.org/images/quantization0.png)
-
-Then, this is the equivalent converted subgraph, still with float inputs and
-outputs, but with internal conversions so the calculations are done in eight
-bit.
-
-![Converted Diagram](https://www.tensorflow.org/images/quantization1.png)
-
-The min and max operations actually look at the values in the input float
-tensor, and then feeds them into the Dequantize operation that converts the
-tensor into eight-bits. There are more details on how the quantized representation
-works later on.
-
-Once the individual operations have been converted, the next stage is to remove
-unnecessary conversions to and from float. If there are consecutive sequences of
-operations that all have float equivalents, then there will be a lot of adjacent
-Dequantize/Quantize ops. This stage spots that pattern, recognizes that they
-cancel each other out, and removes them, like this:
-
-![Stripping Diagram](https://www.tensorflow.org/images/quantization2.png)
-
-Applied on a large scale to models where all of the operations have quantized
-equivalents, this gives a graph where all of the tensor calculations are done in
-eight bit, without having to convert to float.
-
-## What Representation is Used for Quantized Tensors?
-
-We approach converting floating-point arrays of numbers into eight-bit
-representations as a compression problem. We know that the weights and
-activation tensors in trained neural network models tend to have values that are
-distributed across comparatively small ranges (for example you might have -15 to
-+15 for weights, -500 to 1000 for activations on an image model, though the
-exact numbers will vary). We also know from experiment that neural nets tend to
-be very robust in the face of noise, and so the noise-like error produced by
-quantizing down to a small set of values will not hurt the precision of the
-overall results very much. We also want to pick a representation that's easy to
-perform calculations on, especially the large matrix multiplications that form
-the bulk of the work that's needed to run a model.
-
-These led us to pick a representation that has two floats to store the overall
-minimum and maximum values that are represented by the lowest and highest
-quantized value. Each entry in the quantized array represents a float value in
-that range, distributed linearly between the minimum and maximum. For example,
-if we have minimum = -10.0, and maximum = 30.0f, and an eight-bit array, here's
-what the quantized values represent:
+Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully
+quantized TensorFLow Lite model:
 
 ```
-Quantized | Float
---------- | -----
-0         | -10.0
-255       | 30.0
-128       | 10.0
+bazel build tensorflow/contrib/lite/toco:toco && \
+  ./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
+  --input_file=frozen_eval_graph.pb \
+  --output_file=tflite_model.tflite \
+  --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
+  --inference_type=QUANTIZED_UINT8 \
+  --input_shape="1,224, 224,3" \
+  --input_array=input \
+  --output_array=outputs \
+  --std_value=127.5 --mean_value=127.5
 ```
 
-The advantages of this format are that it can represent arbitrary magnitudes of
-ranges, they don't have to be symmetrical, it can represent signed and unsigned
-values, and the linear spread makes doing multiplications straightforward. There
-are alternatives like [Song Han's code books](http://arxiv.org/pdf/1510.00149.pdf)
-that can use lower bit depths by non-linearly distributing the float values
-across the representation, but these tend to be more expensive to calculate on.
-
-The advantage of having a strong and clear definition of the quantized format is
-that it's always possible to convert back and forth from float for operations
-that aren't quantization-ready, or to inspect the tensors for debugging
-purposes. One implementation detail in TensorFlow that we're hoping to improve
-in the future is that the minimum and maximum float values need to be passed as
-separate tensors to the one holding the quantized values, so graphs can get a
-bit dense!
-
-The nice thing about the minimum and maximum ranges is that they can often be
-pre-calculated. Weight parameters are constants known at load time, so their
-ranges can also be stored as constants. We often know the ranges for inputs (for
-examples images are usually RGB values in the range 0.0 to 255.0), and many
-activation functions have known ranges too. This can avoid having to analyze the
-outputs of an operation to determine the range, which we need to do for math ops
-like convolution or matrix multiplication which produce 32-bit accumulated
-results from 8-bit inputs.
-
-## What's Next?
-
-We've found that we can get extremely good performance on mobile and embedded
-devices by using eight-bit arithmetic rather than floating-point. You can see
-the framework we use to optimize matrix multiplications at
-[gemmlowp](https://github.com/google/gemmlowp). We still need to apply all the
-lessons we've learned to the TensorFlow ops to get maximum performance on
-mobile, but we're actively working on that. Right now, this quantized
-implementation is a reasonably fast and accurate reference implementation that
-we're hoping will enable wider support for our eight-bit models on a wider
-variety of devices. We also hope that this demonstration will encourage the
-community to explore what's possible with low-precision neural networks.
+See the documentation for @{tf.contrib.quantize} and
+[TensorFlow Lite](/mobile/tflite/).
+
+## Quantized accuracy
+
+Fixed point [MobileNet](https://arxiv.org/abs/1704.0486) models are released with
+8-bit weights and activations. Using the rewriters, these models achieve the
+Top-1 accuracies listed in Table 1. For comparison, the floating point accuracies
+are listed for the same models. The code used to generate these models
+[is available](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md)
+along with links to all of the pretrained mobilenet_v1 models.
+
+<figure>
+  <table>
+    <tr>
+      <th>Image Size</th>
+      <th>Depth</th>
+      <th>Top-1 Accuracy:<br>Floating point</th>
+      <th>Top-1 Accuracy:<br>Fixed point: 8 bit weights and activations</th>
+    </tr>
+    <tr><td>128</td><td>0.25</td><td>0.415</td><td>0.399</td></tr>
+    <tr><td>128</td><td>0.5</td><td>0.563</td><td>0.549</td></tr>
+    <tr><td>128</td><td>0.75</td><td>0.621</td><td>0.598</td></tr>
+    <tr><td>128</td><td>1</td><td>0.652</td><td>0.64</td></tr>
+    <tr><td>160</td><td>0.25</td><td>0.455</td><td>0.435</td></tr>
+    <tr><td>160</td><td>0.5</td><td>0.591</td><td>0.577</td></tr>
+    <tr><td>160</td><td>0.75</td><td>0.653</td><td>0.639</td></tr>
+    <tr><td>160</td><td>1</td><td>0.68</td><td>0.673</td></tr>
+    <tr><td>192</td><td>0.25</td><td>0.477</td><td>0.458</td></tr>
+    <tr><td>192</td><td>0.5</td><td>0.617</td><td>0.604</td></tr>
+    <tr><td>192</td><td>0.75</td><td>0.672</td><td>0.662</td></tr>
+    <tr><td>192</td><td>1</td><td>0.7</td><td>0.69</td></tr>
+    <tr><td>224</td><td>0.25</td><td>0.498</td><td>0.482</td></tr>
+    <tr><td>224</td><td>0.5</td><td>0.633</td><td>0.622</td></tr>
+    <tr><td>224</td><td>0.75</td><td>0.684</td><td>0.679</td></tr>
+    <tr><td>224</td><td>1</td><td>0.709</td><td>0.697</td></tr>
+  </table>
+  <figcaption>
+    <b>Table 1</b>: MobileNet Top-1 accuracy on Imagenet Validation dataset.
+  </figcaption>
+</figure>
+
+## Representation for quantized tensors
+
+TensorFlow approaches the conversion of floating-point arrays of numbers into
+8-bit representations as a compression problem. Since the weights and activation
+tensors in trained neural network models tend to have values that are distributed
+across comparatively small ranges (for example, -15 to +15 for weights or -500 to
+1000 for image model activations). And since neural nets tend to be robust
+handling noise, the error introduced by quantizing to a small set of values
+maintains the precision of the overall results within an acceptable threshold. A
+chosen representation must perform fast calculations, especially the large matrix
+multiplications that comprise the bulk of the computations while running a model.
+
+This is represented with two floats that store the overall minimum and maximum
+values corresponding to the lowest and highest quantized value. Each entry in the
+quantized array represents a float value in that range, distributed linearly
+between the minimum and maximum. For example, with a minimum of -10.0 and maximum
+of 30.0f, and an 8-bit array, the quantized values represent the following:
+
+<figure>
+  <table>
+    <tr><th>Quantized</th><th>Float</th></tr>
+    <tr><td>0</td><td>-10.0</td></tr>
+    <tr><td>255</td><td>30.0</td></tr>
+    <tr><td>128</td><td>10.0</td></tr>
+  </table>
+  <figcaption>
+    <b>Table 2</b>: Example quantized value range
+  </figcaption>
+</figure>
+
+The advantages of this representation format are:
+
+* It efficiently represents an arbitrary magnitude of ranges.
+* The values don't have to be symmetrical.
+* The format represents both signed and unsigned values.
+* The linear spread makes multiplications straightforward.
+
+Alternative techniques use lower bit depths by non-linearly distributing the
+float values across the representation, but currently are more expensive in terms
+of computation time. (See Han et al.,
+[2016](https://arxiv.org/abs/1510.00149).)
+
+The advantage of having a clear definition of the quantized format is that it's
+always possible to convert back and forth from fixed-point to floating-point for
+operations that aren't quantization-ready, or to inspect the tensors for
+debugging.
diff --git a/tensorflow/docs_src/performance/xla/jit.md b/tensorflow/docs_src/performance/xla/jit.md
index d4dc3e57c8fb5ec2a979b6ba7ebe2a3b6c3a5f94..d9a979ccbd31773b9d227ff946486706844a8f81 100644
--- a/tensorflow/docs_src/performance/xla/jit.md
+++ b/tensorflow/docs_src/performance/xla/jit.md
@@ -157,7 +157,7 @@ to fuse Ops is visible by starting at `hlo_graph_0.dot` and viewing each diagram
 in succession.
 
 To Render the .dot file into a png, install
-[GraphViz](http://www.graphviz.org/Download..php) and run:
+[GraphViz](https://www.graphviz.org/download/) and run:
 
 ```shell
 dot -Tpng hlo_graph_80.dot -o hlo_graph_80.png
diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md
index b0abf5fdd2e0d8c3c20ae4bcd8f185124028df04..5e39e710a0dba74dfd68a04367ce402362520590 100644
--- a/tensorflow/docs_src/performance/xla/operation_semantics.md
+++ b/tensorflow/docs_src/performance/xla/operation_semantics.md
@@ -45,27 +45,30 @@ feature dimension in `operand`), the operation calculates the gradients with
 respect to `operand`, `offset` and `scale` across all the other dimensions. The
 `feature_index` must be a valid index for the feature dimension in `operand`.
 
-The three gradients are defined by the following formulas:
+The three gradients are defined by the following formulas (Assuming a
+4-dimensional tensor as `operand` and (l) is the index for feature dimension):
 
-\\( \nabla x = \nabla y * \gamma * \sqrt{\sigma^2+\epsilon} \\)
+\\( coef_l = \frac{1}{mwh}\sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h (\nabla y_{ijkl} * (x_{ijkl} - \mu_l) / (\sigma^2_{l}+\epsilon)) \\)
 
-\\( \nabla \gamma = sum(\nabla y * (x - \mu) * \sqrt{\sigma^2 + \epsilon}) \\)
+\\( \nabla x_{ijkl} = \gamma_{l} * (1/\sqrt{\sigma^2_{l}+\epsilon}) * [\nabla y_{ijkl} - mean(\nabla y) - (x_{ijkl} - \mu_{l}) * coef_l] \\)
 
-\\( \nabla \beta = sum(\nabla y) \\)
+\\( \nabla \beta_l = \sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h \nabla y_{ijkl} \\)
+
+\\( \nabla \gamma_l = \sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h \nabla y_{ijkl} * ((x_{ijkl} - \mu_l) / \sqrt{\sigma^2_{l}+\epsilon}) \\)
 
 The inputs `mean` and `variance` represents moments value
 across batch and spatial dimensions.
 
 The output type is a tuple of three handles:
 
-|Outputs       | Type                    | Semantics                           |
-|------------- | ----------------------- | ------------------------------------|
-|`grad_operand`| `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `operand`                           :
-|`grad_scale`  | `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `scale`                             :
-|`grad_offset` | `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `offset`                            :
+|Outputs       | Type                    | Semantics                            |
+|------------- | ----------------------- | ------------------------------------ |
+|`grad_operand`| `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `operand` (\\( \nabla x\\))          :
+|`grad_scale`  | `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `scale` (\\( \nabla \gamma\\))       :
+|`grad_offset` | `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `offset`(\\( \nabla \beta\\))        :
 
 
 ## BatchNormInference
@@ -119,7 +122,7 @@ Normalizes an array across batch and spatial dimensions.
 | Arguments       | Type                    | Semantics                        |
 | --------------- | ----------------------- | -------------------------------- |
 | `operand`       | `ComputationDataHandle` | n dimensional array to be        |
-:                 :                         : normalized                       :
+:                 :                         : normalized (x)                   :
 | `scale`         | `ComputationDataHandle` | 1 dimensional array              |
 :                 :                         : (\\(\gamma\\))                   :
 | `offset`        | `ComputationDataHandle` | 1 dimensional array              |
@@ -254,7 +257,7 @@ the range between the minimum and maximum, else returns the minimum value if the
 operand is below this range or the maximum value if the operand is above this
 range.  That is, `clamp(a, x, b) =  min(max(a, x), b)`.
 
-All three arrays must be the same shape. Alternately, as a restricted form of
+All three arrays must be the same shape. Alternatively, as a restricted form of
 [broadcasting](broadcasting.md), `min` and/or `max` can be a scalar of type `T`.
 
 Example with scalar `min` and `max`:
@@ -1050,6 +1053,9 @@ For a more intuitive description, see the "Informal Description" section below.
 :                  :                         : indices of the slices we're     :
 :                  :                         : we're stitching together into   :
 :                  :                         : the output tensor.              :
+|`index_vector_dim`  | `int64`               | The dimension in                |
+:                  :                         : `gather_indices` that contains  :
+:                  :                         : the starting indices.           :
 |`output_window_dims` | `ArraySlice<int64>`  | The set of dimensions in the    |
 :                  :                         : output shape that are _window   :
 :                  :                         : dimensions_ (defined below).    :
@@ -1066,22 +1072,20 @@ For a more intuitive description, see the "Informal Description" section below.
 :                  :            : `output_window_dims`) and the window         :
 :                  :            : dimensions that are elided (via              :
 :                  :            : `elided_window_dims`).                       :
-|`gather_dims_to_operand_dims` | `ArraySlice<int64>` | A dimension map (the  |
+|`gather_dims_to_operand_dims` | `ArraySlice<int64>` | A dimension map (the    |
 :                  :            : array is interpreted as mapping `i` to       :
 :                  :            : `gather_dims_to_operand_dims[i]`)  from      :
 :                  :            : the gather indices in `gather_indices` to    :
 :                  :            : the operand index space.  It has to be       :
 :                  :            : one-to-one and total.                        :
 
-If `gather_indices` is a vector with `N` elements then we implicitly reshape it
-to a tensor of shape `[N,1]` before proceeding.
-
 For every index `Out` in the output tensor, we compute two things (more
 precisely described later):
 
-  - An index into the first `gather_indices.rank` - `1` dimensions of
-    `gather_indices`, which gives us a starting index of a slice, _operand
-    slice_, in the operand tensor.
+  - An index into `gather_indices.rank` - `1` dimensions of `gather_indices`,
+    which gives us a starting index of a slice, _operand slice_, in the operand
+    tensor.  These `gather_indices.rank` - `1` dimensions are all the dimensions
+    in `gather_indices` except `index_vector_dim`.
 
   - A _window index_ that has the same rank as the operand.  This index is
     composed of the values in `Out` at dimensions `output_window_dims`, embedded
@@ -1093,29 +1097,42 @@ should be present in the output at index `Out`.
 The output is a tensor of rank `output_window_dims.size` + `gather_indices.rank`
 - `1`.  Additionally, as a shorthand, we define `output_gather_dims` of type
 `ArraySlice<int64>` as the set of dimensions in the output shape but not in
-`output_window_dims`, in ascending order.  E.g. if the output tensor has rank 5,
-`output_window_dims` is {`2`, `4`} then `output_gather_dims` is {`0`, `1`, `3`}
+`output_window_dims`, in ascending order.  E.g. if the output tensor has rank
+`5`, `output_window_dims` is {`2`, `4`} then `output_gather_dims` is {`0`, `1`,
+`3`}
+
+If `index_vector_dim` is equal to `gather_indices.rank` we implicitly
+consider `gather_indices` to have a trailing `1` dimension (i.e. if
+`gather_indices` was of shape `[6,7]` and `index_vector_dim` is `2` then
+we implicitly consider the shape of `gather_indices` to be `[6,7,1]`).
 
 The bounds for the output tensor along dimension `i` is computed as follows:
 
   1. If `i` is present in `output_gather_dims` (i.e. is equal to
-    `output_gather_dims[k]` for some `k`) then we pick the corresponding
-    dimension bounds out of `gather_indices.shape` (i.e. pick
-    `gather_indices.shape.dims[k]`).
+     `output_gather_dims[k]` for some `k`) then we pick the corresponding
+     dimension bounds out of `gather_indices.shape`, skipping
+     `index_vector_dim` (i.e. pick `gather_indices.shape.dims`[`k`] if `k`
+     < `index_vector_dim` and `gather_indices.shape.dims`[`k`+`1`]
+     otherwise).
   2. If `i` is present in `output_window_dims` (i.e. equal to
-     `output_window_dims[k]` for some `k`) then we pick the corresponding bound
-     out of `window_bounds` after accounting for `elided_window_dims` (i.e. we
-     pick `adjusted_window_bounds[k]` where `adjusted_window_bounds` is
-     `window_bounds` with the bounds at indices `elided_window_dims` removed).
+     `output_window_dims`[`k`] for some `k`) then we pick the corresponding
+     bound out of `window_bounds` after accounting for `elided_window_dims`
+     (i.e. we pick `adjusted_window_bounds`[`k`] where `adjusted_window_bounds`
+     is `window_bounds` with the bounds at indices `elided_window_dims`
+     removed).
 
 The operand index `In` corresponding to an output index `Out` is computed as
 follows:
 
   1. Let `G` = { `Out`[`k`] for `k` in `output_gather_dims` }.  Use `G` to slice
-     out vector `S` such that `S`[`i`] = `gather_indices`[`G`, `i`].
-  2. Create an index, `S`<sub>`in`</sub>, into `operand` using `S` by scattering
-     `S` using the `gather_dims_to_operand_dims` map (`S`<sub>`in`</sub> is the
-     starting indices for _operand slice_ mentioned above.).  More precisely:
+     out vector `S` such that `S`[`i`] = `gather_indices`[Combine(`G`, `i`)]
+     where Combine(A, b) inserts b at position `index_vector_dim` into A.
+     Note that this is well defined even if `G` is empty -- if `G` is empty then
+     `S` = `gather_indices`.
+  2. Create an index, `S`<sub>`in`</sub>, into `operand` using `S` by
+     scattering `S` using the `gather_dims_to_operand_dims` map
+     (`S`<sub>`in`</sub> is the starting indices for _operand slice_ mentioned
+     above).  More precisely:
        1. `S`<sub>`in`</sub>[`gather_dims_to_operand_dims`[`k`]] = `S`[`k`] if `k` <
           `gather_dims_to_operand_dims.size`.
        2. `S`<sub>`in`</sub>[`_`] = `0` otherwise.
@@ -1136,7 +1153,12 @@ follows:
 `operand.rank` is `6` and `elided_window_dims` is {`0`, `2`} then
 `window_dims_to_operand_dims` is {`0`→`1`, `1`→`3`, `2`→`4`, `3`→`5`}.
 
-### Informal Description
+### Informal Description and Examples
+
+`index_vector_dim` is set to `gather_indices.rank` - `1` in all of the
+examples that follow.  More interesting values for `index_vector_dim`
+does not change the operation fundamentally, but makes the visual representation
+more cumbersome.
 
 To get an intuition on how all of the above fits together, let's look at an
 example that gathers 5 slices of shape `[8,6]` from a `[16,11]` tensor.  The
diff --git a/tensorflow/docs_src/programmers_guide/datasets.md b/tensorflow/docs_src/programmers_guide/datasets.md
index d19200e80cdfe6620789ddd273647660c10b2a60..9ccdbde627e6b2415835f7c0771eca1afa04f7f8 100644
--- a/tensorflow/docs_src/programmers_guide/datasets.md
+++ b/tensorflow/docs_src/programmers_guide/datasets.md
@@ -18,11 +18,11 @@ The `tf.data` API introduces two new abstractions to TensorFlow:
   tensors representing the image data and a label. There are two distinct
   ways to create a dataset:
 
-  * Creating a **source** (e.g. `Dataset.from_tensor_slices()`) constructs a
+    * Creating a **source** (e.g. `Dataset.from_tensor_slices()`) constructs a
     dataset from
     one or more `tf.Tensor` objects.
 
-  * Applying a **transformation** (e.g. `Dataset.batch()`) constructs a dataset
+    * Applying a **transformation** (e.g. `Dataset.batch()`) constructs a dataset
     from one or more `tf.data.Dataset` objects.
 
 * A `tf.data.Iterator` provides the main way to extract elements from a
@@ -327,6 +327,35 @@ same op/node (created by `Iterator.get_next()`). Therefore,  evaluating *any* of
 these tensors will advance the iterator for all components. A typical consumer
 of an iterator will include all components in a single expression.
 
+### Saving iterator state
+
+The @{tf.contrib.data.make_saveable_from_iterator} function creates a
+`SaveableObject` from an iterator, which can be used to save and
+restore the current state of the iterator (and, effectively, the whole input
+pipeline). A saveable object thus created can be added to @{tf.train.Saver}
+variables list or the `tf.GraphKeys.SAVEABLE_OBJECTS` collection for saving and
+restoring in the same manner as a @{tf.Variable}. Refer to
+@{$saved_model$Saving and Restoring} for details on how to save and restore
+variables.
+
+```python
+# Create saveable object from iterator.
+saveable = tf.contrib.data.make_saveable_from_iterator(iterator)
+
+# Save the iterator state by adding it to the saveable objects collection.
+tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable)
+saver = tf.train.Saver()
+
+with tf.Session() as sess:
+
+  if should_checkpoint:
+    saver.save(path_to_checkpoint)
+
+# Restore the iterator state.
+with tf.Session() as sess:
+  saver.restore(sess, path_to_checkpoint)
+```
+
 ## Reading input data
 
 ### Consuming NumPy arrays
diff --git a/tensorflow/docs_src/programmers_guide/debugger.md b/tensorflow/docs_src/programmers_guide/debugger.md
index c8fdae6f60c33776b6d9a8c1a33666ce4ddb1cb2..d1399814ee862f5f7ecc3f448d51fb3724fa3447 100644
--- a/tensorflow/docs_src/programmers_guide/debugger.md
+++ b/tensorflow/docs_src/programmers_guide/debugger.md
@@ -23,8 +23,13 @@ debuggers such as Python's `pdb` due to TensorFlow's computation-graph paradigm.
 > installed using `pip install <your_version>.whl`, however curses on Windows
 > may not work as reliably as curses on Linux or Mac.
 
-This tutorial demonstrates how to use the **tfdbg** command-line interface
-(CLI) to debug the appearance of [`nan`s](https://en.wikipedia.org/wiki/NaN)
+> NOTE: This guide focuses on the command-line interface (CLI) of tfdbg. For
+> guide on how to use the graphical user interface (GUI) of tfdbg, i.e., the
+> **TensorBoard Debugger Plugin**, please visit
+> [its README](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md).
+
+This tutorial demonstrates how to use the **tfdbg** CLI to debug the appearance
+of [`nan`s](https://en.wikipedia.org/wiki/NaN)
 and [`inf`s](https://en.wikipedia.org/wiki/Infinity), a frequently-encountered
 type of bug in TensorFlow model development.
 The following example is for users who use the low-level
@@ -454,7 +459,7 @@ accuracy_score = classifier.evaluate(x=test_set.data,
 
 
 [debug_tflearn_iris.py](https://www.tensorflow.org/code/tensorflow/python/debug/examples/debug_tflearn_iris.py),
-based on {$tflearn$tf-learn's iris tutorial}, contains a full example of how to
+based on [tf-learn's iris tutorial](https://www.tensorflow.org/versions/r1.2/get_started/tflearn), contains a full example of how to
 use the tfdbg with `Estimator`s. To run this example, do:
 
 ```none
@@ -748,6 +753,7 @@ There are three possible workarounds or solutions:
    # For LocalCLIDebugHook
    hooks = [tf_debug.LocalCLIDebugHook(dump_root="/with/lots/of/space")]
    ```
+
    Make sure that the directory pointed to by dump_root is empty or nonexistent.
    tfdbg cleans up the dump directories before exiting.
 *  Reduce the batch size used during the runs.
@@ -806,3 +812,13 @@ sess.run(b)
 
 the constant-folding would not occur and `tfdbg` should show the intermediate
 tensor dumps.
+
+**Q**: Is there a GUI for tfdbg?
+
+**A**: Yes, the **TensorBoard Debugger Plugin** is the GUI of tfdbg.
+       It offers features such as inspection of the computation graph,
+       real-time visualization of tensor values, continuation to tensor
+       and conditional breakpoints, and tying tensors to their
+       graph-construction source code, all in the browser environment.
+       To get started, please visit
+       [its README](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md).
diff --git a/tensorflow/docs_src/programmers_guide/faq.md b/tensorflow/docs_src/programmers_guide/faq.md
index 70931f2862de98cb1e934f85919d558a3b36304a..392ac6f7f12532c3efce5bec1917691f55c7bee5 100644
--- a/tensorflow/docs_src/programmers_guide/faq.md
+++ b/tensorflow/docs_src/programmers_guide/faq.md
@@ -159,8 +159,7 @@ available. These operations allow you to build sophisticated
 @{$reading_data$input pipelines}, at the cost of making the
 TensorFlow computation somewhat more complicated. See the how-to documentation
 for
-@{$reading_data#creating-threads-to-prefetch-using-queuerunner-objects$using
-`QueueRunner` objects to drive queues and readers}
+@{$reading_data#creating_threads_to_prefetch_using_queuerunner_objects$using `QueueRunner` objects to drive queues and readers}
 for more information on how to use them.
 
 ## Variables
@@ -273,7 +272,7 @@ Prefer predefined TensorFlow operations such as @{tf.decode_raw},
 
 If your data is not easily parsable with the built-in TensorFlow operations,
 consider converting it, offline, to a format that is easily parsable, such
-as ${tf.python_io.TFRecordWriter$`TFRecord`} format.
+as @{tf.python_io.TFRecordWriter$`TFRecord`} format.
 
 The more efficient method to customize the parsing behavior is to
 @{$adding_an_op$add a new op written in C++} that parses your
diff --git a/tensorflow/docs_src/programmers_guide/graphs.md b/tensorflow/docs_src/programmers_guide/graphs.md
index 9049a5a9f3d44e255188c6c41cdb12a619464379..e69b717432e6a8fab0085eb419dcbc0991cd9d28 100644
--- a/tensorflow/docs_src/programmers_guide/graphs.md
+++ b/tensorflow/docs_src/programmers_guide/graphs.md
@@ -210,9 +210,8 @@ with tf.device("/device:GPU:0"):
   # Operations created in this context will be pinned to the GPU.
   result = tf.matmul(weights, img)
 ```
-
-If you are deploying TensorFlow in a @{$deploy/distributed$typical distributed
-configuration}, you might specify the job name and task ID to place variables on
+If you are deploying TensorFlow in a @{$deploy/distributed$typical distributed configuration},
+you might specify the job name and task ID to place variables on
 a task in the parameter server job (`"/job:ps"`), and the other operations on
 task in the worker job (`"/job:worker"`):
 
@@ -336,20 +335,20 @@ described below.
   controls the behavior of the session. For example, some of the configuration
   options include:
 
-  * `allow_soft_placement`. Set this to `True` to enable a "soft" device
+    * `allow_soft_placement`. Set this to `True` to enable a "soft" device
     placement algorithm, which ignores @{tf.device} annotations that attempt
     to place CPU-only operations on a GPU device, and places them on the CPU
     instead.
 
-  * `cluster_def`. When using distributed TensorFlow, this option allows you
+    * `cluster_def`. When using distributed TensorFlow, this option allows you
     to specify what machines to use in the computation, and provide a mapping
     between job names, task indices, and network addresses. See
     @{tf.train.ClusterSpec.as_cluster_def} for details.
 
-  * `graph_options.optimizer_options`. Provides control over the optimizations
+    * `graph_options.optimizer_options`. Provides control over the optimizations
     that TensorFlow performs on your graph before executing it.
 
-  * `gpu_options.allow_growth`. Set this to `True` to change the GPU memory
+    * `gpu_options.allow_growth`. Set this to `True` to change the GPU memory
     allocator so that it gradually increases the amount of memory allocated,
     rather than allocating most of the memory at startup.
 
diff --git a/tensorflow/docs_src/programmers_guide/index.md b/tensorflow/docs_src/programmers_guide/index.md
index 7a5e90081d9145ca934929f0af11f2a40cb2dcae..e8c2fa6990c8ecfca1cfe76b3f813b4ae6917742 100644
--- a/tensorflow/docs_src/programmers_guide/index.md
+++ b/tensorflow/docs_src/programmers_guide/index.md
@@ -30,8 +30,12 @@ works. The units are as follows:
     can still be helpful.
   * @{$programmers_guide/saved_model}, which
     explains how to save and restore variables and models.
+
+## Accelerators
+
   * @{$using_gpu} explains how TensorFlow assigns operations to
     devices and how you can change the arrangement manually.
+  * @{$using_tpu} explains how to modify `Estimator` programs to run on a TPU.
 
 
 ## ML Concepts
diff --git a/tensorflow/docs_src/programmers_guide/saved_model.md b/tensorflow/docs_src/programmers_guide/saved_model.md
index f18d50b282400810b4869f78ba7f536ad5ea4798..55ee42dd6405db6bd34b064d71deaeb94839b0fa 100644
--- a/tensorflow/docs_src/programmers_guide/saved_model.md
+++ b/tensorflow/docs_src/programmers_guide/saved_model.md
@@ -1,38 +1,33 @@
-# Saving and Restoring
+# Save and Restore
 
-This document explains how to save and restore
-@{$variables$variables} and models.
+The @{tf.train.Saver} class provides methods to save and restore models. The
+@{tf.saved_model.simple_save} function is an easy way to build a
+@{tf.saved_model$saved model} suitable for serving.
+[Estimators](@{$programmers_guide/estimators}) automatically save and restore
+variables in the `model_dir`.
 
-Important: TensorFlow model files are code. Be careful with untrusted code.
-See [Using TensorFlow Securely](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/SECURITY.md)
-for details.
-
-## Saving and restoring variables
-
-A TensorFlow variable provides the best way to represent shared, persistent
-state manipulated by your program. (See @{$variables$Variables} for details.)
-This section explains how to save and restore variables.
-Note that Estimators automatically saves and restores variables
-(in the `model_dir`).
+## Save and restore variables
 
-The `tf.train.Saver` class provides methods for saving and restoring models.
-The `tf.train.Saver` constructor adds `save` and `restore` ops to the graph
-for all, or a specified list, of the variables in the graph.  The `Saver`
-object provides methods to run these ops, specifying paths for the checkpoint
-files to write to or read from.
+TensorFlow @{$variables} are the best way to represent shared, persistent state
+manipulated by your program. The `tf.train.Saver` constructor adds `save` and
+`restore` ops to the graph for all, or a specified list, of the variables in the
+graph.  The `Saver` object provides methods to run these ops, specifying paths
+for the checkpoint files to write to or read from.
 
-The saver will restore all variables already defined in your model. If you're
+`Saver` restores all variables already defined in your model. If you're
 loading a model without knowing how to build its graph (for example, if you're
 writing a generic program to load models), then read the
 [Overview of saving and restoring models](#models) section
 later in this document.
 
-TensorFlow saves variables in binary **checkpoint files** that,
-roughly speaking, map variable names to tensor values.
-
+TensorFlow saves variables in binary *checkpoint files* that map variable
+names to tensor values.
 
+Caution: TensorFlow model files are code. Be careful with untrusted code.
+See [Using TensorFlow Securely](https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md)
+for details.
 
-### Saving variables
+### Save variables
 
 Create a `Saver` with `tf.train.Saver()` to manage all variables in the
 model. For example, the following snippet demonstrates how to call the
@@ -64,9 +59,7 @@ with tf.Session() as sess:
   print("Model saved in path: %s" % save_path)
 ```
 
-
-
-### Restoring variables
+### Restore variables
 
 The `tf.train.Saver` object not only saves variables to checkpoint files, it
 also restores variables. Note that when you restore variables you do not have
@@ -95,14 +88,11 @@ with tf.Session() as sess:
   print("v2 : %s" % v2.eval())
 ```
 
-Notes:
-
-*  There is not a physical file called "/tmp/model.ckpt". It is the **prefix**
-   of filenames created for the checkpoint. Users only interact with the
-   prefix instead of physical checkpoint files.
+Note: There is not a physical file called `/tmp/model.ckpt`. It is the *prefix* of
+filenames created for the checkpoint. Users only interact with the prefix
+instead of physical checkpoint files.
 
-
-### Choosing which variables to save and restore
+### Choose variables to save and restore
 
 If you do not pass any arguments to `tf.train.Saver()`, the saver handles all
 variables in the graph.  Each variable is saved under the name that was passed
@@ -201,29 +191,42 @@ chkp.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name='v2', all_t
 
 
 <a name="models"></a>
-## Overview of saving and restoring models
+## Save and restore models
 
-When you want to save and load variables, the graph, and the
-graph's metadata--basically, when you want to save or restore
-your model--we recommend using SavedModel.
-**SavedModel** is a language-neutral, recoverable, hermetic
-serialization format.  SavedModel enables higher-level systems
-and tools to produce, consume, and transform TensorFlow models.
-TensorFlow provides several mechanisms for interacting with
-SavedModel, including tf.saved_model APIs, Estimator APIs and a CLI.
+Use `SavedModel` to save and load your model—variables, the graph, and the
+graph's metadata. This is a language-neutral, recoverable, hermetic
+serialization format that enables higher-level systems and tools to produce,
+consume, and transform TensorFlow models. TensorFlow provides several ways to
+interact with `SavedModel`, including the @{tf.saved_model} APIs,
+@{tf.estimator.Estimator}, and a command-line interface.
 
 
-## APIs to build and load a SavedModel
+## Build and load a SavedModel
 
-This section focuses on the APIs for building and loading a SavedModel,
-particularly when using lower-level TensorFlow APIs.
+### Simple save
 
+The easiest way to create a `SavedModel` is to use the @{tf.saved_model.simple_save}
+function:
 
-### Building a SavedModel
+```python
+simple_save(session,
+            export_dir,
+            inputs={"x": x, "y": y},
+            outputs={"z": z})
+```
 
-We provide a Python implementation of the SavedModel
-@{tf.saved_model.builder$builder}.
-The `SavedModelBuilder` class provides functionality to
+This configures the `SavedModel` so it can be loaded by
+[TensorFlow serving](/serving/serving_basic) and supports the
+[Predict API](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/predict.proto).
+To access the classify, regress, or multi-inference APIs, use the manual
+`SavedModel` builder APIs or an @{tf.estimator.Estimator}.
+
+### Manually build a SavedModel
+
+If your use case isn't covered by @{tf.saved_model.simple_save}, use the manual
+@{tf.saved_model.builder$builder APIs} to create a `SavedModel`.
+
+The @{tf.saved_model.builder.SavedModelBuilder} class provides functionality to
 save multiple `MetaGraphDef`s.  A **MetaGraph** is a dataflow graph, plus
 its associated variables, assets, and signatures.  A **`MetaGraphDef`**
 is the protocol buffer representation of a MetaGraph.  A **signature** is
@@ -253,16 +256,51 @@ with tf.Session(graph=tf.Graph()) as sess:
   builder.add_meta_graph_and_variables(sess,
                                        [tag_constants.TRAINING],
                                        signature_def_map=foo_signatures,
-                                       assets_collection=foo_assets)
+                                       assets_collection=foo_assets,
+                                       strip_default_attrs=True)
 ...
 # Add a second MetaGraphDef for inference.
 with tf.Session(graph=tf.Graph()) as sess:
   ...
-  builder.add_meta_graph([tag_constants.SERVING])
+  builder.add_meta_graph([tag_constants.SERVING], strip_default_attrs=True)
 ...
 builder.save()
 ```
 
+<a name="forward_compatibility"></a>
+#### Forward compatibility via `strip_default_attrs=True`
+
+Following the guidance below gives you forward compatibility only if the set of
+Ops has not changed.
+
+The @{tf.saved_model.builder.SavedModelBuilder$`SavedModelBuilder`} class allows
+users to control whether default-valued attributes must be stripped from the
+@{$extend/tool_developers#nodes$`NodeDefs`}
+while adding a meta graph to the SavedModel bundle. Both
+@{tf.saved_model.builder.SavedModelBuilder.add_meta_graph_and_variables$`SavedModelBuilder.add_meta_graph_and_variables`}
+and @{tf.saved_model.builder.SavedModelBuilder.add_meta_graph$`SavedModelBuilder.add_meta_graph`}
+methods accept a Boolean flag `strip_default_attrs` that controls this behavior.
+
+If `strip_default_attrs` is `False`, the exported @{tf.MetaGraphDef} will have
+the default valued attributes in all its @{tf.NodeDef} instances.
+This can break forward compatibility with a sequence of events such as the
+following:
+
+*  An existing Op (`Foo`) is updated to include a new attribute (`T`) with a
+   default (`bool`) at version 101.
+*  A model producer such as a "trainer binary" picks up this change (version 101)
+   to the `OpDef` and re-exports an existing model that uses Op `Foo`.
+*  A model consumer (such as [Tensorflow Serving](/serving)) running an older
+   binary (version 100) doesn't have attribute `T` for Op `Foo`, but tries to
+   import this model. The model consumer doesn't recognize attribute `T` in a
+   `NodeDef` that uses Op `Foo` and therefore fails to load the model.
+*  By setting `strip_default_attrs` to True, the model producers can strip away
+   any default valued attributes in the `NodeDefs`. This helps ensure that newly
+   added attributes with defaults don't cause older model consumers to fail
+   loading models regenerated with newer training binaries.
+
+See [compatibility guidance](https://www.tensorflow.org/programmers_guide/version_compat)
+for more information.
 
 ### Loading a SavedModel in Python
 
@@ -288,7 +326,7 @@ with tf.Session(graph=tf.Graph()) as sess:
 ```
 
 
-### Loading a SavedModel in C++
+### Load a SavedModel in C++
 
 The C++ version of the SavedModel
 [loader](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/cc/saved_model/loader.h)
@@ -306,7 +344,7 @@ LoadSavedModel(session_options, run_options, export_dir, {kSavedModelTagTrain},
                &bundle);
 ```
 
-### Loading and Serving a SavedModel in TensorFlow Serving
+### Load and serve a SavedModel in TensorFlow serving
 
 You can easily load and serve a SavedModel with the TensorFlow Serving Model
 Server binary. See [instructions](https://www.tensorflow.org/serving/setup#installing_using_apt-get)
@@ -362,7 +400,7 @@ defined in:
 
 After training an `Estimator` model, you may want to create a service
 from that model that takes requests and returns a result.  You can run such a
-service locally on your machine or deploy it scalably in the cloud.
+service locally on your machine or deploy it in the cloud.
 
 To prepare a trained Estimator for serving, you must export it in the standard
 SavedModel format. This section explains how to:
@@ -374,7 +412,7 @@ SavedModel format. This section explains how to:
 * Serve the model from a local server and request predictions.
 
 
-### Preparing serving inputs
+### Prepare serving inputs
 
 During training, an @{$premade_estimators#input_fn$`input_fn()`} ingests data
 and prepares it for use by the model.  At serving time, similarly, a
@@ -448,14 +486,15 @@ to expect and how to map them to your model's expected inputs.
 By contrast, the *output* portion of the signature is determined by the model.
 
 
-### Performing the export
+### Perform the export
 
 To export your trained Estimator, call
 @{tf.estimator.Estimator.export_savedmodel} with the export base path and
 the `serving_input_receiver_fn`.
 
 ```py
-estimator.export_savedmodel(export_dir_base, serving_input_receiver_fn)
+estimator.export_savedmodel(export_dir_base, serving_input_receiver_fn,
+                            strip_default_attrs=True)
 ```
 
 This method builds a new graph by first calling the
@@ -471,7 +510,7 @@ Session.
 > Note: It is your responsibility to garbage-collect old exports.
 > Otherwise, successive exports will accumulate under `export_dir_base`.
 
-### Specifying the outputs of a custom model
+### Specify the outputs of a custom model
 
 When writing a custom `model_fn`, you must populate the `export_outputs` element
 of the @{tf.estimator.EstimatorSpec} return value. This is a dict of
@@ -503,7 +542,7 @@ indicating which `SignatureDef` will be served when an inference request
 does not specify one.
 
 
-### Serving the exported model locally
+### Serve the exported model locally
 
 For local deployment, you can serve your model using
 [TensorFlow Serving](https://github.com/tensorflow/serving), an open-source project that loads a
@@ -522,7 +561,7 @@ bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 -
 Now you have a server listening for inference requests via gRPC on port 9000!
 
 
-### Requesting predictions from a local server
+### Request predictions from a local server
 
 The server responds to gRPC requests according to the
 [PredictionService](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/prediction_service.proto#L15)
@@ -615,7 +654,7 @@ passing in sample inputs in various formats (for example, Python
 expressions) and then fetching the output.
 
 
-### Installing the SavedModel CLI
+### Install the SavedModel CLI
 
 Broadly speaking, you can install TensorFlow in either of the following
 two ways:
@@ -697,15 +736,15 @@ executing the computation graph later. For example:
 $ saved_model_cli show --dir \
 /tmp/saved_model_dir --tag_set serve --signature_def serving_default
 The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
+  inputs['x'] tensor_info:
+      dtype: DT_FLOAT
+      shape: (-1, 1)
+      name: x:0
 The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
+  outputs['y'] tensor_info:
+      dtype: DT_FLOAT
+      shape: (-1, 1)
+      name: y:0
 Method name is: tensorflow/serving/predict
 ```
 
@@ -717,32 +756,32 @@ $ saved_model_cli show --dir /tmp/saved_model_dir --all
 MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
 
 signature_def['classify_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/classify
 
 ...
 
 signature_def['serving_default']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/predict
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['x'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['y'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/predict
 ```
 
 
@@ -842,7 +881,7 @@ For example:
 `<input_key>=[{"age":[22,24],"education":["BS","MS"]}]`
 ```
 
-#### Save Output
+#### Save output
 
 By default, the SavedModel CLI writes output to stdout. If a directory is
 passed to `--outdir` option, the outputs will be saved as npy files named after
@@ -851,7 +890,7 @@ output tensor keys under the given directory.
 Use `--overwrite` to overwrite existing output files.
 
 
-#### TensorFlow Debugger (tfdbg) Integration
+#### TensorFlow debugger (tfdbg) integration
 
 If `--tf_debug` option is set, the SavedModel CLI will use the
 TensorFlow Debugger (tfdbg) to watch the intermediate Tensors and runtime
@@ -958,6 +997,3 @@ of checkpoints and assets:
 
 Each graph is associated with a specific set of tags, which enables
 identification during a load or restore operation.
-
-
-
diff --git a/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md b/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
index 05dfdfdc4d2257fc680e7fa99b666ef86e3bef09..fadfa03e78349801d69e0045991a8fa9a0a59df9 100644
--- a/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
+++ b/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
@@ -16,10 +16,17 @@ TensorBoard is fully configured, it looks like this:
   </iframe>
 </div>
 
-This tutorial is intended to get you started with simple TensorBoard usage.
-There are other resources available as well! The [TensorBoard's GitHub](https://github.com/tensorflow/tensorboard)
-has a lot more information on TensorBoard usage, including tips & tricks, and
-debugging information.
+This 30-minute tutorial is intended to get you started with simple TensorBoard
+usage. It assumes a basic understanding of TensorFlow.
+
+There are other resources available as well! The [TensorBoard GitHub](https://github.com/tensorflow/tensorboard)
+has a lot more information on using individual dashboards within TensorBoard
+including tips & tricks and debugging information.
+
+## Setup
+
+[Install TensorFlow](https://www.tensorflow.org/install/). Installing TensorFlow
+via pip should also automatically install TensorBoard.
 
 ## Serializing the data
 
@@ -76,7 +83,7 @@ data than you need, though. Instead, consider running the merged summary op
 every `n` steps.
 
 The code example below is a modification of the
-@{$layers$simple MNIST tutorial},
+[simple MNIST tutorial](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist.py),
 in which we have added some summary ops, and run them every ten steps. If you
 run this and then launch `tensorboard --logdir=/tmp/tensorflow/mnist`, you'll be able
 to visualize statistics, such as how the weights or accuracy varied during
@@ -214,4 +221,5 @@ corner. Each tab represents a set of serialized data that can be visualized.
 For in depth information on how to use the *graph* tab to visualize your graph,
 see @{$graph_viz$TensorBoard: Graph Visualization}.
 
-For more usage information on TensorBoard in general, see the [TensorBoard's GitHub](https://github.com/tensorflow/tensorboard).
+For more usage information on TensorBoard in general, see the
+[TensorBoard GitHub](https://github.com/tensorflow/tensorboard).
diff --git a/tensorflow/docs_src/programmers_guide/using_tpu.md b/tensorflow/docs_src/programmers_guide/using_tpu.md
index d74d7f3181c9cf44e6c97e13742db682858f4694..a9c2cb3e33d4817b9a35400dcce9227ddd635ff4 100644
--- a/tensorflow/docs_src/programmers_guide/using_tpu.md
+++ b/tensorflow/docs_src/programmers_guide/using_tpu.md
@@ -129,10 +129,9 @@ my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
 Typically the `FLAGS` would be set by command line arguments. To switch from
 training locally to training on a cloud TPU you would need to:
 
-  1) Set `FLAGS.use_tpu` to `True`
-  1) Set `FLAGS.tpu_name` so the
-     `tf.contrib.cluster_resolver.TPUClusterResolver` can find it
-  1) Set `FLAGS.model_dir` to a Google Cloud Storage bucket url (`gs://`).
+* Set `FLAGS.use_tpu` to `True`
+* Set `FLAGS.tpu_name` so the `tf.contrib.cluster_resolver.TPUClusterResolver` can find it
+* Set `FLAGS.model_dir` to a Google Cloud Storage bucket url (`gs://`).
 
 
 ## Optimizer
diff --git a/tensorflow/docs_src/programmers_guide/version_compat.md b/tensorflow/docs_src/programmers_guide/version_compat.md
index e6613cc69f8aedf344fa25b6564889e34cd9bf53..72e427c5f8f0f6581d528f4ead18699736eafd04 100644
--- a/tensorflow/docs_src/programmers_guide/version_compat.md
+++ b/tensorflow/docs_src/programmers_guide/version_compat.md
@@ -183,7 +183,7 @@ Our versioning scheme has three requirements:
 *   **Forward compatibility** to support scenarios where the producer of a
     graph or checkpoint is upgraded to a newer version of TensorFlow before
     the consumer.
-*   Enable evolving TensorFlow in incompatible ways. For example, removing Ops,
+*   Enable evolving TensorFlow in incompatible ways. For example, removing ops,
     adding attributes, and removing attributes.
 
 Note that while the `GraphDef` version mechanism is separate from the TensorFlow
@@ -245,32 +245,51 @@ contains a main data version which is treated as either `producer` or
     `TF_CHECKPOINT_VERSION_MIN_CONSUMER`, and
     `TF_CHECKPOINT_VERSION_MIN_PRODUCER`.
 
+### Add a new attribute with default to an existing op
+
+Following the guidance below gives you forward compatibility only if the set of
+ops has not changed:
+
+1. If forward compatibility is desired,  set `strip_default_attrs` to `True`
+   while exporting the model using either the
+   @{tf.saved_model.builder.SavedModelBuilder.add_meta_graph_and_variables$`add_meta_graph_and_variables`}
+   and @{tf.saved_model.builder.SavedModelBuilder.add_meta_graph$`add_meta_graph`}
+   methods of the `SavedModelBuilder` class, or
+   @{tf.estimator.Estimator.export_savedmodel$`Estimator.export_savedmodel`}
+2. This strips off the default valued attributes at the time of
+   producing/exporting the models. This makes sure that the exported
+   @{tf.MetaGraphDef} does not contain the new op-attribute when the default
+   value is used.
+3. Having this control could allow out-of-date consumers (for example, serving
+   binaries that lag behind training binaries) to continue loading the models
+   and prevent interruptions in model serving.
+
 ### Evolving GraphDef versions
 
 This section explains how to use this versioning mechanism to make different
 types of changes to the `GraphDef` format.
 
-#### Add an Op
+#### Add an op
 
-Add the new Op to both consumers and producers at the same time, and do not
+Add the new op to both consumers and producers at the same time, and do not
 change any `GraphDef` versions. This type of change is automatically
 backward compatible, and does not impact forward compatibility plan since
 existing producer scripts will not suddenly use the new functionality.
 
-#### Add an Op and switch existing Python wrappers to use it
+#### Add an op and switch existing Python wrappers to use it
 
 1.  Implement new consumer functionality and increment the `GraphDef` version.
 2.  If it is possible to make the wrappers use the new functionality only in
     cases that did not work before, the wrappers can be updated now.
 3.  Change Python wrappers to use the new functionality. Do not increment
-    `min_consumer`, since models that do not use this Op should not break.
+    `min_consumer`, since models that do not use this op should not break.
 
-#### Remove or restrict an Op's functionality
+#### Remove or restrict an op's functionality
 
-1.  Fix all producer scripts (not TensorFlow itself) to not use the banned Op or
+1.  Fix all producer scripts (not TensorFlow itself) to not use the banned op or
     functionality.
 2.  Increment the `GraphDef` version and implement new consumer functionality
-    that bans the removed Op or functionality for GraphDefs at the new version
+    that bans the removed op or functionality for GraphDefs at the new version
     and above. If possible, make TensorFlow stop producing `GraphDefs` with the
     banned functionality. To do so, add the
     [`REGISTER_OP(...).Deprecated(deprecated_at_version,
@@ -279,15 +298,15 @@ existing producer scripts will not suddenly use the new functionality.
 4.  Increase `min_producer` to the GraphDef version from (2) and remove the
     functionality entirely.
 
-#### Change an Op's functionality
+#### Change an op's functionality
 
-1.  Add a new similar Op named `SomethingV2` or similar and go through the
+1.  Add a new similar op named `SomethingV2` or similar and go through the
     process of adding it and switching existing Python wrappers to use it, which
     may take three weeks if forward compatibility is desired.
-2.  Remove the old Op (Can only take place with a major version change due to
+2.  Remove the old op (Can only take place with a major version change due to
     backward compatibility).
-3.  Increase `min_consumer` to rule out consumers with the old Op, add back the
-    old Op as an alias for `SomethingV2`, and go through the process to switch
+3.  Increase `min_consumer` to rule out consumers with the old op, add back the
+    old op as an alias for `SomethingV2`, and go through the process to switch
     existing Python wrappers to use it.
 4.  Go through the process to remove `SomethingV2`.
 
@@ -295,6 +314,6 @@ existing producer scripts will not suddenly use the new functionality.
 
 1.  Bump the `GraphDef` version and add the bad version to `bad_consumers` for
     all new GraphDefs. If possible, add to `bad_consumers` only for GraphDefs
-    which contain a certain Op or similar.
+    which contain a certain op or similar.
 2.  If existing consumers have the bad version, push them out as soon as
     possible.
diff --git a/tensorflow/docs_src/tutorials/deep_cnn.md b/tensorflow/docs_src/tutorials/deep_cnn.md
index 679754020470dddfcffa76e62ca8f55a439ec4f5..6a4c9a9b0727208a158b1b57d13ca70290961ec2 100644
--- a/tensorflow/docs_src/tutorials/deep_cnn.md
+++ b/tensorflow/docs_src/tutorials/deep_cnn.md
@@ -268,7 +268,7 @@ in `cifar10_input.py`.
 
 `cifar10_train.py` periodically @{tf.train.Saver$saves}
 all model parameters in
-@{$variables#saving-and-restoring$checkpoint files}
+@{$programmers_guide/saved_model$checkpoint files}
 but it does *not* evaluate the model. The checkpoint file
 will be used by `cifar10_eval.py` to measure the predictive
 performance (see [Evaluating a Model](#evaluating-a-model) below).
diff --git a/tensorflow/docs_src/tutorials/image_retraining.md b/tensorflow/docs_src/tutorials/image_retraining.md
index df15bc0a9c3763aa51c2fc8cf36ce9fc3544ae68..93d7c86e42aa90d145d27b56edc0abfec7034686 100644
--- a/tensorflow/docs_src/tutorials/image_retraining.md
+++ b/tensorflow/docs_src/tutorials/image_retraining.md
@@ -115,7 +115,7 @@ process is progressing. The training's objective is to make the loss as small as
 possible, so you can tell if the learning is working by keeping an eye on
 whether the loss keeps trending downwards, ignoring the short-term noise.
 
-By default this script will run 4,000 training steps. Each step chooses ten
+By default this script will run 4,000 training steps. Each step chooses 100
 images at random from the training set, finds their bottlenecks from the cache,
 and feeds them into the final layer to get predictions. Those predictions are
 then compared against the actual labels to update the final layer's weights
@@ -349,31 +349,32 @@ results, but if you intend to deploy your model on mobile devices or other
 resource-constrained environments you may want to trade off a little accuracy
 for much smaller file sizes or faster speeds. To help with that, the
 [retrain.py script](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py)
-supports 32 different variations on the [Mobilenet architecture](https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html).
+supports different variations on the [Mobilenet architecture](https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html).
 
 These are a little less precise than Inception v3, but can result in far
-smaller file sizes (down to less than a megabyte) and can be many times faster
+smaller file sizes (a few megabytes) and can be many times faster
 to run. To train with one of these models, pass in the `--architecture` flag,
 for example:
 
 ```
 python tensorflow/examples/image_retraining/retrain.py \
-    --image_dir ~/flower_photos --architecture mobilenet_0.25_128_quantized
+    --image_dir ~/flower_photos --architecture mobilenet_0.25_128
 ```
 
-This will create a 941KB model file in `/tmp/output_graph.pb`, with 25% of the
-parameters of the full Mobilenet, taking 128x128 sized input images, and with
-its weights quantized down to eight bits on disk. You can choose '1.0', '0.75',
-'0.50', or '0.25' to control the number of weight parameters, and so the file
-size (and to some extent the speed), '224', '192', '160', or '128' for the input
-image size, with smaller sizes giving faster speeds, and an optional
-'_quantized' at the end to indicate whether the file should contain 8-bit or
-32-bit float weights.
+This will create a 1.9MB model file in `/tmp/output_graph.pb`, with only 25% of
+the number of neurons of the full Mobilenet, and trained to take 128x128 sized
+input images.
+
+You can choose '1.0', '0.75', '0.50', or '0.25' to control the number of
+neurons (activations of hidden layers); the number of weights (and hence to
+some extent the file size and speed) shrinks like the square of that fraction.
+You can choose '224', '192', '160', or '128' for the input image size,
+with smaller sizes giving faster speeds.
 
 The speed and size advantages come at a loss to accuracy of course, but for many
 purposes this isn't critical. They can also be somewhat offset with improved
 training data. For example, training with distortions allows me to get above 80%
-accuracy on the flower data set even with the 0.25/128/quantized graph above.
+accuracy on the flower data set even with the 0.25/128 graph above.
 
 If you're going to be using the Mobilenet models in label_image or your own
 programs, you'll need to feed in an image of the specified size converted to a
@@ -395,3 +396,9 @@ python tensorflow/examples/label_image/label_image.py \
 --input_mean=128 --input_std=128 \
 --image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
 ```
+
+For more information on deploying the retrained model to a mobile device, see
+the [codelab version](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0)
+of this tutorial, especially [part 2](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets-2-tflite/#0), which describes
+[TensorFlow Lite](/mobile/tflite/) and the additional optimizations it offers
+(including quantization of model weights).
diff --git a/tensorflow/docs_src/tutorials/index.md b/tensorflow/docs_src/tutorials/index.md
index 8c697e48e550c4e425db33bab7257532d209ac7a..af01d3eaa12157f82c981de005708509f6652cca 100644
--- a/tensorflow/docs_src/tutorials/index.md
+++ b/tensorflow/docs_src/tutorials/index.md
@@ -10,7 +10,7 @@ these tutorials.
 
 These tutorials cover different aspects of image recognition:
 
-  * @{$layers}, which introduces convolutional neural networks (CNNs) and
+  * @{$layers$MNIST}, which introduces convolutional neural networks (CNNs) and
     demonstrates how to build a CNN in TensorFlow.
   * @{$image_recognition}, which introduces the field of image recognition and
     uses a pre-trained model (Inception) for recognizing images.
diff --git a/tensorflow/docs_src/tutorials/kernel_methods.md b/tensorflow/docs_src/tutorials/kernel_methods.md
index 63f408c2ca304d6345ffff459b799b011f8d8035..73e5c5105784ddc9729b8cea6cd31921572837e1 100644
--- a/tensorflow/docs_src/tutorials/kernel_methods.md
+++ b/tensorflow/docs_src/tutorials/kernel_methods.md
@@ -1,9 +1,9 @@
 # Improving Linear Models Using Explicit Kernel Methods
 
-Note: This document uses a deprecated version of ${tf.estimator},
-which has a ${tf.contrib.learn.estimator$different interface}.
+Note: This document uses a deprecated version of @{tf.estimator},
+which has a @{tf.contrib.learn.Estimator$different interface}.
 It also uses other `contrib` methods whose
-${$version_compat#not_covered$API may not be stable}.
+@{$version_compat#not_covered$API may not be stable}.
 
 In this tutorial, we demonstrate how combining (explicit) kernel methods with
 linear models can drastically increase the latters' quality of predictions
@@ -53,7 +53,7 @@ In order to feed data to a `tf.contrib.learn Estimator`, it is helpful to conver
 it to Tensors. For this, we will use an `input function` which adds Ops to the
 TensorFlow graph that, when executed, create mini-batches of Tensors to be used
 downstream. For more background on input functions, check
-@{$get_started/premade_estimators#input_fn$this section on input functions}.
+@{$get_started/premade_estimators#create_input_functions$this section on input functions}.
 In this example, we will use the `tf.train.shuffle_batch` Op which, besides
 converting numpy arrays to Tensors, allows us to specify the batch_size and
 whether to randomize the input every time the input_fn Ops are executed
diff --git a/tensorflow/docs_src/tutorials/layers.md b/tensorflow/docs_src/tutorials/layers.md
index 5111b16247e2b5c3410e69dcdf08318a35b18c2f..aeb746f29c28700e6cbd3e6da5721fd294b9cada 100644
--- a/tensorflow/docs_src/tutorials/layers.md
+++ b/tensorflow/docs_src/tutorials/layers.md
@@ -193,14 +193,14 @@ to calculate loss, configure the training op, and generate predictions. If
 you're already experienced with CNNs and @{$get_started/custom_estimators$TensorFlow `Estimator`s},
 and find the above code intuitive, you may want to skim these sections or just
 skip ahead to ["Training and Evaluating the CNN MNIST
-Classifier"](#training-and-evaluating-the-cnn-mnist-classifier).
+Classifier"](#training_and_evaluating_the_cnn_mnist_classifier).
 
 ### Input Layer
 
 The methods in the `layers` module for creating convolutional and pooling layers
-for two-dimensional image data expect input tensors to have a shape of
-<code>[<em>batch_size</em>, <em>image_width</em>, <em>image_height</em>,
-<em>channels</em>]</code>, defined as follows:
+for two-dimensional image data expect input tensors to have a `channels_last` shape of
+<code>[<em>batch_size</em>, <em>image_height</em>, <em>image_width</em>, <em>channels</em>]</code>
+or a `channels_first` shape of <code>[<em>batch_size</em>, <em>channels</em>, <em>image_height</em>, <em>image_width</em>]</code>, defined as follows:
 
 *   _`batch_size`_. Size of the subset of examples to use when performing
     gradient descent during training.
@@ -446,7 +446,7 @@ tf.nn.softmax(logits, name="softmax_tensor")
 
 > Note: We use the `name` argument to explicitly name this operation
 > `softmax_tensor`, so we can reference it later. (We'll set up logging for the
-> softmax values in ["Set Up a Logging Hook"](#set-up-a-logging-hook).
+> softmax values in ["Set Up a Logging Hook"](#set-up-a-logging-hook)).
 
 We compile our predictions in a dict, and return an `EstimatorSpec` object:
 
@@ -534,9 +534,8 @@ if mode == tf.estimator.ModeKeys.TRAIN:
 ```
 
 > Note: For a more in-depth look at configuring training ops for Estimator model
-> functions, see @{$get_started/custom_estimators#defining-the-training-op-for-the-model$"Defining
-> the training op for the model"} in the @{$get_started/custom_estimators$"Creating Estimations in
-> tf.estimator"} tutorial.
+> functions, see @{$get_started/custom_estimators#defining_the_training_op_for_the_model$"Defining the training op for the model"} 
+> in the @{$get_started/custom_estimators$"Creating Estimators in tf.estimator."} tutorial.
 
 ### Add evaluation metrics
 
@@ -625,8 +624,8 @@ operation earlier when we generated the probabilities in `cnn_model_fn`.
 > Note: If you don't explicitly assign a name to an operation via the `name`
 > argument, TensorFlow will assign a default name. A couple easy ways to
 > discover the names applied to operations are to visualize your graph on
-> @{$graph_viz$TensorBoard}) or to enable the @{$debugger$TensorFlow Debugger
-> (tfdbg)}.
+> @{$graph_viz$TensorBoard}) or to enable the
+> @{$programmers_guide/debugger$TensorFlow Debugger (tfdbg)}.
 
 Next, we create the `LoggingTensorHook`, passing `tensors_to_log` to the
 `tensors` argument. We set `every_n_iter=50`, which specifies that probabilities
diff --git a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
index e22536adb6f0b893602ff79612cfb01e10586a18..5d83fbe2a3709c0834f448cbc316453f80428dd1 100644
--- a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
+++ b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
@@ -38,8 +38,8 @@ To try the code for this tutorial:
 1.  [Download the data](#download-the-data) in `TFRecord` format from
     [here](http://download.tensorflow.org/data/quickdraw_tutorial_dataset_v1.tar.gz) and unzip it. More details about [how to
     obtain the original Quick, Draw!
-    data](#optional-download-the-full-quick-draw-data) and [how to convert that
-    to `TFRecord` files](#optional-converting-the-data) is available below.
+    data](#optional_download_the_full_quick_draw_data) and [how to convert that
+    to `TFRecord` files](#optional_converting_the_data) is available below.
 
 1.  Execute the tutorial code with the following command to train the RNN-based
     model described in this tutorial. Make sure to adjust the paths to point to
@@ -108,8 +108,9 @@ This download will take a while and download a bit more than 23GB of data.
 ### Optional: Converting the data
 
 To convert the `ndjson` files to
-@{$python/python_io#tfrecords_format_details$TFRecord} files containing
-${tf.train.Example} protos run the following command.
+@{$python/python_io#TFRecords_Format_Details$TFRecord} files containing
+[`tf.train.Example`](https://www.tensorflow.org/code/tensorflow/core/example/example.proto)
+protos run the following command.
 
 ```shell
    python create_dataset.py --ndjson_path rnn_tutorial_data \
@@ -117,7 +118,7 @@ ${tf.train.Example} protos run the following command.
 ```
 
 This will store the data in 10 shards of
-@{$python/python_io#tfrecords_format_details$TFRecord} files with 10000 items
+@{$python/python_io#TFRecords_Format_Details$TFRecord} files with 10000 items
 per class for the training data and 1000 items per class as eval data.
 
 This conversion process is described in more detail in the following.
diff --git a/tensorflow/docs_src/tutorials/wide.md b/tensorflow/docs_src/tutorials/wide.md
index 005dc020f94f666da295f4ff0342fae858121012..27ce75a30dd2acd5925702611042270e767b0c73 100644
--- a/tensorflow/docs_src/tutorials/wide.md
+++ b/tensorflow/docs_src/tutorials/wide.md
@@ -74,8 +74,8 @@ Here's a list of columns available in the Census Income dataset:
 | relationship   | Categorical | Wife, Own-child, Husband,         |
 :                :             : Not-in-family, Other-relative,    :
 :                :             : Unmarried.                        :
-| race           | Categorical | White, Asian-Pac-Islander,        |
-:                :             : Amer-Indian-Eskimo, Other, Black. :
+| race           | Categorical | Amer-Indian-Eskimo, Asian-Pac-    |
+:                :             : Islander, Black, White, Other.    :
 | gender         | Categorical | Female, Male.                     |
 | capital_gain   | Continuous  | Capital gains recorded.           |
 | capital_loss   | Continuous  | Capital Losses recorded.          |
@@ -247,7 +247,7 @@ hours_per_week = tf.feature_column.numeric_column('hours_per_week')
 ### Making Continuous Features Categorical through Bucketization
 
 Sometimes the relationship between a continuous feature and the label is not
-linear. As an hypothetical example, a person's income may grow with age in the
+linear. As a hypothetical example, a person's income may grow with age in the
 early stage of one's career, then the growth may slow at some point, and finally
 the income decreases after retirement. In this scenario, using the raw `age` as
 a real-valued feature column might not be a good choice because the model can
@@ -361,6 +361,16 @@ The first line of the final output should be something like
 `accuracy: 0.83557522`, which means the accuracy is 83.6%. Feel free to try more
 features and transformations and see if you can do even better!
 
+After the model is evaluated, we can use the model to predict whether an individual has an annual income of over
+50,000 dollars given an individual's information input.
+```python
+  pred_iter = model.predict(input_fn=lambda: input_fn(FLAGS.test_data, 1, False, 1))
+  for pred in pred_iter:
+    print(pred['classes'])
+```
+
+The model prediction output would be like `[b'1']` or `[b'0']` which means whether corresponding individual has an annual income of over 50,000 dollars or not.
+
 If you'd like to see a working end-to-end example, you can download our
 [example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)
 and set the `model_type` flag to `wide`.
diff --git a/tensorflow/examples/android/AndroidManifest.xml b/tensorflow/examples/android/AndroidManifest.xml
index bb75431a1f8bab2951299520903aa6e043f8415e..5c47ce6b673e4c9d635b867c1ccdc679f67c6ae5 100644
--- a/tensorflow/examples/android/AndroidManifest.xml
+++ b/tensorflow/examples/android/AndroidManifest.xml
@@ -40,6 +40,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -49,6 +50,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -58,6 +60,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -67,6 +70,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
     </application>
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java b/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
index 8bd4abb154a8f8c74f2195d4acbb99d3d5d498ea..429138abe5338e63d602ef6005e7607d21e1e357 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
@@ -351,6 +351,10 @@ public abstract class CameraActivity extends Activity
 
   protected void setFragment() {
     String cameraId = chooseCamera();
+    if (cameraId == null) {
+      Toast.makeText(this, "No Camera Detected", Toast.LENGTH_SHORT).show();
+      finish();
+    }
 
     Fragment fragment;
     if (useCamera2API) {
@@ -416,7 +420,8 @@ public abstract class CameraActivity extends Activity
 
   @Override
   public boolean onKeyDown(final int keyCode, final KeyEvent event) {
-    if (keyCode == KeyEvent.KEYCODE_VOLUME_DOWN || keyCode == KeyEvent.KEYCODE_VOLUME_UP) {
+    if (keyCode == KeyEvent.KEYCODE_VOLUME_DOWN || keyCode == KeyEvent.KEYCODE_VOLUME_UP
+            || keyCode == KeyEvent.KEYCODE_BUTTON_L1 || keyCode == KeyEvent.KEYCODE_DPAD_CENTER) {
       debug = !debug;
       requestRender();
       onSetDebug(debug);
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java b/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
index 6a66ec3927be62f1f996eb18bb6c04ea66f43152..33ec65e9f73a1d04bcafdc09d1618b32e03b1dc0 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
@@ -16,8 +16,10 @@
 
 package org.tensorflow.demo;
 
+import android.app.UiModeManager;
 import android.content.Context;
 import android.content.res.AssetManager;
+import android.content.res.Configuration;
 import android.graphics.Bitmap;
 import android.graphics.Bitmap.Config;
 import android.graphics.BitmapFactory;
@@ -31,9 +33,11 @@ import android.graphics.Typeface;
 import android.media.ImageReader.OnImageAvailableListener;
 import android.os.Bundle;
 import android.os.SystemClock;
+import android.util.DisplayMetrics;
 import android.util.Size;
 import android.util.TypedValue;
 import android.view.Display;
+import android.view.KeyEvent;
 import android.view.MotionEvent;
 import android.view.View;
 import android.view.View.OnClickListener;
@@ -43,6 +47,7 @@ import android.widget.BaseAdapter;
 import android.widget.Button;
 import android.widget.GridView;
 import android.widget.ImageView;
+import android.widget.RelativeLayout;
 import android.widget.Toast;
 import java.io.IOException;
 import java.io.InputStream;
@@ -381,6 +386,27 @@ public class StylizeActivity extends CameraActivity implements OnImageAvailableL
     grid = (GridView) findViewById(R.id.grid_layout);
     grid.setAdapter(adapter);
     grid.setOnTouchListener(gridTouchAdapter);
+
+    // Change UI on Android TV
+    UiModeManager uiModeManager = (UiModeManager) getSystemService(UI_MODE_SERVICE);
+    if (uiModeManager.getCurrentModeType() == Configuration.UI_MODE_TYPE_TELEVISION) {
+      DisplayMetrics displayMetrics = new DisplayMetrics();
+      getWindowManager().getDefaultDisplay().getMetrics(displayMetrics);
+      int styleSelectorHeight = displayMetrics.heightPixels;
+      int styleSelectorWidth = displayMetrics.widthPixels - styleSelectorHeight;
+      RelativeLayout.LayoutParams layoutParams = new RelativeLayout.LayoutParams(styleSelectorWidth, ViewGroup.LayoutParams.MATCH_PARENT);
+
+      // Calculate number of style in a row, so all the style can show up without scrolling
+      int numOfStylePerRow = 3;
+      while (styleSelectorWidth / numOfStylePerRow * Math.ceil((float) (adapter.getCount() - 2) / numOfStylePerRow) > styleSelectorHeight) {
+        numOfStylePerRow++;
+      }
+      grid.setNumColumns(numOfStylePerRow);
+      layoutParams.addRule(RelativeLayout.ALIGN_PARENT_RIGHT);
+      grid.setLayoutParams(layoutParams);
+      adapter.buttons.clear();
+    }
+
     setStyle(adapter.items[0], 1.0f);
   }
 
@@ -602,4 +628,38 @@ public class StylizeActivity extends CameraActivity implements OnImageAvailableL
 
     borderedText.drawLines(canvas, 10, canvas.getHeight() - 10, lines);
   }
+
+  @Override
+  public boolean onKeyDown(int keyCode, KeyEvent event) {
+    int moveOffset = 0;
+    switch (keyCode) {
+      case KeyEvent.KEYCODE_DPAD_LEFT:
+        moveOffset = -1;
+        break;
+      case KeyEvent.KEYCODE_DPAD_RIGHT:
+        moveOffset = 1;
+        break;
+      case KeyEvent.KEYCODE_DPAD_UP:
+        moveOffset = -1 * grid.getNumColumns();
+        break;
+      case KeyEvent.KEYCODE_DPAD_DOWN:
+        moveOffset = grid.getNumColumns();
+        break;
+      default:
+        return super.onKeyDown(keyCode, event);
+    }
+
+    // get the highest selected style
+    int currentSelect = 0;
+    float highestValue = 0;
+    for (int i = 0; i < adapter.getCount(); i++) {
+      if (adapter.items[i].value > highestValue) {
+        currentSelect = i;
+        highestValue = adapter.items[i].value;
+      }
+    }
+    setStyle(adapter.items[(currentSelect + moveOffset + adapter.getCount()) % adapter.getCount()], 1);
+
+    return true;
+  }
 }
diff --git a/tensorflow/examples/image_retraining/retrain.py b/tensorflow/examples/image_retraining/retrain.py
index 25e09fecbfd093e97899807b82a03f1116dbe5ff..99a71206acbd533ec8bc5a9644435eacad564cd4 100644
--- a/tensorflow/examples/image_retraining/retrain.py
+++ b/tensorflow/examples/image_retraining/retrain.py
@@ -75,13 +75,16 @@ python tensorflow/examples/image_retraining/retrain.py \
     --image_dir ~/flower_photos --architecture mobilenet_1.0_224
 ```
 
-Run quantized version of mobilenet:
+Run mobilenet, instrumented for quantization:
 
 ```bash
 python tensorflow/examples/image_retraining/retrain.py \
-    --image_dir ~/flower_photos/   --architecture mobilenet_1.0_224_quantized
+    --image_dir ~/flower_photos/   --architecture mobilenet_1.0_224_quant
 ```
 
+These instrumented models can be converted to fully quantized mobile models via
+TensorFlow Lite.
+
 There are 32 different Mobilenet models to choose from, with a variety of file
 size and latency options. The first number can be '1.0', '0.75', '0.50', or
 '0.25' to control the size, and the second controls the input image size, either
@@ -121,7 +124,6 @@ import numpy as np
 from six.moves import urllib
 import tensorflow as tf
 
-from tensorflow.contrib.quantize.python import quant_ops
 from tensorflow.python.framework import graph_util
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.platform import gfile
@@ -135,6 +137,9 @@ FLAGS = None
 # need to update these to reflect the values in the network you're using.
 MAX_NUM_IMAGES_PER_CLASS = 2 ** 27 - 1  # ~134M
 
+# The location where variable checkpoints will be stored.
+CHECKPOINT_NAME = '/tmp/_retrain_checkpoint'
+
 
 def create_image_lists(image_dir, testing_percentage, validation_percentage):
   """Builds a list of training images from the file system.
@@ -745,9 +750,9 @@ def variable_summaries(var):
     tf.summary.histogram('histogram', var)
 
 
-def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
-                           bottleneck_tensor_size, quantize_layer):
-  """Adds a new softmax and fully-connected layer for training.
+def add_final_retrain_ops(class_count, final_tensor_name, bottleneck_tensor,
+                          bottleneck_tensor_size, quantize_layer, is_training):
+  """Adds a new softmax and fully-connected layer for training and eval.
 
   We need to retrain the top layer to identify our new classes, so this function
   adds the right operations to the graph, along with some variables to hold the
@@ -763,7 +768,9 @@ def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
     bottleneck_tensor: The output of the main CNN graph.
     bottleneck_tensor_size: How many entries in the bottleneck vector.
     quantize_layer: Boolean, specifying whether the newly added layer should be
-        quantized.
+        instrumented for quantized.
+    is_training: Boolean, specifying whether the newly add layer is for training
+        or eval.
 
   Returns:
     The tensors for the training and cross entropy results, and tensors for the
@@ -778,50 +785,41 @@ def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
     ground_truth_input = tf.placeholder(
         tf.int64, [None], name='GroundTruthInput')
 
-  # Organizing the following ops as `final_training_ops` so they're easier
-  # to see in TensorBoard
-  layer_name = 'final_training_ops'
+  # Organizing the following ops so they are easier to see in TensorBoard.
+  layer_name = 'final_retrain_ops'
   with tf.name_scope(layer_name):
     with tf.name_scope('weights'):
       initial_value = tf.truncated_normal(
           [bottleneck_tensor_size, class_count], stddev=0.001)
       layer_weights = tf.Variable(initial_value, name='final_weights')
-      if quantize_layer:
-        quantized_layer_weights = quant_ops.MovingAvgQuantize(
-            layer_weights, is_training=True)
-        variable_summaries(quantized_layer_weights)
-
       variable_summaries(layer_weights)
+
     with tf.name_scope('biases'):
       layer_biases = tf.Variable(tf.zeros([class_count]), name='final_biases')
-      if quantize_layer:
-        quantized_layer_biases = quant_ops.MovingAvgQuantize(
-            layer_biases, is_training=True)
-        variable_summaries(quantized_layer_biases)
-
       variable_summaries(layer_biases)
 
     with tf.name_scope('Wx_plus_b'):
-      if quantize_layer:
-        logits = tf.matmul(bottleneck_input,
-                           quantized_layer_weights) + quantized_layer_biases
-        logits = quant_ops.MovingAvgQuantize(
-            logits,
-            init_min=-32.0,
-            init_max=32.0,
-            is_training=True,
-            num_bits=8,
-            narrow_range=False,
-            ema_decay=0.5)
-        tf.summary.histogram('pre_activations', logits)
-      else:
-        logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
-        tf.summary.histogram('pre_activations', logits)
+      logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
+      tf.summary.histogram('pre_activations', logits)
 
   final_tensor = tf.nn.softmax(logits, name=final_tensor_name)
 
+  # The tf.contrib.quantize functions rewrite the graph in place for
+  # quantization. The imported model graph has already been rewritten, so upon
+  # calling these rewrites, only the newly added final layer will be
+  # transformed.
+  if quantize_layer:
+    if is_training:
+      tf.contrib.quantize.create_training_graph()
+    else:
+      tf.contrib.quantize.create_eval_graph()
+
   tf.summary.histogram('activations', final_tensor)
 
+  # If this is an eval graph, we don't need to add loss ops or an optimizer.
+  if not is_training:
+    return None, None, bottleneck_input, ground_truth_input, final_tensor
+
   with tf.name_scope('cross_entropy'):
     cross_entropy_mean = tf.losses.sparse_softmax_cross_entropy(
         labels=ground_truth_input, logits=logits)
@@ -857,13 +855,91 @@ def add_evaluation_step(result_tensor, ground_truth_tensor):
   return evaluation_step, prediction
 
 
-def save_graph_to_file(sess, graph, graph_file_name):
+def run_final_eval(sess, model_info, class_count, image_lists, jpeg_data_tensor,
+                   decoded_image_tensor, resized_image_tensor,
+                   bottleneck_tensor):
+  """Runs a final evaluation on an eval graph using the test data set.
+
+  Args:
+    sess: Session for the train graph.
+    model_info: Model info dictionary from create_model_info()
+    class_count: Number of classes
+    image_lists: Dictionary of training images for each label.
+    jpeg_data_tensor: The layer to feed jpeg image data into.
+    decoded_image_tensor: The output of decoding and resizing the image.
+    resized_image_tensor: The input node of the recognition graph.
+    bottleneck_tensor: The bottleneck output layer of the CNN graph.
+  """
+  (sess, bottleneck_input, ground_truth_input, evaluation_step,
+   prediction) = build_eval_session(model_info, class_count)
+
+  test_bottlenecks, test_ground_truth, test_filenames = (
+      get_random_cached_bottlenecks(sess, image_lists, FLAGS.test_batch_size,
+                                    'testing', FLAGS.bottleneck_dir,
+                                    FLAGS.image_dir, jpeg_data_tensor,
+                                    decoded_image_tensor, resized_image_tensor,
+                                    bottleneck_tensor, FLAGS.architecture))
+  test_accuracy, predictions = sess.run(
+      [evaluation_step, prediction],
+      feed_dict={
+          bottleneck_input: test_bottlenecks,
+          ground_truth_input: test_ground_truth
+      })
+  tf.logging.info('Final test accuracy = %.1f%% (N=%d)' %
+                  (test_accuracy * 100, len(test_bottlenecks)))
+
+  if FLAGS.print_misclassified_test_images:
+    tf.logging.info('=== MISCLASSIFIED TEST IMAGES ===')
+    for i, test_filename in enumerate(test_filenames):
+      if predictions[i] != test_ground_truth[i]:
+        tf.logging.info('%70s  %s' % (test_filename,
+                                      list(image_lists.keys())[predictions[i]]))
+
+
+def build_eval_session(model_info, class_count):
+  """Builds an restored eval session without train operations for exporting.
+
+  Args:
+    model_info: Model info dictionary from create_model_info()
+    class_count: Number of classes
+
+  Returns:
+    Eval session containing the restored eval graph.
+    The bottleneck input, ground truth, eval step, and prediction tensors.
+  """
+  # If quantized, we need to create the correct eval graph for exporting.
+  eval_graph, bottleneck_tensor, _ = create_model_graph(model_info)
+
+  eval_sess = tf.Session(graph=eval_graph)
+  with eval_graph.as_default():
+    # Add the new layer for exporting.
+    (_, _, bottleneck_input,
+     ground_truth_input, final_tensor) = add_final_retrain_ops(
+         class_count, FLAGS.final_tensor_name, bottleneck_tensor,
+         model_info['bottleneck_tensor_size'], model_info['quantize_layer'],
+         False)
+
+    # Now we need to restore the values from the training graph to the eval
+    # graph.
+    tf.train.Saver().restore(eval_sess, CHECKPOINT_NAME)
+
+    evaluation_step, prediction = add_evaluation_step(final_tensor,
+                                                      ground_truth_input)
+
+  return (eval_sess, bottleneck_input, ground_truth_input, evaluation_step,
+          prediction)
+
+
+def save_graph_to_file(graph, graph_file_name, model_info, class_count):
+  """Saves an graph to file, creating a valid quantized one if necessary."""
+  sess, _, _, _, _ = build_eval_session(model_info, class_count)
+  graph = sess.graph
+
   output_graph_def = graph_util.convert_variables_to_constants(
       sess, graph.as_graph_def(), [FLAGS.final_tensor_name])
 
   with gfile.FastGFile(graph_file_name, 'wb') as f:
     f.write(output_graph_def.SerializeToString())
-  return
 
 
 def prepare_file_system():
@@ -916,11 +992,10 @@ def create_model_info(architecture):
       return None
     version_string = parts[1]
     if (version_string != '1.0' and version_string != '0.75' and
-        version_string != '0.50' and version_string != '0.25'):
+        version_string != '0.5' and version_string != '0.25'):
       tf.logging.error(
-          """"The Mobilenet version should be '1.0', '0.75', '0.50', or '0.25',
-  but found '%s' for architecture '%s'""",
-          version_string, architecture)
+          """"The Mobilenet version should be '1.0', '0.75', '0.5', or '0.25',
+  but found '%s' for architecture '%s'""", version_string, architecture)
       return None
     size_string = parts[2]
     if (size_string != '224' and size_string != '192' and
@@ -933,35 +1008,26 @@ def create_model_info(architecture):
     if len(parts) == 3:
       is_quantized = False
     else:
-      if parts[3] != 'quantized':
+      if parts[3] != 'quant':
         tf.logging.error(
             "Couldn't understand architecture suffix '%s' for '%s'", parts[3],
             architecture)
         return None
       is_quantized = True
 
+    data_url = 'http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/'
+    model_name = 'mobilenet_v1_' + version_string + '_' + size_string
     if is_quantized:
-      data_url = 'http://download.tensorflow.org/models/mobilenet_v1_'
-      data_url += version_string + '_' + size_string + '_quantized_frozen.tgz'
-      bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
-      resized_input_tensor_name = 'Placeholder:0'
-      model_dir_name = ('mobilenet_v1_' + version_string + '_' + size_string +
-                        '_quantized_frozen')
-      model_base_name = 'quantized_frozen_graph.pb'
-
-    else:
-      data_url = 'http://download.tensorflow.org/models/mobilenet_v1_'
-      data_url += version_string + '_' + size_string + '_frozen.tgz'
-      bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
-      resized_input_tensor_name = 'input:0'
-      model_dir_name = 'mobilenet_v1_' + version_string + '_' + size_string
-      model_base_name = 'frozen_graph.pb'
+      model_name += '_quant'
+    data_url += model_name + '.tgz'
+    bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
+    resized_input_tensor_name = 'input:0'
+    model_file_name = model_name + '_frozen.pb'
 
     bottleneck_tensor_size = 1001
     input_width = int(size_string)
     input_height = int(size_string)
     input_depth = 3
-    model_file_name = os.path.join(model_dir_name, model_base_name)
     input_mean = 127.5
     input_std = 127.5
   else:
@@ -1011,43 +1077,45 @@ def add_jpeg_decoding(input_width, input_height, input_depth, input_mean,
   return jpeg_data, mul_image
 
 
-def export_model(sess, architecture, saved_model_dir):
+def export_model(model_info, class_count, saved_model_dir):
   """Exports model for serving.
 
   Args:
-    sess: Current active TensorFlow Session.
-    architecture: Model architecture.
+    model_info: The modelinfo for the current model.
+    class_count: The number of classes.
     saved_model_dir: Directory in which to save exported model and variables.
   """
-  if architecture == 'inception_v3':
-    input_tensor = 'DecodeJpeg/contents:0'
-  elif architecture.startswith('mobilenet_'):
-    input_tensor = 'input:0'
-  else:
-    raise ValueError('Unknown architecture', architecture)
-  in_image = sess.graph.get_tensor_by_name(input_tensor)
-  inputs = {'image': tf.saved_model.utils.build_tensor_info(in_image)}
-
-  out_classes = sess.graph.get_tensor_by_name('final_result:0')
-  outputs = {'prediction': tf.saved_model.utils.build_tensor_info(out_classes)}
+  # The SavedModel should hold the eval graph.
+  sess, _, _, _, _ = build_eval_session(model_info, class_count)
+  graph = sess.graph
+  with graph.as_default():
+    input_tensor = model_info['resized_input_tensor_name']
+    in_image = sess.graph.get_tensor_by_name(input_tensor)
+    inputs = {'image': tf.saved_model.utils.build_tensor_info(in_image)}
+
+    out_classes = sess.graph.get_tensor_by_name('final_result:0')
+    outputs = {
+        'prediction': tf.saved_model.utils.build_tensor_info(out_classes)
+    }
 
-  signature = tf.saved_model.signature_def_utils.build_signature_def(
-      inputs=inputs,
-      outputs=outputs,
-      method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
+    signature = tf.saved_model.signature_def_utils.build_signature_def(
+        inputs=inputs,
+        outputs=outputs,
+        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
 
-  legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
+    legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
 
-  # Save out the SavedModel.
-  builder = tf.saved_model.builder.SavedModelBuilder(saved_model_dir)
-  builder.add_meta_graph_and_variables(
-      sess, [tf.saved_model.tag_constants.SERVING],
-      signature_def_map={
-          tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
-              signature
-      },
-      legacy_init_op=legacy_init_op)
-  builder.save()
+    # Save out the SavedModel.
+    builder = tf.saved_model.builder.SavedModelBuilder(saved_model_dir)
+    builder.add_meta_graph_and_variables(
+        sess, [tf.saved_model.tag_constants.SERVING],
+        signature_def_map={
+            tf.saved_model.signature_constants.
+            DEFAULT_SERVING_SIGNATURE_DEF_KEY:
+                signature
+        },
+        legacy_init_op=legacy_init_op)
+    builder.save()
 
 
 def main(_):
@@ -1064,11 +1132,6 @@ def main(_):
     tf.logging.error('Did not recognize architecture flag')
     return -1
 
-  # Set up the pre-trained graph.
-  maybe_download_and_extract(model_info['data_url'])
-  graph, bottleneck_tensor, resized_image_tensor = (
-      create_model_graph(model_info))
-
   # Look at the folder structure, and create lists of all the images.
   image_lists = create_image_lists(FLAGS.image_dir, FLAGS.testing_percentage,
                                    FLAGS.validation_percentage)
@@ -1087,6 +1150,19 @@ def main(_):
       FLAGS.flip_left_right, FLAGS.random_crop, FLAGS.random_scale,
       FLAGS.random_brightness)
 
+  # Set up the pre-trained graph.
+  maybe_download_and_extract(model_info['data_url'])
+  graph, bottleneck_tensor, resized_image_tensor = (
+      create_model_graph(model_info))
+
+  # Add the new layer that we'll be training.
+  with graph.as_default():
+    (train_step, cross_entropy, bottleneck_input,
+     ground_truth_input, final_tensor) = add_final_retrain_ops(
+         class_count, FLAGS.final_tensor_name, bottleneck_tensor,
+         model_info['bottleneck_tensor_size'], model_info['quantize_layer'],
+         True)
+
   with tf.Session(graph=graph) as sess:
     # Set up the image decoding sub-graph.
     jpeg_data_tensor, decoded_image_tensor = add_jpeg_decoding(
@@ -1110,15 +1186,8 @@ def main(_):
                         decoded_image_tensor, resized_image_tensor,
                         bottleneck_tensor, FLAGS.architecture)
 
-    # Add the new layer that we'll be training.
-    (train_step, cross_entropy, bottleneck_input, ground_truth_input,
-     final_tensor) = add_final_training_ops(
-         len(image_lists.keys()), FLAGS.final_tensor_name, bottleneck_tensor,
-         model_info['bottleneck_tensor_size'], model_info['quantize_layer'])
-
     # Create the operations we need to evaluate the accuracy of our new layer.
-    evaluation_step, prediction = add_evaluation_step(
-        final_tensor, ground_truth_input)
+    evaluation_step, _ = add_evaluation_step(final_tensor, ground_truth_input)
 
     # Merge all the summaries and write them out to the summaries_dir
     merged = tf.summary.merge_all()
@@ -1128,6 +1197,10 @@ def main(_):
     validation_writer = tf.summary.FileWriter(
         FLAGS.summaries_dir + '/validation')
 
+    # Create a train saver that is used to restore values into an eval graph
+    # when exporting models.
+    train_saver = tf.train.Saver()
+
     # Set up all our weights to their initial default values.
     init = tf.global_variables_initializer()
     sess.run(init)
@@ -1168,6 +1241,9 @@ def main(_):
                         (datetime.now(), i, train_accuracy * 100))
         tf.logging.info('%s: Step %d: Cross entropy = %f' %
                         (datetime.now(), i, cross_entropy_value))
+        # TODO(suharshs): Make this use an eval graph, to avoid quantization
+        # moving averages being updated by the validation set, though in
+        # practice this makes a negligable difference.
         validation_bottlenecks, validation_ground_truth, _ = (
             get_random_cached_bottlenecks(
                 sess, image_lists, FLAGS.validation_batch_size, 'validation',
@@ -1190,42 +1266,32 @@ def main(_):
 
       if (intermediate_frequency > 0 and (i % intermediate_frequency == 0)
           and i > 0):
+        # If we want to do an intermediate save, save a checkpoint of the train
+        # graph, to restore into the eval graph.
+        train_saver.save(sess, CHECKPOINT_NAME)
         intermediate_file_name = (FLAGS.intermediate_output_graphs_dir +
                                   'intermediate_' + str(i) + '.pb')
         tf.logging.info('Save intermediate result to : ' +
                         intermediate_file_name)
-        save_graph_to_file(sess, graph, intermediate_file_name)
+        save_graph_to_file(graph, intermediate_file_name, model_info,
+                           class_count)
+
+    # After training is complete, force one last save of the train checkpoint.
+    train_saver.save(sess, CHECKPOINT_NAME)
 
     # We've completed all our training, so run a final test evaluation on
     # some new images we haven't used before.
-    test_bottlenecks, test_ground_truth, test_filenames = (
-        get_random_cached_bottlenecks(
-            sess, image_lists, FLAGS.test_batch_size, 'testing',
-            FLAGS.bottleneck_dir, FLAGS.image_dir, jpeg_data_tensor,
-            decoded_image_tensor, resized_image_tensor, bottleneck_tensor,
-            FLAGS.architecture))
-    test_accuracy, predictions = sess.run(
-        [evaluation_step, prediction],
-        feed_dict={bottleneck_input: test_bottlenecks,
-                   ground_truth_input: test_ground_truth})
-    tf.logging.info('Final test accuracy = %.1f%% (N=%d)' %
-                    (test_accuracy * 100, len(test_bottlenecks)))
-
-    if FLAGS.print_misclassified_test_images:
-      tf.logging.info('=== MISCLASSIFIED TEST IMAGES ===')
-      for i, test_filename in enumerate(test_filenames):
-        if predictions[i] != test_ground_truth[i]:
-          tf.logging.info('%70s  %s' %
-                          (test_filename,
-                           list(image_lists.keys())[predictions[i]]))
+    run_final_eval(sess, model_info, class_count, image_lists, jpeg_data_tensor,
+                   decoded_image_tensor, resized_image_tensor,
+                   bottleneck_tensor)
 
     # Write out the trained graph and labels with the weights stored as
     # constants.
-    save_graph_to_file(sess, graph, FLAGS.output_graph)
+    save_graph_to_file(graph, FLAGS.output_graph, model_info, class_count)
     with gfile.FastGFile(FLAGS.output_labels, 'w') as f:
       f.write('\n'.join(image_lists.keys()) + '\n')
 
-    export_model(sess, FLAGS.architecture, FLAGS.saved_model_dir)
+    export_model(model_info, class_count, FLAGS.saved_model_dir)
 
 
 if __name__ == '__main__':
@@ -1406,8 +1472,9 @@ if __name__ == '__main__':
       form 'mobilenet_<parameter size>_<input_size>[_quantized]'. For example,
       'mobilenet_1.0_224' will pick a model that is 17 MB in size and takes 224
       pixel input images, while 'mobilenet_0.25_128_quantized' will choose a much
-      less accurate, but smaller and faster network that's 920 KB on disk and
-      takes 128x128 images. See https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html
+      smaller and less accurate model, taking 128x128 images, and instrumented
+      for eventual quantization via TensorFlow Lite.
+      See https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html
       for more information on Mobilenet.\
       """)
   parser.add_argument(
diff --git a/tensorflow/examples/image_retraining/retrain_test.py b/tensorflow/examples/image_retraining/retrain_test.py
index 8b8dd45fd72e3d29bdb7f6291cc53b912adf3644..fb7324c58ac1be60baad840207f31a61ec6182be 100644
--- a/tensorflow/examples/image_retraining/retrain_test.py
+++ b/tensorflow/examples/image_retraining/retrain_test.py
@@ -67,22 +67,52 @@ class ImageRetrainingTest(test_util.TensorFlowTestCase):
         self.assertIsNotNone(sess.graph.get_tensor_by_name('DistortResult:0'))
 
   @tf.test.mock.patch.object(retrain, 'FLAGS', learning_rate=0.01)
-  def testAddFinalTrainingOps(self, flags_mock):
+  def testAddFinalRetrainOps(self, flags_mock):
     with tf.Graph().as_default():
       with tf.Session() as sess:
         bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
-        # Test creating final training op with quantization
-        retrain.add_final_training_ops(5, 'final', bottleneck, 1024, False)
+        # Test creating final training op with quantization.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, False,
+                                      False)
         self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
 
   @tf.test.mock.patch.object(retrain, 'FLAGS', learning_rate=0.01)
-  def testAddFinalTrainingOpsQuantized(self, flags_mock):
-    with tf.Graph().as_default():
+  def testAddFinalRetrainOpsQuantized(self, flags_mock):
+    # Ensure that the training and eval graph for quantized models are correctly
+    # created.
+    with tf.Graph().as_default() as g:
+      with tf.Session() as sess:
+        bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
+        # Test creating final training op with quantization, set is_training to
+        # true.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, True, True)
+        self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
+        found_fake_quant = 0
+        for op in g.get_operations():
+          if op.type == 'FakeQuantWithMinMaxVars':
+            found_fake_quant += 1
+            # Ensure that the inputs of each FakeQuant operations has 2 Assign
+            # operations in the training graph (Assign[Min,Max]Last,
+            # Assign[Min,Max]Ema)
+            self.assertEqual(2,
+                             len([i for i in op.inputs if 'Assign' in i.name]))
+        self.assertEqual(found_fake_quant, 2)
+    with tf.Graph().as_default() as g:
       with tf.Session() as sess:
         bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
-        # Test creating final training op with quantization
-        retrain.add_final_training_ops(5, 'final', bottleneck, 1024, True)
+        # Test creating final training op with quantization, set is_training to
+        # false.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, True, False)
         self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
+        found_fake_quant = 0
+        for op in g.get_operations():
+          if op.type == 'FakeQuantWithMinMaxVars':
+            found_fake_quant += 1
+            for i in op.inputs:
+              # Ensure that no operations are Assign operation since this is the
+              # evaluation graph.
+              self.assertTrue('Assign' not in i.name)
+        self.assertEqual(found_fake_quant, 2)
 
   def testAddEvaluationStep(self):
     with tf.Graph().as_default():
diff --git a/tensorflow/examples/ios/README.md b/tensorflow/examples/ios/README.md
index 5bdaeb43ce143e36e78cfe301fd9b59e8b85b034..5d7bd36837b2a2c33ab4bc311a582c174666dcd5 100644
--- a/tensorflow/examples/ios/README.md
+++ b/tensorflow/examples/ios/README.md
@@ -119,11 +119,13 @@ rundown:
    `tensorflow/contrib/makefile/gen/lib` to the Library Search Paths setting.
 
  - You'll also need to add `libprotobuf.a` and `libprotobuf-lite.a` from
-   `tensorflow/contrib/makefile/gen/protobuf_ios/lib` to your _Build Stages_ and
-   _Library Search Paths_.
+   `tensorflow/contrib/makefile/gen/protobuf_ios/lib`
+   and `nsync.a` from `tensorflow/contrib/makefile/downloads/nsync/builds/lipo.ios.c++11` 
+   to your _Build Stages_ and _Library Search Paths_.
 
  - The _Header Search_ paths needs to contain:
    - the root folder of tensorflow,
+   - `tensorflow/contrib/makefile/downloads/nsync/public`
    - `tensorflow/contrib/makefile/downloads/protobuf/src`
    - `tensorflow/contrib/makefile/downloads`,
    - `tensorflow/contrib/makefile/downloads/eigen`, and
diff --git a/tensorflow/examples/learn/mnist.py b/tensorflow/examples/learn/mnist.py
index 98819b20bfea5021d52e2c50b004bccdaf1f25e7..3ead8614b68959b95ccad43623d4df4a5c4665bd 100644
--- a/tensorflow/examples/learn/mnist.py
+++ b/tensorflow/examples/learn/mnist.py
@@ -61,8 +61,10 @@ def conv_model(features, labels, mode):
 
   # Densely connected layer with 1024 neurons.
   h_fc1 = tf.layers.dense(h_pool2_flat, 1024, activation=tf.nn.relu)
-  if mode == tf.estimator.ModeKeys.TRAIN:
-    h_fc1 = tf.layers.dropout(h_fc1, rate=0.5)
+  h_fc1 = tf.layers.dropout(
+      h_fc1, 
+      rate=0.5, 
+      training=(mode == tf.estimator.ModeKeys.TRAIN))
 
   # Compute logits (1 per class) and compute loss.
   logits = tf.layers.dense(h_fc1, N_DIGITS, activation=None)
diff --git a/tensorflow/examples/learn/resnet.py b/tensorflow/examples/learn/resnet.py
index 9542e552504580a6614f8bd2f43c38dfa795750f..c00de932a8707ad5717aaf1251cf5c88464a28b0 100755
--- a/tensorflow/examples/learn/resnet.py
+++ b/tensorflow/examples/learn/resnet.py
@@ -53,6 +53,8 @@ def res_net_model(features, labels, mode):
     ndim = int(sqrt(input_shape[1]))
     x = tf.reshape(x, [-1, ndim, ndim, 1])
 
+  training = (mode == tf.estimator.ModeKeys.TRAIN)
+  
   # First convolution expands to 64 channels
   with tf.variable_scope('conv_layer1'):
     net = tf.layers.conv2d(
@@ -60,7 +62,7 @@ def res_net_model(features, labels, mode):
         filters=64,
         kernel_size=7,
         activation=tf.nn.relu)
-    net = tf.layers.batch_normalization(net)
+    net = tf.layers.batch_normalization(net, training=training)
 
   # Max pool
   net = tf.layers.max_pooling2d(
@@ -88,7 +90,7 @@ def res_net_model(features, labels, mode):
             kernel_size=1,
             padding='valid',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       with tf.variable_scope(name + '/conv_bottleneck'):
         conv = tf.layers.conv2d(
@@ -97,7 +99,7 @@ def res_net_model(features, labels, mode):
             kernel_size=3,
             padding='same',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       # 1x1 convolution responsible for restoring dimension
       with tf.variable_scope(name + '/conv_out'):
@@ -108,7 +110,7 @@ def res_net_model(features, labels, mode):
             kernel_size=1,
             padding='valid',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       # shortcut connections that turn the network into its counterpart
       # residual function (identity shortcut)
@@ -154,7 +156,7 @@ def res_net_model(features, labels, mode):
   loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
 
   # Create training op.
-  if mode == tf.estimator.ModeKeys.TRAIN:
+  if training:
     optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
     train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
     return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
diff --git a/tensorflow/examples/tutorials/word2vec/BUILD b/tensorflow/examples/tutorials/word2vec/BUILD
index 42d6355b4f06258a3c22d0ef324bb31880f2d9a3..bfcf4592690a1692db67090c9b6d4e1e4832c45f 100644
--- a/tensorflow/examples/tutorials/word2vec/BUILD
+++ b/tensorflow/examples/tutorials/word2vec/BUILD
@@ -13,6 +13,9 @@ py_binary(
         "word2vec_basic.py",
     ],
     srcs_version = "PY2AND3",
+    tags = [
+        "no-internal-py3",
+    ],
     deps = [
         "//tensorflow:tensorflow_py",
         "//third_party/py/numpy",
diff --git a/tensorflow/contrib/bayesflow/python/ops/optimizers.py b/tensorflow/experimental_api.py
similarity index 54%
rename from tensorflow/contrib/bayesflow/python/ops/optimizers.py
rename to tensorflow/experimental_api.py
index fb70628d1083836281e9327e83e109493276c64f..63a8aa9cb1dc130a7999c3b248815633998c4cd0 100644
--- a/tensorflow/contrib/bayesflow/python/ops/optimizers.py
+++ b/tensorflow/experimental_api.py
@@ -1,4 +1,4 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,25 +12,27 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Probabilistic optimizer modules.
 
-See ${python/contrib.bayesflow.optimizers}.
-"""
+# Bring in all of the public TensorFlow interface into this
+# module.
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# go/tf-wildcard-import
+# pylint: disable=g-bad-import-order
+from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
 # pylint: disable=wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.sgld_optimizer import *
-from tensorflow.contrib.bayesflow.python.ops.variational_sgd_optimizer import *
+from tensorflow.tools.api.generator.api import *  # pylint: disable=redefined-builtin
 # pylint: enable=wildcard-import
-from tensorflow.python.util.all_util import remove_undocumented
 
-_allowed_symbols = [
-    'SGLDOptimizer',
-    'VariationalSGDOptimizer',
-]
+from tensorflow.python.util.lazy_loader import LazyLoader
+contrib = LazyLoader('contrib', globals(), 'tensorflow.contrib')
+del LazyLoader
 
-remove_undocumented(__name__, _allowed_symbols)
+from tensorflow.python.platform import flags  # pylint: disable=g-import-not-at-top
+app.flags = flags  # pylint: disable=undefined-variable
+
+del absolute_import
+del division
+del print_function
diff --git a/tensorflow/go/genop/internal/api_def_map.go b/tensorflow/go/genop/internal/api_def_map.go
index 07b689dbba23a3aa991983f3b373fa8445c673e1..8600452b476dee49292cbffe630026cf6077e22b 100644
--- a/tensorflow/go/genop/internal/api_def_map.go
+++ b/tensorflow/go/genop/internal/api_def_map.go
@@ -31,7 +31,7 @@ import (
 	"unsafe"
 
 	"github.com/golang/protobuf/proto"
-	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework"
+	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework_go_proto"
 )
 
 // Encapsulates a collection of API definitions.
diff --git a/tensorflow/go/genop/internal/genop.go b/tensorflow/go/genop/internal/genop.go
index 82f7510f2ed947e0a87e4d88cfce1ecaaa6362f8..fb8163121850cee36e1fcc652ca258b1fe2d42ff 100644
--- a/tensorflow/go/genop/internal/genop.go
+++ b/tensorflow/go/genop/internal/genop.go
@@ -47,7 +47,7 @@ import (
 	"unsafe"
 
 	"github.com/golang/protobuf/proto"
-	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework"
+	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework_go_proto"
 )
 
 // GenerateFunctionsForRegisteredOps writes a Go source code file to w
@@ -359,13 +359,13 @@ type attrWrapper struct {
 	api *pb.ApiDef_Attr
 }
 
-func (a *attrWrapper) Name() string             { return a.api.Name }
-func (a *attrWrapper) RenameTo() string         { return a.api.RenameTo }
-func (a *attrWrapper) Description() string      { return a.api.Description }
-func (a *attrWrapper) Type() string             { return a.op.Type }
-func (a *attrWrapper) IsListAttr() bool         { return isListAttr(a.op) }
-func (a *attrWrapper) HasMinimum() bool         { return a.op.HasMinimum }
-func (a *attrWrapper) Minimum() int64           { return a.op.Minimum }
+func (a *attrWrapper) Name() string              { return a.api.Name }
+func (a *attrWrapper) RenameTo() string          { return a.api.RenameTo }
+func (a *attrWrapper) Description() string       { return a.api.Description }
+func (a *attrWrapper) Type() string              { return a.op.Type }
+func (a *attrWrapper) IsListAttr() bool          { return isListAttr(a.op) }
+func (a *attrWrapper) HasMinimum() bool          { return a.op.HasMinimum }
+func (a *attrWrapper) Minimum() int64            { return a.op.Minimum }
 func (a *attrWrapper) DefaultValue() interface{} { return a.api.DefaultValue }
 
 type argWrapper struct {
diff --git a/tensorflow/go/genop/internal/genop_test.go b/tensorflow/go/genop/internal/genop_test.go
index b3a23dff102a690b1f7f08b675219929355f139f..d20d22e0c1502f92ade7ef5aa40985dce73b7552 100644
--- a/tensorflow/go/genop/internal/genop_test.go
+++ b/tensorflow/go/genop/internal/genop_test.go
@@ -22,7 +22,7 @@ import (
 	"testing"
 
 	"github.com/golang/protobuf/proto"
-	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework"
+	pb "github.com/tensorflow/tensorflow/tensorflow/go/genop/internal/proto/tensorflow/core/framework_go_proto"
 )
 
 // Creates an ApiDef based on opdef and applies overrides
diff --git a/tensorflow/go/op/wrappers.go b/tensorflow/go/op/wrappers.go
index d9e684a661f2690c9352baec0649fbf42fc79255..5ddd32ed484994704017fe7d7213c8e8a94291ac 100644
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@@ -38,188 +38,6 @@ func makeOutputList(op *tf.Operation, start int, output string) ([]tf.Output, in
 	return list, start + size, nil
 }
 
-// WriteImageSummaryAttr is an optional argument to WriteImageSummary.
-type WriteImageSummaryAttr func(optionalAttr)
-
-// WriteImageSummaryMaxImages sets the optional max_images attribute to value.
-//
-// value: Max number of batch elements to generate images for.
-// If not specified, defaults to 3
-//
-// REQUIRES: value >= 1
-func WriteImageSummaryMaxImages(value int64) WriteImageSummaryAttr {
-	return func(m optionalAttr) {
-		m["max_images"] = value
-	}
-}
-
-// Writes a `Summary` protocol buffer with images.
-//
-// The summary has up to `max_images` summary values containing images. The
-// images are built from `tensor` which must be 4-D with shape `[batch_size,
-// height, width, channels]` and where `channels` can be:
-//
-// *  1: `tensor` is interpreted as Grayscale.
-// *  3: `tensor` is interpreted as RGB.
-// *  4: `tensor` is interpreted as RGBA.
-//
-// The images have the same number of channels as the input tensor. For float
-// input, the values are normalized one image at a time to fit in the range
-// `[0, 255]`.  `uint8` values are unchanged.  The op uses two different
-// normalization algorithms:
-//
-// *  If the input values are all positive, they are rescaled so the largest one
-//    is 255.
-//
-// *  If any input value is negative, the values are shifted so input value 0.0
-//    is at 127.  They are then rescaled so that either the smallest value is 0,
-//    or the largest one is 255.
-//
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
-//
-// *  If `max_images` is 1, the summary value tag is '*tag*/image'.
-// *  If `max_images` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
-//
-// The `bad_color` argument is the color to use in the generated images for
-// non-finite input values.  It is a `unit8` 1-D tensor of length `channels`.
-// Each element must be in the range `[0, 255]` (It represents the value of a
-// pixel in the output image).  Non-finite values in the input tensor are
-// replaced by this tensor in the output image.  The default value is the color
-// red.
-//
-// Arguments:
-//	writer: A handle to a summary writer.
-//	step: The step to write the summary for.
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 4-D of shape `[batch_size, height, width, channels]` where
-// `channels` is 1, 3, or 4.
-//	bad_color: Color to use for pixels with non-finite values.
-//
-// Returns the created operation.
-func WriteImageSummary(scope *Scope, writer tf.Output, step tf.Output, tag tf.Output, tensor tf.Output, bad_color tf.Output, optional ...WriteImageSummaryAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "WriteImageSummary",
-		Input: []tf.Input{
-			writer, step, tag, tensor, bad_color,
-		},
-		Attrs: attrs,
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Outputs a `tf.Event` protocol buffer.
-//
-// When CreateSummaryDbWriter is being used, this op can be useful for
-// importing data from event logs.
-//
-// Arguments:
-//	writer: A handle to a summary writer.
-//	event: A string containing a binary-encoded tf.Event proto.
-//
-// Returns the created operation.
-func ImportEvent(scope *Scope, writer tf.Output, event tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ImportEvent",
-		Input: []tf.Input{
-			writer, event,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Outputs a `Summary` protocol buffer with a tensor.
-//
-// Arguments:
-//	writer: A handle to a summary writer.
-//	step: The step to write the summary for.
-//	tensor: A tensor to serialize.
-//	tag: The summary's tag.
-//	summary_metadata: Serialized SummaryMetadata protocol buffer containing
-// plugin-related metadata for this summary.
-//
-// Returns the created operation.
-func WriteSummary(scope *Scope, writer tf.Output, step tf.Output, tensor tf.Output, tag tf.Output, summary_metadata tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "WriteSummary",
-		Input: []tf.Input{
-			writer, step, tensor, tag, summary_metadata,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Creates summary database writer accessible by given resource handle.
-//
-// This can be used to write tensors from the execution graph directly
-// to a database. Only SQLite is supported right now. This function
-// will create the schema if it doesn't exist. Entries in the Users,
-// Experiments, and Runs tables will be created automatically if they
-// don't already exist.
-//
-// Arguments:
-//	writer: Handle to SummaryWriter resource to overwrite.
-//	db_uri: For example "file:/tmp/foo.sqlite".
-//	experiment_name: Can't contain ASCII control characters or <>. Case
-// sensitive. If empty, then the Run will not be associated with any
-// Experiment.
-//	run_name: Can't contain ASCII control characters or <>. Case sensitive.
-// If empty, then each Tag will not be associated with any Run.
-//	user_name: Must be valid as both a DNS label and Linux username. If
-// empty, then the Experiment will not be associated with any User.
-//
-// Returns the created operation.
-func CreateSummaryDbWriter(scope *Scope, writer tf.Output, db_uri tf.Output, experiment_name tf.Output, run_name tf.Output, user_name tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "CreateSummaryDbWriter",
-		Input: []tf.Input{
-			writer, db_uri, experiment_name, run_name, user_name,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Creates a summary file writer accessible by the given resource handle.
-//
-// Arguments:
-//	writer: A handle to the summary writer resource
-//	logdir: Directory where the event file will be written.
-//	max_queue: Size of the queue of pending events and summaries.
-//	flush_millis: How often, in milliseconds, to flush the pending events and
-// summaries to disk.
-//	filename_suffix: Every event file's name is suffixed with this suffix.
-//
-// Returns the created operation.
-func CreateSummaryFileWriter(scope *Scope, writer tf.Output, logdir tf.Output, max_queue tf.Output, flush_millis tf.Output, filename_suffix tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "CreateSummaryFileWriter",
-		Input: []tf.Input{
-			writer, logdir, max_queue, flush_millis, filename_suffix,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
 // FakeQuantWithMinMaxVarsPerChannelGradientAttr is an optional argument to FakeQuantWithMinMaxVarsPerChannelGradient.
 type FakeQuantWithMinMaxVarsPerChannelGradientAttr func(optionalAttr)
 
@@ -384,105 +202,113 @@ func FakeQuantWithMinMaxVarsGradient(scope *Scope, gradients tf.Output, inputs t
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
-type MutableHashTableOfTensorsV2Attr func(optionalAttr)
+// FakeQuantWithMinMaxArgsGradientAttr is an optional argument to FakeQuantWithMinMaxArgsGradient.
+type FakeQuantWithMinMaxArgsGradientAttr func(optionalAttr)
 
-// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
+// FakeQuantWithMinMaxArgsGradientMin sets the optional min attribute to value.
+// If not specified, defaults to -6
+func FakeQuantWithMinMaxArgsGradientMin(value float32) FakeQuantWithMinMaxArgsGradientAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["min"] = value
 	}
 }
 
-// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
+// FakeQuantWithMinMaxArgsGradientMax sets the optional max attribute to value.
+// If not specified, defaults to 6
+func FakeQuantWithMinMaxArgsGradientMax(value float32) FakeQuantWithMinMaxArgsGradientAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["max"] = value
 	}
 }
 
-// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-// If not specified, defaults to false
-func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
+// FakeQuantWithMinMaxArgsGradientNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxArgsGradientNumBits(value int64) FakeQuantWithMinMaxArgsGradientAttr {
 	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
+		m["num_bits"] = value
 	}
 }
 
-// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
-// If not specified, defaults to <>
-func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
+// FakeQuantWithMinMaxArgsGradientNarrowRange sets the optional narrow_range attribute to value.
+// If not specified, defaults to false
+func FakeQuantWithMinMaxArgsGradientNarrowRange(value bool) FakeQuantWithMinMaxArgsGradientAttr {
 	return func(m optionalAttr) {
-		m["value_shape"] = value
+		m["narrow_range"] = value
 	}
 }
 
-// Creates an empty hash table.
-//
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a vector. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
+// Compute gradients for a FakeQuantWithMinMaxArgs operation.
 //
 // Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
+//	gradients: Backpropagated gradients above the FakeQuantWithMinMaxArgs operation.
+//	inputs: Values passed as inputs to the FakeQuantWithMinMaxArgs operation.
 //
-// Returns Handle to a table.
-func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
+// Returns Backpropagated gradients below the FakeQuantWithMinMaxArgs operation:
+// `gradients * (inputs >= min && inputs <= max)`.
+func FakeQuantWithMinMaxArgsGradient(scope *Scope, gradients tf.Output, inputs tf.Output, optional ...FakeQuantWithMinMaxArgsGradientAttr) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MutableHashTableOfTensorsV2",
-
+		Type: "FakeQuantWithMinMaxArgsGradient",
+		Input: []tf.Input{
+			gradients, inputs,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
-type ResourceApplyProximalAdagradAttr func(optionalAttr)
+// FakeQuantWithMinMaxArgsAttr is an optional argument to FakeQuantWithMinMaxArgs.
+type FakeQuantWithMinMaxArgsAttr func(optionalAttr)
 
-// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
-//
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// FakeQuantWithMinMaxArgsMin sets the optional min attribute to value.
+// If not specified, defaults to -6
+func FakeQuantWithMinMaxArgsMin(value float32) FakeQuantWithMinMaxArgsAttr {
+	return func(m optionalAttr) {
+		m["min"] = value
+	}
+}
+
+// FakeQuantWithMinMaxArgsMax sets the optional max attribute to value.
+// If not specified, defaults to 6
+func FakeQuantWithMinMaxArgsMax(value float32) FakeQuantWithMinMaxArgsAttr {
+	return func(m optionalAttr) {
+		m["max"] = value
+	}
+}
+
+// FakeQuantWithMinMaxArgsNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxArgsNumBits(value int64) FakeQuantWithMinMaxArgsAttr {
+	return func(m optionalAttr) {
+		m["num_bits"] = value
+	}
+}
+
+// FakeQuantWithMinMaxArgsNarrowRange sets the optional narrow_range attribute to value.
 // If not specified, defaults to false
-func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
+func FakeQuantWithMinMaxArgsNarrowRange(value bool) FakeQuantWithMinMaxArgsAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["narrow_range"] = value
 	}
 }
 
-// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
-//
-// accum += grad * grad
-// prox_v = var - lr * grad * (1 / sqrt(accum))
-// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
+// Fake-quantize the 'inputs' tensor, type float to 'outputs' tensor of same type.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	grad: The gradient.
+// Attributes `[min; max]` define the clamping range for the `inputs` data.
+// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
+// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
+// then de-quantized and output as floats in `[min; max]` interval.
+// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
 //
-// Returns the created operation.
-func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
+// Quantization is called fake since the output is still in floating point.
+func FakeQuantWithMinMaxArgs(scope *Scope, inputs tf.Output, optional ...FakeQuantWithMinMaxArgsAttr) (outputs tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -491,570 +317,705 @@ func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyProximalAdagrad",
+		Type: "FakeQuantWithMinMaxArgs",
 		Input: []tf.Input{
-			var_, accum, lr, l1, l2, grad,
+			inputs,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MutableHashTableV2Attr is an optional argument to MutableHashTableV2.
-type MutableHashTableV2Attr func(optionalAttr)
-
-// MutableHashTableV2Container sets the optional container attribute to value.
+// Scatter `updates` into a new (initially zero) tensor according to `indices`.
 //
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableHashTableV2Container(value string) MutableHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MutableHashTableV2SharedName sets the optional shared_name attribute to value.
+// Creates a new tensor by applying sparse `updates` to individual
+// values or slices within a zero tensor of the given `shape` according to
+// indices.  This operator is the inverse of the @{tf.gather_nd} operator which
+// extracts values or slices from a given tensor.
 //
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func MutableHashTableV2SharedName(value string) MutableHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// MutableHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// **WARNING**: The order in which updates are applied is nondeterministic, so the
+// output will be nondeterministic if `indices` contains duplicates.
 //
-// value: If true and shared_name is empty, the table is shared
-// using the node name.
-// If not specified, defaults to false
-func MutableHashTableV2UseNodeNameSharing(value bool) MutableHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// Creates an empty hash table.
+// `indices` is an integer tensor containing indices into a new tensor of shape
+// `shape`.  The last dimension of `indices` can be at most the rank of `shape`:
 //
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a scalar. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
+//     indices.shape[-1] <= shape.rank
+//
+// The last dimension of `indices` corresponds to indices into elements
+// (if `indices.shape[-1] = shape.rank`) or slices
+// (if `indices.shape[-1] < shape.rank`) along dimension `indices.shape[-1]` of
+// `shape`.  `updates` is a tensor with shape
+//
+//     indices.shape[:-1] + shape[indices.shape[-1]:]
+//
+// The simplest form of scatter is to insert individual elements in a tensor by
+// index. For example, say we want to insert 4 scattered elements in a rank-1
+// tensor with 8 elements.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/ScatterNd1.png" alt>
+// </div>
+//
+// In Python, this scatter operation would look like this:
+//
+// ```python
+//     indices = tf.constant([[4], [3], [1], [7]])
+//     updates = tf.constant([9, 10, 11, 12])
+//     shape = tf.constant([8])
+//     scatter = tf.scatter_nd(indices, updates, shape)
+//     with tf.Session() as sess:
+//       print(sess.run(scatter))
+// ```
+//
+// The resulting tensor would look like this:
+//
+//     [0, 11, 0, 10, 9, 0, 0, 12]
+//
+// We can also, insert entire slices of a higher rank tensor all at once. For
+// example, if we wanted to insert two slices in the first dimension of a
+// rank-3 tensor with two matrices of new values.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/ScatterNd2.png" alt>
+// </div>
+//
+// In Python, this scatter operation would look like this:
+//
+// ```python
+//     indices = tf.constant([[0], [2]])
+//     updates = tf.constant([[[5, 5, 5, 5], [6, 6, 6, 6],
+//                             [7, 7, 7, 7], [8, 8, 8, 8]],
+//                            [[5, 5, 5, 5], [6, 6, 6, 6],
+//                             [7, 7, 7, 7], [8, 8, 8, 8]]])
+//     shape = tf.constant([4, 4, 4])
+//     scatter = tf.scatter_nd(indices, updates, shape)
+//     with tf.Session() as sess:
+//       print(sess.run(scatter))
+// ```
+//
+// The resulting tensor would look like this:
+//
+//     [[[5, 5, 5, 5], [6, 6, 6, 6], [7, 7, 7, 7], [8, 8, 8, 8]],
+//      [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
+//      [[5, 5, 5, 5], [6, 6, 6, 6], [7, 7, 7, 7], [8, 8, 8, 8]],
+//      [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]]
 //
 // Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
+//	indices: Index tensor.
+//	updates: Updates to scatter into output.
+//	shape: 1-D. The shape of the resulting tensor.
 //
-// Returns Handle to a table.
-func MutableHashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableV2Attr) (table_handle tf.Output) {
+// Returns A new tensor with the given shape and updates applied according
+// to the indices.
+func ScatterNd(scope *Scope, indices tf.Output, updates tf.Output, shape tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "MutableHashTableV2",
-
-		Attrs: attrs,
+		Type: "ScatterNd",
+		Input: []tf.Input{
+			indices, updates, shape,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
-type MapUnstageNoKeyAttr func(optionalAttr)
+// QuantizeAndDequantizeV2Attr is an optional argument to QuantizeAndDequantizeV2.
+type QuantizeAndDequantizeV2Attr func(optionalAttr)
 
-// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// QuantizeAndDequantizeV2SignedInput sets the optional signed_input attribute to value.
 //
-// REQUIRES: value >= 0
-func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
+// value: If the quantization is signed or unsigned.
+// If not specified, defaults to true
+func QuantizeAndDequantizeV2SignedInput(value bool) QuantizeAndDequantizeV2Attr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["signed_input"] = value
 	}
 }
 
-// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// QuantizeAndDequantizeV2NumBits sets the optional num_bits attribute to value.
 //
-// REQUIRES: value >= 0
-func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapUnstageNoKeyContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
+// value: The bitwidth of the quantization.
+// If not specified, defaults to 8
+func QuantizeAndDequantizeV2NumBits(value int64) QuantizeAndDequantizeV2Attr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["num_bits"] = value
 	}
 }
 
-// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
+// QuantizeAndDequantizeV2RangeGiven sets the optional range_given attribute to value.
+//
+// value: If the range is given or should be computed from the tensor.
+// If not specified, defaults to false
+func QuantizeAndDequantizeV2RangeGiven(value bool) QuantizeAndDequantizeV2Attr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["range_given"] = value
 	}
 }
 
-// Op removes and returns a random (key, value)
+// Quantizes then dequantizes a tensor.
 //
-// from the underlying container.   If the underlying container
-// does not contain elements, the op will block until it does.
-func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
+// This op simulates the precision loss from the quantized forward pass by:
+// 1. Quantizing the tensor to fixed point numbers, which should match the target
+//    quantization method when it is used in inference.
+// 2. Dequantizing it back to floating point numbers for the following ops, most
+//    likely matmul.
+//
+// There are different ways to quantize. This version does not use the full range
+// of the output type, choosing to elide the lowest possible value for symmetry
+// (e.g., output range is -127 to 127, not -128 to 127 for signed 8 bit
+// quantization), so that 0.0 maps to 0.
+//
+// To perform this op, we first find the range of values in our tensor. The range
+// we use is always centered on 0, so we find m such that
+//
+// 1. m = max(abs(input_min), abs(input_max)) if range_given is true,
+// 2. m = max(abs(min_elem(input)), abs(max_elem(input))) otherwise.
+//
+// Our input tensor range is then [-m, m].
+//
+// Next, we choose our fixed-point quantization buckets, [min_fixed, max_fixed].
+// If signed_input is true, this is
+//
+//   [min_fixed, max_fixed ] =
+//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1].
+//
+// Otherwise, if signed_input is false, the fixed-point range is
+//
+//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1].
+//
+// From this we compute our scaling factor, s:
+//
+//   s = (max_fixed - min_fixed) / (2 * m).
+//
+// Now we can quantize and dequantize the elements of our tensor.  An element e
+// is transformed into e':
+//
+//   e' = (e * s).round_to_nearest() / s.
+//
+// Note that we have a different number of buckets in the signed vs. unsigned
+// cases.  For example, if num_bits == 8, we get 254 buckets in the signed case
+// vs. 255 in the unsigned case.
+//
+// For example, suppose num_bits = 8 and m = 1.  Then
+//
+//   [min_fixed, max_fixed] = [-127, 127], and
+//   s = (127 + 127) / 2 = 127.
+//
+// Given the vector {-1, -0.5, 0, 0.3}, this is quantized to
+// {-127, -63, 0, 38}, and dequantized to {-1, -63.0/127, 0, 38.0/127}.
+//
+// Arguments:
+//	input: Tensor to quantize and then dequantize.
+//	input_min: If range_given, this is the min of the range, otherwise this input
+// will be ignored.
+//	input_max: If range_given, this is the max of the range, otherwise this input
+// will be ignored.
+func QuantizeAndDequantizeV2(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, optional ...QuantizeAndDequantizeV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapUnstageNoKey",
+		Type: "QuantizeAndDequantizeV2",
 		Input: []tf.Input{
-			indices,
+			input, input_min, input_max,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	key = op.Output(idx)
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapUnstageNoKey", err)
-		return
-	}
-	return key, values
+	return op.Output(0)
 }
 
-// HashTableV2Attr is an optional argument to HashTableV2.
-type HashTableV2Attr func(optionalAttr)
-
-// HashTableV2Container sets the optional container attribute to value.
+// Bitcasts a tensor from one type to another without copying data.
 //
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func HashTableV2Container(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// HashTableV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func HashTableV2SharedName(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-//
-// value: If true and shared_name is empty, the table is shared
-// using the node name.
-// If not specified, defaults to false
-func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// Creates a non-initialized hash table.
+// Given a tensor `input`, this operation returns a tensor that has the same buffer
+// data as `input` with datatype `type`.
 //
-// This op creates a hash table, specifying the type of its keys and values.
-// Before using the table you will have to initialize it.  After initialization the
-// table will be immutable.
+// If the input datatype `T` is larger than the output datatype `type` then the
+// shape changes from [...] to [..., sizeof(`T`)/sizeof(`type`)].
 //
-// Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
+// If `T` is smaller than `type`, the operator requires that the rightmost
+// dimension be equal to sizeof(`type`)/sizeof(`T`). The shape then goes from
+// [..., sizeof(`type`)/sizeof(`T`)] to [...].
 //
-// Returns Handle to a table.
-func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
+// *NOTE*: Bitcast is implemented as a low-level cast, so machines with different
+// endian orderings will give different results.
+func Bitcast(scope *Scope, input tf.Output, type_ tf.DataType) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"type": type_}
 	opspec := tf.OpSpec{
-		Type: "HashTableV2",
-
+		Type: "Bitcast",
+		Input: []tf.Input{
+			input,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Replaces the contents of the table with the specified keys and values.
-//
-// The tensor `keys` must be of the same type as the keys of the table.
-// The tensor `values` must be of the type of the table values.
+// Extract `patches` from `images` and put them in the "depth" output dimension.
 //
 // Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//	values: Values to associate with keys.
+//	images: 4-D Tensor with shape `[batch, in_rows, in_cols, depth]`.
+//	ksizes: The size of the sliding window for each dimension of `images`.
+//	strides: 1-D of length 4. How far the centers of two consecutive patches are in
+// the images. Must be: `[1, stride_rows, stride_cols, 1]`.
+//	rates: 1-D of length 4. Must be: `[1, rate_rows, rate_cols, 1]`. This is the
+// input stride, specifying how far two consecutive patch samples are in the
+// input. Equivalent to extracting patches with
+// `patch_sizes_eff = patch_sizes + (patch_sizes - 1) * (rates - 1)`, followed by
+// subsampling them spatially by a factor of `rates`. This is equivalent to
+// `rate` in dilated (a.k.a. Atrous) convolutions.
+//	padding: The type of padding algorithm to use.
 //
-// Returns the created operation.
-func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+// We specify the size-related attributes as:
+//
+// ```python
+//       ksizes = [1, ksize_rows, ksize_cols, 1]
+//       strides = [1, strides_rows, strides_cols, 1]
+//       rates = [1, rates_rows, rates_cols, 1]
+// ```
+//
+// Returns 4-D Tensor with shape `[batch, out_rows, out_cols, ksize_rows *
+// ksize_cols * depth]` containing image patches with size
+// `ksize_rows x ksize_cols x depth` vectorized in the "depth" dimension. Note
+// `out_rows` and `out_cols` are the dimensions of the output patches.
+func ExtractImagePatches(scope *Scope, images tf.Output, ksizes []int64, strides []int64, rates []int64, padding string) (patches tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"ksizes": ksizes, "strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "LookupTableImportV2",
+		Type: "ExtractImagePatches",
 		Input: []tf.Input{
-			table_handle, keys, values,
+			images,
 		},
+		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapPeekAttr is an optional argument to MapPeek.
-type MapPeekAttr func(optionalAttr)
+// SpaceToDepthAttr is an optional argument to SpaceToDepth.
+type SpaceToDepthAttr func(optionalAttr)
 
-// MapPeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapPeekCapacity(value int64) MapPeekAttr {
+// SpaceToDepthDataFormat sets the optional data_format attribute to value.
+// If not specified, defaults to "NHWC"
+func SpaceToDepthDataFormat(value string) SpaceToDepthAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["data_format"] = value
 	}
 }
 
-// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// SpaceToDepth for tensors of type T.
 //
-// REQUIRES: value >= 0
-func MapPeekMemoryLimit(value int64) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapPeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapPeekContainer(value string) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MapPeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapPeekSharedName(value string) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op peeks at the values at the specified key.  If the
+// Rearranges blocks of spatial data, into depth. More specifically,
+// this op outputs a copy of the input tensor where values from the `height`
+// and `width` dimensions are moved to the `depth` dimension.
+// The attr `block_size` indicates the input block size.
 //
-// underlying container does not contain this key
-// this op will block until it does.
-func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
+//   * Non-overlapping blocks of size `block_size x block size` are rearranged
+//     into depth at each location.
+//   * The depth of the output tensor is `block_size * block_size * input_depth`.
+//   * The Y, X coordinates within each block of the input become the high order
+//     component of the output channel index.
+//   * The input tensor's height and width must be divisible by block_size.
+//
+// The `data_format` attr specifies the layout of the input and output tensors
+// with the following options:
+//   "NHWC": `[ batch, height, width, channels ]`
+//   "NCHW": `[ batch, channels, height, width ]`
+//   "NCHW_VECT_C":
+//       `qint8 [ batch, channels / 4, height, width, 4 ]`
+//
+// It is useful to consider the operation as transforming a 6-D Tensor.
+// e.g. for data_format = NHWC,
+//      Each element in the input tensor can be specified via 6 coordinates,
+//      ordered by decreasing memory layout significance as:
+//      n,oY,bY,oX,bX,iC  (where n=batch index, oX, oY means X or Y coordinates
+//                         within the output image, bX, bY means coordinates
+//                         within the input block, iC means input channels).
+//      The output would be a transpose to the following layout:
+//      n,oY,oX,bY,bX,iC
+//
+// This operation is useful for resizing the activations between convolutions
+// (but keeping all data), e.g. instead of pooling. It is also useful for training
+// purely convolutional models.
+//
+// For example, given an input of shape `[1, 2, 2, 1]`, data_format = "NHWC" and
+// block_size = 2:
+//
+// ```
+// x = [[[[1], [2]],
+//       [[3], [4]]]]
+// ```
+//
+// This operation will output a tensor of shape `[1, 1, 1, 4]`:
+//
+// ```
+// [[[[1, 2, 3, 4]]]]
+// ```
+//
+// Here, the input has a batch of 1 and each batch element has shape `[2, 2, 1]`,
+// the corresponding output will have a single element (i.e. width and height are
+// both 1) and will have a depth of 4 channels (1 * block_size * block_size).
+// The output element shape is `[1, 1, 4]`.
+//
+// For an input tensor with larger depth, here of shape `[1, 2, 2, 3]`, e.g.
+//
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
+//
+// This operation, for block_size of 2, will return the following tensor of shape
+// `[1, 1, 1, 12]`
+//
+// ```
+// [[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]]]
+// ```
+//
+// Similarly, for the following input of shape `[1 4 4 1]`, and a block size of 2:
+//
+// ```
+// x = [[[[1],   [2],  [5],  [6]],
+//       [[3],   [4],  [7],  [8]],
+//       [[9],  [10], [13],  [14]],
+//       [[11], [12], [15],  [16]]]]
+// ```
+//
+// the operator will return the following tensor of shape `[1 2 2 4]`:
+//
+// ```
+// x = [[[[1, 2, 3, 4],
+//        [5, 6, 7, 8]],
+//       [[9, 10, 11, 12],
+//        [13, 14, 15, 16]]]]
+// ```
+//
+// Arguments:
+//
+//	block_size: The size of the spatial block.
+func SpaceToDepth(scope *Scope, input tf.Output, block_size int64, optional ...SpaceToDepthAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"block_size": block_size}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapPeek",
+		Type: "SpaceToDepth",
 		Input: []tf.Input{
-			key, indices,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapPeek", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// Returns (x - y)(x - y) element-wise.
+// SpaceToBatch for 4-D tensors of type T.
 //
-// *NOTE*: `SquaredDifference` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func SquaredDifference(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SquaredDifference",
-		Input: []tf.Input{
-			x, y,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Forwards the input to the output.
+// This is a legacy version of the more general SpaceToBatchND.
 //
-// This operator represents the loop termination condition used by the
-// "pivot" switches of a loop.
+// Zero-pads and then rearranges (permutes) blocks of spatial data into batch.
+// More specifically, this op outputs a copy of the input tensor where values from
+// the `height` and `width` dimensions are moved to the `batch` dimension. After
+// the zero-padding, both `height` and `width` of the input must be divisible by the
+// block size.
 //
 // Arguments:
-//	input: A boolean scalar, representing the branch predicate of the Switch op.
+//	input: 4-D with shape `[batch, height, width, depth]`.
+//	paddings: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
+//   the padding of the input with zeros across the spatial dimensions as follows:
 //
-// Returns The same tensor as `input`.
-func LoopCond(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LoopCond",
-		Input: []tf.Input{
-			input,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// QuantizedMulAttr is an optional argument to QuantizedMul.
-type QuantizedMulAttr func(optionalAttr)
-
-// QuantizedMulToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedMulToutput(value tf.DataType) QuantizedMulAttr {
-	return func(m optionalAttr) {
-		m["Toutput"] = value
-	}
-}
-
-// Returns x * y element-wise, working on quantized buffers.
+//       paddings = [[pad_top, pad_bottom], [pad_left, pad_right]]
 //
-// Arguments:
+//   The effective spatial dimensions of the zero-padded input tensor will be:
 //
+//       height_pad = pad_top + height + pad_bottom
+//       width_pad = pad_left + width + pad_right
 //
-//	min_x: The float value that the lowest quantized `x` value represents.
-//	max_x: The float value that the highest quantized `x` value represents.
-//	min_y: The float value that the lowest quantized `y` value represents.
-//	max_y: The float value that the highest quantized `y` value represents.
+// The attr `block_size` must be greater than one. It indicates the block size.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+//   * Non-overlapping blocks of size `block_size x block size` in the height and
+//     width dimensions are rearranged into the batch dimension at each location.
+//   * The batch of the output tensor is `batch * block_size * block_size`.
+//   * Both height_pad and width_pad must be divisible by block_size.
 //
-// *NOTE*: `QuantizedMul` supports limited forms of broadcasting. More about
-// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func QuantizedMul(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedMulAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "QuantizedMul",
-		Input: []tf.Input{
-			x, y, min_x, max_x, min_y, max_y,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// QuantizedMatMulAttr is an optional argument to QuantizedMatMul.
-type QuantizedMatMulAttr func(optionalAttr)
-
-// QuantizedMatMulToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedMatMulToutput(value tf.DataType) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["Toutput"] = value
-	}
-}
-
-// QuantizedMatMulTransposeA sets the optional transpose_a attribute to value.
+// The shape of the output will be:
 //
-// value: If true, `a` is transposed before multiplication.
-// If not specified, defaults to false
-func QuantizedMatMulTransposeA(value bool) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_a"] = value
-	}
-}
-
-// QuantizedMatMulTransposeB sets the optional transpose_b attribute to value.
+//     [batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
+//      depth]
 //
-// value: If true, `b` is transposed before multiplication.
-// If not specified, defaults to false
-func QuantizedMatMulTransposeB(value bool) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_b"] = value
-	}
-}
-
-// QuantizedMatMulTactivation sets the optional Tactivation attribute to value.
+// Some examples:
 //
-// value: The type of output produced by activation function
-// following this operation.
-// If not specified, defaults to DT_QUINT8
-func QuantizedMatMulTactivation(value tf.DataType) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["Tactivation"] = value
-	}
-}
-
-// Perform a quantized matrix multiplication of  `a` by the matrix `b`.
+// (1) For the following input of shape `[1, 2, 2, 1]` and block_size of 2:
 //
-// The inputs must be two-dimensional matrices and the inner dimension of
-// `a` (after being transposed if `transpose_a` is non-zero) must match the
-// outer dimension of `b` (after being transposed if `transposed_b` is
-// non-zero).
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
 //
-// Arguments:
-//	a: Must be a two-dimensional tensor.
-//	b: Must be a two-dimensional tensor.
-//	min_a: The float value that the lowest quantized `a` value represents.
-//	max_a: The float value that the highest quantized `a` value represents.
-//	min_b: The float value that the lowest quantized `b` value represents.
-//	max_b: The float value that the highest quantized `b` value represents.
+// The output tensor has shape `[4, 1, 1, 1]` and value:
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
-func QuantizedMatMul(scope *Scope, a tf.Output, b tf.Output, min_a tf.Output, max_a tf.Output, min_b tf.Output, max_b tf.Output, optional ...QuantizedMatMulAttr) (out tf.Output, min_out tf.Output, max_out tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "QuantizedMatMul",
-		Input: []tf.Input{
-			a, b, min_a, max_a, min_b, max_b,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// A placeholder op that passes through `input` when its output is not fed.
+// ```
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
+// ```
 //
-// Arguments:
-//	input: The default value to produce when `output` is not fed.
-//	shape: The (possibly partial) shape of the tensor.
+// (2) For the following input of shape `[1, 2, 2, 3]` and block_size of 2:
 //
-// Returns A placeholder tensor that defaults to `input` if it is not fed.
-func PlaceholderWithDefault(scope *Scope, input tf.Output, shape tf.Shape) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"shape": shape}
-	opspec := tf.OpSpec{
-		Type: "PlaceholderWithDefault",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns the complex conjugate of a complex number.
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
 //
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// complex numbers that are the complex conjugate of each element in `input`. The
-// complex numbers in `input` must be of the form \\(a + bj\\), where *a* is the
-// real part and *b* is the imaginary part.
+// The output tensor has shape `[4, 1, 1, 3]` and value:
 //
-// The complex conjugate returned by this operation is of the form \\(a - bj\\).
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
 //
-// For example:
+// (3) For the following input of shape `[1, 4, 4, 1]` and block_size of 2:
 //
 // ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.conj(input) ==> [-2.25 - 4.75j, 3.25 - 5.75j]
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]],
+//       [[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
 // ```
-func Conj(scope *Scope, input tf.Output) (output tf.Output) {
+//
+// The output tensor has shape `[4, 2, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+//
+// (4) For the following input of shape `[2, 2, 4, 1]` and block_size of 2:
+//
+// ```
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]]],
+//      [[[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
+// ```
+//
+// The output tensor has shape `[8, 1, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
+//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
+// ```
+//
+// Among others, this operation is useful for reducing atrous convolution into
+// regular convolution.
+//
+func SpaceToBatch(scope *Scope, input tf.Output, paddings tf.Output, block_size int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"block_size": block_size}
 	opspec := tf.OpSpec{
-		Type: "Conj",
+		Type: "SpaceToBatch",
 		Input: []tf.Input{
-			input,
+			input, paddings,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyMomentumAttr is an optional argument to ResourceSparseApplyMomentum.
-type ResourceSparseApplyMomentumAttr func(optionalAttr)
-
-// ResourceSparseApplyMomentumUseLocking sets the optional use_locking attribute to value.
+// SpaceToBatch for N-D tensors of type T.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceSparseApplyMomentumUseLocking(value bool) ResourceSparseApplyMomentumAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
+// This operation divides "spatial" dimensions `[1, ..., M]` of the input into a
+// grid of blocks of shape `block_shape`, and interleaves these blocks with the
+// "batch" dimension (0) such that in the output, the spatial dimensions
+// `[1, ..., M]` correspond to the position within the grid, and the batch
+// dimension combines both the position within a spatial block and the original
+// batch position.  Prior to division into blocks, the spatial dimensions of the
+// input are optionally zero padded according to `paddings`.  See below for a
+// precise description.
+//
+// Arguments:
+//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
+// where spatial_shape has `M` dimensions.
+//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
+//	paddings: 2-D with shape `[M, 2]`, all values must be >= 0.
+//   `paddings[i] = [pad_start, pad_end]` specifies the padding for input dimension
+//   `i + 1`, which corresponds to spatial dimension `i`.  It is required that
+//   `block_shape[i]` divides `input_shape[i + 1] + pad_start + pad_end`.
+//
+// This operation is equivalent to the following steps:
+//
+// 1. Zero-pad the start and end of dimensions `[1, ..., M]` of the
+//    input according to `paddings` to produce `padded` of shape `padded_shape`.
+//
+// 2. Reshape `padded` to `reshaped_padded` of shape:
+//
+//      [batch] +
+//      [padded_shape[1] / block_shape[0],
+//        block_shape[0],
+//       ...,
+//       padded_shape[M] / block_shape[M-1],
+//       block_shape[M-1]] +
+//      remaining_shape
+//
+// 3. Permute dimensions of `reshaped_padded` to produce
+//    `permuted_reshaped_padded` of shape:
+//
+//      block_shape +
+//      [batch] +
+//      [padded_shape[1] / block_shape[0],
+//       ...,
+//       padded_shape[M] / block_shape[M-1]] +
+//      remaining_shape
+//
+// 4. Reshape `permuted_reshaped_padded` to flatten `block_shape` into the batch
+//    dimension, producing an output tensor of shape:
+//
+//      [batch * prod(block_shape)] +
+//      [padded_shape[1] / block_shape[0],
+//       ...,
+//       padded_shape[M] / block_shape[M-1]] +
+//      remaining_shape
+//
+// Some examples:
+//
+// (1) For the following input of shape `[1, 2, 2, 1]`, `block_shape = [2, 2]`, and
+//     `paddings = [[0, 0], [0, 0]]`:
+//
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
+//
+// The output tensor has shape `[4, 1, 1, 1]` and value:
+//
+// ```
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
+// ```
+//
+// (2) For the following input of shape `[1, 2, 2, 3]`, `block_shape = [2, 2]`, and
+//     `paddings = [[0, 0], [0, 0]]`:
+//
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
+//
+// The output tensor has shape `[4, 1, 1, 3]` and value:
+//
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
+//
+// (3) For the following input of shape `[1, 4, 4, 1]`, `block_shape = [2, 2]`, and
+//     `paddings = [[0, 0], [0, 0]]`:
+//
+// ```
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]],
+//       [[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
+// ```
+//
+// The output tensor has shape `[4, 2, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+//
+// (4) For the following input of shape `[2, 2, 4, 1]`, block_shape = `[2, 2]`, and
+//     paddings = `[[0, 0], [2, 0]]`:
+//
+// ```
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]]],
+//      [[[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
+// ```
+//
+// The output tensor has shape `[8, 1, 3, 1]` and value:
+//
+// ```
+// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
+//      [[[0], [2], [4]]], [[[0], [10], [12]]],
+//      [[[0], [5], [7]]], [[[0], [13], [15]]],
+//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
+// ```
+//
+// Among others, this operation is useful for reducing atrous convolution into
+// regular convolution.
+func SpaceToBatchND(scope *Scope, input tf.Output, block_shape tf.Output, paddings tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SpaceToBatchND",
+		Input: []tf.Input{
+			input, block_shape, paddings,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ResourceSparseApplyMomentumUseNesterov sets the optional use_nesterov attribute to value.
-//
-// value: If `True`, the tensor passed to compute grad will be
-// var - lr * momentum * accum, so in the end, the var you get is actually
-// var - lr * momentum * accum.
-// If not specified, defaults to false
-func ResourceSparseApplyMomentumUseNesterov(value bool) ResourceSparseApplyMomentumAttr {
+// ListDiffAttr is an optional argument to ListDiff.
+type ListDiffAttr func(optionalAttr)
+
+// ListDiffOutIdx sets the optional out_idx attribute to value.
+// If not specified, defaults to DT_INT32
+func ListDiffOutIdx(value tf.DataType) ListDiffAttr {
 	return func(m optionalAttr) {
-		m["use_nesterov"] = value
+		m["out_idx"] = value
 	}
 }
 
-// Update relevant entries in '*var' and '*accum' according to the momentum scheme.
+// Computes the difference between two lists of numbers or strings.
 //
-// Set use_nesterov = True if you want to use Nesterov momentum.
+// Given a list `x` and a list `y`, this operation returns a list `out` that
+// represents all values that are in `x` but not in `y`. The returned list `out`
+// is sorted in the same order that the numbers appear in `x` (duplicates are
+// preserved). This operation also returns a list `idx` that represents the
+// position of each `out` element in `x`. In other words:
 //
-// That is for rows we have grad for, we update var and accum as follows:
+// `out[i] = x[idx[i]] for i in [0, 1, ..., len(out) - 1]`
 //
-// accum = accum * momentum + grad
-// var -= lr * accum
+// For example, given this input:
+//
+// ```
+// x = [1, 2, 3, 4, 5, 6]
+// y = [1, 3, 5]
+// ```
+//
+// This operation would return:
+//
+// ```
+// out ==> [2, 4, 6]
+// idx ==> [1, 3, 5]
+// ```
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Learning rate. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
-//	momentum: Momentum. Must be a scalar.
+//	x: 1-D. Values to keep.
+//	y: 1-D. Values to remove.
 //
-// Returns the created operation.
-func ResourceSparseApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, indices tf.Output, momentum tf.Output, optional ...ResourceSparseApplyMomentumAttr) (o *tf.Operation) {
+// Returns 1-D. Values present in `x` but not in `y`.1-D. Positions of `x` values preserved in `out`.
+func ListDiff(scope *Scope, x tf.Output, y tf.Output, optional ...ListDiffAttr) (out tf.Output, idx tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -1063,223 +1024,215 @@ func ResourceSparseApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyMomentum",
+		Type: "ListDiff",
 		Input: []tf.Input{
-			var_, accum, lr, grad, indices, momentum,
+			x, y,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// Creates a sequence of numbers.
+// Inserts a dimension of 1 into a tensor's shape.
 //
-// This operation creates a sequence of numbers that begins at `start` and
-// extends by increments of `delta` up to but not including `limit`.
+// Given a tensor `input`, this operation inserts a dimension of 1 at the
+// dimension index `axis` of `input`'s shape. The dimension index `axis` starts at
+// zero; if you specify a negative number for `axis` it is counted backward from
+// the end.
 //
-// For example:
+// This operation is useful if you want to add a batch dimension to a single
+// element. For example, if you have a single image of shape `[height, width,
+// channels]`, you can make it a batch of 1 image with `expand_dims(image, 0)`,
+// which will make the shape `[1, height, width, channels]`.
+//
+// Other examples:
 //
 // ```
-// # 'start' is 3
-// # 'limit' is 18
-// # 'delta' is 3
-// tf.range(start, limit, delta) ==> [3, 6, 9, 12, 15]
+// # 't' is a tensor of shape [2]
+// shape(expand_dims(t, 0)) ==> [1, 2]
+// shape(expand_dims(t, 1)) ==> [2, 1]
+// shape(expand_dims(t, -1)) ==> [2, 1]
+//
+// # 't2' is a tensor of shape [2, 3, 5]
+// shape(expand_dims(t2, 0)) ==> [1, 2, 3, 5]
+// shape(expand_dims(t2, 2)) ==> [2, 3, 1, 5]
+// shape(expand_dims(t2, 3)) ==> [2, 3, 5, 1]
 // ```
 //
+// This operation requires that:
+//
+// `-1-input.dims() <= dim <= input.dims()`
+//
+// This operation is related to `squeeze()`, which removes dimensions of
+// size 1.
+//
 // Arguments:
-//	start: 0-D (scalar). First entry in the sequence.
-//	limit: 0-D (scalar). Upper limit of sequence, exclusive.
-//	delta: 0-D (scalar). Optional. Default is 1. Number that increments `start`.
 //
-// Returns 1-D.
-func Range(scope *Scope, start tf.Output, limit tf.Output, delta tf.Output) (output tf.Output) {
+//	axis: 0-D (scalar). Specifies the dimension index at which to
+// expand the shape of `input`. Must be in the range
+// `[-rank(input) - 1, rank(input)]`.
+//
+// Returns Contains the same data as `input`, but its shape has an additional
+// dimension of size 1 added.
+func ExpandDims(scope *Scope, input tf.Output, axis tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Range",
+		Type: "ExpandDims",
 		Input: []tf.Input{
-			start, limit, delta,
+			input, axis,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes gradients for SparseSegmentSqrtN.
+// Returns (x - y)(x - y) element-wise.
 //
-// Returns tensor "output" with same shape as grad, except for dimension 0 whose
-// value is output_dim0.
-//
-// Arguments:
-//	grad: gradient propagated to the SparseSegmentSqrtN op.
-//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
-//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
-//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
-func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
+// *NOTE*: `SquaredDifference` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func SquaredDifference(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSqrtNGrad",
+		Type: "SquaredDifference",
 		Input: []tf.Input{
-			grad, indices, segment_ids, output_dim0,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the mean along sparse segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
+// Forwards the input to the output.
 //
-// Like `SegmentMean`, but `segment_ids` can have rank less than `data`'s first
-// dimension, selecting a subset of dimension 0, specified by `indices`.
+// This operator represents the loop termination condition used by the
+// "pivot" switches of a loop.
 //
 // Arguments:
+//	input: A boolean scalar, representing the branch predicate of the Switch op.
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SparseSegmentMean(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
+// Returns The same tensor as `input`.
+func LoopCond(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentMean",
+		Type: "LoopCond",
 		Input: []tf.Input{
-			data, indices, segment_ids,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Pop the element at the top of the stack.
-//
-// Arguments:
-//	handle: The handle to a stack.
-//	elem_type: The type of the elem that is popped.
-//
-// Returns The tensor that is popped from the top of the stack.
-func StackPopV2(scope *Scope, handle tf.Output, elem_type tf.DataType) (elem tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"elem_type": elem_type}
-	opspec := tf.OpSpec{
-		Type: "StackPopV2",
-		Input: []tf.Input{
-			handle,
-		},
-		Attrs: attrs,
+// QuantizedMulAttr is an optional argument to QuantizedMul.
+type QuantizedMulAttr func(optionalAttr)
+
+// QuantizedMulToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedMulToutput(value tf.DataType) QuantizedMulAttr {
+	return func(m optionalAttr) {
+		m["Toutput"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the sum along sparse segments of a tensor.
-//
-// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
-// misisng, the `output` tensor at that position will be zeroed.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// For example:
-//
-// ```python
-// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
+// Returns x * y element-wise, working on quantized buffers.
 //
-// tf.sparse_segment_sum_with_num_segments(
-//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
-// # => [[0 0 0 0]
-// #     [0 0 0 0]
-// #     [0 0 0 0]]
+// Arguments:
 //
-// tf.sparse_segment_sum_with_num_segments(c,
-//                                         tf.constant([0, 1]),
-//                                         tf.constant([0, 2],
-//                                         num_segments=4))
-// # => [[ 1  2  3  4]
-// #     [ 0  0  0  0]
-// #     [-1 -2 -3 -4]
-// #     [ 0  0  0  0]]
-// ```
 //
-// Arguments:
+//	min_x: The float value that the lowest quantized `x` value represents.
+//	max_x: The float value that the highest quantized `x` value represents.
+//	min_y: The float value that the lowest quantized `y` value represents.
+//	max_y: The float value that the highest quantized `y` value represents.
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//	num_segments: Should equal the number of distinct segment IDs.
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `num_segments`.
-func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// *NOTE*: `QuantizedMul` supports limited forms of broadcasting. More about
+// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func QuantizedMul(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedMulAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSumWithNumSegments",
+		Type: "QuantizedMul",
 		Input: []tf.Input{
-			data, indices, segment_ids, num_segments,
+			x, y, min_x, max_x, min_y, max_y,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// SparseToDenseAttr is an optional argument to SparseToDense.
-type SparseToDenseAttr func(optionalAttr)
+// QuantizedMatMulAttr is an optional argument to QuantizedMatMul.
+type QuantizedMatMulAttr func(optionalAttr)
 
-// SparseToDenseValidateIndices sets the optional validate_indices attribute to value.
-//
-// value: If true, indices are checked to make sure they are sorted in
-// lexicographic order and that there are no repeats.
-// If not specified, defaults to true
-func SparseToDenseValidateIndices(value bool) SparseToDenseAttr {
+// QuantizedMatMulToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedMatMulToutput(value tf.DataType) QuantizedMatMulAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["Toutput"] = value
 	}
 }
 
-// Converts a sparse representation into a dense tensor.
-//
-// Builds an array `dense` with shape `output_shape` such that
-//
-// ```
-// # If sparse_indices is scalar
-// dense[i] = (i == sparse_indices ? sparse_values : default_value)
+// QuantizedMatMulTransposeA sets the optional transpose_a attribute to value.
 //
-// # If sparse_indices is a vector, then for each i
-// dense[sparse_indices[i]] = sparse_values[i]
+// value: If true, `a` is transposed before multiplication.
+// If not specified, defaults to false
+func QuantizedMatMulTransposeA(value bool) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["transpose_a"] = value
+	}
+}
+
+// QuantizedMatMulTransposeB sets the optional transpose_b attribute to value.
 //
-// # If sparse_indices is an n by d matrix, then for each i in [0, n)
-// dense[sparse_indices[i][0], ..., sparse_indices[i][d-1]] = sparse_values[i]
-// ```
+// value: If true, `b` is transposed before multiplication.
+// If not specified, defaults to false
+func QuantizedMatMulTransposeB(value bool) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["transpose_b"] = value
+	}
+}
+
+// QuantizedMatMulTactivation sets the optional Tactivation attribute to value.
 //
-// All other values in `dense` are set to `default_value`.  If `sparse_values` is a
-// scalar, all sparse indices are set to this single value.
+// value: The type of output produced by activation function
+// following this operation.
+// If not specified, defaults to DT_QUINT8
+func QuantizedMatMulTactivation(value tf.DataType) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["Tactivation"] = value
+	}
+}
+
+// Perform a quantized matrix multiplication of  `a` by the matrix `b`.
 //
-// Indices should be sorted in lexicographic order, and indices must not
-// contain any repeats. If `validate_indices` is true, these properties
-// are checked during execution.
+// The inputs must be two-dimensional matrices and the inner dimension of
+// `a` (after being transposed if `transpose_a` is non-zero) must match the
+// outer dimension of `b` (after being transposed if `transposed_b` is
+// non-zero).
 //
 // Arguments:
-//	sparse_indices: 0-D, 1-D, or 2-D.  `sparse_indices[i]` contains the complete
-// index where `sparse_values[i]` will be placed.
-//	output_shape: 1-D.  Shape of the dense output tensor.
-//	sparse_values: 1-D.  Values corresponding to each row of `sparse_indices`,
-// or a scalar value to be used for all sparse indices.
-//	default_value: Scalar value to set for indices not specified in
-// `sparse_indices`.
+//	a: Must be a two-dimensional tensor.
+//	b: Must be a two-dimensional tensor.
+//	min_a: The float value that the lowest quantized `a` value represents.
+//	max_a: The float value that the highest quantized `a` value represents.
+//	min_b: The float value that the lowest quantized `b` value represents.
+//	max_b: The float value that the highest quantized `b` value represents.
 //
-// Returns Dense output tensor of shape `output_shape`.
-func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Output, sparse_values tf.Output, default_value tf.Output, optional ...SparseToDenseAttr) (dense tf.Output) {
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+func QuantizedMatMul(scope *Scope, a tf.Output, b tf.Output, min_a tf.Output, max_a tf.Output, min_b tf.Output, max_b tf.Output, optional ...QuantizedMatMulAttr) (out tf.Output, min_out tf.Output, max_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -1288,574 +1241,567 @@ func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Outpu
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseToDense",
+		Type: "QuantizedMatMul",
 		Input: []tf.Input{
-			sparse_indices, output_shape, sparse_values, default_value,
+			a, b, min_a, max_a, min_b, max_b,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Counts the number of occurrences of each value in an integer array.
-//
-// Outputs a vector with length `size` and the same dtype as `weights`. If
-// `weights` are empty, then index `i` stores the number of times the value `i` is
-// counted in `arr`. If `weights` are non-empty, then index `i` stores the sum of
-// the value in `weights` at each index where the corresponding value in `arr` is
-// `i`.
-//
-// Values in `arr` outside of the range [0, size) are ignored.
+// A placeholder op that passes through `input` when its output is not fed.
 //
 // Arguments:
-//	arr: int32 `Tensor`.
-//	size: non-negative int32 scalar `Tensor`.
-//	weights: is an int32, int64, float32, or float64 `Tensor` with the same
-// shape as `arr`, or a length-0 `Tensor`, in which case it acts as all weights
-// equal to 1.
+//	input: The default value to produce when `output` is not fed.
+//	shape: The (possibly partial) shape of the tensor.
 //
-// Returns 1D `Tensor` with length equal to `size`. The counts or summed weights for
-// each value in the range [0, size).
-func Bincount(scope *Scope, arr tf.Output, size tf.Output, weights tf.Output) (bins tf.Output) {
+// Returns A placeholder tensor that defaults to `input` if it is not fed.
+func PlaceholderWithDefault(scope *Scope, input tf.Output, shape tf.Shape) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "Bincount",
+		Type: "PlaceholderWithDefault",
 		Input: []tf.Input{
-			arr, size, weights,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the sum along sparse segments of a tensor.
+// Returns the complex conjugate of a complex number.
 //
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// complex numbers that are the complex conjugate of each element in `input`. The
+// complex numbers in `input` must be of the form \\(a + bj\\), where *a* is the
+// real part and *b* is the imaginary part.
 //
-// Like `SegmentSum`, but `segment_ids` can have rank less than `data`'s first
-// dimension, selecting a subset of dimension 0, specified by `indices`.
+// The complex conjugate returned by this operation is of the form \\(a - bj\\).
 //
 // For example:
 //
-// ```python
-// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
-//
-// # Select two rows, one segment.
-// tf.sparse_segment_sum(c, tf.constant([0, 1]), tf.constant([0, 0]))
-// # => [[0 0 0 0]]
-//
-// # Select two rows, two segment.
-// tf.sparse_segment_sum(c, tf.constant([0, 1]), tf.constant([0, 1]))
-// # => [[ 1  2  3  4]
-// #     [-1 -2 -3 -4]]
-//
-// # Select all rows, two segments.
-// tf.sparse_segment_sum(c, tf.constant([0, 1, 2]), tf.constant([0, 0, 1]))
-// # => [[0 0 0 0]
-// #     [5 6 7 8]]
-//
-// # Which is equivalent to:
-// tf.segment_sum(c, tf.constant([0, 0, 1]))
 // ```
-//
-// Arguments:
-//
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SparseSegmentSum(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.conj(input) ==> [-2.25 - 4.75j, 3.25 - 5.75j]
+// ```
+func Conj(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSum",
+		Type: "Conj",
 		Input: []tf.Input{
-			data, indices, segment_ids,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes hyperbolic sine of x element-wise.
-func Sinh(scope *Scope, x tf.Output) (y tf.Output) {
+// ResourceSparseApplyMomentumAttr is an optional argument to ResourceSparseApplyMomentum.
+type ResourceSparseApplyMomentumAttr func(optionalAttr)
+
+// ResourceSparseApplyMomentumUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyMomentumUseLocking(value bool) ResourceSparseApplyMomentumAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// ResourceSparseApplyMomentumUseNesterov sets the optional use_nesterov attribute to value.
+//
+// value: If `True`, the tensor passed to compute grad will be
+// var - lr * momentum * accum, so in the end, the var you get is actually
+// var - lr * momentum * accum.
+// If not specified, defaults to false
+func ResourceSparseApplyMomentumUseNesterov(value bool) ResourceSparseApplyMomentumAttr {
+	return func(m optionalAttr) {
+		m["use_nesterov"] = value
+	}
+}
+
+// Update relevant entries in '*var' and '*accum' according to the momentum scheme.
+//
+// Set use_nesterov = True if you want to use Nesterov momentum.
+//
+// That is for rows we have grad for, we update var and accum as follows:
+//
+// accum = accum * momentum + grad
+// var -= lr * accum
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	lr: Learning rate. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//	momentum: Momentum. Must be a scalar.
+//
+// Returns the created operation.
+func ResourceSparseApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, indices tf.Output, momentum tf.Output, optional ...ResourceSparseApplyMomentumAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Sinh",
+		Type: "ResourceSparseApplyMomentum",
 		Input: []tf.Input{
-			x,
+			var_, accum, lr, grad, indices, momentum,
 		},
+		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Computes the sum along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// `(output[i] = sum_{j...} data[j...]` where the sum is over tuples `j...` such
-// that `segment_ids[j...] == i`.  Unlike `SegmentSum`, `segment_ids`
-// need not be sorted and need not cover all values in the full
-// range of valid values.
+// Creates a sequence of numbers.
 //
-// If the sum is empty for a given segment ID `i`, `output[i] = 0`.
-// If the given segment ID `i` is negative, the value is dropped and will not be
-// added to the sum of the segment.
+// This operation creates a sequence of numbers that begins at `start` and
+// extends by increments of `delta` up to but not including `limit`.
 //
-// `num_segments` should equal the number of distinct segment IDs.
+// For example:
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/UnsortedSegmentSum.png" alt>
-// </div>
+// ```
+// # 'start' is 3
+// # 'limit' is 18
+// # 'delta' is 3
+// tf.range(start, limit, delta) ==> [3, 6, 9, 12, 15]
+// ```
 //
 // Arguments:
+//	start: 0-D (scalar). First entry in the sequence.
+//	limit: 0-D (scalar). Upper limit of sequence, exclusive.
+//	delta: 0-D (scalar). Optional. Default is 1. Number that increments `start`.
 //
-//	segment_ids: A tensor whose shape is a prefix of `data.shape`.
-//
-//
-// Returns Has same shape as data, except for the first `segment_ids.rank`
-// dimensions, which are replaced with a single dimension which has size
-// `num_segments`.
-func UnsortedSegmentSum(scope *Scope, data tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// Returns 1-D.
+func Range(scope *Scope, start tf.Output, limit tf.Output, delta tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "UnsortedSegmentSum",
+		Type: "Range",
 		Input: []tf.Input{
-			data, segment_ids, num_segments,
+			start, limit, delta,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns which elements of x are finite.
+// Computes gradients for SparseSegmentSqrtN.
 //
-// @compatibility(numpy)
-// Equivalent to np.isfinite
-// @end_compatibility
-func IsFinite(scope *Scope, x tf.Output) (y tf.Output) {
+// Returns tensor "output" with same shape as grad, except for dimension 0 whose
+// value is output_dim0.
+//
+// Arguments:
+//	grad: gradient propagated to the SparseSegmentSqrtN op.
+//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
+//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
+//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
+func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "IsFinite",
+		Type: "SparseSegmentSqrtNGrad",
 		Input: []tf.Input{
-			x,
+			grad, indices, segment_ids, output_dim0,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MatMulAttr is an optional argument to MatMul.
-type MatMulAttr func(optionalAttr)
-
-// MatMulTransposeA sets the optional transpose_a attribute to value.
+// Computes the mean along sparse segments of a tensor.
 //
-// value: If true, "a" is transposed before multiplication.
-// If not specified, defaults to false
-func MatMulTransposeA(value bool) MatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_a"] = value
-	}
-}
-
-// MatMulTransposeB sets the optional transpose_b attribute to value.
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// value: If true, "b" is transposed before multiplication.
-// If not specified, defaults to false
-func MatMulTransposeB(value bool) MatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_b"] = value
-	}
-}
-
-// Multiply the matrix "a" by the matrix "b".
+// Like `SegmentMean`, but `segment_ids` can have rank less than `data`'s first
+// dimension, selecting a subset of dimension 0, specified by `indices`.
 //
-// The inputs must be two-dimensional matrices and the inner dimension of
-// "a" (after being transposed if transpose_a is true) must match the
-// outer dimension of "b" (after being transposed if transposed_b is
-// true).
+// Arguments:
 //
-// *Note*: The default kernel implementation for MatMul on GPUs uses
-// cublas.
-func MatMul(scope *Scope, a tf.Output, b tf.Output, optional ...MatMulAttr) (product tf.Output) {
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SparseSegmentMean(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "MatMul",
+		Type: "SparseSegmentMean",
 		Input: []tf.Input{
-			a, b,
+			data, indices, segment_ids,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Selects elements from `x` or `y`, depending on `condition`.
-//
-// The `x`, and `y` tensors must all have the same shape, and the
-// output will also have that shape.
-//
-// The `condition` tensor must be a scalar if `x` and `y` are scalars.
-// If `x` and `y` are vectors or higher rank, then `condition` must be either a
-// scalar, a vector with size matching the first dimension of `x`, or must have
-// the same shape as `x`.
-//
-// The `condition` tensor acts as a mask that chooses, based on the value at each
-// element, whether the corresponding element / row in the output should be
-// taken from `x` (if true) or `y` (if false).
-//
-// If `condition` is a vector and `x` and `y` are higher rank matrices, then
-// it chooses which row (outer dimension) to copy from `x` and `y`.
-// If `condition` has the same shape as `x` and `y`, then it chooses which
-// element to copy from `x` and `y`.
-//
-// For example:
-//
-// ```python
-// # 'condition' tensor is [[True,  False]
-// #                        [False, True]]
-// # 't' is [[1, 2],
-// #         [3, 4]]
-// # 'e' is [[5, 6],
-// #         [7, 8]]
-// select(condition, t, e)  # => [[1, 6], [7, 4]]
-//
-//
-// # 'condition' tensor is [True, False]
-// # 't' is [[1, 2],
-// #         [3, 4]]
-// # 'e' is [[5, 6],
-// #         [7, 8]]
-// select(condition, t, e) ==> [[1, 2],
-//                              [7, 8]]
-//
-// ```
+// Pop the element at the top of the stack.
 //
 // Arguments:
+//	handle: The handle to a stack.
+//	elem_type: The type of the elem that is popped.
 //
-//	x: = A `Tensor` which may have the same shape as `condition`.
-// If `condition` is rank 1, `x` may have higher rank,
-// but its first dimension must match the size of `condition`.
-//	y: = A `Tensor` with the same type and shape as `x`.
-//
-// Returns = A `Tensor` with the same type and shape as `x` and `y`.
-func Select(scope *Scope, condition tf.Output, x tf.Output, y tf.Output) (output tf.Output) {
+// Returns The tensor that is popped from the top of the stack.
+func StackPopV2(scope *Scope, handle tf.Output, elem_type tf.DataType) (elem tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"elem_type": elem_type}
 	opspec := tf.OpSpec{
-		Type: "Select",
+		Type: "StackPopV2",
 		Input: []tf.Input{
-			condition, x, y,
+			handle,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the truth value of x OR y element-wise.
+// Computes the sum along sparse segments of a tensor.
 //
-// *NOTE*: `LogicalOr` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func LogicalOr(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
+// misisng, the `output` tensor at that position will be zeroed.
+//
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// For example:
+//
+// ```python
+// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
+//
+// tf.sparse_segment_sum_with_num_segments(
+//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
+// # => [[0 0 0 0]
+// #     [0 0 0 0]
+// #     [0 0 0 0]]
+//
+// tf.sparse_segment_sum_with_num_segments(c,
+//                                         tf.constant([0, 1]),
+//                                         tf.constant([0, 2],
+//                                         num_segments=4))
+// # => [[ 1  2  3  4]
+// #     [ 0  0  0  0]
+// #     [-1 -2 -3 -4]
+// #     [ 0  0  0  0]]
+// ```
+//
+// Arguments:
+//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//	num_segments: Should equal the number of distinct segment IDs.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `num_segments`.
+func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "LogicalOr",
+		Type: "SparseSegmentSumWithNumSegments",
 		Input: []tf.Input{
-			x, y,
+			data, indices, segment_ids, num_segments,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the regularized incomplete beta integral \\(I_x(a, b)\\).
+// SparseToDenseAttr is an optional argument to SparseToDense.
+type SparseToDenseAttr func(optionalAttr)
+
+// SparseToDenseValidateIndices sets the optional validate_indices attribute to value.
 //
-// The regularized incomplete beta integral is defined as:
+// value: If true, indices are checked to make sure they are sorted in
+// lexicographic order and that there are no repeats.
+// If not specified, defaults to true
+func SparseToDenseValidateIndices(value bool) SparseToDenseAttr {
+	return func(m optionalAttr) {
+		m["validate_indices"] = value
+	}
+}
+
+// Converts a sparse representation into a dense tensor.
 //
+// Builds an array `dense` with shape `output_shape` such that
 //
-// \\(I_x(a, b) = \frac{B(x; a, b)}{B(a, b)}\\)
+// ```
+// # If sparse_indices is scalar
+// dense[i] = (i == sparse_indices ? sparse_values : default_value)
 //
-// where
+// # If sparse_indices is a vector, then for each i
+// dense[sparse_indices[i]] = sparse_values[i]
 //
+// # If sparse_indices is an n by d matrix, then for each i in [0, n)
+// dense[sparse_indices[i][0], ..., sparse_indices[i][d-1]] = sparse_values[i]
+// ```
 //
-// \\(B(x; a, b) = \int_0^x t^{a-1} (1 - t)^{b-1} dt\\)
+// All other values in `dense` are set to `default_value`.  If `sparse_values` is a
+// scalar, all sparse indices are set to this single value.
 //
+// Indices should be sorted in lexicographic order, and indices must not
+// contain any repeats. If `validate_indices` is true, these properties
+// are checked during execution.
 //
-// is the incomplete beta function and \\(B(a, b)\\) is the *complete*
-// beta function.
-func Betainc(scope *Scope, a tf.Output, b tf.Output, x tf.Output) (z tf.Output) {
+// Arguments:
+//	sparse_indices: 0-D, 1-D, or 2-D.  `sparse_indices[i]` contains the complete
+// index where `sparse_values[i]` will be placed.
+//	output_shape: 1-D.  Shape of the dense output tensor.
+//	sparse_values: 1-D.  Values corresponding to each row of `sparse_indices`,
+// or a scalar value to be used for all sparse indices.
+//	default_value: Scalar value to set for indices not specified in
+// `sparse_indices`.
+//
+// Returns Dense output tensor of shape `output_shape`.
+func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Output, sparse_values tf.Output, default_value tf.Output, optional ...SparseToDenseAttr) (dense tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Betainc",
+		Type: "SparseToDense",
 		Input: []tf.Input{
-			a, b, x,
+			sparse_indices, output_shape, sparse_values, default_value,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the sum along sparse segments of a tensor divided by the sqrt of N.
-//
-// N is the size of the segment being reduced.
+// Counts the number of occurrences of each value in an integer array.
 //
-// Like `SparseSegmentSqrtN`, but allows missing ids in `segment_ids`. If an id is
-// misisng, the `output` tensor at that position will be zeroed.
+// Outputs a vector with length `size` and the same dtype as `weights`. If
+// `weights` are empty, then index `i` stores the number of times the value `i` is
+// counted in `arr`. If `weights` are non-empty, then index `i` stores the sum of
+// the value in `weights` at each index where the corresponding value in `arr` is
+// `i`.
 //
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
+// Values in `arr` outside of the range [0, size) are ignored.
 //
 // Arguments:
+//	arr: int32 `Tensor`.
+//	size: non-negative int32 scalar `Tensor`.
+//	weights: is an int32, int64, float32, or float64 `Tensor` with the same
+// shape as `arr`, or a length-0 `Tensor`, in which case it acts as all weights
+// equal to 1.
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//	num_segments: Should equal the number of distinct segment IDs.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SparseSegmentSqrtNWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// Returns 1D `Tensor` with length equal to `size`. The counts or summed weights for
+// each value in the range [0, size).
+func Bincount(scope *Scope, arr tf.Output, size tf.Output, weights tf.Output) (bins tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSqrtNWithNumSegments",
+		Type: "Bincount",
 		Input: []tf.Input{
-			data, indices, segment_ids, num_segments,
+			arr, size, weights,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the upper regularized incomplete Gamma function `Q(a, x)`.
+// Computes the sum along sparse segments of a tensor.
 //
-// The upper regularized incomplete Gamma function is defined as:
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// \\(Q(a, x) = Gamma(a, x) / Gamma(a) = 1 - P(a, x)\\)
+// Like `SegmentSum`, but `segment_ids` can have rank less than `data`'s first
+// dimension, selecting a subset of dimension 0, specified by `indices`.
 //
-// where
+// For example:
 //
-// \\(Gamma(a, x) = int_{x}^{\infty} t^{a-1} exp(-t) dt\\)
+// ```python
+// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
 //
-// is the upper incomplete Gama function.
+// # Select two rows, one segment.
+// tf.sparse_segment_sum(c, tf.constant([0, 1]), tf.constant([0, 0]))
+// # => [[0 0 0 0]]
 //
-// Note, above `P(a, x)` (`Igamma`) is the lower regularized complete
-// Gamma function.
-func Igammac(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
+// # Select two rows, two segment.
+// tf.sparse_segment_sum(c, tf.constant([0, 1]), tf.constant([0, 1]))
+// # => [[ 1  2  3  4]
+// #     [-1 -2 -3 -4]]
+//
+// # Select all rows, two segments.
+// tf.sparse_segment_sum(c, tf.constant([0, 1, 2]), tf.constant([0, 0, 1]))
+// # => [[0 0 0 0]
+// #     [5 6 7 8]]
+//
+// # Which is equivalent to:
+// tf.segment_sum(c, tf.constant([0, 0, 1]))
+// ```
+//
+// Arguments:
+//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SparseSegmentSum(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Igammac",
+		Type: "SparseSegmentSum",
 		Input: []tf.Input{
-			a, x,
+			data, indices, segment_ids,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// LogUniformCandidateSamplerAttr is an optional argument to LogUniformCandidateSampler.
-type LogUniformCandidateSamplerAttr func(optionalAttr)
-
-// LogUniformCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func LogUniformCandidateSamplerSeed(value int64) LogUniformCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
+// Computes hyperbolic sine of x element-wise.
+func Sinh(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// LogUniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
-//
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func LogUniformCandidateSamplerSeed2(value int64) LogUniformCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
+	opspec := tf.OpSpec{
+		Type: "Sinh",
+		Input: []tf.Input{
+			x,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Generates labels for candidate sampling with a log-uniform distribution.
-//
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
-//
-// For each batch, this op picks a single set of sampled candidate labels.
-//
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
-//
-// Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
-//
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func LogUniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LogUniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Computes rectified linear 6: `min(max(features, 0), 6)`.
+func Relu6(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "LogUniformCandidateSampler",
+		Type: "Relu6",
 		Input: []tf.Input{
-			true_classes,
+			features,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// ApproximateEqualAttr is an optional argument to ApproximateEqual.
-type ApproximateEqualAttr func(optionalAttr)
-
-// ApproximateEqualTolerance sets the optional tolerance attribute to value.
-// If not specified, defaults to 1e-05
-func ApproximateEqualTolerance(value float32) ApproximateEqualAttr {
-	return func(m optionalAttr) {
-		m["tolerance"] = value
-	}
-}
-
-// Returns the truth value of abs(x-y) < tolerance element-wise.
-func ApproximateEqual(scope *Scope, x tf.Output, y tf.Output, optional ...ApproximateEqualAttr) (z tf.Output) {
+// Computes the sum along segments of a tensor.
+//
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// Computes a tensor such that
+// `(output[i] = sum_{j...} data[j...]` where the sum is over tuples `j...` such
+// that `segment_ids[j...] == i`.  Unlike `SegmentSum`, `segment_ids`
+// need not be sorted and need not cover all values in the full
+// range of valid values.
+//
+// If the sum is empty for a given segment ID `i`, `output[i] = 0`.
+// If the given segment ID `i` is negative, the value is dropped and will not be
+// added to the sum of the segment.
+//
+// `num_segments` should equal the number of distinct segment IDs.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/UnsortedSegmentSum.png" alt>
+// </div>
+//
+// Arguments:
+//
+//	segment_ids: A tensor whose shape is a prefix of `data.shape`.
+//
+//
+// Returns Has same shape as data, except for the first `segment_ids.rank`
+// dimensions, which are replaced with a single dimension which has size
+// `num_segments`.
+func UnsortedSegmentSum(scope *Scope, data tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ApproximateEqual",
+		Type: "UnsortedSegmentSum",
 		Input: []tf.Input{
-			x, y,
+			data, segment_ids, num_segments,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns x / y element-wise.
+// Returns which elements of x are finite.
 //
-// *NOTE*: `Div` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Div(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.isfinite
+// @end_compatibility
+func IsFinite(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Div",
+		Type: "IsFinite",
 		Input: []tf.Input{
-			x, y,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns x * y element-wise.
+// MatMulAttr is an optional argument to MatMul.
+type MatMulAttr func(optionalAttr)
+
+// MatMulTransposeA sets the optional transpose_a attribute to value.
 //
-// *NOTE*: `Multiply` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Mul(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Mul",
-		Input: []tf.Input{
-			x, y,
-		},
+// value: If true, "a" is transposed before multiplication.
+// If not specified, defaults to false
+func MatMulTransposeA(value bool) MatMulAttr {
+	return func(m optionalAttr) {
+		m["transpose_a"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// SparseReduceSumSparseAttr is an optional argument to SparseReduceSumSparse.
-type SparseReduceSumSparseAttr func(optionalAttr)
-
-// SparseReduceSumSparseKeepDims sets the optional keep_dims attribute to value.
+// MatMulTransposeB sets the optional transpose_b attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
+// value: If true, "b" is transposed before multiplication.
 // If not specified, defaults to false
-func SparseReduceSumSparseKeepDims(value bool) SparseReduceSumSparseAttr {
+func MatMulTransposeB(value bool) MatMulAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["transpose_b"] = value
 	}
 }
 
-// Computes the sum of elements across dimensions of a SparseTensor.
-//
-// This Op takes a SparseTensor and is the sparse counterpart to
-// `tf.reduce_sum()`.  In contrast to SparseReduceSum, this Op returns a
-// SparseTensor.
-//
-// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
-// with length 1.
+// Multiply the matrix "a" by the matrix "b".
 //
-// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
-// with a single element is returned.  Additionally, the axes can be negative,
-// which are interpreted according to the indexing rules in Python.
+// The inputs must be two-dimensional matrices and the inner dimension of
+// "a" (after being transposed if transpose_a is true) must match the
+// outer dimension of "b" (after being transposed if transposed_b is
+// true).
 //
-// Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
-//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
-func SparseReduceSumSparse(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceSumSparseAttr) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
+// *Note*: The default kernel implementation for MatMul on GPUs uses
+// cublas.
+func MatMul(scope *Scope, a tf.Output, b tf.Output, optional ...MatMulAttr) (product tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -1864,309 +1810,252 @@ func SparseReduceSumSparse(scope *Scope, input_indices tf.Output, input_values t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseReduceSumSparse",
+		Type: "MatMul",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape, reduction_axes,
+			a, b,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// BiasAddAttr is an optional argument to BiasAdd.
-type BiasAddAttr func(optionalAttr)
-
-// BiasAddDataFormat sets the optional data_format attribute to value.
+// Selects elements from `x` or `y`, depending on `condition`.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the bias tensor will be added to the last dimension
-// of the value tensor.
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// The tensor will be added to "in_channels", the third-to-the-last
-//     dimension.
-// If not specified, defaults to "NHWC"
-func BiasAddDataFormat(value string) BiasAddAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Adds `bias` to `value`.
+// The `x`, and `y` tensors must all have the same shape, and the
+// output will also have that shape.
 //
-// This is a special case of `tf.add` where `bias` is restricted to be 1-D.
-// Broadcasting is supported, so `value` may have any number of dimensions.
+// The `condition` tensor must be a scalar if `x` and `y` are scalars.
+// If `x` and `y` are vectors or higher rank, then `condition` must be either a
+// scalar, a vector with size matching the first dimension of `x`, or must have
+// the same shape as `x`.
+//
+// The `condition` tensor acts as a mask that chooses, based on the value at each
+// element, whether the corresponding element / row in the output should be
+// taken from `x` (if true) or `y` (if false).
+//
+// If `condition` is a vector and `x` and `y` are higher rank matrices, then
+// it chooses which row (outer dimension) to copy from `x` and `y`.
+// If `condition` has the same shape as `x` and `y`, then it chooses which
+// element to copy from `x` and `y`.
+//
+// For example:
+//
+// ```python
+// # 'condition' tensor is [[True,  False]
+// #                        [False, True]]
+// # 't' is [[1, 2],
+// #         [3, 4]]
+// # 'e' is [[5, 6],
+// #         [7, 8]]
+// select(condition, t, e)  # => [[1, 6], [7, 4]]
+//
+//
+// # 'condition' tensor is [True, False]
+// # 't' is [[1, 2],
+// #         [3, 4]]
+// # 'e' is [[5, 6],
+// #         [7, 8]]
+// select(condition, t, e) ==> [[1, 2],
+//                              [7, 8]]
+//
+// ```
 //
 // Arguments:
-//	value: Any number of dimensions.
-//	bias: 1-D with size the last dimension of `value`.
 //
-// Returns Broadcasted sum of `value` and `bias`.
-func BiasAdd(scope *Scope, value tf.Output, bias tf.Output, optional ...BiasAddAttr) (output tf.Output) {
+//	x: = A `Tensor` which may have the same shape as `condition`.
+// If `condition` is rank 1, `x` may have higher rank,
+// but its first dimension must match the size of `condition`.
+//	y: = A `Tensor` with the same type and shape as `x`.
+//
+// Returns = A `Tensor` with the same type and shape as `x` and `y`.
+func Select(scope *Scope, condition tf.Output, x tf.Output, y tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "BiasAdd",
+		Type: "Select",
 		Input: []tf.Input{
-			value, bias,
+			condition, x, y,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// BiasAddGradAttr is an optional argument to BiasAddGrad.
-type BiasAddGradAttr func(optionalAttr)
-
-// BiasAddGradDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the bias tensor will be added to the last dimension
-// of the value tensor.
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// The tensor will be added to "in_channels", the third-to-the-last
-//     dimension.
-// If not specified, defaults to "NHWC"
-func BiasAddGradDataFormat(value string) BiasAddGradAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// The backward operation for "BiasAdd" on the "bias" tensor.
-//
-// It accumulates all the values from out_backprop into the feature dimension.
-// For NHWC data format, the feature dimension is the last. For NCHW data format,
-// the feature dimension is the third-to-last.
-//
-// Arguments:
-//	out_backprop: Any number of dimensions.
+// Returns the truth value of x OR y element-wise.
 //
-// Returns 1-D with size the feature dimension of `out_backprop`.
-func BiasAddGrad(scope *Scope, out_backprop tf.Output, optional ...BiasAddGradAttr) (output tf.Output) {
+// *NOTE*: `LogicalOr` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func LogicalOr(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "BiasAddGrad",
+		Type: "LogicalOr",
 		Input: []tf.Input{
-			out_backprop,
+			x, y,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns x + y element-wise.
+// Compute the regularized incomplete beta integral \\(I_x(a, b)\\).
 //
-// *NOTE*: `Add` supports broadcasting. `AddN` does not. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func AddV2(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "AddV2",
-		Input: []tf.Input{
-			x, y,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns x + y element-wise.
+// The regularized incomplete beta integral is defined as:
 //
-// *NOTE*: `Add` supports broadcasting. `AddN` does not. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Add(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Add",
-		Input: []tf.Input{
-			x, y,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// NthElementAttr is an optional argument to NthElement.
-type NthElementAttr func(optionalAttr)
-
-// NthElementReverse sets the optional reverse attribute to value.
 //
-// value: When set to True, find the nth-largest value in the vector and vice
-// versa.
-// If not specified, defaults to false
-func NthElementReverse(value bool) NthElementAttr {
-	return func(m optionalAttr) {
-		m["reverse"] = value
-	}
-}
-
-// Finds values of the `n`-th order statistic for the last dimension.
+// \\(I_x(a, b) = \frac{B(x; a, b)}{B(a, b)}\\)
 //
-// If the input is a vector (rank-1), finds the entries which is the nth-smallest
-// value in the vector and outputs their values as scalar tensor.
+// where
 //
-// For matrices (resp. higher rank input), computes the entries which is the
-// nth-smallest value in each row (resp. vector along the last dimension). Thus,
 //
-//     values.shape = input.shape[:-1]
+// \\(B(x; a, b) = \int_0^x t^{a-1} (1 - t)^{b-1} dt\\)
 //
-// Arguments:
-//	input: 1-D or higher with last dimension at least `n+1`.
-//	n: 0-D. Position of sorted vector to select along the last dimension (along
-// each row for matrices). Valid range of n is `[0, input.shape[:-1])`
 //
-// Returns The `n`-th order statistic along each last dimensional slice.
-func NthElement(scope *Scope, input tf.Output, n tf.Output, optional ...NthElementAttr) (values tf.Output) {
+// is the incomplete beta function and \\(B(a, b)\\) is the *complete*
+// beta function.
+func Betainc(scope *Scope, a tf.Output, b tf.Output, x tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "NthElement",
+		Type: "Betainc",
 		Input: []tf.Input{
-			input, n,
+			a, b, x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the Max along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// This operator is similar to the [unsorted segment sum operator](../../../api_docs/python/math_ops.md#UnsortedSegmentSum).
-// Instead of computing the sum over segments, it computes the maximum
-// such that:
+// Computes the sum along sparse segments of a tensor divided by the sqrt of N.
 //
-// \\(output_i = \max_j data_j\\) where max is over `j` such
-// that `segment_ids[j] == i`.
+// N is the size of the segment being reduced.
 //
-// If the maximum is empty for a given segment ID `i`, it outputs the smallest possible value for specific numeric type,
-//  `output[i] = numeric_limits<T>::min()`.
+// Like `SparseSegmentSqrtN`, but allows missing ids in `segment_ids`. If an id is
+// misisng, the `output` tensor at that position will be zeroed.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/UnsortedSegmentMax.png" alt>
-// </div>
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
 // Arguments:
 //
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.
-//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//	num_segments: Should equal the number of distinct segment IDs.
 //
 // Returns Has same shape as data, except for dimension 0 which
-// has size `num_segments`.
-func UnsortedSegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// has size `k`, the number of segments.
+func SparseSegmentSqrtNWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "UnsortedSegmentMax",
+		Type: "SparseSegmentSqrtNWithNumSegments",
 		Input: []tf.Input{
-			data, segment_ids, num_segments,
+			data, indices, segment_ids, num_segments,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes exponential of x element-wise.  \\(y = e^x\\).
-func Exp(scope *Scope, x tf.Output) (y tf.Output) {
+// Compute the upper regularized incomplete Gamma function `Q(a, x)`.
+//
+// The upper regularized incomplete Gamma function is defined as:
+//
+// \\(Q(a, x) = Gamma(a, x) / Gamma(a) = 1 - P(a, x)\\)
+//
+// where
+//
+// \\(Gamma(a, x) = int_{x}^{\infty} t^{a-1} exp(-t) dt\\)
+//
+// is the upper incomplete Gama function.
+//
+// Note, above `P(a, x)` (`Igamma`) is the lower regularized complete
+// Gamma function.
+func Igammac(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Exp",
+		Type: "Igammac",
 		Input: []tf.Input{
-			x,
+			a, x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns an element-wise indication of the sign of a number.
-//
-// `y = sign(x) = -1` if `x < 0`; 0 if `x == 0`; 1 if `x > 0`.
+// LogUniformCandidateSamplerAttr is an optional argument to LogUniformCandidateSampler.
+type LogUniformCandidateSamplerAttr func(optionalAttr)
+
+// LogUniformCandidateSamplerSeed sets the optional seed attribute to value.
 //
-// For complex numbers, `y = sign(x) = x / |x|` if `x != 0`, otherwise `y = 0`.
-func Sign(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Sign",
-		Input: []tf.Input{
-			x,
-		},
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func LogUniformCandidateSamplerSeed(value int64) LogUniformCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// QuantizedAddAttr is an optional argument to QuantizedAdd.
-type QuantizedAddAttr func(optionalAttr)
-
-// QuantizedAddToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedAddToutput(value tf.DataType) QuantizedAddAttr {
+// LogUniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func LogUniformCandidateSamplerSeed2(value int64) LogUniformCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["Toutput"] = value
+		m["seed2"] = value
 	}
 }
 
-// Returns x + y element-wise, working on quantized buffers.
+// Generates labels for candidate sampling with a log-uniform distribution.
 //
-// Arguments:
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
 //
+// For each batch, this op picks a single set of sampled candidate labels.
 //
-//	min_x: The float value that the lowest quantized `x` value represents.
-//	max_x: The float value that the highest quantized `x` value represents.
-//	min_y: The float value that the lowest quantized `y` value represents.
-//	max_y: The float value that the highest quantized `y` value represents.
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
 //
-// *NOTE*: `QuantizedAdd` supports limited forms of broadcasting. More about
-// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedAddAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func LogUniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LogUniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedAdd",
+		Type: "LogUniformCandidateSampler",
 		Input: []tf.Input{
-			x, y, min_x, max_x, min_y, max_y,
+			true_classes,
 		},
 		Attrs: attrs,
 	}
@@ -2174,27 +2063,19 @@ func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ArgMinAttr is an optional argument to ArgMin.
-type ArgMinAttr func(optionalAttr)
+// ApproximateEqualAttr is an optional argument to ApproximateEqual.
+type ApproximateEqualAttr func(optionalAttr)
 
-// ArgMinOutputType sets the optional output_type attribute to value.
-// If not specified, defaults to DT_INT64
-func ArgMinOutputType(value tf.DataType) ArgMinAttr {
+// ApproximateEqualTolerance sets the optional tolerance attribute to value.
+// If not specified, defaults to 1e-05
+func ApproximateEqualTolerance(value float32) ApproximateEqualAttr {
 	return func(m optionalAttr) {
-		m["output_type"] = value
+		m["tolerance"] = value
 	}
 }
 
-// Returns the index with the smallest value across dimensions of a tensor.
-//
-// Note that in case of ties the identity of the return value is not guaranteed.
-//
-// Arguments:
-//
-//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
-// Describes which dimension of the input Tensor to reduce across. For vectors,
-// use dimension = 0.
-func ArgMin(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMinAttr) (output tf.Output) {
+// Returns the truth value of abs(x-y) < tolerance element-wise.
+func ApproximateEqual(scope *Scope, x tf.Output, y tf.Output, optional ...ApproximateEqualAttr) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -2203,9 +2084,9 @@ func ArgMin(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgM
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ArgMin",
+		Type: "ApproximateEqual",
 		Input: []tf.Input{
-			input, dimension,
+			x, y,
 		},
 		Attrs: attrs,
 	}
@@ -2213,179 +2094,221 @@ func ArgMin(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgM
 	return op.Output(0)
 }
 
-// Convert the quantized 'input' tensor into a lower-precision 'output', using the
+// Returns x / y element-wise.
 //
-// output range specified with 'requested_output_min' and 'requested_output_max'.
+// *NOTE*: `Div` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Div(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Div",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns x * y element-wise.
 //
-// [input_min, input_max] are scalar floats that specify the range for the float
-// interpretation of the 'input' data. For example, if input_min is -1.0f and
-// input_max is 1.0f, and we are dealing with quint16 quantized data, then a 0
-// value in the 16-bit data should be interpreted as -1.0f, and a 65535 means 1.0f.
-//
-// Arguments:
-//
-//	input_min: The float value that the minimum quantized input value represents.
-//	input_max: The float value that the maximum quantized input value represents.
-//	requested_output_min: The float value that the minimum quantized output value represents.
-//	requested_output_max: The float value that the maximum quantized output value represents.
-//	out_type: The type of the output. Should be a lower bit depth than Tinput.
-//
-// Returns The requested_output_min value is copied into this output.The requested_output_max value is copied into this output.
-func Requantize(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, requested_output_min tf.Output, requested_output_max tf.Output, out_type tf.DataType) (output tf.Output, output_min tf.Output, output_max tf.Output) {
+// *NOTE*: `Multiply` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Mul(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "Requantize",
+		Type: "Mul",
 		Input: []tf.Input{
-			input, input_min, input_max, requested_output_min, requested_output_max,
+			x, y,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Computes the determinant of one or more square matrices.
+// BiasAddAttr is an optional argument to BiasAdd.
+type BiasAddAttr func(optionalAttr)
+
+// BiasAddDataFormat sets the optional data_format attribute to value.
 //
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. The output is a tensor containing the determinants
-// for all input submatrices `[..., :, :]`.
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the bias tensor will be added to the last dimension
+// of the value tensor.
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// The tensor will be added to "in_channels", the third-to-the-last
+//     dimension.
+// If not specified, defaults to "NHWC"
+func BiasAddDataFormat(value string) BiasAddAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Adds `bias` to `value`.
+//
+// This is a special case of `tf.add` where `bias` is restricted to be 1-D.
+// Broadcasting is supported, so `value` may have any number of dimensions.
 //
 // Arguments:
-//	input: Shape is `[..., M, M]`.
+//	value: Any number of dimensions.
+//	bias: 1-D with size the last dimension of `value`.
 //
-// Returns Shape is `[...]`.
-func MatrixDeterminant(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns Broadcasted sum of `value` and `bias`.
+func BiasAdd(scope *Scope, value tf.Output, bias tf.Output, optional ...BiasAddAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "MatrixDeterminant",
+		Type: "BiasAdd",
 		Input: []tf.Input{
-			input,
+			value, bias,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes sin of x element-wise.
-func Sin(scope *Scope, x tf.Output) (y tf.Output) {
+// SparseReduceSumSparseAttr is an optional argument to SparseReduceSumSparse.
+type SparseReduceSumSparseAttr func(optionalAttr)
+
+// SparseReduceSumSparseKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func SparseReduceSumSparseKeepDims(value bool) SparseReduceSumSparseAttr {
+	return func(m optionalAttr) {
+		m["keep_dims"] = value
+	}
+}
+
+// Computes the sum of elements across dimensions of a SparseTensor.
+//
+// This Op takes a SparseTensor and is the sparse counterpart to
+// `tf.reduce_sum()`.  In contrast to SparseReduceSum, this Op returns a
+// SparseTensor.
+//
+// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
+// with length 1.
+//
+// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
+// with a single element is returned.  Additionally, the axes can be negative,
+// which are interpreted according to the indexing rules in Python.
+//
+// Arguments:
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+func SparseReduceSumSparse(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceSumSparseAttr) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Sin",
+		Type: "SparseReduceSumSparse",
 		Input: []tf.Input{
-			x,
+			input_indices, input_values, input_shape, reduction_axes,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the complementary error function of `x` element-wise.
-func Erfc(scope *Scope, x tf.Output) (y tf.Output) {
+// Returns x + y element-wise.
+//
+// *NOTE*: `Add` supports broadcasting. `AddN` does not. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func AddV2(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Erfc",
+		Type: "AddV2",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes Psi, the derivative of Lgamma (the log of the absolute value of
+// Returns x + y element-wise.
 //
-// `Gamma(x)`), element-wise.
-func Digamma(scope *Scope, x tf.Output) (y tf.Output) {
+// *NOTE*: `Add` supports broadcasting. `AddN` does not. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Add(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Digamma",
+		Type: "Add",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Conv2DBackpropFilterAttr is an optional argument to Conv2DBackpropFilter.
-type Conv2DBackpropFilterAttr func(optionalAttr)
-
-// Conv2DBackpropFilterUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DBackpropFilterUseCudnnOnGpu(value bool) Conv2DBackpropFilterAttr {
-	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
-	}
-}
+// NthElementAttr is an optional argument to NthElement.
+type NthElementAttr func(optionalAttr)
 
-// Conv2DBackpropFilterDataFormat sets the optional data_format attribute to value.
+// NthElementReverse sets the optional reverse attribute to value.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func Conv2DBackpropFilterDataFormat(value string) Conv2DBackpropFilterAttr {
+// value: When set to True, find the nth-largest value in the vector and vice
+// versa.
+// If not specified, defaults to false
+func NthElementReverse(value bool) NthElementAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["reverse"] = value
 	}
 }
 
-// Conv2DBackpropFilterDilations sets the optional dilations attribute to value.
+// Finds values of the `n`-th order statistic for the last dimension.
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
-// element on that dimension. The dimension order is determined by the value of
-// `data_format`, see above for details. Dilations in the batch and depth
-// dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func Conv2DBackpropFilterDilations(value []int64) Conv2DBackpropFilterAttr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
-	}
-}
-
-// Computes the gradients of convolution with respect to the filter.
+// If the input is a vector (rank-1), finds the entries which is the nth-smallest
+// value in the vector and outputs their values as scalar tensor.
+//
+// For matrices (resp. higher rank input), computes the entries which is the
+// nth-smallest value in each row (resp. vector along the last dimension). Thus,
+//
+//     values.shape = input.shape[:-1]
 //
 // Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
-//	filter_sizes: An integer vector representing the tensor shape of `filter`,
-// where `filter` is a 4-D
-// `[filter_height, filter_width, in_channels, out_channels]` tensor.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
-// Gradients w.r.t. the output of the convolution.
-//	strides: The stride of the sliding window for each dimension of the input
-// of the convolution. Must be in the same order as the dimension specified with
-// format.
-//	padding: The type of padding algorithm to use.
+//	input: 1-D or higher with last dimension at least `n+1`.
+//	n: 0-D. Position of sorted vector to select along the last dimension (along
+// each row for matrices). Valid range of n is `[0, input.shape[:-1])`
 //
-// Returns 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.  Gradient w.r.t.
-// the `filter` input of the convolution.
-func Conv2DBackpropFilter(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropFilterAttr) (output tf.Output) {
+// Returns The `n`-th order statistic along each last dimensional slice.
+func NthElement(scope *Scope, input tf.Output, n tf.Output, optional ...NthElementAttr) (values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv2DBackpropFilter",
+		Type: "NthElement",
 		Input: []tf.Input{
-			input, filter_sizes, out_backprop,
+			input, n,
 		},
 		Attrs: attrs,
 	}
@@ -2393,31 +2316,54 @@ func Conv2DBackpropFilter(scope *Scope, input tf.Output, filter_sizes tf.Output,
 	return op.Output(0)
 }
 
-// Returns the number of work units this Reader has finished processing.
+// Computes the Max along segments of a tensor.
+//
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// This operator is similar to the [unsorted segment sum operator](../../../api_docs/python/math_ops.md#UnsortedSegmentSum).
+// Instead of computing the sum over segments, it computes the maximum
+// such that:
+//
+// \\(output_i = \max_j data_j\\) where max is over `j` such
+// that `segment_ids[j] == i`.
+//
+// If the maximum is empty for a given segment ID `i`, it outputs the smallest possible value for specific numeric type,
+//  `output[i] = numeric_limits<T>::min()`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/UnsortedSegmentMax.png" alt>
+// </div>
 //
 // Arguments:
-//	reader_handle: Handle to a Reader.
-func ReaderNumWorkUnitsCompletedV2(scope *Scope, reader_handle tf.Output) (units_completed tf.Output) {
+//
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.
+//
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `num_segments`.
+func UnsortedSegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderNumWorkUnitsCompletedV2",
+		Type: "UnsortedSegmentMax",
 		Input: []tf.Input{
-			reader_handle,
+			data, segment_ids, num_segments,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the log of the absolute value of `Gamma(x)` element-wise.
-func Lgamma(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes exponential of x element-wise.  \\(y = e^x\\).
+func Exp(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Lgamma",
+		Type: "Exp",
 		Input: []tf.Input{
 			x,
 		},
@@ -2426,41 +2372,17 @@ func Lgamma(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// Computes the reverse mode backpropagated gradient of the Cholesky algorithm.
+// Returns an element-wise indication of the sign of a number.
 //
-// For an explanation see "Differentiation of the Cholesky algorithm" by
-// Iain Murray http://arxiv.org/abs/1602.07527.
+// `y = sign(x) = -1` if `x < 0`; 0 if `x == 0`; 1 if `x > 0`.
 //
-// Arguments:
-//	l: Output of batch Cholesky algorithm l = cholesky(A). Shape is `[..., M, M]`.
-// Algorithm depends only on lower triangular part of the innermost matrices of
-// this tensor.
-//	grad: df/dl where f is some scalar function. Shape is `[..., M, M]`.
-// Algorithm depends only on lower triangular part of the innermost matrices of
-// this tensor.
-//
-// Returns Symmetrized version of df/dA . Shape is `[..., M, M]`
-func CholeskyGrad(scope *Scope, l tf.Output, grad tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "CholeskyGrad",
-		Input: []tf.Input{
-			l, grad,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes inverse hyperbolic cosine of x element-wise.
-func Acosh(scope *Scope, x tf.Output) (y tf.Output) {
+// For complex numbers, `y = sign(x) = x / |x|` if `x != 0`, otherwise `y = 0`.
+func Sign(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Acosh",
+		Type: "Sign",
 		Input: []tf.Input{
 			x,
 		},
@@ -2469,35 +2391,32 @@ func Acosh(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// SerializeManySparseAttr is an optional argument to SerializeManySparse.
-type SerializeManySparseAttr func(optionalAttr)
+// QuantizedAddAttr is an optional argument to QuantizedAdd.
+type QuantizedAddAttr func(optionalAttr)
 
-// SerializeManySparseOutType sets the optional out_type attribute to value.
-//
-// value: The `dtype` to use for serialization; the supported types are `string`
-// (default) and `variant`.
-// If not specified, defaults to DT_STRING
-func SerializeManySparseOutType(value tf.DataType) SerializeManySparseAttr {
+// QuantizedAddToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedAddToutput(value tf.DataType) QuantizedAddAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["Toutput"] = value
 	}
 }
 
-// Serialize an `N`-minibatch `SparseTensor` into an `[N, 3]` `Tensor` object.
+// Returns x + y element-wise, working on quantized buffers.
 //
-// The `SparseTensor` must have rank `R` greater than 1, and the first dimension
-// is treated as the minibatch dimension.  Elements of the `SparseTensor`
-// must be sorted in increasing order of this first dimension.  The serialized
-// `SparseTensor` objects going into each row of `serialized_sparse` will have
-// rank `R-1`.
+// Arguments:
 //
-// The minibatch size `N` is extracted from `sparse_shape[0]`.
 //
-// Arguments:
-//	sparse_indices: 2-D.  The `indices` of the minibatch `SparseTensor`.
-//	sparse_values: 1-D.  The `values` of the minibatch `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the minibatch `SparseTensor`.
-func SerializeManySparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeManySparseAttr) (serialized_sparse tf.Output) {
+//	min_x: The float value that the lowest quantized `x` value represents.
+//	max_x: The float value that the highest quantized `x` value represents.
+//	min_y: The float value that the lowest quantized `y` value represents.
+//	max_y: The float value that the highest quantized `y` value represents.
+//
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+//
+// *NOTE*: `QuantizedAdd` supports limited forms of broadcasting. More about
+// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedAddAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -2506,66 +2425,48 @@ func SerializeManySparse(scope *Scope, sparse_indices tf.Output, sparse_values t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SerializeManySparse",
+		Type: "QuantizedAdd",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			x, y, min_x, max_x, min_y, max_y,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// TensorArrayV2Attr is an optional argument to TensorArrayV2.
-type TensorArrayV2Attr func(optionalAttr)
-
-// TensorArrayV2ElementShape sets the optional element_shape attribute to value.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayV2ElementShape(value tf.Shape) TensorArrayV2Attr {
-	return func(m optionalAttr) {
-		m["element_shape"] = value
-	}
-}
-
-// TensorArrayV2DynamicSize sets the optional dynamic_size attribute to value.
-// If not specified, defaults to false
-func TensorArrayV2DynamicSize(value bool) TensorArrayV2Attr {
-	return func(m optionalAttr) {
-		m["dynamic_size"] = value
-	}
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// TensorArrayV2ClearAfterRead sets the optional clear_after_read attribute to value.
-// If not specified, defaults to true
-func TensorArrayV2ClearAfterRead(value bool) TensorArrayV2Attr {
-	return func(m optionalAttr) {
-		m["clear_after_read"] = value
-	}
-}
+// ArgMinAttr is an optional argument to ArgMin.
+type ArgMinAttr func(optionalAttr)
 
-// TensorArrayV2TensorArrayName sets the optional tensor_array_name attribute to value.
-// If not specified, defaults to ""
-func TensorArrayV2TensorArrayName(value string) TensorArrayV2Attr {
+// ArgMinOutputType sets the optional output_type attribute to value.
+// If not specified, defaults to DT_INT64
+func ArgMinOutputType(value tf.DataType) ArgMinAttr {
 	return func(m optionalAttr) {
-		m["tensor_array_name"] = value
+		m["output_type"] = value
 	}
 }
 
-// Deprecated. Use TensorArrayV3
+// Returns the index with the smallest value across dimensions of a tensor.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayV3
-func TensorArrayV2(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV2Attr) (handle tf.Output) {
+// Note that in case of ties the identity of the return value is not guaranteed.
+//
+// Arguments:
+//
+//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
+// Describes which dimension of the input Tensor to reduce across. For vectors,
+// use dimension = 0.
+func ArgMin(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMinAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayV2",
+		Type: "ArgMin",
 		Input: []tf.Input{
-			size,
+			input, dimension,
 		},
 		Attrs: attrs,
 	}
@@ -2573,77 +2474,86 @@ func TensorArrayV2(scope *Scope, size tf.Output, dtype tf.DataType, optional ...
 	return op.Output(0)
 }
 
-// Computes the mean along sparse segments of a tensor.
+// Convert the quantized 'input' tensor into a lower-precision 'output', using the
 //
-// Like `SparseSegmentMean`, but allows missing ids in `segment_ids`. If an id is
-// misisng, the `output` tensor at that position will be zeroed.
+// output range specified with 'requested_output_min' and 'requested_output_max'.
 //
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
+// [input_min, input_max] are scalar floats that specify the range for the float
+// interpretation of the 'input' data. For example, if input_min is -1.0f and
+// input_max is 1.0f, and we are dealing with quint16 quantized data, then a 0
+// value in the 16-bit data should be interpreted as -1.0f, and a 65535 means 1.0f.
 //
 // Arguments:
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//	num_segments: Should equal the number of distinct segment IDs.
+//	input_min: The float value that the minimum quantized input value represents.
+//	input_max: The float value that the maximum quantized input value represents.
+//	requested_output_min: The float value that the minimum quantized output value represents.
+//	requested_output_max: The float value that the maximum quantized output value represents.
+//	out_type: The type of the output. Should be a lower bit depth than Tinput.
 //
-// Returns Has same shape as data, except for dimension 0 which has size
-// `num_segments`.
-func SparseSegmentMeanWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// Returns The requested_output_min value is copied into this output.The requested_output_max value is copied into this output.
+func Requantize(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, requested_output_min tf.Output, requested_output_max tf.Output, out_type tf.DataType) (output tf.Output, output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentMeanWithNumSegments",
+		Type: "Requantize",
 		Input: []tf.Input{
-			data, indices, segment_ids, num_segments,
+			input, input_min, input_max, requested_output_min, requested_output_max,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes hyperbolic cosine of x element-wise.
-func Cosh(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes the determinant of one or more square matrices.
+//
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. The output is a tensor containing the determinants
+// for all input submatrices `[..., :, :]`.
+//
+// Arguments:
+//	input: Shape is `[..., M, M]`.
+//
+// Returns Shape is `[...]`.
+func MatrixDeterminant(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Cosh",
+		Type: "MatrixDeterminant",
 		Input: []tf.Input{
-			x,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that emits each dim-0 slice of `components` once.
-func TensorSliceDataset(scope *Scope, components []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+// Computes sin of x element-wise.
+func Sin(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "TensorSliceDataset",
+		Type: "Sin",
 		Input: []tf.Input{
-			tf.OutputList(components),
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes natural logarithm of (1 + x) element-wise.
-//
-// I.e., \\(y = \log_e (1 + x)\\).
-func Log1p(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes the complementary error function of `x` element-wise.
+func Erfc(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Log1p",
+		Type: "Erfc",
 		Input: []tf.Input{
 			x,
 		},
@@ -2652,50 +2562,282 @@ func Log1p(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// Computes rectified linear 6 gradients for a Relu6 operation.
-//
-// Arguments:
-//	gradients: The backpropagated gradients to the corresponding Relu6 operation.
-//	features: The features passed as input to the corresponding Relu6 operation, or
-// its output; using either one produces the same result.
+// Computes Psi, the derivative of Lgamma (the log of the absolute value of
 //
-// Returns The gradients:
-// `gradients * (features > 0) * (features < 6)`.
-func Relu6Grad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
+// `Gamma(x)`), element-wise.
+func Digamma(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Relu6Grad",
+		Type: "Digamma",
 		Input: []tf.Input{
-			gradients, features,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResizeBicubicAttr is an optional argument to ResizeBicubic.
-type ResizeBicubicAttr func(optionalAttr)
+// Conv2DBackpropFilterAttr is an optional argument to Conv2DBackpropFilter.
+type Conv2DBackpropFilterAttr func(optionalAttr)
 
-// ResizeBicubicAlignCorners sets the optional align_corners attribute to value.
+// Conv2DBackpropFilterUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DBackpropFilterUseCudnnOnGpu(value bool) Conv2DBackpropFilterAttr {
+	return func(m optionalAttr) {
+		m["use_cudnn_on_gpu"] = value
+	}
+}
+
+// Conv2DBackpropFilterDataFormat sets the optional data_format attribute to value.
 //
-// value: If true, rescale input by (new_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeBicubicAlignCorners(value bool) ResizeBicubicAttr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func Conv2DBackpropFilterDataFormat(value string) Conv2DBackpropFilterAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["data_format"] = value
 	}
 }
 
-// Resize `images` to `size` using bicubic interpolation.
+// Conv2DBackpropFilterDilations sets the optional dilations attribute to value.
 //
-// Input images can be of different types but output images are always float.
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
+// element on that dimension. The dimension order is determined by the value of
+// `data_format`, see above for details. Dilations in the batch and depth
+// dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func Conv2DBackpropFilterDilations(value []int64) Conv2DBackpropFilterAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of convolution with respect to the filter.
 //
 // Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
+//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
+//	filter_sizes: An integer vector representing the tensor shape of `filter`,
+// where `filter` is a 4-D
+// `[filter_height, filter_width, in_channels, out_channels]` tensor.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
+// Gradients w.r.t. the output of the convolution.
+//	strides: The stride of the sliding window for each dimension of the input
+// of the convolution. Must be in the same order as the dimension specified with
+// format.
+//	padding: The type of padding algorithm to use.
+//
+// Returns 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.  Gradient w.r.t.
+// the `filter` input of the convolution.
+func Conv2DBackpropFilter(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropFilterAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Conv2DBackpropFilter",
+		Input: []tf.Input{
+			input, filter_sizes, out_backprop,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns the number of work units this Reader has finished processing.
+//
+// Arguments:
+//	reader_handle: Handle to a Reader.
+func ReaderNumWorkUnitsCompletedV2(scope *Scope, reader_handle tf.Output) (units_completed tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReaderNumWorkUnitsCompletedV2",
+		Input: []tf.Input{
+			reader_handle,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the log of the absolute value of `Gamma(x)` element-wise.
+func Lgamma(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Lgamma",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the reverse mode backpropagated gradient of the Cholesky algorithm.
+//
+// For an explanation see "Differentiation of the Cholesky algorithm" by
+// Iain Murray http://arxiv.org/abs/1602.07527.
+//
+// Arguments:
+//	l: Output of batch Cholesky algorithm l = cholesky(A). Shape is `[..., M, M]`.
+// Algorithm depends only on lower triangular part of the innermost matrices of
+// this tensor.
+//	grad: df/dl where f is some scalar function. Shape is `[..., M, M]`.
+// Algorithm depends only on lower triangular part of the innermost matrices of
+// this tensor.
+//
+// Returns Symmetrized version of df/dA . Shape is `[..., M, M]`
+func CholeskyGrad(scope *Scope, l tf.Output, grad tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "CholeskyGrad",
+		Input: []tf.Input{
+			l, grad,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the mean along sparse segments of a tensor.
+//
+// Like `SparseSegmentMean`, but allows missing ids in `segment_ids`. If an id is
+// misisng, the `output` tensor at that position will be zeroed.
+//
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// Arguments:
+//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//	num_segments: Should equal the number of distinct segment IDs.
+//
+// Returns Has same shape as data, except for dimension 0 which has size
+// `num_segments`.
+func SparseSegmentMeanWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSegmentMeanWithNumSegments",
+		Input: []tf.Input{
+			data, indices, segment_ids, num_segments,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes hyperbolic cosine of x element-wise.
+func Cosh(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Cosh",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Creates a dataset that emits each dim-0 slice of `components` once.
+func TensorSliceDataset(scope *Scope, components []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "TensorSliceDataset",
+		Input: []tf.Input{
+			tf.OutputList(components),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes natural logarithm of (1 + x) element-wise.
+//
+// I.e., \\(y = \log_e (1 + x)\\).
+func Log1p(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Log1p",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes rectified linear 6 gradients for a Relu6 operation.
+//
+// Arguments:
+//	gradients: The backpropagated gradients to the corresponding Relu6 operation.
+//	features: The features passed as input to the corresponding Relu6 operation, or
+// its output; using either one produces the same result.
+//
+// Returns The gradients:
+// `gradients * (features > 0) * (features < 6)`.
+func Relu6Grad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Relu6Grad",
+		Input: []tf.Input{
+			gradients, features,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// ResizeBicubicAttr is an optional argument to ResizeBicubic.
+type ResizeBicubicAttr func(optionalAttr)
+
+// ResizeBicubicAlignCorners sets the optional align_corners attribute to value.
+//
+// value: If true, rescale input by (new_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeBicubicAlignCorners(value bool) ResizeBicubicAttr {
+	return func(m optionalAttr) {
+		m["align_corners"] = value
+	}
+}
+
+// Resize `images` to `size` using bicubic interpolation.
+//
+// Input images can be of different types but output images are always float.
+//
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, channels]`.
 //	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
 // new size for the images.
 //
@@ -2860,85 +3002,30 @@ func Rsqrt(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// Inserts a dimension of 1 into a tensor's shape.
+// MatrixInverseAttr is an optional argument to MatrixInverse.
+type MatrixInverseAttr func(optionalAttr)
+
+// MatrixInverseAdjoint sets the optional adjoint attribute to value.
+// If not specified, defaults to false
+func MatrixInverseAdjoint(value bool) MatrixInverseAttr {
+	return func(m optionalAttr) {
+		m["adjoint"] = value
+	}
+}
+
+// Computes the inverse of one or more square invertible matrices or their
 //
-// Given a tensor `input`, this operation inserts a dimension of 1 at the
-// dimension index `axis` of `input`'s shape. The dimension index `axis` starts at
-// zero; if you specify a negative number for `axis` it is counted backward from
-// the end.
+// adjoints (conjugate transposes).
 //
-// This operation is useful if you want to add a batch dimension to a single
-// element. For example, if you have a single image of shape `[height, width,
-// channels]`, you can make it a batch of 1 image with `expand_dims(image, 0)`,
-// which will make the shape `[1, height, width, channels]`.
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. The output is a tensor of the same shape as the input
+// containing the inverse for all input submatrices `[..., :, :]`.
 //
-// Other examples:
+// The op uses LU decomposition with partial pivoting to compute the inverses.
 //
-// ```
-// # 't' is a tensor of shape [2]
-// shape(expand_dims(t, 0)) ==> [1, 2]
-// shape(expand_dims(t, 1)) ==> [2, 1]
-// shape(expand_dims(t, -1)) ==> [2, 1]
-//
-// # 't2' is a tensor of shape [2, 3, 5]
-// shape(expand_dims(t2, 0)) ==> [1, 2, 3, 5]
-// shape(expand_dims(t2, 2)) ==> [2, 3, 1, 5]
-// shape(expand_dims(t2, 3)) ==> [2, 3, 5, 1]
-// ```
-//
-// This operation requires that:
-//
-// `-1-input.dims() <= dim <= input.dims()`
-//
-// This operation is related to `squeeze()`, which removes dimensions of
-// size 1.
-//
-// Arguments:
-//
-//	axis: 0-D (scalar). Specifies the dimension index at which to
-// expand the shape of `input`. Must be in the range
-// `[-rank(input) - 1, rank(input)]`.
-//
-// Returns Contains the same data as `input`, but its shape has an additional
-// dimension of size 1 added.
-func ExpandDims(scope *Scope, input tf.Output, axis tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ExpandDims",
-		Input: []tf.Input{
-			input, axis,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// MatrixInverseAttr is an optional argument to MatrixInverse.
-type MatrixInverseAttr func(optionalAttr)
-
-// MatrixInverseAdjoint sets the optional adjoint attribute to value.
-// If not specified, defaults to false
-func MatrixInverseAdjoint(value bool) MatrixInverseAttr {
-	return func(m optionalAttr) {
-		m["adjoint"] = value
-	}
-}
-
-// Computes the inverse of one or more square invertible matrices or their
-//
-// adjoints (conjugate transposes).
-//
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. The output is a tensor of the same shape as the input
-// containing the inverse for all input submatrices `[..., :, :]`.
-//
-// The op uses LU decomposition with partial pivoting to compute the inverses.
-//
-// If a matrix is not invertible there is no guarantee what the op does. It
-// may detect the condition and raise an exception or it may simply return a
-// garbage result.
+// If a matrix is not invertible there is no guarantee what the op does. It
+// may detect the condition and raise an exception or it may simply return a
+// garbage result.
 //
 // Arguments:
 //	input: Shape is `[..., M, M]`.
@@ -3373,30 +3460,6 @@ func QuantizedAvgPool(scope *Scope, input tf.Output, min_input tf.Output, max_in
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Updates the table to associates keys with values.
-//
-// The tensor `keys` must be of the same type as the keys of the table.
-// The tensor `values` must be of the type of the table values.
-//
-// Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//	values: Values to associate with keys.
-//
-// Returns the created operation.
-func LookupTableInsertV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LookupTableInsertV2",
-		Input: []tf.Input{
-			table_handle, keys, values,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
 // FractionalAvgPoolAttr is an optional argument to FractionalAvgPool.
 type FractionalAvgPoolAttr func(optionalAttr)
 
@@ -3678,45 +3741,6 @@ func MatrixDiag(scope *Scope, diagonal tf.Output) (output tf.Output) {
 	return op.Output(0)
 }
 
-// Says whether the targets are in the top `K` predictions.
-//
-// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
-// prediction for the target class is among the top `k` predictions among
-// all predictions for example `i`. Note that the behavior of `InTopK` differs
-// from the `TopK` op in its handling of ties; if multiple classes have the
-// same prediction value and straddle the top-`k` boundary, all of those
-// classes are considered to be in the top `k`.
-//
-// More formally, let
-//
-//   \\(predictions_i\\) be the predictions for all classes for example `i`,
-//   \\(targets_i\\) be the target class for example `i`,
-//   \\(out_i\\) be the output for example `i`,
-//
-// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
-//
-// Arguments:
-//	predictions: A `batch_size` x `classes` tensor.
-//	targets: A `batch_size` vector of class ids.
-//	k: Number of top elements to look at for computing precision.
-//
-// Returns Computed Precision at `k` as a `bool Tensor`.
-func InTopK(scope *Scope, predictions tf.Output, targets tf.Output, k int64) (precision tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"k": k}
-	opspec := tf.OpSpec{
-		Type: "InTopK",
-		Input: []tf.Input{
-			predictions, targets,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Given a quantized tensor described by (input, input_min, input_max), outputs a
 //
 // range that covers the actual values present in that tensor.  This op is
@@ -3978,254 +4002,104 @@ func IsNan(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// FractionalAvgPoolGradAttr is an optional argument to FractionalAvgPoolGrad.
-type FractionalAvgPoolGradAttr func(optionalAttr)
-
-// FractionalAvgPoolGradOverlapping sets the optional overlapping attribute to value.
-//
-// value: When set to True, it means when pooling, the values at the boundary
-// of adjacent pooling cells are used by both cells. For example:
-//
-// `index  0  1  2  3  4`
-//
-// `value  20 5  16 3  7`
-//
-// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
-// The result would be [41/3, 26/3] for fractional avg pooling.
-// If not specified, defaults to false
-func FractionalAvgPoolGradOverlapping(value bool) FractionalAvgPoolGradAttr {
-	return func(m optionalAttr) {
-		m["overlapping"] = value
-	}
-}
-
-// Computes gradient of the FractionalAvgPool function.
-//
-// Unlike FractionalMaxPoolGrad, we don't need to find arg_max for
-// FractionalAvgPoolGrad, we just need to evenly back-propagate each element of
-// out_backprop to those indices that form the same pooling cell. Therefore, we
-// just need to know the shape of original input tensor, instead of the whole
-// tensor.
+// Computes rectified linear gradients for a Relu operation.
 //
 // Arguments:
-//	orig_input_tensor_shape: Original input tensor shape for `fractional_avg_pool`
-//	out_backprop: 4-D with shape `[batch, height, width, channels]`.  Gradients
-// w.r.t. the output of `fractional_avg_pool`.
-//	row_pooling_sequence: row pooling sequence, form pooling region with
-// col_pooling_sequence.
-//	col_pooling_sequence: column pooling sequence, form pooling region with
-// row_pooling sequence.
+//	gradients: The backpropagated gradients to the corresponding Relu operation.
+//	features: The features passed as input to the corresponding Relu operation, OR
+// the outputs of that operation (both work equivalently).
 //
-// Returns 4-D.  Gradients w.r.t. the input of `fractional_avg_pool`.
-func FractionalAvgPoolGrad(scope *Scope, orig_input_tensor_shape tf.Output, out_backprop tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output, optional ...FractionalAvgPoolGradAttr) (output tf.Output) {
+// Returns `gradients * (features > 0)`.
+func ReluGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "FractionalAvgPoolGrad",
+		Type: "ReluGrad",
 		Input: []tf.Input{
-			orig_input_tensor_shape, out_backprop, row_pooling_sequence, col_pooling_sequence,
+			gradients, features,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes gradients for the exponential linear (Elu) operation.
+// Computes the gradient of morphological 2-D dilation with respect to the input.
 //
 // Arguments:
-//	gradients: The backpropagated gradients to the corresponding Elu operation.
-//	outputs: The outputs of the corresponding Elu operation.
+//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
+//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
+//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
+// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
+//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
+// Must be: `[1, rate_height, rate_width, 1]`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns The gradients: `gradients * (outputs + 1)` if outputs < 0,
-// `gradients` otherwise.
-func EluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
+// Returns 4-D with shape `[batch, in_height, in_width, depth]`.
+func Dilation2DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (in_backprop tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "EluGrad",
+		Type: "Dilation2DBackpropInput",
 		Input: []tf.Input{
-			gradients, outputs,
+			input, filter, out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Converts each string in the input Tensor to its hash mod by a number of buckets.
+// CTCBeamSearchDecoderAttr is an optional argument to CTCBeamSearchDecoder.
+type CTCBeamSearchDecoderAttr func(optionalAttr)
+
+// CTCBeamSearchDecoderMergeRepeated sets the optional merge_repeated attribute to value.
 //
-// The hash function is deterministic on the content of the string within the
-// process.
+// value: If true, merge repeated classes in output.
+// If not specified, defaults to true
+func CTCBeamSearchDecoderMergeRepeated(value bool) CTCBeamSearchDecoderAttr {
+	return func(m optionalAttr) {
+		m["merge_repeated"] = value
+	}
+}
+
+// Performs beam search decoding on the logits given in input.
 //
-// Note that the hash function may change from time to time.
-// This functionality will be deprecated and it's recommended to use
-// `tf.string_to_hash_bucket_fast()` or `tf.string_to_hash_bucket_strong()`.
+// A note about the attribute merge_repeated: For the beam search decoder,
+// this means that if consecutive entries in a beam are the same, only
+// the first of these is emitted.  That is, when the top path is "A B B B B",
+// "A B" is returned if merge_repeated = True but "A B B B B" is
+// returned if merge_repeated = False.
 //
 // Arguments:
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	sequence_length: A vector containing sequence lengths, size `(batch)`.
+//	beam_width: A scalar >= 0 (beam search beam width).
+//	top_paths: A scalar >= 0, <= beam_width (controls output size).
 //
-//	num_buckets: The number of buckets.
-//
-// Returns A Tensor of the same shape as the input `string_tensor`.
-func StringToHashBucket(scope *Scope, string_tensor tf.Output, num_buckets int64) (output tf.Output) {
+// Returns A list (length: top_paths) of indices matrices.  Matrix j,
+// size `(total_decoded_outputs[j] x 2)`, has indices of a
+// `SparseTensor<int64, 2>`.  The rows store: [batch, time].A list (length: top_paths) of values vectors.  Vector j,
+// size `(length total_decoded_outputs[j])`, has the values of a
+// `SparseTensor<int64, 2>`.  The vector stores the decoded classes for beam j.A list (length: top_paths) of shape vector.  Vector j,
+// size `(2)`, stores the shape of the decoded `SparseTensor[j]`.
+// Its values are: `[batch_size, max_decoded_length[j]]`.A matrix, shaped: `(batch_size x top_paths)`.  The
+// sequence log-probabilities.
+func CTCBeamSearchDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, beam_width int64, top_paths int64, optional ...CTCBeamSearchDecoderAttr) (decoded_indices []tf.Output, decoded_values []tf.Output, decoded_shape []tf.Output, log_probability tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_buckets": num_buckets}
+	attrs := map[string]interface{}{"beam_width": beam_width, "top_paths": top_paths}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "StringToHashBucket",
+		Type: "CTCBeamSearchDecoder",
 		Input: []tf.Input{
-			string_tensor,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Creates a dataset that contains `count` elements from the `input_dataset`.
-//
-// Arguments:
-//
-//	count: A scalar representing the number of elements from the `input_dataset`
-// that should be taken. A value of `-1` indicates that all of `input_dataset`
-// is taken.
-//
-//
-func TakeDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "TakeDataset",
-		Input: []tf.Input{
-			input_dataset, count,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes rectified linear 6: `min(max(features, 0), 6)`.
-func Relu6(scope *Scope, features tf.Output) (activations tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Relu6",
-		Input: []tf.Input{
-			features,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes rectified linear gradients for a Relu operation.
-//
-// Arguments:
-//	gradients: The backpropagated gradients to the corresponding Relu operation.
-//	features: The features passed as input to the corresponding Relu operation, OR
-// the outputs of that operation (both work equivalently).
-//
-// Returns `gradients * (features > 0)`.
-func ReluGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ReluGrad",
-		Input: []tf.Input{
-			gradients, features,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the gradient of morphological 2-D dilation with respect to the input.
-//
-// Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
-//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
-//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
-// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
-//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
-// Must be: `[1, rate_height, rate_width, 1]`.
-//	padding: The type of padding algorithm to use.
-//
-// Returns 4-D with shape `[batch, in_height, in_width, depth]`.
-func Dilation2DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (in_backprop tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
-	opspec := tf.OpSpec{
-		Type: "Dilation2DBackpropInput",
-		Input: []tf.Input{
-			input, filter, out_backprop,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// CTCBeamSearchDecoderAttr is an optional argument to CTCBeamSearchDecoder.
-type CTCBeamSearchDecoderAttr func(optionalAttr)
-
-// CTCBeamSearchDecoderMergeRepeated sets the optional merge_repeated attribute to value.
-//
-// value: If true, merge repeated classes in output.
-// If not specified, defaults to true
-func CTCBeamSearchDecoderMergeRepeated(value bool) CTCBeamSearchDecoderAttr {
-	return func(m optionalAttr) {
-		m["merge_repeated"] = value
-	}
-}
-
-// Performs beam search decoding on the logits given in input.
-//
-// A note about the attribute merge_repeated: For the beam search decoder,
-// this means that if consecutive entries in a beam are the same, only
-// the first of these is emitted.  That is, when the top path is "A B B B B",
-// "A B" is returned if merge_repeated = True but "A B B B B" is
-// returned if merge_repeated = False.
-//
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	sequence_length: A vector containing sequence lengths, size `(batch)`.
-//	beam_width: A scalar >= 0 (beam search beam width).
-//	top_paths: A scalar >= 0, <= beam_width (controls output size).
-//
-// Returns A list (length: top_paths) of indices matrices.  Matrix j,
-// size `(total_decoded_outputs[j] x 2)`, has indices of a
-// `SparseTensor<int64, 2>`.  The rows store: [batch, time].A list (length: top_paths) of values vectors.  Vector j,
-// size `(length total_decoded_outputs[j])`, has the values of a
-// `SparseTensor<int64, 2>`.  The vector stores the decoded classes for beam j.A list (length: top_paths) of shape vector.  Vector j,
-// size `(2)`, stores the shape of the decoded `SparseTensor[j]`.
-// Its values are: `[batch_size, max_decoded_length[j]]`.A matrix, shaped: `(batch_size x top_paths)`.  The
-// sequence log-probabilities.
-func CTCBeamSearchDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, beam_width int64, top_paths int64, optional ...CTCBeamSearchDecoderAttr) (decoded_indices []tf.Output, decoded_values []tf.Output, decoded_shape []tf.Output, log_probability tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"beam_width": beam_width, "top_paths": top_paths}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "CTCBeamSearchDecoder",
-		Input: []tf.Input{
-			inputs, sequence_length,
+			inputs, sequence_length,
 		},
 		Attrs: attrs,
 	}
@@ -4464,44 +4338,6 @@ func MaxPool(scope *Scope, input tf.Output, ksize []int64, strides []int64, padd
 	return op.Output(0)
 }
 
-// Bucketizes 'input' based on 'boundaries'.
-//
-// For example, if the inputs are
-//     boundaries = [0, 10, 100]
-//     input = [[-5, 10000]
-//              [150,   10]
-//              [5,    100]]
-//
-// then the output will be
-//     output = [[0, 3]
-//               [3, 2]
-//               [1, 3]]
-//
-// Arguments:
-//	input: Any shape of Tensor contains with int or float type.
-//	boundaries: A sorted list of floats gives the boundary of the buckets.
-//
-// Returns Same shape with 'input', each value of input replaced with bucket index.
-//
-// @compatibility(numpy)
-// Equivalent to np.digitize.
-// @end_compatibility
-func Bucketize(scope *Scope, input tf.Output, boundaries []float32) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"boundaries": boundaries}
-	opspec := tf.OpSpec{
-		Type: "Bucketize",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Computes gradients of the maxpooling function.
 //
 // Arguments:
@@ -4531,68 +4367,6 @@ func MaxPoolGradWithArgmax(scope *Scope, input tf.Output, grad tf.Output, argmax
 	return op.Output(0)
 }
 
-// FakeQuantWithMinMaxArgsGradientAttr is an optional argument to FakeQuantWithMinMaxArgsGradient.
-type FakeQuantWithMinMaxArgsGradientAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxArgsGradientMin sets the optional min attribute to value.
-// If not specified, defaults to -6
-func FakeQuantWithMinMaxArgsGradientMin(value float32) FakeQuantWithMinMaxArgsGradientAttr {
-	return func(m optionalAttr) {
-		m["min"] = value
-	}
-}
-
-// FakeQuantWithMinMaxArgsGradientMax sets the optional max attribute to value.
-// If not specified, defaults to 6
-func FakeQuantWithMinMaxArgsGradientMax(value float32) FakeQuantWithMinMaxArgsGradientAttr {
-	return func(m optionalAttr) {
-		m["max"] = value
-	}
-}
-
-// FakeQuantWithMinMaxArgsGradientNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxArgsGradientNumBits(value int64) FakeQuantWithMinMaxArgsGradientAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
-
-// FakeQuantWithMinMaxArgsGradientNarrowRange sets the optional narrow_range attribute to value.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxArgsGradientNarrowRange(value bool) FakeQuantWithMinMaxArgsGradientAttr {
-	return func(m optionalAttr) {
-		m["narrow_range"] = value
-	}
-}
-
-// Compute gradients for a FakeQuantWithMinMaxArgs operation.
-//
-// Arguments:
-//	gradients: Backpropagated gradients above the FakeQuantWithMinMaxArgs operation.
-//	inputs: Values passed as inputs to the FakeQuantWithMinMaxArgs operation.
-//
-// Returns Backpropagated gradients below the FakeQuantWithMinMaxArgs operation:
-// `gradients * (inputs >= min && inputs <= max)`.
-func FakeQuantWithMinMaxArgsGradient(scope *Scope, gradients tf.Output, inputs tf.Output, optional ...FakeQuantWithMinMaxArgsGradientAttr) (backprops tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxArgsGradient",
-		Input: []tf.Input{
-			gradients, inputs,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // AvgPool3DAttr is an optional argument to AvgPool3D.
 type AvgPool3DAttr func(optionalAttr)
 
@@ -4661,24 +4435,212 @@ func Mod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
-// Computes square root of x element-wise.
-//
-// I.e., \\(y = \sqrt{x} = x^{1/2}\\).
-func Sqrt(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Sqrt",
-		Input: []tf.Input{
-			x,
-		},
+// DepthToSpaceAttr is an optional argument to DepthToSpace.
+type DepthToSpaceAttr func(optionalAttr)
+
+// DepthToSpaceDataFormat sets the optional data_format attribute to value.
+// If not specified, defaults to "NHWC"
+func DepthToSpaceDataFormat(value string) DepthToSpaceAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the gradients of 3-D convolution with respect to the filter.
+// DepthToSpace for tensors of type T.
+//
+// Rearranges data from depth into blocks of spatial data.
+// This is the reverse transformation of SpaceToDepth. More specifically,
+// this op outputs a copy of the input tensor where values from the `depth`
+// dimension are moved in spatial blocks to the `height` and `width` dimensions.
+// The attr `block_size` indicates the input block size and how the data is moved.
+//
+//   * Chunks of data of size `block_size * block_size` from depth are rearranged
+//     into non-overlapping blocks of size `block_size x block_size`
+//   * The width the output tensor is `input_depth * block_size`, whereas the
+//     height is `input_height * block_size`.
+//   * The Y, X coordinates within each block of the output image are determined
+//     by the high order component of the input channel index.
+//   * The depth of the input tensor must be divisible by
+//     `block_size * block_size`.
+//
+// The `data_format` attr specifies the layout of the input and output tensors
+// with the following options:
+//   "NHWC": `[ batch, height, width, channels ]`
+//   "NCHW": `[ batch, channels, height, width ]`
+//   "NCHW_VECT_C":
+//       `qint8 [ batch, channels / 4, height, width, 4 ]`
+//
+// It is useful to consider the operation as transforming a 6-D Tensor.
+// e.g. for data_format = NHWC,
+//      Each element in the input tensor can be specified via 6 coordinates,
+//      ordered by decreasing memory layout significance as:
+//      n,iY,iX,bY,bX,oC  (where n=batch index, iX, iY means X or Y coordinates
+//                         within the input image, bX, bY means coordinates
+//                         within the output block, oC means output channels).
+//      The output would be the input transposed to the following layout:
+//      n,iY,bY,iX,bX,oC
+//
+// This operation is useful for resizing the activations between convolutions
+// (but keeping all data), e.g. instead of pooling. It is also useful for training
+// purely convolutional models.
+//
+// For example, given an input of shape `[1, 1, 1, 4]`, data_format = "NHWC" and
+// block_size = 2:
+//
+// ```
+// x = [[[[1, 2, 3, 4]]]]
+//
+// ```
+//
+// This operation will output a tensor of shape `[1, 2, 2, 1]`:
+//
+// ```
+//    [[[[1], [2]],
+//      [[3], [4]]]]
+// ```
+//
+// Here, the input has a batch of 1 and each batch element has shape `[1, 1, 4]`,
+// the corresponding output will have 2x2 elements and will have a depth of
+// 1 channel (1 = `4 / (block_size * block_size)`).
+// The output element shape is `[2, 2, 1]`.
+//
+// For an input tensor with larger depth, here of shape `[1, 1, 1, 12]`, e.g.
+//
+// ```
+// x = [[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]]]
+// ```
+//
+// This operation, for block size of 2, will return the following tensor of shape
+// `[1, 2, 2, 3]`
+//
+// ```
+//    [[[[1, 2, 3], [4, 5, 6]],
+//      [[7, 8, 9], [10, 11, 12]]]]
+//
+// ```
+//
+// Similarly, for the following input of shape `[1 2 2 4]`, and a block size of 2:
+//
+// ```
+// x =  [[[[1, 2, 3, 4],
+//        [5, 6, 7, 8]],
+//       [[9, 10, 11, 12],
+//        [13, 14, 15, 16]]]]
+// ```
+//
+// the operator will return the following tensor of shape `[1 4 4 1]`:
+//
+// ```
+// x = [[[ [1],   [2],  [5],  [6]],
+//       [ [3],   [4],  [7],  [8]],
+//       [ [9],  [10], [13],  [14]],
+//       [ [11], [12], [15],  [16]]]]
+//
+// ```
+//
+// Arguments:
+//
+//	block_size: The size of the spatial block, same as in Space2Depth.
+func DepthToSpace(scope *Scope, input tf.Output, block_size int64, optional ...DepthToSpaceAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"block_size": block_size}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "DepthToSpace",
+		Input: []tf.Input{
+			input,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Conv3DBackpropInputV2Attr is an optional argument to Conv3DBackpropInputV2.
+type Conv3DBackpropInputV2Attr func(optionalAttr)
+
+// Conv3DBackpropInputV2DataFormat sets the optional data_format attribute to value.
+//
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func Conv3DBackpropInputV2DataFormat(value string) Conv3DBackpropInputV2Attr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Conv3DBackpropInputV2Dilations sets the optional dilations attribute to value.
+//
+// value: 1-D tensor of length 5.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
+func Conv3DBackpropInputV2Dilations(value []int64) Conv3DBackpropInputV2Attr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of 3-D convolution with respect to the input.
+//
+// Arguments:
+//	input_sizes: An integer vector representing the tensor shape of `input`,
+// where `input` is a 5-D
+// `[batch, depth, rows, cols, in_channels]` tensor.
+//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
+// `in_channels` must match between `input` and `filter`.
+//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
+// out_channels]`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+func Conv3DBackpropInputV2(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv3DBackpropInputV2Attr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Conv3DBackpropInputV2",
+		Input: []tf.Input{
+			input_sizes, filter, out_backprop,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes square root of x element-wise.
+//
+// I.e., \\(y = \sqrt{x} = x^{1/2}\\).
+func Sqrt(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Sqrt",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the gradients of 3-D convolution with respect to the filter.
 //
 // DEPRECATED at GraphDef version 10: Use Conv3DBackpropFilterV2
 //
@@ -4970,45 +4932,47 @@ func ReaderRestoreStateV2(scope *Scope, reader_handle tf.Output, state tf.Output
 	return scope.AddOperation(opspec)
 }
 
-// TensorArrayGatherV3Attr is an optional argument to TensorArrayGatherV3.
-type TensorArrayGatherV3Attr func(optionalAttr)
+// MaxPoolGradAttr is an optional argument to MaxPoolGrad.
+type MaxPoolGradAttr func(optionalAttr)
 
-// TensorArrayGatherV3ElementShape sets the optional element_shape attribute to value.
+// MaxPoolGradDataFormat sets the optional data_format attribute to value.
 //
-// value: The expected shape of an element, if known. Used to
-// validate the shapes of TensorArray elements. If this shape is not
-// fully specified, gathering zero-size TensorArrays is an error.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayGatherV3ElementShape(value tf.Shape) TensorArrayGatherV3Attr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolGradDataFormat(value string) MaxPoolGradAttr {
 	return func(m optionalAttr) {
-		m["element_shape"] = value
+		m["data_format"] = value
 	}
 }
 
-// Gather specific elements from the TensorArray into output `value`.
-//
-// All elements selected by `indices` must have the same shape.
+// Computes gradients of the maxpooling function.
 //
 // Arguments:
-//	handle: The handle to a TensorArray.
-//	indices: The locations in the TensorArray from which to read tensor elements.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	dtype: The type of the elem that is returned.
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: 4-D.  Gradients w.r.t. the output of `max_pool`.
+//	ksize: The size of the window for each dimension of the input tensor.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
+//	padding: The type of padding algorithm to use.
 //
-// Returns All of the elements in the TensorArray, concatenated along a new
-// axis (the new dimension 0).
-func TensorArrayGatherV3(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV3Attr) (value tf.Output) {
+// Returns Gradients w.r.t. the input to `max_pool`.
+func MaxPoolGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayGatherV3",
+		Type: "MaxPoolGrad",
 		Input: []tf.Input{
-			handle, indices, flow_in,
+			orig_input, orig_output, grad,
 		},
 		Attrs: attrs,
 	}
@@ -5016,29 +4980,77 @@ func TensorArrayGatherV3(scope *Scope, handle tf.Output, indices tf.Output, flow
 	return op.Output(0)
 }
 
-// Converts each string in the input Tensor to its hash mod by a number of buckets.
-//
-// The hash function is deterministic on the content of the string within the
-// process and will never change. However, it is not suitable for cryptography.
-// This function may be used when CPU time is scarce and inputs are trusted or
-// unimportant. There is a risk of adversaries constructing inputs that all hash
-// to the same bucket. To prevent this problem, use a strong hash function with
-// `tf.string_to_hash_bucket_strong`.
+// CropAndResizeAttr is an optional argument to CropAndResize.
+type CropAndResizeAttr func(optionalAttr)
+
+// CropAndResizeMethod sets the optional method attribute to value.
 //
-// Arguments:
-//	input: The strings to assign a hash bucket.
-//	num_buckets: The number of buckets.
+// value: A string specifying the interpolation method. Only 'bilinear' is
+// supported for now.
+// If not specified, defaults to "bilinear"
+func CropAndResizeMethod(value string) CropAndResizeAttr {
+	return func(m optionalAttr) {
+		m["method"] = value
+	}
+}
+
+// CropAndResizeExtrapolationValue sets the optional extrapolation_value attribute to value.
 //
-// Returns A Tensor of the same shape as the input `string_tensor`.
-func StringToHashBucketFast(scope *Scope, input tf.Output, num_buckets int64) (output tf.Output) {
+// value: Value used for extrapolation, when applicable.
+// If not specified, defaults to 0
+func CropAndResizeExtrapolationValue(value float32) CropAndResizeAttr {
+	return func(m optionalAttr) {
+		m["extrapolation_value"] = value
+	}
+}
+
+// Extracts crops from the input image tensor and bilinearly resizes them (possibly
+//
+// with aspect ratio change) to a common output size specified by `crop_size`. This
+// is more general than the `crop_to_bounding_box` op which extracts a fixed size
+// slice from the input image and does not allow resizing or aspect ratio change.
+//
+// Returns a tensor with `crops` from the input `image` at positions defined at the
+// bounding box locations in `boxes`. The cropped boxes are all resized (with
+// bilinear interpolation) to a fixed `size = [crop_height, crop_width]`. The
+// result is a 4-D tensor `[num_boxes, crop_height, crop_width, depth]`. The
+// resizing is corner aligned. In particular, if `boxes = [[0, 0, 1, 1]]`, the
+// method will give identical results to using `tf.image.resize_bilinear()`
+// with `align_corners=True`.
+//
+// Arguments:
+//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
+// Both `image_height` and `image_width` need to be positive.
+//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
+// specifies the coordinates of a box in the `box_ind[i]` image and is specified
+// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
+// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
+// `[0, 1]` interval of normalized image height is mapped to
+// `[0, image_height - 1]` in image height coordinates. We do allow `y1` > `y2`, in
+// which case the sampled crop is an up-down flipped version of the original
+// image. The width dimension is treated similarly. Normalized coordinates
+// outside the `[0, 1]` range are allowed, in which case we use
+// `extrapolation_value` to extrapolate the input image values.
+//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
+// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//	crop_size: A 1-D tensor of 2 elements, `size = [crop_height, crop_width]`. All
+// cropped image patches are resized to this size. The aspect ratio of the image
+// content is not preserved. Both `crop_height` and `crop_width` need to be
+// positive.
+//
+// Returns A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
+func CropAndResize(scope *Scope, image tf.Output, boxes tf.Output, box_ind tf.Output, crop_size tf.Output, optional ...CropAndResizeAttr) (crops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_buckets": num_buckets}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "StringToHashBucketFast",
+		Type: "CropAndResize",
 		Input: []tf.Input{
-			input,
+			image, boxes, box_ind, crop_size,
 		},
 		Attrs: attrs,
 	}
@@ -5046,168 +5058,232 @@ func StringToHashBucketFast(scope *Scope, input tf.Output, num_buckets int64) (o
 	return op.Output(0)
 }
 
-// Returns the max of x and y (i.e. x > y ? x : y) element-wise.
+// Fills empty rows in the input 2-D `SparseTensor` with a default value.
 //
-// *NOTE*: `Maximum` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Maximum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// The input `SparseTensor` is represented via the tuple of inputs
+// (`indices`, `values`, `dense_shape`).  The output `SparseTensor` has the
+// same `dense_shape` but with indices `output_indices` and values
+// `output_values`.
+//
+// This op inserts a single entry for every row that doesn't have any values.
+// The index is created as `[row, 0, ..., 0]` and the inserted value
+// is `default_value`.
+//
+// For example, suppose `sp_input` has shape `[5, 6]` and non-empty values:
+//
+//     [0, 1]: a
+//     [0, 3]: b
+//     [2, 0]: c
+//     [3, 1]: d
+//
+// Rows 1 and 4 are empty, so the output will be of shape `[5, 6]` with values:
+//
+//     [0, 1]: a
+//     [0, 3]: b
+//     [1, 0]: default_value
+//     [2, 0]: c
+//     [3, 1]: d
+//     [4, 0]: default_value
+//
+// The output `SparseTensor` will be in row-major order and will have the
+// same shape as the input.
+//
+// This op also returns an indicator vector shaped `[dense_shape[0]]` such that
+//
+//     empty_row_indicator[i] = True iff row i was an empty row.
+//
+// And a reverse index map vector shaped `[indices.shape[0]]` that is used during
+// backpropagation,
+//
+//     reverse_index_map[j] = out_j s.t. indices[j, :] == output_indices[out_j, :]
+//
+// Arguments:
+//	indices: 2-D. the indices of the sparse tensor.
+//	values: 1-D. the values of the sparse tensor.
+//	dense_shape: 1-D. the shape of the sparse tensor.
+//	default_value: 0-D. default value to insert into location `[row, 0, ..., 0]`
+//   for rows missing from the input sparse tensor.
+// output indices: 2-D. the indices of the filled sparse tensor.
+//
+// Returns 1-D. the values of the filled sparse tensor.1-D. whether the dense row was missing in the
+// input sparse tensor.1-D. a map from the input indices to the output indices.
+func SparseFillEmptyRows(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output, default_value tf.Output) (output_indices tf.Output, output_values tf.Output, empty_row_indicator tf.Output, reverse_index_map tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Maximum",
+		Type: "SparseFillEmptyRows",
 		Input: []tf.Input{
-			x, y,
+			indices, values, dense_shape, default_value,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
 }
 
-// Outputs all keys and values in the table.
+// Reverses specific dimensions of a tensor.
 //
-// Arguments:
-//	table_handle: Handle to the table.
+// Given a `tensor`, and a `bool` tensor `dims` representing the dimensions
+// of `tensor`, this operation reverses each dimension i of `tensor` where
+// `dims[i]` is `True`.
 //
+// `tensor` can have up to 8 dimensions. The number of dimensions
+// of `tensor` must equal the number of elements in `dims`. In other words:
 //
+// `rank(tensor) = size(dims)`
 //
-// Returns Vector of all keys present in the table.Tensor of all values in the table. Indexed in parallel with `keys`.
-func LookupTableExportV2(scope *Scope, table_handle tf.Output, Tkeys tf.DataType, Tvalues tf.DataType) (keys tf.Output, values tf.Output) {
+// For example:
+//
+// ```
+// # tensor 't' is [[[[ 0,  1,  2,  3],
+// #                  [ 4,  5,  6,  7],
+// #                  [ 8,  9, 10, 11]],
+// #                 [[12, 13, 14, 15],
+// #                  [16, 17, 18, 19],
+// #                  [20, 21, 22, 23]]]]
+// # tensor 't' shape is [1, 2, 3, 4]
+//
+// # 'dims' is [False, False, False, True]
+// reverse(t, dims) ==> [[[[ 3,  2,  1,  0],
+//                         [ 7,  6,  5,  4],
+//                         [ 11, 10, 9, 8]],
+//                        [[15, 14, 13, 12],
+//                         [19, 18, 17, 16],
+//                         [23, 22, 21, 20]]]]
+//
+// # 'dims' is [False, True, False, False]
+// reverse(t, dims) ==> [[[[12, 13, 14, 15],
+//                         [16, 17, 18, 19],
+//                         [20, 21, 22, 23]
+//                        [[ 0,  1,  2,  3],
+//                         [ 4,  5,  6,  7],
+//                         [ 8,  9, 10, 11]]]]
+//
+// # 'dims' is [False, False, True, False]
+// reverse(t, dims) ==> [[[[8, 9, 10, 11],
+//                         [4, 5, 6, 7],
+//                         [0, 1, 2, 3]]
+//                        [[20, 21, 22, 23],
+//                         [16, 17, 18, 19],
+//                         [12, 13, 14, 15]]]]
+// ```
+//
+// Arguments:
+//	tensor: Up to 8-D.
+//	dims: 1-D. The dimensions to reverse.
+//
+// Returns The same shape as `tensor`.
+func Reverse(scope *Scope, tensor tf.Output, dims tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"Tkeys": Tkeys, "Tvalues": Tvalues}
 	opspec := tf.OpSpec{
-		Type: "LookupTableExportV2",
+		Type: "Reverse",
 		Input: []tf.Input{
-			table_handle,
+			tensor, dims,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Real-valued fast Fourier transform.
-//
-// Computes the 1-dimensional discrete Fourier transform of a real-valued signal
-// over the inner-most dimension of `input`.
+// Computes log softmax activations.
 //
-// Since the DFT of a real signal is Hermitian-symmetric, `RFFT` only returns the
-// `fft_length / 2 + 1` unique components of the FFT: the zero-frequency term,
-// followed by the `fft_length / 2` positive-frequency terms.
+// For each batch `i` and class `j` we have
 //
-// Along the axis `RFFT` is computed on, if `fft_length` is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
+//     logsoftmax[i, j] = logits[i, j] - log(sum(exp(logits[i])))
 //
 // Arguments:
-//	input: A float32 tensor.
-//	fft_length: An int32 tensor of shape [1]. The FFT length.
-//
-// Returns A complex64 tensor of the same rank as `input`. The inner-most
-//   dimension of `input` is replaced with the `fft_length / 2 + 1` unique
-//   frequency components of its 1D Fourier transform.
+//	logits: 2-D with shape `[batch_size, num_classes]`.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.rfft
-// @end_compatibility
-func RFFT(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns Same shape as `logits`.
+func LogSoftmax(scope *Scope, logits tf.Output) (logsoftmax tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RFFT",
+		Type: "LogSoftmax",
 		Input: []tf.Input{
-			input, fft_length,
+			logits,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ComplexAttr is an optional argument to Complex.
-type ComplexAttr func(optionalAttr)
-
-// ComplexTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_COMPLEX64
-func ComplexTout(value tf.DataType) ComplexAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Converts two real numbers to a complex number.
+// Computes the inverse permutation of a tensor.
 //
-// Given a tensor `real` representing the real part of a complex number, and a
-// tensor `imag` representing the imaginary part of a complex number, this
-// operation returns complex numbers elementwise of the form \\(a + bj\\), where
-// *a* represents the `real` part and *b* represents the `imag` part.
+// This operation computes the inverse of an index permutation. It takes a 1-D
+// integer tensor `x`, which represents the indices of a zero-based array, and
+// swaps each value with its index position. In other words, for an output tensor
+// `y` and an input tensor `x`, this operation computes the following:
 //
-// The input tensors `real` and `imag` must have the same shape.
+// `y[x[i]] = i for i in [0, 1, ..., len(x) - 1]`
+//
+// The values must include 0. There can be no duplicate values or negative values.
 //
 // For example:
 //
 // ```
-// # tensor 'real' is [2.25, 3.25]
-// # tensor `imag` is [4.75, 5.75]
-// tf.complex(real, imag) ==> [[2.25 + 4.75j], [3.25 + 5.75j]]
+// # tensor `x` is [3, 4, 0, 2, 1]
+// invert_permutation(x) ==> [2, 4, 3, 0, 1]
 // ```
-func Complex(scope *Scope, real tf.Output, imag tf.Output, optional ...ComplexAttr) (out tf.Output) {
+//
+// Arguments:
+//	x: 1-D.
+//
+// Returns 1-D.
+func InvertPermutation(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Complex",
+		Type: "InvertPermutation",
 		Input: []tf.Input{
-			real, imag,
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ImagAttr is an optional argument to Imag.
-type ImagAttr func(optionalAttr)
-
-// ImagTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func ImagTout(value tf.DataType) ImagAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Returns the imaginary part of a complex number.
+// Gradient op for `MirrorPad` op. This op folds a mirror-padded tensor.
 //
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the imaginary part of each element in `input`. All
-// elements in `input` must be complex numbers of the form \\(a + bj\\), where *a*
-// is the real part and *b* is the imaginary part returned by this operation.
+// This operation folds the padded areas of `input` by `MirrorPad` according to the
+// `paddings` you specify. `paddings` must be the same as `paddings` argument
+// given to the corresponding `MirrorPad` op.
+//
+// The folded size of each dimension D of the output is:
+//
+// `input.dim_size(D) - paddings(D, 0) - paddings(D, 1)`
 //
 // For example:
 //
 // ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.imag(input) ==> [4.75, 5.75]
+// # 't' is [[1, 2, 3], [4, 5, 6], [7, 8, 9]].
+// # 'paddings' is [[0, 1]], [0, 1]].
+// # 'mode' is SYMMETRIC.
+// # rank of 't' is 2.
+// pad(t, paddings) ==> [[ 1,  5]
+//                       [11, 28]]
 // ```
-func Imag(scope *Scope, input tf.Output, optional ...ImagAttr) (output tf.Output) {
+//
+// Arguments:
+//	input: The input tensor to be folded.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	mode: The mode used in the `MirrorPad` op.
+//
+// Returns The folded tensor.
+func MirrorPadGrad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"mode": mode}
 	opspec := tf.OpSpec{
-		Type: "Imag",
+		Type: "MirrorPadGrad",
 		Input: []tf.Input{
-			input,
+			input, paddings,
 		},
 		Attrs: attrs,
 	}
@@ -5215,78 +5291,108 @@ func Imag(scope *Scope, input tf.Output, optional ...ImagAttr) (output tf.Output
 	return op.Output(0)
 }
 
-// Compute the Hurwitz zeta function \\(\zeta(x, q)\\).
+// BiasAddGradAttr is an optional argument to BiasAddGrad.
+type BiasAddGradAttr func(optionalAttr)
+
+// BiasAddGradDataFormat sets the optional data_format attribute to value.
 //
-// The Hurwitz zeta function is defined as:
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the bias tensor will be added to the last dimension
+// of the value tensor.
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// The tensor will be added to "in_channels", the third-to-the-last
+//     dimension.
+// If not specified, defaults to "NHWC"
+func BiasAddGradDataFormat(value string) BiasAddGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// The backward operation for "BiasAdd" on the "bias" tensor.
+//
+// It accumulates all the values from out_backprop into the feature dimension.
+// For NHWC data format, the feature dimension is the last. For NCHW data format,
+// the feature dimension is the third-to-last.
 //
+// Arguments:
+//	out_backprop: Any number of dimensions.
 //
-// \\(\zeta(x, q) = \sum_{n=0}^{\infty} (q + n)^{-x}\\)
-func Zeta(scope *Scope, x tf.Output, q tf.Output) (z tf.Output) {
+// Returns 1-D with size the feature dimension of `out_backprop`.
+func BiasAddGrad(scope *Scope, out_backprop tf.Output, optional ...BiasAddGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Zeta",
+		Type: "BiasAddGrad",
 		Input: []tf.Input{
-			x, q,
+			out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// LRNGradAttr is an optional argument to LRNGrad.
-type LRNGradAttr func(optionalAttr)
+// FusedBatchNormV2Attr is an optional argument to FusedBatchNormV2.
+type FusedBatchNormV2Attr func(optionalAttr)
 
-// LRNGradDepthRadius sets the optional depth_radius attribute to value.
+// FusedBatchNormV2Epsilon sets the optional epsilon attribute to value.
 //
-// value: A depth radius.
-// If not specified, defaults to 5
-func LRNGradDepthRadius(value int64) LRNGradAttr {
+// value: A small float number added to the variance of x.
+// If not specified, defaults to 0.0001
+func FusedBatchNormV2Epsilon(value float32) FusedBatchNormV2Attr {
 	return func(m optionalAttr) {
-		m["depth_radius"] = value
+		m["epsilon"] = value
 	}
 }
 
-// LRNGradBias sets the optional bias attribute to value.
+// FusedBatchNormV2DataFormat sets the optional data_format attribute to value.
 //
-// value: An offset (usually > 0 to avoid dividing by 0).
-// If not specified, defaults to 1
-func LRNGradBias(value float32) LRNGradAttr {
+// value: The data format for x and y. Either "NHWC" (default) or "NCHW".
+// If not specified, defaults to "NHWC"
+func FusedBatchNormV2DataFormat(value string) FusedBatchNormV2Attr {
 	return func(m optionalAttr) {
-		m["bias"] = value
+		m["data_format"] = value
 	}
 }
 
-// LRNGradAlpha sets the optional alpha attribute to value.
+// FusedBatchNormV2IsTraining sets the optional is_training attribute to value.
 //
-// value: A scale factor, usually positive.
-// If not specified, defaults to 1
-func LRNGradAlpha(value float32) LRNGradAttr {
+// value: A bool value to indicate the operation is for training (default)
+// or inference.
+// If not specified, defaults to true
+func FusedBatchNormV2IsTraining(value bool) FusedBatchNormV2Attr {
 	return func(m optionalAttr) {
-		m["alpha"] = value
+		m["is_training"] = value
 	}
 }
 
-// LRNGradBeta sets the optional beta attribute to value.
+// Batch normalization.
 //
-// value: An exponent.
-// If not specified, defaults to 0.5
-func LRNGradBeta(value float32) LRNGradAttr {
-	return func(m optionalAttr) {
-		m["beta"] = value
-	}
-}
-
-// Gradients for Local Response Normalization.
+// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
+// The size of 1D Tensors matches the dimension C of the 4D Tensors.
 //
 // Arguments:
-//	input_grads: 4-D with shape `[batch, height, width, channels]`.
-//	input_image: 4-D with shape `[batch, height, width, channels]`.
-//	output_image: 4-D with shape `[batch, height, width, channels]`.
+//	x: A 4D Tensor for input data.
+//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
+//	offset: A 1D Tensor for offset, to shift to the normalized x.
+//	mean: A 1D Tensor for population mean. Used for inference only;
+// must be empty for training.
+//	variance: A 1D Tensor for population variance. Used for inference only;
+// must be empty for training.
 //
-// Returns The gradients for LRN.
-func LRNGrad(scope *Scope, input_grads tf.Output, input_image tf.Output, output_image tf.Output, optional ...LRNGradAttr) (output tf.Output) {
+// Returns A 4D Tensor for output data.A 1D Tensor for the computed batch mean, to be used by TensorFlow
+// to compute the running mean.A 1D Tensor for the computed batch variance, to be used by
+// TensorFlow to compute the running variance.A 1D Tensor for the computed batch mean, to be reused
+// in the gradient computation.A 1D Tensor for the computed batch variance (inverted variance
+// in the cuDNN case), to be reused in the gradient computation.
+func FusedBatchNormV2(scope *Scope, x tf.Output, scale tf.Output, offset tf.Output, mean tf.Output, variance tf.Output, optional ...FusedBatchNormV2Attr) (y tf.Output, batch_mean tf.Output, batch_variance tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -5295,54 +5401,56 @@ func LRNGrad(scope *Scope, input_grads tf.Output, input_image tf.Output, output_
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LRNGrad",
+		Type: "FusedBatchNormV2",
 		Input: []tf.Input{
-			input_grads, input_image, output_image,
+			x, scale, offset, mean, variance,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// AnyAttr is an optional argument to Any.
-type AnyAttr func(optionalAttr)
+// AvgPoolGradAttr is an optional argument to AvgPoolGrad.
+type AvgPoolGradAttr func(optionalAttr)
 
-// AnyKeepDims sets the optional keep_dims attribute to value.
+// AvgPoolGradDataFormat sets the optional data_format attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func AnyKeepDims(value bool) AnyAttr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func AvgPoolGradDataFormat(value string) AvgPoolGradAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["data_format"] = value
 	}
 }
 
-// Computes the "logical or" of elements across dimensions of a tensor.
-//
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// Computes gradients of the average pooling function.
 //
 // Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
+//	orig_input_shape: 1-D.  Shape of the original input to `avg_pool`.
+//	grad: 4-D with shape `[batch, height, width, channels]`.  Gradients w.r.t.
+// the output of `avg_pool`.
+//	ksize: The size of the sliding window for each dimension of the input.
+//	strides: The stride of the sliding window for each dimension of the input.
+//	padding: The type of padding algorithm to use.
 //
-// Returns The reduced tensor.
-func Any(scope *Scope, input tf.Output, axis tf.Output, optional ...AnyAttr) (output tf.Output) {
+// Returns 4-D.  Gradients w.r.t. the input of `avg_pool`.
+func AvgPoolGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Any",
+		Type: "AvgPoolGrad",
 		Input: []tf.Input{
-			input, axis,
+			orig_input_shape, grad,
 		},
 		Attrs: attrs,
 	}
@@ -5350,251 +5458,209 @@ func Any(scope *Scope, input tf.Output, axis tf.Output, optional ...AnyAttr) (ou
 	return op.Output(0)
 }
 
-// ResourceApplyFtrlAttr is an optional argument to ResourceApplyFtrl.
-type ResourceApplyFtrlAttr func(optionalAttr)
+// StageClearAttr is an optional argument to StageClear.
+type StageClearAttr func(optionalAttr)
 
-// ResourceApplyFtrlUseLocking sets the optional use_locking attribute to value.
+// StageClearCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyFtrlUseLocking(value bool) ResourceApplyFtrlAttr {
+// REQUIRES: value >= 0
+func StageClearCapacity(value int64) StageClearAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["capacity"] = value
 	}
 }
 
-// Update '*var' according to the Ftrl-proximal scheme.
-//
-// accum_new = accum + grad * grad
-// linear += grad - (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
-// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
-// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
-// accum = accum_new
-//
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	linear: Should be from a Variable().
-//	grad: The gradient.
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regulariation. Must be a scalar.
-//	l2: L2 regulariation. Must be a scalar.
-//	lr_power: Scaling factor. Must be a scalar.
+// StageClearMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Returns the created operation.
-func ResourceApplyFtrl(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, lr_power tf.Output, optional ...ResourceApplyFtrlAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
+// REQUIRES: value >= 0
+func StageClearMemoryLimit(value int64) StageClearAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	attrs := map[string]interface{}{}
+}
+
+// StageClearContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StageClearContainer(value string) StageClearAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StageClearSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StageClearSharedName(value string) StageClearAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes all elements in the underlying container.
+//
+// Returns the created operation.
+func StageClear(scope *Scope, dtypes []tf.DataType, optional ...StageClearAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyFtrl",
-		Input: []tf.Input{
-			var_, accum, linear, grad, lr, l1, l2, lr_power,
-		},
+		Type: "StageClear",
+
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// RandomUniformAttr is an optional argument to RandomUniform.
-type RandomUniformAttr func(optionalAttr)
+// ComputeAccidentalHitsAttr is an optional argument to ComputeAccidentalHits.
+type ComputeAccidentalHitsAttr func(optionalAttr)
 
-// RandomUniformSeed sets the optional seed attribute to value.
+// ComputeAccidentalHitsSeed sets the optional seed attribute to value.
 //
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// value: If either seed or seed2 are set to be non-zero, the random number
 // generator is seeded by the given seed.  Otherwise, it is seeded by a
 // random seed.
 // If not specified, defaults to 0
-func RandomUniformSeed(value int64) RandomUniformAttr {
+func ComputeAccidentalHitsSeed(value int64) ComputeAccidentalHitsAttr {
 	return func(m optionalAttr) {
 		m["seed"] = value
 	}
 }
 
-// RandomUniformSeed2 sets the optional seed2 attribute to value.
+// ComputeAccidentalHitsSeed2 sets the optional seed2 attribute to value.
 //
-// value: A second seed to avoid seed collision.
+// value: An second seed to avoid seed collision.
 // If not specified, defaults to 0
-func RandomUniformSeed2(value int64) RandomUniformAttr {
+func ComputeAccidentalHitsSeed2(value int64) ComputeAccidentalHitsAttr {
 	return func(m optionalAttr) {
 		m["seed2"] = value
 	}
 }
 
-// Outputs random values from a uniform distribution.
+// Computes the ids of the positions in sampled_candidates that match true_labels.
 //
-// The generated values follow a uniform distribution in the range `[0, 1)`. The
-// lower bound 0 is included in the range, while the upper bound 1 is excluded.
+// When doing log-odds NCE, the result of this op should be passed through a
+// SparseToDense op, then added to the logits of the sampled candidates. This has
+// the effect of 'removing' the sampled labels that match the true labels by
+// making the classifier sure that they are sampled labels.
 //
 // Arguments:
-//	shape: The shape of the output tensor.
-//	dtype: The type of the output.
+//	true_classes: The true_classes output of UnpackSparseLabels.
+//	sampled_candidates: The sampled_candidates output of CandidateSampler.
+//	num_true: Number of true labels per context.
 //
-// Returns A tensor of the specified shape filled with uniform random values.
-func RandomUniform(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...RandomUniformAttr) (output tf.Output) {
+// Returns A vector of indices corresponding to rows of true_candidates.A vector of IDs of positions in sampled_candidates that match a true_label
+// for the row with the corresponding index in indices.A vector of the same length as indices and ids, in which each element
+// is -FLOAT_MAX.
+func ComputeAccidentalHits(scope *Scope, true_classes tf.Output, sampled_candidates tf.Output, num_true int64, optional ...ComputeAccidentalHitsAttr) (indices tf.Output, ids tf.Output, weights tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"num_true": num_true}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomUniform",
+		Type: "ComputeAccidentalHits",
 		Input: []tf.Input{
-			shape,
+			true_classes, sampled_candidates,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// AssertAttr is an optional argument to Assert.
-type AssertAttr func(optionalAttr)
+// TensorArrayGatherV3Attr is an optional argument to TensorArrayGatherV3.
+type TensorArrayGatherV3Attr func(optionalAttr)
 
-// AssertSummarize sets the optional summarize attribute to value.
+// TensorArrayGatherV3ElementShape sets the optional element_shape attribute to value.
 //
-// value: Print this many entries of each tensor.
-// If not specified, defaults to 3
-func AssertSummarize(value int64) AssertAttr {
+// value: The expected shape of an element, if known. Used to
+// validate the shapes of TensorArray elements. If this shape is not
+// fully specified, gathering zero-size TensorArrays is an error.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayGatherV3ElementShape(value tf.Shape) TensorArrayGatherV3Attr {
 	return func(m optionalAttr) {
-		m["summarize"] = value
+		m["element_shape"] = value
 	}
 }
 
-// Asserts that the given condition is true.
+// Gather specific elements from the TensorArray into output `value`.
 //
-// If `condition` evaluates to false, print the list of tensors in `data`.
-// `summarize` determines how many entries of the tensors to print.
+// All elements selected by `indices` must have the same shape.
 //
 // Arguments:
-//	condition: The condition to evaluate.
-//	data: The tensors to print out when condition is false.
+//	handle: The handle to a TensorArray.
+//	indices: The locations in the TensorArray from which to read tensor elements.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	dtype: The type of the elem that is returned.
 //
-// Returns the created operation.
-func Assert(scope *Scope, condition tf.Output, data []tf.Output, optional ...AssertAttr) (o *tf.Operation) {
+// Returns All of the elements in the TensorArray, concatenated along a new
+// axis (the new dimension 0).
+func TensorArrayGatherV3(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV3Attr) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Assert",
+		Type: "TensorArrayGatherV3",
 		Input: []tf.Input{
-			condition, tf.OutputList(data),
+			handle, indices, flow_in,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// Computes element-wise population count (a.k.a. popcount, bitsum, bitcount).
-//
-// For each entry in `x`, calculates the number of `1` (on) bits in the binary
-// representation of that entry.
-//
-// **NOTE**: It is more efficient to first `tf.bitcast` your tensors into
-// `int32` or `int64` and perform the bitcount on the result, than to feed in
-// 8- or 16-bit inputs and then aggregate the resulting counts.
-func PopulationCount(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "PopulationCount",
-		Input: []tf.Input{
-			x,
-		},
-	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Split a `SparseTensor` into `num_split` tensors along one dimension.
-//
-// If the `shape[split_dim]` is not an integer multiple of `num_split`. Slices
-// `[0 : shape[split_dim] % num_split]` gets one extra dimension.
-// For example, if `split_dim = 1` and `num_split = 2` and the input is
-//
-//     input_tensor = shape = [2, 7]
-//     [    a   d e  ]
-//     [b c          ]
-//
-// Graphically the output tensors are:
-//
-//     output_tensor[0] = shape = [2, 4]
-//     [    a  ]
-//     [b c    ]
+// Converts each string in the input Tensor to its hash mod by a number of buckets.
 //
-//     output_tensor[1] = shape = [2, 3]
-//     [ d e  ]
-//     [      ]
+// The hash function is deterministic on the content of the string within the
+// process and will never change. However, it is not suitable for cryptography.
+// This function may be used when CPU time is scarce and inputs are trusted or
+// unimportant. There is a risk of adversaries constructing inputs that all hash
+// to the same bucket. To prevent this problem, use a strong hash function with
+// `tf.string_to_hash_bucket_strong`.
 //
 // Arguments:
-//	split_dim: 0-D.  The dimension along which to split.  Must be in the range
-// `[0, rank(shape))`.
-//	indices: 2-D tensor represents the indices of the sparse tensor.
-//	values: 1-D tensor represents the values of the sparse tensor.
-//	shape: 1-D. tensor represents the shape of the sparse tensor.
-// output indices: A list of 1-D tensors represents the indices of the output
-// sparse tensors.
-//	num_split: The number of ways to split.
+//	input: The strings to assign a hash bucket.
+//	num_buckets: The number of buckets.
 //
-// Returns A list of 1-D tensors represents the values of the output sparse
-// tensors.A list of 1-D tensors represents the shape of the output sparse
-// tensors.
-func SparseSplit(scope *Scope, split_dim tf.Output, indices tf.Output, values tf.Output, shape tf.Output, num_split int64) (output_indices []tf.Output, output_values []tf.Output, output_shape []tf.Output) {
+// Returns A Tensor of the same shape as the input `string_tensor`.
+func StringToHashBucketFast(scope *Scope, input tf.Output, num_buckets int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_split": num_split}
+	attrs := map[string]interface{}{"num_buckets": num_buckets}
 	opspec := tf.OpSpec{
-		Type: "SparseSplit",
+		Type: "StringToHashBucketFast",
 		Input: []tf.Input{
-			split_dim, indices, values, shape,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output_indices, idx, err = makeOutputList(op, idx, "output_indices"); err != nil {
-		scope.UpdateErr("SparseSplit", err)
-		return
-	}
-	if output_values, idx, err = makeOutputList(op, idx, "output_values"); err != nil {
-		scope.UpdateErr("SparseSplit", err)
-		return
-	}
-	if output_shape, idx, err = makeOutputList(op, idx, "output_shape"); err != nil {
-		scope.UpdateErr("SparseSplit", err)
-		return
-	}
-	return output_indices, output_values, output_shape
+	return op.Output(0)
 }
 
-// Returns the truth value of (x < y) element-wise.
+// Returns the max of x and y (i.e. x > y ? x : y) element-wise.
 //
-// *NOTE*: `Less` supports broadcasting. More about broadcasting
+// *NOTE*: `Maximum` supports broadcasting. More about broadcasting
 // [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Less(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+func Maximum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Less",
+		Type: "Maximum",
 		Input: []tf.Input{
 			x, y,
 		},
@@ -5603,173 +5669,141 @@ func Less(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
-// QuantizedReluXAttr is an optional argument to QuantizedReluX.
-type QuantizedReluXAttr func(optionalAttr)
-
-// QuantizedReluXOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_QUINT8
-func QuantizedReluXOutType(value tf.DataType) QuantizedReluXAttr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
-	}
-}
-
-// Computes Quantized Rectified Linear X: `min(max(features, 0), max_value)`
+// Real-valued fast Fourier transform.
 //
-// Arguments:
+// Computes the 1-dimensional discrete Fourier transform of a real-valued signal
+// over the inner-most dimension of `input`.
 //
+// Since the DFT of a real signal is Hermitian-symmetric, `RFFT` only returns the
+// `fft_length / 2 + 1` unique components of the FFT: the zero-frequency term,
+// followed by the `fft_length / 2` positive-frequency terms.
 //
-//	min_features: The float value that the lowest quantized value represents.
-//	max_features: The float value that the highest quantized value represents.
+// Along the axis `RFFT` is computed on, if `fft_length` is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
 //
-// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
-func QuantizedReluX(scope *Scope, features tf.Output, max_value tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedReluXAttr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
+// Arguments:
+//	input: A float32 tensor.
+//	fft_length: An int32 tensor of shape [1]. The FFT length.
+//
+// Returns A complex64 tensor of the same rank as `input`. The inner-most
+//   dimension of `input` is replaced with the `fft_length / 2 + 1` unique
+//   frequency components of its 1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.rfft
+// @end_compatibility
+func RFFT(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedReluX",
+		Type: "RFFT",
 		Input: []tf.Input{
-			features, max_value, min_features, max_features,
+			input, fft_length,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// SummaryWriterAttr is an optional argument to SummaryWriter.
-type SummaryWriterAttr func(optionalAttr)
+// LRNGradAttr is an optional argument to LRNGrad.
+type LRNGradAttr func(optionalAttr)
 
-// SummaryWriterSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func SummaryWriterSharedName(value string) SummaryWriterAttr {
+// LRNGradDepthRadius sets the optional depth_radius attribute to value.
+//
+// value: A depth radius.
+// If not specified, defaults to 5
+func LRNGradDepthRadius(value int64) LRNGradAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["depth_radius"] = value
 	}
 }
 
-// SummaryWriterContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func SummaryWriterContainer(value string) SummaryWriterAttr {
+// LRNGradBias sets the optional bias attribute to value.
+//
+// value: An offset (usually > 0 to avoid dividing by 0).
+// If not specified, defaults to 1
+func LRNGradBias(value float32) LRNGradAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["bias"] = value
 	}
 }
 
-// Returns a handle to be used to access a summary writer.
-//
-// The summary writer is an in-graph resource which can be used by ops to write
-// summaries to event files.
+// LRNGradAlpha sets the optional alpha attribute to value.
 //
-// Returns the summary writer resource. Scalar handle.
-func SummaryWriter(scope *Scope, optional ...SummaryWriterAttr) (writer tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "SummaryWriter",
-
-		Attrs: attrs,
+// value: A scale factor, usually positive.
+// If not specified, defaults to 1
+func LRNGradAlpha(value float32) LRNGradAttr {
+	return func(m optionalAttr) {
+		m["alpha"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes gradients for SparseSegmentMean.
-//
-// Returns tensor "output" with same shape as grad, except for dimension 0 whose
-// value is output_dim0.
+// LRNGradBeta sets the optional beta attribute to value.
 //
-// Arguments:
-//	grad: gradient propagated to the SparseSegmentMean op.
-//	indices: indices passed to the corresponding SparseSegmentMean op.
-//	segment_ids: segment_ids passed to the corresponding SparseSegmentMean op.
-//	output_dim0: dimension 0 of "data" passed to SparseSegmentMean op.
-func SparseSegmentMeanGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSegmentMeanGrad",
-		Input: []tf.Input{
-			grad, indices, segment_ids, output_dim0,
-		},
+// value: An exponent.
+// If not specified, defaults to 0.5
+func LRNGradBeta(value float32) LRNGradAttr {
+	return func(m optionalAttr) {
+		m["beta"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Applies softmax to a batched N-D `SparseTensor`.
-//
-// The inputs represent an N-D SparseTensor  with logical shape `[..., B, C]`
-// (where `N >= 2`), and with indices sorted in the canonical lexicographic order.
-//
-// This op is equivalent to applying the normal `tf.nn.softmax()` to each innermost
-// logical submatrix with shape `[B, C]`, but with the catch that *the implicitly
-// zero elements do not participate*.  Specifically, the algorithm is equivalent
-// to the following:
-//
-//   (1) Applies `tf.nn.softmax()` to a densified view of each innermost submatrix
-//       with shape `[B, C]`, along the size-C dimension;
-//   (2) Masks out the original implicitly-zero locations;
-//   (3) Renormalizes the remaining elements.
-//
-// Hence, the `SparseTensor` result has exactly the same non-zero indices and
-// shape.
+// Gradients for Local Response Normalization.
 //
 // Arguments:
-//	sp_indices: 2-D.  `NNZ x R` matrix with the indices of non-empty values in a
-// SparseTensor, in canonical ordering.
-//	sp_values: 1-D.  `NNZ` non-empty values corresponding to `sp_indices`.
-//	sp_shape: 1-D.  Shape of the input SparseTensor.
+//	input_grads: 4-D with shape `[batch, height, width, channels]`.
+//	input_image: 4-D with shape `[batch, height, width, channels]`.
+//	output_image: 4-D with shape `[batch, height, width, channels]`.
 //
-// Returns 1-D.  The `NNZ` values for the result `SparseTensor`.
-func SparseSoftmax(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output) (output tf.Output) {
+// Returns The gradients for LRN.
+func LRNGrad(scope *Scope, input_grads tf.Output, input_image tf.Output, output_image tf.Output, optional ...LRNGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseSoftmax",
+		Type: "LRNGrad",
 		Input: []tf.Input{
-			sp_indices, sp_values, sp_shape,
+			input_grads, input_image, output_image,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// RandomPoissonAttr is an optional argument to RandomPoisson.
-type RandomPoissonAttr func(optionalAttr)
-
-// RandomPoissonSeed sets the optional seed attribute to value.
-// If not specified, defaults to 0
-func RandomPoissonSeed(value int64) RandomPoissonAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
+// AnyAttr is an optional argument to Any.
+type AnyAttr func(optionalAttr)
 
-// RandomPoissonSeed2 sets the optional seed2 attribute to value.
-// If not specified, defaults to 0
-func RandomPoissonSeed2(value int64) RandomPoissonAttr {
+// AnyKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func AnyKeepDims(value bool) AnyAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Use RandomPoissonV2 instead.
+// Computes the "logical or" of elements across dimensions of a tensor.
 //
-// DEPRECATED at GraphDef version 25: Replaced by RandomPoissonV2
-func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonAttr) (output tf.Output) {
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
+//
+// Arguments:
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
+//
+// Returns The reduced tensor.
+func Any(scope *Scope, input tf.Output, axis tf.Output, optional ...AnyAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -5778,9 +5812,9 @@ func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...Ra
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomPoisson",
+		Type: "Any",
 		Input: []tf.Input{
-			shape, rate,
+			input, axis,
 		},
 		Attrs: attrs,
 	}
@@ -5788,28 +5822,25 @@ func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...Ra
 	return op.Output(0)
 }
 
-// ResourceSparseApplyFtrlV2Attr is an optional argument to ResourceSparseApplyFtrlV2.
-type ResourceSparseApplyFtrlV2Attr func(optionalAttr)
+// ResourceApplyFtrlAttr is an optional argument to ResourceApplyFtrl.
+type ResourceApplyFtrlAttr func(optionalAttr)
 
-// ResourceSparseApplyFtrlV2UseLocking sets the optional use_locking attribute to value.
+// ResourceApplyFtrlUseLocking sets the optional use_locking attribute to value.
 //
 // value: If `True`, updating of the var and accum tensors will be protected
 // by a lock; otherwise the behavior is undefined, but may exhibit less
 // contention.
 // If not specified, defaults to false
-func ResourceSparseApplyFtrlV2UseLocking(value bool) ResourceSparseApplyFtrlV2Attr {
+func ResourceApplyFtrlUseLocking(value bool) ResourceApplyFtrlAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update relevant entries in '*var' according to the Ftrl-proximal scheme.
+// Update '*var' according to the Ftrl-proximal scheme.
 //
-// That is for rows we have grad for, we update var, accum and linear as follows:
-// grad_with_shrinkage = grad + 2 * l2_shrinkage * var
-// accum_new = accum + grad_with_shrinkage * grad_with_shrinkage
-// linear += grad_with_shrinkage +
-//     (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
+// accum_new = accum + grad * grad
+// linear += grad - (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
 // quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
 // var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
 // accum = accum_new
@@ -5819,15 +5850,13 @@ func ResourceSparseApplyFtrlV2UseLocking(value bool) ResourceSparseApplyFtrlV2At
 //	accum: Should be from a Variable().
 //	linear: Should be from a Variable().
 //	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
 //	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 shrinkage regulariation. Must be a scalar.
-//
+//	l1: L1 regulariation. Must be a scalar.
+//	l2: L2 regulariation. Must be a scalar.
 //	lr_power: Scaling factor. Must be a scalar.
 //
 // Returns the created operation.
-func ResourceSparseApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, l2_shrinkage tf.Output, lr_power tf.Output, optional ...ResourceSparseApplyFtrlV2Attr) (o *tf.Operation) {
+func ResourceApplyFtrl(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, lr_power tf.Output, optional ...ResourceApplyFtrlAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -5836,92 +5865,93 @@ func ResourceSparseApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, li
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyFtrlV2",
+		Type: "ResourceApplyFtrl",
 		Input: []tf.Input{
-			var_, accum, linear, grad, indices, lr, l1, l2, l2_shrinkage, lr_power,
+			var_, accum, linear, grad, lr, l1, l2, lr_power,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Associates the given iterator with the given statistics aggregator.
+// RandomUniformAttr is an optional argument to RandomUniform.
+type RandomUniformAttr func(optionalAttr)
+
+// RandomUniformSeed sets the optional seed attribute to value.
 //
-// Returns the created operation.
-func IteratorSetStatsAggregator(scope *Scope, iterator_handle tf.Output, stats_aggregator_handle tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "IteratorSetStatsAggregator",
-		Input: []tf.Input{
-			iterator_handle, stats_aggregator_handle,
-		},
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func RandomUniformSeed(value int64) RandomUniformAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Returns element-wise smallest integer in not less than x.
-func Ceil(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Ceil",
-		Input: []tf.Input{
-			x,
-		},
+// RandomUniformSeed2 sets the optional seed2 attribute to value.
+//
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomUniformSeed2(value int64) RandomUniformAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the number of elements in the given table.
+// Outputs random values from a uniform distribution.
+//
+// The generated values follow a uniform distribution in the range `[0, 1)`. The
+// lower bound 0 is included in the range, while the upper bound 1 is excluded.
 //
 // Arguments:
-//	table_handle: Handle to the table.
+//	shape: The shape of the output tensor.
+//	dtype: The type of the output.
 //
-// Returns Scalar that contains number of elements in the table.
-func LookupTableSizeV2(scope *Scope, table_handle tf.Output) (size tf.Output) {
+// Returns A tensor of the specified shape filled with uniform random values.
+func RandomUniform(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...RandomUniformAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LookupTableSizeV2",
+		Type: "RandomUniform",
 		Input: []tf.Input{
-			table_handle,
+			shape,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResizeBilinearGradAttr is an optional argument to ResizeBilinearGrad.
-type ResizeBilinearGradAttr func(optionalAttr)
+// AssertAttr is an optional argument to Assert.
+type AssertAttr func(optionalAttr)
 
-// ResizeBilinearGradAlignCorners sets the optional align_corners attribute to value.
+// AssertSummarize sets the optional summarize attribute to value.
 //
-// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of grads and original_image. If false, rescale by
-// orig_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeBilinearGradAlignCorners(value bool) ResizeBilinearGradAttr {
+// value: Print this many entries of each tensor.
+// If not specified, defaults to 3
+func AssertSummarize(value int64) AssertAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["summarize"] = value
 	}
 }
 
-// Computes the gradient of bilinear interpolation.
+// Asserts that the given condition is true.
 //
-// Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
-// The image tensor that was resized.
+// If `condition` evaluates to false, print the list of tensors in `data`.
+// `summarize` determines how many entries of the tensors to print.
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
-// Gradients with respect to the input image. Input image must have been
-// float or double.
-func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBilinearGradAttr) (output tf.Output) {
+// Arguments:
+//	condition: The condition to evaluate.
+//	data: The tensors to print out when condition is false.
+//
+// Returns the created operation.
+func Assert(scope *Scope, condition tf.Output, data []tf.Output, optional ...AssertAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -5930,134 +5960,137 @@ func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeBilinearGrad",
+		Type: "Assert",
 		Input: []tf.Input{
-			grads, original_image,
+			condition, tf.OutputList(data),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Computes the sum along sparse segments of a tensor divided by the sqrt of N.
-//
-// N is the size of the segment being reduced.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Arguments:
+// Computes element-wise population count (a.k.a. popcount, bitsum, bitcount).
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+// For each entry in `x`, calculates the number of `1` (on) bits in the binary
+// representation of that entry.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SparseSegmentSqrtN(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
+// **NOTE**: It is more efficient to first `tf.bitcast` your tensors into
+// `int32` or `int64` and perform the bitcount on the result, than to feed in
+// 8- or 16-bit inputs and then aggregate the resulting counts.
+func PopulationCount(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSqrtN",
+		Type: "PopulationCount",
 		Input: []tf.Input{
-			data, indices, segment_ids,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// StatelessTruncatedNormalAttr is an optional argument to StatelessTruncatedNormal.
-type StatelessTruncatedNormalAttr func(optionalAttr)
-
-// StatelessTruncatedNormalDtype sets the optional dtype attribute to value.
+// Split a `SparseTensor` into `num_split` tensors along one dimension.
 //
-// value: The type of the output.
-// If not specified, defaults to DT_FLOAT
-func StatelessTruncatedNormalDtype(value tf.DataType) StatelessTruncatedNormalAttr {
-	return func(m optionalAttr) {
-		m["dtype"] = value
-	}
-}
-
-// Outputs deterministic pseudorandom values from a truncated normal distribution.
+// If the `shape[split_dim]` is not an integer multiple of `num_split`. Slices
+// `[0 : shape[split_dim] % num_split]` gets one extra dimension.
+// For example, if `split_dim = 1` and `num_split = 2` and the input is
 //
-// The generated values follow a normal distribution with mean 0 and standard
-// deviation 1, except that values whose magnitude is more than 2 standard
-// deviations from the mean are dropped and re-picked.
+//     input_tensor = shape = [2, 7]
+//     [    a   d e  ]
+//     [b c          ]
 //
-// The outputs are a deterministic function of `shape` and `seed`.
+// Graphically the output tensors are:
+//
+//     output_tensor[0] = shape = [2, 4]
+//     [    a  ]
+//     [b c    ]
+//
+//     output_tensor[1] = shape = [2, 3]
+//     [ d e  ]
+//     [      ]
 //
 // Arguments:
-//	shape: The shape of the output tensor.
-//	seed: 2 seeds (shape [2]).
+//	split_dim: 0-D.  The dimension along which to split.  Must be in the range
+// `[0, rank(shape))`.
+//	indices: 2-D tensor represents the indices of the sparse tensor.
+//	values: 1-D tensor represents the values of the sparse tensor.
+//	shape: 1-D. tensor represents the shape of the sparse tensor.
+// output indices: A list of 1-D tensors represents the indices of the output
+// sparse tensors.
+//	num_split: The number of ways to split.
 //
-// Returns Random values with specified shape.
-func StatelessTruncatedNormal(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessTruncatedNormalAttr) (output tf.Output) {
+// Returns A list of 1-D tensors represents the values of the output sparse
+// tensors.A list of 1-D tensors represents the shape of the output sparse
+// tensors.
+func SparseSplit(scope *Scope, split_dim tf.Output, indices tf.Output, values tf.Output, shape tf.Output, num_split int64) (output_indices []tf.Output, output_values []tf.Output, output_shape []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"num_split": num_split}
 	opspec := tf.OpSpec{
-		Type: "StatelessTruncatedNormal",
+		Type: "SparseSplit",
 		Input: []tf.Input{
-			shape, seed,
+			split_dim, indices, values, shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output_indices, idx, err = makeOutputList(op, idx, "output_indices"); err != nil {
+		scope.UpdateErr("SparseSplit", err)
+		return
+	}
+	if output_values, idx, err = makeOutputList(op, idx, "output_values"); err != nil {
+		scope.UpdateErr("SparseSplit", err)
+		return
+	}
+	if output_shape, idx, err = makeOutputList(op, idx, "output_shape"); err != nil {
+		scope.UpdateErr("SparseSplit", err)
+		return
+	}
+	return output_indices, output_values, output_shape
 }
 
-// RestoreSliceAttr is an optional argument to RestoreSlice.
-type RestoreSliceAttr func(optionalAttr)
+// RandomPoissonAttr is an optional argument to RandomPoisson.
+type RandomPoissonAttr func(optionalAttr)
 
-// RestoreSlicePreferredShard sets the optional preferred_shard attribute to value.
-//
-// value: Index of file to open first if multiple files match
-// `file_pattern`. See the documentation for `Restore`.
-// If not specified, defaults to -1
-func RestoreSlicePreferredShard(value int64) RestoreSliceAttr {
+// RandomPoissonSeed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func RandomPoissonSeed(value int64) RandomPoissonAttr {
 	return func(m optionalAttr) {
-		m["preferred_shard"] = value
+		m["seed"] = value
 	}
 }
 
-// Restores a tensor from checkpoint files.
-//
-// This is like `Restore` except that restored tensor can be listed as filling
-// only a slice of a larger tensor.  `shape_and_slice` specifies the shape of the
-// larger tensor and the slice that the restored tensor covers.
-//
-// The `shape_and_slice` input has the same format as the
-// elements of the `shapes_and_slices` input of the `SaveSlices` op.
-//
-// Arguments:
-//	file_pattern: Must have a single element. The pattern of the files from
-// which we read the tensor.
-//	tensor_name: Must have a single element. The name of the tensor to be
-// restored.
-//	shape_and_slice: Scalar. The shapes and slice specifications to use when
-// restoring a tensors.
-//	dt: The type of the tensor to be restored.
+// RandomPoissonSeed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func RandomPoissonSeed2(value int64) RandomPoissonAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Use RandomPoissonV2 instead.
 //
-// Returns The restored tensor.
-func RestoreSlice(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, shape_and_slice tf.Output, dt tf.DataType, optional ...RestoreSliceAttr) (tensor tf.Output) {
+// DEPRECATED at GraphDef version 25: Replaced by RandomPoissonV2
+func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dt": dt}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RestoreSlice",
+		Type: "RandomPoisson",
 		Input: []tf.Input{
-			file_pattern, tensor_name, shape_and_slice,
+			shape, rate,
 		},
 		Attrs: attrs,
 	}
@@ -6065,42 +6098,46 @@ func RestoreSlice(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, s
 	return op.Output(0)
 }
 
-// UniqueWithCountsAttr is an optional argument to UniqueWithCounts.
-type UniqueWithCountsAttr func(optionalAttr)
+// ResourceSparseApplyFtrlV2Attr is an optional argument to ResourceSparseApplyFtrlV2.
+type ResourceSparseApplyFtrlV2Attr func(optionalAttr)
 
-// UniqueWithCountsOutIdx sets the optional out_idx attribute to value.
-// If not specified, defaults to DT_INT32
-func UniqueWithCountsOutIdx(value tf.DataType) UniqueWithCountsAttr {
+// ResourceSparseApplyFtrlV2UseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyFtrlV2UseLocking(value bool) ResourceSparseApplyFtrlV2Attr {
 	return func(m optionalAttr) {
-		m["out_idx"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Finds unique elements in a 1-D tensor.
-//
-// This operation returns a tensor `y` containing all of the unique elements of `x`
-// sorted in the same order that they occur in `x`. This operation also returns a
-// tensor `idx` the same size as `x` that contains the index of each value of `x`
-// in the unique output `y`. Finally, it returns a third tensor `count` that
-// contains the count of each element of `y` in `x`. In other words:
-//
-// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
-//
-// For example:
+// Update relevant entries in '*var' according to the Ftrl-proximal scheme.
 //
-// ```
-// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
-// y, idx, count = unique_with_counts(x)
-// y ==> [1, 2, 4, 7, 8]
-// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
-// count ==> [2, 1, 3, 1, 2]
-// ```
+// That is for rows we have grad for, we update var, accum and linear as follows:
+// grad_with_shrinkage = grad + 2 * l2_shrinkage * var
+// accum_new = accum + grad_with_shrinkage * grad_with_shrinkage
+// linear += grad_with_shrinkage +
+//     (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
+// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
+// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
+// accum = accum_new
 //
 // Arguments:
-//	x: 1-D.
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	linear: Should be from a Variable().
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 shrinkage regulariation. Must be a scalar.
 //
-// Returns 1-D.1-D.1-D.
-func UniqueWithCounts(scope *Scope, x tf.Output, optional ...UniqueWithCountsAttr) (y tf.Output, idx tf.Output, count tf.Output) {
+//	lr_power: Scaling factor. Must be a scalar.
+//
+// Returns the created operation.
+func ResourceSparseApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, l2_shrinkage tf.Output, lr_power tf.Output, optional ...ResourceSparseApplyFtrlV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6109,123 +6146,63 @@ func UniqueWithCounts(scope *Scope, x tf.Output, optional ...UniqueWithCountsAtt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "UniqueWithCounts",
+		Type: "ResourceSparseApplyFtrlV2",
 		Input: []tf.Input{
-			x,
+			var_, accum, linear, grad, indices, lr, l1, l2, l2_shrinkage, lr_power,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return scope.AddOperation(opspec)
 }
 
-// StatelessRandomNormalAttr is an optional argument to StatelessRandomNormal.
-type StatelessRandomNormalAttr func(optionalAttr)
-
-// StatelessRandomNormalDtype sets the optional dtype attribute to value.
+// Associates the given iterator with the given statistics aggregator.
 //
-// value: The type of the output.
-// If not specified, defaults to DT_FLOAT
-func StatelessRandomNormalDtype(value tf.DataType) StatelessRandomNormalAttr {
-	return func(m optionalAttr) {
-		m["dtype"] = value
-	}
-}
-
-// Outputs deterministic pseudorandom values from a normal distribution.
-//
-// The generated values will have mean 0 and standard deviation 1.
-//
-// The outputs are a deterministic function of `shape` and `seed`.
-//
-// Arguments:
-//	shape: The shape of the output tensor.
-//	seed: 2 seeds (shape [2]).
-//
-// Returns Random values with specified shape.
-func StatelessRandomNormal(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessRandomNormalAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
+// Returns the created operation.
+func IteratorSetStatsAggregator(scope *Scope, iterator_handle tf.Output, stats_aggregator_handle tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
 	}
 	opspec := tf.OpSpec{
-		Type: "StatelessRandomNormal",
+		Type: "IteratorSetStatsAggregator",
 		Input: []tf.Input{
-			shape, seed,
+			iterator_handle, stats_aggregator_handle,
 		},
-		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Reshapes a quantized tensor as per the Reshape op.
-//
-// ```
-//
-// Arguments:
-//
-//	shape: Defines the shape of the output tensor.
-//	input_min: The minimum value of the input.
-//	input_max: The maximum value of the input.
+// DataFormatVecPermuteAttr is an optional argument to DataFormatVecPermute.
+type DataFormatVecPermuteAttr func(optionalAttr)
+
+// DataFormatVecPermuteSrcFormat sets the optional src_format attribute to value.
 //
-// Returns This value is copied from input_min.This value is copied from input_max.
-func QuantizedReshape(scope *Scope, tensor tf.Output, shape tf.Output, input_min tf.Output, input_max tf.Output) (output tf.Output, output_min tf.Output, output_max tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "QuantizedReshape",
-		Input: []tf.Input{
-			tensor, shape, input_min, input_max,
-		},
+// value: source data format.
+// If not specified, defaults to "NHWC"
+func DataFormatVecPermuteSrcFormat(value string) DataFormatVecPermuteAttr {
+	return func(m optionalAttr) {
+		m["src_format"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// GatherAttr is an optional argument to Gather.
-type GatherAttr func(optionalAttr)
-
-// GatherValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func GatherValidateIndices(value bool) GatherAttr {
+// DataFormatVecPermuteDstFormat sets the optional dst_format attribute to value.
+//
+// value: destination data format.
+// If not specified, defaults to "NCHW"
+func DataFormatVecPermuteDstFormat(value string) DataFormatVecPermuteAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["dst_format"] = value
 	}
 }
 
-// Gather slices from `params` according to `indices`.
-//
-// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
-// Produces an output tensor with shape `indices.shape + params.shape[1:]` where:
-//
-// ```python
-//     # Scalar indices
-//     output[:, ..., :] = params[indices, :, ... :]
-//
-//     # Vector indices
-//     output[i, :, ..., :] = params[indices[i], :, ... :]
-//
-//     # Higher rank indices
-//     output[i, ..., j, :, ... :] = params[indices[i, ..., j], :, ..., :]
-// ```
+// Returns the permuted vector/tensor in the destination data format given the
 //
-// If `indices` is a permutation and `len(indices) == params.shape[0]` then
-// this operation will permute `params` accordingly.
+// one in the source data format.
 //
-// `validate_indices`: DEPRECATED. If this operation is assigned to CPU, values in
-// `indices` are always validated to be within range. If assigned to GPU,
-// out-of-bound indices result in safe but unspecified behavior, which may include
-// raising an error.
+// Arguments:
+//	x: Vector of size 4 or Tensor of shape (4, 2) in source data format.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
-// </div>
-func Gather(scope *Scope, params tf.Output, indices tf.Output, optional ...GatherAttr) (output tf.Output) {
+// Returns Vector of size 4 or Tensor of shape (4, 2) in destination data format.
+func DataFormatVecPermute(scope *Scope, x tf.Output, optional ...DataFormatVecPermuteAttr) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6234,9 +6211,9 @@ func Gather(scope *Scope, params tf.Output, indices tf.Output, optional ...Gathe
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Gather",
+		Type: "DataFormatVecPermute",
 		Input: []tf.Input{
-			params, indices,
+			x,
 		},
 		Attrs: attrs,
 	}
@@ -6244,110 +6221,76 @@ func Gather(scope *Scope, params tf.Output, indices tf.Output, optional ...Gathe
 	return op.Output(0)
 }
 
-// Returns the truth value of (x != y) element-wise.
-//
-// *NOTE*: `NotEqual` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func NotEqual(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Computes tan of x element-wise.
+func Tan(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "NotEqual",
+		Type: "Tan",
 		Input: []tf.Input{
-			x, y,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Inverse 3D real-valued fast Fourier transform.
-//
-// Computes the inverse 3-dimensional discrete Fourier transform of a real-valued
-// signal over the inner-most 3 dimensions of `input`.
+// Computes the sum along sparse segments of a tensor divided by the sqrt of N.
 //
-// The inner-most 3 dimensions of `input` are assumed to be the result of `RFFT3D`:
-// The inner-most dimension contains the `fft_length / 2 + 1` unique components of
-// the DFT of a real-valued signal. If `fft_length` is not provided, it is computed
-// from the size of the inner-most 3 dimensions of `input`. If the FFT length used
-// to compute `input` is odd, it should be provided since it cannot be inferred
-// properly.
+// N is the size of the segment being reduced.
 //
-// Along each axis `IRFFT3D` is computed on, if `fft_length` (or
-// `fft_length / 2 + 1` for the inner-most dimension) is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
 // Arguments:
-//	input: A complex64 tensor.
-//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
 //
-// Returns A float32 tensor of the same rank as `input`. The inner-most 3
-//   dimensions of `input` are replaced with the `fft_length` samples of their
-//   inverse 3D real Fourier transform.
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
 //
-// @compatibility(numpy)
-// Equivalent to np.irfftn with 3 dimensions.
-// @end_compatibility
-func IRFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SparseSegmentSqrtN(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "IRFFT3D",
+		Type: "SparseSegmentSqrtN",
 		Input: []tf.Input{
-			input, fft_length,
+			data, indices, segment_ids,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// StringSplitAttr is an optional argument to StringSplit.
-type StringSplitAttr func(optionalAttr)
+// StatelessTruncatedNormalAttr is an optional argument to StatelessTruncatedNormal.
+type StatelessTruncatedNormalAttr func(optionalAttr)
 
-// StringSplitSkipEmpty sets the optional skip_empty attribute to value.
+// StatelessTruncatedNormalDtype sets the optional dtype attribute to value.
 //
-// value: A `bool`. If `True`, skip the empty strings from the result.
-// If not specified, defaults to true
-func StringSplitSkipEmpty(value bool) StringSplitAttr {
+// value: The type of the output.
+// If not specified, defaults to DT_FLOAT
+func StatelessTruncatedNormalDtype(value tf.DataType) StatelessTruncatedNormalAttr {
 	return func(m optionalAttr) {
-		m["skip_empty"] = value
+		m["dtype"] = value
 	}
 }
 
-// Split elements of `input` based on `delimiter` into a `SparseTensor`.
-//
-// Let N be the size of source (typically N will be the batch size). Split each
-// element of `input` based on `delimiter` and return a `SparseTensor`
-// containing the splitted tokens. Empty tokens are ignored.
-//
-// `delimiter` can be empty, or a string of split characters. If `delimiter` is an
-//  empty string, each element of `input` is split into individual single-byte
-//  character strings, including splitting of UTF-8 multibyte sequences. Otherwise
-//  every character of `delimiter` is a potential split point.
+// Outputs deterministic pseudorandom values from a truncated normal distribution.
 //
-// For example:
-//   N = 2, input[0] is 'hello world' and input[1] is 'a b c', then the output
-//   will be
+// The generated values follow a normal distribution with mean 0 and standard
+// deviation 1, except that values whose magnitude is more than 2 standard
+// deviations from the mean are dropped and re-picked.
 //
-//   indices = [0, 0;
-//              0, 1;
-//              1, 0;
-//              1, 1;
-//              1, 2]
-//   shape = [2, 3]
-//   values = ['hello', 'world', 'a', 'b', 'c']
+// The outputs are a deterministic function of `shape` and `seed`.
 //
 // Arguments:
-//	input: 1-D. Strings to split.
-//	delimiter: 0-D. Delimiter characters (bytes), or empty string.
+//	shape: The shape of the output tensor.
+//	seed: 2 seeds (shape [2]).
 //
-// Returns A dense matrix of int64 representing the indices of the sparse tensor.A vector of strings corresponding to the splited values.a length-2 vector of int64 representing the shape of the sparse
-// tensor, where the first value is N and the second value is the maximum number
-// of tokens in a single input entry.
-func StringSplit(scope *Scope, input tf.Output, delimiter tf.Output, optional ...StringSplitAttr) (indices tf.Output, values tf.Output, shape tf.Output) {
+// Returns Random values with specified shape.
+func StatelessTruncatedNormal(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessTruncatedNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6356,98 +6299,93 @@ func StringSplit(scope *Scope, input tf.Output, delimiter tf.Output, optional ..
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StringSplit",
+		Type: "StatelessTruncatedNormal",
 		Input: []tf.Input{
-			input, delimiter,
+			shape, seed,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// WriteAudioSummaryAttr is an optional argument to WriteAudioSummary.
-type WriteAudioSummaryAttr func(optionalAttr)
+// RestoreSliceAttr is an optional argument to RestoreSlice.
+type RestoreSliceAttr func(optionalAttr)
 
-// WriteAudioSummaryMaxOutputs sets the optional max_outputs attribute to value.
-//
-// value: Max number of batch elements to generate audio for.
-// If not specified, defaults to 3
+// RestoreSlicePreferredShard sets the optional preferred_shard attribute to value.
 //
-// REQUIRES: value >= 1
-func WriteAudioSummaryMaxOutputs(value int64) WriteAudioSummaryAttr {
+// value: Index of file to open first if multiple files match
+// `file_pattern`. See the documentation for `Restore`.
+// If not specified, defaults to -1
+func RestoreSlicePreferredShard(value int64) RestoreSliceAttr {
 	return func(m optionalAttr) {
-		m["max_outputs"] = value
+		m["preferred_shard"] = value
 	}
 }
 
-// Writes a `Summary` protocol buffer with audio.
-//
-// The summary has up to `max_outputs` summary values containing audio. The
-// audio is built from `tensor` which must be 3-D with shape `[batch_size,
-// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
-// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
+// Restores a tensor from checkpoint files.
 //
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
+// This is like `Restore` except that restored tensor can be listed as filling
+// only a slice of a larger tensor.  `shape_and_slice` specifies the shape of the
+// larger tensor and the slice that the restored tensor covers.
 //
-// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
-// *  If `max_outputs` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
+// The `shape_and_slice` input has the same format as the
+// elements of the `shapes_and_slices` input of the `SaveSlices` op.
 //
 // Arguments:
-//	writer: A handle to a summary writer.
-//	step: The step to write the summary for.
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 2-D of shape `[batch_size, frames]`.
-//	sample_rate: The sample rate of the signal in hertz.
+//	file_pattern: Must have a single element. The pattern of the files from
+// which we read the tensor.
+//	tensor_name: Must have a single element. The name of the tensor to be
+// restored.
+//	shape_and_slice: Scalar. The shapes and slice specifications to use when
+// restoring a tensors.
+//	dt: The type of the tensor to be restored.
 //
-// Returns the created operation.
-func WriteAudioSummary(scope *Scope, writer tf.Output, step tf.Output, tag tf.Output, tensor tf.Output, sample_rate tf.Output, optional ...WriteAudioSummaryAttr) (o *tf.Operation) {
+// Returns The restored tensor.
+func RestoreSlice(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, shape_and_slice tf.Output, dt tf.DataType, optional ...RestoreSliceAttr) (tensor tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dt": dt}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "WriteAudioSummary",
+		Type: "RestoreSlice",
 		Input: []tf.Input{
-			writer, step, tag, tensor, sample_rate,
+			file_pattern, tensor_name, shape_and_slice,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ProdAttr is an optional argument to Prod.
-type ProdAttr func(optionalAttr)
+// ImagAttr is an optional argument to Imag.
+type ImagAttr func(optionalAttr)
 
-// ProdKeepDims sets the optional keep_dims attribute to value.
-//
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func ProdKeepDims(value bool) ProdAttr {
+// ImagTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func ImagTout(value tf.DataType) ImagAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["Tout"] = value
 	}
 }
 
-// Computes the product of elements across dimensions of a tensor.
+// Returns the imaginary part of a complex number.
 //
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the imaginary part of each element in `input`. All
+// elements in `input` must be complex numbers of the form \\(a + bj\\), where *a*
+// is the real part and *b* is the imaginary part returned by this operation.
 //
-// Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
+// For example:
 //
-// Returns The reduced tensor.
-func Prod(scope *Scope, input tf.Output, axis tf.Output, optional ...ProdAttr) (output tf.Output) {
+// ```
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.imag(input) ==> [4.75, 5.75]
+// ```
+func Imag(scope *Scope, input tf.Output, optional ...ImagAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6456,9 +6394,9 @@ func Prod(scope *Scope, input tf.Output, axis tf.Output, optional ...ProdAttr) (
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Prod",
+		Type: "Imag",
 		Input: []tf.Input{
-			input, axis,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -6466,33 +6404,34 @@ func Prod(scope *Scope, input tf.Output, axis tf.Output, optional ...ProdAttr) (
 	return op.Output(0)
 }
 
-// ResizeBilinearAttr is an optional argument to ResizeBilinear.
-type ResizeBilinearAttr func(optionalAttr)
+// ComplexAttr is an optional argument to Complex.
+type ComplexAttr func(optionalAttr)
 
-// ResizeBilinearAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, rescale input by (new_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeBilinearAlignCorners(value bool) ResizeBilinearAttr {
+// ComplexTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_COMPLEX64
+func ComplexTout(value tf.DataType) ComplexAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["Tout"] = value
 	}
 }
 
-// Resize `images` to `size` using bilinear interpolation.
+// Converts two real numbers to a complex number.
 //
-// Input images can be of different types but output images are always float.
+// Given a tensor `real` representing the real part of a complex number, and a
+// tensor `imag` representing the imaginary part of a complex number, this
+// operation returns complex numbers elementwise of the form \\(a + bj\\), where
+// *a* represents the `real` part and *b* represents the `imag` part.
 //
-// Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
+// The input tensors `real` and `imag` must have the same shape.
 //
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func ResizeBilinear(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeBilinearAttr) (resized_images tf.Output) {
+// For example:
+//
+// ```
+// # tensor 'real' is [2.25, 3.25]
+// # tensor `imag` is [4.75, 5.75]
+// tf.complex(real, imag) ==> [[2.25 + 4.75j], [3.25 + 5.75j]]
+// ```
+func Complex(scope *Scope, real tf.Output, imag tf.Output, optional ...ComplexAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6501,9 +6440,9 @@ func ResizeBilinear(scope *Scope, images tf.Output, size tf.Output, optional ...
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeBilinear",
+		Type: "Complex",
 		Input: []tf.Input{
-			images, size,
+			real, imag,
 		},
 		Attrs: attrs,
 	}
@@ -6511,189 +6450,167 @@ func ResizeBilinear(scope *Scope, images tf.Output, size tf.Output, optional ...
 	return op.Output(0)
 }
 
-// Computes softsign: `features / (abs(features) + 1)`.
-func Softsign(scope *Scope, features tf.Output) (activations tf.Output) {
+// UniqueWithCountsAttr is an optional argument to UniqueWithCounts.
+type UniqueWithCountsAttr func(optionalAttr)
+
+// UniqueWithCountsOutIdx sets the optional out_idx attribute to value.
+// If not specified, defaults to DT_INT32
+func UniqueWithCountsOutIdx(value tf.DataType) UniqueWithCountsAttr {
+	return func(m optionalAttr) {
+		m["out_idx"] = value
+	}
+}
+
+// Finds unique elements in a 1-D tensor.
+//
+// This operation returns a tensor `y` containing all of the unique elements of `x`
+// sorted in the same order that they occur in `x`. This operation also returns a
+// tensor `idx` the same size as `x` that contains the index of each value of `x`
+// in the unique output `y`. Finally, it returns a third tensor `count` that
+// contains the count of each element of `y` in `x`. In other words:
+//
+// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
+//
+// For example:
+//
+// ```
+// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
+// y, idx, count = unique_with_counts(x)
+// y ==> [1, 2, 4, 7, 8]
+// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
+// count ==> [2, 1, 3, 1, 2]
+// ```
+//
+// Arguments:
+//	x: 1-D.
+//
+// Returns 1-D.1-D.1-D.
+func UniqueWithCounts(scope *Scope, x tf.Output, optional ...UniqueWithCountsAttr) (y tf.Output, idx tf.Output, count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Softsign",
+		Type: "UniqueWithCounts",
 		Input: []tf.Input{
-			features,
+			x,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// GenerateVocabRemappingAttr is an optional argument to GenerateVocabRemapping.
-type GenerateVocabRemappingAttr func(optionalAttr)
+// StatelessRandomNormalAttr is an optional argument to StatelessRandomNormal.
+type StatelessRandomNormalAttr func(optionalAttr)
 
-// GenerateVocabRemappingOldVocabSize sets the optional old_vocab_size attribute to value.
-//
-// value: Number of entries in the old vocab file to consider.  If -1,
-// use the entire old vocabulary.
-// If not specified, defaults to -1
+// StatelessRandomNormalDtype sets the optional dtype attribute to value.
 //
-// REQUIRES: value >= -1
-func GenerateVocabRemappingOldVocabSize(value int64) GenerateVocabRemappingAttr {
+// value: The type of the output.
+// If not specified, defaults to DT_FLOAT
+func StatelessRandomNormalDtype(value tf.DataType) StatelessRandomNormalAttr {
 	return func(m optionalAttr) {
-		m["old_vocab_size"] = value
+		m["dtype"] = value
 	}
 }
 
-// Given a path to new and old vocabulary files, returns a remapping Tensor of
-//
-// length `num_new_vocab`, where `remapping[i]` contains the row number in the old
-// vocabulary that corresponds to row `i` in the new vocabulary (starting at line
-// `new_vocab_offset` and up to `num_new_vocab` entities), or `-1` if entry `i`
-// in the new vocabulary is not in the old vocabulary.  The old vocabulary is
-// constrained to the first `old_vocab_size` entries if `old_vocab_size` is not the
-// default value of -1.
-//
-// `num_vocab_offset` enables
-// use in the partitioned variable case, and should generally be set through
-// examining partitioning info.  The format of the files should be a text file,
-// with each line containing a single entity within the vocabulary.
-//
-// For example, with `new_vocab_file` a text file containing each of the following
-// elements on a single line: `[f0, f1, f2, f3]`, old_vocab_file = [f1, f0, f3],
-// `num_new_vocab = 3, new_vocab_offset = 1`, the returned remapping would be
-// `[0, -1, 2]`.
+// Outputs deterministic pseudorandom values from a normal distribution.
 //
-// The op also returns a count of how many entries in the new vocabulary
-// were present in the old vocabulary, which is used to calculate the number of
-// values to initialize in a weight matrix remapping
+// The generated values will have mean 0 and standard deviation 1.
 //
-// This functionality can be used to remap both row vocabularies (typically,
-// features) and column vocabularies (typically, classes) from TensorFlow
-// checkpoints.  Note that the partitioning logic relies on contiguous vocabularies
-// corresponding to div-partitioned variables.  Moreover, the underlying remapping
-// uses an IndexTable (as opposed to an inexact CuckooTable), so client code should
-// use the corresponding index_table_from_file() as the FeatureColumn framework
-// does (as opposed to tf.feature_to_id(), which uses a CuckooTable).
+// The outputs are a deterministic function of `shape` and `seed`.
 //
 // Arguments:
-//	new_vocab_file: Path to the new vocab file.
-//	old_vocab_file: Path to the old vocab file.
-//	new_vocab_offset: How many entries into the new vocab file to start reading.
-//	num_new_vocab: Number of entries in the new vocab file to remap.
+//	shape: The shape of the output tensor.
+//	seed: 2 seeds (shape [2]).
 //
-// Returns A Tensor of length num_new_vocab where the element at index i
-// is equal to the old ID that maps to the new ID i.  This element is -1 for any
-// new ID that is not found in the old vocabulary.Number of new vocab entries found in old vocab.
-func GenerateVocabRemapping(scope *Scope, new_vocab_file tf.Output, old_vocab_file tf.Output, new_vocab_offset int64, num_new_vocab int64, optional ...GenerateVocabRemappingAttr) (remapping tf.Output, num_present tf.Output) {
+// Returns Random values with specified shape.
+func StatelessRandomNormal(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessRandomNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"new_vocab_offset": new_vocab_offset, "num_new_vocab": num_new_vocab}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "GenerateVocabRemapping",
+		Type: "StatelessRandomNormal",
 		Input: []tf.Input{
-			new_vocab_file, old_vocab_file,
+			shape, seed,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Assigns sparse updates to the variable referenced by `resource`.
-//
-// This operation computes
-//
-//     # Scalar indices
-//     ref[indices, ...] = updates[...]
-//
-//     # Vector indices (for each i)
-//     ref[indices[i], ...] = updates[i, ...]
+// Reshapes a quantized tensor as per the Reshape op.
 //
-//     # High rank indices (for each i, ..., j)
-//     ref[indices[i, ..., j], ...] = updates[i, ..., j, ...]
+// ```
 //
 // Arguments:
-//	resource: Should be from a `Variable` node.
-//	indices: A tensor of indices into the first dimension of `ref`.
-//	updates: A tensor of updated values to add to `ref`.
 //
-// Returns the created operation.
-func ResourceScatterUpdate(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
+//	shape: Defines the shape of the output tensor.
+//	input_min: The minimum value of the input.
+//	input_max: The maximum value of the input.
+//
+// Returns This value is copied from input_min.This value is copied from input_max.
+func QuantizedReshape(scope *Scope, tensor tf.Output, shape tf.Output, input_min tf.Output, input_max tf.Output) (output tf.Output, output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceScatterUpdate",
+		Type: "QuantizedReshape",
 		Input: []tf.Input{
-			resource, indices, updates,
+			tensor, shape, input_min, input_max,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// CumsumAttr is an optional argument to Cumsum.
-type CumsumAttr func(optionalAttr)
-
-// CumsumExclusive sets the optional exclusive attribute to value.
-//
-// value: If `True`, perform exclusive cumsum.
-// If not specified, defaults to false
-func CumsumExclusive(value bool) CumsumAttr {
-	return func(m optionalAttr) {
-		m["exclusive"] = value
-	}
-}
+// GatherAttr is an optional argument to Gather.
+type GatherAttr func(optionalAttr)
 
-// CumsumReverse sets the optional reverse attribute to value.
-//
-// value: A `bool` (default: False).
-// If not specified, defaults to false
-func CumsumReverse(value bool) CumsumAttr {
+// GatherValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func GatherValidateIndices(value bool) GatherAttr {
 	return func(m optionalAttr) {
-		m["reverse"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// Compute the cumulative sum of the tensor `x` along `axis`.
+// Gather slices from `params` according to `indices`.
 //
-// By default, this op performs an inclusive cumsum, which means that the first
-// element of the input is identical to the first element of the output:
-//
-// ```python
-// tf.cumsum([a, b, c])  # => [a, a + b, a + b + c]
-// ```
-//
-// By setting the `exclusive` kwarg to `True`, an exclusive cumsum is
-// performed instead:
+// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
+// Produces an output tensor with shape `indices.shape + params.shape[1:]` where:
 //
 // ```python
-// tf.cumsum([a, b, c], exclusive=True)  # => [0, a, a + b]
-// ```
+//     # Scalar indices
+//     output[:, ..., :] = params[indices, :, ... :]
 //
-// By setting the `reverse` kwarg to `True`, the cumsum is performed in the
-// opposite direction:
+//     # Vector indices
+//     output[i, :, ..., :] = params[indices[i], :, ... :]
 //
-// ```python
-// tf.cumsum([a, b, c], reverse=True)  # => [a + b + c, b + c, c]
+//     # Higher rank indices
+//     output[i, ..., j, :, ... :] = params[indices[i, ..., j], :, ..., :]
 // ```
 //
-// This is more efficient than using separate `tf.reverse` ops.
-//
-// The `reverse` and `exclusive` kwargs can also be combined:
+// If `indices` is a permutation and `len(indices) == params.shape[0]` then
+// this operation will permute `params` accordingly.
 //
-// ```python
-// tf.cumsum([a, b, c], exclusive=True, reverse=True)  # => [b + c, c, 0]
-// ```
+// `validate_indices`: DEPRECATED. If this operation is assigned to CPU, values in
+// `indices` are always validated to be within range. If assigned to GPU,
+// out-of-bound indices result in safe but unspecified behavior, which may include
+// raising an error.
 //
-// Arguments:
-//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
-// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
-// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
-//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
-// `[-rank(x), rank(x))`.
-func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (out tf.Output) {
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
+// </div>
+func Gather(scope *Scope, params tf.Output, indices tf.Output, optional ...GatherAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6702,9 +6619,9 @@ func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Cumsum",
+		Type: "Gather",
 		Input: []tf.Input{
-			x, axis,
+			params, indices,
 		},
 		Attrs: attrs,
 	}
@@ -6712,363 +6629,339 @@ func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (
 	return op.Output(0)
 }
 
-// QuantizedRelu6Attr is an optional argument to QuantizedRelu6.
-type QuantizedRelu6Attr func(optionalAttr)
-
-// QuantizedRelu6OutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_QUINT8
-func QuantizedRelu6OutType(value tf.DataType) QuantizedRelu6Attr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
+// Returns the truth value of (x != y) element-wise.
+//
+// *NOTE*: `NotEqual` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func NotEqual(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NotEqual",
+		Input: []tf.Input{
+			x, y,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes Quantized Rectified Linear 6: `min(max(features, 0), 6)`
+// Inverse 3D real-valued fast Fourier transform.
+//
+// Computes the inverse 3-dimensional discrete Fourier transform of a real-valued
+// signal over the inner-most 3 dimensions of `input`.
+//
+// The inner-most 3 dimensions of `input` are assumed to be the result of `RFFT3D`:
+// The inner-most dimension contains the `fft_length / 2 + 1` unique components of
+// the DFT of a real-valued signal. If `fft_length` is not provided, it is computed
+// from the size of the inner-most 3 dimensions of `input`. If the FFT length used
+// to compute `input` is odd, it should be provided since it cannot be inferred
+// properly.
+//
+// Along each axis `IRFFT3D` is computed on, if `fft_length` (or
+// `fft_length / 2 + 1` for the inner-most dimension) is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
 //
 // Arguments:
+//	input: A complex64 tensor.
+//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
 //
-//	min_features: The float value that the lowest quantized value represents.
-//	max_features: The float value that the highest quantized value represents.
+// Returns A float32 tensor of the same rank as `input`. The inner-most 3
+//   dimensions of `input` are replaced with the `fft_length` samples of their
+//   inverse 3D real Fourier transform.
 //
-// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
-func QuantizedRelu6(scope *Scope, features tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedRelu6Attr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.irfftn with 3 dimensions.
+// @end_compatibility
+func IRFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedRelu6",
+		Type: "IRFFT3D",
 		Input: []tf.Input{
-			features, min_features, max_features,
+			input, fft_length,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// FixedLengthRecordReaderV2Attr is an optional argument to FixedLengthRecordReaderV2.
-type FixedLengthRecordReaderV2Attr func(optionalAttr)
+// StringSplitAttr is an optional argument to StringSplit.
+type StringSplitAttr func(optionalAttr)
 
-// FixedLengthRecordReaderV2HeaderBytes sets the optional header_bytes attribute to value.
+// StringSplitSkipEmpty sets the optional skip_empty attribute to value.
 //
-// value: Number of bytes in the header, defaults to 0.
-// If not specified, defaults to 0
-func FixedLengthRecordReaderV2HeaderBytes(value int64) FixedLengthRecordReaderV2Attr {
+// value: A `bool`. If `True`, skip the empty strings from the result.
+// If not specified, defaults to true
+func StringSplitSkipEmpty(value bool) StringSplitAttr {
 	return func(m optionalAttr) {
-		m["header_bytes"] = value
+		m["skip_empty"] = value
 	}
 }
 
-// FixedLengthRecordReaderV2FooterBytes sets the optional footer_bytes attribute to value.
+// Split elements of `input` based on `delimiter` into a `SparseTensor`.
 //
-// value: Number of bytes in the footer, defaults to 0.
-// If not specified, defaults to 0
-func FixedLengthRecordReaderV2FooterBytes(value int64) FixedLengthRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["footer_bytes"] = value
-	}
-}
-
-// FixedLengthRecordReaderV2HopBytes sets the optional hop_bytes attribute to value.
+// Let N be the size of source (typically N will be the batch size). Split each
+// element of `input` based on `delimiter` and return a `SparseTensor`
+// containing the splitted tokens. Empty tokens are ignored.
 //
-// value: Number of bytes to hop before each read. Default of 0 means using
-// record_bytes.
-// If not specified, defaults to 0
-func FixedLengthRecordReaderV2HopBytes(value int64) FixedLengthRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["hop_bytes"] = value
-	}
-}
-
-// FixedLengthRecordReaderV2Container sets the optional container attribute to value.
+// `delimiter` can be empty, or a string of split characters. If `delimiter` is an
+//  empty string, each element of `input` is split into individual single-byte
+//  character strings, including splitting of UTF-8 multibyte sequences. Otherwise
+//  every character of `delimiter` is a potential split point.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func FixedLengthRecordReaderV2Container(value string) FixedLengthRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// FixedLengthRecordReaderV2SharedName sets the optional shared_name attribute to value.
+// For example:
+//   N = 2, input[0] is 'hello world' and input[1] is 'a b c', then the output
+//   will be
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func FixedLengthRecordReaderV2SharedName(value string) FixedLengthRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+//   indices = [0, 0;
+//              0, 1;
+//              1, 0;
+//              1, 1;
+//              1, 2]
+//   shape = [2, 3]
+//   values = ['hello', 'world', 'a', 'b', 'c']
+//
+// Arguments:
+//	input: 1-D. Strings to split.
+//	delimiter: 0-D. Delimiter characters (bytes), or empty string.
+//
+// Returns A dense matrix of int64 representing the indices of the sparse tensor.A vector of strings corresponding to the splited values.a length-2 vector of int64 representing the shape of the sparse
+// tensor, where the first value is N and the second value is the maximum number
+// of tokens in a single input entry.
+func StringSplit(scope *Scope, input tf.Output, delimiter tf.Output, optional ...StringSplitAttr) (indices tf.Output, values tf.Output, shape tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "StringSplit",
+		Input: []tf.Input{
+			input, delimiter,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// FixedLengthRecordReaderV2Encoding sets the optional encoding attribute to value.
+// ResizeBilinearAttr is an optional argument to ResizeBilinear.
+type ResizeBilinearAttr func(optionalAttr)
+
+// ResizeBilinearAlignCorners sets the optional align_corners attribute to value.
 //
-// value: The type of encoding for the file. Currently ZLIB and GZIP
-// are supported. Defaults to none.
-// If not specified, defaults to ""
-func FixedLengthRecordReaderV2Encoding(value string) FixedLengthRecordReaderV2Attr {
+// value: If true, rescale input by (new_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeBilinearAlignCorners(value bool) ResizeBilinearAttr {
 	return func(m optionalAttr) {
-		m["encoding"] = value
+		m["align_corners"] = value
 	}
 }
 
-// A Reader that outputs fixed-length records from a file.
+// Resize `images` to `size` using bilinear interpolation.
+//
+// Input images can be of different types but output images are always float.
 //
 // Arguments:
-//	record_bytes: Number of bytes in the record.
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
 //
-// Returns The handle to reference the Reader.
-func FixedLengthRecordReaderV2(scope *Scope, record_bytes int64, optional ...FixedLengthRecordReaderV2Attr) (reader_handle tf.Output) {
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func ResizeBilinear(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeBilinearAttr) (resized_images tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"record_bytes": record_bytes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FixedLengthRecordReaderV2",
-
+		Type: "ResizeBilinear",
+		Input: []tf.Input{
+			images, size,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// The gradient operator for the SparseAdd op.
-//
-// The SparseAdd op calculates A + B, where A, B, and the sum are all represented
-// as `SparseTensor` objects.  This op takes in the upstream gradient w.r.t.
-// non-empty values of the sum, and outputs the gradients w.r.t. the non-empty
-// values of A and B.
-//
-// Arguments:
-//	backprop_val_grad: 1-D with shape `[nnz(sum)]`.  The gradient with respect to
-// the non-empty values of the sum.
-//	a_indices: 2-D.  The `indices` of the `SparseTensor` A, size `[nnz(A), ndims]`.
-//	b_indices: 2-D.  The `indices` of the `SparseTensor` B, size `[nnz(B), ndims]`.
-//	sum_indices: 2-D.  The `indices` of the sum `SparseTensor`, size
-// `[nnz(sum), ndims]`.
-//
-// Returns 1-D with shape `[nnz(A)]`. The gradient with respect to the
-// non-empty values of A.1-D with shape `[nnz(B)]`. The gradient with respect to the
-// non-empty values of B.
-func SparseAddGrad(scope *Scope, backprop_val_grad tf.Output, a_indices tf.Output, b_indices tf.Output, sum_indices tf.Output) (a_val_grad tf.Output, b_val_grad tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseAddGrad",
-		Input: []tf.Input{
-			backprop_val_grad, a_indices, b_indices, sum_indices,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Computes atan of x element-wise.
-func Atan(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes softsign: `features / (abs(features) + 1)`.
+func Softsign(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Atan",
+		Type: "Softsign",
 		Input: []tf.Input{
-			x,
+			features,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Encode audio data using the WAV file format.
-//
-// This operation will generate a string suitable to be saved out to create a .wav
-// audio file. It will be encoded in the 16-bit PCM format. It takes in float
-// values in the range -1.0f to 1.0f, and any outside that value will be clamped to
-// that range.
-//
-// `audio` is a 2-D float Tensor of shape `[length, channels]`.
-// `sample_rate` is a scalar Tensor holding the rate to use (e.g. 44100).
+// GenerateVocabRemappingAttr is an optional argument to GenerateVocabRemapping.
+type GenerateVocabRemappingAttr func(optionalAttr)
+
+// GenerateVocabRemappingOldVocabSize sets the optional old_vocab_size attribute to value.
 //
-// Arguments:
-//	audio: 2-D with shape `[length, channels]`.
-//	sample_rate: Scalar containing the sample frequency.
+// value: Number of entries in the old vocab file to consider.  If -1,
+// use the entire old vocabulary.
+// If not specified, defaults to -1
 //
-// Returns 0-D. WAV-encoded file contents.
-func EncodeWav(scope *Scope, audio tf.Output, sample_rate tf.Output) (contents tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "EncodeWav",
-		Input: []tf.Input{
-			audio, sample_rate,
-		},
+// REQUIRES: value >= -1
+func GenerateVocabRemappingOldVocabSize(value int64) GenerateVocabRemappingAttr {
+	return func(m optionalAttr) {
+		m["old_vocab_size"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Converts each string in the input Tensor to its hash mod by a number of buckets.
-//
-// The hash function is deterministic on the content of the string within the
-// process. The hash function is a keyed hash function, where attribute `key`
-// defines the key of the hash function. `key` is an array of 2 elements.
-//
-// A strong hash is important when inputs may be malicious, e.g. URLs with
-// additional components. Adversaries could try to make their inputs hash to the
-// same bucket for a denial-of-service attack or to skew the results. A strong
-// hash prevents this by making it difficult, if not infeasible, to compute inputs
-// that hash to the same bucket. This comes at a cost of roughly 4x higher compute
-// time than `tf.string_to_hash_bucket_fast`.
+// Given a path to new and old vocabulary files, returns a remapping Tensor of
 //
-// Arguments:
-//	input: The strings to assign a hash bucket.
-//	num_buckets: The number of buckets.
-//	key: The key for the keyed hash function passed as a list of two uint64
-// elements.
+// length `num_new_vocab`, where `remapping[i]` contains the row number in the old
+// vocabulary that corresponds to row `i` in the new vocabulary (starting at line
+// `new_vocab_offset` and up to `num_new_vocab` entities), or `-1` if entry `i`
+// in the new vocabulary is not in the old vocabulary.  The old vocabulary is
+// constrained to the first `old_vocab_size` entries if `old_vocab_size` is not the
+// default value of -1.
 //
-// Returns A Tensor of the same shape as the input `string_tensor`.
-func StringToHashBucketStrong(scope *Scope, input tf.Output, num_buckets int64, key []int64) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"num_buckets": num_buckets, "key": key}
-	opspec := tf.OpSpec{
-		Type: "StringToHashBucketStrong",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Generates values in an interval.
+// `num_vocab_offset` enables
+// use in the partitioned variable case, and should generally be set through
+// examining partitioning info.  The format of the files should be a text file,
+// with each line containing a single entity within the vocabulary.
 //
-// A sequence of `num` evenly-spaced values are generated beginning at `start`.
-// If `num > 1`, the values in the sequence increase by `stop - start / num - 1`,
-// so that the last one is exactly `stop`.
+// For example, with `new_vocab_file` a text file containing each of the following
+// elements on a single line: `[f0, f1, f2, f3]`, old_vocab_file = [f1, f0, f3],
+// `num_new_vocab = 3, new_vocab_offset = 1`, the returned remapping would be
+// `[0, -1, 2]`.
 //
-// For example:
+// The op also returns a count of how many entries in the new vocabulary
+// were present in the old vocabulary, which is used to calculate the number of
+// values to initialize in a weight matrix remapping
 //
-// ```
-// tf.linspace(10.0, 12.0, 3, name="linspace") => [ 10.0  11.0  12.0]
-// ```
+// This functionality can be used to remap both row vocabularies (typically,
+// features) and column vocabularies (typically, classes) from TensorFlow
+// checkpoints.  Note that the partitioning logic relies on contiguous vocabularies
+// corresponding to div-partitioned variables.  Moreover, the underlying remapping
+// uses an IndexTable (as opposed to an inexact CuckooTable), so client code should
+// use the corresponding index_table_from_file() as the FeatureColumn framework
+// does (as opposed to tf.feature_to_id(), which uses a CuckooTable).
 //
 // Arguments:
-//	start: First entry in the range.
-//	stop: Last entry in the range.
-//	num: Number of values to generate.
+//	new_vocab_file: Path to the new vocab file.
+//	old_vocab_file: Path to the old vocab file.
+//	new_vocab_offset: How many entries into the new vocab file to start reading.
+//	num_new_vocab: Number of entries in the new vocab file to remap.
 //
-// Returns 1-D. The generated values.
-func LinSpace(scope *Scope, start tf.Output, stop tf.Output, num tf.Output) (output tf.Output) {
+// Returns A Tensor of length num_new_vocab where the element at index i
+// is equal to the old ID that maps to the new ID i.  This element is -1 for any
+// new ID that is not found in the old vocabulary.Number of new vocab entries found in old vocab.
+func GenerateVocabRemapping(scope *Scope, new_vocab_file tf.Output, old_vocab_file tf.Output, new_vocab_offset int64, num_new_vocab int64, optional ...GenerateVocabRemappingAttr) (remapping tf.Output, num_present tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"new_vocab_offset": new_vocab_offset, "num_new_vocab": num_new_vocab}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LinSpace",
+		Type: "GenerateVocabRemapping",
 		Input: []tf.Input{
-			start, stop, num,
+			new_vocab_file, old_vocab_file,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// DestroyResourceOpAttr is an optional argument to DestroyResourceOp.
-type DestroyResourceOpAttr func(optionalAttr)
-
-// DestroyResourceOpIgnoreLookupError sets the optional ignore_lookup_error attribute to value.
+// Assigns sparse updates to the variable referenced by `resource`.
 //
-// value: whether to ignore the error when the resource
-// doesn't exist.
-// If not specified, defaults to true
-func DestroyResourceOpIgnoreLookupError(value bool) DestroyResourceOpAttr {
-	return func(m optionalAttr) {
-		m["ignore_lookup_error"] = value
-	}
-}
-
-// Deletes the resource specified by the handle.
+// This operation computes
 //
-// All subsequent operations using the resource will result in a NotFound
-// error status.
+//     # Scalar indices
+//     ref[indices, ...] = updates[...]
+//
+//     # Vector indices (for each i)
+//     ref[indices[i], ...] = updates[i, ...]
+//
+//     # High rank indices (for each i, ..., j)
+//     ref[indices[i, ..., j], ...] = updates[i, ..., j, ...]
 //
 // Arguments:
-//	resource: handle to the resource to delete.
+//	resource: Should be from a `Variable` node.
+//	indices: A tensor of indices into the first dimension of `ref`.
+//	updates: A tensor of updated values to add to `ref`.
 //
 // Returns the created operation.
-func DestroyResourceOp(scope *Scope, resource tf.Output, optional ...DestroyResourceOpAttr) (o *tf.Operation) {
+func ResourceScatterUpdate(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "DestroyResourceOp",
+		Type: "ResourceScatterUpdate",
 		Input: []tf.Input{
-			resource,
+			resource, indices, updates,
 		},
-		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// CumprodAttr is an optional argument to Cumprod.
-type CumprodAttr func(optionalAttr)
+// CumsumAttr is an optional argument to Cumsum.
+type CumsumAttr func(optionalAttr)
 
-// CumprodExclusive sets the optional exclusive attribute to value.
+// CumsumExclusive sets the optional exclusive attribute to value.
 //
-// value: If `True`, perform exclusive cumprod.
+// value: If `True`, perform exclusive cumsum.
 // If not specified, defaults to false
-func CumprodExclusive(value bool) CumprodAttr {
+func CumsumExclusive(value bool) CumsumAttr {
 	return func(m optionalAttr) {
 		m["exclusive"] = value
 	}
 }
 
-// CumprodReverse sets the optional reverse attribute to value.
+// CumsumReverse sets the optional reverse attribute to value.
 //
 // value: A `bool` (default: False).
 // If not specified, defaults to false
-func CumprodReverse(value bool) CumprodAttr {
+func CumsumReverse(value bool) CumsumAttr {
 	return func(m optionalAttr) {
 		m["reverse"] = value
 	}
 }
 
-// Compute the cumulative product of the tensor `x` along `axis`.
+// Compute the cumulative sum of the tensor `x` along `axis`.
 //
-// By default, this op performs an inclusive cumprod, which means that the first
+// By default, this op performs an inclusive cumsum, which means that the first
 // element of the input is identical to the first element of the output:
 //
 // ```python
-// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
+// tf.cumsum([a, b, c])  # => [a, a + b, a + b + c]
 // ```
 //
-// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
+// By setting the `exclusive` kwarg to `True`, an exclusive cumsum is
 // performed instead:
 //
 // ```python
-// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
+// tf.cumsum([a, b, c], exclusive=True)  # => [0, a, a + b]
 // ```
 //
-// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
+// By setting the `reverse` kwarg to `True`, the cumsum is performed in the
 // opposite direction:
 //
 // ```python
-// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
+// tf.cumsum([a, b, c], reverse=True)  # => [a + b + c, b + c, c]
 // ```
 //
 // This is more efficient than using separate `tf.reverse` ops.
@@ -7076,7 +6969,7 @@ func CumprodReverse(value bool) CumprodAttr {
 // The `reverse` and `exclusive` kwargs can also be combined:
 //
 // ```python
-// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
+// tf.cumsum([a, b, c], exclusive=True, reverse=True)  # => [b + c, c, 0]
 // ```
 //
 // Arguments:
@@ -7085,7 +6978,7 @@ func CumprodReverse(value bool) CumprodAttr {
 // `complex128`, `qint8`, `quint8`, `qint32`, `half`.
 //	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
 // `[-rank(x), rank(x))`.
-func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
+func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7094,7 +6987,7 @@ func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Cumprod",
+		Type: "Cumsum",
 		Input: []tf.Input{
 			x, axis,
 		},
@@ -7104,91 +6997,26 @@ func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr)
 	return op.Output(0)
 }
 
-// Computes the mean along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \frac{\sum_j data_j}{N}\\) where `mean` is
-// over `j` such that `segment_ids[j] == i` and `N` is the total number of
-// values summed.
-//
-// If the mean is empty for a given segment ID `i`, `output[i] = 0`.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMean.png" alt>
-// </div>
+// QuantizedRelu6Attr is an optional argument to QuantizedRelu6.
+type QuantizedRelu6Attr func(optionalAttr)
+
+// QuantizedRelu6OutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_QUINT8
+func QuantizedRelu6OutType(value tf.DataType) QuantizedRelu6Attr {
+	return func(m optionalAttr) {
+		m["out_type"] = value
+	}
+}
+
+// Computes Quantized Rectified Linear 6: `min(max(features, 0), 6)`
 //
 // Arguments:
 //
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentMean(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SegmentMean",
-		Input: []tf.Input{
-			data, segment_ids,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ResourceSparseApplyCenteredRMSPropAttr is an optional argument to ResourceSparseApplyCenteredRMSProp.
-type ResourceSparseApplyCenteredRMSPropAttr func(optionalAttr)
-
-// ResourceSparseApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
-//
-// value: If `True`, updating of the var, mg, ms, and mom tensors is
-// protected by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceSparseApplyCenteredRMSPropUseLocking(value bool) ResourceSparseApplyCenteredRMSPropAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' according to the centered RMSProp algorithm.
-//
-// The centered RMSProp algorithm uses an estimate of the centered second moment
-// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
-// uses the (uncentered) second moment. This often helps with training, but is
-// slightly more expensive in terms of computation and memory.
-//
-// Note that in dense implementation of this algorithm, mg, ms, and mom will
-// update even if the grad is zero, but in this sparse implementation, mg, ms,
-// and mom will not update in iterations during which the grad is zero.
-//
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// mean_grad = decay * mean_grad + (1-decay) * gradient
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
-//
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
-// var <- var - mom
-//
-// Arguments:
-//	var_: Should be from a Variable().
-//	mg: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
-//
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var, ms and mom.
+//	min_features: The float value that the lowest quantized value represents.
+//	max_features: The float value that the highest quantized value represents.
 //
-// Returns the created operation.
-func ResourceSparseApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyCenteredRMSPropAttr) (o *tf.Operation) {
+// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
+func QuantizedRelu6(scope *Scope, features tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedRelu6Attr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7197,155 +7025,129 @@ func ResourceSparseApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyCenteredRMSProp",
+		Type: "QuantizedRelu6",
 		Input: []tf.Input{
-			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad, indices,
+			features, min_features, max_features,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Creates a dataset that batches `batch_size` elements from `input_dataset`.
-//
-// Arguments:
-//
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
-//
+// FixedLengthRecordReaderV2Attr is an optional argument to FixedLengthRecordReaderV2.
+type FixedLengthRecordReaderV2Attr func(optionalAttr)
+
+// FixedLengthRecordReaderV2HeaderBytes sets the optional header_bytes attribute to value.
 //
-func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "BatchDataset",
-		Input: []tf.Input{
-			input_dataset, batch_size,
-		},
-		Attrs: attrs,
+// value: Number of bytes in the header, defaults to 0.
+// If not specified, defaults to 0
+func FixedLengthRecordReaderV2HeaderBytes(value int64) FixedLengthRecordReaderV2Attr {
+	return func(m optionalAttr) {
+		m["header_bytes"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Inverse fast Fourier transform.
-//
-// Computes the inverse 1-dimensional discrete Fourier transform over the
-// inner-most dimension of `input`.
-//
-// Arguments:
-//	input: A complex64 tensor.
-//
-// Returns A complex64 tensor of the same shape as `input`. The inner-most
-//   dimension of `input` is replaced with its inverse 1D Fourier transform.
+// FixedLengthRecordReaderV2FooterBytes sets the optional footer_bytes attribute to value.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.ifft
-// @end_compatibility
-func IFFT(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "IFFT",
-		Input: []tf.Input{
-			input,
-		},
+// value: Number of bytes in the footer, defaults to 0.
+// If not specified, defaults to 0
+func FixedLengthRecordReaderV2FooterBytes(value int64) FixedLengthRecordReaderV2Attr {
+	return func(m optionalAttr) {
+		m["footer_bytes"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// LRNAttr is an optional argument to LRN.
-type LRNAttr func(optionalAttr)
-
-// LRNDepthRadius sets the optional depth_radius attribute to value.
+// FixedLengthRecordReaderV2HopBytes sets the optional hop_bytes attribute to value.
 //
-// value: 0-D.  Half-width of the 1-D normalization window.
-// If not specified, defaults to 5
-func LRNDepthRadius(value int64) LRNAttr {
+// value: Number of bytes to hop before each read. Default of 0 means using
+// record_bytes.
+// If not specified, defaults to 0
+func FixedLengthRecordReaderV2HopBytes(value int64) FixedLengthRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["depth_radius"] = value
+		m["hop_bytes"] = value
 	}
 }
 
-// LRNBias sets the optional bias attribute to value.
+// FixedLengthRecordReaderV2Container sets the optional container attribute to value.
 //
-// value: An offset (usually positive to avoid dividing by 0).
-// If not specified, defaults to 1
-func LRNBias(value float32) LRNAttr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func FixedLengthRecordReaderV2Container(value string) FixedLengthRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["bias"] = value
+		m["container"] = value
 	}
 }
 
-// LRNAlpha sets the optional alpha attribute to value.
+// FixedLengthRecordReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// value: A scale factor, usually positive.
-// If not specified, defaults to 1
-func LRNAlpha(value float32) LRNAttr {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func FixedLengthRecordReaderV2SharedName(value string) FixedLengthRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["alpha"] = value
+		m["shared_name"] = value
 	}
 }
 
-// LRNBeta sets the optional beta attribute to value.
+// FixedLengthRecordReaderV2Encoding sets the optional encoding attribute to value.
 //
-// value: An exponent.
-// If not specified, defaults to 0.5
-func LRNBeta(value float32) LRNAttr {
+// value: The type of encoding for the file. Currently ZLIB and GZIP
+// are supported. Defaults to none.
+// If not specified, defaults to ""
+func FixedLengthRecordReaderV2Encoding(value string) FixedLengthRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["beta"] = value
+		m["encoding"] = value
 	}
 }
 
-// Local Response Normalization.
-//
-// The 4-D `input` tensor is treated as a 3-D array of 1-D vectors (along the last
-// dimension), and each vector is normalized independently.  Within a given vector,
-// each component is divided by the weighted, squared sum of inputs within
-// `depth_radius`.  In detail,
-//
-//     sqr_sum[a, b, c, d] =
-//         sum(input[a, b, c, d - depth_radius : d + depth_radius + 1] ** 2)
-//     output = input / (bias + alpha * sqr_sum) ** beta
-//
-// For details, see [Krizhevsky et al., ImageNet classification with deep
-// convolutional neural networks (NIPS 2012)](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
+// A Reader that outputs fixed-length records from a file.
 //
 // Arguments:
-//	input: 4-D.
-func LRN(scope *Scope, input tf.Output, optional ...LRNAttr) (output tf.Output) {
+//	record_bytes: Number of bytes in the record.
+//
+// Returns The handle to reference the Reader.
+func FixedLengthRecordReaderV2(scope *Scope, record_bytes int64, optional ...FixedLengthRecordReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"record_bytes": record_bytes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LRN",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "FixedLengthRecordReaderV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that zips together `input_datasets`.
-func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Converts each string in the input Tensor to its hash mod by a number of buckets.
+//
+// The hash function is deterministic on the content of the string within the
+// process.
+//
+// Note that the hash function may change from time to time.
+// This functionality will be deprecated and it's recommended to use
+// `tf.string_to_hash_bucket_fast()` or `tf.string_to_hash_bucket_strong()`.
+//
+// Arguments:
+//
+//	num_buckets: The number of buckets.
+//
+// Returns A Tensor of the same shape as the input `string_tensor`.
+func StringToHashBucket(scope *Scope, string_tensor tf.Output, num_buckets int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"num_buckets": num_buckets}
 	opspec := tf.OpSpec{
-		Type: "ZipDataset",
+		Type: "StringToHashBucket",
 		Input: []tf.Input{
-			tf.OutputList(input_datasets),
+			string_tensor,
 		},
 		Attrs: attrs,
 	}
@@ -7353,201 +7155,158 @@ func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.Data
 	return op.Output(0)
 }
 
-// Writes a `GraphDef` protocol buffer to a `SummaryWriter`.
+// Computes gradients for the exponential linear (Elu) operation.
 //
 // Arguments:
-//	writer: Handle of `SummaryWriter`.
-//	step: The step to write the summary for.
-//	tensor: A scalar string of the serialized tf.GraphDef proto.
+//	gradients: The backpropagated gradients to the corresponding Elu operation.
+//	outputs: The outputs of the corresponding Elu operation.
 //
-// Returns the created operation.
-func WriteGraphSummary(scope *Scope, writer tf.Output, step tf.Output, tensor tf.Output) (o *tf.Operation) {
+// Returns The gradients: `gradients * (outputs + 1)` if outputs < 0,
+// `gradients` otherwise.
+func EluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "WriteGraphSummary",
+		Type: "EluGrad",
 		Input: []tf.Input{
-			writer, step, tensor,
+			gradients, outputs,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ResourceSparseApplyAdagradAttr is an optional argument to ResourceSparseApplyAdagrad.
-type ResourceSparseApplyAdagradAttr func(optionalAttr)
-
-// ResourceSparseApplyAdagradUseLocking sets the optional use_locking attribute to value.
+// Creates a dataset that contains `count` elements from the `input_dataset`.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceSparseApplyAdagradUseLocking(value bool) ResourceSparseApplyAdagradAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update relevant entries in '*var' and '*accum' according to the adagrad scheme.
+// Arguments:
 //
-// That is for rows we have grad for, we update var and accum as follows:
-// accum += grad * grad
-// var -= lr * grad * (1 / sqrt(accum))
+//	count: A scalar representing the number of elements from the `input_dataset`
+// that should be taken. A value of `-1` indicates that all of `input_dataset`
+// is taken.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Learning rate. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
 //
-// Returns the created operation.
-func ResourceSparseApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyAdagradAttr) (o *tf.Operation) {
+func TakeDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyAdagrad",
+		Type: "TakeDataset",
 		Input: []tf.Input{
-			var_, accum, lr, grad, indices,
+			input_dataset, count,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// 2D real-valued fast Fourier transform.
-//
-// Computes the 2-dimensional discrete Fourier transform of a real-valued signal
-// over the inner-most 2 dimensions of `input`.
-//
-// Since the DFT of a real signal is Hermitian-symmetric, `RFFT2D` only returns the
-// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
-// of `output`: the zero-frequency term, followed by the `fft_length / 2`
-// positive-frequency terms.
+// The gradient operator for the SparseAdd op.
 //
-// Along each axis `RFFT2D` is computed on, if `fft_length` is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
+// The SparseAdd op calculates A + B, where A, B, and the sum are all represented
+// as `SparseTensor` objects.  This op takes in the upstream gradient w.r.t.
+// non-empty values of the sum, and outputs the gradients w.r.t. the non-empty
+// values of A and B.
 //
 // Arguments:
-//	input: A float32 tensor.
-//	fft_length: An int32 tensor of shape [2]. The FFT length for each dimension.
-//
-// Returns A complex64 tensor of the same rank as `input`. The inner-most 2
-//   dimensions of `input` are replaced with their 2D Fourier transform. The
-//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
-//   components.
+//	backprop_val_grad: 1-D with shape `[nnz(sum)]`.  The gradient with respect to
+// the non-empty values of the sum.
+//	a_indices: 2-D.  The `indices` of the `SparseTensor` A, size `[nnz(A), ndims]`.
+//	b_indices: 2-D.  The `indices` of the `SparseTensor` B, size `[nnz(B), ndims]`.
+//	sum_indices: 2-D.  The `indices` of the sum `SparseTensor`, size
+// `[nnz(sum), ndims]`.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.rfft2
-// @end_compatibility
-func RFFT2D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns 1-D with shape `[nnz(A)]`. The gradient with respect to the
+// non-empty values of A.1-D with shape `[nnz(B)]`. The gradient with respect to the
+// non-empty values of B.
+func SparseAddGrad(scope *Scope, backprop_val_grad tf.Output, a_indices tf.Output, b_indices tf.Output, sum_indices tf.Output) (a_val_grad tf.Output, b_val_grad tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RFFT2D",
+		Type: "SparseAddGrad",
 		Input: []tf.Input{
-			input, fft_length,
+			backprop_val_grad, a_indices, b_indices, sum_indices,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// ResizeAreaAttr is an optional argument to ResizeArea.
-type ResizeAreaAttr func(optionalAttr)
-
-// ResizeAreaAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, rescale input by (new_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeAreaAlignCorners(value bool) ResizeAreaAttr {
-	return func(m optionalAttr) {
-		m["align_corners"] = value
+// Computes atan of x element-wise.
+func Atan(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Atan",
+		Input: []tf.Input{
+			x,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Resize `images` to `size` using area interpolation.
+// Encode audio data using the WAV file format.
 //
-// Input images can be of different types but output images are always float.
+// This operation will generate a string suitable to be saved out to create a .wav
+// audio file. It will be encoded in the 16-bit PCM format. It takes in float
+// values in the range -1.0f to 1.0f, and any outside that value will be clamped to
+// that range.
 //
-// Each output pixel is computed by first transforming the pixel's footprint into
-// the input tensor and then averaging the pixels that intersect the footprint. An
-// input pixel's contribution to the average is weighted by the fraction of its
-// area that intersects the footprint.  This is the same as OpenCV's INTER_AREA.
+// `audio` is a 2-D float Tensor of shape `[length, channels]`.
+// `sample_rate` is a scalar Tensor holding the rate to use (e.g. 44100).
 //
 // Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
+//	audio: 2-D with shape `[length, channels]`.
+//	sample_rate: Scalar containing the sample frequency.
 //
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func ResizeArea(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeAreaAttr) (resized_images tf.Output) {
+// Returns 0-D. WAV-encoded file contents.
+func EncodeWav(scope *Scope, audio tf.Output, sample_rate tf.Output) (contents tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResizeArea",
+		Type: "EncodeWav",
 		Input: []tf.Input{
-			images, size,
+			audio, sample_rate,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// StatelessRandomUniformAttr is an optional argument to StatelessRandomUniform.
-type StatelessRandomUniformAttr func(optionalAttr)
-
-// StatelessRandomUniformDtype sets the optional dtype attribute to value.
-//
-// value: The type of the output.
-// If not specified, defaults to DT_FLOAT
-func StatelessRandomUniformDtype(value tf.DataType) StatelessRandomUniformAttr {
-	return func(m optionalAttr) {
-		m["dtype"] = value
-	}
-}
-
-// Outputs deterministic pseudorandom random values from a uniform distribution.
+// Converts each string in the input Tensor to its hash mod by a number of buckets.
 //
-// The generated values follow a uniform distribution in the range `[0, 1)`. The
-// lower bound 0 is included in the range, while the upper bound 1 is excluded.
+// The hash function is deterministic on the content of the string within the
+// process. The hash function is a keyed hash function, where attribute `key`
+// defines the key of the hash function. `key` is an array of 2 elements.
 //
-// The outputs are a deterministic function of `shape` and `seed`.
+// A strong hash is important when inputs may be malicious, e.g. URLs with
+// additional components. Adversaries could try to make their inputs hash to the
+// same bucket for a denial-of-service attack or to skew the results. A strong
+// hash prevents this by making it difficult, if not infeasible, to compute inputs
+// that hash to the same bucket. This comes at a cost of roughly 4x higher compute
+// time than `tf.string_to_hash_bucket_fast`.
 //
 // Arguments:
-//	shape: The shape of the output tensor.
-//	seed: 2 seeds (shape [2]).
+//	input: The strings to assign a hash bucket.
+//	num_buckets: The number of buckets.
+//	key: The key for the keyed hash function passed as a list of two uint64
+// elements.
 //
-// Returns Random values with specified shape.
-func StatelessRandomUniform(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessRandomUniformAttr) (output tf.Output) {
+// Returns A Tensor of the same shape as the input `string_tensor`.
+func StringToHashBucketStrong(scope *Scope, input tf.Output, num_buckets int64, key []int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"num_buckets": num_buckets, "key": key}
 	opspec := tf.OpSpec{
-		Type: "StatelessRandomUniform",
+		Type: "StringToHashBucketStrong",
 		Input: []tf.Input{
-			shape, seed,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -7555,223 +7314,215 @@ func StatelessRandomUniform(scope *Scope, shape tf.Output, seed tf.Output, optio
 	return op.Output(0)
 }
 
-// AngleAttr is an optional argument to Angle.
-type AngleAttr func(optionalAttr)
-
-// AngleTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func AngleTout(value tf.DataType) AngleAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Returns the argument of a complex number.
-//
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the argument of each element in `input`. All elements in
-// `input` must be complex numbers of the form \\(a + bj\\), where *a*
-// is the real part and *b* is the imaginary part.
+// Generates values in an interval.
 //
-// The argument returned by this operation is of the form \\(atan2(b, a)\\).
+// A sequence of `num` evenly-spaced values are generated beginning at `start`.
+// If `num > 1`, the values in the sequence increase by `stop - start / num - 1`,
+// so that the last one is exactly `stop`.
 //
 // For example:
 //
 // ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.angle(input) ==> [2.0132, 1.056]
+// tf.linspace(10.0, 12.0, 3, name="linspace") => [ 10.0  11.0  12.0]
 // ```
 //
-// @compatibility(numpy)
-// Equivalent to np.angle.
-// @end_compatibility
-func Angle(scope *Scope, input tf.Output, optional ...AngleAttr) (output tf.Output) {
+// Arguments:
+//	start: First entry in the range.
+//	stop: Last entry in the range.
+//	num: Number of values to generate.
+//
+// Returns 1-D. The generated values.
+func LinSpace(scope *Scope, start tf.Output, stop tf.Output, num tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Angle",
+		Type: "LinSpace",
 		Input: []tf.Input{
-			input,
+			start, stop, num,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// VarHandleOpAttr is an optional argument to VarHandleOp.
-type VarHandleOpAttr func(optionalAttr)
+// DestroyResourceOpAttr is an optional argument to DestroyResourceOp.
+type DestroyResourceOpAttr func(optionalAttr)
 
-// VarHandleOpContainer sets the optional container attribute to value.
+// DestroyResourceOpIgnoreLookupError sets the optional ignore_lookup_error attribute to value.
 //
-// value: the container this variable is placed in.
-// If not specified, defaults to ""
-func VarHandleOpContainer(value string) VarHandleOpAttr {
+// value: whether to ignore the error when the resource
+// doesn't exist.
+// If not specified, defaults to true
+func DestroyResourceOpIgnoreLookupError(value bool) DestroyResourceOpAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["ignore_lookup_error"] = value
 	}
 }
 
-// VarHandleOpSharedName sets the optional shared_name attribute to value.
+// Deletes the resource specified by the handle.
 //
-// value: the name by which this variable is referred to.
-// If not specified, defaults to ""
-func VarHandleOpSharedName(value string) VarHandleOpAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Creates a handle to a Variable resource.
+// All subsequent operations using the resource will result in a NotFound
+// error status.
 //
 // Arguments:
-//	dtype: the type of this variable. Must agree with the dtypes
-// of all ops using this variable.
-//	shape: The (possibly partially specified) shape of this variable.
-func VarHandleOp(scope *Scope, dtype tf.DataType, shape tf.Shape, optional ...VarHandleOpAttr) (resource tf.Output) {
+//	resource: handle to the resource to delete.
+//
+// Returns the created operation.
+func DestroyResourceOp(scope *Scope, resource tf.Output, optional ...DestroyResourceOpAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "VarHandleOp",
-
+		Type: "DestroyResourceOp",
+		Input: []tf.Input{
+			resource,
+		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Elementwise computes the bitwise XOR of `x` and `y`.
+// Applies softmax to a batched N-D `SparseTensor`.
 //
-// The result will have those bits set, that are different in `x` and `y`. The
-// computation is performed on the underlying representations of `x` and `y`.
-func BitwiseXor(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// The inputs represent an N-D SparseTensor  with logical shape `[..., B, C]`
+// (where `N >= 2`), and with indices sorted in the canonical lexicographic order.
+//
+// This op is equivalent to applying the normal `tf.nn.softmax()` to each innermost
+// logical submatrix with shape `[B, C]`, but with the catch that *the implicitly
+// zero elements do not participate*.  Specifically, the algorithm is equivalent
+// to the following:
+//
+//   (1) Applies `tf.nn.softmax()` to a densified view of each innermost submatrix
+//       with shape `[B, C]`, along the size-C dimension;
+//   (2) Masks out the original implicitly-zero locations;
+//   (3) Renormalizes the remaining elements.
+//
+// Hence, the `SparseTensor` result has exactly the same non-zero indices and
+// shape.
+//
+// Arguments:
+//	sp_indices: 2-D.  `NNZ x R` matrix with the indices of non-empty values in a
+// SparseTensor, in canonical ordering.
+//	sp_values: 1-D.  `NNZ` non-empty values corresponding to `sp_indices`.
+//	sp_shape: 1-D.  Shape of the input SparseTensor.
+//
+// Returns 1-D.  The `NNZ` values for the result `SparseTensor`.
+func SparseSoftmax(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "BitwiseXor",
+		Type: "SparseSoftmax",
 		Input: []tf.Input{
-			x, y,
+			sp_indices, sp_values, sp_shape,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deserialize `SparseTensor` objects.
+// Partitions `data` into `num_partitions` tensors using indices from `partitions`.
 //
-// The input `serialized_sparse` must have the shape `[?, ?, ..., ?, 3]` where
-// the last dimension stores serialized `SparseTensor` objects and the other N
-// dimensions (N >= 0) correspond to a batch. The ranks of the original
-// `SparseTensor` objects must all match. When the final `SparseTensor` is
-// created, its rank is the rank of the incoming `SparseTensor` objects plus N;
-// the sparse tensors have been concatenated along new dimensions, one for each
-// batch.
+// For each index tuple `js` of size `partitions.ndim`, the slice `data[js, ...]`
+// becomes part of `outputs[partitions[js]]`.  The slices with `partitions[js] = i`
+// are placed in `outputs[i]` in lexicographic order of `js`, and the first
+// dimension of `outputs[i]` is the number of entries in `partitions` equal to `i`.
+// In detail,
 //
-// The output `SparseTensor` object's shape values for the original dimensions
-// are the max across the input `SparseTensor` objects' shape values for the
-// corresponding dimensions. The new dimensions match the size of the batch.
+// ```python
+//     outputs[i].shape = [sum(partitions == i)] + data.shape[partitions.ndim:]
 //
-// The input `SparseTensor` objects' indices are assumed ordered in
-// standard lexicographic order.  If this is not the case, after this
-// step run `SparseReorder` to restore index ordering.
+//     outputs[i] = pack([data[js, ...] for js if partitions[js] == i])
+// ```
 //
-// For example, if the serialized input is a `[2 x 3]` matrix representing two
-// original `SparseTensor` objects:
+// `data.shape` must start with `partitions.shape`.
 //
-//     index = [ 0]
-//             [10]
-//             [20]
-//     values = [1, 2, 3]
-//     shape = [50]
+// For example:
 //
-// and
+// ```python
+//     # Scalar partitions.
+//     partitions = 1
+//     num_partitions = 2
+//     data = [10, 20]
+//     outputs[0] = []  # Empty with shape [0, 2]
+//     outputs[1] = [[10, 20]]
 //
-//     index = [ 2]
-//             [10]
-//     values = [4, 5]
-//     shape = [30]
+//     # Vector partitions.
+//     partitions = [0, 0, 1, 1, 0]
+//     num_partitions = 2
+//     data = [10, 20, 30, 40, 50]
+//     outputs[0] = [10, 20, 50]
+//     outputs[1] = [30, 40]
+// ```
 //
-// then the final deserialized `SparseTensor` will be:
+// See `dynamic_stitch` for an example on how to merge partitions back.
 //
-//     index = [0  0]
-//             [0 10]
-//             [0 20]
-//             [1  2]
-//             [1 10]
-//     values = [1, 2, 3, 4, 5]
-//     shape = [2 50]
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicPartition.png" alt>
+// </div>
 //
 // Arguments:
-//	serialized_sparse: The serialized `SparseTensor` objects. The last dimension
-// must have 3 columns.
-//	dtype: The `dtype` of the serialized `SparseTensor` objects.
-func DeserializeSparse(scope *Scope, serialized_sparse tf.Output, dtype tf.DataType) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
+//
+//	partitions: Any shape.  Indices in the range `[0, num_partitions)`.
+//	num_partitions: The number of partitions to output.
+func DynamicPartition(scope *Scope, data tf.Output, partitions tf.Output, num_partitions int64) (outputs []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"num_partitions": num_partitions}
 	opspec := tf.OpSpec{
-		Type: "DeserializeSparse",
+		Type: "DynamicPartition",
 		Input: []tf.Input{
-			serialized_sparse,
+			data, partitions,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
+		scope.UpdateErr("DynamicPartition", err)
+		return
+	}
+	return outputs
 }
 
-// ResourceApplyRMSPropAttr is an optional argument to ResourceApplyRMSProp.
-type ResourceApplyRMSPropAttr func(optionalAttr)
+// ResourceApplyAdagradAttr is an optional argument to ResourceApplyAdagrad.
+type ResourceApplyAdagradAttr func(optionalAttr)
 
-// ResourceApplyRMSPropUseLocking sets the optional use_locking attribute to value.
+// ResourceApplyAdagradUseLocking sets the optional use_locking attribute to value.
 //
-// value: If `True`, updating of the var, ms, and mom tensors is protected
+// value: If `True`, updating of the var and accum tensors will be protected
 // by a lock; otherwise the behavior is undefined, but may exhibit less
 // contention.
 // If not specified, defaults to false
-func ResourceApplyRMSPropUseLocking(value bool) ResourceApplyRMSPropAttr {
+func ResourceApplyAdagradUseLocking(value bool) ResourceApplyAdagradAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update '*var' according to the RMSProp algorithm.
-//
-// Note that in dense implementation of this algorithm, ms and mom will
-// update even if the grad is zero, but in this sparse implementation, ms
-// and mom will not update in iterations during which the grad is zero.
-//
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
+// Update '*var' according to the adagrad scheme.
 //
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
-// var <- var - mom
+// accum += grad * grad
+// var -= lr * grad * (1 / sqrt(accum))
 //
 // Arguments:
 //	var_: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
+//	accum: Should be from a Variable().
 //	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
-//
-//	epsilon: Ridge term. Must be a scalar.
 //	grad: The gradient.
 //
 // Returns the created operation.
-func ResourceApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyRMSPropAttr) (o *tf.Operation) {
+func ResourceApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, optional ...ResourceApplyAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7780,38 +7531,47 @@ func ResourceApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Out
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyRMSProp",
+		Type: "ResourceApplyAdagrad",
 		Input: []tf.Input{
-			var_, ms, mom, lr, rho, momentum, epsilon, grad,
+			var_, accum, lr, grad,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// SizeAttr is an optional argument to Size.
-type SizeAttr func(optionalAttr)
+// ResourceApplyPowerSignAttr is an optional argument to ResourceApplyPowerSign.
+type ResourceApplyPowerSignAttr func(optionalAttr)
 
-// SizeOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_INT32
-func SizeOutType(value tf.DataType) SizeAttr {
+// ResourceApplyPowerSignUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var and m tensors is
+// protected by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyPowerSignUseLocking(value bool) ResourceApplyPowerSignAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Returns the size of a tensor.
+// Update '*var' according to the AddSign update.
 //
-// This operation returns an integer representing the number of elements in
-// `input`.
+// m_t <- beta1 * m_{t-1} + (1 - beta1) * g
+// update <- exp(logbase * sign_decay * sign(g) * sign(m_t)) * g
+// variable <- variable - lr_t * update
 //
-// For example:
+// Arguments:
+//	var_: Should be from a Variable().
+//	m: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	logbase: Must be a scalar.
+//	sign_decay: Must be a scalar.
+//	beta: Must be a scalar.
+//	grad: The gradient.
 //
-// ```
-// # 't' is [[[1, 1,, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]]
-// size(t) ==> 12
-// ```
-func Size(scope *Scope, input tf.Output, optional ...SizeAttr) (output tf.Output) {
+// Returns the created operation.
+func ResourceApplyPowerSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Output, logbase tf.Output, sign_decay tf.Output, beta tf.Output, grad tf.Output, optional ...ResourceApplyPowerSignAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7820,78 +7580,76 @@ func Size(scope *Scope, input tf.Output, optional ...SizeAttr) (output tf.Output
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Size",
+		Type: "ResourceApplyPowerSign",
 		Input: []tf.Input{
-			input,
+			var_, m, lr, logbase, sign_decay, beta, grad,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// ResourceScatterNdUpdateAttr is an optional argument to ResourceScatterNdUpdate.
-type ResourceScatterNdUpdateAttr func(optionalAttr)
+// CumprodAttr is an optional argument to Cumprod.
+type CumprodAttr func(optionalAttr)
 
-// ResourceScatterNdUpdateUseLocking sets the optional use_locking attribute to value.
+// CumprodExclusive sets the optional exclusive attribute to value.
 //
-// value: An optional bool. Defaults to True. If True, the assignment will
-// be protected by a lock; otherwise the behavior is undefined,
-// but may exhibit less contention.
-// If not specified, defaults to true
-func ResourceScatterNdUpdateUseLocking(value bool) ResourceScatterNdUpdateAttr {
+// value: If `True`, perform exclusive cumprod.
+// If not specified, defaults to false
+func CumprodExclusive(value bool) CumprodAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["exclusive"] = value
 	}
 }
 
-// Applies sparse `updates` to individual values or slices within a given
+// CumprodReverse sets the optional reverse attribute to value.
 //
-// variable according to `indices`.
+// value: A `bool` (default: False).
+// If not specified, defaults to false
+func CumprodReverse(value bool) CumprodAttr {
+	return func(m optionalAttr) {
+		m["reverse"] = value
+	}
+}
+
+// Compute the cumulative product of the tensor `x` along `axis`.
 //
-// `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
+// By default, this op performs an inclusive cumprod, which means that the first
+// element of the input is identical to the first element of the output:
 //
-// `indices` must be integer tensor, containing indices into `ref`.
-// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+// ```python
+// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
+// ```
 //
-// The innermost dimension of `indices` (with length `K`) corresponds to
-// indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
-// dimension of `ref`.
-//
-// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
+// performed instead:
 //
-// ```
-// [d_0, ..., d_{Q-2}, ref.shape[K], ..., ref.shape[P-1]].
+// ```python
+// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
 // ```
 //
-// For example, say we want to update 4 scattered elements to a rank-1 tensor to
-// 8 elements. In Python, that update would look like this:
+// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
+// opposite direction:
 //
 // ```python
-//     ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
-//     indices = tf.constant([[4], [3], [1] ,[7]])
-//     updates = tf.constant([9, 10, 11, 12])
-//     update = tf.scatter_nd_update(ref, indices, updates)
-//     with tf.Session() as sess:
-//       print sess.run(update)
+// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
 // ```
 //
-// The resulting update to ref would look like this:
+// This is more efficient than using separate `tf.reverse` ops.
 //
-//     [1, 11, 3, 10, 9, 6, 7, 12]
+// The `reverse` and `exclusive` kwargs can also be combined:
 //
-// See @{tf.scatter_nd} for more details about how to make updates to
-// slices.
+// ```python
+// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
+// ```
 //
 // Arguments:
-//	ref: A resource handle. Must be from a VarHandleOp.
-//	indices: A Tensor. Must be one of the following types: int32, int64.
-// A tensor of indices into ref.
-//	updates: A Tensor. Must have the same type as ref. A tensor of updated
-// values to add to ref.
-//
-// Returns the created operation.
-func ResourceScatterNdUpdate(scope *Scope, ref tf.Output, indices tf.Output, updates tf.Output, optional ...ResourceScatterNdUpdateAttr) (o *tf.Operation) {
+//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
+// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
+// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
+//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
+// `[-rank(x), rank(x))`.
+func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7900,115 +7658,101 @@ func ResourceScatterNdUpdate(scope *Scope, ref tf.Output, indices tf.Output, upd
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceScatterNdUpdate",
+		Type: "Cumprod",
 		Input: []tf.Input{
-			ref, indices, updates,
+			x, axis,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StageSizeAttr is an optional argument to StageSize.
-type StageSizeAttr func(optionalAttr)
-
-// StageSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Computes the mean along segments of a tensor.
 //
-// REQUIRES: value >= 0
-func StageSizeCapacity(value int64) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// StageSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// REQUIRES: value >= 0
-func StageSizeMemoryLimit(value int64) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// StageSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StageSizeContainer(value string) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// StageSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StageSizeSharedName(value string) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op returns the number of elements in the underlying container.
-func StageSize(scope *Scope, dtypes []tf.DataType, optional ...StageSizeAttr) (size tf.Output) {
+// Computes a tensor such that
+// \\(output_i = \frac{\sum_j data_j}{N}\\) where `mean` is
+// over `j` such that `segment_ids[j] == i` and `N` is the total number of
+// values summed.
+//
+// If the mean is empty for a given segment ID `i`, `output[i] = 0`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMean.png" alt>
+// </div>
+//
+// Arguments:
+//
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentMean(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "StageSize",
-
-		Attrs: attrs,
+		Type: "SegmentMean",
+		Input: []tf.Input{
+			data, segment_ids,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// NonMaxSuppressionAttr is an optional argument to NonMaxSuppression.
-type NonMaxSuppressionAttr func(optionalAttr)
+// ResourceSparseApplyCenteredRMSPropAttr is an optional argument to ResourceSparseApplyCenteredRMSProp.
+type ResourceSparseApplyCenteredRMSPropAttr func(optionalAttr)
 
-// NonMaxSuppressionIouThreshold sets the optional iou_threshold attribute to value.
+// ResourceSparseApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
 //
-// value: A float representing the threshold for deciding whether boxes
-// overlap too much with respect to IOU.
-// If not specified, defaults to 0.5
-func NonMaxSuppressionIouThreshold(value float32) NonMaxSuppressionAttr {
+// value: If `True`, updating of the var, mg, ms, and mom tensors is
+// protected by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyCenteredRMSPropUseLocking(value bool) ResourceSparseApplyCenteredRMSPropAttr {
 	return func(m optionalAttr) {
-		m["iou_threshold"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Greedily selects a subset of bounding boxes in descending order of score,
+// Update '*var' according to the centered RMSProp algorithm.
 //
-// pruning away boxes that have high intersection-over-union (IOU) overlap
-// with previously selected boxes.  Bounding boxes are supplied as
-// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
-// diagonal pair of box corners and the coordinates can be provided as normalized
-// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
-// is agnostic to where the origin is in the coordinate system.  Note that this
-// algorithm is invariant to orthogonal transformations and translations
-// of the coordinate system; thus translating or reflections of the coordinate
-// system result in the same boxes being selected by the algorithm.
-// The output of this operation is a set of integers indexing into the input
-// collection of bounding boxes representing the selected boxes.  The bounding
-// box coordinates corresponding to the selected indices can then be obtained
-// using the `tf.gather operation`.  For example:
-//   selected_indices = tf.image.non_max_suppression(
-//       boxes, scores, max_output_size, iou_threshold)
-//   selected_boxes = tf.gather(boxes, selected_indices)
+// The centered RMSProp algorithm uses an estimate of the centered second moment
+// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
+// uses the (uncentered) second moment. This often helps with training, but is
+// slightly more expensive in terms of computation and memory.
+//
+// Note that in dense implementation of this algorithm, mg, ms, and mom will
+// update even if the grad is zero, but in this sparse implementation, mg, ms,
+// and mom will not update in iterations during which the grad is zero.
+//
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// mean_grad = decay * mean_grad + (1-decay) * gradient
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
+//
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
+// var <- var - mom
 //
 // Arguments:
-//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
-//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
-// score corresponding to each box (each row of boxes).
-//	max_output_size: A scalar integer tensor representing the maximum number of
-// boxes to be selected by non max suppression.
+//	var_: Should be from a Variable().
+//	mg: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
 //
-// Returns A 1-D integer tensor of shape `[M]` representing the selected
-// indices from the boxes tensor, where `M <= max_output_size`.
-func NonMaxSuppression(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, optional ...NonMaxSuppressionAttr) (selected_indices tf.Output) {
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var, ms and mom.
+//
+// Returns the created operation.
+func ResourceSparseApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyCenteredRMSPropAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8017,26 +7761,32 @@ func NonMaxSuppression(scope *Scope, boxes tf.Output, scores tf.Output, max_outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "NonMaxSuppression",
+		Type: "ResourceSparseApplyCenteredRMSProp",
 		Input: []tf.Input{
-			boxes, scores, max_output_size,
+			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad, indices,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Creates a dataset that emits `components` as a tuple of tensors once.
-func TensorDataset(scope *Scope, components []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+// Creates a dataset that batches `batch_size` elements from `input_dataset`.
+//
+// Arguments:
+//
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//
+//
+func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_shapes": output_shapes}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "TensorDataset",
+		Type: "BatchDataset",
 		Input: []tf.Input{
-			tf.OutputList(components),
+			input_dataset, batch_size,
 		},
 		Attrs: attrs,
 	}
@@ -8044,74 +7794,94 @@ func TensorDataset(scope *Scope, components []tf.Output, output_shapes []tf.Shap
 	return op.Output(0)
 }
 
-// Component-wise multiplies a SparseTensor by a dense Tensor.
-//
-// The output locations corresponding to the implicitly zero elements in the sparse
-// tensor will be zero (i.e., will not take up storage space), regardless of the
-// contents of the dense tensor (even if it's +/-INF and that INF*0 == NaN).
+// Inverse fast Fourier transform.
 //
-// *Limitation*: this Op only broadcasts the dense side to the sparse side, but not
-// the other direction.
+// Computes the inverse 1-dimensional discrete Fourier transform over the
+// inner-most dimension of `input`.
 //
 // Arguments:
-//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
-//	sp_shape: 1-D.  Shape of the input SparseTensor.
-//	dense: `R`-D.  The dense Tensor operand.
+//	input: A complex64 tensor.
 //
-// Returns 1-D.  The `N` values that are operated on.
-func SparseDenseCwiseMul(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
+// Returns A complex64 tensor of the same shape as `input`. The inner-most
+//   dimension of `input` is replaced with its inverse 1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.ifft
+// @end_compatibility
+func IFFT(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseDenseCwiseMul",
+		Type: "IFFT",
 		Input: []tf.Input{
-			sp_indices, sp_values, sp_shape, dense,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyFtrlAttr is an optional argument to ResourceSparseApplyFtrl.
-type ResourceSparseApplyFtrlAttr func(optionalAttr)
+// LRNAttr is an optional argument to LRN.
+type LRNAttr func(optionalAttr)
 
-// ResourceSparseApplyFtrlUseLocking sets the optional use_locking attribute to value.
+// LRNDepthRadius sets the optional depth_radius attribute to value.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceSparseApplyFtrlUseLocking(value bool) ResourceSparseApplyFtrlAttr {
+// value: 0-D.  Half-width of the 1-D normalization window.
+// If not specified, defaults to 5
+func LRNDepthRadius(value int64) LRNAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["depth_radius"] = value
 	}
 }
 
-// Update relevant entries in '*var' according to the Ftrl-proximal scheme.
+// LRNBias sets the optional bias attribute to value.
 //
-// That is for rows we have grad for, we update var, accum and linear as follows:
-// accum_new = accum + grad * grad
-// linear += grad + (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
-// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
-// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
-// accum = accum_new
+// value: An offset (usually positive to avoid dividing by 0).
+// If not specified, defaults to 1
+func LRNBias(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["bias"] = value
+	}
+}
+
+// LRNAlpha sets the optional alpha attribute to value.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	linear: Should be from a Variable().
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	lr_power: Scaling factor. Must be a scalar.
+// value: A scale factor, usually positive.
+// If not specified, defaults to 1
+func LRNAlpha(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["alpha"] = value
+	}
+}
+
+// LRNBeta sets the optional beta attribute to value.
 //
-// Returns the created operation.
-func ResourceSparseApplyFtrl(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, lr_power tf.Output, optional ...ResourceSparseApplyFtrlAttr) (o *tf.Operation) {
+// value: An exponent.
+// If not specified, defaults to 0.5
+func LRNBeta(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["beta"] = value
+	}
+}
+
+// Local Response Normalization.
+//
+// The 4-D `input` tensor is treated as a 3-D array of 1-D vectors (along the last
+// dimension), and each vector is normalized independently.  Within a given vector,
+// each component is divided by the weighted, squared sum of inputs within
+// `depth_radius`.  In detail,
+//
+//     sqr_sum[a, b, c, d] =
+//         sum(input[a, b, c, d - depth_radius : d + depth_radius + 1] ** 2)
+//     output = input / (bias + alpha * sqr_sum) ** beta
+//
+// For details, see [Krizhevsky et al., ImageNet classification with deep
+// convolutional neural networks (NIPS 2012)](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
+//
+// Arguments:
+//	input: 4-D.
+func LRN(scope *Scope, input tf.Output, optional ...LRNAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8120,75 +7890,63 @@ func ResourceSparseApplyFtrl(scope *Scope, var_ tf.Output, accum tf.Output, line
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyFtrl",
+		Type: "LRN",
 		Input: []tf.Input{
-			var_, accum, linear, grad, indices, lr, l1, l2, lr_power,
+			input,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Returns which elements of x are Inf.
-//
-// @compatibility(numpy)
-// Equivalent to np.isinf
-// @end_compatibility
-func IsInf(scope *Scope, x tf.Output) (y tf.Output) {
+// Creates a dataset that zips together `input_datasets`.
+func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "IsInf",
+		Type: "ZipDataset",
 		Input: []tf.Input{
-			x,
+			tf.OutputList(input_datasets),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyRMSPropAttr is an optional argument to ResourceSparseApplyRMSProp.
-type ResourceSparseApplyRMSPropAttr func(optionalAttr)
+// ResourceSparseApplyAdagradAttr is an optional argument to ResourceSparseApplyAdagrad.
+type ResourceSparseApplyAdagradAttr func(optionalAttr)
 
-// ResourceSparseApplyRMSPropUseLocking sets the optional use_locking attribute to value.
+// ResourceSparseApplyAdagradUseLocking sets the optional use_locking attribute to value.
 //
-// value: If `True`, updating of the var, ms, and mom tensors is protected
+// value: If `True`, updating of the var and accum tensors will be protected
 // by a lock; otherwise the behavior is undefined, but may exhibit less
 // contention.
 // If not specified, defaults to false
-func ResourceSparseApplyRMSPropUseLocking(value bool) ResourceSparseApplyRMSPropAttr {
+func ResourceSparseApplyAdagradUseLocking(value bool) ResourceSparseApplyAdagradAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update '*var' according to the RMSProp algorithm.
-//
-// Note that in dense implementation of this algorithm, ms and mom will
-// update even if the grad is zero, but in this sparse implementation, ms
-// and mom will not update in iterations during which the grad is zero.
-//
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
+// Update relevant entries in '*var' and '*accum' according to the adagrad scheme.
 //
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
-// var <- var - mom
+// That is for rows we have grad for, we update var and accum as follows:
+// accum += grad * grad
+// var -= lr * grad * (1 / sqrt(accum))
 //
 // Arguments:
 //	var_: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
-//
-//	epsilon: Ridge term. Must be a scalar.
+//	accum: Should be from a Variable().
+//	lr: Learning rate. Must be a scalar.
 //	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var, ms and mom.
+//	indices: A vector of indices into the first dimension of var and accum.
 //
 // Returns the created operation.
-func ResourceSparseApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyRMSPropAttr) (o *tf.Operation) {
+func ResourceSparseApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8197,168 +7955,131 @@ func ResourceSparseApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyRMSProp",
+		Type: "ResourceSparseApplyAdagrad",
 		Input: []tf.Input{
-			var_, ms, mom, lr, rho, momentum, epsilon, grad, indices,
+			var_, accum, lr, grad, indices,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns the truth value of (x > y) element-wise.
+// 2D real-valued fast Fourier transform.
 //
-// *NOTE*: `Greater` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Greater(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Computes the 2-dimensional discrete Fourier transform of a real-valued signal
+// over the inner-most 2 dimensions of `input`.
+//
+// Since the DFT of a real signal is Hermitian-symmetric, `RFFT2D` only returns the
+// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
+// of `output`: the zero-frequency term, followed by the `fft_length / 2`
+// positive-frequency terms.
+//
+// Along each axis `RFFT2D` is computed on, if `fft_length` is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
+//
+// Arguments:
+//	input: A float32 tensor.
+//	fft_length: An int32 tensor of shape [2]. The FFT length for each dimension.
+//
+// Returns A complex64 tensor of the same rank as `input`. The inner-most 2
+//   dimensions of `input` are replaced with their 2D Fourier transform. The
+//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
+//   components.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.rfft2
+// @end_compatibility
+func RFFT2D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Greater",
+		Type: "RFFT2D",
 		Input: []tf.Input{
-			x, y,
+			input, fft_length,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SampleDistortedBoundingBoxAttr is an optional argument to SampleDistortedBoundingBox.
-type SampleDistortedBoundingBoxAttr func(optionalAttr)
+// ResizeAreaAttr is an optional argument to ResizeArea.
+type ResizeAreaAttr func(optionalAttr)
 
-// SampleDistortedBoundingBoxSeed sets the optional seed attribute to value.
+// ResizeAreaAlignCorners sets the optional align_corners attribute to value.
 //
-// value: If either `seed` or `seed2` are set to non-zero, the random number
-// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
-// seed.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxSeed(value int64) SampleDistortedBoundingBoxAttr {
+// value: If true, rescale input by (new_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeAreaAlignCorners(value bool) ResizeAreaAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["align_corners"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxSeed2 sets the optional seed2 attribute to value.
+// Resize `images` to `size` using area interpolation.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxSeed2(value int64) SampleDistortedBoundingBoxAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// SampleDistortedBoundingBoxMinObjectCovered sets the optional min_object_covered attribute to value.
+// Input images can be of different types but output images are always float.
 //
-// value: The cropped area of the image must contain at least this
-// fraction of any bounding box supplied. The value of this parameter should be
-// non-negative. In the case of 0, the cropped area does not need to overlap
-// any of the bounding boxes supplied.
-// If not specified, defaults to 0.1
-func SampleDistortedBoundingBoxMinObjectCovered(value float32) SampleDistortedBoundingBoxAttr {
-	return func(m optionalAttr) {
-		m["min_object_covered"] = value
-	}
-}
-
-// SampleDistortedBoundingBoxAspectRatioRange sets the optional aspect_ratio_range attribute to value.
+// Each output pixel is computed by first transforming the pixel's footprint into
+// the input tensor and then averaging the pixels that intersect the footprint. An
+// input pixel's contribution to the average is weighted by the fraction of its
+// area that intersects the footprint.  This is the same as OpenCV's INTER_AREA.
 //
-// value: The cropped area of the image must have an aspect ratio =
-// width / height within this range.
-// If not specified, defaults to <f:0.75 f:1.33 >
-func SampleDistortedBoundingBoxAspectRatioRange(value []float32) SampleDistortedBoundingBoxAttr {
-	return func(m optionalAttr) {
-		m["aspect_ratio_range"] = value
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
+//
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func ResizeArea(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeAreaAttr) (resized_images tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
 	}
+	opspec := tf.OpSpec{
+		Type: "ResizeArea",
+		Input: []tf.Input{
+			images, size,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SampleDistortedBoundingBoxAreaRange sets the optional area_range attribute to value.
+// StatelessRandomUniformAttr is an optional argument to StatelessRandomUniform.
+type StatelessRandomUniformAttr func(optionalAttr)
+
+// StatelessRandomUniformDtype sets the optional dtype attribute to value.
 //
-// value: The cropped area of the image must contain a fraction of the
-// supplied image within in this range.
-// If not specified, defaults to <f:0.05 f:1 >
-func SampleDistortedBoundingBoxAreaRange(value []float32) SampleDistortedBoundingBoxAttr {
+// value: The type of the output.
+// If not specified, defaults to DT_FLOAT
+func StatelessRandomUniformDtype(value tf.DataType) StatelessRandomUniformAttr {
 	return func(m optionalAttr) {
-		m["area_range"] = value
+		m["dtype"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxMaxAttempts sets the optional max_attempts attribute to value.
-//
-// value: Number of attempts at generating a cropped region of the image
-// of the specified constraints. After `max_attempts` failures, return the entire
-// image.
-// If not specified, defaults to 100
-func SampleDistortedBoundingBoxMaxAttempts(value int64) SampleDistortedBoundingBoxAttr {
-	return func(m optionalAttr) {
-		m["max_attempts"] = value
-	}
-}
-
-// SampleDistortedBoundingBoxUseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
-//
-// value: Controls behavior if no bounding boxes supplied.
-// If true, assume an implicit bounding box covering the whole input. If false,
-// raise an error.
-// If not specified, defaults to false
-func SampleDistortedBoundingBoxUseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxAttr {
-	return func(m optionalAttr) {
-		m["use_image_if_no_bounding_boxes"] = value
-	}
-}
-
-// Generate a single randomly distorted bounding box for an image.
-//
-// Bounding box annotations are often supplied in addition to ground-truth labels
-// in image recognition or object localization tasks. A common technique for
-// training such a system is to randomly distort an image while preserving
-// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
-// localization of an object, i.e. bounding box, given an `image_size`,
-// `bounding_boxes` and a series of constraints.
-//
-// The output of this Op is a single bounding box that may be used to crop the
-// original image. The output is returned as 3 tensors: `begin`, `size` and
-// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
-// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
-// what the bounding box looks like.
-//
-// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
-// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
-// height of the underlying image.
-//
-// For example,
-//
-// ```python
-//     # Generate a single distorted bounding box.
-//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
-//         tf.shape(image),
-//         bounding_boxes=bounding_boxes)
-//
-//     # Draw the bounding box in an image summary.
-//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
-//                                                   bbox_for_draw)
-//     tf.summary.image('images_with_box', image_with_box)
+// Outputs deterministic pseudorandom random values from a uniform distribution.
 //
-//     # Employ the bounding box to distort the image.
-//     distorted_image = tf.slice(image, begin, size)
-// ```
+// The generated values follow a uniform distribution in the range `[0, 1)`. The
+// lower bound 0 is included in the range, while the upper bound 1 is excluded.
 //
-// Note that if no bounding box information is available, setting
-// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
-// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
-// false and no bounding boxes are supplied, an error is raised.
+// The outputs are a deterministic function of `shape` and `seed`.
 //
 // Arguments:
-//	image_size: 1-D, containing `[height, width, channels]`.
-//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
-// associated with the image.
+//	shape: The shape of the output tensor.
+//	seed: 2 seeds (shape [2]).
 //
-// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
-// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
-// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
-// Provide as input to `tf.image.draw_bounding_boxes`.
-func SampleDistortedBoundingBox(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, optional ...SampleDistortedBoundingBoxAttr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
+// Returns Random values with specified shape.
+func StatelessRandomUniform(scope *Scope, shape tf.Output, seed tf.Output, optional ...StatelessRandomUniformAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8367,282 +8088,284 @@ func SampleDistortedBoundingBox(scope *Scope, image_size tf.Output, bounding_box
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SampleDistortedBoundingBox",
+		Type: "StatelessRandomUniform",
 		Input: []tf.Input{
-			image_size, bounding_boxes,
+			shape, seed,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Returns x / y element-wise for integer types.
-//
-// Truncation designates that negative numbers will round fractional quantities
-// toward zero. I.e. -7 / 5 = -1. This matches C semantics but it is different
-// than Python semantics. See `FloorDiv` for a division function that matches
-// Python Semantics.
-//
-// *NOTE*: `TruncateDiv` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func TruncateDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TruncateDiv",
-		Input: []tf.Input{
-			x, y,
-		},
+// AngleAttr is an optional argument to Angle.
+type AngleAttr func(optionalAttr)
+
+// AngleTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func AngleTout(value tf.DataType) AngleAttr {
+	return func(m optionalAttr) {
+		m["Tout"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Restores tensors from a V2 checkpoint.
+// Returns the argument of a complex number.
 //
-// For backward compatibility with the V1 format, this Op currently allows
-// restoring from a V1 checkpoint as well:
-//   - This Op first attempts to find the V2 index file pointed to by "prefix", and
-//     if found proceed to read it as a V2 checkpoint;
-//   - Otherwise the V1 read path is invoked.
-// Relying on this behavior is not recommended, as the ability to fall back to read
-// V1 might be deprecated and eventually removed.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the argument of each element in `input`. All elements in
+// `input` must be complex numbers of the form \\(a + bj\\), where *a*
+// is the real part and *b* is the imaginary part.
 //
-// By default, restores the named tensors in full.  If the caller wishes to restore
-// specific slices of stored tensors, "shape_and_slices" should be non-empty
-// strings and correspondingly well-formed.
+// The argument returned by this operation is of the form \\(atan2(b, a)\\).
 //
-// Callers must ensure all the named tensors are indeed stored in the checkpoint.
+// For example:
 //
-// Arguments:
-//	prefix: Must have a single element.  The prefix of a V2 checkpoint.
-//	tensor_names: shape {N}.  The names of the tensors to be restored.
-//	shape_and_slices: shape {N}.  The slice specs of the tensors to be restored.
-// Empty strings indicate that they are non-partitioned tensors.
-//	dtypes: shape {N}.  The list of expected dtype for the tensors.  Must match
-// those stored in the checkpoint.
+// ```
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.angle(input) ==> [2.0132, 1.056]
+// ```
 //
-// Returns shape {N}.  The restored tensors, whose shapes are read from the
-// checkpoint directly.
-func RestoreV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, dtypes []tf.DataType) (tensors []tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.angle.
+// @end_compatibility
+func Angle(scope *Scope, input tf.Output, optional ...AngleAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "RestoreV2",
+		Type: "Angle",
 		Input: []tf.Input{
-			prefix, tensor_names, shape_and_slices,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if tensors, idx, err = makeOutputList(op, idx, "tensors"); err != nil {
-		scope.UpdateErr("RestoreV2", err)
-		return
+	return op.Output(0)
+}
+
+// VarHandleOpAttr is an optional argument to VarHandleOp.
+type VarHandleOpAttr func(optionalAttr)
+
+// VarHandleOpContainer sets the optional container attribute to value.
+//
+// value: the container this variable is placed in.
+// If not specified, defaults to ""
+func VarHandleOpContainer(value string) VarHandleOpAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	return tensors
 }
 
-// Decode web-safe base64-encoded strings.
+// VarHandleOpSharedName sets the optional shared_name attribute to value.
 //
-// Input may or may not have padding at the end. See EncodeBase64 for padding.
-// Web-safe means that input must use - and _ instead of + and /.
+// value: the name by which this variable is referred to.
+// If not specified, defaults to ""
+func VarHandleOpSharedName(value string) VarHandleOpAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Creates a handle to a Variable resource.
 //
 // Arguments:
-//	input: Base64 strings to decode.
-//
-// Returns Decoded strings.
-func DecodeBase64(scope *Scope, input tf.Output) (output tf.Output) {
+//	dtype: the type of this variable. Must agree with the dtypes
+// of all ops using this variable.
+//	shape: The (possibly partially specified) shape of this variable.
+func VarHandleOp(scope *Scope, dtype tf.DataType, shape tf.Shape, optional ...VarHandleOpAttr) (resource tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "DecodeBase64",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "VarHandleOp",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Store the input tensor in the state of the current session.
-//
-// Arguments:
-//	value: The tensor to be stored.
+// Elementwise computes the bitwise XOR of `x` and `y`.
 //
-// Returns The handle for the tensor stored in the session state, represented
-// as a string.
-func GetSessionHandle(scope *Scope, value tf.Output) (handle tf.Output) {
+// The result will have those bits set, that are different in `x` and `y`. The
+// computation is performed on the underlying representations of `x` and `y`.
+func BitwiseXor(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "GetSessionHandle",
+		Type: "BitwiseXor",
 		Input: []tf.Input{
-			value,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyProximalAdagradAttr is an optional argument to ResourceSparseApplyProximalAdagrad.
-type ResourceSparseApplyProximalAdagradAttr func(optionalAttr)
-
-// ResourceSparseApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
+// Deserialize `SparseTensor` objects.
 //
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceSparseApplyProximalAdagradUseLocking(value bool) ResourceSparseApplyProximalAdagradAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Sparse update entries in '*var' and '*accum' according to FOBOS algorithm.
+// The input `serialized_sparse` must have the shape `[?, ?, ..., ?, 3]` where
+// the last dimension stores serialized `SparseTensor` objects and the other N
+// dimensions (N >= 0) correspond to a batch. The ranks of the original
+// `SparseTensor` objects must all match. When the final `SparseTensor` is
+// created, its rank is the rank of the incoming `SparseTensor` objects plus N;
+// the sparse tensors have been concatenated along new dimensions, one for each
+// batch.
 //
-// That is for rows we have grad for, we update var and accum as follows:
-// accum += grad * grad
-// prox_v = var
-// prox_v -= lr * grad * (1 / sqrt(accum))
-// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
+// The output `SparseTensor` object's shape values for the original dimensions
+// are the max across the input `SparseTensor` objects' shape values for the
+// corresponding dimensions. The new dimensions match the size of the batch.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Learning rate. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
+// The input `SparseTensor` objects' indices are assumed ordered in
+// standard lexicographic order.  If this is not the case, after this
+// step run `SparseReorder` to restore index ordering.
 //
-// Returns the created operation.
-func ResourceSparseApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyProximalAdagradAttr) (o *tf.Operation) {
+// For example, if the serialized input is a `[2 x 3]` matrix representing two
+// original `SparseTensor` objects:
+//
+//     index = [ 0]
+//             [10]
+//             [20]
+//     values = [1, 2, 3]
+//     shape = [50]
+//
+// and
+//
+//     index = [ 2]
+//             [10]
+//     values = [4, 5]
+//     shape = [30]
+//
+// then the final deserialized `SparseTensor` will be:
+//
+//     index = [0  0]
+//             [0 10]
+//             [0 20]
+//             [1  2]
+//             [1 10]
+//     values = [1, 2, 3, 4, 5]
+//     shape = [2 50]
+//
+// Arguments:
+//	serialized_sparse: The serialized `SparseTensor` objects. The last dimension
+// must have 3 columns.
+//	dtype: The `dtype` of the serialized `SparseTensor` objects.
+func DeserializeSparse(scope *Scope, serialized_sparse tf.Output, dtype tf.DataType) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyProximalAdagrad",
+		Type: "DeserializeSparse",
 		Input: []tf.Input{
-			var_, accum, lr, l1, l2, grad, indices,
+			serialized_sparse,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// Returns element-wise largest integer not greater than x.
-func Floor(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Floor",
-		Input: []tf.Input{
-			x,
-		},
-	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the Gauss error function of `x` element-wise.
-func Erf(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Erf",
-		Input: []tf.Input{
-			x,
-		},
+// ResourceApplyRMSPropAttr is an optional argument to ResourceApplyRMSProp.
+type ResourceApplyRMSPropAttr func(optionalAttr)
+
+// ResourceApplyRMSPropUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var, ms, and mom tensors is protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyRMSPropUseLocking(value bool) ResourceApplyRMSPropAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Reads the value of a variable.
+// Update '*var' according to the RMSProp algorithm.
 //
-// The tensor returned by this operation is immutable.
+// Note that in dense implementation of this algorithm, ms and mom will
+// update even if the grad is zero, but in this sparse implementation, ms
+// and mom will not update in iterations during which the grad is zero.
 //
-// The value returned by this operation is guaranteed to be influenced by all the
-// writes on which this operation depends directly or indirectly, and to not be
-// influenced by any of the writes which depend directly or indirectly on this
-// operation.
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
+//
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
+// var <- var - mom
 //
 // Arguments:
-//	resource: handle to the resource in which to store the variable.
-//	dtype: the dtype of the value.
-func ReadVariableOp(scope *Scope, resource tf.Output, dtype tf.DataType) (value tf.Output) {
+//	var_: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
+//
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyRMSPropAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ReadVariableOp",
+		Type: "ResourceApplyRMSProp",
 		Input: []tf.Input{
-			resource,
+			var_, ms, mom, lr, rho, momentum, epsilon, grad,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// MaxPool3DGradAttr is an optional argument to MaxPool3DGrad.
-type MaxPool3DGradAttr func(optionalAttr)
+// SizeAttr is an optional argument to Size.
+type SizeAttr func(optionalAttr)
 
-// MaxPool3DGradDataFormat sets the optional data_format attribute to value.
-//
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func MaxPool3DGradDataFormat(value string) MaxPool3DGradAttr {
+// SizeOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_INT32
+func SizeOutType(value tf.DataType) SizeAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["out_type"] = value
 	}
 }
 
-// Computes gradients of max pooling function.
+// Returns the size of a tensor.
 //
-// Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-func MaxPool3DGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DGradAttr) (output tf.Output) {
+// This operation returns an integer representing the number of elements in
+// `input`.
+//
+// For example:
+//
+// ```
+// # 't' is [[[1, 1,, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]]
+// size(t) ==> 12
+// ```
+func Size(scope *Scope, input tf.Output, optional ...SizeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPool3DGrad",
+		Type: "Size",
 		Input: []tf.Input{
-			orig_input, orig_output, grad,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -8650,43 +8373,68 @@ func MaxPool3DGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, gr
 	return op.Output(0)
 }
 
-// SparseReduceSumAttr is an optional argument to SparseReduceSum.
-type SparseReduceSumAttr func(optionalAttr)
+// ResourceScatterNdUpdateAttr is an optional argument to ResourceScatterNdUpdate.
+type ResourceScatterNdUpdateAttr func(optionalAttr)
 
-// SparseReduceSumKeepDims sets the optional keep_dims attribute to value.
+// ResourceScatterNdUpdateUseLocking sets the optional use_locking attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func SparseReduceSumKeepDims(value bool) SparseReduceSumAttr {
+// value: An optional bool. Defaults to True. If True, the assignment will
+// be protected by a lock; otherwise the behavior is undefined,
+// but may exhibit less contention.
+// If not specified, defaults to true
+func ResourceScatterNdUpdateUseLocking(value bool) ResourceScatterNdUpdateAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Computes the sum of elements across dimensions of a SparseTensor.
+// Applies sparse `updates` to individual values or slices within a given
 //
-// This Op takes a SparseTensor and is the sparse counterpart to
-// `tf.reduce_sum()`.  In particular, this Op also returns a dense `Tensor`
-// instead of a sparse one.
+// variable according to `indices`.
 //
-// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
-// with length 1.
+// `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
 //
-// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
-// with a single element is returned.  Additionally, the axes can be negative,
-// which are interpreted according to the indexing rules in Python.
+// `indices` must be integer tensor, containing indices into `ref`.
+// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+//
+// The innermost dimension of `indices` (with length `K`) corresponds to
+// indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
+// dimension of `ref`.
+//
+// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+//
+// ```
+// [d_0, ..., d_{Q-2}, ref.shape[K], ..., ref.shape[P-1]].
+// ```
+//
+// For example, say we want to update 4 scattered elements to a rank-1 tensor to
+// 8 elements. In Python, that update would look like this:
+//
+// ```python
+//     ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+//     indices = tf.constant([[4], [3], [1] ,[7]])
+//     updates = tf.constant([9, 10, 11, 12])
+//     update = tf.scatter_nd_update(ref, indices, updates)
+//     with tf.Session() as sess:
+//       print sess.run(update)
+// ```
+//
+// The resulting update to ref would look like this:
+//
+//     [1, 11, 3, 10, 9, 6, 7, 12]
+//
+// See @{tf.scatter_nd} for more details about how to make updates to
+// slices.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
-//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//	ref: A resource handle. Must be from a VarHandleOp.
+//	indices: A Tensor. Must be one of the following types: int32, int64.
+// A tensor of indices into ref.
+//	updates: A Tensor. Must have the same type as ref. A tensor of updated
+// values to add to ref.
 //
-// Returns `R-K`-D.  The reduced Tensor.
-func SparseReduceSum(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceSumAttr) (output tf.Output) {
+// Returns the created operation.
+func ResourceScatterNdUpdate(scope *Scope, ref tf.Output, indices tf.Output, updates tf.Output, optional ...ResourceScatterNdUpdateAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8695,113 +8443,109 @@ func SparseReduceSum(scope *Scope, input_indices tf.Output, input_values tf.Outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseReduceSum",
+		Type: "ResourceScatterNdUpdate",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape, reduction_axes,
+			ref, indices, updates,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Partitions `data` into `num_partitions` tensors using indices from `partitions`.
-//
-// For each index tuple `js` of size `partitions.ndim`, the slice `data[js, ...]`
-// becomes part of `outputs[partitions[js]]`.  The slices with `partitions[js] = i`
-// are placed in `outputs[i]` in lexicographic order of `js`, and the first
-// dimension of `outputs[i]` is the number of entries in `partitions` equal to `i`.
-// In detail,
+// SqueezeAttr is an optional argument to Squeeze.
+type SqueezeAttr func(optionalAttr)
+
+// SqueezeAxis sets the optional axis attribute to value.
 //
-// ```python
-//     outputs[i].shape = [sum(partitions == i)] + data.shape[partitions.ndim:]
+// value: If specified, only squeezes the dimensions listed. The dimension
+// index starts at 0. It is an error to squeeze a dimension that is not 1. Must
+// be in the range `[-rank(input), rank(input))`.
+// If not specified, defaults to <>
 //
-//     outputs[i] = pack([data[js, ...] for js if partitions[js] == i])
-// ```
+// REQUIRES: len(value) >= 0
+func SqueezeAxis(value []int64) SqueezeAttr {
+	return func(m optionalAttr) {
+		m["squeeze_dims"] = value
+	}
+}
+
+// Removes dimensions of size 1 from the shape of a tensor.
 //
-// `data.shape` must start with `partitions.shape`.
+// Given a tensor `input`, this operation returns a tensor of the same type with
+// all dimensions of size 1 removed. If you don't want to remove all size 1
+// dimensions, you can remove specific size 1 dimensions by specifying
+// `axis`.
 //
 // For example:
 //
-// ```python
-//     # Scalar partitions.
-//     partitions = 1
-//     num_partitions = 2
-//     data = [10, 20]
-//     outputs[0] = []  # Empty with shape [0, 2]
-//     outputs[1] = [[10, 20]]
-//
-//     # Vector partitions.
-//     partitions = [0, 0, 1, 1, 0]
-//     num_partitions = 2
-//     data = [10, 20, 30, 40, 50]
-//     outputs[0] = [10, 20, 50]
-//     outputs[1] = [30, 40]
+// ```
+// # 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
+// shape(squeeze(t)) ==> [2, 3]
 // ```
 //
-// See `dynamic_stitch` for an example on how to merge partitions back.
+// Or, to remove specific size 1 dimensions:
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicPartition.png" alt>
-// </div>
+// ```
+// # 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
+// shape(squeeze(t, [2, 4])) ==> [1, 2, 3, 1]
+// ```
 //
 // Arguments:
+//	input: The `input` to squeeze.
 //
-//	partitions: Any shape.  Indices in the range `[0, num_partitions)`.
-//	num_partitions: The number of partitions to output.
-func DynamicPartition(scope *Scope, data tf.Output, partitions tf.Output, num_partitions int64) (outputs []tf.Output) {
+// Returns Contains the same data as `input`, but has one or more dimensions of
+// size 1 removed.
+func Squeeze(scope *Scope, input tf.Output, optional ...SqueezeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_partitions": num_partitions}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "DynamicPartition",
+		Type: "Squeeze",
 		Input: []tf.Input{
-			data, partitions,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
-		scope.UpdateErr("DynamicPartition", err)
-		return
-	}
-	return outputs
+	return op.Output(0)
 }
 
-// ResourceApplyAdagradAttr is an optional argument to ResourceApplyAdagrad.
-type ResourceApplyAdagradAttr func(optionalAttr)
+// ResourceApplyAdadeltaAttr is an optional argument to ResourceApplyAdadelta.
+type ResourceApplyAdadeltaAttr func(optionalAttr)
 
-// ResourceApplyAdagradUseLocking sets the optional use_locking attribute to value.
+// ResourceApplyAdadeltaUseLocking sets the optional use_locking attribute to value.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
+// value: If True, updating of the var, accum and update_accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
 // If not specified, defaults to false
-func ResourceApplyAdagradUseLocking(value bool) ResourceApplyAdagradAttr {
+func ResourceApplyAdadeltaUseLocking(value bool) ResourceApplyAdadeltaAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update '*var' according to the adagrad scheme.
+// Update '*var' according to the adadelta scheme.
 //
-// accum += grad * grad
-// var -= lr * grad * (1 / sqrt(accum))
+// accum = rho() * accum + (1 - rho()) * grad.square();
+// update = (update_accum + epsilon).sqrt() * (accum + epsilon()).rsqrt() * grad;
+// update_accum = rho() * update_accum + (1 - rho()) * update.square();
+// var -= update;
 //
 // Arguments:
 //	var_: Should be from a Variable().
 //	accum: Should be from a Variable().
+//	accum_update: Should be from a Variable().
 //	lr: Scaling factor. Must be a scalar.
+//	rho: Decay factor. Must be a scalar.
+//	epsilon: Constant factor. Must be a scalar.
 //	grad: The gradient.
 //
 // Returns the created operation.
-func ResourceApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, optional ...ResourceApplyAdagradAttr) (o *tf.Operation) {
+func ResourceApplyAdadelta(scope *Scope, var_ tf.Output, accum tf.Output, accum_update tf.Output, lr tf.Output, rho tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdadeltaAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8810,179 +8554,238 @@ func ResourceApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.O
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdagrad",
+		Type: "ResourceApplyAdadelta",
 		Input: []tf.Input{
-			var_, accum, lr, grad,
+			var_, accum, accum_update, lr, rho, epsilon, grad,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns element-wise remainder of division. This emulates C semantics in that
+// NonMaxSuppressionAttr is an optional argument to NonMaxSuppression.
+type NonMaxSuppressionAttr func(optionalAttr)
+
+// NonMaxSuppressionIouThreshold sets the optional iou_threshold attribute to value.
 //
-// the result here is consistent with a truncating divide. E.g. `truncate(x / y) *
-// y + truncate_mod(x, y) = x`.
+// value: A float representing the threshold for deciding whether boxes
+// overlap too much with respect to IOU.
+// If not specified, defaults to 0.5
+func NonMaxSuppressionIouThreshold(value float32) NonMaxSuppressionAttr {
+	return func(m optionalAttr) {
+		m["iou_threshold"] = value
+	}
+}
+
+// Greedily selects a subset of bounding boxes in descending order of score,
 //
-// *NOTE*: `TruncateMod` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func TruncateMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// pruning away boxes that have high intersection-over-union (IOU) overlap
+// with previously selected boxes.  Bounding boxes are supplied as
+// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+// diagonal pair of box corners and the coordinates can be provided as normalized
+// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+// is agnostic to where the origin is in the coordinate system.  Note that this
+// algorithm is invariant to orthogonal transformations and translations
+// of the coordinate system; thus translating or reflections of the coordinate
+// system result in the same boxes being selected by the algorithm.
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
+//   selected_indices = tf.image.non_max_suppression(
+//       boxes, scores, max_output_size, iou_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
+//
+// Arguments:
+//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.
+func NonMaxSuppression(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, optional ...NonMaxSuppressionAttr) (selected_indices tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TruncateMod",
+		Type: "NonMaxSuppression",
 		Input: []tf.Input{
-			x, y,
+			boxes, scores, max_output_size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Inverse 2D real-valued fast Fourier transform.
-//
-// Computes the inverse 2-dimensional discrete Fourier transform of a real-valued
-// signal over the inner-most 2 dimensions of `input`.
+// Creates a dataset that emits `components` as a tuple of tensors once.
+func TensorDataset(scope *Scope, components []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "TensorDataset",
+		Input: []tf.Input{
+			tf.OutputList(components),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Component-wise multiplies a SparseTensor by a dense Tensor.
 //
-// The inner-most 2 dimensions of `input` are assumed to be the result of `RFFT2D`:
-// The inner-most dimension contains the `fft_length / 2 + 1` unique components of
-// the DFT of a real-valued signal. If `fft_length` is not provided, it is computed
-// from the size of the inner-most 2 dimensions of `input`. If the FFT length used
-// to compute `input` is odd, it should be provided since it cannot be inferred
-// properly.
+// The output locations corresponding to the implicitly zero elements in the sparse
+// tensor will be zero (i.e., will not take up storage space), regardless of the
+// contents of the dense tensor (even if it's +/-INF and that INF*0 == NaN).
 //
-// Along each axis `IRFFT2D` is computed on, if `fft_length` (or
-// `fft_length / 2 + 1` for the inner-most dimension) is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
+// *Limitation*: this Op only broadcasts the dense side to the sparse side, but not
+// the other direction.
 //
 // Arguments:
-//	input: A complex64 tensor.
-//	fft_length: An int32 tensor of shape [2]. The FFT length for each dimension.
-//
-// Returns A float32 tensor of the same rank as `input`. The inner-most 2
-//   dimensions of `input` are replaced with the `fft_length` samples of their
-//   inverse 2D Fourier transform.
+//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
+//	sp_shape: 1-D.  Shape of the input SparseTensor.
+//	dense: `R`-D.  The dense Tensor operand.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.irfft2
-// @end_compatibility
-func IRFFT2D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns 1-D.  The `N` values that are operated on.
+func SparseDenseCwiseMul(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "IRFFT2D",
+		Type: "SparseDenseCwiseMul",
 		Input: []tf.Input{
-			input, fft_length,
+			sp_indices, sp_values, sp_shape, dense,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Transforms a vector of brain.Example protos (as strings) into typed tensors.
+// ResourceSparseApplyFtrlAttr is an optional argument to ResourceSparseApplyFtrl.
+type ResourceSparseApplyFtrlAttr func(optionalAttr)
+
+// ResourceSparseApplyFtrlUseLocking sets the optional use_locking attribute to value.
 //
-// Arguments:
-//	serialized: A vector containing a batch of binary serialized Example protos.
-//	names: A vector containing the names of the serialized protos.
-// May contain, for example, table key (descriptive) names for the
-// corresponding serialized protos.  These are purely useful for debugging
-// purposes, and the presence of values here has no effect on the output.
-// May also be an empty vector if no names are available.
-// If non-empty, this vector must be the same length as "serialized".
-//	sparse_keys: A list of Nsparse string Tensors (scalars).
-// The keys expected in the Examples' features associated with sparse values.
-//	dense_keys: A list of Ndense string Tensors (scalars).
-// The keys expected in the Examples' features associated with dense values.
-//	dense_defaults: A list of Ndense Tensors (some may be empty).
-// dense_defaults[j] provides default values
-// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
-// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
-// The input type is inferred from dense_defaults[j], even when it's empty.
-// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
-// then the shape of dense_defaults[j] must match that of dense_shapes[j].
-// If dense_shapes[j] has an undefined major dimension (variable strides dense
-// feature), dense_defaults[j] must contain a single element:
-// the padding element.
-//	sparse_types: A list of Nsparse types; the data types of data in each Feature
-// given in sparse_keys.
-// Currently the ParseExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-//	dense_shapes: A list of Ndense shapes; the shapes of data in each Feature
-// given in dense_keys.
-// The number of elements in the Feature corresponding to dense_key[j]
-// must always equal dense_shapes[j].NumEntries().
-// If dense_shapes[j] == (D0, D1, ..., DN) then the shape of output
-// Tensor dense_values[j] will be (|serialized|, D0, D1, ..., DN):
-// The dense outputs are just the inputs row-stacked by batch.
-// This works for dense_shapes[j] = (-1, D1, ..., DN).  In this case
-// the shape of the output Tensor dense_values[j] will be
-// (|serialized|, M, D1, .., DN), where M is the maximum number of blocks
-// of elements of length D1 * .... * DN, across all minibatch entries
-// in the input.  Any minibatch entry with less than M blocks of elements of
-// length D1 * ... * DN will be padded with the corresponding default_value
-// scalar element along the second dimension.
-func ParseExample(scope *Scope, serialized tf.Output, names tf.Output, sparse_keys []tf.Output, dense_keys []tf.Output, dense_defaults []tf.Output, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyFtrlUseLocking(value bool) ResourceSparseApplyFtrlAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update relevant entries in '*var' according to the Ftrl-proximal scheme.
+//
+// That is for rows we have grad for, we update var, accum and linear as follows:
+// accum_new = accum + grad * grad
+// linear += grad + (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
+// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
+// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
+// accum = accum_new
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	linear: Should be from a Variable().
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	lr_power: Scaling factor. Must be a scalar.
+//
+// Returns the created operation.
+func ResourceSparseApplyFtrl(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, lr_power tf.Output, optional ...ResourceSparseApplyFtrlAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"sparse_types": sparse_types, "dense_shapes": dense_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ParseExample",
+		Type: "ResourceSparseApplyFtrl",
 		Input: []tf.Input{
-			serialized, names, tf.OutputList(sparse_keys), tf.OutputList(dense_keys), tf.OutputList(dense_defaults),
+			var_, accum, linear, grad, indices, lr, l1, l2, lr_power,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
+	return scope.AddOperation(opspec)
+}
+
+// Returns which elements of x are Inf.
+//
+// @compatibility(numpy)
+// Equivalent to np.isinf
+// @end_compatibility
+func IsInf(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
-		scope.UpdateErr("ParseExample", err)
-		return
-	}
-	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
-		scope.UpdateErr("ParseExample", err)
-		return
-	}
-	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseExample", err)
-		return
-	}
-	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
-		scope.UpdateErr("ParseExample", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "IsInf",
+		Input: []tf.Input{
+			x,
+		},
 	}
-	return sparse_indices, sparse_values, sparse_shapes, dense_values
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// VariableShapeAttr is an optional argument to VariableShape.
-type VariableShapeAttr func(optionalAttr)
+// ResourceSparseApplyRMSPropAttr is an optional argument to ResourceSparseApplyRMSProp.
+type ResourceSparseApplyRMSPropAttr func(optionalAttr)
 
-// VariableShapeOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_INT32
-func VariableShapeOutType(value tf.DataType) VariableShapeAttr {
+// ResourceSparseApplyRMSPropUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var, ms, and mom tensors is protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyRMSPropUseLocking(value bool) ResourceSparseApplyRMSPropAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Returns the shape of the variable pointed to by `resource`.
+// Update '*var' according to the RMSProp algorithm.
 //
-// This operation returns a 1-D integer tensor representing the shape of `input`.
+// Note that in dense implementation of this algorithm, ms and mom will
+// update even if the grad is zero, but in this sparse implementation, ms
+// and mom will not update in iterations during which the grad is zero.
 //
-// For example:
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
 //
-// ```
-// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
-// shape(t) ==> [2, 2, 3]
-// ```
-func VariableShape(scope *Scope, input tf.Output, optional ...VariableShapeAttr) (output tf.Output) {
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
+// var <- var - mom
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
+//
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var, ms and mom.
+//
+// Returns the created operation.
+func ResourceSparseApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyRMSPropAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -8991,298 +8794,350 @@ func VariableShape(scope *Scope, input tf.Output, optional ...VariableShapeAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "VariableShape",
+		Type: "ResourceSparseApplyRMSProp",
 		Input: []tf.Input{
-			input,
+			var_, ms, mom, lr, rho, momentum, epsilon, grad, indices,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Fills empty rows in the input 2-D `SparseTensor` with a default value.
-//
-// The input `SparseTensor` is represented via the tuple of inputs
-// (`indices`, `values`, `dense_shape`).  The output `SparseTensor` has the
-// same `dense_shape` but with indices `output_indices` and values
-// `output_values`.
-//
-// This op inserts a single entry for every row that doesn't have any values.
-// The index is created as `[row, 0, ..., 0]` and the inserted value
-// is `default_value`.
-//
-// For example, suppose `sp_input` has shape `[5, 6]` and non-empty values:
-//
-//     [0, 1]: a
-//     [0, 3]: b
-//     [2, 0]: c
-//     [3, 1]: d
-//
-// Rows 1 and 4 are empty, so the output will be of shape `[5, 6]` with values:
-//
-//     [0, 1]: a
-//     [0, 3]: b
-//     [1, 0]: default_value
-//     [2, 0]: c
-//     [3, 1]: d
-//     [4, 0]: default_value
-//
-// The output `SparseTensor` will be in row-major order and will have the
-// same shape as the input.
-//
-// This op also returns an indicator vector shaped `[dense_shape[0]]` such that
-//
-//     empty_row_indicator[i] = True iff row i was an empty row.
-//
-// And a reverse index map vector shaped `[indices.shape[0]]` that is used during
-// backpropagation,
-//
-//     reverse_index_map[j] = out_j s.t. indices[j, :] == output_indices[out_j, :]
-//
-// Arguments:
-//	indices: 2-D. the indices of the sparse tensor.
-//	values: 1-D. the values of the sparse tensor.
-//	dense_shape: 1-D. the shape of the sparse tensor.
-//	default_value: 0-D. default value to insert into location `[row, 0, ..., 0]`
-//   for rows missing from the input sparse tensor.
-// output indices: 2-D. the indices of the filled sparse tensor.
+// Returns the truth value of (x > y) element-wise.
 //
-// Returns 1-D. the values of the filled sparse tensor.1-D. whether the dense row was missing in the
-// input sparse tensor.1-D. a map from the input indices to the output indices.
-func SparseFillEmptyRows(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output, default_value tf.Output) (output_indices tf.Output, output_values tf.Output, empty_row_indicator tf.Output, reverse_index_map tf.Output) {
+// *NOTE*: `Greater` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Greater(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseFillEmptyRows",
+		Type: "Greater",
 		Input: []tf.Input{
-			indices, values, dense_shape, default_value,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
+	return op.Output(0)
 }
 
-// Reverses specific dimensions of a tensor.
-//
-// Given a `tensor`, and a `bool` tensor `dims` representing the dimensions
-// of `tensor`, this operation reverses each dimension i of `tensor` where
-// `dims[i]` is `True`.
-//
-// `tensor` can have up to 8 dimensions. The number of dimensions
-// of `tensor` must equal the number of elements in `dims`. In other words:
+// SampleDistortedBoundingBoxAttr is an optional argument to SampleDistortedBoundingBox.
+type SampleDistortedBoundingBoxAttr func(optionalAttr)
+
+// SampleDistortedBoundingBoxSeed sets the optional seed attribute to value.
 //
-// `rank(tensor) = size(dims)`
+// value: If either `seed` or `seed2` are set to non-zero, the random number
+// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
+// seed.
+// If not specified, defaults to 0
+func SampleDistortedBoundingBoxSeed(value int64) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxSeed2 sets the optional seed2 attribute to value.
 //
-// For example:
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func SampleDistortedBoundingBoxSeed2(value int64) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxMinObjectCovered sets the optional min_object_covered attribute to value.
 //
-// ```
-// # tensor 't' is [[[[ 0,  1,  2,  3],
-// #                  [ 4,  5,  6,  7],
-// #                  [ 8,  9, 10, 11]],
-// #                 [[12, 13, 14, 15],
-// #                  [16, 17, 18, 19],
-// #                  [20, 21, 22, 23]]]]
-// # tensor 't' shape is [1, 2, 3, 4]
+// value: The cropped area of the image must contain at least this
+// fraction of any bounding box supplied. The value of this parameter should be
+// non-negative. In the case of 0, the cropped area does not need to overlap
+// any of the bounding boxes supplied.
+// If not specified, defaults to 0.1
+func SampleDistortedBoundingBoxMinObjectCovered(value float32) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["min_object_covered"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxAspectRatioRange sets the optional aspect_ratio_range attribute to value.
 //
-// # 'dims' is [False, False, False, True]
-// reverse(t, dims) ==> [[[[ 3,  2,  1,  0],
-//                         [ 7,  6,  5,  4],
-//                         [ 11, 10, 9, 8]],
-//                        [[15, 14, 13, 12],
-//                         [19, 18, 17, 16],
-//                         [23, 22, 21, 20]]]]
+// value: The cropped area of the image must have an aspect ratio =
+// width / height within this range.
+// If not specified, defaults to <f:0.75 f:1.33 >
+func SampleDistortedBoundingBoxAspectRatioRange(value []float32) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["aspect_ratio_range"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxAreaRange sets the optional area_range attribute to value.
 //
-// # 'dims' is [False, True, False, False]
-// reverse(t, dims) ==> [[[[12, 13, 14, 15],
-//                         [16, 17, 18, 19],
-//                         [20, 21, 22, 23]
-//                        [[ 0,  1,  2,  3],
-//                         [ 4,  5,  6,  7],
-//                         [ 8,  9, 10, 11]]]]
+// value: The cropped area of the image must contain a fraction of the
+// supplied image within in this range.
+// If not specified, defaults to <f:0.05 f:1 >
+func SampleDistortedBoundingBoxAreaRange(value []float32) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["area_range"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxMaxAttempts sets the optional max_attempts attribute to value.
 //
-// # 'dims' is [False, False, True, False]
-// reverse(t, dims) ==> [[[[8, 9, 10, 11],
-//                         [4, 5, 6, 7],
-//                         [0, 1, 2, 3]]
-//                        [[20, 21, 22, 23],
-//                         [16, 17, 18, 19],
-//                         [12, 13, 14, 15]]]]
+// value: Number of attempts at generating a cropped region of the image
+// of the specified constraints. After `max_attempts` failures, return the entire
+// image.
+// If not specified, defaults to 100
+func SampleDistortedBoundingBoxMaxAttempts(value int64) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["max_attempts"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxUseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
+//
+// value: Controls behavior if no bounding boxes supplied.
+// If true, assume an implicit bounding box covering the whole input. If false,
+// raise an error.
+// If not specified, defaults to false
+func SampleDistortedBoundingBoxUseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxAttr {
+	return func(m optionalAttr) {
+		m["use_image_if_no_bounding_boxes"] = value
+	}
+}
+
+// Generate a single randomly distorted bounding box for an image.
+//
+// Bounding box annotations are often supplied in addition to ground-truth labels
+// in image recognition or object localization tasks. A common technique for
+// training such a system is to randomly distort an image while preserving
+// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
+// localization of an object, i.e. bounding box, given an `image_size`,
+// `bounding_boxes` and a series of constraints.
+//
+// The output of this Op is a single bounding box that may be used to crop the
+// original image. The output is returned as 3 tensors: `begin`, `size` and
+// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
+// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
+// what the bounding box looks like.
+//
+// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
+// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
+// height of the underlying image.
+//
+// For example,
+//
+// ```python
+//     # Generate a single distorted bounding box.
+//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
+//         tf.shape(image),
+//         bounding_boxes=bounding_boxes)
+//
+//     # Draw the bounding box in an image summary.
+//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
+//                                                   bbox_for_draw)
+//     tf.summary.image('images_with_box', image_with_box)
+//
+//     # Employ the bounding box to distort the image.
+//     distorted_image = tf.slice(image, begin, size)
 // ```
 //
+// Note that if no bounding box information is available, setting
+// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
+// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
+// false and no bounding boxes are supplied, an error is raised.
+//
 // Arguments:
-//	tensor: Up to 8-D.
-//	dims: 1-D. The dimensions to reverse.
+//	image_size: 1-D, containing `[height, width, channels]`.
+//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
+// associated with the image.
 //
-// Returns The same shape as `tensor`.
-func Reverse(scope *Scope, tensor tf.Output, dims tf.Output) (output tf.Output) {
+// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
+// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
+// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
+// Provide as input to `tf.image.draw_bounding_boxes`.
+func SampleDistortedBoundingBox(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, optional ...SampleDistortedBoundingBoxAttr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Reverse",
+		Type: "SampleDistortedBoundingBox",
 		Input: []tf.Input{
-			tensor, dims,
+			image_size, bounding_boxes,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes log softmax activations.
-//
-// For each batch `i` and class `j` we have
-//
-//     logsoftmax[i, j] = logits[i, j] - log(sum(exp(logits[i])))
+// Returns x / y element-wise for integer types.
 //
-// Arguments:
-//	logits: 2-D with shape `[batch_size, num_classes]`.
+// Truncation designates that negative numbers will round fractional quantities
+// toward zero. I.e. -7 / 5 = -1. This matches C semantics but it is different
+// than Python semantics. See `FloorDiv` for a division function that matches
+// Python Semantics.
 //
-// Returns Same shape as `logits`.
-func LogSoftmax(scope *Scope, logits tf.Output) (logsoftmax tf.Output) {
+// *NOTE*: `TruncateDiv` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func TruncateDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "LogSoftmax",
+		Type: "TruncateDiv",
 		Input: []tf.Input{
-			logits,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the inverse permutation of a tensor.
-//
-// This operation computes the inverse of an index permutation. It takes a 1-D
-// integer tensor `x`, which represents the indices of a zero-based array, and
-// swaps each value with its index position. In other words, for an output tensor
-// `y` and an input tensor `x`, this operation computes the following:
-//
-// `y[x[i]] = i for i in [0, 1, ..., len(x) - 1]`
+// Restores tensors from a V2 checkpoint.
 //
-// The values must include 0. There can be no duplicate values or negative values.
+// For backward compatibility with the V1 format, this Op currently allows
+// restoring from a V1 checkpoint as well:
+//   - This Op first attempts to find the V2 index file pointed to by "prefix", and
+//     if found proceed to read it as a V2 checkpoint;
+//   - Otherwise the V1 read path is invoked.
+// Relying on this behavior is not recommended, as the ability to fall back to read
+// V1 might be deprecated and eventually removed.
 //
-// For example:
+// By default, restores the named tensors in full.  If the caller wishes to restore
+// specific slices of stored tensors, "shape_and_slices" should be non-empty
+// strings and correspondingly well-formed.
 //
-// ```
-// # tensor `x` is [3, 4, 0, 2, 1]
-// invert_permutation(x) ==> [2, 4, 3, 0, 1]
-// ```
+// Callers must ensure all the named tensors are indeed stored in the checkpoint.
 //
 // Arguments:
-//	x: 1-D.
+//	prefix: Must have a single element.  The prefix of a V2 checkpoint.
+//	tensor_names: shape {N}.  The names of the tensors to be restored.
+//	shape_and_slices: shape {N}.  The slice specs of the tensors to be restored.
+// Empty strings indicate that they are non-partitioned tensors.
+//	dtypes: shape {N}.  The list of expected dtype for the tensors.  Must match
+// those stored in the checkpoint.
 //
-// Returns 1-D.
-func InvertPermutation(scope *Scope, x tf.Output) (y tf.Output) {
+// Returns shape {N}.  The restored tensors, whose shapes are read from the
+// checkpoint directly.
+func RestoreV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, dtypes []tf.DataType) (tensors []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	opspec := tf.OpSpec{
-		Type: "InvertPermutation",
+		Type: "RestoreV2",
 		Input: []tf.Input{
-			x,
+			prefix, tensor_names, shape_and_slices,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if tensors, idx, err = makeOutputList(op, idx, "tensors"); err != nil {
+		scope.UpdateErr("RestoreV2", err)
+		return
+	}
+	return tensors
 }
 
-// Gradient op for `MirrorPad` op. This op folds a mirror-padded tensor.
-//
-// This operation folds the padded areas of `input` by `MirrorPad` according to the
-// `paddings` you specify. `paddings` must be the same as `paddings` argument
-// given to the corresponding `MirrorPad` op.
+// Computes the maximum along segments of a tensor.
 //
-// The folded size of each dimension D of the output is:
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// `input.dim_size(D) - paddings(D, 0) - paddings(D, 1)`
+// Computes a tensor such that
+// \\(output_i = \max_j(data_j)\\) where `max` is over `j` such
+// that `segment_ids[j] == i`.
 //
-// For example:
+// If the max is empty for a given segment ID `i`, `output[i] = 0`.
 //
-// ```
-// # 't' is [[1, 2, 3], [4, 5, 6], [7, 8, 9]].
-// # 'paddings' is [[0, 1]], [0, 1]].
-// # 'mode' is SYMMETRIC.
-// # rank of 't' is 2.
-// pad(t, paddings) ==> [[ 1,  5]
-//                       [11, 28]]
-// ```
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMax.png" alt>
+// </div>
 //
 // Arguments:
-//	input: The input tensor to be folded.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	mode: The mode used in the `MirrorPad` op.
 //
-// Returns The folded tensor.
-func MirrorPadGrad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode}
 	opspec := tf.OpSpec{
-		Type: "MirrorPadGrad",
+		Type: "SegmentMax",
 		Input: []tf.Input{
-			input, paddings,
+			data, segment_ids,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes softmax cross entropy cost and gradients to backpropagate.
+// Creates a dataset that skips `count` elements from the `input_dataset`.
 //
-// Unlike `SoftmaxCrossEntropyWithLogits`, this operation does not accept
-// a matrix of label probabilities, but rather a single label per row
-// of features.  This label is considered to have probability 1.0 for the
-// given row.
+// Arguments:
 //
-// Inputs are the logits, not probabilities.
+//	count: A scalar representing the number of elements from the `input_dataset`
+// that should be skipped.  If count is -1, skips everything.
 //
-// Arguments:
-//	features: batch_size x num_classes matrix
-//	labels: batch_size vector with values in [0, num_classes).
-// This is the label for the given minibatch entry.
 //
-// Returns Per example loss (batch_size vector).backpropagated gradients (batch_size x num_classes matrix).
-func SparseSoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
+func SkipDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "SparseSoftmaxCrossEntropyWithLogits",
+		Type: "SkipDataset",
 		Input: []tf.Input{
-			features, labels,
+			input_dataset, count,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Fast Fourier transform.
+// Computes hyperbolic tangent of `x` element-wise.
+func Tanh(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Tanh",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Decode web-safe base64-encoded strings.
 //
-// Computes the 1-dimensional discrete Fourier transform over the inner-most
-// dimension of `input`.
+// Input may or may not have padding at the end. See EncodeBase64 for padding.
+// Web-safe means that input must use - and _ instead of + and /.
 //
 // Arguments:
-//	input: A complex64 tensor.
-//
-// Returns A complex64 tensor of the same shape as `input`. The inner-most
-//   dimension of `input` is replaced with its 1D Fourier transform.
+//	input: Base64 strings to decode.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.fft
-// @end_compatibility
-func FFT(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns Decoded strings.
+func DecodeBase64(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "FFT",
+		Type: "DecodeBase64",
 		Input: []tf.Input{
 			input,
 		},
@@ -9291,35 +9146,60 @@ func FFT(scope *Scope, input tf.Output) (output tf.Output) {
 	return op.Output(0)
 }
 
-// ResourceSparseApplyAdagradDAAttr is an optional argument to ResourceSparseApplyAdagradDA.
-type ResourceSparseApplyAdagradDAAttr func(optionalAttr)
+// Store the input tensor in the state of the current session.
+//
+// Arguments:
+//	value: The tensor to be stored.
+//
+// Returns The handle for the tensor stored in the session state, represented
+// as a string.
+func GetSessionHandle(scope *Scope, value tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "GetSessionHandle",
+		Input: []tf.Input{
+			value,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// ResourceSparseApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
+// ResourceSparseApplyProximalAdagradAttr is an optional argument to ResourceSparseApplyProximalAdagrad.
+type ResourceSparseApplyProximalAdagradAttr func(optionalAttr)
+
+// ResourceSparseApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
 //
 // value: If True, updating of the var and accum tensors will be protected by
 // a lock; otherwise the behavior is undefined, but may exhibit less contention.
 // If not specified, defaults to false
-func ResourceSparseApplyAdagradDAUseLocking(value bool) ResourceSparseApplyAdagradDAAttr {
+func ResourceSparseApplyProximalAdagradUseLocking(value bool) ResourceSparseApplyProximalAdagradAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update entries in '*var' and '*accum' according to the proximal adagrad scheme.
+// Sparse update entries in '*var' and '*accum' according to FOBOS algorithm.
+//
+// That is for rows we have grad for, we update var and accum as follows:
+// accum += grad * grad
+// prox_v = var
+// prox_v -= lr * grad * (1 / sqrt(accum))
+// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
 //
 // Arguments:
 //	var_: Should be from a Variable().
-//	gradient_accumulator: Should be from a Variable().
-//	gradient_squared_accumulator: Should be from a Variable().
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
+//	accum: Should be from a Variable().
 //	lr: Learning rate. Must be a scalar.
 //	l1: L1 regularization. Must be a scalar.
 //	l2: L2 regularization. Must be a scalar.
-//	global_step: Training step number. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
 //
 // Returns the created operation.
-func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceSparseApplyAdagradDAAttr) (o *tf.Operation) {
+func ResourceSparseApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyProximalAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9328,22 +9208,22 @@ func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumul
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyAdagradDA",
+		Type: "ResourceSparseApplyProximalAdagrad",
 		Input: []tf.Input{
-			var_, gradient_accumulator, gradient_squared_accumulator, grad, indices, lr, l1, l2, global_step,
+			var_, accum, lr, l1, l2, grad, indices,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns the truth value of NOT x element-wise.
-func LogicalNot(scope *Scope, x tf.Output) (y tf.Output) {
+// Returns element-wise largest integer not greater than x.
+func Floor(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "LogicalNot",
+		Type: "Floor",
 		Input: []tf.Input{
 			x,
 		},
@@ -9352,143 +9232,183 @@ func LogicalNot(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// 3D real-valued fast Fourier transform.
-//
-// Computes the 3-dimensional discrete Fourier transform of a real-valued signal
-// over the inner-most 3 dimensions of `input`.
-//
-// Since the DFT of a real signal is Hermitian-symmetric, `RFFT3D` only returns the
-// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
-// of `output`: the zero-frequency term, followed by the `fft_length / 2`
-// positive-frequency terms.
-//
-// Along each axis `RFFT3D` is computed on, if `fft_length` is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
-//
-// Arguments:
-//	input: A float32 tensor.
-//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
-//
-// Returns A complex64 tensor of the same rank as `input`. The inner-most 3
-//   dimensions of `input` are replaced with the their 3D Fourier transform. The
-//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
-//   components.
-//
-// @compatibility(numpy)
-// Equivalent to np.fft.rfftn with 3 dimensions.
-// @end_compatibility
-func RFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Computes the Gauss error function of `x` element-wise.
+func Erf(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RFFT3D",
+		Type: "Erf",
 		Input: []tf.Input{
-			input, fft_length,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorArrayV3Attr is an optional argument to TensorArrayV3.
-type TensorArrayV3Attr func(optionalAttr)
+// OneHotAttr is an optional argument to OneHot.
+type OneHotAttr func(optionalAttr)
 
-// TensorArrayV3ElementShape sets the optional element_shape attribute to value.
+// OneHotAxis sets the optional axis attribute to value.
 //
-// value: The expected shape of an element, if known. Used to
-// validate the shapes of TensorArray elements. If this shape is not
-// fully specified, gathering zero-size TensorArrays is an error.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayV3ElementShape(value tf.Shape) TensorArrayV3Attr {
+// value: The axis to fill (default: -1, a new inner-most axis).
+// If not specified, defaults to -1
+func OneHotAxis(value int64) OneHotAttr {
 	return func(m optionalAttr) {
-		m["element_shape"] = value
+		m["axis"] = value
 	}
 }
 
-// TensorArrayV3DynamicSize sets the optional dynamic_size attribute to value.
+// Returns a one-hot tensor.
 //
-// value: A boolean that determines whether writes to the TensorArray
-// are allowed to grow the size.  By default, this is not allowed.
-// If not specified, defaults to false
-func TensorArrayV3DynamicSize(value bool) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["dynamic_size"] = value
-	}
-}
-
-// TensorArrayV3ClearAfterRead sets the optional clear_after_read attribute to value.
+// The locations represented by indices in `indices` take value `on_value`,
+// while all other locations take value `off_value`.
 //
-// value: If true (default), Tensors in the TensorArray are cleared
-// after being read.  This disables multiple read semantics but allows early
-// release of memory.
-// If not specified, defaults to true
-func TensorArrayV3ClearAfterRead(value bool) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["clear_after_read"] = value
-	}
-}
-
-// TensorArrayV3IdenticalElementShapes sets the optional identical_element_shapes attribute to value.
+// If the input `indices` is rank `N`, the output will have rank `N+1`,
+// The new axis is created at dimension `axis` (default: the new axis is
+// appended at the end).
 //
-// value: If true (default is false), then all
-// elements in the TensorArray will be expected to have have identical shapes.
-// This allows certain behaviors, like dynamically checking for
-// consistent shapes on write, and being able to fill in properly
-// shaped zero tensors on stack -- even if the element_shape attribute
-// is not fully defined.
-// If not specified, defaults to false
-func TensorArrayV3IdenticalElementShapes(value bool) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["identical_element_shapes"] = value
-	}
-}
-
-// TensorArrayV3TensorArrayName sets the optional tensor_array_name attribute to value.
+// If `indices` is a scalar the output shape will be a vector of length `depth`.
 //
-// value: Overrides the name used for the temporary tensor_array
-// resource. Default value is the name of the 'TensorArray' op (which
-// is guaranteed unique).
-// If not specified, defaults to ""
-func TensorArrayV3TensorArrayName(value string) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["tensor_array_name"] = value
-	}
-}
-
-// An array of Tensors of given size.
+// If `indices` is a vector of length `features`, the output shape will be:
+// ```
+//   features x depth if axis == -1
+//   depth x features if axis == 0
+// ```
 //
-// Write data via Write and read via Read or Pack.
+// If `indices` is a matrix (batch) with shape `[batch, features]`,
+// the output shape will be:
+// ```
+//   batch x features x depth if axis == -1
+//   batch x depth x features if axis == 1
+//   depth x batch x features if axis == 0
+// ```
 //
-// Arguments:
-//	size: The size of the array.
-//	dtype: The type of the elements on the tensor_array.
 //
-// Returns The handle to the TensorArray.A scalar used to control gradient flow.
-func TensorArrayV3(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV3Attr) (handle tf.Output, flow tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayV3",
-		Input: []tf.Input{
-			size,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
+// Examples
+// =========
+//
+// Suppose that
+//
+// ```
+//   indices = [0, 2, -1, 1]
+//   depth = 3
+//   on_value = 5.0
+//   off_value = 0.0
+//   axis = -1
+// ```
+//
+// Then output is `[4 x 3]`:
+//
+//     ```output =
+//       [5.0 0.0 0.0]  // one_hot(0)
+//       [0.0 0.0 5.0]  // one_hot(2)
+//       [0.0 0.0 0.0]  // one_hot(-1)
+//       [0.0 5.0 0.0]  // one_hot(1)
+//     ```
+//
+// Suppose that
+//
+// ```
+//   indices = [0, 2, -1, 1]
+//   depth = 3
+//   on_value = 0.0
+//   off_value = 3.0
+//   axis = 0
+// ```
+//
+// Then output is `[3 x 4]`:
+//
+//     ```output =
+//       [0.0 3.0 3.0 3.0]
+//       [3.0 3.0 3.0 0.0]
+//       [3.0 3.0 3.0 3.0]
+//       [3.0 0.0 3.0 3.0]
+//     //  ^                one_hot(0)
+//     //      ^            one_hot(2)
+//     //          ^        one_hot(-1)
+//     //              ^    one_hot(1)
+//     ```
+// Suppose that
+//
+// ```
+//   indices = [[0, 2], [1, -1]]
+//   depth = 3
+//   on_value = 1.0
+//   off_value = 0.0
+//   axis = -1
+// ```
+//
+// Then output is `[2 x 2 x 3]`:
+//
+//     ```output =
+//       [
+//         [1.0, 0.0, 0.0]  // one_hot(0)
+//         [0.0, 0.0, 1.0]  // one_hot(2)
+//       ][
+//         [0.0, 1.0, 0.0]  // one_hot(1)
+//         [0.0, 0.0, 0.0]  // one_hot(-1)
+//       ]```
+//
+// Arguments:
+//	indices: A tensor of indices.
+//	depth: A scalar defining the depth of the one hot dimension.
+//	on_value: A scalar defining the value to fill in output when `indices[j] = i`.
+//	off_value: A scalar defining the value to fill in output when `indices[j] != i`.
+//
+// Returns The one-hot tensor.
+func OneHot(scope *Scope, indices tf.Output, depth tf.Output, on_value tf.Output, off_value tf.Output, optional ...OneHotAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "OneHot",
+		Input: []tf.Input{
+			indices, depth, on_value, off_value,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// MaxPool3DAttr is an optional argument to MaxPool3D.
-type MaxPool3DAttr func(optionalAttr)
+// Reads the value of a variable.
+//
+// The tensor returned by this operation is immutable.
+//
+// The value returned by this operation is guaranteed to be influenced by all the
+// writes on which this operation depends directly or indirectly, and to not be
+// influenced by any of the writes which depend directly or indirectly on this
+// operation.
+//
+// Arguments:
+//	resource: handle to the resource in which to store the variable.
+//	dtype: the dtype of the value.
+func ReadVariableOp(scope *Scope, resource tf.Output, dtype tf.DataType) (value tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	opspec := tf.OpSpec{
+		Type: "ReadVariableOp",
+		Input: []tf.Input{
+			resource,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// MaxPool3DDataFormat sets the optional data_format attribute to value.
+// MaxPool3DGradAttr is an optional argument to MaxPool3DGrad.
+type MaxPool3DGradAttr func(optionalAttr)
+
+// MaxPool3DGradDataFormat sets the optional data_format attribute to value.
 //
 // value: The data format of the input and output data. With the
 // default format "NDHWC", the data is stored in the order of:
@@ -9496,24 +9416,24 @@ type MaxPool3DAttr func(optionalAttr)
 // Alternatively, the format could be "NCDHW", the data storage order is:
 //     [batch, in_channels, in_depth, in_height, in_width].
 // If not specified, defaults to "NDHWC"
-func MaxPool3DDataFormat(value string) MaxPool3DAttr {
+func MaxPool3DGradDataFormat(value string) MaxPool3DGradAttr {
 	return func(m optionalAttr) {
 		m["data_format"] = value
 	}
 }
 
-// Performs 3D max pooling on the input.
+// Computes gradients of max pooling function.
 //
 // Arguments:
-//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
 //	ksize: 1-D tensor of length 5. The size of the window for each dimension of
 // the input tensor. Must have `ksize[0] = ksize[4] = 1`.
 //	strides: 1-D tensor of length 5. The stride of the sliding window for each
 // dimension of `input`. Must have `strides[0] = strides[4] = 1`.
 //	padding: The type of padding algorithm to use.
-//
-// Returns The max pooled output tensor.
-func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DAttr) (output tf.Output) {
+func MaxPool3DGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9522,9 +9442,9 @@ func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, pa
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPool3D",
+		Type: "MaxPool3DGrad",
 		Input: []tf.Input{
-			input,
+			orig_input, orig_output, grad,
 		},
 		Attrs: attrs,
 	}
@@ -9532,28 +9452,54 @@ func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, pa
 	return op.Output(0)
 }
 
-// Computes the gradients of 3-D convolution with respect to the input.
+// SparseReduceSumAttr is an optional argument to SparseReduceSum.
+type SparseReduceSumAttr func(optionalAttr)
+
+// SparseReduceSumKeepDims sets the optional keep_dims attribute to value.
 //
-// DEPRECATED at GraphDef version 10: Use Conv3DBackpropInputV2
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func SparseReduceSumKeepDims(value bool) SparseReduceSumAttr {
+	return func(m optionalAttr) {
+		m["keep_dims"] = value
+	}
+}
+
+// Computes the sum of elements across dimensions of a SparseTensor.
+//
+// This Op takes a SparseTensor and is the sparse counterpart to
+// `tf.reduce_sum()`.  In particular, this Op also returns a dense `Tensor`
+// instead of a sparse one.
+//
+// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
+// with length 1.
+//
+// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
+// with a single element is returned.  Additionally, the axes can be negative,
+// which are interpreted according to the indexing rules in Python.
 //
 // Arguments:
-//	input: Shape `[batch, depth, rows, cols, in_channels]`.
-//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
-// `in_channels` must match between `input` and `filter`.
-//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
-// out_channels]`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string) (output tf.Output) {
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//
+// Returns `R-K`-D.  The reduced Tensor.
+func SparseReduceSum(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceSumAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Conv3DBackpropInput",
+		Type: "SparseReduceSum",
 		Input: []tf.Input{
-			input, filter, out_backprop,
+			input_indices, input_values, input_shape, reduction_axes,
 		},
 		Attrs: attrs,
 	}
@@ -9561,123 +9507,166 @@ func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_ba
 	return op.Output(0)
 }
 
-// Inverse 2D fast Fourier transform.
-//
-// Computes the inverse 2-dimensional discrete Fourier transform over the
-// inner-most 2 dimensions of `input`.
-//
-// Arguments:
-//	input: A complex64 tensor.
+// Returns element-wise remainder of division. This emulates C semantics in that
 //
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 2
-//   dimensions of `input` are replaced with their inverse 2D Fourier transform.
+// the result here is consistent with a truncating divide. E.g. `truncate(x / y) *
+// y + truncate_mod(x, y) = x`.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.ifft2
-// @end_compatibility
-func IFFT2D(scope *Scope, input tf.Output) (output tf.Output) {
+// *NOTE*: `TruncateMod` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func TruncateMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "IFFT2D",
+		Type: "TruncateMod",
 		Input: []tf.Input{
-			input,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a tensor filled with a scalar value.
+// Inverse 2D real-valued fast Fourier transform.
 //
-// This operation creates a tensor of shape `dims` and fills it with `value`.
+// Computes the inverse 2-dimensional discrete Fourier transform of a real-valued
+// signal over the inner-most 2 dimensions of `input`.
 //
-// For example:
+// The inner-most 2 dimensions of `input` are assumed to be the result of `RFFT2D`:
+// The inner-most dimension contains the `fft_length / 2 + 1` unique components of
+// the DFT of a real-valued signal. If `fft_length` is not provided, it is computed
+// from the size of the inner-most 2 dimensions of `input`. If the FFT length used
+// to compute `input` is odd, it should be provided since it cannot be inferred
+// properly.
 //
-// ```
-// # Output tensor has shape [2, 3].
-// fill([2, 3], 9) ==> [[9, 9, 9]
-//                      [9, 9, 9]]
-// ```
+// Along each axis `IRFFT2D` is computed on, if `fft_length` (or
+// `fft_length / 2 + 1` for the inner-most dimension) is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
 //
 // Arguments:
-//	dims: 1-D. Represents the shape of the output tensor.
-//	value: 0-D (scalar). Value to fill the returned tensor.
+//	input: A complex64 tensor.
+//	fft_length: An int32 tensor of shape [2]. The FFT length for each dimension.
+//
+// Returns A float32 tensor of the same rank as `input`. The inner-most 2
+//   dimensions of `input` are replaced with the `fft_length` samples of their
+//   inverse 2D Fourier transform.
 //
 // @compatibility(numpy)
-// Equivalent to np.full
+// Equivalent to np.fft.irfft2
 // @end_compatibility
-func Fill(scope *Scope, dims tf.Output, value tf.Output) (output tf.Output) {
+func IRFFT2D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Fill",
+		Type: "IRFFT2D",
 		Input: []tf.Input{
-			dims, value,
+			input, fft_length,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// 2D fast Fourier transform.
-//
-// Computes the 2-dimensional discrete Fourier transform over the inner-most
-// 2 dimensions of `input`.
+// DecodeJpegAttr is an optional argument to DecodeJpeg.
+type DecodeJpegAttr func(optionalAttr)
+
+// DecodeJpegChannels sets the optional channels attribute to value.
 //
-// Arguments:
-//	input: A complex64 tensor.
+// value: Number of color channels for the decoded image.
+// If not specified, defaults to 0
+func DecodeJpegChannels(value int64) DecodeJpegAttr {
+	return func(m optionalAttr) {
+		m["channels"] = value
+	}
+}
+
+// DecodeJpegRatio sets the optional ratio attribute to value.
 //
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 2
-//   dimensions of `input` are replaced with their 2D Fourier transform.
+// value: Downscaling ratio.
+// If not specified, defaults to 1
+func DecodeJpegRatio(value int64) DecodeJpegAttr {
+	return func(m optionalAttr) {
+		m["ratio"] = value
+	}
+}
+
+// DecodeJpegFancyUpscaling sets the optional fancy_upscaling attribute to value.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.fft2
-// @end_compatibility
-func FFT2D(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If true use a slower but nicer upscaling of the
+// chroma planes (yuv420/422 only).
+// If not specified, defaults to true
+func DecodeJpegFancyUpscaling(value bool) DecodeJpegAttr {
+	return func(m optionalAttr) {
+		m["fancy_upscaling"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "FFT2D",
-		Input: []tf.Input{
-			input,
-		},
+}
+
+// DecodeJpegTryRecoverTruncated sets the optional try_recover_truncated attribute to value.
+//
+// value: If true try to recover an image from truncated input.
+// If not specified, defaults to false
+func DecodeJpegTryRecoverTruncated(value bool) DecodeJpegAttr {
+	return func(m optionalAttr) {
+		m["try_recover_truncated"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// ResourceApplyProximalGradientDescentAttr is an optional argument to ResourceApplyProximalGradientDescent.
-type ResourceApplyProximalGradientDescentAttr func(optionalAttr)
+// DecodeJpegAcceptableFraction sets the optional acceptable_fraction attribute to value.
+//
+// value: The minimum required fraction of lines before a truncated
+// input is accepted.
+// If not specified, defaults to 1
+func DecodeJpegAcceptableFraction(value float32) DecodeJpegAttr {
+	return func(m optionalAttr) {
+		m["acceptable_fraction"] = value
+	}
+}
 
-// ResourceApplyProximalGradientDescentUseLocking sets the optional use_locking attribute to value.
+// DecodeJpegDctMethod sets the optional dct_method attribute to value.
 //
-// value: If True, the subtraction will be protected by a lock;
-// otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyProximalGradientDescentUseLocking(value bool) ResourceApplyProximalGradientDescentAttr {
+// value: string specifying a hint about the algorithm used for
+// decompression.  Defaults to "" which maps to a system-specific
+// default.  Currently valid values are ["INTEGER_FAST",
+// "INTEGER_ACCURATE"].  The hint may be ignored (e.g., the internal
+// jpeg library changes to a version that does not have that specific
+// option.)
+// If not specified, defaults to ""
+func DecodeJpegDctMethod(value string) DecodeJpegAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["dct_method"] = value
 	}
 }
 
-// Update '*var' as FOBOS algorithm with fixed learning rate.
+// Decode a JPEG-encoded image to a uint8 tensor.
 //
-// prox_v = var - alpha * delta
-// var = sign(prox_v)/(1+alpha*l2) * max{|prox_v|-alpha*l1,0}
+// The attr `channels` indicates the desired number of color channels for the
+// decoded image.
+//
+// Accepted values are:
+//
+// *   0: Use the number of channels in the JPEG-encoded image.
+// *   1: output a grayscale image.
+// *   3: output an RGB image.
+//
+// If needed, the JPEG-encoded image is transformed to match the requested number
+// of color channels.
+//
+// The attr `ratio` allows downscaling the image by an integer factor during
+// decoding.  Allowed values are: 1, 2, 4, and 8.  This is much faster than
+// downscaling the image later.
+//
+//
+// This op also supports decoding PNGs and non-animated GIFs since the interface is
+// the same, though it is cleaner to use `tf.image.decode_image`.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	alpha: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	delta: The change.
+//	contents: 0-D.  The JPEG-encoded image.
 //
-// Returns the created operation.
-func ResourceApplyProximalGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, l1 tf.Output, l2 tf.Output, delta tf.Output, optional ...ResourceApplyProximalGradientDescentAttr) (o *tf.Operation) {
+// Returns 3-D with shape `[height, width, channels]`..
+func DecodeJpeg(scope *Scope, contents tf.Output, optional ...DecodeJpegAttr) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9686,69 +9675,130 @@ func ResourceApplyProximalGradientDescent(scope *Scope, var_ tf.Output, alpha tf
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyProximalGradientDescent",
+		Type: "DecodeJpeg",
 		Input: []tf.Input{
-			var_, alpha, l1, l2, delta,
+			contents,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// Computes the gradient for the sqrt of `x` wrt its input.
-//
-// Specifically, `grad = dy * 0.5 / y`, where `y = sqrt(x)`, and `dy`
-// is the corresponding input gradient.
-func SqrtGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SqrtGrad",
-		Input: []tf.Input{
-			y, dy,
-		},
-	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Get the value of the tensor specified by its handle.
+// Transforms a vector of brain.Example protos (as strings) into typed tensors.
 //
 // Arguments:
-//	handle: The handle for a tensor stored in the session state.
-//	dtype: The type of the output value.
-//
-// Returns The tensor for the given handle.
-func GetSessionTensor(scope *Scope, handle tf.Output, dtype tf.DataType) (value tf.Output) {
+//	serialized: A vector containing a batch of binary serialized Example protos.
+//	names: A vector containing the names of the serialized protos.
+// May contain, for example, table key (descriptive) names for the
+// corresponding serialized protos.  These are purely useful for debugging
+// purposes, and the presence of values here has no effect on the output.
+// May also be an empty vector if no names are available.
+// If non-empty, this vector must be the same length as "serialized".
+//	sparse_keys: A list of Nsparse string Tensors (scalars).
+// The keys expected in the Examples' features associated with sparse values.
+//	dense_keys: A list of Ndense string Tensors (scalars).
+// The keys expected in the Examples' features associated with dense values.
+//	dense_defaults: A list of Ndense Tensors (some may be empty).
+// dense_defaults[j] provides default values
+// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
+// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
+// The input type is inferred from dense_defaults[j], even when it's empty.
+// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
+// then the shape of dense_defaults[j] must match that of dense_shapes[j].
+// If dense_shapes[j] has an undefined major dimension (variable strides dense
+// feature), dense_defaults[j] must contain a single element:
+// the padding element.
+//	sparse_types: A list of Nsparse types; the data types of data in each Feature
+// given in sparse_keys.
+// Currently the ParseExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+//	dense_shapes: A list of Ndense shapes; the shapes of data in each Feature
+// given in dense_keys.
+// The number of elements in the Feature corresponding to dense_key[j]
+// must always equal dense_shapes[j].NumEntries().
+// If dense_shapes[j] == (D0, D1, ..., DN) then the shape of output
+// Tensor dense_values[j] will be (|serialized|, D0, D1, ..., DN):
+// The dense outputs are just the inputs row-stacked by batch.
+// This works for dense_shapes[j] = (-1, D1, ..., DN).  In this case
+// the shape of the output Tensor dense_values[j] will be
+// (|serialized|, M, D1, .., DN), where M is the maximum number of blocks
+// of elements of length D1 * .... * DN, across all minibatch entries
+// in the input.  Any minibatch entry with less than M blocks of elements of
+// length D1 * ... * DN will be padded with the corresponding default_value
+// scalar element along the second dimension.
+func ParseExample(scope *Scope, serialized tf.Output, names tf.Output, sparse_keys []tf.Output, dense_keys []tf.Output, dense_defaults []tf.Output, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"sparse_types": sparse_types, "dense_shapes": dense_shapes}
 	opspec := tf.OpSpec{
-		Type: "GetSessionTensor",
+		Type: "ParseExample",
 		Input: []tf.Input{
-			handle,
+			serialized, names, tf.OutputList(sparse_keys), tf.OutputList(dense_keys), tf.OutputList(dense_defaults),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
+		scope.UpdateErr("ParseExample", err)
+		return
+	}
+	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
+		scope.UpdateErr("ParseExample", err)
+		return
+	}
+	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseExample", err)
+		return
+	}
+	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
+		scope.UpdateErr("ParseExample", err)
+		return
+	}
+	return sparse_indices, sparse_values, sparse_shapes, dense_values
 }
 
-// Returns x - y element-wise.
+// VariableShapeAttr is an optional argument to VariableShape.
+type VariableShapeAttr func(optionalAttr)
+
+// VariableShapeOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_INT32
+func VariableShapeOutType(value tf.DataType) VariableShapeAttr {
+	return func(m optionalAttr) {
+		m["out_type"] = value
+	}
+}
+
+// Returns the shape of the variable pointed to by `resource`.
 //
-// *NOTE*: `Subtract` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Sub(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// This operation returns a 1-D integer tensor representing the shape of `input`.
+//
+// For example:
+//
+// ```
+// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
+// shape(t) ==> [2, 2, 3]
+// ```
+func VariableShape(scope *Scope, input tf.Output, optional ...VariableShapeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Sub",
+		Type: "VariableShape",
 		Input: []tf.Input{
-			x, y,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
@@ -9756,21 +9806,25 @@ func Sub(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 
 // Computes softmax cross entropy cost and gradients to backpropagate.
 //
+// Unlike `SoftmaxCrossEntropyWithLogits`, this operation does not accept
+// a matrix of label probabilities, but rather a single label per row
+// of features.  This label is considered to have probability 1.0 for the
+// given row.
+//
 // Inputs are the logits, not probabilities.
 //
 // Arguments:
 //	features: batch_size x num_classes matrix
-//	labels: batch_size x num_classes matrix
-// The caller must ensure that each batch of labels represents a valid
-// probability distribution.
+//	labels: batch_size vector with values in [0, num_classes).
+// This is the label for the given minibatch entry.
 //
 // Returns Per example loss (batch_size vector).backpropagated gradients (batch_size x num_classes matrix).
-func SoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
+func SparseSoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SoftmaxCrossEntropyWithLogits",
+		Type: "SparseSoftmaxCrossEntropyWithLogits",
 		Input: []tf.Input{
 			features, labels,
 		},
@@ -9779,61 +9833,63 @@ func SoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.O
 	return op.Output(0), op.Output(1)
 }
 
-// ReduceJoinAttr is an optional argument to ReduceJoin.
-type ReduceJoinAttr func(optionalAttr)
-
-// ReduceJoinKeepDims sets the optional keep_dims attribute to value.
+// Fast Fourier transform.
 //
-// value: If `True`, retain reduced dimensions with length `1`.
-// If not specified, defaults to false
-func ReduceJoinKeepDims(value bool) ReduceJoinAttr {
-	return func(m optionalAttr) {
-		m["keep_dims"] = value
+// Computes the 1-dimensional discrete Fourier transform over the inner-most
+// dimension of `input`.
+//
+// Arguments:
+//	input: A complex64 tensor.
+//
+// Returns A complex64 tensor of the same shape as `input`. The inner-most
+//   dimension of `input` is replaced with its 1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.fft
+// @end_compatibility
+func FFT(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "FFT",
+		Input: []tf.Input{
+			input,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ReduceJoinSeparator sets the optional separator attribute to value.
+// ResourceSparseApplyAdagradDAAttr is an optional argument to ResourceSparseApplyAdagradDA.
+type ResourceSparseApplyAdagradDAAttr func(optionalAttr)
+
+// ResourceSparseApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
 //
-// value: The separator to use when joining.
-// If not specified, defaults to ""
-func ReduceJoinSeparator(value string) ReduceJoinAttr {
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceSparseApplyAdagradDAUseLocking(value bool) ResourceSparseApplyAdagradDAAttr {
 	return func(m optionalAttr) {
-		m["separator"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Joins a string Tensor across the given dimensions.
-//
-// Computes the string join across dimensions in the given string Tensor of shape
-// `[d_0, d_1, ..., d_n-1]`.  Returns a new Tensor created by joining the input
-// strings with the given separator (default: empty string).  Negative indices are
-// counted backwards from the end, with `-1` being equivalent to `n - 1`.
-//
-// For example:
-//
-// ```python
-// # tensor `a` is [["a", "b"], ["c", "d"]]
-// tf.reduce_join(a, 0) ==> ["ac", "bd"]
-// tf.reduce_join(a, 1) ==> ["ab", "cd"]
-// tf.reduce_join(a, -2) = tf.reduce_join(a, 0) ==> ["ac", "bd"]
-// tf.reduce_join(a, -1) = tf.reduce_join(a, 1) ==> ["ab", "cd"]
-// tf.reduce_join(a, 0, keep_dims=True) ==> [["ac", "bd"]]
-// tf.reduce_join(a, 1, keep_dims=True) ==> [["ab"], ["cd"]]
-// tf.reduce_join(a, 0, separator=".") ==> ["a.c", "b.d"]
-// tf.reduce_join(a, [0, 1]) ==> ["acbd"]
-// tf.reduce_join(a, [1, 0]) ==> ["abcd"]
-// tf.reduce_join(a, []) ==> ["abcd"]
-// ```
+// Update entries in '*var' and '*accum' according to the proximal adagrad scheme.
 //
 // Arguments:
-//	inputs: The input to be joined.  All reduced indices must have non-zero size.
-//	reduction_indices: The dimensions to reduce over.  Dimensions are reduced in the
-// order specified.  Omitting `reduction_indices` is equivalent to passing
-// `[n-1, n-2, ..., 0]`.  Negative indices from `-n` to `-1` are supported.
+//	var_: Should be from a Variable().
+//	gradient_accumulator: Should be from a Variable().
+//	gradient_squared_accumulator: Should be from a Variable().
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//	lr: Learning rate. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	global_step: Training step number. Must be a scalar.
 //
-// Returns Has shape equal to that of the input with reduced dimensions removed or
-// set to `1` depending on `keep_dims`.
-func ReduceJoin(scope *Scope, inputs tf.Output, reduction_indices tf.Output, optional ...ReduceJoinAttr) (output tf.Output) {
+// Returns the created operation.
+func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceSparseApplyAdagradDAAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9842,88 +9898,133 @@ func ReduceJoin(scope *Scope, inputs tf.Output, reduction_indices tf.Output, opt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ReduceJoin",
+		Type: "ResourceSparseApplyAdagradDA",
 		Input: []tf.Input{
-			inputs, reduction_indices,
+			var_, gradient_accumulator, gradient_squared_accumulator, grad, indices, lr, l1, l2, global_step,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Computes cos of x element-wise.
-func Cos(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Cos",
-		Input: []tf.Input{
-			x,
-		},
+// EncodeJpegAttr is an optional argument to EncodeJpeg.
+type EncodeJpegAttr func(optionalAttr)
+
+// EncodeJpegFormat sets the optional format attribute to value.
+//
+// value: Per pixel image format.
+// If not specified, defaults to ""
+func EncodeJpegFormat(value string) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["format"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// FusedBatchNormGradAttr is an optional argument to FusedBatchNormGrad.
-type FusedBatchNormGradAttr func(optionalAttr)
+// EncodeJpegQuality sets the optional quality attribute to value.
+//
+// value: Quality of the compression from 0 to 100 (higher is better and slower).
+// If not specified, defaults to 95
+func EncodeJpegQuality(value int64) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["quality"] = value
+	}
+}
 
-// FusedBatchNormGradEpsilon sets the optional epsilon attribute to value.
+// EncodeJpegProgressive sets the optional progressive attribute to value.
 //
-// value: A small float number added to the variance of x.
-// If not specified, defaults to 0.0001
-func FusedBatchNormGradEpsilon(value float32) FusedBatchNormGradAttr {
+// value: If True, create a JPEG that loads progressively (coarse to fine).
+// If not specified, defaults to false
+func EncodeJpegProgressive(value bool) EncodeJpegAttr {
 	return func(m optionalAttr) {
-		m["epsilon"] = value
+		m["progressive"] = value
 	}
 }
 
-// FusedBatchNormGradDataFormat sets the optional data_format attribute to value.
+// EncodeJpegOptimizeSize sets the optional optimize_size attribute to value.
 //
-// value: The data format for y_backprop, x, x_backprop.
-// Either "NHWC" (default) or "NCHW".
-// If not specified, defaults to "NHWC"
-func FusedBatchNormGradDataFormat(value string) FusedBatchNormGradAttr {
+// value: If True, spend CPU/RAM to reduce size with no quality change.
+// If not specified, defaults to false
+func EncodeJpegOptimizeSize(value bool) EncodeJpegAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["optimize_size"] = value
 	}
 }
 
-// FusedBatchNormGradIsTraining sets the optional is_training attribute to value.
+// EncodeJpegChromaDownsampling sets the optional chroma_downsampling attribute to value.
 //
-// value: A bool value to indicate the operation is for training (default)
-// or inference.
+// value: See http://en.wikipedia.org/wiki/Chroma_subsampling.
 // If not specified, defaults to true
-func FusedBatchNormGradIsTraining(value bool) FusedBatchNormGradAttr {
+func EncodeJpegChromaDownsampling(value bool) EncodeJpegAttr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["chroma_downsampling"] = value
 	}
 }
 
-// Gradient for batch normalization.
+// EncodeJpegDensityUnit sets the optional density_unit attribute to value.
 //
-// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
-// The size of 1D Tensors matches the dimension C of the 4D Tensors.
+// value: Unit used to specify `x_density` and `y_density`:
+// pixels per inch (`'in'`) or centimeter (`'cm'`).
+// If not specified, defaults to "in"
+func EncodeJpegDensityUnit(value string) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["density_unit"] = value
+	}
+}
+
+// EncodeJpegXDensity sets the optional x_density attribute to value.
+//
+// value: Horizontal pixels per density unit.
+// If not specified, defaults to 300
+func EncodeJpegXDensity(value int64) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["x_density"] = value
+	}
+}
+
+// EncodeJpegYDensity sets the optional y_density attribute to value.
+//
+// value: Vertical pixels per density unit.
+// If not specified, defaults to 300
+func EncodeJpegYDensity(value int64) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["y_density"] = value
+	}
+}
+
+// EncodeJpegXmpMetadata sets the optional xmp_metadata attribute to value.
+//
+// value: If not empty, embed this XMP metadata in the image header.
+// If not specified, defaults to ""
+func EncodeJpegXmpMetadata(value string) EncodeJpegAttr {
+	return func(m optionalAttr) {
+		m["xmp_metadata"] = value
+	}
+}
+
+// JPEG-encode an image.
+//
+// `image` is a 3-D uint8 Tensor of shape `[height, width, channels]`.
+//
+// The attr `format` can be used to override the color format of the encoded
+// output.  Values can be:
+//
+// *   `''`: Use a default format based on the number of channels in the image.
+// *   `grayscale`: Output a grayscale JPEG image.  The `channels` dimension
+//     of `image` must be 1.
+// *   `rgb`: Output an RGB JPEG image. The `channels` dimension
+//     of `image` must be 3.
+//
+// If `format` is not specified or is the empty string, a default format is picked
+// in function of the number of channels in `image`:
+//
+// *   1: Output a grayscale image.
+// *   3: Output an RGB image.
 //
 // Arguments:
-//	y_backprop: A 4D Tensor for the gradient with respect to y.
-//	x: A 4D Tensor for input data.
-//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
-//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
-// mean to be reused in gradient computation. When is_training is
-// False, a 1D Tensor for the population mean to be reused in both
-// 1st and 2nd order gradient computation.
-//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
-// variance (inverted variance in the cuDNN case) to be reused in
-// gradient computation. When is_training is False, a 1D Tensor
-// for the population variance to be reused in both 1st and 2nd
-// order gradient computation.
+//	image: 3-D with shape `[height, width, channels]`.
 //
-// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
-// in FusedBatchNorm.
-func FusedBatchNormGrad(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradAttr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
+// Returns 0-D. JPEG-encoded image.
+func EncodeJpeg(scope *Scope, image tf.Output, optional ...EncodeJpegAttr) (contents tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9932,479 +10033,522 @@ func FusedBatchNormGrad(scope *Scope, y_backprop tf.Output, x tf.Output, scale t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FusedBatchNormGrad",
+		Type: "EncodeJpeg",
 		Input: []tf.Input{
-			y_backprop, x, scale, reserve_space_1, reserve_space_2,
+			image,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return op.Output(0)
 }
 
-// TopKAttr is an optional argument to TopK.
-type TopKAttr func(optionalAttr)
+// MultinomialAttr is an optional argument to Multinomial.
+type MultinomialAttr func(optionalAttr)
 
-// TopKSorted sets the optional sorted attribute to value.
+// MultinomialSeed sets the optional seed attribute to value.
 //
-// value: If true the resulting `k` elements will be sorted by the values in
-// descending order.
-// If not specified, defaults to true
-func TopKSorted(value bool) TopKAttr {
+// value: If either seed or seed2 is set to be non-zero, the internal random number
+// generator is seeded by the given seed.  Otherwise, a random seed is used.
+// If not specified, defaults to 0
+func MultinomialSeed(value int64) MultinomialAttr {
 	return func(m optionalAttr) {
-		m["sorted"] = value
+		m["seed"] = value
 	}
 }
 
-// Finds values and indices of the `k` largest elements for the last dimension.
-//
-// DEPRECATED at GraphDef version 7: Use TopKV2 instead
-//
-// If the input is a vector (rank-1), finds the `k` largest entries in the vector
-// and outputs their values and indices as vectors.  Thus `values[j]` is the
-// `j`-th largest entry in `input`, and its index is `indices[j]`.
-//
-// For matrices (resp. higher rank input), computes the top `k` entries in each
-// row (resp. vector along the last dimension).  Thus,
-//
-//     values.shape = indices.shape = input.shape[:-1] + [k]
-//
-// If two elements are equal, the lower-index element appears first.
+// MultinomialSeed2 sets the optional seed2 attribute to value.
 //
-// If `k` varies dynamically, use `TopKV2` below.
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func MultinomialSeed2(value int64) MultinomialAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// MultinomialOutputDtype sets the optional output_dtype attribute to value.
+// If not specified, defaults to DT_INT64
+func MultinomialOutputDtype(value tf.DataType) MultinomialAttr {
+	return func(m optionalAttr) {
+		m["output_dtype"] = value
+	}
+}
+
+// Draws samples from a multinomial distribution.
 //
 // Arguments:
-//	input: 1-D or higher with last dimension at least `k`.
-//	k: Number of top elements to look for along the last dimension (along each
-// row for matrices).
+//	logits: 2-D Tensor with shape `[batch_size, num_classes]`.  Each slice `[i, :]`
+// represents the unnormalized log probabilities for all classes.
+//	num_samples: 0-D.  Number of independent samples to draw for each row slice.
 //
-// Returns The `k` largest elements along each last dimensional slice.The indices of `values` within the last dimension of `input`.
-func TopK(scope *Scope, input tf.Output, k int64, optional ...TopKAttr) (values tf.Output, indices tf.Output) {
+// Returns 2-D Tensor with shape `[batch_size, num_samples]`.  Each slice `[i, :]`
+// contains the drawn class labels with range `[0, num_classes)`.
+func Multinomial(scope *Scope, logits tf.Output, num_samples tf.Output, optional ...MultinomialAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"k": k}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TopK",
+		Type: "Multinomial",
 		Input: []tf.Input{
-			input,
+			logits, num_samples,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Transforms a Tensor into a serialized TensorProto proto.
+// Returns the truth value of NOT x element-wise.
+func LogicalNot(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "LogicalNot",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// 3D real-valued fast Fourier transform.
+//
+// Computes the 3-dimensional discrete Fourier transform of a real-valued signal
+// over the inner-most 3 dimensions of `input`.
+//
+// Since the DFT of a real signal is Hermitian-symmetric, `RFFT3D` only returns the
+// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
+// of `output`: the zero-frequency term, followed by the `fft_length / 2`
+// positive-frequency terms.
+//
+// Along each axis `RFFT3D` is computed on, if `fft_length` is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
 //
 // Arguments:
-//	tensor: A Tensor of type `T`.
+//	input: A float32 tensor.
+//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
 //
-// Returns A serialized TensorProto proto of the input tensor.
-func SerializeTensor(scope *Scope, tensor tf.Output) (serialized tf.Output) {
+// Returns A complex64 tensor of the same rank as `input`. The inner-most 3
+//   dimensions of `input` are replaced with the their 3D Fourier transform. The
+//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
+//   components.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.rfftn with 3 dimensions.
+// @end_compatibility
+func RFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SerializeTensor",
+		Type: "RFFT3D",
 		Input: []tf.Input{
-			tensor,
+			input, fft_length,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MatrixSolveAttr is an optional argument to MatrixSolve.
-type MatrixSolveAttr func(optionalAttr)
+// TensorArrayV3Attr is an optional argument to TensorArrayV3.
+type TensorArrayV3Attr func(optionalAttr)
 
-// MatrixSolveAdjoint sets the optional adjoint attribute to value.
+// TensorArrayV3ElementShape sets the optional element_shape attribute to value.
 //
-// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
-// adjoint.
+// value: The expected shape of an element, if known. Used to
+// validate the shapes of TensorArray elements. If this shape is not
+// fully specified, gathering zero-size TensorArrays is an error.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayV3ElementShape(value tf.Shape) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["element_shape"] = value
+	}
+}
+
+// TensorArrayV3DynamicSize sets the optional dynamic_size attribute to value.
+//
+// value: A boolean that determines whether writes to the TensorArray
+// are allowed to grow the size.  By default, this is not allowed.
 // If not specified, defaults to false
-func MatrixSolveAdjoint(value bool) MatrixSolveAttr {
+func TensorArrayV3DynamicSize(value bool) TensorArrayV3Attr {
 	return func(m optionalAttr) {
-		m["adjoint"] = value
+		m["dynamic_size"] = value
 	}
 }
 
-// Solves systems of linear equations.
+// TensorArrayV3ClearAfterRead sets the optional clear_after_read attribute to value.
 //
-// `Matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. `Rhs` is a tensor of shape `[..., M, K]`. The `output` is
-// a tensor shape `[..., M, K]`.  If `adjoint` is `False` then each output matrix
-// satisfies `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
-// If `adjoint` is `True` then each output matrix satisfies
-// `adjoint(matrix[..., :, :]) * output[..., :, :] = rhs[..., :, :]`.
+// value: If true (default), Tensors in the TensorArray are cleared
+// after being read.  This disables multiple read semantics but allows early
+// release of memory.
+// If not specified, defaults to true
+func TensorArrayV3ClearAfterRead(value bool) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["clear_after_read"] = value
+	}
+}
+
+// TensorArrayV3IdenticalElementShapes sets the optional identical_element_shapes attribute to value.
+//
+// value: If true (default is false), then all
+// elements in the TensorArray will be expected to have have identical shapes.
+// This allows certain behaviors, like dynamically checking for
+// consistent shapes on write, and being able to fill in properly
+// shaped zero tensors on stack -- even if the element_shape attribute
+// is not fully defined.
+// If not specified, defaults to false
+func TensorArrayV3IdenticalElementShapes(value bool) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["identical_element_shapes"] = value
+	}
+}
+
+// TensorArrayV3TensorArrayName sets the optional tensor_array_name attribute to value.
+//
+// value: Overrides the name used for the temporary tensor_array
+// resource. Default value is the name of the 'TensorArray' op (which
+// is guaranteed unique).
+// If not specified, defaults to ""
+func TensorArrayV3TensorArrayName(value string) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["tensor_array_name"] = value
+	}
+}
+
+// An array of Tensors of given size.
+//
+// Write data via Write and read via Read or Pack.
 //
 // Arguments:
-//	matrix: Shape is `[..., M, M]`.
-//	rhs: Shape is `[..., M, K]`.
+//	size: The size of the array.
+//	dtype: The type of the elements on the tensor_array.
 //
-// Returns Shape is `[..., M, K]`.
-func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixSolveAttr) (output tf.Output) {
+// Returns The handle to the TensorArray.A scalar used to control gradient flow.
+func TensorArrayV3(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV3Attr) (handle tf.Output, flow tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixSolve",
+		Type: "TensorArrayV3",
 		Input: []tf.Input{
-			matrix, rhs,
+			size,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Looks up keys in a table, outputs the corresponding values.
-//
-// The tensor `keys` must of the same type as the keys of the table.
-// The output `values` is of the type of the table values.
+// MaxPool3DAttr is an optional argument to MaxPool3D.
+type MaxPool3DAttr func(optionalAttr)
+
+// MaxPool3DDataFormat sets the optional data_format attribute to value.
 //
-// The scalar `default_value` is the value output for keys not present in the
-// table. It must also be of the same type as the table values.
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func MaxPool3DDataFormat(value string) MaxPool3DAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Performs 3D max pooling on the input.
 //
 // Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//
+//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Same shape as `keys`.  Values found in the table, or `default_values`
-// for missing keys.
-func LookupTableFindV2(scope *Scope, table_handle tf.Output, keys tf.Output, default_value tf.Output) (values tf.Output) {
+// Returns The max pooled output tensor.
+func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LookupTableFindV2",
+		Type: "MaxPool3D",
 		Input: []tf.Input{
-			table_handle, keys, default_value,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Inverse 3D fast Fourier transform.
+// Computes the gradients of 3-D convolution with respect to the input.
 //
-// Computes the inverse 3-dimensional discrete Fourier transform over the
-// inner-most 3 dimensions of `input`.
+// DEPRECATED at GraphDef version 10: Use Conv3DBackpropInputV2
 //
 // Arguments:
-//	input: A complex64 tensor.
-//
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
-//   dimensions of `input` are replaced with their inverse 3D Fourier transform.
-//
-// @compatibility(numpy)
-// Equivalent to np.fft.ifftn with 3 dimensions.
-// @end_compatibility
-func IFFT3D(scope *Scope, input tf.Output) (output tf.Output) {
+//	input: Shape `[batch, depth, rows, cols, in_channels]`.
+//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
+// `in_channels` must match between `input` and `filter`.
+//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
+// out_channels]`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "IFFT3D",
+		Type: "Conv3DBackpropInput",
 		Input: []tf.Input{
-			input,
+			input, filter, out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Adds `bias` to `value`.
+// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
+type ResourceApplyProximalAdagradAttr func(optionalAttr)
+
+// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
 //
-// This is a deprecated version of BiasAdd and will be soon removed.
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
 //
-// This is a special case of `tf.add` where `bias` is restricted to be 1-D.
-// Broadcasting is supported, so `value` may have any number of dimensions.
+// accum += grad * grad
+// prox_v = var - lr * grad * (1 / sqrt(accum))
+// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
 //
 // Arguments:
-//	value: Any number of dimensions.
-//	bias: 1-D with size the last dimension of `value`.
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	grad: The gradient.
 //
-// Returns Broadcasted sum of `value` and `bias`.
-func BiasAddV1(scope *Scope, value tf.Output, bias tf.Output) (output tf.Output) {
+// Returns the created operation.
+func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BiasAddV1",
+		Type: "ResourceApplyProximalAdagrad",
 		Input: []tf.Input{
-			value, bias,
+			var_, accum, lr, l1, l2, grad,
 		},
+		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Reverses specific dimensions of a tensor.
-//
-// NOTE `tf.reverse` has now changed behavior in preparation for 1.0.
-// `tf.reverse_v2` is currently an alias that will be deprecated before TF 1.0.
-//
-// Given a `tensor`, and a `int32` tensor `axis` representing the set of
-// dimensions of `tensor` to reverse. This operation reverses each dimension
-// `i` for which there exists `j` s.t. `axis[j] == i`.
-//
-// `tensor` can have up to 8 dimensions. The number of dimensions specified
-// in `axis` may be 0 or more entries. If an index is specified more than
-// once, a InvalidArgument error is raised.
-//
-// For example:
-//
-// ```
-// # tensor 't' is [[[[ 0,  1,  2,  3],
-// #                  [ 4,  5,  6,  7],
-// #                  [ 8,  9, 10, 11]],
-// #                 [[12, 13, 14, 15],
-// #                  [16, 17, 18, 19],
-// #                  [20, 21, 22, 23]]]]
-// # tensor 't' shape is [1, 2, 3, 4]
-//
-// # 'dims' is [3] or 'dims' is [-1]
-// reverse(t, dims) ==> [[[[ 3,  2,  1,  0],
-//                         [ 7,  6,  5,  4],
-//                         [ 11, 10, 9, 8]],
-//                        [[15, 14, 13, 12],
-//                         [19, 18, 17, 16],
-//                         [23, 22, 21, 20]]]]
-//
-// # 'dims' is '[1]' (or 'dims' is '[-3]')
-// reverse(t, dims) ==> [[[[12, 13, 14, 15],
-//                         [16, 17, 18, 19],
-//                         [20, 21, 22, 23]
-//                        [[ 0,  1,  2,  3],
-//                         [ 4,  5,  6,  7],
-//                         [ 8,  9, 10, 11]]]]
-//
-// # 'dims' is '[2]' (or 'dims' is '[-2]')
-// reverse(t, dims) ==> [[[[8, 9, 10, 11],
-//                         [4, 5, 6, 7],
-//                         [0, 1, 2, 3]]
-//                        [[20, 21, 22, 23],
-//                         [16, 17, 18, 19],
-//                         [12, 13, 14, 15]]]]
-// ```
-//
-// Arguments:
-//	tensor: Up to 8-D.
-//	axis: 1-D. The indices of the dimensions to reverse. Must be in the range
-// `[-rank(tensor), rank(tensor))`.
+// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
+type MutableHashTableOfTensorsV2Attr func(optionalAttr)
+
+// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
 //
-// Returns The same shape as `tensor`.
-func ReverseV2(scope *Scope, tensor tf.Output, axis tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "ReverseV2",
-		Input: []tf.Input{
-			tensor, axis,
-		},
+}
+
+// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// RealAttr is an optional argument to Real.
-type RealAttr func(optionalAttr)
+// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// If not specified, defaults to false
+func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
 
-// RealTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func RealTout(value tf.DataType) RealAttr {
+// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
+// If not specified, defaults to <>
+func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
 	return func(m optionalAttr) {
-		m["Tout"] = value
+		m["value_shape"] = value
 	}
 }
 
-// Returns the real part of a complex number.
+// Creates an empty hash table.
 //
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the real part of each element in `input`. All elements in
-// `input` must be complex numbers of the form \\(a + bj\\), where *a* is the real
-//  part returned by this operation and *b* is the imaginary part.
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a vector. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
 //
-// For example:
+// Arguments:
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.real(input) ==> [-2.25, 3.25]
-// ```
-func Real(scope *Scope, input tf.Output, optional ...RealAttr) (output tf.Output) {
+// Returns Handle to a table.
+func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Real",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "MutableHashTableOfTensorsV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AudioSummaryAttr is an optional argument to AudioSummary.
-type AudioSummaryAttr func(optionalAttr)
-
-// AudioSummaryMaxOutputs sets the optional max_outputs attribute to value.
-//
-// value: Max number of batch elements to generate audio for.
-// If not specified, defaults to 3
-//
-// REQUIRES: value >= 1
-func AudioSummaryMaxOutputs(value int64) AudioSummaryAttr {
-	return func(m optionalAttr) {
-		m["max_outputs"] = value
-	}
-}
-
-// Outputs a `Summary` protocol buffer with audio.
-//
-// DEPRECATED at GraphDef version 15: Use AudioSummaryV2.
-//
-// The summary has up to `max_outputs` summary values containing audio. The
-// audio is built from `tensor` which must be 3-D with shape `[batch_size,
-// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
-// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
-//
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
+// Inverse 2D fast Fourier transform.
 //
-// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
-// *  If `max_outputs` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
+// Computes the inverse 2-dimensional discrete Fourier transform over the
+// inner-most 2 dimensions of `input`.
 //
 // Arguments:
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 2-D of shape `[batch_size, frames]`.
-//	sample_rate: The sample rate of the signal in hertz.
+//	input: A complex64 tensor.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func AudioSummary(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate float32, optional ...AudioSummaryAttr) (summary tf.Output) {
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 2
+//   dimensions of `input` are replaced with their inverse 2D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.ifft2
+// @end_compatibility
+func IFFT2D(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"sample_rate": sample_rate}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "AudioSummary",
+		Type: "IFFT2D",
 		Input: []tf.Input{
-			tag, tensor,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QrAttr is an optional argument to Qr.
-type QrAttr func(optionalAttr)
-
-// QrFullMatrices sets the optional full_matrices attribute to value.
+// Creates a tensor filled with a scalar value.
 //
-// value: If true, compute full-sized `q` and `r`. If false
-// (the default), compute only the leading `P` columns of `q`.
-// If not specified, defaults to false
-func QrFullMatrices(value bool) QrAttr {
-	return func(m optionalAttr) {
-		m["full_matrices"] = value
-	}
-}
-
-// Computes the QR decompositions of one or more matrices.
+// This operation creates a tensor of shape `dims` and fills it with `value`.
 //
-// Computes the QR decomposition of each inner matrix in `tensor` such that
-// `tensor[..., :, :] = q[..., :, :] * r[..., :,:])`
+// For example:
 //
-// ```python
-// # a is a tensor.
-// # q is a tensor of orthonormal matrices.
-// # r is a tensor of upper triangular matrices.
-// q, r = qr(a)
-// q_full, r_full = qr(a, full_matrices=True)
+// ```
+// # Output tensor has shape [2, 3].
+// fill([2, 3], 9) ==> [[9, 9, 9]
+//                      [9, 9, 9]]
 // ```
 //
 // Arguments:
-//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
-// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
+//	dims: 1-D. Represents the shape of the output tensor.
+//	value: 0-D (scalar). Value to fill the returned tensor.
 //
-// Returns Orthonormal basis for range of `a`. If `full_matrices` is `False` then
-// shape is `[..., M, P]`; if `full_matrices` is `True` then shape is
-// `[..., M, M]`.Triangular factor. If `full_matrices` is `False` then shape is
-// `[..., P, N]`. If `full_matrices` is `True` then shape is `[..., M, N]`.
-func Qr(scope *Scope, input tf.Output, optional ...QrAttr) (q tf.Output, r tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.full
+// @end_compatibility
+func Fill(scope *Scope, dims tf.Output, value tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Qr",
+		Type: "Fill",
 		Input: []tf.Input{
-			input,
+			dims, value,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Records the bytes size of each element of `input_dataset` in a StatsAggregator.
-func BytesProducedStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// 2D fast Fourier transform.
+//
+// Computes the 2-dimensional discrete Fourier transform over the inner-most
+// 2 dimensions of `input`.
+//
+// Arguments:
+//	input: A complex64 tensor.
+//
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 2
+//   dimensions of `input` are replaced with their 2D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.fft2
+// @end_compatibility
+func FFT2D(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "BytesProducedStatsDataset",
+		Type: "FFT2D",
 		Input: []tf.Input{
-			input_dataset, tag,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyProximalGradientDescentAttr is an optional argument to ResourceSparseApplyProximalGradientDescent.
-type ResourceSparseApplyProximalGradientDescentAttr func(optionalAttr)
+// ResourceApplyProximalGradientDescentAttr is an optional argument to ResourceApplyProximalGradientDescent.
+type ResourceApplyProximalGradientDescentAttr func(optionalAttr)
 
-// ResourceSparseApplyProximalGradientDescentUseLocking sets the optional use_locking attribute to value.
+// ResourceApplyProximalGradientDescentUseLocking sets the optional use_locking attribute to value.
 //
 // value: If True, the subtraction will be protected by a lock;
 // otherwise the behavior is undefined, but may exhibit less contention.
 // If not specified, defaults to false
-func ResourceSparseApplyProximalGradientDescentUseLocking(value bool) ResourceSparseApplyProximalGradientDescentAttr {
+func ResourceApplyProximalGradientDescentUseLocking(value bool) ResourceApplyProximalGradientDescentAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Sparse update '*var' as FOBOS algorithm with fixed learning rate.
+// Update '*var' as FOBOS algorithm with fixed learning rate.
 //
-// That is for rows we have grad for, we update var as follows:
-// prox_v = var - alpha * grad
+// prox_v = var - alpha * delta
 // var = sign(prox_v)/(1+alpha*l2) * max{|prox_v|-alpha*l1,0}
 //
 // Arguments:
@@ -10412,11 +10556,10 @@ func ResourceSparseApplyProximalGradientDescentUseLocking(value bool) ResourceSp
 //	alpha: Scaling factor. Must be a scalar.
 //	l1: L1 regularization. Must be a scalar.
 //	l2: L2 regularization. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
+//	delta: The change.
 //
 // Returns the created operation.
-func ResourceSparseApplyProximalGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyProximalGradientDescentAttr) (o *tf.Operation) {
+func ResourceApplyProximalGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, l1 tf.Output, l2 tf.Output, delta tf.Output, optional ...ResourceApplyProximalGradientDescentAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -10425,53 +10568,49 @@ func ResourceSparseApplyProximalGradientDescent(scope *Scope, var_ tf.Output, al
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyProximalGradientDescent",
+		Type: "ResourceApplyProximalGradientDescent",
 		Input: []tf.Input{
-			var_, alpha, l1, l2, grad, indices,
+			var_, alpha, l1, l2, delta,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// MeanAttr is an optional argument to Mean.
-type MeanAttr func(optionalAttr)
-
-// MeanKeepDims sets the optional keep_dims attribute to value.
+// Computes the gradient for the sqrt of `x` wrt its input.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func MeanKeepDims(value bool) MeanAttr {
-	return func(m optionalAttr) {
-		m["keep_dims"] = value
+// Specifically, `grad = dy * 0.5 / y`, where `y = sqrt(x)`, and `dy`
+// is the corresponding input gradient.
+func SqrtGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SqrtGrad",
+		Input: []tf.Input{
+			y, dy,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes the mean of elements across dimensions of a tensor.
-//
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// Get the value of the tensor specified by its handle.
 //
 // Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
+//	handle: The handle for a tensor stored in the session state.
+//	dtype: The type of the output value.
 //
-// Returns The reduced tensor.
-func Mean(scope *Scope, input tf.Output, axis tf.Output, optional ...MeanAttr) (output tf.Output) {
+// Returns The tensor for the given handle.
+func GetSessionTensor(scope *Scope, handle tf.Output, dtype tf.DataType) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "Mean",
+		Type: "GetSessionTensor",
 		Input: []tf.Input{
-			input, axis,
+			handle,
 		},
 		Attrs: attrs,
 	}
@@ -10479,90 +10618,194 @@ func Mean(scope *Scope, input tf.Output, axis tf.Output, optional ...MeanAttr) (
 	return op.Output(0)
 }
 
-// InitializeTableFromTextFileV2Attr is an optional argument to InitializeTableFromTextFileV2.
-type InitializeTableFromTextFileV2Attr func(optionalAttr)
+// Returns x - y element-wise.
+//
+// *NOTE*: `Subtract` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Sub(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Sub",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// InitializeTableFromTextFileV2VocabSize sets the optional vocab_size attribute to value.
+// Computes softmax cross entropy cost and gradients to backpropagate.
 //
-// value: Number of elements of the file, use -1 if unknown.
-// If not specified, defaults to -1
+// Inputs are the logits, not probabilities.
 //
-// REQUIRES: value >= -1
-func InitializeTableFromTextFileV2VocabSize(value int64) InitializeTableFromTextFileV2Attr {
+// Arguments:
+//	features: batch_size x num_classes matrix
+//	labels: batch_size x num_classes matrix
+// The caller must ensure that each batch of labels represents a valid
+// probability distribution.
+//
+// Returns Per example loss (batch_size vector).backpropagated gradients (batch_size x num_classes matrix).
+func SoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SoftmaxCrossEntropyWithLogits",
+		Input: []tf.Input{
+			features, labels,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// ReduceJoinAttr is an optional argument to ReduceJoin.
+type ReduceJoinAttr func(optionalAttr)
+
+// ReduceJoinKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If `True`, retain reduced dimensions with length `1`.
+// If not specified, defaults to false
+func ReduceJoinKeepDims(value bool) ReduceJoinAttr {
 	return func(m optionalAttr) {
-		m["vocab_size"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// InitializeTableFromTextFileV2Delimiter sets the optional delimiter attribute to value.
+// ReduceJoinSeparator sets the optional separator attribute to value.
 //
-// value: Delimiter to separate fields in a line.
-// If not specified, defaults to "\t"
-func InitializeTableFromTextFileV2Delimiter(value string) InitializeTableFromTextFileV2Attr {
+// value: The separator to use when joining.
+// If not specified, defaults to ""
+func ReduceJoinSeparator(value string) ReduceJoinAttr {
 	return func(m optionalAttr) {
-		m["delimiter"] = value
+		m["separator"] = value
 	}
 }
 
-// Initializes a table from a text file.
+// Joins a string Tensor across the given dimensions.
 //
-// It inserts one key-value pair into the table for each line of the file.
-// The key and value is extracted from the whole line content, elements from the
-// split line based on `delimiter` or the line number (starting from zero).
-// Where to extract the key and value from a line is specified by `key_index` and
-// `value_index`.
+// Computes the string join across dimensions in the given string Tensor of shape
+// `[d_0, d_1, ..., d_n-1]`.  Returns a new Tensor created by joining the input
+// strings with the given separator (default: empty string).  Negative indices are
+// counted backwards from the end, with `-1` being equivalent to `n - 1`.
 //
-// - A value of -1 means use the line number(starting from zero), expects `int64`.
-// - A value of -2 means use the whole line content, expects `string`.
-// - A value >= 0 means use the index (starting at zero) of the split line based
-//   on `delimiter`.
+// For example:
+//
+// ```python
+// # tensor `a` is [["a", "b"], ["c", "d"]]
+// tf.reduce_join(a, 0) ==> ["ac", "bd"]
+// tf.reduce_join(a, 1) ==> ["ab", "cd"]
+// tf.reduce_join(a, -2) = tf.reduce_join(a, 0) ==> ["ac", "bd"]
+// tf.reduce_join(a, -1) = tf.reduce_join(a, 1) ==> ["ab", "cd"]
+// tf.reduce_join(a, 0, keep_dims=True) ==> [["ac", "bd"]]
+// tf.reduce_join(a, 1, keep_dims=True) ==> [["ab"], ["cd"]]
+// tf.reduce_join(a, 0, separator=".") ==> ["a.c", "b.d"]
+// tf.reduce_join(a, [0, 1]) ==> ["acbd"]
+// tf.reduce_join(a, [1, 0]) ==> ["abcd"]
+// tf.reduce_join(a, []) ==> ["abcd"]
+// ```
 //
 // Arguments:
-//	table_handle: Handle to a table which will be initialized.
-//	filename: Filename of a vocabulary text file.
-//	key_index: Column index in a line to get the table `key` values from.
-//	value_index: Column index that represents information of a line to get the table
-// `value` values from.
+//	inputs: The input to be joined.  All reduced indices must have non-zero size.
+//	reduction_indices: The dimensions to reduce over.  Dimensions are reduced in the
+// order specified.  Omitting `reduction_indices` is equivalent to passing
+// `[n-1, n-2, ..., 0]`.  Negative indices from `-n` to `-1` are supported.
 //
-// Returns the created operation.
-func InitializeTableFromTextFileV2(scope *Scope, table_handle tf.Output, filename tf.Output, key_index int64, value_index int64, optional ...InitializeTableFromTextFileV2Attr) (o *tf.Operation) {
+// Returns Has shape equal to that of the input with reduced dimensions removed or
+// set to `1` depending on `keep_dims`.
+func ReduceJoin(scope *Scope, inputs tf.Output, reduction_indices tf.Output, optional ...ReduceJoinAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_index": key_index, "value_index": value_index}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "InitializeTableFromTextFileV2",
+		Type: "ReduceJoin",
 		Input: []tf.Input{
-			table_handle, filename,
+			inputs, reduction_indices,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// QuantizedReluAttr is an optional argument to QuantizedRelu.
-type QuantizedReluAttr func(optionalAttr)
+// Computes cos of x element-wise.
+func Cos(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Cos",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// QuantizedReluOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_QUINT8
-func QuantizedReluOutType(value tf.DataType) QuantizedReluAttr {
+// FusedBatchNormGradAttr is an optional argument to FusedBatchNormGrad.
+type FusedBatchNormGradAttr func(optionalAttr)
+
+// FusedBatchNormGradEpsilon sets the optional epsilon attribute to value.
+//
+// value: A small float number added to the variance of x.
+// If not specified, defaults to 0.0001
+func FusedBatchNormGradEpsilon(value float32) FusedBatchNormGradAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["epsilon"] = value
 	}
 }
 
-// Computes Quantized Rectified Linear: `max(features, 0)`
+// FusedBatchNormGradDataFormat sets the optional data_format attribute to value.
 //
-// Arguments:
+// value: The data format for y_backprop, x, x_backprop.
+// Either "NHWC" (default) or "NCHW".
+// If not specified, defaults to "NHWC"
+func FusedBatchNormGradDataFormat(value string) FusedBatchNormGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// FusedBatchNormGradIsTraining sets the optional is_training attribute to value.
 //
-//	min_features: The float value that the lowest quantized value represents.
-//	max_features: The float value that the highest quantized value represents.
+// value: A bool value to indicate the operation is for training (default)
+// or inference.
+// If not specified, defaults to true
+func FusedBatchNormGradIsTraining(value bool) FusedBatchNormGradAttr {
+	return func(m optionalAttr) {
+		m["is_training"] = value
+	}
+}
+
+// Gradient for batch normalization.
 //
-// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
-func QuantizedRelu(scope *Scope, features tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedReluAttr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
+// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
+// The size of 1D Tensors matches the dimension C of the 4D Tensors.
+//
+// Arguments:
+//	y_backprop: A 4D Tensor for the gradient with respect to y.
+//	x: A 4D Tensor for input data.
+//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
+//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
+// mean to be reused in gradient computation. When is_training is
+// False, a 1D Tensor for the population mean to be reused in both
+// 1st and 2nd order gradient computation.
+//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
+// variance (inverted variance in the cuDNN case) to be reused in
+// gradient computation. When is_training is False, a 1D Tensor
+// for the population variance to be reused in both 1st and 2nd
+// order gradient computation.
+//
+// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
+// in FusedBatchNorm.
+func FusedBatchNormGrad(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradAttr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -10571,116 +10814,119 @@ func QuantizedRelu(scope *Scope, features tf.Output, min_features tf.Output, max
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedRelu",
+		Type: "FusedBatchNormGrad",
 		Input: []tf.Input{
-			features, min_features, max_features,
+			y_backprop, x, scale, reserve_space_1, reserve_space_2,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Reshapes a SparseTensor to represent values in a new dense shape.
+// TopKAttr is an optional argument to TopK.
+type TopKAttr func(optionalAttr)
+
+// TopKSorted sets the optional sorted attribute to value.
 //
-// This operation has the same semantics as reshape on the represented dense
-// tensor.  The `input_indices` are recomputed based on the requested `new_shape`.
+// value: If true the resulting `k` elements will be sorted by the values in
+// descending order.
+// If not specified, defaults to true
+func TopKSorted(value bool) TopKAttr {
+	return func(m optionalAttr) {
+		m["sorted"] = value
+	}
+}
+
+// Finds values and indices of the `k` largest elements for the last dimension.
 //
-// If one component of `new_shape` is the special value -1, the size of that
-// dimension is computed so that the total dense size remains constant.  At
-// most one component of `new_shape` can be -1.  The number of dense elements
-// implied by `new_shape` must be the same as the number of dense elements
-// originally implied by `input_shape`.
+// DEPRECATED at GraphDef version 7: Use TopKV2 instead
 //
-// Reshaping does not affect the order of values in the SparseTensor.
+// If the input is a vector (rank-1), finds the `k` largest entries in the vector
+// and outputs their values and indices as vectors.  Thus `values[j]` is the
+// `j`-th largest entry in `input`, and its index is `indices[j]`.
 //
-// If the input tensor has rank `R_in` and `N` non-empty values, and `new_shape`
-// has length `R_out`, then `input_indices` has shape `[N, R_in]`,
-// `input_shape` has length `R_in`, `output_indices` has shape `[N, R_out]`, and
-// `output_shape` has length `R_out`.
+// For matrices (resp. higher rank input), computes the top `k` entries in each
+// row (resp. vector along the last dimension).  Thus,
+//
+//     values.shape = indices.shape = input.shape[:-1] + [k]
+//
+// If two elements are equal, the lower-index element appears first.
+//
+// If `k` varies dynamically, use `TopKV2` below.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R_in` matrix with the indices of non-empty values in a
-// SparseTensor.
-//	input_shape: 1-D.  `R_in` vector with the input SparseTensor's dense shape.
-//	new_shape: 1-D.  `R_out` vector with the requested new dense shape.
+//	input: 1-D or higher with last dimension at least `k`.
+//	k: Number of top elements to look for along the last dimension (along each
+// row for matrices).
 //
-// Returns 2-D.  `N x R_out` matrix with the updated indices of non-empty
-// values in the output SparseTensor.1-D.  `R_out` vector with the full dense shape of the output
-// SparseTensor.  This is the same as `new_shape` but with any -1 dimensions
-// filled in.
-func SparseReshape(scope *Scope, input_indices tf.Output, input_shape tf.Output, new_shape tf.Output) (output_indices tf.Output, output_shape tf.Output) {
+// Returns The `k` largest elements along each last dimensional slice.The indices of `values` within the last dimension of `input`.
+func TopK(scope *Scope, input tf.Output, k int64, optional ...TopKAttr) (values tf.Output, indices tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"k": k}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseReshape",
+		Type: "TopK",
 		Input: []tf.Input{
-			input_indices, input_shape, new_shape,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0), op.Output(1)
 }
 
-// Deprecated. Use TensorArraySplitV3
+// Compute the Hurwitz zeta function \\(\zeta(x, q)\\).
 //
-// DEPRECATED at GraphDef version 26: Use TensorArraySplitV3
-func TensorArraySplitV2(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// The Hurwitz zeta function is defined as:
+//
+//
+// \\(\zeta(x, q) = \sum_{n=0}^{\infty} (q + n)^{-x}\\)
+func Zeta(scope *Scope, x tf.Output, q tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArraySplitV2",
+		Type: "Zeta",
 		Input: []tf.Input{
-			handle, value, lengths, flow_in,
+			x, q,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// PackAttr is an optional argument to Pack.
-type PackAttr func(optionalAttr)
+// ProdAttr is an optional argument to Prod.
+type ProdAttr func(optionalAttr)
 
-// PackAxis sets the optional axis attribute to value.
+// ProdKeepDims sets the optional keep_dims attribute to value.
 //
-// value: Dimension along which to pack.  Negative values wrap around, so the
-// valid range is `[-(R+1), R+1)`.
-// If not specified, defaults to 0
-func PackAxis(value int64) PackAttr {
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func ProdKeepDims(value bool) ProdAttr {
 	return func(m optionalAttr) {
-		m["axis"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Packs a list of `N` rank-`R` tensors into one rank-`(R+1)` tensor.
-//
-// Packs the `N` tensors in `values` into a tensor with rank one higher than each
-// tensor in `values`, by packing them along the `axis` dimension.
-// Given a list of tensors of shape `(A, B, C)`;
-//
-// if `axis == 0` then the `output` tensor will have the shape `(N, A, B, C)`.
-// if `axis == 1` then the `output` tensor will have the shape `(A, N, B, C)`.
-// Etc.
-//
-// For example:
-//
-// ```
-// # 'x' is [1, 4]
-// # 'y' is [2, 5]
-// # 'z' is [3, 6]
-// pack([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
-// pack([x, y, z], axis=1) => [[1, 2, 3], [4, 5, 6]]
-// ```
+// Computes the product of elements across dimensions of a tensor.
 //
-// This is the opposite of `unpack`.
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
 //
 // Arguments:
-//	values: Must be of same shape and type.
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
 //
-// Returns The packed tensor.
-func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Output) {
+// Returns The reduced tensor.
+func Prod(scope *Scope, input tf.Output, axis tf.Output, optional ...ProdAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -10689,9 +10935,9 @@ func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Out
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Pack",
+		Type: "Prod",
 		Input: []tf.Input{
-			tf.OutputList(values),
+			input, axis,
 		},
 		Attrs: attrs,
 	}
@@ -10699,86 +10945,114 @@ func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Out
 	return op.Output(0)
 }
 
-// Reorders a SparseTensor into the canonical, row-major ordering.
-//
-// Note that by convention, all sparse ops preserve the canonical ordering along
-// increasing dimension number. The only time ordering can be violated is during
-// manual manipulation of the indices and values vectors to add entries.
+// FusedResizeAndPadConv2DAttr is an optional argument to FusedResizeAndPadConv2D.
+type FusedResizeAndPadConv2DAttr func(optionalAttr)
+
+// FusedResizeAndPadConv2DResizeAlignCorners sets the optional resize_align_corners attribute to value.
 //
-// Reordering does not affect the shape of the SparseTensor.
+// value: If true, rescale input by (new_height - 1) / (height - 1),
+// which exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func FusedResizeAndPadConv2DResizeAlignCorners(value bool) FusedResizeAndPadConv2DAttr {
+	return func(m optionalAttr) {
+		m["resize_align_corners"] = value
+	}
+}
+
+// Performs a resize and padding as a preprocess during a convolution.
 //
-// If the tensor has rank `R` and `N` non-empty values, `input_indices` has
-// shape `[N, R]`, input_values has length `N`, and input_shape has length `R`.
+// It's often possible to do spatial transformations more efficiently as part of
+// the packing stage of a convolution, so this op allows for an optimized
+// implementation where these stages are fused together. This prevents the need to
+// write out the intermediate results as whole tensors, reducing memory pressure,
+// and we can get some latency gains by merging the transformation calculations.
+// The data_format attribute for Conv2D isn't supported by this op, and defaults to
+// 'NHWC' order.
+// Internally this op uses a single per-graph scratch buffer, which means that it
+// will block if multiple versions are being run in parallel. This is because this
+// operator is primarily an optimization to minimize memory usage.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
+//	size: A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.
 //
-// Returns 2-D.  `N x R` matrix with the same indices as input_indices, but
-// in canonical row-major ordering.1-D.  `N` non-empty values corresponding to `output_indices`.
-func SparseReorder(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
-	if scope.Err() != nil {
+//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
+// of `input`. Must be in the same order as the dimension specified with format.
+//	padding: The type of padding algorithm to use.
+func FusedResizeAndPadConv2D(scope *Scope, input tf.Output, size tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string, optional ...FusedResizeAndPadConv2DAttr) (output tf.Output) {
+	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseReorder",
+		Type: "FusedResizeAndPadConv2D",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape,
+			input, size, paddings, filter,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Computes rectified linear: `max(features, 0)`.
-func Relu(scope *Scope, features tf.Output) (activations tf.Output) {
+// Transforms a Tensor into a serialized TensorProto proto.
+//
+// Arguments:
+//	tensor: A Tensor of type `T`.
+//
+// Returns A serialized TensorProto proto of the input tensor.
+func SerializeTensor(scope *Scope, tensor tf.Output) (serialized tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Relu",
+		Type: "SerializeTensor",
 		Input: []tf.Input{
-			features,
+			tensor,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyAddSignAttr is an optional argument to ResourceApplyAddSign.
-type ResourceApplyAddSignAttr func(optionalAttr)
+// MatrixSolveAttr is an optional argument to MatrixSolve.
+type MatrixSolveAttr func(optionalAttr)
 
-// ResourceApplyAddSignUseLocking sets the optional use_locking attribute to value.
+// MatrixSolveAdjoint sets the optional adjoint attribute to value.
 //
-// value: If `True`, updating of the var and m tensors is
-// protected by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
+// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
+// adjoint.
 // If not specified, defaults to false
-func ResourceApplyAddSignUseLocking(value bool) ResourceApplyAddSignAttr {
+func MatrixSolveAdjoint(value bool) MatrixSolveAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["adjoint"] = value
 	}
 }
 
-// Update '*var' according to the AddSign update.
+// Solves systems of linear equations.
 //
-// m_t <- beta1 * m_{t-1} + (1 - beta1) * g
-// update <- (alpha + sign_decay * sign(g) *sign(m)) * g
-// variable <- variable - lr_t * update
+// `Matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. `Rhs` is a tensor of shape `[..., M, K]`. The `output` is
+// a tensor shape `[..., M, K]`.  If `adjoint` is `False` then each output matrix
+// satisfies `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
+// If `adjoint` is `True` then each output matrix satisfies
+// `adjoint(matrix[..., :, :]) * output[..., :, :] = rhs[..., :, :]`.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	m: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	alpha: Must be a scalar.
-//	sign_decay: Must be a scalar.
-//	beta: Must be a scalar.
-//	grad: The gradient.
+//	matrix: Shape is `[..., M, M]`.
+//	rhs: Shape is `[..., M, K]`.
 //
-// Returns the created operation.
-func ResourceApplyAddSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Output, alpha tf.Output, sign_decay tf.Output, beta tf.Output, grad tf.Output, optional ...ResourceApplyAddSignAttr) (o *tf.Operation) {
+// Returns Shape is `[..., M, K]`.
+func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixSolveAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -10787,96 +11061,164 @@ func ResourceApplyAddSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Outpu
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAddSign",
+		Type: "MatrixSolve",
 		Input: []tf.Input{
-			var_, m, lr, alpha, sign_decay, beta, grad,
+			matrix, rhs,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// FractionalMaxPoolGradAttr is an optional argument to FractionalMaxPoolGrad.
-type FractionalMaxPoolGradAttr func(optionalAttr)
-
-// FractionalMaxPoolGradOverlapping sets the optional overlapping attribute to value.
+// Inverse 3D fast Fourier transform.
 //
-// value: When set to True, it means when pooling, the values at the boundary
-// of adjacent pooling cells are used by both cells. For example:
+// Computes the inverse 3-dimensional discrete Fourier transform over the
+// inner-most 3 dimensions of `input`.
 //
-// `index  0  1  2  3  4`
+// Arguments:
+//	input: A complex64 tensor.
 //
-// `value  20 5  16 3  7`
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
+//   dimensions of `input` are replaced with their inverse 3D Fourier transform.
 //
-// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
-// The result would be [20, 16] for fractional max pooling.
-// If not specified, defaults to false
-func FractionalMaxPoolGradOverlapping(value bool) FractionalMaxPoolGradAttr {
-	return func(m optionalAttr) {
-		m["overlapping"] = value
+// @compatibility(numpy)
+// Equivalent to np.fft.ifftn with 3 dimensions.
+// @end_compatibility
+func IFFT3D(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "IFFT3D",
+		Input: []tf.Input{
+			input,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes gradient of the FractionalMaxPool function.
+// Adds `bias` to `value`.
+//
+// This is a deprecated version of BiasAdd and will be soon removed.
+//
+// This is a special case of `tf.add` where `bias` is restricted to be 1-D.
+// Broadcasting is supported, so `value` may have any number of dimensions.
 //
 // Arguments:
-//	orig_input: Original input for `fractional_max_pool`
-//	orig_output: Original output for `fractional_max_pool`
-//	out_backprop: 4-D with shape `[batch, height, width, channels]`.  Gradients
-// w.r.t. the output of `fractional_max_pool`.
-//	row_pooling_sequence: row pooling sequence, form pooling region with
-// col_pooling_sequence.
-//	col_pooling_sequence: column pooling sequence, form pooling region with
-// row_pooling sequence.
+//	value: Any number of dimensions.
+//	bias: 1-D with size the last dimension of `value`.
 //
-// Returns 4-D.  Gradients w.r.t. the input of `fractional_max_pool`.
-func FractionalMaxPoolGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, out_backprop tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output, optional ...FractionalMaxPoolGradAttr) (output tf.Output) {
+// Returns Broadcasted sum of `value` and `bias`.
+func BiasAddV1(scope *Scope, value tf.Output, bias tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
+	opspec := tf.OpSpec{
+		Type: "BiasAddV1",
+		Input: []tf.Input{
+			value, bias,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Reverses specific dimensions of a tensor.
+//
+// NOTE `tf.reverse` has now changed behavior in preparation for 1.0.
+// `tf.reverse_v2` is currently an alias that will be deprecated before TF 1.0.
+//
+// Given a `tensor`, and a `int32` tensor `axis` representing the set of
+// dimensions of `tensor` to reverse. This operation reverses each dimension
+// `i` for which there exists `j` s.t. `axis[j] == i`.
+//
+// `tensor` can have up to 8 dimensions. The number of dimensions specified
+// in `axis` may be 0 or more entries. If an index is specified more than
+// once, a InvalidArgument error is raised.
+//
+// For example:
+//
+// ```
+// # tensor 't' is [[[[ 0,  1,  2,  3],
+// #                  [ 4,  5,  6,  7],
+// #                  [ 8,  9, 10, 11]],
+// #                 [[12, 13, 14, 15],
+// #                  [16, 17, 18, 19],
+// #                  [20, 21, 22, 23]]]]
+// # tensor 't' shape is [1, 2, 3, 4]
+//
+// # 'dims' is [3] or 'dims' is [-1]
+// reverse(t, dims) ==> [[[[ 3,  2,  1,  0],
+//                         [ 7,  6,  5,  4],
+//                         [ 11, 10, 9, 8]],
+//                        [[15, 14, 13, 12],
+//                         [19, 18, 17, 16],
+//                         [23, 22, 21, 20]]]]
+//
+// # 'dims' is '[1]' (or 'dims' is '[-3]')
+// reverse(t, dims) ==> [[[[12, 13, 14, 15],
+//                         [16, 17, 18, 19],
+//                         [20, 21, 22, 23]
+//                        [[ 0,  1,  2,  3],
+//                         [ 4,  5,  6,  7],
+//                         [ 8,  9, 10, 11]]]]
+//
+// # 'dims' is '[2]' (or 'dims' is '[-2]')
+// reverse(t, dims) ==> [[[[8, 9, 10, 11],
+//                         [4, 5, 6, 7],
+//                         [0, 1, 2, 3]]
+//                        [[20, 21, 22, 23],
+//                         [16, 17, 18, 19],
+//                         [12, 13, 14, 15]]]]
+// ```
+//
+// Arguments:
+//	tensor: Up to 8-D.
+//	axis: 1-D. The indices of the dimensions to reverse. Must be in the range
+// `[-rank(tensor), rank(tensor))`.
+//
+// Returns The same shape as `tensor`.
+func ReverseV2(scope *Scope, tensor tf.Output, axis tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
 	opspec := tf.OpSpec{
-		Type: "FractionalMaxPoolGrad",
+		Type: "ReverseV2",
 		Input: []tf.Input{
-			orig_input, orig_output, out_backprop, row_pooling_sequence, col_pooling_sequence,
+			tensor, axis,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyAdagradDAAttr is an optional argument to ResourceApplyAdagradDA.
-type ResourceApplyAdagradDAAttr func(optionalAttr)
+// RealAttr is an optional argument to Real.
+type RealAttr func(optionalAttr)
 
-// ResourceApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
-//
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyAdagradDAUseLocking(value bool) ResourceApplyAdagradDAAttr {
+// RealTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func RealTout(value tf.DataType) RealAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["Tout"] = value
 	}
 }
 
-// Update '*var' according to the proximal adagrad scheme.
+// Returns the real part of a complex number.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	gradient_accumulator: Should be from a Variable().
-//	gradient_squared_accumulator: Should be from a Variable().
-//	grad: The gradient.
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	global_step: Training step number. Must be a scalar.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the real part of each element in `input`. All elements in
+// `input` must be complex numbers of the form \\(a + bj\\), where *a* is the real
+//  part returned by this operation and *b* is the imaginary part.
 //
-// Returns the created operation.
-func ResourceApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceApplyAdagradDAAttr) (o *tf.Operation) {
+// For example:
+//
+// ```
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.real(input) ==> [-2.25, 3.25]
+// ```
+func Real(scope *Scope, input tf.Output, optional ...RealAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -10885,151 +11227,173 @@ func ResourceApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdagradDA",
+		Type: "Real",
 		Input: []tf.Input{
-			var_, gradient_accumulator, gradient_squared_accumulator, grad, lr, l1, l2, global_step,
+			input,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SparseReduceMaxSparseAttr is an optional argument to SparseReduceMaxSparse.
-type SparseReduceMaxSparseAttr func(optionalAttr)
-
-// SparseReduceMaxSparseKeepDims sets the optional keep_dims attribute to value.
+// AudioSummaryAttr is an optional argument to AudioSummary.
+type AudioSummaryAttr func(optionalAttr)
+
+// AudioSummaryMaxOutputs sets the optional max_outputs attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func SparseReduceMaxSparseKeepDims(value bool) SparseReduceMaxSparseAttr {
+// value: Max number of batch elements to generate audio for.
+// If not specified, defaults to 3
+//
+// REQUIRES: value >= 1
+func AudioSummaryMaxOutputs(value int64) AudioSummaryAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["max_outputs"] = value
 	}
 }
 
-// Computes the max of elements across dimensions of a SparseTensor.
+// Outputs a `Summary` protocol buffer with audio.
 //
-// This Op takes a SparseTensor and is the sparse counterpart to
-// `tf.reduce_max()`.  In contrast to SparseReduceMax, this Op returns a
-// SparseTensor.
+// DEPRECATED at GraphDef version 15: Use AudioSummaryV2.
 //
-// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
-// with length 1.
+// The summary has up to `max_outputs` summary values containing audio. The
+// audio is built from `tensor` which must be 3-D with shape `[batch_size,
+// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
+// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
 //
-// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
-// with a single element is returned.  Additionally, the axes can be negative,
-// which are interpreted according to the indexing rules in Python.
+// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
+// build the `tag` of the summary values:
+//
+// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
+// *  If `max_outputs` is greater than 1, the summary value tags are
+//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
-//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
-func SparseReduceMaxSparse(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxSparseAttr) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
+//	tag: Scalar. Used to build the `tag` attribute of the summary values.
+//	tensor: 2-D of shape `[batch_size, frames]`.
+//	sample_rate: The sample rate of the signal in hertz.
+//
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func AudioSummary(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate float32, optional ...AudioSummaryAttr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"sample_rate": sample_rate}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseReduceMaxSparse",
+		Type: "AudioSummary",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape, reduction_axes,
+			tag, tensor,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Creates a dataset that emits the outputs of `input_dataset` `count` times.
+// QrAttr is an optional argument to Qr.
+type QrAttr func(optionalAttr)
+
+// QrFullMatrices sets the optional full_matrices attribute to value.
 //
-// Arguments:
+// value: If true, compute full-sized `q` and `r`. If false
+// (the default), compute only the leading `P` columns of `q`.
+// If not specified, defaults to false
+func QrFullMatrices(value bool) QrAttr {
+	return func(m optionalAttr) {
+		m["full_matrices"] = value
+	}
+}
+
+// Computes the QR decompositions of one or more matrices.
 //
-//	count: A scalar representing the number of times that `input_dataset` should
-// be repeated. A value of `-1` indicates that it should be repeated infinitely.
+// Computes the QR decomposition of each inner matrix in `tensor` such that
+// `tensor[..., :, :] = q[..., :, :] * r[..., :,:])`
+//
+// ```python
+// # a is a tensor.
+// # q is a tensor of orthonormal matrices.
+// # r is a tensor of upper triangular matrices.
+// q, r = qr(a)
+// q_full, r_full = qr(a, full_matrices=True)
+// ```
 //
+// Arguments:
+//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
+// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
 //
-func RepeatDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Returns Orthonormal basis for range of `a`. If `full_matrices` is `False` then
+// shape is `[..., M, P]`; if `full_matrices` is `True` then shape is
+// `[..., M, M]`.Triangular factor. If `full_matrices` is `False` then shape is
+// `[..., P, N]`. If `full_matrices` is `True` then shape is `[..., M, N]`.
+func Qr(scope *Scope, input tf.Output, optional ...QrAttr) (q tf.Output, r tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "RepeatDataset",
+		Type: "Qr",
 		Input: []tf.Input{
-			input_dataset, count,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// AddManySparseToTensorsMapAttr is an optional argument to AddManySparseToTensorsMap.
-type AddManySparseToTensorsMapAttr func(optionalAttr)
-
-// AddManySparseToTensorsMapContainer sets the optional container attribute to value.
-//
-// value: The container name for the `SparseTensorsMap` created by this op.
-// If not specified, defaults to ""
-func AddManySparseToTensorsMapContainer(value string) AddManySparseToTensorsMapAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Records the bytes size of each element of `input_dataset` in a StatsAggregator.
+func BytesProducedStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "BytesProducedStatsDataset",
+		Input: []tf.Input{
+			input_dataset, tag,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// AddManySparseToTensorsMapSharedName sets the optional shared_name attribute to value.
+// ResourceSparseApplyProximalGradientDescentAttr is an optional argument to ResourceSparseApplyProximalGradientDescent.
+type ResourceSparseApplyProximalGradientDescentAttr func(optionalAttr)
+
+// ResourceSparseApplyProximalGradientDescentUseLocking sets the optional use_locking attribute to value.
 //
-// value: The shared name for the `SparseTensorsMap` created by this op.
-// If blank, the new Operation's unique name is used.
-// If not specified, defaults to ""
-func AddManySparseToTensorsMapSharedName(value string) AddManySparseToTensorsMapAttr {
+// value: If True, the subtraction will be protected by a lock;
+// otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceSparseApplyProximalGradientDescentUseLocking(value bool) ResourceSparseApplyProximalGradientDescentAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Add an `N`-minibatch `SparseTensor` to a `SparseTensorsMap`, return `N` handles.
-//
-// A `SparseTensor` of rank `R` is represented by three tensors: `sparse_indices`,
-// `sparse_values`, and `sparse_shape`, where
-//
-// ```sparse_indices.shape[1] == sparse_shape.shape[0] == R```
-//
-// An `N`-minibatch of `SparseTensor` objects is represented as a `SparseTensor`
-// having a first `sparse_indices` column taking values between `[0, N)`, where
-// the minibatch size `N == sparse_shape[0]`.
-//
-// The input `SparseTensor` must have rank `R` greater than 1, and the first
-// dimension is treated as the minibatch dimension.  Elements of the `SparseTensor`
-// must be sorted in increasing order of this first dimension.  The stored
-// `SparseTensor` objects pointed to by each row of the output `sparse_handles`
-// will have rank `R-1`.
+// Sparse update '*var' as FOBOS algorithm with fixed learning rate.
 //
-// The `SparseTensor` values can then be read out as part of a minibatch by passing
-// the given keys as vector elements to `TakeManySparseFromTensorsMap`.  To ensure
-// the correct `SparseTensorsMap` is accessed, ensure that the same
-// `container` and `shared_name` are passed to that Op.  If no `shared_name`
-// is provided here, instead use the *name* of the Operation created by calling
-// `AddManySparseToTensorsMap` as the `shared_name` passed to
-// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
+// That is for rows we have grad for, we update var as follows:
+// prox_v = var - alpha * grad
+// var = sign(prox_v)/(1+alpha*l2) * max{|prox_v|-alpha*l1,0}
 //
 // Arguments:
-//	sparse_indices: 2-D.  The `indices` of the minibatch `SparseTensor`.
-// `sparse_indices[:, 0]` must be ordered values in `[0, N)`.
-//	sparse_values: 1-D.  The `values` of the minibatch `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the minibatch `SparseTensor`.
-// The minibatch size `N == sparse_shape[0]`.
+//	var_: Should be from a Variable().
+//	alpha: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
 //
-// Returns 1-D.  The handles of the `SparseTensor` now stored in the
-// `SparseTensorsMap`.  Shape: `[N]`.
-func AddManySparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddManySparseToTensorsMapAttr) (sparse_handles tf.Output) {
+// Returns the created operation.
+func ResourceSparseApplyProximalGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyProximalGradientDescentAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11038,30 +11402,29 @@ func AddManySparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_va
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AddManySparseToTensorsMap",
+		Type: "ResourceSparseApplyProximalGradientDescent",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			var_, alpha, l1, l2, grad, indices,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// MinAttr is an optional argument to Min.
-type MinAttr func(optionalAttr)
+// MeanAttr is an optional argument to Mean.
+type MeanAttr func(optionalAttr)
 
-// MinKeepDims sets the optional keep_dims attribute to value.
+// MeanKeepDims sets the optional keep_dims attribute to value.
 //
 // value: If true, retain reduced dimensions with length 1.
 // If not specified, defaults to false
-func MinKeepDims(value bool) MinAttr {
+func MeanKeepDims(value bool) MeanAttr {
 	return func(m optionalAttr) {
 		m["keep_dims"] = value
 	}
 }
 
-// Computes the minimum of elements across dimensions of a tensor.
+// Computes the mean of elements across dimensions of a tensor.
 //
 // Reduces `input` along the dimensions given in `axis`. Unless
 // `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
@@ -11074,7 +11437,7 @@ func MinKeepDims(value bool) MinAttr {
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
-func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (output tf.Output) {
+func Mean(scope *Scope, input tf.Output, axis tf.Output, optional ...MeanAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11083,7 +11446,7 @@ func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (ou
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Min",
+		Type: "Mean",
 		Input: []tf.Input{
 			input, axis,
 		},
@@ -11093,323 +11456,208 @@ func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (ou
 	return op.Output(0)
 }
 
-// Shuffle dimensions of x according to a permutation.
+// InitializeTableFromTextFileV2Attr is an optional argument to InitializeTableFromTextFileV2.
+type InitializeTableFromTextFileV2Attr func(optionalAttr)
+
+// InitializeTableFromTextFileV2VocabSize sets the optional vocab_size attribute to value.
 //
-// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
-//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
-func Transpose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Transpose",
-		Input: []tf.Input{
-			x, perm,
-		},
+// value: Number of elements of the file, use -1 if unknown.
+// If not specified, defaults to -1
+//
+// REQUIRES: value >= -1
+func InitializeTableFromTextFileV2VocabSize(value int64) InitializeTableFromTextFileV2Attr {
+	return func(m optionalAttr) {
+		m["vocab_size"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// DepthwiseConv2dNativeBackpropFilterAttr is an optional argument to DepthwiseConv2dNativeBackpropFilter.
-type DepthwiseConv2dNativeBackpropFilterAttr func(optionalAttr)
-
-// DepthwiseConv2dNativeBackpropFilterDataFormat sets the optional data_format attribute to value.
+// InitializeTableFromTextFileV2Delimiter sets the optional delimiter attribute to value.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, height, width, channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, channels, height, width].
-// If not specified, defaults to "NHWC"
-func DepthwiseConv2dNativeBackpropFilterDataFormat(value string) DepthwiseConv2dNativeBackpropFilterAttr {
+// value: Delimiter to separate fields in a line.
+// If not specified, defaults to "\t"
+func InitializeTableFromTextFileV2Delimiter(value string) InitializeTableFromTextFileV2Attr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["delimiter"] = value
 	}
 }
 
-// DepthwiseConv2dNativeBackpropFilterDilations sets the optional dilations attribute to value.
+// Initializes a table from a text file.
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
-// element on that dimension. The dimension order is determined by the value of
-// `data_format`, see above for details. Dilations in the batch and depth
-// dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func DepthwiseConv2dNativeBackpropFilterDilations(value []int64) DepthwiseConv2dNativeBackpropFilterAttr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
-	}
-}
-
-// Computes the gradients of depthwise convolution with respect to the filter.
+// It inserts one key-value pair into the table for each line of the file.
+// The key and value is extracted from the whole line content, elements from the
+// split line based on `delimiter` or the line number (starting from zero).
+// Where to extract the key and value from a line is specified by `key_index` and
+// `value_index`.
+//
+// - A value of -1 means use the line number(starting from zero), expects `int64`.
+// - A value of -2 means use the whole line content, expects `string`.
+// - A value >= 0 means use the index (starting at zero) of the split line based
+//   on `delimiter`.
 //
 // Arguments:
-//	input: 4-D with shape based on `data_format`.  For example, if
-// `data_format` is 'NHWC' then `input` is a 4-D `[batch, in_height,
-// in_width, in_channels]` tensor.
-//	filter_sizes: An integer vector representing the tensor shape of `filter`,
-// where `filter` is a 4-D
-// `[filter_height, filter_width, in_channels, depthwise_multiplier]` tensor.
-//	out_backprop: 4-D with shape  based on `data_format`.
-// For example, if `data_format` is 'NHWC' then
-// out_backprop shape is `[batch, out_height, out_width, out_channels]`.
-// Gradients w.r.t. the output of the convolution.
-//	strides: The stride of the sliding window for each dimension of the input
-// of the convolution.
-//	padding: The type of padding algorithm to use.
+//	table_handle: Handle to a table which will be initialized.
+//	filename: Filename of a vocabulary text file.
+//	key_index: Column index in a line to get the table `key` values from.
+//	value_index: Column index that represents information of a line to get the table
+// `value` values from.
 //
-// Returns 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.  Gradient w.r.t.
-// the `filter` input of the convolution.
-func DepthwiseConv2dNativeBackpropFilter(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...DepthwiseConv2dNativeBackpropFilterAttr) (output tf.Output) {
+// Returns the created operation.
+func InitializeTableFromTextFileV2(scope *Scope, table_handle tf.Output, filename tf.Output, key_index int64, value_index int64, optional ...InitializeTableFromTextFileV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"key_index": key_index, "value_index": value_index}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DepthwiseConv2dNativeBackpropFilter",
+		Type: "InitializeTableFromTextFileV2",
 		Input: []tf.Input{
-			input, filter_sizes, out_backprop,
+			table_handle, filename,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
+}
+
+// QuantizedReluAttr is an optional argument to QuantizedRelu.
+type QuantizedReluAttr func(optionalAttr)
+
+// QuantizedReluOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_QUINT8
+func QuantizedReluOutType(value tf.DataType) QuantizedReluAttr {
+	return func(m optionalAttr) {
+		m["out_type"] = value
+	}
 }
 
-// Flushes the writer's unwritten events.
+// Computes Quantized Rectified Linear: `max(features, 0)`
 //
 // Arguments:
-//	writer: A handle to the summary writer resource.
 //
-// Returns the created operation.
-func FlushSummaryWriter(scope *Scope, writer tf.Output) (o *tf.Operation) {
+//	min_features: The float value that the lowest quantized value represents.
+//	max_features: The float value that the highest quantized value represents.
+//
+// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
+func QuantizedRelu(scope *Scope, features tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedReluAttr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "FlushSummaryWriter",
+		Type: "QuantizedRelu",
 		Input: []tf.Input{
-			writer,
+			features, min_features, max_features,
 		},
+		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// QuantizeV2Attr is an optional argument to QuantizeV2.
-type QuantizeV2Attr func(optionalAttr)
-
-// QuantizeV2Mode sets the optional mode attribute to value.
-// If not specified, defaults to "MIN_COMBINED"
-func QuantizeV2Mode(value string) QuantizeV2Attr {
-	return func(m optionalAttr) {
-		m["mode"] = value
-	}
-}
-
-// QuantizeV2RoundMode sets the optional round_mode attribute to value.
-// If not specified, defaults to "HALF_AWAY_FROM_ZERO"
-func QuantizeV2RoundMode(value string) QuantizeV2Attr {
-	return func(m optionalAttr) {
-		m["round_mode"] = value
-	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Quantize the 'input' tensor of type float to 'output' tensor of type 'T'.
-//
-// [min_range, max_range] are scalar floats that specify the range for
-// the 'input' data. The 'mode' attribute controls exactly which calculations are
-// used to convert the float values to their quantized equivalents.  The
-// 'round_mode' attribute controls which rounding tie-breaking algorithm is used
-// when rounding float values to their quantized equivalents.
-//
-// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
-//
-// ```
-// out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)
-// if T == qint8, out[i] -= (range(T) + 1) / 2.0
-// ```
-// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
-//
-// *MIN_COMBINED Mode Example*
-//
-// Assume the input is type float and has a possible range of [0.0, 6.0] and the
-// output type is quint8 ([0, 255]). The min_range and max_range values should be
-// specified as 0.0 and 6.0. Quantizing from float to quint8 will multiply each
-// value of the input by 255/6 and cast to quint8.
-//
-// If the output type was qint8 ([-128, 127]), the operation will additionally
-// subtract each value by 128 prior to casting, so that the range of values aligns
-// with the range of qint8.
-//
-// If the mode is 'MIN_FIRST', then this approach is used:
-//
-// ```
-// num_discrete_values = 1 << (# of bits in T)
-// range_adjust = num_discrete_values / (num_discrete_values - 1)
-// range = (range_max - range_min) * range_adjust
-// range_scale = num_discrete_values / range
-// quantized = round(input * range_scale) - round(range_min * range_scale) +
-//   numeric_limits<T>::min()
-// quantized = max(quantized, numeric_limits<T>::min())
-// quantized = min(quantized, numeric_limits<T>::max())
-// ```
-//
-// The biggest difference between this and MIN_COMBINED is that the minimum range
-// is rounded first, before it's subtracted from the rounded value. With
-// MIN_COMBINED, a small bias is introduced where repeated iterations of quantizing
-// and dequantizing will introduce a larger and larger error.
-//
-// *SCALED mode Example*
-//
-// `SCALED` mode matches the quantization approach used in
-// `QuantizeAndDequantize{V2|V3}`.
-//
-// If the mode is `SCALED`, we do not use the full range of the output type,
-// choosing to elide the lowest possible value for symmetry (e.g., output range is
-// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
-// 0.
-//
-// We first find the range of values in our tensor. The
-// range we use is always centered on 0, so we find m such that
-// ```c++
-//   m = max(abs(input_min), abs(input_max))
-// ```
-//
-// Our input tensor range is then `[-m, m]`.
-//
-// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
-// If T is signed, this is
-// ```
-//   num_bits = sizeof(T) * 8
-//   [min_fixed, max_fixed] =
-//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
-// ```
+// Reshapes a SparseTensor to represent values in a new dense shape.
 //
-// Otherwise, if T is unsigned, the fixed-point range is
-// ```
-//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
-// ```
+// This operation has the same semantics as reshape on the represented dense
+// tensor.  The `input_indices` are recomputed based on the requested `new_shape`.
 //
-// From this we compute our scaling factor, s:
-// ```c++
-//   s = (max_fixed - min_fixed) / (2 * m)
-// ```
+// If one component of `new_shape` is the special value -1, the size of that
+// dimension is computed so that the total dense size remains constant.  At
+// most one component of `new_shape` can be -1.  The number of dense elements
+// implied by `new_shape` must be the same as the number of dense elements
+// originally implied by `input_shape`.
 //
-// Now we can quantize the elements of our tensor:
-// ```c++
-// result = round(input * s)
-// ```
+// Reshaping does not affect the order of values in the SparseTensor.
 //
-// One thing to watch out for is that the operator may choose to adjust the
-// requested minimum and maximum values slightly during the quantization process,
-// so you should always use the output ports as the range for further calculations.
-// For example, if the requested minimum and maximum values are close to equal,
-// they will be separated by a small epsilon value to prevent ill-formed quantized
-// buffers from being created. Otherwise, you can end up with buffers where all the
-// quantized values map to the same float value, which causes problems for
-// operations that have to perform further calculations on them.
+// If the input tensor has rank `R_in` and `N` non-empty values, and `new_shape`
+// has length `R_out`, then `input_indices` has shape `[N, R_in]`,
+// `input_shape` has length `R_in`, `output_indices` has shape `[N, R_out]`, and
+// `output_shape` has length `R_out`.
 //
 // Arguments:
+//	input_indices: 2-D.  `N x R_in` matrix with the indices of non-empty values in a
+// SparseTensor.
+//	input_shape: 1-D.  `R_in` vector with the input SparseTensor's dense shape.
+//	new_shape: 1-D.  `R_out` vector with the requested new dense shape.
 //
-//	min_range: The minimum scalar value possibly produced for the input.
-//	max_range: The maximum scalar value possibly produced for the input.
-//
-//
-// Returns The quantized data produced from the float input.The actual minimum scalar value used for the output.The actual maximum scalar value used for the output.
-func QuantizeV2(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, T tf.DataType, optional ...QuantizeV2Attr) (output tf.Output, output_min tf.Output, output_max tf.Output) {
+// Returns 2-D.  `N x R_out` matrix with the updated indices of non-empty
+// values in the output SparseTensor.1-D.  `R_out` vector with the full dense shape of the output
+// SparseTensor.  This is the same as `new_shape` but with any -1 dimensions
+// filled in.
+func SparseReshape(scope *Scope, input_indices tf.Output, input_shape tf.Output, new_shape tf.Output) (output_indices tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"T": T}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QuantizeV2",
+		Type: "SparseReshape",
 		Input: []tf.Input{
-			input, min_range, max_range,
+			input_indices, input_shape, new_shape,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0), op.Output(1)
 }
 
-// Component-wise divides a SparseTensor by a dense Tensor.
-//
-// *Limitation*: this Op only broadcasts the dense side to the sparse side, but not
-// the other direction.
-//
-// Arguments:
-//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
-//	sp_shape: 1-D.  Shape of the input SparseTensor.
-//	dense: `R`-D.  The dense Tensor operand.
+// Deprecated. Use TensorArraySplitV3
 //
-// Returns 1-D.  The `N` values that are operated on.
-func SparseDenseCwiseDiv(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArraySplitV3
+func TensorArraySplitV2(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseDenseCwiseDiv",
+		Type: "TensorArraySplitV2",
 		Input: []tf.Input{
-			sp_indices, sp_values, sp_shape, dense,
+			handle, value, lengths, flow_in,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyMomentumAttr is an optional argument to ResourceApplyMomentum.
-type ResourceApplyMomentumAttr func(optionalAttr)
+// PackAttr is an optional argument to Pack.
+type PackAttr func(optionalAttr)
 
-// ResourceApplyMomentumUseLocking sets the optional use_locking attribute to value.
+// PackAxis sets the optional axis attribute to value.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyMomentumUseLocking(value bool) ResourceApplyMomentumAttr {
+// value: Dimension along which to pack.  Negative values wrap around, so the
+// valid range is `[-(R+1), R+1)`.
+// If not specified, defaults to 0
+func PackAxis(value int64) PackAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["axis"] = value
 	}
 }
 
-// ResourceApplyMomentumUseNesterov sets the optional use_nesterov attribute to value.
+// Packs a list of `N` rank-`R` tensors into one rank-`(R+1)` tensor.
 //
-// value: If `True`, the tensor passed to compute grad will be
-// var - lr * momentum * accum, so in the end, the var you get is actually
-// var - lr * momentum * accum.
-// If not specified, defaults to false
-func ResourceApplyMomentumUseNesterov(value bool) ResourceApplyMomentumAttr {
-	return func(m optionalAttr) {
-		m["use_nesterov"] = value
-	}
-}
-
-// Update '*var' according to the momentum scheme. Set use_nesterov = True if you
+// Packs the `N` tensors in `values` into a tensor with rank one higher than each
+// tensor in `values`, by packing them along the `axis` dimension.
+// Given a list of tensors of shape `(A, B, C)`;
 //
-// want to use Nesterov momentum.
+// if `axis == 0` then the `output` tensor will have the shape `(N, A, B, C)`.
+// if `axis == 1` then the `output` tensor will have the shape `(A, N, B, C)`.
+// Etc.
 //
-// accum = accum * momentum + grad
-// var -= lr * accum
+// For example:
+//
+// ```
+// # 'x' is [1, 4]
+// # 'y' is [2, 5]
+// # 'z' is [3, 6]
+// pack([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
+// pack([x, y, z], axis=1) => [[1, 2, 3], [4, 5, 6]]
+// ```
+//
+// This is the opposite of `unpack`.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	grad: The gradient.
-//	momentum: Momentum. Must be a scalar.
+//	values: Must be of same shape and type.
 //
-// Returns the created operation.
-func ResourceApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, momentum tf.Output, optional ...ResourceApplyMomentumAttr) (o *tf.Operation) {
+// Returns The packed tensor.
+func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11418,236 +11666,194 @@ func ResourceApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyMomentum",
+		Type: "Pack",
 		Input: []tf.Input{
-			var_, accum, lr, grad, momentum,
+			tf.OutputList(values),
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MaxPoolGradGradAttr is an optional argument to MaxPoolGradGrad.
-type MaxPoolGradGradAttr func(optionalAttr)
-
-// MaxPoolGradGradDataFormat sets the optional data_format attribute to value.
+// Reorders a SparseTensor into the canonical, row-major ordering.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolGradGradDataFormat(value string) MaxPoolGradGradAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Computes second-order gradients of the maxpooling function.
+// Note that by convention, all sparse ops preserve the canonical ordering along
+// increasing dimension number. The only time ordering can be violated is during
+// manual manipulation of the indices and values vectors to add entries.
+//
+// Reordering does not affect the shape of the SparseTensor.
+//
+// If the tensor has rank `R` and `N` non-empty values, `input_indices` has
+// shape `[N, R]`, input_values has length `N`, and input_shape has length `R`.
 //
 // Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: 4-D.  Gradients of gradients w.r.t. the input of `max_pool`.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
 //
-// Returns Gradients of gradients w.r.t. the input to `max_pool`.
-func MaxPoolGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradGradAttr) (output tf.Output) {
+// Returns 2-D.  `N x R` matrix with the same indices as input_indices, but
+// in canonical row-major ordering.1-D.  `N` non-empty values corresponding to `output_indices`.
+func SparseReorder(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "MaxPoolGradGrad",
+		Type: "SparseReorder",
 		Input: []tf.Input{
-			orig_input, orig_output, grad,
+			input_indices, input_values, input_shape,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Returns the truth value of (x >= y) element-wise.
-//
-// *NOTE*: `GreaterEqual` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func GreaterEqual(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Computes rectified linear: `max(features, 0)`.
+func Relu(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "GreaterEqual",
+		Type: "Relu",
 		Input: []tf.Input{
-			x, y,
+			features,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Conv3DAttr is an optional argument to Conv3D.
-type Conv3DAttr func(optionalAttr)
-
-// Conv3DDataFormat sets the optional data_format attribute to value.
-//
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func Conv3DDataFormat(value string) Conv3DAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
+// ResourceApplyAddSignAttr is an optional argument to ResourceApplyAddSign.
+type ResourceApplyAddSignAttr func(optionalAttr)
 
-// Conv3DDilations sets the optional dilations attribute to value.
+// ResourceApplyAddSignUseLocking sets the optional use_locking attribute to value.
 //
-// value: 1-D tensor of length 5.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
-func Conv3DDilations(value []int64) Conv3DAttr {
+// value: If `True`, updating of the var and m tensors is
+// protected by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyAddSignUseLocking(value bool) ResourceApplyAddSignAttr {
 	return func(m optionalAttr) {
-		m["dilations"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Computes a 3-D convolution given 5-D `input` and `filter` tensors.
-//
-// In signal processing, cross-correlation is a measure of similarity of
-// two waveforms as a function of a time-lag applied to one of them. This
-// is also known as a sliding dot product or sliding inner-product.
+// Update '*var' according to the AddSign update.
 //
-// Our Conv3D implements a form of cross-correlation.
+// m_t <- beta1 * m_{t-1} + (1 - beta1) * g
+// update <- (alpha + sign_decay * sign(g) *sign(m)) * g
+// variable <- variable - lr_t * update
 //
 // Arguments:
-//	input: Shape `[batch, in_depth, in_height, in_width, in_channels]`.
-//	filter: Shape `[filter_depth, filter_height, filter_width, in_channels,
-// out_channels]`. `in_channels` must match between `input` and `filter`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-func Conv3D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv3DAttr) (output tf.Output) {
+//	var_: Should be from a Variable().
+//	m: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	alpha: Must be a scalar.
+//	sign_decay: Must be a scalar.
+//	beta: Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyAddSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Output, alpha tf.Output, sign_decay tf.Output, beta tf.Output, grad tf.Output, optional ...ResourceApplyAddSignAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv3D",
+		Type: "ResourceApplyAddSign",
 		Input: []tf.Input{
-			input, filter,
+			var_, m, lr, alpha, sign_decay, beta, grad,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Adds up a SparseTensor and a dense Tensor, using these special rules:
+// FractionalMaxPoolGradAttr is an optional argument to FractionalMaxPoolGrad.
+type FractionalMaxPoolGradAttr func(optionalAttr)
+
+// FractionalMaxPoolGradOverlapping sets the optional overlapping attribute to value.
 //
-// (1) Broadcasts the dense side to have the same shape as the sparse side, if
-//     eligible;
-// (2) Then, only the dense values pointed to by the indices of the SparseTensor
-//     participate in the cwise addition.
+// value: When set to True, it means when pooling, the values at the boundary
+// of adjacent pooling cells are used by both cells. For example:
 //
-// By these rules, the result is a logical SparseTensor with exactly the same
-// indices and shape, but possibly with different non-zero values.  The output of
-// this Op is the resultant non-zero values.
+// `index  0  1  2  3  4`
+//
+// `value  20 5  16 3  7`
+//
+// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
+// The result would be [20, 16] for fractional max pooling.
+// If not specified, defaults to false
+func FractionalMaxPoolGradOverlapping(value bool) FractionalMaxPoolGradAttr {
+	return func(m optionalAttr) {
+		m["overlapping"] = value
+	}
+}
+
+// Computes gradient of the FractionalMaxPool function.
 //
 // Arguments:
-//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
-//	sp_shape: 1-D.  Shape of the input SparseTensor.
-//	dense: `R`-D.  The dense Tensor operand.
+//	orig_input: Original input for `fractional_max_pool`
+//	orig_output: Original output for `fractional_max_pool`
+//	out_backprop: 4-D with shape `[batch, height, width, channels]`.  Gradients
+// w.r.t. the output of `fractional_max_pool`.
+//	row_pooling_sequence: row pooling sequence, form pooling region with
+// col_pooling_sequence.
+//	col_pooling_sequence: column pooling sequence, form pooling region with
+// row_pooling sequence.
 //
-// Returns 1-D.  The `N` values that are operated on.
-func SparseDenseCwiseAdd(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
+// Returns 4-D.  Gradients w.r.t. the input of `fractional_max_pool`.
+func FractionalMaxPoolGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, out_backprop tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output, optional ...FractionalMaxPoolGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseDenseCwiseAdd",
+		Type: "FractionalMaxPoolGrad",
 		Input: []tf.Input{
-			sp_indices, sp_values, sp_shape, dense,
+			orig_input, orig_output, out_backprop, row_pooling_sequence, col_pooling_sequence,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Read an element from the TensorArray into output `value`.
-//
-// Arguments:
-//	handle: The handle to a TensorArray.
+// ResourceApplyAdagradDAAttr is an optional argument to ResourceApplyAdagradDA.
+type ResourceApplyAdagradDAAttr func(optionalAttr)
+
+// ResourceApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
 //
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	dtype: The type of the elem that is returned.
-//
-// Returns The tensor that is read from the TensorArray.
-func TensorArrayReadV3(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayReadV3",
-		Input: []tf.Input{
-			handle, index, flow_in,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// EncodePngAttr is an optional argument to EncodePng.
-type EncodePngAttr func(optionalAttr)
-
-// EncodePngCompression sets the optional compression attribute to value.
-//
-// value: Compression level.
-// If not specified, defaults to -1
-func EncodePngCompression(value int64) EncodePngAttr {
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceApplyAdagradDAUseLocking(value bool) ResourceApplyAdagradDAAttr {
 	return func(m optionalAttr) {
-		m["compression"] = value
+		m["use_locking"] = value
 	}
 }
 
-// PNG-encode an image.
-//
-// `image` is a 3-D uint8 or uint16 Tensor of shape `[height, width, channels]`
-// where `channels` is:
-//
-// *   1: for grayscale.
-// *   2: for grayscale + alpha.
-// *   3: for RGB.
-// *   4: for RGBA.
-//
-// The ZLIB compression level, `compression`, can be -1 for the PNG-encoder
-// default or a value from 0 to 9.  9 is the highest compression level, generating
-// the smallest output, but is slower.
+// Update '*var' according to the proximal adagrad scheme.
 //
 // Arguments:
-//	image: 3-D with shape `[height, width, channels]`.
+//	var_: Should be from a Variable().
+//	gradient_accumulator: Should be from a Variable().
+//	gradient_squared_accumulator: Should be from a Variable().
+//	grad: The gradient.
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	global_step: Training step number. Must be a scalar.
 //
-// Returns 0-D. PNG-encoded image.
-func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (contents tf.Output) {
+// Returns the created operation.
+func ResourceApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceApplyAdagradDAAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11656,48 +11862,50 @@ func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (conten
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "EncodePng",
+		Type: "ResourceApplyAdagradDA",
 		Input: []tf.Input{
-			image,
+			var_, gradient_accumulator, gradient_squared_accumulator, grad, lr, l1, l2, global_step,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// DataFormatVecPermuteAttr is an optional argument to DataFormatVecPermute.
-type DataFormatVecPermuteAttr func(optionalAttr)
+// SparseReduceMaxSparseAttr is an optional argument to SparseReduceMaxSparse.
+type SparseReduceMaxSparseAttr func(optionalAttr)
 
-// DataFormatVecPermuteSrcFormat sets the optional src_format attribute to value.
+// SparseReduceMaxSparseKeepDims sets the optional keep_dims attribute to value.
 //
-// value: source data format.
-// If not specified, defaults to "NHWC"
-func DataFormatVecPermuteSrcFormat(value string) DataFormatVecPermuteAttr {
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func SparseReduceMaxSparseKeepDims(value bool) SparseReduceMaxSparseAttr {
 	return func(m optionalAttr) {
-		m["src_format"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// DataFormatVecPermuteDstFormat sets the optional dst_format attribute to value.
+// Computes the max of elements across dimensions of a SparseTensor.
 //
-// value: destination data format.
-// If not specified, defaults to "NCHW"
-func DataFormatVecPermuteDstFormat(value string) DataFormatVecPermuteAttr {
-	return func(m optionalAttr) {
-		m["dst_format"] = value
-	}
-}
-
-// Returns the permuted vector/tensor in the destination data format given the
+// This Op takes a SparseTensor and is the sparse counterpart to
+// `tf.reduce_max()`.  In contrast to SparseReduceMax, this Op returns a
+// SparseTensor.
 //
-// one in the source data format.
+// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
+// with length 1.
 //
-// Arguments:
-//	x: Vector of size 4 or Tensor of shape (4, 2) in source data format.
+// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
+// with a single element is returned.  Additionally, the axes can be negative,
+// which are interpreted according to the indexing rules in Python.
 //
-// Returns Vector of size 4 or Tensor of shape (4, 2) in destination data format.
-func DataFormatVecPermute(scope *Scope, x tf.Output, optional ...DataFormatVecPermuteAttr) (y tf.Output) {
+// Arguments:
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+func SparseReduceMaxSparse(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxSparseAttr) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11706,155 +11914,155 @@ func DataFormatVecPermute(scope *Scope, x tf.Output, optional ...DataFormatVecPe
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DataFormatVecPermute",
+		Type: "SparseReduceMaxSparse",
 		Input: []tf.Input{
-			x,
+			input_indices, input_values, input_shape, reduction_axes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Returns element-wise integer closest to x.
+// Creates a dataset that emits the outputs of `input_dataset` `count` times.
 //
-// If the result is midway between two representable values,
-// the even representable is chosen.
-// For example:
+// Arguments:
 //
-// ```
-// rint(-1.5) ==> -2.0
-// rint(0.5000001) ==> 1.0
-// rint([-1.7, -1.5, -0.2, 0.2, 1.5, 1.7, 2.0]) ==> [-2., -2., -0., 0., 2., 2., 2.]
-// ```
-func Rint(scope *Scope, x tf.Output) (y tf.Output) {
+//	count: A scalar representing the number of times that `input_dataset` should
+// be repeated. A value of `-1` indicates that it should be repeated infinitely.
+//
+//
+func RepeatDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Rint",
+		Type: "RepeatDataset",
 		Input: []tf.Input{
-			x,
+			input_dataset, count,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// OrderedMapUnstageNoKeyAttr is an optional argument to OrderedMapUnstageNoKey.
-type OrderedMapUnstageNoKeyAttr func(optionalAttr)
-
-// OrderedMapUnstageNoKeyCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func OrderedMapUnstageNoKeyCapacity(value int64) OrderedMapUnstageNoKeyAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
+// AddManySparseToTensorsMapAttr is an optional argument to AddManySparseToTensorsMap.
+type AddManySparseToTensorsMapAttr func(optionalAttr)
 
-// OrderedMapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// AddManySparseToTensorsMapContainer sets the optional container attribute to value.
 //
-// REQUIRES: value >= 0
-func OrderedMapUnstageNoKeyMemoryLimit(value int64) OrderedMapUnstageNoKeyAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// OrderedMapUnstageNoKeyContainer sets the optional container attribute to value.
+// value: The container name for the `SparseTensorsMap` created by this op.
 // If not specified, defaults to ""
-func OrderedMapUnstageNoKeyContainer(value string) OrderedMapUnstageNoKeyAttr {
+func AddManySparseToTensorsMapContainer(value string) AddManySparseToTensorsMapAttr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// OrderedMapUnstageNoKeySharedName sets the optional shared_name attribute to value.
+// AddManySparseToTensorsMapSharedName sets the optional shared_name attribute to value.
+//
+// value: The shared name for the `SparseTensorsMap` created by this op.
+// If blank, the new Operation's unique name is used.
 // If not specified, defaults to ""
-func OrderedMapUnstageNoKeySharedName(value string) OrderedMapUnstageNoKeyAttr {
+func AddManySparseToTensorsMapSharedName(value string) AddManySparseToTensorsMapAttr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// Op removes and returns the (key, value) element with the smallest
+// Add an `N`-minibatch `SparseTensor` to a `SparseTensorsMap`, return `N` handles.
 //
-// key from the underlying container.   If the underlying container
-// does not contain elements, the op will block until it does.
-func OrderedMapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
+// A `SparseTensor` of rank `R` is represented by three tensors: `sparse_indices`,
+// `sparse_values`, and `sparse_shape`, where
+//
+// ```sparse_indices.shape[1] == sparse_shape.shape[0] == R```
+//
+// An `N`-minibatch of `SparseTensor` objects is represented as a `SparseTensor`
+// having a first `sparse_indices` column taking values between `[0, N)`, where
+// the minibatch size `N == sparse_shape[0]`.
+//
+// The input `SparseTensor` must have rank `R` greater than 1, and the first
+// dimension is treated as the minibatch dimension.  Elements of the `SparseTensor`
+// must be sorted in increasing order of this first dimension.  The stored
+// `SparseTensor` objects pointed to by each row of the output `sparse_handles`
+// will have rank `R-1`.
+//
+// The `SparseTensor` values can then be read out as part of a minibatch by passing
+// the given keys as vector elements to `TakeManySparseFromTensorsMap`.  To ensure
+// the correct `SparseTensorsMap` is accessed, ensure that the same
+// `container` and `shared_name` are passed to that Op.  If no `shared_name`
+// is provided here, instead use the *name* of the Operation created by calling
+// `AddManySparseToTensorsMap` as the `shared_name` passed to
+// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
+//
+// Arguments:
+//	sparse_indices: 2-D.  The `indices` of the minibatch `SparseTensor`.
+// `sparse_indices[:, 0]` must be ordered values in `[0, N)`.
+//	sparse_values: 1-D.  The `values` of the minibatch `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the minibatch `SparseTensor`.
+// The minibatch size `N == sparse_shape[0]`.
+//
+// Returns 1-D.  The handles of the `SparseTensor` now stored in the
+// `SparseTensorsMap`.  Shape: `[N]`.
+func AddManySparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddManySparseToTensorsMapAttr) (sparse_handles tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapUnstageNoKey",
+		Type: "AddManySparseToTensorsMap",
 		Input: []tf.Input{
-			indices,
+			sparse_indices, sparse_values, sparse_shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	key = op.Output(idx)
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("OrderedMapUnstageNoKey", err)
-		return
-	}
-	return key, values
+	return op.Output(0)
 }
 
-// MaxPool3DGradGradAttr is an optional argument to MaxPool3DGradGrad.
-type MaxPool3DGradGradAttr func(optionalAttr)
+// MinAttr is an optional argument to Min.
+type MinAttr func(optionalAttr)
 
-// MaxPool3DGradGradDataFormat sets the optional data_format attribute to value.
+// MinKeepDims sets the optional keep_dims attribute to value.
 //
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func MaxPool3DGradGradDataFormat(value string) MaxPool3DGradGradAttr {
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func MinKeepDims(value bool) MinAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Computes second-order gradients of the maxpooling function.
+// Computes the minimum of elements across dimensions of a tensor.
 //
-// Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
 //
-// Returns Gradients of gradients w.r.t. the input to `max_pool`.
-func MaxPool3DGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DGradGradAttr) (output tf.Output) {
+// Arguments:
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
+//
+// Returns The reduced tensor.
+func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPool3DGradGrad",
+		Type: "Min",
 		Input: []tf.Input{
-			orig_input, orig_output, grad,
+			input, axis,
 		},
 		Attrs: attrs,
 	}
@@ -11862,51 +12070,76 @@ func MaxPool3DGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output
 	return op.Output(0)
 }
 
-// Conv3DBackpropFilterV2Attr is an optional argument to Conv3DBackpropFilterV2.
-type Conv3DBackpropFilterV2Attr func(optionalAttr)
+// Shuffle dimensions of x according to a permutation.
+//
+// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
+//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
+func Transpose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Transpose",
+		Input: []tf.Input{
+			x, perm,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// Conv3DBackpropFilterV2DataFormat sets the optional data_format attribute to value.
+// DepthwiseConv2dNativeBackpropFilterAttr is an optional argument to DepthwiseConv2dNativeBackpropFilter.
+type DepthwiseConv2dNativeBackpropFilterAttr func(optionalAttr)
+
+// DepthwiseConv2dNativeBackpropFilterDataFormat sets the optional data_format attribute to value.
 //
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func Conv3DBackpropFilterV2DataFormat(value string) Conv3DBackpropFilterV2Attr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, height, width, channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, channels, height, width].
+// If not specified, defaults to "NHWC"
+func DepthwiseConv2dNativeBackpropFilterDataFormat(value string) DepthwiseConv2dNativeBackpropFilterAttr {
 	return func(m optionalAttr) {
 		m["data_format"] = value
 	}
 }
 
-// Conv3DBackpropFilterV2Dilations sets the optional dilations attribute to value.
+// DepthwiseConv2dNativeBackpropFilterDilations sets the optional dilations attribute to value.
 //
-// value: 1-D tensor of length 5.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
-func Conv3DBackpropFilterV2Dilations(value []int64) Conv3DBackpropFilterV2Attr {
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
+// element on that dimension. The dimension order is determined by the value of
+// `data_format`, see above for details. Dilations in the batch and depth
+// dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func DepthwiseConv2dNativeBackpropFilterDilations(value []int64) DepthwiseConv2dNativeBackpropFilterAttr {
 	return func(m optionalAttr) {
 		m["dilations"] = value
 	}
 }
 
-// Computes the gradients of 3-D convolution with respect to the filter.
+// Computes the gradients of depthwise convolution with respect to the filter.
 //
 // Arguments:
-//	input: Shape `[batch, depth, rows, cols, in_channels]`.
+//	input: 4-D with shape based on `data_format`.  For example, if
+// `data_format` is 'NHWC' then `input` is a 4-D `[batch, in_height,
+// in_width, in_channels]` tensor.
 //	filter_sizes: An integer vector representing the tensor shape of `filter`,
-// where `filter` is a 5-D
-// `[filter_depth, filter_height, filter_width, in_channels, out_channels]`
-// tensor.
-//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
-// out_channels]`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+// where `filter` is a 4-D
+// `[filter_height, filter_width, in_channels, depthwise_multiplier]` tensor.
+//	out_backprop: 4-D with shape  based on `data_format`.
+// For example, if `data_format` is 'NHWC' then
+// out_backprop shape is `[batch, out_height, out_width, out_channels]`.
+// Gradients w.r.t. the output of the convolution.
+//	strides: The stride of the sliding window for each dimension of the input
+// of the convolution.
 //	padding: The type of padding algorithm to use.
-func Conv3DBackpropFilterV2(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv3DBackpropFilterV2Attr) (output tf.Output) {
+//
+// Returns 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.  Gradient w.r.t.
+// the `filter` input of the convolution.
+func DepthwiseConv2dNativeBackpropFilter(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...DepthwiseConv2dNativeBackpropFilterAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -11915,7 +12148,7 @@ func Conv3DBackpropFilterV2(scope *Scope, input tf.Output, filter_sizes tf.Outpu
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv3DBackpropFilterV2",
+		Type: "DepthwiseConv2dNativeBackpropFilter",
 		Input: []tf.Input{
 			input, filter_sizes, out_backprop,
 		},
@@ -11925,160 +12158,141 @@ func Conv3DBackpropFilterV2(scope *Scope, input tf.Output, filter_sizes tf.Outpu
 	return op.Output(0)
 }
 
-// Execute a sub graph on a remote processor.
-//
-// The graph specifications(such as graph itself, input tensors and output names)
-// are stored as a serialized protocol buffer of RemoteFusedGraphExecuteInfo
-// as serialized_remote_fused_graph_execute_info.
-// The specifications will be passed to a dedicated registered
-// remote fused graph executor.  The executor will send the graph specifications
-// to a remote processor and execute that graph.  The execution results
-// will be passed to consumer nodes as outputs of this node.
-//
-// Arguments:
-//	inputs: Arbitrary number of tensors with arbitrary data types
-//
-//	serialized_remote_fused_graph_execute_info: Serialized protocol buffer
-// of RemoteFusedGraphExecuteInfo which contains graph specifications.
+// Computes sigmoid of `x` element-wise.
 //
-// Returns Arbitrary number of tensors with arbitrary data types
-func RemoteFusedGraphExecute(scope *Scope, inputs []tf.Output, Toutputs []tf.DataType, serialized_remote_fused_graph_execute_info string) (outputs []tf.Output) {
+// Specifically, `y = 1 / (1 + exp(-x))`.
+func Sigmoid(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"Toutputs": Toutputs, "serialized_remote_fused_graph_execute_info": serialized_remote_fused_graph_execute_info}
 	opspec := tf.OpSpec{
-		Type: "RemoteFusedGraphExecute",
+		Type: "Sigmoid",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
-		scope.UpdateErr("RemoteFusedGraphExecute", err)
-		return
-	}
-	return outputs
+	return op.Output(0)
 }
 
-// ThreadUnsafeUnigramCandidateSamplerAttr is an optional argument to ThreadUnsafeUnigramCandidateSampler.
-type ThreadUnsafeUnigramCandidateSamplerAttr func(optionalAttr)
+// FusedBatchNormAttr is an optional argument to FusedBatchNorm.
+type FusedBatchNormAttr func(optionalAttr)
 
-// ThreadUnsafeUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+// FusedBatchNormEpsilon sets the optional epsilon attribute to value.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func ThreadUnsafeUnigramCandidateSamplerSeed(value int64) ThreadUnsafeUnigramCandidateSamplerAttr {
+// value: A small float number added to the variance of x.
+// If not specified, defaults to 0.0001
+func FusedBatchNormEpsilon(value float32) FusedBatchNormAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["epsilon"] = value
 	}
 }
 
-// ThreadUnsafeUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+// FusedBatchNormDataFormat sets the optional data_format attribute to value.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func ThreadUnsafeUnigramCandidateSamplerSeed2(value int64) ThreadUnsafeUnigramCandidateSamplerAttr {
+// value: The data format for x and y. Either "NHWC" (default) or "NCHW".
+// If not specified, defaults to "NHWC"
+func FusedBatchNormDataFormat(value string) FusedBatchNormAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["data_format"] = value
 	}
 }
 
-// Generates labels for candidate sampling with a learned unigram distribution.
-//
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
+// FusedBatchNormIsTraining sets the optional is_training attribute to value.
 //
-// For each batch, this op picks a single set of sampled candidate labels.
+// value: A bool value to indicate the operation is for training (default)
+// or inference.
+// If not specified, defaults to true
+func FusedBatchNormIsTraining(value bool) FusedBatchNormAttr {
+	return func(m optionalAttr) {
+		m["is_training"] = value
+	}
+}
+
+// Batch normalization.
 //
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
+// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
+// The size of 1D Tensors matches the dimension C of the 4D Tensors.
 //
 // Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+//	x: A 4D Tensor for input data.
+//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
+//	offset: A 1D Tensor for offset, to shift to the normalized x.
+//	mean: A 1D Tensor for population mean. Used for inference only;
+// must be empty for training.
+//	variance: A 1D Tensor for population variance. Used for inference only;
+// must be empty for training.
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func ThreadUnsafeUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...ThreadUnsafeUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Returns A 4D Tensor for output data.A 1D Tensor for the computed batch mean, to be used by TensorFlow
+// to compute the running mean.A 1D Tensor for the computed batch variance, to be used by
+// TensorFlow to compute the running variance.A 1D Tensor for the computed batch mean, to be reused
+// in the gradient computation.A 1D Tensor for the computed batch variance (inverted variance
+// in the cuDNN case), to be reused in the gradient computation.
+func FusedBatchNorm(scope *Scope, x tf.Output, scale tf.Output, offset tf.Output, mean tf.Output, variance tf.Output, optional ...FusedBatchNormAttr) (y tf.Output, batch_mean tf.Output, batch_variance tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ThreadUnsafeUnigramCandidateSampler",
+		Type: "FusedBatchNorm",
 		Input: []tf.Input{
-			true_classes,
+			x, scale, offset, mean, variance,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// MaxPoolV2Attr is an optional argument to MaxPoolV2.
-type MaxPoolV2Attr func(optionalAttr)
+// RandomStandardNormalAttr is an optional argument to RandomStandardNormal.
+type RandomStandardNormalAttr func(optionalAttr)
 
-// MaxPoolV2DataFormat sets the optional data_format attribute to value.
+// RandomStandardNormalSeed sets the optional seed attribute to value.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolV2DataFormat(value string) MaxPoolV2Attr {
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func RandomStandardNormalSeed(value int64) RandomStandardNormalAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["seed"] = value
 	}
 }
 
-// Performs max pooling on the input.
+// RandomStandardNormalSeed2 sets the optional seed2 attribute to value.
 //
-// Arguments:
-//	input: 4-D input to pool over.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomStandardNormalSeed2(value int64) RandomStandardNormalAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Outputs random values from a normal distribution.
 //
-// Returns The max pooled output tensor.
-func MaxPoolV2(scope *Scope, input tf.Output, ksize tf.Output, strides tf.Output, padding string, optional ...MaxPoolV2Attr) (output tf.Output) {
+// The generated values will have mean 0 and standard deviation 1.
+//
+// Arguments:
+//	shape: The shape of the output tensor.
+//	dtype: The type of the output.
+//
+// Returns A tensor of the specified shape filled with random normal values.
+func RandomStandardNormal(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...RandomStandardNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"padding": padding}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPoolV2",
+		Type: "RandomStandardNormal",
 		Input: []tf.Input{
-			input, ksize, strides,
+			shape,
 		},
 		Attrs: attrs,
 	}
@@ -12086,71 +12300,84 @@ func MaxPoolV2(scope *Scope, input tf.Output, ksize tf.Output, strides tf.Output
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayReadV3
+// Component-wise divides a SparseTensor by a dense Tensor.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayReadV3
-func TensorArrayReadV2(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
+// *Limitation*: this Op only broadcasts the dense side to the sparse side, but not
+// the other direction.
+//
+// Arguments:
+//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
+//	sp_shape: 1-D.  Shape of the input SparseTensor.
+//	dense: `R`-D.  The dense Tensor operand.
+//
+// Returns 1-D.  The `N` values that are operated on.
+func SparseDenseCwiseDiv(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayReadV2",
+		Type: "SparseDenseCwiseDiv",
 		Input: []tf.Input{
-			handle, index, flow_in,
+			sp_indices, sp_values, sp_shape, dense,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Does nothing. Serves as a control trigger for scheduling.
+// FractionalAvgPoolGradAttr is an optional argument to FractionalAvgPoolGrad.
+type FractionalAvgPoolGradAttr func(optionalAttr)
+
+// FractionalAvgPoolGradOverlapping sets the optional overlapping attribute to value.
 //
-// Only useful as a placeholder for control edges.
+// value: When set to True, it means when pooling, the values at the boundary
+// of adjacent pooling cells are used by both cells. For example:
 //
-// Returns the created operation.
-func ControlTrigger(scope *Scope) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ControlTrigger",
+// `index  0  1  2  3  4`
+//
+// `value  20 5  16 3  7`
+//
+// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
+// The result would be [41/3, 26/3] for fractional avg pooling.
+// If not specified, defaults to false
+func FractionalAvgPoolGradOverlapping(value bool) FractionalAvgPoolGradAttr {
+	return func(m optionalAttr) {
+		m["overlapping"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Batch normalization.
-//
-// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
+// Computes gradient of the FractionalAvgPool function.
 //
-// This op is deprecated. Prefer `tf.nn.batch_normalization`.
+// Unlike FractionalMaxPoolGrad, we don't need to find arg_max for
+// FractionalAvgPoolGrad, we just need to evenly back-propagate each element of
+// out_backprop to those indices that form the same pooling cell. Therefore, we
+// just need to know the shape of original input tensor, instead of the whole
+// tensor.
 //
 // Arguments:
-//	t: A 4D input Tensor.
-//	m: A 1D mean Tensor with size matching the last dimension of t.
-// This is the first output from tf.nn.moments,
-// or a saved moving average thereof.
-//	v: A 1D variance Tensor with size matching the last dimension of t.
-// This is the second output from tf.nn.moments,
-// or a saved moving average thereof.
-//	beta: A 1D beta Tensor with size matching the last dimension of t.
-// An offset to be added to the normalized tensor.
-//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
-// If "scale_after_normalization" is true, this tensor will be multiplied
-// with the normalized tensor.
-//	variance_epsilon: A small float number to avoid dividing by 0.
-//	scale_after_normalization: A bool indicating whether the resulted tensor
-// needs to be multiplied with gamma.
-func BatchNormWithGlobalNormalization(scope *Scope, t tf.Output, m tf.Output, v tf.Output, beta tf.Output, gamma tf.Output, variance_epsilon float32, scale_after_normalization bool) (result tf.Output) {
+//	orig_input_tensor_shape: Original input tensor shape for `fractional_avg_pool`
+//	out_backprop: 4-D with shape `[batch, height, width, channels]`.  Gradients
+// w.r.t. the output of `fractional_avg_pool`.
+//	row_pooling_sequence: row pooling sequence, form pooling region with
+// col_pooling_sequence.
+//	col_pooling_sequence: column pooling sequence, form pooling region with
+// row_pooling sequence.
+//
+// Returns 4-D.  Gradients w.r.t. the input of `fractional_avg_pool`.
+func FractionalAvgPoolGrad(scope *Scope, orig_input_tensor_shape tf.Output, out_backprop tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output, optional ...FractionalAvgPoolGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BatchNormWithGlobalNormalization",
+		Type: "FractionalAvgPoolGrad",
 		Input: []tf.Input{
-			t, m, v, beta, gamma,
+			orig_input_tensor_shape, out_backprop, row_pooling_sequence, col_pooling_sequence,
 		},
 		Attrs: attrs,
 	}
@@ -12158,347 +12385,419 @@ func BatchNormWithGlobalNormalization(scope *Scope, t tf.Output, m tf.Output, v
 	return op.Output(0)
 }
 
-// MutableDenseHashTableV2Attr is an optional argument to MutableDenseHashTableV2.
-type MutableDenseHashTableV2Attr func(optionalAttr)
-
-// MutableDenseHashTableV2Container sets the optional container attribute to value.
+// Concatenates tensors along one dimension.
 //
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableDenseHashTableV2Container(value string) MutableDenseHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MutableDenseHashTableV2SharedName sets the optional shared_name attribute to value.
+// Arguments:
+//	concat_dim: 0-D.  The dimension along which to concatenate.  Must be in the
+// range [0, rank(values)).
+//	values: The `N` Tensors to concatenate. Their ranks and types must match,
+// and their sizes must match in all dimensions except `concat_dim`.
 //
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func MutableDenseHashTableV2SharedName(value string) MutableDenseHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+// Returns A `Tensor` with the concatenation of values stacked along the
+// `concat_dim` dimension.  This tensor's shape matches that of `values` except
+// in `concat_dim` where it has the sum of the sizes.
+func Concat(scope *Scope, concat_dim tf.Output, values []tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// MutableDenseHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-// If not specified, defaults to false
-func MutableDenseHashTableV2UseNodeNameSharing(value bool) MutableDenseHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
+	opspec := tf.OpSpec{
+		Type: "Concat",
+		Input: []tf.Input{
+			concat_dim, tf.OutputList(values),
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MutableDenseHashTableV2ValueShape sets the optional value_shape attribute to value.
-//
-// value: The shape of each value.
-// If not specified, defaults to <>
-func MutableDenseHashTableV2ValueShape(value tf.Shape) MutableDenseHashTableV2Attr {
-	return func(m optionalAttr) {
-		m["value_shape"] = value
-	}
-}
+// ResourceApplyMomentumAttr is an optional argument to ResourceApplyMomentum.
+type ResourceApplyMomentumAttr func(optionalAttr)
 
-// MutableDenseHashTableV2InitialNumBuckets sets the optional initial_num_buckets attribute to value.
+// ResourceApplyMomentumUseLocking sets the optional use_locking attribute to value.
 //
-// value: The initial number of hash table buckets. Must be a power
-// to 2.
-// If not specified, defaults to 131072
-func MutableDenseHashTableV2InitialNumBuckets(value int64) MutableDenseHashTableV2Attr {
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyMomentumUseLocking(value bool) ResourceApplyMomentumAttr {
 	return func(m optionalAttr) {
-		m["initial_num_buckets"] = value
+		m["use_locking"] = value
 	}
 }
 
-// MutableDenseHashTableV2MaxLoadFactor sets the optional max_load_factor attribute to value.
+// ResourceApplyMomentumUseNesterov sets the optional use_nesterov attribute to value.
 //
-// value: The maximum ratio between number of entries and number of
-// buckets before growing the table. Must be between 0 and 1.
-// If not specified, defaults to 0.8
-func MutableDenseHashTableV2MaxLoadFactor(value float32) MutableDenseHashTableV2Attr {
+// value: If `True`, the tensor passed to compute grad will be
+// var - lr * momentum * accum, so in the end, the var you get is actually
+// var - lr * momentum * accum.
+// If not specified, defaults to false
+func ResourceApplyMomentumUseNesterov(value bool) ResourceApplyMomentumAttr {
 	return func(m optionalAttr) {
-		m["max_load_factor"] = value
+		m["use_nesterov"] = value
 	}
 }
 
-// Creates an empty hash table that uses tensors as the backing store.
+// Update '*var' according to the momentum scheme. Set use_nesterov = True if you
 //
-// It uses "open addressing" with quadratic reprobing to resolve
-// collisions.
+// want to use Nesterov momentum.
 //
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a scalar. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
+// accum = accum * momentum + grad
+// var -= lr * accum
 //
 // Arguments:
-//	empty_key: The key used to represent empty key buckets internally. Must not
-// be used in insert or lookup operations.
-//	value_dtype: Type of the table values.
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	grad: The gradient.
+//	momentum: Momentum. Must be a scalar.
 //
-// Returns Handle to a table.
-func MutableDenseHashTableV2(scope *Scope, empty_key tf.Output, value_dtype tf.DataType, optional ...MutableDenseHashTableV2Attr) (table_handle tf.Output) {
+// Returns the created operation.
+func ResourceApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, momentum tf.Output, optional ...ResourceApplyMomentumAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"value_dtype": value_dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MutableDenseHashTableV2",
+		Type: "ResourceApplyMomentum",
 		Input: []tf.Input{
-			empty_key,
+			var_, accum, lr, grad, momentum,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Produces the max pool of the input tensor for quantized types.
+// MaxPoolGradGradAttr is an optional argument to MaxPoolGradGrad.
+type MaxPoolGradGradAttr func(optionalAttr)
+
+// MaxPoolGradGradDataFormat sets the optional data_format attribute to value.
+//
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolGradGradDataFormat(value string) MaxPoolGradGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes second-order gradients of the maxpooling function.
 //
 // Arguments:
-//	input: The 4D (batch x rows x cols x depth) Tensor to MaxReduce over.
-//	min_input: The float value that the lowest quantized input value represents.
-//	max_input: The float value that the highest quantized input value represents.
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: 4-D.  Gradients of gradients w.r.t. the input of `max_pool`.
 //	ksize: The size of the window for each dimension of the input tensor.
-// The length must be 4 to match the number of dimensions of the input.
-//	strides: The stride of the sliding window for each dimension of the input
-// tensor. The length must be 4 to match the number of dimensions of the input.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
 //	padding: The type of padding algorithm to use.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
-func QuantizedMaxPool(scope *Scope, input tf.Output, min_input tf.Output, max_input tf.Output, ksize []int64, strides []int64, padding string) (output tf.Output, min_output tf.Output, max_output tf.Output) {
+// Returns Gradients of gradients w.r.t. the input to `max_pool`.
+func MaxPoolGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedMaxPool",
+		Type: "MaxPoolGradGrad",
 		Input: []tf.Input{
-			input, min_input, max_input,
+			orig_input, orig_output, grad,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Computes softplus: `log(exp(features) + 1)`.
-func Softplus(scope *Scope, features tf.Output) (activations tf.Output) {
+// Returns element-wise integer closest to x.
+//
+// If the result is midway between two representable values,
+// the even representable is chosen.
+// For example:
+//
+// ```
+// rint(-1.5) ==> -2.0
+// rint(0.5000001) ==> 1.0
+// rint([-1.7, -1.5, -0.2, 0.2, 1.5, 1.7, 2.0]) ==> [-2., -2., -0., 0., 2., 2., 2.]
+// ```
+func Rint(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Softplus",
+		Type: "Rint",
 		Input: []tf.Input{
-			features,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes exponential of x - 1 element-wise.
+// OrderedMapUnstageNoKeyAttr is an optional argument to OrderedMapUnstageNoKey.
+type OrderedMapUnstageNoKeyAttr func(optionalAttr)
+
+// OrderedMapUnstageNoKeyCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// I.e., \\(y = (\exp x) - 1\\).
-func Expm1(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Expm1",
-		Input: []tf.Input{
-			x,
-		},
+// REQUIRES: value >= 0
+func OrderedMapUnstageNoKeyCapacity(value int64) OrderedMapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns the number of records this Reader has produced.
+// OrderedMapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// This is the same as the number of ReaderRead executions that have
-// succeeded.
+// REQUIRES: value >= 0
+func OrderedMapUnstageNoKeyMemoryLimit(value int64) OrderedMapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// OrderedMapUnstageNoKeyContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageNoKeyContainer(value string) OrderedMapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// OrderedMapUnstageNoKeySharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageNoKeySharedName(value string) OrderedMapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns the (key, value) element with the smallest
 //
-// Arguments:
-//	reader_handle: Handle to a Reader.
-func ReaderNumRecordsProducedV2(scope *Scope, reader_handle tf.Output) (records_produced tf.Output) {
+// key from the underlying container.   If the underlying container
+// does not contain elements, the op will block until it does.
+func OrderedMapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ReaderNumRecordsProducedV2",
+		Type: "OrderedMapUnstageNoKey",
 		Input: []tf.Input{
-			reader_handle,
+			indices,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	key = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("OrderedMapUnstageNoKey", err)
+		return
+	}
+	return key, values
 }
 
-// Computes the sum along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \sum_j data_j\\) where sum is over `j` such
-// that `segment_ids[j] == i`.
-//
-// If the sum is empty for a given segment ID `i`, `output[i] = 0`.
+// MaxPool3DGradGradAttr is an optional argument to MaxPool3DGradGrad.
+type MaxPool3DGradGradAttr func(optionalAttr)
+
+// MaxPool3DGradGradDataFormat sets the optional data_format attribute to value.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentSum.png" alt>
-// </div>
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func MaxPool3DGradGradDataFormat(value string) MaxPool3DGradGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes second-order gradients of the maxpooling function.
 //
 // Arguments:
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
 //
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentSum(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+// Returns Gradients of gradients w.r.t. the input to `max_pool`.
+func MaxPool3DGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DGradGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SegmentSum",
+		Type: "MaxPool3DGradGrad",
 		Input: []tf.Input{
-			data, segment_ids,
+			orig_input, orig_output, grad,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that emits the lines of one or more text files.
+// Conv3DBackpropFilterV2Attr is an optional argument to Conv3DBackpropFilterV2.
+type Conv3DBackpropFilterV2Attr func(optionalAttr)
+
+// Conv3DBackpropFilterV2DataFormat sets the optional data_format attribute to value.
+//
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func Conv3DBackpropFilterV2DataFormat(value string) Conv3DBackpropFilterV2Attr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Conv3DBackpropFilterV2Dilations sets the optional dilations attribute to value.
+//
+// value: 1-D tensor of length 5.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
+func Conv3DBackpropFilterV2Dilations(value []int64) Conv3DBackpropFilterV2Attr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of 3-D convolution with respect to the filter.
 //
 // Arguments:
-//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
-// read.
-//	compression_type: A scalar containing either (i) the empty string (no
-// compression), (ii) "ZLIB", or (iii) "GZIP".
-//	buffer_size: A scalar containing the number of bytes to buffer.
-func TextLineDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
+//	input: Shape `[batch, depth, rows, cols, in_channels]`.
+//	filter_sizes: An integer vector representing the tensor shape of `filter`,
+// where `filter` is a 5-D
+// `[filter_depth, filter_height, filter_width, in_channels, out_channels]`
+// tensor.
+//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
+// out_channels]`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+func Conv3DBackpropFilterV2(scope *Scope, input tf.Output, filter_sizes tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv3DBackpropFilterV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TextLineDataset",
+		Type: "Conv3DBackpropFilterV2",
 		Input: []tf.Input{
-			filenames, compression_type, buffer_size,
+			input, filter_sizes, out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Checks whether a resource handle-based variable has been initialized.
+// Execute a sub graph on a remote processor.
+//
+// The graph specifications(such as graph itself, input tensors and output names)
+// are stored as a serialized protocol buffer of RemoteFusedGraphExecuteInfo
+// as serialized_remote_fused_graph_execute_info.
+// The specifications will be passed to a dedicated registered
+// remote fused graph executor.  The executor will send the graph specifications
+// to a remote processor and execute that graph.  The execution results
+// will be passed to consumer nodes as outputs of this node.
 //
 // Arguments:
-//	resource: the input resource handle.
+//	inputs: Arbitrary number of tensors with arbitrary data types
 //
-// Returns a scalar boolean which is true if the variable has been
-// initialized.
-func VarIsInitializedOp(scope *Scope, resource tf.Output) (is_initialized tf.Output) {
+//	serialized_remote_fused_graph_execute_info: Serialized protocol buffer
+// of RemoteFusedGraphExecuteInfo which contains graph specifications.
+//
+// Returns Arbitrary number of tensors with arbitrary data types
+func RemoteFusedGraphExecute(scope *Scope, inputs []tf.Output, Toutputs []tf.DataType, serialized_remote_fused_graph_execute_info string) (outputs []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"Toutputs": Toutputs, "serialized_remote_fused_graph_execute_info": serialized_remote_fused_graph_execute_info}
 	opspec := tf.OpSpec{
-		Type: "VarIsInitializedOp",
+		Type: "RemoteFusedGraphExecute",
 		Input: []tf.Input{
-			resource,
+			tf.OutputList(inputs),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Pads a tensor with zeros.
-//
-// This operation pads a `input` with zeros according to the `paddings` you
-// specify. `paddings` is an integer tensor with shape `[Dn, 2]`, where n is the
-// rank of `input`. For each dimension D of `input`, `paddings[D, 0]` indicates
-// how many zeros to add before the contents of `input` in that dimension, and
-// `paddings[D, 1]` indicates how many zeros to add after the contents of `input`
-// in that dimension.
-//
-// The padded size of each dimension D of the output is:
-//
-// `paddings(D, 0) + input.dim_size(D) + paddings(D, 1)`
-//
-// For example:
-//
-// ```
-// # 't' is [[1, 1], [2, 2]]
-// # 'paddings' is [[1, 1], [2, 2]]
-// # rank of 't' is 2
-// pad(t, paddings) ==> [[0, 0, 0, 0, 0, 0]
-//                       [0, 0, 1, 1, 0, 0]
-//                       [0, 0, 2, 2, 0, 0]
-//                       [0, 0, 0, 0, 0, 0]]
-// ```
-func Pad(scope *Scope, input tf.Output, paddings tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "Pad",
-		Input: []tf.Input{
-			input, paddings,
-		},
+	var idx int
+	var err error
+	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
+		scope.UpdateErr("RemoteFusedGraphExecute", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return outputs
 }
 
-// SparseTensorDenseMatMulAttr is an optional argument to SparseTensorDenseMatMul.
-type SparseTensorDenseMatMulAttr func(optionalAttr)
-
-// SparseTensorDenseMatMulAdjointA sets the optional adjoint_a attribute to value.
-//
-// value: Use the adjoint of A in the matrix multiply.  If A is complex, this
-// is transpose(conj(A)).  Otherwise it's transpose(A).
-// If not specified, defaults to false
-func SparseTensorDenseMatMulAdjointA(value bool) SparseTensorDenseMatMulAttr {
-	return func(m optionalAttr) {
-		m["adjoint_a"] = value
-	}
-}
+// SerializeManySparseAttr is an optional argument to SerializeManySparse.
+type SerializeManySparseAttr func(optionalAttr)
 
-// SparseTensorDenseMatMulAdjointB sets the optional adjoint_b attribute to value.
+// SerializeManySparseOutType sets the optional out_type attribute to value.
 //
-// value: Use the adjoint of B in the matrix multiply.  If B is complex, this
-// is transpose(conj(B)).  Otherwise it's transpose(B).
-// If not specified, defaults to false
-func SparseTensorDenseMatMulAdjointB(value bool) SparseTensorDenseMatMulAttr {
+// value: The `dtype` to use for serialization; the supported types are `string`
+// (default) and `variant`.
+// If not specified, defaults to DT_STRING
+func SerializeManySparseOutType(value tf.DataType) SerializeManySparseAttr {
 	return func(m optionalAttr) {
-		m["adjoint_b"] = value
+		m["out_type"] = value
 	}
 }
 
-// Multiply SparseTensor (of rank 2) "A" by dense matrix "B".
+// Serialize an `N`-minibatch `SparseTensor` into an `[N, 3]` `Tensor` object.
 //
-// No validity checking is performed on the indices of A.  However, the following
-// input format is recommended for optimal behavior:
+// The `SparseTensor` must have rank `R` greater than 1, and the first dimension
+// is treated as the minibatch dimension.  Elements of the `SparseTensor`
+// must be sorted in increasing order of this first dimension.  The serialized
+// `SparseTensor` objects going into each row of `serialized_sparse` will have
+// rank `R-1`.
 //
-// if adjoint_a == false:
-//   A should be sorted in lexicographically increasing order.  Use SparseReorder
-//   if you're not sure.
-// if adjoint_a == true:
-//   A should be sorted in order of increasing dimension 1 (i.e., "column major"
-//   order instead of "row major" order).
+// The minibatch size `N` is extracted from `sparse_shape[0]`.
 //
 // Arguments:
-//	a_indices: 2-D.  The `indices` of the `SparseTensor`, size `[nnz, 2]` Matrix.
-//	a_values: 1-D.  The `values` of the `SparseTensor`, size `[nnz]` Vector.
-//	a_shape: 1-D.  The `shape` of the `SparseTensor`, size `[2]` Vector.
-//	b: 2-D.  A dense Matrix.
-func SparseTensorDenseMatMul(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b tf.Output, optional ...SparseTensorDenseMatMulAttr) (product tf.Output) {
+//	sparse_indices: 2-D.  The `indices` of the minibatch `SparseTensor`.
+//	sparse_values: 1-D.  The `values` of the minibatch `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the minibatch `SparseTensor`.
+func SerializeManySparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeManySparseAttr) (serialized_sparse tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -12507,9 +12806,9 @@ func SparseTensorDenseMatMul(scope *Scope, a_indices tf.Output, a_values tf.Outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseTensorDenseMatMul",
+		Type: "SerializeManySparse",
 		Input: []tf.Input{
-			a_indices, a_values, a_shape, b,
+			sparse_indices, sparse_values, sparse_shape,
 		},
 		Attrs: attrs,
 	}
@@ -12517,103 +12816,71 @@ func SparseTensorDenseMatMul(scope *Scope, a_indices tf.Output, a_values tf.Outp
 	return op.Output(0)
 }
 
-// Deserialize and concatenate `SparseTensors` from a serialized minibatch.
-//
-// The input `serialized_sparse` must be a string matrix of shape `[N x 3]` where
-// `N` is the minibatch size and the rows correspond to packed outputs of
-// `SerializeSparse`.  The ranks of the original `SparseTensor` objects
-// must all match.  When the final `SparseTensor` is created, it has rank one
-// higher than the ranks of the incoming `SparseTensor` objects
-// (they have been concatenated along a new row dimension).
-//
-// The output `SparseTensor` object's shape values for all dimensions but the
-// first are the max across the input `SparseTensor` objects' shape values
-// for the corresponding dimensions.  Its first shape value is `N`, the minibatch
-// size.
-//
-// The input `SparseTensor` objects' indices are assumed ordered in
-// standard lexicographic order.  If this is not the case, after this
-// step run `SparseReorder` to restore index ordering.
-//
-// For example, if the serialized input is a `[2 x 3]` matrix representing two
-// original `SparseTensor` objects:
-//
-//     index = [ 0]
-//             [10]
-//             [20]
-//     values = [1, 2, 3]
-//     shape = [50]
-//
-// and
-//
-//     index = [ 2]
-//             [10]
-//     values = [4, 5]
-//     shape = [30]
-//
-// then the final deserialized `SparseTensor` will be:
-//
-//     index = [0  0]
-//             [0 10]
-//             [0 20]
-//             [1  2]
-//             [1 10]
-//     values = [1, 2, 3, 4, 5]
-//     shape = [2 50]
-//
-// Arguments:
-//	serialized_sparse: 2-D, The `N` serialized `SparseTensor` objects.
-// Must have 3 columns.
-//	dtype: The `dtype` of the serialized `SparseTensor` objects.
-func DeserializeManySparse(scope *Scope, serialized_sparse tf.Output, dtype tf.DataType) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
+// Computes inverse hyperbolic cosine of x element-wise.
+func Acosh(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "DeserializeManySparse",
+		Type: "Acosh",
 		Input: []tf.Input{
-			serialized_sparse,
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// StringJoinAttr is an optional argument to StringJoin.
-type StringJoinAttr func(optionalAttr)
+// TensorArrayV2Attr is an optional argument to TensorArrayV2.
+type TensorArrayV2Attr func(optionalAttr)
 
-// StringJoinSeparator sets the optional separator attribute to value.
-//
-// value: string, an optional join separator.
+// TensorArrayV2ElementShape sets the optional element_shape attribute to value.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayV2ElementShape(value tf.Shape) TensorArrayV2Attr {
+	return func(m optionalAttr) {
+		m["element_shape"] = value
+	}
+}
+
+// TensorArrayV2DynamicSize sets the optional dynamic_size attribute to value.
+// If not specified, defaults to false
+func TensorArrayV2DynamicSize(value bool) TensorArrayV2Attr {
+	return func(m optionalAttr) {
+		m["dynamic_size"] = value
+	}
+}
+
+// TensorArrayV2ClearAfterRead sets the optional clear_after_read attribute to value.
+// If not specified, defaults to true
+func TensorArrayV2ClearAfterRead(value bool) TensorArrayV2Attr {
+	return func(m optionalAttr) {
+		m["clear_after_read"] = value
+	}
+}
+
+// TensorArrayV2TensorArrayName sets the optional tensor_array_name attribute to value.
 // If not specified, defaults to ""
-func StringJoinSeparator(value string) StringJoinAttr {
+func TensorArrayV2TensorArrayName(value string) TensorArrayV2Attr {
 	return func(m optionalAttr) {
-		m["separator"] = value
+		m["tensor_array_name"] = value
 	}
 }
 
-// Joins the strings in the given list of string tensors into one tensor;
-//
-// with the given separator (default is an empty separator).
+// Deprecated. Use TensorArrayV3
 //
-// Arguments:
-//	inputs: A list of string tensors.  The tensors must all have the same shape,
-// or be scalars.  Scalars may be mixed in; these will be broadcast to the shape
-// of non-scalar inputs.
-func StringJoin(scope *Scope, inputs []tf.Output, optional ...StringJoinAttr) (output tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayV3
+func TensorArrayV2(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StringJoin",
+		Type: "TensorArrayV2",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			size,
 		},
 		Attrs: attrs,
 	}
@@ -12621,360 +12888,375 @@ func StringJoin(scope *Scope, inputs []tf.Output, optional ...StringJoinAttr) (o
 	return op.Output(0)
 }
 
-// Returns immutable tensor from memory region.
-//
-// The current implementation memmaps the tensor from a file.
+// ThreadUnsafeUnigramCandidateSamplerAttr is an optional argument to ThreadUnsafeUnigramCandidateSampler.
+type ThreadUnsafeUnigramCandidateSamplerAttr func(optionalAttr)
+
+// ThreadUnsafeUnigramCandidateSamplerSeed sets the optional seed attribute to value.
 //
-// Arguments:
-//	dtype: Type of the returned tensor.
-//	shape: Shape of the returned tensor.
-//	memory_region_name: Name of readonly memory region used by the tensor, see
-// NewReadOnlyMemoryRegionFromFile in tensorflow::Env.
-func ImmutableConst(scope *Scope, dtype tf.DataType, shape tf.Shape, memory_region_name string) (tensor tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func ThreadUnsafeUnigramCandidateSamplerSeed(value int64) ThreadUnsafeUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
 	}
-	attrs := map[string]interface{}{"dtype": dtype, "shape": shape, "memory_region_name": memory_region_name}
-	opspec := tf.OpSpec{
-		Type: "ImmutableConst",
+}
 
-		Attrs: attrs,
+// ThreadUnsafeUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func ThreadUnsafeUnigramCandidateSamplerSeed2(value int64) ThreadUnsafeUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Inverse real-valued fast Fourier transform.
+// Generates labels for candidate sampling with a learned unigram distribution.
 //
-// Computes the inverse 1-dimensional discrete Fourier transform of a real-valued
-// signal over the inner-most dimension of `input`.
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
 //
-// The inner-most dimension of `input` is assumed to be the result of `RFFT`: the
-// `fft_length / 2 + 1` unique components of the DFT of a real-valued signal. If
-// `fft_length` is not provided, it is computed from the size of the inner-most
-// dimension of `input` (`fft_length = 2 * (inner - 1)`). If the FFT length used to
-// compute `input` is odd, it should be provided since it cannot be inferred
-// properly.
+// For each batch, this op picks a single set of sampled candidate labels.
 //
-// Along the axis `IRFFT` is computed on, if `fft_length / 2 + 1` is smaller
-// than the corresponding dimension of `input`, the dimension is cropped. If it is
-// larger, the dimension is padded with zeros.
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
 //
 // Arguments:
-//	input: A complex64 tensor.
-//	fft_length: An int32 tensor of shape [1]. The FFT length.
-//
-// Returns A float32 tensor of the same rank as `input`. The inner-most
-//   dimension of `input` is replaced with the `fft_length` samples of its inverse
-//   1D Fourier transform.
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.irfft
-// @end_compatibility
-func IRFFT(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func ThreadUnsafeUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...ThreadUnsafeUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "IRFFT",
+		Type: "ThreadUnsafeUnigramCandidateSampler",
 		Input: []tf.Input{
-			input, fft_length,
+			true_classes,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Concatenates a list of `SparseTensor` along the specified dimension.
-//
-// Concatenation is with respect to the dense versions of these sparse tensors.
-// It is assumed that each input is a `SparseTensor` whose elements are ordered
-// along increasing dimension number.
-//
-// All inputs' shapes must match, except for the concat dimension.  The
-// `indices`, `values`, and `shapes` lists must have the same length.
-//
-// The output shape is identical to the inputs', except along the concat
-// dimension, where it is the sum of the inputs' sizes along that dimension.
-//
-// The output elements will be resorted to preserve the sort order along
-// increasing dimension number.
-//
-// This op runs in `O(M log M)` time, where `M` is the total number of non-empty
-// values across all inputs. This is due to the need for an internal sort in
-// order to concatenate efficiently across an arbitrary dimension.
-//
-// For example, if `concat_dim = 1` and the inputs are
-//
-//     sp_inputs[0]: shape = [2, 3]
-//     [0, 2]: "a"
-//     [1, 0]: "b"
-//     [1, 1]: "c"
-//
-//     sp_inputs[1]: shape = [2, 4]
-//     [0, 1]: "d"
-//     [0, 2]: "e"
-//
-// then the output will be
-//
-//     shape = [2, 7]
-//     [0, 2]: "a"
-//     [0, 4]: "d"
-//     [0, 5]: "e"
-//     [1, 0]: "b"
-//     [1, 1]: "c"
-//
-// Graphically this is equivalent to doing
+// MaxPoolV2Attr is an optional argument to MaxPoolV2.
+type MaxPoolV2Attr func(optionalAttr)
+
+// MaxPoolV2DataFormat sets the optional data_format attribute to value.
 //
-//     [    a] concat [  d e  ] = [    a   d e  ]
-//     [b c  ]        [       ]   [b c          ]
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolV2DataFormat(value string) MaxPoolV2Attr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Performs max pooling on the input.
 //
 // Arguments:
-//	indices: 2-D.  Indices of each input `SparseTensor`.
-//	values: 1-D.  Non-empty values of each `SparseTensor`.
-//	shapes: 1-D.  Shapes of each `SparseTensor`.
-//	concat_dim: Dimension to concatenate along. Must be in range [-rank, rank),
-// where rank is the number of dimensions in each input `SparseTensor`.
+//	input: 4-D input to pool over.
+//	ksize: The size of the window for each dimension of the input tensor.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
+//	padding: The type of padding algorithm to use.
 //
-// Returns 2-D.  Indices of the concatenated `SparseTensor`.1-D.  Non-empty values of the concatenated `SparseTensor`.1-D.  Shape of the concatenated `SparseTensor`.
-func SparseConcat(scope *Scope, indices []tf.Output, values []tf.Output, shapes []tf.Output, concat_dim int64) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
+// Returns The max pooled output tensor.
+func MaxPoolV2(scope *Scope, input tf.Output, ksize tf.Output, strides tf.Output, padding string, optional ...MaxPoolV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"concat_dim": concat_dim}
+	attrs := map[string]interface{}{"padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseConcat",
+		Type: "MaxPoolV2",
 		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(values), tf.OutputList(shapes),
+			input, ksize, strides,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Generates sparse cross from a list of sparse and dense tensors.
-//
-// The op takes two lists, one of 2D `SparseTensor` and one of 2D `Tensor`, each
-// representing features of one feature column. It outputs a 2D `SparseTensor` with
-// the batchwise crosses of these features.
-//
-// For example, if the inputs are
-//
-//     inputs[0]: SparseTensor with shape = [2, 2]
-//     [0, 0]: "a"
-//     [1, 0]: "b"
-//     [1, 1]: "c"
-//
-//     inputs[1]: SparseTensor with shape = [2, 1]
-//     [0, 0]: "d"
-//     [1, 0]: "e"
+// Deprecated. Use TensorArrayReadV3
 //
-//     inputs[2]: Tensor [["f"], ["g"]]
+// DEPRECATED at GraphDef version 26: Use TensorArrayReadV3
+func TensorArrayReadV2(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayReadV2",
+		Input: []tf.Input{
+			handle, index, flow_in,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Does nothing. Serves as a control trigger for scheduling.
 //
-// then the output will be
+// Only useful as a placeholder for control edges.
 //
-//     shape = [2, 2]
-//     [0, 0]: "a_X_d_X_f"
-//     [1, 0]: "b_X_e_X_g"
-//     [1, 1]: "c_X_e_X_g"
+// Returns the created operation.
+func ControlTrigger(scope *Scope) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ControlTrigger",
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Batch normalization.
 //
-// if hashed_output=true then the output will be
+// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
 //
-//     shape = [2, 2]
-//     [0, 0]: FingerprintCat64(
-//                 Fingerprint64("f"), FingerprintCat64(
-//                     Fingerprint64("d"), Fingerprint64("a")))
-//     [1, 0]: FingerprintCat64(
-//                 Fingerprint64("g"), FingerprintCat64(
-//                     Fingerprint64("e"), Fingerprint64("b")))
-//     [1, 1]: FingerprintCat64(
-//                 Fingerprint64("g"), FingerprintCat64(
-//                     Fingerprint64("e"), Fingerprint64("c")))
+// This op is deprecated. Prefer `tf.nn.batch_normalization`.
 //
 // Arguments:
-//	indices: 2-D.  Indices of each input `SparseTensor`.
-//	values: 1-D.   values of each `SparseTensor`.
-//	shapes: 1-D.   Shapes of each `SparseTensor`.
-//	dense_inputs: 2-D.    Columns represented by dense `Tensor`.
-//	hashed_output: If true, returns the hash of the cross instead of the string.
-// This will allow us avoiding string manipulations.
-//	num_buckets: It is used if hashed_output is true.
-// output = hashed_value%num_buckets if num_buckets > 0 else hashed_value.
-//	hash_key: Specify the hash_key that will be used by the `FingerprintCat64`
-// function to combine the crosses fingerprints.
-//
-//
-//
-// Returns 2-D.  Indices of the concatenated `SparseTensor`.1-D.  Non-empty values of the concatenated or hashed
-// `SparseTensor`.1-D.  Shape of the concatenated `SparseTensor`.
-func SparseCross(scope *Scope, indices []tf.Output, values []tf.Output, shapes []tf.Output, dense_inputs []tf.Output, hashed_output bool, num_buckets int64, hash_key int64, out_type tf.DataType, internal_type tf.DataType) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
+//	t: A 4D input Tensor.
+//	m: A 1D mean Tensor with size matching the last dimension of t.
+// This is the first output from tf.nn.moments,
+// or a saved moving average thereof.
+//	v: A 1D variance Tensor with size matching the last dimension of t.
+// This is the second output from tf.nn.moments,
+// or a saved moving average thereof.
+//	beta: A 1D beta Tensor with size matching the last dimension of t.
+// An offset to be added to the normalized tensor.
+//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
+// If "scale_after_normalization" is true, this tensor will be multiplied
+// with the normalized tensor.
+//	variance_epsilon: A small float number to avoid dividing by 0.
+//	scale_after_normalization: A bool indicating whether the resulted tensor
+// needs to be multiplied with gamma.
+func BatchNormWithGlobalNormalization(scope *Scope, t tf.Output, m tf.Output, v tf.Output, beta tf.Output, gamma tf.Output, variance_epsilon float32, scale_after_normalization bool) (result tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"hashed_output": hashed_output, "num_buckets": num_buckets, "hash_key": hash_key, "out_type": out_type, "internal_type": internal_type}
+	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
 	opspec := tf.OpSpec{
-		Type: "SparseCross",
+		Type: "BatchNormWithGlobalNormalization",
 		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(values), tf.OutputList(shapes), tf.OutputList(dense_inputs),
+			t, m, v, beta, gamma,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// ListDiffAttr is an optional argument to ListDiff.
-type ListDiffAttr func(optionalAttr)
+// MutableDenseHashTableV2Attr is an optional argument to MutableDenseHashTableV2.
+type MutableDenseHashTableV2Attr func(optionalAttr)
 
-// ListDiffOutIdx sets the optional out_idx attribute to value.
-// If not specified, defaults to DT_INT32
-func ListDiffOutIdx(value tf.DataType) ListDiffAttr {
+// MutableDenseHashTableV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableDenseHashTableV2Container(value string) MutableDenseHashTableV2Attr {
 	return func(m optionalAttr) {
-		m["out_idx"] = value
+		m["container"] = value
 	}
 }
 
-// Computes the difference between two lists of numbers or strings.
+// MutableDenseHashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// Given a list `x` and a list `y`, this operation returns a list `out` that
-// represents all values that are in `x` but not in `y`. The returned list `out`
-// is sorted in the same order that the numbers appear in `x` (duplicates are
-// preserved). This operation also returns a list `idx` that represents the
-// position of each `out` element in `x`. In other words:
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableDenseHashTableV2SharedName(value string) MutableDenseHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// MutableDenseHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// If not specified, defaults to false
+func MutableDenseHashTableV2UseNodeNameSharing(value bool) MutableDenseHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
+
+// MutableDenseHashTableV2ValueShape sets the optional value_shape attribute to value.
 //
-// `out[i] = x[idx[i]] for i in [0, 1, ..., len(out) - 1]`
+// value: The shape of each value.
+// If not specified, defaults to <>
+func MutableDenseHashTableV2ValueShape(value tf.Shape) MutableDenseHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["value_shape"] = value
+	}
+}
+
+// MutableDenseHashTableV2InitialNumBuckets sets the optional initial_num_buckets attribute to value.
 //
-// For example, given this input:
+// value: The initial number of hash table buckets. Must be a power
+// to 2.
+// If not specified, defaults to 131072
+func MutableDenseHashTableV2InitialNumBuckets(value int64) MutableDenseHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["initial_num_buckets"] = value
+	}
+}
+
+// MutableDenseHashTableV2MaxLoadFactor sets the optional max_load_factor attribute to value.
 //
-// ```
-// x = [1, 2, 3, 4, 5, 6]
-// y = [1, 3, 5]
-// ```
+// value: The maximum ratio between number of entries and number of
+// buckets before growing the table. Must be between 0 and 1.
+// If not specified, defaults to 0.8
+func MutableDenseHashTableV2MaxLoadFactor(value float32) MutableDenseHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["max_load_factor"] = value
+	}
+}
+
+// Creates an empty hash table that uses tensors as the backing store.
 //
-// This operation would return:
+// It uses "open addressing" with quadratic reprobing to resolve
+// collisions.
 //
-// ```
-// out ==> [2, 4, 6]
-// idx ==> [1, 3, 5]
-// ```
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a scalar. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
 //
 // Arguments:
-//	x: 1-D. Values to keep.
-//	y: 1-D. Values to remove.
+//	empty_key: The key used to represent empty key buckets internally. Must not
+// be used in insert or lookup operations.
+//	value_dtype: Type of the table values.
 //
-// Returns 1-D. Values present in `x` but not in `y`.1-D. Positions of `x` values preserved in `out`.
-func ListDiff(scope *Scope, x tf.Output, y tf.Output, optional ...ListDiffAttr) (out tf.Output, idx tf.Output) {
+// Returns Handle to a table.
+func MutableDenseHashTableV2(scope *Scope, empty_key tf.Output, value_dtype tf.DataType, optional ...MutableDenseHashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"value_dtype": value_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ListDiff",
+		Type: "MutableDenseHashTableV2",
 		Input: []tf.Input{
-			x, y,
+			empty_key,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Adds up a `SparseTensor` and a dense `Tensor`, producing a dense `Tensor`.
+// StageSizeAttr is an optional argument to StageSize.
+type StageSizeAttr func(optionalAttr)
+
+// StageSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// This Op does not require `a_indices` be sorted in standard lexicographic order.
+// REQUIRES: value >= 0
+func StageSizeCapacity(value int64) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// StageSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	a_indices: 2-D.  The `indices` of the `SparseTensor`, with shape `[nnz, ndims]`.
-//	a_values: 1-D.  The `values` of the `SparseTensor`, with shape `[nnz]`.
-//	a_shape: 1-D.  The `shape` of the `SparseTensor`, with shape `[ndims]`.
-//	b: `ndims`-D Tensor.  With shape `a_shape`.
-func SparseTensorDenseAdd(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b tf.Output) (output tf.Output) {
+// REQUIRES: value >= 0
+func StageSizeMemoryLimit(value int64) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// StageSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StageSizeContainer(value string) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StageSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StageSizeSharedName(value string) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op returns the number of elements in the underlying container.
+func StageSize(scope *Scope, dtypes []tf.DataType, optional ...StageSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SparseTensorDenseAdd",
-		Input: []tf.Input{
-			a_indices, a_values, a_shape, b,
-		},
+		Type: "StageSize",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SparseToSparseSetOperationAttr is an optional argument to SparseToSparseSetOperation.
-type SparseToSparseSetOperationAttr func(optionalAttr)
-
-// SparseToSparseSetOperationValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func SparseToSparseSetOperationValidateIndices(value bool) SparseToSparseSetOperationAttr {
-	return func(m optionalAttr) {
-		m["validate_indices"] = value
-	}
-}
-
-// Applies set operation along last dimension of 2 `SparseTensor` inputs.
-//
-// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
-//
-// If `validate_indices` is `True`, `SparseToSparseSetOperation` validates the
-// order and range of `set1` and `set2` indices.
-//
-// Input `set1` is a `SparseTensor` represented by `set1_indices`, `set1_values`,
-// and `set1_shape`. For `set1` ranked `n`, 1st `n-1` dimensions must be the same
-// as `set2`. Dimension `n` contains values in a set, duplicates are allowed but
-// ignored.
-//
-// Input `set2` is a `SparseTensor` represented by `set2_indices`, `set2_values`,
-// and `set2_shape`. For `set2` ranked `n`, 1st `n-1` dimensions must be the same
-// as `set1`. Dimension `n` contains values in a set, duplicates are allowed but
-// ignored.
-//
-// If `validate_indices` is `True`, this op validates the order and range of `set1`
-// and `set2` indices.
-//
-// Output `result` is a `SparseTensor` represented by `result_indices`,
-// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
-// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
-// dimension contains the result of `set_operation` applied to the corresponding
-// `[0...n-1]` dimension of `set`.
+// Produces the max pool of the input tensor for quantized types.
 //
 // Arguments:
-//	set1_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
-// order.
-//	set1_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
-// order.
-//	set1_shape: 1D `Tensor`, shape of a `SparseTensor`. `set1_shape[0...n-1]` must
-// be the same as `set2_shape[0...n-1]`, `set1_shape[n]` is the
-// max set size across `0...n-1` dimensions.
-//	set2_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
-// order.
-//	set2_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
-// order.
-//	set2_shape: 1D `Tensor`, shape of a `SparseTensor`. `set2_shape[0...n-1]` must
-// be the same as `set1_shape[0...n-1]`, `set2_shape[n]` is the
-// max set size across `0...n-1` dimensions.
-//
+//	input: The 4D (batch x rows x cols x depth) Tensor to MaxReduce over.
+//	min_input: The float value that the lowest quantized input value represents.
+//	max_input: The float value that the highest quantized input value represents.
+//	ksize: The size of the window for each dimension of the input tensor.
+// The length must be 4 to match the number of dimensions of the input.
+//	strides: The stride of the sliding window for each dimension of the input
+// tensor. The length must be 4 to match the number of dimensions of the input.
+//	padding: The type of padding algorithm to use.
 //
-// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
-// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
-// is the max result set size across all `0...n-1` dimensions.
-func SparseToSparseSetOperation(scope *Scope, set1_indices tf.Output, set1_values tf.Output, set1_shape tf.Output, set2_indices tf.Output, set2_values tf.Output, set2_shape tf.Output, set_operation string, optional ...SparseToSparseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+func QuantizedMaxPool(scope *Scope, input tf.Output, min_input tf.Output, max_input tf.Output, ksize []int64, strides []int64, padding string) (output tf.Output, min_output tf.Output, max_output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"set_operation": set_operation}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "SparseToSparseSetOperation",
+		Type: "QuantizedMaxPool",
 		Input: []tf.Input{
-			set1_indices, set1_values, set1_shape, set2_indices, set2_values, set2_shape,
+			input, min_input, max_input,
 		},
 		Attrs: attrs,
 	}
@@ -12982,271 +13264,276 @@ func SparseToSparseSetOperation(scope *Scope, set1_indices tf.Output, set1_value
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes numerical negative value element-wise.
-//
-// I.e., \\(y = -x\\).
-func Neg(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes softplus: `log(exp(features) + 1)`.
+func Softplus(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Neg",
+		Type: "Softplus",
 		Input: []tf.Input{
-			x,
+			features,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Writes a `Summary` protocol buffer with a histogram.
-//
-// The generated
-// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-// has one summary value containing a histogram for `values`.
-//
-// This op reports an `InvalidArgument` error if any value is not finite.
-//
-// Arguments:
-//	writer: A handle to a summary writer.
-//	step: The step to write the summary for.
-//	tag: Scalar.  Tag to use for the `Summary.Value`.
-//	values: Any shape. Values to use to build the histogram.
+// Computes exponential of x - 1 element-wise.
 //
-// Returns the created operation.
-func WriteHistogramSummary(scope *Scope, writer tf.Output, step tf.Output, tag tf.Output, values tf.Output) (o *tf.Operation) {
+// I.e., \\(y = (\exp x) - 1\\).
+func Expm1(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "WriteHistogramSummary",
+		Type: "Expm1",
 		Input: []tf.Input{
-			writer, step, tag, values,
+			x,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Adds two `SparseTensor` objects to produce another `SparseTensor`.
+// Returns the number of records this Reader has produced.
 //
-// The input `SparseTensor` objects' indices are assumed ordered in standard
-// lexicographic order.  If this is not the case, before this step run
-// `SparseReorder` to restore index ordering.
-//
-// By default, if two values sum to zero at some index, the output `SparseTensor`
-// would still include that particular location in its index, storing a zero in the
-// corresponding value slot.  To override this, callers can specify `thresh`,
-// indicating that if the sum has a magnitude strictly smaller than `thresh`, its
-// corresponding value and index would then not be included.  In particular,
-// `thresh == 0` (default) means everything is kept and actual thresholding happens
-// only for a positive value.
-//
-// In the following shapes, `nnz` is the count after taking `thresh` into account.
+// This is the same as the number of ReaderRead executions that have
+// succeeded.
 //
 // Arguments:
-//	a_indices: 2-D.  The `indices` of the first `SparseTensor`, size `[nnz, ndims]` Matrix.
-//	a_values: 1-D.  The `values` of the first `SparseTensor`, size `[nnz]` Vector.
-//	a_shape: 1-D.  The `shape` of the first `SparseTensor`, size `[ndims]` Vector.
-//	b_indices: 2-D.  The `indices` of the second `SparseTensor`, size `[nnz, ndims]` Matrix.
-//	b_values: 1-D.  The `values` of the second `SparseTensor`, size `[nnz]` Vector.
-//	b_shape: 1-D.  The `shape` of the second `SparseTensor`, size `[ndims]` Vector.
-//	thresh: 0-D.  The magnitude threshold that determines if an output value/index
-// pair takes space.
-func SparseAdd(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output, thresh tf.Output) (sum_indices tf.Output, sum_values tf.Output, sum_shape tf.Output) {
+//	reader_handle: Handle to a Reader.
+func ReaderNumRecordsProducedV2(scope *Scope, reader_handle tf.Output) (records_produced tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseAdd",
+		Type: "ReaderNumRecordsProducedV2",
 		Input: []tf.Input{
-			a_indices, a_values, a_shape, b_indices, b_values, b_shape, thresh,
+			reader_handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// OrderedMapPeekAttr is an optional argument to OrderedMapPeek.
-type OrderedMapPeekAttr func(optionalAttr)
-
-// OrderedMapPeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Computes the sum along segments of a tensor.
 //
-// REQUIRES: value >= 0
-func OrderedMapPeekCapacity(value int64) OrderedMapPeekAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapPeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// REQUIRES: value >= 0
-func OrderedMapPeekMemoryLimit(value int64) OrderedMapPeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// Computes a tensor such that
+// \\(output_i = \sum_j data_j\\) where sum is over `j` such
+// that `segment_ids[j] == i`.
+//
+// If the sum is empty for a given segment ID `i`, `output[i] = 0`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentSum.png" alt>
+// </div>
+//
+// Arguments:
+//
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentSum(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// OrderedMapPeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapPeekContainer(value string) OrderedMapPeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+	opspec := tf.OpSpec{
+		Type: "SegmentSum",
+		Input: []tf.Input{
+			data, segment_ids,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapPeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapPeekSharedName(value string) OrderedMapPeekAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+// Creates a dataset that emits the lines of one or more text files.
+//
+// Arguments:
+//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
+// read.
+//	compression_type: A scalar containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+//	buffer_size: A scalar containing the number of bytes to buffer.
+func TextLineDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TextLineDataset",
+		Input: []tf.Input{
+			filenames, compression_type, buffer_size,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Op peeks at the values at the specified key.  If the
+// Checks whether a resource handle-based variable has been initialized.
 //
-// underlying container does not contain this key
-// this op will block until it does.   This Op is optimized for
-// performance.
-func OrderedMapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapPeekAttr) (values []tf.Output) {
+// Arguments:
+//	resource: the input resource handle.
+//
+// Returns a scalar boolean which is true if the variable has been
+// initialized.
+func VarIsInitializedOp(scope *Scope, resource tf.Output) (is_initialized tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapPeek",
+		Type: "VarIsInitializedOp",
 		Input: []tf.Input{
-			key, indices,
+			resource,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Pads a tensor with zeros.
+//
+// This operation pads a `input` with zeros according to the `paddings` you
+// specify. `paddings` is an integer tensor with shape `[Dn, 2]`, where n is the
+// rank of `input`. For each dimension D of `input`, `paddings[D, 0]` indicates
+// how many zeros to add before the contents of `input` in that dimension, and
+// `paddings[D, 1]` indicates how many zeros to add after the contents of `input`
+// in that dimension.
+//
+// The padded size of each dimension D of the output is:
+//
+// `paddings(D, 0) + input.dim_size(D) + paddings(D, 1)`
+//
+// For example:
+//
+// ```
+// # 't' is [[1, 1], [2, 2]]
+// # 'paddings' is [[1, 1], [2, 2]]
+// # rank of 't' is 2
+// pad(t, paddings) ==> [[0, 0, 0, 0, 0, 0]
+//                       [0, 0, 1, 1, 0, 0]
+//                       [0, 0, 2, 2, 0, 0]
+//                       [0, 0, 0, 0, 0, 0]]
+// ```
+func Pad(scope *Scope, input tf.Output, paddings tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("OrderedMapPeek", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "Pad",
+		Input: []tf.Input{
+			input, paddings,
+		},
 	}
-	return values
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeAndCropJpegAttr is an optional argument to DecodeAndCropJpeg.
-type DecodeAndCropJpegAttr func(optionalAttr)
-
-// DecodeAndCropJpegChannels sets the optional channels attribute to value.
+// Computes gradients for SparseSegmentMean.
 //
-// value: Number of color channels for the decoded image.
-// If not specified, defaults to 0
-func DecodeAndCropJpegChannels(value int64) DecodeAndCropJpegAttr {
-	return func(m optionalAttr) {
-		m["channels"] = value
+// Returns tensor "output" with same shape as grad, except for dimension 0 whose
+// value is output_dim0.
+//
+// Arguments:
+//	grad: gradient propagated to the SparseSegmentMean op.
+//	indices: indices passed to the corresponding SparseSegmentMean op.
+//	segment_ids: segment_ids passed to the corresponding SparseSegmentMean op.
+//	output_dim0: dimension 0 of "data" passed to SparseSegmentMean op.
+func SparseSegmentMeanGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSegmentMeanGrad",
+		Input: []tf.Input{
+			grad, indices, segment_ids, output_dim0,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeAndCropJpegRatio sets the optional ratio attribute to value.
+// Returns the truth value of (x >= y) element-wise.
 //
-// value: Downscaling ratio.
-// If not specified, defaults to 1
-func DecodeAndCropJpegRatio(value int64) DecodeAndCropJpegAttr {
-	return func(m optionalAttr) {
-		m["ratio"] = value
+// *NOTE*: `GreaterEqual` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func GreaterEqual(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "GreaterEqual",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeAndCropJpegFancyUpscaling sets the optional fancy_upscaling attribute to value.
+// Conv3DAttr is an optional argument to Conv3D.
+type Conv3DAttr func(optionalAttr)
+
+// Conv3DDataFormat sets the optional data_format attribute to value.
 //
-// value: If true use a slower but nicer upscaling of the
-// chroma planes (yuv420/422 only).
-// If not specified, defaults to true
-func DecodeAndCropJpegFancyUpscaling(value bool) DecodeAndCropJpegAttr {
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func Conv3DDataFormat(value string) Conv3DAttr {
 	return func(m optionalAttr) {
-		m["fancy_upscaling"] = value
+		m["data_format"] = value
 	}
 }
 
-// DecodeAndCropJpegTryRecoverTruncated sets the optional try_recover_truncated attribute to value.
+// Conv3DDilations sets the optional dilations attribute to value.
 //
-// value: If true try to recover an image from truncated input.
-// If not specified, defaults to false
-func DecodeAndCropJpegTryRecoverTruncated(value bool) DecodeAndCropJpegAttr {
+// value: 1-D tensor of length 5.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
+func Conv3DDilations(value []int64) Conv3DAttr {
 	return func(m optionalAttr) {
-		m["try_recover_truncated"] = value
+		m["dilations"] = value
 	}
 }
 
-// DecodeAndCropJpegAcceptableFraction sets the optional acceptable_fraction attribute to value.
-//
-// value: The minimum required fraction of lines before a truncated
-// input is accepted.
-// If not specified, defaults to 1
-func DecodeAndCropJpegAcceptableFraction(value float32) DecodeAndCropJpegAttr {
-	return func(m optionalAttr) {
-		m["acceptable_fraction"] = value
-	}
-}
-
-// DecodeAndCropJpegDctMethod sets the optional dct_method attribute to value.
-//
-// value: string specifying a hint about the algorithm used for
-// decompression.  Defaults to "" which maps to a system-specific
-// default.  Currently valid values are ["INTEGER_FAST",
-// "INTEGER_ACCURATE"].  The hint may be ignored (e.g., the internal
-// jpeg library changes to a version that does not have that specific
-// option.)
-// If not specified, defaults to ""
-func DecodeAndCropJpegDctMethod(value string) DecodeAndCropJpegAttr {
-	return func(m optionalAttr) {
-		m["dct_method"] = value
-	}
-}
-
-// Decode and Crop a JPEG-encoded image to a uint8 tensor.
-//
-// The attr `channels` indicates the desired number of color channels for the
-// decoded image.
-//
-// Accepted values are:
-//
-// *   0: Use the number of channels in the JPEG-encoded image.
-// *   1: output a grayscale image.
-// *   3: output an RGB image.
-//
-// If needed, the JPEG-encoded image is transformed to match the requested number
-// of color channels.
-//
-// The attr `ratio` allows downscaling the image by an integer factor during
-// decoding.  Allowed values are: 1, 2, 4, and 8.  This is much faster than
-// downscaling the image later.
+// Computes a 3-D convolution given 5-D `input` and `filter` tensors.
 //
+// In signal processing, cross-correlation is a measure of similarity of
+// two waveforms as a function of a time-lag applied to one of them. This
+// is also known as a sliding dot product or sliding inner-product.
 //
-// It is equivalent to a combination of decode and crop, but much faster by only
-// decoding partial jpeg image.
+// Our Conv3D implements a form of cross-correlation.
 //
 // Arguments:
-//	contents: 0-D.  The JPEG-encoded image.
-//	crop_window: 1-D.  The crop window: [crop_y, crop_x, crop_height, crop_width].
-//
-// Returns 3-D with shape `[height, width, channels]`..
-func DecodeAndCropJpeg(scope *Scope, contents tf.Output, crop_window tf.Output, optional ...DecodeAndCropJpegAttr) (image tf.Output) {
+//	input: Shape `[batch, in_depth, in_height, in_width, in_channels]`.
+//	filter: Shape `[filter_depth, filter_height, filter_width, in_channels,
+// out_channels]`. `in_channels` must match between `input` and `filter`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+func Conv3D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv3DAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeAndCropJpeg",
+		Type: "Conv3D",
 		Input: []tf.Input{
-			contents, crop_window,
+			input, filter,
 		},
 		Attrs: attrs,
 	}
@@ -13254,274 +13541,197 @@ func DecodeAndCropJpeg(scope *Scope, contents tf.Output, crop_window tf.Output,
 	return op.Output(0)
 }
 
-// AllCandidateSamplerAttr is an optional argument to AllCandidateSampler.
-type AllCandidateSamplerAttr func(optionalAttr)
-
-// AllCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func AllCandidateSamplerSeed(value int64) AllCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// AllCandidateSamplerSeed2 sets the optional seed2 attribute to value.
-//
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func AllCandidateSamplerSeed2(value int64) AllCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Generates labels for candidate sampling with a learned unigram distribution.
-//
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
-//
-// For each batch, this op picks a single set of sampled candidate labels.
-//
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
-//
-// Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to produce.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
+// Adds up a SparseTensor and a dense Tensor, using these special rules:
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func AllCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, optional ...AllCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "AllCandidateSampler",
-		Input: []tf.Input{
-			true_classes,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// Returns the element-wise min of two SparseTensors.
+// (1) Broadcasts the dense side to have the same shape as the sparse side, if
+//     eligible;
+// (2) Then, only the dense values pointed to by the indices of the SparseTensor
+//     participate in the cwise addition.
 //
-// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
+// By these rules, the result is a logical SparseTensor with exactly the same
+// indices and shape, but possibly with different non-zero values.  The output of
+// this Op is the resultant non-zero values.
 //
 // Arguments:
-//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, in the canonical lexicographic ordering.
-//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
-//	a_shape: 1-D.  Shape of the input SparseTensor.
-//	b_indices: counterpart to `a_indices` for the other operand.
-//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
-//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
-//
-// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
-func SparseSparseMinimum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSparseMinimum",
-		Input: []tf.Input{
-			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Constructs a tensor by tiling a given tensor.
-//
-// This operation creates a new tensor by replicating `input` `multiples` times.
-// The output tensor's i'th dimension has `input.dims(i) * multiples[i]` elements,
-// and the values of `input` are replicated `multiples[i]` times along the 'i'th
-// dimension. For example, tiling `[a b c d]` by `[2]` produces
-// `[a b c d a b c d]`.
+//	sp_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	sp_values: 1-D.  `N` non-empty values corresponding to `sp_indices`.
+//	sp_shape: 1-D.  Shape of the input SparseTensor.
+//	dense: `R`-D.  The dense Tensor operand.
 //
-// Arguments:
-//	input: 1-D or higher.
-//	multiples: 1-D. Length must be the same as the number of dimensions in `input`
-func Tile(scope *Scope, input tf.Output, multiples tf.Output) (output tf.Output) {
+// Returns 1-D.  The `N` values that are operated on.
+func SparseDenseCwiseAdd(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output, dense tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Tile",
+		Type: "SparseDenseCwiseAdd",
 		Input: []tf.Input{
-			input, multiples,
+			sp_indices, sp_values, sp_shape, dense,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Saves the input tensors to disk.
-//
-// The size of `tensor_names` must match the number of tensors in `data`. `data[i]`
-// is written to `filename` with name `tensor_names[i]`.
-//
-// See also `SaveSlices`.
+// Read an element from the TensorArray into output `value`.
 //
 // Arguments:
-//	filename: Must have a single element. The name of the file to which we write
-// the tensor.
-//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
-//	data: `N` tensors to save.
-//
-// Returns the created operation.
-func Save(scope *Scope, filename tf.Output, tensor_names tf.Output, data []tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Save",
-		Input: []tf.Input{
-			filename, tensor_names, tf.OutputList(data),
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Returns element-wise remainder of division. When `x < 0` xor `y < 0` is
+//	handle: The handle to a TensorArray.
 //
-// true, this follows Python semantics in that the result here is consistent
-// with a flooring divide. E.g. `floor(x / y) * y + mod(x, y) = x`.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	dtype: The type of the elem that is returned.
 //
-// *NOTE*: `FloorMod` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func FloorMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Returns The tensor that is read from the TensorArray.
+func TensorArrayReadV3(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "FloorMod",
+		Type: "TensorArrayReadV3",
 		Input: []tf.Input{
-			x, y,
+			handle, index, flow_in,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TakeManySparseFromTensorsMapAttr is an optional argument to TakeManySparseFromTensorsMap.
-type TakeManySparseFromTensorsMapAttr func(optionalAttr)
+// QuantizeV2Attr is an optional argument to QuantizeV2.
+type QuantizeV2Attr func(optionalAttr)
 
-// TakeManySparseFromTensorsMapContainer sets the optional container attribute to value.
-//
-// value: The container name for the `SparseTensorsMap` read by this op.
-// If not specified, defaults to ""
-func TakeManySparseFromTensorsMapContainer(value string) TakeManySparseFromTensorsMapAttr {
+// QuantizeV2Mode sets the optional mode attribute to value.
+// If not specified, defaults to "MIN_COMBINED"
+func QuantizeV2Mode(value string) QuantizeV2Attr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["mode"] = value
 	}
 }
 
-// TakeManySparseFromTensorsMapSharedName sets the optional shared_name attribute to value.
-//
-// value: The shared name for the `SparseTensorsMap` read by this op.
-// It should not be blank; rather the `shared_name` or unique Operation name
-// of the Op that created the original `SparseTensorsMap` should be used.
-// If not specified, defaults to ""
-func TakeManySparseFromTensorsMapSharedName(value string) TakeManySparseFromTensorsMapAttr {
+// QuantizeV2RoundMode sets the optional round_mode attribute to value.
+// If not specified, defaults to "HALF_AWAY_FROM_ZERO"
+func QuantizeV2RoundMode(value string) QuantizeV2Attr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["round_mode"] = value
 	}
 }
 
-// Read `SparseTensors` from a `SparseTensorsMap` and concatenate them.
+// Quantize the 'input' tensor of type float to 'output' tensor of type 'T'.
 //
-// The input `sparse_handles` must be an `int64` matrix of shape `[N, 1]` where
-// `N` is the minibatch size and the rows correspond to the output handles of
-// `AddSparseToTensorsMap` or `AddManySparseToTensorsMap`.  The ranks of the
-// original `SparseTensor` objects that went into the given input ops must all
-// match.  When the final `SparseTensor` is created, it has rank one
-// higher than the ranks of the incoming `SparseTensor` objects
-// (they have been concatenated along a new row dimension on the left).
+// [min_range, max_range] are scalar floats that specify the range for
+// the 'input' data. The 'mode' attribute controls exactly which calculations are
+// used to convert the float values to their quantized equivalents.  The
+// 'round_mode' attribute controls which rounding tie-breaking algorithm is used
+// when rounding float values to their quantized equivalents.
 //
-// The output `SparseTensor` object's shape values for all dimensions but the
-// first are the max across the input `SparseTensor` objects' shape values
-// for the corresponding dimensions.  Its first shape value is `N`, the minibatch
-// size.
+// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
 //
-// The input `SparseTensor` objects' indices are assumed ordered in
-// standard lexicographic order.  If this is not the case, after this
-// step run `SparseReorder` to restore index ordering.
+// ```
+// out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)
+// if T == qint8, out[i] -= (range(T) + 1) / 2.0
+// ```
+// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
 //
-// For example, if the handles represent an input, which is a `[2, 3]` matrix
-// representing two original `SparseTensor` objects:
+// *MIN_COMBINED Mode Example*
+//
+// Assume the input is type float and has a possible range of [0.0, 6.0] and the
+// output type is quint8 ([0, 255]). The min_range and max_range values should be
+// specified as 0.0 and 6.0. Quantizing from float to quint8 will multiply each
+// value of the input by 255/6 and cast to quint8.
+//
+// If the output type was qint8 ([-128, 127]), the operation will additionally
+// subtract each value by 128 prior to casting, so that the range of values aligns
+// with the range of qint8.
+//
+// If the mode is 'MIN_FIRST', then this approach is used:
 //
 // ```
-//     index = [ 0]
-//             [10]
-//             [20]
-//     values = [1, 2, 3]
-//     shape = [50]
+// num_discrete_values = 1 << (# of bits in T)
+// range_adjust = num_discrete_values / (num_discrete_values - 1)
+// range = (range_max - range_min) * range_adjust
+// range_scale = num_discrete_values / range
+// quantized = round(input * range_scale) - round(range_min * range_scale) +
+//   numeric_limits<T>::min()
+// quantized = max(quantized, numeric_limits<T>::min())
+// quantized = min(quantized, numeric_limits<T>::max())
 // ```
 //
-// and
+// The biggest difference between this and MIN_COMBINED is that the minimum range
+// is rounded first, before it's subtracted from the rounded value. With
+// MIN_COMBINED, a small bias is introduced where repeated iterations of quantizing
+// and dequantizing will introduce a larger and larger error.
+//
+// *SCALED mode Example*
+//
+// `SCALED` mode matches the quantization approach used in
+// `QuantizeAndDequantize{V2|V3}`.
+//
+// If the mode is `SCALED`, we do not use the full range of the output type,
+// choosing to elide the lowest possible value for symmetry (e.g., output range is
+// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
+// 0.
 //
+// We first find the range of values in our tensor. The
+// range we use is always centered on 0, so we find m such that
+// ```c++
+//   m = max(abs(input_min), abs(input_max))
 // ```
-//     index = [ 2]
-//             [10]
-//     values = [4, 5]
-//     shape = [30]
+//
+// Our input tensor range is then `[-m, m]`.
+//
+// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
+// If T is signed, this is
+// ```
+//   num_bits = sizeof(T) * 8
+//   [min_fixed, max_fixed] =
+//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
 // ```
 //
-// then the final `SparseTensor` will be:
+// Otherwise, if T is unsigned, the fixed-point range is
+// ```
+//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
+// ```
 //
+// From this we compute our scaling factor, s:
+// ```c++
+//   s = (max_fixed - min_fixed) / (2 * m)
 // ```
-//     index = [0  0]
-//             [0 10]
-//             [0 20]
-//             [1  2]
-//             [1 10]
-//     values = [1, 2, 3, 4, 5]
-//     shape = [2 50]
+//
+// Now we can quantize the elements of our tensor:
+// ```c++
+// result = round(input * s)
 // ```
 //
+// One thing to watch out for is that the operator may choose to adjust the
+// requested minimum and maximum values slightly during the quantization process,
+// so you should always use the output ports as the range for further calculations.
+// For example, if the requested minimum and maximum values are close to equal,
+// they will be separated by a small epsilon value to prevent ill-formed quantized
+// buffers from being created. Otherwise, you can end up with buffers where all the
+// quantized values map to the same float value, which causes problems for
+// operations that have to perform further calculations on them.
+//
 // Arguments:
-//	sparse_handles: 1-D, The `N` serialized `SparseTensor` objects.
-// Shape: `[N]`.
-//	dtype: The `dtype` of the `SparseTensor` objects stored in the
-// `SparseTensorsMap`.
 //
-// Returns 2-D.  The `indices` of the minibatch `SparseTensor`.1-D.  The `values` of the minibatch `SparseTensor`.1-D.  The `shape` of the minibatch `SparseTensor`.
-func TakeManySparseFromTensorsMap(scope *Scope, sparse_handles tf.Output, dtype tf.DataType, optional ...TakeManySparseFromTensorsMapAttr) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
+//	min_range: The minimum scalar value possibly produced for the input.
+//	max_range: The maximum scalar value possibly produced for the input.
+//
+//
+// Returns The quantized data produced from the float input.The actual minimum scalar value used for the output.The actual maximum scalar value used for the output.
+func QuantizeV2(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, T tf.DataType, optional ...QuantizeV2Attr) (output tf.Output, output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"T": T}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TakeManySparseFromTensorsMap",
+		Type: "QuantizeV2",
 		Input: []tf.Input{
-			sparse_handles,
+			input, min_range, max_range,
 		},
 		Attrs: attrs,
 	}
@@ -13529,251 +13739,282 @@ func TakeManySparseFromTensorsMap(scope *Scope, sparse_handles tf.Output, dtype
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Says whether the targets are in the top `K` predictions.
-//
-// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
-// prediction for the target class is among the top `k` predictions among
-// all predictions for example `i`. Note that the behavior of `InTopK` differs
-// from the `TopK` op in its handling of ties; if multiple classes have the
-// same prediction value and straddle the top-`k` boundary, all of those
-// classes are considered to be in the top `k`.
-//
-// More formally, let
-//
-//   \\(predictions_i\\) be the predictions for all classes for example `i`,
-//   \\(targets_i\\) be the target class for example `i`,
-//   \\(out_i\\) be the output for example `i`,
-//
-// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
-//
-// Arguments:
-//	predictions: A `batch_size` x `classes` tensor.
-//	targets: A `batch_size` vector of class ids.
-//	k: Number of top elements to look at for computing precision.
+// Returns the truth value of (x < y) element-wise.
 //
-// Returns Computed precision at `k` as a `bool Tensor`.
-func InTopKV2(scope *Scope, predictions tf.Output, targets tf.Output, k tf.Output) (precision tf.Output) {
+// *NOTE*: `Less` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Less(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "InTopKV2",
+		Type: "Less",
 		Input: []tf.Input{
-			predictions, targets, k,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Assigns a new value to a variable.
-//
-// Any ReadVariableOp with a control dependency on this op is guaranteed to return
-// this value or a subsequent newer value of the variable.
+// QuantizedReluXAttr is an optional argument to QuantizedReluX.
+type QuantizedReluXAttr func(optionalAttr)
+
+// QuantizedReluXOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_QUINT8
+func QuantizedReluXOutType(value tf.DataType) QuantizedReluXAttr {
+	return func(m optionalAttr) {
+		m["out_type"] = value
+	}
+}
+
+// Computes Quantized Rectified Linear X: `min(max(features, 0), max_value)`
 //
 // Arguments:
-//	resource: handle to the resource in which to store the variable.
-//	value: the value to set the new tensor to use.
 //
-// Returns the created operation.
-func AssignVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
+//
+//	min_features: The float value that the lowest quantized value represents.
+//	max_features: The float value that the highest quantized value represents.
+//
+// Returns Has the same output shape as "features".The float value that the lowest quantized value represents.The float value that the highest quantized value represents.
+func QuantizedReluX(scope *Scope, features tf.Output, max_value tf.Output, min_features tf.Output, max_features tf.Output, optional ...QuantizedReluXAttr) (activations tf.Output, min_activations tf.Output, max_activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "AssignVariableOp",
+		Type: "QuantizedReluX",
 		Input: []tf.Input{
-			resource, value,
+			features, max_value, min_features, max_features,
 		},
+		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Returns a tensor of ones with the same shape and type as x.
+// WholeFileReaderV2Attr is an optional argument to WholeFileReaderV2.
+type WholeFileReaderV2Attr func(optionalAttr)
+
+// WholeFileReaderV2Container sets the optional container attribute to value.
 //
-// Arguments:
-//	x: a tensor of type T.
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func WholeFileReaderV2Container(value string) WholeFileReaderV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// WholeFileReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// Returns a tensor of the same shape and type as x but filled with ones.
-func OnesLike(scope *Scope, x tf.Output) (y tf.Output) {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func WholeFileReaderV2SharedName(value string) WholeFileReaderV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// A Reader that outputs the entire contents of a file as a value.
+//
+// To use, enqueue filenames in a Queue.  The output of ReaderRead will
+// be a filename (key) and the contents of that file (value).
+//
+// Returns The handle to reference the Reader.
+func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "OnesLike",
-		Input: []tf.Input{
-			x,
-		},
+		Type: "WholeFileReaderV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// The gradient of SparseFillEmptyRows.
-//
-// Takes vectors reverse_index_map, shaped `[N]`, and grad_values,
-// shaped `[N_full]`, where `N_full >= N` and copies data into either
-// `d_values` or `d_default_value`.  Here `d_values` is shaped `[N]` and
-// `d_default_value` is a scalar.
-//
-//   d_values[j] = grad_values[reverse_index_map[j]]
-//   d_default_value = sum_{k : 0 .. N_full - 1} (
-//      grad_values[k] * 1{k not in reverse_index_map})
+// Transforms a tf.Example proto (as a string) into typed tensors.
 //
 // Arguments:
-//	reverse_index_map: 1-D.  The reverse index map from SparseFillEmptyRows.
-//	grad_values: 1-D.  The gradients from backprop.
-//
-// Returns 1-D.  The backprop into values.0-D.  The backprop into default_value.
-func SparseFillEmptyRowsGrad(scope *Scope, reverse_index_map tf.Output, grad_values tf.Output) (d_values tf.Output, d_default_value tf.Output) {
+//	serialized: A vector containing a batch of binary serialized Example protos.
+//	dense_defaults: A list of Tensors (some may be empty), whose length matches
+// the length of `dense_keys`. dense_defaults[j] provides default values
+// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
+// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
+// The input type is inferred from dense_defaults[j], even when it's empty.
+// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
+// then the shape of dense_defaults[j] must match that of dense_shapes[j].
+// If dense_shapes[j] has an undefined major dimension (variable strides dense
+// feature), dense_defaults[j] must contain a single element:
+// the padding element.
+//	num_sparse: The number of sparse features to be parsed from the example. This
+// must match the lengths of `sparse_keys` and `sparse_types`.
+//	sparse_keys: A list of `num_sparse` strings.
+// The keys expected in the Examples' features associated with sparse values.
+//	dense_keys: The keys expected in the Examples' features associated with dense
+// values.
+//	sparse_types: A list of `num_sparse` types; the data types of data in each
+// Feature given in sparse_keys.
+// Currently the ParseSingleExample op supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+//	dense_shapes: The shapes of data in each Feature given in dense_keys.
+// The length of this list must match the length of `dense_keys`.  The
+// number of elements in the Feature corresponding to dense_key[j] must
+// always equal dense_shapes[j].NumEntries().  If dense_shapes[j] ==
+// (D0, D1, ..., DN) then the shape of output Tensor dense_values[j]
+// will be (D0, D1, ..., DN): In the case dense_shapes[j] = (-1, D1,
+// ..., DN), the shape of the output Tensor dense_values[j] will be (M,
+// D1, .., DN), where M is the number of blocks of elements of length
+// D1 * .... * DN, in the input.
+func ParseSingleExample(scope *Scope, serialized tf.Output, dense_defaults []tf.Output, num_sparse int64, sparse_keys []string, dense_keys []string, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"num_sparse": num_sparse, "sparse_keys": sparse_keys, "dense_keys": dense_keys, "sparse_types": sparse_types, "dense_shapes": dense_shapes}
 	opspec := tf.OpSpec{
-		Type: "SparseFillEmptyRowsGrad",
+		Type: "ParseSingleExample",
 		Input: []tf.Input{
-			reverse_index_map, grad_values,
+			serialized, tf.OutputList(dense_defaults),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Computes scaled exponential linear: `scale * alpha * (exp(features) - 1)`
-//
-// if < 0, `scale * features` otherwise.
-//
-// See [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
-func Selu(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "Selu",
-		Input: []tf.Input{
-			features,
-		},
+	var idx int
+	var err error
+	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
+		return
+	}
+	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
+		return
+	}
+	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
+		return
+	}
+	return sparse_indices, sparse_values, sparse_shapes, dense_values
 }
 
-// SetSizeAttr is an optional argument to SetSize.
-type SetSizeAttr func(optionalAttr)
+// QuantizedConv2DAttr is an optional argument to QuantizedConv2D.
+type QuantizedConv2DAttr func(optionalAttr)
 
-// SetSizeValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func SetSizeValidateIndices(value bool) SetSizeAttr {
+// QuantizedConv2DOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedConv2DOutType(value tf.DataType) QuantizedConv2DAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["out_type"] = value
 	}
 }
 
-// Number of unique elements along last dimension of input `set`.
+// QuantizedConv2DDilations sets the optional dilations attribute to value.
 //
-// Input `set` is a `SparseTensor` represented by `set_indices`, `set_values`,
-// and `set_shape`. The last dimension contains values in a set, duplicates are
-// allowed but ignored.
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func QuantizedConv2DDilations(value []int64) QuantizedConv2DAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes a 2D convolution given quantized 4D input and filter tensors.
 //
-// If `validate_indices` is `True`, this op validates the order and range of `set`
-// indices.
+// The inputs are quantized tensors where the lowest value represents the real
+// number of the associated minimum, and the highest represents the maximum.
+// This means that you can only interpret the quantized output in the same way, by
+// taking the returned minimum and maximum values into account.
 //
 // Arguments:
-//	set_indices: 2D `Tensor`, indices of a `SparseTensor`.
-//	set_values: 1D `Tensor`, values of a `SparseTensor`.
-//	set_shape: 1D `Tensor`, shape of a `SparseTensor`.
 //
-// Returns For `set` ranked `n`, this is a `Tensor` with rank `n-1`, and the same 1st
-// `n-1` dimensions as `set`. Each value is the number of unique elements in
-// the corresponding `[0...n-1]` dimension of `set`.
-func SetSize(scope *Scope, set_indices tf.Output, set_values tf.Output, set_shape tf.Output, optional ...SetSizeAttr) (size tf.Output) {
+//	filter: filter's input_depth dimension must match input's depth dimensions.
+//	min_input: The float value that the lowest quantized input value represents.
+//	max_input: The float value that the highest quantized input value represents.
+//	min_filter: The float value that the lowest quantized filter value represents.
+//	max_filter: The float value that the highest quantized filter value represents.
+//	strides: The stride of the sliding window for each dimension of the input
+// tensor.
+//	padding: The type of padding algorithm to use.
+//
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+func QuantizedConv2D(scope *Scope, input tf.Output, filter tf.Output, min_input tf.Output, max_input tf.Output, min_filter tf.Output, max_filter tf.Output, strides []int64, padding string, optional ...QuantizedConv2DAttr) (output tf.Output, min_output tf.Output, max_output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SetSize",
+		Type: "QuantizedConv2D",
 		Input: []tf.Input{
-			set_indices, set_values, set_shape,
+			input, filter, min_input, max_input, min_filter, max_filter,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the sign and the log of the absolute value of the determinant of
-//
-// one or more square matrices.
-//
-// The input is a tensor of shape `[N, M, M]` whose inner-most 2 dimensions
-// form square matrices. The outputs are two tensors containing the signs and
-// absolute values of the log determinants for all N input submatrices
-// `[..., :, :]` such that the determinant = sign*exp(log_abs_determinant).
-// The log_abs_determinant is computed as det(P)*sum(log(diag(LU))) where LU
-// is the LU decomposition of the input and P is the corresponding
-// permutation matrix.
-//
-// Arguments:
-//	input: Shape is `[N, M, M]`.
-//
-// Returns The signs of the log determinants of the inputs. Shape is `[N]`.The logs of the absolute values of the determinants
-// of the N input matrices.  Shape is `[N]`.
-func LogMatrixDeterminant(scope *Scope, input tf.Output) (sign tf.Output, log_abs_determinant tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LogMatrixDeterminant",
-		Input: []tf.Input{
-			input,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// SumAttr is an optional argument to Sum.
-type SumAttr func(optionalAttr)
+// ResourceGatherAttr is an optional argument to ResourceGather.
+type ResourceGatherAttr func(optionalAttr)
 
-// SumKeepDims sets the optional keep_dims attribute to value.
-//
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func SumKeepDims(value bool) SumAttr {
+// ResourceGatherValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func ResourceGatherValidateIndices(value bool) ResourceGatherAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// Computes the sum of elements across dimensions of a tensor.
+// Gather slices from the variable pointed to by `resource` according to `indices`.
 //
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
+// Produces an output tensor with shape `indices.shape + params.shape[1:]` where:
 //
-// Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
+// ```python
+//     # Scalar indices
+//     output[:, ..., :] = params[indices, :, ... :]
 //
-// Returns The reduced tensor.
-func Sum(scope *Scope, input tf.Output, axis tf.Output, optional ...SumAttr) (output tf.Output) {
+//     # Vector indices
+//     output[i, :, ..., :] = params[indices[i], :, ... :]
+//
+//     # Higher rank indices
+//     output[i, ..., j, :, ... :] = params[indices[i, ..., j], :, ..., :]
+// ```
+func ResourceGather(scope *Scope, resource tf.Output, indices tf.Output, dtype tf.DataType, optional ...ResourceGatherAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Sum",
+		Type: "ResourceGather",
 		Input: []tf.Input{
-			input, axis,
+			resource, indices,
 		},
 		Attrs: attrs,
 	}
@@ -13781,18 +14022,21 @@ func Sum(scope *Scope, input tf.Output, axis tf.Output, optional ...SumAttr) (ou
 	return op.Output(0)
 }
 
-// Delete the tensor specified by its handle in the session.
+// Delete the TensorArray from its resource container.
+//
+// This enables the user to close and release the resource in the middle
+// of a step/run.
 //
 // Arguments:
-//	handle: The handle for a tensor stored in the session state.
+//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
 //
 // Returns the created operation.
-func DeleteSessionTensor(scope *Scope, handle tf.Output) (o *tf.Operation) {
+func TensorArrayCloseV3(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DeleteSessionTensor",
+		Type: "TensorArrayCloseV3",
 		Input: []tf.Input{
 			handle,
 		},
@@ -13800,461 +14044,396 @@ func DeleteSessionTensor(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	return scope.AddOperation(opspec)
 }
 
-// L2 Loss.
+// Adds two `SparseTensor` objects to produce another `SparseTensor`.
 //
-// Computes half the L2 norm of a tensor without the `sqrt`:
+// The input `SparseTensor` objects' indices are assumed ordered in standard
+// lexicographic order.  If this is not the case, before this step run
+// `SparseReorder` to restore index ordering.
 //
-//     output = sum(t ** 2) / 2
+// By default, if two values sum to zero at some index, the output `SparseTensor`
+// would still include that particular location in its index, storing a zero in the
+// corresponding value slot.  To override this, callers can specify `thresh`,
+// indicating that if the sum has a magnitude strictly smaller than `thresh`, its
+// corresponding value and index would then not be included.  In particular,
+// `thresh == 0` (default) means everything is kept and actual thresholding happens
+// only for a positive value.
 //
-// Arguments:
-//	t: Typically 2-D, but may have any dimensions.
+// In the following shapes, `nnz` is the count after taking `thresh` into account.
 //
-// Returns 0-D.
-func L2Loss(scope *Scope, t tf.Output) (output tf.Output) {
+// Arguments:
+//	a_indices: 2-D.  The `indices` of the first `SparseTensor`, size `[nnz, ndims]` Matrix.
+//	a_values: 1-D.  The `values` of the first `SparseTensor`, size `[nnz]` Vector.
+//	a_shape: 1-D.  The `shape` of the first `SparseTensor`, size `[ndims]` Vector.
+//	b_indices: 2-D.  The `indices` of the second `SparseTensor`, size `[nnz, ndims]` Matrix.
+//	b_values: 1-D.  The `values` of the second `SparseTensor`, size `[nnz]` Vector.
+//	b_shape: 1-D.  The `shape` of the second `SparseTensor`, size `[ndims]` Vector.
+//	thresh: 0-D.  The magnitude threshold that determines if an output value/index
+// pair takes space.
+func SparseAdd(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output, thresh tf.Output) (sum_indices tf.Output, sum_values tf.Output, sum_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "L2Loss",
+		Type: "SparseAdd",
 		Input: []tf.Input{
-			t,
+			a_indices, a_values, a_shape, b_indices, b_values, b_shape, thresh,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// DenseToSparseSetOperationAttr is an optional argument to DenseToSparseSetOperation.
-type DenseToSparseSetOperationAttr func(optionalAttr)
+// OrderedMapPeekAttr is an optional argument to OrderedMapPeek.
+type OrderedMapPeekAttr func(optionalAttr)
 
-// DenseToSparseSetOperationValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func DenseToSparseSetOperationValidateIndices(value bool) DenseToSparseSetOperationAttr {
+// OrderedMapPeekCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func OrderedMapPeekCapacity(value int64) OrderedMapPeekAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["capacity"] = value
 	}
 }
 
-// Applies set operation along last dimension of `Tensor` and `SparseTensor`.
-//
-// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
-//
-// Input `set2` is a `SparseTensor` represented by `set2_indices`, `set2_values`,
-// and `set2_shape`. For `set2` ranked `n`, 1st `n-1` dimensions must be the same
-// as `set1`. Dimension `n` contains values in a set, duplicates are allowed but
-// ignored.
-//
-// If `validate_indices` is `True`, this op validates the order and range of `set2`
-// indices.
-//
-// Output `result` is a `SparseTensor` represented by `result_indices`,
-// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
-// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
-// dimension contains the result of `set_operation` applied to the corresponding
-// `[0...n-1]` dimension of `set`.
-//
-// Arguments:
-//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
-// Dimension `n` contains values in a set, duplicates are allowed but ignored.
-//	set2_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
-// order.
-//	set2_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
-// order.
-//	set2_shape: 1D `Tensor`, shape of a `SparseTensor`. `set2_shape[0...n-1]` must
-// be the same as the 1st `n-1` dimensions of `set1`, `result_shape[n]` is the
-// max set size across `n-1` dimensions.
-//
+// OrderedMapPeekMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
-// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
-// is the max result set size across all `0...n-1` dimensions.
-func DenseToSparseSetOperation(scope *Scope, set1 tf.Output, set2_indices tf.Output, set2_values tf.Output, set2_shape tf.Output, set_operation string, optional ...DenseToSparseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"set_operation": set_operation}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "DenseToSparseSetOperation",
-		Input: []tf.Input{
-			set1, set2_indices, set2_values, set2_shape,
-		},
-		Attrs: attrs,
+// REQUIRES: value >= 0
+func OrderedMapPeekMemoryLimit(value int64) OrderedMapPeekAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// FusedResizeAndPadConv2DAttr is an optional argument to FusedResizeAndPadConv2D.
-type FusedResizeAndPadConv2DAttr func(optionalAttr)
+// OrderedMapPeekContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapPeekContainer(value string) OrderedMapPeekAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
 
-// FusedResizeAndPadConv2DResizeAlignCorners sets the optional resize_align_corners attribute to value.
-//
-// value: If true, rescale input by (new_height - 1) / (height - 1),
-// which exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func FusedResizeAndPadConv2DResizeAlignCorners(value bool) FusedResizeAndPadConv2DAttr {
+// OrderedMapPeekSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapPeekSharedName(value string) OrderedMapPeekAttr {
 	return func(m optionalAttr) {
-		m["resize_align_corners"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Performs a resize and padding as a preprocess during a convolution.
-//
-// It's often possible to do spatial transformations more efficiently as part of
-// the packing stage of a convolution, so this op allows for an optimized
-// implementation where these stages are fused together. This prevents the need to
-// write out the intermediate results as whole tensors, reducing memory pressure,
-// and we can get some latency gains by merging the transformation calculations.
-// The data_format attribute for Conv2D isn't supported by this op, and defaults to
-// 'NHWC' order.
-// Internally this op uses a single per-graph scratch buffer, which means that it
-// will block if multiple versions are being run in parallel. This is because this
-// operator is primarily an optimization to minimize memory usage.
-//
-// Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
-//	size: A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.
+// Op peeks at the values at the specified key.  If the
 //
-//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
-// of `input`. Must be in the same order as the dimension specified with format.
-//	padding: The type of padding algorithm to use.
-func FusedResizeAndPadConv2D(scope *Scope, input tf.Output, size tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string, optional ...FusedResizeAndPadConv2DAttr) (output tf.Output) {
+// underlying container does not contain this key
+// this op will block until it does.   This Op is optimized for
+// performance.
+func OrderedMapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapPeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FusedResizeAndPadConv2D",
+		Type: "OrderedMapPeek",
 		Input: []tf.Input{
-			input, size, paddings, filter,
+			key, indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Subtracts a value from the current value of a variable.
-//
-// Any ReadVariableOp which depends directly or indirectly on this assign is
-// guaranteed to see the incremented value or a subsequent newer one.
-//
-// Outputs the incremented value, which can be used to totally order the
-// increments to this variable.
-//
-// Arguments:
-//	resource: handle to the resource in which to store the variable.
-//	value: the value by which the variable will be incremented.
-//
-// Returns the created operation.
-func AssignSubVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "AssignSubVariableOp",
-		Input: []tf.Input{
-			resource, value,
-		},
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("OrderedMapPeek", err)
+		return
 	}
-	return scope.AddOperation(opspec)
+	return values
 }
 
-// RestoreAttr is an optional argument to Restore.
-type RestoreAttr func(optionalAttr)
+// DecodeAndCropJpegAttr is an optional argument to DecodeAndCropJpeg.
+type DecodeAndCropJpegAttr func(optionalAttr)
 
-// RestorePreferredShard sets the optional preferred_shard attribute to value.
+// DecodeAndCropJpegChannels sets the optional channels attribute to value.
 //
-// value: Index of file to open first if multiple files match
-// `file_pattern`.
-// If not specified, defaults to -1
-func RestorePreferredShard(value int64) RestoreAttr {
+// value: Number of color channels for the decoded image.
+// If not specified, defaults to 0
+func DecodeAndCropJpegChannels(value int64) DecodeAndCropJpegAttr {
 	return func(m optionalAttr) {
-		m["preferred_shard"] = value
+		m["channels"] = value
 	}
 }
 
-// Restores a tensor from checkpoint files.
-//
-// Reads a tensor stored in one or several files. If there are several files (for
-// instance because a tensor was saved as slices), `file_pattern` may contain
-// wildcard symbols (`*` and `?`) in the filename portion only, not in the
-// directory portion.
-//
-// If a `file_pattern` matches several files, `preferred_shard` can be used to hint
-// in which file the requested tensor is likely to be found. This op will first
-// open the file at index `preferred_shard` in the list of matching files and try
-// to restore tensors from that file.  Only if some tensors or tensor slices are
-// not found in that first file, then the Op opens all the files. Setting
-// `preferred_shard` to match the value passed as the `shard` input
-// of a matching `Save` Op may speed up Restore.  This attribute only affects
-// performance, not correctness.  The default value -1 means files are processed in
-// order.
-//
-// See also `RestoreSlice`.
-//
-// Arguments:
-//	file_pattern: Must have a single element. The pattern of the files from
-// which we read the tensor.
-//	tensor_name: Must have a single element. The name of the tensor to be
-// restored.
-//	dt: The type of the tensor to be restored.
+// DecodeAndCropJpegRatio sets the optional ratio attribute to value.
 //
-// Returns The restored tensor.
-func Restore(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, dt tf.DataType, optional ...RestoreAttr) (tensor tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dt": dt}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Restore",
-		Input: []tf.Input{
-			file_pattern, tensor_name,
-		},
-		Attrs: attrs,
+// value: Downscaling ratio.
+// If not specified, defaults to 1
+func DecodeAndCropJpegRatio(value int64) DecodeAndCropJpegAttr {
+	return func(m optionalAttr) {
+		m["ratio"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// QuantizedResizeBilinearAttr is an optional argument to QuantizedResizeBilinear.
-type QuantizedResizeBilinearAttr func(optionalAttr)
+// DecodeAndCropJpegFancyUpscaling sets the optional fancy_upscaling attribute to value.
+//
+// value: If true use a slower but nicer upscaling of the
+// chroma planes (yuv420/422 only).
+// If not specified, defaults to true
+func DecodeAndCropJpegFancyUpscaling(value bool) DecodeAndCropJpegAttr {
+	return func(m optionalAttr) {
+		m["fancy_upscaling"] = value
+	}
+}
 
-// QuantizedResizeBilinearAlignCorners sets the optional align_corners attribute to value.
+// DecodeAndCropJpegTryRecoverTruncated sets the optional try_recover_truncated attribute to value.
 //
-// value: If true, rescale input by (new_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
+// value: If true try to recover an image from truncated input.
 // If not specified, defaults to false
-func QuantizedResizeBilinearAlignCorners(value bool) QuantizedResizeBilinearAttr {
+func DecodeAndCropJpegTryRecoverTruncated(value bool) DecodeAndCropJpegAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["try_recover_truncated"] = value
 	}
 }
 
-// Resize quantized `images` to `size` using quantized bilinear interpolation.
+// DecodeAndCropJpegAcceptableFraction sets the optional acceptable_fraction attribute to value.
 //
-// Input images and output images must be quantized types.
+// value: The minimum required fraction of lines before a truncated
+// input is accepted.
+// If not specified, defaults to 1
+func DecodeAndCropJpegAcceptableFraction(value float32) DecodeAndCropJpegAttr {
+	return func(m optionalAttr) {
+		m["acceptable_fraction"] = value
+	}
+}
+
+// DecodeAndCropJpegDctMethod sets the optional dct_method attribute to value.
 //
-// Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
-//
-//
-//
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func QuantizedResizeBilinear(scope *Scope, images tf.Output, size tf.Output, min tf.Output, max tf.Output, optional ...QuantizedResizeBilinearAttr) (resized_images tf.Output, out_min tf.Output, out_max tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "QuantizedResizeBilinear",
-		Input: []tf.Input{
-			images, size, min, max,
-		},
-		Attrs: attrs,
+// value: string specifying a hint about the algorithm used for
+// decompression.  Defaults to "" which maps to a system-specific
+// default.  Currently valid values are ["INTEGER_FAST",
+// "INTEGER_ACCURATE"].  The hint may be ignored (e.g., the internal
+// jpeg library changes to a version that does not have that specific
+// option.)
+// If not specified, defaults to ""
+func DecodeAndCropJpegDctMethod(value string) DecodeAndCropJpegAttr {
+	return func(m optionalAttr) {
+		m["dct_method"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the minimum along segments of a tensor.
+// Decode and Crop a JPEG-encoded image to a uint8 tensor.
 //
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
+// The attr `channels` indicates the desired number of color channels for the
+// decoded image.
 //
-// Computes a tensor such that
-// \\(output_i = \min_j(data_j)\\) where `min` is over `j` such
-// that `segment_ids[j] == i`.
+// Accepted values are:
 //
-// If the min is empty for a given segment ID `i`, `output[i] = 0`.
+// *   0: Use the number of channels in the JPEG-encoded image.
+// *   1: output a grayscale image.
+// *   3: output an RGB image.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMin.png" alt>
-// </div>
+// If needed, the JPEG-encoded image is transformed to match the requested number
+// of color channels.
 //
-// Arguments:
+// The attr `ratio` allows downscaling the image by an integer factor during
+// decoding.  Allowed values are: 1, 2, 4, and 8.  This is much faster than
+// downscaling the image later.
 //
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentMin(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+// It is equivalent to a combination of decode and crop, but much faster by only
+// decoding partial jpeg image.
+//
+// Arguments:
+//	contents: 0-D.  The JPEG-encoded image.
+//	crop_window: 1-D.  The crop window: [crop_y, crop_x, crop_height, crop_width].
+//
+// Returns 3-D with shape `[height, width, channels]`..
+func DecodeAndCropJpeg(scope *Scope, contents tf.Output, crop_window tf.Output, optional ...DecodeAndCropJpegAttr) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SegmentMin",
+		Type: "DecodeAndCropJpeg",
 		Input: []tf.Input{
-			data, segment_ids,
+			contents, crop_window,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SdcaOptimizerAttr is an optional argument to SdcaOptimizer.
-type SdcaOptimizerAttr func(optionalAttr)
+// AllCandidateSamplerAttr is an optional argument to AllCandidateSampler.
+type AllCandidateSamplerAttr func(optionalAttr)
 
-// SdcaOptimizerAdaptative sets the optional adaptative attribute to value.
+// AllCandidateSamplerSeed sets the optional seed attribute to value.
 //
-// value: Whether to use Adapative SDCA for the inner loop.
-// If not specified, defaults to false
-func SdcaOptimizerAdaptative(value bool) SdcaOptimizerAttr {
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func AllCandidateSamplerSeed(value int64) AllCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["adaptative"] = value
+		m["seed"] = value
 	}
 }
 
-// Distributed version of Stochastic Dual Coordinate Ascent (SDCA) optimizer for
-//
-// linear models with L1 + L2 regularization. As global optimization objective is
-// strongly-convex, the optimizer optimizes the dual objective at each step. The
-// optimizer applies each update one example at a time. Examples are sampled
-// uniformly, and the optimizer is learning rate free and enjoys linear convergence
-// rate.
+// AllCandidateSamplerSeed2 sets the optional seed2 attribute to value.
 //
-// [Proximal Stochastic Dual Coordinate Ascent](http://arxiv.org/pdf/1211.2717v1.pdf).<br>
-// Shai Shalev-Shwartz, Tong Zhang. 2012
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func AllCandidateSamplerSeed2(value int64) AllCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Generates labels for candidate sampling with a learned unigram distribution.
 //
-// $$Loss Objective = \sum f_{i} (wx_{i}) + (l2 / 2) * |w|^2 + l1 * |w|$$
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
 //
-// [Adding vs. Averaging in Distributed Primal-Dual Optimization](http://arxiv.org/abs/1502.03508).<br>
-// Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan,
-// Peter Richtarik, Martin Takac. 2015
+// For each batch, this op picks a single set of sampled candidate labels.
 //
-// [Stochastic Dual Coordinate Ascent with Adaptive Probabilities](https://arxiv.org/abs/1502.08053).<br>
-// Dominik Csiba, Zheng Qu, Peter Richtarik. 2015
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
 //
 // Arguments:
-//	sparse_example_indices: a list of vectors which contain example indices.
-//	sparse_feature_indices: a list of vectors which contain feature indices.
-//	sparse_feature_values: a list of vectors which contains feature value
-// associated with each feature group.
-//	dense_features: a list of matrices which contains the dense feature values.
-//	example_weights: a vector which contains the weight associated with each
-// example.
-//	example_labels: a vector which contains the label/target associated with each
-// example.
-//	sparse_indices: a list of vectors where each value is the indices which has
-// corresponding weights in sparse_weights. This field maybe omitted for the
-// dense approach.
-//	sparse_weights: a list of vectors where each value is the weight associated with
-// a sparse feature group.
-//	dense_weights: a list of vectors where the values are the weights associated
-// with a dense feature group.
-//	example_state_data: a list of vectors containing the example state data.
-//	loss_type: Type of the primal loss. Currently SdcaSolver supports logistic,
-// squared and hinge losses.
-//	l1: Symmetric l1 regularization strength.
-//	l2: Symmetric l2 regularization strength.
-//	num_loss_partitions: Number of partitions of the global loss function.
-//	num_inner_iterations: Number of iterations per mini-batch.
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to produce.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
 //
-// Returns a list of vectors containing the updated example state
-// data.a list of vectors where each value is the delta
-// weights associated with a sparse feature group.a list of vectors where the values are the delta
-// weights associated with a dense feature group.
-func SdcaOptimizer(scope *Scope, sparse_example_indices []tf.Output, sparse_feature_indices []tf.Output, sparse_feature_values []tf.Output, dense_features []tf.Output, example_weights tf.Output, example_labels tf.Output, sparse_indices []tf.Output, sparse_weights []tf.Output, dense_weights []tf.Output, example_state_data tf.Output, loss_type string, l1 float32, l2 float32, num_loss_partitions int64, num_inner_iterations int64, optional ...SdcaOptimizerAttr) (out_example_state_data tf.Output, out_delta_sparse_weights []tf.Output, out_delta_dense_weights []tf.Output) {
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func AllCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, optional ...AllCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"loss_type": loss_type, "l1": l1, "l2": l2, "num_loss_partitions": num_loss_partitions, "num_inner_iterations": num_inner_iterations}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SdcaOptimizer",
+		Type: "AllCandidateSampler",
 		Input: []tf.Input{
-			tf.OutputList(sparse_example_indices), tf.OutputList(sparse_feature_indices), tf.OutputList(sparse_feature_values), tf.OutputList(dense_features), example_weights, example_labels, tf.OutputList(sparse_indices), tf.OutputList(sparse_weights), tf.OutputList(dense_weights), example_state_data,
+			true_classes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
+}
+
+// Saves the input tensors to disk.
+//
+// The size of `tensor_names` must match the number of tensors in `data`. `data[i]`
+// is written to `filename` with name `tensor_names[i]`.
+//
+// See also `SaveSlices`.
+//
+// Arguments:
+//	filename: Must have a single element. The name of the file to which we write
+// the tensor.
+//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
+//	data: `N` tensors to save.
+//
+// Returns the created operation.
+func Save(scope *Scope, filename tf.Output, tensor_names tf.Output, data []tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	out_example_state_data = op.Output(idx)
-	if out_delta_sparse_weights, idx, err = makeOutputList(op, idx, "out_delta_sparse_weights"); err != nil {
-		scope.UpdateErr("SdcaOptimizer", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "Save",
+		Input: []tf.Input{
+			filename, tensor_names, tf.OutputList(data),
+		},
 	}
-	if out_delta_dense_weights, idx, err = makeOutputList(op, idx, "out_delta_dense_weights"); err != nil {
-		scope.UpdateErr("SdcaOptimizer", err)
+	return scope.AddOperation(opspec)
+}
+
+// Returns element-wise remainder of division. When `x < 0` xor `y < 0` is
+//
+// true, this follows Python semantics in that the result here is consistent
+// with a flooring divide. E.g. `floor(x / y) * y + mod(x, y) = x`.
+//
+// *NOTE*: `FloorMod` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func FloorMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
 		return
 	}
-	return out_example_state_data, out_delta_sparse_weights, out_delta_dense_weights
+	opspec := tf.OpSpec{
+		Type: "FloorMod",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SparseMatMulAttr is an optional argument to SparseMatMul.
-type SparseMatMulAttr func(optionalAttr)
+// SparseTensorDenseMatMulAttr is an optional argument to SparseTensorDenseMatMul.
+type SparseTensorDenseMatMulAttr func(optionalAttr)
 
-// SparseMatMulTransposeA sets the optional transpose_a attribute to value.
+// SparseTensorDenseMatMulAdjointA sets the optional adjoint_a attribute to value.
+//
+// value: Use the adjoint of A in the matrix multiply.  If A is complex, this
+// is transpose(conj(A)).  Otherwise it's transpose(A).
 // If not specified, defaults to false
-func SparseMatMulTransposeA(value bool) SparseMatMulAttr {
+func SparseTensorDenseMatMulAdjointA(value bool) SparseTensorDenseMatMulAttr {
 	return func(m optionalAttr) {
-		m["transpose_a"] = value
+		m["adjoint_a"] = value
 	}
 }
 
-// SparseMatMulTransposeB sets the optional transpose_b attribute to value.
+// SparseTensorDenseMatMulAdjointB sets the optional adjoint_b attribute to value.
+//
+// value: Use the adjoint of B in the matrix multiply.  If B is complex, this
+// is transpose(conj(B)).  Otherwise it's transpose(B).
 // If not specified, defaults to false
-func SparseMatMulTransposeB(value bool) SparseMatMulAttr {
+func SparseTensorDenseMatMulAdjointB(value bool) SparseTensorDenseMatMulAttr {
 	return func(m optionalAttr) {
-		m["transpose_b"] = value
+		m["adjoint_b"] = value
 	}
 }
 
-// SparseMatMulAIsSparse sets the optional a_is_sparse attribute to value.
-// If not specified, defaults to false
-func SparseMatMulAIsSparse(value bool) SparseMatMulAttr {
-	return func(m optionalAttr) {
-		m["a_is_sparse"] = value
-	}
-}
-
-// SparseMatMulBIsSparse sets the optional b_is_sparse attribute to value.
-// If not specified, defaults to false
-func SparseMatMulBIsSparse(value bool) SparseMatMulAttr {
-	return func(m optionalAttr) {
-		m["b_is_sparse"] = value
-	}
-}
-
-// Multiply matrix "a" by matrix "b".
+// Multiply SparseTensor (of rank 2) "A" by dense matrix "B".
 //
-// The inputs must be two-dimensional matrices and the inner dimension of "a" must
-// match the outer dimension of "b". This op is optimized for the case where at
-// least one of "a" or "b" is sparse. The breakeven for using this versus a dense
-// matrix multiply on one platform was 30% zero values in the sparse matrix.
+// No validity checking is performed on the indices of A.  However, the following
+// input format is recommended for optimal behavior:
 //
-// The gradient computation of this operation will only take advantage of sparsity
-// in the input gradient when that gradient comes from a Relu.
-func SparseMatMul(scope *Scope, a tf.Output, b tf.Output, optional ...SparseMatMulAttr) (product tf.Output) {
+// if adjoint_a == false:
+//   A should be sorted in lexicographically increasing order.  Use SparseReorder
+//   if you're not sure.
+// if adjoint_a == true:
+//   A should be sorted in order of increasing dimension 1 (i.e., "column major"
+//   order instead of "row major" order).
+//
+// Arguments:
+//	a_indices: 2-D.  The `indices` of the `SparseTensor`, size `[nnz, 2]` Matrix.
+//	a_values: 1-D.  The `values` of the `SparseTensor`, size `[nnz]` Vector.
+//	a_shape: 1-D.  The `shape` of the `SparseTensor`, size `[2]` Vector.
+//	b: 2-D.  A dense Matrix.
+func SparseTensorDenseMatMul(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b tf.Output, optional ...SparseTensorDenseMatMulAttr) (product tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -14263,9 +14442,9 @@ func SparseMatMul(scope *Scope, a tf.Output, b tf.Output, optional ...SparseMatM
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseMatMul",
+		Type: "SparseTensorDenseMatMul",
 		Input: []tf.Input{
-			a, b,
+			a_indices, a_values, a_shape, b,
 		},
 		Attrs: attrs,
 	}
@@ -14273,52 +14452,92 @@ func SparseMatMul(scope *Scope, a tf.Output, b tf.Output, optional ...SparseMatM
 	return op.Output(0)
 }
 
-// Computes the power of one value to another.
+// Deserialize and concatenate `SparseTensors` from a serialized minibatch.
 //
-// Given a tensor `x` and a tensor `y`, this operation computes \\(x^y\\) for
-// corresponding elements in `x` and `y`. For example:
+// The input `serialized_sparse` must be a string matrix of shape `[N x 3]` where
+// `N` is the minibatch size and the rows correspond to packed outputs of
+// `SerializeSparse`.  The ranks of the original `SparseTensor` objects
+// must all match.  When the final `SparseTensor` is created, it has rank one
+// higher than the ranks of the incoming `SparseTensor` objects
+// (they have been concatenated along a new row dimension).
 //
-// ```
-// # tensor 'x' is [[2, 2]], [3, 3]]
-// # tensor 'y' is [[8, 16], [2, 3]]
-// tf.pow(x, y) ==> [[256, 65536], [9, 27]]
-// ```
-func Pow(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// The output `SparseTensor` object's shape values for all dimensions but the
+// first are the max across the input `SparseTensor` objects' shape values
+// for the corresponding dimensions.  Its first shape value is `N`, the minibatch
+// size.
+//
+// The input `SparseTensor` objects' indices are assumed ordered in
+// standard lexicographic order.  If this is not the case, after this
+// step run `SparseReorder` to restore index ordering.
+//
+// For example, if the serialized input is a `[2 x 3]` matrix representing two
+// original `SparseTensor` objects:
+//
+//     index = [ 0]
+//             [10]
+//             [20]
+//     values = [1, 2, 3]
+//     shape = [50]
+//
+// and
+//
+//     index = [ 2]
+//             [10]
+//     values = [4, 5]
+//     shape = [30]
+//
+// then the final deserialized `SparseTensor` will be:
+//
+//     index = [0  0]
+//             [0 10]
+//             [0 20]
+//             [1  2]
+//             [1 10]
+//     values = [1, 2, 3, 4, 5]
+//     shape = [2 50]
+//
+// Arguments:
+//	serialized_sparse: 2-D, The `N` serialized `SparseTensor` objects.
+// Must have 3 columns.
+//	dtype: The `dtype` of the serialized `SparseTensor` objects.
+func DeserializeManySparse(scope *Scope, serialized_sparse tf.Output, dtype tf.DataType) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "Pow",
+		Type: "DeserializeManySparse",
 		Input: []tf.Input{
-			x, y,
+			serialized_sparse,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ShapeAttr is an optional argument to Shape.
-type ShapeAttr func(optionalAttr)
+// StringJoinAttr is an optional argument to StringJoin.
+type StringJoinAttr func(optionalAttr)
 
-// ShapeOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_INT32
-func ShapeOutType(value tf.DataType) ShapeAttr {
+// StringJoinSeparator sets the optional separator attribute to value.
+//
+// value: string, an optional join separator.
+// If not specified, defaults to ""
+func StringJoinSeparator(value string) StringJoinAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["separator"] = value
 	}
 }
 
-// Returns the shape of a tensor.
-//
-// This operation returns a 1-D integer tensor representing the shape of `input`.
+// Joins the strings in the given list of string tensors into one tensor;
 //
-// For example:
+// with the given separator (default is an empty separator).
 //
-// ```
-// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
-// shape(t) ==> [2, 2, 3]
-// ```
-func Shape(scope *Scope, input tf.Output, optional ...ShapeAttr) (output tf.Output) {
+// Arguments:
+//	inputs: A list of string tensors.  The tensors must all have the same shape,
+// or be scalars.  Scalars may be mixed in; these will be broadcast to the shape
+// of non-scalar inputs.
+func StringJoin(scope *Scope, inputs []tf.Output, optional ...StringJoinAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -14327,9 +14546,9 @@ func Shape(scope *Scope, input tf.Output, optional ...ShapeAttr) (output tf.Outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Shape",
+		Type: "StringJoin",
 		Input: []tf.Input{
-			input,
+			tf.OutputList(inputs),
 		},
 		Attrs: attrs,
 	}
@@ -14337,416 +14556,451 @@ func Shape(scope *Scope, input tf.Output, optional ...ShapeAttr) (output tf.Outp
 	return op.Output(0)
 }
 
-// Computes fingerprints of the input strings.
+// Returns immutable tensor from memory region.
 //
-// Arguments:
-//	input: vector of strings to compute fingerprints on.
+// The current implementation memmaps the tensor from a file.
 //
-// Returns a (N,2) shaped matrix where N is the number of elements in the input
-// vector. Each row contains the low and high parts of the fingerprint.
-func SdcaFprint(scope *Scope, input tf.Output) (output tf.Output) {
+// Arguments:
+//	dtype: Type of the returned tensor.
+//	shape: Shape of the returned tensor.
+//	memory_region_name: Name of readonly memory region used by the tensor, see
+// NewReadOnlyMemoryRegionFromFile in tensorflow::Env.
+func ImmutableConst(scope *Scope, dtype tf.DataType, shape tf.Shape, memory_region_name string) (tensor tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape, "memory_region_name": memory_region_name}
 	opspec := tf.OpSpec{
-		Type: "SdcaFprint",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "ImmutableConst",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// RandomPoissonV2Attr is an optional argument to RandomPoissonV2.
-type RandomPoissonV2Attr func(optionalAttr)
-
-// RandomPoissonV2Seed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomPoissonV2Seed(value int64) RandomPoissonV2Attr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomPoissonV2Seed2 sets the optional seed2 attribute to value.
+// Inverse real-valued fast Fourier transform.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomPoissonV2Seed2(value int64) RandomPoissonV2Attr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// RandomPoissonV2Dtype sets the optional dtype attribute to value.
-// If not specified, defaults to DT_INT64
-func RandomPoissonV2Dtype(value tf.DataType) RandomPoissonV2Attr {
-	return func(m optionalAttr) {
-		m["dtype"] = value
-	}
-}
-
-// Outputs random values from the Poisson distribution(s) described by rate.
+// Computes the inverse 1-dimensional discrete Fourier transform of a real-valued
+// signal over the inner-most dimension of `input`.
 //
-// This op uses two algorithms, depending on rate. If rate >= 10, then
-// the algorithm by Hormann is used to acquire samples via
-// transformation-rejection.
-// See http://www.sciencedirect.com/science/article/pii/0167668793909974.
+// The inner-most dimension of `input` is assumed to be the result of `RFFT`: the
+// `fft_length / 2 + 1` unique components of the DFT of a real-valued signal. If
+// `fft_length` is not provided, it is computed from the size of the inner-most
+// dimension of `input` (`fft_length = 2 * (inner - 1)`). If the FFT length used to
+// compute `input` is odd, it should be provided since it cannot be inferred
+// properly.
 //
-// Otherwise, Knuth's algorithm is used to acquire samples via multiplying uniform
-// random variables.
-// See Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer
-// Programming, Volume 2. Addison Wesley
+// Along the axis `IRFFT` is computed on, if `fft_length / 2 + 1` is smaller
+// than the corresponding dimension of `input`, the dimension is cropped. If it is
+// larger, the dimension is padded with zeros.
 //
 // Arguments:
-//	shape: 1-D integer tensor. Shape of independent samples to draw from each
-// distribution described by the shape parameters given in rate.
-//	rate: A tensor in which each scalar is a "rate" parameter describing the
-// associated poisson distribution.
+//	input: A complex64 tensor.
+//	fft_length: An int32 tensor of shape [1]. The FFT length.
 //
-// Returns A tensor with shape `shape + shape(rate)`. Each slice
-// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
-// `rate[i0, i1, ...iN]`.
-func RandomPoissonV2(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonV2Attr) (output tf.Output) {
+// Returns A float32 tensor of the same rank as `input`. The inner-most
+//   dimension of `input` is replaced with the `fft_length` samples of its inverse
+//   1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.irfft
+// @end_compatibility
+func IRFFT(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "RandomPoissonV2",
+		Type: "IRFFT",
 		Input: []tf.Input{
-			shape, rate,
+			input, fft_length,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MatrixTriangularSolveAttr is an optional argument to MatrixTriangularSolve.
-type MatrixTriangularSolveAttr func(optionalAttr)
-
-// MatrixTriangularSolveLower sets the optional lower attribute to value.
-//
-// value: Boolean indicating whether the innermost matrices in `matrix` are
-// lower or upper triangular.
-// If not specified, defaults to true
-func MatrixTriangularSolveLower(value bool) MatrixTriangularSolveAttr {
-	return func(m optionalAttr) {
-		m["lower"] = value
-	}
-}
-
-// MatrixTriangularSolveAdjoint sets the optional adjoint attribute to value.
+// Concatenates a list of `SparseTensor` along the specified dimension.
 //
-// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
-//          adjoint.
+// Concatenation is with respect to the dense versions of these sparse tensors.
+// It is assumed that each input is a `SparseTensor` whose elements are ordered
+// along increasing dimension number.
 //
-// @compatibility(numpy)
-// Equivalent to np.linalg.triangular_solve
-// @end_compatibility
-// If not specified, defaults to false
-func MatrixTriangularSolveAdjoint(value bool) MatrixTriangularSolveAttr {
-	return func(m optionalAttr) {
-		m["adjoint"] = value
-	}
-}
-
-// Solves systems of linear equations with upper or lower triangular matrices by
+// All inputs' shapes must match, except for the concat dimension.  The
+// `indices`, `values`, and `shapes` lists must have the same length.
 //
-// backsubstitution.
+// The output shape is identical to the inputs', except along the concat
+// dimension, where it is the sum of the inputs' sizes along that dimension.
 //
-// `matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions form
-// square matrices. If `lower` is `True` then the strictly upper triangular part
-// of each inner-most matrix is assumed to be zero and not accessed.
-// If `lower` is False then the strictly lower triangular part of each inner-most
-// matrix is assumed to be zero and not accessed.
-// `rhs` is a tensor of shape `[..., M, K]`.
+// The output elements will be resorted to preserve the sort order along
+// increasing dimension number.
 //
-// The output is a tensor of shape `[..., M, K]`. If `adjoint` is
-// `True` then the innermost matrices in `output` satisfy matrix equations
-// `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
-// If `adjoint` is `False` then the strictly then the  innermost matrices in
-// `output` satisfy matrix equations
-// `adjoint(matrix[..., i, k]) * output[..., k, j] = rhs[..., i, j]`.
+// This op runs in `O(M log M)` time, where `M` is the total number of non-empty
+// values across all inputs. This is due to the need for an internal sort in
+// order to concatenate efficiently across an arbitrary dimension.
+//
+// For example, if `concat_dim = 1` and the inputs are
+//
+//     sp_inputs[0]: shape = [2, 3]
+//     [0, 2]: "a"
+//     [1, 0]: "b"
+//     [1, 1]: "c"
+//
+//     sp_inputs[1]: shape = [2, 4]
+//     [0, 1]: "d"
+//     [0, 2]: "e"
+//
+// then the output will be
+//
+//     shape = [2, 7]
+//     [0, 2]: "a"
+//     [0, 4]: "d"
+//     [0, 5]: "e"
+//     [1, 0]: "b"
+//     [1, 1]: "c"
+//
+// Graphically this is equivalent to doing
+//
+//     [    a] concat [  d e  ] = [    a   d e  ]
+//     [b c  ]        [       ]   [b c          ]
 //
 // Arguments:
-//	matrix: Shape is `[..., M, M]`.
-//	rhs: Shape is `[..., M, K]`.
+//	indices: 2-D.  Indices of each input `SparseTensor`.
+//	values: 1-D.  Non-empty values of each `SparseTensor`.
+//	shapes: 1-D.  Shapes of each `SparseTensor`.
+//	concat_dim: Dimension to concatenate along. Must be in range [-rank, rank),
+// where rank is the number of dimensions in each input `SparseTensor`.
 //
-// Returns Shape is `[..., M, K]`.
-func MatrixTriangularSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixTriangularSolveAttr) (output tf.Output) {
+// Returns 2-D.  Indices of the concatenated `SparseTensor`.1-D.  Non-empty values of the concatenated `SparseTensor`.1-D.  Shape of the concatenated `SparseTensor`.
+func SparseConcat(scope *Scope, indices []tf.Output, values []tf.Output, shapes []tf.Output, concat_dim int64) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"concat_dim": concat_dim}
 	opspec := tf.OpSpec{
-		Type: "MatrixTriangularSolve",
+		Type: "SparseConcat",
 		Input: []tf.Input{
-			matrix, rhs,
+			tf.OutputList(indices), tf.OutputList(values), tf.OutputList(shapes),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes inverse hyperbolic sine of x element-wise.
-func Asinh(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Asinh",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Creates a dataset with a range of values. Corresponds to python's xrange.
+// Generates sparse cross from a list of sparse and dense tensors.
+//
+// The op takes two lists, one of 2D `SparseTensor` and one of 2D `Tensor`, each
+// representing features of one feature column. It outputs a 2D `SparseTensor` with
+// the batchwise crosses of these features.
+//
+// For example, if the inputs are
+//
+//     inputs[0]: SparseTensor with shape = [2, 2]
+//     [0, 0]: "a"
+//     [1, 0]: "b"
+//     [1, 1]: "c"
+//
+//     inputs[1]: SparseTensor with shape = [2, 1]
+//     [0, 0]: "d"
+//     [1, 0]: "e"
+//
+//     inputs[2]: Tensor [["f"], ["g"]]
+//
+// then the output will be
+//
+//     shape = [2, 2]
+//     [0, 0]: "a_X_d_X_f"
+//     [1, 0]: "b_X_e_X_g"
+//     [1, 1]: "c_X_e_X_g"
+//
+// if hashed_output=true then the output will be
+//
+//     shape = [2, 2]
+//     [0, 0]: FingerprintCat64(
+//                 Fingerprint64("f"), FingerprintCat64(
+//                     Fingerprint64("d"), Fingerprint64("a")))
+//     [1, 0]: FingerprintCat64(
+//                 Fingerprint64("g"), FingerprintCat64(
+//                     Fingerprint64("e"), Fingerprint64("b")))
+//     [1, 1]: FingerprintCat64(
+//                 Fingerprint64("g"), FingerprintCat64(
+//                     Fingerprint64("e"), Fingerprint64("c")))
 //
 // Arguments:
-//	start: corresponds to start in python's xrange().
-//	stop: corresponds to stop in python's xrange().
-//	step: corresponds to step in python's xrange().
+//	indices: 2-D.  Indices of each input `SparseTensor`.
+//	values: 1-D.   values of each `SparseTensor`.
+//	shapes: 1-D.   Shapes of each `SparseTensor`.
+//	dense_inputs: 2-D.    Columns represented by dense `Tensor`.
+//	hashed_output: If true, returns the hash of the cross instead of the string.
+// This will allow us avoiding string manipulations.
+//	num_buckets: It is used if hashed_output is true.
+// output = hashed_value%num_buckets if num_buckets > 0 else hashed_value.
+//	hash_key: Specify the hash_key that will be used by the `FingerprintCat64`
+// function to combine the crosses fingerprints.
 //
 //
-func RangeDataset(scope *Scope, start tf.Output, stop tf.Output, step tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+//
+// Returns 2-D.  Indices of the concatenated `SparseTensor`.1-D.  Non-empty values of the concatenated or hashed
+// `SparseTensor`.1-D.  Shape of the concatenated `SparseTensor`.
+func SparseCross(scope *Scope, indices []tf.Output, values []tf.Output, shapes []tf.Output, dense_inputs []tf.Output, hashed_output bool, num_buckets int64, hash_key int64, out_type tf.DataType, internal_type tf.DataType) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"hashed_output": hashed_output, "num_buckets": num_buckets, "hash_key": hash_key, "out_type": out_type, "internal_type": internal_type}
 	opspec := tf.OpSpec{
-		Type: "RangeDataset",
+		Type: "SparseCross",
 		Input: []tf.Input{
-			start, stop, step,
+			tf.OutputList(indices), tf.OutputList(values), tf.OutputList(shapes), tf.OutputList(dense_inputs),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// DepthwiseConv2dNativeBackpropInputAttr is an optional argument to DepthwiseConv2dNativeBackpropInput.
-type DepthwiseConv2dNativeBackpropInputAttr func(optionalAttr)
-
-// DepthwiseConv2dNativeBackpropInputDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, height, width, channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, channels, height, width].
-// If not specified, defaults to "NHWC"
-func DepthwiseConv2dNativeBackpropInputDataFormat(value string) DepthwiseConv2dNativeBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// DepthwiseConv2dNativeBackpropInputDilations sets the optional dilations attribute to value.
-//
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
-// element on that dimension. The dimension order is determined by the value of
-// `data_format`, see above for details. Dilations in the batch and depth
-// dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func DepthwiseConv2dNativeBackpropInputDilations(value []int64) DepthwiseConv2dNativeBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
-	}
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the gradients of depthwise convolution with respect to the input.
+// Concatenates quantized tensors along one dimension.
 //
 // Arguments:
-//	input_sizes: An integer vector representing the shape of `input`, based
-// on `data_format`.  For example, if `data_format` is 'NHWC' then
-//  `input` is a 4-D `[batch, height, width, channels]` tensor.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, depthwise_multiplier]`.
-//	out_backprop: 4-D with shape  based on `data_format`.
-// For example, if `data_format` is 'NHWC' then
-// out_backprop shape is `[batch, out_height, out_width, out_channels]`.
-// Gradients w.r.t. the output of the convolution.
-//	strides: The stride of the sliding window for each dimension of the input
-// of the convolution.
-//	padding: The type of padding algorithm to use.
+//	concat_dim: 0-D.  The dimension along which to concatenate.  Must be in the
+// range [0, rank(values)).
+//	values: The `N` Tensors to concatenate. Their ranks and types must match,
+// and their sizes must match in all dimensions except `concat_dim`.
+//	input_mins: The minimum scalar values for each of the input tensors.
+//	input_maxes: The maximum scalar values for each of the input tensors.
 //
-// Returns 4-D with shape according to `data_format`.  For example, if
-// `data_format` is 'NHWC', output shape is `[batch, in_height,
-// in_width, in_channels]`.  Gradient w.r.t. the input of the
-// convolution.
-func DepthwiseConv2dNativeBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...DepthwiseConv2dNativeBackpropInputAttr) (output tf.Output) {
+// Returns A `Tensor` with the concatenation of values stacked along the
+// `concat_dim` dimension.  This tensor's shape matches that of `values` except
+// in `concat_dim` where it has the sum of the sizes.The float value that the minimum quantized output value represents.The float value that the maximum quantized output value represents.
+func QuantizedConcat(scope *Scope, concat_dim tf.Output, values []tf.Output, input_mins []tf.Output, input_maxes []tf.Output) (output tf.Output, output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "DepthwiseConv2dNativeBackpropInput",
+		Type: "QuantizedConcat",
 		Input: []tf.Input{
-			input_sizes, filter, out_backprop,
+			concat_dim, tf.OutputList(values), tf.OutputList(input_mins), tf.OutputList(input_maxes),
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Adds sparse updates to the variable referenced by `resource`.
-//
-// This operation computes
+// Slice a `SparseTensor` based on the `start` and `size`.
 //
-//     # Scalar indices
-//     ref[indices, ...] += updates[...]
+// For example, if the input is
 //
-//     # Vector indices (for each i)
-//     ref[indices[i], ...] += updates[i, ...]
+//     input_tensor = shape = [2, 7]
+//     [    a   d e  ]
+//     [b c          ]
 //
-//     # High rank indices (for each i, ..., j)
-//     ref[indices[i, ..., j], ...] += updates[i, ..., j, ...]
+// Graphically the output tensors are:
 //
-// Duplicate entries are handled correctly: if multiple `indices` reference
-// the same location, their contributions add.
+//     sparse_slice([0, 0], [2, 4]) = shape = [2, 4]
+//     [    a  ]
+//     [b c    ]
 //
-// Requires `updates.shape = indices.shape + ref.shape[1:]`.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src='https://www.tensorflow.org/images/ScatterAdd.png' alt>
-// </div>
+//     sparse_slice([0, 4], [2, 3]) = shape = [2, 3]
+//     [ d e  ]
+//     [      ]
 //
 // Arguments:
-//	resource: Should be from a `Variable` node.
-//	indices: A tensor of indices into the first dimension of `ref`.
-//	updates: A tensor of updated values to add to `ref`.
+//	indices: 2-D tensor represents the indices of the sparse tensor.
+//	values: 1-D tensor represents the values of the sparse tensor.
+//	shape: 1-D. tensor represents the shape of the sparse tensor.
+//	start: 1-D. tensor represents the start of the slice.
+//	size: 1-D. tensor represents the size of the slice.
+// output indices: A list of 1-D tensors represents the indices of the output
+// sparse tensors.
 //
-// Returns the created operation.
-func ResourceScatterAdd(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
+// Returns A list of 1-D tensors represents the values of the output sparse
+// tensors.A list of 1-D tensors represents the shape of the output sparse
+// tensors.
+func SparseSlice(scope *Scope, indices tf.Output, values tf.Output, shape tf.Output, start tf.Output, size tf.Output) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceScatterAdd",
+		Type: "SparseSlice",
 		Input: []tf.Input{
-			resource, indices, updates,
+			indices, values, shape, start, size,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the gradient for the inverse of `x` wrt its input.
+// Adds up a `SparseTensor` and a dense `Tensor`, producing a dense `Tensor`.
 //
-// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
-// is the corresponding input gradient.
-func ReciprocalGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// This Op does not require `a_indices` be sorted in standard lexicographic order.
+//
+// Arguments:
+//	a_indices: 2-D.  The `indices` of the `SparseTensor`, with shape `[nnz, ndims]`.
+//	a_values: 1-D.  The `values` of the `SparseTensor`, with shape `[nnz]`.
+//	a_shape: 1-D.  The `shape` of the `SparseTensor`, with shape `[ndims]`.
+//	b: `ndims`-D Tensor.  With shape `a_shape`.
+func SparseTensorDenseAdd(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReciprocalGrad",
+		Type: "SparseTensorDenseAdd",
 		Input: []tf.Input{
-			y, dy,
+			a_indices, a_values, a_shape, b,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the min of x and y (i.e. x < y ? x : y) element-wise.
+// Returns the set of files matching one or more glob patterns.
 //
-// *NOTE*: `Minimum` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Minimum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Note that this routine only supports wildcard characters in the
+// basename portion of the pattern, not in the directory portion.
+//
+// Arguments:
+//	pattern: Shell wildcard pattern(s). Scalar or vector of type string.
+//
+// Returns A vector of matching filenames.
+func MatchingFiles(scope *Scope, pattern tf.Output) (filenames tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Minimum",
+		Type: "MatchingFiles",
 		Input: []tf.Input{
-			x, y,
+			pattern,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MfccAttr is an optional argument to Mfcc.
-type MfccAttr func(optionalAttr)
+// SparseToSparseSetOperationAttr is an optional argument to SparseToSparseSetOperation.
+type SparseToSparseSetOperationAttr func(optionalAttr)
 
-// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
-//
-// value: The highest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 4000
-func MfccUpperFrequencyLimit(value float32) MfccAttr {
+// SparseToSparseSetOperationValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func SparseToSparseSetOperationValidateIndices(value bool) SparseToSparseSetOperationAttr {
 	return func(m optionalAttr) {
-		m["upper_frequency_limit"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
+// Applies set operation along last dimension of 2 `SparseTensor` inputs.
 //
-// value: The lowest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 20
-func MfccLowerFrequencyLimit(value float32) MfccAttr {
-	return func(m optionalAttr) {
-		m["lower_frequency_limit"] = value
+// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
+//
+// If `validate_indices` is `True`, `SparseToSparseSetOperation` validates the
+// order and range of `set1` and `set2` indices.
+//
+// Input `set1` is a `SparseTensor` represented by `set1_indices`, `set1_values`,
+// and `set1_shape`. For `set1` ranked `n`, 1st `n-1` dimensions must be the same
+// as `set2`. Dimension `n` contains values in a set, duplicates are allowed but
+// ignored.
+//
+// Input `set2` is a `SparseTensor` represented by `set2_indices`, `set2_values`,
+// and `set2_shape`. For `set2` ranked `n`, 1st `n-1` dimensions must be the same
+// as `set1`. Dimension `n` contains values in a set, duplicates are allowed but
+// ignored.
+//
+// If `validate_indices` is `True`, this op validates the order and range of `set1`
+// and `set2` indices.
+//
+// Output `result` is a `SparseTensor` represented by `result_indices`,
+// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
+// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
+// dimension contains the result of `set_operation` applied to the corresponding
+// `[0...n-1]` dimension of `set`.
+//
+// Arguments:
+//	set1_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
+// order.
+//	set1_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
+// order.
+//	set1_shape: 1D `Tensor`, shape of a `SparseTensor`. `set1_shape[0...n-1]` must
+// be the same as `set2_shape[0...n-1]`, `set1_shape[n]` is the
+// max set size across `0...n-1` dimensions.
+//	set2_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
+// order.
+//	set2_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
+// order.
+//	set2_shape: 1D `Tensor`, shape of a `SparseTensor`. `set2_shape[0...n-1]` must
+// be the same as `set1_shape[0...n-1]`, `set2_shape[n]` is the
+// max set size across `0...n-1` dimensions.
+//
+//
+// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
+// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
+// is the max result set size across all `0...n-1` dimensions.
+func SparseToSparseSetOperation(scope *Scope, set1_indices tf.Output, set1_values tf.Output, set1_shape tf.Output, set2_indices tf.Output, set2_values tf.Output, set2_shape tf.Output, set_operation string, optional ...SparseToSparseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"set_operation": set_operation}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseToSparseSetOperation",
+		Input: []tf.Input{
+			set1_indices, set1_values, set1_shape, set2_indices, set2_values, set2_shape,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
+// Computes numerical negative value element-wise.
 //
-// value: Resolution of the Mel bank used internally.
-// If not specified, defaults to 40
-func MfccFilterbankChannelCount(value int64) MfccAttr {
+// I.e., \\(y = -x\\).
+func Neg(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Neg",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// FakeQuantWithMinMaxVarsAttr is an optional argument to FakeQuantWithMinMaxVars.
+type FakeQuantWithMinMaxVarsAttr func(optionalAttr)
+
+// FakeQuantWithMinMaxVarsNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxVarsNumBits(value int64) FakeQuantWithMinMaxVarsAttr {
 	return func(m optionalAttr) {
-		m["filterbank_channel_count"] = value
+		m["num_bits"] = value
 	}
 }
 
-// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
-//
-// value: How many output channels to produce per time slice.
-// If not specified, defaults to 13
-func MfccDctCoefficientCount(value int64) MfccAttr {
+// FakeQuantWithMinMaxVarsNarrowRange sets the optional narrow_range attribute to value.
+// If not specified, defaults to false
+func FakeQuantWithMinMaxVarsNarrowRange(value bool) FakeQuantWithMinMaxVarsAttr {
 	return func(m optionalAttr) {
-		m["dct_coefficient_count"] = value
+		m["narrow_range"] = value
 	}
 }
 
-// Transforms a spectrogram into a form that's useful for speech recognition.
+// Fake-quantize the 'inputs' tensor of type float via global float scalars `min`
 //
-// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
-// been effective as an input feature for machine learning. They are created by
-// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
-// higher frequencies that are less significant to the human ear. They have a long
-// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
-// is a good resource to learn more.
+// and `max` to 'outputs' tensor of same shape as `inputs`.
 //
-// Arguments:
-//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
-// set to true.
-//	sample_rate: How many samples per second the source audio used.
-func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
+// `[min; max]` define the clamping range for the `inputs` data.
+// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
+// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
+// then de-quantized and output as floats in `[min; max]` interval.
+// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+//
+// This operation has a gradient and thus allows for training `min` and `max`
+// values.
+func FakeQuantWithMinMaxVars(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsAttr) (outputs tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -14755,9 +15009,9 @@ func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional .
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Mfcc",
+		Type: "FakeQuantWithMinMaxVars",
 		Input: []tf.Input{
-			spectrogram, sample_rate,
+			inputs, min, max,
 		},
 		Attrs: attrs,
 	}
@@ -14765,199 +15019,154 @@ func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional .
 	return op.Output(0)
 }
 
-// Returns the element-wise sum of a list of tensors.
-//
-// `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
-// wait for all of its inputs to be ready before beginning to sum. This can
-// save memory if inputs are ready at different times, since minimum temporary
-// storage is proportional to the output size rather than the inputs size.
-//
-// Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
+// Returns the element-wise min of two SparseTensors.
 //
-// Returns a `Tensor` of same shape and type as the elements of `inputs`.
+// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
 //
 // Arguments:
-//	inputs: A list of `Tensor` objects, each with same shape and type.
-//	shape: Shape of elements of `inputs`.
-func AccumulateNV2(scope *Scope, inputs []tf.Output, shape tf.Shape) (sum tf.Output) {
+//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, in the canonical lexicographic ordering.
+//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
+//	a_shape: 1-D.  Shape of the input SparseTensor.
+//	b_indices: counterpart to `a_indices` for the other operand.
+//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
+//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
+//
+// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
+func SparseSparseMinimum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "AccumulateNV2",
+		Type: "SparseSparseMinimum",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Convert the quantized 'input' tensor into a lower-precision 'output', using the
+// Constructs a tensor by tiling a given tensor.
 //
-// actual distribution of the values to maximize the usage of the lower bit depth
-// and adjusting the output min and max ranges accordingly.
-//
-// [input_min, input_max] are scalar floats that specify the range for the float
-// interpretation of the 'input' data. For example, if input_min is -1.0f and
-// input_max is 1.0f, and we are dealing with quint16 quantized data, then a 0
-// value in the 16-bit data should be interpreted as -1.0f, and a 65535 means 1.0f.
-//
-// This operator tries to squeeze as much precision as possible into an output with
-// a lower bit depth by calculating the actual min and max values found in the
-// data. For example, maybe that quint16 input has no values lower than 16,384 and
-// none higher than 49,152. That means only half the range is actually needed, all
-// the float interpretations are between -0.5f and 0.5f, so if we want to compress
-// the data into a quint8 output, we can use that range rather than the theoretical
-// -1.0f to 1.0f that is suggested by the input min and max.
-//
-// In practice, this is most useful for taking output from operations like
-// QuantizedMatMul that can produce higher bit-depth outputs than their inputs and
-// may have large potential output ranges, but in practice have a distribution of
-// input values that only uses a small fraction of the possible range. By feeding
-// that output into this operator, we can reduce it from 32 bits down to 8 with
-// minimal loss of accuracy.
+// This operation creates a new tensor by replicating `input` `multiples` times.
+// The output tensor's i'th dimension has `input.dims(i) * multiples[i]` elements,
+// and the values of `input` are replicated `multiples[i]` times along the 'i'th
+// dimension. For example, tiling `[a b c d]` by `[2]` produces
+// `[a b c d a b c d]`.
 //
 // Arguments:
-//
-//	input_min: The float value that the minimum quantized input value represents.
-//	input_max: The float value that the maximum quantized input value represents.
-//	out_type: The type of the output. Should be a lower bit depth than Tinput.
-//
-// Returns The float value that the minimum quantized output value represents.The float value that the maximum quantized output value represents.
-func QuantizeDownAndShrinkRange(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, out_type tf.DataType) (output tf.Output, output_min tf.Output, output_max tf.Output) {
+//	input: 1-D or higher.
+//	multiples: 1-D. Length must be the same as the number of dimensions in `input`
+func Tile(scope *Scope, input tf.Output, multiples tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "QuantizeDownAndShrinkRange",
+		Type: "Tile",
 		Input: []tf.Input{
-			input, input_min, input_max,
+			input, multiples,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// RandomGammaAttr is an optional argument to RandomGamma.
-type RandomGammaAttr func(optionalAttr)
+// TakeManySparseFromTensorsMapAttr is an optional argument to TakeManySparseFromTensorsMap.
+type TakeManySparseFromTensorsMapAttr func(optionalAttr)
 
-// RandomGammaSeed sets the optional seed attribute to value.
+// TakeManySparseFromTensorsMapContainer sets the optional container attribute to value.
 //
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomGammaSeed(value int64) RandomGammaAttr {
+// value: The container name for the `SparseTensorsMap` read by this op.
+// If not specified, defaults to ""
+func TakeManySparseFromTensorsMapContainer(value string) TakeManySparseFromTensorsMapAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["container"] = value
 	}
 }
 
-// RandomGammaSeed2 sets the optional seed2 attribute to value.
+// TakeManySparseFromTensorsMapSharedName sets the optional shared_name attribute to value.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomGammaSeed2(value int64) RandomGammaAttr {
+// value: The shared name for the `SparseTensorsMap` read by this op.
+// It should not be blank; rather the `shared_name` or unique Operation name
+// of the Op that created the original `SparseTensorsMap` should be used.
+// If not specified, defaults to ""
+func TakeManySparseFromTensorsMapSharedName(value string) TakeManySparseFromTensorsMapAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Outputs random values from the Gamma distribution(s) described by alpha.
+// Read `SparseTensors` from a `SparseTensorsMap` and concatenate them.
 //
-// This op uses the algorithm by Marsaglia et al. to acquire samples via
-// transformation-rejection from pairs of uniform and normal random variables.
-// See http://dl.acm.org/citation.cfm?id=358414
+// The input `sparse_handles` must be an `int64` matrix of shape `[N, 1]` where
+// `N` is the minibatch size and the rows correspond to the output handles of
+// `AddSparseToTensorsMap` or `AddManySparseToTensorsMap`.  The ranks of the
+// original `SparseTensor` objects that went into the given input ops must all
+// match.  When the final `SparseTensor` is created, it has rank one
+// higher than the ranks of the incoming `SparseTensor` objects
+// (they have been concatenated along a new row dimension on the left).
 //
-// Arguments:
-//	shape: 1-D integer tensor. Shape of independent samples to draw from each
-// distribution described by the shape parameters given in alpha.
-//	alpha: A tensor in which each scalar is a "shape" parameter describing the
-// associated gamma distribution.
+// The output `SparseTensor` object's shape values for all dimensions but the
+// first are the max across the input `SparseTensor` objects' shape values
+// for the corresponding dimensions.  Its first shape value is `N`, the minibatch
+// size.
 //
-// Returns A tensor with shape `shape + shape(alpha)`. Each slice
-// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
-// `alpha[i0, i1, ...iN]`. The dtype of the output matches the dtype of alpha.
-func RandomGamma(scope *Scope, shape tf.Output, alpha tf.Output, optional ...RandomGammaAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "RandomGamma",
-		Input: []tf.Input{
-			shape, alpha,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// QuantizedConv2DAttr is an optional argument to QuantizedConv2D.
-type QuantizedConv2DAttr func(optionalAttr)
-
-// QuantizedConv2DOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedConv2DOutType(value tf.DataType) QuantizedConv2DAttr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
-	}
-}
-
-// QuantizedConv2DDilations sets the optional dilations attribute to value.
+// The input `SparseTensor` objects' indices are assumed ordered in
+// standard lexicographic order.  If this is not the case, after this
+// step run `SparseReorder` to restore index ordering.
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func QuantizedConv2DDilations(value []int64) QuantizedConv2DAttr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
-	}
-}
-
-// Computes a 2D convolution given quantized 4D input and filter tensors.
+// For example, if the handles represent an input, which is a `[2, 3]` matrix
+// representing two original `SparseTensor` objects:
 //
-// The inputs are quantized tensors where the lowest value represents the real
-// number of the associated minimum, and the highest represents the maximum.
-// This means that you can only interpret the quantized output in the same way, by
-// taking the returned minimum and maximum values into account.
+// ```
+//     index = [ 0]
+//             [10]
+//             [20]
+//     values = [1, 2, 3]
+//     shape = [50]
+// ```
 //
-// Arguments:
+// and
 //
-//	filter: filter's input_depth dimension must match input's depth dimensions.
-//	min_input: The float value that the lowest quantized input value represents.
-//	max_input: The float value that the highest quantized input value represents.
-//	min_filter: The float value that the lowest quantized filter value represents.
-//	max_filter: The float value that the highest quantized filter value represents.
-//	strides: The stride of the sliding window for each dimension of the input
-// tensor.
-//	padding: The type of padding algorithm to use.
+// ```
+//     index = [ 2]
+//             [10]
+//     values = [4, 5]
+//     shape = [30]
+// ```
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
-func QuantizedConv2D(scope *Scope, input tf.Output, filter tf.Output, min_input tf.Output, max_input tf.Output, min_filter tf.Output, max_filter tf.Output, strides []int64, padding string, optional ...QuantizedConv2DAttr) (output tf.Output, min_output tf.Output, max_output tf.Output) {
+// then the final `SparseTensor` will be:
+//
+// ```
+//     index = [0  0]
+//             [0 10]
+//             [0 20]
+//             [1  2]
+//             [1 10]
+//     values = [1, 2, 3, 4, 5]
+//     shape = [2 50]
+// ```
+//
+// Arguments:
+//	sparse_handles: 1-D, The `N` serialized `SparseTensor` objects.
+// Shape: `[N]`.
+//	dtype: The `dtype` of the `SparseTensor` objects stored in the
+// `SparseTensorsMap`.
+//
+// Returns 2-D.  The `indices` of the minibatch `SparseTensor`.1-D.  The `values` of the minibatch `SparseTensor`.1-D.  The `shape` of the minibatch `SparseTensor`.
+func TakeManySparseFromTensorsMap(scope *Scope, sparse_handles tf.Output, dtype tf.DataType, optional ...TakeManySparseFromTensorsMapAttr) (sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedConv2D",
+		Type: "TakeManySparseFromTensorsMap",
 		Input: []tf.Input{
-			input, filter, min_input, max_input, min_filter, max_filter,
+			sparse_handles,
 		},
 		Attrs: attrs,
 	}
@@ -14965,270 +15174,240 @@ func QuantizedConv2D(scope *Scope, input tf.Output, filter tf.Output, min_input
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ResourceGatherAttr is an optional argument to ResourceGather.
-type ResourceGatherAttr func(optionalAttr)
-
-// ResourceGatherValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func ResourceGatherValidateIndices(value bool) ResourceGatherAttr {
-	return func(m optionalAttr) {
-		m["validate_indices"] = value
-	}
-}
-
-// Gather slices from the variable pointed to by `resource` according to `indices`.
+// Says whether the targets are in the top `K` predictions.
 //
-// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
-// Produces an output tensor with shape `indices.shape + params.shape[1:]` where:
+// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
+// prediction for the target class is among the top `k` predictions among
+// all predictions for example `i`. Note that the behavior of `InTopK` differs
+// from the `TopK` op in its handling of ties; if multiple classes have the
+// same prediction value and straddle the top-`k` boundary, all of those
+// classes are considered to be in the top `k`.
 //
-// ```python
-//     # Scalar indices
-//     output[:, ..., :] = params[indices, :, ... :]
+// More formally, let
 //
-//     # Vector indices
-//     output[i, :, ..., :] = params[indices[i], :, ... :]
+//   \\(predictions_i\\) be the predictions for all classes for example `i`,
+//   \\(targets_i\\) be the target class for example `i`,
+//   \\(out_i\\) be the output for example `i`,
 //
-//     # Higher rank indices
-//     output[i, ..., j, :, ... :] = params[indices[i, ..., j], :, ..., :]
-// ```
-func ResourceGather(scope *Scope, resource tf.Output, indices tf.Output, dtype tf.DataType, optional ...ResourceGatherAttr) (output tf.Output) {
+// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
+//
+// Arguments:
+//	predictions: A `batch_size` x `classes` tensor.
+//	targets: A `batch_size` vector of class ids.
+//	k: Number of top elements to look at for computing precision.
+//
+// Returns Computed precision at `k` as a `bool Tensor`.
+func InTopKV2(scope *Scope, predictions tf.Output, targets tf.Output, k tf.Output) (precision tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResourceGather",
+		Type: "InTopKV2",
 		Input: []tf.Input{
-			resource, indices,
+			predictions, targets, k,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Delete the TensorArray from its resource container.
+// Assigns a new value to a variable.
 //
-// This enables the user to close and release the resource in the middle
-// of a step/run.
+// Any ReadVariableOp with a control dependency on this op is guaranteed to return
+// this value or a subsequent newer value of the variable.
 //
 // Arguments:
-//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
+//	resource: handle to the resource in which to store the variable.
+//	value: the value to set the new tensor to use.
 //
 // Returns the created operation.
-func TensorArrayCloseV3(scope *Scope, handle tf.Output) (o *tf.Operation) {
+func AssignVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayCloseV3",
+		Type: "AssignVariableOp",
 		Input: []tf.Input{
-			handle,
+			resource, value,
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// RandomUniformIntAttr is an optional argument to RandomUniformInt.
-type RandomUniformIntAttr func(optionalAttr)
-
-// RandomUniformIntSeed sets the optional seed attribute to value.
+// Returns a tensor of ones with the same shape and type as x.
 //
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomUniformIntSeed(value int64) RandomUniformIntAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomUniformIntSeed2 sets the optional seed2 attribute to value.
+// Arguments:
+//	x: a tensor of type T.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomUniformIntSeed2(value int64) RandomUniformIntAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
+// Returns a tensor of the same shape and type as x but filled with ones.
+func OnesLike(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "OnesLike",
+		Input: []tf.Input{
+			x,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Outputs random integers from a uniform distribution.
+// The gradient of SparseFillEmptyRows.
 //
-// The generated values are uniform integers in the range `[minval, maxval)`.
-// The lower bound `minval` is included in the range, while the upper bound
-// `maxval` is excluded.
+// Takes vectors reverse_index_map, shaped `[N]`, and grad_values,
+// shaped `[N_full]`, where `N_full >= N` and copies data into either
+// `d_values` or `d_default_value`.  Here `d_values` is shaped `[N]` and
+// `d_default_value` is a scalar.
 //
-// The random integers are slightly biased unless `maxval - minval` is an exact
-// power of two.  The bias is small for values of `maxval - minval` significantly
-// smaller than the range of the output (either `2^32` or `2^64`).
+//   d_values[j] = grad_values[reverse_index_map[j]]
+//   d_default_value = sum_{k : 0 .. N_full - 1} (
+//      grad_values[k] * 1{k not in reverse_index_map})
 //
 // Arguments:
-//	shape: The shape of the output tensor.
-//	minval: 0-D.  Inclusive lower bound on the generated integers.
-//	maxval: 0-D.  Exclusive upper bound on the generated integers.
+//	reverse_index_map: 1-D.  The reverse index map from SparseFillEmptyRows.
+//	grad_values: 1-D.  The gradients from backprop.
 //
-// Returns A tensor of the specified shape filled with uniform random integers.
-func RandomUniformInt(scope *Scope, shape tf.Output, minval tf.Output, maxval tf.Output, optional ...RandomUniformIntAttr) (output tf.Output) {
+// Returns 1-D.  The backprop into values.0-D.  The backprop into default_value.
+func SparseFillEmptyRowsGrad(scope *Scope, reverse_index_map tf.Output, grad_values tf.Output) (d_values tf.Output, d_default_value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "RandomUniformInt",
+		Type: "SparseFillEmptyRowsGrad",
 		Input: []tf.Input{
-			shape, minval, maxval,
+			reverse_index_map, grad_values,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// SkipgramAttr is an optional argument to Skipgram.
-type SkipgramAttr func(optionalAttr)
-
-// SkipgramWindowSize sets the optional window_size attribute to value.
+// Computes scaled exponential linear: `scale * alpha * (exp(features) - 1)`
 //
-// value: The number of words to predict to the left and right of the target.
-// If not specified, defaults to 5
-func SkipgramWindowSize(value int64) SkipgramAttr {
-	return func(m optionalAttr) {
-		m["window_size"] = value
-	}
-}
-
-// SkipgramMinCount sets the optional min_count attribute to value.
+// if < 0, `scale * features` otherwise.
 //
-// value: The minimum number of word occurrences for it to be included in the
-// vocabulary.
-// If not specified, defaults to 5
-func SkipgramMinCount(value int64) SkipgramAttr {
-	return func(m optionalAttr) {
-		m["min_count"] = value
+// See [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
+func Selu(scope *Scope, features tf.Output) (activations tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Selu",
+		Input: []tf.Input{
+			features,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SkipgramSubsample sets the optional subsample attribute to value.
-//
-// value: Threshold for word occurrence. Words that appear with higher
-// frequency will be randomly down-sampled. Set to 0 to disable.
-// If not specified, defaults to 0.001
-func SkipgramSubsample(value float32) SkipgramAttr {
+// SetSizeAttr is an optional argument to SetSize.
+type SetSizeAttr func(optionalAttr)
+
+// SetSizeValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func SetSizeValidateIndices(value bool) SetSizeAttr {
 	return func(m optionalAttr) {
-		m["subsample"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// Parses a text file and creates a batch of examples.
+// Number of unique elements along last dimension of input `set`.
 //
-// DEPRECATED at GraphDef version 19: Moving word2vec into tensorflow_models/tutorials and deprecating its ops here as a result
+// Input `set` is a `SparseTensor` represented by `set_indices`, `set_values`,
+// and `set_shape`. The last dimension contains values in a set, duplicates are
+// allowed but ignored.
+//
+// If `validate_indices` is `True`, this op validates the order and range of `set`
+// indices.
 //
 // Arguments:
-//	filename: The corpus's text file name.
-//	batch_size: The size of produced batch.
+//	set_indices: 2D `Tensor`, indices of a `SparseTensor`.
+//	set_values: 1D `Tensor`, values of a `SparseTensor`.
+//	set_shape: 1D `Tensor`, shape of a `SparseTensor`.
 //
-// Returns A vector of words in the corpus.Frequencies of words. Sorted in the non-ascending order.Number of words per epoch in the data file.The current epoch number.The total number of words processed so far.A vector of word ids.A vector of word ids.
-func Skipgram(scope *Scope, filename string, batch_size int64, optional ...SkipgramAttr) (vocab_word tf.Output, vocab_freq tf.Output, words_per_epoch tf.Output, current_epoch tf.Output, total_words_processed tf.Output, examples tf.Output, labels tf.Output) {
+// Returns For `set` ranked `n`, this is a `Tensor` with rank `n-1`, and the same 1st
+// `n-1` dimensions as `set`. Each value is the number of unique elements in
+// the corresponding `[0...n-1]` dimension of `set`.
+func SetSize(scope *Scope, set_indices tf.Output, set_values tf.Output, set_shape tf.Output, optional ...SetSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"filename": filename, "batch_size": batch_size}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Skipgram",
-
+		Type: "SetSize",
+		Input: []tf.Input{
+			set_indices, set_values, set_shape,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4), op.Output(5), op.Output(6)
+	return op.Output(0)
 }
 
-// StringToNumberAttr is an optional argument to StringToNumber.
-type StringToNumberAttr func(optionalAttr)
-
-// StringToNumberOutType sets the optional out_type attribute to value.
+// Computes the sign and the log of the absolute value of the determinant of
 //
-// value: The numeric type to interpret each string in `string_tensor` as.
-// If not specified, defaults to DT_FLOAT
-func StringToNumberOutType(value tf.DataType) StringToNumberAttr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
-	}
-}
-
-// Converts each string in the input Tensor to the specified numeric type.
+// one or more square matrices.
 //
-// (Note that int32 overflow results in an error while float overflow
-// results in a rounded value.)
+// The input is a tensor of shape `[N, M, M]` whose inner-most 2 dimensions
+// form square matrices. The outputs are two tensors containing the signs and
+// absolute values of the log determinants for all N input submatrices
+// `[..., :, :]` such that the determinant = sign*exp(log_abs_determinant).
+// The log_abs_determinant is computed as det(P)*sum(log(diag(LU))) where LU
+// is the LU decomposition of the input and P is the corresponding
+// permutation matrix.
 //
-// Returns A Tensor of the same shape as the input `string_tensor`.
-func StringToNumber(scope *Scope, string_tensor tf.Output, optional ...StringToNumberAttr) (output tf.Output) {
+// Arguments:
+//	input: Shape is `[N, M, M]`.
+//
+// Returns The signs of the log determinants of the inputs. Shape is `[N]`.The logs of the absolute values of the determinants
+// of the N input matrices.  Shape is `[N]`.
+func LogMatrixDeterminant(scope *Scope, input tf.Output) (sign tf.Output, log_abs_determinant tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "StringToNumber",
+		Type: "LogMatrixDeterminant",
 		Input: []tf.Input{
-			string_tensor,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// ResourceApplyFtrlV2Attr is an optional argument to ResourceApplyFtrlV2.
-type ResourceApplyFtrlV2Attr func(optionalAttr)
+// SumAttr is an optional argument to Sum.
+type SumAttr func(optionalAttr)
 
-// ResourceApplyFtrlV2UseLocking sets the optional use_locking attribute to value.
+// SumKeepDims sets the optional keep_dims attribute to value.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
+// value: If true, retain reduced dimensions with length 1.
 // If not specified, defaults to false
-func ResourceApplyFtrlV2UseLocking(value bool) ResourceApplyFtrlV2Attr {
+func SumKeepDims(value bool) SumAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Update '*var' according to the Ftrl-proximal scheme.
+// Computes the sum of elements across dimensions of a tensor.
 //
-// grad_with_shrinkage = grad + 2 * l2_shrinkage * var
-// accum_new = accum + grad_with_shrinkage * grad_with_shrinkage
-// linear += grad_with_shrinkage +
-//     (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
-// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
-// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
-// accum = accum_new
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	linear: Should be from a Variable().
-//	grad: The gradient.
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regulariation. Must be a scalar.
-//	l2: L2 shrinkage regulariation. Must be a scalar.
-//
-//	lr_power: Scaling factor. Must be a scalar.
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
 //
-// Returns the created operation.
-func ResourceApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, l2_shrinkage tf.Output, lr_power tf.Output, optional ...ResourceApplyFtrlV2Attr) (o *tf.Operation) {
+// Returns The reduced tensor.
+func Sum(scope *Scope, input tf.Output, axis tf.Output, optional ...SumAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -15237,375 +15416,431 @@ func ResourceApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, linear t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyFtrlV2",
+		Type: "Sum",
 		Input: []tf.Input{
-			var_, accum, linear, grad, lr, l1, l2, l2_shrinkage, lr_power,
+			input, axis,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// TruncatedNormalAttr is an optional argument to TruncatedNormal.
-type TruncatedNormalAttr func(optionalAttr)
-
-// TruncatedNormalSeed sets the optional seed attribute to value.
+// Delete the tensor specified by its handle in the session.
 //
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func TruncatedNormalSeed(value int64) TruncatedNormalAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// TruncatedNormalSeed2 sets the optional seed2 attribute to value.
+// Arguments:
+//	handle: The handle for a tensor stored in the session state.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func TruncatedNormalSeed2(value int64) TruncatedNormalAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
+// Returns the created operation.
+func DeleteSessionTensor(scope *Scope, handle tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DeleteSessionTensor",
+		Input: []tf.Input{
+			handle,
+		},
 	}
+	return scope.AddOperation(opspec)
 }
 
-// Outputs random values from a truncated normal distribution.
+// L2 Loss.
 //
-// The generated values follow a normal distribution with mean 0 and standard
-// deviation 1, except that values whose magnitude is more than 2 standard
-// deviations from the mean are dropped and re-picked.
+// Computes half the L2 norm of a tensor without the `sqrt`:
+//
+//     output = sum(t ** 2) / 2
 //
 // Arguments:
-//	shape: The shape of the output tensor.
-//	dtype: The type of the output.
+//	t: Typically 2-D, but may have any dimensions.
 //
-// Returns A tensor of the specified shape filled with random truncated normal
-// values.
-func TruncatedNormal(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...TruncatedNormalAttr) (output tf.Output) {
+// Returns 0-D.
+func L2Loss(scope *Scope, t tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "TruncatedNormal",
+		Type: "L2Loss",
 		Input: []tf.Input{
-			shape,
+			t,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// RandomShuffleAttr is an optional argument to RandomShuffle.
-type RandomShuffleAttr func(optionalAttr)
+// DenseToSparseSetOperationAttr is an optional argument to DenseToSparseSetOperation.
+type DenseToSparseSetOperationAttr func(optionalAttr)
 
-// RandomShuffleSeed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomShuffleSeed(value int64) RandomShuffleAttr {
+// DenseToSparseSetOperationValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func DenseToSparseSetOperationValidateIndices(value bool) DenseToSparseSetOperationAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// RandomShuffleSeed2 sets the optional seed2 attribute to value.
+// Applies set operation along last dimension of `Tensor` and `SparseTensor`.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomShuffleSeed2(value int64) RandomShuffleAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Randomly shuffles a tensor along its first dimension.
+// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
 //
-//   The tensor is shuffled along dimension 0, such that each `value[j]` is mapped
-//   to one and only one `output[i]`. For example, a mapping that might occur for a
-//   3x2 tensor is:
+// Input `set2` is a `SparseTensor` represented by `set2_indices`, `set2_values`,
+// and `set2_shape`. For `set2` ranked `n`, 1st `n-1` dimensions must be the same
+// as `set1`. Dimension `n` contains values in a set, duplicates are allowed but
+// ignored.
 //
-// ```
-// [[1, 2],       [[5, 6],
-//  [3, 4],  ==>   [1, 2],
-//  [5, 6]]        [3, 4]]
-// ```
+// If `validate_indices` is `True`, this op validates the order and range of `set2`
+// indices.
+//
+// Output `result` is a `SparseTensor` represented by `result_indices`,
+// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
+// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
+// dimension contains the result of `set_operation` applied to the corresponding
+// `[0...n-1]` dimension of `set`.
 //
 // Arguments:
-//	value: The tensor to be shuffled.
+//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
+// Dimension `n` contains values in a set, duplicates are allowed but ignored.
+//	set2_indices: 2D `Tensor`, indices of a `SparseTensor`. Must be in row-major
+// order.
+//	set2_values: 1D `Tensor`, values of a `SparseTensor`. Must be in row-major
+// order.
+//	set2_shape: 1D `Tensor`, shape of a `SparseTensor`. `set2_shape[0...n-1]` must
+// be the same as the 1st `n-1` dimensions of `set1`, `result_shape[n]` is the
+// max set size across `n-1` dimensions.
 //
-// Returns A tensor of same shape and type as `value`, shuffled along its first
-// dimension.
-func RandomShuffle(scope *Scope, value tf.Output, optional ...RandomShuffleAttr) (output tf.Output) {
+//
+// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
+// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
+// is the max result set size across all `0...n-1` dimensions.
+func DenseToSparseSetOperation(scope *Scope, set1 tf.Output, set2_indices tf.Output, set2_values tf.Output, set2_shape tf.Output, set_operation string, optional ...DenseToSparseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"set_operation": set_operation}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomShuffle",
+		Type: "DenseToSparseSetOperation",
 		Input: []tf.Input{
-			value,
+			set1, set2_indices, set2_values, set2_shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// OrderedMapIncompleteSizeAttr is an optional argument to OrderedMapIncompleteSize.
-type OrderedMapIncompleteSizeAttr func(optionalAttr)
-
-// OrderedMapIncompleteSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Subtracts a value from the current value of a variable.
 //
-// REQUIRES: value >= 0
-func OrderedMapIncompleteSizeCapacity(value int64) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Any ReadVariableOp which depends directly or indirectly on this assign is
+// guaranteed to see the incremented value or a subsequent newer one.
 //
-// REQUIRES: value >= 0
-func OrderedMapIncompleteSizeMemoryLimit(value int64) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// Outputs the incremented value, which can be used to totally order the
+// increments to this variable.
+//
+// Arguments:
+//	resource: handle to the resource in which to store the variable.
+//	value: the value by which the variable will be incremented.
+//
+// Returns the created operation.
+func AssignSubVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// OrderedMapIncompleteSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapIncompleteSizeContainer(value string) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+	opspec := tf.OpSpec{
+		Type: "AssignSubVariableOp",
+		Input: []tf.Input{
+			resource, value,
+		},
 	}
+	return scope.AddOperation(opspec)
 }
 
-// OrderedMapIncompleteSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapIncompleteSizeSharedName(value string) OrderedMapIncompleteSizeAttr {
+// RestoreAttr is an optional argument to Restore.
+type RestoreAttr func(optionalAttr)
+
+// RestorePreferredShard sets the optional preferred_shard attribute to value.
+//
+// value: Index of file to open first if multiple files match
+// `file_pattern`.
+// If not specified, defaults to -1
+func RestorePreferredShard(value int64) RestoreAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["preferred_shard"] = value
 	}
 }
 
-// Op returns the number of incomplete elements in the underlying container.
-func OrderedMapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapIncompleteSizeAttr) (size tf.Output) {
+// Restores a tensor from checkpoint files.
+//
+// Reads a tensor stored in one or several files. If there are several files (for
+// instance because a tensor was saved as slices), `file_pattern` may contain
+// wildcard symbols (`*` and `?`) in the filename portion only, not in the
+// directory portion.
+//
+// If a `file_pattern` matches several files, `preferred_shard` can be used to hint
+// in which file the requested tensor is likely to be found. This op will first
+// open the file at index `preferred_shard` in the list of matching files and try
+// to restore tensors from that file.  Only if some tensors or tensor slices are
+// not found in that first file, then the Op opens all the files. Setting
+// `preferred_shard` to match the value passed as the `shard` input
+// of a matching `Save` Op may speed up Restore.  This attribute only affects
+// performance, not correctness.  The default value -1 means files are processed in
+// order.
+//
+// See also `RestoreSlice`.
+//
+// Arguments:
+//	file_pattern: Must have a single element. The pattern of the files from
+// which we read the tensor.
+//	tensor_name: Must have a single element. The name of the tensor to be
+// restored.
+//	dt: The type of the tensor to be restored.
+//
+// Returns The restored tensor.
+func Restore(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, dt tf.DataType, optional ...RestoreAttr) (tensor tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"dt": dt}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapIncompleteSize",
-
+		Type: "Restore",
+		Input: []tf.Input{
+			file_pattern, tensor_name,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DecodeRawAttr is an optional argument to DecodeRaw.
-type DecodeRawAttr func(optionalAttr)
+// QuantizedResizeBilinearAttr is an optional argument to QuantizedResizeBilinear.
+type QuantizedResizeBilinearAttr func(optionalAttr)
 
-// DecodeRawLittleEndian sets the optional little_endian attribute to value.
+// QuantizedResizeBilinearAlignCorners sets the optional align_corners attribute to value.
 //
-// value: Whether the input `bytes` are in little-endian order.
-// Ignored for `out_type` values that are stored in a single byte like
-// `uint8`.
-// If not specified, defaults to true
-func DecodeRawLittleEndian(value bool) DecodeRawAttr {
+// value: If true, rescale input by (new_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func QuantizedResizeBilinearAlignCorners(value bool) QuantizedResizeBilinearAttr {
 	return func(m optionalAttr) {
-		m["little_endian"] = value
+		m["align_corners"] = value
 	}
 }
 
-// Reinterpret the bytes of a string as a vector of numbers.
+// Resize quantized `images` to `size` using quantized bilinear interpolation.
+//
+// Input images and output images must be quantized types.
 //
 // Arguments:
-//	bytes: All the elements must have the same length.
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
 //
 //
-// Returns A Tensor with one more dimension than the input `bytes`.  The
-// added dimension will have size equal to the length of the elements
-// of `bytes` divided by the number of bytes to represent `out_type`.
-func DecodeRaw(scope *Scope, bytes tf.Output, out_type tf.DataType, optional ...DecodeRawAttr) (output tf.Output) {
+//
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func QuantizedResizeBilinear(scope *Scope, images tf.Output, size tf.Output, min tf.Output, max tf.Output, optional ...QuantizedResizeBilinearAttr) (resized_images tf.Output, out_min tf.Output, out_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"out_type": out_type}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeRaw",
+		Type: "QuantizedResizeBilinear",
 		Input: []tf.Input{
-			bytes,
+			images, size, min, max,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Copy a tensor setting everything outside a central band in each innermost matrix
-//
-// to zero.
-//
-// The `band` part is computed as follows:
-// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
-// tensor with the same shape where
-//
-// `band[i, j, k, ..., m, n] = in_band(m, n) * input[i, j, k, ..., m, n]`.
-//
-// The indicator function
-//
-// `in_band(m, n) = (num_lower < 0 || (m-n) <= num_lower)) &&
-//                  (num_upper < 0 || (n-m) <= num_upper)`.
-//
-// For example:
-//
-// ```
-// # if 'input' is [[ 0,  1,  2, 3]
-//                  [-1,  0,  1, 2]
-//                  [-2, -1,  0, 1]
-//                  [-3, -2, -1, 0]],
+// Computes the minimum along segments of a tensor.
 //
-// tf.matrix_band_part(input, 1, -1) ==> [[ 0,  1,  2, 3]
-//                                        [-1,  0,  1, 2]
-//                                        [ 0, -1,  0, 1]
-//                                        [ 0,  0, -1, 0]],
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// tf.matrix_band_part(input, 2, 1) ==> [[ 0,  1,  0, 0]
-//                                       [-1,  0,  1, 0]
-//                                       [-2, -1,  0, 1]
-//                                       [ 0, -2, -1, 0]]
-// ```
+// Computes a tensor such that
+// \\(output_i = \min_j(data_j)\\) where `min` is over `j` such
+// that `segment_ids[j] == i`.
 //
-// Useful special cases:
+// If the min is empty for a given segment ID `i`, `output[i] = 0`.
 //
-// ```
-//  tf.matrix_band_part(input, 0, -1) ==> Upper triangular part.
-//  tf.matrix_band_part(input, -1, 0) ==> Lower triangular part.
-//  tf.matrix_band_part(input, 0, 0) ==> Diagonal.
-// ```
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMin.png" alt>
+// </div>
 //
 // Arguments:
-//	input: Rank `k` tensor.
-//	num_lower: 0-D tensor. Number of subdiagonals to keep. If negative, keep entire
-// lower triangle.
-//	num_upper: 0-D tensor. Number of superdiagonals to keep. If negative, keep
-// entire upper triangle.
 //
-// Returns Rank `k` tensor of the same shape as input. The extracted banded tensor.
-func MatrixBandPart(scope *Scope, input tf.Output, num_lower tf.Output, num_upper tf.Output) (band tf.Output) {
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentMin(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixBandPart",
+		Type: "SegmentMin",
 		Input: []tf.Input{
-			input, num_lower, num_upper,
+			data, segment_ids,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DecodeCompressedAttr is an optional argument to DecodeCompressed.
-type DecodeCompressedAttr func(optionalAttr)
+// SdcaOptimizerAttr is an optional argument to SdcaOptimizer.
+type SdcaOptimizerAttr func(optionalAttr)
 
-// DecodeCompressedCompressionType sets the optional compression_type attribute to value.
+// SdcaOptimizerAdaptative sets the optional adaptative attribute to value.
 //
-// value: A scalar containing either (i) the empty string (no
-// compression), (ii) "ZLIB", or (iii) "GZIP".
-// If not specified, defaults to ""
-func DecodeCompressedCompressionType(value string) DecodeCompressedAttr {
+// value: Whether to use Adapative SDCA for the inner loop.
+// If not specified, defaults to false
+func SdcaOptimizerAdaptative(value bool) SdcaOptimizerAttr {
 	return func(m optionalAttr) {
-		m["compression_type"] = value
+		m["adaptative"] = value
 	}
 }
 
-// Decompress strings.
+// Distributed version of Stochastic Dual Coordinate Ascent (SDCA) optimizer for
 //
-// This op decompresses each element of the `bytes` input `Tensor`, which
-// is assumed to be compressed using the given `compression_type`.
+// linear models with L1 + L2 regularization. As global optimization objective is
+// strongly-convex, the optimizer optimizes the dual objective at each step. The
+// optimizer applies each update one example at a time. Examples are sampled
+// uniformly, and the optimizer is learning rate free and enjoys linear convergence
+// rate.
 //
-// The `output` is a string `Tensor` of the same shape as `bytes`,
-// each element containing the decompressed data from the corresponding
-// element in `bytes`.
+// [Proximal Stochastic Dual Coordinate Ascent](http://arxiv.org/pdf/1211.2717v1.pdf).<br>
+// Shai Shalev-Shwartz, Tong Zhang. 2012
+//
+// $$Loss Objective = \sum f_{i} (wx_{i}) + (l2 / 2) * |w|^2 + l1 * |w|$$
+//
+// [Adding vs. Averaging in Distributed Primal-Dual Optimization](http://arxiv.org/abs/1502.03508).<br>
+// Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan,
+// Peter Richtarik, Martin Takac. 2015
+//
+// [Stochastic Dual Coordinate Ascent with Adaptive Probabilities](https://arxiv.org/abs/1502.08053).<br>
+// Dominik Csiba, Zheng Qu, Peter Richtarik. 2015
 //
 // Arguments:
-//	bytes: A Tensor of string which is compressed.
+//	sparse_example_indices: a list of vectors which contain example indices.
+//	sparse_feature_indices: a list of vectors which contain feature indices.
+//	sparse_feature_values: a list of vectors which contains feature value
+// associated with each feature group.
+//	dense_features: a list of matrices which contains the dense feature values.
+//	example_weights: a vector which contains the weight associated with each
+// example.
+//	example_labels: a vector which contains the label/target associated with each
+// example.
+//	sparse_indices: a list of vectors where each value is the indices which has
+// corresponding weights in sparse_weights. This field maybe omitted for the
+// dense approach.
+//	sparse_weights: a list of vectors where each value is the weight associated with
+// a sparse feature group.
+//	dense_weights: a list of vectors where the values are the weights associated
+// with a dense feature group.
+//	example_state_data: a list of vectors containing the example state data.
+//	loss_type: Type of the primal loss. Currently SdcaSolver supports logistic,
+// squared and hinge losses.
+//	l1: Symmetric l1 regularization strength.
+//	l2: Symmetric l2 regularization strength.
+//	num_loss_partitions: Number of partitions of the global loss function.
+//	num_inner_iterations: Number of iterations per mini-batch.
 //
-// Returns A Tensor with the same shape as input `bytes`, uncompressed
-// from bytes.
-func DecodeCompressed(scope *Scope, bytes tf.Output, optional ...DecodeCompressedAttr) (output tf.Output) {
+// Returns a list of vectors containing the updated example state
+// data.a list of vectors where each value is the delta
+// weights associated with a sparse feature group.a list of vectors where the values are the delta
+// weights associated with a dense feature group.
+func SdcaOptimizer(scope *Scope, sparse_example_indices []tf.Output, sparse_feature_indices []tf.Output, sparse_feature_values []tf.Output, dense_features []tf.Output, example_weights tf.Output, example_labels tf.Output, sparse_indices []tf.Output, sparse_weights []tf.Output, dense_weights []tf.Output, example_state_data tf.Output, loss_type string, l1 float32, l2 float32, num_loss_partitions int64, num_inner_iterations int64, optional ...SdcaOptimizerAttr) (out_example_state_data tf.Output, out_delta_sparse_weights []tf.Output, out_delta_dense_weights []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"loss_type": loss_type, "l1": l1, "l2": l2, "num_loss_partitions": num_loss_partitions, "num_inner_iterations": num_inner_iterations}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeCompressed",
+		Type: "SdcaOptimizer",
 		Input: []tf.Input{
-			bytes,
+			tf.OutputList(sparse_example_indices), tf.OutputList(sparse_feature_indices), tf.OutputList(sparse_feature_values), tf.OutputList(dense_features), example_weights, example_labels, tf.OutputList(sparse_indices), tf.OutputList(sparse_weights), tf.OutputList(dense_weights), example_state_data,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	out_example_state_data = op.Output(idx)
+	if out_delta_sparse_weights, idx, err = makeOutputList(op, idx, "out_delta_sparse_weights"); err != nil {
+		scope.UpdateErr("SdcaOptimizer", err)
+		return
+	}
+	if out_delta_dense_weights, idx, err = makeOutputList(op, idx, "out_delta_dense_weights"); err != nil {
+		scope.UpdateErr("SdcaOptimizer", err)
+		return
+	}
+	return out_example_state_data, out_delta_sparse_weights, out_delta_dense_weights
 }
 
-// WholeFileReaderV2Attr is an optional argument to WholeFileReaderV2.
-type WholeFileReaderV2Attr func(optionalAttr)
+// SparseMatMulAttr is an optional argument to SparseMatMul.
+type SparseMatMulAttr func(optionalAttr)
 
-// WholeFileReaderV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func WholeFileReaderV2Container(value string) WholeFileReaderV2Attr {
+// SparseMatMulTransposeA sets the optional transpose_a attribute to value.
+// If not specified, defaults to false
+func SparseMatMulTransposeA(value bool) SparseMatMulAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["transpose_a"] = value
 	}
 }
 
-// WholeFileReaderV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func WholeFileReaderV2SharedName(value string) WholeFileReaderV2Attr {
+// SparseMatMulTransposeB sets the optional transpose_b attribute to value.
+// If not specified, defaults to false
+func SparseMatMulTransposeB(value bool) SparseMatMulAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["transpose_b"] = value
 	}
 }
 
-// A Reader that outputs the entire contents of a file as a value.
+// SparseMatMulAIsSparse sets the optional a_is_sparse attribute to value.
+// If not specified, defaults to false
+func SparseMatMulAIsSparse(value bool) SparseMatMulAttr {
+	return func(m optionalAttr) {
+		m["a_is_sparse"] = value
+	}
+}
+
+// SparseMatMulBIsSparse sets the optional b_is_sparse attribute to value.
+// If not specified, defaults to false
+func SparseMatMulBIsSparse(value bool) SparseMatMulAttr {
+	return func(m optionalAttr) {
+		m["b_is_sparse"] = value
+	}
+}
+
+// Multiply matrix "a" by matrix "b".
 //
-// To use, enqueue filenames in a Queue.  The output of ReaderRead will
-// be a filename (key) and the contents of that file (value).
+// The inputs must be two-dimensional matrices and the inner dimension of "a" must
+// match the outer dimension of "b". This op is optimized for the case where at
+// least one of "a" or "b" is sparse. The breakeven for using this versus a dense
+// matrix multiply on one platform was 30% zero values in the sparse matrix.
 //
-// Returns The handle to reference the Reader.
-func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_handle tf.Output) {
+// The gradient computation of this operation will only take advantage of sparsity
+// in the input gradient when that gradient comes from a Relu.
+func SparseMatMul(scope *Scope, a tf.Output, b tf.Output, optional ...SparseMatMulAttr) (product tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -15614,279 +15849,227 @@ func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "WholeFileReaderV2",
-
+		Type: "SparseMatMul",
+		Input: []tf.Input{
+			a, b,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Transforms a tf.Example proto (as a string) into typed tensors.
+// Computes the power of one value to another.
 //
-// Arguments:
-//	serialized: A vector containing a batch of binary serialized Example protos.
-//	dense_defaults: A list of Tensors (some may be empty), whose length matches
-// the length of `dense_keys`. dense_defaults[j] provides default values
-// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
-// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
-// The input type is inferred from dense_defaults[j], even when it's empty.
-// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
-// then the shape of dense_defaults[j] must match that of dense_shapes[j].
-// If dense_shapes[j] has an undefined major dimension (variable strides dense
-// feature), dense_defaults[j] must contain a single element:
-// the padding element.
-//	num_sparse: The number of sparse features to be parsed from the example. This
-// must match the lengths of `sparse_keys` and `sparse_types`.
-//	sparse_keys: A list of `num_sparse` strings.
-// The keys expected in the Examples' features associated with sparse values.
-//	dense_keys: The keys expected in the Examples' features associated with dense
-// values.
-//	sparse_types: A list of `num_sparse` types; the data types of data in each
-// Feature given in sparse_keys.
-// Currently the ParseSingleExample op supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-//	dense_shapes: The shapes of data in each Feature given in dense_keys.
-// The length of this list must match the length of `dense_keys`.  The
-// number of elements in the Feature corresponding to dense_key[j] must
-// always equal dense_shapes[j].NumEntries().  If dense_shapes[j] ==
-// (D0, D1, ..., DN) then the shape of output Tensor dense_values[j]
-// will be (D0, D1, ..., DN): In the case dense_shapes[j] = (-1, D1,
-// ..., DN), the shape of the output Tensor dense_values[j] will be (M,
-// D1, .., DN), where M is the number of blocks of elements of length
-// D1 * .... * DN, in the input.
-func ParseSingleExample(scope *Scope, serialized tf.Output, dense_defaults []tf.Output, num_sparse int64, sparse_keys []string, dense_keys []string, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"num_sparse": num_sparse, "sparse_keys": sparse_keys, "dense_keys": dense_keys, "sparse_types": sparse_types, "dense_shapes": dense_shapes}
-	opspec := tf.OpSpec{
-		Type: "ParseSingleExample",
-		Input: []tf.Input{
-			serialized, tf.OutputList(dense_defaults),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	return sparse_indices, sparse_values, sparse_shapes, dense_values
-}
-
-// Computes acos of x element-wise.
-func Acos(scope *Scope, x tf.Output) (y tf.Output) {
+// Given a tensor `x` and a tensor `y`, this operation computes \\(x^y\\) for
+// corresponding elements in `x` and `y`. For example:
+//
+// ```
+// # tensor 'x' is [[2, 2]], [3, 3]]
+// # tensor 'y' is [[8, 16], [2, 3]]
+// tf.pow(x, y) ==> [[256, 65536], [9, 27]]
+// ```
+func Pow(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Acos",
+		Type: "Pow",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MaxPoolWithArgmaxAttr is an optional argument to MaxPoolWithArgmax.
-type MaxPoolWithArgmaxAttr func(optionalAttr)
+// ShapeAttr is an optional argument to Shape.
+type ShapeAttr func(optionalAttr)
 
-// MaxPoolWithArgmaxTargmax sets the optional Targmax attribute to value.
-// If not specified, defaults to DT_INT64
-func MaxPoolWithArgmaxTargmax(value tf.DataType) MaxPoolWithArgmaxAttr {
+// ShapeOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_INT32
+func ShapeOutType(value tf.DataType) ShapeAttr {
 	return func(m optionalAttr) {
-		m["Targmax"] = value
+		m["out_type"] = value
 	}
 }
 
-// Performs max pooling on the input and outputs both max values and indices.
-//
-// The indices in `argmax` are flattened, so that a maximum value at position
-// `[b, y, x, c]` becomes flattened index
-// `((b * height + y) * width + x) * channels + c`.
+// Returns the shape of a tensor.
 //
-// The indices returned are always in `[0, height) x [0, width)` before flattening,
-// even if padding is involved and the mathematically correct answer is outside
-// (either negative or too large).  This is a bug, but fixing it is difficult to do
-// in a safe backwards compatible way, especially due to flattening.
+// This operation returns a 1-D integer tensor representing the shape of `input`.
 //
-// Arguments:
-//	input: 4-D with shape `[batch, height, width, channels]`.  Input to pool over.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
+// For example:
 //
-// Returns The max pooled output tensor.4-D.  The flattened indices of the max values chosen for each output.
-func MaxPoolWithArgmax(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolWithArgmaxAttr) (output tf.Output, argmax tf.Output) {
+// ```
+// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
+// shape(t) ==> [2, 2, 3]
+// ```
+func Shape(scope *Scope, input tf.Output, optional ...ShapeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPoolWithArgmax",
+		Type: "Shape",
 		Input: []tf.Input{
 			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Transforms a serialized tensorflow.TensorProto proto into a Tensor.
+// Computes fingerprints of the input strings.
 //
 // Arguments:
-//	serialized: A scalar string containing a serialized TensorProto proto.
-//	out_type: The type of the serialized tensor.  The provided type must match the
-// type of the serialized tensor and no implicit conversion will take place.
+//	input: vector of strings to compute fingerprints on.
 //
-// Returns A Tensor of type `out_type`.
-func ParseTensor(scope *Scope, serialized tf.Output, out_type tf.DataType) (output tf.Output) {
+// Returns a (N,2) shaped matrix where N is the number of elements in the input
+// vector. Each row contains the low and high parts of the fingerprint.
+func SdcaFprint(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "ParseTensor",
+		Type: "SdcaFprint",
 		Input: []tf.Input{
-			serialized,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MapClearAttr is an optional argument to MapClear.
-type MapClearAttr func(optionalAttr)
+// RandomPoissonV2Attr is an optional argument to RandomPoissonV2.
+type RandomPoissonV2Attr func(optionalAttr)
 
-// MapClearCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// RandomPoissonV2Seed sets the optional seed attribute to value.
 //
-// REQUIRES: value >= 0
-func MapClearCapacity(value int64) MapClearAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapClearMemoryLimit sets the optional memory_limit attribute to value.
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
 // If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapClearMemoryLimit(value int64) MapClearAttr {
+func RandomPoissonV2Seed(value int64) RandomPoissonV2Attr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["seed"] = value
 	}
 }
 
-// MapClearContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapClearContainer(value string) MapClearAttr {
+// RandomPoissonV2Seed2 sets the optional seed2 attribute to value.
+//
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomPoissonV2Seed2(value int64) RandomPoissonV2Attr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["seed2"] = value
 	}
 }
 
-// MapClearSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapClearSharedName(value string) MapClearAttr {
+// RandomPoissonV2Dtype sets the optional dtype attribute to value.
+// If not specified, defaults to DT_INT64
+func RandomPoissonV2Dtype(value tf.DataType) RandomPoissonV2Attr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["dtype"] = value
 	}
 }
 
-// Op removes all elements in the underlying container.
+// Outputs random values from the Poisson distribution(s) described by rate.
 //
-// Returns the created operation.
-func MapClear(scope *Scope, dtypes []tf.DataType, optional ...MapClearAttr) (o *tf.Operation) {
+// This op uses two algorithms, depending on rate. If rate >= 10, then
+// the algorithm by Hormann is used to acquire samples via
+// transformation-rejection.
+// See http://www.sciencedirect.com/science/article/pii/0167668793909974.
+//
+// Otherwise, Knuth's algorithm is used to acquire samples via multiplying uniform
+// random variables.
+// See Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer
+// Programming, Volume 2. Addison Wesley
+//
+// Arguments:
+//	shape: 1-D integer tensor. Shape of independent samples to draw from each
+// distribution described by the shape parameters given in rate.
+//	rate: A tensor in which each scalar is a "rate" parameter describing the
+// associated poisson distribution.
+//
+// Returns A tensor with shape `shape + shape(rate)`. Each slice
+// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
+// `rate[i0, i1, ...iN]`.
+func RandomPoissonV2(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapClear",
-
+		Type: "RandomPoissonV2",
+		Input: []tf.Input{
+			shape, rate,
+		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeCSVAttr is an optional argument to DecodeCSV.
-type DecodeCSVAttr func(optionalAttr)
-
-// DecodeCSVFieldDelim sets the optional field_delim attribute to value.
-//
-// value: char delimiter to separate fields in a record.
-// If not specified, defaults to ","
-func DecodeCSVFieldDelim(value string) DecodeCSVAttr {
-	return func(m optionalAttr) {
-		m["field_delim"] = value
-	}
-}
+// MatrixTriangularSolveAttr is an optional argument to MatrixTriangularSolve.
+type MatrixTriangularSolveAttr func(optionalAttr)
 
-// DecodeCSVUseQuoteDelim sets the optional use_quote_delim attribute to value.
+// MatrixTriangularSolveLower sets the optional lower attribute to value.
 //
-// value: If false, treats double quotation marks as regular
-// characters inside of the string fields (ignoring RFC 4180, Section 2,
-// Bullet 5).
+// value: Boolean indicating whether the innermost matrices in `matrix` are
+// lower or upper triangular.
 // If not specified, defaults to true
-func DecodeCSVUseQuoteDelim(value bool) DecodeCSVAttr {
+func MatrixTriangularSolveLower(value bool) MatrixTriangularSolveAttr {
 	return func(m optionalAttr) {
-		m["use_quote_delim"] = value
+		m["lower"] = value
 	}
 }
 
-// DecodeCSVNaValue sets the optional na_value attribute to value.
+// MatrixTriangularSolveAdjoint sets the optional adjoint attribute to value.
 //
-// value: Additional string to recognize as NA/NaN.
-// If not specified, defaults to ""
-func DecodeCSVNaValue(value string) DecodeCSVAttr {
+// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
+//          adjoint.
+//
+// @compatibility(numpy)
+// Equivalent to np.linalg.triangular_solve
+// @end_compatibility
+// If not specified, defaults to false
+func MatrixTriangularSolveAdjoint(value bool) MatrixTriangularSolveAttr {
 	return func(m optionalAttr) {
-		m["na_value"] = value
+		m["adjoint"] = value
 	}
 }
 
-// Convert CSV records to tensors. Each column maps to one tensor.
+// Solves systems of linear equations with upper or lower triangular matrices by
 //
-// RFC 4180 format is expected for the CSV records.
-// (https://tools.ietf.org/html/rfc4180)
-// Note that we allow leading and trailing spaces with int or float field.
+// backsubstitution.
+//
+// `matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions form
+// square matrices. If `lower` is `True` then the strictly upper triangular part
+// of each inner-most matrix is assumed to be zero and not accessed.
+// If `lower` is False then the strictly lower triangular part of each inner-most
+// matrix is assumed to be zero and not accessed.
+// `rhs` is a tensor of shape `[..., M, K]`.
+//
+// The output is a tensor of shape `[..., M, K]`. If `adjoint` is
+// `True` then the innermost matrices in `output` satisfy matrix equations
+// `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
+// If `adjoint` is `False` then the strictly then the  innermost matrices in
+// `output` satisfy matrix equations
+// `adjoint(matrix[..., i, k]) * output[..., k, j] = rhs[..., i, j]`.
 //
 // Arguments:
-//	records: Each string is a record/row in the csv and all records should have
-// the same format.
-//	record_defaults: One tensor per column of the input record, with either a
-// scalar default value for that column or empty if the column is required.
+//	matrix: Shape is `[..., M, M]`.
+//	rhs: Shape is `[..., M, K]`.
 //
-// Returns Each tensor will have the same shape as records.
-func DecodeCSV(scope *Scope, records tf.Output, record_defaults []tf.Output, optional ...DecodeCSVAttr) (output []tf.Output) {
+// Returns Shape is `[..., M, K]`.
+func MatrixTriangularSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixTriangularSolveAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -15895,103 +16078,198 @@ func DecodeCSV(scope *Scope, records tf.Output, record_defaults []tf.Output, opt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeCSV",
+		Type: "MatrixTriangularSolve",
 		Input: []tf.Input{
-			records, tf.OutputList(record_defaults),
+			matrix, rhs,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes inverse hyperbolic sine of x element-wise.
+func Asinh(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("DecodeCSV", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "Asinh",
+		Input: []tf.Input{
+			x,
+		},
 	}
-	return output
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Returns the rank of a tensor.
-//
-// This operation returns an integer representing the rank of `input`.
+// Creates a dataset with a range of values. Corresponds to python's xrange.
 //
-// For example:
+// Arguments:
+//	start: corresponds to start in python's xrange().
+//	stop: corresponds to stop in python's xrange().
+//	step: corresponds to step in python's xrange().
 //
-// ```
-// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
-// # shape of tensor 't' is [2, 2, 3]
-// rank(t) ==> 3
-// ```
 //
-// **Note**: The rank of a tensor is not the same as the rank of a matrix. The rank
-// of a tensor is the number of indices required to uniquely select each element
-// of the tensor. Rank is also known as "order", "degree", or "ndims."
-func Rank(scope *Scope, input tf.Output) (output tf.Output) {
+func RangeDataset(scope *Scope, start tf.Output, stop tf.Output, step tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Rank",
+		Type: "RangeDataset",
 		Input: []tf.Input{
-			input,
+			start, stop, step,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Output a fact about factorials.
-func Fact(scope *Scope) (fact tf.Output) {
+// DepthwiseConv2dNativeBackpropInputAttr is an optional argument to DepthwiseConv2dNativeBackpropInput.
+type DepthwiseConv2dNativeBackpropInputAttr func(optionalAttr)
+
+// DepthwiseConv2dNativeBackpropInputDataFormat sets the optional data_format attribute to value.
+//
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, height, width, channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, channels, height, width].
+// If not specified, defaults to "NHWC"
+func DepthwiseConv2dNativeBackpropInputDataFormat(value string) DepthwiseConv2dNativeBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// DepthwiseConv2dNativeBackpropInputDilations sets the optional dilations attribute to value.
+//
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
+// element on that dimension. The dimension order is determined by the value of
+// `data_format`, see above for details. Dilations in the batch and depth
+// dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func DepthwiseConv2dNativeBackpropInputDilations(value []int64) DepthwiseConv2dNativeBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of depthwise convolution with respect to the input.
+//
+// Arguments:
+//	input_sizes: An integer vector representing the shape of `input`, based
+// on `data_format`.  For example, if `data_format` is 'NHWC' then
+//  `input` is a 4-D `[batch, height, width, channels]` tensor.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, depthwise_multiplier]`.
+//	out_backprop: 4-D with shape  based on `data_format`.
+// For example, if `data_format` is 'NHWC' then
+// out_backprop shape is `[batch, out_height, out_width, out_channels]`.
+// Gradients w.r.t. the output of the convolution.
+//	strides: The stride of the sliding window for each dimension of the input
+// of the convolution.
+//	padding: The type of padding algorithm to use.
+//
+// Returns 4-D with shape according to `data_format`.  For example, if
+// `data_format` is 'NHWC', output shape is `[batch, in_height,
+// in_width, in_channels]`.  Gradient w.r.t. the input of the
+// convolution.
+func DepthwiseConv2dNativeBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...DepthwiseConv2dNativeBackpropInputAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Fact",
+		Type: "DepthwiseConv2dNativeBackpropInput",
+		Input: []tf.Input{
+			input_sizes, filter, out_backprop,
+		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Makes its input available to the next iteration.
+// Adds sparse updates to the variable referenced by `resource`.
+//
+// This operation computes
+//
+//     # Scalar indices
+//     ref[indices, ...] += updates[...]
+//
+//     # Vector indices (for each i)
+//     ref[indices[i], ...] += updates[i, ...]
+//
+//     # High rank indices (for each i, ..., j)
+//     ref[indices[i, ..., j], ...] += updates[i, ..., j, ...]
+//
+// Duplicate entries are handled correctly: if multiple `indices` reference
+// the same location, their contributions add.
+//
+// Requires `updates.shape = indices.shape + ref.shape[1:]`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src='https://www.tensorflow.org/images/ScatterAdd.png' alt>
+// </div>
 //
 // Arguments:
-//	data: The tensor to be made available to the next iteration.
+//	resource: Should be from a `Variable` node.
+//	indices: A tensor of indices into the first dimension of `ref`.
+//	updates: A tensor of updated values to add to `ref`.
 //
-// Returns The same tensor as `data`.
-func NextIteration(scope *Scope, data tf.Output) (output tf.Output) {
+// Returns the created operation.
+func ResourceScatterAdd(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "NextIteration",
+		Type: "ResourceScatterAdd",
 		Input: []tf.Input{
-			data,
+			resource, indices, updates,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Creates a dataset that skips `count` elements from the `input_dataset`.
+// Says whether the targets are in the top `K` predictions.
 //
-// Arguments:
+// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
+// prediction for the target class is among the top `k` predictions among
+// all predictions for example `i`. Note that the behavior of `InTopK` differs
+// from the `TopK` op in its handling of ties; if multiple classes have the
+// same prediction value and straddle the top-`k` boundary, all of those
+// classes are considered to be in the top `k`.
 //
-//	count: A scalar representing the number of elements from the `input_dataset`
-// that should be skipped.  If count is -1, skips everything.
+// More formally, let
 //
+//   \\(predictions_i\\) be the predictions for all classes for example `i`,
+//   \\(targets_i\\) be the target class for example `i`,
+//   \\(out_i\\) be the output for example `i`,
 //
-func SkipDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
+//
+// Arguments:
+//	predictions: A `batch_size` x `classes` tensor.
+//	targets: A `batch_size` vector of class ids.
+//	k: Number of top elements to look at for computing precision.
+//
+// Returns Computed Precision at `k` as a `bool Tensor`.
+func InTopK(scope *Scope, predictions tf.Output, targets tf.Output, k int64) (precision tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"k": k}
 	opspec := tf.OpSpec{
-		Type: "SkipDataset",
+		Type: "InTopK",
 		Input: []tf.Input{
-			input_dataset, count,
+			predictions, targets,
 		},
 		Attrs: attrs,
 	}
@@ -15999,284 +16277,309 @@ func SkipDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_
 	return op.Output(0)
 }
 
-// Computes hyperbolic tangent of `x` element-wise.
-func Tanh(scope *Scope, x tf.Output) (y tf.Output) {
+// Computes the gradient for the inverse of `x` wrt its input.
+//
+// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
+// is the corresponding input gradient.
+func ReciprocalGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Tanh",
+		Type: "ReciprocalGrad",
 		Input: []tf.Input{
-			x,
+			y, dy,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the maximum along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \max_j(data_j)\\) where `max` is over `j` such
-// that `segment_ids[j] == i`.
-//
-// If the max is empty for a given segment ID `i`, `output[i] = 0`.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMax.png" alt>
-// </div>
-//
-// Arguments:
-//
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
+// Returns the min of x and y (i.e. x < y ? x : y) element-wise.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+// *NOTE*: `Minimum` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Minimum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SegmentMax",
+		Type: "Minimum",
 		Input: []tf.Input{
-			data, segment_ids,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AvgPoolGradAttr is an optional argument to AvgPoolGrad.
-type AvgPoolGradAttr func(optionalAttr)
+// MfccAttr is an optional argument to Mfcc.
+type MfccAttr func(optionalAttr)
 
-// AvgPoolGradDataFormat sets the optional data_format attribute to value.
+// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func AvgPoolGradDataFormat(value string) AvgPoolGradAttr {
+// value: The highest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 4000
+func MfccUpperFrequencyLimit(value float32) MfccAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["upper_frequency_limit"] = value
 	}
 }
 
-// Computes gradients of the average pooling function.
-//
-// Arguments:
-//	orig_input_shape: 1-D.  Shape of the original input to `avg_pool`.
-//	grad: 4-D with shape `[batch, height, width, channels]`.  Gradients w.r.t.
-// the output of `avg_pool`.
-//	ksize: The size of the sliding window for each dimension of the input.
-//	strides: The stride of the sliding window for each dimension of the input.
-//	padding: The type of padding algorithm to use.
+// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
 //
-// Returns 4-D.  Gradients w.r.t. the input of `avg_pool`.
-func AvgPoolGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolGradAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "AvgPoolGrad",
-		Input: []tf.Input{
-			orig_input_shape, grad,
-		},
-		Attrs: attrs,
+// value: The lowest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 20
+func MfccLowerFrequencyLimit(value float32) MfccAttr {
+	return func(m optionalAttr) {
+		m["lower_frequency_limit"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// StageClearAttr is an optional argument to StageClear.
-type StageClearAttr func(optionalAttr)
-
-// StageClearCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
 //
-// REQUIRES: value >= 0
-func StageClearCapacity(value int64) StageClearAttr {
+// value: Resolution of the Mel bank used internally.
+// If not specified, defaults to 40
+func MfccFilterbankChannelCount(value int64) MfccAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["filterbank_channel_count"] = value
 	}
 }
 
-// StageClearMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
 //
-// REQUIRES: value >= 0
-func StageClearMemoryLimit(value int64) StageClearAttr {
+// value: How many output channels to produce per time slice.
+// If not specified, defaults to 13
+func MfccDctCoefficientCount(value int64) MfccAttr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["dct_coefficient_count"] = value
 	}
 }
 
-// StageClearContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StageClearContainer(value string) StageClearAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Transforms a spectrogram into a form that's useful for speech recognition.
+//
+// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
+// been effective as an input feature for machine learning. They are created by
+// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
+// higher frequencies that are less significant to the human ear. They have a long
+// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
+// is a good resource to learn more.
+//
+// Arguments:
+//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
+// set to true.
+//	sample_rate: How many samples per second the source audio used.
+func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
 	}
+	opspec := tf.OpSpec{
+		Type: "Mfcc",
+		Input: []tf.Input{
+			spectrogram, sample_rate,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StageClearSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StageClearSharedName(value string) StageClearAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+// Returns the element-wise sum of a list of tensors.
+//
+// `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
+// wait for all of its inputs to be ready before beginning to sum. This can
+// save memory if inputs are ready at different times, since minimum temporary
+// storage is proportional to the output size rather than the inputs size.
+//
+// Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
+//
+// Returns a `Tensor` of same shape and type as the elements of `inputs`.
+//
+// Arguments:
+//	inputs: A list of `Tensor` objects, each with same shape and type.
+//	shape: Shape of elements of `inputs`.
+func AccumulateNV2(scope *Scope, inputs []tf.Output, shape tf.Shape) (sum tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"shape": shape}
+	opspec := tf.OpSpec{
+		Type: "AccumulateNV2",
+		Input: []tf.Input{
+			tf.OutputList(inputs),
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Op removes all elements in the underlying container.
+// Convert the quantized 'input' tensor into a lower-precision 'output', using the
 //
-// Returns the created operation.
-func StageClear(scope *Scope, dtypes []tf.DataType, optional ...StageClearAttr) (o *tf.Operation) {
+// actual distribution of the values to maximize the usage of the lower bit depth
+// and adjusting the output min and max ranges accordingly.
+//
+// [input_min, input_max] are scalar floats that specify the range for the float
+// interpretation of the 'input' data. For example, if input_min is -1.0f and
+// input_max is 1.0f, and we are dealing with quint16 quantized data, then a 0
+// value in the 16-bit data should be interpreted as -1.0f, and a 65535 means 1.0f.
+//
+// This operator tries to squeeze as much precision as possible into an output with
+// a lower bit depth by calculating the actual min and max values found in the
+// data. For example, maybe that quint16 input has no values lower than 16,384 and
+// none higher than 49,152. That means only half the range is actually needed, all
+// the float interpretations are between -0.5f and 0.5f, so if we want to compress
+// the data into a quint8 output, we can use that range rather than the theoretical
+// -1.0f to 1.0f that is suggested by the input min and max.
+//
+// In practice, this is most useful for taking output from operations like
+// QuantizedMatMul that can produce higher bit-depth outputs than their inputs and
+// may have large potential output ranges, but in practice have a distribution of
+// input values that only uses a small fraction of the possible range. By feeding
+// that output into this operator, we can reduce it from 32 bits down to 8 with
+// minimal loss of accuracy.
+//
+// Arguments:
+//
+//	input_min: The float value that the minimum quantized input value represents.
+//	input_max: The float value that the maximum quantized input value represents.
+//	out_type: The type of the output. Should be a lower bit depth than Tinput.
+//
+// Returns The float value that the minimum quantized output value represents.The float value that the maximum quantized output value represents.
+func QuantizeDownAndShrinkRange(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, out_type tf.DataType) (output tf.Output, output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "StageClear",
-
+		Type: "QuantizeDownAndShrinkRange",
+		Input: []tf.Input{
+			input, input_min, input_max,
+		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ComputeAccidentalHitsAttr is an optional argument to ComputeAccidentalHits.
-type ComputeAccidentalHitsAttr func(optionalAttr)
+// RandomGammaAttr is an optional argument to RandomGamma.
+type RandomGammaAttr func(optionalAttr)
 
-// ComputeAccidentalHitsSeed sets the optional seed attribute to value.
+// RandomGammaSeed sets the optional seed attribute to value.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
 // generator is seeded by the given seed.  Otherwise, it is seeded by a
 // random seed.
 // If not specified, defaults to 0
-func ComputeAccidentalHitsSeed(value int64) ComputeAccidentalHitsAttr {
+func RandomGammaSeed(value int64) RandomGammaAttr {
 	return func(m optionalAttr) {
 		m["seed"] = value
 	}
 }
 
-// ComputeAccidentalHitsSeed2 sets the optional seed2 attribute to value.
+// RandomGammaSeed2 sets the optional seed2 attribute to value.
 //
-// value: An second seed to avoid seed collision.
+// value: A second seed to avoid seed collision.
 // If not specified, defaults to 0
-func ComputeAccidentalHitsSeed2(value int64) ComputeAccidentalHitsAttr {
+func RandomGammaSeed2(value int64) RandomGammaAttr {
 	return func(m optionalAttr) {
 		m["seed2"] = value
 	}
 }
 
-// Computes the ids of the positions in sampled_candidates that match true_labels.
+// Outputs random values from the Gamma distribution(s) described by alpha.
 //
-// When doing log-odds NCE, the result of this op should be passed through a
-// SparseToDense op, then added to the logits of the sampled candidates. This has
-// the effect of 'removing' the sampled labels that match the true labels by
-// making the classifier sure that they are sampled labels.
+// This op uses the algorithm by Marsaglia et al. to acquire samples via
+// transformation-rejection from pairs of uniform and normal random variables.
+// See http://dl.acm.org/citation.cfm?id=358414
 //
 // Arguments:
-//	true_classes: The true_classes output of UnpackSparseLabels.
-//	sampled_candidates: The sampled_candidates output of CandidateSampler.
-//	num_true: Number of true labels per context.
+//	shape: 1-D integer tensor. Shape of independent samples to draw from each
+// distribution described by the shape parameters given in alpha.
+//	alpha: A tensor in which each scalar is a "shape" parameter describing the
+// associated gamma distribution.
 //
-// Returns A vector of indices corresponding to rows of true_candidates.A vector of IDs of positions in sampled_candidates that match a true_label
-// for the row with the corresponding index in indices.A vector of the same length as indices and ids, in which each element
-// is -FLOAT_MAX.
-func ComputeAccidentalHits(scope *Scope, true_classes tf.Output, sampled_candidates tf.Output, num_true int64, optional ...ComputeAccidentalHitsAttr) (indices tf.Output, ids tf.Output, weights tf.Output) {
+// Returns A tensor with shape `shape + shape(alpha)`. Each slice
+// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
+// `alpha[i0, i1, ...iN]`. The dtype of the output matches the dtype of alpha.
+func RandomGamma(scope *Scope, shape tf.Output, alpha tf.Output, optional ...RandomGammaAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ComputeAccidentalHits",
+		Type: "RandomGamma",
 		Input: []tf.Input{
-			true_classes, sampled_candidates,
+			shape, alpha,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// Computes sigmoid of `x` element-wise.
-//
-// Specifically, `y = 1 / (1 + exp(-x))`.
-func Sigmoid(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Sigmoid",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// RandomStandardNormalAttr is an optional argument to RandomStandardNormal.
-type RandomStandardNormalAttr func(optionalAttr)
+// RandomUniformIntAttr is an optional argument to RandomUniformInt.
+type RandomUniformIntAttr func(optionalAttr)
 
-// RandomStandardNormalSeed sets the optional seed attribute to value.
+// RandomUniformIntSeed sets the optional seed attribute to value.
 //
 // value: If either `seed` or `seed2` are set to be non-zero, the random number
 // generator is seeded by the given seed.  Otherwise, it is seeded by a
 // random seed.
 // If not specified, defaults to 0
-func RandomStandardNormalSeed(value int64) RandomStandardNormalAttr {
+func RandomUniformIntSeed(value int64) RandomUniformIntAttr {
 	return func(m optionalAttr) {
 		m["seed"] = value
 	}
 }
 
-// RandomStandardNormalSeed2 sets the optional seed2 attribute to value.
+// RandomUniformIntSeed2 sets the optional seed2 attribute to value.
 //
 // value: A second seed to avoid seed collision.
 // If not specified, defaults to 0
-func RandomStandardNormalSeed2(value int64) RandomStandardNormalAttr {
+func RandomUniformIntSeed2(value int64) RandomUniformIntAttr {
 	return func(m optionalAttr) {
 		m["seed2"] = value
 	}
 }
 
-// Outputs random values from a normal distribution.
+// Outputs random integers from a uniform distribution.
 //
-// The generated values will have mean 0 and standard deviation 1.
+// The generated values are uniform integers in the range `[minval, maxval)`.
+// The lower bound `minval` is included in the range, while the upper bound
+// `maxval` is excluded.
+//
+// The random integers are slightly biased unless `maxval - minval` is an exact
+// power of two.  The bias is small for values of `maxval - minval` significantly
+// smaller than the range of the output (either `2^32` or `2^64`).
 //
 // Arguments:
 //	shape: The shape of the output tensor.
-//	dtype: The type of the output.
+//	minval: 0-D.  Inclusive lower bound on the generated integers.
+//	maxval: 0-D.  Exclusive upper bound on the generated integers.
 //
-// Returns A tensor of the specified shape filled with random normal values.
-func RandomStandardNormal(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...RandomStandardNormalAttr) (output tf.Output) {
+// Returns A tensor of the specified shape filled with uniform random integers.
+func RandomUniformInt(scope *Scope, shape tf.Output, minval tf.Output, maxval tf.Output, optional ...RandomUniformIntAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomStandardNormal",
+		Type: "RandomUniformInt",
 		Input: []tf.Input{
-			shape,
+			shape, minval, maxval,
 		},
 		Attrs: attrs,
 	}
@@ -16284,147 +16587,143 @@ func RandomStandardNormal(scope *Scope, shape tf.Output, dtype tf.DataType, opti
 	return op.Output(0)
 }
 
-// FusedBatchNormAttr is an optional argument to FusedBatchNorm.
-type FusedBatchNormAttr func(optionalAttr)
+// SkipgramAttr is an optional argument to Skipgram.
+type SkipgramAttr func(optionalAttr)
 
-// FusedBatchNormEpsilon sets the optional epsilon attribute to value.
+// SkipgramWindowSize sets the optional window_size attribute to value.
 //
-// value: A small float number added to the variance of x.
-// If not specified, defaults to 0.0001
-func FusedBatchNormEpsilon(value float32) FusedBatchNormAttr {
+// value: The number of words to predict to the left and right of the target.
+// If not specified, defaults to 5
+func SkipgramWindowSize(value int64) SkipgramAttr {
 	return func(m optionalAttr) {
-		m["epsilon"] = value
+		m["window_size"] = value
 	}
 }
 
-// FusedBatchNormDataFormat sets the optional data_format attribute to value.
+// SkipgramMinCount sets the optional min_count attribute to value.
 //
-// value: The data format for x and y. Either "NHWC" (default) or "NCHW".
-// If not specified, defaults to "NHWC"
-func FusedBatchNormDataFormat(value string) FusedBatchNormAttr {
+// value: The minimum number of word occurrences for it to be included in the
+// vocabulary.
+// If not specified, defaults to 5
+func SkipgramMinCount(value int64) SkipgramAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["min_count"] = value
 	}
 }
 
-// FusedBatchNormIsTraining sets the optional is_training attribute to value.
+// SkipgramSubsample sets the optional subsample attribute to value.
 //
-// value: A bool value to indicate the operation is for training (default)
-// or inference.
-// If not specified, defaults to true
-func FusedBatchNormIsTraining(value bool) FusedBatchNormAttr {
+// value: Threshold for word occurrence. Words that appear with higher
+// frequency will be randomly down-sampled. Set to 0 to disable.
+// If not specified, defaults to 0.001
+func SkipgramSubsample(value float32) SkipgramAttr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["subsample"] = value
 	}
 }
 
-// Batch normalization.
+// Parses a text file and creates a batch of examples.
 //
-// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
-// The size of 1D Tensors matches the dimension C of the 4D Tensors.
+// DEPRECATED at GraphDef version 19: Moving word2vec into tensorflow_models/tutorials and deprecating its ops here as a result
 //
 // Arguments:
-//	x: A 4D Tensor for input data.
-//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
-//	offset: A 1D Tensor for offset, to shift to the normalized x.
-//	mean: A 1D Tensor for population mean. Used for inference only;
-// must be empty for training.
-//	variance: A 1D Tensor for population variance. Used for inference only;
-// must be empty for training.
+//	filename: The corpus's text file name.
+//	batch_size: The size of produced batch.
 //
-// Returns A 4D Tensor for output data.A 1D Tensor for the computed batch mean, to be used by TensorFlow
-// to compute the running mean.A 1D Tensor for the computed batch variance, to be used by
-// TensorFlow to compute the running variance.A 1D Tensor for the computed batch mean, to be reused
-// in the gradient computation.A 1D Tensor for the computed batch variance (inverted variance
-// in the cuDNN case), to be reused in the gradient computation.
-func FusedBatchNorm(scope *Scope, x tf.Output, scale tf.Output, offset tf.Output, mean tf.Output, variance tf.Output, optional ...FusedBatchNormAttr) (y tf.Output, batch_mean tf.Output, batch_variance tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output) {
+// Returns A vector of words in the corpus.Frequencies of words. Sorted in the non-ascending order.Number of words per epoch in the data file.The current epoch number.The total number of words processed so far.A vector of word ids.A vector of word ids.
+func Skipgram(scope *Scope, filename string, batch_size int64, optional ...SkipgramAttr) (vocab_word tf.Output, vocab_freq tf.Output, words_per_epoch tf.Output, current_epoch tf.Output, total_words_processed tf.Output, examples tf.Output, labels tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"filename": filename, "batch_size": batch_size}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FusedBatchNorm",
-		Input: []tf.Input{
-			x, scale, offset, mean, variance,
-		},
+		Type: "Skipgram",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4), op.Output(5), op.Output(6)
 }
 
-// Computes tan of x element-wise.
-func Tan(scope *Scope, x tf.Output) (y tf.Output) {
+// StringToNumberAttr is an optional argument to StringToNumber.
+type StringToNumberAttr func(optionalAttr)
+
+// StringToNumberOutType sets the optional out_type attribute to value.
+//
+// value: The numeric type to interpret each string in `string_tensor` as.
+// If not specified, defaults to DT_FLOAT
+func StringToNumberOutType(value tf.DataType) StringToNumberAttr {
+	return func(m optionalAttr) {
+		m["out_type"] = value
+	}
+}
+
+// Converts each string in the input Tensor to the specified numeric type.
+//
+// (Note that int32 overflow results in an error while float overflow
+// results in a rounded value.)
+//
+// Returns A Tensor of the same shape as the input `string_tensor`.
+func StringToNumber(scope *Scope, string_tensor tf.Output, optional ...StringToNumberAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Tan",
+		Type: "StringToNumber",
 		Input: []tf.Input{
-			x,
+			string_tensor,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FusedBatchNormV2Attr is an optional argument to FusedBatchNormV2.
-type FusedBatchNormV2Attr func(optionalAttr)
-
-// FusedBatchNormV2Epsilon sets the optional epsilon attribute to value.
-//
-// value: A small float number added to the variance of x.
-// If not specified, defaults to 0.0001
-func FusedBatchNormV2Epsilon(value float32) FusedBatchNormV2Attr {
-	return func(m optionalAttr) {
-		m["epsilon"] = value
-	}
-}
-
-// FusedBatchNormV2DataFormat sets the optional data_format attribute to value.
-//
-// value: The data format for x and y. Either "NHWC" (default) or "NCHW".
-// If not specified, defaults to "NHWC"
-func FusedBatchNormV2DataFormat(value string) FusedBatchNormV2Attr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
+// ResourceApplyFtrlV2Attr is an optional argument to ResourceApplyFtrlV2.
+type ResourceApplyFtrlV2Attr func(optionalAttr)
 
-// FusedBatchNormV2IsTraining sets the optional is_training attribute to value.
+// ResourceApplyFtrlV2UseLocking sets the optional use_locking attribute to value.
 //
-// value: A bool value to indicate the operation is for training (default)
-// or inference.
-// If not specified, defaults to true
-func FusedBatchNormV2IsTraining(value bool) FusedBatchNormV2Attr {
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyFtrlV2UseLocking(value bool) ResourceApplyFtrlV2Attr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Batch normalization.
+// Update '*var' according to the Ftrl-proximal scheme.
 //
-// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
-// The size of 1D Tensors matches the dimension C of the 4D Tensors.
+// grad_with_shrinkage = grad + 2 * l2_shrinkage * var
+// accum_new = accum + grad_with_shrinkage * grad_with_shrinkage
+// linear += grad_with_shrinkage +
+//     (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
+// quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2
+// var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0
+// accum = accum_new
 //
 // Arguments:
-//	x: A 4D Tensor for input data.
-//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
-//	offset: A 1D Tensor for offset, to shift to the normalized x.
-//	mean: A 1D Tensor for population mean. Used for inference only;
-// must be empty for training.
-//	variance: A 1D Tensor for population variance. Used for inference only;
-// must be empty for training.
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	linear: Should be from a Variable().
+//	grad: The gradient.
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regulariation. Must be a scalar.
+//	l2: L2 shrinkage regulariation. Must be a scalar.
 //
-// Returns A 4D Tensor for output data.A 1D Tensor for the computed batch mean, to be used by TensorFlow
-// to compute the running mean.A 1D Tensor for the computed batch variance, to be used by
-// TensorFlow to compute the running variance.A 1D Tensor for the computed batch mean, to be reused
-// in the gradient computation.A 1D Tensor for the computed batch variance (inverted variance
-// in the cuDNN case), to be reused in the gradient computation.
-func FusedBatchNormV2(scope *Scope, x tf.Output, scale tf.Output, offset tf.Output, mean tf.Output, variance tf.Output, optional ...FusedBatchNormV2Attr) (y tf.Output, batch_mean tf.Output, batch_variance tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output) {
+//	lr_power: Scaling factor. Must be a scalar.
+//
+// Returns the created operation.
+func ResourceApplyFtrlV2(scope *Scope, var_ tf.Output, accum tf.Output, linear tf.Output, grad tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, l2_shrinkage tf.Output, lr_power tf.Output, optional ...ResourceApplyFtrlV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -16433,69 +16732,64 @@ func FusedBatchNormV2(scope *Scope, x tf.Output, scale tf.Output, offset tf.Outp
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FusedBatchNormV2",
+		Type: "ResourceApplyFtrlV2",
 		Input: []tf.Input{
-			x, scale, offset, mean, variance,
+			var_, accum, linear, grad, lr, l1, l2, l2_shrinkage, lr_power,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return scope.AddOperation(opspec)
 }
 
-// MultinomialAttr is an optional argument to Multinomial.
-type MultinomialAttr func(optionalAttr)
+// TruncatedNormalAttr is an optional argument to TruncatedNormal.
+type TruncatedNormalAttr func(optionalAttr)
 
-// MultinomialSeed sets the optional seed attribute to value.
+// TruncatedNormalSeed sets the optional seed attribute to value.
 //
-// value: If either seed or seed2 is set to be non-zero, the internal random number
-// generator is seeded by the given seed.  Otherwise, a random seed is used.
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
 // If not specified, defaults to 0
-func MultinomialSeed(value int64) MultinomialAttr {
+func TruncatedNormalSeed(value int64) TruncatedNormalAttr {
 	return func(m optionalAttr) {
 		m["seed"] = value
 	}
 }
 
-// MultinomialSeed2 sets the optional seed2 attribute to value.
+// TruncatedNormalSeed2 sets the optional seed2 attribute to value.
 //
 // value: A second seed to avoid seed collision.
 // If not specified, defaults to 0
-func MultinomialSeed2(value int64) MultinomialAttr {
+func TruncatedNormalSeed2(value int64) TruncatedNormalAttr {
 	return func(m optionalAttr) {
 		m["seed2"] = value
 	}
 }
 
-// MultinomialOutputDtype sets the optional output_dtype attribute to value.
-// If not specified, defaults to DT_INT64
-func MultinomialOutputDtype(value tf.DataType) MultinomialAttr {
-	return func(m optionalAttr) {
-		m["output_dtype"] = value
-	}
-}
-
-// Draws samples from a multinomial distribution.
+// Outputs random values from a truncated normal distribution.
+//
+// The generated values follow a normal distribution with mean 0 and standard
+// deviation 1, except that values whose magnitude is more than 2 standard
+// deviations from the mean are dropped and re-picked.
 //
 // Arguments:
-//	logits: 2-D Tensor with shape `[batch_size, num_classes]`.  Each slice `[i, :]`
-// represents the unnormalized log probabilities for all classes.
-//	num_samples: 0-D.  Number of independent samples to draw for each row slice.
+//	shape: The shape of the output tensor.
+//	dtype: The type of the output.
 //
-// Returns 2-D Tensor with shape `[batch_size, num_samples]`.  Each slice `[i, :]`
-// contains the drawn class labels with range `[0, num_classes)`.
-func Multinomial(scope *Scope, logits tf.Output, num_samples tf.Output, optional ...MultinomialAttr) (output tf.Output) {
+// Returns A tensor of the specified shape filled with random truncated normal
+// values.
+func TruncatedNormal(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...TruncatedNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Multinomial",
+		Type: "TruncatedNormal",
 		Input: []tf.Input{
-			logits, num_samples,
+			shape,
 		},
 		Attrs: attrs,
 	}
@@ -16503,172 +16797,277 @@ func Multinomial(scope *Scope, logits tf.Output, num_samples tf.Output, optional
 	return op.Output(0)
 }
 
-// EncodeJpegAttr is an optional argument to EncodeJpeg.
-type EncodeJpegAttr func(optionalAttr)
+// RandomShuffleAttr is an optional argument to RandomShuffle.
+type RandomShuffleAttr func(optionalAttr)
 
-// EncodeJpegFormat sets the optional format attribute to value.
+// RandomShuffleSeed sets the optional seed attribute to value.
 //
-// value: Per pixel image format.
-// If not specified, defaults to ""
-func EncodeJpegFormat(value string) EncodeJpegAttr {
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func RandomShuffleSeed(value int64) RandomShuffleAttr {
 	return func(m optionalAttr) {
-		m["format"] = value
+		m["seed"] = value
 	}
 }
 
-// EncodeJpegQuality sets the optional quality attribute to value.
+// RandomShuffleSeed2 sets the optional seed2 attribute to value.
 //
-// value: Quality of the compression from 0 to 100 (higher is better and slower).
-// If not specified, defaults to 95
-func EncodeJpegQuality(value int64) EncodeJpegAttr {
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomShuffleSeed2(value int64) RandomShuffleAttr {
 	return func(m optionalAttr) {
-		m["quality"] = value
+		m["seed2"] = value
 	}
 }
 
-// EncodeJpegProgressive sets the optional progressive attribute to value.
+// Randomly shuffles a tensor along its first dimension.
 //
-// value: If True, create a JPEG that loads progressively (coarse to fine).
-// If not specified, defaults to false
-func EncodeJpegProgressive(value bool) EncodeJpegAttr {
-	return func(m optionalAttr) {
-		m["progressive"] = value
+//   The tensor is shuffled along dimension 0, such that each `value[j]` is mapped
+//   to one and only one `output[i]`. For example, a mapping that might occur for a
+//   3x2 tensor is:
+//
+// ```
+// [[1, 2],       [[5, 6],
+//  [3, 4],  ==>   [1, 2],
+//  [5, 6]]        [3, 4]]
+// ```
+//
+// Arguments:
+//	value: The tensor to be shuffled.
+//
+// Returns A tensor of same shape and type as `value`, shuffled along its first
+// dimension.
+func RandomShuffle(scope *Scope, value tf.Output, optional ...RandomShuffleAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "RandomShuffle",
+		Input: []tf.Input{
+			value,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// EncodeJpegOptimizeSize sets the optional optimize_size attribute to value.
+// OrderedMapIncompleteSizeAttr is an optional argument to OrderedMapIncompleteSize.
+type OrderedMapIncompleteSizeAttr func(optionalAttr)
+
+// OrderedMapIncompleteSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: If True, spend CPU/RAM to reduce size with no quality change.
-// If not specified, defaults to false
-func EncodeJpegOptimizeSize(value bool) EncodeJpegAttr {
+// REQUIRES: value >= 0
+func OrderedMapIncompleteSizeCapacity(value int64) OrderedMapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["optimize_size"] = value
+		m["capacity"] = value
 	}
 }
 
-// EncodeJpegChromaDownsampling sets the optional chroma_downsampling attribute to value.
+// OrderedMapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: See http://en.wikipedia.org/wiki/Chroma_subsampling.
-// If not specified, defaults to true
-func EncodeJpegChromaDownsampling(value bool) EncodeJpegAttr {
+// REQUIRES: value >= 0
+func OrderedMapIncompleteSizeMemoryLimit(value int64) OrderedMapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["chroma_downsampling"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// EncodeJpegDensityUnit sets the optional density_unit attribute to value.
-//
-// value: Unit used to specify `x_density` and `y_density`:
-// pixels per inch (`'in'`) or centimeter (`'cm'`).
-// If not specified, defaults to "in"
-func EncodeJpegDensityUnit(value string) EncodeJpegAttr {
+// OrderedMapIncompleteSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapIncompleteSizeContainer(value string) OrderedMapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["density_unit"] = value
+		m["container"] = value
 	}
 }
 
-// EncodeJpegXDensity sets the optional x_density attribute to value.
-//
-// value: Horizontal pixels per density unit.
-// If not specified, defaults to 300
-func EncodeJpegXDensity(value int64) EncodeJpegAttr {
+// OrderedMapIncompleteSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapIncompleteSizeSharedName(value string) OrderedMapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["x_density"] = value
+		m["shared_name"] = value
 	}
 }
 
-// EncodeJpegYDensity sets the optional y_density attribute to value.
+// Op returns the number of incomplete elements in the underlying container.
+func OrderedMapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapIncompleteSizeAttr) (size tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "OrderedMapIncompleteSize",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// DecodeRawAttr is an optional argument to DecodeRaw.
+type DecodeRawAttr func(optionalAttr)
+
+// DecodeRawLittleEndian sets the optional little_endian attribute to value.
 //
-// value: Vertical pixels per density unit.
-// If not specified, defaults to 300
-func EncodeJpegYDensity(value int64) EncodeJpegAttr {
+// value: Whether the input `bytes` are in little-endian order.
+// Ignored for `out_type` values that are stored in a single byte like
+// `uint8`.
+// If not specified, defaults to true
+func DecodeRawLittleEndian(value bool) DecodeRawAttr {
 	return func(m optionalAttr) {
-		m["y_density"] = value
+		m["little_endian"] = value
 	}
 }
 
-// EncodeJpegXmpMetadata sets the optional xmp_metadata attribute to value.
+// Reinterpret the bytes of a string as a vector of numbers.
 //
-// value: If not empty, embed this XMP metadata in the image header.
-// If not specified, defaults to ""
-func EncodeJpegXmpMetadata(value string) EncodeJpegAttr {
-	return func(m optionalAttr) {
-		m["xmp_metadata"] = value
+// Arguments:
+//	bytes: All the elements must have the same length.
+//
+//
+// Returns A Tensor with one more dimension than the input `bytes`.  The
+// added dimension will have size equal to the length of the elements
+// of `bytes` divided by the number of bytes to represent `out_type`.
+func DecodeRaw(scope *Scope, bytes tf.Output, out_type tf.DataType, optional ...DecodeRawAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"out_type": out_type}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "DecodeRaw",
+		Input: []tf.Input{
+			bytes,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// JPEG-encode an image.
+// Copy a tensor setting everything outside a central band in each innermost matrix
 //
-// `image` is a 3-D uint8 Tensor of shape `[height, width, channels]`.
+// to zero.
 //
-// The attr `format` can be used to override the color format of the encoded
-// output.  Values can be:
+// The `band` part is computed as follows:
+// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
+// tensor with the same shape where
 //
-// *   `''`: Use a default format based on the number of channels in the image.
-// *   `grayscale`: Output a grayscale JPEG image.  The `channels` dimension
-//     of `image` must be 1.
-// *   `rgb`: Output an RGB JPEG image. The `channels` dimension
-//     of `image` must be 3.
+// `band[i, j, k, ..., m, n] = in_band(m, n) * input[i, j, k, ..., m, n]`.
 //
-// If `format` is not specified or is the empty string, a default format is picked
-// in function of the number of channels in `image`:
+// The indicator function
 //
-// *   1: Output a grayscale image.
-// *   3: Output an RGB image.
+// `in_band(m, n) = (num_lower < 0 || (m-n) <= num_lower)) &&
+//                  (num_upper < 0 || (n-m) <= num_upper)`.
+//
+// For example:
+//
+// ```
+// # if 'input' is [[ 0,  1,  2, 3]
+//                  [-1,  0,  1, 2]
+//                  [-2, -1,  0, 1]
+//                  [-3, -2, -1, 0]],
+//
+// tf.matrix_band_part(input, 1, -1) ==> [[ 0,  1,  2, 3]
+//                                        [-1,  0,  1, 2]
+//                                        [ 0, -1,  0, 1]
+//                                        [ 0,  0, -1, 0]],
+//
+// tf.matrix_band_part(input, 2, 1) ==> [[ 0,  1,  0, 0]
+//                                       [-1,  0,  1, 0]
+//                                       [-2, -1,  0, 1]
+//                                       [ 0, -2, -1, 0]]
+// ```
+//
+// Useful special cases:
+//
+// ```
+//  tf.matrix_band_part(input, 0, -1) ==> Upper triangular part.
+//  tf.matrix_band_part(input, -1, 0) ==> Lower triangular part.
+//  tf.matrix_band_part(input, 0, 0) ==> Diagonal.
+// ```
 //
 // Arguments:
-//	image: 3-D with shape `[height, width, channels]`.
+//	input: Rank `k` tensor.
+//	num_lower: 0-D tensor. Number of subdiagonals to keep. If negative, keep entire
+// lower triangle.
+//	num_upper: 0-D tensor. Number of superdiagonals to keep. If negative, keep
+// entire upper triangle.
 //
-// Returns 0-D. JPEG-encoded image.
-func EncodeJpeg(scope *Scope, image tf.Output, optional ...EncodeJpegAttr) (contents tf.Output) {
+// Returns Rank `k` tensor of the same shape as input. The extracted banded tensor.
+func MatrixBandPart(scope *Scope, input tf.Output, num_lower tf.Output, num_upper tf.Output) (band tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
+	opspec := tf.OpSpec{
+		Type: "MatrixBandPart",
+		Input: []tf.Input{
+			input, num_lower, num_upper,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes acos of x element-wise.
+func Acos(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
 	opspec := tf.OpSpec{
-		Type: "EncodeJpeg",
+		Type: "Acos",
 		Input: []tf.Input{
-			image,
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MaxPoolGradAttr is an optional argument to MaxPoolGrad.
-type MaxPoolGradAttr func(optionalAttr)
+// MaxPoolWithArgmaxAttr is an optional argument to MaxPoolWithArgmax.
+type MaxPoolWithArgmaxAttr func(optionalAttr)
 
-// MaxPoolGradDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolGradDataFormat(value string) MaxPoolGradAttr {
+// MaxPoolWithArgmaxTargmax sets the optional Targmax attribute to value.
+// If not specified, defaults to DT_INT64
+func MaxPoolWithArgmaxTargmax(value tf.DataType) MaxPoolWithArgmaxAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["Targmax"] = value
 	}
 }
 
-// Computes gradients of the maxpooling function.
+// Performs max pooling on the input and outputs both max values and indices.
+//
+// The indices in `argmax` are flattened, so that a maximum value at position
+// `[b, y, x, c]` becomes flattened index
+// `((b * height + y) * width + x) * channels + c`.
+//
+// The indices returned are always in `[0, height) x [0, width)` before flattening,
+// even if padding is involved and the mathematically correct answer is outside
+// (either negative or too large).  This is a bug, but fixing it is difficult to do
+// in a safe backwards compatible way, especially due to flattening.
 //
 // Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: 4-D.  Gradients w.r.t. the output of `max_pool`.
+//	input: 4-D with shape `[batch, height, width, channels]`.  Input to pool over.
 //	ksize: The size of the window for each dimension of the input tensor.
 //	strides: The stride of the sliding window for each dimension of the
 // input tensor.
 //	padding: The type of padding algorithm to use.
 //
-// Returns Gradients w.r.t. the input to `max_pool`.
-func MaxPoolGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradAttr) (output tf.Output) {
+// Returns The max pooled output tensor.4-D.  The flattened indices of the max values chosen for each output.
+func MaxPoolWithArgmax(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolWithArgmaxAttr) (output tf.Output, argmax tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -16677,87 +17076,33 @@ func MaxPoolGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPoolGrad",
+		Type: "MaxPoolWithArgmax",
 		Input: []tf.Input{
-			orig_input, orig_output, grad,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// CropAndResizeAttr is an optional argument to CropAndResize.
-type CropAndResizeAttr func(optionalAttr)
-
-// CropAndResizeMethod sets the optional method attribute to value.
-//
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
-// If not specified, defaults to "bilinear"
-func CropAndResizeMethod(value string) CropAndResizeAttr {
-	return func(m optionalAttr) {
-		m["method"] = value
-	}
-}
-
-// CropAndResizeExtrapolationValue sets the optional extrapolation_value attribute to value.
-//
-// value: Value used for extrapolation, when applicable.
-// If not specified, defaults to 0
-func CropAndResizeExtrapolationValue(value float32) CropAndResizeAttr {
-	return func(m optionalAttr) {
-		m["extrapolation_value"] = value
-	}
+	return op.Output(0), op.Output(1)
 }
 
-// Extracts crops from the input image tensor and bilinearly resizes them (possibly
-//
-// with aspect ratio change) to a common output size specified by `crop_size`. This
-// is more general than the `crop_to_bounding_box` op which extracts a fixed size
-// slice from the input image and does not allow resizing or aspect ratio change.
-//
-// Returns a tensor with `crops` from the input `image` at positions defined at the
-// bounding box locations in `boxes`. The cropped boxes are all resized (with
-// bilinear interpolation) to a fixed `size = [crop_height, crop_width]`. The
-// result is a 4-D tensor `[num_boxes, crop_height, crop_width, depth]`. The
-// resizing is corner aligned. In particular, if `boxes = [[0, 0, 1, 1]]`, the
-// method will give identical results to using `tf.image.resize_bilinear()`
-// with `align_corners=True`.
+// Transforms a serialized tensorflow.TensorProto proto into a Tensor.
 //
 // Arguments:
-//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
-// Both `image_height` and `image_width` need to be positive.
-//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
-// specifies the coordinates of a box in the `box_ind[i]` image and is specified
-// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
-// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
-// `[0, 1]` interval of normalized image height is mapped to
-// `[0, image_height - 1]` in image height coordinates. We do allow `y1` > `y2`, in
-// which case the sampled crop is an up-down flipped version of the original
-// image. The width dimension is treated similarly. Normalized coordinates
-// outside the `[0, 1]` range are allowed, in which case we use
-// `extrapolation_value` to extrapolate the input image values.
-//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
-// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
-//	crop_size: A 1-D tensor of 2 elements, `size = [crop_height, crop_width]`. All
-// cropped image patches are resized to this size. The aspect ratio of the image
-// content is not preserved. Both `crop_height` and `crop_width` need to be
-// positive.
+//	serialized: A scalar string containing a serialized TensorProto proto.
+//	out_type: The type of the serialized tensor.  The provided type must match the
+// type of the serialized tensor and no implicit conversion will take place.
 //
-// Returns A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
-func CropAndResize(scope *Scope, image tf.Output, boxes tf.Output, box_ind tf.Output, crop_size tf.Output, optional ...CropAndResizeAttr) (crops tf.Output) {
+// Returns A Tensor of type `out_type`.
+func ParseTensor(scope *Scope, serialized tf.Output, out_type tf.DataType) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"out_type": out_type}
 	opspec := tf.OpSpec{
-		Type: "CropAndResize",
+		Type: "ParseTensor",
 		Input: []tf.Input{
-			image, boxes, box_ind, crop_size,
+			serialized,
 		},
 		Attrs: attrs,
 	}
@@ -16765,136 +17110,113 @@ func CropAndResize(scope *Scope, image tf.Output, boxes tf.Output, box_ind tf.Ou
 	return op.Output(0)
 }
 
-// ResourceApplyPowerSignAttr is an optional argument to ResourceApplyPowerSign.
-type ResourceApplyPowerSignAttr func(optionalAttr)
+// MapClearAttr is an optional argument to MapClear.
+type MapClearAttr func(optionalAttr)
 
-// ResourceApplyPowerSignUseLocking sets the optional use_locking attribute to value.
+// MapClearCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: If `True`, updating of the var and m tensors is
-// protected by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyPowerSignUseLocking(value bool) ResourceApplyPowerSignAttr {
+// REQUIRES: value >= 0
+func MapClearCapacity(value int64) MapClearAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["capacity"] = value
 	}
 }
 
-// Update '*var' according to the AddSign update.
-//
-// m_t <- beta1 * m_{t-1} + (1 - beta1) * g
-// update <- exp(logbase * sign_decay * sign(g) * sign(m_t)) * g
-// variable <- variable - lr_t * update
-//
-// Arguments:
-//	var_: Should be from a Variable().
-//	m: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	logbase: Must be a scalar.
-//	sign_decay: Must be a scalar.
-//	beta: Must be a scalar.
-//	grad: The gradient.
+// MapClearMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Returns the created operation.
-func ResourceApplyPowerSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Output, logbase tf.Output, sign_decay tf.Output, beta tf.Output, grad tf.Output, optional ...ResourceApplyPowerSignAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResourceApplyPowerSign",
-		Input: []tf.Input{
-			var_, m, lr, logbase, sign_decay, beta, grad,
-		},
-		Attrs: attrs,
+// REQUIRES: value >= 0
+func MapClearMemoryLimit(value int64) MapClearAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Deprecated. Disallowed in GraphDef version >= 2.
-//
-// DEPRECATED at GraphDef version 2: Use AdjustContrastv2 instead
-func AdjustContrast(scope *Scope, images tf.Output, contrast_factor tf.Output, min_value tf.Output, max_value tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// MapClearContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapClearContainer(value string) MapClearAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "AdjustContrast",
-		Input: []tf.Input{
-			images, contrast_factor, min_value, max_value,
-		},
+}
+
+// MapClearSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapClearSharedName(value string) MapClearAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Table initializer that takes two tensors for keys and values respectively.
-//
-// Arguments:
-//	table_handle: Handle to a table which will be initialized.
-//	keys: Keys of type Tkey.
-//	values: Values of type Tval.
+// Op removes all elements in the underlying container.
 //
 // Returns the created operation.
-func InitializeTableV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+func MapClear(scope *Scope, dtypes []tf.DataType, optional ...MapClearAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "InitializeTableV2",
-		Input: []tf.Input{
-			table_handle, keys, values,
-		},
+		Type: "MapClear",
+
+		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// PrintAttr is an optional argument to Print.
-type PrintAttr func(optionalAttr)
+// DecodeCSVAttr is an optional argument to DecodeCSV.
+type DecodeCSVAttr func(optionalAttr)
 
-// PrintMessage sets the optional message attribute to value.
+// DecodeCSVFieldDelim sets the optional field_delim attribute to value.
 //
-// value: A string, prefix of the error message.
-// If not specified, defaults to ""
-func PrintMessage(value string) PrintAttr {
+// value: char delimiter to separate fields in a record.
+// If not specified, defaults to ","
+func DecodeCSVFieldDelim(value string) DecodeCSVAttr {
 	return func(m optionalAttr) {
-		m["message"] = value
+		m["field_delim"] = value
 	}
 }
 
-// PrintFirstN sets the optional first_n attribute to value.
+// DecodeCSVUseQuoteDelim sets the optional use_quote_delim attribute to value.
 //
-// value: Only log `first_n` number of times. -1 disables logging.
-// If not specified, defaults to -1
-func PrintFirstN(value int64) PrintAttr {
+// value: If false, treats double quotation marks as regular
+// characters inside of the string fields (ignoring RFC 4180, Section 2,
+// Bullet 5).
+// If not specified, defaults to true
+func DecodeCSVUseQuoteDelim(value bool) DecodeCSVAttr {
 	return func(m optionalAttr) {
-		m["first_n"] = value
+		m["use_quote_delim"] = value
 	}
 }
 
-// PrintSummarize sets the optional summarize attribute to value.
+// DecodeCSVNaValue sets the optional na_value attribute to value.
 //
-// value: Only print this many entries of each tensor.
-// If not specified, defaults to 3
-func PrintSummarize(value int64) PrintAttr {
+// value: Additional string to recognize as NA/NaN.
+// If not specified, defaults to ""
+func DecodeCSVNaValue(value string) DecodeCSVAttr {
 	return func(m optionalAttr) {
-		m["summarize"] = value
+		m["na_value"] = value
 	}
 }
 
-// Prints a list of tensors.
+// Convert CSV records to tensors. Each column maps to one tensor.
 //
-// Passes `input` through to `output` and prints `data` when evaluating.
+// RFC 4180 format is expected for the CSV records.
+// (https://tools.ietf.org/html/rfc4180)
+// Note that we allow leading and trailing spaces with int or float field.
 //
 // Arguments:
-//	input: The tensor passed to `output`
-//	data: A list of tensors to print out when op is evaluated.
+//	records: Each string is a record/row in the csv and all records should have
+// the same format.
+//	record_defaults: One tensor per column of the input record, with either a
+// scalar default value for that column or empty if the column is required.
 //
-// Returns = The unmodified `input` tensor
-func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAttr) (output tf.Output) {
+// Returns Each tensor will have the same shape as records.
+func DecodeCSV(scope *Scope, records tf.Output, record_defaults []tf.Output, optional ...DecodeCSVAttr) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -16903,332 +17225,359 @@ func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAtt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Print",
+		Type: "DecodeCSV",
 		Input: []tf.Input{
-			input, tf.OutputList(data),
+			records, tf.OutputList(record_defaults),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Outputs a `Summary` protocol buffer with a tensor and per-plugin data.
-//
-// Arguments:
-//	tag: A string attached to this summary. Used for organization in TensorBoard.
-//	tensor: A tensor to serialize.
-//	serialized_summary_metadata: A serialized SummaryMetadata proto. Contains plugin
-// data.
-func TensorSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, serialized_summary_metadata tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "TensorSummaryV2",
-		Input: []tf.Input{
-			tag, tensor, serialized_summary_metadata,
-		},
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("DecodeCSV", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return output
 }
 
-// Creates a dataset that asynchronously prefetches elements from `input_dataset`.
+// Returns the rank of a tensor.
 //
-// Arguments:
+// This operation returns an integer representing the rank of `input`.
 //
-//	buffer_size: The maximum number of elements to buffer in an iterator over
-// this dataset.
+// For example:
 //
+// ```
+// # 't' is [[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]]
+// # shape of tensor 't' is [2, 2, 3]
+// rank(t) ==> 3
+// ```
 //
-func PrefetchDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// **Note**: The rank of a tensor is not the same as the rank of a matrix. The rank
+// of a tensor is the number of indices required to uniquely select each element
+// of the tensor. Rank is also known as "order", "degree", or "ndims."
+func Rank(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "PrefetchDataset",
+		Type: "Rank",
 		Input: []tf.Input{
-			input_dataset, buffer_size,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorSummaryAttr is an optional argument to TensorSummary.
-type TensorSummaryAttr func(optionalAttr)
+// Output a fact about factorials.
+func Fact(scope *Scope) (fact tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Fact",
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// TensorSummaryDescription sets the optional description attribute to value.
+// Makes its input available to the next iteration.
 //
-// value: A json-encoded SummaryDescription proto.
-// If not specified, defaults to ""
-func TensorSummaryDescription(value string) TensorSummaryAttr {
-	return func(m optionalAttr) {
-		m["description"] = value
+// Arguments:
+//	data: The tensor to be made available to the next iteration.
+//
+// Returns The same tensor as `data`.
+func NextIteration(scope *Scope, data tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NextIteration",
+		Input: []tf.Input{
+			data,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// TensorSummaryLabels sets the optional labels attribute to value.
+// MapPeekAttr is an optional argument to MapPeek.
+type MapPeekAttr func(optionalAttr)
+
+// MapPeekCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: An unused list of strings.
-// If not specified, defaults to <>
-func TensorSummaryLabels(value []string) TensorSummaryAttr {
+// REQUIRES: value >= 0
+func MapPeekCapacity(value int64) MapPeekAttr {
 	return func(m optionalAttr) {
-		m["labels"] = value
+		m["capacity"] = value
 	}
 }
 
-// TensorSummaryDisplayName sets the optional display_name attribute to value.
+// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: An unused string.
+// REQUIRES: value >= 0
+func MapPeekMemoryLimit(value int64) MapPeekAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapPeekContainer sets the optional container attribute to value.
 // If not specified, defaults to ""
-func TensorSummaryDisplayName(value string) TensorSummaryAttr {
+func MapPeekContainer(value string) MapPeekAttr {
 	return func(m optionalAttr) {
-		m["display_name"] = value
+		m["container"] = value
 	}
 }
 
-// Outputs a `Summary` protocol buffer with a tensor.
-//
-// This op is being phased out in favor of TensorSummaryV2, which lets callers pass
-// a tag as well as a serialized SummaryMetadata proto string that contains
-// plugin-specific data. We will keep this op to maintain backwards compatibility.
+// MapPeekSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapPeekSharedName(value string) MapPeekAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op peeks at the values at the specified key.  If the
 //
-// Arguments:
-//	tensor: A tensor to serialize.
-func TensorSummary(scope *Scope, tensor tf.Output, optional ...TensorSummaryAttr) (summary tf.Output) {
+// underlying container does not contain this key
+// this op will block until it does.
+func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorSummary",
+		Type: "MapPeek",
 		Input: []tf.Input{
-			tensor,
+			key, indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapPeek", err)
+		return
+	}
+	return values
 }
 
-// Computes the gradient for the tanh of `x` wrt its input.
+// Looks up keys in a table, outputs the corresponding values.
 //
-// Specifically, `grad = dy * (1 - y*y)`, where `y = tanh(x)`, and `dy`
-// is the corresponding input gradient.
-func TanhGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// The tensor `keys` must of the same type as the keys of the table.
+// The output `values` is of the type of the table values.
+//
+// The scalar `default_value` is the value output for keys not present in the
+// table. It must also be of the same type as the table values.
+//
+// Arguments:
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//
+//
+// Returns Same shape as `keys`.  Values found in the table, or `default_values`
+// for missing keys.
+func LookupTableFindV2(scope *Scope, table_handle tf.Output, keys tf.Output, default_value tf.Output) (values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TanhGrad",
+		Type: "LookupTableFindV2",
 		Input: []tf.Input{
-			y, dy,
+			table_handle, keys, default_value,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Outputs a `Summary` protocol buffer with scalar values.
+// Bucketizes 'input' based on 'boundaries'.
 //
-// The input `tags` and `values` must have the same shape.  The generated summary
-// has a summary value for each tag-value pair in `tags` and `values`.
+// For example, if the inputs are
+//     boundaries = [0, 10, 100]
+//     input = [[-5, 10000]
+//              [150,   10]
+//              [5,    100]]
+//
+// then the output will be
+//     output = [[0, 3]
+//               [3, 2]
+//               [1, 3]]
 //
 // Arguments:
-//	tags: Tags for the summary.
-//	values: Same shape as `tags.  Values for the summary.
+//	input: Any shape of Tensor contains with int or float type.
+//	boundaries: A sorted list of floats gives the boundary of the buckets.
 //
-// Returns Scalar.  Serialized `Summary` protocol buffer.
-func ScalarSummary(scope *Scope, tags tf.Output, values tf.Output) (summary tf.Output) {
+// Returns Same shape with 'input', each value of input replaced with bucket index.
+//
+// @compatibility(numpy)
+// Equivalent to np.digitize.
+// @end_compatibility
+func Bucketize(scope *Scope, input tf.Output, boundaries []float32) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"boundaries": boundaries}
 	opspec := tf.OpSpec{
-		Type: "ScalarSummary",
+		Type: "Bucketize",
 		Input: []tf.Input{
-			tags, values,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Outputs a `Summary` protocol buffer with a histogram.
+// EncodePngAttr is an optional argument to EncodePng.
+type EncodePngAttr func(optionalAttr)
+
+// EncodePngCompression sets the optional compression attribute to value.
 //
-// The generated
-// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-// has one summary value containing a histogram for `values`.
+// value: Compression level.
+// If not specified, defaults to -1
+func EncodePngCompression(value int64) EncodePngAttr {
+	return func(m optionalAttr) {
+		m["compression"] = value
+	}
+}
+
+// PNG-encode an image.
 //
-// This op reports an `InvalidArgument` error if any value is not finite.
+// `image` is a 3-D uint8 or uint16 Tensor of shape `[height, width, channels]`
+// where `channels` is:
+//
+// *   1: for grayscale.
+// *   2: for grayscale + alpha.
+// *   3: for RGB.
+// *   4: for RGBA.
+//
+// The ZLIB compression level, `compression`, can be -1 for the PNG-encoder
+// default or a value from 0 to 9.  9 is the highest compression level, generating
+// the smallest output, but is slower.
 //
 // Arguments:
-//	tag: Scalar.  Tag to use for the `Summary.Value`.
-//	values: Any shape. Values to use to build the histogram.
+//	image: 3-D with shape `[height, width, channels]`.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func HistogramSummary(scope *Scope, tag tf.Output, values tf.Output) (summary tf.Output) {
+// Returns 0-D. PNG-encoded image.
+func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (contents tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "HistogramSummary",
+		Type: "EncodePng",
 		Input: []tf.Input{
-			tag, values,
+			image,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the number of elements in the given queue.
+// Updates the table to associates keys with values.
+//
+// The tensor `keys` must be of the same type as the keys of the table.
+// The tensor `values` must be of the type of the table values.
 //
 // Arguments:
-//	handle: The handle to a queue.
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//	values: Values to associate with keys.
 //
-// Returns The number of elements in the given queue.
-func QueueSizeV2(scope *Scope, handle tf.Output) (size tf.Output) {
+// Returns the created operation.
+func LookupTableInsertV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "QueueSizeV2",
+		Type: "LookupTableInsertV2",
 		Input: []tf.Input{
-			handle,
+			table_handle, keys, values,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Returns element-wise smallest integer in not less than x.
+func Ceil(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Ceil",
+		Input: []tf.Input{
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ImageSummaryAttr is an optional argument to ImageSummary.
-type ImageSummaryAttr func(optionalAttr)
-
-// ImageSummaryMaxImages sets the optional max_images attribute to value.
-//
-// value: Max number of batch elements to generate images for.
-// If not specified, defaults to 3
-//
-// REQUIRES: value >= 1
-func ImageSummaryMaxImages(value int64) ImageSummaryAttr {
-	return func(m optionalAttr) {
-		m["max_images"] = value
-	}
-}
-
-// ImageSummaryBadColor sets the optional bad_color attribute to value.
-//
-// value: Color to use for pixels with non-finite values.
-// If not specified, defaults to <dtype:DT_UINT8 tensor_shape:<dim:<size:4 > > int_val:255 int_val:0 int_val:0 int_val:255 >
-func ImageSummaryBadColor(value tf.Tensor) ImageSummaryAttr {
-	return func(m optionalAttr) {
-		m["bad_color"] = value
-	}
-}
-
-// Outputs a `Summary` protocol buffer with images.
-//
-// The summary has up to `max_images` summary values containing images. The
-// images are built from `tensor` which must be 4-D with shape `[batch_size,
-// height, width, channels]` and where `channels` can be:
-//
-// *  1: `tensor` is interpreted as Grayscale.
-// *  3: `tensor` is interpreted as RGB.
-// *  4: `tensor` is interpreted as RGBA.
-//
-// The images have the same number of channels as the input tensor. For float
-// input, the values are normalized one image at a time to fit in the range
-// `[0, 255]`.  `uint8` values are unchanged.  The op uses two different
-// normalization algorithms:
-//
-// *  If the input values are all positive, they are rescaled so the largest one
-//    is 255.
-//
-// *  If any input value is negative, the values are shifted so input value 0.0
-//    is at 127.  They are then rescaled so that either the smallest value is 0,
-//    or the largest one is 255.
-//
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
-//
-// *  If `max_images` is 1, the summary value tag is '*tag*/image'.
-// *  If `max_images` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
-//
-// The `bad_color` argument is the color to use in the generated images for
-// non-finite input values.  It is a `unit8` 1-D tensor of length `channels`.
-// Each element must be in the range `[0, 255]` (It represents the value of a
-// pixel in the output image).  Non-finite values in the input tensor are
-// replaced by this tensor in the output image.  The default value is the color
-// red.
+// Computes the number of elements in the given table.
 //
 // Arguments:
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 4-D of shape `[batch_size, height, width, channels]` where
-// `channels` is 1, 3, or 4.
+//	table_handle: Handle to the table.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func ImageSummary(scope *Scope, tag tf.Output, tensor tf.Output, optional ...ImageSummaryAttr) (summary tf.Output) {
+// Returns Scalar that contains number of elements in the table.
+func LookupTableSizeV2(scope *Scope, table_handle tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ImageSummary",
+		Type: "LookupTableSizeV2",
 		Input: []tf.Input{
-			tag, tensor,
+			table_handle,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AudioSummaryV2Attr is an optional argument to AudioSummaryV2.
-type AudioSummaryV2Attr func(optionalAttr)
+// ResizeBilinearGradAttr is an optional argument to ResizeBilinearGrad.
+type ResizeBilinearGradAttr func(optionalAttr)
 
-// AudioSummaryV2MaxOutputs sets the optional max_outputs attribute to value.
-//
-// value: Max number of batch elements to generate audio for.
-// If not specified, defaults to 3
+// ResizeBilinearGradAlignCorners sets the optional align_corners attribute to value.
 //
-// REQUIRES: value >= 1
-func AudioSummaryV2MaxOutputs(value int64) AudioSummaryV2Attr {
+// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of grads and original_image. If false, rescale by
+// orig_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeBilinearGradAlignCorners(value bool) ResizeBilinearGradAttr {
 	return func(m optionalAttr) {
-		m["max_outputs"] = value
+		m["align_corners"] = value
 	}
 }
 
-// Outputs a `Summary` protocol buffer with audio.
-//
-// The summary has up to `max_outputs` summary values containing audio. The
-// audio is built from `tensor` which must be 3-D with shape `[batch_size,
-// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
-// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
-//
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
-//
-// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
-// *  If `max_outputs` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
+// Computes the gradient of bilinear interpolation.
 //
 // Arguments:
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 2-D of shape `[batch_size, frames]`.
-//	sample_rate: The sample rate of the signal in hertz.
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
+// The image tensor that was resized.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate tf.Output, optional ...AudioSummaryV2Attr) (summary tf.Output) {
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
+// Gradients with respect to the input image. Input image must have been
+// float or double.
+func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBilinearGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -17237,9 +17586,9 @@ func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AudioSummaryV2",
+		Type: "ResizeBilinearGrad",
 		Input: []tf.Input{
-			tag, tensor, sample_rate,
+			grads, original_image,
 		},
 		Attrs: attrs,
 	}
@@ -17247,390 +17596,346 @@ func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate t
 	return op.Output(0)
 }
 
-// AvgPoolAttr is an optional argument to AvgPool.
-type AvgPoolAttr func(optionalAttr)
-
-// AvgPoolDataFormat sets the optional data_format attribute to value.
+// Outputs all keys and values in the table.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func AvgPoolDataFormat(value string) AvgPoolAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Performs average pooling on the input.
+// Arguments:
+//	table_handle: Handle to the table.
 //
-// Each entry in `output` is the mean of the corresponding size `ksize`
-// window in `value`.
 //
-// Arguments:
-//	value: 4-D with shape `[batch, height, width, channels]`.
-//	ksize: The size of the sliding window for each dimension of `value`.
-//	strides: The stride of the sliding window for each dimension of `value`.
-//	padding: The type of padding algorithm to use.
 //
-// Returns The average pooled output tensor.
-func AvgPool(scope *Scope, value tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolAttr) (output tf.Output) {
+// Returns Vector of all keys present in the table.Tensor of all values in the table. Indexed in parallel with `keys`.
+func LookupTableExportV2(scope *Scope, table_handle tf.Output, Tkeys tf.DataType, Tvalues tf.DataType) (keys tf.Output, values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"Tkeys": Tkeys, "Tvalues": Tvalues}
 	opspec := tf.OpSpec{
-		Type: "AvgPool",
+		Type: "LookupTableExportV2",
 		Input: []tf.Input{
-			value,
+			table_handle,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Merges summaries.
-//
-// This op creates a
-// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-// protocol buffer that contains the union of all the values in the input
-// summaries.
+// Replaces the contents of the table with the specified keys and values.
 //
-// When the Op is run, it reports an `InvalidArgument` error if multiple values
-// in the summaries to merge use the same tag.
+// The tensor `keys` must be of the same type as the keys of the table.
+// The tensor `values` must be of the type of the table values.
 //
 // Arguments:
-//	inputs: Can be of any shape.  Each must contain serialized `Summary` protocol
-// buffers.
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//	values: Values to associate with keys.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func MergeSummary(scope *Scope, inputs []tf.Output) (summary tf.Output) {
+// Returns the created operation.
+func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MergeSummary",
+		Type: "LookupTableImportV2",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			table_handle, keys, values,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Computes the gradient of morphological 2-D dilation with respect to the filter.
-//
-// Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
-//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
-//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
-// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
-//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
-// Must be: `[1, rate_height, rate_width, 1]`.
-//	padding: The type of padding algorithm to use.
+// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
+type MapUnstageNoKeyAttr func(optionalAttr)
+
+// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Returns 3-D with shape `[filter_height, filter_width, depth]`.
-func Dilation2DBackpropFilter(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (filter_backprop tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
-	opspec := tf.OpSpec{
-		Type: "Dilation2DBackpropFilter",
-		Input: []tf.Input{
-			input, filter, out_backprop,
-		},
-		Attrs: attrs,
+// REQUIRES: value >= 0
+func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// AddSparseToTensorsMapAttr is an optional argument to AddSparseToTensorsMap.
-type AddSparseToTensorsMapAttr func(optionalAttr)
-
-// AddSparseToTensorsMapContainer sets the optional container attribute to value.
+// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: The container name for the `SparseTensorsMap` created by this op.
+// REQUIRES: value >= 0
+func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapUnstageNoKeyContainer sets the optional container attribute to value.
 // If not specified, defaults to ""
-func AddSparseToTensorsMapContainer(value string) AddSparseToTensorsMapAttr {
+func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// AddSparseToTensorsMapSharedName sets the optional shared_name attribute to value.
-//
-// value: The shared name for the `SparseTensorsMap` created by this op.
-// If blank, the new Operation's unique name is used.
+// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
 // If not specified, defaults to ""
-func AddSparseToTensorsMapSharedName(value string) AddSparseToTensorsMapAttr {
+func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// Add a `SparseTensor` to a `SparseTensorsMap` return its handle.
-//
-// A `SparseTensor` is represented by three tensors: `sparse_indices`,
-// `sparse_values`, and `sparse_shape`.
+// Op removes and returns a random (key, value)
 //
-// This operator takes the given `SparseTensor` and adds it to a container
-// object (a `SparseTensorsMap`).  A unique key within this container is generated
-// in the form of an `int64`, and this is the value that is returned.
-//
-// The `SparseTensor` can then be read out as part of a minibatch by passing
-// the key as a vector element to `TakeManySparseFromTensorsMap`.  To ensure
-// the correct `SparseTensorsMap` is accessed, ensure that the same
-// `container` and `shared_name` are passed to that Op.  If no `shared_name`
-// is provided here, instead use the *name* of the Operation created by calling
-// `AddSparseToTensorsMap` as the `shared_name` passed to
-// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
-//
-// Arguments:
-//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
-//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
-//
-// Returns 0-D.  The handle of the `SparseTensor` now stored in the
-// `SparseTensorsMap`.
-func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddSparseToTensorsMapAttr) (sparse_handle tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain elements, the op will block until it does.
+func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AddSparseToTensorsMap",
+		Type: "MapUnstageNoKey",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Writes a `Summary` protocol buffer with scalar values.
-//
-// The input `tag` and `value` must have the scalars.
-//
-// Arguments:
-//	writer: A handle to a summary writer.
-//	step: The step to write the summary for.
-//	tag: Tag for the summary.
-//	value: Value for the summary.
-//
-// Returns the created operation.
-func WriteScalarSummary(scope *Scope, writer tf.Output, step tf.Output, tag tf.Output, value tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "WriteScalarSummary",
-		Input: []tf.Input{
-			writer, step, tag, value,
-		},
+	var idx int
+	var err error
+	key = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapUnstageNoKey", err)
+		return
 	}
-	return scope.AddOperation(opspec)
+	return key, values
 }
 
-// Computes the matrix exponential of one or more square matrices:
+// HashTableV2Attr is an optional argument to HashTableV2.
+type HashTableV2Attr func(optionalAttr)
+
+// HashTableV2Container sets the optional container attribute to value.
 //
-// exp(A) = \sum_{n=0}^\infty A^n/n!
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func HashTableV2Container(value string) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// HashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// The exponential is computed using a combination of the scaling and squaring
-// method and the Pade approximation. Details can be founds in:
-// Nicholas J. Higham, "The scaling and squaring method for the matrix exponential
-// revisited," SIAM J. Matrix Anal. Applic., 26:1179-1193, 2005.
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func HashTableV2SharedName(value string) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
 //
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. The output is a tensor of the same shape as the input
-// containing the exponential for all input submatrices `[..., :, :]`.
+// value: If true and shared_name is empty, the table is shared
+// using the node name.
+// If not specified, defaults to false
+func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
+
+// Creates a non-initialized hash table.
 //
-// Arguments:
-//	input: Shape is `[..., M, M]`.
+// This op creates a hash table, specifying the type of its keys and values.
+// Before using the table you will have to initialize it.  After initialization the
+// table will be immutable.
 //
-// Returns Shape is `[..., M, M]`.
+// Arguments:
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// @compatibility(scipy)
-// Equivalent to scipy.linalg.expm
-// @end_compatibility
-func MatrixExponential(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns Handle to a table.
+func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "MatrixExponential",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "HashTableV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QueueDequeueUpToV2Attr is an optional argument to QueueDequeueUpToV2.
-type QueueDequeueUpToV2Attr func(optionalAttr)
+// MutableHashTableV2Attr is an optional argument to MutableHashTableV2.
+type MutableHashTableV2Attr func(optionalAttr)
 
-// QueueDequeueUpToV2TimeoutMs sets the optional timeout_ms attribute to value.
+// MutableHashTableV2Container sets the optional container attribute to value.
 //
-// value: If the queue has fewer than n elements, this operation
-// will block for up to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueDequeueUpToV2TimeoutMs(value int64) QueueDequeueUpToV2Attr {
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableHashTableV2Container(value string) MutableHashTableV2Attr {
 	return func(m optionalAttr) {
-		m["timeout_ms"] = value
+		m["container"] = value
 	}
 }
 
-// Dequeues `n` tuples of one or more tensors from the given queue.
-//
-// This operation is not supported by all queues.  If a queue does not support
-// DequeueUpTo, then an Unimplemented error is returned.
+// MutableHashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// If the queue is closed and there are more than 0 but less than `n`
-// elements remaining, then instead of returning an OutOfRange error like
-// QueueDequeueMany, less than `n` elements are returned immediately.  If
-// the queue is closed and there are 0 elements left in the queue, then
-// an OutOfRange error is returned just like in QueueDequeueMany.
-// Otherwise the behavior is identical to QueueDequeueMany:
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableHashTableV2SharedName(value string) MutableHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// MutableHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
 //
-// This operation concatenates queue-element component tensors along the
-// 0th dimension to make a single component tensor.  All of the components
-// in the dequeued tuple will have size n in the 0th dimension.
+// value: If true and shared_name is empty, the table is shared
+// using the node name.
+// If not specified, defaults to false
+func MutableHashTableV2UseNodeNameSharing(value bool) MutableHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
+
+// Creates an empty hash table.
 //
-// This operation has `k` outputs, where `k` is the number of components in
-// the tuples stored in the given queue, and output `i` is the ith
-// component of the dequeued tuple.
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a scalar. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	n: The number of tuples to dequeue.
-//	component_types: The type of each component in a tuple.
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// Returns One or more tensors that were dequeued as a tuple.
-func QueueDequeueUpToV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueUpToV2Attr) (components []tf.Output) {
+// Returns Handle to a table.
+func MutableHashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QueueDequeueUpToV2",
-		Input: []tf.Input{
-			handle, n,
-		},
+		Type: "MutableHashTableV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("QueueDequeueUpToV2", err)
-		return
+	return op.Output(0)
+}
+
+// DequantizeAttr is an optional argument to Dequantize.
+type DequantizeAttr func(optionalAttr)
+
+// DequantizeMode sets the optional mode attribute to value.
+// If not specified, defaults to "MIN_COMBINED"
+func DequantizeMode(value string) DequantizeAttr {
+	return func(m optionalAttr) {
+		m["mode"] = value
 	}
-	return components
 }
 
-// Computes the Cholesky decomposition of one or more square matrices.
+// Dequantize the 'input' tensor into a float Tensor.
 //
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices.
+// [min_range, max_range] are scalar floats that specify the range for
+// the 'input' data. The 'mode' attribute controls exactly which calculations are
+// used to convert the float values to their quantized equivalents.
 //
-// The input has to be symmetric and positive definite. Only the lower-triangular
-// part of the input will be used for this operation. The upper-triangular part
-// will not be read.
+// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
 //
-// The output is a tensor of the same shape as the input
-// containing the Cholesky decompositions for all input submatrices `[..., :, :]`.
+// ```
+// if T == qint8, in[i] += (range(T) + 1)/ 2.0
+// out[i] = min_range + (in[i]* (max_range - min_range) / range(T))
+// ```
+// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
 //
-// **Note**: The gradient computation on GPU is faster for large matrices but
-// not for large batch dimensions when the submatrices are small. In this
-// case it might be faster to use the CPU.
+// *MIN_COMBINED Mode Example*
 //
-// Arguments:
-//	input: Shape is `[..., M, M]`.
+// If the input comes from a QuantizedRelu6, the output type is
+// quint8 (range of 0-255) but the possible range of QuantizedRelu6 is
+// 0-6.  The min_range and max_range values are therefore 0.0 and 6.0.
+// Dequantize on quint8 will take each value, cast to float, and multiply
+// by 6 / 255.
+// Note that if quantizedtype is qint8, the operation will additionally add
+// each value by 128 prior to casting.
 //
-// Returns Shape is `[..., M, M]`.
-func Cholesky(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Cholesky",
-		Input: []tf.Input{
-			input,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Writes contents to the file at input filename. Creates file and recursively
+// If the mode is 'MIN_FIRST', then this approach is used:
 //
-// creates directory if not existing.
+// ```c++
+// num_discrete_values = 1 << (# of bits in T)
+// range_adjust = num_discrete_values / (num_discrete_values - 1)
+// range = (range_max - range_min) * range_adjust
+// range_scale = range / num_discrete_values
+// const double offset_input = static_cast<double>(input) - lowest_quantized;
+// result = range_min + ((input - numeric_limits<T>::min()) * range_scale)
+// ```
 //
-// Arguments:
-//	filename: scalar. The name of the file to which we write the contents.
-//	contents: scalar. The content to be written to the output file.
+// *SCALED mode Example*
 //
-// Returns the created operation.
-func WriteFile(scope *Scope, filename tf.Output, contents tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "WriteFile",
-		Input: []tf.Input{
-			filename, contents,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// AllAttr is an optional argument to All.
-type AllAttr func(optionalAttr)
-
-// AllKeepDims sets the optional keep_dims attribute to value.
+// `SCALED` mode matches the quantization approach used in
+// `QuantizeAndDequantize{V2|V3}`.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func AllKeepDims(value bool) AllAttr {
-	return func(m optionalAttr) {
-		m["keep_dims"] = value
-	}
-}
-
-// Computes the "logical and" of elements across dimensions of a tensor.
+// If the mode is `SCALED`, we do not use the full range of the output type,
+// choosing to elide the lowest possible value for symmetry (e.g., output range is
+// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
+// 0.
 //
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// We first find the range of values in our tensor. The
+// range we use is always centered on 0, so we find m such that
+// ```c++
+//   m = max(abs(input_min), abs(input_max))
+// ```
+//
+// Our input tensor range is then `[-m, m]`.
+//
+// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
+// If T is signed, this is
+// ```
+//   num_bits = sizeof(T) * 8
+//   [min_fixed, max_fixed] =
+//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
+// ```
+//
+// Otherwise, if T is unsigned, the fixed-point range is
+// ```
+//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
+// ```
+//
+// From this we compute our scaling factor, s:
+// ```c++
+//   s = (2 * m) / (max_fixed - min_fixed)
+// ```
+//
+// Now we can dequantize the elements of our tensor:
+// ```c++
+// result = input * s
+// ```
 //
 // Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
 //
-// Returns The reduced tensor.
-func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (output tf.Output) {
+//	min_range: The minimum scalar value possibly produced for the input.
+//	max_range: The maximum scalar value possibly produced for the input.
+func Dequantize(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, optional ...DequantizeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -17639,9 +17944,9 @@ func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (ou
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "All",
+		Type: "Dequantize",
 		Input: []tf.Input{
-			input, axis,
+			input, min_range, max_range,
 		},
 		Attrs: attrs,
 	}
@@ -17649,88 +17954,105 @@ func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (ou
 	return op.Output(0)
 }
 
-// Computes the Eigen Decomposition of a batch of square self-adjoint matrices.
-//
-// DEPRECATED at GraphDef version 11: Use SelfAdjointEigV2 instead.
-//
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices, with the same constraints as the single matrix
-// SelfAdjointEig.
-//
-// The result is a [..., M+1, M] matrix with [..., 0,:] containing the
-// eigenvalues, and subsequent [...,1:, :] containing the eigenvectors.
+// Flips all bits elementwise.
 //
-// Arguments:
-//	input: Shape is `[..., M, M]`.
+// The result will have exactly those bits set, that are not set in `x`. The
+// computation is performed on the underlying representation of x.
+func Invert(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Invert",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Deprecated. Disallowed in GraphDef version >= 2.
 //
-// Returns Shape is `[..., M+1, M]`.
-func SelfAdjointEig(scope *Scope, input tf.Output) (output tf.Output) {
+// DEPRECATED at GraphDef version 2: Use AdjustContrastv2 instead
+func AdjustContrast(scope *Scope, images tf.Output, contrast_factor tf.Output, min_value tf.Output, max_value tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SelfAdjointEig",
+		Type: "AdjustContrast",
 		Input: []tf.Input{
-			input,
+			images, contrast_factor, min_value, max_value,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes softplus gradients for a softplus operation.
+// Table initializer that takes two tensors for keys and values respectively.
 //
 // Arguments:
-//	gradients: The backpropagated gradients to the corresponding softplus operation.
-//	features: The features passed as input to the corresponding softplus operation.
+//	table_handle: Handle to a table which will be initialized.
+//	keys: Keys of type Tkey.
+//	values: Values of type Tval.
 //
-// Returns The gradients: `gradients / (1 + exp(-features))`.
-func SoftplusGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
+// Returns the created operation.
+func InitializeTableV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SoftplusGrad",
+		Type: "InitializeTableV2",
 		Input: []tf.Input{
-			gradients, features,
+			table_handle, keys, values,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// SelfAdjointEigV2Attr is an optional argument to SelfAdjointEigV2.
-type SelfAdjointEigV2Attr func(optionalAttr)
+// PrintAttr is an optional argument to Print.
+type PrintAttr func(optionalAttr)
 
-// SelfAdjointEigV2ComputeV sets the optional compute_v attribute to value.
+// PrintMessage sets the optional message attribute to value.
 //
-// value: If `True` then eigenvectors will be computed and returned in `v`.
-// Otherwise, only the eigenvalues will be computed.
-// If not specified, defaults to true
-func SelfAdjointEigV2ComputeV(value bool) SelfAdjointEigV2Attr {
+// value: A string, prefix of the error message.
+// If not specified, defaults to ""
+func PrintMessage(value string) PrintAttr {
 	return func(m optionalAttr) {
-		m["compute_v"] = value
+		m["message"] = value
 	}
 }
 
-// Computes the eigen decomposition of one or more square self-adjoint matrices.
+// PrintFirstN sets the optional first_n attribute to value.
 //
-// Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
-// `input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`.
+// value: Only log `first_n` number of times. -1 disables logging.
+// If not specified, defaults to -1
+func PrintFirstN(value int64) PrintAttr {
+	return func(m optionalAttr) {
+		m["first_n"] = value
+	}
+}
+
+// PrintSummarize sets the optional summarize attribute to value.
 //
-// ```python
-// # a is a tensor.
-// # e is a tensor of eigenvalues.
-// # v is a tensor of eigenvectors.
-// e, v = self_adjoint_eig(a)
-// e = self_adjoint_eig(a, compute_v=False)
-// ```
+// value: Only print this many entries of each tensor.
+// If not specified, defaults to 3
+func PrintSummarize(value int64) PrintAttr {
+	return func(m optionalAttr) {
+		m["summarize"] = value
+	}
+}
+
+// Prints a list of tensors.
+//
+// Passes `input` through to `output` and prints `data` when evaluating.
 //
 // Arguments:
-//	input: `Tensor` input of shape `[N, N]`.
+//	input: The tensor passed to `output`
+//	data: A list of tensors to print out when op is evaluated.
 //
-// Returns Eigenvalues. Shape is `[N]`.Eigenvectors. Shape is `[N, N]`.
-func SelfAdjointEigV2(scope *Scope, input tf.Output, optional ...SelfAdjointEigV2Attr) (e tf.Output, v tf.Output) {
+// Returns = The unmodified `input` tensor
+func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -17739,190 +18061,103 @@ func SelfAdjointEigV2(scope *Scope, input tf.Output, optional ...SelfAdjointEigV
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SelfAdjointEigV2",
+		Type: "Print",
 		Input: []tf.Input{
-			input,
+			input, tf.OutputList(data),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Adjust the saturation of one or more images.
-//
-// `images` is a tensor of at least 3 dimensions.  The last dimension is
-// interpretted as channels, and must be three.
-//
-// The input image is considered in the RGB colorspace. Conceptually, the RGB
-// colors are first mapped into HSV. A scale is then applied all the saturation
-// values, and then remapped back to RGB colorspace.
+// Outputs a `Summary` protocol buffer with a tensor and per-plugin data.
 //
 // Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	scale: A float scale to add to the saturation.
-//
-// Returns The hue-adjusted image or images.
-func AdjustSaturation(scope *Scope, images tf.Output, scale tf.Output) (output tf.Output) {
+//	tag: A string attached to this summary. Used for organization in TensorBoard.
+//	tensor: A tensor to serialize.
+//	serialized_summary_metadata: A serialized SummaryMetadata proto. Contains plugin
+// data.
+func TensorSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, serialized_summary_metadata tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "AdjustSaturation",
+		Type: "TensorSummaryV2",
 		Input: []tf.Input{
-			images, scale,
+			tag, tensor, serialized_summary_metadata,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Elementwise computes the bitwise OR of `x` and `y`.
+// Creates a dataset that asynchronously prefetches elements from `input_dataset`.
 //
-// The result will have those bits set, that are set in `x`, `y` or both. The
-// computation is performed on the underlying representations of `x` and `y`.
-func BitwiseOr(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Arguments:
+//
+//	buffer_size: The maximum number of elements to buffer in an iterator over
+// this dataset.
+//
+//
+func PrefetchDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "BitwiseOr",
+		Type: "PrefetchDataset",
 		Input: []tf.Input{
-			x, y,
+			input_dataset, buffer_size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MatrixSolveLsAttr is an optional argument to MatrixSolveLs.
-type MatrixSolveLsAttr func(optionalAttr)
+// TensorSummaryAttr is an optional argument to TensorSummary.
+type TensorSummaryAttr func(optionalAttr)
 
-// MatrixSolveLsFast sets the optional fast attribute to value.
-// If not specified, defaults to true
-func MatrixSolveLsFast(value bool) MatrixSolveLsAttr {
-	return func(m optionalAttr) {
-		m["fast"] = value
-	}
-}
-
-// Solves one or more linear least-squares problems.
-//
-// `matrix` is a tensor of shape `[..., M, N]` whose inner-most 2 dimensions
-// form real or complex matrices of size `[M, N]`. `Rhs` is a tensor of the same
-// type as `matrix` and shape `[..., M, K]`.
-// The output is a tensor shape `[..., N, K]` where each output matrix solves
-// each of the equations
-// `matrix[..., :, :]` * `output[..., :, :]` = `rhs[..., :, :]`
-// in the least squares sense.
-//
-// We use the following notation for (complex) matrix and right-hand sides
-// in the batch:
-//
-// `matrix`=\\(A \in \mathbb{C}^{m \times n}\\),
-// `rhs`=\\(B  \in \mathbb{C}^{m \times k}\\),
-// `output`=\\(X  \in \mathbb{C}^{n \times k}\\),
-// `l2_regularizer`=\\(\lambda \in \mathbb{R}\\).
-//
-// If `fast` is `True`, then the solution is computed by solving the normal
-// equations using Cholesky decomposition. Specifically, if \\(m \ge n\\) then
-// \\(X = (A^H A + \lambda I)^{-1} A^H B\\), which solves the least-squares
-// problem \\(X = \mathrm{argmin}_{Z \in \Re^{n \times k} } ||A Z - B||_F^2 +
-// \lambda ||Z||_F^2\\). If \\(m \lt n\\) then `output` is computed as
-// \\(X = A^H (A A^H + \lambda I)^{-1} B\\), which (for \\(\lambda = 0\\)) is the
-// minimum-norm solution to the under-determined linear system, i.e.
-// \\(X = \mathrm{argmin}_{Z \in \mathbb{C}^{n \times k} } ||Z||_F^2 \\),
-// subject to \\(A Z = B\\). Notice that the fast path is only numerically stable
-// when \\(A\\) is numerically full rank and has a condition number
-// \\(\mathrm{cond}(A) \lt \frac{1}{\sqrt{\epsilon_{mach} } }\\) or\\(\lambda\\) is
-// sufficiently large.
-//
-// If `fast` is `False` an algorithm based on the numerically robust complete
-// orthogonal decomposition is used. This computes the minimum-norm
-// least-squares solution, even when \\(A\\) is rank deficient. This path is
-// typically 6-7 times slower than the fast path. If `fast` is `False` then
-// `l2_regularizer` is ignored.
-//
-// Arguments:
-//	matrix: Shape is `[..., M, N]`.
-//	rhs: Shape is `[..., M, K]`.
-//	l2_regularizer: Scalar tensor.
-//
-// @compatibility(numpy)
-// Equivalent to np.linalg.lstsq
-// @end_compatibility
+// TensorSummaryDescription sets the optional description attribute to value.
 //
-// Returns Shape is `[..., N, K]`.
-func MatrixSolveLs(scope *Scope, matrix tf.Output, rhs tf.Output, l2_regularizer tf.Output, optional ...MatrixSolveLsAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "MatrixSolveLs",
-		Input: []tf.Input{
-			matrix, rhs, l2_regularizer,
-		},
-		Attrs: attrs,
+// value: A json-encoded SummaryDescription proto.
+// If not specified, defaults to ""
+func TensorSummaryDescription(value string) TensorSummaryAttr {
+	return func(m optionalAttr) {
+		m["description"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// SvdAttr is an optional argument to Svd.
-type SvdAttr func(optionalAttr)
-
-// SvdComputeUv sets the optional compute_uv attribute to value.
+// TensorSummaryLabels sets the optional labels attribute to value.
 //
-// value: If true, left and right singular vectors will be
-// computed and returned in `u` and `v`, respectively.
-// If false, `u` and `v` are not set and should never referenced.
-// If not specified, defaults to true
-func SvdComputeUv(value bool) SvdAttr {
+// value: An unused list of strings.
+// If not specified, defaults to <>
+func TensorSummaryLabels(value []string) TensorSummaryAttr {
 	return func(m optionalAttr) {
-		m["compute_uv"] = value
+		m["labels"] = value
 	}
 }
 
-// SvdFullMatrices sets the optional full_matrices attribute to value.
+// TensorSummaryDisplayName sets the optional display_name attribute to value.
 //
-// value: If true, compute full-sized `u` and `v`. If false
-// (the default), compute only the leading `P` singular vectors.
-// Ignored if `compute_uv` is `False`.
-// If not specified, defaults to false
-func SvdFullMatrices(value bool) SvdAttr {
+// value: An unused string.
+// If not specified, defaults to ""
+func TensorSummaryDisplayName(value string) TensorSummaryAttr {
 	return func(m optionalAttr) {
-		m["full_matrices"] = value
+		m["display_name"] = value
 	}
 }
 
-// Computes the singular value decompositions of one or more matrices.
-//
-// Computes the SVD of each inner matrix in `input` such that
-// `input[..., :, :] = u[..., :, :] * diag(s[..., :, :]) * transpose(v[..., :, :])`
+// Outputs a `Summary` protocol buffer with a tensor.
 //
-// ```python
-// # a is a tensor containing a batch of matrices.
-// # s is a tensor of singular values for each matrix.
-// # u is the tensor containing of left singular vectors for each matrix.
-// # v is the tensor containing of right singular vectors for each matrix.
-// s, u, v = svd(a)
-// s, _, _ = svd(a, compute_uv=False)
-// ```
+// This op is being phased out in favor of TensorSummaryV2, which lets callers pass
+// a tag as well as a serialized SummaryMetadata proto string that contains
+// plugin-specific data. We will keep this op to maintain backwards compatibility.
 //
 // Arguments:
-//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
-// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
-//
-// Returns Singular values. Shape is `[..., P]`.Left singular vectors. If `full_matrices` is `False` then shape is
-// `[..., M, P]`; if `full_matrices` is `True` then shape is
-// `[..., M, M]`. Undefined if `compute_uv` is `False`.Left singular vectors. If `full_matrices` is `False` then shape is
-// `[..., N, P]`. If `full_matrices` is `True` then shape is `[..., N, N]`.
-// Undefined if `compute_uv` is false.
-func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.Output, v tf.Output) {
+//	tensor: A tensor to serialize.
+func TensorSummary(scope *Scope, tensor tf.Output, optional ...TensorSummaryAttr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -17931,175 +18166,173 @@ func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Svd",
+		Type: "TensorSummary",
 		Input: []tf.Input{
-			input,
+			tensor,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// QueueEnqueueManyV2Attr is an optional argument to QueueEnqueueManyV2.
-type QueueEnqueueManyV2Attr func(optionalAttr)
-
-// QueueEnqueueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
-//
-// value: If the queue is too full, this operation will block for up
-// to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueEnqueueManyV2TimeoutMs(value int64) QueueEnqueueManyV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
-	}
+	return op.Output(0)
 }
 
-// Enqueues zero or more tuples of one or more tensors in the given queue.
-//
-// This operation slices each component tensor along the 0th dimension to
-// make multiple queue elements. All of the tuple components must have the
-// same size in the 0th dimension.
-//
-// The components input has k elements, which correspond to the components of
-// tuples stored in the given queue.
-//
-// N.B. If the queue is full, this operation will block until the given
-// elements have been enqueued (or 'timeout_ms' elapses, if specified).
-//
-// Arguments:
-//	handle: The handle to a queue.
-//	components: One or more tensors from which the enqueued tensors should
-// be taken.
+// Computes the gradient for the tanh of `x` wrt its input.
 //
-// Returns the created operation.
-func QueueEnqueueManyV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueManyV2Attr) (o *tf.Operation) {
+// Specifically, `grad = dy * (1 - y*y)`, where `y = tanh(x)`, and `dy`
+// is the corresponding input gradient.
+func TanhGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueEnqueueManyV2",
+		Type: "TanhGrad",
 		Input: []tf.Input{
-			handle, tf.OutputList(components),
+			y, dy,
 		},
-		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes the product along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \prod_j data_j\\) where the product is over `j` such
-// that `segment_ids[j] == i`.
-//
-// If the product is empty for a given segment ID `i`, `output[i] = 1`.
+// Outputs a `Summary` protocol buffer with scalar values.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentProd.png" alt>
-// </div>
+// The input `tags` and `values` must have the same shape.  The generated summary
+// has a summary value for each tag-value pair in `tags` and `values`.
 //
 // Arguments:
+//	tags: Tags for the summary.
+//	values: Same shape as `tags.  Values for the summary.
 //
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentProd(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+// Returns Scalar.  Serialized `Summary` protocol buffer.
+func ScalarSummary(scope *Scope, tags tf.Output, values tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SegmentProd",
+		Type: "ScalarSummary",
 		Input: []tf.Input{
-			data, segment_ids,
+			tags, values,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Converts one or more images from RGB to HSV.
+// Outputs a `Summary` protocol buffer with a histogram.
 //
-// Outputs a tensor of the same shape as the `images` tensor, containing the HSV
-// value of the pixels. The output is only well defined if the value in `images`
-// are in `[0,1]`.
+// The generated
+// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
+// has one summary value containing a histogram for `values`.
 //
-// `output[..., 0]` contains hue, `output[..., 1]` contains saturation, and
-// `output[..., 2]` contains value. All HSV values are in `[0,1]`. A hue of 0
-// corresponds to pure red, hue 1/3 is pure green, and 2/3 is pure blue.
+// This op reports an `InvalidArgument` error if any value is not finite.
 //
 // Arguments:
-//	images: 1-D or higher rank. RGB data to convert. Last dimension must be size 3.
+//	tag: Scalar.  Tag to use for the `Summary.Value`.
+//	values: Any shape. Values to use to build the histogram.
 //
-// Returns `images` converted to HSV.
-func RGBToHSV(scope *Scope, images tf.Output) (output tf.Output) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func HistogramSummary(scope *Scope, tag tf.Output, values tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RGBToHSV",
+		Type: "HistogramSummary",
 		Input: []tf.Input{
-			images,
+			tag, values,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Does nothing. Only useful as a placeholder for control edges.
+// Computes the number of elements in the given queue.
 //
-// Returns the created operation.
-func NoOp(scope *Scope) (o *tf.Operation) {
+// Arguments:
+//	handle: The handle to a queue.
+//
+// Returns The number of elements in the given queue.
+func QueueSizeV2(scope *Scope, handle tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "NoOp",
+		Type: "QueueSizeV2",
+		Input: []tf.Input{
+			handle,
+		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MergeV2CheckpointsAttr is an optional argument to MergeV2Checkpoints.
-type MergeV2CheckpointsAttr func(optionalAttr)
+// ImageSummaryAttr is an optional argument to ImageSummary.
+type ImageSummaryAttr func(optionalAttr)
 
-// MergeV2CheckpointsDeleteOldDirs sets the optional delete_old_dirs attribute to value.
+// ImageSummaryMaxImages sets the optional max_images attribute to value.
 //
-// value: see above.
-// If not specified, defaults to true
-func MergeV2CheckpointsDeleteOldDirs(value bool) MergeV2CheckpointsAttr {
+// value: Max number of batch elements to generate images for.
+// If not specified, defaults to 3
+//
+// REQUIRES: value >= 1
+func ImageSummaryMaxImages(value int64) ImageSummaryAttr {
 	return func(m optionalAttr) {
-		m["delete_old_dirs"] = value
+		m["max_images"] = value
 	}
 }
 
-// V2 format specific: merges the metadata files of sharded checkpoints.  The
+// ImageSummaryBadColor sets the optional bad_color attribute to value.
 //
-// result is one logical checkpoint, with one physical metadata file and renamed
-// data files.
+// value: Color to use for pixels with non-finite values.
+// If not specified, defaults to <dtype:DT_UINT8 tensor_shape:<dim:<size:4 > > int_val:255 int_val:0 int_val:0 int_val:255 >
+func ImageSummaryBadColor(value tf.Tensor) ImageSummaryAttr {
+	return func(m optionalAttr) {
+		m["bad_color"] = value
+	}
+}
+
+// Outputs a `Summary` protocol buffer with images.
 //
-// Intended for "grouping" multiple checkpoints in a sharded checkpoint setup.
+// The summary has up to `max_images` summary values containing images. The
+// images are built from `tensor` which must be 4-D with shape `[batch_size,
+// height, width, channels]` and where `channels` can be:
 //
-// If delete_old_dirs is true, attempts to delete recursively the dirname of each
-// path in the input checkpoint_prefixes.  This is useful when those paths are non
-// user-facing temporary locations.
+// *  1: `tensor` is interpreted as Grayscale.
+// *  3: `tensor` is interpreted as RGB.
+// *  4: `tensor` is interpreted as RGBA.
+//
+// The images have the same number of channels as the input tensor. For float
+// input, the values are normalized one image at a time to fit in the range
+// `[0, 255]`.  `uint8` values are unchanged.  The op uses two different
+// normalization algorithms:
+//
+// *  If the input values are all positive, they are rescaled so the largest one
+//    is 255.
+//
+// *  If any input value is negative, the values are shifted so input value 0.0
+//    is at 127.  They are then rescaled so that either the smallest value is 0,
+//    or the largest one is 255.
+//
+// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
+// build the `tag` of the summary values:
+//
+// *  If `max_images` is 1, the summary value tag is '*tag*/image'.
+// *  If `max_images` is greater than 1, the summary value tags are
+//    generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
+//
+// The `bad_color` argument is the color to use in the generated images for
+// non-finite input values.  It is a `unit8` 1-D tensor of length `channels`.
+// Each element must be in the range `[0, 255]` (It represents the value of a
+// pixel in the output image).  Non-finite values in the input tensor are
+// replaced by this tensor in the output image.  The default value is the color
+// red.
 //
 // Arguments:
-//	checkpoint_prefixes: prefixes of V2 checkpoints to merge.
-//	destination_prefix: scalar.  The desired final prefix.  Allowed to be the same
-// as one of the checkpoint_prefixes.
+//	tag: Scalar. Used to build the `tag` attribute of the summary values.
+//	tensor: 4-D of shape `[batch_size, height, width, channels]` where
+// `channels` is 1, 3, or 4.
 //
-// Returns the created operation.
-func MergeV2Checkpoints(scope *Scope, checkpoint_prefixes tf.Output, destination_prefix tf.Output, optional ...MergeV2CheckpointsAttr) (o *tf.Operation) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func ImageSummary(scope *Scope, tag tf.Output, tensor tf.Output, optional ...ImageSummaryAttr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18108,182 +18341,225 @@ func MergeV2Checkpoints(scope *Scope, checkpoint_prefixes tf.Output, destination
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MergeV2Checkpoints",
+		Type: "ImageSummary",
 		Input: []tf.Input{
-			checkpoint_prefixes, destination_prefix,
+			tag, tensor,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Saves input tensors slices to disk.
-//
-// This is like `Save` except that tensors can be listed in the saved file as being
-// a slice of a larger tensor.  `shapes_and_slices` specifies the shape of the
-// larger tensor and the slice that this tensor covers. `shapes_and_slices` must
-// have as many elements as `tensor_names`.
+// AudioSummaryV2Attr is an optional argument to AudioSummaryV2.
+type AudioSummaryV2Attr func(optionalAttr)
+
+// AudioSummaryV2MaxOutputs sets the optional max_outputs attribute to value.
 //
-// Elements of the `shapes_and_slices` input must either be:
+// value: Max number of batch elements to generate audio for.
+// If not specified, defaults to 3
 //
-// *  The empty string, in which case the corresponding tensor is
-//    saved normally.
-// *  A string of the form `dim0 dim1 ... dimN-1 slice-spec` where the
-//    `dimI` are the dimensions of the larger tensor and `slice-spec`
-//    specifies what part is covered by the tensor to save.
+// REQUIRES: value >= 1
+func AudioSummaryV2MaxOutputs(value int64) AudioSummaryV2Attr {
+	return func(m optionalAttr) {
+		m["max_outputs"] = value
+	}
+}
+
+// Outputs a `Summary` protocol buffer with audio.
 //
-// `slice-spec` itself is a `:`-separated list: `slice0:slice1:...:sliceN-1`
-// where each `sliceI` is either:
+// The summary has up to `max_outputs` summary values containing audio. The
+// audio is built from `tensor` which must be 3-D with shape `[batch_size,
+// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
+// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
 //
-// *  The string `-` meaning that the slice covers all indices of this dimension
-// *  `start,length` where `start` and `length` are integers.  In that
-//    case the slice covers `length` indices starting at `start`.
+// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
+// build the `tag` of the summary values:
 //
-// See also `Save`.
+// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
+// *  If `max_outputs` is greater than 1, the summary value tags are
+//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
 //
 // Arguments:
-//	filename: Must have a single element. The name of the file to which we write the
-// tensor.
-//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
-//	shapes_and_slices: Shape `[N]`.  The shapes and slice specifications to use when
-// saving the tensors.
-//	data: `N` tensors to save.
+//	tag: Scalar. Used to build the `tag` attribute of the summary values.
+//	tensor: 2-D of shape `[batch_size, frames]`.
+//	sample_rate: The sample rate of the signal in hertz.
 //
-// Returns the created operation.
-func SaveSlices(scope *Scope, filename tf.Output, tensor_names tf.Output, shapes_and_slices tf.Output, data []tf.Output) (o *tf.Operation) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate tf.Output, optional ...AudioSummaryV2Attr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SaveSlices",
+		Type: "AudioSummaryV2",
 		Input: []tf.Input{
-			filename, tensor_names, shapes_and_slices, tf.OutputList(data),
+			tag, tensor, sample_rate,
 		},
+		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DenseToDenseSetOperationAttr is an optional argument to DenseToDenseSetOperation.
-type DenseToDenseSetOperationAttr func(optionalAttr)
+// AvgPoolAttr is an optional argument to AvgPool.
+type AvgPoolAttr func(optionalAttr)
 
-// DenseToDenseSetOperationValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func DenseToDenseSetOperationValidateIndices(value bool) DenseToDenseSetOperationAttr {
+// AvgPoolDataFormat sets the optional data_format attribute to value.
+//
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func AvgPoolDataFormat(value string) AvgPoolAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["data_format"] = value
 	}
 }
 
-// Applies set operation along last dimension of 2 `Tensor` inputs.
-//
-// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
+// Performs average pooling on the input.
 //
-// Output `result` is a `SparseTensor` represented by `result_indices`,
-// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
-// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
-// dimension contains the result of `set_operation` applied to the corresponding
-// `[0...n-1]` dimension of `set`.
+// Each entry in `output` is the mean of the corresponding size `ksize`
+// window in `value`.
 //
 // Arguments:
-//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
-// Dimension `n` contains values in a set, duplicates are allowed but ignored.
-//	set2: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set1`.
-// Dimension `n` contains values in a set, duplicates are allowed but ignored.
-//
+//	value: 4-D with shape `[batch, height, width, channels]`.
+//	ksize: The size of the sliding window for each dimension of `value`.
+//	strides: The stride of the sliding window for each dimension of `value`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
-// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
-// is the max result set size across all `0...n-1` dimensions.
-func DenseToDenseSetOperation(scope *Scope, set1 tf.Output, set2 tf.Output, set_operation string, optional ...DenseToDenseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
+// Returns The average pooled output tensor.
+func AvgPool(scope *Scope, value tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"set_operation": set_operation}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DenseToDenseSetOperation",
+		Type: "AvgPool",
 		Input: []tf.Input{
-			set1, set2,
+			value,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Generate a sharded filename. The filename is printf formatted as
+// Merges summaries.
 //
-//    %s-%05d-of-%05d, basename, shard, num_shards.
-func ShardedFilename(scope *Scope, basename tf.Output, shard tf.Output, num_shards tf.Output) (filename tf.Output) {
+// This op creates a
+// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
+// protocol buffer that contains the union of all the values in the input
+// summaries.
+//
+// When the Op is run, it reports an `InvalidArgument` error if multiple values
+// in the summaries to merge use the same tag.
+//
+// Arguments:
+//	inputs: Can be of any shape.  Each must contain serialized `Summary` protocol
+// buffers.
+//
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func MergeSummary(scope *Scope, inputs []tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ShardedFilename",
+		Type: "MergeSummary",
 		Input: []tf.Input{
-			basename, shard, num_shards,
+			tf.OutputList(inputs),
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Generate a glob pattern matching all sharded file names.
-func ShardedFilespec(scope *Scope, basename tf.Output, num_shards tf.Output) (filename tf.Output) {
+// Computes the gradient of morphological 2-D dilation with respect to the filter.
+//
+// Arguments:
+//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
+//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
+//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
+// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
+//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
+// Must be: `[1, rate_height, rate_width, 1]`.
+//	padding: The type of padding algorithm to use.
+//
+// Returns 3-D with shape `[filter_height, filter_width, depth]`.
+func Dilation2DBackpropFilter(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (filter_backprop tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "ShardedFilespec",
+		Type: "Dilation2DBackpropFilter",
 		Input: []tf.Input{
-			basename, num_shards,
+			input, filter, out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
-type TextLineReaderV2Attr func(optionalAttr)
+// AddSparseToTensorsMapAttr is an optional argument to AddSparseToTensorsMap.
+type AddSparseToTensorsMapAttr func(optionalAttr)
 
-// TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
+// AddSparseToTensorsMapContainer sets the optional container attribute to value.
 //
-// value: Number of lines to skip from the beginning of every file.
-// If not specified, defaults to 0
-func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
+// value: The container name for the `SparseTensorsMap` created by this op.
+// If not specified, defaults to ""
+func AddSparseToTensorsMapContainer(value string) AddSparseToTensorsMapAttr {
 	return func(m optionalAttr) {
-		m["skip_header_lines"] = value
+		m["container"] = value
 	}
 }
 
-// TextLineReaderV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// TextLineReaderV2SharedName sets the optional shared_name attribute to value.
+// AddSparseToTensorsMapSharedName sets the optional shared_name attribute to value.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
+// value: The shared name for the `SparseTensorsMap` created by this op.
+// If blank, the new Operation's unique name is used.
 // If not specified, defaults to ""
-func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
+func AddSparseToTensorsMapSharedName(value string) AddSparseToTensorsMapAttr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// A Reader that outputs the lines of a file delimited by '\n'.
+// Add a `SparseTensor` to a `SparseTensorsMap` return its handle.
 //
-// Returns The handle to reference the Reader.
-func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
+// A `SparseTensor` is represented by three tensors: `sparse_indices`,
+// `sparse_values`, and `sparse_shape`.
+//
+// This operator takes the given `SparseTensor` and adds it to a container
+// object (a `SparseTensorsMap`).  A unique key within this container is generated
+// in the form of an `int64`, and this is the value that is returned.
+//
+// The `SparseTensor` can then be read out as part of a minibatch by passing
+// the key as a vector element to `TakeManySparseFromTensorsMap`.  To ensure
+// the correct `SparseTensorsMap` is accessed, ensure that the same
+// `container` and `shared_name` are passed to that Op.  If no `shared_name`
+// is provided here, instead use the *name* of the Operation created by calling
+// `AddSparseToTensorsMap` as the `shared_name` passed to
+// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
+//
+// Arguments:
+//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
+//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
+//
+// Returns 0-D.  The handle of the `SparseTensor` now stored in the
+// `SparseTensorsMap`.
+func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddSparseToTensorsMapAttr) (sparse_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18292,404 +18568,303 @@ func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_ha
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TextLineReaderV2",
-
+		Type: "AddSparseToTensorsMap",
+		Input: []tf.Input{
+			sparse_indices, sparse_values, sparse_shape,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// LoadAndRemapMatrixAttr is an optional argument to LoadAndRemapMatrix.
-type LoadAndRemapMatrixAttr func(optionalAttr)
-
-// LoadAndRemapMatrixMaxRowsInMemory sets the optional max_rows_in_memory attribute to value.
-//
-// value: The maximum number of rows to load from the checkpoint at
-// once. If less than or equal to 0, the entire matrix will be loaded into
-// memory. Setting this arg trades increased disk reads for lower memory usage.
-// If not specified, defaults to -1
-func LoadAndRemapMatrixMaxRowsInMemory(value int64) LoadAndRemapMatrixAttr {
-	return func(m optionalAttr) {
-		m["max_rows_in_memory"] = value
-	}
-}
-
-// Loads a 2-D (matrix) `Tensor` with name `old_tensor_name` from the checkpoint
-//
-// at `ckpt_path` and potentially reorders its rows and columns using the
-// specified remappings.
-//
-// Most users should use one of the wrapper initializers (such as
-// `tf.contrib.framework.load_and_remap_matrix_initializer`) instead of this
-// function directly.
-//
-// The remappings are 1-D tensors with the following properties:
-//
-// * `row_remapping` must have exactly `num_rows` entries. Row `i` of the output
-//   matrix will be initialized from the row corresponding to index
-//   `row_remapping[i]` in the old `Tensor` from the checkpoint.
-// * `col_remapping` must have either 0 entries (indicating that no column
-//   reordering is needed) or `num_cols` entries. If specified, column `j` of the
-//   output matrix will be initialized from the column corresponding to index
-//   `col_remapping[j]` in the old `Tensor` from the checkpoint.
-// * A value of -1 in either of the remappings signifies a "missing" entry. In that
-//   case, values from the `initializing_values` tensor will be used to fill that
-//   missing row or column. If `row_remapping` has `r` missing entries and
-//   `col_remapping` has `c` missing entries, then the following condition must be
-//   true:
-//
-// `(r * num_cols) + (c * num_rows) - (r * c) == len(initializing_values)`
+// Computes the matrix exponential of one or more square matrices:
 //
-// The remapping tensors can be generated using the GenerateVocabRemapping op.
+// exp(A) = \sum_{n=0}^\infty A^n/n!
 //
-// As an example, with row_remapping = [1, 0, -1], col_remapping = [0, 2, -1],
-// initializing_values = [0.5, -0.5, 0.25, -0.25, 42], and w(i, j) representing
-// the value from row i, column j of the old tensor in the checkpoint, the output
-// matrix will look like the following:
+// The exponential is computed using a combination of the scaling and squaring
+// method and the Pade approximation. Details can be founds in:
+// Nicholas J. Higham, "The scaling and squaring method for the matrix exponential
+// revisited," SIAM J. Matrix Anal. Applic., 26:1179-1193, 2005.
 //
-// [[w(1, 0),  w(1, 2),  0.5],
-//  [w(0, 0),  w(0, 2), -0.5],
-//  [0.25,    -0.25,      42]]
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. The output is a tensor of the same shape as the input
+// containing the exponential for all input submatrices `[..., :, :]`.
 //
 // Arguments:
-//	ckpt_path: Path to the TensorFlow checkpoint (version 2, `TensorBundle`) from
-// which the old matrix `Tensor` will be loaded.
-//	old_tensor_name: Name of the 2-D `Tensor` to load from checkpoint.
-//	row_remapping: An int `Tensor` of row remappings (generally created by
-// `generate_vocab_remapping`).  Even if no row remapping is needed, this must
-// still be an index-valued Tensor (e.g. [0, 1, 2, ...]), or a shifted
-// index-valued `Tensor` (e.g. [8, 9, 10, ...], for partitioned `Variables`).
-//	col_remapping: An int `Tensor` of column remappings (generally created by
-// `generate_vocab_remapping`).  May be a size-0 `Tensor` if only row remapping
-// is to be done (e.g. column ordering is the same).
-//	initializing_values: A float `Tensor` containing  values to fill in for cells
-// in the output matrix that are not loaded from the checkpoint. Length must be
-// exactly the same as the number of missing / new cells.
-//	num_rows: Number of rows (length of the 1st dimension) in the output matrix.
-//	num_cols: Number of columns (length of the 2nd dimension) in the output matrix.
+//	input: Shape is `[..., M, M]`.
 //
-// Returns Output matrix containing existing values loaded from the
-// checkpoint, and with any missing values filled in from initializing_values.
-func LoadAndRemapMatrix(scope *Scope, ckpt_path tf.Output, old_tensor_name tf.Output, row_remapping tf.Output, col_remapping tf.Output, initializing_values tf.Output, num_rows int64, num_cols int64, optional ...LoadAndRemapMatrixAttr) (output_matrix tf.Output) {
+// Returns Shape is `[..., M, M]`.
+//
+// @compatibility(scipy)
+// Equivalent to scipy.linalg.expm
+// @end_compatibility
+func MatrixExponential(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_rows": num_rows, "num_cols": num_cols}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "LoadAndRemapMatrix",
+		Type: "MatrixExponential",
 		Input: []tf.Input{
-			ckpt_path, old_tensor_name, row_remapping, col_remapping, initializing_values,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TFRecordReaderV2Attr is an optional argument to TFRecordReaderV2.
-type TFRecordReaderV2Attr func(optionalAttr)
+// QueueDequeueUpToV2Attr is an optional argument to QueueDequeueUpToV2.
+type QueueDequeueUpToV2Attr func(optionalAttr)
 
-// TFRecordReaderV2Container sets the optional container attribute to value.
+// QueueDequeueUpToV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func TFRecordReaderV2Container(value string) TFRecordReaderV2Attr {
+// value: If the queue has fewer than n elements, this operation
+// will block for up to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueDequeueUpToV2TimeoutMs(value int64) QueueDequeueUpToV2Attr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// TFRecordReaderV2SharedName sets the optional shared_name attribute to value.
+// Dequeues `n` tuples of one or more tensors from the given queue.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func TFRecordReaderV2SharedName(value string) TFRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// TFRecordReaderV2CompressionType sets the optional compression_type attribute to value.
-// If not specified, defaults to ""
-func TFRecordReaderV2CompressionType(value string) TFRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["compression_type"] = value
-	}
-}
-
-// A Reader that outputs the records from a TensorFlow Records file.
+// This operation is not supported by all queues.  If a queue does not support
+// DequeueUpTo, then an Unimplemented error is returned.
 //
-// Returns The handle to reference the Reader.
-func TFRecordReaderV2(scope *Scope, optional ...TFRecordReaderV2Attr) (reader_handle tf.Output) {
+// If the queue is closed and there are more than 0 but less than `n`
+// elements remaining, then instead of returning an OutOfRange error like
+// QueueDequeueMany, less than `n` elements are returned immediately.  If
+// the queue is closed and there are 0 elements left in the queue, then
+// an OutOfRange error is returned just like in QueueDequeueMany.
+// Otherwise the behavior is identical to QueueDequeueMany:
+//
+// This operation concatenates queue-element component tensors along the
+// 0th dimension to make a single component tensor.  All of the components
+// in the dequeued tuple will have size n in the 0th dimension.
+//
+// This operation has `k` outputs, where `k` is the number of components in
+// the tuples stored in the given queue, and output `i` is the ith
+// component of the dequeued tuple.
+//
+// Arguments:
+//	handle: The handle to a queue.
+//	n: The number of tuples to dequeue.
+//	component_types: The type of each component in a tuple.
+//
+// Returns One or more tensors that were dequeued as a tuple.
+func QueueDequeueUpToV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueUpToV2Attr) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TFRecordReaderV2",
-
+		Type: "QueueDequeueUpToV2",
+		Input: []tf.Input{
+			handle, n,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// QuantizeAndDequantizeV3Attr is an optional argument to QuantizeAndDequantizeV3.
-type QuantizeAndDequantizeV3Attr func(optionalAttr)
-
-// QuantizeAndDequantizeV3SignedInput sets the optional signed_input attribute to value.
-// If not specified, defaults to true
-func QuantizeAndDequantizeV3SignedInput(value bool) QuantizeAndDequantizeV3Attr {
-	return func(m optionalAttr) {
-		m["signed_input"] = value
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// QuantizeAndDequantizeV3RangeGiven sets the optional range_given attribute to value.
-// If not specified, defaults to true
-func QuantizeAndDequantizeV3RangeGiven(value bool) QuantizeAndDequantizeV3Attr {
-	return func(m optionalAttr) {
-		m["range_given"] = value
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("QueueDequeueUpToV2", err)
+		return
 	}
+	return components
 }
 
-// Quantizes then dequantizes a tensor.
+// Computes the Cholesky decomposition of one or more square matrices.
 //
-// This is almost identical to QuantizeAndDequantizeV2, except that num_bits is a
-// tensor, so its value can change during training.
-func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, num_bits tf.Output, optional ...QuantizeAndDequantizeV3Attr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "QuantizeAndDequantizeV3",
-		Input: []tf.Input{
-			input, input_min, input_max, num_bits,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// IdentityReaderV2Attr is an optional argument to IdentityReaderV2.
-type IdentityReaderV2Attr func(optionalAttr)
-
-// IdentityReaderV2Container sets the optional container attribute to value.
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func IdentityReaderV2Container(value string) IdentityReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// IdentityReaderV2SharedName sets the optional shared_name attribute to value.
+// The input has to be symmetric and positive definite. Only the lower-triangular
+// part of the input will be used for this operation. The upper-triangular part
+// will not be read.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func IdentityReaderV2SharedName(value string) IdentityReaderV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A Reader that outputs the queued work as both the key and value.
+// The output is a tensor of the same shape as the input
+// containing the Cholesky decompositions for all input submatrices `[..., :, :]`.
 //
-// To use, enqueue strings in a Queue.  ReaderRead will take the front
-// work string and output (work, work).
+// **Note**: The gradient computation on GPU is faster for large matrices but
+// not for large batch dimensions when the submatrices are small. In this
+// case it might be faster to use the CPU.
 //
-// Returns The handle to reference the Reader.
-func IdentityReaderV2(scope *Scope, optional ...IdentityReaderV2Attr) (reader_handle tf.Output) {
+// Arguments:
+//	input: Shape is `[..., M, M]`.
+//
+// Returns Shape is `[..., M, M]`.
+func Cholesky(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "IdentityReaderV2",
-
-		Attrs: attrs,
+		Type: "Cholesky",
+		Input: []tf.Input{
+			input,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyGradientDescentAttr is an optional argument to ResourceApplyGradientDescent.
-type ResourceApplyGradientDescentAttr func(optionalAttr)
-
-// ResourceApplyGradientDescentUseLocking sets the optional use_locking attribute to value.
+// Writes contents to the file at input filename. Creates file and recursively
 //
-// value: If `True`, the subtraction will be protected by a lock;
-// otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyGradientDescentUseLocking(value bool) ResourceApplyGradientDescentAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' by subtracting 'alpha' * 'delta' from it.
+// creates directory if not existing.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	alpha: Scaling factor. Must be a scalar.
-//	delta: The change.
+//	filename: scalar. The name of the file to which we write the contents.
+//	contents: scalar. The content to be written to the output file.
 //
 // Returns the created operation.
-func ResourceApplyGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, delta tf.Output, optional ...ResourceApplyGradientDescentAttr) (o *tf.Operation) {
+func WriteFile(scope *Scope, filename tf.Output, contents tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyGradientDescent",
+		Type: "WriteFile",
 		Input: []tf.Input{
-			var_, alpha, delta,
+			filename, contents,
 		},
-		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns the next record (key, value pair) produced by a Reader.
+// AllAttr is an optional argument to All.
+type AllAttr func(optionalAttr)
+
+// AllKeepDims sets the optional keep_dims attribute to value.
 //
-// Will dequeue from the input queue if necessary (e.g. when the
-// Reader needs to start reading from a new file since it has finished
-// with the previous file).
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func AllKeepDims(value bool) AllAttr {
+	return func(m optionalAttr) {
+		m["keep_dims"] = value
+	}
+}
+
+// Computes the "logical and" of elements across dimensions of a tensor.
+//
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
 //
 // Arguments:
-//	reader_handle: Handle to a Reader.
-//	queue_handle: Handle to a Queue, with string work items.
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
 //
-// Returns A scalar.A scalar.
-func ReaderReadV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output) (key tf.Output, value tf.Output) {
+// Returns The reduced tensor.
+func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ReaderReadV2",
+		Type: "All",
 		Input: []tf.Input{
-			reader_handle, queue_handle,
+			input, axis,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Returns up to `num_records` (key, value) pairs produced by a Reader.
+// Computes the Eigen Decomposition of a batch of square self-adjoint matrices.
 //
-// Will dequeue from the input queue if necessary (e.g. when the
-// Reader needs to start reading from a new file since it has finished
-// with the previous file).
-// It may return less than `num_records` even before the last batch.
+// DEPRECATED at GraphDef version 11: Use SelfAdjointEigV2 instead.
+//
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices, with the same constraints as the single matrix
+// SelfAdjointEig.
+//
+// The result is a [..., M+1, M] matrix with [..., 0,:] containing the
+// eigenvalues, and subsequent [...,1:, :] containing the eigenvectors.
 //
 // Arguments:
-//	reader_handle: Handle to a `Reader`.
-//	queue_handle: Handle to a `Queue`, with string work items.
-//	num_records: number of records to read from `Reader`.
+//	input: Shape is `[..., M, M]`.
 //
-// Returns A 1-D tensor.A 1-D tensor.
-func ReaderReadUpToV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output, num_records tf.Output) (keys tf.Output, values tf.Output) {
+// Returns Shape is `[..., M+1, M]`.
+func SelfAdjointEig(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderReadUpToV2",
+		Type: "SelfAdjointEig",
 		Input: []tf.Input{
-			reader_handle, queue_handle, num_records,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Restore a Reader to its initial clean state.
+// Computes softplus gradients for a softplus operation.
 //
 // Arguments:
-//	reader_handle: Handle to a Reader.
+//	gradients: The backpropagated gradients to the corresponding softplus operation.
+//	features: The features passed as input to the corresponding softplus operation.
 //
-// Returns the created operation.
-func ReaderResetV2(scope *Scope, reader_handle tf.Output) (o *tf.Operation) {
+// Returns The gradients: `gradients / (1 + exp(-features))`.
+func SoftplusGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderResetV2",
+		Type: "SoftplusGrad",
 		Input: []tf.Input{
-			reader_handle,
+			gradients, features,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ResourceApplyAdamAttr is an optional argument to ResourceApplyAdam.
-type ResourceApplyAdamAttr func(optionalAttr)
+// SelfAdjointEigV2Attr is an optional argument to SelfAdjointEigV2.
+type SelfAdjointEigV2Attr func(optionalAttr)
 
-// ResourceApplyAdamUseLocking sets the optional use_locking attribute to value.
+// SelfAdjointEigV2ComputeV sets the optional compute_v attribute to value.
 //
-// value: If `True`, updating of the var, m, and v tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyAdamUseLocking(value bool) ResourceApplyAdamAttr {
+// value: If `True` then eigenvectors will be computed and returned in `v`.
+// Otherwise, only the eigenvalues will be computed.
+// If not specified, defaults to true
+func SelfAdjointEigV2ComputeV(value bool) SelfAdjointEigV2Attr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["compute_v"] = value
 	}
 }
 
-// ResourceApplyAdamUseNesterov sets the optional use_nesterov attribute to value.
+// Computes the eigen decomposition of one or more square self-adjoint matrices.
 //
-// value: If `True`, uses the nesterov update.
-// If not specified, defaults to false
-func ResourceApplyAdamUseNesterov(value bool) ResourceApplyAdamAttr {
-	return func(m optionalAttr) {
-		m["use_nesterov"] = value
-	}
-}
-
-// Update '*var' according to the Adam algorithm.
+// Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
+// `input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`.
 //
-// lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
-// m_t <- beta1 * m_{t-1} + (1 - beta1) * g_t
-// v_t <- beta2 * v_{t-1} + (1 - beta2) * g_t * g_t
-// variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
+// ```python
+// # a is a tensor.
+// # e is a tensor of eigenvalues.
+// # v is a tensor of eigenvectors.
+// e, v = self_adjoint_eig(a)
+// e = self_adjoint_eig(a, compute_v=False)
+// ```
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	m: Should be from a Variable().
-//	v: Should be from a Variable().
-//	beta1_power: Must be a scalar.
-//	beta2_power: Must be a scalar.
-//	lr: Scaling factor. Must be a scalar.
-//	beta1: Momentum factor. Must be a scalar.
-//	beta2: Momentum factor. Must be a scalar.
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
+//	input: `Tensor` input of shape `[N, N]`.
 //
-// Returns the created operation.
-func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, beta1_power tf.Output, beta2_power tf.Output, lr tf.Output, beta1 tf.Output, beta2 tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdamAttr) (o *tf.Operation) {
+// Returns Eigenvalues. Shape is `[N]`.Eigenvectors. Shape is `[N, N]`.
+func SelfAdjointEigV2(scope *Scope, input tf.Output, optional ...SelfAdjointEigV2Attr) (e tf.Output, v tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18698,85 +18873,121 @@ func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, b
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdam",
+		Type: "SelfAdjointEigV2",
 		Input: []tf.Input{
-			var_, m, v, beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad,
+			input,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// Store the input tensor in the state of the current session.
+// Adjust the saturation of one or more images.
+//
+// `images` is a tensor of at least 3 dimensions.  The last dimension is
+// interpretted as channels, and must be three.
+//
+// The input image is considered in the RGB colorspace. Conceptually, the RGB
+// colors are first mapped into HSV. A scale is then applied all the saturation
+// values, and then remapped back to RGB colorspace.
 //
 // Arguments:
-//	value: The tensor to be stored.
+//	images: Images to adjust.  At least 3-D.
+//	scale: A float scale to add to the saturation.
 //
-// Returns The handle for the tensor stored in the session state, represented
-// as a ResourceHandle object.
-func GetSessionHandleV2(scope *Scope, value tf.Output) (handle tf.Output) {
+// Returns The hue-adjusted image or images.
+func AdjustSaturation(scope *Scope, images tf.Output, scale tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "GetSessionHandleV2",
+		Type: "AdjustSaturation",
 		Input: []tf.Input{
-			value,
+			images, scale,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the set of files matching one or more glob patterns.
-//
-// Note that this routine only supports wildcard characters in the
-// basename portion of the pattern, not in the directory portion.
-//
-// Arguments:
-//	pattern: Shell wildcard pattern(s). Scalar or vector of type string.
+// Elementwise computes the bitwise OR of `x` and `y`.
 //
-// Returns A vector of matching filenames.
-func MatchingFiles(scope *Scope, pattern tf.Output) (filenames tf.Output) {
+// The result will have those bits set, that are set in `x`, `y` or both. The
+// computation is performed on the underlying representations of `x` and `y`.
+func BitwiseOr(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatchingFiles",
+		Type: "BitwiseOr",
 		Input: []tf.Input{
-			pattern,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResizeBicubicGradAttr is an optional argument to ResizeBicubicGrad.
-type ResizeBicubicGradAttr func(optionalAttr)
+// MatrixSolveLsAttr is an optional argument to MatrixSolveLs.
+type MatrixSolveLsAttr func(optionalAttr)
 
-// ResizeBicubicGradAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of grads and original_image. If false, rescale by
-// orig_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeBicubicGradAlignCorners(value bool) ResizeBicubicGradAttr {
+// MatrixSolveLsFast sets the optional fast attribute to value.
+// If not specified, defaults to true
+func MatrixSolveLsFast(value bool) MatrixSolveLsAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["fast"] = value
 	}
 }
 
-// Computes the gradient of bicubic interpolation.
+// Solves one or more linear least-squares problems.
+//
+// `matrix` is a tensor of shape `[..., M, N]` whose inner-most 2 dimensions
+// form real or complex matrices of size `[M, N]`. `Rhs` is a tensor of the same
+// type as `matrix` and shape `[..., M, K]`.
+// The output is a tensor shape `[..., N, K]` where each output matrix solves
+// each of the equations
+// `matrix[..., :, :]` * `output[..., :, :]` = `rhs[..., :, :]`
+// in the least squares sense.
+//
+// We use the following notation for (complex) matrix and right-hand sides
+// in the batch:
+//
+// `matrix`=\\(A \in \mathbb{C}^{m \times n}\\),
+// `rhs`=\\(B  \in \mathbb{C}^{m \times k}\\),
+// `output`=\\(X  \in \mathbb{C}^{n \times k}\\),
+// `l2_regularizer`=\\(\lambda \in \mathbb{R}\\).
+//
+// If `fast` is `True`, then the solution is computed by solving the normal
+// equations using Cholesky decomposition. Specifically, if \\(m \ge n\\) then
+// \\(X = (A^H A + \lambda I)^{-1} A^H B\\), which solves the least-squares
+// problem \\(X = \mathrm{argmin}_{Z \in \Re^{n \times k} } ||A Z - B||_F^2 +
+// \lambda ||Z||_F^2\\). If \\(m \lt n\\) then `output` is computed as
+// \\(X = A^H (A A^H + \lambda I)^{-1} B\\), which (for \\(\lambda = 0\\)) is the
+// minimum-norm solution to the under-determined linear system, i.e.
+// \\(X = \mathrm{argmin}_{Z \in \mathbb{C}^{n \times k} } ||Z||_F^2 \\),
+// subject to \\(A Z = B\\). Notice that the fast path is only numerically stable
+// when \\(A\\) is numerically full rank and has a condition number
+// \\(\mathrm{cond}(A) \lt \frac{1}{\sqrt{\epsilon_{mach} } }\\) or\\(\lambda\\) is
+// sufficiently large.
+//
+// If `fast` is `False` an algorithm based on the numerically robust complete
+// orthogonal decomposition is used. This computes the minimum-norm
+// least-squares solution, even when \\(A\\) is rank deficient. This path is
+// typically 6-7 times slower than the fast path. If `fast` is `False` then
+// `l2_regularizer` is ignored.
 //
 // Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
-// The image tensor that was resized.
+//	matrix: Shape is `[..., M, N]`.
+//	rhs: Shape is `[..., M, K]`.
+//	l2_regularizer: Scalar tensor.
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
-// Gradients with respect to the input image. Input image must have been
-// float or double.
-func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBicubicGradAttr) (output tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.linalg.lstsq
+// @end_compatibility
+//
+// Returns Shape is `[..., N, K]`.
+func MatrixSolveLs(scope *Scope, matrix tf.Output, rhs tf.Output, l2_regularizer tf.Output, optional ...MatrixSolveLsAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18785,9 +18996,9 @@ func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeBicubicGrad",
+		Type: "MatrixSolveLs",
 		Input: []tf.Input{
-			grads, original_image,
+			matrix, rhs, l2_regularizer,
 		},
 		Attrs: attrs,
 	}
@@ -18795,31 +19006,57 @@ func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output,
 	return op.Output(0)
 }
 
-// ResizeNearestNeighborAttr is an optional argument to ResizeNearestNeighbor.
-type ResizeNearestNeighborAttr func(optionalAttr)
+// SvdAttr is an optional argument to Svd.
+type SvdAttr func(optionalAttr)
 
-// ResizeNearestNeighborAlignCorners sets the optional align_corners attribute to value.
+// SvdComputeUv sets the optional compute_uv attribute to value.
 //
-// value: If true, rescale input by (new_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of images and resized images. If false, rescale
-// by new_height / height. Treat similarly the width dimension.
+// value: If true, left and right singular vectors will be
+// computed and returned in `u` and `v`, respectively.
+// If false, `u` and `v` are not set and should never referenced.
+// If not specified, defaults to true
+func SvdComputeUv(value bool) SvdAttr {
+	return func(m optionalAttr) {
+		m["compute_uv"] = value
+	}
+}
+
+// SvdFullMatrices sets the optional full_matrices attribute to value.
+//
+// value: If true, compute full-sized `u` and `v`. If false
+// (the default), compute only the leading `P` singular vectors.
+// Ignored if `compute_uv` is `False`.
 // If not specified, defaults to false
-func ResizeNearestNeighborAlignCorners(value bool) ResizeNearestNeighborAttr {
+func SvdFullMatrices(value bool) SvdAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["full_matrices"] = value
 	}
 }
 
-// Resize `images` to `size` using nearest neighbor interpolation.
+// Computes the singular value decompositions of one or more matrices.
+//
+// Computes the SVD of each inner matrix in `input` such that
+// `input[..., :, :] = u[..., :, :] * diag(s[..., :, :]) * transpose(v[..., :, :])`
+//
+// ```python
+// # a is a tensor containing a batch of matrices.
+// # s is a tensor of singular values for each matrix.
+// # u is the tensor containing of left singular vectors for each matrix.
+// # v is the tensor containing of right singular vectors for each matrix.
+// s, u, v = svd(a)
+// s, _, _ = svd(a, compute_uv=False)
+// ```
 //
 // Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
+//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
+// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
 //
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func ResizeNearestNeighbor(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeNearestNeighborAttr) (resized_images tf.Output) {
+// Returns Singular values. Shape is `[..., P]`.Left singular vectors. If `full_matrices` is `False` then shape is
+// `[..., M, P]`; if `full_matrices` is `True` then shape is
+// `[..., M, M]`. Undefined if `compute_uv` is `False`.Left singular vectors. If `full_matrices` is `False` then shape is
+// `[..., N, P]`. If `full_matrices` is `True` then shape is `[..., N, N]`.
+// Undefined if `compute_uv` is false.
+func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.Output, v tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18828,41 +19065,50 @@ func ResizeNearestNeighbor(scope *Scope, images tf.Output, size tf.Output, optio
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeNearestNeighbor",
+		Type: "Svd",
 		Input: []tf.Input{
-			images, size,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ResizeNearestNeighborGradAttr is an optional argument to ResizeNearestNeighborGrad.
-type ResizeNearestNeighborGradAttr func(optionalAttr)
+// QueueEnqueueManyV2Attr is an optional argument to QueueEnqueueManyV2.
+type QueueEnqueueManyV2Attr func(optionalAttr)
 
-// ResizeNearestNeighborGradAlignCorners sets the optional align_corners attribute to value.
+// QueueEnqueueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
-// exactly aligns the 4 corners of grads and original_image. If false, rescale by
-// orig_height / height. Treat similarly the width dimension.
-// If not specified, defaults to false
-func ResizeNearestNeighborGradAlignCorners(value bool) ResizeNearestNeighborGradAttr {
+// value: If the queue is too full, this operation will block for up
+// to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueEnqueueManyV2TimeoutMs(value int64) QueueEnqueueManyV2Attr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// Computes the gradient of nearest neighbor interpolation.
+// Enqueues zero or more tuples of one or more tensors in the given queue.
+//
+// This operation slices each component tensor along the 0th dimension to
+// make multiple queue elements. All of the tuple components must have the
+// same size in the 0th dimension.
+//
+// The components input has k elements, which correspond to the components of
+// tuples stored in the given queue.
+//
+// N.B. If the queue is full, this operation will block until the given
+// elements have been enqueued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `orig_height, orig_width`. The
-// original input size.
+//	handle: The handle to a queue.
+//	components: One or more tensors from which the enqueued tensors should
+// be taken.
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`. Gradients
-// with respect to the input image.
-func ResizeNearestNeighborGrad(scope *Scope, grads tf.Output, size tf.Output, optional ...ResizeNearestNeighborGradAttr) (output tf.Output) {
+// Returns the created operation.
+func QueueEnqueueManyV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueManyV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18871,113 +19117,123 @@ func ResizeNearestNeighborGrad(scope *Scope, grads tf.Output, size tf.Output, op
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeNearestNeighborGrad",
+		Type: "QueueEnqueueManyV2",
 		Input: []tf.Input{
-			grads, size,
+			handle, tf.OutputList(components),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// DecodeJpegAttr is an optional argument to DecodeJpeg.
-type DecodeJpegAttr func(optionalAttr)
-
-// DecodeJpegChannels sets the optional channels attribute to value.
+// Computes the product along segments of a tensor.
 //
-// value: Number of color channels for the decoded image.
-// If not specified, defaults to 0
-func DecodeJpegChannels(value int64) DecodeJpegAttr {
-	return func(m optionalAttr) {
-		m["channels"] = value
-	}
-}
-
-// DecodeJpegRatio sets the optional ratio attribute to value.
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// value: Downscaling ratio.
-// If not specified, defaults to 1
-func DecodeJpegRatio(value int64) DecodeJpegAttr {
-	return func(m optionalAttr) {
-		m["ratio"] = value
+// Computes a tensor such that
+// \\(output_i = \prod_j data_j\\) where the product is over `j` such
+// that `segment_ids[j] == i`.
+//
+// If the product is empty for a given segment ID `i`, `output[i] = 1`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentProd.png" alt>
+// </div>
+//
+// Arguments:
+//
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentProd(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "SegmentProd",
+		Input: []tf.Input{
+			data, segment_ids,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeJpegFancyUpscaling sets the optional fancy_upscaling attribute to value.
+// Converts one or more images from RGB to HSV.
 //
-// value: If true use a slower but nicer upscaling of the
-// chroma planes (yuv420/422 only).
-// If not specified, defaults to true
-func DecodeJpegFancyUpscaling(value bool) DecodeJpegAttr {
-	return func(m optionalAttr) {
-		m["fancy_upscaling"] = value
-	}
-}
-
-// DecodeJpegTryRecoverTruncated sets the optional try_recover_truncated attribute to value.
+// Outputs a tensor of the same shape as the `images` tensor, containing the HSV
+// value of the pixels. The output is only well defined if the value in `images`
+// are in `[0,1]`.
 //
-// value: If true try to recover an image from truncated input.
-// If not specified, defaults to false
-func DecodeJpegTryRecoverTruncated(value bool) DecodeJpegAttr {
-	return func(m optionalAttr) {
-		m["try_recover_truncated"] = value
+// `output[..., 0]` contains hue, `output[..., 1]` contains saturation, and
+// `output[..., 2]` contains value. All HSV values are in `[0,1]`. A hue of 0
+// corresponds to pure red, hue 1/3 is pure green, and 2/3 is pure blue.
+//
+// Arguments:
+//	images: 1-D or higher rank. RGB data to convert. Last dimension must be size 3.
+//
+// Returns `images` converted to HSV.
+func RGBToHSV(scope *Scope, images tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RGBToHSV",
+		Input: []tf.Input{
+			images,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeJpegAcceptableFraction sets the optional acceptable_fraction attribute to value.
+// Does nothing. Only useful as a placeholder for control edges.
 //
-// value: The minimum required fraction of lines before a truncated
-// input is accepted.
-// If not specified, defaults to 1
-func DecodeJpegAcceptableFraction(value float32) DecodeJpegAttr {
-	return func(m optionalAttr) {
-		m["acceptable_fraction"] = value
+// Returns the created operation.
+func NoOp(scope *Scope) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NoOp",
 	}
+	return scope.AddOperation(opspec)
 }
 
-// DecodeJpegDctMethod sets the optional dct_method attribute to value.
+// MergeV2CheckpointsAttr is an optional argument to MergeV2Checkpoints.
+type MergeV2CheckpointsAttr func(optionalAttr)
+
+// MergeV2CheckpointsDeleteOldDirs sets the optional delete_old_dirs attribute to value.
 //
-// value: string specifying a hint about the algorithm used for
-// decompression.  Defaults to "" which maps to a system-specific
-// default.  Currently valid values are ["INTEGER_FAST",
-// "INTEGER_ACCURATE"].  The hint may be ignored (e.g., the internal
-// jpeg library changes to a version that does not have that specific
-// option.)
-// If not specified, defaults to ""
-func DecodeJpegDctMethod(value string) DecodeJpegAttr {
+// value: see above.
+// If not specified, defaults to true
+func MergeV2CheckpointsDeleteOldDirs(value bool) MergeV2CheckpointsAttr {
 	return func(m optionalAttr) {
-		m["dct_method"] = value
+		m["delete_old_dirs"] = value
 	}
 }
 
-// Decode a JPEG-encoded image to a uint8 tensor.
-//
-// The attr `channels` indicates the desired number of color channels for the
-// decoded image.
-//
-// Accepted values are:
-//
-// *   0: Use the number of channels in the JPEG-encoded image.
-// *   1: output a grayscale image.
-// *   3: output an RGB image.
-//
-// If needed, the JPEG-encoded image is transformed to match the requested number
-// of color channels.
+// V2 format specific: merges the metadata files of sharded checkpoints.  The
 //
-// The attr `ratio` allows downscaling the image by an integer factor during
-// decoding.  Allowed values are: 1, 2, 4, and 8.  This is much faster than
-// downscaling the image later.
+// result is one logical checkpoint, with one physical metadata file and renamed
+// data files.
 //
+// Intended for "grouping" multiple checkpoints in a sharded checkpoint setup.
 //
-// This op also supports decoding PNGs and non-animated GIFs since the interface is
-// the same, though it is cleaner to use `tf.image.decode_image`.
+// If delete_old_dirs is true, attempts to delete recursively the dirname of each
+// path in the input checkpoint_prefixes.  This is useful when those paths are non
+// user-facing temporary locations.
 //
 // Arguments:
-//	contents: 0-D.  The JPEG-encoded image.
+//	checkpoint_prefixes: prefixes of V2 checkpoints to merge.
+//	destination_prefix: scalar.  The desired final prefix.  Allowed to be the same
+// as one of the checkpoint_prefixes.
 //
-// Returns 3-D with shape `[height, width, channels]`..
-func DecodeJpeg(scope *Scope, contents tf.Output, optional ...DecodeJpegAttr) (image tf.Output) {
+// Returns the created operation.
+func MergeV2Checkpoints(scope *Scope, checkpoint_prefixes tf.Output, destination_prefix tf.Output, optional ...MergeV2CheckpointsAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -18986,322 +19242,313 @@ func DecodeJpeg(scope *Scope, contents tf.Output, optional ...DecodeJpegAttr) (i
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeJpeg",
+		Type: "MergeV2Checkpoints",
 		Input: []tf.Input{
-			contents,
+			checkpoint_prefixes, destination_prefix,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// ExtractJpegShapeAttr is an optional argument to ExtractJpegShape.
-type ExtractJpegShapeAttr func(optionalAttr)
-
-// ExtractJpegShapeOutputType sets the optional output_type attribute to value.
+// Saves input tensors slices to disk.
 //
-// value: (Optional) The output type of the operation (int32 or int64).
-// Defaults to int32.
-// If not specified, defaults to DT_INT32
-func ExtractJpegShapeOutputType(value tf.DataType) ExtractJpegShapeAttr {
-	return func(m optionalAttr) {
-		m["output_type"] = value
-	}
-}
-
-// Extract the shape information of a JPEG-encoded image.
+// This is like `Save` except that tensors can be listed in the saved file as being
+// a slice of a larger tensor.  `shapes_and_slices` specifies the shape of the
+// larger tensor and the slice that this tensor covers. `shapes_and_slices` must
+// have as many elements as `tensor_names`.
 //
-// This op only parses the image header, so it is much faster than DecodeJpeg.
+// Elements of the `shapes_and_slices` input must either be:
+//
+// *  The empty string, in which case the corresponding tensor is
+//    saved normally.
+// *  A string of the form `dim0 dim1 ... dimN-1 slice-spec` where the
+//    `dimI` are the dimensions of the larger tensor and `slice-spec`
+//    specifies what part is covered by the tensor to save.
+//
+// `slice-spec` itself is a `:`-separated list: `slice0:slice1:...:sliceN-1`
+// where each `sliceI` is either:
+//
+// *  The string `-` meaning that the slice covers all indices of this dimension
+// *  `start,length` where `start` and `length` are integers.  In that
+//    case the slice covers `length` indices starting at `start`.
+//
+// See also `Save`.
 //
 // Arguments:
-//	contents: 0-D. The JPEG-encoded image.
+//	filename: Must have a single element. The name of the file to which we write the
+// tensor.
+//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
+//	shapes_and_slices: Shape `[N]`.  The shapes and slice specifications to use when
+// saving the tensors.
+//	data: `N` tensors to save.
 //
-// Returns 1-D. The image shape with format [height, width, channels].
-func ExtractJpegShape(scope *Scope, contents tf.Output, optional ...ExtractJpegShapeAttr) (image_shape tf.Output) {
+// Returns the created operation.
+func SaveSlices(scope *Scope, filename tf.Output, tensor_names tf.Output, shapes_and_slices tf.Output, data []tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ExtractJpegShape",
+		Type: "SaveSlices",
 		Input: []tf.Input{
-			contents,
+			filename, tensor_names, shapes_and_slices, tf.OutputList(data),
 		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// PaddingFIFOQueueV2Attr is an optional argument to PaddingFIFOQueueV2.
-type PaddingFIFOQueueV2Attr func(optionalAttr)
-
-// PaddingFIFOQueueV2Shapes sets the optional shapes attribute to value.
-//
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types.
-// Shapes of fixed rank but variable size are allowed by setting
-// any shape dimension to -1.  In this case, the inputs' shape may vary along
-// the given dimension, and DequeueMany will pad the given dimension with
-// zeros up to the maximum shape of all elements in the given batch.
-// If the length of this attr is 0, different queue elements may have
-// different ranks and shapes, but only one element may be dequeued at a time.
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func PaddingFIFOQueueV2Shapes(value []tf.Shape) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shapes"] = value
 	}
+	return scope.AddOperation(opspec)
 }
 
-// PaddingFIFOQueueV2Capacity sets the optional capacity attribute to value.
-//
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func PaddingFIFOQueueV2Capacity(value int64) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
+// DenseToDenseSetOperationAttr is an optional argument to DenseToDenseSetOperation.
+type DenseToDenseSetOperationAttr func(optionalAttr)
 
-// PaddingFIFOQueueV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func PaddingFIFOQueueV2Container(value string) PaddingFIFOQueueV2Attr {
+// DenseToDenseSetOperationValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func DenseToDenseSetOperationValidateIndices(value bool) DenseToDenseSetOperationAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// PaddingFIFOQueueV2SharedName sets the optional shared_name attribute to value.
+// Applies set operation along last dimension of 2 `Tensor` inputs.
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func PaddingFIFOQueueV2SharedName(value string) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A queue that produces elements in first-in first-out order.
+// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
 //
-// Variable-size shapes are allowed by setting the corresponding shape dimensions
-// to 0 in the shape attr.  In this case DequeueMany will pad up to the maximum
-// size of any given element in the minibatch.  See below for details.
+// Output `result` is a `SparseTensor` represented by `result_indices`,
+// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
+// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
+// dimension contains the result of `set_operation` applied to the corresponding
+// `[0...n-1]` dimension of `set`.
 //
 // Arguments:
-//	component_types: The type of each component in a value.
+//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
+// Dimension `n` contains values in a set, duplicates are allowed but ignored.
+//	set2: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set1`.
+// Dimension `n` contains values in a set, duplicates are allowed but ignored.
 //
-// Returns The handle to the queue.
-func PaddingFIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...PaddingFIFOQueueV2Attr) (handle tf.Output) {
+//
+// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
+// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
+// is the max result set size across all `0...n-1` dimensions.
+func DenseToDenseSetOperation(scope *Scope, set1 tf.Output, set2 tf.Output, set_operation string, optional ...DenseToDenseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{"set_operation": set_operation}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "PaddingFIFOQueueV2",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// DecodePngAttr is an optional argument to DecodePng.
-type DecodePngAttr func(optionalAttr)
-
-// DecodePngChannels sets the optional channels attribute to value.
-//
-// value: Number of color channels for the decoded image.
-// If not specified, defaults to 0
-func DecodePngChannels(value int64) DecodePngAttr {
-	return func(m optionalAttr) {
-		m["channels"] = value
-	}
-}
-
-// DecodePngDtype sets the optional dtype attribute to value.
-// If not specified, defaults to DT_UINT8
-func DecodePngDtype(value tf.DataType) DecodePngAttr {
-	return func(m optionalAttr) {
-		m["dtype"] = value
+		Type: "DenseToDenseSetOperation",
+		Input: []tf.Input{
+			set1, set2,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Decode a PNG-encoded image to a uint8 or uint16 tensor.
-//
-// The attr `channels` indicates the desired number of color channels for the
-// decoded image.
-//
-// Accepted values are:
-//
-// *   0: Use the number of channels in the PNG-encoded image.
-// *   1: output a grayscale image.
-// *   3: output an RGB image.
-// *   4: output an RGBA image.
-//
-// If needed, the PNG-encoded image is transformed to match the requested number
-// of color channels.
-//
-// This op also supports decoding JPEGs and non-animated GIFs since the interface
-// is the same, though it is cleaner to use `tf.image.decode_image`.
-//
-// Arguments:
-//	contents: 0-D.  The PNG-encoded image.
+// Generate a sharded filename. The filename is printf formatted as
 //
-// Returns 3-D with shape `[height, width, channels]`.
-func DecodePng(scope *Scope, contents tf.Output, optional ...DecodePngAttr) (image tf.Output) {
+//    %s-%05d-of-%05d, basename, shard, num_shards.
+func ShardedFilename(scope *Scope, basename tf.Output, shard tf.Output, num_shards tf.Output) (filename tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "DecodePng",
+		Type: "ShardedFilename",
 		Input: []tf.Input{
-			contents,
+			basename, shard, num_shards,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Decode the first frame of a GIF-encoded image to a uint8 tensor.
+// BatchToSpace for N-D tensors of type T.
 //
-// GIF with frame or transparency compression are not supported
-// convert animated GIF from compressed to uncompressed by:
+// This operation reshapes the "batch" dimension 0 into `M + 1` dimensions of shape
+// `block_shape + [batch]`, interleaves these blocks back into the grid defined by
+// the spatial dimensions `[1, ..., M]`, to obtain a result with the same rank as
+// the input.  The spatial dimensions of this intermediate result are then
+// optionally cropped according to `crops` to produce the output.  This is the
+// reverse of SpaceToBatch.  See below for a precise description.
 //
-//     convert $src.gif -coalesce $dst.gif
+// Arguments:
+//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
+// where spatial_shape has M dimensions.
+//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
+//	crops: 2-D with shape `[M, 2]`, all values must be >= 0.
+//   `crops[i] = [crop_start, crop_end]` specifies the amount to crop from input
+//   dimension `i + 1`, which corresponds to spatial dimension `i`.  It is
+//   required that
+//   `crop_start[i] + crop_end[i] <= block_shape[i] * input_shape[i + 1]`.
 //
-// This op also supports decoding JPEGs and PNGs, though it is cleaner to use
-// `tf.image.decode_image`.
+// This operation is equivalent to the following steps:
 //
-// Arguments:
-//	contents: 0-D.  The GIF-encoded image.
+// 1. Reshape `input` to `reshaped` of shape:
+//      [block_shape[0], ..., block_shape[M-1],
+//       batch / prod(block_shape),
+//       input_shape[1], ..., input_shape[N-1]]
 //
-// Returns 4-D with shape `[num_frames, height, width, 3]`. RGB order
-func DecodeGif(scope *Scope, contents tf.Output) (image tf.Output) {
+// 2. Permute dimensions of `reshaped` to produce `permuted` of shape
+//      [batch / prod(block_shape),
+//
+//       input_shape[1], block_shape[0],
+//       ...,
+//       input_shape[M], block_shape[M-1],
+//
+//       input_shape[M+1], ..., input_shape[N-1]]
+//
+// 3. Reshape `permuted` to produce `reshaped_permuted` of shape
+//      [batch / prod(block_shape),
+//
+//       input_shape[1] * block_shape[0],
+//       ...,
+//       input_shape[M] * block_shape[M-1],
+//
+//       input_shape[M+1],
+//       ...,
+//       input_shape[N-1]]
+//
+// 4. Crop the start and end of dimensions `[1, ..., M]` of
+//    `reshaped_permuted` according to `crops` to produce the output of shape:
+//      [batch / prod(block_shape),
+//
+//       input_shape[1] * block_shape[0] - crops[0,0] - crops[0,1],
+//       ...,
+//       input_shape[M] * block_shape[M-1] - crops[M-1,0] - crops[M-1,1],
+//
+//       input_shape[M+1], ..., input_shape[N-1]]
+//
+// Some examples:
+//
+// (1) For the following input of shape `[4, 1, 1, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
+//
+// (2) For the following input of shape `[4, 1, 1, 3]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 3]` and value:
+//
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
+//
+// (3) For the following input of shape `[4, 2, 2, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+//
+// The output tensor has shape `[1, 4, 4, 1]` and value:
+//
+// ```
+// x = [[[1],   [2],  [3],  [4]],
+//      [[5],   [6],  [7],  [8]],
+//      [[9],  [10], [11],  [12]],
+//      [[13], [14], [15],  [16]]]
+// ```
+//
+// (4) For the following input of shape `[8, 1, 3, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [2, 0]]`:
+//
+// ```
+// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
+//      [[[0], [2], [4]]], [[[0], [10], [12]]],
+//      [[[0], [5], [7]]], [[[0], [13], [15]]],
+//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
+// ```
+//
+// The output tensor has shape `[2, 2, 4, 1]` and value:
+//
+// ```
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]]],
+//      [[[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
+// ```
+func BatchToSpaceND(scope *Scope, input tf.Output, block_shape tf.Output, crops tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeGif",
+		Type: "BatchToSpaceND",
 		Input: []tf.Input{
-			contents,
+			input, block_shape, crops,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyCenteredRMSPropAttr is an optional argument to ResourceApplyCenteredRMSProp.
-type ResourceApplyCenteredRMSPropAttr func(optionalAttr)
+// UnpackAttr is an optional argument to Unpack.
+type UnpackAttr func(optionalAttr)
 
-// ResourceApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
+// UnpackAxis sets the optional axis attribute to value.
 //
-// value: If `True`, updating of the var, mg, ms, and mom tensors is
-// protected by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyCenteredRMSPropUseLocking(value bool) ResourceApplyCenteredRMSPropAttr {
+// value: Dimension along which to unpack.  Negative values wrap around, so the
+// valid range is `[-R, R)`.
+// If not specified, defaults to 0
+func UnpackAxis(value int64) UnpackAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["axis"] = value
 	}
 }
 
-// Update '*var' according to the centered RMSProp algorithm.
-//
-// The centered RMSProp algorithm uses an estimate of the centered second moment
-// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
-// uses the (uncentered) second moment. This often helps with training, but is
-// slightly more expensive in terms of computation and memory.
+// Unpacks a given dimension of a rank-`R` tensor into `num` rank-`(R-1)` tensors.
 //
-// Note that in dense implementation of this algorithm, mg, ms, and mom will
-// update even if the grad is zero, but in this sparse implementation, mg, ms,
-// and mom will not update in iterations during which the grad is zero.
+// Unpacks `num` tensors from `value` by chipping it along the `axis` dimension.
+// For example, given a tensor of shape `(A, B, C, D)`;
 //
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// mean_grad = decay * mean_grad + (1-decay) * gradient
+// If `axis == 0` then the i'th tensor in `output` is the slice `value[i, :, :, :]`
+//   and each tensor in `output` will have shape `(B, C, D)`. (Note that the
+//   dimension unpacked along is gone, unlike `split`).
 //
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
+// If `axis == 1` then the i'th tensor in `output` is the slice `value[:, i, :, :]`
+//   and each tensor in `output` will have shape `(A, C, D)`.
+// Etc.
 //
-// mg <- rho * mg_{t-1} + (1-rho) * grad
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms - mg * mg + epsilon)
-// var <- var - mom
+// This is the opposite of `pack`.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	mg: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
+//	value: 1-D or higher, with `axis` dimension size equal to `num`.
 //
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
 //
-// Returns the created operation.
-func ResourceApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyCenteredRMSPropAttr) (o *tf.Operation) {
+// Returns The list of tensors unpacked from `value`.
+func Unpack(scope *Scope, value tf.Output, num int64, optional ...UnpackAttr) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num": num}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyCenteredRMSProp",
+		Type: "Unpack",
 		Input: []tf.Input{
-			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad,
+			value,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// Returns a list of tensors with the same shapes and contents as the input
-//
-// tensors.
-//
-// This op can be used to override the gradient for complicated functions. For
-// example, suppose y = f(x) and we wish to apply a custom function g for backprop
-// such that dx = g(dy). In Python,
-//
-// ```python
-// with tf.get_default_graph().gradient_override_map(
-//     {'IdentityN': 'OverrideGradientWithG'}):
-//   y, _ = identity_n([f(x), x])
-//
-// @tf.RegisterGradient('OverrideGradientWithG')
-// def ApplyG(op, dy, _):
-//   return [None, g(dy)]  # Do not backprop to f(x).
-// ```
-func IdentityN(scope *Scope, input []tf.Output) (output []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "IdentityN",
-		Input: []tf.Input{
-			tf.OutputList(input),
-		},
-	}
 	op := scope.AddOperation(opspec)
 	if scope.Err() != nil {
 		return
@@ -19309,182 +19556,111 @@ func IdentityN(scope *Scope, input []tf.Output) (output []tf.Output) {
 	var idx int
 	var err error
 	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("IdentityN", err)
+		scope.UpdateErr("Unpack", err)
 		return
 	}
 	return output
 }
 
-// Computes the gradient of the sigmoid of `x` wrt its input.
+// Increments variable pointed to by 'resource' until it reaches 'limit'.
 //
-// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
-// `dy` is the corresponding input gradient.
-func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// Arguments:
+//	resource: Should be from a scalar `Variable` node.
+//	limit: If incrementing ref would bring it above limit, instead generates an
+// 'OutOfRange' error.
+//
+//
+// Returns A copy of the input before increment. If nothing else modifies the
+// input, the values produced will all be distinct.
+func ResourceCountUpTo(scope *Scope, resource tf.Output, limit int64, T tf.DataType) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"limit": limit, "T": T}
 	opspec := tf.OpSpec{
-		Type: "SigmoidGrad",
+		Type: "ResourceCountUpTo",
 		Input: []tf.Input{
-			y, dy,
+			resource,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Convert one or more images from HSV to RGB.
-//
-// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
-// value of the pixels. The output is only well defined if the value in `images`
-// are in `[0,1]`.
-//
-// See `rgb_to_hsv` for a description of the HSV encoding.
+// Delete the stack from its resource container.
 //
 // Arguments:
-//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
+//	handle: The handle to a stack.
 //
-// Returns `images` converted to RGB.
-func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
+// Returns the created operation.
+func StackCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "HSVToRGB",
+		Type: "StackCloseV2",
 		Input: []tf.Input{
-			images,
+			handle,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// SampleDistortedBoundingBoxV2Attr is an optional argument to SampleDistortedBoundingBoxV2.
-type SampleDistortedBoundingBoxV2Attr func(optionalAttr)
-
-// SampleDistortedBoundingBoxV2Seed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to non-zero, the random number
-// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
-// seed.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxV2Seed(value int64) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["seed"] = value
+// Generate a glob pattern matching all sharded file names.
+func ShardedFilespec(scope *Scope, basename tf.Output, num_shards tf.Output) (filename tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// SampleDistortedBoundingBoxV2Seed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxV2Seed2(value int64) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
+	opspec := tf.OpSpec{
+		Type: "ShardedFilespec",
+		Input: []tf.Input{
+			basename, num_shards,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SampleDistortedBoundingBoxV2AspectRatioRange sets the optional aspect_ratio_range attribute to value.
-//
-// value: The cropped area of the image must have an aspect ratio =
-// width / height within this range.
-// If not specified, defaults to <f:0.75 f:1.33 >
-func SampleDistortedBoundingBoxV2AspectRatioRange(value []float32) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["aspect_ratio_range"] = value
-	}
-}
+// TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
+type TextLineReaderV2Attr func(optionalAttr)
 
-// SampleDistortedBoundingBoxV2AreaRange sets the optional area_range attribute to value.
+// TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
 //
-// value: The cropped area of the image must contain a fraction of the
-// supplied image within in this range.
-// If not specified, defaults to <f:0.05 f:1 >
-func SampleDistortedBoundingBoxV2AreaRange(value []float32) SampleDistortedBoundingBoxV2Attr {
+// value: Number of lines to skip from the beginning of every file.
+// If not specified, defaults to 0
+func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
-		m["area_range"] = value
+		m["skip_header_lines"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxV2MaxAttempts sets the optional max_attempts attribute to value.
+// TextLineReaderV2Container sets the optional container attribute to value.
 //
-// value: Number of attempts at generating a cropped region of the image
-// of the specified constraints. After `max_attempts` failures, return the entire
-// image.
-// If not specified, defaults to 100
-func SampleDistortedBoundingBoxV2MaxAttempts(value int64) SampleDistortedBoundingBoxV2Attr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
-		m["max_attempts"] = value
+		m["container"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
+// TextLineReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// value: Controls behavior if no bounding boxes supplied.
-// If true, assume an implicit bounding box covering the whole input. If false,
-// raise an error.
-// If not specified, defaults to false
-func SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxV2Attr {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
-		m["use_image_if_no_bounding_boxes"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Generate a single randomly distorted bounding box for an image.
-//
-// Bounding box annotations are often supplied in addition to ground-truth labels
-// in image recognition or object localization tasks. A common technique for
-// training such a system is to randomly distort an image while preserving
-// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
-// localization of an object, i.e. bounding box, given an `image_size`,
-// `bounding_boxes` and a series of constraints.
-//
-// The output of this Op is a single bounding box that may be used to crop the
-// original image. The output is returned as 3 tensors: `begin`, `size` and
-// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
-// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
-// what the bounding box looks like.
-//
-// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
-// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
-// height of the underlying image.
-//
-// For example,
-//
-// ```python
-//     # Generate a single distorted bounding box.
-//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
-//         tf.shape(image),
-//         bounding_boxes=bounding_boxes)
-//
-//     # Draw the bounding box in an image summary.
-//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
-//                                                   bbox_for_draw)
-//     tf.summary.image('images_with_box', image_with_box)
-//
-//     # Employ the bounding box to distort the image.
-//     distorted_image = tf.slice(image, begin, size)
-// ```
-//
-// Note that if no bounding box information is available, setting
-// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
-// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
-// false and no bounding boxes are supplied, an error is raised.
-//
-// Arguments:
-//	image_size: 1-D, containing `[height, width, channels]`.
-//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
-// associated with the image.
-//	min_object_covered: The cropped area of the image must contain at least this
-// fraction of any bounding box supplied. The value of this parameter should be
-// non-negative. In the case of 0, the cropped area does not need to overlap
-// any of the bounding boxes supplied.
+// A Reader that outputs the lines of a file delimited by '\n'.
 //
-// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
-// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
-// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
-// Provide as input to `tf.image.draw_bounding_boxes`.
-func SampleDistortedBoundingBoxV2(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, min_object_covered tf.Output, optional ...SampleDistortedBoundingBoxV2Attr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
+// Returns The handle to reference the Reader.
+func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19493,99 +19669,97 @@ func SampleDistortedBoundingBoxV2(scope *Scope, image_size tf.Output, bounding_b
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SampleDistortedBoundingBoxV2",
-		Input: []tf.Input{
-			image_size, bounding_boxes, min_object_covered,
-		},
+		Type: "TextLineReaderV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// ExtractGlimpseAttr is an optional argument to ExtractGlimpse.
-type ExtractGlimpseAttr func(optionalAttr)
+// LoadAndRemapMatrixAttr is an optional argument to LoadAndRemapMatrix.
+type LoadAndRemapMatrixAttr func(optionalAttr)
 
-// ExtractGlimpseCentered sets the optional centered attribute to value.
+// LoadAndRemapMatrixMaxRowsInMemory sets the optional max_rows_in_memory attribute to value.
 //
-// value: indicates if the offset coordinates are centered relative to
-// the image, in which case the (0, 0) offset is relative to the center
-// of the input images. If false, the (0,0) offset corresponds to the
-// upper left corner of the input images.
-// If not specified, defaults to true
-func ExtractGlimpseCentered(value bool) ExtractGlimpseAttr {
+// value: The maximum number of rows to load from the checkpoint at
+// once. If less than or equal to 0, the entire matrix will be loaded into
+// memory. Setting this arg trades increased disk reads for lower memory usage.
+// If not specified, defaults to -1
+func LoadAndRemapMatrixMaxRowsInMemory(value int64) LoadAndRemapMatrixAttr {
 	return func(m optionalAttr) {
-		m["centered"] = value
+		m["max_rows_in_memory"] = value
 	}
 }
 
-// ExtractGlimpseNormalized sets the optional normalized attribute to value.
+// Loads a 2-D (matrix) `Tensor` with name `old_tensor_name` from the checkpoint
 //
-// value: indicates if the offset coordinates are normalized.
-// If not specified, defaults to true
-func ExtractGlimpseNormalized(value bool) ExtractGlimpseAttr {
-	return func(m optionalAttr) {
-		m["normalized"] = value
-	}
-}
-
-// ExtractGlimpseUniformNoise sets the optional uniform_noise attribute to value.
+// at `ckpt_path` and potentially reorders its rows and columns using the
+// specified remappings.
 //
-// value: indicates if the noise should be generated using a
-// uniform distribution or a Gaussian distribution.
-// If not specified, defaults to true
-func ExtractGlimpseUniformNoise(value bool) ExtractGlimpseAttr {
-	return func(m optionalAttr) {
-		m["uniform_noise"] = value
-	}
-}
-
-// Extracts a glimpse from the input tensor.
+// Most users should use one of the wrapper initializers (such as
+// `tf.contrib.framework.load_and_remap_matrix_initializer`) instead of this
+// function directly.
 //
-// Returns a set of windows called glimpses extracted at location
-// `offsets` from the input tensor. If the windows only partially
-// overlaps the inputs, the non overlapping areas will be filled with
-// random noise.
+// The remappings are 1-D tensors with the following properties:
 //
-// The result is a 4-D tensor of shape `[batch_size, glimpse_height,
-// glimpse_width, channels]`. The channels and batch dimensions are the
-// same as that of the input tensor. The height and width of the output
-// windows are specified in the `size` parameter.
+// * `row_remapping` must have exactly `num_rows` entries. Row `i` of the output
+//   matrix will be initialized from the row corresponding to index
+//   `row_remapping[i]` in the old `Tensor` from the checkpoint.
+// * `col_remapping` must have either 0 entries (indicating that no column
+//   reordering is needed) or `num_cols` entries. If specified, column `j` of the
+//   output matrix will be initialized from the column corresponding to index
+//   `col_remapping[j]` in the old `Tensor` from the checkpoint.
+// * A value of -1 in either of the remappings signifies a "missing" entry. In that
+//   case, values from the `initializing_values` tensor will be used to fill that
+//   missing row or column. If `row_remapping` has `r` missing entries and
+//   `col_remapping` has `c` missing entries, then the following condition must be
+//   true:
 //
-// The argument `normalized` and `centered` controls how the windows are built:
+// `(r * num_cols) + (c * num_rows) - (r * c) == len(initializing_values)`
 //
-// * If the coordinates are normalized but not centered, 0.0 and 1.0
-//   correspond to the minimum and maximum of each height and width
-//   dimension.
-// * If the coordinates are both normalized and centered, they range from
-//   -1.0 to 1.0. The coordinates (-1.0, -1.0) correspond to the upper
-//   left corner, the lower right corner is located at (1.0, 1.0) and the
-//   center is at (0, 0).
-// * If the coordinates are not normalized they are interpreted as
-//   numbers of pixels.
+// The remapping tensors can be generated using the GenerateVocabRemapping op.
+//
+// As an example, with row_remapping = [1, 0, -1], col_remapping = [0, 2, -1],
+// initializing_values = [0.5, -0.5, 0.25, -0.25, 42], and w(i, j) representing
+// the value from row i, column j of the old tensor in the checkpoint, the output
+// matrix will look like the following:
+//
+// [[w(1, 0),  w(1, 2),  0.5],
+//  [w(0, 0),  w(0, 2), -0.5],
+//  [0.25,    -0.25,      42]]
 //
 // Arguments:
-//	input: A 4-D float tensor of shape `[batch_size, height, width, channels]`.
-//	size: A 1-D tensor of 2 elements containing the size of the glimpses
-// to extract.  The glimpse height must be specified first, following
-// by the glimpse width.
-//	offsets: A 2-D integer tensor of shape `[batch_size, 2]` containing
-// the y, x locations of the center of each window.
+//	ckpt_path: Path to the TensorFlow checkpoint (version 2, `TensorBundle`) from
+// which the old matrix `Tensor` will be loaded.
+//	old_tensor_name: Name of the 2-D `Tensor` to load from checkpoint.
+//	row_remapping: An int `Tensor` of row remappings (generally created by
+// `generate_vocab_remapping`).  Even if no row remapping is needed, this must
+// still be an index-valued Tensor (e.g. [0, 1, 2, ...]), or a shifted
+// index-valued `Tensor` (e.g. [8, 9, 10, ...], for partitioned `Variables`).
+//	col_remapping: An int `Tensor` of column remappings (generally created by
+// `generate_vocab_remapping`).  May be a size-0 `Tensor` if only row remapping
+// is to be done (e.g. column ordering is the same).
+//	initializing_values: A float `Tensor` containing  values to fill in for cells
+// in the output matrix that are not loaded from the checkpoint. Length must be
+// exactly the same as the number of missing / new cells.
+//	num_rows: Number of rows (length of the 1st dimension) in the output matrix.
+//	num_cols: Number of columns (length of the 2nd dimension) in the output matrix.
 //
-// Returns A tensor representing the glimpses `[batch_size,
-// glimpse_height, glimpse_width, channels]`.
-func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Output, optional ...ExtractGlimpseAttr) (glimpse tf.Output) {
+// Returns Output matrix containing existing values loaded from the
+// checkpoint, and with any missing values filled in from initializing_values.
+func LoadAndRemapMatrix(scope *Scope, ckpt_path tf.Output, old_tensor_name tf.Output, row_remapping tf.Output, col_remapping tf.Output, initializing_values tf.Output, num_rows int64, num_cols int64, optional ...LoadAndRemapMatrixAttr) (output_matrix tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num_rows": num_rows, "num_cols": num_cols}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ExtractGlimpse",
+		Type: "LoadAndRemapMatrix",
 		Input: []tf.Input{
-			input, size, offsets,
+			ckpt_path, old_tensor_name, row_remapping, col_remapping, initializing_values,
 		},
 		Attrs: attrs,
 	}
@@ -19593,17 +19767,52 @@ func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Ou
 	return op.Output(0)
 }
 
-// A container for an iterator resource.
+// TFRecordReaderV2Attr is an optional argument to TFRecordReaderV2.
+type TFRecordReaderV2Attr func(optionalAttr)
+
+// TFRecordReaderV2Container sets the optional container attribute to value.
 //
-// Returns A handle to the iterator that can be passed to a "MakeIterator"
-// or "IteratorGetNext" op.
-func Iterator(scope *Scope, shared_name string, container string, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func TFRecordReaderV2Container(value string) TFRecordReaderV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// TFRecordReaderV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func TFRecordReaderV2SharedName(value string) TFRecordReaderV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// TFRecordReaderV2CompressionType sets the optional compression_type attribute to value.
+// If not specified, defaults to ""
+func TFRecordReaderV2CompressionType(value string) TFRecordReaderV2Attr {
+	return func(m optionalAttr) {
+		m["compression_type"] = value
+	}
+}
+
+// A Reader that outputs the records from a TensorFlow Records file.
+//
+// Returns The handle to reference the Reader.
+func TFRecordReaderV2(scope *Scope, optional ...TFRecordReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shared_name": shared_name, "container": container, "output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Iterator",
+		Type: "TFRecordReaderV2",
 
 		Attrs: attrs,
 	}
@@ -19611,48 +19820,41 @@ func Iterator(scope *Scope, shared_name string, container string, output_types [
 	return op.Output(0)
 }
 
-// ShuffleDatasetAttr is an optional argument to ShuffleDataset.
-type ShuffleDatasetAttr func(optionalAttr)
+// QuantizeAndDequantizeV3Attr is an optional argument to QuantizeAndDequantizeV3.
+type QuantizeAndDequantizeV3Attr func(optionalAttr)
 
-// ShuffleDatasetReshuffleEachIteration sets the optional reshuffle_each_iteration attribute to value.
-//
-// value: If true, each iterator over this dataset will be given
-// a different pseudorandomly generated seed, based on a sequence seeded by the
-// `seed` and `seed2` inputs. If false, each iterator will be given the same
-// seed, and repeated iteration over this dataset will yield the exact same
-// sequence of results.
+// QuantizeAndDequantizeV3SignedInput sets the optional signed_input attribute to value.
 // If not specified, defaults to true
-func ShuffleDatasetReshuffleEachIteration(value bool) ShuffleDatasetAttr {
+func QuantizeAndDequantizeV3SignedInput(value bool) QuantizeAndDequantizeV3Attr {
 	return func(m optionalAttr) {
-		m["reshuffle_each_iteration"] = value
+		m["signed_input"] = value
 	}
 }
 
-// Creates a dataset that shuffles elements from `input_dataset` pseudorandomly.
-//
-// Arguments:
-//
-//	buffer_size: The number of output elements to buffer in an iterator over
-// this dataset. Compare with the `min_after_dequeue` attr when creating a
-// `RandomShuffleQueue`.
-//	seed: A scalar seed for the random number generator. If either `seed` or
-// `seed2` is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//
+// QuantizeAndDequantizeV3RangeGiven sets the optional range_given attribute to value.
+// If not specified, defaults to true
+func QuantizeAndDequantizeV3RangeGiven(value bool) QuantizeAndDequantizeV3Attr {
+	return func(m optionalAttr) {
+		m["range_given"] = value
+	}
+}
+
+// Quantizes then dequantizes a tensor.
 //
-func ShuffleDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape, optional ...ShuffleDatasetAttr) (handle tf.Output) {
+// This is almost identical to QuantizeAndDequantizeV2, except that num_bits is a
+// tensor, so its value can change during training.
+func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, num_bits tf.Output, optional ...QuantizeAndDequantizeV3Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ShuffleDataset",
+		Type: "QuantizeAndDequantizeV3",
 		Input: []tf.Input{
-			input_dataset, buffer_size, seed, seed2,
+			input, input_min, input_max, num_bits,
 		},
 		Attrs: attrs,
 	}
@@ -19660,69 +19862,77 @@ func ShuffleDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output
 	return op.Output(0)
 }
 
-// 3D fast Fourier transform.
+// IdentityReaderV2Attr is an optional argument to IdentityReaderV2.
+type IdentityReaderV2Attr func(optionalAttr)
+
+// IdentityReaderV2Container sets the optional container attribute to value.
 //
-// Computes the 3-dimensional discrete Fourier transform over the inner-most 3
-// dimensions of `input`.
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func IdentityReaderV2Container(value string) IdentityReaderV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// IdentityReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// Arguments:
-//	input: A complex64 tensor.
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func IdentityReaderV2SharedName(value string) IdentityReaderV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// A Reader that outputs the queued work as both the key and value.
 //
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
-//   dimensions of `input` are replaced with their 3D Fourier transform.
+// To use, enqueue strings in a Queue.  ReaderRead will take the front
+// work string and output (work, work).
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.fftn with 3 dimensions.
-// @end_compatibility
-func FFT3D(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns The handle to reference the Reader.
+func IdentityReaderV2(scope *Scope, optional ...IdentityReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "FFT3D",
-		Input: []tf.Input{
-			input,
-		},
+		Type: "IdentityReaderV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// CropAndResizeGradBoxesAttr is an optional argument to CropAndResizeGradBoxes.
-type CropAndResizeGradBoxesAttr func(optionalAttr)
+// ResourceApplyGradientDescentAttr is an optional argument to ResourceApplyGradientDescent.
+type ResourceApplyGradientDescentAttr func(optionalAttr)
 
-// CropAndResizeGradBoxesMethod sets the optional method attribute to value.
+// ResourceApplyGradientDescentUseLocking sets the optional use_locking attribute to value.
 //
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
-// If not specified, defaults to "bilinear"
-func CropAndResizeGradBoxesMethod(value string) CropAndResizeGradBoxesAttr {
+// value: If `True`, the subtraction will be protected by a lock;
+// otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceApplyGradientDescentUseLocking(value bool) ResourceApplyGradientDescentAttr {
 	return func(m optionalAttr) {
-		m["method"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Computes the gradient of the crop_and_resize op wrt the input boxes tensor.
+// Update '*var' by subtracting 'alpha' * 'delta' from it.
 //
 // Arguments:
-//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
-//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
-// Both `image_height` and `image_width` need to be positive.
-//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
-// specifies the coordinates of a box in the `box_ind[i]` image and is specified
-// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
-// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
-// `[0, 1]` interval of normalized image height is mapped to
-// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
-// which case the sampled crop is an up-down flipped version of the original
-// image. The width dimension is treated similarly. Normalized coordinates
-// outside the `[0, 1]` range are allowed, in which case we use
-// `extrapolation_value` to extrapolate the input image values.
-//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
-// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//	var_: Should be from a Variable().
+//	alpha: Scaling factor. Must be a scalar.
+//	delta: The change.
 //
-// Returns A 2-D tensor of shape `[num_boxes, 4]`.
-func CropAndResizeGradBoxes(scope *Scope, grads tf.Output, image tf.Output, boxes tf.Output, box_ind tf.Output, optional ...CropAndResizeGradBoxesAttr) (output tf.Output) {
+// Returns the created operation.
+func ResourceApplyGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, delta tf.Output, optional ...ResourceApplyGradientDescentAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19731,65 +19941,132 @@ func CropAndResizeGradBoxes(scope *Scope, grads tf.Output, image tf.Output, boxe
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CropAndResizeGradBoxes",
+		Type: "ResourceApplyGradientDescent",
 		Input: []tf.Input{
-			grads, image, boxes, box_ind,
+			var_, alpha, delta,
 		},
 		Attrs: attrs,
 	}
+	return scope.AddOperation(opspec)
+}
+
+// Returns the next record (key, value pair) produced by a Reader.
+//
+// Will dequeue from the input queue if necessary (e.g. when the
+// Reader needs to start reading from a new file since it has finished
+// with the previous file).
+//
+// Arguments:
+//	reader_handle: Handle to a Reader.
+//	queue_handle: Handle to a Queue, with string work items.
+//
+// Returns A scalar.A scalar.
+func ReaderReadV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output) (key tf.Output, value tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReaderReadV2",
+		Input: []tf.Input{
+			reader_handle, queue_handle,
+		},
+	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Saves tensors in V2 checkpoint format.
+// Returns up to `num_records` (key, value) pairs produced by a Reader.
 //
-// By default, saves the named tensors in full.  If the caller wishes to save
-// specific slices of full tensors, "shape_and_slices" should be non-empty strings
-// and correspondingly well-formed.
+// Will dequeue from the input queue if necessary (e.g. when the
+// Reader needs to start reading from a new file since it has finished
+// with the previous file).
+// It may return less than `num_records` even before the last batch.
 //
 // Arguments:
-//	prefix: Must have a single element. The prefix of the V2 checkpoint to which we
-// write the tensors.
-//	tensor_names: shape {N}. The names of the tensors to be saved.
-//	shape_and_slices: shape {N}.  The slice specs of the tensors to be saved.
-// Empty strings indicate that they are non-partitioned tensors.
-//	tensors: `N` tensors to save.
+//	reader_handle: Handle to a `Reader`.
+//	queue_handle: Handle to a `Queue`, with string work items.
+//	num_records: number of records to read from `Reader`.
+//
+// Returns A 1-D tensor.A 1-D tensor.
+func ReaderReadUpToV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output, num_records tf.Output) (keys tf.Output, values tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReaderReadUpToV2",
+		Input: []tf.Input{
+			reader_handle, queue_handle, num_records,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// Restore a Reader to its initial clean state.
+//
+// Arguments:
+//	reader_handle: Handle to a Reader.
 //
 // Returns the created operation.
-func SaveV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, tensors []tf.Output) (o *tf.Operation) {
+func ReaderResetV2(scope *Scope, reader_handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SaveV2",
+		Type: "ReaderResetV2",
 		Input: []tf.Input{
-			prefix, tensor_names, shape_and_slices, tf.OutputList(tensors),
+			reader_handle,
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// StatsAggregatorHandleAttr is an optional argument to StatsAggregatorHandle.
-type StatsAggregatorHandleAttr func(optionalAttr)
+// ResourceApplyAdamAttr is an optional argument to ResourceApplyAdam.
+type ResourceApplyAdamAttr func(optionalAttr)
 
-// StatsAggregatorHandleContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StatsAggregatorHandleContainer(value string) StatsAggregatorHandleAttr {
+// ResourceApplyAdamUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var, m, and v tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyAdamUseLocking(value bool) ResourceApplyAdamAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["use_locking"] = value
 	}
 }
 
-// StatsAggregatorHandleSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StatsAggregatorHandleSharedName(value string) StatsAggregatorHandleAttr {
+// ResourceApplyAdamUseNesterov sets the optional use_nesterov attribute to value.
+//
+// value: If `True`, uses the nesterov update.
+// If not specified, defaults to false
+func ResourceApplyAdamUseNesterov(value bool) ResourceApplyAdamAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["use_nesterov"] = value
 	}
 }
 
-// Creates a statistics manager resource.
-func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr) (handle tf.Output) {
+// Update '*var' according to the Adam algorithm.
+//
+// lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
+// m_t <- beta1 * m_{t-1} + (1 - beta1) * g_t
+// v_t <- beta2 * v_{t-1} + (1 - beta2) * g_t * g_t
+// variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	m: Should be from a Variable().
+//	v: Should be from a Variable().
+//	beta1_power: Must be a scalar.
+//	beta2_power: Must be a scalar.
+//	lr: Scaling factor. Must be a scalar.
+//	beta1: Momentum factor. Must be a scalar.
+//	beta2: Momentum factor. Must be a scalar.
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, beta1_power tf.Output, beta2_power tf.Output, lr tf.Output, beta1 tf.Output, beta2 tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdamAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19798,181 +20075,200 @@ func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StatsAggregatorHandle",
-
+		Type: "ResourceApplyAdam",
+		Input: []tf.Input{
+			var_, m, v, beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad,
+		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Greedily selects a subset of bounding boxes in descending order of score,
+// Store the input tensor in the state of the current session.
 //
-// pruning away boxes that have high intersection-over-union (IOU) overlap
-// with previously selected boxes.  Bounding boxes are supplied as
-// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
-// diagonal pair of box corners and the coordinates can be provided as normalized
-// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
-// is agnostic to where the origin is in the coordinate system.  Note that this
-// algorithm is invariant to orthogonal transformations and translations
-// of the coordinate system; thus translating or reflections of the coordinate
-// system result in the same boxes being selected by the algorithm.
+// Arguments:
+//	value: The tensor to be stored.
 //
-// The output of this operation is a set of integers indexing into the input
-// collection of bounding boxes representing the selected boxes.  The bounding
-// box coordinates corresponding to the selected indices can then be obtained
-// using the `tf.gather operation`.  For example:
+// Returns The handle for the tensor stored in the session state, represented
+// as a ResourceHandle object.
+func GetSessionHandleV2(scope *Scope, value tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "GetSessionHandleV2",
+		Input: []tf.Input{
+			value,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// ResizeBicubicGradAttr is an optional argument to ResizeBicubicGrad.
+type ResizeBicubicGradAttr func(optionalAttr)
+
+// ResizeBicubicGradAlignCorners sets the optional align_corners attribute to value.
 //
-//   selected_indices = tf.image.non_max_suppression_v2(
-//       boxes, scores, max_output_size, iou_threshold)
-//   selected_boxes = tf.gather(boxes, selected_indices)
+// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of grads and original_image. If false, rescale by
+// orig_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeBicubicGradAlignCorners(value bool) ResizeBicubicGradAttr {
+	return func(m optionalAttr) {
+		m["align_corners"] = value
+	}
+}
+
+// Computes the gradient of bicubic interpolation.
 //
 // Arguments:
-//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
-//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
-// score corresponding to each box (each row of boxes).
-//	max_output_size: A scalar integer tensor representing the maximum number of
-// boxes to be selected by non max suppression.
-//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
-// boxes overlap too much with respect to IOU.
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
+// The image tensor that was resized.
 //
-// Returns A 1-D integer tensor of shape `[M]` representing the selected
-// indices from the boxes tensor, where `M <= max_output_size`.
-func NonMaxSuppressionV2(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output) (selected_indices tf.Output) {
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
+// Gradients with respect to the input image. Input image must have been
+// float or double.
+func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBicubicGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "NonMaxSuppressionV2",
+		Type: "ResizeBicubicGrad",
 		Input: []tf.Input{
-			boxes, scores, max_output_size, iou_threshold,
+			grads, original_image,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Reshapes a tensor.
-//
-// Given `tensor`, this operation returns a tensor that has the same values
-// as `tensor` with shape `shape`.
-//
-// If one component of `shape` is the special value -1, the size of that dimension
-// is computed so that the total size remains constant.  In particular, a `shape`
-// of `[-1]` flattens into 1-D.  At most one component of `shape` can be -1.
+// ResizeNearestNeighborAttr is an optional argument to ResizeNearestNeighbor.
+type ResizeNearestNeighborAttr func(optionalAttr)
+
+// ResizeNearestNeighborAlignCorners sets the optional align_corners attribute to value.
 //
-// If `shape` is 1-D or higher, then the operation returns a tensor with shape
-// `shape` filled with the values of `tensor`. In this case, the number of elements
-// implied by `shape` must be the same as the number of elements in `tensor`.
+// value: If true, rescale input by (new_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of images and resized images. If false, rescale
+// by new_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeNearestNeighborAlignCorners(value bool) ResizeNearestNeighborAttr {
+	return func(m optionalAttr) {
+		m["align_corners"] = value
+	}
+}
+
+// Resize `images` to `size` using nearest neighbor interpolation.
 //
-// For example:
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
 //
-// ```
-// # tensor 't' is [1, 2, 3, 4, 5, 6, 7, 8, 9]
-// # tensor 't' has shape [9]
-// reshape(t, [3, 3]) ==> [[1, 2, 3],
-//                         [4, 5, 6],
-//                         [7, 8, 9]]
-//
-// # tensor 't' is [[[1, 1], [2, 2]],
-// #                [[3, 3], [4, 4]]]
-// # tensor 't' has shape [2, 2, 2]
-// reshape(t, [2, 4]) ==> [[1, 1, 2, 2],
-//                         [3, 3, 4, 4]]
-//
-// # tensor 't' is [[[1, 1, 1],
-// #                 [2, 2, 2]],
-// #                [[3, 3, 3],
-// #                 [4, 4, 4]],
-// #                [[5, 5, 5],
-// #                 [6, 6, 6]]]
-// # tensor 't' has shape [3, 2, 3]
-// # pass '[-1]' to flatten 't'
-// reshape(t, [-1]) ==> [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
-//
-// # -1 can also be used to infer the shape
-//
-// # -1 is inferred to be 9:
-// reshape(t, [2, -1]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
-//                          [4, 4, 4, 5, 5, 5, 6, 6, 6]]
-// # -1 is inferred to be 2:
-// reshape(t, [-1, 9]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
-//                          [4, 4, 4, 5, 5, 5, 6, 6, 6]]
-// # -1 is inferred to be 3:
-// reshape(t, [ 2, -1, 3]) ==> [[[1, 1, 1],
-//                               [2, 2, 2],
-//                               [3, 3, 3]],
-//                              [[4, 4, 4],
-//                               [5, 5, 5],
-//                               [6, 6, 6]]]
-//
-// # tensor 't' is [7]
-// # shape `[]` reshapes to a scalar
-// reshape(t, []) ==> 7
-// ```
-//
-// Arguments:
-//
-//	shape: Defines the shape of the output tensor.
-func Reshape(scope *Scope, tensor tf.Output, shape tf.Output) (output tf.Output) {
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func ResizeNearestNeighbor(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeNearestNeighborAttr) (resized_images tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Reshape",
+		Type: "ResizeNearestNeighbor",
 		Input: []tf.Input{
-			tensor, shape,
+			images, size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that splits a SparseTensor into elements row-wise.
-func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseTensorSliceDataset",
-		Input: []tf.Input{
-			indices, values, dense_shape,
-		},
+// ResizeNearestNeighborGradAttr is an optional argument to ResizeNearestNeighborGrad.
+type ResizeNearestNeighborGradAttr func(optionalAttr)
+
+// ResizeNearestNeighborGradAlignCorners sets the optional align_corners attribute to value.
+//
+// value: If true, rescale grads by (orig_height - 1) / (height - 1), which
+// exactly aligns the 4 corners of grads and original_image. If false, rescale by
+// orig_height / height. Treat similarly the width dimension.
+// If not specified, defaults to false
+func ResizeNearestNeighborGradAlignCorners(value bool) ResizeNearestNeighborGradAttr {
+	return func(m optionalAttr) {
+		m["align_corners"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns x / y element-wise for real types.
+// Computes the gradient of nearest neighbor interpolation.
 //
-// If `x` and `y` are reals, this will return the floating-point division.
+// Arguments:
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `orig_height, orig_width`. The
+// original input size.
 //
-// *NOTE*: `Div` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`. Gradients
+// with respect to the input image.
+func ResizeNearestNeighborGrad(scope *Scope, grads tf.Output, size tf.Output, optional ...ResizeNearestNeighborGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "RealDiv",
+		Type: "ResizeNearestNeighborGrad",
 		Input: []tf.Input{
-			x, y,
+			grads, size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that concatenates `input_dataset` with `another_dataset`.
-func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// ExtractJpegShapeAttr is an optional argument to ExtractJpegShape.
+type ExtractJpegShapeAttr func(optionalAttr)
+
+// ExtractJpegShapeOutputType sets the optional output_type attribute to value.
+//
+// value: (Optional) The output type of the operation (int32 or int64).
+// Defaults to int32.
+// If not specified, defaults to DT_INT32
+func ExtractJpegShapeOutputType(value tf.DataType) ExtractJpegShapeAttr {
+	return func(m optionalAttr) {
+		m["output_type"] = value
+	}
+}
+
+// Extract the shape information of a JPEG-encoded image.
+//
+// This op only parses the image header, so it is much faster than DecodeJpeg.
+//
+// Arguments:
+//	contents: 0-D. The JPEG-encoded image.
+//
+// Returns 1-D. The image shape with format [height, width, channels].
+func ExtractJpegShape(scope *Scope, contents tf.Output, optional ...ExtractJpegShapeAttr) (image_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ConcatenateDataset",
+		Type: "ExtractJpegShape",
 		Input: []tf.Input{
-			input_dataset, another_dataset,
+			contents,
 		},
 		Attrs: attrs,
 	}
@@ -19980,123 +20276,143 @@ func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset t
 	return op.Output(0)
 }
 
-// Adds a value to the current value of a variable.
-//
-// Any ReadVariableOp which depends directly or indirectly on this assign is
-// guaranteed to see the incremented value or a subsequent newer one.
-//
-// Outputs the incremented value, which can be used to totally order the
-// increments to this variable.
+// PaddingFIFOQueueV2Attr is an optional argument to PaddingFIFOQueueV2.
+type PaddingFIFOQueueV2Attr func(optionalAttr)
+
+// PaddingFIFOQueueV2Shapes sets the optional shapes attribute to value.
 //
-// Arguments:
-//	resource: handle to the resource in which to store the variable.
-//	value: the value by which the variable will be incremented.
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types.
+// Shapes of fixed rank but variable size are allowed by setting
+// any shape dimension to -1.  In this case, the inputs' shape may vary along
+// the given dimension, and DequeueMany will pad the given dimension with
+// zeros up to the maximum shape of all elements in the given batch.
+// If the length of this attr is 0, different queue elements may have
+// different ranks and shapes, but only one element may be dequeued at a time.
+// If not specified, defaults to <>
 //
-// Returns the created operation.
-func AssignAddVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
+// REQUIRES: len(value) >= 0
+func PaddingFIFOQueueV2Shapes(value []tf.Shape) PaddingFIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shapes"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "AssignAddVariableOp",
-		Input: []tf.Input{
-			resource, value,
-		},
+}
+
+// PaddingFIFOQueueV2Capacity sets the optional capacity attribute to value.
+//
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func PaddingFIFOQueueV2Capacity(value int64) PaddingFIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Records the latency of producing `input_dataset` elements in a StatsAggregator.
-func LatencyStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
+// PaddingFIFOQueueV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func PaddingFIFOQueueV2Container(value string) PaddingFIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "LatencyStatsDataset",
-		Input: []tf.Input{
-			input_dataset, tag,
-		},
-		Attrs: attrs,
+}
+
+// PaddingFIFOQueueV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
+// If not specified, defaults to ""
+func PaddingFIFOQueueV2SharedName(value string) PaddingFIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Convert JSON-encoded Example records to binary protocol buffer strings.
+// A queue that produces elements in first-in first-out order.
 //
-// This op translates a tensor containing Example records, encoded using
-// the [standard JSON
-// mapping](https://developers.google.com/protocol-buffers/docs/proto3#json),
-// into a tensor containing the same records encoded as binary protocol
-// buffers. The resulting tensor can then be fed to any of the other
-// Example-parsing ops.
+// Variable-size shapes are allowed by setting the corresponding shape dimensions
+// to 0 in the shape attr.  In this case DequeueMany will pad up to the maximum
+// size of any given element in the minibatch.  See below for details.
 //
 // Arguments:
-//	json_examples: Each string is a JSON object serialized according to the JSON
-// mapping of the Example proto.
+//	component_types: The type of each component in a value.
 //
-// Returns Each string is a binary Example protocol buffer corresponding
-// to the respective element of `json_examples`.
-func DecodeJSONExample(scope *Scope, json_examples tf.Output) (binary_examples tf.Output) {
+// Returns The handle to the queue.
+func PaddingFIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...PaddingFIFOQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"component_types": component_types}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "DecodeJSONExample",
-		Input: []tf.Input{
-			json_examples,
-		},
+		Type: "PaddingFIFOQueueV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the grayscale dilation of 4-D `input` and 3-D `filter` tensors.
+// DecodePngAttr is an optional argument to DecodePng.
+type DecodePngAttr func(optionalAttr)
+
+// DecodePngChannels sets the optional channels attribute to value.
 //
-// The `input` tensor has shape `[batch, in_height, in_width, depth]` and the
-// `filter` tensor has shape `[filter_height, filter_width, depth]`, i.e., each
-// input channel is processed independently of the others with its own structuring
-// function. The `output` tensor has shape
-// `[batch, out_height, out_width, depth]`. The spatial dimensions of the output
-// tensor depend on the `padding` algorithm. We currently only support the default
-// "NHWC" `data_format`.
+// value: Number of color channels for the decoded image.
+// If not specified, defaults to 0
+func DecodePngChannels(value int64) DecodePngAttr {
+	return func(m optionalAttr) {
+		m["channels"] = value
+	}
+}
+
+// DecodePngDtype sets the optional dtype attribute to value.
+// If not specified, defaults to DT_UINT8
+func DecodePngDtype(value tf.DataType) DecodePngAttr {
+	return func(m optionalAttr) {
+		m["dtype"] = value
+	}
+}
+
+// Decode a PNG-encoded image to a uint8 or uint16 tensor.
 //
-// In detail, the grayscale morphological 2-D dilation is the max-sum correlation
-// (for consistency with `conv2d`, we use unmirrored filters):
+// The attr `channels` indicates the desired number of color channels for the
+// decoded image.
 //
-//     output[b, y, x, c] =
-//        max_{dy, dx} input[b,
-//                           strides[1] * y + rates[1] * dy,
-//                           strides[2] * x + rates[2] * dx,
-//                           c] +
-//                     filter[dy, dx, c]
+// Accepted values are:
 //
-// Max-pooling is a special case when the filter has size equal to the pooling
-// kernel size and contains all zeros.
+// *   0: Use the number of channels in the PNG-encoded image.
+// *   1: output a grayscale image.
+// *   3: output an RGB image.
+// *   4: output an RGBA image.
 //
-// Note on duality: The dilation of `input` by the `filter` is equal to the
-// negation of the erosion of `-input` by the reflected `filter`.
+// If needed, the PNG-encoded image is transformed to match the requested number
+// of color channels.
+//
+// This op also supports decoding JPEGs and non-animated GIFs since the interface
+// is the same, though it is cleaner to use `tf.image.decode_image`.
 //
 // Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
-//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
-//	strides: The stride of the sliding window for each dimension of the input
-// tensor. Must be: `[1, stride_height, stride_width, 1]`.
-//	rates: The input stride for atrous morphological dilation. Must be:
-// `[1, rate_height, rate_width, 1]`.
-//	padding: The type of padding algorithm to use.
+//	contents: 0-D.  The PNG-encoded image.
 //
-// Returns 4-D with shape `[batch, out_height, out_width, depth]`.
-func Dilation2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, rates []int64, padding string) (output tf.Output) {
+// Returns 3-D with shape `[height, width, channels]`.
+func DecodePng(scope *Scope, contents tf.Output, optional ...DecodePngAttr) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Dilation2D",
+		Type: "DecodePng",
 		Input: []tf.Input{
-			input, filter,
+			contents,
 		},
 		Attrs: attrs,
 	}
@@ -20104,344 +20420,411 @@ func Dilation2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64
 	return op.Output(0)
 }
 
-// Converts the given variant tensor to an iterator and stores it in the given resource.
+// Decode the first frame of a GIF-encoded image to a uint8 tensor.
+//
+// GIF with frame or transparency compression are not supported
+// convert animated GIF from compressed to uncompressed by:
+//
+//     convert $src.gif -coalesce $dst.gif
+//
+// This op also supports decoding JPEGs and PNGs, though it is cleaner to use
+// `tf.image.decode_image`.
 //
 // Arguments:
-//	resource_handle: A handle to an iterator resource.
-//	serialized: A variant tensor storing the state of the iterator contained in the
-// resource.
+//	contents: 0-D.  The GIF-encoded image.
 //
-// Returns the created operation.
-func DeserializeIterator(scope *Scope, resource_handle tf.Output, serialized tf.Output) (o *tf.Operation) {
+// Returns 4-D with shape `[num_frames, height, width, 3]`. RGB order
+func DecodeGif(scope *Scope, contents tf.Output) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DeserializeIterator",
+		Type: "DecodeGif",
 		Input: []tf.Input{
-			resource_handle, serialized,
+			contents,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// TensorArrayConcatV2Attr is an optional argument to TensorArrayConcatV2.
-type TensorArrayConcatV2Attr func(optionalAttr)
+// ResourceApplyCenteredRMSPropAttr is an optional argument to ResourceApplyCenteredRMSProp.
+type ResourceApplyCenteredRMSPropAttr func(optionalAttr)
 
-// TensorArrayConcatV2ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayConcatV2ElementShapeExcept0(value tf.Shape) TensorArrayConcatV2Attr {
+// ResourceApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var, mg, ms, and mom tensors is
+// protected by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyCenteredRMSPropUseLocking(value bool) ResourceApplyCenteredRMSPropAttr {
 	return func(m optionalAttr) {
-		m["element_shape_except0"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Deprecated. Use TensorArrayConcatV3
-func TensorArrayConcatV2(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV2Attr) (value tf.Output, lengths tf.Output) {
+// Update '*var' according to the centered RMSProp algorithm.
+//
+// The centered RMSProp algorithm uses an estimate of the centered second moment
+// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
+// uses the (uncentered) second moment. This often helps with training, but is
+// slightly more expensive in terms of computation and memory.
+//
+// Note that in dense implementation of this algorithm, mg, ms, and mom will
+// update even if the grad is zero, but in this sparse implementation, mg, ms,
+// and mom will not update in iterations during which the grad is zero.
+//
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// mean_grad = decay * mean_grad + (1-decay) * gradient
+//
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
+//
+// mg <- rho * mg_{t-1} + (1-rho) * grad
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms - mg * mg + epsilon)
+// var <- var - mom
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	mg: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
+//
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyCenteredRMSPropAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayConcatV2",
+		Type: "ResourceApplyCenteredRMSProp",
 		Input: []tf.Input{
-			handle, flow_in,
+			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return scope.AddOperation(opspec)
 }
 
-// Creates a dataset that batches and pads `batch_size` elements from the input.
+// Returns a list of tensors with the same shapes and contents as the input
 //
-// Arguments:
+// tensors.
 //
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
-//	padded_shapes: A list of int64 tensors representing the desired padded shapes
-// of the corresponding output components. These shapes may be partially
-// specified, using `-1` to indicate that a particular dimension should be
-// padded to the maximum size of all batch elements.
-//	padding_values: A list of scalars containing the padding value to use for
-// each of the outputs.
+// This op can be used to override the gradient for complicated functions. For
+// example, suppose y = f(x) and we wish to apply a custom function g for backprop
+// such that dx = g(dy). In Python,
 //
-func PaddedBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, padded_shapes []tf.Output, padding_values []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+// ```python
+// with tf.get_default_graph().gradient_override_map(
+//     {'IdentityN': 'OverrideGradientWithG'}):
+//   y, _ = identity_n([f(x), x])
+//
+// @tf.RegisterGradient('OverrideGradientWithG')
+// def ApplyG(op, dy, _):
+//   return [None, g(dy)]  # Do not backprop to f(x).
+// ```
+func IdentityN(scope *Scope, input []tf.Output) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "PaddedBatchDataset",
+		Type: "IdentityN",
 		Input: []tf.Input{
-			input_dataset, batch_size, tf.OutputList(padded_shapes), tf.OutputList(padding_values),
+			tf.OutputList(input),
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("IdentityN", err)
+		return
+	}
+	return output
 }
 
-// Creates a dataset that batches input elements into a SparseTensor.
-//
-// Arguments:
-//	input_dataset: A handle to an input dataset. Must have a single component.
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
-//	row_shape: A vector representing the dense shape of each row in the produced
-// SparseTensor. The shape may be partially specified, using `-1` to indicate
-// that a particular dimension should use the maximum size of all batch elements.
-//
+// Computes the gradient of the sigmoid of `x` wrt its input.
 //
-func DenseToSparseBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, row_shape tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
+// `dy` is the corresponding input gradient.
+func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "DenseToSparseBatchDataset",
+		Type: "SigmoidGrad",
 		Input: []tf.Input{
-			input_dataset, batch_size, row_shape,
+			y, dy,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayGradV3
+// Convert one or more images from HSV to RGB.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayGradV3
-func TensorArrayGradV2(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output) {
+// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
+// value of the pixels. The output is only well defined if the value in `images`
+// are in `[0,1]`.
+//
+// See `rgb_to_hsv` for a description of the HSV encoding.
+//
+// Arguments:
+//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
+//
+// Returns `images` converted to RGB.
+func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"source": source}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayGradV2",
+		Type: "HSVToRGB",
 		Input: []tf.Input{
-			handle, flow_in,
+			images,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyAdadeltaAttr is an optional argument to ResourceSparseApplyAdadelta.
-type ResourceSparseApplyAdadeltaAttr func(optionalAttr)
+// SampleDistortedBoundingBoxV2Attr is an optional argument to SampleDistortedBoundingBoxV2.
+type SampleDistortedBoundingBoxV2Attr func(optionalAttr)
 
-// ResourceSparseApplyAdadeltaUseLocking sets the optional use_locking attribute to value.
+// SampleDistortedBoundingBoxV2Seed sets the optional seed attribute to value.
 //
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceSparseApplyAdadeltaUseLocking(value bool) ResourceSparseApplyAdadeltaAttr {
+// value: If either `seed` or `seed2` are set to non-zero, the random number
+// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
+// seed.
+// If not specified, defaults to 0
+func SampleDistortedBoundingBoxV2Seed(value int64) SampleDistortedBoundingBoxV2Attr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["seed"] = value
 	}
 }
 
-// var: Should be from a Variable().
-//
-// Arguments:
+// SampleDistortedBoundingBoxV2Seed2 sets the optional seed2 attribute to value.
 //
-//	accum: Should be from a Variable().
-//	accum_update: : Should be from a Variable().
-//	lr: Learning rate. Must be a scalar.
-//	rho: Decay factor. Must be a scalar.
-//	epsilon: Constant factor. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
-//
-// Returns the created operation.
-func ResourceSparseApplyAdadelta(scope *Scope, var_ tf.Output, accum tf.Output, accum_update tf.Output, lr tf.Output, rho tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyAdadeltaAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyAdadelta",
-		Input: []tf.Input{
-			var_, accum, accum_update, lr, rho, epsilon, grad, indices,
-		},
-		Attrs: attrs,
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func SampleDistortedBoundingBoxV2Seed2(value int64) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Identity op for gradient debugging.
+// SampleDistortedBoundingBoxV2AspectRatioRange sets the optional aspect_ratio_range attribute to value.
 //
-// This op is hidden from public in Python. It is used by TensorFlow Debugger to
-// register gradient tensors for gradient debugging.
-// This op operates on non-reference-type tensors.
-func DebugGradientIdentity(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "DebugGradientIdentity",
-		Input: []tf.Input{
-			input,
-		},
+// value: The cropped area of the image must have an aspect ratio =
+// width / height within this range.
+// If not specified, defaults to <f:0.75 f:1.33 >
+func SampleDistortedBoundingBoxV2AspectRatioRange(value []float32) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["aspect_ratio_range"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Return substrings from `Tensor` of strings.
-//
-// For each string in the input `Tensor`, creates a substring starting at index
-// `pos` with a total length of `len`.
-//
-// If `len` defines a substring that would extend beyond the length of the input
-// string, then as many characters as possible are used.
-//
-// If `pos` is negative or specifies a character index larger than any of the input
-// strings, then an `InvalidArgumentError` is thrown.
-//
-// `pos` and `len` must have the same shape, otherwise a `ValueError` is thrown on
-// Op creation.
+// SampleDistortedBoundingBoxV2AreaRange sets the optional area_range attribute to value.
 //
-// *NOTE*: `Substr` supports broadcasting up to two dimensions. More about
-// broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+// value: The cropped area of the image must contain a fraction of the
+// supplied image within in this range.
+// If not specified, defaults to <f:0.05 f:1 >
+func SampleDistortedBoundingBoxV2AreaRange(value []float32) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["area_range"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxV2MaxAttempts sets the optional max_attempts attribute to value.
 //
-// ---
+// value: Number of attempts at generating a cropped region of the image
+// of the specified constraints. After `max_attempts` failures, return the entire
+// image.
+// If not specified, defaults to 100
+func SampleDistortedBoundingBoxV2MaxAttempts(value int64) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["max_attempts"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
 //
-// Examples
+// value: Controls behavior if no bounding boxes supplied.
+// If true, assume an implicit bounding box covering the whole input. If false,
+// raise an error.
+// If not specified, defaults to false
+func SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["use_image_if_no_bounding_boxes"] = value
+	}
+}
+
+// Generate a single randomly distorted bounding box for an image.
 //
-// Using scalar `pos` and `len`:
+// Bounding box annotations are often supplied in addition to ground-truth labels
+// in image recognition or object localization tasks. A common technique for
+// training such a system is to randomly distort an image while preserving
+// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
+// localization of an object, i.e. bounding box, given an `image_size`,
+// `bounding_boxes` and a series of constraints.
 //
-// ```python
-// input = [b'Hello', b'World']
-// position = 1
-// length = 3
+// The output of this Op is a single bounding box that may be used to crop the
+// original image. The output is returned as 3 tensors: `begin`, `size` and
+// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
+// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
+// what the bounding box looks like.
 //
-// output = [b'ell', b'orl']
-// ```
+// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
+// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
+// height of the underlying image.
 //
-// Using `pos` and `len` with same shape as `input`:
+// For example,
 //
 // ```python
-// input = [[b'ten', b'eleven', b'twelve'],
-//          [b'thirteen', b'fourteen', b'fifteen'],
-//          [b'sixteen', b'seventeen', b'eighteen']]
-// position = [[1, 2, 3],
-//             [1, 2, 3],
-//             [1, 2, 3]]
-// length =   [[2, 3, 4],
-//             [4, 3, 2],
-//             [5, 5, 5]]
-//
-// output = [[b'en', b'eve', b'lve'],
-//           [b'hirt', b'urt', b'te'],
-//           [b'ixtee', b'vente', b'hteen']]
-// ```
-//
-// Broadcasting `pos` and `len` onto `input`:
-//
-// ```
-// input = [[b'ten', b'eleven', b'twelve'],
-//          [b'thirteen', b'fourteen', b'fifteen'],
-//          [b'sixteen', b'seventeen', b'eighteen'],
-//          [b'nineteen', b'twenty', b'twentyone']]
-// position = [1, 2, 3]
-// length =   [1, 2, 3]
-//
-// output = [[b'e', b'ev', b'lve'],
-//           [b'h', b'ur', b'tee'],
-//           [b'i', b've', b'hte'],
-//           [b'i', b'en', b'nty']]
-// ```
+//     # Generate a single distorted bounding box.
+//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
+//         tf.shape(image),
+//         bounding_boxes=bounding_boxes)
 //
-// Broadcasting `input` onto `pos` and `len`:
+//     # Draw the bounding box in an image summary.
+//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
+//                                                   bbox_for_draw)
+//     tf.summary.image('images_with_box', image_with_box)
 //
+//     # Employ the bounding box to distort the image.
+//     distorted_image = tf.slice(image, begin, size)
 // ```
-// input = b'thirteen'
-// position = [1, 5, 7]
-// length =   [3, 2, 1]
 //
-// output = [b'hir', b'ee', b'n']
-// ```
+// Note that if no bounding box information is available, setting
+// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
+// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
+// false and no bounding boxes are supplied, an error is raised.
 //
 // Arguments:
-//	input: Tensor of strings
-//	pos: Scalar defining the position of first character in each substring
-//	len: Scalar defining the number of characters to include in each substring
+//	image_size: 1-D, containing `[height, width, channels]`.
+//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
+// associated with the image.
+//	min_object_covered: The cropped area of the image must contain at least this
+// fraction of any bounding box supplied. The value of this parameter should be
+// non-negative. In the case of 0, the cropped area does not need to overlap
+// any of the bounding boxes supplied.
 //
-// Returns Tensor of substrings
-func Substr(scope *Scope, input tf.Output, pos tf.Output, len tf.Output) (output tf.Output) {
+// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
+// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
+// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
+// Provide as input to `tf.image.draw_bounding_boxes`.
+func SampleDistortedBoundingBoxV2(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, min_object_covered tf.Output, optional ...SampleDistortedBoundingBoxV2Attr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Substr",
+		Type: "SampleDistortedBoundingBoxV2",
 		Input: []tf.Input{
-			input, pos, len,
+			image_size, bounding_boxes, min_object_covered,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Creates a Dataset that returns pseudorandom numbers.
-//
-// Arguments:
-//	seed: A scalar seed for the random number generator. If either seed or
-// seed2 is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//
+// ExtractGlimpseAttr is an optional argument to ExtractGlimpse.
+type ExtractGlimpseAttr func(optionalAttr)
+
+// ExtractGlimpseCentered sets the optional centered attribute to value.
 //
-func RandomDataset(scope *Scope, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: indicates if the offset coordinates are centered relative to
+// the image, in which case the (0, 0) offset is relative to the center
+// of the input images. If false, the (0,0) offset corresponds to the
+// upper left corner of the input images.
+// If not specified, defaults to true
+func ExtractGlimpseCentered(value bool) ExtractGlimpseAttr {
+	return func(m optionalAttr) {
+		m["centered"] = value
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "RandomDataset",
-		Input: []tf.Input{
-			seed, seed2,
-		},
-		Attrs: attrs,
+}
+
+// ExtractGlimpseNormalized sets the optional normalized attribute to value.
+//
+// value: indicates if the offset coordinates are normalized.
+// If not specified, defaults to true
+func ExtractGlimpseNormalized(value bool) ExtractGlimpseAttr {
+	return func(m optionalAttr) {
+		m["normalized"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Creates a dataset that shuffles and repeats elements from `input_dataset`
+// ExtractGlimpseUniformNoise sets the optional uniform_noise attribute to value.
 //
-// pseudorandomly.
+// value: indicates if the noise should be generated using a
+// uniform distribution or a Gaussian distribution.
+// If not specified, defaults to true
+func ExtractGlimpseUniformNoise(value bool) ExtractGlimpseAttr {
+	return func(m optionalAttr) {
+		m["uniform_noise"] = value
+	}
+}
+
+// Extracts a glimpse from the input tensor.
 //
-// Arguments:
+// Returns a set of windows called glimpses extracted at location
+// `offsets` from the input tensor. If the windows only partially
+// overlaps the inputs, the non overlapping areas will be filled with
+// random noise.
 //
-//	buffer_size: The number of output elements to buffer in an iterator over
-// this dataset. Compare with the `min_after_dequeue` attr when creating a
-// `RandomShuffleQueue`.
-//	seed: A scalar seed for the random number generator. If either `seed` or
-// `seed2` is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//	count: A scalar representing the number of times the underlying dataset
-// should be repeated. The default is `-1`, which results in infinite repetition.
+// The result is a 4-D tensor of shape `[batch_size, glimpse_height,
+// glimpse_width, channels]`. The channels and batch dimensions are the
+// same as that of the input tensor. The height and width of the output
+// windows are specified in the `size` parameter.
 //
+// The argument `normalized` and `centered` controls how the windows are built:
 //
-func ShuffleAndRepeatDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// * If the coordinates are normalized but not centered, 0.0 and 1.0
+//   correspond to the minimum and maximum of each height and width
+//   dimension.
+// * If the coordinates are both normalized and centered, they range from
+//   -1.0 to 1.0. The coordinates (-1.0, -1.0) correspond to the upper
+//   left corner, the lower right corner is located at (1.0, 1.0) and the
+//   center is at (0, 0).
+// * If the coordinates are not normalized they are interpreted as
+//   numbers of pixels.
+//
+// Arguments:
+//	input: A 4-D float tensor of shape `[batch_size, height, width, channels]`.
+//	size: A 1-D tensor of 2 elements containing the size of the glimpses
+// to extract.  The glimpse height must be specified first, following
+// by the glimpse width.
+//	offsets: A 2-D integer tensor of shape `[batch_size, 2]` containing
+// the y, x locations of the center of each window.
+//
+// Returns A tensor representing the glimpses `[batch_size,
+// glimpse_height, glimpse_width, channels]`.
+func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Output, optional ...ExtractGlimpseAttr) (glimpse tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ShuffleAndRepeatDataset",
+		Type: "ExtractGlimpse",
 		Input: []tf.Input{
-			input_dataset, buffer_size, seed, seed2, count,
+			input, size, offsets,
 		},
 		Attrs: attrs,
 	}
@@ -20449,581 +20832,510 @@ func ShuffleAndRepeatDataset(scope *Scope, input_dataset tf.Output, buffer_size
 	return op.Output(0)
 }
 
-// Creates a dataset that caches elements from `input_dataset`.
-//
-// A CacheDataset will iterate over the input_dataset, and store tensors. If the
-// cache already exists, the cache will be used. If the cache is inappropriate
-// (e.g. cannot be opened, contains tensors of the wrong shape / size), an error
-// will the returned when used.
-//
-// Arguments:
-//
-//	filename: A path on the filesystem where we should cache the dataset. Note: this
-// will be a directory.
-//
+// A container for an iterator resource.
 //
-func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Returns A handle to the iterator that can be passed to a "MakeIterator"
+// or "IteratorGetNext" op.
+func Iterator(scope *Scope, shared_name string, container string, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"shared_name": shared_name, "container": container, "output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "CacheDataset",
-		Input: []tf.Input{
-			input_dataset, filename,
-		},
+		Type: "Iterator",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// PlaceholderAttr is an optional argument to Placeholder.
-type PlaceholderAttr func(optionalAttr)
+// ShuffleDatasetAttr is an optional argument to ShuffleDataset.
+type ShuffleDatasetAttr func(optionalAttr)
 
-// PlaceholderShape sets the optional shape attribute to value.
+// ShuffleDatasetReshuffleEachIteration sets the optional reshuffle_each_iteration attribute to value.
 //
-// value: (Optional) The shape of the tensor. If the shape has 0 dimensions, the
-// shape is unconstrained.
-// If not specified, defaults to <unknown_rank:true >
-func PlaceholderShape(value tf.Shape) PlaceholderAttr {
+// value: If true, each iterator over this dataset will be given
+// a different pseudorandomly generated seed, based on a sequence seeded by the
+// `seed` and `seed2` inputs. If false, each iterator will be given the same
+// seed, and repeated iteration over this dataset will yield the exact same
+// sequence of results.
+// If not specified, defaults to true
+func ShuffleDatasetReshuffleEachIteration(value bool) ShuffleDatasetAttr {
 	return func(m optionalAttr) {
-		m["shape"] = value
+		m["reshuffle_each_iteration"] = value
 	}
 }
 
-// A placeholder op for a value that will be fed into the computation.
-//
-// N.B. This operation will fail with an error if it is executed. It is
-// intended as a way to represent a value that will always be fed, and to
-// provide attrs that enable the fed value to be checked at runtime.
+// Creates a dataset that shuffles elements from `input_dataset` pseudorandomly.
 //
 // Arguments:
-//	dtype: The type of elements in the tensor.
 //
-// Returns A placeholder tensor that must be replaced using the feed mechanism.
-func Placeholder(scope *Scope, dtype tf.DataType, optional ...PlaceholderAttr) (output tf.Output) {
+//	buffer_size: The number of output elements to buffer in an iterator over
+// this dataset. Compare with the `min_after_dequeue` attr when creating a
+// `RandomShuffleQueue`.
+//	seed: A scalar seed for the random number generator. If either `seed` or
+// `seed2` is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
+//
+//
+func ShuffleDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape, optional ...ShuffleDatasetAttr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Placeholder",
-
+		Type: "ShuffleDataset",
+		Input: []tf.Input{
+			input_dataset, buffer_size, seed, seed2,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that executes a SQL query and emits rows of the result set.
+// 3D fast Fourier transform.
+//
+// Computes the 3-dimensional discrete Fourier transform over the inner-most 3
+// dimensions of `input`.
 //
 // Arguments:
-//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
-//	data_source_name: A connection string to connect to the database.
-//	query: A SQL query to execute.
+//	input: A complex64 tensor.
 //
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
+//   dimensions of `input` are replaced with their 3D Fourier transform.
 //
-func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// @compatibility(numpy)
+// Equivalent to np.fft.fftn with 3 dimensions.
+// @end_compatibility
+func FFT3D(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "SqlDataset",
+		Type: "FFT3D",
 		Input: []tf.Input{
-			driver_name, data_source_name, query,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that emits the records from one or more binary files.
+// CropAndResizeGradBoxesAttr is an optional argument to CropAndResizeGradBoxes.
+type CropAndResizeGradBoxesAttr func(optionalAttr)
+
+// CropAndResizeGradBoxesMethod sets the optional method attribute to value.
+//
+// value: A string specifying the interpolation method. Only 'bilinear' is
+// supported for now.
+// If not specified, defaults to "bilinear"
+func CropAndResizeGradBoxesMethod(value string) CropAndResizeGradBoxesAttr {
+	return func(m optionalAttr) {
+		m["method"] = value
+	}
+}
+
+// Computes the gradient of the crop_and_resize op wrt the input boxes tensor.
 //
 // Arguments:
-//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
-// read.
-//	header_bytes: A scalar representing the number of bytes to skip at the
-// beginning of a file.
-//	record_bytes: A scalar representing the number of bytes in each record.
-//	footer_bytes: A scalar representing the number of bytes to skip at the end
-// of a file.
-//	buffer_size: A scalar representing the number of bytes to buffer. Must be > 0.
-func FixedLengthRecordDataset(scope *Scope, filenames tf.Output, header_bytes tf.Output, record_bytes tf.Output, footer_bytes tf.Output, buffer_size tf.Output) (handle tf.Output) {
+//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
+//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
+// Both `image_height` and `image_width` need to be positive.
+//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
+// specifies the coordinates of a box in the `box_ind[i]` image and is specified
+// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
+// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
+// `[0, 1]` interval of normalized image height is mapped to
+// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
+// which case the sampled crop is an up-down flipped version of the original
+// image. The width dimension is treated similarly. Normalized coordinates
+// outside the `[0, 1]` range are allowed, in which case we use
+// `extrapolation_value` to extrapolate the input image values.
+//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
+// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//
+// Returns A 2-D tensor of shape `[num_boxes, 4]`.
+func CropAndResizeGradBoxes(scope *Scope, grads tf.Output, image tf.Output, boxes tf.Output, box_ind tf.Output, optional ...CropAndResizeGradBoxesAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "FixedLengthRecordDataset",
+		Type: "CropAndResizeGradBoxes",
 		Input: []tf.Input{
-			filenames, header_bytes, record_bytes, footer_bytes, buffer_size,
+			grads, image, boxes, box_ind,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Slice a `SparseTensor` based on the `start` and `size`.
-//
-// For example, if the input is
-//
-//     input_tensor = shape = [2, 7]
-//     [    a   d e  ]
-//     [b c          ]
-//
-// Graphically the output tensors are:
-//
-//     sparse_slice([0, 0], [2, 4]) = shape = [2, 4]
-//     [    a  ]
-//     [b c    ]
+// Saves tensors in V2 checkpoint format.
 //
-//     sparse_slice([0, 4], [2, 3]) = shape = [2, 3]
-//     [ d e  ]
-//     [      ]
+// By default, saves the named tensors in full.  If the caller wishes to save
+// specific slices of full tensors, "shape_and_slices" should be non-empty strings
+// and correspondingly well-formed.
 //
 // Arguments:
-//	indices: 2-D tensor represents the indices of the sparse tensor.
-//	values: 1-D tensor represents the values of the sparse tensor.
-//	shape: 1-D. tensor represents the shape of the sparse tensor.
-//	start: 1-D. tensor represents the start of the slice.
-//	size: 1-D. tensor represents the size of the slice.
-// output indices: A list of 1-D tensors represents the indices of the output
-// sparse tensors.
+//	prefix: Must have a single element. The prefix of the V2 checkpoint to which we
+// write the tensors.
+//	tensor_names: shape {N}. The names of the tensors to be saved.
+//	shape_and_slices: shape {N}.  The slice specs of the tensors to be saved.
+// Empty strings indicate that they are non-partitioned tensors.
+//	tensors: `N` tensors to save.
 //
-// Returns A list of 1-D tensors represents the values of the output sparse
-// tensors.A list of 1-D tensors represents the shape of the output sparse
-// tensors.
-func SparseSlice(scope *Scope, indices tf.Output, values tf.Output, shape tf.Output, start tf.Output, size tf.Output) (output_indices tf.Output, output_values tf.Output, output_shape tf.Output) {
+// Returns the created operation.
+func SaveV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, tensors []tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSlice",
+		Type: "SaveV2",
 		Input: []tf.Input{
-			indices, values, shape, start, size,
+			prefix, tensor_names, shape_and_slices, tf.OutputList(tensors),
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return scope.AddOperation(opspec)
 }
 
-// Concatenates quantized tensors along one dimension.
-//
-// Arguments:
-//	concat_dim: 0-D.  The dimension along which to concatenate.  Must be in the
-// range [0, rank(values)).
-//	values: The `N` Tensors to concatenate. Their ranks and types must match,
-// and their sizes must match in all dimensions except `concat_dim`.
-//	input_mins: The minimum scalar values for each of the input tensors.
-//	input_maxes: The maximum scalar values for each of the input tensors.
-//
-// Returns A `Tensor` with the concatenation of values stacked along the
-// `concat_dim` dimension.  This tensor's shape matches that of `values` except
-// in `concat_dim` where it has the sum of the sizes.The float value that the minimum quantized output value represents.The float value that the maximum quantized output value represents.
-func QuantizedConcat(scope *Scope, concat_dim tf.Output, values []tf.Output, input_mins []tf.Output, input_maxes []tf.Output) (output tf.Output, output_min tf.Output, output_max tf.Output) {
-	if scope.Err() != nil {
-		return
+// StatsAggregatorHandleAttr is an optional argument to StatsAggregatorHandle.
+type StatsAggregatorHandleAttr func(optionalAttr)
+
+// StatsAggregatorHandleContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StatsAggregatorHandleContainer(value string) StatsAggregatorHandleAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "QuantizedConcat",
-		Input: []tf.Input{
-			concat_dim, tf.OutputList(values), tf.OutputList(input_mins), tf.OutputList(input_maxes),
-		},
+}
+
+// StatsAggregatorHandleSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StatsAggregatorHandleSharedName(value string) StatsAggregatorHandleAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Gradients for batch normalization.
-//
-// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
-//
-// This op is deprecated. See `tf.nn.batch_normalization`.
-//
-// Arguments:
-//	t: A 4D input Tensor.
-//	m: A 1D mean Tensor with size matching the last dimension of t.
-// This is the first output from tf.nn.moments,
-// or a saved moving average thereof.
-//	v: A 1D variance Tensor with size matching the last dimension of t.
-// This is the second output from tf.nn.moments,
-// or a saved moving average thereof.
-//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
-// If "scale_after_normalization" is true, this Tensor will be multiplied
-// with the normalized Tensor.
-//	backprop: 4D backprop Tensor.
-//	variance_epsilon: A small float number to avoid dividing by 0.
-//	scale_after_normalization: A bool indicating whether the resulted tensor
-// needs to be multiplied with gamma.
-//
-// Returns 4D backprop tensor for input.1D backprop tensor for mean.1D backprop tensor for variance.1D backprop tensor for beta.1D backprop tensor for gamma.
-func BatchNormWithGlobalNormalizationGrad(scope *Scope, t tf.Output, m tf.Output, v tf.Output, gamma tf.Output, backprop tf.Output, variance_epsilon float32, scale_after_normalization bool) (dx tf.Output, dm tf.Output, dv tf.Output, db tf.Output, dg tf.Output) {
+// Creates a statistics manager resource.
+func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BatchNormWithGlobalNormalizationGrad",
-		Input: []tf.Input{
-			t, m, v, gamma, backprop,
-		},
+		Type: "StatsAggregatorHandle",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return op.Output(0)
 }
 
-// Creates a dataset that emits the records from one or more TFRecord files.
+// Greedily selects a subset of bounding boxes in descending order of score,
+//
+// pruning away boxes that have high intersection-over-union (IOU) overlap
+// with previously selected boxes.  Bounding boxes are supplied as
+// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+// diagonal pair of box corners and the coordinates can be provided as normalized
+// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+// is agnostic to where the origin is in the coordinate system.  Note that this
+// algorithm is invariant to orthogonal transformations and translations
+// of the coordinate system; thus translating or reflections of the coordinate
+// system result in the same boxes being selected by the algorithm.
+//
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
+//
+//   selected_indices = tf.image.non_max_suppression_v2(
+//       boxes, scores, max_output_size, iou_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
 //
 // Arguments:
-//	filenames: A scalar or vector containing the name(s) of the file(s) to be
-// read.
-//	compression_type: A scalar containing either (i) the empty string (no
-// compression), (ii) "ZLIB", or (iii) "GZIP".
-//	buffer_size: A scalar representing the number of bytes to buffer. A value of
-// 0 means no buffering will be performed.
-func TFRecordDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
+//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
+// boxes overlap too much with respect to IOU.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.
+func NonMaxSuppressionV2(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output) (selected_indices tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TFRecordDataset",
+		Type: "NonMaxSuppressionV2",
 		Input: []tf.Input{
-			filenames, compression_type, buffer_size,
+			boxes, scores, max_output_size, iou_threshold,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// BatchToSpace for 4-D tensors of type T.
-//
-// This is a legacy version of the more general BatchToSpaceND.
-//
-// Rearranges (permutes) data from batch into blocks of spatial data, followed by
-// cropping. This is the reverse transformation of SpaceToBatch. More specifically,
-// this op outputs a copy of the input tensor where values from the `batch`
-// dimension are moved in spatial blocks to the `height` and `width` dimensions,
-// followed by cropping along the `height` and `width` dimensions.
-//
-// Arguments:
-//	input: 4-D tensor with shape
-// `[batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
-//   depth]`. Note that the batch size of the input tensor must be divisible by
-// `block_size * block_size`.
-//	crops: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
-// how many elements to crop from the intermediate result across the spatial
-// dimensions as follows:
-//
-//     crops = [[crop_top, crop_bottom], [crop_left, crop_right]]
-//
-//
-// Returns 4-D with shape `[batch, height, width, depth]`, where:
-//
-//       height = height_pad - crop_top - crop_bottom
-//       width = width_pad - crop_left - crop_right
-//
-// The attr `block_size` must be greater than one. It indicates the block size.
-//
-// Some examples:
-//
-// (1) For the following input of shape `[4, 1, 1, 1]` and block_size of 2:
-//
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
-//
-// The output tensor has shape `[1, 2, 2, 1]` and value:
+// Reshapes a tensor.
 //
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
-// ```
+// Given `tensor`, this operation returns a tensor that has the same values
+// as `tensor` with shape `shape`.
 //
-// (2) For the following input of shape `[4, 1, 1, 3]` and block_size of 2:
+// If one component of `shape` is the special value -1, the size of that dimension
+// is computed so that the total size remains constant.  In particular, a `shape`
+// of `[-1]` flattens into 1-D.  At most one component of `shape` can be -1.
 //
-// ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
-// ```
+// If `shape` is 1-D or higher, then the operation returns a tensor with shape
+// `shape` filled with the values of `tensor`. In this case, the number of elements
+// implied by `shape` must be the same as the number of elements in `tensor`.
 //
-// The output tensor has shape `[1, 2, 2, 3]` and value:
+// For example:
 //
 // ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
-//
-// (3) For the following input of shape `[4, 2, 2, 1]` and block_size of 2:
+// # tensor 't' is [1, 2, 3, 4, 5, 6, 7, 8, 9]
+// # tensor 't' has shape [9]
+// reshape(t, [3, 3]) ==> [[1, 2, 3],
+//                         [4, 5, 6],
+//                         [7, 8, 9]]
 //
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
+// # tensor 't' is [[[1, 1], [2, 2]],
+// #                [[3, 3], [4, 4]]]
+// # tensor 't' has shape [2, 2, 2]
+// reshape(t, [2, 4]) ==> [[1, 1, 2, 2],
+//                         [3, 3, 4, 4]]
 //
-// The output tensor has shape `[1, 4, 4, 1]` and value:
+// # tensor 't' is [[[1, 1, 1],
+// #                 [2, 2, 2]],
+// #                [[3, 3, 3],
+// #                 [4, 4, 4]],
+// #                [[5, 5, 5],
+// #                 [6, 6, 6]]]
+// # tensor 't' has shape [3, 2, 3]
+// # pass '[-1]' to flatten 't'
+// reshape(t, [-1]) ==> [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
 //
-// ```
-// x = [[[1],   [2],  [3],  [4]],
-//      [[5],   [6],  [7],  [8]],
-//      [[9],  [10], [11],  [12]],
-//      [[13], [14], [15],  [16]]]
-// ```
+// # -1 can also be used to infer the shape
 //
-// (4) For the following input of shape `[8, 1, 2, 1]` and block_size of 2:
+// # -1 is inferred to be 9:
+// reshape(t, [2, -1]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
+//                          [4, 4, 4, 5, 5, 5, 6, 6, 6]]
+// # -1 is inferred to be 2:
+// reshape(t, [-1, 9]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
+//                          [4, 4, 4, 5, 5, 5, 6, 6, 6]]
+// # -1 is inferred to be 3:
+// reshape(t, [ 2, -1, 3]) ==> [[[1, 1, 1],
+//                               [2, 2, 2],
+//                               [3, 3, 3]],
+//                              [[4, 4, 4],
+//                               [5, 5, 5],
+//                               [6, 6, 6]]]
 //
-// ```
-// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
-//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
+// # tensor 't' is [7]
+// # shape `[]` reshapes to a scalar
+// reshape(t, []) ==> 7
 // ```
 //
-// The output tensor has shape `[2, 2, 4, 1]` and value:
+// Arguments:
 //
-// ```
-// x = [[[[1], [3]], [[5], [7]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
-func BatchToSpace(scope *Scope, input tf.Output, crops tf.Output, block_size int64) (output tf.Output) {
+//	shape: Defines the shape of the output tensor.
+func Reshape(scope *Scope, tensor tf.Output, shape tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"block_size": block_size}
 	opspec := tf.OpSpec{
-		Type: "BatchToSpace",
+		Type: "Reshape",
 		Input: []tf.Input{
-			input, crops,
+			tensor, shape,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Makes a new iterator from the given `dataset` and stores it in `iterator`.
-//
-// This operation may be executed multiple times. Each execution will reset the
-// iterator in `iterator` to the first element of `dataset`.
-//
-// Returns the created operation.
-func MakeIterator(scope *Scope, dataset tf.Output, iterator tf.Output) (o *tf.Operation) {
+// Creates a dataset that splits a SparseTensor into elements row-wise.
+func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MakeIterator",
+		Type: "SparseTensorSliceDataset",
 		Input: []tf.Input{
-			dataset, iterator,
+			indices, values, dense_shape,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Adjust the contrast of one or more images.
+// Returns x / y element-wise for real types.
 //
-// `images` is a tensor of at least 3 dimensions.  The last 3 dimensions are
-// interpreted as `[height, width, channels]`.  The other dimensions only
-// represent a collection of images, such as `[batch, height, width, channels].`
-//
-// Contrast is adjusted independently for each channel of each image.
-//
-// For each channel, the Op first computes the mean of the image pixels in the
-// channel and then adjusts each component of each pixel to
-// `(x - mean) * contrast_factor + mean`.
-//
-// Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	contrast_factor: A float multiplier for adjusting contrast.
+// If `x` and `y` are reals, this will return the floating-point division.
 //
-// Returns The contrast-adjusted image or images.
-func AdjustContrastv2(scope *Scope, images tf.Output, contrast_factor tf.Output) (output tf.Output) {
+// *NOTE*: `Div` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "AdjustContrastv2",
+		Type: "RealDiv",
 		Input: []tf.Input{
-			images, contrast_factor,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Gets the next output from the given iterator.
-func IteratorGetNext(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+// Creates a dataset that concatenates `input_dataset` with `another_dataset`.
+func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "IteratorGetNext",
+		Type: "ConcatenateDataset",
 		Input: []tf.Input{
-			iterator,
+			input_dataset, another_dataset,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("IteratorGetNext", err)
-		return
-	}
-	return components
+	return op.Output(0)
 }
 
-// Outputs the single element from the given dataset.
+// Adds a value to the current value of a variable.
 //
-// Arguments:
-//	dataset: A handle to a dataset that contains a single element.
+// Any ReadVariableOp which depends directly or indirectly on this assign is
+// guaranteed to see the incremented value or a subsequent newer one.
 //
+// Outputs the incremented value, which can be used to totally order the
+// increments to this variable.
 //
+// Arguments:
+//	resource: handle to the resource in which to store the variable.
+//	value: the value by which the variable will be incremented.
 //
-// Returns The components of the single element of `input`.
-func DatasetToSingleElement(scope *Scope, dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+// Returns the created operation.
+func AssignAddVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "DatasetToSingleElement",
+		Type: "AssignAddVariableOp",
 		Input: []tf.Input{
-			dataset,
+			resource, value,
 		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
 	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("DatasetToSingleElement", err)
-		return
-	}
-	return components
+	return scope.AddOperation(opspec)
 }
 
-// Converts the given `resource_handle` representing an iterator to a string.
-//
-// Arguments:
-//	resource_handle: A handle to an iterator resource.
-//
-// Returns A string representation of the given handle.
-func IteratorToStringHandle(scope *Scope, resource_handle tf.Output) (string_handle tf.Output) {
+// Records the latency of producing `input_dataset` elements in a StatsAggregator.
+func LatencyStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "IteratorToStringHandle",
+		Type: "LatencyStatsDataset",
 		Input: []tf.Input{
-			resource_handle,
+			input_dataset, tag,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ShapeNAttr is an optional argument to ShapeN.
-type ShapeNAttr func(optionalAttr)
-
-// ShapeNOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_INT32
-func ShapeNOutType(value tf.DataType) ShapeNAttr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
-	}
-}
-
-// Returns shape of tensors.
+// Convert JSON-encoded Example records to binary protocol buffer strings.
 //
-// This operation returns N 1-D integer tensors representing shape of `input[i]s`.
-func ShapeN(scope *Scope, input []tf.Output, optional ...ShapeNAttr) (output []tf.Output) {
+// This op translates a tensor containing Example records, encoded using
+// the [standard JSON
+// mapping](https://developers.google.com/protocol-buffers/docs/proto3#json),
+// into a tensor containing the same records encoded as binary protocol
+// buffers. The resulting tensor can then be fed to any of the other
+// Example-parsing ops.
+//
+// Arguments:
+//	json_examples: Each string is a JSON object serialized according to the JSON
+// mapping of the Example proto.
+//
+// Returns Each string is a binary Example protocol buffer corresponding
+// to the respective element of `json_examples`.
+func DecodeJSONExample(scope *Scope, json_examples tf.Output) (binary_examples tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ShapeN",
+		Type: "DecodeJSONExample",
 		Input: []tf.Input{
-			tf.OutputList(input),
+			json_examples,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("ShapeN", err)
-		return
-	}
-	return output
+	return op.Output(0)
 }
 
-// IteratorFromStringHandleAttr is an optional argument to IteratorFromStringHandle.
-type IteratorFromStringHandleAttr func(optionalAttr)
-
-// IteratorFromStringHandleOutputTypes sets the optional output_types attribute to value.
+// Computes the grayscale dilation of 4-D `input` and 3-D `filter` tensors.
 //
-// value: If specified, defines the type of each tuple component in an
-// element produced by the resulting iterator.
-// If not specified, defaults to <>
+// The `input` tensor has shape `[batch, in_height, in_width, depth]` and the
+// `filter` tensor has shape `[filter_height, filter_width, depth]`, i.e., each
+// input channel is processed independently of the others with its own structuring
+// function. The `output` tensor has shape
+// `[batch, out_height, out_width, depth]`. The spatial dimensions of the output
+// tensor depend on the `padding` algorithm. We currently only support the default
+// "NHWC" `data_format`.
 //
-// REQUIRES: len(value) >= 0
-func IteratorFromStringHandleOutputTypes(value []tf.DataType) IteratorFromStringHandleAttr {
-	return func(m optionalAttr) {
-		m["output_types"] = value
-	}
-}
-
-// IteratorFromStringHandleOutputShapes sets the optional output_shapes attribute to value.
+// In detail, the grayscale morphological 2-D dilation is the max-sum correlation
+// (for consistency with `conv2d`, we use unmirrored filters):
 //
-// value: If specified, defines the shape of each tuple component in an
-// element produced by the resulting iterator.
-// If not specified, defaults to <>
+//     output[b, y, x, c] =
+//        max_{dy, dx} input[b,
+//                           strides[1] * y + rates[1] * dy,
+//                           strides[2] * x + rates[2] * dx,
+//                           c] +
+//                     filter[dy, dx, c]
 //
-// REQUIRES: len(value) >= 0
-func IteratorFromStringHandleOutputShapes(value []tf.Shape) IteratorFromStringHandleAttr {
-	return func(m optionalAttr) {
-		m["output_shapes"] = value
-	}
-}
-
-// Converts the given string representing a handle to an iterator to a resource.
+// Max-pooling is a special case when the filter has size equal to the pooling
+// kernel size and contains all zeros.
+//
+// Note on duality: The dilation of `input` by the `filter` is equal to the
+// negation of the erosion of `-input` by the reflected `filter`.
 //
 // Arguments:
-//	string_handle: A string representation of the given handle.
+//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
+//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
+//	strides: The stride of the sliding window for each dimension of the input
+// tensor. Must be: `[1, stride_height, stride_width, 1]`.
+//	rates: The input stride for atrous morphological dilation. Must be:
+// `[1, rate_height, rate_width, 1]`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns A handle to an iterator resource.
-func IteratorFromStringHandle(scope *Scope, string_handle tf.Output, optional ...IteratorFromStringHandleAttr) (resource_handle tf.Output) {
+// Returns 4-D with shape `[batch, out_height, out_width, depth]`.
+func Dilation2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, rates []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "IteratorFromStringHandle",
+		Type: "Dilation2D",
 		Input: []tf.Input{
-			string_handle,
+			input, filter,
 		},
 		Attrs: attrs,
 	}
@@ -21031,257 +21343,311 @@ func IteratorFromStringHandle(scope *Scope, string_handle tf.Output, optional ..
 	return op.Output(0)
 }
 
-// Computes arctangent of `y/x` element-wise, respecting signs of the arguments.
+// Converts the given variant tensor to an iterator and stores it in the given resource.
 //
-// This is the angle \( \theta \in [-\pi, \pi] \) such that
-// \[ x = r \cos(\theta) \]
-// and
-// \[ y = r \sin(\theta) \]
-// where \(r = \sqrt(x^2 + y^2) \).
-func Atan2(scope *Scope, y tf.Output, x tf.Output) (z tf.Output) {
+// Arguments:
+//	resource_handle: A handle to an iterator resource.
+//	serialized: A variant tensor storing the state of the iterator contained in the
+// resource.
+//
+// Returns the created operation.
+func DeserializeIterator(scope *Scope, resource_handle tf.Output, serialized tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Atan2",
+		Type: "DeserializeIterator",
 		Input: []tf.Input{
-			y, x,
+			resource_handle, serialized,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Return a tensor with the same shape and contents as the input tensor or value.
-func Identity(scope *Scope, input tf.Output) (output tf.Output) {
+// TensorArrayConcatV2Attr is an optional argument to TensorArrayConcatV2.
+type TensorArrayConcatV2Attr func(optionalAttr)
+
+// TensorArrayConcatV2ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayConcatV2ElementShapeExcept0(value tf.Shape) TensorArrayConcatV2Attr {
+	return func(m optionalAttr) {
+		m["element_shape_except0"] = value
+	}
+}
+
+// Deprecated. Use TensorArrayConcatV3
+func TensorArrayConcatV2(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV2Attr) (value tf.Output, lengths tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Identity",
+		Type: "TensorArrayConcatV2",
 		Input: []tf.Input{
-			input,
+			handle, flow_in,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Gather slices from `params` axis `axis` according to `indices`.
-//
-// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
-// Produces an output tensor with shape `params.shape[:axis] + indices.shape +
-// params.shape[axis + 1:]` where:
-//
-// ```python
-//     # Scalar indices (output is rank(params) - 1).
-//     output[a_0, ..., a_n, b_0, ..., b_n] =
-//       params[a_0, ..., a_n, indices, b_0, ..., b_n]
-//
-//     # Vector indices (output is rank(params)).
-//     output[a_0, ..., a_n, i, b_0, ..., b_n] =
-//       params[a_0, ..., a_n, indices[i], b_0, ..., b_n]
-//
-//     # Higher rank indices (output is rank(params) + rank(indices) - 1).
-//     output[a_0, ..., a_n, i, ..., j, b_0, ... b_n] =
-//       params[a_0, ..., a_n, indices[i, ..., j], b_0, ..., b_n]
-// ```
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
-// </div>
+// Creates a dataset that batches and pads `batch_size` elements from the input.
 //
 // Arguments:
-//	params: The tensor from which to gather values. Must be at least rank
-// `axis + 1`.
-//	indices: Index tensor. Must be in range `[0, params.shape[axis])`.
-//	axis: The axis in `params` to gather `indices` from. Defaults to the first
-// dimension. Supports negative indexes.
 //
-// Returns Values from `params` gathered from indices given by `indices`, with
-// shape `params.shape[:axis] + indices.shape + params.shape[axis + 1:]`.
-func GatherV2(scope *Scope, params tf.Output, indices tf.Output, axis tf.Output) (output tf.Output) {
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//	padded_shapes: A list of int64 tensors representing the desired padded shapes
+// of the corresponding output components. These shapes may be partially
+// specified, using `-1` to indicate that a particular dimension should be
+// padded to the maximum size of all batch elements.
+//	padding_values: A list of scalars containing the padding value to use for
+// each of the outputs.
+//
+func PaddedBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, padded_shapes []tf.Output, padding_values []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "GatherV2",
+		Type: "PaddedBatchDataset",
 		Input: []tf.Input{
-			params, indices, axis,
+			input_dataset, batch_size, tf.OutputList(padded_shapes), tf.OutputList(padding_values),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Converts the given `resource_handle` representing an iterator to a variant tensor.
+// Creates a dataset that batches input elements into a SparseTensor.
 //
 // Arguments:
-//	resource_handle: A handle to an iterator resource.
+//	input_dataset: A handle to an input dataset. Must have a single component.
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//	row_shape: A vector representing the dense shape of each row in the produced
+// SparseTensor. The shape may be partially specified, using `-1` to indicate
+// that a particular dimension should use the maximum size of all batch elements.
 //
-// Returns A variant tensor storing the state of the iterator contained in the
-// resource.
-func SerializeIterator(scope *Scope, resource_handle tf.Output) (serialized tf.Output) {
+//
+func DenseToSparseBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, row_shape tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "SerializeIterator",
+		Type: "DenseToSparseBatchDataset",
 		Input: []tf.Input{
-			resource_handle,
+			input_dataset, batch_size, row_shape,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FIFOQueueV2Attr is an optional argument to FIFOQueueV2.
-type FIFOQueueV2Attr func(optionalAttr)
-
-// FIFOQueueV2Shapes sets the optional shapes attribute to value.
-//
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
-// If not specified, defaults to <>
+// Deprecated. Use TensorArrayGradV3
 //
-// REQUIRES: len(value) >= 0
-func FIFOQueueV2Shapes(value []tf.Shape) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shapes"] = value
+// DEPRECATED at GraphDef version 26: Use TensorArrayGradV3
+func TensorArrayGradV2(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// FIFOQueueV2Capacity sets the optional capacity attribute to value.
-//
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func FIFOQueueV2Capacity(value int64) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+	attrs := map[string]interface{}{"source": source}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayGradV2",
+		Input: []tf.Input{
+			handle, flow_in,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// FIFOQueueV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func FIFOQueueV2Container(value string) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
+// ResourceSparseApplyAdadeltaAttr is an optional argument to ResourceSparseApplyAdadelta.
+type ResourceSparseApplyAdadeltaAttr func(optionalAttr)
 
-// FIFOQueueV2SharedName sets the optional shared_name attribute to value.
+// ResourceSparseApplyAdadeltaUseLocking sets the optional use_locking attribute to value.
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func FIFOQueueV2SharedName(value string) FIFOQueueV2Attr {
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceSparseApplyAdadeltaUseLocking(value bool) ResourceSparseApplyAdadeltaAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["use_locking"] = value
 	}
 }
 
-// A queue that produces elements in first-in first-out order.
+// var: Should be from a Variable().
 //
 // Arguments:
-//	component_types: The type of each component in a value.
 //
-// Returns The handle to the queue.
-func FIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...FIFOQueueV2Attr) (handle tf.Output) {
+//	accum: Should be from a Variable().
+//	accum_update: : Should be from a Variable().
+//	lr: Learning rate. Must be a scalar.
+//	rho: Decay factor. Must be a scalar.
+//	epsilon: Constant factor. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//
+// Returns the created operation.
+func ResourceSparseApplyAdadelta(scope *Scope, var_ tf.Output, accum tf.Output, accum_update tf.Output, lr tf.Output, rho tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyAdadeltaAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FIFOQueueV2",
-
+		Type: "ResourceSparseApplyAdadelta",
+		Input: []tf.Input{
+			var_, accum, accum_update, lr, rho, epsilon, grad, indices,
+		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Produces a summary of any statistics recorded by the given statistics manager.
-func StatsAggregatorSummary(scope *Scope, iterator tf.Output) (summary tf.Output) {
+// Identity op for gradient debugging.
+//
+// This op is hidden from public in Python. It is used by TensorFlow Debugger to
+// register gradient tensors for gradient debugging.
+// This op operates on non-reference-type tensors.
+func DebugGradientIdentity(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "StatsAggregatorSummary",
+		Type: "DebugGradientIdentity",
 		Input: []tf.Input{
-			iterator,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the pairwise cross product.
+// Return substrings from `Tensor` of strings.
 //
-// `a` and `b` must be the same shape; they can either be simple 3-element vectors,
-// or any shape where the innermost dimension is 3. In the latter case, each pair
-// of corresponding 3-element vectors is cross-multiplied independently.
+// For each string in the input `Tensor`, creates a substring starting at index
+// `pos` with a total length of `len`.
+//
+// If `len` defines a substring that would extend beyond the length of the input
+// string, then as many characters as possible are used.
+//
+// If `pos` is negative or specifies a character index larger than any of the input
+// strings, then an `InvalidArgumentError` is thrown.
+//
+// `pos` and `len` must have the same shape, otherwise a `ValueError` is thrown on
+// Op creation.
+//
+// *NOTE*: `Substr` supports broadcasting up to two dimensions. More about
+// broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+//
+// ---
+//
+// Examples
+//
+// Using scalar `pos` and `len`:
+//
+// ```python
+// input = [b'Hello', b'World']
+// position = 1
+// length = 3
+//
+// output = [b'ell', b'orl']
+// ```
+//
+// Using `pos` and `len` with same shape as `input`:
+//
+// ```python
+// input = [[b'ten', b'eleven', b'twelve'],
+//          [b'thirteen', b'fourteen', b'fifteen'],
+//          [b'sixteen', b'seventeen', b'eighteen']]
+// position = [[1, 2, 3],
+//             [1, 2, 3],
+//             [1, 2, 3]]
+// length =   [[2, 3, 4],
+//             [4, 3, 2],
+//             [5, 5, 5]]
+//
+// output = [[b'en', b'eve', b'lve'],
+//           [b'hirt', b'urt', b'te'],
+//           [b'ixtee', b'vente', b'hteen']]
+// ```
+//
+// Broadcasting `pos` and `len` onto `input`:
+//
+// ```
+// input = [[b'ten', b'eleven', b'twelve'],
+//          [b'thirteen', b'fourteen', b'fifteen'],
+//          [b'sixteen', b'seventeen', b'eighteen'],
+//          [b'nineteen', b'twenty', b'twentyone']]
+// position = [1, 2, 3]
+// length =   [1, 2, 3]
+//
+// output = [[b'e', b'ev', b'lve'],
+//           [b'h', b'ur', b'tee'],
+//           [b'i', b've', b'hte'],
+//           [b'i', b'en', b'nty']]
+// ```
+//
+// Broadcasting `input` onto `pos` and `len`:
+//
+// ```
+// input = b'thirteen'
+// position = [1, 5, 7]
+// length =   [3, 2, 1]
+//
+// output = [b'hir', b'ee', b'n']
+// ```
 //
 // Arguments:
-//	a: A tensor containing 3-element vectors.
-//	b: Another tensor, of same type and shape as `a`.
+//	input: Tensor of strings
+//	pos: Scalar defining the position of first character in each substring
+//	len: Scalar defining the number of characters to include in each substring
 //
-// Returns Pairwise cross product of the vectors in `a` and `b`.
-func Cross(scope *Scope, a tf.Output, b tf.Output) (product tf.Output) {
+// Returns Tensor of substrings
+func Substr(scope *Scope, input tf.Output, pos tf.Output, len tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Cross",
+		Type: "Substr",
 		Input: []tf.Input{
-			a, b,
+			input, pos, len,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Performs a padding as a preprocess during a convolution.
-//
-// Similar to FusedResizeAndPadConv2d, this op allows for an optimized
-// implementation where the spatial padding transformation stage is fused with the
-// im2col lookup, but in this case without the bilinear filtering required for
-// resizing. Fusing the padding prevents the need to write out the intermediate
-// results as whole tensors, reducing memory pressure, and we can get some latency
-// gains by merging the transformation calculations.
-// The data_format attribute for Conv2D isn't supported by this op, and 'NHWC'
-// order is used instead.
-// Internally this op uses a single per-graph scratch buffer, which means that it
-// will block if multiple versions are being run in parallel. This is because this
-// operator is primarily an optimization to minimize memory usage.
+// Creates a Dataset that returns pseudorandom numbers.
 //
 // Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.
+//	seed: A scalar seed for the random number generator. If either seed or
+// seed2 is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
 //
-//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
-// of `input`. Must be in the same order as the dimension specified with format.
-//	padding: The type of padding algorithm to use.
-func FusedPadConv2D(scope *Scope, input tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string) (output tf.Output) {
+//
+func RandomDataset(scope *Scope, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "FusedPadConv2D",
+		Type: "RandomDataset",
 		Input: []tf.Input{
-			input, paddings, filter,
+			seed, seed2,
 		},
 		Attrs: attrs,
 	}
@@ -21289,73 +21655,32 @@ func FusedPadConv2D(scope *Scope, input tf.Output, paddings tf.Output, filter tf
 	return op.Output(0)
 }
 
-// Conv2DBackpropInputAttr is an optional argument to Conv2DBackpropInput.
-type Conv2DBackpropInputAttr func(optionalAttr)
-
-// Conv2DBackpropInputUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DBackpropInputUseCudnnOnGpu(value bool) Conv2DBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
-	}
-}
-
-// Conv2DBackpropInputDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func Conv2DBackpropInputDataFormat(value string) Conv2DBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Conv2DBackpropInputDilations sets the optional dilations attribute to value.
+// Creates a dataset that shuffles and repeats elements from `input_dataset`
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
-// element on that dimension. The dimension order is determined by the value of
-// `data_format`, see above for details. Dilations in the batch and depth
-// dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func Conv2DBackpropInputDilations(value []int64) Conv2DBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
-	}
-}
-
-// Computes the gradients of convolution with respect to the input.
+// pseudorandomly.
 //
 // Arguments:
-//	input_sizes: An integer vector representing the shape of `input`,
-// where `input` is a 4-D `[batch, height, width, channels]` tensor.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
-// Gradients w.r.t. the output of the convolution.
-//	strides: The stride of the sliding window for each dimension of the input
-// of the convolution. Must be in the same order as the dimension specified with
-// format.
-//	padding: The type of padding algorithm to use.
 //
-// Returns 4-D with shape `[batch, in_height, in_width, in_channels]`.  Gradient
-// w.r.t. the input of the convolution.
-func Conv2DBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropInputAttr) (output tf.Output) {
+//	buffer_size: The number of output elements to buffer in an iterator over
+// this dataset. Compare with the `min_after_dequeue` attr when creating a
+// `RandomShuffleQueue`.
+//	seed: A scalar seed for the random number generator. If either `seed` or
+// `seed2` is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
+//	count: A scalar representing the number of times the underlying dataset
+// should be repeated. The default is `-1`, which results in infinite repetition.
+//
+//
+func ShuffleAndRepeatDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Conv2DBackpropInput",
+		Type: "ShuffleAndRepeatDataset",
 		Input: []tf.Input{
-			input_sizes, filter, out_backprop,
+			input_dataset, buffer_size, seed, seed2, count,
 		},
 		Attrs: attrs,
 	}
@@ -21363,117 +21688,60 @@ func Conv2DBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output,
 	return op.Output(0)
 }
 
-// Interleave the values from the `data` tensors into a single tensor.
-//
-// Builds a merged tensor such that
-//
-// ```python
-//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
-// ```
-//
-// For example, if each `indices[m]` is scalar or vector, we have
-//
-// ```python
-//     # Scalar indices:
-//     merged[indices[m], ...] = data[m][...]
-//
-//     # Vector indices:
-//     merged[indices[m][i], ...] = data[m][i, ...]
-// ```
-//
-// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
-// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
-// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
-// `constant`, the output shape is
-//
-//     merged.shape = [max(indices)] + constant
-//
-// Values are merged in order, so if an index appears in both `indices[m][i]` and
-// `indices[n][j]` for `(m,i) < (n,j)` the slice `data[n][j]` will appear in the
-// merged result. If you do not need this guarantee, ParallelDynamicStitch might
-// perform better on some devices.
-//
-// For example:
+// Creates a dataset that caches elements from `input_dataset`.
 //
-// ```python
-//     indices[0] = 6
-//     indices[1] = [4, 1]
-//     indices[2] = [[5, 2], [0, 3]]
-//     data[0] = [61, 62]
-//     data[1] = [[41, 42], [11, 12]]
-//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
-//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
-//               [51, 52], [61, 62]]
-// ```
+// A CacheDataset will iterate over the input_dataset, and store tensors. If the
+// cache already exists, the cache will be used. If the cache is inappropriate
+// (e.g. cannot be opened, contains tensors of the wrong shape / size), an error
+// will the returned when used.
 //
-// This method can be used to merge partitions created by `dynamic_partition`
-// as illustrated on the following example:
+// Arguments:
 //
-// ```python
-//     # Apply function (increments x_i) on elements for which a certain condition
-//     # apply (x_i != -1 in this example).
-//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
-//     condition_mask=tf.not_equal(x,tf.constant(-1.))
-//     partitioned_data = tf.dynamic_partition(
-//         x, tf.cast(condition_mask, tf.int32) , 2)
-//     partitioned_data[1] = partitioned_data[1] + 1.0
-//     condition_indices = tf.dynamic_partition(
-//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
-//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
-//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
-//     # unchanged.
-// ```
+//	filename: A path on the filesystem where we should cache the dataset. Note: this
+// will be a directory.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
-// </div>
-func DynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "DynamicStitch",
-		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(data),
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns the truth value of (x == y) element-wise.
 //
-// *NOTE*: `Equal` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Equal(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Equal",
+		Type: "CacheDataset",
 		Input: []tf.Input{
-			x, y,
+			input_dataset, filename,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorArrayGatherV2Attr is an optional argument to TensorArrayGatherV2.
-type TensorArrayGatherV2Attr func(optionalAttr)
+// PlaceholderAttr is an optional argument to Placeholder.
+type PlaceholderAttr func(optionalAttr)
 
-// TensorArrayGatherV2ElementShape sets the optional element_shape attribute to value.
+// PlaceholderShape sets the optional shape attribute to value.
+//
+// value: (Optional) The shape of the tensor. If the shape has 0 dimensions, the
+// shape is unconstrained.
 // If not specified, defaults to <unknown_rank:true >
-func TensorArrayGatherV2ElementShape(value tf.Shape) TensorArrayGatherV2Attr {
+func PlaceholderShape(value tf.Shape) PlaceholderAttr {
 	return func(m optionalAttr) {
-		m["element_shape"] = value
+		m["shape"] = value
 	}
 }
 
-// Deprecated. Use TensorArrayGatherV3
+// A placeholder op for a value that will be fed into the computation.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayGatherV3
-func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV2Attr) (value tf.Output) {
+// N.B. This operation will fail with an error if it is executed. It is
+// intended as a way to represent a value that will always be fed, and to
+// provide attrs that enable the fed value to be checked at runtime.
+//
+// Arguments:
+//	dtype: The type of elements in the tensor.
+//
+// Returns A placeholder tensor that must be replaced using the feed mechanism.
+func Placeholder(scope *Scope, dtype tf.DataType, optional ...PlaceholderAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -21482,297 +21750,223 @@ func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayGatherV2",
-		Input: []tf.Input{
-			handle, indices, flow_in,
-		},
+		Type: "Placeholder",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Interleave the values from the `data` tensors into a single tensor.
-//
-// Builds a merged tensor such that
-//
-// ```python
-//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
-// ```
-//
-// For example, if each `indices[m]` is scalar or vector, we have
+// Creates a dataset that executes a SQL query and emits rows of the result set.
 //
-// ```python
-//     # Scalar indices:
-//     merged[indices[m], ...] = data[m][...]
+// Arguments:
+//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
+//	data_source_name: A connection string to connect to the database.
+//	query: A SQL query to execute.
 //
-//     # Vector indices:
-//     merged[indices[m][i], ...] = data[m][i, ...]
-// ```
 //
-// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
-// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
-// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
-// `constant`, the output shape is
-//
-//     merged.shape = [max(indices)] + constant
-//
-// Values may be merged in parallel, so if an index appears in both `indices[m][i]`
-// and `indices[n][j]`, the result may be invalid. This differs from the normal
-// DynamicStitch operator that defines the behavior in that case.
-//
-// For example:
-//
-// ```python
-//     indices[0] = 6
-//     indices[1] = [4, 1]
-//     indices[2] = [[5, 2], [0, 3]]
-//     data[0] = [61, 62]
-//     data[1] = [[41, 42], [11, 12]]
-//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
-//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
-//               [51, 52], [61, 62]]
-// ```
-//
-// This method can be used to merge partitions created by `dynamic_partition`
-// as illustrated on the following example:
-//
-// ```python
-//     # Apply function (increments x_i) on elements for which a certain condition
-//     # apply (x_i != -1 in this example).
-//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
-//     condition_mask=tf.not_equal(x,tf.constant(-1.))
-//     partitioned_data = tf.dynamic_partition(
-//         x, tf.cast(condition_mask, tf.int32) , 2)
-//     partitioned_data[1] = partitioned_data[1] + 1.0
-//     condition_indices = tf.dynamic_partition(
-//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
-//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
-//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
-//     # unchanged.
-// ```
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
-// </div>
-func ParallelDynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
+func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "ParallelDynamicStitch",
+		Type: "SqlDataset",
 		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(data),
+			driver_name, data_source_name, query,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the gradient for the inverse of `x` wrt its input.
+// Creates a dataset that emits the records from one or more binary files.
 //
-// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
-// is the corresponding input gradient.
-func InvGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// Arguments:
+//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
+// read.
+//	header_bytes: A scalar representing the number of bytes to skip at the
+// beginning of a file.
+//	record_bytes: A scalar representing the number of bytes in each record.
+//	footer_bytes: A scalar representing the number of bytes to skip at the end
+// of a file.
+//	buffer_size: A scalar representing the number of bytes to buffer. Must be > 0.
+func FixedLengthRecordDataset(scope *Scope, filenames tf.Output, header_bytes tf.Output, record_bytes tf.Output, footer_bytes tf.Output, buffer_size tf.Output) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "InvGrad",
+		Type: "FixedLengthRecordDataset",
 		Input: []tf.Input{
-			y, dy,
+			filenames, header_bytes, record_bytes, footer_bytes, buffer_size,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// StridedSliceAttr is an optional argument to StridedSlice.
-type StridedSliceAttr func(optionalAttr)
-
-// StridedSliceBeginMask sets the optional begin_mask attribute to value.
+// Gradients for batch normalization.
 //
-// value: a bitmask where a bit i being 1 means to ignore the begin
-// value and instead use the largest interval possible. At runtime
-// begin[i] will be replaced with `[0, n-1) if `stride[i] > 0` or
-// `[-1, n-1]` if `stride[i] < 0`
-// If not specified, defaults to 0
-func StridedSliceBeginMask(value int64) StridedSliceAttr {
-	return func(m optionalAttr) {
-		m["begin_mask"] = value
-	}
-}
-
-// StridedSliceEndMask sets the optional end_mask attribute to value.
+// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
 //
-// value: analogous to `begin_mask`
-// If not specified, defaults to 0
-func StridedSliceEndMask(value int64) StridedSliceAttr {
-	return func(m optionalAttr) {
-		m["end_mask"] = value
-	}
-}
-
-// StridedSliceEllipsisMask sets the optional ellipsis_mask attribute to value.
+// This op is deprecated. See `tf.nn.batch_normalization`.
 //
-// value: a bitmask where bit `i` being 1 means the `i`th
-// position is actually an ellipsis. One bit at most can be 1.
-// If `ellipsis_mask == 0`, then an implicit ellipsis mask of `1 << (m+1)`
-// is provided. This means that `foo[3:5] == foo[3:5, ...]`. An ellipsis
-// implicitly creates as many range specifications as necessary to fully
-// specify the sliced range for every dimension. For example for a 4-dimensional
-// tensor `foo` the slice `foo[2, ..., 5:8]` implies `foo[2, :, :, 5:8]`.
-// If not specified, defaults to 0
-func StridedSliceEllipsisMask(value int64) StridedSliceAttr {
-	return func(m optionalAttr) {
-		m["ellipsis_mask"] = value
-	}
-}
-
-// StridedSliceNewAxisMask sets the optional new_axis_mask attribute to value.
+// Arguments:
+//	t: A 4D input Tensor.
+//	m: A 1D mean Tensor with size matching the last dimension of t.
+// This is the first output from tf.nn.moments,
+// or a saved moving average thereof.
+//	v: A 1D variance Tensor with size matching the last dimension of t.
+// This is the second output from tf.nn.moments,
+// or a saved moving average thereof.
+//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
+// If "scale_after_normalization" is true, this Tensor will be multiplied
+// with the normalized Tensor.
+//	backprop: 4D backprop Tensor.
+//	variance_epsilon: A small float number to avoid dividing by 0.
+//	scale_after_normalization: A bool indicating whether the resulted tensor
+// needs to be multiplied with gamma.
 //
-// value: a bitmask where bit `i` being 1 means the `i`th
-// specification creates a new shape 1 dimension. For example
-// `foo[:4, tf.newaxis, :2]` would produce a shape `(4, 1, 2)` tensor.
-// If not specified, defaults to 0
-func StridedSliceNewAxisMask(value int64) StridedSliceAttr {
-	return func(m optionalAttr) {
-		m["new_axis_mask"] = value
+// Returns 4D backprop tensor for input.1D backprop tensor for mean.1D backprop tensor for variance.1D backprop tensor for beta.1D backprop tensor for gamma.
+func BatchNormWithGlobalNormalizationGrad(scope *Scope, t tf.Output, m tf.Output, v tf.Output, gamma tf.Output, backprop tf.Output, variance_epsilon float32, scale_after_normalization bool) (dx tf.Output, dm tf.Output, dv tf.Output, db tf.Output, dg tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
+	opspec := tf.OpSpec{
+		Type: "BatchNormWithGlobalNormalizationGrad",
+		Input: []tf.Input{
+			t, m, v, gamma, backprop,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// StridedSliceShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
+// Creates a dataset that emits the records from one or more TFRecord files.
 //
-// value: a bitmask where bit `i` implies that the `i`th
-// specification should shrink the dimensionality. begin and end
-// must imply a slice of size 1 in the dimension. For example in
-// python one might do `foo[:, 3, :]` which would result in
-// `shrink_axis_mask` being 2.
-// If not specified, defaults to 0
-func StridedSliceShrinkAxisMask(value int64) StridedSliceAttr {
-	return func(m optionalAttr) {
-		m["shrink_axis_mask"] = value
+// Arguments:
+//	filenames: A scalar or vector containing the name(s) of the file(s) to be
+// read.
+//	compression_type: A scalar containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+//	buffer_size: A scalar representing the number of bytes to buffer. A value of
+// 0 means no buffering will be performed.
+func TFRecordDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TFRecordDataset",
+		Input: []tf.Input{
+			filenames, compression_type, buffer_size,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Return a strided slice from `input`.
+// BatchToSpace for 4-D tensors of type T.
 //
-// Note, most python users will want to use the Python `Tensor.__getitem__`
-// or `Variable.__getitem__` rather than this op directly.
+// This is a legacy version of the more general BatchToSpaceND.
 //
-// The goal of this op is to produce a new tensor with a subset of
-// the elements from the `n` dimensional `input` tensor. The subset is chosen using
-// a sequence of `m` sparse range specifications encoded into the arguments
-// of this function. Note, in some cases
-// `m` could be equal to `n`, but this need not be the case. Each
-// range specification entry can be one of the following:
+// Rearranges (permutes) data from batch into blocks of spatial data, followed by
+// cropping. This is the reverse transformation of SpaceToBatch. More specifically,
+// this op outputs a copy of the input tensor where values from the `batch`
+// dimension are moved in spatial blocks to the `height` and `width` dimensions,
+// followed by cropping along the `height` and `width` dimensions.
 //
-// - An ellipsis (...). Ellipses are used to imply zero or more
-//   dimensions of full-dimension selection and are produced using
-//   `ellipsis_mask`. For example, `foo[...]` is the identity slice.
+// Arguments:
+//	input: 4-D tensor with shape
+// `[batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
+//   depth]`. Note that the batch size of the input tensor must be divisible by
+// `block_size * block_size`.
+//	crops: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
+// how many elements to crop from the intermediate result across the spatial
+// dimensions as follows:
 //
-// - A new axis. This is used to insert a new shape=1 dimension and is
-//   produced using `new_axis_mask`. For example, `foo[:, ...]` where
-//   `foo` is shape `(3, 4)` produces a `(1, 3, 4)` tensor.
+//     crops = [[crop_top, crop_bottom], [crop_left, crop_right]]
 //
 //
-// - A range `begin:end:stride`. This is used to specify how much to choose from
-//   a given dimension. `stride` can be any integer but 0.  `begin` is an integer
-//   which represents the index of the first value to select while `end` represents
-//   the index of the last value to select. The number of values selected in each
-//   dimension is `end - begin` if `stride > 0` and `begin - end` if `stride < 0`.
-//   `begin` and `end` can be negative where `-1` is the last element, `-2` is
-//   the second to last. `begin_mask` controls whether to replace the explicitly
-//   given `begin` with an implicit effective value of `0` if `stride > 0` and
-//   `-1` if `stride < 0`. `end_mask` is analogous but produces the number
-//   required to create the largest open interval. For example, given a shape
-//   `(3,)` tensor `foo[:]`, the effective `begin` and `end` are `0` and `3`. Do
-//   not assume this is equivalent to `foo[0:-1]` which has an effective `begin`
-//   and `end` of `0` and `2`. Another example is `foo[-2::-1]` which reverses the
-//   first dimension of a tensor while dropping the last two (in the original
-//   order elements). For example `foo = [1,2,3,4]; foo[-2::-1]` is `[4,3]`.
+// Returns 4-D with shape `[batch, height, width, depth]`, where:
 //
-// - A single index. This is used to keep only elements that have a given
-//   index. For example (`foo[2, :]` on a shape `(5,6)` tensor produces a
-//   shape `(6,)` tensor. This is encoded in `begin` and `end` and
-//   `shrink_axis_mask`.
+//       height = height_pad - crop_top - crop_bottom
+//       width = width_pad - crop_left - crop_right
 //
-// Each conceptual range specification is encoded in the op's argument. This
-// encoding is best understand by considering a non-trivial example. In
-// particular,
-// `foo[1, 2:4, None, ..., :-3:-1, :]` will be encoded as
+// The attr `block_size` must be greater than one. It indicates the block size.
+//
+// Some examples:
+//
+// (1) For the following input of shape `[4, 1, 1, 1]` and block_size of 2:
 //
 // ```
-// begin = [1, 2, x, x, 0, x] # x denotes don't care (usually 0)
-// end = [2, 4, x, x, -3, x]
-// strides = [1, 1, x, x, -1, 1]
-// begin_mask = 1<<4 | 1 << 5 = 48
-// end_mask = 1<<5 = 32
-// ellipsis_mask = 1<<3 = 8
-// new_axis_mask = 1<<2 4
-// shrink_axis_mask = 1<<0
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
 // ```
 //
-// In this case if `foo.shape` is (5, 5, 5, 5, 5, 5) the final shape of
-// the slice becomes (2, 1, 5, 5, 2, 5).
-// Let us walk step by step through each argument specification.
+// The output tensor has shape `[1, 2, 2, 1]` and value:
 //
-// 1.  The first argument in the example slice is turned into `begin = 1` and
-// `end = begin + 1 = 2`. To disambiguate from the original spec `2:4` we
-// also set the appropriate bit in `shrink_axis_mask`.
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
 //
-// 2. `2:4` is contributes 2, 4, 1 to begin, end, and stride. All masks have
-// zero bits contributed.
+// (2) For the following input of shape `[4, 1, 1, 3]` and block_size of 2:
 //
-// 3. None is a synonym for `tf.newaxis`. This means insert a dimension of size 1
-// dimension in the final shape. Dummy values are contributed to begin,
-// end and stride, while the new_axis_mask bit is set.
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
 //
-// 4. `...` grab the full ranges from as many dimensions as needed to
-// fully specify a slice for every dimension of the input shape.
+// The output tensor has shape `[1, 2, 2, 3]` and value:
 //
-// 5. `:-3:-1` shows the use of negative indices. A negative index `i` associated
-// with a dimension that has shape `s` is converted to a positive index
-// `s + i`. So `-1` becomes `s-1` (i.e. the last element). This conversion
-// is done internally so begin, end and strides receive x, -3, and -1.
-// The appropriate begin_mask bit is set to indicate the start range is the
-// full range (ignoring the x).
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
 //
-// 6. `:` indicates that the entire contents of the corresponding dimension
-// is selected. This is equivalent to `::` or `0::1`. begin, end, and strides
-// receive 0, 0, and 1, respectively. The appropriate bits in `begin_mask` and
-// `end_mask` are also set.
+// (3) For the following input of shape `[4, 2, 2, 1]` and block_size of 2:
 //
-// *Requirements*:
-//   `0 != strides[i] for i in [0, m)`
-//   `ellipsis_mask must be a power of two (only one ellipsis)`
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
 //
-// Arguments:
+// The output tensor has shape `[1, 4, 4, 1]` and value:
 //
-//	begin: `begin[k]` specifies the offset into the `k`th range specification.
-// The exact dimension this corresponds to will be determined by context.
-// Out-of-bounds values will be silently clamped. If the `k`th bit of
-// `begin_mask` then `begin[k]` is ignored and the full range of the
-// appropriate dimension is used instead. Negative values causes indexing
-// to start from the highest element e.g. If `foo==[1,2,3]` then `foo[-1]==3`.
-//	end: `end[i]` is like `begin` with the exception that `end_mask` is
-// used to determine full ranges.
-//	strides: `strides[i]` specifies the increment in the `i`th specification
-// after extracting a given element. Negative indices will reverse
-// the original order. Out or range values are
-// clamped to `[0,dim[i]) if slice[i]>0` or `[-1,dim[i]-1] if slice[i] < 0`
-func StridedSlice(scope *Scope, input tf.Output, begin tf.Output, end tf.Output, strides tf.Output, optional ...StridedSliceAttr) (output tf.Output) {
+// ```
+// x = [[[1],   [2],  [3],  [4]],
+//      [[5],   [6],  [7],  [8]],
+//      [[9],  [10], [11],  [12]],
+//      [[13], [14], [15],  [16]]]
+// ```
+//
+// (4) For the following input of shape `[8, 1, 2, 1]` and block_size of 2:
+//
+// ```
+// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
+//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
+// ```
+//
+// The output tensor has shape `[2, 2, 4, 1]` and value:
+//
+// ```
+// x = [[[[1], [3]], [[5], [7]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+func BatchToSpace(scope *Scope, input tf.Output, crops tf.Output, block_size int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"block_size": block_size}
 	opspec := tf.OpSpec{
-		Type: "StridedSlice",
+		Type: "BatchToSpace",
 		Input: []tf.Input{
-			input, begin, end, strides,
+			input, crops,
 		},
 		Attrs: attrs,
 	}
@@ -21780,140 +21974,100 @@ func StridedSlice(scope *Scope, input tf.Output, begin tf.Output, end tf.Output,
 	return op.Output(0)
 }
 
-// PriorityQueueV2Attr is an optional argument to PriorityQueueV2.
-type PriorityQueueV2Attr func(optionalAttr)
-
-// PriorityQueueV2ComponentTypes sets the optional component_types attribute to value.
+// Makes a new iterator from the given `dataset` and stores it in `iterator`.
 //
-// value: The type of each component in a value.
-// If not specified, defaults to <>
+// This operation may be executed multiple times. Each execution will reset the
+// iterator in `iterator` to the first element of `dataset`.
 //
-// REQUIRES: len(value) >= 0
-func PriorityQueueV2ComponentTypes(value []tf.DataType) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["component_types"] = value
+// Returns the created operation.
+func MakeIterator(scope *Scope, dataset tf.Output, iterator tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// PriorityQueueV2Capacity sets the optional capacity attribute to value.
-//
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func PriorityQueueV2Capacity(value int64) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+	opspec := tf.OpSpec{
+		Type: "MakeIterator",
+		Input: []tf.Input{
+			dataset, iterator,
+		},
 	}
+	return scope.AddOperation(opspec)
 }
 
-// PriorityQueueV2Container sets the optional container attribute to value.
+// Adjust the contrast of one or more images.
 //
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func PriorityQueueV2Container(value string) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// PriorityQueueV2SharedName sets the optional shared_name attribute to value.
+// `images` is a tensor of at least 3 dimensions.  The last 3 dimensions are
+// interpreted as `[height, width, channels]`.  The other dimensions only
+// represent a collection of images, such as `[batch, height, width, channels].`
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func PriorityQueueV2SharedName(value string) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A queue that produces elements sorted by the first component value.
+// Contrast is adjusted independently for each channel of each image.
 //
-// Note that the PriorityQueue requires the first component of any element
-// to be a scalar int64, in addition to the other elements declared by
-// component_types.  Therefore calls to Enqueue and EnqueueMany (resp. Dequeue
-// and DequeueMany) on a PriorityQueue will all require (resp. output) one extra
-// entry in their input (resp. output) lists.
+// For each channel, the Op first computes the mean of the image pixels in the
+// channel and then adjusts each component of each pixel to
+// `(x - mean) * contrast_factor + mean`.
 //
 // Arguments:
-//	shapes: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
+//	images: Images to adjust.  At least 3-D.
+//	contrast_factor: A float multiplier for adjusting contrast.
 //
-// Returns The handle to the queue.
-func PriorityQueueV2(scope *Scope, shapes []tf.Shape, optional ...PriorityQueueV2Attr) (handle tf.Output) {
+// Returns The contrast-adjusted image or images.
+func AdjustContrastv2(scope *Scope, images tf.Output, contrast_factor tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shapes": shapes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "PriorityQueueV2",
-
-		Attrs: attrs,
+		Type: "AdjustContrastv2",
+		Input: []tf.Input{
+			images, contrast_factor,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// UnstageAttr is an optional argument to Unstage.
-type UnstageAttr func(optionalAttr)
-
-// UnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func UnstageCapacity(value int64) UnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+// Gets the next output from the given iterator.
+func IteratorGetNext(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// UnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func UnstageMemoryLimit(value int64) UnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "IteratorGetNext",
+		Input: []tf.Input{
+			iterator,
+		},
+		Attrs: attrs,
 	}
-}
-
-// UnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnstageContainer(value string) UnstageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// UnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnstageSharedName(value string) UnstageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("IteratorGetNext", err)
+		return
 	}
+	return components
 }
 
-// Op is similar to a lightweight Dequeue.
+// Outputs the single element from the given dataset.
 //
-// The basic functionality is similar to dequeue with many fewer
-// capabilities and options.  This Op is optimized for performance.
-func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (values []tf.Output) {
+// Arguments:
+//	dataset: A handle to a dataset that contains a single element.
+//
+//
+//
+// Returns The components of the single element of `input`.
+func DatasetToSingleElement(scope *Scope, dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Unstage",
-
+		Type: "DatasetToSingleElement",
+		Input: []tf.Input{
+			dataset,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
@@ -21922,106 +22076,69 @@ func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (value
 	}
 	var idx int
 	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("Unstage", err)
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("DatasetToSingleElement", err)
 		return
 	}
-	return values
+	return components
 }
 
-// ArgMaxAttr is an optional argument to ArgMax.
-type ArgMaxAttr func(optionalAttr)
-
-// ArgMaxOutputType sets the optional output_type attribute to value.
-// If not specified, defaults to DT_INT64
-func ArgMaxOutputType(value tf.DataType) ArgMaxAttr {
-	return func(m optionalAttr) {
-		m["output_type"] = value
-	}
-}
-
-// Returns the index with the largest value across dimensions of a tensor.
-//
-// Note that in case of ties the identity of the return value is not guaranteed.
+// Converts the given `resource_handle` representing an iterator to a string.
 //
 // Arguments:
+//	resource_handle: A handle to an iterator resource.
 //
-//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
-// Describes which dimension of the input Tensor to reduce across. For vectors,
-// use dimension = 0.
-func ArgMax(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMaxAttr) (output tf.Output) {
+// Returns A string representation of the given handle.
+func IteratorToStringHandle(scope *Scope, resource_handle tf.Output) (string_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ArgMax",
+		Type: "IteratorToStringHandle",
 		Input: []tf.Input{
-			input, dimension,
+			resource_handle,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceStridedSliceAssignAttr is an optional argument to ResourceStridedSliceAssign.
-type ResourceStridedSliceAssignAttr func(optionalAttr)
-
-// ResourceStridedSliceAssignBeginMask sets the optional begin_mask attribute to value.
-// If not specified, defaults to 0
-func ResourceStridedSliceAssignBeginMask(value int64) ResourceStridedSliceAssignAttr {
-	return func(m optionalAttr) {
-		m["begin_mask"] = value
-	}
-}
-
-// ResourceStridedSliceAssignEndMask sets the optional end_mask attribute to value.
-// If not specified, defaults to 0
-func ResourceStridedSliceAssignEndMask(value int64) ResourceStridedSliceAssignAttr {
-	return func(m optionalAttr) {
-		m["end_mask"] = value
-	}
-}
-
-// ResourceStridedSliceAssignEllipsisMask sets the optional ellipsis_mask attribute to value.
-// If not specified, defaults to 0
-func ResourceStridedSliceAssignEllipsisMask(value int64) ResourceStridedSliceAssignAttr {
-	return func(m optionalAttr) {
-		m["ellipsis_mask"] = value
-	}
-}
+// IteratorFromStringHandleAttr is an optional argument to IteratorFromStringHandle.
+type IteratorFromStringHandleAttr func(optionalAttr)
 
-// ResourceStridedSliceAssignNewAxisMask sets the optional new_axis_mask attribute to value.
-// If not specified, defaults to 0
-func ResourceStridedSliceAssignNewAxisMask(value int64) ResourceStridedSliceAssignAttr {
+// IteratorFromStringHandleOutputTypes sets the optional output_types attribute to value.
+//
+// value: If specified, defines the type of each tuple component in an
+// element produced by the resulting iterator.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func IteratorFromStringHandleOutputTypes(value []tf.DataType) IteratorFromStringHandleAttr {
 	return func(m optionalAttr) {
-		m["new_axis_mask"] = value
+		m["output_types"] = value
 	}
 }
 
-// ResourceStridedSliceAssignShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
-// If not specified, defaults to 0
-func ResourceStridedSliceAssignShrinkAxisMask(value int64) ResourceStridedSliceAssignAttr {
+// IteratorFromStringHandleOutputShapes sets the optional output_shapes attribute to value.
+//
+// value: If specified, defines the shape of each tuple component in an
+// element produced by the resulting iterator.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func IteratorFromStringHandleOutputShapes(value []tf.Shape) IteratorFromStringHandleAttr {
 	return func(m optionalAttr) {
-		m["shrink_axis_mask"] = value
+		m["output_shapes"] = value
 	}
 }
 
-// Assign `value` to the sliced l-value reference of `ref`.
-//
-// The values of `value` are assigned to the positions in the variable
-// `ref` that are selected by the slice parameters. The slice parameters
-// `begin, `end`, `strides`, etc. work exactly as in `StridedSlice`.
+// Converts the given string representing a handle to an iterator to a resource.
 //
-// NOTE this op currently does not support broadcasting and so `value`'s
-// shape must be exactly the shape produced by the slice of `ref`.
+// Arguments:
+//	string_handle: A string representation of the given handle.
 //
-// Returns the created operation.
-func ResourceStridedSliceAssign(scope *Scope, ref tf.Output, begin tf.Output, end tf.Output, strides tf.Output, value tf.Output, optional ...ResourceStridedSliceAssignAttr) (o *tf.Operation) {
+// Returns A handle to an iterator resource.
+func IteratorFromStringHandle(scope *Scope, string_handle tf.Output, optional ...IteratorFromStringHandleAttr) (resource_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -22030,515 +22147,470 @@ func ResourceStridedSliceAssign(scope *Scope, ref tf.Output, begin tf.Output, en
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceStridedSliceAssign",
+		Type: "IteratorFromStringHandle",
 		Input: []tf.Input{
-			ref, begin, end, strides, value,
+			string_handle,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// QueueEnqueueV2Attr is an optional argument to QueueEnqueueV2.
-type QueueEnqueueV2Attr func(optionalAttr)
-
-// QueueEnqueueV2TimeoutMs sets the optional timeout_ms attribute to value.
+// Computes arctangent of `y/x` element-wise, respecting signs of the arguments.
 //
-// value: If the queue is full, this operation will block for up to
-// timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueEnqueueV2TimeoutMs(value int64) QueueEnqueueV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
+// This is the angle \( \theta \in [-\pi, \pi] \) such that
+// \[ x = r \cos(\theta) \]
+// and
+// \[ y = r \sin(\theta) \]
+// where \(r = \sqrt(x^2 + y^2) \).
+func Atan2(scope *Scope, y tf.Output, x tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Atan2",
+		Input: []tf.Input{
+			y, x,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Enqueues a tuple of one or more tensors in the given queue.
-//
-// The components input has k elements, which correspond to the components of
-// tuples stored in the given queue.
-//
-// N.B. If the queue is full, this operation will block until the given
-// element has been enqueued (or 'timeout_ms' elapses, if specified).
-//
-// Arguments:
-//	handle: The handle to a queue.
-//	components: One or more tensors from which the enqueued tensors should be taken.
-//
-// Returns the created operation.
-func QueueEnqueueV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueV2Attr) (o *tf.Operation) {
+// Return a tensor with the same shape and contents as the input tensor or value.
+func Identity(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueEnqueueV2",
+		Type: "Identity",
 		Input: []tf.Input{
-			handle, tf.OutputList(components),
+			input,
 		},
-		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// QueueDequeueManyV2Attr is an optional argument to QueueDequeueManyV2.
-type QueueDequeueManyV2Attr func(optionalAttr)
-
-// QueueDequeueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
+// Gather slices from `params` axis `axis` according to `indices`.
 //
-// value: If the queue has fewer than n elements, this operation
-// will block for up to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueDequeueManyV2TimeoutMs(value int64) QueueDequeueManyV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
-	}
-}
-
-// Dequeues `n` tuples of one or more tensors from the given queue.
+// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
+// Produces an output tensor with shape `params.shape[:axis] + indices.shape +
+// params.shape[axis + 1:]` where:
 //
-// If the queue is closed and there are fewer than `n` elements, then an
-// OutOfRange error is returned.
+// ```python
+//     # Scalar indices (output is rank(params) - 1).
+//     output[a_0, ..., a_n, b_0, ..., b_n] =
+//       params[a_0, ..., a_n, indices, b_0, ..., b_n]
 //
-// This operation concatenates queue-element component tensors along the
-// 0th dimension to make a single component tensor.  All of the components
-// in the dequeued tuple will have size `n` in the 0th dimension.
+//     # Vector indices (output is rank(params)).
+//     output[a_0, ..., a_n, i, b_0, ..., b_n] =
+//       params[a_0, ..., a_n, indices[i], b_0, ..., b_n]
 //
-// This operation has `k` outputs, where `k` is the number of components in
-// the tuples stored in the given queue, and output `i` is the ith
-// component of the dequeued tuple.
+//     # Higher rank indices (output is rank(params) + rank(indices) - 1).
+//     output[a_0, ..., a_n, i, ..., j, b_0, ... b_n] =
+//       params[a_0, ..., a_n, indices[i, ..., j], b_0, ..., b_n]
+// ```
 //
-// N.B. If the queue is empty, this operation will block until `n` elements
-// have been dequeued (or 'timeout_ms' elapses, if specified).
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
+// </div>
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	n: The number of tuples to dequeue.
-//	component_types: The type of each component in a tuple.
+//	params: The tensor from which to gather values. Must be at least rank
+// `axis + 1`.
+//	indices: Index tensor. Must be in range `[0, params.shape[axis])`.
+//	axis: The axis in `params` to gather `indices` from. Defaults to the first
+// dimension. Supports negative indexes.
 //
-// Returns One or more tensors that were dequeued as a tuple.
-func QueueDequeueManyV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueManyV2Attr) (components []tf.Output) {
+// Returns Values from `params` gathered from indices given by `indices`, with
+// shape `params.shape[:axis] + indices.shape + params.shape[axis + 1:]`.
+func GatherV2(scope *Scope, params tf.Output, indices tf.Output, axis tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueDequeueManyV2",
+		Type: "GatherV2",
 		Input: []tf.Input{
-			handle, n,
+			params, indices, axis,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Converts the given `resource_handle` representing an iterator to a variant tensor.
+//
+// Arguments:
+//	resource_handle: A handle to an iterator resource.
+//
+// Returns A variant tensor storing the state of the iterator contained in the
+// resource.
+func SerializeIterator(scope *Scope, resource_handle tf.Output) (serialized tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("QueueDequeueManyV2", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "SerializeIterator",
+		Input: []tf.Input{
+			resource_handle,
+		},
 	}
-	return components
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// EncodeBase64Attr is an optional argument to EncodeBase64.
-type EncodeBase64Attr func(optionalAttr)
+// FIFOQueueV2Attr is an optional argument to FIFOQueueV2.
+type FIFOQueueV2Attr func(optionalAttr)
 
-// EncodeBase64Pad sets the optional pad attribute to value.
+// FIFOQueueV2Shapes sets the optional shapes attribute to value.
 //
-// value: Bool whether padding is applied at the ends.
-// If not specified, defaults to false
-func EncodeBase64Pad(value bool) EncodeBase64Attr {
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func FIFOQueueV2Shapes(value []tf.Shape) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["pad"] = value
+		m["shapes"] = value
 	}
 }
 
-// Encode strings into web-safe base64 format.
-//
-// Refer to the following article for more information on base64 format:
-// en.wikipedia.org/wiki/Base64. Base64 strings may have padding with '=' at the
-// end so that the encoded has length multiple of 4. See Padding section of the
-// link above.
-//
-// Web-safe means that the encoder uses - and _ instead of + and /.
-//
-// Arguments:
-//	input: Strings to be encoded.
+// FIFOQueueV2Capacity sets the optional capacity attribute to value.
 //
-// Returns Input strings encoded in base64.
-func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "EncodeBase64",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func FIFOQueueV2Capacity(value int64) FIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayCloseV3
-//
-// DEPRECATED at GraphDef version 26: Use TensorArrayCloseV3
+// FIFOQueueV2Container sets the optional container attribute to value.
 //
-// Returns the created operation.
-func TensorArrayCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayCloseV2",
-		Input: []tf.Input{
-			handle,
-		},
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func FIFOQueueV2Container(value string) FIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// CropAndResizeGradImageAttr is an optional argument to CropAndResizeGradImage.
-type CropAndResizeGradImageAttr func(optionalAttr)
-
-// CropAndResizeGradImageMethod sets the optional method attribute to value.
+// FIFOQueueV2SharedName sets the optional shared_name attribute to value.
 //
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
-// If not specified, defaults to "bilinear"
-func CropAndResizeGradImageMethod(value string) CropAndResizeGradImageAttr {
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
+// If not specified, defaults to ""
+func FIFOQueueV2SharedName(value string) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["method"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Computes the gradient of the crop_and_resize op wrt the input image tensor.
+// A queue that produces elements in first-in first-out order.
 //
 // Arguments:
-//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
-//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
-// specifies the coordinates of a box in the `box_ind[i]` image and is specified
-// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
-// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
-// `[0, 1]` interval of normalized image height is mapped to
-// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
-// which case the sampled crop is an up-down flipped version of the original
-// image. The width dimension is treated similarly. Normalized coordinates
-// outside the `[0, 1]` range are allowed, in which case we use
-// `extrapolation_value` to extrapolate the input image values.
-//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
-// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
-//	image_size: A 1-D tensor with value `[batch, image_height, image_width, depth]`
-// containing the original image size. Both `image_height` and `image_width` need
-// to be positive.
-//
+//	component_types: The type of each component in a value.
 //
-// Returns A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
-func CropAndResizeGradImage(scope *Scope, grads tf.Output, boxes tf.Output, box_ind tf.Output, image_size tf.Output, T tf.DataType, optional ...CropAndResizeGradImageAttr) (output tf.Output) {
+// Returns The handle to the queue.
+func FIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...FIFOQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"T": T}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CropAndResizeGradImage",
-		Input: []tf.Input{
-			grads, boxes, box_ind, image_size,
-		},
+		Type: "FIFOQueueV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Reads and outputs the entire contents of the input filename.
-func ReadFile(scope *Scope, filename tf.Output) (contents tf.Output) {
+// Produces a summary of any statistics recorded by the given statistics manager.
+func StatsAggregatorSummary(scope *Scope, iterator tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReadFile",
+		Type: "StatsAggregatorSummary",
 		Input: []tf.Input{
-			filename,
+			iterator,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Concatenates tensors along one dimension.
+// Compute the pairwise cross product.
+//
+// `a` and `b` must be the same shape; they can either be simple 3-element vectors,
+// or any shape where the innermost dimension is 3. In the latter case, each pair
+// of corresponding 3-element vectors is cross-multiplied independently.
 //
 // Arguments:
-//	values: List of `N` Tensors to concatenate. Their ranks and types must match,
-// and their sizes must match in all dimensions except `concat_dim`.
-//	axis: 0-D.  The dimension along which to concatenate.  Must be in the
-// range [-rank(values), rank(values)).
+//	a: A tensor containing 3-element vectors.
+//	b: Another tensor, of same type and shape as `a`.
 //
-// Returns A `Tensor` with the concatenation of values stacked along the
-// `concat_dim` dimension.  This tensor's shape matches that of `values` except
-// in `concat_dim` where it has the sum of the sizes.
-func ConcatV2(scope *Scope, values []tf.Output, axis tf.Output) (output tf.Output) {
+// Returns Pairwise cross product of the vectors in `a` and `b`.
+func Cross(scope *Scope, a tf.Output, b tf.Output) (product tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ConcatV2",
+		Type: "Cross",
 		Input: []tf.Input{
-			tf.OutputList(values), axis,
+			a, b,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Forwards the value of an available tensor from `inputs` to `output`.
-//
-// `Merge` waits for at least one of the tensors in `inputs` to become available.
-// It is usually combined with `Switch` to implement branching.
+// Performs a padding as a preprocess during a convolution.
 //
-// `Merge` forwards the first tensor to become available to `output`, and sets
-// `value_index` to its index in `inputs`.
+// Similar to FusedResizeAndPadConv2d, this op allows for an optimized
+// implementation where the spatial padding transformation stage is fused with the
+// im2col lookup, but in this case without the bilinear filtering required for
+// resizing. Fusing the padding prevents the need to write out the intermediate
+// results as whole tensors, reducing memory pressure, and we can get some latency
+// gains by merging the transformation calculations.
+// The data_format attribute for Conv2D isn't supported by this op, and 'NHWC'
+// order is used instead.
+// Internally this op uses a single per-graph scratch buffer, which means that it
+// will block if multiple versions are being run in parallel. This is because this
+// operator is primarily an optimization to minimize memory usage.
 //
 // Arguments:
-//	inputs: The input tensors, exactly one of which will become available.
+//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.
 //
-// Returns Will be set to the available input tensor.The index of the chosen input tensor in `inputs`.
-func Merge(scope *Scope, inputs []tf.Output) (output tf.Output, value_index tf.Output) {
+//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
+// of `input`. Must be in the same order as the dimension specified with format.
+//	padding: The type of padding algorithm to use.
+func FusedPadConv2D(scope *Scope, input tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "Merge",
+		Type: "FusedPadConv2D",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			input, paddings, filter,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// QueueCloseV2Attr is an optional argument to QueueCloseV2.
-type QueueCloseV2Attr func(optionalAttr)
+// Conv2DBackpropInputAttr is an optional argument to Conv2DBackpropInput.
+type Conv2DBackpropInputAttr func(optionalAttr)
 
-// QueueCloseV2CancelPendingEnqueues sets the optional cancel_pending_enqueues attribute to value.
+// Conv2DBackpropInputUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DBackpropInputUseCudnnOnGpu(value bool) Conv2DBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["use_cudnn_on_gpu"] = value
+	}
+}
+
+// Conv2DBackpropInputDataFormat sets the optional data_format attribute to value.
 //
-// value: If true, all pending enqueue requests that are
-// blocked on the given queue will be canceled.
-// If not specified, defaults to false
-func QueueCloseV2CancelPendingEnqueues(value bool) QueueCloseV2Attr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func Conv2DBackpropInputDataFormat(value string) Conv2DBackpropInputAttr {
 	return func(m optionalAttr) {
-		m["cancel_pending_enqueues"] = value
+		m["data_format"] = value
 	}
 }
 
-// Closes the given queue.
+// Conv2DBackpropInputDilations sets the optional dilations attribute to value.
 //
-// This operation signals that no more elements will be enqueued in the
-// given queue. Subsequent Enqueue(Many) operations will fail.
-// Subsequent Dequeue(Many) operations will continue to succeed if
-// sufficient elements remain in the queue. Subsequent Dequeue(Many)
-// operations that would block will fail immediately.
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
+// element on that dimension. The dimension order is determined by the value of
+// `data_format`, see above for details. Dilations in the batch and depth
+// dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func Conv2DBackpropInputDilations(value []int64) Conv2DBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of convolution with respect to the input.
 //
 // Arguments:
-//	handle: The handle to a queue.
+//	input_sizes: An integer vector representing the shape of `input`,
+// where `input` is a 4-D `[batch, height, width, channels]` tensor.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
+// Gradients w.r.t. the output of the convolution.
+//	strides: The stride of the sliding window for each dimension of the input
+// of the convolution. Must be in the same order as the dimension specified with
+// format.
+//	padding: The type of padding algorithm to use.
 //
-// Returns the created operation.
-func QueueCloseV2(scope *Scope, handle tf.Output, optional ...QueueCloseV2Attr) (o *tf.Operation) {
+// Returns 4-D with shape `[batch, in_height, in_width, in_channels]`.  Gradient
+// w.r.t. the input of the convolution.
+func Conv2DBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropInputAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QueueCloseV2",
+		Type: "Conv2DBackpropInput",
 		Input: []tf.Input{
-			handle,
+			input_sizes, filter, out_backprop,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// Computes inverse hyperbolic tangent of x element-wise.
-func Atanh(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Atanh",
-		Input: []tf.Input{
-			x,
-		},
-	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns true if queue is closed.
+// Interleave the values from the `data` tensors into a single tensor.
 //
-// This operation returns true if the queue is closed and false if the queue
-// is open.
+// Builds a merged tensor such that
 //
-// Arguments:
-//	handle: The handle to a queue.
-func QueueIsClosedV2(scope *Scope, handle tf.Output) (is_closed tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "QueueIsClosedV2",
-		Input: []tf.Input{
-			handle,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns the batched diagonal part of a batched tensor.
+// ```python
+//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
+// ```
 //
-// This operation returns a tensor with the `diagonal` part
-// of the batched `input`. The `diagonal` part is computed as follows:
+// For example, if each `indices[m]` is scalar or vector, we have
 //
-// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
-// tensor of rank `k - 1` with dimensions `[I, J, K, ..., min(M, N)]` where:
+// ```python
+//     # Scalar indices:
+//     merged[indices[m], ...] = data[m][...]
 //
-// `diagonal[i, j, k, ..., n] = input[i, j, k, ..., n, n]`.
+//     # Vector indices:
+//     merged[indices[m][i], ...] = data[m][i, ...]
+// ```
 //
-// The input must be at least a matrix.
+// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
+// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
+// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
+// `constant`, the output shape is
+//
+//     merged.shape = [max(indices)] + constant
+//
+// Values are merged in order, so if an index appears in both `indices[m][i]` and
+// `indices[n][j]` for `(m,i) < (n,j)` the slice `data[n][j]` will appear in the
+// merged result. If you do not need this guarantee, ParallelDynamicStitch might
+// perform better on some devices.
 //
 // For example:
 //
+// ```python
+//     indices[0] = 6
+//     indices[1] = [4, 1]
+//     indices[2] = [[5, 2], [0, 3]]
+//     data[0] = [61, 62]
+//     data[1] = [[41, 42], [11, 12]]
+//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
+//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
+//               [51, 52], [61, 62]]
 // ```
-// # 'input' is [[[1, 0, 0, 0]
-//                [0, 2, 0, 0]
-//                [0, 0, 3, 0]
-//                [0, 0, 0, 4]],
-//               [[5, 0, 0, 0]
-//                [0, 6, 0, 0]
-//                [0, 0, 7, 0]
-//                [0, 0, 0, 8]]]
-//
-// and input.shape = (2, 4, 4)
 //
-// tf.matrix_diag_part(input) ==> [[1, 2, 3, 4], [5, 6, 7, 8]]
+// This method can be used to merge partitions created by `dynamic_partition`
+// as illustrated on the following example:
 //
-// which has shape (2, 4)
+// ```python
+//     # Apply function (increments x_i) on elements for which a certain condition
+//     # apply (x_i != -1 in this example).
+//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
+//     condition_mask=tf.not_equal(x,tf.constant(-1.))
+//     partitioned_data = tf.dynamic_partition(
+//         x, tf.cast(condition_mask, tf.int32) , 2)
+//     partitioned_data[1] = partitioned_data[1] + 1.0
+//     condition_indices = tf.dynamic_partition(
+//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
+//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
+//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
+//     # unchanged.
 // ```
 //
-// Arguments:
-//	input: Rank `k` tensor where `k >= 2`.
-//
-// Returns The extracted diagonal(s) having shape
-// `diagonal.shape = input.shape[:-2] + [min(input.shape[-2:])]`.
-func MatrixDiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
+// </div>
+func DynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixDiagPart",
+		Type: "DynamicStitch",
 		Input: []tf.Input{
-			input,
+			tf.OutputList(indices), tf.OutputList(data),
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the absolute value of a tensor.
+// Returns the truth value of (x == y) element-wise.
 //
-// Given a tensor `x`, this operation returns a tensor containing the absolute
-// value of each element in `x`. For example, if x is an input element and y is
-// an output element, this operation computes \\(y = |x|\\).
-func Abs(scope *Scope, x tf.Output) (y tf.Output) {
+// *NOTE*: `Equal` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Equal(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Abs",
+		Type: "Equal",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Flushes and closes the summary writer.
-//
-// Also removes it from the resource manager. To reopen, use another
-// CreateSummaryFileWriter op.
-//
-// Arguments:
-//	writer: A handle to the summary writer resource.
-//
-// Returns the created operation.
-func CloseSummaryWriter(scope *Scope, writer tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "CloseSummaryWriter",
-		Input: []tf.Input{
-			writer,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// StackV2Attr is an optional argument to StackV2.
-type StackV2Attr func(optionalAttr)
+// TensorArrayGatherV2Attr is an optional argument to TensorArrayGatherV2.
+type TensorArrayGatherV2Attr func(optionalAttr)
 
-// StackV2StackName sets the optional stack_name attribute to value.
-//
-// value: Overrides the name used for the temporary stack resource. Default
-// value is the name of the 'Stack' op (which is guaranteed unique).
-// If not specified, defaults to ""
-func StackV2StackName(value string) StackV2Attr {
+// TensorArrayGatherV2ElementShape sets the optional element_shape attribute to value.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayGatherV2ElementShape(value tf.Shape) TensorArrayGatherV2Attr {
 	return func(m optionalAttr) {
-		m["stack_name"] = value
+		m["element_shape"] = value
 	}
 }
 
-// A stack that produces elements in first-in last-out order.
-//
-// Arguments:
-//	max_size: The maximum size of the stack if non-negative. If negative, the stack
-// size is unlimited.
-//	elem_type: The type of the elements on the stack.
+// Deprecated. Use TensorArrayGatherV3
 //
-// Returns The handle to the stack.
-func StackV2(scope *Scope, max_size tf.Output, elem_type tf.DataType, optional ...StackV2Attr) (handle tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayGatherV3
+func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV2Attr) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"elem_type": elem_type}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StackV2",
+		Type: "TensorArrayGatherV2",
 		Input: []tf.Input{
-			max_size,
+			handle, indices, flow_in,
 		},
 		Attrs: attrs,
 	}
@@ -22546,566 +22618,464 @@ func StackV2(scope *Scope, max_size tf.Output, elem_type tf.DataType, optional .
 	return op.Output(0)
 }
 
-// OrderedMapStageAttr is an optional argument to OrderedMapStage.
-type OrderedMapStageAttr func(optionalAttr)
-
-// OrderedMapStageCapacity sets the optional capacity attribute to value.
+// Interleave the values from the `data` tensors into a single tensor.
 //
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
-// If not specified, defaults to 0
+// Builds a merged tensor such that
 //
-// REQUIRES: value >= 0
-func OrderedMapStageCapacity(value int64) OrderedMapStageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapStageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// ```python
+//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
+// ```
 //
-// REQUIRES: value >= 0
-func OrderedMapStageMemoryLimit(value int64) OrderedMapStageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// OrderedMapStageContainer sets the optional container attribute to value.
+// For example, if each `indices[m]` is scalar or vector, we have
 //
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func OrderedMapStageContainer(value string) OrderedMapStageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// OrderedMapStageSharedName sets the optional shared_name attribute to value.
+// ```python
+//     # Scalar indices:
+//     merged[indices[m], ...] = data[m][...]
 //
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func OrderedMapStageSharedName(value string) OrderedMapStageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Stage (key, values) in the underlying container which behaves like a ordered
+//     # Vector indices:
+//     merged[indices[m][i], ...] = data[m][i, ...]
+// ```
 //
-// associative container.   Elements are ordered by key.
+// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
+// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
+// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
+// `constant`, the output shape is
 //
-// Arguments:
-//	key: int64
+//     merged.shape = [max(indices)] + constant
 //
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
+// Values may be merged in parallel, so if an index appears in both `indices[m][i]`
+// and `indices[n][j]`, the result may be invalid. This differs from the normal
+// DynamicStitch operator that defines the behavior in that case.
 //
+// For example:
 //
-// Returns the created operation.
-func OrderedMapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...OrderedMapStageAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "OrderedMapStage",
-		Input: []tf.Input{
-			key, indices, tf.OutputList(values),
-		},
-		Attrs: attrs,
-	}
-	return scope.AddOperation(opspec)
-}
-
-// StackPushV2Attr is an optional argument to StackPushV2.
-type StackPushV2Attr func(optionalAttr)
-
-// StackPushV2SwapMemory sets the optional swap_memory attribute to value.
+// ```python
+//     indices[0] = 6
+//     indices[1] = [4, 1]
+//     indices[2] = [[5, 2], [0, 3]]
+//     data[0] = [61, 62]
+//     data[1] = [[41, 42], [11, 12]]
+//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
+//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
+//               [51, 52], [61, 62]]
+// ```
 //
-// value: Swap `elem` to CPU. Default to false.
-// If not specified, defaults to false
-func StackPushV2SwapMemory(value bool) StackPushV2Attr {
-	return func(m optionalAttr) {
-		m["swap_memory"] = value
+// This method can be used to merge partitions created by `dynamic_partition`
+// as illustrated on the following example:
+//
+// ```python
+//     # Apply function (increments x_i) on elements for which a certain condition
+//     # apply (x_i != -1 in this example).
+//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
+//     condition_mask=tf.not_equal(x,tf.constant(-1.))
+//     partitioned_data = tf.dynamic_partition(
+//         x, tf.cast(condition_mask, tf.int32) , 2)
+//     partitioned_data[1] = partitioned_data[1] + 1.0
+//     condition_indices = tf.dynamic_partition(
+//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
+//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
+//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
+//     # unchanged.
+// ```
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
+// </div>
+func ParallelDynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ParallelDynamicStitch",
+		Input: []tf.Input{
+			tf.OutputList(indices), tf.OutputList(data),
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Push an element onto the stack.
-//
-// Arguments:
-//	handle: The handle to a stack.
-//	elem: The tensor to be pushed onto the stack.
+// Computes the gradient for the inverse of `x` wrt its input.
 //
-// Returns The same tensor as the input 'elem'.
-func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...StackPushV2Attr) (output tf.Output) {
+// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
+// is the corresponding input gradient.
+func InvGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "StackPushV2",
+		Type: "InvGrad",
 		Input: []tf.Input{
-			handle, elem,
+			y, dy,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FusedBatchNormGradV2Attr is an optional argument to FusedBatchNormGradV2.
-type FusedBatchNormGradV2Attr func(optionalAttr)
+// StridedSliceAttr is an optional argument to StridedSlice.
+type StridedSliceAttr func(optionalAttr)
 
-// FusedBatchNormGradV2Epsilon sets the optional epsilon attribute to value.
+// StridedSliceBeginMask sets the optional begin_mask attribute to value.
 //
-// value: A small float number added to the variance of x.
-// If not specified, defaults to 0.0001
-func FusedBatchNormGradV2Epsilon(value float32) FusedBatchNormGradV2Attr {
+// value: a bitmask where a bit i being 1 means to ignore the begin
+// value and instead use the largest interval possible. At runtime
+// begin[i] will be replaced with `[0, n-1) if `stride[i] > 0` or
+// `[-1, n-1]` if `stride[i] < 0`
+// If not specified, defaults to 0
+func StridedSliceBeginMask(value int64) StridedSliceAttr {
 	return func(m optionalAttr) {
-		m["epsilon"] = value
+		m["begin_mask"] = value
 	}
 }
 
-// FusedBatchNormGradV2DataFormat sets the optional data_format attribute to value.
+// StridedSliceEndMask sets the optional end_mask attribute to value.
 //
-// value: The data format for y_backprop, x, x_backprop.
-// Either "NHWC" (default) or "NCHW".
-// If not specified, defaults to "NHWC"
-func FusedBatchNormGradV2DataFormat(value string) FusedBatchNormGradV2Attr {
+// value: analogous to `begin_mask`
+// If not specified, defaults to 0
+func StridedSliceEndMask(value int64) StridedSliceAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["end_mask"] = value
 	}
 }
 
-// FusedBatchNormGradV2IsTraining sets the optional is_training attribute to value.
+// StridedSliceEllipsisMask sets the optional ellipsis_mask attribute to value.
 //
-// value: A bool value to indicate the operation is for training (default)
-// or inference.
-// If not specified, defaults to true
-func FusedBatchNormGradV2IsTraining(value bool) FusedBatchNormGradV2Attr {
+// value: a bitmask where bit `i` being 1 means the `i`th
+// position is actually an ellipsis. One bit at most can be 1.
+// If `ellipsis_mask == 0`, then an implicit ellipsis mask of `1 << (m+1)`
+// is provided. This means that `foo[3:5] == foo[3:5, ...]`. An ellipsis
+// implicitly creates as many range specifications as necessary to fully
+// specify the sliced range for every dimension. For example for a 4-dimensional
+// tensor `foo` the slice `foo[2, ..., 5:8]` implies `foo[2, :, :, 5:8]`.
+// If not specified, defaults to 0
+func StridedSliceEllipsisMask(value int64) StridedSliceAttr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["ellipsis_mask"] = value
 	}
 }
 
-// Gradient for batch normalization.
-//
-// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
-// The size of 1D Tensors matches the dimension C of the 4D Tensors.
-//
-// Arguments:
-//	y_backprop: A 4D Tensor for the gradient with respect to y.
-//	x: A 4D Tensor for input data.
-//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
-//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
-// mean to be reused in gradient computation. When is_training is
-// False, a 1D Tensor for the population mean to be reused in both
-// 1st and 2nd order gradient computation.
-//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
-// variance (inverted variance in the cuDNN case) to be reused in
-// gradient computation. When is_training is False, a 1D Tensor
-// for the population variance to be reused in both 1st and 2nd
-// order gradient computation.
+// StridedSliceNewAxisMask sets the optional new_axis_mask attribute to value.
 //
-// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
-// in FusedBatchNorm.
-func FusedBatchNormGradV2(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradV2Attr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
+// value: a bitmask where bit `i` being 1 means the `i`th
+// specification creates a new shape 1 dimension. For example
+// `foo[:4, tf.newaxis, :2]` would produce a shape `(4, 1, 2)` tensor.
+// If not specified, defaults to 0
+func StridedSliceNewAxisMask(value int64) StridedSliceAttr {
+	return func(m optionalAttr) {
+		m["new_axis_mask"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "FusedBatchNormGradV2",
-		Input: []tf.Input{
-			y_backprop, x, scale, reserve_space_1, reserve_space_2,
-		},
-		Attrs: attrs,
+}
+
+// StridedSliceShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
+//
+// value: a bitmask where bit `i` implies that the `i`th
+// specification should shrink the dimensionality. begin and end
+// must imply a slice of size 1 in the dimension. For example in
+// python one might do `foo[:, 3, :]` which would result in
+// `shrink_axis_mask` being 2.
+// If not specified, defaults to 0
+func StridedSliceShrinkAxisMask(value int64) StridedSliceAttr {
+	return func(m optionalAttr) {
+		m["shrink_axis_mask"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Creates a TensorArray for storing the gradients of values in the given handle.
+// Return a strided slice from `input`.
 //
-// If the given TensorArray gradient already exists, returns a reference to it.
+// Note, most python users will want to use the Python `Tensor.__getitem__`
+// or `Variable.__getitem__` rather than this op directly.
 //
-// Locks the size of the original TensorArray by disabling its dynamic size flag.
+// The goal of this op is to produce a new tensor with a subset of
+// the elements from the `n` dimensional `input` tensor. The subset is chosen using
+// a sequence of `m` sparse range specifications encoded into the arguments
+// of this function. Note, in some cases
+// `m` could be equal to `n`, but this need not be the case. Each
+// range specification entry can be one of the following:
 //
-// **A note about the input flow_in:**
+// - An ellipsis (...). Ellipses are used to imply zero or more
+//   dimensions of full-dimension selection and are produced using
+//   `ellipsis_mask`. For example, `foo[...]` is the identity slice.
 //
-// The handle flow_in forces the execution of the gradient lookup to occur
-// only after certain other operations have occurred.  For example, when
-// the forward TensorArray is dynamically sized, writes to this TensorArray
-// may resize the object.  The gradient TensorArray is statically sized based
-// on the size of the forward TensorArray when this operation executes.
-// Furthermore, the size of the forward TensorArray is frozen by this call.
-// As a result, the flow is used to ensure that the call to generate the gradient
-// TensorArray only happens after all writes are executed.
+// - A new axis. This is used to insert a new shape=1 dimension and is
+//   produced using `new_axis_mask`. For example, `foo[:, ...]` where
+//   `foo` is shape `(3, 4)` produces a `(1, 3, 4)` tensor.
 //
-// In the case of dynamically sized TensorArrays, gradient computation should
-// only be performed on read operations that have themselves been chained via
-// flow to occur only after all writes have executed. That way the final size
-// of the forward TensorArray is known when this operation is called.
 //
-// **A note about the source attribute:**
+// - A range `begin:end:stride`. This is used to specify how much to choose from
+//   a given dimension. `stride` can be any integer but 0.  `begin` is an integer
+//   which represents the index of the first value to select while `end` represents
+//   the index of the last value to select. The number of values selected in each
+//   dimension is `end - begin` if `stride > 0` and `begin - end` if `stride < 0`.
+//   `begin` and `end` can be negative where `-1` is the last element, `-2` is
+//   the second to last. `begin_mask` controls whether to replace the explicitly
+//   given `begin` with an implicit effective value of `0` if `stride > 0` and
+//   `-1` if `stride < 0`. `end_mask` is analogous but produces the number
+//   required to create the largest open interval. For example, given a shape
+//   `(3,)` tensor `foo[:]`, the effective `begin` and `end` are `0` and `3`. Do
+//   not assume this is equivalent to `foo[0:-1]` which has an effective `begin`
+//   and `end` of `0` and `2`. Another example is `foo[-2::-1]` which reverses the
+//   first dimension of a tensor while dropping the last two (in the original
+//   order elements). For example `foo = [1,2,3,4]; foo[-2::-1]` is `[4,3]`.
 //
-// TensorArray gradient calls use an accumulator TensorArray object.  If
-// multiple gradients are calculated and run in the same session, the multiple
-// gradient nodes may accidentally flow through the same accumulator TensorArray.
-// This double counts and generally breaks the TensorArray gradient flow.
+// - A single index. This is used to keep only elements that have a given
+//   index. For example (`foo[2, :]` on a shape `(5,6)` tensor produces a
+//   shape `(6,)` tensor. This is encoded in `begin` and `end` and
+//   `shrink_axis_mask`.
 //
-// The solution is to identify which gradient call this particular
-// TensorArray gradient is being called in.  This is performed by identifying
-// a unique string (e.g. "gradients", "gradients_1", ...) from the input
-// gradient Tensor's name.  This string is used as a suffix when creating
-// the TensorArray gradient object here (the attribute `source`).
+// Each conceptual range specification is encoded in the op's argument. This
+// encoding is best understand by considering a non-trivial example. In
+// particular,
+// `foo[1, 2:4, None, ..., :-3:-1, :]` will be encoded as
 //
-// The attribute `source` is added as a suffix to the forward TensorArray's
-// name when performing the creation / lookup, so that each separate gradient
-// calculation gets its own TensorArray accumulator.
+// ```
+// begin = [1, 2, x, x, 0, x] # x denotes don't care (usually 0)
+// end = [2, 4, x, x, -3, x]
+// strides = [1, 1, x, x, -1, 1]
+// begin_mask = 1<<4 | 1 << 5 = 48
+// end_mask = 1<<5 = 32
+// ellipsis_mask = 1<<3 = 8
+// new_axis_mask = 1<<2 4
+// shrink_axis_mask = 1<<0
+// ```
 //
-// Arguments:
-//	handle: The handle to the forward TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	source: The gradient source string, used to decide which gradient TensorArray
-// to return.
-func TensorArrayGradV3(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output, flow_out tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"source": source}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayGradV3",
-		Input: []tf.Input{
-			handle, flow_in,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
+// In this case if `foo.shape` is (5, 5, 5, 5, 5, 5) the final shape of
+// the slice becomes (2, 1, 5, 5, 2, 5).
+// Let us walk step by step through each argument specification.
 //
-// Each comparison returns a boolean `true` (if `input_value > threshold`)
-// or and `false` otherwise.
+// 1.  The first argument in the example slice is turned into `begin = 1` and
+// `end = begin + 1 = 2`. To disambiguate from the original spec `2:4` we
+// also set the appropriate bit in `shrink_axis_mask`.
 //
-// This operation is useful for Locality-Sensitive-Hashing (LSH) and other
-// algorithms that use hashing approximations of cosine and `L2` distances;
-// codes can be generated from an input via:
+// 2. `2:4` is contributes 2, 4, 1 to begin, end, and stride. All masks have
+// zero bits contributed.
 //
-// ```python
-// codebook_size = 50
-// codebook_bits = codebook_size * 32
-// codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
-//                            dtype=x.dtype,
-//                            initializer=tf.orthogonal_initializer())
-// codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
-// codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
-// # now codes has shape x.shape[:-1] + [codebook_size]
-// ```
+// 3. None is a synonym for `tf.newaxis`. This means insert a dimension of size 1
+// dimension in the final shape. Dummy values are contributed to begin,
+// end and stride, while the new_axis_mask bit is set.
 //
-// **NOTE**: Currently, the innermost dimension of the tensor must be divisible
-// by 8.
+// 4. `...` grab the full ranges from as many dimensions as needed to
+// fully specify a slice for every dimension of the input shape.
 //
-// Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
-// a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
+// 5. `:-3:-1` shows the use of negative indices. A negative index `i` associated
+// with a dimension that has shape `s` is converted to a positive index
+// `s + i`. So `-1` becomes `s-1` (i.e. the last element). This conversion
+// is done internally so begin, end and strides receive x, -3, and -1.
+// The appropriate begin_mask bit is set to indicate the start range is the
+// full range (ignoring the x).
+//
+// 6. `:` indicates that the entire contents of the corresponding dimension
+// is selected. This is equivalent to `::` or `0::1`. begin, end, and strides
+// receive 0, 0, and 1, respectively. The appropriate bits in `begin_mask` and
+// `end_mask` are also set.
+//
+// *Requirements*:
+//   `0 != strides[i] for i in [0, m)`
+//   `ellipsis_mask must be a power of two (only one ellipsis)`
 //
 // Arguments:
-//	input: Values to compare against `threshold` and bitpack.
-//	threshold: Threshold to compare against.
 //
-// Returns The bitpacked comparisons.
-func CompareAndBitpack(scope *Scope, input tf.Output, threshold tf.Output) (output tf.Output) {
+//	begin: `begin[k]` specifies the offset into the `k`th range specification.
+// The exact dimension this corresponds to will be determined by context.
+// Out-of-bounds values will be silently clamped. If the `k`th bit of
+// `begin_mask` then `begin[k]` is ignored and the full range of the
+// appropriate dimension is used instead. Negative values causes indexing
+// to start from the highest element e.g. If `foo==[1,2,3]` then `foo[-1]==3`.
+//	end: `end[i]` is like `begin` with the exception that `end_mask` is
+// used to determine full ranges.
+//	strides: `strides[i]` specifies the increment in the `i`th specification
+// after extracting a given element. Negative indices will reverse
+// the original order. Out or range values are
+// clamped to `[0,dim[i]) if slice[i]>0` or `[-1,dim[i]-1] if slice[i] < 0`
+func StridedSlice(scope *Scope, input tf.Output, begin tf.Output, end tf.Output, strides tf.Output, optional ...StridedSliceAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "CompareAndBitpack",
+		Type: "StridedSlice",
 		Input: []tf.Input{
-			input, threshold,
+			input, begin, end, strides,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Push an element onto the tensor_array.
+// PriorityQueueV2Attr is an optional argument to PriorityQueueV2.
+type PriorityQueueV2Attr func(optionalAttr)
+
+// PriorityQueueV2ComponentTypes sets the optional component_types attribute to value.
 //
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	index: The position to write to inside the TensorArray.
-//	value: The tensor to write to the TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
+// value: The type of each component in a value.
+// If not specified, defaults to <>
 //
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArrayWriteV3(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayWriteV3",
-		Input: []tf.Input{
-			handle, index, value, flow_in,
-		},
+// REQUIRES: len(value) >= 0
+func PriorityQueueV2ComponentTypes(value []tf.DataType) PriorityQueueV2Attr {
+	return func(m optionalAttr) {
+		m["component_types"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Scatter the data from the input value into specific TensorArray elements.
-//
-// `indices` must be a vector, its length must match the first dim of `value`.
-//
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	indices: The locations at which to write the tensor elements.
-//	value: The concatenated tensor to write to the TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
+// PriorityQueueV2Capacity sets the optional capacity attribute to value.
 //
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArrayScatterV3(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayScatterV3",
-		Input: []tf.Input{
-			handle, indices, value, flow_in,
-		},
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func PriorityQueueV2Capacity(value int64) PriorityQueueV2Attr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// TensorArrayConcatV3Attr is an optional argument to TensorArrayConcatV3.
-type TensorArrayConcatV3Attr func(optionalAttr)
-
-// TensorArrayConcatV3ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
+// PriorityQueueV2Container sets the optional container attribute to value.
 //
-// value: The expected shape of an element, if known,
-// excluding the first dimension. Used to validate the shapes of
-// TensorArray elements. If this shape is not fully specified, concatenating
-// zero-size TensorArrays is an error.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayConcatV3ElementShapeExcept0(value tf.Shape) TensorArrayConcatV3Attr {
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func PriorityQueueV2Container(value string) PriorityQueueV2Attr {
 	return func(m optionalAttr) {
-		m["element_shape_except0"] = value
+		m["container"] = value
 	}
 }
 
-// Concat the elements from the TensorArray into value `value`.
-//
-// Takes `T` elements of shapes
-//
-//   ```
-//   (n0 x d0 x d1 x ...), (n1 x d0 x d1 x ...), ..., (n(T-1) x d0 x d1 x ...)
-//   ```
-//
-// and concatenates them into a Tensor of shape:
+// PriorityQueueV2SharedName sets the optional shared_name attribute to value.
 //
-//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
+// If not specified, defaults to ""
+func PriorityQueueV2SharedName(value string) PriorityQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// A queue that produces elements sorted by the first component value.
 //
-// All elements must have the same shape (excepting the first dimension).
+// Note that the PriorityQueue requires the first component of any element
+// to be a scalar int64, in addition to the other elements declared by
+// component_types.  Therefore calls to Enqueue and EnqueueMany (resp. Dequeue
+// and DequeueMany) on a PriorityQueue will all require (resp. output) one extra
+// entry in their input (resp. output) lists.
 //
 // Arguments:
-//	handle: The handle to a TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	dtype: The type of the elem that is returned.
+//	shapes: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
 //
-// Returns All of the elements in the TensorArray, concatenated along the first
-// axis.A vector of the row sizes of the original T elements in the
-// value output.  In the example above, this would be the values:
-// `(n1, n2, ..., n(T-1))`.
-func TensorArrayConcatV3(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV3Attr) (value tf.Output, lengths tf.Output) {
+// Returns The handle to the queue.
+func PriorityQueueV2(scope *Scope, shapes []tf.Shape, optional ...PriorityQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"shapes": shapes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayConcatV3",
-		Input: []tf.Input{
-			handle, flow_in,
-		},
+		Type: "PriorityQueueV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// ParameterizedTruncatedNormalAttr is an optional argument to ParameterizedTruncatedNormal.
-type ParameterizedTruncatedNormalAttr func(optionalAttr)
+// UnstageAttr is an optional argument to Unstage.
+type UnstageAttr func(optionalAttr)
 
-// ParameterizedTruncatedNormalSeed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
+// UnstageCapacity sets the optional capacity attribute to value.
 // If not specified, defaults to 0
-func ParameterizedTruncatedNormalSeed(value int64) ParameterizedTruncatedNormalAttr {
+//
+// REQUIRES: value >= 0
+func UnstageCapacity(value int64) UnstageAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["capacity"] = value
 	}
 }
 
-// ParameterizedTruncatedNormalSeed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
+// UnstageMemoryLimit sets the optional memory_limit attribute to value.
 // If not specified, defaults to 0
-func ParameterizedTruncatedNormalSeed2(value int64) ParameterizedTruncatedNormalAttr {
+//
+// REQUIRES: value >= 0
+func UnstageMemoryLimit(value int64) UnstageAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Outputs random values from a normal distribution. The parameters may each be a
-//
-// scalar which applies to the entire output, or a vector of length shape[0] which
-// stores the parameters for each batch.
-//
-// Arguments:
-//	shape: The shape of the output tensor. Batches are indexed by the 0th dimension.
-//	means: The mean parameter of each batch.
-//	stdevs: The standard deviation parameter of each batch. Must be greater than 0.
-//	minvals: The minimum cutoff. May be -infinity.
-//	maxvals: The maximum cutoff. May be +infinity, and must be more than the minval
-// for each batch.
-//
-// Returns A matrix of shape num_batches x samples_per_batch, filled with random
-// truncated normal values using the parameters for each row.
-func ParameterizedTruncatedNormal(scope *Scope, shape tf.Output, means tf.Output, stdevs tf.Output, minvals tf.Output, maxvals tf.Output, optional ...ParameterizedTruncatedNormalAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
+// UnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnstageContainer(value string) UnstageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "ParameterizedTruncatedNormal",
-		Input: []tf.Input{
-			shape, means, stdevs, minvals, maxvals,
-		},
-		Attrs: attrs,
+}
+
+// UnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnstageSharedName(value string) UnstageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns a diagonal tensor with a given diagonal values.
-//
-// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
-// everything else padded with zeros. The diagonal is computed as follows:
-//
-// Assume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of
-// rank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:
-//
-// `output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.
-//
-// For example:
-//
-// ```
-// # 'diagonal' is [1, 2, 3, 4]
-// tf.diag(diagonal) ==> [[1, 0, 0, 0]
-//                        [0, 2, 0, 0]
-//                        [0, 0, 3, 0]
-//                        [0, 0, 0, 4]]
-// ```
+// Op is similar to a lightweight Dequeue.
 //
-// Arguments:
-//	diagonal: Rank k tensor where k is at most 1.
-func Diag(scope *Scope, diagonal tf.Output) (output tf.Output) {
+// The basic functionality is similar to dequeue with many fewer
+// capabilities and options.  This Op is optimized for performance.
+func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Diag",
-		Input: []tf.Input{
-			diagonal,
-		},
+		Type: "Unstage",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Split the data from the input value into TensorArray elements.
-//
-// Assuming that `lengths` takes on values
-//
-//   ```(n0, n1, ..., n(T-1))```
-//
-// and that `value` has shape
-//
-//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```,
-//
-// this splits values into a TensorArray with T tensors.
-//
-// TensorArray index t will be the subtensor of values with starting position
-//
-//   ```(n0 + n1 + ... + n(t-1), 0, 0, ...)```
-//
-// and having size
-//
-//   ```nt x d0 x d1 x ...```
-//
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	value: The concatenated tensor to write to the TensorArray.
-//	lengths: The vector of lengths, how to split the rows of value into the
-// TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArraySplitV3(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "TensorArraySplitV3",
-		Input: []tf.Input{
-			handle, value, lengths, flow_in,
-		},
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("Unstage", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return values
 }
 
-// SerializeSparseAttr is an optional argument to SerializeSparse.
-type SerializeSparseAttr func(optionalAttr)
+// ArgMaxAttr is an optional argument to ArgMax.
+type ArgMaxAttr func(optionalAttr)
 
-// SerializeSparseOutType sets the optional out_type attribute to value.
-//
-// value: The `dtype` to use for serialization; the supported types are `string`
-// (default) and `variant`.
-// If not specified, defaults to DT_STRING
-func SerializeSparseOutType(value tf.DataType) SerializeSparseAttr {
+// ArgMaxOutputType sets the optional output_type attribute to value.
+// If not specified, defaults to DT_INT64
+func ArgMaxOutputType(value tf.DataType) ArgMaxAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["output_type"] = value
 	}
 }
 
-// Serialize a `SparseTensor` into a `[3]` `Tensor` object.
+// Returns the index with the largest value across dimensions of a tensor.
+//
+// Note that in case of ties the identity of the return value is not guaranteed.
 //
 // Arguments:
-//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
-//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
-func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeSparseAttr) (serialized_sparse tf.Output) {
+//
+//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
+// Describes which dimension of the input Tensor to reduce across. For vectors,
+// use dimension = 0.
+func ArgMax(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMaxAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -23114,9 +23084,9 @@ func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Ou
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SerializeSparse",
+		Type: "ArgMax",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			input, dimension,
 		},
 		Attrs: attrs,
 	}
@@ -23124,329 +23094,299 @@ func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Ou
 	return op.Output(0)
 }
 
-// RandomShuffleQueueV2Attr is an optional argument to RandomShuffleQueueV2.
-type RandomShuffleQueueV2Attr func(optionalAttr)
-
-// RandomShuffleQueueV2Shapes sets the optional shapes attribute to value.
-//
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func RandomShuffleQueueV2Shapes(value []tf.Shape) RandomShuffleQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shapes"] = value
-	}
-}
+// ResourceStridedSliceAssignAttr is an optional argument to ResourceStridedSliceAssign.
+type ResourceStridedSliceAssignAttr func(optionalAttr)
 
-// RandomShuffleQueueV2Capacity sets the optional capacity attribute to value.
-//
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func RandomShuffleQueueV2Capacity(value int64) RandomShuffleQueueV2Attr {
+// ResourceStridedSliceAssignBeginMask sets the optional begin_mask attribute to value.
+// If not specified, defaults to 0
+func ResourceStridedSliceAssignBeginMask(value int64) ResourceStridedSliceAssignAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["begin_mask"] = value
 	}
 }
 
-// RandomShuffleQueueV2MinAfterDequeue sets the optional min_after_dequeue attribute to value.
-//
-// value: Dequeue will block unless there would be this
-// many elements after the dequeue or the queue is closed. This
-// ensures a minimum level of mixing of elements.
+// ResourceStridedSliceAssignEndMask sets the optional end_mask attribute to value.
 // If not specified, defaults to 0
-func RandomShuffleQueueV2MinAfterDequeue(value int64) RandomShuffleQueueV2Attr {
+func ResourceStridedSliceAssignEndMask(value int64) ResourceStridedSliceAssignAttr {
 	return func(m optionalAttr) {
-		m["min_after_dequeue"] = value
+		m["end_mask"] = value
 	}
 }
 
-// RandomShuffleQueueV2Seed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 is set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, a random seed is used.
+// ResourceStridedSliceAssignEllipsisMask sets the optional ellipsis_mask attribute to value.
 // If not specified, defaults to 0
-func RandomShuffleQueueV2Seed(value int64) RandomShuffleQueueV2Attr {
+func ResourceStridedSliceAssignEllipsisMask(value int64) ResourceStridedSliceAssignAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["ellipsis_mask"] = value
 	}
 }
 
-// RandomShuffleQueueV2Seed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
+// ResourceStridedSliceAssignNewAxisMask sets the optional new_axis_mask attribute to value.
 // If not specified, defaults to 0
-func RandomShuffleQueueV2Seed2(value int64) RandomShuffleQueueV2Attr {
+func ResourceStridedSliceAssignNewAxisMask(value int64) ResourceStridedSliceAssignAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["new_axis_mask"] = value
 	}
 }
 
-// RandomShuffleQueueV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func RandomShuffleQueueV2Container(value string) RandomShuffleQueueV2Attr {
+// ResourceStridedSliceAssignShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
+// If not specified, defaults to 0
+func ResourceStridedSliceAssignShrinkAxisMask(value int64) ResourceStridedSliceAssignAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["shrink_axis_mask"] = value
 	}
 }
 
-// RandomShuffleQueueV2SharedName sets the optional shared_name attribute to value.
+// Assign `value` to the sliced l-value reference of `ref`.
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func RandomShuffleQueueV2SharedName(value string) RandomShuffleQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A queue that randomizes the order of elements.
+// The values of `value` are assigned to the positions in the variable
+// `ref` that are selected by the slice parameters. The slice parameters
+// `begin, `end`, `strides`, etc. work exactly as in `StridedSlice`.
 //
-// Arguments:
-//	component_types: The type of each component in a value.
+// NOTE this op currently does not support broadcasting and so `value`'s
+// shape must be exactly the shape produced by the slice of `ref`.
 //
-// Returns The handle to the queue.
-func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional ...RandomShuffleQueueV2Attr) (handle tf.Output) {
+// Returns the created operation.
+func ResourceStridedSliceAssign(scope *Scope, ref tf.Output, begin tf.Output, end tf.Output, strides tf.Output, value tf.Output, optional ...ResourceStridedSliceAssignAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomShuffleQueueV2",
-
+		Type: "ResourceStridedSliceAssign",
+		Input: []tf.Input{
+			ref, begin, end, strides, value,
+		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Draw bounding boxes on a batch of images.
-//
-// Outputs a copy of `images` but draws on top of the pixels zero or more bounding
-// boxes specified by the locations in `boxes`. The coordinates of the each
-// bounding box in `boxes` are encoded as `[y_min, x_min, y_max, x_max]`. The
-// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
-// height of the underlying image.
+// QueueEnqueueV2Attr is an optional argument to QueueEnqueueV2.
+type QueueEnqueueV2Attr func(optionalAttr)
+
+// QueueEnqueueV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// For example, if an image is 100 x 200 pixels (height x width) and the bounding
-// box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
-// the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+// value: If the queue is full, this operation will block for up to
+// timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueEnqueueV2TimeoutMs(value int64) QueueEnqueueV2Attr {
+	return func(m optionalAttr) {
+		m["timeout_ms"] = value
+	}
+}
+
+// Enqueues a tuple of one or more tensors in the given queue.
 //
-// Parts of the bounding box may fall outside the image.
+// The components input has k elements, which correspond to the components of
+// tuples stored in the given queue.
+//
+// N.B. If the queue is full, this operation will block until the given
+// element has been enqueued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	images: 4-D with shape `[batch, height, width, depth]`. A batch of images.
-//	boxes: 3-D with shape `[batch, num_bounding_boxes, 4]` containing bounding
-// boxes.
+//	handle: The handle to a queue.
+//	components: One or more tensors from which the enqueued tensors should be taken.
 //
-// Returns 4-D with the same shape as `images`. The batch of input images with
-// bounding boxes drawn on the images.
-func DrawBoundingBoxes(scope *Scope, images tf.Output, boxes tf.Output) (output tf.Output) {
+// Returns the created operation.
+func QueueEnqueueV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "DrawBoundingBoxes",
+		Type: "QueueEnqueueV2",
 		Input: []tf.Input{
-			images, boxes,
+			handle, tf.OutputList(components),
 		},
+		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// LearnedUnigramCandidateSamplerAttr is an optional argument to LearnedUnigramCandidateSampler.
-type LearnedUnigramCandidateSamplerAttr func(optionalAttr)
+// QueueDequeueManyV2Attr is an optional argument to QueueDequeueManyV2.
+type QueueDequeueManyV2Attr func(optionalAttr)
 
-// LearnedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+// QueueDequeueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func LearnedUnigramCandidateSamplerSeed(value int64) LearnedUnigramCandidateSamplerAttr {
+// value: If the queue has fewer than n elements, this operation
+// will block for up to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueDequeueManyV2TimeoutMs(value int64) QueueDequeueManyV2Attr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// LearnedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+// Dequeues `n` tuples of one or more tensors from the given queue.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func LearnedUnigramCandidateSamplerSeed2(value int64) LearnedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Generates labels for candidate sampling with a learned unigram distribution.
+// If the queue is closed and there are fewer than `n` elements, then an
+// OutOfRange error is returned.
 //
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
+// This operation concatenates queue-element component tensors along the
+// 0th dimension to make a single component tensor.  All of the components
+// in the dequeued tuple will have size `n` in the 0th dimension.
 //
-// For each batch, this op picks a single set of sampled candidate labels.
+// This operation has `k` outputs, where `k` is the number of components in
+// the tuples stored in the given queue, and output `i` is the ith
+// component of the dequeued tuple.
 //
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
+// N.B. If the queue is empty, this operation will block until `n` elements
+// have been dequeued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+//	handle: The handle to a queue.
+//	n: The number of tuples to dequeue.
+//	component_types: The type of each component in a tuple.
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func LearnedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LearnedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Returns One or more tensors that were dequeued as a tuple.
+func QueueDequeueManyV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueManyV2Attr) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LearnedUnigramCandidateSampler",
+		Type: "QueueDequeueManyV2",
 		Input: []tf.Input{
-			true_classes,
+			handle, n,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// Computes gradients for the scaled exponential linear (Selu) operation.
-//
-// Arguments:
-//	gradients: The backpropagated gradients to the corresponding Selu operation.
-//	outputs: The outputs of the corresponding Selu operation.
-//
-// Returns The gradients: `gradients * (outputs + scale * alpha)`
-// if outputs < 0, `scale * gradients` otherwise.
-func SeluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "SeluGrad",
-		Input: []tf.Input{
-			gradients, outputs,
-		},
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("QueueDequeueManyV2", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return components
 }
 
-// Get the current size of the TensorArray.
+// EncodeBase64Attr is an optional argument to EncodeBase64.
+type EncodeBase64Attr func(optionalAttr)
+
+// EncodeBase64Pad sets the optional pad attribute to value.
+//
+// value: Bool whether padding is applied at the ends.
+// If not specified, defaults to false
+func EncodeBase64Pad(value bool) EncodeBase64Attr {
+	return func(m optionalAttr) {
+		m["pad"] = value
+	}
+}
+
+// Encode strings into web-safe base64 format.
+//
+// Refer to the following article for more information on base64 format:
+// en.wikipedia.org/wiki/Base64. Base64 strings may have padding with '=' at the
+// end so that the encoded has length multiple of 4. See Padding section of the
+// link above.
+//
+// Web-safe means that the encoder uses - and _ instead of + and /.
 //
 // Arguments:
-//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
-//	flow_in: A float scalar that enforces proper chaining of operations.
+//	input: Strings to be encoded.
 //
-// Returns The current size of the TensorArray.
-func TensorArraySizeV3(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
+// Returns Input strings encoded in base64.
+func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TensorArraySizeV3",
+		Type: "EncodeBase64",
 		Input: []tf.Input{
-			handle, flow_in,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayGradV3
+// Deprecated. Use TensorArrayCloseV3
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayWriteV3
-func TensorArrayWriteV2(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayCloseV3
+//
+// Returns the created operation.
+func TensorArrayCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayWriteV2",
+		Type: "TensorArrayCloseV2",
 		Input: []tf.Input{
-			handle, index, value, flow_in,
+			handle,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// SparseReduceMaxAttr is an optional argument to SparseReduceMax.
-type SparseReduceMaxAttr func(optionalAttr)
+// CropAndResizeGradImageAttr is an optional argument to CropAndResizeGradImage.
+type CropAndResizeGradImageAttr func(optionalAttr)
 
-// SparseReduceMaxKeepDims sets the optional keep_dims attribute to value.
+// CropAndResizeGradImageMethod sets the optional method attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func SparseReduceMaxKeepDims(value bool) SparseReduceMaxAttr {
+// value: A string specifying the interpolation method. Only 'bilinear' is
+// supported for now.
+// If not specified, defaults to "bilinear"
+func CropAndResizeGradImageMethod(value string) CropAndResizeGradImageAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["method"] = value
 	}
 }
 
-// Computes the max of elements across dimensions of a SparseTensor.
-//
-// This Op takes a SparseTensor and is the sparse counterpart to
-// `tf.reduce_max()`.  In particular, this Op also returns a dense `Tensor`
-// instead of a sparse one.
-//
-// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
-// with length 1.
-//
-// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
-// with a single element is returned.  Additionally, the axes can be negative,
-// which are interpreted according to the indexing rules in Python.
+// Computes the gradient of the crop_and_resize op wrt the input image tensor.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
-//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
+//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
+// specifies the coordinates of a box in the `box_ind[i]` image and is specified
+// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
+// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
+// `[0, 1]` interval of normalized image height is mapped to
+// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
+// which case the sampled crop is an up-down flipped version of the original
+// image. The width dimension is treated similarly. Normalized coordinates
+// outside the `[0, 1]` range are allowed, in which case we use
+// `extrapolation_value` to extrapolate the input image values.
+//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
+// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//	image_size: A 1-D tensor with value `[batch, image_height, image_width, depth]`
+// containing the original image size. Both `image_height` and `image_width` need
+// to be positive.
 //
-// Returns `R-K`-D.  The reduced Tensor.
-func SparseReduceMax(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxAttr) (output tf.Output) {
+//
+// Returns A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
+func CropAndResizeGradImage(scope *Scope, grads tf.Output, boxes tf.Output, box_ind tf.Output, image_size tf.Output, T tf.DataType, optional ...CropAndResizeGradImageAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"T": T}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseReduceMax",
+		Type: "CropAndResizeGradImage",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape, reduction_axes,
+			grads, boxes, box_ind, image_size,
 		},
 		Attrs: attrs,
 	}
@@ -23454,68 +23394,99 @@ func SparseReduceMax(scope *Scope, input_indices tf.Output, input_values tf.Outp
 	return op.Output(0)
 }
 
-// AsStringAttr is an optional argument to AsString.
-type AsStringAttr func(optionalAttr)
-
-// AsStringPrecision sets the optional precision attribute to value.
-//
-// value: The post-decimal precision to use for floating point numbers.
-// Only used if precision > -1.
-// If not specified, defaults to -1
-func AsStringPrecision(value int64) AsStringAttr {
-	return func(m optionalAttr) {
-		m["precision"] = value
+// Reads and outputs the entire contents of the input filename.
+func ReadFile(scope *Scope, filename tf.Output) (contents tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// AsStringScientific sets the optional scientific attribute to value.
-//
-// value: Use scientific notation for floating point numbers.
-// If not specified, defaults to false
-func AsStringScientific(value bool) AsStringAttr {
-	return func(m optionalAttr) {
-		m["scientific"] = value
+	opspec := tf.OpSpec{
+		Type: "ReadFile",
+		Input: []tf.Input{
+			filename,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// AsStringShortest sets the optional shortest attribute to value.
+// Concatenates tensors along one dimension.
 //
-// value: Use shortest representation (either scientific or standard) for
-// floating point numbers.
-// If not specified, defaults to false
-func AsStringShortest(value bool) AsStringAttr {
-	return func(m optionalAttr) {
-		m["shortest"] = value
+// Arguments:
+//	values: List of `N` Tensors to concatenate. Their ranks and types must match,
+// and their sizes must match in all dimensions except `concat_dim`.
+//	axis: 0-D.  The dimension along which to concatenate.  Must be in the
+// range [-rank(values), rank(values)).
+//
+// Returns A `Tensor` with the concatenation of values stacked along the
+// `concat_dim` dimension.  This tensor's shape matches that of `values` except
+// in `concat_dim` where it has the sum of the sizes.
+func ConcatV2(scope *Scope, values []tf.Output, axis tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "ConcatV2",
+		Input: []tf.Input{
+			tf.OutputList(values), axis,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// AsStringWidth sets the optional width attribute to value.
+// Forwards the value of an available tensor from `inputs` to `output`.
 //
-// value: Pad pre-decimal numbers to this width.
-// Applies to both floating point and integer numbers.
-// Only used if width > -1.
-// If not specified, defaults to -1
-func AsStringWidth(value int64) AsStringAttr {
-	return func(m optionalAttr) {
-		m["width"] = value
+// `Merge` waits for at least one of the tensors in `inputs` to become available.
+// It is usually combined with `Switch` to implement branching.
+//
+// `Merge` forwards the first tensor to become available to `output`, and sets
+// `value_index` to its index in `inputs`.
+//
+// Arguments:
+//	inputs: The input tensors, exactly one of which will become available.
+//
+// Returns Will be set to the available input tensor.The index of the chosen input tensor in `inputs`.
+func Merge(scope *Scope, inputs []tf.Output) (output tf.Output, value_index tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Merge",
+		Input: []tf.Input{
+			tf.OutputList(inputs),
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// AsStringFill sets the optional fill attribute to value.
+// QueueCloseV2Attr is an optional argument to QueueCloseV2.
+type QueueCloseV2Attr func(optionalAttr)
+
+// QueueCloseV2CancelPendingEnqueues sets the optional cancel_pending_enqueues attribute to value.
 //
-// value: The value to pad if width > -1.  If empty, pads with spaces.
-// Another typical value is '0'.  String cannot be longer than 1 character.
-// If not specified, defaults to ""
-func AsStringFill(value string) AsStringAttr {
+// value: If true, all pending enqueue requests that are
+// blocked on the given queue will be canceled.
+// If not specified, defaults to false
+func QueueCloseV2CancelPendingEnqueues(value bool) QueueCloseV2Attr {
 	return func(m optionalAttr) {
-		m["fill"] = value
+		m["cancel_pending_enqueues"] = value
 	}
 }
 
-// Converts each entry in the given tensor to strings.  Supports many numeric
+// Closes the given queue.
 //
-// types and boolean.
-func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output tf.Output) {
+// This operation signals that no more elements will be enqueued in the
+// given queue. Subsequent Enqueue(Many) operations will fail.
+// Subsequent Dequeue(Many) operations will continue to succeed if
+// sufficient elements remain in the queue. Subsequent Dequeue(Many)
+// operations that would block will fail immediately.
+//
+// Arguments:
+//	handle: The handle to a queue.
+//
+// Returns the created operation.
+func QueueCloseV2(scope *Scope, handle tf.Output, optional ...QueueCloseV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -23524,378 +23495,381 @@ func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AsString",
+		Type: "QueueCloseV2",
 		Input: []tf.Input{
-			input,
+			handle,
 		},
 		Attrs: attrs,
 	}
+	return scope.AddOperation(opspec)
+}
+
+// Computes inverse hyperbolic tangent of x element-wise.
+func Atanh(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Atanh",
+		Input: []tf.Input{
+			x,
+		},
+	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayScatterV3
+// Returns true if queue is closed.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayScatterV3
-func TensorArrayScatterV2(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// This operation returns true if the queue is closed and false if the queue
+// is open.
+//
+// Arguments:
+//	handle: The handle to a queue.
+func QueueIsClosedV2(scope *Scope, handle tf.Output) (is_closed tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayScatterV2",
+		Type: "QueueIsClosedV2",
 		Input: []tf.Input{
-			handle, indices, value, flow_in,
+			handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Applies sparse addition to `input` using individual values or slices
+// Returns the batched diagonal part of a batched tensor.
 //
-// from `updates` according to indices `indices`.  The updates are non-aliasing:
-// `input` is only modified in-place if no other operations will use it.
-// Otherwise, a copy of `input` is made.  This operation has a gradient with
-// respect to both `input` and `updates`.
+// This operation returns a tensor with the `diagonal` part
+// of the batched `input`. The `diagonal` part is computed as follows:
 //
-// `input` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
+// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
+// tensor of rank `k - 1` with dimensions `[I, J, K, ..., min(M, N)]` where:
 //
-// `indices` must be integer tensor, containing indices into `input`.
-// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+// `diagonal[i, j, k, ..., n] = input[i, j, k, ..., n, n]`.
 //
-// The innermost dimension of `indices` (with length `K`) corresponds to
-// indices into elements (if `K = P`) or `(P-K)`-dimensional slices
-// (if `K < P`) along the `K`th dimension of `input`.
+// The input must be at least a matrix.
 //
-// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+// For example:
 //
 // ```
-// [d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].
-// ```
-//
-// For example, say we want to add 4 scattered elements to a rank-1 tensor to 8
-// elements. In Python, that addition would look like this:
-//
-//     input = tf.constant([1, 2, 3, 4, 5, 6, 7, 8])
-//     indices = tf.constant([[4], [3], [1], [7]])
-//     updates = tf.constant([9, 10, 11, 12])
-//     output = tf.scatter_nd_non_aliasing_add(input, indices, updates)
-//     with tf.Session() as sess:
-//       print(sess.run(output))
+// # 'input' is [[[1, 0, 0, 0]
+//                [0, 2, 0, 0]
+//                [0, 0, 3, 0]
+//                [0, 0, 0, 4]],
+//               [[5, 0, 0, 0]
+//                [0, 6, 0, 0]
+//                [0, 0, 7, 0]
+//                [0, 0, 0, 8]]]
 //
-// The resulting value `output` would look like this:
+// and input.shape = (2, 4, 4)
 //
-//     [1, 13, 3, 14, 14, 6, 7, 20]
+// tf.matrix_diag_part(input) ==> [[1, 2, 3, 4], [5, 6, 7, 8]]
 //
-// See @{tf.scatter_nd} for more details about how to make updates to slices.
+// which has shape (2, 4)
+// ```
 //
 // Arguments:
-//	input: A Tensor.
-//	indices: A Tensor. Must be one of the following types: `int32`, `int64`.
-// A tensor of indices into `input`.
-//	updates: A Tensor. Must have the same type as ref. A tensor of updated values
-// to add to `input`.
+//	input: Rank `k` tensor where `k >= 2`.
 //
-// Returns A `Tensor` with the same shape as `input`, containing values of `input`
-// updated with `updates`.
-func ScatterNdNonAliasingAdd(scope *Scope, input tf.Output, indices tf.Output, updates tf.Output) (output tf.Output) {
+// Returns The extracted diagonal(s) having shape
+// `diagonal.shape = input.shape[:-2] + [min(input.shape[-2:])]`.
+func MatrixDiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ScatterNdNonAliasingAdd",
+		Type: "MatrixDiagPart",
 		Input: []tf.Input{
-			input, indices, updates,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FractionalMaxPoolAttr is an optional argument to FractionalMaxPool.
-type FractionalMaxPoolAttr func(optionalAttr)
+// Computes the absolute value of a tensor.
+//
+// Given a tensor `x`, this operation returns a tensor containing the absolute
+// value of each element in `x`. For example, if x is an input element and y is
+// an output element, this operation computes \\(y = |x|\\).
+func Abs(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Abs",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
 
-// FractionalMaxPoolPseudoRandom sets the optional pseudo_random attribute to value.
+// StackV2Attr is an optional argument to StackV2.
+type StackV2Attr func(optionalAttr)
+
+// StackV2StackName sets the optional stack_name attribute to value.
 //
-// value: When set to True, generates the pooling sequence in a
-// pseudorandom fashion, otherwise, in a random fashion. Check paper [Benjamin
-// Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) for
-// difference between pseudorandom and random.
-// If not specified, defaults to false
-func FractionalMaxPoolPseudoRandom(value bool) FractionalMaxPoolAttr {
+// value: Overrides the name used for the temporary stack resource. Default
+// value is the name of the 'Stack' op (which is guaranteed unique).
+// If not specified, defaults to ""
+func StackV2StackName(value string) StackV2Attr {
 	return func(m optionalAttr) {
-		m["pseudo_random"] = value
+		m["stack_name"] = value
 	}
 }
 
-// FractionalMaxPoolOverlapping sets the optional overlapping attribute to value.
+// A stack that produces elements in first-in last-out order.
 //
-// value: When set to True, it means when pooling, the values at the boundary
-// of adjacent pooling cells are used by both cells. For example:
+// Arguments:
+//	max_size: The maximum size of the stack if non-negative. If negative, the stack
+// size is unlimited.
+//	elem_type: The type of the elements on the stack.
 //
-// `index  0  1  2  3  4`
+// Returns The handle to the stack.
+func StackV2(scope *Scope, max_size tf.Output, elem_type tf.DataType, optional ...StackV2Attr) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"elem_type": elem_type}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "StackV2",
+		Input: []tf.Input{
+			max_size,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// OrderedMapStageAttr is an optional argument to OrderedMapStage.
+type OrderedMapStageAttr func(optionalAttr)
+
+// OrderedMapStageCapacity sets the optional capacity attribute to value.
 //
-// `value  20 5  16 3  7`
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
 //
-// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
-// The result would be [20, 16] for fractional max pooling.
-// If not specified, defaults to false
-func FractionalMaxPoolOverlapping(value bool) FractionalMaxPoolAttr {
+// REQUIRES: value >= 0
+func OrderedMapStageCapacity(value int64) OrderedMapStageAttr {
 	return func(m optionalAttr) {
-		m["overlapping"] = value
+		m["capacity"] = value
 	}
 }
 
-// FractionalMaxPoolDeterministic sets the optional deterministic attribute to value.
+// OrderedMapStageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: When set to True, a fixed pooling region will be used when
-// iterating over a FractionalMaxPool node in the computation graph. Mainly used
-// in unit test to make FractionalMaxPool deterministic.
-// If not specified, defaults to false
-func FractionalMaxPoolDeterministic(value bool) FractionalMaxPoolAttr {
+// REQUIRES: value >= 0
+func OrderedMapStageMemoryLimit(value int64) OrderedMapStageAttr {
 	return func(m optionalAttr) {
-		m["deterministic"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// FractionalMaxPoolSeed sets the optional seed attribute to value.
+// OrderedMapStageContainer sets the optional container attribute to value.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func FractionalMaxPoolSeed(value int64) FractionalMaxPoolAttr {
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func OrderedMapStageContainer(value string) OrderedMapStageAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["container"] = value
 	}
 }
 
-// FractionalMaxPoolSeed2 sets the optional seed2 attribute to value.
+// OrderedMapStageSharedName sets the optional shared_name attribute to value.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func FractionalMaxPoolSeed2(value int64) FractionalMaxPoolAttr {
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func OrderedMapStageSharedName(value string) OrderedMapStageAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Performs fractional max pooling on the input.
-//
-// Fractional max pooling is slightly different than regular max pooling.  In
-// regular max pooling, you downsize an input set by taking the maximum value of
-// smaller N x N subsections of the set (often 2x2), and try to reduce the set by
-// a factor of N, where N is an integer.  Fractional max pooling, as you might
-// expect from the word "fractional", means that the overall reduction ratio N
-// does not have to be an integer.
-//
-// The sizes of the pooling regions are generated randomly but are fairly uniform.
-// For example, let's look at the height dimension, and the constraints on the
-// list of rows that will be pool boundaries.
-//
-// First we define the following:
-//
-// 1.  input_row_length : the number of rows from the input set
-// 2.  output_row_length : which will be smaller than the input
-// 3.  alpha = input_row_length / output_row_length : our reduction ratio
-// 4.  K = floor(alpha)
-// 5.  row_pooling_sequence : this is the result list of pool boundary rows
+// Stage (key, values) in the underlying container which behaves like a ordered
 //
-// Then, row_pooling_sequence should satisfy:
+// associative container.   Elements are ordered by key.
 //
-// 1.  a[0] = 0 : the first value of the sequence is 0
-// 2.  a[end] = input_row_length : the last value of the sequence is the size
-// 3.  K <= (a[i+1] - a[i]) <= K+1 : all intervals are K or K+1 size
-// 4.  length(row_pooling_sequence) = output_row_length+1
+// Arguments:
+//	key: int64
 //
-// For more details on fractional max pooling, see this paper:
-// [Benjamin Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071)
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
 //
-// Arguments:
-//	value: 4-D with shape `[batch, height, width, channels]`.
-//	pooling_ratio: Pooling ratio for each dimension of `value`, currently only
-// supports row and col dimension and should be >= 1.0. For example, a valid
-// pooling ratio looks like [1.0, 1.44, 1.73, 1.0]. The first and last elements
-// must be 1.0 because we don't allow pooling on batch and channels
-// dimensions. 1.44 and 1.73 are pooling ratio on height and width dimensions
-// respectively.
 //
-// Returns output tensor after fractional max pooling.row pooling sequence, needed to calculate gradient.column pooling sequence, needed to calculate gradient.
-func FractionalMaxPool(scope *Scope, value tf.Output, pooling_ratio []float32, optional ...FractionalMaxPoolAttr) (output tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output) {
+// Returns the created operation.
+func OrderedMapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...OrderedMapStageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"pooling_ratio": pooling_ratio}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FractionalMaxPool",
+		Type: "OrderedMapStage",
 		Input: []tf.Input{
-			value,
+			key, indices, tf.OutputList(values),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return scope.AddOperation(opspec)
 }
 
-// Deprecated. Use TensorArraySizeV3
+// StackPushV2Attr is an optional argument to StackPushV2.
+type StackPushV2Attr func(optionalAttr)
+
+// StackPushV2SwapMemory sets the optional swap_memory attribute to value.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArraySizeV3
-func TensorArraySizeV2(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
+// value: Swap `elem` to CPU. Default to false.
+// If not specified, defaults to false
+func StackPushV2SwapMemory(value bool) StackPushV2Attr {
+	return func(m optionalAttr) {
+		m["swap_memory"] = value
+	}
+}
+
+// Push an element onto the stack.
+//
+// Arguments:
+//	handle: The handle to a stack.
+//	elem: The tensor to be pushed onto the stack.
+//
+// Returns The same tensor as the input 'elem'.
+func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...StackPushV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TensorArraySizeV2",
+		Type: "StackPushV2",
 		Input: []tf.Input{
-			handle, flow_in,
+			handle, elem,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Conv2DAttr is an optional argument to Conv2D.
-type Conv2DAttr func(optionalAttr)
+// FusedBatchNormGradV2Attr is an optional argument to FusedBatchNormGradV2.
+type FusedBatchNormGradV2Attr func(optionalAttr)
 
-// Conv2DUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DUseCudnnOnGpu(value bool) Conv2DAttr {
+// FusedBatchNormGradV2Epsilon sets the optional epsilon attribute to value.
+//
+// value: A small float number added to the variance of x.
+// If not specified, defaults to 0.0001
+func FusedBatchNormGradV2Epsilon(value float32) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
+		m["epsilon"] = value
 	}
 }
 
-// Conv2DDataFormat sets the optional data_format attribute to value.
+// FusedBatchNormGradV2DataFormat sets the optional data_format attribute to value.
 //
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, height, width, channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, channels, height, width].
+// value: The data format for y_backprop, x, x_backprop.
+// Either "NHWC" (default) or "NCHW".
 // If not specified, defaults to "NHWC"
-func Conv2DDataFormat(value string) Conv2DAttr {
+func FusedBatchNormGradV2DataFormat(value string) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
 		m["data_format"] = value
 	}
 }
 
-// Conv2DDilations sets the optional dilations attribute to value.
+// FusedBatchNormGradV2IsTraining sets the optional is_training attribute to value.
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func Conv2DDilations(value []int64) Conv2DAttr {
+// value: A bool value to indicate the operation is for training (default)
+// or inference.
+// If not specified, defaults to true
+func FusedBatchNormGradV2IsTraining(value bool) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
-		m["dilations"] = value
+		m["is_training"] = value
 	}
 }
 
-// Computes a 2-D convolution given 4-D `input` and `filter` tensors.
-//
-// Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
-// and a filter / kernel tensor of shape
-// `[filter_height, filter_width, in_channels, out_channels]`, this op
-// performs the following:
-//
-// 1. Flattens the filter to a 2-D matrix with shape
-//    `[filter_height * filter_width * in_channels, output_channels]`.
-// 2. Extracts image patches from the input tensor to form a *virtual*
-//    tensor of shape `[batch, out_height, out_width,
-//    filter_height * filter_width * in_channels]`.
-// 3. For each patch, right-multiplies the filter matrix and the image patch
-//    vector.
-//
-// In detail, with the default NHWC format,
-//
-//     output[b, i, j, k] =
-//         sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *
-//                         filter[di, dj, q, k]
+// Gradient for batch normalization.
 //
-// Must have `strides[0] = strides[3] = 1`.  For the most common case of the same
-// horizontal and vertices strides, `strides = [1, stride, stride, 1]`.
+// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
+// The size of 1D Tensors matches the dimension C of the 4D Tensors.
 //
 // Arguments:
-//	input: A 4-D tensor. The dimension order is interpreted according to the value
-// of `data_format`, see below for details.
-//	filter: A 4-D tensor of shape
-// `[filter_height, filter_width, in_channels, out_channels]`
-//	strides: 1-D tensor of length 4.  The stride of the sliding window for each
-// dimension of `input`. The dimension order is determined by the value of
-// `data_format`, see below for details.
-//	padding: The type of padding algorithm to use.
+//	y_backprop: A 4D Tensor for the gradient with respect to y.
+//	x: A 4D Tensor for input data.
+//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
+//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
+// mean to be reused in gradient computation. When is_training is
+// False, a 1D Tensor for the population mean to be reused in both
+// 1st and 2nd order gradient computation.
+//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
+// variance (inverted variance in the cuDNN case) to be reused in
+// gradient computation. When is_training is False, a 1D Tensor
+// for the population variance to be reused in both 1st and 2nd
+// order gradient computation.
 //
-// Returns A 4-D tensor. The dimension order is determined by the value of
-// `data_format`, see below for details.
-func Conv2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv2DAttr) (output tf.Output) {
+// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
+// in FusedBatchNorm.
+func FusedBatchNormGradV2(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradV2Attr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv2D",
+		Type: "FusedBatchNormGradV2",
 		Input: []tf.Input{
-			input, filter,
+			y_backprop, x, scale, reserve_space_1, reserve_space_2,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// FakeQuantWithMinMaxArgsAttr is an optional argument to FakeQuantWithMinMaxArgs.
-type FakeQuantWithMinMaxArgsAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxArgsMin sets the optional min attribute to value.
-// If not specified, defaults to -6
-func FakeQuantWithMinMaxArgsMin(value float32) FakeQuantWithMinMaxArgsAttr {
-	return func(m optionalAttr) {
-		m["min"] = value
-	}
-}
-
-// FakeQuantWithMinMaxArgsMax sets the optional max attribute to value.
-// If not specified, defaults to 6
-func FakeQuantWithMinMaxArgsMax(value float32) FakeQuantWithMinMaxArgsAttr {
-	return func(m optionalAttr) {
-		m["max"] = value
-	}
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// FakeQuantWithMinMaxArgsNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxArgsNumBits(value int64) FakeQuantWithMinMaxArgsAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
+// DecodeCompressedAttr is an optional argument to DecodeCompressed.
+type DecodeCompressedAttr func(optionalAttr)
 
-// FakeQuantWithMinMaxArgsNarrowRange sets the optional narrow_range attribute to value.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxArgsNarrowRange(value bool) FakeQuantWithMinMaxArgsAttr {
+// DecodeCompressedCompressionType sets the optional compression_type attribute to value.
+//
+// value: A scalar containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+// If not specified, defaults to ""
+func DecodeCompressedCompressionType(value string) DecodeCompressedAttr {
 	return func(m optionalAttr) {
-		m["narrow_range"] = value
+		m["compression_type"] = value
 	}
 }
 
-// Fake-quantize the 'inputs' tensor, type float to 'outputs' tensor of same type.
+// Decompress strings.
 //
-// Attributes `[min; max]` define the clamping range for the `inputs` data.
-// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
-// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
-// then de-quantized and output as floats in `[min; max]` interval.
-// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+// This op decompresses each element of the `bytes` input `Tensor`, which
+// is assumed to be compressed using the given `compression_type`.
 //
-// Quantization is called fake since the output is still in floating point.
-func FakeQuantWithMinMaxArgs(scope *Scope, inputs tf.Output, optional ...FakeQuantWithMinMaxArgsAttr) (outputs tf.Output) {
+// The `output` is a string `Tensor` of the same shape as `bytes`,
+// each element containing the decompressed data from the corresponding
+// element in `bytes`.
+//
+// Arguments:
+//	bytes: A Tensor of string which is compressed.
+//
+// Returns A Tensor with the same shape as input `bytes`, uncompressed
+// from bytes.
+func DecodeCompressed(scope *Scope, bytes tf.Output, optional ...DecodeCompressedAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -23904,9 +23878,9 @@ func FakeQuantWithMinMaxArgs(scope *Scope, inputs tf.Output, optional ...FakeQua
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxArgs",
+		Type: "DecodeCompressed",
 		Input: []tf.Input{
-			inputs,
+			bytes,
 		},
 		Attrs: attrs,
 	}
@@ -23914,540 +23888,493 @@ func FakeQuantWithMinMaxArgs(scope *Scope, inputs tf.Output, optional ...FakeQua
 	return op.Output(0)
 }
 
-// StageAttr is an optional argument to Stage.
-type StageAttr func(optionalAttr)
-
-// StageCapacity sets the optional capacity attribute to value.
+// Creates a TensorArray for storing the gradients of values in the given handle.
 //
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
-// If not specified, defaults to 0
+// If the given TensorArray gradient already exists, returns a reference to it.
 //
-// REQUIRES: value >= 0
-func StageCapacity(value int64) StageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// StageMemoryLimit sets the optional memory_limit attribute to value.
+// Locks the size of the original TensorArray by disabling its dynamic size flag.
 //
-// value: The maximum number of bytes allowed for Tensors in the Staging Area.
-// If > 0, inserts will block until sufficient space is available.
-// If not specified, defaults to 0
+// **A note about the input flow_in:**
 //
-// REQUIRES: value >= 0
-func StageMemoryLimit(value int64) StageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// StageContainer sets the optional container attribute to value.
+// The handle flow_in forces the execution of the gradient lookup to occur
+// only after certain other operations have occurred.  For example, when
+// the forward TensorArray is dynamically sized, writes to this TensorArray
+// may resize the object.  The gradient TensorArray is statically sized based
+// on the size of the forward TensorArray when this operation executes.
+// Furthermore, the size of the forward TensorArray is frozen by this call.
+// As a result, the flow is used to ensure that the call to generate the gradient
+// TensorArray only happens after all writes are executed.
 //
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func StageContainer(value string) StageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// StageSharedName sets the optional shared_name attribute to value.
+// In the case of dynamically sized TensorArrays, gradient computation should
+// only be performed on read operations that have themselves been chained via
+// flow to occur only after all writes have executed. That way the final size
+// of the forward TensorArray is known when this operation is called.
 //
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func StageSharedName(value string) StageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Stage values similar to a lightweight Enqueue.
+// **A note about the source attribute:**
 //
-// The basic functionality of this Op is similar to a queue with many
-// fewer capabilities and options.  This Op is optimized for performance.
+// TensorArray gradient calls use an accumulator TensorArray object.  If
+// multiple gradients are calculated and run in the same session, the multiple
+// gradient nodes may accidentally flow through the same accumulator TensorArray.
+// This double counts and generally breaks the TensorArray gradient flow.
 //
-// Arguments:
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
+// The solution is to identify which gradient call this particular
+// TensorArray gradient is being called in.  This is performed by identifying
+// a unique string (e.g. "gradients", "gradients_1", ...) from the input
+// gradient Tensor's name.  This string is used as a suffix when creating
+// the TensorArray gradient object here (the attribute `source`).
 //
-// Returns the created operation.
-func Stage(scope *Scope, values []tf.Output, optional ...StageAttr) (o *tf.Operation) {
+// The attribute `source` is added as a suffix to the forward TensorArray's
+// name when performing the creation / lookup, so that each separate gradient
+// calculation gets its own TensorArray accumulator.
+//
+// Arguments:
+//	handle: The handle to the forward TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	source: The gradient source string, used to decide which gradient TensorArray
+// to return.
+func TensorArrayGradV3(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output, flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"source": source}
 	opspec := tf.OpSpec{
-		Type: "Stage",
+		Type: "TensorArrayGradV3",
 		Input: []tf.Input{
-			tf.OutputList(values),
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// StagePeekAttr is an optional argument to StagePeek.
-type StagePeekAttr func(optionalAttr)
-
-// StagePeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
 //
-// REQUIRES: value >= 0
-func StagePeekCapacity(value int64) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// StagePeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Each comparison returns a boolean `true` (if `input_value > threshold`)
+// or and `false` otherwise.
 //
-// REQUIRES: value >= 0
-func StagePeekMemoryLimit(value int64) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// StagePeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StagePeekContainer(value string) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// StagePeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StagePeekSharedName(value string) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op peeks at the values at the specified index.  If the
+// This operation is useful for Locality-Sensitive-Hashing (LSH) and other
+// algorithms that use hashing approximations of cosine and `L2` distances;
+// codes can be generated from an input via:
 //
-// underlying container does not contain sufficient elements
-// this op will block until it does.   This Op is optimized for
-// performance.
-func StagePeek(scope *Scope, index tf.Output, dtypes []tf.DataType, optional ...StagePeekAttr) (values []tf.Output) {
+// ```python
+// codebook_size = 50
+// codebook_bits = codebook_size * 32
+// codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
+//                            dtype=x.dtype,
+//                            initializer=tf.orthogonal_initializer())
+// codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
+// codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
+// # now codes has shape x.shape[:-1] + [codebook_size]
+// ```
+//
+// **NOTE**: Currently, the innermost dimension of the tensor must be divisible
+// by 8.
+//
+// Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
+// a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
+//
+// Arguments:
+//	input: Values to compare against `threshold` and bitpack.
+//	threshold: Threshold to compare against.
+//
+// Returns The bitpacked comparisons.
+func CompareAndBitpack(scope *Scope, input tf.Output, threshold tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "StagePeek",
+		Type: "CompareAndBitpack",
 		Input: []tf.Input{
-			index,
+			input, threshold,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("StagePeek", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// Conv3DBackpropInputV2Attr is an optional argument to Conv3DBackpropInputV2.
-type Conv3DBackpropInputV2Attr func(optionalAttr)
-
-// Conv3DBackpropInputV2DataFormat sets the optional data_format attribute to value.
+// Push an element onto the tensor_array.
 //
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func Conv3DBackpropInputV2DataFormat(value string) Conv3DBackpropInputV2Attr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Conv3DBackpropInputV2Dilations sets the optional dilations attribute to value.
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	index: The position to write to inside the TensorArray.
+//	value: The tensor to write to the TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
 //
-// value: 1-D tensor of length 5.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 i:1 >
-func Conv3DBackpropInputV2Dilations(value []int64) Conv3DBackpropInputV2Attr {
-	return func(m optionalAttr) {
-		m["dilations"] = value
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArrayWriteV3(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayWriteV3",
+		Input: []tf.Input{
+			handle, index, value, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes the gradients of 3-D convolution with respect to the input.
+// Scatter the data from the input value into specific TensorArray elements.
+//
+// `indices` must be a vector, its length must match the first dim of `value`.
 //
 // Arguments:
-//	input_sizes: An integer vector representing the tensor shape of `input`,
-// where `input` is a 5-D
-// `[batch, depth, rows, cols, in_channels]` tensor.
-//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
-// `in_channels` must match between `input` and `filter`.
-//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
-// out_channels]`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-func Conv3DBackpropInputV2(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv3DBackpropInputV2Attr) (output tf.Output) {
+//	handle: The handle to a TensorArray.
+//	indices: The locations at which to write the tensor elements.
+//	value: The concatenated tensor to write to the TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArrayScatterV3(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Conv3DBackpropInputV2",
+		Type: "TensorArrayScatterV3",
 		Input: []tf.Input{
-			input_sizes, filter, out_backprop,
+			handle, indices, value, flow_in,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DepthToSpaceAttr is an optional argument to DepthToSpace.
-type DepthToSpaceAttr func(optionalAttr)
+// TensorArrayConcatV3Attr is an optional argument to TensorArrayConcatV3.
+type TensorArrayConcatV3Attr func(optionalAttr)
 
-// DepthToSpaceDataFormat sets the optional data_format attribute to value.
-// If not specified, defaults to "NHWC"
-func DepthToSpaceDataFormat(value string) DepthToSpaceAttr {
+// TensorArrayConcatV3ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
+//
+// value: The expected shape of an element, if known,
+// excluding the first dimension. Used to validate the shapes of
+// TensorArray elements. If this shape is not fully specified, concatenating
+// zero-size TensorArrays is an error.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayConcatV3ElementShapeExcept0(value tf.Shape) TensorArrayConcatV3Attr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["element_shape_except0"] = value
 	}
 }
 
-// DepthToSpace for tensors of type T.
-//
-// Rearranges data from depth into blocks of spatial data.
-// This is the reverse transformation of SpaceToDepth. More specifically,
-// this op outputs a copy of the input tensor where values from the `depth`
-// dimension are moved in spatial blocks to the `height` and `width` dimensions.
-// The attr `block_size` indicates the input block size and how the data is moved.
-//
-//   * Chunks of data of size `block_size * block_size` from depth are rearranged
-//     into non-overlapping blocks of size `block_size x block_size`
-//   * The width the output tensor is `input_depth * block_size`, whereas the
-//     height is `input_height * block_size`.
-//   * The Y, X coordinates within each block of the output image are determined
-//     by the high order component of the input channel index.
-//   * The depth of the input tensor must be divisible by
-//     `block_size * block_size`.
-//
-// The `data_format` attr specifies the layout of the input and output tensors
-// with the following options:
-//   "NHWC": `[ batch, height, width, channels ]`
-//   "NCHW": `[ batch, channels, height, width ]`
-//   "NCHW_VECT_C":
-//       `qint8 [ batch, channels / 4, height, width, 4 ]`
-//
-// It is useful to consider the operation as transforming a 6-D Tensor.
-// e.g. for data_format = NHWC,
-//      Each element in the input tensor can be specified via 6 coordinates,
-//      ordered by decreasing memory layout significance as:
-//      n,iY,iX,bY,bX,oC  (where n=batch index, iX, iY means X or Y coordinates
-//                         within the input image, bX, bY means coordinates
-//                         within the output block, oC means output channels).
-//      The output would be the input transposed to the following layout:
-//      n,iY,bY,iX,bX,oC
-//
-// This operation is useful for resizing the activations between convolutions
-// (but keeping all data), e.g. instead of pooling. It is also useful for training
-// purely convolutional models.
-//
-// For example, given an input of shape `[1, 1, 1, 4]`, data_format = "NHWC" and
-// block_size = 2:
-//
-// ```
-// x = [[[[1, 2, 3, 4]]]]
-//
-// ```
-//
-// This operation will output a tensor of shape `[1, 2, 2, 1]`:
-//
-// ```
-//    [[[[1], [2]],
-//      [[3], [4]]]]
-// ```
-//
-// Here, the input has a batch of 1 and each batch element has shape `[1, 1, 4]`,
-// the corresponding output will have 2x2 elements and will have a depth of
-// 1 channel (1 = `4 / (block_size * block_size)`).
-// The output element shape is `[2, 2, 1]`.
-//
-// For an input tensor with larger depth, here of shape `[1, 1, 1, 12]`, e.g.
-//
-// ```
-// x = [[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]]]
-// ```
-//
-// This operation, for block size of 2, will return the following tensor of shape
-// `[1, 2, 2, 3]`
-//
-// ```
-//    [[[[1, 2, 3], [4, 5, 6]],
-//      [[7, 8, 9], [10, 11, 12]]]]
-//
-// ```
+// Concat the elements from the TensorArray into value `value`.
 //
-// Similarly, for the following input of shape `[1 2 2 4]`, and a block size of 2:
+// Takes `T` elements of shapes
 //
-// ```
-// x =  [[[[1, 2, 3, 4],
-//        [5, 6, 7, 8]],
-//       [[9, 10, 11, 12],
-//        [13, 14, 15, 16]]]]
-// ```
+//   ```
+//   (n0 x d0 x d1 x ...), (n1 x d0 x d1 x ...), ..., (n(T-1) x d0 x d1 x ...)
+//   ```
 //
-// the operator will return the following tensor of shape `[1 4 4 1]`:
+// and concatenates them into a Tensor of shape:
 //
-// ```
-// x = [[[ [1],   [2],  [5],  [6]],
-//       [ [3],   [4],  [7],  [8]],
-//       [ [9],  [10], [13],  [14]],
-//       [ [11], [12], [15],  [16]]]]
+//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```
 //
-// ```
+// All elements must have the same shape (excepting the first dimension).
 //
 // Arguments:
+//	handle: The handle to a TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	dtype: The type of the elem that is returned.
 //
-//	block_size: The size of the spatial block, same as in Space2Depth.
-func DepthToSpace(scope *Scope, input tf.Output, block_size int64, optional ...DepthToSpaceAttr) (output tf.Output) {
+// Returns All of the elements in the TensorArray, concatenated along the first
+// axis.A vector of the row sizes of the original T elements in the
+// value output.  In the example above, this would be the values:
+// `(n1, n2, ..., n(T-1))`.
+func TensorArrayConcatV3(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV3Attr) (value tf.Output, lengths tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"block_size": block_size}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DepthToSpace",
+		Type: "TensorArrayConcatV3",
 		Input: []tf.Input{
-			input,
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// MapStageAttr is an optional argument to MapStage.
-type MapStageAttr func(optionalAttr)
+// ParameterizedTruncatedNormalAttr is an optional argument to ParameterizedTruncatedNormal.
+type ParameterizedTruncatedNormalAttr func(optionalAttr)
 
-// MapStageCapacity sets the optional capacity attribute to value.
-//
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
-// If not specified, defaults to 0
+// ParameterizedTruncatedNormalSeed sets the optional seed attribute to value.
 //
-// REQUIRES: value >= 0
-func MapStageCapacity(value int64) MapStageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapStageMemoryLimit sets the optional memory_limit attribute to value.
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
 // If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapStageMemoryLimit(value int64) MapStageAttr {
+func ParameterizedTruncatedNormalSeed(value int64) ParameterizedTruncatedNormalAttr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["seed"] = value
 	}
 }
 
-// MapStageContainer sets the optional container attribute to value.
+// ParameterizedTruncatedNormalSeed2 sets the optional seed2 attribute to value.
 //
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func MapStageContainer(value string) MapStageAttr {
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func ParameterizedTruncatedNormalSeed2(value int64) ParameterizedTruncatedNormalAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["seed2"] = value
 	}
 }
 
-// MapStageSharedName sets the optional shared_name attribute to value.
+// Outputs random values from a normal distribution. The parameters may each be a
 //
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func MapStageSharedName(value string) MapStageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Stage (key, values) in the underlying container which behaves like a hashtable.
+// scalar which applies to the entire output, or a vector of length shape[0] which
+// stores the parameters for each batch.
 //
 // Arguments:
-//	key: int64
-//
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
-//
+//	shape: The shape of the output tensor. Batches are indexed by the 0th dimension.
+//	means: The mean parameter of each batch.
+//	stdevs: The standard deviation parameter of each batch. Must be greater than 0.
+//	minvals: The minimum cutoff. May be -infinity.
+//	maxvals: The maximum cutoff. May be +infinity, and must be more than the minval
+// for each batch.
 //
-// Returns the created operation.
-func MapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...MapStageAttr) (o *tf.Operation) {
+// Returns A matrix of shape num_batches x samples_per_batch, filled with random
+// truncated normal values using the parameters for each row.
+func ParameterizedTruncatedNormal(scope *Scope, shape tf.Output, means tf.Output, stdevs tf.Output, minvals tf.Output, maxvals tf.Output, optional ...ParameterizedTruncatedNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapStage",
+		Type: "ParameterizedTruncatedNormal",
 		Input: []tf.Input{
-			key, indices, tf.OutputList(values),
+			shape, means, stdevs, minvals, maxvals,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapUnstageAttr is an optional argument to MapUnstage.
-type MapUnstageAttr func(optionalAttr)
-
-// MapUnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Returns a diagonal tensor with a given diagonal values.
 //
-// REQUIRES: value >= 0
-func MapUnstageCapacity(value int64) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
+// everything else padded with zeros. The diagonal is computed as follows:
+//
+// Assume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of
+// rank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:
+//
+// `output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.
+//
+// For example:
+//
+// ```
+// # 'diagonal' is [1, 2, 3, 4]
+// tf.diag(diagonal) ==> [[1, 0, 0, 0]
+//                        [0, 2, 0, 0]
+//                        [0, 0, 3, 0]
+//                        [0, 0, 0, 4]]
+// ```
+//
+// Arguments:
+//	diagonal: Rank k tensor where k is at most 1.
+func Diag(scope *Scope, diagonal tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Diag",
+		Input: []tf.Input{
+			diagonal,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapUnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Split the data from the input value into TensorArray elements.
 //
-// REQUIRES: value >= 0
-func MapUnstageMemoryLimit(value int64) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// Assuming that `lengths` takes on values
+//
+//   ```(n0, n1, ..., n(T-1))```
+//
+// and that `value` has shape
+//
+//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```,
+//
+// this splits values into a TensorArray with T tensors.
+//
+// TensorArray index t will be the subtensor of values with starting position
+//
+//   ```(n0 + n1 + ... + n(t-1), 0, 0, ...)```
+//
+// and having size
+//
+//   ```nt x d0 x d1 x ...```
+//
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	value: The concatenated tensor to write to the TensorArray.
+//	lengths: The vector of lengths, how to split the rows of value into the
+// TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArraySplitV3(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// MapUnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapUnstageContainer(value string) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+	opspec := tf.OpSpec{
+		Type: "TensorArraySplitV3",
+		Input: []tf.Input{
+			handle, value, lengths, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapUnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapUnstageSharedName(value string) MapUnstageAttr {
+// SerializeSparseAttr is an optional argument to SerializeSparse.
+type SerializeSparseAttr func(optionalAttr)
+
+// SerializeSparseOutType sets the optional out_type attribute to value.
+//
+// value: The `dtype` to use for serialization; the supported types are `string`
+// (default) and `variant`.
+// If not specified, defaults to DT_STRING
+func SerializeSparseOutType(value tf.DataType) SerializeSparseAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["out_type"] = value
 	}
 }
 
-// Op removes and returns the values associated with the key
+// Serialize a `SparseTensor` into a `[3]` `Tensor` object.
 //
-// from the underlying container.   If the underlying container
-// does not contain this key, the op will block until it does.
-func MapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageAttr) (values []tf.Output) {
+// Arguments:
+//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
+//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
+func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeSparseAttr) (serialized_sparse tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapUnstage",
+		Type: "SerializeSparse",
 		Input: []tf.Input{
-			key, indices,
+			sparse_indices, sparse_values, sparse_shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapUnstage", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// MapSizeAttr is an optional argument to MapSize.
-type MapSizeAttr func(optionalAttr)
+// RandomShuffleQueueV2Attr is an optional argument to RandomShuffleQueueV2.
+type RandomShuffleQueueV2Attr func(optionalAttr)
 
-// MapSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// RandomShuffleQueueV2Shapes sets the optional shapes attribute to value.
 //
-// REQUIRES: value >= 0
-func MapSizeCapacity(value int64) MapSizeAttr {
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func RandomShuffleQueueV2Shapes(value []tf.Shape) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shapes"] = value
+	}
+}
+
+// RandomShuffleQueueV2Capacity sets the optional capacity attribute to value.
+//
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func RandomShuffleQueueV2Capacity(value int64) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["capacity"] = value
 	}
 }
 
-// MapSizeMemoryLimit sets the optional memory_limit attribute to value.
+// RandomShuffleQueueV2MinAfterDequeue sets the optional min_after_dequeue attribute to value.
+//
+// value: Dequeue will block unless there would be this
+// many elements after the dequeue or the queue is closed. This
+// ensures a minimum level of mixing of elements.
+// If not specified, defaults to 0
+func RandomShuffleQueueV2MinAfterDequeue(value int64) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["min_after_dequeue"] = value
+	}
+}
+
+// RandomShuffleQueueV2Seed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 is set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, a random seed is used.
 // If not specified, defaults to 0
+func RandomShuffleQueueV2Seed(value int64) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// RandomShuffleQueueV2Seed2 sets the optional seed2 attribute to value.
 //
-// REQUIRES: value >= 0
-func MapSizeMemoryLimit(value int64) MapSizeAttr {
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomShuffleQueueV2Seed2(value int64) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["seed2"] = value
 	}
 }
 
-// MapSizeContainer sets the optional container attribute to value.
+// RandomShuffleQueueV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
 // If not specified, defaults to ""
-func MapSizeContainer(value string) MapSizeAttr {
+func RandomShuffleQueueV2Container(value string) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// MapSizeSharedName sets the optional shared_name attribute to value.
+// RandomShuffleQueueV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
 // If not specified, defaults to ""
-func MapSizeSharedName(value string) MapSizeAttr {
+func RandomShuffleQueueV2SharedName(value string) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// Op returns the number of elements in the underlying container.
-func MapSize(scope *Scope, dtypes []tf.DataType, optional ...MapSizeAttr) (size tf.Output) {
+// A queue that randomizes the order of elements.
+//
+// Arguments:
+//	component_types: The type of each component in a value.
+//
+// Returns The handle to the queue.
+func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional ...RandomShuffleQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapSize",
+		Type: "RandomShuffleQueueV2",
 
 		Attrs: attrs,
 	}
@@ -24455,245 +24382,291 @@ func MapSize(scope *Scope, dtypes []tf.DataType, optional ...MapSizeAttr) (size
 	return op.Output(0)
 }
 
-// MapIncompleteSizeAttr is an optional argument to MapIncompleteSize.
-type MapIncompleteSizeAttr func(optionalAttr)
-
-// MapIncompleteSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Draw bounding boxes on a batch of images.
 //
-// REQUIRES: value >= 0
-func MapIncompleteSizeCapacity(value int64) MapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Outputs a copy of `images` but draws on top of the pixels zero or more bounding
+// boxes specified by the locations in `boxes`. The coordinates of the each
+// bounding box in `boxes` are encoded as `[y_min, x_min, y_max, x_max]`. The
+// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
+// height of the underlying image.
 //
-// REQUIRES: value >= 0
-func MapIncompleteSizeMemoryLimit(value int64) MapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// For example, if an image is 100 x 200 pixels (height x width) and the bounding
+// box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
+// the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+//
+// Parts of the bounding box may fall outside the image.
+//
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, depth]`. A batch of images.
+//	boxes: 3-D with shape `[batch, num_bounding_boxes, 4]` containing bounding
+// boxes.
+//
+// Returns 4-D with the same shape as `images`. The batch of input images with
+// bounding boxes drawn on the images.
+func DrawBoundingBoxes(scope *Scope, images tf.Output, boxes tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DrawBoundingBoxes",
+		Input: []tf.Input{
+			images, boxes,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapIncompleteSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapIncompleteSizeContainer(value string) MapIncompleteSizeAttr {
+// LearnedUnigramCandidateSamplerAttr is an optional argument to LearnedUnigramCandidateSampler.
+type LearnedUnigramCandidateSamplerAttr func(optionalAttr)
+
+// LearnedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func LearnedUnigramCandidateSamplerSeed(value int64) LearnedUnigramCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["seed"] = value
 	}
 }
 
-// MapIncompleteSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapIncompleteSizeSharedName(value string) MapIncompleteSizeAttr {
+// LearnedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func LearnedUnigramCandidateSamplerSeed2(value int64) LearnedUnigramCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["seed2"] = value
 	}
 }
 
-// Op returns the number of incomplete elements in the underlying container.
-func MapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...MapIncompleteSizeAttr) (size tf.Output) {
+// Generates labels for candidate sampling with a learned unigram distribution.
+//
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
+//
+// For each batch, this op picks a single set of sampled candidate labels.
+//
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
+//
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func LearnedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LearnedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapIncompleteSize",
-
+		Type: "LearnedUnigramCandidateSampler",
+		Input: []tf.Input{
+			true_classes,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// OrderedMapUnstageAttr is an optional argument to OrderedMapUnstage.
-type OrderedMapUnstageAttr func(optionalAttr)
-
-// OrderedMapUnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Computes gradients for the scaled exponential linear (Selu) operation.
 //
-// REQUIRES: value >= 0
-func OrderedMapUnstageCapacity(value int64) OrderedMapUnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+// Arguments:
+//	gradients: The backpropagated gradients to the corresponding Selu operation.
+//	outputs: The outputs of the corresponding Selu operation.
+//
+// Returns The gradients: `gradients * (outputs + scale * alpha)`
+// if outputs < 0, `scale * gradients` otherwise.
+func SeluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "SeluGrad",
+		Input: []tf.Input{
+			gradients, outputs,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapUnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Get the current size of the TensorArray.
 //
-// REQUIRES: value >= 0
-func OrderedMapUnstageMemoryLimit(value int64) OrderedMapUnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// Arguments:
+//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns The current size of the TensorArray.
+func TensorArraySizeV3(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TensorArraySizeV3",
+		Input: []tf.Input{
+			handle, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapUnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapUnstageContainer(value string) OrderedMapUnstageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Deprecated. Use TensorArrayGradV3
+//
+// DEPRECATED at GraphDef version 26: Use TensorArrayWriteV3
+func TensorArrayWriteV2(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayWriteV2",
+		Input: []tf.Input{
+			handle, index, value, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapUnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapUnstageSharedName(value string) OrderedMapUnstageAttr {
+// SparseReduceMaxAttr is an optional argument to SparseReduceMax.
+type SparseReduceMaxAttr func(optionalAttr)
+
+// SparseReduceMaxKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func SparseReduceMaxKeepDims(value bool) SparseReduceMaxAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Op removes and returns the values associated with the key
+// Computes the max of elements across dimensions of a SparseTensor.
 //
-// from the underlying container.   If the underlying container
-// does not contain this key, the op will block until it does.
-func OrderedMapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageAttr) (values []tf.Output) {
+// This Op takes a SparseTensor and is the sparse counterpart to
+// `tf.reduce_max()`.  In particular, this Op also returns a dense `Tensor`
+// instead of a sparse one.
+//
+// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
+// with length 1.
+//
+// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
+// with a single element is returned.  Additionally, the axes can be negative,
+// which are interpreted according to the indexing rules in Python.
+//
+// Arguments:
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//
+// Returns `R-K`-D.  The reduced Tensor.
+func SparseReduceMax(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapUnstage",
+		Type: "SparseReduceMax",
 		Input: []tf.Input{
-			key, indices,
+			input_indices, input_values, input_shape, reduction_axes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("OrderedMapUnstage", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// OrderedMapSizeAttr is an optional argument to OrderedMapSize.
-type OrderedMapSizeAttr func(optionalAttr)
+// AsStringAttr is an optional argument to AsString.
+type AsStringAttr func(optionalAttr)
 
-// OrderedMapSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// AsStringPrecision sets the optional precision attribute to value.
 //
-// REQUIRES: value >= 0
-func OrderedMapSizeCapacity(value int64) OrderedMapSizeAttr {
+// value: The post-decimal precision to use for floating point numbers.
+// Only used if precision > -1.
+// If not specified, defaults to -1
+func AsStringPrecision(value int64) AsStringAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["precision"] = value
 	}
 }
 
-// OrderedMapSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// AsStringScientific sets the optional scientific attribute to value.
 //
-// REQUIRES: value >= 0
-func OrderedMapSizeMemoryLimit(value int64) OrderedMapSizeAttr {
+// value: Use scientific notation for floating point numbers.
+// If not specified, defaults to false
+func AsStringScientific(value bool) AsStringAttr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["scientific"] = value
 	}
 }
 
-// OrderedMapSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapSizeContainer(value string) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// OrderedMapSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapSizeSharedName(value string) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op returns the number of elements in the underlying container.
-func OrderedMapSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapSizeAttr) (size tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "OrderedMapSize",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// CTCLossAttr is an optional argument to CTCLoss.
-type CTCLossAttr func(optionalAttr)
-
-// CTCLossPreprocessCollapseRepeated sets the optional preprocess_collapse_repeated attribute to value.
+// AsStringShortest sets the optional shortest attribute to value.
 //
-// value: Scalar, if true then repeated labels are
-// collapsed prior to the CTC calculation.
+// value: Use shortest representation (either scientific or standard) for
+// floating point numbers.
 // If not specified, defaults to false
-func CTCLossPreprocessCollapseRepeated(value bool) CTCLossAttr {
+func AsStringShortest(value bool) AsStringAttr {
 	return func(m optionalAttr) {
-		m["preprocess_collapse_repeated"] = value
+		m["shortest"] = value
 	}
 }
 
-// CTCLossCtcMergeRepeated sets the optional ctc_merge_repeated attribute to value.
+// AsStringWidth sets the optional width attribute to value.
 //
-// value: Scalar.  If set to false, *during* CTC calculation
-// repeated non-blank labels will not be merged and are interpreted as
-// individual labels.  This is a simplified version of CTC.
-// If not specified, defaults to true
-func CTCLossCtcMergeRepeated(value bool) CTCLossAttr {
+// value: Pad pre-decimal numbers to this width.
+// Applies to both floating point and integer numbers.
+// Only used if width > -1.
+// If not specified, defaults to -1
+func AsStringWidth(value int64) AsStringAttr {
 	return func(m optionalAttr) {
-		m["ctc_merge_repeated"] = value
+		m["width"] = value
 	}
 }
 
-// CTCLossIgnoreLongerOutputsThanInputs sets the optional ignore_longer_outputs_than_inputs attribute to value.
+// AsStringFill sets the optional fill attribute to value.
 //
-// value: Scalar. If set to true, during CTC
-// calculation, items that have longer output sequences than input sequences
-// are skipped: they don't contribute to the loss term and have zero-gradient.
-// If not specified, defaults to false
-func CTCLossIgnoreLongerOutputsThanInputs(value bool) CTCLossAttr {
+// value: The value to pad if width > -1.  If empty, pads with spaces.
+// Another typical value is '0'.  String cannot be longer than 1 character.
+// If not specified, defaults to ""
+func AsStringFill(value string) AsStringAttr {
 	return func(m optionalAttr) {
-		m["ignore_longer_outputs_than_inputs"] = value
+		m["fill"] = value
 	}
 }
 
-// Calculates the CTC Loss (log probability) for each batch entry.  Also calculates
-//
-// the gradient.  This class performs the softmax operation for you, so inputs
-// should be e.g. linear projections of outputs by an LSTM.
-//
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	labels_indices: The indices of a `SparseTensor<int32, 2>`.
-// `labels_indices(i, :) == [b, t]` means `labels_values(i)` stores the id for
-// `(batch b, time t)`.
-//	labels_values: The values (labels) associated with the given batch and time.
-//	sequence_length: A vector containing sequence lengths (batch).
+// Converts each entry in the given tensor to strings.  Supports many numeric
 //
-// Returns A vector (batch) containing log-probabilities.The gradient of `loss`.  3-D, shape:
-// `(max_time x batch_size x num_classes)`.
-func CTCLoss(scope *Scope, inputs tf.Output, labels_indices tf.Output, labels_values tf.Output, sequence_length tf.Output, optional ...CTCLossAttr) (loss tf.Output, gradient tf.Output) {
+// types and boolean.
+func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -24702,1059 +24675,1040 @@ func CTCLoss(scope *Scope, inputs tf.Output, labels_indices tf.Output, labels_va
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CTCLoss",
+		Type: "AsString",
 		Input: []tf.Input{
-			inputs, labels_indices, labels_values, sequence_length,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// CTCGreedyDecoderAttr is an optional argument to CTCGreedyDecoder.
-type CTCGreedyDecoderAttr func(optionalAttr)
-
-// CTCGreedyDecoderMergeRepeated sets the optional merge_repeated attribute to value.
-//
-// value: If True, merge repeated classes in output.
-// If not specified, defaults to false
-func CTCGreedyDecoderMergeRepeated(value bool) CTCGreedyDecoderAttr {
-	return func(m optionalAttr) {
-		m["merge_repeated"] = value
-	}
+	return op.Output(0)
 }
 
-// Performs greedy decoding on the logits given in inputs.
-//
-// A note about the attribute merge_repeated: if enabled, when
-// consecutive logits' maximum indices are the same, only the first of
-// these is emitted.  Labeling the blank '*', the sequence "A B B * B B"
-// becomes "A B B" if merge_repeated = True and "A B B B B" if
-// merge_repeated = False.
-//
-// Regardless of the value of merge_repeated, if the maximum index of a given
-// time and batch corresponds to the blank, index `(num_classes - 1)`, no new
-// element is emitted.
-//
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	sequence_length: A vector containing sequence lengths, size `(batch_size)`.
+// Deprecated. Use TensorArrayScatterV3
 //
-// Returns Indices matrix, size `(total_decoded_outputs x 2)`,
-// of a `SparseTensor<int64, 2>`.  The rows store: [batch, time].Values vector, size: `(total_decoded_outputs)`,
-// of a `SparseTensor<int64, 2>`.  The vector stores the decoded classes.Shape vector, size `(2)`, of the decoded SparseTensor.
-// Values are: `[batch_size, max_decoded_length]`.Matrix, size `(batch_size x 1)`, containing sequence
-// log-probabilities.
-func CTCGreedyDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, optional ...CTCGreedyDecoderAttr) (decoded_indices tf.Output, decoded_values tf.Output, decoded_shape tf.Output, log_probability tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayScatterV3
+func TensorArrayScatterV2(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "CTCGreedyDecoder",
+		Type: "TensorArrayScatterV2",
 		Input: []tf.Input{
-			inputs, sequence_length,
+			handle, indices, value, flow_in,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
+	return op.Output(0)
 }
 
-// Forwards `data` to the output port determined by `pred`.
+// Applies sparse addition to `input` using individual values or slices
 //
-// If `pred` is true, the `data` input is forwarded to `output_true`. Otherwise,
-// the data goes to `output_false`.
+// from `updates` according to indices `indices`.  The updates are non-aliasing:
+// `input` is only modified in-place if no other operations will use it.
+// Otherwise, a copy of `input` is made.  This operation has a gradient with
+// respect to both `input` and `updates`.
 //
-// See also `RefSwitch` and `Merge`.
+// `input` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
 //
-// Arguments:
-//	data: The tensor to be forwarded to the appropriate output.
-//	pred: A scalar that specifies which output port will receive data.
+// `indices` must be integer tensor, containing indices into `input`.
+// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
 //
-// Returns If `pred` is false, data will be forwarded to this output.If `pred` is true, data will be forwarded to this output.
-func Switch(scope *Scope, data tf.Output, pred tf.Output) (output_false tf.Output, output_true tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Switch",
-		Input: []tf.Input{
-			data, pred,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Add all input tensors element wise.
+// The innermost dimension of `indices` (with length `K`) corresponds to
+// indices into elements (if `K = P`) or `(P-K)`-dimensional slices
+// (if `K < P`) along the `K`th dimension of `input`.
+//
+// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+//
+// ```
+// [d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].
+// ```
+//
+// For example, say we want to add 4 scattered elements to a rank-1 tensor to 8
+// elements. In Python, that addition would look like this:
+//
+//     input = tf.constant([1, 2, 3, 4, 5, 6, 7, 8])
+//     indices = tf.constant([[4], [3], [1], [7]])
+//     updates = tf.constant([9, 10, 11, 12])
+//     output = tf.scatter_nd_non_aliasing_add(input, indices, updates)
+//     with tf.Session() as sess:
+//       print(sess.run(output))
+//
+// The resulting value `output` would look like this:
+//
+//     [1, 13, 3, 14, 14, 6, 7, 20]
+//
+// See @{tf.scatter_nd} for more details about how to make updates to slices.
 //
 // Arguments:
-//	inputs: Must all be the same size and shape.
-func AddN(scope *Scope, inputs []tf.Output) (sum tf.Output) {
+//	input: A Tensor.
+//	indices: A Tensor. Must be one of the following types: `int32`, `int64`.
+// A tensor of indices into `input`.
+//	updates: A Tensor. Must have the same type as ref. A tensor of updated values
+// to add to `input`.
+//
+// Returns A `Tensor` with the same shape as `input`, containing values of `input`
+// updated with `updates`.
+func ScatterNdNonAliasingAdd(scope *Scope, input tf.Output, indices tf.Output, updates tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "AddN",
+		Type: "ScatterNdNonAliasingAdd",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			input, indices, updates,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// EnterAttr is an optional argument to Enter.
-type EnterAttr func(optionalAttr)
+// FractionalMaxPoolAttr is an optional argument to FractionalMaxPool.
+type FractionalMaxPoolAttr func(optionalAttr)
 
-// EnterIsConstant sets the optional is_constant attribute to value.
+// FractionalMaxPoolPseudoRandom sets the optional pseudo_random attribute to value.
 //
-// value: If true, the output is constant within the child frame.
+// value: When set to True, generates the pooling sequence in a
+// pseudorandom fashion, otherwise, in a random fashion. Check paper [Benjamin
+// Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) for
+// difference between pseudorandom and random.
 // If not specified, defaults to false
-func EnterIsConstant(value bool) EnterAttr {
+func FractionalMaxPoolPseudoRandom(value bool) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["is_constant"] = value
+		m["pseudo_random"] = value
 	}
 }
 
-// EnterParallelIterations sets the optional parallel_iterations attribute to value.
+// FractionalMaxPoolOverlapping sets the optional overlapping attribute to value.
 //
-// value: The number of iterations allowed to run in parallel.
-// If not specified, defaults to 10
-func EnterParallelIterations(value int64) EnterAttr {
+// value: When set to True, it means when pooling, the values at the boundary
+// of adjacent pooling cells are used by both cells. For example:
+//
+// `index  0  1  2  3  4`
+//
+// `value  20 5  16 3  7`
+//
+// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
+// The result would be [20, 16] for fractional max pooling.
+// If not specified, defaults to false
+func FractionalMaxPoolOverlapping(value bool) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["parallel_iterations"] = value
+		m["overlapping"] = value
 	}
 }
 
-// Creates or finds a child frame, and makes `data` available to the child frame.
-//
-// This op is used together with `Exit` to create loops in the graph.
-// The unique `frame_name` is used by the `Executor` to identify frames. If
-// `is_constant` is true, `output` is a constant in the child frame; otherwise
-// it may be changed in the child frame. At most `parallel_iterations` iterations
-// are run in parallel in the child frame.
+// FractionalMaxPoolDeterministic sets the optional deterministic attribute to value.
 //
-// Arguments:
-//	data: The tensor to be made available to the child frame.
-//	frame_name: The name of the child frame.
+// value: When set to True, a fixed pooling region will be used when
+// iterating over a FractionalMaxPool node in the computation graph. Mainly used
+// in unit test to make FractionalMaxPool deterministic.
+// If not specified, defaults to false
+func FractionalMaxPoolDeterministic(value bool) FractionalMaxPoolAttr {
+	return func(m optionalAttr) {
+		m["deterministic"] = value
+	}
+}
+
+// FractionalMaxPoolSeed sets the optional seed attribute to value.
 //
-// Returns The same tensor as `data`.
-func Enter(scope *Scope, data tf.Output, frame_name string, optional ...EnterAttr) (output tf.Output) {
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func FractionalMaxPoolSeed(value int64) FractionalMaxPoolAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// FractionalMaxPoolSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func FractionalMaxPoolSeed2(value int64) FractionalMaxPoolAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Performs fractional max pooling on the input.
+//
+// Fractional max pooling is slightly different than regular max pooling.  In
+// regular max pooling, you downsize an input set by taking the maximum value of
+// smaller N x N subsections of the set (often 2x2), and try to reduce the set by
+// a factor of N, where N is an integer.  Fractional max pooling, as you might
+// expect from the word "fractional", means that the overall reduction ratio N
+// does not have to be an integer.
+//
+// The sizes of the pooling regions are generated randomly but are fairly uniform.
+// For example, let's look at the height dimension, and the constraints on the
+// list of rows that will be pool boundaries.
+//
+// First we define the following:
+//
+// 1.  input_row_length : the number of rows from the input set
+// 2.  output_row_length : which will be smaller than the input
+// 3.  alpha = input_row_length / output_row_length : our reduction ratio
+// 4.  K = floor(alpha)
+// 5.  row_pooling_sequence : this is the result list of pool boundary rows
+//
+// Then, row_pooling_sequence should satisfy:
+//
+// 1.  a[0] = 0 : the first value of the sequence is 0
+// 2.  a[end] = input_row_length : the last value of the sequence is the size
+// 3.  K <= (a[i+1] - a[i]) <= K+1 : all intervals are K or K+1 size
+// 4.  length(row_pooling_sequence) = output_row_length+1
+//
+// For more details on fractional max pooling, see this paper:
+// [Benjamin Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071)
+//
+// Arguments:
+//	value: 4-D with shape `[batch, height, width, channels]`.
+//	pooling_ratio: Pooling ratio for each dimension of `value`, currently only
+// supports row and col dimension and should be >= 1.0. For example, a valid
+// pooling ratio looks like [1.0, 1.44, 1.73, 1.0]. The first and last elements
+// must be 1.0 because we don't allow pooling on batch and channels
+// dimensions. 1.44 and 1.73 are pooling ratio on height and width dimensions
+// respectively.
+//
+// Returns output tensor after fractional max pooling.row pooling sequence, needed to calculate gradient.column pooling sequence, needed to calculate gradient.
+func FractionalMaxPool(scope *Scope, value tf.Output, pooling_ratio []float32, optional ...FractionalMaxPoolAttr) (output tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"frame_name": frame_name}
+	attrs := map[string]interface{}{"pooling_ratio": pooling_ratio}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Enter",
+		Type: "FractionalMaxPool",
 		Input: []tf.Input{
-			data,
+			value,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Produce a string tensor that encodes the state of a Reader.
-//
-// Not all Readers support being serialized, so this can produce an
-// Unimplemented error.
+// Deprecated. Use TensorArraySizeV3
 //
-// Arguments:
-//	reader_handle: Handle to a Reader.
-func ReaderSerializeStateV2(scope *Scope, reader_handle tf.Output) (state tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArraySizeV3
+func TensorArraySizeV2(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderSerializeStateV2",
+		Type: "TensorArraySizeV2",
 		Input: []tf.Input{
-			reader_handle,
+			handle, flow_in,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Exits the current frame to its parent frame.
+// Conv2DAttr is an optional argument to Conv2D.
+type Conv2DAttr func(optionalAttr)
+
+// Conv2DUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DUseCudnnOnGpu(value bool) Conv2DAttr {
+	return func(m optionalAttr) {
+		m["use_cudnn_on_gpu"] = value
+	}
+}
+
+// Conv2DDataFormat sets the optional data_format attribute to value.
 //
-// Exit makes its input `data` available to the parent frame.
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, height, width, channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, channels, height, width].
+// If not specified, defaults to "NHWC"
+func Conv2DDataFormat(value string) Conv2DAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Conv2DDilations sets the optional dilations attribute to value.
+//
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func Conv2DDilations(value []int64) Conv2DAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes a 2-D convolution given 4-D `input` and `filter` tensors.
+//
+// Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
+// and a filter / kernel tensor of shape
+// `[filter_height, filter_width, in_channels, out_channels]`, this op
+// performs the following:
+//
+// 1. Flattens the filter to a 2-D matrix with shape
+//    `[filter_height * filter_width * in_channels, output_channels]`.
+// 2. Extracts image patches from the input tensor to form a *virtual*
+//    tensor of shape `[batch, out_height, out_width,
+//    filter_height * filter_width * in_channels]`.
+// 3. For each patch, right-multiplies the filter matrix and the image patch
+//    vector.
+//
+// In detail, with the default NHWC format,
+//
+//     output[b, i, j, k] =
+//         sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *
+//                         filter[di, dj, q, k]
+//
+// Must have `strides[0] = strides[3] = 1`.  For the most common case of the same
+// horizontal and vertices strides, `strides = [1, stride, stride, 1]`.
 //
 // Arguments:
-//	data: The tensor to be made available to the parent frame.
+//	input: A 4-D tensor. The dimension order is interpreted according to the value
+// of `data_format`, see below for details.
+//	filter: A 4-D tensor of shape
+// `[filter_height, filter_width, in_channels, out_channels]`
+//	strides: 1-D tensor of length 4.  The stride of the sliding window for each
+// dimension of `input`. The dimension order is determined by the value of
+// `data_format`, see below for details.
+//	padding: The type of padding algorithm to use.
 //
-// Returns The same tensor as `data`.
-func Exit(scope *Scope, data tf.Output) (output tf.Output) {
+// Returns A 4-D tensor. The dimension order is determined by the value of
+// `data_format`, see below for details.
+func Conv2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv2DAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Exit",
+		Type: "Conv2D",
 		Input: []tf.Input{
-			data,
+			input, filter,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns a copy of the input tensor.
-func Snapshot(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Snapshot",
-		Input: []tf.Input{
-			input,
-		},
+// StageAttr is an optional argument to Stage.
+type StageAttr func(optionalAttr)
+
+// StageCapacity sets the optional capacity attribute to value.
+//
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StageCapacity(value int64) StageAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Scatter `updates` into a new (initially zero) tensor according to `indices`.
+// StageMemoryLimit sets the optional memory_limit attribute to value.
 //
-// Creates a new tensor by applying sparse `updates` to individual
-// values or slices within a zero tensor of the given `shape` according to
-// indices.  This operator is the inverse of the @{tf.gather_nd} operator which
-// extracts values or slices from a given tensor.
+// value: The maximum number of bytes allowed for Tensors in the Staging Area.
+// If > 0, inserts will block until sufficient space is available.
+// If not specified, defaults to 0
 //
-// **WARNING**: The order in which updates are applied is nondeterministic, so the
-// output will be nondeterministic if `indices` contains duplicates.
+// REQUIRES: value >= 0
+func StageMemoryLimit(value int64) StageAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// StageContainer sets the optional container attribute to value.
 //
-// `indices` is an integer tensor containing indices into a new tensor of shape
-// `shape`.  The last dimension of `indices` can be at most the rank of `shape`:
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func StageContainer(value string) StageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StageSharedName sets the optional shared_name attribute to value.
 //
-//     indices.shape[-1] <= shape.rank
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func StageSharedName(value string) StageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Stage values similar to a lightweight Enqueue.
 //
-// The last dimension of `indices` corresponds to indices into elements
-// (if `indices.shape[-1] = shape.rank`) or slices
-// (if `indices.shape[-1] < shape.rank`) along dimension `indices.shape[-1]` of
-// `shape`.  `updates` is a tensor with shape
-//
-//     indices.shape[:-1] + shape[indices.shape[-1]:]
-//
-// The simplest form of scatter is to insert individual elements in a tensor by
-// index. For example, say we want to insert 4 scattered elements in a rank-1
-// tensor with 8 elements.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/ScatterNd1.png" alt>
-// </div>
-//
-// In Python, this scatter operation would look like this:
-//
-// ```python
-//     indices = tf.constant([[4], [3], [1], [7]])
-//     updates = tf.constant([9, 10, 11, 12])
-//     shape = tf.constant([8])
-//     scatter = tf.scatter_nd(indices, updates, shape)
-//     with tf.Session() as sess:
-//       print(sess.run(scatter))
-// ```
-//
-// The resulting tensor would look like this:
-//
-//     [0, 11, 0, 10, 9, 0, 0, 12]
-//
-// We can also, insert entire slices of a higher rank tensor all at once. For
-// example, if we wanted to insert two slices in the first dimension of a
-// rank-3 tensor with two matrices of new values.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/ScatterNd2.png" alt>
-// </div>
-//
-// In Python, this scatter operation would look like this:
-//
-// ```python
-//     indices = tf.constant([[0], [2]])
-//     updates = tf.constant([[[5, 5, 5, 5], [6, 6, 6, 6],
-//                             [7, 7, 7, 7], [8, 8, 8, 8]],
-//                            [[5, 5, 5, 5], [6, 6, 6, 6],
-//                             [7, 7, 7, 7], [8, 8, 8, 8]]])
-//     shape = tf.constant([4, 4, 4])
-//     scatter = tf.scatter_nd(indices, updates, shape)
-//     with tf.Session() as sess:
-//       print(sess.run(scatter))
-// ```
-//
-// The resulting tensor would look like this:
-//
-//     [[[5, 5, 5, 5], [6, 6, 6, 6], [7, 7, 7, 7], [8, 8, 8, 8]],
-//      [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
-//      [[5, 5, 5, 5], [6, 6, 6, 6], [7, 7, 7, 7], [8, 8, 8, 8]],
-//      [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]]
+// The basic functionality of this Op is similar to a queue with many
+// fewer capabilities and options.  This Op is optimized for performance.
 //
 // Arguments:
-//	indices: Index tensor.
-//	updates: Updates to scatter into output.
-//	shape: 1-D. The shape of the resulting tensor.
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
 //
-// Returns A new tensor with the given shape and updates applied according
-// to the indices.
-func ScatterNd(scope *Scope, indices tf.Output, updates tf.Output, shape tf.Output) (output tf.Output) {
+// Returns the created operation.
+func Stage(scope *Scope, values []tf.Output, optional ...StageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ScatterNd",
+		Type: "Stage",
 		Input: []tf.Input{
-			indices, updates, shape,
+			tf.OutputList(values),
 		},
+		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// SpaceToDepthAttr is an optional argument to SpaceToDepth.
-type SpaceToDepthAttr func(optionalAttr)
+// StagePeekAttr is an optional argument to StagePeek.
+type StagePeekAttr func(optionalAttr)
 
-// SpaceToDepthDataFormat sets the optional data_format attribute to value.
-// If not specified, defaults to "NHWC"
-func SpaceToDepthDataFormat(value string) SpaceToDepthAttr {
+// StagePeekCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StagePeekCapacity(value int64) StagePeekAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["capacity"] = value
 	}
 }
 
-// SpaceToDepth for tensors of type T.
-//
-// Rearranges blocks of spatial data, into depth. More specifically,
-// this op outputs a copy of the input tensor where values from the `height`
-// and `width` dimensions are moved to the `depth` dimension.
-// The attr `block_size` indicates the input block size.
-//
-//   * Non-overlapping blocks of size `block_size x block size` are rearranged
-//     into depth at each location.
-//   * The depth of the output tensor is `block_size * block_size * input_depth`.
-//   * The Y, X coordinates within each block of the input become the high order
-//     component of the output channel index.
-//   * The input tensor's height and width must be divisible by block_size.
-//
-// The `data_format` attr specifies the layout of the input and output tensors
-// with the following options:
-//   "NHWC": `[ batch, height, width, channels ]`
-//   "NCHW": `[ batch, channels, height, width ]`
-//   "NCHW_VECT_C":
-//       `qint8 [ batch, channels / 4, height, width, 4 ]`
-//
-// It is useful to consider the operation as transforming a 6-D Tensor.
-// e.g. for data_format = NHWC,
-//      Each element in the input tensor can be specified via 6 coordinates,
-//      ordered by decreasing memory layout significance as:
-//      n,oY,bY,oX,bX,iC  (where n=batch index, oX, oY means X or Y coordinates
-//                         within the output image, bX, bY means coordinates
-//                         within the input block, iC means input channels).
-//      The output would be a transpose to the following layout:
-//      n,oY,oX,bY,bX,iC
-//
-// This operation is useful for resizing the activations between convolutions
-// (but keeping all data), e.g. instead of pooling. It is also useful for training
-// purely convolutional models.
-//
-// For example, given an input of shape `[1, 2, 2, 1]`, data_format = "NHWC" and
-// block_size = 2:
-//
-// ```
-// x = [[[[1], [2]],
-//       [[3], [4]]]]
-// ```
-//
-// This operation will output a tensor of shape `[1, 1, 1, 4]`:
-//
-// ```
-// [[[[1, 2, 3, 4]]]]
-// ```
-//
-// Here, the input has a batch of 1 and each batch element has shape `[2, 2, 1]`,
-// the corresponding output will have a single element (i.e. width and height are
-// both 1) and will have a depth of 4 channels (1 * block_size * block_size).
-// The output element shape is `[1, 1, 4]`.
-//
-// For an input tensor with larger depth, here of shape `[1, 2, 2, 3]`, e.g.
-//
-// ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
-//
-// This operation, for block_size of 2, will return the following tensor of shape
-// `[1, 1, 1, 12]`
-//
-// ```
-// [[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]]]
-// ```
-//
-// Similarly, for the following input of shape `[1 4 4 1]`, and a block size of 2:
-//
-// ```
-// x = [[[[1],   [2],  [5],  [6]],
-//       [[3],   [4],  [7],  [8]],
-//       [[9],  [10], [13],  [14]],
-//       [[11], [12], [15],  [16]]]]
-// ```
-//
-// the operator will return the following tensor of shape `[1 2 2 4]`:
-//
-// ```
-// x = [[[[1, 2, 3, 4],
-//        [5, 6, 7, 8]],
-//       [[9, 10, 11, 12],
-//        [13, 14, 15, 16]]]]
-// ```
+// StagePeekMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
+// REQUIRES: value >= 0
+func StagePeekMemoryLimit(value int64) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// StagePeekContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StagePeekContainer(value string) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StagePeekSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StagePeekSharedName(value string) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op peeks at the values at the specified index.  If the
 //
-//	block_size: The size of the spatial block.
-func SpaceToDepth(scope *Scope, input tf.Output, block_size int64, optional ...SpaceToDepthAttr) (output tf.Output) {
+// underlying container does not contain sufficient elements
+// this op will block until it does.   This Op is optimized for
+// performance.
+func StagePeek(scope *Scope, index tf.Output, dtypes []tf.DataType, optional ...StagePeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"block_size": block_size}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SpaceToDepth",
+		Type: "StagePeek",
 		Input: []tf.Input{
-			input,
+			index,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("StagePeek", err)
+		return
+	}
+	return values
 }
 
-// AbortAttr is an optional argument to Abort.
-type AbortAttr func(optionalAttr)
+// MapStageAttr is an optional argument to MapStage.
+type MapStageAttr func(optionalAttr)
 
-// AbortErrorMsg sets the optional error_msg attribute to value.
+// MapStageCapacity sets the optional capacity attribute to value.
 //
-// value: A string which is the message associated with the exception.
-// If not specified, defaults to ""
-func AbortErrorMsg(value string) AbortAttr {
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func MapStageCapacity(value int64) MapStageAttr {
 	return func(m optionalAttr) {
-		m["error_msg"] = value
+		m["capacity"] = value
 	}
 }
 
-// AbortExitWithoutError sets the optional exit_without_error attribute to value.
-// If not specified, defaults to false
-func AbortExitWithoutError(value bool) AbortAttr {
+// MapStageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func MapStageMemoryLimit(value int64) MapStageAttr {
 	return func(m optionalAttr) {
-		m["exit_without_error"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Raise a exception to abort the process when called.
+// MapStageContainer sets the optional container attribute to value.
 //
-// If exit_without_error is true, the process will exit normally,
-// otherwise it will exit with a SIGABORT signal.
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func MapStageContainer(value string) MapStageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapStageSharedName sets the optional shared_name attribute to value.
+//
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func MapStageSharedName(value string) MapStageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Stage (key, values) in the underlying container which behaves like a hashtable.
+//
+// Arguments:
+//	key: int64
+//
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
 //
-// Returns nothing but an exception.
 //
 // Returns the created operation.
-func Abort(scope *Scope, optional ...AbortAttr) (o *tf.Operation) {
+func MapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...MapStageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Abort",
-
+		Type: "MapStage",
+		Input: []tf.Input{
+			key, indices, tf.OutputList(values),
+		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// UniformCandidateSamplerAttr is an optional argument to UniformCandidateSampler.
-type UniformCandidateSamplerAttr func(optionalAttr)
+// MapUnstageAttr is an optional argument to MapUnstage.
+type MapUnstageAttr func(optionalAttr)
 
-// UniformCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
+// MapUnstageCapacity sets the optional capacity attribute to value.
 // If not specified, defaults to 0
-func UniformCandidateSamplerSeed(value int64) UniformCandidateSamplerAttr {
+//
+// REQUIRES: value >= 0
+func MapUnstageCapacity(value int64) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["capacity"] = value
 	}
 }
 
-// UniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
-//
-// value: An second seed to avoid seed collision.
+// MapUnstageMemoryLimit sets the optional memory_limit attribute to value.
 // If not specified, defaults to 0
-func UniformCandidateSamplerSeed2(value int64) UniformCandidateSamplerAttr {
+//
+// REQUIRES: value >= 0
+func MapUnstageMemoryLimit(value int64) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Generates labels for candidate sampling with a uniform distribution.
-//
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
-//
-// For each batch, this op picks a single set of sampled candidate labels.
-//
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
-//
-// Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+// MapUnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapUnstageContainer(value string) MapUnstageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapUnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapUnstageSharedName(value string) MapUnstageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns the values associated with the key
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func UniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...UniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain this key, the op will block until it does.
+func MapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "UniformCandidateSampler",
+		Type: "MapUnstage",
 		Input: []tf.Input{
-			true_classes,
+			key, indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapUnstage", err)
+		return
+	}
+	return values
 }
 
-// FixedUnigramCandidateSamplerAttr is an optional argument to FixedUnigramCandidateSampler.
-type FixedUnigramCandidateSamplerAttr func(optionalAttr)
+// MapSizeAttr is an optional argument to MapSize.
+type MapSizeAttr func(optionalAttr)
 
-// FixedUnigramCandidateSamplerVocabFile sets the optional vocab_file attribute to value.
+// MapSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: Each valid line in this file (which should have a CSV-like format)
-// corresponds to a valid word ID. IDs are in sequential order, starting from
-// num_reserved_ids. The last entry in each line is expected to be a value
-// corresponding to the count or relative probability. Exactly one of vocab_file
-// and unigrams needs to be passed to this op.
-// If not specified, defaults to ""
-func FixedUnigramCandidateSamplerVocabFile(value string) FixedUnigramCandidateSamplerAttr {
+// REQUIRES: value >= 0
+func MapSizeCapacity(value int64) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["vocab_file"] = value
+		m["capacity"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerDistortion sets the optional distortion attribute to value.
+// MapSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: The distortion is used to skew the unigram probability distribution.
-// Each weight is first raised to the distortion's power before adding to the
-// internal unigram distribution. As a result, distortion = 1.0 gives regular
-// unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
-// a uniform distribution.
-// If not specified, defaults to 1
-func FixedUnigramCandidateSamplerDistortion(value float32) FixedUnigramCandidateSamplerAttr {
+// REQUIRES: value >= 0
+func MapSizeMemoryLimit(value int64) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["distortion"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerNumReservedIds sets the optional num_reserved_ids attribute to value.
-//
-// value: Optionally some reserved IDs can be added in the range [0,
-// ..., num_reserved_ids) by the users. One use case is that a special unknown
-// word token is used as ID 0. These IDs will have a sampling probability of 0.
-// If not specified, defaults to 0
-func FixedUnigramCandidateSamplerNumReservedIds(value int64) FixedUnigramCandidateSamplerAttr {
+// MapSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapSizeContainer(value string) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["num_reserved_ids"] = value
+		m["container"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerNumShards sets the optional num_shards attribute to value.
-//
-// value: A sampler can be used to sample from a subset of the original range
-// in order to speed up the whole computation through parallelism. This parameter
-// (together with 'shard') indicates the number of partitions that are being
-// used in the overall computation.
-// If not specified, defaults to 1
-//
-// REQUIRES: value >= 1
-func FixedUnigramCandidateSamplerNumShards(value int64) FixedUnigramCandidateSamplerAttr {
+// MapSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapSizeSharedName(value string) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["num_shards"] = value
+		m["shared_name"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerShard sets the optional shard attribute to value.
-//
-// value: A sampler can be used to sample from a subset of the original range
-// in order to speed up the whole computation through parallelism. This parameter
-// (together with 'num_shards') indicates the particular partition number of a
-// sampler op, when partitioning is being used.
+// Op returns the number of elements in the underlying container.
+func MapSize(scope *Scope, dtypes []tf.DataType, optional ...MapSizeAttr) (size tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "MapSize",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// MapIncompleteSizeAttr is an optional argument to MapIncompleteSize.
+type MapIncompleteSizeAttr func(optionalAttr)
+
+// MapIncompleteSizeCapacity sets the optional capacity attribute to value.
 // If not specified, defaults to 0
 //
 // REQUIRES: value >= 0
-func FixedUnigramCandidateSamplerShard(value int64) FixedUnigramCandidateSamplerAttr {
+func MapIncompleteSizeCapacity(value int64) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["shard"] = value
+		m["capacity"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerUnigrams sets the optional unigrams attribute to value.
+// MapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: A list of unigram counts or probabilities, one per ID in sequential
-// order. Exactly one of vocab_file and unigrams should be passed to this op.
-// If not specified, defaults to <>
-func FixedUnigramCandidateSamplerUnigrams(value []float32) FixedUnigramCandidateSamplerAttr {
+// REQUIRES: value >= 0
+func MapIncompleteSizeMemoryLimit(value int64) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["unigrams"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func FixedUnigramCandidateSamplerSeed(value int64) FixedUnigramCandidateSamplerAttr {
+// MapIncompleteSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapIncompleteSizeContainer(value string) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["container"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
-//
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func FixedUnigramCandidateSamplerSeed2(value int64) FixedUnigramCandidateSamplerAttr {
+// MapIncompleteSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapIncompleteSizeSharedName(value string) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Generates labels for candidate sampling with a learned unigram distribution.
-//
-// A unigram sampler could use a fixed unigram distribution read from a
-// file or passed in as an in-memory array instead of building up the distribution
-// from data on the fly. There is also an option to skew the distribution by
-// applying a distortion power to the weights.
-//
-// The vocabulary file should be in CSV-like format, with the last field
-// being the weight associated with the word.
-//
-// For each batch, this op picks a single set of sampled candidate labels.
-//
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
-//
-// Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
-//
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func FixedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...FixedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Op returns the number of incomplete elements in the underlying container.
+func MapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...MapIncompleteSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FixedUnigramCandidateSampler",
-		Input: []tf.Input{
-			true_classes,
-		},
+		Type: "MapIncompleteSize",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Elementwise computes the bitwise AND of `x` and `y`.
+// OrderedMapUnstageAttr is an optional argument to OrderedMapUnstage.
+type OrderedMapUnstageAttr func(optionalAttr)
+
+// OrderedMapUnstageCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// The result will have those bits set, that are set in both `x` and `y`. The
-// computation is performed on the underlying representations of `x` and `y`.
-func BitwiseAnd(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BitwiseAnd",
-		Input: []tf.Input{
-			x, y,
-		},
+// REQUIRES: value >= 0
+func OrderedMapUnstageCapacity(value int64) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Elementwise computes the bitwise left-shift of `x` and `y`.
+// OrderedMapUnstageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// If `y` is negative, or greater than or equal to the width of `x` in bits the
-// result is implementation defined.
-func LeftShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
+// REQUIRES: value >= 0
+func OrderedMapUnstageMemoryLimit(value int64) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "LeftShift",
-		Input: []tf.Input{
-			x, y,
-		},
+}
+
+// OrderedMapUnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageContainer(value string) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Elementwise computes the bitwise right-shift of `x` and `y`.
-//
-// Performs a logical shift for unsigned integer types, and an arithmetic shift
-// for signed integer types.
+// OrderedMapUnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageSharedName(value string) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns the values associated with the key
 //
-// If `y` is negative, or greater than or equal to than the width of `x` in bits
-// the result is implementation defined.
-func RightShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain this key, the op will block until it does.
+func OrderedMapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "RightShift",
+		Type: "OrderedMapUnstage",
 		Input: []tf.Input{
-			x, y,
+			key, indices,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("OrderedMapUnstage", err)
+		return
+	}
+	return values
 }
 
-// Adjust the hue of one or more images.
-//
-// `images` is a tensor of at least 3 dimensions.  The last dimension is
-// interpretted as channels, and must be three.
-//
-// The input image is considered in the RGB colorspace. Conceptually, the RGB
-// colors are first mapped into HSV. A delta is then applied all the hue values,
-// and then remapped back to RGB colorspace.
+// OrderedMapSizeAttr is an optional argument to OrderedMapSize.
+type OrderedMapSizeAttr func(optionalAttr)
+
+// OrderedMapSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	delta: A float delta to add to the hue.
+// REQUIRES: value >= 0
+func OrderedMapSizeCapacity(value int64) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// OrderedMapSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Returns The hue-adjusted image or images.
-func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
+// REQUIRES: value >= 0
+func OrderedMapSizeMemoryLimit(value int64) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// OrderedMapSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapSizeContainer(value string) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// OrderedMapSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapSizeSharedName(value string) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op returns the number of elements in the underlying container.
+func OrderedMapSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "AdjustHue",
-		Input: []tf.Input{
-			images, delta,
-		},
+		Type: "OrderedMapSize",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
-type AvgPool3DGradAttr func(optionalAttr)
+// ShapeNAttr is an optional argument to ShapeN.
+type ShapeNAttr func(optionalAttr)
 
-// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
-//
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
+// ShapeNOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_INT32
+func ShapeNOutType(value tf.DataType) ShapeNAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["out_type"] = value
 	}
 }
 
-// Computes gradients of average pooling function.
-//
-// Arguments:
-//	orig_input_shape: The original input dimensions.
-//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
+// Returns shape of tensors.
 //
-// Returns The backprop for input.
-func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
+// This operation returns N 1-D integer tensors representing shape of `input[i]s`.
+func ShapeN(scope *Scope, input []tf.Output, optional ...ShapeNAttr) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AvgPool3DGrad",
+		Type: "ShapeN",
 		Input: []tf.Input{
-			orig_input_shape, grad,
+			tf.OutputList(input),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("ShapeN", err)
+		return
+	}
+	return output
 }
 
-// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
-type ParseSingleSequenceExampleAttr func(optionalAttr)
+// UniformCandidateSamplerAttr is an optional argument to UniformCandidateSampler.
+type UniformCandidateSamplerAttr func(optionalAttr)
 
-// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
+// UniformCandidateSamplerSeed sets the optional seed attribute to value.
 //
-// value: A list of Ncontext_sparse types; the data types of data in
-// each context Feature given in context_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["context_sparse_types"] = value
-	}
-}
-
-// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func UniformCandidateSamplerSeed(value int64) UniformCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["feature_list_dense_types"] = value
+		m["seed"] = value
 	}
 }
 
-// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
-//
-// value: A list of Ncontext_dense shapes; the shapes of data in
-// each context Feature given in context_dense_keys.
-// The number of elements in the Feature corresponding to context_dense_key[j]
-// must always equal context_dense_shapes[j].NumEntries().
-// The shape of context_dense_values[j] will match context_dense_shapes[j].
-// If not specified, defaults to <>
+// UniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func UniformCandidateSamplerSeed2(value int64) UniformCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["context_dense_shapes"] = value
+		m["seed2"] = value
 	}
 }
 
-// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
-//
-// value: A list of Nfeature_list_sparse types; the data types
-// of data in each FeatureList given in feature_list_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
+// Generates labels for candidate sampling with a uniform distribution.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_sparse_types"] = value
-	}
-}
-
-// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
 //
-// value: A list of Nfeature_list_dense shapes; the shapes of
-// data in each FeatureList given in feature_list_dense_keys.
-// The shape of each Feature in the FeatureList corresponding to
-// feature_list_dense_key[j] must always equal
-// feature_list_dense_shapes[j].NumEntries().
-// If not specified, defaults to <>
+// For each batch, this op picks a single set of sampled candidate labels.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_dense_shapes"] = value
-	}
-}
-
-// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
 //
 // Arguments:
-//	serialized: A scalar containing a binary serialized SequenceExample proto.
-//	feature_list_dense_missing_assumed_empty: A vector listing the
-// FeatureList keys which may be missing from the SequenceExample.  If the
-// associated FeatureList is missing, it is treated as empty.  By default,
-// any FeatureList not listed in this vector must exist in the SequenceExample.
-//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
-// The keys expected in the Examples' features associated with context_sparse
-// values.
-//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' context features associated with
-// dense values.
-//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
-// (scalars).  The keys expected in the FeatureLists associated with sparse
-// values.
-//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' feature_lists associated
-// with lists of dense values.
-//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
-// context_dense_defaults[j] provides default values
-// when the SequenceExample's context map lacks context_dense_key[j].
-// If an empty Tensor is provided for context_dense_defaults[j],
-// then the Feature context_dense_keys[j] is required.
-// The input type is inferred from context_dense_defaults[j], even when it's
-// empty.  If context_dense_defaults[j] is not empty, its shape must match
-// context_dense_shapes[j].
-//	debug_name: A scalar containing the name of the serialized proto.
-// May contain, for example, table key (descriptive) name for the
-// corresponding serialized proto.  This is purely useful for debugging
-// purposes, and the presence of values here has no effect on the output.
-// May also be an empty scalar if no name is available.
-func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func UniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...UniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ParseSingleSequenceExample",
+		Type: "UniformCandidateSampler",
 		Input: []tf.Input{
-			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
+			true_classes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// DecodeWavAttr is an optional argument to DecodeWav.
-type DecodeWavAttr func(optionalAttr)
+// CTCLossAttr is an optional argument to CTCLoss.
+type CTCLossAttr func(optionalAttr)
 
-// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
+// CTCLossPreprocessCollapseRepeated sets the optional preprocess_collapse_repeated attribute to value.
 //
-// value: Number of sample channels wanted.
-// If not specified, defaults to -1
-func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
+// value: Scalar, if true then repeated labels are
+// collapsed prior to the CTC calculation.
+// If not specified, defaults to false
+func CTCLossPreprocessCollapseRepeated(value bool) CTCLossAttr {
 	return func(m optionalAttr) {
-		m["desired_channels"] = value
+		m["preprocess_collapse_repeated"] = value
 	}
 }
 
-// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
+// CTCLossCtcMergeRepeated sets the optional ctc_merge_repeated attribute to value.
 //
-// value: Length of audio requested.
-// If not specified, defaults to -1
-func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
+// value: Scalar.  If set to false, *during* CTC calculation
+// repeated non-blank labels will not be merged and are interpreted as
+// individual labels.  This is a simplified version of CTC.
+// If not specified, defaults to true
+func CTCLossCtcMergeRepeated(value bool) CTCLossAttr {
 	return func(m optionalAttr) {
-		m["desired_samples"] = value
+		m["ctc_merge_repeated"] = value
 	}
 }
 
-// Decode a 16-bit PCM WAV file to a float tensor.
-//
-// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
-//
-// When desired_channels is set, if the input contains fewer channels than this
-// then the last channel will be duplicated to give the requested number, else if
-// the input has more channels than requested then the additional channels will be
-// ignored.
+// CTCLossIgnoreLongerOutputsThanInputs sets the optional ignore_longer_outputs_than_inputs attribute to value.
 //
-// If desired_samples is set, then the audio will be cropped or padded with zeroes
-// to the requested length.
+// value: Scalar. If set to true, during CTC
+// calculation, items that have longer output sequences than input sequences
+// are skipped: they don't contribute to the loss term and have zero-gradient.
+// If not specified, defaults to false
+func CTCLossIgnoreLongerOutputsThanInputs(value bool) CTCLossAttr {
+	return func(m optionalAttr) {
+		m["ignore_longer_outputs_than_inputs"] = value
+	}
+}
+
+// Calculates the CTC Loss (log probability) for each batch entry.  Also calculates
 //
-// The first output contains a Tensor with the content of the audio samples. The
-// lowest dimension will be the number of channels, and the second will be the
-// number of samples. For example, a ten-sample-long stereo WAV file should give an
-// output shape of [10, 2].
+// the gradient.  This class performs the softmax operation for you, so inputs
+// should be e.g. linear projections of outputs by an LSTM.
 //
 // Arguments:
-//	contents: The WAV-encoded audio, usually from a file.
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	labels_indices: The indices of a `SparseTensor<int32, 2>`.
+// `labels_indices(i, :) == [b, t]` means `labels_values(i)` stores the id for
+// `(batch b, time t)`.
+//	labels_values: The values (labels) associated with the given batch and time.
+//	sequence_length: A vector containing sequence lengths (batch).
 //
-// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
-func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
+// Returns A vector (batch) containing log-probabilities.The gradient of `loss`.  3-D, shape:
+// `(max_time x batch_size x num_classes)`.
+func CTCLoss(scope *Scope, inputs tf.Output, labels_indices tf.Output, labels_values tf.Output, sequence_length tf.Output, optional ...CTCLossAttr) (loss tf.Output, gradient tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -25763,9 +25717,9 @@ func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (aud
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeWav",
+		Type: "CTCLoss",
 		Input: []tf.Input{
-			contents,
+			inputs, labels_indices, labels_values, sequence_length,
 		},
 		Attrs: attrs,
 	}
@@ -25773,40 +25727,41 @@ func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (aud
 	return op.Output(0), op.Output(1)
 }
 
-// UniqueAttr is an optional argument to Unique.
-type UniqueAttr func(optionalAttr)
+// CTCGreedyDecoderAttr is an optional argument to CTCGreedyDecoder.
+type CTCGreedyDecoderAttr func(optionalAttr)
 
-// UniqueOutIdx sets the optional out_idx attribute to value.
-// If not specified, defaults to DT_INT32
-func UniqueOutIdx(value tf.DataType) UniqueAttr {
+// CTCGreedyDecoderMergeRepeated sets the optional merge_repeated attribute to value.
+//
+// value: If True, merge repeated classes in output.
+// If not specified, defaults to false
+func CTCGreedyDecoderMergeRepeated(value bool) CTCGreedyDecoderAttr {
 	return func(m optionalAttr) {
-		m["out_idx"] = value
+		m["merge_repeated"] = value
 	}
 }
 
-// Finds unique elements in a 1-D tensor.
+// Performs greedy decoding on the logits given in inputs.
 //
-// This operation returns a tensor `y` containing all of the unique elements of `x`
-// sorted in the same order that they occur in `x`. This operation also returns a
-// tensor `idx` the same size as `x` that contains the index of each value of `x`
-// in the unique output `y`. In other words:
-//
-// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
-//
-// For example:
+// A note about the attribute merge_repeated: if enabled, when
+// consecutive logits' maximum indices are the same, only the first of
+// these is emitted.  Labeling the blank '*', the sequence "A B B * B B"
+// becomes "A B B" if merge_repeated = True and "A B B B B" if
+// merge_repeated = False.
 //
-// ```
-// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
-// y, idx = unique(x)
-// y ==> [1, 2, 4, 7, 8]
-// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
-// ```
+// Regardless of the value of merge_repeated, if the maximum index of a given
+// time and batch corresponds to the blank, index `(num_classes - 1)`, no new
+// element is emitted.
 //
 // Arguments:
-//	x: 1-D.
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	sequence_length: A vector containing sequence lengths, size `(batch_size)`.
 //
-// Returns 1-D.1-D.
-func Unique(scope *Scope, x tf.Output, optional ...UniqueAttr) (y tf.Output, idx tf.Output) {
+// Returns Indices matrix, size `(total_decoded_outputs x 2)`,
+// of a `SparseTensor<int64, 2>`.  The rows store: [batch, time].Values vector, size: `(total_decoded_outputs)`,
+// of a `SparseTensor<int64, 2>`.  The vector stores the decoded classes.Shape vector, size `(2)`, of the decoded SparseTensor.
+// Values are: `[batch_size, max_decoded_length]`.Matrix, size `(batch_size x 1)`, containing sequence
+// log-probabilities.
+func CTCGreedyDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, optional ...CTCGreedyDecoderAttr) (decoded_indices tf.Output, decoded_values tf.Output, decoded_shape tf.Output, log_probability tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -25815,530 +25770,619 @@ func Unique(scope *Scope, x tf.Output, optional ...UniqueAttr) (y tf.Output, idx
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Unique",
+		Type: "CTCGreedyDecoder",
 		Input: []tf.Input{
-			x,
+			inputs, sequence_length,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
 }
 
-// Concatenates a list of `N` tensors along the first dimension.
-//
-// The input tensors are all required to have size 1 in the first dimension.
-//
-// For example:
+// Forwards `data` to the output port determined by `pred`.
 //
-// ```
-// # 'x' is [[1, 4]]
-// # 'y' is [[2, 5]]
-// # 'z' is [[3, 6]]
-// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
-// ```
+// If `pred` is true, the `data` input is forwarded to `output_true`. Otherwise,
+// the data goes to `output_false`.
 //
-// The difference between concat and parallel_concat is that concat requires all
-// of the inputs be computed before the operation will begin but doesn't require
-// that the input shapes be known during graph construction.  Parallel concat
-// will copy pieces of the input into the output as they become available, in
-// some situations this can provide a performance benefit.
+// See also `RefSwitch` and `Merge`.
 //
 // Arguments:
-//	values: Tensors to be concatenated. All must have size 1 in the first dimension
-// and same shape.
-//	shape: the final shape of the result; should be equal to the shapes of any input
-// but with the number of input values in the first dimension.
+//	data: The tensor to be forwarded to the appropriate output.
+//	pred: A scalar that specifies which output port will receive data.
 //
-// Returns The concatenated tensor.
-func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
+// Returns If `pred` is false, data will be forwarded to this output.If `pred` is true, data will be forwarded to this output.
+func Switch(scope *Scope, data tf.Output, pred tf.Output) (output_false tf.Output, output_true tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "ParallelConcat",
+		Type: "Switch",
 		Input: []tf.Input{
-			tf.OutputList(values),
+			data, pred,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Concatenates tensors along one dimension.
+// Add all input tensors element wise.
 //
 // Arguments:
-//	concat_dim: 0-D.  The dimension along which to concatenate.  Must be in the
-// range [0, rank(values)).
-//	values: The `N` Tensors to concatenate. Their ranks and types must match,
-// and their sizes must match in all dimensions except `concat_dim`.
-//
-// Returns A `Tensor` with the concatenation of values stacked along the
-// `concat_dim` dimension.  This tensor's shape matches that of `values` except
-// in `concat_dim` where it has the sum of the sizes.
-func Concat(scope *Scope, concat_dim tf.Output, values []tf.Output) (output tf.Output) {
+//	inputs: Must all be the same size and shape.
+func AddN(scope *Scope, inputs []tf.Output) (sum tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Concat",
+		Type: "AddN",
 		Input: []tf.Input{
-			concat_dim, tf.OutputList(values),
+			tf.OutputList(inputs),
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the lower regularized incomplete Gamma function `Q(a, x)`.
-//
-// The lower regularized incomplete Gamma function is defined as:
-//
+// EnterAttr is an optional argument to Enter.
+type EnterAttr func(optionalAttr)
+
+// EnterIsConstant sets the optional is_constant attribute to value.
 //
-// \\(P(a, x) = gamma(a, x) / Gamma(a) = 1 - Q(a, x)\\)
+// value: If true, the output is constant within the child frame.
+// If not specified, defaults to false
+func EnterIsConstant(value bool) EnterAttr {
+	return func(m optionalAttr) {
+		m["is_constant"] = value
+	}
+}
+
+// EnterParallelIterations sets the optional parallel_iterations attribute to value.
 //
-// where
+// value: The number of iterations allowed to run in parallel.
+// If not specified, defaults to 10
+func EnterParallelIterations(value int64) EnterAttr {
+	return func(m optionalAttr) {
+		m["parallel_iterations"] = value
+	}
+}
+
+// Creates or finds a child frame, and makes `data` available to the child frame.
 //
-// \\(gamma(a, x) = int_{0}^{x} t^{a-1} exp(-t) dt\\)
+// This op is used together with `Exit` to create loops in the graph.
+// The unique `frame_name` is used by the `Executor` to identify frames. If
+// `is_constant` is true, `output` is a constant in the child frame; otherwise
+// it may be changed in the child frame. At most `parallel_iterations` iterations
+// are run in parallel in the child frame.
 //
-// is the lower incomplete Gamma function.
+// Arguments:
+//	data: The tensor to be made available to the child frame.
+//	frame_name: The name of the child frame.
 //
-// Note, above `Q(a, x)` (`Igammac`) is the upper regularized complete
-// Gamma function.
-func Igamma(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
+// Returns The same tensor as `data`.
+func Enter(scope *Scope, data tf.Output, frame_name string, optional ...EnterAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"frame_name": frame_name}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Igamma",
+		Type: "Enter",
 		Input: []tf.Input{
-			a, x,
+			data,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes offsets of concat inputs within its output.
-//
-// For example:
-//
-// ```
-// # 'x' is [2, 2, 7]
-// # 'y' is [2, 3, 7]
-// # 'z' is [2, 5, 7]
-// concat_offset(2, [x, y, z]) => [0, 0, 0], [0, 2, 0], [0, 5, 0]
-// ```
+// Produce a string tensor that encodes the state of a Reader.
 //
-// This is typically used by gradient computations for a concat operation.
+// Not all Readers support being serialized, so this can produce an
+// Unimplemented error.
 //
 // Arguments:
-//	concat_dim: The dimension along which to concatenate.
-//	shape: The `N` int32 vectors representing shape of tensors being concatenated.
-//
-// Returns The `N` int32 vectors representing the starting offset
-// of input tensors within the concatenated output.
-func ConcatOffset(scope *Scope, concat_dim tf.Output, shape []tf.Output) (offset []tf.Output) {
+//	reader_handle: Handle to a Reader.
+func ReaderSerializeStateV2(scope *Scope, reader_handle tf.Output) (state tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ConcatOffset",
+		Type: "ReaderSerializeStateV2",
 		Input: []tf.Input{
-			concat_dim, tf.OutputList(shape),
+			reader_handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if offset, idx, err = makeOutputList(op, idx, "offset"); err != nil {
-		scope.UpdateErr("ConcatOffset", err)
-		return
-	}
-	return offset
+	return op.Output(0)
 }
 
-// Splits a tensor into `num_split` tensors along one dimension.
+// Exits the current frame to its parent frame.
+//
+// Exit makes its input `data` available to the parent frame.
 //
 // Arguments:
-//	axis: 0-D.  The dimension along which to split.  Must be in the range
-// `[-rank(value), rank(value))`.
-//	value: The tensor to split.
-//	num_split: The number of ways to split.  Must evenly divide
-// `value.shape[split_dim]`.
+//	data: The tensor to be made available to the parent frame.
 //
-// Returns They are identically shaped tensors, whose shape matches that of `value`
-// except along `axis`, where their sizes are
-// `values.shape[split_dim] / num_split`.
-func Split(scope *Scope, axis tf.Output, value tf.Output, num_split int64) (output []tf.Output) {
+// Returns The same tensor as `data`.
+func Exit(scope *Scope, data tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_split": num_split}
 	opspec := tf.OpSpec{
-		Type: "Split",
+		Type: "Exit",
 		Input: []tf.Input{
-			axis, value,
+			data,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("Split", err)
-		return
-	}
-	return output
+	return op.Output(0)
 }
 
-// Splits a tensor into `num_split` tensors along one dimension.
-//
-// Arguments:
-//	value: The tensor to split.
-//	size_splits: list containing the sizes of each output tensor along the split
-// dimension. Must sum to the dimension of value along split_dim.
-// Can contain one -1 indicating that dimension is to be inferred.
-//	axis: 0-D.  The dimension along which to split.  Must be in the range
-// `[-rank(value), rank(value))`.
-//
-//
-// Returns Tensors whose shape matches that of `value`
-// except along `axis`, where their sizes are
-// `size_splits[i]`.
-func SplitV(scope *Scope, value tf.Output, size_splits tf.Output, axis tf.Output, num_split int64) (output []tf.Output) {
+// Returns a copy of the input tensor.
+func Snapshot(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_split": num_split}
 	opspec := tf.OpSpec{
-		Type: "SplitV",
+		Type: "Snapshot",
 		Input: []tf.Input{
-			value, size_splits, axis,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("SplitV", err)
-		return
-	}
-	return output
+	return op.Output(0)
 }
 
-// Gives a guarantee to the TF runtime that the input tensor is a constant.
-//
-// The runtime is then free to make optimizations based on this.
+// AbortAttr is an optional argument to Abort.
+type AbortAttr func(optionalAttr)
+
+// AbortErrorMsg sets the optional error_msg attribute to value.
 //
-// Only accepts value typed tensors as inputs and rejects resource variable handles
-// as input.
+// value: A string which is the message associated with the exception.
+// If not specified, defaults to ""
+func AbortErrorMsg(value string) AbortAttr {
+	return func(m optionalAttr) {
+		m["error_msg"] = value
+	}
+}
+
+// AbortExitWithoutError sets the optional exit_without_error attribute to value.
+// If not specified, defaults to false
+func AbortExitWithoutError(value bool) AbortAttr {
+	return func(m optionalAttr) {
+		m["exit_without_error"] = value
+	}
+}
+
+// Raise a exception to abort the process when called.
 //
-// Returns the input tensor without modification.
-func GuaranteeConst(scope *Scope, input tf.Output) (output tf.Output) {
+// If exit_without_error is true, the process will exit normally,
+// otherwise it will exit with a SIGABORT signal.
+//
+// Returns nothing but an exception.
+//
+// Returns the created operation.
+func Abort(scope *Scope, optional ...AbortAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "GuaranteeConst",
+		Type: "Abort",
+
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// FixedUnigramCandidateSamplerAttr is an optional argument to FixedUnigramCandidateSampler.
+type FixedUnigramCandidateSamplerAttr func(optionalAttr)
+
+// FixedUnigramCandidateSamplerVocabFile sets the optional vocab_file attribute to value.
+//
+// value: Each valid line in this file (which should have a CSV-like format)
+// corresponds to a valid word ID. IDs are in sequential order, starting from
+// num_reserved_ids. The last entry in each line is expected to be a value
+// corresponding to the count or relative probability. Exactly one of vocab_file
+// and unigrams needs to be passed to this op.
+// If not specified, defaults to ""
+func FixedUnigramCandidateSamplerVocabFile(value string) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["vocab_file"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerDistortion sets the optional distortion attribute to value.
+//
+// value: The distortion is used to skew the unigram probability distribution.
+// Each weight is first raised to the distortion's power before adding to the
+// internal unigram distribution. As a result, distortion = 1.0 gives regular
+// unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
+// a uniform distribution.
+// If not specified, defaults to 1
+func FixedUnigramCandidateSamplerDistortion(value float32) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["distortion"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerNumReservedIds sets the optional num_reserved_ids attribute to value.
+//
+// value: Optionally some reserved IDs can be added in the range [0,
+// ..., num_reserved_ids) by the users. One use case is that a special unknown
+// word token is used as ID 0. These IDs will have a sampling probability of 0.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerNumReservedIds(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["num_reserved_ids"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerNumShards sets the optional num_shards attribute to value.
+//
+// value: A sampler can be used to sample from a subset of the original range
+// in order to speed up the whole computation through parallelism. This parameter
+// (together with 'shard') indicates the number of partitions that are being
+// used in the overall computation.
+// If not specified, defaults to 1
+//
+// REQUIRES: value >= 1
+func FixedUnigramCandidateSamplerNumShards(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["num_shards"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerShard sets the optional shard attribute to value.
+//
+// value: A sampler can be used to sample from a subset of the original range
+// in order to speed up the whole computation through parallelism. This parameter
+// (together with 'num_shards') indicates the particular partition number of a
+// sampler op, when partitioning is being used.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func FixedUnigramCandidateSamplerShard(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["shard"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerUnigrams sets the optional unigrams attribute to value.
+//
+// value: A list of unigram counts or probabilities, one per ID in sequential
+// order. Exactly one of vocab_file and unigrams should be passed to this op.
+// If not specified, defaults to <>
+func FixedUnigramCandidateSamplerUnigrams(value []float32) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["unigrams"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerSeed(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerSeed2(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Generates labels for candidate sampling with a learned unigram distribution.
+//
+// A unigram sampler could use a fixed unigram distribution read from a
+// file or passed in as an in-memory array instead of building up the distribution
+// from data on the fly. There is also an option to skew the distribution by
+// applying a distortion power to the weights.
+//
+// The vocabulary file should be in CSV-like format, with the last field
+// being the weight associated with the word.
+//
+// For each batch, this op picks a single set of sampled candidate labels.
+//
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
+//
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func FixedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...FixedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "FixedUnigramCandidateSampler",
 		Input: []tf.Input{
-			input,
+			true_classes,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Returns a tensor of zeros with the same shape and type as x.
-//
-// Arguments:
-//	x: a tensor of type T.
+// Elementwise computes the bitwise AND of `x` and `y`.
 //
-// Returns a tensor of the same shape and type as x but filled with zeros.
-func ZerosLike(scope *Scope, x tf.Output) (y tf.Output) {
+// The result will have those bits set, that are set in both `x` and `y`. The
+// computation is performed on the underlying representations of `x` and `y`.
+func BitwiseAnd(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ZerosLike",
+		Type: "BitwiseAnd",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Flips all bits elementwise.
+// Elementwise computes the bitwise left-shift of `x` and `y`.
 //
-// The result will have exactly those bits set, that are not set in `x`. The
-// computation is performed on the underlying representation of x.
-func Invert(scope *Scope, x tf.Output) (y tf.Output) {
+// If `y` is negative, or greater than or equal to the width of `x` in bits the
+// result is implementation defined.
+func LeftShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Invert",
+		Type: "LeftShift",
 		Input: []tf.Input{
-			x,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DequantizeAttr is an optional argument to Dequantize.
-type DequantizeAttr func(optionalAttr)
-
-// DequantizeMode sets the optional mode attribute to value.
-// If not specified, defaults to "MIN_COMBINED"
-func DequantizeMode(value string) DequantizeAttr {
-	return func(m optionalAttr) {
-		m["mode"] = value
+// Elementwise computes the bitwise right-shift of `x` and `y`.
+//
+// Performs a logical shift for unsigned integer types, and an arithmetic shift
+// for signed integer types.
+//
+// If `y` is negative, or greater than or equal to than the width of `x` in bits
+// the result is implementation defined.
+func RightShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RightShift",
+		Input: []tf.Input{
+			x, y,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Dequantize the 'input' tensor into a float Tensor.
+// Adjust the hue of one or more images.
 //
-// [min_range, max_range] are scalar floats that specify the range for
-// the 'input' data. The 'mode' attribute controls exactly which calculations are
-// used to convert the float values to their quantized equivalents.
+// `images` is a tensor of at least 3 dimensions.  The last dimension is
+// interpretted as channels, and must be three.
 //
-// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
-//
-// ```
-// if T == qint8, in[i] += (range(T) + 1)/ 2.0
-// out[i] = min_range + (in[i]* (max_range - min_range) / range(T))
-// ```
-// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
-//
-// *MIN_COMBINED Mode Example*
-//
-// If the input comes from a QuantizedRelu6, the output type is
-// quint8 (range of 0-255) but the possible range of QuantizedRelu6 is
-// 0-6.  The min_range and max_range values are therefore 0.0 and 6.0.
-// Dequantize on quint8 will take each value, cast to float, and multiply
-// by 6 / 255.
-// Note that if quantizedtype is qint8, the operation will additionally add
-// each value by 128 prior to casting.
-//
-// If the mode is 'MIN_FIRST', then this approach is used:
-//
-// ```c++
-// num_discrete_values = 1 << (# of bits in T)
-// range_adjust = num_discrete_values / (num_discrete_values - 1)
-// range = (range_max - range_min) * range_adjust
-// range_scale = range / num_discrete_values
-// const double offset_input = static_cast<double>(input) - lowest_quantized;
-// result = range_min + ((input - numeric_limits<T>::min()) * range_scale)
-// ```
-//
-// *SCALED mode Example*
-//
-// `SCALED` mode matches the quantization approach used in
-// `QuantizeAndDequantize{V2|V3}`.
-//
-// If the mode is `SCALED`, we do not use the full range of the output type,
-// choosing to elide the lowest possible value for symmetry (e.g., output range is
-// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
-// 0.
-//
-// We first find the range of values in our tensor. The
-// range we use is always centered on 0, so we find m such that
-// ```c++
-//   m = max(abs(input_min), abs(input_max))
-// ```
-//
-// Our input tensor range is then `[-m, m]`.
-//
-// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
-// If T is signed, this is
-// ```
-//   num_bits = sizeof(T) * 8
-//   [min_fixed, max_fixed] =
-//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
-// ```
-//
-// Otherwise, if T is unsigned, the fixed-point range is
-// ```
-//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
-// ```
-//
-// From this we compute our scaling factor, s:
-// ```c++
-//   s = (2 * m) / (max_fixed - min_fixed)
-// ```
-//
-// Now we can dequantize the elements of our tensor:
-// ```c++
-// result = input * s
-// ```
+// The input image is considered in the RGB colorspace. Conceptually, the RGB
+// colors are first mapped into HSV. A delta is then applied all the hue values,
+// and then remapped back to RGB colorspace.
 //
 // Arguments:
+//	images: Images to adjust.  At least 3-D.
+//	delta: A float delta to add to the hue.
 //
-//	min_range: The minimum scalar value possibly produced for the input.
-//	max_range: The maximum scalar value possibly produced for the input.
-func Dequantize(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, optional ...DequantizeAttr) (output tf.Output) {
+// Returns The hue-adjusted image or images.
+func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Dequantize",
+		Type: "AdjustHue",
 		Input: []tf.Input{
-			input, min_range, max_range,
+			images, delta,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the element-wise max of two SparseTensors.
-//
-// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
-//
-// Arguments:
-//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, in the canonical lexicographic ordering.
-//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
-//	a_shape: 1-D.  Shape of the input SparseTensor.
-//	b_indices: counterpart to `a_indices` for the other operand.
-//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
-//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
+// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
+type AvgPool3DGradAttr func(optionalAttr)
+
+// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
 //
-// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
-func SparseSparseMaximum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSparseMaximum",
-		Input: []tf.Input{
-			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
-		},
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
 }
 
-// Returns a batched matrix tensor with new batched diagonal values.
-//
-// Given `input` and `diagonal`, this operation returns a tensor with the
-// same shape and values as `input`, except for the main diagonal of the
-// innermost matrices.  These will be overwritten by the values in `diagonal`.
-//
-// The output is computed as follows:
-//
-// Assume `input` has `k+1` dimensions `[I, J, K, ..., M, N]` and `diagonal` has
-// `k` dimensions `[I, J, K, ..., min(M, N)]`.  Then the output is a
-// tensor of rank `k+1` with dimensions `[I, J, K, ..., M, N]` where:
-//
-//   * `output[i, j, k, ..., m, n] = diagonal[i, j, k, ..., n]` for `m == n`.
-//   * `output[i, j, k, ..., m, n] = input[i, j, k, ..., m, n]` for `m != n`.
+// Computes gradients of average pooling function.
 //
 // Arguments:
-//	input: Rank `k+1`, where `k >= 1`.
-//	diagonal: Rank `k`, where `k >= 1`.
+//	orig_input_shape: The original input dimensions.
+//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Rank `k+1`, with `output.shape = input.shape`.
-func MatrixSetDiag(scope *Scope, input tf.Output, diagonal tf.Output) (output tf.Output) {
+// Returns The backprop for input.
+func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "MatrixSetDiag",
+		Type: "AvgPool3DGrad",
 		Input: []tf.Input{
-			input, diagonal,
+			orig_input_shape, grad,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// EditDistanceAttr is an optional argument to EditDistance.
-type EditDistanceAttr func(optionalAttr)
+// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
+type ParseSingleSequenceExampleAttr func(optionalAttr)
 
-// EditDistanceNormalize sets the optional normalize attribute to value.
+// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
 //
-// value: boolean (if true, edit distances are normalized by length of truth).
+// value: A list of Ncontext_sparse types; the data types of data in
+// each context Feature given in context_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
 //
-// The output is:
-// If not specified, defaults to true
-func EditDistanceNormalize(value bool) EditDistanceAttr {
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
 	return func(m optionalAttr) {
-		m["normalize"] = value
+		m["context_sparse_types"] = value
 	}
 }
 
-// Computes the (possibly normalized) Levenshtein Edit Distance.
-//
-// The inputs are variable-length sequences provided by SparseTensors
-//   (hypothesis_indices, hypothesis_values, hypothesis_shape)
-// and
-//   (truth_indices, truth_values, truth_shape).
+// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
+// If not specified, defaults to <>
 //
-// The inputs are:
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
 //
-// Arguments:
-//	hypothesis_indices: The indices of the hypothesis list SparseTensor.
-// This is an N x R int64 matrix.
-//	hypothesis_values: The values of the hypothesis list SparseTensor.
-// This is an N-length vector.
-//	hypothesis_shape: The shape of the hypothesis list SparseTensor.
-// This is an R-length vector.
-//	truth_indices: The indices of the truth list SparseTensor.
-// This is an M x R int64 matrix.
-//	truth_values: The values of the truth list SparseTensor.
-// This is an M-length vector.
-//	truth_shape: truth indices, vector.
+// value: A list of Ncontext_dense shapes; the shapes of data in
+// each context Feature given in context_dense_keys.
+// The number of elements in the Feature corresponding to context_dense_key[j]
+// must always equal context_dense_shapes[j].NumEntries().
+// The shape of context_dense_values[j] will match context_dense_shapes[j].
+// If not specified, defaults to <>
 //
-// Returns A dense float tensor with rank R - 1.
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["context_dense_shapes"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
 //
-// For the example input:
+// value: A list of Nfeature_list_sparse types; the data types
+// of data in each FeatureList given in feature_list_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
 //
-//     // hypothesis represents a 2x1 matrix with variable-length values:
-//     //   (0,0) = ["a"]
-//     //   (1,0) = ["b"]
-//     hypothesis_indices = [[0, 0, 0],
-//                           [1, 0, 0]]
-//     hypothesis_values = ["a", "b"]
-//     hypothesis_shape = [2, 1, 1]
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_sparse_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
 //
-//     // truth represents a 2x2 matrix with variable-length values:
-//     //   (0,0) = []
-//     //   (0,1) = ["a"]
-//     //   (1,0) = ["b", "c"]
-//     //   (1,1) = ["a"]
-//     truth_indices = [[0, 1, 0],
-//                      [1, 0, 0],
-//                      [1, 0, 1],
-//                      [1, 1, 0]]
-//     truth_values = ["a", "b", "c", "a"]
-//     truth_shape = [2, 2, 2]
-//     normalize = true
+// value: A list of Nfeature_list_dense shapes; the shapes of
+// data in each FeatureList given in feature_list_dense_keys.
+// The shape of each Feature in the FeatureList corresponding to
+// feature_list_dense_key[j] must always equal
+// feature_list_dense_shapes[j].NumEntries().
+// If not specified, defaults to <>
 //
-// The output will be:
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_shapes"] = value
+	}
+}
+
+// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
 //
-//     // output is a 2x2 matrix with edit distances normalized by truth lengths.
-//     output = [[inf, 1.0],  // (0,0): no truth, (0,1): no hypothesis
-//               [0.5, 1.0]]  // (1,0): addition, (1,1): no hypothesis
-func EditDistance(scope *Scope, hypothesis_indices tf.Output, hypothesis_values tf.Output, hypothesis_shape tf.Output, truth_indices tf.Output, truth_values tf.Output, truth_shape tf.Output, optional ...EditDistanceAttr) (output tf.Output) {
+// Arguments:
+//	serialized: A scalar containing a binary serialized SequenceExample proto.
+//	feature_list_dense_missing_assumed_empty: A vector listing the
+// FeatureList keys which may be missing from the SequenceExample.  If the
+// associated FeatureList is missing, it is treated as empty.  By default,
+// any FeatureList not listed in this vector must exist in the SequenceExample.
+//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
+// The keys expected in the Examples' features associated with context_sparse
+// values.
+//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' context features associated with
+// dense values.
+//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
+// (scalars).  The keys expected in the FeatureLists associated with sparse
+// values.
+//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' feature_lists associated
+// with lists of dense values.
+//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
+// context_dense_defaults[j] provides default values
+// when the SequenceExample's context map lacks context_dense_key[j].
+// If an empty Tensor is provided for context_dense_defaults[j],
+// then the Feature context_dense_keys[j] is required.
+// The input type is inferred from context_dense_defaults[j], even when it's
+// empty.  If context_dense_defaults[j] is not empty, its shape must match
+// context_dense_shapes[j].
+//	debug_name: A scalar containing the name of the serialized proto.
+// May contain, for example, table key (descriptive) name for the
+// corresponding serialized proto.  This is purely useful for debugging
+// purposes, and the presence of values here has no effect on the output.
+// May also be an empty scalar if no name is available.
+func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -26347,518 +26391,463 @@ func EditDistance(scope *Scope, hypothesis_indices tf.Output, hypothesis_values
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "EditDistance",
+		Type: "ParseSingleSequenceExample",
 		Input: []tf.Input{
-			hypothesis_indices, hypothesis_values, hypothesis_shape, truth_indices, truth_values, truth_shape,
+			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
 }
 
-// Gather slices from `params` into a Tensor with shape specified by `indices`.
-//
-// `indices` is an K-dimensional integer tensor, best thought of as a
-// (K-1)-dimensional tensor of indices into `params`, where each element defines a
-// slice of `params`:
-//
-//     output[i_0, ..., i_{K-2}] = params[indices[i0, ..., i_{K-2}]]
-//
-// Whereas in @{tf.gather} `indices` defines slices into the first
-// dimension of `params`, in `tf.gather_nd`, `indices` defines slices into the
-// first `N` dimensions of `params`, where `N = indices.shape[-1]`.
-//
-// The last dimension of `indices` can be at most the rank of
-// `params`:
-//
-//     indices.shape[-1] <= params.rank
-//
-// The last dimension of `indices` corresponds to elements
-// (if `indices.shape[-1] == params.rank`) or slices
-// (if `indices.shape[-1] < params.rank`) along dimension `indices.shape[-1]`
-// of `params`.  The output tensor has shape
-//
-//     indices.shape[:-1] + params.shape[indices.shape[-1]:]
-//
-// Some examples below.
-//
-// Simple indexing into a matrix:
-//
-// ```python
-//     indices = [[0, 0], [1, 1]]
-//     params = [['a', 'b'], ['c', 'd']]
-//     output = ['a', 'd']
-// ```
-//
-// Slice indexing into a matrix:
-//
-// ```python
-//     indices = [[1], [0]]
-//     params = [['a', 'b'], ['c', 'd']]
-//     output = [['c', 'd'], ['a', 'b']]
-// ```
-//
-// Indexing into a 3-tensor:
-//
-// ```python
-//     indices = [[1]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = [[['a1', 'b1'], ['c1', 'd1']]]
-//
-//
-//     indices = [[0, 1], [1, 0]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = [['c0', 'd0'], ['a1', 'b1']]
-//
-//
-//     indices = [[0, 0, 1], [1, 0, 1]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = ['b0', 'b1']
-// ```
-//
-// Batched indexing into a matrix:
-//
-// ```python
-//     indices = [[[0, 0]], [[0, 1]]]
-//     params = [['a', 'b'], ['c', 'd']]
-//     output = [['a'], ['b']]
-// ```
-//
-// Batched slice indexing into a matrix:
+// DecodeWavAttr is an optional argument to DecodeWav.
+type DecodeWavAttr func(optionalAttr)
+
+// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
 //
-// ```python
-//     indices = [[[1]], [[0]]]
-//     params = [['a', 'b'], ['c', 'd']]
-//     output = [[['c', 'd']], [['a', 'b']]]
-// ```
+// value: Number of sample channels wanted.
+// If not specified, defaults to -1
+func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
+	return func(m optionalAttr) {
+		m["desired_channels"] = value
+	}
+}
+
+// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
 //
-// Batched indexing into a 3-tensor:
+// value: Length of audio requested.
+// If not specified, defaults to -1
+func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
+	return func(m optionalAttr) {
+		m["desired_samples"] = value
+	}
+}
+
+// Decode a 16-bit PCM WAV file to a float tensor.
 //
-// ```python
-//     indices = [[[1]], [[0]]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = [[[['a1', 'b1'], ['c1', 'd1']]],
-//               [[['a0', 'b0'], ['c0', 'd0']]]]
+// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
 //
-//     indices = [[[0, 1], [1, 0]], [[0, 0], [1, 1]]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = [[['c0', 'd0'], ['a1', 'b1']],
-//               [['a0', 'b0'], ['c1', 'd1']]]
+// When desired_channels is set, if the input contains fewer channels than this
+// then the last channel will be duplicated to give the requested number, else if
+// the input has more channels than requested then the additional channels will be
+// ignored.
 //
+// If desired_samples is set, then the audio will be cropped or padded with zeroes
+// to the requested length.
 //
-//     indices = [[[0, 0, 1], [1, 0, 1]], [[0, 1, 1], [1, 1, 0]]]
-//     params = [[['a0', 'b0'], ['c0', 'd0']],
-//               [['a1', 'b1'], ['c1', 'd1']]]
-//     output = [['b0', 'b1'], ['d0', 'c1']]
-// ```
+// The first output contains a Tensor with the content of the audio samples. The
+// lowest dimension will be the number of channels, and the second will be the
+// number of samples. For example, a ten-sample-long stereo WAV file should give an
+// output shape of [10, 2].
 //
 // Arguments:
-//	params: The tensor from which to gather values.
-//	indices: Index tensor.
+//	contents: The WAV-encoded audio, usually from a file.
 //
-// Returns Values from `params` gathered from indices given by `indices`, with
-// shape `indices.shape[:-1] + params.shape[indices.shape[-1]:]`.
-func GatherNd(scope *Scope, params tf.Output, indices tf.Output) (output tf.Output) {
+// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
+func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "GatherNd",
+		Type: "DecodeWav",
 		Input: []tf.Input{
-			params, indices,
+			contents,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Eagerly executes a python function to compute func(input)->output. The
+// UniqueAttr is an optional argument to Unique.
+type UniqueAttr func(optionalAttr)
+
+// UniqueOutIdx sets the optional out_idx attribute to value.
+// If not specified, defaults to DT_INT32
+func UniqueOutIdx(value tf.DataType) UniqueAttr {
+	return func(m optionalAttr) {
+		m["out_idx"] = value
+	}
+}
+
+// Finds unique elements in a 1-D tensor.
 //
-// semantics of the input, output, and attributes are the same as those for
-// PyFunc.
-func EagerPyFunc(scope *Scope, input []tf.Output, token string, Tout []tf.DataType) (output []tf.Output) {
+// This operation returns a tensor `y` containing all of the unique elements of `x`
+// sorted in the same order that they occur in `x`. This operation also returns a
+// tensor `idx` the same size as `x` that contains the index of each value of `x`
+// in the unique output `y`. In other words:
+//
+// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
+//
+// For example:
+//
+// ```
+// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
+// y, idx = unique(x)
+// y ==> [1, 2, 4, 7, 8]
+// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
+// ```
+//
+// Arguments:
+//	x: 1-D.
+//
+// Returns 1-D.1-D.
+func Unique(scope *Scope, x tf.Output, optional ...UniqueAttr) (y tf.Output, idx tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"token": token, "Tout": Tout}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "EagerPyFunc",
+		Type: "Unique",
 		Input: []tf.Input{
-			tf.OutputList(input),
+			x,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("EagerPyFunc", err)
-		return
-	}
-	return output
+	return op.Output(0), op.Output(1)
 }
 
-// Stops gradient computation.
+// Concatenates a list of `N` tensors along the first dimension.
 //
-// When executed in a graph, this op outputs its input tensor as-is.
+// The input tensors are all required to have size 1 in the first dimension.
 //
-// When building ops to compute gradients, this op prevents the contribution of
-// its inputs to be taken into account.  Normally, the gradient generator adds ops
-// to a graph to compute the derivatives of a specified 'loss' by recursively
-// finding out inputs that contributed to its computation.  If you insert this op
-// in the graph it inputs are masked from the gradient generator.  They are not
-// taken into account for computing gradients.
+// For example:
 //
-// This is useful any time you want to compute a value with TensorFlow but need
-// to pretend that the value was a constant. Some examples include:
+// ```
+// # 'x' is [[1, 4]]
+// # 'y' is [[2, 5]]
+// # 'z' is [[3, 6]]
+// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
+// ```
 //
-// *  The *EM* algorithm where the *M-step* should not involve backpropagation
-//    through the output of the *E-step*.
-// *  Contrastive divergence training of Boltzmann machines where, when
-//    differentiating the energy function, the training must not backpropagate
-//    through the graph that generated the samples from the model.
-// *  Adversarial training, where no backprop should happen through the adversarial
-//    example generation process.
-func StopGradient(scope *Scope, input tf.Output) (output tf.Output) {
+// The difference between concat and parallel_concat is that concat requires all
+// of the inputs be computed before the operation will begin but doesn't require
+// that the input shapes be known during graph construction.  Parallel concat
+// will copy pieces of the input into the output as they become available, in
+// some situations this can provide a performance benefit.
+//
+// Arguments:
+//	values: Tensors to be concatenated. All must have size 1 in the first dimension
+// and same shape.
+//	shape: the final shape of the result; should be equal to the shapes of any input
+// but with the number of input values in the first dimension.
+//
+// Returns The concatenated tensor.
+func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "StopGradient",
+		Type: "ParallelConcat",
 		Input: []tf.Input{
-			input,
+			tf.OutputList(values),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes asin of x element-wise.
-func Asin(scope *Scope, x tf.Output) (y tf.Output) {
+// Compute the lower regularized incomplete Gamma function `Q(a, x)`.
+//
+// The lower regularized incomplete Gamma function is defined as:
+//
+//
+// \\(P(a, x) = gamma(a, x) / Gamma(a) = 1 - Q(a, x)\\)
+//
+// where
+//
+// \\(gamma(a, x) = int_{0}^{x} t^{a-1} exp(-t) dt\\)
+//
+// is the lower incomplete Gamma function.
+//
+// Note, above `Q(a, x)` (`Igammac`) is the upper regularized complete
+// Gamma function.
+func Igamma(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Asin",
+		Type: "Igamma",
 		Input: []tf.Input{
-			x,
+			a, x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// PreventGradientAttr is an optional argument to PreventGradient.
-type PreventGradientAttr func(optionalAttr)
-
-// PreventGradientMessage sets the optional message attribute to value.
+// Computes offsets of concat inputs within its output.
 //
-// value: Will be printed in the error when anyone tries to differentiate
-// this operation.
-// If not specified, defaults to ""
-func PreventGradientMessage(value string) PreventGradientAttr {
-	return func(m optionalAttr) {
-		m["message"] = value
-	}
-}
-
-// An identity op that triggers an error if a gradient is requested.
+// For example:
 //
-// When executed in a graph, this op outputs its input tensor as-is.
+// ```
+// # 'x' is [2, 2, 7]
+// # 'y' is [2, 3, 7]
+// # 'z' is [2, 5, 7]
+// concat_offset(2, [x, y, z]) => [0, 0, 0], [0, 2, 0], [0, 5, 0]
+// ```
 //
-// When building ops to compute gradients, the TensorFlow gradient system
-// will return an error when trying to lookup the gradient of this op,
-// because no gradient must ever be registered for this function.  This
-// op exists to prevent subtle bugs from silently returning unimplemented
-// gradients in some corner cases.
+// This is typically used by gradient computations for a concat operation.
 //
 // Arguments:
-//	input: any tensor.
+//	concat_dim: The dimension along which to concatenate.
+//	shape: The `N` int32 vectors representing shape of tensors being concatenated.
 //
-// Returns the same input tensor.
-func PreventGradient(scope *Scope, input tf.Output, optional ...PreventGradientAttr) (output tf.Output) {
+// Returns The `N` int32 vectors representing the starting offset
+// of input tensors within the concatenated output.
+func ConcatOffset(scope *Scope, concat_dim tf.Output, shape []tf.Output) (offset []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "PreventGradient",
+		Type: "ConcatOffset",
 		Input: []tf.Input{
-			input,
+			concat_dim, tf.OutputList(shape),
 		},
-		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if offset, idx, err = makeOutputList(op, idx, "offset"); err != nil {
+		scope.UpdateErr("ConcatOffset", err)
+		return
+	}
+	return offset
 }
 
-// Checks a tensor for NaN and Inf values.
-//
-// When run, reports an `InvalidArgument` error if `tensor` has any values
-// that are not a number (NaN) or infinity (Inf). Otherwise, passes `tensor` as-is.
+// Splits a tensor into `num_split` tensors along one dimension.
 //
 // Arguments:
+//	axis: 0-D.  The dimension along which to split.  Must be in the range
+// `[-rank(value), rank(value))`.
+//	value: The tensor to split.
+//	num_split: The number of ways to split.  Must evenly divide
+// `value.shape[split_dim]`.
 //
-//	message: Prefix of the error message.
-func CheckNumerics(scope *Scope, tensor tf.Output, message string) (output tf.Output) {
+// Returns They are identically shaped tensors, whose shape matches that of `value`
+// except along `axis`, where their sizes are
+// `values.shape[split_dim] / num_split`.
+func Split(scope *Scope, axis tf.Output, value tf.Output, num_split int64) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"message": message}
+	attrs := map[string]interface{}{"num_split": num_split}
 	opspec := tf.OpSpec{
-		Type: "CheckNumerics",
+		Type: "Split",
 		Input: []tf.Input{
-			tensor,
+			axis, value,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("Split", err)
+		return
+	}
+	return output
 }
 
-// Shuffle dimensions of x according to a permutation and conjugate the result.
+// Splits a tensor into `num_split` tensors along one dimension.
 //
-// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
-//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
-//   `y[i,j,k,...,s,t,u] == conj(x[perm[i], perm[j], perm[k],...,perm[s], perm[t], perm[u]])`
-func ConjugateTranspose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
+// Arguments:
+//	value: The tensor to split.
+//	size_splits: list containing the sizes of each output tensor along the split
+// dimension. Must sum to the dimension of value along split_dim.
+// Can contain one -1 indicating that dimension is to be inferred.
+//	axis: 0-D.  The dimension along which to split.  Must be in the range
+// `[-rank(value), rank(value))`.
+//
+//
+// Returns Tensors whose shape matches that of `value`
+// except along `axis`, where their sizes are
+// `size_splits[i]`.
+func SplitV(scope *Scope, value tf.Output, size_splits tf.Output, axis tf.Output, num_split int64) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"num_split": num_split}
 	opspec := tf.OpSpec{
-		Type: "ConjugateTranspose",
+		Type: "SplitV",
 		Input: []tf.Input{
-			x, perm,
+			value, size_splits, axis,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// UniqueV2Attr is an optional argument to UniqueV2.
-type UniqueV2Attr func(optionalAttr)
-
-// UniqueV2OutIdx sets the optional out_idx attribute to value.
-// If not specified, defaults to DT_INT32
-func UniqueV2OutIdx(value tf.DataType) UniqueV2Attr {
-	return func(m optionalAttr) {
-		m["out_idx"] = value
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("SplitV", err)
+		return
 	}
+	return output
 }
 
-// Finds unique elements in a 1-D tensor.
-//
-// This operation returns a tensor `y` containing all of the unique elements of `x`
-// sorted in the same order that they occur in `x`. This operation also returns a
-// tensor `idx` the same size as `x` that contains the index of each value of `x`
-// in the unique output `y`. In other words:
-//
-// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
-//
-// For example:
+// Gives a guarantee to the TF runtime that the input tensor is a constant.
 //
-// ```
-// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
-// y, idx = unique(x)
-// y ==> [1, 2, 4, 7, 8]
-// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
-// ```
+// The runtime is then free to make optimizations based on this.
 //
-// Arguments:
-//	x: A `Tensor`.
-//	axis: A `Tensor` of type `int64` (default: 0). The axis of the Tensor to
-// find the unique elements.
+// Only accepts value typed tensors as inputs and rejects resource variable handles
+// as input.
 //
-// Returns A `Tensor`. Unique elements along the `axis` of `Tensor` x.A 1-D Tensor. Has the same type as x that contains the index of each
-// value of x in the output y.
-func UniqueV2(scope *Scope, x tf.Output, axis tf.Output, optional ...UniqueV2Attr) (y tf.Output, idx tf.Output) {
+// Returns the input tensor without modification.
+func GuaranteeConst(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "UniqueV2",
+		Type: "GuaranteeConst",
 		Input: []tf.Input{
-			x, axis,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Return a slice from 'input'.
-//
-// The output tensor is a tensor with dimensions described by 'size'
-// whose values are extracted from 'input' starting at the offsets in
-// 'begin'.
-//
-// *Requirements*:
-//   0 <= begin[i] <= begin[i] + size[i] <= Di  for i in [0, n)
+// Returns a tensor of zeros with the same shape and type as x.
 //
 // Arguments:
+//	x: a tensor of type T.
 //
-//	begin: begin[i] specifies the offset into the 'i'th dimension of
-// 'input' to slice from.
-//	size: size[i] specifies the number of elements of the 'i'th dimension
-// of 'input' to slice. If size[i] is -1, all remaining elements in dimension
-// i are included in the slice (i.e. this is equivalent to setting
-// size[i] = input.dim_size(i) - begin[i]).
-func Slice(scope *Scope, input tf.Output, begin tf.Output, size tf.Output) (output tf.Output) {
+// Returns a tensor of the same shape and type as x but filled with zeros.
+func ZerosLike(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Slice",
+		Type: "ZerosLike",
 		Input: []tf.Input{
-			input, begin, size,
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// StridedSliceGradAttr is an optional argument to StridedSliceGrad.
-type StridedSliceGradAttr func(optionalAttr)
-
-// StridedSliceGradBeginMask sets the optional begin_mask attribute to value.
-// If not specified, defaults to 0
-func StridedSliceGradBeginMask(value int64) StridedSliceGradAttr {
-	return func(m optionalAttr) {
-		m["begin_mask"] = value
-	}
-}
-
-// StridedSliceGradEndMask sets the optional end_mask attribute to value.
-// If not specified, defaults to 0
-func StridedSliceGradEndMask(value int64) StridedSliceGradAttr {
-	return func(m optionalAttr) {
-		m["end_mask"] = value
-	}
-}
+// QuantizedInstanceNormAttr is an optional argument to QuantizedInstanceNorm.
+type QuantizedInstanceNormAttr func(optionalAttr)
 
-// StridedSliceGradEllipsisMask sets the optional ellipsis_mask attribute to value.
-// If not specified, defaults to 0
-func StridedSliceGradEllipsisMask(value int64) StridedSliceGradAttr {
+// QuantizedInstanceNormOutputRangeGiven sets the optional output_range_given attribute to value.
+//
+// value: If True, `given_y_min` and `given_y_min`
+// and `given_y_max` are used as the output range. Otherwise,
+// the implementation computes the output range.
+// If not specified, defaults to false
+func QuantizedInstanceNormOutputRangeGiven(value bool) QuantizedInstanceNormAttr {
 	return func(m optionalAttr) {
-		m["ellipsis_mask"] = value
+		m["output_range_given"] = value
 	}
 }
 
-// StridedSliceGradNewAxisMask sets the optional new_axis_mask attribute to value.
+// QuantizedInstanceNormGivenYMin sets the optional given_y_min attribute to value.
+//
+// value: Output in `y_min` if `output_range_given` is True.
 // If not specified, defaults to 0
-func StridedSliceGradNewAxisMask(value int64) StridedSliceGradAttr {
+func QuantizedInstanceNormGivenYMin(value float32) QuantizedInstanceNormAttr {
 	return func(m optionalAttr) {
-		m["new_axis_mask"] = value
+		m["given_y_min"] = value
 	}
 }
 
-// StridedSliceGradShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
+// QuantizedInstanceNormGivenYMax sets the optional given_y_max attribute to value.
+//
+// value: Output in `y_max` if `output_range_given` is True.
 // If not specified, defaults to 0
-func StridedSliceGradShrinkAxisMask(value int64) StridedSliceGradAttr {
+func QuantizedInstanceNormGivenYMax(value float32) QuantizedInstanceNormAttr {
 	return func(m optionalAttr) {
-		m["shrink_axis_mask"] = value
-	}
-}
-
-// Returns the gradient of `StridedSlice`.
-//
-// Since `StridedSlice` cuts out pieces of its `input` which is size
-// `shape`, its gradient will have the same shape (which is passed here
-// as `shape`). The gradient will be zero in any element that the slice
-// does not select.
-//
-// Arguments are the same as StridedSliceGrad with the exception that
-// `dy` is the input gradient to be propagated and `shape` is the
-// shape of `StridedSlice`'s `input`.
-func StridedSliceGrad(scope *Scope, shape tf.Output, begin tf.Output, end tf.Output, strides tf.Output, dy tf.Output, optional ...StridedSliceGradAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "StridedSliceGrad",
-		Input: []tf.Input{
-			shape, begin, end, strides, dy,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns the gradient of `Tile`.
-//
-// DEPRECATED at GraphDef version 3: TileGrad has been replaced with reduce_sum
-//
-// Since `Tile` takes an input and repeats the input `multiples` times
-// along each dimension, `TileGrad` takes in `multiples` and aggregates
-// each repeated tile of `input` into `output`.
-func TileGrad(scope *Scope, input tf.Output, multiples tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TileGrad",
-		Input: []tf.Input{
-			input, multiples,
-		},
+		m["given_y_max"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// DataFormatDimMapAttr is an optional argument to DataFormatDimMap.
-type DataFormatDimMapAttr func(optionalAttr)
-
-// DataFormatDimMapSrcFormat sets the optional src_format attribute to value.
+// QuantizedInstanceNormVarianceEpsilon sets the optional variance_epsilon attribute to value.
 //
-// value: source data format.
-// If not specified, defaults to "NHWC"
-func DataFormatDimMapSrcFormat(value string) DataFormatDimMapAttr {
+// value: A small float number to avoid dividing by 0.
+// If not specified, defaults to 1e-05
+func QuantizedInstanceNormVarianceEpsilon(value float32) QuantizedInstanceNormAttr {
 	return func(m optionalAttr) {
-		m["src_format"] = value
+		m["variance_epsilon"] = value
 	}
 }
 
-// DataFormatDimMapDstFormat sets the optional dst_format attribute to value.
+// QuantizedInstanceNormMinSeparation sets the optional min_separation attribute to value.
 //
-// value: destination data format.
-// If not specified, defaults to "NCHW"
-func DataFormatDimMapDstFormat(value string) DataFormatDimMapAttr {
+// value: Minimum value of `y_max - y_min`
+// If not specified, defaults to 0.001
+func QuantizedInstanceNormMinSeparation(value float32) QuantizedInstanceNormAttr {
 	return func(m optionalAttr) {
-		m["dst_format"] = value
+		m["min_separation"] = value
 	}
 }
 
-// Returns the dimension index in the destination data format given the one in
-//
-// the source data format.
+// Quantized Instance normalization.
 //
 // Arguments:
-//	x: A Tensor with each element as a dimension index in source data format.
-// Must be in the range [-4, 4).
+//	x: A 4D input Tensor.
+//	x_min: The value represented by the lowest quantized input.
+//	x_max: The value represented by the highest quantized input.
 //
-// Returns A Tensor with each element as a dimension index in destination data format.
-func DataFormatDimMap(scope *Scope, x tf.Output, optional ...DataFormatDimMapAttr) (y tf.Output) {
+// Returns A 4D Tensor.The value represented by the lowest quantized output.The value represented by the highest quantized output.
+func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.Output, optional ...QuantizedInstanceNormAttr) (y tf.Output, y_min tf.Output, y_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -26867,227 +26856,185 @@ func DataFormatDimMap(scope *Scope, x tf.Output, optional ...DataFormatDimMapAtt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DataFormatDimMap",
+		Type: "QuantizedInstanceNorm",
 		Input: []tf.Input{
-			x,
+			x, x_min, x_max,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Return the shape of s0 op s1 with broadcast.
-//
-// Given `s0` and `s1`, tensors that represent shapes, compute `r0`, the
-// broadcasted shape. `s0`, `s1` and `r0` are all integer vectors.
-func BroadcastArgs(scope *Scope, s0 tf.Output, s1 tf.Output) (r0 tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BroadcastArgs",
-		Input: []tf.Input{
-			s0, s1,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Return the reduction indices for computing gradients of s0 op s1 with broadcast.
-//
-// This is typically used by gradient computations for a broadcasting operation.
-func BroadcastGradientArgs(scope *Scope, s0 tf.Output, s1 tf.Output) (r0 tf.Output, r1 tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BroadcastGradientArgs",
-		Input: []tf.Input{
-			s0, s1,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Pads a tensor with mirrored values.
+// Returns the diagonal part of the tensor.
 //
-// This operation pads a `input` with mirrored values according to the `paddings`
-// you specify. `paddings` is an integer tensor with shape `[n, 2]`, where n is
-// the rank of `input`. For each dimension D of `input`, `paddings[D, 0]` indicates
-// how many values to add before the contents of `input` in that dimension, and
-// `paddings[D, 1]` indicates how many values to add after the contents of `input`
-// in that dimension. Both `paddings[D, 0]` and `paddings[D, 1]` must be no greater
-// than `input.dim_size(D)` (or `input.dim_size(D) - 1`) if `copy_border` is true
-// (if false, respectively).
+// This operation returns a tensor with the `diagonal` part
+// of the `input`. The `diagonal` part is computed as follows:
 //
-// The padded size of each dimension D of the output is:
+// Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
+// tensor of rank `k` with dimensions `[D1,..., Dk]` where:
 //
-// `paddings(D, 0) + input.dim_size(D) + paddings(D, 1)`
+// `diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
 //
 // For example:
 //
 // ```
-// # 't' is [[1, 2, 3], [4, 5, 6]].
-// # 'paddings' is [[1, 1]], [2, 2]].
-// # 'mode' is SYMMETRIC.
-// # rank of 't' is 2.
-// pad(t, paddings) ==> [[2, 1, 1, 2, 3, 3, 2]
-//                       [2, 1, 1, 2, 3, 3, 2]
-//                       [5, 4, 4, 5, 6, 6, 5]
-//                       [5, 4, 4, 5, 6, 6, 5]]
+// # 'input' is [[1, 0, 0, 0]
+//               [0, 2, 0, 0]
+//               [0, 0, 3, 0]
+//               [0, 0, 0, 4]]
+//
+// tf.diag_part(input) ==> [1, 2, 3, 4]
 // ```
 //
 // Arguments:
-//	input: The input tensor to be padded.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	mode: Either `REFLECT` or `SYMMETRIC`. In reflect mode the padded regions
-// do not include the borders, while in symmetric mode the padded regions
-// do include the borders. For example, if `input` is `[1, 2, 3]` and `paddings`
-// is `[0, 2]`, then the output is `[1, 2, 3, 2, 1]` in reflect mode, and
-// it is `[1, 2, 3, 3, 2]` in symmetric mode.
+//	input: Rank k tensor where k is even and not zero.
 //
-// Returns The padded tensor.
-func MirrorPad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
+// Returns The extracted diagonal.
+func DiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode}
 	opspec := tf.OpSpec{
-		Type: "MirrorPad",
+		Type: "DiagPart",
 		Input: []tf.Input{
-			input, paddings,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// A placeholder op for a value that will be fed into the computation.
-//
-// DEPRECATED at GraphDef version 23: Placeholder now behaves the same as PlaceholderV2.
+// Returns the element-wise max of two SparseTensors.
 //
-// N.B. This operation will fail with an error if it is executed. It is
-// intended as a way to represent a value that will always be fed, and to
-// provide attrs that enable the fed value to be checked at runtime.
+// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
 //
 // Arguments:
-//	dtype: The type of elements in the tensor.
-//	shape: The shape of the tensor. The shape can be any partially-specified
-// shape.  To be unconstrained, pass in a shape with unknown rank.
+//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, in the canonical lexicographic ordering.
+//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
+//	a_shape: 1-D.  Shape of the input SparseTensor.
+//	b_indices: counterpart to `a_indices` for the other operand.
+//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
+//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
 //
-// Returns A placeholder tensor that must be replaced using the feed mechanism.
-func PlaceholderV2(scope *Scope, dtype tf.DataType, shape tf.Shape) (output tf.Output) {
+// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
+func SparseSparseMaximum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
 	opspec := tf.OpSpec{
-		Type: "PlaceholderV2",
-
-		Attrs: attrs,
+		Type: "SparseSparseMaximum",
+		Input: []tf.Input{
+			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
+		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// ResourceApplyAdadeltaAttr is an optional argument to ResourceApplyAdadelta.
-type ResourceApplyAdadeltaAttr func(optionalAttr)
-
-// ResourceApplyAdadeltaUseLocking sets the optional use_locking attribute to value.
+// Returns a batched matrix tensor with new batched diagonal values.
 //
-// value: If True, updating of the var, accum and update_accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyAdadeltaUseLocking(value bool) ResourceApplyAdadeltaAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' according to the adadelta scheme.
+// Given `input` and `diagonal`, this operation returns a tensor with the
+// same shape and values as `input`, except for the main diagonal of the
+// innermost matrices.  These will be overwritten by the values in `diagonal`.
 //
-// accum = rho() * accum + (1 - rho()) * grad.square();
-// update = (update_accum + epsilon).sqrt() * (accum + epsilon()).rsqrt() * grad;
-// update_accum = rho() * update_accum + (1 - rho()) * update.square();
-// var -= update;
+// The output is computed as follows:
+//
+// Assume `input` has `k+1` dimensions `[I, J, K, ..., M, N]` and `diagonal` has
+// `k` dimensions `[I, J, K, ..., min(M, N)]`.  Then the output is a
+// tensor of rank `k+1` with dimensions `[I, J, K, ..., M, N]` where:
+//
+//   * `output[i, j, k, ..., m, n] = diagonal[i, j, k, ..., n]` for `m == n`.
+//   * `output[i, j, k, ..., m, n] = input[i, j, k, ..., m, n]` for `m != n`.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	accum_update: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay factor. Must be a scalar.
-//	epsilon: Constant factor. Must be a scalar.
-//	grad: The gradient.
+//	input: Rank `k+1`, where `k >= 1`.
+//	diagonal: Rank `k`, where `k >= 1`.
 //
-// Returns the created operation.
-func ResourceApplyAdadelta(scope *Scope, var_ tf.Output, accum tf.Output, accum_update tf.Output, lr tf.Output, rho tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdadeltaAttr) (o *tf.Operation) {
+// Returns Rank `k+1`, with `output.shape = input.shape`.
+func MatrixSetDiag(scope *Scope, input tf.Output, diagonal tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdadelta",
+		Type: "MatrixSetDiag",
 		Input: []tf.Input{
-			var_, accum, accum_update, lr, rho, epsilon, grad,
+			input, diagonal,
 		},
-		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SqueezeAttr is an optional argument to Squeeze.
-type SqueezeAttr func(optionalAttr)
+// EditDistanceAttr is an optional argument to EditDistance.
+type EditDistanceAttr func(optionalAttr)
 
-// SqueezeAxis sets the optional axis attribute to value.
+// EditDistanceNormalize sets the optional normalize attribute to value.
 //
-// value: If specified, only squeezes the dimensions listed. The dimension
-// index starts at 0. It is an error to squeeze a dimension that is not 1. Must
-// be in the range `[-rank(input), rank(input))`.
-// If not specified, defaults to <>
+// value: boolean (if true, edit distances are normalized by length of truth).
 //
-// REQUIRES: len(value) >= 0
-func SqueezeAxis(value []int64) SqueezeAttr {
+// The output is:
+// If not specified, defaults to true
+func EditDistanceNormalize(value bool) EditDistanceAttr {
 	return func(m optionalAttr) {
-		m["squeeze_dims"] = value
+		m["normalize"] = value
 	}
 }
 
-// Removes dimensions of size 1 from the shape of a tensor.
+// Computes the (possibly normalized) Levenshtein Edit Distance.
 //
-// Given a tensor `input`, this operation returns a tensor of the same type with
-// all dimensions of size 1 removed. If you don't want to remove all size 1
-// dimensions, you can remove specific size 1 dimensions by specifying
-// `axis`.
+// The inputs are variable-length sequences provided by SparseTensors
+//   (hypothesis_indices, hypothesis_values, hypothesis_shape)
+// and
+//   (truth_indices, truth_values, truth_shape).
 //
-// For example:
+// The inputs are:
 //
-// ```
-// # 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
-// shape(squeeze(t)) ==> [2, 3]
-// ```
+// Arguments:
+//	hypothesis_indices: The indices of the hypothesis list SparseTensor.
+// This is an N x R int64 matrix.
+//	hypothesis_values: The values of the hypothesis list SparseTensor.
+// This is an N-length vector.
+//	hypothesis_shape: The shape of the hypothesis list SparseTensor.
+// This is an R-length vector.
+//	truth_indices: The indices of the truth list SparseTensor.
+// This is an M x R int64 matrix.
+//	truth_values: The values of the truth list SparseTensor.
+// This is an M-length vector.
+//	truth_shape: truth indices, vector.
 //
-// Or, to remove specific size 1 dimensions:
+// Returns A dense float tensor with rank R - 1.
+//
+// For the example input:
+//
+//     // hypothesis represents a 2x1 matrix with variable-length values:
+//     //   (0,0) = ["a"]
+//     //   (1,0) = ["b"]
+//     hypothesis_indices = [[0, 0, 0],
+//                           [1, 0, 0]]
+//     hypothesis_values = ["a", "b"]
+//     hypothesis_shape = [2, 1, 1]
 //
-// ```
-// # 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
-// shape(squeeze(t, [2, 4])) ==> [1, 2, 3, 1]
-// ```
+//     // truth represents a 2x2 matrix with variable-length values:
+//     //   (0,0) = []
+//     //   (0,1) = ["a"]
+//     //   (1,0) = ["b", "c"]
+//     //   (1,1) = ["a"]
+//     truth_indices = [[0, 1, 0],
+//                      [1, 0, 0],
+//                      [1, 0, 1],
+//                      [1, 1, 0]]
+//     truth_values = ["a", "b", "c", "a"]
+//     truth_shape = [2, 2, 2]
+//     normalize = true
 //
-// Arguments:
-//	input: The `input` to squeeze.
+// The output will be:
 //
-// Returns Contains the same data as `input`, but has one or more dimensions of
-// size 1 removed.
-func Squeeze(scope *Scope, input tf.Output, optional ...SqueezeAttr) (output tf.Output) {
+//     // output is a 2x2 matrix with edit distances normalized by truth lengths.
+//     output = [[inf, 1.0],  // (0,0): no truth, (0,1): no hypothesis
+//               [0.5, 1.0]]  // (1,0): addition, (1,1): no hypothesis
+func EditDistance(scope *Scope, hypothesis_indices tf.Output, hypothesis_values tf.Output, hypothesis_shape tf.Output, truth_indices tf.Output, truth_values tf.Output, truth_shape tf.Output, optional ...EditDistanceAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -27096,9 +27043,9 @@ func Squeeze(scope *Scope, input tf.Output, optional ...SqueezeAttr) (output tf.
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Squeeze",
+		Type: "EditDistance",
 		Input: []tf.Input{
-			input,
+			hypothesis_indices, hypothesis_values, hypothesis_shape, truth_indices, truth_values, truth_shape,
 		},
 		Attrs: attrs,
 	}
@@ -27106,235 +27053,235 @@ func Squeeze(scope *Scope, input tf.Output, optional ...SqueezeAttr) (output tf.
 	return op.Output(0)
 }
 
-// SpaceToBatch for N-D tensors of type T.
-//
-// This operation divides "spatial" dimensions `[1, ..., M]` of the input into a
-// grid of blocks of shape `block_shape`, and interleaves these blocks with the
-// "batch" dimension (0) such that in the output, the spatial dimensions
-// `[1, ..., M]` correspond to the position within the grid, and the batch
-// dimension combines both the position within a spatial block and the original
-// batch position.  Prior to division into blocks, the spatial dimensions of the
-// input are optionally zero padded according to `paddings`.  See below for a
-// precise description.
+// Gather slices from `params` into a Tensor with shape specified by `indices`.
 //
-// Arguments:
-//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
-// where spatial_shape has `M` dimensions.
-//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
-//	paddings: 2-D with shape `[M, 2]`, all values must be >= 0.
-//   `paddings[i] = [pad_start, pad_end]` specifies the padding for input dimension
-//   `i + 1`, which corresponds to spatial dimension `i`.  It is required that
-//   `block_shape[i]` divides `input_shape[i + 1] + pad_start + pad_end`.
+// `indices` is an K-dimensional integer tensor, best thought of as a
+// (K-1)-dimensional tensor of indices into `params`, where each element defines a
+// slice of `params`:
 //
-// This operation is equivalent to the following steps:
+//     output[i_0, ..., i_{K-2}] = params[indices[i0, ..., i_{K-2}]]
 //
-// 1. Zero-pad the start and end of dimensions `[1, ..., M]` of the
-//    input according to `paddings` to produce `padded` of shape `padded_shape`.
+// Whereas in @{tf.gather} `indices` defines slices into the first
+// dimension of `params`, in `tf.gather_nd`, `indices` defines slices into the
+// first `N` dimensions of `params`, where `N = indices.shape[-1]`.
 //
-// 2. Reshape `padded` to `reshaped_padded` of shape:
+// The last dimension of `indices` can be at most the rank of
+// `params`:
 //
-//      [batch] +
-//      [padded_shape[1] / block_shape[0],
-//        block_shape[0],
-//       ...,
-//       padded_shape[M] / block_shape[M-1],
-//       block_shape[M-1]] +
-//      remaining_shape
+//     indices.shape[-1] <= params.rank
 //
-// 3. Permute dimensions of `reshaped_padded` to produce
-//    `permuted_reshaped_padded` of shape:
+// The last dimension of `indices` corresponds to elements
+// (if `indices.shape[-1] == params.rank`) or slices
+// (if `indices.shape[-1] < params.rank`) along dimension `indices.shape[-1]`
+// of `params`.  The output tensor has shape
 //
-//      block_shape +
-//      [batch] +
-//      [padded_shape[1] / block_shape[0],
-//       ...,
-//       padded_shape[M] / block_shape[M-1]] +
-//      remaining_shape
+//     indices.shape[:-1] + params.shape[indices.shape[-1]:]
 //
-// 4. Reshape `permuted_reshaped_padded` to flatten `block_shape` into the batch
-//    dimension, producing an output tensor of shape:
+// Some examples below.
 //
-//      [batch * prod(block_shape)] +
-//      [padded_shape[1] / block_shape[0],
-//       ...,
-//       padded_shape[M] / block_shape[M-1]] +
-//      remaining_shape
+// Simple indexing into a matrix:
 //
-// Some examples:
+// ```python
+//     indices = [[0, 0], [1, 1]]
+//     params = [['a', 'b'], ['c', 'd']]
+//     output = ['a', 'd']
+// ```
 //
-// (1) For the following input of shape `[1, 2, 2, 1]`, `block_shape = [2, 2]`, and
-//     `paddings = [[0, 0], [0, 0]]`:
+// Slice indexing into a matrix:
 //
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
+// ```python
+//     indices = [[1], [0]]
+//     params = [['a', 'b'], ['c', 'd']]
+//     output = [['c', 'd'], ['a', 'b']]
 // ```
 //
-// The output tensor has shape `[4, 1, 1, 1]` and value:
+// Indexing into a 3-tensor:
 //
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
+// ```python
+//     indices = [[1]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = [[['a1', 'b1'], ['c1', 'd1']]]
 //
-// (2) For the following input of shape `[1, 2, 2, 3]`, `block_shape = [2, 2]`, and
-//     `paddings = [[0, 0], [0, 0]]`:
 //
-// ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
+//     indices = [[0, 1], [1, 0]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = [['c0', 'd0'], ['a1', 'b1']]
 //
-// The output tensor has shape `[4, 1, 1, 3]` and value:
 //
-// ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+//     indices = [[0, 0, 1], [1, 0, 1]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = ['b0', 'b1']
 // ```
 //
-// (3) For the following input of shape `[1, 4, 4, 1]`, `block_shape = [2, 2]`, and
-//     `paddings = [[0, 0], [0, 0]]`:
+// Batched indexing into a matrix:
 //
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]],
-//       [[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
+// ```python
+//     indices = [[[0, 0]], [[0, 1]]]
+//     params = [['a', 'b'], ['c', 'd']]
+//     output = [['a'], ['b']]
 // ```
 //
-// The output tensor has shape `[4, 2, 2, 1]` and value:
+// Batched slice indexing into a matrix:
 //
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
+// ```python
+//     indices = [[[1]], [[0]]]
+//     params = [['a', 'b'], ['c', 'd']]
+//     output = [[['c', 'd']], [['a', 'b']]]
 // ```
 //
-// (4) For the following input of shape `[2, 2, 4, 1]`, block_shape = `[2, 2]`, and
-//     paddings = `[[0, 0], [2, 0]]`:
+// Batched indexing into a 3-tensor:
 //
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]]],
-//      [[[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
-// ```
+// ```python
+//     indices = [[[1]], [[0]]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = [[[['a1', 'b1'], ['c1', 'd1']]],
+//               [[['a0', 'b0'], ['c0', 'd0']]]]
 //
-// The output tensor has shape `[8, 1, 3, 1]` and value:
+//     indices = [[[0, 1], [1, 0]], [[0, 0], [1, 1]]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = [[['c0', 'd0'], ['a1', 'b1']],
+//               [['a0', 'b0'], ['c1', 'd1']]]
 //
-// ```
-// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
-//      [[[0], [2], [4]]], [[[0], [10], [12]]],
-//      [[[0], [5], [7]]], [[[0], [13], [15]]],
-//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
+//
+//     indices = [[[0, 0, 1], [1, 0, 1]], [[0, 1, 1], [1, 1, 0]]]
+//     params = [[['a0', 'b0'], ['c0', 'd0']],
+//               [['a1', 'b1'], ['c1', 'd1']]]
+//     output = [['b0', 'b1'], ['d0', 'c1']]
 // ```
 //
-// Among others, this operation is useful for reducing atrous convolution into
-// regular convolution.
-func SpaceToBatchND(scope *Scope, input tf.Output, block_shape tf.Output, paddings tf.Output) (output tf.Output) {
+// Arguments:
+//	params: The tensor from which to gather values.
+//	indices: Index tensor.
+//
+// Returns Values from `params` gathered from indices given by `indices`, with
+// shape `indices.shape[:-1] + params.shape[indices.shape[-1]:]`.
+func GatherNd(scope *Scope, params tf.Output, indices tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "GatherNd",
+		Input: []tf.Input{
+			params, indices,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Eagerly executes a python function to compute func(input)->output. The
+//
+// semantics of the input, output, and attributes are the same as those for
+// PyFunc.
+func EagerPyFunc(scope *Scope, input []tf.Output, token string, Tout []tf.DataType) (output []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"token": token, "Tout": Tout}
+	opspec := tf.OpSpec{
+		Type: "EagerPyFunc",
+		Input: []tf.Input{
+			tf.OutputList(input),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("EagerPyFunc", err)
+		return
+	}
+	return output
+}
+
+// Stops gradient computation.
+//
+// When executed in a graph, this op outputs its input tensor as-is.
+//
+// When building ops to compute gradients, this op prevents the contribution of
+// its inputs to be taken into account.  Normally, the gradient generator adds ops
+// to a graph to compute the derivatives of a specified 'loss' by recursively
+// finding out inputs that contributed to its computation.  If you insert this op
+// in the graph it inputs are masked from the gradient generator.  They are not
+// taken into account for computing gradients.
+//
+// This is useful any time you want to compute a value with TensorFlow but need
+// to pretend that the value was a constant. Some examples include:
+//
+// *  The *EM* algorithm where the *M-step* should not involve backpropagation
+//    through the output of the *E-step*.
+// *  Contrastive divergence training of Boltzmann machines where, when
+//    differentiating the energy function, the training must not backpropagate
+//    through the graph that generated the samples from the model.
+// *  Adversarial training, where no backprop should happen through the adversarial
+//    example generation process.
+func StopGradient(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SpaceToBatchND",
+		Type: "StopGradient",
 		Input: []tf.Input{
-			input, block_shape, paddings,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QuantizeAndDequantizeV2Attr is an optional argument to QuantizeAndDequantizeV2.
-type QuantizeAndDequantizeV2Attr func(optionalAttr)
-
-// QuantizeAndDequantizeV2SignedInput sets the optional signed_input attribute to value.
-//
-// value: If the quantization is signed or unsigned.
-// If not specified, defaults to true
-func QuantizeAndDequantizeV2SignedInput(value bool) QuantizeAndDequantizeV2Attr {
-	return func(m optionalAttr) {
-		m["signed_input"] = value
+// Computes asin of x element-wise.
+func Asin(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// QuantizeAndDequantizeV2NumBits sets the optional num_bits attribute to value.
-//
-// value: The bitwidth of the quantization.
-// If not specified, defaults to 8
-func QuantizeAndDequantizeV2NumBits(value int64) QuantizeAndDequantizeV2Attr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
+	opspec := tf.OpSpec{
+		Type: "Asin",
+		Input: []tf.Input{
+			x,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// QuantizeAndDequantizeV2RangeGiven sets the optional range_given attribute to value.
+// PreventGradientAttr is an optional argument to PreventGradient.
+type PreventGradientAttr func(optionalAttr)
+
+// PreventGradientMessage sets the optional message attribute to value.
 //
-// value: If the range is given or should be computed from the tensor.
-// If not specified, defaults to false
-func QuantizeAndDequantizeV2RangeGiven(value bool) QuantizeAndDequantizeV2Attr {
+// value: Will be printed in the error when anyone tries to differentiate
+// this operation.
+// If not specified, defaults to ""
+func PreventGradientMessage(value string) PreventGradientAttr {
 	return func(m optionalAttr) {
-		m["range_given"] = value
+		m["message"] = value
 	}
 }
 
-// Quantizes then dequantizes a tensor.
-//
-// This op simulates the precision loss from the quantized forward pass by:
-// 1. Quantizing the tensor to fixed point numbers, which should match the target
-//    quantization method when it is used in inference.
-// 2. Dequantizing it back to floating point numbers for the following ops, most
-//    likely matmul.
-//
-// There are different ways to quantize. This version does not use the full range
-// of the output type, choosing to elide the lowest possible value for symmetry
-// (e.g., output range is -127 to 127, not -128 to 127 for signed 8 bit
-// quantization), so that 0.0 maps to 0.
-//
-// To perform this op, we first find the range of values in our tensor. The range
-// we use is always centered on 0, so we find m such that
-//
-// 1. m = max(abs(input_min), abs(input_max)) if range_given is true,
-// 2. m = max(abs(min_elem(input)), abs(max_elem(input))) otherwise.
-//
-// Our input tensor range is then [-m, m].
-//
-// Next, we choose our fixed-point quantization buckets, [min_fixed, max_fixed].
-// If signed_input is true, this is
-//
-//   [min_fixed, max_fixed ] =
-//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1].
-//
-// Otherwise, if signed_input is false, the fixed-point range is
-//
-//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1].
-//
-// From this we compute our scaling factor, s:
-//
-//   s = (max_fixed - min_fixed) / (2 * m).
-//
-// Now we can quantize and dequantize the elements of our tensor.  An element e
-// is transformed into e':
-//
-//   e' = (e * s).round_to_nearest() / s.
-//
-// Note that we have a different number of buckets in the signed vs. unsigned
-// cases.  For example, if num_bits == 8, we get 254 buckets in the signed case
-// vs. 255 in the unsigned case.
-//
-// For example, suppose num_bits = 8 and m = 1.  Then
+// An identity op that triggers an error if a gradient is requested.
 //
-//   [min_fixed, max_fixed] = [-127, 127], and
-//   s = (127 + 127) / 2 = 127.
+// When executed in a graph, this op outputs its input tensor as-is.
 //
-// Given the vector {-1, -0.5, 0, 0.3}, this is quantized to
-// {-127, -63, 0, 38}, and dequantized to {-1, -63.0/127, 0, 38.0/127}.
+// When building ops to compute gradients, the TensorFlow gradient system
+// will return an error when trying to lookup the gradient of this op,
+// because no gradient must ever be registered for this function.  This
+// op exists to prevent subtle bugs from silently returning unimplemented
+// gradients in some corner cases.
 //
 // Arguments:
-//	input: Tensor to quantize and then dequantize.
-//	input_min: If range_given, this is the min of the range, otherwise this input
-// will be ignored.
-//	input_max: If range_given, this is the max of the range, otherwise this input
-// will be ignored.
-func QuantizeAndDequantizeV2(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, optional ...QuantizeAndDequantizeV2Attr) (output tf.Output) {
+//	input: any tensor.
+//
+// Returns the same input tensor.
+func PreventGradient(scope *Scope, input tf.Output, optional ...PreventGradientAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -27343,9 +27290,9 @@ func QuantizeAndDequantizeV2(scope *Scope, input tf.Output, input_min tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizeAndDequantizeV2",
+		Type: "PreventGradient",
 		Input: []tf.Input{
-			input, input_min, input_max,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -27353,113 +27300,23 @@ func QuantizeAndDequantizeV2(scope *Scope, input tf.Output, input_min tf.Output,
 	return op.Output(0)
 }
 
-// SpaceToBatch for 4-D tensors of type T.
-//
-// This is a legacy version of the more general SpaceToBatchND.
+// Checks a tensor for NaN and Inf values.
 //
-// Zero-pads and then rearranges (permutes) blocks of spatial data into batch.
-// More specifically, this op outputs a copy of the input tensor where values from
-// the `height` and `width` dimensions are moved to the `batch` dimension. After
-// the zero-padding, both `height` and `width` of the input must be divisible by the
-// block size.
+// When run, reports an `InvalidArgument` error if `tensor` has any values
+// that are not a number (NaN) or infinity (Inf). Otherwise, passes `tensor` as-is.
 //
 // Arguments:
-//	input: 4-D with shape `[batch, height, width, depth]`.
-//	paddings: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
-//   the padding of the input with zeros across the spatial dimensions as follows:
-//
-//       paddings = [[pad_top, pad_bottom], [pad_left, pad_right]]
-//
-//   The effective spatial dimensions of the zero-padded input tensor will be:
-//
-//       height_pad = pad_top + height + pad_bottom
-//       width_pad = pad_left + width + pad_right
-//
-// The attr `block_size` must be greater than one. It indicates the block size.
-//
-//   * Non-overlapping blocks of size `block_size x block size` in the height and
-//     width dimensions are rearranged into the batch dimension at each location.
-//   * The batch of the output tensor is `batch * block_size * block_size`.
-//   * Both height_pad and width_pad must be divisible by block_size.
-//
-// The shape of the output will be:
-//
-//     [batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
-//      depth]
-//
-// Some examples:
-//
-// (1) For the following input of shape `[1, 2, 2, 1]` and block_size of 2:
-//
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
-// ```
-//
-// The output tensor has shape `[4, 1, 1, 1]` and value:
-//
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
-//
-// (2) For the following input of shape `[1, 2, 2, 3]` and block_size of 2:
-//
-// ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
-//
-// The output tensor has shape `[4, 1, 1, 3]` and value:
-//
-// ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
-// ```
-//
-// (3) For the following input of shape `[1, 4, 4, 1]` and block_size of 2:
-//
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]],
-//       [[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
-// ```
-//
-// The output tensor has shape `[4, 2, 2, 1]` and value:
-//
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
-//
-// (4) For the following input of shape `[2, 2, 4, 1]` and block_size of 2:
-//
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]]],
-//      [[[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
-// ```
-//
-// The output tensor has shape `[8, 1, 2, 1]` and value:
-//
-// ```
-// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
-//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
-// ```
 //
-// Among others, this operation is useful for reducing atrous convolution into
-// regular convolution.
-//
-func SpaceToBatch(scope *Scope, input tf.Output, paddings tf.Output, block_size int64) (output tf.Output) {
+//	message: Prefix of the error message.
+func CheckNumerics(scope *Scope, tensor tf.Output, message string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"block_size": block_size}
+	attrs := map[string]interface{}{"message": message}
 	opspec := tf.OpSpec{
-		Type: "SpaceToBatch",
+		Type: "CheckNumerics",
 		Input: []tf.Input{
-			input, paddings,
+			tensor,
 		},
 		Attrs: attrs,
 	}
@@ -27467,285 +27324,176 @@ func SpaceToBatch(scope *Scope, input tf.Output, paddings tf.Output, block_size
 	return op.Output(0)
 }
 
-// UnpackAttr is an optional argument to Unpack.
-type UnpackAttr func(optionalAttr)
-
-// UnpackAxis sets the optional axis attribute to value.
+// Shuffle dimensions of x according to a permutation and conjugate the result.
 //
-// value: Dimension along which to unpack.  Negative values wrap around, so the
-// valid range is `[-R, R)`.
-// If not specified, defaults to 0
-func UnpackAxis(value int64) UnpackAttr {
+// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
+//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
+//   `y[i,j,k,...,s,t,u] == conj(x[perm[i], perm[j], perm[k],...,perm[s], perm[t], perm[u]])`
+func ConjugateTranspose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ConjugateTranspose",
+		Input: []tf.Input{
+			x, perm,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// UniqueV2Attr is an optional argument to UniqueV2.
+type UniqueV2Attr func(optionalAttr)
+
+// UniqueV2OutIdx sets the optional out_idx attribute to value.
+// If not specified, defaults to DT_INT32
+func UniqueV2OutIdx(value tf.DataType) UniqueV2Attr {
 	return func(m optionalAttr) {
-		m["axis"] = value
+		m["out_idx"] = value
 	}
 }
 
-// Unpacks a given dimension of a rank-`R` tensor into `num` rank-`(R-1)` tensors.
+// Finds unique elements in a 1-D tensor.
 //
-// Unpacks `num` tensors from `value` by chipping it along the `axis` dimension.
-// For example, given a tensor of shape `(A, B, C, D)`;
+// This operation returns a tensor `y` containing all of the unique elements of `x`
+// sorted in the same order that they occur in `x`. This operation also returns a
+// tensor `idx` the same size as `x` that contains the index of each value of `x`
+// in the unique output `y`. In other words:
 //
-// If `axis == 0` then the i'th tensor in `output` is the slice `value[i, :, :, :]`
-//   and each tensor in `output` will have shape `(B, C, D)`. (Note that the
-//   dimension unpacked along is gone, unlike `split`).
+// `y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
 //
-// If `axis == 1` then the i'th tensor in `output` is the slice `value[:, i, :, :]`
-//   and each tensor in `output` will have shape `(A, C, D)`.
-// Etc.
+// For example:
 //
-// This is the opposite of `pack`.
+// ```
+// # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
+// y, idx = unique(x)
+// y ==> [1, 2, 4, 7, 8]
+// idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
+// ```
 //
 // Arguments:
-//	value: 1-D or higher, with `axis` dimension size equal to `num`.
-//
+//	x: A `Tensor`.
+//	axis: A `Tensor` of type `int64` (default: 0). The axis of the Tensor to
+// find the unique elements.
 //
-// Returns The list of tensors unpacked from `value`.
-func Unpack(scope *Scope, value tf.Output, num int64, optional ...UnpackAttr) (output []tf.Output) {
+// Returns A `Tensor`. Unique elements along the `axis` of `Tensor` x.A 1-D Tensor. Has the same type as x that contains the index of each
+// value of x in the output y.
+func UniqueV2(scope *Scope, x tf.Output, axis tf.Output, optional ...UniqueV2Attr) (y tf.Output, idx tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num": num}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Unpack",
+		Type: "UniqueV2",
 		Input: []tf.Input{
-			value,
+			x, axis,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("Unpack", err)
-		return
-	}
-	return output
+	return op.Output(0), op.Output(1)
 }
 
-// Increments variable pointed to by 'resource' until it reaches 'limit'.
+// Return a slice from 'input'.
 //
-// Arguments:
-//	resource: Should be from a scalar `Variable` node.
-//	limit: If incrementing ref would bring it above limit, instead generates an
-// 'OutOfRange' error.
+// The output tensor is a tensor with dimensions described by 'size'
+// whose values are extracted from 'input' starting at the offsets in
+// 'begin'.
 //
+// *Requirements*:
+//   0 <= begin[i] <= begin[i] + size[i] <= Di  for i in [0, n)
 //
-// Returns A copy of the input before increment. If nothing else modifies the
-// input, the values produced will all be distinct.
-func ResourceCountUpTo(scope *Scope, resource tf.Output, limit int64, T tf.DataType) (output tf.Output) {
+// Arguments:
+//
+//	begin: begin[i] specifies the offset into the 'i'th dimension of
+// 'input' to slice from.
+//	size: size[i] specifies the number of elements of the 'i'th dimension
+// of 'input' to slice. If size[i] is -1, all remaining elements in dimension
+// i are included in the slice (i.e. this is equivalent to setting
+// size[i] = input.dim_size(i) - begin[i]).
+func Slice(scope *Scope, input tf.Output, begin tf.Output, size tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"limit": limit, "T": T}
 	opspec := tf.OpSpec{
-		Type: "ResourceCountUpTo",
+		Type: "Slice",
 		Input: []tf.Input{
-			resource,
+			input, begin, size,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Delete the stack from its resource container.
-//
-// Arguments:
-//	handle: The handle to a stack.
-//
-// Returns the created operation.
-func StackCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
+// StridedSliceGradAttr is an optional argument to StridedSliceGrad.
+type StridedSliceGradAttr func(optionalAttr)
+
+// StridedSliceGradBeginMask sets the optional begin_mask attribute to value.
+// If not specified, defaults to 0
+func StridedSliceGradBeginMask(value int64) StridedSliceGradAttr {
+	return func(m optionalAttr) {
+		m["begin_mask"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "StackCloseV2",
-		Input: []tf.Input{
-			handle,
-		},
+}
+
+// StridedSliceGradEndMask sets the optional end_mask attribute to value.
+// If not specified, defaults to 0
+func StridedSliceGradEndMask(value int64) StridedSliceGradAttr {
+	return func(m optionalAttr) {
+		m["end_mask"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// BatchToSpace for N-D tensors of type T.
-//
-// This operation reshapes the "batch" dimension 0 into `M + 1` dimensions of shape
-// `block_shape + [batch]`, interleaves these blocks back into the grid defined by
-// the spatial dimensions `[1, ..., M]`, to obtain a result with the same rank as
-// the input.  The spatial dimensions of this intermediate result are then
-// optionally cropped according to `crops` to produce the output.  This is the
-// reverse of SpaceToBatch.  See below for a precise description.
-//
-// Arguments:
-//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
-// where spatial_shape has M dimensions.
-//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
-//	crops: 2-D with shape `[M, 2]`, all values must be >= 0.
-//   `crops[i] = [crop_start, crop_end]` specifies the amount to crop from input
-//   dimension `i + 1`, which corresponds to spatial dimension `i`.  It is
-//   required that
-//   `crop_start[i] + crop_end[i] <= block_shape[i] * input_shape[i + 1]`.
-//
-// This operation is equivalent to the following steps:
-//
-// 1. Reshape `input` to `reshaped` of shape:
-//      [block_shape[0], ..., block_shape[M-1],
-//       batch / prod(block_shape),
-//       input_shape[1], ..., input_shape[N-1]]
-//
-// 2. Permute dimensions of `reshaped` to produce `permuted` of shape
-//      [batch / prod(block_shape),
-//
-//       input_shape[1], block_shape[0],
-//       ...,
-//       input_shape[M], block_shape[M-1],
-//
-//       input_shape[M+1], ..., input_shape[N-1]]
-//
-// 3. Reshape `permuted` to produce `reshaped_permuted` of shape
-//      [batch / prod(block_shape),
-//
-//       input_shape[1] * block_shape[0],
-//       ...,
-//       input_shape[M] * block_shape[M-1],
-//
-//       input_shape[M+1],
-//       ...,
-//       input_shape[N-1]]
-//
-// 4. Crop the start and end of dimensions `[1, ..., M]` of
-//    `reshaped_permuted` according to `crops` to produce the output of shape:
-//      [batch / prod(block_shape),
-//
-//       input_shape[1] * block_shape[0] - crops[0,0] - crops[0,1],
-//       ...,
-//       input_shape[M] * block_shape[M-1] - crops[M-1,0] - crops[M-1,1],
-//
-//       input_shape[M+1], ..., input_shape[N-1]]
-//
-// Some examples:
-//
-// (1) For the following input of shape `[4, 1, 1, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
-//
-// The output tensor has shape `[1, 2, 2, 1]` and value:
-//
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
-// ```
-//
-// (2) For the following input of shape `[4, 1, 1, 3]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
-// ```
-//
-// The output tensor has shape `[1, 2, 2, 3]` and value:
-//
-// ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
-//
-// (3) For the following input of shape `[4, 2, 2, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
-//
-// The output tensor has shape `[1, 4, 4, 1]` and value:
-//
-// ```
-// x = [[[1],   [2],  [3],  [4]],
-//      [[5],   [6],  [7],  [8]],
-//      [[9],  [10], [11],  [12]],
-//      [[13], [14], [15],  [16]]]
-// ```
-//
-// (4) For the following input of shape `[8, 1, 3, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [2, 0]]`:
-//
-// ```
-// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
-//      [[[0], [2], [4]]], [[[0], [10], [12]]],
-//      [[[0], [5], [7]]], [[[0], [13], [15]]],
-//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
-// ```
-//
-// The output tensor has shape `[2, 2, 4, 1]` and value:
-//
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]]],
-//      [[[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
-// ```
-func BatchToSpaceND(scope *Scope, input tf.Output, block_shape tf.Output, crops tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// StridedSliceGradEllipsisMask sets the optional ellipsis_mask attribute to value.
+// If not specified, defaults to 0
+func StridedSliceGradEllipsisMask(value int64) StridedSliceGradAttr {
+	return func(m optionalAttr) {
+		m["ellipsis_mask"] = value
+	}
+}
+
+// StridedSliceGradNewAxisMask sets the optional new_axis_mask attribute to value.
+// If not specified, defaults to 0
+func StridedSliceGradNewAxisMask(value int64) StridedSliceGradAttr {
+	return func(m optionalAttr) {
+		m["new_axis_mask"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "BatchToSpaceND",
-		Input: []tf.Input{
-			input, block_shape, crops,
-		},
+}
+
+// StridedSliceGradShrinkAxisMask sets the optional shrink_axis_mask attribute to value.
+// If not specified, defaults to 0
+func StridedSliceGradShrinkAxisMask(value int64) StridedSliceGradAttr {
+	return func(m optionalAttr) {
+		m["shrink_axis_mask"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Extract `patches` from `images` and put them in the "depth" output dimension.
-//
-// Arguments:
-//	images: 4-D Tensor with shape `[batch, in_rows, in_cols, depth]`.
-//	ksizes: The size of the sliding window for each dimension of `images`.
-//	strides: 1-D of length 4. How far the centers of two consecutive patches are in
-// the images. Must be: `[1, stride_rows, stride_cols, 1]`.
-//	rates: 1-D of length 4. Must be: `[1, rate_rows, rate_cols, 1]`. This is the
-// input stride, specifying how far two consecutive patch samples are in the
-// input. Equivalent to extracting patches with
-// `patch_sizes_eff = patch_sizes + (patch_sizes - 1) * (rates - 1)`, followed by
-// subsampling them spatially by a factor of `rates`. This is equivalent to
-// `rate` in dilated (a.k.a. Atrous) convolutions.
-//	padding: The type of padding algorithm to use.
-//
-// We specify the size-related attributes as:
+// Returns the gradient of `StridedSlice`.
 //
-// ```python
-//       ksizes = [1, ksize_rows, ksize_cols, 1]
-//       strides = [1, strides_rows, strides_cols, 1]
-//       rates = [1, rates_rows, rates_cols, 1]
-// ```
+// Since `StridedSlice` cuts out pieces of its `input` which is size
+// `shape`, its gradient will have the same shape (which is passed here
+// as `shape`). The gradient will be zero in any element that the slice
+// does not select.
 //
-// Returns 4-D Tensor with shape `[batch, out_rows, out_cols, ksize_rows *
-// ksize_cols * depth]` containing image patches with size
-// `ksize_rows x ksize_cols x depth` vectorized in the "depth" dimension. Note
-// `out_rows` and `out_cols` are the dimensions of the output patches.
-func ExtractImagePatches(scope *Scope, images tf.Output, ksizes []int64, strides []int64, rates []int64, padding string) (patches tf.Output) {
+// Arguments are the same as StridedSliceGrad with the exception that
+// `dy` is the input gradient to be propagated and `shape` is the
+// shape of `StridedSlice`'s `input`.
+func StridedSliceGrad(scope *Scope, shape tf.Output, begin tf.Output, end tf.Output, strides tf.Output, dy tf.Output, optional ...StridedSliceGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksizes": ksizes, "strides": strides, "rates": rates, "padding": padding}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ExtractImagePatches",
+		Type: "StridedSliceGrad",
 		Input: []tf.Input{
-			images,
+			shape, begin, end, strides, dy,
 		},
 		Attrs: attrs,
 	}
@@ -27753,148 +27501,74 @@ func ExtractImagePatches(scope *Scope, images tf.Output, ksizes []int64, strides
 	return op.Output(0)
 }
 
-// Bitcasts a tensor from one type to another without copying data.
-//
-// Given a tensor `input`, this operation returns a tensor that has the same buffer
-// data as `input` with datatype `type`.
-//
-// If the input datatype `T` is larger than the output datatype `type` then the
-// shape changes from [...] to [..., sizeof(`T`)/sizeof(`type`)].
+// Returns the gradient of `Tile`.
 //
-// If `T` is smaller than `type`, the operator requires that the rightmost
-// dimension be equal to sizeof(`type`)/sizeof(`T`). The shape then goes from
-// [..., sizeof(`type`)/sizeof(`T`)] to [...].
+// DEPRECATED at GraphDef version 3: TileGrad has been replaced with reduce_sum
 //
-// *NOTE*: Bitcast is implemented as a low-level cast, so machines with different
-// endian orderings will give different results.
-func Bitcast(scope *Scope, input tf.Output, type_ tf.DataType) (output tf.Output) {
+// Since `Tile` takes an input and repeats the input `multiples` times
+// along each dimension, `TileGrad` takes in `multiples` and aggregates
+// each repeated tile of `input` into `output`.
+func TileGrad(scope *Scope, input tf.Output, multiples tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"type": type_}
 	opspec := tf.OpSpec{
-		Type: "Bitcast",
+		Type: "TileGrad",
 		Input: []tf.Input{
-			input,
+			input, multiples,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// OneHotAttr is an optional argument to OneHot.
-type OneHotAttr func(optionalAttr)
+// QuantizeAndDequantizeAttr is an optional argument to QuantizeAndDequantize.
+type QuantizeAndDequantizeAttr func(optionalAttr)
 
-// OneHotAxis sets the optional axis attribute to value.
-//
-// value: The axis to fill (default: -1, a new inner-most axis).
-// If not specified, defaults to -1
-func OneHotAxis(value int64) OneHotAttr {
+// QuantizeAndDequantizeSignedInput sets the optional signed_input attribute to value.
+// If not specified, defaults to true
+func QuantizeAndDequantizeSignedInput(value bool) QuantizeAndDequantizeAttr {
 	return func(m optionalAttr) {
-		m["axis"] = value
+		m["signed_input"] = value
 	}
 }
 
-// Returns a one-hot tensor.
-//
-// The locations represented by indices in `indices` take value `on_value`,
-// while all other locations take value `off_value`.
-//
-// If the input `indices` is rank `N`, the output will have rank `N+1`,
-// The new axis is created at dimension `axis` (default: the new axis is
-// appended at the end).
-//
-// If `indices` is a scalar the output shape will be a vector of length `depth`.
-//
-// If `indices` is a vector of length `features`, the output shape will be:
-// ```
-//   features x depth if axis == -1
-//   depth x features if axis == 0
-// ```
-//
-// If `indices` is a matrix (batch) with shape `[batch, features]`,
-// the output shape will be:
-// ```
-//   batch x features x depth if axis == -1
-//   batch x depth x features if axis == 1
-//   depth x batch x features if axis == 0
-// ```
-//
-//
-// Examples
-// =========
-//
-// Suppose that
-//
-// ```
-//   indices = [0, 2, -1, 1]
-//   depth = 3
-//   on_value = 5.0
-//   off_value = 0.0
-//   axis = -1
-// ```
-//
-// Then output is `[4 x 3]`:
-//
-//     ```output =
-//       [5.0 0.0 0.0]  // one_hot(0)
-//       [0.0 0.0 5.0]  // one_hot(2)
-//       [0.0 0.0 0.0]  // one_hot(-1)
-//       [0.0 5.0 0.0]  // one_hot(1)
-//     ```
-//
-// Suppose that
-//
-// ```
-//   indices = [0, 2, -1, 1]
-//   depth = 3
-//   on_value = 0.0
-//   off_value = 3.0
-//   axis = 0
-// ```
-//
-// Then output is `[3 x 4]`:
-//
-//     ```output =
-//       [0.0 3.0 3.0 3.0]
-//       [3.0 3.0 3.0 0.0]
-//       [3.0 3.0 3.0 3.0]
-//       [3.0 0.0 3.0 3.0]
-//     //  ^                one_hot(0)
-//     //      ^            one_hot(2)
-//     //          ^        one_hot(-1)
-//     //              ^    one_hot(1)
-//     ```
-// Suppose that
-//
-// ```
-//   indices = [[0, 2], [1, -1]]
-//   depth = 3
-//   on_value = 1.0
-//   off_value = 0.0
-//   axis = -1
-// ```
-//
-// Then output is `[2 x 2 x 3]`:
-//
-//     ```output =
-//       [
-//         [1.0, 0.0, 0.0]  // one_hot(0)
-//         [0.0, 0.0, 1.0]  // one_hot(2)
-//       ][
-//         [0.0, 1.0, 0.0]  // one_hot(1)
-//         [0.0, 0.0, 0.0]  // one_hot(-1)
-//       ]```
-//
-// Arguments:
-//	indices: A tensor of indices.
-//	depth: A scalar defining the depth of the one hot dimension.
-//	on_value: A scalar defining the value to fill in output when `indices[j] = i`.
-//	off_value: A scalar defining the value to fill in output when `indices[j] != i`.
+// QuantizeAndDequantizeNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func QuantizeAndDequantizeNumBits(value int64) QuantizeAndDequantizeAttr {
+	return func(m optionalAttr) {
+		m["num_bits"] = value
+	}
+}
+
+// QuantizeAndDequantizeRangeGiven sets the optional range_given attribute to value.
+// If not specified, defaults to false
+func QuantizeAndDequantizeRangeGiven(value bool) QuantizeAndDequantizeAttr {
+	return func(m optionalAttr) {
+		m["range_given"] = value
+	}
+}
+
+// QuantizeAndDequantizeInputMin sets the optional input_min attribute to value.
+// If not specified, defaults to 0
+func QuantizeAndDequantizeInputMin(value float32) QuantizeAndDequantizeAttr {
+	return func(m optionalAttr) {
+		m["input_min"] = value
+	}
+}
+
+// QuantizeAndDequantizeInputMax sets the optional input_max attribute to value.
+// If not specified, defaults to 0
+func QuantizeAndDequantizeInputMax(value float32) QuantizeAndDequantizeAttr {
+	return func(m optionalAttr) {
+		m["input_max"] = value
+	}
+}
+
+// Use QuantizeAndDequantizeV2 instead.
 //
-// Returns The one-hot tensor.
-func OneHot(scope *Scope, indices tf.Output, depth tf.Output, on_value tf.Output, off_value tf.Output, optional ...OneHotAttr) (output tf.Output) {
+// DEPRECATED at GraphDef version 22: Replaced by QuantizeAndDequantizeV2
+func QuantizeAndDequantize(scope *Scope, input tf.Output, optional ...QuantizeAndDequantizeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -27903,9 +27577,9 @@ func OneHot(scope *Scope, indices tf.Output, depth tf.Output, on_value tf.Output
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OneHot",
+		Type: "QuantizeAndDequantize",
 		Input: []tf.Input{
-			indices, depth, on_value, off_value,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -28032,66 +27706,52 @@ func QueueDequeueV2(scope *Scope, handle tf.Output, component_types []tf.DataTyp
 //                   [2, 1, 1]]
 // ```
 func Where(scope *Scope, condition tf.Output) (index tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Where",
-		Input: []tf.Input{
-			condition,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// QuantizeAndDequantizeAttr is an optional argument to QuantizeAndDequantize.
-type QuantizeAndDequantizeAttr func(optionalAttr)
-
-// QuantizeAndDequantizeSignedInput sets the optional signed_input attribute to value.
-// If not specified, defaults to true
-func QuantizeAndDequantizeSignedInput(value bool) QuantizeAndDequantizeAttr {
-	return func(m optionalAttr) {
-		m["signed_input"] = value
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// QuantizeAndDequantizeNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func QuantizeAndDequantizeNumBits(value int64) QuantizeAndDequantizeAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
+	opspec := tf.OpSpec{
+		Type: "Where",
+		Input: []tf.Input{
+			condition,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// QuantizeAndDequantizeRangeGiven sets the optional range_given attribute to value.
-// If not specified, defaults to false
-func QuantizeAndDequantizeRangeGiven(value bool) QuantizeAndDequantizeAttr {
-	return func(m optionalAttr) {
-		m["range_given"] = value
-	}
-}
+// DataFormatDimMapAttr is an optional argument to DataFormatDimMap.
+type DataFormatDimMapAttr func(optionalAttr)
 
-// QuantizeAndDequantizeInputMin sets the optional input_min attribute to value.
-// If not specified, defaults to 0
-func QuantizeAndDequantizeInputMin(value float32) QuantizeAndDequantizeAttr {
+// DataFormatDimMapSrcFormat sets the optional src_format attribute to value.
+//
+// value: source data format.
+// If not specified, defaults to "NHWC"
+func DataFormatDimMapSrcFormat(value string) DataFormatDimMapAttr {
 	return func(m optionalAttr) {
-		m["input_min"] = value
+		m["src_format"] = value
 	}
 }
 
-// QuantizeAndDequantizeInputMax sets the optional input_max attribute to value.
-// If not specified, defaults to 0
-func QuantizeAndDequantizeInputMax(value float32) QuantizeAndDequantizeAttr {
+// DataFormatDimMapDstFormat sets the optional dst_format attribute to value.
+//
+// value: destination data format.
+// If not specified, defaults to "NCHW"
+func DataFormatDimMapDstFormat(value string) DataFormatDimMapAttr {
 	return func(m optionalAttr) {
-		m["input_max"] = value
+		m["dst_format"] = value
 	}
 }
 
-// Use QuantizeAndDequantizeV2 instead.
+// Returns the dimension index in the destination data format given the one in
 //
-// DEPRECATED at GraphDef version 22: Replaced by QuantizeAndDequantizeV2
-func QuantizeAndDequantize(scope *Scope, input tf.Output, optional ...QuantizeAndDequantizeAttr) (output tf.Output) {
+// the source data format.
+//
+// Arguments:
+//	x: A Tensor with each element as a dimension index in source data format.
+// Must be in the range [-4, 4).
+//
+// Returns A Tensor with each element as a dimension index in destination data format.
+func DataFormatDimMap(scope *Scope, x tf.Output, optional ...DataFormatDimMapAttr) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -28100,9 +27760,9 @@ func QuantizeAndDequantize(scope *Scope, input tf.Output, optional ...QuantizeAn
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizeAndDequantize",
+		Type: "DataFormatDimMap",
 		Input: []tf.Input{
-			input,
+			x,
 		},
 		Attrs: attrs,
 	}
@@ -28110,171 +27770,118 @@ func QuantizeAndDequantize(scope *Scope, input tf.Output, optional ...QuantizeAn
 	return op.Output(0)
 }
 
-// Returns the diagonal part of the tensor.
-//
-// This operation returns a tensor with the `diagonal` part
-// of the `input`. The `diagonal` part is computed as follows:
-//
-// Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
-// tensor of rank `k` with dimensions `[D1,..., Dk]` where:
-//
-// `diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
-//
-// For example:
-//
-// ```
-// # 'input' is [[1, 0, 0, 0]
-//               [0, 2, 0, 0]
-//               [0, 0, 3, 0]
-//               [0, 0, 0, 4]]
-//
-// tf.diag_part(input) ==> [1, 2, 3, 4]
-// ```
-//
-// Arguments:
-//	input: Rank k tensor where k is even and not zero.
+// Return the shape of s0 op s1 with broadcast.
 //
-// Returns The extracted diagonal.
-func DiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
+// Given `s0` and `s1`, tensors that represent shapes, compute `r0`, the
+// broadcasted shape. `s0`, `s1` and `r0` are all integer vectors.
+func BroadcastArgs(scope *Scope, s0 tf.Output, s1 tf.Output) (r0 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DiagPart",
+		Type: "BroadcastArgs",
 		Input: []tf.Input{
-			input,
+			s0, s1,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QuantizedInstanceNormAttr is an optional argument to QuantizedInstanceNorm.
-type QuantizedInstanceNormAttr func(optionalAttr)
-
-// QuantizedInstanceNormOutputRangeGiven sets the optional output_range_given attribute to value.
+// Return the reduction indices for computing gradients of s0 op s1 with broadcast.
 //
-// value: If True, `given_y_min` and `given_y_min`
-// and `given_y_max` are used as the output range. Otherwise,
-// the implementation computes the output range.
-// If not specified, defaults to false
-func QuantizedInstanceNormOutputRangeGiven(value bool) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["output_range_given"] = value
+// This is typically used by gradient computations for a broadcasting operation.
+func BroadcastGradientArgs(scope *Scope, s0 tf.Output, s1 tf.Output) (r0 tf.Output, r1 tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// QuantizedInstanceNormGivenYMin sets the optional given_y_min attribute to value.
-//
-// value: Output in `y_min` if `output_range_given` is True.
-// If not specified, defaults to 0
-func QuantizedInstanceNormGivenYMin(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["given_y_min"] = value
+	opspec := tf.OpSpec{
+		Type: "BroadcastGradientArgs",
+		Input: []tf.Input{
+			s0, s1,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// QuantizedInstanceNormGivenYMax sets the optional given_y_max attribute to value.
+// Pads a tensor with mirrored values.
 //
-// value: Output in `y_max` if `output_range_given` is True.
-// If not specified, defaults to 0
-func QuantizedInstanceNormGivenYMax(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["given_y_max"] = value
-	}
-}
-
-// QuantizedInstanceNormVarianceEpsilon sets the optional variance_epsilon attribute to value.
+// This operation pads a `input` with mirrored values according to the `paddings`
+// you specify. `paddings` is an integer tensor with shape `[n, 2]`, where n is
+// the rank of `input`. For each dimension D of `input`, `paddings[D, 0]` indicates
+// how many values to add before the contents of `input` in that dimension, and
+// `paddings[D, 1]` indicates how many values to add after the contents of `input`
+// in that dimension. Both `paddings[D, 0]` and `paddings[D, 1]` must be no greater
+// than `input.dim_size(D)` (or `input.dim_size(D) - 1`) if `copy_border` is true
+// (if false, respectively).
 //
-// value: A small float number to avoid dividing by 0.
-// If not specified, defaults to 1e-05
-func QuantizedInstanceNormVarianceEpsilon(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["variance_epsilon"] = value
-	}
-}
-
-// QuantizedInstanceNormMinSeparation sets the optional min_separation attribute to value.
+// The padded size of each dimension D of the output is:
 //
-// value: Minimum value of `y_max - y_min`
-// If not specified, defaults to 0.001
-func QuantizedInstanceNormMinSeparation(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["min_separation"] = value
-	}
-}
-
-// Quantized Instance normalization.
+// `paddings(D, 0) + input.dim_size(D) + paddings(D, 1)`
+//
+// For example:
+//
+// ```
+// # 't' is [[1, 2, 3], [4, 5, 6]].
+// # 'paddings' is [[1, 1]], [2, 2]].
+// # 'mode' is SYMMETRIC.
+// # rank of 't' is 2.
+// pad(t, paddings) ==> [[2, 1, 1, 2, 3, 3, 2]
+//                       [2, 1, 1, 2, 3, 3, 2]
+//                       [5, 4, 4, 5, 6, 6, 5]
+//                       [5, 4, 4, 5, 6, 6, 5]]
+// ```
 //
 // Arguments:
-//	x: A 4D input Tensor.
-//	x_min: The value represented by the lowest quantized input.
-//	x_max: The value represented by the highest quantized input.
+//	input: The input tensor to be padded.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	mode: Either `REFLECT` or `SYMMETRIC`. In reflect mode the padded regions
+// do not include the borders, while in symmetric mode the padded regions
+// do include the borders. For example, if `input` is `[1, 2, 3]` and `paddings`
+// is `[0, 2]`, then the output is `[1, 2, 3, 2, 1]` in reflect mode, and
+// it is `[1, 2, 3, 3, 2]` in symmetric mode.
 //
-// Returns A 4D Tensor.The value represented by the lowest quantized output.The value represented by the highest quantized output.
-func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.Output, optional ...QuantizedInstanceNormAttr) (y tf.Output, y_min tf.Output, y_max tf.Output) {
+// Returns The padded tensor.
+func MirrorPad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"mode": mode}
 	opspec := tf.OpSpec{
-		Type: "QuantizedInstanceNorm",
+		Type: "MirrorPad",
 		Input: []tf.Input{
-			x, x_min, x_max,
+			input, paddings,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// FakeQuantWithMinMaxVarsAttr is an optional argument to FakeQuantWithMinMaxVars.
-type FakeQuantWithMinMaxVarsAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxVarsNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxVarsNumBits(value int64) FakeQuantWithMinMaxVarsAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
-
-// FakeQuantWithMinMaxVarsNarrowRange sets the optional narrow_range attribute to value.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxVarsNarrowRange(value bool) FakeQuantWithMinMaxVarsAttr {
-	return func(m optionalAttr) {
-		m["narrow_range"] = value
-	}
+	return op.Output(0)
 }
 
-// Fake-quantize the 'inputs' tensor of type float via global float scalars `min`
+// A placeholder op for a value that will be fed into the computation.
 //
-// and `max` to 'outputs' tensor of same shape as `inputs`.
+// DEPRECATED at GraphDef version 23: Placeholder now behaves the same as PlaceholderV2.
 //
-// `[min; max]` define the clamping range for the `inputs` data.
-// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
-// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
-// then de-quantized and output as floats in `[min; max]` interval.
-// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+// N.B. This operation will fail with an error if it is executed. It is
+// intended as a way to represent a value that will always be fed, and to
+// provide attrs that enable the fed value to be checked at runtime.
 //
-// This operation has a gradient and thus allows for training `min` and `max`
-// values.
-func FakeQuantWithMinMaxVars(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsAttr) (outputs tf.Output) {
+// Arguments:
+//	dtype: The type of elements in the tensor.
+//	shape: The shape of the tensor. The shape can be any partially-specified
+// shape.  To be unconstrained, pass in a shape with unknown rank.
+//
+// Returns A placeholder tensor that must be replaced using the feed mechanism.
+func PlaceholderV2(scope *Scope, dtype tf.DataType, shape tf.Shape) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
 	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxVars",
-		Input: []tf.Input{
-			inputs, min, max,
-		},
+		Type: "PlaceholderV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
diff --git a/tensorflow/java/maven/libtensorflow/pom.xml b/tensorflow/java/maven/libtensorflow/pom.xml
index d35bb4111271c11839a160517dc9695ead5b46e9..0b69a8cbe530a13dc35aad3a5c859f77f0deca2a 100644
--- a/tensorflow/java/maven/libtensorflow/pom.xml
+++ b/tensorflow/java/maven/libtensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.7.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni/pom.xml b/tensorflow/java/maven/libtensorflow_jni/pom.xml
index d9ba1bbbfb91170257f64a56f47c6c980e8a9570..541876f7f5e4fadcbc9336f15b319389dcddbf51 100644
--- a/tensorflow/java/maven/libtensorflow_jni/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.7.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
index f6f532c2c10d0a4dad9fc2d7750ea708652000b1..d8933e5238149337b08e70b3f407385887aef0a0 100644
--- a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.7.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni_gpu</artifactId>
diff --git a/tensorflow/java/maven/pom.xml b/tensorflow/java/maven/pom.xml
index 0a6b3d23d7d37515cf275e6a46842e32ada4fee1..6286fd73df6dec5643fceda8f6f652220d75e1a7 100644
--- a/tensorflow/java/maven/pom.xml
+++ b/tensorflow/java/maven/pom.xml
@@ -6,7 +6,7 @@
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.tensorflow</groupId>
   <artifactId>parentpom</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.7.0-rc1</version>
   <packaging>pom</packaging>
 
   <url>https://www.tensorflow.org</url>
diff --git a/tensorflow/java/maven/proto/pom.xml b/tensorflow/java/maven/proto/pom.xml
index 1d8e8723731f959c8142f0648fc805593d7beac8..4e881f5a631f0b2e389b31a9b24028902eac6301 100644
--- a/tensorflow/java/maven/proto/pom.xml
+++ b/tensorflow/java/maven/proto/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.7.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>proto</artifactId>
diff --git a/tensorflow/java/maven/tensorflow-android/pom-android.xml.template b/tensorflow/java/maven/tensorflow-android/pom-android.xml.template
index 5cbd0c898dc52ec5dfb72f0a2ac893d492a7d4be..37d2372d7b09f6f144e7abb145cb75bf98356615 100644
--- a/tensorflow/java/maven/tensorflow-android/pom-android.xml.template
+++ b/tensorflow/java/maven/tensorflow-android/pom-android.xml.template
@@ -20,10 +20,8 @@
 
   <properties>
     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
-    <project.build.number>${build_number}</project.build.number>
     <project.build.commitid>${build_commit_id}</project.build.commitid>
     <project.build.type>${build_type}</project.build.type>
-    <project.build.url>${build_url}</project.build.url>
   </properties>
 
 </project>
diff --git a/tensorflow/java/maven/tensorflow-android/update.py b/tensorflow/java/maven/tensorflow-android/update.py
index 4ae666e4e5351f1bdaf79d1b5cfdb63b0f811e2b..2206d800ca1fe82c5596ff39e56518bc5aea6211 100644
--- a/tensorflow/java/maven/tensorflow-android/update.py
+++ b/tensorflow/java/maven/tensorflow-android/update.py
@@ -45,6 +45,9 @@ def get_json(url):
 
 def get_commit_id(build_info):
   """Fetch the git commit id from the build info json object."""
+  release_commit_id = build_info.get('build_commit_id')
+  if release_commit_id:
+    return release_commit_id
   actions = build_info.get('actions')
   build_data = next(
       a for a in actions
@@ -95,20 +98,12 @@ def main():
     release_prefix = 'https://storage.googleapis.com/tensorflow/libtensorflow'
     info_url = '%s/android_buildinfo-%s.json' % (release_prefix, args.version)
     aar_url = '%s/tensorflow-%s.aar' % (release_prefix, args.version)
-    build_type = 'release-matrix-android2'
+    build_type = 'release-android'
 
   # Retrieve build information
   build_info = get_json(info_url)
 
   # Check all required build info is present
-  if build_info.get('result') != 'SUCCESS':
-    raise ValueError('Invalid json: %s' % build_info)
-  build_url = build_info.get('url')
-  if not build_url:
-    raise ValueError('Missing url: %s' % build_info)
-  build_number = build_info.get('number')
-  if not build_number:
-    raise ValueError('Missing build number: %s' % build_info)
   build_commit_id = get_commit_id(build_info)
   if not build_commit_id:
     raise ValueError('Missing commit id: %s' % build_info)
@@ -119,9 +114,7 @@ def main():
     f.write(
         template.substitute({
             'build_commit_id': build_commit_id,
-            'build_number': build_number,
             'build_type': build_type,
-            'build_url': build_url,
             'version': args.version
         }))
 
diff --git a/tensorflow/java/maven/tensorflow/pom.xml b/tensorflow/java/maven/tensorflow/pom.xml
index 5c1b55085c5df1ec473a3f4e0bf750b236cfc264..d512a7eda9638d428e02beda442ba4d4db9adf62 100644
--- a/tensorflow/java/maven/tensorflow/pom.xml
+++ b/tensorflow/java/maven/tensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.7.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>tensorflow</artifactId>
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index 4c8c73548cac261131c2950d725ac41c6af3dab0..7bd05bb6e0b3a7b4aa2cd0b4cee7fc158d6e8bbf 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -58,6 +58,18 @@ py_library(
         "//tensorflow/tools/api/generator:__pkg__",
         "//tensorflow/tools/quantization:__pkg__",  # TODO(b/34059704): remove when fixed
     ],
+    deps = [":no_contrib"] + if_not_windows([
+        "//tensorflow/contrib:contrib_py",
+    ]),
+)
+
+py_library(
+    name = "no_contrib",
+    srcs = ["__init__.py"],
+    srcs_version = "PY2AND3",
+    visibility = [
+        "//tensorflow:__pkg__",
+    ],
     deps = [
         ":array_ops",
         ":bitwise_ops",
@@ -66,6 +78,7 @@ py_library(
         ":client_testlib",
         ":confusion_matrix",
         ":control_flow_ops",
+        ":cudnn_rnn_ops_gen",
         ":errors",
         ":framework",
         ":framework_for_generated_wrappers",
@@ -86,39 +99,38 @@ py_library(
         ":ops",
         ":platform",
         ":pywrap_tensorflow",
+        ":saver_test_utils",
         ":script_ops",
         ":session_ops",
         ":sets",
         ":sparse_ops",
         ":spectral_ops",
+        ":spectral_ops_test_util",
         ":standard_ops",
         ":state_ops",
         ":string_ops",
+        ":subscribe",
         ":summary",
         ":tensor_array_ops",
-        ":training",
-        ":saver_test_utils",
-        ":subscribe",
         ":test_ops",  # TODO: Break testing code out into separate rule.
-        ":tf_item",
         ":tf_cluster",
+        ":tf_item",
         ":tf_optimizer",
+        ":training",
         ":util",
         ":weights_broadcast_ops",
-        "//third_party/py/numpy",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python/data",
         "//tensorflow/python/estimator:estimator_py",
         "//tensorflow/python/feature_column:feature_column_py",
         "//tensorflow/python/keras",
-        "//tensorflow/python/ops/losses",
         "//tensorflow/python/ops/distributions",
         "//tensorflow/python/ops/linalg",
+        "//tensorflow/python/ops/losses",
         "//tensorflow/python/profiler",
         "//tensorflow/python/saved_model",
-    ] + if_not_windows([
-        "//tensorflow/contrib:contrib_py",
-    ]),
+        "//third_party/py/numpy",
+    ],
 )
 
 tf_py_build_info_genrule()
@@ -765,6 +777,31 @@ py_library(
     ],
 )
 
+py_library(
+    name = "smart_cond",
+    srcs = ["framework/smart_cond.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":control_flow_ops",
+        ":tensor_util",
+    ],
+)
+
+py_test(
+    name = "smart_cond_test",
+    size = "small",
+    srcs = ["framework/smart_cond_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":client_testlib",
+        ":constant_op",
+        ":framework_ops",
+        ":math_ops",
+        ":session",
+        ":smart_cond",
+    ],
+)
+
 py_library(
     name = "sparse_tensor",
     srcs = ["framework/sparse_tensor.py"],
@@ -1007,6 +1044,11 @@ cuda_py_tests(
         "//third_party/py/numpy",
         "//tensorflow/core:protos_all_py",
     ],
+    shard_count = 10,
+    tags = [
+        "noasan",
+        "optonly",
+    ],
 )
 
 py_test(
@@ -1023,7 +1065,7 @@ py_test(
 
 py_test(
     name = "framework_importer_test",
-    size = "medium",
+    size = "large",
     srcs = ["framework/importer_test.py"],
     main = "framework/importer_test.py",
     srcs_version = "PY2AND3",
@@ -1331,6 +1373,12 @@ tf_gen_op_wrapper_private_py(
     ],
 )
 
+tf_gen_op_wrapper_private_py(
+    name = "summary_ops_gen",
+    visibility = ["//tensorflow:__subpackages__"],
+    deps = ["//tensorflow/core:summary_ops_op_lib"],
+)
+
 tf_gen_op_wrapper_private_py(
     name = "audio_ops_gen",
     require_shape_functions = True,
@@ -1340,6 +1388,13 @@ tf_gen_op_wrapper_private_py(
     ],
 )
 
+tf_gen_op_wrapper_private_py(
+    name = "cudnn_rnn_ops_gen",
+    visibility = [
+        "//tensorflow:__subpackages__",
+    ],
+)
+
 tf_gen_op_wrapper_private_py(
     name = "candidate_sampling_ops_gen",
     visibility = ["//learning/brain/python/ops:__pkg__"],
@@ -1750,6 +1805,7 @@ py_library(
 py_library(
     name = "gradients",
     srcs = [
+        "ops/custom_gradient.py",
         "ops/gradients.py",
         "ops/gradients_impl.py",
     ],
@@ -1763,6 +1819,7 @@ py_library(
         ":control_flow_util",
         ":framework",
         ":framework_for_generated_wrappers",
+        ":framework_ops",
         ":functional_ops",
         ":image_grad",
         ":linalg_grad",
@@ -1775,6 +1832,9 @@ py_library(
         ":platform",
         ":spectral_grad",
         ":util",
+        "//tensorflow/python/eager:backprop",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/eager:tape",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -1817,13 +1877,16 @@ py_library(
         ":control_flow_ops",
         ":framework",
         ":framework_for_generated_wrappers",
+        ":gradients",
         ":image_ops_gen",
         ":math_ops",
+        ":nn",
         ":nn_ops_gen",
         ":random_ops",
         ":string_ops",
         ":util",
         ":variables",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -2552,6 +2615,7 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":user_ops_gen",
+        ":util",
         "@six_archive//:six",
     ],
 )
@@ -2857,6 +2921,7 @@ py_library(
     srcs = ["training/checkpointable.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":array_ops",
         ":dtypes",
         ":io_ops_gen",
         ":ops",
@@ -3050,7 +3115,6 @@ tf_proto_library(
             "framework/cpp_shape_inference.proto",
         ],
     ),
-    go_api_version = 2,
 )
 
 tf_proto_library_py(
@@ -3629,6 +3693,7 @@ py_test(
         ":framework_for_generated_wrappers",
         ":math_ops",
         ":state_ops_gen",
+        ":variable_scope",
         ":variables",
         "//tensorflow/core:protos_all_py",
     ],
@@ -3919,7 +3984,13 @@ py_test(
     size = "small",
     srcs = ["training/checkpoint_utils_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_windows"],
+    tags = [
+        "manual",
+        "no_cuda_on_cpu_tap",
+        "no_oss",
+        "no_windows",
+        "notap",
+    ],
     deps = [
         ":client",
         ":client_testlib",
@@ -3928,6 +3999,7 @@ py_test(
         ":partitioned_variables",
         ":platform",
         ":pywrap_tensorflow",
+        ":resource_variable_ops",
         ":state_ops",
         ":training",
         ":variable_scope",
@@ -3957,6 +4029,25 @@ py_test(
     ],
 )
 
+py_test(
+    name = "warm_starting_util_test",
+    size = "small",
+    srcs = ["training/warm_starting_util_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":array_ops",
+        ":client_testlib",
+        ":dtypes",
+        ":framework_ops",
+        ":init_ops",
+        ":training",
+        ":variable_scope",
+        ":variables",
+        "//tensorflow/python/feature_column",
+        "//third_party/py/numpy",
+    ],
+)
+
 py_test(
     name = "monitored_session_test",
     size = "medium",
@@ -4047,6 +4138,7 @@ py_library(
         ":pywrap_tensorflow",
         ":summary_op_util",
         ":summary_ops",
+        ":summary_ops_gen",
         ":util",
         "//tensorflow/python/eager:context",
         "//third_party/py/numpy",
@@ -4091,6 +4183,7 @@ py_library(
         ":control_flow_ops",
         ":framework_for_generated_wrappers",
         ":platform",
+        ":smart_cond",
         ":tensor_util",
         ":util",
         ":variable_scope",
@@ -4711,6 +4804,7 @@ py_test(
     srcs_version = "PY2AND3",
     tags = [
         "grappler",
+        "no_cuda_on_cpu_tap",
         "no_pip",
     ],
     deps = [
diff --git a/tensorflow/python/__init__.py b/tensorflow/python/__init__.py
index 02ed5517ca895ab070a89f8810f77dadcff9212b..3346937904885c216d7a8de86fc6036604376173 100644
--- a/tensorflow/python/__init__.py
+++ b/tensorflow/python/__init__.py
@@ -99,6 +99,10 @@ from tensorflow.python.user_ops import user_ops
 from tensorflow.python.util import compat
 
 
+# Import cudnn rnn ops to make sure their ops are registered.
+from tensorflow.python.ops import gen_cudnn_rnn_ops as _
+
+
 # Import the names from python/training.py as train.Name.
 from tensorflow.python.training import training as train
 
@@ -139,6 +143,10 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import string_ops
 from tensorflow.python.ops import tensor_array_ops
 
+# Eager execution
+from tensorflow.python.eager.context import executing_eagerly
+from tensorflow.python.framework.ops import enable_eager_execution
+
 # Symbols whitelisted for export without documentation.
 # TODO(cwhipkey): review these and move to contrib, expose through
 # documentation, or remove.
@@ -198,13 +206,9 @@ tf_export('TensorInfo')(TensorInfo)
 _allowed_symbols.extend([
     'arg_max',
     'arg_min',
-    'mul',  # use tf.multiply instead.
-    'neg',  # use tf.negative instead.
-    'sub',  # use tf.subtract instead.
     'create_partitioned_variables',
     'deserialize_many_sparse',
     'lin_space',
-    'list_diff',  # Use tf.listdiff instead.
     'listdiff',  # Use tf.listdiff instead.
     'parse_single_sequence_example',
     'serialize_many_sparse',
@@ -294,6 +298,12 @@ _allowed_symbols.extend([
     'MONOLITHIC_BUILD',
 ])
 
+# Eager execution
+_allowed_symbols.extend([
+    'enable_eager_execution',
+    'executing_eagerly',
+])
+
 # Remove all extra symbols that don't have a docstring or are not explicitly
 # referenced in the whitelist.
 remove_undocumented(__name__, _allowed_symbols, [
diff --git a/tensorflow/python/client/session.py b/tensorflow/python/client/session.py
index f3c4fecdc0fde0436bea76cc774edaabe1bc07dd..da5dc6f5998bd6f63445dc3694e53d1032e3d1ab 100644
--- a/tensorflow/python/client/session.py
+++ b/tensorflow/python/client/session.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import functools
 import re
 import threading
+import warnings
 
 import numpy as np
 
@@ -888,6 +889,8 @@ class BaseSession(SessionInterface):
       Either a single value if `fetches` is a single graph element, or
       a list of values if `fetches` is a list, or a dictionary with the
       same keys as `fetches` if that is a dictionary (described above).
+      Order in which `fetches` operations are evaluated inside the call
+      is undefined.
 
     Raises:
       RuntimeError: If this `Session` is in an invalid state (e.g. has been
@@ -1085,7 +1088,10 @@ class BaseSession(SessionInterface):
           if isinstance(subfeed_val, ops.Tensor):
             raise TypeError('The value of a feed cannot be a tf.Tensor object. '
                             'Acceptable feed values include Python scalars, '
-                            'strings, lists, numpy ndarrays, or TensorHandles.')
+                            'strings, lists, numpy ndarrays, or TensorHandles.'
+                            'For reference, the tensor object was ' +
+                            str(feed_val) + ' which was passed to the '
+                            'feed with key ' + str(feed) + '.')
 
           subfeed_dtype = subfeed_t.dtype.as_numpy_dtype
           if isinstance(subfeed_val, int) and _convert_to_numpy_obj(
@@ -1217,19 +1223,12 @@ class BaseSession(SessionInterface):
           compat.as_bytes(options.SerializeToString())) if options else None
       run_metadata_ptr = tf_session.TF_NewBuffer() if run_metadata else None
       try:
-        with errors.raise_exception_on_not_ok_status() as status:
-          if self._created_with_new_api:
-            results = tf_session.TF_SessionRun_wrapper(
-                self._session, options_ptr, {}, fetch_list, target_list,
-                run_metadata_ptr, status)
-          else:
-            results = tf_session.TF_Run(self._session, options_ptr, {},
-                                        fetch_list, target_list, status,
-                                        run_metadata_ptr)
-          if fetch_handler:
-            results = fetch_handler.build_results(self, results)
-          else:
-            results = results[0] if results else None
+        results = self._call_tf_sessionrun(
+            options_ptr, {}, fetch_list, target_list, run_metadata_ptr)
+        if fetch_handler:
+          results = fetch_handler.build_results(self, results)
+        else:
+          results = results[0] if results else None
         if run_metadata:
           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
           run_metadata.ParseFromString(compat.as_bytes(proto_data))
@@ -1250,13 +1249,7 @@ class BaseSession(SessionInterface):
       assert len(target_list) == 1
 
       def _single_operation_run():
-        with errors.raise_exception_on_not_ok_status() as status:
-          if self._created_with_new_api:
-            tf_session.TF_SessionRun_wrapper(self._session, None, {}, [],
-                                             target_list, None, status)
-          else:
-            tf_session.TF_Run(self._session, None, {}, [], target_list, status,
-                              None)
+        self._call_tf_sessionrun(None, {}, [], target_list, None)
 
       return _single_operation_run
     elif isinstance(fetches, ops.Tensor):
@@ -1266,13 +1259,7 @@ class BaseSession(SessionInterface):
       assert not target_list
 
       def _single_tensor_run():
-        with errors.raise_exception_on_not_ok_status() as status:
-          if self._created_with_new_api:
-            results = tf_session.TF_SessionRun_wrapper(
-                self._session, None, {}, fetch_list, [], None, status)
-          else:
-            results = tf_session.TF_Run(self._session, None, {}, fetch_list, [],
-                                        status, None)
+        results = self._call_tf_sessionrun(None, {}, fetch_list, [], None)
         return results[0]
 
       return _single_tensor_run
@@ -1280,13 +1267,8 @@ class BaseSession(SessionInterface):
       # In all other cases, we must use `fetch_handler` to build the
       # results for us.
       def _fetch_handler_run():
-        with errors.raise_exception_on_not_ok_status() as status:
-          if self._created_with_new_api:
-            results = tf_session.TF_SessionRun_wrapper(
-                self._session, None, {}, fetch_list, target_list, None, status)
-          else:
-            results = tf_session.TF_Run(self._session, None, {}, fetch_list,
-                                        target_list, status, None)
+        results = self._call_tf_sessionrun(
+            None, {}, fetch_list, target_list, None)
         return fetch_handler.build_results(self, results)
 
       return _fetch_handler_run
@@ -1326,35 +1308,22 @@ class BaseSession(SessionInterface):
       fetches = _name_list(fetch_list)
       targets = _name_list(target_list)
 
-    def _run_fn(session, feed_dict, fetch_list, target_list, options,
-                run_metadata):
+    def _run_fn(feed_dict, fetch_list, target_list, options, run_metadata):
       # Ensure any changes to the graph are reflected in the runtime.
       self._extend_graph()
-      with errors.raise_exception_on_not_ok_status() as status:
-        if self._created_with_new_api:
-          return tf_session.TF_SessionRun_wrapper(session, options, feed_dict,
-                                                  fetch_list, target_list,
-                                                  run_metadata, status)
-        else:
-          return tf_session.TF_Run(session, options, feed_dict, fetch_list,
-                                   target_list, status, run_metadata)
+      return self._call_tf_sessionrun(
+          options, feed_dict, fetch_list, target_list, run_metadata)
 
-    def _prun_fn(session, handle, feed_dict, fetch_list):
+    def _prun_fn(handle, feed_dict, fetch_list):
       if target_list:
         raise RuntimeError('partial_run() requires empty target_list.')
-      with errors.raise_exception_on_not_ok_status() as status:
-        if self._created_with_new_api:
-          return tf_session.TF_SessionPRun_wrapper(session, handle, feed_dict,
-                                                   fetch_list, status)
-        else:
-          return tf_session.TF_PRun(session, handle, feed_dict, fetch_list,
-                                    status)
+      return self._call_tf_sessionprun(handle, feed_dict, fetch_list)
 
     if handle is None:
-      return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-                           options, run_metadata)
+      return self._do_call(_run_fn, feeds, fetches, targets, options,
+                           run_metadata)
     else:
-      return self._do_call(_prun_fn, self._session, handle, feeds, fetches)
+      return self._do_call(_prun_fn, handle, feeds, fetches)
 
   def _do_call(self, fn, *args):
     try:
@@ -1374,23 +1343,23 @@ class BaseSession(SessionInterface):
       raise type(e)(node_def, op, message)
 
   def _extend_graph(self):
-    # Nothing to do if we're using the new session interface
-    # TODO(skyewm): remove this function altogether eventually
     if self._created_with_new_api:
-      return
-
-    # Ensure any changes to the graph are reflected in the runtime.
-    with self._extend_lock:
-      if self._graph.version > self._current_version:
-        # pylint: disable=protected-access
-        graph_def, self._current_version = self._graph._as_graph_def(
-            from_version=self._current_version, add_shapes=self._add_shapes)
-        # pylint: enable=protected-access
-
+      with self._graph._lock:  # pylint: disable=protected-access
         with errors.raise_exception_on_not_ok_status() as status:
-          tf_session.TF_ExtendGraph(self._session,
-                                    graph_def.SerializeToString(), status)
-        self._opened = True
+          tf_session.ExtendSession(self._session, status)
+    else:
+      # Ensure any changes to the graph are reflected in the runtime.
+      with self._extend_lock:
+        if self._graph.version > self._current_version:
+          # pylint: disable=protected-access
+          graph_def, self._current_version = self._graph._as_graph_def(
+              from_version=self._current_version, add_shapes=self._add_shapes)
+          # pylint: enable=protected-access
+
+          with errors.raise_exception_on_not_ok_status() as status:
+            tf_session.TF_ExtendGraph(self._session,
+                                      graph_def.SerializeToString(), status)
+          self._opened = True
 
   # The threshold to run garbage collection to delete dead tensors.
   _DEAD_HANDLES_THRESHOLD = 10
@@ -1441,6 +1410,27 @@ class BaseSession(SessionInterface):
         feed_dict[feed_tensor] = np_val
       return handles
 
+  def _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list,
+                          run_metadata):
+    with errors.raise_exception_on_not_ok_status() as status:
+      if self._created_with_new_api:
+        return tf_session.TF_SessionRun_wrapper(
+            self._session, options, feed_dict, fetch_list, target_list,
+            run_metadata, status)
+      else:
+        return tf_session.TF_Run(
+            self._session, options, feed_dict, fetch_list, target_list,
+            status, run_metadata)
+
+  def _call_tf_sessionprun(self, handle, feed_dict, fetch_list):
+    with errors.raise_exception_on_not_ok_status() as status:
+      if self._created_with_new_api:
+        return tf_session.TF_SessionPRun_wrapper(
+            self._session, handle, feed_dict, fetch_list, status)
+      else:
+        return tf_session.TF_PRun(
+            self._session, handle, feed_dict, fetch_list, status)
+
 
 @tf_export('Session')
 class Session(BaseSession):
@@ -1637,6 +1627,9 @@ class InteractiveSession(BaseSession):
   ```
   """
 
+  _count_lock = threading.Lock()
+  _active_session_count = 0  # GUARDED_BY(_count_lock)
+
   def __init__(self, target='', graph=None, config=None):
     """Creates a new interactive TensorFlow session.
 
@@ -1665,6 +1658,19 @@ class InteractiveSession(BaseSession):
     config.graph_options.place_pruned_graph = True
 
     super(InteractiveSession, self).__init__(target, graph, config)
+    with InteractiveSession._count_lock:
+      if InteractiveSession._active_session_count > 0:
+        warnings.warn('An interactive session is already active. This can '
+                      'cause out-of-memory errors in some cases. You must '
+                      'explicitly call `InteractiveSession.close()` to release '
+                      'resources held by the other session(s).')
+      InteractiveSession._active_session_count += 1
+    # NOTE(mrry): We do not use `Session._closed` here because it has unhelpful
+    # semantics (in particular, it is not set to true if `Session.close()` is
+    # called on a session that has not been "opened" by running a step) and we
+    # cannot change those semantics without breaking existing code.
+    self._explicitly_closed = False
+
     self._default_session = self.as_default()
     self._default_session.enforce_nesting = False
     self._default_session.__enter__()
@@ -1677,6 +1683,14 @@ class InteractiveSession(BaseSession):
   def close(self):
     """Closes an `InteractiveSession`."""
     super(InteractiveSession, self).close()
+    with InteractiveSession._count_lock:
+      if not self._explicitly_closed:
+        InteractiveSession._active_session_count -= 1
+        self._explicitly_closed = True
+      else:
+        return
     if self._explicit_graph is not None:
       self._default_graph.__exit__(None, None, None)
+      self._default_graph = None
     self._default_session.__exit__(None, None, None)
+    self._default_session = None
diff --git a/tensorflow/python/client/session_test.py b/tensorflow/python/client/session_test.py
index 490572254b0be6a110ef06cea15d20d780f732cf..6e2640efd1d58ab524e42b62f62ad3d38f360c0e 100644
--- a/tensorflow/python/client/session_test.py
+++ b/tensorflow/python/client/session_test.py
@@ -22,13 +22,13 @@ import os
 import sys
 import threading
 import time
+import warnings
 
 import numpy as np
 import six
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.core.framework import attr_value_pb2
-from tensorflow.core.framework import types_pb2
 from tensorflow.core.lib.core import error_codes_pb2
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session
@@ -37,6 +37,7 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import function
+from tensorflow.python.framework import importer
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_util
@@ -46,6 +47,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import gen_control_flow_ops
+from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 # Import resource_variable_ops for the variables-to-tensor implicit conversion.
 from tensorflow.python.ops import resource_variable_ops  # pylint: disable=unused-import
@@ -63,6 +65,10 @@ ops.RegisterShape('ConstructionFails')(common_shapes.unknown_shape)
 @test_util.with_c_api
 class SessionTest(test_util.TensorFlowTestCase):
 
+  def setUp(self):
+    super(SessionTest, self).setUp()
+    warnings.simplefilter('always')
+
   def testUseExistingGraph(self):
     with ops.Graph().as_default() as g, ops.device('/cpu:0'):
       a = constant_op.constant(6.0, shape=[1, 1])
@@ -187,12 +193,10 @@ class SessionTest(test_util.TensorFlowTestCase):
       a = constant_op.constant(0.0, shape=[2, 3])
       # NOTE(mrry): The original_op is nonsense, but used here to test that the
       #   errors are reported correctly.
-      # pylint: disable=protected-access
       with sess.graph._original_op(a.op):
         b = array_ops.identity(a, name='id')
       with sess.graph._original_op(b.op):
         c = array_ops.placeholder(dtypes.float32)
-      # pylint: enable=protected-access
 
       def exc_predicate(e):
         return (e.op == c.op and e.op._original_op == b.op and
@@ -1052,6 +1056,43 @@ class SessionTest(test_util.TensorFlowTestCase):
       for t in threads:
         t.join()
 
+  def testParallelRunAndBuild(self):
+    with session.Session() as sess:
+      c = constant_op.constant(5.0)
+      stop = threading.Event()
+
+      def run_loop():
+        while not stop.is_set():
+          self.assertEqual(sess.run(c), 5.0)
+
+      threads = [self.checkedThread(target=run_loop) for _ in range(100)]
+      for t in threads:
+        t.start()
+
+      # Do some graph construction. Try to exercise non-trivial paths.
+      graph = ops.get_default_graph()
+      gdef = None
+      for _ in range(10):
+        x = array_ops.placeholder(dtype=dtypes.float32)
+        with ops.colocate_with(x):
+          y = array_ops.placeholder(dtype=dtypes.float32)
+        with ops.device('/cpu:0'):
+          z = control_flow_ops.while_loop(
+              lambda x, y: x < 10, lambda x, y: (x + 1, x * y), [x, y])
+        with graph._attr_scope({'_a': attr_value_pb2.AttrValue(b=False)}):
+          gradients_impl.gradients(z, [x, y])
+          if gdef is None:
+            gdef = graph.as_graph_def()
+          else:
+            # NOTE(skyewm): import_graph_def breaks the running threads without
+            # the C API enabled. This is not a regression so I didn't fix it.
+            if ops._USE_C_API:
+              importer.import_graph_def(gdef, name='import')
+
+      stop.set()
+      for t in threads:
+        t.join()
+
   def testRunFeedDict(self):
     with session.Session() as s:
       x = array_ops.zeros([2])
@@ -1153,6 +1194,33 @@ class SessionTest(test_util.TensorFlowTestCase):
       self.assertAllEqual([[24.0]], e.eval())
       sess.close()
 
+  def testMultipleInteractiveSessionsWarning(self):
+    # Reinitialize the global state to ensure that the expected warnings will
+    # be emitted.
+    session.InteractiveSession._active_session_count = 0  # pylint: disable=protected-access
+
+    sess = session.InteractiveSession()
+    sess.run(constant_op.constant(4.0))  # Run so that the session is "opened".
+    sess.close()
+    # Opening and closing interactive sessions serially should not warn.
+    with warnings.catch_warnings(record=True) as w:
+      sess = session.InteractiveSession()
+      sess.close()
+    self.assertEqual(0, len(w))
+
+    with warnings.catch_warnings(record=True) as w:
+      sess = session.InteractiveSession()
+    self.assertEqual(0, len(w))
+    with warnings.catch_warnings(record=True) as w:
+      sess2 = session.InteractiveSession()
+    self.assertEqual(1, len(w))
+    self.assertTrue('An interactive session is already active. This can cause '
+                    'out-of-memory errors in some cases. You must explicitly '
+                    'call `InteractiveSession.close()` to release resources '
+                    'held by the other session(s).' in str(w[0].message))
+    sess2.close()
+    sess.close()
+
   def testInteractivePlacePrunedGraph(self):
     sess = session.InteractiveSession()
 
@@ -1745,8 +1813,8 @@ class SessionTest(test_util.TensorFlowTestCase):
     # Ensure that errors from building the graph get propagated.
     data = array_ops.placeholder(dtypes.float32, shape=[])
     # pylint: disable=protected-access
-    enter_1 = gen_control_flow_ops._enter(data, 'foo_1', False)
-    enter_2 = gen_control_flow_ops._enter(data, 'foo_2', False)
+    enter_1 = gen_control_flow_ops.enter(data, 'foo_1', False)
+    enter_2 = gen_control_flow_ops.enter(data, 'foo_2', False)
     # pylint: enable=protected-access
     res = math_ops.add(enter_1, enter_2)
     with self.assertRaisesOpError('has inputs from different frames'):
@@ -1815,144 +1883,5 @@ class SessionTest(test_util.TensorFlowTestCase):
         sess.run(a, feed_dict={a: 1})
 
 
-class GraphMutationTest(test_util.TensorFlowTestCase):
-
-  def setUp(self):
-    self._original_use_c_api_value = ops._USE_C_API
-    ops._USE_C_API = True
-    super(GraphMutationTest, self).setUp()
-
-  def tearDown(self):
-    ops._USE_C_API = self._original_use_c_api_value
-    super(GraphMutationTest, self).tearDown()
-
-  def testUpdateInputAfterRunning(self):
-    with ops.Graph().as_default() as g:
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-    with session.Session(graph=g) as sess:
-      self.assertAllEqual(3.0, sess.run(c))
-      c.op._update_input(1, a)  # pylint: disable=protected-access
-      with self.assertRaisesRegexp(
-          errors.FailedPreconditionError,
-          'add.*was changed by updating input tensor after it was run'):
-        sess.run(c)
-
-      # Check that running the graph with a new session is fine
-      with session.Session(graph=g) as sess2:
-        self.assertAllEqual(2.0, sess2.run(c))
-
-  def testSetDeviceAfterRunning(self):
-    with ops.Graph().as_default() as g:
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-    with session.Session(graph=g) as sess:
-      self.assertAllEqual(3.0, sess.run(c))
-      c.op._set_device('/cpu:0')  # pylint: disable=protected-access
-      with self.assertRaisesRegexp(
-          errors.FailedPreconditionError,
-          'add.*was changed by setting device after it was run'):
-        sess.run(c)
-
-  def testSetAttrAfterRunning(self):
-    with ops.Graph().as_default() as g:
-      a = constant_op.constant(1.0, dtype=dtypes.float32)
-      b = math_ops.cast(a, dtypes.float64)
-
-    with session.Session(graph=g) as sess:
-      self.assertAllEqual(1.0, sess.run(b))
-      b.op._set_attr('DstT', attr_value_pb2.AttrValue(type=types_pb2.DT_FLOAT))
-      with self.assertRaisesRegexp(
-          errors.FailedPreconditionError,
-          'Cast.*was changed by setting attribute after it was run'):
-        sess.run(b)
-
-  def testRunModifyRun(self):
-    with ops.Graph().as_default() as g:
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-      with session.Session(graph=g) as sess:
-        self.assertAllEqual(3.0, sess.run(c))
-
-        d = b + c
-        d.op._update_input(0, a)  # pylint: disable=protected-access
-        self.assertAllEqual(3.0, sess.run(c))
-        self.assertAllEqual(4.0, sess.run(d))
-
-  def testRunModifyRunTwoSessions(self):
-    with ops.Graph().as_default() as g:
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-      with session.Session(graph=g) as sess1:
-        with session.Session(graph=g) as sess2:
-          self.assertAllEqual(3.0, sess1.run(c))
-          self.assertAllEqual(3.0, sess2.run(c))
-
-          d = b + c
-          d.op._update_input(0, a)  # pylint: disable=protected-access
-          self.assertAllEqual(3.0, sess2.run(c))
-          self.assertAllEqual(4.0, sess2.run(d))
-
-          d.op._update_input(0, b)  # pylint: disable=protected-access
-          self.assertAllEqual(3.0, sess1.run(c))
-          self.assertAllEqual(5.0, sess1.run(d))
-
-          with self.assertRaisesRegexp(
-              errors.FailedPreconditionError,
-              'add.*was changed by updating input tensor after it was run'):
-            sess2.run(c)
-
-  def testTwoSessionsOneRunBeforeModification(self):
-    with ops.Graph().as_default() as g, ops.device('/cpu:0'):
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-    with session.Session(graph=g) as sess1:
-      with session.Session(graph=g) as sess2:
-        sess1.run(c)
-
-        c.op._set_device('/cpu:0')  # pylint: disable=protected-access
-
-        with self.assertRaisesRegexp(
-            errors.FailedPreconditionError,
-            'add.*was changed by setting device after it was run'):
-          sess1.run(c)
-
-        # sess2 was not run before modification
-        self.assertAllEqual(3.0, sess2.run(c))
-
-  def testTwoSessionsBothRunBeforeModification(self):
-    with ops.Graph().as_default() as g, ops.device('/cpu:0'):
-      a = constant_op.constant(1.0)
-      b = constant_op.constant(2.0)
-      c = a + b
-
-    with session.Session(graph=g) as sess1:
-      with session.Session(graph=g) as sess2:
-        sess1.run(c)
-        sess2.run(c)
-
-        c.op._set_device('/cpu:0')  # pylint: disable=protected-access
-
-        with self.assertRaisesRegexp(
-            errors.FailedPreconditionError,
-            'add.*was changed by setting device after it was run'):
-          sess1.run(c)
-
-        with self.assertRaisesRegexp(
-            errors.FailedPreconditionError,
-            'add.*was changed by setting device after it was run'):
-          sess2.run(c)
-
-
 if __name__ == '__main__':
   googletest.main()
diff --git a/tensorflow/python/client/tf_session.i b/tensorflow/python/client/tf_session.i
index f305cd271f98bea697ea8ff15be799d3e80db0bf..e88fc0c01a8bb7534f47e2a0389965c102bbad7b 100644
--- a/tensorflow/python/client/tf_session.i
+++ b/tensorflow/python/client/tf_session.i
@@ -720,6 +720,9 @@ def TF_Reset(target, containers=None, config=None):
 }
 
 %unignore SetRequireShapeInferenceFns;
+%unignore TF_TryEvaluateConstant_wrapper;
+%noexception TF_TryEvaluateConstant_wrapper;
+%unignore ExtendSession;
 
 %include "tensorflow/python/client/tf_session_helper.h"
 
diff --git a/tensorflow/python/client/tf_session_helper.cc b/tensorflow/python/client/tf_session_helper.cc
index 361dbc22b097a9bc82f656d7416b88c4a3a1ec2d..a8ab91749a86749a1eef25e2674634334682d0f3 100644
--- a/tensorflow/python/client/tf_session_helper.cc
+++ b/tensorflow/python/client/tf_session_helper.cc
@@ -493,4 +493,19 @@ std::vector<string> TF_ImportGraphDefResultsMissingUnusedInputMappings_wrapper(
   return input_strs;
 }
 
+PyObject* TF_TryEvaluateConstant_wrapper(TF_Graph* graph, TF_Output output,
+                                         TF_Status* status) {
+  TF_Tensor* result_tensor;
+  bool evaluated =
+      TF_TryEvaluateConstant(graph, output, &result_tensor, status);
+  if (!evaluated || TF_GetCode(status) != TF_OK) Py_RETURN_NONE;
+
+  Safe_TF_TensorPtr safe_result_tensor(result_tensor);
+  PyObject* out;
+  Status s = TF_TensorToPyArray(std::move(safe_result_tensor), &out);
+  Set_TF_Status_from_Status(status, s);
+  if (!s.ok()) Py_RETURN_NONE;
+  return out;
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/python/client/tf_session_helper.h b/tensorflow/python/client/tf_session_helper.h
index 29d5b28f40a7c07c199eec8c8cd85de626f6b068..83318dc178f6da3828a8dc41e81b7fc3e2e19e22 100644
--- a/tensorflow/python/client/tf_session_helper.h
+++ b/tensorflow/python/client/tf_session_helper.h
@@ -213,6 +213,11 @@ std::vector<int64_t> TF_GraphGetTensorShape_wrapper(TF_Graph* graph,
 std::vector<string> TF_ImportGraphDefResultsMissingUnusedInputMappings_wrapper(
     TF_ImportGraphDefResults* results);
 
+// If evaluation was possible, returns the numpy ndarray of the evaluated
+// result. Otherwise returns None.
+PyObject* TF_TryEvaluateConstant_wrapper(TF_Graph* graph, TF_Output output,
+                                         TF_Status* status);
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_PYTHON_CLIENT_TF_SESSION_HELPER_H_
diff --git a/tensorflow/python/client/timeline_test.py b/tensorflow/python/client/timeline_test.py
index 9641b8b7f2735e2e0477aec59edd539e999fa969..5e6b5acdb02e4c8c167485520a8d84ac43db7511 100644
--- a/tensorflow/python/client/timeline_test.py
+++ b/tensorflow/python/client/timeline_test.py
@@ -155,9 +155,12 @@ class TimelineTest(test.TestCase):
     ctf = step_analysis.chrome_trace.format_to_string()
     self._validateTrace(ctf)
     maximums = step_analysis.allocator_maximums
-    self.assertTrue('cpu' in maximums)
+    cpuname = 'cpu'
+    if 'mklcpu' in maximums:
+      cpuname = 'mkl' + cpuname
+    self.assertTrue(cpuname in maximums)
     cpu_max = maximums[
-        'cuda_host_bfc'] if 'cuda_host_bfc' in maximums else maximums['cpu']
+        'cuda_host_bfc'] if 'cuda_host_bfc' in maximums else maximums[cpuname]
     # At least num1 + num2, both float32s (4 bytes each)
     self.assertGreater(cpu_max.num_bytes, 8)
     self.assertGreater(cpu_max.timestamp, 0)
diff --git a/tensorflow/python/data/kernel_tests/cache_dataset_op_test.py b/tensorflow/python/data/kernel_tests/cache_dataset_op_test.py
index 02720a2e985914d3a6774dc6f64d1316890c46bf..25269dc810ae2e3107f8b5317496a35a8ff59d0c 100644
--- a/tensorflow/python/data/kernel_tests/cache_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/cache_dataset_op_test.py
@@ -297,6 +297,21 @@ class MemoryCacheDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(i2.get_next())
 
+  def testCacheTakeRepeat(self):
+    dataset = dataset_ops.Dataset.range(10).cache().take(5).repeat(2)
+    itr = dataset.make_one_shot_iterator()
+    n = itr.get_next()
+
+    expected_values = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
+
+    with self.test_session() as sess:
+      for i, expected in enumerate(expected_values):
+        self.assertEqual(expected, sess.run(n),
+                         "Unexpected value at index %s" % i)
+
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(itr.get_next())
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/dataset_constructor_op_test.py b/tensorflow/python/data/kernel_tests/dataset_constructor_op_test.py
index 14627810b57f68fd96e3e3cc7b51b4fbf7365299..ea5b41e5d819743ad03f3148d654329aea51dab7 100644
--- a/tensorflow/python/data/kernel_tests/dataset_constructor_op_test.py
+++ b/tensorflow/python/data/kernel_tests/dataset_constructor_op_test.py
@@ -263,7 +263,7 @@ class DatasetConstructorTest(test.TestCase):
       for i in range(3):
         results = sess.run(get_next)
         for component, result_component in zip(
-            (zip(*components[:3])[i] + expected[i]), results):
+            (list(zip(*components[:3]))[i] + expected[i]), results):
           if sparse_tensor.is_sparse(component):
             self.assertSparseValuesEqual(component, result_component)
           else:
diff --git a/tensorflow/python/data/kernel_tests/filter_dataset_op_test.py b/tensorflow/python/data/kernel_tests/filter_dataset_op_test.py
index b9258b720edd4ecd620c61eed18f6f975cb7f439..4f2216f0a340acb582c2d09523b0c78af99bdd90 100644
--- a/tensorflow/python/data/kernel_tests/filter_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/filter_dataset_op_test.py
@@ -17,11 +17,15 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import time
+
 import numpy as np
 
+from tensorflow.python.client import session
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import functional_ops
@@ -156,6 +160,65 @@ class FilterDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testReturnComponent(self):
+    iterator = (
+        dataset_ops.Dataset.zip(
+            (dataset_ops.Dataset.range(10),
+             dataset_ops.Dataset.from_tensors(True).repeat(None)))
+        .filter(lambda x, y: y).make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      for i in range(10):
+        self.assertEqual((i, True), sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testParallelFilters(self):
+    dataset = dataset_ops.Dataset.range(10).filter(
+        lambda x: math_ops.equal(x % 2, 0))
+    iterators = [dataset.make_one_shot_iterator() for _ in range(10)]
+    next_elements = [iterator.get_next() for iterator in iterators]
+    with self.test_session() as sess:
+      self.assertEqual([0 for _ in range(10)], sess.run(next_elements))
+
+
+class FilterDatasetBenchmark(test.Benchmark):
+
+  def _benchmark(self, predicate, name):
+    with ops.Graph().as_default():
+      dataset = (
+          dataset_ops.Dataset.from_tensors(True).repeat(None).filter(predicate))
+      iterator = dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+      with session.Session() as sess:
+        for _ in range(5):
+          sess.run(next_element.op)
+        deltas = []
+        for _ in range(100):
+          start = time.time()
+          for _ in range(100):
+            sess.run(next_element.op)
+          end = time.time()
+          deltas.append(end - start)
+
+        median_wall_time = np.median(deltas) / 100
+        print("Filter dataset using %s. Median wall time: %f" %
+              (name, median_wall_time))
+        self.report_benchmark(
+            iters=100,
+            wall_time=median_wall_time,
+            name="benchmark_filter_dataset_%s" % name)
+
+  def benchmarkSimpleFunction(self):
+    self._benchmark(array_ops.identity, "simple_function")
+
+  def benchmarkReturnComponentOptimization(self):
+    self._benchmark(lambda x: x, "return_component")
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/iterator_ops_test.py b/tensorflow/python/data/kernel_tests/iterator_ops_test.py
index 23c6d7385f8d4a12019fa514f349f2598d9629de..4a14a915bdb33f1ac6e8fc1839b32bc81fa8de05 100644
--- a/tensorflow/python/data/kernel_tests/iterator_ops_test.py
+++ b/tensorflow/python/data/kernel_tests/iterator_ops_test.py
@@ -22,6 +22,7 @@ import warnings
 
 import numpy as np
 
+from tensorflow.core.protobuf import cluster_pb2
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.data.ops import dataset_ops
@@ -44,6 +45,7 @@ from tensorflow.python.ops import script_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.training import server_lib
+from tensorflow.python.util import compat
 
 
 class IteratorTest(test.TestCase):
@@ -63,8 +65,9 @@ class IteratorTest(test.TestCase):
 
   def testCapturingStateInOneShotRaisesException(self):
     var = variables.Variable(37.0, name="myvar")
-    dataset = (dataset_ops.Dataset.from_tensor_slices([0.0, 1.0, 2.0])
-               .map(lambda x: x + var))
+    dataset = (
+        dataset_ops.Dataset.from_tensor_slices([0.0, 1.0, 2.0])
+        .map(lambda x: x + var))
     with self.assertRaisesRegexp(
         ValueError, r"`Dataset.make_one_shot_iterator\(\)` does not support "
         "datasets that capture stateful objects.+myvar"):
@@ -78,8 +81,9 @@ class IteratorTest(test.TestCase):
     def _map_fn(x, y, z):
       return math_ops.square(x), math_ops.square(y), math_ops.square(z)
 
-    iterator = (dataset_ops.Dataset.from_tensor_slices(components).map(_map_fn)
-                .repeat(14).make_one_shot_iterator())
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices(components).map(_map_fn)
+        .repeat(14).make_one_shot_iterator())
     get_next = iterator.get_next()
 
     self.assertEqual([c.shape[1:] for c in components],
@@ -103,8 +107,9 @@ class IteratorTest(test.TestCase):
     def _map_fn(x, y, z):
       return math_ops.square(x), math_ops.square(y), math_ops.square(z)
 
-    iterator = (dataset_ops.Dataset.from_tensor_slices(tensor_components)
-                .map(_map_fn).repeat(14).make_one_shot_iterator())
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices(tensor_components)
+        .map(_map_fn).repeat(14).make_one_shot_iterator())
     get_next = iterator.get_next()
 
     self.assertEqual([c.shape[1:] for c in components],
@@ -125,10 +130,13 @@ class IteratorTest(test.TestCase):
                   np.array(37.0) * np.arange(7))
 
     def within_container():
+
       def _map_fn(x, y, z):
         return math_ops.square(x), math_ops.square(y), math_ops.square(z)
-      iterator = (dataset_ops.Dataset.from_tensor_slices(components)
-                  .map(_map_fn).repeat(14).make_one_shot_iterator())
+
+      iterator = (
+          dataset_ops.Dataset.from_tensor_slices(components)
+          .map(_map_fn).repeat(14).make_one_shot_iterator())
       return iterator.get_next()
 
     server = server_lib.Server.create_local_server()
@@ -159,8 +167,8 @@ class IteratorTest(test.TestCase):
 
     # Create a session with a single thread to ensure that the
     # one-shot iterator initializer does not deadlock.
-    config = config_pb2.ConfigProto(inter_op_parallelism_threads=1,
-                                    use_per_session_threads=True)
+    config = config_pb2.ConfigProto(
+        inter_op_parallelism_threads=1, use_per_session_threads=True)
     with session.Session(config=config) as sess:
       self.assertAllEqual([1, 4, 9], sess.run(next_element))
       with self.assertRaises(errors.OutOfRangeError):
@@ -169,6 +177,7 @@ class IteratorTest(test.TestCase):
     # Test with multiple threads invoking the one-shot iterator concurrently.
     with session.Session(config=config) as sess:
       results = []
+
       def consumer_thread():
         try:
           results.append(sess.run(next_element))
@@ -177,7 +186,8 @@ class IteratorTest(test.TestCase):
 
       num_threads = 8
       threads = [
-          self.checkedThread(consumer_thread) for _ in range(num_threads)]
+          self.checkedThread(consumer_thread) for _ in range(num_threads)
+      ]
       for t in threads:
         t.start()
       for t in threads:
@@ -205,24 +215,24 @@ class IteratorTest(test.TestCase):
         sess.run(next_element)
 
     with self.test_session() as sess:
+
       def consumer_thread():
         with self.assertRaisesRegexp(errors.InvalidArgumentError, "oops"):
           sess.run(next_element)
 
       num_threads = 8
       threads = [
-          self.checkedThread(consumer_thread) for _ in range(num_threads)]
+          self.checkedThread(consumer_thread) for _ in range(num_threads)
+      ]
       for t in threads:
         t.start()
       for t in threads:
         t.join()
 
   def testSimpleSharedResource(self):
-    components = (
-        np.array(1, dtype=np.int64),
-        np.array([1, 2, 3], dtype=np.int64),
-        np.array(37.0, dtype=np.float64)
-    )
+    components = (np.array(1, dtype=np.int64),
+                  np.array([1, 2, 3], dtype=np.int64),
+                  np.array(37.0, dtype=np.float64))
 
     server = server_lib.Server.create_local_server()
 
@@ -231,9 +241,10 @@ class IteratorTest(test.TestCase):
     # first session (initializing the iterator) is visible in the
     # second session.
     with ops.Graph().as_default():
-      iterator = (dataset_ops.Dataset.from_tensors(components)
-                  .map(lambda x, y, z: (x, y, z)).make_initializable_iterator(
-                      shared_name="shared_iterator"))
+      iterator = (
+          dataset_ops.Dataset.from_tensors(components)
+          .map(lambda x, y, z: (x, y, z)).make_initializable_iterator(
+              shared_name="shared_iterator"))
       init_op = iterator.initializer
       get_next = iterator.get_next()
 
@@ -269,8 +280,9 @@ class IteratorTest(test.TestCase):
 
   def testNotInitializedError(self):
     components = (np.array(1), np.array([1, 2, 3]), np.array(37.0))
-    iterator = (dataset_ops.Dataset.from_tensors(components)
-                .make_initializable_iterator())
+    iterator = (
+        dataset_ops.Dataset.from_tensors(components)
+        .make_initializable_iterator())
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
@@ -320,8 +332,8 @@ class IteratorTest(test.TestCase):
   def testReinitializableIteratorStaticErrors(self):
     # Non-matching structure for types and shapes.
     with self.assertRaises(TypeError):
-      iterator = iterator_ops.Iterator.from_structure((dtypes.int64,
-                                                       dtypes.float64), [None])
+      iterator = iterator_ops.Iterator.from_structure(
+          (dtypes.int64, dtypes.float64), [None])
 
     # Test validation of dataset argument.
     iterator = iterator_ops.Iterator.from_structure((dtypes.int64,
@@ -337,18 +349,18 @@ class IteratorTest(test.TestCase):
     # Incompatible types.
     with self.assertRaises(TypeError):
       iterator.make_initializer(
-          dataset_ops.Dataset.from_tensors((constant_op.constant(
-              [1, 2, 3], dtype=dtypes.int32), constant_op.constant(
-                  [4., 5., 6., 7.], dtype=dtypes.float32))))
+          dataset_ops.Dataset.from_tensors(
+              (constant_op.constant([1, 2, 3], dtype=dtypes.int32),
+               constant_op.constant([4., 5., 6., 7.], dtype=dtypes.float32))))
 
     # Incompatible shapes.
     iterator = iterator_ops.Iterator.from_structure(
         (dtypes.int64, dtypes.float64), ([None], []))
     with self.assertRaises(TypeError):
       iterator.make_initializer(
-          dataset_ops.Dataset.from_tensors((constant_op.constant(
-              [1, 2, 3], dtype=dtypes.int64), constant_op.constant(
-                  [4., 5., 6., 7.], dtype=dtypes.float64))))
+          dataset_ops.Dataset.from_tensors(
+              (constant_op.constant([1, 2, 3], dtype=dtypes.int64),
+               constant_op.constant([4., 5., 6., 7.], dtype=dtypes.float64))))
 
   def testIteratorStringHandle(self):
     dataset_3 = dataset_ops.Dataset.from_tensor_slices([1, 2, 3])
@@ -370,33 +382,40 @@ class IteratorTest(test.TestCase):
       iterator_3_handle = sess.run(iterator_3.string_handle())
       iterator_4_handle = sess.run(iterator_4.string_handle())
 
-      self.assertEqual(
-          10, sess.run(next_element,
-                       feed_dict={handle_placeholder: iterator_4_handle}))
-      self.assertEqual(
-          1, sess.run(next_element,
-                      feed_dict={handle_placeholder: iterator_3_handle}))
-      self.assertEqual(
-          20, sess.run(next_element,
-                       feed_dict={handle_placeholder: iterator_4_handle}))
-      self.assertEqual(
-          2, sess.run(next_element,
-                      feed_dict={handle_placeholder: iterator_3_handle}))
-      self.assertEqual(
-          30, sess.run(next_element,
-                       feed_dict={handle_placeholder: iterator_4_handle}))
-      self.assertEqual(
-          3, sess.run(next_element,
-                      feed_dict={handle_placeholder: iterator_3_handle}))
-      self.assertEqual(
-          40, sess.run(next_element,
-                       feed_dict={handle_placeholder: iterator_4_handle}))
+      self.assertEqual(10,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_4_handle}))
+      self.assertEqual(1,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_3_handle}))
+      self.assertEqual(20,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_4_handle}))
+      self.assertEqual(2,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_3_handle}))
+      self.assertEqual(30,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_4_handle}))
+      self.assertEqual(3,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_3_handle}))
+      self.assertEqual(40,
+                       sess.run(
+                           next_element,
+                           feed_dict={handle_placeholder: iterator_4_handle}))
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(next_element,
-                 feed_dict={handle_placeholder: iterator_3_handle})
+        sess.run(
+            next_element, feed_dict={handle_placeholder: iterator_3_handle})
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(next_element,
-                 feed_dict={handle_placeholder: iterator_4_handle})
+        sess.run(
+            next_element, feed_dict={handle_placeholder: iterator_4_handle})
 
   def testIteratorStringHandleReuseTensorObject(self):
     dataset = dataset_ops.Dataset.from_tensor_slices([1, 2, 3])
@@ -427,8 +446,8 @@ class IteratorTest(test.TestCase):
     self.assertIsNot(handle_with_name, handle_with_same_name)
 
   def testIteratorStringHandleError(self):
-    dataset_int_scalar = (dataset_ops.Dataset.from_tensor_slices([1, 2,
-                                                                  3]).repeat())
+    dataset_int_scalar = (
+        dataset_ops.Dataset.from_tensor_slices([1, 2, 3]).repeat())
     dataset_float_vector = (dataset_ops.Dataset.from_tensors([1.0, 2.0, 3.0]))
 
     handle_placeholder = array_ops.placeholder(dtypes.string, shape=[])
@@ -522,6 +541,58 @@ class IteratorTest(test.TestCase):
                 target_placeholder: "/job:localhost/replica:0/task:0/cpu:1"
             })
 
+  def testRemoteIteratorUsingRemoteCallOpMultiWorkers(self):
+    s1 = server_lib.Server.create_local_server()
+    s2 = server_lib.Server.create_local_server()
+    s3 = server_lib.Server.create_local_server()
+
+    cluster_def = cluster_pb2.ClusterDef()
+    workers = cluster_def.job.add()
+    workers.name = "worker"
+    workers.tasks[0] = s1.target[len("grpc://"):]
+    workers.tasks[1] = s2.target[len("grpc://"):]
+    client = cluster_def.job.add()
+    client.name = "client"
+    client.tasks[0] = s3.target[len("grpc://"):]
+    config = config_pb2.ConfigProto(cluster_def=cluster_def)
+
+    worker_devices = [
+        "/job:worker/replica:0/task:%d/cpu:0" % i for i in range(2)
+    ]
+    itr_handles = []
+    for device in worker_devices:
+      with ops.device(device):
+        src = dataset_ops.Dataset.from_tensor_slices([device])
+        itr = src.make_one_shot_iterator()
+        itr_handles.append(itr.string_handle())
+
+    targets = dataset_ops.Dataset.from_tensor_slices(worker_devices)
+    handles = dataset_ops.Dataset.from_tensor_slices(itr_handles)
+
+    @function.Defun(dtypes.string)
+    def loading_func(h):
+      remote_itr = iterator_ops.Iterator.from_string_handle(
+          h, itr.output_types, itr.output_shapes)
+      return remote_itr.get_next()
+
+    def map_fn(target, handle):
+      return functional_ops.remote_call(
+          args=[handle], Tout=[dtypes.string], f=loading_func, target=target)
+
+    with ops.device("/job:client"):
+      client_dataset = dataset_ops.Dataset.zip((targets, handles)).map(map_fn)
+      itr = client_dataset.make_initializable_iterator()
+      n = itr.get_next()
+
+    with session.Session(s3.target, config=config) as sess:
+      sess.run(itr.initializer)
+      expected_values = worker_devices
+      for expected in expected_values:
+        self.assertEqual((compat.as_bytes(expected),), sess.run(n))
+
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(n)
+
   def testRemoteIteratorUsingRemoteCallOpDirectSessionGPUCPU(self):
     if not test_util.is_gpu_available():
       self.skipTest("No GPU available")
@@ -641,8 +712,7 @@ class IteratorTest(test.TestCase):
     with warnings.catch_warnings(record=True) as w:
       for _ in range(100):
         iterator.get_next()
-    self.assertEqual(100 - iterator_ops.GET_NEXT_CALL_WARNING_THRESHOLD,
-                     len(w))
+    self.assertEqual(100 - iterator_ops.GET_NEXT_CALL_WARNING_THRESHOLD, len(w))
     for warning in w:
       self.assertTrue(
           iterator_ops.GET_NEXT_CALL_WARNING_MESSAGE in str(warning.message))
diff --git a/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py b/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
index 4e7691ee8144a19a62476281d86fb5df46dd3e4b..6442eb9ff554e61829796fb904342072d1846a32 100644
--- a/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
@@ -46,8 +46,9 @@ class ListFilesDatasetOpTest(test.TestCase):
     dataset = dataset_ops.Dataset.list_files(path.join(self.tmp_dir, '*'))
     with self.test_session() as sess:
       itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(itr.get_next())
+        sess.run(next_element)
 
   def testSimpleDirectory(self):
     filenames = ['a', 'b', 'c']
@@ -56,13 +57,14 @@ class ListFilesDatasetOpTest(test.TestCase):
     dataset = dataset_ops.Dataset.list_files(path.join(self.tmp_dir, '*'))
     with self.test_session() as sess:
       itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
 
       full_filenames = []
       produced_filenames = []
       for filename in filenames:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
       self.assertItemsEqual(full_filenames, produced_filenames)
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(itr.get_next())
@@ -73,12 +75,13 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*')})
 
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(itr.get_next())
+        sess.run(next_element)
 
   def testSimpleDirectoryInitializer(self):
     filenames = ['a', 'b', 'c']
@@ -89,6 +92,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*')})
@@ -98,7 +102,7 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
 
       self.assertItemsEqual(full_filenames, produced_filenames)
 
@@ -114,6 +118,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*.py')})
@@ -123,7 +128,7 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames[1:-1]:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
       self.assertItemsEqual(full_filenames, produced_filenames)
 
       with self.assertRaises(errors.OutOfRangeError):
@@ -138,6 +143,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*.py*')})
@@ -147,13 +153,44 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames[1:]:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
 
       self.assertItemsEqual(full_filenames, produced_filenames)
 
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(itr.get_next())
 
+  def testNoShuffle(self):
+    filenames = ['a', 'b', 'c']
+    self._touchTempFiles(filenames)
+
+    # Repeat the list twice and ensure that the order is the same each time.
+    # NOTE(mrry): This depends on an implementation detail of `list_files()`,
+    # which is that the list of files is captured when the iterator is
+    # initialized. Otherwise, or if e.g. the iterator were initialized more than
+    # once, it's possible that the non-determinism of `tf.matching_files()`
+    # would cause this test to fail. However, it serves as a useful confirmation
+    # that the `shuffle=False` argument is working as intended.
+    # TODO(b/73959787): Provide some ordering guarantees so that this test is
+    # more meaningful.
+    dataset = dataset_ops.Dataset.list_files(
+        path.join(self.tmp_dir, '*'), shuffle=False).repeat(2)
+    with self.test_session() as sess:
+      itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
+
+      full_filenames = []
+      produced_filenames = []
+      for filename in filenames * 2:
+        full_filenames.append(
+            compat.as_bytes(path.join(self.tmp_dir, filename)))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(itr.get_next())
+      self.assertItemsEqual(full_filenames, produced_filenames)
+      self.assertEqual(produced_filenames[:len(filenames)],
+                       produced_filenames[len(filenames):])
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
index d7140088c310767d40bd2cf3413c899375acab15..1ddedfda4e1c9d6b6949f796be1870f167435763 100644
--- a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
@@ -21,6 +21,7 @@ import gzip
 import os
 import zlib
 
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.ops import readers
 from tensorflow.python.framework import constant_op
@@ -736,12 +737,43 @@ class TFRecordDatasetTest(test.TestCase):
     one_mebibyte = 2**20
     d = readers.TFRecordDataset(self.test_filenames, buffer_size=one_mebibyte)
     iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
     with self.test_session() as sess:
       for j in range(self._num_files):
         for i in range(self._num_records):
-          self.assertAllEqual(self._record(j, i), sess.run(iterator.get_next()))
+          self.assertAllEqual(self._record(j, i), sess.run(next_element))
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(iterator.get_next())
+        sess.run(next_element)
+
+  def testReadFromDatasetOfFiles(self):
+    files = dataset_ops.Dataset.from_tensor_slices(self.test_filenames)
+    d = readers.TFRecordDataset(files)
+    iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
+    with self.test_session() as sess:
+      for j in range(self._num_files):
+        for i in range(self._num_records):
+          self.assertAllEqual(self._record(j, i), sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testReadTenEpochsFromDatasetOfFilesInParallel(self):
+    files = dataset_ops.Dataset.from_tensor_slices(
+        self.test_filenames).repeat(10)
+    d = readers.TFRecordDataset(files, num_parallel_reads=4)
+    iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
+    expected = []
+    actual = []
+    with self.test_session() as sess:
+      for _ in range(10):
+        for j in range(self._num_files):
+          for i in range(self._num_records):
+            expected.append(self._record(j, i))
+            actual.append(sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+      self.assertEqual(sorted(expected), sorted(actual))
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py b/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
index c089fb08c1082c1cf74d492796550980d6755591..5fcc48831f3ca744e015c92760f12ea4dbef2ff7 100644
--- a/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
@@ -132,6 +132,33 @@ class ShuffleDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testSeedZero(self):
+    """Test for same behavior when the seed is a Python or Tensor zero."""
+    iterator = (
+        dataset_ops.Dataset.range(10).shuffle(10, seed=0)
+        .make_one_shot_iterator())
+    get_next = iterator.get_next()
+
+    elems = []
+    with self.test_session() as sess:
+      for _ in range(10):
+        elems.append(sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+    seed_placeholder = array_ops.placeholder(dtypes.int64, shape=[])
+    iterator = (
+        dataset_ops.Dataset.range(10).shuffle(10, seed=seed_placeholder)
+        .make_initializable_iterator())
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer, feed_dict={seed_placeholder: 0})
+      for elem in elems:
+        self.assertEqual(elem, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
   def testDefaultArguments(self):
     components = [0, 1, 2, 3, 4]
     iterator = (dataset_ops.Dataset.from_tensor_slices(components).shuffle(5)
diff --git a/tensorflow/python/data/ops/BUILD b/tensorflow/python/data/ops/BUILD
index f12b358a7dc35c18338171e489fa88ba1a82d11b..3119ab003794cb9bc0c748dfeb47597e0877f5fd 100644
--- a/tensorflow/python/data/ops/BUILD
+++ b/tensorflow/python/data/ops/BUILD
@@ -23,6 +23,7 @@ py_library(
         "//tensorflow/python:tensor_util",
         "//tensorflow/python:util",
         "//tensorflow/python/data/util:nest",
+        "//tensorflow/python/data/util:random_seed",
         "//tensorflow/python/data/util:sparse",
         "//third_party/py/numpy",
     ],
@@ -34,6 +35,7 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":dataset_ops",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -50,9 +52,11 @@ py_library(
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
+        "//tensorflow/python/eager:context",
     ],
 )
 
diff --git a/tensorflow/python/data/ops/dataset_ops.py b/tensorflow/python/data/ops/dataset_ops.py
index 3fb1f8d5479fc461a8d1f509c5eec2d0ed4a44c9..c0a6283be433aba80eab2375cbaed6f187e3c4c3 100644
--- a/tensorflow/python/data/ops/dataset_ops.py
+++ b/tensorflow/python/data/ops/dataset_ops.py
@@ -26,16 +26,17 @@ import six
 
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import gen_io_ops
 from tensorflow.python.ops import math_ops
@@ -90,7 +91,7 @@ class Dataset(object):
     Raises:
       RuntimeError: If eager execution is enabled.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "dataset.make_initializable_iterator is not supported when eager "
           "execution is enabled.")
@@ -110,11 +111,11 @@ class Dataset(object):
                                  self.output_types, self.output_shapes,
                                  self.output_classes)
 
-  def make_one_shot_iterator(self):
+  def __iter__(self):
     """Creates an `Iterator` for enumerating the elements of this dataset.
 
-    Note: The returned iterator will be initialized automatically.
-    A "one-shot" iterator does not currently support re-initialization.
+    The returned iterator implements the Python iterator protocol and therefore
+    can only be used in eager mode.
 
     Returns:
       An `Iterator` over the elements of this dataset.
@@ -122,10 +123,23 @@ class Dataset(object):
     Raises:
       RuntimeError: If eager execution is enabled.
     """
-    if context.in_eager_mode():
-      raise RuntimeError(
-          "dataset.make_one_shot_iterator is not supported when eager "
-          "execution is enabled.")
+    if context.executing_eagerly():
+      return iterator_ops.EagerIterator(self)
+    else:
+      raise RuntimeError("dataset.__iter__() is only supported when eager "
+                         "execution is enabled.")
+
+  def make_one_shot_iterator(self):
+    """Creates an `Iterator` for enumerating the elements of this dataset.
+
+    Note: The returned iterator will be initialized automatically.
+    A "one-shot" iterator does not currently support re-initialization.
+
+    Returns:
+      An `Iterator` over the elements of this dataset.
+    """
+    if context.executing_eagerly():
+      return iterator_ops.EagerIterator(self)
     # NOTE(mrry): We capture by value here to ensure that `_make_dataset()` is
     # a 0-argument function.
     @function.Defun(capture_by_value=True)
@@ -549,7 +563,7 @@ class Dataset(object):
 
     Args:
       buffer_size: A `tf.int64` scalar `tf.Tensor`, representing the
-        maximum number elements that will be buffered when prefetching.
+        maximum number of elements that will be buffered when prefetching.
 
     Returns:
       Dataset: A `Dataset`.
@@ -557,7 +571,7 @@ class Dataset(object):
     return PrefetchDataset(self, buffer_size)
 
   @staticmethod
-  def list_files(file_pattern):
+  def list_files(file_pattern, shuffle=None):
     """A dataset of all files matching a pattern.
 
     Example:
@@ -570,16 +584,31 @@ class Dataset(object):
         - /path/to/dir/b.py
         - /path/to/dir/c.py
 
-    NOTE: The order of the file names returned can be non-deterministic.
+    NOTE: The order of the file names returned can be non-deterministic even
+    when `shuffle` is `False`.
 
     Args:
       file_pattern: A string or scalar string `tf.Tensor`, representing
         the filename pattern that will be matched.
+      shuffle: (Optional.) If `True`, the file names will be shuffled randomly.
+        Defaults to `True`.
 
     Returns:
      Dataset: A `Dataset` of strings corresponding to file names.
     """
-    return Dataset.from_tensor_slices(gen_io_ops.matching_files(file_pattern))
+    # TODO(b/73959787): Add a `seed` argument and make the `shuffle=False`
+    # behavior deterministic (e.g. by sorting the filenames).
+    if shuffle is None:
+      shuffle = True
+    matching_files = gen_io_ops.matching_files(file_pattern)
+    dataset = Dataset.from_tensor_slices(matching_files)
+    if shuffle:
+      # NOTE(mrry): The shuffle buffer size must be greater than zero, but the
+      # list of files might be empty.
+      buffer_size = math_ops.maximum(
+          array_ops.shape(matching_files, out_type=dtypes.int64)[0], 1)
+      dataset = dataset.shuffle(buffer_size)
+    return dataset
 
   def repeat(self, count=None):
     """Repeats this dataset `count` times.
@@ -758,11 +787,31 @@ class Dataset(object):
   def padded_batch(self, batch_size, padded_shapes, padding_values=None):
     """Combines consecutive elements of this dataset into padded batches.
 
-    Like `Dataset.dense_to_sparse_batch()`, this method combines
-    multiple consecutive elements of this dataset, which might have
-    different shapes, into a single element. The tensors in the
-    resulting element have an additional outer dimension, and are
-    padded to the respective shape in `padded_shapes`.
+    This transformation combines multiple consecutive elements of the input
+    dataset into a single element. Like @{tf.data.Dataset.batch}, the tensors
+    in the resulting element have an additional outer dimension, which will be
+    `batch_size` for all but the last element, and `N % batch_size` for the
+    last element (where `N` is the number of elements in this dataset). Unlike
+    @{tf.data.Dataset.batch}, the elements may have different shapes for some
+    of their components, and this transformation will pad each component to
+    the respective shape in `padding_shapes`. The `padding_shapes` argument
+    determines the resulting shape for each dimension of each component in an
+    output element:
+
+    * If the dimension is a constant (e.g. `tf.Dimension(37)`), the component
+      will be padded out to that length in that dimension.
+    * If the dimension is unknown (e.g. `tf.Dimension(None)`), the component
+      will be padded out to the maximum length of all elements in that
+      dimension.
+
+    NOTE: If the number of elements (`N`) in this dataset is not an exact
+    multiple of `batch_size`, the final batch contain smaller tensors with
+    shape `N % batch_size` in the batch dimension. If your program depends on
+    the batches having the same shape, consider using the
+    @{tf.contrib.data.padded_batch_and_drop_remainder} transformation instead.
+
+    See also @{tf.contrib.data.dense_to_sparse_batch}, which combines elements
+    that may have different shapes into a @{tf.SparseTensor}.
 
     Args:
       batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
@@ -1484,16 +1533,7 @@ class ShuffleDataset(Dataset):
     self._input_dataset = input_dataset
     self._buffer_size = ops.convert_to_tensor(
         buffer_size, dtype=dtypes.int64, name="buffer_size")
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
     if reshuffle_each_iteration is None:
       self._reshuffle_each_iteration = True
     else:
@@ -1910,47 +1950,13 @@ class FlatMapDataset(Dataset):
     return self._output_types
 
 
-class InterleaveDataset(Dataset):
+class InterleaveDataset(FlatMapDataset):
   """A `Dataset` that maps a function over its input and interleaves the result.
   """
 
   def __init__(self, input_dataset, map_func, cycle_length, block_length):
     """See `Dataset.interleave()` for details."""
-    super(InterleaveDataset, self).__init__()
-    self._input_dataset = input_dataset
-
-    @function.Defun(*nest.flatten(
-        sparse.as_dense_types(input_dataset.output_types,
-                              input_dataset.output_classes)))
-    def tf_map_func(*args):
-      """A wrapper for Defun that facilitates shape inference."""
-      # Pass in shape information from the input_dataset.
-      dense_shapes = sparse.as_dense_shapes(input_dataset.output_shapes,
-                                            input_dataset.output_classes)
-      for arg, shape in zip(args, nest.flatten(dense_shapes)):
-        arg.set_shape(shape)
-
-      nested_args = nest.pack_sequence_as(input_dataset.output_types, args)
-      nested_args = sparse.deserialize_sparse_tensors(
-          nested_args, input_dataset.output_types, input_dataset.output_shapes,
-          input_dataset.output_classes)
-      if _should_unpack_args(nested_args):
-        dataset = map_func(*nested_args)
-      else:
-        dataset = map_func(nested_args)
-
-      if not isinstance(dataset, Dataset):
-        raise TypeError("`map_func` must return a `Dataset` object.")
-
-      self._output_classes = dataset.output_classes
-      self._output_types = dataset.output_types
-      self._output_shapes = dataset.output_shapes
-
-      return dataset._as_variant_tensor()  # pylint: disable=protected-access
-
-    self._map_func = tf_map_func
-    self._map_func.add_to_graph(ops.get_default_graph())
-
+    super(InterleaveDataset, self).__init__(input_dataset, map_func)
     self._cycle_length = ops.convert_to_tensor(
         cycle_length, dtype=dtypes.int64, name="cycle_length")
     self._block_length = ops.convert_to_tensor(
@@ -1959,27 +1965,15 @@ class InterleaveDataset(Dataset):
   def _as_variant_tensor(self):
     return gen_dataset_ops.interleave_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        self._map_func.captured_inputs,
+        self._map_func.captured_inputs,  # pylint: disable=protected-access
         self._cycle_length,
         self._block_length,
-        f=self._map_func,
+        f=self._map_func,  # pylint: disable=protected-access
         output_types=nest.flatten(
             sparse.as_dense_types(self.output_types, self.output_classes)),
         output_shapes=nest.flatten(
             sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
 
-  @property
-  def output_classes(self):
-    return self._output_classes
-
-  @property
-  def output_shapes(self):
-    return self._output_shapes
-
-  @property
-  def output_types(self):
-    return self._output_types
-
 
 class FilterDataset(Dataset):
   """A `Dataset` that filters its input according to a predicate function."""
diff --git a/tensorflow/python/data/ops/iterator_ops.py b/tensorflow/python/data/ops/iterator_ops.py
index 4756ec74820bace5bea4e1f41ebe214420fe5c3d..d79b9d6011b6ebd00a47d572165cdbba8a31bd32 100644
--- a/tensorflow/python/data/ops/iterator_ops.py
+++ b/tensorflow/python/data/ops/iterator_ops.py
@@ -17,14 +17,18 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import threading
 import warnings
 
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
+from tensorflow.python.eager import context
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -412,3 +416,147 @@ class Iterator(object):
       of an element of this dataset.
     """
     return self._output_types
+
+
+_uid_counter = 0
+_uid_lock = threading.Lock()
+
+
+def _generate_shared_name(prefix):
+  with _uid_lock:
+    global _uid_counter
+    uid = _uid_counter
+    _uid_counter += 1
+  return "{}{}".format(prefix, uid)
+
+
+class EagerIterator(object):
+  """An iterator producing tf.Tensor objects from a tf.data.Dataset."""
+
+  def __init__(self, dataset):
+    """Creates a new iterator over the given dataset.
+
+    For example:
+    ```python
+    dataset = tf.data.Dataset.range(4)
+    for x in Iterator(dataset):
+      print(x)
+    ```
+
+    Tensors produced will be placed on the device on which this iterator object
+    was created.
+
+    Args:
+      dataset: A `tf.data.Dataset` object.
+
+    Raises:
+      RuntimeError: When invoked without eager execution enabled.
+    """
+
+    if not context.executing_eagerly():
+      raise RuntimeError(
+          "{} objects can only be used when eager execution is enabled, use "
+          "tf.data.Dataset.make_initializable_iterator or "
+          "tf.data.Dataset.make_one_shot_iterator for graph construction".
+          format(type(self)))
+    with ops.device("/device:CPU:0"):
+      ds_variant = dataset._as_variant_tensor()  # pylint: disable=protected-access
+      self._output_classes = dataset.output_classes
+      self._output_types = dataset.output_types
+      self._output_shapes = dataset.output_shapes
+      self._flat_output_types = nest.flatten(
+          sparse.as_dense_types(self._output_types, self._output_classes))
+      self._flat_output_shapes = nest.flatten(
+          sparse.as_dense_shapes(self._output_shapes, self._output_classes))
+      self._resource = gen_dataset_ops.iterator(
+          shared_name="",
+          container=_generate_shared_name("eageriterator"),
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+      gen_dataset_ops.make_iterator(ds_variant, self._resource)
+      # Delete the resource when this object is deleted
+      self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
+          handle=self._resource, handle_device="/device:CPU:0")
+    self._device = context.context().device_name
+
+  def __iter__(self):
+    return self
+
+  def __next__(self):  # For Python 3 compatibility
+    return self.next()
+
+  def _next_internal(self):
+    """Returns a nested structure of `tf.Tensor`s containing the next element.
+    """
+    with ops.device(self._device):
+      # TODO(ashankar): Consider removing this ops.device() contextmanager
+      # and instead mimic ops placement in graphs: Operations on resource
+      # handles execute on the same device as where the resource is placed.
+      # NOTE(mrry): Here we use the "_sync" variant of `iterator_get_next`
+      # because in eager mode this code will run synchronously on the calling
+      # thread. Therefore we do not need to make a defensive context switch
+      # to a background thread, and can achieve a small constant performance
+      # boost by invoking the iterator synchronously.
+      ret = gen_dataset_ops.iterator_get_next_sync(
+          self._resource,
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+
+    return sparse.deserialize_sparse_tensors(
+        nest.pack_sequence_as(self._output_types, ret), self._output_types,
+        self._output_shapes, self._output_classes)
+
+  def next(self):
+    """Returns a nested structure of `tf.Tensor`s containing the next element.
+    """
+    try:
+      return self._next_internal()
+    except errors.OutOfRangeError:
+      raise StopIteration
+
+  @property
+  def output_classes(self):
+    """Returns the class of each component of an element of this iterator.
+
+    The expected values are `tf.Tensor` and `tf.SparseTensor`.
+
+    Returns:
+      A nested structure of Python `type` objects corresponding to each
+      component of an element of this dataset.
+    """
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    """Returns the shape of each component of an element of this iterator.
+
+    Returns:
+      A nested structure of `tf.TensorShape` objects corresponding to each
+      component of an element of this dataset.
+    """
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    """Returns the type of each component of an element of this iterator.
+
+    Returns:
+      A nested structure of `tf.DType` objects corresponding to each component
+      of an element of this dataset.
+    """
+    return self._output_types
+
+  def get_next(self, name=None):
+    """Returns a nested structure of `tf.Tensor`s containing the next element.
+
+    Args:
+      name: (Optional.) A name for the created operation. Currently unused.
+
+    Returns:
+      A nested structure of `tf.Tensor` objects.
+
+    Raises:
+      `tf.errors.OutOfRangeError`: If the end of the dataset has been reached.
+    """
+    del name
+    return self._next_internal()
diff --git a/tensorflow/python/data/ops/readers.py b/tensorflow/python/data/ops/readers.py
index fa7601741b11f018e9b53ed3b77a7561be50d3f4..fe033f5546498d57dd98289d2cda1a8bbb1c7822 100644
--- a/tensorflow/python/data/ops/readers.py
+++ b/tensorflow/python/data/ops/readers.py
@@ -17,11 +17,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.data.ops.dataset_ops import Dataset
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import convert
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.util.tf_export import tf_export
 
@@ -31,7 +34,7 @@ _DEFAULT_READER_BUFFER_SIZE_BYTES = 256 * 1024  # 256 KB
 
 
 @tf_export("data.TextLineDataset")
-class TextLineDataset(Dataset):
+class TextLineDataset(dataset_ops.Dataset):
   """A `Dataset` comprising lines from one or more text files."""
 
   def __init__(self, filenames, compression_type=None, buffer_size=None):
@@ -73,8 +76,7 @@ class TextLineDataset(Dataset):
     return dtypes.string
 
 
-@tf_export("data.TFRecordDataset")
-class TFRecordDataset(Dataset):
+class _TFRecordDataset(dataset_ops.Dataset):
   """A `Dataset` comprising records from one or more TFRecord files."""
 
   def __init__(self, filenames, compression_type=None, buffer_size=None):
@@ -87,7 +89,7 @@ class TFRecordDataset(Dataset):
       buffer_size: (Optional.) A `tf.int64` scalar representing the number of
         bytes in the read buffer. 0 means no buffering.
     """
-    super(TFRecordDataset, self).__init__()
+    super(_TFRecordDataset, self).__init__()
     # Force the type to string even if filenames is an empty list.
     self._filenames = ops.convert_to_tensor(
         filenames, dtypes.string, name="filenames")
@@ -118,8 +120,112 @@ class TFRecordDataset(Dataset):
     return dtypes.string
 
 
+class ParallelInterleaveDataset(dataset_ops.InterleaveDataset):
+  """A `Dataset` that maps a function over its input and flattens the result."""
+
+  def __init__(self, input_dataset, map_func, cycle_length, block_length,
+               sloppy, buffer_output_elements, prefetch_input_elements):
+    """See `tf.contrib.data.parallel_interleave()` for details."""
+    super(ParallelInterleaveDataset, self).__init__(input_dataset, map_func,
+                                                    cycle_length, block_length)
+    self._sloppy = ops.convert_to_tensor(
+        sloppy, dtype=dtypes.bool, name="sloppy")
+    self._buffer_output_elements = convert.optional_param_to_tensor(
+        "buffer_output_elements",
+        buffer_output_elements,
+        argument_default=2 * block_length)
+    self._prefetch_input_elements = convert.optional_param_to_tensor(
+        "prefetch_input_elements",
+        prefetch_input_elements,
+        argument_default=2 * cycle_length)
+
+  def _as_variant_tensor(self):
+    # pylint: disable=protected-access
+    return gen_dataset_ops.parallel_interleave_dataset(
+        self._input_dataset._as_variant_tensor(),
+        self._map_func.captured_inputs,
+        self._cycle_length,
+        self._block_length,
+        self._sloppy,
+        self._buffer_output_elements,
+        self._prefetch_input_elements,
+        f=self._map_func,
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)),
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+    # pylint: enable=protected-access
+
+
+@tf_export("data.TFRecordDataset")
+class TFRecordDataset(dataset_ops.Dataset):
+  """A `Dataset` comprising records from one or more TFRecord files."""
+
+  def __init__(self, filenames, compression_type=None, buffer_size=None,
+               num_parallel_reads=None):
+    """Creates a `TFRecordDataset` to read for one or more TFRecord files.
+
+    NOTE: The `num_parallel_reads` argument can be used to improve performance
+    when reading from a remote filesystem.
+
+    Args:
+      filenames: A `tf.string` tensor or `tf.data.Dataset` containing one or
+        more filenames.
+      compression_type: (Optional.) A `tf.string` scalar evaluating to one of
+        `""` (no compression), `"ZLIB"`, or `"GZIP"`.
+      buffer_size: (Optional.) A `tf.int64` scalar representing the number of
+        bytes in the read buffer. 0 means no buffering.
+      num_parallel_reads: (Optional.) A `tf.int64` scalar representing the
+        number of files to read in parallel. Defaults to reading files
+        sequentially.
+
+    Raises:
+      TypeError: If any argument does not have the expected type.
+      ValueError: If any argument does not have the expected shape.
+    """
+    super(TFRecordDataset, self).__init__()
+    if isinstance(filenames, dataset_ops.Dataset):
+      if filenames.output_types != dtypes.string:
+        raise TypeError(
+            "`filenames` must be a `tf.data.Dataset` of `tf.string` elements.")
+      if not filenames.output_shapes.is_compatible_with(tensor_shape.scalar()):
+        raise ValueError(
+            "`filenames` must be a `tf.data.Dataset` of scalar `tf.string` "
+            "elements.")
+    else:
+      filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
+      filenames = array_ops.reshape(filenames, [-1], name="flat_filenames")
+      filenames = dataset_ops.Dataset.from_tensor_slices(filenames)
+
+    def read_one_file(filename):
+      return _TFRecordDataset(filename, compression_type, buffer_size)
+
+    if num_parallel_reads is None:
+      self._impl = filenames.flat_map(read_one_file)
+    else:
+      self._impl = ParallelInterleaveDataset(
+          filenames, read_one_file, cycle_length=num_parallel_reads,
+          block_length=1, sloppy=False, buffer_output_elements=None,
+          prefetch_input_elements=None)
+
+  def _as_variant_tensor(self):
+    return self._impl._as_variant_tensor()  # pylint: disable=protected-access
+
+  @property
+  def output_classes(self):
+    return self._impl.output_classes
+
+  @property
+  def output_shapes(self):
+    return self._impl.output_shapes
+
+  @property
+  def output_types(self):
+    return self._impl.output_types
+
+
 @tf_export("data.FixedLengthRecordDataset")
-class FixedLengthRecordDataset(Dataset):
+class FixedLengthRecordDataset(dataset_ops.Dataset):
   """A `Dataset` of fixed-length records from one or more binary files."""
 
   def __init__(self,
diff --git a/tensorflow/python/data/util/BUILD b/tensorflow/python/data/util/BUILD
index e32c7b54a48dd887c2748897c3ce3661aab9f497..b1bdbdab37b63667b475c732df7a47d9e57f2b19 100644
--- a/tensorflow/python/data/util/BUILD
+++ b/tensorflow/python/data/util/BUILD
@@ -86,6 +86,30 @@ py_test(
     ],
 )
 
+py_library(
+    name = "random_seed",
+    srcs = ["random_seed.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework",
+    ],
+)
+
+py_test(
+    name = "random_seed_test",
+    size = "small",
+    srcs = ["random_seed_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":random_seed",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:util",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/python/data/util/random_seed.py b/tensorflow/python/data/util/random_seed.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c9d8672f94587fd3164f25f97b44a97526be07
--- /dev/null
+++ b/tensorflow/python/data/util/random_seed.py
@@ -0,0 +1,58 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities for generating Tensor-valued random seeds."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+
+
+def get_seed(seed):
+  """Returns the local seeds an operation should use given an op-specific seed.
+
+  See @{tf.get_seed} for more details. This wrapper adds support for the case
+  where `seed` may be a tensor.
+
+  Args:
+    seed: An integer or a @{tf.int64} scalar tensor.
+
+  Returns:
+    A tuple of two @{tf.int64} scalar tensors that should be used for the local
+    seed of the calling dataset.
+  """
+  seed, seed2 = random_seed.get_seed(seed)
+  if seed is None:
+    seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
+  else:
+    seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
+  if seed2 is None:
+    seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
+  else:
+    with ops.name_scope("seed2") as scope:
+      seed2 = ops.convert_to_tensor(seed2, dtype=dtypes.int64)
+      seed2 = array_ops.where(
+          math_ops.logical_and(
+              math_ops.equal(seed, 0), math_ops.equal(seed2, 0)),
+          constant_op.constant(2**31 - 1, dtype=dtypes.int64),
+          seed2,
+          name=scope)
+  return seed, seed2
diff --git a/tensorflow/python/data/util/random_seed_test.py b/tensorflow/python/data/util/random_seed_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..33227e82afe6fe1c748693d107d4e9844abb8e09
--- /dev/null
+++ b/tensorflow/python/data/util/random_seed_test.py
@@ -0,0 +1,83 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for utilities working with arbitrarily nested structures."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.data.util import random_seed as data_random_seed
+from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class RandomSeedTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testRandomSeed(self):
+    zero_t = constant_op.constant(0, dtype=dtypes.int64, name='zero')
+    one_t = constant_op.constant(1, dtype=dtypes.int64, name='one')
+    intmax_t = constant_op.constant(
+        2**31 - 1, dtype=dtypes.int64, name='intmax')
+    test_cases = [
+        # Each test case is a tuple with input to get_seed:
+        # (input_graph_seed, input_op_seed)
+        # and output from get_seed:
+        # (output_graph_seed, output_op_seed)
+        ((None, None), (0, 0)),
+        ((None, 1), (random_seed.DEFAULT_GRAPH_SEED, 1)),
+        ((1, 1), (1, 1)),
+        ((0, 0), (0, 2**31 - 1)),  # Avoid nondeterministic (0, 0) output
+        ((2**31 - 1, 0), (0, 2**31 - 1)),  # Don't wrap to (0, 0) either
+        ((0, 2**31 - 1), (0, 2**31 - 1)),  # Wrapping for the other argument
+        # Once more, with tensor-valued arguments
+        ((None, one_t), (random_seed.DEFAULT_GRAPH_SEED, 1)),
+        ((1, one_t), (1, 1)),
+        ((0, zero_t), (0, 2**31 - 1)),  # Avoid nondeterministic (0, 0) output
+        ((2**31 - 1, zero_t), (0, 2**31 - 1)),  # Don't wrap to (0, 0) either
+        ((0, intmax_t), (0, 2**31 - 1)),  # Wrapping for the other argument
+    ]
+    for tc in test_cases:
+      tinput, toutput = tc[0], tc[1]
+      random_seed.set_random_seed(tinput[0])
+      g_seed, op_seed = data_random_seed.get_seed(tinput[1])
+      g_seed = self.evaluate(g_seed)
+      op_seed = self.evaluate(op_seed)
+      msg = 'test_case = {0}, got {1}, want {2}'.format(
+          tinput, (g_seed, op_seed), toutput)
+      self.assertEqual((g_seed, op_seed), toutput, msg=msg)
+      random_seed.set_random_seed(None)
+
+    if not context.executing_eagerly():
+      random_seed.set_random_seed(1)
+      tinput = (1, None)
+      toutput = (1, ops.get_default_graph()._last_id)  # pylint: disable=protected-access
+      random_seed.set_random_seed(tinput[0])
+      g_seed, op_seed = data_random_seed.get_seed(tinput[1])
+      g_seed = self.evaluate(g_seed)
+      op_seed = self.evaluate(op_seed)
+      msg = 'test_case = {0}, got {1}, want {2}'.format(1, (g_seed, op_seed),
+                                                        toutput)
+      self.assertEqual((g_seed, op_seed), toutput, msg=msg)
+      random_seed.set_random_seed(None)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/debug/BUILD b/tensorflow/python/debug/BUILD
index 253588fc3b2986af3ab8c6be5b0b85f178c06336..512d292ee2ffa3e61cca0952c0d530c5ec9b3d2a 100644
--- a/tensorflow/python/debug/BUILD
+++ b/tensorflow/python/debug/BUILD
@@ -957,7 +957,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "session_debug_grpc_test",
-    size = "large",
+    size = "medium",
     srcs = ["lib/session_debug_grpc_test.py"],
     additional_deps = [
         ":debug_data",
@@ -967,7 +967,6 @@ cuda_py_test(
         ":grpc_wrapper",
         ":hooks",
         ":session_debug_testlib",
-        "//third_party/py/numpy",
         "//tensorflow/python:client",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_for_generated_wrappers",
@@ -983,6 +982,29 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "grpc_large_data_test",
+    size = "medium",
+    srcs = ["lib/grpc_large_data_test.py"],
+    additional_deps = [
+        ":dumping_wrapper",
+        ":grpc_debug_test_server",
+        ":grpc_wrapper",
+        ":session_debug_testlib",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:variables",
+    ],
+    tags = [
+        "no_oss",  # Test flaky due to port collisions.
+        "no_windows",
+        "oss_serial",
+    ],
+)
+
 # TODO(cais): Run the test in OSS, perhaps through a sh_test.
 cuda_py_test(
     name = "dist_session_debug_grpc_test",
diff --git a/tensorflow/python/debug/README.md b/tensorflow/python/debug/README.md
index a2273b050bb1ecd5a35938c3de57fb8562f1d26d..269bbb19bdb898d1d81d0b9c618a284a437e68b9 100644
--- a/tensorflow/python/debug/README.md
+++ b/tensorflow/python/debug/README.md
@@ -37,12 +37,18 @@ models:
 * Association of nodes and tensors in graphs with Python source lines
 * Profiling of models at the level of graph nodes and Python source lines.
 (Omitted internal-only feature)
+* A [gRPC](https://grpc.io/)-based remote debugging protocol, which allows us to
+  build a browser-based graphical user interface (GUI) for TFDBG: the
+  [TensorBoard Debugger Plugin](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md).
 
 ## How to use TFDBG?
 
 * For a walkthrough of TFDBG command-line interface, see https://www.tensorflow.org/programmers_guide/debugger.
+* For information on the web GUI of TFDBG (TensorBoard Debugger Plugin), see
+  [this README](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md).
 * For programmatic use of the API of TFDBG, see https://www.tensorflow.org/api_docs/python/tfdbg.
 
+
 ## Related Publications
 
 * Cai, S., Breck E., Nielsen E., Salib M., Sculley D. (2016) TensorFlow Debugger:
diff --git a/tensorflow/python/debug/cli/curses_ui.py b/tensorflow/python/debug/cli/curses_ui.py
index bb52f9051250625836b0d7a0f8e30265d9b34e92..f66cefb427c9ccfa0769655415193e8d2535e53c 100644
--- a/tensorflow/python/debug/cli/curses_ui.py
+++ b/tensorflow/python/debug/cli/curses_ui.py
@@ -1185,6 +1185,22 @@ class CursesUI(base_ui.BaseUI):
       self._main_menu = None
       self._main_menu_pad = None
 
+  def _pad_line_end_with_whitespace(self, pad, row, line_end_x):
+    """Pad the whitespace at the end of a line with the default color pair.
+
+    Prevents spurious color pairs from appearing at the end of the lines in
+    certain text terimnals.
+
+    Args:
+      pad: The curses pad object to operate on.
+      row: (`int`) row index.
+      line_end_x: (`int`) column index of the end of the line (beginning of
+        the whitespace).
+    """
+    if line_end_x < self._max_x - 2:
+      pad.addstr(row, line_end_x, " " * (self._max_x - 3 - line_end_x),
+                 self._default_color_pair)
+
   def _screen_add_line_to_output_pad(self, pad, row, txt, color_segments=None):
     """Render a line in a text pad.
 
@@ -1208,6 +1224,7 @@ class CursesUI(base_ui.BaseUI):
 
     if not color_segments:
       pad.addstr(row, 0, txt, self._default_color_pair)
+      self._pad_line_end_with_whitespace(pad, row, len(txt))
       return
 
     if not isinstance(color_segments, list):
@@ -1248,6 +1265,8 @@ class CursesUI(base_ui.BaseUI):
     for segment, color_pair in zip(all_segments, all_color_pairs):
       if segment[1] < self._max_x:
         pad.addstr(row, segment[0], txt[segment[0]:segment[1]], color_pair)
+    if all_segments:
+      self._pad_line_end_with_whitespace(pad, row, all_segments[-1][1])
 
   def _screen_scroll_output_pad(self, pad, viewport_top, viewport_left,
                                 screen_location_top, screen_location_left,
diff --git a/tensorflow/python/debug/lib/debug_gradients.py b/tensorflow/python/debug/lib/debug_gradients.py
index 16f51a4b32f711b97077643cec669bb8970e0b21..589a13db7f798aef3bb82dfbd442deabfbcf2a41 100644
--- a/tensorflow/python/debug/lib/debug_gradients.py
+++ b/tensorflow/python/debug/lib/debug_gradients.py
@@ -156,11 +156,12 @@ class GradientsDebugger(object):
     # TODO(cais): Implement value_stack.
     grad_debug_op_name = _tensor_to_grad_debug_op_name(input_tensor, self._uuid)
     # pylint: disable=protected-access
-    identity_op = (gen_array_ops._debug_gradient_ref_identity
-                   if input_tensor.dtype._is_ref_dtype
-                   else gen_array_ops._debug_gradient_identity)
-    debug_grad_identity = identity_op(input_tensor, name=grad_debug_op_name)
+    identity_op = (
+        gen_array_ops.debug_gradient_ref_identity
+        if input_tensor.dtype._is_ref_dtype else
+        gen_array_ops.debug_gradient_identity)
     # pylint: enable=protected-access
+    debug_grad_identity = identity_op(input_tensor, name=grad_debug_op_name)
     assert debug_grad_identity.dtype == input_tensor.dtype
     if debug_grad_identity.op.name != grad_debug_op_name:
       raise ValueError(
diff --git a/tensorflow/python/debug/lib/grpc_large_data_test.py b/tensorflow/python/debug/lib/grpc_large_data_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bc477a9baeb7116530fc9122b926458c1a6c08e
--- /dev/null
+++ b/tensorflow/python/debug/lib/grpc_large_data_test.py
@@ -0,0 +1,210 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for sending large-size data through tfdbg grpc channels.
+
+"Large-size data" includes large GraphDef protos and large Tensor protos.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from six.moves import xrange  # pylint: disable=redefined-builtin
+
+from tensorflow.python.debug.lib import grpc_debug_test_server
+from tensorflow.python.debug.lib import session_debug_testlib
+from tensorflow.python.debug.wrappers import framework
+from tensorflow.python.debug.wrappers import grpc_wrapper
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import googletest
+from tensorflow.python.platform import test
+from tensorflow.python.platform import tf_logging
+
+
+class LargeGraphAndLargeTensorsDebugTest(test_util.TensorFlowTestCase):
+
+  @classmethod
+  def setUpClass(cls):
+    (cls.debug_server_port, cls.debug_server_url, _, cls.debug_server_thread,
+     cls.debug_server
+    ) = grpc_debug_test_server.start_server_on_separate_thread(
+        dump_to_filesystem=False)
+    tf_logging.info("debug server url: %s", cls.debug_server_url)
+
+  @classmethod
+  def tearDownClass(cls):
+    cls.debug_server.stop_server().wait()
+    cls.debug_server_thread.join()
+
+  def tearDown(self):
+    ops.reset_default_graph()
+    self.debug_server.clear_data()
+
+  def testSendingLargeGraphDefsWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      u = variables.Variable(42.0, name="original_u")
+      for _ in xrange(50 * 1000):
+        u = array_ops.identity(u)
+      sess.run(variables.global_variables_initializer())
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"original_u")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      self.assertAllClose(42.0, sess.run(u))
+
+      self.assertAllClose(
+          [42.0],
+          self.debug_server.debug_tensor_values["original_u:0:DebugIdentity"])
+      self.assertEqual(2 if test.is_gpu_available() else 1,
+                       len(self.debug_server.partition_graph_defs))
+      max_graph_def_size = max([
+          len(graph_def.SerializeToString())
+          for graph_def in self.debug_server.partition_graph_defs])
+      self.assertGreater(max_graph_def_size, 4 * 1024 * 1024)
+
+  def testSendingLargeFloatTensorWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      u_init_val_array = list(xrange(1200 * 1024))
+      # Size: 4 * 1200 * 1024 = 4800k > 4M
+
+      u_init = constant_op.constant(
+          u_init_val_array, dtype=dtypes.float32, name="u_init")
+      u = variables.Variable(u_init, name="u")
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds  # Unused by this watch_fn.
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"u_init")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      sess.run(u.initializer)
+
+      self.assertAllEqual(
+          u_init_val_array,
+          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
+
+  def testSendingStringTensorWithAlmostTooLargeStringsWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      u_init_val = [
+          b"", b"spam", b"A" * 2500 * 1024, b"B" * 2500 * 1024, b"egg", b""]
+      u_init = constant_op.constant(
+          u_init_val, dtype=dtypes.string, name="u_init")
+      u = variables.Variable(u_init, name="u")
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"u_init")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      sess.run(u.initializer)
+
+      self.assertAllEqual(
+          u_init_val,
+          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
+
+  def testSendingLargeStringTensorWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      strs_total_size_threshold = 5000 * 1024
+      cum_size = 0
+      u_init_val_array = []
+      while cum_size < strs_total_size_threshold:
+        strlen = np.random.randint(200)
+        u_init_val_array.append(b"A" * strlen)
+        cum_size += strlen
+
+      u_init = constant_op.constant(
+          u_init_val_array, dtype=dtypes.string, name="u_init")
+      u = variables.Variable(u_init, name="u")
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"u_init")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      sess.run(u.initializer)
+
+      self.assertAllEqual(
+          u_init_val_array,
+          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
+
+  def testSendingEmptyFloatTensorWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      u_init = constant_op.constant(
+          [], dtype=dtypes.float32, shape=[0], name="u_init")
+      u = variables.Variable(u_init, name="u")
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"u_init")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      sess.run(u.initializer)
+
+      u_init_value = self.debug_server.debug_tensor_values[
+          "u_init:0:DebugIdentity"][0]
+      self.assertEqual(np.float32, u_init_value.dtype)
+      self.assertEqual(0, len(u_init_value))
+
+  def testSendingEmptyStringTensorWorks(self):
+    with self.test_session(
+        use_gpu=True,
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
+      u_init = constant_op.constant(
+          [], dtype=dtypes.string, shape=[0], name="u_init")
+      u = variables.Variable(u_init, name="u")
+
+      def watch_fn(fetches, feeds):
+        del fetches, feeds
+        return framework.WatchOptions(
+            debug_ops=["DebugIdentity"],
+            node_name_regex_whitelist=r"u_init")
+      sess = grpc_wrapper.GrpcDebugWrapperSession(
+          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
+      sess.run(u.initializer)
+
+      u_init_value = self.debug_server.debug_tensor_values[
+          "u_init:0:DebugIdentity"][0]
+      self.assertEqual(np.object, u_init_value.dtype)
+      self.assertEqual(0, len(u_init_value))
+
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/python/debug/lib/session_debug_file_test.py b/tensorflow/python/debug/lib/session_debug_file_test.py
index 1a6bedbbcbf94eb95e49d43e2d03c85b53bebb7b..ba0f15b4e2ff23295eae764088144a3d1b533f01 100644
--- a/tensorflow/python/debug/lib/session_debug_file_test.py
+++ b/tensorflow/python/debug/lib/session_debug_file_test.py
@@ -22,7 +22,6 @@ import shutil
 import tempfile
 
 from tensorflow.core.protobuf import config_pb2
-from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.debug.lib import debug_data
 from tensorflow.python.debug.lib import debug_utils
@@ -36,13 +35,6 @@ from tensorflow.python.platform import googletest
 
 class SessionDebugFileTest(session_debug_testlib.SessionDebugTestBase):
 
-  def _no_rewrite_session_config(self):
-    rewriter_config = rewriter_config_pb2.RewriterConfig(
-        disable_model_pruning=True,
-        arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF)
-    graph_options = config_pb2.GraphOptions(rewrite_options=rewriter_config)
-    return config_pb2.ConfigProto(graph_options=graph_options)
-
   def _debug_urls(self, run_number=None):
     return ["file://%s" % self._debug_dump_dir(run_number=run_number)]
 
@@ -55,7 +47,8 @@ class SessionDebugFileTest(session_debug_testlib.SessionDebugTestBase):
   def testAllowsDifferentWatchesOnDifferentRuns(self):
     """Test watching different tensors on different runs of the same graph."""
 
-    with session.Session(config=self._no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       u_init_val = [[5.0, 3.0], [-1.0, 0.0]]
       v_init_val = [[2.0], [-1.0]]
 
diff --git a/tensorflow/python/debug/lib/session_debug_grpc_test.py b/tensorflow/python/debug/lib/session_debug_grpc_test.py
index b623ee31c5dc59894373ec7952e53acd0f6e1126..ff49b6954776264ccb2eceeceab7da5a881081f0 100644
--- a/tensorflow/python/debug/lib/session_debug_grpc_test.py
+++ b/tensorflow/python/debug/lib/session_debug_grpc_test.py
@@ -24,11 +24,9 @@ from __future__ import print_function
 import os
 import shutil
 
-import numpy as np
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.core.protobuf import config_pb2
-from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.debug.lib import debug_data
 from tensorflow.python.debug.lib import debug_utils
@@ -38,28 +36,15 @@ from tensorflow.python.debug.wrappers import framework
 from tensorflow.python.debug.wrappers import grpc_wrapper
 from tensorflow.python.debug.wrappers import hooks
 from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
-from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
-from tensorflow.python.platform import test
-from tensorflow.python.platform import tf_logging
 from tensorflow.python.training import monitored_session
 
 
-def no_rewrite_session_config():
-  rewriter_config = rewriter_config_pb2.RewriterConfig(
-      disable_model_pruning=True,
-      arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF,
-      dependency_optimization=rewriter_config_pb2.RewriterConfig.OFF)
-  graph_options = config_pb2.GraphOptions(rewrite_options=rewriter_config)
-  return config_pb2.ConfigProto(graph_options=graph_options)
-
-
 class GrpcDebugServerTest(test_util.TensorFlowTestCase):
 
   def testRepeatedRunServerRaisesException(self):
@@ -142,19 +127,22 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
       return os.path.join(self._dump_root, "run_%d" % run_number)
 
   def testConstructGrpcDebugWrapperSessionWithInvalidTypeRaisesException(self):
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     with self.assertRaisesRegexp(
         TypeError, "Expected type str or list in grpc_debug_server_addresses"):
       grpc_wrapper.GrpcDebugWrapperSession(sess, 1337)
 
   def testConstructGrpcDebugWrapperSessionWithInvalidTypeRaisesException2(self):
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     with self.assertRaisesRegexp(
         TypeError, "Expected type str in list grpc_debug_server_addresses"):
       grpc_wrapper.GrpcDebugWrapperSession(sess, ["localhost:1337", 1338])
 
   def testUseInvalidWatchFnTypeWithGrpcDebugWrapperSessionRaisesException(self):
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     with self.assertRaises(TypeError):
       grpc_wrapper.GrpcDebugWrapperSession(
           sess, "localhost:%d" % self._server_port, watch_fn="foo")
@@ -164,7 +152,8 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     v = variables.Variable(20.0, name="v")
     w = math_ops.multiply(u, v, name="w")
 
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     sess.run(u.initializer)
     sess.run(v.initializer)
 
@@ -190,7 +179,8 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     v = variables.Variable(20.0, name="v")
     w = math_ops.multiply(u, v, name="w")
 
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     sess.run(u.initializer)
     sess.run(v.initializer)
 
@@ -223,7 +213,8 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     v = variables.Variable(20.0, name="v")
     w = math_ops.multiply(u, v, name="w")
 
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     sess.run(u.initializer)
     sess.run(v.initializer)
 
@@ -254,7 +245,8 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     v = variables.Variable(20.0, name="v")
     w = math_ops.multiply(u, v, name="w")
 
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     sess.run(u.initializer)
     sess.run(v.initializer)
 
@@ -298,7 +290,8 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     v = variables.Variable(20.0, name="v")
     w = math_ops.multiply(u, v, name="w")
 
-    sess = session.Session(config=no_rewrite_session_config())
+    sess = session.Session(
+        config=session_debug_testlib.no_rewrite_session_config())
     sess.run(variables.global_variables_initializer())
 
     grpc_debug_hook = hooks.TensorBoardDebugHook(
@@ -324,168 +317,6 @@ class SessionDebugGrpcTest(session_debug_testlib.SessionDebugTestBase):
     hooks.GrpcDebugHook(["foo:42424"])
 
 
-class LargeGraphAndLargeTensorsDebugTest(test_util.TensorFlowTestCase):
-
-  @classmethod
-  def setUpClass(cls):
-    (cls.debug_server_port, cls.debug_server_url, _, cls.debug_server_thread,
-     cls.debug_server
-    ) = grpc_debug_test_server.start_server_on_separate_thread(
-        dump_to_filesystem=False)
-    tf_logging.info("debug server url: %s", cls.debug_server_url)
-
-  @classmethod
-  def tearDownClass(cls):
-    cls.debug_server.stop_server().wait()
-    cls.debug_server_thread.join()
-
-  def tearDown(self):
-    ops.reset_default_graph()
-    self.debug_server.clear_data()
-
-  def testSendingLargeGraphDefsWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      u = variables.Variable(42.0, name="original_u")
-      for _ in xrange(50 * 1000):
-        u = array_ops.identity(u)
-      sess.run(variables.global_variables_initializer())
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"original_u")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      self.assertAllClose(42.0, sess.run(u))
-
-      self.assertAllClose(
-          [42.0],
-          self.debug_server.debug_tensor_values["original_u:0:DebugIdentity"])
-      self.assertEqual(2 if test.is_gpu_available() else 1,
-                       len(self.debug_server.partition_graph_defs))
-      max_graph_def_size = max([
-          len(graph_def.SerializeToString())
-          for graph_def in self.debug_server.partition_graph_defs])
-      self.assertGreater(max_graph_def_size, 4 * 1024 * 1024)
-
-  def testSendingLargeFloatTensorWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      u_init_val_array = list(xrange(1200 * 1024))
-      # Size: 4 * 1200 * 1024 = 4800k > 4M
-
-      u_init = constant_op.constant(
-          u_init_val_array, dtype=dtypes.float32, name="u_init")
-      u = variables.Variable(u_init, name="u")
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds  # Unused by this watch_fn.
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"u_init")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      sess.run(u.initializer)
-
-      self.assertAllEqual(
-          u_init_val_array,
-          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
-
-  def testSendingStringTensorWithAlmostTooLargeStringsWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      u_init_val = [
-          b"", b"spam", b"A" * 2500 * 1024, b"B" * 2500 * 1024, b"egg", b""]
-      u_init = constant_op.constant(
-          u_init_val, dtype=dtypes.string, name="u_init")
-      u = variables.Variable(u_init, name="u")
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"u_init")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      sess.run(u.initializer)
-
-      self.assertAllEqual(
-          u_init_val,
-          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
-
-  def testSendingLargeStringTensorWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      strs_total_size_threshold = 5000 * 1024
-      cum_size = 0
-      u_init_val_array = []
-      while cum_size < strs_total_size_threshold:
-        strlen = np.random.randint(200)
-        u_init_val_array.append(b"A" * strlen)
-        cum_size += strlen
-
-      u_init = constant_op.constant(
-          u_init_val_array, dtype=dtypes.string, name="u_init")
-      u = variables.Variable(u_init, name="u")
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"u_init")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      sess.run(u.initializer)
-
-      self.assertAllEqual(
-          u_init_val_array,
-          self.debug_server.debug_tensor_values["u_init:0:DebugIdentity"][0])
-
-  def testSendingEmptyFloatTensorWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      u_init = constant_op.constant(
-          [], dtype=dtypes.float32, shape=[0], name="u_init")
-      u = variables.Variable(u_init, name="u")
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"u_init")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      sess.run(u.initializer)
-
-      u_init_value = self.debug_server.debug_tensor_values[
-          "u_init:0:DebugIdentity"][0]
-      self.assertEqual(np.float32, u_init_value.dtype)
-      self.assertEqual(0, len(u_init_value))
-
-  def testSendingEmptyStringTensorWorks(self):
-    with self.test_session(
-        use_gpu=True, config=no_rewrite_session_config()) as sess:
-      u_init = constant_op.constant(
-          [], dtype=dtypes.string, shape=[0], name="u_init")
-      u = variables.Variable(u_init, name="u")
-
-      def watch_fn(fetches, feeds):
-        del fetches, feeds
-        return framework.WatchOptions(
-            debug_ops=["DebugIdentity"],
-            node_name_regex_whitelist=r"u_init")
-      sess = grpc_wrapper.GrpcDebugWrapperSession(
-          sess, "localhost:%d" % self.debug_server_port, watch_fn=watch_fn)
-      sess.run(u.initializer)
-
-      u_init_value = self.debug_server.debug_tensor_values[
-          "u_init:0:DebugIdentity"][0]
-      self.assertEqual(np.object, u_init_value.dtype)
-      self.assertEqual(0, len(u_init_value))
-
-
 class SessionDebugConcurrentTest(
     session_debug_testlib.DebugConcurrentRunCallsTest):
 
@@ -548,7 +379,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
     self._server_2.clear_data()
 
   def testToggleEnableTwoDebugWatchesNoCrosstalkBetweenDebugNodes(self):
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v_1 = variables.Variable(50.0, name="v_1")
       v_2 = variables.Variable(-50.0, name="v_1")
       delta_1 = constant_op.constant(5.0, name="delta_1")
@@ -617,7 +449,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
                                         ("toggled_2", 0, "DebugIdentity")])
     self._servers_and_threads.append((server, server_thread))
 
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v_1 = variables.Variable(50.0, name="v_1")
       v_2 = variables.Variable(-50.0, name="v_1")
       # These two nodes have names that match those in the
@@ -656,7 +489,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
           self.assertEqual(0, len(server.debug_tensor_values))
 
   def testToggleEnableTwoDebugWatchesNoCrosstalkBetweenServers(self):
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v = variables.Variable(50.0, name="v")
       delta = constant_op.constant(5.0, name="delta")
       inc_v = state_ops.assign_add(v, delta, name="inc_v")
@@ -698,7 +532,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
           self.assertEqual(0, len(self._server_2.debug_tensor_values))
 
   def testToggleBreakpointsWorks(self):
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v_1 = variables.Variable(50.0, name="v_1")
       v_2 = variables.Variable(-50.0, name="v_2")
       delta_1 = constant_op.constant(5.0, name="delta_1")
@@ -755,7 +590,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
           self.assertSetEqual(set(), self._server_1.breakpoints)
 
   def testTensorBoardDebuggerWrapperToggleBreakpointsWorks(self):
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v_1 = variables.Variable(50.0, name="v_1")
       v_2 = variables.Variable(-50.0, name="v_2")
       delta_1 = constant_op.constant(5.0, name="delta_1")
@@ -827,7 +663,8 @@ class SessionDebugGrpcGatingTest(test_util.TensorFlowTestCase):
             self._server_1.query_source_file_line(__file__, 1)
 
   def testTensorBoardDebuggerWrapperDisablingTracebackSourceSendingWorks(self):
-    with session.Session(config=no_rewrite_session_config()) as sess:
+    with session.Session(
+        config=session_debug_testlib.no_rewrite_session_config()) as sess:
       v_1 = variables.Variable(50.0, name="v_1")
       v_2 = variables.Variable(-50.0, name="v_2")
       delta_1 = constant_op.constant(5.0, name="delta_1")
diff --git a/tensorflow/python/eager/BUILD b/tensorflow/python/eager/BUILD
index ab81d40148476735492890f608315b19eaa0a33f..5bedf9c6fdfcf630a72ef8a34bca38417b738d4f 100644
--- a/tensorflow/python/eager/BUILD
+++ b/tensorflow/python/eager/BUILD
@@ -42,7 +42,6 @@ py_library(
         ":backprop",
         ":context",
         ":core",
-        ":custom_gradient",
         ":execute",
         ":function",
         ":graph_callable",
@@ -103,7 +102,6 @@ cuda_py_test(
     additional_deps = [
         ":backprop",
         ":context",
-        ":custom_gradient",
         ":test",
         "//tensorflow/python:embedding_ops",
         "//tensorflow/python:array_ops",
@@ -206,21 +204,6 @@ cc_library(
     ],
 )
 
-py_library(
-    name = "custom_gradient",
-    srcs = ["custom_gradient.py"],
-    srcs_version = "PY2AND3",
-    visibility = ["//tensorflow:internal"],
-    deps = [
-        ":context",
-        ":tape",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:resource_variable_ops",
-        "//tensorflow/python:util",
-    ],
-)
-
 py_library(
     name = "graph_only_ops",
     srcs = ["graph_only_ops.py"],
@@ -364,7 +347,6 @@ py_test(
     deps = [
         ":backprop",
         ":context",
-        ":custom_gradient",
         ":test",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:constant_op",
diff --git a/tensorflow/python/eager/backprop.py b/tensorflow/python/eager/backprop.py
index 14bcc60006228eeaabea241ee18d960174a9dbea..06e11f6ef9985e078b98c5f1e3d9d0edeef2df56 100644
--- a/tensorflow/python/eager/backprop.py
+++ b/tensorflow/python/eager/backprop.py
@@ -18,7 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import collections
 import functools
 import operator
 import threading
@@ -41,26 +40,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.util import nest
 from tensorflow.python.util import tf_inspect
-
-
-class _TensorCache(object):
-  """Simple cache which evicts items based on length in a FIFO manner."""
-
-  def __init__(self, max_items=256):
-    self._data = collections.OrderedDict()
-    self._max_items = max_items if max_items else 256
-
-  def put(self, key, value):
-    self._data[key] = value
-
-    if len(self._data) > self._max_items:
-      self._data.popitem(last=False)
-
-  def get(self, key):
-    return self._data.get(key, None)
-
-  def flush(self):
-    self._data = {}
+from tensorflow.python.util.tf_export import tf_export
 
 
 _op_attr_type_cache = {}
@@ -622,7 +602,7 @@ def _num_elements(grad):
   raise ValueError("`grad` not a Tensor or IndexedSlices.")
 
 
-_zeros_cache = _TensorCache()
+_zeros_cache = context._TensorCache()  # pylint: disable=protected-access
 
 
 def _fast_fill(value, shape, dtype):
@@ -658,64 +638,55 @@ _default_vspace = imperative_grad.VSpace(
     ones=_ones)
 
 
+@tf_export("GradientTape")
 class GradientTape(object):
-  """Records operations to use to compute gradients.
+  """Record operations for automatic differentiation.
 
-  Operations are recorded if:
-    - they happen in code marked by this context manager
-    - at least one of their inputs is being watched
+  Operations are recorded if they are executed within this context manager and
+  at least one of their inputs is being "watched".
 
-  Outputs of recorded operations are watched. Variables are automatically
-  watched and tensors can be manually watched by calling the watch method on the
-  context manager.
+  Variables (created by `tf.contrib.eager.Variable` or @{tf.get_variable})
+  are automatically watched. Tensors can be manually watched by invoking the
+  `watch`
+  method on this context manager.
 
-  Example usage:
+  For example, consider the function `y = x * x`. The gradient at `x = 3.0` can
+  be computed as:
 
   ```python
+  x = tf.constant(3.)
   with tfe.GradientTape() as g:
-    x = tf.constant(3.0)
     g.watch(x)
     y = x * x
-  grad = g.gradient(y, [x])[0]
-  assert grad.numpy() == 6.0
+  grad = g.gradient(y, [x])[0] # Will compute to 6.0
   ```
 
-  It is possible to use GradientTapes to compute higher-order derivatives as
-  follows:
+  GradientTapes can be nested to compute higher-order derivatives. For example,
 
   ```python
+  x = tf.constant(3.0)
   with tfe.GradientTape() as g:
-    x = tf.constant(3.0)
-    g.watch(x)
-    y = x * x
     with tfe.GradientTape() as gg:
-      gg.watch(y)
-      z = 2 * y
-    inner_grad = gg.gradient(z, [y])[0]
-    assert inner_grad.numpy() == 2
-    y = y + inner_grad
-  grad = g.gradient(y, [x])[0]
-  assert grad.numpy() == 6.0
+      gg.watch(x)
+      y = x * x
+    dy_dx = gg.gradient(y, [x])[0]     # Will compute to 6.0
+  d2y_dx2 = g.gradient(dy_dx, [x])[0]  # Will compute to 2.0
   ```
 
   By default, the resources held by a GradientTape are released as soon as
-  GradientTape.gradient() method is called. However, if one need to compute
-  multiple gradients over the same computation, she can create a persistent
-  GradientTape. Persistent tapes allow multiple calls to the gradient() method
-  and release resources when the tape object is destructed.
-
-  Example usage:
+  GradientTape.gradient() method is called. To compute multiple gradients over
+  the same computation, create a persistent gradient tape. This allows multiple
+  calls to the gradient() method as resources are released when the tape object
+  is garbage collected. For example:
 
   ```python
+  x = tf.constant(3.0)
   with tfe.GradientTape(persistent=True) as g:
-    x = tf.constant(3.0)
     g.watch(x)
     y = x * x
     z = y * y
-  dz_dx = g.gradient(z, [x])[0]
-  assert dz_dx.numpy() == 108.0   # 4*x^3 at x = 3
-  dy_dx = g.gradient(y, [x])[0]
-  assert dy_dx.numpy() == 6.0
+  dy_dx = g.gradient(z, [x])[0]  # 6.0
+  dz_dx = g.gradient(y, [x])[0]  # 108.0 (4*x^3 at x = 3)
   del g  # Drop the reference to the tape
   """
 
@@ -724,8 +695,8 @@ class GradientTape(object):
 
     Args:
       persistent: Boolean controlling whether a persistent gradient tape
-        is created. Must be True or False.
-
+        is created. False by default, which means at most one call can
+        be made to the gradient() method on this object.
     """
     self._tape = None
     self._persistent = persistent
@@ -741,7 +712,7 @@ class GradientTape(object):
     """Ensures that `tensor` is being traced by this tape.
 
     Args:
-      tensor: a Tensor or Variable a list of Tensors or Variables.
+      tensor: a Tensor or list of Tensors.
     """
     for t in nest.flatten(tensor):
       if isinstance(t, resource_variable_ops.ResourceVariable):
@@ -756,14 +727,14 @@ class GradientTape(object):
                        key=lambda v: v.handle._id))  # pylint: disable=protected-access
 
   def gradient(self, target, sources, output_gradients=None):
-    """Computes the gradient using information traced by the tape.
+    """Computes the gradient using operations recorded in context of this tape.
 
     Args:
-      target: the tensor to be differentiated.
-      sources: a list of Tensors or Variables, the target will be
-       differentiated with respect to the sources.
+      target: Tensor to be differentiated.
+      sources: a list of Tensors or Variables. `target` will be differentiated
+        against elements in `sources`.
       output_gradients: a list of gradients, one for each element of
-       target. Defaults to None.
+        target. Defaults to None.
 
     Returns:
       a list of Tensors (or IndexedSlices, or None), one for each element in
@@ -771,7 +742,7 @@ class GradientTape(object):
 
     Raises:
       RuntimeError: if called inside the context of the tape, or if called more
-       than once.
+       than once on a non-persistent tape.
     """
     if self._tape is None:
       raise RuntimeError("GradientTape.gradient can only be called once "
diff --git a/tensorflow/python/eager/backprop_test.py b/tensorflow/python/eager/backprop_test.py
index 734558dee2b35813810341480eb38a4bace2936b..bca2928708d15addeda16d7d2e107326443ec11e 100644
--- a/tensorflow/python/eager/backprop_test.py
+++ b/tensorflow/python/eager/backprop_test.py
@@ -23,7 +23,6 @@ import numpy as np
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
-from tensorflow.python.eager import custom_gradient
 from tensorflow.python.eager import tape
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
@@ -32,6 +31,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import custom_gradient
 from tensorflow.python.ops import embedding_ops
 from tensorflow.python.ops import gradients
 from tensorflow.python.ops import math_ops
@@ -115,6 +115,19 @@ class BackpropTest(test.TestCase):
     with self.assertRaises(RuntimeError):
       backprop.gradients_function(f)(constant_op.constant(1.0))
 
+  def testGradientsFunctionInCustomGradient(self):
+
+    @custom_gradient.custom_gradient
+    def f(x):
+      (y,) = backprop.gradients_function(lambda x: x * x)(x)
+
+      def grad(dy):
+        return [2 * dy]
+
+      return y, grad
+
+    self.assertAllEqual(f(1.0), 2.0)
+
   def testImplicitGradOverEmbeddingLookup(self):
     batch_size = 8
     embedding_size = 512
@@ -182,6 +195,19 @@ class BackpropTest(test.TestCase):
     g, = backprop.gradients_function(loss, [0])(logits, labels)
     self.assertAllEqual(g.numpy(), [[-0.5, 0.5]])
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testGradientWithinTapeBlock(self):
+    v1 = resource_variable_ops.ResourceVariable(1.)
+    self.evaluate(v1.initializer)
+    with backprop.GradientTape() as t:
+      loss = 2 * v1
+      with self.assertRaises(RuntimeError):
+        t.gradient(loss, [v1])
+    with backprop.GradientTape(persistent=True) as t:
+      loss = 2 * v1
+      grad = t.gradient(loss, [v1])
+    self.assertAllEqual(self.evaluate(grad[0]), 2.0)
+
   @test_util.assert_no_new_tensors
   def testSecondGrad(self):
 
@@ -343,6 +369,7 @@ class BackpropTest(test.TestCase):
     self.assertEqual(backprop.implicit_grad(f)()[0][0], None)
 
   @test_util.assert_no_new_tensors
+  @test_util.run_in_graph_and_eager_modes()
   def testGradientTape(self):
     with backprop.GradientTape() as g:
       x = constant_op.constant(3.0)
@@ -352,10 +379,10 @@ class BackpropTest(test.TestCase):
         gg.watch(y)
         z = 2 * y
       inner_grad = gg.gradient(z, [y])[0]
-      self.assertEqual(inner_grad.numpy(), 2.0)
+      self.assertEqual(self.evaluate(inner_grad), 2.0)
       y += inner_grad
     grad = g.gradient(y, [x])[0]
-    self.assertEqual(grad.numpy(), 6.0)
+    self.assertEqual(self.evaluate(grad), 6.0)
 
   @test_util.assert_no_new_tensors
   def testGradientTapeGradientCalledMultipleTimes(self):
@@ -370,6 +397,7 @@ class BackpropTest(test.TestCase):
       g.gradient(y, [x])
 
   @test_util.assert_no_new_tensors
+  @test_util.run_in_graph_and_eager_modes()
   def testPersistentTape(self):
     with backprop.GradientTape(persistent=True) as g:
       x = constant_op.constant(3.0)
@@ -377,12 +405,13 @@ class BackpropTest(test.TestCase):
       y = x * x
       z = y * y
     dz_dx = g.gradient(z, [x])[0]
-    self.assertEqual(dz_dx.numpy(), 4*3*3*3)
+    self.assertEqual(self.evaluate(dz_dx), 4 * 3 * 3 * 3)
     dy_dx = g.gradient(y, [x])[0]
-    self.assertEqual(dy_dx.numpy(), 2*3)
+    self.assertEqual(self.evaluate(dy_dx), 2 * 3)
     del g
 
   @test_util.assert_no_new_tensors
+  @test_util.run_in_graph_and_eager_modes()
   def testPersistentNestedTape(self):
     with backprop.GradientTape(persistent=True) as g:
       x = constant_op.constant(3.0)
@@ -393,22 +422,24 @@ class BackpropTest(test.TestCase):
         z = 2 * y
       for _ in range(2):
         inner_grad = gg.gradient(z, [y])[0]
-        self.assertEqual(inner_grad.numpy(), 2.0)
+        self.assertEqual(self.evaluate(inner_grad), 2.0)
       y += inner_grad
       del gg
     grad = g.gradient(y, [x])[0]
-    self.assertEqual(grad.numpy(), 6.0)
+    self.assertEqual(self.evaluate(grad), 6.0)
     grad = g.gradient(z, [x])[0]
-    self.assertEqual(grad.numpy(), 12.0)
+    self.assertEqual(self.evaluate(grad), 12.0)
     del g
 
   @test_util.assert_no_new_tensors
+  @test_util.run_in_graph_and_eager_modes()
   def testGradientTapeVariable(self):
     v = resource_variable_ops.ResourceVariable(1.0, name='v')
+    self.evaluate(v.initializer)
     with backprop.GradientTape() as g:
       y = v * v
     grad = g.gradient(y, [v])[0]
-    self.assertAllEqual(grad, 2.0)
+    self.assertAllEqual(self.evaluate(grad), 2.0)
 
   @test_util.assert_no_new_tensors
   def testEmptyParamsForValueAndGradFunction(self):
diff --git a/tensorflow/python/eager/benchmarks_test.py b/tensorflow/python/eager/benchmarks_test.py
index b56cbe80a7ab6b90d715187b0f0a44847038fc37..9ca5041c38ed07b39fd73b9f110ab06e8e903251 100644
--- a/tensorflow/python/eager/benchmarks_test.py
+++ b/tensorflow/python/eager/benchmarks_test.py
@@ -35,7 +35,6 @@ from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop  # pylint: disable=unused-import
 from tensorflow.python.eager import context
 from tensorflow.python.eager import core
-from tensorflow.python.eager import execute
 from tensorflow.python.eager import function
 from tensorflow.python.eager import test
 from tensorflow.python.framework import dtypes
@@ -56,11 +55,11 @@ def c_tfe_py_fastpath_execute(a,
                               transpose_b=False,
                               name=None):
   ctx = context.context()
-  assert not ctx.in_graph_mode(
+  assert ctx.executing_eagerly(
   ), "The prototype doesn't contain C code for graph construction"
   try:
     return pywrap_tensorflow.TFE_Py_FastPathExecute(
-        ctx._handle, ctx.device_name, "MatMul", execute.record_gradient, name,
+        ctx._handle, ctx.device_name, "MatMul", name,
         ctx._post_execution_callbacks, a, b, "transpose_a", transpose_a,
         "transpose_b", transpose_b)
   except core._NotOkStatusException as e:
@@ -83,16 +82,24 @@ class MicroBenchmarks(test.Benchmark):
     self._num_iters_2_by_2 = 30000
     self._num_iters_100_by_784 = 1000
 
-  def _run(self, func, num_iters):
+  def _run(self, func, num_iters, execution_mode=None):
     # call func to maybe warm up the GPU
-    func()
-    start = time.time()
-    for _ in xrange(num_iters):
+    ctx = context.context()
+    with ctx.execution_mode(execution_mode):
       func()
-    end = time.time()
-    mean_us = (end - start) * 1e6 / num_iters
-    self.report_benchmark(iters=num_iters, wall_time=mean_us,
-                          extras={"examples_per_sec": num_iters/(end-start)})
+      if execution_mode == context.ASYNC:
+        ctx.async_wait()
+      start = time.time()
+      for _ in xrange(num_iters):
+        func()
+      if execution_mode == context.ASYNC:
+        ctx.async_wait()
+      end = time.time()
+      mean_us = (end - start) * 1e6 / num_iters
+      self.report_benchmark(
+          iters=num_iters,
+          wall_time=mean_us,
+          extras={"examples_per_sec": num_iters / (end - start)})
 
   def benchmark_create_np_array(self):
     func = lambda: np.array([3.0])
@@ -237,13 +244,15 @@ class MicroBenchmarks(test.Benchmark):
     func = lambda: np.dot(a, b)
     self._run(func, num_iters)
 
-  def _benchmark_tf_matmul(self, m, transpose_b, num_iters):
+  def _benchmark_tf_matmul(self, m, transpose_b, num_iters,
+                           execution_mode=None):
     func = lambda: math_ops.matmul(m, m, transpose_b=transpose_b)
-    self._run(func, num_iters)
+    self._run(func, num_iters, execution_mode=execution_mode)
 
   def _benchmark_gen_math_ops_matmul(self, m, transpose_b, num_iters):
     def func():
-      gen_math_ops._mat_mul(m, m, transpose_b=transpose_b)
+      gen_math_ops.mat_mul(m, m, transpose_b=transpose_b)
+
     self._run(func, num_iters)
 
   def _benchmark_tfe_py_fastpath_execute_matmul(self, m, transpose_b,
@@ -267,14 +276,28 @@ class MicroBenchmarks(test.Benchmark):
 
     self._run(func, num_iters)
 
-  def _benchmark_defun_matmul(self, m, transpose_b, num_iters):
+  def _benchmark_defun_matmul(self,
+                              m,
+                              transpose_b,
+                              num_iters,
+                              execution_mode=None):
     f = function.defun(math_ops.matmul)
     func = lambda: f(m, m, transpose_b)
-    self._run(func, num_iters)
+    self._run(func, num_iters, execution_mode=execution_mode)
 
   def _benchmark_read_variable(self, m, num_iters):
     self._run(m.value, num_iters)
 
+  def _benchmark_matmul_read_variable(self, m, num_iters):
+    self._benchmark_gen_math_ops_matmul(
+        m, transpose_b=False, num_iters=num_iters)
+
+  def _benchmark_matmul_read_variable_with_tape(self, m, num_iters):
+    with backprop.GradientTape() as tape:
+      tape.watch(m)
+      self._benchmark_gen_math_ops_matmul(
+          m, transpose_b=False, num_iters=num_iters)
+
   def _benchmark_read_variable_with_tape(self, m, num_iters):
     with backprop.GradientTape() as tape:
       tape.watch(m)
@@ -291,6 +314,15 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_tf_matmul(
           m, transpose_b=False, num_iters=self._num_iters_2_by_2)
 
+  def benchmark_tf_matmul_2_by_2_CPU_async(self):
+    with context.device(CPU):
+      m = self._m_2_by_2.cpu()
+      self._benchmark_tf_matmul(
+          m,
+          transpose_b=False,
+          num_iters=self._num_iters_2_by_2,
+          execution_mode=context.ASYNC)
+
   def benchmark_gen_math_ops_matmul_2_by_2_CPU(self):
     with context.device(CPU):
       m = self._m_2_by_2.cpu()
@@ -315,6 +347,15 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_defun_matmul(
           m, transpose_b=False, num_iters=self._num_iters_2_by_2)
 
+  def benchmark_defun_matmul_2_by_2_CPU_async(self):
+    with context.device(CPU):
+      m = self._m_2_by_2.cpu()
+      self._benchmark_defun_matmul(
+          m,
+          transpose_b=False,
+          num_iters=self._num_iters_2_by_2,
+          execution_mode=context.ASYNC)
+
   def benchmark_tf_matmul_2_by_2_GPU(self):
     if not context.num_gpus():
       return
@@ -323,6 +364,17 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_tf_matmul(
           m, transpose_b=False, num_iters=self._num_iters_2_by_2)
 
+  def benchmark_tf_matmul_2_by_2_GPU_async(self):
+    if not context.num_gpus():
+      return
+    with context.device(GPU):
+      m = self._m_2_by_2.gpu()
+      self._benchmark_tf_matmul(
+          m,
+          transpose_b=False,
+          num_iters=self._num_iters_2_by_2,
+          execution_mode=context.ASYNC)
+
   def benchmark_gen_math_ops_matmul_2_by_2_GPU(self):
     if not context.num_gpus():
       return
@@ -347,6 +399,17 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_defun_matmul(
           m, transpose_b=False, num_iters=self._num_iters_2_by_2)
 
+  def benchmark_defun_matmul_2_by_2_GPU_async(self):
+    if not context.num_gpus():
+      return
+    with context.device(GPU):
+      m = self._m_2_by_2.gpu()
+      self._benchmark_defun_matmul(
+          m,
+          transpose_b=False,
+          num_iters=self._num_iters_2_by_2,
+          execution_mode=context.ASYNC)
+
   # Benchmarks for AA.T, A of dimension 100 by 784.
   def benchmark_np_matmul_100_by_784(self):
     self._benchmark_np_matmul(
@@ -360,6 +423,15 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_tf_matmul(
           m, transpose_b=True, num_iters=self._num_iters_100_by_784)
 
+  def benchmark_tf_matmul_100_by_784_CPU_async(self):
+    with context.device(CPU):
+      m = self._m_100_by_784.cpu()
+      self._benchmark_tf_matmul(
+          m,
+          transpose_b=True,
+          num_iters=self._num_iters_100_by_784,
+          execution_mode=context.ASYNC)
+
   def benchmark_gen_math_ops_matmul_100_by_784_CPU(self):
     with context.device(CPU):
       m = self._m_100_by_784.cpu()
@@ -392,6 +464,17 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_tf_matmul(
           m, transpose_b=True, num_iters=self._num_iters_100_by_784)
 
+  def benchmark_tf_matmul_100_by_784_GPU_async(self):
+    if not context.num_gpus():
+      return
+    with context.device(GPU):
+      m = self._m_100_by_784.gpu()
+      self._benchmark_tf_matmul(
+          m,
+          transpose_b=True,
+          num_iters=self._num_iters_100_by_784,
+          execution_mode=context.ASYNC)
+
   def benchmark_gen_math_ops_matmul_100_by_784_GPU(self):
     if not context.num_gpus():
       return
@@ -416,6 +499,17 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_defun_matmul(
           m, transpose_b=True, num_iters=self._num_iters_100_by_784)
 
+  def benchmark_matmul_read_variable_op_2_by_2_CPU(self):
+    with context.device(CPU):
+      m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
+      self._benchmark_matmul_read_variable(m, num_iters=self._num_iters_2_by_2)
+
+  def benchmark_matmul_read_variable_op_with_tape_2_by_2_CPU(self):
+    with context.device(CPU):
+      m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
+      self._benchmark_matmul_read_variable_with_tape(
+          m, num_iters=self._num_iters_2_by_2)
+
   def benchmark_read_variable_op_2_by_2_CPU(self):
     with context.device(CPU):
       m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
diff --git a/tensorflow/python/eager/context.py b/tensorflow/python/eager/context.py
index 0e9c21b221c64aaa445fde59514c7e50f8d8b773..6c9a14730c0db4bdf23fc10b23d63b758349bdc1 100644
--- a/tensorflow/python/eager/context.py
+++ b/tensorflow/python/eager/context.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import errors
 from tensorflow.python.util import compat
 from tensorflow.python.util import is_in_graph_mode
 from tensorflow.python.util import tf_contextlib
+from tensorflow.python.util.tf_export import tf_export
 
 GRAPH_MODE = 0
 EAGER_MODE = 1
@@ -52,6 +53,28 @@ DEVICE_PLACEMENT_WARN = pywrap_tensorflow.TFE_DEVICE_PLACEMENT_WARN
 DEVICE_PLACEMENT_SILENT = pywrap_tensorflow.TFE_DEVICE_PLACEMENT_SILENT
 DEVICE_PLACEMENT_SILENT_FOR_INT32 = (
     pywrap_tensorflow.TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32)
+SYNC = 0
+ASYNC = 1
+
+
+class _TensorCache(object):
+  """Simple cache which evicts items based on length in a FIFO manner."""
+
+  def __init__(self, max_items=256):
+    self._data = collections.OrderedDict()
+    self._max_items = max_items if max_items else 256
+
+  def put(self, key, value):
+    self._data[key] = value
+
+    if len(self._data) > self._max_items:
+      self._data.popitem(last=False)
+
+  def get(self, key):
+    return self._data.get(key, None)
+
+  def flush(self):
+    self._data = {}
 
 
 # TODO(agarwal): better name ?
@@ -67,24 +90,36 @@ class _EagerContext(threading.local):
     self.recording_summaries = False
     self.summary_writer_resource = None
     self.scalar_cache = {}
+    self.ones_rank_cache = _TensorCache()
+    self.execution_mode = None
 
 
-ContextStackEntry = collections.namedtuple(
-    "ContextStackEntry", ["is_building_function", "enter_context_fn"])
+ContextSwitch = collections.namedtuple(
+    "ContextSwitch", ["is_building_function", "enter_context_fn"])
 
 
-class ContextStack(threading.local):
+# `_ContextSwitchStack` is a `threading.local` to match the semantics of
+# ``DefaultGraphStack`, which is also a `threading.local`.
+class _ContextSwitchStack(threading.local):
   """A thread-local stack of context switches."""
 
-  def __init__(self):
-    super(ContextStack, self).__init__()
+  def __init__(self, eager):
+    super(_ContextSwitchStack, self).__init__()
     self.stack = []
+    if eager:
+      # Initialize the stack with a pointer to enter the eager context; this
+      # ensures that the fact that eager execution was enabled is propagated
+      # across threads, since (1) `enable_eager_execution` modifies a
+      # process-level flag (`_default_mode`) and (2) `__init__` is called each
+      # time a threading.local object is used in a separate thread.
+      self.push(is_building_function=False, enter_context_fn=eager_mode)
 
   def push(self, is_building_function, enter_context_fn):
     """Push metadata about a context switch onto the stack.
 
     A context switch can take one of two forms: installing a graph as the
-    default graph, or entering the eager context.
+    default graph, or entering the eager context. For each context switch,
+    we record whether or not the entered context is building a function.
 
     Args:
       is_building_function: (bool.) Whether the context is building a function.
@@ -93,7 +128,7 @@ class ContextStack(threading.local):
     """
 
     self.stack.append(
-        ContextStackEntry(is_building_function, enter_context_fn))
+        ContextSwitch(is_building_function, enter_context_fn))
 
   def pop(self):
     """Pop the stack."""
@@ -101,34 +136,49 @@ class ContextStack(threading.local):
     self.stack.pop()
 
 
-context_stack = ContextStack()
-
-
 # TODO(agarwal): rename to EagerContext / EagerRuntime ?
 # TODO(agarwal): consider keeping the corresponding Graph here.
 class Context(object):
   """Environment in which eager operations execute."""
 
-  def __init__(self, config=None, device_policy=None):
+  # TODO(agarwal): create and link in some documentation for `execution_mode`.
+  # pylint: disable=redefined-outer-name
+  def __init__(self, config=None, device_policy=None, execution_mode=None):
     """Creates a new Context.
 
     Args:
       config: (Optional.) A `ConfigProto` protocol buffer with configuration
-       options for the Context. Note that a lot of these options may be
-       currently unimplemented or irrelevant when eager execution is enabled.
+        options for the Context. Note that a lot of these options may be
+        currently unimplemented or irrelevant when eager execution is enabled.
       device_policy: (Optional.) What policy to use when trying to run an
-       operation on a device with inputs which are not on that device.
-       Valid values:
-         tfe.DEVICE_PLACEMENT_EXPLICIT: raises an error if the placement is not
-           correct.
-         tfe.DEVICE_PLACEMENT_WARN: copies the tensors which are not on the
+         operation on a device with inputs which are not on that device.
+         When set to None, an appropriate value will be picked automatically.
+         The value picked may change between TensorFlow releases.
+
+         Defaults to tf.contrib.eager.DEVICE_PLACEMENT_SILENT_FOR_INT32.
+         Valid values:
+         - tfe.DEVICE_PLACEMENT_EXPLICIT: raises an error if the placement is
+           not correct.
+         - tfe.DEVICE_PLACEMENT_WARN: copies the tensors which are not on the
            right device but raises a warning.
-         tfe.DEVICE_PLACEMENT_SILENT: silently copies the tensors. This might
+         - tfe.DEVICE_PLACEMENT_SILENT: silently copies the tensors. This might
            hide performance problems.
-         tfe.DEVICE_PLACEMENT_SILENT_FOR_INT32: silently copies int32 tensors,
+         - tfe.DEVICE_PLACEMENT_SILENT_FOR_INT32: silently copies int32 tensors,
            raising errors on the other ones.
+      execution_mode: (Optional.) Policy controlling how operations dispatched
+        are actually executed. When set to None, an appropriate value will be
+        picked automatically. The value picked may change between TensorFlow
+        releases.
+        Valid values:
+        - tf.contrib.eager.SYNC: executes each operation synchronously.
+        - tf.contrib.eager.ASYNC: executes each operation asynchronously. These
+          operations may return "non-ready" handles.
+
+    Raises:
+     ValueError: If execution_mode is not valid.
     """
     self._eager_context = _EagerContext()
+    self._context_switches = _ContextSwitchStack(self.executing_eagerly())
     self._context_handle = None
     self._context_devices = None
     self._post_execution_callbacks = []
@@ -136,6 +186,14 @@ class Context(object):
     self._seed = None
     self._initialize_lock = threading.Lock()
     self._device_policy = device_policy
+    if execution_mode not in (None, SYNC, ASYNC):
+      raise ValueError(
+          "execution_mode should be None/SYNC/ASYNC. Got %s" % execution_mode)
+    if execution_mode is None:
+      execution_mode = SYNC
+    self._execution_mode = execution_mode
+
+  # pylint: enable=redefined-outer-name
 
   def _set_global_seed(self, seed):
     """Set a global eager mode seed for random ops."""
@@ -173,6 +231,8 @@ class Context(object):
           if self._device_policy is not None:
             pywrap_tensorflow.TFE_ContextOptionsSetDevicePlacementPolicy(
                 opts, self._device_policy)
+          if self._execution_mode == ASYNC:
+            pywrap_tensorflow.TFE_ContextOptionsSetAsync(True)
           self._context_handle = pywrap_tensorflow.TFE_NewContext(opts, status)
       finally:
         pywrap_tensorflow.TFE_DeleteContextOptions(opts)
@@ -231,26 +291,29 @@ class Context(object):
     old_mode = ctx.mode
     ctx.mode = mode
     if mode == EAGER_MODE:
-      context_stack.push(False, eager_mode)
+      # Entering graph mode does not provide us with sufficient information to
+      # record a context switch; graph-based context switches are only logged
+      # when a graph is registered as the default graph.
+      self.context_switches.push(False, eager_mode)
     try:
       yield
     finally:
       ctx.mode = old_mode
       if mode == EAGER_MODE:
-        context_stack.pop()
+        self.context_switches.pop()
 
-  def in_graph_mode(self):
-    """Returns True if current thread is in GRAPH mode."""
-    return self._eager_context.mode == GRAPH_MODE
-
-  def in_eager_mode(self):
-    """Returns True if current thread is in EAGER mode."""
+  def executing_eagerly(self):
+    """Returns True if current thread has eager executing enabled."""
     return self._eager_context.mode == EAGER_MODE
 
   def scalar_cache(self):
     """Per-device cache for scalars."""
     return self._eager_context.scalar_cache
 
+  def ones_rank_cache(self):
+    """Per-device cache for scalars."""
+    return self._eager_context.ones_rank_cache
+
   @property
   def scope_name(self):
     """Returns scope name for the current thread."""
@@ -334,6 +397,43 @@ class Context(object):
     """List of the names of devices available to execute operations."""
     return self._devices
 
+  def get_execution_mode(self):
+    mode = self._eager_context.execution_mode
+    if mode is None:
+      mode = self._execution_mode
+    return mode
+
+  def set_execution_mode(self, mode):
+    """Sets execution mode for current thread."""
+    if mode not in (None, SYNC, ASYNC):
+      raise ValueError(
+          "Execution mode should be None/SYNC/ASYNC. Got %s" % mode)
+    if mode is None:
+      mode = SYNC
+    self._eager_context.execution_mode = mode
+    with errors.raise_exception_on_not_ok_status() as status:
+      pywrap_tensorflow.TFE_ContextSetAsyncForThread(self._handle,
+                                                     mode == ASYNC, status)
+
+  @tf_contextlib.contextmanager
+  def execution_mode(self, mode):
+    """Context manager for setting execution mode for current thread."""
+    old_mode = self.get_execution_mode()
+    try:
+      self.set_execution_mode(mode)
+      yield
+    finally:
+      self.set_execution_mode(old_mode)
+
+  def async_wait(self):
+    """Waits for ops dispatched in ASYNC mode to finish."""
+    with errors.raise_exception_on_not_ok_status() as status:
+      pywrap_tensorflow.TFE_ContextAsyncWait(self._handle, status)
+
+  def async_clear_error(self):
+    """Clears errors raised during ASYNC execution."""
+    pywrap_tensorflow.TFE_ContextAsyncClearError(self._handle)
+
   def num_gpus(self):
     """The number of GPUs available to execute operations."""
     self._initialize_handle_and_devices()
@@ -456,6 +556,11 @@ class Context(object):
     run_metadata.ParseFromString(compat.as_bytes(proto_data))
     return run_metadata
 
+  @property
+  def context_switches(self):
+    """Returns a stack of context switches."""
+    return self._context_switches
+
 _context = None
 _context_lock = threading.Lock()
 
@@ -497,23 +602,29 @@ def internal_operation_seed():
   return context()._internal_operation_seed()  # pylint: disable=protected-access
 
 
-def in_graph_mode():
-  """Returns True if current thread is in GRAPH mode for default context."""
-  return context().in_graph_mode()
+@tf_export("executing_eagerly")
+def executing_eagerly():
+  """Returns True if the current thread has eager execution enabled.
+
+  Eager execution is typically enabled via @{tf.enable_eager_execution},
+  but may also be enabled within the context of a Python function via
+  tf.contrib.eager.py_func.
+  """
+  return context().executing_eagerly()
 
 
 def in_eager_mode():
-  """Returns True if current thread is in EAGER mode for default context."""
-  return context().in_eager_mode()
+  """Use executing_eagerly() instead. This function will be removed."""
+  return executing_eagerly()
 
 
 def graph_mode():
-  """Context-manager to enable GRAPH mode for current thread."""
+  """Context-manager to disable eager execution for the current thread."""
   return context()._mode(GRAPH_MODE)  # pylint: disable=protected-access
 
 
 def eager_mode():
-  """Context-manager to enable EAGER mode for current thread."""
+  """Context-manager to enable eager execution for the current thread."""
   return context()._mode(EAGER_MODE)  # pylint: disable=protected-access
 
 
@@ -567,6 +678,26 @@ def list_devices():
   return context().devices()
 
 
+def set_execution_mode(mode):
+  """Sets execution mode for the current thread."""
+  context().set_execution_mode(mode)
+
+
+def execution_mode(mode):
+  """Context manager for setting execution mode for current thread."""
+  return context().execution_mode(mode)
+
+
+def async_wait():
+  """Waits for ops dispatched in ASYNC mode to finish."""
+  return context().async_wait()
+
+
+def async_clear_error():
+  """Clears errors raised during ASYNC execution mode."""
+  return context().async_clear_error()
+
+
 def num_gpus():
   """Get the number of available GPU devices.
 
@@ -606,4 +737,8 @@ def export_run_metadata():
 # (for example, enable_eager_execution in python/framework/ops.py),
 # but they do all import this file.  Note that IS_IN_GRAPH_MODE and
 # in_graph_mode are both parameterless functions.
-is_in_graph_mode.IS_IN_GRAPH_MODE = in_graph_mode
+def _tmp_in_graph_mode():
+  return not executing_eagerly()
+
+
+is_in_graph_mode.IS_IN_GRAPH_MODE = _tmp_in_graph_mode
diff --git a/tensorflow/python/eager/core_test.py b/tensorflow/python/eager/core_test.py
index 0e40d8a5c0a582ab27d95735dd917e2a5daabe09..6ebf5b24819d48ba4a17d6059510eee7affe40ea 100644
--- a/tensorflow/python/eager/core_test.py
+++ b/tensorflow/python/eager/core_test.py
@@ -34,7 +34,9 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_resource_variable_ops
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import resource_variable_ops
 
 
 def execute(op_name, num_outputs, inputs, attrs=None):
@@ -55,13 +57,22 @@ class TFETest(test_util.TensorFlowTestCase):
 
   def testContext(self):
     ctx = context.Context()
-    self.assertFalse(ctx.in_graph_mode())
-    self.assertTrue(ctx.in_eager_mode())
+    self.assertTrue(ctx.executing_eagerly())
 
     self.assertEqual('', ctx.scope_name)
     ctx.scope_name = 'foo'
     self.assertEqual('foo', ctx.scope_name)
 
+    self.assertEqual(context.SYNC, ctx.get_execution_mode())
+    ctx.set_execution_mode(context.ASYNC)
+    self.assertEqual(context.ASYNC, ctx.get_execution_mode())
+    ctx.set_execution_mode(context.SYNC)
+    self.assertEqual(context.SYNC, ctx.get_execution_mode())
+    with ctx.execution_mode(context.ASYNC):
+      self.assertEqual(context.ASYNC, ctx.get_execution_mode())
+    ctx.set_execution_mode(context.SYNC)
+    self.assertEqual(context.SYNC, ctx.get_execution_mode())
+
     self.assertIsNone(ctx.summary_writer_resource)
     ctx.summary_writer_resource = 'mock'
     self.assertEqual('mock', ctx.summary_writer_resource)
@@ -112,19 +123,18 @@ class TFETest(test_util.TensorFlowTestCase):
     # available, when no device is explicitly provided)
     self.assertEqual(y.device, '/job:localhost/replica:0/task:0/device:CPU:0')
 
-  def testContextStackContainsEagerMode(self):
-    # Eager execution has been enabled, and no other context
-    # switch has occurred, so `context_stack` should contain
-    # exactly one entry.
-    self.assertEqual(len(context.context_stack.stack), 1)
-    stack_entry = context.context_stack.stack[0]
+  def testContextSwitchStackContainsEagerMode(self):
+    # Eager execution has been enabled, and no other context switch has
+    # occurred, so `context_switches` should contain exactly one entry.
+    self.assertEqual(len(context.context().context_switches.stack), 1)
+    switch = context.context().context_switches.stack[0]
 
     # The entry should log that eager mode was entered.
-    self.assertIs(stack_entry.enter_context_fn, context.eager_mode)
+    self.assertIs(switch.enter_context_fn, context.eager_mode)
 
     # It is not possible to build a graph function when eager execution
     # is enabled; the stack entry should reflect this fact.
-    self.assertFalse(stack_entry.is_building_function)
+    self.assertFalse(switch.is_building_function)
 
   def testInt32GPU(self):
     if not context.context().num_gpus():
@@ -148,9 +158,9 @@ class TFETest(test_util.TensorFlowTestCase):
 
     def get_context_values(ctx):
       return [
-          ctx.in_graph_mode(),
-          ctx.in_eager_mode(), ctx.scope_name, ctx.summary_writer_resource,
-          ctx.device_name, ctx.num_gpus()
+          ctx.executing_eagerly(), ctx.scope_name, ctx.summary_writer_resource,
+          ctx.device_name,
+          ctx.num_gpus()
       ]
 
     def get_values(ctx, values):
@@ -181,6 +191,18 @@ class TFETest(test_util.TensorFlowTestCase):
         attrs=('T', x.dtype.as_datatype_enum))[0].cpu().numpy()
     self.assertEqual(3, result)
 
+  def testResourceTensorPlacement(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+
+    with context.device('gpu:0'):
+      v = resource_variable_ops.ResourceVariable(1.0)
+    with context.device('cpu:0'):
+      # Check that even though we specified the cpu device we'll run the read op
+      # in the device where the handle is.
+      self.assertAllEqual(
+          gen_resource_variable_ops.read_variable_op(v.handle, v.dtype), 1.0)
+
   def testCopyBetweenDevices(self):
     if not context.context().num_gpus():
       self.skipTest('No GPUs found')
@@ -195,6 +217,23 @@ class TFETest(test_util.TensorFlowTestCase):
     with self.assertRaises(RuntimeError):
       x.gpu(context.context().num_gpus() + 1)
 
+  def testCopyBetweenDevicesAsync(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+    with context.execution_mode(context.ASYNC):
+      x = constant_op.constant([[1., 2.], [3., 4.]])
+      x = x.cpu()
+      x = x.gpu()
+      x = x.gpu()
+      x = x.cpu()
+      context.async_wait()
+
+    # Invalid device
+    with self.assertRaises(RuntimeError):
+      x.gpu(context.context().num_gpus() + 1)
+      context.async_wait()
+    context.async_clear_error()
+
   def testCopyScope(self):
     if not context.context().num_gpus():
       self.skipTest('No GPUs found')
@@ -235,16 +274,49 @@ class TFETest(test_util.TensorFlowTestCase):
         attrs=('T', three.dtype.as_datatype_enum))[0]
     self.assertAllEqual(15, product)
 
+  def testExecuteBasicAsync(self):
+    with context.execution_mode(context.ASYNC):
+      three = constant_op.constant(3)
+      five = constant_op.constant(5)
+      product = execute(
+          b'Mul',
+          num_outputs=1,
+          inputs=[three, five],
+          attrs=('T', three.dtype.as_datatype_enum))[0]
+      self.assertAllEqual(15, product)
+    # Error: Invalid arguments
+    context.set_execution_mode(context.ASYNC)
+    with self.assertRaises(errors.InvalidArgumentError):
+      execute(
+          b'MatMul',
+          num_outputs=1,
+          inputs=[three, five],
+          attrs=('transpose_a', False, 'transpose_b', False, 'T',
+                 three.dtype.as_datatype_enum))
+      context.async_wait()
+    context.async_clear_error()
+    context.set_execution_mode(context.SYNC)
+
   def testExecuteTooManyNumOutputs(self):
     # num_outputs provided is 50, but only one output is produced.
-    # That should be okay.
     product = execute(
         b'Mul',
         num_outputs=50,
-        inputs=[constant_op.constant(3), constant_op.constant(5)],
+        inputs=[constant_op.constant(3),
+                constant_op.constant(5)],
         attrs=('T', dtypes.int32.as_datatype_enum))[0]
     self.assertAllEqual(15, product)
 
+  def testExecuteTooFewNumOutputs(self):
+    # num_outputs provided is 0, but one output is produced.
+    with self.assertRaises(errors.InvalidArgumentError):
+      _ = execute(
+          b'Mul',
+          num_outputs=0,
+          inputs=[constant_op.constant(3),
+                  constant_op.constant(5)],
+          attrs=('T', dtypes.int32.as_datatype_enum))[0]
+
   def testMatMulGPU(self):
     if not context.context().num_gpus():
       self.skipTest('No GPUs found')
@@ -532,5 +604,61 @@ class TFETest(test_util.TensorFlowTestCase):
       self.assertIsInstance(t, ops.EagerTensor)
 
 
+class SendRecvTest(test_util.TensorFlowTestCase):
+
+  cpu_device = '/job:localhost/replica:0/task:0/device:CPU:0'
+
+  def _send(self, tensor, tensor_name, to_device):
+    return execute(
+        b'_Send', num_outputs=0, inputs=[tensor],
+        attrs=('T', tensor.dtype.as_datatype_enum,
+               'tensor_name', tensor_name,
+               'send_device', tensor.device,
+               'send_device_incarnation', 0,
+               'recv_device', to_device,
+               'client_terminated', True))
+
+  def _recv(self, dtype, tensor_name, from_device):
+    device_name = context.context().device_name
+    if not device_name:
+      device_name = self.cpu_device
+    return execute(
+        b'_Recv', num_outputs=1, inputs=[],
+        attrs=('tensor_type', dtype.as_datatype_enum,
+               'tensor_name', tensor_name,
+               'send_device', from_device,
+               'send_device_incarnation', 0,
+               'recv_device', device_name,
+               'client_terminated', False))[0]
+
+  def testBasic(self):
+    t0 = constant_op.constant(1.0)
+    t1 = constant_op.constant(2.0)
+    self._send(t0, 't0', self.cpu_device)
+    self._send(t1, 't1', self.cpu_device)
+    self.assertAllEqual(
+        self._recv(dtypes.float32, 't0', self.cpu_device),
+        1.0)
+    self.assertAllEqual(
+        self._recv(dtypes.float32, 't1', self.cpu_device),
+        2.0)
+
+  def testLocalCrossDevice(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+    gpu_device_name = '/job:localhost/replica:0/task:0/device:GPU:0'
+    with ops.device('GPU:0'):
+      t0 = constant_op.constant(1.0)
+      self._send(t0, 't0', self.cpu_device)
+    self.assertAllEqual(
+        self._recv(dtypes.float32, 't0', gpu_device_name),
+        1.0)
+    self._send(constant_op.constant(2.0), 't1', gpu_device_name)
+    with ops.device('GPU:0'):
+      self.assertAllEqual(
+          self._recv(dtypes.float32, 't1', self.cpu_device),
+          2.0)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/eager/custom_gradient.py b/tensorflow/python/eager/custom_gradient.py
deleted file mode 100644
index 05460ff9968312528d87f5fc2ad0495b4da2ad1a..0000000000000000000000000000000000000000
--- a/tensorflow/python/eager/custom_gradient.py
+++ /dev/null
@@ -1,91 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Decorator to overrides the gradient for a function."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.python.eager import context
-from tensorflow.python.eager import tape
-from tensorflow.python.framework import ops as tf_ops
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_array_ops
-from tensorflow.python.util import nest
-from tensorflow.python.util import tf_decorator
-
-
-def custom_gradient(f):
-  """Decorator to define a function with a custom gradient.
-
-  The input function is expected to return the tuple
-    (results, gradient_function).
-
-  The output function will return results while possibly recording the
-  gradient_function and inputs in the tape.
-
-  Args:
-    f: function to be decorated.
-
-  Returns:
-    decorated function.
-  """
-
-  def decorated(*args, **kwargs):
-    """Decorated function with custom gradient."""
-    if context.in_graph_mode():
-      if kwargs:
-        raise ValueError(
-            "custom_gradient in graph mode doesn't support keyword arguments.")
-      name = "CustomGradient-%s" % tf_ops.uid()
-      args = [tf_ops.convert_to_tensor(x) for x in args]
-      result, grad_fn = f(*args)
-      flat_result = nest.flatten(result)
-      all_tensors = flat_result + args
-
-      @tf_ops.RegisterGradient(name)
-      def internal_grad_fn(unused_op, *result_grads):  # pylint: disable=unused-variable
-        gradients = nest.flatten(grad_fn(*result_grads[:len(flat_result)]))
-        # Need to return one value per input to the IdentityN, so pad the
-        # gradients of the inputs of the custom_gradient function with the
-        # gradients of the outputs as well.
-        return ([None] * len(flat_result)) + gradients
-
-      with tf_ops.get_default_graph().gradient_override_map(
-          {"IdentityN": name}):
-        all_tensors = array_ops.identity_n(all_tensors)
-      return nest.pack_sequence_as(
-          structure=result, flat_sequence=all_tensors[:len(flat_result)])
-
-    input_tensors = [tf_ops.convert_to_tensor(x) for x in args]
-
-    with tape.stop_recording():
-      result, grad_fn = f(*args, **kwargs)
-      flat_result = nest.flatten(result)
-      # TODO(apassos) consider removing the identity below.
-      flat_result = [gen_array_ops.identity(x) for x in flat_result]
-
-    def actual_grad_fn(*outputs):
-      return nest.flatten(grad_fn(*outputs))
-
-    tape.record_operation(
-        f.__name__,
-        flat_result,
-        input_tensors,
-        actual_grad_fn)
-    flat_result = list(flat_result)
-    return nest.pack_sequence_as(result, flat_result)
-
-  return tf_decorator.make_decorator(f, decorated)
diff --git a/tensorflow/python/eager/function.py b/tensorflow/python/eager/function.py
index b3317bd3235f432220d9d5d135f1af18a6f43310..343012e552592a6f8bb1255118add3e938aa443c 100644
--- a/tensorflow/python/eager/function.py
+++ b/tensorflow/python/eager/function.py
@@ -36,6 +36,7 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes as dtypes_module
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.util import compat
@@ -111,7 +112,7 @@ def _convert_to_graph_tensor(value, dtype=None, name=None, as_ref=False):
   """
   del as_ref  # Unused.
 
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return value
 
   default_graph = ops.get_default_graph()
@@ -162,31 +163,15 @@ class CapturingGraph(ops.Graph):
       op_def=None,
       compute_shapes=True,
       compute_device=True):
-    # TODO(apassos) probably control flow has to be handled delicately here as
-    # in if a resource is accessed inside a control flow context we need the
-    # control dependency to point to something outside the context which is
-    # guaranteed to happen after the access.
-    #
     # TODO(apassos) this should do some form of alias analysis as ops which
     # forward the resources such as Identity and Switch can cause serialization
     # to fail.
-    resource_inputs = set()
-    control_inputs = set()
     for i, inp in enumerate(inputs):
       if inp.graph is not self:
         inputs[i] = capture_value(self.captures, inp, inp.dtype, inp.op.name)
-      inp = inputs[i]
-      if inp.dtype == dtypes_module.resource:
-        if inp.name in self._last_op_using_resource_tensor:
-          control_inputs.add(self._last_op_using_resource_tensor[inp.name])
-        resource_inputs.add(inp.name)
-    with self.control_dependencies(list(control_inputs)):
-      op = super(CapturingGraph, self).create_op(
-          op_type, inputs, dtypes, input_types, name, attrs, op_def,
-          compute_shapes, compute_device)
-    for name in resource_inputs:
-      self._last_op_using_resource_tensor[name] = op
-    return op
+    return super(CapturingGraph, self).create_op(
+        op_type, inputs, dtypes, input_types, name, attrs, op_def,
+        compute_shapes, compute_device)
 
 
 # TODO(apassos): it'd be really nice if we could scope this registration.
@@ -310,7 +295,7 @@ class _EagerDefinedFunction(object):
       proto_data = pywrap_tensorflow.TF_GetBuffer(buffer_)
     function_def = function_pb2.FunctionDef()
     function_def.ParseFromString(compat.as_bytes(proto_data))
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       _register(fn)
     self.definition = function_def
     self.name = function_def.signature.name
@@ -453,7 +438,14 @@ class GraphModeFunction(object):
     all_args = args + self._extra_inputs
     signature = self._forward_fdef.signature
     ctx = context.context()
-    if ctx.in_graph_mode():
+    if ctx.executing_eagerly():
+      outputs = execute.execute(
+          str(signature.name),
+          num_outputs=len(signature.output_arg),
+          inputs=all_args,
+          attrs=None,
+          ctx=ctx)
+    else:
       g = ops.get_default_graph()
       g._add_function(self._forward_fdef)  # pylint: disable=protected-access
       op = g.create_op(
@@ -468,13 +460,6 @@ class GraphModeFunction(object):
           outputs, (ops.Tensor, type(None))) else list(outputs)
       for i, s in enumerate(self._output_shapes):
         outputs[i].set_shape(s)
-    else:
-      outputs = execute.execute(
-          str(signature.name),
-          num_outputs=len(signature.output_arg),
-          inputs=all_args,
-          attrs=None,
-          ctx=ctx)
     real_outputs = outputs[:len(self._returns)]
     side_outputs = outputs[len(self._returns):]
 
@@ -545,7 +530,14 @@ class GraphModeFunction(object):
       return self._backprop_call(tensor_inputs)
 
     ctx = context.context()
-    if ctx.in_graph_mode():
+    if ctx.executing_eagerly():
+      result = execute.execute(
+          str(self._func_name),
+          num_outputs=self._num_outputs,
+          inputs=tensor_inputs + self._extra_inputs,
+          attrs=None,
+          ctx=ctx)
+    else:
       g = ops.get_default_graph()
       self.add_to_graph(g)
       signature = self._function_def.definition.signature
@@ -562,13 +554,6 @@ class GraphModeFunction(object):
         return op
       for i, s in enumerate(self._output_shapes):
         result[i].set_shape(s)
-    else:
-      result = execute.execute(
-          str(self._func_name),
-          num_outputs=self._num_outputs,
-          inputs=tensor_inputs + self._extra_inputs,
-          attrs=None,
-          ctx=ctx)
 
     return self._build_call_outputs(result)
 
@@ -636,13 +621,15 @@ def _defun_internal(name, func, args, kwds):
     for collection in curr_graph.collections:
       tmp_graph.get_collection_ref(collection)[:] = curr_graph.get_collection(
           collection)
-    with tmp_graph.as_default():
+    with tmp_graph.as_default(), AutomaticControlDependencies() as a:
       func_inputs = _get_defun_inputs(args)
 
       def convert(x):
         if x is None:
           return None
-        return ops.convert_to_tensor_or_indexed_slices(x)
+        x = ops.convert_to_tensor_or_indexed_slices(x)
+        x = a.mark_as_return(x)
+        return x
 
       with capture_tensors(captures):
         this_tape = tape.push_new_tape()
@@ -679,7 +666,7 @@ def _defun_internal(name, func, args, kwds):
                      if x not in all_ignored_ops)
   # Register any other functions defined in the graph
   # TODO(ashankar): Oh lord, forgive me for this lint travesty.
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     for f in tmp_graph._functions.values():  # pylint: disable=protected-access
       # TODO(ashankar): What about the gradient registry?
       _register(f._c_func)  # pylint: disable=protected-access
@@ -887,10 +874,39 @@ class AutomaticControlDependencies(object):
     self._returned_tensors = set()
 
   def mark_as_return(self, tensor):
+    """Acts like identity but marks the `Tensor` as a return value.
+
+    This will possibly return a copy of the `Tensor`. Usage:
+
+    ```
+      with AutomaticControlDependencies() as a:
+       ...
+       t = a.mark_as_return(t)
+      _ = ...(t...)  # i.e. it's safe to use t here
+    ```
+
+    Args:
+      tensor: the `Tensor` to be marked
+
+    Returns:
+      a copy of the `Tensor`.
+    """
+    if isinstance(tensor, ops.IndexedSlices):
+      values = array_ops.identity(tensor.values)
+      indices = array_ops.identity(tensor.indices)
+      self._returned_tensors.add(indices)
+      self._returned_tensors.add(values)
+      return ops.IndexedSlices(values, indices, dense_shape=tensor.dense_shape)
+    # We want to make the return values depend on the stateful operations, but
+    # we don't want to introduce a cycle, so we make the return value the result
+    # of a new identity operation that the stateful operations definitely don't
+    # depend on.
+    tensor = array_ops.identity(tensor)
     self._returned_tensors.add(tensor)
+    return tensor
 
   def __enter__(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return self
     # This code assumes no other thread is adding ops to the graph while
     # we're adding ops to the graph.
@@ -961,7 +977,7 @@ class AutomaticControlDependencies(object):
       merge_for_resource[o] = new_merge[0].op
 
   def __exit__(self, unused_type, unused_value, unused_traceback):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return
 
     if self._graph is not ops.get_default_graph():
@@ -1008,7 +1024,8 @@ class AutomaticControlDependencies(object):
     for op in new_operations:
       control_inputs = set()
       # Ensure stateful ops run
-      if self._graph._registered_ops[op.type].is_stateful:  # pylint: disable=protected-access
+      if (op.type not in self._graph._registered_ops  # pylint: disable=protected-access
+          or self._graph._registered_ops[op.type].is_stateful):  # pylint: disable=protected-access
         ops_which_must_run.add(op)
       # Ignore switches (they're handled separately)
       if op.type == "Switch" and op.inputs[0].dtype == dtypes_module.resource:
@@ -1044,9 +1061,10 @@ class AutomaticControlDependencies(object):
 
     # Ensure all ops which must run do run
     for r in self._returned_tensors:
-      r.op._add_control_inputs(  # pylint: disable=protected-access
-          [o for o in ops_which_must_run
-           if o._control_flow_context is r.op._control_flow_context])  # pylint: disable=protected-access
+      if ops_which_must_run:
+        r.op._add_control_inputs(  # pylint: disable=protected-access
+            [o for o in ops_which_must_run
+             if o._control_flow_context is r.op._control_flow_context])  # pylint: disable=protected-access
 
 
 def automatic_control_dependencies(f):
@@ -1066,8 +1084,7 @@ def automatic_control_dependencies(f):
   def wrapper(*args, **kwds):
     with AutomaticControlDependencies() as a:
       result = f(*args, **kwds)
-      for t in nest.flatten(result):
-        a.mark_as_return(t)
-      return result
+      result_flat = [a.mark_as_return(t) for t in nest.flatten(result)]
+      return nest.pack_sequence_as(result, result_flat)
 
   return tf_decorator.make_decorator(f, wrapper)
diff --git a/tensorflow/python/eager/function_test.py b/tensorflow/python/eager/function_test.py
index 431d9388c0ee97eda197142ec97b9448d985b04b..fd1d2c25ffe50cb7afcae29b3d0b15635b6a57dd 100644
--- a/tensorflow/python/eager/function_test.py
+++ b/tensorflow/python/eager/function_test.py
@@ -37,6 +37,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
+from tensorflow.python.training import gradient_descent
 
 
 class FunctionTest(test.TestCase):
@@ -606,7 +607,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         v.assign(v + 1)
         v.assign(2 * v)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(), 4.0)
 
   def testCondMustRun(self):
@@ -626,7 +627,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 5.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 6.0)
 
@@ -647,7 +648,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         one = constant_op.constant(1.0)
-        c.mark_as_return(one)
+        one = c.mark_as_return(one)
       one.eval(feed_dict={p: False})
       self.assertAllEqual(v.read_value().eval(), 5.0)
       one.eval(feed_dict={p: True})
@@ -681,7 +682,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         control_flow_ops.cond(p, true_fn, false_fn)
         with ops.name_scope('final'):
           val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False, q: False}), 3.0)
       self.assertAllEqual(val.eval(feed_dict={p: False, q: True}), 6.0)
       self.assertAllEqual(val.eval(feed_dict={p: True, q: True}), 7.0)
@@ -703,7 +704,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 5.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 5.0)
 
@@ -724,7 +725,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 6.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 12.0)
 
@@ -745,7 +746,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         control_flow_ops.cond(p, true_fn, false_fn)
         v.assign(v * 2)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 10.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 20.0)
 
@@ -762,6 +763,37 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
       self.assertAllEqual(f().eval(), 4.0)
 
+  def testOptimizerInDefun(self):
+    def loss(v):
+      return v**2
+
+    optimizer = gradient_descent.GradientDescentOptimizer(learning_rate=1.0)
+
+    @function.defun
+    def train():
+      v = resource_variable_ops.ResourceVariable(1.0)
+      grad = backprop.implicit_grad(loss)(v)
+      optimizer.apply_gradients(grad)
+      return v.read_value()
+
+    value = train()
+    self.assertEqual(value.numpy(), -1.0)
+
+  def testOptimizerInDefunWithCapturedVariable(self):
+    v = resource_variable_ops.ResourceVariable(1.0)
+    def loss():
+      return v**2
+
+    optimizer = gradient_descent.GradientDescentOptimizer(learning_rate=1.0)
+
+    @function.defun
+    def train():
+      grad = backprop.implicit_grad(loss)()
+      optimizer.apply_gradients(grad)
+
+    train()
+    self.assertEqual(v.numpy(), -1.0)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/eager/graph_callable.py b/tensorflow/python/eager/graph_callable.py
index 62106bf0e2809e3c056e4a357f3d05251b7dca68..ee5d87f0835a8e70e0ce14537a51ea5418db41b9 100644
--- a/tensorflow/python/eager/graph_callable.py
+++ b/tensorflow/python/eager/graph_callable.py
@@ -279,9 +279,12 @@ def _graph_callable_internal(func, shape_and_dtypes):
       # scope's view of which variables exist.
       variable_captures = _VariableCapturingScope()
       with variable_captures.initializing_scope(), function.capture_tensors(
-          captures):
+          captures), function.AutomaticControlDependencies() as a:
         func_outputs = func(*func_inputs)
-      outputs_list = nest.flatten(func_outputs)
+        outputs_list = nest.flatten(func_outputs)
+        for i, x in enumerate(outputs_list):
+          if x is not None:
+            outputs_list[i] = a.mark_as_return(x)
       if len(outputs_list) == 1 and outputs_list[0] is None:
         outputs_list = []
       output_shapes = [x.shape for x in outputs_list]
@@ -294,9 +297,12 @@ def _graph_callable_internal(func, shape_and_dtypes):
       # knows about all variables.
       tmp_graph.clear_resource_control_flow_state()
       with variable_captures.capturing_scope(), function.capture_tensors(
-          captures):
+          captures), function.AutomaticControlDependencies() as a:
         captured_outputs = func(*func_inputs)
       captured_outlist = nest.flatten(captured_outputs)
+      for i, x in enumerate(captured_outlist):
+        if x is not None:
+          captured_outlist[i] = a.mark_as_return(x)
       capturing_operations = tmp_graph.get_operations()[
           len(initializing_operations):]
 
@@ -400,7 +406,7 @@ def graph_callable(shape_and_dtypes):
     A callable graph object.
   """
   # TODO(alive,apassos): support initialized_value and friends from tf.Variable.
-  assert context.in_eager_mode(), (
+  assert context.executing_eagerly(), (
       "graph_callable can only be used when Eager execution is enabled.")
   def decorator(func):
     return tf_decorator.make_decorator(func,
diff --git a/tensorflow/python/eager/ops_test.py b/tensorflow/python/eager/ops_test.py
index f2e70341d975fb06bce7f2ce6cba7d8c3bc9826c..fc76ede4c502ae8b554c925a921e419bf003c40c 100644
--- a/tensorflow/python/eager/ops_test.py
+++ b/tensorflow/python/eager/ops_test.py
@@ -17,8 +17,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import threading
 import numpy as np
 
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.eager import context
 from tensorflow.python.eager import execute
 from tensorflow.python.eager import test
@@ -130,8 +132,12 @@ class OpsTest(test_util.TensorFlowTestCase):
                                    dtype=dtypes.int64)
     values = constant_op.constant([2, 3, 5, 7, 11])
     shape = constant_op.constant([2, 7], dtype=dtypes.int64)
-    result = sparse_ops.gen_sparse_ops._sparse_split(  # pylint: disable=protected-access
-        split_dim, indices, values, shape, num_split=2)
+    result = sparse_ops.gen_sparse_ops.sparse_split(
+        split_dim,
+        indices,
+        values,
+        shape,
+        num_split=2)
     output_indices, output_values, output_shape = result
     self.assertEqual(2, len(output_indices))
     self.assertEqual(2, len(output_values))
@@ -277,6 +283,25 @@ class OpsTest(test_util.TensorFlowTestCase):
       context._context = context.Context()
     # pylint: enable=protected-access
 
+  def testSoftPlacement(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+    # Temporarily replace the context
+    # pylint: disable=protected-access
+    del context._context
+    try:
+      context._context = context.Context(
+          device_policy=context.DEVICE_PLACEMENT_SILENT,
+          config=config_pb2.ConfigProto(allow_soft_placement=True))
+      cpu_tensor = constant_op.constant(1.0)
+      result = cpu_tensor + cpu_tensor
+      self.assertEqual(result.device,
+                       '/job:localhost/replica:0/task:0/device:GPU:0')
+    finally:
+      del context._context
+      context._context = context.Context()
+    # pylint: enable=protected-access
+
   def testRandomUniform(self):
     scalar_shape = constant_op.constant([], dtype=dtypes.int32)
 
@@ -352,6 +377,22 @@ class OpsTest(test_util.TensorFlowTestCase):
   def testNoOpIsNone(self):
     self.assertTrue(control_flow_ops.no_op() is None)
 
+  def testEagerContextPreservedAcrossThreads(self):
+    def init_fn():
+      self.assertTrue(context.executing_eagerly())
+      with ops.init_scope():
+        self.assertTrue(context.executing_eagerly())
+        context_switches = context.context().context_switches
+        self.assertEqual(len(context_switches.stack), 1)
+        self.assertFalse(context_switches.stack[0].is_building_function)
+        self.assertEqual(context_switches.stack[0].enter_context_fn,
+                         context.eager_mode)
+
+    self.assertTrue(context.executing_eagerly())
+    t1 = threading.Thread(target=init_fn)
+    t1.start()
+    t1.join()
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/eager/python_eager_op_gen.cc b/tensorflow/python/eager/python_eager_op_gen.cc
index e6d03297e0b85856ff165af310149c79e494ab36..c2ce8efd7f70c6ba93b6d444f88ddbb9aa51ccdb 100644
--- a/tensorflow/python/eager/python_eager_op_gen.cc
+++ b/tensorflow/python/eager/python_eager_op_gen.cc
@@ -367,7 +367,7 @@ void GenEagerPythonOp::HandleGraphMode(const string& function_setup) {
   // Handle graph-mode case
   strings::StrAppend(&result_,
                      "  _ctx = _context.context()\n"
-                     "  if _ctx.in_graph_mode():\n",
+                     "  if not _ctx.executing_eagerly():\n",
                      function_setup,
                      "    _, _, _op = _op_def_lib._apply_op_helper(\n");
   AddBodyNoReturn("        ");
@@ -712,9 +712,9 @@ bool GenEagerPythonOp::AddEagerFallbackCode(
 }
 
 void GenEagerPythonOp::AddEagerFastPathExecute() {
-  string fastpath_execute_params = strings::StrCat(
-      "_ctx._handle, _ctx.device_name, \"", op_def_.name(), "\", ",
-      "_execute.record_gradient, name, _ctx._post_execution_callbacks");
+  string fastpath_execute_params =
+      strings::StrCat("_ctx._handle, _ctx.device_name, \"", op_def_.name(),
+                      "\", ", "name, _ctx._post_execution_callbacks");
   string fallback_params;
 
   for (int i = 0; i < api_def_.in_arg_size(); i++) {
@@ -955,10 +955,10 @@ from tensorflow.python.util.tf_export import tf_export
     if (api_def->visibility() == ApiDef::SKIP) {
       continue;
     }
-
     // An op is hidden if either its ApiDef visibility is HIDDEN
     // or it is in the hidden_ops list.
     bool is_hidden = api_def->visibility() == ApiDef::HIDDEN;
+    bool hidden_by_api_def = is_hidden;
     if (!is_hidden) {
       for (const string& hidden : hidden_ops) {
         if (op_def.name() == hidden) {
@@ -971,13 +971,22 @@ from tensorflow.python.util.tf_export import tf_export
     string function_name;
     python_op_gen_internal::GenerateLowerCaseOpName(op_def.name(),
                                                     &function_name);
-    if (is_hidden) function_name = strings::StrCat("_", function_name);
-
-    // When users create custom python wrappers, they may link in the
-    // default op registry by accident, and because they can't
-    // enumerate all 'hidden' symbols, this guard is to prevent
-    // instantiating a python reserved word in their wrapper.
-    if (python_op_gen_internal::IsPythonReserved(function_name)) {
+    bool is_reserved = python_op_gen_internal::IsPythonReserved(function_name);
+
+    // Prefix an op with underscore if the op is listed in hidden_ops or
+    // name is reserved or it is of the exceptions in IsOpWithUnderscorePrefix.
+    // Do not add underscores to ops set to HIDDEN in ApiDef otherwise.
+    // TODO(annarev): don't prefix with underscores even if op is in hidden_ops.
+    if (is_hidden) {
+      if (!hidden_by_api_def || is_reserved ||
+          python_op_gen_internal::IsOpWithUnderscorePrefix(function_name)) {
+        function_name = strings::StrCat("_", function_name);
+      }
+    } else if (is_reserved) {
+      // When users create custom python wrappers, they may link in the
+      // default op registry by accident, and because they can't
+      // enumerate all 'hidden' symbols, this guard is to prevent
+      // instantiating a python reserved word in their wrapper.
       continue;
     }
 
diff --git a/tensorflow/python/eager/pywrap_tensor.cc b/tensorflow/python/eager/pywrap_tensor.cc
index 3ec2109d323b4f0b2a7e2de0ee13c3317f536a68..519814b979e00dd7c9df41eacbe1edc02c9d88e8 100644
--- a/tensorflow/python/eager/pywrap_tensor.cc
+++ b/tensorflow/python/eager/pywrap_tensor.cc
@@ -163,7 +163,7 @@ PyObject* PyIntFromDataType(TF_DataType l) {
 
 extern "C" {
 
-static const int kMaxEagerTensorParentSize = 32;
+static const int kMaxEagerTensorParentSize = 64;
 
 // TODO(agarwal): store context handle in EagerTensor.
 typedef struct EagerTensor {
@@ -186,6 +186,10 @@ typedef struct EagerTensor {
   // This stores `_keras_mask` object and is set by Tensorflow layers.
   PyObject* keras_mask;
 
+  // This stores `_tensor_shape`, a cached `TensorShape` object, and is set the
+  // first time that `_EagerTensorBase`'s `shape` property is called.
+  PyObject* tensor_shape;
+
   // We store a status object here as an optimization to avoid allocating a new
   // Status objects on different functions that operate on EagerTensor and need
   // to use a TF_Status object. However note that accesses to `status` are not
@@ -201,6 +205,8 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
   self->handle_data = Py_None;
   Py_INCREF(Py_None);
   self->keras_mask = Py_None;
+  Py_INCREF(Py_None);
+  self->tensor_shape = Py_None;
   self->status = TF_NewStatus();
   PyObject* value;
   PyObject* context = nullptr;
@@ -333,8 +339,11 @@ void EagerTensor_dealloc(EagerTensor* self) {
   TF_DeleteStatus(self->status);
   Py_DECREF(self->handle_data);
   Py_DECREF(self->keras_mask);
-  TFE_DeleteTensorHandle(self->handle);
-  self->handle = nullptr;
+  Py_DECREF(self->tensor_shape);
+  if (self->handle != nullptr) {
+    TFE_DeleteTensorHandle(self->handle);
+    self->handle = nullptr;
+  }
   // We have the global interpreter lock, so use this chance to perform delayed
   // refcount decrements.
   tensorflow::ClearDecrefCache();
@@ -420,6 +429,19 @@ static int EagerTensor_setkeras_mask(EagerTensor* self, PyObject* value,
   self->keras_mask = value;
   return 0;
 }
+
+static PyObject* EagerTensor_tensor_shape(EagerTensor* self, void* unused) {
+  Py_INCREF(self->tensor_shape);
+  return self->tensor_shape;
+}
+
+static int EagerTensor_settensor_shape(EagerTensor* self, PyObject* value,
+                                       void* unused) {
+  Py_DECREF(self->tensor_shape);
+  Py_INCREF(value);
+  self->tensor_shape = value;
+  return 0;
+}
 // Function `_copy_to_device`.
 static PyObject* EagerTensor_copy_to_device(EagerTensor* self, PyObject* args,
                                             PyObject* kwds) {
@@ -484,6 +506,9 @@ static PyGetSetDef EagerTensor_getseters[] = {
     {const_cast<char*>("_keras_mask"), (getter)EagerTensor_keras_mask,
      (setter)EagerTensor_setkeras_mask, const_cast<char*>("_keras_mask"),
      nullptr},
+    {const_cast<char*>("_tensor_shape"), (getter)EagerTensor_tensor_shape,
+     (setter)EagerTensor_settensor_shape, const_cast<char*>("_tensor_shape"),
+     nullptr},
     {nullptr} /* Sentinel */
 };
 
@@ -520,16 +545,11 @@ PyTypeObject* EagerTensorType = nullptr;
 
 #if PY_MAJOR_VERSION >= 3
 static PyType_Slot EagerTensor_Type_slots[] = {
-    Py_tp_dealloc,
-    reinterpret_cast<void*>(EagerTensor_dealloc),
-    Py_tp_methods,
-    reinterpret_cast<void*>(EagerTensor_methods),
-    Py_tp_getset,
-    reinterpret_cast<void*>(EagerTensor_getseters),
-    Py_tp_init,
-    reinterpret_cast<void*>(EagerTensor_init),
-    0,
-    nullptr,
+    {Py_tp_dealloc, reinterpret_cast<void*>(EagerTensor_dealloc)},
+    {Py_tp_methods, reinterpret_cast<void*>(EagerTensor_methods)},
+    {Py_tp_getset, reinterpret_cast<void*>(EagerTensor_getseters)},
+    {Py_tp_init, reinterpret_cast<void*>(EagerTensor_init)},
+    {0, nullptr},
 };
 
 PyType_Spec EagerTensor_Type_spec = {"EagerTensor", sizeof(EagerTensor), 0,
@@ -604,6 +624,8 @@ PyObject* EagerTensorFromHandle(TFE_TensorHandle* handle) {
     t->handle_data = Py_None;
     Py_INCREF(Py_None);
     t->keras_mask = Py_None;
+    Py_INCREF(Py_None);
+    t->tensor_shape = Py_None;
     t->handle = handle;
     t->status = TF_NewStatus();
   }
diff --git a/tensorflow/python/eager/pywrap_tfe.h b/tensorflow/python/eager/pywrap_tfe.h
index f9692a8910aa6354c7ed81c7e88aed882058f276..32d731d0f68910b8e41a57cb32ae60c3ea6742f7 100644
--- a/tensorflow/python/eager/pywrap_tfe.h
+++ b/tensorflow/python/eager/pywrap_tfe.h
@@ -51,6 +51,13 @@ void TFE_Py_Execute(TFE_Context* ctx, const char* device_name,
 // This function is not thread-safe.
 PyObject* TFE_Py_RegisterExceptionClass(PyObject* e);
 
+// Registers e as the type of the ResourceVariable class.
+// Returns Py_None if registration succeeds, else throws a TypeError and returns
+// NULL.
+//
+// This function is not thread-safe.
+PyObject* TFE_Py_RegisterResourceVariableType(PyObject* e);
+
 // Registers e as the Exception to be raised when the conditions of
 // TFE_Py_FastPathExecute_C have not been met. When this exception is set, it
 // is a signal to the calling code that it should fall back to the safer (and
@@ -160,13 +167,10 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
 //  Item 2: device_name: Name of the device on which to execute the operation,
 //          or NULL for automatic selection.
 //  Item 3: op_name: Name of the TensorFlow op to execute.
-//  Item 4: record_gradient_callback: Callback that records the gradient of the
-//          result. The callback takes (op_name, inputs, attrs, result, name)
-//          - all sequences and records the gradient.
-//  Item 5: name: An optional name for the operation.
-//  Item 6: List representing all callbacks to execute after successful
+//  Item 4: name: An optional name for the operation.
+//  Item 5: List representing all callbacks to execute after successful
 //  op execute.
-//  Item 7 onwards: inputs - This is a list of inputs followed by a list of
+//  Item 6 onwards: inputs - This is a list of inputs followed by a list of
 //        attrs. It is not necessary for type attrs to be present.
 //
 // This is named _C since there doesn't seem to be any way to make it visible
diff --git a/tensorflow/python/eager/pywrap_tfe_src.cc b/tensorflow/python/eager/pywrap_tfe_src.cc
index 30e08c8e6531739e3db66a94308e4ce2aff61f11..701f68b8f7a4b4342e7912c7a115f630b4cab627 100644
--- a/tensorflow/python/eager/pywrap_tfe_src.cc
+++ b/tensorflow/python/eager/pywrap_tfe_src.cc
@@ -31,12 +31,30 @@ limitations under the License.
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/python/eager/pywrap_tensor.h"
+#include "tensorflow/python/lib/core/safe_ptr.h"
 
 using tensorflow::string;
 using tensorflow::strings::Printf;
 
 namespace {
 
+struct FastPathOpExecInfo {
+  TFE_Context* ctx;
+  const char* device_name;
+  // The op def of the main op being executed.
+  const tensorflow::OpDef* op_def;
+
+  bool run_callbacks;
+  bool run_post_exec_callbacks;
+  bool run_gradient_callback;
+
+  // The op name of the main op being executed.
+  PyObject* name;
+  // The op type name of the main op being executed.
+  PyObject* op_name;
+  PyObject* callbacks;
+};
+
 #define PARSE_VALUE(fn_name, type, check_fn, parse_fn)                       \
   bool fn_name(const string& key, PyObject* py_value, TF_Status* status,     \
                type* value) {                                                \
@@ -75,6 +93,34 @@ Py_ssize_t TensorShapeNumDims(PyObject* value) {
   return size;
 }
 
+bool IsInteger(PyObject* py_value) {
+#if PY_MAJOR_VERSION >= 3
+  return PyLong_Check(py_value);
+#else
+  return PyInt_Check(py_value);
+#endif
+}
+
+bool ParseDimensionValue(const string& key, PyObject* py_value,
+                         TF_Status* status, int64_t* value) {
+  if (IsInteger(py_value)) {
+    return ParseInt64Value(key, py_value, status, value);
+  }
+
+  tensorflow::Safe_PyObjectPtr dimension_value(
+      PyObject_GetAttrString(py_value, "_value"));
+  if (dimension_value == nullptr) {
+    TF_SetStatus(
+        status, TF_INVALID_ARGUMENT,
+        tensorflow::strings::StrCat("Expecting a Dimension for attr ", key,
+                                    ", got ", py_value->ob_type->tp_name)
+            .c_str());
+    return false;
+  }
+
+  return ParseInt64Value(key, dimension_value.get(), status, value);
+}
+
 bool ParseStringValue(const string& key, PyObject* py_value, TF_Status* status,
                       const char** value) {
   if (PyBytes_Check(py_value)) {
@@ -101,14 +147,6 @@ bool ParseBoolValue(const string& key, PyObject* py_value, TF_Status* status,
   return true;
 }
 
-bool IsInteger(PyObject* py_value) {
-#if PY_MAJOR_VERSION >= 3
-  return PyLong_Check(py_value);
-#else
-  return PyInt_Check(py_value);
-#endif
-}
-
 // The passed in py_value is expected to be an object of the python type
 // dtypes.DType or an int.
 bool ParseTypeValue(const string& key, PyObject* py_value, TF_Status* status,
@@ -117,18 +155,18 @@ bool ParseTypeValue(const string& key, PyObject* py_value, TF_Status* status,
     return ParseIntValue(key, py_value, status, value);
   }
 
-  PyObject* py_type_enum = PyObject_GetAttrString(py_value, "_type_enum");
+  tensorflow::Safe_PyObjectPtr py_type_enum(
+      PyObject_GetAttrString(py_value, "_type_enum"));
   if (py_type_enum == nullptr) {
+    TF_SetStatus(
+        status, TF_INVALID_ARGUMENT,
+        tensorflow::strings::StrCat("Expecting a DType.dtype for attr ", key,
+                                    ", got ", py_value->ob_type->tp_name)
+            .c_str());
     return false;
   }
 
-  if (!ParseIntValue(key, py_type_enum, status, value)) {
-    Py_DECREF(py_type_enum);
-    return false;
-  }
-
-  Py_DECREF(py_type_enum);
-  return true;
+  return ParseIntValue(key, py_type_enum.get(), status, value);
 }
 
 bool SetOpAttrList(
@@ -146,11 +184,11 @@ bool SetOpAttrList(
   const int num_values = PySequence_Size(py_list);
   if (attr_list_sizes != nullptr) (*attr_list_sizes)[key] = num_values;
 
-#define PARSE_LIST(c_type, parse_fn)                                \
-  std::unique_ptr<c_type[]> values(new c_type[num_values]);         \
-  for (int i = 0; i < num_values; ++i) {                            \
-    auto py_value = PySequence_ITEM(py_list, i);                    \
-    if (!parse_fn(key, py_value, status, &values[i])) return false; \
+#define PARSE_LIST(c_type, parse_fn)                                      \
+  std::unique_ptr<c_type[]> values(new c_type[num_values]);               \
+  for (int i = 0; i < num_values; ++i) {                                  \
+    tensorflow::Safe_PyObjectPtr py_value(PySequence_ITEM(py_list, i));   \
+    if (!parse_fn(key, py_value.get(), status, &values[i])) return false; \
   }
 
   if (type == TF_ATTR_STRING) {
@@ -175,9 +213,9 @@ bool SetOpAttrList(
     // dims across all the input lists.
     int total_dims = 0;
     for (int i = 0; i < num_values; ++i) {
-      auto py_value = PySequence_ITEM(py_list, i);
-      if (py_value != Py_None) {
-        if (!PySequence_Check(py_value)) {
+      tensorflow::Safe_PyObjectPtr py_value(PySequence_ITEM(py_list, i));
+      if (py_value.get() != Py_None) {
+        if (!PySequence_Check(py_value.get())) {
           TF_SetStatus(
               status, TF_INVALID_ARGUMENT,
               tensorflow::strings::StrCat(
@@ -186,7 +224,7 @@ bool SetOpAttrList(
                   .c_str());
           return false;
         }
-        const auto size = TensorShapeNumDims(py_value);
+        const auto size = TensorShapeNumDims(py_value.get());
         if (size >= 0) {
           total_dims += size;
         }
@@ -200,12 +238,12 @@ bool SetOpAttrList(
     std::unique_ptr<int[]> num_dims(new int[num_values]);
     int64_t* offset = buffer.get();
     for (int i = 0; i < num_values; ++i) {
-      auto py_value = PySequence_ITEM(py_list, i);
-      if (py_value == Py_None) {
+      tensorflow::Safe_PyObjectPtr py_value(PySequence_ITEM(py_list, i));
+      if (py_value.get() == Py_None) {
         dims[i] = nullptr;
         num_dims[i] = -1;
       } else {
-        const auto size = TensorShapeNumDims(py_value);
+        const auto size = TensorShapeNumDims(py_value.get());
         if (size == -1) {
           dims[i] = nullptr;
           num_dims[i] = -1;
@@ -214,10 +252,12 @@ bool SetOpAttrList(
         dims[i] = offset;
         num_dims[i] = size;
         for (int j = 0; j < size; ++j) {
-          auto inner_py_value = PySequence_ITEM(py_value, j);
-          if (inner_py_value == Py_None) {
+          tensorflow::Safe_PyObjectPtr inner_py_value(
+              PySequence_ITEM(py_value.get(), j));
+          if (inner_py_value.get() == Py_None) {
             *offset = -1;
-          } else if (!ParseInt64Value(key, inner_py_value, status, offset)) {
+          } else if (!ParseDimensionValue(key, inner_py_value.get(), status,
+                                          offset)) {
             return false;
           }
           ++offset;
@@ -238,21 +278,12 @@ bool SetOpAttrList(
   return true;
 }
 
-// This is only declared here since GetFunc makes a recursive call to
-// SetOpAttrScalarDefault.
-void SetOpAttrScalarDefault(
-    TFE_Context* ctx, TFE_Op* op, const tensorflow::AttrValue& default_value,
-    const char* attr_name,
-    tensorflow::gtl::FlatMap<string, tensorflow::int64>* attr_list_sizes,
-    TF_Status* status);
-
 TFE_Op* GetFunc(TFE_Context* ctx, const tensorflow::NameAttrList& func,
                 TF_Status* status) {
   TFE_Op* func_op = TFE_NewOp(ctx, func.name().data(), status);
   for (const auto& attr : func.attr()) {
     if (TF_GetCode(status) != TF_OK) return nullptr;
-    SetOpAttrScalarDefault(ctx, func_op, attr.second, attr.first.data(),
-                           nullptr, status);
+    SetOpAttrValueScalar(ctx, func_op, attr.second, attr.first.data(), status);
     if (TF_GetCode(status) != TF_OK) return nullptr;
   }
   return func_op;
@@ -398,10 +429,12 @@ bool SetOpAttrScalar(
       }
       std::unique_ptr<int64_t[]> dims(new int64_t[num_dims]);
       for (int i = 0; i < num_dims; ++i) {
-        auto inner_py_value = PySequence_ITEM(py_value, i);
-        if (inner_py_value == Py_None) {
+        tensorflow::Safe_PyObjectPtr inner_py_value(
+            PySequence_ITEM(py_value, i));
+        if (inner_py_value.get() == Py_None) {
           dims[i] = -1;
-        } else if (!ParseInt64Value(key, inner_py_value, status, &dims[i])) {
+        } else if (!ParseDimensionValue(key, inner_py_value.get(), status,
+                                        &dims[i])) {
           return false;
         }
       }
@@ -452,57 +485,9 @@ void SetOpAttrScalarDefault(
     const char* attr_name,
     tensorflow::gtl::FlatMap<string, tensorflow::int64>* attr_list_sizes,
     TF_Status* status) {
-  switch (default_value.value_case()) {
-    case tensorflow::AttrValue::kS:
-      TFE_OpSetAttrString(op, attr_name, default_value.s().data());
-      break;
-    case tensorflow::AttrValue::kI:
-      TFE_OpSetAttrInt(op, attr_name, static_cast<int64_t>(default_value.i()));
-      (*attr_list_sizes)[attr_name] = default_value.i();
-      break;
-    case tensorflow::AttrValue::kF:
-      TFE_OpSetAttrFloat(op, attr_name, default_value.f());
-      break;
-    case tensorflow::AttrValue::kB:
-      TFE_OpSetAttrBool(op, attr_name, default_value.b());
-      break;
-    case tensorflow::AttrValue::kType:
-      TFE_OpSetAttrType(op, attr_name,
-                        static_cast<TF_DataType>(default_value.type()));
-      break;
-    case tensorflow::AttrValue::kShape: {
-      const auto& tensor_shape = default_value.shape();
-      if (tensor_shape.unknown_rank()) {
-        TFE_OpSetAttrShape(op, attr_name, nullptr, -1, status);
-      } else {
-        const auto num_dims = tensor_shape.dim_size();
-        std::unique_ptr<int64_t[]> dims(new int64_t[num_dims]);
-        for (int i = 0; i < num_dims; ++i) {
-          dims[i] = tensor_shape.dim(i).size();
-        }
-        TFE_OpSetAttrShape(op, attr_name, dims.get(), num_dims, status);
-      }
-    } break;
-    case tensorflow::AttrValue::kFunc: {
-      const auto func_op = GetFunc(ctx, default_value.func(), status);
-      if (TF_GetCode(status) != TF_OK) return;
-      // TODO(nareshmodi): TFE_OpSetAttrFunction and TFE_OpSetAttrFunctionList
-      // require TFE_Op* and just convert it internally a NameAttrValue, so
-      // consider adding an overload to the C API to make this case easier.
-      TFE_OpSetAttrFunction(op, attr_name, func_op);
-    } break;
-    case tensorflow::AttrValue::kList:
-      TF_FALLTHROUGH_INTENDED;
-    case tensorflow::AttrValue::kTensor:
-      TF_FALLTHROUGH_INTENDED;
-    case tensorflow::AttrValue::kPlaceholder:
-      TF_FALLTHROUGH_INTENDED;
-    case tensorflow::AttrValue::VALUE_NOT_SET:
-      TF_SetStatus(
-          status, TF_UNIMPLEMENTED,
-          tensorflow::strings::StrCat("Unable to get setfor default value: ",
-                                      default_value.DebugString())
-              .data());
+  SetOpAttrValueScalar(ctx, op, default_value, attr_name, status);
+  if (default_value.value_case() == tensorflow::AttrValue::kI) {
+    (*attr_list_sizes)[attr_name] = default_value.i();
   }
 }
 
@@ -579,6 +564,8 @@ PyObject* fallback_exception_class = nullptr;
 // Python function that returns a backward_function.
 PyObject* backward_function_getter = nullptr;
 
+PyTypeObject* resource_variable_type = nullptr;
+
 tensorflow::mutex _uid_mutex(tensorflow::LINKER_INITIALIZED);
 tensorflow::int64 _uid GUARDED_BY(_uid_mutex) = 0;
 
@@ -627,11 +614,28 @@ PyObject* TFE_Py_RegisterExceptionClass(PyObject* e) {
                     "TFE_Py_RegisterExceptionClass: "
                     "Registered class should be subclass of Exception.");
     return nullptr;
-  } else {
-    Py_INCREF(e);
-    exception_class = e;
-    Py_RETURN_NONE;
   }
+
+  Py_INCREF(e);
+  exception_class = e;
+  Py_RETURN_NONE;
+}
+
+PyObject* TFE_Py_RegisterResourceVariableType(PyObject* e) {
+  if (!PyType_Check(e)) {
+    PyErr_SetString(
+        PyExc_TypeError,
+        "TFE_Py_RegisterResourceVariableType: Need to register a type.");
+    return nullptr;
+  }
+
+  if (resource_variable_type != nullptr) {
+    Py_DECREF(resource_variable_type);
+  }
+
+  Py_INCREF(e);
+  resource_variable_type = reinterpret_cast<PyTypeObject*>(e);
+  Py_RETURN_NONE;
 }
 
 PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e) {
@@ -1008,7 +1012,14 @@ static tensorflow::eager::TapeTensor TapeTensorFromTensor(PyObject* tensor) {
   if (EagerTensor_CheckExact(tensor)) {
     TFE_TensorHandle* t = EagerTensor_Handle(tensor);
     tensorflow::int64 id = EagerTensor_id(tensor);
-    return tensorflow::eager::TapeTensor{id, t->t.dtype(), t->t.shape()};
+    const tensorflow::Tensor* tensor = nullptr;
+    const tensorflow::Status status = t->Tensor(&tensor);
+    if (MaybeRaiseExceptionFromStatus(status, nullptr)) {
+      return tensorflow::eager::TapeTensor{id, t->dtype,
+                                           tensorflow::TensorShape({})};
+    } else {
+      return tensorflow::eager::TapeTensor{id, t->dtype, tensor->shape()};
+    }
   }
   tensorflow::int64 id = FastTensorId(tensor);
   if (PyErr_Occurred()) {
@@ -1312,6 +1323,16 @@ std::vector<PyObject*> MakeTensorList(PyObject* tensors) {
 PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
                               PyObject* target, PyObject* sources,
                               PyObject* output_gradients, TF_Status* status) {
+  TFE_Py_Tape* tape_obj = reinterpret_cast<TFE_Py_Tape*>(tape);
+  if (!tape_obj->tape->IsPersistent()) {
+    auto* tape_set = GetTapeSet();
+    if (tape_set->find(tape_obj) != tape_set->end()) {
+      PyErr_SetString(PyExc_RuntimeError,
+                      "Trying to call tape.gradient on a non-persistent tape "
+                      "while it is still active.");
+      return nullptr;
+    }
+  }
   PyVSpace c_vspace(vspace);
   if (!c_vspace.Initialize().ok()) {
     return nullptr;
@@ -1337,7 +1358,6 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
       Py_INCREF(tensor);
     }
   }
-  TFE_Py_Tape* tape_obj = reinterpret_cast<TFE_Py_Tape*>(tape);
   std::vector<PyObject*> result;
   status->status = tape_obj->tape->ComputeGradient(
       c_vspace, target_vec, sources_vec, outgrad_vec, &result);
@@ -1364,7 +1384,7 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
 }
 
 namespace {
-static const int kFastPathExecuteInputStartIndex = 6;
+static const int kFastPathExecuteInputStartIndex = 5;
 
 PyObject* GetPythonObjectFromString(const char* s) {
 #if PY_MAJOR_VERSION >= 3
@@ -1374,8 +1394,12 @@ PyObject* GetPythonObjectFromString(const char* s) {
 #endif
 }
 
-bool CheckEagerTensors(PyObject* seq, int start_index,
-                       const tensorflow::OpDef& op_def) {
+bool CheckResourceVariable(PyObject* item) {
+  return PyObject_TypeCheck(item, resource_variable_type);
+}
+
+bool CheckInputsOk(PyObject* seq, int start_index,
+                   const tensorflow::OpDef& op_def) {
   for (int i = 0; i < op_def.input_arg_size(); i++) {
     PyObject* item = PyTuple_GET_ITEM(seq, i + start_index);
     if (!op_def.input_arg(i).number_attr().empty() ||
@@ -1383,9 +1407,13 @@ bool CheckEagerTensors(PyObject* seq, int start_index,
       // This item should be a list input.
       if (!PyList_Check(item)) return false;
       for (Py_ssize_t j = 0; j < PyList_Size(item); j++) {
-        if (!EagerTensor_CheckExact(PyList_GET_ITEM(item, j))) return false;
+        PyObject* inner_item = PyList_GET_ITEM(item, j);
+        if (!EagerTensor_CheckExact(inner_item) &&
+            !CheckResourceVariable(inner_item)) {
+          return false;
+        }
       }
-    } else if (!EagerTensor_CheckExact(item)) {
+    } else if (!EagerTensor_CheckExact(item) && !CheckResourceVariable(item)) {
       return false;
     }
   }
@@ -1393,71 +1421,6 @@ bool CheckEagerTensors(PyObject* seq, int start_index,
   return true;
 }
 
-// Adds input and type attr to the op, and to the list of flattened
-// inputs/attrs.
-bool AddInputToOp(PyObject* input, const tensorflow::OpDef::ArgDef* input_arg,
-                  std::vector<PyObject*>* flattened_attrs,
-                  std::vector<PyObject*>* flattened_inputs, TFE_Op* op,
-                  TF_Status* status) {
-  TFE_TensorHandle* input_handle = EagerTensor_Handle(input);
-  if (input_arg != nullptr && !input_arg->type_attr().empty()) {
-    auto dtype = TFE_TensorHandleDataType(input_handle);
-    TFE_OpSetAttrType(op, input_arg->type_attr().data(), dtype);
-    if (flattened_attrs != nullptr) {
-      flattened_attrs->push_back(
-          GetPythonObjectFromString(input_arg->type_attr().data()));
-      flattened_attrs->push_back(PyLong_FromLong(dtype));
-    }
-  }
-
-  if (flattened_inputs != nullptr) {
-    flattened_inputs->push_back(input);
-  }
-  TFE_OpAddInput(op, input_handle, status);
-  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) {
-    return false;
-  }
-  return true;
-}
-
-const tensorflow::OpDef* GetOpDef(PyObject* py_op_name) {
-  const char* op_name = TFE_GetPythonString(py_op_name);
-  if (op_name == nullptr) {
-    PyErr_SetString(PyExc_TypeError,
-                    Printf("expected a string for op_name, got %s instead",
-                           py_op_name->ob_type->tp_name)
-                        .c_str());
-    return nullptr;
-  }
-
-  const tensorflow::OpRegistrationData* op_reg_data = nullptr;
-  const tensorflow::Status lookup_status =
-      tensorflow::OpRegistry::Global()->LookUp(op_name, &op_reg_data);
-  if (MaybeRaiseExceptionFromStatus(lookup_status, nullptr)) {
-    return nullptr;
-  }
-  return &op_reg_data->op_def;
-}
-
-const char* GetDeviceName(PyObject* py_device_name) {
-  if (py_device_name != Py_None) {
-    return TFE_GetPythonString(py_device_name);
-  }
-  return nullptr;
-}
-
-bool RaiseIfNotPyList(PyObject* list, const string& attr_name) {
-  if (!PyList_Check(list)) {
-    PyErr_SetString(PyExc_TypeError,
-                    Printf("expected a list for attr %s, got %s instead",
-                           attr_name.data(), list->ob_type->tp_name)
-                        .data());
-
-    return false;
-  }
-  return true;
-}
-
 bool OpDoesntRequireOutput(const string& op_name) {
   static tensorflow::gtl::FlatSet<string>* ops_that_dont_require_outputs =
       new tensorflow::gtl::FlatSet<string>({
@@ -1582,7 +1545,6 @@ PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,
       break;
     }
   }
-
   if (!should_record) Py_RETURN_NONE;
 
   string c_op_name = TFE_GetPythonString(op_name);
@@ -1616,53 +1578,212 @@ PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,
   Py_RETURN_NONE;
 }
 
-bool RunCallbacks(bool run_gradient_callback, bool run_post_exec_callbacks,
-                  const tensorflow::OpDef* op_def, PyObject* args,
-                  const std::vector<PyObject*>& flattened_inputs,
-                  const std::vector<PyObject*>& flattened_attrs,
-                  PyObject* flattened_result, PyObject* op_name, PyObject* name,
-                  PyObject* record_gradient_callback, PyObject* callbacks) {
-  PyObject* inputs = PyTuple_New(flattened_inputs.size());
+void MaybeWatchVariable(PyObject* input) {
+  DCHECK(CheckResourceVariable(input));
+  DCHECK(PyObject_HasAttrString(input, "_trainable"));
+
+  tensorflow::Safe_PyObjectPtr trainable(
+      PyObject_GetAttrString(input, "_trainable"));
+  if (trainable.get() == Py_False) return;
+  TFE_Py_TapeSetWatchVariable(input);
+}
+
+bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
+                    PyObject* input, tensorflow::Safe_PyObjectPtr* output,
+                    TF_Status* status) {
+  MaybeWatchVariable(input);
+
+  TFE_Op* op = TFE_NewOp(parent_op_exec_info.ctx, "ReadVariableOp", status);
+  auto cleaner = tensorflow::gtl::MakeCleanup([op] { TFE_DeleteOp(op); });
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Set dtype
+  DCHECK(PyObject_HasAttrString(input, "_dtype"));
+  tensorflow::Safe_PyObjectPtr dtype(PyObject_GetAttrString(input, "_dtype"));
+  int value;
+  if (!ParseTypeValue("_dtype", dtype.get(), status, &value)) {
+    return false;
+  }
+  TFE_OpSetAttrType(op, "dtype", static_cast<TF_DataType>(value));
+
+  TFE_OpSetDevice(op, parent_op_exec_info.device_name, status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Get handle
+  tensorflow::Safe_PyObjectPtr handle(PyObject_GetAttrString(input, "_handle"));
+  if (!EagerTensor_CheckExact(handle.get())) return false;
+  TFE_OpAddInput(op, EagerTensor_Handle(handle.get()), status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  int num_retvals = 1;
+  TFE_TensorHandle* output_handle;
+  TFE_Execute(op, &output_handle, &num_retvals, status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Always create the py object (and correctly DECREF it) from the returned
+  // value, else the data will leak.
+  output->reset(EagerTensorFromHandle(output_handle));
+
+  // TODO(nareshmodi): Should we run post exec callbacks here?
+  if (parent_op_exec_info.run_gradient_callback) {
+    tensorflow::Safe_PyObjectPtr inputs(PyTuple_New(1));
+    PyTuple_SET_ITEM(inputs.get(), 0, handle.release());
+
+    tensorflow::Safe_PyObjectPtr outputs(PyTuple_New(1));
+    Py_INCREF(output->get());  // stay alive after since tuple steals.
+    PyTuple_SET_ITEM(outputs.get(), 0, output->get());
+
+    if (!RecordGradient(GetPythonObjectFromString("ReadVariableOp"),
+                        inputs.get(), Py_None, outputs.get(), Py_None)) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+// Supports only 2 cases at the moment:
+//  i) input is an EagerTensor
+//  ii) input is a ResourceVariable - in this case, the is_variable param is set
+//  to true.
+bool ConvertToTensor(const FastPathOpExecInfo& op_exec_info, PyObject* input,
+                     tensorflow::Safe_PyObjectPtr* output_handle,
+                     TF_Status* status) {
+  if (CheckResourceVariable(input)) {
+    return ReadVariableOp(op_exec_info, input, output_handle, status);
+  }
+
+  Py_INCREF(input);
+  output_handle->reset(input);
+
+  return true;
+}
+
+// Adds input and type attr to the op, and to the list of flattened
+// inputs/attrs.
+bool AddInputToOp(const FastPathOpExecInfo& op_exec_info, PyObject* input,
+                  const tensorflow::OpDef::ArgDef* input_arg,
+                  std::vector<tensorflow::Safe_PyObjectPtr>* flattened_attrs,
+                  std::vector<tensorflow::Safe_PyObjectPtr>* flattened_inputs,
+                  TFE_Op* op, TF_Status* status) {
+  // py_eager_tensor's ownership is transferred to flattened_inputs if it is
+  // required, else the object is destroyed and DECREF'd when the object goes
+  // out of scope in this function.
+  tensorflow::Safe_PyObjectPtr py_eager_tensor = nullptr;
+
+  if (!ConvertToTensor(op_exec_info, input, &py_eager_tensor, status)) {
+    return false;
+  }
+
+  TFE_TensorHandle* input_handle = EagerTensor_Handle(py_eager_tensor.get());
+
+  if (input_arg != nullptr && !input_arg->type_attr().empty()) {
+    auto dtype = TFE_TensorHandleDataType(input_handle);
+    TFE_OpSetAttrType(op, input_arg->type_attr().data(), dtype);
+    if (flattened_attrs != nullptr) {
+      flattened_attrs->emplace_back(
+          GetPythonObjectFromString(input_arg->type_attr().data()));
+      flattened_attrs->emplace_back(PyLong_FromLong(dtype));
+    }
+  }
+
+  if (flattened_inputs != nullptr) {
+    flattened_inputs->emplace_back(std::move(py_eager_tensor));
+  }
+
+  TFE_OpAddInput(op, input_handle, status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) {
+    return false;
+  }
+
+  return true;
+}
+
+const tensorflow::OpDef* GetOpDef(PyObject* py_op_name) {
+  const char* op_name = TFE_GetPythonString(py_op_name);
+  if (op_name == nullptr) {
+    PyErr_SetString(PyExc_TypeError,
+                    Printf("expected a string for op_name, got %s instead",
+                           py_op_name->ob_type->tp_name)
+                        .c_str());
+    return nullptr;
+  }
+
+  const tensorflow::OpRegistrationData* op_reg_data = nullptr;
+  const tensorflow::Status lookup_status =
+      tensorflow::OpRegistry::Global()->LookUp(op_name, &op_reg_data);
+  if (MaybeRaiseExceptionFromStatus(lookup_status, nullptr)) {
+    return nullptr;
+  }
+  return &op_reg_data->op_def;
+}
+
+const char* GetDeviceName(PyObject* py_device_name) {
+  if (py_device_name != Py_None) {
+    return TFE_GetPythonString(py_device_name);
+  }
+  return nullptr;
+}
+
+bool RaiseIfNotPyList(PyObject* list, const string& attr_name) {
+  if (!PyList_Check(list)) {
+    PyErr_SetString(PyExc_TypeError,
+                    Printf("expected a list for attr %s, got %s instead",
+                           attr_name.data(), list->ob_type->tp_name)
+                        .data());
+
+    return false;
+  }
+  return true;
+}
+
+bool RunCallbacks(
+    const FastPathOpExecInfo& op_exec_info, PyObject* args,
+    const std::vector<tensorflow::Safe_PyObjectPtr>& flattened_inputs,
+    const std::vector<tensorflow::Safe_PyObjectPtr>& flattened_attrs,
+    PyObject* flattened_result) {
+  if (!op_exec_info.run_callbacks) return true;
+
+  tensorflow::Safe_PyObjectPtr inputs(PyTuple_New(flattened_inputs.size()));
   for (int i = 0; i < flattened_inputs.size(); i++) {
-    PyObject* input = flattened_inputs[i];
+    PyObject* input = flattened_inputs[i].get();
     Py_INCREF(input);
-    PyTuple_SET_ITEM(inputs, i, input);
+    PyTuple_SET_ITEM(inputs.get(), i, input);
   }
 
   int num_non_inferred_attrs = PyTuple_GET_SIZE(args) -
-                               op_def->input_arg_size() -
+                               op_exec_info.op_def->input_arg_size() -
                                kFastPathExecuteInputStartIndex;
   int num_attrs = flattened_attrs.size() + num_non_inferred_attrs;
-  PyObject* attrs = PyTuple_New(num_attrs);
+  tensorflow::Safe_PyObjectPtr attrs(PyTuple_New(num_attrs));
 
   for (int i = 0; i < num_non_inferred_attrs; i++) {
-    auto* attr = PyTuple_GET_ITEM(
-        args, kFastPathExecuteInputStartIndex + op_def->input_arg_size() + i);
+    auto* attr =
+        PyTuple_GET_ITEM(args, kFastPathExecuteInputStartIndex +
+                                   op_exec_info.op_def->input_arg_size() + i);
     Py_INCREF(attr);
-    PyTuple_SET_ITEM(attrs, i, attr);
+    PyTuple_SET_ITEM(attrs.get(), i, attr);
   }
   for (int i = num_non_inferred_attrs; i < num_attrs; i++) {
-    // Not INCREFing anything in flattened_attrs as each of those is a new
-    // reference, so allow the attrs tuple to steal the reference.
-    PyTuple_SET_ITEM(attrs, i, flattened_attrs.at(i - num_non_inferred_attrs));
+    PyObject* attr_or_name =
+        flattened_attrs.at(i - num_non_inferred_attrs).get();
+    Py_INCREF(attr_or_name);
+    PyTuple_SET_ITEM(attrs.get(), i, attr_or_name);
   }
 
-  PyObject* callback_args =
-      Py_BuildValue("OOOOO", op_name, inputs, attrs, flattened_result, name);
-
-  auto cleaner = tensorflow::gtl::MakeCleanup([inputs, attrs, callback_args] {
-    Py_DECREF(inputs);
-    Py_DECREF(attrs);
-    Py_DECREF(callback_args);
-  });
-
-  if (run_gradient_callback) {
-    RecordGradient(op_name, inputs, attrs, flattened_result, name);
+  if (op_exec_info.run_gradient_callback) {
+    if (!RecordGradient(op_exec_info.op_name, inputs.get(), attrs.get(),
+                        flattened_result, op_exec_info.name)) {
+      return false;
+    }
   }
 
-  if (run_post_exec_callbacks) {
-    for (Py_ssize_t i = 0; i < PyList_Size(callbacks); i++) {
-      PyObject* callback_fn = PyList_GET_ITEM(callbacks, i);
+  if (op_exec_info.run_post_exec_callbacks) {
+    tensorflow::Safe_PyObjectPtr callback_args(
+        Py_BuildValue("OOOOO", op_exec_info.op_name, inputs.get(), attrs.get(),
+                      flattened_result, op_exec_info.name));
+    for (Py_ssize_t i = 0; i < PyList_Size(op_exec_info.callbacks); i++) {
+      PyObject* callback_fn = PyList_GET_ITEM(op_exec_info.callbacks, i);
       if (!PyCallable_Check(callback_fn)) {
         PyErr_SetString(
             PyExc_TypeError,
@@ -1673,7 +1794,7 @@ bool RunCallbacks(bool run_gradient_callback, bool run_post_exec_callbacks,
         return false;
       }
       PyObject* callback_result =
-          PyObject_CallObject(callback_fn, callback_args);
+          PyObject_CallObject(callback_fn, callback_args.get());
       if (!callback_result) {
         return false;
       }
@@ -1697,15 +1818,30 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     return nullptr;
   }
 
-  TFE_Context* ctx = reinterpret_cast<TFE_Context*>(
+  FastPathOpExecInfo op_exec_info;
+
+  op_exec_info.ctx = reinterpret_cast<TFE_Context*>(
       PyCapsule_GetPointer(PyTuple_GET_ITEM(args, 0), nullptr));
-  const char* device_name = GetDeviceName(PyTuple_GET_ITEM(args, 1));
-  PyObject* op_name = PyTuple_GET_ITEM(args, 2);
-  const tensorflow::OpDef* op_def = GetOpDef(op_name);
-  if (op_def == nullptr) return nullptr;
-  PyObject* record_gradient_callback = PyTuple_GET_ITEM(args, 3);
-  PyObject* name = PyTuple_GET_ITEM(args, 4);
-  PyObject* callbacks = PyTuple_GET_ITEM(args, 5);
+  op_exec_info.device_name = GetDeviceName(PyTuple_GET_ITEM(args, 1));
+  op_exec_info.op_name = PyTuple_GET_ITEM(args, 2);
+  op_exec_info.op_def = GetOpDef(op_exec_info.op_name);
+  if (op_exec_info.op_def == nullptr) return nullptr;
+  op_exec_info.name = PyTuple_GET_ITEM(args, 3);
+  op_exec_info.callbacks = PyTuple_GET_ITEM(args, 4);
+
+  const tensorflow::OpDef* op_def = op_exec_info.op_def;
+
+  // TODO(nareshmodi): Add a benchmark for the fast-path with gradient callbacks
+  // (similar to benchmark_tf_gradient_function_*). Also consider using an
+  // InlinedVector for flattened_attrs and flattened_inputs if the benchmarks
+  // point out problems with heap allocs.
+  op_exec_info.run_gradient_callback =
+      !*ThreadTapeIsStopped() && !GetTapeSet()->empty();
+  op_exec_info.run_post_exec_callbacks =
+      op_exec_info.callbacks != Py_None &&
+      PyList_Size(op_exec_info.callbacks) > 0;
+  op_exec_info.run_callbacks = op_exec_info.run_gradient_callback ||
+                               op_exec_info.run_post_exec_callbacks;
 
   if (args_size < kFastPathExecuteInputStartIndex + op_def->input_arg_size()) {
     PyErr_SetString(
@@ -1718,7 +1854,7 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     return nullptr;
   }
 
-  if (!CheckEagerTensors(args, kFastPathExecuteInputStartIndex, *op_def)) {
+  if (!CheckInputsOk(args, kFastPathExecuteInputStartIndex, *op_def)) {
     RaiseFallbackException(
         "This function does not handle the case of the path where "
         "all inputs are not already EagerTensors.");
@@ -1726,7 +1862,7 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   }
 
   TF_Status* status = TF_NewStatus();
-  TFE_Op* op = TFE_NewOp(ctx, op_def->name().c_str(), status);
+  TFE_Op* op = TFE_NewOp(op_exec_info.ctx, op_def->name().c_str(), status);
   auto cleaner = tensorflow::gtl::MakeCleanup([status, op] {
     TF_DeleteStatus(status);
     TFE_DeleteOp(op);
@@ -1753,8 +1889,8 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     // OpRegistrationData.
     for (const auto& attr : op_def->attr()) {
       if (attr_name == attr.name()) {
-        SetOpAttrWithDefaults(ctx, op, attr, attr_name.data(), py_attr_value,
-                              &attr_list_sizes, status);
+        SetOpAttrWithDefaults(op_exec_info.ctx, op, attr, attr_name.data(),
+                              py_attr_value, &attr_list_sizes, status);
 
         if (TF_GetCode(status) != TF_OK) {
           RaiseFallbackException(TF_Message(status));
@@ -1766,34 +1902,28 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     }
   }
 
-  TFE_OpSetDevice(op, device_name, status);
+  TFE_OpSetDevice(op, op_exec_info.device_name, status);
   if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) {
     return nullptr;
   }
 
-  // TODO(nareshmodi): Add a benchmark for the fast-path with gradient callbacks
-  // (similar to benchmark_tf_gradient_function_*). Also consider using an
-  // InlinedVector for flattened_attrs and flattened_inputs if the benchmarks
-  // point out problems with heap allocs.
-  bool run_gradient_callback = !*ThreadTapeIsStopped() &&
-                               !GetTapeSet()->empty() &&
-                               record_gradient_callback != Py_None;
-  bool run_post_exec_callbacks =
-      callbacks != Py_None && PyList_Size(callbacks) > 0;
-  bool run_callbacks = run_gradient_callback || run_post_exec_callbacks;
   // Flat attrs and inputs as required by the record_gradient call. The attrs
   // here only contain inferred attrs (non-inferred attrs are added directly
   // from the input args).
-  // All items in flattened_attrs contain new references.
-  // All items in flattened_inputs contain borrowed references.
+  // All items in flattened_attrs and flattened_inputs contain
+  // Safe_PyObjectPtr - any time something steals a reference to this, it must
+  // INCREF.
   // TODO(nareshmodi): figure out why PyList_New/PyList_Append don't work
   // directly.
-  std::unique_ptr<std::vector<PyObject*>> flattened_attrs = nullptr;
-  std::unique_ptr<std::vector<PyObject*>> flattened_inputs = nullptr;
+  std::unique_ptr<std::vector<tensorflow::Safe_PyObjectPtr>> flattened_attrs =
+      nullptr;
+  std::unique_ptr<std::vector<tensorflow::Safe_PyObjectPtr>> flattened_inputs =
+      nullptr;
 
-  if (run_callbacks) {
-    flattened_attrs.reset(new std::vector<PyObject*>);
-    flattened_inputs.reset(new std::vector<PyObject*>);
+  // TODO(nareshmodi): Encapsulate callbacks information into a struct.
+  if (op_exec_info.run_callbacks) {
+    flattened_attrs.reset(new std::vector<tensorflow::Safe_PyObjectPtr>);
+    flattened_inputs.reset(new std::vector<tensorflow::Safe_PyObjectPtr>);
   }
 
   // Add inferred attrs and inputs.
@@ -1813,16 +1943,16 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       Py_ssize_t len = PyList_Size(input);
 
       TFE_OpSetAttrInt(op, input_arg.number_attr().data(), len);
-      if (run_callbacks) {
-        flattened_attrs->push_back(
+      if (op_exec_info.run_callbacks) {
+        flattened_attrs->emplace_back(
             GetPythonObjectFromString(input_arg.number_attr().data()));
-        flattened_attrs->push_back(PyLong_FromLong(len));
+        flattened_attrs->emplace_back(PyLong_FromLong(len));
       }
       attr_list_sizes[input_arg.number_attr()] = len;
 
       if (len > 0) {
         // First item adds the type attr.
-        if (!AddInputToOp(PyList_GET_ITEM(input, 0), &input_arg,
+        if (!AddInputToOp(op_exec_info, PyList_GET_ITEM(input, 0), &input_arg,
                           flattened_attrs.get(), flattened_inputs.get(), op,
                           status)) {
           return nullptr;
@@ -1830,7 +1960,8 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
 
         for (Py_ssize_t j = 1; j < len; j++) {
           // Since the list is homogeneous, we don't need to re-add the attr.
-          if (!AddInputToOp(PyList_GET_ITEM(input, j), nullptr /* input_arg */,
+          if (!AddInputToOp(op_exec_info, PyList_GET_ITEM(input, j),
+                            nullptr /* input_arg */,
                             nullptr /* flattened_attrs */,
                             flattened_inputs.get(), op, status)) {
             return nullptr;
@@ -1844,12 +1975,20 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       Py_ssize_t len = PyList_Size(input);
       tensorflow::gtl::InlinedVector<TF_DataType, 4> attr_value(len);
       PyObject* py_attr_value = nullptr;
-      if (run_callbacks) {
+      if (op_exec_info.run_callbacks) {
         py_attr_value = PyTuple_New(len);
       }
       for (Py_ssize_t j = 0; j < len; j++) {
         PyObject* py_input = PyList_GET_ITEM(input, j);
-        TFE_TensorHandle* input_handle = EagerTensor_Handle(py_input);
+        tensorflow::Safe_PyObjectPtr py_eager_tensor;
+        if (!ConvertToTensor(op_exec_info, py_input, &py_eager_tensor,
+                             status)) {
+          return nullptr;
+        }
+
+        TFE_TensorHandle* input_handle =
+            EagerTensor_Handle(py_eager_tensor.get());
+
         attr_value[j] = TFE_TensorHandleDataType(input_handle);
 
         TFE_OpAddInput(op, input_handle, status);
@@ -1857,22 +1996,23 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
           return nullptr;
         }
 
-        if (run_callbacks) {
-          flattened_inputs->push_back(py_input);
+        if (op_exec_info.run_callbacks) {
+          flattened_inputs->emplace_back(std::move(py_eager_tensor));
 
           PyTuple_SET_ITEM(py_attr_value, j, PyLong_FromLong(attr_value[j]));
         }
       }
-      if (run_callbacks) {
-        flattened_attrs->push_back(GetPythonObjectFromString(attr_name.data()));
-        flattened_attrs->push_back(py_attr_value);
+      if (op_exec_info.run_callbacks) {
+        flattened_attrs->emplace_back(
+            GetPythonObjectFromString(attr_name.data()));
+        flattened_attrs->emplace_back(py_attr_value);
       }
       TFE_OpSetAttrTypeList(op, attr_name.data(), attr_value.data(),
                             attr_value.size());
       attr_list_sizes[attr_name] = len;
     } else {
       // The item is a single item.
-      if (!AddInputToOp(input, &input_arg, flattened_attrs.get(),
+      if (!AddInputToOp(op_exec_info, input, &input_arg, flattened_attrs.get(),
                         flattened_inputs.get(), op, status)) {
         return nullptr;
       }
@@ -1896,27 +2036,27 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   Py_BEGIN_ALLOW_THREADS;
   TFE_Execute(op, retvals.data(), &num_retvals, status);
   Py_END_ALLOW_THREADS;
+
   if (TF_GetCode(status) != TF_OK) {
     // Augment the status with the op_name for easier debugging similar to
     // TFE_Py_Execute.
     TF_SetStatus(status, TF_GetCode(status),
-                 tensorflow::strings::StrCat(TF_Message(status), " [Op:",
-                                             TFE_GetPythonString(op_name), "]")
+                 tensorflow::strings::StrCat(
+                     TF_Message(status),
+                     " [Op:", TFE_GetPythonString(op_exec_info.op_name), "]")
                      .c_str());
 
     MaybeRaiseExceptionFromTFStatus(status, nullptr);
     return nullptr;
   }
 
-  PyObject* flat_result = PyList_New(num_retvals);
+  tensorflow::Safe_PyObjectPtr flat_result(PyList_New(num_retvals));
   for (int i = 0; i < num_retvals; ++i) {
-    PyList_SET_ITEM(flat_result, i, EagerTensorFromHandle(retvals[i]));
+    PyList_SET_ITEM(flat_result.get(), i, EagerTensorFromHandle(retvals[i]));
   }
 
-  if (run_callbacks &&
-      !RunCallbacks(run_gradient_callback, run_post_exec_callbacks, op_def,
-                    args, *flattened_inputs, *flattened_attrs, flat_result,
-                    op_name, name, record_gradient_callback, callbacks)) {
+  if (!RunCallbacks(op_exec_info, args, *flattened_inputs, *flattened_attrs,
+                    flat_result.get())) {
     return nullptr;
   }
 
@@ -1928,11 +2068,10 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   if (op_def->output_arg_size() == 1) {
     if (!op_def->output_arg(0).number_attr().empty() ||
         !op_def->output_arg(0).type_list_attr().empty()) {
-      return flat_result;
+      return flat_result.release();
     } else {
-      auto* result = PyList_GET_ITEM(flat_result, 0);
+      auto* result = PyList_GET_ITEM(flat_result.get(), 0);
       Py_INCREF(result);
-      Py_DECREF(flat_result);
       return result;
     }
   }
@@ -1945,7 +2084,7 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       int list_length = attr_list_sizes[op_def->output_arg(i).number_attr()];
       PyObject* inner_list = PyList_New(list_length);
       for (int j = 0; j < list_length; j++) {
-        PyObject* obj = PyList_GET_ITEM(flat_result, flat_result_index++);
+        PyObject* obj = PyList_GET_ITEM(flat_result.get(), flat_result_index++);
         Py_INCREF(obj);
         PyList_SET_ITEM(inner_list, j, obj);
       }
@@ -1954,18 +2093,17 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       int list_length = attr_list_sizes[op_def->output_arg(i).type_list_attr()];
       PyObject* inner_list = PyList_New(list_length);
       for (int j = 0; j < list_length; j++) {
-        PyObject* obj = PyList_GET_ITEM(flat_result, flat_result_index++);
+        PyObject* obj = PyList_GET_ITEM(flat_result.get(), flat_result_index++);
         Py_INCREF(obj);
         PyList_SET_ITEM(inner_list, j, obj);
       }
       PyList_SET_ITEM(result, i, inner_list);
     } else {
-      PyObject* obj = PyList_GET_ITEM(flat_result, flat_result_index++);
+      PyObject* obj = PyList_GET_ITEM(flat_result.get(), flat_result_index++);
       Py_INCREF(obj);
       PyList_SET_ITEM(result, i, obj);
     }
   }
-  Py_DECREF(flat_result);
   return result;
 }
 
diff --git a/tensorflow/python/eager/pywrap_tfe_test.py b/tensorflow/python/eager/pywrap_tfe_test.py
index 49323e6640e664ef5f98b227964f9dd4e248ca39..faaae40b3f1ef02984a7a75c23ae4acae65ac335 100644
--- a/tensorflow/python/eager/pywrap_tfe_test.py
+++ b/tensorflow/python/eager/pywrap_tfe_test.py
@@ -21,13 +21,13 @@ from __future__ import print_function
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
-from tensorflow.python.eager import execute
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import resource_variable_ops
 
 
 class Tests(test.TestCase):
@@ -46,15 +46,28 @@ class Tests(test.TestCase):
     self.assertAllClose(
         math_ops.matmul(a_2_by_2, b_2_by_2),
         pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "MatMul", execute.record_gradient,
-            None, None, a_2_by_2, b_2_by_2, "transpose_a", False, "transpose_b",
-            False))
+            ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2,
+            b_2_by_2, "transpose_a", False, "transpose_b", False))
     self.assertAllClose(
         math_ops.matmul(a_100_by_784, b_100_by_784, transpose_b=True),
         pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "MatMul", execute.record_gradient,
-            None, None, a_100_by_784, b_100_by_784, "transpose_a", False,
-            "transpose_b", True))
+            ctx._handle, ctx.device_name, "MatMul", None, None, a_100_by_784,
+            b_100_by_784, "transpose_a", False, "transpose_b", True))
+
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
+  def testFastpathExecute_ResourceVariableMatMulCorrectResponse(self):
+    ctx = context.context()
+    a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
+    m = resource_variable_ops.ResourceVariable(a_2_by_2)
+    x = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, m, m, "transpose_a",
+        False, "transpose_b", False)
+    y = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2, a_2_by_2,
+        "transpose_a", False, "transpose_b", False)
+
+    self.assertAllEqual(x, y)
 
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
@@ -64,12 +77,27 @@ class Tests(test.TestCase):
       a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
       tape.watch(a_2_by_2)
       z = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "MatMul", execute.record_gradient, None,
-          None, a_2_by_2, a_2_by_2, "transpose_a", False, "transpose_b", False)
+          ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2,
+          a_2_by_2, "transpose_a", False, "transpose_b", False)
     dz_dy = tape.gradient(z, [a_2_by_2])[0]
     self.assertAllEqual(dz_dy.numpy(),
                         constant_op.constant(4.0, shape=[2, 2]).numpy())
 
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
+  def testFastpathExecute_ResourceVariableTapeWrite(self):
+    ctx = context.context()
+    with backprop.GradientTape(persistent=True) as tape:
+      a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
+      m = resource_variable_ops.ResourceVariable(a_2_by_2)
+      tape.watch(m)
+      z = pywrap_tensorflow.TFE_Py_FastPathExecute(
+          ctx._handle, ctx.device_name, "MatMul", None, None, m, m,
+          "transpose_a", False, "transpose_b", False)
+    dz_dy = tape.gradient(z, [m])[0]
+    self.assertAllEqual(dz_dy.numpy(),
+                        constant_op.constant(4.0, shape=[2, 2]).numpy())
+
   # Tests homogeneous list op
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
@@ -80,9 +108,9 @@ class Tests(test.TestCase):
 
     self.assertAllClose(
         math_ops.add_n([a_2_by_2, b_2_by_2]),
-        pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "AddN", execute.record_gradient, None,
-            None, [a_2_by_2, b_2_by_2]))
+        pywrap_tensorflow.TFE_Py_FastPathExecute(ctx._handle, ctx.device_name,
+                                                 "AddN", None, None,
+                                                 [a_2_by_2, b_2_by_2]))
 
   # Tests homogeneous list op
   @test_util.assert_no_new_tensors
@@ -96,8 +124,8 @@ class Tests(test.TestCase):
       tape.watch(a_2_by_2)
       tape.watch(b_2_by_2)
       z1 = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "AddN", execute.record_gradient, None,
-          None, [a_2_by_2, b_2_by_2])
+          ctx._handle, ctx.device_name, "AddN", None, None,
+          [a_2_by_2, b_2_by_2])
       z2 = math_ops.add_n([a_2_by_2, b_2_by_2])
     dz1_dy = tape.gradient(z1, [a_2_by_2])[0]
     dz2_dy = tape.gradient(z2, [a_2_by_2])[0]
@@ -113,9 +141,9 @@ class Tests(test.TestCase):
 
     self.assertAllClose(
         array_ops.identity_n([a_2_by_2, b_2_by_2]),
-        pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "IdentityN", execute.record_gradient,
-            None, None, [a_2_by_2, b_2_by_2]))
+        pywrap_tensorflow.TFE_Py_FastPathExecute(ctx._handle, ctx.device_name,
+                                                 "IdentityN", None, None,
+                                                 [a_2_by_2, b_2_by_2]))
 
   # Tests heterogeneous list op
   @test_util.assert_no_new_tensors
@@ -129,8 +157,8 @@ class Tests(test.TestCase):
       tape.watch(a_2_by_2)
       tape.watch(b_2_by_2)
       z1 = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "IdentityN", execute.record_gradient,
-          None, None, [a_2_by_2, b_2_by_2])
+          ctx._handle, ctx.device_name, "IdentityN", None, None,
+          [a_2_by_2, b_2_by_2])
       z2 = array_ops.identity_n([a_2_by_2, b_2_by_2])
     dz1_dy = tape.gradient(z1[0], [a_2_by_2])[0]
     dz2_dy = tape.gradient(z2[0], [a_2_by_2])[0]
@@ -141,28 +169,26 @@ class Tests(test.TestCase):
   def testFastpathExecute_InvalidInputs(self):
     a_2_by_2 = random_ops.random_uniform((2, 2))
     ctx = context.context()
-    assert not ctx.in_graph_mode(
+    assert ctx.executing_eagerly(
     ), "The prototype doesn't contain C code for graph construction"
     ctx_handle = ctx._handle  # pylint: disable=protected-access
 
     # Not enough base params
     with self.assertRaisesRegexp(ValueError,
-                                 "at least 6 items in the input tuple"):
+                                 "at least 5 items in the input tuple"):
       pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx.device_name,
                                                "Identity")
 
     # Not enough inputs
     with self.assertRaisesRegexp(ValueError,
-                                 "Expected to be at least 7, was 6"):
-      pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx_handle, ctx_handle, "Identity", backprop._record_gradient, None,
-          [])
+                                 "Expected to be at least 6, was 5"):
+      pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx_handle,
+                                               "Identity", None, [])
 
     # Bad type
     with self.assertRaisesRegexp(TypeError, "expected a string for op_name"):
-      pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx_handle, ctx.device_name, ctx_handle, backprop._record_gradient,
-          None, [], a_2_by_2)
+      pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx.device_name,
+                                               ctx_handle, None, [], a_2_by_2)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/eager/tape_test.py b/tensorflow/python/eager/tape_test.py
index b490bac66db03b0a61a8852f45f1f558cccaf121..4326d5efa3d362e883815eb2d3dafb27df25afd4 100644
--- a/tensorflow/python/eager/tape_test.py
+++ b/tensorflow/python/eager/tape_test.py
@@ -21,11 +21,11 @@ from __future__ import print_function
 
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
-from tensorflow.python.eager import custom_gradient
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import custom_gradient
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 # Importing nn_grad for the registration functions.
@@ -165,21 +165,6 @@ class TapeTest(test.TestCase):
     g, = backprop.gradients_function(fn, [0])(t)
     self.assertAllEqual(g, 1.0)
 
-  def testCustomGradientGraphMode(self):
-    with context.graph_mode(), self.test_session():
-
-      @custom_gradient.custom_gradient
-      def f(x):
-
-        def grad(dresult):
-          return dresult * 10.0
-
-        return x, grad
-
-      inp = constant_op.constant(1.0)
-      grad = gradients_impl.gradients(f(inp), inp)
-      self.assertAllEqual(grad[0].eval(), 10.0)
-
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/estimator/BUILD b/tensorflow/python/estimator/BUILD
index c519fd557a9319d6ef5522b26198e5b4202917fc..5afb5a7dd5d88715768fda985fcea34bc798e37f 100644
--- a/tensorflow/python/estimator/BUILD
+++ b/tensorflow/python/estimator/BUILD
@@ -7,6 +7,7 @@ package(
 licenses(["notice"])  # Apache 2.0
 
 load("//tensorflow:tensorflow.bzl", "py_test")
+load("//tensorflow:tensorflow.bzl", "cuda_py_test")
 
 filegroup(
     name = "all_files",
@@ -35,9 +36,9 @@ py_library(
         ":linear",
         ":model_fn",
         ":parsing_utils",
+        ":replicate_model_fn",
         ":run_config",
         ":training",
-        ":warm_starting_util",
         "//tensorflow/python:util",
     ],
 )
@@ -264,7 +265,6 @@ py_library(
         "//tensorflow/python:nn",
         "//tensorflow/python:partitioned_variables",
         "//tensorflow/python:summary",
-        "//tensorflow/python:training",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python/feature_column",
         "//tensorflow/python/ops/losses",
@@ -278,12 +278,12 @@ py_library(
     srcs = ["canned/dnn_testing_utils.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":estimator",
         ":head",
         ":metric_keys",
         ":model_fn",
         ":numpy_io",
         ":prediction_keys",
-        ":warm_starting_util",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:check_ops",
@@ -427,7 +427,6 @@ py_library(
         ":model_fn",
         ":run_config",
         ":util",
-        ":warm_starting_util",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:client",
         "//tensorflow/python:control_flow_ops",
@@ -617,6 +616,7 @@ py_library(
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:string_ops",
         "//tensorflow/python:summary",
+        "//tensorflow/python:training",
         "//tensorflow/python:weights_broadcast_ops",
         "//tensorflow/python/feature_column",
         "//tensorflow/python/ops/losses",
@@ -870,37 +870,65 @@ py_test(
 )
 
 py_library(
-    name = "warm_starting_util",
-    srcs = ["warm_starting_util.py"],
+    name = "replicate_model_fn",
+    srcs = [
+        "replicate_model_fn.py",
+    ],
     srcs_version = "PY2AND3",
     deps = [
+        ":export_output",
+        ":model_fn",
+        ":util",
+        "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:device",
+        "//tensorflow/python:device_lib",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
         "//tensorflow/python:platform",
+        "//tensorflow/python:sparse_ops",
+        "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:state_ops",
         "//tensorflow/python:training",
         "//tensorflow/python:variable_scope",
-        "//tensorflow/python:variables",
-        "//tensorflow/python/feature_column",
+        "//tensorflow/python/ops/losses",
+        "@six_archive//:six",
     ],
 )
 
-py_test(
-    name = "warm_starting_util_test",
-    size = "small",
-    srcs = ["warm_starting_util_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":warm_starting_util",
+cuda_py_test(
+    name = "replicate_model_fn_test",
+    size = "medium",
+    srcs = ["replicate_model_fn_test.py"],
+    additional_deps = [
+        "//tensorflow/python/estimator",
+        ":dnn",
+        ":export_export",
+        ":export_output",
+        ":model_fn",
+        ":numpy_io",
+        ":optimizers",
+        ":prediction_keys",
+        "//tensorflow/python/feature_column",
+        "//tensorflow/python/ops/losses",
+        "//tensorflow/python/saved_model:signature_constants",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:init_ops",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:metrics",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:summary",
         "//tensorflow/python:training",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
-        "//tensorflow/python/feature_column",
-        "//third_party/py/numpy",
+        ":replicate_model_fn",
+    ],
+    tags = [
+        "multi_gpu",
+        "noasan",  # flaky time outs
     ],
 )
diff --git a/tensorflow/python/estimator/canned/baseline_test.py b/tensorflow/python/estimator/canned/baseline_test.py
index 18c955f5a0e998de983b31fc4cc595895e6bbcbd..7833df2052657114c9799417e1b9d96035b4c5ef 100644
--- a/tensorflow/python/estimator/canned/baseline_test.py
+++ b/tensorflow/python/estimator/canned/baseline_test.py
@@ -1071,11 +1071,13 @@ class BaselineClassifierEvaluationTest(test.TestCase):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: 1.3133,
           metric_keys.MetricKeys.ACCURACY: 0.,
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: 0.2689,
           metric_keys.MetricKeys.LABEL_MEAN: 1.,
           metric_keys.MetricKeys.ACCURACY_BASELINE: 1,
           metric_keys.MetricKeys.AUC: 0.,
-          metric_keys.MetricKeys.AUC_PR: 0.5,
+          metric_keys.MetricKeys.AUC_PR: 1.,
       }
     else:
       # Multi classes: loss = 1 * -log ( softmax(logits)[label] )
@@ -1132,11 +1134,13 @@ class BaselineClassifierEvaluationTest(test.TestCase):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: expected_loss / 2,
           metric_keys.MetricKeys.ACCURACY: 0.5,
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: 0.2689,
           metric_keys.MetricKeys.LABEL_MEAN: 0.5,
           metric_keys.MetricKeys.ACCURACY_BASELINE: 0.5,
           metric_keys.MetricKeys.AUC: 0.5,
-          metric_keys.MetricKeys.AUC_PR: 0.25,
+          metric_keys.MetricKeys.AUC_PR: 0.75,
       }
     else:
       # Expand logits since batch_size=2
@@ -1207,12 +1211,14 @@ class BaselineClassifierEvaluationTest(test.TestCase):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: loss_mean,
           metric_keys.MetricKeys.ACCURACY: 2. / (1. + 2.),
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: predictions_mean,
           metric_keys.MetricKeys.LABEL_MEAN: label_mean,
           metric_keys.MetricKeys.ACCURACY_BASELINE: (
               max(label_mean, 1-label_mean)),
           metric_keys.MetricKeys.AUC: 0.5,
-          metric_keys.MetricKeys.AUC_PR: 0.16666645,
+          metric_keys.MetricKeys.AUC_PR: 2. / (1. + 2.),
       }
     else:
       # Multi classes: unweighted_loss = 1 * -log ( soft_max(logits)[label] )
@@ -1542,4 +1548,3 @@ class BaselineLogitFnTest(test.TestCase):
 
 if __name__ == '__main__':
   test.main()
-
diff --git a/tensorflow/python/estimator/canned/dnn.py b/tensorflow/python/estimator/canned/dnn.py
index 7043da8de036e5be27d223271c37e065d9ffbcdd..6382622e0b5c72e5d3fcd9b9c6863968a425b86f 100644
--- a/tensorflow/python/estimator/canned/dnn.py
+++ b/tensorflow/python/estimator/canned/dnn.py
@@ -32,7 +32,6 @@ from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.summary import summary
-from tensorflow.python.training import training_util
 from tensorflow.python.util.tf_export import tf_export
 
 # The default learning rate of 0.05 is a historical artifact of the initial
@@ -183,17 +182,11 @@ def _dnn_model_fn(features,
         input_layer_partitioner=input_layer_partitioner)
     logits = logit_fn(features=features, mode=mode)
 
-    def _train_op_fn(loss):
-      """Returns the op to optimize the loss."""
-      return optimizer.minimize(
-          loss,
-          global_step=training_util.get_global_step())
-
     return head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
-        train_op_fn=_train_op_fn,
+        optimizer=optimizer,
         logits=logits)
 
 
diff --git a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
index 84675bf2a4a1655026bbba37c5d7a63d2f788c46..d275695eb319117cf94aefd7038ab5ee685e05a9 100644
--- a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
+++ b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
@@ -26,7 +26,7 @@ import six
 
 from tensorflow.core.example import example_pb2
 from tensorflow.core.example import feature_pb2
-from tensorflow.python.estimator import warm_starting_util
+from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator.canned import dnn_linear_combined
 from tensorflow.python.estimator.canned import dnn_testing_utils
 from tensorflow.python.estimator.canned import linear_testing_utils
@@ -866,7 +866,7 @@ class DNNLinearCombinedWarmStartingTest(test.TestCase):
                 learning_rate=0.0),
             # The provided regular expression will only warm-start the deep
             # portion of the model.
-            warm_start_from=warm_starting_util.WarmStartSettings(
+            warm_start_from=estimator.WarmStartSettings(
                 ckpt_to_initialize_from=dnn_lc_classifier.model_dir,
                 vars_to_warm_start='.*(dnn).*')))
 
diff --git a/tensorflow/python/estimator/canned/dnn_testing_utils.py b/tensorflow/python/estimator/canned/dnn_testing_utils.py
index cbae43e4f7fef0271de20a4ec54449989455d4bd..44545c058c673d00f16c4276dc42cdbf4ca188e4 100644
--- a/tensorflow/python/estimator/canned/dnn_testing_utils.py
+++ b/tensorflow/python/estimator/canned/dnn_testing_utils.py
@@ -27,8 +27,8 @@ import six
 
 from tensorflow.core.framework import summary_pb2
 from tensorflow.python.client import session as tf_session
+from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import model_fn
-from tensorflow.python.estimator import warm_starting_util
 from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.estimator.canned import metric_keys
 from tensorflow.python.estimator.canned import prediction_keys
@@ -53,7 +53,7 @@ from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import checkpoint_utils
 from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import monitored_session
-from tensorflow.python.training import optimizer
+from tensorflow.python.training import optimizer as optimizer_lib
 from tensorflow.python.training import saver
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training_util
@@ -134,7 +134,8 @@ def mock_head(testcase, hidden_units, logits_dimension, expected_logits):
       hidden_weights_names + hidden_biases_names +
       [LOGITS_WEIGHTS_NAME + '/part_0:0', LOGITS_BIASES_NAME + '/part_0:0'])
 
-  def _create_estimator_spec(features, mode, logits, labels, train_op_fn):
+  def _create_estimator_spec(
+      features, mode, logits, labels, train_op_fn=None, optimizer=None):
     del features, labels  # Not used.
     trainable_vars = ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
     testcase.assertItemsEqual(expected_var_names,
@@ -144,8 +145,12 @@ def mock_head(testcase, hidden_units, logits_dimension, expected_logits):
         expected_logits, logits, message='Failed for mode={}. '.format(mode))
     with ops.control_dependencies([assert_logits]):
       if mode == model_fn.ModeKeys.TRAIN:
+        if train_op_fn is not None:
+          train_op = train_op_fn(loss)
+        elif optimizer is not None:
+          train_op = optimizer.minimize(loss, global_step=None)
         return model_fn.EstimatorSpec(
-            mode=mode, loss=loss, train_op=train_op_fn(loss))
+            mode=mode, loss=loss, train_op=train_op)
       elif mode == model_fn.ModeKeys.EVAL:
         return model_fn.EstimatorSpec(mode=mode, loss=array_ops.identity(loss))
       elif mode == model_fn.ModeKeys.PREDICT:
@@ -203,8 +208,8 @@ def mock_optimizer(testcase, hidden_units, expected_loss=None):
       return control_flow_ops.no_op()
 
   optimizer_mock = test.mock.NonCallableMagicMock(
-      spec=optimizer.Optimizer,
-      wraps=optimizer.Optimizer(use_locking=False, name='my_optimizer'))
+      spec=optimizer_lib.Optimizer,
+      wraps=optimizer_lib.Optimizer(use_locking=False, name='my_optimizer'))
   optimizer_mock.minimize = test.mock.MagicMock(wraps=_minimize)
 
   return optimizer_mock
@@ -828,7 +833,7 @@ class BaseDNNWarmStartingTest(object):
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
         # The provided regular expression will only warm-start the city
         # embedding, not the kernels and biases of the hidden weights.
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=dnn_classifier.model_dir,
             vars_to_warm_start='.*(city).*'))
 
@@ -892,7 +897,7 @@ class BaseDNNWarmStartingTest(object):
         dimension=2)
     # We can create our VocabInfo object from the new and old occupation
     # FeatureColumn's.
-    occupation_vocab_info = warm_starting_util.VocabInfo(
+    occupation_vocab_info = estimator.VocabInfo(
         new_vocab=new_occupation.categorical_column.vocabulary_file,
         new_vocab_size=new_occupation.categorical_column.vocabulary_size,
         num_oov_buckets=new_occupation.categorical_column.num_oov_buckets,
@@ -907,7 +912,7 @@ class BaseDNNWarmStartingTest(object):
         feature_columns=[occupation],
         n_classes=4,
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=dnn_classifier.model_dir,
             var_name_to_vocab_info={
                 OCCUPATION_EMBEDDING_NAME: occupation_vocab_info
@@ -978,7 +983,7 @@ class BaseDNNWarmStartingTest(object):
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
         # The 'city' variable correspond to the 'locality' variable in the
         # previous model.
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=dnn_classifier.model_dir,
             var_name_to_prev_var_name={
                 CITY_EMBEDDING_NAME:
@@ -1035,13 +1040,16 @@ class BaseDNNClassifierEvaluateTest(object):
         metric_keys.MetricKeys.LOSS: expected_loss,
         metric_keys.MetricKeys.LOSS_MEAN: expected_loss / 2.,
         metric_keys.MetricKeys.ACCURACY: 0.5,
+        metric_keys.MetricKeys.PRECISION: 0.0,
+        metric_keys.MetricKeys.RECALL: 0.0,
         metric_keys.MetricKeys.PREDICTION_MEAN: 0.11105597,
         metric_keys.MetricKeys.LABEL_MEAN: 0.5,
         metric_keys.MetricKeys.ACCURACY_BASELINE: 0.5,
         # There is no good way to calculate AUC for only two data points. But
         # that is what the algorithm returns.
         metric_keys.MetricKeys.AUC: 0.5,
-        metric_keys.MetricKeys.AUC_PR: 0.25,
+        metric_keys.MetricKeys.AUC_PR: 0.75,
+
         ops.GraphKeys.GLOBAL_STEP: global_step
     }, dnn_classifier.evaluate(input_fn=_input_fn, steps=1))
 
diff --git a/tensorflow/python/estimator/canned/head.py b/tensorflow/python/estimator/canned/head.py
index 8d742a2c6147e86619d4c0aad59b69459384bd4d..c9635a9c27207cd5f1533188312df1c23ec9fd82 100644
--- a/tensorflow/python/estimator/canned/head.py
+++ b/tensorflow/python/estimator/canned/head.py
@@ -44,6 +44,7 @@ from tensorflow.python.ops import weights_broadcast_ops
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.summary import summary
+from tensorflow.python.training import training_util
 
 _DEFAULT_SERVING_KEY = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
 
@@ -85,40 +86,39 @@ class _Head(object):
     ```python
     def _my_dnn_model_fn(features, labels, mode, params, config=None):
       # Optionally your callers can pass head to model_fn as a param.
-      head = tf.contrib.learn.regression_head(...)
-      input = tf.contrib.layers.input_from_feature_columns(features, ...)
-      last_hidden_layer_out = tf.contrib.layers.stack(
-          input, tf.contrib.layers.fully_connected, [1000, 500])
-      logits = tf.contrib.layers.fully_connected(
-          last_hidden_layer_out, head.logits_dimension, activation_fn=None)
-
-      def _train_op_fn(loss):
-        return optimizer.minimize(loss)
+      head = tf.contrib.estimator.regression_head(...)
+      inputs = tf.feature_column.input_layer(features, ...)
+      hidden_layer0 = tf.layers.dense(
+          inputs, units=1000, activation=tf.nn.relu)
+      hidden_layer1 = tf.layers.dense(
+          hidden_layer0, units=500, activation=tf.nn.relu)
+      logits = tf.layers.dense(
+          hidden_layer1, units=head.logits_dimension, activation=None)
 
       return head.create_estimator_spec(
           features=features,
           labels=labels,
           mode=mode,
           logits=logits,
-          train_op_fn=_train_op_fn)
+          optimizer=optimizer)
     ```
 
   There are cases where computing and applying gradients can not be meaningfully
-  captured with train_op_fn we support (for example, with sync optimizer). In
-  such case, you can take the responsibility on your own. Here is a common
-  use case,
+  captured with optimizer or train_op_fn we support (for example, with sync
+  optimizer). In such case, you can take the responsibility on your own. Here is
+  a common use case,
     ```python
     estimator_spec = head.create_estimator_spec(
         features=features,
         labels=labels,
         mode=mode,
         logits=logits,
-        train_op_fn=tf.contrib.learn.no_op_train_fn)
+        train_op_fn=lambda _: tf.no_op())
     if mode == model_fn.ModeKeys.TRAIN:
       optimizer = ...
       sync = tf.train.SyncReplicasOptimizer(opt=optimizer, ...)
-      update_op = tf.contrib.layers.optimize_loss(optimizer=sync,
-                                                  loss=estimator_spec.loss, ...)
+      update_op = sync.minimize(
+          estimator_spec.loss, global_step=tf.get_global_step())
       hooks = [sync.make_session_run_hook(is_chief)]
       ... update train_op and hooks in EstimatorSpec and return
     ```
@@ -172,10 +172,12 @@ class _Head(object):
     """
     raise NotImplementedError('Calling an abstract method.')
 
+  # TODO(b/65403806): By default, collect regularization_losses from
+  # GraphKeys.REGULARIZATION_LOSSES collection.
   @abc.abstractmethod
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None,
-      regularization_losses=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None, regularization_losses=None):
     """Returns `EstimatorSpec` that a model_fn can return.
 
     Please note that,
@@ -186,10 +188,14 @@ class _Head(object):
       mode: Estimator's `ModeKeys`.
       logits: logits `Tensor` to be used by the head.
       labels: Labels `Tensor`, or `dict` of same.
+      optimizer: `Optimizer` instance to optimize the loss in TRAIN mode.
+        Namely, sets `train_op = optimizer.minimize(loss, global_step)`, which
+        updates variables and increments `global_step`.
       train_op_fn: Function that takes a scalar loss `Tensor` and returns an op
-        to optimize the model with the loss. This is used in TRAIN mode and
-        must not be None. None is allowed in other modes. If you want to
-        optimize loss yourself you can pass `no_op_train_fn` and then use
+        to optimize the model with the loss in TRAIN mode. Used if `optimizer`
+        is `None`. Exactly one of `train_op_fn` and `optimizer` must be set in
+        TRAIN mode. None is allowed in other modes. If you want to optimize loss
+        yourself you can pass `lambda _: tf.no_op()` and then use
         EstimatorSpec.loss to compute and apply gradients.
       regularization_losses: A list of additional scalar losses to be added to
         the training loss, such as regularization losses.
@@ -694,8 +700,8 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
         processed_labels=label_ids)
 
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None,
-      regularization_losses=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None, regularization_losses=None):
     """Returns an `EstimatorSpec`.
 
     Args:
@@ -706,8 +712,11 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
       labels: Labels integer or string `Tensor` with shape matching `logits`,
         namely `[D0, D1, ... DN, 1]` or `[D0, D1, ... DN]`. `labels` is
         required argument when `mode` equals `TRAIN` or `EVAL`.
+      optimizer: `Optimizer` instance to optimize the loss in TRAIN mode.
+        Namely, sets `train_op = optimizer.minimize(loss, global_step)`, which
+        updates variables and increments `global_step`.
       train_op_fn: Function that takes a scalar loss `Tensor` and returns
-        `train_op`. Required in TRAIN mode.
+        `train_op`. Used if `optimizer` is `None`.
       regularization_losses: A list of additional scalar losses to be added to
         the training loss, such as regularization losses. These losses are
         usually expressed as a batch average, so for best results users need to
@@ -717,7 +726,8 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
     Returns:
       `EstimatorSpec`.
     Raises:
-      ValueError: If `train_op_fn` is `None` in TRAIN mode.
+      ValueError: If both `train_op_fn` and `optimizer` are `None` in TRAIN
+        mode, or if both are set.
     """
     with ops.name_scope(self._name, 'head'):
       logits = _check_logits_final_dim(logits, self.logits_dimension)
@@ -780,8 +790,16 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
                 regularization_loss=regularization_loss))
 
       # Train.
-      if train_op_fn is None:
-        raise ValueError('train_op_fn cannot be None.')
+      if optimizer is not None:
+        if train_op_fn is not None:
+          raise ValueError('train_op_fn and optimizer cannot both be set.')
+        train_op = optimizer.minimize(
+            regularized_training_loss,
+            global_step=training_util.get_global_step())
+      elif train_op_fn is not None:
+        train_op = train_op_fn(regularized_training_loss)
+      else:
+        raise ValueError('train_op_fn and optimizer cannot both be None.')
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -807,7 +825,7 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
         mode=model_fn.ModeKeys.TRAIN,
         predictions=predictions,
         loss=regularized_training_loss,
-        train_op=train_op_fn(regularized_training_loss))
+        train_op=train_op)
 
 
 def _binary_logistic_head_with_sigmoid_cross_entropy_loss(
@@ -940,6 +958,18 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
                   predictions=class_ids,
                   weights=weights,
                   name=keys.ACCURACY),
+          _summary_key(self._name, keys.PRECISION):
+              metrics_lib.precision(
+                  labels=labels,
+                  predictions=class_ids,
+                  weights=weights,
+                  name=keys.PRECISION),
+          _summary_key(self._name, keys.RECALL):
+              metrics_lib.recall(
+                  labels=labels,
+                  predictions=class_ids,
+                  weights=weights,
+                  name=keys.RECALL),
           _summary_key(self._name, keys.PREDICTION_MEAN):
               _predictions_mean(
                   predictions=logistic,
@@ -1027,8 +1057,8 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
         processed_labels=labels)
 
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None,
-      regularization_losses=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None, regularization_losses=None):
     """Returns an `EstimatorSpec`.
 
     Args:
@@ -1039,8 +1069,11 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
       labels: Labels integer or string `Tensor` with shape matching `logits`,
         namely `[D0, D1, ... DN, 1]` or `[D0, D1, ... DN]`. `labels` is required
         argument when `mode` equals `TRAIN` or `EVAL`.
+      optimizer: `Optimizer` instance to optimize the loss in TRAIN mode.
+        Namely, sets `train_op = optimizer.minimize(loss, global_step)`, which
+        updates variables and increments `global_step`.
       train_op_fn: Function that takes a scalar loss `Tensor` and returns
-        `train_op`. Required in TRAIN mode.
+        `train_op`. Used if `optimizer` is `None`.
       regularization_losses: A list of additional scalar losses to be added to
         the training loss, such as regularization losses. These losses are
         usually expressed as a batch average, so for best results users need to
@@ -1050,7 +1083,8 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
     Returns:
       `EstimatorSpec`.
     Raises:
-      ValueError: If `train_op_fn` is `None` in TRAIN mode.
+      ValueError: If both `train_op_fn` and `optimizer` are `None` in TRAIN
+        mode, or if both are set.
     """
     # Predict.
     with ops.name_scope(self._name, 'head'):
@@ -1122,8 +1156,16 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
                 regularization_loss=regularization_loss))
 
       # Train.
-      if train_op_fn is None:
-        raise ValueError('train_op_fn can not be None.')
+      if optimizer is not None:
+        if train_op_fn is not None:
+          raise ValueError('train_op_fn and optimizer cannot both be set.')
+        train_op = optimizer.minimize(
+            regularized_training_loss,
+            global_step=training_util.get_global_step())
+      elif train_op_fn is not None:
+        train_op = train_op_fn(regularized_training_loss)
+      else:
+        raise ValueError('train_op_fn and optimizer cannot both be None.')
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -1148,7 +1190,7 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
         mode=model_fn.ModeKeys.TRAIN,
         predictions=predictions,
         loss=regularized_training_loss,
-        train_op=train_op_fn(regularized_training_loss))
+        train_op=train_op)
 
 
 def _regression_head_with_mean_squared_error_loss(
@@ -1277,8 +1319,8 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         processed_labels=labels)
 
   def create_estimator_spec(
-      self, features, mode, logits, labels=None, train_op_fn=None,
-      regularization_losses=None):
+      self, features, mode, logits, labels=None, optimizer=None,
+      train_op_fn=None, regularization_losses=None):
     """Returns an `EstimatorSpec`.
 
     Args:
@@ -1290,8 +1332,11 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         `[D0, D1, ... DN, logits_dimension]`. When `logits_dimension=1`, shape
         `[D0, D1, ... DN]` is also supported. `labels` is required argument when
         `mode` equals `TRAIN` or `EVAL`.
+      optimizer: `Optimizer` instance to optimize the loss in TRAIN mode.
+        Namely, sets `train_op = optimizer.minimize(loss, global_step)`, which
+        updates variables and increments `global_step`.
       train_op_fn: Function that takes a scalar loss `Tensor` and returns
-        `train_op`. Required in TRAIN mode.
+        `train_op`. Used if `optimizer` is `None`.
       regularization_losses: A list of additional scalar losses to be added to
         the training loss, such as regularization losses. These losses are
         usually expressed as a batch average, so for best results users need to
@@ -1301,7 +1346,8 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
     Returns:
       `EstimatorSpec`.
     Raises:
-      ValueError: If `train_op_fn` is `None` in TRAIN mode.
+      ValueError: If both `train_op_fn` and `optimizer` are `None` in TRAIN
+        mode, or if both are set.
     """
     # Predict.
     with ops.name_scope(self._name, 'head'):
@@ -1361,8 +1407,16 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
             eval_metric_ops=eval_metric_ops)
 
       # Train.
-      if train_op_fn is None:
-        raise ValueError('train_op_fn can not be None.')
+      if optimizer is not None:
+        if train_op_fn is not None:
+          raise ValueError('train_op_fn and optimizer cannot both be set.')
+        train_op = optimizer.minimize(
+            regularized_training_loss,
+            global_step=training_util.get_global_step())
+      elif train_op_fn is not None:
+        train_op = train_op_fn(regularized_training_loss)
+      else:
+        raise ValueError('train_op_fn and optimizer cannot both be None.')
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -1387,7 +1441,7 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         mode=model_fn.ModeKeys.TRAIN,
         predictions=predictions,
         loss=regularized_training_loss,
-        train_op=train_op_fn(regularized_training_loss))
+        train_op=train_op)
 
 
 def _assert_range(labels, n_classes, message=None):
diff --git a/tensorflow/python/estimator/canned/head_test.py b/tensorflow/python/estimator/canned/head_test.py
index a300f315c18f60e77f262a3b961c5ef6306bc235..fe6ee07529bc0314618a7cc85926dbb39660a352 100644
--- a/tensorflow/python/estimator/canned/head_test.py
+++ b/tensorflow/python/estimator/canned/head_test.py
@@ -300,7 +300,12 @@ class MultiClassHeadWithSoftmaxCrossEntropyLoss(test.TestCase):
     features = {'x': values_2x3}
 
     # Static shape.
-    with self.assertRaisesRegexp(ValueError, 'Dimensions must be equal'):
+    with self.assertRaisesRegexp(
+        ValueError,
+        r'Shape mismatch: The shape of labels \(received \(3,\)\) should equal '
+        r'the shape of logits except for the last dimension '
+        r'\(received \(2, 3\)\)\.'
+    ):
       head.create_loss(
           features=features,
           mode=model_fn.ModeKeys.EVAL,
@@ -837,6 +842,41 @@ class MultiClassHeadWithSoftmaxCrossEntropyLoss(test.TestCase):
           metric_keys.MetricKeys.LOSS_MEAN: expected_loss / 2,
       }, summary_str, tol)
 
+  def test_train_with_optimizer(self):
+    n_classes = 3
+    head = head_lib._multi_class_head_with_softmax_cross_entropy_loss(n_classes)
+
+    logits = np.array(((10, 0, 0), (0, 10, 0),), dtype=np.float32)
+    labels = np.array(((1,), (1,)), dtype=np.int64)
+    features = {'x': np.array(((42,),), dtype=np.int32)}
+    expected_train_result = 'my_train_op'
+
+    class _Optimizer(object):
+
+      def minimize(self, loss, global_step):
+        del global_step
+        return string_ops.string_join(
+            [constant_op.constant(expected_train_result),
+             string_ops.as_string(loss, precision=2)])
+
+    # loss = sum(cross_entropy(labels, logits)) = sum(10, 0) = 10.
+    expected_loss = 10.
+    spec = head.create_estimator_spec(
+        features=features,
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        optimizer=_Optimizer())
+
+    tol = 1e-2
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run((spec.loss, spec.train_op))
+      self.assertAllClose(expected_loss, loss, rtol=tol, atol=tol)
+      self.assertEqual(
+          six.b('{0:s}{1:.2f}'.format(expected_train_result, expected_loss)),
+          train_result)
+
   def test_train_summaries_with_head_name(self):
     n_classes = 3
     head = head_lib._multi_class_head_with_softmax_cross_entropy_loss(
@@ -1554,11 +1594,13 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
         # loss_mean = loss/2 = 41./2 = 20.5
         keys.LOSS_MEAN: 20.5,
         keys.ACCURACY: 1./2,
+        keys.PRECISION: 1.,
+        keys.RECALL: 1./2,
         keys.PREDICTION_MEAN: 1./2,
         keys.LABEL_MEAN: 2./2,
         keys.ACCURACY_BASELINE: 2./2,
         keys.AUC: 0.,
-        keys.AUC_PR: 0.74999905,
+        keys.AUC_PR: 1.,
     }
 
     # Assert spec contains expected tensors.
@@ -1597,11 +1639,13 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
     expected_metric_keys = [
         '{}/some_binary_head'.format(metric_keys.MetricKeys.LOSS_MEAN),
         '{}/some_binary_head'.format(metric_keys.MetricKeys.ACCURACY),
+        '{}/some_binary_head'.format(metric_keys.MetricKeys.PRECISION),
+        '{}/some_binary_head'.format(metric_keys.MetricKeys.RECALL),
         '{}/some_binary_head'.format(metric_keys.MetricKeys.PREDICTION_MEAN),
         '{}/some_binary_head'.format(metric_keys.MetricKeys.LABEL_MEAN),
         '{}/some_binary_head'.format(metric_keys.MetricKeys.ACCURACY_BASELINE),
         '{}/some_binary_head'.format(metric_keys.MetricKeys.AUC),
-        '{}/some_binary_head'.format(metric_keys.MetricKeys.AUC_PR)
+        '{}/some_binary_head'.format(metric_keys.MetricKeys.AUC_PR),
     ]
     self.assertItemsEqual(expected_metric_keys, spec.eval_metric_ops.keys())
 
@@ -1632,11 +1676,13 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
         keys.LOSS_MEAN: expected_unregularized_loss,
         keys.LOSS_REGULARIZATION: expected_regularization_loss,
         keys.ACCURACY: 1./2,
+        keys.PRECISION: 1.,
+        keys.RECALL: 1./2,
         keys.PREDICTION_MEAN: 1./2,
         keys.LABEL_MEAN: 2./2,
         keys.ACCURACY_BASELINE: 2./2,
         keys.AUC: 0.,
-        keys.AUC_PR: 0.75,
+        keys.AUC_PR: 1.,
     }
 
     # Assert predictions, loss, and metrics.
@@ -1737,11 +1783,13 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
     expected_metrics = {
         keys.LOSS_MEAN: 1.62652338 / 2.,
         keys.ACCURACY: 1./2,
+        keys.PRECISION: 1.,
+        keys.RECALL: .5,
         keys.PREDICTION_MEAN: 1./2,
         keys.LABEL_MEAN: 2./2,
         keys.ACCURACY_BASELINE: 2./2,
         keys.AUC: 0.,
-        keys.AUC_PR: 0.74999905,
+        keys.AUC_PR: 1.,
         keys.ACCURACY_AT_THRESHOLD % thresholds[0]: 1.,
         keys.PRECISION_AT_THRESHOLD % thresholds[0]: 1.,
         keys.RECALL_AT_THRESHOLD % thresholds[0]: 1.,
@@ -1929,6 +1977,39 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
           metric_keys.MetricKeys.LOSS_MEAN: 20.5,
       }, summary_str)
 
+  def test_train_with_optimizer(self):
+    head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss()
+
+    logits = np.array(((45,), (-41,),), dtype=np.float32)
+    labels = np.array(((1,), (1,),), dtype=np.float64)
+    expected_train_result = b'my_train_op'
+    features = {'x': np.array(((42,),), dtype=np.float32)}
+    # loss = sum(cross_entropy(labels, logits)) = sum(0, 41) = 41
+    expected_loss = 41.
+
+    class _Optimizer(object):
+
+      def minimize(self, loss, global_step):
+        del global_step
+        with ops.control_dependencies((check_ops.assert_equal(
+            math_ops.to_float(expected_loss), math_ops.to_float(loss),
+            name='assert_loss'),)):
+          return constant_op.constant(expected_train_result)
+
+    # Create estimator spec.
+    spec = head.create_estimator_spec(
+        features=features,
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        optimizer=_Optimizer())
+
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run((spec.loss, spec.train_op))
+      self.assertAllClose(expected_loss, loss)
+      self.assertEqual(expected_train_result, train_result)
+
   def test_train_summaries_with_head_name(self):
     head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
         name='some_binary_head')
@@ -2182,13 +2263,15 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
         keys.LOSS_MEAN: 26.9615384615,
         # accuracy = (1*1 + .1*0 + 1.5*0)/(1 + .1 + 1.5) = 1/2.6 = .38461538461
         keys.ACCURACY: .38461538461,
+        keys.PRECISION: 1./2.5,
+        keys.RECALL: 1./1.1,
         # prediction_mean = (1*1 + .1*0 + 1.5*1)/(1 + .1 + 1.5) = 2.5/2.6
         #                 = .96153846153
         keys.PREDICTION_MEAN: .96153846153,
         keys.LABEL_MEAN: expected_label_mean,
         keys.ACCURACY_BASELINE: 1 - expected_label_mean,
         keys.AUC: .45454565,
-        keys.AUC_PR: .21923049,
+        keys.AUC_PR: .6737757325172424,
     }
 
     # Assert spec contains expected tensors.
@@ -2481,13 +2564,15 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
     expected_metrics = {
         keys.LOSS_MEAN: expected_loss / np.sum(weights),
         keys.ACCURACY: (1.*0. + 1.5*1. + 2.*1. + 2.5*0.) / np.sum(weights),
+        keys.PRECISION: 2.0/3.0,
+        keys.RECALL: 2.0/4.5,
         keys.PREDICTION_MEAN: (1.*1 + 1.5*0 + 2.*1 + 2.5*0) / np.sum(weights),
         keys.LABEL_MEAN: (1.*0 + 1.5*0 + 2.*1 + 2.5*1) / np.sum(weights),
         keys.ACCURACY_BASELINE: (1.*0 + 1.5*0 + 2.*1 + 2.5*1) / np.sum(weights),
         # We cannot reliably calculate AUC with only 4 data points, but the
         # values should not change because of backwards-compatibility.
         keys.AUC: 0.5222,
-        keys.AUC_PR: 0.5119,
+        keys.AUC_PR: 0.7341,
     }
 
     tol = 1e-2
@@ -3059,6 +3144,40 @@ class RegressionHeadWithMeanSquaredErrorLossTest(test.TestCase):
           metric_keys.MetricKeys.LOSS_MEAN: 6.5,
       }, summary_str)
 
+  def test_train_with_optimizer(self):
+    head = head_lib._regression_head_with_mean_squared_error_loss()
+    self.assertEqual(1, head.logits_dimension)
+
+    # Create estimator spec.
+    logits = np.array(((45,), (41,),), dtype=np.float32)
+    labels = np.array(((43.,), (44.,),), dtype=np.float64)
+    expected_train_result = b'my_train_op'
+    features = {'x': np.array(((42.,),), dtype=np.float32)}
+    # loss = (43-45)^2 + (44-41)^2 = 4 + 9 = 13
+    expected_loss = 13
+
+    class _Optimizer(object):
+
+      def minimize(self, loss, global_step):
+        del global_step
+        with ops.control_dependencies((check_ops.assert_equal(
+            math_ops.to_float(expected_loss), math_ops.to_float(loss),
+            name='assert_loss'),)):
+          return constant_op.constant(expected_train_result)
+
+    spec = head.create_estimator_spec(
+        features=features,
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        optimizer=_Optimizer())
+
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run((spec.loss, spec.train_op))
+      self.assertAllClose(expected_loss, loss)
+      self.assertEqual(expected_train_result, train_result)
+
   def test_train_summaries_with_head_name(self):
     head = head_lib._regression_head_with_mean_squared_error_loss(
         name='some_regression_head')
diff --git a/tensorflow/python/estimator/canned/linear.py b/tensorflow/python/estimator/canned/linear.py
index a2f24ef27044680fe93b176b5207593165d0d109..e7ec4179917a88703444f8aa835ed0359ff58a46 100644
--- a/tensorflow/python/estimator/canned/linear.py
+++ b/tensorflow/python/estimator/canned/linear.py
@@ -33,7 +33,6 @@ from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.summary import summary
 from tensorflow.python.training import ftrl
-from tensorflow.python.training import training_util
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -157,17 +156,11 @@ def _linear_model_fn(features, labels, mode, head, feature_columns, optimizer,
         units=head.logits_dimension, feature_columns=feature_columns)
     logits = logit_fn(features=features)
 
-    def _train_op_fn(loss):
-      """Returns the op to optimize the loss."""
-      return optimizer.minimize(
-          loss,
-          global_step=training_util.get_global_step())
-
     return head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
-        train_op_fn=_train_op_fn,
+        optimizer=optimizer,
         logits=logits)
 
 
diff --git a/tensorflow/python/estimator/canned/linear_testing_utils.py b/tensorflow/python/estimator/canned/linear_testing_utils.py
index e88fcbbd2e0e3617dde428662e58b1d86c4eddd0..da3ce86999b32e081eb8f12bbd9f7a4599ed4eaa 100644
--- a/tensorflow/python/estimator/canned/linear_testing_utils.py
+++ b/tensorflow/python/estimator/canned/linear_testing_utils.py
@@ -31,7 +31,6 @@ from tensorflow.core.example import feature_pb2
 from tensorflow.python.client import session as tf_session
 from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import run_config
-from tensorflow.python.estimator import warm_starting_util
 from tensorflow.python.estimator.canned import linear
 from tensorflow.python.estimator.canned import metric_keys
 from tensorflow.python.estimator.export import export
@@ -1338,11 +1337,13 @@ class BaseLinearClassifierEvaluationTest(object):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: 41.,
           metric_keys.MetricKeys.ACCURACY: 0.,
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: 0.,
           metric_keys.MetricKeys.LABEL_MEAN: 1.,
           metric_keys.MetricKeys.ACCURACY_BASELINE: 1,
           metric_keys.MetricKeys.AUC: 0.,
-          metric_keys.MetricKeys.AUC_PR: 0.5,
+          metric_keys.MetricKeys.AUC_PR: 1.,
       }
     else:
       # Multi classes: loss = 1 * -log ( soft_max(logits)[label] )
@@ -1407,6 +1408,8 @@ class BaseLinearClassifierEvaluationTest(object):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: expected_loss / 2,
           metric_keys.MetricKeys.ACCURACY: 0.,
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: 0.5,
           metric_keys.MetricKeys.LABEL_MEAN: 0.5,
           metric_keys.MetricKeys.ACCURACY_BASELINE: 0.5,
@@ -1488,6 +1491,8 @@ class BaseLinearClassifierEvaluationTest(object):
           ops.GraphKeys.GLOBAL_STEP: 100,
           metric_keys.MetricKeys.LOSS_MEAN: loss_mean,
           metric_keys.MetricKeys.ACCURACY: 0.,
+          metric_keys.MetricKeys.PRECISION: 0.,
+          metric_keys.MetricKeys.RECALL: 0.,
           metric_keys.MetricKeys.PREDICTION_MEAN: predictions_mean,
           metric_keys.MetricKeys.LABEL_MEAN: label_mean,
           metric_keys.MetricKeys.ACCURACY_BASELINE: (
@@ -1968,7 +1973,7 @@ class BaseLinearWarmStartingTest(object):
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
         # The provided regular expression will only warm-start the age variable
         # and not the bias.
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=linear_classifier.model_dir,
             vars_to_warm_start='.*(age).*'))
 
@@ -2016,7 +2021,7 @@ class BaseLinearWarmStartingTest(object):
         vocabulary_size=len(new_vocab_list))
     # We can create our VocabInfo object from the new and old occupation
     # FeatureColumn's.
-    occupation_vocab_info = warm_starting_util.VocabInfo(
+    occupation_vocab_info = estimator.VocabInfo(
         new_vocab=new_occupation.vocabulary_file,
         new_vocab_size=new_occupation.vocabulary_size,
         num_oov_buckets=new_occupation.num_oov_buckets,
@@ -2030,7 +2035,7 @@ class BaseLinearWarmStartingTest(object):
         feature_columns=[occupation],
         n_classes=4,
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=linear_classifier.model_dir,
             var_name_to_vocab_info={
                 OCCUPATION_WEIGHT_NAME: occupation_vocab_info
@@ -2082,7 +2087,7 @@ class BaseLinearWarmStartingTest(object):
         optimizer=gradient_descent.GradientDescentOptimizer(learning_rate=0.0),
         # The 'age' variable correspond to the 'age_in_years' variable in the
         # previous model.
-        warm_start_from=warm_starting_util.WarmStartSettings(
+        warm_start_from=estimator.WarmStartSettings(
             ckpt_to_initialize_from=linear_classifier.model_dir,
             var_name_to_prev_var_name={
                 AGE_WEIGHT_NAME: AGE_WEIGHT_NAME.replace('age', 'age_in_years')
diff --git a/tensorflow/python/estimator/canned/metric_keys.py b/tensorflow/python/estimator/canned/metric_keys.py
index 44eb680939203fea67e3391326a6f1013f022ad5..f374d3154982e3b7cdc637e9e3606b3a2947cbf3 100644
--- a/tensorflow/python/estimator/canned/metric_keys.py
+++ b/tensorflow/python/estimator/canned/metric_keys.py
@@ -28,6 +28,8 @@ class MetricKeys(object):
   LOSS_REGULARIZATION = 'regularization_loss'
 
   ACCURACY = 'accuracy'
+  PRECISION = 'precision'
+  RECALL = 'recall'
   # This is the best the model could do by always predicting one class.
   # Should be < ACCURACY in a trained model.
   ACCURACY_BASELINE = 'accuracy_baseline'
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index 1167b3834eb6a79abf670f629ec2cbc37957d191..6a4132bca2cb9f14984b39462d00cf68e4e4ae62 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -19,6 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
 import copy
 import os
 import tempfile
@@ -35,7 +36,6 @@ from tensorflow.python.eager import context
 from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.estimator import run_config
 from tensorflow.python.estimator import util
-from tensorflow.python.estimator import warm_starting_util
 from tensorflow.python.estimator.export.export import build_all_signature_defs
 from tensorflow.python.estimator.export.export import get_temp_export_dir
 from tensorflow.python.estimator.export.export import get_timestamped_export_dir
@@ -49,11 +49,13 @@ from tensorflow.python.saved_model import builder as saved_model_builder
 from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.summary import summary
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import device_setter
 from tensorflow.python.training import evaluation
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import saver
 from tensorflow.python.training import training
 from tensorflow.python.training import training_util
+from tensorflow.python.training import warm_starting_util
 from tensorflow.python.util import compat
 from tensorflow.python.util import compat_internal
 from tensorflow.python.util import nest
@@ -137,8 +139,8 @@ class Estimator(object):
                  to configure Estimators from hyper parameter tuning.
           * `config`: Optional configuration object. Will receive what is passed
                  to Estimator in `config` parameter, or the default `config`.
-                 Allows updating things in your model_fn based on configuration
-                 such as `num_ps_replicas`, or `model_dir`.
+                 Allows updating things in your `model_fn` based on
+                 configuration such as `num_ps_replicas`, or `model_dir`.
 
         * Returns:
           `EstimatorSpec`
@@ -165,7 +167,7 @@ class Estimator(object):
       ValueError: if this is called via a subclass and if that class overrides
         a member of `Estimator`.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           'Estimators are not supported when eager execution is enabled.')
 
@@ -216,8 +218,8 @@ class Estimator(object):
     self._params = copy.deepcopy(params or {})
 
     # pylint: disable=protected-access
-    self._warm_start_settings = (
-        warm_starting_util._get_default_warm_start_settings(warm_start_from))
+    self._warm_start_settings = _get_default_warm_start_settings(
+        warm_start_from)
     # pylint: enable=protected-access
 
   @property
@@ -299,11 +301,11 @@ class Estimator(object):
 
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
             tuple (features, labels) with same constraints as below.
-          * A tuple (features, labels): Where features is a `Tensor` or a
-            dictionary of string feature name to `Tensor` and labels is a
+          * A tuple (features, labels): Where `features` is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and `labels` is a
             `Tensor` or a dictionary of string label name to `Tensor`. Both
-            features and labels are consumed by `model_fn`. They should satisfy
-            the expectation of `model_fn` from inputs.
+            `features` and `labels` are consumed by `model_fn`. They should
+            satisfy the expectation of `model_fn` from inputs.
 
       hooks: List of `SessionRunHook` subclass instances. Used for callbacks
         inside the training loop.
@@ -379,11 +381,11 @@ class Estimator(object):
 
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
             tuple (features, labels) with same constraints as below.
-          * A tuple (features, labels): Where features is a `Tensor` or a
-            dictionary of string feature name to `Tensor` and labels is a
+          * A tuple (features, labels): Where `features` is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and `labels` is a
             `Tensor` or a dictionary of string label name to `Tensor`. Both
-            features and labels are consumed by `model_fn`. They should satisfy
-            the expectation of `model_fn` from inputs.
+            `features` and `labels` are consumed by `model_fn`. They should
+            satisfy the expectation of `model_fn` from inputs.
 
       steps: Number of steps for which to evaluate model. If `None`, evaluates
         until `input_fn` raises an end-of-input exception.
@@ -455,17 +457,17 @@ class Estimator(object):
       checkpoint_path: Path of a specific checkpoint to predict. If `None`, the
         latest checkpoint in `model_dir` is used.
       yield_single_examples: If False, yield the whole batch as returned by the
-        model_fn instead of decomposing the batch into individual elements. This
-        is useful if model_fn return some tensor with first dimension not
-        equal to the batch size
+        `model_fn` instead of decomposing the batch into individual elements.
+        This is useful if `model_fn` returns some tensors whose first dimension
+        is not equal to the batch size.
 
     Yields:
       Evaluated values of `predictions` tensors.
 
     Raises:
-      ValueError: Could not find a trained model in model_dir.
-      ValueError: if batch length of predictions are not same and
-        yield_single_examples is True.
+      ValueError: Could not find a trained model in `model_dir`.
+      ValueError: If batch length of predictions is not the same and
+        `yield_single_examples` is True.
       ValueError: If there is a conflict between `predict_keys` and
         `predictions`. For example if `predict_keys` is not `None` but
         `EstimatorSpec.predictions` is not a `dict`.
@@ -515,7 +517,7 @@ class Estimator(object):
     allowed_overrides = set([
         '_call_input_fn', '_create_global_step',
         '_convert_train_steps_to_hooks', '_convert_eval_steps_to_hooks',
-        '_tf_api_names'
+        '_tf_api_names', '_validate_features_in_predict_input'
     ])
     estimator_members = set([m for m in Estimator.__dict__.keys()
                              if not m.startswith('__')])
@@ -570,7 +572,7 @@ class Estimator(object):
       export_dir_base: A string containing a directory in which to create
         timestamped subdirectories containing exported SavedModels.
       serving_input_receiver_fn: A function that takes no argument and
-        returns a `ServingInputReceiver`.
+        returns a `ServingInputReceiver` or `TensorServingInputReceiver`.
       assets_extra: A dict specifying how to populate the assets.extra directory
         within the exported SavedModel, or `None` if no extra assets are needed.
       as_text: whether to write the SavedModel proto in text format.
@@ -668,11 +670,14 @@ class Estimator(object):
       # Unconditionally drop the label (the second element of result).
       result = result[0]
 
+    self._validate_features_in_predict_input(result)
+    return result, input_hooks
+
+  def _validate_features_in_predict_input(self, result):
     if not _has_dataset_or_queue_runner(result):
       logging.warning('Input graph does not use tf.data.Dataset or contain a '
                       'QueueRunner. That means predict yields forever. '
                       'This is probably a mistake.')
-    return result, input_hooks
 
   def _get_features_and_labels_from_input_fn(self, input_fn, mode):
     """Extracts the `features` and labels from return values of `input_fn`."""
@@ -720,7 +725,7 @@ class Estimator(object):
     """Creates the global step tensor in graph.
 
     The global step tensor must be an integer type with name 'global_step' and
-    be added to the collection ${tf.GraphKeys.GLOBAL_STEP}.
+    be added to the collection @{tf.GraphKeys.GLOBAL_STEP}.
 
     Args:
       graph: The graph in which to create the global step tensor.
@@ -826,7 +831,7 @@ class Estimator(object):
         logging.info('Warm-starting with WarmStartSettings: %s' %
                      (self._warm_start_settings,))
         # pylint: disable=protected-access
-        warm_starting_util._warm_start(self._warm_start_settings)
+        warm_starting_util.warm_start(*self._warm_start_settings)
         # pylint: enable=protected-access
       # Check if the user created a loss summary, and add one if they didn't.
       # We assume here that the summary is called 'loss'. If it is not, we will
@@ -844,7 +849,7 @@ class Estimator(object):
                   'loss': estimator_spec.loss,
                   'step': global_step_tensor
               },
-              every_n_iter=100)
+              every_n_iter=self._config.log_step_count_steps)
       ])
       worker_hooks.extend(estimator_spec.training_hooks)
 
@@ -1007,13 +1012,6 @@ def _get_replica_device_setter(config):
   Returns:
     A replica device setter, or None.
   """
-  ps_ops = [
-      'Variable', 'VariableV2', 'AutoReloadVariable', 'MutableHashTable',
-      'MutableHashTableV2', 'MutableHashTableOfTensors',
-      'MutableHashTableOfTensorsV2', 'MutableDenseHashTable',
-      'MutableDenseHashTableV2', 'VarHandleOp'
-  ]
-
   if config.task_type:
     worker_device = '/job:%s/task:%d' % (config.task_type, config.task_id)
   else:
@@ -1024,7 +1022,7 @@ def _get_replica_device_setter(config):
         ps_tasks=config.num_ps_replicas,
         worker_device=worker_device,
         merge_devices=True,
-        ps_ops=ps_ops,
+        ps_ops=list(device_setter.STANDARD_PS_OPS),
         cluster=config.cluster_spec)
   else:
     return None
@@ -1118,7 +1116,7 @@ def _write_dict_to_summary(output_dir,
       try:
         summ = summary_pb2.Summary.FromString(dictionary[key])
         for i, _ in enumerate(summ.value):
-          summ.value[i].tag = key
+          summ.value[i].tag = '%s/%d' % (key, i)
         summary_proto.value.extend(summ.value)
       except message.DecodeError:
         logging.warn('Skipping summary for %s, cannot parse string to Summary.',
@@ -1155,3 +1153,187 @@ class _DatasetInitializerHook(training.SessionRunHook):
   def after_create_session(self, session, coord):
     del coord
     session.run(self._initializer)
+
+VocabInfo = warm_starting_util.VocabInfo  # pylint: disable=invalid-name
+
+
+@tf_export('estimator.WarmStartSettings')
+class WarmStartSettings(
+    collections.namedtuple('WarmStartSettings', [
+        'ckpt_to_initialize_from',
+        'vars_to_warm_start',
+        'var_name_to_vocab_info',
+        'var_name_to_prev_var_name',
+    ])):
+  """Settings for warm-starting in Estimators.
+
+  Example Use with canned `DNNEstimator`:
+
+  ```
+  emb_vocab_file = tf.feature_column.embedding_column(
+      tf.feature_column.categorical_column_with_vocabulary_file(
+          "sc_vocab_file", "new_vocab.txt", vocab_size=100),
+      dimension=8)
+  emb_vocab_list = tf.feature_column.embedding_column(
+      tf.feature_column.categorical_column_with_vocabulary_list(
+          "sc_vocab_list", vocabulary_list=["a", "b"]),
+      dimension=8)
+  estimator = tf.estimator.DNNClassifier(
+    hidden_units=[128, 64], feature_columns=[emb_vocab_file, emb_vocab_list],
+    warm_start_from=ws)
+  ```
+
+  where `ws` could be defined as:
+
+  Warm-start all weights in the model (input layer and hidden weights).
+  Either the directory or a specific checkpoint can be provided (in the case
+  of the former, the latest checkpoint will be used):
+
+  ```
+  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp")
+  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp/model-1000")
+  ```
+
+  Warm-start only the embeddings (input layer):
+
+  ```
+  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp",
+                         vars_to_warm_start=".*input_layer.*")
+  ```
+
+  Warm-start all weights but the embedding parameters corresponding to
+  `sc_vocab_file` have a different vocab from the one used in the current
+  model:
+
+  ```
+  vocab_info = tf.estimator.VocabInfo(
+      new_vocab=sc_vocab_file.vocabulary_file,
+      new_vocab_size=sc_vocab_file.vocabulary_size,
+      num_oov_buckets=sc_vocab_file.num_oov_buckets,
+      old_vocab="old_vocab.txt"
+  )
+  ws = WarmStartSettings(
+      ckpt_to_initialize_from="/tmp",
+      var_name_to_vocab_info={
+          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
+      })
+  ```
+
+  Warm-start only `sc_vocab_file` embeddings (and no other variables), which
+  have a different vocab from the one used in the current model:
+
+  ```
+  vocab_info = tf.estimator.VocabInfo(
+      new_vocab=sc_vocab_file.vocabulary_file,
+      new_vocab_size=sc_vocab_file.vocabulary_size,
+      num_oov_buckets=sc_vocab_file.num_oov_buckets,
+      old_vocab="old_vocab.txt"
+  )
+  ws = WarmStartSettings(
+      ckpt_to_initialize_from="/tmp",
+      vars_to_warm_start=None,
+      var_name_to_vocab_info={
+          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
+      })
+  ```
+
+  Warm-start all weights but the parameters corresponding to `sc_vocab_file`
+  have a different vocab from the one used in current checkpoint, and only
+  100 of those entries were used:
+
+  ```
+  vocab_info = tf.estimator.VocabInfo(
+      new_vocab=sc_vocab_file.vocabulary_file,
+      new_vocab_size=sc_vocab_file.vocabulary_size,
+      num_oov_buckets=sc_vocab_file.num_oov_buckets,
+      old_vocab="old_vocab.txt",
+      old_vocab_size=100
+  )
+  ws = WarmStartSettings(
+      ckpt_to_initialize_from="/tmp",
+      var_name_to_vocab_info={
+          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
+      })
+  ```
+
+  Warm-start all weights but the parameters corresponding to `sc_vocab_file`
+  have a different vocab from the one used in current checkpoint and the
+  parameters corresponding to `sc_vocab_list` have a different name from the
+  current checkpoint:
+
+  ```
+  vocab_info = tf.estimator.VocabInfo(
+      new_vocab=sc_vocab_file.vocabulary_file,
+      new_vocab_size=sc_vocab_file.vocabulary_size,
+      num_oov_buckets=sc_vocab_file.num_oov_buckets,
+      old_vocab="old_vocab.txt",
+      old_vocab_size=100
+  )
+  ws = WarmStartSettings(
+      ckpt_to_initialize_from="/tmp",
+      var_name_to_vocab_info={
+          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
+      },
+      var_name_to_prev_var_name={
+          "input_layer/sc_vocab_list_embedding/embedding_weights":
+              "old_tensor_name"
+      })
+  ```
+
+  Attributes:
+    ckpt_to_initialize_from: [Required] A string specifying the directory with
+      checkpoint file(s) or path to checkpoint from which to warm-start the
+      model parameters.
+    vars_to_warm_start: [Optional] A regular expression that captures which
+      variables to warm-start (see tf.get_collection).  Defaults to `'.*'`,
+      which warm-starts all variables.  If `None` is explicitly given, only
+      variables specified in `var_name_to_vocab_info` will be warm-started.
+    var_name_to_vocab_info: [Optional] Dict of variable names (strings) to
+      VocabInfo. The variable names should be "full" variables, not the names
+      of the partitions.  If not explicitly provided, the variable is assumed to
+      have no vocabulary.
+    var_name_to_prev_var_name: [Optional] Dict of variable names (strings) to
+      name of the previously-trained variable in `ckpt_to_initialize_from`. If
+      not explicitly provided, the name of the variable is assumed to be same
+      between previous checkpoint and current model.
+  """
+
+  def __new__(cls,
+              ckpt_to_initialize_from,
+              vars_to_warm_start='.*',
+              var_name_to_vocab_info=None,
+              var_name_to_prev_var_name=None):
+    if not ckpt_to_initialize_from:
+      raise ValueError(
+          '`ckpt_to_initialize_from` MUST be set in WarmStartSettings')
+    return super(WarmStartSettings, cls).__new__(
+        cls,
+        ckpt_to_initialize_from,
+        vars_to_warm_start,
+        var_name_to_vocab_info or {},
+        var_name_to_prev_var_name or {},
+    )
+
+
+def _get_default_warm_start_settings(warm_start_from):
+  """Returns default WarmStartSettings.
+
+  Args:
+    warm_start_from: Either a string representing the filepath of a checkpoint
+      to initialize from, or an instance of WarmStartSettings.
+
+  Returns:
+    Either None or an instance of WarmStartSettings.
+
+  Raises:
+    ValueError: If warm_start_from is not None but is neither a string nor an
+      instance of WarmStartSettings.
+  """
+  if warm_start_from is None:
+    return None
+  if isinstance(warm_start_from, six.string_types):
+    return WarmStartSettings(ckpt_to_initialize_from=warm_start_from)
+  elif isinstance(warm_start_from, WarmStartSettings):
+    return warm_start_from
+  else:
+    raise ValueError('warm_start_from must be a string or a WarmStartSettings')
diff --git a/tensorflow/python/estimator/estimator_lib.py b/tensorflow/python/estimator/estimator_lib.py
index 01699e7399c4089281e9ece76e534e1f82692257..be8930b3cbcd89dbb31dffde0a7a5ecfb64fcd8b 100644
--- a/tensorflow/python/estimator/estimator_lib.py
+++ b/tensorflow/python/estimator/estimator_lib.py
@@ -30,6 +30,8 @@ from tensorflow.python.estimator.canned.linear import LinearRegressor
 from tensorflow.python.estimator.canned.parsing_utils import classifier_parse_example_spec
 from tensorflow.python.estimator.canned.parsing_utils import regressor_parse_example_spec
 from tensorflow.python.estimator.estimator import Estimator
+from tensorflow.python.estimator.estimator import VocabInfo
+from tensorflow.python.estimator.estimator import WarmStartSettings
 from tensorflow.python.estimator.export import export_lib as export
 from tensorflow.python.estimator.exporter import Exporter
 from tensorflow.python.estimator.exporter import FinalExporter
@@ -41,8 +43,6 @@ from tensorflow.python.estimator.run_config import RunConfig
 from tensorflow.python.estimator.training import EvalSpec
 from tensorflow.python.estimator.training import train_and_evaluate
 from tensorflow.python.estimator.training import TrainSpec
-from tensorflow.python.estimator.warm_starting_util import VocabInfo
-from tensorflow.python.estimator.warm_starting_util import WarmStartSettings
 
 
 from tensorflow.python.util.all_util import remove_undocumented
diff --git a/tensorflow/python/estimator/estimator_test.py b/tensorflow/python/estimator/estimator_test.py
index 7a0745b1d0d5ae932fa59be56a4952e82922a584..f4255091bf6c44916789a182e60e583171ad5e6b 100644
--- a/tensorflow/python/estimator/estimator_test.py
+++ b/tensorflow/python/estimator/estimator_test.py
@@ -48,6 +48,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import lookup_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import state_ops
@@ -1268,10 +1269,10 @@ class EstimatorEvaluateTest(test.TestCase):
       _, _ = features, labels
       global_step = training.get_global_step()
 
-      image = array_ops.zeros([1, 3, 3, 1])
+      image = array_ops.zeros([5, 3, 3, 1])
       eval_metric_ops = {
-          'image': (summary.image('image', image, max_outputs=1),
-                    constant_op.constant(1))
+          'foo': (summary.image('image', image, max_outputs=3),
+                  constant_op.constant(1))
       }
       return model_fn_lib.EstimatorSpec(
           mode,
@@ -1291,10 +1292,10 @@ class EstimatorEvaluateTest(test.TestCase):
     writer_cache.FileWriterCache.clear()
 
     # Get last evaluation Event written.
-    if check_eventfile_for_keyword('image', os.path.join(est.model_dir,
-                                                         'eval')):
-      return
-    self.fail('{} should be part of reported summaries.'.format('image'))
+    for key in ['foo/0', 'foo/1', 'foo/2']:
+      self.assertTrue(
+          check_eventfile_for_keyword(key, os.path.join(est.model_dir, 'eval')),
+          '{} should be part of reported summaries.'.format(key))
 
 
 class EstimatorPredictTest(test.TestCase):
@@ -1936,6 +1937,60 @@ class EstimatorExportTest(test.TestCase):
     # cleanup
     gfile.DeleteRecursively(tmpdir)
 
+  def test_export_savedmodel_tensor_features(self):
+    """Test that models accepting a single raw Tensor can be exported.
+
+    See https://github.com/tensorflow/tensorflow/issues/11674
+
+    If the model_fn and receiver_fn accept raw tensors rather than dictionaries
+    as input, export_savedmodel should be okay with that, too.
+
+    """
+
+    tmpdir = tempfile.mkdtemp()
+
+    def _input_fn_tensor_features():
+      t = array_ops.constant([1, 2, 3], dtype=dtypes.float32, shape=[1, 3])
+      return (t, None)
+
+    def _model_fn_tensor_features(features, labels, mode):
+      _ = labels
+      prediction = math_ops.matmul(features, features, transpose_b=True)
+
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          predictions=prediction,
+          loss=constant_op.constant(1.),
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          export_outputs={
+              'test': export_output.PredictOutput({'prediction': prediction})
+          })
+
+    def _serving_input_receiver_fn():
+      feat = array_ops.placeholder(dtype=dtypes.float32)
+      return export.TensorServingInputReceiver(
+          features=feat, receiver_tensors=feat)
+
+    est = estimator.Estimator(model_fn=_model_fn_tensor_features)
+    est.train(input_fn=_input_fn_tensor_features, steps=1)
+
+    # Perform the export.
+    export_dir_base = os.path.join(
+        compat.as_bytes(tmpdir), compat.as_bytes('export'))
+    export_dir = est.export_savedmodel(
+        export_dir_base, _serving_input_receiver_fn)
+
+    # Restore, to validate that the export was well-formed.
+    with ops.Graph().as_default() as graph:
+      with session.Session(graph=graph) as sess:
+        loader.load(sess, [tag_constants.SERVING], export_dir)
+        graph_ops = [x.name.lower() for x in graph.get_operations()]
+        self.assertTrue('const' in graph_ops)
+        self.assertTrue('matmul' in graph_ops)
+
+    # Clean up.
+    gfile.DeleteRecursively(tmpdir)
+
   def test_scaffold_is_used_for_saver(self):
     tmpdir = tempfile.mkdtemp()
 
diff --git a/tensorflow/python/estimator/export/export.py b/tensorflow/python/estimator/export/export.py
index 83251c79fc561e16ebddb638668b92b3c69b8af4..9206a4964b3b7a6e3cc1e0f9e965a197be78c4ba 100644
--- a/tensorflow/python/estimator/export/export.py
+++ b/tensorflow/python/estimator/export/export.py
@@ -21,17 +21,16 @@ from __future__ import print_function
 
 import collections
 import os
-import time
 
 import six
 
+from tensorflow.python.estimator import util
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import parsing_ops
-from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.saved_model import signature_def_utils
@@ -120,6 +119,62 @@ class ServingInputReceiver(collections.namedtuple(
         receiver_tensors_alternatives=receiver_tensors_alternatives)
 
 
+@tf_export('estimator.export.TensorServingInputReceiver')
+class TensorServingInputReceiver(collections.namedtuple(
+    'TensorServingInputReceiver',
+    ['features', 'receiver_tensors', 'receiver_tensors_alternatives'])):
+  """A return type for a serving_input_receiver_fn.
+
+  This is for use with models that expect a single `Tensor` or `SparseTensor`
+  as an input feature, as opposed to a dict of features.
+
+  The normal `ServingInputReceiver` always returns a feature dict, even if it
+  contains only one entry, and so can be used only with models that accept such
+  a dict.  For models that accept only a single raw feature, the
+  `serving_input_receiver_fn` provided to `Estimator.export_savedmodel()` should
+  return this `TensorServingInputReceiver` instead.  See:
+  https://github.com/tensorflow/tensorflow/issues/11674
+
+  Note that the receiver_tensors and receiver_tensor_alternatives arguments
+  will be automatically converted to the dict representation in either case,
+  because the SavedModel format requires each input `Tensor` to have a name
+  (provided by the dict key).
+
+  The expected return values are:
+    features: A single `Tensor` or `SparseTensor`, representing the feature
+      to be passed to the model.
+    receiver_tensors: a `Tensor`, or a dict of string to `Tensor`, specifying
+      input nodes where this receiver expects to be fed by default.  Typically,
+      this is a single placeholder expecting serialized `tf.Example` protos.
+    receiver_tensors_alternatives: a dict of string to additional
+      groups of receiver tensors, each of which may be a `Tensor` or a dict of
+      string to `Tensor`.  These named receiver tensor alternatives generate
+      additional serving signatures, which may be used to feed inputs at
+      different points within the input receiver subgraph.  A typical usage is
+      to allow feeding raw feature `Tensor`s *downstream* of the
+      tf.parse_example() op.  Defaults to None.
+  """
+
+  def __new__(cls, features, receiver_tensors,
+              receiver_tensors_alternatives=None):
+    if features is None:
+      raise ValueError('features must be defined.')
+    if not (isinstance(features, ops.Tensor)
+            or isinstance(features, sparse_tensor.SparseTensor)):
+      raise ValueError('feature must be a Tensor or SparseTensor.')
+
+    receiver = ServingInputReceiver(
+        features=features,
+        receiver_tensors=receiver_tensors,
+        receiver_tensors_alternatives=receiver_tensors_alternatives)
+
+    return super(TensorServingInputReceiver, cls).__new__(
+        cls,
+        features=receiver.features[_SINGLE_FEATURE_DEFAULT_NAME],
+        receiver_tensors=receiver.receiver_tensors,
+        receiver_tensors_alternatives=receiver.receiver_tensors_alternatives)
+
+
 @tf_export('estimator.export.build_parsing_serving_input_receiver_fn')
 def build_parsing_serving_input_receiver_fn(feature_spec,
                                             default_batch_size=None):
@@ -273,13 +328,6 @@ def _log_signature_report(signature_def_map, excluded_signatures):
     logging.warn('Export includes no default signature!')
 
 
-# When we create a timestamped directory, there is a small chance that the
-# directory already exists because another worker is also writing exports.
-# In this case we just wait one second to get a new timestamp and try again.
-# If this fails several times in a row, then something is seriously wrong.
-MAX_DIRECTORY_CREATION_ATTEMPTS = 10
-
-
 def get_timestamped_export_dir(export_dir_base):
   """Builds a path to a new subdirectory within the base directory.
 
@@ -298,25 +346,7 @@ def get_timestamped_export_dir(export_dir_base):
     RuntimeError: if repeated attempts fail to obtain a unique timestamped
       directory name.
   """
-  attempts = 0
-  while attempts < MAX_DIRECTORY_CREATION_ATTEMPTS:
-    export_timestamp = int(time.time())
-
-    export_dir = os.path.join(
-        compat.as_bytes(export_dir_base),
-        compat.as_bytes(str(export_timestamp)))
-    if not gfile.Exists(export_dir):
-      # Collisions are still possible (though extremely unlikely): this
-      # directory is not actually created yet, but it will be almost
-      # instantly on return from this function.
-      return export_dir
-    time.sleep(1)
-    attempts += 1
-    logging.warn(
-        'Export directory {} already exists; retrying (attempt {}/{})'.format(
-            export_dir, attempts, MAX_DIRECTORY_CREATION_ATTEMPTS))
-  raise RuntimeError('Failed to obtain a unique export directory name after '
-                     '{} attempts.'.format(MAX_DIRECTORY_CREATION_ATTEMPTS))
+  return util.get_timestamped_dir(export_dir_base)
 
 
 def get_temp_export_dir(timestamped_export_dir):
diff --git a/tensorflow/python/estimator/export/export_lib.py b/tensorflow/python/estimator/export/export_lib.py
index 99cd81d678bc04e7ed52de721a1fdf799c116795..226fc97fd3a3aefe61c4b88088873ce7489168c7 100644
--- a/tensorflow/python/estimator/export/export_lib.py
+++ b/tensorflow/python/estimator/export/export_lib.py
@@ -22,6 +22,7 @@ from __future__ import print_function
 from tensorflow.python.estimator.export.export import build_parsing_serving_input_receiver_fn
 from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn
 from tensorflow.python.estimator.export.export import ServingInputReceiver
+from tensorflow.python.estimator.export.export import TensorServingInputReceiver
 from tensorflow.python.estimator.export.export_output import ClassificationOutput
 from tensorflow.python.estimator.export.export_output import ExportOutput
 from tensorflow.python.estimator.export.export_output import PredictOutput
@@ -34,6 +35,7 @@ _allowed_symbols = [
     'build_parsing_serving_input_receiver_fn',
     'build_raw_serving_input_receiver_fn',
     'ServingInputReceiver',
+    'TensorServingInputReceiver',
     'ClassificationOutput',
     'ExportOutput',
     'PredictOutput',
diff --git a/tensorflow/python/estimator/export/export_test.py b/tensorflow/python/estimator/export/export_test.py
index 8442bf04accbd0bc15f5958069bf3060debd42bc..eb9688bc973666554b6057f5f546b9a2d18461d6 100644
--- a/tensorflow/python/estimator/export/export_test.py
+++ b/tensorflow/python/estimator/export/export_test.py
@@ -385,5 +385,67 @@ class ExportTest(test_util.TensorFlowTestCase):
     self.assertTrue(int(time_2) < int(time_3))
 
 
+class TensorServingReceiverTest(test_util.TensorFlowTestCase):
+
+  def test_tensor_serving_input_receiver_constructor(self):
+    features = constant_op.constant([0])
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+    r = export.TensorServingInputReceiver(features, receiver_tensors)
+    self.assertTrue(isinstance(r.features, ops.Tensor))
+    self.assertTrue(isinstance(r.receiver_tensors, dict))
+
+  def test_tensor_serving_input_receiver_sparse(self):
+    features = sparse_tensor.SparseTensor(
+        indices=[[0, 0]], values=[1], dense_shape=[1, 1])
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+    r = export.TensorServingInputReceiver(features, receiver_tensors)
+    self.assertTrue(isinstance(r.features, sparse_tensor.SparseTensor))
+    self.assertTrue(isinstance(r.receiver_tensors, dict))
+
+  def test_serving_input_receiver_features_invalid(self):
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+
+    with self.assertRaisesRegexp(ValueError, "features must be defined"):
+      export.TensorServingInputReceiver(
+          features=None,
+          receiver_tensors=receiver_tensors)
+
+    with self.assertRaisesRegexp(ValueError, "feature must be a Tensor"):
+      export.TensorServingInputReceiver(
+          features={"1": constant_op.constant([1])},
+          receiver_tensors=receiver_tensors)
+
+  def test_serving_input_receiver_receiver_tensors_invalid(self):
+    features = constant_op.constant([0])
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensors must be defined"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors=None)
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensors keys must be strings"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors={
+              1: array_ops.placeholder(dtypes.string, name="example0")})
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensor example1 must be a Tensor"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors={"example1": [1]})
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/estimator/replicate_model_fn.py b/tensorflow/python/estimator/replicate_model_fn.py
new file mode 100644
index 0000000000000000000000000000000000000000..144d89abf3444062927d9261301fe50f4a63b280
--- /dev/null
+++ b/tensorflow/python/estimator/replicate_model_fn.py
@@ -0,0 +1,824 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities to replicate model_fn's over local GPUs.
+
+This file contains util that allow to replicate `Estimator.model_fn` over
+GPUs.  Replicated version of a `model_fn` is returned that can subsequently
+be used with `Estimator`.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import defaultdict
+from contextlib import contextmanager
+import copy
+
+import six
+
+from tensorflow.core.framework import node_def_pb2
+from tensorflow.python.client import device_lib
+from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator import util
+from tensorflow.python.estimator.export import export_output as export_output_lib
+from tensorflow.python.framework import device as framework_device
+from tensorflow.python.framework import ops as ops_lib
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops.losses import losses
+from tensorflow.python.platform import tf_logging
+from tensorflow.python.training import device_setter as device_setter_lib
+from tensorflow.python.training import optimizer as optimizer_lib
+
+
+def _replicate_model_fn(model_fn,
+                        devices=None):
+  """Replicate `Estimator.model_fn` over GPUs.
+
+  The given `model_fn` specifies a single forward pass of a model.  To replicate
+  such a model over GPUs, each GPU gets its own instance of the forward pass
+  (a.k.a. a tower).  The input features and labels get sharded into the chunks
+  that correspond to the number of GPUs.  Each tower computes a loss based
+  on its input.  For each such loss, gradients are computed.  After that, the
+  available losses are aggregated to form aggregated loss.  Available
+  gradients are summed.  Then, they update weights using the specified
+  optimizer.
+
+  If `devices` are `None`, then all available GPUs are going to be used for
+  replication.  If no GPUs are available, then the model is going to be
+  placed on the CPU.
+
+  Two modes of local replication over available GPUs are supported:
+    1)  If exactly 1 GPU is detected, then variables and operations are placed
+        onto the GPU.
+    2)  If more than 1 GPU is detected, then variables are going to be placed on
+        the CPU.  Replicas of operations are placed on each individual GPU.
+
+  Here is an example of how one might use their `model_fn` to run over GPUs:
+    ```python
+       ...
+       def model_fn(...):  # See `model_fn` in `Estimator`.
+         loss = ...
+         optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
+         optimizer = tf.contrib.estimator._TowerOptimizer(optimizer)
+         if mode == tf.estimator.ModeKeys.TRAIN:
+           #  See the section below on `EstimatorSpec.train_op`.
+           return EstimatorSpec(mode=mode, loss=loss,
+                                train_op=optimizer.minimize(loss))
+
+         #  No change for `ModeKeys.EVAL` or `ModeKeys.PREDICT`.
+         return EstimatorSpec(...)
+       ...
+       classifier = tf.estimator.Estimator(
+         model_fn=tf.contrib.estimator.replicate_model_fn(model_fn))
+    ```
+
+  Please see `DNNClassifierIntegrationTest` for an example with a canned
+  Estimator.
+
+  On `EstimatorSpec.train_op`:
+  `model_fn` returns `EstimatorSpec.train_op` for
+  `tf.estimator.GraphKeys.TRAIN`. It is typically derived using an optimizer.
+  Towers are expected to populate it in the same way.  Gradients from all towers
+  are reduced and applied in the last tower.  To achieve that in the case of
+  multiple towers, `_TowerOptimizer` needs to be used.  See `_TowerOptimizer`.
+
+  On sharding input features and labels:
+  Input features and labels are split for consumption by each tower. They are
+  split across the dimension 0.  Features and labels need to be batch major.
+
+  On reduction algorithms:
+  Certain algorithms were chosen for aggregating results of computations on
+  multiple towers:
+    - Losses from all towers are reduced according to `loss_reduction` argument
+      to TowerOptimizer..
+    - Gradients from all towers are reduced according to the `loss_reduction`
+      for each trainable variable.
+    - `eval_metrics_ops` are reduced per metric using `reduce_mean`.
+    - `EstimatorSpec.predictions` and `EstimatorSpec.export_outputs` are
+      reduced using concatenation.
+    - For all other fields of `EstimatorSpec` the values of the first tower
+      are taken.
+
+  On distribution of variables:
+  Variables are not duplicated between towers.  Instead, they are placed on a
+  single device as defined above and shared across towers.
+
+  On overhead:
+  If only one device is specified, then aggregation of loss and gradients
+  doesn't happen. Replication consists of placing `model_fn` onto the
+  specified device.
+
+  On current limitations:
+    - `predictions` are not supported for `ModeKeys.EVAL`.  They are required
+       for `tf.contrib.estimator.add_metrics`.
+
+  Args:
+    model_fn: `model_fn` as defined in `Estimator`.  See the section above about
+      the train_op argument of `EstimatorSpec`.
+    devices: Optional list of devices to replicate the model across.  This
+      argument can be used to replice only on the subset of available GPUs.
+      If `None`, then all available GPUs are going to be used for replication.
+      If no GPUs are available, then the model is going to be placed on the CPU.
+
+  Returns:
+    A replicated version of the supplied `model_fn`. Returned function that
+      conforms to the requirements of `Estimator`'s `model_fn` and can be used
+      instead of the supplied `model_fn`.
+  """
+  return _replicate_model_fn_with_mode(
+      model_fn,
+      devices,
+      # TODO(isaprykin): Query the system configuration to choose modes other
+      # than `SHARED_LOCAL_PARAMETER_SERVER`, even though it is often
+      # appropriate.
+      mode=_VariableDistributionMode.SHARED_LOCAL_PARAMETER_SERVER)
+
+
+class _VariableDistributionMode(object):
+  """Modes for variable distribution used for forcing a particular one.
+
+  Forcing a mode is meant for performance experimentation purposes rather than
+  for general use cases.
+  """
+
+  SHARED_LOCAL_PARAMETER_SERVER = 1
+  """Variables are placed on a single device and shared across all devices.
+
+  Two ways to achieve this distribution over available GPUs are supported:
+    1)  If exactly 1 GPU is detected, then variables and operations are placed
+        onto GPU.
+    2)  If more than 1 GPU is detected, then variables are going to be placed on
+        the CPU.  Replicas of operations are placed on each individual GPU.
+  """
+
+  SHARED_ROUND_ROBIN = 2
+  """Variables are placed on all devices in a round-robin fashion.
+
+  Every subsequent variable is placed on the next device.  There is only one
+  copy of each variable that is shared across all devices.
+  """
+
+
+def _replicate_model_fn_with_mode(
+    model_fn,
+    devices=None,
+    mode=_VariableDistributionMode.SHARED_LOCAL_PARAMETER_SERVER):
+  """A version of `replicate_model_fn` that allows to specify a `mode`."""
+  if not devices:
+    devices = _get_local_devices('GPU') or _get_local_devices('CPU')
+
+  is_a_single_gpu_case = len(devices) == 1 and 'GPU' in devices[0].upper()
+  consolidation_device = devices[0] if is_a_single_gpu_case else '/CPU:0'
+
+  ps_devices = [consolidation_device]
+  if mode == _VariableDistributionMode.SHARED_ROUND_ROBIN:
+    ps_devices = devices
+
+  tf_logging.info('Replicating the `model_fn` across {}.  Variables are going '
+                  'to be placed on {}.  Consolidation device is going to be {}.'
+                  .format(devices, ps_devices, consolidation_device))
+
+  def single_device_model_fn(features, labels, mode, params=None, config=None):
+    """`model_fn` on a single device without reduction overhead."""
+    return _get_loss_towers(
+        model_fn=model_fn,
+        mode=mode,
+        features=[features],
+        labels=[labels],
+        params=params,
+        config=config,
+        devices=devices,
+        local_ps_devices=ps_devices)[0]  # One device, so one spec is out.
+
+  def replicated_model_fn(features, labels, mode, params=None, config=None):
+    """Replicated version of `model_fn` to be used instead."""
+    feature_shards, label_shards = _split_batch(
+        features, labels, len(devices), device=consolidation_device)
+    tower_specs = _get_loss_towers(
+        model_fn=model_fn,
+        mode=mode,
+        features=feature_shards,
+        labels=label_shards,
+        params=params,
+        config=config,
+        devices=devices,
+        local_ps_devices=ps_devices)
+
+    if mode == model_fn_lib.ModeKeys.TRAIN:
+      train_op = _minimize_towers(tower_specs)
+      return _train_spec(
+          tower_specs, train_op, aggregation_device=consolidation_device)
+    elif mode == model_fn_lib.ModeKeys.EVAL:
+      return _eval_spec(tower_specs, aggregation_device=consolidation_device)
+    elif mode == model_fn_lib.ModeKeys.PREDICT:
+      return _predict_spec(tower_specs, aggregation_device=consolidation_device)
+
+  if len(devices) == 1:
+    return single_device_model_fn
+  else:
+    return replicated_model_fn
+
+
+class _TowerOptimizer(optimizer_lib.Optimizer):
+  """Gathers gradients from all towers and reduces them in the last one."""
+
+  COLLECTION_FOR_GRAPH_STATES = 'replicate_model_fn_graph_states'
+
+  def __init__(self, optimizer_or_optimizer_fn,
+               loss_reduction=losses.Reduction.SUM_OVER_BATCH_SIZE):
+    """Wrap an existing optimizer for gathering gradients across towers.
+
+    Each invocation of model_fn has to call the same optimizers in the same
+    order.
+
+    Multiple optimizers that use the same or different losses are supported.
+
+    If _TowerOptimizer is used but `replicate_model_fn` isn't, then no
+    aggregation will happen.  All calls will simply be forwarded to the
+    underlying optimizer. The behavior is similar if there is only one tower.
+
+    If _TowerOptimizer is used together with SyncReplicasOptimizer that wraps
+    the user's optimizer, then it's the SyncReplicasOptimizer that needs to be
+    wrapped with _TowerOptimizer.
+
+    Args:
+      optimizer_or_optimizer_fn: an instance of optimizer to wrap.  That
+        instance is going to be used for optimizer-specific logic.  This can
+        also be a no-argument function that returns such an optimizer instance.
+      loss_reduction: controls whether losses are summed or averaged.
+    """
+    self._optimizer_or_optimizer_fn = optimizer_or_optimizer_fn
+    self._loss_reduction = loss_reduction
+
+  @staticmethod
+  def has_been_used():
+    return _TowerOptimizer._graph_state().has_tower_optimizer_been_used
+
+  def get_slot(self, *args, **kwargs):
+    return self._get_optimizer().get_slot(*args, **kwargs)
+
+  def get_slot_names(self, *args, **kwargs):
+    return self._get_optimizer().get_slot_names(*args, **kwargs)
+
+  def get_name(self, *args, **kwargs):
+    return self._get_optimizer().get_name(*args, **kwargs)
+
+  def variables(self, *args, **kwargs):
+    return self._get_optimizer().variables(*args, **kwargs)
+
+  def compute_gradients(self, loss, *args, **kwargs):
+    """Compute gradients, but first, if needed, scale the loss."""
+    _TowerOptimizer._graph_state().set_loss_reduction(self._loss_reduction)
+    loss = _scale_loss(loss,
+                       self._loss_reduction,
+                       self._graph_state().number_of_towers)
+    return self._get_optimizer().compute_gradients(loss, *args, **kwargs)
+
+  def apply_gradients(self, grads_and_vars, global_step=None, **kwargs):
+    """Collect gradients updates to apply them with the last tower."""
+    if self._graph_state().number_of_towers == 1:
+      # Avoid the overhead of reduction if there's only one tower.
+      #
+      # There assumed to be only one tower if aggregation-related methods were
+      # not called by `_get_loss_towers`, for example if the model_fn uses
+      # TowerEstimator, but `replicate_model_fn` isn't used.
+      return self._get_optimizer().apply_gradients(grads_and_vars, global_step,
+                                                   **kwargs)
+
+    self._graph_state().collect_gradients(grads_and_vars)
+
+    if not self._graph_state().is_the_last_tower:
+      with ops_lib.control_dependencies(_extract_tensors(grads_and_vars)):
+        return self._construct_no_op_train_op()
+    else:
+      # Gradients need to be gathered and applied in the scope of the first
+      # tower, so that the tensors are accessible via names without prefixes.
+      var_scope, name_scope = self._graph_state().scopes_of_the_first_tower
+      with variable_scope.variable_scope(var_scope):
+        with ops_lib.name_scope(name_scope):
+          return self._apply_gathered_gradients(global_step, **kwargs)
+
+  def _apply_gathered_gradients(self, global_step, **kwargs):
+    graph_state = self._graph_state()
+    optimizer = self._get_optimizer()
+
+    grad_lists = {}
+    for grad, var in graph_state.get_latest_gradients_from_all_towers():
+      if grad is not None:
+        grad_lists.setdefault(var, []).append(grad)
+
+    aggregated_grads = []
+    with ops_lib.name_scope('gradient_aggregating'):
+      for var, grads in six.iteritems(grad_lists):
+        grad = _compute_sum_on_device(grads, var.device)
+        aggregated_grads.append((grad, var))
+    return optimizer.apply_gradients(
+        aggregated_grads, global_step=global_step, **kwargs)
+
+  def _get_optimizer(self):
+    if callable(self._optimizer_or_optimizer_fn):
+      # If optimizer is given as a function then we need to wait till we are
+      # under the right graph context before constructing it.  That's why the
+      # optimizer is constructed in _get_optimizer() rather than __init__().
+      self._optimizer_or_optimizer_fn = self._optimizer_or_optimizer_fn()
+    self._graph_state().has_tower_optimizer_been_used = True
+    return self._optimizer_or_optimizer_fn
+
+  def _construct_no_op_train_op(self):
+    return control_flow_ops.no_op(name='train_op_placeholder')
+
+  @staticmethod
+  def _graph_state():
+    graph_states = ops_lib.get_default_graph().get_collection_ref(
+        _TowerOptimizer.COLLECTION_FOR_GRAPH_STATES)
+    if not graph_states:
+      graph_states.append(_TowerOptimizer._PerGraphState())
+    return graph_states[-1]
+
+  @staticmethod
+  def _did_towers_have_same_optimizer_calls():
+    graph_state = _TowerOptimizer._graph_state()
+    return graph_state.did_towers_have_same_optimizer_calls()
+
+  @staticmethod
+  def _clear_graph_state():
+    # Clearing the Graph collection will prevent _PerGraphState from being
+    # serialized.
+    ops_lib.get_default_graph().clear_collection(
+        _TowerOptimizer.COLLECTION_FOR_GRAPH_STATES)
+
+  class _PerGraphState(object):
+    """Gradient reduction related state of a Tensorflow graph."""
+
+    def __init__(self):
+      self._collected_grads_and_vars = defaultdict(list)
+      self._current_tower_index = 0
+      self._number_of_towers = 1
+      self._loss_reduction = None
+      # Scopes of the first tower that don't have a prefix:
+      self._variable_scope = None
+      self._name_scope = None
+      # If needed, alert that _TowerOptimizer needs to be used with model_fn.
+      self._has_tower_optimizer_been_used = False
+
+    def collect_gradients(self, grads_and_vars):
+      self._collected_grads_and_vars[self._current_tower_index].append(
+          grads_and_vars)
+
+    def get_latest_gradients_from_all_towers(self):
+      """Get gradients across towers for the last called optimizer."""
+      grads_and_vars = []
+      index_of_last_gradients = len(
+          self._collected_grads_and_vars[self._current_tower_index]) - 1
+      for tower_id in range(self._current_tower_index + 1):
+        grads_and_vars.extend(
+            self._collected_grads_and_vars[tower_id][index_of_last_gradients])
+      return grads_and_vars
+
+    def set_number_of_towers(self, number_of_towers):
+      self._number_of_towers = number_of_towers
+
+    def set_loss_reduction(self, loss_reduction):
+      self._loss_reduction = loss_reduction
+
+    @contextmanager
+    def tower(self, tower_id, var_scope, name_scope):
+      if tower_id == 0:
+        self._variable_scope = var_scope
+        self._name_scope = name_scope
+      self._current_tower_index = tower_id
+      yield
+
+    @property
+    def scopes_of_the_first_tower(self):
+      return self._variable_scope, self._name_scope
+
+    @property
+    def is_the_last_tower(self):
+      return self._current_tower_index == (self._number_of_towers - 1)
+
+    @property
+    def number_of_towers(self):
+      return self._number_of_towers
+
+    @property
+    def loss_reduction(self):
+      return self._loss_reduction
+
+    @property
+    def has_tower_optimizer_been_used(self):
+      return self._has_tower_optimizer_been_used
+
+    @has_tower_optimizer_been_used.setter
+    def has_tower_optimizer_been_used(self, value):
+      self._has_tower_optimizer_been_used = value
+
+    def did_towers_have_same_optimizer_calls(self):
+      total_number_of_grads = sum([
+          len(grads)
+          for _, grads in six.iteritems(self._collected_grads_and_vars)
+      ])
+      return total_number_of_grads % self._number_of_towers == 0
+
+
+def _get_local_devices(device_type):
+  local_device_protos = device_lib.list_local_devices()
+  return [
+      device.name
+      for device in local_device_protos
+      if device.device_type == device_type
+  ]
+
+
+def _split_batch(features, labels, number_of_shards, device):
+  """Split input features and labes into batches."""
+
+  def ensure_divisible_by_shards(sequence):
+    batch_size = ops_lib.convert_to_tensor(sequence).get_shape()[0]
+    if batch_size % number_of_shards != 0:
+      raise ValueError(
+          'Batch size {} needs to be divisible by the number of GPUs, which '
+          'is {}.'.format(batch_size, number_of_shards))
+
+  def split_dictionary(dictionary):
+    """Split a dictionary into shards."""
+    shards = [{} for _ in range(number_of_shards)]
+    for name, tensor in six.iteritems(dictionary):
+      if isinstance(tensor, sparse_tensor.SparseTensor):
+        for i, shard in enumerate(
+            sparse_ops.sparse_split(
+                sp_input=tensor, num_split=number_of_shards, axis=0)):
+          shards[i][name] = shard
+      else:
+        ensure_divisible_by_shards(tensor)
+        for i, shard in enumerate(array_ops.split(tensor, number_of_shards)):
+          shards[i][name] = shard
+    return shards
+
+  with ops_lib.name_scope('split_inputs'):
+    with ops_lib.device(device):
+      if isinstance(features, dict):
+        feature_shards = split_dictionary(features)
+      else:
+        ensure_divisible_by_shards(features)
+        feature_shards = array_ops.split(features, number_of_shards)
+
+      if labels is None:
+        label_shards = None
+      elif isinstance(labels, dict):
+        label_shards = split_dictionary(labels)
+      else:
+        ensure_divisible_by_shards(labels)
+        label_shards = array_ops.split(labels, number_of_shards)
+  return feature_shards, label_shards
+
+
+_DEFAULT_NAME_SCOPE_PATTERN = 'tower_{}'
+
+
+def _get_loss_towers(model_fn,
+                     mode,
+                     features,
+                     labels,
+                     params,
+                     config,
+                     devices,
+                     local_ps_devices,
+                     name_scope_pattern=_DEFAULT_NAME_SCOPE_PATTERN):
+  """Replicate the loss computation across devices."""
+  tower_specs = []
+
+  model_fn_args = util.fn_args(model_fn)
+  optional_params = {}
+  if 'params' in model_fn_args:
+    optional_params['params'] = copy.deepcopy(params)
+  if 'config' in model_fn_args:
+    optional_params['config'] = copy.deepcopy(config)
+
+  # pylint: disable=protected-access
+  round_robin_strategy = device_setter_lib._RoundRobinStrategy(
+      num_tasks=len(local_ps_devices))
+  _TowerOptimizer._graph_state().set_number_of_towers(len(devices))
+
+  for i, device in enumerate(devices):
+    is_the_first_tower = (i == 0)
+
+    device_setter = _local_device_setter(
+        worker_device=device,
+        ps_devices=local_ps_devices,
+        ps_strategy=round_robin_strategy)
+
+    # We would like to preserve the names of the variables and ops that the user
+    # might be relying on. Names without a prefix are going to resolve to
+    # variables and ops of the first tower.
+    name_scope = name_scope_pattern
+    if is_the_first_tower:
+      name_scope = ''
+
+    with variable_scope.variable_scope(
+        '', reuse=not is_the_first_tower) as var_scope:
+      with ops_lib.name_scope(name_scope.format(i)) as name_scope:
+        with _TowerOptimizer._graph_state().tower(
+            tower_id=i, var_scope=var_scope, name_scope=name_scope):
+          with ops_lib.device(device_setter):
+            labels_shard = None
+            if labels:
+              labels_shard = labels[i]
+
+            tower_spec = model_fn(
+                mode=mode,
+                features=features[i],
+                labels=labels_shard,
+                **optional_params)
+
+            if (tower_spec.train_op is not None and len(devices) > 1 and
+                not _TowerOptimizer.has_been_used()):
+              raise ValueError('Please wrap optimizers with _TowerOptimizer'
+                               ' in order to use replicate_model_fn with'
+                               ' multiple `devices`.')
+
+            # Scaling the loss here doesn't actually affect gradients.  Another
+            # instance of scaling happens inside the _TowerOptimizer.
+            tower_spec = _scale_tower_loss(
+                tower_spec,
+                _TowerOptimizer._graph_state().loss_reduction,
+                number_of_towers=len(devices))
+            tower_specs.append(tower_spec)
+
+  if not _TowerOptimizer._did_towers_have_same_optimizer_calls():
+    raise ValueError('Each invocation of model_fn was supposed to make the same'
+                     ' optimizer calls.')
+  _TowerOptimizer._clear_graph_state()
+  # pylint: enable=protected-access
+  return tower_specs
+
+
+def _local_device_setter(worker_device, ps_devices, ps_strategy):
+  """A device setter that puts distributes Var/Ops to PS/workers."""
+  ps_ops = ['Variable', 'VariableV2', 'VarHandleOp']
+
+  def local_device_chooser(op):
+    current_device = framework_device.DeviceSpec.from_string(op.device or '')
+
+    node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
+    if node_def.op in ps_ops:
+      ps_device_spec = framework_device.DeviceSpec.from_string(
+          '{}'.format(ps_devices[ps_strategy(op)]))
+
+      ps_device_spec.merge_from(current_device)
+      return ps_device_spec.to_string()
+    else:
+      worker_device_spec = framework_device.DeviceSpec.from_string(
+          worker_device or '')
+      worker_device_spec.merge_from(current_device)
+      return worker_device_spec.to_string()
+
+  return local_device_chooser
+
+
+def _scale_tower_loss(tower_spec, loss_reduction, number_of_towers):
+  """Produce an EstimatorSpec with approproriately scaled loss."""
+  if tower_spec.loss is None:
+    return tower_spec
+
+  estimator_spec = _asdict(tower_spec)
+  estimator_spec['loss'] = _scale_loss(
+      tower_spec.loss,
+      loss_reduction,
+      number_of_towers,
+      reduced_loss_name='averaged_loss')
+  return model_fn_lib.EstimatorSpec(**estimator_spec)
+
+
+def _scale_loss(loss, loss_reduction, number_of_towers, reduced_loss_name=None):
+  """If needed, scale down the loss for averaging loss by summing."""
+  if loss is None:
+    return None
+  if number_of_towers == 1:
+    return loss
+
+  if loss_reduction == losses.Reduction.NONE:
+    raise ValueError('Tower losses need to be reduced in some way, yet {} '
+                     'reduction is specified.'.format(loss_reduction))
+
+  if loss_reduction != losses.Reduction.SUM:
+    return math_ops.div(loss, 1.0 * number_of_towers, name=reduced_loss_name)
+  else:
+    return loss
+
+
+def _minimize_towers(tower_specs):
+  """`train_op` of the last tower applies aggregated gradients."""
+  return tower_specs[-1].train_op
+
+
+def _compute_sum_on_device(values, device, name=None):
+  with ops_lib.device(device):
+    if isinstance(values[0], ops_lib.IndexedSlices):
+      if name:
+        raise ValueError('The name {} is not expected to be given to '
+                         'IndexedSlices {}'.format(name, values))
+
+      values_concat = array_ops.concat([v.values for v in values], axis=0)
+      indices_concat = array_ops.concat([v.indices for v in values], axis=0)
+      return ops_lib.IndexedSlices(values_concat, indices_concat,
+                                   values[0].dense_shape)
+    else:
+      return math_ops.add_n(values, name=name)
+
+
+def _train_spec(tower_specs,
+                train_op,
+                aggregation_device,
+                aggregated_loss_name='loss'):
+  """Populate replicated EstimatorSpec for `GraphKeys.TRAIN`."""
+  # Spec of the last tower is used as the template for the final spec, because
+  # some `EstimatorSpec.training_hooks` rely on calls made in model_fn.  For
+  # example, `SyncReplicasOptimizerHook` validates the
+  # `SyncReplicasOptimizer.apply_gradients` call. `TowerEstimator` makes that
+  # call only in the last tower.
+  estimator_spec = _asdict(tower_specs[-1])
+  estimator_spec['mode'] = model_fn_lib.ModeKeys.TRAIN
+  estimator_spec['train_op'] = train_op
+  estimator_spec['loss'] = _compute_sum_on_device(
+      [spec.loss for spec in tower_specs], aggregation_device,
+      aggregated_loss_name)
+  return model_fn_lib.EstimatorSpec(**estimator_spec)
+
+
+def _eval_spec(tower_specs, aggregation_device, aggregated_loss_name='loss'):
+  """Populate replicated EstimatorSpec for `GraphKeys.EVAL`."""
+  estimator_spec = _asdict(tower_specs[0])
+  estimator_spec['mode'] = model_fn_lib.ModeKeys.EVAL
+  estimator_spec['loss'] = _compute_sum_on_device(
+      [spec.loss for spec in tower_specs], aggregation_device,
+      aggregated_loss_name)
+
+  update_ops = []
+  for tower_spec in tower_specs:
+    for name, (_, update_op) in six.iteritems(tower_spec.eval_metric_ops):
+      update_ops.append(update_op)
+
+  with ops_lib.control_dependencies(update_ops):
+    reduced_update_op = _reduce_metric_variables(len(tower_specs))
+
+  eval_metric_ops = {}
+  for name, (metric_tensor, _) in six.iteritems(tower_specs[0].eval_metric_ops):
+    eval_metric_ops[name] = (metric_tensor, reduced_update_op)
+  estimator_spec['eval_metric_ops'] = eval_metric_ops
+  return model_fn_lib.EstimatorSpec(**estimator_spec)
+
+
+def _reduce_metric_variables(number_of_towers):
+  """Aggregate local variables used in metrics into the first tower."""
+  if number_of_towers == 1:
+    return control_flow_ops.no_op(name='no_eval_metric_reduction')
+
+  metric_variables = ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES)
+  variables_per_tower = len(metric_variables) // number_of_towers
+
+  if len(metric_variables) % number_of_towers != 0:
+    raise ValueError(
+        'Different `EstimatorSpec.eval_metric_ops` across `model_fn()` calls.'
+        ' Expected {} local variables, but got {} instead.'.format(
+            variables_per_tower * number_of_towers, len(metric_variables)))
+
+  # `metric_variables` has the size of `variables_per_tower` x
+  #  number_of_towers.  Each tower is produced by calling the same model_fn.
+  #  First `variables_per_tower` correspond to the first tower.  Each such
+  #  variable has an replica at the `(variables_per_tower * i)` position, where
+  #  `i` is `[1.. number_of_towers]`.  We are going to add values from replicas
+  #  to each variable of the first tower.  We then zero out replica values, so
+  #  that `_reduce_metric_variables` operation is idempotent.  If a metric
+  #  is then computed based on local variables from the first tower, then the
+  #  resulting metric is an estimate for all `number_of_towers` towers.
+  ops = []
+  for i in range(0, variables_per_tower):
+    next_replica_id = i + variables_per_tower
+    replicas = [
+        metric_variables[replica_id]
+        for replica_id in range(next_replica_id, len(metric_variables),
+                                variables_per_tower)
+    ]  #  `replicas` doesn't contain the first-tower variable.
+
+    reduce_op = state_ops.assign_add(metric_variables[i],
+                                     math_ops.add_n(replicas))
+
+    with ops_lib.control_dependencies([reduce_op]):
+      for replica in replicas:
+        zeros_for_replica = array_ops.zeros(
+            array_ops.shape(replica), dtype=replica.dtype)
+        zero_out_replica_op = state_ops.assign(replica, zeros_for_replica)
+        ops.append(zero_out_replica_op)
+
+  return control_flow_ops.group(*ops)
+
+
+def _predict_spec(tower_specs, aggregation_device):
+  """Populate replicated EstimatorSpec for `GraphKeys.PREDICT`."""
+  estimator_spec = _asdict(tower_specs[0])
+  estimator_spec['mode'] = model_fn_lib.ModeKeys.PREDICT
+
+  with ops_lib.device(aggregation_device):
+    estimator_spec['predictions'] = _concat_tensor_dicts(
+        *[tower_spec.predictions for tower_spec in tower_specs])
+
+    export_outputs_dict = _dict_concat(
+        *[tower_spec.export_outputs for tower_spec in tower_specs])
+
+    export_outputs = {}
+    for name, export_output_list in six.iteritems(export_outputs_dict):
+      if isinstance(export_output_list[0], export_output_lib.PredictOutput):
+        export_outputs[name] = export_output_lib.PredictOutput(
+            outputs=_concat_tensor_dicts(*[
+                export_output.outputs for export_output in export_output_list
+            ]))
+      elif isinstance(export_output_list[0],
+                      export_output_lib.RegressionOutput):
+        export_outputs[name] = export_output_lib.RegressionOutput(
+            value=array_ops.concat(
+                [export_output.value for export_output in export_output_list],
+                axis=0))
+      elif isinstance(export_output_list[0],
+                      export_output_lib.ClassificationOutput):
+        scores = None
+        if export_output_list[0].scores is not None:
+          scores = array_ops.concat(
+              [export_output.scores for export_output in export_output_list],
+              axis=0)
+
+        classes = None
+        if export_output_list[0].classes is not None:
+          classes = array_ops.stack(
+              [export_output.classes for export_output in export_output_list],
+              axis=0)
+
+        export_outputs[name] = export_output_lib.ClassificationOutput(
+            scores=scores, classes=classes)
+
+  estimator_spec['export_outputs'] = export_outputs
+  return model_fn_lib.EstimatorSpec(**estimator_spec)
+
+
+def _concat_tensor_dicts(*tensor_dicts):
+  return {
+      name: array_ops.concat(tensors, axis=0, name=name)
+      for name, tensors in six.iteritems(_dict_concat(*tensor_dicts))
+  }
+
+
+def _extract_tensors(tensors_and_vars):
+  tensors = []
+  for tensor_and_var in tensors_and_vars:
+    tensor, _ = tensor_and_var
+    if isinstance(tensor, ops_lib.IndexedSlices):
+      tensors.append(tensor.values)
+    elif tensor is not None:
+      tensors.append(tensor)
+  return tensors
+
+
+def _dict_concat(*dicts):
+  list_dict = {}
+  for d in dicts:
+    if d is None:
+      continue
+
+    for k, v in six.iteritems(d):
+      list_dict.setdefault(k, []).append(v)
+  return list_dict
+
+
+def _asdict(namedtuple):
+  """Returns a namedtuple as a dictionary.
+
+  This is required because `_asdict()` in Python 3.x.x is broken in classes
+  that inherit from `collections.namedtuple`. See
+  https://bugs.python.org/issue24931 for more details.
+
+  Args:
+    namedtuple: An object that inherits from `collections.namedtuple`.
+
+  Returns:
+    A dictionary version of the tuple.
+  """
+  return {k: getattr(namedtuple, k) for k in namedtuple._fields}
diff --git a/tensorflow/python/estimator/replicate_model_fn_test.py b/tensorflow/python/estimator/replicate_model_fn_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad1f9c02b92d7b1ce929494f4b6fbf636762a7fd
--- /dev/null
+++ b/tensorflow/python/estimator/replicate_model_fn_test.py
@@ -0,0 +1,1739 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for utilities that replicate `Estimator.model_fn` over GPUs."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import re
+import shutil
+import tempfile
+import numpy as np
+import six
+
+from tensorflow.python.estimator import estimator as estimator_lib
+from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator import replicate_model_fn
+from tensorflow.python.estimator.canned import dnn
+from tensorflow.python.estimator.canned import optimizers
+from tensorflow.python.estimator.canned import prediction_keys
+from tensorflow.python.estimator.export import export
+from tensorflow.python.estimator.export import export_output
+from tensorflow.python.estimator.inputs import numpy_io
+from tensorflow.python.feature_column import feature_column
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops as ops_lib
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import losses
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import metrics as metrics_lib
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.ops.losses import losses
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import test
+from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import adam
+from tensorflow.python.training import device_setter
+from tensorflow.python.training import gradient_descent
+from tensorflow.python.training import training
+
+
+# TODO(isaprykin):  Parametrize all the tests on
+#   replicate_model_fn._VariableDistributionMode when it's supported.
+class DNNClassifierIntegrationTest(test_util.TensorFlowTestCase):
+
+  def setUp(self):
+    self._model_dir = tempfile.mkdtemp()
+
+  def test_complete_flow_with_public_version(self):
+    return self._complete_flow_with_mode(mode=None)
+
+  def test_complete_flow_with_mode_local_ps_server(self):
+    return self._complete_flow_with_mode(
+        replicate_model_fn._VariableDistributionMode.
+        SHARED_LOCAL_PARAMETER_SERVER)
+
+  def test_complete_flow_with_mode_round_robin(self):
+    return self._complete_flow_with_mode(
+        replicate_model_fn._VariableDistributionMode.SHARED_ROUND_ROBIN)
+
+  def _complete_flow_with_mode(self, mode):
+    n_classes = 3
+    input_dimension = 2
+    batch_size = 12
+
+    data = np.linspace(
+        0., n_classes - 1., batch_size * input_dimension, dtype=np.float32)
+    x_data = data.reshape(batch_size, input_dimension)
+    categorical_data = np.random.random_integers(
+        0, len(x_data), size=len(x_data))
+    y_data = np.reshape(self._as_label(data[:batch_size]), (batch_size, 1))
+    train_input_fn = numpy_io.numpy_input_fn(
+        x={'x': x_data,
+           'categories': categorical_data},
+        y=y_data,
+        batch_size=batch_size,
+        num_epochs=None,
+        shuffle=True)
+    eval_input_fn = numpy_io.numpy_input_fn(
+        x={'x': x_data,
+           'categories': categorical_data},
+        y=y_data,
+        batch_size=batch_size,
+        shuffle=False)
+    predict_input_fn = numpy_io.numpy_input_fn(
+        x={'x': x_data,
+           'categories': categorical_data},
+        batch_size=batch_size,
+        shuffle=False)
+
+    feature_columns = [
+        feature_column.numeric_column('x', shape=(input_dimension,)),
+        feature_column.embedding_column(
+            feature_column.categorical_column_with_vocabulary_list(
+                'categories',
+                vocabulary_list=np.linspace(
+                    0., len(x_data), len(x_data), dtype=np.int64)), 1)
+    ]
+
+    def optimizer_fn():
+      return optimizers.get_optimizer_instance('Adagrad', learning_rate=0.05)
+
+    estimator = dnn.DNNClassifier(
+        hidden_units=(2, 2),
+        # Adagrad is configured with `get_optimizer_instance`, so the function
+        # form of `TowerOptimizer.__init__` is used.
+        optimizer=replicate_model_fn._TowerOptimizer(
+            optimizer_fn, loss_reduction=losses.Reduction.SUM),
+        feature_columns=feature_columns,
+        n_classes=n_classes,
+        model_dir=self._model_dir)
+
+    if not mode:  # Use the public `replicate_model_fn`.
+      model_fn = replicate_model_fn._replicate_model_fn(
+          estimator.model_fn, devices=['/gpu:0', '/gpu:1', '/gpu:2'])
+    else:
+      model_fn = replicate_model_fn._replicate_model_fn_with_mode(
+          estimator.model_fn,
+          devices=['/gpu:0', '/gpu:1', '/gpu:2'],
+          mode=mode)
+
+    estimator = estimator_lib.Estimator(
+        model_fn=model_fn,
+        model_dir=estimator.model_dir,
+        config=estimator.config,
+        params=estimator.params)
+
+    num_steps = 10
+    estimator.train(train_input_fn, steps=num_steps)
+
+    scores = estimator.evaluate(eval_input_fn)
+    self.assertEqual(num_steps, scores[ops_lib.GraphKeys.GLOBAL_STEP])
+    self.assertIn('loss', six.iterkeys(scores))
+
+    predicted_proba = np.array([
+        x[prediction_keys.PredictionKeys.PROBABILITIES]
+        for x in estimator.predict(predict_input_fn)
+    ])
+    self.assertAllEqual((batch_size, n_classes), predicted_proba.shape)
+
+    feature_spec = feature_column.make_parse_example_spec(feature_columns)
+    serving_input_receiver_fn = export.build_parsing_serving_input_receiver_fn(
+        feature_spec)
+    export_dir = estimator.export_savedmodel(tempfile.mkdtemp(),
+                                             serving_input_receiver_fn)
+    self.assertTrue(gfile.Exists(export_dir))
+
+    # Nothing should be left in the graph so that it doesn't get serialized.
+    self.assertFalse(ops_lib.get_default_graph().get_collection_ref(
+        replicate_model_fn._TowerOptimizer.COLLECTION_FOR_GRAPH_STATES))
+
+  def _as_label(self, data_in_float):
+    return np.rint(data_in_float).astype(np.int64)
+
+  def tearDown(self):
+    if self._model_dir:
+      writer_cache.FileWriterCache.clear()
+      shutil.rmtree(self._model_dir)
+
+
+class ReplicateModelTest(test_util.TensorFlowTestCase):
+
+  def create_model_fn_with_loss_reduction(self, loss_reduction):
+
+    def model_fn(mode, features, labels, params):
+      c = variable_scope.get_variable(
+          'c',
+          initializer=constant_op.constant(10, dtype=dtypes.float64),
+          dtype=dtypes.float64)
+
+      predictions = math_ops.multiply(features, c)
+
+      loss = losses.absolute_difference(
+          labels=labels,
+          predictions=predictions,
+          reduction=losses.Reduction.SUM)
+      loss = math_ops.reduce_sum(loss)
+
+      metrics = {
+          'accuracy': metrics_lib.accuracy(labels, predictions),
+          'auc': metrics_lib.auc(labels, predictions)
+      }
+
+      optimizer = replicate_model_fn._TowerOptimizer(
+          gradient_descent.GradientDescentOptimizer(params['learning_rate']),
+          loss_reduction=loss_reduction)
+
+      return model_fn_lib.EstimatorSpec(
+          mode=mode,
+          loss=loss,
+          eval_metric_ops=metrics,
+          predictions={'probabilities': predictions},
+          train_op=optimizer.minimize(loss))
+
+    return model_fn
+
+  @property
+  def params(self):
+    params = {}
+    params['learning_rate'] = 1.0
+    return params
+
+  def test_train(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+      session.run(variables.global_variables_initializer())
+
+      # loss = feature * c - label
+      total_loss = (1.0 * 10 - 1.0) + (2.0 * 10 - 2.0)
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      # derivative of loss = (1*c - 1) + (2*c - 2) is 3.
+      # new value of c = 10 - learning rate * 3 = 7.0.
+      session.run(estimator_spec.train_op)
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(7.0, session.run(c))
+
+  def test_train_with_mean_reduction(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session() as session:
+      # Add another trainable variable that doesn't produce a gradient to
+      # verify that None gradients are supported.
+      _ = variable_scope.get_variable(
+          'another_variable',
+          initializer=constant_op.constant(1, dtype=dtypes.float64),
+          dtype=dtypes.float64)
+
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.MEAN),
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+      session.run(variables.global_variables_initializer())
+
+      # loss = feature * c - label
+      total_loss = ((1.0 * 10 - 1.0) + (2.0 * 10 - 2.0)) / 2.0
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      # derivative of loss = (1*c - 1)/2 + (2*c - 2)/2 is 1.5.
+      # It's the same computation as without mean reduction, but the
+      # loss from every tower is scaled by 1/<number of towers>.
+      # new value of c = 10 - learning rate * 1.5 = 8.5
+      session.run(estimator_spec.train_op)
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(8.5, session.run(c))
+
+  def test_train_two_steps_collected_gradients_are_reset_between_steps(self):
+    with ops_lib.Graph().as_default():
+      features = array_ops.placeholder(dtypes.float64)
+      labels = array_ops.placeholder(dtypes.float64)
+
+      feature_inputs = np.array([[1.0], [2.0]]), np.array([[1.5], [2.5]])
+      label_inputs = np.array([[1.0], [2.0]]), np.array([[1.5], [2.5]])
+
+      # loss = feature * c - label
+      expected_losses = ((1.0 * 10 - 1.0) + (2.0 * 10 - 2.0),
+                         (1.5 * 7.0 - 1.5) + (2.5 * 7.0 - 2.5))
+      # Derivative of the loss is 1.0 + 2.0 for the first step and 1.5 + 2.5
+      # for the second.
+      expected_c = 10.0 - 3.0, 7.0 - 4.0
+
+      with self.test_session() as session, variable_scope.variable_scope(
+          '', reuse=variable_scope.AUTO_REUSE):
+        replicated_model_fn = replicate_model_fn._replicate_model_fn(
+            self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+            devices=['/gpu:0', '/gpu:1'])
+        estimator_spec = replicated_model_fn(
+            features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+        session.run(variables.global_variables_initializer())
+
+        for feature_input, label_input, loss, weight in zip(
+            feature_inputs, label_inputs, expected_losses, expected_c):
+          feeds = {features: feature_input, labels: label_input}
+
+          self.assertEqual(loss, session.run(estimator_spec.loss, feeds))
+
+          session.run(estimator_spec.train_op, feeds)
+          c = variable_scope.get_variable('c', dtype=dtypes.float64)
+          self.assertEqual(weight, session.run(c, feeds))
+
+  def test_eval(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.EVAL, self.params)
+      session.run(variables.local_variables_initializer())
+      session.run(variables.global_variables_initializer())
+
+      accuracy, a = estimator_spec.eval_metric_ops['accuracy']
+      auc, b = estimator_spec.eval_metric_ops['auc']
+
+      session.run([a, b])
+      accuracy = session.run(accuracy)
+      auc = session.run(auc)
+
+      # loss[i] = features[i] * 10 - labels[i].
+      # Accuracy is 0.0 (no match) in the first tower.
+      # Accuracy is 1.0 (match) in the second tower, since the feature
+      # times weight "c" happened to be equal to the label.
+      total_loss = ((0.01 * 10 - 0.01) + (0.002 * 10 - 0.02))
+
+      self.assertNear((0.0 + 1.0) / 2.0, accuracy, 0.01)
+      self.assertEqual(0, auc)
+      self.assertNear(total_loss, session.run(estimator_spec.loss), 0.01)
+
+  def test_eval_with_mean_reduction(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.MEAN),
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.EVAL, self.params)
+      session.run(variables.local_variables_initializer())
+      session.run(variables.global_variables_initializer())
+
+      accuracy, a = estimator_spec.eval_metric_ops['accuracy']
+      auc, b = estimator_spec.eval_metric_ops['auc']
+
+      session.run([a, b])
+      accuracy = session.run(accuracy)
+      auc = session.run(auc)
+
+      # loss[i] = features[i] * 10 - labels[i].
+      # Accuracy is 0.0 (no match) in the first tower.
+      # Accuracy is 1.0 (match) in the second tower, since the feature
+      # times weight "c" happened to be equal to the label.
+      total_loss = ((0.01 * 10 - 0.01) + (0.002 * 10 - 0.02)) / 2.0
+
+      self.assertNear((0.0 + 1.0) / 2.0, accuracy, 0.01)
+      self.assertEqual(0, auc)
+      self.assertNear(total_loss, session.run(estimator_spec.loss), 0.01)
+
+  def test_predict(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.PREDICT, self.params)
+      session.run(variables.global_variables_initializer())
+
+      self.assertAllClose({
+          'probabilities': np.array([[0.1], [0.02]])
+      }, session.run(estimator_spec.predictions))
+
+  def test_train_single_tower(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+      session.run(variables.global_variables_initializer())
+
+      # loss = feature * c - label
+      total_loss = (1.0 * 10 - 1.0) + (2.0 * 10 - 2.0)
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      # loss' of c is 3.
+      # new value of c = 10 - learning rate * 3 = 7.0.
+      session.run(estimator_spec.train_op)
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(7.0, session.run(c))
+
+  def test_eval_single_tower(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.EVAL, self.params)
+      session.run(variables.local_variables_initializer())
+      session.run(variables.global_variables_initializer())
+
+      accuracy, a = estimator_spec.eval_metric_ops['accuracy']
+      auc, b = estimator_spec.eval_metric_ops['auc']
+
+      session.run([a, b])
+      accuracy = session.run(accuracy)
+      auc = session.run(auc)
+
+      # Accuracy is 0.0 (no match) in the first tower.
+      # Accuracy is 1.0 (match) in the second tower, since the feature
+      # times weight "c" happened to be equal to the label.
+      total_loss = ((0.01 * 10 - 0.01) + (0.002 * 10 - 0.02))
+
+      self.assertNear((0.0 + 1.0) / 2.0, accuracy, 0.01)
+      self.assertEqual(0, auc)
+      self.assertNear(total_loss, session.run(estimator_spec.loss), 0.01)
+
+  def test_predict_single_tower(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.PREDICT, self.params)
+      session.run(variables.global_variables_initializer())
+
+      self.assertAllClose({
+          'probabilities': np.array([[0.1], [0.02]])
+      }, session.run(estimator_spec.predictions))
+
+  def test_batch_size_that_is_not_divisible_by_the_number_of_gpus(self):
+    features = np.array([[1.0], [2.0], [3.0]])
+    labels = np.array([[1.0], [2.0], [3.0]])
+
+    with self.assertRaisesRegexp(
+        ValueError, '.*Batch.+size.+needs.+to.+be.+divisible.+by.+GPUs.+'):
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0', '/gpu:1'])
+      _ = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+
+  def test_unsupported_loss_reduction(self):
+    features = np.array([[1.0], [2.0], [3.0]])
+    labels = np.array([[1.0], [2.0], [3.0]])
+
+    with self.assertRaisesRegexp(ValueError,
+                                 '.+none.+reduction.+is.+specified.+'):
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.NONE),
+          devices=['/gpu:0', '/gpu:1', '/gpu:2'])
+      _ = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+
+  def test_places_on_gpu_with_upper_case_spelling(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session():
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/GPU:0'])
+      _ = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:0', c.device)
+
+  def test_places_on_gpu_with_lower_case_spelling(self):
+    features = np.array([[0.01], [0.002]])
+    labels = np.array([[0.01], [0.02]])
+
+    with self.test_session():
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          devices=['/gpu:0'])
+      _ = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:0', c.device)
+
+
+class ReplicateAcrossASingleDeviceWithoutTowerOptimizer(
+    test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    optimizer = gradient_descent.GradientDescentOptimizer(
+        params['learning_rate'])
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=loss,
+        eval_metric_ops=metrics,
+        predictions={'probabilities': predictions},
+        train_op=optimizer.minimize(loss))
+
+  @property
+  def params(self):
+    params = {}
+    params['learning_rate'] = 1.0
+    return params
+
+  def test_train_single_tower(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.model_fn, devices=['/gpu:0'])
+      estimator_spec = replicated_model_fn(
+          features, labels, model_fn_lib.ModeKeys.TRAIN, self.params)
+      session.run(variables.global_variables_initializer())
+
+      # loss = feature * c - label
+      total_loss = (1.0 * 10 - 1.0) + (2.0 * 10 - 2.0)
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      # loss' of c is 3.
+      # new value of c = 10 - learning rate * 3 = 7.0.
+      session.run(estimator_spec.train_op)
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(7.0, session.run(c))
+
+
+class UseTowerEstimatorWithoutReplication(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    features = features['features']
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    optimizer = replicate_model_fn._TowerOptimizer(
+        gradient_descent.GradientDescentOptimizer(params['learning_rate']))
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=loss,
+        eval_metric_ops=metrics,
+        predictions={'probabilities': predictions},
+        train_op=optimizer.minimize(loss))
+
+  @property
+  def params(self):
+    params = {}
+    params['learning_rate'] = 1.0
+    return params
+
+  def test_train_single_tower(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    train_input_fn = numpy_io.numpy_input_fn(
+        x={'features': features}, y=labels, batch_size=2, shuffle=False)
+
+    with self.test_session():
+      estimator = estimator_lib.Estimator(
+          model_fn=self.model_fn,
+          model_dir=tempfile.mkdtemp(),
+          params=self.params)
+      estimator.train(train_input_fn, steps=1)
+
+      self.assertEqual(7.0, estimator.get_variable_value('c'))
+
+
+class MakeSureSyncReplicasOptimizerWorks(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    features = features['features']
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    optimizer = gradient_descent.GradientDescentOptimizer(
+        params['learning_rate'])
+    optimizer = training.SyncReplicasOptimizer(
+        optimizer, replicas_to_aggregate=1)
+    sync_hook = optimizer.make_session_run_hook(True)
+    optimizer = replicate_model_fn._TowerOptimizer(
+        optimizer, loss_reduction=losses.Reduction.SUM)
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=loss,
+        eval_metric_ops=metrics,
+        training_hooks=[sync_hook],
+        predictions={'probabilities': predictions},
+        train_op=optimizer.minimize(
+            loss, global_step=training.get_global_step()))
+
+  @property
+  def params(self):
+    params = {}
+    params['learning_rate'] = 1.0
+    return params
+
+  def test_train_multiple_towers(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    train_input_fn = numpy_io.numpy_input_fn(
+        x={'features': features}, y=labels, batch_size=2, shuffle=False)
+
+    model_fn = replicate_model_fn._replicate_model_fn(
+        self.model_fn,
+        devices=['/gpu:0', '/gpu:1'])
+
+    estimator = estimator_lib.Estimator(
+        model_fn=model_fn, model_dir=tempfile.mkdtemp(), params=self.params)
+    estimator.train(train_input_fn, steps=1)
+
+    self.assertEqual(7.0, estimator.get_variable_value('c'))
+
+
+class ReplicateWithTwoOptimizersTest(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    side_effects = variable_scope.get_variable(
+        'side_effects',
+        initializer=constant_op.constant(0, dtype=dtypes.float64),
+        dtype=dtypes.float64,
+        use_resource=True,
+        trainable=False)
+
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    first_optimizer = replicate_model_fn._TowerOptimizer(
+        gradient_descent.GradientDescentOptimizer(1.0),
+        loss_reduction=losses.Reduction.SUM)
+    second_optimizer = replicate_model_fn._TowerOptimizer(
+        adam.AdamOptimizer(1.0), loss_reduction=losses.Reduction.SUM)
+
+    with ops_lib.control_dependencies([side_effects.assign_add(1.0)]):
+      first_grads_and_vars = first_optimizer.compute_gradients(loss)
+
+    train_op = control_flow_ops.group(
+        [first_optimizer.apply_gradients(first_grads_and_vars),
+         second_optimizer.minimize(loss)])
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=loss,
+        eval_metric_ops=metrics,
+        predictions={'probabilities': predictions},
+        train_op=train_op)
+
+  def test_train(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.model_fn,
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(features, labels,
+                                           model_fn_lib.ModeKeys.TRAIN, {})
+      session.run(variables.global_variables_initializer())
+
+      # loss = feature * c - label
+      total_loss = (1.0 * 10 - 1.0) + (2.0 * 10 - 2.0)
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      # loss' of c is 3.
+      # new value of c = 10 - learning rate * 3 = 7.0.
+      # Adam subtracts another ~1.
+      session.run(estimator_spec.train_op)
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertNear(6.0, session.run(c), 0.000001)
+
+        side_effects = variable_scope.get_variable(
+            'side_effects', dtype=dtypes.float64)
+        self.assertNear(2.0, session.run(side_effects), 0.000001)
+
+
+class ReplicateWithTwoLossesAndOneOptimizer(test_util.TensorFlowTestCase):
+
+  def setUp(self):
+    self._should_skip_optimizer = False
+    self._towers_left_before_skipping_optimizer = -1
+
+  def incorrectly_skip_optimizer_for_tower(self, tower_number):
+    self._should_skip_optimizer = True
+    self._towers_left_before_skipping_optimizer = tower_number
+
+  def should_skip_optimizer(self):
+    if not self._should_skip_optimizer:
+      return False
+    if self._towers_left_before_skipping_optimizer == 0:
+      return True
+    else:
+      self._towers_left_before_skipping_optimizer -= 1
+      return False
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+    d = variable_scope.get_variable(
+        'd',
+        initializer=constant_op.constant(2, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    another_predictions = math_ops.multiply(features, d)
+    another_loss = losses.absolute_difference(
+        labels=labels,
+        predictions=another_predictions,
+        reduction=losses.Reduction.SUM)
+    another_loss = math_ops.reduce_sum(another_loss)
+
+    total_loss = math_ops.add(loss, another_loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    train_ops = []
+
+    optimizer = replicate_model_fn._TowerOptimizer(
+        gradient_descent.GradientDescentOptimizer(1.0),
+        loss_reduction=losses.Reduction.SUM)
+    train_ops.append(optimizer.minimize(loss, var_list=[c]))
+    if not self.should_skip_optimizer():
+      another_optimizer = replicate_model_fn._TowerOptimizer(
+          gradient_descent.GradientDescentOptimizer(1.0),
+          loss_reduction=losses.Reduction.SUM)
+      train_ops.append(another_optimizer.minimize(another_loss, var_list=[d]))
+
+    train_op = control_flow_ops.group(train_ops)
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=total_loss,
+        eval_metric_ops=metrics,
+        predictions={'probabilities': predictions},
+        train_op=train_op)
+
+  def test_train(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with ops_lib.Graph().as_default(), self.test_session() as session:
+      replicated_model_fn = replicate_model_fn._replicate_model_fn(
+          self.model_fn,
+          devices=['/gpu:0', '/gpu:1'])
+      estimator_spec = replicated_model_fn(features, labels,
+                                           model_fn_lib.ModeKeys.TRAIN, {})
+      session.run(variables.global_variables_initializer())
+
+      # For each tower, loss = (feature * c - label) + (feature * d - label).
+      total_loss = (1.0 * 10 - 1.0 + 1.0 * 2.0 - 1.0) + (
+          2.0 * 10 - 2.0 + 2.0 * 2.0 - 2.0)
+      self.assertEqual(total_loss, session.run(estimator_spec.loss))
+
+      session.run(estimator_spec.train_op)
+
+      # loss' of c or loss' of d is 3.
+      # new value of c = 10 - learning rate * 3 = 7.0.
+      # new value of d = 2  - learning rate * 3 = -1.0.
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertNear(7.0, session.run(c), 0.000001)
+        d = variable_scope.get_variable('d', dtype=dtypes.float64)
+        self.assertNear(-1.0, session.run(d), 0.000001)
+
+  def test_different_optimizer_calls_within_towers(self):
+    self.incorrectly_skip_optimizer_for_tower(1)
+
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session(), ops_lib.Graph().as_default():
+      with self.assertRaisesRegexp(
+          ValueError, '.+was.+supposed.+to.+make.+same.+optimizer.+calls.+'):
+        replicated_model_fn = replicate_model_fn._replicate_model_fn(
+            self.model_fn, devices=['/gpu:0', '/gpu:1'])
+        _ = replicated_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN,
+                                {})
+
+
+class FailToWrapOptimizerInTheModelFn(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    predictions = math_ops.multiply(features, c)
+
+    loss = losses.absolute_difference(
+        labels=labels, predictions=predictions, reduction=losses.Reduction.SUM)
+    loss = math_ops.reduce_sum(loss)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+
+    optimizer = gradient_descent.GradientDescentOptimizer(1.0)
+    train_op = optimizer.minimize(loss)
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=loss,
+        eval_metric_ops=metrics,
+        predictions={'probabilities': predictions},
+        train_op=train_op)
+
+  def test_train(self):
+    features = np.array([[1.0], [2.0]])
+    labels = np.array([[1.0], [2.0]])
+
+    with self.test_session():
+      with self.assertRaisesRegexp(ValueError,
+                                   'Please.+wrap.+with.+TowerOptimizer'):
+        replicated_model_fn = replicate_model_fn._replicate_model_fn(
+            self.model_fn, devices=['/gpu:0', '/gpu:1'])
+        _ = replicated_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN,
+                                {})
+
+
+class GetLossTowersTest(test_util.TensorFlowTestCase):
+
+  def create_model_fn_with_loss_reduction(self, loss_reduction):
+
+    def model_fn(mode, features, labels, params):
+      del params
+      c = variable_scope.get_variable(
+          'c',
+          initializer=constant_op.constant(0.25, dtype=dtypes.float64),
+          dtype=dtypes.float64)
+
+      predictions = math_ops.add(np.array([0.1, 0.2, 0.3, features[0]]), c)
+      labels = np.array([0.1, 0.2, 0.3, labels[0]])
+
+      loss = losses.absolute_difference(
+          labels=labels,
+          predictions=predictions,
+          reduction=losses.Reduction.SUM)
+
+      optimizer = replicate_model_fn._TowerOptimizer(
+          gradient_descent.GradientDescentOptimizer(1.0),
+          loss_reduction)
+
+      return model_fn_lib.EstimatorSpec(
+          mode=mode,
+          loss=math_ops.reduce_sum(loss),
+          train_op=optimizer.minimize(loss))
+
+    return model_fn
+
+  def test_gradients_are_computed(self):
+    with self.test_session() as session:
+      tower_specs = replicate_model_fn._get_loss_towers(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.SUM),
+          mode=None,
+          features=[[0.6], [1.6]],
+          labels=[[0.6], [0.6]],
+          params=None,
+          config=None,
+          devices=['/gpu:0', '/gpu:1'],
+          local_ps_devices=['/gpu:0'],
+          name_scope_pattern='test_tower_{}')
+      session.run(variables.global_variables_initializer())
+
+      self.assertEqual(len(tower_specs), 2)
+
+      self.assertEqual('/device:GPU:0', tower_specs[0].loss.device)
+      self.assertEqual('Sum:0', tower_specs[0].loss.name)
+      self.assertEqual(1.0, session.run(tower_specs[0].loss))
+
+      self.assertEqual('/device:GPU:1', tower_specs[1].loss.device)
+      self.assertEqual('test_tower_1/Sum:0', tower_specs[1].loss.name)
+      # The input batch for the second tower had a loss that is 1.0
+      # bigger: 0.6 vs 1.6.
+      self.assertEqual(2.0, session.run(tower_specs[1].loss))
+
+      self.assertEqual(1, len(variables.global_variables()))
+      self.assertEqual(1, len(variables.trainable_variables()))
+
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(0.25, session.run(c))
+
+  def test_gradients_are_computed_with_mean_reduction(self):
+    with self.test_session() as session:
+      tower_specs = replicate_model_fn._get_loss_towers(
+          self.create_model_fn_with_loss_reduction(losses.Reduction.MEAN),
+          mode=model_fn_lib.ModeKeys.EVAL,
+          features=[[0.6], [1.6]],
+          labels=[[0.6], [0.6]],
+          params=None,
+          config=None,
+          devices=['/gpu:0', '/gpu:1'],
+          local_ps_devices=['/gpu:0'],
+          name_scope_pattern='test_tower_{}')
+      session.run(variables.global_variables_initializer())
+
+      self.assertEqual(len(tower_specs), 2)
+
+      self.assertEqual('/device:GPU:0', tower_specs[0].loss.device)
+      self.assertEqual('averaged_loss:0', tower_specs[0].loss.name)
+      self.assertEqual(0.5, session.run(tower_specs[0].loss))
+
+      self.assertEqual('/device:GPU:1', tower_specs[1].loss.device)
+      self.assertEqual('test_tower_1/averaged_loss:0', tower_specs[1].loss.name)
+      # The input batch for the second tower had a loss that is 1.0
+      # bigger: 0.6 vs 1.6.
+      self.assertEqual(1.0, session.run(tower_specs[1].loss))
+
+      self.assertEqual(1, len(variables.global_variables()))
+      self.assertEqual(1, len(variables.trainable_variables()))
+
+      with variable_scope.variable_scope('', reuse=True):
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual(0.25, session.run(c))
+
+  def test_variables_are_round_robined_correctly(self):
+    """Test that creates multiple variables and tests round-robin placement."""
+
+    def model_fn(mode, features, labels, params):
+      del params
+      for variable_name in ['a', 'b', 'c', 'd']:
+        c = variable_scope.get_variable(
+            variable_name,
+            initializer=constant_op.constant(0.25, dtype=dtypes.float64),
+            dtype=dtypes.float64)
+
+      predictions = math_ops.add(np.array([0.1, 0.2, 0.3, features[0]]), c)
+      labels = np.array([0.1, 0.2, 0.3, labels[0]])
+      loss = losses.absolute_difference(
+          labels=labels,
+          predictions=predictions,
+          reduction=losses.Reduction.SUM)
+      return model_fn_lib.EstimatorSpec(
+          mode=mode, loss=math_ops.reduce_sum(loss))
+
+    with self.test_session() as session:
+      tower_specs = replicate_model_fn._get_loss_towers(
+          model_fn,
+          mode=None,
+          features=[[0.6], [1.6], [2.6]],
+          labels=[[0.6], [0.6], [2.6]],
+          params=None,
+          config=None,
+          devices=['/gpu:0', '/gpu:1', '/gpu:3'],
+          local_ps_devices=['/gpu:0', '/gpu:1', '/gpu:3'],
+          name_scope_pattern='test_tower_{}')
+      session.run(variables.global_variables_initializer())
+
+      self.assertEqual(len(tower_specs), 3)
+      self.assertEqual('/device:GPU:0', tower_specs[0].loss.device)
+      self.assertEqual('/device:GPU:1', tower_specs[1].loss.device)
+      self.assertEqual('/device:GPU:3', tower_specs[2].loss.device)
+
+      with variable_scope.variable_scope('', reuse=True):
+        a = variable_scope.get_variable('a', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:0', a.device)
+        b = variable_scope.get_variable('b', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:1', b.device)
+        c = variable_scope.get_variable('c', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:3', c.device)
+        d = variable_scope.get_variable('d', dtype=dtypes.float64)
+        self.assertEqual('/device:GPU:0', d.device)
+
+
+class SplitBatchTest(test_util.TensorFlowTestCase):
+
+  def evaluate_shards(self, first_list, second_list):
+    evaluate_items = lambda x: x.eval()
+    return list(map(evaluate_items, first_list)), list(
+        map(evaluate_items, second_list))
+
+  def assertSparseValuesEqual(self, a, b):
+    self.assertAllEqual(a.indices, b.indices)
+    self.assertAllEqual(a.values, b.values)
+    self.assertAllEqual(a.dense_shape, b.dense_shape)
+
+  def test_simple_half_split(self):
+    with self.test_session():
+      features = [0.0, 1.0, 2.0, 3.0]
+      labels = [10.0, 11.0, 12.0, 13.0]
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 2, device='/gpu:0')
+
+      feature_shards, label_shards = self.evaluate_shards(
+          feature_shards, label_shards)
+
+      self.assertAllEqual([[0.0, 1.0], [2.0, 3.0]], feature_shards)
+      self.assertAllEqual([[10.0, 11.0], [12.0, 13.0]], label_shards)
+
+  def test_to_each_their_own(self):
+    with self.test_session():
+      features = [0.0, 1.0, 2.0, 3.0]
+      labels = [10.0, 11.0, 12.0, 13.0]
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 4, device='/gpu:0')
+
+      feature_shards, label_shards = self.evaluate_shards(
+          feature_shards, label_shards)
+
+      self.assertAllEqual([[0.0], [1.0], [2.0], [3.0]], feature_shards)
+      self.assertAllEqual([[10.0], [11.0], [12.0], [13.0]], label_shards)
+
+  def test_one_batch(self):
+    with self.test_session():
+      features = [0.0, 1.0, 2.0, 3.0]
+      labels = [10.0, 11.0, 12.0, 13.0]
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 1, device='/gpu:0')
+
+      feature_shards, label_shards = self.evaluate_shards(
+          feature_shards, label_shards)
+
+      self.assertAllEqual([[0.0, 1.0, 2.0, 3.0]], feature_shards)
+      self.assertAllEqual([[10.0, 11.0, 12.0, 13.0]], label_shards)
+
+  def test_half_split_in_dictionary(self):
+    with self.test_session():
+      features = {'first': [0.0, 1.0, 2.0, 3.0], 'second': [4.0, 5.0, 6.0, 7.0]}
+      labels = [10.0, 11.0, 12.0, 13.0]
+
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 2, device='/gpu:0')
+
+      self.assertAllEqual([0.0, 1.0], feature_shards[0]['first'].eval())
+      self.assertAllEqual([4.0, 5.0], feature_shards[0]['second'].eval())
+      self.assertAllEqual([2.0, 3.0], feature_shards[1]['first'].eval())
+      self.assertAllEqual([6.0, 7.0], feature_shards[1]['second'].eval())
+      self.assertAllEqual([10.0, 11.0], label_shards[0].eval())
+      self.assertAllEqual([12.0, 13.0], label_shards[1].eval())
+
+  def test_sparse_tensor_can_be_split_unevenly(self):
+    with self.test_session():
+      features = {
+          'x':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [1, 2], [2, 2]],
+                  values=[1.0, 2.0, 3.0],
+                  dense_shape=[3, 4])
+      }
+      labels = np.array([[1.0], [2.0]])
+
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 2, device='/gpu:0')
+
+      self.assertSparseValuesEqual(
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [1, 2]], values=[1., 2.], dense_shape=[2, 4]),
+          feature_shards[0]['x'].eval())
+      self.assertSparseValuesEqual(
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 2]], values=[3.], dense_shape=[1, 4]),
+          feature_shards[1]['x'].eval())
+      self.assertAllEqual([[1.0]], label_shards[0].eval())
+      self.assertAllEqual([[2.0]], label_shards[1].eval())
+
+  def test_sparse_tensor_can_be_split_unevenly_repeated_row(self):
+    with self.test_session():
+      features = {
+          'x':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [1, 0], [1, 1]],
+                  values=[1.0, 2.0, 3.0],
+                  dense_shape=[3, 4])
+      }
+      labels = np.array([[1.0], [2.0]])
+
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 2, device='/gpu:0')
+
+      self.assertSparseValuesEqual(
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [1, 0], [1, 1]],
+              values=[1., 2., 3.],
+              dense_shape=[2, 4]), feature_shards[0]['x'].eval())
+
+      second_batch = feature_shards[1]['x'].eval()
+      self.assertFalse(len(second_batch.indices))
+      self.assertFalse(len(second_batch.values))
+      self.assertAllEqual([1, 4], second_batch.dense_shape)
+      self.assertAllEqual([[1.0]], label_shards[0].eval())
+      self.assertAllEqual([[2.0]], label_shards[1].eval())
+
+  def test_one_batch_in_dictionary(self):
+    with self.test_session() as session:  # pylint: disable=unused-variable
+      features = {'first': [0.0, 1.0, 2.0, 3.0], 'second': [4.0, 5.0, 6.0, 7.0]}
+      labels = [10.0, 11.0, 12.0, 13.0]
+
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 1, device='/gpu:0')
+
+      self.assertAllEqual([0.0, 1.0, 2.0, 3.0],
+                          feature_shards[0]['first'].eval())
+      self.assertAllEqual([4.0, 5.0, 6.0, 7.0],
+                          feature_shards[0]['second'].eval())
+      self.assertAllEqual([10.0, 11.0, 12.0, 13.0], label_shards[0].eval())
+
+  def test_feature_and_label_dictionaries(self):
+    with self.test_session() as session:  # pylint: disable=unused-variable
+      features = {'first': [0.0, 1.0, 2.0, 3.0], 'second': [4.0, 5.0, 6.0, 7.0]}
+      labels = {'first': [10.0, 11.0], 'second': [12.0, 13.0]}
+
+      feature_shards, label_shards = replicate_model_fn._split_batch(
+          features, labels, 2, device='/gpu:0')
+
+      self.assertAllEqual([0.0, 1.0], feature_shards[0]['first'].eval())
+      self.assertAllEqual([4.0, 5.0], feature_shards[0]['second'].eval())
+      self.assertAllEqual([2.0, 3.0], feature_shards[1]['first'].eval())
+      self.assertAllEqual([6.0, 7.0], feature_shards[1]['second'].eval())
+      self.assertAllEqual([10.0], label_shards[0]['first'].eval())
+      self.assertAllEqual([12.0], label_shards[0]['second'].eval())
+      self.assertAllEqual([11], label_shards[1]['first'].eval())
+      self.assertAllEqual([13.0], label_shards[1]['second'].eval())
+
+
+class TrainSpecTest(test_util.TensorFlowTestCase):
+
+  expected_predictions = {}
+
+  def create_estimator_spec(self, loss):
+    return model_fn_lib.EstimatorSpec(
+        mode=model_fn_lib.ModeKeys.TRAIN,
+        loss=loss,
+        train_op=loss,  # Not used; currently required.
+        predictions=self.expected_predictions)
+
+  def create_constant_loss(self, loss_value):
+    return constant_op.constant(loss_value, dtype=dtypes.float64)
+
+  def test_example(self):
+    with self.test_session() as session:
+      tower_losses = list(map(self.create_constant_loss, [2, 4, 6]))
+      tower_specs = list(map(self.create_estimator_spec, tower_losses))
+
+      expected_train_op = tower_losses[1]
+
+      estimator_spec = replicate_model_fn._train_spec(
+          tower_specs, expected_train_op, aggregation_device='/gpu:0')
+
+      self.assertEqual(expected_train_op, estimator_spec.train_op)
+      self.assertEqual(2 + 4 + 6, session.run(estimator_spec.loss))
+      self.assertEqual(self.expected_predictions, estimator_spec.predictions)
+
+
+class EvalSpecTest(test_util.TensorFlowTestCase):
+
+  def create_estimator_spec(self, loss, metrics):
+    return model_fn_lib.EstimatorSpec(
+        mode=model_fn_lib.ModeKeys.EVAL, loss=loss, eval_metric_ops=metrics)
+
+  def create_constant_loss(self, loss_value):
+    return constant_op.constant(loss_value, dtype=dtypes.float64)
+
+  def create_eval_metrics(self, noise):
+    predictions = np.array([0.1, 0.2, 0.3, 0.6 + noise])
+    labels = np.array([0.1, 0.2, 0.3, 0.6])
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions),
+        'auc': metrics_lib.auc(labels, predictions)
+    }
+    return metrics
+
+  def test_example(self):
+    with self.test_session() as session:
+      tower_losses = map(self.create_constant_loss, [2, 4, 6])
+      tower_metrics = map(self.create_eval_metrics, [0, 0.2, 0.3])
+      tower_specs = [
+          self.create_estimator_spec(l, m)
+          for l, m in zip(tower_losses, tower_metrics)
+      ]
+      session.run(variables.local_variables_initializer())
+
+      estimator_spec = replicate_model_fn._eval_spec(
+          tower_specs, aggregation_device='/device:GPU:0')
+
+      accuracy, a = estimator_spec.eval_metric_ops['accuracy']
+      auc, b = estimator_spec.eval_metric_ops['auc']
+
+      self.assertEqual('/device:CPU:0', accuracy.device)
+      self.assertEqual('/device:CPU:0', auc.device)
+
+      session.run([a, b])
+      accuracy, auc = session.run([accuracy, auc])
+
+      self.assertNear((12 - 2) / 12, accuracy, 0.01)
+      self.assertEqual(0, auc)
+      self.assertEqual(2 + 4 + 6, session.run(estimator_spec.loss))
+
+  def test_handles_single_tower(self):
+    with self.test_session() as session:
+      tower_losses = map(self.create_constant_loss, [5])
+      tower_metrics = map(self.create_eval_metrics, [0.2])
+      tower_specs = [
+          self.create_estimator_spec(l, m)
+          for l, m in zip(tower_losses, tower_metrics)
+      ]
+      session.run(variables.local_variables_initializer())
+
+      estimator_spec = replicate_model_fn._eval_spec(
+          tower_specs, aggregation_device='/device:GPU:0')
+
+      accuracy, a = estimator_spec.eval_metric_ops['accuracy']
+      auc, b = estimator_spec.eval_metric_ops['auc']
+
+      self.assertEqual('/device:CPU:0', accuracy.device)
+      self.assertEqual('/device:CPU:0', auc.device)
+
+      session.run([a, b])
+      accuracy = session.run(accuracy)
+      auc = session.run(auc)
+
+      self.assertNear((4 - 1) / 4, accuracy, 0.01)
+      self.assertEqual(0, auc)
+      self.assertEqual(5, session.run(estimator_spec.loss))
+
+
+class PredictSpecTest(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(0.25, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    predictions = math_ops.add(np.array([features[0], features[0]]), c)
+
+    return model_fn_lib.EstimatorSpec(
+        mode=model_fn_lib.ModeKeys.PREDICT,
+        predictions={
+            'probabilities': predictions
+        })
+
+  def test_example(self):
+    with self.test_session() as session:
+      tower_specs = replicate_model_fn._get_loss_towers(
+          self.model_fn,
+          mode=None,
+          features=[[0.1], [0.2]],
+          labels=[[], []],
+          params=None,
+          config=None,
+          devices=['/gpu:0', '/gpu:1'],
+          local_ps_devices=['/gpu:0'],
+      )
+      session.run(variables.global_variables_initializer())
+
+      estimator_spec = replicate_model_fn._predict_spec(
+          tower_specs, aggregation_device='/gpu:0')
+
+      self.assertEqual('/device:GPU:0',
+                       estimator_spec.predictions['probabilities'].device)
+      self.assertAllClose({
+          'probabilities': np.array([0.35, 0.35, 0.45, 0.45])
+      }, session.run(estimator_spec.predictions))
+
+
+class ReduceMetricVariablesTest(test_util.TensorFlowTestCase):
+
+  def create_metric_variable(self, initial_value, name):
+    return variable_scope.variable(
+        initial_value,
+        trainable=False,
+        collections=[ops_lib.GraphKeys.METRIC_VARIABLES],
+        validate_shape=True,
+        name=name)
+
+  def create_tower_metrics(self, tower_id):
+    with variable_scope.variable_scope('', reuse=(tower_id != 0)):
+      self.create_metric_variable(1.3 * (tower_id + 1), 'total')
+      self.create_metric_variable(2.3 * (tower_id + 1), 'count')
+      self.create_metric_variable(
+          np.array([3.3, 3.5, 3.7]) * (tower_id + 1), 'total')
+
+  def test_example(self):
+    with self.test_session() as session:
+      for tower_id in range(3):
+        self.create_tower_metrics(tower_id)
+
+      session.run(
+          variables.variables_initializer(
+              ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES)))
+
+      session.run(
+          replicate_model_fn._reduce_metric_variables(number_of_towers=3))
+
+      # 1st tower = 1.3, 2.3,  [3.3, 3.5, 3.7]
+      # 2nd tower = 2.6, 4.6,  [6.6, 7.0, 7.4]
+      # 3rd tower = 3.9, 6.9,  [9.9, 10.5, 11.1]
+      # Reduced =   7.8, 13.8, [19.8, 21.0, 22.2]
+      # Towers are accumulated in the first tower.
+      local_metrics = session.run(
+          ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES))
+
+      self.assertNear(7.8, local_metrics[0], 0.01)
+      self.assertNear(13.8, local_metrics[1], 0.01)
+      self.assertAllClose([19.8, 21., 22.1], local_metrics[2], 0.01)
+      self.assertNear(0.0, local_metrics[3], 0.01)
+      self.assertNear(0.0, local_metrics[4], 0.01)
+      self.assertAllClose([0.0, 0.0, 0.0], local_metrics[5], 0.01)
+      self.assertNear(0.0, local_metrics[6], 0.01)
+      self.assertNear(0.0, local_metrics[7], 0.01)
+      self.assertAllClose([0.0, 0.0, 0.0], local_metrics[8], 0.01)
+
+  def test_reduce_is_idempotent(self):
+    with self.test_session() as session:
+      for tower_id in range(3):
+        self.create_tower_metrics(tower_id)
+
+      session.run(
+          variables.variables_initializer(
+              ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES)))
+
+      for _ in range(20):
+        session.run(
+            replicate_model_fn._reduce_metric_variables(number_of_towers=3))
+
+      local_metrics = session.run(
+          ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES))
+
+      self.assertNear(7.8, local_metrics[0], 0.01)
+      self.assertNear(13.8, local_metrics[1], 0.01)
+      self.assertAllClose([19.8, 21., 22.1], local_metrics[2], 0.01)
+      self.assertNear(0.0, local_metrics[3], 0.01)
+      self.assertNear(0.0, local_metrics[4], 0.01)
+      self.assertAllClose([0.0, 0.0, 0.0], local_metrics[5], 0.01)
+      self.assertNear(0.0, local_metrics[6], 0.01)
+      self.assertNear(0.0, local_metrics[7], 0.01)
+      self.assertAllClose([0.0, 0.0, 0.0], local_metrics[8], 0.01)
+
+  def test_handles_single_tower(self):
+    with self.test_session() as session:
+      self.create_tower_metrics(0)
+      session.run(
+          variables.variables_initializer(
+              ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES)))
+
+      session.run(
+          replicate_model_fn._reduce_metric_variables(number_of_towers=1))
+
+      local_metrics = session.run(
+          ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES))
+
+      self.assertNear(1.3, local_metrics[0], 0.01)
+      self.assertNear(2.3, local_metrics[1], 0.01)
+      self.assertAllClose([3.3, 3.5, 3.7], local_metrics[2], 0.01)
+
+  def test_doesnt_accept_uneven_number_of_variables(self):
+    with self.test_session() as session:
+      for tower_id in range(3):
+        self.create_tower_metrics(tower_id)
+      self.create_metric_variable(-1.0, 'oddball')
+
+      session.run(
+          variables.variables_initializer(
+              ops_lib.get_collection(ops_lib.GraphKeys.METRIC_VARIABLES)))
+
+      with self.assertRaisesRegexp(
+          ValueError, '.+Expected.+local.+variables.+but.+got.+instead.+'):
+        session.run(
+            replicate_model_fn._reduce_metric_variables(number_of_towers=3))
+
+
+class MergeExportOutputsTest(test_util.TensorFlowTestCase):
+
+  def model_fn(self, mode, features, labels, params):
+    c = variable_scope.get_variable(
+        'c',
+        initializer=constant_op.constant(10, dtype=dtypes.float64),
+        dtype=dtypes.float64)
+
+    predictions = {'probabilities': math_ops.multiply(features, c)}
+    loss = losses.absolute_difference(
+        labels=labels,
+        predictions=predictions['probabilities'],
+        reduction=losses.Reduction.SUM)
+
+    metrics = {
+        'accuracy': metrics_lib.accuracy(labels, predictions['probabilities']),
+        'auc': metrics_lib.auc(labels, predictions['probabilities'])
+    }
+    tensor_string_repr = str(features)
+    classes = constant_op.constant(
+        re.search('(split_inputs/split:[0-9])', tensor_string_repr).group(1),
+        dtype=dtypes.string)
+
+    export_outputs = {
+        signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
+            export_output.PredictOutput(predictions),
+        'classification_output':
+            export_output.ClassificationOutput(predictions['probabilities'],
+                                               classes),
+        'classification_scores':
+            export_output.ClassificationOutput(
+                scores=predictions['probabilities']),
+        'classification_classes':
+            export_output.ClassificationOutput(classes=classes),
+        'regression_output':
+            export_output.RegressionOutput(predictions['probabilities']),
+    }
+
+    return model_fn_lib.EstimatorSpec(
+        mode=mode,
+        loss=math_ops.reduce_sum(loss),
+        eval_metric_ops=metrics,
+        predictions=predictions,
+        export_outputs=export_outputs)
+
+  def replicate_estimator_spec(self, session):
+    features = np.array([0.01, 0.002])
+    labels = np.array([0.01, 0.02])
+
+    replicated_model_fn = replicate_model_fn._replicate_model_fn(
+        self.model_fn, devices=['/gpu:0', '/gpu:1'])
+    estimator_spec = replicated_model_fn(features, labels,
+                                         model_fn_lib.ModeKeys.PREDICT, {})
+    session.run(variables.global_variables_initializer())
+    return estimator_spec
+
+  def test_merge_predict_output(self):
+    with self.test_session() as session:
+      estimator_spec = self.replicate_estimator_spec(session)
+      self.assertAllClose(
+          {
+              'probabilities': np.array([0.1, 0.02])
+          },
+          session.run(estimator_spec.export_outputs[
+              signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY].outputs))
+
+  def test_merge_classification_output_scores_classes(self):
+    with self.test_session() as session:
+      estimator_spec = self.replicate_estimator_spec(session)
+      self.assertAllClose(
+          [0.1, 0.02],
+          session.run(
+              estimator_spec.export_outputs['classification_output'].scores))
+      self.assertAllEqual(
+          [b'split_inputs/split:0', b'split_inputs/split:1'],
+          session.run(
+              estimator_spec.export_outputs['classification_output'].classes))
+
+  def test_merge_classification_output_scores(self):
+    with self.test_session() as session:
+      estimator_spec = self.replicate_estimator_spec(session)
+      self.assertAllClose(
+          [0.1, 0.02],
+          session.run(
+              estimator_spec.export_outputs['classification_scores'].scores))
+      self.assertEqual(
+          None, estimator_spec.export_outputs['classification_scores'].classes)
+
+  def test_merge_classification_output_classes(self):
+    with self.test_session() as session:
+      estimator_spec = self.replicate_estimator_spec(session)
+      self.assertAllEqual(
+          [b'split_inputs/split:0', b'split_inputs/split:1'],
+          session.run(
+              estimator_spec.export_outputs['classification_classes'].classes))
+      self.assertEqual(
+          None, estimator_spec.export_outputs['classification_classes'].scores)
+
+  def test_merge_regression_output(self):
+    with self.test_session() as session:
+      estimator_spec = self.replicate_estimator_spec(session)
+      self.assertAllClose(
+          [0.1, 0.02],
+          session.run(estimator_spec.export_outputs['regression_output'].value))
+
+
+class GetLocalDevicesTest(test_util.TensorFlowTestCase):
+
+  def test_there_is_at_least_a_cpu(self):
+    self.assertTrue(replicate_model_fn._get_local_devices('CPU'))
+
+  def test_there_is_no_xpu(self):
+    self.assertFalse(
+        replicate_model_fn._get_local_devices('XPU'))  # XPU doesn't exist.
+
+  def test_whether_there_is_a_gpu(self):
+    if test.is_gpu_available():
+      self.assertTrue(len(replicate_model_fn._get_local_devices('GPU')))
+
+
+class LocalDeviceSetterTest(test_util.TensorFlowTestCase):
+
+  def test_vars_are_on_ps_but_ops_are_on_workers(self):
+    ps_devices = ['/device:GPU:3']
+    round_robin = device_setter._RoundRobinStrategy(num_tasks=len(ps_devices))
+
+    local_device_setter = replicate_model_fn._local_device_setter(
+        ps_devices=ps_devices,
+        ps_strategy=round_robin,
+        worker_device='/device:GPU:2')
+
+    with ops_lib.device(local_device_setter):
+      a = variables.Variable(0.01)
+      self.assertEqual('/device:GPU:3', a.device)
+
+      b = variables.Variable(0.02)
+      self.assertEqual('/device:GPU:3', b.device)
+
+      c = variables.Variable(0.03)
+      self.assertEqual('/device:GPU:3', c.device)
+
+      a_op = array_ops.concat(a, axis=0)
+      self.assertEqual('/device:GPU:2', a_op.device)
+
+      b_op = array_ops.concat(b, axis=0)
+      self.assertEqual('/device:GPU:2', b_op.device)
+
+  def test_round_robin_placement(self):
+    ps_devices = [
+        '/device:GPU:0', '/device:GPU:1', '/device:GPU:3', '/device:GPU:4'
+    ]
+    round_robin = device_setter._RoundRobinStrategy(num_tasks=len(ps_devices))
+
+    local_device_setter = replicate_model_fn._local_device_setter(
+        ps_devices=ps_devices,
+        ps_strategy=round_robin,
+        worker_device='/device:GPU:2')
+
+    with ops_lib.device(local_device_setter):
+      a = variables.Variable(0.01)
+      self.assertEqual('/device:GPU:0', a.device)
+
+      b = variables.Variable(0.02)
+      self.assertEqual('/device:GPU:1', b.device)
+
+      c = variables.Variable(0.03)
+      self.assertEqual('/device:GPU:3', c.device)
+
+      a_op = array_ops.concat(a, axis=0)
+      self.assertEqual('/device:GPU:2', a_op.device)
+
+      b_op = array_ops.concat(b, axis=0)
+      self.assertEqual('/device:GPU:2', b_op.device)
+
+      c = variables.Variable(0.03)
+      self.assertEqual('/device:GPU:4', c.device)
+
+      d = variables.Variable(0.03)
+      self.assertEqual('/device:GPU:0', d.device)
+
+      c_op = array_ops.concat(c, axis=0)
+      self.assertEqual('/device:GPU:2', c_op.device)
+
+
+class ComputeSumWithDevicePlacementTest(test_util.TensorFlowTestCase):
+
+  def test_vectors(self):
+    with self.test_session() as session:
+      total = replicate_model_fn._compute_sum_on_device(
+          [1.0, 2.0, 3.0, 4.0], device='/device:GPU:0', name='test_sum')
+
+      self.assertEqual('/device:GPU:0', total.device)
+      self.assertEqual('test_sum', total.op.name)
+      self.assertEqual(10.0, session.run(total))
+
+  def test_tensors(self):
+    with self.test_session() as session:
+      total = replicate_model_fn._compute_sum_on_device(
+          [[1.0, 2.0], [3.0, 4.0]], device='/device:GPU:0', name='test_sum')
+
+      self.assertEqual('/device:GPU:0', total.device)
+      self.assertEqual('test_sum', total.op.name)
+      self.assertAllEqual([4.0, 6.0], session.run(total))
+
+  def test_indexedslices(self):
+    with self.test_session() as session:
+      a = ops_lib.IndexedSlices(
+          constant_op.constant([1.0, 2.0]), [0, 1],
+          dense_shape=constant_op.constant([2]))
+      b = ops_lib.IndexedSlices(constant_op.constant([3.0, 4.0]), [0, 1])
+
+      total = replicate_model_fn._compute_sum_on_device(
+          [a, b], device='/device:GPU:0')
+
+      self.assertEqual('/device:GPU:0', total.device)
+      self.assertAllEqual([4.0, 6.0],
+                          session.run(ops_lib.convert_to_tensor(total)))
+
+  def test_indexedslices_higher_dimensions(self):
+    with self.test_session() as session:
+      a = ops_lib.IndexedSlices(
+          constant_op.constant([[1.0, 5.0], [2.0, 6.0]]), [0, 1],
+          dense_shape=constant_op.constant([2, 4]))
+      b = ops_lib.IndexedSlices(
+          constant_op.constant([[3.0, 7.0], [4.0, 8.0]]), [0, 1])
+
+      total = replicate_model_fn._compute_sum_on_device(
+          [a, b], device='/device:GPU:0')
+
+      self.assertEqual('/device:GPU:0', total.device)
+      self.assertAllEqual([[4.0, 12.0], [6.0, 14.0]],
+                          session.run(ops_lib.convert_to_tensor(total)))
+
+  def test_indexedslices_some_dont_overlap(self):
+    with self.test_session() as session:
+      a = ops_lib.IndexedSlices(
+          constant_op.constant([1.0, 2.0]), [0, 3],
+          dense_shape=constant_op.constant([4]))
+      b = ops_lib.IndexedSlices(constant_op.constant([3.0, 4.0]), [0, 1])
+
+      total = replicate_model_fn._compute_sum_on_device(
+          [a, b], device='/device:GPU:0')
+
+      self.assertEqual('/device:GPU:0', total.device)
+      self.assertAllEqual([4.0, 4.0, 0.0, 2.0],
+                          session.run(ops_lib.convert_to_tensor(total)))
+
+  def test_no_name_for_indexslices(self):
+    a = ops_lib.IndexedSlices(
+        constant_op.constant([1.0, 2.0]), [0, 1],
+        dense_shape=constant_op.constant([2]))
+    b = ops_lib.IndexedSlices(constant_op.constant([3.0, 4.0]), [0, 1])
+
+    with self.assertRaisesRegexp(ValueError, '.+name.+not.+expected.+'):
+      _ = replicate_model_fn._compute_sum_on_device(
+          [a, b], device='/device:GPU:0', name='cant_name_indexslices')
+
+
+class ConcatTensorDictsTest(test_util.TensorFlowTestCase):
+
+  def test_example(self):
+    tensor_dicts = [
+        {
+            'a': np.array([1.0, 2.0]),
+            'b': np.array([11.0]),
+            'c': np.array([21.0]),
+        },
+        {
+            'a': np.array([3.0]),
+            'b': np.array([12.0, 13.0]),
+        },
+        {
+            'b': np.array([14.0]),
+        },
+    ]
+
+    with self.test_session() as session:
+      self.assertAllClose({
+          'a': np.array([1.0, 2.0, 3.0]),
+          'b': np.array([11.0, 12.0, 13.0, 14.0]),
+          'c': np.array([21.0]),
+      }, session.run(replicate_model_fn._concat_tensor_dicts(*tensor_dicts)))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/estimator/run_config.py b/tensorflow/python/estimator/run_config.py
index 3e021242c4cc914990c6b38736b8f725213b5b7e..820fda7765cafb49e3866f2b6700d42ffbf5dea0 100644
--- a/tensorflow/python/estimator/run_config.py
+++ b/tensorflow/python/estimator/run_config.py
@@ -345,7 +345,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'worker', 'index': 1}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == 'host4:2222'
       assert config.task_id == 1
       assert config.num_ps_replicas == 2
@@ -363,7 +363,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'chief', 'index': 0}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == 'host0:2222'
       assert config.task_id == 0
       assert config.num_ps_replicas == 2
@@ -381,7 +381,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'evaluator', 'index': 0}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == ''
       assert config.evaluator_master == ''
       assert config.task_id == 0
@@ -423,7 +423,7 @@ class RunConfig(object):
         to be saved. The default value of 10,000 hours effectively disables
         the feature.
       log_step_count_steps: The frequency, in number of global steps, that the
-        global step/sec will be logged during training.
+        global step/sec and the loss will be logged during training.
 
 
     Raises:
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index 2cc3331a15867e9a984847391857bf84baee7424..e38b765da52a7b6957a4fb8a02087c5d1fd5a781 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -128,9 +128,16 @@ class TrainSpec(
     """Creates a validated `TrainSpec` instance.
 
     Args:
-      input_fn: Training input function returning a tuple of:
-          features - `Tensor` or dictionary of string feature name to `Tensor`.
-          labels - `Tensor` or dictionary of `Tensor` with labels.
+      input_fn: A function that provides input data for training as minibatches.
+        See @{$get_started/premade_estimators#create_input_functions} for more
+        information. The function should construct and return one of
+        the following:
+          * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+            tuple (features, labels) with same constraints as below.
+          * A tuple (features, labels): Where features is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and labels is a
+            `Tensor` or a dictionary of string label name to `Tensor`.
+            
       max_steps: Int. Positive number of total steps for which to train model.
         If `None`, train forever. The training `input_fn` is not expected to
         generate `OutOfRangeError` or `StopIteration` exceptions. See the
@@ -185,9 +192,16 @@ class EvalSpec(
     """Creates a validated `EvalSpec` instance.
 
     Args:
-      input_fn: Evaluation input function returning a tuple of:
-          features - `Tensor` or dictionary of string feature name to `Tensor`.
-          labels - `Tensor` or dictionary of `Tensor` with labels.
+      input_fn: A function that constructs the input data for evaluation.
+        See @{$get_started/premade_estimators#create_input_functions} for more
+        information. The function should construct and return one of
+        the following:
+          * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+            tuple (features, labels) with same constraints as below.
+          * A tuple (features, labels): Where features is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and labels is a
+            `Tensor` or a dictionary of string label name to `Tensor`.
+            
       steps: Int. Positive number of steps for which to evaluate model. If
         `None`, evaluates until `input_fn` raises an end-of-input exception.
         See `Estimator.evaluate` for details.
diff --git a/tensorflow/python/estimator/util.py b/tensorflow/python/estimator/util.py
index 3ce8eea84b6bf601ce89dfaa7d8e3a5d193468b3..bb4bdd3fdfb2e19dbc1c581d7771f2e1ac4442ba 100644
--- a/tensorflow/python/estimator/util.py
+++ b/tensorflow/python/estimator/util.py
@@ -20,7 +20,12 @@ from __future__ import division
 from __future__ import print_function
 
 import functools
+import os
+import time
 
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util import compat
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
 
@@ -56,3 +61,48 @@ def fn_args(fn):
     if _is_bounded_method(fn):
       args.remove('self')
   return tuple(args)
+
+
+# When we create a timestamped directory, there is a small chance that the
+# directory already exists because another process is also creating these
+# directories. In this case we just wait one second to get a new timestamp and
+# try again. If this fails several times in a row, then something is seriously
+# wrong.
+MAX_DIRECTORY_CREATION_ATTEMPTS = 10
+
+
+def get_timestamped_dir(dir_base):
+  """Builds a path to a new subdirectory within the base directory.
+
+  The subdirectory will be named using the current time.
+  This guarantees monotonically increasing directory numbers even across
+  multiple runs of the pipeline.
+  The timestamp used is the number of seconds since epoch UTC.
+
+  Args:
+    dir_base: A string containing a directory to create the subdirectory under.
+
+  Returns:
+    The full path of the new subdirectory (which is not actually created yet).
+
+  Raises:
+    RuntimeError: if repeated attempts fail to obtain a unique timestamped
+      directory name.
+  """
+  attempts = 0
+  while attempts < MAX_DIRECTORY_CREATION_ATTEMPTS:
+    timestamp = int(time.time())
+
+    result_dir = os.path.join(
+        compat.as_bytes(dir_base), compat.as_bytes(str(timestamp)))
+    if not gfile.Exists(result_dir):
+      # Collisions are still possible (though extremely unlikely): this
+      # directory is not actually created yet, but it will be almost
+      # instantly on return from this function.
+      return result_dir
+    time.sleep(1)
+    attempts += 1
+    logging.warn('Directory {} already exists; retrying (attempt {}/{})'.format(
+        result_dir, attempts, MAX_DIRECTORY_CREATION_ATTEMPTS))
+  raise RuntimeError('Failed to obtain a unique export directory name after '
+                     '{} attempts.'.format(MAX_DIRECTORY_CREATION_ATTEMPTS))
diff --git a/tensorflow/python/feature_column/BUILD b/tensorflow/python/feature_column/BUILD
index a758f8a4fc4898713772c4e919acda48b0f6ad0b..238a90b67d9d0039c25a6f3800aad25a2db9e36f 100644
--- a/tensorflow/python/feature_column/BUILD
+++ b/tensorflow/python/feature_column/BUILD
@@ -74,7 +74,10 @@ py_test(
     srcs = ["feature_column_test.py"],
     data = [":vocabulary_testdata"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_pip",
+    ],
     deps = [
         ":feature_column",
         ":feature_column_py",
diff --git a/tensorflow/python/feature_column/feature_column.py b/tensorflow/python/feature_column/feature_column.py
index c416881c3119c160d28f4b8e37cd2aeb22f239a6..7d99fcb3e79318c2fecabaa9bdd0347aa67cf309 100644
--- a/tensorflow/python/feature_column/feature_column.py
+++ b/tensorflow/python/feature_column/feature_column.py
@@ -16,7 +16,7 @@
 
 FeatureColumns provide a high level abstraction for ingesting and representing
 features. FeatureColumns are also the primary way of encoding features for
-canned ${tf.estimator.Estimator}s.
+canned @{tf.estimator.Estimator}s.
 
 When using FeatureColumns with `Estimators`, the type of feature column you
 should choose depends on (1) the feature type and (2) the model type.
@@ -1626,7 +1626,7 @@ class _FeatureColumn(object):
 
     It is used for get_parsing_spec for `tf.parse_example`. Returned spec is a
     dict from keys ('string') to `VarLenFeature`, `FixedLenFeature`, and other
-    supported objects. Please check documentation of ${tf.parse_example} for all
+    supported objects. Please check documentation of @{tf.parse_example} for all
     supported spec objects.
 
     Let's say a Feature column depends on raw feature ('raw') and another
@@ -1677,7 +1677,7 @@ class _DenseColumn(_FeatureColumn):
       weight_collections: List of graph collections to which Variables (if any
         will be created) are added.
       trainable: If `True` also add variables to the graph collection
-        `GraphKeys.TRAINABLE_VARIABLES` (see ${tf.Variable}).
+        `GraphKeys.TRAINABLE_VARIABLES` (see @{tf.Variable}).
 
     Returns:
       `Tensor` of shape [batch_size] + `_variable_shape`.
@@ -1735,7 +1735,7 @@ class _CategoricalColumn(_FeatureColumn):
   WARNING: Do not subclass this layer unless you know what you are doing:
   the API is subject to future changes.
 
-  A categorical feature typically handled with a ${tf.SparseTensor} of IDs.
+  A categorical feature typically handled with a @{tf.SparseTensor} of IDs.
   """
   __metaclass__ = abc.ABCMeta
 
@@ -1770,7 +1770,7 @@ class _CategoricalColumn(_FeatureColumn):
       weight_collections: List of graph collections to which variables (if any
         will be created) are added.
       trainable: If `True` also add variables to the graph collection
-        `GraphKeys.TRAINABLE_VARIABLES` (see ${tf.get_variable}).
+        `GraphKeys.TRAINABLE_VARIABLES` (see @{tf.get_variable}).
     """
     pass
 
@@ -1804,6 +1804,21 @@ def _create_categorical_column_weighted_sum(
       name='weighted_sum')
 
 
+class _SequenceDenseColumn(_FeatureColumn):
+  """Represents dense sequence data."""
+
+  __metaclass__ = abc.ABCMeta
+
+  TensorSequenceLengthPair = collections.namedtuple(  # pylint: disable=invalid-name
+      'TensorSequenceLengthPair', ['dense_tensor', 'sequence_length'])
+
+  @abc.abstractmethod
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    """Returns a `TensorSequenceLengthPair`."""
+    pass
+
+
 class _LazyBuilder(object):
   """Handles caching of transformations while building the model.
 
@@ -1874,12 +1889,12 @@ class _LazyBuilder(object):
       self._feature_tensors[key] = feature_tensor
       return feature_tensor
 
-    if not isinstance(key, (str, _FeatureColumn)):
-      raise TypeError('"key" must be either a "str" or "_FeatureColumn". '
-                      'Provided: {}'.format(key))
+    if isinstance(key, str):
+      raise ValueError('Feature {} is not in features dictionary.'.format(key))
 
     if not isinstance(key, _FeatureColumn):
-      raise ValueError('Feature {} is not in features dictionary.'.format(key))
+      raise TypeError('"key" must be either a "str" or "_FeatureColumn". '
+                      'Provided: {}'.format(key))
 
     column = key
     logging.debug('Transforming feature_column %s.', column)
@@ -2152,7 +2167,7 @@ class _BucketizedColumn(_DenseColumn, _CategoricalColumn,
 
 
 class _EmbeddingColumn(
-    _DenseColumn,
+    _DenseColumn, _SequenceDenseColumn,
     collections.namedtuple('_EmbeddingColumn', (
         'categorical_column', 'dimension', 'combiner', 'initializer',
         'ckpt_to_load_from', 'tensor_name_in_ckpt', 'max_norm', 'trainable'
@@ -2178,7 +2193,9 @@ class _EmbeddingColumn(
       self._shape = tensor_shape.vector(self.dimension)
     return self._shape
 
-  def _get_dense_tensor(self, inputs, weight_collections=None, trainable=None):
+  def _get_dense_tensor_internal(
+      self, inputs, weight_collections=None, trainable=None):
+    """Private method that follows the signature of _get_dense_tensor."""
     # Get sparse IDs and weights.
     sparse_tensors = self.categorical_column._get_sparse_tensors(  # pylint: disable=protected-access
         inputs, weight_collections=weight_collections, trainable=trainable)
@@ -2210,6 +2227,43 @@ class _EmbeddingColumn(
         name='%s_weights' % self.name,
         max_norm=self.max_norm)
 
+  def _get_dense_tensor(self, inputs, weight_collections=None, trainable=None):
+    if isinstance(self.categorical_column, _SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must not be of type _SequenceCategoricalColumn. '
+          'Suggested fix A: If you wish to use input_layer, use a '
+          'non-sequence categorical_column_with_*. '
+          'Suggested fix B: If you wish to create sequence input, use '
+          'sequence_input_layer instead of input_layer. '
+          'Given (type {}): {}'.format(
+              self.name, type(self.categorical_column),
+              self.categorical_column))
+    return self._get_dense_tensor_internal(
+        inputs=inputs, weight_collections=weight_collections,
+        trainable=trainable)
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    if not isinstance(self.categorical_column, _SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must be of type _SequenceCategoricalColumn '
+          'to use sequence_input_layer. '
+          'Suggested fix: Use one of sequence_categorical_column_with_*. '
+          'Given (type {}): {}'.format(
+              self.name, type(self.categorical_column),
+              self.categorical_column))
+    dense_tensor = self._get_dense_tensor_internal(  # pylint: disable=protected-access
+        inputs=inputs,
+        weight_collections=weight_collections,
+        trainable=trainable)
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)  # pylint: disable=protected-access
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sparse_tensors.id_tensor)
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
 
 class _SharedEmbeddingColumn(
     _DenseColumn,
@@ -2890,7 +2944,7 @@ def _prune_invalid_ids(sparse_ids, sparse_weights):
   return sparse_ids, sparse_weights
 
 
-class _IndicatorColumn(_DenseColumn,
+class _IndicatorColumn(_DenseColumn, _SequenceDenseColumn,
                        collections.namedtuple('_IndicatorColumn',
                                               ['categorical_column'])):
   """Represents a one-hot column for use in deep networks.
@@ -2966,15 +3020,53 @@ class _IndicatorColumn(_DenseColumn,
 
     Returns:
       Dense `Tensor` created within `_transform_feature`.
+
+    Raises:
+      ValueError: If `categorical_column` is a `_SequenceCategoricalColumn`.
     """
     # Do nothing with weight_collections and trainable since no variables are
     # created in this function.
     del weight_collections
     del trainable
+    if isinstance(self.categorical_column, _SequenceCategoricalColumn):
+      raise ValueError(
+          'In indicator_column: {}. '
+          'categorical_column must not be of type _SequenceCategoricalColumn. '
+          'Suggested fix A: If you wish to use input_layer, use a '
+          'non-sequence categorical_column_with_*. '
+          'Suggested fix B: If you wish to create sequence input, use '
+          'sequence_input_layer instead of input_layer. '
+          'Given (type {}): {}'.format(
+              self.name, type(self.categorical_column),
+              self.categorical_column))
     # Feature has been already transformed. Return the intermediate
     # representation created by _transform_feature.
     return inputs.get(self)
 
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    # Do nothing with weight_collections and trainable since no variables are
+    # created in this function.
+    del weight_collections
+    del trainable
+    if not isinstance(self.categorical_column, _SequenceCategoricalColumn):
+      raise ValueError(
+          'In indicator_column: {}. '
+          'categorical_column must be of type _SequenceCategoricalColumn '
+          'to use sequence_input_layer. '
+          'Suggested fix: Use one of sequence_categorical_column_with_*. '
+          'Given (type {}): {}'.format(
+              self.name, type(self.categorical_column),
+              self.categorical_column))
+    # Feature has been already transformed. Return the intermediate
+    # representation created by _transform_feature.
+    dense_tensor = inputs.get(self)
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)  # pylint: disable=protected-access
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sparse_tensors.id_tensor)
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
 
 def _verify_static_batch_size_equality(tensors, columns):
   # bath_size is a tf.Dimension object.
@@ -2990,3 +3082,68 @@ def _verify_static_batch_size_equality(tensors, columns):
             'Batch size of columns ({}, {}): ({}, {})'.format(
                 columns[bath_size_column_index].name, columns[i].name,
                 expected_batch_size, tensors[i].shape[0]))
+
+
+def _sequence_length_from_sparse_tensor(sp_tensor, num_elements=1):
+  """Returns a [batch_size] Tensor with per-example sequence length."""
+  with ops.name_scope(None, 'sequence_length') as name_scope:
+    row_ids = sp_tensor.indices[:, 0]
+    column_ids = sp_tensor.indices[:, 1]
+    column_ids += array_ops.ones_like(column_ids)
+    seq_length = math_ops.to_int64(
+        math_ops.segment_max(column_ids, segment_ids=row_ids) / num_elements)
+    # If the last n rows do not have ids, seq_length will have shape
+    # [batch_size - n]. Pad the remaining values with zeros.
+    n_pad = array_ops.shape(sp_tensor)[:1] - array_ops.shape(seq_length)[:1]
+    padding = array_ops.zeros(n_pad, dtype=seq_length.dtype)
+    return array_ops.concat([seq_length, padding], axis=0, name=name_scope)
+
+
+class _SequenceCategoricalColumn(
+    _CategoricalColumn,
+    collections.namedtuple(
+        '_SequenceCategoricalColumn', ['categorical_column'])):
+  """Represents sequences of categorical data."""
+
+  @property
+  def name(self):
+    return self.categorical_column.name
+
+  @property
+  def _parse_example_spec(self):
+    return self.categorical_column._parse_example_spec  # pylint: disable=protected-access
+
+  def _transform_feature(self, inputs):
+    return self.categorical_column._transform_feature(inputs)  # pylint: disable=protected-access
+
+  @property
+  def _num_buckets(self):
+    return self.categorical_column._num_buckets  # pylint: disable=protected-access
+
+  def _get_sparse_tensors(self, inputs, weight_collections=None,
+                          trainable=None):
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)  # pylint: disable=protected-access
+    id_tensor = sparse_tensors.id_tensor
+    weight_tensor = sparse_tensors.weight_tensor
+    # Expands final dimension, so that embeddings are not combined during
+    # embedding lookup.
+    check_id_rank = check_ops.assert_equal(
+        array_ops.rank(id_tensor), 2,
+        data=[
+            'Column {} expected ID tensor of rank 2. '.format(self.name),
+            'id_tensor shape: ', array_ops.shape(id_tensor)])
+    with ops.control_dependencies([check_id_rank]):
+      id_tensor = sparse_ops.sparse_reshape(
+          id_tensor,
+          shape=array_ops.concat([id_tensor.dense_shape, [1]], axis=0))
+    if weight_tensor is not None:
+      check_weight_rank = check_ops.assert_equal(
+          array_ops.rank(weight_tensor), 2,
+          data=[
+              'Column {} expected weight tensor of rank 2.'.format(self.name),
+              'weight_tensor shape:', array_ops.shape(weight_tensor)])
+      with ops.control_dependencies([check_weight_rank]):
+        weight_tensor = sparse_ops.sparse_reshape(
+            weight_tensor,
+            shape=array_ops.concat([weight_tensor.dense_shape, [1]], axis=0))
+    return _CategoricalColumn.IdWeightPair(id_tensor, weight_tensor)
diff --git a/tensorflow/python/framework/constant_op.py b/tensorflow/python/framework/constant_op.py
index d3d8c9c154fbfcc9613acce4e1bdab7df2e7d56d..782b505d6c1d0b576b7734f088c4d2c9625f4be2 100644
--- a/tensorflow/python/framework/constant_op.py
+++ b/tensorflow/python/framework/constant_op.py
@@ -181,7 +181,7 @@ def constant(value, dtype=None, shape=None, name="Const", verify_shape=False):
     TypeError: if shape is incorrectly specified or unsupported.
   """
   ctx = context.context()
-  if not ctx.in_graph_mode():
+  if ctx.executing_eagerly():
     t = convert_to_eager_tensor(value, ctx, dtype)
     if shape is None:
       return t
diff --git a/tensorflow/python/framework/dtypes.py b/tensorflow/python/framework/dtypes.py
index 99ae8b24f11c4955379ae532ba7b921ebec63385..0edae92fd4a86e7d10a180ce64364d3ea552bf60 100644
--- a/tensorflow/python/framework/dtypes.py
+++ b/tensorflow/python/framework/dtypes.py
@@ -343,7 +343,9 @@ tf_export("uint8").export_constant(__name__, "uint8")
 uint16 = DType(types_pb2.DT_UINT16)
 tf_export("uint16").export_constant(__name__, "uint16")
 uint32 = DType(types_pb2.DT_UINT32)
+tf_export("uint32").export_constant(__name__, "uint32")
 uint64 = DType(types_pb2.DT_UINT64)
+tf_export("uint64").export_constant(__name__, "uint32")
 int16 = DType(types_pb2.DT_INT16)
 tf_export("int16").export_constant(__name__, "int16")
 int8 = DType(types_pb2.DT_INT8)
diff --git a/tensorflow/python/framework/framework_lib.py b/tensorflow/python/framework/framework_lib.py
index 3172f3c2c3d259d2c3f2b340b101aef043d0fc33..392a4f65c6e62c3cb70f8e02a9b24f015a09f649 100644
--- a/tensorflow/python/framework/framework_lib.py
+++ b/tensorflow/python/framework/framework_lib.py
@@ -48,6 +48,7 @@
 ## Graph collections
 
 @@add_to_collection
+@@add_to_collections
 @@get_collection
 @@get_collection_ref
 @@GraphKeys
@@ -92,6 +93,7 @@ from tensorflow.python.framework.ops import get_default_graph
 from tensorflow.python.framework.ops import reset_default_graph
 from tensorflow.python.framework.ops import GraphKeys
 from tensorflow.python.framework.ops import add_to_collection
+from tensorflow.python.framework.ops import add_to_collections
 from tensorflow.python.framework.ops import get_collection
 from tensorflow.python.framework.ops import get_collection_ref
 from tensorflow.python.framework.ops import convert_to_tensor
diff --git a/tensorflow/python/framework/function.py b/tensorflow/python/framework/function.py
index caa604999c2fad4ce111d910a77e4b99399c11ca..14d72d8a3de7e22bee4f9961c2f66044c217f641 100644
--- a/tensorflow/python/framework/function.py
+++ b/tensorflow/python/framework/function.py
@@ -489,10 +489,10 @@ class _DefinedFunction(object):
 
     # Adds this function into 'g'.
     # pylint: disable=protected-access
-    if context.in_graph_mode():
-      g._add_function(self)
-    else:
+    if context.executing_eagerly():
       context.context().add_function_def(self.definition)
+    else:
+      g._add_function(self)
     # pylint: enable=protected-access
 
     # Ensures related sub-routines are defined in 'g', too.
diff --git a/tensorflow/python/framework/function_test.py b/tensorflow/python/framework/function_test.py
index 52052ba77d42fa91692e7699f49898d0c01c22be..65ca801cbe922b36e3bc72bc2fbcd88f66aa5290 100644
--- a/tensorflow/python/framework/function_test.py
+++ b/tensorflow/python/framework/function_test.py
@@ -193,7 +193,7 @@ class FunctionTest(test.TestCase):
 
     @function.Defun(dtypes.float32, dtypes.float32)
     def XSquarePlusOneGrad(x, dy):
-      dx = functional_ops._symbolic_gradient(
+      dx = functional_ops.symbolic_gradient(
           input=[x, dy], Tout=[dtypes.float32], f="XSquarePlusOneFn", name="dx")
       return dx
 
@@ -295,7 +295,7 @@ class FunctionTest(test.TestCase):
       # gradient function is (x, y, dz) -> (dx, dy).  dx's shape
       # should be the same as x's; and dy's shape should be the same
       # as y's.
-      dx, dy = functional_ops._symbolic_gradient(
+      dx, dy = functional_ops.symbolic_gradient(
           input=[x, y, dz], Tout=[dtypes.float32] * 2, f="Foo")
       self.assertEqual(x.get_shape(), dx.get_shape())
       self.assertEqual(y.get_shape(), dy.get_shape())
diff --git a/tensorflow/python/framework/graph_util_impl.py b/tensorflow/python/framework/graph_util_impl.py
index 5a543317e665a940841714fd72d834a430f8406a..910364364c8be84b1a629dbdaae5e69443d07e75 100644
--- a/tensorflow/python/framework/graph_util_impl.py
+++ b/tensorflow/python/framework/graph_util_impl.py
@@ -235,7 +235,7 @@ def convert_variables_to_constants(sess,
   variable_names = []
   variable_dict_names = []
   for node in inference_graph.node:
-    if node.op in ["Variable", "VariableV2"]:
+    if node.op in ["Variable", "VariableV2", "VarHandleOp"]:
       variable_name = node.name
       if ((variable_names_whitelist is not None and
            variable_name not in variable_names_whitelist) or
@@ -243,7 +243,10 @@ def convert_variables_to_constants(sess,
            variable_name in variable_names_blacklist)):
         continue
       variable_dict_names.append(variable_name)
-      variable_names.append(variable_name + ":0")
+      if node.op == "VarHandleOp":
+        variable_names.append(variable_name + "/Read/ReadVariableOp:0")
+      else:
+        variable_names.append(variable_name + ":0")
   if variable_names:
     returned_variables = sess.run(variable_names)
   else:
@@ -266,6 +269,17 @@ def convert_variables_to_constants(sess,
               tensor=tensor_util.make_tensor_proto(
                   data, dtype=dtype.type, shape=data.shape)))
       how_many_converted += 1
+    elif input_node.op == "ReadVariableOp" and (
+        input_node.input[0] in found_variables):
+      # The preceding branch converts all VarHandleOps of ResourceVariables to
+      # constants, so we need to convert the associated ReadVariableOps to
+      # Identity ops.
+      output_node.op = "Identity"
+      output_node.name = input_node.name
+      output_node.input.extend([input_node.input[0]])
+      output_node.attr["T"].CopyFrom(input_node.attr["dtype"])
+      if "_class" in input_node.attr:
+        output_node.attr["_class"].CopyFrom(input_node.attr["_class"])
     else:
       output_node.CopyFrom(input_node)
     output_graph_def.node.extend([output_node])
diff --git a/tensorflow/python/framework/graph_util_test.py b/tensorflow/python/framework/graph_util_test.py
index 0421837d49de753d642aed59d1524619a243dcb8..b618152b0256fd043dc7259960d867278ba55b0a 100644
--- a/tensorflow/python/framework/graph_util_test.py
+++ b/tensorflow/python/framework/graph_util_test.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import gen_state_ops
 from tensorflow.python.ops import math_ops  # pylint: disable=unused-import
 from tensorflow.python.ops import math_ops as math_ops_lib
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
@@ -47,46 +48,46 @@ class DeviceFunctionsTest(test.TestCase):
 
   def testTwoDeviceFunctions(self):
     with ops.Graph().as_default() as g:
-      var_0 = gen_state_ops._variable(
+      var_0 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_0",
           container="",
           shared_name="")
       with g.device(test_device_func_pin_variable_to_cpu):
-        var_1 = gen_state_ops._variable(
+        var_1 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_1",
             container="",
             shared_name="")
-      var_2 = gen_state_ops._variable(
+      var_2 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_2",
           container="",
           shared_name="")
-      var_3 = gen_state_ops._variable(
+      var_3 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_3",
           container="",
           shared_name="")
       with g.device(test_device_func_pin_variable_to_cpu):
-        var_4 = gen_state_ops._variable(
+        var_4 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_4",
             container="",
             shared_name="")
         with g.device("/device:GPU:0"):
-          var_5 = gen_state_ops._variable(
+          var_5 = gen_state_ops.variable(
               shape=[1],
               dtype=dtypes.float32,
               name="var_5",
               container="",
               shared_name="")
-        var_6 = gen_state_ops._variable(
+        var_6 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_6",
@@ -226,52 +227,62 @@ class DeviceFunctionsTest(test.TestCase):
                          constant_graph_def.library)
 
   def testConvertVariablesToConsts(self):
-    with ops.Graph().as_default():
-      variable_node = variables.Variable(1.0, name="variable_node")
-      _ = variables.Variable(1.0, name="unused_variable_node")
-      output_node = math_ops_lib.multiply(
-          variable_node, 2.0, name="output_node")
-      with session.Session() as sess:
-        init = variables.initialize_variables([variable_node])
-        sess.run(init)
-        output = sess.run(output_node)
-        self.assertNear(2.0, output, 0.00001)
-        variable_graph_def = sess.graph.as_graph_def()
-        # First get the constant_graph_def when variable_names_whitelist is set,
-        # note that if variable_names_whitelist is not set an error will be
-        # thrown because unused_variable_node is not initialized.
-        constant_graph_def = graph_util.convert_variables_to_constants(
-            sess,
-            variable_graph_def, ["output_node"],
-            variable_names_whitelist=set(["variable_node"]))
+    self._test_variable_to_const_conversion(use_resource=False)
 
-        # Then initialize the unused variable, and get another
-        # constant_graph_def when variable_names_whitelist is not set.
-        sess.run(variables.global_variables_initializer())
-        constant_graph_def_without_variable_whitelist = (
-            graph_util.convert_variables_to_constants(sess, variable_graph_def,
-                                                      ["output_node"]))
-
-        # The unused variable should be cleared so the two graphs should be
-        # equivalent.
-        self.assertEqual(
-            str(constant_graph_def),
-            str(constant_graph_def_without_variable_whitelist))
-
-        # Test variable name black list. This should result in the variable not
-        # being a const.
-        sess.run(variables.global_variables_initializer())
-        constant_graph_def_with_blacklist = (
-            graph_util.convert_variables_to_constants(
-                sess,
-                variable_graph_def, ["output_node"],
-                variable_names_blacklist=set(["variable_node"])))
-        variable_node = None
-        for node in constant_graph_def_with_blacklist.node:
-          if node.name == "variable_node":
-            variable_node = node
-        self.assertIsNotNone(variable_node)
-        self.assertEqual(variable_node.op, "VariableV2")
+  def testConvertResourceVariablesToConsts(self):
+    self._test_variable_to_const_conversion(use_resource=True)
+
+  def _test_variable_to_const_conversion(self, use_resource):
+    with ops.Graph().as_default():
+      with variable_scope.variable_scope("", use_resource=use_resource):
+        variable_node = variable_scope.get_variable(
+            "variable_node", initializer=1.0)
+        another_variable = variable_scope.get_variable(
+            "unused_variable_node", initializer=1.0)
+        output_node = math_ops_lib.multiply(
+            variable_node, 2.0, name="output_node")
+        with session.Session() as sess:
+          sess.run(variable_node.initializer)
+          output = sess.run(output_node)
+          self.assertNear(2.0, output, 0.00001)
+          variable_graph_def = sess.graph.as_graph_def()
+          # First get the constant_graph_def when variable_names_whitelist is
+          # set, note that if variable_names_whitelist is not set an error will
+          # be thrown because unused_variable_node is not initialized.
+          constant_graph_def = graph_util.convert_variables_to_constants(
+              sess,
+              variable_graph_def, ["output_node"],
+              variable_names_whitelist=set(["variable_node"]))
+
+          # Then initialize the unused variable, and get another
+          # constant_graph_def when variable_names_whitelist is not set.
+          sess.run(another_variable.initializer)
+          constant_graph_def_without_variable_whitelist = (
+              graph_util.convert_variables_to_constants(
+                  sess, variable_graph_def, ["output_node"]))
+
+          # The unused variable should be cleared so the two graphs should be
+          # equivalent.
+          self.assertEqual(
+              str(constant_graph_def),
+              str(constant_graph_def_without_variable_whitelist))
+
+          # Test variable name black list. This should result in the variable
+          # not being a const.
+          constant_graph_def_with_blacklist = (
+              graph_util.convert_variables_to_constants(
+                  sess,
+                  variable_graph_def, ["output_node"],
+                  variable_names_blacklist=set(["variable_node"])))
+          variable_node = None
+          for node in constant_graph_def_with_blacklist.node:
+            if node.name == "variable_node":
+              variable_node = node
+          self.assertIsNotNone(variable_node)
+          if use_resource:
+            self.assertEqual(variable_node.op, "VarHandleOp")
+          else:
+            self.assertEqual(variable_node.op, "VariableV2")
 
     # Now we make sure the variable is now a constant, and that the graph still
     # produces the expected result.
@@ -279,8 +290,9 @@ class DeviceFunctionsTest(test.TestCase):
       _ = importer.import_graph_def(constant_graph_def, name="")
       self.assertEqual(4, len(constant_graph_def.node))
       for node in constant_graph_def.node:
-        self.assertNotEqual("Variable", node.op)
-        self.assertNotEqual("VariableV2", node.op)
+        self.assertNotIn(
+            node.op,
+            ["Variable", "VariableV2", "VarHandleOp", "ReadVariableOp"])
       with session.Session() as sess:
         output_node = sess.graph.get_tensor_by_name("output_node:0")
         output = sess.run(output_node)
diff --git a/tensorflow/python/framework/importer.py b/tensorflow/python/framework/importer.py
index 6ecc1a40ae14760dd39242aaf595b32a9decdc9f..4ea34d7bb2831845aec1f40fcdb7f64a8f8c438a 100644
--- a/tensorflow/python/framework/importer.py
+++ b/tensorflow/python/framework/importer.py
@@ -301,14 +301,17 @@ def _ProcessNewOps(graph):
   colocation_pairs = {}
 
   for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
+    original_device = new_op.device
+    new_op._set_device('')  # pylint: disable=protected-access
     colocation_names = _GetColocationNames(new_op)
     if colocation_names:
       colocation_pairs[new_op] = colocation_names
-      # Don't apply this op's device function, since colocation constraints
-      # override device functions. Note that this op's device may still be set
-      # by the loop below.
+      # Don't set a device for this op, since colocation constraints override
+      # device functions and the original device. Note that this op's device may
+      # still be set by the loop below.
+      # TODO(skyewm): why does it override the original device?
     else:
-      with _MaybeDevice(new_op.device):
+      with _MaybeDevice(original_device):
         graph._apply_device_functions(new_op)  # pylint: disable=protected-access
 
   # The following loop populates the device field of ops that are colocated
@@ -475,32 +478,39 @@ def import_graph_def(graph_def,
     _PopulateTFImportGraphDefOptions(options, prefix, input_map,
                                      return_elements)
 
-    with c_api_util.tf_buffer(graph_def.SerializeToString()) as serialized:
-      try:
-        with errors.raise_exception_on_not_ok_status() as status:
-          results = c_api.TF_GraphImportGraphDefWithResults(
-              graph._c_graph, serialized, options, status)  # pylint: disable=protected-access
-      except errors.InvalidArgumentError as e:
-        # Convert to ValueError for backwards compatibility.
-        raise ValueError(str(e))
-
-    _ProcessNewOps(graph)
+    # _ProcessNewOps mutates the new operations. _lock ensures a Session.run
+    # call cannot occur between creating the TF_Operations in the
+    # TF_GraphImportGraphDefWithResults call and mutating the them in
+    # _ProcessNewOps.
+    with graph._lock:  # pylint: disable=protected-access
+      with c_api_util.tf_buffer(graph_def.SerializeToString()) as serialized:
+        try:
+          with errors.raise_exception_on_not_ok_status() as status:
+            results = c_api.TF_GraphImportGraphDefWithResults(
+                graph._c_graph, serialized, options, status)  # pylint: disable=protected-access
+        except errors.InvalidArgumentError as e:
+          # Convert to ValueError for backwards compatibility.
+          raise ValueError(str(e))
+
+      # Create _DefinedFunctions for any imported functions.
+      #
+      # We do this by creating _DefinedFunctions directly from `graph_def`, and
+      # adding them to `graph`. Adding an existing function to a TF_Graph is a
+      # no-op, so this only has the effect of updating the Python state (usually
+      # _DefinedFunction.add_to_graph also adds the function to the TF_Graph).
+      #
+      # TODO(skyewm): fetch the TF_Functions directly from the TF_Graph
+      # TODO(skyewm): avoid sending serialized FunctionDefs back to the TF_Graph
+      # TODO(b/74620627): move this after _ProcessNewOps outside the lock once
+      # _USE_C_SHAPES is removed.
+      if graph_def.library and graph_def.library.function:
+        # pylint: disable=protected-access
+        functions = function._from_library(graph_def.library)
+        for f in functions:
+          f.add_to_graph(graph)
+        # pylint: enable=protected-access
 
-    # Create _DefinedFunctions for any imported functions.
-    #
-    # We do this by creating _DefinedFunctions directly from `graph_def`, and
-    # adding them to `graph`. Adding an existing function to a TF_Graph is a
-    # no-op, so this only has the effect of updating the Python state (usually
-    # _DefinedFunction.add_to_graph also adds the function to the TF_Graph).
-    #
-    # TODO(skyewm): fetch the TF_Functions directly from the TF_Graph
-    # TODO(skyewm): avoid sending serialized FunctionDefs back to the TF_Graph
-    if graph_def.library and graph_def.library.function:
-      # pylint: disable=protected-access
-      functions = function._from_library(graph_def.library)
-      for f in functions:
-        f.add_to_graph(graph)
-      # pylint: enable=protected-access
+      _ProcessNewOps(graph)
 
     # Treat input mappings that don't appear in the graph as an error, because
     # they are likely to be due to a typo.
diff --git a/tensorflow/python/framework/importer_test.py b/tensorflow/python/framework/importer_test.py
index bf5d9fe0936882c242198bdc7118f9f3a4e79260..6593b1718434fd2035133f65aa08b17774e9e806 100644
--- a/tensorflow/python/framework/importer_test.py
+++ b/tensorflow/python/framework/importer_test.py
@@ -680,6 +680,49 @@ class ImportGraphDefTest(test.TestCase):
           "list { s: 'loc:@imported_graph/A' }",
           b.node_def.attr["_class"])
 
+  def testColocationAndDevice(self):
+    # A and B are colocated, device set on A.
+    original_graph_def = self._MakeGraphDef("""
+          node { name: 'A' op: 'None' device: '/device:CPU:0' attr {
+            key: '_class'
+            value { list { s: 'loc:@A' } }
+          } }
+          node { name: 'B' op: 'None'  attr {
+            key: '_class'
+            value { list { s: 'loc:@A' } }
+          } }""")
+
+    with ops.Graph().as_default():
+      a, b = importer.import_graph_def(original_graph_def,
+                                       return_elements=["A", "B"],
+                                       name="")
+      self.assertEqual(a.device, "/device:CPU:0")
+      self.assertEqual(b.device, "/device:CPU:0")
+      self.assertEqual(a.colocation_groups(), [b"loc:@A"])
+      self.assertEqual(b.colocation_groups(), [b"loc:@A"])
+
+    # A and B are colocated, device set on B.
+    original_graph_def = self._MakeGraphDef("""
+          node { name: 'A' op: 'None' attr {
+            key: '_class'
+            value { list { s: 'loc:@A' } }
+          } }
+          node { name: 'B' op: 'None' device: '/device:CPU:0' attr {
+            key: '_class'
+            value { list { s: 'loc:@A' } }
+          } }""")
+
+    with ops.Graph().as_default():
+      a, b = importer.import_graph_def(original_graph_def,
+                                       return_elements=["A", "B"],
+                                       name="")
+      # TODO(skyewm): this behavior seems inconsistent with the above. Why is
+      # B's device ignored?
+      self.assertEqual(a.device, "")
+      self.assertEqual(b.device, "")
+      self.assertEqual(a.colocation_groups(), [b"loc:@A"])
+      self.assertEqual(b.colocation_groups(), [b"loc:@A"])
+
   def testColocationWithDeviceFn(self):
     original_graph_def = self._MakeGraphDef("""
           node { name: 'A' op: 'None' attr {
diff --git a/tensorflow/python/framework/meta_graph.py b/tensorflow/python/framework/meta_graph.py
index 4c1bd736d727e974375ad9008a579361137fb9d6..4bb9941bb778ae9b4022ef9d376cb031223ddb1c 100644
--- a/tensorflow/python/framework/meta_graph.py
+++ b/tensorflow/python/framework/meta_graph.py
@@ -695,7 +695,7 @@ def import_scoped_meta_graph(meta_graph_or_file,
   Raises:
     ValueError: If the graph_def contains unbound inputs.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError("Exporting/importing meta graphs is not supported when "
                      "eager execution is enabled.")
   if isinstance(meta_graph_or_file, meta_graph_pb2.MetaGraphDef):
@@ -856,7 +856,7 @@ def export_scoped_meta_graph(filename=None,
   Raises:
     ValueError: When the `GraphDef` is larger than 2GB.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError("Exporting/importing meta graphs is not supported when "
                      "Eager Execution is enabled.")
   graph = graph or ops.get_default_graph()
diff --git a/tensorflow/python/framework/meta_graph_test.py b/tensorflow/python/framework/meta_graph_test.py
index 19dcd6a1b34741290b2578d93b79883c103fdb1b..21963d0beee398da8e90c2c829b2d4607ec6cc42 100644
--- a/tensorflow/python/framework/meta_graph_test.py
+++ b/tensorflow/python/framework/meta_graph_test.py
@@ -905,20 +905,6 @@ class ExportImportAcrossScopesTest(test.TestCase):
       with variable_scope.variable_scope("importA/keepA"):
         graph_fn(use_resource=use_resource)
 
-      if use_resource:
-        # Bringing in collections that contain ResourceVariables will adds ops
-        # to the graph the first time a variable is encountered, so mimic the
-        # same behavior.
-        seen_variables = set()
-        for collection_key in sorted([
-            ops.GraphKeys.GLOBAL_VARIABLES,
-            ops.GraphKeys.TRAINABLE_VARIABLES,
-        ]):
-          for var in expected_graph.get_collection(collection_key):
-            if var not in seen_variables:
-              var._read_variable_op()
-              seen_variables.add(var)
-
     result = meta_graph.export_scoped_meta_graph(graph=imported_graph)[0]
     expected = meta_graph.export_scoped_meta_graph(graph=expected_graph)[0]
 
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index 5a14ea417626265b8cee4bbec025f2bb9f5d4307..93edaa0cf0094c402196ec4ace9c69303d805bd3 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -63,6 +63,7 @@ from tensorflow.python.util.tf_export import tf_export
 # in code or via the environment variable. This will be removed once all
 # functionality is supported and there's no performance penalty with it enabled.
 _USE_C_API = os.getenv("TF_C_API_GRAPH_CONSTRUCTION", "0") is not "0"
+_USE_C_SHAPES = os.getenv("TF_C_API_GRAPH_CONSTRUCTION_SHAPES", "0") is not "0"
 
 
 def tensor_id(tensor):
@@ -369,7 +370,7 @@ class Tensor(_TensorLike):
 
     """
     graph = self._op._graph._c_graph # pylint: disable=protected-access
-    if graph:
+    if graph and _USE_C_SHAPES:
       with errors.raise_exception_on_not_ok_status() as status:
         num_dims = c_api.TF_GraphGetTensorNumDims(graph, self._as_tf_output(),
                                                   status)
@@ -395,10 +396,10 @@ class Tensor(_TensorLike):
         "Tensor._shape cannot be assigned, use Tensor.set_shape instead.")
 
   def __iter__(self):
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       raise TypeError(
-          "`Tensor` objects are not iterable when eager execution is not "
-          "enabled. To iterate over this tensor use `tf.map_fn`.")
+          "Tensor objects are not iterable when eager execution is not "
+          "enabled. To iterate over this tensor use tf.map_fn.")
     shape = self._shape_tuple()
     if shape is None:
       raise TypeError("Cannot iterate over a tensor with unknown shape.")
@@ -466,9 +467,13 @@ class Tensor(_TensorLike):
       ValueError: If `shape` is not compatible with the current shape of
         this tensor.
     """
-    if not self._op._graph._c_graph:  # pylint: disable=protected-access # ASIM
+    if not _USE_C_SHAPES:  # pylint: disable=protected-access
       self._shape_val = self._shape_val.merge_with(shape)
-      return
+
+    if not self._op._graph._c_graph: return
+
+    # Update C shape even if _USE_C_SHAPES = False, since we still want
+    # set_shape to be reflected in the C API graph for when we run it.
     if not isinstance(shape, tensor_shape.TensorShape):
       shape = tensor_shape.TensorShape(shape)
     dim_list = []
@@ -772,7 +777,7 @@ class _EagerTensorBase(Tensor):
       six.raise_from(core._status_to_exception(e.code, e.message), None)
 
     # Record the copy on tape and define backprop copy as well.
-    if not context.in_graph_mode():
+    if context.executing_eagerly():
       self_device = self.device
       def grad_fun(dresult):
         return [dresult._copy(device_name=self_device)]
@@ -782,7 +787,11 @@ class _EagerTensorBase(Tensor):
 
   @property
   def shape(self):
-    return tensor_shape.TensorShape(self._shape_tuple())
+    if self._tensor_shape is None:  # pylint: disable=access-member-before-definition
+      # `_tensor_shape` is declared and defined in the definition of
+      # `EagerTensor`, in C.
+      self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
+    return self._tensor_shape
 
   def get_shape(self):
     """Alias of Tensor.shape."""
@@ -829,41 +838,51 @@ class _EagerTensorBase(Tensor):
   def set_shape(self, shape):
     if not self.shape.is_compatible_with(shape):
       raise ValueError(
-          "EagerTensor's shape %s is not compatible with supplied shape %s" %
+          "Tensor's shape %s is not compatible with supplied shape %s" %
           (self.shape, shape))
 
   # Methods not supported / implemented for Eager Tensors.
   @property
   def op(self):
-    raise AttributeError("op not supported for Eager Tensors.")
+    raise AttributeError(
+        "Tensor.op is meaningless when eager execution is enabled.")
 
   @property
   def graph(self):
-    raise AttributeError("graph not supported for Eager Tensors.")
+    raise AttributeError(
+        "Tensor.graph is meaningless when eager execution is enabled.")
 
   @property
   def name(self):
-    raise AttributeError("name not supported for Eager Tensors.")
+    raise AttributeError(
+        "Tensor.name is meaningless when eager execution is enabled.")
 
   @property
   def value_index(self):
-    raise AttributeError("value_index not supported for Eager Tensors.")
+    raise AttributeError(
+        "Tensor.value_index is meaningless when eager execution is enabled.")
 
   def consumers(self):
-    raise NotImplementedError("consumers not supported for Eager Tensors.")
+    raise NotImplementedError(
+        "Tensor.consumers is meaningless when eager execution is enabled.")
 
   def _add_consumer(self, consumer):
-    raise NotImplementedError("_add_consumer not supported for Eager Tensors.")
+    raise NotImplementedError(
+        "_add_consumer not supported when eager execution is enabled.")
 
   def _as_node_def_input(self):
     raise NotImplementedError(
-        "_as_node_def_input not supported for Eager Tensors.")
+        "_as_node_def_input not supported when eager execution is enabled.")
 
   def _as_tf_output(self):
-    raise NotImplementedError("_as_tf_output not supported for Eager Tensors.")
+    raise NotImplementedError(
+        "_as_tf_output not supported when eager execution is enabled.")
 
   def eval(self, feed_dict=None, session=None):
-    raise NotImplementedError("eval not supported for Eager Tensors.")
+    raise NotImplementedError(
+        "eval is not supported when eager execution is enabled, "
+        "is .numpy() what you're looking for?"
+    )
 
 
 # This call creates an EagerTensor class, as a subclass of _EagerTensorBase, and
@@ -989,7 +1008,7 @@ def internal_convert_to_tensor(value,
 
   """
   if ctx is None: ctx = context.context()
-  if ctx.in_eager_mode():
+  if ctx.executing_eagerly():
     # Fast path for EagerTensors that don't need any conversion.
     if isinstance(value, EagerTensor):
       # Note that we don't check that value's dtype matches the dtype
@@ -1897,7 +1916,8 @@ class Operation(object):
     tensor._add_consumer(self)  # pylint: disable=protected-access
     self._recompute_node_def()
 
-  def _update_input(self, index, tensor):
+  # TODO(skyewm): Remove `update_dtype` when we enable the C API.
+  def _update_input(self, index, tensor, update_dtype=True):
     """Update the input to this operation at the given index.
 
     NOTE: This is for TF internal use only. Please don't use it.
@@ -1905,6 +1925,7 @@ class Operation(object):
     Args:
       index: the index of the input to update.
       tensor: the Tensor to be used as the input at the given index.
+      update_dtype: If `False`, the type for this input is not updated.
 
     Raises:
       TypeError: if tensor is not a Tensor,
@@ -1924,7 +1945,8 @@ class Operation(object):
     else:
       self._inputs_val[index].consumers().remove(self)
       self._inputs_val[index] = tensor
-      self._input_types_val[index] = tensor.dtype
+      if update_dtype:
+        self._input_types_val[index] = tensor.dtype
       tensor._add_consumer(self)  # pylint: disable=protected-access
       self._recompute_node_def()
 
@@ -2486,7 +2508,7 @@ def _set_shapes_for_outputs(op):
 
 def set_shapes_for_outputs(op):
   """Set the shapes for op's outputs."""
-  if op._c_op:  # pylint: disable=protected-access
+  if op._c_op and _USE_C_SHAPES:  # pylint: disable=protected-access
     return _set_shapes_for_outputs_c_api(op)
   else:
     return _set_shapes_for_outputs(op)
@@ -2690,21 +2712,24 @@ class Graph(object):
 
   def __init__(self):
     """Creates a new, empty Graph."""
-    # Protects the core state that may be accessed by multiple readers.
-    # Only state that can be returned via public accessors (`as_graph_def()`,
-    # `get_operations()`, `as_graph_element()`, `get_collection()`, and
-    # `get_collection_ref()`) is by the lock. Thread-safety is provided on a
-    # best-effort basis to support buggy programs, and is not guaranteed by the
-    # public `tf.Graph` API.
+    # Protects core state that can be returned via public accessors, as well as
+    # synchronizes Session.run calls with methods that create and mutate ops
+    # (e.g. Graph.create_op()). This synchronization is necessary because it's
+    # illegal to modify an operation after it's been run. Thread-safety is
+    # provided on a best-effort basis to support buggy programs, and is not
+    # guaranteed by the public `tf.Graph` API.
+    #
+    # The lock must be reentrant because create_op can be called recursively due
+    # to control flow. Without a reentrant lock, many methods would also need a
+    # "locked" version or parameter (including generated code).
+    #
     # NOTE(mrry): This does not protect the various stacks. A warning will
     # be reported if these are used from multiple threads
-    self._lock = threading.Lock()
+    self._lock = threading.RLock()
     self._nodes_by_id = dict()  # GUARDED_BY(self._lock)
     self._next_id_counter = 0  # GUARDED_BY(self._lock)
     self._nodes_by_name = dict()  # GUARDED_BY(self._lock)
     self._version = 0  # GUARDED_BY(self._lock)
-    # Current name stack: uniquified names
-    self._name_stack = ""
     # Maps a name used in the graph to the next id to use for that name.
     self._names_in_use = {}
     self._stack_state_is_thread_local = False
@@ -2776,7 +2801,6 @@ class Graph(object):
       c_api.SetRequireShapeInferenceFns(self._c_graph, False)
     else:
       self._scoped_c_graph = None
-    self._variable_creator_stack = []
 
   # TODO(apassos) remove once the C API is used by default.
   def _use_c_api_hack(self):
@@ -2817,17 +2841,26 @@ class Graph(object):
   # frozen, and this functionality is still not ready for public visibility.
   @tf_contextlib.contextmanager
   def _variable_creator_scope(self, creator):
+    # This step makes a copy of the existing stack, and it also initializes
+    # self._thread_local._variable_creator_stack if it doesn't exist yet.
     old = list(self._variable_creator_stack)
-    self._variable_creator_stack.append(creator)
+    self._thread_local._variable_creator_stack.append(creator)
     try:
       yield
     finally:
-      self._variable_creator_stack = old
+      self._thread_local._variable_creator_stack = old
 
   # Note: this method is private because the API of tf.Graph() is public and
   # frozen, and this functionality is still not ready for public visibility.
-  def _get_variable_creator_stack(self):
-    return list(self._variable_creator_stack)
+  @property
+  def _variable_creator_stack(self):
+    if not hasattr(self._thread_local, "_variable_creator_stack"):
+      self._thread_local._variable_creator_stack = []
+    return list(self._thread_local._variable_creator_stack)
+
+  @_variable_creator_stack.setter
+  def _variable_creator_stack(self, variable_creator_stack):
+    self._thread_local._variable_creator_stack = variable_creator_stack
 
   def _extract_stack(self):
     """A lightweight, extensible re-implementation of traceback.extract_stack.
@@ -3259,17 +3292,34 @@ class Graph(object):
 
     input_ops = set([t.op for t in inputs])
     control_inputs = self._control_dependencies_for_inputs(input_ops)
-    ret = Operation(
-        node_def,
-        self,
-        inputs=inputs,
-        output_types=dtypes,
-        control_inputs=control_inputs,
-        input_types=input_types,
-        original_op=self._default_original_op,
-        op_def=op_def)
-    self._create_op_helper(ret, compute_shapes=compute_shapes,
-                           compute_device=compute_device)
+    # _create_op_helper mutates the new Operation. _lock ensures a Session.run
+    # call cannot occur between creating and mutating the op.
+    with self._lock:
+      ret = Operation(
+          node_def,
+          self,
+          inputs=inputs,
+          output_types=dtypes,
+          control_inputs=control_inputs,
+          input_types=input_types,
+          original_op=self._default_original_op,
+          op_def=op_def)
+
+      # TODO(vrv): Instead of eagerly filling in shape property for every op,
+      # only populate the shape when requested.
+      #
+      # TODO(skyewm): unlike in the original Python implementation, the C API
+      # always computes shape information (even for function calls, which the
+      # original Python shape inference code doesn't handle). Deprecate the
+      # compute_shapes argument.
+      #
+      # TODO(b/74620627): move this back to _create_op_helper once _USE_C_SHAPES
+      # is removed
+      if (ret._c_op and _USE_C_SHAPES) or compute_shapes:  # pylint: disable=protected-access
+        set_shapes_for_outputs(ret)
+
+      self._create_op_helper(ret, compute_shapes=compute_shapes,
+                             compute_device=compute_device)
     return ret
 
   def _create_op_from_tf_operation(self, c_op, compute_device=True):
@@ -3301,15 +3351,6 @@ class Graph(object):
 
   def _create_op_helper(self, op, compute_shapes=True, compute_device=True):
     """Common logic for creating an op in this graph."""
-    # TODO(vrv): Instead of eagerly filling in shape property for every op, only
-    # populate the shape when requested.
-    #
-    # TODO(skyewm): unlike in the original Python implementation, the C API
-    # always computes shape information (even for function calls, which the
-    # original Python shape inference code doesn't handle). Deprecate the
-    # compute_shapes argument.
-    if op._c_op or compute_shapes:  # pylint: disable=protected-access
-      set_shapes_for_outputs(op)
     # TODO(b/XXXX): move to Operation.__init__ once _USE_C_API flag is removed.
     self._add_op(op)
 
@@ -3414,6 +3455,12 @@ class Graph(object):
     ]
 
     for op in new_ops:
+      # Operations created by the C API always retrieve shapes from the C API so
+      # we preserve the shapes of ops created in import_graph_def (from the
+      # "_output_shapes" attr of the imported NodeDef).
+      # TODO(b/74620627): move this back to _create_op_helper once _USE_C_SHAPES
+      # is removed.
+      _set_shapes_for_outputs_c_api(op)
       new_control_inputs = self._control_dependencies_for_inputs(op.inputs)
       # pylint: disable=protected-access
       op._add_control_inputs(new_control_inputs)
@@ -3861,6 +3908,17 @@ class Graph(object):
     finally:
       self._default_original_op = old_original_op
 
+  @property
+  def _name_stack(self):
+    # This may be called from a thread where name_stack doesn't yet exist.
+    if not hasattr(self._thread_local, "_name_stack"):
+      self._thread_local._name_stack = ""
+    return self._thread_local._name_stack
+
+  @_name_stack.setter
+  def _name_stack(self, name_stack):
+    self._thread_local._name_stack = name_stack
+
   # pylint: disable=g-doc-return-or-yield,line-too-long
   @tf_contextlib.contextmanager
   def name_scope(self, name):
@@ -4777,15 +4835,15 @@ def device(device_name_or_function):
   Raises:
     RuntimeError: If eager execution is enabled and a function is passed in.
   """
-  if context.in_graph_mode():
-    return get_default_graph().device(device_name_or_function)
-  else:
+  if context.executing_eagerly():
     # TODO(agarwal): support device functions in EAGER mode.
     if callable(device_name_or_function):
       raise RuntimeError(
           "tf.device does not support functions when eager execution "
           "is enabled.")
     return context.device(device_name_or_function)
+  else:
+    return get_default_graph().device(device_name_or_function)
 
 
 @tf_export("container")
@@ -4804,13 +4862,20 @@ def container(container_name):
 
 @tf_export("colocate_with")
 def colocate_with(op, ignore_existing=False):
-  if context.in_graph_mode():
-    return get_default_graph().colocate_with(op, ignore_existing)
-  else:
+  if context.executing_eagerly():
     if op is not None:
       return device(op.device)
     else:
       return _NullContextmanager()
+  else:
+    default_graph = get_default_graph()
+    if isinstance(op, EagerTensor):
+      if default_graph.building_function:
+        op = internal_convert_to_tensor(op)
+      else:
+        raise ValueError("Encountered an Eager-defined Tensor during graph "
+                         "construction, but a function was not being built.")
+    return default_graph.colocate_with(op, ignore_existing)
 
 
 @tf_export("control_dependencies")
@@ -4820,20 +4885,29 @@ def control_dependencies(control_inputs):
   See @{tf.Graph.control_dependencies}
   for more details.
 
+  When eager execution is enabled, any callable object in the `control_inputs`
+  list will be called.
+
   Args:
     control_inputs: A list of `Operation` or `Tensor` objects which
       must be executed or computed before running the operations
       defined in the context.  Can also be `None` to clear the control
-      dependencies.
+      dependencies. If eager execution is enabled, any callable object in the
+      `control_inputs` list will be called.
 
   Returns:
    A context manager that specifies control dependencies for all
    operations constructed within the context.
   """
-  if context.in_graph_mode():
-    return get_default_graph().control_dependencies(control_inputs)
-  else:
+  if context.executing_eagerly():
+    if control_inputs:
+      # Excute any pending callables.
+      for control in control_inputs:
+        if callable(control):
+          control()
     return _NullContextmanager()
+  else:
+    return get_default_graph().control_dependencies(control_inputs)
 
 
 class _DefaultStack(threading.local):
@@ -5054,11 +5128,12 @@ class _DefaultGraphStack(_DefaultStack):  # pylint: disable=protected-access
   @tf_contextlib.contextmanager
   def get_controller(self, default):
     try:
-      context.context_stack.push(default.building_function, default.as_default)
+      context.context().context_switches.push(default.building_function,
+                                              default.as_default)
       with super(_DefaultGraphStack, self).get_controller(default) as g:
         yield g
     finally:
-      context.context_stack.pop()
+      context.context().context_switches.pop()
 
 
 _default_graph_stack = _DefaultGraphStack()
@@ -5084,87 +5159,130 @@ def init_scope():
         graph function. Here, a context is defined as either a graph or an eager
         context. Every context switch, i.e., every installation of a graph as
         the default graph and every switch into eager mode, is logged in a
-        thread-local stack called the `context_stack`; the log entry for a
+        thread-local stack called `context_switches`; the log entry for a
         context switch is popped from the stack when the context is exited.
-        Entering an `init_scope` is equivalent to crawling up the
-        `context_stack`, finding the first context that is not building a graph
-        function, and entering it. A caveat is that if graph mode is enabled
-        but the default graph stack is empty, then entering an `init_scope`
-        will simply install a fresh graph as the default one.
+        Entering an `init_scope` is equivalent to crawling up
+        `context_switches`, finding the first context that is not building a
+        graph function, and entering it. A caveat is that if graph mode is
+        enabled but the default graph stack is empty, then entering an
+        `init_scope` will simply install a fresh graph as the default one.
 
     (3) The gradient tape is paused while the scope is active.
   """
   # pylint: enable=g-doc-return-or-yield,line-too-long
 
-  in_graph_mode = context.in_graph_mode()
-  # Retrieve the active name scope: entering an `init_scope` preserves
-  # the name scope of the current context.
-  if in_graph_mode:
+  if context.executing_eagerly():
+    # Fastpath.
+    with tape.stop_recording():
+      yield
+  else:
+    # Retrieve the active name scope: entering an `init_scope` preserves
+    # the name scope of the current context.
     default_graph = get_default_graph()
     scope = default_graph.get_name_scope()
-  else:
-    scope = context.context().scope_name
-  if scope and scope[-1] != '/':
-    # Names that end with trailing slashes are treated by `name_scope` as
-    # absolute.
-    scope = scope + '/'
-
-  outer_context = None
-  if in_graph_mode and not _default_graph_stack.stack:
-    outer_context = default_graph.as_default
-  else:
-    for stack_entry in reversed(context.context_stack.stack):
-      if not stack_entry.is_building_function:
-        outer_context = stack_entry.enter_context_fn
-        break
+    if scope and scope[-1] != '/':
+      # Names that end with trailing slashes are treated by `name_scope` as
+      # absolute.
+      scope = scope + '/'
+
+    outer_context = None
+    if not _default_graph_stack.stack:
+      # If the default graph stack is empty, then we cannot be building a
+      # function. Install the global graph (which, in this case, is also the
+      # default graph) as the outer context.
+      if default_graph.building_function:
+        raise RuntimeError("The global graph is building a function.")
+      outer_context = default_graph.as_default
+    else:
+      # Find a context that is not building a function.
+      for stack_entry in reversed(context.context().context_switches.stack):
+        if not stack_entry.is_building_function:
+          outer_context = stack_entry.enter_context_fn
+          break
+
+      if outer_context is None:
+        # As a last resort, obtain the global default graph; this graph doesn't
+        # necessarily live on the graph stack (and hence it doesn't necessarily
+        # live on the context stack), but it is stored in the graph stack's
+        # encapsulating object.
+        outer_context = _default_graph_stack._GetGlobalDefaultGraph().as_default  # pylint: disable=protected-access
 
-  if outer_context is None:
-    raise AssertionError("All graphs are building functions, and no "
+    if outer_context is None:
+      # Sanity check; this shouldn't be triggered.
+      raise RuntimeError("All graphs are building functions, and no "
                          "eager context was previously active.")
 
-  try:
     with outer_context(), name_scope(scope), control_dependencies(
         None), tape.stop_recording():
       yield
-  finally:
-    pass
 
 
-def enable_eager_execution(config=None, device_policy=None):
-  """Enables, for the rest of the lifetime of this program, eager execution.
+@tf_export("enable_eager_execution")
+def enable_eager_execution(config=None, device_policy=None,
+                           execution_mode=None):
+  """Enables eager execution for the lifetime of this program.
 
-  If not called immediately on startup risks creating breakage and bugs.
+  Eager execution provides an imperative interface to TensorFlow. With eager
+  execution enabled, TensorFlow functions execute operations immediately (as
+  opposed to adding to a graph to be executed later in a @{tf.Session}) and
+  return concrete values (as opposed to symbolic references to a node in a
+  computational graph).
 
-  Example:
+  For example:
   ```python
-  tfe.enable_eager_execution()
+  tf.enable_eager_execution()
 
   # After eager execution is enabled, operations are executed as they are
-  # defined and `Tensor`s hold concrete values, which can be accessed as
-  # `numpy.ndarray`s through the `numpy()` method.
+  # defined and Tensor objects hold concrete values, which can be accessed as
+  # numpy.ndarray`s through the numpy() method.
   assert tf.multiply(6, 7).numpy() == 42
   ```
 
+  Eager execution cannot be enabled after TensorFlow APIs have been used to
+  create or execute graphs. It is typically recommended to invoke this function
+  at program startup and not in a library (as most libraries should be usable
+  both with and without eager execution).
+
   Args:
-    config: (Optional.) A `ConfigProto` protocol buffer with configuration
-     options for the Context. Note that a lot of these options may be
-     currently unimplemented or irrelevant when eager execution is enabled.
-    device_policy: (Optional.) What policy to use when trying to run an
-     operation on a device with inputs which are not on that device.
+    config: (Optional.) A @{tf.ConfigProto} to use to configure the environment
+      in which operations are executed. Note that @{tf.ConfigProto} is also
+      used to configure graph execution (via @{tf.Session}) and many options
+      within `tf.ConfigProto` are not implemented (or are irrelevant) when
+     eager execution is enabled.
+    device_policy: (Optional.) Policy controlling how operations requiring
+     inputs on a specific device (e.g., a GPU 0) handle inputs on a different
+     device  (e.g. GPU 1 or CPU). When set to None, an appropriate value will be
+     picked automatically. The value picked may change between TensorFlow
+     releases.
      Valid values:
-       tfe.DEVICE_PLACEMENT_EXPLICIT: raises an error if the placement is not
-         correct.
-       tfe.DEVICE_PLACEMENT_WARN: copies the tensors which are not on the
-         right device but raises a warning.
-       tfe.DEVICE_PLACEMENT_SILENT: silently copies the tensors. This might
-         hide performance problems.
-       tfe.DEVICE_PLACEMENT_SILENT_FOR_INT32: silently copies int32 tensors,
-         raising errors on the other ones.
+
+      - tf.contrib.eager.DEVICE_PLACEMENT_EXPLICIT: raises an error if the
+        placement is not correct.
+
+      - tf.contrib.eager.DEVICE_PLACEMENT_WARN: copies the tensors which are not
+        on the right device but logs a warning.
+
+      - tf.contrib.eager.DEVICE_PLACEMENT_SILENT: silently copies the tensors.
+        Note that this may hide performance problems as there is no notification
+        provided when operations are blocked on the tensor being copied between
+        devices.
+
+      - tf.contrib.eager.DEVICE_PLACEMENT_SILENT_FOR_INT32: silently copies
+        int32 tensors, raising errors on the other ones.
+    execution_mode: (Optional.) Policy controlling how operations dispatched are
+      actually executed. When set to None, an appropriate value will be picked
+      automatically. The value picked may change between TensorFlow releases.
+      Valid values:
+
+        - tf.contrib.eager.SYNC: executes each operation synchronously.
+
+        - tf.contrib.eager.ASYNC: executes each operation asynchronously. These
+          operations may return "non-ready" handles.
 
   Raises:
-    ValueError: If trying to create a context after using graph operations
-     or if trying to create a context with nontrivial options which differ
-     from those of the existing context.
+    ValueError: If eager execution is enabled after creating/executing a
+     TensorFlow graph, or if options provided conflict with a previous call
+     to this function.
   """
   if config is not None and not isinstance(config, config_pb2.ConfigProto):
     raise TypeError(
@@ -5174,8 +5292,12 @@ def enable_eager_execution(config=None, device_policy=None):
                            context.DEVICE_PLACEMENT_SILENT,
                            context.DEVICE_PLACEMENT_SILENT_FOR_INT32):
     raise ValueError(
-        "device_policy must be one of None, tfe.DEVICE_PLACEMENT_*"
+        "device_policy must be one of None, tf.contrib.eager.DEVICE_PLACEMENT_*"
     )
+  if execution_mode not in (None, context.SYNC, context.ASYNC):
+    raise ValueError(
+        "execution_mode must be one of None, tf.contrib.eager.SYNC, "
+        "tf.contrib.eager.ASYNC")
   # pylint: disable=protected-access
   if context._default_mode == context.GRAPH_MODE:
     graph_mode_has_been_used = (
@@ -5183,30 +5305,29 @@ def enable_eager_execution(config=None, device_policy=None):
         _default_graph_stack._global_default_graph is not None)
     if graph_mode_has_been_used:
       raise ValueError(
-          "tfe.enable_eager_execution has to be called at program startup.")
+          "tf.enable_eager_execution must be called at program startup.")
   context._default_mode = context.EAGER_MODE
   if context._context is None:
-    context._context = context.Context(config=config,
-                                       device_policy=device_policy)
-    if context.context_stack.stack:
-      raise AssertionError("Invariant violated: The context stack must "
-                           "be empty when eager execution is enabled.")
-    # Log that eager execution has been enabled by pushing an entry onto the
-    # context stack; this entry won't ever be popped, as it's impossible to
-    # disable eager execution
-    context.context_stack.push(False, context.eager_mode)
-  elif ((config is not None and config is not context._context._config)
-        or (device_policy is not None
-            and device_policy is not context._context._device_policy)):
+    context._context = context.Context(
+        config=config,
+        device_policy=device_policy,
+        execution_mode=execution_mode)
+  elif ((config is not None and config is not context._context._config) or
+        (device_policy is not None and
+         device_policy is not context._context._device_policy) or
+        (execution_mode is not None and
+         execution_mode is not context._context._execution_mode)):
     raise ValueError("Trying to change the options of an active eager"
                      " execution. Context config: %s, specified config:"
-                     " %s. Context device policy: %s; specified device"
-                     " policy: %s." % (config, context._context._config,
-                                       device_policy,
-                                       context._context._device_policy))
+                     " %s. Context device policy: %s, specified device"
+                     " policy: %s. Context execution mode: %s, "
+                     " specified execution mode %s." %
+                     (context._context._config, config,
+                      context._context._device_policy, device_policy,
+                      context._context._execution_mode, execution_mode))
   else:
     raise ValueError(
-        "tfe.enable_eager_execution has to be called at program startup.")
+        "tf.enable_eager_execution must be called at program startup.")
 
 
 def eager_run(main=None, argv=None):
@@ -5290,6 +5411,8 @@ def get_name_scope():
   Returns:
     A string representing the current name scope.
   """
+  if context.executing_eagerly():
+    return context.context().scope_name.rstrip("/")
   return get_default_graph().get_name_scope()
 
 
@@ -5544,7 +5667,7 @@ def add_to_collection(name, value):
   """
   get_default_graph().add_to_collection(name, value)
 
-
+@tf_export("add_to_collections")
 def add_to_collections(names, value):
   """Wrapper for `Graph.add_to_collections()` using the default graph.
 
@@ -5666,7 +5789,7 @@ class name_scope(object):  # pylint: disable=invalid-name
     self._default_name = default_name
     self._values = values
     self._ctx = context.context()
-    self._in_eager_mode = self._ctx.in_eager_mode()
+    self._in_eager_mode = self._ctx.executing_eagerly()
 
   def __enter__(self):
     """Start the scope block.
@@ -5845,10 +5968,11 @@ def get_from_proto_function(collection_name):
 
 
 def _assert_collection_is_ok(collection_name):
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     if collection_name in GraphKeys._VARIABLE_COLLECTIONS:  # pylint: disable=protected-access
-      raise ValueError("When Eager Execution is enabled, variable "
-                       "collections are not supported.")
+      raise ValueError(
+          "variable collections are not supported when eager execution is enabled."
+      )
 
 
 def _operation_conversion_error(op, dtype=None, name=None, as_ref=False):
diff --git a/tensorflow/python/framework/ops_test.py b/tensorflow/python/framework/ops_test.py
index a141fe6340c70efde84411db4efb1f80cb0a61c5..aa51391871f4c12d34b86311cc5b8ea9aabd5434 100644
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@@ -763,6 +763,7 @@ class CreateOpFromTFOperationTest(test_util.TensorFlowTestCase):
     self.assertEqual(g.get_operation_by_name("myop"), op)
     self.assertEqual(g.get_tensor_by_name("myop:0"), op.outputs[0])
 
+  @test_util.enable_c_shapes
   def testShape(self):
     g = ops.Graph()
     with g.as_default():
@@ -1555,6 +1556,35 @@ class MultithreadedGraphStateTest(test_util.TensorFlowTestCase):
              input: "^ColocateWithMe_2" }
     """, gd)
 
+  def testNameStack(self):
+
+    class NameSettingThread(self.TestThread):
+
+      def run(self):
+        with g.name_scope("foo"):
+          op1 = g.create_op("FloatOutput", [], [dtypes.float32])
+          self.has_mutated_graph.set()
+          self.should_continue.wait()
+          self.should_continue.clear()
+          op2 = g.create_op("FloatOutput", [], [dtypes.float32])
+          self.result = (op1, op2)
+
+    g = ops.Graph()
+    threads = [NameSettingThread(g, i) for i in range(3)]
+    for t in threads:
+      t.start()
+      t.has_mutated_graph.wait()
+      t.has_mutated_graph.clear()
+
+    for t in threads:
+      t.should_continue.set()
+      t.join()
+
+    suffixes = ["", "_1", "_2"]
+    for t, s in zip(threads, suffixes):
+      self.assertEquals("foo" + s + "/FloatOutput", t.result[0].name)
+      self.assertEquals("foo" + s + "/FloatOutput_1", t.result[1].name)
+
 
 @test_util.with_c_api
 class ObjectWithName(object):
@@ -1763,7 +1793,13 @@ class ControlDependenciesTest(test_util.TensorFlowTestCase):
       return constant_op.constant(2.0)
     future.calls = 0
 
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      a = constant_op.constant(1.0)
+      b = future
+      with ops.control_dependencies([a, b]):
+        c = constant_op.constant(3.0)
+      self.assertEqual(future.calls, 1)
+    else:
       g = ops.Graph()
       with g.as_default():
         a = constant_op.constant(1.0)
@@ -1772,12 +1808,6 @@ class ControlDependenciesTest(test_util.TensorFlowTestCase):
           c = constant_op.constant(3.0)
       self.assertEqual(c.op.control_inputs, [a.op, b.op])
       self.assertEqual(future.calls, 1)
-    else:
-      a = constant_op.constant(1.0)
-      b = future()
-      with ops.control_dependencies([a, b]):
-        c = constant_op.constant(3.0)
-      self.assertEqual(future.calls, 1)
 
   def testBasicWithConversion(self):
     g = ops.Graph()
@@ -2150,19 +2180,11 @@ class InitScopeTest(test_util.TensorFlowTestCase):
           with ops.init_scope():
             # Because g is building a function, init_scope should
             # escape out to the eager context.
-            self.assertTrue(context.in_eager_mode())
+            self.assertTrue(context.executing_eagerly())
           # g should be reinstated as the default graph, and the
           # graph context should be re-entered.
           self.assertIs(g, ops.get_default_graph())
-          self.assertTrue(context.in_graph_mode())
-
-  def testAllGraphsBuildingFunctionsRaisesError(self):
-    g = ops.Graph()
-    g._building_function = True  # pylint: disable=protected-access
-    with g.as_default():
-      with self.assertRaises(AssertionError):
-        with ops.init_scope():
-          pass
+          self.assertFalse(context.executing_eagerly())
 
   def testStaysInEagerWhenOnlyEagerContextActive(self):
     with context.eager_mode():
@@ -2241,6 +2263,29 @@ class InitScopeTest(test_util.TensorFlowTestCase):
       self.assertEqual(4, int(compiled_outer(inner=compiled_inner)))
       self.assertEqual(7, int(compiled_outer(inner=compiled_inner)))
 
+  def testFallsBackToGlobalGraphWhenAllGraphsAreBuildingFunctions(self):
+    with context.graph_mode():
+      ops.reset_default_graph()
+      # This doesn't push anything onto the graph stack, but it does
+      # set the stack's global graph.
+      global_graph = ops.get_default_graph()
+      fn_graph = ops.Graph()
+
+      # pylint: disable=protected-access
+      fn_graph._building_function = True
+      self.assertEqual(len(ops._default_graph_stack.stack), 0)
+      with fn_graph.as_default():
+        self.assertEqual(len(ops._default_graph_stack.stack), 1)
+        with ops.init_scope():
+          self.assertGreater(len(ops._default_graph_stack.stack), 1)
+          dummy = constant_op.constant(1.0)
+        self.assertEqual(len(ops._default_graph_stack.stack), 1)
+      # Note that the global graph is _not_ on the graph stack.
+      self.assertEqual(len(ops._default_graph_stack.stack), 0)
+      # Ensure that `dummy` was added to the global graph.
+      self.assertEqual(global_graph, dummy.graph)
+      # pylint: enable=protected-access
+
   def testInstallsDefaultGraphWhenGraphStackIsEmptyInGraphMode(self):
     with context.graph_mode():
       # pylint: disable=protected-access
@@ -2262,12 +2307,13 @@ class InitScopeTest(test_util.TensorFlowTestCase):
     with context.eager_mode():
       def foo():
         with ops.name_scope("inner"), ops.init_scope():
-          if context.in_graph_mode():
-            self.assertEqual(ops.get_name_scope(), "inner")
-          else:
+          if context.executing_eagerly():
             # A trailing slash is always appended when eager execution is
             # enabled.
             self.assertEqual(context.context().scope_name, "inner/")
+          else:
+            self.assertEqual(ops.get_name_scope(), "inner")
+
       foo()
       self.assertEqual(ops.get_name_scope(), "")
       foo_compiled = eager_function.defun(foo)
@@ -2877,7 +2923,7 @@ class OutputTypesTest(test_util.TensorFlowTestCase):
     with g.as_default():
       x = constant_op.constant([1, 1, 2, 4, 4, 4, 7, 8, 8],
                                dtype=dtypes.double)
-      y, _ = gen_array_ops._unique(x)
+      y, _ = gen_array_ops.unique(x)
       self.assertEqual([types_pb2.DT_DOUBLE, types_pb2.DT_INT32],
                        y.op._output_types)  # pylint: disable=protected-access
 
@@ -2902,6 +2948,9 @@ class EnableEagerExecutionTest(test_util.TensorFlowTestCase):
     with self.assertRaisesRegexp(ValueError, "device_policy must be one of"):
       c = config_pb2.ConfigProto()
       ops.enable_eager_execution(c, c)
+    with self.assertRaisesRegexp(ValueError, "execution_mode must be one of"):
+      c = config_pb2.ConfigProto()
+      ops.enable_eager_execution(c, execution_mode=c)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/framework/python_op_gen.cc b/tensorflow/python/framework/python_op_gen.cc
index c95149d177990e364c3d6b9daeae5dc535cf0070..9850f0becc69ff1f53b70f0ad2296aead8b5152c 100644
--- a/tensorflow/python/framework/python_op_gen.cc
+++ b/tensorflow/python/framework/python_op_gen.cc
@@ -75,6 +75,33 @@ bool IsPythonReserved(const string& s) {
   return kPythonReserved->count(s) > 0;
 }
 
+bool IsOpWithUnderscorePrefix(const string& s) {
+  static const std::set<string>* const kUnderscoreOps = new std::set<string>(
+      {// Lowercase built-in functions and types in Python, from:
+       // [x for x in dir(__builtins__) if x[0].islower()] except "round".
+       // These need to be excluded so they don't conflict with actual built-in
+       // functions since we use '*' imports.
+       "abs", "all", "any", "apply", "bin", "bool", "buffer", "bytearray",
+       "bytes", "callable", "chr", "classmethod", "cmp", "coerce", "compile",
+       "complex", "copyright", "credits", "delattr", "dict", "dir", "divmod",
+       "enumerate", "eval", "execfile", "exit", "file", "filter", "float",
+       "format", "frozenset", "getattr", "globals", "hasattr", "hash", "help",
+       "hex", "id", "input", "int", "intern", "isinstance", "issubclass",
+       "iter", "len", "license", "list", "locals", "long", "map", "max",
+       "memoryview", "min", "next", "object", "oct", "open", "ord", "pow",
+       "print", "property", "quit", "range", "raw_input", "reduce", "reload",
+       "repr", "reversed", "set", "setattr", "slice", "sorted", "staticmethod",
+       "str", "sum", "super", "tuple", "type", "unichr", "unicode", "vars",
+       "xrange", "zip",
+       // These have the same name as ops defined in Python and might be used
+       // incorrectly depending on order of '*' imports.
+       // TODO(annarev): reduce usage of '*' imports and remove these from the
+       // list.
+       "fused_batch_norm", "histogram_fixed_width", "stack",
+       "batch_norm_with_global_normalization"});
+  return kUnderscoreOps->count(s) > 0;
+}
+
 string AvoidPythonReserved(const string& s) {
   if (IsPythonReserved(s)) return strings::StrCat(s, "_");
   return s;
@@ -816,6 +843,7 @@ from tensorflow.python.util.tf_export import tf_export
     // An op is hidden if either its ApiDef visibility is HIDDEN
     // or it is in the hidden_ops list.
     bool is_hidden = api_def->visibility() == ApiDef::HIDDEN;
+    bool hidden_by_api_def = is_hidden;
     if (!is_hidden) {
       for (const string& hidden : hidden_ops) {
         if (op_def.name() == hidden) {
@@ -828,13 +856,22 @@ from tensorflow.python.util.tf_export import tf_export
     string function_name;
     python_op_gen_internal::GenerateLowerCaseOpName(op_def.name(),
                                                     &function_name);
-    if (is_hidden) function_name = strings::StrCat("_", function_name);
-
-    // When users create custom python wrappers, they may link in the
-    // default op registry by accident, and because they can't
-    // enumerate all 'hidden' symbols, this guard is to prevent
-    // instantiating a python reserved word in their wrapper.
-    if (python_op_gen_internal::IsPythonReserved(function_name)) {
+    bool is_reserved = python_op_gen_internal::IsPythonReserved(function_name);
+
+    // Prefix an op with underscore if the op is listed in hidden_ops or
+    // name is reserved or it is of the exceptions in IsOpWithUnderscorePrefix.
+    // Do not add underscores to ops set to HIDDEN in ApiDef otherwise.
+    // TODO(annarev): don't prefix with underscores even if op is in hidden_ops.
+    if (is_hidden) {
+      if (!hidden_by_api_def || is_reserved ||
+          python_op_gen_internal::IsOpWithUnderscorePrefix(function_name)) {
+        function_name = strings::StrCat("_", function_name);
+      }
+    } else if (is_reserved) {
+      // When users create custom python wrappers, they may link in the
+      // default op registry by accident, and because they can't
+      // enumerate all 'hidden' symbols, this guard is to prevent
+      // instantiating a python reserved word in their wrapper.
       continue;
     }
 
diff --git a/tensorflow/python/framework/python_op_gen_internal.h b/tensorflow/python/framework/python_op_gen_internal.h
index 4319e5a7820b33283df8153fdc76e0e567813a17..e0cfb05f4bdf8afd09957c62a9ba3af1fd0882a6 100644
--- a/tensorflow/python/framework/python_op_gen_internal.h
+++ b/tensorflow/python/framework/python_op_gen_internal.h
@@ -29,6 +29,9 @@ namespace python_op_gen_internal {
 // Returns true if s is a Python keyword or built-in.
 bool IsPythonReserved(const string& s);
 
+// Whether the op should be prefixed with underscore.
+bool IsOpWithUnderscorePrefix(const string& s);
+
 // Add a _ to the end of s if necessary to avoid a Python keyword or built-in.
 string AvoidPythonReserved(const string& s);
 
diff --git a/tensorflow/python/framework/random_seed.py b/tensorflow/python/framework/random_seed.py
index 1e74a790a3fb0c72b7c0fb1127ffac95f386d85e..b724432e00b0d11de86a0fff9ff31758ad36479f 100644
--- a/tensorflow/python/framework/random_seed.py
+++ b/tensorflow/python/framework/random_seed.py
@@ -52,20 +52,20 @@ def get_seed(op_seed):
     A tuple of two integers that should be used for the local seed of this
     operation.
   """
-  is_graph_mode = context.in_graph_mode()
+  eager = context.executing_eagerly()
 
-  if is_graph_mode:
-    global_seed = ops.get_default_graph().seed
-  else:
+  if eager:
     global_seed = context.global_seed()
+  else:
+    global_seed = ops.get_default_graph().seed
 
   if global_seed is not None:
     if op_seed is None:
       # pylint: disable=protected-access
-      if is_graph_mode:
-        op_seed = ops.get_default_graph()._last_id
-      else:
+      if eager:
         op_seed = context.internal_operation_seed()
+      else:
+        op_seed = ops.get_default_graph()._last_id
 
     seeds = _truncate_seed(global_seed), _truncate_seed(op_seed)
   else:
@@ -176,7 +176,7 @@ def set_random_seed(seed):
   Args:
     seed: integer.
   """
-  if context.in_graph_mode():
-    ops.get_default_graph().seed = seed
-  else:
+  if context.executing_eagerly():
     context.set_global_seed(seed)
+  else:
+    ops.get_default_graph().seed = seed
diff --git a/tensorflow/python/framework/random_seed_test.py b/tensorflow/python/framework/random_seed_test.py
index b4c98ab8b289c850c6171425167bb17606a4162d..194492268631abfa911bd45f13a302c09a2c8bda 100644
--- a/tensorflow/python/framework/random_seed_test.py
+++ b/tensorflow/python/framework/random_seed_test.py
@@ -40,13 +40,13 @@ class RandomSeedTest(test.TestCase):
         ((2**31 - 1, 0), (0, 2**31 - 1)),  # Don't wrap to (0, 0) either
         ((0, 2**31 - 1), (0, 2**31 - 1)),  # Wrapping for the other argument
     ]
-    if context.in_graph_mode():
-      # 0 will be the default_graph._lastid.
-      test_cases.append(((1, None), (1, 0)))
-    else:
+    if context.executing_eagerly():
       # operation seed is random number generated based on global seed.
       # it's not tested due to possibility of platform or version difference.
       pass
+    else:
+      # 0 will be the default_graph._lastid.
+      test_cases.append(((1, None), (1, 0)))
     for tc in test_cases:
       tinput, toutput = tc[0], tc[1]
       random_seed.set_random_seed(tinput[0])
diff --git a/tensorflow/python/framework/smart_cond.py b/tensorflow/python/framework/smart_cond.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7ff23e4ff809ed7bc57259fa3ec9feb921b5a71
--- /dev/null
+++ b/tensorflow/python/framework/smart_cond.py
@@ -0,0 +1,123 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""smart_cond and related utilties."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python import pywrap_tensorflow as c_api
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import control_flow_ops
+
+
+def smart_cond(pred, true_fn=None, false_fn=None, name=None):
+  """Return either `true_fn()` if predicate `pred` is true else `false_fn()`.
+
+  If `pred` is a bool or has a constant value, we return either `true_fn()`
+  or `false_fn()`, otherwise we use `tf.cond` to dynamically route to both.
+
+  Arguments:
+    pred: A scalar determining whether to return the result of `true_fn` or
+      `false_fn`.
+    true_fn: The callable to be performed if pred is true.
+    false_fn: The callable to be performed if pred is false.
+    name: Optional name prefix when using `tf.cond`.
+
+  Returns:
+    Tensors returned by the call to either `true_fn` or `false_fn`.
+
+  Raises:
+    TypeError: If `true_fn` or `false_fn` is not callable.
+  """
+  if not callable(true_fn):
+    raise TypeError("`true_fn` must be callable.")
+  if not callable(false_fn):
+    raise TypeError("`false_fn` must be callable.")
+
+  pred_value = smart_constant_value(pred)
+  if pred_value is not None:
+    if pred_value:
+      return true_fn()
+    else:
+      return false_fn()
+  else:
+    return control_flow_ops.cond(pred, true_fn=true_fn, false_fn=false_fn,
+                                 name=name)
+
+
+def smart_constant_value(pred):
+  """Return the bool value for `pred`, or None if `pred` had a dynamic value.
+
+  Arguments:
+    pred: A scalar, either a Python bool or tensor.
+
+  Returns:
+    True or False if `pred` has a constant boolean value, None otherwise.
+
+  Raises:
+    TypeError: If `pred` is not a Tensor or bool.
+  """
+  if pred in {0, 1}:  # Accept 1/0 as valid boolean values
+    pred_value = bool(pred)
+  elif isinstance(pred, bool):
+    pred_value = pred
+  elif isinstance(pred, ops.Tensor):
+    pred_value = tensor_util.constant_value(pred)
+    # TODO(skyewm): consider folding this into tensor_util.constant_value when
+    # _USE_C_API is removed (there may be performance and correctness bugs, so I
+    # wanted to limit the change hidden behind _USE_C_API).
+    # pylint: disable=protected-access
+    if pred_value is None and ops._USE_C_API:
+      with errors.raise_exception_on_not_ok_status() as status:
+        pred_value = c_api.TF_TryEvaluateConstant_wrapper(
+            pred.graph._c_graph, pred._as_tf_output(), status)
+    # pylint: enable=protected-access
+
+  else:
+    raise TypeError("`pred` must be a Tensor, or a Python bool, or 1 or 0. "
+                    "Found instead: %s" % pred)
+  return pred_value
+
+
+def smart_case(pred_fn_pairs, default=None, exclusive=False, name="smart_case"):
+  """Like tf.case, except attempts to statically evaluate predicates.
+
+  If any predicate in `pred_fn_pairs` is a bool or has a constant value, the
+  associated callable will be called or omitted depending on its value.
+  Otherwise this functions like tf.case.
+
+  Args:
+    pred_fn_pairs: Dict or list of pairs of a boolean scalar tensor and a
+                   callable which returns a list of tensors.
+    default: Optional callable that returns a list of tensors.
+    exclusive: True iff at most one predicate is allowed to evaluate to `True`.
+    name: A name for this operation (optional).
+
+  Returns:
+    The tensors returned by the first pair whose predicate evaluated to True, or
+    those returned by `default` if none does.
+
+  Raises:
+    TypeError: If `pred_fn_pairs` is not a list/dictionary.
+    TypeError: If `pred_fn_pairs` is a list but does not contain 2-tuples.
+    TypeError: If `fns[i]` is not callable for any i, or `default` is not
+               callable.
+  """
+  return control_flow_ops._case_helper(  # pylint: disable=protected-access
+      smart_cond, pred_fn_pairs, default, exclusive, name,
+      allow_python_preds=True)
diff --git a/tensorflow/python/framework/smart_cond_test.py b/tensorflow/python/framework/smart_cond_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..1170a41c99995ae875e58a2d5491e05bc1e40df6
--- /dev/null
+++ b/tensorflow/python/framework/smart_cond_test.py
@@ -0,0 +1,166 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.client import session
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import googletest
+
+
+def raise_exception():
+  raise RuntimeError("did not expect to be called")
+
+
+@test_util.with_c_api
+class SmartCondTest(test_util.TensorFlowTestCase):
+
+  def testTrue(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(2)
+        y = constant_op.constant(5)
+        z = smart_cond.smart_cond(True, lambda: math_ops.multiply(x, 16),
+                                  lambda: math_ops.multiply(y, 5))
+        self.assertEqual(z.eval(), 32)
+
+  def testFalse(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(4)
+        y = constant_op.constant(3)
+        z = smart_cond.smart_cond(False, lambda: math_ops.multiply(x, 16),
+                                  lambda: math_ops.multiply(y, 3))
+        self.assertEqual(z.eval(), 9)
+
+  def testUnknown(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = array_ops.placeholder(dtype=dtypes.int32)
+        y = smart_cond.smart_cond(x > 0, lambda: constant_op.constant(1),
+                                  lambda: constant_op.constant(2))
+        self.assertEqual(y.eval(feed_dict={x: 1}), 1)
+        self.assertEqual(y.eval(feed_dict={x: -1}), 2)
+
+  def testEval(self):
+    # Constant expression evaluation only works with the C API enabled.
+    if not ops._USE_C_API: return
+
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(1)
+        y = constant_op.constant(2)
+        # x * y > 0 can be evaluated at graph construction time, so the false
+        # branch shouldn't be evaluated at all.
+        z = smart_cond.smart_cond(x * y > 0, lambda: constant_op.constant(1),
+                                  raise_exception)
+        self.assertEqual(z.eval(feed_dict={x: 1}), 1)
+
+  def testPlaceholderWithDefault(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = array_ops.placeholder_with_default(1, shape=())
+        y = smart_cond.smart_cond(x > 0, lambda: constant_op.constant(1),
+                                  lambda: constant_op.constant(2))
+        self.assertEqual(y.eval(), 1)
+        self.assertEqual(y.eval(feed_dict={x: -1}), 2)
+
+  def testMissingArg1(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(1)
+        with self.assertRaises(TypeError):
+          smart_cond.smart_cond(True, false_fn=lambda: x)
+
+  def testMissingArg2(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(1)
+        with self.assertRaises(TypeError):
+          smart_cond.smart_cond(True, lambda: x)
+
+
+@test_util.with_c_api
+class SmartCaseTest(test_util.TensorFlowTestCase):
+
+  def testTrue(self):
+    x = array_ops.placeholder(dtype=dtypes.int32, shape=[])
+    conditions = [(True, lambda: constant_op.constant(1)),
+                  (x == 0, raise_exception)]
+    y = smart_cond.smart_case(conditions, default=raise_exception,
+                              exclusive=False)
+    z = smart_cond.smart_case(conditions, default=raise_exception,
+                              exclusive=True)
+    with session.Session() as sess:
+      # No feed_dict necessary
+      self.assertEqual(sess.run(y), 1)
+      self.assertEqual(sess.run(z), 1)
+
+  def testFalse(self):
+    conditions = [(False, raise_exception)]
+    y = smart_cond.smart_case(conditions,
+                              default=lambda: constant_op.constant(1),
+                              exclusive=False)
+    z = smart_cond.smart_case(conditions,
+                              default=lambda: constant_op.constant(1),
+                              exclusive=True)
+    with session.Session() as sess:
+      self.assertEqual(sess.run(y), 1)
+      self.assertEqual(sess.run(z), 1)
+
+  def testMix(self):
+    # Constant expression evaluation only works with the C API enabled.
+    if not ops._USE_C_API: return
+
+    x = array_ops.placeholder(dtype=dtypes.int32, shape=[])
+    y = constant_op.constant(10)
+    conditions = [(x > 1, lambda: constant_op.constant(1)),
+                  (y < 1, raise_exception),
+                  (False, raise_exception),
+                  (True, lambda: constant_op.constant(3))]
+    z = smart_cond.smart_case(conditions, default=raise_exception)
+    with session.Session() as sess:
+      self.assertEqual(sess.run(z, feed_dict={x: 2}), 1)
+      self.assertEqual(sess.run(z, feed_dict={x: 0}), 3)
+
+
+@test_util.with_c_api
+class SmartConstantValueTest(test_util.TensorFlowTestCase):
+
+  # TODO(skyewm): this is essentially a regression test for
+  # TF_TryEvaluateConstant, and is not really a valid smart_constant_value test
+  # (smart_constant_value is only supposed to return bools). Move the
+  # TF_TryEvaluateConstant call to tensor_util.constant_value and make this a
+  # constant_value test instead.
+  def testCond(self):
+    with ops.Graph().as_default():
+      pred = array_ops.placeholder_with_default(True, shape=())
+      x = control_flow_ops.cond(pred,
+                                lambda: constant_op.constant(1),
+                                lambda: constant_op.constant(2))
+      self.assertIsNone(smart_cond.smart_constant_value(x))
+
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/python/framework/tensor_shape.py b/tensorflow/python/framework/tensor_shape.py
index 222071cb9e87aa0fdd9788d1c72df4c66ea61547..af2a5b1a7ef9a70c0baf5d02257951803a7a76fa 100644
--- a/tensorflow/python/framework/tensor_shape.py
+++ b/tensorflow/python/framework/tensor_shape.py
@@ -156,7 +156,7 @@ class Dimension(object):
     ```
 
     Args:
-      other: Another Dimension.
+      other: Another Dimension, or a value accepted by `as_dimension`.
 
     Returns:
       A Dimension whose value is the sum of `self` and `other`.
@@ -167,6 +167,17 @@ class Dimension(object):
     else:
       return Dimension(self._value + other.value)
 
+  def __radd__(self, other):
+    """Returns the sum of `other` and `self`.
+
+    Args:
+      other: Another Dimension, or a value accepted by `as_dimension`.
+
+    Returns:
+      A Dimension whose value is the sum of `self` and `other`.
+    """
+    return self + other
+
   def __sub__(self, other):
     """Returns the subtraction of `other` from `self`.
 
@@ -180,10 +191,10 @@ class Dimension(object):
     ```
 
     Args:
-      other: Another Dimension.
+      other: Another Dimension, or a value accepted by `as_dimension`.
 
     Returns:
-      A Dimension whose value is the subtraction of sum of `other` from `self`.
+      A Dimension whose value is the subtraction of `other` from `self`.
     """
     other = as_dimension(other)
     if self._value is None or other.value is None:
@@ -191,6 +202,21 @@ class Dimension(object):
     else:
       return Dimension(self._value - other.value)
 
+  def __rsub__(self, other):
+    """Returns the subtraction of `self` from `other`.
+
+    Args:
+      other: Another Dimension, or a value accepted by `as_dimension`.
+
+    Returns:
+      A Dimension whose value is the subtraction of `self` from `other`.
+    """
+    other = as_dimension(other)
+    if self._value is None or other.value is None:
+      return Dimension(None)
+    else:
+      return Dimension(other.value - self._value)
+
   def __mul__(self, other):
     """Returns the product of `self` and `other`.
 
@@ -204,17 +230,32 @@ class Dimension(object):
     ```
 
     Args:
-      other: Another Dimension.
+      other: Another Dimension, or a value accepted by `as_dimension`.
 
     Returns:
       A Dimension whose value is the product of `self` and `other`.
     """
-    other = as_dimension(other)
+    try:
+      other = as_dimension(other)
+    except (TypeError, ValueError):
+      return NotImplemented
+
     if self._value is None or other.value is None:
       return Dimension(None)
     else:
       return Dimension(self._value * other.value)
 
+  def __rmul__(self, other):
+    """Returns the product of `self` and `other`.
+
+    Args:
+      other: Another Dimension, or a value accepted by `as_dimension`.
+
+    Returns:
+      A Dimension whose value is the product of `self` and `other`.
+    """
+    return self * other
+
   def __floordiv__(self, other):
     """Returns the quotient of `self` and `other` rounded down.
 
@@ -228,17 +269,35 @@ class Dimension(object):
     ```
 
     Args:
-      other: Another `Dimension`.
+      other: Another Dimension, or a value accepted by `as_dimension`.
 
     Returns:
       A `Dimension` whose value is the integer quotient of `self` and `other`.
     """
-    other = as_dimension(other)
+    try:
+      other = as_dimension(other)
+    except (TypeError, ValueError):
+      return NotImplemented
     if self._value is None or other.value is None:
       return Dimension(None)
     else:
       return Dimension(self._value // other.value)
 
+  def __rfloordiv__(self, other):
+    """Returns the quotient of `other` and `self` rounded down.
+
+    Args:
+      other: Another Dimension, or a value accepted by `as_dimension`.
+
+    Returns:
+      A `Dimension` whose value is the integer quotient of `self` and `other`.
+    """
+    other = as_dimension(other)
+    if self._value is None or other.value is None:
+      return Dimension(None)
+    else:
+      return Dimension(other.value // self._value)
+
   def __div__(self, other):
     """DEPRECATED: Use `__floordiv__` via `x // y` instead.
 
@@ -256,7 +315,7 @@ class Dimension(object):
     return self // other
 
   def __mod__(self, other):
-    """Returns `self` modulo `other.
+    """Returns `self` modulo `other`.
 
     Dimension moduli are computed as follows:
 
@@ -268,17 +327,35 @@ class Dimension(object):
     ```
 
     Args:
-      other: Another Dimension.
+      other: Another Dimension, or a value accepted by `as_dimension`.
 
     Returns:
       A Dimension whose value is `self` modulo `other`.
     """
-    other = as_dimension(other)
+    try:
+      other = as_dimension(other)
+    except (TypeError, ValueError):
+      return NotImplemented
     if self._value is None or other.value is None:
       return Dimension(None)
     else:
       return Dimension(self._value % other.value)
 
+  def __rmod__(self, other):
+    """Returns `other` modulo `self`.
+
+    Args:
+      other: Another Dimension, or a value accepted by `as_dimension`.
+
+    Returns:
+      A Dimension whose value is `other` modulo `self`.
+    """
+    try:
+      other = as_dimension(other)
+    except (TypeError, ValueError):
+      return NotImplemented
+    return other % self
+
   def __lt__(self, other):
     """Returns True if `self` is known to be less than `other`.
 
@@ -456,6 +533,7 @@ class TensorShape(object):
       else:
         # Got a list of dimensions
         self._dims = [as_dimension(d) for d in dims_iter]
+    self._ndims = None
 
   def __repr__(self):
     return "TensorShape(%r)" % self._dims
@@ -473,19 +551,26 @@ class TensorShape(object):
     """Returns a list of Dimensions, or None if the shape is unspecified."""
     return self._dims
 
+  @dims.setter
+  def dims(self, dims):
+    self._dims = dims
+    self._ndims = None
+
   @property
   def ndims(self):
     """Returns the rank of this shape, or None if it is unspecified."""
     if self._dims is None:
       return None
     else:
-      return len(self._dims)
+      if self._ndims is None:
+        self._ndims = len(self._dims)
+      return self._ndims
 
   def __len__(self):
     """Returns the rank of this shape, or raises ValueError if unspecified."""
     if self._dims is None:
       raise ValueError("Cannot take the length of Shape with unknown rank.")
-    return len(self._dims)
+    return self.ndims
 
   def __bool__(self):
     """Returns True if this shape contains non-zero information."""
diff --git a/tensorflow/python/framework/tensor_shape_test.py b/tensorflow/python/framework/tensor_shape_test.py
index fffd86c7a6241b8be92ad33852da244ab9b5284d..4e8ce4d889c4ef0c6e56806587a64e8f9be7e10a 100644
--- a/tensorflow/python/framework/tensor_shape_test.py
+++ b/tensorflow/python/framework/tensor_shape_test.py
@@ -34,12 +34,20 @@ class DimensionTest(test_util.TensorFlowTestCase):
     self.assertEqual(tensor_shape.Dimension(15),
                      dim + tensor_shape.Dimension(3))
     self.assertEqual(tensor_shape.Dimension(15), dim + 3)
+    self.assertEqual(tensor_shape.Dimension(15), 3 + dim)
+    self.assertEqual(tensor_shape.Dimension(9), dim - 3)
+    self.assertEqual(tensor_shape.Dimension(1), 13 - dim)
     self.assertEqual(tensor_shape.Dimension(24),
                      dim * tensor_shape.Dimension(2))
     self.assertEqual(tensor_shape.Dimension(24), dim * 2)
+    self.assertEqual(tensor_shape.Dimension(24), 2 * dim)
+    self.assertEqual([4] * 12, [4] * dim)
+    self.assertEqual(12 * [4], dim * [4])
+    self.assertEqual(tensor_shape.Dimension(24), 2 * dim)
     self.assertEqual(
         tensor_shape.Dimension(6), dim // tensor_shape.Dimension(2))
     self.assertEqual(tensor_shape.Dimension(6), dim // 2)
+    self.assertEqual(tensor_shape.Dimension(0), 2 // dim)
     self.assertEqual(tensor_shape.Dimension(12),
                      dim.merge_with(tensor_shape.Dimension(12)))
     self.assertEqual(tensor_shape.Dimension(12), dim.merge_with(12))
@@ -176,6 +184,14 @@ class DimensionTest(test_util.TensorFlowTestCase):
     self.assertEqual(str(tensor_shape.Dimension(7)), "7")
     self.assertEqual(str(tensor_shape.Dimension(None)), "?")
 
+  def testMod(self):
+    four = tensor_shape.Dimension(4)
+    nine = tensor_shape.Dimension(9)
+    self.assertEqual(nine % four, 1)
+    # test both __mod__ and __rmod__.
+    self.assertEqual(nine % 4, 1)
+    self.assertEqual(4 % nine, 4)
+
 
 class ShapeTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/framework/tensor_spec.py b/tensorflow/python/framework/tensor_spec.py
index a0411bc3d9b4b2b87e5a31e9f201154f28ccf1cc..6676cfcaa334e02208d9ec346de7d266c4700f24 100644
--- a/tensorflow/python/framework/tensor_spec.py
+++ b/tensorflow/python/framework/tensor_spec.py
@@ -65,6 +65,11 @@ class TensorSpec(object):
     else:
       raise ValueError("`tensor` should be a tf.Tensor")
 
+  @classmethod
+  def is_bounded(cls):
+    del cls
+    return False
+
   @property
   def shape(self):
     """Returns the `TensorShape` that represents the shape of the tensor."""
@@ -80,6 +85,16 @@ class TensorSpec(object):
     """Returns the name of the described tensor."""
     return self._name
 
+  @property
+  def is_discrete(self):
+    """Whether spec is discrete."""
+    return self.dtype.is_integer
+
+  @property
+  def is_continuous(self):
+    """Whether spec is continuous."""
+    return self.dtype.is_floating
+
   def is_compatible_with(self, spec_or_tensor):
     """True if the shape and dtype of `spec_or_tensor` are compatible."""
     return (self._dtype.is_compatible_with(spec_or_tensor.dtype) and
@@ -95,6 +110,9 @@ class TensorSpec(object):
   def __ne__(self, other):
     return not self == other
 
+  def __reduce__(self):
+    return TensorSpec, (self._shape, self._dtype, self._name)
+
 
 class BoundedTensorSpec(TensorSpec):
   """A `TensorSpec` that specifies minimum and maximum values.
@@ -163,19 +181,16 @@ class BoundedTensorSpec(TensorSpec):
     self._maximum = np.array(maximum, dtype=self.dtype.as_numpy_dtype())
     self._maximum.setflags(write=False)
 
+  @classmethod
+  def is_bounded(cls):
+    del cls
+    return True
+
   @classmethod
   def from_spec(cls, spec):
     dtype = dtypes.as_dtype(spec.dtype)
-    if dtype in [dtypes.float64, dtypes.float32]:
-      # Avoid under/over-flow for `dtype.maximum - dtype.minimum`.
-      low = dtype.min / 2
-      high = dtype.max / 2
-    else:
-      low = dtype.min
-      high = dtype.max
-
-    minimum = getattr(spec, "minimum", low)
-    maximum = getattr(spec, "maximum", high)
+    minimum = getattr(spec, "minimum", dtype.min)
+    maximum = getattr(spec, "maximum", dtype.max)
     return BoundedTensorSpec(spec.shape, dtype, minimum, maximum, spec.name)
 
   @property
@@ -198,4 +213,7 @@ class BoundedTensorSpec(TensorSpec):
     return (tensor_spec_eq and np.allclose(self.minimum, other.minimum) and
             np.allclose(self.maximum, other.maximum))
 
+  def __reduce__(self):
+    return BoundedTensorSpec, (self._shape, self._dtype, self._minimum,
+                               self._maximum, self._name)
 
diff --git a/tensorflow/python/framework/tensor_spec_test.py b/tensorflow/python/framework/tensor_spec_test.py
index 54ca4d9a19c2e1c879c05cfb828085951bdd8444..2e9e43e12279fe833d640d4163c5474c398e70cd 100644
--- a/tensorflow/python/framework/tensor_spec_test.py
+++ b/tensorflow/python/framework/tensor_spec_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import pickle
+
 import numpy as np
 
 from tensorflow.python.framework import constant_op
@@ -127,6 +129,26 @@ class TensorSpecTest(test_util.TensorFlowTestCase):
     self.assertEqual(bounded_spec.dtype, spec.dtype)
     self.assertEqual(bounded_spec.name, spec.name)
 
+  def testIsDiscrete(self):
+    discrete_spec = tensor_spec.TensorSpec((1, 2), dtypes.int32)
+    continuous_spec = tensor_spec.TensorSpec((1, 2), dtypes.float32)
+    self.assertTrue(discrete_spec.is_discrete)
+    self.assertFalse(continuous_spec.is_discrete)
+
+  def testIsContinuous(self):
+    discrete_spec = tensor_spec.TensorSpec((1, 2), dtypes.int32)
+    continuous_spec = tensor_spec.TensorSpec((1, 2), dtypes.float32)
+    self.assertFalse(discrete_spec.is_continuous)
+    self.assertTrue(continuous_spec.is_continuous)
+
+  def testIsBounded(self):
+    unbounded_spec = tensor_spec.TensorSpec((1, 2), dtypes.int32)
+    self.assertFalse(unbounded_spec.is_bounded())
+
+  def testSerialization(self):
+    desc = tensor_spec.TensorSpec([1, 5], dtypes.float32, "test")
+    self.assertEqual(pickle.loads(pickle.dumps(desc)), desc)
+
 
 class BoundedTensorSpecTest(test_util.TensorFlowTestCase):
 
@@ -138,6 +160,11 @@ class BoundedTensorSpecTest(test_util.TensorFlowTestCase):
     with self.assertRaisesRegexp(ValueError, "not compatible"):
       tensor_spec.BoundedTensorSpec((3, 5), dtypes.uint8, 0, (1, 1, 1))
 
+  def testIsBounded(self):
+    bounded_spec = tensor_spec.BoundedTensorSpec(
+        (1, 2), dtypes.int32, minimum=0, maximum=1)
+    self.assertTrue(bounded_spec.is_bounded())
+
   def testMinimumMaximumAttributes(self):
     spec = tensor_spec.BoundedTensorSpec(
         (1, 2, 3), dtypes.float32, 0, (5, 5, 5))
@@ -222,6 +249,10 @@ class BoundedTensorSpecTest(test_util.TensorFlowTestCase):
     self.assertEqual(spec.dtype.max, bounded_spec.maximum)
     self.assertEqual(spec.name, bounded_spec.name)
 
+  def testSerialization(self):
+    desc = tensor_spec.BoundedTensorSpec([1, 5], dtypes.float32, -1, 1, "test")
+    self.assertEqual(pickle.loads(pickle.dumps(desc)), desc)
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/framework/tensor_util.py b/tensorflow/python/framework/tensor_util.py
index 27afaa074a6becd5c8b7db94be59e8da1611c13a..984bcecdfe05efd79bdf218197c410b14abe3516 100644
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@@ -559,16 +559,16 @@ def MakeNdarray(tensor):
   if tensor.tensor_content:
     return (np.frombuffer(tensor.tensor_content, dtype=dtype).copy()
             .reshape(shape))
-  elif tensor_dtype == dtypes.float16:
+  elif tensor_dtype == dtypes.float16 or tensor_dtype == dtypes.bfloat16:
     # the half_val field of the TensorProto stores the binary representation
     # of the fp16: we need to reinterpret this as a proper float16
     if len(tensor.half_val) == 1:
       tmp = np.array(tensor.half_val[0], dtype=np.uint16)
-      tmp.dtype = np.float16
+      tmp.dtype = tensor_dtype.as_numpy_dtype
       return np.repeat(tmp, num_elements).reshape(shape)
     else:
       tmp = np.fromiter(tensor.half_val, dtype=np.uint16)
-      tmp.dtype = np.float16
+      tmp.dtype = tensor_dtype.as_numpy_dtype
       return tmp.reshape(shape)
   elif tensor_dtype == dtypes.float32:
     if len(tensor.float_val) == 1:
@@ -586,8 +586,7 @@ def MakeNdarray(tensor):
       return np.fromiter(tensor.double_val, dtype=dtype).reshape(shape)
   elif tensor_dtype in [
       dtypes.int32, dtypes.uint8, dtypes.uint16, dtypes.int16, dtypes.int8,
-      dtypes.qint32, dtypes.quint8, dtypes.qint8, dtypes.qint16, dtypes.quint16,
-      dtypes.bfloat16
+      dtypes.qint32, dtypes.quint8, dtypes.qint8, dtypes.qint16, dtypes.quint16
   ]:
     if len(tensor.int_val) == 1:
       return np.repeat(np.array(tensor.int_val[0], dtype=dtype),
@@ -829,7 +828,7 @@ def constant_value_as_shape(tensor):  # pylint: disable=invalid-name
   Returns:
     A `TensorShape` based on the constant value of the given `tensor`.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return tensor_shape.as_shape(
         [dim if dim != -1 else None for dim in tensor.numpy()])
 
diff --git a/tensorflow/python/framework/tensor_util_test.py b/tensorflow/python/framework/tensor_util_test.py
index bea0ee34fd4900cc9d4d5d52348ba4512368e81f..35fff80c61b98e7603d3b7b5df3cabdb59059a72 100644
--- a/tensorflow/python/framework/tensor_util_test.py
+++ b/tensorflow/python/framework/tensor_util_test.py
@@ -235,6 +235,26 @@ class TensorUtilTest(test.TestCase):
     self.assertEquals(np.float16, a.dtype)
     self.assertAllClose(np.array([10.0, 20.0], dtype=np.float16), a)
 
+  def testBfloat16(self):
+    test_type = dtypes.bfloat16.as_numpy_dtype
+    t = tensor_util.make_tensor_proto(np.array([10.0, 20.0], dtype=test_type))
+    # 10.0: 16672 = 010000010(130) 0100000: (1+0/2+1/4) * 2^(130-127)
+    # 20.0: 16800 = 010000011(131) 0100000: (1+0/2+1/4) * 2^(131-127)
+    self.assertProtoEquals("""
+      dtype: DT_BFLOAT16
+      tensor_shape {
+        dim {
+          size: 2
+        }
+      }
+      half_val: 16672
+      half_val: 16800
+      """, t)
+
+    a = tensor_util.MakeNdarray(t)
+    self.assertEquals(test_type, a.dtype)
+    self.assertAllClose(np.array([10.0, 20.0], dtype=test_type), a)
+
   def testInt(self):
     t = tensor_util.make_tensor_proto(10)
     self.assertProtoEquals("""
@@ -768,7 +788,7 @@ class ConstantValueTest(test.TestCase):
     self.assertAllClose(np_val, tensor_util.constant_value(tf_val))
 
   def testUnknown(self):
-    tf_val = gen_state_ops._variable(
+    tf_val = gen_state_ops.variable(
         shape=[3, 4, 7],
         dtype=dtypes.float32,
         name="tf_val",
diff --git a/tensorflow/python/framework/test_util.py b/tensorflow/python/framework/test_util.py
index 7389730d91cf9fd35c861ad85040c79108e5eb77..43106b6e598d464b15d0fe00265ccec906fff9a7 100644
--- a/tensorflow/python/framework/test_util.py
+++ b/tensorflow/python/framework/test_util.py
@@ -53,9 +53,11 @@ from tensorflow.python.eager import tape  # pylint: disable=unused-import
 from tensorflow.python.framework import device as pydev
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import importer
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import versions
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -205,6 +207,10 @@ def CudaSupportsHalfMatMulAndConv():
   return pywrap_tensorflow.CudaSupportsHalfMatMulAndConv()
 
 
+def IsMklEnabled():
+  return pywrap_tensorflow.IsMklEnabled()
+
+
 def InstallStackTraceHandler():
   pywrap_tensorflow.InstallStacktraceHandler()
 
@@ -331,6 +337,8 @@ def _use_c_api_wrapper(fn, use_c_api, *args, **kwargs):
     # Make sure default graph reflects prev_value in case next test doesn't call
     # reset_default_graph().
     ops.reset_default_graph()
+
+
 # pylint: disable=protected-access
 
 
@@ -403,6 +411,31 @@ def enable_c_api(fn):
   return wrapper
 
 
+def enable_c_shapes(fn):
+  """Decorator for enabling C shapes on a test.
+
+  Note this enables the C shapes after running the test class's setup/teardown
+  methods.
+
+  Args:
+    fn: the function to be wrapped
+
+  Returns:
+    The wrapped function
+  """
+
+  def wrapper(*args, **kwargs):
+    prev_value = ops._USE_C_SHAPES
+    # Only use C shapes if the C API is already enabled.
+    ops._USE_C_SHAPES = ops._USE_C_API
+    try:
+      fn(*args, **kwargs)
+    finally:
+      ops._USE_C_SHAPES = prev_value
+
+  return wrapper
+
+
 # This decorator is a hacky way to run all the test methods in a decorated
 # class with and without C API enabled.
 # TODO(iga): Remove this and its uses once we switch to using C API by default.
@@ -422,7 +455,8 @@ def with_c_api(cls):
   # If the C API is already enabled, don't do anything. Some tests break if the
   # same test is run twice, so this allows us to turn on the C API by default
   # without breaking these tests.
-  if ops._USE_C_API: return cls
+  if ops._USE_C_API:
+    return cls
 
   for name, value in cls.__dict__.copy().items():
     if callable(value) and name.startswith("test"):
@@ -430,6 +464,35 @@ def with_c_api(cls):
   return cls
 
 
+def assert_no_new_pyobjects_executing_eagerly(f):
+  """Decorator for asserting that no new Python objects persist after a test.
+
+  Runs the test multiple times executing eagerly, first as a warmup and then
+  several times to let objects accumulate. The warmup helps ignore caches which
+  do not grow as the test is run repeatedly.
+
+  Useful for checking that there are no missing Py_DECREFs in the C exercised by
+  a bit of Python.
+  """
+
+  def decorator(self, **kwargs):
+    """Warms up, gets an object count, runs the test, checks for new objects."""
+    with context.eager_mode():
+      gc.disable()
+      f(self, **kwargs)
+      gc.collect()
+      previous_count = len(gc.get_objects())
+      for _ in range(3):
+        f(self, **kwargs)
+      gc.collect()
+      # There should be no new Python objects hanging around.
+      new_count = len(gc.get_objects())
+      self.assertEqual(previous_count, new_count)
+      gc.enable()
+
+  return decorator
+
+
 def assert_no_new_tensors(f):
   """Decorator for asserting that no new Tensors persist after a test.
 
@@ -451,15 +514,17 @@ def assert_no_new_tensors(f):
   def decorator(self, **kwargs):
     """Finds existing Tensors, runs the test, checks for new Tensors."""
 
-    def _is_tensor(obj):
+    def _is_tensorflow_object(obj):
       try:
-        return (isinstance(obj, ops.Tensor) or
-                isinstance(obj, variables.Variable))
+        return isinstance(obj,
+                          (ops.Tensor, variables.Variable,
+                           tensor_shape.Dimension, tensor_shape.TensorShape))
       except ReferenceError:
         # If the object no longer exists, we don't care about it.
         return False
 
-    tensors_before = set(id(obj) for obj in gc.get_objects() if _is_tensor(obj))
+    tensors_before = set(
+        id(obj) for obj in gc.get_objects() if _is_tensorflow_object(obj))
     outside_graph_key = ops.get_default_graph()._graph_key
     with ops.Graph().as_default():
       # Run the test in a new graph so that collections get cleared when it's
@@ -469,11 +534,12 @@ def assert_no_new_tensors(f):
     # Make an effort to clear caches, which would otherwise look like leaked
     # Tensors.
     backprop._zeros_cache.flush()
+    context.get_default_context().ones_rank_cache().flush()
     context.get_default_context().scalar_cache().clear()
     gc.collect()
     tensors_after = [
         obj for obj in gc.get_objects()
-        if _is_tensor(obj) and id(obj) not in tensors_before
+        if _is_tensorflow_object(obj) and id(obj) not in tensors_before
     ]
     if tensors_after:
       raise AssertionError(("%d Tensors not deallocated after test: %s" % (
@@ -512,18 +578,18 @@ def assert_no_garbage_created(f):
           "likely due to a reference cycle. New objects in cycle(s):")
       for i, obj in enumerate(gc.garbage[previous_garbage:]):
         try:
-          logging.error(
-              "Object %d of %d" % (i, len(gc.garbage) - previous_garbage))
+          logging.error("Object %d of %d", i,
+                        len(gc.garbage) - previous_garbage)
+
           def _safe_object_str(obj):
             return "<%s %d>" % (obj.__class__.__name__, id(obj))
-          logging.error("  Object type: %s" % (_safe_object_str(obj),))
-          logging.error("  Referrer types: %s" % (
-              ', '.join([_safe_object_str(ref)
-                         for ref in gc.get_referrers(obj)]),))
-          logging.error("  Referent types: %s" % (
-              ', '.join([_safe_object_str(ref)
-                         for ref in gc.get_referents(obj)]),))
-          logging.error("  Object attribute names: %s" % (dir(obj),))
+
+          logging.error("  Object type: %s", _safe_object_str(obj))
+          logging.error("  Referrer types: %s", ", ".join(
+              [_safe_object_str(ref) for ref in gc.get_referrers(obj)]))
+          logging.error("  Referent types: %s", ", ".join(
+              [_safe_object_str(ref) for ref in gc.get_referents(obj)]))
+          logging.error("  Object attribute names: %s", dir(obj))
           logging.error("  Object __str__:")
           logging.error(obj)
           logging.error("  Object __repr__:")
@@ -645,15 +711,23 @@ def is_gpu_available(cuda_only=False, min_cuda_compute_capability=None):
       return 0, 0
     return int(match.group(1)), int(match.group(2))
 
-  for local_device in device_lib.list_local_devices():
-    if local_device.device_type == "GPU":
-      if (min_cuda_compute_capability is None or
-          compute_capability_from_device_desc(local_device.physical_device_desc)
-          >= min_cuda_compute_capability):
+  try:
+    for local_device in device_lib.list_local_devices():
+      if local_device.device_type == "GPU":
+        if (min_cuda_compute_capability is None or
+            compute_capability_from_device_desc(
+                local_device.physical_device_desc) >=
+            min_cuda_compute_capability):
+          return True
+      if local_device.device_type == "SYCL" and not cuda_only:
         return True
-    if local_device.device_type == "SYCL" and not cuda_only:
-      return True
-  return False
+    return False
+  except errors_impl.NotFoundError as e:
+    if not all([x in str(e) for x in ["CUDA", "not find"]]):
+      raise e
+    else:
+      logging.error(str(e))
+      return False
 
 
 @contextlib.contextmanager
@@ -812,7 +886,7 @@ class TensorFlowTestCase(googletest.TestCase):
     Returns:
       tensors numpy values.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return self._eval_helper(tensors)
     else:
       sess = ops.get_default_session()
@@ -842,9 +916,9 @@ class TensorFlowTestCase(googletest.TestCase):
 
     Use the `use_gpu` and `force_gpu` options to control where ops are run. If
     `force_gpu` is True, all ops are pinned to `/device:GPU:0`. Otherwise, if
-    `use_gpu`
-    is True, TensorFlow tries to run as many ops on the GPU as possible. If both
-    `force_gpu and `use_gpu` are False, all ops are pinned to the CPU.
+    `use_gpu` is True, TensorFlow tries to run as many ops on the GPU as
+    possible. If both `force_gpu and `use_gpu` are False, all ops are pinned to
+    the CPU.
 
     Example:
     ```python
@@ -1196,9 +1270,9 @@ class TensorFlowTestCase(googletest.TestCase):
             msg="Mismatched value: a%s is different from b%s." % (path_str,
                                                                   path_str))
       except TypeError as e:
-        msg = "Error: a%s has %s, but b%s has %s" % (
-            path_str, type(a), path_str, type(b))
-        e.args = ((e.args[0] + ' : ' + msg,) + e.args[1:])
+        msg = "Error: a%s has %s, but b%s has %s" % (path_str, type(a),
+                                                     path_str, type(b))
+        e.args = ((e.args[0] + " : " + msg,) + e.args[1:])
         raise
 
   def assertAllClose(self, a, b, rtol=1e-6, atol=1e-6, msg=None):
@@ -1378,8 +1452,7 @@ class TensorFlowTestCase(googletest.TestCase):
     """
     device1 = pydev.canonical_name(device1)
     device2 = pydev.canonical_name(device2)
-    self.assertEqual(device1, device2,
-                     "Devices %s and %s are not equal. %s" % 
+    self.assertEqual(device1, device2, "Devices %s and %s are not equal. %s" %
                      (device1, device2, msg))
 
   # Fix Python 3 compatibility issues
diff --git a/tensorflow/python/framework/test_util_test.py b/tensorflow/python/framework/test_util_test.py
index a717eb39513ac3369ae133b6090ff82597f12eb7..02ffa93baee5c643ebdceaa274710f9d58e6eecb 100644
--- a/tensorflow/python/framework/test_util_test.py
+++ b/tensorflow/python/framework/test_util_test.py
@@ -82,6 +82,14 @@ class TestUtilTest(test_util.TensorFlowTestCase):
     else:
       print("GoogleCuda is disabled")
 
+  def testIsMklEnabled(self):
+    # This test doesn't assert anything.
+    # It ensures the py wrapper function is generated correctly.
+    if test_util.IsMklEnabled():
+      print("MKL is enabled")
+    else:
+      print("MKL is disabled")
+
   def testAssertProtoEqualsStr(self):
 
     graph_str = "node { name: 'w1' op: 'params' }"
@@ -440,6 +448,26 @@ class GarbageCollectionTest(test_util.TensorFlowTestCase):
 
     LeakedTensorTest().test_has_no_leak()
 
+  def test_no_new_objects_decorator(self):
+
+    class LeakedObjectTest(object):
+
+      def __init__(inner_self):  # pylint: disable=no-self-argument
+        inner_self.assertEqual = self.assertEqual  # pylint: disable=invalid-name
+        inner_self.accumulation = []
+
+      @test_util.assert_no_new_pyobjects_executing_eagerly
+      def test_has_leak(self):
+        self.accumulation.append([1.])
+
+      @test_util.assert_no_new_pyobjects_executing_eagerly
+      def test_has_no_leak(self):
+        self.not_accumulating = [1.]
+
+    with self.assertRaises(AssertionError):
+      LeakedObjectTest().test_has_leak()
+
+    LeakedObjectTest().test_has_no_leak()
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/grappler/hierarchical_controller.py b/tensorflow/python/grappler/hierarchical_controller.py
index b06fb3c6d0666659031863b90212e9456d044c14..c0866c1069ac7f7e25cbd12cb5a490e2ed5e4bec 100644
--- a/tensorflow/python/grappler/hierarchical_controller.py
+++ b/tensorflow/python/grappler/hierarchical_controller.py
@@ -258,9 +258,11 @@ class HierarchicalController(Controller):
             "attn_w_2", [self.hparams.hidden_size, self.hparams.hidden_size])
         variable_scope.get_variable("attn_v", [self.hparams.hidden_size, 1])
     seq2seq_input_layer = array_ops.placeholder_with_default(
-        array_ops.zeros([1, self.num_groups, self.group_emb_size],
+        array_ops.zeros([self.hparams.num_children,
+                         self.num_groups,
+                         self.group_emb_size],
                         dtypes.float32),
-        shape=(1, self.num_groups, self.group_emb_size))
+        shape=(self.hparams.num_children, self.num_groups, self.group_emb_size))
     self.seq2seq_input_layer = seq2seq_input_layer
 
   def compute_reward(self, run_time):
@@ -585,12 +587,29 @@ class HierarchicalController(Controller):
     """Approximating the blocks of a TF graph from a graph_def.
 
     Args:
-      grouping_actions: grouping predictions
+      grouping_actions: grouping predictions.
       verbose: print stuffs.
 
     Returns:
       groups: list of groups.
     """
+    groups = [
+        self._create_group_embeddings(grouping_actions, i, verbose) for
+        i in range(self.hparams.num_children)
+    ]
+    return np.stack(groups, axis=0)
+
+  def _create_group_embeddings(self, grouping_actions, child_id, verbose=False):
+    """Approximating the blocks of a TF graph from a graph_def for each child.
+
+    Args:
+      grouping_actions: grouping predictions.
+      child_id: child_id for the group.
+      verbose: print stuffs.
+
+    Returns:
+      groups: group embedding for the child_id.
+    """
     if verbose:
       print("Processing input_graph")
 
@@ -599,13 +618,13 @@ class HierarchicalController(Controller):
     dag_matrix = np.zeros([self.num_groups, self.num_groups], dtype=np.float32)
     for op in self.important_ops:
       topo_op_index = self.name_to_topo_order_index[op.name]
-      # TODO(agoldie) child_id
-      group_index = grouping_actions[0][topo_op_index]
+      group_index = grouping_actions[child_id][topo_op_index]
       for output_op in self.get_node_fanout(op):
         if output_op.name not in self.important_op_names:
           continue
-        output_group_index = grouping_actions[0][self.name_to_topo_order_index[
-            output_op.name]]
+        output_group_index = (
+            grouping_actions[child_id][self.name_to_topo_order_index[
+                output_op.name]])
         dag_matrix[group_index, output_group_index] += 1.0
     num_connections = np.sum(dag_matrix)
     num_intra_group_connections = dag_matrix.trace()
@@ -648,7 +667,8 @@ class HierarchicalController(Controller):
         ],
         dtype=np.float32)
     for op_index, op in enumerate(self.important_ops):
-      group_index = grouping_actions[0][self.name_to_topo_order_index[op.name]]
+      group_index = grouping_actions[child_id][
+          self.name_to_topo_order_index[op.name]]
       type_name = str(op.op)
       type_index = self.type_dict[type_name]
       group_embedding[group_index, type_index] += 1
@@ -675,7 +695,7 @@ class HierarchicalController(Controller):
           shape=[num_children, self.num_groups],
           trainable=False)
 
-    x = array_ops.tile(self.seq2seq_input_layer, [num_children, 1, 1])
+    x = self.seq2seq_input_layer
     last_c, last_h, attn_mem = self.encode(x)
     actions, log_probs = {}, {}
     actions["sample"], log_probs["sample"] = (
@@ -988,8 +1008,7 @@ class HierarchicalController(Controller):
   def generate_placement(self, grouping, sess):
     controller_ops = self.ops["controller"]
     feed_seq2seq_input_dict = {}
-    feed_seq2seq_input_dict[self.seq2seq_input_layer] = np.expand_dims(
-        grouping, axis=0)
+    feed_seq2seq_input_dict[self.seq2seq_input_layer] = grouping
     sess.run(
         controller_ops["y_preds"]["sample"], feed_dict=feed_seq2seq_input_dict)
 
diff --git a/tensorflow/python/grappler/item.i b/tensorflow/python/grappler/item.i
index 9a84c60b04029a64ed35a01f045a6eec5e492504..593d38206d127978f1982a0f2cc22e17daee1a3d 100644
--- a/tensorflow/python/grappler/item.i
+++ b/tensorflow/python/grappler/item.i
@@ -83,7 +83,6 @@ static GItem TF_NewItem(
   tensorflow::grappler::ItemConfig cfg;
   cfg.ignore_user_placement = ignore_user_placement;
   cfg.ignore_colocation = ignore_colocation;
-  cfg.inline_functions = true;
   std::unique_ptr<tensorflow::grappler::GrapplerItem> item =
       tensorflow::grappler::GrapplerItemFromMetaGraphDef("item", meta_graph, cfg);
   if (!item) {
diff --git a/tensorflow/python/grappler/item_test.py b/tensorflow/python/grappler/item_test.py
index 7c3efd6249cbdaa2675632f7fc8e25fb88658a24..c40de9da0abca3bb99a82a1456261f45b1c45c99 100644
--- a/tensorflow/python/grappler/item_test.py
+++ b/tensorflow/python/grappler/item_test.py
@@ -111,7 +111,7 @@ class ItemTest(test.TestCase):
     with ops.Graph().as_default() as g:
       c = constant_op.constant([10])
       v = variables.Variable([3], dtype=dtypes.int32)
-      i = gen_array_ops._ref_identity(v)
+      i = gen_array_ops.ref_identity(v)
       a = state_ops.assign(i, c)
       train_op = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
       train_op.append(a)
diff --git a/tensorflow/python/grappler/layout_optimizer_test.py b/tensorflow/python/grappler/layout_optimizer_test.py
index 0f5150174049250e86bbac0a49eb998339058326..5a84b16a23f567fba6d08aaefd3b816a76907735 100644
--- a/tensorflow/python/grappler/layout_optimizer_test.py
+++ b/tensorflow/python/grappler/layout_optimizer_test.py
@@ -321,7 +321,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       dim = array_ops.placeholder(dtype='int32')
       sizes = constant_op.constant([50, 10, 4], shape=[3])
-      split = gen_array_ops._split_v(
+      split = gen_array_ops.split_v(
           value=conv, size_splits=sizes, axis=dim, num_split=3)
       output = math_ops.reduce_sum(split[0])
 
@@ -896,7 +896,7 @@ class LayoutOptimizerTest(test.TestCase):
       add = math_ops.add(conv, conv)
       mean = math_ops.reduce_mean(conv)
       condition = math_ops.less(conv, mean)
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       with session.Session(config=_get_config(False)) as sess:
@@ -926,7 +926,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       add = math_ops.add(conv, conv)
       condition = array_ops.placeholder(dtype='bool')
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       condition_val = np.zeros((1, 7, 7, 64))
@@ -957,7 +957,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       add = math_ops.add(conv, conv)
       condition = constant_op.constant(True)
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       with session.Session(config=_get_config(False)) as sess:
@@ -1023,7 +1023,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       ksize = constant_op.constant([1, 2, 3, 1], shape=[4])
       strides = array_ops.placeholder(dtype='int32', shape=[4])
-      max_pool = gen_nn_ops._max_pool_v2(conv, ksize, strides, 'VALID')
+      max_pool = gen_nn_ops.max_pool_v2(conv, ksize, strides, 'VALID')
       output = array_ops.identity(max_pool)
 
       strides_val = [1, 3, 2, 1]
diff --git a/tensorflow/python/grappler/memory_optimizer_test.py b/tensorflow/python/grappler/memory_optimizer_test.py
index 948911f099674af4c6dd19bfdac75e5fc1f75c78..4df959ce04169395589aeebaef9e3e7839e2300c 100644
--- a/tensorflow/python/grappler/memory_optimizer_test.py
+++ b/tensorflow/python/grappler/memory_optimizer_test.py
@@ -162,7 +162,8 @@ class MemoryOptimizerRecomputeTest(test.TestCase):
             arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF,
             memory_optimization=rewriter_config_pb2.RewriterConfig.
             RECOMPUTATION_HEURISTICS,
-            memory_optimizer_target_node_name_prefix='optimizer/gradients/'),
+            # Checks that name scope "gradients/" also match sub-scope.
+            memory_optimizer_target_node_name_scope='gradients/'),
         original_metagraph)
     self.assertGreater(
         len(rewritten_graph_def.node),
@@ -176,6 +177,35 @@ class MemoryOptimizerRecomputeTest(test.TestCase):
         len([node for node in rewritten_graph_def.node
              if 'Recomputed/' in node.name]))
 
+  def testRewritingNameScopedGradientNamesScope(self):
+    """Tests that rewriting occurs with non-standard gradient names."""
+    (original_metagraph, _, _,
+     _) = self._GetMetaGraph(optimizer_scope_name='foo/bar')
+    rewritten_graph_def = tf_optimizer.OptimizeGraph(
+        rewriter_config_pb2.RewriterConfig(
+            disable_model_pruning=True,
+            constant_folding=rewriter_config_pb2.RewriterConfig.OFF,
+            dependency_optimization=rewriter_config_pb2.RewriterConfig.OFF,
+            layout_optimizer=rewriter_config_pb2.RewriterConfig.OFF,
+            arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF,
+            memory_optimization=rewriter_config_pb2.RewriterConfig.
+            RECOMPUTATION_HEURISTICS,
+            # This should not match anything.
+            memory_optimizer_target_node_name_scope='r/gradients/'),
+        original_metagraph)
+    self.assertEqual(
+        len(rewritten_graph_def.node), len(original_metagraph.graph_def.node))
+    self.assertEqual(0,
+                     len([
+                         node for node in original_metagraph.graph_def.node
+                         if 'Recomputed/' in node.name
+                     ]))
+    self.assertEqual(0,
+                     len([
+                         node for node in rewritten_graph_def.node
+                         if 'Recomputed/' in node.name
+                     ]))
+
   def _GetMemoryOptimizerSessionConfig(self):
     rewrite_options = rewriter_config_pb2.RewriterConfig(
         disable_model_pruning=True,
diff --git a/tensorflow/python/grappler/model_analyzer.cc b/tensorflow/python/grappler/model_analyzer.cc
index d23eb811ac2b0a6a8802979b4d966b5617c8a8d9..5a76cdd8fb29361cd800dea60cb9ebc0e39f6487 100644
--- a/tensorflow/python/grappler/model_analyzer.cc
+++ b/tensorflow/python/grappler/model_analyzer.cc
@@ -26,9 +26,10 @@ namespace grappler {
 
 ModelAnalyzer::ModelAnalyzer(const GrapplerItem& item) : item_(item) {}
 
-Status ModelAnalyzer::GenerateReport(bool debug, std::ostream& os) {
+Status ModelAnalyzer::GenerateReport(bool debug, bool assume_valid_feeds,
+                                     std::ostream& os) {
   GraphProperties properties(item_);
-  TF_RETURN_IF_ERROR(properties.InferStatically(false));
+  TF_RETURN_IF_ERROR(properties.InferStatically(assume_valid_feeds));
 
   for (const auto& node : item_.MainOpsFanin()) {
     PrintNodeInfo(node, properties, debug, os);
diff --git a/tensorflow/python/grappler/model_analyzer.h b/tensorflow/python/grappler/model_analyzer.h
index 5bc551927d88db723e21b29903d6f5b941048139..97ffafabe1f785e3b2c3044143b8fb8006b59225 100644
--- a/tensorflow/python/grappler/model_analyzer.h
+++ b/tensorflow/python/grappler/model_analyzer.h
@@ -31,7 +31,7 @@ class GraphProperties;
 class ModelAnalyzer {
  public:
   explicit ModelAnalyzer(const GrapplerItem& item);
-  Status GenerateReport(bool debug, std::ostream& os);
+  Status GenerateReport(bool debug, bool assume_valid_feeds, std::ostream& os);
 
  private:
   void PrintNodeInfo(const NodeDef* node, const GraphProperties& properties,
diff --git a/tensorflow/python/grappler/model_analyzer.i b/tensorflow/python/grappler/model_analyzer.i
index 7c3a692d0efc501341ff1dff3cf24b8a4830ec84..4955780764be802b9e4be3598bf114b227757194 100644
--- a/tensorflow/python/grappler/model_analyzer.i
+++ b/tensorflow/python/grappler/model_analyzer.i
@@ -40,7 +40,8 @@ limitations under the License.
 %}
 
 %{
-string GenerateModelReport(const tensorflow::MetaGraphDef& metagraph, bool debug) {
+string GenerateModelReport(const tensorflow::MetaGraphDef& metagraph,
+                           bool assume_valid_feeds, bool debug) {
   tensorflow::grappler::ItemConfig cfg;
   cfg.apply_optimizations = false;
   std::unique_ptr<tensorflow::grappler::GrapplerItem> item =
@@ -53,10 +54,11 @@ string GenerateModelReport(const tensorflow::MetaGraphDef& metagraph, bool debug
   tensorflow::grappler::ModelAnalyzer analyzer(*item);
 
   std::stringstream os;
-  analyzer.GenerateReport(debug, os);
+  analyzer.GenerateReport(debug, assume_valid_feeds, os);
   return os.str();
 }
 
 %}
 
-string GenerateModelReport(const tensorflow::MetaGraphDef& metagraph, bool debug);
+string GenerateModelReport(const tensorflow::MetaGraphDef& metagraph,
+                           bool assume_valid_feeds, bool debug);
diff --git a/tensorflow/python/grappler/model_analyzer.py b/tensorflow/python/grappler/model_analyzer.py
index 535889e1c4034952562a05e4d044fcafeddbc0ca..98cdc5785011dcebbaaf43704772b3de00c9d6ca 100644
--- a/tensorflow/python/grappler/model_analyzer.py
+++ b/tensorflow/python/grappler/model_analyzer.py
@@ -22,11 +22,12 @@ from tensorflow.python import pywrap_tensorflow as tf_wrap
 from tensorflow.python.framework import errors
 
 
-def GenerateModelReport(metagraph, debug=False):
+def GenerateModelReport(metagraph, assume_valid_feeds=True, debug=False):
   """Report what's known statically about each node in the provided metagraph.
 
   Args:
     metagraph: A TensorFlow MetaGraphDef.
+    assume_valid_feeds: If True, assume that the shape of the fed nodes is valid
     debug: Add some information useful for debugging.
 
   Returns:
@@ -34,6 +35,6 @@ def GenerateModelReport(metagraph, debug=False):
   """
   with errors.raise_exception_on_not_ok_status():
     ret_from_swig = tf_wrap.GenerateModelReport(metagraph.SerializeToString(),
-                                                debug)
+                                                assume_valid_feeds, debug)
 
   return ret_from_swig
diff --git a/tensorflow/python/grappler/tf_optimizer.i b/tensorflow/python/grappler/tf_optimizer.i
index de9326ccfc1653c2afd0833dcdca2cc4bfdabed5..39ca71e99af06c19fb7fe5bf185c29106729f5e9 100644
--- a/tensorflow/python/grappler/tf_optimizer.i
+++ b/tensorflow/python/grappler/tf_optimizer.i
@@ -98,7 +98,6 @@ PyObject* TF_OptimizeGraph(
       const tensorflow::MetaGraphDef& metagraph,
       bool verbose, const string& graph_id, TF_Status* out_status) {
     tensorflow::grappler::ItemConfig item_config;
-    item_config.inline_functions = false;
     item_config.apply_optimizations = false;
     item_config.ignore_user_placement = false;
     std::unique_ptr<tensorflow::grappler::GrapplerItem> grappler_item =
diff --git a/tensorflow/python/keras/BUILD b/tensorflow/python/keras/BUILD
index a98d08f92892cd5f923833a2059ce7e89ebba1aa..711106d2db78f533bc41c25dba995a18e0875fed 100755
--- a/tensorflow/python/keras/BUILD
+++ b/tensorflow/python/keras/BUILD
@@ -8,6 +8,12 @@ exports_files(["LICENSE"])
 package(default_visibility = ["//visibility:public"])
 
 load("//tensorflow:tensorflow.bzl", "py_test")
+load("//tensorflow:tensorflow.bzl", "cuda_py_test")
+
+config_setting(
+    name = "empty_condition",
+    values = {"define": "UNUSED=unused"},
+)
 
 py_library(
     name = "keras",
@@ -45,7 +51,10 @@ py_library(
         "_impl/keras/engine/saving.py",
         "_impl/keras/engine/sequential.py",
         "_impl/keras/engine/training.py",
+        "_impl/keras/engine/training_arrays.py",
         "_impl/keras/engine/training_eager.py",
+        "_impl/keras/engine/training_generator.py",
+        "_impl/keras/engine/training_utils.py",
         "_impl/keras/estimator.py",
         "_impl/keras/initializers.py",
         "_impl/keras/layers/__init__.py",
@@ -78,8 +87,8 @@ py_library(
         "_impl/keras/utils/generic_utils.py",
         "_impl/keras/utils/io_utils.py",
         "_impl/keras/utils/layer_utils.py",
+        "_impl/keras/utils/multi_gpu_utils.py",
         "_impl/keras/utils/np_utils.py",
-        "_impl/keras/utils/training_utils.py",
         "_impl/keras/utils/vis_utils.py",
         "_impl/keras/wrappers/__init__.py",
         "_impl/keras/wrappers/scikit_learn.py",
@@ -123,7 +132,11 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
-    deps = [
+    deps = select({
+        ":empty_condition": [],
+        "//conditions:default": [],
+    }) + [
+        "@six_archive//:six",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:check_ops",
@@ -162,7 +175,6 @@ py_library(
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/saved_model",
-        "@six_archive//:six",
     ],
 )
 
@@ -601,6 +613,7 @@ py_test(
         "no_windows",
         "noasan",  # times out
         "notsan",
+        "optonly",  # times out
     ],
     deps = [
         ":keras",
@@ -645,16 +658,17 @@ py_test(
     ],
 )
 
-py_test(
-    name = "training_utils_test",
-    size = "medium",
-    srcs = ["_impl/keras/utils/training_utils_test.py"],
-    srcs_version = "PY2AND3",
-    tags = ["multi_gpu"],
-    deps = [
+cuda_py_test(
+    name = "multi_gpu_utils_test",
+    srcs = ["_impl/keras/utils/multi_gpu_utils_test.py"],
+    additional_deps = [
         ":keras",
-        "//tensorflow/python:client_testlib",
         "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+    ],
+    tags = [
+        "guitar",
+        "multi_gpu",
     ],
 )
 
@@ -763,6 +777,9 @@ py_test(
     size = "small",
     srcs = ["_impl/keras/engine/topology_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "no-internal-py3",
+    ],
     deps = [
         ":keras",
         "//tensorflow/python:client_testlib",
diff --git a/tensorflow/python/keras/_impl/keras/__init__.py b/tensorflow/python/keras/_impl/keras/__init__.py
index b63907b2e60acfc80ee411b9193b2829f0224c3e..53f5d31e9c5b861c551a7a9ca3700c383ea679d7 100644
--- a/tensorflow/python/keras/_impl/keras/__init__.py
+++ b/tensorflow/python/keras/_impl/keras/__init__.py
@@ -40,4 +40,4 @@ from tensorflow.python.keras._impl.keras.layers import Input
 from tensorflow.python.keras._impl.keras.models import Model
 from tensorflow.python.keras._impl.keras.models import Sequential
 
-__version__ = '2.1.4-tf'
+__version__ = '2.1.5-tf'
diff --git a/tensorflow/python/keras/_impl/keras/backend.py b/tensorflow/python/keras/_impl/keras/backend.py
index a2db05f6cfd0c20fef3b18832834d06990b7a512..7baf27642a475eb3a09687a1d19a6ed05de046e9 100644
--- a/tensorflow/python/keras/_impl/keras/backend.py
+++ b/tensorflow/python/keras/_impl/keras/backend.py
@@ -55,10 +55,10 @@ from tensorflow.python.ops import tensor_array_grad  # pylint: disable=unused-im
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.ops import variables as variables_module
 from tensorflow.python.training import moving_averages
+from tensorflow.python.util import tf_contextlib
 from tensorflow.python.util import tf_inspect
 from tensorflow.python.util.tf_export import tf_export
 
-
 py_all = all
 py_sum = sum
 
@@ -343,7 +343,7 @@ def learning_phase():
   Returns:
       Learning phase (scalar integer tensor or Python integer).
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     if 'eager' not in _GRAPH_LEARNING_PHASES:
       # Fallback to inference mode as default.
       return 0
@@ -369,13 +369,42 @@ def set_learning_phase(value):
   """
   global _GRAPH_LEARNING_PHASES  # pylint: disable=global-variable-not-assigned
   if value not in {0, 1}:
-    raise ValueError('Expected learning phase to be ' '0 or 1.')
-  if context.in_eager_mode():
+    raise ValueError('Expected learning phase to be 0 or 1.')
+  if context.executing_eagerly():
     _GRAPH_LEARNING_PHASES['eager'] = value
   else:
     _GRAPH_LEARNING_PHASES[ops.get_default_graph()] = value
 
 
+@tf_contextlib.contextmanager
+def learning_phase_scope(value):
+  """Provides a scope within which the learning phase is equal to `value`.
+
+  The learning phase gets restored to its original value upon exiting the scope.
+
+  Arguments:
+     value: Learning phase value, either 0 or 1 (integers).
+
+  Yields:
+    The provided value.
+
+  Raises:
+     ValueError: if `value` is neither `0` nor `1`.
+  """
+  if value not in {0, 1}:
+    raise ValueError('Expected learning phase to be 0 or 1.')
+  previous_value = learning_phase()
+  try:
+    set_learning_phase(value)
+    yield value
+  finally:
+    # Restore learning phase to initial value.
+    if context.executing_eagerly():
+      _GRAPH_LEARNING_PHASES['eager'] = previous_value
+    else:
+      _GRAPH_LEARNING_PHASES[ops.get_default_graph()] = previous_value
+
+
 @tf_export('keras.backend.get_session')
 def get_session():
   """Returns the TF session to be used by the backend.
@@ -394,8 +423,9 @@ def get_session():
       A TensorFlow session.
   """
   global _SESSION
-  if ops.get_default_session() is not None:
-    session = ops.get_default_session()
+  default_session = ops.get_default_session()
+  if default_session is not None:
+    session = default_session
   else:
     if _SESSION is None:
       if not os.environ.get('OMP_NUM_THREADS'):
@@ -466,7 +496,7 @@ def _is_current_explicit_device(device_type):
   """
   device_type = device_type.upper()
   if device_type not in ['CPU', 'GPU']:
-    raise ValueError('device_type should be either "CPU" or "GPU".')
+    raise ValueError('`device_type` should be either "CPU" or "GPU".')
   device = _get_current_tf_device()
   return device is not None and device.device_type == device_type.upper()
 
@@ -2596,7 +2626,7 @@ def get_value(x):
   Returns:
       A Numpy array.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return x.numpy()
   return x.eval(session=get_session())
 
@@ -2611,7 +2641,7 @@ def batch_get_value(tensors):
   Returns:
       A list of Numpy arrays.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return [x.numpy() for x in tensors]
   if tensors:
     return get_session().run(tensors)
@@ -2629,7 +2659,7 @@ def set_value(x, value):
           (of the same shape).
   """
   value = np.asarray(value, dtype=dtype(x))
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     x.assign(value)
   else:
     tf_dtype = dtypes_module.as_dtype(x.dtype.name.split('_')[0])
@@ -2652,7 +2682,7 @@ def batch_set_value(tuples):
       tuples: a list of tuples `(tensor, value)`.
           `value` should be a Numpy array.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     for x, value in tuples:
       x.assign(np.asarray(value, dtype=dtype(x)))
   else:
@@ -2749,7 +2779,7 @@ class Function(object):
       self.updates_op = control_flow_ops.group(*updates_ops)
     self.name = name
     # additional tensor substitutions
-    self.feed_dict = session_kwargs.pop('feed_dict', {})
+    self.feed_dict = session_kwargs.pop('feed_dict', None)
     # additional operations
     self.fetches = session_kwargs.pop('fetches', [])
     if not isinstance(self.fetches, list):
@@ -2759,8 +2789,15 @@ class Function(object):
   def __call__(self, inputs):
     if not isinstance(inputs, (list, tuple)):
       raise TypeError('`inputs` should be a list or tuple.')
-    feed_dict = self.feed_dict.copy()
+
+    if self.feed_dict:
+      feed_dict = self.feed_dict.copy()
+    else:
+      feed_dict = {}
+
     for tensor, value in zip(self.inputs, inputs):
+      if value is None:
+        continue
       if is_sparse(tensor):
         sparse_coo = value.tocoo()
         indices = np.concatenate((np.expand_dims(sparse_coo.row, 1),
@@ -3087,7 +3124,7 @@ def rnn(step_function,
   outputs_shape[1] = inputs_shape[1]
   outputs.set_shape(outputs_shape)
 
-  if not context.in_eager_mode():
+  if not context.executing_eagerly():
     last_output._uses_learning_phase = uses_learning_phase
   return last_output, outputs, new_states
 
@@ -3336,7 +3373,7 @@ def categorical_crossentropy(target, output, from_logits=False):
         target * math_ops.log(output),
         axis=len(output.get_shape()) - 1)
   else:
-    return nn.softmax_cross_entropy_with_logits(labels=target, logits=output)
+    return nn.softmax_cross_entropy_with_logits_v2(labels=target, logits=output)
 
 
 @tf_export('keras.backend.sparse_categorical_crossentropy')
@@ -3478,7 +3515,7 @@ def l2_normalize(x, axis=None):
   Returns:
       A tensor.
   """
-  return nn.l2_normalize(x, dim=axis)
+  return nn.l2_normalize(x, axis=axis)
 
 
 @tf_export('keras.backend.in_top_k')
diff --git a/tensorflow/python/keras/_impl/keras/backend_test.py b/tensorflow/python/keras/_impl/keras/backend_test.py
index f29ca49378bc43385b9e90d3f1cefb7937df64cd..fb4b2a0e1dc06c904d4b93038840dbf688d42ed4 100644
--- a/tensorflow/python/keras/_impl/keras/backend_test.py
+++ b/tensorflow/python/keras/_impl/keras/backend_test.py
@@ -128,6 +128,22 @@ class BackendUtilsTest(test.TestCase):
       sess.run(variables.global_variables_initializer())
       sess.run(y, feed_dict={x: np.random.random((2, 3))})
 
+  def test_learning_phase_scope(self):
+    with self.test_session():
+      initial_learning_phase = keras.backend.learning_phase()
+      with keras.backend.learning_phase_scope(1) as lp:
+        self.assertEqual(lp, 1)
+        self.assertEqual(keras.backend.learning_phase(), 1)
+      self.assertEqual(keras.backend.learning_phase(), initial_learning_phase)
+      with keras.backend.learning_phase_scope(0) as lp:
+        self.assertEqual(lp, 0)
+        self.assertEqual(keras.backend.learning_phase(), 0)
+      self.assertEqual(keras.backend.learning_phase(), initial_learning_phase)
+      with self.assertRaises(ValueError):
+        with keras.backend.learning_phase_scope(None):
+          pass
+      self.assertEqual(keras.backend.learning_phase(), initial_learning_phase)
+
   def test_int_shape(self):
     x = keras.backend.placeholder(shape=(3, 4))
     self.assertEqual(keras.backend.int_shape(x), (3, 4))
diff --git a/tensorflow/python/keras/_impl/keras/callbacks.py b/tensorflow/python/keras/_impl/keras/callbacks.py
index f6c466142522927135d66f73f9f5c697671649ec..deb1e8867dba3d52816ebda02bd9a3bf2ec7bc09 100644
--- a/tensorflow/python/keras/_impl/keras/callbacks.py
+++ b/tensorflow/python/keras/_impl/keras/callbacks.py
@@ -778,16 +778,24 @@ class TensorBoard(Callback):
         while i < val_size:
           step = min(self.batch_size, val_size - i)
           batch_val = []
-          batch_val.append(val_data[0][i:i + step])
-          batch_val.append(val_data[1][i:i + step])
-          batch_val.append(val_data[2][i:i + step])
+          batch_val.append(val_data[0][i:i + step]
+                           if val_data[0] is not None else None)
+          batch_val.append(val_data[1][i:i + step]
+                           if val_data[1] is not None else None)
+          batch_val.append(val_data[2][i:i + step]
+                           if val_data[2] is not None else None)
           if self.model.uses_learning_phase:
             # do not slice the learning phase
-            batch_val = [x[i:i + step] for x in val_data[:-1]]
+            batch_val = [x[i:i + step] if x is not None else None
+                         for x in val_data[:-1]]
             batch_val.append(val_data[-1])
           else:
-            batch_val = [x[i:i + step] for x in val_data]
-          feed_dict = dict(zip(tensors, batch_val))
+            batch_val = [x[i:i + step] if x is not None else None
+                         for x in val_data]
+          feed_dict = {}
+          for key, val in zip(tensors, batch_val):
+            if val is not None:
+              feed_dict[key] = val
           result = self.sess.run([self.merged], feed_dict=feed_dict)
           summary_str = result[0]
           self.writer.add_summary(summary_str, epoch)
diff --git a/tensorflow/python/keras/_impl/keras/datasets/fashion_mnist.py b/tensorflow/python/keras/_impl/keras/datasets/fashion_mnist.py
index b9ae41a0d4d0e8d9df70e3fc1952e81c5f57e8d9..508e95f719a02977960b80c283495ced642293c5 100644
--- a/tensorflow/python/keras/_impl/keras/datasets/fashion_mnist.py
+++ b/tensorflow/python/keras/_impl/keras/datasets/fashion_mnist.py
@@ -24,8 +24,10 @@ import os
 import numpy as np
 
 from tensorflow.python.keras._impl.keras.utils.data_utils import get_file
+from tensorflow.python.util.tf_export import tf_export
 
 
+@tf_export('keras.datasets.fashion_mnist.load_data')
 def load_data():
   """Loads the Fashion-MNIST dataset.
 
diff --git a/tensorflow/python/keras/_impl/keras/engine/base_layer.py b/tensorflow/python/keras/_impl/keras/engine/base_layer.py
index 142325041bf4d2f8a564adcf867f3b719435f0ba..5615241ae3077102ef40f9c0619161964a62a335 100644
--- a/tensorflow/python/keras/_impl/keras/engine/base_layer.py
+++ b/tensorflow/python/keras/_impl/keras/engine/base_layer.py
@@ -237,12 +237,13 @@ class Layer(tf_base_layers.Layer):
     """
     # Actually call the layer (optionally building it).
     output = super(Layer, self).__call__(inputs, **kwargs)
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return output
 
-    # Un-built subclassed network: build it
-    if hasattr(self, '_set_inputs') and not self.inputs:
-      self._set_inputs(inputs, training=kwargs.get('training'))
+    if hasattr(self, '_symbolic_set_inputs') and not self.inputs:
+      # Subclassed network: explicitly set metadata normally set by a call to
+      # self._set_inputs().
+      self._symbolic_set_inputs(inputs, output)
 
     # Update learning phase info.
     output_tensors = generic_utils.to_list(output)
diff --git a/tensorflow/python/keras/_impl/keras/engine/input_layer.py b/tensorflow/python/keras/_impl/keras/engine/input_layer.py
index 8f9ea6f7a40e49ec45dfaeb14f807cd9c7db65c9..b51dd8a2189d0c8542c84dfeac9be0d72b96ff1b 100644
--- a/tensorflow/python/keras/_impl/keras/engine/input_layer.py
+++ b/tensorflow/python/keras/_impl/keras/engine/input_layer.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.util.tf_export import tf_export
 
 
+@tf_export('keras.layers.InputLayer')
 class InputLayer(base_layer.Layer):
   """Layer to be used as an entry point into a Network (a graph of layers).
 
@@ -92,7 +93,7 @@ class InputLayer(base_layer.Layer):
       else:
         batch_input_shape = None
 
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         # In eager mode, create a temporary placeholder to call the layer on.
         input_tensor = tf_base_layers._DeferredTensor(  # pylint: disable=protected-access
             shape=batch_input_shape,
diff --git a/tensorflow/python/keras/_impl/keras/engine/network.py b/tensorflow/python/keras/_impl/keras/engine/network.py
index 453cc8f8b7268376f48f07f5c8cf788a0aa10ddc..bf8239043874794c6617937cfc9c619d743502a9 100644
--- a/tensorflow/python/keras/_impl/keras/engine/network.py
+++ b/tensorflow/python/keras/_impl/keras/engine/network.py
@@ -38,6 +38,7 @@ from tensorflow.python.keras._impl.keras.utils.layer_utils import print_summary
 from tensorflow.python.layers import base as tf_base_layers
 from tensorflow.python.layers import utils as tf_layers_util
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import nest
 from tensorflow.python.util import tf_inspect
 
@@ -98,11 +99,11 @@ class Network(base_layer.Layer):
     self._losses = []   # Used in symbolic mode only.
     self._scope = None  # Never used.
     self._reuse = None  # Never used.
-    if context.in_eager_mode:
+    if context.executing_eagerly():
       self._graph = None
     else:
       self._graph = ops.get_default_graph()  # Used in symbolic mode only.
-        # A Network does not create weights of its own, thus has no dtype.
+      # A Network does not create weights of its own, thus has no dtype.
     self._dtype = None
 
     # All layers in order of horizontal graph traversal.
@@ -125,7 +126,7 @@ class Network(base_layer.Layer):
       self.outputs = [outputs]
 
     # User-prodived argument validation.
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # Check that all inputs/outputs are DeferredTensors.
       for tensor in self.inputs:
         if not isinstance(tensor, tf_base_layers._DeferredTensor):  # pylint: disable=protected-access
@@ -274,7 +275,7 @@ class Network(base_layer.Layer):
         self._feed_input_names.append(layer.name)
         self._feed_input_shapes.append(K.int_shape(self.inputs[i]))
         # layer.input gives an error in eager mode
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           self._feed_inputs.append(layer.input)
     for layer in self._output_layers:
       self.output_names.append(layer.name)
@@ -302,6 +303,13 @@ class Network(base_layer.Layer):
       if not is_graph_network:
         if value not in self._layers:
           self._layers.append(value)
+    if isinstance(value, checkpointable.CheckpointableBase):
+      # Layer (and therefore Network/Model) inherit from CheckpointableBase
+      # rather than Checkpointable, which means there is no Checkpointable
+      # __setattr__ override (it would be a performance issue for functional
+      # layers). Therefore Model tracks Checkpointable objects itself.
+      self._track_checkpointable(
+          checkpointable=value, name=name, overwrite=True)
     super(Network, self).__setattr__(name, value)
 
   def add_variable(self, name, shape, dtype=None, initializer=None,
@@ -309,7 +317,7 @@ class Network(base_layer.Layer):
     raise NotImplementedError('`add_variable` is not supported on Networks.')
 
   def add_loss(self, *args, **kwargs):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise NotImplementedError('`add_loss` is not supported on Networks '
                                 'when eager execution is enabled.')
     super(Network, self).add_loss(*args, **kwargs)
@@ -388,7 +396,7 @@ class Network(base_layer.Layer):
     if cache_key in self._output_mask_cache:
       return self._output_mask_cache[cache_key]
     else:
-      _, output_masks = self._run_internal_graph(inputs, masks)
+      _, output_masks = self._run_internal_graph(inputs, mask=masks)
       return output_masks
 
   @property
@@ -398,6 +406,7 @@ class Network(base_layer.Layer):
   def get_layer(self, name=None, index=None):
     """Retrieves a layer based on either its name (unique) or index.
 
+    If `name` and `index` are both provided, `index` will take precedence.
     Indices are based on order of horizontal graph traversal (bottom-up).
 
     Arguments:
@@ -429,7 +438,7 @@ class Network(base_layer.Layer):
 
   @property
   def updates(self):
-    """Retrieve the network's updates.
+    """Retrieves the network's updates.
 
     Will only include updates that are either
     unconditional, or conditional on inputs to this model
@@ -475,7 +484,7 @@ class Network(base_layer.Layer):
     Returns:
         A list of update ops.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return []
 
     if not self.trainable and not self.stateful:
@@ -487,7 +496,10 @@ class Network(base_layer.Layer):
 
     # `updates` might contain irrelevant updates, so it needs to be filtered
     # with respect to inputs the model has been called on.
-    relevant_inputs = self.inputs or []
+    if self.inputs:
+      relevant_inputs = self.inputs[:]
+    else:
+      relevant_inputs = []
     for i in range(1, len(self._inbound_nodes)):
       inputs = self.get_input_at(i)
       if isinstance(inputs, list):
@@ -506,7 +518,7 @@ class Network(base_layer.Layer):
 
   @property
   def losses(self):
-    """Retrieve the network's losses.
+    """Retrieves the network's losses.
 
     Will only include losses that are either
     unconditional, or conditional on inputs to this model
@@ -519,10 +531,13 @@ class Network(base_layer.Layer):
     losses = []
     for layer in self.layers:
       losses += layer.losses
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return losses
 
-    relevant_inputs = self.inputs or []
+    if self.inputs:
+      relevant_inputs = self.inputs[:]
+    else:
+      relevant_inputs = []
     for i in range(1, len(self._inbound_nodes)):
       inputs = self.get_input_at(i)
       if isinstance(inputs, list):
@@ -586,7 +601,7 @@ class Network(base_layer.Layer):
     return specs
 
   def call(self, inputs, training=None, mask=None):
-    """Call the model on new inputs.
+    """Calls the model on new inputs.
 
     In this case `call` just reapplies
     all ops in the graph to the new inputs
@@ -609,7 +624,7 @@ class Network(base_layer.Layer):
     else:
       masks = nest.flatten(mask)
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # Try to retrieve cached outputs if the layer has already been called
       # on these exact inputs.
       cache_key = (tf_layers_util.object_list_uid(inputs)
@@ -815,7 +830,7 @@ class Network(base_layer.Layer):
               else:
                 output_masks = [None for _ in range(len(output_tensors))]
 
-            if context.in_graph_mode():
+            if not context.executing_eagerly():
               if layer.activity_regularizer is not None:
                 regularization_losses = [
                     layer.activity_regularizer(x) for x in output_tensors
@@ -845,7 +860,7 @@ class Network(base_layer.Layer):
       if output_masks is not None:
         output_masks = output_masks[0]
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # Update cache;
       # keys are based on ids on input tensors and inputs masks.
       cache_key = (tf_layers_util.object_list_uid(inputs)
@@ -1016,7 +1031,7 @@ class Network(base_layer.Layer):
           layer(input_tensors, **kwargs)
 
     def process_layer(layer_data):
-      """Deserialize a layer, then call it on appropriate inputs.
+      """Deserializes a layer, then call it on appropriate inputs.
 
       Arguments:
           layer_data: layer config dict.
@@ -1073,7 +1088,7 @@ class Network(base_layer.Layer):
     return cls(inputs=input_tensors, outputs=output_tensors, name=name)
 
   def save(self, filepath, overwrite=True, include_optimizer=True):
-    """Save the model to a single HDF5 file.
+    """Saves the model to a single HDF5 file.
 
     The savefile includes:
         - The model architecture, allowing to re-instantiate the model.
@@ -1179,7 +1194,7 @@ class Network(base_layer.Layer):
         saving.load_weights_from_hdf5_group(f, self.layers)
 
   def _updated_config(self):
-    """Util hared between different serialization methods.
+    """Util shared between different serialization methods.
 
     Returns:
         Model config with Keras version information added.
@@ -1319,7 +1334,7 @@ def _make_node_key(layer_name, node_index):
 
 
 def _map_graph_network(inputs, outputs):
-  """Validate a network's topology and gather its layers and nodes.
+  """Validates a network's topology and gather its layers and nodes.
 
   Arguments:
     inputs: List of input tensors.
diff --git a/tensorflow/python/keras/_impl/keras/engine/saving.py b/tensorflow/python/keras/_impl/keras/engine/saving.py
index 52522e693511b010d0501651e594d346984c41e3..2ad06ca4fdcd55c12ba3ba192751f2f05aacc7ec 100644
--- a/tensorflow/python/keras/_impl/keras/engine/saving.py
+++ b/tensorflow/python/keras/_impl/keras/engine/saving.py
@@ -35,6 +35,7 @@ from tensorflow.python.util.tf_export import tf_export
 # pylint: disable=g-import-not-at-top
 try:
   import h5py
+  HDF5_OBJECT_HEADER_LIMIT = 64512
 except ImportError:
   h5py = None
 
@@ -47,7 +48,7 @@ except ImportError:
 
 @tf_export('keras.models.save_model')
 def save_model(model, filepath, overwrite=True, include_optimizer=True):
-  """Save a model to a HDF5 file.
+  """Saves a model to a HDF5 file.
 
   The saved model contains:
       - the model's configuration (topology)
@@ -74,7 +75,7 @@ def save_model(model, filepath, overwrite=True, include_optimizer=True):
     raise ImportError('`save_model` requires h5py.')
 
   def get_json_type(obj):
-    """Serialize any object to a JSON-serializable structure.
+    """Serializes any object to a JSON-serializable structure.
 
     Arguments:
         obj: the object to serialize
@@ -358,34 +359,6 @@ def model_from_json(json_string, custom_objects=None):
   return deserialize(config, custom_objects=custom_objects)
 
 
-def save_weights_to_hdf5_group(f, layers):
-  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
-
-  f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
-  f.attrs['backend'] = K.backend().encode('utf8')
-  f.attrs['keras_version'] = str(keras_version).encode('utf8')
-
-  for layer in layers:
-    g = f.create_group(layer.name)
-    symbolic_weights = layer.weights
-    weight_values = K.batch_get_value(symbolic_weights)
-    weight_names = []
-    for i, (w, val) in enumerate(zip(symbolic_weights, weight_values)):
-      if hasattr(w, 'name') and w.name:
-        name = str(w.name)
-      else:
-        name = 'param_' + str(i)
-      weight_names.append(name.encode('utf8'))
-    g.attrs['weight_names'] = weight_names
-    for name, val in zip(weight_names, weight_values):
-      param_dset = g.create_dataset(name, val.shape, dtype=val.dtype)
-      if not val.shape:
-        # scalar
-        param_dset[()] = val
-      else:
-        param_dset[:] = val
-
-
 def preprocess_weights_for_loading(layer,
                                    weights,
                                    original_keras_version=None,
@@ -549,9 +522,140 @@ def preprocess_weights_for_loading(layer,
       # split the bias into half and merge
       weights[2] = bias[:units * 4] + bias[units * 4:]
 
+  return convert_rnn_weights(layer, weights)
+
+
+def convert_rnn_weights(layer, weights):
+  """Converts weights for RNN layers between native and CuDNN format.
+
+  Input kernels for each gate are transposed and converted between Fortran
+  and C layout, recurrent kernels are transposed. For LSTM biases are summed/
+  split in half, for GRU biases are reshaped.
+
+  Weights can be converted in both directions between `LSTM` and`CuDNNSLTM`
+  and between `CuDNNGRU` and `GRU(reset_after=True)`. Default `GRU` is not
+  compatible with `CuDNNGRU`.
+
+  For missing biases in `LSTM`/`GRU` (`use_bias=False`) no conversion is made.
+
+  Arguments:
+      layer: Target layer instance.
+      weights: List of source weights values (input kernels, recurrent
+          kernels, [biases]) (Numpy arrays).
+
+  Returns:
+      A list of converted weights values (Numpy arrays).
+
+  Raises:
+      ValueError: for incompatible GRU layer/weights or incompatible biases
+  """
+
+  def transform_kernels(kernels, func, n_gates):
+    """Transforms kernel for each gate separately using given function.
+
+    Arguments:
+        kernels: Stacked array of kernels for individual gates.
+        func: Function applied to kernel of each gate.
+        n_gates: Number of gates (4 for LSTM, 3 for GRU).
+    Returns:
+        Stacked array of transformed kernels.
+    """
+    return np.hstack([func(k) for k in np.hsplit(kernels, n_gates)])
+
+  def transpose_input(from_cudnn):
+    """Makes a function that transforms input kernels from/to CuDNN format.
+
+    It keeps the shape, but changes between the layout (Fortran/C). Eg.:
+
+    ```
+    Keras                 CuDNN
+    [[0, 1, 2],  <--->  [[0, 2, 4],
+     [3, 4, 5]]          [1, 3, 5]]
+    ```
+
+    It can be passed to `transform_kernels()`.
+
+    Arguments:
+        from_cudnn: `True` if source weights are in CuDNN format, `False`
+            if they're in plain Keras format.
+    Returns:
+        Function that converts input kernel to the other format.
+    """
+    order = 'F' if from_cudnn else 'C'
+
+    def transform(kernel):
+      return kernel.T.reshape(kernel.shape, order=order)
+
+    return transform
+
+  target_class = layer.__class__.__name__
+
+  # convert the weights between CuDNNLSTM and LSTM
+  if target_class in ['LSTM', 'CuDNNLSTM'] and len(weights) == 3:
+    # determine if we're loading a CuDNNLSTM layer
+    # from the number of bias weights:
+    # CuDNNLSTM has (units * 8) weights; while LSTM has (units * 4)
+    # if there's no bias weight in the file, skip this conversion
+    units = weights[1].shape[0]
+    bias_shape = weights[2].shape
+    n_gates = 4
+
+    if bias_shape == (2 * units * n_gates,):
+      source = 'CuDNNLSTM'
+    elif bias_shape == (units * n_gates,):
+      source = 'LSTM'
+    else:
+      raise ValueError('Invalid bias shape: ' + str(bias_shape))
+
+    def convert_lstm_weights(weights, from_cudnn=True):
+      # Transpose (and reshape) input and recurrent kernels.
+      kernels = transform_kernels(weights[0], transpose_input(from_cudnn),
+                                  n_gates)
+      recurrent_kernels = transform_kernels(weights[1], lambda k: k.T, n_gates)
+      if from_cudnn:  # Merge input and recurrent biases into a single set.
+        biases = np.sum(np.split(weights[2], 2, axis=0), axis=0)
+      else:
+        # Split single set of biases evenly to two sets.
+        biases = np.tile(0.5 * weights[2], 2)
+      return [kernels, recurrent_kernels, biases]
+
+    if source != target_class:
+      weights = convert_lstm_weights(weights, from_cudnn=source == 'CuDNNLSTM')
+
+  # TODO(fchollet): add feature after GRU is refactored:
+  # convert the weights between `CuDNNGRU` and `GRU(reset_after=True)`
   return weights
 
 
+def save_weights_to_hdf5_group(f, layers):
+  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
+
+  save_attributes_to_hdf5_group(
+      f, 'layer_names', [layer.name.encode('utf8') for layer in layers])
+  f.attrs['backend'] = K.backend().encode('utf8')
+  f.attrs['keras_version'] = str(keras_version).encode('utf8')
+
+  for layer in layers:
+    g = f.create_group(layer.name)
+    symbolic_weights = layer.weights
+    weight_values = K.batch_get_value(symbolic_weights)
+    weight_names = []
+    for i, (w, val) in enumerate(zip(symbolic_weights, weight_values)):
+      if hasattr(w, 'name') and w.name:
+        name = str(w.name)
+      else:
+        name = 'param_' + str(i)
+      weight_names.append(name.encode('utf8'))
+    save_attributes_to_hdf5_group(g, 'weight_names', weight_names)
+    for name, val in zip(weight_names, weight_values):
+      param_dset = g.create_dataset(name, val.shape, dtype=val.dtype)
+      if not val.shape:
+        # scalar
+        param_dset[()] = val
+      else:
+        param_dset[:] = val
+
+
 def load_weights_from_hdf5_group(f, layers):
   """Implements topological (order-based) weight loading.
 
@@ -578,11 +682,11 @@ def load_weights_from_hdf5_group(f, layers):
     if weights:
       filtered_layers.append(layer)
 
-  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
+  layer_names = load_attributes_from_hdf5_group(f, 'layer_names')
   filtered_layer_names = []
   for name in layer_names:
     g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    weight_names = load_attributes_from_hdf5_group(g, 'weight_names')
     if weight_names:
       filtered_layer_names.append(name)
   layer_names = filtered_layer_names
@@ -597,7 +701,7 @@ def load_weights_from_hdf5_group(f, layers):
   weight_value_tuples = []
   for k, name in enumerate(layer_names):
     g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    weight_names = load_attributes_from_hdf5_group(g, 'weight_names')
     weight_values = [g[weight_name] for weight_name in weight_names]
     layer = filtered_layers[k]
     symbolic_weights = layer.weights
@@ -640,7 +744,7 @@ def load_weights_from_hdf5_group_by_name(f, layers):
     original_backend = None
 
   # New file format.
-  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
+  layer_names = load_attributes_from_hdf5_group(f, 'layer_names')
 
   # Reverse index of layer name to list of layers with name.
   index = {}
@@ -653,7 +757,7 @@ def load_weights_from_hdf5_group_by_name(f, layers):
   weight_value_tuples = []
   for k, name in enumerate(layer_names):
     g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    weight_names = load_attributes_from_hdf5_group(g, 'weight_names')
     weight_values = [g[weight_name] for weight_name in weight_names]
 
     for layer in index.get(name, []):
@@ -669,3 +773,72 @@ def load_weights_from_hdf5_group_by_name(f, layers):
       for i in range(len(weight_values)):
         weight_value_tuples.append((symbolic_weights[i], weight_values[i]))
   K.batch_set_value(weight_value_tuples)
+
+
+def save_attributes_to_hdf5_group(group, name, data):
+  """Saves attributes (data) of the specified name into the HDF5 group.
+
+  This method deals with an inherent problem of HDF5 file which is not
+  able to store data larger than HDF5_OBJECT_HEADER_LIMIT bytes.
+
+  Arguments:
+      group: A pointer to a HDF5 group.
+      name: A name of the attributes to save.
+      data: Attributes data to store.
+
+  Raises:
+    RuntimeError: If any single attribute is too large to be saved.
+  """
+  # Check that no item in `data` is larger than `HDF5_OBJECT_HEADER_LIMIT`
+  # because in that case even chunking the array would not make the saving
+  # possible.
+  bad_attributes = [x for x in data if len(x) > HDF5_OBJECT_HEADER_LIMIT]
+
+  # Expecting this to never be true.
+  if bad_attributes:
+    raise RuntimeError('The following attributes cannot be saved to HDF5 '
+                       'file because they are larger than %d bytes: %s' %
+                       (HDF5_OBJECT_HEADER_LIMIT,
+                        ', '.join([x for x in bad_attributes])))
+
+  data_npy = np.asarray(data)
+
+  num_chunks = 1
+  chunked_data = np.array_split(data_npy, num_chunks)
+
+  # This will never loop forever thanks to the test above.
+  while any([x.nbytes > HDF5_OBJECT_HEADER_LIMIT for x in chunked_data]):
+    num_chunks += 1
+    chunked_data = np.array_split(data_npy, num_chunks)
+
+  if num_chunks > 1:
+    for chunk_id, chunk_data in enumerate(chunked_data):
+      group.attrs['%s%d' % (name, chunk_id)] = chunk_data
+  else:
+    group.attrs[name] = data
+
+
+def load_attributes_from_hdf5_group(group, name):
+  """Loads attributes of the specified name from the HDF5 group.
+
+  This method deals with an inherent problem
+  of HDF5 file which is not able to store
+  data larger than HDF5_OBJECT_HEADER_LIMIT bytes.
+
+  Arguments:
+      group: A pointer to a HDF5 group.
+      name: A name of the attributes to load.
+
+  Returns:
+      data: Attributes data.
+  """
+  if name in group.attrs:
+    data = [n.decode('utf8') for n in group.attrs[name]]
+  else:
+    data = []
+    chunk_id = 0
+    while '%s%d' % (name, chunk_id) in group.attrs:
+      data.extend(
+          [n.decode('utf8') for n in group.attrs['%s%d' % (name, chunk_id)]])
+      chunk_id += 1
+  return data
diff --git a/tensorflow/python/keras/_impl/keras/engine/saving_test.py b/tensorflow/python/keras/_impl/keras/engine/saving_test.py
index bdb17641b0d26bc227b142d9302dc1da9637c506..4a18cc2e119d7cfb3f15da593d4944abd445905b 100644
--- a/tensorflow/python/keras/_impl/keras/engine/saving_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/saving_test.py
@@ -370,6 +370,92 @@ class TestWholeModelSaving(test.TestCase):
     self.assertAllClose(mean, model.layers[1].arguments['mu'])
     self.assertAllClose(std, model.layers[1].arguments['std'])
 
+  def test_saving_model_with_long_layer_names(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      # This layer name will make the `layers_name` HDF5 attribute blow
+      # out of proportion. Note that it fits into the internal HDF5
+      # attribute memory limit on its own but because h5py converts
+      # the list of layer names into numpy array, which uses the same
+      # amout of memory for every item, it increases the memory
+      # requirements substantially.
+      x = keras.Input(shape=(2,), name='input_' + ('x' * (2**15)))
+      f = x
+      for i in range(4):
+        f = keras.layers.Dense(2, name='dense_%d' % (i,))(f)
+      model = keras.Model(inputs=[x], outputs=[f])
+      model.compile(loss='mse', optimizer='adam', metrics=['acc'])
+
+      x = np.random.random((1, 2))
+      y = np.random.random((1, 2))
+      model.train_on_batch(x, y)
+      out = model.predict(x)
+
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+      model = keras.models.load_model(fname)
+
+      # Check that the HDF5 files contains chunked array
+      # of layer names.
+      with h5py.File(fname, 'r') as h5file:
+        num_names_arrays = len([attr for attr in h5file['model_weights'].attrs
+                                if attr.startswith('layer_names')])
+      # The chunking of layer names array should have happend.
+      self.assertGreater(num_names_arrays, 0)
+      out2 = model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+      # Cleanup
+      os.close(fd)
+      os.remove(fname)
+
+  def test_saving_model_with_long_weights_names(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      x = keras.Input(shape=(2,), name='nested_model_input')
+      f = x
+      for i in range(4):
+        f = keras.layers.Dense(2, name='nested_model_dense_%d' % (i,))(f)
+      # This layer name will make the `weights_name`
+      # HDF5 attribute blow out of proportion.
+      f = keras.layers.Dense(2, name='nested_model_output' + ('x' * (2**15)))(f)
+      nested_model = keras.Model(inputs=[x], outputs=[f], name='nested_model')
+
+      x = keras.Input(shape=(2,), name='outer_model_input')
+      f = nested_model(x)
+      f = keras.layers.Dense(2, name='outer_model_output')(f)
+
+      model = keras.Model(inputs=[x], outputs=[f])
+      model.compile(loss='mse', optimizer='adam', metrics=['acc'])
+
+      x = np.random.random((1, 2))
+      y = np.random.random((1, 2))
+      model.train_on_batch(x, y)
+      out = model.predict(x)
+
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+      model = keras.models.load_model(fname)
+
+      # Check that the HDF5 files contains chunked array
+      # of weight names.
+      with h5py.File(fname, 'r') as h5file:
+        num_weight_arrays = len(
+            [attr for attr in h5file['model_weights']['nested_model'].attrs
+             if attr.startswith('weight_names')])
+      # The chunking of layer names array should have happend.
+      self.assertGreater(num_weight_arrays, 0)
+      out2 = model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+      # Cleanup
+      os.close(fd)
+      os.remove(fname)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/_impl/keras/engine/topology_test.py b/tensorflow/python/keras/_impl/keras/engine/topology_test.py
index 04434323d6a9f8e12ad8f45c1e83819dfa8b3b96..b50277c8fff917d77694903c989fd02ea98b1711 100644
--- a/tensorflow/python/keras/_impl/keras/engine/topology_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/topology_test.py
@@ -531,7 +531,9 @@ class TopologyConstructionTest(test.TestCase):
 
       e = keras.layers.Input(shape=(32,), name='input_e')
       f = keras.layers.Input(shape=(32,), name='input_f')
+      self.assertEqual(len(model.inputs), 2)
       g, h = model([e, f])
+      self.assertEqual(len(model.inputs), 2)
       self.assertEqual(g.name, 'model/dense_2/BiasAdd:0')
 
       self.assertListEqual(g.get_shape().as_list(), c.get_shape().as_list())
@@ -713,7 +715,9 @@ class TopologyConstructionTest(test.TestCase):
 
     j = keras.layers.Input(shape=(32,), name='input_j')
     k = keras.layers.Input(shape=(32,), name='input_k')
+    self.assertEqual(len(model.inputs), 2)
     m, n = model([j, k])
+    self.assertEqual(len(model.inputs), 2)
     tf_model = keras.models.Model([j, k], [m, n])
 
     j_tf = array_ops.placeholder(dtype=dtypes.float32, shape=(None, 32))
@@ -751,7 +755,17 @@ class TopologyConstructionTest(test.TestCase):
       def compute_mask(self, inputs, mask=None):
         return array_ops.ones_like(inputs)
 
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      a = constant_op.constant([2] * 32)
+      mask = constant_op.constant([0, 1] * 16)
+      a._keras_mask = mask
+      b = MaskedLayer().apply(a)
+      self.assertTrue(hasattr(b, '_keras_mask'))
+      self.assertAllEqual(
+          self.evaluate(array_ops.ones_like(mask)),
+          self.evaluate(getattr(b, '_keras_mask')))
+      self.assertAllEqual(self.evaluate(a * mask), self.evaluate(b))
+    else:
       x = keras.Input(shape=(32,))
       y = MaskedLayer()(x)  # pylint: disable=not-callable
       network = keras.engine.Network(x, y)
@@ -765,15 +779,6 @@ class TopologyConstructionTest(test.TestCase):
       x_2 = array_ops.placeholder(dtype='float32', shape=(None, 32))
       y_2 = network(x_2)
       self.assertEqual(y_2.get_shape().as_list(), [None, 32])
-    else:
-      a = constant_op.constant([2] * 32)
-      mask = constant_op.constant([0, 1] * 16)
-      a._keras_mask = mask
-      b = MaskedLayer().apply(a)
-      self.assertTrue(hasattr(b, '_keras_mask'))
-      self.assertAllEqual(self.evaluate(array_ops.ones_like(mask)),
-                          self.evaluate(getattr(b, '_keras_mask')))
-      self.assertAllEqual(self.evaluate(a * mask), self.evaluate(b))
 
   def test_activity_regularization_with_model_composition(self):
 
@@ -881,13 +886,13 @@ class DeferredModeTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testSimpleNetworkBuilding(self):
     inputs = keras.engine.Input(shape=(32,))
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       self.assertIsInstance(inputs, tf_base_layers._DeferredTensor)
       self.assertEqual(inputs.dtype.name, 'float32')
       self.assertEqual(inputs.shape.as_list(), [None, 32])
 
     x = keras.layers.Dense(2)(inputs)
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       self.assertIsInstance(x, tf_base_layers._DeferredTensor)
       self.assertEqual(x.dtype.name, 'float32')
       self.assertEqual(x.shape.as_list(), [None, 2])
@@ -896,7 +901,7 @@ class DeferredModeTest(test.TestCase):
     network = keras.engine.Network(inputs, outputs)
     self.assertIsInstance(network, keras.engine.Network)
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # It should be possible to call such a network on EagerTensors.
       inputs = constant_op.constant(
           np.random.random((10, 32)).astype('float32'))
@@ -921,7 +926,7 @@ class DeferredModeTest(test.TestCase):
     c = keras.layers.Dense(2)(c)
 
     network = keras.engine.Network([input_a, input_b], [a, c])
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       a_val = constant_op.constant(
           np.random.random((10, 32)).astype('float32'))
       b_val = constant_op.constant(
diff --git a/tensorflow/python/keras/_impl/keras/engine/training.py b/tensorflow/python/keras/_impl/keras/engine/training.py
index 57451ad4704d1935d0217a7d8d8e0f3995170f43..4acb41553eda2e07962f6ac510f08988a5adb90c 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training.py
@@ -18,500 +18,28 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import copy
-
 import numpy as np
 
 from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.keras._impl.keras import backend as K
-from tensorflow.python.keras._impl.keras import callbacks as cbks
 from tensorflow.python.keras._impl.keras import losses
 from tensorflow.python.keras._impl.keras import metrics as metrics_module
 from tensorflow.python.keras._impl.keras import optimizers
+from tensorflow.python.keras._impl.keras.engine import training_arrays
 from tensorflow.python.keras._impl.keras.engine import training_eager
+from tensorflow.python.keras._impl.keras.engine import training_generator
+from tensorflow.python.keras._impl.keras.engine import training_utils
 from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
 from tensorflow.python.keras._impl.keras.engine.network import Network
-from tensorflow.python.keras._impl.keras.utils.data_utils import GeneratorEnqueuer
-from tensorflow.python.keras._impl.keras.utils.data_utils import OrderedEnqueuer
-from tensorflow.python.keras._impl.keras.utils.data_utils import Sequence
-from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
-from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.layers.base import _DeferredTensor
+from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import optimizer as tf_optimizer_module
 from tensorflow.python.util.tf_export import tf_export
 
-try:
-  from scipy.sparse import issparse  # pylint: disable=g-import-not-at-top
-except ImportError:
-  issparse = None
-
-
-def _standardize_input_data(data,
-                            names,
-                            shapes=None,
-                            check_batch_axis=True,
-                            exception_prefix=''):
-  """Normalizes inputs and targets provided by users.
-
-  Users may pass data as a list of arrays, dictionary of arrays,
-  or as a single array. We normalize this to an ordered list of
-  arrays (same order as `names`), while checking that the provided
-  arrays have shapes that match the network's expectations.
-
-  Arguments:
-      data: User-provided input data (polymorphic).
-      names: List of expected array names.
-      shapes: Optional list of expected array shapes.
-      check_batch_axis: Boolean; whether to check that
-          the batch axis of the arrays matches the expected
-          value found in `shapes`.
-      exception_prefix: String prefix used for exception formatting.
-
-  Returns:
-      List of standardized input arrays (one array per model input).
-
-  Raises:
-      ValueError: in case of improperly formatted user-provided data.
-  """
-  if not names:
-    if data is not None and hasattr(data, '__len__') and len(data):
-      raise ValueError('Error when checking model ' + exception_prefix + ': '
-                       'expected no data, but got:', data)
-    return []
-  if data is None:
-    return [None for _ in range(len(names))]
-
-  if isinstance(data, dict):
-    try:
-      data = [
-          data[x].values
-          if data[x].__class__.__name__ == 'DataFrame' else data[x]
-          for x in names
-      ]
-    except KeyError as e:
-      raise ValueError('No data provided for "' + e.args[0] + '". Need data '
-                       'for each key in: ' + str(names))
-  elif isinstance(data, list):
-    if isinstance(data[0], list):
-      data = [np.asarray(d) for d in data]
-    elif len(names) == 1 and isinstance(data[0], (float, int)):
-      data = [np.asarray(data)]
-    else:
-      data = [
-          x.values if x.__class__.__name__ == 'DataFrame' else x for x in data
-      ]
-  else:
-    data = data.values if data.__class__.__name__ == 'DataFrame' else data
-    data = [data]
-  data = [
-      np.expand_dims(x, 1) if x is not None and x.ndim == 1 else x for x in data
-  ]
-
-  if len(data) != len(names):
-    if data and hasattr(data[0], 'shape'):
-      raise ValueError('Error when checking model ' + exception_prefix +
-                       ': the list of Numpy arrays that you are passing to '
-                       'your model is not the size the model expected. '
-                       'Expected to see ' + str(len(names)) + ' array(s), '
-                       'but instead got the following list of ' +
-                       str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
-    elif len(names) > 1:
-      raise ValueError(
-          'Error when checking model ' + exception_prefix +
-          ': you are passing a list as input to your model, '
-          'but the model expects a list of ' + str(len(names)) +
-          ' Numpy arrays instead. The list you passed was: ' + str(data)[:200])
-    elif len(data) == 1 and not hasattr(data[0], 'shape'):
-      raise TypeError('Error when checking model ' + exception_prefix +
-                      ': data should be a Numpy array, or list/dict of '
-                      'Numpy arrays. Found: ' + str(data)[:200] + '...')
-    elif len(names) == 1:
-      data = [np.asarray(data)]
-
-  # Check shapes compatibility.
-  if shapes:
-    for i in range(len(names)):
-      if shapes[i] is not None:
-        data_shape = data[i].shape
-        shape = shapes[i]
-        if data[i].ndim != len(shape):
-          raise ValueError('Error when checking ' + exception_prefix +
-                           ': expected ' + names[i] + ' to have ' +
-                           str(len(shape)) + ' dimensions, but got array '
-                           'with shape ' + str(data_shape))
-        if not check_batch_axis:
-          data_shape = data_shape[1:]
-          shape = shape[1:]
-        for dim, ref_dim in zip(data_shape, shape):
-          if ref_dim != dim and ref_dim:
-            raise ValueError(
-                'Error when checking ' + exception_prefix + ': expected ' +
-                names[i] + ' to have shape ' + str(shape) +
-                ' but got array with shape ' + str(data_shape))
-  return data
-
-
-def _standardize_sample_or_class_weights(x_weight, output_names, weight_type):
-  """Maps `sample_weight` or `class_weight` to model outputs.
-
-  Arguments:
-      x_weight: User-provided `sample_weight` or `class_weight` argument.
-      output_names: List of output names (strings) in the model.
-      weight_type: A string used purely for exception printing.
-
-  Returns:
-      A list of `sample_weight` or `class_weight` where there are exactly
-          one element per model output.
-
-  Raises:
-      ValueError: In case of invalid user-provided argument.
-  """
-  if x_weight is None or len(x_weight) == 0:  # pylint: disable=g-explicit-length-test
-    return [None for _ in output_names]
-  if len(output_names) == 1:
-    if isinstance(x_weight, list) and len(x_weight) == 1:
-      return x_weight
-    if isinstance(x_weight, dict) and output_names[0] in x_weight:
-      return [x_weight[output_names[0]]]
-    else:
-      return [x_weight]
-  if isinstance(x_weight, list):
-    if len(x_weight) != len(output_names):
-      raise ValueError('Provided `' + weight_type + '` was a list of ' +
-                       str(len(x_weight)) + ' elements, but the model has ' +
-                       str(len(output_names)) + ' outputs. '
-                       'You should provide one `' + weight_type + '`'
-                       'array per model output.')
-    return x_weight
-  if isinstance(x_weight, dict):
-    x_weights = []
-    for name in output_names:
-      x_weights.append(x_weight.get(name))
-    return x_weights
-  else:
-    raise TypeError(
-        'The model has multiple outputs, so `' + weight_type + '` '
-        'should be either a list or a dict. '
-        'Provided `' + weight_type + '` type not understood: ' + str(x_weight))
-
-
-def _standardize_class_weights(class_weight, output_names):
-  return _standardize_sample_or_class_weights(class_weight, output_names,
-                                              'class_weight')
-
-
-def _standardize_sample_weights(sample_weight, output_names):
-  return _standardize_sample_or_class_weights(sample_weight, output_names,
-                                              'sample_weight')
-
-
-def _check_array_lengths(inputs, targets, weights=None):
-  """Does user input validation for numpy arrays.
-
-  Arguments:
-      inputs: list of Numpy arrays of inputs.
-      targets: list of Numpy arrays of targets.
-      weights: list of Numpy arrays of sample weights.
-
-  Raises:
-      ValueError: in case of incorrectly formatted data.
-  """
-
-  def set_of_lengths(x):
-    # return a set with the variation between
-    # different shapes, with None => 0
-    if x is None:
-      return {0}
-    else:
-      return set([0 if y is None else y.shape[0] for y in x])
-
-  set_x = set_of_lengths(inputs)
-  set_y = set_of_lengths(targets)
-  set_w = set_of_lengths(weights)
-  if len(set_x) > 1:
-    raise ValueError('All input arrays (x) should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([x.shape for x in inputs]))
-  if len(set_y) > 1:
-    raise ValueError('All target arrays (y) should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([y.shape for y in targets]))
-  if set_x and set_y and list(set_x)[0] != list(set_y)[0]:
-    raise ValueError('Input arrays should have '
-                     'the same number of samples as target arrays. '
-                     'Found ' + str(list(set_x)[0]) + ' input samples '
-                     'and ' + str(list(set_y)[0]) + ' target samples.')
-  if len(set_w) > 1:
-    raise ValueError('All sample_weight arrays should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([w.shape for w in weights]))
-  if set_y and set_w and list(set_y)[0] != list(set_w)[0]:
-    raise ValueError('Sample_weight arrays should have '
-                     'the same number of samples as target arrays. Got ' +
-                     str(list(set_y)[0]) + ' input samples and ' +
-                     str(list(set_w)[0]) + ' target samples.')
-
-
-def _check_loss_and_target_compatibility(targets, loss_fns, output_shapes):
-  """Does validation on the compatibility of targets and loss functions.
-
-  This helps prevent users from using loss functions incorrectly.
-
-  Arguments:
-      targets: list of Numpy arrays of targets.
-      loss_fns: list of loss functions.
-      output_shapes: list of shapes of model outputs.
-
-  Raises:
-      ValueError: if a loss function or target array
-          is incompatible with an output.
-  """
-  key_losses = {
-      losses.mean_squared_error, losses.binary_crossentropy,
-      losses.categorical_crossentropy
-  }
-  for y, loss, shape in zip(targets, loss_fns, output_shapes):
-    if y is None or loss is None:
-      continue
-    if loss is losses.categorical_crossentropy:
-      if y.shape[-1] == 1:
-        raise ValueError('You are passing a target array of shape ' + str(
-            y.shape) + ' while using as loss `categorical_crossentropy`. '
-                         '`categorical_crossentropy` expects '
-                         'targets to be binary matrices (1s and 0s) '
-                         'of shape (samples, classes). '
-                         'If your targets are integer classes, '
-                         'you can convert them to the expected format via:\n'
-                         '```\n'
-                         'from keras.utils import to_categorical\n'
-                         'y_binary = to_categorical(y_int)\n'
-                         '```\n'
-                         '\n'
-                         'Alternatively, you can use the loss function '
-                         '`sparse_categorical_crossentropy` instead, '
-                         'which does expect integer targets.')
-    if loss in key_losses:
-      for target_dim, out_dim in zip(y.shape[1:], shape[1:]):
-        if out_dim is not None and target_dim != out_dim:
-          raise ValueError('A target array with shape ' + str(y.shape) +
-                           ' was passed for an output of shape ' + str(shape) +
-                           ' while using as loss `' + loss.__name__ + '`. '
-                           'This loss expects '
-                           'targets to have the same shape '
-                           'as the output.')
-
-
-def _collect_metrics(metrics, output_names):
-  """Maps metric functions to model outputs.
-
-  Arguments:
-      metrics: a list or dict of metric functions.
-      output_names: a list of the names (strings) of model outputs.
-
-  Returns:
-      A list (one entry per model output) of lists of metric functions.
-      For instance, if the model has 2 outputs, and for the first output
-      we want to compute "binary_accuracy" and "binary_crossentropy",
-      and just "binary_accuracy" for the second output,
-      the list would look like:
-          `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]`
-
-  Raises:
-      TypeError: if an incorrect type is passed for the `metrics` argument.
-  """
-  if not metrics:
-    return [[] for _ in output_names]
-  if isinstance(metrics, list):
-    # we then apply all metrics to all outputs.
-    return [copy.copy(metrics) for _ in output_names]
-  elif isinstance(metrics, dict):
-    nested_metrics = []
-    for name in output_names:
-      output_metrics = metrics.get(name, [])
-      if not isinstance(output_metrics, list):
-        output_metrics = [output_metrics]
-      nested_metrics.append(output_metrics)
-    return nested_metrics
-  else:
-    raise TypeError('Type of `metrics` argument not understood. '
-                    'Expected a list or dictionary, found: ' + str(metrics))
-
-
-def _batch_shuffle(index_array, batch_size):
-  """Shuffles an array in a batch-wise fashion.
-
-  Useful for shuffling HDF5 arrays
-  (where one cannot access arbitrary indices).
-
-  Arguments:
-      index_array: array of indices to be shuffled.
-      batch_size: integer.
-
-  Returns:
-      The `index_array` array, shuffled in a batch-wise fashion.
-  """
-  batch_count = int(len(index_array) / batch_size)
-  # to reshape we need to be cleanly divisible by batch size
-  # we stash extra items and reappend them after shuffling
-  last_batch = index_array[batch_count * batch_size:]
-  index_array = index_array[:batch_count * batch_size]
-  index_array = index_array.reshape((batch_count, batch_size))
-  np.random.shuffle(index_array)
-  index_array = index_array.flatten()
-  return np.append(index_array, last_batch)
-
-
-def _weighted_masked_objective(fn):
-  """Adds support for masking and sample-weighting to an objective function.
-
-  It transforms an objective function `fn(y_true, y_pred)`
-  into a sample-weighted, cost-masked objective function
-  `fn(y_true, y_pred, weights, mask)`.
-
-  Arguments:
-      fn: The objective function to wrap,
-          with signature `fn(y_true, y_pred)`.
-
-  Returns:
-      A function with signature `fn(y_true, y_pred, weights, mask)`.
-  """
-  if fn is None:
-    return None
-
-  def weighted(y_true, y_pred, weights, mask=None):
-    """Wrapper function.
-
-    Arguments:
-        y_true: `y_true` argument of `fn`.
-        y_pred: `y_pred` argument of `fn`.
-        weights: Weights tensor.
-        mask: Mask tensor.
-
-    Returns:
-        Scalar tensor.
-    """
-    # score_array has ndim >= 2
-    score_array = fn(y_true, y_pred)
-    if mask is not None:
-      # Cast the mask to floatX to avoid float64 upcasting in theano
-      mask = K.cast(mask, K.floatx())
-      # mask should have the same shape as score_array
-      score_array *= mask
-      #  the loss per batch should be proportional
-      #  to the number of unmasked samples.
-      score_array /= K.mean(mask)
-
-    # apply sample weighting
-    if weights is not None:
-      # reduce score_array to same ndim as weight array
-      ndim = K.ndim(score_array)
-      weight_ndim = K.ndim(weights)
-      score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim)))
-      score_array *= weights
-      score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
-    return K.mean(score_array)
-
-  return weighted
-
-
-def _standardize_weights(y,
-                         sample_weight=None,
-                         class_weight=None,
-                         sample_weight_mode=None):
-  """Performs sample weight validation and standardization.
-
-  Everything gets normalized to a single sample-wise (or timestep-wise)
-  weight array.
-
-  Arguments:
-      y: Numpy array of model targets to be weighted.
-      sample_weight: User-provided `sample_weight` argument.
-      class_weight: User-provided `class_weight` argument.
-      sample_weight_mode: One of `None` or `"temporal"`.
-          `"temporal"` indicated that we expect 2D weight data
-          that will be applied to the last 2 dimensions of
-          the targets (i.e. we are weighting timesteps, not samples).
-
-  Returns:
-      A numpy array of target weights, one entry per sample to weight.
-
-  Raises:
-      ValueError: In case of invalid user-provided arguments.
-  """
-  if sample_weight_mode is not None:
-    if sample_weight_mode != 'temporal':
-      raise ValueError('"sample_weight_mode '
-                       'should be None or "temporal". '
-                       'Found: ' + str(sample_weight_mode))
-    if len(y.shape) < 3:
-      raise ValueError('Found a sample_weight array for '
-                       'an input with shape ' + str(y.shape) + '. '
-                       'Timestep-wise sample weighting (use of '
-                       'sample_weight_mode="temporal") is restricted to '
-                       'outputs that are at least 3D, i.e. that have '
-                       'a time dimension.')
-    if sample_weight is not None and len(sample_weight.shape) != 2:
-      raise ValueError('Found a sample_weight array with shape ' +
-                       str(sample_weight.shape) + '. '
-                       'In order to use timestep-wise sample weighting, '
-                       'you should pass a 2D sample_weight array.')
-  else:
-    if sample_weight is not None and len(sample_weight.shape) != 1:
-      raise ValueError('Found a sample_weight array with shape ' +
-                       str(sample_weight.shape) + '. '
-                       'In order to use timestep-wise sample weights, '
-                       'you should specify '
-                       'sample_weight_mode="temporal" '
-                       'in compile(). If you just mean to use '
-                       'sample-wise weights, make sure your '
-                       'sample_weight array is 1D.')
-
-  if sample_weight is not None:
-    if len(sample_weight.shape) > len(y.shape):
-      raise ValueError(
-          'Found a sample_weight with shape' + str(sample_weight.shape) + '.'
-          'Expected sample_weight with rank '
-          'less than or equal to ' + str(len(y.shape)))
-
-    if y.shape[:sample_weight.ndim] != sample_weight.shape:
-      raise ValueError(
-          'Found a sample_weight array with shape ' + str(sample_weight.shape) +
-          ' for an input with shape ' + str(y.shape) + '. '
-          'sample_weight cannot be broadcast.')
-    return sample_weight
-  elif isinstance(class_weight, dict):
-    if len(y.shape) > 2:
-      raise ValueError('`class_weight` not supported for '
-                       '3+ dimensional targets.')
-    if y.shape[1] > 1:
-      y_classes = np.argmax(y, axis=1)
-    elif y.shape[1] == 1:
-      y_classes = np.reshape(y, y.shape[0])
-    else:
-      y_classes = y
-
-    weights = np.asarray(
-        [class_weight[cls] for cls in y_classes if cls in class_weight])
-
-    if len(weights) != len(y_classes):
-      # subtract the sets to pick all missing classes
-      existing_classes = set(y_classes)
-      existing_class_weight = set(class_weight.keys())
-      raise ValueError('`class_weight` must contain all classes in the data.'
-                       ' The classes %s exist in the data but not in '
-                       '`class_weight`.' %
-                       (existing_classes - existing_class_weight))
-    return weights
-  else:
-    if sample_weight_mode is None:
-      return np.ones((y.shape[0],), dtype=K.floatx())
-    else:
-      return np.ones((y.shape[0], y.shape[1]), dtype=K.floatx())
-
 
 @tf_export('keras.models.Model', 'keras.Model')
 class Model(Network):
@@ -634,7 +162,7 @@ class Model(Network):
             `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
     """
     loss = loss or {}
-    if context.in_eager_mode() and  not isinstance(
+    if context.executing_eagerly() and not isinstance(
         optimizer, (tf_optimizer_module.Optimizer, optimizers.TFOptimizer)):
       raise ValueError('Only TF native optimizers are supported in Eager mode.')
 
@@ -642,13 +170,13 @@ class Model(Network):
     self.loss = loss
     self.metrics = metrics or []
     self.loss_weights = loss_weights
-    if context.in_eager_mode() and sample_weight_mode is not None:
+    if context.executing_eagerly() and sample_weight_mode is not None:
       raise ValueError('sample_weight_mode is not supported in Eager mode.')
     self.sample_weight_mode = sample_weight_mode
-    if context.in_eager_mode() and weighted_metrics is not None:
+    if context.executing_eagerly() and weighted_metrics is not None:
       raise ValueError('weighted_metrics is not supported in Eager mode.')
     self.weighted_metrics = weighted_metrics
-    if context.in_eager_mode() and target_tensors is not None:
+    if context.executing_eagerly() and target_tensors is not None:
       raise ValueError('target_tensors is not supported in Eager mode.')
     self.target_tensors = target_tensors
 
@@ -688,7 +216,8 @@ class Model(Network):
       loss_functions = [loss_function for _ in range(len(self.outputs))]
     self.loss_functions = loss_functions
 
-    weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
+    weighted_losses = [training_utils.weighted_masked_objective(fn)
+                       for fn in loss_functions]
     skip_target_indices = []
     skip_target_weighing_indices = []
     self._feed_outputs = []
@@ -701,7 +230,7 @@ class Model(Network):
         skip_target_weighing_indices.append(i)
 
     # Prepare output masks.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       masks = self.compute_mask(self.inputs, mask=None)
       if masks is None:
         masks = [None for _ in self.outputs]
@@ -735,9 +264,9 @@ class Model(Network):
     self.loss_weights_list = loss_weights_list
 
     # initialization for Eager mode execution
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       if target_tensors is not None:
-        raise ValueError('target_tensors are not currently supported in Eager'
+        raise ValueError('target_tensors are not currently supported in Eager '
                          'mode.')
       self.total_loss = None
       self.metrics_tensors = []
@@ -745,7 +274,8 @@ class Model(Network):
       for i in range(len(self.outputs)):
         if len(self.outputs) > 1:
           self.metrics_names.append(self.output_names[i] + '_loss')
-      self.nested_metrics = _collect_metrics(metrics, self.output_names)
+      self.nested_metrics = training_utils.collect_metrics(metrics,
+                                                           self.output_names)
       self._feed_sample_weight_modes = []
       for i in range(len(self.outputs)):
         self._feed_sample_weight_modes.append(None)
@@ -862,12 +392,12 @@ class Model(Network):
           sample_weights.append(None)
         else:
           if sample_weight_mode == 'temporal':
-            sample_weights.append(
-                K.placeholder(ndim=2, name=name + '_sample_weights'))
+            sample_weights.append(array_ops.placeholder_with_default(
+                [[1.]], shape=[None, None], name=name + '_sample_weights'))
             sample_weight_modes.append('temporal')
           else:
-            sample_weights.append(
-                K.placeholder(ndim=1, name=name + '_sample_weights'))
+            sample_weights.append(array_ops.placeholder_with_default(
+                [1.], shape=[None], name=name + '_sample_weights'))
             sample_weight_modes.append(None)
     self.sample_weight_modes = sample_weight_modes
     self._feed_sample_weight_modes = []
@@ -915,9 +445,9 @@ class Model(Network):
 
     # List of same size as output_names.
     # contains tuples (metrics for output, names of metrics).
-    nested_metrics = _collect_metrics(metrics, self.output_names)
-    nested_weighted_metrics = _collect_metrics(weighted_metrics,
-                                               self.output_names)
+    nested_metrics = training_utils.collect_metrics(metrics, self.output_names)
+    nested_weighted_metrics = training_utils.collect_metrics(weighted_metrics,
+                                                             self.output_names)
     self.metrics_updates = []
     self.stateful_metric_names = []
     with K.name_scope('metrics'):
@@ -963,11 +493,13 @@ class Model(Network):
                 suffix = 'acc'
               elif metric in ('crossentropy', 'ce'):
                 suffix = 'ce'
-              weighted_metric_fn = _weighted_masked_objective(metric_fn)
+              weighted_metric_fn = training_utils.weighted_masked_objective(
+                  metric_fn)
               metric_name = metric_name_prefix + suffix
             else:
               metric_fn = metrics_module.get(metric)
-              weighted_metric_fn = _weighted_masked_objective(metric_fn)
+              weighted_metric_fn = training_utils.weighted_masked_objective(
+                  metric_fn)
               # Get metric name as string
               if hasattr(metric_fn, 'name'):
                 metric_name = metric_fn.name
@@ -1105,451 +637,6 @@ class Model(Network):
           name='predict_function',
           **kwargs)
 
-  def _check_num_samples(self,
-                         ins,
-                         batch_size=None,
-                         steps=None,
-                         steps_name='steps'):
-    """Determine the number of samples provided for training and evaluation.
-
-    The number of samples is not defined when running with `steps`,
-    in which case the number of samples is set to `None`.
-
-    Arguments:
-        ins: List of tensors to be fed to the Keras function.
-        batch_size: Integer batch size or `None` if not defined.
-        steps: Total number of steps (batches of samples)
-            before declaring `_predict_loop` finished.
-            Ignored with the default value of `None`.
-        steps_name: The public API's parameter name for `steps`.
-
-    Raises:
-        ValueError: when `steps` is `None` and the attribute `ins.shape`
-        does not exist. Also raises ValueError when `steps` is not `None`
-        and `batch_size` is not `None` because they are mutually
-        exclusive.
-
-    Returns:
-        When steps is `None`, returns the number of samples to be
-        processed based on the size of the first dimension of the
-        first input numpy array. When steps is not `None` and
-        `batch_size` is `None`, returns `None`.
-
-    Raises:
-        ValueError: In case of invalid arguments.
-    """
-    if steps is not None:
-      num_samples = None
-      if batch_size is not None:
-        raise ValueError(
-            'If ' + steps_name + ' is set, the `batch_size` must be None.')
-    elif ins and hasattr(ins[0], 'shape'):
-      num_samples = ins[0].shape[0]
-    else:
-      raise ValueError(
-          'Either the input data should have '
-          'a defined shape, or ' + steps_name + ' should be specified.')
-    return num_samples
-
-  def _fit_loop(self,
-                f,
-                ins,
-                out_labels=None,
-                batch_size=None,
-                epochs=100,
-                verbose=1,
-                callbacks=None,
-                val_f=None,
-                val_ins=None,
-                shuffle=True,
-                callback_metrics=None,
-                initial_epoch=0,
-                steps_per_epoch=None,
-                validation_steps=None):
-    """Abstract fit function for `f(ins)`.
-
-    Assume that f returns a list, labeled by out_labels.
-
-    Arguments:
-        f: Keras function returning a list of tensors
-        ins: List of tensors to be fed to `f`
-        out_labels: List of strings, display names of
-            the outputs of `f`
-        batch_size: Integer batch size or None if unknown.
-        epochs: Number of times to iterate over the data
-        verbose: Verbosity mode, 0, 1 or 2
-        callbacks: List of callbacks to be called during training
-        val_f: Keras function to call for validation
-        val_ins: List of tensors to be fed to `val_f`
-        shuffle: Whether to shuffle the data at the beginning of each epoch
-        callback_metrics: List of strings, the display names of the metrics
-            passed to the callbacks. They should be the
-            concatenation of list the display names of the outputs of
-             `f` and the list of display names of the outputs of `f_val`.
-        initial_epoch: Epoch at which to start training
-            (useful for resuming a previous training run)
-        steps_per_epoch: Total number of steps (batches of samples)
-            before declaring one epoch finished and starting the
-            next epoch. Ignored with the default value of `None`.
-        validation_steps: Number of steps to run validation for
-            (only if doing validation from data tensors).
-            Ignored with the default value of `None`.
-
-    Returns:
-        `History` object.
-
-    Raises:
-        ValueError: in case of invalid arguments.
-    """
-    do_validation = False
-    if val_f and val_ins:
-      do_validation = True
-      if verbose and ins and hasattr(ins[0], 'shape') and hasattr(
-          val_ins[0], 'shape'):
-        print('Train on %d samples, validate on %d samples' %
-              (ins[0].shape[0], val_ins[0].shape[0]))
-    if validation_steps:
-      do_validation = True
-      if steps_per_epoch is None:
-        raise ValueError('Can only use `validation_steps` '
-                         'when doing step-wise '
-                         'training, i.e. `steps_per_epoch` '
-                         'must be set.')
-
-    num_train_samples = self._check_num_samples(
-        ins, batch_size, steps_per_epoch, 'steps_per_epoch')
-    if num_train_samples is not None:
-      index_array = np.arange(num_train_samples)
-
-    self.history = cbks.History()
-    all_callbacks = [cbks.BaseLogger(
-        stateful_metrics=self.stateful_metric_names)]
-    if verbose:
-      if steps_per_epoch is not None:
-        count_mode = 'steps'
-      else:
-        count_mode = 'samples'
-      all_callbacks.append(
-          cbks.ProgbarLogger(
-              count_mode, stateful_metrics=self.stateful_metric_names))
-    all_callbacks += (callbacks or []) + [self.history]
-    callbacks = cbks.CallbackList(all_callbacks)
-    out_labels = out_labels or []
-
-    # it's possible to callback a different model than self
-    # (used by Sequential models)
-    if hasattr(self, 'callback_model') and self.callback_model:
-      callback_model = self.callback_model
-    else:
-      callback_model = self
-
-    callbacks.set_model(callback_model)
-
-    callbacks.set_params({
-        'batch_size': batch_size,
-        'epochs': epochs,
-        'steps': steps_per_epoch,
-        'samples': num_train_samples,
-        'verbose': verbose,
-        'do_validation': do_validation,
-        'metrics': callback_metrics or [],
-    })
-    callbacks.on_train_begin()
-    callback_model.stop_training = False
-    for cbk in callbacks:
-      cbk.validation_data = val_ins
-
-    # To prevent a slowdown, we find beforehand the arrays that need conversion.
-    feed = self._feed_inputs + self._feed_targets + self._feed_sample_weights
-    indices_for_conversion_to_dense = []
-    for i in range(len(feed)):
-      if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
-        indices_for_conversion_to_dense.append(i)
-
-    for epoch in range(initial_epoch, epochs):
-      # Reset stateful metrics
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-      # Update callbacks
-      callbacks.on_epoch_begin(epoch)
-      epoch_logs = {}
-      if steps_per_epoch is not None:
-        for step_index in range(steps_per_epoch):
-          batch_logs = {}
-          batch_logs['batch'] = step_index
-          batch_logs['size'] = 1
-          callbacks.on_batch_begin(step_index, batch_logs)
-          outs = f(ins)
-
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(step_index, batch_logs)
-          if callback_model.stop_training:
-            break
-
-        if do_validation:
-          val_outs = self._test_loop(
-              val_f,
-              val_ins,
-              batch_size=batch_size,
-              steps=validation_steps,
-              verbose=0)
-          if not isinstance(val_outs, list):
-            val_outs = [val_outs]
-          # Same labels assumed.
-          for l, o in zip(out_labels, val_outs):
-            epoch_logs['val_' + l] = o
-      else:
-        if shuffle == 'batch':
-          index_array = _batch_shuffle(index_array, batch_size)
-        elif shuffle:
-          np.random.shuffle(index_array)
-
-        batches = make_batches(num_train_samples, batch_size)
-
-        for batch_index, (batch_start, batch_end) in enumerate(batches):
-          batch_ids = index_array[batch_start:batch_end]
-          try:
-            if isinstance(ins[-1], float):
-              # Do not slice the training phase flag.
-              ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-            else:
-              ins_batch = slice_arrays(ins, batch_ids)
-          except TypeError:
-            raise TypeError('TypeError while preparing batch. '
-                            'If using HDF5 input data, '
-                            'pass shuffle="batch".')
-          batch_logs = {}
-          batch_logs['batch'] = batch_index
-          batch_logs['size'] = len(batch_ids)
-          callbacks.on_batch_begin(batch_index, batch_logs)
-          for i in indices_for_conversion_to_dense:
-            ins_batch[i] = ins_batch[i].toarray()
-
-          outs = f(ins_batch)
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(batch_index, batch_logs)
-          if callback_model.stop_training:
-            break
-
-          if batch_index == len(batches) - 1:  # Last batch.
-            if do_validation:
-              val_outs = self._test_loop(
-                  val_f, val_ins, batch_size=batch_size, verbose=0)
-              if not isinstance(val_outs, list):
-                val_outs = [val_outs]
-              # Same labels assumed.
-              for l, o in zip(out_labels, val_outs):
-                epoch_logs['val_' + l] = o
-      callbacks.on_epoch_end(epoch, epoch_logs)
-      if callback_model.stop_training:
-        break
-    callbacks.on_train_end()
-    return self.history
-
-  def _predict_loop(self, f, ins, batch_size=32, verbose=0, steps=None):
-    """Abstract method to loop over some data in batches.
-
-    Arguments:
-        f: Keras function returning a list of tensors.
-        ins: list of tensors to be fed to `f`.
-        batch_size: integer batch size.
-        verbose: verbosity mode.
-        steps: Total number of steps (batches of samples)
-            before declaring `_predict_loop` finished.
-            Ignored with the default value of `None`.
-
-    Returns:
-        Array of predictions (if the model has a single output)
-        or list of arrays of predictions
-        (if the model has multiple outputs).
-    """
-    if hasattr(self, 'metrics'):
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-
-    num_samples = self._check_num_samples(ins, batch_size, steps, 'steps')
-    if verbose == 1:
-      if steps is not None:
-        progbar = Progbar(target=steps,
-                          stateful_metrics=self.stateful_metric_names)
-      else:
-        progbar = Progbar(target=num_samples,
-                          stateful_metrics=self.stateful_metric_names)
-
-    indices_for_conversion_to_dense = []
-    for i in range(len(self._feed_inputs)):
-      if (issparse is not None and issparse(ins[i]) and
-          not K.is_sparse(self._feed_inputs[i])):
-        indices_for_conversion_to_dense.append(i)
-
-    if steps is not None:
-      # Step-based predictions.
-      # Since we do not know how many samples
-      # we will see, we cannot pre-allocate
-      # the returned Numpy arrays.
-      # Instead, we store one array per batch seen
-      # and concatenate them upon returning.
-      unconcatenated_outs = []
-      for step in range(steps):
-        batch_outs = f(ins)
-        if not isinstance(batch_outs, list):
-          batch_outs = [batch_outs]
-        if step == 0:
-          for batch_out in batch_outs:
-            unconcatenated_outs.append([])
-        for i, batch_out in enumerate(batch_outs):
-          unconcatenated_outs[i].append(batch_out)
-        if verbose == 1:
-          progbar.update(step + 1)
-      if len(unconcatenated_outs) == 1:
-        return np.concatenate(unconcatenated_outs[0], axis=0)
-      return [
-          np.concatenate(unconcatenated_outs[i], axis=0)
-          for i in range(len(unconcatenated_outs))
-      ]
-    else:
-      # Sample-based predictions.
-      outs = []
-      batches = make_batches(num_samples, batch_size)
-      index_array = np.arange(num_samples)
-      for batch_index, (batch_start, batch_end) in enumerate(batches):
-        batch_ids = index_array[batch_start:batch_end]
-        if ins and isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-        else:
-          ins_batch = slice_arrays(ins, batch_ids)
-        for i in indices_for_conversion_to_dense:
-          ins_batch[i] = ins_batch[i].toarray()
-
-        batch_outs = f(ins_batch)
-        if not isinstance(batch_outs, list):
-          batch_outs = [batch_outs]
-        if batch_index == 0:
-          # Pre-allocate the results arrays.
-          for batch_out in batch_outs:
-            shape = (num_samples,) + batch_out.shape[1:]
-            outs.append(np.zeros(shape, dtype=batch_out.dtype))
-        for i, batch_out in enumerate(batch_outs):
-          outs[i][batch_start:batch_end] = batch_out
-        if verbose == 1:
-          progbar.update(batch_end)
-      if len(outs) == 1:
-        return outs[0]
-      return outs
-
-  def _test_loop(self, f, ins, batch_size=None, verbose=0, steps=None):
-    """Abstract method to loop over some data in batches.
-
-    Arguments:
-        f: Keras function returning a list of tensors.
-        ins: list of tensors to be fed to `f`.
-        batch_size: integer batch size or `None`.
-        verbose: verbosity mode.
-        steps: Total number of steps (batches of samples)
-            before declaring predictions finished.
-            Ignored with the default value of `None`.
-
-    Returns:
-        Scalar loss (if the model has a single output and no metrics)
-        or list of scalars (if the model has multiple outputs
-        and/or metrics). The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-    """
-    if hasattr(self, 'metrics'):
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-      stateful_metric_indices = [
-          i for i, name in enumerate(self.metrics_names)
-          if str(name) in self.stateful_metric_names
-      ]
-    else:
-      stateful_metric_indices = []
-
-    num_samples = self._check_num_samples(ins, batch_size, steps, 'steps')
-    outs = []
-    if verbose == 1:
-      if steps is not None:
-        progbar = Progbar(target=steps)
-      else:
-        progbar = Progbar(target=num_samples)
-
-    # To prevent a slowdown, we find beforehand the arrays that need conversion.
-    feed = self._feed_inputs + self._feed_targets + self._feed_sample_weights
-    indices_for_conversion_to_dense = []
-    for i in range(len(feed)):
-      if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
-        indices_for_conversion_to_dense.append(i)
-
-    if steps is not None:
-      for step in range(steps):
-        batch_outs = f(ins)
-        if isinstance(batch_outs, list):
-          if step == 0:
-            for _ in enumerate(batch_outs):
-              outs.append(0.)
-          for i, batch_out in enumerate(batch_outs):
-            if i in stateful_metric_indices:
-              outs[i] = batch_out
-            else:
-              outs[i] += batch_out
-        else:
-          if step == 0:
-            outs.append(0.)
-          outs[0] += batch_outs
-        if verbose == 1:
-          progbar.update(step + 1)
-      for i in range(len(outs)):
-        if i not in stateful_metric_indices:
-          outs[i] /= steps
-    else:
-      batches = make_batches(num_samples, batch_size)
-      index_array = np.arange(num_samples)
-      for batch_index, (batch_start, batch_end) in enumerate(batches):
-        batch_ids = index_array[batch_start:batch_end]
-        if isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-        else:
-          ins_batch = slice_arrays(ins, batch_ids)
-        for i in indices_for_conversion_to_dense:
-          ins_batch[i] = ins_batch[i].toarray()
-
-        batch_outs = f(ins_batch)
-
-        if isinstance(batch_outs, list):
-          if batch_index == 0:
-            for batch_out in enumerate(batch_outs):
-              outs.append(0.)
-          for i, batch_out in enumerate(batch_outs):
-            if i in stateful_metric_indices:
-              outs[i] = batch_out
-            else:
-              outs[i] += batch_out * len(batch_ids)
-        else:
-          if batch_index == 0:
-            outs.append(0.)
-          outs[0] += batch_outs * len(batch_ids)
-        if verbose == 1:
-          progbar.update(batch_end)
-      for i in range(len(outs)):
-        if i not in stateful_metric_indices:
-          outs[i] /= num_samples
-    if len(outs) == 1:
-      return outs[0]
-    return outs
-
   def _standardize_user_data(self,
                              x,
                              y=None,
@@ -1651,13 +738,13 @@ class Model(Network):
                              'TensorFlow tensors. '
                              'You passed: x=' + str(x) + '; y=' + str(y))
 
-        if context.in_graph_mode():
+        if context.executing_eagerly():
+          target_tensors = None
+        else:
           # Handle target tensors if any passed.
           if not isinstance(y, (list, tuple)):
             y = [y]
           target_tensors = [v for v in y if tensor_util.is_tensor(v)]
-        else:
-          target_tensors = None
         self.compile(optimizer=self.optimizer,
                      loss=self.loss,
                      metrics=self.metrics,
@@ -1674,7 +761,7 @@ class Model(Network):
     # What follows is input validation and standardization to list format,
     # in the case where all inputs are value arrays.
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # In eager mode, do not do shape validation.
       feed_input_names = self.input_names
       feed_input_shapes = None
@@ -1689,7 +776,7 @@ class Model(Network):
       feed_input_shapes = self._feed_input_shapes
 
     # Standardize the inputs.
-    x = _standardize_input_data(
+    x = training_utils.standardize_input_data(
         x,
         feed_input_names,
         feed_input_shapes,
@@ -1697,7 +784,7 @@ class Model(Network):
         exception_prefix='input')
 
     if y is not None:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         feed_output_names = self.output_names
         feed_output_shapes = None
         # Sample weighting not supported in this case.
@@ -1728,7 +815,7 @@ class Model(Network):
             feed_output_shapes.append(output_shape)
 
       # Standardize the outputs.
-      y = _standardize_input_data(
+      y = training_utils.standardize_input_data(
           y,
           feed_output_names,
           feed_output_shapes,
@@ -1737,21 +824,21 @@ class Model(Network):
 
       # Generate sample-wise weight values given the `sample_weight` and
       # `class_weight` arguments.
-      sample_weights = _standardize_sample_weights(sample_weight,
-                                                   feed_output_names)
-      class_weights = _standardize_class_weights(class_weight,
-                                                 feed_output_names)
+      sample_weights = training_utils.standardize_sample_weights(
+          sample_weight, feed_output_names)
+      class_weights = training_utils.standardize_class_weights(
+          class_weight, feed_output_names)
       sample_weights = [
-          _standardize_weights(ref, sw, cw, mode)
+          training_utils.standardize_weights(ref, sw, cw, mode)
           for (ref, sw, cw, mode) in zip(y, sample_weights, class_weights,
                                          feed_sample_weight_modes)
       ]
       # Check that all arrays have the same length.
-      _check_array_lengths(x, y, sample_weights)
-      if self._is_graph_network and not context.in_eager_mode():
+      training_utils.check_array_lengths(x, y, sample_weights)
+      if self._is_graph_network and not context.executing_eagerly():
         # Additional checks to avoid users mistakenly using improper loss fns.
-        _check_loss_and_target_compatibility(y, self._feed_loss_fns,
-                                             feed_output_shapes)
+        training_utils.check_loss_and_target_compatibility(
+            y, self._feed_loss_fns, feed_output_shapes)
     else:
       y = []
       sample_weights = []
@@ -1787,11 +874,20 @@ class Model(Network):
         whether to build the model's graph in inference mode (False), training
         mode (True), or using the Keras learning phase (None).
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       self._eager_set_inputs(inputs)
     else:
       self._symbolic_set_inputs(inputs, training=training)
 
+  def _set_scope(self, scope=None):
+    """Modify the Layer scope creation logic to create ResourceVariables."""
+    super(Model, self)._set_scope(scope=scope)
+    # Subclassed Models create ResourceVariables by default. This makes it
+    # easier to use Models in an eager/graph agnostic way (since eager execution
+    # always uses ResourceVariables).
+    if not self._is_graph_network:
+      self._scope.set_use_resource(True)
+
   def _eager_set_inputs(self, inputs):
     """Set model's input and output specs based on the input data received.
 
@@ -1807,7 +903,7 @@ class Model(Network):
     Raises:
       ValueError: If the model's inputs are already set.
     """
-    assert context.in_eager_mode()
+    assert context.executing_eagerly()
     if self.inputs:
       raise ValueError('Model inputs are already set.')
     # On-the-fly setting of model inputs/outputs as DeferredTensors,
@@ -1836,14 +932,17 @@ class Model(Network):
         'output_%d' % (i + 1) for i in range(len(dummy_output_values))]
     self.built = True
 
-  def _symbolic_set_inputs(self, inputs, training=None):
-    """Set model's inputs based on the input data received from the user.
+  def _symbolic_set_inputs(self, inputs, outputs=None, training=None):
+    """Set model's inputs and output specs based.
 
     This is to be used for Model subclasses, which do not know at instantiation
     time what their inputs look like.
 
     Args:
       inputs: Argument `x` (input data) passed by the user upon first model use.
+      outputs: None, a data tensor, or a list of data tensors. If None, the
+        outputs will be determined by invoking self.call(), otherwise the
+        provided value will be used.
       training: Boolean or None. Only relevant in symbolic mode. Specifies
         whether to build the model's graph in inference mode (False), training
         mode (True), or using the Keras learning phase (None).
@@ -1851,7 +950,7 @@ class Model(Network):
     Raises:
       ValueError: If the model's inputs are already set.
     """
-    assert context.in_graph_mode()
+    assert not context.executing_eagerly()
     if self.inputs:
       raise ValueError('Model inputs are already set.')
 
@@ -1893,17 +992,18 @@ class Model(Network):
           self._feed_input_names.append(name)
           self._feed_input_shapes.append(K.int_shape(v))
 
-    # Obtain symbolic outputs by calling the model.
-    if len(self.inputs) == 1:
-      if self._expects_training_arg:
-        outputs = self.call(self.inputs[0], training=training)
-      else:
-        outputs = self.call(self.inputs[0])
-    else:
-      if self._expects_training_arg:
-        outputs = self.call(self.inputs, training=training)
+    if outputs is None:
+      # Obtain symbolic outputs by calling the model.
+      if len(self.inputs) == 1:
+        if self._expects_training_arg:
+          outputs = self.call(self.inputs[0], training=training)
+        else:
+          outputs = self.call(self.inputs[0])
       else:
-        outputs = self.call(self.inputs)
+        if self._expects_training_arg:
+          outputs = self.call(self.inputs, training=training)
+        else:
+          outputs = self.call(self.inputs)
     if isinstance(outputs, (list, tuple)):
       outputs = list(outputs)
     else:
@@ -2049,10 +1149,7 @@ class Model(Network):
         class_weight=class_weight,
         batch_size=batch_size)
     # Prepare validation data.
-    do_validation = False
-    val_ins = []
     if validation_data:
-      do_validation = True
       if len(validation_data) == 2:
         val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
         val_sample_weight = None
@@ -2070,13 +1167,8 @@ class Model(Network):
           val_y,
           sample_weight=val_sample_weight,
           batch_size=batch_size)
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = val_x + val_y + val_sample_weights + [0.]
-      else:
-        val_ins = val_x + val_y + val_sample_weights
 
     elif validation_split and 0. < validation_split < 1.:
-      do_validation = True
       if hasattr(x[0], 'shape'):
         split_at = int(x[0].shape[0] * (1. - validation_split))
       else:
@@ -2085,77 +1177,44 @@ class Model(Network):
       y, val_y = (slice_arrays(y, 0, split_at), slice_arrays(y, split_at))
       sample_weights, val_sample_weights = (slice_arrays(
           sample_weights, 0, split_at), slice_arrays(sample_weights, split_at))
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = val_x + val_y + val_sample_weights + [0.]
-      else:
-        val_ins = val_x + val_y + val_sample_weights
-
     elif validation_steps:
-      do_validation = True
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = [0.]
-
-    # Prepare input arrays and training function.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [1.]
+      val_x = []
+      val_y = []
+      val_sample_weights = []
     else:
-      ins = x + y + sample_weights
-
-    # Prepare display labels.
-    out_labels = self.metrics_names
-
-    if context.in_eager_mode():
-      if do_validation:
-        callback_metrics = copy.copy(out_labels) + [
-            'val_' + n for n in out_labels
-        ]
-      else:
-        callback_metrics = copy.copy(out_labels)
+      val_x = None
+      val_y = None
+      val_sample_weights = None
 
+    if context.executing_eagerly():
       return training_eager.fit_loop(
           self,
-          ins,
-          out_labels=out_labels,
+          inputs=x,
+          targets=y,
+          sample_weights=sample_weights,
           batch_size=batch_size,
           epochs=epochs,
           verbose=verbose,
           callbacks=callbacks,
-          val_ins=val_ins,
+          val_inputs=val_x,
+          val_targets=val_y,
+          val_sample_weights=val_sample_weights,
           shuffle=shuffle,
-          callback_metrics=callback_metrics,
           initial_epoch=initial_epoch,
           steps_per_epoch=steps_per_epoch,
           validation_steps=validation_steps)
     else:
-      self._make_train_function()
-      f = self.train_function
-
-      if do_validation:
-        if context.in_graph_mode():
-          self._make_test_function()
-          val_f = self.test_function
-        else:
-          val_f = None
-        callback_metrics = copy.copy(out_labels) + [
-            'val_' + n for n in out_labels
-        ]
-      else:
-        val_f = None
-        callback_metrics = copy.copy(out_labels)
-
-      # Delegate logic to `_fit_loop`.
-      return self._fit_loop(
-          f,
-          ins,
-          out_labels=out_labels,
+      return training_arrays.fit_loop(
+          self, x, y,
+          sample_weights=sample_weights,
           batch_size=batch_size,
           epochs=epochs,
           verbose=verbose,
           callbacks=callbacks,
-          val_f=val_f,
-          val_ins=val_ins,
+          val_inputs=val_x,
+          val_targets=val_y,
+          val_sample_weights=val_sample_weights,
           shuffle=shuffle,
-          callback_metrics=callback_metrics,
           initial_epoch=initial_epoch,
           steps_per_epoch=steps_per_epoch,
           validation_steps=validation_steps)
@@ -2229,20 +1288,15 @@ class Model(Network):
         y,
         sample_weight=sample_weight,
         batch_size=batch_size)
-    # Prepare inputs, delegate logic to `_test_loop`.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [0.]
-    else:
-      ins = x + y + sample_weights
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return training_eager.test_loop(
-          self, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+          self, inputs=x, targets=y, sample_weights=sample_weights,
+          batch_size=batch_size, verbose=verbose, steps=steps)
     else:
-      self._make_test_function()
-      f = self.test_function
-      return self._test_loop(
-          f, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+      return training_arrays.test_loop(
+          self, inputs=x, targets=y, sample_weights=sample_weights,
+          batch_size=batch_size, verbose=verbose, steps=steps)
 
   def predict(self, x, batch_size=None, verbose=0, steps=None):
     """Generates output predictions for the input samples.
@@ -2276,21 +1330,12 @@ class Model(Network):
                        'argument.')
     x, _, _ = self._standardize_user_data(x)
 
-    # Prepare inputs, delegate logic to `_predict_loop`.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + [0.]
-    else:
-      ins = x
-
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return training_eager.predict_loop(
-          self, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+          self, x, batch_size=batch_size, verbose=verbose, steps=steps)
     else:
-      self._make_predict_function()
-      f = self.predict_function
-
-      return self._predict_loop(
-          f, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+      return training_arrays.predict_loop(
+          self, x, batch_size=batch_size, verbose=verbose, steps=steps)
 
   def train_on_batch(self, x, y, sample_weight=None, class_weight=None):
     """Runs a single gradient update on a single batch of data.
@@ -2327,20 +1372,24 @@ class Model(Network):
         and/or metrics). The attribute `model.metrics_names` will give you
         the display labels for the scalar outputs.
 
+    Raises:
+      ValueError: In case of invalid user-provided arguments.
     """
     x, y, sample_weights = self._standardize_user_data(
         x,
         y,
         sample_weight=sample_weight,
         class_weight=class_weight)
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [1.]
-    else:
-      ins = x + y + sample_weights
 
-    if context.in_eager_mode():
-      outputs = training_eager.train_on_batch(self, ins)
+    if context.executing_eagerly():
+      outputs = training_eager.train_on_batch(
+          self, x, y, sample_weights=sample_weights)
     else:
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + y + sample_weights + [1]
+      else:
+        ins = x + y + sample_weights
+
       self._make_train_function()
       outputs = self.train_function(ins)
 
@@ -2377,18 +1426,19 @@ class Model(Network):
         the display labels for the scalar outputs.
 
     Raises:
-        ValueError: in case of invalid arguments.
+        ValueError: In case of invalid user-provided arguments.
     """
     x, y, sample_weights = self._standardize_user_data(
         x, y, sample_weight=sample_weight)
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [0.]
-    else:
-      ins = x + y + sample_weights
 
-    if context.in_eager_mode():
-      outputs = training_eager.test_on_batch(self, ins)
+    if context.executing_eagerly():
+      outputs = training_eager.test_on_batch(
+          self, x, y, sample_weights=sample_weights)
     else:
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + y + sample_weights + [0]
+      else:
+        ins = x + y + sample_weights
       self._make_test_function()
       outputs = self.test_function(ins)
 
@@ -2408,26 +1458,19 @@ class Model(Network):
     """
     x, _, _ = self._standardize_user_data(x)
 
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + [0.]
-    else:
-      ins = x
-
-    if context.in_eager_mode():
-      ins_batch_converted = []
-      for ib in ins:
-        ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-
-      eager_model_inputs = []
-      for i in range(len(self.inputs)):
-        eager_model_inputs.append(ins_batch_converted[i])
+    if context.executing_eagerly():
+      inputs = [ops.convert_to_tensor(val, dtype=K.floatx()) for val in x]
+      return self(inputs)  # pylint: disable=not-callable
 
-      outs = self(eager_model_inputs)  # pylint: disable=not-callable
-      return outs
+    if not context.executing_eagerly():
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + [0]
+      else:
+        ins = x
 
-    if context.in_graph_mode():
       self._make_predict_function()
       outputs = self.predict_function(ins)
+
       if len(outputs) == 1:
         return outputs[0]
       return outputs
@@ -2499,20 +1542,19 @@ class Model(Network):
         max_queue_size: Integer. Maximum size for the generator queue.
             If unspecified, `max_queue_size` will default to 10.
         workers: Integer. Maximum number of processes to spin up
-            when using process based threading.
+            when using process-based threading.
             If unspecified, `workers` will default to 1. If 0, will
             execute the generator on the main thread.
-        use_multiprocessing: Boolean. If True, use process based threading.
-            If unspecified, `workers` will default to False.
-            Note that because
-            this implementation relies on multiprocessing,
-            you should not pass
-            non picklable arguments to the generator
-            as they can't be passed
-            easily to children processes.
-        shuffle: Whether to shuffle the order of the batches at
+        use_multiprocessing: Boolean.
+            If `True`, use process-based threading.
+            If unspecified, `use_multiprocessing` will default to `False`.
+            Note that because this implementation relies on multiprocessing,
+            you should not pass non-picklable arguments to the generator
+            as they can't be passed easily to children processes.
+        shuffle: Boolean. Whether to shuffle the order of the batches at
             the beginning of each epoch. Only used with instances
-            of `Sequence` (keras.utils.Sequence).
+            of `Sequence` (`keras.utils.Sequence`).
+            Has no effect when `steps_per_epoch` is not `None`.
         initial_epoch: Epoch at which to start training
             (useful for resuming a previous training run)
 
@@ -2543,213 +1585,21 @@ class Model(Network):
       raise NotImplementedError(
           '`fit_generator` is not yet enabled for Model subclasses')
 
-    wait_time = 0.01  # in seconds
-    epoch = initial_epoch
-
-    do_validation = bool(validation_data)
-    self._make_train_function()
-    if do_validation:
-      self._make_test_function()
-
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps_per_epoch is None:
-      if is_sequence:
-        steps_per_epoch = len(generator)
-      else:
-        raise ValueError('`steps_per_epoch=None` is only valid for a'
-                         ' generator based on the `keras.utils.Sequence`'
-                         ' class. Please specify `steps_per_epoch` or use'
-                         ' the `keras.utils.Sequence` class.')
-
-    # python 2 has 'next', 3 has '__next__'
-    # avoid any explicit version checks
-    val_gen = (
-        hasattr(validation_data, 'next') or
-        hasattr(validation_data, '__next__') or
-        isinstance(validation_data, Sequence))
-    if (val_gen and not isinstance(validation_data, Sequence) and
-        not validation_steps):
-      raise ValueError('`validation_steps=None` is only valid for a'
-                       ' generator based on the `keras.utils.Sequence`'
-                       ' class. Please specify `validation_steps` or use'
-                       ' the `keras.utils.Sequence` class.')
-
-    # Prepare display labels.
-    out_labels = self.metrics_names
-    callback_metrics = out_labels + ['val_%s' % n for n in out_labels]
-
-    # prepare callbacks
-    self.history = cbks.History()
-    callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history]
-    if verbose:
-      callbacks += [cbks.ProgbarLogger(count_mode='steps')]
-    callbacks = cbks.CallbackList(callbacks)
-
-    # it's possible to callback a different model than self:
-    if hasattr(self, 'callback_model') and self.callback_model:
-      callback_model = self.callback_model
-    else:
-      callback_model = self
-    callbacks.set_model(callback_model)
-    callbacks.set_params({
-        'epochs': epochs,
-        'steps': steps_per_epoch,
-        'verbose': verbose,
-        'do_validation': do_validation,
-        'metrics': callback_metrics,
-    })
-    callbacks.on_train_begin()
-
-    enqueuer = None
-    val_enqueuer = None
-
-    try:
-      if do_validation:
-        if val_gen:
-          if workers > 0:
-            if isinstance(validation_data, Sequence):
-              val_enqueuer = OrderedEnqueuer(
-                  validation_data, use_multiprocessing=use_multiprocessing)
-              if validation_steps is None:
-                validation_steps = len(validation_data)
-            else:
-              val_enqueuer = GeneratorEnqueuer(
-                  validation_data,
-                  use_multiprocessing=use_multiprocessing,
-                  wait_time=wait_time)
-            val_enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-            validation_generator = val_enqueuer.get()
-          else:
-            validation_generator = validation_data
-        else:
-          if len(validation_data) == 2:
-            val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
-            val_sample_weight = None
-          elif len(validation_data) == 3:
-            val_x, val_y, val_sample_weight = validation_data  # pylint: disable=unpacking-non-sequence
-          else:
-            raise ValueError(
-                '`validation_data` should be a tuple '
-                '`(val_x, val_y, val_sample_weight)` '
-                'or `(val_x, val_y)`. Found: ' + str(validation_data))
-          val_x, val_y, val_sample_weights = self._standardize_user_data(
-              val_x, val_y, val_sample_weight)
-          val_data = val_x + val_y + val_sample_weights
-          if self.uses_learning_phase and not isinstance(
-              K.learning_phase(), int):
-            val_data += [0.]
-          for cbk in callbacks:
-            cbk.validation_data = val_data
-
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              shuffle=shuffle)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      callback_model.stop_training = False
-      # Construct epoch logs.
-      epoch_logs = {}
-      while epoch < epochs:
-        callbacks.on_epoch_begin(epoch)
-        steps_done = 0
-        batch_index = 0
-        while steps_done < steps_per_epoch:
-          generator_output = next(output_generator)
-
-          if not hasattr(generator_output, '__len__'):
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-
-          if len(generator_output) == 2:
-            x, y = generator_output
-            sample_weight = None
-          elif len(generator_output) == 3:
-            x, y, sample_weight = generator_output
-          else:
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-          # build batch logs
-          batch_logs = {}
-          if isinstance(x, list):
-            batch_size = x[0].shape[0]
-          elif isinstance(x, dict):
-            batch_size = list(x.values())[0].shape[0]
-          else:
-            batch_size = x.shape[0]
-          batch_logs['batch'] = batch_index
-          batch_logs['size'] = batch_size
-          callbacks.on_batch_begin(batch_index, batch_logs)
-
-          outs = self.train_on_batch(
-              x, y, sample_weight=sample_weight, class_weight=class_weight)
-
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(batch_index, batch_logs)
-
-          batch_index += 1
-          steps_done += 1
-
-          # Epoch finished.
-          if steps_done >= steps_per_epoch and do_validation:
-            if val_gen:
-              val_outs = self.evaluate_generator(
-                  validation_generator, validation_steps, workers=0)
-            else:
-              # No need for try/except because
-              # data has already been validated.
-              val_outs = self.evaluate(
-                  val_x,
-                  val_y,
-                  batch_size=batch_size,
-                  sample_weight=val_sample_weights,
-                  verbose=0)
-            if not isinstance(val_outs, list):
-              val_outs = [val_outs]
-            # Same labels assumed.
-            for l, o in zip(out_labels, val_outs):
-              epoch_logs['val_' + l] = o
-
-          if callback_model.stop_training:
-            break
-
-        callbacks.on_epoch_end(epoch, epoch_logs)
-        epoch += 1
-        if callback_model.stop_training:
-          break
-
-    finally:
-      try:
-        if enqueuer is not None:
-          enqueuer.stop()
-      finally:
-        if val_enqueuer is not None:
-          val_enqueuer.stop()
-
-    callbacks.on_train_end()
-    return self.history
+    return training_generator.fit_generator(
+        self,
+        generator,
+        steps_per_epoch=steps_per_epoch,
+        epochs=epochs,
+        verbose=verbose,
+        callbacks=callbacks,
+        validation_data=validation_data,
+        validation_steps=validation_steps,
+        class_weight=class_weight,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        shuffle=shuffle,
+        initial_epoch=initial_epoch)
 
   def evaluate_generator(self,
                          generator,
@@ -2774,16 +1624,15 @@ class Model(Network):
             the `len(generator)` as a number of steps.
         max_queue_size: maximum size for the generator queue
         workers: Integer. Maximum number of processes to spin up
-            when using process based threading.
+            when using process-based threading.
             If unspecified, `workers` will default to 1. If 0, will
             execute the generator on the main thread.
-        use_multiprocessing: if True, use process based threading.
-            Note that because
-            this implementation relies on multiprocessing,
-            you should not pass
-            non picklable arguments to the generator
-            as they can't be passed
-            easily to children processes.
+        use_multiprocessing: Boolean.
+            If `True`, use process-based threading.
+            If unspecified, `use_multiprocessing` will default to `False`.
+            Note that because this implementation relies on multiprocessing,
+            you should not pass non-picklable arguments to the generator
+            as they can't be passed easily to children processes.
 
     Returns:
         Scalar test loss (if the model has a single output and no metrics)
@@ -2802,87 +1651,13 @@ class Model(Network):
       raise NotImplementedError(
           '`evaluate_generator` is not yet enabled for Model subclasses')
 
-    self._make_test_function()
-
-    steps_done = 0
-    wait_time = 0.01
-    all_outs = []
-    batch_sizes = []
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps is None:
-      if is_sequence:
-        steps = len(generator)
-      else:
-        raise ValueError('`steps=None` is only valid for a generator'
-                         ' based on the `keras.utils.Sequence` class.'
-                         ' Please specify `steps` or use the'
-                         ' `keras.utils.Sequence` class.')
-    enqueuer = None
-
-    try:
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator, use_multiprocessing=use_multiprocessing)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      while steps_done < steps:
-        generator_output = next(output_generator)
-        if not hasattr(generator_output, '__len__'):
-          raise ValueError('Output of generator should be a tuple '
-                           '(x, y, sample_weight) '
-                           'or (x, y). Found: ' + str(generator_output))
-        if len(generator_output) == 2:
-          x, y = generator_output
-          sample_weight = None
-        elif len(generator_output) == 3:
-          x, y, sample_weight = generator_output
-        else:
-          raise ValueError('Output of generator should be a tuple '
-                           '(x, y, sample_weight) '
-                           'or (x, y). Found: ' + str(generator_output))
-        outs = self.test_on_batch(x, y, sample_weight=sample_weight)
-
-        if isinstance(x, list):
-          batch_size = x[0].shape[0]
-        elif isinstance(x, dict):
-          batch_size = list(x.values())[0].shape[0]
-        else:
-          batch_size = x.shape[0]
-        if batch_size == 0:
-          raise ValueError('Received an empty batch. '
-                           'Batches should at least contain one item.')
-        all_outs.append(outs)
-
-        steps_done += 1
-        batch_sizes.append(batch_size)
-
-    finally:
-      if enqueuer is not None:
-        enqueuer.stop()
-
-    if not isinstance(outs, list):
-      return np.average(np.asarray(all_outs), weights=batch_sizes)
-    else:
-      averages = []
-      for i in range(len(outs)):
-        averages.append(
-            np.average([out[i] for out in all_outs], weights=batch_sizes))
-      return averages
+    return training_generator.evaluate_generator(
+        self,
+        generator,
+        steps=steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing)
 
   def predict_generator(self,
                         generator,
@@ -2907,16 +1682,15 @@ class Model(Network):
             the `len(generator)` as a number of steps.
         max_queue_size: Maximum size for the generator queue.
         workers: Integer. Maximum number of processes to spin up
-            when using process based threading.
+            when using process-based threading.
             If unspecified, `workers` will default to 1. If 0, will
             execute the generator on the main thread.
-        use_multiprocessing: If `True`, use process based threading.
-            Note that because
-            this implementation relies on multiprocessing,
-            you should not pass
-            non picklable arguments to the generator
-            as they can't be passed
-            easily to children processes.
+        use_multiprocessing: Boolean.
+            If `True`, use process-based threading.
+            If unspecified, `use_multiprocessing` will default to `False`.
+            Note that because this implementation relies on multiprocessing,
+            you should not pass non-picklable arguments to the generator
+            as they can't be passed easily to children processes.
         verbose: verbosity mode, 0 or 1.
 
     Returns:
@@ -2930,88 +1704,11 @@ class Model(Network):
       raise NotImplementedError(
           '`predict_generator` is not yet enabled for Model subclasses')
 
-    self._make_predict_function()
-
-    steps_done = 0
-    wait_time = 0.01
-    all_outs = []
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps is None:
-      if is_sequence:
-        steps = len(generator)
-      else:
-        raise ValueError('`steps=None` is only valid for a generator'
-                         ' based on the `keras.utils.Sequence` class.'
-                         ' Please specify `steps` or use the'
-                         ' `keras.utils.Sequence` class.')
-    enqueuer = None
-
-    try:
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator, use_multiprocessing=use_multiprocessing)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      if verbose == 1:
-        progbar = Progbar(target=steps)
-
-      while steps_done < steps:
-        generator_output = next(output_generator)
-        if isinstance(generator_output, tuple):
-          # Compatibility with the generators
-          # used for training.
-          if len(generator_output) == 2:
-            x, _ = generator_output
-          elif len(generator_output) == 3:
-            x, _, _ = generator_output
-          else:
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-        else:
-          # Assumes a generator that only
-          # yields inputs (not targets and sample weights).
-          x = generator_output
-
-        outs = self.predict_on_batch(x)
-        if not isinstance(outs, list):
-          outs = [outs]
-
-        if not all_outs:
-          for out in outs:
-            all_outs.append([])
-
-        for i, out in enumerate(outs):
-          all_outs[i].append(out)
-        steps_done += 1
-        if verbose == 1:
-          progbar.update(steps_done)
-
-    finally:
-      if enqueuer is not None:
-        enqueuer.stop()
-
-    if len(all_outs) == 1:
-      if steps_done == 1:
-        return all_outs[0][0]
-      else:
-        return np.concatenate(all_outs[0])
-    if steps_done == 1:
-      return [out[0] for out in all_outs]
-    else:
-      return [np.concatenate(out) for out in all_outs]
+    return training_generator.predict_generator(
+        self,
+        generator,
+        steps=steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        verbose=verbose)
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_arrays.py b/tensorflow/python/keras/_impl/keras/engine/training_arrays.py
new file mode 100644
index 0000000000000000000000000000000000000000..18116e3a14d6b1365f1a9db1a23243cd07763a62
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_arrays.py
@@ -0,0 +1,488 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Part of the Keras training engine related to plain array data.
+"""
+# pylint: disable=protected-access
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+
+import numpy as np
+
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import callbacks as cbks
+from tensorflow.python.keras._impl.keras.engine import training_utils
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
+from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
+from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
+
+try:
+  from scipy.sparse import issparse  # pylint: disable=g-import-not-at-top
+except ImportError:
+  issparse = None
+
+
+def fit_loop(model,
+             inputs,
+             targets,
+             sample_weights=None,
+             batch_size=None,
+             epochs=100,
+             verbose=1,
+             callbacks=None,
+             val_inputs=None,
+             val_targets=None,
+             val_sample_weights=None,
+             shuffle=True,
+             callback_metrics=None,
+             initial_epoch=0,
+             steps_per_epoch=None,
+             validation_steps=None):
+  """Abstract fit function for arrays of data.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      batch_size: Integer batch size or None if unknown.
+      epochs: Number of times to iterate over the data
+      verbose: Verbosity mode, 0, 1 or 2
+      callbacks: List of callbacks to be called during training
+      val_inputs: List of input arrays.
+      val_targets: List of target arrays.
+      val_sample_weights: Optional list of sample weight arrays.
+      shuffle: Whether to shuffle the data at the beginning of each epoch
+      callback_metrics: List of strings, the display names of the metrics
+          passed to the callbacks. They should be the
+          concatenation of list the display names of the outputs of
+           `f` and the list of display names of the outputs of `f_val`.
+      initial_epoch: Epoch at which to start training
+          (useful for resuming a previous training run)
+      steps_per_epoch: Total number of steps (batches of samples)
+          before declaring one epoch finished and starting the
+          next epoch. Ignored with the default value of `None`.
+      validation_steps: Number of steps to run validation for
+          (only if doing validation from data tensors).
+          Ignored with the default value of `None`.
+
+  Returns:
+      `History` object.
+
+  Raises:
+      ValueError: in case of invalid arguments.
+  """
+  model._make_train_function()
+  f = model.train_function
+
+  sample_weights = sample_weights or []
+  val_sample_weights = val_sample_weights or []
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + targets + sample_weights + [1]
+    if val_inputs:
+      val_ins = val_inputs + val_targets + val_sample_weights + [1]
+  else:
+    ins = inputs + targets + sample_weights
+    if val_inputs:
+      val_ins = val_inputs + val_targets + val_sample_weights
+  if not val_inputs:
+    val_ins = []
+
+  do_validation = False
+  if val_inputs:
+    do_validation = True
+    if verbose and inputs and hasattr(inputs[0], 'shape') and hasattr(
+        val_inputs[0], 'shape'):
+      print('Train on %d samples, validate on %d samples' %
+            (inputs[0].shape[0], val_inputs[0].shape[0]))
+  if validation_steps:
+    do_validation = True
+    if steps_per_epoch is None:
+      raise ValueError('Can only use `validation_steps` '
+                       'when doing step-wise '
+                       'training, i.e. `steps_per_epoch` '
+                       'must be set.')
+
+  out_labels = model.metrics_names
+  if do_validation:
+    callback_metrics = copy.copy(out_labels) + [
+        'val_' + n for n in out_labels
+    ]
+  else:
+    callback_metrics = copy.copy(out_labels)
+
+  num_train_samples = training_utils.check_num_samples(
+      ins, batch_size, steps_per_epoch, 'steps_per_epoch')
+  if num_train_samples is not None:
+    index_array = np.arange(num_train_samples)
+
+  model.history = cbks.History()
+  all_callbacks = [cbks.BaseLogger(
+      stateful_metrics=model.stateful_metric_names)]
+  if verbose:
+    if steps_per_epoch is not None:
+      count_mode = 'steps'
+    else:
+      count_mode = 'samples'
+    all_callbacks.append(
+        cbks.ProgbarLogger(
+            count_mode, stateful_metrics=model.stateful_metric_names))
+  all_callbacks += (callbacks or []) + [model.history]
+  callbacks = cbks.CallbackList(all_callbacks)
+  out_labels = out_labels or []
+
+  # it's possible to callback a different model than self
+  # (used by Sequential models)
+  if hasattr(model, 'callback_model') and model.callback_model:
+    callback_model = model.callback_model
+  else:
+    callback_model = model
+
+  callbacks.set_model(callback_model)
+
+  callbacks.set_params({
+      'batch_size': batch_size,
+      'epochs': epochs,
+      'steps': steps_per_epoch,
+      'samples': num_train_samples,
+      'verbose': verbose,
+      'do_validation': do_validation,
+      'metrics': callback_metrics or [],
+  })
+  callbacks.on_train_begin()
+  callback_model.stop_training = False
+  for cbk in callbacks:
+    cbk.validation_data = val_ins
+
+  # To prevent a slowdown, we find beforehand the arrays that need conversion.
+  feed = model._feed_inputs + model._feed_targets + model._feed_sample_weights
+  indices_for_conversion_to_dense = []
+  for i in range(len(feed)):
+    if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
+      indices_for_conversion_to_dense.append(i)
+
+  for epoch in range(initial_epoch, epochs):
+    # Reset stateful metrics
+    for m in model.metrics:
+      if isinstance(m, Layer):
+        m.reset_states()
+    # Update callbacks
+    callbacks.on_epoch_begin(epoch)
+    epoch_logs = {}
+    if steps_per_epoch is not None:
+      for step_index in range(steps_per_epoch):
+        batch_logs = {}
+        batch_logs['batch'] = step_index
+        batch_logs['size'] = 1
+        callbacks.on_batch_begin(step_index, batch_logs)
+        outs = f(ins)
+
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(step_index, batch_logs)
+        if callback_model.stop_training:
+          break
+
+      if do_validation:
+        val_outs = test_loop(
+            model,
+            val_inputs,
+            val_targets,
+            sample_weights=val_sample_weights,
+            batch_size=batch_size,
+            steps=validation_steps,
+            verbose=0)
+        if not isinstance(val_outs, list):
+          val_outs = [val_outs]
+        # Same labels assumed.
+        for l, o in zip(out_labels, val_outs):
+          epoch_logs['val_' + l] = o
+    else:
+      if shuffle == 'batch':
+        index_array = training_utils.batch_shuffle(index_array, batch_size)
+      elif shuffle:
+        np.random.shuffle(index_array)
+
+      batches = make_batches(num_train_samples, batch_size)
+
+      for batch_index, (batch_start, batch_end) in enumerate(batches):
+        batch_ids = index_array[batch_start:batch_end]
+        try:
+          if isinstance(ins[-1], int):
+            # Do not slice the training phase flag.
+            ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+          else:
+            ins_batch = slice_arrays(ins, batch_ids)
+        except TypeError:
+          raise TypeError('TypeError while preparing batch. '
+                          'If using HDF5 input data, '
+                          'pass shuffle="batch".')
+        batch_logs = {}
+        batch_logs['batch'] = batch_index
+        batch_logs['size'] = len(batch_ids)
+        callbacks.on_batch_begin(batch_index, batch_logs)
+        for i in indices_for_conversion_to_dense:
+          ins_batch[i] = ins_batch[i].toarray()
+
+        outs = f(ins_batch)
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(batch_index, batch_logs)
+        if callback_model.stop_training:
+          break
+
+        if batch_index == len(batches) - 1:  # Last batch.
+          if do_validation:
+            val_outs = test_loop(
+                model,
+                val_inputs,
+                val_targets,
+                sample_weights=val_sample_weights,
+                batch_size=batch_size,
+                verbose=0)
+            if not isinstance(val_outs, list):
+              val_outs = [val_outs]
+            # Same labels assumed.
+            for l, o in zip(out_labels, val_outs):
+              epoch_logs['val_' + l] = o
+    callbacks.on_epoch_end(epoch, epoch_logs)
+    if callback_model.stop_training:
+      break
+  callbacks.on_train_end()
+  return model.history
+
+
+def predict_loop(model, inputs, batch_size=32, verbose=0, steps=None):
+  """Abstract method to loop over some data in batches.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: list of tensors to be fed to `f`.
+      batch_size: integer batch size.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring `_predict_loop` finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Array of predictions (if the model has a single output)
+      or list of arrays of predictions
+      (if the model has multiple outputs).
+  """
+  model._make_predict_function()
+  f = model.predict_function
+
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + [0]
+  else:
+    ins = inputs
+
+  num_samples = training_utils.check_num_samples(
+      inputs, batch_size, steps, 'steps')
+  if verbose == 1:
+    if steps is not None:
+      progbar = Progbar(target=steps)
+    else:
+      progbar = Progbar(target=num_samples)
+
+  indices_for_conversion_to_dense = []
+  for i in range(len(model._feed_inputs)):
+    if (issparse is not None and issparse(inputs[i]) and
+        not K.is_sparse(model._feed_inputs[i])):
+      indices_for_conversion_to_dense.append(i)
+
+  if steps is not None:
+    # Step-based predictions.
+    # Since we do not know how many samples
+    # we will see, we cannot pre-allocate
+    # the returned Numpy arrays.
+    # Instead, we store one array per batch seen
+    # and concatenate them upon returning.
+    unconcatenated_outs = []
+    for step in range(steps):
+      batch_outs = f(ins)
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if step == 0:
+        for batch_out in batch_outs:
+          unconcatenated_outs.append([])
+      for i, batch_out in enumerate(batch_outs):
+        unconcatenated_outs[i].append(batch_out)
+      if verbose == 1:
+        progbar.update(step + 1)
+    if len(unconcatenated_outs) == 1:
+      return np.concatenate(unconcatenated_outs[0], axis=0)
+    return [
+        np.concatenate(unconcatenated_outs[i], axis=0)
+        for i in range(len(unconcatenated_outs))
+    ]
+  else:
+    # Sample-based predictions.
+    outs = []
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      if ins and isinstance(ins[-1], int):
+        # Do not slice the training phase flag.
+        ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+      else:
+        ins_batch = slice_arrays(ins, batch_ids)
+      for i in indices_for_conversion_to_dense:
+        ins_batch[i] = ins_batch[i].toarray()
+
+      batch_outs = f(ins_batch)
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if batch_index == 0:
+        # Pre-allocate the results arrays.
+        for batch_out in batch_outs:
+          shape = (num_samples,) + batch_out.shape[1:]
+          outs.append(np.zeros(shape, dtype=batch_out.dtype))
+      for i, batch_out in enumerate(batch_outs):
+        outs[i][batch_start:batch_end] = batch_out
+      if verbose == 1:
+        progbar.update(batch_end)
+    if len(outs) == 1:
+      return outs[0]
+    return outs
+
+
+def test_loop(model, inputs, targets,
+              sample_weights=None,
+              batch_size=None,
+              verbose=0,
+              steps=None):
+  """Abstract method to loop over some data in batches.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      batch_size: integer batch size or `None`.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring predictions finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Scalar loss (if the model has a single output and no metrics)
+      or list of scalars (if the model has multiple outputs
+      and/or metrics). The attribute `model.metrics_names` will give you
+      the display labels for the scalar outputs.
+  """
+  model._make_test_function()
+  f = model.test_function
+
+  sample_weights = sample_weights or []
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + targets + sample_weights + [0]
+  else:
+    ins = inputs + targets + sample_weights
+
+  if hasattr(model, 'metrics'):
+    for m in model.metrics:
+      if isinstance(m, Layer):
+        m.reset_states()
+    stateful_metric_indices = [
+        i for i, name in enumerate(model.metrics_names)
+        if str(name) in model.stateful_metric_names
+    ]
+  else:
+    stateful_metric_indices = []
+
+  num_samples = training_utils.check_num_samples(
+      ins, batch_size, steps, 'steps')
+  outs = []
+  if verbose == 1:
+    if steps is not None:
+      progbar = Progbar(target=steps)
+    else:
+      progbar = Progbar(target=num_samples)
+
+  # To prevent a slowdown, we find beforehand the arrays that need conversion.
+  feed = model._feed_inputs + model._feed_targets + model._feed_sample_weights
+  indices_for_conversion_to_dense = []
+  for i in range(len(feed)):
+    if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
+      indices_for_conversion_to_dense.append(i)
+
+  if steps is not None:
+    for step in range(steps):
+      batch_outs = f(ins)
+      if isinstance(batch_outs, list):
+        if step == 0:
+          for _ in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          if i in stateful_metric_indices:
+            outs[i] = batch_out
+          else:
+            outs[i] += batch_out
+      else:
+        if step == 0:
+          outs.append(0.)
+        outs[0] += batch_outs
+      if verbose == 1:
+        progbar.update(step + 1)
+    for i in range(len(outs)):
+      if i not in stateful_metric_indices:
+        outs[i] /= steps
+  else:
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      if isinstance(ins[-1], int):
+        # Do not slice the training phase flag.
+        ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+      else:
+        ins_batch = slice_arrays(ins, batch_ids)
+      for i in indices_for_conversion_to_dense:
+        ins_batch[i] = ins_batch[i].toarray()
+
+      batch_outs = f(ins_batch)
+
+      if isinstance(batch_outs, list):
+        if batch_index == 0:
+          for batch_out in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          if i in stateful_metric_indices:
+            outs[i] = batch_out
+          else:
+            outs[i] += batch_out * len(batch_ids)
+      else:
+        if batch_index == 0:
+          outs.append(0.)
+        outs[0] += batch_outs * len(batch_ids)
+      if verbose == 1:
+        progbar.update(batch_end)
+    for i in range(len(outs)):
+      if i not in stateful_metric_indices:
+        outs[i] /= num_samples
+  if len(outs) == 1:
+    return outs[0]
+  return outs
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_eager.py b/tensorflow/python/keras/_impl/keras/engine/training_eager.py
index 282dd0dc0dbd440455cfade6952eda669aeaf2df..67858a578c5c95b3099e1e6713f3287748fc861f 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_eager.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_eager.py
@@ -12,20 +12,25 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Keras training and evaluation routines.
+"""Keras training and evaluation routines for eager execution.
 """
 # pylint: disable=protected-access
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+
+import copy
+
 import numpy as np
+
 from tensorflow.python.eager.backprop import GradientTape
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
-from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import backend
 from tensorflow.python.keras._impl.keras import callbacks as cbks
 from tensorflow.python.keras._impl.keras import losses
 from tensorflow.python.keras._impl.keras import metrics as metrics_module
+from tensorflow.python.keras._impl.keras.engine import training_utils
 from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
 from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
@@ -55,7 +60,7 @@ def _get_metrics_info(metric, internal_output_shapes=None, loss_func=None):
 
 
 def _eager_loss_fn(outputs, targets, loss_fn, output_name):
-  with K.name_scope(output_name + '_loss'):
+  with backend.name_scope(output_name + '_loss'):
     loss = loss_fn(targets, outputs)
   return loss
 
@@ -83,7 +88,7 @@ def _eager_metrics_fn(model, outputs, targets):
     output_metrics = model.nested_metrics[i]
     for nested_output_metric in output_metrics:
       metric_name, metric_fn = _get_metrics_info(
-          nested_output_metric, K.int_shape(model.outputs[i]),
+          nested_output_metric, backend.int_shape(model.outputs[i]),
           model.loss_functions[i])
 
       if len(model.output_names) > 1:
@@ -91,23 +96,23 @@ def _eager_metrics_fn(model, outputs, targets):
         if metric_name not in model.metrics_names:
           model.metrics_names.append(metric_name)
 
-      with K.name_scope(metric_name):
+      with backend.name_scope(metric_name):
         metric_result = metric_fn(outputs[i], targets[i])
         metric_names.append(metric_name)
-        metric_results.append(K.mean(metric_result))
+        metric_results.append(backend.mean(metric_result))
 
   return metric_names, metric_results
 
 
-def _model_loss(model, inputs, targets, training=False):
+def _model_loss(model, inputs, targets, sample_weights=None, training=False):
   """Calculates the loss for a given model.
 
   Arguments:
-     model: The model on which metrics are being calculated.
-     inputs: The inputs of the given model. This is typically the mini batch of
-              data that is fed to the model.
-     targets: The predictions or targets of the given model.
-     training: Whether the model should be run in inference or training mode.
+      model: The model on which metrics are being calculated.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      training: Whether the model should be run in inference or training mode.
 
   Returns:
      Returns the model output, total loss and loss value calculated using the
@@ -132,33 +137,22 @@ def _model_loss(model, inputs, targets, training=False):
     targets = [targets]
 
   loss_metrics = []
-  with K.name_scope('loss'):
+  with backend.name_scope('loss'):
     for i, loss_fn in enumerate(model.loss_functions):
-      # compute the loss
-      output_loss = _eager_loss_fn(outs[i], targets[i], loss_fn,
-                                   model.output_names[i])
-      loss_metrics.append(K.mean(output_loss))
+      if sample_weights:
+        weights = sample_weights[i]
+      else:
+        weights = None
 
+      # TODO(fchollet): support masking; in practice `_keras_mask` is never
+      # set in this context currently.
       mask = outs[i]._keras_mask
-      # adapted from weighted_loss_fn
-      if mask is not None:
-        # mask should have the same shape as output_loss
-        output_loss *= mask
-        #  the loss per batch should be proportional
-        #  to the number of unmasked samples.
-        output_loss /= K.mean(mask)
-
-      # adapted from weighted_loss_fn
-      # apply sample weighting
-      if model.sample_weights:
-        # reduce score_array to same ndim as weight array
-        ndim = K.ndim(output_loss)
-        weight_ndim = K.ndim(model.sample_weights)
-        output_loss = K.mean(output_loss, axis=list(range(weight_ndim, ndim)))
-        output_loss *= model.sample_weights
-        output_loss /= K.mean(K.cast(K.not_equal(model.sample_weights, 0),
-                                     K.floatx()))
-        output_loss = K.mean(output_loss)
+
+      weighted_masked_fn = training_utils.weighted_masked_objective(loss_fn)
+      with backend.name_scope(model.output_names[i] + '_loss'):
+        output_loss = weighted_masked_fn(
+            outs[i], targets[i], weights, mask=mask)
+      loss_metrics.append(backend.mean(output_loss))
 
       loss_weight = model.loss_weights_list[i]
       if total_loss is None:
@@ -166,7 +160,7 @@ def _model_loss(model, inputs, targets, training=False):
       else:
         total_loss += loss_weight * output_loss
 
-    total_loss = K.mean(total_loss)
+    total_loss = backend.mean(total_loss)
     # Add regularization losses
     custom_losses = []
     for layer in model.layers:
@@ -179,16 +173,20 @@ def _model_loss(model, inputs, targets, training=False):
   return outs, total_loss, loss_metrics
 
 
-def _process_single_batch(eager_model_inputs, eager_model_outputs, model,
+def _process_single_batch(model,
+                          inputs,
+                          targets,
+                          sample_weights=None,
                           training=False):
   """Calculate the loss and gradient for one input batch.
 
      The model weights are updated if training is set to True.
 
   Arguments:
-      eager_model_inputs: Input batch data.
-      eager_model_outputs: Output batch data.
       model: Model whose loss has to be calculated.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
       training: The boolean represents if the weights of the model are updated.
               'fit' methods will set this to True while 'evaluate' methods will
               set this to False.
@@ -199,81 +197,81 @@ def _process_single_batch(eager_model_inputs, eager_model_outputs, model,
   Raises:
       ValueError: If the model has no loss to optimize.
   """
-  K.set_learning_phase(training)
-  with GradientTape() as tape:
-    outs, loss, loss_metrics = _model_loss(model, eager_model_inputs,
-                                           eager_model_outputs,
-                                           training=training)
-    if loss is None:
-      raise ValueError('The model cannot be run '
-                       'because it has no loss to optimize.')
-  if training:
-    if not model._collected_trainable_weights:
-      logging.warning('The list of trainable weights is empty. Make sure that '
-                      'you are not setting model.trainable to False before '
-                      'compiling the model.')
-    else:
-      grads = tape.gradient(loss, model._collected_trainable_weights)
-      model.optimizer.apply_gradients(zip(grads,
-                                          model._collected_trainable_weights))
-  return outs, loss, loss_metrics
+  with backend.learning_phase_scope(1 if training else 0):
+    with GradientTape() as tape:
+      outs, loss, loss_metrics = _model_loss(model, inputs, targets,
+                                             sample_weights=sample_weights,
+                                             training=training)
+      if loss is None:
+        raise ValueError('The model cannot be run '
+                         'because it has no loss to optimize.')
+    if training:
+      if not model._collected_trainable_weights:
+        logging.warning('The list of trainable weights is empty. Make sure that'
+                        ' you are not setting model.trainable to False before '
+                        'compiling the model.')
+      else:
+        grads = tape.gradient(loss, model._collected_trainable_weights)
+        model.optimizer.apply_gradients(zip(grads,
+                                            model._collected_trainable_weights))
+    return outs, loss, loss_metrics
 
 
-def train_on_batch(model, ins):
+def train_on_batch(model, inputs, targets, sample_weights=None):
   """Calculates the loss and gradient updates for one input batch.
 
   Arguments:
-      model: Given model on which loss and gradients are calculated.
-      ins: Input and output batch numpy arrays.
+      model: Model whose loss has to be calculated.
+      inputs: Input batch data.
+      targets: Target batch data.
+      sample_weights: Sample weight batch data.
 
   Returns:
       total loss and the loss associated with each output.
   """
-  ins_batch_converted = []
-  for ib in ins:
-    ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-  eager_model_inputs = []
-  eager_model_outputs = []
-  for i in range(len(model.inputs)):
-    eager_model_inputs.append(ins_batch_converted[i])
-  for i in range(len(model.inputs), len(ins_batch_converted)):
-    eager_model_outputs.append(ins_batch_converted[i])
+  inputs = [
+      ops.convert_to_tensor(val, dtype=backend.floatx()) for val in inputs]
+  targets = [
+      ops.convert_to_tensor(val, dtype=backend.floatx()) for val in targets]
+  sample_weights = [
+      ops.convert_to_tensor(val, dtype=backend.floatx())
+      if val is not None else None for val in sample_weights]
   outs, loss, _ = _process_single_batch(
-      eager_model_inputs, eager_model_outputs, model, training=True)
+      model, inputs, targets, sample_weights=sample_weights, training=True)
   if not isinstance(outs, list):
     outs = [outs]
   _, metrics_results = _eager_metrics_fn(
-      model, outs, eager_model_outputs)
+      model, outs, targets)
   if not isinstance(loss, list):
     loss = [loss]
   return loss + metrics_results
 
 
-def test_on_batch(model, ins):
+def test_on_batch(model, inputs, targets, sample_weights=None):
   """Calculates the loss for one input batch.
 
   Arguments:
-      model: Given model on which loss is calculated.
-      ins: Input and output batch numpy arrays.
+      model: Model whose loss has to be calculated.
+      inputs: Input batch data.
+      targets: Target batch data.
+      sample_weights: Sample weight batch data.
 
   Returns:
       total loss, loss and metrics associated with each output.
   """
-  ins_batch_converted = []
-  for ib in ins:
-    ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-  eager_model_inputs = []
-  eager_model_outputs = []
-  for i in range(len(model.inputs)):
-    eager_model_inputs.append(ins_batch_converted[i])
-  for i in range(len(model.inputs), len(ins_batch_converted)):
-    eager_model_outputs.append(ins_batch_converted[i])
+  inputs = [
+      ops.convert_to_tensor(val, dtype=backend.floatx()) for val in inputs]
+  targets = [
+      ops.convert_to_tensor(val, dtype=backend.floatx()) for val in targets]
+  sample_weights = [
+      ops.convert_to_tensor(val, dtype=backend.floatx())
+      if val is not None else None for val in sample_weights]
   outs, loss, loss_metrics = _process_single_batch(
-      eager_model_inputs, eager_model_outputs, model, training=False)
+      model, inputs, targets, sample_weights=sample_weights, training=False)
   if not isinstance(outs, list):
     outs = [outs]
   metric_names, metrics_results = _eager_metrics_fn(
-      model, outs, eager_model_outputs)
+      model, outs, targets)
   model.metrics_names.append(metric_names)
   if not isinstance(loss, list):
     loss = [loss]
@@ -282,32 +280,35 @@ def test_on_batch(model, ins):
 
 def fit_loop(
     model,
-    ins,
-    out_labels=None,
+    inputs,
+    targets,
+    sample_weights=None,
+    val_inputs=None,
+    val_targets=None,
+    val_sample_weights=None,
     batch_size=None,
     epochs=100,
     verbose=1,
     callbacks=None,
-    val_ins=None,
     shuffle=True,
     callback_metrics=None,
     initial_epoch=0,
     steps_per_epoch=None,
     validation_steps=None):
-  """Abstract fit function for `f(ins)`.
-
-  Assume that f returns a list, labeled by out_labels.
+  """Abstract fit function for eager execution.
 
   Arguments:
       model: Instance of the model that is being executed in Eager mode.
-      ins: List of tensors to be fed to `f`
-      out_labels: List of strings, display names of
-          the outputs of `f`
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      val_inputs: Input data for validation.
+      val_targets: Target data for validation.
+      val_sample_weights: Sample weight data for validation.
       batch_size: Integer batch size or None if unknown.
       epochs: Number of times to iterate over the data
       verbose: Verbosity mode, 0, 1 or 2
       callbacks: List of callbacks to be called during training
-      val_ins: List of tensors to be fed to `val_f`
       shuffle: Whether to shuffle the data at the beginning of each epoch
       callback_metrics: List of strings, the display names of the metrics
           passed to the callbacks. They should be the
@@ -328,165 +329,196 @@ def fit_loop(
     ValueError: In case of invalid argument values.
   """
   # Required for Eager mode
-  K.set_learning_phase(True)
-
-  do_validation = False
-  if val_ins:
-    do_validation = True
-    if (verbose and ins and hasattr(ins[0], 'shape') and
-        hasattr(val_ins[0], 'shape')):
-      print('Train on %d samples, validate on %d samples' %
-            (ins[0].shape[0], val_ins[0].shape[0]))
-  if validation_steps:
-    if steps_per_epoch is None:
-      raise ValueError('Can only use `validation_steps` when doing step-wise '
-                       'training, i.e. `steps_per_epoch` must be set.')
-    do_validation = True
-
-  num_train_samples = model._check_num_samples(
-      ins, batch_size, steps_per_epoch, 'steps_per_epoch')
-
-  if num_train_samples is not None:
-    index_array = np.arange(num_train_samples)
-
-  model.history = cbks.History()
-  callbacks = [cbks.BaseLogger()] + (callbacks or []) + [model.history]
-  if verbose:
-    if steps_per_epoch is not None:
-      count_mode = 'steps'
+  with backend.learning_phase_scope(1):
+    do_validation = False
+    if val_inputs:
+      do_validation = True
+      if (verbose and inputs and hasattr(inputs[0], 'shape') and
+          hasattr(val_inputs[0], 'shape')):
+        print('Train on %d samples, validate on %d samples' %
+              (inputs[0].shape[0], val_inputs[0].shape[0]))
+    if validation_steps:
+      if steps_per_epoch is None:
+        raise ValueError('Can only use `validation_steps` when doing step-wise '
+                         'training, i.e. `steps_per_epoch` must be set.')
+      do_validation = True
+
+    out_labels = model.metrics_names
+    if do_validation:
+      callback_metrics = copy.copy(out_labels) + [
+          'val_' + n for n in out_labels
+      ]
     else:
-      count_mode = 'samples'
-    callbacks += [cbks.ProgbarLogger(count_mode)]
-  callbacks = cbks.CallbackList(callbacks)
-  out_labels = out_labels or []
-
-  # it's possible to callback a different model than self
-  # (used by Sequential models)
-  if hasattr(model, 'callback_model') and model.callback_model:
-    callback_model = model.callback_model
-  else:
-    callback_model = model
-
-  callbacks.set_model(callback_model)
-
-  callbacks.set_params({
-      'batch_size': batch_size,
-      'epochs': epochs,
-      'steps': steps_per_epoch,
-      'samples': num_train_samples,
-      'verbose': verbose,
-      'do_validation': do_validation,
-      'metrics': callback_metrics or [],
-  })
-  callbacks.on_train_begin()
-  callback_model.stop_training = False
-  for cbk in callbacks:
-    cbk.validation_data = val_ins
-
-  for epoch in range(initial_epoch, epochs):
-    callbacks.on_epoch_begin(epoch)
-    epoch_logs = {}
-    if shuffle == 'batch':
-      index_array = model._batch_shuffle(index_array, batch_size)
-    elif shuffle:
-      np.random.shuffle(index_array)
-
-    batches = make_batches(num_train_samples, batch_size)
+      callback_metrics = copy.copy(out_labels)
 
-    for batch_index, (batch_start, batch_end) in enumerate(batches):
-      batch_ids = index_array[batch_start:batch_end]
-      try:
-        if isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-        else:
-          ins_batch = slice_arrays(ins, batch_ids)
-      except TypeError:
-        raise TypeError('TypeError while preparing batch. '
-                        'If using HDF5 input data, '
-                        'pass shuffle="batch".')
-      batch_logs = {}
-      batch_logs['batch'] = batch_index
-      batch_logs['size'] = len(batch_ids)
-
-      callbacks.on_batch_begin(batch_index, batch_logs)
-
-      ins_batch_converted = []
-      for ib in ins_batch:
-        ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-      eager_model_inputs = []
-      eager_model_outputs = []
-      for i in range(len(model.inputs)):
-        eager_model_inputs.append(ins_batch_converted[i])
-
-      for i in range(len(model.inputs), len(ins_batch_converted)):
-        eager_model_outputs.append(ins_batch_converted[i])
-
-      outs, loss, loss_metrics = _process_single_batch(eager_model_inputs,
-                                                       eager_model_outputs,
-                                                       model,
-                                                       training=True)
-
-      if not isinstance(outs, list):
-        outs = [outs]
-
-      for l, o in zip(out_labels, outs):
-        batch_logs[l] = o
-      # Required for Eager mode
-      metrics_names, metrics_results = _eager_metrics_fn(model, outs,
-                                                         eager_model_outputs)
-      batch_logs['loss'] = tensor_util.constant_value(K.mean(loss))
-
-      # TODO(anjalisridhar): Move this to compile to avoid duplicate code.
-      # In graph mode we set the metric names in compile. However in
-      # Eager mode we calculate the metrics for each batch in fit_loop.
-      # We could calculate the metric names and functions in compile.
-      # This would avoid setting the callback parameters separately.
-      # We need to do this for the first iteration alone
-      for m in metrics_names:
-        if m not in callback_metrics:
-          callback_metrics.append(m)
-
-      callbacks.set_params({
-          'batch_size': batch_size,
-          'epochs': epochs,
-          'steps': steps_per_epoch,
-          'samples': num_train_samples,
-          'verbose': verbose,
-          'do_validation': do_validation,
-          'metrics': callback_metrics or [],
-      })
-
-      for k, v in zip(model.metrics_names,
-                      [K.mean(loss)] + loss_metrics + metrics_results):
-        batch_logs[k] = tensor_util.constant_value(v)
-
-      callbacks.on_batch_end(batch_index, batch_logs)
+    if sample_weights:
+      feed_data = inputs + targets + sample_weights
+    else:
+      feed_data = inputs + targets
+    num_train_samples = training_utils.check_num_samples(
+        feed_data,
+        batch_size=batch_size,
+        steps=steps_per_epoch,
+        steps_name='steps_per_epoch')
+
+    if num_train_samples is not None:
+      index_array = np.arange(num_train_samples)
+
+    model.history = cbks.History()
+    callbacks = [cbks.BaseLogger()] + (callbacks or []) + [model.history]
+    if verbose:
+      if steps_per_epoch is not None:
+        count_mode = 'steps'
+      else:
+        count_mode = 'samples'
+      callbacks += [cbks.ProgbarLogger(count_mode)]
+    callbacks = cbks.CallbackList(callbacks)
+
+    # it's possible to callback a different model than self
+    # (used by Sequential models)
+    if hasattr(model, 'callback_model') and model.callback_model:
+      callback_model = model.callback_model
+    else:
+      callback_model = model
+
+    callbacks.set_model(callback_model)
+
+    callbacks.set_params({
+        'batch_size': batch_size,
+        'epochs': epochs,
+        'steps': steps_per_epoch,
+        'samples': num_train_samples,
+        'verbose': verbose,
+        'do_validation': do_validation,
+        'metrics': callback_metrics or [],
+    })
+    callbacks.on_train_begin()
+    callback_model.stop_training = False
+    for cbk in callbacks:
+      if not val_inputs:
+        cbk.validation_data = []
+      elif val_sample_weights:
+        cbk.validation_data = val_inputs + val_targets + val_sample_weights
+      else:
+        cbk.validation_data = val_inputs + val_targets
+
+    for epoch in range(initial_epoch, epochs):
+      callbacks.on_epoch_begin(epoch)
+      epoch_logs = {}
+      if shuffle == 'batch':
+        index_array = model._batch_shuffle(index_array, batch_size)
+      elif shuffle:
+        np.random.shuffle(index_array)
+
+      batches = make_batches(num_train_samples, batch_size)
+
+      for batch_index, (batch_start, batch_end) in enumerate(batches):
+        batch_ids = index_array[batch_start:batch_end]
+        try:
+          inputs_batch = slice_arrays(inputs, batch_ids)
+          targets_batch = slice_arrays(targets, batch_ids)
+          if sample_weights:
+            sample_weights_batch = slice_arrays(sample_weights, batch_ids)
+          else:
+            sample_weights_batch = None
+        except TypeError:
+          raise TypeError('TypeError while preparing batch. '
+                          'If using HDF5 input data, '
+                          'pass shuffle="batch".')
+        batch_logs = {}
+        batch_logs['batch'] = batch_index
+        batch_logs['size'] = len(batch_ids)
+
+        callbacks.on_batch_begin(batch_index, batch_logs)
+
+        inputs_batch = [
+            ops.convert_to_tensor(val, dtype=backend.floatx())
+            for val in inputs_batch]
+        targets_batch = [
+            ops.convert_to_tensor(val, dtype=backend.floatx())
+            for val in targets_batch]
+        if sample_weights:
+          sample_weights_batch = [
+              ops.convert_to_tensor(val, dtype=backend.floatx())
+              if val is not None else None
+              for val in sample_weights_batch]
+
+        outs, loss, loss_metrics = _process_single_batch(
+            model,
+            inputs_batch,
+            targets_batch,
+            sample_weights=sample_weights_batch,
+            training=True)
+
+        if not isinstance(outs, list):
+          outs = [outs]
+
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+        # Required for Eager mode
+        metrics_names, metrics_results = _eager_metrics_fn(
+            model, outs, targets_batch)
+        batch_logs['loss'] = tensor_util.constant_value(backend.mean(loss))
+
+        # TODO(anjalisridhar): Move this to compile to avoid duplicate code.
+        # In graph mode we set the metric names in compile. However in
+        # Eager mode we calculate the metrics for each batch in fit_loop.
+        # We could calculate the metric names and functions in compile.
+        # This would avoid setting the callback parameters separately.
+        # We need to do this for the first iteration alone
+        for m in metrics_names:
+          if m not in callback_metrics:
+            callback_metrics.append(m)
+
+        callbacks.set_params({
+            'batch_size': batch_size,
+            'epochs': epochs,
+            'steps': steps_per_epoch,
+            'samples': num_train_samples,
+            'verbose': verbose,
+            'do_validation': do_validation,
+            'metrics': callback_metrics or [],
+        })
+
+        for k, v in zip(model.metrics_names,
+                        [backend.mean(loss)] + loss_metrics + metrics_results):
+          batch_logs[k] = tensor_util.constant_value(v)
+
+        callbacks.on_batch_end(batch_index, batch_logs)
+        if callback_model.stop_training:
+          break
+
+        if batch_index == len(batches) - 1:  # Last batch.
+          if do_validation:
+            val_outs = test_loop(
+                model, val_inputs, val_targets,
+                sample_weights=val_sample_weights,
+                batch_size=batch_size,
+                verbose=0)
+            if not isinstance(val_outs, list):
+              val_outs = [val_outs]
+            # Same labels assumed.
+            for l, o in zip(out_labels, val_outs):
+              epoch_logs['val_' + l] = o
+      callbacks.on_epoch_end(epoch, epoch_logs)
       if callback_model.stop_training:
         break
+    callbacks.on_train_end()
+    return model.history
+
 
-      if batch_index == len(batches) - 1:  # Last batch.
-        if do_validation:
-          val_outs = test_loop(
-              model, val_ins, batch_size=batch_size, verbose=0)
-          if not isinstance(val_outs, list):
-            val_outs = [val_outs]
-          # Same labels assumed.
-          for l, o in zip(out_labels, val_outs):
-            epoch_logs['val_' + l] = o
-    callbacks.on_epoch_end(epoch, epoch_logs)
-    if callback_model.stop_training:
-      break
-  callbacks.on_train_end()
-  return model.history
-
-
-def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
+def test_loop(model, inputs, targets,
+              sample_weights=None,
+              batch_size=None,
+              verbose=0,
+              steps=None):
   """Abstract method to loop over some data in batches.
 
   Arguments:
       model: Model instance that is being evaluated in Eager mode.
-      ins: list of tensors to be fed to `f`.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
       batch_size: integer batch size or `None`.
       verbose: verbosity mode.
       steps: Total number of steps (batches of samples)
@@ -499,69 +531,79 @@ def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
       and/or metrics). The attribute `model.metrics_names` will give you
       the display labels for the scalar outputs.
   """
-  K.set_learning_phase(False)
-  num_samples = model._check_num_samples(ins, batch_size, steps, 'steps')
-  outs = []
-  if verbose == 1:
-    progbar = Progbar(target=num_samples)
-  batches = make_batches(num_samples, batch_size)
-  index_array = np.arange(num_samples)
-  for batch_index, (batch_start, batch_end) in enumerate(batches):
-    batch_ids = index_array[batch_start:batch_end]
-    if isinstance(ins[-1], float):
-      # Do not slice the training phase flag.
-      ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-    else:
-      ins_batch = slice_arrays(ins, batch_ids)
-
-    ins_batch_converted = []
-    for ib in ins_batch:
-      ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-
-    eager_model_inputs = []
-    eager_model_outputs = []
-    for i in range(len(model.inputs)):
-      eager_model_inputs.append(ins_batch_converted[i])
-
-    for i in range(len(model.inputs), len(ins_batch_converted)):
-      eager_model_outputs.append(ins_batch_converted[i])
-
-    loss_outs, loss, loss_metrics = _model_loss(model, eager_model_inputs,
-                                                eager_model_outputs,
-                                                training=False)
-    _, metrics_results = _eager_metrics_fn(model, loss_outs,
-                                           eager_model_outputs)
-    batch_outs = []
-    for _, v in zip(model.metrics_names,
-                    [K.mean(loss)] + loss_metrics + metrics_results):
-      batch_outs.append(tensor_util.constant_value(v))
-
-    if isinstance(batch_outs, list):
-      if batch_index == 0:
-        for batch_out in enumerate(batch_outs):
+  with backend.learning_phase_scope(0):
+    feed_data = inputs + targets
+    if sample_weights:
+      feed_data += sample_weights
+    num_samples = training_utils.check_num_samples(
+        feed_data, batch_size=batch_size, steps=steps, steps_name='steps')
+    outs = []
+    if verbose == 1:
+      progbar = Progbar(target=num_samples)
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      inputs_batch = slice_arrays(inputs, batch_ids)
+      targets_batch = slice_arrays(targets, batch_ids)
+      if sample_weights:
+        sample_weights_batch = slice_arrays(sample_weights, batch_ids)
+      else:
+        sample_weights_batch = None
+
+      inputs_batch = [
+          ops.convert_to_tensor(val, dtype=backend.floatx())
+          for val in inputs_batch]
+      targets_batch = [
+          ops.convert_to_tensor(val, dtype=backend.floatx())
+          for val in targets_batch]
+      if sample_weights:
+        sample_weights_batch = [
+            ops.convert_to_tensor(val, dtype=backend.floatx())
+            if val is not None else None
+            for val in sample_weights_batch]
+
+      loss_outs, loss, loss_metrics = _model_loss(
+          model,
+          inputs_batch,
+          targets_batch,
+          sample_weights=sample_weights_batch,
+          training=False)
+      _, metrics_results = _eager_metrics_fn(model, loss_outs, targets_batch)
+      batch_outs = []
+      for _, v in zip(model.metrics_names,
+                      [backend.mean(loss)] + loss_metrics + metrics_results):
+        batch_outs.append(tensor_util.constant_value(v))
+
+      if isinstance(batch_outs, list):
+        if batch_index == 0:
+          for batch_out in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          outs[i] += batch_out * len(batch_ids)
+      else:
+        if batch_index == 0:
           outs.append(0.)
-      for i, batch_out in enumerate(batch_outs):
-        outs[i] += batch_out * len(batch_ids)
-    else:
-      if batch_index == 0:
-        outs.append(0.)
-      outs[0] += batch_outs * len(batch_ids)
+        outs[0] += batch_outs * len(batch_ids)
 
-    if verbose == 1:
-      progbar.update(batch_end)
-  for i in range(len(outs)):
-    outs[i] /= num_samples
-  if len(outs) == 1:
-    return outs[0]
-  return outs
+      if verbose == 1:
+        progbar.update(batch_end)
+    for i in range(len(outs)):
+      outs[i] /= num_samples
+    if len(outs) == 1:
+      return outs[0]
+    return outs
 
 
-def predict_loop(model, ins, batch_size=32, verbose=0, steps=None):
+def predict_loop(model, inputs,
+                 batch_size=32,
+                 verbose=0,
+                 steps=None):
   """Abstract method to loop over some data in batches.
 
   Arguments:
       model:
-      ins: list of tensors to be fed to `f`.
+      inputs: List of input arrays.
       batch_size: integer batch size.
       verbose: verbosity mode.
       steps: Total number of steps (batches of samples)
@@ -573,57 +615,50 @@ def predict_loop(model, ins, batch_size=32, verbose=0, steps=None):
       or list of arrays of predictions
       (if the model has multiple outputs).
   """
-  K.set_learning_phase(False)
-  num_samples = model._check_num_samples(ins, batch_size, steps, 'steps')
-  if verbose == 1:
-    if steps is not None:
-      progbar = Progbar(target=steps)
-    else:
-      progbar = Progbar(target=num_samples)
-
-  outs = []
-  batches = make_batches(num_samples, batch_size)
-  index_array = np.arange(num_samples)
-  for batch_index, (batch_start, batch_end) in enumerate(batches):
-    batch_ids = index_array[batch_start:batch_end]
-    if ins and isinstance(ins[-1], float):
-      # Do not slice the training phase flag.
-      ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-    else:
-      ins_batch = slice_arrays(ins, batch_ids)
+  with backend.learning_phase_scope(0):
+    num_samples = training_utils.check_num_samples(
+        inputs, batch_size, steps, 'steps')
+    if verbose == 1:
+      if steps is not None:
+        progbar = Progbar(target=steps)
+      else:
+        progbar = Progbar(target=num_samples)
 
-    ins_batch_converted = []
-    for ib in ins_batch:
-      ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
+    outs = []
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      inputs_batch = slice_arrays(inputs, batch_ids)
 
-    eager_model_inputs = []
-    for i in range(len(model.inputs)):
-      eager_model_inputs.append(ins_batch_converted[i])
+      inputs_batch = [
+          ops.convert_to_tensor(val, dtype=backend.floatx())
+          for val in inputs_batch]
 
-    if len(eager_model_inputs) == 1:
-      if model._expects_training_arg:
-        batch_outs = model.call(eager_model_inputs[0], training=False)
-      else:
-        batch_outs = model.call(eager_model_inputs[0])
-    else:
-      if model._expects_training_arg:
-        batch_outs = model.call(eager_model_inputs, training=False)
+      if len(inputs_batch) == 1:
+        if model._expects_training_arg:
+          batch_outs = model.call(inputs_batch[0], training=False)
+        else:
+          batch_outs = model.call(inputs_batch[0])
       else:
-        batch_outs = model.call(eager_model_inputs)
-
-    if not isinstance(batch_outs, list):
-      batch_outs = [batch_outs]
-    if batch_index == 0:
-      # Pre-allocate the results arrays.
-      for batch_out in batch_outs:
-        dims = batch_out.shape[1:].dims
-        dims_list = [d.value for d in dims]
-        shape = (num_samples,) + tuple(dims_list)
-        outs.append(np.zeros(shape, dtype=batch_out.dtype.as_numpy_dtype))
-    for i, batch_out in enumerate(batch_outs):
-      outs[i][batch_start:batch_end] = batch_out
-    if verbose == 1:
-      progbar.update(batch_end)
-  if len(outs) == 1:
-    return outs[0]
-  return outs
+        if model._expects_training_arg:
+          batch_outs = model.call(inputs_batch, training=False)
+        else:
+          batch_outs = model.call(inputs_batch)
+
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if batch_index == 0:
+        # Pre-allocate the results arrays.
+        for batch_out in batch_outs:
+          dims = batch_out.shape[1:].dims
+          dims_list = [d.value for d in dims]
+          shape = (num_samples,) + tuple(dims_list)
+          outs.append(np.zeros(shape, dtype=batch_out.dtype.as_numpy_dtype))
+      for i, batch_out in enumerate(batch_outs):
+        outs[i][batch_start:batch_end] = batch_out
+      if verbose == 1:
+        progbar.update(batch_end)
+    if len(outs) == 1:
+      return outs[0]
+    return outs
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py b/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
index 3d94b7537ff354351d3f4adde6e0819ce66ea377..8848b393d5e602e564cb357c32a937eaabd68203 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
@@ -24,9 +24,7 @@ import numpy as np
 from tensorflow.python.framework import ops
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
-from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.platform import test
-from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
@@ -316,10 +314,9 @@ class LossWeightingTest(test.TestCase):
   def test_class_weights(self):
     num_classes = 5
     batch_size = 5
-    epochs = 5
     weighted_class = 3
-    train_samples = 3000
-    test_samples = 3000
+    train_samples = 300
+    test_samples = 300
     input_dim = 5
 
     model = keras.models.Sequential()
@@ -344,16 +341,16 @@ class LossWeightingTest(test.TestCase):
     test_ids = np.where(int_y_test == np.array(weighted_class))[0]
 
     class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 2.
+    class_weight[weighted_class] = 4.
 
     sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 2.
+    sample_weight[int_y_train == weighted_class] = 4.
 
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         class_weight=class_weight,
         validation_data=(x_train, y_train, sample_weight))
@@ -361,14 +358,14 @@ class LossWeightingTest(test.TestCase):
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 2,
+        epochs=2,
         verbose=0,
         class_weight=class_weight)
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 2,
+        epochs=2,
         verbose=0,
         class_weight=class_weight,
         validation_split=0.1)
@@ -383,10 +380,9 @@ class LossWeightingTest(test.TestCase):
   def test_sample_weights(self):
     num_classes = 5
     batch_size = 5
-    epochs = 5
     weighted_class = 3
-    train_samples = 3000
-    test_samples = 3000
+    train_samples = 300
+    test_samples = 300
     input_dim = 5
 
     model = keras.models.Sequential()
@@ -407,23 +403,23 @@ class LossWeightingTest(test.TestCase):
     y_train = keras.utils.to_categorical(y_train, num_classes)
 
     class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 2.
+    class_weight[weighted_class] = 4.
 
     sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 2.
+    sample_weight[int_y_train == weighted_class] = 4.
 
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         sample_weight=sample_weight)
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         sample_weight=sample_weight,
         validation_split=0.1)
@@ -536,215 +532,6 @@ class LossWeightingTest(test.TestCase):
       model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': bad_w_np})
 
 
-class TestDynamicTrainability(test.TestCase):
-
-  def test_trainable_warning(self):
-    x = np.random.random((5, 3))
-    y = np.random.random((5, 2))
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=3))
-    model.trainable = False
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    model.trainable = True
-    with test.mock.patch.object(logging, 'warning') as mock_log:
-      model.train_on_batch(x, y)
-      self.assertRegexpMatches(str(mock_log.call_args),
-                               'trainable weights is empty')
-
-  def test_trainable_argument(self):
-    x = np.random.random((5, 3))
-    y = np.random.random((5, 2))
-
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=3, trainable=False))
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    out = model.predict(x)
-    with test.mock.patch.object(logging, 'warning') as mock_log:
-      model.train_on_batch(x, y)
-      self.assertRegexpMatches(str(mock_log.call_args),
-                               'trainable weights is empty')
-    out_2 = model.predict(x)
-    self.assertAllClose(out, out_2)
-
-    # test with nesting
-    inputs = keras.layers.Input(shape=(3,))
-    output = model(inputs)
-    model = keras.models.Model(inputs, output)
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    out = model.predict(x)
-    with test.mock.patch.object(logging, 'warning') as mock_log:
-      model.train_on_batch(x, y)
-      self.assertRegexpMatches(str(mock_log.call_args),
-                               'trainable weights is empty')
-    out_2 = model.predict(x)
-    self.assertAllClose(out, out_2)
-
-  def test_layer_trainability_switch(self):
-    # with constructor argument, in Sequential
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, trainable=False, input_dim=1))
-    self.assertListEqual(model.trainable_weights, [])
-
-    # by setting the `trainable` argument, in Sequential
-    model = keras.models.Sequential()
-    layer = keras.layers.Dense(2, input_dim=1)
-    model.add(layer)
-    self.assertListEqual(model.trainable_weights, layer.trainable_weights)
-    layer.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-    # with constructor argument, in Model
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2, trainable=False)(x)
-    model = keras.models.Model(x, y)
-    self.assertListEqual(model.trainable_weights, [])
-
-    # by setting the `trainable` argument, in Model
-    x = keras.layers.Input(shape=(1,))
-    layer = keras.layers.Dense(2)
-    y = layer(x)
-    model = keras.models.Model(x, y)
-    self.assertListEqual(model.trainable_weights, layer.trainable_weights)
-    layer.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-  def test_model_trainability_switch(self):
-    # a non-trainable model has no trainable weights
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    model = keras.models.Model(x, y)
-    model.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-    # same for Sequential
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=1))
-    model.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-  def test_nested_model_trainability(self):
-
-    # a Sequential inside a Model
-    inner_model = keras.models.Sequential()
-    inner_model.add(keras.layers.Dense(2, input_dim=1))
-
-    x = keras.layers.Input(shape=(1,))
-    y = inner_model(x)
-    outer_model = keras.models.Model(x, y)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Sequential inside a Sequential
-    inner_model = keras.models.Sequential()
-    inner_model.add(keras.layers.Dense(2, input_dim=1))
-    outer_model = keras.models.Sequential()
-    outer_model.add(inner_model)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Model inside a Model
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    inner_model = keras.models.Model(x, y)
-    x = keras.layers.Input(shape=(1,))
-    y = inner_model(x)
-    outer_model = keras.models.Model(x, y)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Model inside a Sequential
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    inner_model = keras.models.Model(x, y)
-    outer_model = keras.models.Sequential()
-    outer_model.add(inner_model)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-
-class TestTrainingUtils(test.TestCase):
-
-  def test_check_array_lengths(self):
-    keras.engine.training._check_array_lengths(None, None, None)
-    a_np = np.random.random((4, 3, 3))
-    keras.engine.training._check_array_lengths(a_np, a_np, a_np)
-    keras.engine.training._check_array_lengths(
-        [a_np, a_np], [a_np, a_np], [a_np, a_np])
-    keras.engine.training._check_array_lengths([None], [None], [None])
-
-    b_np = np.random.random((3, 4))
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, None, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, a_np, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [None], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [b_np], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], None, [b_np])
-
-  def test_slice_arrays(self):
-    input_a = np.random.random((10, 3))
-    slice_arrays(None)
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = [None, [1, 1], None, [1, 1]]
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = [None]
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = None
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-
-  def test_fit_with_BatchNorm(self):
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(10, input_dim=4))
-    model.add(keras.layers.BatchNormalization())
-    model.add(keras.layers.Activation('tanh'))
-    model.add(keras.layers.Dropout(0.2))
-
-    input_a_np = np.random.random((10, 4))
-    output_b_np = np.random.random((10, 10))
-
-    model.compile(loss='binary_crossentropy', optimizer=RMSPropOptimizer(0.001))
-    model.fit(input_a_np, output_b_np, epochs=1, batch_size=5, verbose=0)
-
-  def test_fit_with_regularization(self):
-    model = keras.models.Sequential()
-    with self.assertRaises(ValueError):
-      model.add(
-          keras.layers.Dense(4, input_dim=3,
-                             kernel_regularizer=keras.regularizers.l2(0.01),
-                             activity_regularizer=keras.regularizers.l1(0.01)))
-
-
 if __name__ == '__main__':
   # Bazel sets these environment variables to very long paths.
   # Tempfile uses them to create long paths, and in turn multiprocessing
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_generator.py b/tensorflow/python/keras/_impl/keras/engine/training_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..58b5bc39c10ea06f680eb030e14ecd19a3888588
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_generator.py
@@ -0,0 +1,436 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Part of the Keras training engine related to Python generators of array data.
+"""
+# pylint: disable=protected-access
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import callbacks as cbks
+from tensorflow.python.keras._impl.keras.utils.data_utils import GeneratorEnqueuer
+from tensorflow.python.keras._impl.keras.utils.data_utils import OrderedEnqueuer
+from tensorflow.python.keras._impl.keras.utils.data_utils import Sequence
+from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
+from tensorflow.python.platform import tf_logging as logging
+
+
+def fit_generator(model,
+                  generator,
+                  steps_per_epoch=None,
+                  epochs=1,
+                  verbose=1,
+                  callbacks=None,
+                  validation_data=None,
+                  validation_steps=None,
+                  class_weight=None,
+                  max_queue_size=10,
+                  workers=1,
+                  use_multiprocessing=False,
+                  shuffle=True,
+                  initial_epoch=0):
+  """See docstring for `Model.fit_generator`."""
+  wait_time = 0.01  # in seconds
+  epoch = initial_epoch
+
+  do_validation = bool(validation_data)
+  model._make_train_function()
+  if do_validation:
+    model._make_test_function()
+
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps_per_epoch is None:
+    if is_sequence:
+      steps_per_epoch = len(generator)
+    else:
+      raise ValueError('`steps_per_epoch=None` is only valid for a'
+                       ' generator based on the `keras.utils.Sequence`'
+                       ' class. Please specify `steps_per_epoch` or use'
+                       ' the `keras.utils.Sequence` class.')
+
+  # python 2 has 'next', 3 has '__next__'
+  # avoid any explicit version checks
+  val_gen = (
+      hasattr(validation_data, 'next') or
+      hasattr(validation_data, '__next__') or
+      isinstance(validation_data, Sequence))
+  if (val_gen and not isinstance(validation_data, Sequence) and
+      not validation_steps):
+    raise ValueError('`validation_steps=None` is only valid for a'
+                     ' generator based on the `keras.utils.Sequence`'
+                     ' class. Please specify `validation_steps` or use'
+                     ' the `keras.utils.Sequence` class.')
+
+  # Prepare display labels.
+  out_labels = model.metrics_names
+  callback_metrics = out_labels + ['val_%s' % n for n in out_labels]
+
+  # prepare callbacks
+  model.history = cbks.History()
+  callbacks = [cbks.BaseLogger()] + (callbacks or []) + [model.history]
+  if verbose:
+    callbacks += [cbks.ProgbarLogger(count_mode='steps')]
+  callbacks = cbks.CallbackList(callbacks)
+
+  # it's possible to callback a different model than self:
+  if hasattr(model, 'callback_model') and model.callback_model:
+    callback_model = model.callback_model
+  else:
+    callback_model = model
+  callbacks.set_model(callback_model)
+  callbacks.set_params({
+      'epochs': epochs,
+      'steps': steps_per_epoch,
+      'verbose': verbose,
+      'do_validation': do_validation,
+      'metrics': callback_metrics,
+  })
+  callbacks.on_train_begin()
+
+  enqueuer = None
+  val_enqueuer = None
+
+  try:
+    if do_validation and not val_gen:
+      # Prepare data for validation
+      if len(validation_data) == 2:
+        val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
+        val_sample_weight = None
+      elif len(validation_data) == 3:
+        val_x, val_y, val_sample_weight = validation_data  # pylint: disable=unpacking-non-sequence
+      else:
+        raise ValueError(
+            '`validation_data` should be a tuple '
+            '`(val_x, val_y, val_sample_weight)` '
+            'or `(val_x, val_y)`. Found: ' + str(validation_data))
+      val_x, val_y, val_sample_weights = model._standardize_user_data(
+          val_x, val_y, val_sample_weight)
+      val_data = val_x + val_y + val_sample_weights
+      if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        val_data += [0.]
+      for cbk in callbacks:
+        cbk.validation_data = val_data
+
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            shuffle=shuffle)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      if is_sequence:
+        output_generator = iter(generator)
+      else:
+        output_generator = generator
+
+    callback_model.stop_training = False
+    # Construct epoch logs.
+    epoch_logs = {}
+    while epoch < epochs:
+      callbacks.on_epoch_begin(epoch)
+      steps_done = 0
+      batch_index = 0
+      while steps_done < steps_per_epoch:
+        generator_output = next(output_generator)
+
+        if not hasattr(generator_output, '__len__'):
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+
+        if len(generator_output) == 2:
+          x, y = generator_output
+          sample_weight = None
+        elif len(generator_output) == 3:
+          x, y, sample_weight = generator_output
+        else:
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+        # build batch logs
+        batch_logs = {}
+        if isinstance(x, list):
+          batch_size = x[0].shape[0]
+        elif isinstance(x, dict):
+          batch_size = list(x.values())[0].shape[0]
+        else:
+          batch_size = x.shape[0]
+        batch_logs['batch'] = batch_index
+        batch_logs['size'] = batch_size
+        callbacks.on_batch_begin(batch_index, batch_logs)
+
+        outs = model.train_on_batch(
+            x, y, sample_weight=sample_weight, class_weight=class_weight)
+
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(batch_index, batch_logs)
+
+        batch_index += 1
+        steps_done += 1
+
+        # Epoch finished.
+        if steps_done >= steps_per_epoch and do_validation:
+          if val_gen:
+            val_outs = evaluate_generator(
+                model,
+                validation_data,
+                validation_steps,
+                workers=workers,
+                use_multiprocessing=use_multiprocessing,
+                max_queue_size=max_queue_size)
+          else:
+            # No need for try/except because
+            # data has already been validated.
+            val_outs = model.evaluate(
+                val_x,
+                val_y,
+                batch_size=batch_size,
+                sample_weight=val_sample_weights,
+                verbose=0)
+          if not isinstance(val_outs, list):
+            val_outs = [val_outs]
+          # Same labels assumed.
+          for l, o in zip(out_labels, val_outs):
+            epoch_logs['val_' + l] = o
+
+        if callback_model.stop_training:
+          break
+
+      callbacks.on_epoch_end(epoch, epoch_logs)
+      epoch += 1
+      if callback_model.stop_training:
+        break
+
+  finally:
+    try:
+      if enqueuer is not None:
+        enqueuer.stop()
+    finally:
+      if val_enqueuer is not None:
+        val_enqueuer.stop()
+
+  callbacks.on_train_end()
+  return model.history
+
+
+def evaluate_generator(model,
+                       generator,
+                       steps=None,
+                       max_queue_size=10,
+                       workers=1,
+                       use_multiprocessing=False):
+  """See docstring for `Model.evaluate_generator`."""
+  model._make_test_function()
+
+  steps_done = 0
+  wait_time = 0.01
+  all_outs = []
+  batch_sizes = []
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps is None:
+    if is_sequence:
+      steps = len(generator)
+    else:
+      raise ValueError('`steps=None` is only valid for a generator'
+                       ' based on the `keras.utils.Sequence` class.'
+                       ' Please specify `steps` or use the'
+                       ' `keras.utils.Sequence` class.')
+  enqueuer = None
+
+  try:
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator, use_multiprocessing=use_multiprocessing)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      if is_sequence:
+        output_generator = iter(generator)
+      else:
+        output_generator = generator
+
+    while steps_done < steps:
+      generator_output = next(output_generator)
+      if not hasattr(generator_output, '__len__'):
+        raise ValueError('Output of generator should be a tuple '
+                         '(x, y, sample_weight) '
+                         'or (x, y). Found: ' + str(generator_output))
+      if len(generator_output) == 2:
+        x, y = generator_output
+        sample_weight = None
+      elif len(generator_output) == 3:
+        x, y, sample_weight = generator_output
+      else:
+        raise ValueError('Output of generator should be a tuple '
+                         '(x, y, sample_weight) '
+                         'or (x, y). Found: ' + str(generator_output))
+      outs = model.test_on_batch(x, y, sample_weight=sample_weight)
+
+      if isinstance(x, list):
+        batch_size = x[0].shape[0]
+      elif isinstance(x, dict):
+        batch_size = list(x.values())[0].shape[0]
+      else:
+        batch_size = x.shape[0]
+      if batch_size == 0:
+        raise ValueError('Received an empty batch. '
+                         'Batches should at least contain one item.')
+      all_outs.append(outs)
+
+      steps_done += 1
+      batch_sizes.append(batch_size)
+
+  finally:
+    if enqueuer is not None:
+      enqueuer.stop()
+
+  if not isinstance(outs, list):
+    return np.average(np.asarray(all_outs), weights=batch_sizes)
+  else:
+    averages = []
+    for i in range(len(outs)):
+      averages.append(
+          np.average([out[i] for out in all_outs], weights=batch_sizes))
+    return averages
+
+
+def predict_generator(model,
+                      generator,
+                      steps=None,
+                      max_queue_size=10,
+                      workers=1,
+                      use_multiprocessing=False,
+                      verbose=0):
+  """See docstring for `Model.predict_generator`."""
+  model._make_predict_function()
+
+  steps_done = 0
+  wait_time = 0.01
+  all_outs = []
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps is None:
+    if is_sequence:
+      steps = len(generator)
+    else:
+      raise ValueError('`steps=None` is only valid for a generator'
+                       ' based on the `keras.utils.Sequence` class.'
+                       ' Please specify `steps` or use the'
+                       ' `keras.utils.Sequence` class.')
+  enqueuer = None
+
+  try:
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator, use_multiprocessing=use_multiprocessing)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      if is_sequence:
+        output_generator = iter(generator)
+      else:
+        output_generator = generator
+
+    if verbose == 1:
+      progbar = Progbar(target=steps)
+
+    while steps_done < steps:
+      generator_output = next(output_generator)
+      if isinstance(generator_output, tuple):
+        # Compatibility with the generators
+        # used for training.
+        if len(generator_output) == 2:
+          x, _ = generator_output
+        elif len(generator_output) == 3:
+          x, _, _ = generator_output
+        else:
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+      else:
+        # Assumes a generator that only
+        # yields inputs (not targets and sample weights).
+        x = generator_output
+
+      outs = model.predict_on_batch(x)
+      if not isinstance(outs, list):
+        outs = [outs]
+
+      if not all_outs:
+        for out in outs:
+          all_outs.append([])
+
+      for i, out in enumerate(outs):
+        all_outs[i].append(out)
+      steps_done += 1
+      if verbose == 1:
+        progbar.update(steps_done)
+
+  finally:
+    if enqueuer is not None:
+      enqueuer.stop()
+
+  if len(all_outs) == 1:
+    if steps_done == 1:
+      return all_outs[0][0]
+    else:
+      return np.concatenate(all_outs[0])
+  if steps_done == 1:
+    return [out[0] for out in all_outs]
+  else:
+    return [np.concatenate(out) for out in all_outs]
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_test.py b/tensorflow/python/keras/_impl/keras/engine/training_test.py
index 9651eb9f14f1275dc79c8d3b1fb54690772086a1..fd91dbba52ff7d152335514085ef3b057ae5eec4 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_test.py
@@ -25,7 +25,7 @@ import numpy as np
 
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
-from tensorflow.python.keras._impl.keras.engine.training import _weighted_masked_objective
+from tensorflow.python.keras._impl.keras.engine.training_utils import weighted_masked_objective
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.platform import test
 
@@ -340,20 +340,21 @@ class TrainingTest(test.TestCase):
     if scipy_sparse is None:
       return
 
-    test_inputs = [
-        scipy_sparse.random(6, 3, density=0.25).tocsr() for _ in range(2)]
-    test_outputs = [
-        scipy_sparse.random(6, i, density=0.25).tocsr() for i in range(3, 5)]
-    in1 = keras.layers.Input(shape=(3,))
-    in2 = keras.layers.Input(shape=(3,))
-    out1 = keras.layers.Dropout(0.5, name='dropout')(in1)
-    out2 = keras.layers.Dense(4, name='dense_1')(in2)
-    model = keras.Model([in1, in2], [out1, out2])
-    model.predict(test_inputs, batch_size=2)
-    model.compile('rmsprop', 'mse')
-    model.fit(test_inputs, test_outputs,
-              epochs=1, batch_size=2, validation_split=0.5)
-    model.evaluate(test_inputs, test_outputs, batch_size=2)
+    with self.test_session():
+      test_inputs = [
+          scipy_sparse.random(6, 3, density=0.25).tocsr() for _ in range(2)]
+      test_outputs = [
+          scipy_sparse.random(6, i, density=0.25).tocsr() for i in range(3, 5)]
+      in1 = keras.layers.Input(shape=(3,))
+      in2 = keras.layers.Input(shape=(3,))
+      out1 = keras.layers.Dropout(0.5, name='dropout')(in1)
+      out2 = keras.layers.Dense(4, name='dense_1')(in2)
+      model = keras.Model([in1, in2], [out1, out2])
+      model.predict(test_inputs, batch_size=2)
+      model.compile('rmsprop', 'mse')
+      model.fit(test_inputs, test_outputs,
+                epochs=1, batch_size=2, validation_split=0.5)
+      model.evaluate(test_inputs, test_outputs, batch_size=2)
 
   def test_that_trainable_disables_updates(self):
     val_a = np.random.random((10, 4))
@@ -705,7 +706,7 @@ class LossMaskingTest(test.TestCase):
 
   def test_loss_masking(self):
     with self.test_session():
-      weighted_loss = _weighted_masked_objective(keras.losses.get('mae'))
+      weighted_loss = weighted_masked_objective(keras.losses.get('mae'))
       shape = (3, 4, 2)
       x = np.arange(24).reshape(shape)
       y = 2 * x
@@ -876,9 +877,9 @@ class TestGeneratorMethods(test.TestCase):
 
     def custom_generator():
       batch_size = 10
-      n_samples = 50
+      num_samples = 50
       while True:
-        batch_index = np.random.randint(0, n_samples - batch_size)
+        batch_index = np.random.randint(0, num_samples - batch_size)
         start = batch_index
         end = start + batch_size
         x = arr_data[start: end]
@@ -957,9 +958,9 @@ class TestGeneratorMethods(test.TestCase):
 
     def custom_generator():
       batch_size = 10
-      n_samples = 50
+      num_samples = 50
       while True:
-        batch_index = np.random.randint(0, n_samples - batch_size)
+        batch_index = np.random.randint(0, num_samples - batch_size)
         start = batch_index
         end = start + batch_size
         x = arr_data[start: end]
@@ -1033,28 +1034,66 @@ class TestGeneratorMethods(test.TestCase):
                                  max_queue_size=10,
                                  use_multiprocessing=False)
 
+  def test_training_with_sequences(self):
+
+    class DummySequence(keras.utils.Sequence):
+
+      def __getitem__(self, idx):
+        return np.zeros([10, 2]), np.ones([10])
+
+      def __len__(self):
+        return 10
+
+    arr_data = np.random.random((50, 2))
+    arr_labels = np.random.random((50,))
+    arr_sample_weights = np.random.random((50,))
+
+    def custom_generator():
+      batch_size = 10
+      num_samples = 50
+      while True:
+        batch_index = np.random.randint(0, num_samples - batch_size)
+        start = batch_index
+        end = start + batch_size
+        x = arr_data[start: end]
+        y = arr_labels[start: end]
+        w = arr_sample_weights[start: end]
+        yield x, y, w
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(1, input_shape=(2,)))
+      model.compile(loss='mse', optimizer='sgd')
+
+    model.fit_generator(DummySequence(),
+                        steps_per_epoch=10,
+                        validation_data=custom_generator(),
+                        validation_steps=1,
+                        max_queue_size=10,
+                        workers=0,
+                        use_multiprocessing=True)
+    model.fit_generator(DummySequence(),
+                        steps_per_epoch=10,
+                        validation_data=custom_generator(),
+                        validation_steps=1,
+                        max_queue_size=10,
+                        workers=0,
+                        use_multiprocessing=False)
+
 
 class TestTrainingUtils(test.TestCase):
 
   def test_check_array_lengths(self):
-    keras.engine.training._check_array_lengths(None, None, None)
+    keras.engine.training_utils.check_array_lengths(None, None, None)
     a_np = np.random.random((4, 3, 3))
-    keras.engine.training._check_array_lengths(a_np, a_np, a_np)
-    keras.engine.training._check_array_lengths(
+    keras.engine.training_utils.check_array_lengths(a_np, a_np, a_np)
+    keras.engine.training_utils.check_array_lengths(
         [a_np, a_np], [a_np, a_np], [a_np, a_np])
-    keras.engine.training._check_array_lengths([None], [None], [None])
+    keras.engine.training_utils.check_array_lengths([None], [None], [None])
 
     b_np = np.random.random((3, 4))
     with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, None, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, a_np, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [None], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [b_np], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], None, [b_np])
+      keras.engine.training_utils.check_array_lengths([a_np], [b_np], None)
 
   def test_slice_arrays(self):
     input_a = np.random.random((10, 3))
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_utils.py b/tensorflow/python/keras/_impl/keras/engine/training_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..105638ce1087e8668b49b6653a847667e8f9157d
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_utils.py
@@ -0,0 +1,534 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Training-related utilities.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+
+import numpy as np
+
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import losses
+
+
+def check_num_samples(ins,
+                      batch_size=None,
+                      steps=None,
+                      steps_name='steps'):
+  """Determine the number of samples provided for training and evaluation.
+
+  The number of samples is not defined when running with `steps`,
+  in which case the number of samples is set to `None`.
+
+  Arguments:
+      ins: List of tensors to be fed to the Keras function.
+      batch_size: Integer batch size or `None` if not defined.
+      steps: Total number of steps (batches of samples)
+          before declaring `_predict_loop` finished.
+          Ignored with the default value of `None`.
+      steps_name: The public API's parameter name for `steps`.
+
+  Raises:
+      ValueError: when `steps` is `None` and the attribute `ins.shape`
+      does not exist. Also raises ValueError when `steps` is not `None`
+      and `batch_size` is not `None` because they are mutually
+      exclusive.
+
+  Returns:
+      When steps is `None`, returns the number of samples to be
+      processed based on the size of the first dimension of the
+      first input numpy array. When steps is not `None` and
+      `batch_size` is `None`, returns `None`.
+
+  Raises:
+      ValueError: In case of invalid arguments.
+  """
+  if steps is not None:
+    num_samples = None
+    if batch_size is not None:
+      raise ValueError(
+          'If ' + steps_name + ' is set, the `batch_size` must be None.')
+  elif ins and hasattr(ins[0], 'shape'):
+    num_samples = ins[0].shape[0]
+  else:
+    raise ValueError(
+        'Either the input data should have '
+        'a defined shape, or ' + steps_name + ' should be specified.')
+  return num_samples
+
+
+def standardize_input_data(data,
+                           names,
+                           shapes=None,
+                           check_batch_axis=True,
+                           exception_prefix=''):
+  """Normalizes inputs and targets provided by users.
+
+  Users may pass data as a list of arrays, dictionary of arrays,
+  or as a single array. We normalize this to an ordered list of
+  arrays (same order as `names`), while checking that the provided
+  arrays have shapes that match the network's expectations.
+
+  Arguments:
+      data: User-provided input data (polymorphic).
+      names: List of expected array names.
+      shapes: Optional list of expected array shapes.
+      check_batch_axis: Boolean; whether to check that
+          the batch axis of the arrays matches the expected
+          value found in `shapes`.
+      exception_prefix: String prefix used for exception formatting.
+
+  Returns:
+      List of standardized input arrays (one array per model input).
+
+  Raises:
+      ValueError: in case of improperly formatted user-provided data.
+  """
+  if not names:
+    if data is not None and hasattr(data, '__len__') and len(data):
+      raise ValueError('Error when checking model ' + exception_prefix + ': '
+                       'expected no data, but got:', data)
+    return []
+  if data is None:
+    return [None for _ in range(len(names))]
+
+  if isinstance(data, dict):
+    try:
+      data = [
+          data[x].values
+          if data[x].__class__.__name__ == 'DataFrame' else data[x]
+          for x in names
+      ]
+    except KeyError as e:
+      raise ValueError('No data provided for "' + e.args[0] + '". Need data '
+                       'for each key in: ' + str(names))
+  elif isinstance(data, list):
+    if isinstance(data[0], list):
+      data = [np.asarray(d) for d in data]
+    elif len(names) == 1 and isinstance(data[0], (float, int)):
+      data = [np.asarray(data)]
+    else:
+      data = [
+          x.values if x.__class__.__name__ == 'DataFrame' else x for x in data
+      ]
+  else:
+    data = data.values if data.__class__.__name__ == 'DataFrame' else data
+    data = [data]
+  data = [
+      np.expand_dims(x, 1) if x is not None and x.ndim == 1 else x for x in data
+  ]
+
+  if len(data) != len(names):
+    if data and hasattr(data[0], 'shape'):
+      raise ValueError('Error when checking model ' + exception_prefix +
+                       ': the list of Numpy arrays that you are passing to '
+                       'your model is not the size the model expected. '
+                       'Expected to see ' + str(len(names)) + ' array(s), '
+                       'but instead got the following list of ' +
+                       str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
+    elif len(names) > 1:
+      raise ValueError(
+          'Error when checking model ' + exception_prefix +
+          ': you are passing a list as input to your model, '
+          'but the model expects a list of ' + str(len(names)) +
+          ' Numpy arrays instead. The list you passed was: ' + str(data)[:200])
+    elif len(data) == 1 and not hasattr(data[0], 'shape'):
+      raise TypeError('Error when checking model ' + exception_prefix +
+                      ': data should be a Numpy array, or list/dict of '
+                      'Numpy arrays. Found: ' + str(data)[:200] + '...')
+    elif len(names) == 1:
+      data = [np.asarray(data)]
+
+  # Check shapes compatibility.
+  if shapes:
+    for i in range(len(names)):
+      if shapes[i] is not None:
+        data_shape = data[i].shape
+        shape = shapes[i]
+        if data[i].ndim != len(shape):
+          raise ValueError('Error when checking ' + exception_prefix +
+                           ': expected ' + names[i] + ' to have ' +
+                           str(len(shape)) + ' dimensions, but got array '
+                           'with shape ' + str(data_shape))
+        if not check_batch_axis:
+          data_shape = data_shape[1:]
+          shape = shape[1:]
+        for dim, ref_dim in zip(data_shape, shape):
+          if ref_dim != dim and ref_dim:
+            raise ValueError(
+                'Error when checking ' + exception_prefix + ': expected ' +
+                names[i] + ' to have shape ' + str(shape) +
+                ' but got array with shape ' + str(data_shape))
+  return data
+
+
+def standardize_sample_or_class_weights(x_weight, output_names, weight_type):
+  """Maps `sample_weight` or `class_weight` to model outputs.
+
+  Arguments:
+      x_weight: User-provided `sample_weight` or `class_weight` argument.
+      output_names: List of output names (strings) in the model.
+      weight_type: A string used purely for exception printing.
+
+  Returns:
+      A list of `sample_weight` or `class_weight` where there are exactly
+          one element per model output.
+
+  Raises:
+      ValueError: In case of invalid user-provided argument.
+  """
+  if x_weight is None or len(x_weight) == 0:  # pylint: disable=g-explicit-length-test
+    return [None for _ in output_names]
+  if len(output_names) == 1:
+    if isinstance(x_weight, list) and len(x_weight) == 1:
+      return x_weight
+    if isinstance(x_weight, dict) and output_names[0] in x_weight:
+      return [x_weight[output_names[0]]]
+    else:
+      return [x_weight]
+  if isinstance(x_weight, list):
+    if len(x_weight) != len(output_names):
+      raise ValueError('Provided `' + weight_type + '` was a list of ' +
+                       str(len(x_weight)) + ' elements, but the model has ' +
+                       str(len(output_names)) + ' outputs. '
+                       'You should provide one `' + weight_type + '`'
+                       'array per model output.')
+    return x_weight
+  if isinstance(x_weight, dict):
+    x_weights = []
+    for name in output_names:
+      x_weights.append(x_weight.get(name))
+    return x_weights
+  else:
+    raise TypeError(
+        'The model has multiple outputs, so `' + weight_type + '` '
+        'should be either a list or a dict. '
+        'Provided `' + weight_type + '` type not understood: ' + str(x_weight))
+
+
+def standardize_class_weights(class_weight, output_names):
+  return standardize_sample_or_class_weights(class_weight, output_names,
+                                             'class_weight')
+
+
+def standardize_sample_weights(sample_weight, output_names):
+  return standardize_sample_or_class_weights(sample_weight, output_names,
+                                             'sample_weight')
+
+
+def check_array_lengths(inputs, targets, weights=None):
+  """Does user input validation for numpy arrays.
+
+  Arguments:
+      inputs: list of Numpy arrays of inputs.
+      targets: list of Numpy arrays of targets.
+      weights: list of Numpy arrays of sample weights.
+
+  Raises:
+      ValueError: in case of incorrectly formatted data.
+  """
+
+  def set_of_lengths(x):
+    # return a set with the variation between
+    # different shapes, with None => 0
+    if x is None:
+      return {}
+    else:
+      return set([y.shape[0] for y in x if y is not None])
+
+  set_x = set_of_lengths(inputs)
+  set_y = set_of_lengths(targets)
+  set_w = set_of_lengths(weights)
+  if len(set_x) > 1:
+    raise ValueError('All input arrays (x) should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([x.shape for x in inputs]))
+  if len(set_y) > 1:
+    raise ValueError('All target arrays (y) should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([y.shape for y in targets]))
+  if set_x and set_y and list(set_x)[0] != list(set_y)[0]:
+    raise ValueError('Input arrays should have '
+                     'the same number of samples as target arrays. '
+                     'Found ' + str(list(set_x)[0]) + ' input samples '
+                     'and ' + str(list(set_y)[0]) + ' target samples.')
+  if len(set_w) > 1:
+    raise ValueError('All sample_weight arrays should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([w.shape for w in weights]))
+  if set_y and set_w and list(set_y)[0] != list(set_w)[0]:
+    raise ValueError('Sample_weight arrays should have '
+                     'the same number of samples as target arrays. Got ' +
+                     str(list(set_y)[0]) + ' input samples and ' +
+                     str(list(set_w)[0]) + ' target samples.')
+
+
+def check_loss_and_target_compatibility(targets, loss_fns, output_shapes):
+  """Does validation on the compatibility of targets and loss functions.
+
+  This helps prevent users from using loss functions incorrectly. This check
+  is purely for UX purposes.
+
+  Arguments:
+      targets: list of Numpy arrays of targets.
+      loss_fns: list of loss functions.
+      output_shapes: list of shapes of model outputs.
+
+  Raises:
+      ValueError: if a loss function or target array
+          is incompatible with an output.
+  """
+  key_losses = {
+      losses.mean_squared_error, losses.binary_crossentropy,
+      losses.categorical_crossentropy
+  }
+  for y, loss, shape in zip(targets, loss_fns, output_shapes):
+    if y is None or loss is None or tensor_util.is_tensor(y):
+      continue
+    if loss is losses.categorical_crossentropy:
+      if y.shape[-1] == 1:
+        raise ValueError('You are passing a target array of shape ' + str(
+            y.shape) + ' while using as loss `categorical_crossentropy`. '
+                         '`categorical_crossentropy` expects '
+                         'targets to be binary matrices (1s and 0s) '
+                         'of shape (samples, classes). '
+                         'If your targets are integer classes, '
+                         'you can convert them to the expected format via:\n'
+                         '```\n'
+                         'from keras.utils import to_categorical\n'
+                         'y_binary = to_categorical(y_int)\n'
+                         '```\n'
+                         '\n'
+                         'Alternatively, you can use the loss function '
+                         '`sparse_categorical_crossentropy` instead, '
+                         'which does expect integer targets.')
+    if loss in key_losses:
+      for target_dim, out_dim in zip(y.shape[1:], shape[1:]):
+        if out_dim is not None and target_dim != out_dim:
+          raise ValueError('A target array with shape ' + str(y.shape) +
+                           ' was passed for an output of shape ' + str(shape) +
+                           ' while using as loss `' + loss.__name__ + '`. '
+                           'This loss expects '
+                           'targets to have the same shape '
+                           'as the output.')
+
+
+def collect_metrics(metrics, output_names):
+  """Maps metric functions to model outputs.
+
+  Arguments:
+      metrics: a list or dict of metric functions.
+      output_names: a list of the names (strings) of model outputs.
+
+  Returns:
+      A list (one entry per model output) of lists of metric functions.
+      For instance, if the model has 2 outputs, and for the first output
+      we want to compute "binary_accuracy" and "binary_crossentropy",
+      and just "binary_accuracy" for the second output,
+      the list would look like:
+          `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]`
+
+  Raises:
+      TypeError: if an incorrect type is passed for the `metrics` argument.
+  """
+  if not metrics:
+    return [[] for _ in output_names]
+  if isinstance(metrics, list):
+    # we then apply all metrics to all outputs.
+    return [copy.copy(metrics) for _ in output_names]
+  elif isinstance(metrics, dict):
+    nested_metrics = []
+    for name in output_names:
+      output_metrics = metrics.get(name, [])
+      if not isinstance(output_metrics, list):
+        output_metrics = [output_metrics]
+      nested_metrics.append(output_metrics)
+    return nested_metrics
+  else:
+    raise TypeError('Type of `metrics` argument not understood. '
+                    'Expected a list or dictionary, found: ' + str(metrics))
+
+
+def batch_shuffle(index_array, batch_size):
+  """Shuffles an array in a batch-wise fashion.
+
+  Useful for shuffling HDF5 arrays
+  (where one cannot access arbitrary indices).
+
+  Arguments:
+      index_array: array of indices to be shuffled.
+      batch_size: integer.
+
+  Returns:
+      The `index_array` array, shuffled in a batch-wise fashion.
+  """
+  batch_count = int(len(index_array) / batch_size)
+  # to reshape we need to be cleanly divisible by batch size
+  # we stash extra items and reappend them after shuffling
+  last_batch = index_array[batch_count * batch_size:]
+  index_array = index_array[:batch_count * batch_size]
+  index_array = index_array.reshape((batch_count, batch_size))
+  np.random.shuffle(index_array)
+  index_array = index_array.flatten()
+  return np.append(index_array, last_batch)
+
+
+def weighted_masked_objective(fn):
+  """Adds support for masking and sample-weighting to an objective function.
+
+  It transforms an objective function `fn(y_true, y_pred)`
+  into a sample-weighted, cost-masked objective function
+  `fn(y_true, y_pred, weights, mask)`.
+
+  Arguments:
+      fn: The objective function to wrap,
+          with signature `fn(y_true, y_pred)`.
+
+  Returns:
+      A function with signature `fn(y_true, y_pred, weights, mask)`.
+  """
+  if fn is None:
+    return None
+
+  def weighted(y_true, y_pred, weights, mask=None):
+    """Wrapper function.
+
+    Arguments:
+        y_true: `y_true` argument of `fn`.
+        y_pred: `y_pred` argument of `fn`.
+        weights: Weights tensor.
+        mask: Mask tensor.
+
+    Returns:
+        Scalar tensor.
+    """
+    # score_array has ndim >= 2
+    score_array = fn(y_true, y_pred)
+    if mask is not None:
+      # Cast the mask to floatX to avoid float64 upcasting in theano
+      mask = K.cast(mask, K.floatx())
+      # mask should have the same shape as score_array
+      score_array *= mask
+      #  the loss per batch should be proportional
+      #  to the number of unmasked samples.
+      score_array /= K.mean(mask)
+
+    # apply sample weighting
+    if weights is not None:
+      # reduce score_array to same ndim as weight array
+      ndim = K.ndim(score_array)
+      weight_ndim = K.ndim(weights)
+      score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim)))
+      score_array *= weights
+      score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
+    return K.mean(score_array)
+
+  return weighted
+
+
+def standardize_weights(y,
+                        sample_weight=None,
+                        class_weight=None,
+                        sample_weight_mode=None):
+  """Performs sample weight validation and standardization.
+
+  Everything gets normalized to a single sample-wise (or timestep-wise)
+  weight array.
+
+  Arguments:
+      y: Numpy array of model targets to be weighted.
+      sample_weight: User-provided `sample_weight` argument.
+      class_weight: User-provided `class_weight` argument.
+      sample_weight_mode: One of `None` or `"temporal"`.
+          `"temporal"` indicated that we expect 2D weight data
+          that will be applied to the last 2 dimensions of
+          the targets (i.e. we are weighting timesteps, not samples).
+
+  Returns:
+      A numpy array of target weights, one entry per sample to weight.
+
+  Raises:
+      ValueError: In case of invalid user-provided arguments.
+  """
+  if sample_weight_mode is not None:
+    if sample_weight_mode != 'temporal':
+      raise ValueError('"sample_weight_mode '
+                       'should be None or "temporal". '
+                       'Found: ' + str(sample_weight_mode))
+    if len(y.shape) < 3:
+      raise ValueError('Found a sample_weight array for '
+                       'an input with shape ' + str(y.shape) + '. '
+                       'Timestep-wise sample weighting (use of '
+                       'sample_weight_mode="temporal") is restricted to '
+                       'outputs that are at least 3D, i.e. that have '
+                       'a time dimension.')
+    if sample_weight is not None and len(sample_weight.shape) != 2:
+      raise ValueError('Found a sample_weight array with shape ' +
+                       str(sample_weight.shape) + '. '
+                       'In order to use timestep-wise sample weighting, '
+                       'you should pass a 2D sample_weight array.')
+  else:
+    if sample_weight is not None and len(sample_weight.shape) != 1:
+      raise ValueError('Found a sample_weight array with shape ' +
+                       str(sample_weight.shape) + '. '
+                       'In order to use timestep-wise sample weights, '
+                       'you should specify '
+                       'sample_weight_mode="temporal" '
+                       'in compile(). If you just mean to use '
+                       'sample-wise weights, make sure your '
+                       'sample_weight array is 1D.')
+
+  if sample_weight is not None:
+    if len(sample_weight.shape) > len(y.shape):
+      raise ValueError(
+          'Found a sample_weight with shape' + str(sample_weight.shape) + '.'
+          'Expected sample_weight with rank '
+          'less than or equal to ' + str(len(y.shape)))
+
+    if y.shape[:sample_weight.ndim] != sample_weight.shape:
+      raise ValueError(
+          'Found a sample_weight array with shape ' + str(sample_weight.shape) +
+          ' for an input with shape ' + str(y.shape) + '. '
+          'sample_weight cannot be broadcast.')
+    return sample_weight
+  elif isinstance(class_weight, dict):
+    if len(y.shape) > 2:
+      raise ValueError('`class_weight` not supported for '
+                       '3+ dimensional targets.')
+    if y.shape[1] > 1:
+      y_classes = np.argmax(y, axis=1)
+    elif y.shape[1] == 1:
+      y_classes = np.reshape(y, y.shape[0])
+    else:
+      y_classes = y
+
+    weights = np.asarray(
+        [class_weight[cls] for cls in y_classes if cls in class_weight])
+
+    if len(weights) != len(y_classes):
+      # subtract the sets to pick all missing classes
+      existing_classes = set(y_classes)
+      existing_class_weight = set(class_weight.keys())
+      raise ValueError('`class_weight` must contain all classes in the data.'
+                       ' The classes %s exist in the data but not in '
+                       '`class_weight`.' %
+                       (existing_classes - existing_class_weight))
+    return weights
+  else:
+    return None
diff --git a/tensorflow/python/keras/_impl/keras/estimator.py b/tensorflow/python/keras/_impl/keras/estimator.py
index 0bf5bd41dc915fbecbce4c3a6191e925612dbebb..081f25e91452582afabee1e43f6e15f7e5d0013d 100644
--- a/tensorflow/python/keras/_impl/keras/estimator.py
+++ b/tensorflow/python/keras/_impl/keras/estimator.py
@@ -25,11 +25,15 @@ from tensorflow.python.client import session
 from tensorflow.python.estimator import estimator as estimator_lib
 from tensorflow.python.estimator import export as export_lib
 from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
 from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras import models
+from tensorflow.python.keras._impl.keras import optimizers
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.engine.network import Network
 from tensorflow.python.keras._impl.keras.utils.generic_utils import CustomObjectScope
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_module
@@ -50,36 +54,174 @@ def _cast_tensor_to_floatx(x):
     return math_ops.cast(x, K.floatx())
 
 
-def _create_ordered_io(keras_model, estimator_io_dict, is_input=True):
+def _create_ordered_io(keras_model, estimator_io, is_input=True):
   """Create a list of tensors from IO dictionary based on Keras IO order.
 
   Args:
-    keras_model: an instance of compiled keras model.
-    estimator_io_dict: features or labels dictionary from model_fn.
+    keras_model: An instance of compiled keras model.
+    estimator_io: The features or labels (dict or plain array) from model_fn.
     is_input: True if dictionary is for inputs.
 
   Returns:
-    a list of tensors based on Keras IO order.
+    A list of tensors based on Keras IO order.
 
   Raises:
     ValueError: if dictionary keys cannot be found in Keras model input_names
       or output_names.
   """
-  if is_input:
-    keras_io_names = keras_model.input_names
+  if isinstance(estimator_io, (list, tuple)):
+    # Case currently not supported by most built-in input_fn,
+    # but it's good to have for sanity
+    return [_cast_tensor_to_floatx(x) for x in estimator_io]
+  elif isinstance(estimator_io, dict):
+    if is_input:
+      if keras_model._is_graph_network:
+        keras_io_names = keras_model.input_names
+      else:
+        keras_io_names = [
+            'input_%d' % i for i in range(1, len(estimator_io) + 1)]
+    else:
+      if keras_model._is_graph_network:
+        keras_io_names = keras_model.output_names
+      else:
+        keras_io_names = [
+            'output_%d' % i for i in range(1, len(estimator_io) + 1)]
+
+    for key in estimator_io:
+      if key not in keras_io_names:
+        raise ValueError(
+            'Cannot find %s with name "%s" in Keras Model. '
+            'It needs to match one '
+            'of the following: %s' % ('input' if is_input else 'output', key,
+                                      ', '.join(keras_io_names)))
+      tensors = [_cast_tensor_to_floatx(estimator_io[io_name])
+                 for io_name in keras_io_names]
+    return tensors
   else:
-    keras_io_names = keras_model.output_names
+    # Plain array.
+    return _cast_tensor_to_floatx(estimator_io)
 
-  for key in estimator_io_dict:
-    if key not in keras_io_names:
-      raise ValueError(
-          'Cannot find %s with name "%s" in Keras Model. It needs to match '
-          'one of the following: %s' % ('input' if is_input else 'output', key,
-                                        ', '.join(keras_io_names)))
-  tensors = []
-  for io_name in keras_io_names:
-    tensors.append(_cast_tensor_to_floatx(estimator_io_dict[io_name]))
-  return tensors
+
+def _in_place_subclassed_model_reset(model):
+  """Substitute for model cloning that works for subclassed models.
+
+  Subclassed models cannot be cloned because their topology is not serializable.
+  To "instantiate" an identical model in a new TF graph, we reuse the original
+  model object, but we clear its state.
+
+  After calling this function on a model intance, you can use the model instance
+  as if it were a model clone (in particular you can use it in a new graph).
+
+  This method clears the state of the input model. It is thus destructive.
+  However the original state can be restored fully by calling
+  `_in_place_subclassed_model_state_restoration`.
+
+  Args:
+    model: Instance of a Keras model created via subclassing.
+
+  Raises:
+    ValueError: In case the model uses a subclassed model as inner layer.
+  """
+  assert not model._is_graph_network  # Only makes sense for subclassed networks
+  # Retrieve all layers tracked by the model as well as their attribute names
+  attributes_cache = {}
+  for name in dir(model):
+    try:
+      value = getattr(model, name)
+    except (AttributeError, ValueError, TypeError):
+      continue
+    if isinstance(value, Layer):
+      attributes_cache[name] = value
+      assert value in model._layers
+    elif isinstance(value, (list, tuple)) and name not in ('layers', '_layers'):
+      # Handle case: list/tuple of layers (also tracked by the Network API).
+      if value and all(isinstance(val, Layer) for val in value):
+        raise ValueError('We do not support the use of list-of-layers '
+                         'attributes in subclassed models used with '
+                         '`model_to_estimator` at this time. Found list '
+                         'model: %s' % name)
+
+  # Replace layers on the model with fresh layers
+  layers_to_names = {value: key for key, value in attributes_cache.items()}
+  original_layers = model._layers[:]
+  model._layers = []
+  for layer in original_layers:  # We preserve layer order.
+    config = layer.get_config()
+    # This will not work for nested subclassed models used as layers.
+    # This would be theoretically possible to support, but would add complexity.
+    # Only do it if users complain.
+    if isinstance(layer, Network) and not layer._is_graph_network:
+      raise ValueError('We do not support the use of nested subclassed models '
+                       'in `model_to_estimator` at this time. Found nested '
+                       'model: %s' % layer)
+    fresh_layer = layer.__class__.from_config(config)
+    name = layers_to_names[layer]
+    setattr(model, name, fresh_layer)
+
+  # Cache original model build attributes (in addition to layers)
+  if (not hasattr(model, '_original_attributes_cache') or
+      model._original_attributes_cache is None):
+    if model.built:
+      attributes_to_cache = [
+          'inputs',
+          'outputs',
+          '_feed_outputs',
+          '_feed_output_names',
+          '_feed_output_shapes',
+          '_feed_loss_fns',
+          'loss_weights_list',
+          'targets',
+          '_feed_targets',
+          'sample_weight_modes',
+          'weighted_metrics',
+          'metrics_names',
+          'metrics_tensors',
+          'metrics_updates',
+          'stateful_metric_names',
+          'total_loss',
+          'sample_weights',
+          '_feed_sample_weights',
+          'train_function',
+          'test_function',
+          'predict_function',
+          '_collected_trainable_weights',
+          '_feed_inputs',
+          '_feed_input_names',
+          '_feed_input_shapes',
+          'optimizer',
+      ]
+      for name in attributes_to_cache:
+        attributes_cache[name] = getattr(model, name)
+  model._original_attributes_cache = attributes_cache
+
+  # Reset built state
+  model.built = False
+  model.inputs = None
+  model.outputs = None
+
+
+def _in_place_subclassed_model_state_restoration(model):
+  """Restores the original state of a model after it was "reset".
+
+  This undoes this action of `_in_place_subclassed_model_reset`.
+
+  Args:
+    model: Instance of a Keras model created via subclassing, on which
+      `_in_place_subclassed_model_reset` was previously called.
+  """
+  assert not model._is_graph_network
+  # Restore layers and build attributes
+  if (hasattr(model, '_original_attributes_cache') and
+      model._original_attributes_cache is not None):
+    model._layers = []
+    for name, value in model._original_attributes_cache.items():
+      setattr(model, name, value)
+    model._original_attributes_cache = None
+  else:
+    # Restore to the state of a never-called model.
+    model.built = False
+    model.inputs = None
+    model.outputs = None
 
 
 def _clone_and_build_model(mode,
@@ -93,8 +235,8 @@ def _clone_and_build_model(mode,
     mode: training mode.
     keras_model: an instance of compiled keras model.
     custom_objects: Dictionary for custom objects.
-    features:
-    labels:
+    features: Dict of tensors.
+    labels: Dict of tensors, or single tensor instance.
 
   Returns:
     The newly built model.
@@ -102,33 +244,49 @@ def _clone_and_build_model(mode,
   # Set to True during training, False for inference.
   K.set_learning_phase(mode == model_fn_lib.ModeKeys.TRAIN)
 
-  # Clone keras model.
-  input_tensors = None if features is None else _create_ordered_io(
-      keras_model, features)
-  if custom_objects:
-    with CustomObjectScope(custom_objects):
+  # Get list of inputs.
+  if features is None:
+    input_tensors = None
+  else:
+    input_tensors = _create_ordered_io(keras_model,
+                                       estimator_io=features,
+                                       is_input=True)
+  # Get list of outputs.
+  if labels is None:
+    target_tensors = None
+  elif isinstance(labels, dict):
+    target_tensors = _create_ordered_io(keras_model,
+                                        estimator_io=labels,
+                                        is_input=False)
+  else:
+    target_tensors = [
+        _cast_tensor_to_floatx(
+            sparse_tensor_lib.convert_to_tensor_or_sparse_tensor(labels))
+    ]
+
+  if keras_model._is_graph_network:
+    if custom_objects:
+      with CustomObjectScope(custom_objects):
+        model = models.clone_model(keras_model, input_tensors=input_tensors)
+    else:
       model = models.clone_model(keras_model, input_tensors=input_tensors)
   else:
-    model = models.clone_model(keras_model, input_tensors=input_tensors)
+    model = keras_model
+    _in_place_subclassed_model_reset(model)
+    if input_tensors is not None:
+      model._set_inputs(input_tensors)
 
   # Compile/Build model
-  if mode is model_fn_lib.ModeKeys.PREDICT and not model.built:
-    model.build()
+  if mode is model_fn_lib.ModeKeys.PREDICT:
+    if isinstance(model, models.Sequential):
+      model.build()
   else:
-    optimizer_config = keras_model.optimizer.get_config()
-    optimizer = keras_model.optimizer.__class__.from_config(optimizer_config)
-    optimizer.iterations = training_util.get_or_create_global_step()
-
-    # Get list of outputs.
-    if labels is None:
-      target_tensors = None
-    elif isinstance(labels, dict):
-      target_tensors = _create_ordered_io(keras_model, labels, is_input=False)
+    if isinstance(keras_model.optimizer, optimizers.TFOptimizer):
+      optimizer = keras_model.optimizer
     else:
-      target_tensors = [
-          _cast_tensor_to_floatx(
-              sparse_tensor_lib.convert_to_tensor_or_sparse_tensor(labels))
-      ]
+      optimizer_config = keras_model.optimizer.get_config()
+      optimizer = keras_model.optimizer.__class__.from_config(optimizer_config)
+    optimizer.iterations = training_util.get_or_create_global_step()
 
     model.compile(
         optimizer,
@@ -168,10 +326,14 @@ def _create_keras_model_fn(keras_model, custom_objects=None):
 
     # Set loss and metric only during train and evaluate.
     if mode is not model_fn_lib.ModeKeys.PREDICT:
-      model._make_train_function()  # pylint: disable=protected-access
+      if mode is model_fn_lib.ModeKeys.TRAIN:
+        model._make_train_function()  # pylint: disable=protected-access
+      else:
+        model._make_test_function()  # pylint: disable=protected-access
       loss = model.total_loss
 
       if model.metrics:
+        # TODO(fchollet): support stateful metrics
         eval_metric_ops = {}
         # When each metric maps to an output
         if isinstance(model.metrics, dict):
@@ -195,6 +357,10 @@ def _create_keras_model_fn(keras_model, custom_objects=None):
     if mode is model_fn_lib.ModeKeys.TRAIN:
       train_op = model.train_function.updates_op
 
+    if not model._is_graph_network:
+      # Reset model state to original state,
+      # to avoid `model_fn` being destructive for the initial model argument.
+      _in_place_subclassed_model_state_restoration(keras_model)
     return model_fn_lib.EstimatorSpec(
         mode=mode,
         predictions=predictions,
@@ -274,10 +440,11 @@ def model_to_estimator(keras_model=None,
   """
   if (not keras_model) and (not keras_model_path):
     raise ValueError(
-        'Either keras_model or keras_model_path needs to be provided.')
+        'Either `keras_model` or `keras_model_path` needs to be provided.')
   if keras_model and keras_model_path:
     raise ValueError(
-        'Please specity either keras_model or keras_model_path but not both.')
+        'Please specity either `keras_model` or `keras_model_path`, '
+        'but not both.')
 
   if not keras_model:
     if keras_model_path.startswith(
@@ -288,18 +455,42 @@ def model_to_estimator(keras_model=None,
     logging.info('Loading models from %s', keras_model_path)
     keras_model = models.load_model(keras_model_path)
   else:
-    logging.info('Using the Keras model from memory.')
+    logging.info('Using the Keras model provided.')
     keras_model = keras_model
 
-  if not hasattr(keras_model, 'optimizer'):
+  if not hasattr(keras_model, 'optimizer') or not keras_model.optimizer:
     raise ValueError(
-        'Given keras model has not been compiled yet. Please compile first '
-        'before creating the estimator.')
+        'The given keras model has not been compiled yet. Please compile first '
+        'before calling `model_to_estimator`.')
+
+  if isinstance(config, dict):
+    config = run_config_lib.RunConfig(**config)
 
-  keras_weights = keras_model.get_weights()
   keras_model_fn = _create_keras_model_fn(keras_model, custom_objects)
-  est = estimator_lib.Estimator(
+  estimator = estimator_lib.Estimator(
       keras_model_fn, model_dir=model_dir, config=config)
-  # TODO(yifeif): move checkpoint initialization to scaffold.init_fn
-  _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
-  return est
+
+  # Pass the config into keras backend's default session.
+  with session.Session(config=estimator._session_config) as sess:
+    K.set_session(sess)
+
+  keras_weights = keras_model.get_weights()
+  if keras_model._is_graph_network:
+    # TODO(yifeif): move checkpoint initialization to scaffold.init_fn
+    _save_first_checkpoint(keras_model,
+                           estimator,
+                           custom_objects,
+                           keras_weights)
+  elif keras_model.built:
+    logging.warning('You are creating an Estimator from a Keras model '
+                    'manually subclassed from `Model`, that was '
+                    'already called on some inputs (and thus already had '
+                    'weights). We are currently unable to preserve '
+                    'the model\'s state (its weights) '
+                    'as part of the estimator '
+                    'in this case. Be warned that the estimator '
+                    'has been created using '
+                    'a freshly initialized version of your model.\n'
+                    'Note that this doesn\'t affect the state of the '
+                    'model instance you passed as `keras_model` argument.')
+  return estimator
diff --git a/tensorflow/python/keras/_impl/keras/estimator_test.py b/tensorflow/python/keras/_impl/keras/estimator_test.py
index 88dd14b856a4ee9dfbee61d6fd1bdb96af24b50c..e076dc25b16900636313f0ddd85a61b8d917fc91 100644
--- a/tensorflow/python/keras/_impl/keras/estimator_test.py
+++ b/tensorflow/python/keras/_impl/keras/estimator_test.py
@@ -24,6 +24,7 @@ import tempfile
 
 import numpy as np
 
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.estimator.inputs import numpy_io
 from tensorflow.python.framework import test_util
@@ -33,6 +34,7 @@ from tensorflow.python.keras._impl.keras.applications import mobilenet
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import rmsprop
 
 
 try:
@@ -63,12 +65,42 @@ def simple_functional_model():
   return model
 
 
-def get_resource_for_simple_model(is_sequential=True, is_evaluate=False):
-  model = simple_sequential_model(
-  ) if is_sequential else simple_functional_model()
-  if is_sequential:
+def simple_subclassed_model():
+
+  class SimpleModel(keras.Model):
+
+    def __init__(self):
+      super(SimpleModel, self).__init__()
+      self.dense1 = keras.layers.Dense(16, activation='relu')
+      self.dp = keras.layers.Dropout(0.1)
+      self.dense2 = keras.layers.Dense(_NUM_CLASS, activation='softmax')
+
+    def call(self, inputs):
+      x = self.dense1(inputs)
+      x = self.dp(x)
+      return self.dense2(x)
+
+  return SimpleModel()
+
+
+def get_resource_for_simple_model(model_type='sequential',
+                                  is_evaluate=False,):
+  if model_type == 'sequential':
+    model = simple_sequential_model()
     model.build()
-  input_name = model.input_names[0]
+  elif model_type == 'subclass':
+    model = simple_subclassed_model()
+  else:
+    assert model_type == 'functional'
+    model = simple_functional_model()
+
+  if model_type == 'subclass':
+    input_name = 'input_1'
+    output_name = 'output_1'
+  else:
+    input_name = model.input_names[0]
+    output_name = model.output_names[0]
+
   np.random.seed(_RANDOM_SEED)
   (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
       train_samples=_TRAIN_SIZE,
@@ -79,17 +111,19 @@ def get_resource_for_simple_model(is_sequential=True, is_evaluate=False):
   y_test = keras.utils.to_categorical(y_test)
 
   train_input_fn = numpy_io.numpy_input_fn(
-      x={input_name: x_train},
-      y=y_train,
+      x=randomize_io_type(x_train, input_name),
+      y=randomize_io_type(y_train, output_name),
       shuffle=False,
       num_epochs=None,
       batch_size=16)
 
   evaluate_input_fn = numpy_io.numpy_input_fn(
-      x={input_name: x_test}, y=y_test, num_epochs=1, shuffle=False)
+      x=randomize_io_type(x_test, input_name),
+      y=randomize_io_type(y_test, output_name),
+      num_epochs=1, shuffle=False)
 
   predict_input_fn = numpy_io.numpy_input_fn(
-      x={input_name: x_test}, num_epochs=1, shuffle=False)
+      x=randomize_io_type(x_test, input_name), num_epochs=1, shuffle=False)
 
   inference_input_fn = evaluate_input_fn if is_evaluate else predict_input_fn
 
@@ -97,6 +131,14 @@ def get_resource_for_simple_model(is_sequential=True, is_evaluate=False):
                                      y_test), train_input_fn, inference_input_fn
 
 
+def randomize_io_type(array, name):
+  switch = np.random.random()
+  if switch > 0.5:
+    return array
+  else:
+    return {name: array}
+
+
 def multi_inputs_multi_outputs_model():
   # test multi-input layer
   a = keras.layers.Input(shape=(16,), name='input_a')
@@ -133,10 +175,10 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       gfile.DeleteRecursively(self._base_dir)
 
   def test_train(self):
-    for is_sequential in [True, False]:
+    for model_type in ['sequential', 'functional']:
       keras_model, (_, _), (
           _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
-              is_sequential=is_sequential, is_evaluate=True)
+              model_type=model_type, is_evaluate=True)
       keras_model.compile(
           loss='categorical_crossentropy',
           optimizer='rmsprop',
@@ -154,10 +196,87 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       writer_cache.FileWriterCache.clear()
       gfile.DeleteRecursively(self._config.model_dir)
 
+  def test_train_with_tf_optimizer(self):
+    for model_type in ['sequential', 'functional']:
+      keras_model, (_, _), (
+          _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
+              model_type=model_type, is_evaluate=True)
+      keras_model.compile(
+          loss='categorical_crossentropy',
+          optimizer=rmsprop.RMSPropOptimizer(1e-3),
+          metrics=['mse', keras.metrics.categorical_accuracy])
+
+      with self.test_session():
+        est_keras = keras.estimator.model_to_estimator(
+            keras_model=keras_model,
+            # Also use dict config argument to get test coverage for that line.
+            config={
+                'tf_random_seed': _RANDOM_SEED,
+                'model_dir': self._base_dir,
+            })
+        before_eval_results = est_keras.evaluate(
+            input_fn=eval_input_fn, steps=1)
+        est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
+        after_eval_results = est_keras.evaluate(input_fn=eval_input_fn, steps=1)
+        self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+      writer_cache.FileWriterCache.clear()
+      gfile.DeleteRecursively(self._config.model_dir)
+
+  def test_train_with_subclassed_model(self):
+    keras_model, (_, _), (
+        _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
+            model_type='subclass', is_evaluate=True)
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer=rmsprop.RMSPropOptimizer(1e-3),
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    with self.test_session():
+      est_keras = keras.estimator.model_to_estimator(
+          keras_model=keras_model, config=self._config)
+      est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
+      before_eval_results = est_keras.evaluate(
+          input_fn=eval_input_fn, steps=1)
+      est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
+      after_eval_results = est_keras.evaluate(input_fn=eval_input_fn, steps=1)
+      self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+  def test_train_with_subclassed_model_with_existing_state(self):
+    keras_model, (_, _), (
+        _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
+            model_type='subclass', is_evaluate=True)
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer=rmsprop.RMSPropOptimizer(1e-3),
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    with self.test_session():
+      # Create state
+      keras_model.train_on_batch(np.random.random((10,) + _INPUT_SIZE),
+                                 np.random.random((10, _NUM_CLASS)))
+      original_preds = keras_model.predict(np.ones((10,) + _INPUT_SIZE))
+
+      est_keras = keras.estimator.model_to_estimator(
+          keras_model=keras_model, config=self._config)
+      est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
+      before_eval_results = est_keras.evaluate(
+          input_fn=eval_input_fn, steps=1)
+      est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
+      after_eval_results = est_keras.evaluate(input_fn=eval_input_fn, steps=1)
+      self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+      # Check that original model state was not altered
+      preds = keras_model.predict(np.ones((10,) + _INPUT_SIZE))
+      self.assertAllClose(original_preds, preds, atol=1e-5)
+      # Check that the original model compilation did not break
+      keras_model.train_on_batch(np.random.random((10,) + _INPUT_SIZE),
+                                 np.random.random((10, _NUM_CLASS)))
+
   def test_evaluate(self):
     keras_model, (x_train, y_train), (
         x_test, y_test), _, eval_input_fn = get_resource_for_simple_model(
-            is_sequential=False, is_evaluate=True)
+            model_type='functional', is_evaluate=True)
 
     with self.test_session():
       metrics = [
@@ -199,7 +318,7 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
     # Check that predict on a pretrained model yield the same result.
     keras_model, (x_train, y_train), (
         x_test, _), _, pred_input_fn = get_resource_for_simple_model(
-            is_sequential=True, is_evaluate=False)
+            model_type='sequential', is_evaluate=False)
 
     with self.test_session():
       keras_model.compile(
@@ -261,7 +380,7 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
 
     keras_model, (x_train, y_train), (
         x_test, _), _, pred_input_fn = get_resource_for_simple_model(
-            is_sequential=False, is_evaluate=False)
+            model_type='functional', is_evaluate=False)
 
     with self.test_session():
       keras_model.compile(
@@ -377,6 +496,22 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
             keras_model=keras_model,
             model_dir=tempfile.mkdtemp(dir=self._base_dir))
 
+  def test_gpu_config(self):
+    keras_model, (_, _), (_, _), _, _ = get_resource_for_simple_model()
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer='rmsprop',
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    gpu_options = config_pb2.GPUOptions(per_process_gpu_memory_fraction=0.3)
+    sess_config = config_pb2.ConfigProto(gpu_options=gpu_options)
+    self._config._session_config = sess_config
+    keras.estimator.model_to_estimator(
+        keras_model=keras_model, config=self._config)
+    self.assertEqual(keras.backend.get_session()
+                     ._config.gpu_options.per_process_gpu_memory_fraction,
+                     gpu_options.per_process_gpu_memory_fraction)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py b/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
index c612e97a9d67f7398c78a7da1107f8e067bf9371..f4a134b96cec0385cb24a208f3403db944b68edc 100644
--- a/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
@@ -553,7 +553,7 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding1D(padding=2)
       layer.build(shape)
       output = layer(keras.backend.variable(inputs))
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         np_output = output.numpy()
       else:
         np_output = keras.backend.eval(output)
@@ -564,7 +564,7 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding1D(padding=(1, 2))
       layer.build(shape)
       output = layer(keras.backend.variable(inputs))
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         np_output = output.numpy()
       else:
         np_output = keras.backend.eval(output)
@@ -610,7 +610,7 @@ class ZeroPaddingTest(test.TestCase):
             padding=(2, 2), data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        if context.in_eager_mode():
+        if context.executing_eagerly():
           np_output = output.numpy()
         else:
           np_output = keras.backend.eval(output)
@@ -629,7 +629,7 @@ class ZeroPaddingTest(test.TestCase):
             padding=((1, 2), (3, 4)), data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        if context.in_eager_mode():
+        if context.executing_eagerly():
           np_output = output.numpy()
         else:
           np_output = keras.backend.eval(output)
@@ -683,7 +683,7 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding3D(padding=(2, 2, 2))
       layer.build(inputs.shape)
       output = layer(keras.backend.variable(inputs))
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         np_output = output.numpy()
       else:
         np_output = keras.backend.eval(output)
@@ -737,7 +737,7 @@ class UpSamplingTest(test.TestCase):
                 size=(length_row, length_col), data_format=data_format)
             layer.build(inputs.shape)
             output = layer(keras.backend.variable(inputs))
-            if context.in_eager_mode():
+            if context.executing_eagerly():
               np_output = output.numpy()
             else:
               np_output = keras.backend.eval(output)
@@ -790,7 +790,7 @@ class UpSamplingTest(test.TestCase):
                   data_format=data_format)
               layer.build(inputs.shape)
               output = layer(keras.backend.variable(inputs))
-              if context.in_eager_mode():
+              if context.executing_eagerly():
                 np_output = output.numpy()
               else:
                 np_output = keras.backend.eval(output)
@@ -865,7 +865,7 @@ class CroppingTest(test.TestCase):
             cropping=cropping, data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        if context.in_eager_mode():
+        if context.executing_eagerly():
           np_output = output.numpy()
         else:
           np_output = keras.backend.eval(output)
@@ -892,7 +892,7 @@ class CroppingTest(test.TestCase):
             cropping=cropping, data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        if context.in_eager_mode():
+        if context.executing_eagerly():
           np_output = output.numpy()
         else:
           np_output = keras.backend.eval(output)
@@ -937,7 +937,7 @@ class CroppingTest(test.TestCase):
                 cropping=cropping, data_format=data_format)
             layer.build(inputs.shape)
             output = layer(keras.backend.variable(inputs))
-            if context.in_eager_mode():
+            if context.executing_eagerly():
               np_output = output.numpy()
             else:
               np_output = keras.backend.eval(output)
@@ -954,7 +954,7 @@ class CroppingTest(test.TestCase):
                                     cropping[2][0]:-cropping[2][1], :]
             np.testing.assert_allclose(np_output, expected_out)
 
-     # test incorrect use
+    # test incorrect use
     with self.assertRaises(ValueError):
       keras.layers.Cropping3D(cropping=(1, 1))
     with self.assertRaises(ValueError):
diff --git a/tensorflow/python/keras/_impl/keras/layers/core.py b/tensorflow/python/keras/_impl/keras/layers/core.py
index 50a197c80c3d97f47a071a24297301dddf78a27e..73e4f15f7e259211c892fdc663e14dcb14aec58d 100644
--- a/tensorflow/python/keras/_impl/keras/layers/core.py
+++ b/tensorflow/python/keras/_impl/keras/layers/core.py
@@ -124,7 +124,7 @@ class Dropout(tf_core_layers.Dropout, Layer):
       training = K.learning_phase()
     output = super(Dropout, self).call(inputs, training=training)
     # EagerTensor object has no attribute _uses_learning_phase
-    if not context.in_eager_mode() and training is K.learning_phase():
+    if not context.executing_eagerly() and training is K.learning_phase():
       output._uses_learning_phase = True  # pylint: disable=protected-access
     return output
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/normalization.py b/tensorflow/python/keras/_impl/keras/layers/normalization.py
index 0dedd5e8daa2974038c90ae2e8c68ca6516ba725..3b44b20bf822429351002c0f81fe8f9596d595d3 100644
--- a/tensorflow/python/keras/_impl/keras/layers/normalization.py
+++ b/tensorflow/python/keras/_impl/keras/layers/normalization.py
@@ -111,7 +111,7 @@ class BatchNormalization(tf_normalization_layers.BatchNormalization, Layer):
     if training is None:
       training = K.learning_phase()
     output = super(BatchNormalization, self).call(inputs, training=training)
-    if context.in_graph_mode() and training is K.learning_phase():
+    if not context.executing_eagerly() and training is K.learning_phase():
       output._uses_learning_phase = True  # pylint: disable=protected-access
     return output
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/pooling_test.py b/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
index 70049f0976b7170005183bb4b028079b39a23afb..bb003c1dddf80e2a745c1268a3a7d045f4e8b036 100644
--- a/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
@@ -105,7 +105,7 @@ class Pooling2DTest(test.TestCase):
 
     # This part of the test can only run on GPU but doesn't appear
     # to be properly assigned to a GPU when running in eager mode.
-    if not context.in_eager_mode():
+    if not context.executing_eagerly():
       # Only runs on GPU with CUDA, channels_first is not supported on CPU.
       # TODO(b/62340061): Support channels_first on CPU.
       if test.is_gpu_available(cuda_only=True):
diff --git a/tensorflow/python/keras/_impl/keras/layers/recurrent.py b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
index 0264c7ae0119b36261a0a5467576c47a12a30801..791f9b311300ed05591083d551c040eb25ac8e22 100644
--- a/tensorflow/python/keras/_impl/keras/layers/recurrent.py
+++ b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
@@ -546,8 +546,8 @@ class RNN(Layer):
         raise ValueError('The initial state or constants of an RNN'
                          ' layer cannot be specified with a mix of'
                          ' Keras tensors and non-Keras tensors'
-                         '(a "Keras tensor" is a tensor that was'
-                         'returned by a Keras layer, or by `Input`)')
+                         ' (a "Keras tensor" is a tensor that was'
+                         ' returned by a Keras layer, or by `Input`)')
 
     if is_keras_tensor:
       # Compute the full input spec, including state and constants
@@ -936,7 +936,7 @@ class SimpleRNNCell(Layer):
 
     # Properly set learning phase on output tensor.
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None and not context.in_eager_mode():
+      if training is None and not context.executing_eagerly():
         # This would be harmless to set in eager mode, but eager tensors
         # disallow setting arbitrary attributes.
         output._uses_learning_phase = True
@@ -1384,7 +1384,7 @@ class GRUCell(Layer):
       hh = self.activation(x_h + recurrent_h)
     h = z * h_tm1 + (1 - z) * hh
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None and not context.in_eager_mode():
+      if training is None and not context.executing_eagerly():
         # This would be harmless to set in eager mode, but eager tensors
         # disallow setting arbitrary attributes.
         h._uses_learning_phase = True
@@ -1877,7 +1877,7 @@ class LSTMCell(Layer):
 
     h = o * self.activation(c)
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None and not context.in_eager_mode():
+      if training is None and not context.executing_eagerly():
         # This would be harmless to set in eager mode, but eager tensors
         # disallow setting arbitrary attributes.
         h._uses_learning_phase = True
diff --git a/tensorflow/python/keras/_impl/keras/model_subclassing_test.py b/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
index 3d71a620fcb34d21c41f920eed99b1fe22668899..58b144365be6cd8ea5b2ea82e275eacdee6b6c84 100644
--- a/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
+++ b/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
@@ -174,19 +174,18 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = SimpleTestModel(num_classes=num_classes,
-                              use_dp=True,
-                              use_bn=True)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = SimpleTestModel(num_classes=num_classes,
+                            use_dp=True,
+                            use_bn=True)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_multi_io_workflow_with_np_arrays(self):
@@ -194,21 +193,20 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 1000
     input_dim = 50
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes,
-                               use_dp=True,
-                               use_bn=True)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = MultiIOTestModel(num_classes=num_classes,
+                             use_dp=True,
+                             use_bn=True)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x1 = np.ones((num_samples, input_dim))
-      x2 = np.ones((num_samples, input_dim))
-      y1 = np.zeros((num_samples, num_classes[0]))
-      y2 = np.zeros((num_samples, num_classes[1]))
+    x1 = np.ones((num_samples, input_dim))
+    x2 = np.ones((num_samples, input_dim))
+    y1 = np.zeros((num_samples, num_classes[0]))
+    y2 = np.zeros((num_samples, num_classes[1]))
 
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate([x1, x2], [y1, y2], verbose=0)
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate([x1, x2], [y1, y2], verbose=0)
 
   def test_single_io_workflow_with_tensors(self):
 
@@ -321,14 +319,13 @@ class ModelSubclassingTest(test.TestCase):
     x = np.ones((num_samples, input_dim))
     y = np.ones((num_samples, input_dim))
 
-    with self.test_session():
-      model = BNNet()
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      y_ref = model.predict(x)
+    model = BNNet()
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    y_ref = model.predict(x)
 
-      model.train_on_batch(x, y)
-      y_new = model.predict(x)
-      self.assertGreater(np.sum(np.abs(y_ref - y_new)), 0.1)
+    model.train_on_batch(x, y)
+    y_new = model.predict(x)
+    self.assertGreater(np.sum(np.abs(y_ref - y_new)), 0.1)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_training_and_inference_behavior(self):
@@ -350,14 +347,13 @@ class ModelSubclassingTest(test.TestCase):
         x = self.dp(inputs)
         return self.dense(x)
 
-    with self.test_session():
-      model = DPNet()
-      x = np.ones((num_samples, input_dim))
-      y = model.predict(x)
-      self.assertEqual(np.sum(y), np.sum(x))
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      loss = model.train_on_batch(x, y)
-      self.assertGreater(loss, 0.1)
+    model = DPNet()
+    x = np.ones((num_samples, input_dim))
+    y = model.predict(x)
+    self.assertEqual(np.sum(y), np.sum(x))
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    loss = model.train_on_batch(x, y)
+    self.assertGreater(loss, 0.1)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_training_methods(self):
@@ -373,21 +369,20 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
-      model.fit({'input_1': x1, 'input_2': x2},
-                {'output_1': y1, 'output_2': y2},
-                epochs=2, batch_size=32)
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0,
-                validation_data=([x1, x2], [y1, y2]))
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    model.fit({'input_1': x1, 'input_2': x2},
+              {'output_1': y1, 'output_2': y2},
+              epochs=2, batch_size=32)
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0,
+              validation_data=([x1, x2], [y1, y2]))
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.train_on_batch([x1, x2], [y1, y2])
-      model.train_on_batch({'input_1': x1, 'input_2': x2},
-                           {'output_1': y1, 'output_2': y2})
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.train_on_batch([x1, x2], [y1, y2])
+    model.train_on_batch({'input_1': x1, 'input_2': x2},
+                         {'output_1': y1, 'output_2': y2})
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
   def test_inference_methods(self):
@@ -402,17 +397,16 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.evaluate([x1, x2], [y1, y2])
-      model.test_on_batch([x1, x2], [y1, y2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.evaluate([x1, x2], [y1, y2])
+    model.test_on_batch([x1, x2], [y1, y2])
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.predict([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.predict([x1, x2])
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.predict_on_batch([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.predict_on_batch([x1, x2])
 
   @test_util.run_in_graph_and_eager_modes()
   def test_trainable_mutation(self):
@@ -435,26 +429,25 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
-      y_ref_1, y_ref_2 = model.predict([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    y_ref_1, y_ref_2 = model.predict([x1, x2])
 
-      fd, fname = tempfile.mkstemp('.h5')
-      model.save_weights(fname)
+    fd, fname = tempfile.mkstemp('.h5')
+    model.save_weights(fname)
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      # need to build the model before loading weights
-      # (otherwise no weights to load)
-      model._set_inputs([x1, x2])
-      model.load_weights(fname)
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    # need to build the model before loading weights
+    # (otherwise no weights to load)
+    model._set_inputs([x1, x2])
+    model.load_weights(fname)
 
-      y1, y2 = model.predict([x1, x2])
-      self.assertAllClose(y_ref_1, y1, atol=1e-5)
-      self.assertAllClose(y_ref_2, y2, atol=1e-5)
-      os.close(fd)
-      os.remove(fname)
+    y1, y2 = model.predict([x1, x2])
+    self.assertAllClose(y_ref_1, y1, atol=1e-5)
+    self.assertAllClose(y_ref_2, y2, atol=1e-5)
+    os.close(fd)
+    os.remove(fname)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_summary(self):
@@ -488,23 +481,22 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = NestedTestModel1(num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = NestedTestModel1(num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
-      self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
-      self.assertEqual(len(model.non_trainable_weights),
-                       2 + len(model.test_net.non_trainable_weights))
-      self.assertEqual(len(model.trainable_weights),
-                       6 + len(model.test_net.trainable_weights))
+    self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
+    self.assertEqual(len(model.non_trainable_weights),
+                     2 + len(model.test_net.non_trainable_weights))
+    self.assertEqual(len(model.trainable_weights),
+                     6 + len(model.test_net.trainable_weights))
 
   @test_util.run_in_graph_and_eager_modes()
   def test_graph_nested_in_subclass(self):
@@ -512,23 +504,22 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = NestedTestModel2(num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = NestedTestModel2(num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
-      self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
-      self.assertEqual(len(model.non_trainable_weights),
-                       2 + len(model.test_net.non_trainable_weights))
-      self.assertEqual(len(model.trainable_weights),
-                       6 + len(model.test_net.trainable_weights))
+    self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
+    self.assertEqual(len(model.non_trainable_weights),
+                     2 + len(model.test_net.non_trainable_weights))
+    self.assertEqual(len(model.trainable_weights),
+                     6 + len(model.test_net.trainable_weights))
 
   @test_util.run_in_graph_and_eager_modes()
   def test_subclass_nested_in_graph(self):
@@ -536,22 +527,21 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = get_nested_model_3(input_dim=input_dim, num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = get_nested_model_3(input_dim=input_dim, num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
-      self.assertEqual(len(model.weights), 16)
-      self.assertEqual(
-          len(model.non_trainable_weights), 4)
-      self.assertEqual(len(model.trainable_weights), 12)
+    self.assertEqual(len(model.weights), 16)
+    self.assertEqual(
+        len(model.non_trainable_weights), 4)
+    self.assertEqual(len(model.trainable_weights), 12)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_support_for_manual_training_arg(self):
@@ -575,14 +565,13 @@ class ModelSubclassingTest(test.TestCase):
         x = self.dp(inputs, training=training)
         return self.dense(x)
 
-    with self.test_session():
-      model = DPNet()
-      x = np.ones((10, 10))
-      y = model.predict(x)
-      self.assertEqual(np.sum(y), np.sum(x))
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      loss = model.train_on_batch(x, y)
-      self.assertGreater(loss, 0.1)
+    model = DPNet()
+    x = np.ones((10, 10))
+    y = model.predict(x)
+    self.assertEqual(np.sum(y), np.sum(x))
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    loss = model.train_on_batch(x, y)
+    self.assertGreater(loss, 0.1)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/optimizers.py b/tensorflow/python/keras/_impl/keras/optimizers.py
index 6520128c5b65451aef20ec9626245fba5ef29927..b715d722b98b9db3bdf0985da0954356a2facdfe 100644
--- a/tensorflow/python/keras/_impl/keras/optimizers.py
+++ b/tensorflow/python/keras/_impl/keras/optimizers.py
@@ -95,7 +95,26 @@ class Optimizer(object):
     raise NotImplementedError
 
   def get_gradients(self, loss, params):
+    """Returns gradients of `loss` with respect to `params`.
+
+    Arguments:
+        loss: Loss tensor.
+        params: List of variables.
+
+    Returns:
+        List of gradient tensors.
+
+    Raises:
+        ValueError: In case any gradient cannot be computed (e.g. if gradient
+          function not implemented).
+    """
     grads = K.gradients(loss, params)
+    if None in grads:
+      raise ValueError('An operation has `None` for gradient. '
+                       'Please make sure that all of your ops have a '
+                       'gradient defined (i.e. are differentiable). '
+                       'Common ops without gradient: '
+                       'K.argmax, K.round, K.eval.')
     if hasattr(self, 'clipnorm') and self.clipnorm > 0:
       norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))
       grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
@@ -120,6 +139,11 @@ class Optimizer(object):
         ValueError: in case of incompatible weight shapes.
     """
     params = self.weights
+    if len(params) != len(weights):
+      raise ValueError(
+          'Length of the specified weight list (' + str(len(weights)) +
+          ') does not match the number of weights '
+          'of the optimizer (' + str(len(params)) + ')')
     weight_value_tuples = []
     param_values = K.batch_get_value(params)
     for pv, p, w in zip(param_values, params, weights):
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/image.py b/tensorflow/python/keras/_impl/keras/preprocessing/image.py
index d12f10863921ee7d635930f34e8bc701c89864e8..6299445c34b99f20d7ae5090fc979d0ac2611109 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/image.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/image.py
@@ -43,6 +43,7 @@ except ImportError:
 
 
 try:
+  from PIL import ImageEnhance
   from PIL import Image as pil_image
 except ImportError:
   pil_image = None
@@ -227,6 +228,32 @@ def random_channel_shift(x, intensity, channel_axis=0):
   return x
 
 
+@tf_export('keras.preprocessing.image.random_brightness')
+def random_brightness(x, brightness_range):
+  """Performs a random adjustment of brightness of a Numpy image tensor.
+
+  Arguments:
+      x: Input tensor. Must be 3D.
+      brightness_range: Tuple of floats; range to pick a brightness value from.
+
+  Returns:
+      Brightness adjusted Numpy image tensor.
+
+  Raises:
+      ValueError: if `brightness_range` isn't a tuple.
+  """
+  if len(brightness_range) != 2:
+    raise ValueError('`brightness_range should be tuple or list of two floats. '
+                     'Received arg: ', brightness_range)
+
+  x = array_to_img(x)
+  x = ImageEnhance.Brightness(x)
+  u = np.random.uniform(brightness_range[0], brightness_range[1])
+  x = x.enhance(u)
+  x = img_to_array(x)
+  return x
+
+
 def transform_matrix_offset_center(matrix, x, y):
   o_x = float(x) / 2 + 0.5
   o_y = float(y) / 2 + 0.5
@@ -265,7 +292,7 @@ def apply_transform(x,
           x_channel,
           final_affine_matrix,
           final_offset,
-          order=0,
+          order=1,
           mode=fill_mode,
           cval=cval) for x_channel in x
   ]
@@ -436,6 +463,7 @@ class ImageDataGenerator(object):
       rotation_range: degrees (0 to 180).
       width_shift_range: fraction of total width, if < 1, or pixels if >= 1.
       height_shift_range: fraction of total height, if < 1, or pixels if >= 1.
+      brightness_range: the range of brightness to apply
       shear_range: shear intensity (shear angle in degrees).
       zoom_range: amount of zoom. if scalar z, zoom will be randomly picked
           in the range [1-z, 1+z]. A sequence of two can be passed instead
@@ -469,6 +497,8 @@ class ImageDataGenerator(object):
           It defaults to the `image_data_format` value found in your
           Keras config file at `~/.keras/keras.json`.
           If you never set it, then it will be "channels_last".
+      validation_split: fraction of images reserved for validation (strictly
+        between 0 and 1).
   """
 
   def __init__(self,
@@ -481,6 +511,7 @@ class ImageDataGenerator(object):
                rotation_range=0.,
                width_shift_range=0.,
                height_shift_range=0.,
+               brightness_range=None,
                shear_range=0.,
                zoom_range=0.,
                channel_shift_range=0.,
@@ -490,7 +521,8 @@ class ImageDataGenerator(object):
                vertical_flip=False,
                rescale=None,
                preprocessing_function=None,
-               data_format=None):
+               data_format=None,
+               validation_split=0.0):
     if data_format is None:
       data_format = K.image_data_format()
     self.featurewise_center = featurewise_center
@@ -502,6 +534,7 @@ class ImageDataGenerator(object):
     self.rotation_range = rotation_range
     self.width_shift_range = width_shift_range
     self.height_shift_range = height_shift_range
+    self.brightness_range = brightness_range
     self.shear_range = shear_range
     self.zoom_range = zoom_range
     self.channel_shift_range = channel_shift_range
@@ -526,6 +559,10 @@ class ImageDataGenerator(object):
       self.channel_axis = 3
       self.row_axis = 1
       self.col_axis = 2
+    if validation_split and not 0 < validation_split < 1:
+      raise ValueError('`validation_split` must be strictly between 0 and 1. '
+                       'Received arg: ', validation_split)
+    self.validation_split = validation_split
 
     self.mean = None
     self.std = None
@@ -574,7 +611,8 @@ class ImageDataGenerator(object):
            seed=None,
            save_to_dir=None,
            save_prefix='',
-           save_format='png'):
+           save_format='png',
+           subset=None):
     return NumpyArrayIterator(
         x,
         y,
@@ -585,7 +623,8 @@ class ImageDataGenerator(object):
         data_format=self.data_format,
         save_to_dir=save_to_dir,
         save_prefix=save_prefix,
-        save_format=save_format)
+        save_format=save_format,
+        subset=subset)
 
   def flow_from_directory(self,
                           directory,
@@ -600,6 +639,7 @@ class ImageDataGenerator(object):
                           save_prefix='',
                           save_format='png',
                           follow_links=False,
+                          subset=None,
                           interpolation='nearest'):
     return DirectoryIterator(
         directory,
@@ -616,6 +656,7 @@ class ImageDataGenerator(object):
         save_prefix=save_prefix,
         save_format=save_format,
         follow_links=follow_links,
+        subset=subset,
         interpolation=interpolation)
 
   def standardize(self, x):
@@ -628,7 +669,7 @@ class ImageDataGenerator(object):
         The inputs, normalized.
     """
     if self.preprocessing_function:
-      x = self.preprocessing_function(x)
+      x = self.image_data_generator.preprocessing_function(x)
     if self.rescale:
       x *= self.rescale
     if self.samplewise_center:
@@ -762,6 +803,9 @@ class ImageDataGenerator(object):
       if np.random.random() < 0.5:
         x = flip_axis(x, img_row_axis)
 
+    if self.brightness_range is not None:
+      x = random_brightness(x, self.brightness_range)
+
     return x
 
   def fit(self, x, augment=False, rounds=1, seed=None):
@@ -828,12 +872,10 @@ class ImageDataGenerator(object):
         raise ImportError('Scipy is required for zca_whitening.')
 
       flat_x = np.reshape(x, (x.shape[0], x.shape[1] * x.shape[2] * x.shape[3]))
-      num_examples = flat_x.shape[0]
-      _, s, vt = linalg.svd(flat_x / np.sqrt(num_examples))
-      s_expand = np.hstack(
-          (s, np.zeros(vt.shape[0] - num_examples, dtype=flat_x.dtype)))
-      self.principal_components = (
-          vt.T / np.sqrt(s_expand**2 + self.zca_epsilon)).dot(vt)
+      sigma = np.dot(flat_x.T, flat_x) / flat_x.shape[0]
+      u, s, _ = linalg.svd(sigma)
+      s_inv = 1. / np.sqrt(s[np.newaxis] + self.zca_epsilon)
+      self.principal_components = (u * s_inv).dot(u.T)
 
 
 @tf_export('keras.preprocessing.image.Iterator')
@@ -947,6 +989,8 @@ class NumpyArrayIterator(Iterator):
           images (if `save_to_dir` is set).
       save_format: Format to use for saving sample images
           (if `save_to_dir` is set).
+      subset: Subset of data (`"training"` or `"validation"`) if
+          validation_split is set in ImageDataGenerator.
   """
 
   def __init__(self,
@@ -959,17 +1003,29 @@ class NumpyArrayIterator(Iterator):
                data_format=None,
                save_to_dir=None,
                save_prefix='',
-               save_format='png'):
+               save_format='png',
+               subset=None):
     if y is not None and len(x) != len(y):
-      raise ValueError('X (images tensor) and y (labels) '
+      raise ValueError('`x` (images tensor) and `y` (labels) '
                        'should have the same length. '
-                       'Found: X.shape = %s, y.shape = %s' %
+                       'Found: x.shape = %s, y.shape = %s' %
                        (np.asarray(x).shape, np.asarray(y).shape))
-
+    if subset is not None:
+      if subset not in {'training', 'validation'}:
+        raise ValueError('Invalid subset name:', subset,
+                         '; expected "training" or "validation".')
+      split_idx = int(len(x) * image_data_generator.validation_split)
+      if subset == 'validation':
+        x = x[:split_idx]
+        if y is not None:
+          y = y[:split_idx]
+      else:
+        x = x[split_idx:]
+        if y is not None:
+          y = y[split_idx:]
     if data_format is None:
       data_format = K.image_data_format()
     self.x = np.asarray(x, dtype=K.floatx())
-
     if self.x.ndim != 4:
       raise ValueError('Input data in `NumpyArrayIterator` '
                        'should have rank 4. You passed an array '
@@ -1032,8 +1088,7 @@ class NumpyArrayIterator(Iterator):
     return self._get_batches_of_transformed_samples(index_array)
 
 
-def _count_valid_files_in_directory(directory, white_list_formats,
-                                    follow_links):
+def _iter_valid_files(directory, white_list_formats, follow_links):
   """Count files with extension in `white_list_formats` contained in directory.
 
   Arguments:
@@ -1043,29 +1098,54 @@ def _count_valid_files_in_directory(directory, white_list_formats,
           the files to be counted.
       follow_links: boolean.
 
-  Returns:
-      the count of files with extension in `white_list_formats` contained in
-      the directory.
+  Yields:
+      tuple of (root, filename) with extension in `white_list_formats`.
   """
 
   def _recursive_list(subpath):
     return sorted(
-        os.walk(subpath, followlinks=follow_links), key=lambda tpl: tpl[0])
+        os.walk(subpath, followlinks=follow_links), key=lambda x: x[0])
 
-  samples = 0
-  for _, _, files in _recursive_list(directory):
-    for fname in files:
-      is_valid = False
+  for root, _, files in _recursive_list(directory):
+    for fname in sorted(files):
       for extension in white_list_formats:
+        if fname.lower().endswith('.tiff'):
+          logging.warning(
+              'Using \'.tiff\' files with multiple bands will cause '
+              'distortion. Please verify your output.')
         if fname.lower().endswith('.' + extension):
-          is_valid = True
-          break
-      if is_valid:
-        samples += 1
-  return samples
+          yield root, fname
 
 
-def _list_valid_filenames_in_directory(directory, white_list_formats,
+def _count_valid_files_in_directory(directory, white_list_formats, split,
+                                    follow_links):
+  """Count files with extension in `white_list_formats` contained in directory.
+
+  Arguments:
+      directory: absolute path to the directory
+          containing files to be counted
+      white_list_formats: set of strings containing allowed extensions for
+          the files to be counted.
+      split: tuple of floats (e.g. `(0.2, 0.6)`) to only take into
+          account a certain fraction of files in each directory.
+          E.g.: `segment=(0.6, 1.0)` would only account for last 40 percent
+          of images in each directory.
+      follow_links: boolean.
+
+  Returns:
+      the count of files with extension in `white_list_formats` contained in
+      the directory.
+  """
+  num_files = len(
+      list(_iter_valid_files(directory, white_list_formats, follow_links)))
+  if split:
+    start, stop = int(split[0] * num_files), int(split[1] * num_files)
+  else:
+    start, stop = 0, num_files
+  return stop - start
+
+
+def _list_valid_filenames_in_directory(directory, white_list_formats, split,
                                        class_indices, follow_links):
   """List paths of files in `subdir` with extensions in `white_list_formats`.
 
@@ -1075,6 +1155,10 @@ def _list_valid_filenames_in_directory(directory, white_list_formats,
             `class_indices`.
       white_list_formats: set of strings containing allowed extensions for
           the files to be counted.
+      split: tuple of floats (e.g. `(0.2, 0.6)`) to only take into
+          account a certain fraction of files in each directory.
+          E.g.: `segment=(0.6, 1.0)` would only account for last 40 percent
+          of images in each directory.
       class_indices: dictionary mapping a class name to its index.
       follow_links: boolean.
 
@@ -1084,27 +1168,26 @@ def _list_valid_filenames_in_directory(directory, white_list_formats,
           `directory`'s parent (e.g., if `directory` is "dataset/class1",
           the filenames will be ["class1/file1.jpg", "class1/file2.jpg", ...]).
   """
-
-  def _recursive_list(subpath):
-    return sorted(
-        os.walk(subpath, followlinks=follow_links), key=lambda tpl: tpl[0])
+  dirname = os.path.basename(directory)
+  if split:
+    num_files = len(
+        list(_iter_valid_files(directory, white_list_formats, follow_links)))
+    start, stop = int(split[0] * num_files), int(split[1] * num_files)
+    valid_files = list(
+        _iter_valid_files(directory, white_list_formats,
+                          follow_links))[start:stop]
+  else:
+    valid_files = _iter_valid_files(directory, white_list_formats, follow_links)
 
   classes = []
   filenames = []
-  subdir = os.path.basename(directory)
-  basedir = os.path.dirname(directory)
-  for root, _, files in _recursive_list(directory):
-    for fname in sorted(files):
-      is_valid = False
-      for extension in white_list_formats:
-        if fname.lower().endswith('.' + extension):
-          is_valid = True
-          break
-      if is_valid:
-        classes.append(class_indices[subdir])
-        # add filename relative to directory
-        absolute_path = os.path.join(root, fname)
-        filenames.append(os.path.relpath(absolute_path, basedir))
+  for root, fname in valid_files:
+    classes.append(class_indices[dirname])
+    absolute_path = os.path.join(root, fname)
+    relative_path = os.path.join(dirname,
+                                 os.path.relpath(absolute_path, directory))
+    filenames.append(relative_path)
+
   return classes, filenames
 
 
@@ -1144,6 +1227,8 @@ class DirectoryIterator(Iterator):
           images (if `save_to_dir` is set).
       save_format: Format to use for saving sample images
           (if `save_to_dir` is set).
+      subset: Subset of data (`"training"` or `"validation"`) if
+          validation_split is set in ImageDataGenerator.
       interpolation: Interpolation method used to resample the image if the
           target size is different from that of the loaded image.
           Supported methods are "nearest", "bilinear", and "bicubic".
@@ -1167,6 +1252,7 @@ class DirectoryIterator(Iterator):
                save_prefix='',
                save_format='png',
                follow_links=False,
+               subset=None,
                interpolation='nearest'):
     if data_format is None:
       data_format = K.image_data_format()
@@ -1200,7 +1286,20 @@ class DirectoryIterator(Iterator):
     self.save_format = save_format
     self.interpolation = interpolation
 
-    white_list_formats = {'png', 'jpg', 'jpeg', 'bmp', 'ppm'}
+    if subset is not None:
+      validation_split = self.image_data_generator.validation_split
+      if subset == 'validation':
+        split = (0, validation_split)
+      elif subset == 'training':
+        split = (validation_split, 1)
+      else:
+        raise ValueError('Invalid subset name: ', subset,
+                         '; expected "training" or "validation"')
+    else:
+      split = None
+    self.subset = subset
+
+    white_list_formats = {'png', 'jpg', 'jpeg', 'bmp', 'ppm', 'tif', 'tiff'}
 
     # first, count the number of samples and classes
     self.samples = 0
@@ -1217,7 +1316,8 @@ class DirectoryIterator(Iterator):
     function_partial = partial(
         _count_valid_files_in_directory,
         white_list_formats=white_list_formats,
-        follow_links=follow_links)
+        follow_links=follow_links,
+        split=split)
     self.samples = sum(
         pool.map(function_partial,
                  (os.path.join(directory, subdir) for subdir in classes)))
@@ -1233,14 +1333,15 @@ class DirectoryIterator(Iterator):
     i = 0
     for dirpath in (os.path.join(directory, subdir) for subdir in classes):
       results.append(
-          pool.apply_async(
-              _list_valid_filenames_in_directory,
-              (dirpath, white_list_formats, self.class_indices, follow_links)))
+          pool.apply_async(_list_valid_filenames_in_directory,
+                           (dirpath, white_list_formats, split,
+                            self.class_indices, follow_links)))
     for res in results:
       classes, filenames = res.get()
       self.classes[i:i + len(classes)] = classes
       self.filenames += filenames
       i += len(classes)
+
     pool.close()
     pool.join()
     super(DirectoryIterator, self).__init__(self.samples, batch_size, shuffle,
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/image_test.py b/tensorflow/python/keras/_impl/keras/preprocessing/image_test.py
index c0790b5a5140193b18907d9375530f4f06e137da..001fee91f9ed609c0b3cd88d4079e75c0e585b02 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/image_test.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/image_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import os
 import shutil
+import tempfile
 
 import numpy as np
 
@@ -74,6 +75,7 @@ class TestImage(test.TestCase):
           shear_range=0.5,
           zoom_range=0.2,
           channel_shift_range=0.,
+          brightness_range=(1, 5),
           fill_mode='nearest',
           cval=0.5,
           horizontal_flip=True,
@@ -92,6 +94,47 @@ class TestImage(test.TestCase):
         self.assertEqual(x.shape[1:], images.shape[1:])
         break
 
+  def test_image_data_generator_with_validation_split(self):
+    if PIL is None:
+      return  # Skip test if PIL is not available.
+
+    for test_images in _generate_test_images():
+      img_list = []
+      for im in test_images:
+        img_list.append(keras.preprocessing.image.img_to_array(im)[None, ...])
+
+      images = np.vstack(img_list)
+      generator = keras.preprocessing.image.ImageDataGenerator(
+          validation_split=0.5)
+      seq = generator.flow(
+          images,
+          np.arange(images.shape[0]),
+          shuffle=False,
+          batch_size=3,
+          subset='validation')
+      _, y = seq[0]
+      self.assertEqual(list(y), [0, 1, 2])
+      seq = generator.flow(
+          images,
+          np.arange(images.shape[0]),
+          shuffle=False,
+          batch_size=3,
+          subset='training')
+      _, y2 = seq[0]
+      self.assertEqual(list(y2), [4, 5, 6])
+
+      with self.assertRaises(ValueError):
+        generator.flow(
+            images,
+            np.arange(images.shape[0]),
+            shuffle=False,
+            batch_size=3,
+            subset='foo')
+
+  def test_image_data_generator_with_split_value_error(self):
+    with self.assertRaises(ValueError):
+      keras.preprocessing.image.ImageDataGenerator(validation_split=5)
+
   def test_image_data_generator_invalid_data(self):
     generator = keras.preprocessing.image.ImageDataGenerator(
         featurewise_center=True,
@@ -202,9 +245,80 @@ class TestImage(test.TestCase):
     # check number of classes and images
     self.assertEqual(len(dir_iterator.class_indices), num_classes)
     self.assertEqual(len(dir_iterator.classes), count)
-    self.assertEqual(sorted(dir_iterator.filenames), sorted(filenames))
+    self.assertEqual(set(dir_iterator.filenames), set(filenames))
     _ = dir_iterator.next()
 
+  def directory_iterator_with_validation_split_test_helper(
+      self, validation_split):
+    if PIL is None:
+      return  # Skip test if PIL is not available.
+
+    num_classes = 2
+    tmp_folder = tempfile.mkdtemp(prefix='test_images')
+
+    # create folders and subfolders
+    paths = []
+    for cl in range(num_classes):
+      class_directory = 'class-{}'.format(cl)
+      classpaths = [
+          class_directory,
+          os.path.join(class_directory, 'subfolder-1'),
+          os.path.join(class_directory, 'subfolder-2'),
+          os.path.join(class_directory, 'subfolder-1', 'sub-subfolder')
+      ]
+      for path in classpaths:
+        os.mkdir(os.path.join(tmp_folder, path))
+      paths.append(classpaths)
+
+    # save the images in the paths
+    count = 0
+    filenames = []
+    for test_images in _generate_test_images():
+      for im in test_images:
+        # rotate image class
+        im_class = count % num_classes
+        # rotate subfolders
+        classpaths = paths[im_class]
+        filename = os.path.join(classpaths[count % len(classpaths)],
+                                'image-{}.jpg'.format(count))
+        filenames.append(filename)
+        im.save(os.path.join(tmp_folder, filename))
+        count += 1
+
+    # create iterator
+    generator = keras.preprocessing.image.ImageDataGenerator(
+        validation_split=validation_split)
+
+    with self.assertRaises(ValueError):
+      generator.flow_from_directory(tmp_folder, subset='foo')
+
+    num_validation = int(count * validation_split)
+    num_training = count - num_validation
+    train_iterator = generator.flow_from_directory(
+        tmp_folder, subset='training')
+    self.assertEqual(train_iterator.samples, num_training)
+
+    valid_iterator = generator.flow_from_directory(
+        tmp_folder, subset='validation')
+    self.assertEqual(valid_iterator.samples, num_validation)
+
+    # check number of classes and images
+    self.assertEqual(len(train_iterator.class_indices), num_classes)
+    self.assertEqual(len(train_iterator.classes), num_training)
+    self.assertEqual(
+        len(set(train_iterator.filenames) & set(filenames)), num_training)
+
+    shutil.rmtree(tmp_folder)
+
+  def test_directory_iterator_with_validation_split_25_percent(self):
+    self.directory_iterator_with_validation_split_test_helper(0.25)
+
+  def test_directory_iterator_with_validation_split_40_percent(self):
+    self.directory_iterator_with_validation_split_test_helper(0.40)
+
+  def test_directory_iterator_with_validation_split_50_percent(self):
+    self.directory_iterator_with_validation_split_test_helper(0.50)
+
   def test_img_utils(self):
     if PIL is None:
       return  # Skip test if PIL is not available.
@@ -241,6 +355,41 @@ class TestImage(test.TestCase):
     x = keras.preprocessing.image.img_to_array(img, data_format='channels_last')
     self.assertEqual(x.shape, (height, width, 1))
 
+  def test_batch_standardize(self):
+    if PIL is None:
+      return  # Skip test if PIL is not available.
+
+    # ImageDataGenerator.standardize should work on batches
+    for test_images in _generate_test_images():
+      img_list = []
+      for im in test_images:
+        img_list.append(keras.preprocessing.image.img_to_array(im)[None, ...])
+
+      images = np.vstack(img_list)
+      generator = keras.preprocessing.image.ImageDataGenerator(
+          featurewise_center=True,
+          samplewise_center=True,
+          featurewise_std_normalization=True,
+          samplewise_std_normalization=True,
+          zca_whitening=True,
+          rotation_range=90.,
+          width_shift_range=0.1,
+          height_shift_range=0.1,
+          shear_range=0.5,
+          zoom_range=0.2,
+          channel_shift_range=0.,
+          brightness_range=(1, 5),
+          fill_mode='nearest',
+          cval=0.5,
+          horizontal_flip=True,
+          vertical_flip=True)
+      generator.fit(images, augment=True)
+
+      transformed = np.copy(images)
+      for i, im in enumerate(transformed):
+        transformed[i] = generator.random_transform(im)
+      transformed = generator.standardize(transformed)
+
   def test_img_transforms(self):
     x = np.random.random((3, 200, 200))
     _ = keras.preprocessing.image.random_rotation(x, 20)
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/sequence.py b/tensorflow/python/keras/_impl/keras/preprocessing/sequence.py
index a423d96d3d8578df347b7ee36fb53dfd335e0d65..e68c171d9c7e33d7e932f5d5b7f15859faa2348b 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/sequence.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/sequence.py
@@ -22,6 +22,8 @@ import random
 
 import numpy as np
 from six.moves import range  # pylint: disable=redefined-builtin
+
+from tensorflow.python.keras._impl.keras.utils.data_utils import Sequence
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -32,29 +34,40 @@ def pad_sequences(sequences,
                   padding='pre',
                   truncating='pre',
                   value=0.):
-  """Pads each sequence to the same length (length of the longest sequence).
+  """Pads sequences to the same length.
+
+  This function transforms a list of
+  `num_samples` sequences (lists of integers)
+  into a 2D Numpy array of shape `(num_samples, num_timesteps)`.
+  `num_timesteps` is either the `maxlen` argument if provided,
+  or the length of the longest sequence otherwise.
+
+  Sequences that are shorter than `num_timesteps`
+  are padded with `value` at the end.
 
-  If maxlen is provided, any sequence longer
-  than maxlen is truncated to maxlen.
-  Truncation happens off either the beginning (default) or
-  the end of the sequence.
+  Sequences longer than `num_timesteps` are truncated
+  so that they fit the desired length.
+  The position where padding or truncation happens is determined by
+  the arguments `padding` and `truncating`, respectively.
 
-  Supports post-padding and pre-padding (default).
+  Pre-padding is the default.
 
   Arguments:
-      sequences: list of lists where each element is a sequence
-      maxlen: int, maximum length
-      dtype: type to cast the resulting sequence.
-      padding: 'pre' or 'post', pad either before or after each sequence.
-      truncating: 'pre' or 'post', remove values from sequences larger than
-          maxlen either in the beginning or in the end of the sequence
-      value: float, value to pad the sequences to the desired value.
+      sequences: List of lists, where each element is a sequence.
+      maxlen: Int, maximum length of all sequences.
+      dtype: Type of the output sequences.
+      padding: String, 'pre' or 'post':
+          pad either before or after each sequence.
+      truncating: String, 'pre' or 'post':
+          remove values from sequences larger than
+          `maxlen`, either at the beginning or at the end of the sequences.
+      value: Float, padding value.
 
   Returns:
-      x: numpy array with dimensions (number_of_sequences, maxlen)
+      x: Numpy array with shape `(len(sequences), maxlen)`
 
   Raises:
-      ValueError: in case of invalid values for `truncating` or `padding`,
+      ValueError: In case of invalid values for `truncating` or `padding`,
           or in case of invalid shape for a `sequences` entry.
   """
   if not hasattr(sequences, '__len__'):
@@ -92,10 +105,9 @@ def pad_sequences(sequences,
     # check `trunc` has expected shape
     trunc = np.asarray(trunc, dtype=dtype)
     if trunc.shape[1:] != sample_shape:
-      raise ValueError(
-          'Shape of sample %s of sequence at position %s is different from '
-          'expected shape %s'
-          % (trunc.shape[1:], idx, sample_shape))
+      raise ValueError('Shape of sample %s of sequence at position %s '
+                       'is different from expected shape %s' %
+                       (trunc.shape[1:], idx, sample_shape))
 
     if padding == 'post':
       x[idx, :len(trunc)] = trunc
@@ -110,22 +122,26 @@ def pad_sequences(sequences,
 def make_sampling_table(size, sampling_factor=1e-5):
   """Generates a word rank-based probabilistic sampling table.
 
-  This generates an array where the ith element
-  is the probability that a word of rank i would be sampled,
-  according to the sampling distribution used in word2vec.
+  Used for generating the `sampling_table` argument for `skipgrams`.
+  `sampling_table[i]` is the probability of sampling
+  the word i-th most common word in a dataset
+  (more common words should be sampled less frequently, for balance).
 
-  The word2vec formula is:
-      p(word) = min(1, sqrt(word.frequency/sampling_factor) /
-      (word.frequency/sampling_factor))
+  The sampling probabilities are generated according
+  to the sampling distribution used in word2vec:
+
+  `p(word) = min(1, sqrt(word_frequency / sampling_factor) / (word_frequency /
+  sampling_factor))`
 
   We assume that the word frequencies follow Zipf's law (s=1) to derive
   a numerical approximation of frequency(rank):
-     frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
-      where gamma is the Euler-Mascheroni constant.
+
+  `frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
+  where `gamma` is the Euler-Mascheroni constant.
 
   Arguments:
-      size: int, number of possible words to sample.
-      sampling_factor: the sampling factor in the word2vec formula.
+      size: Int, number of possible words to sample.
+      sampling_factor: The sampling factor in the word2vec formula.
 
   Returns:
       A 1D Numpy array of length `size` where the ith entry
@@ -151,30 +167,37 @@ def skipgrams(sequence,
               seed=None):
   """Generates skipgram word pairs.
 
-  Takes a sequence (list of indexes of words),
-  returns couples of [word_index, other_word index] and labels (1s or 0s),
-  where label = 1 if 'other_word' belongs to the context of 'word',
-  and label=0 if 'other_word' is randomly sampled
+  This function transforms a sequence of word indexes (list of integers)
+  into tuples of words of the form:
+
+  - (word, word in the same window), with label 1 (positive samples).
+  - (word, random word from the vocabulary), with label 0 (negative samples).
+
+  Read more about Skipgram in this gnomic paper by Mikolov et al.:
+  [Efficient Estimation of Word Representations in
+  Vector Space](http://arxiv.org/pdf/1301.3781v3.pdf)
 
   Arguments:
-      sequence: a word sequence (sentence), encoded as a list
+      sequence: A word sequence (sentence), encoded as a list
           of word indices (integers). If using a `sampling_table`,
           word indices are expected to match the rank
           of the words in a reference dataset (e.g. 10 would encode
           the 10-th most frequently occurring token).
           Note that index 0 is expected to be a non-word and will be skipped.
-      vocabulary_size: int. maximum possible word index + 1
-      window_size: int. actually half-window.
-          The window of a word wi will be [i-window_size, i+window_size+1]
-      negative_samples: float >= 0. 0 for no negative (=random) samples.
-          1 for same number as positive samples. etc.
-      shuffle: whether to shuffle the word couples before returning them.
+      vocabulary_size: Int, maximum possible word index + 1
+      window_size: Int, size of sampling windows (technically half-window).
+          The window of a word `w_i` will be
+          `[i - window_size, i + window_size+1]`.
+      negative_samples: Float >= 0. 0 for no negative (i.e. random) samples.
+          1 for same number as positive samples.
+      shuffle: Whether to shuffle the word couples before returning them.
       categorical: bool. if False, labels will be
-          integers (eg. [0, 1, 1 .. ]),
-          if True labels will be categorical eg. [[1,0],[0,1],[0,1] .. ]
+          integers (eg. `[0, 1, 1 .. ]`),
+          if `True`, labels will be categorical, e.g.
+          `[[1,0],[0,1],[0,1] .. ]`.
       sampling_table: 1D array of size `vocabulary_size` where the entry i
           encodes the probability to sample a word of rank i.
-      seed: random seed.
+      seed: Random seed.
 
   Returns:
       couples, labels: where `couples` are int pairs and
@@ -234,9 +257,9 @@ def _remove_long_seq(maxlen, seq, label):
   """Removes sequences that exceed the maximum length.
 
   Arguments:
-      maxlen: int, maximum length
-      seq: list of lists where each sublist is a sequence
-      label: list where each element is an integer
+      maxlen: Int, maximum length of the output sequences.
+      seq: List of lists, where each sublist is a sequence.
+      label: List where each element is an integer.
 
   Returns:
       new_seq, new_label: shortened lists for `seq` and `label`.
@@ -247,3 +270,120 @@ def _remove_long_seq(maxlen, seq, label):
       new_seq.append(x)
       new_label.append(y)
   return new_seq, new_label
+
+
+@tf_export('keras.preprocessing.sequence.TimeseriesGenerator')
+class TimeseriesGenerator(Sequence):
+  """Utility class for generating batches of temporal data.
+
+  This class takes in a sequence of data-points gathered at
+  equal intervals, along with time series parameters such as
+  stride, length of history, etc., to produce batches for
+  training/validation.
+
+  Arguments:
+      data: Indexable generator (such as list or Numpy array)
+          containing consecutive data points (timesteps).
+          The data should be at 2D, and axis 0 is expected
+          to be the time dimension.
+      targets: Targets corresponding to timesteps in `data`.
+          It should have same length as `data`.
+      length: Length of the output sequences (in number of timesteps).
+      sampling_rate: Period between successive individual timesteps
+          within sequences. For rate `r`, timesteps
+          `data[i]`, `data[i-r]`, ... `data[i - length]`
+          are used for create a sample sequence.
+      stride: Period between successive output sequences.
+          For stride `s`, consecutive output samples would
+          be centered around `data[i]`, `data[i+s]`, `data[i+2*s]`, etc.
+      start_index, end_index: Data points earlier than `start_index`
+          or later than `end_index` will not be used in the output sequences.
+          This is useful to reserve part of the data for test or validation.
+      shuffle: Whether to shuffle output samples,
+          or instead draw them in chronological order.
+      reverse: Boolean: if `true`, timesteps in each output sample will be
+          in reverse chronological order.
+      batch_size: Number of timeseries samples in each batch
+          (except maybe the last one).
+
+  Returns:
+      A [Sequence](/utils/#sequence) instance.
+
+  Examples:
+
+  ```python
+  from keras.preprocessing.sequence import TimeseriesGenerator
+  import numpy as np
+
+  data = np.array([[i] for i in range(50)])
+  targets = np.array([[i] for i in range(50)])
+
+  data_gen = TimeseriesGenerator(data, targets,
+                                 length=10, sampling_rate=2,
+                                 batch_size=2)
+  assert len(data_gen) == 20
+
+  batch_0 = data_gen[0]
+  x, y = batch_0
+  assert np.array_equal(x,
+                        np.array([[[0], [2], [4], [6], [8]],
+                                  [[1], [3], [5], [7], [9]]]))
+  assert np.array_equal(y,
+                        np.array([[10], [11]]))
+  ```
+  """
+
+  def __init__(self,
+               data,
+               targets,
+               length,
+               sampling_rate=1,
+               stride=1,
+               start_index=0,
+               end_index=None,
+               shuffle=False,
+               reverse=False,
+               batch_size=128):
+    self.data = data
+    self.targets = targets
+    self.length = length
+    self.sampling_rate = sampling_rate
+    self.stride = stride
+    self.start_index = start_index + length
+    if end_index is None:
+      end_index = len(data) - 1
+    self.end_index = end_index
+    self.shuffle = shuffle
+    self.reverse = reverse
+    self.batch_size = batch_size
+
+  def __len__(self):
+    length = int(
+        np.ceil((self.end_index - self.start_index) /
+                (self.batch_size * self.stride)))
+    return length if length >= 0 else 0
+
+  def _empty_batch(self, num_rows):
+    samples_shape = [num_rows, self.length // self.sampling_rate]
+    samples_shape.extend(self.data.shape[1:])
+    targets_shape = [num_rows]
+    targets_shape.extend(self.targets.shape[1:])
+    return np.empty(samples_shape), np.empty(targets_shape)
+
+  def __getitem__(self, index):
+    if self.shuffle:
+      rows = np.random.randint(
+          self.start_index, self.end_index, size=self.batch_size)
+    else:
+      i = self.start_index + self.batch_size * self.stride * index
+      rows = np.arange(i, min(i + self.batch_size * self.stride,
+                              self.end_index), self.stride)
+
+    samples, targets = self._empty_batch(len(rows))
+    for j in range(len(rows)):
+      indices = range(rows[j] - self.length, rows[j], self.sampling_rate)
+      samples[j] = self.data[indices]
+      targets[j] = self.targets[rows[j]]
+    if self.reverse:
+      return samples[:, ::-1, ...], targets
+    return samples, targets
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/sequence_test.py b/tensorflow/python/keras/_impl/keras/preprocessing/sequence_test.py
index 4529e6e94fc42661fb0474c1a827863ddb654776..b9bfdd000484665e8771f4bccef59738e5c26120 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/sequence_test.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/sequence_test.py
@@ -84,15 +84,91 @@ class TestSequence(test.TestCase):
     couples, labels = keras.preprocessing.sequence.skipgrams(
         np.arange(3), vocabulary_size=3)
     for couple in couples:
-      assert couple[0] in [0, 1, 2] and couple[1] in [0, 1, 2]
+      self.assertIn(couple[0], [0, 1, 2])
+      self.assertIn(couple[1], [0, 1, 2])
 
     # test window size and categorical labels
     couples, labels = keras.preprocessing.sequence.skipgrams(
         np.arange(5), vocabulary_size=5, window_size=1, categorical=True)
     for couple in couples:
-      assert couple[0] - couple[1] <= 3
+      self.assertLessEqual(couple[0] - couple[1], 3)
     for l in labels:
-      assert len(l) == 2
+      self.assertEqual(len(l), 2)
+
+  def test_TimeseriesGenerator(self):
+    data = np.array([[i] for i in range(50)])
+    targets = np.array([[i] for i in range(50)])
+
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data, targets, length=10, sampling_rate=2, batch_size=2)
+    self.assertEqual(len(data_gen), 20)
+    self.assertAllClose(data_gen[0][0],
+                        np.array([[[0], [2], [4], [6], [8]], [[1], [3], [5],
+                                                              [7], [9]]]))
+    self.assertAllClose(data_gen[0][1], np.array([[10], [11]]))
+    self.assertAllClose(data_gen[1][0],
+                        np.array([[[2], [4], [6], [8], [10]], [[3], [5], [7],
+                                                               [9], [11]]]))
+    self.assertAllClose(data_gen[1][1], np.array([[12], [13]]))
+
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data, targets, length=10, sampling_rate=2, reverse=True, batch_size=2)
+    self.assertEqual(len(data_gen), 20)
+    self.assertAllClose(data_gen[0][0],
+                        np.array([[[8], [6], [4], [2], [0]], [[9], [7], [5],
+                                                              [3], [1]]]))
+    self.assertAllClose(data_gen[0][1], np.array([[10], [11]]))
+
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data, targets, length=10, sampling_rate=2, shuffle=True, batch_size=1)
+    batch = data_gen[0]
+    r = batch[1][0][0]
+    self.assertAllClose(batch[0],
+                        np.array([[[r - 10], [r - 8], [r - 6], [r - 4],
+                                   [r - 2]]]))
+    self.assertAllClose(batch[1], np.array([
+        [r],
+    ]))
+
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data, targets, length=10, sampling_rate=2, stride=2, batch_size=2)
+    self.assertEqual(len(data_gen), 10)
+    self.assertAllClose(data_gen[1][0],
+                        np.array([[[4], [6], [8], [10], [12]], [[6], [8], [10],
+                                                                [12], [14]]]))
+    self.assertAllClose(data_gen[1][1], np.array([[14], [16]]))
+
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data,
+        targets,
+        length=10,
+        sampling_rate=2,
+        start_index=10,
+        end_index=30,
+        batch_size=2)
+    self.assertEqual(len(data_gen), 5)
+    self.assertAllClose(data_gen[0][0],
+                        np.array([[[10], [12], [14], [16], [18]],
+                                  [[11], [13], [15], [17], [19]]]))
+    self.assertAllClose(data_gen[0][1], np.array([[20], [21]]))
+
+    data = np.array([np.random.random_sample((1, 2, 3, 4)) for i in range(50)])
+    targets = np.array([np.random.random_sample((3, 2, 1)) for i in range(50)])
+    data_gen = keras.preprocessing.sequence.TimeseriesGenerator(
+        data,
+        targets,
+        length=10,
+        sampling_rate=2,
+        start_index=10,
+        end_index=30,
+        batch_size=2)
+
+    self.assertEqual(len(data_gen), 5)
+    self.assertAllClose(data_gen[0][0],
+                        np.array(
+                            [np.array(data[10:19:2]),
+                             np.array(data[11:20:2])]))
+    self.assertAllClose(data_gen[0][1], np.array([targets[20], targets[21]]))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/text.py b/tensorflow/python/keras/_impl/keras/preprocessing/text.py
index 1e3828ccf1e3bf9c443691e1c1da5697bedb4653..f652f318f3d6dae20b1113a50cd02930abb851af 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/text.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/text.py
@@ -91,6 +91,7 @@ def one_hot(text,
       text, n, hash_function=hash, filters=filters, lower=lower, split=split)
 
 
+@tf_export('keras.preprocessing.text.hashing_trick')
 def hashing_trick(text,
                   n,
                   hash_function=None,
@@ -187,21 +188,27 @@ class Tokenizer(object):
     self.document_count = 0
     self.char_level = char_level
     self.oov_token = oov_token
+    self.index_docs = {}
 
   def fit_on_texts(self, texts):
     """Updates internal vocabulary based on a list of texts.
 
+    In the case where texts contains lists, we assume each entry of the lists
+    to be a token.
+
     Required before using `texts_to_sequences` or `texts_to_matrix`.
 
     Arguments:
         texts: can be a list of strings,
-            or a generator of strings (for memory-efficiency)
+            a generator of strings (for memory-efficiency),
+            or a list of list of strings.
     """
-    self.document_count = 0
     for text in texts:
       self.document_count += 1
-      seq = text if self.char_level else text_to_word_sequence(
-          text, self.filters, self.lower, self.split)
+      if self.char_level or isinstance(text, list):
+        seq = text
+      else:
+        seq = text_to_word_sequence(text, self.filters, self.lower, self.split)
       for w in seq:
         if w in self.word_counts:
           self.word_counts[w] += 1
@@ -226,7 +233,6 @@ class Tokenizer(object):
       if i is None:
         self.word_index[self.oov_token] = len(self.word_index) + 1
 
-    self.index_docs = {}
     for w, c in list(self.word_docs.items()):
       self.index_docs[self.word_index[w]] = c
 
@@ -240,8 +246,7 @@ class Tokenizer(object):
         sequences: A list of sequence.
             A "sequence" is a list of integer word indices.
     """
-    self.document_count = len(sequences)
-    self.index_docs = {}
+    self.document_count += len(sequences)
     for seq in sequences:
       seq = set(seq)
       for i in seq:
@@ -268,7 +273,11 @@ class Tokenizer(object):
     return res
 
   def texts_to_sequences_generator(self, texts):
-    """Transforms each text in texts in a sequence of integers.
+    """Transforms each text in `texts` in a sequence of integers.
+
+    Each item in texts can also be a list, in which case we assume each item of
+    that list
+    to be a token.
 
     Only top "num_words" most frequent words will be taken into account.
     Only words known by the tokenizer will be taken into account.
@@ -281,8 +290,10 @@ class Tokenizer(object):
     """
     num_words = self.num_words
     for text in texts:
-      seq = text if self.char_level else text_to_word_sequence(
-          text, self.filters, self.lower, self.split)
+      if self.char_level or isinstance(text, list):
+        seq = text
+      else:
+        seq = text_to_word_sequence(text, self.filters, self.lower, self.split)
       vect = []
       for w in seq:
         i = self.word_index.get(w)
diff --git a/tensorflow/python/keras/_impl/keras/preprocessing/text_test.py b/tensorflow/python/keras/_impl/keras/preprocessing/text_test.py
index a934e331c4a14d9bd170258b6b6183e6a15bb561..c6a267e57e4e2dc04156483d1cf85a42a78eb395 100644
--- a/tensorflow/python/keras/_impl/keras/preprocessing/text_test.py
+++ b/tensorflow/python/keras/_impl/keras/preprocessing/text_test.py
@@ -1,3 +1,4 @@
+# -*- coding: utf-8 -*-
 # Copyright 2016 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -80,17 +81,52 @@ class TestText(test.TestCase):
     x_train = ['This text has only known words']
     x_test = ['This text has some unknown words']  # 2 OOVs: some, unknown
 
-    # Defalut, without OOV flag
+    # Default, without OOV flag
     tokenizer = keras.preprocessing.text.Tokenizer()
     tokenizer.fit_on_texts(x_train)
     x_test_seq = tokenizer.texts_to_sequences(x_test)
-    assert len(x_test_seq[0]) == 4  # discards 2 OOVs
+    self.assertEqual(len(x_test_seq[0]), 4)  # discards 2 OOVs
 
     # With OOV feature
     tokenizer = keras.preprocessing.text.Tokenizer(oov_token='<unk>')
     tokenizer.fit_on_texts(x_train)
     x_test_seq = tokenizer.texts_to_sequences(x_test)
-    assert len(x_test_seq[0]) == 6  # OOVs marked in place
+    self.assertEqual(len(x_test_seq[0]), 6)  # OOVs marked in place
+
+  def test_sequential_fit(self):
+    texts = [
+        'The cat sat on the mat.', 'The dog sat on the log.',
+        'Dogs and cats living together.'
+    ]
+    word_sequences = [['The', 'cat', 'is', 'sitting'],
+                      ['The', 'dog', 'is', 'standing']]
+    tokenizer = keras.preprocessing.text.Tokenizer()
+    tokenizer.fit_on_texts(texts)
+    tokenizer.fit_on_texts(word_sequences)
+
+    self.assertEqual(tokenizer.document_count, 5)
+
+    tokenizer.texts_to_matrix(texts)
+    tokenizer.texts_to_matrix(word_sequences)
+
+  def test_text_to_word_sequence(self):
+    text = 'hello! ? world!'
+    seq = keras.preprocessing.text.text_to_word_sequence(text)
+    self.assertEqual(seq, ['hello', 'world'])
+
+  def test_text_to_word_sequence_unicode(self):
+    text = u'ali! veli? kırk dokuz elli'
+    seq = keras.preprocessing.text.text_to_word_sequence(text)
+    self.assertEqual(seq, [u'ali', u'veli', u'kırk', u'dokuz', u'elli'])
+
+  def test_tokenizer_unicode(self):
+    texts = [
+        u'ali veli kırk dokuz elli', u'ali veli kırk dokuz elli veli kırk dokuz'
+    ]
+    tokenizer = keras.preprocessing.text.Tokenizer(num_words=5)
+    tokenizer.fit_on_texts(texts)
+
+    self.assertEqual(len(tokenizer.word_counts), 5)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/utils/__init__.py b/tensorflow/python/keras/_impl/keras/utils/__init__.py
index 370ae0dd0f0d00059f1b0cc79459abe75c8ca494..0c9f19a0c8dcf3bf929e102b31679a03b27728f7 100644
--- a/tensorflow/python/keras/_impl/keras/utils/__init__.py
+++ b/tensorflow/python/keras/_impl/keras/utils/__init__.py
@@ -31,8 +31,8 @@ from tensorflow.python.keras._impl.keras.utils.generic_utils import serialize_ke
 from tensorflow.python.keras._impl.keras.utils.io_utils import HDF5Matrix
 from tensorflow.python.keras._impl.keras.utils.layer_utils import convert_all_kernels_in_model
 from tensorflow.python.keras._impl.keras.utils.layer_utils import print_summary
+from tensorflow.python.keras._impl.keras.utils.multi_gpu_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.np_utils import normalize
 from tensorflow.python.keras._impl.keras.utils.np_utils import to_categorical
-from tensorflow.python.keras._impl.keras.utils.training_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.vis_utils import plot_model
 
diff --git a/tensorflow/python/keras/_impl/keras/utils/data_utils.py b/tensorflow/python/keras/_impl/keras/utils/data_utils.py
index e87c8f48ef0967d561db1ab841a669d783f9b1ec..4c49544c6a63c4e5a0b79d31b074ad352c512bfa 100644
--- a/tensorflow/python/keras/_impl/keras/utils/data_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/data_utils.py
@@ -393,6 +393,16 @@ class Sequence(object):
     """
     pass
 
+  def __iter__(self):
+    """Creates an infinite generator that iterate over the Sequence.
+
+    Yields:
+      Sequence items.
+    """
+    while True:
+      for item in (self[i] for i in range(len(self))):
+        yield item
+
 
 # Global variables to be shared across processes
 _SHARED_SEQUENCES = {}
@@ -400,6 +410,11 @@ _SHARED_SEQUENCES = {}
 _SEQUENCE_COUNTER = None
 
 
+def init_pool(seqs):
+  global _SHARED_SEQUENCES
+  _SHARED_SEQUENCES = seqs
+
+
 def get_index(uid, i):
   """Get the value from the Sequence `uid` at index `i`.
 
@@ -532,9 +547,11 @@ class OrderedEnqueuer(SequenceEnqueuer):
             (when full, workers could block on `put()`)
     """
     if self.use_multiprocessing:
-      self.executor_fn = lambda: multiprocessing.Pool(workers)
+      self.executor_fn = lambda seqs: multiprocessing.Pool(  # pylint: disable=g-long-lambda
+          workers, initializer=init_pool, initargs=(seqs,))
     else:
-      self.executor_fn = lambda: ThreadPool(workers)
+       # We do not need the init since it's threads.
+      self.executor_fn = lambda _: ThreadPool(workers)
     self.workers = workers
     self.queue = queue.Queue(max_queue_size)
     self.stop_signal = threading.Event()
@@ -557,7 +574,7 @@ class OrderedEnqueuer(SequenceEnqueuer):
       if self.shuffle:
         random.shuffle(sequence)
 
-      with closing(self.executor_fn()) as executor:
+      with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
         for i in sequence:
           if self.stop_signal.is_set():
             return
diff --git a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
index 5196bf17400c33d876daa430a9d3d5b4f4b491a1..3bbe87f92d8f7eac27033344550ca65397eab986 100644
--- a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
@@ -490,8 +490,8 @@ def slice_arrays(arrays, start=None, stop=None):
   if arrays is None:
     return [None]
   if isinstance(start, list) and stop is not None:
-    raise ValueError('The stop argument has to be None if the value of start is'
-                     'a list.')
+    raise ValueError('The stop argument has to be None if the value of start '
+                     'is a list.')
   elif isinstance(arrays, list):
     if hasattr(start, '__len__'):
       # hdf5 datasets only support list objects as indices
diff --git a/tensorflow/python/keras/_impl/keras/utils/training_utils.py b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils.py
similarity index 98%
rename from tensorflow/python/keras/_impl/keras/utils/training_utils.py
rename to tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils.py
index ce7402e9d279278eaaf5aab58a3973eec6de8e99..231ace2a0b4a4f25cebf06a5216cf3d30aadc49b 100644
--- a/tensorflow/python/keras/_impl/keras/utils/training_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils.py
@@ -125,7 +125,7 @@ def multi_gpu_model(model, gpus):
     if gpus <= 1:
       raise ValueError('For multi-gpu usage to be effective, '
                        'call `multi_gpu_model` with `gpus >= 2`. '
-                       'Received: `gpus=%d`' % gpus)
+                       'Received: `gpus=%s`' % gpus)
     num_gpus = gpus
     target_gpu_ids = range(num_gpus)
 
@@ -136,7 +136,7 @@ def multi_gpu_model(model, gpus):
   ]
   for device in target_devices:
     if device not in available_devices:
-      raise ValueError('To call `multi_gpu_model` with `gpus=%d`, '
+      raise ValueError('To call `multi_gpu_model` with `gpus=%s`, '
                        'we expect the following devices to be available: %s. '
                        'However this machine only has: %s. '
                        'Try reducing `gpus`.' % (gpus, target_devices,
diff --git a/tensorflow/python/keras/_impl/keras/utils/training_utils_test.py b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils_test.py
similarity index 64%
rename from tensorflow/python/keras/_impl/keras/utils/training_utils_test.py
rename to tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils_test.py
index 12354c49ca72cddc0f395bcfcfabab18c1189227..0a38d6b5228fe791ce14adc7e37e0b7a6926fadf 100644
--- a/tensorflow/python/keras/_impl/keras/utils/training_utils_test.py
+++ b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils_test.py
@@ -19,21 +19,34 @@ from __future__ import print_function
 
 import numpy as np
 
-
+from tensorflow.python import data
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.platform import test
 
 
+def check_if_compatible_devices(gpus=2):
+  available_devices = [
+      keras.utils.multi_gpu_utils._normalize_device_name(name)
+      for name in keras.utils.multi_gpu_utils._get_available_devices()
+  ]
+  if '/gpu:%d' % (gpus - 1) not in available_devices:
+    return False
+  return True
+
+
 class TestMultiGPUModel(test.TestCase):
 
-  def multi_gpu_test_simple_model(self):
+  def test_multi_gpu_test_simple_model(self):
     gpus = 2
     num_samples = 1000
     input_dim = 10
     output_dim = 1
     hidden_dim = 10
     epochs = 2
-    target_gpu_id = [0, 2, 4]
+    target_gpu_id = [0, 1]
+
+    if not check_if_compatible_devices(gpus=gpus):
+      return
 
     with self.test_session():
       model = keras.models.Sequential()
@@ -47,12 +60,11 @@ class TestMultiGPUModel(test.TestCase):
       parallel_model = keras.utils.multi_gpu_model(model, gpus=gpus)
       parallel_model.compile(loss='mse', optimizer='rmsprop')
       parallel_model.fit(x, y, epochs=epochs)
-
       parallel_model = keras.utils.multi_gpu_model(model, gpus=target_gpu_id)
       parallel_model.compile(loss='mse', optimizer='rmsprop')
       parallel_model.fit(x, y, epochs=epochs)
 
-  def multi_gpu_test_multi_io_model(self):
+  def test_multi_gpu_test_multi_io_model(self):
     gpus = 2
     num_samples = 1000
     input_dim_a = 10
@@ -61,7 +73,10 @@ class TestMultiGPUModel(test.TestCase):
     output_dim_b = 2
     hidden_dim = 10
     epochs = 2
-    target_gpu_id = [0, 2, 4]
+    target_gpu_id = [0, 1]
+
+    if not check_if_compatible_devices(gpus=gpus):
+      return
 
     with self.test_session():
       input_a = keras.Input((input_dim_a,))
@@ -86,7 +101,10 @@ class TestMultiGPUModel(test.TestCase):
       parallel_model.compile(loss='mse', optimizer='rmsprop')
       parallel_model.fit([a_x, b_x], [a_y, b_y], epochs=epochs)
 
-  def multi_gpu_test_invalid_devices(self):
+  def test_multi_gpu_test_invalid_devices(self):
+    if not check_if_compatible_devices(gpus=2):
+      return
+
     with self.test_session():
       input_shape = (1000, 10)
       model = keras.models.Sequential()
@@ -115,3 +133,53 @@ class TestMultiGPUModel(test.TestCase):
       with self.assertRaises(ValueError):
         parallel_model = keras.utils.multi_gpu_model(model, gpus=[0])
         parallel_model.fit(x, y, epochs=2)
+
+  def test_nested_model_with_tensor_input(self):
+    gpus = 2
+    input_dim = 10
+    shape = (input_dim,)
+    num_samples = 16
+    num_classes = 10
+
+    if not check_if_compatible_devices(gpus=gpus):
+      return
+
+    with self.test_session():
+      input_shape = (num_samples,) + shape
+      x_train = np.random.randint(0, 255, input_shape)
+      y_train = np.random.randint(0, num_classes, (input_shape[0],))
+      keras.backend.set_learning_phase(True)
+
+      y_train = keras.utils.to_categorical(y_train, num_classes)
+
+      x_train = x_train.astype('float32')
+      y_train = y_train.astype('float32')
+
+      dataset = data.Dataset.from_tensor_slices((x_train, y_train))
+      dataset = dataset.repeat()
+      dataset = dataset.batch(4)
+      iterator = dataset.make_one_shot_iterator()
+
+      inputs, targets = iterator.get_next()
+
+      input_tensor = keras.layers.Input(tensor=inputs)
+
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(3,
+                                   input_shape=(input_dim,)))
+      model.add(keras.layers.Dense(num_classes))
+
+      output = model(input_tensor)
+      outer_model = keras.Model(input_tensor, output)
+      parallel_model = keras.utils.multi_gpu_model(outer_model, gpus=gpus)
+
+      parallel_model.compile(
+          loss='categorical_crossentropy',
+          optimizer=keras.optimizers.RMSprop(lr=0.0001, decay=1e-6),
+          metrics=['accuracy'],
+          target_tensors=[targets])
+      parallel_model.fit(epochs=1, steps_per_epoch=3)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/keras/_impl/keras/utils/vis_utils.py b/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
index 45c1b92075c50956fee004409e98898411e83d27..4761cece82c727e4962d0374f8efb80dfaeac3c6 100644
--- a/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
@@ -120,7 +120,7 @@ def model_to_dot(model, show_shapes=False, show_layer_names=True, rankdir='TB'):
     layer_id = str(id(layer))
     for i, node in enumerate(layer._inbound_nodes):
       node_key = layer.name + '_ib-' + str(i)
-      if node_key in model._container_nodes:
+      if node_key in model._network_nodes:  # pylint: disable=protected-access
         for inbound_layer in node.inbound_layers:
           inbound_layer_id = str(id(inbound_layer))
           layer_id = str(id(layer))
diff --git a/tensorflow/python/keras/datasets/fashion_mnist/__init__.py b/tensorflow/python/keras/datasets/fashion_mnist/__init__.py
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..7f5ddecc4707334d52ebf4966f2ec6141cce0d46 100644
--- a/tensorflow/python/keras/datasets/fashion_mnist/__init__.py
+++ b/tensorflow/python/keras/datasets/fashion_mnist/__init__.py
@@ -0,0 +1,25 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Fashion-MNIST dataset."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.keras._impl.keras.datasets.fashion_mnist import load_data
+
+del absolute_import
+del division
+del print_function
diff --git a/tensorflow/python/keras/preprocessing/image/__init__.py b/tensorflow/python/keras/preprocessing/image/__init__.py
index b96e7675527041d3952b049f5f431d3df36eea4c..6aba5fc8252e1acf604a89a4e66c2a7db080aa73 100644
--- a/tensorflow/python/keras/preprocessing/image/__init__.py
+++ b/tensorflow/python/keras/preprocessing/image/__init__.py
@@ -27,6 +27,7 @@ from tensorflow.python.keras._impl.keras.preprocessing.image import img_to_array
 from tensorflow.python.keras._impl.keras.preprocessing.image import Iterator
 from tensorflow.python.keras._impl.keras.preprocessing.image import load_img
 from tensorflow.python.keras._impl.keras.preprocessing.image import NumpyArrayIterator
+from tensorflow.python.keras._impl.keras.preprocessing.image import random_brightness
 from tensorflow.python.keras._impl.keras.preprocessing.image import random_channel_shift
 from tensorflow.python.keras._impl.keras.preprocessing.image import random_rotation
 from tensorflow.python.keras._impl.keras.preprocessing.image import random_shear
diff --git a/tensorflow/python/keras/preprocessing/sequence/__init__.py b/tensorflow/python/keras/preprocessing/sequence/__init__.py
index 112f6af5e588bcb2e85fdbecea86f402742d44e7..b7a7149cc40654c878e3c0db1fc78d8912abf498 100644
--- a/tensorflow/python/keras/preprocessing/sequence/__init__.py
+++ b/tensorflow/python/keras/preprocessing/sequence/__init__.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 from tensorflow.python.keras._impl.keras.preprocessing.sequence import make_sampling_table
 from tensorflow.python.keras._impl.keras.preprocessing.sequence import pad_sequences
 from tensorflow.python.keras._impl.keras.preprocessing.sequence import skipgrams
+from tensorflow.python.keras._impl.keras.preprocessing.sequence import TimeseriesGenerator
 
 del absolute_import
 del division
diff --git a/tensorflow/python/keras/preprocessing/text/__init__.py b/tensorflow/python/keras/preprocessing/text/__init__.py
index 5bf1a2fb21dc27f7aa10cd08b1496e3991c61d2f..000ad68a0c01e9067f8852836ba5d502deb3fcd4 100644
--- a/tensorflow/python/keras/preprocessing/text/__init__.py
+++ b/tensorflow/python/keras/preprocessing/text/__init__.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.keras._impl.keras.preprocessing.text import hashing_trick
 from tensorflow.python.keras._impl.keras.preprocessing.text import one_hot
 from tensorflow.python.keras._impl.keras.preprocessing.text import text_to_word_sequence
 from tensorflow.python.keras._impl.keras.preprocessing.text import Tokenizer
diff --git a/tensorflow/python/keras/utils/__init__.py b/tensorflow/python/keras/utils/__init__.py
index 91cc8607274a80a14dd27a64274da7f8f0aafab1..2f74cf031d0520c8d874b7269c52e3b9e1b9931b 100644
--- a/tensorflow/python/keras/utils/__init__.py
+++ b/tensorflow/python/keras/utils/__init__.py
@@ -30,9 +30,9 @@ from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
 from tensorflow.python.keras._impl.keras.utils.generic_utils import serialize_keras_object
 from tensorflow.python.keras._impl.keras.utils.io_utils import HDF5Matrix
 from tensorflow.python.keras._impl.keras.utils.layer_utils import convert_all_kernels_in_model
+from tensorflow.python.keras._impl.keras.utils.multi_gpu_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.np_utils import normalize
 from tensorflow.python.keras._impl.keras.utils.np_utils import to_categorical
-from tensorflow.python.keras._impl.keras.utils.training_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.vis_utils import plot_model
 
 del absolute_import
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index d4ceb2e489c8a20d26eaf9d89b12992d2b8673d7..ece1da033274720ae3b6c221004081dce6fb4042 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -393,6 +393,7 @@ tf_py_test(
         "//tensorflow/python:nn_ops",
         "//tensorflow/python:nn_ops_gen",
     ],
+    shard_count = 5,
 )
 
 tf_py_test(
@@ -408,6 +409,7 @@ tf_py_test(
         "//tensorflow/python:nn_ops",
         "//tensorflow/python:nn_ops_gen",
     ],
+    shard_count = 5,
 )
 
 tf_py_test(
@@ -712,6 +714,18 @@ cuda_py_test(
     ],
 )
 
+tf_py_test(
+    name = "regex_replace_op_test",
+    size = "small",
+    srcs = ["regex_replace_op_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:string_ops",
+    ],
+)
+
 tf_py_test(
     name = "save_restore_ops_test",
     size = "small",
@@ -1075,6 +1089,7 @@ cuda_py_test(
     tags = [
         "no_windows",
         "noasan",
+        "notap",
     ],
 )
 
@@ -1559,12 +1574,15 @@ cuda_py_test(
         "//third_party/py/numpy",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
+        "//tensorflow/python:layers",
         "//tensorflow/python:framework",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:init_ops",
+        "//tensorflow/python:linalg_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:nn_ops",
         "//tensorflow/python:partitioned_variables",
+        "//tensorflow/python:random_ops",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
     ],
@@ -1892,7 +1910,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "softmax_op_test",
-    size = "small",
+    size = "medium",
     srcs = ["softmax_op_test.py"],
     additional_deps = [
         "//third_party/py/numpy",
@@ -2892,6 +2910,40 @@ tf_py_test(
     ],
 )
 
+tf_py_test(
+    name = "accumulate_n_test",
+    size = "small",
+    srcs = ["accumulate_n_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:variables",
+    ],
+)
+
+tf_py_test(
+    name = "accumulate_n_eager_test",
+    size = "small",
+    srcs = ["accumulate_n_eager_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python/eager:backprop",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/eager:tape",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
similarity index 72%
rename from tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
rename to tensorflow/python/kernel_tests/accumulate_n_eager_test.py
index 35974b9e21d2d7423777a95a99f51c9cb4b453b2..dc11b7deceb9040584aca1f629f4d003aef39428 100644
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
+++ b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
@@ -12,48 +12,41 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for new version of accumulate_n op that will eventually go into
-`ops.math_ops`.
-
-These test cases spefically exercise the `eager` APIs. They need to be in a
-separate file from the remaining tests because eager mode is currently something
-you can turn on but can't turn off for the lifetime of the current process."""
+"""Tests for new version of accumulate_n op."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
-
 from tensorflow.python.eager import backprop
 
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 
 
-
 class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
-  """Tests of the new, differentiable version of accumulate_n"""
+  """Tests of the new, differentiable version of accumulate_n."""
 
   def testMinimalEagerMode(self):
     forty = constant_op.constant(40)
     two = constant_op.constant(2)
-    answer = av2.accumulate_n_v2([forty, two])
+    answer = math_ops.accumulate_n([forty, two])
     self.assertEqual(42, answer.numpy())
 
-
   def testFloat(self):
     np.random.seed(12345)
     x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).numpy())
-      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).numpy())
+      self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x).numpy())
+      self.assertAllClose(x[0] * 5,
+                          math_ops.accumulate_n([tf_x[0]] * 5).numpy())
 
   def testGrad(self):
     np.random.seed(42)
@@ -65,16 +58,14 @@ class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
     ]
 
     def fn(first, second, third):
-      return av2.accumulate_n_v2([first, second, third])
+      return math_ops.accumulate_n([first, second, third])
 
     grad_fn = backprop.gradients_function(fn)
     grad = grad_fn(input_vars[0], input_vars[1], input_vars[2])
-    self.assertAllEqual(np.repeat(1.0, num_inputs), # d/dx (x + y + ...) = 1
+    self.assertAllEqual(np.repeat(1.0, num_inputs),  # d/dx (x + y + ...) = 1
                         [elem.numpy() for elem in grad])
 
 
-
 if __name__ == "__main__":
   ops.enable_eager_execution()
   test.main()
-
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py b/tensorflow/python/kernel_tests/accumulate_n_test.py
similarity index 75%
rename from tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
rename to tensorflow/python/kernel_tests/accumulate_n_test.py
index 45962098e93acfac414396ddbeaa847701ff2b4b..b793906fac2cd12a5c0c663dd169000ad6067759 100644
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
+++ b/tensorflow/python/kernel_tests/accumulate_n_test.py
@@ -12,42 +12,49 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for new version of accumulate_n op that will eventually go into
-`ops.math_ops`."""
+"""Tests for new version of accumulate_n op."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
-
 from tensorflow.python.framework import dtypes as dtypes_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradients
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
 
 
 class AccumulateNV2Test(test_util.TensorFlowTestCase):
-  """Tests of the new, differentiable version of accumulate_n"""
+  """Tests of the new, differentiable version of accumulate_n."""
 
   def testFloat(self):
     np.random.seed(12345)
     x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).eval())
-      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).eval())
+      self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x).eval())
+      self.assertAllClose(x[0] * 5,
+                          math_ops.accumulate_n([tf_x[0]] * 5).eval())
 
   def testInt(self):
     np.random.seed(54321)
     x = [np.random.randint(-128, 128, (5, 4, 3, 2, 1)) for _ in range(6)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllEqual(sum(x), av2.accumulate_n_v2(tf_x).eval())
-      self.assertAllEqual(x[0] * 6, av2.accumulate_n_v2([tf_x[0]] * 6).eval())
+      self.assertAllEqual(sum(x), math_ops.accumulate_n(tf_x).eval())
+      self.assertAllEqual(x[0] * 6,
+                          math_ops.accumulate_n([tf_x[0]] * 6).eval())
+
+  def testUnknownShape(self):
+    with self.test_session(use_gpu=True):
+      x0 = array_ops.placeholder(dtype=dtypes_lib.int32, shape=[None])
+      acc = math_ops.accumulate_n([x0, x0], shape=[None])
+      self.assertAllEqual([2, 4], acc.eval(feed_dict={x0: [1, 2]}))
 
   def testGrad(self):
     np.random.seed(42)
@@ -55,9 +62,9 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
       with self.test_session(use_gpu=True) as sess:
         input_vars = [
             variables.Variable(10.0 * np.random.random())
-            for i in range(0, num_inputs)
+            for _ in range(0, num_inputs)
         ]
-        accum_n = av2.accumulate_n_v2(input_vars)
+        accum_n = math_ops.accumulate_n(input_vars)
         sess.run(variables.global_variables_initializer())
         accum_n_grad = gradients.gradients(accum_n, input_vars)
         self.assertAllEqual(
@@ -77,7 +84,7 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
           ops.convert_to_tensor(x, dtype=dtypes_lib.float32)
           for x in random_arrays
       ]
-      tf_val = av2.accumulate_n_v2(random_tensors)
+      tf_val = math_ops.accumulate_n(random_tensors)
       np_val = random_arrays[0]
       for random_array in random_arrays[1:]:
         np_val += random_array
@@ -86,7 +93,7 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
   def testZeroArgs(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        tf_val = av2.accumulate_n_v2([])
+        tf_val = math_ops.accumulate_n([])
         tf_val.eval()
 
   def testWrongShape(self):
@@ -94,28 +101,28 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
       with self.assertRaises(ValueError):
         a = variables.Variable(0.2)
         b = variables.Variable(0.1)
-        tf_val = av2.accumulate_n_v2([a, b], shape=[2, 2])  # Should be shape=[]
+        math_ops.accumulate_n([a, b], shape=[2, 2])  # Should be shape=[]
 
   def testIncompatibleShapes(self):
     with self.test_session():
       with self.assertRaises(ValueError):
         a = variables.Variable(np.array([0.1, 0.2]))
         b = variables.Variable(np.array([[0.3], [0.4]]))
-        tf_val = av2.accumulate_n_v2([a, b])
+        math_ops.accumulate_n([a, b])
 
   def testWrongType(self):
     with self.test_session():
       with self.assertRaises(TypeError):
         a = variables.Variable(0.2, dtype=np.float32)
         b = variables.Variable(0.1, dtype=np.float32)
-        tf_val = av2.accumulate_n_v2([a, b], tensor_dtype=np.int32)
+        math_ops.accumulate_n([a, b], tensor_dtype=np.int32)
 
   def testWrongTypeOneInput(self):
     # Scenario that used to trigger a bug, even when testWrongType() worked
     with self.test_session():
       with self.assertRaises(TypeError):
         a = variables.Variable(0.2, dtype=np.float32)
-        tf_val = av2.accumulate_n_v2([a], tensor_dtype=np.int32)
+        math_ops.accumulate_n([a], tensor_dtype=np.int32)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/kernel_tests/array_ops_test.py b/tensorflow/python/kernel_tests/array_ops_test.py
index 365cf72108de5a1e5e1eb47891a6ad64151add22..d0ba8020c1eaee74ded5ad67cae39b51d44097bd 100644
--- a/tensorflow/python/kernel_tests/array_ops_test.py
+++ b/tensorflow/python/kernel_tests/array_ops_test.py
@@ -1024,6 +1024,7 @@ class SequenceMaskTest(test_util.TensorFlowTestCase):
           [[True, False, False, False, False], [True, True, True, False, False],
            [True, True, False, False, False]])
 
+  @test_util.enable_c_shapes
   def testOneDimensionalDtypeWithoutMaxlen(self):
     with self.test_session():
       # test dtype and default maxlen:
@@ -1037,6 +1038,7 @@ class SequenceMaskTest(test_util.TensorFlowTestCase):
           res.eval(),
           [[0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0]])
 
+  @test_util.enable_c_shapes
   def testOneDimensionalWithoutMaxlen(self):
     with self.test_session():
       res = array_ops.sequence_mask(
@@ -1051,6 +1053,7 @@ class SequenceMaskTest(test_util.TensorFlowTestCase):
            [True, False, False, False],
            [True, True, True, True]])
 
+  @test_util.enable_c_shapes
   def testTwoDimensional(self):
     with self.test_session():
       res = array_ops.sequence_mask(constant_op.constant([[1, 3, 2]]), 5)
@@ -1223,7 +1226,7 @@ class SnapshotOpTest(test_util.TensorFlowTestCase):
     for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]:
       with self.test_session(use_gpu=True):
         x = constant_op.constant([0, 1, 2, 3], dtype=dtype)
-        y = gen_array_ops._snapshot(x)
+        y = gen_array_ops.snapshot(x)
         self.assertAllEqual(y.eval(), [0, 1, 2, 3])
 
 
diff --git a/tensorflow/python/kernel_tests/atrous_convolution_test.py b/tensorflow/python/kernel_tests/atrous_convolution_test.py
index 2d1b3d9b7e836591646a2d0e59742bf6139446d1..0ef08581c9f931b991ef0c1218dc503345e248c2 100644
--- a/tensorflow/python/kernel_tests/atrous_convolution_test.py
+++ b/tensorflow/python/kernel_tests/atrous_convolution_test.py
@@ -83,14 +83,14 @@ class AtrousConvolutionTest(test.TestCase):
     checks = []
 
     def add_check(check, *args, **kwargs):
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         args_val, kwargs_val = self.evaluate([args, kwargs])
         check(*args_val, **kwargs_val)
       else:
         checks.append((check, args, kwargs))
 
     yield add_check
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       all_values = self.evaluate([[args, kwargs] for _, args, kwargs in checks])
       for (check, _, _), (args, kwargs) in zip(checks, all_values):
         check(*args, **kwargs)
diff --git a/tensorflow/python/kernel_tests/basic_gpu_test.py b/tensorflow/python/kernel_tests/basic_gpu_test.py
index 405651e8ae97fbc5eefd4aba0a95a99ff8fd8c26..987a6ffcd4b18eb5857ff9e82206de7f6ebe8a27 100644
--- a/tensorflow/python/kernel_tests/basic_gpu_test.py
+++ b/tensorflow/python/kernel_tests/basic_gpu_test.py
@@ -33,7 +33,7 @@ from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables
-from tensorflow.python.ops.gen_array_ops import _broadcast_gradient_args
+from tensorflow.python.ops.gen_array_ops import broadcast_gradient_args
 from tensorflow.python.platform import test
 
 
@@ -157,7 +157,7 @@ class BroadcastSimpleTest(test.TestCase):
 
   def _GetGradientArgs(self, xs, ys):
     with self.test_session(use_gpu=True) as sess:
-      return sess.run(_broadcast_gradient_args(xs, ys))
+      return sess.run(broadcast_gradient_args(xs, ys))
 
   def testBroadcast(self):
     r0, r1 = self._GetGradientArgs([2, 3, 5], [1])
diff --git a/tensorflow/python/kernel_tests/batchtospace_op_test.py b/tensorflow/python/kernel_tests/batchtospace_op_test.py
index 0c802476a0e788aff3de84ab736fa8f1de5daab4..6143cd3baa6317fc512d80f94b494710037d4082 100644
--- a/tensorflow/python/kernel_tests/batchtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/batchtospace_op_test.py
@@ -44,7 +44,7 @@ class CppOpImpl(object):
 
   @staticmethod
   def batch_to_space(*args, **kwargs):
-    return gen_array_ops._batch_to_space(*args, **kwargs)
+    return gen_array_ops.batch_to_space(*args, **kwargs)
 
 
 class BatchToSpaceDepthToSpace(test.TestCase, PythonOpImpl):
diff --git a/tensorflow/python/kernel_tests/bcast_ops_test.py b/tensorflow/python/kernel_tests/bcast_ops_test.py
index 9e512346053a4c3af089170f47313606c4a307c2..3305e55c05bd03d31c46fd333db09dbab9a5d09c 100644
--- a/tensorflow/python/kernel_tests/bcast_ops_test.py
+++ b/tensorflow/python/kernel_tests/bcast_ops_test.py
@@ -20,8 +20,8 @@ from __future__ import print_function
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
-from tensorflow.python.ops.gen_array_ops import _broadcast_args
-from tensorflow.python.ops.gen_array_ops import _broadcast_gradient_args
+from tensorflow.python.ops.gen_array_ops import broadcast_args
+from tensorflow.python.ops.gen_array_ops import broadcast_gradient_args
 from tensorflow.python.platform import test
 
 
@@ -29,11 +29,11 @@ class BcastOpsTest(test.TestCase):
 
   def _GetBroadcastShape(self, xs, ys):
     with self.test_session() as sess:
-      return sess.run(_broadcast_args(xs, ys))
+      return sess.run(broadcast_args(xs, ys))
 
   def _GetGradientArgs(self, xs, ys):
     with self.test_session() as sess:
-      return sess.run(_broadcast_gradient_args(xs, ys))
+      return sess.run(broadcast_gradient_args(xs, ys))
 
   def testBasic(self):
     r = self._GetBroadcastShape([2, 3, 5], [1])
diff --git a/tensorflow/python/kernel_tests/check_ops_test.py b/tensorflow/python/kernel_tests/check_ops_test.py
index 2e94603a3f3d4ca9074320cfb4e9bf06b6640e82..5a83ec8d302b4c26aef7abfa7465eb9fd0cca019 100644
--- a/tensorflow/python/kernel_tests/check_ops_test.py
+++ b/tensorflow/python/kernel_tests/check_ops_test.py
@@ -102,17 +102,15 @@ class AssertEqualTest(test.TestCase):
     with self.assertRaisesRegexp(errors.InvalidArgumentError, "fail"):
       check_ops.assert_equal(static_big, static_small, message="fail")
 
-    # Dynamic check
-    if context.in_graph_mode():
-      with self.test_session():
-        small = array_ops.placeholder(dtypes.int32, name="small")
-        big = array_ops.placeholder(dtypes.int32, name="big")
-        with ops.control_dependencies(
-            [check_ops.assert_equal(
-                big, small, message="fail")]):
-          out = array_ops.identity(small)
-        with self.assertRaisesOpError("fail.*big.*small"):
-          out.eval(feed_dict={small: [1, 2], big: [3, 4]})
+  def test_raises_when_greater_dynamic(self):
+    with self.test_session():
+      small = array_ops.placeholder(dtypes.int32, name="small")
+      big = array_ops.placeholder(dtypes.int32, name="big")
+      with ops.control_dependencies(
+          [check_ops.assert_equal(big, small, message="fail")]):
+        out = array_ops.identity(small)
+      with self.assertRaisesOpError("fail.*big.*small"):
+        out.eval(feed_dict={small: [1, 2], big: [3, 4]})
 
   def test_error_message_eager(self):
     expected_error_msg_full = r"""big does not equal small
@@ -182,15 +180,14 @@ First 2 elements of y:
     with self.assertRaisesRegexp(errors.InvalidArgumentError, "fail"):
       check_ops.assert_equal(static_big, static_small, message="fail")
 
-    # Dynamic check
-    if context.in_graph_mode():
-      with self.test_session():
-        small = array_ops.placeholder(dtypes.int32, name="small")
-        big = array_ops.placeholder(dtypes.int32, name="big")
-        with ops.control_dependencies([check_ops.assert_equal(small, big)]):
-          out = array_ops.identity(small)
-        with self.assertRaisesOpError("small.*big"):
-          out.eval(feed_dict={small: [3, 1], big: [4, 2]})
+  def test_raises_when_less_dynamic(self):
+    with self.test_session():
+      small = array_ops.placeholder(dtypes.int32, name="small")
+      big = array_ops.placeholder(dtypes.int32, name="big")
+      with ops.control_dependencies([check_ops.assert_equal(small, big)]):
+        out = array_ops.identity(small)
+      with self.assertRaisesOpError("small.*big"):
+        out.eval(feed_dict={small: [3, 1], big: [4, 2]})
 
   @test_util.run_in_graph_and_eager_modes()
   def test_doesnt_raise_when_equal_and_broadcastable_shapes(self):
@@ -215,6 +212,12 @@ First 2 elements of y:
         out = array_ops.identity(small)
       self.evaluate(out)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def test_raises_when_not_equal_and_broadcastable_shapes(self):
+    cond = constant_op.constant([True, False], name="small")
+    with self.assertRaisesRegexp(errors.InvalidArgumentError, "fail"):
+      check_ops.assert_equal(cond, False, message="fail")
+
   @test_util.run_in_graph_and_eager_modes()
   def test_doesnt_raise_when_both_empty(self):
     larry = constant_op.constant([])
diff --git a/tensorflow/python/kernel_tests/checkpoint_ops_test.py b/tensorflow/python/kernel_tests/checkpoint_ops_test.py
index a786d0a47e569f71812086fb93c21dc12660a2a5..7f147ba53a71539962f424158731e359724f664f 100644
--- a/tensorflow/python/kernel_tests/checkpoint_ops_test.py
+++ b/tensorflow/python/kernel_tests/checkpoint_ops_test.py
@@ -50,7 +50,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_no_vocab_changes(self):
     """Tests where vocab does not change at all."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.old_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -63,7 +63,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_shifted_vocab(self):
     """Tests where vocab is the same, but shifted / ordered differently."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -76,7 +76,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_offset(self):
     """Tests offset and num_new_vocab logic."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=1,
@@ -89,7 +89,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_old_vocab_size(self):
     """Tests where old_vocab_size is specified."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -132,7 +132,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
 
     # No column remapping, new weight matrix has second row, then first row.
     row_remapping = [1, 0]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -147,7 +147,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     # No row remapping, new weight matrix has third col, then first col.
     row_remapping = list(range(self.old_num_rows))
     col_remapping = [2, 0]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -162,7 +162,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     # Both row and column remappings.
     row_remapping = [1, 0, 4]
     col_remapping = [1, 15]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -177,7 +177,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
   def test_load_and_remap_with_init(self):
     """Tests the op's load and remap where there are missing entries."""
     init_val = 42
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -196,7 +196,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     """Tests when all the rows are missing and need to be initialized."""
     num_rows = 7
     initializing_values = [42] * num_rows * self.old_num_cols
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[-1] * num_rows,
@@ -214,7 +214,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     num_rows = 7
     num_cols = 4
     initializing_values = [42] * num_rows * num_cols
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[-1] * num_rows,
@@ -235,7 +235,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     invalid_remapping = [1, 0, 0, 0, 1, 2]
 
     # Invalid row remapping.
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=invalid_remapping,
@@ -247,7 +247,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
       remapped_matrix.eval()
 
     # Invalid column remapping.
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=list(range(self.old_num_rows)),
@@ -260,7 +260,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
 
   def test_load_and_remap_incorrect_initializing_values(self):
     """Tests that errors are raised with incorrect number of init values."""
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -275,7 +275,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     with self.test_session(), self.assertRaises(errors.InvalidArgumentError):
       remapped_matrix.eval()
 
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -314,7 +314,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       num_rows, num_cols = np_value.shape
 
       # Tests loading the entire tensor (except reversed).
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Simply reverses the rows of the matrix.
@@ -332,7 +332,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       self.assertGreater(num_rows, 2)
       prefix_rows = 2
       suffix_rows = 3
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Reverses the rows of the matrix, then prepends and appends
@@ -353,7 +353,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       # Tests when everything is taken from initializing_values.
       new_rows = 7
       initializing_values = [42] * new_rows * num_cols
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Nothing is loaded from the old tensor.
diff --git a/tensorflow/python/kernel_tests/concat_op_test.py b/tensorflow/python/kernel_tests/concat_op_test.py
index 127bc6bb20ae6b415da94672de68cc4b8ceaa287..c22934ce47543ab11b6a5b9acde2e2ec3aec9da7 100644
--- a/tensorflow/python/kernel_tests/concat_op_test.py
+++ b/tensorflow/python/kernel_tests/concat_op_test.py
@@ -526,7 +526,7 @@ class ConcatOpTest(test.TestCase):
     with self.test_session(use_gpu=True):
       t1 = []
       t2 = []
-      output = gen_array_ops._concat_v2([t1, t2], 0).eval()
+      output = gen_array_ops.concat_v2([t1, t2], 0).eval()
       self.assertFalse(output)  # Checks that output is empty
 
   def testConcatInvalidAxis(self):
@@ -534,20 +534,20 @@ class ConcatOpTest(test.TestCase):
       with self.test_session(use_gpu=True):
         t1 = [1]
         t2 = [2]
-        gen_array_ops._concat_v2([t1, t2], 1).eval()
+        gen_array_ops.concat_v2([t1, t2], 1).eval()
 
   def testConcatNegativeAxis(self):
     with self.test_session(use_gpu=True):
       t1 = [[1, 2, 3], [4, 5, 6]]
       t2 = [[7, 8, 9], [10, 11, 12]]
 
-      c = gen_array_ops._concat_v2([t1, t2], -2)
+      c = gen_array_ops.concat_v2([t1, t2], -2)
       self.assertEqual([4, 3], c.get_shape().as_list())
       output = c.eval()
       self.assertAllEqual([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
                           output)
 
-      c = gen_array_ops._concat_v2([t1, t2], -1)
+      c = gen_array_ops.concat_v2([t1, t2], -1)
       self.assertEqual([2, 6], c.get_shape().as_list())
       output = c.eval()
       self.assertAllEqual([[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]], output)
@@ -606,6 +606,17 @@ class ConcatOpTest(test.TestCase):
           inp_tensors_placeholders, -2, output_shape=[2, 3],
           gather_indexes=[2, 0], feed_dict=feed_dict)
 
+  def testConcatAxisType(self):
+    for dtype in [dtypes.int32, dtypes.int64]:
+      with self.test_session(use_gpu=True):
+        t1 = [[1, 2, 3], [4, 5, 6]]
+        t2 = [[7, 8, 9], [10, 11, 12]]
+
+        c = gen_array_ops.concat_v2([t1, t2],
+                                    constant_op.constant(1, dtype=dtype))
+        self.assertEqual([2, 6], c.get_shape().as_list())
+        output = c.eval()
+        self.assertAllEqual([[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]], output)
 
 class ConcatOffsetTest(test.TestCase):
 
@@ -615,7 +626,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
       s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
@@ -624,7 +635,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([[2, 3, 5]], dtypes.int32)
       s1 = constant_op.constant([[2, 7, 5]], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"should be a vector"):
         sess.run(off)
@@ -634,7 +645,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(4, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"Concat dim is out of range: 4 vs. 3"):
         sess.run(off)
@@ -644,7 +655,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5, 10], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"should contain 3 elem"):
         sess.run(off)
@@ -654,7 +665,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 10], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(
           errors_impl.InvalidArgumentError,
           r"All dimensions except 1 must match. Input 1 has shape \[2 7 10\] "
@@ -667,7 +678,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
       s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
@@ -675,7 +686,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([1, 3, 5], dtypes.int32)
       s2 = constant_op.constant([3, 3, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [2, 0, 0], [3, 0, 0]])
 
diff --git a/tensorflow/python/kernel_tests/constant_op_test.py b/tensorflow/python/kernel_tests/constant_op_test.py
index 16e56349c45dd56a335f6f881826d975e24bd110..18796f709566f022258806ce46cc706e8fe34354 100644
--- a/tensorflow/python/kernel_tests/constant_op_test.py
+++ b/tensorflow/python/kernel_tests/constant_op_test.py
@@ -30,6 +30,7 @@ from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import importer
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import logging_ops
@@ -180,6 +181,11 @@ class ConstantTest(test.TestCase):
           shape=[2, 3, 5])
     self.assertEqual(c.get_shape(), [2, 3, 5])
 
+  @test_util.assert_no_new_pyobjects_executing_eagerly
+  def testEagerMemory(self):
+    """Tests PyObject refs are managed correctly when executing eagerly."""
+    constant_op.constant([[1.]])
+
   def testImplicitShapeNumPy(self):
     with ops.Graph().as_default():
       c = constant_op.constant(
@@ -875,7 +881,7 @@ versions {
 class PlaceholderWithDefaultTest(test.TestCase):
 
   def testFullShape(self):
-    with self.test_session():
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       p = array_ops.placeholder_with_default([[2, 2], [2, 2]], shape=[2, 2])
       a = array_ops.identity(p)
       self.assertAllEqual([[2, 2], [2, 2]], a.eval())
@@ -886,7 +892,7 @@ class PlaceholderWithDefaultTest(test.TestCase):
         a.eval(feed_dict={p: [[6, 6, 6], [6, 6, 6]]})
 
   def testPartialShape(self):
-    with self.test_session():
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       p = array_ops.placeholder_with_default([1, 2, 3], shape=[None])
       a = array_ops.identity(p)
       self.assertAllEqual([1, 2, 3], a.eval())
@@ -896,7 +902,7 @@ class PlaceholderWithDefaultTest(test.TestCase):
         a.eval(feed_dict={p: [[2, 2], [2, 2]]})
 
   def testNoShape(self):
-    with self.test_session():
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       p = array_ops.placeholder_with_default([17], shape=None)
       a = array_ops.identity(p)
       self.assertAllEqual([17], a.eval())
@@ -905,11 +911,12 @@ class PlaceholderWithDefaultTest(test.TestCase):
           [[3, 3], [3, 3]], a.eval(feed_dict={p: [[3, 3], [3, 3]]}))
 
   def testGradient(self):
-    with self.test_session():
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       x = array_ops.placeholder(dtypes_lib.float32, [5, 7])
       y = array_ops.placeholder_with_default(x, None)
       err = gradient_checker.compute_gradient_error(x, [5, 7], y, [5, 7])
       self.assertLess(err, 1e-3)
 
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
index 58f38650eb526e98edf35b2425e0e9e1296ab353..75f8644f694c4cebb7dbdac4599244dda427bc05 100644
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@@ -144,7 +144,7 @@ class ControlFlowTest(test.TestCase):
 
       enter_v = control_flow_ops._Enter(v, "foo_1", is_constant=True)
       nine = constant_op.constant(9)
-      enter_nine = gen_control_flow_ops._enter(nine, "foo_1")
+      enter_nine = gen_control_flow_ops.enter(nine, "foo_1")
       op = state_ops.assign(enter_v, enter_nine)
       v2 = control_flow_ops.with_dependencies([op], enter_v)
       v3 = control_flow_ops.exit(v2)
@@ -164,9 +164,9 @@ class ControlFlowTest(test.TestCase):
   def testEnterMulExit(self):
     with self.test_session():
       data = constant_op.constant([1, 2, 3, 4, 5, 6], name="data")
-      enter_data = gen_control_flow_ops._enter(data, "foo_1", False)
+      enter_data = gen_control_flow_ops.enter(data, "foo_1", False)
       five = constant_op.constant(5)
-      enter_five = gen_control_flow_ops._enter(five, "foo_1", False)
+      enter_five = gen_control_flow_ops.enter(five, "foo_1", False)
       mul_op = math_ops.multiply(enter_data, enter_five)
       exit_op = control_flow_ops.exit(mul_op)
 
@@ -178,12 +178,12 @@ class ControlFlowTest(test.TestCase):
       v = variables.Variable([0.0, 0.0], dtype=dtypes.float32)
 
       # If is_constant=True, the shape information should be propagated.
-      enter_v_constant = gen_control_flow_ops._enter(
+      enter_v_constant = gen_control_flow_ops.enter(
           v, "frame1", is_constant=True)
       self.assertEqual(enter_v_constant.shape, [2])
 
       # Otherwise, the shape should be unknown.
-      enter_v_non_constant = gen_control_flow_ops._enter(
+      enter_v_non_constant = gen_control_flow_ops.enter(
           v, "frame2", is_constant=False)
       self.assertEqual(enter_v_non_constant.shape, None)
 
@@ -257,8 +257,8 @@ class ControlFlowTest(test.TestCase):
       false = ops.convert_to_tensor(False)
       n = constant_op.constant(10)
 
-      enter_false = gen_control_flow_ops._enter(false, "foo_1", False)
-      enter_n = gen_control_flow_ops._enter(n, "foo_1", False)
+      enter_false = gen_control_flow_ops.enter(false, "foo_1", False)
+      enter_n = gen_control_flow_ops.enter(n, "foo_1", False)
 
       merge_n = control_flow_ops.merge([enter_n, enter_n], name="merge_n")[0]
       switch_n = control_flow_ops.switch(merge_n, enter_false)
@@ -275,9 +275,9 @@ class ControlFlowTest(test.TestCase):
       one = constant_op.constant(1)
       n = constant_op.constant(10)
 
-      enter_i = gen_control_flow_ops._enter(zero, "foo", False)
-      enter_one = gen_control_flow_ops._enter(one, "foo", True)
-      enter_n = gen_control_flow_ops._enter(n, "foo", True)
+      enter_i = gen_control_flow_ops.enter(zero, "foo", False)
+      enter_one = gen_control_flow_ops.enter(one, "foo", True)
+      enter_n = gen_control_flow_ops.enter(n, "foo", True)
 
       with ops.device(test.gpu_device_name()):
         merge_i = control_flow_ops.merge([enter_i, enter_i])[0]
@@ -301,9 +301,9 @@ class ControlFlowTest(test.TestCase):
       one = constant_op.constant(1)
       n = constant_op.constant(10)
 
-      enter_i = gen_control_flow_ops._enter(zero, "foo", False)
-      enter_one = gen_control_flow_ops._enter(one, "foo", True)
-      enter_n = gen_control_flow_ops._enter(n, "foo", True)
+      enter_i = gen_control_flow_ops.enter(zero, "foo", False)
+      enter_one = gen_control_flow_ops.enter(one, "foo", True)
+      enter_n = gen_control_flow_ops.enter(n, "foo", True)
 
       merge_i = control_flow_ops.merge([enter_i, enter_i])[0]
 
@@ -324,8 +324,8 @@ class ControlFlowTest(test.TestCase):
   def testDifferentFrame(self):
     with self.test_session():
       data = array_ops.placeholder(dtypes.float32, shape=[])
-      enter_1 = gen_control_flow_ops._enter(data, "foo_1", False)
-      enter_2 = gen_control_flow_ops._enter(data, "foo_2", False)
+      enter_1 = gen_control_flow_ops.enter(data, "foo_1", False)
+      enter_2 = gen_control_flow_ops.enter(data, "foo_2", False)
       res = math_ops.add(enter_1, enter_2)
       with self.assertRaisesOpError("has inputs from different frames"):
         res.eval(feed_dict={data: 1.0})
@@ -552,7 +552,7 @@ class ControlFlowTest(test.TestCase):
 
   def testCondRef(self):
     with self.test_session():
-      x = gen_state_ops._variable(
+      x = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="x",
@@ -580,7 +580,7 @@ class ControlFlowTest(test.TestCase):
 
   def testUninitializedRefIdentity(self):
     with self.test_session() as sess:
-      v = gen_state_ops._variable(
+      v = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="v",
@@ -591,10 +591,10 @@ class ControlFlowTest(test.TestCase):
       # Both v_f and v_t are uninitialized references. However, an actual use
       # of the reference in the 'true' branch in the 'tf.identity' op will
       # not 'fire' when v is uninitialized, so this is a valid construction.
-      # This test tests that _ref_identity allows uninitialized ref as input
+      # This test tests that ref_identity allows uninitialized ref as input
       # so that this construction is allowed.
-      v_f_op = gen_array_ops._ref_identity(v_f)
-      v_t_op = gen_array_ops._ref_identity(v_t)
+      v_f_op = gen_array_ops.ref_identity(v_f)
+      v_t_op = gen_array_ops.ref_identity(v_t)
       with ops.control_dependencies([v_f_op]):
         assign_v = state_ops.assign(v, [1.0])
       with ops.control_dependencies([v_t_op]):
@@ -633,7 +633,8 @@ class ControlFlowTest(test.TestCase):
       sess.run(r)
 
   def testCondGrad_1(self):
-    with self.test_session():
+    graph = ops.Graph()
+    with graph.as_default():
       x = constant_op.constant(10.0, name="x")
       pred = math_ops.less(1, 2)
       fn1 = lambda: array_ops.identity(x)
@@ -641,8 +642,14 @@ class ControlFlowTest(test.TestCase):
       r = control_flow_ops.cond(pred, fn1, fn2)
 
       grad = gradients_impl.gradients(r, [x])[0]
-      result = grad.eval()
-    self.assertAllEqual(1.0, result)
+      with self.test_session():
+        self.assertAllEqual(1.0, grad.eval())
+    # The gradients computation creates a tensor with zeros by broadcasting a
+    # zeros constant to the required shape. Verify that the zero constant
+    # feeding into the fill is dominated by a Switch.
+    zero = graph.get_operation_by_name("gradients/zeros/Const")
+    self.assertEqual(len(zero.control_inputs), 1)
+    self.assertEqual(zero.control_inputs[0].type, "Switch")
 
   def testCondGrad_2(self):
     with self.test_session():
@@ -744,7 +751,7 @@ class ControlFlowTest(test.TestCase):
 
       def b(i, x):
         self.assertEqual(x.dtype, dtypes.int32_ref)
-        return (i + 1, gen_array_ops._ref_identity(x))
+        return (i + 1, gen_array_ops.ref_identity(x))
 
       r = control_flow_ops.while_loop(c, b, [i, x], parallel_iterations=5)
 
@@ -1620,7 +1627,7 @@ class ControlFlowTest(test.TestCase):
 
   def testWhileStack_1(self):
     with self.test_session():
-      s = gen_data_flow_ops._stack_v2(-1, dtypes.int32, stack_name="foo")
+      s = gen_data_flow_ops.stack_v2(-1, dtypes.int32, stack_name="foo")
       i = constant_op.constant(0)
 
       def c(i):
@@ -1629,7 +1636,7 @@ class ControlFlowTest(test.TestCase):
       def b(i):
         ni = math_ops.add(i, 1)
         ni = control_flow_ops.with_dependencies(
-            [gen_data_flow_ops._stack_push_v2(s, i)], ni)
+            [gen_data_flow_ops.stack_push_v2(s, i)], ni)
         return ni
 
       r = control_flow_ops.while_loop(c, b, [i], parallel_iterations=1)
@@ -1641,7 +1648,7 @@ class ControlFlowTest(test.TestCase):
 
       def b1(i, x):
         ni = math_ops.subtract(i, 1)
-        nx = x + gen_data_flow_ops._stack_pop_v2(s, dtypes.int32)
+        nx = x + gen_data_flow_ops.stack_pop_v2(s, dtypes.int32)
         return [ni, nx]
 
       _, rx = control_flow_ops.while_loop(
@@ -2205,12 +2212,9 @@ class ControlFlowTest(test.TestCase):
 
       self.assertEqual(x.dtype, dtypes.int32_ref)
 
-      # pylint: disable=protected-access
       def body(i, x):
         self.assertEqual(x.dtype, dtypes.int32_ref)
-        return [i + 1, gen_array_ops._ref_identity(x)]
-
-      # pylint: enable=protected-access
+        return [i + 1, gen_array_ops.ref_identity(x)]
 
       r = control_flow_ops.while_loop(c, body, [i, x], parallel_iterations=5)
 
diff --git a/tensorflow/python/kernel_tests/control_flow_util_test.py b/tensorflow/python/kernel_tests/control_flow_util_test.py
index 23185eaeece0d56fd83ecdf9e02c778712420465..39e96f74b0461da0cf499e303b30a4a41aae4899 100644
--- a/tensorflow/python/kernel_tests/control_flow_util_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_util_test.py
@@ -41,17 +41,17 @@ class ControlFlowUtilTest(test.TestCase):
     self.assertFalse(control_flow_util.IsSwitch(test_ops.int_output().op))
 
   def testIsLoopEnter(self):
-    enter = gen_control_flow_ops._enter(1, frame_name="name").op
+    enter = gen_control_flow_ops.enter(1, frame_name="name").op
     self.assertTrue(control_flow_util.IsLoopEnter(enter))
     self.assertFalse(control_flow_util.IsLoopConstantEnter(enter))
 
-    ref_enter = gen_control_flow_ops._ref_enter(test_ops.ref_output(),
-                                                frame_name="name").op
+    ref_enter = gen_control_flow_ops.ref_enter(test_ops.ref_output(),
+                                               frame_name="name").op
     self.assertTrue(control_flow_util.IsLoopEnter(ref_enter))
     self.assertFalse(control_flow_util.IsLoopConstantEnter(ref_enter))
 
-    const_enter = gen_control_flow_ops._enter(1, frame_name="name",
-                                              is_constant=True).op
+    const_enter = gen_control_flow_ops.enter(1, frame_name="name",
+                                             is_constant=True).op
     self.assertTrue(control_flow_util.IsLoopEnter(const_enter))
     self.assertTrue(control_flow_util.IsLoopConstantEnter(const_enter))
 
diff --git a/tensorflow/python/kernel_tests/conv_ops_test.py b/tensorflow/python/kernel_tests/conv_ops_test.py
index f4fe01f868da25660171c614bbf84390aead3ade..a291bef0ad6f16184ff29f665457a53b77447d54 100644
--- a/tensorflow/python/kernel_tests/conv_ops_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_test.py
@@ -159,11 +159,11 @@ class Conv2DTest(test.TestCase):
 
   def _DtypesToTest(self, use_gpu):
     if use_gpu and not test_util.CudaSupportsHalfMatMulAndConv():
-      return [dtypes.float32]
+      return [dtypes.float32, dtypes.float64]
     else:
       # It is important that float32 comes before float16 here,
       # as we will be using its gradients as reference for fp16 gradients.
-      return [dtypes.float32, dtypes.float16]
+      return [dtypes.float32, dtypes.float16, dtypes.float64]
 
   def _SetupValuesForDevice(self, tensor_in_sizes, filter_in_sizes, dilations,
                             strides, padding, data_format, dtype, use_gpu):
@@ -970,7 +970,7 @@ class Conv2DTest(test.TestCase):
       self.assertArrayNear(value_2.flatten(), value.flatten(), err)
 
   def testConv2D2x2Depth3ValidBackpropFilterStride1x1Dilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 6, 1],
@@ -984,7 +984,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth1ValidBackpropFilterDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 2, 3, 1],
@@ -998,7 +998,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DEmptyBackpropFilterDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 2, 3, 1],
@@ -1012,7 +1012,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropFilterDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 4, 3],
@@ -1026,7 +1026,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DKernelSizeMatchesInputSizeBackpropFilterDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 3, 1],
@@ -1040,7 +1040,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropInputStride1x1Dilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 3, 6, 1],
@@ -1054,7 +1054,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth1ValidBackpropInputDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 2, 3, 1],
@@ -1068,7 +1068,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DEmptyBackpropInputDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[0, 2, 3, 1],
@@ -1082,7 +1082,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropInputDilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         # The GPU version of this test is not very stable. So adjusting the
         # error threshold to 1e-4.
@@ -1098,7 +1098,7 @@ class Conv2DTest(test.TestCase):
             err=1e-4)
 
   def testConv2DKernelSizeMatchesInputSizeBackpropInputDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 3, 3, 1],
diff --git a/tensorflow/python/kernel_tests/cwise_ops_test.py b/tensorflow/python/kernel_tests/cwise_ops_test.py
index 0d9b46c30dbbed20dd940e0427fbf6f6d5415106..8db0bb6f0dc495e7be2cd717787acf87156f42af 100644
--- a/tensorflow/python/kernel_tests/cwise_ops_test.py
+++ b/tensorflow/python/kernel_tests/cwise_ops_test.py
@@ -495,11 +495,11 @@ class UnaryOpTest(test.TestCase):
     dtype_tols = [(np.float32, 5e-4), (np.float64, 1e-6), (np.complex64, 5e-4),
                   (np.complex128, 1e-6)]
     op_range = [
-        (gen_math_ops._reciprocal_grad, [-2, 2]),
-        (gen_math_ops._rsqrt_grad, [0.1, 3]),
-        (gen_math_ops._sigmoid_grad, [-2, 2]),
-        (gen_math_ops._sqrt_grad, [0.1, 3]),
-        (gen_math_ops._tanh_grad, [-2, 2]),
+        (gen_math_ops.reciprocal_grad, [-2, 2]),
+        (gen_math_ops.rsqrt_grad, [0.1, 3]),
+        (gen_math_ops.sigmoid_grad, [-2, 2]),
+        (gen_math_ops.sqrt_grad, [0.1, 3]),
+        (gen_math_ops.tanh_grad, [-2, 2]),
     ]
 
     def rand(dtype):
diff --git a/tensorflow/python/kernel_tests/depthtospace_op_test.py b/tensorflow/python/kernel_tests/depthtospace_op_test.py
index 7df2366954f3a6f3f37aef447479ba67c263025f..f0beabb4e20e4ec0a2fc7a487bf2541d19568927 100644
--- a/tensorflow/python/kernel_tests/depthtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/depthtospace_op_test.py
@@ -35,8 +35,8 @@ from tensorflow.python.platform import tf_logging
 
 class DepthToSpaceTest(test.TestCase):
 
-  def _testOne(self, inputs, block_size, outputs):
-    input_nhwc = math_ops.to_float(inputs)
+  def _testOne(self, inputs, block_size, outputs, dtype=dtypes.float32):
+    input_nhwc = math_ops.cast(inputs, dtype)
     with self.test_session(use_gpu=False):
       # test NHWC (default) on CPU
       x_tf = array_ops.depth_to_space(input_nhwc, block_size)
@@ -59,6 +59,12 @@ class DepthToSpaceTest(test.TestCase):
     x_out = [[[[1], [2]], [[3], [4]]]]
     self._testOne(x_np, block_size, x_out)
 
+  def testBasicFloat16(self):
+    x_np = [[[[1, 2, 3, 4]]]]
+    block_size = 2
+    x_out = [[[[1], [2]], [[3], [4]]]]
+    self._testOne(x_np, block_size, x_out, dtype=dtypes.float16)
+
   # Tests for larger input dimensions. To make sure elements are
   # correctly ordered spatially.
   def testBlockSize2(self):
@@ -90,6 +96,24 @@ class DepthToSpaceTest(test.TestCase):
     x_out = [batch_output_elt(i) for i in range(batch_size)]
     self._testOne(x_np, block_size, x_out)
 
+  def testBatchSize0(self):
+    block_size = 2
+    batch_size = 0
+    input_nhwc = array_ops.ones([batch_size, 2, 3, 12])
+    x_out = array_ops.ones([batch_size, 4, 6, 3])
+
+    with self.test_session(use_gpu=False):
+      # test NHWC (default) on CPU
+      x_tf = array_ops.depth_to_space(input_nhwc, block_size)
+      self.assertAllEqual(x_tf.shape, x_out.shape)
+      x_tf.eval()
+    if test.is_gpu_available():
+      with self.test_session(use_gpu=True):
+        # test NHWC (default) on GPU
+        x_tf = array_ops.depth_to_space(input_nhwc, block_size)
+        self.assertAllEqual(x_tf.shape, x_out.shape)
+        x_tf.eval()
+
   # Tests for different width and height.
   def testNonSquare(self):
     x_np = [[[[1, 10, 2, 20, 3, 30, 4, 40]],
diff --git a/tensorflow/python/kernel_tests/determinant_op_test.py b/tensorflow/python/kernel_tests/determinant_op_test.py
index 222038b22ef3c766efd14fd9b1c9044a0b6e9125..a52b2c0dc32c26ecd5ef08aa3f8678f0006cd4fe 100644
--- a/tensorflow/python/kernel_tests/determinant_op_test.py
+++ b/tensorflow/python/kernel_tests/determinant_op_test.py
@@ -65,7 +65,7 @@ class DeterminantOpTest(test.TestCase):
       self._compareDeterminantBase(matrix_x,
                                    linalg_ops.matrix_determinant(matrix_x))
       self._compareLogDeterminantBase(
-          matrix_x, gen_linalg_ops._log_matrix_determinant(matrix_x))
+          matrix_x, gen_linalg_ops.log_matrix_determinant(matrix_x))
 
   def testBasic(self):
     # 2x2 matrices
diff --git a/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py b/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
index feec9934e459590bb1dd0bc5c7cf40013d3d8b88..faac7d8365dfaa1b6b32f8fe66a76c3694aa0d5b 100644
--- a/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
+++ b/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
@@ -347,7 +347,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
 
   Two types of tests for FractionalAvgPoolGrad.
   1) Test fractional_avg_pool_grad() directly.
-    This type of test relies on gen_nn_ops._avg_pool_grad() returns the
+    This type of test relies on gen_nn_ops.avg_pool_grad() returns the
   correct result. For example:
     * input_tensor_shape = (1, 10, 10, 1)
     * window_size = (1, 2, 2, 1)
@@ -404,13 +404,13 @@ class FractionalAvgPoolGradTest(test.TestCase):
                 num_elements *= dim_size
               output_backprop = (self._PRNG.rand(num_elements) *
                                  1000).reshape(output_data.shape)
-              input_backprop_tensor = gen_nn_ops._avg_pool_grad(
+              input_backprop_tensor = gen_nn_ops.avg_pool_grad(
                   input_tensor.get_shape(), output_backprop, window_size,
                   stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows + 1, row_window_size))
               col_seq = list(range(0, num_cols + 1, col_window_size))
-              fap_input_backprop_tensor = gen_nn_ops._fractional_avg_pool_grad(
+              fap_input_backprop_tensor = gen_nn_ops.fractional_avg_pool_grad(
                   input_tensor.get_shape(),
                   output_backprop,
                   row_seq,
@@ -443,7 +443,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
                 num_elements *= dim_size
               output_backprop = (self._PRNG.rand(num_elements) *
                                  1000).reshape(output_data.shape)
-              input_backprop_tensor = gen_nn_ops._avg_pool_grad(
+              input_backprop_tensor = gen_nn_ops.avg_pool_grad(
                   input_tensor.get_shape(), output_backprop, window_size,
                   stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
@@ -451,7 +451,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
               col_seq = list(range(0, num_cols, col_window_size - 1))
               row_seq[-1] += 1
               col_seq[-1] += 1
-              fap_input_backprop_tensor = gen_nn_ops._fractional_avg_pool_grad(
+              fap_input_backprop_tensor = gen_nn_ops.fractional_avg_pool_grad(
                   input_tensor.get_shape(),
                   output_backprop,
                   row_seq,
diff --git a/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py b/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
index 5983ae7759dbf3eb2db9867def829ce8dbeb4b73..6477c9ebc4c35fcc5963b27a0f5c50624a73fa09 100644
--- a/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
+++ b/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
@@ -318,7 +318,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
 
   Two types of tests for FractionalMaxPoolGrad.
   1) Test fractional_max_pool_grad() directly.
-    This type of test relies on gen_nn_ops._max_pool_grad() returns the correct
+    This type of test relies on gen_nn_ops.max_pool_grad() returns the correct
   result. For example:
     * input_tensor_shape = (1, 10, 10, 1)
     * window_size = (1, 2, 2, 1)
@@ -384,16 +384,13 @@ class FractionalMaxPoolGradTest(test.TestCase):
                                               stride_size, padding)
               output_data = output_tensor.eval()
               output_backprop = self._PRNG.randint(100, size=output_data.shape)
-              input_backprop_tensor = gen_nn_ops._max_pool_grad(input_tensor,
-                                                                output_tensor,
-                                                                output_backprop,
-                                                                window_size,
-                                                                stride_size,
-                                                                padding)
+              input_backprop_tensor = gen_nn_ops.max_pool_grad(
+                  input_tensor, output_tensor, output_backprop, window_size,
+                  stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows + 1, row_window_size))
               col_seq = list(range(0, num_cols + 1, col_window_size))
-              fmp_input_backprop_tensor = gen_nn_ops._fractional_max_pool_grad(
+              fmp_input_backprop_tensor = gen_nn_ops.fractional_max_pool_grad(
                   input_tensor,
                   output_tensor,
                   output_backprop,
@@ -422,18 +419,15 @@ class FractionalMaxPoolGradTest(test.TestCase):
                                               stride_size, padding)
               output_data = output_tensor.eval()
               output_backprop = self._PRNG.randint(100, size=output_data.shape)
-              input_backprop_tensor = gen_nn_ops._max_pool_grad(input_tensor,
-                                                                output_tensor,
-                                                                output_backprop,
-                                                                window_size,
-                                                                stride_size,
-                                                                padding)
+              input_backprop_tensor = gen_nn_ops.max_pool_grad(
+                  input_tensor, output_tensor, output_backprop, window_size,
+                  stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows, row_window_size - 1))
               col_seq = list(range(0, num_cols, col_window_size - 1))
               row_seq[-1] += 1
               col_seq[-1] += 1
-              fmp_input_backprop_tensor = gen_nn_ops._fractional_max_pool_grad(
+              fmp_input_backprop_tensor = gen_nn_ops.fractional_max_pool_grad(
                   input_tensor,
                   output_tensor,
                   output_backprop,
@@ -591,7 +585,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
       output_tensor = constant_op.constant(
           output_data_not_overlapping, shape=output_size)
       grad = constant_op.constant(output_backprop, shape=output_size)
-      r = gen_nn_ops._fractional_max_pool_grad(
+      r = gen_nn_ops.fractional_max_pool_grad(
           input_tensor,
           output_tensor,
           grad,
@@ -606,7 +600,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
       # Test when overlapping is True
       output_tensor = constant_op.constant(
           output_data_overlapping, shape=output_size)
-      r = gen_nn_ops._fractional_max_pool_grad(
+      r = gen_nn_ops.fractional_max_pool_grad(
           input_tensor, output_tensor, grad, row_seq, col_seq, overlapping=True)
       input_backprop_overlapping = r.eval()
       self.assertShapeEqual(
diff --git a/tensorflow/python/kernel_tests/identity_op_py_test.py b/tensorflow/python/kernel_tests/identity_op_py_test.py
index 2cfe420bd49ec44815d1386bd873b234d8710e9d..49fb76d5b41de18ed3ba2187e85cb288e7344c38 100644
--- a/tensorflow/python/kernel_tests/identity_op_py_test.py
+++ b/tensorflow/python/kernel_tests/identity_op_py_test.py
@@ -65,7 +65,7 @@ class IdentityOpTest(test.TestCase):
           constant_op.constant(
               [[1, 2, 3], [6, 5, 4]], dtype=dtypes.int32))
       self.assertEquals(shape, tensor.get_shape())
-      self.assertEquals(shape, gen_array_ops._ref_identity(tensor).get_shape())
+      self.assertEquals(shape, gen_array_ops.ref_identity(tensor).get_shape())
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/kernel_tests/init_ops_test.py b/tensorflow/python/kernel_tests/init_ops_test.py
index 19a7d2f9d51fff46ee817ad03ef62383f6727791..c1755985ee85c62005c8d3d5fb916859193aa5f3 100644
--- a/tensorflow/python/kernel_tests/init_ops_test.py
+++ b/tensorflow/python/kernel_tests/init_ops_test.py
@@ -25,10 +25,13 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
+from tensorflow.python.layers import convolutional
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -571,6 +574,82 @@ class OrthogonalInitializerTest(test.TestCase):
                 np.dot(t, t.T), np.eye(t.shape[0]), rtol=tol, atol=tol)
 
 
+class ConvolutionDeltaOrthogonalInitializerTest(test.TestCase):
+
+  def testInitializerIdentical(self):
+    for dtype in [dtypes.float32, dtypes.float64]:
+      init1 = init_ops.convolutional_delta_orthogonal(seed=1, dtype=dtype)
+      init2 = init_ops.convolutional_delta_orthogonal(seed=1, dtype=dtype)
+      self.assertTrue(identicaltest(self, init1, init2, (3, 3, 10, 10)))
+
+  def testInitializerDifferent(self):
+    for dtype in [dtypes.float32, dtypes.float64]:
+      init1 = init_ops.convolutional_delta_orthogonal(seed=1, dtype=dtype)
+      init2 = init_ops.convolutional_delta_orthogonal(seed=2, dtype=dtype)
+      self.assertFalse(identicaltest(self, init1, init2, (3, 3, 10, 10)))
+
+  def testDuplicatedInitializer(self):
+    init = init_ops.convolutional_delta_orthogonal()
+    self.assertFalse(duplicated_initializer(self, init, 1, (3, 3, 10, 10)))
+
+  def testInvalidDataType(self):
+    self.assertRaises(
+        ValueError, init_ops.convolutional_delta_orthogonal,
+        dtype=dtypes.string)
+
+  def testInvalidShape(self):
+    init1 = init_ops.convolutional_delta_orthogonal()
+    with self.test_session(graph=ops.Graph(), use_gpu=True):
+      self.assertRaises(ValueError, init1, shape=[3, 3, 6, 5])
+
+  def testGain(self):
+    shape = (3, 3, 10, 10)
+    for dtype in [dtypes.float32, dtypes.float64]:
+      init1 = init_ops.convolutional_delta_orthogonal(seed=1, dtype=dtype)
+      init2 = init_ops.convolutional_delta_orthogonal(gain=3.14,
+                                                      seed=1, dtype=dtype)
+      with self.test_session(graph=ops.Graph(), use_gpu=True):
+        t1 = init1(shape).eval()
+      with self.test_session(graph=ops.Graph(), use_gpu=True):
+        t2 = init2(shape).eval()
+      return np.allclose(t1, t2 / 3.14, rtol=1e-15, atol=1e-15)
+
+  def testShapesValues(self):
+    for dtype in [dtypes.float32]:
+      for kernel_size in [[3], [8], [3, 5], [2, 4], [3, 3, 3], [2, 2, 2]]:
+        tol = 1e-2
+        # Check orthogonality by computing the 2-norms of the inputs and ouputs.
+        if len(kernel_size) == 1:
+          shape = [4, 32, 64]
+          convolution = convolutional.conv1d
+        elif len(kernel_size) == 2:
+          convolution = convolutional.conv2d
+          shape = [4, 32, 32, 64]
+        else:
+          shape = [4, 16, 16, 16, 64]
+          convolution = convolutional.conv3d
+        inputs = random_ops.random_normal(shape, dtype=dtype)
+        inputs_2norm = linalg_ops.norm(inputs)
+        outputs = convolution(
+            inputs, padding="same", filters=128,
+            kernel_size=kernel_size, use_bias=False,
+            kernel_initializer=init_ops.convolutional_delta_orthogonal(
+                gain=3.14))
+        outputs_shape = shape[0:-1] + [128]
+        outputs_2norm = linalg_ops.norm(outputs)
+        my_ops = variables.global_variables_initializer()
+        with self.test_session(use_gpu=True) as sess:
+          sess.run(my_ops)
+          # Check the shape of the outputs
+          t = outputs.eval()
+          self.assertAllEqual(t.shape, outputs_shape)
+          # Check isometry of the delta-orthogonal kernel.
+          self.assertAllClose(
+              sess.run(inputs_2norm)/np.sqrt(np.prod(shape)),
+              sess.run(outputs_2norm)/(np.sqrt(np.prod(shape))*np.sqrt(3.14)),
+              rtol=tol, atol=tol)
+
+
 class IdentityInitializerTest(test.TestCase):
 
   def testInvalidDataType(self):
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_composition_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_composition_test.py
index 4d79365dbefc74fe8412b65ec089fb2af4255aea..f96b9ccdaacae7d8e0552ed3d74ce53808fed963 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_composition_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_composition_test.py
@@ -44,9 +44,9 @@ class SquareLinearOperatorCompositionTest(
     self._rtol[dtypes.float32] = 1e-4
     self._rtol[dtypes.complex64] = 1e-4
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
     sess = ops.get_default_session()
-    shape = list(shape)
+    shape = list(build_info.shape)
 
     # Either 1 or 2 matrices, depending.
     num_operators = rng.randint(low=1, high=3)
@@ -148,9 +148,9 @@ class NonSquareLinearOperatorCompositionTest(
     self._rtol[dtypes.float32] = 1e-4
     self._rtol[dtypes.complex64] = 1e-4
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
     sess = ops.get_default_session()
-    shape = list(shape)
+    shape = list(build_info.shape)
 
     # Test only the case of 2 matrices.
     # The Square test uses either 1 or 2, so we have tested the case of 1 matrix
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
index 8cb9f9e6213cda8daae7b629fc31d4721fd48fa7..0a0e31c716ecfa10ed93cff92fa908a240f8495e 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
@@ -34,7 +34,8 @@ class LinearOperatorDiagTest(
     linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
   """Most tests done in the base class LinearOperatorDerivedClassTest."""
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
     diag = linear_operator_test_util.random_sign_uniform(
         shape[:-1], minval=1., maxval=2., dtype=dtype)
     if use_placeholder:
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_full_matrix_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_full_matrix_test.py
index 50d6f524e9ad75715d7a57348638fdfeee667f40..b3da623b5e8d8c99c6777e75e2d49f24dab1c96b 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_full_matrix_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_full_matrix_test.py
@@ -36,11 +36,11 @@ class SquareLinearOperatorFullMatrixTest(
     linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
   """Most tests done in the base class LinearOperatorDerivedClassTest."""
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
-    shape = list(shape)
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
 
-    matrix = linear_operator_test_util.random_positive_definite_matrix(shape,
-                                                                       dtype)
+    matrix = linear_operator_test_util.random_positive_definite_matrix(
+        shape, dtype)
 
     if use_placeholder:
       matrix_ph = array_ops.placeholder(dtype=dtype)
@@ -136,8 +136,8 @@ class SquareLinearOperatorFullMatrixSymmetricPositiveDefiniteTest(
   def _dtypes_to_test(self):
     return [dtypes.float32, dtypes.float64]
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
-    shape = list(shape)
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
 
     matrix = linear_operator_test_util.random_positive_definite_matrix(
         shape, dtype, force_well_conditioned=True)
@@ -210,7 +210,8 @@ class NonSquareLinearOperatorFullMatrixTest(
     linear_operator_test_util.NonSquareLinearOperatorDerivedClassTest):
   """Most tests done in the base class LinearOperatorDerivedClassTest."""
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
     matrix = linear_operator_test_util.random_normal(shape, dtype=dtype)
     if use_placeholder:
       matrix_ph = array_ops.placeholder(dtype=dtype)
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_identity_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_identity_test.py
index 6d635707683f4500919073a4f43c320a44b65018..59f63f949e96991193412d3574603e58a75cb6e5 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_identity_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_identity_test.py
@@ -43,8 +43,8 @@ class LinearOperatorIdentityTest(
     # 16bit.
     return [dtypes.float32, dtypes.float64, dtypes.complex64, dtypes.complex128]
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
-    shape = list(shape)
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
     assert shape[-1] == shape[-2]
 
     batch_shape = shape[:-2]
@@ -261,8 +261,8 @@ class LinearOperatorScaledIdentityTest(
     # 16bit.
     return [dtypes.float32, dtypes.float64, dtypes.complex64, dtypes.complex128]
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
-    shape = list(shape)
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
     assert shape[-1] == shape[-2]
 
     batch_shape = shape[:-2]
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
index d3a47da946b12277c4c390a4a320d7c91ed81b32..8095f6419ef0d9543339cf1f4ee9cd4783f852b9 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
@@ -55,16 +55,22 @@ class BaseLinearOperatorLowRankUpdatetest(object):
     return [dtypes.float32, dtypes.float64]
 
   @property
-  def _shapes_to_test(self):
+  def _operator_build_infos(self):
+    build_info = linear_operator_test_util.OperatorBuildInfo
     # Previously we had a (2, 10, 10) shape at the end.  We did this to test the
     # inversion and determinant lemmas on not-tiny matrices, since these are
     # known to have stability issues.  This resulted in test timeouts, so this
     # shape has been removed, but rest assured, the tests did pass.
-    return [(0, 0), (1, 1), (1, 3, 3), (3, 4, 4), (2, 1, 4, 4)]
-
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+    return [
+        build_info((0, 0)),
+        build_info((1, 1)),
+        build_info((1, 3, 3)),
+        build_info((3, 4, 4)),
+        build_info((2, 1, 4, 4))]
+
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
     # Recall A = L + UDV^H
-    shape = list(shape)
+    shape = list(build_info.shape)
     diag_shape = shape[:-1]
     k = shape[-2] // 2 + 1
     u_perturbation_shape = shape[:-1] + [k]
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
index db3918f9983c5b7d05fa4ba3bc85b26a485f2f00..a57d2f085e089fb913f09fdd9b07cf13aa7f3c35 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
@@ -38,7 +38,8 @@ class LinearOperatorLowerTriangularTest(
     # matrix_triangular_solve.
     return [dtypes.float32, dtypes.float64]
 
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
+    shape = list(build_info.shape)
     # Upper triangle will be nonzero, but ignored.
     # Use a diagonal that ensures this matrix is well conditioned.
     tril = linear_operator_test_util.random_tril_matrix(
diff --git a/tensorflow/python/kernel_tests/list_ops_test.py b/tensorflow/python/kernel_tests/list_ops_test.py
index 1577b7bc8021a326eb720bdf059b8d1c568c0cc1..dbbed39c727f01ed1fae271375575c690958c7d8 100644
--- a/tensorflow/python/kernel_tests/list_ops_test.py
+++ b/tensorflow/python/kernel_tests/list_ops_test.py
@@ -30,7 +30,9 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import list_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 from tensorflow.python.training import server_lib
 
@@ -123,6 +125,78 @@ class ListOpsTest(test_util.TensorFlowTestCase):
             l_cpu, element_dtype=dtypes.float32)[1],
         2.0)
 
+  def testGraphStack(self):
+    with context.graph_mode(), self.test_session():
+      tl = list_ops.empty_tensor_list(
+          element_shape=constant_op.constant([1], dtype=dtypes.int32),
+          element_dtype=dtypes.int32)
+      tl = list_ops.tensor_list_push_back(tl, [1])
+      self.assertAllEqual(
+          list_ops.tensor_list_stack(tl, element_dtype=dtypes.int32).eval(),
+          [[1]])
+
+  def testGraphStackInLoop(self):
+    with context.graph_mode(), self.test_session():
+      t1 = list_ops.empty_tensor_list(
+          element_shape=constant_op.constant([], dtype=dtypes.int32),
+          element_dtype=dtypes.int32)
+      i = constant_op.constant(0, dtype=dtypes.int32)
+
+      def body(i, t1):
+        t1 = list_ops.tensor_list_push_back(t1, i)
+        i += 1
+        return i, t1
+
+      i, t1 = control_flow_ops.while_loop(lambda i, t1: math_ops.less(i, 4),
+                                          body, [i, t1])
+      s1 = list_ops.tensor_list_stack(t1, element_dtype=dtypes.int32).eval()
+      self.assertAllEqual(s1, [0, 1, 2, 3])
+
+  def testGraphStackSwitchDtype(self):
+    with context.graph_mode(), self.test_session():
+      list_ = list_ops.empty_tensor_list(
+          element_shape=constant_op.constant([], dtype=dtypes.int32),
+          element_dtype=dtypes.int32)
+      m = constant_op.constant([1, 2, 3], dtype=dtypes.float32)
+
+      def body(list_, m):
+        list_ = control_flow_ops.cond(
+            math_ops.equal(list_ops.tensor_list_length(list_), 0),
+            lambda: list_ops.empty_tensor_list(m.shape, m.dtype), lambda: list_)
+        list_ = list_ops.tensor_list_push_back(list_, m)
+        return list_, m
+
+      for _ in range(2):
+        list_, m = body(list_, m)
+
+      s1 = list_ops.tensor_list_stack(
+          list_, element_dtype=dtypes.float32).eval()
+      np_s1 = np.array([[1, 2, 3], [1, 2, 3]], dtype=np.float32)
+      self.assertAllEqual(s1, np_s1)
+
+  def testGraphStackInLoopSwitchDtype(self):
+    with context.graph_mode(), self.test_session():
+      t1 = list_ops.empty_tensor_list(
+          element_shape=constant_op.constant([], dtype=dtypes.int32),
+          element_dtype=dtypes.int32)
+      i = constant_op.constant(0, dtype=dtypes.float32)
+      m = constant_op.constant([1, 2, 3], dtype=dtypes.float32)
+
+      def body(i, m, t1):
+        t1 = control_flow_ops.cond(
+            math_ops.equal(list_ops.tensor_list_length(t1), 0),
+            lambda: list_ops.empty_tensor_list(m.shape, m.dtype), lambda: t1)
+
+        t1 = list_ops.tensor_list_push_back(t1, m * i)
+        i += 1.0
+        return i, m, t1
+
+      i, m, t1 = control_flow_ops.while_loop(
+          lambda i, m, t1: math_ops.less(i, 4), body, [i, m, t1])
+      s1 = list_ops.tensor_list_stack(t1, element_dtype=dtypes.float32).eval()
+      np_s1 = np.vstack([np.arange(1, 4) * i for i in range(4)])
+      self.assertAllEqual(s1, np_s1)
+
   def testSerialize(self):
     # pylint: disable=g-import-not-at-top
     try:
diff --git a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
index 6203a412d7faec4fe9f6179141301579b5900291..a0c66c77d8850d3144678870983730537a253556 100644
--- a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
@@ -48,7 +48,7 @@ class ExponentialOpTest(test.TestCase):
   def _verifyExponential(self, x, np_type):
     inp = x.astype(np_type)
     with self.test_session(use_gpu=True):
-      tf_ans = gen_linalg_ops._matrix_exponential(inp)
+      tf_ans = gen_linalg_ops.matrix_exponential(inp)
       if x.size == 0:
         np_ans = np.empty(x.shape, dtype=np_type)
       else:
@@ -116,13 +116,13 @@ class ExponentialOpTest(test.TestCase):
     # When the exponential of a non-square matrix is attempted we should return
     # an error
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
+      gen_linalg_ops.matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
 
   def testWrongDimensions(self):
     # The input to the exponential should be at least a 2-dimensional tensor.
     tensor3 = constant_op.constant([1., 2.])
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_exponential(tensor3)
+      gen_linalg_ops.matrix_exponential(tensor3)
 
   def testEmpty(self):
     self._verifyExponentialReal(np.empty([0, 2, 2]))
@@ -143,8 +143,8 @@ class ExponentialOpTest(test.TestCase):
     with self.test_session(use_gpu=True) as sess:
       matrix1 = random_ops.random_normal([5, 5], seed=42)
       matrix2 = random_ops.random_normal([5, 5], seed=42)
-      expm1 = gen_linalg_ops._matrix_exponential(matrix1)
-      expm2 = gen_linalg_ops._matrix_exponential(matrix2)
+      expm1 = gen_linalg_ops.matrix_exponential(matrix1)
+      expm2 = gen_linalg_ops.matrix_exponential(matrix2)
       expm = sess.run([expm1, expm2])
       self.assertAllEqual(expm[0], expm[1])
 
@@ -180,7 +180,7 @@ class MatrixExponentialBenchmark(test.Benchmark):
           session.Session() as sess, \
           ops.device("/cpu:0"):
         matrix = self._GenerateMatrix(shape)
-        expm = gen_linalg_ops._matrix_exponential(matrix)
+        expm = gen_linalg_ops.matrix_exponential(matrix)
         variables.global_variables_initializer().run()
         self.run_op_benchmark(
             sess,
diff --git a/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py b/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
index 18ed59828c15f5ad21fe054cd6e40991c02bb356..24edc4f59fe6dd84da6732036eb53e2ad367bd06 100644
--- a/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
@@ -39,8 +39,8 @@ class LogarithmOpTest(test.TestCase):
     inp = x.astype(np_type)
     with self.test_session(use_gpu=True):
       # Verify that expm(logm(A)) == A.
-      tf_ans = gen_linalg_ops._matrix_exponential(
-          gen_linalg_ops._matrix_logarithm(inp))
+      tf_ans = gen_linalg_ops.matrix_exponential(
+          gen_linalg_ops.matrix_logarithm(inp))
       out = tf_ans.eval()
       self.assertAllClose(inp, out, rtol=1e-4, atol=1e-3)
 
@@ -85,14 +85,14 @@ class LogarithmOpTest(test.TestCase):
     # When the logarithm of a non-square matrix is attempted we should return
     # an error
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_logarithm(
+      gen_linalg_ops.matrix_logarithm(
           np.array([[1., 2., 3.], [3., 4., 5.]], dtype=np.complex64))
 
   def testWrongDimensions(self):
     # The input to the logarithm should be at least a 2-dimensional tensor.
     tensor3 = constant_op.constant([1., 2.], dtype=dtypes.complex64)
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_logarithm(tensor3)
+      gen_linalg_ops.matrix_logarithm(tensor3)
 
   def testEmpty(self):
     self._verifyLogarithmComplex(np.empty([0, 2, 2], dtype=np.complex64))
@@ -115,8 +115,8 @@ class LogarithmOpTest(test.TestCase):
           random_ops.random_normal([5, 5], seed=42), dtypes.complex64)
       matrix2 = math_ops.cast(
           random_ops.random_normal([5, 5], seed=42), dtypes.complex64)
-      logm1 = gen_linalg_ops._matrix_logarithm(matrix1)
-      logm2 = gen_linalg_ops._matrix_logarithm(matrix2)
+      logm1 = gen_linalg_ops.matrix_logarithm(matrix1)
+      logm2 = gen_linalg_ops.matrix_logarithm(matrix2)
       logm = sess.run([logm1, logm2])
       self.assertAllEqual(logm[0], logm[1])
 
@@ -152,7 +152,7 @@ class MatrixLogarithmBenchmark(test.Benchmark):
           session.Session() as sess, \
           ops.device("/cpu:0"):
         matrix = self._GenerateMatrix(shape)
-        logm = gen_linalg_ops._matrix_logarithm(matrix)
+        logm = gen_linalg_ops.matrix_logarithm(matrix)
         variables.global_variables_initializer().run()
         self.run_op_benchmark(
             sess,
diff --git a/tensorflow/python/kernel_tests/metrics_test.py b/tensorflow/python/kernel_tests/metrics_test.py
index 59e7afa2dcb1e02ed9c66e5cf75753f96552b4e0..ad802f7e1f72f6cbc3dda1ca98e46e6da4e5110a 100644
--- a/tensorflow/python/kernel_tests/metrics_test.py
+++ b/tensorflow/python/kernel_tests/metrics_test.py
@@ -1132,9 +1132,9 @@ class AUCTest(test.TestCase):
       auc, update_op = metrics.auc(labels, predictions, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.54166, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.79166, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.54166, auc.eval(), delta=1e-3)
+      self.assertAlmostEqual(0.79166, auc.eval(), delta=1e-3)
 
   def testAnotherAUCPRSpecialCase(self):
     with self.test_session() as sess:
@@ -1146,9 +1146,9 @@ class AUCTest(test.TestCase):
       auc, update_op = metrics.auc(labels, predictions, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.44365042, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.610317, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.44365042, auc.eval(), delta=1e-3)
+      self.assertAlmostEqual(0.610317, auc.eval(), delta=1e-3)
 
   def testThirdAUCPRSpecialCase(self):
     with self.test_session() as sess:
@@ -1160,26 +1160,9 @@ class AUCTest(test.TestCase):
       auc, update_op = metrics.auc(labels, predictions, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.73611039, sess.run(update_op), delta=1e-3)
+      self.assertAlmostEqual(0.90277, sess.run(update_op), delta=1e-3)
 
-      self.assertAlmostEqual(0.73611039, auc.eval(), delta=1e-3)
-
-  def testFourthAUCPRSpecialCase(self):
-    # Create the labels and data.
-    labels = np.array([
-        0, 0, 0, 0, 0, 0, 0, 1, 0, 1])
-    predictions = np.array([
-        0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35])
-
-    with self.test_session() as sess:
-      auc, _ = metrics.auc(
-          labels, predictions, curve='PR', num_thresholds=11)
-
-      sess.run(variables.local_variables_initializer())
-      # Since this is only approximate, we can't expect a 6 digits match.
-      # Although with higher number of samples/thresholds we should see the
-      # accuracy improving
-      self.assertAlmostEqual(0.0, auc.eval(), delta=0.001)
+      self.assertAlmostEqual(0.90277, auc.eval(), delta=1e-3)
 
   def testAllIncorrect(self):
     inputs = np.random.randint(0, 2, size=(100, 1))
@@ -1205,16 +1188,16 @@ class AUCTest(test.TestCase):
 
       self.assertAlmostEqual(1, auc.eval(), 6)
 
-  def testRecallOneAndPrecisionOne(self):
+  def testRecallOneAndPrecisionOneGivesOnePRAUC(self):
     with self.test_session() as sess:
       predictions = array_ops.ones([4], dtype=dtypes_lib.float32)
       labels = array_ops.ones([4])
       auc, update_op = metrics.auc(labels, predictions, curve='PR')
 
       sess.run(variables.local_variables_initializer())
-      self.assertAlmostEqual(0.5, sess.run(update_op), 6)
+      self.assertAlmostEqual(1, sess.run(update_op), 6)
 
-      self.assertAlmostEqual(0.5, auc.eval(), 6)
+      self.assertAlmostEqual(1, auc.eval(), 6)
 
   def np_auc(self, predictions, labels, weights):
     """Computes the AUC explicitly using Numpy.
diff --git a/tensorflow/python/kernel_tests/pad_op_test.py b/tensorflow/python/kernel_tests/pad_op_test.py
index 2c766e364073fc8c92156f19d08753367982e7fc..361853448ce2c8477af6920257c58c1eba0fa952 100644
--- a/tensorflow/python/kernel_tests/pad_op_test.py
+++ b/tensorflow/python/kernel_tests/pad_op_test.py
@@ -215,13 +215,13 @@ class PadOpTest(test.TestCase):
   def testIntTypes(self):
     # TODO(touts): Figure out why the padding tests do not work on GPU
     # for int types and rank > 2.
-    for t in [np.int32, np.int64]:
+    for t in [np.int8, np.int32, np.int64]:
       self._testAll(
           np.random.randint(-100, 100, (4, 4, 3)).astype(t),
           [[1, 0], [2, 3], [0, 2]], 0)
       self._testAll(
           np.random.randint(-100, 100, (4, 2, 1, 3)).astype(t),
-          [[0, 0], [0, 0], [0, 0], [0, 0]], -1234)
+          [[0, 0], [0, 0], [0, 0], [0, 0]], -123)
 
   def testFloatTypes(self):
     for t in [np.float32, np.float64]:
@@ -238,6 +238,29 @@ class PadOpTest(test.TestCase):
       x = np.random.rand(3, 2, 1, 1).astype(t)
       self._testAll(x + 1j * x, [[0, 0], [0, 0], [0, 0], [0, 0]], 0 + 0j)
 
+  def testString(self):
+    # Numpy does not support padding strings so we compare padding manually.
+    x = ops.convert_to_tensor([["Hello", "World"],
+                               ["Goodnight", "Moon"]])
+
+    constant = array_ops.pad(x, [[1, 0], [0, 1]], mode="CONSTANT",
+                             constant_values="PAD")
+    reflect = array_ops.pad(x, [[1, 0], [0, 1]], mode="REFLECT",
+                            constant_values="PAD")
+    symmetric = array_ops.pad(x, [[1, 0], [0, 1]], mode="SYMMETRIC",
+                              constant_values="PAD")
+    with self.test_session(use_gpu=True):
+      self.assertAllEqual([[b"PAD", b"PAD", b"PAD"],
+                           [b"Hello", b"World", b"PAD"],
+                           [b"Goodnight", b"Moon", b"PAD"]], constant.eval())
+      self.assertAllEqual([[b"Goodnight", b"Moon", b"Goodnight"],
+                           [b"Hello", b"World", b"Hello"],
+                           [b"Goodnight", b"Moon", b"Goodnight"]],
+                          reflect.eval())
+      self.assertAllEqual([[b"Hello", b"World", b"World"],
+                           [b"Hello", b"World", b"World"],
+                           [b"Goodnight", b"Moon", b"Moon"]], symmetric.eval())
+
   def testShapeFunctionEdgeCases(self):
     # Unknown paddings shape.
     inp = constant_op.constant(0.0, shape=[4, 4, 4, 4])
@@ -313,5 +336,32 @@ class PadOpTest(test.TestCase):
       self.assertAllEqual(inp, out)
       self.assertShapeEqual(inp, tf_val)
 
+  def testCollapseAdjacentNonPaddedDimensions(self):
+    # pyformat: disable
+    paddings_values = [[[0, 0], [0, 0], [0, 0], [0, 1]],
+                       [[0, 0], [2, 3], [0, 0], [0, 0]],
+                       [[0, 0], [0, 0], [0, 0], [0, 0]]]
+    # pyformat: enable
+    for paddings_value in paddings_values:
+      for dtype in [dtypes.float32, dtypes.int32]:
+        inp = constant_op.constant(1, shape=[8, 28, 28, 3], dtype=dtype)
+        paddings = constant_op.constant(paddings_value, dtype=dtypes.int32)
+        padded = array_ops.pad(inp, paddings)
+        middle = array_ops.slice(padded, [row[0] for row in paddings_value],
+                                 [dim.value for dim in inp.shape.dims])
+        left = array_ops.slice(padded, [0, 0, 0, 0],
+                               [row[0] for row in paddings_value])
+        right = array_ops.slice(
+            padded,
+            [paddings_value[i][0] + inp.shape.dims[i].value for i in range(4)],
+            [-1, -1, -1, -1])
+        with self.test_session(use_gpu=True):
+          self.assertAllEqual(inp.eval(), middle.eval())
+          self.assertAllEqual(
+              np.zeros([row[0] for row in paddings_value]), left.eval())
+          self.assertAllEqual(
+              np.zeros([row[1] for row in paddings_value]), right.eval())
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/pooling_ops_test.py b/tensorflow/python/kernel_tests/pooling_ops_test.py
index 4466beeec96509b3761e34d885276e1510c62d10..ed44a1a4d16a94d3aa75a50bf059e33326757c4d 100644
--- a/tensorflow/python/kernel_tests/pooling_ops_test.py
+++ b/tensorflow/python/kernel_tests/pooling_ops_test.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import variables
 import tensorflow.python.ops.nn_grad  # pylint: disable=unused-import
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
@@ -122,8 +123,9 @@ class PoolingTest(test.TestCase):
       if input_sizes[-1] % 4 != 0:
         tf_logging.info("Skipping test for depth %d", input_sizes[-1])
         return
-    tf_logging.info("Running %s test. %r %r %d %r %r %r", data_format, v2,
-                    input_sizes, total_size, pool_func, ksize, strides)
+    tf_logging.info("Running %s test. %r %r %d %r %r %r %s", data_format, v2,
+                    input_sizes, total_size, pool_func, ksize, strides,
+                    data_type)
     # Initializes the input tensor with array containing incrementing
     # numbers from 1, wrapping round to -127 after 127 to support int8.
     x = [((f + 128) % 255) - 127 for f in range(total_size)]
@@ -192,6 +194,8 @@ class PoolingTest(test.TestCase):
 
     self._VerifyOneType(pool_func, input_sizes, ksize, strides, padding,
                         data_format, dtypes.float32, expected, use_gpu, v2)
+    self._VerifyOneType(pool_func, input_sizes, ksize, strides, padding,
+                        data_format, dtypes.float64, expected, use_gpu, v2)
 
     if not use_gpu or test_util.CudaSupportsHalfMatMulAndConv():
       self._VerifyOneType(pool_func, input_sizes, ksize, strides, padding,
@@ -405,7 +409,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 3, 3, 3],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -427,7 +431,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 3, 3],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -456,7 +460,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 2, 1],
           ksize=[1, 1, 2, 1],
           strides=[1, 1, 1, 1],
@@ -485,7 +489,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 1],
           ksize=[1, 2, 2, 1],
           strides=[1, 1, 2, 1],
@@ -494,7 +498,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu,
           v2=v2)
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 1],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 1, 1],
@@ -519,7 +523,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 4],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -554,7 +558,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 8, 8, 8],
           ksize=[1, 3, 3, 1],
           strides=[1, 2, 2, 1],
@@ -565,7 +569,7 @@ class PoolingTest(test.TestCase):
 
   def _testMaxPoolEmptyInput(self, use_gpu):
     self._VerifyValues(
-        gen_nn_ops._max_pool_v2,
+        gen_nn_ops.max_pool_v2,
         input_sizes=[0, 8, 8, 8],
         ksize=[1, 3, 3, 1],
         strides=[1, 2, 2, 1],
@@ -600,7 +604,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 1, 1, 10],
           ksize=[1, 1, 1, 2],
           strides=[1, 1, 1, 2],
@@ -626,7 +630,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 2, 6],
           ksize=[1, 1, 1, 3],
           strides=[1, 1, 1, 3],
@@ -648,7 +652,7 @@ class PoolingTest(test.TestCase):
 
       for v2 in [True, False]:
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 7, 7, 1],
             ksize=[1, 2, 2, 1],
             strides=[1, 3, 3, 1],
@@ -689,7 +693,7 @@ class PoolingTest(test.TestCase):
 
       for v2 in [True, False]:
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 3, 3, 1],
             ksize=[1, 1, 1, 1],
             strides=[1, 2, 2, 1],
@@ -699,7 +703,7 @@ class PoolingTest(test.TestCase):
             v2=v2)
 
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 4, 4, 1],
             ksize=[1, 1, 1, 1],
             strides=[1, 2, 2, 1],
@@ -731,7 +735,8 @@ class PoolingTest(test.TestCase):
                                             [1, 1, 1, 3], "evenly divide")
     if test.is_gpu_available():
       with self.test_session(use_gpu=True):
-        t = constant_op.constant(1.0, shape=[1, 2, 2, 4])
+        t = variables.Variable(np.ones([1, 2, 2, 4]))
+        variables.global_variables_initializer().run()
         with self.assertRaisesOpError("for CPU devices"):
           nn_ops.max_pool(
               t, ksize=[1, 1, 1, 2], strides=[1, 1, 1, 2],
@@ -764,8 +769,8 @@ class PoolingTest(test.TestCase):
         _, argmax_op = nn_ops.max_pool_with_argmax(t, ksize, strides, padding)
         argmax = argmax_op.eval()
         grad_in = constant_op.constant(tensor_output, shape=output_shape)
-        out_op = gen_nn_ops._max_pool_grad_with_argmax(t, grad_in, argmax,
-                                                       ksize, strides, padding)
+        out_op = gen_nn_ops.max_pool_grad_with_argmax(t, grad_in, argmax, ksize,
+                                                      strides, padding)
         gpu_val = out_op.eval()
         self.assertShapeEqual(gpu_val, out_op)
       with self.test_session(use_gpu=False):
@@ -773,8 +778,8 @@ class PoolingTest(test.TestCase):
         out_op = nn_ops.max_pool(t, ksize, strides, padding)
         orig_out = out_op.eval()
         grad_in = constant_op.constant(tensor_output, shape=output_shape)
-        out_op = gen_nn_ops._max_pool_grad(t, orig_out, grad_in, ksize, strides,
-                                           padding)
+        out_op = gen_nn_ops.max_pool_grad(t, orig_out, grad_in, ksize, strides,
+                                          padding)
         cpu_val = out_op.eval()
         self.assertShapeEqual(cpu_val, out_op)
       # The CPU version accumulates its gradient on fp16, so it's less
@@ -793,7 +798,7 @@ class PoolingTest(test.TestCase):
         _, argmax_op = nn_ops.max_pool_with_argmax(t, ksize, strides, padding)
         argmax = argmax_op.eval()
         grad_in = constant_op.constant(tensor_input, shape=input_shape)
-        out_op = gen_nn_ops._max_pool_grad_grad_with_argmax(
+        out_op = gen_nn_ops.max_pool_grad_grad_with_argmax(
             t, grad_in, argmax, ksize, strides, padding)
         gpu_val = out_op.eval()
         self.assertShapeEqual(gpu_val, out_op)
@@ -802,8 +807,8 @@ class PoolingTest(test.TestCase):
         out_op = nn_ops.max_pool(t, ksize, strides, padding)
         orig_out = out_op.eval()
         grad_in = constant_op.constant(tensor_input, shape=input_shape)
-        out_op = gen_nn_ops._max_pool_grad_grad(t, orig_out, grad_in, ksize,
-                                                strides, padding)
+        out_op = gen_nn_ops.max_pool_grad_grad(t, orig_out, grad_in, ksize,
+                                               strides, padding)
         cpu_val = out_op.eval()
         self.assertShapeEqual(cpu_val, out_op)
       # The CPU version accumulates its gradient on fp16, so it's less
@@ -842,7 +847,7 @@ class PoolingTest(test.TestCase):
       t = constant_op.constant(tensor_input, shape=[1, 2, 2, 1])
       argmax = constant_op.constant(
           tensor_argmax, shape=[1, 2, 2, 1], dtype=dtypes.int64)
-      out_op = gen_nn_ops._max_pool_grad_with_argmax(
+      out_op = gen_nn_ops.max_pool_grad_with_argmax(
           orig_in,
           t,
           argmax,
@@ -865,7 +870,7 @@ class PoolingTest(test.TestCase):
       t = constant_op.constant(tensor_input, shape=[1, 3, 3, 1])
       argmax = constant_op.constant(
           tensor_argmax, shape=[1, 2, 2, 1], dtype=dtypes.int64)
-      out_op = gen_nn_ops._max_pool_grad_grad_with_argmax(
+      out_op = gen_nn_ops.max_pool_grad_grad_with_argmax(
           orig_in,
           t,
           argmax,
@@ -1029,7 +1034,7 @@ class PoolingTest(test.TestCase):
     self.assertLess(err, err_tolerance)
 
   def _testMaxPoolGradValidPadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1043,7 +1048,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_1_6(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 6, 6, 3],
@@ -1057,7 +1062,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_1_7(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 7, 7, 3],
@@ -1071,7 +1076,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding1_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1085,7 +1090,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 2, 3],
@@ -1099,7 +1104,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1113,7 +1118,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding1_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1127,7 +1132,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding2_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1141,7 +1146,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1155,7 +1160,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding3_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 7, 7, 1],
@@ -1199,7 +1204,7 @@ class PoolingTest(test.TestCase):
     Returns:
       A Tensor.
     """
-    pool_func = gen_nn_ops.max_pool_grad_v2 if v2 else gen_nn_ops._max_pool_grad
+    pool_func = gen_nn_ops.max_pool_grad_v2 if v2 else gen_nn_ops.max_pool_grad
     return pool_func(orig_input, orig_output, grad,
                      [1, window_rows, window_cols, 1],
                      [1, row_stride, col_stride, 1], padding)
@@ -1208,9 +1213,11 @@ class PoolingTest(test.TestCase):
                              expected_input_backprop, input_sizes, output_sizes,
                              window_rows, window_cols, row_stride, col_stride,
                              padding, use_gpu, v2):
-    pool_func = gen_nn_ops._max_pool_v2 if v2 else nn_ops.max_pool
+    pool_func = gen_nn_ops.max_pool_v2 if v2 else nn_ops.max_pool
     with self.test_session(use_gpu=use_gpu):
-      input_tensor = constant_op.constant(input_data, shape=input_sizes)
+      input_tensor = variables.Variable(
+          np.array(input_data, dtype=np.float32).reshape(input_sizes))
+      variables.global_variables_initializer().run()
       output_tensor = pool_func(input_tensor, [1, window_rows, window_cols, 1],
                                 [1, row_stride, col_stride, 1], padding)
       output_backprop_tensor = constant_op.constant(
@@ -1504,7 +1511,7 @@ class PoolingTest(test.TestCase):
     self._testMaxPoolGradDirectWithNans2_2()
 
   def _testMaxPoolGradGradValidPadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1518,7 +1525,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_1_6(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 6, 6, 3],
@@ -1532,7 +1539,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_1_7(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 7, 7, 3],
@@ -1546,7 +1553,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 2, 3],
@@ -1560,7 +1567,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1574,7 +1581,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding2_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1588,7 +1595,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1602,7 +1609,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding3_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[1, 7, 7, 1],
@@ -1644,7 +1651,7 @@ class PoolingTest(test.TestCase):
     Returns:
       A Tensor.
     """
-    return gen_nn_ops._max_pool_grad_grad(
+    return gen_nn_ops.max_pool_grad_grad(
         orig_input, orig_output, grad, [1, window_rows, window_cols, 1],
         [1, row_stride, col_stride, 1], padding)
 
diff --git a/tensorflow/python/kernel_tests/py_func_test.py b/tensorflow/python/kernel_tests/py_func_test.py
index 61fb3f12e45ea5ae3bc4f0a26c2116b54c003624..5b508b7c0e72180194fa1a4c95bc4282d4694605 100644
--- a/tensorflow/python/kernel_tests/py_func_test.py
+++ b/tensorflow/python/kernel_tests/py_func_test.py
@@ -19,6 +19,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import re
+
 import numpy as np
 from six.moves import queue
 from six.moves import xrange  # pylint: disable=redefined-builtin
@@ -33,6 +35,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import script_ops
 from tensorflow.python.platform import test
 
@@ -356,12 +359,22 @@ class PyFuncTest(test.TestCase):
 
   def _testExceptionHandling(self, py_exp, tf_exp, eager=False):
 
-    def raise_exception():
+    def inner_exception():
       raise py_exp("blah")  # pylint: disable=not-callable
 
+    def raise_exception():
+      inner_exception()
+
+    expected_regexp = r": blah.*"               # Error at the top
+    expected_regexp += r"in raise_exception.*"  # Stacktrace outer
+    expected_regexp += r"in inner_exception.*"  # Stacktrace inner
+    expected_regexp += r": blah"                # Stacktrace of raise
+    def expected_error_check(exception):
+      return re.search(expected_regexp, str(exception), re.DOTALL)
+
     if eager:
-      if context.in_eager_mode():
-        with self.assertRaisesRegexp(tf_exp, "blah"):
+      if context.executing_eagerly():
+        with self.assertRaisesWithPredicateMatch(tf_exp, expected_error_check):
           f = script_ops.eager_py_func(raise_exception, [], [])
         return
       else:
@@ -370,7 +383,7 @@ class PyFuncTest(test.TestCase):
       f = script_ops.py_func(raise_exception, [], [])
 
     with self.test_session():
-      with self.assertRaisesRegexp(tf_exp, "blah"):
+      with self.assertRaisesWithPredicateMatch(tf_exp, expected_error_check):
         self.evaluate(f)
 
   def testExceptionHandling(self):
@@ -432,7 +445,7 @@ class PyFuncTest(test.TestCase):
 
       output = script_ops.eager_py_func(no_return_value, inp=[], Tout=[])
       ret = self.evaluate(output)
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         self.assertEquals(len(ret), 0)
       else:
         self.assertIsNone(ret)
@@ -468,6 +481,18 @@ class PyFuncTest(test.TestCase):
 
       self._testExceptionHandling(WeirdError, errors.UnknownError, eager=True)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testEagerReturningVariableRaisesError(self):
+    def return_variable():
+      variable = resource_variable_ops.ResourceVariable(0.0)
+      return variable
+
+    with self.assertRaisesRegexp(errors.UnknownError,
+                                 "Attempting to return a variable"):
+      output = script_ops.eager_py_func(
+          return_variable, inp=[], Tout=dtypes.float32)
+      self.evaluate(output)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/reduction_ops_test.py b/tensorflow/python/kernel_tests/reduction_ops_test.py
index d306d1b8d64f292dc299deee2e3c36981b933d1e..589ea54973c10902c461f552d5c54b6fad6ecf67 100644
--- a/tensorflow/python/kernel_tests/reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/reduction_ops_test.py
@@ -30,6 +30,7 @@ from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
 # The maximum input rank to test.
@@ -212,7 +213,8 @@ class SumReductionTest(BaseReductionTest):
     arr = np.ones([68000], dtype=np.float16)
 
     with self.test_session(graph=ops.Graph(), use_gpu=True) as sess:
-      tf_arr = array_ops.constant(arr)
+      tf_arr = variables.Variable(arr)
+      variables.global_variables_initializer().run()
       tf_mean = math_ops.reduce_mean(tf_arr, 0, False)
       tf_out_mean = sess.run(tf_mean)
     self.assertAllClose(tf_out_mean, 1.)
diff --git a/tensorflow/python/kernel_tests/regex_replace_op_test.py b/tensorflow/python/kernel_tests/regex_replace_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6739ac32245668e98d37673fe9e9fe9d55cc0c5f
--- /dev/null
+++ b/tensorflow/python/kernel_tests/regex_replace_op_test.py
@@ -0,0 +1,71 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for RegexReplace op from string_ops."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import string_ops
+from tensorflow.python.platform import test
+
+
+class RegexReplaceOpTest(test.TestCase):
+
+  def testRemovePrefix(self):
+    values = ["a:foo", "a:bar", "a:foo", "b:baz", "b:qux", "ca:b"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(
+          input_vector, "^(a:|b:)", "", replace_global=False).eval()
+      self.assertAllEqual([b"foo", b"bar", b"foo", b"baz", b"qux", b"ca:b"],
+                          stripped)
+
+  def testRegexReplace(self):
+    values = ["aba\naba", "abcdabcde"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "a.*a", "(\\0)").eval()
+      self.assertAllEqual([b"(aba)\n(aba)", b"(abcda)bcde"], stripped)
+
+  def testEmptyMatch(self):
+    values = ["abc", "1"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "", "x").eval()
+      self.assertAllEqual([b"xaxbxcx", b"x1x"], stripped)
+
+  def testInvalidPattern(self):
+    values = ["abc", "1"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      invalid_pattern = "A["
+      replace = string_ops.regex_replace(input_vector, invalid_pattern, "x")
+      with self.assertRaisesOpError("Invalid pattern"):
+        replace.eval()
+
+  def testGlobal(self):
+    values = ["ababababab", "abcabcabc", ""]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "ab", "abc",
+                                          True).eval()
+      self.assertAllEqual([b"abcabcabcabcabc", b"abccabccabcc", b""], stripped)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/kernel_tests/relu_op_test.py b/tensorflow/python/kernel_tests/relu_op_test.py
index 6b4091ae5d3c6e469a9cd5237b978eae4c75485f..25e947f09e137b37ea129ba6015a060aa01f02e4 100644
--- a/tensorflow/python/kernel_tests/relu_op_test.py
+++ b/tensorflow/python/kernel_tests/relu_op_test.py
@@ -19,12 +19,14 @@ from __future__ import division
 from __future__ import print_function
 
 import numpy as np
+from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import gradients_impl
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables
@@ -87,6 +89,35 @@ class ReluTest(test.TestCase):
     print("relu (float32) gradient err = ", err)
     self.assertLess(err, 1e-4)
 
+  # The gradient for fp16 is inaccurate due to the low-precision.
+  # Instead of relying on compute_gradient_error, we compare the fp16 analytical
+  # gradient against their fp32 counterpart.
+  def testGradientFloat16(self):
+    with self.test_session(use_gpu=True) as sess:
+      # Randomly construct a 1D shape from [1, 40)
+      shape = random_ops.random_uniform(
+          [1], minval=1, maxval=40, dtype=dtypes.int32)
+
+      # Construct the fp32 graph and its gradient.
+      x = random_ops.random_uniform(shape, minval=-1, maxval=1, name="x")
+      y1 = nn_ops.relu(x, name="relu_fp32")
+      l1 = nn_ops.l2_loss(y1)
+      dx_f32 = gradients_impl.gradients(l1, x)
+
+      # Construct the fp16 graph and its gradient.
+      # It starts with the same x, in fp32. But before it reaches Relu, it is
+      # cast into fp16. So during backprop, the gradient computation is in fp16.
+      x2 = math_ops.cast(x, dtype=dtypes.float16, name="cast")
+      y2 = nn_ops.relu(x2, name="relu_fp16")
+      l2 = nn_ops.l2_loss(y2)
+      dx_f16 = gradients_impl.gradients(l2, x)
+
+      # Repeat the experiment for 100 times. All tensor shapes and its tensor
+      # values are randomly generated for each run.
+      for _ in xrange(100):
+        dx_f32_v, dx_f16_v = sess.run([dx_f32, dx_f16])
+        self.assertAllClose(dx_f32_v, dx_f16_v, atol=3e-4)
+
   def testGradientFloat64(self):
     with self.test_session():
       x = constant_op.constant(
diff --git a/tensorflow/python/kernel_tests/resource_variable_ops_test.py b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
index 8503f3e0310125bb714942b32bbbf46596f9bddb..2dc993f8117a41de8f15663ce763ec1d5b7ecdb4 100644
--- a/tensorflow/python/kernel_tests/resource_variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
@@ -277,6 +277,17 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign(2.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign(3.0, read_value=True)
+    self.assertEqual(3.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign(4.0, read_value=False)
+    if context.executing_eagerly():
+      self.assertIsNone(assign_without_read)
+    else:
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    self.evaluate(assign_without_read)
+    self.assertEqual(4.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testLoad(self):
     v = resource_variable_ops.ResourceVariable(1.0, name="var0")
@@ -329,6 +340,9 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       w = resource_variable_ops.ResourceVariable.from_proto(v.to_proto())
       self.assertEquals(2, math_ops.add(w, 1).eval())
 
+      self.assertEquals(v._handle, w._handle)
+      self.assertEquals(v._graph_element, w._graph_element)
+
   @test_util.run_in_graph_and_eager_modes()
   def testAssignAddMethod(self):
     v = resource_variable_ops.ResourceVariable(1.0, name="var0")
@@ -336,6 +350,17 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign_add(1.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign_add(1.0, read_value=True)
+    self.assertEqual(3.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign_add(1.0, read_value=False)
+    if context.executing_eagerly():
+      self.assertIsNone(assign_without_read)
+    else:
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    self.evaluate(assign_without_read)
+    self.assertEqual(4.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testAssignSubMethod(self):
     v = resource_variable_ops.ResourceVariable(3.0, name="var0")
@@ -343,6 +368,17 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign_sub(1.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign_sub(1.0, read_value=True)
+    self.assertEqual(1.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign_sub(1.0, read_value=False)
+    if context.executing_eagerly():
+      self.assertIsNone(assign_without_read)
+    else:
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    self.evaluate(assign_without_read)
+    self.assertEqual(0.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testDestroyResource(self):
     v = resource_variable_ops.ResourceVariable(3.0, name="var0")
@@ -440,7 +476,7 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.assertEqual("(10, 20, 35)", str(v.get_shape()))
     self.assertEqual("(10, 20, 35)", str(v.value().shape))
     self.assertEqual("(3, 20, 35)", str(v.sparse_read([0, 1, 2]).shape))
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           "<unknown>",
           str(v.sparse_read(array_ops.placeholder(dtypes.int32)).shape))
@@ -481,7 +517,6 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       self.assertEqual(dtypes.int32, v.dtype)
       self.assertEqual("foo/var7:0", v.name)
       self.assertAllEqual([10, 20, 35], v.shape.as_list())
-      self.assertEqual(context.get_default_context().device_name, v.device)
       self.assertTrue(isinstance(v.handle, ops.EagerTensor))
       self.assertEqual(constraint, v.constraint)
       self.assertAllEqual(init.numpy(), v.read_value().numpy())
@@ -551,6 +586,15 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       state_ops.scatter_update(v, [1], [3])
       self.assertAllEqual([1.0, 3.0], v.numpy())
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testScatterUpdateInvalidArgs(self):
+    v = resource_variable_ops.ResourceVariable([0, 1, 2, 3], name="update")
+    # The exact error and message differ between graph construction (where the
+    # error is realized during shape inference at graph construction time) and
+    # eager execution (where the error is realized during kernel execution).
+    with self.assertRaisesRegexp(Exception, r"shape.*2.*3"):
+      state_ops.scatter_update(v, [0, 1], [0, 1, 2])
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/rnn_test.py b/tensorflow/python/kernel_tests/rnn_test.py
index daa42938e6af205425d7e423ce162294b9002be4..9a0409c796ab60da3d47cf7d46ef6fbd5bd82394 100644
--- a/tensorflow/python/kernel_tests/rnn_test.py
+++ b/tensorflow/python/kernel_tests/rnn_test.py
@@ -111,10 +111,10 @@ class RNNTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testInvalidSequenceLengthShape(self):
     cell = Plus1RNNCell()
-    if context.in_graph_mode():
-      inputs = [array_ops.placeholder(dtypes.float32, shape=(3, 4))]
-    else:
+    if context.executing_eagerly():
       inputs = [constant_op.constant(np.ones((3, 4)))]
+    else:
+      inputs = [array_ops.placeholder(dtypes.float32, shape=(3, 4))]
     with self.assertRaisesRegexp(ValueError, "must be a vector"):
       rnn.dynamic_rnn(
           cell,
@@ -125,38 +125,30 @@ class RNNTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testBatchSizeFromInput(self):
     cell = Plus1RNNCell()
-    in_graph_mode = context.in_graph_mode()
+    in_eager_mode = context.executing_eagerly()
     # With static batch size
-    if in_graph_mode:
-      inputs = array_ops.placeholder(dtypes.float32, shape=(3, 4, 5))
-      initial_state = array_ops.placeholder(dtypes.float32, shape=(3, 5))
-    else:
+    if in_eager_mode:
       inputs = np.zeros((3, 4, 5), dtype=np.float32)
       initial_state = np.zeros((3, 5), dtype=np.float32)
+    else:
+      inputs = array_ops.placeholder(dtypes.float32, shape=(3, 4, 5))
+      initial_state = array_ops.placeholder(dtypes.float32, shape=(3, 5))
 
     # - Without initial_state
     outputs, state = rnn.dynamic_rnn(cell, inputs, dtype=dtypes.float32)
-    if in_graph_mode:
-      self.assertEqual(3, outputs.shape[0].value)
-      self.assertEqual(3, state.shape[0].value)
-    else:
-      self.assertEqual(3, outputs.shape[0])
-      self.assertEqual(3, state.shape[0])
+    self.assertEqual(3, outputs.shape[0])
+    self.assertEqual(3, state.shape[0])
 
     # - With initial_state
     outputs, state = rnn.dynamic_rnn(
         cell, inputs, initial_state=initial_state)
-    if in_graph_mode:
-      self.assertEqual(3, outputs.shape[0].value)
-      self.assertEqual(3, state.shape[0].value)
-    else:
-      self.assertEqual(3, outputs.shape[0])
-      self.assertEqual(3, state.shape[0])
+    self.assertEqual(3, outputs.shape[0])
+    self.assertEqual(3, state.shape[0])
 
     # Without static batch size
-    # Tensor shapes are fully determined in Eager mode, so only run this
-    # test in graph mode.
-    if in_graph_mode:
+    # Tensor shapes are fully determined with eager execution enabled,
+    # so only run this test for graph construction.
+    if not in_eager_mode:
       inputs = array_ops.placeholder(dtypes.float32, shape=(None, 4, 5))
       # - Without initial_state
       outputs, state = rnn.dynamic_rnn(cell, inputs, dtype=dtypes.float32)
@@ -173,56 +165,46 @@ class RNNTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testScalarStateIsAccepted(self):
     cell = ScalarStateRNNCell()
-    in_graph_mode = context.in_graph_mode()
+    in_eager_mode = context.executing_eagerly()
 
-    if in_graph_mode:
-      inputs = array_ops.placeholder(dtypes.float32, shape=(1, 4, 1))
-    else:
+    if in_eager_mode:
       inputs = np.array([[[1], [2], [3], [4]]], dtype=np.float32)
+    else:
+      inputs = array_ops.placeholder(dtypes.float32, shape=(1, 4, 1))
 
     with self.test_session() as sess:
       outputs, state = rnn.dynamic_rnn(
           cell, inputs, dtype=dtypes.float32, sequence_length=[4])
-      if in_graph_mode:
+      if not in_eager_mode:
         outputs, state = sess.run(
             [outputs, state], feed_dict={inputs: [[[1], [2], [3], [4]]]})
 
-    if in_graph_mode:
-      self.assertAllEqual(outputs, np.array([[[1], [2], [3], [4]]]))
-      self.assertEqual(state, 4)
-    else:
-      self.assertAllEqual(outputs.numpy(), np.array([[[1], [2], [3], [4]]]))
-      self.assertEqual(state.numpy(), 4)
+    self.assertAllEqual([[[1], [2], [3], [4]]], outputs)
+    self.assertAllEqual(4, state)
 
   @test_util.run_in_graph_and_eager_modes()
   def testTensorArrayStateIsAccepted(self):
     cell = TensorArrayStateRNNCell()
-    in_graph_mode = context.in_graph_mode()
+    in_eager_mode = context.executing_eagerly()
 
-    if in_graph_mode:
-      inputs = array_ops.placeholder(dtypes.float32, shape=(1, 4, 1))
-    else:
+    if in_eager_mode:
       inputs = np.array([[[1], [2], [3], [4]]], dtype=np.float32)
+    else:
+      inputs = array_ops.placeholder(dtypes.float32, shape=(1, 4, 1))
 
     with self.test_session() as sess:
       outputs, state = rnn.dynamic_rnn(
           cell, inputs, dtype=dtypes.float32, sequence_length=[4])
       state = (state[0], state[1].stack())
-      if in_graph_mode:
+      if not in_eager_mode:
         outputs, state = sess.run(
             [outputs, state], feed_dict={
                 inputs: [[[1], [2], [3], [4]]]
             })
 
-    if in_graph_mode:
-      self.assertAllEqual(outputs, np.array([[[1], [2], [3], [4]]]))
-      self.assertEqual(state[0], 4)
-      self.assertAllEqual(state[1], np.array([[[1]], [[2]], [[3]], [[4]]]))
-    else:
-      self.assertAllEqual(outputs.numpy(), np.array([[[1], [2], [3], [4]]]))
-      self.assertEqual(state[0].numpy(), 4)
-      self.assertAllEqual(state[1].numpy(),
-                          np.array([[[1]], [[2]], [[3]], [[4]]]))
+    self.assertAllEqual([[[1], [2], [3], [4]]], outputs)
+    self.assertAllEqual(4, state[0])
+    self.assertAllEqual([[[1]], [[2]], [[3]], [[4]]], state[1])
 
 
 ######### Benchmarking RNN code
diff --git a/tensorflow/python/kernel_tests/save_restore_ops_test.py b/tensorflow/python/kernel_tests/save_restore_ops_test.py
index 1bdfa9ebd8e1a4495e67004f59adfb51bf3a6602..cb9aa1e34d6eb82efa94e60e7b56c26b181cef04 100644
--- a/tensorflow/python/kernel_tests/save_restore_ops_test.py
+++ b/tensorflow/python/kernel_tests/save_restore_ops_test.py
@@ -31,11 +31,10 @@ class ShardedFileOpsTest(test.TestCase):
     with session.Session(
         target="", config=config_pb2.ConfigProto(device_count={"CPU": 2})):
       self.assertEqual(
-          gen_io_ops._sharded_filename("foo", 4, 100).eval(),
+          gen_io_ops.sharded_filename("foo", 4, 100).eval(),
           b"foo-00004-of-00100")
       self.assertEqual(
-          gen_io_ops._sharded_filespec("foo", 100).eval(),
-          b"foo-?????-of-00100")
+          gen_io_ops.sharded_filespec("foo", 100).eval(), b"foo-?????-of-00100")
 
 
 class ShapeInferenceTest(test.TestCase):
@@ -53,7 +52,7 @@ class ShapeInferenceTest(test.TestCase):
                         [dtypes.float32, dtypes.float32])
 
   def testRestoreSlice(self):
-    op = gen_io_ops._restore_slice("model", "var", "3 4 0,1:-", dtypes.float32)
+    op = gen_io_ops.restore_slice("model", "var", "3 4 0,1:-", dtypes.float32)
     self.assertEqual([1, 4], op.get_shape())
 
 
diff --git a/tensorflow/python/kernel_tests/scalar_test.py b/tensorflow/python/kernel_tests/scalar_test.py
index e65241981eac2d42207c1de7a261f7936e588f2a..0d8fd232946883ac1d95c4c2d9744af69175ab90 100644
--- a/tensorflow/python/kernel_tests/scalar_test.py
+++ b/tensorflow/python/kernel_tests/scalar_test.py
@@ -92,11 +92,11 @@ class ScalarTest(test.TestCase):
     self.check(array_ops.reshape, (7, 1), 'sizes input must be 1-D', [7])
 
   def testShardedFilename(self):
-    self.check(gen_io_ops._sharded_filename, ('foo', 4, [100]),
+    self.check(gen_io_ops.sharded_filename, ('foo', 4, [100]),
                'must be a scalar', b'foo-00004-of-00100')
 
   def testShardedFilespec(self):
-    self.check(gen_io_ops._sharded_filespec, ('foo', [100]), 'must be a scalar',
+    self.check(gen_io_ops.sharded_filespec, ('foo', [100]), 'must be a scalar',
                b'foo-?????-of-00100')
 
   def testUnsortedSegmentSum(self):
diff --git a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
index bbce6b7d47325b8209815230426672ec6894147f..3bca5fadc42693f514911c7ffa8f078de8ef9bcd 100644
--- a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
@@ -542,6 +542,25 @@ class SparseSegmentReductionOpTest(SparseSegmentReductionHelper):
         tf_ans = s.eval()
         self.assertAllClose(np_ans, tf_ans)
 
+  def testWithEmptySegments(self):
+    tf_x = constant_op.constant([], shape=[0, 4], dtype=dtypes_lib.float32)
+    ops_list = [
+        math_ops.sparse_segment_sum_with_num_segments,
+        math_ops.sparse_segment_mean_with_num_segments
+    ]
+    segment_indices = []
+    tf_indices = []
+    num_segments = 5
+    with self.test_session(use_gpu=False):
+      for tf_op in ops_list:
+        s = tf_op(
+            data=tf_x,
+            indices=tf_indices,
+            segment_ids=segment_indices,
+            num_segments=num_segments)
+        tf_ans = s.eval()
+        self.assertAllClose(np.zeros([5, 4]), tf_ans)
+
   def testSegmentIdsGreaterThanZero(self):
     tf_x, np_x = self._input([10, 4], dtype=dtypes_lib.float32)
     ops_list = [(np.add, None, math_ops.sparse_segment_sum), (
diff --git a/tensorflow/python/kernel_tests/slice_op_test.py b/tensorflow/python/kernel_tests/slice_op_test.py
index 051a25080b826de05ee3e24a82fbcd1f47995544..5fc9bef21816e3a12f0d274bab1fc82a83546422 100644
--- a/tensorflow/python/kernel_tests/slice_op_test.py
+++ b/tensorflow/python/kernel_tests/slice_op_test.py
@@ -283,7 +283,7 @@ class SliceTest(test.TestCase):
     # unintended behavior is prevented.
     c = constant_op.constant(5.0)
     with self.assertRaisesWithPredicateMatch(
-        TypeError, lambda e: "`Tensor` objects are not iterable" in str(e)):
+        TypeError, lambda e: "Tensor objects are not iterable" in str(e)):
       for _ in c:
         pass
 
diff --git a/tensorflow/python/kernel_tests/softmax_op_test.py b/tensorflow/python/kernel_tests/softmax_op_test.py
index 4d89831aae9a5e95210a8defb180e09c9d38f4d6..981f96b74d3058aa79a1ea10e1254e572d0e8b85 100644
--- a/tensorflow/python/kernel_tests/softmax_op_test.py
+++ b/tensorflow/python/kernel_tests/softmax_op_test.py
@@ -18,15 +18,17 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import unittest
 import numpy as np
 
-from tensorflow.python.framework import constant_op
+
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.platform import test
+from tensorflow.python.platform import tf_logging as logging
 
 
 @test_util.with_c_api
@@ -42,9 +44,10 @@ class SoftmaxTest(test.TestCase):
             features, axis=dim), one_only_on_dim))
     softmax = e / np.reshape(np.sum(e, axis=dim), one_only_on_dim)
     if log:
-      return np.log(softmax)
+      res = np.log(softmax)
     else:
-      return softmax
+      res = softmax
+    return res
 
   def _testSoftmax(self, np_features, dim=-1, log=False, use_gpu=False):
     # A previous version of the code checked the op name rather than the op type
@@ -54,9 +57,9 @@ class SoftmaxTest(test.TestCase):
     np_softmax = self._npSoftmax(np_features, dim=dim, log=log)
     with self.test_session(use_gpu=use_gpu):
       if log:
-        tf_softmax = nn_ops.log_softmax(np_features, dim=dim, name=name)
+        tf_softmax = nn_ops.log_softmax(np_features, axis=dim, name=name)
       else:
-        tf_softmax = nn_ops.softmax(np_features, dim=dim, name=name)
+        tf_softmax = nn_ops.softmax(np_features, axis=dim, name=name)
       out = tf_softmax.eval()
     self.assertAllCloseAccordingToType(np_softmax, out)
     self.assertShapeEqual(np_softmax, tf_softmax)
@@ -118,10 +121,32 @@ class SoftmaxTest(test.TestCase):
     self._testAll(
         np.array([[1., 1., 1., 1.], [1., 2., 3., 4.]]).astype(np.float32))
 
+  @unittest.skipUnless(test.is_built_with_cuda(),
+                       "Test only applicable when running on GPUs")
+  def testFloatGPU(self):
+    if test.is_gpu_available(cuda_only=True):
+      rows = [2**x + np.random.randint(0, 1024) for x in range(1, 10)]
+      cols = [2**x + np.random.randint(0, 1024) for x in range(1, 10)]
+      for row, col in zip(rows, cols):
+        logging.info("Testing softmax float dtype in shape [%d, %d]", row, col)
+        data = np.random.rand(row, col)
+        self._testAll(data.astype(np.float32))
+
   def testHalf(self):
     self._testAll(
         np.array([[1., 1., 1., 1.], [1., 2., 3., 4.]]).astype(np.float16))
 
+  @unittest.skipUnless(test.is_built_with_cuda(),
+                       "Test only applicable when running on GPUs")
+  def testHalfGPU(self):
+    if test.is_gpu_available(cuda_only=True):
+      rows = [2**x + np.random.randint(0, 1024) for x in range(1, 8)]
+      cols = [2**x + np.random.randint(0, 1024) for x in range(1, 8)]
+      for row, col in zip(rows, cols):
+        logging.info("Testing softmax half dtype in shape [%d, %d]", row, col)
+        data = np.random.rand(row, col)
+        self._testAll(data.astype(np.float16))
+
   def testDouble(self):
     self._testSoftmax(
         np.array([[1., 1., 1., 1.], [1., 2., 3., 4.]]).astype(np.float64))
@@ -166,11 +191,11 @@ class SoftmaxTest(test.TestCase):
 
   def testEmptyInput(self):
     with self.test_session():
-      x = constant_op.constant([[]], shape=[0, 3])
+      x = array_ops.placeholder(dtypes.float32, shape=[0, 3])
       self.assertEqual(0, array_ops.size(x).eval())
       # reshape would raise if logits is empty
       with self.assertRaises(errors_impl.InvalidArgumentError):
-        nn_ops.softmax(x, dim=0).eval()
+        nn_ops.softmax(x, axis=0).eval()
 
   def testDimTooLarge(self):
     with self.test_session():
@@ -178,7 +203,7 @@ class SoftmaxTest(test.TestCase):
       # inference error.
       dim = array_ops.placeholder_with_default(100, shape=[])
       with self.assertRaises(errors_impl.InvalidArgumentError):
-        nn_ops.softmax([1., 2., 3., 4.], dim=dim).eval()
+        nn_ops.softmax([1., 2., 3., 4.], axis=dim).eval()
 
   def testLargeDims(self):
     # Make sure that we properly handle large inputs. See
diff --git a/tensorflow/python/kernel_tests/spacetobatch_op_test.py b/tensorflow/python/kernel_tests/spacetobatch_op_test.py
index b943dfa4e5f2a06eddcb3af03764e5e046b715f4..2a9232b6aecb66328f10a62f2251246c4fcec6e6 100644
--- a/tensorflow/python/kernel_tests/spacetobatch_op_test.py
+++ b/tensorflow/python/kernel_tests/spacetobatch_op_test.py
@@ -86,11 +86,11 @@ class CppOpImpl(object):
 
   @staticmethod
   def space_to_batch(*args, **kwargs):
-    return gen_array_ops._space_to_batch(*args, **kwargs)
+    return gen_array_ops.space_to_batch(*args, **kwargs)
 
   @staticmethod
   def batch_to_space(*args, **kwargs):
-    return gen_array_ops._batch_to_space(*args, **kwargs)
+    return gen_array_ops.batch_to_space(*args, **kwargs)
 
 
 class SpaceToBatchTest(test.TestCase, PythonOpImpl):
diff --git a/tensorflow/python/kernel_tests/spacetodepth_op_test.py b/tensorflow/python/kernel_tests/spacetodepth_op_test.py
index 3c98a685e07a1f2d55c3c1035a99ffaa593d35b3..cd90d16aacb4325ed426b0466266d9616b574401 100644
--- a/tensorflow/python/kernel_tests/spacetodepth_op_test.py
+++ b/tensorflow/python/kernel_tests/spacetodepth_op_test.py
@@ -34,8 +34,8 @@ from tensorflow.python.platform import tf_logging
 
 class SpaceToDepthTest(test.TestCase):
 
-  def _testOne(self, inputs, block_size, outputs):
-    input_nhwc = math_ops.to_float(inputs)
+  def _testOne(self, inputs, block_size, outputs, dtype=dtypes.float32):
+    input_nhwc = math_ops.cast(inputs, dtype)
     with self.test_session(use_gpu=False):
       # test NHWC (default) on CPU
       x_tf = array_ops.space_to_depth(input_nhwc, block_size)
@@ -58,6 +58,12 @@ class SpaceToDepthTest(test.TestCase):
     x_out = [[[[1, 2, 3, 4]]]]
     self._testOne(x_np, block_size, x_out)
 
+  def testBasicFloat16(self):
+    x_np = [[[[1], [2]], [[3], [4]]]]
+    block_size = 2
+    x_out = [[[[1, 2, 3, 4]]]]
+    self._testOne(x_np, block_size, x_out, dtype=dtypes.float16)
+
   # Tests for larger input dimensions. To make sure elements are
   # correctly ordered spatially.
   def testLargerInput2x2(self):
@@ -126,6 +132,24 @@ class SpaceToDepthTest(test.TestCase):
     x_out = [batch_output_elt(i) for i in range(batch_size)]
     self._testOne(x_np, block_size, x_out)
 
+  def testBatchSize0(self):
+    block_size = 2
+    batch_size = 0
+    input_nhwc = array_ops.ones([batch_size, 4, 6, 3])
+    x_out = array_ops.ones([batch_size, 2, 3, 12])
+
+    with self.test_session(use_gpu=False):
+      # test NHWC (default) on CPU
+      x_tf = array_ops.space_to_depth(input_nhwc, block_size)
+      self.assertAllEqual(x_tf.shape, x_out.shape)
+      x_tf.eval()
+    if test.is_gpu_available():
+      with self.test_session(use_gpu=True):
+        # test NHWC (default) on GPU
+        x_tf = array_ops.space_to_depth(input_nhwc, block_size)
+        self.assertAllEqual(x_tf.shape, x_out.shape)
+        x_tf.eval()
+
   # Tests for different width and height.
   def testNonSquare(self):
     x_np = [[[[1, 10], [2, 20]], [[3, 30], [4, 40]], [[5, 50], [6, 60]],
diff --git a/tensorflow/python/kernel_tests/sparse_xent_op_test.py b/tensorflow/python/kernel_tests/sparse_xent_op_test.py
index cd5b711a0ed18aabff543aa4b6ecb1a885618caf..a841fe83a7f585a69ef33c437570359797484a4a 100644
--- a/tensorflow/python/kernel_tests/sparse_xent_op_test.py
+++ b/tensorflow/python/kernel_tests/sparse_xent_op_test.py
@@ -64,7 +64,7 @@ class SparseXentTest(test.TestCase):
   def _testXent(self, np_features, np_labels):
     np_loss, np_backprop = self._npXent(np_features, np_labels)
     with self.test_session(use_gpu=True) as sess:
-      loss, backprop = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+      loss, backprop = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
           np_features, np_labels)
       tf_loss, tf_backprop = sess.run([loss, backprop])
     self.assertAllCloseAccordingToType(np_loss, tf_loss)
@@ -73,7 +73,7 @@ class SparseXentTest(test.TestCase):
   def testSingleClass(self):
     for label_dtype in np.int32, np.int64:
       with self.test_session(use_gpu=True) as sess:
-        loss, backprop = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+        loss, backprop = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
             np.array([[1.], [-1.], [0.]]).astype(np.float32),
             np.array([0, 0, 0]).astype(label_dtype))
         tf_loss, tf_backprop = sess.run([loss, backprop])
@@ -87,8 +87,9 @@ class SparseXentTest(test.TestCase):
 
     if test.is_built_with_cuda() and test.is_gpu_available():
       with self.test_session(use_gpu=True) as sess:
-        loss, backprop = (gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
-            features, labels))
+        loss, backprop = (
+            gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
+                features, labels))
         tf_loss, tf_backprop = sess.run([loss, backprop])
         self.assertAllClose(
             [[np.nan] * 4, [0.25, 0.25, 0.25, -0.75],
@@ -100,8 +101,8 @@ class SparseXentTest(test.TestCase):
             [np.nan, 1.3862, 3.4420, np.nan], tf_loss, rtol=1e-3, atol=1e-3)
 
     with self.test_session(use_gpu=False) as sess:
-      loss, backprop = (gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
-          features, labels))
+      loss, backprop = (
+          gen_nn_ops.sparse_softmax_cross_entropy_with_logits(features, labels))
       with self.assertRaisesOpError("Received a label value of"):
         sess.run([loss, backprop])
 
diff --git a/tensorflow/python/kernel_tests/split_op_test.py b/tensorflow/python/kernel_tests/split_op_test.py
index 6171793b148f8d8f195b9548a13df89d29c5e96e..8cfee3eb933afcea7a58d5632948b87b0c4c10df 100644
--- a/tensorflow/python/kernel_tests/split_op_test.py
+++ b/tensorflow/python/kernel_tests/split_op_test.py
@@ -336,6 +336,20 @@ class SplitOpTest(test.TestCase):
     for s in splits:
       self.assertEqual(None, s.get_shape().ndims)
 
+  def testNonexistentDimTensor(self):
+    x = array_ops.placeholder(dtypes.int32)
+    values = np.zeros([5, 30])
+    splits = array_ops.placeholder(dtypes.int32)
+    with self.assertRaisesRegexp(ValueError, "Cannot infer"):
+      y = array_ops.split(values, splits, axis=x)
+
+    splits = array_ops.placeholder(dtypes.int32, [3])
+    y = array_ops.split(values, splits, axis=x)
+    with self.test_session(use_gpu=True) as sess:
+      with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
+                                   "must have exactly one element"):
+        sess.run(y, {x: np.array([], dtype=np.int32), splits: [4, 11, 15]})
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/stack_ops_test.py b/tensorflow/python/kernel_tests/stack_ops_test.py
index aa409336f5c50178e4d0ca946190119fb0e4188e..afd2eaffab992bca4b3ae7b4f65e0370f325b548 100644
--- a/tensorflow/python/kernel_tests/stack_ops_test.py
+++ b/tensorflow/python/kernel_tests/stack_ops_test.py
@@ -34,11 +34,11 @@ class StackOpTest(test.TestCase):
 
   def _testStackPushPop(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push_v2(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval())
 
   def testStackPushPop(self):
@@ -49,11 +49,11 @@ class StackOpTest(test.TestCase):
     with self.test_session(use_gpu=use_gpu):
       a = np.arange(2000)
       x = constant_op.constant(a, dtype=dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, x, swap_memory=True)
+      c = gen_data_flow_ops.stack_push_v2(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose(a, c1.eval())
 
   def testStackPushPopSwap(self):
@@ -63,7 +63,7 @@ class StackOpTest(test.TestCase):
   def _testStackWhileSwap(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       n = constant_op.constant(0)
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
 
       def c(x):
@@ -72,7 +72,7 @@ class StackOpTest(test.TestCase):
       def b(x):
         with ops.control_dependencies([x]):
           a = constant_op.constant(np.ones(2000), dtype=dtypes.float32)
-          v = gen_data_flow_ops._stack_push_v2(h, a, swap_memory=True)
+          v = gen_data_flow_ops.stack_push_v2(h, a, swap_memory=True)
         with ops.control_dependencies([v]):
           return math_ops.add(x, 1)
 
@@ -86,7 +86,7 @@ class StackOpTest(test.TestCase):
 
       def b1(x, y):
         nx = math_ops.subtract(x, 1)
-        ny = y + gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        ny = y + gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
         return [nx, ny]
 
       _, ry = control_flow_ops.while_loop(
@@ -99,16 +99,16 @@ class StackOpTest(test.TestCase):
 
   def _testMultiStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
-      h1 = gen_data_flow_ops._stack_v2(
+      h1 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push_v2(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-      h2 = gen_data_flow_ops._stack_v2(
+        c1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+      h2 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval())
 
@@ -119,17 +119,17 @@ class StackOpTest(test.TestCase):
   def _testSameNameStacks(self, use_gpu):
     """Different stacks with the same name do not interfere."""
     with self.test_session(use_gpu=use_gpu) as sess:
-      h1 = gen_data_flow_ops._stack_v2(
+      h1 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      h2 = gen_data_flow_ops._stack_v2(
+      h2 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
 
-      c1 = gen_data_flow_ops._stack_push_v2(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+        c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        pop1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-        pop2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        pop1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+        pop2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
 
       out1, out2 = sess.run([pop1, pop2])
       self.assertAllClose(out1, 4.0)
@@ -141,9 +141,9 @@ class StackOpTest(test.TestCase):
 
   def _testCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close_v2(h)
+      c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1)
 
   def testCloseStack(self):
@@ -152,11 +152,11 @@ class StackOpTest(test.TestCase):
 
   def _testPushCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push_v2(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close_v2(h)
+        c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1)
 
   def testPushCloseStack(self):
@@ -170,9 +170,9 @@ class StackOpRefTest(test.TestCase):
   def _testStackPushPop(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval())
 
   def testStackPushPop(self):
@@ -184,9 +184,9 @@ class StackOpRefTest(test.TestCase):
       a = np.arange(2000)
       x = constant_op.constant(a, dtype=dtypes.float32)
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, x, swap_memory=True)
+      c = gen_data_flow_ops.stack_push(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h, dtypes.float32)
       self.assertAllClose(a, c1.eval())
 
   def testStackPushPopSwap(self):
@@ -196,13 +196,13 @@ class StackOpRefTest(test.TestCase):
   def _testMultiStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h1 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop(h1, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h1, dtypes.float32)
       h2 = gen_data_flow_ops._stack(dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval())
 
@@ -217,7 +217,7 @@ class StackOpRefTest(test.TestCase):
       def b(x):
         with ops.control_dependencies([x]):
           a = constant_op.constant(np.ones(2000), dtype=dtypes.float32)
-          v = gen_data_flow_ops._stack_push(h, a, swap_memory=True)
+          v = gen_data_flow_ops.stack_push(h, a, swap_memory=True)
         with ops.control_dependencies([v]):
           return math_ops.add(x, 1)
 
@@ -231,7 +231,7 @@ class StackOpRefTest(test.TestCase):
 
       def b1(x, y):
         nx = math_ops.subtract(x, 1)
-        ny = y + gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        ny = y + gen_data_flow_ops.stack_pop(h, dtypes.float32)
         return [nx, ny]
 
       _, ry = control_flow_ops.while_loop(
@@ -249,9 +249,9 @@ class StackOpRefTest(test.TestCase):
   def _testSameNameStacks(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h1 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push(h1, 4.0)
       h2 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c2 = gen_data_flow_ops._stack_push(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push(h2, 5.0)
       _ = c1 + c2
       self.assertNotEqual(h1.eval()[1], h2.eval()[1])
 
@@ -262,7 +262,7 @@ class StackOpRefTest(test.TestCase):
   def _testCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close(h)
+      c1 = gen_data_flow_ops.stack_close(h)
       sess.run(c1)
 
   def testCloseStack(self):
@@ -272,9 +272,9 @@ class StackOpRefTest(test.TestCase):
   def _testPushCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close(h)
+        c1 = gen_data_flow_ops.stack_close(h)
       sess.run(c1)
 
   def testPushCloseStack(self):
diff --git a/tensorflow/python/kernel_tests/template_test.py b/tensorflow/python/kernel_tests/template_test.py
index a519b69b22cf51ab4f4173b215c21a71d83e9f99..1b935d5286729e9e802c56e90e2ae7ab72a6e080 100644
--- a/tensorflow/python/kernel_tests/template_test.py
+++ b/tensorflow/python/kernel_tests/template_test.py
@@ -356,6 +356,10 @@ class TemplateTest(test.TestCase):
     self.assertEqual("s1_1/nested/dummy:0", v5.name)
     self.assertEqual("s1_1/nested_1/dummy:0", v6.name)
 
+    self.assertEqual(2, len(tmpl1._checkpoint_dependencies))
+    self.assertEqual("nested", tmpl1._checkpoint_dependencies[0].name)
+    self.assertEqual("nested_1", tmpl1._checkpoint_dependencies[1].name)
+
   @test_util.run_in_graph_and_eager_modes()
   def test_nested_templates_with_defun(self):
 
@@ -558,7 +562,7 @@ class TemplateTest(test.TestCase):
     outputs_b, _ = linear1(inputs)
     self.assertEquals("foo", linear1.variable_scope.name)
     self.assertEquals("foo/w:0", w1.name)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEquals("foo/add:0", outputs_a.name,
                         "First application of template should get "
                         "same name scope as variables.")
@@ -573,7 +577,7 @@ class TemplateTest(test.TestCase):
                       "New template gets a freshly uniquified variable scope "
                       "because 'foo' is already taken.")
     self.assertEquals("foo_1/w:0", w2.name)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEquals("foo_1_1/add:0", outputs_c.name,
                         "First application of template would get "
                         "same name scope as variables, but 'foo_1' is already "
@@ -588,7 +592,7 @@ class TemplateTest(test.TestCase):
     with variable_scope.variable_scope("foo"):
       # Create two templates with the same name, ensure scopes are made unique.
       ta = template.make_template("bar", variable_scoped_function, True)
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         tb = template.make_template("s", function_with_side_create,
                                     trainable=False)
       else:
diff --git a/tensorflow/python/kernel_tests/tensor_array_ops_test.py b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
index aad2443eea7ad87faf481973e91ca3df32ccfb44..a834675828b67aed4057d1857c546a586cee69c9 100644
--- a/tensorflow/python/kernel_tests/tensor_array_ops_test.py
+++ b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
@@ -399,28 +399,14 @@ class TensorArrayTest(test.TestCase):
   def testTensorArrayWriteWrongIndexOrDataTypeFails(self):
     with self.test_session(use_gpu=True):
       ta = _make_ta(3, "foo", dtype=dtypes.float32)
-      in_graph_mode = context.in_graph_mode()
       # Test writing the wrong datatype
-      if in_graph_mode:
-        with self.assertRaisesOpError(
-            "TensorArray dtype is float but Op is trying to write "
-            "dtype string"):
-          self.evaluate(ta.write(0, "wrong_type_scalar").flow)
-      else:
-        with self.assertRaisesOpError(
-            "TensorArray dtype is float32 but Op is trying to write "
-            "dtype string"):
-          self.evaluate(ta.write(0, "wrong_type_scalar").flow)
+      with self.assertRaisesOpError(
+          "TensorArray dtype is (float|float32) but Op is trying to write "
+          "dtype string"):
+        self.evaluate(ta.write(0, "wrong_type_scalar").flow)
 
-      if context.in_graph_mode():
-        with self.assertRaisesOpError(
-            "Tried to write to index -1 but array is not "
-            "resizeable and size is: 3"):
-          self.evaluate(ta.write(-1, 3.0).flow)
-      else:
-        with self.assertRaisesOpError(
-            r"Writing to negative indices \(index -1\) is not allowed."):
-          self.evaluate(ta.write(-1, 3.0).flow)
+      with self.assertRaisesOpError("index -1"):
+        self.evaluate(ta.write(-1, 3.0).flow)
 
       # Test reading from too large an index
       with self.assertRaisesOpError(
@@ -435,23 +421,17 @@ class TensorArrayTest(test.TestCase):
 
       w0 = ta.write(0, [[4.0, 5.0]])
 
-      # Test reading wrong datatype, which is only possible in graph mode
-      if context.in_graph_mode():
-        r0_bad = gen_data_flow_ops._tensor_array_read_v3(
+      # Test reading wrong datatype (only possible when constructing graphs).
+      if not context.executing_eagerly():
+        r0_bad = gen_data_flow_ops.tensor_array_read_v3(
             handle=w0.handle, index=0, dtype=dtypes.float64, flow_in=w0.flow)
         with self.assertRaisesOpError(
             "TensorArray dtype is float but Op requested dtype double."):
           r0_bad.eval()
 
       # Test reading from a negative index, which is not allowed
-      if context.in_graph_mode():
-        with self.assertRaisesOpError(
-            r"Tried to read from index -1 but array size is: 3"):
-          self.evaluate(ta.read(-1))
-      else:
-        with self.assertRaisesOpError(
-            r"Reading from negative indices \(index -1\) is not allowed."):
-          self.evaluate(ta.read(-1))
+      with self.assertRaisesOpError("index -1"):
+        self.evaluate(ta.read(-1))
 
       # Test reading from too large an index
       with self.assertRaisesOpError(
@@ -467,10 +447,7 @@ class TensorArrayTest(test.TestCase):
       with self.assertRaisesOpError(
           "Could not write to TensorArray index 2 because "
           "it has already been written to."):
-        if context.in_graph_mode():
-          self.evaluate(ta.write(2, 3.0).write(2, 3.0).flow)
-        else:
-          self.evaluate(ta.write(2, 3.0).write(2, 3.0))
+        self.evaluate(ta.write(2, 3.0).write(2, 3.0).flow)
 
   @test_util.run_in_graph_and_eager_modes()
   def testTensorArrayConcatIncompatibleShapesFails(self):
@@ -499,58 +476,40 @@ class TensorArrayTest(test.TestCase):
       w2 = w1.write(1, [4.0])
       w3 = w2.write(2, [[3.0]])
 
-      # The eager-mode implementation just passes up array_op.concat's error
-      # message.
-      if context.in_graph_mode():
-        with self.assertRaisesOpError(
-            r"TensorArray has inconsistent shapes.  Index 0 has "
-            r"\(excepting dimension 0\) shape: \[\] but index 2 has "
-            r"\(excepting dimension 0\) shape: \[1\]"):
-          self.evaluate(w3.concat())
-      else:
-        with self.assertRaisesOpError(
-            r".*Ranks of all input tensors should match: shape\[0\] "
-            r"= \[1\] vs\. shape\[2\] = \[1,1\].*"):
-          self.evaluate(w3.concat())
+      # The exact error messages differ between eager execution and graph
+      # construction as the former bubbles up the error from array_op.concat.
+      with self.assertRaisesOpError("shape"):
+        self.evaluate(w3.concat())
 
   @test_util.run_in_graph_and_eager_modes()
   def testTensorArraySplitIncompatibleShapesFails(self):
     with self.test_session(use_gpu=True):
-      in_graph_mode = context.in_graph_mode()
+      in_eager_mode = context.executing_eagerly()
       ta = _make_ta(3, "foo")
       with self.assertRaisesOpError(
           r"Expected lengths to be a vector, received shape: \[\]"):
-        if in_graph_mode:
+        if in_eager_mode:
+          self.evaluate(ta.split([1.0, 2.0, 3.0], 1))
+        else:
           lengths = array_ops.placeholder(dtypes.int64)
           ta.split([1.0, 2.0, 3.0], lengths).flow.eval(feed_dict={lengths: 1})
-        else:
-          self.evaluate(ta.split([1.0, 2.0, 3.0], 1))
 
       with self.assertRaisesOpError(
           r"Expected sum of lengths to be equal to values.shape\[0\], "
           r"but sum of lengths is 1 and value's shape is: \[3\]"):
-        if in_graph_mode:
-          self.evaluate(ta.split([1.0, 2.0, 3.0], [1]).flow)
-        else:
-          self.evaluate(ta.split([1.0, 2.0, 3.0], [1]))
+        self.evaluate(ta.split([1.0, 2.0, 3.0], [1]).flow)
 
       ta = _make_ta(1, "baz")
       with self.assertRaisesOpError(
           r"Expected value to be at least a vector, but received shape: \[\]"):
-        if in_graph_mode:
-          self.evaluate(ta.split(1.0, [1]).flow)
-        else:
-          self.evaluate(ta.split(1.0, [1]))
+        self.evaluate(ta.split(1.0, [1]).flow)
 
       ta = _make_ta(2, "buz")
       with self.assertRaisesOpError(
           r"TensorArray's size is not equal to the size of lengths "
           r"\(2 vs. 1\), and the TensorArray is not marked as "
           r"dynamically resizeable"):
-        if in_graph_mode:
-          self.evaluate(ta.split([1.0], [1]).flow)
-        else:
-          self.evaluate(ta.split([1.0], [1]))
+        self.evaluate(ta.split([1.0], [1]).flow)
 
   def _testTensorArrayWriteGradientAddMultipleAdds(self, dtype):
     with self.test_session(use_gpu=True):
@@ -868,14 +827,14 @@ class TensorArrayTest(test.TestCase):
 
       vout = func(v0, state0, var)
       grad_val = -np.arange(3 * 5, dtype=np_dtype).reshape(3, 5)
-      if context.in_graph_mode():
+      if context.executing_eagerly():
+        grad_fn = backprop.gradients_function(func)
+        v0_grad, state0_grad, var_grad = grad_fn(v0, state0, var, dy=grad_val)
+      else:
         v0_grad = gradients_impl.gradients([vout], [v0], [grad_val])[0]
         state0_grad = gradients_impl.gradients([vout], [state0], [grad_val])[0]
         var_grad = gradients_impl.gradients([vout], [var], [grad_val])[0]
         variables.global_variables_initializer().run()
-      else:
-        grad_fn = backprop.gradients_function(func)
-        v0_grad, state0_grad, var_grad = grad_fn(v0, state0, var, dy=grad_val)
 
       state0_t, var_t, v0_t, vout_t, v0_grad_t, var_grad_t, state0_grad_t = (
           self.evaluate(
@@ -959,10 +918,10 @@ class TensorArrayTest(test.TestCase):
         return r
 
       x = constant_op.constant(2.0, name="x")
-      if context.in_graph_mode():
-        grad = gradients_impl.gradients(loop(x), [x])[0]
-      else:
+      if context.executing_eagerly():
         grad = backprop.gradients_function(loop)(x)[0]
+      else:
+        grad = gradients_impl.gradients(loop(x), [x])[0]
       self.assertAllClose(31.0, self.evaluate(grad))
 
   def testSumOfTwoReadVariablesWithoutRepeatGrad(self):
@@ -1158,14 +1117,14 @@ class TensorArrayTest(test.TestCase):
           infer_shape=True)
       w0 = ta1.split(value, [1, 2])
       r0 = w0.read(0)
-      if context.in_graph_mode():
+      if context.executing_eagerly():
+        self.assertEqual((1, 2), r0.get_shape())
+        self.assertEqual((2, 2), w0.read(1).get_shape())
+      else:
         self.assertEqual(r0.get_shape().ndims, None)
         self.assertEqual(
             tensor_shape.TensorShape(
                 ta1.handle.op.get_attr("element_shape")).ndims, None)
-      else:
-        self.assertEqual((1, 2), r0.get_shape())
-        self.assertEqual((2, 2), w0.read(1).get_shape())
 
   def testWriteUnknownShape(self):
     with self.test_session(use_gpu=True):
@@ -1297,13 +1256,13 @@ class TensorArrayTest(test.TestCase):
       g = func(values)
       grad_ys = [[[2.0, 3.0], [4.0, 5.0]]]
       # Test combined gradients + aggregation of read(0)
-      if context.in_graph_mode():
-        grad = gradients_impl.gradients(ys=[g], xs=[values], grad_ys=grad_ys)
-        g_vals, grad_vals = session.run([[g], grad])
-      else:
+      if context.executing_eagerly():
         g_vals = [g]
         grad_vals = backprop.gradients_function(func)(
             values, dy=constant_op.constant(grad_ys[0], dtype=dtypes.float32))
+      else:
+        grad = gradients_impl.gradients(ys=[g], xs=[values], grad_ys=grad_ys)
+        g_vals, grad_vals = session.run([[g], grad])
 
       # Gradients for 8 of the 10 unread components are zero.
       expected_grad = np.zeros((10, 2))
@@ -1453,13 +1412,13 @@ class TensorArrayTest(test.TestCase):
       # Tests correct properties on new TensorArrays.
       self.assertEqual(dtypes.float32, ta0.dtype)
       self.assertEqual(dtypes.int32, ta1.dtype)
-      if context.in_graph_mode():
-        self.assertEqual(tensor_shape.unknown_shape(), read0.get_shape())
+      if context.executing_eagerly():
+        self.assertEqual(tensor_shape.scalar(), read0.get_shape())
       else:
-        self.assertEqual(tensor_shape.scalar(), read1.get_shape())
+        self.assertEqual(tensor_shape.unknown_shape(), read0.get_shape())
       self.assertEqual(tensor_shape.scalar(), read1.get_shape())
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         variables.global_variables_initializer().run()
 
       read0_v, read1_v, size0_v, size1_v = self.evaluate((read0, read1, size0,
diff --git a/tensorflow/python/kernel_tests/unique_op_test.py b/tensorflow/python/kernel_tests/unique_op_test.py
index 4498fd9fe9986c134b92aed192a6de6f06109bd9..bbc040dc13fc151b970f130eeb76fa1639245416 100644
--- a/tensorflow/python/kernel_tests/unique_op_test.py
+++ b/tensorflow/python/kernel_tests/unique_op_test.py
@@ -66,9 +66,9 @@ class UniqueTest(test.TestCase):
     for dtype in [np.int32, np.int64]:
       x = np.array([[1, 0, 0], [1, 0, 0], [2, 0, 0]])
       with self.test_session() as sess:
-        y0, idx0 = gen_array_ops._unique_v2(x, axis=np.array([0], dtype))
+        y0, idx0 = gen_array_ops.unique_v2(x, axis=np.array([0], dtype))
         tf_y0, tf_idx0 = sess.run([y0, idx0])
-        y1, idx1 = gen_array_ops._unique_v2(x, axis=np.array([1], dtype))
+        y1, idx1 = gen_array_ops.unique_v2(x, axis=np.array([1], dtype))
         tf_y1, tf_idx1 = sess.run([y1, idx1])
       self.assertAllEqual(tf_y0, np.array([[1, 0, 0], [2, 0, 0]]))
       self.assertAllEqual(tf_idx0, np.array([0, 0, 1]))
@@ -80,7 +80,7 @@ class UniqueTest(test.TestCase):
     # by default, the axis will be wrapped to allow `axis=None`.
     x = np.random.randint(2, high=10, size=7000)
     with self.test_session() as sess:
-      y, idx = gen_array_ops._unique_v2(x, axis=np.array([], np.int32))
+      y, idx = gen_array_ops.unique_v2(x, axis=np.array([], np.int32))
       tf_y, tf_idx = sess.run([y, idx])
 
     self.assertEqual(len(x), len(tf_idx))
@@ -137,10 +137,10 @@ class UniqueWithCountsTest(test.TestCase):
     for dtype in [np.int32, np.int64]:
       x = np.array([[1, 0, 0], [1, 0, 0], [2, 0, 0]])
       with self.test_session() as sess:
-        y0, idx0, count0 = gen_array_ops._unique_with_counts_v2(
+        y0, idx0, count0 = gen_array_ops.unique_with_counts_v2(
             x, axis=np.array([0], dtype))
         tf_y0, tf_idx0, tf_count0 = sess.run([y0, idx0, count0])
-        y1, idx1, count1 = gen_array_ops._unique_with_counts_v2(
+        y1, idx1, count1 = gen_array_ops.unique_with_counts_v2(
             x, axis=np.array([1], dtype))
         tf_y1, tf_idx1, tf_count1 = sess.run([y1, idx1, count1])
       self.assertAllEqual(tf_y0, np.array([[1, 0, 0], [2, 0, 0]]))
@@ -155,7 +155,7 @@ class UniqueWithCountsTest(test.TestCase):
     # by default, the axis will be wrapped to allow `axis=None`.
     x = np.random.randint(2, high=10, size=7000)
     with self.test_session() as sess:
-      y, idx, count = gen_array_ops._unique_with_counts_v2(
+      y, idx, count = gen_array_ops.unique_with_counts_v2(
           x, axis=np.array([], np.int32))
       tf_y, tf_idx, tf_count = sess.run([y, idx, count])
 
diff --git a/tensorflow/python/kernel_tests/variable_ops_test.py b/tensorflow/python/kernel_tests/variable_ops_test.py
index 79071029fd42374964d12f513e9c510bdc7400eb..cf369c071813120fef685b7220292d50b966cf11 100644
--- a/tensorflow/python/kernel_tests/variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/variable_ops_test.py
@@ -165,26 +165,26 @@ class VariableOpTest(test.TestCase):
 
   def testTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable(
+      var = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="foo")
       var = state_ops.assign(var, [[4.0, 5.0]])
       var = state_ops.assign_add(var, [[6.0, 7.0]])
-      final = gen_state_ops._destroy_temporary_variable(var, var_name="foo")
+      final = gen_state_ops.destroy_temporary_variable(var, var_name="foo")
       self.assertAllClose([[10.0, 12.0]], final.eval())
 
   def testDestroyNonexistentTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable([1, 2], dtypes.float32)
-      final = gen_state_ops._destroy_temporary_variable(var, var_name="bad")
+      var = gen_state_ops.temporary_variable([1, 2], dtypes.float32)
+      final = gen_state_ops.destroy_temporary_variable(var, var_name="bad")
       with self.assertRaises(errors.NotFoundError):
         final.eval()
 
   def testDuplicateTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var1 = gen_state_ops._temporary_variable(
+      var1 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="dup")
       var1 = state_ops.assign(var1, [[1.0, 2.0]])
-      var2 = gen_state_ops._temporary_variable(
+      var2 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="dup")
       var2 = state_ops.assign(var2, [[3.0, 4.0]])
       final = var1 + var2
@@ -193,25 +193,25 @@ class VariableOpTest(test.TestCase):
 
   def testDestroyTemporaryVariableTwice(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable([1, 2], dtypes.float32)
-      val1 = gen_state_ops._destroy_temporary_variable(var, var_name="dup")
-      val2 = gen_state_ops._destroy_temporary_variable(var, var_name="dup")
+      var = gen_state_ops.temporary_variable([1, 2], dtypes.float32)
+      val1 = gen_state_ops.destroy_temporary_variable(var, var_name="dup")
+      val2 = gen_state_ops.destroy_temporary_variable(var, var_name="dup")
       final = val1 + val2
       with self.assertRaises(errors.NotFoundError):
         final.eval()
 
   def testTemporaryVariableNoLeak(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable(
+      var = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="bar")
       final = array_ops.identity(var)
       final.eval()
 
   def testTwoTemporaryVariablesNoLeaks(self):
     with self.test_session(use_gpu=True):
-      var1 = gen_state_ops._temporary_variable(
+      var1 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="var1")
-      var2 = gen_state_ops._temporary_variable(
+      var2 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="var2")
       final = var1 + var2
       final.eval()
diff --git a/tensorflow/python/kernel_tests/variable_scope_test.py b/tensorflow/python/kernel_tests/variable_scope_test.py
index 8527f116f9541942e52ba2ab635ca1212ea38583..86ab9fbb70b5efcf06cc064617df14deb18c1f98 100644
--- a/tensorflow/python/kernel_tests/variable_scope_test.py
+++ b/tensorflow/python/kernel_tests/variable_scope_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import gc
+import threading
 
 import numpy
 
@@ -166,12 +167,10 @@ class VariableScopeTest(test.TestCase):
     self.evaluate(variables_lib.variables_initializer([w]))
     self.assertAllClose(self.evaluate(w.value()), [1, 2, 3])
 
-    if context.in_graph_mode():
-      with self.assertRaises(TypeError):
-        variable_scope.get_variable("x4", initializer={})
-    else:
-      with self.assertRaises(ValueError):
-        variable_scope.get_variable("x4", initializer={})
+    # A quirk to be revisited?
+    error = ValueError if context.executing_eagerly() else TypeError
+    with self.assertRaises(error):
+      variable_scope.get_variable("x4", initializer={})
 
   @test_util.run_in_graph_and_eager_modes()
   def testInitFromNonInitializer(self):
@@ -267,7 +266,7 @@ class VariableScopeTest(test.TestCase):
         self.assertAllClose(self.evaluate(losses[2]), 0.5)
       with variable_scope.variable_scope("foo", reuse=True):
         # reuse=True is for now only supported when eager execution is disabled.
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           v = variable_scope.get_variable("v",
                                           [])  # "v" is alredy there, reused
           losses = ops.get_collection(ops.GraphKeys.REGULARIZATION_LOSSES)
@@ -374,7 +373,7 @@ class VariableScopeTest(test.TestCase):
       v = variable_scope.get_variable("v", [])
       self.evaluate(variables_lib.variables_initializer([v]))
       self.assertAllClose(self.evaluate(v.value()), 0.3)
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         # Check that we can set reuse.
         variable_scope.get_variable_scope().reuse_variables()
         with self.assertRaises(ValueError):  # Fail, w does not exist yet.
@@ -408,7 +407,7 @@ class VariableScopeTest(test.TestCase):
       with variable_scope.variable_scope("tower") as tower:
         with ops.name_scope("scope2") as sc2:
           self.assertEqual(sc2, "testVarScopeNameScope1/tower/scope2/")
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         with variable_scope.variable_scope(
             tower):  # Re-entering acts like another "tower".
           with ops.name_scope("scope2") as sc2:
@@ -422,7 +421,7 @@ class VariableScopeTest(test.TestCase):
       with variable_scope.variable_scope("tower"):
         with ops.name_scope("scope2") as sc2:
           self.assertEqual(sc2, "testVarScopeNameScope2/tower/scope2/")
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         with variable_scope.variable_scope(tower):
           with ops.name_scope("scope2") as sc2:
             self.assertEqual(sc2, "testVarScopeNameScope2/tower_1/scope2/")
@@ -903,17 +902,15 @@ class VariableScopeTest(test.TestCase):
             "w", [], collections=["foo"])
         self.assertEqual(local_var.name, "outer/w:0")
 
-    # Since variable is local, it should be in the local variable collection
-    # but not the trainable collection.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
+      # Since variable is local, it should be in the local variable collection
+      # but not the trainable collection.
       self.assertIn(local_var,
                     ops.get_collection(ops.GraphKeys.LOCAL_VARIABLES))
       self.assertIn(local_var, ops.get_collection("foo"))
       self.assertNotIn(local_var,
                        ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
-
-    # Check that local variable respects `reuse`.
-    if context.in_graph_mode():
+      # Check that local variable respects `reuse`.
       with variable_scope.variable_scope(outer, "default", reuse=True):
         self.assertEqual(
             variable_scope.get_local_variable("w", []).name, "outer/w:0")
@@ -1353,5 +1350,91 @@ class PartitionInfoTest(test.TestCase):
     self.assertEqual(0, partition_info.single_slice_dim([2, 3]))
 
 
+class VariableScopeMultithreadedTest(test.TestCase):
+
+  def testTwoThreadsDisjointScopeEntry(self):
+
+    def thread_fn(i, graph):
+      with graph.as_default():
+        with variable_scope.variable_scope("foo"):
+          if i == 0:
+            v = variable_scope.get_variable("v", [])
+            self.assertEquals("foo/v:0", v.name)
+          else:
+            # Any thread after the first one should fail to create variable
+            # with the same name.
+            with self.assertRaises(ValueError):
+              variable_scope.get_variable("v", [])
+
+    graph = ops.get_default_graph()
+    threads = [
+        threading.Thread(target=thread_fn, args=(i, graph,)) for i in range(2)]
+
+    threads[0].start()
+    # Allow thread 0 to finish before starting thread 1.
+    threads[0].join()
+    threads[1].start()
+    threads[1].join()
+
+  def testTwoThreadsNestedScopeEntry(self):
+
+    def thread_fn(i, graph, run_event, pause_event):
+      with graph.as_default():
+        with variable_scope.variable_scope("foo"):
+          if i == 0:
+            v = variable_scope.get_variable("v", [])
+            self.assertEquals("foo/v:0", v.name)
+          else:
+            # Any thread after the first one should fail to create variable
+            # with the same name.
+            with self.assertRaises(ValueError):
+              variable_scope.get_variable("v", [])
+          pause_event.set()
+          run_event.wait()
+
+    graph = ops.get_default_graph()
+    run_events = [threading.Event() for _ in range(2)]
+    pause_events = [threading.Event() for _ in range(2)]
+    threads = [
+        threading.Thread(
+            target=thread_fn, args=(i, graph, run_events[i], pause_events[i]))
+        for i in range(2)
+    ]
+
+    # Start first thread.
+    threads[0].start()
+    pause_events[0].wait()
+    # Start next thread once the first thread has paused.
+    threads[1].start()
+    pause_events[1].wait()
+    # Resume both threads.
+    run_events[0].set()
+    run_events[1].set()
+    threads[0].join()
+    threads[1].join()
+
+  def testReenterMainScope(self):
+
+    def thread_fn(graph, main_thread_scope):
+      with graph.as_default():
+        # Variable created with main scope will have prefix "main".
+        with variable_scope.variable_scope(main_thread_scope):
+          with variable_scope.variable_scope("foo"):
+            v = variable_scope.get_variable("v", [])
+            self.assertEquals("main/foo/v:0", v.name)
+
+        # Variable created outside main scope will not have prefix "main".
+        with variable_scope.variable_scope("bar"):
+          v = variable_scope.get_variable("v", [])
+          self.assertEquals("bar/v:0", v.name)
+
+    graph = ops.get_default_graph()
+    with variable_scope.variable_scope("main") as main_thread_scope:
+      thread = threading.Thread(
+          target=thread_fn, args=(graph, main_thread_scope))
+      thread.start()
+      thread.join()
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/variables_test.py b/tensorflow/python/kernel_tests/variables_test.py
index b16c8c002c98a0351d1fc55fce061695327a18c9..27599868b74be323189b872c2147c6a33f84d170 100644
--- a/tensorflow/python/kernel_tests/variables_test.py
+++ b/tensorflow/python/kernel_tests/variables_test.py
@@ -687,7 +687,7 @@ class VariableContainerTest(test.TestCase):
         v1 = variables.Variable([1])
         with ops.container("l2"):
           v2 = variables.Variable([2])
-          special_v = gen_state_ops._variable(
+          special_v = gen_state_ops.variable(
               shape=[1],
               dtype=dtypes.float32,
               name="VariableInL3",
diff --git a/tensorflow/python/kernel_tests/xent_op_test.py b/tensorflow/python/kernel_tests/xent_op_test.py
index e152f02d8e983364603053dc5c8d14b5dfaf3605..e3e120a4eb01885ac5ac5e41f82ad3e480a83a77 100644
--- a/tensorflow/python/kernel_tests/xent_op_test.py
+++ b/tensorflow/python/kernel_tests/xent_op_test.py
@@ -48,7 +48,7 @@ class XentTest(test.TestCase):
   def _testXent(self, np_features, np_labels, use_gpu=False):
     np_loss, np_backprop = self._npXent(np_features, np_labels)
     with self.test_session(use_gpu=use_gpu) as sess:
-      loss, backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+      loss, backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
           np_features, np_labels)
       tf_loss, tf_backprop = sess.run([loss, backprop])
     self.assertAllCloseAccordingToType(np_loss, tf_loss)
@@ -71,7 +71,7 @@ class XentTest(test.TestCase):
   def _testSingleClass(self, use_gpu=False):
     for dtype in np.float16, np.float32:
       with self.test_session(use_gpu=use_gpu) as sess:
-        loss, backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+        loss, backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
             np.array([[1.], [-1.], [0.]]).astype(dtype),
             np.array([[-1.], [0.], [1.]]).astype(dtype))
         tf_loss, tf_backprop = sess.run([loss, backprop])
@@ -89,7 +89,7 @@ class XentTest(test.TestCase):
       np_labels = np.array([[[0., 0., 0., 1.]], [[0., .5, .5,
                                                   0.]]]).astype(dtype)
       self.assertRaisesRegexp(ValueError, "must be rank 2",
-                              gen_nn_ops._softmax_cross_entropy_with_logits,
+                              gen_nn_ops.softmax_cross_entropy_with_logits,
                               np_features, np_labels)
 
   def testNpXent(self):
@@ -131,14 +131,14 @@ class XentTest(test.TestCase):
   def testShapeMismatch(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        gen_nn_ops._softmax_cross_entropy_with_logits(
+        gen_nn_ops.softmax_cross_entropy_with_logits(
             [[0., 1.], [2., 3.]], [[0., 1., 0.], [1., 0., 0.]])
 
   def testNotMatrix(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        gen_nn_ops._softmax_cross_entropy_with_logits([0., 1., 2., 3.],
-                                                      [0., 1., 0., 1.])
+        gen_nn_ops.softmax_cross_entropy_with_logits([0., 1., 2., 3.],
+                                                     [0., 1., 0., 1.])
 
   def testHalf(self):
     self._testAll(
diff --git a/tensorflow/python/layers/base.py b/tensorflow/python/layers/base.py
index 8314c4aa87a5b54effc44c371703267517ffa07d..e4395bea92961d348ed3841a31cacb91aaa282ec 100644
--- a/tensorflow/python/layers/base.py
+++ b/tensorflow/python/layers/base.py
@@ -36,12 +36,13 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
 
 @tf_export('layers.Layer')
-class Layer(object):
+class Layer(checkpointable.CheckpointableBase):
   """Base layer class.
 
   This is the class from which all layers inherit, implementing common
@@ -114,7 +115,7 @@ class Layer(object):
     # Provides information about which inputs are compatible with the layer.
     self.input_spec = None
 
-    if activity_regularizer and context.in_eager_mode():
+    if activity_regularizer and context.executing_eagerly():
       raise ValueError(
           ('Activity regularization is not supported when executing eagerly. '
            'Got activity_regularizer=%s') % (activity_regularizer,))
@@ -126,12 +127,12 @@ class Layer(object):
     # return tensors. When using graph execution, _losses is a list of ops.
     self._losses = []
     self._reuse = kwargs.get('_reuse')
-    self._graph = ops.get_default_graph()
+    self._graph = None  # Will be set at build time.
     self._dtype = None if dtype is None else dtypes.as_dtype(dtype).name
-    call_fn_args = estimator_util.fn_args(self.call)
-    self._compute_previous_mask = ('mask' in call_fn_args or
+    self._call_fn_args = estimator_util.fn_args(self.call)
+    self._compute_previous_mask = ('mask' in self._call_fn_args or
                                    hasattr(self, 'compute_mask'))
-    self._call_has_scope_arg = 'scope' in call_fn_args
+    self._call_has_scope_arg = 'scope' in self._call_fn_args
 
     # These lists will be filled via successive calls
     # to self._add_inbound_node().
@@ -227,7 +228,7 @@ class Layer(object):
 
   @property
   def updates(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.updates not supported in Eager mode.')
     if not self.trainable and not self.stateful:
       return []
@@ -259,7 +260,7 @@ class Layer(object):
         have is available at runtime.
         A step counter might fall into this category.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return  # Updates already applied when in eager mode.
 
     updates = _to_list(updates)
@@ -285,7 +286,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('`get_updates_for()` not supported in Eager mode.')
 
     # Updates disabled if layer is not trainable and not explicitly stateful.
@@ -316,7 +317,7 @@ class Layer(object):
     Returns:
       A list of tensors.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # _losses may only contain variable regularization losses when executing
       # eagerly, and they have been saved as lambdas to be executed when
       # requested.
@@ -354,7 +355,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # TODO(fchollet): it should be possible (and highly desirable) to support
       # `add_loss` in eager mode. This allows great convenience and flexibility
       # in defining custom losses on the fly (e.g. in VAEs).
@@ -388,7 +389,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.get_losses_for not supported in Eager mode.')
 
     if inputs is None:
@@ -508,7 +509,7 @@ class Layer(object):
     # will occur; it should be None if and only if initialization will take
     # place in the eager context.
     init_graph = None
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       default_graph = ops.get_default_graph()
       if default_graph.building_function:
         with ops.init_scope():
@@ -516,7 +517,7 @@ class Layer(object):
           # will be lifted; if initialization ops will be lifted into
           # the eager context, then there is nothing to retrieve, since variable
           # collections are not supported when eager execution is enabled.
-          if context.in_graph_mode():
+          if not context.executing_eagerly():
             init_graph = ops.get_default_graph()
             existing_variables = set(tf_variables.global_variables())
       else:
@@ -532,13 +533,17 @@ class Layer(object):
     with vs.variable_scope(
         self._scope, reuse=reuse, auxiliary_name_scope=False) as scope:
       with ops.name_scope(self._name_scope_name(scope)):
-        variable = vs.get_variable(name,
-                                   shape=shape,
-                                   initializer=initializer,
-                                   dtype=dtypes.as_dtype(dtype),
-                                   constraint=constraint,
-                                   trainable=trainable and self.trainable,
-                                   partitioner=partitioner)
+        variable = self._add_variable_with_custom_getter(
+            name=name,
+            shape=shape,
+            getter=vs.get_variable,
+            # Manage errors in Layer rather than Checkpointable.
+            overwrite=True,
+            initializer=initializer,
+            dtype=dtypes.as_dtype(dtype),
+            constraint=constraint,
+            trainable=trainable and self.trainable,
+            partitioner=partitioner)
 
         if init_graph is not None:  # pylint: disable=protected-access
           # The variable was created and initialized in a graph.
@@ -573,7 +578,7 @@ class Layer(object):
           if isinstance(variable, tf_variables.PartitionedVariable):
             raise RuntimeError(
                 'Partitioned variable regularization is not yet '
-                'supported when executing eagerly. File a feature request'
+                'supported when executing eagerly. File a feature request '
                 'if this is important to you.')
           # Save a zero-argument lambda which runs the regularizer on the
           # variable, to be executed when `Layer.losses` is requested.
@@ -619,16 +624,17 @@ class Layer(object):
     self._set_scope(kwargs.pop('scope', None))
     input_list = nest.flatten(inputs)
 
-    in_graph_mode = context.in_graph_mode()
+    build_graph = not context.executing_eagerly()
     in_deferred_mode = isinstance(input_list[0], _DeferredTensor)
     # Ensure the Layer, if being reused, is working with inputs from
     # the same graph as where it was created.
-    if in_graph_mode:
+    if build_graph:
       try:
-        ops._get_graph_from_inputs(input_list, graph=self.graph)  # pylint: disable=protected-access
+        # Set layer's "graph" at build time
+        self._graph = ops._get_graph_from_inputs(input_list, graph=self._graph)  # pylint: disable=protected-access
       except ValueError as e:
         raise ValueError('Input graph and Layer graph are not the same: %s' % e)
-    if in_graph_mode or in_deferred_mode:
+    if build_graph or in_deferred_mode:
       user_kwargs = copy.copy(kwargs)
 
     # Handle Keras mask propagation from previous layer to current layer.
@@ -636,8 +642,9 @@ class Layer(object):
     if (not hasattr(self, '_compute_previous_mask') or
         self._compute_previous_mask):
       previous_mask = _collect_previous_mask(inputs)
-      if ('mask' in estimator_util.fn_args(self.call) and
-          'mask' not in kwargs and
+      if not hasattr(self, '_call_fn_args'):
+        self._call_fn_args = estimator_util.fn_args(self.call)
+      if ('mask' in self._call_fn_args and 'mask' not in kwargs and
           not _is_all_none(previous_mask)):
         # The previous layer generated a mask, and mask was not explicitly pass
         # to __call__, hence we set previous_mask as the default value.
@@ -662,13 +669,14 @@ class Layer(object):
     with scope_context_manager as scope:
       with ops.name_scope(self._name_scope_name(scope)):
         if not self.built:
-          if not in_graph_mode:
+          if not build_graph:
             # Activity regularization is currently unsupported in Eager mode.
             if self._activity_regularizer:
-              raise ValueError('activity_regularizer currently unsupported in '
-                               'Eager mode. Found an activity_regularizer in '
-                               '%s(%s).' % (self.__class__.__name__, self))
-          if not in_graph_mode and not in_deferred_mode:
+              raise ValueError(
+                  'activity_regularizer currently unsupported with '
+                  'eager execution enabled. Found an activity_regularizer in '
+                  '%s(%s).' % (self.__class__.__name__, self))
+          if not build_graph and not in_deferred_mode:
             # TODO(agarwal): support _keras_history in Eager mode.
             for x in input_list:
               if hasattr(x, '_keras_history'):
@@ -693,11 +701,13 @@ class Layer(object):
           # TODO(agarwal): Fix the sub-classes and avoid this complexity.
           call_has_scope_arg = self._call_has_scope_arg
         except AttributeError:
-          call_has_scope_arg = 'scope' in estimator_util.fn_args(self.call)
+          self._call_fn_args = estimator_util.fn_args(self.call)
+          self._call_has_scope_arg = 'scope' in self._call_fn_args
+          call_has_scope_arg = self._call_has_scope_arg
         if call_has_scope_arg:
           kwargs['scope'] = scope
         # Check input assumptions set after layer building, e.g. input shape.
-        if in_graph_mode or in_deferred_mode:
+        if build_graph or in_deferred_mode:
           self._assert_input_compatibility(inputs)
 
         if not in_deferred_mode:
@@ -721,7 +731,7 @@ class Layer(object):
           if len(outputs) == 1:
             outputs = outputs[0]
 
-        if in_graph_mode:
+        if build_graph:
           # Apply activity regularization.
           # Note that it should be applied every time the layer creates a new
           # output, since it is output-specific.
@@ -743,7 +753,7 @@ class Layer(object):
             else:
               outputs._keras_mask = output_mask  # pylint: disable=protected-access
 
-    if in_graph_mode:
+    if build_graph:
       # If all input tensors have history metadata,
       # we update the output tensors
       # with corresponding history metadata, thus eventually allowing to use
@@ -766,7 +776,7 @@ class Layer(object):
       # Update global default collections.
       _add_elements_to_collection(self.updates, ops.GraphKeys.UPDATE_OPS)
 
-    if in_deferred_mode or in_graph_mode:
+    if in_deferred_mode or build_graph:
       if _have_all_keras_metadata(inputs):
         # Add an inbound node to the layer, so it can keep track of this call.
         # This updates the layer history of the output tensor(s).
@@ -778,7 +788,7 @@ class Layer(object):
 
   @property
   def graph(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.graph not supported in Eager mode.')
     return self._graph
 
@@ -882,7 +892,7 @@ class Layer(object):
         mode.
         ValueError: If the index provided does not match any node.
     """
-    assert context.in_graph_mode()
+    assert not context.executing_eagerly()
     if not self._inbound_nodes:
       raise RuntimeError('The layer has never been called '
                          'and thus has no defined ' + attr_name + '.')
@@ -912,7 +922,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           'Layer.get_input_shape_at not supported in Eager mode.')
     return self._get_node_attribute_at_index(node_index, 'input_shapes',
@@ -934,7 +944,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           'Layer.get_output_shape_at not supported in Eager mode.')
     return self._get_node_attribute_at_index(node_index, 'output_shapes',
@@ -955,7 +965,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.get_input_at not supported in Eager mode.')
     return self._get_node_attribute_at_index(node_index, 'input_tensors',
                                              'input')
@@ -975,7 +985,7 @@ class Layer(object):
     Raises:
       RuntimeError: If called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.get_output_at not supported in Eager mode.')
     return self._get_node_attribute_at_index(node_index, 'output_tensors',
                                              'output')
@@ -998,7 +1008,7 @@ class Layer(object):
       RuntimeError: If called in Eager mode.
       AttributeError: If no inbound nodes are found.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.input not supported in Eager mode.')
     if not self._inbound_nodes:
       raise AttributeError('Layer ' + self.name +
@@ -1020,7 +1030,7 @@ class Layer(object):
         layers.
       RuntimeError: if called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.output not supported in Eager mode.')
     if not self._inbound_nodes:
       raise AttributeError('Layer ' + self.name + ' has no inbound nodes.')
@@ -1042,7 +1052,7 @@ class Layer(object):
         AttributeError: if the layer has no defined input_shape.
         RuntimeError: if called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.input_shape not supported in Eager mode.')
     if not self._inbound_nodes:
       raise AttributeError('The layer has never been called '
@@ -1103,7 +1113,7 @@ class Layer(object):
         AttributeError: if the layer has no defined output shape.
         RuntimeError: if called in Eager mode.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError('Layer.output_shape not supported in Eager mode.')
     if not self._inbound_nodes:
       raise AttributeError('The layer has never been called '
@@ -1461,7 +1471,7 @@ def _to_list(x):
 
 
 def _add_elements_to_collection(elements, collection_list):
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('Using collections from Layers not supported in Eager '
                        'mode. Tried to add %s to %s' % (elements,
                                                         collection_list))
diff --git a/tensorflow/python/layers/base_test.py b/tensorflow/python/layers/base_test.py
index 91b8988d31c1f04be8134733e5e919c738ccb74f..9ed4afeaba931c47d2a1e65f08489773f0b9eb1b 100644
--- a/tensorflow/python/layers/base_test.py
+++ b/tensorflow/python/layers/base_test.py
@@ -44,7 +44,7 @@ class BaseLayerTest(test.TestCase):
     self.assertEqual(layer.variables, [])
     self.assertEqual(layer.trainable_variables, [])
     self.assertEqual(layer.non_trainable_variables, [])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # updates, losses only supported in GRAPH mode
       self.assertEqual(layer.updates, [])
       self.assertEqual(layer.losses, [])
@@ -63,7 +63,7 @@ class BaseLayerTest(test.TestCase):
     self.assertEqual(layer.variables, [variable])
     self.assertEqual(layer.trainable_variables, [variable])
     self.assertEqual(layer.non_trainable_variables, [])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           layer.variables,
           ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
@@ -77,7 +77,7 @@ class BaseLayerTest(test.TestCase):
     self.assertEqual(layer.variables, [variable, variable_2])
     self.assertEqual(layer.trainable_variables, [variable])
     self.assertEqual(layer.non_trainable_variables, [variable_2])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           len(ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)), 1)
 
@@ -161,7 +161,7 @@ class BaseLayerTest(test.TestCase):
     inputs = random_ops.random_uniform((5,), seed=1)
     outputs = layer.apply(inputs)
     self.assertEqual(layer.built, True)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # op is only supported in GRAPH mode
       self.assertEqual(outputs.op.name, 'my_layer/Square')
 
@@ -210,7 +210,7 @@ class BaseLayerTest(test.TestCase):
     inputs = random_ops.random_uniform((5,), seed=1)
     outputs = layer.apply(inputs)
     self.assertEqual(layer.built, True)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # op only supported in GRAPH mode.
       self.assertEqual(outputs.op.name, 'my_layer/Square')
 
@@ -280,7 +280,7 @@ class BaseLayerTest(test.TestCase):
       def call(self, inputs):
         return inputs
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       layer = CustomerLayer()
       with self.assertRaisesRegexp(ValueError, r'requires a defined rank'):
         layer.apply(array_ops.placeholder('int32'))
@@ -307,7 +307,7 @@ class BaseLayerTest(test.TestCase):
       def call(self, inputs):
         return inputs
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       layer = CustomerLayer()
       with self.assertRaisesRegexp(ValueError, r'requires a defined rank'):
         layer.apply(array_ops.placeholder('int32'))
@@ -335,7 +335,7 @@ class BaseLayerTest(test.TestCase):
       def call(self, inputs):
         return inputs
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       layer = CustomerLayer()
       with self.assertRaisesRegexp(ValueError, r'requires a defined rank'):
         layer.apply(array_ops.placeholder('int32'))
@@ -430,7 +430,7 @@ class BaseLayerTest(test.TestCase):
     layer.apply(constant_op.constant(1))
 
     # Works
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       layer.apply(array_ops.placeholder('int32'))
       layer.apply(array_ops.placeholder('int32', shape=(2, 3)))
 
@@ -453,13 +453,7 @@ class BaseLayerTest(test.TestCase):
         return {'l' + key: inputs[key] for key in inputs}
 
     layer = DictLayer()
-    if context.in_graph_mode():
-      i1 = array_ops.placeholder('int32')
-      i2 = array_ops.placeholder('float32')
-      result = layer.apply({'abel': i1, 'ogits': i2})
-      self.assertTrue(isinstance(result, dict))
-      self.assertEqual(set(['label', 'logits']), set(result.keys()))
-    else:
+    if context.executing_eagerly():
       i1 = constant_op.constant(3)
       i2 = constant_op.constant(4.0)
       result = layer.apply({'abel': i1, 'ogits': i2})
@@ -467,6 +461,12 @@ class BaseLayerTest(test.TestCase):
       self.assertEqual(set(['label', 'logits']), set(result.keys()))
       self.assertEqual(3, result['label'].numpy())
       self.assertEqual(4.0, result['logits'].numpy())
+    else:
+      i1 = array_ops.placeholder('int32')
+      i2 = array_ops.placeholder('float32')
+      result = layer.apply({'abel': i1, 'ogits': i2})
+      self.assertTrue(isinstance(result, dict))
+      self.assertEqual(set(['label', 'logits']), set(result.keys()))
 
   def testActivityRegularizer(self):
     regularizer = math_ops.reduce_sum
@@ -643,6 +643,16 @@ class BaseLayerTest(test.TestCase):
     self.assertEqual(len(layer.get_losses_for([intermediate_inputs])), 1)
     self.assertEqual(len(layer.get_losses_for([outputs])), 0)
 
+  def testLayerGraphSetInFirstApply(self):
+    with ops.Graph().as_default():
+      layer = core_layers.Dense(1)  # Graph at construction time is ignored
+    with ops.Graph().as_default():
+      layer.apply(constant_op.constant([[1]]))
+      # layer is now bound to second Graph
+    with ops.Graph().as_default(), self.assertRaisesRegexp(
+        ValueError, 'Input graph and Layer graph are not the same'):
+      layer.apply(constant_op.constant([[1]]))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/layers/convolutional.py b/tensorflow/python/layers/convolutional.py
index bb10fe5e8bfd26e4877fb6aef73980a30f62bb5d..74e7c63fb364d9c4475af5efe7d5db95cccf8166 100644
--- a/tensorflow/python/layers/convolutional.py
+++ b/tensorflow/python/layers/convolutional.py
@@ -1664,7 +1664,7 @@ class Conv2DTranspose(Conv2D):
         padding=self.padding.upper(),
         data_format=utils.convert_data_format(self.data_format, ndim=4))
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # Infer the static output shape:
       out_shape = inputs.get_shape().as_list()
       out_shape[c_axis] = self.filters
@@ -1969,7 +1969,7 @@ class Conv3DTranspose(Conv3D):
         data_format=utils.convert_data_format(self.data_format, ndim=5),
         padding=self.padding.upper())
 
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       # Infer the static output shape:
       out_shape = inputs.get_shape().as_list()
       out_shape[c_axis] = self.filters
diff --git a/tensorflow/python/layers/core.py b/tensorflow/python/layers/core.py
index 6970bf9234f5a31ee8093069ac1c933bcdb6f103..e598d9f83ab21f2dd5fabb3dd37fa0bfb5f003a4 100644
--- a/tensorflow/python/layers/core.py
+++ b/tensorflow/python/layers/core.py
@@ -35,6 +35,7 @@ from tensorflow.python.layers import utils
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import standard_ops
@@ -155,11 +156,11 @@ class Dense(base.Layer):
       outputs = standard_ops.tensordot(inputs, self.kernel, [[len(shape) - 1],
                                                              [0]])
       # Reshape the output back to the original ndim of the input.
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         output_shape = shape[:-1] + [self.units]
         outputs.set_shape(output_shape)
     else:
-      outputs = standard_ops.matmul(inputs, self.kernel)
+      outputs = gen_math_ops.mat_mul(inputs, self.kernel)
     if self.use_bias:
       outputs = nn.bias_add(outputs, self.bias)
     if self.activation is not None:
@@ -373,7 +374,7 @@ class Flatten(base.Layer):
 
   def call(self, inputs):
     outputs = array_ops.reshape(inputs, (array_ops.shape(inputs)[0], -1))
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       outputs.set_shape(self.compute_output_shape(inputs.get_shape()))
     return outputs
 
diff --git a/tensorflow/python/layers/core_test.py b/tensorflow/python/layers/core_test.py
index 15ce6cba21fcc78126f7db58ab18934db69c15fd..cf45b07637108422f1c612390bb01efdad6d5bcf 100644
--- a/tensorflow/python/layers/core_test.py
+++ b/tensorflow/python/layers/core_test.py
@@ -67,7 +67,7 @@ class DenseTest(test.TestCase):
       variables.global_variables_initializer().run()
       self.assertAllEqual(x.eval(), [[0.0]])
 
-  @test_util.run_in_graph_and_eager_modes()
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
   def testCall(self):
     dense = core_layers.Dense(2, activation=nn_ops.relu, name='my_dense')
     inputs = random_ops.random_uniform((5, 4), seed=1)
@@ -77,12 +77,20 @@ class DenseTest(test.TestCase):
     self.assertListEqual(dense.trainable_variables,
                          [dense.kernel, dense.bias])
     self.assertListEqual(dense.non_trainable_variables, [])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           len(ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)), 2)
     self.assertEqual(dense.kernel.name, 'my_dense/kernel:0')
     self.assertEqual(dense.bias.name, 'my_dense/bias:0')
 
+  @test_util.assert_no_new_pyobjects_executing_eagerly
+  def testNoEagerLeak(self):
+    # Tests that repeatedly constructing and building a Layer does not leak
+    # Python objects.
+    inputs = random_ops.random_uniform((5, 4), seed=1)
+    core_layers.Dense(5)(inputs)
+    core_layers.Dense(2, activation=nn_ops.relu, name='my_dense')(inputs)
+
   @test_util.run_in_graph_and_eager_modes()
   def testCallTensorDot(self):
     dense = core_layers.Dense(2, activation=nn_ops.relu, name='my_dense')
@@ -98,7 +106,7 @@ class DenseTest(test.TestCase):
     self.assertListEqual(dense.variables, [dense.kernel])
     self.assertListEqual(dense.trainable_variables, [dense.kernel])
     self.assertListEqual(dense.non_trainable_variables, [])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           len(ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)), 1)
     self.assertEqual(dense.kernel.name, 'my_dense/kernel:0')
@@ -113,7 +121,7 @@ class DenseTest(test.TestCase):
     self.assertListEqual(dense.non_trainable_variables,
                          [dense.kernel, dense.bias])
     self.assertListEqual(dense.trainable_variables, [])
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(
           len(ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)), 0)
 
@@ -162,13 +170,13 @@ class DenseTest(test.TestCase):
     dense = core_layers.Dense(2, activation=nn_ops.relu, name='dense1')
     inputs = random_ops.random_uniform((5, 3), seed=1)
     outputs = dense(inputs)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(outputs.op.name, 'dense1/Relu')
 
     dense = core_layers.Dense(2, name='dense2')
     inputs = random_ops.random_uniform((5, 3), seed=1)
     outputs = dense(inputs)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.assertEqual(outputs.op.name, 'dense2/BiasAdd')
 
   def testActivityRegularizer(self):
@@ -374,7 +382,7 @@ class DropoutTest(test.TestCase):
     dp = core_layers.Dropout(0.5)
     inputs = array_ops.ones((5, 3))
     dropped = dp.apply(inputs, training=True)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self.evaluate(variables.global_variables_initializer())
     np_output = self.evaluate(dropped)
     self.assertAlmostEqual(0., np_output.min())
diff --git a/tensorflow/python/layers/normalization.py b/tensorflow/python/layers/normalization.py
index d83292b80963d942023b5d086a089af53008efe0..29fb92ccb59aef83448cff8fd1bd759c4fda5abf 100644
--- a/tensorflow/python/layers/normalization.py
+++ b/tensorflow/python/layers/normalization.py
@@ -319,7 +319,6 @@ class BatchNormalization(base.Layer):
           initializer=self.moving_variance_initializer,
           trainable=False)
 
-      self._one_minus_decay = 1.0 - self.momentum
       if self.renorm:
         # Create variables to maintain the moving mean and standard deviation.
         # These are used in training and thus are different from the moving
@@ -338,8 +337,9 @@ class BatchNormalization(base.Layer):
           return var
 
         with ops.device(None):
-          device = ((lambda _: self.moving_mean.device)
-                    if context.in_graph_mode() else self.moving_mean.device)
+          device = (
+              self.moving_mean.device if context.executing_eagerly() else
+              (lambda _: self.moving_mean.device))
           with ops.device(device):
             self.renorm_mean = _renorm_variable('renorm_mean', param_shape)
             self.renorm_mean_weight = _renorm_variable('renorm_mean_weight', ())
@@ -347,8 +347,9 @@ class BatchNormalization(base.Layer):
           # renorm_stddev_weight. This allows us to (1) mix the average
           # stddev with the minibatch stddev early in training, and (2) compute
           # the unbiased average stddev by dividing renorm_stddev by the weight.
-          device = ((lambda _: self.moving_variance.device)
-                    if context.in_graph_mode() else self.moving_variance.device)
+          device = (
+              self.moving_variance.device if context.executing_eagerly() else
+              (lambda _: self.moving_variance.device))
           with ops.device(device):
             self.renorm_stddev = _renorm_variable('renorm_stddev', param_shape)
             self.renorm_stddev_weight = _renorm_variable(
@@ -358,20 +359,15 @@ class BatchNormalization(base.Layer):
         self._scope.set_partitioner(partitioner)
     self.built = True
 
-  def _assign_moving_average(self, variable, value, one_minus_decay):
+  def _assign_moving_average(self, variable, value, momentum):
     with ops.name_scope(None, 'AssignMovingAvg',
-                        [variable, value, one_minus_decay]) as scope:
+                        [variable, value, momentum]) as scope:
       with ops.colocate_with(variable):
-        update_delta = math_ops.multiply(
-            math_ops.subtract(variable.read_value(), value),
-            one_minus_decay)
-        if isinstance(variable, resource_variable_ops.ResourceVariable):
-          # state_ops.assign_sub does an extra read_variable_op after the
-          # assign. We avoid that here.
-          return gen_resource_variable_ops.assign_sub_variable_op(
-              variable.handle, update_delta, name=scope)
-        else:
-          return state_ops.assign_sub(variable, update_delta, name=scope)
+        decay = ops.convert_to_tensor(1.0 - momentum, name='decay')
+        if decay.dtype != variable.dtype.base_dtype:
+          decay = math_ops.cast(decay, variable.dtype.base_dtype)
+        update_delta = (variable - value) * decay
+        return state_ops.assign_sub(variable, update_delta, name=scope)
 
   def _fused_batch_norm(self, inputs, training):
     """Returns the output of fused batch norm."""
@@ -410,22 +406,16 @@ class BatchNormalization(base.Layer):
 
     training_value = utils.constant_value(training)
     if training_value is None:
-      one_minus_decay = utils.smart_cond(training,
-                                         lambda: self._one_minus_decay,
-                                         lambda: 0.)
+      momentum = utils.smart_cond(training, lambda: self.momentum, lambda: 1.0)
     else:
-      one_minus_decay = ops.convert_to_tensor(self._one_minus_decay)
+      momentum = ops.convert_to_tensor(self.momentum)
     if training_value or training_value is None:
       mean_update = self._assign_moving_average(self.moving_mean, mean,
-                                                one_minus_decay)
+                                                momentum)
       variance_update = self._assign_moving_average(self.moving_variance,
-                                                    variance, one_minus_decay)
-      if context.in_graph_mode():
-        # Note that in Eager mode, the updates are already executed when running
-        # assign_moving_averages. So we do not need to put them into
-        # collections.
-        self.add_update(mean_update, inputs=inputs)
-        self.add_update(variance_update, inputs=inputs)
+                                                    variance, momentum)
+      self.add_update(mean_update, inputs=inputs)
+      self.add_update(variance_update, inputs=inputs)
 
     return output
 
@@ -462,6 +452,7 @@ class BatchNormalization(base.Layer):
       """Updates a moving average and weight, returns the unbiased value."""
       value = array_ops.identity(value)
       def _do_update():
+        """Updates the var and weight, returns their updated ratio."""
         # Update the variables without zero debiasing. The debiasing will be
         # accomplished by dividing the exponential moving average by the weight.
         # For example, after a single update, the moving average would be
@@ -470,11 +461,14 @@ class BatchNormalization(base.Layer):
         # Make sure the weight is not updated until before r and d computation.
         with ops.control_dependencies([value]):
           weight_value = array_ops.constant(1., dtype=weight.dtype)
-        new_var = moving_averages.assign_moving_average(
-            var, value, self.renorm_momentum, zero_debias=False)
-        new_weight = moving_averages.assign_moving_average(
-            weight, weight_value, self.renorm_momentum, zero_debias=False)
+        new_var = self._assign_moving_average(var, value, self.renorm_momentum)
+        new_weight = self._assign_moving_average(weight, weight_value,
+                                                 self.renorm_momentum)
+        # TODO(yuefengz): the updates to var and weighted can not be batched
+        # together if we fetch their updated values here. Consider calculating
+        # new values and delaying the updates.
         return new_var / new_weight
+
       def _fake_update():
         return array_ops.identity(var)
       return utils.smart_cond(training, _do_update, _fake_update)
@@ -493,7 +487,7 @@ class BatchNormalization(base.Layer):
     return (r, d, new_mean, new_variance)
 
   def call(self, inputs, training=False):
-    in_eager_mode = context.in_eager_mode()
+    in_eager_mode = context.executing_eagerly()
     if self.virtual_batch_size is not None:
       # Virtual batches (aka ghost batches) can be simulated by reshaping the
       # Tensor and reusing the existing batch norm implementation
@@ -599,8 +593,7 @@ class BatchNormalization(base.Layer):
         if in_eager_mode and not self.trainable:
           return
 
-        return moving_averages.assign_moving_average(
-            var, value, self.momentum, zero_debias=False)
+        return self._assign_moving_average(var, value, self.momentum)
 
       mean_update = utils.smart_cond(
           training,
@@ -610,7 +603,7 @@ class BatchNormalization(base.Layer):
           training,
           lambda: _do_update(self.moving_variance, new_variance),
           lambda: self.moving_variance)
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.add_update(mean_update, inputs=inputs)
         self.add_update(variance_update, inputs=inputs)
 
@@ -671,9 +664,16 @@ def batch_normalization(inputs,
 
   Note: when training, the moving_mean and moving_variance need to be updated.
   By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
-  need to be added as a dependency to the `train_op`. For example:
+  need to be added as a dependency to the `train_op`. Also, be sure to add
+  any batch_normalization ops before getting the update_ops collection.
+  Otherwise, update_ops will be empty, and training/inference will not work
+  properly. For example:
 
   ```python
+    x_norm = tf.layers.batch_normalization(x, training=training)
+
+    # ...
+
     update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
     with tf.control_dependencies(update_ops):
       train_op = optimizer.minimize(loss)
diff --git a/tensorflow/python/layers/utils.py b/tensorflow/python/layers/utils.py
index 484c6fc466558dc274740955594cc279a175d638..3b156c36a2ff35fb9e05af1406d7b3f6cf883394 100644
--- a/tensorflow/python/layers/utils.py
+++ b/tensorflow/python/layers/utils.py
@@ -24,6 +24,7 @@ from tensorflow.python.eager import context
 from tensorflow.python.ops import variables
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond as smart_module
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.util import nest
 
@@ -201,7 +202,7 @@ def smart_cond(pred, true_fn=None, false_fn=None, name=None):
   if isinstance(pred, variables.Variable):
     return control_flow_ops.cond(
         pred, true_fn=true_fn, false_fn=false_fn, name=name)
-  return control_flow_ops.smart_cond(
+  return smart_module.smart_cond(
       pred, true_fn=true_fn, false_fn=false_fn, name=name)
 
 
@@ -228,7 +229,7 @@ def constant_value(pred):
 
   if isinstance(pred, variables.Variable):
     return None
-  return control_flow_ops.smart_constant_value(pred)
+  return smart_module.smart_constant_value(pred)
 
 
 def object_list_uid(object_list):
diff --git a/tensorflow/python/lib/core/py_func.cc b/tensorflow/python/lib/core/py_func.cc
index e0422ef80add42307268be2743e668eb8c8acb68..02eafd42b35231195a6405c8c3cc11871ed55772 100644
--- a/tensorflow/python/lib/core/py_func.cc
+++ b/tensorflow/python/lib/core/py_func.cc
@@ -79,10 +79,11 @@ Status MakeArgTuple(const PyCall* call, PyObject** tuple) {
     const Tensor& t = call->ins[i];
     if (call->eager) {
       if (call->gpu) {
-        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, call->device));
+        arg = EagerTensorFromHandle(
+            new TFE_TensorHandle(t, call->device, call->device));
       } else {
         // TFE_TensorHandle assumes that CPU is identified by `nullptr`.
-        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, nullptr));
+        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, nullptr, nullptr));
       }
       if (arg == nullptr) {
         return errors::Internal("Unable to procure EagerTensor from Tensor.");
@@ -163,9 +164,9 @@ bool IsSingleNone(PyObject* obj) {
 }
 
 // Retrieves a Tensor from `eager_tensor` and stores it in `output_tensor`.
-void ExtractTensorFromEagerTensor(const PyObject* eager_tensor,
-                                  Tensor* output_tensor) {
-  *output_tensor = EagerTensor_Handle(eager_tensor)->t;
+tensorflow::Status ExtractTensorFromEagerTensor(const PyObject* eager_tensor,
+                                                const Tensor** output_tensor) {
+  return EagerTensor_Handle(eager_tensor)->Tensor(output_tensor);
 }
 
 // Calls the registered py function through the trampoline.
@@ -219,7 +220,9 @@ Status DoCallPyFunc(PyCall* call, bool* out_log_on_error) {
       if (call->eager) {
         const PyObject* item = PyList_GetItem(result, i);
         if (EagerTensor_CheckExact(item)) {
-          ExtractTensorFromEagerTensor(item, &t);
+          const Tensor* tensor = nullptr;
+          s = ExtractTensorFromEagerTensor(item, &tensor);
+          if (s.ok()) t = *tensor;
         } else {
           s = errors::FailedPrecondition(
               "Expected EagerTensor, found PyObject of type: ",
@@ -237,10 +240,10 @@ Status DoCallPyFunc(PyCall* call, bool* out_log_on_error) {
   } else if (EagerTensor_CheckExact(result) || result == Py_None) {
     // result is an `EagerTensor` or `None`.
     DCHECK(call->eager);
-    Tensor t;
     if (result != Py_None) {
-      ExtractTensorFromEagerTensor(result, &t);
-      call->out.push_back(t);
+      const Tensor* t = nullptr;
+      s = ExtractTensorFromEagerTensor(result, &t);
+      if (s.ok()) call->out.push_back(*t);
     }
   } else if (PyArray_Check(result)) {
     // `result` is a NumPy array.
diff --git a/tensorflow/python/lib/core/py_seq_tensor.cc b/tensorflow/python/lib/core/py_seq_tensor.cc
index 317bdc2e14747583f372808f48a5928273f5570a..8247d354db62532c10c5acc9875cc08289cd31bf 100644
--- a/tensorflow/python/lib/core/py_seq_tensor.cc
+++ b/tensorflow/python/lib/core/py_seq_tensor.cc
@@ -84,6 +84,7 @@ bool IsPyDimension(PyObject* obj) {
 }
 
 Status InferShapeAndType(PyObject* obj, TensorShape* shape, DataType* dtype) {
+  std::vector<Safe_PyObjectPtr> refs_to_clean;
   while (true) {
     // We test strings first, in case a string is considered a sequence.
     if (IsPyString(obj)) {
@@ -93,6 +94,7 @@ Status InferShapeAndType(PyObject* obj, TensorShape* shape, DataType* dtype) {
       if (length > 0) {
         shape->AddDim(length);
         obj = PySequence_GetItem(obj, 0);
+        refs_to_clean.push_back(make_safe(obj));
         continue;
       } else if (length == 0) {
         shape->AddDim(length);
@@ -167,14 +169,15 @@ const char ErrorFoundFloat[] =
     if (shape.dims() > 1) {                                               \
       /* Iterate over outer dim, and recursively convert each element. */ \
       const int64 s = shape.dim_size(0);                                  \
-      if (TF_PREDICT_FALSE(s != PySequence_Length(obj))) {                \
+      Safe_PyObjectPtr seq = make_safe(PySequence_Fast(obj, ""));         \
+      if (TF_PREDICT_FALSE(s != PySequence_Fast_GET_SIZE(seq.get()))) {   \
         return ErrorRectangular;                                          \
       }                                                                   \
       TensorShape rest = shape;                                           \
       rest.RemoveDim(0);                                                  \
       for (int64 i = 0; i < s; ++i) {                                     \
-        const char* error =                                               \
-            FUNCTION##Helper(PySequence_GetItem(obj, i), rest, buf);      \
+        const char* error = FUNCTION##Helper(                             \
+            PySequence_Fast_GET_ITEM(seq.get(), i), rest, buf);           \
         if (TF_PREDICT_FALSE(error != nullptr)) return error;             \
       }                                                                   \
     } else {                                                              \
diff --git a/tensorflow/python/lib/core/py_util.cc b/tensorflow/python/lib/core/py_util.cc
index 2635694e23c07dd8e75d4bb0cfb9e83a2042d921..00cbf0c532cf80d3bb27afe168ecde963ba3591d 100644
--- a/tensorflow/python/lib/core/py_util.cc
+++ b/tensorflow/python/lib/core/py_util.cc
@@ -41,6 +41,55 @@ const char* ClassName(PyObject* py) {
 
 }  // end namespace
 
+// Returns a PyObject containing a string, or null
+void TryAppendTraceback(PyObject* ptype, PyObject* pvalue, PyObject* ptraceback,
+                        string* out) {
+  // The "traceback" module is assumed to be imported already by script_ops.py.
+  PyObject* tb_module = PyImport_AddModule("traceback");
+
+  if (!tb_module) {
+    return;
+  }
+
+  PyObject* format_exception =
+      PyObject_GetAttrString(tb_module, "format_exception");
+
+  if (!format_exception) {
+    return;
+  }
+
+  if (!PyCallable_Check(format_exception)) {
+    Py_DECREF(format_exception);
+    return;
+  }
+
+  PyObject* ret_val = PyObject_CallFunctionObjArgs(format_exception, ptype,
+                                                   pvalue, ptraceback, nullptr);
+  Py_DECREF(format_exception);
+
+  if (!ret_val) {
+    return;
+  }
+
+  if (!PyList_Check(ret_val)) {
+    Py_DECREF(ret_val);
+    return;
+  }
+
+  Py_ssize_t n = PyList_GET_SIZE(ret_val);
+  for (Py_ssize_t i = 0; i < n; ++i) {
+    PyObject* v = PyList_GET_ITEM(ret_val, i);
+#if PY_MAJOR_VERSION < 3
+    strings::StrAppend(out, PyString_AS_STRING(v), "\n");
+#else
+    strings::StrAppend(out, PyUnicode_AsUTF8(v), "\n");
+#endif
+  }
+
+  // Iterate through ret_val.
+  Py_DECREF(ret_val);
+}
+
 string PyExceptionFetch() {
   CHECK(PyErr_Occurred())
       << "Must only call PyExceptionFetch after an exception.";
@@ -52,14 +101,20 @@ string PyExceptionFetch() {
   string err = ClassName(ptype);
   if (pvalue) {
     PyObject* str = PyObject_Str(pvalue);
+
     if (str) {
 #if PY_MAJOR_VERSION < 3
-      strings::StrAppend(&err, ": ", PyString_AS_STRING(str));
+      strings::StrAppend(&err, ": ", PyString_AS_STRING(str), "\n");
 #else
-      strings::StrAppend(&err, ": ", PyUnicode_AsUTF8(str));
+      strings::StrAppend(&err, ": ", PyUnicode_AsUTF8(str), "\n");
 #endif
       Py_DECREF(str);
+    } else {
+      strings::StrAppend(&err, "(unknown error message)\n");
     }
+
+    TryAppendTraceback(ptype, pvalue, ptraceback, &err);
+
     Py_DECREF(pvalue);
   }
   Py_DECREF(ptype);
diff --git a/tensorflow/python/lib/io/file_io_test.py b/tensorflow/python/lib/io/file_io_test.py
index a751607aaa1f47ca7c08674eca2b27ee0cafa3d2..223858edfa84eaa1c7879a9774dcc836de4f4672 100644
--- a/tensorflow/python/lib/io/file_io_test.py
+++ b/tensorflow/python/lib/io/file_io_test.py
@@ -485,6 +485,11 @@ class FileIoTest(test.TestCase):
     f.flush()
     self.assertEqual(content, f.read(len(content) + 1))
 
+  def testUTF8StringPathExists(self):
+    file_path = os.path.join(self._base_dir, "UTF8测试_file_exist")
+    file_io.write_string_to_file(file_path, "testing")
+    v = file_io.file_exists(file_path)
+    self.assertEqual(v, True)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/lib/io/tf_record.py b/tensorflow/python/lib/io/tf_record.py
index 48ea107a146c2714f7b59f53abbcd8b60dbf2fd4..6fcf9c91d831e3a89552b522040e8e8647114a2f 100644
--- a/tensorflow/python/lib/io/tf_record.py
+++ b/tensorflow/python/lib/io/tf_record.py
@@ -75,14 +75,16 @@ def tf_record_iterator(path, options=None):
 
   if reader is None:
     raise IOError("Could not open %s." % path)
-  while True:
-    try:
-      with errors.raise_exception_on_not_ok_status() as status:
-        reader.GetNext(status)
-    except errors.OutOfRangeError:
-      break
-    yield reader.record()
-  reader.Close()
+  try:
+    while True:
+      try:
+        with errors.raise_exception_on_not_ok_status() as status:
+          reader.GetNext(status)
+      except errors.OutOfRangeError:
+        break
+      yield reader.record()
+  finally:
+    reader.Close()
 
 
 @tf_export("python_io.TFRecordWriter")
diff --git a/tensorflow/python/ops/accumulate_n_benchmark.py b/tensorflow/python/ops/accumulate_n_benchmark.py
index c58d36f39705ecf0f24214ce4ba4574e70a93e77..a709066cae4da2811b3e98d2e93bf44ec12dcee6 100644
--- a/tensorflow/python/ops/accumulate_n_benchmark.py
+++ b/tensorflow/python/ops/accumulate_n_benchmark.py
@@ -39,7 +39,7 @@ from tensorflow.python.platform import test
 class AccumulateNBenchmark(test.Benchmark):
 
   def _AccumulateNTemplate(self, inputs, init, shape, validate_shape):
-    var = gen_state_ops._temporary_variable(
+    var = gen_state_ops.temporary_variable(
         shape=shape, dtype=inputs[0].dtype.base_dtype)
     ref = state_ops.assign(var, init, validate_shape=validate_shape)
     update_ops = [
@@ -47,8 +47,7 @@ class AccumulateNBenchmark(test.Benchmark):
             ref, tensor, use_locking=True).op for tensor in inputs
     ]
     with ops.control_dependencies(update_ops):
-      return gen_state_ops._destroy_temporary_variable(
-          ref, var_name=var.op.name)
+      return gen_state_ops.destroy_temporary_variable(ref, var_name=var.op.name)
 
   def _AccumulateNInitializedWithFirst(self, inputs):
     return self._AccumulateNTemplate(
@@ -60,7 +59,7 @@ class AccumulateNBenchmark(test.Benchmark):
   def _AccumulateNInitializedWithMerge(self, inputs):
     return self._AccumulateNTemplate(
         inputs,
-        init=array_ops.zeros_like(gen_control_flow_ops._merge(inputs)[0]),
+        init=array_ops.zeros_like(gen_control_flow_ops.merge(inputs)[0]),
         shape=tensor_shape.vector(0),
         validate_shape=False)
 
diff --git a/tensorflow/python/ops/array_grad.py b/tensorflow/python/ops/array_grad.py
index 9745d38dc23dba806a2d0dd2ef588a5a950aa05c..3c6a5c9e562ff9765c2ef47555871c94cd6feb1e 100644
--- a/tensorflow/python/ops/array_grad.py
+++ b/tensorflow/python/ops/array_grad.py
@@ -80,7 +80,7 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
 
   def _ExtractInputShapes(inputs):
     """Extract the shapes of a set of input tensors."""
-    if not context.in_graph_mode():
+    if context.executing_eagerly():
       return array_ops.shape_n(inputs)
     sizes = []
     fully_known = True
@@ -106,7 +106,7 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
 
   out_grads = []
   if isinstance(grad, ops.Tensor):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # Using mod here for convenience since concat_dim is already verified
       # in concat implementation to be within the allowed [-rank, rank) range.
       non_neg_concat_dim = (
@@ -139,7 +139,6 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
       # on CPUs and a Maxwell TitanX.  A speedup was seen in a large majority of
       # cases when switching implementations at N=16, but it is possible that
       # there will be a small number of performance regressions.
-      # pylint: disable=protected-access
       if len(sizes) > 16:
         # extract the size of each input along the concat dimension
         sizes = array_ops.squeeze(
@@ -148,10 +147,9 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
                 [1, -1]))
         out_grads = array_ops.split(grad, sizes, non_neg_concat_dim)
       else:
-        offset = gen_array_ops._concat_offset(non_neg_concat_dim, sizes)
+        offset = gen_array_ops.concat_offset(non_neg_concat_dim, sizes)
         for (begin, size) in zip(offset, sizes):
           out_grads.append(array_ops.slice(grad, begin, size))
-      # pylint: enable=protected-access
   elif isinstance(grad, ops.IndexedSlices):
     # Using mod here for convenience since concat_dim is already verified
     # in concat implementation to be within the allowed [-rank, rank) range.
@@ -430,7 +428,7 @@ def _GatherV2Grad(op, grad):
 
   # For axis 0 gathers, build an appropriately shaped IndexedSlices.
   if axis_static == 0:
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       params_tail_shape = params_shape.cpu()[1:]
     else:
       params_tail_shape = params_shape[1:]
@@ -580,7 +578,7 @@ def _TileGrad(op, grad):
   axes = math_ops.range(0, array_ops.size(split_shape), 2)
   input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
   # Fix shape inference
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     input_grad.set_shape(op.inputs[0].get_shape())
   return [input_grad, None]
 
@@ -627,9 +625,7 @@ def _ReverseSequenceGrad(op, grad):
 @ops.RegisterGradient("Reverse")
 def _ReverseGrad(op, grad):
   reverse_dims = op.inputs[1]
-  # pylint: disable=protected-access
-  return gen_array_ops._reverse(grad, reverse_dims), None
-  # pylint: enable=protected-access
+  return gen_array_ops.reverse(grad, reverse_dims), None
 
 
 @ops.RegisterGradient("ReverseV2")
@@ -700,17 +696,13 @@ ops.NotDifferentiable("OneHot")
 @ops.RegisterGradient("MirrorPad")
 def _MirrorPadGrad(op, grad):
   mode = op.get_attr("mode")
-  # pylint: disable=protected-access
-  return [gen_array_ops._mirror_pad_grad(grad, op.inputs[1], mode=mode), None]
-  # pylint: enable=protected-access
+  return [gen_array_ops.mirror_pad_grad(grad, op.inputs[1], mode=mode), None]
 
 
 @ops.RegisterGradient("MirrorPadGrad")
 def _MirrorPadGradGrad(op, grad):
   mode = op.get_attr("mode")
-  # pylint: disable=protected-access
-  return [gen_array_ops._mirror_pad(grad, op.inputs[1], mode=mode), None]
-  # pylint: enable=protected-access
+  return [gen_array_ops.mirror_pad(grad, op.inputs[1], mode=mode), None]
 
 
 @ops.RegisterGradient("QuantizeAndDequantize")
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index 96f5f81c1f04b7d64ddbd6fd461348c6986d9ff6..ec7c14f7d8697e61d2acb25a82c0ac9b2fcf28f4 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -128,9 +128,7 @@ def identity(input, name=None):  # pylint: disable=redefined-builtin
   Returns:
     A `Tensor`. Has the same type as `input`.
   """
-  if context.in_graph_mode():
-    return gen_array_ops.identity(input, name=name)
-  else:
+  if context.executing_eagerly():
     input = ops.convert_to_tensor(input)
     in_device = input.device
     # TODO(ashankar): Does 'identity' need to invoke execution callbacks?
@@ -140,6 +138,8 @@ def identity(input, name=None):  # pylint: disable=redefined-builtin
     if context_device != in_device:
       return input._copy()  # pylint: disable=protected-access
     return input
+  else:
+    return gen_array_ops.identity(input, name=name)
 
 
 # pylint: disable=redefined-builtin,protected-access
@@ -198,7 +198,7 @@ def expand_dims(input, axis=None, name=None, dim=None):
     if axis is not None:
       raise ValueError("can't specify both 'dim' and 'axis'")
     axis = dim
-  return gen_array_ops._expand_dims(input, axis, name)
+  return gen_array_ops.expand_dims(input, axis, name)
 
 
 # pylint: enable=redefined-builtin,protected-access
@@ -211,28 +211,25 @@ def expand_dims(input, axis=None, name=None, dim=None):
     "This op will be removed after the deprecation date. "
     "Please switch to tf.setdiff1d().")
 def listdiff(x, y, out_idx=None, name=None):
-  return gen_array_ops._list_diff(x, y, out_idx, name)
+  return gen_array_ops.list_diff(x, y, out_idx, name)
 
 
-listdiff.__doc__ = gen_array_ops._list_diff.__doc__ + "\n" + listdiff.__doc__
+listdiff.__doc__ = gen_array_ops.list_diff.__doc__ + "\n" + listdiff.__doc__
 
 # pylint: enable=protected-access
 
 
-# pylint: disable=undefined-variable,protected-access
+# pylint: disable=undefined-variable
 @tf_export("setdiff1d")
 def setdiff1d(x, y, index_dtype=dtypes.int32, name=None):
-  return gen_array_ops._list_diff(x, y, index_dtype, name)
+  return gen_array_ops.list_diff(x, y, index_dtype, name)
 
 
-setdiff1d.__doc__ = gen_array_ops._list_diff.__doc__
-
-# pylint: enable=protected-access
+setdiff1d.__doc__ = gen_array_ops.list_diff.__doc__
 
 
 @tf_export("broadcast_dynamic_shape")
 def broadcast_dynamic_shape(shape_x, shape_y):
-  # pylint: disable=protected-access
   """Returns the broadcasted dynamic shape between `shape_x` and `shape_y`.
 
   Args:
@@ -242,8 +239,7 @@ def broadcast_dynamic_shape(shape_x, shape_y):
   Returns:
     A rank 1 integer `Tensor` representing the broadcasted shape.
   """
-  return gen_array_ops._broadcast_args(shape_x, shape_y)
-  # pylint: enable=protected-access
+  return gen_array_ops.broadcast_args(shape_x, shape_y)
 
 
 @tf_export("broadcast_static_shape")
@@ -309,7 +305,7 @@ def shape_internal(input, name=None, optimize=True, out_type=dtypes.int32):
                           sparse_tensor.SparseTensorValue)):
       return gen_math_ops.cast(input.dense_shape, out_type)
     else:
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         input_tensor = ops.convert_to_tensor(input)
         input_shape = input_tensor.get_shape()
         if optimize and input_shape.is_fully_defined():
@@ -334,7 +330,7 @@ def shape_n(input, out_type=dtypes.int32, name=None):
   """
 
   output = gen_array_ops.shape_n(input, out_type=out_type, name=name)
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     for i, input_tensor in enumerate(input):
       input_tensor = ops.convert_to_tensor(input_tensor)
       input_shape = input_tensor.get_shape()
@@ -389,17 +385,13 @@ def size_internal(input, name=None, optimize=True, out_type=dtypes.int32):
   Returns:
     A `Tensor` of type `out_type`. Defaults to `tf.int32`.
   """
-  if context.in_eager_mode() and not isinstance(
-      input, (sparse_tensor.SparseTensor,
-              sparse_tensor.SparseTensorValue)):
-    size_ = 1
-    for dim in ops.convert_to_tensor(input)._shape_tuple():  # pylint: disable=protected-access
-      size_ *= dim
-    return size_
+  if context.executing_eagerly() and not isinstance(
+      input, (sparse_tensor.SparseTensor, sparse_tensor.SparseTensorValue)):
+    return np.prod(ops.convert_to_tensor(input)._shape_tuple())  # pylint: disable=protected-access
   with ops.name_scope(name, "Size", [input]) as name:
     if isinstance(input, (sparse_tensor.SparseTensor,
                           sparse_tensor.SparseTensorValue)):
-      return gen_math_ops._prod(
+      return gen_math_ops.prod(
           gen_math_ops.cast(input.dense_shape, out_type), 0, name=name)
     else:
       input_tensor = ops.convert_to_tensor(input)
@@ -790,7 +782,7 @@ def strided_slice(input_,
         new_axis_mask=new_axis_mask,
         shrink_axis_mask=shrink_axis_mask)
 
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     # TODO(apassos) In eager mode assignment will be done by overriding
     # __setitem__ instead.
     op.assign = assign
@@ -801,8 +793,8 @@ def _SliceHelperVar(var, slice_spec):
   """Creates a slice helper object given a variable.
 
   This allows creating a sub-tensor from part of the current contents
-  of a variable.  See ${tf.Tensor$`Tensor.__getitem__`}
-  for detailed examples of slicing.
+  of a variable. See @{tf.Tensor.__getitem__} for detailed examples
+  of slicing.
 
   This function in addition also allows assignment to a sliced range.
   This is similar to `__setitem__` functionality in Python. However,
@@ -892,7 +884,7 @@ def parallel_stack(values, name="parallel_stack"):
     output_shape = tensor_shape.TensorShape([len(values)])
     output_shape = output_shape.concatenate(value_shape)
     # expand_dims converts concat to stack.
-    return gen_array_ops._parallel_concat(
+    return gen_array_ops.parallel_concat(
         [expand_dims(value, 0) for value in values], shape=output_shape)
 
 
@@ -950,7 +942,7 @@ def stack(values, axis=0, name="stack"):
       raise ValueError("axis = %d not in [%d, %d)" % (axis, -expanded_num_dims,
                                                       expanded_num_dims))
 
-  return gen_array_ops._pack(values, axis=axis, name=name)
+  return gen_array_ops.pack(values, axis=axis, name=name)
 
 
 # pylint: disable=invalid-name
@@ -994,7 +986,7 @@ def _autopacking_helper(list_or_tuple, dtype, name):
           # convertible-to-tensor types, such as numpy arrays.
           elems_as_tensors.append(
               constant_op.constant(elem, dtype=dtype, name=str(i)))
-      return gen_array_ops._pack(elems_as_tensors, name=scope)
+      return gen_array_ops.pack(elems_as_tensors, name=scope)
     else:
       return converted_elems
 
@@ -1089,7 +1081,7 @@ def unstack(value, num=None, axis=0, name="unstack"):
       num = value_shape[axis].value
   if num is None:
     raise ValueError("Cannot infer num from shape %s" % value_shape)
-  return gen_array_ops._unpack(value, num=num, axis=axis, name=name)
+  return gen_array_ops.unpack(value, num=num, axis=axis, name=name)
 
 
 @tf_export("concat")
@@ -1186,7 +1178,7 @@ def concat(values, axis, name="concat"):
           dtype=dtypes.int32).get_shape().assert_is_compatible_with(
               tensor_shape.scalar())
       return identity(values[0], name=scope)
-  return gen_array_ops._concat_v2(values=values, axis=axis, name=name)
+  return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
 
 
 @tf_export("boolean_mask")
@@ -1254,8 +1246,7 @@ def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
     axis = 0 if axis is None else axis
     shape_tensor[axis:axis + ndims_mask].assert_is_compatible_with(shape_mask)
 
-    leading_size = gen_math_ops._prod(
-        shape(tensor)[axis:axis + ndims_mask], [0])
+    leading_size = gen_math_ops.prod(shape(tensor)[axis:axis + ndims_mask], [0])
     tensor = reshape(tensor,
                      concat([
                          shape(tensor)[:axis], [leading_size],
@@ -1319,10 +1310,10 @@ def unique(x, out_idx=dtypes.int32, name=None):
   # period (3 weeks) pass.
   # TODO(yongtang): The documentation should also
   # be updated when switch  to v2.
-  return gen_array_ops._unique(x, out_idx, name)
+  return gen_array_ops.unique(x, out_idx, name)
 
 
-unique.__doc__ = gen_array_ops._unique.__doc__
+unique.__doc__ = gen_array_ops.unique.__doc__
 
 
 @tf_export("unique_with_counts")
@@ -1331,10 +1322,10 @@ def unique_with_counts(x, out_idx=dtypes.int32, name=None):
   # period (3 weeks) pass.
   # TODO(yongtang): The documentation should also
   # be updated when switch  to v2.
-  return gen_array_ops._unique_with_counts(x, out_idx, name)
+  return gen_array_ops.unique_with_counts(x, out_idx, name)
 
 
-unique_with_counts.__doc__ = gen_array_ops._unique_with_counts.__doc__
+unique_with_counts.__doc__ = gen_array_ops.unique_with_counts.__doc__
 
 
 @tf_export("split")
@@ -1388,20 +1379,18 @@ def split(value, num_or_size_splits, axis=0, num=None, name="split"):
   """
   size_splits = ops.convert_to_tensor(num_or_size_splits)
   if size_splits._rank() == 0 and size_splits.dtype.is_integer:
-    return gen_array_ops._split(
+    return gen_array_ops.split(
         axis=axis, num_split=num_or_size_splits, value=value, name=name)
 
   if num is None:
-    num = size_splits._shape_tuple()[0]
+    size_splits_shape = size_splits._shape_tuple()
+    if size_splits_shape:
+      num = size_splits_shape[0]
     if num is None:
       raise ValueError("Cannot infer num from shape %s" % num_or_size_splits)
 
-  return gen_array_ops._split_v(
-      value=value,
-      size_splits=size_splits,
-      axis=axis,
-      num_split=num,
-      name=name)
+  return gen_array_ops.split_v(
+      value=value, size_splits=size_splits, axis=axis, num_split=num, name=name)
 
 
 @tf_export("transpose")
@@ -1471,7 +1460,7 @@ def transpose(a, perm=None, name="transpose", conjugate=False):
   """
   with ops.name_scope(name, "transpose", [a]) as name:
     transpose_fn = (
-        gen_array_ops._conjugate_transpose
+        gen_array_ops.conjugate_transpose
         if (conjugate and a.dtype.is_complex) else gen_array_ops.transpose)
     if perm is None:
       rank = gen_array_ops.rank(a)
@@ -1479,7 +1468,7 @@ def transpose(a, perm=None, name="transpose", conjugate=False):
       ret = transpose_fn(a, perm, name=name)
       # NOTE(mrry): Setting the shape explicitly because
       #   reverse is not handled by the shape function.
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         input_shape = ret.op.inputs[0].get_shape().dims
         if input_shape is not None:
           ret.set_shape(input_shape[::-1])
@@ -1644,12 +1633,12 @@ def zeros_like(tensor, dtype=None, name=None, optimize=True):
   with ops.name_scope(name, "zeros_like", [tensor]) as name:
     tensor = ops.convert_to_tensor(tensor, name="tensor")
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       if dtype is not None and dtype != tensor.dtype:
         return zeros(
             shape_internal(tensor, optimize=optimize), dtype=dtype, name=name)
       with ops.device(tensor.device):
-        return gen_array_ops._zeros_like(tensor, name=name)
+        return gen_array_ops.zeros_like(tensor, name=name)
 
     # For now, variant types must be created via zeros_like; as we need to
     # pass the input variant object to the proper zeros callback.
@@ -1664,7 +1653,7 @@ def zeros_like(tensor, dtype=None, name=None, optimize=True):
       return zeros(
           shape_internal(tensor, optimize=optimize), dtype=dtype, name=name)
     else:
-      return gen_array_ops._zeros_like(tensor, name=name)
+      return gen_array_ops.zeros_like(tensor, name=name)
 
 
 @tf_export("ones_like")
@@ -1700,7 +1689,7 @@ def ones_like(tensor, dtype=None, name=None, optimize=True):
     if dtype is None:
       dtype = tensor.dtype
     ret = ones(ones_shape, dtype=dtype, name=name)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       ret.set_shape(tensor.get_shape())
     return ret
 
@@ -1781,11 +1770,11 @@ def placeholder(dtype, shape=None, name=None):
   Raises:
     RuntimeError: if eager execution is enabled
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError("tf.placeholder() is not compatible with "
                        "eager execution.")
 
-  return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
+  return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
 
 
 # pylint: disable=redefined-outer-name
@@ -1844,7 +1833,7 @@ def sparse_placeholder(dtype, shape=None, name=None):
   Raises:
     RuntimeError: if eager execution is enabled
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError("tf.placeholder() is not compatible with "
                        "eager execution.")
 
@@ -1929,21 +1918,21 @@ def pad(tensor, paddings, mode="CONSTANT", name=None, constant_values=0):  # pyl
     # TODO(rjryan): Once the forward compatibility period (3 weeks) have passed
     # remove the "Pad" fallback here.
     if constant_values != 0:
-      result = gen_array_ops._pad_v2(
+      result = gen_array_ops.pad_v2(
           tensor, paddings, constant_values, name=name)
     else:
-      result = gen_array_ops._pad(tensor, paddings, name=name)
+      result = gen_array_ops.pad(tensor, paddings, name=name)
   elif mode == "REFLECT":
-    result = gen_array_ops._mirror_pad(
+    result = gen_array_ops.mirror_pad(
         tensor, paddings, mode="REFLECT", name=name)
   elif mode == "SYMMETRIC":
-    result = gen_array_ops._mirror_pad(
+    result = gen_array_ops.mirror_pad(
         tensor, paddings, mode="SYMMETRIC", name=name)
   else:
     raise ValueError("Unknown padding mode: %s" % mode)
 
   # Restore shape information where possible.
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     paddings_constant = tensor_util.constant_value(
         result.op.inputs[1], partial=True)
     input_shape = result.op.inputs[0].shape
@@ -2167,7 +2156,7 @@ def edit_distance(hypothesis, truth, normalize=True, name="edit_distance"):
                             sparse_tensor.SparseTensorValue)):
     raise TypeError("Truth must be a SparseTensor.")
 
-  return gen_array_ops._edit_distance(
+  return gen_array_ops.edit_distance(
       hypothesis.indices,
       hypothesis.values,
       hypothesis.dense_shape,
@@ -2304,7 +2293,7 @@ def space_to_batch(input, paddings, block_size, name=None):  # pylint: disable=r
   return result
 
 
-space_to_batch.__doc__ = gen_array_ops._space_to_batch.__doc__
+space_to_batch.__doc__ = gen_array_ops.space_to_batch.__doc__
 
 
 @tf_export("space_to_depth")
@@ -2334,7 +2323,7 @@ def batch_to_space(input, crops, block_size, name=None):  # pylint: disable=rede
   return result
 
 
-batch_to_space.__doc__ = gen_array_ops._batch_to_space.__doc__
+batch_to_space.__doc__ = gen_array_ops.batch_to_space.__doc__
 
 
 @tf_export("one_hot")
@@ -2478,8 +2467,8 @@ def one_hot(indices,
       raise TypeError("dtype {0} of on_value does not match "
                       "dtype {1} of off_value".format(on_dtype, off_dtype))
 
-    return gen_array_ops._one_hot(indices, depth, on_value, off_value, axis,
-                                  name)
+    return gen_array_ops.one_hot(indices, depth, on_value, off_value, axis,
+                                 name)
 
 
 def _all_dimensions(x):
@@ -2607,7 +2596,7 @@ def squeeze(input, axis=None, name=None, squeeze_dims=None):
     axis = squeeze_dims
   if np.isscalar(axis):
     axis = [axis]
-  return gen_array_ops._squeeze(input, axis, name)
+  return gen_array_ops.squeeze(input, axis, name)
 
 
 @tf_export("where")
@@ -2658,7 +2647,7 @@ def where(condition, x=None, y=None, name=None):
           condition, preferred_dtype=dtypes.bool, name="condition")
       return gen_array_ops.where(condition=condition, name=name)
   elif x is not None and y is not None:
-    return gen_math_ops._select(condition=condition, x=x, y=y, name=name)
+    return gen_math_ops.select(condition=condition, x=x, y=y, name=name)
   else:
     raise ValueError("x and y must both be non-None or both be None.")
 
diff --git a/tensorflow/python/ops/batch_norm_benchmark.py b/tensorflow/python/ops/batch_norm_benchmark.py
index c2ee2b383231333239c6e2d4e874a0ad1cdf493e..5d68b47aeaef3a90973387ecd5b265eef1e96a5f 100644
--- a/tensorflow/python/ops/batch_norm_benchmark.py
+++ b/tensorflow/python/ops/batch_norm_benchmark.py
@@ -41,9 +41,8 @@ def batch_norm_op(tensor, mean, variance, beta, gamma, scale):
   # _batch_norm_with_global_normalization is deprecated in v9
   ops.get_default_graph().graph_def_versions.producer = 8
   # pylint: disable=protected-access
-  return gen_nn_ops._batch_norm_with_global_normalization(tensor, mean,
-                                                          variance, beta, gamma,
-                                                          0.001, scale)
+  return gen_nn_ops._batch_norm_with_global_normalization(
+      tensor, mean, variance, beta, gamma, 0.001, scale)
   # pylint: enable=protected-access
 
 
diff --git a/tensorflow/python/ops/candidate_sampling_ops.py b/tensorflow/python/ops/candidate_sampling_ops.py
index 220ef1754d2e1a2d54a8962148b47806df48e98f..9ea1ea9c92c9b016a3f9126c89ee4dc1e73c9f27 100644
--- a/tensorflow/python/ops/candidate_sampling_ops.py
+++ b/tensorflow/python/ops/candidate_sampling_ops.py
@@ -77,7 +77,7 @@ def uniform_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._uniform_candidate_sampler(
+  return gen_candidate_sampling_ops.uniform_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -136,7 +136,7 @@ def log_uniform_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._log_uniform_candidate_sampler(
+  return gen_candidate_sampling_ops.log_uniform_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -193,7 +193,7 @@ def learned_unigram_candidate_sampler(true_classes, num_true, num_sampled,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._learned_unigram_candidate_sampler(
+  return gen_candidate_sampling_ops.learned_unigram_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -283,7 +283,7 @@ def fixed_unigram_candidate_sampler(true_classes,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._fixed_unigram_candidate_sampler(
+  return gen_candidate_sampling_ops.fixed_unigram_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max,
       vocab_file=vocab_file, distortion=distortion,
       num_reserved_ids=num_reserved_ids, num_shards=num_shards, shard=shard,
@@ -321,7 +321,7 @@ def all_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`. All returned values are 1.0.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._all_candidate_sampler(
+  return gen_candidate_sampling_ops.all_candidate_sampler(
       true_classes, num_true, num_sampled, unique, seed=seed1, seed2=seed2,
       name=name)
 
@@ -370,6 +370,6 @@ def compute_accidental_hits(true_classes, sampled_candidates, num_true,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._compute_accidental_hits(
+  return gen_candidate_sampling_ops.compute_accidental_hits(
       true_classes, sampled_candidates, num_true, seed=seed1, seed2=seed2,
       name=name)
diff --git a/tensorflow/python/ops/check_ops.py b/tensorflow/python/ops/check_ops.py
index 64567ac54ae43acf6f8b674c46525db7a6c4fab7..9cea3e91f7760034d2ab7649709e62dbf1987701 100644
--- a/tensorflow/python/ops/check_ops.py
+++ b/tensorflow/python/ops/check_ops.py
@@ -169,7 +169,7 @@ def assert_negative(x, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_negative', [x, data]):
     x = ops.convert_to_tensor(x, name='x')
     if data is None:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         name = _shape_and_dtype_str(x)
       else:
         name = x.name
@@ -210,7 +210,7 @@ def assert_positive(x, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_positive', [x, data]):
     x = ops.convert_to_tensor(x, name='x')
     if data is None:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         name = _shape_and_dtype_str(x)
       else:
         name = x.name
@@ -251,7 +251,7 @@ def assert_non_negative(x, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_non_negative', [x, data]):
     x = ops.convert_to_tensor(x, name='x')
     if data is None:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         name = _shape_and_dtype_str(x)
       else:
         name = x.name
@@ -293,7 +293,7 @@ def assert_non_positive(x, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_non_positive', [x, data]):
     x = ops.convert_to_tensor(x, name='x')
     if data is None:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         name = _shape_and_dtype_str(x)
       else:
         name = x.name
@@ -343,7 +343,7 @@ def assert_equal(x, y, data=None, summarize=None, message=None, name=None):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       eq = math_ops.equal(x, y)
       condition = math_ops.reduce_all(eq)
       if not condition:
@@ -363,27 +363,30 @@ def assert_equal(x, y, data=None, summarize=None, message=None, name=None):
                          (x_sum, x_np[:x_sum],
                           y_sum, y_np[:y_sum]))
 
-        # Get the values that actually differed and their indices.
-        mask = math_ops.logical_not(eq)
-        indices = array_ops.where(mask)
-        indices_np = indices.numpy()
-        x_vals = array_ops.boolean_mask(x, mask)
-        y_vals = array_ops.boolean_mask(y, mask)
-        summarize = min(summarize, indices_np.shape[0])
+        index_and_values_str = ''
+        if x.shape == y.shape:
+          # If the shapes of x and y are the same,
+          # Get the values that actually differed and their indices.
+          # If shapes are different this information is more confusing
+          # than useful.
+          mask = math_ops.logical_not(eq)
+          indices = array_ops.where(mask)
+          indices_np = indices.numpy()
+          x_vals = array_ops.boolean_mask(x, mask)
+          y_vals = array_ops.boolean_mask(y, mask)
+          summarize = min(summarize, indices_np.shape[0])
+          index_and_values_str = (
+              'Indices of first %s different values:\n%s\n'
+              'Corresponding x values:\n%s\n'
+              'Corresponding y values:\n%s\n' %
+              (summarize, indices_np[:summarize],
+               x_vals.numpy().reshape((-1,))[:summarize],
+               y_vals.numpy().reshape((-1,))[:summarize]))
 
         raise errors.InvalidArgumentError(
             node_def=None, op=None,
-            message=('%s\nCondition x == y did not hold.\n'
-                     'Indices of first %s different values:\n%s\n'
-                     'Corresponding x values:\n%s\n'
-                     'Corresponding y values:\n%s\n'
-                     '%s'
-                     %
-                     (message or '',
-                      summarize, indices_np[:summarize],
-                      x_vals.numpy().reshape((-1,))[:summarize],
-                      y_vals.numpy().reshape((-1,))[:summarize],
-                      summary_msg)))
+            message=('%s\nCondition x == y did not hold.\n%s%s' %
+                     (message or '', index_and_values_str, summary_msg)))
       return
 
     if data is None:
@@ -435,7 +438,7 @@ def assert_none_equal(
   with ops.name_scope(name, 'assert_none_equal', [x, y, data]):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -512,7 +515,7 @@ def assert_near(
     rtol = ops.convert_to_tensor(rtol, name='rtol', dtype=x.dtype)
     atol = ops.convert_to_tensor(atol, name='atol', dtype=x.dtype)
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -562,7 +565,7 @@ def assert_less(x, y, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_less', [x, y, data]):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -610,7 +613,7 @@ def assert_less_equal(x, y, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_less_equal', [x, y, data]):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -658,7 +661,7 @@ def assert_greater(x, y, data=None, summarize=None, message=None, name=None):
   with ops.name_scope(name, 'assert_greater', [x, y, data]):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -708,7 +711,7 @@ def assert_greater_equal(x, y, data=None, summarize=None, message=None,
   with ops.name_scope(name, 'assert_greater_equal', [x, y, data]):
     x = ops.convert_to_tensor(x, name='x')
     y = ops.convert_to_tensor(y, name='y')
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       x_name = _shape_and_dtype_str(x)
       y_name = _shape_and_dtype_str(y)
     else:
@@ -808,7 +811,7 @@ def assert_rank(x, rank, data=None, summarize=None, message=None, name=None):
     static_condition = lambda actual_rank, given_rank: actual_rank == given_rank
     dynamic_condition = math_ops.equal
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       name = ''
     else:
       name = x.name
@@ -873,7 +876,7 @@ def assert_rank_at_least(
     static_condition = lambda actual_rank, given_rank: actual_rank >= given_rank
     dynamic_condition = math_ops.greater_equal
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       name = ''
     else:
       name = x.name
@@ -1001,7 +1004,7 @@ def assert_rank_in(
     ranks = tuple([ops.convert_to_tensor(rank, name='rank') for rank in ranks])
     message = message or ''
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       name = ''
     else:
       name = x.name
@@ -1054,7 +1057,7 @@ def assert_integer(x, message=None, name=None):
   with ops.name_scope(name, 'assert_integer', [x]):
     x = ops.convert_to_tensor(x, name='x')
     if not x.dtype.is_integer:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         name = 'tensor'
       else:
         name = x.name
@@ -1087,12 +1090,11 @@ def assert_type(tensor, tf_type, message=None, name=None):
   with ops.name_scope(name, 'assert_type', [tensor]):
     tensor = ops.convert_to_tensor(tensor, name='tensor')
     if tensor.dtype != tf_type:
-      if context.in_graph_mode():
-        raise TypeError(
-            '%s  %s must be of type %s' % (message, tensor.name, tf_type))
+      if context.executing_eagerly():
+        raise TypeError('%s tensor must be of type %s' % (message, tf_type))
       else:
-        raise TypeError(
-            '%s tensor must be of type %s' % (message, tf_type))
+        raise TypeError('%s  %s must be of type %s' % (message, tensor.name,
+                                                       tf_type))
 
     return control_flow_ops.no_op('statically_determined_correct_type')
 
@@ -1240,7 +1242,7 @@ def assert_scalar(tensor, name=None):
     tensor = ops.convert_to_tensor(tensor, name=name_scope)
     shape = tensor.get_shape()
     if shape.ndims != 0:
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         raise ValueError('Expected scalar shape, saw shape: %s.'
                          % (shape,))
       else:
diff --git a/tensorflow/python/ops/control_flow_grad.py b/tensorflow/python/ops/control_flow_grad.py
index 97b57177b29986a006df992f4c0c2b79e11467aa..21354b5ae8ff1724bbb2539aff370b3df6da2598 100644
--- a/tensorflow/python/ops/control_flow_grad.py
+++ b/tensorflow/python/ops/control_flow_grad.py
@@ -28,7 +28,6 @@ from tensorflow.python.ops import math_ops
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import,undefined-variable
 from tensorflow.python.ops.control_flow_ops import *
-from tensorflow.python.ops.gen_control_flow_ops import *
 # pylint: enable=wildcard-import
 
 
diff --git a/tensorflow/python/ops/control_flow_ops.py b/tensorflow/python/ops/control_flow_ops.py
index 8218e60b53450a500df71719e533fb0c1cbeb5b5..1278768d8bdc9f039f19cf032f8ee09442ea34a9 100644
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@@ -23,7 +23,6 @@ See the @{$python/control_flow_ops} guide.
 @@no_op
 @@count_up_to
 @@cond
-@@smart_cond
 @@case
 @@while_loop
 @@logical_and
@@ -153,7 +152,7 @@ def Assert(condition, data, summarize=None, name=None):
     @compatibility{eager} `tf.errors.InvalidArgumentError` if `condition`
     is not true
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     if not condition:
       xs = ops.convert_n_to_tensor(data)
       data_str = [_summarize_eager(x, summarize) for x in xs]
@@ -179,6 +178,8 @@ def Assert(condition, data, summarize=None, name=None):
             condition, data, summarize, name="Assert")
 
       guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
+      if context.executing_eagerly():
+        return
       return guarded_assert.op
 
 
@@ -195,7 +196,7 @@ def _Identity(data, name=None):
   data = ops.internal_convert_to_tensor_or_indexed_slices(data, as_ref=True)
   if isinstance(data, ops.Tensor):
     if data.dtype._is_ref_dtype:  # pylint: disable=protected-access
-      return gen_array_ops._ref_identity(data, name=name)
+      return gen_array_ops.ref_identity(data, name=name)
     else:
       return array_ops.identity(data, name=name)
   else:
@@ -263,10 +264,10 @@ def _Enter(data,
   data = ops.internal_convert_to_tensor_or_indexed_slices(data, as_ref=True)
   if isinstance(data, ops.Tensor):
     if data.dtype._is_ref_dtype and use_ref:  # pylint: disable=protected-access
-      result = gen_control_flow_ops._ref_enter(
+      result = gen_control_flow_ops.ref_enter(
           data, frame_name, is_constant, parallel_iterations, name=name)
     else:
-      result = gen_control_flow_ops._enter(
+      result = gen_control_flow_ops.enter(
           data, frame_name, is_constant, parallel_iterations, name=name)
     if use_input_shape:
       result.set_shape(data.get_shape())
@@ -281,7 +282,7 @@ def _Enter(data,
         parallel_iterations=parallel_iterations,
         use_input_shape=use_input_shape,
         name=name)
-    indices = gen_control_flow_ops._enter(
+    indices = gen_control_flow_ops.enter(
         data.indices,
         frame_name,
         is_constant,
@@ -292,7 +293,7 @@ def _Enter(data,
     if isinstance(data, ops.IndexedSlices):
       dense_shape = data.dense_shape
       if dense_shape is not None:
-        dense_shape = gen_control_flow_ops._enter(
+        dense_shape = gen_control_flow_ops.enter(
             dense_shape,
             frame_name,
             is_constant,
@@ -302,7 +303,7 @@ def _Enter(data,
           dense_shape.set_shape(data.dense_shape.get_shape())
       return ops.IndexedSlices(values, indices, dense_shape)
     else:
-      dense_shape = gen_control_flow_ops._enter(
+      dense_shape = gen_control_flow_ops.enter(
           data.dense_shape,
           frame_name,
           is_constant,
@@ -328,7 +329,7 @@ def exit(data, name=None):  # pylint: disable=redefined-builtin
   data = ops.internal_convert_to_tensor_or_indexed_slices(data, as_ref=True)
   if isinstance(data, ops.Tensor):
     if data.dtype._is_ref_dtype:  # pylint: disable=protected-access
-      return gen_control_flow_ops._ref_exit(data, name)
+      return gen_control_flow_ops.ref_exit(data, name)
     else:
       return gen_control_flow_ops._exit(data, name)
   else:
@@ -370,17 +371,17 @@ def switch(data, pred, dtype=None, name=None):
         data, dtype=dtype, name="data", as_ref=True)
     pred = ops.convert_to_tensor(pred, name="pred")
     if isinstance(data, ops.Tensor):
-      return gen_control_flow_ops._switch(data, pred, name=name)
+      return gen_control_flow_ops.switch(data, pred, name=name)
     else:
       if not isinstance(data, (ops.IndexedSlices, sparse_tensor.SparseTensor)):
         raise TypeError("Type %s not supported" % type(data))
       val, ind = data.values, data.indices
-      val_f, val_t = gen_control_flow_ops._switch(val, pred, name=name)
-      ind_f, ind_t = gen_control_flow_ops._switch(ind, pred, name="indices")
+      val_f, val_t = gen_control_flow_ops.switch(val, pred, name=name)
+      ind_f, ind_t = gen_control_flow_ops.switch(ind, pred, name="indices")
       if isinstance(data, ops.IndexedSlices):
         dense_shape = data.dense_shape
         if dense_shape is not None:
-          dense_shape_f, dense_shape_t = gen_control_flow_ops._switch(
+          dense_shape_f, dense_shape_t = gen_control_flow_ops.switch(
               dense_shape, pred, name="dense_shape")
         else:
           dense_shape_f, dense_shape_t = None, None
@@ -388,7 +389,7 @@ def switch(data, pred, dtype=None, name=None):
                 ops.IndexedSlices(val_t, ind_t, dense_shape_t))
       else:
         dense_shape = data.dense_shape
-        dense_shape_f, dense_shape_t = gen_control_flow_ops._switch(
+        dense_shape_f, dense_shape_t = gen_control_flow_ops.switch(
             data.dense_shape, pred, name="dense_shape")
         return (sparse_tensor.SparseTensor(ind_f, val_f, dense_shape_f),
                 sparse_tensor.SparseTensor(ind_t, val_t, dense_shape_t))
@@ -472,15 +473,15 @@ def merge(inputs, name=None):
     ]
     if all([isinstance(v, ops.Tensor) for v in inputs]):
       if all([v.dtype._is_ref_dtype for v in inputs]):  # pylint: disable=protected-access
-        return gen_control_flow_ops._ref_merge(inputs, name)
+        return gen_control_flow_ops.ref_merge(inputs, name)
       else:
-        return gen_control_flow_ops._merge(inputs, name)
+        return gen_control_flow_ops.merge(inputs, name)
     elif all([isinstance(v, sparse_tensor.SparseTensor) for v in inputs]):
       # Only handle the case when all inputs are SparseTensor.
       values, _ = merge([inp.values for inp in inputs], name=name)
-      indices, chosen_index = gen_control_flow_ops._merge(
+      indices, chosen_index = gen_control_flow_ops.merge(
           [inp.indices for inp in inputs], name="indices")
-      dense_shape, _ = gen_control_flow_ops._merge(
+      dense_shape, _ = gen_control_flow_ops.merge(
           [inp.dense_shape for inp in inputs], name="dense_shape")
       return (sparse_tensor.SparseTensor(indices, values, dense_shape),
               chosen_index)
@@ -488,13 +489,13 @@ def merge(inputs, name=None):
       # For now convert all the inputs as IndexedSlices.
       inputs = math_ops._as_indexed_slices_list(inputs, optimize=False)
       values, _ = merge([inp.values for inp in inputs], name=name)
-      indices, chosen_index = gen_control_flow_ops._merge(
+      indices, chosen_index = gen_control_flow_ops.merge(
           [inp.indices for inp in inputs], name="indices")
       if any(inp.dense_shape is not None for inp in inputs):
         if any(inp.dense_shape is None for inp in inputs):
           raise ValueError("Either all merged IndexedSlices must have a "
                            "dense_shape, or none must have a dense_shape.")
-        dense_shape, _ = gen_control_flow_ops._merge(
+        dense_shape, _ = gen_control_flow_ops.merge(
             [inp.dense_shape for inp in inputs], name="dense_shape")
       else:
         dense_shape = None
@@ -1014,10 +1015,8 @@ class GradLoopState(object):
         else:
           max_size = GetMaxSizeFromNestedMaximumIterations(
               value, self.forward_context)
-        # pylint: disable=protected-access
-        acc = gen_data_flow_ops._stack_v2(
+        acc = gen_data_flow_ops.stack_v2(
             max_size=max_size, elem_type=value.dtype.base_dtype, name="f_acc")
-        # pylint: enable=protected-access
       if curr_ctxt:
         curr_ctxt.Exit()
 
@@ -1030,10 +1029,8 @@ class GradLoopState(object):
       if value_ctxt == self.forward_context:
         # value is not nested in the forward context.
         self.forward_context.Enter()
-        # pylint: disable=protected-access
-        push = gen_data_flow_ops._stack_push_v2(
+        push = gen_data_flow_ops.stack_push_v2(
             enter_acc, value, swap_memory=swap_enabled)
-        # pylint: enable=protected-access
         self.forward_context.Exit()
         # Protect stack push and order it before forward_index.
         self.forward_index.op._add_control_input(push.op)
@@ -1045,18 +1042,14 @@ class GradLoopState(object):
           # The special case for creating a zero tensor for a dead
           # branch of a switch. See ControlFlowState.ZerosLike().
           value_ctxt.outer_context.Enter()
-          # pylint: disable=protected-access
-          push = gen_data_flow_ops._stack_push_v2(
+          push = gen_data_flow_ops.stack_push_v2(
               enter_acc, value, swap_memory=swap_enabled)
-          # pylint: enable=protected-access
           value_ctxt.outer_context.Exit()
           push.op._set_control_flow_context(value_ctxt)
         else:
           value_ctxt.Enter()
-          # pylint: disable=protected-access
-          push = gen_data_flow_ops._stack_push_v2(
+          push = gen_data_flow_ops.stack_push_v2(
               enter_acc, value, swap_memory=swap_enabled)
-          # pylint: enable=protected-access
           value_ctxt.Exit()
         # Protect stack push and order it before forward_sync.
         self.forward_sync._add_control_input(push.op)
@@ -1103,10 +1096,8 @@ class GradLoopState(object):
           pred = cond_ctxt.pred
         branch = (1 - cond_ctxt.branch) if dead_branch else cond_ctxt.branch
         history_value = _SwitchRefOrTensor(history_value, pred)[branch]
-      # pylint: disable=protected-access
-      pop = gen_data_flow_ops._stack_pop_v2(history_value,
-                                            value.dtype.base_dtype)
-      # pylint: enable=protected-access
+      pop = gen_data_flow_ops.stack_pop_v2(history_value,
+                                           value.dtype.base_dtype)
       pop.set_shape(value.get_shape())
       self.grad_context.Exit()
     parallel_iterations = self.grad_context.parallel_iterations
@@ -1476,7 +1467,10 @@ def ZerosLikeOutsideLoop(op, index):
       branch = op_ctxt.branch
       switch_val = switch(op.inputs[0], pred)[1 - branch]
       zeros_shape = array_ops.shape_internal(switch_val, optimize=False)
-      return array_ops.zeros(zeros_shape, dtype=val.dtype)
+      # Ensure ops created within array_ops.zeros are dominated by switch in
+      # cond context.
+      with ops.control_dependencies([switch_val]):
+        return array_ops.zeros(zeros_shape, dtype=val.dtype)
     else:
       return array_ops.zeros_like(val, optimize=False)
 
@@ -1508,9 +1502,11 @@ class ControlFlowContext(object):
     if values_def:
       self._init_values_from_proto(values_def, import_scope=import_scope)
     else:
-      # Values that have been already seen in this context.
+      # The names of tensors that have been already seen in this context.
       self._values = set()
-      # Values referenced by but external to this context.
+      # The keys are the names of tensors referenced by but external to this
+      # context. Each value is the Tensor that should be used by this context to
+      # access the key value (e.g. a switch output guarding a cond input value).
       self._external_values = {}
 
   def _init_values_from_proto(self, values_def, import_scope=None):
@@ -1697,9 +1693,12 @@ class CondContext(ControlFlowContext):
       self._pivot = pivot  # The predicate tensor in this branch
       self._branch = branch  # 0 or 1 representing this branch
 
-      # Values considered to have been already seen in this context.
+      # Values considered to have been already seen in this context. They are
+      # not included in this context.
       self._values.add(pred.name)
+      self._external_values[pred.name] = pred
       self._values.add(pivot.name)
+      self._external_values[pivot.name] = pivot
 
   def _init_from_proto(self, context_def, import_scope=None):
     """Creates a new `CondContext` from protocol buffer.
@@ -1717,8 +1716,8 @@ class CondContext(ControlFlowContext):
     self._pivot = g.as_graph_element(
         ops.prepend_name_scope(context_def.pivot_name, import_scope))
     self._branch = context_def.branch
-    super(CondContext, self).__init__(
-        values_def=context_def.values_def, import_scope=import_scope)
+    super(CondContext, self).__init__(values_def=context_def.values_def,
+                                      import_scope=import_scope)
 
   @property
   def pred(self):
@@ -1766,13 +1765,9 @@ class CondContext(ControlFlowContext):
       context_def.branch = self._branch
       context_def.values_def.MergeFrom(super(CondContext, self)._to_values_def(
           export_scope))
-      # TODO(b/72868227): enable this once the corresponding control_flow.proto
-      # changes have been checked in (they aren't checked in and this is
-      # disabled for now to ensure forwards compatibility).
-      if False:  # pylint: disable=using-constant-test
-        for nested in self._nested_contexts:
-          nested_def = context_def.nested_contexts.add()
-          nested.to_control_flow_context_def(nested_def)
+      for nested in self._nested_contexts:
+        nested_def = context_def.nested_contexts.add()
+        nested.to_control_flow_context_def(nested_def)
 
       return context_def
     else:
@@ -1784,14 +1779,10 @@ class CondContext(ControlFlowContext):
     ret = CondContext(context_def=context_def,
                       import_scope=import_scope)
 
-    # TODO(b/72868227): remove "if hasattr(...)" once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is here for now to ensure forwards compatibility).
-    if hasattr(context_def, "nested_contexts"):
-      ret.Enter()
-      for nested_def in context_def.nested_contexts:
-        from_control_flow_context_def(nested_def)
-      ret.Exit()
+    ret.Enter()
+    for nested_def in context_def.nested_contexts:
+      from_control_flow_context_def(nested_def, import_scope=import_scope)
+    ret.Exit()
     return ret
 
   def to_control_flow_context_def(self, context_def, export_scope=None):
@@ -1810,6 +1801,7 @@ class CondContext(ControlFlowContext):
       if self._outer_context:
         result = self._outer_context.AddValue(val)
         self._values.add(result.name)
+        self._external_values[result.name] = result
       with ops.control_dependencies(None):
         result = _SwitchRefOrTensor(result, self._pred)[self._branch]
         if self._outer_context:
@@ -1874,6 +1866,7 @@ class CondContext(ControlFlowContext):
       if self._outer_context:
         real_val = self._outer_context.AddValue(val)
         self._values.add(real_val.name)
+        self._external_values[real_val.name] = real_val
       real_val = _SwitchRefOrTensor(real_val, self._pred)[self._branch]
       self._external_values[val.name] = real_val
     else:
@@ -2035,7 +2028,7 @@ def cond(pred,
     raise TypeError("false_fn must be callable.")
 
   with ops.name_scope(name, "cond", [pred]):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       if pred:
         return _UnpackIfSingleton(true_fn())
       return _UnpackIfSingleton(false_fn())
@@ -2109,10 +2102,7 @@ def cond(pred,
     # Only add non-nested conds to the collection. Any nested control flow will
     # be encapsulated in the root context.
     assert context_t.outer_context == context_f.outer_context
-    # TODO(b/72868227): remove "if True..." once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if True or context_t.outer_context is None:
+    if context_t.outer_context is None:
       ops.add_to_collection(ops.GraphKeys.COND_CONTEXT, context_t)
       ops.add_to_collection(ops.GraphKeys.COND_CONTEXT, context_f)
 
@@ -2128,61 +2118,6 @@ def cond(pred,
 # pylint: enable=redefined-outer-name
 
 
-def smart_cond(pred, true_fn=None, false_fn=None, name=None):
-  """Return either `true_fn()` if predicate `pred` is true else `false_fn()`.
-
-  If `pred` is a bool or has a constant value, we return either `true_fn()`
-  or `false_fn()`, otherwise we use `tf.cond` to dynamically route to both.
-
-  Arguments:
-    pred: A scalar determining whether to return the result of `true_fn` or
-      `false_fn`.
-    true_fn: The callable to be performed if pred is true.
-    false_fn: The callable to be performed if pred is false.
-    name: Optional name prefix when using `tf.cond`.
-
-  Returns:
-    Tensors returned by the call to either `true_fn` or `false_fn`.
-
-  Raises:
-    TypeError: If `true_fn` or `false_fn` is not callable.
-  """
-  if not callable(true_fn):
-    raise TypeError("`true_fn` must be callable.")
-  if not callable(false_fn):
-    raise TypeError("`false_fn` must be callable.")
-
-  pred_value = smart_constant_value(pred)
-  if pred_value is not None:
-    if pred_value:
-      return true_fn()
-    else:
-      return false_fn()
-  else:
-    return cond(pred, true_fn=true_fn, false_fn=false_fn, name=name)
-
-
-def smart_constant_value(pred):
-  """Return the bool value for `pred`, or None if `pred` had a dynamic value.
-
-  Arguments:
-    pred: A scalar, either a Python bool or tensor.
-
-  Returns:
-    True or False if `pred` has a constant boolean value, None otherwise.
-
-  Raises:
-    TypeError: If `pred` is not a Tensor or bool.
-  """
-  if isinstance(pred, bool):
-    pred_value = pred
-  elif isinstance(pred, ops.Tensor):
-    pred_value = tensor_util.constant_value(pred)
-  else:
-    raise TypeError("`pred` must be a Tensor or a Python bool.")
-  return pred_value
-
-
 def _resource_safe_shape(t):
   """Returns the shape of t or the variable it points to."""
   if t.dtype == dtypes.resource:
@@ -2390,13 +2325,9 @@ class WhileContext(ControlFlowContext):
       context_def.values_def.MergeFrom(
           super(WhileContext, self)._to_values_def(
               export_scope=export_scope))
-      # TODO(b/72868227): remove "if True..." once the corresponding
-      # control_flow.proto changes have been checked in (they aren't checked in
-      # and this is disabled for now to ensure forwards compatibility).
-      if False:  # pylint: disable=using-constant-test
-        for nested in self._nested_contexts:
-          nested_def = context_def.nested_contexts.add()
-          nested.to_control_flow_context_def(nested_def)
+      for nested in self._nested_contexts:
+        nested_def = context_def.nested_contexts.add()
+        nested.to_control_flow_context_def(nested_def)
 
       return context_def
     else:
@@ -2418,14 +2349,10 @@ class WhileContext(ControlFlowContext):
     """
     ret = WhileContext(context_def=context_def,
                        import_scope=import_scope)
-    # TODO(b/72868227): remove "if hasattr(...)" once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if hasattr(context_def, "nested_contexts"):
-      ret.Enter()
-      for nested_def in context_def.nested_contexts:
-        from_control_flow_context_def(nested_def, import_scope=import_scope)
-      ret.Exit()
+    ret.Enter()
+    for nested_def in context_def.nested_contexts:
+      from_control_flow_context_def(nested_def, import_scope=import_scope)
+    ret.Exit()
     return ret
 
   def GetWhileContext(self):
@@ -3009,8 +2936,11 @@ class WhileContext(ControlFlowContext):
     loop_vars = ops.convert_n_to_tensor_or_indexed_slices(loop_vars)
     try:
       self.Enter()
-      original_body_result, exit_vars = self._BuildLoop(
-          pred, body, original_loop_vars, loop_vars, shape_invariants)
+      # _BuildLoop calls _update_input in several places. _lock ensures a
+      # Session.run call cannot occur between creating and mutating new ops.
+      with ops.get_default_graph()._lock:  # pylint: disable=protected-access
+        original_body_result, exit_vars = self._BuildLoop(
+            pred, body, original_loop_vars, loop_vars, shape_invariants)
     finally:
       self.Exit()
 
@@ -3250,7 +3180,7 @@ def while_loop(cond,
             math_ops.logical_and(i < maximum_iterations, orig_cond(*lv)))
         body = lambda i, lv: (i + 1, orig_body(*lv))
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       while cond(*loop_vars):
         loop_vars = body(*loop_vars)
       if maximum_iterations is not None:
@@ -3270,10 +3200,7 @@ def while_loop(cond,
         swap_memory=swap_memory)
     # Only add non-nested loops to the collection. Any nested control flow will
     # be encapsulated in the root context.
-    # TODO(b/72868227): enable condition once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if True or loop_context.outer_context is None:
+    if loop_context.outer_context is None:
       ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
     result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
     if maximum_iterations is not None:
@@ -3347,7 +3274,7 @@ def with_dependencies(dependencies, output_tensor, name=None):
   Raises:
     TypeError: if `output_tensor` is not a `Tensor` or `IndexedSlices`.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return output_tensor
   with ops.name_scope(name, "control_dependency",
                       list(dependencies) + [output_tensor]) as name:
@@ -3392,7 +3319,7 @@ def group(*inputs, **kwargs):
   Raises:
     ValueError: If an unknown keyword argument is provided.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return None
   name = kwargs.pop("name", None)
   if kwargs:
@@ -3472,7 +3399,7 @@ def tuple(tensors, name=None, control_inputs=None):  # pylint: disable=redefined
       objects.
 
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return tensors
   with ops.name_scope(name, "tuple", tensors) as name:
     tensors = [t if (isinstance(t, ops.Operation)
@@ -3560,15 +3487,17 @@ def _case_create_default_action(predicates, actions):
   return default_action, other_predicates, other_actions
 
 
-def _case_verify_and_canonicalize_args(pred_fn_pairs, exclusive, name):
+def _case_verify_and_canonicalize_args(pred_fn_pairs, exclusive, name,
+                                       allow_python_preds):
   """Verifies input arguments for the case function.
 
   Args:
-    pred_fn_pairs: Dict or list of pairs of a boolean scalar tensor and a
-                   callable which returns a list of tensors.
+    pred_fn_pairs: Dict or list of pairs of a boolean scalar tensor,
+                   and a callable which returns a list of tensors.
     exclusive: True iff at most one predicate is allowed to evaluate to `True`.
     name: A name for the case operation.
-
+    allow_python_preds: if true, pred_fn_pairs may contain Python bools in
+                        addition to boolean Tensors
   Raises:
     TypeError: If `pred_fn_pairs` is not a list/dictionary.
     TypeError: If `pred_fn_pairs` is a list but does not contain 2-tuples.
@@ -3593,14 +3522,69 @@ def _case_verify_and_canonicalize_args(pred_fn_pairs, exclusive, name):
     if not isinstance(pred_fn_pair, _basetuple) or len(pred_fn_pair) != 2:
       raise TypeError("Each entry in pred_fn_pairs must be a 2-tuple")
     pred, fn = pred_fn_pair
-    if pred.dtype != dtypes.bool:
-      raise TypeError("pred must be of type bool: %s", pred.name)
+
+    if isinstance(pred, ops.Tensor):
+      if pred.dtype != dtypes.bool:
+        raise TypeError("pred must be Tensor of type bool: %s" % pred.name)
+    elif not allow_python_preds:
+      raise TypeError("pred must be a Tensor, got: %s" % pred)
+    elif not isinstance(pred, bool):
+      raise TypeError("pred must be a Tensor or bool, got: %s" % pred)
+
     if not callable(fn):
       raise TypeError("fn for pred %s must be callable." % pred.name)
+
   predicates, actions = zip(*pred_fn_pairs)
   return predicates, actions
 
 
+def _case_helper(cond_fn, pred_fn_pairs, default,
+                 exclusive, name, allow_python_preds=False, **cond_kwargs):
+  """Implementation of case that allows for different cond functions.
+
+  Args:
+    cond_fn: method that has signature and semantics of `cond` above.
+    pred_fn_pairs: Dict or list of pairs of a boolean scalar tensor, and a
+                   callable which returns a list of tensors.
+    default: Optional callable that returns a list of tensors.
+    exclusive: True iff at most one predicate is allowed to evaluate to `True`.
+    name: A name for this operation (optional).
+    allow_python_preds: if true, pred_fn_pairs may contain Python bools in
+                        addition to boolean Tensors
+    **cond_kwargs: keyword arguments that will be passed to `cond_fn`.
+
+  Returns:
+    The tensors returned by the first pair whose predicate evaluated to True, or
+    those returned by `default` if none does.
+
+  Raises:
+    TypeError: If `pred_fn_pairs` is not a list/dictionary.
+    TypeError: If `pred_fn_pairs` is a list but does not contain 2-tuples.
+    TypeError: If `fns[i]` is not callable for any i, or `default` is not
+               callable.
+  """
+  predicates, actions = _case_verify_and_canonicalize_args(
+      pred_fn_pairs, exclusive, name, allow_python_preds)
+  with ops.name_scope(name, "case", [predicates]):
+    if default is None:
+      default, predicates, actions = _case_create_default_action(
+          predicates, actions)
+    fn = default
+    # To eval conditions in direct order we create nested conditions in reverse:
+    #   cond_fn(c[0], true_fn=.., false_fn=cond_fn(c[1], ...))
+    for predicate, action in reversed(list(zip(predicates, actions))):
+      fn = functools.partial(
+          cond_fn, predicate, true_fn=action, false_fn=fn, **cond_kwargs)
+    if exclusive:
+      with ops.control_dependencies([
+          _assert_at_most_n_true(
+              predicates, n=1, msg="Input error: exclusive=True")
+      ]):
+        return fn()
+    else:
+      return fn()
+
+
 @tf_export("case")
 def case(pred_fn_pairs,
          default=None,
@@ -3691,26 +3675,8 @@ def case(pred_fn_pairs,
     TypeError: If `fns[i]` is not callable for any i, or `default` is not
                callable.
   """
-  predicates, actions = _case_verify_and_canonicalize_args(
-      pred_fn_pairs, exclusive, name)
-  with ops.name_scope(name, "case", [predicates]):
-    if default is None:
-      default, predicates, actions = _case_create_default_action(
-          predicates, actions)
-    fn = default
-    # To eval conditions in direct order we create nested conditions in reverse:
-    #   cond(c[0], true_fn=.., false_fn=cond(c[1], ...))
-    for predicate, action in reversed(list(zip(predicates, actions))):
-      fn = functools.partial(
-          cond, predicate, true_fn=action, false_fn=fn, strict=strict)
-    if exclusive:
-      with ops.control_dependencies([
-          _assert_at_most_n_true(
-              predicates, n=1, msg="Input error: exclusive=True")
-      ]):
-        return fn()
-    else:
-      return fn()
+  return _case_helper(cond, pred_fn_pairs, default, exclusive, name,
+                      allow_python_preds=False, strict=strict)
 
 
 class XLAControlFlowContext(ControlFlowContext):
diff --git a/tensorflow/python/ops/control_flow_ops_test.py b/tensorflow/python/ops/control_flow_ops_test.py
index adc8c51e11191c4dbf29ac8294f3390bff37bc6c..f22f3059d139d1bb7c7db57a2939184f1089f397 100644
--- a/tensorflow/python/ops/control_flow_ops_test.py
+++ b/tensorflow/python/ops/control_flow_ops_test.py
@@ -349,42 +349,6 @@ class SwitchTestCase(test_util.TensorFlowTestCase):
       self.assertEquals(grad_x_false.eval(), 0.)
 
 
-@test_util.with_c_api
-class SmartCondTest(test_util.TensorFlowTestCase):
-
-  def testSmartCondTrue(self):
-    with ops.Graph().as_default():
-      with session.Session():
-        x = constant_op.constant(2)
-        y = constant_op.constant(5)
-        z = control_flow_ops.smart_cond(True, lambda: math_ops.multiply(x, 16),
-                                        lambda: math_ops.multiply(y, 5))
-        self.assertEqual(z.eval(), 32)
-
-  def testSmartCondFalse(self):
-    with ops.Graph().as_default():
-      with session.Session():
-        x = constant_op.constant(4)
-        y = constant_op.constant(3)
-        z = control_flow_ops.smart_cond(False, lambda: math_ops.multiply(x, 16),
-                                        lambda: math_ops.multiply(y, 3))
-        self.assertEqual(z.eval(), 9)
-
-  def testSmartCondMissingArg1(self):
-    with ops.Graph().as_default():
-      with session.Session():
-        x = constant_op.constant(1)
-        with self.assertRaises(TypeError):
-          control_flow_ops.smart_cond(True, false_fn=lambda: x)
-
-  def testSmartCondMissingArg2(self):
-    with ops.Graph().as_default():
-      with session.Session():
-        x = constant_op.constant(1)
-        with self.assertRaises(TypeError):
-          control_flow_ops.smart_cond(True, lambda: x)
-
-
 @test_util.with_c_api
 class CondTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/ops/ctc_ops.py b/tensorflow/python/ops/ctc_ops.py
index 83da6739db673644f59fda3044769b18b2138fbc..4b57e2de790af13499bc73cfcfa98e999eab1603 100644
--- a/tensorflow/python/ops/ctc_ops.py
+++ b/tensorflow/python/ops/ctc_ops.py
@@ -148,7 +148,7 @@ def ctc_loss(labels, inputs, sequence_length,
   if not time_major:
     inputs = array_ops.transpose(inputs, [1, 0, 2])  # (B,T,N) => (T,B,N)
 
-  loss, _ = gen_ctc_ops._ctc_loss(
+  loss, _ = gen_ctc_ops.ctc_loss(
       inputs,
       labels.indices,
       labels.values,
@@ -224,7 +224,7 @@ def ctc_greedy_decoder(inputs, sequence_length, merge_repeated=True):
         sequence found, the negative of the sum of the greatest logit at each
         timeframe.
   """
-  outputs = gen_ctc_ops._ctc_greedy_decoder(
+  outputs = gen_ctc_ops.ctc_greedy_decoder(
       inputs, sequence_length, merge_repeated=merge_repeated)
   (decoded_ix, decoded_val, decoded_shape, log_probabilities) = outputs
   return ([sparse_tensor.SparseTensor(decoded_ix, decoded_val, decoded_shape)],
@@ -272,7 +272,7 @@ def ctc_beam_search_decoder(inputs, sequence_length, beam_width=100,
   """
 
   decoded_ixs, decoded_vals, decoded_shapes, log_probabilities = (
-      gen_ctc_ops._ctc_beam_search_decoder(
+      gen_ctc_ops.ctc_beam_search_decoder(
           inputs, sequence_length, beam_width=beam_width, top_paths=top_paths,
           merge_repeated=merge_repeated))
 
diff --git a/tensorflow/python/ops/custom_gradient.py b/tensorflow/python/ops/custom_gradient.py
new file mode 100644
index 0000000000000000000000000000000000000000..9eacac1b3704c43cbeb5ecd0cbe827cac3a7cc8b
--- /dev/null
+++ b/tensorflow/python/ops/custom_gradient.py
@@ -0,0 +1,134 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Decorator to overrides the gradient for a function."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.eager import context
+from tensorflow.python.eager import tape
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.util import nest
+from tensorflow.python.util import tf_decorator
+from tensorflow.python.util.tf_export import tf_export
+
+
+@tf_export("custom_gradient")
+def custom_gradient(f):
+  """Decorator to define a function with a custom gradient.
+
+  This decorator allows fine grained control over the gradients of a sequence
+  for operations.  This may be useful for multiple reasons, including providing
+  a more efficient or numerically stable gradient for a sequence of operations.
+
+  For example, consider the following function that commonly occurs in the
+  computation of cross entropy and log likelihoods:
+
+  ```python
+  def log1pexp(x):
+    return tf.log(1 + tf.exp(x))
+  ```
+
+  Due to numerical instability, the gradient this function evaluated at x=100 is
+  NaN.  For example:
+
+  ```python
+  x = tf.constant(100.)
+  y = log1pexp(x)
+  dy = tf.gradients(y, x) # Will be NaN when evaluated.
+  ```
+
+  The gradient expression can be analytically simplified to provide numerical
+  stability:
+
+  ```python
+  @tf.custom_gradient
+  def log1pexp(x):
+    e = tf.exp(x)
+    def grad(dy):
+      return dy * (1 - 1 / (1 + e))
+    return tf.log(1 + e), grad
+  ```
+
+  With this definition, the gradient at x=100 will be correctly evaluated as
+  1.0.
+
+  See also @{tf.RegisterGradient} which registers a gradient function for a
+  primitive TensorFlow operation. `tf.custom_gradient` on the other hand allows
+  for fine grained control over the gradient computation of a sequence of
+  operations.
+
+  Args:
+    f: function `f(x)` that returns a tuple `(y, grad_fn)` where:
+       - `x` is a `Tensor` or sequence of `Tensor` inputs to the function.
+       - `y` is a `Tensor` or sequence of `Tensor` outputs of applying
+         TensorFlow
+         operations in `f` to `x`.
+       - `grad_fn` is a function with the signature `g(grad_ys)` which returns
+         a list of `Tensor`s - the derivatives of `Tensor`s in `y` with respect
+         to the `Tensor`s in `x.  `grad_ys` is a `Tensor` or sequence of
+         `Tensor`s the same size as `y` holding the initial value gradients for
+         each `Tensor` in `y`.
+
+  Returns:
+    A function `h(x)` which returns the same value as `f(x)[0]` and whose
+    gradient (as calculated by @{tf.gradients}) is determined by `f(x)[1]`.
+  """
+
+  def decorated(*args, **kwargs):
+    """Decorated function with custom gradient."""
+    if not context.executing_eagerly():
+      if kwargs:
+        raise ValueError(
+            "The custom_gradient decorator currently suports keywords "
+            "arguments only when eager execution is enabled.")
+      name = "CustomGradient-%s" % ops.uid()
+      args = [ops.convert_to_tensor(x) for x in args]
+      result, grad_fn = f(*args)
+      flat_result = nest.flatten(result)
+      all_tensors = flat_result + args
+
+      @ops.RegisterGradient(name)
+      def internal_grad_fn(unused_op, *result_grads):  # pylint: disable=unused-variable
+        gradients = nest.flatten(grad_fn(*result_grads[:len(flat_result)]))
+        # Need to return one value per input to the IdentityN, so pad the
+        # gradients of the inputs of the custom_gradient function with the
+        # gradients of the outputs as well.
+        return ([None] * len(flat_result)) + gradients
+
+      with ops.get_default_graph().gradient_override_map({"IdentityN": name}):
+        all_tensors = array_ops.identity_n(all_tensors)
+      return nest.pack_sequence_as(
+          structure=result, flat_sequence=all_tensors[:len(flat_result)])
+
+    input_tensors = [ops.convert_to_tensor(x) for x in args]
+
+    result, grad_fn = f(*args, **kwargs)
+    flat_result = nest.flatten(result)
+    # TODO(apassos) consider removing the identity below.
+    flat_result = [gen_array_ops.identity(x) for x in flat_result]
+
+    def actual_grad_fn(*outputs):
+      return nest.flatten(grad_fn(*outputs))
+
+    tape.record_operation(f.__name__, flat_result, input_tensors,
+                          actual_grad_fn)
+    flat_result = list(flat_result)
+    return nest.pack_sequence_as(result, flat_result)
+
+  return tf_decorator.make_decorator(f, decorated)
diff --git a/tensorflow/python/ops/data_flow_ops.py b/tensorflow/python/ops/data_flow_ops.py
index 03ed537cfcf27151a0200d7a17f63b1a2bc7ba1a..d2cc87555f6321432261b32f08431c23ce707eff 100644
--- a/tensorflow/python/ops/data_flow_ops.py
+++ b/tensorflow/python/ops/data_flow_ops.py
@@ -159,7 +159,7 @@ class QueueBase(object):
       ValueError: If one of the arguments is invalid.
       RuntimeError: If eager execution is enabled.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "Queues are not supported when eager execution is enabled. "
           "Instead, please use tf.data to get data into your model.")
@@ -177,10 +177,10 @@ class QueueBase(object):
     else:
       self._names = None
     self._queue_ref = queue_ref
-    if context.in_graph_mode():
-      self._name = self._queue_ref.op.name.split("/")[-1]
-    else:
+    if context.executing_eagerly():
       self._name = context.context().scope_name
+    else:
+      self._name = self._queue_ref.op.name.split("/")[-1]
 
   @staticmethod
   def from_list(index, queues):
@@ -231,9 +231,9 @@ class QueueBase(object):
   @property
   def name(self):
     """The name of the underlying queue."""
-    if context.in_graph_mode():
-      return self._queue_ref.op.name
-    return self._name
+    if context.executing_eagerly():
+      return self._name
+    return self._queue_ref.op.name
 
   @property
   def dtypes(self):
@@ -342,10 +342,10 @@ class QueueBase(object):
         val.get_shape().assert_is_compatible_with(shape)
 
       if self._queue_ref.dtype == _dtypes.resource:
-        return gen_data_flow_ops._queue_enqueue_v2(
+        return gen_data_flow_ops.queue_enqueue_v2(
             self._queue_ref, vals, name=scope)
       else:
-        return gen_data_flow_ops._queue_enqueue(
+        return gen_data_flow_ops.queue_enqueue(
             self._queue_ref, vals, name=scope)
 
   def enqueue_many(self, vals, name=None):
@@ -387,7 +387,7 @@ class QueueBase(object):
             val.get_shape().with_rank_at_least(1)[0])
         val.get_shape()[1:].assert_is_compatible_with(shape)
 
-      return gen_data_flow_ops._queue_enqueue_many_v2(
+      return gen_data_flow_ops.queue_enqueue_many_v2(
           self._queue_ref, vals, name=scope)
 
   def _dequeue_return_value(self, tensors):
@@ -436,15 +436,15 @@ class QueueBase(object):
     if name is None:
       name = "%s_Dequeue" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      ret = gen_data_flow_ops._queue_dequeue_v2(
+      ret = gen_data_flow_ops.queue_dequeue_v2(
           self._queue_ref, self._dtypes, name=name)
     else:
-      ret = gen_data_flow_ops._queue_dequeue(
+      ret = gen_data_flow_ops.queue_dequeue(
           self._queue_ref, self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
     # the `QueueBase` object.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       op = ret[0].op
       for output, shape in zip(op.values(), self._shapes):
         output.set_shape(shape)
@@ -479,12 +479,12 @@ class QueueBase(object):
     if name is None:
       name = "%s_DequeueMany" % self._name
 
-    ret = gen_data_flow_ops._queue_dequeue_many_v2(
+    ret = gen_data_flow_ops.queue_dequeue_many_v2(
         self._queue_ref, n=n, component_types=self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
     # the Queue object.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       op = ret[0].op
       batch_dim = tensor_shape.Dimension(
           tensor_util.constant_value(op.inputs[1]))
@@ -523,12 +523,12 @@ class QueueBase(object):
     if name is None:
       name = "%s_DequeueUpTo" % self._name
 
-    ret = gen_data_flow_ops._queue_dequeue_up_to_v2(
+    ret = gen_data_flow_ops.queue_dequeue_up_to_v2(
         self._queue_ref, n=n, component_types=self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
     # the Queue object.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       op = ret[0].op
       for output, shape in zip(op.values(), self._shapes):
         output.set_shape(tensor_shape.TensorShape([None]).concatenate(shape))
@@ -560,12 +560,12 @@ class QueueBase(object):
     if name is None:
       name = "%s_Close" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      return gen_data_flow_ops._queue_close_v2(
+      return gen_data_flow_ops.queue_close_v2(
           self._queue_ref,
           cancel_pending_enqueues=cancel_pending_enqueues,
           name=name)
     else:
-      return gen_data_flow_ops._queue_close(
+      return gen_data_flow_ops.queue_close(
           self._queue_ref,
           cancel_pending_enqueues=cancel_pending_enqueues,
           name=name)
@@ -601,9 +601,9 @@ class QueueBase(object):
     if name is None:
       name = "%s_Size" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      return gen_data_flow_ops._queue_size_v2(self._queue_ref, name=name)
+      return gen_data_flow_ops.queue_size_v2(self._queue_ref, name=name)
     else:
-      return gen_data_flow_ops._queue_size(self._queue_ref, name=name)
+      return gen_data_flow_ops.queue_size(self._queue_ref, name=name)
 
 
 @tf_export("RandomShuffleQueue")
@@ -683,7 +683,7 @@ class RandomShuffleQueue(QueueBase):
       # the id of the last op created.)
       string = (str(seed1) + shared_name).encode("utf-8")
       seed2 = int(hashlib.md5(string).hexdigest()[:8], 16) & 0x7FFFFFFF
-    queue_ref = gen_data_flow_ops._random_shuffle_queue_v2(
+    queue_ref = gen_data_flow_ops.random_shuffle_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -748,7 +748,7 @@ class FIFOQueue(QueueBase):
     dtypes = _as_type_list(dtypes)
     shapes = _as_shape_list(shapes, dtypes)
     names = _as_name_list(names, dtypes)
-    queue_ref = gen_data_flow_ops._fifo_queue_v2(
+    queue_ref = gen_data_flow_ops.fifo_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -827,7 +827,7 @@ class PaddingFIFOQueue(QueueBase):
                        "but received %d dtypes and %d shapes." % (len(dtypes),
                                                                   len(shapes)))
 
-    queue_ref = gen_data_flow_ops._padding_fifo_queue_v2(
+    queue_ref = gen_data_flow_ops.padding_fifo_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -895,7 +895,7 @@ class PriorityQueue(QueueBase):
     types = _as_type_list(types)
     shapes = _as_shape_list(shapes, types)
 
-    queue_ref = gen_data_flow_ops._priority_queue_v2(
+    queue_ref = gen_data_flow_ops.priority_queue_v2(
         component_types=types,
         shapes=shapes,
         capacity=capacity,
@@ -985,15 +985,15 @@ class Barrier(object):
     else:
       self._shapes = [tensor_shape.unknown_shape() for _ in self._types]
 
-    self._barrier_ref = gen_data_flow_ops._barrier(
+    self._barrier_ref = gen_data_flow_ops.barrier(
         component_types=self._types,
         shapes=self._shapes,
         shared_name=shared_name,
         name=name)
-    if context.in_graph_mode():
-      self._name = self._barrier_ref.op.name.split("/")[-1]
-    else:
+    if context.executing_eagerly():
       self._name = context.context().scope_name
+    else:
+      self._name = self._barrier_ref.op.name.split("/")[-1]
 
   @property
   def barrier_ref(self):
@@ -1003,9 +1003,9 @@ class Barrier(object):
   @property
   def name(self):
     """The name of the underlying barrier."""
-    if context.in_graph_mode():
-      return self._barrier_ref.op.name
-    return self._name
+    if context.executing_eagerly():
+      return self._name
+    return self._barrier_ref.op.name
 
   def insert_many(self, component_index, keys, values, name=None):
     """For each key, assigns the respective value to the specified component.
@@ -1026,7 +1026,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierInsertMany" % self._name
-    return gen_data_flow_ops._barrier_insert_many(
+    return gen_data_flow_ops.barrier_insert_many(
         self._barrier_ref, keys, values, component_index, name=name)
 
   def take_many(self,
@@ -1073,7 +1073,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierTakeMany" % self._name
-    ret = gen_data_flow_ops._barrier_take_many(
+    ret = gen_data_flow_ops.barrier_take_many(
         self._barrier_ref,
         num_elements,
         self._types,
@@ -1083,7 +1083,7 @@ class Barrier(object):
 
     # NOTE(mrry): Not using a shape function because we need access to
     # the Barrier object.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       op = ret[0].op
       if allow_small_batch:
         batch_dim = None
@@ -1122,7 +1122,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierClose" % self._name
-    return gen_data_flow_ops._barrier_close(
+    return gen_data_flow_ops.barrier_close(
         self._barrier_ref,
         cancel_pending_enqueues=cancel_pending_enqueues,
         name=name)
@@ -1139,7 +1139,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierReadySize" % self._name
-    return gen_data_flow_ops._barrier_ready_size(self._barrier_ref, name=name)
+    return gen_data_flow_ops.barrier_ready_size(self._barrier_ref, name=name)
 
   def incomplete_size(self, name=None):
     """Compute the number of incomplete elements in the given barrier.
@@ -1153,7 +1153,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierIncompleteSize" % self._name
-    return gen_data_flow_ops._barrier_incomplete_size(
+    return gen_data_flow_ops.barrier_incomplete_size(
         self._barrier_ref, name=name)
 
 
@@ -1183,10 +1183,10 @@ class ConditionalAccumulatorBase(object):
     else:
       self._shape = tensor_shape.unknown_shape()
     self._accumulator_ref = accumulator_ref
-    if context.in_graph_mode():
-      self._name = self._accumulator_ref.op.name.split("/")[-1]
-    else:
+    if context.executing_eagerly():
       self._name = context.context().scope_name
+    else:
+      self._name = self._accumulator_ref.op.name.split("/")[-1]
 
   @property
   def accumulator_ref(self):
diff --git a/tensorflow/python/ops/distributions/gamma.py b/tensorflow/python/ops/distributions/gamma.py
index 8fb218be3ac7e17e18d85b8e1c100ccd58aa1034..adb1f4f9a879e44cf8cb4cafd22b92554f487712 100644
--- a/tensorflow/python/ops/distributions/gamma.py
+++ b/tensorflow/python/ops/distributions/gamma.py
@@ -193,12 +193,6 @@ class Gamma(distribution.Distribution):
   def _log_prob(self, x):
     return self._log_unnormalized_prob(x) - self._log_normalization()
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
-  def _log_cdf(self, x):
-    return math_ops.log(self._cdf(x))
-
   def _cdf(self, x):
     x = self._maybe_assert_valid_sample(x)
     # Note that igamma returns the regularized incomplete gamma function,
diff --git a/tensorflow/python/ops/distributions/normal.py b/tensorflow/python/ops/distributions/normal.py
index e7f120ea2da525e20a1ae42e6418cf2ac83686af..32e8a49c81bc4b23d8897639998dd33942b41a80 100644
--- a/tensorflow/python/ops/distributions/normal.py
+++ b/tensorflow/python/ops/distributions/normal.py
@@ -188,9 +188,6 @@ class Normal(distribution.Distribution):
   def _log_prob(self, x):
     return self._log_unnormalized_prob(x) - self._log_normalization()
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _log_cdf(self, x):
     return special_math.log_ndtr(self._z(x))
 
diff --git a/tensorflow/python/ops/distributions/student_t.py b/tensorflow/python/ops/distributions/student_t.py
index 778fefb8c2991153b7e7a1f20df61680153dab2a..9d9e65b4e8d6d2e40bf9c263339f899439c842c3 100644
--- a/tensorflow/python/ops/distributions/student_t.py
+++ b/tensorflow/python/ops/distributions/student_t.py
@@ -248,9 +248,6 @@ class StudentT(distribution.Distribution):
             math_ops.lgamma(0.5 * self.df) -
             math_ops.lgamma(0.5 * (self.df + 1.)))
 
-  def _prob(self, x):
-    return math_ops.exp(self._log_prob(x))
-
   def _cdf(self, x):
     # Take Abs(scale) to make subsequent where work correctly.
     y = (x - self.loc) / math_ops.abs(self.scale)
diff --git a/tensorflow/python/ops/distributions/uniform.py b/tensorflow/python/ops/distributions/uniform.py
index 3580af18f241d777c81340f1c565074914838029..ec623b55eb0067e16599c18c9c504635da863907 100644
--- a/tensorflow/python/ops/distributions/uniform.py
+++ b/tensorflow/python/ops/distributions/uniform.py
@@ -45,11 +45,12 @@ class Uniform(distribution.Distribution):
   Z = b - a
   ```
 
-  where:
-  * `low = a`,
-  * `high = b`,
-  * `Z` is the normalizing constant, and,
-  * `I[predicate]` is the [indicator function](
+  where
+
+  - `low = a`,
+  - `high = b`,
+  - `Z` is the normalizing constant, and
+  - `I[predicate]` is the [indicator function](
     https://en.wikipedia.org/wiki/Indicator_function) for `predicate`.
 
   The parameters `low` and `high` must be shaped in a way that supports
@@ -164,9 +165,6 @@ class Uniform(distribution.Distribution):
                                         seed=seed)
     return self.low + self.range() * samples
 
-  def _log_prob(self, x):
-    return math_ops.log(self._prob(x))
-
   def _prob(self, x):
     broadcasted_x = x * array_ops.ones(self.batch_shape_tensor())
     return array_ops.where(
@@ -178,9 +176,6 @@ class Uniform(distribution.Distribution):
             array_ops.zeros_like(broadcasted_x),
             array_ops.ones_like(broadcasted_x) / self.range()))
 
-  def _log_cdf(self, x):
-    return math_ops.log(self.cdf(x))
-
   def _cdf(self, x):
     broadcast_shape = array_ops.broadcast_dynamic_shape(
         array_ops.shape(x), self.batch_shape_tensor())
diff --git a/tensorflow/python/ops/embedding_ops.py b/tensorflow/python/ops/embedding_ops.py
index 3826585f59c31133b12c365816729e090c9ab561..20e4a28b9c0f5427e331a69cba52503492a8420a 100644
--- a/tensorflow/python/ops/embedding_ops.py
+++ b/tensorflow/python/ops/embedding_ops.py
@@ -396,8 +396,8 @@ def embedding_lookup_sparse(params,
     with `combiner`="mean", then the output will be a 3x20 matrix where
 
       output[0, :] = (params[1, :] * 2.0 + params[3, :] * 0.5) / (2.0 + 0.5)
-      output[1, :] = params[0, :] * 1.0
-      output[2, :] = params[1, :] * 3.0
+      output[1, :] = (params[0, :] * 1.0) / 1.0
+      output[2, :] = (params[1, :] * 3.0) / 3.0
 
   Raises:
     TypeError: If sp_ids is not a SparseTensor, or if sp_weights is neither
diff --git a/tensorflow/python/ops/functional_ops.py b/tensorflow/python/ops/functional_ops.py
index ac03d30fcd2e65f032937d9259bc8fff18626619..a840b1eddfc6922dc310490e8166efd73480c437 100644
--- a/tensorflow/python/ops/functional_ops.py
+++ b/tensorflow/python/ops/functional_ops.py
@@ -41,7 +41,7 @@ from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops.gen_functional_ops import *
 # pylint: enable=wildcard-import
 # pylint: disable=unused-import
-from tensorflow.python.ops.gen_functional_ops import _symbolic_gradient
+from tensorflow.python.ops.gen_functional_ops import symbolic_gradient
 # pylint: enable=unused-import
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
@@ -90,7 +90,7 @@ def foldl(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
   if not callable(fn):
     raise TypeError("fn must be callable.")
 
-  in_graph_mode = context.in_graph_mode()
+  in_graph_mode = not context.executing_eagerly()
   with ops.name_scope(name, "foldl", [elems]):
     # TODO(akshayka): Remove the in_graph_mode check once caching devices are
     # supported in Eager
@@ -178,7 +178,7 @@ def foldr(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
   if not callable(fn):
     raise TypeError("fn must be callable.")
 
-  in_graph_mode = context.in_graph_mode()
+  in_graph_mode = not context.executing_eagerly()
   with ops.name_scope(name, "foldr", [elems]):
     # TODO(akshayka): Remove the in_graph_mode check once caching devices are
     # supported in Eager
@@ -343,7 +343,7 @@ def map_fn(fn, elems, dtype=None, parallel_iterations=10, back_prop=True,
 
   elems_flat = input_flatten(elems)
 
-  in_graph_mode = context.in_graph_mode()
+  in_graph_mode = not context.executing_eagerly()
   with ops.name_scope(name, "map", elems_flat):
     # TODO(akshayka): Remove the in_graph_mode check once caching devices are
     # supported in Eager
@@ -364,8 +364,8 @@ def map_fn(fn, elems, dtype=None, parallel_iterations=10, back_prop=True,
     dtype = dtype or input_pack([elem.dtype for elem in elems_flat])
     dtype_flat = output_flatten(dtype)
 
-    # Convert elems to tensor array.
-    n = array_ops.shape(elems_flat[0])[0]
+    # Convert elems to tensor array. n may be known statically.
+    n = elems_flat[0].shape[0].value or array_ops.shape(elems_flat[0])[0]
 
     # TensorArrays are always flat
     elems_ta = [
@@ -536,7 +536,7 @@ def scan(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
 
   elems_flat = input_flatten(elems)
 
-  in_graph_mode = context.in_graph_mode()
+  in_graph_mode = not context.executing_eagerly()
   with ops.name_scope(name, "scan", elems_flat):
     # TODO(akshayka): Remove the in_graph_mode check once caching devices are
     # supported in Eager
@@ -555,7 +555,8 @@ def scan(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
     elems_flat = [
         ops.convert_to_tensor(elem, name="elem") for elem in elems_flat]
 
-    n = array_ops.shape(elems_flat[0])[0]
+    # Convert elems to tensor array. n may be known statically.
+    n = elems_flat[0].shape[0].value or array_ops.shape(elems_flat[0])[0]
 
     # TensorArrays are always flat
     elems_ta = [
@@ -615,7 +616,8 @@ def scan(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
     _, _, r_a = control_flow_ops.while_loop(
         lambda i, _1, _2: i < n, compute, (i, a_flat, accs_ta),
         parallel_iterations=parallel_iterations,
-        back_prop=back_prop, swap_memory=swap_memory)
+        back_prop=back_prop, swap_memory=swap_memory,
+        maximum_iterations=n)
 
     results_flat = [r.stack() for r in r_a]
 
diff --git a/tensorflow/python/ops/gradients.py b/tensorflow/python/ops/gradients.py
index 921fd50aa9fdd1a1e493708f4bc8c66996e26e2c..2668e8f60cd2864fd59ffa3fb539380d34a34004 100644
--- a/tensorflow/python/ops/gradients.py
+++ b/tensorflow/python/ops/gradients.py
@@ -19,6 +19,8 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=unused-import
+from tensorflow.python.eager.backprop import GradientTape
+from tensorflow.python.ops.custom_gradient import custom_gradient
 from tensorflow.python.ops.gradients_impl import AggregationMethod
 from tensorflow.python.ops.gradients_impl import gradients
 from tensorflow.python.ops.gradients_impl import hessians
@@ -28,6 +30,8 @@ from tensorflow.python.util.all_util import remove_undocumented
 _allowed_symbols = [
     # TODO(drpng): find a good place to reference this.
     "AggregationMethod",
+    "GradientTape",
+    "custom_gradient",
     "gradients",  # tf.gradients.gradients.
     "hessians",  # tf.gradients.hessians
 ]
diff --git a/tensorflow/python/ops/gradients_impl.py b/tensorflow/python/ops/gradients_impl.py
index 1418c0b10fb60601e7c3024891b89aadb53e6873..44473ec69c8ac6cf565f635621eebff7bc403225 100644
--- a/tensorflow/python/ops/gradients_impl.py
+++ b/tensorflow/python/ops/gradients_impl.py
@@ -86,17 +86,19 @@ def _IndexedSlicesToTensor(value, dtype=None, name=None, as_ref=False):
         % str(value))
   # TODO(mrry): Consider adding static shape information to
   # IndexedSlices, to avoid using numpy here.
-  dense_shape_value = tensor_util.constant_value(value.dense_shape)
-  if dense_shape_value is not None:
-    num_elements = np.prod(dense_shape_value)
-    if num_elements >= _LARGE_SPARSE_NUM_ELEMENTS:
+  if not context.executing_eagerly():
+    dense_shape_value = tensor_util.constant_value(value.dense_shape)
+    if dense_shape_value is not None:
+      num_elements = np.prod(dense_shape_value)
+      if num_elements >= _LARGE_SPARSE_NUM_ELEMENTS:
+        warnings.warn(
+            "Converting sparse IndexedSlices to a dense Tensor with %d "
+            "elements. This may consume a large amount of memory." %
+            num_elements)
+    else:
       warnings.warn(
-          "Converting sparse IndexedSlices to a dense Tensor with %d elements. "
-          "This may consume a large amount of memory." % num_elements)
-  else:
-    warnings.warn(
-        "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
-        "This may consume a large amount of memory.")
+          "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
+          "This may consume a large amount of memory.")
   return math_ops.unsorted_segment_sum(
       value.values, value.indices, value.dense_shape[0], name=name)
 
@@ -354,7 +356,7 @@ def _SymGrad(op, out_grads):
   for k in op.node_def.attr:
     f.attr[k].CopyFrom(op.node_def.attr[k])
   # pylint: disable=protected-access
-  in_grads = functional_ops._symbolic_gradient(input=f_in, Tout=f_types, f=f)
+  in_grads = functional_ops.symbolic_gradient(input=f_in, Tout=f_types, f=f)
   # pylint: enable=protected-access
   return in_grads
 
@@ -478,9 +480,21 @@ def gradients(ys,
     RuntimeError: if called in Eager mode.
 
   """
-  if context.in_eager_mode():
-    raise RuntimeError("tf.gradients not supported in EAGER mode. Use "
-                       "functions in tf.contrib.eager.backprop instead.")
+  # Creating the gradient graph for control flow mutates Operations. _lock
+  # ensures a Session.run call cannot occur between creating and mutating new
+  # ops.
+  with ops.get_default_graph()._lock:  # pylint: disable=protected-access
+    return _GradientsHelper(ys, xs, grad_ys, name, colocate_gradients_with_ops,
+                            gate_gradients, aggregation_method, stop_gradients)
+
+
+def _GradientsHelper(ys, xs, grad_ys, name, colocate_gradients_with_ops,
+                     gate_gradients, aggregation_method, stop_gradients):
+  """Implementation of gradients()."""
+  if context.executing_eagerly():
+    raise RuntimeError("tf.gradients not supported when eager execution "
+                       "is enabled. Use tf.contrib.eager.GradientTape "
+                       "instead.")
   ys = _AsList(ys)
   xs = _AsList(xs)
   stop_gradients = [] if stop_gradients is None else _AsList(stop_gradients)
diff --git a/tensorflow/python/ops/gradients_test.py b/tensorflow/python/ops/gradients_test.py
index d39b934819177e3c15af95a0777ba96869c5e9cf..c94f1396b28e2124c6e5123cf711ac86abf174ab 100644
--- a/tensorflow/python/ops/gradients_test.py
+++ b/tensorflow/python/ops/gradients_test.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import array_grad  # pylint: disable=unused-import
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_grad  # pylint: disable=unused-import
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import custom_gradient
 from tensorflow.python.ops import data_flow_grad  # pylint: disable=unused-import
 from tensorflow.python.ops import data_flow_ops  # pylint: disable=unused-import
 from tensorflow.python.ops import functional_ops  # pylint: disable=unused-import
@@ -661,6 +662,7 @@ class HessianTest(test_util.TensorFlowTestCase):
     self.assertAllEqual((m, n, m, n), hess_actual.shape)
     self.assertAllClose(hess_value, hess_actual.reshape((m * n, m * n)))
 
+
 @test_util.with_c_api
 class IndexedSlicesToTensorTest(test_util.TensorFlowTestCase):
 
@@ -741,6 +743,59 @@ class IndexedSlicesToTensorTest(test_util.TensorFlowTestCase):
         "of unknown shape. This may consume a large amount of memory." in
         str(w[0].message))
 
+  def testCustomGradientTrivial(self):
+
+    @custom_gradient.custom_gradient
+    def MyIdentity(x):
+
+      def Grad(dy):
+        return [3 * dy]
+
+      return x, Grad
+
+    with ops.Graph().as_default():
+      x = constant(3.)
+      y = MyIdentity(MyIdentity(x))
+      dy = gradients.gradients(y, x)[0]
+      with session.Session():
+        self.assertEqual(9., dy.eval())
+
+  def testCustomGradient(self):
+
+    @custom_gradient.custom_gradient
+    def MyMultiply(x1, x2):
+      result = x1 * x2
+
+      def Grad(dy):
+        # Switched the ordering here.
+        return [dy * x1, dy * x2]
+
+      return result, Grad
+
+    with ops.Graph().as_default():
+      x1 = constant(3.)
+      x2 = constant(5.)
+      y = MyMultiply(x1, x2)
+      dy = gradients.gradients(y, [x1, x2])
+      with session.Session() as sess:
+        self.assertAllEqual([3., 5.], sess.run(dy))
+
+  def testCustomGradientErrors(self):
+
+    @custom_gradient.custom_gradient
+    def F(x):
+
+      def Grad(_):
+        raise RuntimeError("x")
+
+      return x, Grad
+
+    with ops.Graph().as_default():
+      x = constant(1.0)
+      y = F(x)
+      with self.assertRaises(RuntimeError):
+        gradients.gradients(y, x)
+
 
 @test_util.with_c_api
 class OnlyRealGradientsTest(test_util.TensorFlowTestCase):
diff --git a/tensorflow/python/ops/hidden_ops.txt b/tensorflow/python/ops/hidden_ops.txt
deleted file mode 100644
index 9b8172bf2639cca0efb663ff4075b36d6f4f2245..0000000000000000000000000000000000000000
--- a/tensorflow/python/ops/hidden_ops.txt
+++ /dev/null
@@ -1,394 +0,0 @@
-# array_ops
-BatchToSpace
-BroadcastArgs
-BroadcastGradientArgs
-ConcatOffset
-Concat
-ConcatV2
-ConjugateTranspose
-Const
-DebugGradientIdentity
-DebugGradientRefIdentity
-EditDistance
-ExpandDims
-ListDiff
-MirrorPad
-MirrorPadGrad
-OneHot
-Pack
-Pad
-PadV2
-ParallelConcat
-Placeholder
-RefIdentity
-Reverse
-Snapshot
-SpaceToBatch
-Split
-SplitV
-Squeeze
-Slice
-TileGrad  # Exported through array_grad instead of array_ops.
-ZerosLike  # TODO(josh11b): Use this instead of the Python version.
-Unique
-UniqueV2
-UniqueWithCounts
-UniqueWithCountsV2
-Unpack
-
-# candidate_sampling_ops
-AllCandidateSampler
-ComputeAccidentalHits
-FixedUnigramCandidateSampler
-LearnedUnigramCandidateSampler
-LogUniformCandidateSampler
-ThreadUnsafeUnigramCandidateSampler
-UniformCandidateSampler
-
-# checkpoint_ops
-GenerateVocabRemapping
-LoadAndRemapMatrix
-
-
-# control_flow_ops
-Switch
-Merge
-RefMerge
-Exit
-RefExit
-
-# ctc_ops
-CTCLoss
-CTCGreedyDecoder
-CTCBeamSearchDecoder
-
-# data_flow_ops
-Barrier
-BarrierClose
-BarrierIncompleteSize
-BarrierInsertMany
-BarrierReadySize
-BarrierTakeMany
-DeleteSessionTensor
-FakeQueue
-FIFOQueue
-FIFOQueueV2
-GetSessionHandle
-GetSessionHandleV2
-GetSessionTensor
-HashTable
-HashTableV2
-InitializeTable
-InitializeTableV2
-InitializeTableFromTextFile
-InitializeTableFromTextFileV2
-LookupTableExport
-LookupTableExportV2
-LookupTableFind
-LookupTableFindV2
-LookupTableImport
-LookupTableImportV2
-LookupTableInsert
-LookupTableInsertV2
-LookupTableSize
-LookupTableSizeV2
-MutableDenseHashTable
-MutableDenseHashTableV2
-MutableHashTable
-MutableHashTableV2
-MutableHashTableOfTensors
-MutableHashTableOfTensorsV2
-Mutex
-MutexAcquire
-MutexRelease
-PaddingFIFOQueue
-PaddingFIFOQueueV2
-PriorityQueue
-PriorityQueueV2
-QueueClose
-QueueCloseV2
-QueueDequeue
-QueueDequeueV2
-QueueDequeueMany
-QueueDequeueManyV2
-QueueDequeueUpTo
-QueueDequeueUpToV2
-QueueEnqueue
-QueueEnqueueV2
-QueueEnqueueMany
-QueueEnqueueManyV2
-QueueSize
-QueueSizeV2
-RandomShuffleQueue
-RandomShuffleQueueV2
-Stack
-StackClose
-StackPop
-StackPush
-StackV2
-StackCloseV2
-StackPopV2
-StackPushV2
-TensorArray
-TensorArrayClose
-TensorArrayCloseV2
-TensorArrayConcat
-TensorArrayConcatV2
-TensorArrayGather
-TensorArrayGatherV2
-TensorArrayGrad
-TensorArrayGradV2
-TensorArrayPack
-TensorArrayPackV2
-TensorArrayRead
-TensorArrayReadV2
-TensorArrayScatter
-TensorArrayScatterV2
-TensorArraySize
-TensorArraySizeV2
-TensorArraySplit
-TensorArraySplitV2
-TensorArrayUnpack
-TensorArrayUnpackV2
-TensorArrayV2
-TensorArrayWrite
-TensorArrayWriteV2
-TensorArrayV3
-TensorArrayCloseV3
-TensorArrayConcatV3
-TensorArrayGatherV3
-TensorArrayGradV3
-TensorArrayReadV3
-TensorArrayPackV3
-TensorArrayScatterV3
-TensorArraySizeV3
-TensorArraySplitV3
-TensorArrayUnpackV3
-TensorArrayWriteV3
-
-# functional_ops
-SymbolicGradient
-
-# image_ops
-AdjustContrastv2
-NonMaxSuppression
-NonMaxSuppressionV2
-RandomCrop
-ResizeBilinearGrad
-ResizeBicubicGrad
-ResizeNearestNeighborGrad
-SampleDistortedBoundingBox
-SampleDistortedBoundingBoxV2
-ScaleImageGrad
-
-# io_ops
-FixedLengthRecordReader
-IdentityReader
-ReaderNumRecordsProduced
-ReaderNumWorkUnitsCompleted
-ReaderRead
-ReaderReadUpTo
-ReaderReset
-ReaderRestoreState
-ReaderSerializeState
-ReaderWorkQueueLength
-FixedLengthRecordReaderV2
-IdentityReaderV2
-ReaderNumRecordsProducedV2
-ReaderNumWorkUnitsCompletedV2
-ReaderReadV2
-ReaderReadUpToV2
-ReaderResetV2
-ReaderRestoreStateV2
-ReaderSerializeStateV2
-ReaderWorkQueueLengthV2
-Restore
-RestoreSlice
-Save
-SaveSlices
-ShardedFilename
-ShardedFilespec
-TextLineReader
-TFRecordReader
-WholeFileReader
-TextLineReaderV2
-TFRecordReaderV2
-WholeFileReaderV2
-LMDBReader
-DecodeCSV
-
-# linalg_ops
-BatchCholesky
-BatchCholeskyGrad
-BatchMatrixDeterminant
-BatchMatrixInverse
-BatchMatrixSolve
-BatchMatrixSolveLs
-BatchMatrixTriangularSolve
-BatchSelfAdjointEig
-BatchSelfAdjointEigV2
-BatchSvd
-LogMatrixDeterminant
-MatrixExponential
-MatrixLogarithm
-MatrixSolveLs
-SelfAdjointEig
-SelfAdjointEigV2
-Svd
-
-# logging_ops
-Assert
-AudioSummary
-AudioSummaryV2
-HistogramSummary
-ImageSummary
-MergeSummary
-Print
-ScalarSummary
-TensorSummary
-TensorSummaryV2
-
-# math_ops
-Abs
-AccumulateNV2
-AddN
-AddV2
-All
-Any
-BatchMatMul
-BatchFFT
-BatchFFT2D
-BatchFFT3D
-BatchIFFT
-BatchIFFT2D
-BatchIFFT3D
-Bucketize
-Complex
-ComplexAbs
-Conj
-FloorDiv
-FloorMod
-HistogramFixedWidth
-Max
-Mean
-Min
-Mul
-Neg
-Pow
-Prod
-Range
-RealDiv
-Select
-SparseMatMul
-Sub
-Sum
-MatMul
-Sigmoid
-Tanh
-SigmoidGrad
-TanhGrad
-InvGrad
-ReciprocalGrad
-SqrtGrad
-RsqrtGrad
-TruncateDiv
-TruncateMod
-
-# nn_ops
-AvgPoolGrad  # "*Grad" accessible through nn_grad instead of nn_ops.
-AvgPool3DGrad
-BatchNormWithGlobalNormalization
-BatchNormWithGlobalNormalizationGrad
-FusedBatchNorm
-FusedBatchNormV2
-SoftmaxCrossEntropyWithLogits
-SparseSoftmaxCrossEntropyWithLogits
-LRNGrad
-MaxPoolGrad
-MaxPoolGradWithArgmax
-MaxPoolGradGrad
-MaxPoolGradGradWithArgmax
-MaxPool3DGrad
-MaxPool3DGradGrad
-ReluGrad
-Relu6Grad
-EluGrad
-SeluGrad
-SoftplusGrad
-SoftsignGrad
-TopK
-TopKV2
-BiasAdd
-BiasAddV1
-Relu6
-AvgPool
-MaxPool
-MaxPoolV2
-Softmax
-LogSoftmax
-FractionalAvgPoolGrad
-FractionalMaxPoolGrad
-InTopK
-InTopKV2
-
-# parsing_ops
-ParseExample
-ParseSingleSequenceExample
-
-# random_ops
-RandomGamma
-RandomPoisson
-RandomUniform
-RandomUniformInt
-RandomShuffle
-RandomStandardNormal
-ParameterizedTruncatedNormal
-TruncatedNormal
-
-# script_ops
-PyFunc
-PyFuncStateless
-EagerPyFunc
-
-# sdca_ops
-
-# state_ops
-Variable
-VariableV2
-TemporaryVariable
-DestroyTemporaryVariable
-
-# sparse_ops
-AddSparseToTensorsMap
-AddManySparseToTensorsMap
-TakeManySparseFromTensorsMap
-DeserializeManySparse
-DeserializeSparse
-SerializeManySparse
-SerializeSparse
-SparseAdd
-SparseAddGrad
-SparseConcat
-SparseCross
-SparseFillEmptyRows
-SparseFillEmptyRowsGrad
-SparseSplit
-SparseSelectLastK
-SparseReorder
-SparseReshape
-SparseToDense
-SparseTensorDenseAdd
-SparseTensorDenseMatMul
-
-# string_ops
-StringSplit
-
-# user_ops
-Fact
-
-# training_ops
-# (None)
-
-# word2vec deprecated ops
-NegTrain
-Skipgram
diff --git a/tensorflow/python/ops/histogram_ops.py b/tensorflow/python/ops/histogram_ops.py
index 6a975160b0698270dfc9ce9140e8b3ff633cdb9e..4a1ef54fb50013881aa832f83674ac66ecccd9bc 100644
--- a/tensorflow/python/ops/histogram_ops.py
+++ b/tensorflow/python/ops/histogram_ops.py
@@ -141,5 +141,7 @@ def histogram_fixed_width(values,
   """
   with ops.name_scope(name, 'histogram_fixed_width',
                       [values, value_range, nbins]) as name:
-    return gen_math_ops._histogram_fixed_width(  # pylint: disable=protected-access
+    # pylint: disable=protected-access
+    return gen_math_ops._histogram_fixed_width(
         values, value_range, nbins, dtype=dtype, name=name)
+    # pylint: enable=protected-access
diff --git a/tensorflow/python/ops/image_grad.py b/tensorflow/python/ops/image_grad.py
index 093843cd5bc0b7c2281a0c9ddf52d93ea3faede3..9f43e3f1466d900ae6d39f3b9ef48043421cb777 100644
--- a/tensorflow/python/ops/image_grad.py
+++ b/tensorflow/python/ops/image_grad.py
@@ -41,12 +41,10 @@ def _ResizeNearestNeighborGrad(op, grad):
   else:
     image_shape = array_ops.shape(image)[1:3]
 
-  # pylint: disable=protected-access
-  grads = gen_image_ops._resize_nearest_neighbor_grad(
+  grads = gen_image_ops.resize_nearest_neighbor_grad(
       grad,
       image_shape,
       align_corners=op.get_attr("align_corners"))
-  # pylint: enable=protected-access
   return [grads, None]
 
 
@@ -61,10 +59,8 @@ def _ResizeBilinearGrad(op, grad):
   Returns:
     The gradients w.r.t. the input.
   """
-  # pylint: disable=protected-access
-  grad0 = gen_image_ops._resize_bilinear_grad(
+  grad0 = gen_image_ops.resize_bilinear_grad(
       grad, op.inputs[0], align_corners=op.get_attr("align_corners"))
-  # pylint: enable=protected-access
   return [grad0, None]
 
 
@@ -82,10 +78,8 @@ def _ResizeBicubicGrad(op, grad):
   allowed_types = [dtypes.float32, dtypes.float64]
   grad0 = None
   if op.inputs[0].dtype in allowed_types:
-    # pylint: disable=protected-access
-    grad0 = gen_image_ops._resize_bicubic_grad(
+    grad0 = gen_image_ops.resize_bicubic_grad(
         grad, op.inputs[0], align_corners=op.get_attr("align_corners"))
-    # pylint: enable=protected-access
   return [grad0, None]
 
 
diff --git a/tensorflow/python/ops/image_ops.py b/tensorflow/python/ops/image_ops.py
index ae52d32fea1c872e588c4122f5e73198e4dfe9ad..68be9ccdd642823e7a9c2294f209accd16f45be5 100644
--- a/tensorflow/python/ops/image_ops.py
+++ b/tensorflow/python/ops/image_ops.py
@@ -69,6 +69,11 @@ See the @{$python/image} guide.
 @@non_max_suppression
 @@sample_distorted_bounding_box
 @@total_variation
+@@psnr
+@@ssim
+@@ssim_multiscale
+@@image_gradients
+@@sobel_edges
 """
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index 58c18c6696d64ccca4ebfaa07242d3c7789116e4..3369fe3c9b37ca05311c5548dbfa3228ba04ee80 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -29,6 +31,8 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_image_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import string_ops
 from tensorflow.python.ops import variables
@@ -1113,10 +1117,8 @@ def adjust_contrast(images, contrast_factor):
     orig_dtype = images.dtype
     flt_images = convert_image_dtype(images, dtypes.float32)
 
-    # pylint: disable=protected-access
-    adjusted = gen_image_ops._adjust_contrastv2(
+    adjusted = gen_image_ops.adjust_contrastv2(
         flt_images, contrast_factor=contrast_factor, name=name)
-    # pylint: enable=protected-access
 
     return convert_image_dtype(adjusted, orig_dtype, saturate=True)
 
@@ -1730,7 +1732,7 @@ def sample_distorted_bounding_box(image_size,
       Provide as input to `tf.image.draw_bounding_boxes`.
   """
   with ops.name_scope(name, 'sample_distorted_bounding_box'):
-    return gen_image_ops._sample_distorted_bounding_box_v2(  # pylint: disable=protected-access
+    return gen_image_ops.sample_distorted_bounding_box_v2(
         image_size,
         bounding_boxes,
         seed=seed,
@@ -1784,10 +1786,8 @@ def non_max_suppression(boxes,
   """
   with ops.name_scope(name, 'non_max_suppression'):
     iou_threshold = ops.convert_to_tensor(iou_threshold, name='iou_threshold')
-    # pylint: disable=protected-access
-    return gen_image_ops._non_max_suppression_v2(boxes, scores, max_output_size,
-                                                 iou_threshold)
-    # pylint: enable=protected-access
+    return gen_image_ops.non_max_suppression_v2(boxes, scores, max_output_size,
+                                                iou_threshold)
 
 
 _rgb_to_yiq_kernel = [[0.299, 0.59590059,
@@ -1795,6 +1795,7 @@ _rgb_to_yiq_kernel = [[0.299, 0.59590059,
                       [0.114, -0.32134392, 0.31119955]]
 
 
+@tf_export('image.rgb_to_yiq')
 def rgb_to_yiq(images):
   """Converts one or more images from RGB to YIQ.
 
@@ -1820,6 +1821,7 @@ _yiq_to_rgb_kernel = [[1, 1, 1], [0.95598634, -0.27201283, -1.10674021],
                       [0.6208248, -0.64720424, 1.70423049]]
 
 
+@tf_export('image.yiq_to_rgb')
 def yiq_to_rgb(images):
   """Converts one or more images from YIQ to RGB.
 
@@ -1847,6 +1849,7 @@ _rgb_to_yuv_kernel = [[0.299, -0.14714119,
                       [0.114, 0.43601035, -0.10001026]]
 
 
+@tf_export('image.rgb_to_yuv')
 def rgb_to_yuv(images):
   """Converts one or more images from RGB to YUV.
 
@@ -1872,6 +1875,7 @@ _yuv_to_rgb_kernel = [[1, 1, 1], [0, -0.394642334, 2.03206185],
                       [1.13988303, -0.58062185, 0]]
 
 
+@tf_export('image.yuv_to_rgb')
 def yuv_to_rgb(images):
   """Converts one or more images from YUV to RGB.
 
@@ -1892,3 +1896,489 @@ def yuv_to_rgb(images):
       _yuv_to_rgb_kernel, dtype=images.dtype, name='kernel')
   ndims = images.get_shape().ndims
   return math_ops.tensordot(images, kernel, axes=[[ndims - 1], [0]])
+
+
+def _verify_compatible_image_shapes(img1, img2):
+  """Checks if two image tensors are compatible for applying SSIM or PSNR.
+
+  This function checks if two sets of images have ranks at least 3, and if the
+  last three dimensions match.
+
+  Args:
+    img1: Tensor containing the first image batch.
+    img2: Tensor containing the second image batch.
+
+  Returns:
+    A tuple containing: the first tensor shape, the second tensor shape, and a
+    list of control_flow_ops.Assert() ops implementing the checks.
+
+  Raises:
+    ValueError: When static shape check fails.
+  """
+  shape1 = img1.get_shape().with_rank_at_least(3)
+  shape2 = img2.get_shape().with_rank_at_least(3)
+  shape1[-3:].assert_is_compatible_with(shape2[-3:])
+
+  if shape1.ndims is not None and shape2.ndims is not None:
+    for dim1, dim2 in zip(reversed(shape1[:-3]), reversed(shape2[:-3])):
+      if not (dim1 == 1 or dim2 == 1 or dim1.is_compatible_with(dim2)):
+        raise ValueError(
+            'Two images are not compatible: %s and %s' % (shape1, shape2))
+
+  # Now assign shape tensors.
+  shape1, shape2 = array_ops.shape_n([img1, img2])
+
+  # TODO(sjhwang): Check if shape1[:-3] and shape2[:-3] are broadcastable.
+  checks = []
+  checks.append(control_flow_ops.Assert(
+      math_ops.greater_equal(array_ops.size(shape1), 3),
+      [shape1, shape2], summarize=10))
+  checks.append(control_flow_ops.Assert(
+      math_ops.reduce_all(math_ops.equal(shape1[-3:], shape2[-3:])),
+      [shape1, shape2], summarize=10))
+  return shape1, shape2, checks
+
+
+@tf_export('image.psnr')
+def psnr(a, b, max_val, name=None):
+  """Returns the Peak Signal-to-Noise Ratio between a and b.
+
+  This is intended to be used on signals (or images). Produces a PSNR value for
+  each image in batch.
+
+  The last three dimensions of input are expected to be [height, width, depth].
+
+  Example:
+
+  ```python
+      # Read images from file.
+      im1 = tf.decode_png('path/to/im1.png')
+      im2 = tf.decode_png('path/to/im2.png')
+      # Compute PSNR over tf.uint8 Tensors.
+      psnr1 = tf.image.psnr(im1, im2, max_val=255)
+
+      # Compute PSNR over tf.float32 Tensors.
+      im1 = tf.image.convert_image_dtype(im1, tf.float32)
+      im2 = tf.image.convert_image_dtype(im2, tf.float32)
+      psnr2 = tf.image.psnr(im1, im2, max_val=1.0)
+      # psnr1 and psnr2 both have type tf.float32 and are almost equal.
+  ```
+
+  Arguments:
+    a: First set of images.
+    b: Second set of images.
+    max_val: The dynamic range of the images (i.e., the difference between the
+      maximum the and minimum allowed values).
+    name: Namespace to embed the computation in.
+
+  Returns:
+    The scalar PSNR between a and b. The returned tensor has type `tf.float32`
+    and shape [batch_size, 1].
+  """
+  with ops.name_scope(name, 'PSNR', [a, b]):
+    # Need to convert the images to float32.  Scale max_val accordingly so that
+    # PSNR is computed correctly.
+    max_val = math_ops.cast(max_val, a.dtype)
+    max_val = convert_image_dtype(max_val, dtypes.float32)
+    a = convert_image_dtype(a, dtypes.float32)
+    b = convert_image_dtype(b, dtypes.float32)
+    mse = math_ops.reduce_mean(math_ops.squared_difference(a, b), [-3, -2, -1])
+    psnr_val = math_ops.subtract(
+        20 * math_ops.log(max_val) / math_ops.log(10.0),
+        np.float32(10 / np.log(10)) * math_ops.log(mse),
+        name='psnr')
+
+    _, _, checks = _verify_compatible_image_shapes(a, b)
+    with ops.control_dependencies(checks):
+      return array_ops.identity(psnr_val)
+
+_SSIM_K1 = 0.01
+_SSIM_K2 = 0.03
+
+
+def _ssim_helper(x, y, reducer, max_val, compensation=1.0):
+  r"""Helper function for computing SSIM.
+
+  SSIM estimates covariances with weighted sums.  The default parameters
+  use a biased estimate of the covariance:
+  Suppose `reducer` is a weighted sum, then the mean estimators are
+    \mu_x = \sum_i w_i x_i,
+    \mu_y = \sum_i w_i y_i,
+  where w_i's are the weighted-sum weights, and covariance estimator is
+    cov_{xy} = \sum_i w_i (x_i - \mu_x) (y_i - \mu_y)
+  with assumption \sum_i w_i = 1. This covariance estimator is biased, since
+    E[cov_{xy}] = (1 - \sum_i w_i ^ 2) Cov(X, Y).
+  For SSIM measure with unbiased covariance estimators, pass as `compensation`
+  argument (1 - \sum_i w_i ^ 2).
+
+  Arguments:
+    x: First set of images.
+    y: Second set of images.
+    reducer: Function that computes 'local' averages from set of images.
+      For non-covolutional version, this is usually tf.reduce_mean(x, [1, 2]),
+      and for convolutional version, this is usually tf.nn.avg_pool or
+      tf.nn.conv2d with weighted-sum kernel.
+    max_val: The dynamic range (i.e., the difference between the maximum
+      possible allowed value and the minimum allowed value).
+    compensation: Compensation factor. See above.
+
+  Returns:
+    A pair containing the luminance measure, and the contrast-structure measure.
+  """
+  c1 = (_SSIM_K1 * max_val) ** 2
+  c2 = (_SSIM_K2 * max_val) ** 2
+
+  # SSIM luminance measure is
+  # (2 * mu_x * mu_y + c1) / (mu_x ** 2 + mu_y ** 2 + c1).
+  mean0 = reducer(x)
+  mean1 = reducer(y)
+  num0 = mean0 * mean1 * 2.0
+  den0 = math_ops.square(mean0) + math_ops.square(mean1)
+  luminance = (num0 + c1) / (den0 + c1)
+
+  # SSIM contrast-structure measure is
+  #   (2 * cov_{xy} + c2) / (cov_{xx} + cov_{yy} + c2).
+  # Note that `reducer` is a weighted sum with weight w_k, \sum_i w_i = 1, then
+  #   cov_{xy} = \sum_i w_i (x_i - \mu_x) (y_i - \mu_y)
+  #          = \sum_i w_i x_i y_i - (\sum_i w_i x_i) (\sum_j w_j y_j).
+  num1 = reducer(x * y) * 2.0
+  den1 = reducer(math_ops.square(x) + math_ops.square(y))
+  c2 *= compensation
+  cs = (num1 - num0 + c2) / (den1 - den0 + c2)
+
+  # SSIM score is the product of the luminance and contrast-structure measures.
+  return luminance, cs
+
+
+def _fspecial_gauss(size, sigma):
+  """Function to mimic the 'fspecial' gaussian MATLAB function."""
+  size = ops.convert_to_tensor(size, dtypes.int32)
+  sigma = ops.convert_to_tensor(sigma)
+
+  coords = math_ops.cast(math_ops.range(size), sigma.dtype)
+  coords -= math_ops.cast(size - 1, sigma.dtype) / 2.0
+
+  g = math_ops.square(coords)
+  g *= -0.5 / math_ops.square(sigma)
+
+  g = array_ops.reshape(g, shape=[1, -1]) + array_ops.reshape(g, shape=[-1, 1])
+  g = array_ops.reshape(g, shape=[1, -1])  # For tf.nn.softmax().
+  g = nn_ops.softmax(g)
+  return array_ops.reshape(g, shape=[size, size, 1, 1])
+
+
+def _ssim_per_channel(img1, img2, max_val=1.0):
+  """Computes SSIM index between img1 and img2 per color channel.
+
+  This function matches the standard SSIM implementation from:
+  Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image
+  quality assessment: from error visibility to structural similarity. IEEE
+  transactions on image processing.
+
+  Details:
+    - 11x11 Gaussian filter of width 1.5 is used.
+    - k1 = 0.01, k2 = 0.03 as in the original paper.
+
+  Args:
+    img1: First image batch.
+    img2: Second image batch.
+    max_val: The dynamic range of the images (i.e., the difference between the
+      maximum the and minimum allowed values).
+
+  Returns:
+    A pair of tensors containing and channel-wise SSIM and contrast-structure
+    values. The shape is [..., channels].
+  """
+  filter_size = constant_op.constant(11, dtype=dtypes.int32)
+  filter_sigma = constant_op.constant(1.5, dtype=img1.dtype)
+
+  shape1, shape2 = array_ops.shape_n([img1, img2])
+  checks = [
+      control_flow_ops.Assert(math_ops.reduce_all(math_ops.greater_equal(
+          shape1[-3:-1], filter_size)), [shape1, filter_size], summarize=8),
+      control_flow_ops.Assert(math_ops.reduce_all(math_ops.greater_equal(
+          shape2[-3:-1], filter_size)), [shape2, filter_size], summarize=8)]
+
+  # Enforce the check to run before computation.
+  with ops.control_dependencies(checks):
+    img1 = array_ops.identity(img1)
+
+  # TODO(sjhwang): Try to cache kernels and compensation factor.
+  kernel = _fspecial_gauss(filter_size, filter_sigma)
+  kernel = array_ops.tile(kernel, multiples=[1, 1, shape1[-1], 1])
+
+  # The correct compensation factor is `1.0 - tf.reduce_sum(tf.square(kernel))`,
+  # but to match MATLAB implementation of MS-SSIM, we use 1.0 instead.
+  compensation = 1.0
+
+  # TODO(sjhwang): Try FFT.
+  # TODO(sjhwang): Gaussian kernel is separable in space. Consider applying
+  #   1-by-n and n-by-1 Gaussain filters instead of an n-by-n filter.
+  def reducer(x):
+    shape = array_ops.shape(x)
+    x = array_ops.reshape(x, shape=array_ops.concat([[-1], shape[-3:]], 0))
+    y = nn.depthwise_conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')
+    return array_ops.reshape(y, array_ops.concat([shape[:-3],
+                                                  array_ops.shape(y)[1:]], 0))
+
+  luminance, cs = _ssim_helper(img1, img2, reducer, max_val, compensation)
+
+  # Average over the second and the third from the last: height, width.
+  axes = constant_op.constant([-3, -2], dtype=dtypes.int32)
+  ssim_val = math_ops.reduce_mean(luminance * cs, axes)
+  cs = math_ops.reduce_mean(cs, axes)
+  return ssim_val, cs
+
+
+@tf_export('image.ssim')
+def ssim(img1, img2, max_val):
+  """Computes SSIM index between img1 and img2.
+
+  This function is based on the standard SSIM implementation from:
+  Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image
+  quality assessment: from error visibility to structural similarity. IEEE
+  transactions on image processing.
+
+  Note: The true SSIM is only defined on grayscale.  This function does not
+  perform any colorspace transform.  (If input is already YUV, then it will
+  compute YUV SSIM average.)
+
+  Details:
+    - 11x11 Gaussian filter of width 1.5 is used.
+    - k1 = 0.01, k2 = 0.03 as in the original paper.
+
+  The image sizes must be at least 11x11 because of the filter size.
+
+  Example:
+
+  ```python
+      # Read images from file.
+      im1 = tf.decode_png('path/to/im1.png')
+      im2 = tf.decode_png('path/to/im2.png')
+      # Compute SSIM over tf.uint8 Tensors.
+      ssim1 = tf.image.ssim(im1, im2, max_val=255)
+
+      # Compute SSIM over tf.float32 Tensors.
+      im1 = tf.image.convert_image_dtype(im1, tf.float32)
+      im2 = tf.image.convert_image_dtype(im2, tf.float32)
+      ssim2 = tf.image.ssim(im1, im2, max_val=1.0)
+      # ssim1 and ssim2 both have type tf.float32 and are almost equal.
+  ```
+
+  Args:
+    img1: First image batch.
+    img2: Second image batch.
+    max_val: The dynamic range of the images (i.e., the difference between the
+      maximum the and minimum allowed values).
+
+  Returns:
+    A tensor containing an SSIM value for each image in batch.  Returned SSIM
+    values are in range (-1, 1], when pixel values are non-negative. Returns
+    a tensor with shape: broadcast(img1.shape[:-3], img2.shape[:-3]).
+  """
+  _, _, checks = _verify_compatible_image_shapes(img1, img2)
+  with ops.control_dependencies(checks):
+    img1 = array_ops.identity(img1)
+
+  # Need to convert the images to float32.  Scale max_val accordingly so that
+  # SSIM is computed correctly.
+  max_val = math_ops.cast(max_val, img1.dtype)
+  max_val = convert_image_dtype(max_val, dtypes.float32)
+  img1 = convert_image_dtype(img1, dtypes.float32)
+  img2 = convert_image_dtype(img2, dtypes.float32)
+  ssim_per_channel, _ = _ssim_per_channel(img1, img2, max_val)
+  # Compute average over color channels.
+  return math_ops.reduce_mean(ssim_per_channel, [-1])
+
+
+# Default values obtained by Wang et al.
+_MSSSIM_WEIGHTS = (0.0448, 0.2856, 0.3001, 0.2363, 0.1333)
+
+
+@tf_export('image.ssim_multiscale')
+def ssim_multiscale(img1, img2, max_val, power_factors=_MSSSIM_WEIGHTS):
+  """Computes the MS-SSIM between img1 and img2.
+
+  This function assumes that `img1` and `img2` are image batches, i.e. the last
+  three dimensions are [height, width, channels].
+
+  Note: The true SSIM is only defined on grayscale.  This function does not
+  perform any colorspace transform.  (If input is already YUV, then it will
+  compute YUV SSIM average.)
+
+  Original paper: Wang, Zhou, Eero P. Simoncelli, and Alan C. Bovik. "Multiscale
+  structural similarity for image quality assessment." Signals, Systems and
+  Computers, 2004.
+
+  Arguments:
+    img1: First image batch.
+    img2: Second image batch. Must have the same rank as img1.
+    max_val: The dynamic range of the images (i.e., the difference between the
+      maximum the and minimum allowed values).
+    power_factors: Iterable of weights for each of the scales. The number of
+      scales used is the length of the list. Index 0 is the unscaled
+      resolution's weight and each increasing scale corresponds to the image
+      being downsampled by 2.  Defaults to (0.0448, 0.2856, 0.3001, 0.2363,
+      0.1333), which are the values obtained in the original paper.
+
+  Returns:
+    A tensor containing an MS-SSIM value for each image in batch.  The values
+    are in range [0, 1].  Returns a tensor with shape:
+    broadcast(img1.shape[:-3], img2.shape[:-3]).
+  """
+  # Shape checking.
+  shape1 = img1.get_shape().with_rank_at_least(3)
+  shape2 = img2.get_shape().with_rank_at_least(3)
+  shape1[-3:].merge_with(shape2[-3:])
+
+  with ops.name_scope(None, 'MS-SSIM', [img1, img2]):
+    shape1, shape2, checks = _verify_compatible_image_shapes(img1, img2)
+    with ops.control_dependencies(checks):
+      img1 = array_ops.identity(img1)
+
+    # Need to convert the images to float32.  Scale max_val accordingly so that
+    # SSIM is computed correctly.
+    max_val = math_ops.cast(max_val, img1.dtype)
+    max_val = convert_image_dtype(max_val, dtypes.float32)
+    img1 = convert_image_dtype(img1, dtypes.float32)
+    img2 = convert_image_dtype(img2, dtypes.float32)
+
+    imgs = [img1, img2]
+    shapes = [shape1, shape2]
+
+    # img1 and img2 are assumed to be a (multi-dimensional) batch of
+    # 3-dimensional images (height, width, channels). `heads` contain the batch
+    # dimensions, and `tails` contain the image dimensions.
+    heads = [s[:-3] for s in shapes]
+    tails = [s[-3:] for s in shapes]
+
+    divisor = [1, 2, 2, 1]
+    divisor_tensor = constant_op.constant(divisor[1:], dtype=dtypes.int32)
+
+    def do_pad(images, remainder):
+      padding = array_ops.expand_dims(remainder, -1)
+      padding = array_ops.pad(padding, [[1, 0], [1, 0]])
+      return [array_ops.pad(x, padding, mode='SYMMETRIC') for x in images]
+
+    mcs = []
+    for k in range(len(power_factors)):
+      with ops.name_scope(None, 'Scale%d' % k, imgs):
+        if k > 0:
+          # Avg pool takes rank 4 tensors. Flatten leading dimensions.
+          flat_imgs = [
+              array_ops.reshape(x, array_ops.concat([[-1], t], 0))
+              for x, t in zip(imgs, tails)
+          ]
+
+          remainder = tails[0] % divisor_tensor
+          need_padding = math_ops.reduce_any(math_ops.not_equal(remainder, 0))
+          # pylint: disable=cell-var-from-loop
+          padded = control_flow_ops.cond(need_padding,
+                                         lambda: do_pad(flat_imgs, remainder),
+                                         lambda: flat_imgs)
+          # pylint: enable=cell-var-from-loop
+
+          downscaled = [nn_ops.avg_pool(x, ksize=divisor, strides=divisor,
+                                        padding='VALID')
+                        for x in padded]
+          tails = [x[1:] for x in array_ops.shape_n(downscaled)]
+          imgs = [
+              array_ops.reshape(x, array_ops.concat([h, t], 0))
+              for x, h, t in zip(downscaled, heads, tails)
+          ]
+
+        # Overwrite previous ssim value since we only need the last one.
+        ssim_per_channel, cs = _ssim_per_channel(*imgs, max_val=max_val)
+        mcs.append(nn_ops.relu(cs))
+
+    # Remove the cs score for the last scale. In the MS-SSIM calculation,
+    # we use the l(p) at the highest scale. l(p) * cs(p) is ssim(p).
+    mcs.pop()  # Remove the cs score for the last scale.
+    mcs_and_ssim = array_ops.stack(mcs + [nn_ops.relu(ssim_per_channel)],
+                                   axis=-1)
+    # Take weighted geometric mean across the scale axis.
+    ms_ssim = math_ops.reduce_prod(math_ops.pow(mcs_and_ssim, power_factors),
+                                   [-1])
+
+    return math_ops.reduce_mean(ms_ssim, [-1])  # Avg over color channels.
+
+
+@tf_export('image.image_gradients')
+def image_gradients(image):
+  """Returns image gradients (dy, dx) for each color channel.
+
+  Both output tensors have the same shape as the input: [batch_size, h, w,
+  d]. The gradient values are organized so that [I(x+1, y) - I(x, y)] is in
+  location (x, y). That means that dy will always have zeros in the last row,
+  and dx will always have zeros in the last column.
+
+  Arguments:
+    image: Tensor with shape [batch_size, h, w, d].
+
+  Returns:
+    Pair of tensors (dy, dx) holding the vertical and horizontal image
+    gradients (1-step finite difference).
+
+  Raises:
+    ValueError: If `image` is not a 4D tensor.
+  """
+  if image.get_shape().ndims != 4:
+    raise ValueError('image_gradients expects a 4D tensor '
+                     '[batch_size, h, w, d], not %s.', image.get_shape())
+  image_shape = array_ops.shape(image)
+  batch_size, height, width, depth = array_ops.unstack(image_shape)
+  dy = image[:, 1:, :, :] - image[:, :-1, :, :]
+  dx = image[:, :, 1:, :] - image[:, :, :-1, :]
+
+  # Return tensors with same size as original image by concatenating
+  # zeros. Place the gradient [I(x+1,y) - I(x,y)] on the base pixel (x, y).
+  shape = array_ops.stack([batch_size, 1, width, depth])
+  dy = array_ops.concat([dy, array_ops.zeros(shape, image.dtype)], 1)
+  dy = array_ops.reshape(dy, image_shape)
+
+  shape = array_ops.stack([batch_size, height, 1, depth])
+  dx = array_ops.concat([dx, array_ops.zeros(shape, image.dtype)], 2)
+  dx = array_ops.reshape(dx, image_shape)
+
+  return dy, dx
+
+
+@tf_export('image.sobel_edges')
+def sobel_edges(image):
+  """Returns a tensor holding Sobel edge maps.
+
+  Arguments:
+    image: Image tensor with shape [batch_size, h, w, d] and type float32 or
+    float64.  The image(s) must be 2x2 or larger.
+
+  Returns:
+    Tensor holding edge maps for each channel. Returns a tensor with shape
+    [batch_size, h, w, d, 2] where the last two dimensions hold [[dy[0], dx[0]],
+    [dy[1], dx[1]], ..., [dy[d-1], dx[d-1]]] calculated using the Sobel filter.
+  """
+  # Define vertical and horizontal Sobel filters.
+  static_image_shape = image.get_shape()
+  image_shape = array_ops.shape(image)
+  kernels = [[[-1, -2, -1], [0, 0, 0], [1, 2, 1]],
+             [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]]
+  num_kernels = len(kernels)
+  kernels = np.transpose(np.asarray(kernels), (1, 2, 0))
+  kernels = np.expand_dims(kernels, -2)
+  kernels_tf = constant_op.constant(kernels, dtype=image.dtype)
+
+  kernels_tf = array_ops.tile(kernels_tf, [1, 1, image_shape[-1], 1],
+                              name='sobel_filters')
+
+  # Use depth-wise convolution to calculate edge maps per channel.
+  pad_sizes = [[0, 0], [1, 1], [1, 1], [0, 0]]
+  padded = array_ops.pad(image, pad_sizes, mode='REFLECT')
+
+  # Output tensor has shape [batch_size, h, w, d * num_kernels].
+  strides = [1, 1, 1, 1]
+  output = nn.depthwise_conv2d(padded, kernels_tf, strides, 'VALID')
+
+  # Reshape to [batch_size, h, w, d, num_kernels].
+  shape = array_ops.concat([image_shape, [num_kernels]], 0)
+  output = array_ops.reshape(output, shape=shape)
+  output.set_shape(static_image_shape.concatenate([num_kernels]))
+  return output
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index b8c4b27c162acdd86d88da641ff8afffaa5a9e6a..c437c12c2744792eaee197bf7d2a5f2b75d280bf 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import colorsys
 import functools
+import itertools
 import math
 import os
 import time
@@ -37,7 +38,9 @@ from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_image_ops
+from tensorflow.python.ops import gradients
 from tensorflow.python.ops import image_ops
+from tensorflow.python.ops import image_ops_impl
 from tensorflow.python.ops import io_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
@@ -3328,5 +3331,420 @@ class NonMaxSuppressionTest(test_util.TensorFlowTestCase):
       image_ops.non_max_suppression(boxes, scores, 3, [[0.5]])
 
 
+class VerifyCompatibleImageShapesTest(test_util.TensorFlowTestCase):
+  """Tests utility function used by ssim() and psnr()."""
+
+  def testWrongDims(self):
+    img = array_ops.placeholder(dtype=dtypes.float32)
+    img_np = np.array((2, 2))
+
+    with self.test_session(use_gpu=True) as sess:
+      _, _, checks = image_ops_impl._verify_compatible_image_shapes(img, img)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(checks, {img: img_np})
+
+  def testShapeMismatch(self):
+    img1 = array_ops.placeholder(dtype=dtypes.float32)
+    img2 = array_ops.placeholder(dtype=dtypes.float32)
+
+    img1_np = np.array([1, 2, 2, 1])
+    img2_np = np.array([1, 3, 3, 1])
+
+    with self.test_session(use_gpu=True) as sess:
+      _, _, checks = image_ops_impl._verify_compatible_image_shapes(img1, img2)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(checks, {img1: img1_np, img2: img2_np})
+
+
+class PSNRTest(test_util.TensorFlowTestCase):
+  """Tests for PSNR."""
+
+  def _LoadTestImage(self, sess, filename):
+    content = io_ops.read_file(os.path.join(
+        "tensorflow/core/lib/psnr/testdata", filename))
+    im = image_ops.decode_jpeg(content, dct_method="INTEGER_ACCURATE")
+    im = image_ops.convert_image_dtype(im, dtypes.float32)
+    im, = sess.run([im])
+    return np.expand_dims(im, axis=0)
+
+  def _LoadTestImages(self):
+    with self.test_session(use_gpu=True) as sess:
+      q20 = self._LoadTestImage(sess, "cat_q20.jpg")
+      q72 = self._LoadTestImage(sess, "cat_q72.jpg")
+      q95 = self._LoadTestImage(sess, "cat_q95.jpg")
+      return q20, q72, q95
+
+  def _PSNR_NumPy(self, orig, target, max_value):
+    """Numpy implementation of PSNR."""
+    mse = ((orig - target) ** 2).mean(axis=(-3, -2, -1))
+    return 20 * np.log10(max_value) - 10 * np.log10(mse)
+
+  def _RandomImage(self, shape, max_val):
+    """Returns an image or image batch with given shape."""
+    return np.random.rand(*shape).astype(np.float32) * max_val
+
+  def testPSNRSingleImage(self):
+    image1 = self._RandomImage((8, 8, 1), 1)
+    image2 = self._RandomImage((8, 8, 1), 1)
+    psnr = self._PSNR_NumPy(image1, image2, 1)
+
+    with self.test_session(use_gpu=True):
+      tf_image1 = constant_op.constant(image1, shape=image1.shape,
+                                       dtype=dtypes.float32)
+      tf_image2 = constant_op.constant(image2, shape=image2.shape,
+                                       dtype=dtypes.float32)
+      tf_psnr = image_ops.psnr(tf_image1, tf_image2, 1.0, "psnr").eval()
+      self.assertAllClose(psnr, tf_psnr, atol=0.001)
+
+  def testPSNRMultiImage(self):
+    image1 = self._RandomImage((10, 8, 8, 1), 1)
+    image2 = self._RandomImage((10, 8, 8, 1), 1)
+    psnr = self._PSNR_NumPy(image1, image2, 1)
+
+    with self.test_session(use_gpu=True):
+      tf_image1 = constant_op.constant(image1, shape=image1.shape,
+                                       dtype=dtypes.float32)
+      tf_image2 = constant_op.constant(image2, shape=image2.shape,
+                                       dtype=dtypes.float32)
+      tf_psnr = image_ops.psnr(tf_image1, tf_image2, 1, "psnr").eval()
+      self.assertAllClose(psnr, tf_psnr, atol=0.001)
+
+  def testGoldenPSNR(self):
+    q20, q72, q95 = self._LoadTestImages()
+
+    # Verify NumPy implementation first.
+    # Golden values are generated using GNU Octave's psnr() function.
+    psnr1 = self._PSNR_NumPy(q20, q72, 1)
+    self.assertNear(30.321, psnr1, 0.001, msg="q20.dtype=" + str(q20.dtype))
+    psnr2 = self._PSNR_NumPy(q20, q95, 1)
+    self.assertNear(29.994, psnr2, 0.001)
+    psnr3 = self._PSNR_NumPy(q72, q95, 1)
+    self.assertNear(35.302, psnr3, 0.001)
+
+    # Test TensorFlow implementation.
+    with self.test_session(use_gpu=True):
+      tf_q20 = constant_op.constant(q20, shape=q20.shape, dtype=dtypes.float32)
+      tf_q72 = constant_op.constant(q72, shape=q72.shape, dtype=dtypes.float32)
+      tf_q95 = constant_op.constant(q95, shape=q95.shape, dtype=dtypes.float32)
+      tf_psnr1 = image_ops.psnr(tf_q20, tf_q72, 1, "psnr1").eval()
+      tf_psnr2 = image_ops.psnr(tf_q20, tf_q95, 1, "psnr2").eval()
+      tf_psnr3 = image_ops.psnr(tf_q72, tf_q95, 1, "psnr3").eval()
+      self.assertAllClose(psnr1, tf_psnr1, atol=0.001)
+      self.assertAllClose(psnr2, tf_psnr2, atol=0.001)
+      self.assertAllClose(psnr3, tf_psnr3, atol=0.001)
+
+  def testInfinity(self):
+    q20, _, _ = self._LoadTestImages()
+    psnr = self._PSNR_NumPy(q20, q20, 1)
+    with self.test_session(use_gpu=True):
+      tf_q20 = constant_op.constant(q20, shape=q20.shape, dtype=dtypes.float32)
+      tf_psnr = image_ops.psnr(tf_q20, tf_q20, 1, "psnr").eval()
+      self.assertAllClose(psnr, tf_psnr, atol=0.001)
+
+  def testInt(self):
+    img1 = self._RandomImage((10, 8, 8, 1), 255)
+    img2 = self._RandomImage((10, 8, 8, 1), 255)
+    img1 = constant_op.constant(img1, dtypes.uint8)
+    img2 = constant_op.constant(img2, dtypes.uint8)
+    psnr_uint8 = image_ops.psnr(img1, img2, 255)
+    img1 = image_ops.convert_image_dtype(img1, dtypes.float32)
+    img2 = image_ops.convert_image_dtype(img2, dtypes.float32)
+    psnr_float32 = image_ops.psnr(img1, img2, 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(psnr_uint8.eval(), psnr_float32.eval(), atol=0.001)
+
+
+class SSIMTest(test_util.TensorFlowTestCase):
+  """Tests for SSIM."""
+
+  _filenames = ["checkerboard1.png",
+                "checkerboard2.png",
+                "checkerboard3.png",]
+
+  _ssim = np.asarray([[1.000000, 0.230880, 0.231153],
+                      [0.230880, 1.000000, 0.996828],
+                      [0.231153, 0.996828, 1.000000]])
+
+  def _LoadTestImage(self, sess, filename):
+    content = io_ops.read_file(os.path.join(
+        "tensorflow/core/lib/ssim/testdata", filename))
+    im = image_ops.decode_png(content)
+    im = image_ops.convert_image_dtype(im, dtypes.float32)
+    im, = sess.run([im])
+    return np.expand_dims(im, axis=0)
+
+  def _LoadTestImages(self):
+    with self.test_session(use_gpu=True) as sess:
+      return [self._LoadTestImage(sess, f) for f in self._filenames]
+
+  def _RandomImage(self, shape, max_val):
+    """Returns an image or image batch with given shape."""
+    return np.random.rand(*shape).astype(np.float32) * max_val
+
+  def testAgainstMatlab(self):
+    """Tests against values produced by Matlab."""
+    img = self._LoadTestImages()
+    expected = self._ssim[np.triu_indices(3)]
+
+    ph = [array_ops.placeholder(dtype=dtypes.float32) for _ in range(2)]
+    ssim = image_ops.ssim(*ph, max_val=1.0)
+    with self.test_session(use_gpu=True):
+      scores = [ssim.eval(dict(zip(ph, t)))
+                for t in itertools.combinations_with_replacement(img, 2)]
+    self.assertAllClose(expected, np.squeeze(scores), atol=1e-4)
+
+  def testBatch(self):
+    img = self._LoadTestImages()
+    expected = self._ssim[np.triu_indices(3, k=1)]
+
+    img1, img2 = zip(*itertools.combinations(img, 2))
+    img1 = np.concatenate(img1)
+    img2 = np.concatenate(img2)
+
+    ssim = image_ops.ssim(constant_op.constant(img1),
+                          constant_op.constant(img2), 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(expected, ssim.eval(), atol=1e-4)
+
+  def testBroadcast(self):
+    img = self._LoadTestImages()[:2]
+    expected = self._ssim[:2, :2]
+
+    img = constant_op.constant(np.concatenate(img))
+    img1 = array_ops.expand_dims(img, axis=0)  # batch dims: 1, 2.
+    img2 = array_ops.expand_dims(img, axis=1)  # batch dims: 2, 1.
+
+    ssim = image_ops.ssim(img1, img2, 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(expected, ssim.eval(), atol=1e-4)
+
+  def testNegative(self):
+    """Tests against negative SSIM index."""
+    step = np.expand_dims(np.arange(0, 256, 16, dtype=np.uint8), axis=0)
+    img1 = np.tile(step, (16, 1))
+    img2 = np.fliplr(img1)
+
+    img1 = img1.reshape((1, 16, 16, 1))
+    img2 = img2.reshape((1, 16, 16, 1))
+
+    ssim = image_ops.ssim(constant_op.constant(img1),
+                          constant_op.constant(img2), 255)
+    with self.test_session(use_gpu=True):
+      self.assertLess(ssim.eval(), 0)
+
+  def testInt(self):
+    img1 = self._RandomImage((1, 16, 16, 3), 255)
+    img2 = self._RandomImage((1, 16, 16, 3), 255)
+    img1 = constant_op.constant(img1, dtypes.uint8)
+    img2 = constant_op.constant(img2, dtypes.uint8)
+    ssim_uint8 = image_ops.ssim(img1, img2, 255)
+    img1 = image_ops.convert_image_dtype(img1, dtypes.float32)
+    img2 = image_ops.convert_image_dtype(img2, dtypes.float32)
+    ssim_float32 = image_ops.ssim(img1, img2, 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(ssim_uint8.eval(), ssim_float32.eval(), atol=0.001)
+
+
+class MultiscaleSSIMTest(test_util.TensorFlowTestCase):
+  """Tests for MS-SSIM."""
+
+  _filenames = ["checkerboard1.png",
+                "checkerboard2.png",
+                "checkerboard3.png",]
+
+  _msssim = np.asarray([[1.000000, 0.091016, 0.091025],
+                        [0.091016, 1.000000, 0.999567],
+                        [0.091025, 0.999567, 1.000000]])
+
+  def _LoadTestImage(self, sess, filename):
+    content = io_ops.read_file(os.path.join(
+        "tensorflow/core/lib/ssim/testdata", filename))
+    im = image_ops.decode_png(content)
+    im = image_ops.convert_image_dtype(im, dtypes.float32)
+    im, = sess.run([im])
+    return np.expand_dims(im, axis=0)
+
+  def _LoadTestImages(self):
+    with self.test_session(use_gpu=True) as sess:
+      return [self._LoadTestImage(sess, f) for f in self._filenames]
+
+  def _RandomImage(self, shape, max_val):
+    """Returns an image or image batch with given shape."""
+    return np.random.rand(*shape).astype(np.float32) * max_val
+
+  def testAgainstMatlab(self):
+    """Tests against MS-SSIM computed with Matlab implementation.
+
+    For color images, MS-SSIM scores are averaged over color channels.
+    """
+    img = self._LoadTestImages()
+    expected = self._msssim[np.triu_indices(3)]
+
+    ph = [array_ops.placeholder(dtype=dtypes.float32) for _ in range(2)]
+    msssim = image_ops.ssim_multiscale(*ph, max_val=1.0)
+    with self.test_session(use_gpu=True):
+      scores = [msssim.eval(dict(zip(ph, t)))
+                for t in itertools.combinations_with_replacement(img, 2)]
+
+    self.assertAllClose(expected, np.squeeze(scores), atol=1e-4)
+
+  def testUnweightedIsDifferentiable(self):
+    img = self._LoadTestImages()
+    ph = [array_ops.placeholder(dtype=dtypes.float32) for _ in range(2)]
+    scalar = constant_op.constant(1.0, dtype=dtypes.float32)
+    scaled_ph = [x * scalar for x in ph]
+    msssim = image_ops.ssim_multiscale(*scaled_ph, max_val=1.0,
+                                       power_factors=(1, 1, 1, 1, 1))
+    grads = gradients.gradients(msssim, scalar)
+    with self.test_session(use_gpu=True) as sess:
+      np_grads = sess.run(grads, feed_dict={ph[0]: img[0], ph[1]: img[1]})
+    self.assertTrue(np.isfinite(np_grads).all())
+
+  def testBatch(self):
+    """Tests MS-SSIM computed in batch."""
+    img = self._LoadTestImages()
+    expected = self._msssim[np.triu_indices(3, k=1)]
+
+    img1, img2 = zip(*itertools.combinations(img, 2))
+    img1 = np.concatenate(img1)
+    img2 = np.concatenate(img2)
+
+    msssim = image_ops.ssim_multiscale(constant_op.constant(img1),
+                                       constant_op.constant(img2), 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(expected, msssim.eval(), 1e-4)
+
+  def testBroadcast(self):
+    """Tests MS-SSIM broadcasting."""
+    img = self._LoadTestImages()[:2]
+    expected = self._msssim[:2, :2]
+
+    img = constant_op.constant(np.concatenate(img))
+    img1 = array_ops.expand_dims(img, axis=0)  # batch dims: 1, 2.
+    img2 = array_ops.expand_dims(img, axis=1)  # batch dims: 2, 1.
+
+    score_tensor = image_ops.ssim_multiscale(img1, img2, 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(expected, score_tensor.eval(), 1e-4)
+
+  def testRange(self):
+    """Tests against low MS-SSIM score.
+
+    MS-SSIM is a geometric mean of SSIM and CS scores of various scales.
+    If any of the value is negative so that the geometric mean is not
+    well-defined, then treat the MS-SSIM score as zero.
+    """
+    with self.test_session(use_gpu=True) as sess:
+      img1 = self._LoadTestImage(sess, "checkerboard1.png")
+      img2 = self._LoadTestImage(sess, "checkerboard3.png")
+      images = [img1, img2, np.zeros_like(img1),
+                np.full_like(img1, fill_value=255)]
+
+      images = [ops.convert_to_tensor(x, dtype=dtypes.float32) for x in images]
+      msssim_ops = [image_ops.ssim_multiscale(x, y, 1.0)
+                    for x, y in itertools.combinations(images, 2)]
+      msssim = sess.run(msssim_ops)
+      msssim = np.squeeze(msssim)
+
+    self.assertTrue(np.all(msssim >= 0.0))
+    self.assertTrue(np.all(msssim <= 1.0))
+
+  def testInt(self):
+    img1 = self._RandomImage((1, 180, 240, 3), 255)
+    img2 = self._RandomImage((1, 180, 240, 3), 255)
+    img1 = constant_op.constant(img1, dtypes.uint8)
+    img2 = constant_op.constant(img2, dtypes.uint8)
+    ssim_uint8 = image_ops.ssim_multiscale(img1, img2, 255)
+    img1 = image_ops.convert_image_dtype(img1, dtypes.float32)
+    img2 = image_ops.convert_image_dtype(img2, dtypes.float32)
+    ssim_float32 = image_ops.ssim_multiscale(img1, img2, 1.0)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(ssim_uint8.eval(), ssim_float32.eval(), atol=0.001)
+
+
+class ImageGradientsTest(test_util.TensorFlowTestCase):
+
+  def testImageGradients(self):
+    shape = [1, 2, 4, 1]
+    img = constant_op.constant([[1, 3, 4, 2], [8, 7, 5, 6]])
+    img = array_ops.reshape(img, shape)
+
+    expected_dy = np.reshape([[7, 4, 1, 4], [0, 0, 0, 0]], shape)
+    expected_dx = np.reshape([[2, 1, -2, 0], [-1, -2, 1, 0]], shape)
+
+    dy, dx = image_ops.image_gradients(img)
+    with self.test_session():
+      actual_dy = dy.eval()
+      actual_dx = dx.eval()
+      self.assertAllClose(expected_dy, actual_dy)
+      self.assertAllClose(expected_dx, actual_dx)
+
+  def testImageGradientsMultiChannelBatch(self):
+    batch = [[[[1, 2], [2, 5], [3, 3]],
+              [[8, 4], [5, 1], [9, 8]]],
+             [[[5, 3], [7, 9], [1, 6]],
+              [[1, 2], [6, 3], [6, 3]]]]
+
+    expected_dy = [[[[7, 2], [3, -4], [6, 5]],
+                    [[0, 0], [0, 0], [0, 0]]],
+                   [[[-4, -1], [-1, -6], [5, -3]],
+                    [[0, 0], [0, 0], [0, 0]]]]
+
+    expected_dx = [[[[1, 3], [1, -2], [0, 0]],
+                    [[-3, -3], [4, 7], [0, 0]]],
+                   [[[2, 6], [-6, -3], [0, 0]],
+                    [[5, 1], [0, 0], [0, 0]]]]
+
+    batch = constant_op.constant(batch)
+    assert batch.get_shape().as_list() == [2, 2, 3, 2]
+    dy, dx = image_ops.image_gradients(batch)
+    with self.test_session(use_gpu=True):
+      actual_dy = dy.eval()
+      actual_dx = dx.eval()
+      self.assertAllClose(expected_dy, actual_dy)
+      self.assertAllClose(expected_dx, actual_dx)
+
+  def testImageGradientsBadShape(self):
+    # [2 x 4] image but missing batch and depth dimensions.
+    img = constant_op.constant([[1, 3, 4, 2], [8, 7, 5, 6]])
+    with self.assertRaises(ValueError):
+      image_ops.image_gradients(img)
+
+
+class SobelEdgesTest(test_util.TensorFlowTestCase):
+
+  def testSobelEdges1x2x3x1(self):
+    img = constant_op.constant([[1, 3, 6], [4, 1, 5]],
+                               dtype=dtypes.float32, shape=[1, 2, 3, 1])
+    expected = np.reshape([[[0, 0], [0, 12], [0, 0]],
+                           [[0, 0], [0, 12], [0, 0]]], [1, 2, 3, 1, 2])
+    sobel = image_ops.sobel_edges(img)
+    with self.test_session(use_gpu=True):
+      actual_sobel = sobel.eval()
+      self.assertAllClose(expected, actual_sobel)
+
+  def testSobelEdges5x3x4x2(self):
+    batch_size = 5
+    plane = np.reshape([[1, 3, 6, 2], [4, 1, 5, 7], [2, 5, 1, 4]],
+                       [1, 3, 4, 1])
+    two_channel = np.concatenate([plane, plane], axis=3)
+    batch = np.concatenate([two_channel] * batch_size, axis=0)
+    img = constant_op.constant(batch, dtype=dtypes.float32,
+                               shape=[batch_size, 3, 4, 2])
+
+    expected_plane = np.reshape([[[0, 0], [0, 12], [0, 10], [0, 0]],
+                                 [[6, 0], [0, 6], [-6, 10], [-6, 0]],
+                                 [[0, 0], [0, 0], [0, 10], [0, 0]]],
+                                [1, 3, 4, 1, 2])
+    expected_two_channel = np.concatenate(
+        [expected_plane, expected_plane], axis=3)
+    expected_batch = np.concatenate([expected_two_channel] * batch_size, axis=0)
+
+    sobel = image_ops.sobel_edges(img)
+    with self.test_session(use_gpu=True):
+      actual_sobel = sobel.eval()
+      self.assertAllClose(expected_batch, actual_sobel)
+
+
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/ops/init_ops.py b/tensorflow/python/ops/init_ops.py
index c7502d0fda5c38079362d30877a917e3965e6ca0..40ab22951b1aa04a61e09aac155b6449ae358d7b 100644
--- a/tensorflow/python/ops/init_ops.py
+++ b/tensorflow/python/ops/init_ops.py
@@ -542,6 +542,62 @@ class Orthogonal(Initializer):
     return {"gain": self.gain, "seed": self.seed, "dtype": self.dtype.name}
 
 
+class ConvolutionDeltaOrthogonal(Initializer):
+  """Initializer that generates a delta orthogonal kernel for ConvNets.
+
+  The shape of the tensor must have length 3, 4 or 5. The number of input
+  filters must not exceed the number of output filters. The center pixels of the
+  tensor form an orthogonal matrix. Other pixels are set to be zero.
+
+  Args:
+    gain: multiplicative factor to apply to the orthogonal matrix. Default is 1.
+      The 2-norm of an input is multiplied by a factor of 'sqrt(gain)' after
+      applying this convolution.
+    dtype: The type of the output.
+    seed: A Python integer. Used to create random seeds. See
+      @{tf.set_random_seed}
+      for behavior.
+  """
+
+  def __init__(self, gain=1.0, seed=None, dtype=dtypes.float32):
+    self.gain = gain
+    self.dtype = _assert_float_dtype(dtypes.as_dtype(dtype))
+    self.seed = seed
+
+  def __call__(self, shape, dtype=None, partition_info=None):
+    if dtype is None:
+      dtype = self.dtype
+    # Check the shape
+    if len(shape) < 3 or len(shape) > 5:
+      raise ValueError("The tensor to initialize must be at least "
+                       "three-dimensional and at most five-dimensional")
+
+    if shape[-2] > shape[-1]:
+      raise ValueError("In_filters cannot be greater than out_filters.")
+
+    # Generate a random matrix
+    a = random_ops.random_normal([shape[-1], shape[-1]],
+                                 dtype=dtype, seed=self.seed)
+    # Compute the qr factorization
+    q, _ = linalg_ops.qr(a, full_matrices=False)
+    q = q[:shape[-2], :]
+    q *= math_ops.sqrt(math_ops.cast(self.gain, dtype=dtype))
+    if len(shape) == 3:
+      weight = array_ops.scatter_nd([[(shape[0]-1)//2]],
+                                    array_ops.expand_dims(q, 0), shape)
+    elif len(shape) == 4:
+      weight = array_ops.scatter_nd([[(shape[0]-1)//2, (shape[1]-1)//2]],
+                                    array_ops.expand_dims(q, 0), shape)
+    else:
+      weight = array_ops.scatter_nd([[(shape[0]-1)//2, (shape[1]-1)//2,
+                                      (shape[2]-1)//2]],
+                                    array_ops.expand_dims(q, 0), shape)
+    return weight
+
+  def get_config(self):
+    return {"gain": self.gain, "seed": self.seed, "dtype": self.dtype.name}
+
+
 @tf_export("keras.initializers.Identity", "initializers.identity")
 class Identity(Initializer):
   """Initializer that generates the identity matrix.
@@ -586,7 +642,7 @@ uniform_unit_scaling_initializer = UniformUnitScaling
 variance_scaling_initializer = VarianceScaling
 orthogonal_initializer = Orthogonal
 identity_initializer = Identity
-
+convolutional_delta_orthogonal = ConvolutionDeltaOrthogonal
 # pylint: enable=invalid-name
 
 
diff --git a/tensorflow/python/ops/initializers_ns.py b/tensorflow/python/ops/initializers_ns.py
index c21079f2971a4bdd76b4be1a803055c12b243903..e7996efe93eb2f33306a52ded91c273009192789 100644
--- a/tensorflow/python/ops/initializers_ns.py
+++ b/tensorflow/python/ops/initializers_ns.py
@@ -39,5 +39,8 @@ global_variables = _variables.global_variables_initializer
 local_variables = _variables.local_variables_initializer
 
 # Seal API.
+del absolute_import
+del division
+del print_function
 del init_ops
 del _variables
diff --git a/tensorflow/python/ops/io_ops.py b/tensorflow/python/ops/io_ops.py
index 5e70b3186f382a0c795b1795b2db27bb2058ee41..f6a25610c5a2ee8b76d06e286365cb957ab643cd 100644
--- a/tensorflow/python/ops/io_ops.py
+++ b/tensorflow/python/ops/io_ops.py
@@ -111,10 +111,10 @@ def _save(filename, tensor_names, tensors, tensor_slices=None, name="save"):
     An Operation that saves the tensors.
   """
   if tensor_slices is None:
-    return gen_io_ops._save(filename, tensor_names, tensors, name=name)
+    return gen_io_ops.save(filename, tensor_names, tensors, name=name)
   else:
-    return gen_io_ops._save_slices(filename, tensor_names, tensor_slices,
-                                   tensors, name=name)
+    return gen_io_ops.save_slices(filename, tensor_names, tensor_slices,
+                                  tensors, name=name)
 
 
 def _restore_slice(file_pattern, tensor_name, shape_and_slice, tensor_type,
@@ -136,7 +136,7 @@ def _restore_slice(file_pattern, tensor_name, shape_and_slice, tensor_type,
     A tensor of type "tensor_type".
   """
   base_type = dtypes.as_dtype(tensor_type).base_dtype
-  return gen_io_ops._restore_slice(
+  return gen_io_ops.restore_slice(
       file_pattern, tensor_name, shape_and_slice, base_type,
       preferred_shard, name=name)
 
@@ -173,7 +173,7 @@ class ReaderBase(object):
     Raises:
       RuntimeError: If eager execution is enabled.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "Readers are not supported when eager execution is enabled. "
           "Instead, please use tf.data to get data into your model.")
@@ -208,12 +208,12 @@ class ReaderBase(object):
     else:
       queue_ref = queue.queue_ref
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
+      return gen_io_ops.reader_read_v2(self._reader_ref, queue_ref, name=name)
     else:
       # For compatibility with pre-resource queues, create a ref(string) tensor
       # which can be looked up as the same queue by a resource manager.
-      old_queue_op = gen_data_flow_ops._fake_queue(queue_ref)
-      return gen_io_ops._reader_read(self._reader_ref, old_queue_op, name=name)
+      old_queue_op = gen_data_flow_ops.fake_queue(queue_ref)
+      return gen_io_ops.reader_read(self._reader_ref, old_queue_op, name=name)
 
   def read_up_to(self, queue, num_records,  # pylint: disable=invalid-name
                  name=None):
@@ -240,18 +240,18 @@ class ReaderBase(object):
     else:
       queue_ref = queue.queue_ref
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_read_up_to_v2(self._reader_ref,
-                                              queue_ref,
-                                              num_records,
-                                              name=name)
+      return gen_io_ops.reader_read_up_to_v2(self._reader_ref,
+                                             queue_ref,
+                                             num_records,
+                                             name=name)
     else:
       # For compatibility with pre-resource queues, create a ref(string) tensor
       # which can be looked up as the same queue by a resource manager.
-      old_queue_op = gen_data_flow_ops._fake_queue(queue_ref)
-      return gen_io_ops._reader_read_up_to(self._reader_ref,
-                                           old_queue_op,
-                                           num_records,
-                                           name=name)
+      old_queue_op = gen_data_flow_ops.fake_queue(queue_ref)
+      return gen_io_ops.reader_read_up_to(self._reader_ref,
+                                          old_queue_op,
+                                          num_records,
+                                          name=name)
 
   def num_records_produced(self, name=None):
     """Returns the number of records this reader has produced.
@@ -267,11 +267,11 @@ class ReaderBase(object):
 
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_num_records_produced_v2(self._reader_ref,
-                                                        name=name)
+      return gen_io_ops.reader_num_records_produced_v2(self._reader_ref,
+                                                       name=name)
     else:
-      return gen_io_ops._reader_num_records_produced(self._reader_ref,
-                                                     name=name)
+      return gen_io_ops.reader_num_records_produced(self._reader_ref,
+                                                    name=name)
 
   def num_work_units_completed(self, name=None):
     """Returns the number of work units this reader has finished processing.
@@ -283,11 +283,11 @@ class ReaderBase(object):
       An int64 Tensor.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_num_work_units_completed_v2(self._reader_ref,
-                                                            name=name)
+      return gen_io_ops.reader_num_work_units_completed_v2(self._reader_ref,
+                                                           name=name)
     else:
-      return gen_io_ops._reader_num_work_units_completed(self._reader_ref,
-                                                         name=name)
+      return gen_io_ops.reader_num_work_units_completed(self._reader_ref,
+                                                        name=name)
 
   def serialize_state(self, name=None):
     """Produce a string tensor that encodes the state of a reader.
@@ -302,9 +302,9 @@ class ReaderBase(object):
       A string Tensor.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_serialize_state_v2(self._reader_ref, name=name)
+      return gen_io_ops.reader_serialize_state_v2(self._reader_ref, name=name)
     else:
-      return gen_io_ops._reader_serialize_state(self._reader_ref, name=name)
+      return gen_io_ops.reader_serialize_state(self._reader_ref, name=name)
 
   def restore_state(self, state, name=None):
     """Restore a reader to a previously saved state.
@@ -321,11 +321,10 @@ class ReaderBase(object):
       The created Operation.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_restore_state_v2(
+      return gen_io_ops.reader_restore_state_v2(
           self._reader_ref, state, name=name)
     else:
-      return gen_io_ops._reader_restore_state(
-          self._reader_ref, state, name=name)
+      return gen_io_ops.reader_restore_state(self._reader_ref, state, name=name)
 
   @property
   def supports_serialize(self):
@@ -342,9 +341,9 @@ class ReaderBase(object):
       The created Operation.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_reset_v2(self._reader_ref, name=name)
+      return gen_io_ops.reader_reset_v2(self._reader_ref, name=name)
     else:
-      return gen_io_ops._reader_reset(self._reader_ref, name=name)
+      return gen_io_ops.reader_reset(self._reader_ref, name=name)
 
 
 ops.NotDifferentiable("ReaderRead")
@@ -377,7 +376,7 @@ class WholeFileReader(ReaderBase):
     Args:
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._whole_file_reader_v2(name=name)
+    rr = gen_io_ops.whole_file_reader_v2(name=name)
     super(WholeFileReader, self).__init__(rr, supports_serialize=True)
 
 
@@ -406,8 +405,8 @@ class TextLineReader(ReaderBase):
         to skip from the beginning of every file.
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._text_line_reader_v2(skip_header_lines=skip_header_lines,
-                                         name=name)
+    rr = gen_io_ops.text_line_reader_v2(skip_header_lines=skip_header_lines,
+                                        name=name)
     super(TextLineReader, self).__init__(rr)
 
 
@@ -444,7 +443,7 @@ class FixedLengthRecordReader(ReaderBase):
       name: A name for the operation (optional).
       encoding: The type of encoding for the file. Defaults to none.
     """
-    rr = gen_io_ops._fixed_length_record_reader_v2(
+    rr = gen_io_ops.fixed_length_record_reader_v2(
         record_bytes=record_bytes,
         header_bytes=header_bytes,
         footer_bytes=footer_bytes,
@@ -480,7 +479,7 @@ class TFRecordReader(ReaderBase):
     compression_type = python_io.TFRecordOptions.get_compression_type_string(
         options)
 
-    rr = gen_io_ops._tf_record_reader_v2(
+    rr = gen_io_ops.tf_record_reader_v2(
         name=name, compression_type=compression_type)
     super(TFRecordReader, self).__init__(rr)
 
@@ -506,7 +505,7 @@ class LMDBReader(ReaderBase):
       name: A name for the operation (optional).
       options: A LMDBRecordOptions object (optional).
     """
-    rr = gen_io_ops._lmdb_reader(name=name)
+    rr = gen_io_ops.lmdb_reader(name=name)
     super(LMDBReader, self).__init__(rr)
 
 
@@ -534,7 +533,7 @@ class IdentityReader(ReaderBase):
     Args:
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._identity_reader_v2(name=name)
+    rr = gen_io_ops.identity_reader_v2(name=name)
     super(IdentityReader, self).__init__(rr, supports_serialize=True)
 
 
diff --git a/tensorflow/python/ops/linalg/linalg_impl.py b/tensorflow/python/ops/linalg/linalg_impl.py
index d5bd916f80d8a03e5423c43d1ca039bc4dceff5e..8343c62816c6aeadc77dae701ae9917a86e68954 100644
--- a/tensorflow/python/ops/linalg/linalg_impl.py
+++ b/tensorflow/python/ops/linalg/linalg_impl.py
@@ -31,18 +31,19 @@ band_part = array_ops.matrix_band_part
 cholesky = linalg_ops.cholesky
 cholesky_solve = linalg_ops.cholesky_solve
 det = linalg_ops.matrix_determinant
-# pylint: disable=protected-access
-slogdet = gen_linalg_ops._log_matrix_determinant
-# pylint: disable=protected-access
+slogdet = gen_linalg_ops.log_matrix_determinant
+tf_export('linalg.slogdet')(slogdet)
 diag = array_ops.matrix_diag
 diag_part = array_ops.matrix_diag_part
 eigh = linalg_ops.self_adjoint_eig
 eigvalsh = linalg_ops.self_adjoint_eigvals
 einsum = special_math_ops.einsum
-expm = gen_linalg_ops._matrix_exponential
+expm = gen_linalg_ops.matrix_exponential
+tf_export('linalg.expm')(expm)
 eye = linalg_ops.eye
 inv = linalg_ops.matrix_inverse
-logm = gen_linalg_ops._matrix_logarithm
+logm = gen_linalg_ops.matrix_logarithm
+tf_export('linalg.logm')(logm)
 lstsq = linalg_ops.matrix_solve_ls
 norm = linalg_ops.norm
 qr = linalg_ops.qr
diff --git a/tensorflow/python/ops/linalg/linear_operator.py b/tensorflow/python/ops/linalg/linear_operator.py
index 957a7959181efe3bbc319e62582053329b763dc3..c7513d5b40c5a4bb11501c90e08a9dc3a38c2e09 100644
--- a/tensorflow/python/ops/linalg/linear_operator.py
+++ b/tensorflow/python/ops/linalg/linear_operator.py
@@ -204,16 +204,6 @@ class LinearOperator(object):
     self._is_positive_definite = is_positive_definite
     self._name = name or type(self).__name__
 
-    # We will cache some tensors to avoid repeatedly adding shape
-    # manipulation ops to the graph.
-    # Naming convention:
-    #   self._cached_X_tensor is the cached version of self._X_tensor.
-    self._cached_shape_tensor = None
-    self._cached_batch_shape_tensor = None
-    self._cached_domain_dimension_tensor = None
-    self._cached_range_dimension_tensor = None
-    self._cached_tensor_rank_tensor = None
-
   @contextlib.contextmanager
   def _name_scope(self, name=None, values=None):
     """Helper function to standardize op scope."""
@@ -299,15 +289,11 @@ class LinearOperator(object):
       `int32` `Tensor`
     """
     with self._name_scope(name):
-      # Be clean by avoiding adding shape Ops to the graph too many times.
-      if self._cached_shape_tensor is None:
-        # Prefer to use statically defined shape if available.
-        if self.shape.is_fully_defined():
-          self._cached_shape_tensor = linear_operator_util.shape_tensor(
-              self.shape.as_list())
-        else:
-          self._cached_shape_tensor = self._shape_tensor()
-      return self._cached_shape_tensor
+      # Prefer to use statically defined shape if available.
+      if self.shape.is_fully_defined():
+        return linear_operator_util.shape_tensor(self.shape.as_list())
+      else:
+        return self._shape_tensor()
 
   @property
   def batch_shape(self):
@@ -338,14 +324,12 @@ class LinearOperator(object):
     """
     # Derived classes get this "for free" once .shape() is implemented.
     with self._name_scope(name):
-      if self._cached_batch_shape_tensor is None:
-        # Prefer to use statically defined shape if available.
-        if self.batch_shape.is_fully_defined():
-          self._cached_batch_shape_tensor = linear_operator_util.shape_tensor(
-              self.batch_shape.as_list(), name="batch_shape")
-        else:
-          self._cached_batch_shape_tensor = self.shape_tensor()[:-2]
-      return self._cached_batch_shape_tensor
+      # Prefer to use statically defined shape if available.
+      if self.batch_shape.is_fully_defined():
+        return linear_operator_util.shape_tensor(
+            self.batch_shape.as_list(), name="batch_shape")
+      else:
+        return self.shape_tensor()[:-2]
 
   @property
   def tensor_rank(self, name="tensor_rank"):
@@ -378,14 +362,11 @@ class LinearOperator(object):
     """
     # Derived classes get this "for free" once .shape() is implemented.
     with self._name_scope(name):
-      if self._cached_tensor_rank_tensor is None:
-        # Prefer to use statically defined shape if available.
-        if self.tensor_rank is not None:
-          self._cached_tensor_rank_tensor = ops.convert_to_tensor(
-              self.tensor_rank)
-        else:
-          self._cached_tensor_rank_tensor = array_ops.size(self.shape_tensor())
-      return self._cached_tensor_rank_tensor
+      # Prefer to use statically defined shape if available.
+      if self.tensor_rank is not None:
+        return ops.convert_to_tensor(self.tensor_rank)
+      else:
+        return array_ops.size(self.shape_tensor())
 
   @property
   def domain_dimension(self):
@@ -416,14 +397,11 @@ class LinearOperator(object):
     """
     # Derived classes get this "for free" once .shape() is implemented.
     with self._name_scope(name):
-      if self._cached_domain_dimension_tensor is None:
-        # Prefer to use statically defined shape if available.
-        if self.domain_dimension.value is not None:
-          self._cached_domain_dimension_tensor = ops.convert_to_tensor(
-              self.domain_dimension.value)
-        else:
-          self._cached_domain_dimension_tensor = self.shape_tensor()[-1]
-      return self._cached_domain_dimension_tensor
+      # Prefer to use statically defined shape if available.
+      if self.domain_dimension.value is not None:
+        return ops.convert_to_tensor(self.domain_dimension.value)
+      else:
+        return self.shape_tensor()[-1]
 
   @property
   def range_dimension(self):
@@ -454,14 +432,11 @@ class LinearOperator(object):
     """
     # Derived classes get this "for free" once .shape() is implemented.
     with self._name_scope(name):
-      if self._cached_range_dimension_tensor is None:
-        # Prefer to use statically defined shape if available.
-        if self.range_dimension.value is not None:
-          self._cached_range_dimension_tensor = ops.convert_to_tensor(
-              self.range_dimension.value)
-        else:
-          self._cached_range_dimension_tensor = self.shape_tensor()[-2]
-      return self._cached_range_dimension_tensor
+      # Prefer to use statically defined shape if available.
+      if self.range_dimension.value is not None:
+        return ops.convert_to_tensor(self.range_dimension.value)
+      else:
+        return self.shape_tensor()[-2]
 
   def _assert_non_singular(self):
     """Private default implementation of _assert_non_singular."""
@@ -471,8 +446,7 @@ class LinearOperator(object):
     if self._can_use_cholesky():
       return self.assert_positive_definite()
     else:
-      singular_values = linalg_ops.svd(
-          self._get_cached_dense_matrix(), compute_uv=False)
+      singular_values = linalg_ops.svd(self.to_dense(), compute_uv=False)
       # TODO(langmore) Add .eig and .cond as methods.
       cond = (math_ops.reduce_max(singular_values, axis=-1) /
               math_ops.reduce_min(singular_values, axis=-1))
@@ -524,7 +498,7 @@ class LinearOperator(object):
     # and sufficient.
     if self.is_self_adjoint:
       return check_ops.assert_positive(
-          array_ops.matrix_diag_part(self._get_cached_chol()),
+          array_ops.matrix_diag_part(linalg_ops.cholesky(self.to_dense())),
           message="Matrix was not positive definite.")
     # We have no generic check for positive definite.
     raise NotImplementedError("assert_positive_definite is not implemented.")
@@ -547,7 +521,7 @@ class LinearOperator(object):
       return self._assert_positive_definite()
 
   def _assert_self_adjoint(self):
-    dense = self._get_cached_dense_matrix()
+    dense = self.to_dense()
     logging.warn(
         "Using (possibly slow) default implementation of assert_self_adjoint."
         "  Requires conversion to a dense matrix.")
@@ -692,7 +666,7 @@ class LinearOperator(object):
         "Using (possibly slow) default implementation of determinant."
         "  Requires conversion to a dense matrix and O(N^3) operations.")
     if self._can_use_cholesky():
-      diag = array_ops.matrix_diag_part(self._get_cached_chol())
+      diag = array_ops.matrix_diag_part(linalg_ops.cholesky(self.to_dense()))
       return 2 * math_ops.reduce_sum(math_ops.log(diag), reduction_indices=[-1])
     _, log_abs_det = linalg.slogdet(self._matrix)
     return log_abs_det
@@ -726,9 +700,9 @@ class LinearOperator(object):
         "  Requires conversion to a dense matrix and O(N^3) operations.")
     rhs = linalg.adjoint(rhs) if adjoint_arg else rhs
     if self._can_use_cholesky():
-      return linalg_ops.cholesky_solve(self._get_cached_chol(), rhs)
-    return linalg_ops.matrix_solve(
-        self._get_cached_dense_matrix(), rhs, adjoint=adjoint)
+      return linalg_ops.cholesky_solve(
+          linalg_ops.cholesky(self.to_dense()), rhs)
+    return linalg_ops.matrix_solve(self.to_dense(), rhs, adjoint=adjoint)
 
   def solve(self, rhs, adjoint=False, adjoint_arg=False, name="solve"):
     """Solve (exact or approx) `R` (batch) systems of equations: `A X = rhs`.
@@ -866,7 +840,7 @@ class LinearOperator(object):
 
   def _diag_part(self):
     """Generic and often inefficient implementation.  Override often."""
-    return array_ops.matrix_diag_part(self._get_cached_dense_matrix())
+    return array_ops.matrix_diag_part(self.to_dense())
 
   def diag_part(self, name="diag_part"):
     """Efficiently get the [batch] diagonal part of this operator.
@@ -915,7 +889,7 @@ class LinearOperator(object):
 
   def _add_to_tensor(self, x):
     # Override if a more efficient implementation is available.
-    return self._get_cached_dense_matrix() + x
+    return self.to_dense() + x
 
   def add_to_tensor(self, x, name="add_to_tensor"):
     """Add matrix represented by this operator to `x`.  Equivalent to `A + x`.
@@ -936,13 +910,3 @@ class LinearOperator(object):
     # TODO(langmore) Add complex types when tf.cholesky can use them.
     return (not self.dtype.is_complex and self.is_self_adjoint and
             self.is_positive_definite)
-
-  def _get_cached_dense_matrix(self):
-    if not hasattr(self, "_cached_dense_matrix"):
-      self._cached_dense_matrix = self.to_dense()
-    return self._cached_dense_matrix
-
-  def _get_cached_chol(self):
-    if not hasattr(self, "_cached_chol"):
-      self._cached_chol = linalg_ops.cholesky(self._get_cached_dense_matrix())
-    return self._cached_chol
diff --git a/tensorflow/python/ops/linalg/linear_operator_test_util.py b/tensorflow/python/ops/linalg/linear_operator_test_util.py
index 2c11f90e6d9de280e6020edfaa4d8ef237126705..ce1a112ad584a14298be6e471578858ef31573d5 100644
--- a/tensorflow/python/ops/linalg/linear_operator_test_util.py
+++ b/tensorflow/python/ops/linalg/linear_operator_test_util.py
@@ -35,6 +35,18 @@ from tensorflow.python.ops.linalg import linalg_impl as linalg
 from tensorflow.python.platform import test
 
 
+class OperatorBuildInfo(object):
+  """Object encoding expected shape for a test.
+
+  Encodes the expected shape of a matrix for a test. Also
+  allows additional metadata for the test harness.
+  """
+
+  def __init__(self, shape, **kwargs):
+    self.shape = shape
+    self.__dict__.update(kwargs)
+
+
 @six.add_metaclass(abc.ABCMeta)  # pylint: disable=no-init
 class LinearOperatorDerivedClassTest(test.TestCase):
   """Tests for derived classes.
@@ -84,19 +96,20 @@ class LinearOperatorDerivedClassTest(test.TestCase):
     return [False, True]
 
   @abc.abstractproperty
-  def _shapes_to_test(self):
-    """Returns list of tuples, each is one shape that will be tested."""
-    raise NotImplementedError("shapes_to_test has not been implemented.")
+  def _operator_build_infos(self):
+    """Returns list of OperatorBuildInfo, encapsulating the shape to test."""
+    raise NotImplementedError("operator_build_infos has not been implemented.")
 
   @abc.abstractmethod
-  def _operator_and_mat_and_feed_dict(self, shape, dtype, use_placeholder):
+  def _operator_and_mat_and_feed_dict(self, build_info, dtype, use_placeholder):
     """Build a batch matrix and an Operator that should have similar behavior.
 
     Every operator acts like a (batch) matrix.  This method returns both
     together, and is used by tests.
 
     Args:
-      shape:  List-like of Python integers giving full shape of operator.
+      build_info: `OperatorBuildInfo`, encoding shape information about the
+        operator.
       dtype:  Numpy dtype.  Data type of returned array/operator.
       use_placeholder:  Python bool.  If True, initialize the operator with a
         placeholder of undefined shape and correct dtype.
@@ -164,30 +177,30 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_to_dense(self):
     self._skip_if_tests_to_skip_contains("to_dense")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_dense = operator.to_dense()
             if not use_placeholder:
-              self.assertAllEqual(shape, op_dense.get_shape())
+              self.assertAllEqual(build_info.shape, op_dense.get_shape())
             op_dense_v, mat_v = sess.run([op_dense, mat], feed_dict=feed_dict)
             self.assertAC(op_dense_v, mat_v)
 
   def test_det(self):
     self._skip_if_tests_to_skip_contains("det")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_det = operator.determinant()
             if not use_placeholder:
-              self.assertAllEqual(shape[:-2], op_det.get_shape())
+              self.assertAllEqual(build_info.shape[:-2], op_det.get_shape())
             op_det_v, mat_det_v = sess.run(
                 [op_det, linalg_ops.matrix_determinant(mat)],
                 feed_dict=feed_dict)
@@ -196,16 +209,17 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_log_abs_det(self):
     self._skip_if_tests_to_skip_contains("log_abs_det")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_log_abs_det = operator.log_abs_determinant()
             _, mat_log_abs_det = linalg.slogdet(mat)
             if not use_placeholder:
-              self.assertAllEqual(shape[:-2], op_log_abs_det.get_shape())
+              self.assertAllEqual(
+                  build_info.shape[:-2], op_log_abs_det.get_shape())
             op_log_abs_det_v, mat_log_abs_det_v = sess.run(
                 [op_log_abs_det, mat_log_abs_det], feed_dict=feed_dict)
             self.assertAC(op_log_abs_det_v, mat_log_abs_det_v)
@@ -213,14 +227,14 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_matmul(self):
     self._skip_if_tests_to_skip_contains("matmul")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           for adjoint in self._adjoint_options:
             for adjoint_arg in self._adjoint_arg_options:
               with self.test_session(graph=ops.Graph()) as sess:
                 sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
                 operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                    shape, dtype, use_placeholder=use_placeholder)
+                    build_info, dtype, use_placeholder=use_placeholder)
                 x = self._make_x(operator, adjoint=adjoint)
                 # If adjoint_arg, compute A X^H^H = A X.
                 if adjoint_arg:
@@ -241,14 +255,14 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_solve(self):
     self._skip_if_tests_to_skip_contains("solve")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           for adjoint in self._adjoint_options:
             for adjoint_arg in self._adjoint_arg_options:
               with self.test_session(graph=ops.Graph()) as sess:
                 sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
                 operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                    shape, dtype, use_placeholder=use_placeholder)
+                    build_info, dtype, use_placeholder=use_placeholder)
                 rhs = self._make_rhs(operator, adjoint=adjoint)
                 # If adjoint_arg, solve A X = (rhs^H)^H = rhs.
                 if adjoint_arg:
@@ -270,12 +284,12 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_trace(self):
     self._skip_if_tests_to_skip_contains("trace")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_trace = operator.trace()
             mat_trace = math_ops.trace(mat)
             if not use_placeholder:
@@ -287,16 +301,16 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_add_to_tensor(self):
     self._skip_if_tests_to_skip_contains("add_to_tensor")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_plus_2mat = operator.add_to_tensor(2 * mat)
 
             if not use_placeholder:
-              self.assertAllEqual(shape, op_plus_2mat.get_shape())
+              self.assertAllEqual(build_info.shape, op_plus_2mat.get_shape())
 
             op_plus_2mat_v, mat_v = sess.run(
                 [op_plus_2mat, mat], feed_dict=feed_dict)
@@ -306,12 +320,12 @@ class LinearOperatorDerivedClassTest(test.TestCase):
   def test_diag_part(self):
     self._skip_if_tests_to_skip_contains("diag_part")
     for use_placeholder in self._use_placeholder_options:
-      for shape in self._shapes_to_test:
+      for build_info in self._operator_build_infos:
         for dtype in self._dtypes_to_test:
           with self.test_session(graph=ops.Graph()) as sess:
             sess.graph.seed = random_seed.DEFAULT_GRAPH_SEED
             operator, mat, feed_dict = self._operator_and_mat_and_feed_dict(
-                shape, dtype, use_placeholder=use_placeholder)
+                build_info, dtype, use_placeholder=use_placeholder)
             op_diag_part = operator.diag_part()
             mat_diag_part = array_ops.matrix_diag_part(mat)
 
@@ -334,9 +348,15 @@ class SquareLinearOperatorDerivedClassTest(LinearOperatorDerivedClassTest):
   """
 
   @property
-  def _shapes_to_test(self):
+  def _operator_build_infos(self):
+    build_info = OperatorBuildInfo
     # non-batch operators (n, n) and batch operators.
-    return [(0, 0), (1, 1), (1, 3, 3), (3, 4, 4), (2, 1, 4, 4)]
+    return [
+        build_info((0, 0)),
+        build_info((1, 1)),
+        build_info((1, 3, 3)),
+        build_info((3, 4, 4)),
+        build_info((2, 1, 4, 4))]
 
   def _make_rhs(self, operator, adjoint):
     # This operator is square, so rhs and x will have same shape.
@@ -387,9 +407,15 @@ class NonSquareLinearOperatorDerivedClassTest(LinearOperatorDerivedClassTest):
     return ["solve", "det", "log_abs_det"]
 
   @property
-  def _shapes_to_test(self):
+  def _operator_build_infos(self):
+    build_info = OperatorBuildInfo
     # non-batch operators (n, n) and batch operators.
-    return [(2, 1), (1, 2), (1, 3, 2), (3, 3, 4), (2, 1, 2, 4)]
+    return [
+        build_info((2, 1)),
+        build_info((1, 2)),
+        build_info((1, 3, 2)),
+        build_info((3, 3, 4)),
+        build_info((2, 1, 2, 4))]
 
   def _make_rhs(self, operator, adjoint):
     # TODO(langmore) Add once we're testing solve_ls.
diff --git a/tensorflow/python/ops/linalg_ops.py b/tensorflow/python/ops/linalg_ops.py
index 9803eed6aefe072cbe0841dff2de3f640a440dd5..5b4fb4f7c8d56e49bd4a7d2f5078ee978e7a8765 100644
--- a/tensorflow/python/ops/linalg_ops.py
+++ b/tensorflow/python/ops/linalg_ops.py
@@ -248,7 +248,7 @@ def matrix_solve_ls(matrix, rhs, l2_regularizer=0.0, fast=True, name=None):
     and l2_regularizer != 0 due to poor accuracy.
   """
 
-  # pylint: disable=protected-access,long-lambda
+  # pylint: disable=long-lambda
   def _use_composite_impl(fast, tensor_shape):
     """Determines whether to use the composite or specialized CPU kernel.
 
@@ -323,9 +323,8 @@ def matrix_solve_ls(matrix, rhs, l2_regularizer=0.0, fast=True, name=None):
   if _use_composite_impl(fast, tensor_shape):
     return _composite_impl(matrix, rhs, l2_regularizer)
   else:
-    return gen_linalg_ops._matrix_solve_ls(
+    return gen_linalg_ops.matrix_solve_ls(
         matrix, rhs, l2_regularizer, fast=fast, name=name)
-  # pylint: enable=protected-access
 
 
 @tf_export('self_adjoint_eig', 'linalg.eigh')
@@ -342,12 +341,11 @@ def self_adjoint_eig(tensor, name=None):
     name: string, optional name of the operation.
 
   Returns:
-    e: Eigenvalues. Shape is `[..., N]`.
+    e: Eigenvalues. Shape is `[..., N]`. Sorted in non-decreasing order.
     v: Eigenvectors. Shape is `[..., N, N]`. The columns of the inner most
       matrices contain eigenvectors of the corresponding matrices in `tensor`
   """
-  # pylint: disable=protected-access
-  e, v = gen_linalg_ops._self_adjoint_eig_v2(tensor, compute_v=True, name=name)
+  e, v = gen_linalg_ops.self_adjoint_eig_v2(tensor, compute_v=True, name=name)
   return e, v
 
 
@@ -369,8 +367,7 @@ def self_adjoint_eigvals(tensor, name=None):
     e: Eigenvalues. Shape is `[..., N]`. The vector `e[..., :]` contains the `N`
       eigenvalues of `tensor[..., :, :]`.
   """
-  # pylint: disable=protected-access
-  e, _ = gen_linalg_ops._self_adjoint_eig_v2(tensor, compute_v=False, name=name)
+  e, _ = gen_linalg_ops.self_adjoint_eig_v2(tensor, compute_v=False, name=name)
   return e
 
 
@@ -435,10 +432,8 @@ def svd(tensor, full_matrices=False, compute_uv=True, name=None):
   ````
   @end_compatibility
   """
-  # pylint: disable=protected-access
-  s, u, v = gen_linalg_ops._svd(
+  s, u, v = gen_linalg_ops.svd(
       tensor, compute_uv=compute_uv, full_matrices=full_matrices, name=name)
-  # pylint: enable=protected-access
   if compute_uv:
     return math_ops.real(s), u, v
   else:
diff --git a/tensorflow/python/ops/logging_ops.py b/tensorflow/python/ops/logging_ops.py
index 3757109c956dfedc64ac4cda4ad13a4cfa601418..222b8ebc9da6b076f012f8febbd50cc3c4c86c08 100644
--- a/tensorflow/python/ops/logging_ops.py
+++ b/tensorflow/python/ops/logging_ops.py
@@ -109,7 +109,7 @@ def histogram_summary(tag, values, collections=None, name=None):
     buffer.
   """
   with ops.name_scope(name, "HistogramSummary", [tag, values]) as scope:
-    val = gen_logging_ops._histogram_summary(
+    val = gen_logging_ops.histogram_summary(
         tag=tag, values=values, name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
@@ -170,7 +170,7 @@ def image_summary(tag, tensor, max_images=3, collections=None, name=None):
     buffer.
   """
   with ops.name_scope(name, "ImageSummary", [tag, tensor]) as scope:
-    val = gen_logging_ops._image_summary(
+    val = gen_logging_ops.image_summary(
         tag=tag, tensor=tensor, max_images=max_images, name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
@@ -226,11 +226,12 @@ def audio_summary(tag,
   with ops.name_scope(name, "AudioSummary", [tag, tensor]) as scope:
     sample_rate = ops.convert_to_tensor(sample_rate, dtype=dtypes.float32,
                                         name="sample_rate")
-    val = gen_logging_ops._audio_summary_v2(tag=tag,
-                                            tensor=tensor,
-                                            max_outputs=max_outputs,
-                                            sample_rate=sample_rate,
-                                            name=scope)
+    val = gen_logging_ops.audio_summary_v2(
+        tag=tag,
+        tensor=tensor,
+        max_outputs=max_outputs,
+        sample_rate=sample_rate,
+        name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
 
@@ -263,7 +264,7 @@ def merge_summary(inputs, collections=None, name=None):
     buffer resulting from the merging.
   """
   with ops.name_scope(name, "MergeSummary", inputs):
-    val = gen_logging_ops._merge_summary(inputs=inputs, name=name)
+    val = gen_logging_ops.merge_summary(inputs=inputs, name=name)
     _Collect(val, collections, [])
   return val
 
@@ -345,7 +346,7 @@ def scalar_summary(tags, values, collections=None, name=None):
     buffer.
   """
   with ops.name_scope(name, "ScalarSummary", [tags, values]) as scope:
-    val = gen_logging_ops._scalar_summary(tags=tags, values=values, name=scope)
+    val = gen_logging_ops.scalar_summary(tags=tags, values=values, name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
 
diff --git a/tensorflow/python/ops/lookup_ops.py b/tensorflow/python/ops/lookup_ops.py
index f539a7bb68da57e31746bc80fb25339a03a4fafe..6f043f60e677eac560004619464905cd616256b2 100644
--- a/tensorflow/python/ops/lookup_ops.py
+++ b/tensorflow/python/ops/lookup_ops.py
@@ -157,10 +157,10 @@ class InitializableLookupTableBase(LookupInterface):
       default_value: The value to use if a key is missing in the table.
       initializer: The table initializer to use.
     """
-    if context.in_graph_mode():
-      name = table_ref.op.name.split("/")[-1]
-    else:
+    if context.executing_eagerly():
       name = context.context().scope_name
+    else:
+      name = table_ref.op.name.split("/")[-1]
     super(InitializableLookupTableBase,
           self).__init__(initializer.key_dtype, initializer.value_dtype,
                          name)
@@ -196,9 +196,7 @@ class InitializableLookupTableBase(LookupInterface):
     """
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as scope:
-      # pylint: disable=protected-access
-      return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=scope)
-      # pylint: enable=protected-access
+      return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=scope)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -227,10 +225,8 @@ class InitializableLookupTableBase(LookupInterface):
     with ops.name_scope(name, "%s_Lookup" % self._name,
                         (self._table_ref, key_tensor,
                          self._default_value)) as scope:
-      # pylint: disable=protected-access
-      values = gen_lookup_ops._lookup_table_find_v2(
+      values = gen_lookup_ops.lookup_table_find_v2(
           self._table_ref, key_tensor, self._default_value, name=scope)
-      # pylint: enable=protected-access
 
     values.set_shape(key_tensor.get_shape())
     if isinstance(keys, sparse_tensor.SparseTensor):
@@ -274,13 +270,11 @@ class HashTable(InitializableLookupTableBase):
     """
     with ops.name_scope(name, "hash_table", (initializer,
                                              default_value)) as scope:
-      # pylint: disable=protected-access
-      table_ref = gen_lookup_ops._hash_table_v2(
+      table_ref = gen_lookup_ops.hash_table_v2(
           shared_name=shared_name,
           key_dtype=initializer.key_dtype,
           value_dtype=initializer.value_dtype,
           name=scope)
-      # pylint: enable=protected-access
 
       super(HashTable, self).__init__(table_ref, default_value, initializer)
 
@@ -352,10 +346,8 @@ class KeyValueTensorInitializer(TableInitializerBase):
     with ops.name_scope(
         self._name, values=(table.table_ref, self._keys,
                             self._values)) as scope:
-      # pylint: disable=protected-access
-      init_op = gen_lookup_ops._initialize_table_v2(
+      init_op = gen_lookup_ops.initialize_table_v2(
           table.table_ref, self._keys, self._values, name=scope)
-      # pylint: enable=protected-access
     ops.add_to_collection(ops.GraphKeys.TABLE_INITIALIZERS, init_op)
     return init_op
 
@@ -518,8 +510,7 @@ class TextFileInitializer(TableInitializerBase):
                         (table.table_ref,)) as scope:
       filename = ops.convert_to_tensor(
           self._filename, dtypes.string, name="asset_filepath")
-      # pylint: disable=protected-access
-      init_op = gen_lookup_ops._initialize_table_from_text_file_v2(
+      init_op = gen_lookup_ops.initialize_table_from_text_file_v2(
           table.table_ref,
           filename,
           self._key_index,
@@ -527,11 +518,10 @@ class TextFileInitializer(TableInitializerBase):
           -1 if self._vocab_size is None else self._vocab_size,
           self._delimiter,
           name=scope)
-      # pylint: enable=protected-access
     ops.add_to_collection(ops.GraphKeys.TABLE_INITIALIZERS, init_op)
     # If the filename tensor is anything other than a string constant (e.g., if
     # it is a placeholder) then it does not make sense to track it as an asset.
-    if context.in_graph_mode() and constant_op.is_constant(filename):
+    if not context.executing_eagerly() and constant_op.is_constant(filename):
       ops.add_to_collection(ops.GraphKeys.ASSET_FILEPATHS, filename)
     return init_op
 
diff --git a/tensorflow/python/ops/losses/losses_impl.py b/tensorflow/python/ops/losses/losses_impl.py
index 7386976e93fbb82f38550f50429af878fadda813..0840760810c86a6393ea6b4ab0b9410233275f11 100644
--- a/tensorflow/python/ops/losses/losses_impl.py
+++ b/tensorflow/python/ops/losses/losses_impl.py
@@ -89,14 +89,6 @@ def _safe_div(numerator, denominator, name="value"):
   Returns:
     The element-wise value of the numerator divided by the denominator.
   """
-  if isinstance(denominator, float):
-    if math_ops.equal(denominator, 0.0):
-      return ops.convert_to_tensor(0.0, dtype=numerator.dtype)
-    return math_ops.div(numerator, denominator)
-  if context.in_eager_mode() and denominator._rank() == 0:  # pylint: disable=protected-access
-    if math_ops.equal(denominator, 0.0):
-      return ops.convert_to_tensor(0.0, dtype=numerator.dtype)
-    return math_ops.div(numerator, denominator)
   return array_ops.where(
       math_ops.greater(denominator, 0),
       math_ops.div(numerator, array_ops.where(
@@ -144,7 +136,7 @@ def _num_present(losses, weights, per_batch=False):
       `[batch_size]`. Otherwise, a single scalar tensor is returned.
   """
   if ((isinstance(weights, float) and weights != 0.0) or
-      (context.in_eager_mode() and weights._rank() == 0  # pylint: disable=protected-access
+      (context.executing_eagerly() and weights._rank() == 0  # pylint: disable=protected-access
        and not math_ops.equal(weights, 0.0))):
     return _num_elements(losses)
   with ops.name_scope(None, "num_present", (losses, weights)) as scope:
diff --git a/tensorflow/python/ops/manip_ops.py b/tensorflow/python/ops/manip_ops.py
index 91e15b47b9400f29425af2f186c7c44ee6a5a622..6d335cdc212f368e7667a030791c7b634113a9c6 100644
--- a/tensorflow/python/ops/manip_ops.py
+++ b/tensorflow/python/ops/manip_ops.py
@@ -23,9 +23,11 @@ from __future__ import print_function
 
 from tensorflow.python.ops import gen_manip_ops as _gen_manip_ops
 from tensorflow.python.util.all_util import remove_undocumented
+from tensorflow.python.util.tf_export import tf_export
 
 
 # pylint: disable=protected-access
+@tf_export('manip.roll')
 def roll(input, shift, axis):  # pylint: disable=redefined-builtin
   return _gen_manip_ops.roll(input, shift, axis)
 
diff --git a/tensorflow/python/ops/math_grad.py b/tensorflow/python/ops/math_grad.py
index 69afa618e2fae146f75fd70dee4b04d447c843d3..02e07dc7b1f5fe6a671da967f6d07cef123d3d1e 100644
--- a/tensorflow/python/ops/math_grad.py
+++ b/tensorflow/python/ops/math_grad.py
@@ -41,6 +41,12 @@ def _ArgMaxGrad(op, grad):
   return [None, None]
 
 
+@ops.RegisterGradient("ArgMin")
+def _ArgMinGrad(op, grad):
+  del op, grad
+  return [None, None]
+
+
 @ops.RegisterGradient("Sum")
 def _SumGrad(op, grad):
   """Gradient for Sum."""
@@ -52,10 +58,18 @@ def _SumGrad(op, grad):
     if axes is not None:
       rank = len(input_0_shape)
       if np.array_equal(axes, np.arange(rank)):  # Reduce all dims.
-        grad = array_ops.reshape(grad, [1] * rank)
+        if context.executing_eagerly():
+          ctx = context.context()
+          new_shape = ctx.ones_rank_cache().get(rank)
+          if new_shape is None:
+            new_shape = constant_op.constant([1] * rank, dtype=dtypes.int32)
+            ctx.ones_rank_cache().put(rank, new_shape)
+        else:
+          new_shape = [1] * rank
+        grad = array_ops.reshape(grad, new_shape)
         # If shape is not fully defined (but rank is), we use Shape.
         if None not in input_0_shape:
-          input_shape = input_0_shape
+          input_shape = constant_op.constant(input_0_shape, dtype=dtypes.int32)
         else:
           input_shape = array_ops.shape(op.inputs[0])
         return [array_ops.tile(grad, input_shape), None]
@@ -388,16 +402,14 @@ def _NegGrad(_, grad):
 def _InvGrad(op, grad):
   """Returns -grad * (1 / x^2)."""
   y = op.outputs[0]  # y = 1 / x
-  # pylint: disable=protected-access
-  return gen_math_ops._reciprocal_grad(y, grad)
+  return gen_math_ops.reciprocal_grad(y, grad)
 
 
 @ops.RegisterGradient("Reciprocal")
 def _ReciprocalGrad(op, grad):
   """Returns -grad * (1 / x^2)."""
   y = op.outputs[0]  # y = 1 / x
-  # pylint: disable=protected-access
-  return gen_math_ops._reciprocal_grad(y, grad)
+  return gen_math_ops.reciprocal_grad(y, grad)
 
 
 @ops.RegisterGradient("InvGrad")
@@ -407,8 +419,7 @@ def _InvGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     ca = math_ops.conj(op.inputs[0])
     cg = math_ops.conj(grad)
-    # pylint: disable=protected-access
-    return cg * -2.0 * b * ca, gen_math_ops._reciprocal_grad(ca, grad)
+    return cg * -2.0 * b * ca, gen_math_ops.reciprocal_grad(ca, grad)
 
 
 @ops.RegisterGradient("ReciprocalGrad")
@@ -418,8 +429,7 @@ def _ReciprocalGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     ca = math_ops.conj(op.inputs[0])
     cg = math_ops.conj(grad)
-    # pylint: disable=protected-access
-    return cg * -2.0 * b * ca, gen_math_ops._reciprocal_grad(ca, grad)
+    return cg * -2.0 * b * ca, gen_math_ops.reciprocal_grad(ca, grad)
 
 
 @ops.RegisterGradient("Square")
@@ -428,15 +438,14 @@ def _SquareGrad(op, grad):
   # Added control dependencies to prevent 2*x from being computed too early.
   with ops.control_dependencies([grad]):
     x = math_ops.conj(x)
-    return math_ops.multiply(grad, math_ops.multiply(x, 2.0))
+    y = constant_op.constant(2.0, dtype=x.dtype)
+    return math_ops.multiply(grad, math_ops.multiply(x, y))
 
 
 @ops.RegisterGradient("Sqrt")
 def _SqrtGrad(op, grad):
   y = op.outputs[0]  # y = x^(1/2)
-  # pylint: disable=protected-access
-  return gen_math_ops._sqrt_grad(y, grad)
-  # pylint: enable=protected-access
+  return gen_math_ops.sqrt_grad(y, grad)
 
 
 @ops.RegisterGradient("SqrtGrad")
@@ -452,9 +461,7 @@ def _SqrtGradGrad(op, grad):
 def _RsqrtGrad(op, grad):
   """Returns -0.5 * grad * conj(y)^3."""
   y = op.outputs[0]  # y = x^(-1/2)
-  # pylint: disable=protected-access
-  return gen_math_ops._rsqrt_grad(y, grad)
-  # pylint: enable=protected-access
+  return gen_math_ops.rsqrt_grad(y, grad)
 
 
 @ops.RegisterGradient("RsqrtGrad")
@@ -466,8 +473,7 @@ def _RsqrtGradGrad(op, grad):
     ca = math_ops.conj(a)
     cg = math_ops.conj(grad)
     grad_a = -1.5 * cg * b * math_ops.square(ca)
-    # pylint: disable=protected-access
-    grad_b = gen_math_ops._rsqrt_grad(ca, grad)
+    grad_b = gen_math_ops.rsqrt_grad(ca, grad)
     return grad_a, grad_b
 
 
@@ -532,8 +538,7 @@ def _TanhGrad(op, grad):
   y = op.outputs[0]  # y = tanh(x)
   with ops.control_dependencies([grad]):
     y = math_ops.conj(y)
-    # pylint: disable=protected-access
-    return gen_math_ops._tanh_grad(y, grad)
+    return gen_math_ops.tanh_grad(y, grad)
 
 
 @ops.RegisterGradient("Asinh")
@@ -571,8 +576,7 @@ def _TanhGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     a = math_ops.conj(op.inputs[0])
     b = math_ops.conj(op.inputs[1])
-    # pylint: disable=protected-access
-    return grad * -2.0 * b * a, gen_math_ops._tanh_grad(a, grad)
+    return grad * -2.0 * b * a, gen_math_ops.tanh_grad(a, grad)
 
 
 @ops.RegisterGradient("Erf")
@@ -622,9 +626,7 @@ def _IgammaGrad(op, grad):
   x = op.inputs[1]
   sa = array_ops.shape(a)
   sx = array_ops.shape(x)
-  # pylint: disable=protected-access
-  unused_ra, rx = gen_array_ops._broadcast_gradient_args(sa, sx)
-  # pylint: enable=protected-access
+  unused_ra, rx = gen_array_ops.broadcast_gradient_args(sa, sx)
 
   # Perform operations in log space before summing, because Gamma(a)
   # and Gamma'(a) can grow large.
@@ -651,9 +653,7 @@ def _BetaincGrad(op, grad):
   # versa; so its sufficient to check against shape(a).
   sa = array_ops.shape(a)
   sx = array_ops.shape(x)
-  # pylint: disable=protected-access
-  _, rx = gen_array_ops._broadcast_gradient_args(sa, sx)
-  # pylint: enable=protected-access
+  _, rx = gen_array_ops.broadcast_gradient_args(sa, sx)
 
   # Perform operations in log space before summing, because terms
   # can grow large.
@@ -679,9 +679,7 @@ def _ZetaGrad(op, grad):
   # Broadcast gradients
   sx = array_ops.shape(x)
   sq = array_ops.shape(q)
-  # pylint: disable=protected-access
-  unused_rx, rq = gen_array_ops._broadcast_gradient_args(sx, sq)
-  # pylint: enable=protected-access
+  unused_rx, rq = gen_array_ops.broadcast_gradient_args(sx, sq)
   # Evaluate gradient
   with ops.control_dependencies([grad]):
     x = math_ops.conj(x)
@@ -701,9 +699,7 @@ def _PolygammaGrad(op, grad):
   # Broadcast gradients
   sn = array_ops.shape(n)
   sx = array_ops.shape(x)
-  # pylint: disable=protected-access
-  unused_rn, rx = gen_array_ops._broadcast_gradient_args(sn, sx)
-  # pylint: enable=protected-access
+  unused_rn, rx = gen_array_ops.broadcast_gradient_args(sn, sx)
   # Evaluate gradient
   with ops.control_dependencies([grad]):
     n = math_ops.conj(n)
@@ -720,8 +716,7 @@ def _SigmoidGrad(op, grad):
   y = op.outputs[0]  # y = sigmoid(x)
   with ops.control_dependencies([grad]):
     y = math_ops.conj(y)
-    # pylint: disable=protected-access
-    return gen_math_ops._sigmoid_grad(y, grad)
+    return gen_math_ops.sigmoid_grad(y, grad)
 
 
 @ops.RegisterGradient("SigmoidGrad")
@@ -730,8 +725,7 @@ def _SigmoidGradGrad(op, grad):
     a = math_ops.conj(op.inputs[0])
     b = math_ops.conj(op.inputs[1])
     gb = grad * b
-    # pylint: disable=protected-access
-    return gb - 2.0 * gb * a, gen_math_ops._sigmoid_grad(a, grad)
+    return gb - 2.0 * gb * a, gen_math_ops.sigmoid_grad(a, grad)
 
 
 @ops.RegisterGradient("Sign")
@@ -845,9 +839,7 @@ def _AddGrad(op, grad):
     return grad, grad
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   return (array_ops.reshape(math_ops.reduce_sum(grad, rx), sx),
           array_ops.reshape(math_ops.reduce_sum(grad, ry), sy))
 
@@ -862,9 +854,7 @@ def _SubGrad(op, grad):
     return grad, -grad
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   return (array_ops.reshape(math_ops.reduce_sum(grad, rx), sx),
           array_ops.reshape(-math_ops.reduce_sum(grad, ry), sy))
 
@@ -874,22 +864,20 @@ def _MulGrad(op, grad):
   """The gradient of scalar multiplication."""
   x = op.inputs[0]
   y = op.inputs[1]
-  # pylint: disable=protected-access
   if (isinstance(grad, ops.Tensor) and
       _ShapesFullySpecifiedAndEqual(x, y, grad) and
       grad.dtype in (dtypes.int32, dtypes.float32)):
-    return gen_math_ops._mul(grad, y), gen_math_ops._mul(grad, x)
+    return gen_math_ops.mul(grad, y), gen_math_ops.mul(grad, x)
   assert x.dtype.base_dtype == y.dtype.base_dtype, (x.dtype, " vs. ", y.dtype)
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   x = math_ops.conj(x)
   y = math_ops.conj(y)
   return (array_ops.reshape(
-      math_ops.reduce_sum(gen_math_ops._mul(grad, y), rx), sx),
+      math_ops.reduce_sum(gen_math_ops.mul(grad, y), rx), sx),
           array_ops.reshape(
-              math_ops.reduce_sum(gen_math_ops._mul(x, grad), ry), sy))
-  # pylint: enable=protected-access
+              math_ops.reduce_sum(gen_math_ops.mul(x, grad), ry), sy))
 
 
 @ops.RegisterGradient("Div")
@@ -899,9 +887,7 @@ def _DivGrad(op, grad):
   y = op.inputs[1]
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   x = math_ops.conj(x)
   y = math_ops.conj(y)
   return (array_ops.reshape(math_ops.reduce_sum(math_ops.div(grad, y), rx), sx),
@@ -924,9 +910,7 @@ def _FloorModGrad(op, grad):
 
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   floor_xy = math_ops.floor_div(x, y)
   gx = array_ops.reshape(math_ops.reduce_sum(grad, rx), sx)
   gy = array_ops.reshape(
@@ -946,9 +930,7 @@ def _RealDivGrad(op, grad):
   y = op.inputs[1]
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   x = math_ops.conj(x)
   y = math_ops.conj(y)
   return (array_ops.reshape(
@@ -966,7 +948,7 @@ def _PowGrad(op, grad):
   z = op.outputs[0]
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   x = math_ops.conj(x)
   y = math_ops.conj(y)
   z = math_ops.conj(z)
@@ -994,7 +976,7 @@ def _MaximumMinimumGrad(op, grad, selector_op):
   gradshape = array_ops.shape(grad)
   zeros = array_ops.zeros(gradshape, gdtype)
   xmask = selector_op(x, y)
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   xgrad = array_ops.where(xmask, grad, zeros)
   ygrad = array_ops.where(xmask, zeros, grad)
   gx = array_ops.reshape(math_ops.reduce_sum(xgrad, rx), sx)
@@ -1021,9 +1003,7 @@ def _SquaredDifferenceGrad(op, grad):
   y = op.inputs[1]
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  # pylint: disable=protected-access
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
-  # pylint: enable=protected-access
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   with ops.control_dependencies([grad]):
     # The parens ensure that if grad is IndexedSlices, it'll get multiplied by
     # Tensor (not a number like 2.0) which causes it to convert to Tensor.
@@ -1062,20 +1042,18 @@ def _MatMulGrad(op, grad):
   t_b = op.get_attr("transpose_b")
   a = math_ops.conj(op.inputs[0])
   b = math_ops.conj(op.inputs[1])
-  # pylint: disable=protected-access
   if not t_a and not t_b:
-    grad_a = gen_math_ops._mat_mul(grad, b, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(a, grad, transpose_a=True)
+    grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(a, grad, transpose_a=True)
   elif not t_a and t_b:
-    grad_a = gen_math_ops._mat_mul(grad, b)
-    grad_b = gen_math_ops._mat_mul(grad, a, transpose_a=True)
+    grad_a = gen_math_ops.mat_mul(grad, b)
+    grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True)
   elif t_a and not t_b:
-    grad_a = gen_math_ops._mat_mul(b, grad, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(a, grad)
+    grad_a = gen_math_ops.mat_mul(b, grad, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(a, grad)
   elif t_a and t_b:
-    grad_a = gen_math_ops._mat_mul(b, grad, transpose_a=True, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(grad, a, transpose_a=True, transpose_b=True)
-  # pylint: enable=protected-access
+    grad_a = gen_math_ops.mat_mul(b, grad, transpose_a=True, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True, transpose_b=True)
   return grad_a, grad_b
 
 
@@ -1089,7 +1067,7 @@ def _SparseMatMulGrad(op, grad):
       op.inputs[0]: op.get_attr("a_is_sparse"),
       op.inputs[1]: op.get_attr("b_is_sparse"),
       # Use heuristic to figure out if grad might be sparse
-      grad: context.in_graph_mode() and (grad.op.type == "ReluGrad")
+      grad: not context.executing_eagerly() and (grad.op.type == "ReluGrad")
   }
 
   def _SparseMatMul(t1, t2, out_dtype, transpose_a=False, transpose_b=False):
@@ -1189,7 +1167,7 @@ def _ComplexGrad(op, grad):
   y = op.inputs[1]
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
-  rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
+  rx, ry = gen_array_ops.broadcast_gradient_args(sx, sy)
   return (array_ops.reshape(math_ops.reduce_sum(math_ops.real(grad), rx), sx),
           array_ops.reshape(math_ops.reduce_sum(math_ops.imag(grad), ry), sy))
 
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index 2ae8b610da04d9762284d51b9c9f28a8c07e24f7..276897ab99e5e8770b72cb1eb27d07fb8dbc08bb 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -89,8 +89,6 @@ See the @{$python/math_ops} guide.
 @@matrix_inverse
 @@cholesky
 @@cholesky_solve
-@@matrix_exponential
-@@matrix_logarithm
 @@matrix_solve
 @@matrix_triangular_solve
 @@matrix_solve_ls
@@ -129,8 +127,11 @@ See the @{$python/math_ops} guide.
 @@segment_min
 @@segment_max
 @@segment_mean
+@@to_complex128
+@@to_complex64
 @@unsorted_segment_sum
 @@unsorted_segment_max
+@@unsorted_segment_mean
 @@unsorted_segment_min
 @@unsorted_segment_prod
 @@unsorted_segment_sqrt_n
@@ -161,14 +162,12 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_control_flow_ops
 from tensorflow.python.ops import gen_data_flow_ops
 from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import gen_sparse_ops
 from tensorflow.python.ops import gen_spectral_ops
-from tensorflow.python.ops import gen_state_ops
-from tensorflow.python.ops import state_ops
+from tensorflow.python.platform import tf_logging as logging
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import
 from tensorflow.python.ops.gen_math_ops import *
@@ -182,6 +181,13 @@ linspace = gen_math_ops.lin_space
 
 arg_max = deprecation.deprecated(None, "Use `argmax` instead")(arg_max)  # pylint: disable=used-before-assignment
 arg_min = deprecation.deprecated(None, "Use `argmin` instead")(arg_min)  # pylint: disable=used-before-assignment
+tf_export("arg_max")(arg_max)
+tf_export("arg_min")(arg_min)
+
+
+# This is set by resource_variable_ops.py. It is included in this way since
+# there is a circular dependency between math_ops and resource_variable_ops
+_resource_variable_type = None
 
 
 def _set_doc(doc):
@@ -266,7 +272,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
   with ops.name_scope(name, "Abs", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
       if x.values.dtype.is_complex:
-        x_abs = gen_math_ops._complex_abs(
+        x_abs = gen_math_ops.complex_abs(
             x.values, Tout=x.values.dtype.real_dtype, name=name)
         return sparse_tensor.SparseTensor(
             indices=x.indices, values=x_abs, dense_shape=x.dense_shape)
@@ -276,7 +282,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
     else:
       x = ops.convert_to_tensor(x, name="x")
       if x.dtype.is_complex:
-        return gen_math_ops._complex_abs(x, Tout=x.dtype.real_dtype, name=name)
+        return gen_math_ops.complex_abs(x, Tout=x.dtype.real_dtype, name=name)
       return gen_math_ops._abs(x, name=name)
 
 
@@ -285,7 +291,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
 
 # pylint: disable=redefined-builtin
 def _bucketize(input, boundaries, name=None):
-  return gen_math_ops._bucketize(input=input, boundaries=boundaries, name=name)
+  return gen_math_ops.bucketize(input=input, boundaries=boundaries, name=name)
 
 
 # pylint: enable=redefined-builtin
@@ -328,10 +334,10 @@ def divide(x, y, name=None):
 
 @tf_export("multiply")
 def multiply(x, y, name=None):
-  return gen_math_ops._mul(x, y, name)
+  return gen_math_ops.mul(x, y, name)
 
 
-multiply.__doc__ = gen_math_ops._mul.__doc__.replace("Mul", "`tf.multiply`")
+multiply.__doc__ = gen_math_ops.mul.__doc__.replace("Multiply", "`tf.multiply`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
@@ -339,19 +345,19 @@ multiply.__doc__ = gen_math_ops._mul.__doc__.replace("Mul", "`tf.multiply`")
     "2016-12-30",
     "`tf.mul(x, y)` is deprecated, please use `tf.multiply(x, y)` or `x * y`")
 def _mul(x, y, name=None):
-  return gen_math_ops._mul(x, y, name)
+  return gen_math_ops.mul(x, y, name)
 
 
 _mul.__doc__ = (
-    gen_math_ops._mul.__doc__ + ("" if _mul.__doc__ is None else _mul.__doc__))
+    gen_math_ops.mul.__doc__ + ("" if _mul.__doc__ is None else _mul.__doc__))
 
 
 @tf_export("subtract")
 def subtract(x, y, name=None):
-  return gen_math_ops._sub(x, y, name)
+  return gen_math_ops.sub(x, y, name)
 
 
-subtract.__doc__ = gen_math_ops._sub.__doc__.replace("`Sub`", "`tf.subtract`")
+subtract.__doc__ = gen_math_ops.sub.__doc__.replace("`Sub`", "`tf.subtract`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
@@ -359,11 +365,11 @@ subtract.__doc__ = gen_math_ops._sub.__doc__.replace("`Sub`", "`tf.subtract`")
     "2016-12-30",
     "`tf.sub(x, y)` is deprecated, please use `tf.subtract(x, y)` or `x - y`")
 def _sub(x, y, name=None):
-  return gen_math_ops._sub(x, y, name)
+  return gen_math_ops.sub(x, y, name)
 
 
 _sub.__doc__ = (
-    gen_math_ops._sub.__doc__ + ("" if _sub.__doc__ is None else _sub.__doc__))
+    gen_math_ops.sub.__doc__ + ("" if _sub.__doc__ is None else _sub.__doc__))
 
 
 # pylint: disable=g-docstring-has-escape
@@ -383,11 +389,11 @@ def negative(x, name=None):
   """
   with ops.name_scope(name, "Neg", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
-      x_neg = gen_math_ops._neg(x.values, name=name)
+      x_neg = gen_math_ops.neg(x.values, name=name)
       return sparse_tensor.SparseTensor(
           indices=x.indices, values=x_neg, dense_shape=x.dense_shape)
     else:
-      return gen_math_ops._neg(x, name=name)
+      return gen_math_ops.neg(x, name=name)
 
 
 # pylint: enable=g-docstring-has-escape
@@ -770,16 +776,18 @@ def cast(x, dtype, name=None):
   with ops.name_scope(name, "Cast", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
       values_cast = cast(x.values, base_type, name=name)
-      return sparse_tensor.SparseTensor(x.indices, values_cast, x.dense_shape)
+      x = sparse_tensor.SparseTensor(x.indices, values_cast, x.dense_shape)
     else:
       # TODO(josh11b): If x is not already a Tensor, we could return
       # ops.convert_to_tensor(x, dtype=dtype, ...)  here, but that
       # allows some conversions that cast() can't do, e.g. casting numbers to
       # strings.
       x = ops.convert_to_tensor(x, name="x")
-      if x.dtype.base_dtype == base_type:
-        return x
-      return gen_math_ops.cast(x, base_type, name=name)
+      if x.dtype.base_dtype != base_type:
+        x = gen_math_ops.cast(x, base_type, name=name)
+    if x.dtype.is_complex and base_type.is_floating:
+      logging.warn("Casting complex to real discards imaginary part.")
+    return x
 
 
 @tf_export("saturate_cast")
@@ -935,7 +943,7 @@ def to_complex128(x, name="ToComplex128"):
   return cast(x, dtypes.complex128, name=name)
 
 
-ops.Tensor._override_operator("__neg__", gen_math_ops._neg)
+ops.Tensor._override_operator("__neg__", gen_math_ops.neg)
 ops.Tensor._override_operator("__abs__", abs)
 # __invert__ corresponds to the ~ operator.  Here we follow the numpy convention
 # ~ marks an elementwise bit-wise inverse.  This is only implemented for boolean
@@ -1064,7 +1072,7 @@ def _truediv_python3(x, y, name=None):
     if dtype is not None:
       x = cast(x, dtype)
       y = cast(y, dtype)
-    return gen_math_ops._real_div(x, y, name=name)
+    return gen_math_ops.real_div(x, y, name=name)
 
 
 def _div_python2(x, y, name=None):
@@ -1087,9 +1095,9 @@ def _div_python2(x, y, name=None):
       raise TypeError("x and y must have the same dtype, got %r != %r" %
                       (x_dtype, y_dtype))
     if x_dtype.is_floating or x_dtype.is_complex:
-      return gen_math_ops._real_div(x, y, name=name)
+      return gen_math_ops.real_div(x, y, name=name)
     else:
-      return gen_math_ops._floor_div(x, y, name=name)
+      return gen_math_ops.floor_div(x, y, name=name)
 
 
 @tf_export("truediv")
@@ -1147,7 +1155,7 @@ def div(x, y, name=None):
 
 
 # TODO(aselle): This should be removed
-mod = gen_math_ops._floor_mod
+mod = gen_math_ops.floor_mod
 
 
 # TODO(aselle): Deprecate this once all internal functionality uses
@@ -1180,22 +1188,27 @@ def floordiv(x, y, name=None):
     TypeError: If the inputs are complex.
   """
   with ops.name_scope(name, "floordiv", [x, y]) as name:
-    return gen_math_ops._floor_div(x, y, name=name)
+    return gen_math_ops.floor_div(x, y, name=name)
 
 
-realdiv = gen_math_ops._real_div
-truncatediv = gen_math_ops._truncate_div
+realdiv = gen_math_ops.real_div
+tf_export("realdiv")(realdiv)
+truncatediv = gen_math_ops.truncate_div
+tf_export("truncatediv")(truncatediv)
 # TODO(aselle): Rename this to floordiv when we can.
-floor_div = gen_math_ops._floor_div
-truncatemod = gen_math_ops._truncate_mod
-floormod = gen_math_ops._floor_mod
+floor_div = gen_math_ops.floor_div
+tf_export("floor_div")(floor_div)
+truncatemod = gen_math_ops.truncate_mod
+tf_export("truncatemod")(truncatemod)
+floormod = gen_math_ops.floor_mod
+tf_export("floormod", "mod")(floormod)
 
 
 def _mul_dispatch(x, y, name=None):
   """Dispatches cwise mul for "Dense*Dense" and "Dense*Sparse"."""
   is_tensor_y = isinstance(y, ops.Tensor)
   if is_tensor_y:
-    return gen_math_ops._mul(x, y, name=name)
+    return gen_math_ops.mul(x, y, name=name)
   else:
     assert isinstance(y, sparse_tensor.SparseTensor)  # Case: Dense * Sparse.
     new_vals = gen_sparse_ops.sparse_dense_cwise_mul(y.indices, y.values,
@@ -1214,12 +1227,12 @@ _OverrideBinaryOperatorHelper(gen_sparse_ops.sparse_dense_cwise_mul, "mul",
                               sparse_tensor.SparseTensor)
 
 _OverrideBinaryOperatorHelper(gen_math_ops.add, "add")
-_OverrideBinaryOperatorHelper(gen_math_ops._sub, "sub")
+_OverrideBinaryOperatorHelper(gen_math_ops.sub, "sub")
 _OverrideBinaryOperatorHelper(_mul_dispatch, "mul")
 _OverrideBinaryOperatorHelper(_div_python2, "div")
 _OverrideBinaryOperatorHelper(_truediv_python3, "truediv")
 _OverrideBinaryOperatorHelper(floordiv, "floordiv")
-_OverrideBinaryOperatorHelper(gen_math_ops._floor_mod, "mod")
+_OverrideBinaryOperatorHelper(gen_math_ops.floor_mod, "mod")
 _OverrideBinaryOperatorHelper(pow, "pow")
 
 
@@ -1541,7 +1554,7 @@ def reduce_mean(input_tensor,
   if keepdims is None:
     keepdims = False
   return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
-                               gen_math_ops._mean(
+                               gen_math_ops.mean(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
@@ -1591,7 +1604,7 @@ def reduce_prod(input_tensor,
   if keepdims is None:
     keepdims = False
   return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
-                               gen_math_ops._prod(
+                               gen_math_ops.prod(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
@@ -2044,8 +2057,15 @@ def matmul(a,
     if transpose_b and adjoint_b:
       raise ValueError("Only one of transpose_b and adjoint_b can be True.")
 
-    a = ops.convert_to_tensor(a, name="a")
-    b = ops.convert_to_tensor(b, name="b")
+    if context.executing_eagerly():
+      if not isinstance(a, (ops.EagerTensor, _resource_variable_type)):
+        a = ops.convert_to_tensor(a, name="a")
+      if not isinstance(b, (ops.EagerTensor, _resource_variable_type)):
+        b = ops.convert_to_tensor(b, name="b")
+    else:
+      a = ops.convert_to_tensor(a, name="a")
+      b = ops.convert_to_tensor(b, name="b")
+
     # TODO(apassos) remove _shape_tuple here when it is not needed.
     a_shape = a._shape_tuple()  # pylint: disable=protected-access
     b_shape = b._shape_tuple()  # pylint: disable=protected-access
@@ -2060,7 +2080,7 @@ def matmul(a,
       if transpose_b:
         b = conj(b)
         adjoint_b = True
-      return gen_math_ops._batch_mat_mul(
+      return gen_math_ops.batch_mat_mul(
           a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
 
     # Neither matmul nor sparse_matmul support adjoint, so we conjugate
@@ -2078,8 +2098,9 @@ def matmul(a,
       sparse_matmul_types = [dtypes.bfloat16, dtypes.float32]
       use_sparse_matmul = (
           a.dtype in sparse_matmul_types and b.dtype in sparse_matmul_types)
-    if a.dtype == dtypes.bfloat16 or b.dtype == dtypes.bfloat16:
-      # matmul currently doesn't handle bfloat16 inputs.
+    if (a.dtype == dtypes.bfloat16 or b.dtype == dtypes.bfloat16 and
+        a.dtype != b.dtype):
+      # matmul currently doesn't handle mixed-precision inputs.
       use_sparse_matmul = True
     if use_sparse_matmul:
       ret = sparse_matmul(
@@ -2097,13 +2118,14 @@ def matmul(a,
         ret = cast(ret, dtypes.bfloat16)
       return ret
     else:
-      return gen_math_ops._mat_mul(
+      return gen_math_ops.mat_mul(
           a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
 
 
 _OverrideBinaryOperatorHelper(matmul, "matmul")
 
-sparse_matmul = gen_math_ops._sparse_mat_mul
+sparse_matmul = gen_math_ops.sparse_mat_mul
+tf_export("sparse_matmul")(sparse_matmul)
 
 
 @ops.RegisterStatistics("MatMul", "flops")
@@ -2208,7 +2230,7 @@ def add_n(inputs, name=None):
     if name:
       return array_ops.identity(inputs[0], name=name)
     return inputs[0]
-  return gen_math_ops._add_n(inputs, name=name)
+  return gen_math_ops.add_n(inputs, name=name)
 
 
 @tf_export("accumulate_n")
@@ -2218,14 +2240,12 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   Optionally, pass `shape` and `tensor_dtype` for shape and type checking,
   otherwise, these are inferred.
 
-  NOTE: This operation is not differentiable and cannot be used if inputs depend
-  on trainable variables. Please use `tf.add_n` for such cases.
+  `tf.accumulate_n` performs the same operation as `tf.add_n`, but does not
+  wait for all of its inputs to be ready before beginning to sum. This can
+  save memory if inputs are ready at different times, since minimum temporary
+  storage is proportional to the output size rather than the inputs size.
 
-  Aside from differentiability, `tf.accumulate_n` performs the same operation as
-  `tf.add_n`, but does not wait for all of its inputs to be ready before
-  beginning to sum. This can save memory if inputs are ready at different times,
-  since minimum temporary storage is proportional to the output size rather than
-  the inputs size.
+  `accumulate_n` is differentiable (but wasn't previous to TensorFlow 1.7).
 
   For example:
 
@@ -2235,8 +2255,9 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   tf.accumulate_n([a, b, a])  # [[7, 4], [6, 14]]
 
   # Explicitly pass shape and type
-  tf.accumulate_n([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)  # [[7,  4],
-                                                                   #  [6, 14]]
+  tf.accumulate_n([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)
+                                                                 # [[7,  4],
+                                                                 #  [6, 14]]
   ```
 
   Args:
@@ -2252,20 +2273,17 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
     ValueError: If `inputs` don't all have same shape and dtype or the shape
     cannot be inferred.
   """
-  if context.in_eager_mode():
-    # TODO(apassos) remove this once the lifetime of eager variables gets
-    # addressed.
-    raise ValueError("accumulate_n not supported in eager mode")
+  def _input_error():
+    return ValueError(
+        "inputs must be a list of at least one Tensor with the "
+        "same dtype and shape")
   if not inputs or not isinstance(inputs, (list, tuple)):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   inputs = ops.convert_n_to_tensor_or_indexed_slices(inputs)
   if not all(isinstance(x, ops.Tensor) for x in inputs):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   if not all(x.dtype == inputs[0].dtype for x in inputs):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   if shape is not None:
     shape = tensor_shape.as_shape(shape)
   else:
@@ -2273,27 +2291,31 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   for input_tensor in inputs:
     if isinstance(input_tensor, ops.Tensor):
       shape = shape.merge_with(input_tensor.get_shape())
-  if tensor_dtype is None:
-    tensor_dtype = inputs[0].dtype
-  if tensor_dtype != inputs[0].dtype:
-    raise TypeError("tensor_dtype is {}, but input is of type {}".format(
-        tensor_dtype, inputs[0].dtype))
-  if len(inputs) == 1:
+
+  # tensor_dtype is for safety only; operator's output type computed in C++
+  if tensor_dtype is not None and tensor_dtype != inputs[0].dtype:
+    raise TypeError("tensor_dtype is {}, but input is of type {}"
+                    .format(tensor_dtype, inputs[0].dtype))
+
+  if len(inputs) == 1 and name is None:
     return inputs[0]
-  with ops.name_scope(name, "AccumulateN", inputs) as name:
-    var = gen_state_ops._temporary_variable(
-        shape=tensor_shape.vector(0), dtype=tensor_dtype)
-    with ops.colocate_with(var):
-      zeros = array_ops.zeros_like(gen_control_flow_ops._merge(inputs)[0])
-      zeros.set_shape(shape)
-      ref = state_ops.assign(var, zeros, validate_shape=False)
-      update_ops = [
-          state_ops.assign_add(ref, input_tensor, use_locking=True)
-          for input_tensor in inputs
-      ]
-      with ops.control_dependencies(update_ops):
-        return gen_state_ops._destroy_temporary_variable(
-            ref, var_name=var.op.name, name=name)
+  elif len(inputs) == 1 and name is not None:
+    return array_ops.identity(inputs[0], name=name)
+  elif context.executing_eagerly():
+    # TemporaryVariable not currently supported in eager mode; fall back
+    # onto AddN for now.
+    # TODO(frreiss) remove this once the lifetime of eager variables gets
+    # addressed
+    return add_n(inputs, name=name)
+  else:
+    return gen_math_ops.accumulate_nv2(inputs, name=name, shape=shape)  # pylint: disable=protected-access
+
+
+@ops.RegisterGradient("AccumulateNV2")
+def _accumulate_n_grad(op, grad):
+  """Same as gradient for AddN. Copies the gradient to all inputs."""
+  # Not broadcasting.
+  return [grad] * len(op.inputs)
 
 
 @tf_export("nn.sigmoid", "sigmoid")
@@ -2316,7 +2338,7 @@ def sigmoid(x, name=None):
   """
   with ops.name_scope(name, "Sigmoid", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
-    return gen_math_ops._sigmoid(x, name=name)
+    return gen_math_ops.sigmoid(x, name=name)
 
 
 @tf_export("log_sigmoid")
@@ -2335,7 +2357,7 @@ def log_sigmoid(x, name=None):
   """
   with ops.name_scope(name, "LogSigmoid", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
-    return gen_math_ops._neg(gen_nn_ops.softplus(-x), name=name)
+    return gen_math_ops.neg(gen_nn_ops.softplus(-x), name=name)
 
 
 @tf_export("nn.tanh", "tanh")
@@ -2352,11 +2374,11 @@ def tanh(x, name=None):
   """
   with ops.name_scope(name, "Tanh", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
-      x_tanh = gen_math_ops._tanh(x.values, name=name)
+      x_tanh = gen_math_ops.tanh(x.values, name=name)
       return sparse_tensor.SparseTensor(
           indices=x.indices, values=x_tanh, dense_shape=x.dense_shape)
     else:
-      return gen_math_ops._tanh(x, name=name)
+      return gen_math_ops.tanh(x, name=name)
 
 
 @tf_export("bincount")
@@ -2545,7 +2567,7 @@ def conj(x, name=None):
   with ops.name_scope(name, "Conj", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
     if x.dtype.is_complex or x.dtype == dtypes.variant:
-      return gen_math_ops._conj(x, name=name)
+      return gen_math_ops.conj(x, name=name)
     elif x.dtype.is_floating or x.dtype.is_integer:
       return x
     else:
diff --git a/tensorflow/python/ops/math_ops_test.py b/tensorflow/python/ops/math_ops_test.py
index d314124ccd9bc8b7676e6926830a8eb1e0315f5f..9f85188b3513563a7444f7a0e908f11af985498b 100644
--- a/tensorflow/python/ops/math_ops_test.py
+++ b/tensorflow/python/ops/math_ops_test.py
@@ -60,7 +60,7 @@ class ReduceTest(test_util.TensorFlowTestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testReduceInvalidAxis(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # The shape check is in run a graph construction time. In eager mode,
       # it misses the check, magically return result given wrong shape.
       return
@@ -249,7 +249,7 @@ class ScalarMulTest(test_util.TensorFlowTestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testAcceptsRefs(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       var = resource_variable_ops.ResourceVariable(10, name="var")
     else:
       var = variables.Variable(10)
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index 043c0e30cd8476b1a91e136df60edfbedf85ab24..9ec49545796cfa7a603b31c23bfd0d495639898d 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -308,7 +308,7 @@ def mean(values,
       or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean is not supported when eager execution '
                        'is enabled.')
 
@@ -394,7 +394,7 @@ def accuracy(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.accuracy is not supported when eager '
                        'execution is enabled.')
 
@@ -644,7 +644,7 @@ def auc(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.auc is not supported when eager execution '
                        'is enabled.')
 
@@ -672,7 +672,7 @@ def auc(labels,
         x = fp_rate
         y = rec
       else:  # curve == 'PR'.
-        prec = math_ops.div(tp, tp + fp + epsilon)
+        prec = math_ops.div(tp + epsilon, tp + fp + epsilon)
         x = rec
         y = prec
       if summation_method == 'trapezoidal':
@@ -758,7 +758,7 @@ def mean_absolute_error(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_absolute_error is not supported '
                        'when eager execution is enabled.')
 
@@ -818,7 +818,7 @@ def mean_cosine_distance(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_cosine_distance is not supported when '
                        'eager execution is enabled.')
 
@@ -891,7 +891,7 @@ def mean_per_class_accuracy(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_per_class_accuracy is not supported '
                        'when eager execution is enabled.')
 
@@ -923,8 +923,8 @@ def mean_per_class_accuracy(labels,
         weights = array_ops.reshape(weights, [-1])
       weights = math_ops.to_float(weights)
 
-      is_correct *= weights
-      ones *= weights
+      is_correct = is_correct * weights
+      ones = ones * weights
 
     update_total_op = state_ops.scatter_add(total, labels, ones)
     update_count_op = state_ops.scatter_add(count, labels, is_correct)
@@ -996,7 +996,7 @@ def mean_iou(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_iou is not supported when '
                        'eager execution is enabled.')
 
@@ -1098,7 +1098,7 @@ def mean_relative_error(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_relative_error is not supported when '
                        'eager execution is enabled.')
 
@@ -1165,7 +1165,7 @@ def mean_squared_error(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_squared_error is not supported when '
                        'eager execution is enabled.')
 
@@ -1223,7 +1223,7 @@ def mean_tensor(values,
       or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.mean_tensor is not supported when '
                        'eager execution is enabled.')
 
@@ -1304,7 +1304,7 @@ def percentage_below(values,
       or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.percentage_below is not supported when '
                        'eager execution is enabled.')
 
@@ -1397,7 +1397,7 @@ def false_negatives(labels,
       or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.false_negatives is not supported when '
                        'eager execution is enabled.')
 
@@ -1453,7 +1453,7 @@ def false_negatives_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.false_negatives_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -1507,7 +1507,7 @@ def false_positives(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.false_positives is not supported when '
                        'eager execution is enabled.')
 
@@ -1563,7 +1563,7 @@ def false_positives_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.false_positives_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -1617,7 +1617,7 @@ def true_negatives(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.true_negatives is not '
                        'supported when eager execution is enabled.')
 
@@ -1673,7 +1673,7 @@ def true_negatives_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.true_negatives_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -1727,7 +1727,7 @@ def true_positives(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.true_positives is not '
                        'supported when eager execution is enabled.')
 
@@ -1783,7 +1783,7 @@ def true_positives_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.true_positives_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -1851,7 +1851,7 @@ def precision(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.precision is not '
                        'supported when eager execution is enabled.')
 
@@ -1947,7 +1947,7 @@ def precision_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.precision_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -2023,7 +2023,7 @@ def recall(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.recall is not supported is not '
                        'supported when eager execution is enabled.')
 
@@ -2400,7 +2400,7 @@ def recall_at_k(labels,
     are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.recall_at_k is not '
                        'supported when eager execution is enabled.')
 
@@ -2549,7 +2549,7 @@ def recall_at_thresholds(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.recall_at_thresholds is not '
                        'supported when eager execution is enabled.')
 
@@ -2626,7 +2626,7 @@ def root_mean_squared_error(labels,
       tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.root_mean_squared_error is not '
                        'supported when eager execution is enabled.')
 
@@ -2707,7 +2707,7 @@ def sensitivity_at_specificity(labels,
       or `updates_collections` are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.sensitivity_at_specificity is not '
                        'supported when eager execution is enabled.')
 
@@ -3098,7 +3098,7 @@ def average_precision_at_k(labels,
     ValueError: if k is invalid.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.sparse_average_precision_at_k is not '
                        'supported when eager execution is enabled.')
 
@@ -3267,7 +3267,7 @@ def precision_at_top_k(labels,
       are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.precision_at_top_k is not '
                        'supported when eager execution is enabled.')
 
@@ -3396,7 +3396,7 @@ def precision_at_k(labels,
       are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.sparse_precision_at_k is not '
                        'supported when eager execution is enabled.')
 
@@ -3473,7 +3473,7 @@ def specificity_at_sensitivity(labels,
       or `updates_collections` are not a list or tuple.
     RuntimeError: If eager execution is enabled.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError('tf.metrics.specificity_at_sensitivity is not '
                        'supported when eager execution is enabled.')
 
diff --git a/tensorflow/python/ops/nn_batchnorm_test.py b/tensorflow/python/ops/nn_batchnorm_test.py
index eebfb17085a568f48769f6df7dddd3ae2f799efc..3ac2c8eb17ef31b46638ce50e0e9f9705adce189 100644
--- a/tensorflow/python/ops/nn_batchnorm_test.py
+++ b/tensorflow/python/ops/nn_batchnorm_test.py
@@ -57,7 +57,6 @@ class BatchNormalizationTest(test.TestCase):
     test_util.set_producer_version(ops.get_default_graph(), 8)
     return gen_nn_ops._batch_norm_with_global_normalization(
         x, m, v, beta, gamma, epsilon, scale_after_normalization)
-    # pylint: enable=protected-access
 
   def _tfBatchNormV1BW(self, x, m, v, beta, gamma, epsilon,
                        scale_after_normalization):
@@ -223,7 +222,7 @@ class BatchNormalizationTest(test.TestCase):
         for scale_after_normalization in [True, False]:
           # _batch_norm_with_global_normalization_grad is deprecated in v9
           test_util.set_producer_version(ops.get_default_graph(), 8)
-          grad = gen_nn_ops._batch_norm_with_global_normalization_grad(
+          grad = gen_nn_ops.batch_norm_with_global_normalization_grad(
               x, m, v, gamma, backprop, epsilon, scale_after_normalization)
           dx, dm, dv, db, dg = grad
           self.assertEqual(grad.dx, dx)
diff --git a/tensorflow/python/ops/nn_grad.py b/tensorflow/python/ops/nn_grad.py
index dc24b821a5580e3581f153f3cbf63ad2868b8a18..4af5bd26dd80b984b1c898411c2a23827bed1b4b 100644
--- a/tensorflow/python/ops/nn_grad.py
+++ b/tensorflow/python/ops/nn_grad.py
@@ -150,7 +150,7 @@ def _Conv3DBackpropFilterGrad(op, grad):
 
 @ops.RegisterGradient("AvgPool3D")
 def _AvgPool3DGrad(op, grad):
-  return gen_nn_ops._avg_pool3d_grad(
+  return gen_nn_ops.avg_pool3d_grad(
       array_ops.shape(op.inputs[0]),
       grad,
       ksize=op.get_attr("ksize"),
@@ -172,7 +172,7 @@ def _AvgPool3DGradGrad(op, grad):
 
 @ops.RegisterGradient("MaxPool3D")
 def _MaxPool3DGrad(op, grad):
-  return gen_nn_ops._max_pool3d_grad(
+  return gen_nn_ops.max_pool3d_grad(
       op.inputs[0],
       op.outputs[0],
       grad,
@@ -188,7 +188,7 @@ def _MaxPool3DGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool3d_grad_grad(
+          gen_nn_ops.max_pool3d_grad_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -204,7 +204,7 @@ def _MaxPool3DGradGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool3d_grad(
+          gen_nn_ops.max_pool3d_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -352,13 +352,13 @@ def _BiasAddGradV1(unused_bias_op, received_grad):
 
 @ops.RegisterGradient("Relu")
 def _ReluGrad(op, grad):
-  return gen_nn_ops._relu_grad(grad, op.outputs[0])
+  return gen_nn_ops.relu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("EluGrad")
 def _EluGradGrad(op, grad):
   elu_x = op.inputs[1]
-  return (gen_nn_ops._elu_grad(grad, op.outputs[0]),
+  return (gen_nn_ops.elu_grad(grad, op.outputs[0]),
           array_ops.where(elu_x < 0, grad * op.inputs[0],
                           array_ops.zeros(
                               shape=array_ops.shape(elu_x), dtype=elu_x.dtype)))
@@ -368,63 +368,63 @@ def _EluGradGrad(op, grad):
 def _SeluGradGrad(op, grad):
   x = op.inputs[1]
   scale_alpha = 1.7580993408473768599402175208123
-  return (gen_nn_ops._elu_grad(grad, op.outputs[0]),
+  return (gen_nn_ops.elu_grad(grad, op.outputs[0]),
           array_ops.where(x < 0.,
-                          gen_nn_ops._elu_grad(grad,
-                                               op.outputs[0] + scale_alpha),
+                          gen_nn_ops.elu_grad(grad,
+                                              op.outputs[0] + scale_alpha),
                           array_ops.zeros(
                               shape=array_ops.shape(x), dtype=x.dtype)))
 
 
 @ops.RegisterGradient("Relu6")
 def _Relu6Grad(op, grad):
-  return gen_nn_ops._relu6_grad(grad, op.outputs[0])  # pylint: disable=protected-access
+  return gen_nn_ops.relu6_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Relu6Grad")
 def _Relu6GradGrad(op, grad):
   x = op.inputs[1]
-  return (gen_nn_ops._relu6_grad(grad, x),
+  return (gen_nn_ops.relu6_grad(grad, x),
           array_ops.zeros(shape=array_ops.shape(x), dtype=x.dtype))
 
 
 @ops.RegisterGradient("Elu")
 def _EluGrad(op, grad):
-  return gen_nn_ops._elu_grad(grad, op.outputs[0])
+  return gen_nn_ops.elu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Selu")
 def _SeluGrad(op, grad):
-  return gen_nn_ops._selu_grad(grad, op.outputs[0])
+  return gen_nn_ops.selu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Softplus")
 def _SoftplusGrad(op, grad):
-  return gen_nn_ops._softplus_grad(grad, op.inputs[0])
+  return gen_nn_ops.softplus_grad(grad, op.inputs[0])
 
 
 @ops.RegisterGradient("SoftplusGrad")
 def _SoftplusGradGrad(op, grad):
   # Let:
   #   y = tf.nn.softplus(x)
-  #   dx = gen_nn_ops._softplus_grad(dy, x) = dy / (1 + exp(-x))
+  #   dx = gen_nn_ops.softplus_grad(dy, x) = dy / (1 + exp(-x))
   # This op computes (ddy, d2x) from op.inputs == [dy, x] and grad == ddx.
   dy, x = op.inputs
   with ops.control_dependencies([grad]):
-    ddy = gen_nn_ops._softplus_grad(grad, x)  # pylint: disable=protected-access
+    ddy = gen_nn_ops.softplus_grad(grad, x)
     d2x = grad * dy / (math_ops.exp(-x) + 2.0 + math_ops.exp(x))
     return (ddy, d2x)
 
 
 @ops.RegisterGradient("Softsign")
 def _SoftsignGrad(op, grad):
-  return gen_nn_ops._softsign_grad(grad, op.inputs[0])
+  return gen_nn_ops.softsign_grad(grad, op.inputs[0])
 
 
 @ops.RegisterGradient("ReluGrad")
 def _ReluGradGrad(op, grad):
   x = op.inputs[1]
-  return (gen_nn_ops._relu_grad(grad, x),
+  return (gen_nn_ops.relu_grad(grad, x),
           array_ops.zeros(shape=array_ops.shape(x), dtype=x.dtype))
 
 
@@ -456,7 +456,7 @@ def _SoftmaxCrossEntropyWithLogitsGrad(op, grad_loss, grad_grad):
 
   def IsZero(g):
     # Some introspection to check if the gradient is feeding zeros
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # TODO(apassos) add an efficient way to detect eager zeros here.
       return False
     if g.op.type in ("ZerosLike", "Zeros"):
@@ -565,14 +565,14 @@ def _LRNGrad(op, grad):
   alpha = op.get_attr("alpha")
   beta = op.get_attr("beta")
   return [
-      gen_nn_ops._lrn_grad(grad, op.inputs[0], op.outputs[0], depth_radius,
-                           bias, alpha, beta)
+      gen_nn_ops.lrn_grad(grad, op.inputs[0], op.outputs[0], depth_radius, bias,
+                          alpha, beta)
   ]
 
 
 @ops.RegisterGradient("AvgPool")
 def _AvgPoolGrad(op, grad):
-  return gen_nn_ops._avg_pool_grad(
+  return gen_nn_ops.avg_pool_grad(
       array_ops.shape(op.inputs[0]),
       grad,
       op.get_attr("ksize"),
@@ -584,7 +584,7 @@ def _AvgPoolGrad(op, grad):
 @ops.RegisterGradient("AvgPoolGrad")
 def _AvgPoolGradGrad(op, grad):
   return (array_ops.stop_gradient(op.inputs[0]),
-          gen_nn_ops._avg_pool(
+          gen_nn_ops.avg_pool(
               grad,
               op.get_attr("ksize"),
               op.get_attr("strides"),
@@ -594,7 +594,7 @@ def _AvgPoolGradGrad(op, grad):
 
 @ops.RegisterGradient("MaxPool")
 def _MaxPoolGrad(op, grad):
-  return gen_nn_ops._max_pool_grad(
+  return gen_nn_ops.max_pool_grad(
       op.inputs[0],
       op.outputs[0],
       grad,
@@ -620,7 +620,7 @@ def _MaxPoolGradV2(op, grad):
 
 @ops.RegisterGradient("MaxPoolWithArgmax")
 def _MaxPoolGradWithArgmax(op, grad, unused_argmax_grad):
-  return gen_nn_ops._max_pool_grad_with_argmax(
+  return gen_nn_ops.max_pool_grad_with_argmax(
       op.inputs[0],
       grad,
       op.outputs[1],
@@ -635,7 +635,7 @@ def _MaxPoolGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool_grad_grad(
+          gen_nn_ops.max_pool_grad_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -669,7 +669,7 @@ def _MaxPoolGradGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool_grad(
+          gen_nn_ops.max_pool_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -696,8 +696,7 @@ def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
   Returns:
     Input backprop for FractionalMaxPool op.
   """
-  # pylint: disable=protected-access
-  return gen_nn_ops._fractional_max_pool_grad(
+  return gen_nn_ops.fractional_max_pool_grad(
       op.inputs[0], op.outputs[0], grad_0, op.outputs[1], op.outputs[2],
       op.get_attr("overlapping"))
 
@@ -719,10 +718,9 @@ def _FractionalAvgPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
   Returns:
     Input backprop for FractionalAvgPool op.
   """
-  # pylint: disable=protected-access
-  return gen_nn_ops._fractional_avg_pool_grad(op.inputs[0].get_shape(), grad_0,
-                                              op.outputs[1], op.outputs[2],
-                                              op.get_attr("overlapping"))
+  return gen_nn_ops.fractional_avg_pool_grad(op.inputs[0].get_shape(), grad_0,
+                                             op.outputs[1], op.outputs[2],
+                                             op.get_attr("overlapping"))
 
 
 @ops.RegisterGradient("BatchNormWithGlobalNormalization")
@@ -746,7 +744,7 @@ def _BatchNormWithGlobalNormalizationGrad(op, grad):
         last dimension.
     dg: Backprop for gamma, which is (grad * ((x - m) * rsqrt(v + epsilon)))
   """
-  dx, dm, dv, db, dg = gen_nn_ops._batch_norm_with_global_normalization_grad(
+  dx, dm, dv, db, dg = gen_nn_ops.batch_norm_with_global_normalization_grad(
       op.inputs[0], op.inputs[1], op.inputs[2], op.inputs[4], grad,
       op.get_attr("variance_epsilon"), op.get_attr("scale_after_normalization"))
   return dx, dm, dv, db, dg
diff --git a/tensorflow/python/ops/nn_impl.py b/tensorflow/python/ops/nn_impl.py
index 5fa5708114fd5cda6afbca78fa0debf68f0252cc..47cc4da7f2abd1f5b00e193a76c8391be94ca27d 100644
--- a/tensorflow/python/ops/nn_impl.py
+++ b/tensorflow/python/ops/nn_impl.py
@@ -303,12 +303,12 @@ def _swish_grad(features, grad):
 # @Defun decorator with noinline=True so that sigmoid(features) is re-computed
 # during backprop, and we can free the sigmoid(features) expression immediately
 # after use during the forward pass.
+@tf_export("nn.swish")
 @function.Defun(
     grad_func=_swish_grad,
     shape_func=_swish_shape,
     func_name="swish",
     noinline=True)
-@tf_export("nn.swish")
 def swish(features):
   # pylint: disable=g-doc-args
   """Computes the Swish activation function: `x * sigmoid(x)`.
@@ -888,12 +888,10 @@ def fused_batch_norm(
   # TODO(reedwm): In a few weeks, switch to using the V2 version exclusively. We
   # currently only use the V2 version for float16 inputs, which is not supported
   # by the V1 version.
-  # pylint: disable=protected-access
   if x.dtype == dtypes.float16 or x.dtype == dtypes.bfloat16:
-    fused_batch_norm_func = gen_nn_ops._fused_batch_norm_v2
+    fused_batch_norm_func = gen_nn_ops.fused_batch_norm_v2
   else:
-    fused_batch_norm_func = gen_nn_ops._fused_batch_norm
-  # pylint: enable=protected-access
+    fused_batch_norm_func = gen_nn_ops._fused_batch_norm  # pylint: disable=protected-access
   y, batch_mean, batch_var, _, _ = fused_batch_norm_func(
       x,
       scale,
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 8fbe698914e5f2fa8f821feed82c33fc77e35e21..a74de39eab34a1a27df90f70adf0f4c68ec29465 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -29,6 +29,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
@@ -149,14 +150,12 @@ class _NonAtrousConvolution(object):
                                                               conv_dims))
     if conv_dims == 1:
       # conv1d uses the 2-d data format names
-      if data_format is None or data_format == "NWC":
-        data_format_2d = "NHWC"
-      elif data_format == "NCW":
-        data_format_2d = "NCHW"
-      else:
+      if data_format is None:
+        data_format = "NWC"
+      elif data_format not in {"NCW", "NWC", "NCHW", "NHWC"}:
         raise ValueError("data_format must be \"NWC\" or \"NCW\".")
       self.strides = strides[0]
-      self.data_format = data_format_2d
+      self.data_format = data_format
       self.conv_op = self._conv1d
     elif conv_dims == 2:
       if data_format is None or data_format == "NHWC":
@@ -698,7 +697,7 @@ def convolution(
   `padded_input` is obtained by zero padding the input using an effective
   spatial filter shape of `(spatial_filter_shape-1) * dilation_rate + 1` and
   output striding `strides` as described in the
-  @{tf.nn.convolution$comment here}.
+  @{$python/nn#Convolution$comment here}.
 
   In the case that `data_format` does start with `"NC"`, the `input` and output
   (but not the `filter`) are simply transposed as follows:
@@ -1042,9 +1041,7 @@ def pool(
 
 @tf_export("nn.atrous_conv2d")
 def atrous_conv2d(value, filters, rate, padding, name=None):
-  """Atrous convolution (a.k.a.
-
-  convolution with holes or dilated convolution).
+  """Atrous convolution (a.k.a. convolution with holes or dilated convolution).
 
   This function is a simpler wrapper around the more general
   @{tf.nn.convolution}, and exists only for backwards compatibility. You can
@@ -1481,7 +1478,6 @@ def conv3d_transpose(
         name=name)
 
 
-# pylint: disable=protected-access
 @tf_export("nn.bias_add")
 def bias_add(value, bias, data_format=None, name=None):
   """Adds `bias` to `value`.
@@ -1504,12 +1500,12 @@ def bias_add(value, bias, data_format=None, name=None):
     A `Tensor` with the same type as `value`.
   """
   with ops.name_scope(name, "BiasAdd", [value, bias]) as name:
-    value = ops.convert_to_tensor(value, name="input")
-    bias = ops.convert_to_tensor(bias, dtype=value.dtype, name="bias")
-    return gen_nn_ops._bias_add(value, bias, data_format=data_format, name=name)
+    if not context.executing_eagerly():
+      value = ops.convert_to_tensor(value, name="input")
+      bias = ops.convert_to_tensor(bias, dtype=value.dtype, name="bias")
+    return gen_nn_ops.bias_add(value, bias, data_format=data_format, name=name)
 
 
-# pylint: disable=protected-access
 def bias_add_v1(value, bias, name=None):
   """Adds `bias` to `value`.
 
@@ -1534,7 +1530,7 @@ def bias_add_v1(value, bias, name=None):
   with ops.name_scope(name, "BiasAddV1", [value, bias]) as name:
     value = ops.convert_to_tensor(value, name="input")
     bias = ops.convert_to_tensor(bias, dtype=value.dtype, name="bias")
-    return gen_nn_ops._bias_add_v1(value, bias, name=name)
+    return gen_nn_ops.bias_add_v1(value, bias, name=name)
 
 
 @tf_export("nn.crelu")
@@ -1580,7 +1576,7 @@ def relu6(features, name=None):
   """
   with ops.name_scope(name, "Relu6", [features]) as name:
     features = ops.convert_to_tensor(features, name="features")
-    return gen_nn_ops._relu6(features, name=name)
+    return gen_nn_ops.relu6(features, name=name)
 
 
 @tf_export("nn.leaky_relu")
@@ -1616,7 +1612,7 @@ def _flatten_outer_dims(logits):
   output = array_ops.reshape(logits, array_ops.concat([[-1], last_dim_size], 0))
 
   # Set output shape if known.
-  if context.in_graph_mode():
+  if not context.executing_eagerly():
     shape = logits.get_shape()
     if shape is not None and shape.dims is not None:
       shape = shape.as_list()
@@ -1645,7 +1641,7 @@ def _softmax(logits, compute_op, dim=-1, name=None):
   Args:
     logits: A non-empty `Tensor`. Must be one of the following types: `half`,
       `float32`, `float64`.
-    compute_op: Either gen_nn_ops._softmax or gen_nn_ops._log_softmax
+    compute_op: Either gen_nn_ops.softmax or gen_nn_ops.log_softmax
     dim: The dimension softmax would be performed on. The default is -1 which
       indicates the last dimension.
     name: A name for the operation (optional).
@@ -1739,7 +1735,7 @@ def softmax(logits, axis=None, name=None, dim=None):
   axis = deprecation.deprecated_argument_lookup("axis", axis, "dim", dim)
   if axis is None:
     axis = -1
-  return _softmax(logits, gen_nn_ops._softmax, axis, name)
+  return _softmax(logits, gen_nn_ops.softmax, axis, name)
 
 
 @tf_export("nn.log_softmax")
@@ -1769,7 +1765,7 @@ def log_softmax(logits, axis=None, name=None, dim=None):
   axis = deprecation.deprecated_argument_lookup("axis", axis, "dim", dim)
   if axis is None:
     axis = -1
-  return _softmax(logits, gen_nn_ops._log_softmax, axis, name)
+  return _softmax(logits, gen_nn_ops.log_softmax, axis, name)
 
 
 def _ensure_xent_args(name, sentinel, labels, logits):
@@ -1871,7 +1867,7 @@ def softmax_cross_entropy_with_logits_v2(
     # Do the actual op computation.
     # The second output tensor contains the gradients.  We use it in
     # _CrossEntropyGrad() in nn_grad but not here.
-    cost, unused_backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+    cost, unused_backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
         precise_logits, labels, name=name)
 
     # The output cost shape should be the input minus dim.
@@ -1881,7 +1877,8 @@ def softmax_cross_entropy_with_logits_v2(
 
     # Make shape inference work since reshape and transpose may erase its static
     # shape.
-    if context.in_graph_mode() and shape is not None and shape.dims is not None:
+    if not context.executing_eagerly(
+    ) and shape is not None and shape.dims is not None:
       shape = shape.as_list()
       del shape[dim]
       cost.set_shape(shape)
@@ -2027,6 +2024,9 @@ def sparse_softmax_cross_entropy_with_logits(
     # Store label shape for result later.
     labels_static_shape = labels.get_shape()
     labels_shape = array_ops.shape(labels)
+    static_shapes_fully_defined = (
+        labels_static_shape.is_fully_defined() and
+        logits.get_shape()[:-1].is_fully_defined())
     if logits.get_shape().ndims is not None and logits.get_shape().ndims == 0:
       raise ValueError(
           "Logits cannot be scalars - received shape %s." % logits.get_shape())
@@ -2036,29 +2036,44 @@ def sparse_softmax_cross_entropy_with_logits(
       raise ValueError("Rank mismatch: Rank of labels (received %s) should "
                        "equal rank of logits minus 1 (received %s)." %
                        (labels_static_shape.ndims, logits.get_shape().ndims))
+    if (static_shapes_fully_defined and
+        labels_static_shape != logits.get_shape()[:-1]):
+      raise ValueError("Shape mismatch: The shape of labels (received %s) "
+                       "should equal the shape of logits except for the last "
+                       "dimension (received %s)." % (labels_static_shape,
+                                                     logits.get_shape()))
     # Check if no reshapes are required.
     if logits.get_shape().ndims == 2:
-      cost, _ = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+      cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
           precise_logits, labels, name=name)
       if logits.dtype == dtypes.float16:
         return math_ops.cast(cost, dtypes.float16)
       else:
         return cost
 
-    # Reshape logits to 2 dim, labels to 1 dim.
-    num_classes = array_ops.shape(logits)[array_ops.rank(logits) - 1]
-    precise_logits = array_ops.reshape(precise_logits, [-1, num_classes])
-    labels = array_ops.reshape(labels, [-1])
-    # The second output tensor contains the gradients.  We use it in
-    # _CrossEntropyGrad() in nn_grad but not here.
-    cost, _ = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
-        precise_logits, labels, name=name)
-    cost = array_ops.reshape(cost, labels_shape)
-    cost.set_shape(labels_static_shape)
-    if logits.dtype == dtypes.float16:
-      return math_ops.cast(cost, dtypes.float16)
-    else:
-      return cost
+    # Perform a check of the dynamic shapes if the static shapes are not fully
+    # defined.
+    shape_checks = []
+    if not static_shapes_fully_defined:
+      shape_checks.append(
+          check_ops.assert_equal(
+              array_ops.shape(labels),
+              array_ops.shape(logits)[:-1]))
+    with ops.control_dependencies(shape_checks):
+      # Reshape logits to 2 dim, labels to 1 dim.
+      num_classes = array_ops.shape(logits)[array_ops.rank(logits) - 1]
+      precise_logits = array_ops.reshape(precise_logits, [-1, num_classes])
+      labels = array_ops.reshape(labels, [-1])
+      # The second output tensor contains the gradients.  We use it in
+      # _CrossEntropyGrad() in nn_grad but not here.
+      cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
+          precise_logits, labels, name=name)
+      cost = array_ops.reshape(cost, labels_shape)
+      cost.set_shape(labels_static_shape)
+      if logits.dtype == dtypes.float16:
+        return math_ops.cast(cost, dtypes.float16)
+      else:
+        return cost
 
 
 @tf_export("nn.avg_pool")
@@ -2086,7 +2101,7 @@ def avg_pool(value, ksize, strides, padding, data_format="NHWC", name=None):
   """
   with ops.name_scope(name, "AvgPool", [value]) as name:
     value = ops.convert_to_tensor(value, name="input")
-    return gen_nn_ops._avg_pool(
+    return gen_nn_ops.avg_pool(
         value,
         ksize=ksize,
         strides=strides,
@@ -2116,12 +2131,13 @@ def max_pool(value, ksize, strides, padding, data_format="NHWC", name=None):
   """
   with ops.name_scope(name, "MaxPool", [value]) as name:
     value = ops.convert_to_tensor(value, name="input")
-    return gen_nn_ops._max_pool(value,
-                                ksize=ksize,
-                                strides=strides,
-                                padding=padding,
-                                data_format=data_format,
-                                name=name)
+    return gen_nn_ops.max_pool(
+        value,
+        ksize=ksize,
+        strides=strides,
+        padding=padding,
+        data_format=data_format,
+        name=name)
 
 
 @ops.RegisterStatistics("Conv2D", "flops")
@@ -2299,7 +2315,7 @@ def dropout(x, keep_prob, noise_shape=None, seed=None, name=None):  # pylint: di
     # 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
     binary_tensor = math_ops.floor(random_tensor)
     ret = math_ops.div(x, keep_prob) * binary_tensor
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       ret.set_shape(x.get_shape())
     return ret
 
@@ -2331,7 +2347,7 @@ def top_k(input, k=1, sorted=True, name=None):  # pylint: disable=redefined-buil
     values: The `k` largest elements along each last dimensional slice.
     indices: The indices of `values` within the last dimension of `input`.
   """
-  return gen_nn_ops._top_kv2(input, k=k, sorted=sorted, name=name)
+  return gen_nn_ops.top_kv2(input, k=k, sorted=sorted, name=name)
 
 
 def nth_element(input, n, reverse=False, name=None):  # pylint: disable=redefined-builtin
@@ -2650,4 +2666,4 @@ def in_top_k(predictions, targets, k, name=None):
     A `Tensor` of type `bool`. Computed Precision at `k` as a `bool Tensor`.
   """
   with ops.name_scope(name, "in_top_k"):
-    return gen_nn_ops._in_top_kv2(predictions, targets, k, name=name)
+    return gen_nn_ops.in_top_kv2(predictions, targets, k, name=name)
diff --git a/tensorflow/python/ops/nn_test.py b/tensorflow/python/ops/nn_test.py
index 21eea3db25af0d1bcfbc7496665f5535c3f660ea..af9dae2aa64f0994f403ac81dcba800699d3c960 100644
--- a/tensorflow/python/ops/nn_test.py
+++ b/tensorflow/python/ops/nn_test.py
@@ -1049,6 +1049,22 @@ class DataFormatVectorPermuteTest(test_lib.TestCase):
       y_val = sess.run(y)
       self.assertAllEqual(y_val, [7, 9, 3, 4])
 
+  def testNHWCToHWNC(self):
+    x_val = [7, 4, 9, 3]
+    x = constant_op.constant(x_val)
+    y = nn_ops.data_format_vec_permute(x, src_format="NHWC", dst_format="HWNC")
+    with self.test_session(use_gpu=test_lib.is_gpu_available()) as sess:
+      y_val = sess.run(y)
+      self.assertAllEqual(y_val, [4, 9, 7, 3])
+
+  def testHWNCToNHWC(self):
+    x_val = [7, 4, 9, 3]
+    x = constant_op.constant(x_val)
+    y = nn_ops.data_format_vec_permute(x, src_format="HWNC", dst_format="NHWC")
+    with self.test_session(use_gpu=test_lib.is_gpu_available()) as sess:
+      y_val = sess.run(y)
+      self.assertAllEqual(y_val, [9, 7, 4, 3])
+
   def testNHWCToNCHW2D(self):
     x_val = [[7, 4], [9, 3], [4, 5], [5, 1]]
     x = constant_op.constant(x_val)
@@ -1057,6 +1073,22 @@ class DataFormatVectorPermuteTest(test_lib.TestCase):
       y_val = sess.run(y)
       self.assertAllEqual(y_val, [[7, 4], [5, 1], [9, 3], [4, 5]])
 
+  def testNHWCToHWNC2D(self):
+    x_val = [[7, 4], [9, 3], [4, 5], [5, 1]]
+    x = constant_op.constant(x_val)
+    y = nn_ops.data_format_vec_permute(x, src_format="NHWC", dst_format="HWNC")
+    with self.test_session(use_gpu=test_lib.is_gpu_available()) as sess:
+      y_val = sess.run(y)
+      self.assertAllEqual(y_val, [[9, 3], [4, 5], [7, 4], [5, 1]])
+
+  def testHWNCToNHWC2D(self):
+    x_val = [[7, 4], [9, 3], [4, 5], [5, 1]]
+    x = constant_op.constant(x_val)
+    y = nn_ops.data_format_vec_permute(x, src_format="HWNC", dst_format="NHWC")
+    with self.test_session(use_gpu=test_lib.is_gpu_available()) as sess:
+      y_val = sess.run(y)
+      self.assertAllEqual(y_val, [[4, 5], [7, 4], [9, 3], [5, 1]])
+
   def testNCHWToNHWC2D(self):
     x_val = [[7, 4], [9, 3], [4, 5], [5, 1]]
     x = constant_op.constant(x_val)
diff --git a/tensorflow/python/ops/numerics.py b/tensorflow/python/ops/numerics.py
index b4ce1cbf25346412e2781a520b7e2cdcf720bcd5..d348e47f57b703138aabfc3463e750b795113335 100644
--- a/tensorflow/python/ops/numerics.py
+++ b/tensorflow/python/ops/numerics.py
@@ -74,7 +74,7 @@ def add_check_numerics_ops():
   the checked operations.
   @enc_compatibility
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError(
         "add_check_numerics_ops() is not compatible with eager execution. "
         "To check for Inf's and NaN's under eager execution, call "
diff --git a/tensorflow/python/ops/parsing_ops.py b/tensorflow/python/ops/parsing_ops.py
index b0315ceee268be8ac1813dae5a262a7d9496e154..075b38d743d13329e646c0b268e938b5c5704e47 100644
--- a/tensorflow/python/ops/parsing_ops.py
+++ b/tensorflow/python/ops/parsing_ops.py
@@ -700,8 +700,7 @@ def _parse_example_raw(serialized,
     # Finally, convert dense_shapes to TensorShapeProto
     dense_shapes = [shape.as_proto() for shape in dense_shapes]
 
-    # pylint: disable=protected-access
-    outputs = gen_parsing_ops._parse_example(
+    outputs = gen_parsing_ops.parse_example(
         serialized=serialized,
         names=names,
         dense_defaults=dense_defaults_vec,
@@ -710,7 +709,6 @@ def _parse_example_raw(serialized,
         dense_keys=dense_keys,
         dense_shapes=dense_shapes,
         name=name)
-    # pylint: enable=protected-access
 
     (sparse_indices, sparse_values, sparse_shapes, dense_values) = outputs
 
@@ -1132,8 +1130,7 @@ def _parse_single_sequence_example_raw(serialized,
     feature_list_dense_shapes = [tensor_shape.as_shape(shape).as_proto()
                                  for shape in feature_list_dense_shapes]
 
-    # pylint: disable=protected-access
-    outputs = gen_parsing_ops._parse_single_sequence_example(
+    outputs = gen_parsing_ops.parse_single_sequence_example(
         serialized=serialized,
         debug_name=debug_name,
         context_dense_defaults=context_dense_defaults_vec,
@@ -1149,7 +1146,6 @@ def _parse_single_sequence_example_raw(serialized,
         feature_list_dense_missing_assumed_empty=(
             feature_list_dense_missing_assumed_empty),
         name=name)
-    # pylint: enable=protected-access
 
     (context_sparse_indices, context_sparse_values,
      context_sparse_shapes, context_dense_values,
@@ -1182,7 +1178,6 @@ def _parse_single_sequence_example_raw(serialized,
 @tf_export("decode_csv")
 def decode_csv(records, record_defaults, field_delim=",",
                use_quote_delim=True, name=None, na_value=""):
-  # pylint: disable=protected-access
   """Convert CSV records to tensors. Each column maps to one tensor.
 
   RFC 4180 format is expected for the CSV records.
@@ -1211,11 +1206,13 @@ def decode_csv(records, record_defaults, field_delim=",",
     Each tensor will have the same shape as records.
   """
   # TODO(martinwicke), remove the wrapper when new Python API generator is done.
-  return gen_parsing_ops._decode_csv(
-      records=records, record_defaults=record_defaults,
-      field_delim=field_delim, use_quote_delim=use_quote_delim,
-      na_value=na_value, name=name)
-  # pylint: enable=protected-access
+  return gen_parsing_ops.decode_csv(
+      records=records,
+      record_defaults=record_defaults,
+      field_delim=field_delim,
+      use_quote_delim=use_quote_delim,
+      na_value=na_value,
+      name=name)
 
 
 # TODO(b/70890287): Combine the implementation of this op and
@@ -1391,7 +1388,6 @@ def _parse_single_example_v2_raw(serialized, sparse_keys, sparse_types,
     # Finally, convert dense_shapes to TensorShapeProto
     dense_shapes = [shape.as_proto() for shape in dense_shapes]
 
-    # pylint: disable=protected-access
     outputs = gen_parsing_ops.parse_single_example(
         serialized=serialized,
         dense_defaults=dense_defaults_vec,
@@ -1401,7 +1397,6 @@ def _parse_single_example_v2_raw(serialized, sparse_keys, sparse_types,
         dense_keys=dense_keys,
         dense_shapes=dense_shapes,
         name=name)
-    # pylint: enable=protected-access
 
     (sparse_indices, sparse_values, sparse_shapes, dense_values) = outputs
 
diff --git a/tensorflow/python/ops/random_ops.py b/tensorflow/python/ops/random_ops.py
index 2c86358d21b1c280b8d7ade625fd4b7a44c5de26..6a2dd3f1cd55eea1d3b652a31cd2784c411c2ce0 100644
--- a/tensorflow/python/ops/random_ops.py
+++ b/tensorflow/python/ops/random_ops.py
@@ -43,7 +43,6 @@ def _ShapeTensor(shape):
   return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
 
 
-# pylint: disable=protected-access
 @tf_export("random_normal")
 def random_normal(shape,
                   mean=0.0,
@@ -74,7 +73,7 @@ def random_normal(shape,
     mean_tensor = ops.convert_to_tensor(mean, dtype=dtype, name="mean")
     stddev_tensor = ops.convert_to_tensor(stddev, dtype=dtype, name="stddev")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._random_standard_normal(
+    rnd = gen_random_ops.random_standard_normal(
         shape_tensor, dtype, seed=seed1, seed2=seed2)
     mul = rnd * stddev_tensor
     value = math_ops.add(mul, mean_tensor, name=name)
@@ -126,7 +125,7 @@ def parameterized_truncated_normal(shape,
     minvals_tensor = ops.convert_to_tensor(minvals, dtype=dtype, name="minvals")
     maxvals_tensor = ops.convert_to_tensor(maxvals, dtype=dtype, name="maxvals")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._parameterized_truncated_normal(
+    rnd = gen_random_ops.parameterized_truncated_normal(
         shape_tensor,
         means_tensor,
         stddevs_tensor,
@@ -171,7 +170,7 @@ def truncated_normal(shape,
     mean_tensor = ops.convert_to_tensor(mean, dtype=dtype, name="mean")
     stddev_tensor = ops.convert_to_tensor(stddev, dtype=dtype, name="stddev")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._truncated_normal(
+    rnd = gen_random_ops.truncated_normal(
         shape_tensor, dtype, seed=seed1, seed2=seed2)
     mul = rnd * stddev_tensor
     value = math_ops.add(mul, mean_tensor, name=name)
@@ -210,7 +209,7 @@ def random_uniform(shape,
     maxval: A 0-D Tensor or Python value of type `dtype`. The upper bound on
       the range of random values to generate.  Defaults to 1 if `dtype` is
       floating point.
-    dtype: The type of the output: 'float16`, `float32`, `float64`, `int32`,
+    dtype: The type of the output: `float16`, `float32`, `float64`, `int32`,
       or `int64`.
     seed: A Python integer. Used to create a random seed for the distribution.
       See @{tf.set_random_seed}
@@ -237,11 +236,10 @@ def random_uniform(shape,
     maxval = ops.convert_to_tensor(maxval, dtype=dtype, name="max")
     seed1, seed2 = random_seed.get_seed(seed)
     if dtype.is_integer:
-      return gen_random_ops._random_uniform_int(
+      return gen_random_ops.random_uniform_int(
           shape, minval, maxval, seed=seed1, seed2=seed2, name=name)
     else:
-      rnd = gen_random_ops._random_uniform(
-          shape, dtype, seed=seed1, seed2=seed2)
+      rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
       return math_ops.add(rnd * (maxval - minval), minval, name=name)
 
 
@@ -275,7 +273,7 @@ def random_shuffle(value, seed=None, name=None):
     dimension.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_random_ops._random_shuffle(
+  return gen_random_ops.random_shuffle(
       value, seed=seed1, seed2=seed2, name=name)
 
 
@@ -420,7 +418,7 @@ def random_gamma(shape,
     seed1, seed2 = random_seed.get_seed(seed)
     return math_ops.maximum(
         np.finfo(dtype.as_numpy_dtype).tiny,
-        gen_random_ops._random_gamma(
+        gen_random_ops.random_gamma(
             shape, alpha_broadcast, seed=seed1, seed2=seed2) / beta)
 
 ops.NotDifferentiable("RandomGamma")
diff --git a/tensorflow/python/ops/resource_variable_ops.py b/tensorflow/python/ops/resource_variable_ops.py
index 2d6d0672e03d9435175b0accd7c20dfddae16bcc..df873da98e7fac7accc99a229ffb53a60a74c9bb 100644
--- a/tensorflow/python/ops/resource_variable_ops.py
+++ b/tensorflow/python/ops/resource_variable_ops.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 
 from tensorflow.core.framework import attr_value_pb2
 from tensorflow.core.framework import variable_pb2
+from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import context
 from tensorflow.python.eager import tape
 from tensorflow.python.framework import dtypes
@@ -30,6 +31,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_array_ops
 from tensorflow.python.ops import gen_resource_variable_ops
 from tensorflow.python.ops import gen_state_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variables
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import
@@ -44,10 +46,6 @@ def _eager_safe_variable_handle(shape, dtype, shared_name, name, graph_mode):
   container = ops.get_default_graph()._container  # pylint: disable=protected-access
   if container is None:
     container = ""
-  if not graph_mode:
-    # When in eager mode use a uid for the shared_name, to prevent accidental
-    # sharing.
-    shared_name = str(ops.uid())
   handle = gen_resource_variable_ops.var_handle_op(shape=shape, dtype=dtype,
                                                    shared_name=shared_name,
                                                    name=name,
@@ -133,10 +131,10 @@ class EagerResourceDeleter(object):
       # valid, and so on. Printing warnings in these cases is silly
       # (exceptions raised from __del__ are printed as warnings to stderr).
       pass  # 'NoneType' object is not callable when the handle has been
-            # partially unloaded.
+      # partially unloaded.
     except AttributeError:
       pass  # 'NoneType' object has no attribute 'eager_mode' when context has
-            # been unloaded. Will catch other module unloads as well.
+      # been unloaded. Will catch other module unloads as well.
 
 
 def shape_safe_assign_variable_handle(handle, shape, value, name=None):
@@ -151,7 +149,7 @@ def shape_safe_assign_variable_handle(handle, shape, value, name=None):
 class ResourceVariable(variables.Variable):
   """Variable based on resource handles.
 
-  See the ${variables} documentation for more details.
+  See the @{$variables$Variables How To} for a high level overview.
 
   A `ResourceVariable` allows you to maintain state across subsequent calls to
   session.run.
@@ -181,24 +179,20 @@ class ResourceVariable(variables.Variable):
   by edges in the graph. Consider the following example, in which two writes
   can cause tf.Variable and tf.ResourceVariable to behave differently:
 
-   ```python
-    a = tf.ResourceVariable(1.0)
-    a.initializer.run()
-
-    assign = a.assign(2.0)
-    with tf.control_dependencies([assign]):
-      b = a.read_value()
-    with tf.control_dependencies([b]):
-      other_assign = a.assign(3.0)
-    with tf.control_dependencies([other_assign]):
-      # Will print 2.0 because the value was read before other_assign ran. If
-      # `a` was a tf.Variable instead, 2.0 or 3.0 could be printed.
-      tf.Print(b, [b]).eval()
+  ```python
+  a = tf.ResourceVariable(1.0)
+  a.initializer.run()
+
+  assign = a.assign(2.0)
+  with tf.control_dependencies([assign]):
+    b = a.read_value()
+  with tf.control_dependencies([b]):
+    other_assign = a.assign(3.0)
+  with tf.control_dependencies([other_assign]):
+    # Will print 2.0 because the value was read before other_assign ran. If
+    # `a` was a tf.Variable instead, 2.0 or 3.0 could be printed.
+    tf.Print(b, [b]).eval()
   ```
-
-  To enforce these consistency properties tf.ResourceVariable might make more
-  copies than an equivalent tf.Variable under the hood, so tf.Variable is still
-  not deprecated.
   """
 
   def __init__(self,
@@ -265,9 +259,9 @@ class ResourceVariable(variables.Variable):
       if initial_value is not None:
         raise ValueError("variable_def and initial_value are mutually "
                          "exclusive.")
-      if not context.in_graph_mode():
-        raise ValueError("Creating ResourceVariable from variable_def"
-                         " only supported in GRAPH mode.")
+      if context.executing_eagerly():
+        raise ValueError("Creating ResourceVariable from variable_def is "
+                         "not supported when eager execution is enabled.")
       self._init_from_proto(variable_def, import_scope=import_scope)
     else:
       self._init_from_args(
@@ -361,11 +355,17 @@ class ResourceVariable(variables.Variable):
     # this graph.
     self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
     with ops.init_scope():
-      self._in_graph_mode = context.in_graph_mode()
+      self._in_graph_mode = not context.executing_eagerly()
       with ops.name_scope(name, "Variable", []
                           if init_from_fn else [initial_value]) as name:
         # pylint: disable=protected-access
         handle_name = ops._name_from_scope_name(name)
+        if self._in_graph_mode:
+          shared_name = handle_name
+        else:
+          # When in eager mode use a uid for the shared_name, to prevent
+          # accidental sharing.
+          shared_name = "%s_%d" % (handle_name, ops.uid())
         if init_from_fn:
           # Use attr_scope and device(None) to simulate the behavior of
           # colocate_with when the variable we want to colocate with doesn't
@@ -381,12 +381,9 @@ class ResourceVariable(variables.Variable):
               self._handle = _eager_safe_variable_handle(
                   shape=initial_value.get_shape(),
                   dtype=initial_value.dtype.base_dtype,
-                  shared_name=handle_name,
+                  shared_name=shared_name,
                   name=name,
                   graph_mode=self._in_graph_mode)
-              self._handle_device = (
-                  self._handle.device if self._in_graph_mode else
-                  context.get_default_context().device_name)
               self._shape = initial_value.get_shape()
           else:
             initial_value = initial_value()
@@ -396,12 +393,9 @@ class ResourceVariable(variables.Variable):
             self._handle = _eager_safe_variable_handle(
                 shape=initial_value.get_shape(),
                 dtype=initial_value.dtype.base_dtype,
-                shared_name=handle_name,
+                shared_name=shared_name,
                 name=name,
                 graph_mode=False)
-            self._handle_device = (
-                self._handle.device if self._in_graph_mode else
-                context.get_default_context().device_name)
             self._shape = initial_value.get_shape()
         # pylint: enable=protected-access
 
@@ -422,13 +416,12 @@ class ResourceVariable(variables.Variable):
           self._handle = _eager_safe_variable_handle(
               shape=initial_value.get_shape(),
               dtype=initial_value.dtype.base_dtype,
-              shared_name=handle_name,
+              shared_name=shared_name,
               name=name,
               graph_mode=self._in_graph_mode)
-          self._handle_device = (self._handle.device if self._in_graph_mode else
-                                 context.get_default_context().device_name)
           self._shape = initial_value.get_shape()
 
+        self._unique_id = shared_name
         self._initial_value = initial_value if self._in_graph_mode else None
         self._handle_name = handle_name + ":0"
         self._dtype = initial_value.dtype.base_dtype
@@ -449,7 +442,7 @@ class ResourceVariable(variables.Variable):
           with ops.name_scope("Read"), ops.colocate_with(self._handle):
             # Manually assign reads to the handle's device to avoid log
             # messages.
-            with ops.device(self._handle_device):
+            with ops.device(self._handle.device):
               value = self._read_variable_op()
             self._graph_element = value
             if caching_device is not None:
@@ -476,7 +469,7 @@ class ResourceVariable(variables.Variable):
               self._cached_value = self._read_variable_op()
           else:
             self._cached_value = None
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           ops.add_to_collections(collections, self)
         elif ops.GraphKeys.GLOBAL_STEP in collections:
           ops.add_to_collections(ops.GraphKeys.GLOBAL_STEP, self)
@@ -489,12 +482,13 @@ class ResourceVariable(variables.Variable):
       # cycles being uncollectable, and means that no __del__ will be defined at
       # all in graph mode.
       self._handle_deleter = EagerResourceDeleter(
-          handle=self._handle, handle_device=self._handle_device)
+          handle=self._handle, handle_device=self._handle.device)
+    self._cached_shape_as_list = None
 
   def _init_from_proto(self, variable_def, import_scope=None):
     """Initializes from `VariableDef` proto."""
     # Note that init_from_proto is currently not supported in Eager mode.
-    assert context.in_graph_mode()
+    assert not context.executing_eagerly()
     self._in_graph_mode = True
     assert isinstance(variable_def, variable_pb2.VariableDef)
     if not variable_def.is_resource:
@@ -507,8 +501,8 @@ class ResourceVariable(variables.Variable):
             variable_def.variable_name, import_scope=import_scope))
     self._shape = tensor_shape.TensorShape(
         self._handle.op.get_attr("shape"))
-    self._handle_device = self._handle.device
     self._handle_name = self._handle.name
+    self._unique_id = self._handle_name
     self._initializer_op = g.as_graph_element(
         ops.prepend_name_scope(
             variable_def.initializer_name, import_scope=import_scope))
@@ -534,8 +528,10 @@ class ResourceVariable(variables.Variable):
       self._save_slice_info = None
     self._caching_device = None
     self._dtype = dtypes.as_dtype(self._handle.op.get_attr("dtype"))
-    self._graph_element = self.value()
+    self._graph_element = g.get_tensor_by_name(
+        self._handle.op.name + "/Read/ReadVariableOp:0")
     self._constraint = None
+    self._cached_shape_as_list = None
 
   def __nonzero__(self):
     return self.__bool__()
@@ -551,7 +547,7 @@ class ResourceVariable(variables.Variable):
   @property
   def device(self):
     """The device this variable is on."""
-    return self._handle_device
+    return self._handle.device
 
   @property
   def graph(self):
@@ -568,11 +564,26 @@ class ResourceVariable(variables.Variable):
     """The shape of this variable."""
     return self._shape
 
+  def _shape_as_list(self):
+    if self._cached_shape_as_list:
+      return self._cached_shape_as_list
+    if self.shape.ndims is None:
+      return None
+    self._cached_shape_as_list = [dim.value for dim in self.shape.dims]
+    return self._cached_shape_as_list
+
+  def _shape_tuple(self):
+    shape = self._shape_as_list()
+    if shape is None:
+      return None
+    return tuple(shape)
+
   @property
   def create(self):
     """The op responsible for initializing this variable."""
     if not self._in_graph_mode:
-      raise RuntimeError("Calling create in EAGER mode not supported.")
+      raise RuntimeError("Calling create is not supported when eager execution"
+                         " is enabled.")
     return self._initializer_op
 
   @property
@@ -585,7 +596,7 @@ class ResourceVariable(variables.Variable):
     if self._cached_value is not None:
       return self._cached_value
     with ops.colocate_with(None, ignore_existing=True):
-      with ops.device(self._handle_device):
+      with ops.device(self._handle.device):
         return self._read_variable_op()
 
   def _as_graph_element(self):
@@ -600,7 +611,7 @@ class ResourceVariable(variables.Variable):
   @property
   def initial_value(self):
     """Returns the Tensor used as the initial value for the variable."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("initial_value not supported in EAGER mode.")
     return self._initial_value
 
@@ -621,15 +632,15 @@ class ResourceVariable(variables.Variable):
 
   def eval(self, session=None):
     """Evaluates and returns the value of this variable."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Trying to eval in EAGER mode")
     return self._graph_element.eval(session=session)
 
   def numpy(self):
-    if context.in_graph_mode():
-      raise NotImplementedError(
-          "numpy() is only available when eager execution is enabled.")
-    return self.read_value().numpy()
+    if context.executing_eagerly():
+      return self.read_value().numpy()
+    raise NotImplementedError(
+        "numpy() is only available when eager execution is enabled.")
 
   def count_up_to(self, limit):
     """Increments this variable until it reaches `limit`.
@@ -682,7 +693,7 @@ class ResourceVariable(variables.Variable):
     """
     with ops.name_scope("Read"):
       # Ensure we read the variable in the same device as the handle.
-      with ops.device(self._handle_device):
+      with ops.device(self._handle.device):
         value = self._read_variable_op()
     # Return an identity so it can get placed on whatever device the context
     # specifies instead of the device where the variable is.
@@ -710,7 +721,7 @@ class ResourceVariable(variables.Variable):
       A `VariableDef` protocol buffer, or `None` if the `Variable` is not
       in the specified name scope.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("to_proto not supported in EAGER mode.")
     if export_scope is None or self.handle.name.startswith(export_scope):
       var_def = variable_pb2.VariableDef()
@@ -737,7 +748,7 @@ class ResourceVariable(variables.Variable):
 
   @staticmethod
   def from_proto(variable_def, import_scope=None):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("from_proto not supported in EAGER mode.")
     return ResourceVariable(
         variable_def=variable_def, import_scope=import_scope)
@@ -788,37 +799,84 @@ class ResourceVariable(variables.Variable):
 
   __array_priority__ = 100
 
-  def assign_sub(self, delta, use_locking=None, name=None):
+  def assign_sub(self, delta, use_locking=None, name=None, read_value=True):
+    """Subtracts a value from this variable.
+
+    Args:
+      delta: A `Tensor`. The value to subtract from this variable.
+      use_locking: If `True`, use locking during the operation.
+      name: The name to use for the operation.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
+
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
     # TODO(apassos): this here and below is not atomic. Consider making it
     # atomic if there's a way to do so without a performance cost for those who
     # don't need it.
-    return self._lazy_read(gen_resource_variable_ops.assign_sub_variable_op(
-        self.handle,
-        ops.convert_to_tensor(delta, dtype=self.dtype),
-        name=name))
+    assign_sub_op = gen_resource_variable_ops.assign_sub_variable_op(
+        self.handle, ops.convert_to_tensor(delta, dtype=self.dtype), name=name)
+    if read_value:
+      return self._lazy_read(assign_sub_op)
+    return assign_sub_op
+
+  def assign_add(self, delta, use_locking=None, name=None, read_value=True):
+    """Adds a value to this variable.
+
+    Args:
+      delta: A `Tensor`. The value to add to this variable.
+      use_locking: If `True`, use locking during the operation.
+      name: The name to use for the operation.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
 
-  def assign_add(self, delta, use_locking=None, name=None):
-    return self._lazy_read(gen_resource_variable_ops.assign_add_variable_op(
-        self.handle,
-        ops.convert_to_tensor(delta, dtype=self.dtype),
-        name=name))
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
+    assign_add_op = gen_resource_variable_ops.assign_add_variable_op(
+        self.handle, ops.convert_to_tensor(delta, dtype=self.dtype), name=name)
+    if read_value:
+      return self._lazy_read(assign_add_op)
+    return assign_add_op
 
   def _lazy_read(self, op):
     if hasattr(self, "_trainable") and self._trainable:
       tape.watch_variable(self)
     return _UnreadVariable(
-        self._handle, self.dtype, self._handle_device, self._shape,
-        self._in_graph_mode,
-        self._handle_deleter if not self._in_graph_mode else None, op)
+        self._handle, self.dtype, self._shape, self._in_graph_mode,
+        self._handle_deleter if not self._in_graph_mode else None, op,
+        self._unique_id)
+
+  def assign(self, value, use_locking=None, name=None, read_value=True):
+    """Assigns a new value to this variable.
+
+    Args:
+      value: A `Tensor`. The new value for this variable.
+      use_locking: If `True`, use locking during the assignment.
+      name: The name to use for the assignment.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
 
-  def assign(self, value, use_locking=None, name=None):
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
     value_tensor = ops.convert_to_tensor(value, dtype=self.dtype)
     self._shape.assert_is_compatible_with(value_tensor.shape)
-    return self._lazy_read(
-        gen_resource_variable_ops.assign_variable_op(
-            self.handle,
-            value_tensor,
-            name=name))
+    assign_op = gen_resource_variable_ops.assign_variable_op(
+        self.handle, value_tensor, name=name)
+    if read_value:
+      return self._lazy_read(assign_op)
+    return assign_op
 
   def _strided_slice_assign(self, begin, end, strides, value, name, begin_mask,
                             end_mask, ellipsis_mask, new_axis_mask,
@@ -894,6 +952,10 @@ class ResourceVariable(variables.Variable):
                        "Tensor object.")
 
 
+pywrap_tensorflow.TFE_Py_RegisterResourceVariableType(ResourceVariable)
+math_ops._resource_variable_type = ResourceVariable  # pylint: disable=protected-access
+
+
 def _dense_var_to_tensor(var, dtype=None, name=None, as_ref=False):
   return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protected-access
 
@@ -904,31 +966,31 @@ class _UnreadVariable(ResourceVariable):
   Pretends to be the tensor if anyone looks.
   """
 
-  def __init__(self, handle, dtype, handle_device,  # pylint: disable=super-init-not-called
-               shape, in_graph_mode, deleter, parent_op):
+  def __init__(self, handle, dtype,  # pylint: disable=super-init-not-called
+               shape, in_graph_mode, deleter, parent_op, unique_id):
     # We do not call super init on purpose.
     self._trainable = False
     self._save_slice_info = None
     self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
     self._in_graph_mode = in_graph_mode
     self._handle = handle
-    self._handle_device = handle_device
     self._shape = shape
     self._initial_value = None
     if isinstance(self._handle, ops.EagerTensor):
       self._handle_name = ""
     else:
       self._handle_name = self._handle.name
+    self._unique_id = unique_id
     self._dtype = dtype
     self._constraint = None
     self._cached_value = None
     self._is_initialized_op = None
     self._initializer_op = None
     self._parent_op = parent_op
-    if context.in_graph_mode():
-      self._graph_element = self.read_value()
-    else:
+    if context.executing_eagerly():
       self._graph_element = None
+    else:
+      self._graph_element = self.read_value()
     self._handle_deleter = deleter
 
   def value(self):
@@ -944,6 +1006,7 @@ class _UnreadVariable(ResourceVariable):
 
   def set_shape(self, shape):
     self._shape = shape
+    self._cached_shape_as_list = None
 
   @property
   def op(self):
diff --git a/tensorflow/python/ops/rnn.py b/tensorflow/python/ops/rnn.py
index aa8d4327d2f0e93768728744d5cce3fed385393f..42af7f8b274c555e6375ab8e937a8cc06ffbaa8e 100644
--- a/tensorflow/python/ops/rnn.py
+++ b/tensorflow/python/ops/rnn.py
@@ -45,7 +45,6 @@ from tensorflow.python.util.tf_export import tf_export
 
 # pylint: disable=protected-access
 _concat = rnn_cell_impl._concat
-_like_rnncell = rnn_cell_impl._like_rnncell
 # pylint: enable=protected-access
 
 
@@ -403,11 +402,8 @@ def bidirectional_dynamic_rnn(cell_fw, cell_bw, inputs, sequence_length=None,
   Raises:
     TypeError: If `cell_fw` or `cell_bw` is not an instance of `RNNCell`.
   """
-
-  if not _like_rnncell(cell_fw):
-    raise TypeError("cell_fw must be an instance of RNNCell")
-  if not _like_rnncell(cell_bw):
-    raise TypeError("cell_bw must be an instance of RNNCell")
+  rnn_cell_impl.assert_like_rnncell("cell_fw", cell_fw)
+  rnn_cell_impl.assert_like_rnncell("cell_bw", cell_bw)
 
   with vs.variable_scope(scope or "bidirectional_rnn"):
     # Forward direction
@@ -568,14 +564,13 @@ def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,
     TypeError: If `cell` is not an instance of RNNCell.
     ValueError: If inputs is None or an empty list.
   """
-  if not _like_rnncell(cell):
-    raise TypeError("cell must be an instance of RNNCell")
+  rnn_cell_impl.assert_like_rnncell("cell", cell)
 
   with vs.variable_scope(scope or "rnn") as varscope:
     # Create a new scope in which the caching device is either
     # determined by the parent scope, or is set to place the cached
     # Variable using the same placement as for the rest of the RNN.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       if varscope.caching_device is None:
         varscope.set_caching_device(lambda op: op.device)
 
@@ -616,7 +611,7 @@ def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,
           ["Expected shape for Tensor %s is " % x.name,
            packed_shape, " but saw shape: ", x_shape])
 
-    if context.in_graph_mode() and sequence_length is not None:
+    if not context.executing_eagerly() and sequence_length is not None:
       # Perform some shape validation
       with ops.control_dependencies(
           [_assert_has_shape(sequence_length, [batch_size])]):
@@ -742,7 +737,7 @@ def _dynamic_rnn_loop(cell,
                                         element_shape=element_shape,
                                         tensor_array_name=base_name + name)
 
-  in_graph_mode = context.in_graph_mode()
+  in_graph_mode = not context.executing_eagerly()
   if in_graph_mode:
     output_ta = tuple(
         _create_ta(
@@ -872,7 +867,7 @@ def raw_rnn(cell, loop_fn,
 
   ```python
   time = tf.constant(0, dtype=tf.int32)
-  (finished, next_input, initial_state, _, loop_state) = loop_fn(
+  (finished, next_input, initial_state, emit_structure, loop_state) = loop_fn(
       time=time, cell_output=None, cell_state=None, loop_state=None)
   emit_ta = TensorArray(dynamic_size=True, dtype=initial_state.dtype)
   state = initial_state
@@ -883,7 +878,7 @@ def raw_rnn(cell, loop_fn,
         loop_state=loop_state)
     # Emit zeros and copy forward state for minibatch entries that are finished.
     state = tf.where(finished, state, next_state)
-    emit = tf.where(finished, tf.zeros_like(emit), emit)
+    emit = tf.where(finished, tf.zeros_like(emit_structure), emit)
     emit_ta = emit_ta.write(time, emit)
     # If any new minibatch entries are marked as finished, mark these.
     finished = tf.logical_or(finished, next_finished)
@@ -943,10 +938,15 @@ def raw_rnn(cell, loop_fn,
       and `emit_output`: the output to store for this iteration.
 
       Note that `emit_output` should be a `Tensor` or (possibly nested)
-      tuple of tensors with shapes and structure matching `cell.output_size`
-      and `cell_output` above.  The parameter `cell_state` and output
-      `next_cell_state` may be either a single or (possibly nested) tuple
-      of tensors.  The parameter `loop_state` and
+      tuple of tensors which is aggregated in the `emit_ta` inside the
+      `while_loop`. For the first call to `loop_fn`, the `emit_output`
+      corresponds to the `emit_structure` which is then used to determine the
+      size of the `zero_tensor` for the `emit_ta` (defaults to
+      `cell.output_size`). For the subsequent calls to the `loop_fn`, the
+      `emit_output` corresponds to the actual output tensor
+      that is to be aggregated in the `emit_ta`. The parameter `cell_state`
+      and output `next_cell_state` may be either a single or (possibly nested)
+      tuple of tensors.  The parameter `loop_state` and
       output `next_loop_state` may be either a single or (possibly nested) tuple
       of `Tensor` and `TensorArray` objects.  This last parameter
       may be ignored by `loop_fn` and the return value may be `None`.  If it
@@ -1015,9 +1015,8 @@ def raw_rnn(cell, loop_fn,
     TypeError: If `cell` is not an instance of RNNCell, or `loop_fn` is not
       a `callable`.
   """
+  rnn_cell_impl.assert_like_rnncell("cell", cell)
 
-  if not _like_rnncell(cell):
-    raise TypeError("cell must be an instance of RNNCell")
   if not callable(loop_fn):
     raise TypeError("loop_fn must be a callable")
 
@@ -1027,7 +1026,7 @@ def raw_rnn(cell, loop_fn,
   # determined by the parent scope, or is set to place the cached
   # Variable using the same placement as for the rest of the RNN.
   with vs.variable_scope(scope or "rnn") as varscope:
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       if varscope.caching_device is None:
         varscope.set_caching_device(lambda op: op.device)
 
@@ -1229,9 +1228,7 @@ def static_rnn(cell,
     ValueError: If `inputs` is `None` or an empty list, or if the input depth
       (column size) cannot be inferred from inputs via shape inference.
   """
-
-  if not _like_rnncell(cell):
-    raise TypeError("cell must be an instance of RNNCell")
+  rnn_cell_impl.assert_like_rnncell("cell", cell)
   if not nest.is_sequence(inputs):
     raise TypeError("inputs must be a sequence")
   if not inputs:
@@ -1242,7 +1239,7 @@ def static_rnn(cell,
   # determined by the parent scope, or is set to place the cached
   # Variable using the same placement as for the rest of the RNN.
   with vs.variable_scope(scope or "rnn") as varscope:
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       if varscope.caching_device is None:
         varscope.set_caching_device(lambda op: op.device)
 
@@ -1469,11 +1466,8 @@ def static_bidirectional_rnn(cell_fw,
     TypeError: If `cell_fw` or `cell_bw` is not an instance of `RNNCell`.
     ValueError: If inputs is None or an empty list.
   """
-
-  if not _like_rnncell(cell_fw):
-    raise TypeError("cell_fw must be an instance of RNNCell")
-  if not _like_rnncell(cell_bw):
-    raise TypeError("cell_bw must be an instance of RNNCell")
+  rnn_cell_impl.assert_like_rnncell("cell_fw", cell_fw)
+  rnn_cell_impl.assert_like_rnncell("cell_bw", cell_bw)
   if not nest.is_sequence(inputs):
     raise TypeError("inputs must be a sequence")
   if not inputs:
diff --git a/tensorflow/python/ops/rnn_cell_impl.py b/tensorflow/python/ops/rnn_cell_impl.py
index 923348ea44e18a87e09fe1c0424f0323eb967e3d..fe380c44dafdad6dc25d50102bacba610132674d 100644
--- a/tensorflow/python/ops/rnn_cell_impl.py
+++ b/tensorflow/python/ops/rnn_cell_impl.py
@@ -46,6 +46,7 @@ from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
@@ -54,6 +55,8 @@ _BIAS_VARIABLE_NAME = "bias"
 _WEIGHTS_VARIABLE_NAME = "kernel"
 
 
+# TODO(jblespiau): Remove this function when we are sure there are no longer
+# any usage (even if protected, it is being used). Prefer assert_like_rnncell.
 def _like_rnncell(cell):
   """Checks that a given object is an RNNCell by using duck typing."""
   conditions = [hasattr(cell, "output_size"), hasattr(cell, "state_size"),
@@ -61,6 +64,45 @@ def _like_rnncell(cell):
   return all(conditions)
 
 
+# This can be used with self.assertRaisesRegexp for assert_like_rnncell.
+ASSERT_LIKE_RNNCELL_ERROR_REGEXP = "is not an RNNCell"
+
+
+def assert_like_rnncell(cell_name, cell):
+  """Raises a TypeError if cell is not like an RNNCell.
+
+  NOTE: Do not rely on the error message (in particular in tests) which can be
+  subject to change to increase readability. Use
+  ASSERT_LIKE_RNNCELL_ERROR_REGEXP.
+
+  Args:
+    cell_name: A string to give a meaningful error referencing to the name
+      of the functionargument.
+    cell: The object which should behave like an RNNCell.
+
+  Raises:
+    TypeError: A human-friendly exception.
+  """
+  conditions = [
+      hasattr(cell, "output_size"),
+      hasattr(cell, "state_size"),
+      hasattr(cell, "zero_state"),
+      callable(cell),
+  ]
+  errors = [
+      "'output_size' property is missing",
+      "'state_size' property is missing",
+      "'zero_state' method is missing",
+      "is not callable"
+  ]
+
+  if not all(conditions):
+
+    errors = [error for error, cond in zip(errors, conditions) if not cond]
+    raise TypeError("The argument {!r} ({}) is not an RNNCell: {}.".format(
+        cell_name, cell, ", ".join(errors)))
+
+
 def _concat(prefix, suffix, static=False):
   """Concat that enables int, Tensor, or TensorShape values.
 
@@ -127,7 +169,7 @@ def _zero_state_tensors(state_size, batch_size, dtype):
     """Combine s with batch_size to get a proper tensor shape."""
     c = _concat(batch_size, s)
     size = array_ops.zeros(c, dtype=dtype)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       c_static = _concat(batch_size, s, static=True)
       size.set_shape(c_static)
     return size
@@ -191,12 +233,13 @@ class RNNCell(base_layer.Layer):
 
   def _rnn_get_variable(self, getter, *args, **kwargs):
     variable = getter(*args, **kwargs)
-    if context.in_graph_mode():
-      trainable = (variable in tf_variables.trainable_variables() or
-                   (isinstance(variable, tf_variables.PartitionedVariable) and
-                    list(variable)[0] in tf_variables.trainable_variables()))
-    else:
+    if context.executing_eagerly():
       trainable = variable._trainable  # pylint: disable=protected-access
+    else:
+      trainable = (
+          variable in tf_variables.trainable_variables() or
+          (isinstance(variable, tf_variables.PartitionedVariable) and
+           list(variable)[0] in tf_variables.trainable_variables()))
     if trainable and variable not in self._trainable_weights:
       self._trainable_weights.append(variable)
     elif not trainable and variable not in self._non_trainable_weights:
@@ -240,7 +283,7 @@ class RNNCell(base_layer.Layer):
     # Try to use the last cached zero_state. This is done to avoid recreating
     # zeros, especially when eager execution is enabled.
     state_size = self.state_size
-    is_eager = context.in_eager_mode()
+    is_eager = context.executing_eagerly()
     if is_eager and hasattr(self, "_last_zero_state"):
       (last_state_size, last_batch_size, last_dtype,
        last_output) = getattr(self, "_last_zero_state")
@@ -912,8 +955,8 @@ class DropoutWrapper(RNNCell):
         but not `callable`.
       ValueError: if any of the keep_probs are not between 0 and 1.
     """
-    if not _like_rnncell(cell):
-      raise TypeError("The parameter cell is not a RNNCell.")
+    assert_like_rnncell("cell", cell)
+
     if (dropout_state_filter_visitor is not None
         and not callable(dropout_state_filter_visitor)):
       raise TypeError("dropout_state_filter_visitor must be callable")
@@ -1187,6 +1230,12 @@ class MultiRNNCell(RNNCell):
           "cells must be a list or tuple, but saw: %s." % cells)
 
     self._cells = cells
+    for cell_number, cell in enumerate(self._cells):
+      # Add Checkpointable dependencies on these cells so their variables get
+      # saved with this object when using object-based saving.
+      if isinstance(cell, checkpointable.CheckpointableBase):
+        # TODO(allenl): Track down non-Checkpointable callers.
+        self._track_checkpointable(cell, name="cell-%d" % (cell_number,))
     self._state_is_tuple = state_is_tuple
     if not state_is_tuple:
       if any(nest.is_sequence(c.state_size) for c in self._cells):
diff --git a/tensorflow/python/ops/script_ops.py b/tensorflow/python/ops/script_ops.py
index 6fe2f61016775b410045fefcc8764907b8ea39f3..1b4111bca630ffa122ed590b0e3d54b796ab6b7a 100644
--- a/tensorflow/python/ops/script_ops.py
+++ b/tensorflow/python/ops/script_ops.py
@@ -25,6 +25,9 @@ from __future__ import print_function
 
 import threading
 
+# Used by py_util.cc to get tracebacks.
+import traceback  # pylint: disable=unused-import
+
 import numpy as np
 import six
 
@@ -33,6 +36,7 @@ from tensorflow.python.eager import context
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_script_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
@@ -51,6 +55,16 @@ class EagerFunc(object):
     self._func = func
     self._out_dtypes = Tout
 
+  def _convert(self, value, dtype):
+    if isinstance(value, resource_variable_ops.ResourceVariable):
+      raise RuntimeError(
+          "Attempting to return a variable from an eagerly executed py_func. "
+          "Only numeric data structures like Tensors or NumPy arrays should "
+          "be returned; to return the value of a variable, make sure to obtain "
+          "the Tensor backing it by calling `.read_value()` on the variable in "
+          "question: %s" % value)
+    return ops.convert_to_tensor(value, dtype=dtype)
+
   def __call__(self, on_gpu, args):
     """Passes `args` to `self._func`, which is executed eagerly."""
     with context.eager_mode():
@@ -58,14 +72,13 @@ class EagerFunc(object):
       maybe_copy_to_gpu = lambda x: x if not on_gpu else x.gpu()
       if isinstance(ret, (tuple, list)):
         return [
-            maybe_copy_to_gpu(ops.convert_to_tensor(x, dtype=dtype))
+            maybe_copy_to_gpu(self._convert(x, dtype=dtype))
             for (x, dtype) in zip(ret, self._out_dtypes)
         ]
       elif ret is None:
         return ret
       else:
-        return maybe_copy_to_gpu(
-            ops.convert_to_tensor(ret, dtype=self._out_dtypes[0]))
+        return maybe_copy_to_gpu(self._convert(ret, dtype=self._out_dtypes[0]))
 
 
 class FuncRegistry(object):
@@ -219,18 +232,16 @@ def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
   graph._cleanup_py_funcs_used_in_graph.append(cleanup)
   # pylint: enable=protected-access
 
-  # pylint: disable=protected-access
   if eager:
-    result = gen_script_ops._eager_py_func(
+    result = gen_script_ops.eager_py_func(
         input=inp, token=token, Tout=Tout, name=name)
   else:
     if stateful:
-      result = gen_script_ops._py_func(
+      result = gen_script_ops.py_func(
           input=inp, token=token, Tout=Tout, name=name)
     else:
-      result = gen_script_ops._py_func_stateless(
+      result = gen_script_ops.py_func_stateless(
           input=inp, token=token, Tout=Tout, name=name)
-  # pylint: enable=protected-access
   return result if is_list_or_tuple else result[0]
 
 
@@ -319,7 +330,7 @@ def py_func(func, inp, Tout, stateful=True, name=None):
   Returns:
     A list of `Tensor` or a single `Tensor` which `func` computes.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     result = func(*[x.numpy() for x in inp])
     result = nest.flatten(result)
 
diff --git a/tensorflow/python/ops/session_ops.py b/tensorflow/python/ops/session_ops.py
index cedd36c1deed541adcf601ff9447345e2279e8f9..ad38845153c94e9bb31e6e3ee05ebed0a4313efc 100644
--- a/tensorflow/python/ops/session_ops.py
+++ b/tensorflow/python/ops/session_ops.py
@@ -16,7 +16,6 @@
 """Tensor Handle Operations. See the @{$python/session_ops} guide.
 
 @@get_session_handle
-@@get_session_handle_v2
 @@get_session_tensor
 @@delete_session_tensor
 """
@@ -182,7 +181,7 @@ def get_session_handle(data, name=None):
 
   # Colocate this operation with data.
   with ops.colocate_with(data):
-    return gen_data_flow_ops._get_session_handle(data, name=name)  # pylint: disable=protected-access
+    return gen_data_flow_ops.get_session_handle(data, name=name)
 
 
 @tf_export("get_session_tensor")
@@ -222,7 +221,7 @@ def get_session_tensor(handle, dtype, name=None):
   with ops.device(handle_device):
     holder = array_ops.placeholder(dtypes.string)
     _register_handle_feeder(holder.graph, holder, dtype)
-    tensor = gen_data_flow_ops._get_session_tensor(holder, dtype, name=name)
+    tensor = gen_data_flow_ops.get_session_tensor(holder, dtype, name=name)
   return (holder, tensor)
 
 
@@ -246,7 +245,7 @@ def delete_session_tensor(handle, name=None):
   handle_device = TensorHandle._get_device_name(handle)
   with ops.device(handle_device):
     holder = array_ops.placeholder(dtypes.string)
-    deleter = gen_data_flow_ops._delete_session_tensor(holder, name=name)
+    deleter = gen_data_flow_ops.delete_session_tensor(holder, name=name)
   return (holder, deleter)
 
 
@@ -268,7 +267,7 @@ def _get_handle_reader(graph, handle, dtype):
     with graph.as_default(), graph.device(handle_device):
       holder = array_ops.placeholder(dtypes.string)
       _register_handle_feeder(holder.graph, holder, dtype)
-      reader = gen_data_flow_ops._get_session_tensor(holder, dtype)
+      reader = gen_data_flow_ops.get_session_tensor(holder, dtype)
     result = (holder, reader)
     graph._handle_readers[graph_key] = result
   return result
@@ -289,7 +288,7 @@ def _get_handle_mover(graph, feeder, handle):
     # Create mover if we haven't done it.
     holder, reader = _get_handle_reader(graph, handle, dtype)
     with graph.as_default(), graph.device(feeder.op.device):
-      mover = gen_data_flow_ops._get_session_handle(reader)  # pylint: disable=protected-access
+      mover = gen_data_flow_ops.get_session_handle(reader)
     result = (holder, mover)
     graph._handle_movers[graph_key] = result
   return result
@@ -303,7 +302,7 @@ def _get_handle_deleter(graph, deleter_key, handle):
     handle_device = TensorHandle._get_device_name(handle)
     with graph.as_default(), graph.device(handle_device):
       holder = array_ops.placeholder(dtypes.string)
-      deleter = gen_data_flow_ops._delete_session_tensor(holder)
+      deleter = gen_data_flow_ops.delete_session_tensor(holder)
     result = (holder, deleter)
     graph._handle_deleters[deleter_key] = result
   return result
diff --git a/tensorflow/python/ops/sets_impl.py b/tensorflow/python/ops/sets_impl.py
index b0eecd8a1e812857de8f47e1370e4fc5f1004bc0..21e08d03d213c173d12dfc6676fe7f009811e93f 100644
--- a/tensorflow/python/ops/sets_impl.py
+++ b/tensorflow/python/ops/sets_impl.py
@@ -247,7 +247,7 @@ def set_difference(a, b, aminusb=True, validate_indices=True):
     #
     # collections.OrderedDict([
     #     ((0, 0, 0), 2),
-    #     ((0, 0, 1), 3),
+    #     ((0, 1, 0), 3),
     # ])
   ```
 
diff --git a/tensorflow/python/ops/sparse_grad.py b/tensorflow/python/ops/sparse_grad.py
index 5295e7d21c2b5810422ec36f5aced63c9039feca..97353d6c747cb7e4d3c1fa92ad61af24fb17de91 100644
--- a/tensorflow/python/ops/sparse_grad.py
+++ b/tensorflow/python/ops/sparse_grad.py
@@ -88,10 +88,8 @@ def _SparseAddGrad(op, *grads):
   # the non-zero elements of the sum, and we will peek into `sum_indices` in the
   # gradient op.
 
-  # pylint: disable=protected-access
-  a_val_grad, b_val_grad = gen_sparse_ops._sparse_add_grad(val_grad, a_indices,
-                                                           b_indices,
-                                                           sum_indices)
+  a_val_grad, b_val_grad = gen_sparse_ops.sparse_add_grad(
+      val_grad, a_indices, b_indices, sum_indices)
   a_val_grad.set_shape(op.inputs[1].get_shape())
   b_val_grad.set_shape(op.inputs[4].get_shape())
   # (a_indices, a_values, a_shape, b_indices, b_values, b_shape, thresh)
@@ -151,7 +149,7 @@ def _SparseTensorDenseMatMulGrad(op, grad):
                               "complex gradients.")
 
   # gradient w.r.t. dense
-  b_grad = gen_sparse_ops._sparse_tensor_dense_mat_mul(  # pylint: disable=protected-access
+  b_grad = gen_sparse_ops.sparse_tensor_dense_mat_mul(
       a_indices, a_values, a_shape, grad, adjoint_a=not adj_a)
   if adj_b:
     b_grad = array_ops.transpose(b_grad)
@@ -278,8 +276,7 @@ def _SparseFillEmptyRowsGrad(op, unused_grad_output_indices, output_grad_values,
   """Gradients for SparseFillEmptyRows."""
   reverse_index_map = op.outputs[3]
 
-  # pylint: disable=protected-access
-  d_values, d_default_value = gen_sparse_ops._sparse_fill_empty_rows_grad(
+  d_values, d_default_value = gen_sparse_ops.sparse_fill_empty_rows_grad(
       reverse_index_map=reverse_index_map, grad_values=output_grad_values)
 
   # d_indices, d_values, d_dense_shape, d_default_value.
diff --git a/tensorflow/python/ops/sparse_ops.py b/tensorflow/python/ops/sparse_ops.py
index 0fbbf5a805f1439d85ad53f02bdb665c04248606..c580052c32c8b61467b857af3d237be41718c1a1 100644
--- a/tensorflow/python/ops/sparse_ops.py
+++ b/tensorflow/python/ops/sparse_ops.py
@@ -234,7 +234,7 @@ def sparse_concat(axis,
     ]
 
   output_ind, output_val, output_shape = (
-      gen_sparse_ops._sparse_concat(inds, vals, shapes, axis, name=name))
+      gen_sparse_ops.sparse_concat(inds, vals, shapes, axis, name=name))
 
   return sparse_tensor.SparseTensor(output_ind, output_val, output_shape)
 
@@ -302,8 +302,8 @@ def sparse_add(a, b, thresh=0):
     thresh = ops.convert_to_tensor(
         thresh, dtype=a.values.dtype.real_dtype.base_dtype, name="thresh")
     output_ind, output_val, output_shape = (
-        gen_sparse_ops._sparse_add(a.indices, a.values, a.dense_shape,
-                                   b.indices, b.values, b.dense_shape, thresh))
+        gen_sparse_ops.sparse_add(a.indices, a.values, a.dense_shape,
+                                  b.indices, b.values, b.dense_shape, thresh))
 
     # Attempt to get output_shape statically.
     a.get_shape().assert_is_compatible_with(b.get_shape())
@@ -317,8 +317,8 @@ def sparse_add(a, b, thresh=0):
     # swap to make `a` the SparseTensor.
     if isinstance(b, sparse_classes):
       a, b = b, a
-    return gen_sparse_ops._sparse_tensor_dense_add(a.indices, a.values,
-                                                   a.dense_shape, b)
+    return gen_sparse_ops.sparse_tensor_dense_add(a.indices, a.values,
+                                                  a.dense_shape, b)
 
 
 def _sparse_cross(inputs, name=None):
@@ -402,7 +402,7 @@ def _sparse_cross_internal(inputs,
                            num_buckets=0,
                            hash_key=None,
                            name=None):
-  """See gen_sparse_ops._sparse_cross."""
+  """See gen_sparse_ops.sparse_cross."""
   if not isinstance(inputs, list):
     raise TypeError("Inputs must be a list")
   if not all(
@@ -432,7 +432,7 @@ def _sparse_cross_internal(inputs,
       dense_inputs[i] = math_ops.to_int64(dense_inputs[i])
       internal_type = dtypes.int64
 
-  indices_out, values_out, shape_out = gen_sparse_ops._sparse_cross(
+  indices_out, values_out, shape_out = gen_sparse_ops.sparse_cross(
       indices=indices,
       values=values,
       shapes=shapes,
@@ -511,7 +511,7 @@ def sparse_reorder(sp_input, name=None):
   sp_input = _convert_to_sparse_tensor(sp_input)
 
   reordered_ind, reordered_val = (
-      gen_sparse_ops._sparse_reorder(
+      gen_sparse_ops.sparse_reorder(
           sp_input.indices, sp_input.values, sp_input.dense_shape, name=name))
 
   if sp_input.get_shape().is_fully_defined():
@@ -575,7 +575,7 @@ def sparse_reshape(sp_input, shape, name=None):
   shape = math_ops.cast(shape, dtype=dtypes.int64)
 
   with ops.name_scope(name, "SparseReshape", [sp_input]) as name:
-    reshaped_ind, reshaped_shape = gen_sparse_ops._sparse_reshape(
+    reshaped_ind, reshaped_shape = gen_sparse_ops.sparse_reshape(
         sp_input.indices, sp_input.dense_shape, shape, name=name)
 
     reshaped_shape_const = tensor_util.constant_value(shape)
@@ -671,7 +671,7 @@ def sparse_split(keyword_required=KeywordRequired(),
   sp_input = _convert_to_sparse_tensor(sp_input)
 
   output_inds, output_vals, output_shapes = (
-      gen_sparse_ops._sparse_split(
+      gen_sparse_ops.sparse_split(
           axis,
           sp_input.indices,
           sp_input.values,
@@ -782,7 +782,7 @@ def sparse_to_dense(sparse_indices,
     Dense `Tensor` of shape `output_shape`.  Has the same type as
     `sparse_values`.
   """
-  return gen_sparse_ops._sparse_to_dense(
+  return gen_sparse_ops.sparse_to_dense(
       sparse_indices,
       output_shape,
       sparse_values,
@@ -1412,7 +1412,7 @@ def sparse_fill_empty_rows(sp_input, default_value, name=None):
     default_value = ops.convert_to_tensor(
         default_value, dtype=sp_input.values.dtype)
     (output_indices, output_values, empty_row_indicator,
-     unused_reverse_index_map) = gen_sparse_ops._sparse_fill_empty_rows(
+     unused_reverse_index_map) = gen_sparse_ops.sparse_fill_empty_rows(
          indices=sp_input.indices,
          values=sp_input.values,
          dense_shape=sp_input.dense_shape,
@@ -1441,7 +1441,7 @@ def serialize_sparse(sp_input, name=None, out_type=dtypes.string):
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._serialize_sparse(
+  return gen_sparse_ops.serialize_sparse(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -1476,7 +1476,7 @@ def serialize_many_sparse(sp_input, name=None, out_type=dtypes.string):
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._serialize_many_sparse(
+  return gen_sparse_ops.serialize_many_sparse(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -1541,7 +1541,7 @@ def deserialize_sparse(serialized_sparse, dtype, rank=None, name=None):
 
   """
   output_indices, output_values, output_shape = (
-      gen_sparse_ops._deserialize_sparse(serialized_sparse, dtype, name=name))
+      gen_sparse_ops.deserialize_sparse(serialized_sparse, dtype, name=name))
 
   # Feed rank data back in, if available
   output_indices.set_shape([None, rank])
@@ -1610,7 +1610,7 @@ def deserialize_many_sparse(serialized_sparse, dtype, rank=None, name=None):
     All of the serialized `SparseTensor`s must have had the same rank and type.
   """
   output_indices, output_values, output_shape = (
-      gen_sparse_ops._deserialize_many_sparse(
+      gen_sparse_ops.deserialize_many_sparse(
           serialized_sparse, dtype, name=name))
 
   # Feed rank data back in, if available
@@ -1828,7 +1828,7 @@ def sparse_tensor_dense_matmul(sp_a,
   with ops.name_scope(name, "SparseTensorDenseMatMul",
                       [sp_a.indices, sp_a.values, b]) as name:
     b = ops.convert_to_tensor(b, name="b")
-    return gen_sparse_ops._sparse_tensor_dense_mat_mul(
+    return gen_sparse_ops.sparse_tensor_dense_mat_mul(
         a_indices=sp_a.indices,
         a_values=sp_a.values,
         a_shape=sp_a.dense_shape,
@@ -2046,7 +2046,7 @@ def _add_sparse_to_tensors_map(sp_input,
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._add_sparse_to_tensors_map(
+  return gen_sparse_ops.add_sparse_to_tensors_map(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -2086,7 +2086,7 @@ def _add_many_sparse_to_tensors_map(sp_input,
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._add_many_sparse_to_tensors_map(
+  return gen_sparse_ops.add_many_sparse_to_tensors_map(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -2167,7 +2167,7 @@ def _take_many_sparse_from_tensors_map(sparse_map_op,
   with ops.colocate_with(sparse_map_op):
     shared_name = sparse_map_op.get_attr("shared_name") or sparse_map_op.name
     output_indices, output_values, output_shape = (
-        gen_sparse_ops._take_many_sparse_from_tensors_map(
+        gen_sparse_ops.take_many_sparse_from_tensors_map(
             sparse_handles,
             dtype=sparse_map_op.get_attr("T"),
             container=sparse_map_op.get_attr("container"),
diff --git a/tensorflow/python/ops/special_math_ops.py b/tensorflow/python/ops/special_math_ops.py
index 6d7eaababcd94d687ff20dddc35c68a98320a19b..5e2146b79f08e6671c429f388b05634b1727b4ed 100644
--- a/tensorflow/python/ops/special_math_ops.py
+++ b/tensorflow/python/ops/special_math_ops.py
@@ -163,7 +163,7 @@ def einsum(equation, *inputs, **kwargs):
     if '...' in equation:
       raise ValueError('Subscripts with ellipses are not yet supported.')
 
-    match = re.match('([a-z,]+)(->[a-z]*)?', equation)
+    match = re.match('^([a-zA-Z,]+)(->[a-zA-Z]*)?$', equation)
     if not match:
       raise ValueError('Indices have incorrect format: %s' % equation)
 
@@ -402,7 +402,7 @@ def _exponential_space_einsum(equation, *inputs):
   if '...' in equation:
     raise ValueError('Subscripts with ellipses are not yet supported.')
 
-  match = re.match('([a-z,]+)(->[a-z]*)?', equation)
+  match = re.match('^([a-zA-Z,]+)(->[a-zA-Z]*)?$', equation)
   if not match:
     raise ValueError('Indices have incorrect format: %s' % equation)
 
diff --git a/tensorflow/python/ops/special_math_ops_test.py b/tensorflow/python/ops/special_math_ops_test.py
index 2c212f45483eacfd3fd27eecb8d7b2c846b5fe96..d7c3a7e8dc7c2ad611cf47718dddcf38700ce304 100644
--- a/tensorflow/python/ops/special_math_ops_test.py
+++ b/tensorflow/python/ops/special_math_ops_test.py
@@ -192,6 +192,9 @@ class EinsumTest(test.TestCase):
       'abc,cba',
       'dba,ead,cad->bce',
       'aef,fbc,dca->bde',
+      'iJ,Jk->ik',
+      'iJ,Ki->JK',
+      'iJk,Jklm->Jk'
   ]
 
   long_cases = [
@@ -208,6 +211,8 @@ class EinsumTest(test.TestCase):
       'ijk ijk',
       'ij.jk->ik',
       'ij...,jk...->ik...',
+      'ij,k ->kji',
+      'ij,k-> kji',
 
       # axis in output that does not exist
       'ij,jk->im',
diff --git a/tensorflow/python/ops/standard_ops.py b/tensorflow/python/ops/standard_ops.py
index b62e556967753dac4418add2864ce4e641dc6b58..230b7ef937cf1043ce872b112a11689b56b45520 100644
--- a/tensorflow/python/ops/standard_ops.py
+++ b/tensorflow/python/ops/standard_ops.py
@@ -186,7 +186,6 @@ _allowed_symbols_array_ops = [
     "quantize_and_dequantize",  # to-doc
 
     # TODO(drpng): legacy symbols to be removed.
-    "list_diff",  # Use tf.listdiff instead.
     "batch_matrix_diag",
     "batch_matrix_band_part",
     "batch_matrix_diag_part",
@@ -219,6 +218,8 @@ _allowed_symbols_gradients = [
     # Documented in training.py:
     # Not importing training.py to avoid complex graph dependencies.
     "AggregationMethod",
+    "GradientTape",
+    "custom_gradient",
     "gradients",  # tf.gradients = gradients.gradients
     "hessians",
 ]
diff --git a/tensorflow/python/ops/state_ops.py b/tensorflow/python/ops/state_ops.py
index 6c0a090d16bb328de40f02edf9865a0e0a62d385..c3ad5831b4dfa2160a198429c60c7a6ac00f6357 100644
--- a/tensorflow/python/ops/state_ops.py
+++ b/tensorflow/python/ops/state_ops.py
@@ -99,8 +99,8 @@ def variable_op(shape, dtype, name="Variable", set_shape=True, container="",
   """Deprecated. Used variable_op_v2 instead."""
   if not set_shape:
     shape = tensor_shape.unknown_shape()
-  ret = gen_state_ops._variable(shape=shape, dtype=dtype, name=name,
-                                container=container, shared_name=shared_name)
+  ret = gen_state_ops.variable(shape=shape, dtype=dtype, name=name,
+                               container=container, shared_name=shared_name)
   # TODO(mrry): Move this to where it is used, so we can get rid of this op
   #   wrapper?
   if set_shape:
@@ -127,11 +127,12 @@ def variable_op_v2(shape, dtype, name="Variable", container="", shared_name=""):
   Returns:
     A variable tensor.
   """
-  return gen_state_ops._variable_v2(shape=shape,
-                                    dtype=dtype,
-                                    name=name,
-                                    container=container,
-                                    shared_name=shared_name)
+  return gen_state_ops.variable_v2(
+      shape=shape,
+      dtype=dtype,
+      name=name,
+      container=container,
+      shared_name=shared_name)
 
 
 def init_variable(v, init, name="init"):
@@ -185,7 +186,7 @@ def is_variable_initialized(ref, name=None):
   if ref.dtype._is_ref_dtype:
     return gen_state_ops.is_variable_initialized(ref=ref, name=name)
   # Handle resource variables.
-  if context.in_eager_mode() or ref.op.type == "VarHandleOp":
+  if context.executing_eagerly() or ref.op.type == "VarHandleOp":
     return gen_resource_variable_ops.var_is_initialized_op(ref.handle,
                                                            name=name)
 
diff --git a/tensorflow/python/ops/string_ops.py b/tensorflow/python/ops/string_ops.py
index b8c39d91b41790c6441594b175e8eaa03620e1ec..5bd75b9215fdbccd5882ea39c2b35ccbbe29d5b0 100644
--- a/tensorflow/python/ops/string_ops.py
+++ b/tensorflow/python/ops/string_ops.py
@@ -17,6 +17,7 @@
 
 See the @{$python/string_ops} guide.
 
+@@regex_replace
 @@string_to_hash_bucket_fast
 @@string_to_hash_bucket_strong
 @@string_to_hash_bucket
@@ -93,10 +94,8 @@ def string_split(source, delimiter=" ", skip_empty=True):  # pylint: disable=inv
   delimiter = ops.convert_to_tensor(delimiter, dtype=dtypes.string)
   source = ops.convert_to_tensor(source, dtype=dtypes.string)
 
-  # pylint: disable=protected-access
-  indices, values, shape = gen_string_ops._string_split(
+  indices, values, shape = gen_string_ops.string_split(
       source, delimiter=delimiter, skip_empty=skip_empty)
-  # pylint: enable=protected-access
   indices.set_shape([None, 2])
   values.set_shape([None])
   shape.set_shape([2])
@@ -141,6 +140,7 @@ def reduce_join(inputs, axis=None,
 reduce_join.__doc__ = deprecation.rewrite_argument_docstring(
     gen_string_ops.reduce_join.__doc__, "reduction_indices", "axis")
 
+ops.NotDifferentiable("RegexReplace")
 ops.NotDifferentiable("StringToHashBucket")
 ops.NotDifferentiable("StringToHashBucketFast")
 ops.NotDifferentiable("StringToHashBucketStrong")
diff --git a/tensorflow/python/ops/summary_ops.py b/tensorflow/python/ops/summary_ops.py
index 7f4f4ce5ab4ee2bd309932cb81f05775996371d6..037bc9845a3f734f65b73b0c4b4ca19fb653731d 100644
--- a/tensorflow/python/ops/summary_ops.py
+++ b/tensorflow/python/ops/summary_ops.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 # ==============================================================================
 """Summary Operations."""
-# pylint: disable=protected-access
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -74,7 +73,7 @@ def tensor_summary(name,
 
   with summary_op_util.summary_scope(
       name, family, values=[tensor]) as (tag, scope):
-    val = gen_logging_ops._tensor_summary_v2(
+    val = gen_logging_ops.tensor_summary_v2(
         tensor=tensor,
         tag=tag,
         name=scope,
diff --git a/tensorflow/python/ops/template.py b/tensorflow/python/ops/template.py
index 424582b348d87d8a5b043ec9b771d8f2768a5994..0294ecee548d1e7f507a5e4195e4ee320a0b9918 100644
--- a/tensorflow/python/ops/template.py
+++ b/tensorflow/python/ops/template.py
@@ -26,6 +26,7 @@ from tensorflow.python.eager import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import tf_contextlib
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util.deprecation import deprecated
@@ -203,7 +204,7 @@ def make_template_internal(name_,
   if kwargs:
     func_ = tf_decorator.make_decorator(func_, functools.partial(
         func_, **kwargs))
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     if unique_name_ is not None:
       raise ValueError(
           "unique_name_ cannot be used when eager exeuction is enabled.")
@@ -230,7 +231,7 @@ def _skip_common_stack_elements(stacktrace, base_case):
   return stacktrace[-1:]
 
 
-class Template(object):
+class Template(checkpointable.CheckpointableBase):
   """Wrap a function to aid in variable sharing.
 
   Templates are functions that create variables the first time they are called
@@ -294,12 +295,115 @@ class Template(object):
     # which is not the same as whether the scope has been created.
     self._variables_created = False
 
+  @property
+  def _checkpoint_dependencies(self):
+    """Sanity checking for object-based saving.
+
+    Does not override Checkpointable dependency tracking, but checks that
+    variables accessible through Checkpointable dependencies on other `Template`
+    objects include all of the variable_scope-filtered `Template.variables`.
+
+    Returns:
+      A list of checkpointable.CheckpointableReference objects.
+    Raises:
+      ValueError: If this object is not compatible with object-based saving.
+    """
+    dependencies = super(Template, self)._checkpoint_dependencies
+    dependency_variables = []
+    for _, dependency in dependencies:
+      if isinstance(dependency, Template):
+        dependency_variables.extend(dependency.variables)
+      else:
+        dependency_variables.append(dependency)
+    dependency_variables = set(dependency_variables)
+    not_included_variables = []
+    for expected_variable in sorted(self.variables, key=lambda v: v.name):
+      if expected_variable not in dependency_variables:
+        not_included_variables.append(expected_variable)
+    if not_included_variables:
+      # Trying to save a Template which improperly tracks its variables.
+      raise ValueError(
+          ("The Template '%s' references variables which are not included via "
+           "object-based dependency tracking. Most likely a custom "
+           "getter/creator was registered which does not call Template's "
+           "custom variable creator (which is responsible for tracking "
+           "dependencies).\n\nExpected these variables to be dependencies: %s")
+          % (self, not_included_variables))
+    return dependencies
+
+  def _checkpointable_custom_creator(self, next_creator, name, initial_value,
+                                     checkpointable_parent=None, **kwargs):
+    """A variable creation hook which adds Checkpointable dependencies.
+
+    Set during the `Template`'s first wrapped function execution. Ensures that
+    (a) `Template` objects depend on `Template`s created inside them which
+    create variables, and (b) that any variables not in a more deeply nested
+    `Template` are added as dependencies directly.
+
+    The `checkpointable_parent` argument is passed between `Template` custom
+    creators but ignored when the variable object itself is created. This
+    argument indicates (if not `None`) that a more deeply nested `Template` has
+    already added the variable as a dependency, and that parent `Template`s
+    should add a dependency on that `Template` rather than on the variable
+    directly.
+
+    Args:
+      next_creator: See `variable_scope.variable_creator_scope`; the next
+        creator in the chain.
+      name: The (full, scope-influenced) name of the variable. The scope name
+        for the Template itself is stripped for the purposes of object-based
+        dependency tracking, but scopes within Templates are respected.
+      initial_value: See `variable_scope.variable_creator_scope`. Taken
+        explicitly so the argument can be re-named and used with
+        `Checkpointable._add_variable_with_custom_getter`.
+      checkpointable_parent: If not None, a more deeply nested Template object
+        to add a dependency on (rather than depending on the variable directly).
+      **kwargs: Passed through to the next creator.
+    Returns:
+      The output of `next_creator`: the fetched/created variable object.
+    """
+    def _call_next_creator_renaming_initializer(initializer, **inner_kwargs):
+      inner_kwargs.pop("name")  # Ignored; this is the scope-stripped name which
+      # we don't want to propagate.
+      return next_creator(
+          initial_value=initializer,
+          name=name,
+          **inner_kwargs)
+    if name.startswith(self._variable_scope.name):
+      scope_stripped_name = name[len(self._variable_scope.name) + 1:]
+      if not checkpointable_parent:
+        return self._add_variable_with_custom_getter(
+            initializer=initial_value,
+            name=scope_stripped_name,
+            getter=_call_next_creator_renaming_initializer,
+            # Disable error checking for Checkpointable. Exceptions are instead
+            # raised if necessary when the object-based saver tries to
+            # save/restore the object.
+            overwrite=True,
+            checkpointable_parent=self,
+            **kwargs)
+      else:
+        self._track_checkpointable(
+            checkpointable_parent,
+            name=checkpointable_parent._variable_scope.name[  # pylint: disable=protected-access
+                len(self._variable_scope.name) + 1:],
+            overwrite=True)
+    return next_creator(name=name, initial_value=initial_value,
+                        checkpointable_parent=self, **kwargs)
+
   def _call_func(self, args, kwargs):
     try:
       vars_at_start = len(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
       trainable_at_start = len(
           ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
-      result = self._func(*args, **kwargs)
+      if self._variables_created:
+        result = self._func(*args, **kwargs)
+      else:
+        # The first time we run, restore variables if necessary (via
+        # Checkpointable).
+        with variable_scope.variable_creator_scope(
+            self._checkpointable_custom_creator):
+          result = self._func(*args, **kwargs)
 
       if self._variables_created:
         # Variables were previously created, implying this is not the first
@@ -479,7 +583,7 @@ class _EagerTemplateVariableStore(object):
       if self._variable_scope_name is None:
         raise RuntimeError("A variable scope must be set before an "
                            "_EagerTemplateVariableStore object exits.")
-      self._eager_variable_store._store.close_variable_subscopes(  # pylint: disable=protected-access
+      variable_scope.get_variable_scope_store().close_variable_subscopes(
           self._variable_scope_name)
 
   def _variables_in_scope(self, variable_list):
@@ -543,7 +647,7 @@ class EagerTemplate(Template):
     Raises:
       RuntimeError: if eager execution is not enabled.
     """
-    if not context.in_eager_mode():
+    if not context.executing_eagerly():
       raise RuntimeError(
           "{} objects can only be used when eager execution is enabled, use "
           "tf.Template for graph construction".
@@ -563,7 +667,14 @@ class EagerTemplate(Template):
     try:
       vars_at_start = self._template_store.variables()
       trainable_at_start = self._template_store.trainable_variables()
-      result = self._func(*args, **kwargs)
+      if self._variables_created:
+        result = self._func(*args, **kwargs)
+      else:
+        # The first time we run, restore variables if necessary (via
+        # Checkpointable).
+        with variable_scope.variable_creator_scope(
+            self._checkpointable_custom_creator):
+          result = self._func(*args, **kwargs)
 
       if self._variables_created:
         # Variables were previously created, implying this is not the first
diff --git a/tensorflow/python/ops/tensor_array_ops.py b/tensorflow/python/ops/tensor_array_ops.py
index 3c08870146e447d84d4a5f620cbead633d94751f..2f6badcb532c0ef9d82b211d0c7b11a67e8e3010 100644
--- a/tensorflow/python/ops/tensor_array_ops.py
+++ b/tensorflow/python/ops/tensor_array_ops.py
@@ -148,7 +148,7 @@ class _GraphTensorArray(object):
         # will retroactively set the device value of this op.
         def create():
           """Create the TensorArray op."""
-          return gen_data_flow_ops._tensor_array_v3(
+          return gen_data_flow_ops.tensor_array_v3(
               dtype=dtype,
               size=size,
               element_shape=element_shape,
@@ -237,7 +237,7 @@ class _GraphTensorArray(object):
       flow = self.flow
     with ops.name_scope(name, "TensorArrayGrad", [self._handle]):
       with ops.colocate_with(self._handle):
-        g_handle, unused_flow = gen_data_flow_ops._tensor_array_grad_v3(
+        g_handle, unused_flow = gen_data_flow_ops.tensor_array_grad_v3(
             handle=self._handle, source=source, flow_in=flow, name=name)
         with ops.control_dependencies([g_handle]):
           flow = array_ops.identity(flow, name="gradient_flow")
@@ -252,7 +252,7 @@ class _GraphTensorArray(object):
 
   def read(self, index, name=None):
     """See TensorArray."""
-    value = gen_data_flow_ops._tensor_array_read_v3(
+    value = gen_data_flow_ops.tensor_array_read_v3(
         handle=self._handle,
         index=index,
         flow_in=self._flow,
@@ -270,7 +270,7 @@ class _GraphTensorArray(object):
       if self._infer_shape:
         self._merge_element_shape(value.shape)
       with self._maybe_colocate_with(value):
-        flow_out = gen_data_flow_ops._tensor_array_write_v3(
+        flow_out = gen_data_flow_ops.tensor_array_write_v3(
             handle=self._handle,
             index=index,
             value=value,
@@ -296,7 +296,7 @@ class _GraphTensorArray(object):
       element_shape = self._element_shape[0]
     else:
       element_shape = tensor_shape.TensorShape(None)
-    value = gen_data_flow_ops._tensor_array_gather_v3(
+    value = gen_data_flow_ops.tensor_array_gather_v3(
         handle=self._handle,
         indices=indices,
         flow_in=self._flow,
@@ -314,7 +314,7 @@ class _GraphTensorArray(object):
           tensor_shape.TensorShape(self._element_shape[0].dims[1:]))
     else:
       element_shape_except0 = tensor_shape.TensorShape(None)
-    value, _ = gen_data_flow_ops._tensor_array_concat_v3(
+    value, _ = gen_data_flow_ops.tensor_array_concat_v3(
         handle=self._handle,
         flow_in=self._flow,
         dtype=self._dtype,
@@ -338,10 +338,10 @@ class _GraphTensorArray(object):
     with ops.name_scope(name, "TensorArrayScatter",
                         [self._handle, value, indices]):
       value = ops.convert_to_tensor(value, name="value")
-      if self._infer_shape and context.in_graph_mode():
+      if self._infer_shape and not context.executing_eagerly():
         self._merge_element_shape(value.shape[1:])
       with self._maybe_colocate_with(value):
-        flow_out = gen_data_flow_ops._tensor_array_scatter_v3(
+        flow_out = gen_data_flow_ops.tensor_array_scatter_v3(
             handle=self._handle,
             indices=indices,
             value=value,
@@ -363,14 +363,14 @@ class _GraphTensorArray(object):
       value = ops.convert_to_tensor(value, name="value")
       with self._maybe_colocate_with(value):
         lengths_64 = math_ops.to_int64(lengths)
-        if self._infer_shape and context.in_graph_mode():
+        if self._infer_shape and not context.executing_eagerly():
           clengths = tensor_util.constant_value(lengths_64)
           if value.shape.dims is not None:
             if clengths is not None and clengths.max() == clengths.min():
               self._merge_element_shape(
                   tensor_shape.TensorShape([clengths[0]]).concatenate(
                       value.shape[1:]))
-        flow_out = gen_data_flow_ops._tensor_array_split_v3(
+        flow_out = gen_data_flow_ops.tensor_array_split_v3(
             handle=self._handle,
             value=value,
             lengths=lengths_64,
@@ -386,13 +386,13 @@ class _GraphTensorArray(object):
 
   def size(self, name=None):
     """See TensorArray."""
-    return gen_data_flow_ops._tensor_array_size_v3(
+    return gen_data_flow_ops.tensor_array_size_v3(
         handle=self._handle, flow_in=self.flow, name=name)
 
   @tf_should_use.should_use_result
   def close(self, name=None):
     """See TensorArray."""
-    return gen_data_flow_ops._tensor_array_close_v3(
+    return gen_data_flow_ops.tensor_array_close_v3(
         handle=self._handle, name=name)
 
 # pylint: enable=protected-access
@@ -774,10 +774,10 @@ class TensorArray(object):
       ValueError: if both handle and tensor_array_name are provided.
       TypeError: if handle is provided but is not a Tensor.
     """
-    if context.in_graph_mode():
-      implementation = _GraphTensorArray
-    else:
+    if context.executing_eagerly():
       implementation = _EagerTensorArray
+    else:
+      implementation = _GraphTensorArray
 
     self._implementation = implementation(
         dtype,
diff --git a/tensorflow/python/ops/variable_scope.py b/tensorflow/python/ops/variable_scope.py
index 81565a63774da49628d100ef071b02f6311f6af2..c35735ca656b21d43f758830e68e5777d654f271 100644
--- a/tensorflow/python/ops/variable_scope.py
+++ b/tensorflow/python/ops/variable_scope.py
@@ -24,6 +24,7 @@ import copy
 import enum  # pylint: disable=g-bad-import-order
 import functools
 import sys
+import threading
 import traceback
 
 import six
@@ -211,23 +212,8 @@ class _VariableStore(object):
     """Create a variable store."""
     self._vars = {}  # A dictionary of the stored TensorFlow variables.
     self._partitioned_vars = {}  # A dict of the stored PartitionedVariables.
-    self.variable_scopes_count = {}  # Count re-used variable scopes.
     self._store_eager_variables = False
 
-  def open_variable_scope(self, scope_name):
-    if scope_name in self.variable_scopes_count:
-      self.variable_scopes_count[scope_name] += 1
-    else:
-      self.variable_scopes_count[scope_name] = 1
-
-  def close_variable_subscopes(self, scope_name):
-    for k in self.variable_scopes_count:
-      if not scope_name or k.startswith(scope_name + "/"):
-        self.variable_scopes_count[k] = 0
-
-  def variable_scope_count(self, scope_name):
-    return self.variable_scopes_count.get(scope_name, 0)
-
   def get_variable(self, name, shape=None, dtype=dtypes.float32,
                    initializer=None, regularizer=None, reuse=None,
                    trainable=True, collections=None, caching_device=None,
@@ -321,7 +307,7 @@ class _VariableStore(object):
       raise ValueError(
           "Passed a custom_getter which is not callable: %s" % custom_getter)
 
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       if not self._store_eager_variables and reuse:
         raise RuntimeError(
             "When eager execution is enabled variable reuse is only supported"
@@ -518,7 +504,7 @@ class _VariableStore(object):
         when violating reuse during variable creation, or if an existing
         sharded variable exists for the given name but with different sharding.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise NotImplementedError("Partitioned variables are not yet supported "
                                 "when eager execution is enabled.")
 
@@ -798,7 +784,7 @@ class _VariableStore(object):
         validate_shape=validate_shape,
         constraint=constraint,
         use_resource=use_resource)
-    if context.in_graph_mode() or self._store_eager_variables:
+    if not context.executing_eagerly() or self._store_eager_variables:
       # In eager mode we do not want to keep default references to Variable
       # objects as this will prevent their memory from being released.
       self._vars[name] = v
@@ -811,12 +797,12 @@ class _VariableStore(object):
         with ops.name_scope(name + "/Regularizer/"):
           loss = regularizer(v)
         if loss is not None:
-          if context.in_graph_mode():
-            v_name = v.name
-            loss_name = loss.name
-          else:
+          if context.executing_eagerly():
             v_name = "v_%s" % type(v)
             loss_name = "loss_%s" % type(loss)
+          else:
+            v_name = v.name
+            loss_name = loss.name
           logging.vlog(1, "Applied regularizer to %s and added the result %s "
                        "to REGULARIZATION_LOSSES.", v_name, loss_name)
           ops.add_to_collection(ops.GraphKeys.REGULARIZATION_LOSSES, loss)
@@ -920,7 +906,7 @@ class VariableScope(object):
     self._dtype = dtype
     self._use_resource = use_resource
     self._constraint = constraint
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       if self._caching_device is not None:
         raise NotImplementedError("Caching devices is not yet supported "
                                   "when eager execution is enabled.")
@@ -988,7 +974,7 @@ class VariableScope(object):
 
   def set_use_resource(self, use_resource):
     """Sets whether to use ResourceVariables for this scope."""
-    if context.in_eager_mode() and not use_resource:
+    if context.executing_eagerly() and not use_resource:
       raise ValueError("When eager execution is enabled, "
                        "use_resource cannot be set to false.")
     self._use_resource = use_resource
@@ -999,14 +985,14 @@ class VariableScope(object):
 
   def set_caching_device(self, caching_device):
     """Set caching_device for this scope."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise NotImplementedError("Caching devices are not yet supported "
                                 "when eager execution is enabled.")
     self._caching_device = caching_device
 
   def set_partitioner(self, partitioner):
     """Set partitioner for this scope."""
-    if partitioner and context.in_eager_mode():
+    if partitioner and context.executing_eagerly():
       raise NotImplementedError("Partitioned variables are not yet supported "
                                 "when eager execution is enabled.")
     self._partitioner = partitioner
@@ -1057,14 +1043,14 @@ class VariableScope(object):
       partitioner = self._partitioner
     if custom_getter is None:
       custom_getter = self._custom_getter
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      reuse = False
+      use_resource = True
+    else:
       if reuse is None:
         reuse = self._reuse
       if use_resource is None:
         use_resource = self._use_resource
-    else:
-      reuse = False
-      use_resource = True
 
     full_name = self.name + "/" + name if self.name else name
     # Variable names only depend on variable_scope (full_name here),
@@ -1107,7 +1093,7 @@ class VariableScope(object):
                                 use_resource=None,
                                 constraint=None):
     """Gets an existing variable with this name or create a new one."""
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise NotImplementedError("Partitioned variables are not yet supported "
                                 "when eager execution is enabled.")
     if initializer is None:
@@ -1160,18 +1146,49 @@ class VariableScope(object):
 
 
 _VARSTORE_KEY = ("__variable_store",)
-_VARSCOPE_KEY = ("__varscope",)
+_VARSCOPESTORE_KEY = ("__varscope",)
+
+
+class _VariableScopeStore(threading.local):
+  """A thread local store for the current variable scope and scope counts."""
+
+  def __init__(self):
+    super(_VariableScopeStore, self).__init__()
+    self.current_scope = VariableScope(False)
+    self.variable_scopes_count = {}
+
+  def open_variable_scope(self, scope_name):
+    if scope_name in self.variable_scopes_count:
+      self.variable_scopes_count[scope_name] += 1
+    else:
+      self.variable_scopes_count[scope_name] = 1
+
+  def close_variable_subscopes(self, scope_name):
+    for k in self.variable_scopes_count:
+      if not scope_name or k.startswith(scope_name + "/"):
+        self.variable_scopes_count[k] = 0
+
+  def variable_scope_count(self, scope_name):
+    return self.variable_scopes_count.get(scope_name, 0)
+
+
+def get_variable_scope_store():
+  """Returns the variable scope store for current thread."""
+  scope_store = ops.get_collection(_VARSCOPESTORE_KEY)
+
+  if not scope_store:
+    scope_store = _VariableScopeStore()
+    ops.add_to_collection(_VARSCOPESTORE_KEY, scope_store)
+  else:
+    scope_store = scope_store[0]
+
+  return scope_store
 
 
 @tf_export("get_variable_scope")
 def get_variable_scope():
   """Returns the current variable scope."""
-  scope = ops.get_collection(_VARSCOPE_KEY)
-  if scope:  # This collection has at most 1 element, the default scope at [0].
-    return scope[0]
-  scope = VariableScope(False)
-  ops.add_to_collection(_VARSCOPE_KEY, scope)
-  return scope
+  return get_variable_scope_store().current_scope
 
 
 def _get_default_variable_store():
@@ -1274,6 +1291,9 @@ class EagerVariableStore(object):
     # pylint: enable=protected-access
 
 
+# The argument list for get_variable must match arguments to get_local_variable.
+# So, if you are updating the arguments, also update arguments to
+# get_local_variable below.
 @tf_export("get_variable")
 def get_variable(name,
                  shape=None,
@@ -1385,15 +1405,32 @@ get_variable.__doc__ = get_variable_or_local_docstring % (
     "GraphKeys.GLOBAL_VARIABLES")
 
 
-@functools.wraps(get_variable)
+# The argument list for get_local_variable must match arguments to get_variable.
+# So, if you are updating the arguments, also update arguments to get_variable.
 @tf_export("get_local_variable")
-def get_local_variable(*args, **kwargs):
-  kwargs["trainable"] = False
-  if "collections" in kwargs:
-    kwargs["collections"] += [ops.GraphKeys.LOCAL_VARIABLES]
+def get_local_variable(name,
+                       shape=None,
+                       dtype=None,
+                       initializer=None,
+                       regularizer=None,
+                       trainable=False,  # pylint: disable=unused-argument
+                       collections=None,
+                       caching_device=None,
+                       partitioner=None,
+                       validate_shape=True,
+                       use_resource=None,
+                       custom_getter=None,
+                       constraint=None):
+  if collections:
+    collections += [ops.GraphKeys.LOCAL_VARIABLES]
   else:
-    kwargs["collections"] = [ops.GraphKeys.LOCAL_VARIABLES]
-  return get_variable(*args, **kwargs)
+    collections = [ops.GraphKeys.LOCAL_VARIABLES]
+  return get_variable(
+      name, shape=shape, dtype=dtype, initializer=initializer,
+      regularizer=regularizer, trainable=False, collections=collections,
+      caching_device=caching_device, partitioner=partitioner,
+      validate_shape=validate_shape, use_resource=use_resource,
+      custom_getter=custom_getter, constraint=constraint)
 get_local_variable.__doc__ = get_variable_or_local_docstring % (
     "Gets an existing *local* variable or creates a new one.",
     "Behavior is the same as in `get_variable`, except that variables are\n"
@@ -1555,10 +1592,8 @@ class _pure_variable_scope(object):  # pylint: disable=invalid-name
     self._dtype = dtype
     self._use_resource = use_resource
     self._constraint = constraint
-    get_variable_scope()  # Ensure that a default exists, then get a pointer.
-    # Get the reference to the collection as we want to modify it in place.
-    self._default_varscope = ops.get_collection_ref(_VARSCOPE_KEY)
     self._var_store = _get_default_variable_store()
+    self._var_scope_store = get_variable_scope_store()
     if isinstance(self._name_or_scope, VariableScope):
       self._new_name = self._name_or_scope.name
       name_scope = self._name_or_scope._name_scope  # pylint: disable=protected-access
@@ -1606,10 +1641,11 @@ class _pure_variable_scope(object):  # pylint: disable=invalid-name
         a reuse scope, or if reuse is not `None` or `True`.
       TypeError: when the types of some arguments are not appropriate.
     """
-    self._old = self._default_varscope[0]
+    self._old = self._var_scope_store.current_scope
     if isinstance(self._name_or_scope, VariableScope):
-      self._var_store.open_variable_scope(self._new_name)
-      self._old_subscopes = copy.copy(self._var_store.variable_scopes_count)
+      self._var_scope_store.open_variable_scope(self._new_name)
+      self._old_subscopes = copy.copy(
+          self._var_scope_store.variable_scopes_count)
       variable_scope_object = self._cached_variable_scope_object
     else:
       # Handler for the case when we just prolong current variable scope.
@@ -1652,17 +1688,17 @@ class _pure_variable_scope(object):  # pylint: disable=invalid-name
         variable_scope_object.set_dtype(self._dtype)
       if self._use_resource is not None:
         variable_scope_object.set_use_resource(self._use_resource)
-      self._var_store.open_variable_scope(self._new_name)
-    self._default_varscope[0] = variable_scope_object
+      self._var_scope_store.open_variable_scope(self._new_name)
+    self._var_scope_store.current_scope = variable_scope_object
     return variable_scope_object
 
   def __exit__(self, type_arg, value_arg, traceback_arg):
     # If jumping out from a non-prolonged scope, restore counts.
     if isinstance(self._name_or_scope, VariableScope):
-      self._var_store.variable_scopes_count = self._old_subscopes
+      self._var_scope_store.variable_scopes_count = self._old_subscopes
     else:
-      self._var_store.close_variable_subscopes(self._new_name)
-    self._default_varscope[0] = self._old
+      self._var_scope_store.close_variable_subscopes(self._new_name)
+    self._var_scope_store.current_scope = self._old
 
 
 def _maybe_wrap_custom_getter(custom_getter, old_getter):
@@ -1687,13 +1723,13 @@ def _maybe_wrap_custom_getter(custom_getter, old_getter):
 
 def _get_unique_variable_scope(prefix):
   """Get a name with the given prefix unique in the current variable scope."""
-  var_store = _get_default_variable_store()
+  var_scope_store = get_variable_scope_store()
   current_scope = get_variable_scope()
   name = current_scope.name + "/" + prefix if current_scope.name else prefix
-  if var_store.variable_scope_count(name) == 0:
+  if var_scope_store.variable_scope_count(name) == 0:
     return prefix
   idx = 1
-  while var_store.variable_scope_count(name + ("_%d" % idx)) > 0:
+  while var_scope_store.variable_scope_count(name + ("_%d" % idx)) > 0:
     idx += 1
   return prefix + ("_%d" % idx)
 
@@ -1709,9 +1745,10 @@ class variable_scope(object):
   graph, ensures that graph is the default graph, and pushes a name scope and a
   variable scope.
 
-  If `name_or_scope` is not None, it is used as is. If `scope` is None, then
-  `default_name` is used.  In that case, if the same name has been previously
-  used in the same scope, it will be made unique by appending `_N` to it.
+  If `name_or_scope` is not None, it is used as is. If `name_or_scope` is None,
+  then `default_name` is used.  In that case, if the same name has been
+  previously used in the same scope, it will be made unique by appending `_N`
+  to it.
 
   Variable scope allows you to create new variables and to share already created
   ones while providing checks to not create or share by accident. For details,
@@ -1790,6 +1827,32 @@ class variable_scope(object):
   discouraged) to pass False to the reuse argument, yielding undocumented
   behaviour slightly different from None. Starting at 1.1.0 passing None and
   False as reuse has exactly the same effect.
+
+  A note about using variable scopes in multi-threaded environment: Variable
+  scopes are thread local, so one thread will not see another thread's current
+  scope. Also, when using `default_name`, unique scopes names are also generated
+  only on a per thread basis. If the same name was used within a different
+  thread, that doesn't prevent a new thread from creating the same scope.
+  However, the underlying variable store is shared across threads (within the
+  same graph). As such, if another thread tries to create a new variable with
+  the same name as a variable created by a previous thread, it will fail unless
+  reuse is True.
+
+  Further, each thread starts with an empty variable scope. So if you wish to
+  preserve name prefixes from a scope from the main thread, you should capture
+  the main thread's scope and re-enter it in each thread. For e.g.
+
+  ```
+  main_thread_scope = variable_scope.get_variable_scope()
+
+  # Thread's target function:
+  def thread_target_fn(captured_scope):
+    with variable_scope.variable_scope(captured_scope):
+      # .... regular code for this thread
+
+
+  thread = threading.Thread(target=thread_target_fn, args=(main_thread_scope,))
+  ```
   """
 
   def __init__(self,
@@ -1871,7 +1934,7 @@ class variable_scope(object):
       raise ValueError("The reuse parameter must be True or False or None.")
     if self._values is None:
       self._values = []
-    self._in_graph_mode = not context.in_eager_mode()
+    self._in_graph_mode = not context.executing_eagerly()
     if self._in_graph_mode:
       self._graph = ops._get_graph_from_inputs(self._values)  # pylint: disable=protected-access
     self._cached_pure_variable_scope = None
@@ -2111,13 +2174,13 @@ def default_variable_creator(next_creator=None, **kwargs):
   use_resource = kwargs.get("use_resource", None)
   if use_resource is None:
     use_resource = get_variable_scope().use_resource
-  if use_resource or (use_resource is None and context.in_eager_mode()):
+  if use_resource or (use_resource is None and context.executing_eagerly()):
     return resource_variable_ops.ResourceVariable(
         initial_value=initial_value, trainable=trainable,
         collections=collections, validate_shape=validate_shape,
         caching_device=caching_device, name=name, dtype=dtype,
         constraint=constraint)
-  elif not use_resource and context.in_eager_mode():
+  elif not use_resource and context.executing_eagerly():
     raise RuntimeError(
         "VariableScope should use resource variable when eager execution is"
         " enabled, but use_resource is False."
@@ -2145,7 +2208,7 @@ def variable(initial_value=None,
              constraint=None,
              use_resource=None):
   previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
-  for getter in ops.get_default_graph()._get_variable_creator_stack():  # pylint: disable=protected-access
+  for getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
     previous_getter = _make_getter(getter, previous_getter)
   return previous_getter(initial_value=initial_value,
                          trainable=trainable,
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index d382683858be5d91755ef1a15ebbc6ae2287f8a7..c646f795896f0abfce3eb9a57cadc27299714023 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -125,8 +125,8 @@ class Variable(checkpointable.CheckpointableBase):
 
   @compatibility(eager)
   `tf.Variable` is not compatible with eager execution.  Use
-  `tfe.Variable` instead which is compatible with both eager execution
-  and graph construction.  See [the TensorFlow Eager Execution
+  `tf.contrib.eager.Variable` instead which is compatible with both eager
+  execution and graph construction.  See [the TensorFlow Eager Execution
   guide](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/g3doc/guide.md#variables-and-optimizers)
   for details on how variables work in eager execution.
   @end_compatibility
@@ -210,10 +210,11 @@ class Variable(checkpointable.CheckpointableBase):
     for details on how variables work in eager execution.
     @end_compatibility
     """
-    if not context.in_graph_mode():
-      raise RuntimeError("tf.Variable not supported in Eager mode. "
-                         "Please use tfe.Variable instead")
-    self._in_graph_mode = context.in_graph_mode()
+    if context.executing_eagerly():
+      raise RuntimeError(
+          "tf.Variable not supported when eager execution is enabled. "
+          "Please use tf.contrib.eager.Variable instead")
+    self._in_graph_mode = True
     if variable_def:
       # If variable_def is provided, recreates the variable from its fields.
       if initial_value:
@@ -234,7 +235,7 @@ class Variable(checkpointable.CheckpointableBase):
           constraint=constraint)
 
   def __repr__(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       return "<tf.Variable '%s' shape=%s dtype=%s, numpy=%s>" % (
           self.name, self.get_shape(), self.dtype.name,
           ops.numpy_text(self.read_value(), is_repr=True))
@@ -292,6 +293,7 @@ class Variable(checkpointable.CheckpointableBase):
     Raises:
       ValueError: If the initial value is not specified, or does not have a
         shape and `validate_shape` is `True`.
+      RuntimeError: If lifted into the eager context.
     """
     _ = expected_shape
     if initial_value is None:
@@ -307,6 +309,9 @@ class Variable(checkpointable.CheckpointableBase):
     if constraint is not None and not callable(constraint):
       raise ValueError("The `constraint` argument must be a callable.")
 
+    # Store the graph key so optimizers know how to only retrieve variables from
+    # this graph.
+    self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
     if isinstance(initial_value, checkpointable.CheckpointInitialValue):
       self._maybe_initialize_checkpointable()
       self._update_uid = initial_value.checkpoint_position.restore_uid
@@ -315,6 +320,11 @@ class Variable(checkpointable.CheckpointableBase):
     if trainable and ops.GraphKeys.TRAINABLE_VARIABLES not in collections:
       collections = list(collections) + [ops.GraphKeys.TRAINABLE_VARIABLES]
     with ops.init_scope():
+      # Ensure that we weren't lifted into the eager context.
+      if context.executing_eagerly():
+        raise RuntimeError(
+            "tf.Variable not supported when eager execution is enabled. "
+            "Please use tf.contrib.eager.Variable instead")
       with ops.name_scope(name, "Variable", [] if init_from_fn else
                           [initial_value]) as name:
 
@@ -737,15 +747,15 @@ class Variable(checkpointable.CheckpointableBase):
     Raises:
         ValueError: Session is not passed and no default session
     """
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      self.assign(value)
+    else:
       session = session or ops.get_default_session()
       if session is None:
         raise ValueError(
             "Either session argument should be provided or default session "
             "should be established")
       session.run(self._initializer_op, {self._initializer_op.inputs[1]: value})
-    else:
-      self.assign(value)
 
   # Conversion to tensor.
   @staticmethod
@@ -1245,9 +1255,9 @@ class PartitionedVariable(object):
         information does not match `shape`, or `partitions` has invalid values.
       RuntimeError: If eager execution is enabled
     """
-    if not context.in_graph_mode():
-      raise RuntimeError("tf.PartitionedVariable not supported in "
-                         "eager mode. Please use tfe.Variable instead")
+    if context.executing_eagerly():
+      raise RuntimeError(
+          "tf.PartitionedVariable not supported with eager execution enabled.")
     if not isinstance(variable_list, (list, tuple)):
       raise TypeError(
           "variable_list is not a list or tuple: %s" % variable_list)
@@ -1538,7 +1548,7 @@ def variables_initializer(var_list, name="init"):
   Returns:
     An Op that run the initializers of all the specified variables.
   """
-  if var_list and context.in_graph_mode():
+  if var_list and not context.executing_eagerly():
     return control_flow_ops.group(*[v.initializer for v in var_list], name=name)
   return control_flow_ops.no_op(name=name)
 
@@ -1560,7 +1570,7 @@ def global_variables_initializer():
   Returns:
     An Op that initializes global variables in the graph.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return control_flow_ops.no_op(name="global_variables_initializer")
   return variables_initializer(global_variables())
 
@@ -1582,7 +1592,7 @@ def local_variables_initializer():
   Returns:
     An Op that initializes all local variables in the graph.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return control_flow_ops.no_op(name="local_variables_initializer")
   return variables_initializer(local_variables())
 
diff --git a/tensorflow/python/platform/googletest.py b/tensorflow/python/platform/googletest.py
index 96219faab719e28a5fa8a9a21c83f81a6f8478e6..8141cf92c568f257a5e9810318182d71f445dfa1 100644
--- a/tensorflow/python/platform/googletest.py
+++ b/tensorflow/python/platform/googletest.py
@@ -36,6 +36,7 @@ from tensorflow.python.platform import benchmark
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.tf_export import tf_export
 
 
 Benchmark = benchmark.TensorFlowBenchmark  # pylint: disable=invalid-name
@@ -138,6 +139,7 @@ def StatefulSessionAvailable():
   return False
 
 
+@tf_export('test.StubOutForTesting')
 class StubOutForTesting(object):
   """Support class for stubbing methods out for unit testing.
 
diff --git a/tensorflow/python/platform/sysconfig.py b/tensorflow/python/platform/sysconfig.py
index 5c50fa023dc3b216838390d9356a39e70e2362d2..fdd2b903fc79c40a26392714328f74756f3fff92 100644
--- a/tensorflow/python/platform/sysconfig.py
+++ b/tensorflow/python/platform/sysconfig.py
@@ -68,7 +68,6 @@ def get_compile_flags():
   """
   flags = []
   flags.append('-I%s' % get_include())
-  flags.append('-I%s/external/nsync/public' % get_include())
   flags.append('-D_GLIBCXX_USE_CXX11_ABI=%d' % _CXX11_ABI_FLAG)
   return flags
 
diff --git a/tensorflow/python/platform/test.py b/tensorflow/python/platform/test.py
index 9b7655722ac5a917f2753617f8e99bf2bd2f8d11..1660791febc9da93f3a3a977a17ca876e772a9a5 100644
--- a/tensorflow/python/platform/test.py
+++ b/tensorflow/python/platform/test.py
@@ -62,6 +62,8 @@ if sys.version_info.major == 2:
 else:
   from unittest import mock  # pylint: disable=g-import-not-at-top
 
+tf_export('test.mock')(mock)
+
 # Import Benchmark class
 Benchmark = _googletest.Benchmark  # pylint: disable=invalid-name
 
diff --git a/tensorflow/python/profiler/model_analyzer.py b/tensorflow/python/profiler/model_analyzer.py
index 0e20ca35bba606079ed5b0f225dd3029772b5af3..acf02096fffe8b38e68824878fa698ed69d3895c 100644
--- a/tensorflow/python/profiler/model_analyzer.py
+++ b/tensorflow/python/profiler/model_analyzer.py
@@ -172,7 +172,7 @@ class Profiler(object):
       op_log: optional. tensorflow::tfprof::OpLogProto proto. Used to define
           extra op types.
     """
-    if not graph and context.in_graph_mode():
+    if not graph and not context.executing_eagerly():
       graph = ops.get_default_graph()
     self._coverage = 0.0
     self._graph = graph
@@ -336,7 +336,7 @@ def profile(graph=None,
     If cmd is 'op' or 'code', returns MultiGraphNodeProto proto.
     Side effect: stdout/file/timeline.json depending on options['output']
   """
-  if not graph and context.in_graph_mode():
+  if not graph and not context.executing_eagerly():
     graph = ops.get_default_graph()
 
   if options == _DEFAULT_PROFILE_OPTIONS:
diff --git a/tensorflow/python/profiler/tfprof_logger.py b/tensorflow/python/profiler/tfprof_logger.py
index 8d121064967f2f87cd0aefaa361bfd6f387a3e6e..e651de32ea3bce32a965bfbeefc76ff08a79ac38 100644
--- a/tensorflow/python/profiler/tfprof_logger.py
+++ b/tensorflow/python/profiler/tfprof_logger.py
@@ -156,7 +156,7 @@ def merge_default_with_oplog(graph, op_log=None, run_meta=None,
   Returns:
     tmp_op_log: Merged OpLogProto proto.
   """
-  if not graph and context.in_graph_mode():
+  if not graph and not context.executing_eagerly():
     graph = ops.get_default_graph()
 
   tmp_op_log = tfprof_log_pb2.OpLogProto()
@@ -210,7 +210,7 @@ def write_op_log(graph, log_dir, op_log=None, run_meta=None, add_trace=True):
     add_trace: Whether to add python code trace information.
         Used to support "code" view.
   """
-  if not graph and context.in_graph_mode():
+  if not graph and not context.executing_eagerly():
     graph = ops.get_default_graph()
   op_log = merge_default_with_oplog(graph, op_log, run_meta, add_trace)
 
diff --git a/tensorflow/python/pywrap_tfe.i b/tensorflow/python/pywrap_tfe.i
index 7ab0db526881109765adf83749bd01e4d543e5b2..39fabb9c1bc646a09557293c1f645a8b97f5bbdd 100644
--- a/tensorflow/python/pywrap_tfe.i
+++ b/tensorflow/python/pywrap_tfe.i
@@ -26,11 +26,15 @@ limitations under the License.
 %rename("%s") TFE_ContextClearCaches;
 %rename("%s") TFE_ContextGetDevicePlacementPolicy;
 %rename("%s") TFE_ContextSetThreadLocalDevicePlacementPolicy;
+%rename("%s") TFE_ContextSetAsyncForThread;
+%rename("%s") TFE_ContextAsyncWait;
+%rename("%s") TFE_ContextAsyncClearError;
 %rename("%s") TFE_OpNameGetAttrType;
 %rename("%s") TFE_Py_InitEagerTensor;
 %rename("%s") TFE_Py_RegisterExceptionClass;
 %rename("%s") TFE_Py_RegisterBackwardFunctionGetter;
 %rename("%s") TFE_Py_RegisterFallbackExceptionClass;
+%rename("%s") TFE_Py_RegisterResourceVariableType;
 %rename("%s") TFE_Py_Execute;
 %rename("%s") TFE_Py_FastPathExecute;
 %rename("%s") TFE_Py_RecordGradient;
@@ -50,6 +54,7 @@ limitations under the License.
 %rename("%s") TFE_NewContextOptions;
 %rename("%s") TFE_ContextOptionsSetConfig;
 %rename("%s") TFE_ContextOptionsSetDevicePlacementPolicy;
+%rename("%s") TFE_ContextOptionsSetAsync;
 %rename("%s") TFE_DeleteContextOptions;
 %rename("%s") TFE_Py_TensorShapeSlice;
 
diff --git a/tensorflow/python/saved_model/builder_impl.py b/tensorflow/python/saved_model/builder_impl.py
index 7347da75364818b95d3f2ad7dfa74a8c3614b161..3447d917e9bf2dace3de784106dadb1fcc3a9647 100644
--- a/tensorflow/python/saved_model/builder_impl.py
+++ b/tensorflow/python/saved_model/builder_impl.py
@@ -193,7 +193,8 @@ class SavedModelBuilder(object):
   def _validate_tensor_info(self, tensor_info):
     """Validates the `TensorInfo` proto.
 
-    Checks if the `name` and `dtype` fields exist and are non-empty.
+    Checks if the `encoding` (`name` or `coo_sparse`) and `dtype` fields exist
+    and are non-empty.
 
     Args:
       tensor_info: `TensorInfo` protocol buffer to validate.
@@ -206,10 +207,12 @@ class SavedModelBuilder(object):
       raise AssertionError(
           "All TensorInfo protos used in the SignatureDefs must have the name "
           "and dtype fields set.")
-    if not tensor_info.name:
+    if tensor_info.WhichOneof("encoding") is None:
+      # TODO(soergel) validate each of the fields of coo_sparse
       raise AssertionError(
-          "All TensorInfo protos used in the SignatureDefs must have the name "
-          "field set: %s" % tensor_info)
+          "All TensorInfo protos used in the SignatureDefs must have one of "
+          "the 'encoding' fields (e.g., name or coo_sparse) set: %s"
+          % tensor_info)
     if tensor_info.dtype is types_pb2.DT_INVALID:
       raise AssertionError(
           "All TensorInfo protos used in the SignatureDefs must have the dtype "
diff --git a/tensorflow/python/saved_model/saved_model_test.py b/tensorflow/python/saved_model/saved_model_test.py
index d9d316882584470769c14cf0c5f265b58e37ab43..804255375e7c5215597a5dcca02f3b32f2c0a497 100644
--- a/tensorflow/python/saved_model/saved_model_test.py
+++ b/tensorflow/python/saved_model/saved_model_test.py
@@ -94,7 +94,7 @@ class SavedModelTest(test.TestCase):
     self.assertEqual(expected_asset_file_name, asset.filename)
     self.assertEqual(expected_asset_tensor_name, asset.tensor_info.name)
 
-  def _validate_inputs_tensor_info(self, builder, tensor_info):
+  def _validate_inputs_tensor_info_fail(self, builder, tensor_info):
     with self.test_session(graph=ops.Graph()) as sess:
       self._init_and_validate_variable(sess, "v", 42)
 
@@ -107,7 +107,18 @@ class SavedModelTest(test.TestCase):
           sess, ["foo"],
           signature_def_map={"foo_key": foo_signature})
 
-  def _validate_outputs_tensor_info(self, builder, tensor_info):
+  def _validate_inputs_tensor_info_accept(self, builder, tensor_info):
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      foo_signature = signature_def_utils.build_signature_def({
+          "foo_inputs": tensor_info
+      }, dict(), "foo")
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"],
+          signature_def_map={"foo_key": foo_signature})
+
+  def _validate_outputs_tensor_info_fail(self, builder, tensor_info):
     with self.test_session(graph=ops.Graph()) as sess:
       self._init_and_validate_variable(sess, "v", 42)
 
@@ -119,6 +130,16 @@ class SavedModelTest(test.TestCase):
           sess, ["foo"],
           signature_def_map={"foo_key": foo_signature})
 
+  def _validate_outputs_tensor_info_accept(self, builder, tensor_info):
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      foo_signature = signature_def_utils.build_signature_def(
+          dict(), {"foo_outputs": tensor_info}, "foo")
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"],
+          signature_def_map={"foo_key": foo_signature})
+
   def testMaybeSavedModelDir(self):
     base_path = test.test_src_dir_path("/python/saved_model")
     self.assertFalse(loader.maybe_saved_model_directory(base_path))
@@ -538,23 +559,50 @@ class SavedModelTest(test.TestCase):
       self.assertEqual("bar", bar_signature["bar_key"].method_name)
       self.assertEqual("foo_new", bar_signature["foo_key"].method_name)
 
-  def testSignatureDefValidation(self):
-    export_dir = self._get_export_dir("test_signature_def_validation")
+  def testSignatureDefValidationFails(self):
+    export_dir = self._get_export_dir("test_signature_def_validation_fail")
     builder = saved_model_builder.SavedModelBuilder(export_dir)
 
-    tensor_without_name = meta_graph_pb2.TensorInfo()
-    tensor_without_name.dtype = types_pb2.DT_FLOAT
-    self._validate_inputs_tensor_info(builder, tensor_without_name)
-    self._validate_outputs_tensor_info(builder, tensor_without_name)
+    tensor_without_encoding = meta_graph_pb2.TensorInfo()
+    tensor_without_encoding.dtype = types_pb2.DT_FLOAT
+    self._validate_inputs_tensor_info_fail(builder, tensor_without_encoding)
+    self._validate_outputs_tensor_info_fail(builder, tensor_without_encoding)
 
     tensor_without_dtype = meta_graph_pb2.TensorInfo()
     tensor_without_dtype.name = "x"
-    self._validate_inputs_tensor_info(builder, tensor_without_dtype)
-    self._validate_outputs_tensor_info(builder, tensor_without_dtype)
+    self._validate_inputs_tensor_info_fail(builder, tensor_without_dtype)
+    self._validate_outputs_tensor_info_fail(builder, tensor_without_dtype)
 
     tensor_empty = meta_graph_pb2.TensorInfo()
-    self._validate_inputs_tensor_info(builder, tensor_empty)
-    self._validate_outputs_tensor_info(builder, tensor_empty)
+    self._validate_inputs_tensor_info_fail(builder, tensor_empty)
+    self._validate_outputs_tensor_info_fail(builder, tensor_empty)
+
+  def testSignatureDefValidationSucceedsWithName(self):
+    tensor_with_name = meta_graph_pb2.TensorInfo()
+    tensor_with_name.name = "foo"
+    tensor_with_name.dtype = types_pb2.DT_FLOAT
+
+    export_dir = self._get_export_dir("test_signature_def_validation_name_1")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+    self._validate_inputs_tensor_info_accept(builder, tensor_with_name)
+
+    export_dir = self._get_export_dir("test_signature_def_validation_name_2")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+    self._validate_outputs_tensor_info_accept(builder, tensor_with_name)
+
+  def testSignatureDefValidationSucceedsWithCoo(self):
+    tensor_with_coo = meta_graph_pb2.TensorInfo()
+    # TODO(soergel) test validation of each of the fields of coo_sparse
+    tensor_with_coo.coo_sparse.values_tensor_name = "foo"
+    tensor_with_coo.dtype = types_pb2.DT_FLOAT
+
+    export_dir = self._get_export_dir("test_signature_def_validation_coo_1")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+    self._validate_inputs_tensor_info_accept(builder, tensor_with_coo)
+
+    export_dir = self._get_export_dir("test_signature_def_validation_coo_2")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+    self._validate_outputs_tensor_info_accept(builder, tensor_with_coo)
 
   def testAssets(self):
     export_dir = self._get_export_dir("test_assets")
diff --git a/tensorflow/python/summary/summary.py b/tensorflow/python/summary/summary.py
index b80ad79074e85bdeae70148b2822c319c29468bc..97f2ddfdfc49e415bdcff428d6bd3f5b61cc3f20 100644
--- a/tensorflow/python/summary/summary.py
+++ b/tensorflow/python/summary/summary.py
@@ -48,10 +48,12 @@ from tensorflow.core.util.event_pb2 import SessionLog
 from tensorflow.core.util.event_pb2 import TaggedRunMetadata
 # pylint: enable=unused-import
 
+
 from tensorflow.python.eager import context as _context
 from tensorflow.python.framework import dtypes as _dtypes
 from tensorflow.python.framework import ops as _ops
 from tensorflow.python.ops import gen_logging_ops as _gen_logging_ops
+from tensorflow.python.ops import gen_summary_ops as _gen_summary_ops  # pylint: disable=unused-import
 from tensorflow.python.ops import summary_op_util as _summary_op_util
 
 # exports tensor-related summaries
@@ -98,8 +100,7 @@ def scalar(name, tensor, collections=None, family=None):
   """
   with _summary_op_util.summary_scope(
       name, family, values=[tensor]) as (tag, scope):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._scalar_summary(tags=tag, values=tensor, name=scope)
+    val = _gen_logging_ops.scalar_summary(tags=tag, values=tensor, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
   return val
 
@@ -152,8 +153,7 @@ def image(name, tensor, max_outputs=3, collections=None, family=None):
   """
   with _summary_op_util.summary_scope(
       name, family, values=[tensor]) as (tag, scope):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._image_summary(
+    val = _gen_logging_ops.image_summary(
         tag=tag, tensor=tensor, max_images=max_outputs, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
   return val
@@ -192,8 +192,7 @@ def histogram(name, values, collections=None, family=None):
   with _summary_op_util.summary_scope(
       name, family, values=[values],
       default_name='HistogramSummary') as (tag, scope):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._histogram_summary(
+    val = _gen_logging_ops.histogram_summary(
         tag=tag, values=values, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
   return val
@@ -237,10 +236,9 @@ def audio(name, tensor, sample_rate, max_outputs=3, collections=None,
   """
   with _summary_op_util.summary_scope(
       name, family=family, values=[tensor]) as (tag, scope):
-    # pylint: disable=protected-access
     sample_rate = _ops.convert_to_tensor(
         sample_rate, dtype=_dtypes.float32, name='sample_rate')
-    val = _gen_logging_ops._audio_summary_v2(
+    val = _gen_logging_ops.audio_summary_v2(
         tag=tag, tensor=tensor, max_outputs=max_outputs,
         sample_rate=sample_rate, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
@@ -280,14 +278,13 @@ def merge(inputs, collections=None, name=None):
   @end_compatbility
   """
   # pylint: enable=line-too-long
-  if _context.in_eager_mode():
+  if _context.executing_eagerly():
     raise RuntimeError(
         'Merging tf.summary.* ops is not compatible with eager execution. '
         'Use tf.contrib.summary instead.')
   name = _summary_op_util.clean_tag(name)
   with _ops.name_scope(name, 'Merge', inputs):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._merge_summary(inputs=inputs, name=name)
+    val = _gen_logging_ops.merge_summary(inputs=inputs, name=name)
     _summary_op_util.collect(val, collections, [])
   return val
 
@@ -314,7 +311,7 @@ def merge_all(key=_ops.GraphKeys.SUMMARIES, scope=None):
   summaries under eager execution, use `tf.contrib.summary` instead.
   @end_compatbility
   """
-  if _context.in_eager_mode():
+  if _context.executing_eagerly():
     raise RuntimeError(
         'Merging tf.summary.* ops is not compatible with eager execution. '
         'Use tf.contrib.summary instead.')
diff --git a/tensorflow/python/summary/writer/writer.py b/tensorflow/python/summary/writer/writer.py
index 1f3f2287043c021d636113b5a8807c9f4adf77aa..57f78c156b1334a5486b29f2ddec957e49156e73 100644
--- a/tensorflow/python/summary/writer/writer.py
+++ b/tensorflow/python/summary/writer/writer.py
@@ -343,7 +343,7 @@ class FileWriter(SummaryToEventTransformer):
     summaries under eager execution, use `tf.contrib.summary` instead.
     @end_compatbility
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "tf.summary.FileWriter is not compatible with eager execution. "
           "Use tf.contrib.summary instead.")
diff --git a/tensorflow/python/tools/BUILD b/tensorflow/python/tools/BUILD
index 63f16c53a29fd65c32077dd29e3b1823c11d457b..1de1adcfbc35e2b760f362cb9784dd415b9a4dc4 100644
--- a/tensorflow/python/tools/BUILD
+++ b/tensorflow/python/tools/BUILD
@@ -14,6 +14,7 @@ py_library(
     name = "tools_pip",
     deps = [
         ":freeze_graph",
+        ":import_pb_to_tensorboard",
         ":inspect_checkpoint",
         ":optimize_for_inference",
         ":print_selective_registration_header",
@@ -248,7 +249,10 @@ py_test(
         "//tensorflow/cc/saved_model:saved_model_half_plus_two",
     ],
     srcs_version = "PY2AND3",
-    tags = ["manual"],
+    tags = [
+        "manual",
+        "no-internal-py3",
+    ],
     deps = [
         ":saved_model_cli",
         "//tensorflow/core:protos_all_py",
diff --git a/tensorflow/python/tools/freeze_graph.py b/tensorflow/python/tools/freeze_graph.py
index a52f325ddbcd90ad011c1c056965912b96f27aaa..e9f1def48c462dcd8a5acf0e3d29d562cd1b3d58 100644
--- a/tensorflow/python/tools/freeze_graph.py
+++ b/tensorflow/python/tools/freeze_graph.py
@@ -56,8 +56,6 @@ from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.tools import saved_model_utils
 from tensorflow.python.training import saver as saver_lib
 
-FLAGS = None
-
 
 def freeze_graph_with_def_protos(input_graph_def,
                                  input_saver_def,
@@ -256,25 +254,24 @@ def freeze_graph(input_graph,
       checkpoint_version=checkpoint_version)
 
 
-def main(unused_args):
-  if FLAGS.checkpoint_version == 1:
+def main(unused_args, flags):
+  if flags.checkpoint_version == 1:
     checkpoint_version = saver_pb2.SaverDef.V1
-  elif FLAGS.checkpoint_version == 2:
+  elif flags.checkpoint_version == 2:
     checkpoint_version = saver_pb2.SaverDef.V2
   else:
     print("Invalid checkpoint version (must be '1' or '2'): %d" %
-          FLAGS.checkpoint_version)
+          flags.checkpoint_version)
     return -1
-  freeze_graph(FLAGS.input_graph, FLAGS.input_saver, FLAGS.input_binary,
-               FLAGS.input_checkpoint, FLAGS.output_node_names,
-               FLAGS.restore_op_name, FLAGS.filename_tensor_name,
-               FLAGS.output_graph, FLAGS.clear_devices, FLAGS.initializer_nodes,
-               FLAGS.variable_names_whitelist, FLAGS.variable_names_blacklist,
-               FLAGS.input_meta_graph, FLAGS.input_saved_model_dir,
-               FLAGS.saved_model_tags, checkpoint_version)
-
+  freeze_graph(flags.input_graph, flags.input_saver, flags.input_binary,
+               flags.input_checkpoint, flags.output_node_names,
+               flags.restore_op_name, flags.filename_tensor_name,
+               flags.output_graph, flags.clear_devices, flags.initializer_nodes,
+               flags.variable_names_whitelist, flags.variable_names_blacklist,
+               flags.input_meta_graph, flags.input_saved_model_dir,
+               flags.saved_model_tags, checkpoint_version)
 
-if __name__ == "__main__":
+def run_main():
   parser = argparse.ArgumentParser()
   parser.register("type", "bool", lambda v: v.lower() == "true")
   parser.add_argument(
@@ -376,5 +373,10 @@ if __name__ == "__main__":
       separated by \',\'. For tag-set contains multiple tags, all tags \
       must be passed in.\
       """)
-  FLAGS, unparsed = parser.parse_known_args()
-  app.run(main=main, argv=[sys.argv[0]] + unparsed)
+  flags, unparsed = parser.parse_known_args()
+
+  my_main = lambda unused_args: main(unused_args, flags)
+  app.run(main=my_main, argv=[sys.argv[0]] + unparsed)
+
+if __name__ == '__main__':
+  run_main()
diff --git a/tensorflow/python/tools/inspect_checkpoint.py b/tensorflow/python/tools/inspect_checkpoint.py
index dd876cbe7fcd64a8de70eb28f67996df9de1dd7d..6504fbc10755c5c543016b8d56d6d53f3311b249 100644
--- a/tensorflow/python/tools/inspect_checkpoint.py
+++ b/tensorflow/python/tools/inspect_checkpoint.py
@@ -30,7 +30,7 @@ FLAGS = None
 
 
 def print_tensors_in_checkpoint_file(file_name, tensor_name, all_tensors,
-                                     all_tensor_names):
+                                     all_tensor_names=False):
   """Prints tensors in a checkpoint file.
 
   If no `tensor_name` is provided, prints the tensor names and shapes
@@ -139,7 +139,7 @@ if __name__ == "__main__":
       const=True,
       type="bool",
       default=False,
-      help="If True, print the values of all the tensors.")
+      help="If True, print the names and values of all the tensors.")
   parser.add_argument(
       "--all_tensor_names",
       nargs="?",
diff --git a/tensorflow/python/tools/print_selective_registration_header_test.py b/tensorflow/python/tools/print_selective_registration_header_test.py
index 36978b0860a423569149cd0572629f9f1f280637..4b3d98242caf683693430f08bd8cb74483f4bc74 100644
--- a/tensorflow/python/tools/print_selective_registration_header_test.py
+++ b/tensorflow/python/tools/print_selective_registration_header_test.py
@@ -24,6 +24,7 @@ import sys
 from google.protobuf import text_format
 
 from tensorflow.core.framework import graph_pb2
+from tensorflow.python.framework import test_util
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.tools import selective_registration_header_lib
@@ -93,11 +94,16 @@ class PrintOpFilegroupTest(test.TestCase):
 
     ops_and_kernels = selective_registration_header_lib.get_ops_and_kernels(
         'rawproto', self.WriteGraphFiles(graphs), default_ops)
+    matmul_prefix = ''
+    if test_util.IsMklEnabled():
+      matmul_prefix = 'Mkl'
+
     self.assertListEqual(
         [
             ('BiasAdd', 'BiasOp<CPUDevice, float>'),  #
-            ('MatMul', 'MatMulOp<CPUDevice, double, false >'),  #
-            ('MatMul', 'MatMulOp<CPUDevice, float, false >'),  #
+            ('MatMul',
+             matmul_prefix + 'MatMulOp<CPUDevice, double, false >'),  #
+            ('MatMul', matmul_prefix + 'MatMulOp<CPUDevice, float, false >'),  #
             ('NoOp', 'NoOp'),  #
             ('Reshape', 'ReshapeOp'),  #
             ('_Recv', 'RecvOp'),  #
@@ -112,8 +118,9 @@ class PrintOpFilegroupTest(test.TestCase):
     self.assertListEqual(
         [
             ('BiasAdd', 'BiasOp<CPUDevice, float>'),  #
-            ('MatMul', 'MatMulOp<CPUDevice, double, false >'),  #
-            ('MatMul', 'MatMulOp<CPUDevice, float, false >'),  #
+            ('MatMul',
+             matmul_prefix + 'MatMulOp<CPUDevice, double, false >'),  #
+            ('MatMul', matmul_prefix + 'MatMulOp<CPUDevice, float, false >'),  #
             ('NoOp', 'NoOp'),  #
             ('Reshape', 'ReshapeOp'),  #
             ('_Recv', 'RecvOp'),  #
diff --git a/tensorflow/python/tools/saved_model_cli.py b/tensorflow/python/tools/saved_model_cli.py
index 33f6debbcbecb652774c776be54323bbaa824822..b88be4ae04d5dc7a7641fb8dbd7e56e61035869f 100644
--- a/tensorflow/python/tools/saved_model_cli.py
+++ b/tensorflow/python/tools/saved_model_cli.py
@@ -38,11 +38,15 @@ from tensorflow.core.example import example_pb2
 from tensorflow.core.framework import types_pb2
 from tensorflow.python.client import session
 from tensorflow.python.debug.wrappers import local_cli_wrapper
+from tensorflow.python.framework import meta_graph as meta_graph_lib
 from tensorflow.python.framework import ops as ops_lib
 from tensorflow.python.platform import app  # pylint: disable=unused-import
 from tensorflow.python.saved_model import loader
 from tensorflow.python.tools import saved_model_utils
 
+# Set of ops to blacklist.
+_OP_BLACKLIST = set(['WriteFile', 'ReadFile'])
+
 
 def _show_tag_sets(saved_model_dir):
   """Prints the tag-sets stored in SavedModel directory.
@@ -115,7 +119,7 @@ def _get_outputs_tensor_info_from_meta_graph_def(meta_graph_def,
                                                       signature_def_key).outputs
 
 
-def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
+def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key, indent=0):
   """Prints input and output TensorInfos.
 
   Prints the details of input and output TensorInfos for the SignatureDef mapped
@@ -126,6 +130,7 @@ def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
     tag_set: Group of tag(s) of the MetaGraphDef, in string format, separated by
         ','. For tag-set contains multiple tags, all tags must be passed in.
     signature_def_key: A SignatureDef key string.
+    indent: How far (in increments of 2 spaces) to indent each line of output.
   """
   meta_graph_def = saved_model_utils.get_meta_graph_def(saved_model_dir,
                                                         tag_set)
@@ -134,29 +139,39 @@ def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
   outputs_tensor_info = _get_outputs_tensor_info_from_meta_graph_def(
       meta_graph_def, signature_def_key)
 
-  print('The given SavedModel SignatureDef contains the following input(s):')
+  indent_str = "  " * indent
+  def in_print(s):
+    print(indent_str + s)
+
+  in_print('The given SavedModel SignatureDef contains the following input(s):')
   for input_key, input_tensor in sorted(inputs_tensor_info.items()):
-    print('inputs[\'%s\'] tensor_info:' % input_key)
-    _print_tensor_info(input_tensor)
+    in_print('  inputs[\'%s\'] tensor_info:' % input_key)
+    _print_tensor_info(input_tensor, indent+1)
 
-  print('The given SavedModel SignatureDef contains the following output(s):')
+  in_print('The given SavedModel SignatureDef contains the following '
+           'output(s):')
   for output_key, output_tensor in sorted(outputs_tensor_info.items()):
-    print('outputs[\'%s\'] tensor_info:' % output_key)
-    _print_tensor_info(output_tensor)
+    in_print('  outputs[\'%s\'] tensor_info:' % output_key)
+    _print_tensor_info(output_tensor, indent+1)
 
-  print('Method name is: %s' %
-        meta_graph_def.signature_def[signature_def_key].method_name)
+  in_print('Method name is: %s' %
+           meta_graph_def.signature_def[signature_def_key].method_name)
 
 
-def _print_tensor_info(tensor_info):
+def _print_tensor_info(tensor_info, indent=0):
   """Prints details of the given tensor_info.
 
   Args:
     tensor_info: TensorInfo object to be printed.
+    indent: How far (in increments of 2 spaces) to indent each line output
   """
-  print('    dtype: ' +
-        {value: key
-         for (key, value) in types_pb2.DataType.items()}[tensor_info.dtype])
+  indent_str = "  " * indent
+  def in_print(s):
+    print(indent_str + s)
+
+  in_print('    dtype: ' +
+           {value: key
+            for (key, value) in types_pb2.DataType.items()}[tensor_info.dtype])
   # Display shape as tuple.
   if tensor_info.tensor_shape.unknown_rank:
     shape = 'unknown_rank'
@@ -164,8 +179,8 @@ def _print_tensor_info(tensor_info):
     dims = [str(dim.size) for dim in tensor_info.tensor_shape.dim]
     shape = ', '.join(dims)
     shape = '(' + shape + ')'
-  print('    shape: ' + shape)
-  print('    name: ' + tensor_info.name)
+  in_print('    shape: ' + shape)
+  in_print('    name: ' + tensor_info.name)
 
 
 def _show_all(saved_model_dir):
@@ -186,7 +201,8 @@ def _show_all(saved_model_dir):
     signature_def_map = get_signature_def_map(saved_model_dir, tag_set)
     for signature_def_key in sorted(signature_def_map.keys()):
       print('\nsignature_def[\'' + signature_def_key + '\']:')
-      _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key)
+      _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key, 
+                           indent=1)
 
 
 def get_meta_graph_def(saved_model_dir, tag_set):
@@ -230,6 +246,27 @@ def get_signature_def_map(saved_model_dir, tag_set):
   return meta_graph.signature_def
 
 
+def scan_meta_graph_def(meta_graph_def):
+  """Scans meta_graph_def and reports if there are ops on blacklist.
+
+  Print ops if they are on black list, or print success if no blacklisted ops
+  found.
+
+  Args:
+    meta_graph_def: MetaGraphDef protocol buffer.
+  """
+  all_ops_set = set(
+      meta_graph_lib.ops_used_by_graph_def(meta_graph_def.graph_def))
+  blacklisted_ops = _OP_BLACKLIST & all_ops_set
+  if blacklisted_ops:
+    # TODO(yifeif): print more warnings
+    print('MetaGraph with tag set %s contains the following blacklisted ops:' %
+          meta_graph_def.meta_info_def.tags, blacklisted_ops)
+  else:
+    print('MetaGraph with tag set %s does not contain blacklisted ops.' %
+          meta_graph_def.meta_info_def.tags)
+
+
 def run_saved_model_with_feed_dict(saved_model_dir, tag_set, signature_def_key,
                                    input_tensor_key_feed_dict, outdir,
                                    overwrite_flag, tf_debug=False):
@@ -597,6 +634,21 @@ def run(args):
                                  args.overwrite, tf_debug=args.tf_debug)
 
 
+def scan(args):
+  """Function triggered by scan command.
+
+  Args:
+    args: A namespace parsed from command line.
+  """
+  if args.tag_set:
+    scan_meta_graph_def(
+        saved_model_utils.get_meta_graph_def(args.dir, args.tag_set))
+  else:
+    saved_model = reader.read_saved_model(args.dir)
+    for meta_graph_def in saved_model.meta_graphs:
+      scan_meta_graph_def(meta_graph_def)
+
+
 def create_parser():
   """Creates a parser that parse the command line arguments.
 
@@ -614,19 +666,19 @@ def create_parser():
   show_msg = (
       'Usage examples:\n'
       'To show all tag-sets in a SavedModel:\n'
-      '$saved_model_cli show --dir /tmp/saved_model\n'
+      '$saved_model_cli show --dir /tmp/saved_model\n\n'
       'To show all available SignatureDef keys in a '
       'MetaGraphDef specified by its tag-set:\n'
-      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve\n'
+      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve\n\n'
       'For a MetaGraphDef with multiple tags in the tag-set, all tags must be '
       'passed in, separated by \';\':\n'
       '$saved_model_cli show --dir /tmp/saved_model --tag_set serve,gpu\n\n'
       'To show all inputs and outputs TensorInfo for a specific'
       ' SignatureDef specified by the SignatureDef key in a'
       ' MetaGraph.\n'
-      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve '
-      '--signature_def serving_default\n\n'
-      'To show all available information in the SavedModel\n:'
+      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve'
+      ' --signature_def serving_default\n\n'
+      'To show all available information in the SavedModel:\n'
       '$saved_model_cli show --dir /tmp/saved_model --all')
   parser_show = subparsers.add_parser(
       'show',
@@ -658,12 +710,14 @@ def create_parser():
   run_msg = ('Usage example:\n'
              'To run input tensors from files through a MetaGraphDef and save'
              ' the output tensors to files:\n'
-             '$saved_model_cli show --dir /tmp/saved_model --tag_set serve '
-             '--signature_def serving_default '
-             '--inputs input1_key=/tmp/124.npz[x],input2_key=/tmp/123.npy '
-             '--input_exprs \'input3_key=np.ones(2)\' --input_examples '
-             '\'input4_key=[{"id":[26],"weights":[0.5, 0.5]}]\' '
-             '--outdir=/out\n\n'
+             '$saved_model_cli show --dir /tmp/saved_model --tag_set serve \\\n'
+             '   --signature_def serving_default \\\n'
+             '   --inputs input1_key=/tmp/124.npz[x],input2_key=/tmp/123.npy '
+             '\\\n'
+             '   --input_exprs \'input3_key=np.ones(2)\' \\\n'
+             '   --input_examples '
+             '\'input4_key=[{"id":[26],"weights":[0.5, 0.5]}]\' \\\n'
+             '   --outdir=/out\n\n'
              'For more information about input file format, please see:\n'
              'https://www.tensorflow.org/programmers_guide/saved_model_cli\n')
   parser_run = subparsers.add_parser(
@@ -716,6 +770,26 @@ def create_parser():
            'SavedModel.')
   parser_run.set_defaults(func=run)
 
+  # scan command
+  scan_msg = ('Usage example:\n'
+              'To scan for blacklisted ops in SavedModel:\n'
+              '$saved_model_cli scan --dir /tmp/saved_model\n'
+              'To scan a specific MetaGraph, pass in --tag_set\n')
+  parser_scan = subparsers.add_parser(
+      'scan',
+      description=scan_msg,
+      formatter_class=argparse.RawTextHelpFormatter)
+  parser_scan.add_argument(
+      '--dir',
+      type=str,
+      required=True,
+      help='directory containing the SavedModel to execute')
+  parser_scan.add_argument(
+      '--tag_set',
+      type=str,
+      help='tag-set of graph in SavedModel to scan, separated by \',\'')
+  parser_scan.set_defaults(func=scan)
+
   return parser
 
 
diff --git a/tensorflow/python/tools/saved_model_cli_test.py b/tensorflow/python/tools/saved_model_cli_test.py
index d6cbc49ba1e08a6b808b228fb8d69fc14f36e3d2..eedc893a38d3d0857dd49c7ce03f3921da48fdbd 100644
--- a/tensorflow/python/tools/saved_model_cli_test.py
+++ b/tensorflow/python/tools/saved_model_cli_test.py
@@ -61,83 +61,84 @@ class SavedModelCLITestCase(test.TestCase):
     exp_out = """MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
 
 signature_def['classify_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/classify
 
 signature_def['classify_x_to_y']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/classify
 
 signature_def['regress_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['regress_x_to_y']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['regress_x_to_y2']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y2:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y2:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['serving_default']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/predict"""
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['x'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['y'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/predict"""
     # pylint: enable=line-too-long
+    self.maxDiff = None # Produce a useful error msg if the comparison fails
     self.assertMultiLineEqual(output, exp_out)
     self.assertEqual(err.getvalue().strip(), '')
 
@@ -193,11 +194,11 @@ Method name is: tensorflow/serving/predict"""
     output = out.getvalue().strip()
     expected_output = (
         'The given SavedModel SignatureDef contains the following input(s):\n'
-        'inputs[\'x\'] tensor_info:\n'
-        '    dtype: DT_FLOAT\n    shape: (-1, 1)\n    name: x:0\n'
+        '  inputs[\'x\'] tensor_info:\n'
+        '      dtype: DT_FLOAT\n      shape: (-1, 1)\n      name: x:0\n'
         'The given SavedModel SignatureDef contains the following output(s):\n'
-        'outputs[\'y\'] tensor_info:\n'
-        '    dtype: DT_FLOAT\n    shape: (-1, 1)\n    name: y:0\n'
+        '  outputs[\'y\'] tensor_info:\n'
+        '      dtype: DT_FLOAT\n      shape: (-1, 1)\n      name: y:0\n'
         'Method name is: tensorflow/serving/predict')
     self.assertEqual(output, expected_output)
     self.assertEqual(err.getvalue().strip(), '')
@@ -524,6 +525,28 @@ Method name is: tensorflow/serving/predict"""
     y_expected = np.array([[2.5], [3.0]])
     self.assertAllClose(y_expected, y_actual)
 
+  def testScanCommand(self):
+    self.parser = saved_model_cli.create_parser()
+    base_path = test.test_src_dir_path(SAVED_MODEL_PATH)
+    args = self.parser.parse_args(['scan', '--dir', base_path])
+    with captured_output() as (out, _):
+      saved_model_cli.scan(args)
+    output = out.getvalue().strip()
+    self.assertTrue('does not contain blacklisted ops' in output)
+
+  def testScanCommandFoundBlacklistedOp(self):
+    self.parser = saved_model_cli.create_parser()
+    base_path = test.test_src_dir_path(SAVED_MODEL_PATH)
+    args = self.parser.parse_args(
+        ['scan', '--dir', base_path, '--tag_set', 'serve'])
+    op_blacklist = saved_model_cli._OP_BLACKLIST
+    saved_model_cli._OP_BLACKLIST = set(['VariableV2'])
+    with captured_output() as (out, _):
+      saved_model_cli.scan(args)
+    saved_model_cli._OP_BLACKLIST = op_blacklist
+    output = out.getvalue().strip()
+    self.assertTrue('\'VariableV2\'' in output)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/training/adam.py b/tensorflow/python/training/adam.py
index c92f6fc3015960a2b821651231bb94713e0d53dd..006e360389b404a8edd97c9a8bf4b8876c828004 100644
--- a/tensorflow/python/training/adam.py
+++ b/tensorflow/python/training/adam.py
@@ -106,10 +106,10 @@ class AdamOptimizer(optimizer.Optimizer):
     self._updated_lr = None
 
   def _get_beta_accumulators(self):
-    if context.in_graph_mode():
-      graph = ops.get_default_graph()
-    else:
+    if context.executing_eagerly():
       graph = None
+    else:
+      graph = ops.get_default_graph()
     return (self._get_non_slot_variable("beta1_power", graph=graph),
             self._get_non_slot_variable("beta2_power", graph=graph))
 
diff --git a/tensorflow/python/training/adam_test.py b/tensorflow/python/training/adam_test.py
index a521f1299e035424d1c3897a469655db732b0dcd..9be8b6aafefa33977511cde24dd2e87dd6c3b81a 100644
--- a/tensorflow/python/training/adam_test.py
+++ b/tensorflow/python/training/adam_test.py
@@ -184,7 +184,7 @@ class AdamOptimizerTest(test.TestCase):
           # Shouldn't return non-slot variables from other graphs.
           self.assertEqual(0, len(opt.variables()))
 
-        if context.in_graph_mode():
+        if not context.executing_eagerly():
           self.evaluate(variables.global_variables_initializer())
           # Fetch params to validate initial values
           self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -194,7 +194,7 @@ class AdamOptimizerTest(test.TestCase):
 
         # Run 3 steps of Adam
         for t in range(1, 4):
-          if context.in_graph_mode():
+          if not context.executing_eagerly():
             self.evaluate(update)
           elif t > 1:
             opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
@@ -319,6 +319,15 @@ class AdamOptimizerTest(test.TestCase):
         # fails.
         optimizer.apply_gradients([(grads0, var0)])
 
+  def testSlotsUniqueEager(self):
+    with context.eager_mode():
+      v1 = resource_variable_ops.ResourceVariable(1.)
+      v2 = resource_variable_ops.ResourceVariable(1.)
+      opt = adam.AdamOptimizer(1.)
+      opt.minimize(lambda: v1 + v2)
+      # There should be two non-slot variables, and two unique slot variables
+      # for v1 and v2 respectively.
+      self.assertEqual(6, len(set(opt.variables())))
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/checkpoint_ops.py b/tensorflow/python/training/checkpoint_ops.py
index 7f92d94d2be369709608d36c109863b0ebfb7bbe..a6e9662b7305a00f1fcf03245685e93b756942d3 100644
--- a/tensorflow/python/training/checkpoint_ops.py
+++ b/tensorflow/python/training/checkpoint_ops.py
@@ -149,7 +149,7 @@ def _load_and_remap_matrix(ckpt_path,
   num_rows_present = num_rows_to_load
   if remap_rows:
     row_remapping, num_rows_present = (
-        gen_checkpoint_ops._generate_vocab_remapping(  # pylint: disable=protected-access
+        gen_checkpoint_ops.generate_vocab_remapping(
             new_vocab_file=new_row_vocab_file,
             old_vocab_file=old_row_vocab_file,
             new_vocab_offset=new_row_vocab_offset,
@@ -168,7 +168,7 @@ def _load_and_remap_matrix(ckpt_path,
   num_cols_present = new_col_vocab_size
   if remap_cols:
     col_remapping, num_cols_present = (
-        gen_checkpoint_ops._generate_vocab_remapping(  # pylint: disable=protected-access
+        gen_checkpoint_ops.generate_vocab_remapping(
             new_vocab_file=new_col_vocab_file,
             old_vocab_file=old_col_vocab_file,
             new_vocab_offset=0,  # Offset is unused for cols (no partitioning).
@@ -178,7 +178,7 @@ def _load_and_remap_matrix(ckpt_path,
       num_rows_to_load * new_col_vocab_size -
       num_rows_present * num_cols_present, 1
   ])
-  return_tensor = gen_checkpoint_ops._load_and_remap_matrix(  # pylint: disable=protected-access
+  return_tensor = gen_checkpoint_ops.load_and_remap_matrix(
       ckpt_path=ckpt_path,
       old_tensor_name=old_tensor_name,
       row_remapping=row_remapping,
diff --git a/tensorflow/python/training/checkpoint_utils.py b/tensorflow/python/training/checkpoint_utils.py
index 0af1cdecfa280f757b253abb2f3408bc5c9416f1..e7f88de1d2290a49f3b7bdf47417016d7e7c9cea 100644
--- a/tensorflow/python/training/checkpoint_utils.py
+++ b/tensorflow/python/training/checkpoint_utils.py
@@ -23,6 +23,7 @@ import six
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import io_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables
@@ -289,10 +290,18 @@ def _set_checkpoint_initializer(variable,
     name: Name of the operation.
   """
   base_type = variable.dtype.base_dtype
-  with ops.colocate_with(variable):
+  # Do not colocate with variable since RestoreV2 op only runs on CPU and
+  # colocation will force variable (and other ops that colocate with variable)
+  # to be on CPU as well. It is okay to place the variable's initializer op on
+  # CPU since it will only be run once at the start.
+  with ops.device(variable.device), ops.device("/cpu:0"):
     restore_op = io_ops.restore_v2(
         ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
-    variable._initializer_op = state_ops.assign(variable, restore_op)  # pylint:disable=protected-access
+    if isinstance(variable, resource_variable_ops.ResourceVariable):
+      init_op = variable.assign(restore_op, read_value=False)
+    else:
+      init_op = state_ops.assign(variable, restore_op)
+    variable._initializer_op = init_op  # pylint:disable=protected-access
     restore_op.set_shape(variable.shape)
     variable._initial_value = restore_op  # pylint:disable=protected-access
 
diff --git a/tensorflow/python/training/checkpoint_utils_test.py b/tensorflow/python/training/checkpoint_utils_test.py
index a461b24cbb1acafe60937f64d1cc0d35eb1bfc55..4e08a1c859fbaac75e7cd09ad498d9fea14c6338 100644
--- a/tensorflow/python/training/checkpoint_utils_test.py
+++ b/tensorflow/python/training/checkpoint_utils_test.py
@@ -26,6 +26,7 @@ from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -157,23 +158,23 @@ class CheckpointsTest(test.TestCase):
             "some_scope", initializer=init_ops.zeros_initializer()):
           my1 = variable_scope.get_variable("my1", [1, 10])
 
-        # At this point, my1.initialized_value() will add ops that reference
-        # the zeros initializer of my1.
-        before = variables.Variable(my1.initialized_value(), name="before")
+        before = my1.initialized_value()
 
         checkpoint_utils.init_from_checkpoint(checkpoint_dir, {"var1": my1})
 
-        # At this point, my1.initialized_value() will add ops that reference
-        # the newly set initializer of my1.
-        after = variables.Variable(my1.initialized_value(), name="after")
+        after = my1.initialized_value()
+
+        self.assertAllEqual(session.run(before), [[0.0] * 10])
+        self.assertAllEqual(session.run(after), v1)
 
         session.run(variables.global_variables_initializer())
+
         self.assertAllEqual(session.run(my1), v1)
         self.assertAllEqual(session.run(my1.initialized_value()), v1)
-        self.assertAllClose(session.run(before), [[0.0] * 10])
+        self.assertAllClose(session.run(before), v1)
         self.assertAllClose(session.run(after), v1)
         with self.assertRaises(AssertionError):
-          self.assertAllClose(session.run(before), session.run(after))
+          self.assertAllClose(v1, [[0.0] * 10])
 
   def testInitWithScopeDoesNotCaptureSuffixes(self):
     checkpoint_dir = self.get_temp_dir()
@@ -206,7 +207,9 @@ class CheckpointsTest(test.TestCase):
 
       checkpoint_utils.init_from_checkpoint(checkpoint_dir,
                                             {"useful_scope/": "useful_scope/"})
-      self.assertEqual(my4._initializer_op.op.inputs[1].device, "/job:ps")
+      # initializer runs on the same task but always on CPU.
+      self.assertEqual(my4._initializer_op.op.inputs[1].device,
+                       "/job:ps/device:CPU:0")
 
   def testInitFromRootCheckpoint(self):
     checkpoint_dir = self.get_temp_dir()
@@ -362,6 +365,31 @@ class CheckpointsTest(test.TestCase):
           checkpoint_utils.init_from_checkpoint(checkpoint_dir,
                                                 {"useful_scope": "some_scope/"})
 
+  def testNoAdditionalReadOpsForResourceVariables(self):
+    checkpoint_dir = self.get_temp_dir()
+    with self.test_session() as session:
+      v1, _, _, _ = _create_checkpoints(session, checkpoint_dir)
+
+    # New graph and session.
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as session:
+        my1 = resource_variable_ops.ResourceVariable([[0.0] * 10], name="my1")
+
+        with ops.name_scope("init_from_checkpoint"):
+          checkpoint_utils.init_from_checkpoint(checkpoint_dir, {"var1": my1})
+
+        # Basic sanity checks:
+        session.run(variables.global_variables_initializer())
+        self.assertAllEqual(session.run(my1), v1)
+
+    ops_in_init_from_checkpoint_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("init_from_checkpoint/") and
+            not op.name.startswith("init_from_checkpoint/checkpoint_initializer"
+                                  ) and op.type != "AssignVariableOp")
+    ]
+    self.assertEqual(ops_in_init_from_checkpoint_scope, [])
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/checkpointable.py b/tensorflow/python/training/checkpointable.py
index 11caa761aec5d631d87a91ec876e0b5032ffdc5b..96e3c4821cd319df0300edce2c722eb72ec752a6 100644
--- a/tensorflow/python/training/checkpointable.py
+++ b/tensorflow/python/training/checkpointable.py
@@ -22,6 +22,7 @@ import collections
 from tensorflow.python.eager import context
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_io_ops as io_ops
 from tensorflow.python.util import nest
 
@@ -31,8 +32,8 @@ from tensorflow.python.util import nest
 # creation (avoiding double assignment when executing eagerly).
 VARIABLE_VALUE_KEY = "VARIABLE_VALUE"
 
-_CheckpointableReference = collections.namedtuple(
-    "_CheckpointableReference",
+CheckpointableReference = collections.namedtuple(
+    "CheckpointableReference",
     [
         # The local name for this dependency.
         "name",
@@ -181,13 +182,16 @@ class _CheckpointPosition(object):
       dtype = self._checkpoint.dtype_map[checkpoint_key]
       base_type = dtype.base_dtype
       with ops.init_scope():
-        value, = io_ops.restore_v2(
-            prefix=self._checkpoint.save_path,
-            tensor_names=[checkpoint_key],
-            shape_and_slices=[""],
-            dtypes=[base_type],
-            name="%s_checkpoint_read" % (serialized_tensor.name,))
-        value_tensors[serialized_tensor.name] = value
+        with ops.device("/cpu:0"):
+          # Run the restore itself on the CPU.
+          value, = io_ops.restore_v2(
+              prefix=self._checkpoint.save_path,
+              tensor_names=[checkpoint_key],
+              shape_and_slices=[""],
+              dtypes=[base_type],
+              name="%s_checkpoint_read" % (serialized_tensor.name,))
+        # Copy the value to the current device if necessary.
+        value_tensors[serialized_tensor.name] = array_ops.identity(value)
       return value_tensors
 
   def restore_ops(self):
@@ -204,10 +208,10 @@ class _CheckpointPosition(object):
     # Name saveables based on the name this object had when it was checkpointed.
     named_saveables = {}
     restore_ops = []
-    in_graph_mode = context.in_graph_mode()
+    building_graph = not context.executing_eagerly()
     for serialized_tensor in self.object_proto.attributes:
-      saveable_object = saveables.get(serialized_tensor.name, None)
-      if saveable_object is None:
+      saveable_factory = saveables.get(serialized_tensor.name, None)
+      if saveable_factory is None:
         # Purposefully does not throw an exception if attributes have been added
         # or deleted. Stores unused attributes so an exception can be raised if
         # the user decides to check that everything in the checkpoint was
@@ -215,13 +219,17 @@ class _CheckpointPosition(object):
         self._checkpoint.unused_attributes.setdefault(
             self.checkpointable, []).append(serialized_tensor.name)
         continue
-      if in_graph_mode:
+      if building_graph:
         existing_ops = self._checkpoint.restore_ops_by_name.get(
             serialized_tensor.name, None)
       else:
         existing_ops = None
       if existing_ops is None:
-        named_saveables[serialized_tensor.checkpoint_key] = saveable_object
+        if callable(saveable_factory):
+          saveable = saveable_factory(name=serialized_tensor.checkpoint_key)
+        else:
+          saveable = saveable_factory
+        named_saveables[serialized_tensor.checkpoint_key] = saveable
     if named_saveables:
       validated_saveables = (
           self._checkpoint.builder._ValidateAndSliceInputs(named_saveables))  # pylint: disable=protected-access
@@ -241,7 +249,7 @@ class _CheckpointPosition(object):
             saveable_index:saveable_index + num_specs]
         saveable_index += num_specs
         restore_op = saveable.restore(saveable_tensors, restored_shapes=None)
-        if in_graph_mode:
+        if building_graph:
           assert saveable.name not in self._checkpoint.restore_ops_by_name
           self._checkpoint.restore_ops_by_name[saveable.name] = restore_op
           restore_ops.append(restore_op)
@@ -301,14 +309,17 @@ class CheckpointableBase(object):
 
     Not __init__, since most objects will forget to call it.
     """
-    if hasattr(self, "_checkpoint_dependencies"):
+    if hasattr(self, "_unconditional_checkpoint_dependencies"):
       # __init__ already called. This check means that we don't need
       # Checkpointable.__init__() in the constructor of every TensorFlow object.
       return
-    # A list of _CheckpointableReference objects.
-    self._checkpoint_dependencies = []
+    # A list of CheckpointableReference objects. Some classes implementing
+    # `Checkpointable`, notably `Optimizer`s, may override the
+    # _checkpoint_dependencies property with conditional dependencies
+    # (e.g. based on the current graph when saving).
+    self._unconditional_checkpoint_dependencies = []
     # Maps names -> Checkpointable objects
-    self._dependency_names = {}
+    self._unconditional_dependency_names = {}
     # Restorations for other Checkpointable objects on which this object may
     # eventually depend.
     self._deferred_dependencies = {}  # local name -> _CheckpointPosition list
@@ -320,9 +331,36 @@ class CheckpointableBase(object):
           "initialization code was run.")
     self._update_uid = -1
 
+  @property
+  def _checkpoint_dependencies(self):
+    """All dependencies of this object.
+
+    May be overridden to include conditional dependencies.
+
+    Returns:
+      A list of `CheckpointableReference` objects indicating named
+      `Checkpointable` dependencies which should be saved along with this
+      object.
+    """
+    return self._unconditional_checkpoint_dependencies
+
+  def _lookup_dependency(self, name):
+    """Look up a dependency by name.
+
+    May be overridden to include conditional dependencies.
+
+    Args:
+      name: The local name of the dependency.
+    Returns:
+      A `Checkpointable` object, or `None` if no dependency by this name was
+      found.
+    """
+    return self._unconditional_dependency_names.get(name, None)
+
   def _add_variable_with_custom_getter(
       self, name, shape=None, dtype=dtypes.float32,
-      initializer=None, getter=None, **kwargs_for_getter):
+      initializer=None, getter=None, overwrite=False,
+      **kwargs_for_getter):
     """Restore-on-create for a variable be saved with this `Checkpointable`.
 
     If the user has requested that this object or another `Checkpointable` which
@@ -334,12 +372,11 @@ class CheckpointableBase(object):
       name: A name for the variable. Must be unique within this object.
       shape: The shape of the variable.
       dtype: The data type of the variable.
-
       initializer: The initializer to use. Ignored if there is a deferred
         restoration left over from a call to
         `_restore_from_checkpoint_position`.
-
       getter: The getter to wrap which actually fetches the variable.
+      overwrite: If True, disables unique name and type checks.
       **kwargs_for_getter: Passed to the getter.
 
     Returns:
@@ -349,13 +386,13 @@ class CheckpointableBase(object):
       ValueError: If the variable name is not unique.
     """
     self._maybe_initialize_checkpointable()
-    if name in self._dependency_names:
+    if not overwrite and self._lookup_dependency(name) is not None:
       raise ValueError(
           ("A variable named '%s' already exists in this Checkpointable, but "
            "Checkpointable._add_variable called to create another with "
            "that name. Variable names must be unique within a Checkpointable "
            "object.") % (name,))
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       # If this is a variable with a single Tensor stored in the checkpoint, we
       # can set that value as an initializer rather than initializing and then
       # assigning (when executing eagerly). This call returns None if there is
@@ -385,7 +422,13 @@ class CheckpointableBase(object):
     # assign again. It will add this variable to our dependencies, and if there
     # is a non-trivial restoration queued, it will handle that. This also
     # handles slot variables.
-    return self._track_checkpointable(new_variable, name=name)
+    if not overwrite or isinstance(new_variable, CheckpointableBase):
+      return self._track_checkpointable(new_variable, name=name,
+                                        overwrite=overwrite)
+    else:
+      # TODO(allenl): Some variable types are not yet supported. Remove this
+      # fallback once all get_variable() return types are Checkpointable.
+      return new_variable
 
   def _preload_simple_restoration(self, name, shape):
     """Return a dependency's value for restore-on-create.
@@ -455,9 +498,10 @@ class CheckpointableBase(object):
       raise TypeError(
           ("Checkpointable._track_checkpointable() passed type %s, not a "
            "Checkpointable.") % (type(checkpointable),))
-    new_reference = _CheckpointableReference(name=name, ref=checkpointable)
-    if (name in self._dependency_names
-        and self._dependency_names[name] is not checkpointable):
+    new_reference = CheckpointableReference(name=name, ref=checkpointable)
+    current_object = self._lookup_dependency(name)
+    if (current_object is not None
+        and current_object is not checkpointable):
       if not overwrite:
         raise ValueError(
             ("Called Checkpointable._track_checkpointable() with name='%s', "
@@ -465,19 +509,47 @@ class CheckpointableBase(object):
              "dependency. Names must be unique (or overwrite=True).") % (name,))
       # This is a weird thing to do, but we're not going to stop people from
       # using __setattr__.
-      for index, (old_name, _) in enumerate(self._checkpoint_dependencies):
+      for index, (old_name, _) in enumerate(
+          self._unconditional_checkpoint_dependencies):
         if name == old_name:
-          self._checkpoint_dependencies[index] = new_reference
+          self._unconditional_checkpoint_dependencies[index] = new_reference
     else:
-      self._checkpoint_dependencies.append(new_reference)
+      self._unconditional_checkpoint_dependencies.append(new_reference)
 
-    self._dependency_names[name] = checkpointable
-    deferred_dependency_list = self._deferred_dependencies.pop(name, None)
-    if deferred_dependency_list is not None:
-      for checkpoint_position in deferred_dependency_list:
-        checkpoint_position.restore(checkpointable=checkpointable)
+    self._unconditional_dependency_names[name] = checkpointable
+    self._handle_deferred_dependencies(name=name, checkpointable=checkpointable)
     return checkpointable
 
+  def _handle_deferred_dependencies(self, name, checkpointable):
+    """Pop and load any deferred checkpoint restores into `checkpointable`.
+
+    This method does not add a new dependency on `checkpointable`, but it does
+    check if any outstanding/deferred dependencies have been queued waiting for
+    this dependency to be added (matched based on `name`). If so,
+    `checkpointable` and its dependencies are restored. The restorations are
+    considered fulfilled and so are deleted.
+
+    `_track_checkpointable` is more appropriate for adding a
+    normal/unconditional dependency, and includes handling for deferred
+    restorations. This method allows objects such as `Optimizer` to use the same
+    restoration logic while managing conditional dependencies themselves, by
+    overriding `_checkpoint_dependencies` and `_lookup_dependency` to change the
+    object's dependencies based on the context it is saved/restored in (a single
+    optimizer instance can have state associated with multiple graphs).
+
+    Args:
+      name: The name of the dependency within this object (`self`), used to
+        match `checkpointable` with values saved in a checkpoint.
+      checkpointable: The Checkpointable object to restore (inheriting from
+        `CheckpointableBase`).
+    """
+    deferred_dependencies_list = self._deferred_dependencies.pop(name, ())
+    for checkpoint_position in sorted(
+        deferred_dependencies_list,
+        key=lambda restore: restore.checkpoint.restore_uid,
+        reverse=True):
+      checkpoint_position.restore(checkpointable)
+
   def _restore_from_checkpoint_position(self, checkpoint_position):
     """Restore this object and its dependencies (may be deferred)."""
     # Attempt a breadth-first traversal, since presumably the user has more
@@ -513,7 +585,7 @@ class CheckpointableBase(object):
       child_position = _CheckpointPosition(
           checkpoint=checkpoint,
           proto_id=child.node_id)
-      local_object = self._dependency_names.get(child.local_name, None)
+      local_object = self._lookup_dependency(child.local_name)
       if local_object is None:
         # We don't yet have a dependency registered with this name. Save it
         # in case we do.
@@ -532,14 +604,30 @@ class CheckpointableBase(object):
     """Returns a dictionary of values to checkpoint with this object.
 
     Keys in the returned dictionary are local to this object and in a separate
-    namespace from dependencies. Values may either be `SaveableObject`s or
-    variables easily converted to `SaveableObject`s (as in `tf.train.Saver`'s
+    namespace from dependencies. Values may either be `SaveableObject` factories
+    or variables easily converted to `SaveableObject`s (as in `tf.train.Saver`'s
     `var_list` constructor argument).
 
+    `SaveableObjects` have a name set, which Checkpointable needs to generate
+    itself. So rather than returning `SaveableObjects` directly, this method
+    should return a dictionary of callables which take `name` arguments and
+    return `SaveableObjects` with that name.
+
+    If this object may also be passed to the global-name-based `tf.train.Saver`,
+    the returned callables should have a default value for their name argument
+    (i.e. be callable with no arguments).
+
     Returned values must be saved only by this object; if any value may be
     shared, it should instead be a dependency. For example, variable objects
     save their own values with the key `VARIABLE_VALUE_KEY`, but objects which
     reference variables simply add a dependency.
+
+    Returns:
+      The dictionary mapping attribute names to `SaveableObject` factories
+      described above. For example:
+      {VARIABLE_VALUE_KEY:
+       lambda name="global_name_for_this_object":
+       SaveableObject(name=name, ...)}
     """
     return {}
 
diff --git a/tensorflow/python/training/device_setter.py b/tensorflow/python/training/device_setter.py
index 689088bb41edfd94a1d483ed2b5f7447e9e060e7..d31c375b4ce48dcb9bc2918514707636a647c675 100644
--- a/tensorflow/python/training/device_setter.py
+++ b/tensorflow/python/training/device_setter.py
@@ -25,6 +25,15 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import server_lib
 from tensorflow.python.util.tf_export import tf_export
 
+# This is a tuple of PS ops used by tf.estimator.Esitmator which should work in
+# almost all of cases.
+STANDARD_PS_OPS = (
+    "Variable", "VariableV2", "AutoReloadVariable", "MutableHashTable",
+    "MutableHashTableV2", "MutableHashTableOfTensors",
+    "MutableHashTableOfTensorsV2", "MutableDenseHashTable",
+    "MutableDenseHashTableV2", "VarHandleOp"
+)
+
 
 class _RoundRobinStrategy(object):
   """Returns the next ps task index for placement in round-robin order.
@@ -170,8 +179,7 @@ def replica_device_setter(ps_tasks=0, ps_device="/job:ps",
       than overriding them.
     cluster: `ClusterDef` proto or `ClusterSpec`.
     ps_ops: List of strings representing `Operation` types that need to be
-      placed on `ps` devices.  If `None`, defaults to
-      `["Variable", "VariableV2", "VarHandleOp"]`.
+      placed on `ps` devices.  If `None`, defaults to `STANDARD_PS_OPS`.
     ps_strategy: A callable invoked for every ps `Operation` (i.e. matched by
       `ps_ops`), that takes the `Operation` and returns the ps task index to
       use.  If `None`, defaults to a round-robin strategy across all `ps`
@@ -201,7 +209,7 @@ def replica_device_setter(ps_tasks=0, ps_device="/job:ps",
   if ps_ops is None:
     # TODO(sherrym): Variables in the LOCAL_VARIABLES collection should not be
     # placed in the parameter server.
-    ps_ops = ["Variable", "VariableV2", "VarHandleOp"]
+    ps_ops = list(STANDARD_PS_OPS)
 
   if not merge_devices:
     logging.warning(
diff --git a/tensorflow/python/training/ftrl.py b/tensorflow/python/training/ftrl.py
index 9d02e694db15637126f37ee5575638908b351def..4fa081fab72df62107cf4957d4ff68240ced9ee0 100644
--- a/tensorflow/python/training/ftrl.py
+++ b/tensorflow/python/training/ftrl.py
@@ -53,7 +53,7 @@ class FtrlOptimizer(optimizer.Optimizer):
       learning_rate: A float value or a constant float `Tensor`.
       learning_rate_power: A float value, must be less or equal to zero.
       initial_accumulator_value: The starting value for accumulators.
-        Only positive values are allowed.
+        Only zero or positive values are allowed.
       l1_regularization_strength: A float value, must be greater than or
         equal to zero.
       l2_regularization_strength: A float value, must be greater than or
@@ -84,9 +84,10 @@ class FtrlOptimizer(optimizer.Optimizer):
     """
     super(FtrlOptimizer, self).__init__(use_locking, name)
 
-    if initial_accumulator_value <= 0.0:
-      raise ValueError("initial_accumulator_value %f needs to be positive" %
-                       initial_accumulator_value)
+    if initial_accumulator_value < 0.0:
+      raise ValueError(
+          "initial_accumulator_value %f needs to be be positive or zero" %
+          initial_accumulator_value)
     if learning_rate_power > 0.0:
       raise ValueError("learning_rate_power %f needs to be negative or zero" %
                        learning_rate_power)
diff --git a/tensorflow/python/training/gradient_descent.py b/tensorflow/python/training/gradient_descent.py
index 380e14e02497fbe3681d6bae03fe9c636c5d13aa..6caf29d83af546f821314179e17f7bf1a693ff1a 100644
--- a/tensorflow/python/training/gradient_descent.py
+++ b/tensorflow/python/training/gradient_descent.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -43,6 +44,7 @@ class GradientDescentOptimizer(optimizer.Optimizer):
     """
     super(GradientDescentOptimizer, self).__init__(use_locking, name)
     self._learning_rate = learning_rate
+    self._learning_rate_tensor = None
 
   def _apply_dense(self, grad, var):
     return training_ops.apply_gradient_descent(
@@ -69,5 +71,6 @@ class GradientDescentOptimizer(optimizer.Optimizer):
     return var.scatter_sub(delta, use_locking=self._use_locking)
 
   def _prepare(self):
-    self._learning_rate_tensor = ops.convert_to_tensor(self._learning_rate,
-                                                       name="learning_rate")
+    if not context.executing_eagerly() or self._learning_rate_tensor is None:
+      self._learning_rate_tensor = ops.convert_to_tensor(self._learning_rate,
+                                                         name="learning_rate")
diff --git a/tensorflow/python/training/input.py b/tensorflow/python/training/input.py
index bd9985a7c5c181c0431e0c0a91186bc36b11c787..44f00a96deff64012705c4c81b185a9c4fac2295 100644
--- a/tensorflow/python/training/input.py
+++ b/tensorflow/python/training/input.py
@@ -159,7 +159,7 @@ def input_producer(input_tensor,
   enabled. Please use the `tf.data` API to ingest data under eager execution.
   @end_compatibility
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError(
         "Input pipelines based on Queues are not supported when eager execution"
         " is enabled. Please use tf.data to ingest data into your model"
@@ -737,7 +737,7 @@ def _batch(tensors, batch_size, keep_input, num_threads=1, capacity=32,
            allow_smaller_final_batch=False, shared_name=None,
            name=None):
   """Helper function for `batch` and `maybe_batch`."""
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError(
         "Input pipelines based on Queues are not supported when eager execution"
         " is enabled. Please use tf.data to ingest data into your model"
@@ -775,7 +775,7 @@ def _batch_join(tensors_list, batch_size, keep_input, capacity=32,
                 enqueue_many=False, shapes=None, dynamic_pad=False,
                 allow_smaller_final_batch=False, shared_name=None, name=None):
   """Helper function for `batch_join` and `maybe_batch_join`."""
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError(
         "Input pipelines based on Queues are not supported when eager execution"
         " is enabled. Please use tf.data to ingest data into your model"
@@ -810,7 +810,7 @@ def _shuffle_batch(tensors, batch_size, capacity, min_after_dequeue,
                    shapes=None, allow_smaller_final_batch=False,
                    shared_name=None, name=None):
   """Helper function for `shuffle_batch` and `maybe_shuffle_batch`."""
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError(
         "Input pipelines based on Queues are not supported when eager execution"
         " is enabled. Please use tf.data to ingest data into your model"
@@ -855,7 +855,7 @@ def _shuffle_batch_join(tensors_list, batch_size, capacity,
                         allow_smaller_final_batch=False, shared_name=None,
                         name=None):
   """Helper function for `shuffle_batch_join` and `maybe_shuffle_batch_join`."""
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise ValueError(
         "Input pipelines based on Queues are not supported when eager execution"
         " is enabled. Please use tf.data to ingest data into your model"
diff --git a/tensorflow/python/training/learning_rate_decay_test.py b/tensorflow/python/training/learning_rate_decay_test.py
index 1ce8c156a0b126f680bad62267f90e31a23febed..60306e4f1239a759ea1f68492a1211d5f0858997 100644
--- a/tensorflow/python/training/learning_rate_decay_test.py
+++ b/tensorflow/python/training/learning_rate_decay_test.py
@@ -43,8 +43,8 @@ class LRDecayTest(test_util.TensorFlowTestCase):
 
   def testStaircase(self):
     with self.test_session():
-      step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                     name="step", container="", shared_name="")
+      step = gen_state_ops.variable(shape=[], dtype=dtypes.int32,
+                                    name="step", container="", shared_name="")
       assign_100 = state_ops.assign(step, 100)
       assign_1 = state_ops.assign(step, 1)
       assign_2 = state_ops.assign(step, 2)
@@ -113,7 +113,7 @@ class LRDecayTest(test_util.TensorFlowTestCase):
       learning_rate_decay.piecewise_constant(x, boundaries, values)
 
     # Test that ref types are valid.
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       x = variables.Variable(0.0)
       x_ref = x.op.outputs[0]   # float32_ref tensor should be accepted
       boundaries, values = [1.0, 2.0], [1, 2, 3]
@@ -264,8 +264,8 @@ class ExponentialDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.natural_exp_decay(initial_lr, step,
@@ -281,8 +281,8 @@ class ExponentialDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.natural_exp_decay(initial_lr,
@@ -304,8 +304,8 @@ class InverseDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.inverse_time_decay(initial_lr,
@@ -323,8 +323,8 @@ class InverseDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.inverse_time_decay(initial_lr,
diff --git a/tensorflow/python/training/momentum_test.py b/tensorflow/python/training/momentum_test.py
index cda421cef837fa6ab25898208a8dc94d70561048..297a8bbde5447cff9465be36c0bb71f2490c60fc 100644
--- a/tensorflow/python/training/momentum_test.py
+++ b/tensorflow/python/training/momentum_test.py
@@ -66,7 +66,7 @@ class MomentumOptimizerTest(test.TestCase):
       mom_update = mom_opt.apply_gradients(
           zip([grads0, grads1], [var0, var1]))
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.evaluate(variables.global_variables_initializer())
         # Fetch params to validate initial values
         self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -78,13 +78,13 @@ class MomentumOptimizerTest(test.TestCase):
       self.assertEquals(slot0.get_shape(), var0.get_shape())
       slot1 = mom_opt.get_slot(var1, "momentum")
       self.assertEquals(slot1.get_shape(), var1.get_shape())
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.assertFalse(slot0 in variables.trainable_variables())
         self.assertFalse(slot1 in variables.trainable_variables())
 
       # Step 1: the momentum accumulators where 0. So we should see a normal
       # update: v -= grad * learning_rate
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.evaluate(mom_update)
       # Check that the momentum accumulators have been updated.
       self.assertAllCloseAccordingToType(np.array([0.1, 0.1]),
@@ -99,10 +99,10 @@ class MomentumOptimizerTest(test.TestCase):
           np.array([3.0 - (0.01 * 2.0), 4.0 - (0.01 * 2.0)]),
           self.evaluate(var1))
       # Step 2: the momentum accumulators contain the previous update.
-      if context.in_graph_mode():
-        self.evaluate(mom_update)
-      else:
+      if context.executing_eagerly():
         mom_opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      else:
+        self.evaluate(mom_update)
       # Check that the momentum accumulators have been updated.
       self.assertAllCloseAccordingToType(
           np.array([(0.9 * 0.1 + 0.1), (0.9 * 0.1 + 0.1)]),
@@ -142,7 +142,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var0")
       var1 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var1")
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         loss = lambda: math_ops.reduce_sum(var0 + var1)
       else:
         loss = math_ops.reduce_sum(var0 + var1)
@@ -157,7 +157,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var2")
       var3 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var3")
-      if context.in_eager_mode():
+      if context.executing_eagerly():
         loss = lambda: math_ops.reduce_sum(var2 + var3)
       else:
         loss = math_ops.reduce_sum(var2 + var3)
diff --git a/tensorflow/python/training/moving_averages_test.py b/tensorflow/python/training/moving_averages_test.py
index 6efdeb286657e761a4c46634b9408121765a447b..6717811bbb0f05723a5ad0fbcbfba75249d0d43b 100644
--- a/tensorflow/python/training/moving_averages_test.py
+++ b/tensorflow/python/training/moving_averages_test.py
@@ -376,7 +376,7 @@ class ExponentialMovingAverageTest(test.TestCase):
     with ops.device("/job:dev_v0"):
       v0 = variables.Variable(10.0, name="v0")
     with ops.device("/job:dev_v1"):
-      v1 = gen_state_ops._variable(
+      v1 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="v1",
diff --git a/tensorflow/python/training/optimizer.py b/tensorflow/python/training/optimizer.py
index 454cc3add5c8a5b39385a4a2b48ebe3c5ef2336f..bf79714f9682e60b97788b8b470821cfe9290886 100644
--- a/tensorflow/python/training/optimizer.py
+++ b/tensorflow/python/training/optimizer.py
@@ -40,19 +40,6 @@ from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
 
-def _get_variable_for(v):
-  """Returns the ResourceVariable responsible for v, or v if not necessary."""
-  if context.in_eager_mode():
-    return v
-  if v.op.type == "VarHandleOp":
-    for var in variables.trainable_variables():
-      if (isinstance(var, resource_variable_ops.ResourceVariable)
-          and var.handle.op is v.op):
-        return var
-    raise ValueError("Got %s but could not locate source variable." % (str(v)))
-  return v
-
-
 def _deduplicate_indexed_slices(values, indices):
   """Sums `values` associated with any non-unique `indices`.
 
@@ -73,8 +60,8 @@ def _deduplicate_indexed_slices(values, indices):
 
 
 def _var_key(var):
-  if context.in_eager_mode():
-    return var._shared_name  # pylint: disable=protected-access
+  if context.executing_eagerly():
+    return var._unique_id  # pylint: disable=protected-access
   return (var.op.graph, var.op.name)
 
 
@@ -199,11 +186,15 @@ class _TensorProcessor(_OptimizableVariable):
 
 def _get_processor(v):
   """The processor of v."""
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     if isinstance(v, ops.Tensor):
       return _TensorProcessor(v)
     else:
       return _DenseResourceVariableProcessor(v)
+  if isinstance(
+      v, resource_variable_ops.ResourceVariable) and not v._in_graph_mode:  # pylint: disable=protected-access
+    # True if and only if `v` was initialized eagerly.
+    return _DenseResourceVariableProcessor(v)
   if v.op.type == "VarHandleOp":
     return _DenseResourceVariableProcessor(v)
   if isinstance(v, variables.Variable):
@@ -216,7 +207,11 @@ def _get_processor(v):
 
 
 @tf_export("train.Optimizer")
-class Optimizer(checkpointable.Checkpointable):
+class Optimizer(
+    # Optimizers inherit from CheckpointableBase rather than Checkpointable
+    # since they do most of their dependency management themselves (slot
+    # variables are special-cased, and non-slot variables are keyed to graphs).
+    checkpointable.CheckpointableBase):
   """Base class for optimizers.
 
   This class defines the API to add Ops to train a model.  You never use this
@@ -456,7 +451,7 @@ class Optimizer(checkpointable.Checkpointable):
         var_list = tape.watched_variables()
       grads = tape.gradient(loss_value, var_list, grad_loss)
       return list(zip(grads, var_list))
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "`loss` passed to Optimizer.compute_gradients should "
           "be a function when eager execution is enabled.")
@@ -545,7 +540,7 @@ class Optimizer(checkpointable.Checkpointable):
       raise ValueError("No gradients provided for any variable: %s." %
                        ([str(v) for _, _, v in converted_grads_and_vars],))
     with ops.init_scope():
-      self._create_slots([_get_variable_for(v) for v in var_list])
+      self._create_slots(var_list)
     update_ops = []
     with ops.name_scope(name, self._name) as name:
       self._prepare()
@@ -555,7 +550,12 @@ class Optimizer(checkpointable.Checkpointable):
         # We colocate all ops created in _apply_dense or _apply_sparse
         # on the same device as the variable.
         # TODO(apassos): figure out how to get the variable name here.
-        scope_name = var.op.name if context.in_graph_mode() else ""
+        if context.executing_eagerly() or isinstance(
+            var,
+            resource_variable_ops.ResourceVariable) and not var._in_graph_mode:  # pylint: disable=protected-access
+          scope_name = ""
+        else:
+          scope_name = var.op.name
         with ops.name_scope("update_" + scope_name), ops.colocate_with(var):
           update_ops.append(processor.update_op(self, grad))
       if global_step is None:
@@ -573,7 +573,7 @@ class Optimizer(checkpointable.Checkpointable):
             else:
               apply_updates = state_ops.assign_add(global_step, 1, name=name)
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         if isinstance(apply_updates, ops.Tensor):
           apply_updates = apply_updates.op
         train_op = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
@@ -623,7 +623,7 @@ class Optimizer(checkpointable.Checkpointable):
     Returns:
       A list of variables.
     """
-    executing_eagerly = context.in_eager_mode()
+    executing_eagerly = context.executing_eagerly()
     current_graph = ops.get_default_graph()
 
     def _from_current_graph(variable):
@@ -645,20 +645,54 @@ class Optimizer(checkpointable.Checkpointable):
 
   def _create_non_slot_variable(self, initial_value, name, colocate_with):
     """Add an extra variable, not associated with a slot."""
-    if context.in_graph_mode():
-      graph = colocate_with.graph
-    else:
-      graph = None
+    eager = context.executing_eagerly()
+    graph = None if eager else colocate_with.graph
 
     key = (name, graph)
     v = self._non_slot_dict.get(key, None)
     if v is None:
+      self._maybe_initialize_checkpointable()
       with ops.colocate_with(colocate_with):
+        if eager:
+          restored_initial_value = self._preload_simple_restoration(
+              name=name, shape=None)
+          if restored_initial_value is not None:
+            initial_value = restored_initial_value
         v = variable_scope.variable(initial_value, name=name, trainable=False)
+        # Restore this variable by name if necessary, but don't add a
+        # Checkpointable dependency. Optimizers return the current graph's
+        # non-slot variables from _checkpoint_dependencies explicitly rather
+        # than unconditionally adding dependencies (since there may be multiple
+        # non-slot variables with the same name in different graphs, trying to
+        # save all of them would result in errors).
+        self._handle_deferred_dependencies(name=name, checkpointable=v)
       self._non_slot_dict[key] = v
 
     return v
 
+  @property
+  def _checkpoint_dependencies(self):
+    """From Checkpointable. Gather graph-specific non-slot variables to save."""
+    current_graph_non_slot_variables = []
+    current_graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
+    for (name, _), variable_object in sorted(self._non_slot_dict.items(),
+                                             # Avoid comparing graphs
+                                             key=lambda item: item[0][0]):
+      if variable_object._graph_key == current_graph_key:  # pylint: disable=protected-access
+        current_graph_non_slot_variables.append(
+            checkpointable.CheckpointableReference(
+                name=name, ref=variable_object))
+    return (super(Optimizer, self)._checkpoint_dependencies
+            + current_graph_non_slot_variables)
+
+  def _lookup_dependency(self, name):
+    """From Checkpointable. Find a non-slot variable in the current graph."""
+    unconditional = super(Optimizer, self)._lookup_dependency(name)
+    if unconditional is not None:
+      return unconditional
+    graph = None if context.executing_eagerly() else ops.get_default_graph()
+    return self._get_non_slot_variable(name, graph=graph)
+
   def _get_non_slot_variable(self, name, graph=None):
     return self._non_slot_dict.get((name, graph), None)
 
@@ -990,9 +1024,8 @@ class Optimizer(checkpointable.Checkpointable):
     named_slots = self._slot_dict(slot_name)
     variable_key = _var_key(variable)
     slot_variable = named_slots.get(variable_key, None)
-    if (slot_variable is None
-        and context.in_eager_mode()
-        and slot_variable_position.is_simple_variable()):
+    if (slot_variable is None and context.executing_eagerly() and
+        slot_variable_position.is_simple_variable()):
       initializer = checkpointable.CheckpointInitialValue(
           checkpoint_position=slot_variable_position)
       slot_variable = self._get_or_make_slot(
diff --git a/tensorflow/python/training/quantize_training.i b/tensorflow/python/training/quantize_training.i
index 17ffcd6e0758c9c1bc8bab864b6b7a2a18bc9cbf..fb5e47efa0259d02df3ccf2e9b1430e027f8fcfb 100644
--- a/tensorflow/python/training/quantize_training.i
+++ b/tensorflow/python/training/quantize_training.i
@@ -56,6 +56,11 @@ PyObject* DoQuantizeTrainingOnGraphDefHelper(
 
 %insert("python") %{
 def do_quantize_training_on_graphdef(input_graph, num_bits):
+  """A general quantization scheme is being developed in @{tf.contrib.quantize}.
+
+  Consider using that instead, though since it is in the tf.contrib namespace,
+  it is not subject to backward compatibility guarantees.
+  """
   from tensorflow.core.framework.graph_pb2 import GraphDef
   from tensorflow.python.framework import errors
   with errors.raise_exception_on_not_ok_status() as status:
diff --git a/tensorflow/python/training/queue_runner_impl.py b/tensorflow/python/training/queue_runner_impl.py
index 07afba79abf4d636c9ec2d53bcf2641594a35733..d38c5499c73e1217effbc907077236cb6c8e0ae8 100644
--- a/tensorflow/python/training/queue_runner_impl.py
+++ b/tensorflow/python/training/queue_runner_impl.py
@@ -89,7 +89,7 @@ class QueueRunner(object):
         restoring from `queue_runner_def`.
       RuntimeError: If eager execution is enabled.
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError(
           "QueueRunners are not supported when eager execution is enabled. "
           "Instead, please use tf.data to get data into your model.")
@@ -441,7 +441,7 @@ def start_queue_runners(sess=None, coord=None, daemon=True, start=True,
   use the `tf.data` API instead.
   @end_compatibility
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError("Queues are not compatible with eager execution.")
   if sess is None:
     sess = ops.get_default_session()
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 9afd1e6643f7443bc9bdc5dc2b77ef4402772c38..ba0d0384758f25cc2cc6264b9b73e47f15359721 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -311,8 +311,7 @@ class BaseSaverBuilder(object):
     Returns:
       A string tensor.
     """
-    # pylint: disable=protected-access
-    return gen_io_ops._sharded_filename(filename_tensor, shard, num_shards)
+    return gen_io_ops.sharded_filename(filename_tensor, shard, num_shards)
 
   def _AddSaveOps(self, filename_tensor, saveables):
     """Add ops to save variables that are on the same shard.
@@ -421,8 +420,7 @@ class BaseSaverBuilder(object):
         sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
     # Return the sharded name for the save path.
     with ops.control_dependencies([x.op for x in sharded_saves]):
-      # pylint: disable=protected-access
-      return gen_io_ops._sharded_filespec(filename_tensor, num_shards_tensor)
+      return gen_io_ops.sharded_filespec(filename_tensor, num_shards_tensor)
 
   def _AddRestoreOps(self,
                      filename_tensor,
@@ -580,13 +578,31 @@ class BaseSaverBuilder(object):
           names_to_saveables[name] = [var]
       elif (isinstance(var, checkpointable.CheckpointableBase)
             and not isinstance(var, variables.Variable)):
+        checkpointable_saveables = [
+            (factory() if callable(factory) else factory)
+            for factory in var._gather_saveables_for_checkpoint().values()]
         names_to_saveables.update(
-            BaseSaverBuilder.OpListToDict(
-                list(var._gather_saveables_for_checkpoint().values())))
+            BaseSaverBuilder.OpListToDict(checkpointable_saveables))
       else:
-        if context.in_graph_mode():
+        if context.executing_eagerly():
+          if not isinstance(var, resource_variable_ops.ResourceVariable):
+            raise ValueError(
+                "Can only save/restore ResourceVariables when eager execution "
+                "is enabled, type: %s." % type(var))
+          set_var = names_to_saveables.setdefault(var._shared_name, var)
+          if set_var is not var:
+            raise ValueError(
+                ("Two different ResourceVariable objects with the same "
+                 "shared_name '%s' were passed to the Saver. This likely means "
+                 "that they were created in different Graphs or isolation "
+                 "contexts, and may not be checkpointed together.") %
+                (var._shared_name,))
+        else:
           if convert_variable_to_tensor:
-            var = ops.internal_convert_to_tensor(var, as_ref=True)
+            if isinstance(var, resource_variable_ops.ResourceVariable):
+              var = var._graph_element  # pylint: disable=protected-access
+            else:
+              var = ops.internal_convert_to_tensor(var, as_ref=True)
             if not BaseSaverBuilder._IsVariable(var):
               raise TypeError("Variable to save is not a Variable: %s" % var)
           if var.op.type == "ReadVariableOp":
@@ -597,18 +613,6 @@ class BaseSaverBuilder(object):
             raise ValueError("At least two variables have the same name: %s" %
                              name)
           names_to_saveables[name] = var
-        else:
-          if not isinstance(var, resource_variable_ops.ResourceVariable):
-            raise ValueError("Can only save/restore ResourceVariable eager "
-                             "mode is enabled, type: %s." % type(var))
-          set_var = names_to_saveables.setdefault(var._shared_name, var)
-          if set_var is not var:
-            raise ValueError(
-                ("Two different ResourceVariable objects with the same "
-                 "shared_name '%s' were passed to the Saver. This likely means "
-                 "that they were created in different Graphs or isolation "
-                 "contexts, and may not be checkpointed together.") % (
-                     var._shared_name,))
 
       # pylint: enable=protected-access
     return names_to_saveables
@@ -670,13 +674,16 @@ class BaseSaverBuilder(object):
         # pylint: enable=protected-access
       else:
         # A variable or tensor.
-        if context.in_eager_mode():
+        if context.executing_eagerly():
           if not isinstance(op, resource_variable_ops.ResourceVariable):
             raise ValueError("Can only save/restore ResourceVariable eager "
                              "mode is enabled, type: %s." % type(op))
           saveable = BaseSaverBuilder.ResourceVariableSaveable(op, "", name)
         else:
-          variable = ops.internal_convert_to_tensor(op, as_ref=True)
+          if isinstance(op, resource_variable_ops.ResourceVariable):
+            variable = op._graph_element  # pylint: disable=protected-access
+          else:
+            variable = ops.internal_convert_to_tensor(op, as_ref=True)
           if not BaseSaverBuilder._IsVariable(variable):
             raise TypeError("names_to_saveables must be a dict mapping string "
                             "names to Tensors/Variables. Not a variable: %s" %
@@ -774,8 +781,10 @@ class BaseSaverBuilder(object):
                       build_save=True,
                       build_restore=True):
     """build() with option to only perform save and restore."""
-    if context.in_graph_mode() and (not build_save or not build_restore):
-      raise ValueError("Graph mode needs to build save and restore together.")
+    if not context.executing_eagerly() and (not build_save or
+                                            not build_restore):
+      raise ValueError("save and restore operations need to be built together "
+                       " when eager execution is not enabled.")
 
     saveables = self._ValidateAndSliceInputs(names_to_saveables)
     if max_to_keep is None:
@@ -812,22 +821,22 @@ class BaseSaverBuilder(object):
     # such usage model makes sense.
     #
     # assert restore_op.name.endswith("restore_all"), restore_op.name
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      # Store the tensor values to the tensor_names.
+      save_tensor_name = save_tensor.numpy() if build_save else ""
       return saver_pb2.SaverDef(
-          filename_tensor_name=filename_tensor.name,
-          save_tensor_name=save_tensor.name,
-          restore_op_name=restore_op.name,
+          filename_tensor_name=filename_tensor.numpy(),
+          save_tensor_name=save_tensor_name,
+          restore_op_name="",
           max_to_keep=max_to_keep,
           sharded=sharded,
           keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
           version=self._write_version)
     else:
-      # Store the tensor values to the tensor_names.
-      save_tensor_name = save_tensor.numpy() if build_save else ""
       return saver_pb2.SaverDef(
-          filename_tensor_name=filename_tensor.numpy(),
-          save_tensor_name=save_tensor_name,
-          restore_op_name="",
+          filename_tensor_name=filename_tensor.name,
+          save_tensor_name=save_tensor.name,
+          restore_op_name=restore_op.name,
           max_to_keep=max_to_keep,
           sharded=sharded,
           keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
@@ -1126,8 +1135,9 @@ class Saver(object):
   the proliferation of checkpoint files on disk:
 
   * `max_to_keep` indicates the maximum number of recent checkpoint files to
-    keep.  As new files are created, older files are deleted.  If None or 0,
-    all checkpoint files are kept.  Defaults to 5 (that is, the 5 most recent
+    keep.  As new files are created, older files are deleted.   If None or 0,
+    no checkpoints are deleted from the filesystem but only the last one is
+    kept in the `checkpoint` file.  Defaults to 5 (that is, the 5 most recent
     checkpoint files are kept.)
 
   * `keep_checkpoint_every_n_hours`: In addition to keeping the most recent
@@ -1276,7 +1286,7 @@ class Saver(object):
       raise ValueError(
           "If `var_list` is provided then build cannot be deferred. "
           "Either set defer_build=False or var_list=None.")
-    if context.in_eager_mode() and var_list is None:
+    if context.executing_eagerly() and var_list is None:
       raise RuntimeError(
           "When eager execution is enabled, `var_list` must specify a list or "
           "dict of variables to save")
@@ -1295,7 +1305,12 @@ class Saver(object):
     self._write_version = write_version
     self._pad_step_number = pad_step_number
     self._filename = filename
-    if not defer_build and context.in_graph_mode():
+    self._last_checkpoints = []
+    self._checkpoints_to_be_deleted = []
+    if context.executing_eagerly():
+      self._next_checkpoint_time = (
+          time.time() + self._keep_checkpoint_every_n_hours * 3600)
+    elif not defer_build:
       self.build()
     if self.saver_def:
       self._check_saver_def()
@@ -1303,7 +1318,7 @@ class Saver(object):
     self._save_relative_paths = save_relative_paths
 
   def build(self):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Use save/restore instead of build in eager mode.")
     self._build(self._filename, build_save=True, build_restore=True)
 
@@ -1313,12 +1328,12 @@ class Saver(object):
 
   def _build(self, checkpoint_path, build_save, build_restore):
     """Builds saver_def."""
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       if self._is_built:
         return
       self._is_built = True
 
-    if not self.saver_def or context.in_eager_mode():
+    if not self.saver_def or context.executing_eagerly():
       if self._builder is None:
         self._builder = BulkSaverBuilder(self._write_version)
 
@@ -1355,17 +1370,17 @@ class Saver(object):
           self.saver_def.restore_op_name, self._name)
 
     self._check_saver_def()
-    # Updates next checkpoint time.
-    self._next_checkpoint_time = (
-        time.time() + self.saver_def.keep_checkpoint_every_n_hours * 3600)
-    self._last_checkpoints = []
-    self._checkpoints_to_be_deleted = []
+    if not context.executing_eagerly():
+      # Updates next checkpoint time.
+      # Set in __init__ when executing eagerly.
+      self._next_checkpoint_time = (
+          time.time() + self.saver_def.keep_checkpoint_every_n_hours * 3600)
 
   def _check_saver_def(self):
     if not isinstance(self.saver_def, saver_pb2.SaverDef):
       raise ValueError("saver_def must be a saver_pb2.SaverDef: %s" %
                        self.saver_def)
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       if not self.saver_def.save_tensor_name:
         raise ValueError("saver_def must specify the save_tensor_name: %s" %
                          str(self.saver_def))
@@ -1615,7 +1630,7 @@ class Saver(object):
       RuntimeError: If save and restore ops weren't built.
     """
     # pylint: enable=line-too-long
-    if not self._is_built and context.in_graph_mode():
+    if not self._is_built and not context.executing_eagerly():
       raise RuntimeError(
           "`build()` should be called before save if defer_build==True")
     if latest_filename is None:
@@ -1647,21 +1662,21 @@ class Saver(object):
             "'latest_filename' collides with 'save_path': '%s' and '%s'" %
             (latest_filename, save_path))
 
-    if (context.in_graph_mode() and
+    if (not context.executing_eagerly() and
         not isinstance(sess, session.SessionInterface)):
       raise TypeError("'sess' must be a Session; %s" % sess)
 
     save_path_parent = os.path.dirname(save_path)
     if not self._is_empty:
       try:
-        if context.in_graph_mode():
-          model_checkpoint_path = sess.run(
-              self.saver_def.save_tensor_name,
-              {self.saver_def.filename_tensor_name: checkpoint_file})
-        else:
+        if context.executing_eagerly():
           self._build_eager(
               checkpoint_file, build_save=True, build_restore=False)
           model_checkpoint_path = self.saver_def.save_tensor_name
+        else:
+          model_checkpoint_path = sess.run(
+              self.saver_def.save_tensor_name,
+              {self.saver_def.filename_tensor_name: checkpoint_file})
 
         model_checkpoint_path = compat.as_str(model_checkpoint_path)
         if write_state:
@@ -1683,7 +1698,7 @@ class Saver(object):
     if write_meta_graph:
       meta_graph_filename = self._MetaGraphFilename(
           checkpoint_file, meta_graph_suffix=meta_graph_suffix)
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         with sess.graph.as_default():
           self.export_meta_graph(
               meta_graph_filename, strip_default_attrs=strip_default_attrs)
@@ -1756,11 +1771,11 @@ class Saver(object):
     if save_path is None:
       raise ValueError("Can't load save_path when it is None.")
     logging.info("Restoring parameters from %s", save_path)
-    if context.in_graph_mode():
+    if context.executing_eagerly():
+      self._build_eager(save_path, build_save=False, build_restore=True)
+    else:
       sess.run(self.saver_def.restore_op_name,
                {self.saver_def.filename_tensor_name: save_path})
-    else:
-      self._build_eager(save_path, build_save=False, build_restore=True)
 
   @staticmethod
   def _add_collection_def(meta_graph_def, key, export_scope=None):
@@ -1900,7 +1915,7 @@ def import_meta_graph(meta_graph_or_file, clear_devices=False,
   execution is enabled.
   @end_compatibility
   """  # pylint: disable=g-doc-exception
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError("Exporting/importing meta graphs is not supported when "
                        "eager execution is enabled. No graph exists when eager "
                        "execution is enabled.")
@@ -1955,7 +1970,7 @@ def export_meta_graph(filename=None,
     saver_def: `SaverDef` protocol buffer.
     collection_list: List of string keys to collect.
     as_text: If `True`, writes the `MetaGraphDef` as an ASCII proto.
-    graph: The `Graph` to import into. If `None`, use the default graph.
+    graph: The `Graph` to export. If `None`, use the default graph.
     export_scope: Optional `string`. Name scope under which to extract
       the subgraph. The scope name will be striped from the node definitions
       for easy import later into new name scopes. If `None`, the whole graph
@@ -1983,7 +1998,7 @@ def export_meta_graph(filename=None,
   @end_compatibility
   """
   # pylint: enable=line-too-long
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     raise RuntimeError("Exporting/importing meta graphs is not supported when "
                        "eager execution is enabled. No graph exists when eager "
                        "execution is enabled.")
diff --git a/tensorflow/python/training/saver_test.py b/tensorflow/python/training/saver_test.py
index f00f98db00065cd475dd5708c91ea4a0205f24c7..7de778f298e0fb0d62d45abdd280b673f1068213 100644
--- a/tensorflow/python/training/saver_test.py
+++ b/tensorflow/python/training/saver_test.py
@@ -35,6 +35,7 @@ from google.protobuf import text_format
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.core.protobuf import meta_graph_pb2
 from tensorflow.core.protobuf import queue_runner_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.core.protobuf import saver_pb2
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.client import session
@@ -53,6 +54,7 @@ from tensorflow.python.lib.io import file_io
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
+from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import partitioned_variables
@@ -90,7 +92,7 @@ class SaverTest(test.TestCase):
       v2_init = v2.insert("k1", 30.0)
 
       # Initialize all variables
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.evaluate([variables.global_variables_initializer(), v2_init])
 
         # Check that the parameter nodes have been initialized.
@@ -118,7 +120,7 @@ class SaverTest(test.TestCase):
       v2 = saver_test_utils.CheckpointedOp(name="v2")
 
       # Assert that the variables are not initialized.
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.assertEqual(
             len(variables.report_uninitialized_variables().eval()), 2)
         self.assertEqual(0, len(v2.keys().eval()))
@@ -141,7 +143,7 @@ class SaverTest(test.TestCase):
       v2_init = v2_2.insert("k1000", 3000.0)
 
       # Check that the parameter nodes have been initialized.
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         init_all_op = [variables.global_variables_initializer(), v2_init]
         self.evaluate(init_all_op)
         # TODO(xpan): Why _mutable_hash_table_v2 doesn't create empty
@@ -250,10 +252,10 @@ class SaverTest(test.TestCase):
     with self.test_session(graph=ops_lib.Graph()) as sess:
       v = resource_variable_ops.ResourceVariable([1], caching_device="/cpu:0",
                                                  name="v")
-      if context.in_graph_mode():
-        self.evaluate(variables.global_variables_initializer())
-      else:
+      if context.executing_eagerly():
         sess = None
+      else:
+        self.evaluate(variables.global_variables_initializer())
       save = saver_module.Saver([v])
       save.save(sess, save_path)
 
@@ -261,6 +263,24 @@ class SaverTest(test.TestCase):
       save2.restore(sess, save_path)
       self.assertEquals(self.evaluate(v), [1])
 
+  def testNoAdditionalOpsAddedBySaverForResourceVariablesOutsideSaveScope(self):
+    with ops_lib.Graph().as_default() as g:
+      v = resource_variable_ops.ResourceVariable(1.0, name="v")
+      with ops_lib.name_scope("saver1"):
+        saver_module.Saver()
+      with ops_lib.name_scope("saver2"):
+        saver_module.Saver({"name": v})
+    ops_in_saver1_scope_but_not_save_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("saver1/") and
+            not op.name.startswith("saver1/save/"))]
+    self.assertEqual(ops_in_saver1_scope_but_not_save_scope, [])
+    ops_in_saver2_scope_but_not_save_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("saver2/") and
+            not op.name.startswith("saver2/save/"))]
+    self.assertEqual(ops_in_saver2_scope_but_not_save_scope, [])
+
   def testSaveCopyRestoreWithSaveRelativePaths(self):
     """Save, copy checkpoint dir and restore from copied dir.
 
@@ -498,7 +518,7 @@ class SaverTest(test.TestCase):
     with self.test_session(graph=ops_lib.Graph()) as sess:
       var = resource_variable_ops.ResourceVariable(var_value, name=var_name)
       save = saver_module.Saver({var_name: var})
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.evaluate(var.initializer)
       val = save.save(sess, save_path)
       self.assertEqual(save_path, val)
@@ -658,11 +678,11 @@ class SaverTest(test.TestCase):
             {
                 var._shared_name: var
             }, pad_step_number=pad_step_number)
-        if context.in_graph_mode():
+        if context.executing_eagerly():
+          sess = None
+        else:
           self.evaluate(var.initializer)
           sess = ops_lib.get_default_session()
-        else:
-          sess = None
         if use_tensor:
           global_step = constant_op.constant(global_step_int)
           val = save.save(sess, save_path, global_step=global_step)
@@ -1040,6 +1060,77 @@ class MaxToKeepTest(test.TestCase):
     self.assertEqual(checkpoint_state.all_model_checkpoint_paths,
                      all_model_checkpoint_paths)
 
+  def testMaxToKeepEager(self):
+    with context.eager_mode():
+      save_dir = self._get_test_dir("max_to_keep_non_sharded")
+
+      v = variable_scope.variable(10.0, name="v")
+      save = saver_module.Saver({"v": v}, max_to_keep=2)
+      self.evaluate(variables.global_variables_initializer())
+      if not context.executing_eagerly():
+        self.assertEqual([], save.last_checkpoints)
+
+      s1 = save.save(None, os.path.join(save_dir, "s1"))
+      self.assertEqual([s1], save.last_checkpoints)
+      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertCheckpointState(
+          model_checkpoint_path=s1,
+          all_model_checkpoint_paths=[s1],
+          save_dir=save_dir)
+
+      s2 = save.save(None, os.path.join(save_dir, "s2"))
+      self.assertEqual([s1, s2], save.last_checkpoints)
+      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertCheckpointState(
+          model_checkpoint_path=s2,
+          all_model_checkpoint_paths=[s1, s2],
+          save_dir=save_dir)
+
+      s3 = save.save(None, os.path.join(save_dir, "s3"))
+      self.assertEqual([s2, s3], save.last_checkpoints)
+      self.assertFalse(saver_module.checkpoint_exists(s1))
+      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertTrue(saver_module.checkpoint_exists(s3))
+      self.assertCheckpointState(
+          model_checkpoint_path=s3,
+          all_model_checkpoint_paths=[s2, s3],
+          save_dir=save_dir)
+
+      # Create a second helper, identical to the first.
+      save2 = saver_module.Saver({"v": v}, max_to_keep=2)
+      save2.set_last_checkpoints(save.last_checkpoints)
+
+      # Exercise the first helper.
+
+      # Adding s2 again (old s2 is removed first, then new s2 appended)
+      s2 = save.save(None, os.path.join(save_dir, "s2"))
+      self.assertEqual([s3, s2], save.last_checkpoints)
+      self.assertFalse(saver_module.checkpoint_exists(s1))
+      self.assertTrue(saver_module.checkpoint_exists(s3))
+      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertCheckpointState(
+          model_checkpoint_path=s2,
+          all_model_checkpoint_paths=[s3, s2],
+          save_dir=save_dir)
+
+      # Adding s1 (s3 should now be deleted as oldest in list)
+      s1 = save.save(None, os.path.join(save_dir, "s1"))
+      self.assertEqual([s2, s1], save.last_checkpoints)
+      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertCheckpointState(
+          model_checkpoint_path=s1,
+          all_model_checkpoint_paths=[s2, s1],
+          save_dir=save_dir)
+
+      s2 = save2.save(None, os.path.join(save_dir, "s2"))
+      self.assertEqual([s3, s2], save2.last_checkpoints)
+      # Created by the first helper.
+      self.assertTrue(saver_module.checkpoint_exists(s1))
+      # Deleted by the first helper.
+      self.assertFalse(saver_module.checkpoint_exists(s3))
+
   def testNonSharded(self):
     save_dir = self._get_test_dir("max_to_keep_non_sharded")
 
@@ -1302,15 +1393,16 @@ class KeepCheckpointEveryNHoursTest(test.TestCase):
     gfile.MakeDirs(test_dir)
     return test_dir
 
+  @test_util.run_in_graph_and_eager_modes()
   @test.mock.patch.object(saver_module, "time")
   def testNonSharded(self, mock_time):
     save_dir = self._get_test_dir("keep_checkpoint_every_n_hours")
 
     with self.test_session() as sess:
-      v = variables.Variable([10.0], name="v")
+      v = variable_scope.variable([10.0], name="v")
       # Run the initializer NOW to avoid the 0.5s overhead of the first Run()
       # call, which throws the test timing off in fastbuild mode.
-      variables.global_variables_initializer().run()
+      self.evaluate(variables.global_variables_initializer())
       # Create a saver that will keep the last 2 checkpoints plus one every 0.7
       # seconds.
       start_time = time.time()
@@ -1388,7 +1480,7 @@ class SaveRestoreWithVariableNameMap(test.TestCase):
       v0 = variable_op(-1.0, name="v0")
       v1 = variable_op(-1.0, name="v1")
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         with self.assertRaisesOpError("uninitialized"):
           self.evaluate(v0)
         with self.assertRaisesOpError("uninitialized"):
@@ -1398,7 +1490,7 @@ class SaveRestoreWithVariableNameMap(test.TestCase):
       save.restore(sess, save_path)
 
       # Check that the parameter nodes have been restored.
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         self.assertEqual(10.0, self.evaluate(v0))
         self.assertEqual(20.0, self.evaluate(v1))
 
@@ -1408,7 +1500,7 @@ class SaveRestoreWithVariableNameMap(test.TestCase):
       v0 = variable_op(-1.0, name="restore_prefix/v0")
       v1 = variable_op(-1.0, name="restore_prefix/v1")
 
-      if context.in_graph_mode():
+      if not context.executing_eagerly():
         with self.assertRaisesOpError("uninitialized"):
           self.evaluate(v0)
         with self.assertRaisesOpError("uninitialized"):
@@ -2040,6 +2132,113 @@ class MetaGraphTest(test.TestCase):
     self._testGraphExtensionRestore(test_dir)
     self._testRestoreFromTrainGraphWithControlContext(test_dir)
 
+  def _testGradientSerDes(self, graph_fn):
+    """Tests that gradients can be computed after exporting and importing.
+
+    Builds a graph, exports it, and verifies that it can be imported and the
+    gradient can be built and run correctly.
+
+    Args:
+      graph_fn: takes a single float Tensor argument as input, outputs a single
+        Tensor
+    """
+    test_dir = self._get_test_dir("nested_control_flow")
+    filename = os.path.join(test_dir, "metafile")
+    saver_ckpt = os.path.join(test_dir, "saver.ckpt")
+
+    # Create while loop using `outer_body_fn`.
+    with ops_lib.Graph().as_default():
+      var = variables.Variable(0.0)
+      var_name = var.name
+      output = graph_fn(var)
+      output_name = output.name
+      init_op = variables.global_variables_initializer()
+
+      # Generate a MetaGraphDef containing the while loop.
+      with session.Session() as sess:
+        sess.run(init_op)
+        sess.run(output)
+        saver = saver_module.Saver()
+        saver.save(sess, saver_ckpt)
+        saver.export_meta_graph(filename)
+
+      # Build and run the gradients of the while loop. We use this below to
+      # verify that the gradients are correct with an imported MetaGraphDef.
+      grad = gradients_impl.gradients([output], [var])
+      # Turn off constant folding to avoid breaking testNestedControlFlowSerDes.
+      # It appears that a missing control dependency in the gradient graph
+      # causes the fetch node to not be triggered.
+      no_constfold_config = config_pb2.ConfigProto()
+      no_constfold_config.graph_options.rewrite_options.constant_folding = (
+          rewriter_config_pb2.RewriterConfig.OFF)
+      with session.Session(config=no_constfold_config) as sess:
+        sess.run(init_op)
+        expected_grad_value = sess.run(grad)
+
+    # Restore the MetaGraphDef into a new Graph.
+    with ops_lib.Graph().as_default():
+      with session.Session() as sess:
+        saver = saver_module.import_meta_graph(filename)
+        saver.restore(sess, saver_ckpt)
+
+      # Make sure we can still build gradients and get the same result.
+      var = ops_lib.get_default_graph().get_tensor_by_name(var_name)
+      output = ops_lib.get_default_graph().get_tensor_by_name(output_name)
+      grad = gradients_impl.gradients([output], [var])
+
+      init_op = variables.global_variables_initializer()
+
+      with session.Session(config=no_constfold_config) as sess:
+        sess.run(init_op)
+        actual_grad_value = sess.run(grad)
+        self.assertEqual(expected_grad_value, actual_grad_value)
+
+  def _testWhileLoopAndGradientSerDes(self, outer_body_fn):
+    # Build a while loop with `outer_body_fn`, export it, and verify that it can
+    # be imported and the gradient can be built and run correctly.
+    # pylint: disable=g-long-lambda
+    return self._testGradientSerDes(
+        lambda x: control_flow_ops.while_loop(
+            lambda i, y: i < 5, outer_body_fn, [0, x])[1])
+    # pylint: enable=g-long-lambda
+
+  def testNestedWhileLoopsSerDes(self):
+    # Test two simple nested while loops.
+    def body(i, x):
+      _, r = control_flow_ops.while_loop(lambda j, y: j < 3,
+                                         lambda j, y: (j + 1, y + x),
+                                         [0, 0.0])
+      return i + 1, x + r
+    self._testWhileLoopAndGradientSerDes(body)
+
+  def testNestedControlFlowSerDes(self):
+    # Test while loop in a cond in a while loop.
+    # pylint: disable=g-long-lambda
+    def body(i, x):
+      cond_result = control_flow_ops.cond(
+          i > 0,
+          lambda: control_flow_ops.while_loop(
+              lambda j, y: j < 3,
+              lambda j, y: (j + 1, y + x),
+              [0, 0.0])[1],
+          lambda: x)
+      return i + 1, cond_result
+    # pylint: enable=g-long-lambda
+    self._testWhileLoopAndGradientSerDes(body)
+
+  def testNestedCondsSerDes(self):
+    # Test conds in a cond.
+    # pylint: disable=g-long-lambda
+    self._testGradientSerDes(lambda x: control_flow_ops.cond(
+        x > 0,
+        lambda: control_flow_ops.cond(x > 3,
+                                      lambda: array_ops.identity(x),
+                                      lambda: math_ops.multiply(x, 2.0)),
+        lambda: control_flow_ops.cond(x < -3,
+                                      lambda: constant_op.constant(1.0),
+                                      lambda: math_ops.multiply(x, -1.0))))
+    # pylint: enable=g-long-lambda
+
   def testStrippedOpListDef(self):
     with self.test_session():
       # Creates a graph.
@@ -2680,11 +2879,11 @@ class _OwnsAVariableSimple(checkpointable.CheckpointableBase):
 class _MirroringSaveable(
     saver_module.BaseSaverBuilder.ResourceVariableSaveable):
 
-  def __init__(self, primary_variable, mirrored_variable):
+  def __init__(self, primary_variable, mirrored_variable, name):
     self._primary_variable = primary_variable
     self._mirrored_variable = mirrored_variable
     super(_MirroringSaveable, self).__init__(
-        self._primary_variable, "", self._primary_variable.name)
+        self._primary_variable, "", name)
 
   def restore(self, restored_tensors, restored_shapes):
     """Restore the same value into both variables."""
@@ -2704,10 +2903,12 @@ class _OwnsMirroredVariables(checkpointable.CheckpointableBase):
         name="mirrored", initializer=15., use_resource=True)
 
   def _gather_saveables_for_checkpoint(self):
-    saveable = _MirroringSaveable(
-        primary_variable=self.non_dep_variable,
-        mirrored_variable=self.mirrored)
-    return {checkpointable.VARIABLE_VALUE_KEY: saveable}
+    def _saveable_factory(name=self.non_dep_variable.name):
+      return _MirroringSaveable(
+          primary_variable=self.non_dep_variable,
+          mirrored_variable=self.mirrored,
+          name=name)
+    return {checkpointable.VARIABLE_VALUE_KEY: _saveable_factory}
 
   # The Saver sorts by name before parsing, so we need a name property.
   @property
diff --git a/tensorflow/python/training/saver_test_utils.py b/tensorflow/python/training/saver_test_utils.py
index 44b06b357ecbe4c8e330a2ccc49e83ddd4bf8c7d..2bbe5b6d845c304c4dc79fb3619c57211ca0489e 100644
--- a/tensorflow/python/training/saver_test_utils.py
+++ b/tensorflow/python/training/saver_test_utils.py
@@ -35,12 +35,12 @@ class CheckpointedOp(object):
   # pylint: disable=protected-access
   def __init__(self, name, table_ref=None):
     if table_ref is None:
-      self.table_ref = gen_lookup_ops._mutable_hash_table_v2(
+      self.table_ref = gen_lookup_ops.mutable_hash_table_v2(
           key_dtype=dtypes.string, value_dtype=dtypes.float32, name=name)
     else:
       self.table_ref = table_ref
     self._name = name
-    if context.in_graph_mode():
+    if not context.executing_eagerly():
       self._saveable = CheckpointedOp.CustomSaveable(self, name)
       ops_lib.add_to_collection(ops_lib.GraphKeys.SAVEABLE_OBJECTS,
                                 self._saveable)
@@ -51,16 +51,16 @@ class CheckpointedOp(object):
 
   @property
   def saveable(self):
-    if context.in_graph_mode():
-      return self._saveable
-    else:
+    if context.executing_eagerly():
       return CheckpointedOp.CustomSaveable(self, self.name)
+    else:
+      return self._saveable
 
   def insert(self, keys, values):
-    return gen_lookup_ops._lookup_table_insert_v2(self.table_ref, keys, values)
+    return gen_lookup_ops.lookup_table_insert_v2(self.table_ref, keys, values)
 
   def lookup(self, keys, default):
-    return gen_lookup_ops._lookup_table_find_v2(self.table_ref, keys, default)
+    return gen_lookup_ops.lookup_table_find_v2(self.table_ref, keys, default)
 
   def keys(self):
     return self._export()[0]
@@ -69,8 +69,8 @@ class CheckpointedOp(object):
     return self._export()[1]
 
   def _export(self):
-    return gen_lookup_ops._lookup_table_export_v2(self.table_ref, dtypes.string,
-                                                  dtypes.float32)
+    return gen_lookup_ops.lookup_table_export_v2(self.table_ref, dtypes.string,
+                                                 dtypes.float32)
 
   class CustomSaveable(saver_module.BaseSaverBuilder.SaveableObject):
     """A custom saveable for CheckpointedOp."""
@@ -86,6 +86,6 @@ class CheckpointedOp(object):
       super(CheckpointedOp.CustomSaveable, self).__init__(table, specs, name)
 
     def restore(self, restore_tensors, shapes):
-      return gen_lookup_ops._lookup_table_import_v2(
+      return gen_lookup_ops.lookup_table_import_v2(
           self.op.table_ref, restore_tensors[0], restore_tensors[1])
   # pylint: enable=protected-access
diff --git a/tensorflow/python/training/slot_creator.py b/tensorflow/python/training/slot_creator.py
index 75ef3d5976aba9f0cbe849d9f6984646d71a29ef..9ac52dd0715d7ed15e2e57ed286be973614b01e5 100644
--- a/tensorflow/python/training/slot_creator.py
+++ b/tensorflow/python/training/slot_creator.py
@@ -106,7 +106,10 @@ def create_slot(primary, val, name, colocate_with_primary=True):
   # and the same name has been previously used, the scope name will add '_N'
   # as suffix for unique identifications.
   validate_shape = val.get_shape().is_fully_defined()
-  prefix = primary.op.name if context.in_graph_mode() else primary._shared_name  # pylint: disable=protected-access
+  if context.executing_eagerly():
+    prefix = primary._shared_name  # pylint: disable=protected-access
+  else:
+    prefix = primary.op.name
   with variable_scope.variable_scope(None, prefix + "/" + name):
     if colocate_with_primary:
       with ops.colocate_with(primary):
@@ -139,7 +142,10 @@ def create_slot_with_initializer(primary, initializer, shape, dtype, name,
   # and the same name has been previously used, the scope name will add '_N'
   # as suffix for unique identifications.
   validate_shape = shape.is_fully_defined()
-  prefix = primary.op.name if context.in_graph_mode() else primary._shared_name  # pylint: disable=protected-access
+  if context.executing_eagerly():
+    prefix = primary._shared_name  # pylint: disable=protected-access
+  else:
+    prefix = primary.op.name
   with variable_scope.variable_scope(None, prefix + "/" + name):
     if colocate_with_primary:
       with ops.colocate_with(primary):
diff --git a/tensorflow/python/training/supervisor.py b/tensorflow/python/training/supervisor.py
index d2ad34773e0615256c340826dcc312cc8a00dc23..7389e344c7d8eef8e26c4d24c0985ff66276deea 100644
--- a/tensorflow/python/training/supervisor.py
+++ b/tensorflow/python/training/supervisor.py
@@ -45,7 +45,7 @@ class Supervisor(object):
   """A training helper that checkpoints models and computes summaries.
 
   This class is deprecated. Please use
-  ${tf.train.MonitoredTrainingSession} instead.
+  @{tf.train.MonitoredTrainingSession} instead.
 
   The Supervisor is a small wrapper around a `Coordinator`, a `Saver`,
   and a `SessionManager` that takes care of common needs of TensorFlow
@@ -305,7 +305,7 @@ class Supervisor(object):
     `Supervisor`s are not supported when eager execution is enabled.
     @end_compatibility
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Supervisors are compatible with eager execution.")
     # Set default values of arguments.
     if graph is None:
@@ -762,7 +762,7 @@ class Supervisor(object):
     execution is enabled, use the `tf.data` API.
     @end_compatibility
     """
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       raise RuntimeError("Queues are not compatible with eager execution.")
     if queue_runners is None:
       queue_runners = self._graph.get_collection(ops.GraphKeys.QUEUE_RUNNERS)
diff --git a/tensorflow/python/training/training.py b/tensorflow/python/training/training.py
index 78c8ce9208efc2f2fa8b5c671d3379e7ca8c70f5..b759b156d78cf8d869b49375058cc7ed42e82b34 100644
--- a/tensorflow/python/training/training.py
+++ b/tensorflow/python/training/training.py
@@ -28,8 +28,10 @@ See the @{$python/train} guide.
 @@ProximalGradientDescentOptimizer
 @@ProximalAdagradOptimizer
 @@RMSPropOptimizer
+@@custom_gradient
 @@gradients
 @@AggregationMethod
+@@GradientTape
 @@stop_gradient
 @@hessians
 @@clip_by_value
@@ -94,6 +96,8 @@ See the @{$python/train} guide.
 @@load_variable
 @@list_variables
 @@init_from_checkpoint
+@@warm_start
+@@VocabInfo
 """
 
 # Optimizers.
@@ -187,6 +191,8 @@ from tensorflow.python.training.training_util import get_global_step
 from tensorflow.python.training.training_util import assert_global_step
 from tensorflow.python.training.training_util import create_global_step
 from tensorflow.python.training.training_util import get_or_create_global_step
+from tensorflow.python.training.warm_starting_util import VocabInfo
+from tensorflow.python.training.warm_starting_util import warm_start
 from tensorflow.python.pywrap_tensorflow import do_quantize_training_on_graphdef
 from tensorflow.python.pywrap_tensorflow import NewCheckpointReader
 from tensorflow.python.util.tf_export import tf_export
diff --git a/tensorflow/python/training/training_util.py b/tensorflow/python/training/training_util.py
index 499f1feb2dbf8aee26314a43b0a000fb91a1c686..4f1abccc96ff95b6fcf9fe82c7a5d45fc2fd1c0c 100644
--- a/tensorflow/python/training/training_util.py
+++ b/tensorflow/python/training/training_util.py
@@ -64,7 +64,7 @@ def global_step(sess, global_step_tensor):
   Returns:
     The global step value.
   """
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     return int(global_step_tensor.numpy())
   return int(sess.run(global_step_tensor))
 
@@ -123,7 +123,7 @@ def create_global_step(graph=None):
     raise ValueError('"global_step" already exists.')
   # Create in proper graph and base name_scope.
   with graph.as_default() as g, g.name_scope(None):
-    if context.in_eager_mode():
+    if context.executing_eagerly():
       with ops.device('cpu:0'):
         return variable_scope.get_variable(
             ops.GraphKeys.GLOBAL_STEP,
diff --git a/tensorflow/python/estimator/warm_starting_util.py b/tensorflow/python/training/warm_starting_util.py
similarity index 67%
rename from tensorflow/python/estimator/warm_starting_util.py
rename to tensorflow/python/training/warm_starting_util.py
index adb013f5c653c4967a743047fef4e805946e0f59..4d4fb394c1272d2bf510bb594d70b9aa2edb3df2 100644
--- a/tensorflow/python/estimator/warm_starting_util.py
+++ b/tensorflow/python/training/warm_starting_util.py
@@ -33,7 +33,7 @@ from tensorflow.python.training import saver
 from tensorflow.python.util.tf_export import tf_export
 
 
-@tf_export("estimator.VocabInfo")
+@tf_export("train.VocabInfo", "estimator.VocabInfo")
 class VocabInfo(
     collections.namedtuple("VocabInfo", [
         "new_vocab",
@@ -43,7 +43,7 @@ class VocabInfo(
         "old_vocab_size",
         "backup_initializer",
     ])):
-  """Vocabulary information for WarmStartSettings.
+  """Vocabulary information for warm-starting.
 
   See @{tf.estimator.WarmStartSettings$WarmStartSettings} for examples of using
   VocabInfo to warm-start.
@@ -83,164 +83,6 @@ class VocabInfo(
     )
 
 
-@tf_export("estimator.WarmStartSettings")
-class WarmStartSettings(
-    collections.namedtuple("WarmStartSettings", [
-        "ckpt_to_initialize_from",
-        "vars_to_warm_start",
-        "var_name_to_vocab_info",
-        "var_name_to_prev_var_name",
-    ])):
-  """Settings for warm-starting in Estimators.
-
-  Example Use with canned `DNNEstimator`:
-
-  ```
-  emb_vocab_file = tf.feature_column.embedding_column(
-      tf.feature_column.categorical_column_with_vocabulary_file(
-          "sc_vocab_file", "new_vocab.txt", vocab_size=100),
-      dimension=8)
-  emb_vocab_list = tf.feature_column.embedding_column(
-      tf.feature_column.categorical_column_with_vocabulary_list(
-          "sc_vocab_list", vocabulary_list=["a", "b"]),
-      dimension=8)
-  estimator = tf.estimator.DNNClassifier(
-    hidden_units=[128, 64], feature_columns=[emb_vocab_file, emb_vocab_list],
-    warm_start_from=ws)
-  ```
-
-  where `ws` could be defined as:
-
-  Warm-start all weights in the model (input layer and hidden weights).
-  Either the directory or a specific checkpoint can be provided (in the case
-  of the former, the latest checkpoint will be used):
-
-  ```
-  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp")
-  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp/model-1000")
-  ```
-
-  Warm-start only the embeddings (input layer):
-
-  ```
-  ws = WarmStartSettings(ckpt_to_initialize_from="/tmp",
-                         vars_to_warm_start=".*input_layer.*")
-  ```
-
-  Warm-start all weights but the embedding parameters corresponding to
-  `sc_vocab_file` have a different vocab from the one used in the current
-  model:
-
-  ```
-  vocab_info = ws_util.VocabInfo(
-      new_vocab=sc_vocab_file.vocabulary_file,
-      new_vocab_size=sc_vocab_file.vocabulary_size,
-      num_oov_buckets=sc_vocab_file.num_oov_buckets,
-      old_vocab="old_vocab.txt"
-  )
-  ws = WarmStartSettings(
-      ckpt_to_initialize_from="/tmp",
-      var_name_to_vocab_info={
-          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
-      })
-  ```
-
-  Warm-start only `sc_vocab_file` embeddings (and no other variables), which
-  have a different vocab from the one used in the current model:
-
-  ```
-  vocab_info = ws_util.VocabInfo(
-      new_vocab=sc_vocab_file.vocabulary_file,
-      new_vocab_size=sc_vocab_file.vocabulary_size,
-      num_oov_buckets=sc_vocab_file.num_oov_buckets,
-      old_vocab="old_vocab.txt"
-  )
-  ws = WarmStartSettings(
-      ckpt_to_initialize_from="/tmp",
-      vars_to_warm_start=None,
-      var_name_to_vocab_info={
-          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
-      })
-  ```
-
-  Warm-start all weights but the parameters corresponding to `sc_vocab_file`
-  have a different vocab from the one used in current checkpoint, and only
-  100 of those entries were used:
-
-  ```
-  vocab_info = ws_util.VocabInfo(
-      new_vocab=sc_vocab_file.vocabulary_file,
-      new_vocab_size=sc_vocab_file.vocabulary_size,
-      num_oov_buckets=sc_vocab_file.num_oov_buckets,
-      old_vocab="old_vocab.txt",
-      old_vocab_size=100
-  )
-  ws = WarmStartSettings(
-      ckpt_to_initialize_from="/tmp",
-      var_name_to_vocab_info={
-          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
-      })
-  ```
-
-  Warm-start all weights but the parameters corresponding to `sc_vocab_file`
-  have a different vocab from the one used in current checkpoint and the
-  parameters corresponding to `sc_vocab_list` have a different name from the
-  current checkpoint:
-
-  ```
-  vocab_info = ws_util.VocabInfo(
-      new_vocab=sc_vocab_file.vocabulary_file,
-      new_vocab_size=sc_vocab_file.vocabulary_size,
-      num_oov_buckets=sc_vocab_file.num_oov_buckets,
-      old_vocab="old_vocab.txt",
-      old_vocab_size=100
-  )
-  ws = WarmStartSettings(
-      ckpt_to_initialize_from="/tmp",
-      var_name_to_vocab_info={
-          "input_layer/sc_vocab_file_embedding/embedding_weights": vocab_info
-      },
-      var_name_to_prev_var_name={
-          "input_layer/sc_vocab_list_embedding/embedding_weights":
-              "old_tensor_name"
-      })
-  ```
-
-  Attributes:
-    ckpt_to_initialize_from: [Required] A string specifying the directory with
-      checkpoint file(s) or path to checkpoint from which to warm-start the
-      model parameters.
-    vars_to_warm_start: [Optional] A regular expression that captures which
-      variables to warm-start (see tf.get_collection).  Defaults to `'.*'`,
-      which warm-starts all variables.  If `None` is explicitly given, only
-      variables specified in `var_name_to_vocab_info` will be warm-started.
-    var_name_to_vocab_info: [Optional] Dict of variable names (strings) to
-      VocabInfo. The variable names should be "full" variables, not the names
-      of the partitions.  If not explicitly provided, the variable is assumed to
-      have no vocabulary.
-    var_name_to_prev_var_name: [Optional] Dict of variable names (strings) to
-      name of the previously-trained variable in `ckpt_to_initialize_from`. If
-      not explicitly provided, the name of the variable is assumed to be same
-      between previous checkpoint and current model.
-  """
-
-  def __new__(cls,
-              ckpt_to_initialize_from,
-              vars_to_warm_start=".*",
-              var_name_to_vocab_info=None,
-              var_name_to_prev_var_name=None):
-    if not ckpt_to_initialize_from:
-      raise ValueError(
-          "`ckpt_to_initialize_from` MUST be set in WarmStartSettings")
-    return super(WarmStartSettings, cls).__new__(
-        cls,
-        ckpt_to_initialize_from,
-        vars_to_warm_start,
-        var_name_to_vocab_info or {},
-        var_name_to_prev_var_name or {},
-    )
-
-
 def _is_variable(x):
   return (isinstance(x, variables_lib.Variable) or
           isinstance(x, resource_variable_ops.ResourceVariable))
@@ -375,8 +217,7 @@ def _warm_start_var_with_vocab(var,
           full_shape=slice_info.full_shape,
           var_offset=slice_info.var_offset)
 
-    # TODO(eddz): Support WarmStartSettings where class vocabularies need
-    # remapping too.
+    # TODO(eddz): Support cases where class vocabularies need remapping too.
     init = checkpoint_ops._load_and_remap_matrix_initializer(
         ckpt_path=checkpoint_utils._get_checkpoint_filename(prev_ckpt),
         old_tensor_name=prev_tensor_name,
@@ -396,32 +237,53 @@ def _warm_start_var_with_vocab(var,
 # pylint: enable=protected-access
 
 
-def _warm_start(warm_start_settings):
+@tf_export("train.warm_start")
+def warm_start(ckpt_to_initialize_from,
+               vars_to_warm_start=".*",
+               var_name_to_vocab_info=None,
+               var_name_to_prev_var_name=None):
   """Warm-starts a model using the given settings.
 
   If you are using a tf.estimator.Estimator, this will automatically be called
   during training.
 
   Args:
-    warm_start_settings: An object of `WarmStartSettings`.
+    ckpt_to_initialize_from: [Required] A string specifying the directory with
+      checkpoint file(s) or path to checkpoint from which to warm-start the
+      model parameters.
+    vars_to_warm_start: [Optional] A regular expression that captures which
+      variables to warm-start (see tf.get_collection).  Defaults to `'.*'`,
+      which warm-starts all variables.  If `None` is explicitly given, only
+      variables specified in `var_name_to_vocab_info` will be warm-started.
+    var_name_to_vocab_info: [Optional] Dict of variable names (strings) to
+      VocabInfo. The variable names should be "full" variables, not the names
+      of the partitions.  If not explicitly provided, the variable is assumed to
+      have no vocabulary.
+    var_name_to_prev_var_name: [Optional] Dict of variable names (strings) to
+      name of the previously-trained variable in `ckpt_to_initialize_from`. If
+      not explicitly provided, the name of the variable is assumed to be same
+      between previous checkpoint and current model.
   Raises:
     ValueError: If the WarmStartSettings contains prev_var_name or VocabInfo
       configuration for variable names that are not used.  This is to ensure
       a stronger check for variable configuration than relying on users to
       examine the logs.
   """
-  logging.info("Warm-starting from: %s",
-               (warm_start_settings.ckpt_to_initialize_from,))
+  if var_name_to_vocab_info is None:
+    var_name_to_vocab_info = {}
+  if var_name_to_prev_var_name is None:
+    var_name_to_prev_var_name = {}
+  logging.info("Warm-starting from: %s", (ckpt_to_initialize_from,))
   # We have to deal with partitioned variables, since get_collection flattens
   # out the list.
   grouped_variables = {}
-  # Both warm_start_settings.vars_to_warm_start = '.*' and
-  # warm_start_settings.vars_to_warm_start = None will match everything here.
+  # Both vars_to_warm_start = '.*' and
+  # vars_to_warm_start = None will match everything here.
   for v in ops.get_collection(
       # TODO(eddz): Allow for different collections here (to support
       # warm-starting accumulators).
       ops.GraphKeys.TRAINABLE_VARIABLES,
-      scope=warm_start_settings.vars_to_warm_start):
+      scope=vars_to_warm_start):
     if not isinstance(v, list):
       var_name = _infer_var_name([v])
     else:
@@ -437,10 +299,10 @@ def _warm_start(warm_start_settings):
   vocab_info_used = set()
 
   for var_name, variable in six.iteritems(grouped_variables):
-    prev_var_name = warm_start_settings.var_name_to_prev_var_name.get(var_name)
+    prev_var_name = var_name_to_prev_var_name.get(var_name)
     if prev_var_name:
       prev_var_name_used.add(var_name)
-    vocab_info = warm_start_settings.var_name_to_vocab_info.get(var_name)
+    vocab_info = var_name_to_vocab_info.get(var_name)
     if vocab_info:
       vocab_info_used.add(var_name)
       logging.info(
@@ -460,16 +322,16 @@ def _warm_start(warm_start_settings):
           variable,
           current_vocab_path=vocab_info.new_vocab,
           current_vocab_size=vocab_info.new_vocab_size,
-          prev_ckpt=warm_start_settings.ckpt_to_initialize_from,
+          prev_ckpt=ckpt_to_initialize_from,
           prev_vocab_path=vocab_info.old_vocab,
           previous_vocab_size=vocab_info.old_vocab_size,
           current_oov_buckets=vocab_info.num_oov_buckets,
           prev_tensor_name=prev_var_name,
           initializer=vocab_info.backup_initializer)
     else:
-      # For the special value of warm_start_settings.vars_to_warm_start = None,
+      # For the special value of vars_to_warm_start = None,
       # we only warm-start variables with explicitly specified vocabularies.
-      if warm_start_settings.vars_to_warm_start:
+      if vars_to_warm_start:
         logging.info("Warm-starting variable: {}; prev_var_name: {}".format(
             var_name, prev_var_name or "Unchanged"))
         # Because we use a default empty list in grouped_variables, single
@@ -477,48 +339,22 @@ def _warm_start(warm_start_settings):
         # for init_from_checkpoint logic to work correctly.
         if len(variable) == 1:
           variable = variable[0]
-        _warm_start_var(variable, warm_start_settings.ckpt_to_initialize_from,
-                        prev_var_name)
+        _warm_start_var(variable, ckpt_to_initialize_from, prev_var_name)
 
   prev_var_name_not_used = set(
-      warm_start_settings.var_name_to_prev_var_name.keys()) - prev_var_name_used
-  vocab_info_not_used = set(
-      warm_start_settings.var_name_to_vocab_info.keys()) - vocab_info_used
+      var_name_to_prev_var_name.keys()) - prev_var_name_used
+  vocab_info_not_used = set(var_name_to_vocab_info.keys()) - vocab_info_used
 
   if prev_var_name_not_used:
     raise ValueError(
         "You provided the following variables in "
-        "warm_start_settings.var_name_to_prev_var_name that were not used: "
+        "var_name_to_prev_var_name that were not used: "
         "{0}.  Perhaps you misspelled them?  Here is the list of viable "
         "variable names: {1}".format(prev_var_name_not_used,
                                      grouped_variables.keys()))
   if vocab_info_not_used:
     raise ValueError(
         "You provided the following variables in "
-        "warm_start_settings.var_name_to_vocab_info that were not used: {0}. "
+        "var_name_to_vocab_info that were not used: {0}. "
         " Perhaps you misspelled them?  Here is the list of viable variable "
         "names: {1}".format(vocab_info_not_used, grouped_variables.keys()))
-
-
-def _get_default_warm_start_settings(warm_start_from):
-  """Returns default WarmStartSettings.
-
-  Args:
-    warm_start_from: Either a string representing the filepath of a checkpoint
-      to initialize from, or an instance of WarmStartSettings.
-
-  Returns:
-    Either None or an instance of WarmStartSettings.
-
-  Raises:
-    ValueError: If warm_start_from is not None but is neither a string nor an
-      instance of WarmStartSettings.
-  """
-  if warm_start_from is None:
-    return None
-  if isinstance(warm_start_from, six.string_types):
-    return WarmStartSettings(ckpt_to_initialize_from=warm_start_from)
-  elif isinstance(warm_start_from, WarmStartSettings):
-    return warm_start_from
-  else:
-    raise ValueError("warm_start_from must be a string or a WarmStartSettings")
diff --git a/tensorflow/python/estimator/warm_starting_util_test.py b/tensorflow/python/training/warm_starting_util_test.py
similarity index 94%
rename from tensorflow/python/estimator/warm_starting_util_test.py
rename to tensorflow/python/training/warm_starting_util_test.py
index 3985d9ebd04e6963339fcf9999f6367fe4dadc1a..6e445d8bd14cc13010541c1ab0f737f96a4b1e03 100644
--- a/tensorflow/python/estimator/warm_starting_util_test.py
+++ b/tensorflow/python/training/warm_starting_util_test.py
@@ -22,7 +22,6 @@ import os
 import numpy as np
 import six
 
-from tensorflow.python.estimator import warm_starting_util as ws_util
 from tensorflow.python.feature_column import feature_column as fc
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -32,6 +31,7 @@ from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.training import saver as saver_lib
+from tensorflow.python.training import warm_starting_util as ws_util
 
 ones = init_ops.ones_initializer
 norms = init_ops.truncated_normal_initializer
@@ -330,9 +330,7 @@ class WarmStartingUtilTest(test.TestCase):
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         cols_to_vars = self._create_linear_model([sc_int], partitioner)
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                self.get_temp_dir(), vars_to_warm_start=".*sc_int.*"))
+        ws_util.warm_start(self.get_temp_dir(), vars_to_warm_start=".*sc_int.*")
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars, {sc_int: [prev_int_val]}, sess)
@@ -361,9 +359,8 @@ class WarmStartingUtilTest(test.TestCase):
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         cols_to_vars = self._create_linear_model([sc_hash], partitioner)
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                self.get_temp_dir(), vars_to_warm_start=".*sc_hash.*"))
+        ws_util.warm_start(
+            self.get_temp_dir(), vars_to_warm_start=".*sc_hash.*")
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars, {sc_hash: [prev_hash_val]},
@@ -398,9 +395,8 @@ class WarmStartingUtilTest(test.TestCase):
         cols_to_vars = self._create_linear_model([sc_vocab], partitioner)
         # Since old vocab is not explicitly set in WarmStartSettings, the old
         # vocab is assumed to be same as new vocab.
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                self.get_temp_dir(), vars_to_warm_start=".*sc_vocab.*"))
+        ws_util.warm_start(
+            self.get_temp_dir(), vars_to_warm_start=".*sc_vocab.*")
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars, {sc_vocab: [prev_vocab_val]},
@@ -435,11 +431,10 @@ class WarmStartingUtilTest(test.TestCase):
         cols_to_vars = self._create_linear_model([sc_vocab], partitioner)
         # Since old vocab is not explicitly set in WarmStartSettings, the old
         # vocab is assumed to be same as new vocab.
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                # Explicitly provide the file prefix instead of just the dir.
-                os.path.join(self.get_temp_dir(), "model-0"),
-                vars_to_warm_start=".*sc_vocab.*"))
+        ws_util.warm_start(
+            # Explicitly provide the file prefix instead of just the dir.
+            os.path.join(self.get_temp_dir(), "model-0"),
+            vars_to_warm_start=".*sc_vocab.*")
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars, {sc_vocab: [prev_vocab_val]},
@@ -485,13 +480,12 @@ class WarmStartingUtilTest(test.TestCase):
             num_oov_buckets=sc_vocab.num_oov_buckets,
             old_vocab=old_vocab_path,
             old_vocab_size=old_vocab_size)
-        warm_start_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             ckpt_to_initialize_from=self.get_temp_dir(),
             vars_to_warm_start=".*sc_vocab.*",
             var_name_to_vocab_info={
                 "linear_model/sc_vocab/weights": vocab_info
             })
-        ws_util._warm_start(warm_start_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.  'banana' isn't in the
         # first two entries of the old vocabulary, so it's newly initialized.
@@ -523,9 +517,8 @@ class WarmStartingUtilTest(test.TestCase):
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         cols_to_vars = self._create_linear_model([real_bucket], partitioner)
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                self.get_temp_dir(), vars_to_warm_start=".*real_bucketized.*"))
+        ws_util.warm_start(
+            self.get_temp_dir(), vars_to_warm_start=".*real_bucketized.*")
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars,
@@ -606,12 +599,11 @@ class WarmStartingUtilTest(test.TestCase):
             new_vocab_size=sc_vocab.vocabulary_size,
             num_oov_buckets=sc_vocab.num_oov_buckets,
             old_vocab=vocab_path)
-        ws_util._warm_start(
-            ws_util.WarmStartSettings(
-                self.get_temp_dir(),
-                var_name_to_vocab_info={
-                    "linear_model/sc_vocab/weights": vocab_info
-                }))
+        ws_util.warm_start(
+            self.get_temp_dir(),
+            var_name_to_vocab_info={
+                "linear_model/sc_vocab/weights": vocab_info
+            })
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.
         self._assert_cols_to_vars(cols_to_vars, {
@@ -668,7 +660,7 @@ class WarmStartingUtilTest(test.TestCase):
             new_vocab_size=sc_vocab.vocabulary_size,
             num_oov_buckets=sc_vocab.num_oov_buckets,
             old_vocab=prev_vocab_path)
-        ws_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             self.get_temp_dir(),
             vars_to_warm_start=".*(sc_keys|sc_vocab).*",
             var_name_to_vocab_info={
@@ -678,7 +670,6 @@ class WarmStartingUtilTest(test.TestCase):
                 ws_util._infer_var_name(cols_to_vars[sc_keys]):
                     "some_other_name"
             })
-        ws_util._warm_start(ws_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.  Var corresponding to
         # sc_hash should not be warm-started.  Var corresponding to sc_vocab
@@ -732,7 +723,7 @@ class WarmStartingUtilTest(test.TestCase):
             new_vocab_size=sc_vocab.vocabulary_size,
             num_oov_buckets=sc_vocab.num_oov_buckets,
             old_vocab=prev_vocab_path)
-        ws_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             self.get_temp_dir(),
             vars_to_warm_start=".*(sc_keys|sc_vocab).*",
             var_name_to_vocab_info={
@@ -742,7 +733,6 @@ class WarmStartingUtilTest(test.TestCase):
                 ws_util._infer_var_name(cols_to_vars[sc_keys]):
                     "some_other_name"
             })
-        ws_util._warm_start(ws_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.  Var corresponding to
         # sc_hash should not be warm-started.  Var corresponding to sc_vocab
@@ -796,7 +786,7 @@ class WarmStartingUtilTest(test.TestCase):
             new_vocab_size=sc_vocab.vocabulary_size,
             num_oov_buckets=sc_vocab.num_oov_buckets,
             old_vocab=prev_vocab_path)
-        ws_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             self.get_temp_dir(),
             # The special value of None here will ensure that only the variable
             # specified in var_name_to_vocab_info (sc_vocab embedding) is
@@ -812,7 +802,6 @@ class WarmStartingUtilTest(test.TestCase):
                 ws_util._infer_var_name(cols_to_vars[sc_keys]):
                     "some_other_name"
             })
-        ws_util._warm_start(ws_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started.  Var corresponding to
         # sc_vocab should be correctly warm-started after vocab remapping,
@@ -874,13 +863,12 @@ class WarmStartingUtilTest(test.TestCase):
             # use a truncated normal initializer.
             backup_initializer=init_ops.random_uniform_initializer(
                 minval=0.42, maxval=0.42))
-        ws_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             self.get_temp_dir(),
             var_name_to_vocab_info={
                 ws_util._infer_var_name(cols_to_vars[emb_vocab_column]):
                     vocab_info
             })
-        ws_util._warm_start(ws_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started. Var corresponding to
         # emb_vocab_column should be correctly warm-started after vocab
@@ -947,13 +935,12 @@ class WarmStartingUtilTest(test.TestCase):
             # use a truncated normal initializer.
             backup_initializer=init_ops.random_uniform_initializer(
                 minval=0.42, maxval=0.42))
-        ws_settings = ws_util.WarmStartSettings(
+        ws_util.warm_start(
             self.get_temp_dir(),
             vars_to_warm_start=".*sc_vocab.*",
             var_name_to_vocab_info={
                 "linear_model/sc_vocab_embedding/embedding_weights": vocab_info
             })
-        ws_util._warm_start(ws_settings)
         sess.run(variables.global_variables_initializer())
         # Verify weights were correctly warm-started. Var corresponding to
         # emb_vocab should be correctly warm-started after vocab remapping.
@@ -973,7 +960,6 @@ class WarmStartingUtilTest(test.TestCase):
             }, sess)
 
   def testErrorConditions(self):
-    self.assertRaises(ValueError, ws_util.WarmStartSettings, None)
     x = variable_scope.get_variable(
         "x",
         shape=[4, 1],
@@ -983,9 +969,6 @@ class WarmStartingUtilTest(test.TestCase):
     # List of PartitionedVariable is invalid type when warm-starting with vocab.
     self.assertRaises(TypeError, ws_util._warm_start_var_with_vocab, [x],
                       "/tmp", 5, "/tmp", "/tmp")
-    # Keys of type other than FeatureColumn.
-    self.assertRaises(TypeError, ws_util._warm_start, {"StringType": x},
-                      ws_util.WarmStartSettings("/tmp"))
 
     # Unused variable names raises ValueError.
     with ops.Graph().as_default():
@@ -997,18 +980,16 @@ class WarmStartingUtilTest(test.TestCase):
             partitioner=lambda shape, dtype: [2, 1])
         self._write_checkpoint(sess)
 
-    self.assertRaises(ValueError, ws_util._warm_start,
-                      ws_util.WarmStartSettings(
-                          self.get_temp_dir(),
-                          var_name_to_vocab_info={
-                              "y": ws_util.VocabInfo("", 1, 0, "")
-                          }))
-    self.assertRaises(ValueError, ws_util._warm_start,
-                      ws_util.WarmStartSettings(
-                          self.get_temp_dir(),
-                          var_name_to_prev_var_name={
-                              "y": "y2"
-                          }))
+    self.assertRaises(
+        ValueError,
+        ws_util.warm_start,
+        self.get_temp_dir(),
+        var_name_to_vocab_info={"y": ws_util.VocabInfo("", 1, 0, "")})
+    self.assertRaises(
+        ValueError,
+        ws_util.warm_start,
+        self.get_temp_dir(),
+        var_name_to_prev_var_name={"y": "y2"})
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/user_ops/user_ops.py b/tensorflow/python/user_ops/user_ops.py
index 17dbab706c9243c5f119dc82cc4428f03b90a18d..20ea3b0f621dc74bd3778d565f8897e47a881d42 100644
--- a/tensorflow/python/user_ops/user_ops.py
+++ b/tensorflow/python/user_ops/user_ops.py
@@ -23,8 +23,10 @@ from tensorflow.python.ops import gen_user_ops as _gen_user_ops
 
 # go/tf-wildcard-import
 from tensorflow.python.ops.gen_user_ops import *  # pylint: disable=wildcard-import
+from tensorflow.python.util.tf_export import tf_export
 
 
+@tf_export('user_ops.my_fact')
 def my_fact():
   """Example of overriding the generated code for an Op."""
-  return _gen_user_ops._fact()  # pylint: disable=protected-access
+  return _gen_user_ops.fact()
diff --git a/tensorflow/python/util/decorator_utils.py b/tensorflow/python/util/decorator_utils.py
index df259c7f7c29f9a4b674d3e980b33d6dcf323769..7b4363c0e40802779cf47c75c5a5e5a901da37e2 100644
--- a/tensorflow/python/util/decorator_utils.py
+++ b/tensorflow/python/util/decorator_utils.py
@@ -82,7 +82,7 @@ def add_notice_to_docstring(
     lines = _normalize_docstring(doc).splitlines()
     lines[0] += ' ' + suffix_str
 
-  notice = [''] + notice + [instructions]
+  notice = [''] + notice + ([instructions] if instructions else [])
 
   if len(lines) > 1:
     # Make sure that we keep our distance from the main body
diff --git a/tensorflow/python/util/port.i b/tensorflow/python/util/port.i
index cea4d8468afe8816d71da6581635b8a7ab0c2388..2f730732bee373a6e6ead97fe3320645f37ac220 100644
--- a/tensorflow/python/util/port.i
+++ b/tensorflow/python/util/port.i
@@ -23,5 +23,6 @@ limitations under the License.
 %unignore tensorflow;
 %unignore tensorflow::IsGoogleCudaEnabled;
 %unignore tensorflow::CudaSupportsHalfMatMulAndConv;
+%unignore tensorflow::IsMklEnabled;
 %include "tensorflow/core/util/port.h"
 %unignoreall
diff --git a/tensorflow/python/util/tf_inspect.py b/tensorflow/python/util/tf_inspect.py
index c2fe6fc4494428693605a5a7463a9f590a2da39e..4ab8a72a83b466c38c50b1c76004e7a6fe942a04 100644
--- a/tensorflow/python/util/tf_inspect.py
+++ b/tensorflow/python/util/tf_inspect.py
@@ -46,8 +46,10 @@ def getargspec(object):  # pylint: disable=redefined-builtin
 
 
 def getfullargspec(obj):  # pylint: disable=redefined-builtin
-  """TFDecorator-aware replacement for inspect.getfullargspec and fallback to
-  inspect.getargspec in Python 2.
+  """TFDecorator-aware replacement for `inspect.getfullargspec`/`getargspec`.
+
+  This wrapper uses `inspect.getfullargspec` if available and falls back to
+  `inspect.getargspec` in Python 2.
 
   Args:
     obj: A callable, possibly decorated.
@@ -149,6 +151,11 @@ def getsource(object):  # pylint: disable=redefined-builtin
   return _inspect.getsource(tf_decorator.unwrap(object)[1])
 
 
+def isbuiltin(object):  # pylint: disable=redefined-builtin
+  """TFDecorator-aware replacement for inspect.isbuiltin."""
+  return _inspect.isbuiltin(tf_decorator.unwrap(object)[1])
+
+
 def isclass(object):  # pylint: disable=redefined-builtin
   """TFDecorator-aware replacement for inspect.isclass."""
   return _inspect.isclass(tf_decorator.unwrap(object)[1])
diff --git a/tensorflow/python/util/tf_inspect_test.py b/tensorflow/python/util/tf_inspect_test.py
index 8903e1156b27b3a28543eb5ecfcc6eeb1a04f6ae..129408449ebb45ac3a322f163a13b705cbb31f0c 100644
--- a/tensorflow/python/util/tf_inspect_test.py
+++ b/tensorflow/python/util/tf_inspect_test.py
@@ -144,6 +144,19 @@ def test_decorated_function_with_defaults(a, b=2, c='Hello'):
     self.assertEqual(
         expected, tf_inspect.getsource(test_decorated_function_with_defaults))
 
+  def testIsBuiltin(self):
+    self.assertEqual(
+        tf_inspect.isbuiltin(TestDecoratedClass),
+        inspect.isbuiltin(TestDecoratedClass))
+    self.assertEqual(
+        tf_inspect.isbuiltin(test_decorated_function),
+        inspect.isbuiltin(test_decorated_function))
+    self.assertEqual(
+        tf_inspect.isbuiltin(test_undecorated_function),
+        inspect.isbuiltin(test_undecorated_function))
+    self.assertEqual(tf_inspect.isbuiltin(range), inspect.isbuiltin(range))
+    self.assertEqual(tf_inspect.isbuiltin(max), inspect.isbuiltin(max))
+
   def testIsClass(self):
     self.assertTrue(tf_inspect.isclass(TestDecoratedClass))
     self.assertFalse(tf_inspect.isclass(test_decorated_function))
diff --git a/tensorflow/python/util/tf_should_use.py b/tensorflow/python/util/tf_should_use.py
index 37733152e8ec6d7b026bf74e69e33bfe8f9f4e89..28e49afa023904abed076373685bb38f2537b7d4 100644
--- a/tensorflow/python/util/tf_should_use.py
+++ b/tensorflow/python/util/tf_should_use.py
@@ -47,7 +47,7 @@ def _add_should_use_warning(x, fatal_error=False):
   if x is None or x == []:  # pylint: disable=g-explicit-bool-comparison
     return x
 
-  if context.in_eager_mode():
+  if context.executing_eagerly():
     # Typically not needed when executing eagerly (the main use case is for ops
     # which need to be incorporated into the graph), and even the no-op wrapper
     # creates reference cycles which require garbage collection.
diff --git a/tensorflow/stream_executor/blas.cc b/tensorflow/stream_executor/blas.cc
index da09d84921e2dd94942b3a62fe7366211c60aed1..31724cf6c9b97e45975b9e053459f7b8f5918dfa 100644
--- a/tensorflow/stream_executor/blas.cc
+++ b/tensorflow/stream_executor/blas.cc
@@ -79,6 +79,8 @@ string ComputationTypeString(ComputationType ty) {
       return "f32";
     case ComputationType::kF64:
       return "f64";
+    case ComputationType::kI32:
+      return "i32";
     case ComputationType::kComplexF32:
       return "complex f32";
     case ComputationType::kComplexF64:
@@ -88,6 +90,10 @@ string ComputationTypeString(ComputationType ty) {
   }
 }
 
+std::ostream& operator<<(std::ostream& os, ComputationType ty) {
+  return os << ComputationTypeString(ty);
+}
+
 }  // namespace blas
 }  // namespace gputools
 }  // namespace perftools
diff --git a/tensorflow/stream_executor/blas.h b/tensorflow/stream_executor/blas.h
index 072f08554688276a05d9be85718de8750bd874c2..c5f778a5c74519c0f35cea5d59aac3d0d4564c56 100644
--- a/tensorflow/stream_executor/blas.h
+++ b/tensorflow/stream_executor/blas.h
@@ -104,6 +104,8 @@ enum class ComputationType {
 // Converts a ComputationType to a string.
 string ComputationTypeString(ComputationType ty);
 
+std::ostream &operator<<(std::ostream &os, ComputationType ty);
+
 // Opaque identifier for an "algorithm" used by a blas routine.  This functions
 // as a hint to the blas library.
 typedef int64 AlgorithmType;
diff --git a/tensorflow/stream_executor/cuda/cuda_blas.cc b/tensorflow/stream_executor/cuda/cuda_blas.cc
index 44a3a745ad86dc24f632e4a36691fba06171c9fb..c563f8f931b0a5689268329386d1252f2a45bdd1 100644
--- a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -13,17 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// Include cuBLAS headers early, and then set EIGEN_HAS_CUDA_FP16
-// if we have new enough CUDA (which we will only know after including
-// cuda.h). This ensures that Eigen's Half.h does not attempt to make its own
-// __half typedef if CUDA has already defined one (and conversely, that we do
-// not include <cuda_fp16.h> after Half.h has made its typedef).
-#include "cuda/include/cuda.h"
 #include "cuda/include/cublas_v2.h"
-
-#if CUDA_VERSION >= 7050
-#define EIGEN_HAS_CUDA_FP16
-#endif
+#include "cuda/include/cuda.h"
 
 #if CUDA_VERSION >= 8000
 #define SE_CUDA_DATA_HALF CUDA_R_16F
@@ -33,6 +24,34 @@ limitations under the License.
 
 #include "tensorflow/stream_executor/cuda/cuda_blas.h"
 
+// Both Eigen Half.h and CUDA cuda_fp16.h provide similar typedef for __half. As
+// such, there are two ways to get the typedef for __half:
+//
+// (1) Includes cuda_fp16.h and defines EIGEN_HAS_CUDA_FP16.
+// (2) Neither includes cuda_fp16.h nor defines EIGEN_HAS_CUDA_FP16.
+//
+// Due to issue b/73793421, when the first approach is used and NVCC is used to
+// compile this file, NVCC will complain duplicated definition for
+// EIGEN_HAS_CUDA_FP16. On the other hand, when the second approach is used and
+// clang is used to compile this file, clang will not understand __half
+// due to missing the definition and macro EIGEN_HAS_CUDA_FP16.
+//
+// Because this file may be compiled with CLANG but will never be compiled with
+// NVCC, we choose the first approach for CUDA < 9.0. For CUDA >= 9.0, we have
+// to use the second approach because the data member in the __half defined
+// by CUDA > 9.0 is `__x` while Eigen expects it to be `x`.
+//
+// TODO(b/73793421): Remove the following code block to switch to the second
+// approach when the issue is fixed.
+#if CUDA_VERSION < 9000
+#include "cuda/include/cuda_fp16.h"
+#if CUDA_VERSION >= 7050
+#define EIGEN_HAS_CUDA_FP16
+#endif
+#endif
+
+#include "third_party/eigen3/Eigen/Core"
+
 #include <assert.h>
 #include <complex>
 
@@ -2256,6 +2275,14 @@ bool CUDABlas::DoBlasGemmWithAlgorithm(
     DeviceMemory<Eigen::half> *c, int ldc,
     blas::ComputationType computation_type, blas::AlgorithmType algorithm,
     blas::ProfileResult *output_profile_result) {
+  if (computation_type == blas::ComputationType::kF32) {
+    return DoBlasGemmWithAlgorithmImpl(
+        stream, transa, transb, m, n, k, static_cast<float>(alpha), a, lda, b,
+        ldb, static_cast<float>(beta), c, ldc, computation_type, algorithm,
+        output_profile_result);
+  }
+
+  CHECK_EQ(computation_type, blas::ComputationType::kF16);
   return DoBlasGemmWithAlgorithmImpl(
       stream, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc,
       computation_type, algorithm, output_profile_result);
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index 61cf4ba7eac1f9482e3c1b179f35434a2a65d955..03e3e0857f9f70dcd3062f0766b50f6f75b2fa5e 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -274,7 +274,8 @@ CUDNN_DNN_ROUTINE_EACH_R6(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
 // clang-format off
 #if CUDNN_VERSION >= 7000
 #define CUDNN_DNN_ROUTINE_EACH_R7(__macro)                    \
-  __macro(cudnnSetConvolutionMathType)
+  __macro(cudnnSetConvolutionMathType)                        \
+  __macro(cudnnSetRNNMatrixMathType)
 
 // clang-format on
 CUDNN_DNN_ROUTINE_EACH_R7(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
@@ -586,6 +587,19 @@ static bool TensorOpMathEnabled() {
   return is_enabled;
 }
 
+// A helper function to decide whether to enable the TENSOR_OP_MATH math type
+// for RNNs.
+static bool RnnTensorOpMathEnabled() {
+  static bool is_enabled = [] {
+    bool is_disabled = false;
+    TF_CHECK_OK(
+        tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUDNN_RNN_TENSOR_OP_MATH",
+                                       /*default_val=*/false, &is_disabled));
+    return !is_disabled;
+  }();
+  return is_enabled;
+}
+
 // A helper function to decide whether to use CUDNN_BATCHNORM_SPATIAL_PERSISTENT
 // in batchnorm. This mode can be faster in some tasks because an optimized path
 // may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF data types, compute
@@ -1124,6 +1138,9 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
       SetFailure(cudnn_params_desc_->Status());
       return;
     }
+    if (data_type == CUDNN_DATA_HALF) {
+      set_use_tensor_op_math(true);
+    }
   }
   ~CudnnRnnDescriptor() override {
     if (rnn_desc_) {
@@ -1132,6 +1149,20 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
       CUDNN_RETURN_IF_FAIL(status, "Unable to destroy RNN descriptor");
     }
   }
+  void set_use_tensor_op_math(bool use_tensor_op_math) {
+#if CUDNN_VERSION >= 7000
+    cudnnMathType_t math_type =
+        (use_tensor_op_math ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH);
+    if (RnnTensorOpMathEnabled()) {
+      cudnnStatus_t status =
+          wrap::cudnnSetRNNMatrixMathType(parent_, rnn_desc_, math_type);
+      if (status != CUDNN_STATUS_SUCCESS) {
+        LOG(FATAL) << "could not set cudnn RNN math type: "
+                   << ToString(status);
+      }
+    }
+#endif
+  }
   cudnnRNNDescriptor_t handle() const {
     if (!ok()) return nullptr;
     return rnn_desc_;
@@ -2281,7 +2312,6 @@ struct ConvDoFP32ComputationFP16Input {
 
 // A group of helper functions to return the internal compute type for
 // convolutions in cudnn.
-// TODO(yangzihao): Add support for float64.
 template <typename T>
 cudnnDataType_t GetConvComputeType() {
   return CUDNN_DATA_FLOAT;
@@ -2296,6 +2326,11 @@ cudnnDataType_t GetConvComputeType<Eigen::half>() {
   }
 }
 
+template <>
+cudnnDataType_t GetConvComputeType<double>() {
+  return CUDNN_DATA_DOUBLE;
+}
+
 }  // namespace
 
 template <class T>
@@ -2324,9 +2359,15 @@ bool CudnnSupport::DoConvolveImpl(
     LOG(FATAL) << "failed to set stream for cudnn handle: " << ToString(status);
   }
   // Alpha is the scaling factor for input.
-  float alpha = 1.0;
+  float falpha = 1.0;
+  double dalpha = 1.0;
+  void* alpha = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dalpha)
+                                                : static_cast<void*>(&falpha);
   // Beta is the scaling factor for output.
-  float beta = 0.0;
+  float fbeta = 0.0;
+  double dbeta = 0.0;
+  void* beta = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dbeta)
+                                               : static_cast<void*>(&fbeta);
 
   const bool is_profiling = output_profile_result != nullptr;
   cudnnConvolutionFwdAlgo_t algo;
@@ -2464,11 +2505,11 @@ bool CudnnSupport::DoConvolveImpl(
   }
   status = wrap::cudnnConvolutionForward(
       parent_, ToHandle(dnn_handle_),
-      /*alpha=*/&alpha, /*srcDesc=*/input_nd.handle(),
+      /*alpha=*/alpha, /*srcDesc=*/input_nd.handle(),
       /*srcData=*/input_data.opaque(), /*filterDesc=*/filter.handle(),
       /*filterData=*/filter_data.opaque(), /*convDesc=*/conv.handle(),
       /*algo=*/algo, /*workSpace=*/scratch.opaque(),
-      /*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/&beta,
+      /*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/beta,
       /*destDesc=*/output_nd.handle(), /*destData=*/output_data->opaque());
 
   if (is_profiling) {
@@ -2943,10 +2984,14 @@ bool CudnnSupport::DoConvolve(
     const FilterDescriptor& filter_descriptor,
     const DeviceMemory<double>& filter_data,
     const ConvolutionDescriptor& convolution_descriptor,
-    const BatchDescriptor& output_descriptor,
-    DeviceMemory<double>* output_data) {
-  LOG(ERROR) << "double-based DNN not yet implemented";
-  return false;
+    const BatchDescriptor& output_descriptor, DeviceMemory<double>* output_data,
+    ScratchAllocator* scratch_allocator,
+    const dnn::AlgorithmConfig& algorithm_config,
+    dnn::ProfileResult* output_profile_result) {
+  return DoConvolveImpl<double>(
+      stream, batch_descriptor, input_data, filter_descriptor, filter_data,
+      convolution_descriptor, output_descriptor, output_data, scratch_allocator,
+      algorithm_config, output_profile_result);
 }
 
 bool CudnnSupport::DoConvolve(
@@ -3151,10 +3196,17 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
     LOG(FATAL) << "failed to set stream for cudnn handle: " << ToString(status);
   }
 
+  cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
   // Alpha is the scaling factor for input.
-  float alpha = 1.0;
+  float falpha = 1.0;
+  double dalpha = 1.0;
+  void* alpha = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dalpha)
+                                                : static_cast<void*>(&falpha);
   // Beta is the scaling factor for output.
-  float beta = 0.0;
+  float fbeta = 0.0;
+  double dbeta = 0.0;
+  void* beta = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dbeta)
+                                               : static_cast<void*>(&fbeta);
 
   // TBD(keveman): remove once cuDNN supports kBatchYXDepth for backward pass.
   BatchDescriptor output_descriptor;
@@ -3163,7 +3215,6 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
   backward_output_data = MaybeTransformLayout(
       stream, &output_descriptor, backward_output_data, &transform_scratch);
 
-  cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
   ScopedTensorDescriptor out_back_nd{parent_, output_descriptor, cudnn_type};
   ScopedTensorDescriptor in_back_nd{parent_, input_descriptor, cudnn_type};
   ScopedFilterDescriptor filter{parent_, filter_descriptor, input_descriptor,
@@ -3310,7 +3361,7 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
   status = wrap::cudnnConvolutionBackwardData_v3(
 #endif
       parent_, ToHandle(dnn_handle_),
-      /*alpha=*/&alpha,
+      /*alpha=*/alpha,
       /*filterDesc=*/filter.handle(),
       /*filterData=*/filter_data.opaque(),
       /*diffDesc=*/out_back_nd.handle(),
@@ -3319,7 +3370,7 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
       /*algo=*/algo,
       /*workSpace=*/scratch.opaque(),
       /*workSpaceSizeInBytes=*/scratch.size(),
-      /*beta=*/&beta,
+      /*beta=*/beta,
       /*gradDesc=*/in_back_nd.handle(),
       /*gradData=*/backward_input_data->opaque());
   if (is_profiling) {
@@ -3344,10 +3395,28 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
   return true;
 }
 
+bool CudnnSupport::DoConvolveBackwardData(
+    Stream* stream, const FilterDescriptor& filter_descriptor,
+    const DeviceMemory<double>& filter_data,
+    const BatchDescriptor& output_descriptor,
+    DeviceMemory<double> backward_output_data,
+    const ConvolutionDescriptor& convolution_descriptor,
+    const BatchDescriptor& input_descriptor,
+    DeviceMemory<double>* backward_input_data,
+    ScratchAllocator* scratch_allocator,
+    const dnn::AlgorithmConfig& algorithm_config,
+    dnn::ProfileResult* output_profile_result) {
+  return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
+                                    output_descriptor, backward_output_data,
+                                    convolution_descriptor, input_descriptor,
+                                    backward_input_data, scratch_allocator,
+                                    algorithm_config, output_profile_result);
+}
+
 bool CudnnSupport::DoConvolveBackwardData(
     Stream* stream, const FilterDescriptor& filter_descriptor,
     const DeviceMemory<float>& filter_data,
-    const BatchDescriptor& output_descriptor_in,
+    const BatchDescriptor& output_descriptor,
     DeviceMemory<float> backward_output_data,
     const ConvolutionDescriptor& convolution_descriptor,
     const BatchDescriptor& input_descriptor,
@@ -3356,7 +3425,7 @@ bool CudnnSupport::DoConvolveBackwardData(
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
   return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
-                                    output_descriptor_in, backward_output_data,
+                                    output_descriptor, backward_output_data,
                                     convolution_descriptor, input_descriptor,
                                     backward_input_data, scratch_allocator,
                                     algorithm_config, output_profile_result);
@@ -3365,7 +3434,7 @@ bool CudnnSupport::DoConvolveBackwardData(
 bool CudnnSupport::DoConvolveBackwardData(
     Stream* stream, const FilterDescriptor& filter_descriptor,
     const DeviceMemory<Eigen::half>& filter_data,
-    const BatchDescriptor& output_descriptor_in,
+    const BatchDescriptor& output_descriptor,
     DeviceMemory<Eigen::half> backward_output_data,
     const ConvolutionDescriptor& convolution_descriptor,
     const BatchDescriptor& input_descriptor,
@@ -3374,7 +3443,7 @@ bool CudnnSupport::DoConvolveBackwardData(
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
   return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
-                                    output_descriptor_in, backward_output_data,
+                                    output_descriptor, backward_output_data,
                                     convolution_descriptor, input_descriptor,
                                     backward_input_data, scratch_allocator,
                                     algorithm_config, output_profile_result);
@@ -3398,10 +3467,17 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
     LOG(FATAL) << "failed to set stream for cudnn handle: " << ToString(status);
   }
 
+  cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
   // Alpha is the scaling factor for input.
-  float alpha = 1.0;
+  float falpha = 1.0;
+  double dalpha = 1.0;
+  void* alpha = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dalpha)
+                                                : static_cast<void*>(&falpha);
   // Beta is the scaling factor for output.
-  float beta = 0.0;
+  float fbeta = 0.0;
+  double dbeta = 0.0;
+  void* beta = cudnn_type == CUDNN_DATA_DOUBLE ? static_cast<void*>(&dbeta)
+                                               : static_cast<void*>(&fbeta);
 
   // TBD(keveman): remove once cuDNN supports kBatchYXDepth for backward pass.
   BatchDescriptor output_descriptor;
@@ -3410,7 +3486,6 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
   backward_output_data = MaybeTransformLayout(
       stream, &output_descriptor, backward_output_data, &transform_scratch);
 
-  cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
   ScopedTensorDescriptor out_back_nd{parent_, output_descriptor, cudnn_type};
   ScopedTensorDescriptor input_nd{parent_, input_descriptor, cudnn_type};
   ScopedFilterDescriptor filter{parent_, filter_descriptor, input_descriptor,
@@ -3557,7 +3632,7 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
 #else
   status = wrap::cudnnConvolutionBackwardFilter_v3(
 #endif
-      parent_, ToHandle(dnn_handle_), /*alpha=*/&alpha,
+      parent_, ToHandle(dnn_handle_), /*alpha=*/alpha,
       /*srcDesc=*/input_nd.handle(),
       /*srcData=*/input_data.opaque(),
       /*diffDesc=*/out_back_nd.handle(),
@@ -3566,7 +3641,7 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
       /*algo=*/algo,
       /*workSpace=*/scratch.opaque(),
       /*workSpaceSizeInBytes=*/scratch.size(),
-      /*beta=*/&beta,
+      /*beta=*/beta,
       /*gradDesc=*/filter.handle(),
       /*gradData=*/backward_filter_data->opaque());
 
@@ -3592,10 +3667,28 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
   return true;
 }
 
+bool CudnnSupport::DoConvolveBackwardFilter(
+    Stream* stream, const dnn::BatchDescriptor& input_descriptor,
+    const DeviceMemory<double>& input_data,
+    const dnn::BatchDescriptor& output_descriptor,
+    DeviceMemory<double> backward_output_data,
+    const dnn::ConvolutionDescriptor& convolution_descriptor,
+    const dnn::FilterDescriptor& filter_descriptor,
+    DeviceMemory<double>* backward_filter_data,
+    ScratchAllocator* scratch_allocator,
+    const dnn::AlgorithmConfig& algorithm_config,
+    dnn::ProfileResult* output_profile_result) {
+  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                      output_descriptor, backward_output_data,
+                                      convolution_descriptor, filter_descriptor,
+                                      backward_filter_data, scratch_allocator,
+                                      algorithm_config, output_profile_result);
+}
+
 bool CudnnSupport::DoConvolveBackwardFilter(
     Stream* stream, const dnn::BatchDescriptor& input_descriptor,
     const DeviceMemory<float>& input_data,
-    const dnn::BatchDescriptor& output_descriptor_in,
+    const dnn::BatchDescriptor& output_descriptor,
     DeviceMemory<float> backward_output_data,
     const dnn::ConvolutionDescriptor& convolution_descriptor,
     const dnn::FilterDescriptor& filter_descriptor,
@@ -3603,17 +3696,17 @@ bool CudnnSupport::DoConvolveBackwardFilter(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardFilterImpl(
-      stream, input_descriptor, input_data, output_descriptor_in,
-      backward_output_data, convolution_descriptor, filter_descriptor,
-      backward_filter_data, scratch_allocator, algorithm_config,
-      output_profile_result);
+  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                      output_descriptor, backward_output_data,
+                                      convolution_descriptor, filter_descriptor,
+                                      backward_filter_data, scratch_allocator,
+                                      algorithm_config, output_profile_result);
 }
 
 bool CudnnSupport::DoConvolveBackwardFilter(
     Stream* stream, const dnn::BatchDescriptor& input_descriptor,
     const DeviceMemory<Eigen::half>& input_data,
-    const dnn::BatchDescriptor& output_descriptor_in,
+    const dnn::BatchDescriptor& output_descriptor,
     DeviceMemory<Eigen::half> backward_output_data,
     const dnn::ConvolutionDescriptor& convolution_descriptor,
     const dnn::FilterDescriptor& filter_descriptor,
@@ -3621,11 +3714,11 @@ bool CudnnSupport::DoConvolveBackwardFilter(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardFilterImpl(
-      stream, input_descriptor, input_data, output_descriptor_in,
-      backward_output_data, convolution_descriptor, filter_descriptor,
-      backward_filter_data, scratch_allocator, algorithm_config,
-      output_profile_result);
+  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                      output_descriptor, backward_output_data,
+                                      convolution_descriptor, filter_descriptor,
+                                      backward_filter_data, scratch_allocator,
+                                      algorithm_config, output_profile_result);
 }
 
 template <class T>
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.h b/tensorflow/stream_executor/cuda/cuda_dnn.h
index 40aa974dd967df50075da6f2bb34439cd238a113..48d56f71e3195a897b6216ab9f5709326d1b86d3 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.h
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.h
@@ -259,7 +259,10 @@ class CudnnSupport : public dnn::DnnSupport {
                   const DeviceMemory<double>& filter_data,
                   const dnn::ConvolutionDescriptor& convolution_descriptor,
                   const dnn::BatchDescriptor& output_descriptor,
-                  DeviceMemory<double>* output_data) override;
+                  DeviceMemory<double>* output_data,
+                  ScratchAllocator* scratch_allocator,
+                  const dnn::AlgorithmConfig& algorithm_config,
+                  dnn::ProfileResult* output_profile_result) override;
 
   bool DoConvolve(Stream* stream, const dnn::BatchDescriptor& batch_descriptor,
                   const DeviceMemory<Eigen::half>& input_data,
@@ -371,6 +374,18 @@ class CudnnSupport : public dnn::DnnSupport {
     return false;
   }
 
+  bool DoConvolveBackwardData(
+      Stream* stream, const dnn::FilterDescriptor& filter_descriptor,
+      const DeviceMemory<double>& filter_data,
+      const dnn::BatchDescriptor& output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const dnn::ConvolutionDescriptor& convolution_descriptor,
+      const dnn::BatchDescriptor& input_descriptor,
+      DeviceMemory<double>* backward_input_data,
+      ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      dnn::ProfileResult* output_profile_result) override;
+
   bool DoConvolveBackwardData(
       Stream* stream, const dnn::FilterDescriptor& filter_descriptor,
       const DeviceMemory<float>& filter_data,
@@ -395,6 +410,18 @@ class CudnnSupport : public dnn::DnnSupport {
       const dnn::AlgorithmConfig& algorithm_config,
       dnn::ProfileResult* output_profile_result) override;
 
+  bool DoConvolveBackwardFilter(
+      Stream* stream, const dnn::BatchDescriptor& input_descriptor,
+      const DeviceMemory<double>& input_data,
+      const dnn::BatchDescriptor& output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const dnn::ConvolutionDescriptor& convolution_descriptor,
+      const dnn::FilterDescriptor& filter_descriptor,
+      DeviceMemory<double>* backward_filter_data,
+      ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      dnn::ProfileResult* output_profile_result) override;
+
   bool DoConvolveBackwardFilter(
       Stream* stream, const dnn::BatchDescriptor& input_descriptor,
       const DeviceMemory<float>& input_data,
diff --git a/tensorflow/stream_executor/cuda/cuda_driver.cc b/tensorflow/stream_executor/cuda/cuda_driver.cc
index a017ff64d4c69b6952b442464877dc26a800ad37..58e1e58c593a3d938d97baff2356bce2c215a7a1 100644
--- a/tensorflow/stream_executor/cuda/cuda_driver.cc
+++ b/tensorflow/stream_executor/cuda/cuda_driver.cc
@@ -1503,6 +1503,19 @@ static port::StatusOr<T> GetSimpleAttribute(CUdevice device,
   return true;
 }
 
+/* static */ port::StatusOr<int> CUDADriver::GetDeviceAttribute(
+    CUdevice_attribute attribute, CUdevice device) {
+  int val;
+  CUresult res = cuDeviceGetAttribute(&val, attribute, device);
+  if (res != CUDA_SUCCESS) {
+    return port::Status{
+        port::error::INTERNAL,
+        port::Printf("failed to get device attribute %d for device %d: %s",
+                     attribute, device, ToString(res).c_str())};
+  }
+  return val;
+}
+
 /* static */ bool CUDADriver::IsEccEnabled(CUdevice device, bool *result) {
   int value = -1;
   CUresult res =
diff --git a/tensorflow/stream_executor/cuda/cuda_driver.h b/tensorflow/stream_executor/cuda/cuda_driver.h
index 4002ba2021d1a2e2c36bd1786a3084ee8c08bb78..fa9172b3f008d3083309126bbfa4a1ab961030e1 100644
--- a/tensorflow/stream_executor/cuda/cuda_driver.h
+++ b/tensorflow/stream_executor/cuda/cuda_driver.h
@@ -400,12 +400,20 @@ class CUDADriver {
 
   // Returns a grab-bag of device properties in a caller-owned device_properties
   // structure for device_ordinal via cuDeviceGetProperties.
-  // This call is deprecated in the NVIDIA driver API.
+  //
+  // This call is deprecated in the NVIDIA driver API; its replacement is
+  // GetDeviceAttribute
   //
   // http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE__DEPRECATED.html#group__CUDA__DEVICE__DEPRECATED_1g65a5b4e25186bd257df80b98c98cffe6
   static bool GetDeviceProperties(CUdevprop *device_properties,
                                   int device_ordinal);
 
+  // Gets a specific integer-valued property about the given device.
+  //
+  // http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266
+  static port::StatusOr<int> GetDeviceAttribute(CUdevice_attribute attribute,
+                                                CUdevice device);
+
   // Returns whether ECC is enabled for the given CUdevice via
   // cuDeviceGetattribute with CU_DEVICE_ATTRIBUTE_ECC_ENABLED.
   // http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index 4bbd531e14f18fc24d87b4fa655fe72e9f56b129..5ecaf46b8cae3c1e1f312816e7e5aec8ff8ce306 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -1103,6 +1103,18 @@ DeviceDescription *CUDAExecutor::PopulateDeviceDescription() const {
     builder.set_device_memory_size(device_memory_size);
   }
 
+  port::StatusOr<int> mem_clock_khz = CUDADriver::GetDeviceAttribute(
+      CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE, device_ordinal_);
+  port::StatusOr<int> mem_bus_width_bits = CUDADriver::GetDeviceAttribute(
+      CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH, device_ordinal_);
+  if (mem_clock_khz.ok() && mem_bus_width_bits.ok()) {
+    // Times 2 because HBM is DDR memory; it gets two data bits per each data
+    // lane.
+    builder.set_memory_bandwidth(2 * int64_t{mem_clock_khz.ValueOrDie()} *
+                                 1000 *
+                                 int64_t{mem_bus_width_bits.ValueOrDie()} / 8);
+  }
+
   {
     BlockDim block_dim_limit;
     FillBlockDimLimit(&block_dim_limit);
diff --git a/tensorflow/stream_executor/device_description.cc b/tensorflow/stream_executor/device_description.cc
index a98143e34bbb42c3aee76c27e1648c49397a0e44..52f5319a3b16c771ce89843a963841b25df5467e 100644
--- a/tensorflow/stream_executor/device_description.cc
+++ b/tensorflow/stream_executor/device_description.cc
@@ -50,6 +50,7 @@ DeviceDescription::DeviceDescription()
       shared_memory_alloc_granularity_(1),
       device_address_bits_(kUninitializedUint64),
       device_memory_size_(kUninitializedUint64),
+      memory_bandwidth_(kUninitializedUint64),
       shared_memory_per_core_(kUninitializedUint64),
       shared_memory_per_block_(kUninitializedUint64),
       clock_rate_ghz_(-1.0),
@@ -85,6 +86,8 @@ std::unique_ptr<std::map<string, string>> DeviceDescription::ToMap() const {
   result["Device Address Bits"] = port::StrCat(device_address_bits());
   result["Device Memory Size"] =
       port::HumanReadableNumBytes::ToString(device_memory_size());
+  result["Memory Bandwidth"] = port::StrCat(
+      port::HumanReadableNumBytes::ToString(memory_bandwidth_), "/s");
 
   result["Shared Memory Per Core"] =
       port::HumanReadableNumBytes::ToString(shared_memory_per_core_);
diff --git a/tensorflow/stream_executor/device_description.h b/tensorflow/stream_executor/device_description.h
index f2b35bcb4345a37f72541979564cbbb7944595c2..fcf0928096ed1f1bdf0499efb92af2bc9cb0eaa2 100644
--- a/tensorflow/stream_executor/device_description.h
+++ b/tensorflow/stream_executor/device_description.h
@@ -140,6 +140,11 @@ class DeviceDescription {
   // Returns the device memory size in bytes.
   uint64 device_memory_size() const { return device_memory_size_; }
 
+  // Returns the device's memory bandwidth in bytes/sec.  (This is for
+  // reads/writes to/from the device's own memory, not for transfers between the
+  // host and device.)
+  uint64 memory_bandwidth() const { return memory_bandwidth_; }
+
   // Returns the device's core clock rate in GHz.
   float clock_rate_ghz() const { return clock_rate_ghz_; }
 
@@ -212,6 +217,7 @@ class DeviceDescription {
 
   uint64 device_address_bits_;
   uint64 device_memory_size_;
+  uint64 memory_bandwidth_;
 
   // Shared memory limits on a given device.
   uint64 shared_memory_per_core_;
@@ -305,6 +311,9 @@ class DeviceDescriptionBuilder {
   void set_device_memory_size(uint64 value) {
     device_description_->device_memory_size_ = value;
   }
+  void set_memory_bandwidth(uint64 value) {
+    device_description_->memory_bandwidth_ = value;
+  }
 
   void set_shared_memory_per_core(int64 value) {
     device_description_->shared_memory_per_core_ = value;
diff --git a/tensorflow/stream_executor/dnn.h b/tensorflow/stream_executor/dnn.h
index aa88fe770f3596e5da5e12705c3b706365382134..b41536e638873412a31a0cdbbd3ba3a818dd9cf2 100644
--- a/tensorflow/stream_executor/dnn.h
+++ b/tensorflow/stream_executor/dnn.h
@@ -1172,7 +1172,9 @@ class DnnSupport {
       const DeviceMemory<double>& filter_data,
       const dnn::ConvolutionDescriptor& convolution_descriptor,
       const dnn::BatchDescriptor& output_descriptor,
-      DeviceMemory<double>* output_data) = 0;
+      DeviceMemory<double>* output_data, ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      dnn::ProfileResult* output_profile_result) = 0;
 
   // Enqueues a half-precision convolution operation onto the stream.
   // See DoConvolve above for argument details.
@@ -1273,6 +1275,18 @@ class DnnSupport {
       bool with_winograd_nonfused, int cc_major, int cc_minor,
       std::vector<AlgorithmDesc>* out_algorithms);
 
+  virtual bool DoConvolveBackwardData(
+      Stream* stream, const FilterDescriptor& filter_descriptor,
+      const DeviceMemory<double>& filter_data,
+      const BatchDescriptor& output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const ConvolutionDescriptor& convolution_descriptor,
+      const BatchDescriptor& input_descriptor,
+      DeviceMemory<double>* backward_input_data,
+      ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      ProfileResult* output_profile_result) = 0;
+
   virtual bool DoConvolveBackwardData(
       Stream* stream, const FilterDescriptor& filter_descriptor,
       const DeviceMemory<Eigen::half>& filter_data,
@@ -1322,6 +1336,18 @@ class DnnSupport {
       bool with_winograd_nonfused, int cc_major, int cc_minor,
       std::vector<AlgorithmDesc>* out_algorithms);
 
+  virtual bool DoConvolveBackwardFilter(
+      Stream* stream, const BatchDescriptor& input_descriptor,
+      const DeviceMemory<double>& input_data,
+      const BatchDescriptor& output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const ConvolutionDescriptor& convolution_descriptor,
+      const FilterDescriptor& filter_descriptor,
+      DeviceMemory<double>* backward_filter_data,
+      ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      ProfileResult* output_profile_result) = 0;
+
   virtual bool DoConvolveBackwardFilter(
       Stream* stream, const BatchDescriptor& input_descriptor,
       const DeviceMemory<Eigen::half>& input_data,
diff --git a/tensorflow/stream_executor/multi_platform_manager.cc b/tensorflow/stream_executor/multi_platform_manager.cc
index f23224ae772b9c5915426feaef1155fc9711f075..f9f3737a06dad3f146ef9fc8e2ec50160b3a01b5 100644
--- a/tensorflow/stream_executor/multi_platform_manager.cc
+++ b/tensorflow/stream_executor/multi_platform_manager.cc
@@ -23,11 +23,37 @@ limitations under the License.
 namespace perftools {
 namespace gputools {
 
+/* static */ mutex MultiPlatformManager::platforms_mutex_{LINKER_INITIALIZED};
+
+/* static */ port::StatusOr<Platform*> MultiPlatformManager::LookupByNameLocked(
+    const string& target) {
+  PlatformMap* platform_map = GetPlatformMap();
+  auto it = platform_map->find(port::Lowercase(target));
+  if (it == platform_map->end()) {
+    return port::Status(
+        port::error::NOT_FOUND,
+        "could not find registered platform with name: \"" + target + "\"");
+  }
+  return it->second;
+}
+
+/* static */ port::StatusOr<Platform*> MultiPlatformManager::LookupByIdLocked(
+    const Platform::Id& id) {
+  PlatformIdMap* platform_map = GetPlatformByIdMap();
+  auto it = platform_map->find(id);
+  if (it == platform_map->end()) {
+    return port::Status(
+        port::error::NOT_FOUND,
+        port::Printf("could not find registered platform with id: 0x%p", id));
+  }
+  return it->second;
+}
+
 /* static */ port::Status MultiPlatformManager::RegisterPlatform(
     std::unique_ptr<Platform> platform) {
   CHECK(platform != nullptr);
   string key = port::Lowercase(platform->Name());
-  mutex_lock lock(GetPlatformsMutex());
+  mutex_lock lock(platforms_mutex_);
   if (GetPlatformMap()->find(key) != GetPlatformMap()->end()) {
     return port::Status(port::error::INTERNAL,
                         "platform is already registered with name: \"" +
@@ -45,33 +71,63 @@ namespace gputools {
 
 /* static */ port::StatusOr<Platform*> MultiPlatformManager::PlatformWithName(
     const string& target) {
-  tf_shared_lock lock(GetPlatformsMutex());
-  auto it = GetPlatformMap()->find(port::Lowercase(target));
+  mutex_lock lock(platforms_mutex_);
 
-  if (it == GetPlatformMap()->end()) {
-    return port::Status(
-        port::error::NOT_FOUND,
-        "could not find registered platform with name: \"" + target + "\"");
+  SE_ASSIGN_OR_RETURN(Platform * platform, LookupByNameLocked(target));
+  if (!platform->Initialized()) {
+    SE_RETURN_IF_ERROR(platform->Initialize({}));
   }
 
-  return it->second;
+  return platform;
 }
 
 /* static */ port::StatusOr<Platform*> MultiPlatformManager::PlatformWithId(
     const Platform::Id& id) {
-  tf_shared_lock lock(GetPlatformsMutex());
-  auto it = GetPlatformByIdMap()->find(id);
-  if (it == GetPlatformByIdMap()->end()) {
+  mutex_lock lock(platforms_mutex_);
+
+  SE_ASSIGN_OR_RETURN(Platform * platform, LookupByIdLocked(id));
+  if (!platform->Initialized()) {
+    SE_RETURN_IF_ERROR(platform->Initialize({}));
+  }
+
+  return platform;
+}
+
+/* static */ port::StatusOr<Platform*>
+MultiPlatformManager::InitializePlatformWithName(
+    const string& target, const std::map<string, string>& options) {
+  mutex_lock lock(platforms_mutex_);
+
+  SE_ASSIGN_OR_RETURN(Platform * platform, LookupByNameLocked(target));
+  if (platform->Initialized()) {
+    return port::Status(port::error::FAILED_PRECONDITION,
+                        "platform \"" + target + "\" is already initialized");
+  }
+
+  SE_RETURN_IF_ERROR(platform->Initialize(options));
+
+  return platform;
+}
+
+/* static */ port::StatusOr<Platform*>
+MultiPlatformManager::InitializePlatformWithId(
+    const Platform::Id& id, const std::map<string, string>& options) {
+  mutex_lock lock(platforms_mutex_);
+
+  SE_ASSIGN_OR_RETURN(Platform * platform, LookupByIdLocked(id));
+  if (platform->Initialized()) {
     return port::Status(
-        port::error::NOT_FOUND,
-        port::Printf("could not find registered platform with id: 0x%p", id));
+        port::error::FAILED_PRECONDITION,
+        port::Printf("platform with id 0x%p is already initialized", id));
   }
 
-  return it->second;
+  SE_RETURN_IF_ERROR(platform->Initialize(options));
+
+  return platform;
 }
 
 /* static */ void MultiPlatformManager::ClearPlatformRegistry() {
-  mutex_lock lock(GetPlatformsMutex());
+  mutex_lock lock(platforms_mutex_);
   GetPlatformMap()->clear();
   GetPlatformByIdMap()->clear();
 }
diff --git a/tensorflow/stream_executor/multi_platform_manager.h b/tensorflow/stream_executor/multi_platform_manager.h
index ea6155b4826439b98262530e70e6463eb1fda237..438653ee20bdb1fd83cd9e75c4bcd35af277cc28 100644
--- a/tensorflow/stream_executor/multi_platform_manager.h
+++ b/tensorflow/stream_executor/multi_platform_manager.h
@@ -67,13 +67,13 @@ limitations under the License.
 #include <functional>
 #include <map>
 #include <memory>
-#include "tensorflow/stream_executor/platform/port.h"
 
 #include "tensorflow/stream_executor/lib/status.h"
 #include "tensorflow/stream_executor/lib/statusor.h"
 #include "tensorflow/stream_executor/platform.h"
 #include "tensorflow/stream_executor/platform/mutex.h"
 #include "tensorflow/stream_executor/platform/port.h"
+#include "tensorflow/stream_executor/platform/thread_annotations.h"
 
 namespace perftools {
 namespace gputools {
@@ -85,26 +85,43 @@ class MultiPlatformManager {
   // already registered. The associated listener, if not null, will be used to
   // trace events for ALL executors for that platform.
   // Takes ownership of listener.
-  static port::Status RegisterPlatform(std::unique_ptr<Platform> platform);
+  static port::Status RegisterPlatform(std::unique_ptr<Platform> platform)
+      LOCKS_EXCLUDED(platforms_mutex_);
 
-  // Retrieves the platform registered with the given platform name; e.g.
-  // "CUDA", "OpenCL", ...
+  // Retrieves the platform registered with the given platform name (e.g.
+  // "CUDA", "OpenCL", ...) or id (an opaque, comparable value provided by the
+  // Platform's Id() method).
+  //
+  // If the platform has not already been initialized, it will be initialized
+  // with a default set of parameters.
   //
   // If the requested platform is not registered, an error status is returned.
   // Ownership of the platform is NOT transferred to the caller --
   // the MultiPlatformManager owns the platforms in a singleton-like fashion.
-  static port::StatusOr<Platform*> PlatformWithName(const string& target);
-
-  // Retrieves the platform registered with the given platform ID, which
-  // is an opaque (but comparable) value.
+  static port::StatusOr<Platform*> PlatformWithName(const string& target)
+      LOCKS_EXCLUDED(platforms_mutex_);
+  static port::StatusOr<Platform*> PlatformWithId(const Platform::Id& id)
+      LOCKS_EXCLUDED(platforms_mutex_);
+
+  // Retrieves the platform registered with the given platform name (e.g.
+  // "CUDA", "OpenCL", ...) or id (an opaque, comparable value provided by the
+  // Platform's Id() method).
+  //
+  // The platform will be initialized with the given options. If the platform
+  // was already initialized, an error will be returned.
   //
   // If the requested platform is not registered, an error status is returned.
   // Ownership of the platform is NOT transferred to the caller --
   // the MultiPlatformManager owns the platforms in a singleton-like fashion.
-  static port::StatusOr<Platform*> PlatformWithId(const Platform::Id& id);
+  static port::StatusOr<Platform*> InitializePlatformWithName(
+      const string& target, const std::map<string, string>& options)
+      LOCKS_EXCLUDED(platforms_mutex_);
+  static port::StatusOr<Platform*> InitializePlatformWithId(
+      const Platform::Id& id, const std::map<string, string>& options)
+      LOCKS_EXCLUDED(platforms_mutex_);
 
   // Clears the set of registered platforms, primarily used for testing.
-  static void ClearPlatformRegistry();
+  static void ClearPlatformRegistry() LOCKS_EXCLUDED(platforms_mutex_);
 
   // Although the MultiPlatformManager "owns" its platforms, it holds them as
   // undecorated pointers to prevent races during program exit (between this
@@ -122,17 +139,16 @@ class MultiPlatformManager {
 
   // Provides access to the available set of platforms under a lock.
   static port::Status WithPlatforms(
-      std::function<port::Status(PlatformMap*)> callback) {
-    mutex_lock lock(GetPlatformsMutex());
+      std::function<port::Status(PlatformMap*)> callback)
+      LOCKS_EXCLUDED(platforms_mutex_) {
+    mutex_lock lock(platforms_mutex_);
     return callback(GetPlatformMap());
   }
 
  private:
-  // mutex that guards the platform map.
-  static mutex& GetPlatformsMutex() {
-    static mutex* platforms_mutex = new mutex;
-    return *platforms_mutex;
-  }
+  using PlatformIdMap = std::map<Platform::Id, Platform*>;
+
+  static mutex platforms_mutex_;
 
   // TODO(b/22689637): Clean up these two maps; make sure they coexist nicely.
   // TODO(b/22689637): Move this (whatever the final/"official" map is) to
@@ -147,12 +163,21 @@ class MultiPlatformManager {
 
   // Holds a Platform::Id-to-object mapping.
   // Unlike platforms_ above, this map does not own its contents.
-  static std::map<Platform::Id, Platform*>* GetPlatformByIdMap() {
-    using PlatformIdMap = std::map<Platform::Id, Platform*>;
+  static PlatformIdMap* GetPlatformByIdMap() {
     static PlatformIdMap* instance = new PlatformIdMap;
     return instance;
   }
 
+  // Looks up the platform object with the given name.  Assumes the Platforms
+  // mutex is held.
+  static port::StatusOr<Platform*> LookupByNameLocked(const string& target)
+      EXCLUSIVE_LOCKS_REQUIRED(platforms_mutex_);
+
+  // Looks up the platform object with the given id.  Assumes the Platforms
+  // mutex is held.
+  static port::StatusOr<Platform*> LookupByIdLocked(const Platform::Id& id)
+      EXCLUSIVE_LOCKS_REQUIRED(platforms_mutex_);
+
   SE_DISALLOW_COPY_AND_ASSIGN(MultiPlatformManager);
 };
 
diff --git a/tensorflow/stream_executor/platform.cc b/tensorflow/stream_executor/platform.cc
index 93f08d06dae862f24b5b533395af63139f344f77..4cdc22bd16a3ea66037696f6a9d70bcb86ef5ebb 100644
--- a/tensorflow/stream_executor/platform.cc
+++ b/tensorflow/stream_executor/platform.cc
@@ -85,6 +85,17 @@ StreamExecutorConfig::StreamExecutorConfig(int ordinal_in)
 
 Platform::~Platform() {}
 
+bool Platform::Initialized() const { return true; }
+
+port::Status Platform::Initialize(
+    const std::map<string, string> &platform_options) {
+  if (!platform_options.empty()) {
+    return port::Status(port::error::UNIMPLEMENTED,
+                        "this platform does not support custom initialization");
+  }
+  return port::Status::OK();
+}
+
 port::Status Platform::ForceExecutorShutdown() {
   return port::Status(port::error::UNIMPLEMENTED,
                       "executor shutdown is not supported on this platform");
diff --git a/tensorflow/stream_executor/platform.h b/tensorflow/stream_executor/platform.h
index f0a0e60e02f951018b39ef831cd2f7dd3256f87d..54f8aa86c269ff0d32648e1d4629179cafd5be76 100644
--- a/tensorflow/stream_executor/platform.h
+++ b/tensorflow/stream_executor/platform.h
@@ -111,6 +111,9 @@ class Platform {
   // Returns a key uniquely identifying this platform.
   virtual Id id() const = 0;
 
+  // Name of this platform.
+  virtual const string& Name() const = 0;
+
   // Returns the number of devices accessible on this platform.
   //
   // Note that, though these devices are visible, if there is only one userspace
@@ -118,8 +121,17 @@ class Platform {
   // device, a call to ExecutorForDevice may return an error status.
   virtual int VisibleDeviceCount() const = 0;
 
-  // Name of this platform.
-  virtual const string& Name() const = 0;
+  // Returns true iff the platform has been initialized.
+  virtual bool Initialized() const;
+
+  // Initializes the platform with a custom set of options. The platform must be
+  // initialized before obtaining StreamExecutor objects.  The interpretation of
+  // the platform_options argument is implementation specific.  This method may
+  // return an error if unrecognized options are provided.  If using
+  // MultiPlatformManager, this method will be called automatically by
+  // InitializePlatformWithId/InitializePlatformWithName.
+  virtual port::Status Initialize(
+      const std::map<string, string>& platform_options);
 
   // Returns a device with the given ordinal on this platform with a default
   // plugin configuration or, if none can be found with the given ordinal or
@@ -156,6 +168,8 @@ class Platform {
   // This is only useful on platforms which bind a device to a single process
   // that has obtained the device context. May return UNIMPLEMENTED on platforms
   // that have no reason to destroy device contexts.
+  //
+  // The platform must be reinitialized after this is called.
   virtual port::Status ForceExecutorShutdown();
 
   // Registers a TraceListener to listen to all StreamExecutors for this
diff --git a/tensorflow/stream_executor/stream.cc b/tensorflow/stream_executor/stream.cc
index ba5001e273632c893b05eea64542f1b156e28c47..1e3afde2687657e417e9e2cb3f5e2aaf0600da7a 100644
--- a/tensorflow/stream_executor/stream.cc
+++ b/tensorflow/stream_executor/stream.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/stream_executor/platform/port.h"
 
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/stream_executor/blas.h"
 #include "tensorflow/stream_executor/host_buffer.h"
 #include "tensorflow/stream_executor/lib/stacktrace.h"
@@ -117,7 +118,9 @@ string ToVlogString(const DeviceMemoryBase *memory) {
   return ToVlogString(*memory);
 }
 
-string ToVlogString(const Eigen::half &h) { return port::StrCat(h); }
+string ToVlogString(const Eigen::half &h) {
+  return port::StrCat(static_cast<float>(h));
+}
 
 string ToVlogString(int i) { return port::StrCat(i); }
 
@@ -681,6 +684,37 @@ Stream &Stream::ThenFusedConvolveWithAlgorithm(
   return *this;
 }
 
+Stream &Stream::ThenConvolveWithAlgorithm(
+    const dnn::BatchDescriptor &input_descriptor,
+    const DeviceMemory<double> &input_data,
+    const dnn::FilterDescriptor &filter_descriptor,
+    const DeviceMemory<double> &filter_data,
+    const dnn::ConvolutionDescriptor &convolution_descriptor,
+    const dnn::BatchDescriptor &output_descriptor, DeviceMemory<double> *output,
+    ScratchAllocator *scratch_allocator,
+    const dnn::AlgorithmConfig &algorithm_config,
+    dnn::ProfileResult *output_profile_result) {
+  VLOG_CALL(PARAM(input_descriptor), PARAM(input_data),
+            PARAM(filter_descriptor), PARAM(filter_data),
+            PARAM(convolution_descriptor), PARAM(output_descriptor),
+            PARAM(output), PARAM(algorithm_config));
+
+  if (ok()) {
+    if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
+      auto status = dnn->DoConvolve(
+          this, input_descriptor, input_data, filter_descriptor, filter_data,
+          convolution_descriptor, output_descriptor, output, scratch_allocator,
+          algorithm_config, output_profile_result);
+      if (!status && !output_profile_result) {
+        SetError();
+      }
+    } else {
+      SetErrorAndLogNoDnnSupport();
+    }
+  }
+  return *this;
+}
+
 Stream &Stream::ThenConvolveWithAlgorithm(
     const dnn::BatchDescriptor &input_descriptor,
     const DeviceMemory<float> &input_data,
@@ -890,6 +924,39 @@ Stream &Stream::ThenConvolveBackwardDataWithScratch(
   return *this;
 }
 
+Stream &Stream::ThenConvolveBackwardDataWithAlgorithm(
+    const dnn::FilterDescriptor &filter_descriptor,
+    const DeviceMemory<double> &filter_data,
+    const dnn::BatchDescriptor &output_descriptor,
+    DeviceMemory<double> backward_output_data,
+    const dnn::ConvolutionDescriptor &convolution_descriptor,
+    const dnn::BatchDescriptor &input_descriptor,
+    DeviceMemory<double> *backward_input_data,
+    ScratchAllocator *scratch_allocator,
+    const dnn::AlgorithmConfig &algorithm_config,
+    dnn::ProfileResult *output_profile_result) {
+  VLOG_CALL(PARAM(filter_descriptor), PARAM(filter_data),
+            PARAM(output_descriptor), PARAM(backward_output_data),
+            PARAM(convolution_descriptor), PARAM(input_descriptor),
+            PARAM(backward_input_data));
+
+  if (ok()) {
+    if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
+      auto status = dnn->DoConvolveBackwardData(
+          this, filter_descriptor, filter_data, output_descriptor,
+          backward_output_data, convolution_descriptor, input_descriptor,
+          backward_input_data, scratch_allocator, algorithm_config,
+          output_profile_result);
+      if (!status && !output_profile_result) {
+        SetError();
+      }
+    } else {
+      SetErrorAndLogNoDnnSupport();
+    }
+  }
+  return *this;
+}
+
 Stream &Stream::ThenConvolveBackwardDataWithAlgorithm(
     const dnn::FilterDescriptor &filter_descriptor,
     const DeviceMemory<float> &filter_data,
@@ -1026,6 +1093,39 @@ Stream &Stream::ThenConvolveBackwardFilterWithScratch(
   return *this;
 }
 
+Stream &Stream::ThenConvolveBackwardFilterWithAlgorithm(
+    const dnn::BatchDescriptor &input_descriptor,
+    const DeviceMemory<double> &input_data,
+    const dnn::BatchDescriptor &output_descriptor,
+    DeviceMemory<double> backward_output_data,
+    const dnn::ConvolutionDescriptor &convolution_descriptor,
+    const dnn::FilterDescriptor &filter_descriptor,
+    DeviceMemory<double> *backward_filter_data,
+    ScratchAllocator *scratch_allocator,
+    const dnn::AlgorithmConfig &algorithm_config,
+    dnn::ProfileResult *output_profile_result) {
+  VLOG_CALL(PARAM(input_descriptor), PARAM(input_data),
+            PARAM(output_descriptor), PARAM(backward_output_data),
+            PARAM(convolution_descriptor), PARAM(filter_descriptor),
+            PARAM(backward_filter_data));
+
+  if (ok()) {
+    if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
+      auto status = dnn->DoConvolveBackwardFilter(
+          this, input_descriptor, input_data, output_descriptor,
+          backward_output_data, convolution_descriptor, filter_descriptor,
+          backward_filter_data, scratch_allocator, algorithm_config,
+          output_profile_result);
+      if (!status && !output_profile_result) {
+        SetError();
+      }
+    } else {
+      SetErrorAndLogNoDnnSupport();
+    }
+  }
+  return *this;
+}
+
 Stream &Stream::ThenConvolveBackwardFilterWithAlgorithm(
     const dnn::BatchDescriptor &input_descriptor,
     const DeviceMemory<float> &input_data,
@@ -4923,12 +5023,6 @@ Stream &Stream::ThenTransformTensor(const dnn::BatchDescriptor &input_desc,
   return *this;
 }
 
-Stream &Stream::ThenDoHostCallbackForTest(std::function<void()> callback) {
-  VLOG_CALL(PARAM(callback));
-
-  return ThenDoHostCallback(callback);
-}
-
 Stream &Stream::ThenDoHostCallback(std::function<void()> callback) {
   VLOG_CALL(PARAM(callback));
 
diff --git a/tensorflow/stream_executor/stream.h b/tensorflow/stream_executor/stream.h
index a2fb2ea2375d0f245ae3bf3ccb04803d01663def..d7d11315699b85cae4d479b79bc8fc2717b2d8fb 100644
--- a/tensorflow/stream_executor/stream.h
+++ b/tensorflow/stream_executor/stream.h
@@ -358,6 +358,17 @@ class Stream {
       const dnn::BatchDescriptor &output_descriptor,
       DeviceMemory<float> *output, ScratchAllocator *scratch_allocator);
 
+  Stream &ThenConvolveWithAlgorithm(
+      const dnn::BatchDescriptor &input_descriptor,
+      const DeviceMemory<double> &input_data,
+      const dnn::FilterDescriptor &filter_descriptor,
+      const DeviceMemory<double> &filter_data,
+      const dnn::ConvolutionDescriptor &convolution_descriptor,
+      const dnn::BatchDescriptor &output_descriptor,
+      DeviceMemory<double> *output, ScratchAllocator *scratch_allocator,
+      const dnn::AlgorithmConfig &algorithm_config,
+      dnn::ProfileResult *output_profile_result);
+
   Stream &ThenConvolveWithAlgorithm(
       const dnn::BatchDescriptor &input_descriptor,
       const DeviceMemory<float> &input_data,
@@ -476,6 +487,18 @@ class Stream {
       DeviceMemory<Eigen::half> *backward_input_data,
       ScratchAllocator *scratch_allocator);
 
+  Stream &ThenConvolveBackwardDataWithAlgorithm(
+      const dnn::FilterDescriptor &filter_descriptor,
+      const DeviceMemory<double> &filter_data,
+      const dnn::BatchDescriptor &output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const dnn::ConvolutionDescriptor &convolution_descriptor,
+      const dnn::BatchDescriptor &input_descriptor,
+      DeviceMemory<double> *backward_input_data,
+      ScratchAllocator *scratch_allocator,
+      const dnn::AlgorithmConfig &algorithm_config,
+      dnn::ProfileResult *output_profile_result);
+
   Stream &ThenConvolveBackwardDataWithAlgorithm(
       const dnn::FilterDescriptor &filter_descriptor,
       const DeviceMemory<float> &filter_data,
@@ -529,6 +552,18 @@ class Stream {
       DeviceMemory<Eigen::half> *backward_filter_data,
       ScratchAllocator *scratch_allocator);
 
+  Stream &ThenConvolveBackwardFilterWithAlgorithm(
+      const dnn::BatchDescriptor &input_descriptor,
+      const DeviceMemory<double> &input_data,
+      const dnn::BatchDescriptor &output_descriptor,
+      DeviceMemory<double> backward_output_data,
+      const dnn::ConvolutionDescriptor &convolution_descriptor,
+      const dnn::FilterDescriptor &filter_descriptor,
+      DeviceMemory<double> *backward_filter_data,
+      ScratchAllocator *scratch_allocator,
+      const dnn::AlgorithmConfig &algorithm_config,
+      dnn::ProfileResult *output_profile_result);
+
   Stream &ThenConvolveBackwardFilterWithAlgorithm(
       const dnn::BatchDescriptor &input_descriptor,
       const DeviceMemory<float> &input_data,
@@ -1933,16 +1968,15 @@ class Stream {
   // Entrains onto the stream a callback to the host (from the device).
   // Host callbacks block/occupy the stream just as device functions
   // (execute one at a time, block later stream operations).
+  //
   // Behavior is undefined when synchronizing using OpenCL user events.
   // Behavior is undefined if host callbacks call device routines or insert
   // them into any stream.
+  //
   // On certain platforms, ThenDoHostCallback is expected to have significant
   // negative effects on performance.
   Stream &ThenDoHostCallback(std::function<void()> callback);
 
-  // Identical to ThenDoHostCallback; only exposed for testing purposes.
-  Stream &ThenDoHostCallbackForTest(std::function<void()> callback);
-
   // Returns the StreamExecutor (parent object) associated with this stream.
   StreamExecutor *parent() const {
     CHECK(parent_ != nullptr);
diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index 818d67f7b5be1e8f2db66b24976a529b361a4990..fcc57d506e38205d8da605653ed67fb645102c35 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -22,6 +22,7 @@ load(
 load(
     "//third_party/mkl:build_defs.bzl",
     "if_mkl",
+    "if_mkl_lnx_x64"
 )
 
 def register_extension_info(**kwargs):
@@ -34,7 +35,7 @@ def src_to_test_name(src):
   return src.replace("/", "_").split(".")[0]
 
 def full_path(relative_paths):
-  return [PACKAGE_NAME + "/" + relative for relative in relative_paths]
+  return [native.package_name() + "/" + relative for relative in relative_paths]
 
 # List of proto files for android builds
 def tf_android_core_proto_sources(core_proto_sources_relative):
@@ -202,7 +203,8 @@ def tf_copts(android_optimization_level_override="-O2", is_external=False):
           "-ftemplate-depth=900"])
       + if_cuda(["-DGOOGLE_CUDA=1"])
       + if_tensorrt(["-DGOOGLE_TENSORRT=1"])
-      + if_mkl(["-DINTEL_MKL=1", "-DEIGEN_USE_VML", "-fopenmp",])
+      + if_mkl(["-DINTEL_MKL=1", "-DEIGEN_USE_VML"])
+      + if_mkl_lnx_x64(["-fopenmp"])
       + if_android_arm(["-mfpu=neon"])
       + if_linux_x86_64(["-msse3"])
       + if_ios_x86_64(["-msse4.1"])
@@ -265,7 +267,7 @@ def _rpath_linkopts(name):
   # deployed. Other shared object dependencies (e.g. shared between contrib/
   # ops) are picked up as long as they are in either the same or a parent
   # directory in the tensorflow/ tree.
-  levels_to_root = PACKAGE_NAME.count("/") + name.count("/")
+  levels_to_root = native.package_name().count("/") + name.count("/")
   return select({
       clean_dep("//tensorflow:darwin"): [
           "-Wl,%s" % (_make_search_paths("@loader_path", levels_to_root),),
@@ -905,6 +907,14 @@ def tf_cuda_library(deps=None, cuda_deps=None, copts=tf_copts(), **kwargs):
   if not cuda_deps:
     cuda_deps = []
 
+  if 'linkstatic' not in kwargs or kwargs['linkstatic'] != 1:
+    enable_text_relocation_linkopt = select({
+          clean_dep("//tensorflow:darwin"): [],
+          "//conditions:default": ['-Wl,-z,notext'],})
+    if 'linkopts' in kwargs:
+      kwargs['linkopts'] += enable_text_relocation_linkopt
+    else:
+      kwargs['linkopts'] = enable_text_relocation_linkopt
   native.cc_library(
       deps=deps + if_cuda(cuda_deps + [
           clean_dep("//tensorflow/core:cuda"),
@@ -1158,22 +1168,6 @@ def transitive_hdrs(name, deps=[], **kwargs):
 # the libraries in deps.
 def cc_header_only_library(name, deps=[], includes=[], **kwargs):
   _transitive_hdrs(name=name + "_gather", deps=deps)
-
-  # We could generalize the following, but rather than complicate things
-  # here, we'll do the minimal use case for now, and hope bazel comes up
-  # with a better solution before too long.  We'd expect it to compute
-  # the right include path by itself, but it doesn't, possibly because
-  # _transitive_hdrs lost some information about the include path.
-  if "@nsync//:nsync_headers" in deps:
-    # Buiding tensorflow from @org_tensorflow finds this two up.
-    nsynch = "../../external/nsync/public"
-    # Building tensorflow from elsewhere finds it four up.
-    # Note that native.repository_name() is not yet available in TF's Kokoro.
-    if REPOSITORY_NAME != "@":
-      nsynch = "../../" + nsynch
-    includes = includes[:]
-    includes.append(nsynch)
-
   native.cc_library(name=name,
                     hdrs=[":" + name + "_gather"],
                     includes=includes,
@@ -1182,7 +1176,6 @@ def cc_header_only_library(name, deps=[], includes=[], **kwargs):
 def tf_custom_op_library_additional_deps():
   return [
       "@protobuf_archive//:protobuf_headers",
-      "@nsync//:nsync_headers",
       clean_dep("//third_party/eigen3"),
       clean_dep("//tensorflow/core:framework_headers_lib"),
   ]
diff --git a/tensorflow/tools/api/generator/BUILD b/tensorflow/tools/api/generator/BUILD
index e731127a63d792825e15a4b95379517117edebb0..d9b0260c9f254f0b609ecc9094789085bb6586d4 100644
--- a/tensorflow/tools/api/generator/BUILD
+++ b/tensorflow/tools/api/generator/BUILD
@@ -1,5 +1,6 @@
 # Description:
 # Scripts used to generate TensorFlow Python API.
+
 licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
@@ -21,7 +22,7 @@ py_binary(
     srcs = ["create_python_api.py"],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow:tensorflow_py",
+        "//tensorflow/python",
     ],
 )
 
@@ -80,6 +81,7 @@ genrule(
         "api/keras/datasets/boston_housing/__init__.py",
         "api/keras/datasets/cifar10/__init__.py",
         "api/keras/datasets/cifar100/__init__.py",
+        "api/keras/datasets/fashion_mnist/__init__.py",
         "api/keras/datasets/imdb/__init__.py",
         "api/keras/datasets/mnist/__init__.py",
         "api/keras/datasets/reuters/__init__.py",
@@ -102,6 +104,7 @@ genrule(
         "api/linalg/__init__.py",
         "api/logging/__init__.py",
         "api/losses/__init__.py",
+        "api/manip/__init__.py",
         "api/metrics/__init__.py",
         "api/nn/__init__.py",
         "api/nn/rnn_cell/__init__.py",
@@ -124,6 +127,7 @@ genrule(
         "api/test/__init__.py",
         "api/train/__init__.py",
         "api/train/queue_runner/__init__.py",
+        "api/user_ops/__init__.py",
     ],
     cmd = "$(location create_python_api) $(OUTS)",
     tools = ["create_python_api"],
@@ -133,7 +137,9 @@ py_library(
     name = "python_api",
     srcs = [":python_api_gen"],
     srcs_version = "PY2AND3",
+    visibility = ["//tensorflow:__subpackages__"],
     deps = [
         "//tensorflow/contrib:contrib_py",  # keep
+        "//tensorflow/python",  # keep
     ],
 )
diff --git a/tensorflow/tools/api/generator/create_python_api.py b/tensorflow/tools/api/generator/create_python_api.py
index 1557314939bd85c0467426216f90aa3891ca0ac0..183c4731b8176ece16a70bac421291fd76d748cb 100644
--- a/tensorflow/tools/api/generator/create_python_api.py
+++ b/tensorflow/tools/api/generator/create_python_api.py
@@ -23,15 +23,13 @@ import collections
 import os
 import sys
 
-# This import is needed so that we can traverse over TensorFlow modules.
-import tensorflow as tf  # pylint: disable=unused-import
 from tensorflow.python.util import tf_decorator
 
 
 _API_CONSTANTS_ATTR = '_tf_api_constants'
 _API_NAMES_ATTR = '_tf_api_names'
 _API_DIR = '/api/'
-_CONTRIB_IMPORT = 'from tensorflow import contrib'
+_OUTPUT_MODULE = 'tensorflow.tools.api.generator.api'
 _GENERATED_FILE_HEADER = """\"\"\"Imports for Python API.
 
 This file is MACHINE GENERATED! Do not edit.
@@ -40,6 +38,11 @@ Generated by: tensorflow/tools/api/generator/create_python_api.py script.
 """
 
 
+class SymbolExposedTwiceError(Exception):
+  """Raised when different symbols are exported with the same name."""
+  pass
+
+
 def format_import(source_module_name, source_name, dest_name):
   """Formats import statement.
 
@@ -64,6 +67,44 @@ def format_import(source_module_name, source_name, dest_name):
       return 'import %s as %s' % (source_name, dest_name)
 
 
+class _ModuleImportsBuilder(object):
+  """Builds a map from module name to imports included in that module."""
+
+  def __init__(self):
+    self.module_imports = collections.defaultdict(list)
+    self._seen_api_names = set()
+
+  def add_import(
+      self, dest_module_name, source_module_name, source_name, dest_name):
+    """Adds this import to module_imports.
+
+    Args:
+      dest_module_name: (string) Module name to add import to.
+      source_module_name: (string) Module to import from.
+      source_name: (string) Name of the symbol to import.
+      dest_name: (string) Import the symbol using this name.
+
+    Raises:
+      SymbolExposedTwiceError: Raised when an import with the same
+        dest_name has already been added to dest_module_name.
+    """
+    import_str = format_import(source_module_name, source_name, dest_name)
+    if import_str in self.module_imports[dest_module_name]:
+      return
+
+    # Check if we are trying to expose two different symbols with same name.
+    full_api_name = dest_name
+    if dest_module_name:
+      full_api_name = dest_module_name + '.' + full_api_name
+    if full_api_name in self._seen_api_names:
+      raise SymbolExposedTwiceError(
+          'Trying to export multiple symbols with same name: %s.' %
+          full_api_name)
+    self._seen_api_names.add(full_api_name)
+
+    self.module_imports[dest_module_name].append(import_str)
+
+
 def get_api_imports():
   """Get a map from destination module to formatted imports.
 
@@ -74,7 +115,9 @@ def get_api_imports():
           (for e.g. 'from foo import bar') and constant
           assignments (for e.g. 'FOO = 123').
   """
-  module_imports = collections.defaultdict(list)
+  module_imports_builder = _ModuleImportsBuilder()
+  visited_symbols = set()
+
   # Traverse over everything imported above. Specifically,
   # we want to traverse over TensorFlow Python modules.
   for module in sys.modules.values():
@@ -87,48 +130,56 @@ def get_api_imports():
 
     for module_contents_name in dir(module):
       attr = getattr(module, module_contents_name)
+      if id(attr) in visited_symbols:
+        continue
 
       # If attr is _tf_api_constants attribute, then add the constants.
       if module_contents_name == _API_CONSTANTS_ATTR:
         for exports, value in attr:
           for export in exports:
-            names = ['tf'] + export.split('.')
+            names = export.split('.')
             dest_module = '.'.join(names[:-1])
-            import_str = format_import(module.__name__, value, names[-1])
-            module_imports[dest_module].append(import_str)
+            module_imports_builder.add_import(
+                dest_module, module.__name__, value, names[-1])
         continue
 
       _, attr = tf_decorator.unwrap(attr)
       # If attr is a symbol with _tf_api_names attribute, then
       # add import for it.
       if hasattr(attr, '__dict__') and _API_NAMES_ATTR in attr.__dict__:
-        # The same op might be accessible from multiple modules.
-        # We only want to consider location where function was defined.
-        if attr.__module__ != module.__name__:
+        # If the same symbol is available using multiple names, only create
+        # imports for it once.
+        if id(attr) in visited_symbols:
           continue
+        visited_symbols.add(id(attr))
 
         for export in attr._tf_api_names:  # pylint: disable=protected-access
-          names = ['tf'] + export.split('.')
+          names = export.split('.')
           dest_module = '.'.join(names[:-1])
-          import_str = format_import(
-              module.__name__, module_contents_name, names[-1])
-          module_imports[dest_module].append(import_str)
+          module_imports_builder.add_import(
+              dest_module, module.__name__, module_contents_name, names[-1])
 
   # Import all required modules in their parent modules.
-  # For e.g. if we import 'tf.foo.bar.Value'. Then, we also
-  # import 'bar' in 'tf.foo'.
-  dest_modules = set(module_imports.keys())
-  for dest_module in dest_modules:
-    dest_module_split = dest_module.split('.')
-    for dest_submodule_index in range(1, len(dest_module_split)):
-      dest_submodule = '.'.join(dest_module_split[:dest_submodule_index])
-      submodule_import = format_import(
-          '', dest_module_split[dest_submodule_index],
-          dest_module_split[dest_submodule_index])
-      if submodule_import not in module_imports[dest_submodule]:
-        module_imports[dest_submodule].append(submodule_import)
-
-  return module_imports
+  # For e.g. if we import 'foo.bar.Value'. Then, we also
+  # import 'bar' in 'foo'.
+  imported_modules = set(module_imports_builder.module_imports.keys())
+  for module in imported_modules:
+    if not module:
+      continue
+    module_split = module.split('.')
+    parent_module = ''  # we import submodules in their parent_module
+
+    for submodule_index in range(len(module_split)):
+      import_from = _OUTPUT_MODULE
+      if submodule_index > 0:
+        parent_module += ('.' + module_split[submodule_index-1] if parent_module
+                          else module_split[submodule_index-1])
+        import_from += '.' + parent_module
+      module_imports_builder.add_import(
+          parent_module, import_from, module_split[submodule_index],
+          module_split[submodule_index])
+
+  return module_imports_builder.module_imports
 
 
 def create_api_files(output_files):
@@ -151,8 +202,8 @@ def create_api_files(output_files):
     # First get module directory under _API_DIR.
     module_dir = os.path.dirname(
         output_file[output_file.rfind(_API_DIR)+len(_API_DIR):])
-    # Convert / to . and prefix with tf.
-    module_name = '.'.join(['tf', module_dir.replace('/', '.')]).strip('.')
+    # Convert / to .
+    module_name = module_dir.replace('/', '.').strip('.')
     module_name_to_file_path[module_name] = output_file
 
   # Create file for each expected output in genrule.
@@ -162,16 +213,14 @@ def create_api_files(output_files):
     open(file_path, 'a').close()
 
   module_imports = get_api_imports()
-  module_imports['tf'].append(_CONTRIB_IMPORT)  # Include all of contrib.
 
   # Add imports to output files.
   missing_output_files = []
   for module, exports in module_imports.items():
     # Make sure genrule output file list is in sync with API exports.
     if module not in module_name_to_file_path:
-      module_without_tf = module[len('tf.'):]
       module_file_path = '"api/%s/__init__.py"' %  (
-          module_without_tf.replace('.', '/'))
+          module.replace('.', '/'))
       missing_output_files.append(module_file_path)
       continue
     with open(module_name_to_file_path[module], 'w') as fp:
diff --git a/tensorflow/tools/api/golden/tensorflow.-gradient-tape.pbtxt b/tensorflow/tools/api/golden/tensorflow.-gradient-tape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7405202b892bba67a36d86cd43fb7a67ab3be947
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.-gradient-tape.pbtxt
@@ -0,0 +1,21 @@
+path: "tensorflow.GradientTape"
+tf_class {
+  is_instance: "<class \'tensorflow.python.eager.backprop.GradientTape\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'persistent\'], varargs=None, keywords=None, defaults=[\'False\'], "
+  }
+  member_method {
+    name: "gradient"
+    argspec: "args=[\'self\', \'target\', \'sources\', \'output_gradients\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "watch"
+    argspec: "args=[\'self\', \'tensor\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "watched_variables"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
index 42de5c0c80023ad5bd7f33a564780060998307c1..0900adaf762df1415c8db63c3879ca2fabc28d9f 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
@@ -64,7 +64,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
index e2fc8d6cb1d318cc50828f22e8e575cc28c7aaad..7b16ac90c925beb25e065d26e73ee2a54b06d9dc 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
index 9770389e5ef1e29a80ae1da2725d9862f6521ff9..9cf5f2ae2057ab4a16131527cf2ef2fa6ada28e5 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
@@ -17,7 +17,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'filenames\', \'compression_type\', \'buffer_size\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
+    argspec: "args=[\'self\', \'filenames\', \'compression_type\', \'buffer_size\', \'num_parallel_reads\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
index 7263230c1c7182bb812cb2e433aedd415bcd16c7..8c3d6691439e619c906996a3ddaea4317c4a9597 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-vocab-info.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-vocab-info.pbtxt
index a16e3aedae96e7289e73c49ac7890550dd5ddb08..5301b94eb361251a1cb4d02a5d8168f7c8191045 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-vocab-info.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-vocab-info.pbtxt
@@ -1,7 +1,7 @@
 path: "tensorflow.estimator.VocabInfo"
 tf_class {
-  is_instance: "<class \'tensorflow.python.estimator.warm_starting_util.VocabInfo\'>"
-  is_instance: "<class \'tensorflow.python.estimator.warm_starting_util.VocabInfo\'>"
+  is_instance: "<class \'tensorflow.python.training.warm_starting_util.VocabInfo\'>"
+  is_instance: "<class \'tensorflow.python.training.warm_starting_util.VocabInfo\'>"
   is_instance: "<type \'tuple\'>"
   member {
     name: "backup_initializer"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-warm-start-settings.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-warm-start-settings.pbtxt
index afdd6bb058353594415cd1abe726070f84ae46b6..43f5343359aff3b856a2b3708e4cda7cec29e146 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-warm-start-settings.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-warm-start-settings.pbtxt
@@ -1,7 +1,7 @@
 path: "tensorflow.estimator.WarmStartSettings"
 tf_class {
-  is_instance: "<class \'tensorflow.python.estimator.warm_starting_util.WarmStartSettings\'>"
-  is_instance: "<class \'tensorflow.python.estimator.warm_starting_util.WarmStartSettings\'>"
+  is_instance: "<class \'tensorflow.python.estimator.estimator.WarmStartSettings\'>"
+  is_instance: "<class \'tensorflow.python.estimator.estimator.WarmStartSettings\'>"
   is_instance: "<type \'tuple\'>"
   member {
     name: "ckpt_to_initialize_from"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4fe92643bf9867765499d7bf475b9cdd1686aec5
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt
@@ -0,0 +1,27 @@
+path: "tensorflow.estimator.export.TensorServingInputReceiver"
+tf_class {
+  is_instance: "<class \'tensorflow.python.estimator.export.export.TensorServingInputReceiver\'>"
+  is_instance: "<class \'tensorflow.python.estimator.export.export.TensorServingInputReceiver\'>"
+  is_instance: "<type \'tuple\'>"
+  member {
+    name: "features"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "receiver_tensors"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "receiver_tensors_alternatives"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+  }
+  member_method {
+    name: "count"
+  }
+  member_method {
+    name: "index"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
index 4d0dddb3bc0305a28fab0c95c31e4869f5db0aa8..bd72f6cd79f7dffb9f0a7f8ae43751c4ecba939d 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
@@ -20,6 +20,10 @@ tf_module {
     name: "ServingInputReceiver"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "TensorServingInputReceiver"
+    mtype: "<type \'type\'>"
+  }
   member_method {
     name: "build_parsing_serving_input_receiver_fn"
     argspec: "args=[\'feature_spec\', \'default_batch_size\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.image.pbtxt b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
index bda1c2bf85977e69b0969bc8b6056710d88ca910..3fc64dae888012169af3ea7695154b73f24d90c8 100644
--- a/tensorflow/tools/api/golden/tensorflow.image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
@@ -100,6 +100,10 @@ tf_module {
     name: "hsv_to_rgb"
     argspec: "args=[\'images\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "image_gradients"
+    argspec: "args=[\'image\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "is_jpeg"
     argspec: "args=[\'contents\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -116,6 +120,10 @@ tf_module {
     name: "per_image_standardization"
     argspec: "args=[\'image\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "psnr"
+    argspec: "args=[\'a\', \'b\', \'max_val\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "random_brightness"
     argspec: "args=[\'image\', \'max_delta\', \'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -188,6 +196,18 @@ tf_module {
     name: "sample_distorted_bounding_box"
     argspec: "args=[\'image_size\', \'bounding_boxes\', \'seed\', \'seed2\', \'min_object_covered\', \'aspect_ratio_range\', \'area_range\', \'max_attempts\', \'use_image_if_no_bounding_boxes\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'0.1\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
+  member_method {
+    name: "sobel_edges"
+    argspec: "args=[\'image\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "ssim"
+    argspec: "args=[\'img1\', \'img2\', \'max_val\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "ssim_multiscale"
+    argspec: "args=[\'img1\', \'img2\', \'max_val\', \'power_factors\'], varargs=None, keywords=None, defaults=[\'(0.0448, 0.2856, 0.3001, 0.2363, 0.1333)\'], "
+  }
   member_method {
     name: "total_variation"
     argspec: "args=[\'images\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt b/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
index 21a0f84d22fc2d06e551c9a709f3963e812333b8..eaf0036cacfadce335a84bcf61f47f9d360be7e2 100644
--- a/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
@@ -1,17 +1,9 @@
 path: "tensorflow.initializers"
 tf_module {
-  member {
-    name: "absolute_import"
-    mtype: "<type \'instance\'>"
-  }
   member {
     name: "constant"
     mtype: "<type \'type\'>"
   }
-  member {
-    name: "division"
-    mtype: "<type \'instance\'>"
-  }
   member {
     name: "identity"
     mtype: "<type \'type\'>"
@@ -24,10 +16,6 @@ tf_module {
     name: "orthogonal"
     mtype: "<type \'type\'>"
   }
-  member {
-    name: "print_function"
-    mtype: "<type \'instance\'>"
-  }
   member {
     name: "random_normal"
     mtype: "<type \'type\'>"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
index 241db8956a5bc01a058048d3b21b2e1cbe56c92f..7be2f4f61f6b9637f372591e49efc0c93c7a8c0a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
index 9673a508d610778029013b9388ddafd34713f301..0f2428d77a537959cf2c46dfa350208abea8cb36 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.datasets.fashion_mnist.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.datasets.fashion_mnist.pbtxt
index 791cfda23345fea7df1cfb107ae5dec06354bd48..a0e14356fa5e91bc81bd89f6eb8c07087956c392 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.datasets.fashion_mnist.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.datasets.fashion_mnist.pbtxt
@@ -1,3 +1,7 @@
 path: "tensorflow.keras.datasets.fashion_mnist"
 tf_module {
+  member_method {
+    name: "load_data"
+    argspec: "args=[], varargs=None, keywords=None, defaults=None"
+  }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
index 041acf29ff76d7271913204c817cc8c3d47429a5..db8f626b98b70fd99f38e696aa16c72e74e86e25 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Activation\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
index 48143b2cd66b4cec6f8833be71f27645be9dc898..809b3a5430449176a0d7423ec7f4499ceb620890 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.ActivityRegularization\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
index 11f78fed9733d8a17072f73a78799bcae823d469..68d41bb6cc258ca87d4664ac0fb9d5649f89ebaf 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
index 84eb8256325974619a9252d11da139032cdddfa3..970b777e514194db4ac49fe58bea737b35436217 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.AlphaDropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
index ab377a248f093530396cce7bc5baacfeba237e2b..529c64ab293d596012aefd42e0695bd1eb7e44d1 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
index c2edd79f5263b52f0a1ab9df9cb29517b33da7de..7e7c330d74fe3b71ecd0eb87e34719e47ae70784 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
index f3f37eed9946132a91c8c872411f164df7d1691d..ada8466d7473072b1878861ab36ec40b07fa1914 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
index 31d1d1c049c3009b37d80839832ddf44ca1cd6d6..2a5c1cd530a7a532f6cdd3c184f4ee7eb88d23d3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
index 6582e1b18eb1982719cbac6b679ec830ce5938b3..9a2cb29815d59f3761ea25e9ea36ff6489c85b88 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
index 12f66095d2de014e4e7dfc02f5a7a2341db428f5..f5e991ea42e5ee2723b64574d4598dc8463f1c8c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
index 3a45fa180ee90a921fe4bbde0924cb8364ddb9e5..31732214a62524017e39776cdfb9ab629746e8ae 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
index a0f272c1788a3fd197bae6f5583e009f97dd3c56..422eddf10db6763e10405dba5537ca161d1b8994 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.normalization.BatchNormalization\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
index 9c7d3154ad43cb75e310e9c8dd3a2f7a46153fce..9053a37916314198842bc21b0608a9b69a64c264 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
index 949b225e545132cf2464fd01b3607e0ef2c44b7f..3d536d2182fc4480a2ee5fba177543ca21fbd5ac 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
index a736c84a102e4780ca04ff1fad92f9310c841814..6a7da1aef8db64ad11bb5a5ba357f33eeb99170b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.Recurrent\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
index 95f9afed28961e88c8329117ed6714b91e72b10d..801a0339720919f8b3f6beee0f045d58b2c0a371 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
index 38ba15400a49088fd5337a43f7d6ed3b4067a9d3..13352e264a5305190717bb973a3f2bce4d7f4fff 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
index bc84e2a97e549bed7a6a4e255b3c3d3fe6cd5250..f400e4a15c362037e85ac375cee98bb5f6358669 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
index 0802578c227b2d5ed56e5180075f604680d94397..b3a9f573b8ba652d2544b21f36f65fe81a6ebb50 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
index 8ad4646c749b22d3c8bc79ada9d3d875bdb34201..a9be09c0abd19aeb4df30116ef2befc3948bfbf4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
index 110e267b75c59793af194982a404a318d81ded7c..be1ef5eb928d16cc6bf78c289aa20d815c728b23 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
index 24cfc83af61f3742229ee3a7bfcb3b399db53291..30034f7eaf6d9073695353e5c8d9ead0cc8de7cc 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
index c56e89187f26fee8447889465244398c600c6e18..189b38054c004facfeeff8ad2ae87848b89040f2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
index 3674f2746caf04f56618bbd2824c9ae54ade21b8..a76d85c629c1fe620dafd62a0f0e05e9009109e2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
index 5a8f9d770280b5ca1f9c91d46042dcca061f31c9..782195d4ad5883d8c0ea6a657cc10258f2080a55 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
index caa748be8150be045ccaede0c30f7d6eee66c30e..2cb7a39ea595e1ff699b96554cb135377d20a488 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
index 97bd4a265a9b0139207e795252679d915ba1bde0..80803306992bba3b601824a93cb3086ef3947369 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
index 20c43eeed1c06d5970684d792c193913eed8b5d4..678f40bbc23db15ff7c1138169478fb4412a449d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
index 256f0e4bdf40a4bf10f65f873d50a5d328091740..fac826109b6a32305ece86c4990f08afe2236ce8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dense\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
index d1e53f900c2dd6630499486083d33e4d193b30fd..285d544af2d69d564afdec748598b39b6b95670f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
index b010ff6805f3d032ba8841fecb8a98aabd604a88..b77976974cccb96fc2373c093d2bdf279560c46f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
index fffd3854bbbf486c67d32f259e605b14b2c42ede..b07714d3f2d158496e0482f8611e55ea0fb0fd51 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.ELU\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
index 1155fe03fc53d5ea046fd3c2d04353116ab98273..e67d4ddfc47077d62319ab097e5333a373cbfc80 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.embeddings.Embedding\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
index 5e4bebb15b54742ce1d6a4bf31482046fe6f5be5..b2a668e5a88d312656f48ddd0e9f7aa9f6306991 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Flatten\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
index cb9bb3d82151eb6f9f4e4c90133519a1d6fbde63..1fd3febad26df16576dedca1df7560bf230c08ec 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.GRUCell\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
index 9a36e806498f3fc8b16c1b2587244bdad515ad5f..f5f41d879dcb840551c00a7272bbcfbe51dbee89 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
index eb32238e151a3665f852df3b693b716e6bf04fa2..f4f1a5d51c5d5689918af4facf907f79d9ca71ec 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.GaussianDropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
index 37fc8e29aefda3cc1c47b4934d7d421b3f438eb5..e502df5e177d422403d0643c18a9588afb9d9713 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.GaussianNoise\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
index 490816458b08a0663d07d62a2d9dfad7f1c95dd8..9c8d5bfcd8966384230e7d5cdcc1cac53a0eab9a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
index ab49f67f336d256e46282c5e5a18e498fa56b9c7..8dd65f1f248daaf120780f19050c45d297b7902e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
index 3d7cb3ba491c91341b37270a69e758f08161ec53..5e30571cc730ee23767a044036b590460deec00b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
index c99ddab4f3277d43e663b9af6e8ce7dfe6307f05..ba90fa454696d1cb4e77d80a2dc77ff65def4714 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
index 290d2eaebe8ef9f5bab2eebe224e261561c9a86d..8823857758307c208527b144c0cc73b566f2f115 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
index cf63069641a61c029e6851be35c2fa8e8f433b22..500ced852ba6b19502769ba9052f2e364af7e283 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
index 2dadc67c0939df6a806bd4b1a0a91b81365aea3c..cf2717ed46b56e639fb774c1e922648e1653ec0d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
index 1a1a1dcf64b726092f93c00d46c009d0b49b7baf..a86ff1a46997f19b11e6ef03be432b45687a2df2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
index 44898e23ad8faff5c806de20f18cf62bfd6708f2..e01cc7c1b09ad6a40380613d54b771c6a1c89c1c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
index 941d867d24c42684d78b61cf06017ecde7c6eefc..259c1fb37c787f5318570b7aca6935d2f0ed997f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
index 9a5a6325f89b2b4253622e918c17e9a86701b8e7..0c41bf97f763f1e40e8fac714709ccac1483a00b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
index 7a0c1932f6239897bdbde24f3a7269027cb008ca..bec8817aa393ba2d8a6410408938402366cbb01d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
index f679c1d00692e9e31eba4752e4b9ed7cb2494d8e..17be86222901c0f5a9a18c0e5f1c5bcac6c06a17 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.input_layer.InputLayer\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
index ad1e7f2cad74a72e5489bfeb857a90511ececb03..6d2a8c56196d9b3c80f570c7f1d3ac803253fff6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.LSTMCell\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
index 6dad4b48979b11afbc50e33e4b3766429b1a9541..490b5b618c65e28f1ae2e01e8d35e7f3973cc180 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
index fa45d8c9028639c4c1f9dbf930cda107ea8819d8..21a65b838af35e2f540eacab823513e7bf54b434 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Lambda\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
index 023d6c0d69790ab2e6203d3331f63fdb44c85b2e..127b04738e70c11b2dc1071cf174cf5de23c5133 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.keras.layers.Layer"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
index e429fced77bd163c381466df4ba18e2672e31b0d..87e49f2ed5b5d73aee5e9aa2511485b1f3f4bcd9 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.LeakyReLU\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
index 462568124f72b334a7a83bdbd0adc4ac1b14218e..1aa3aad3246b83931a47e69a4aa76fdf2b5aee22 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.local.LocallyConnected1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
index 11bf6a2b426d7f76ab67b0450048d56e02211702..5e9dc7d4774c651a186a4e320d0cfd088e87b6b3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.local.LocallyConnected2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
index a9324488911909395929d8e08c7429fab0bb5058..0d101e5b68cdb2cdf24ed472c724cfc885e3d95d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Masking\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
index 6ff2adddac4bf679242856f7ed21a6ed05f949b8..c85cd49ac8ce2c1fc0759671865b7174cd1c1480 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
index 2957673d4d461b9111a2a40bac0f2489fa798ff0..4f59e330c92f96101c65a9a24f66196e84587ccb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
index 2191c10b7399feb8cf488cb71e8047b746daf523..c0ea0eb0505d20e70d641f2a646a060d7dbfabda 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
index af750ac1b61e23509902aae3ec9c91bbb063509d..ca37ae51314516ae67c7725eb2ccd3d25154e2ac 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
index 9046061510828af39e71a2eee6e29cc8ca7c92d1..3ede2378347f5eddb0e8fae775a0200ea484d3f8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
index a40666807be0b50e53809f254dc6cbc16a403209..d87e25a7ba8e7cce615431723b53a0106c2b5279 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
index 65378cef42215ddbc471e4014424fde3010d23b6..e4df7b48ae6b41400375920a48ef8577bb69376e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
index b037559e02a8200f4251f2d8604bbd7c2595cb15..6bf7c77743c31b6d74df35d827e9d5bc9a25d303 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
index b3a7f47fa59af9dccdc74af3b138e0ed66234bae..c14be132b7e406c99841576be8d8fa9ab99aa816 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.PReLU\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
index b2f22f7da3f5f33e7c750d22690c5732c4fe7643..72ffbceae01da900778dba1ec14e646aa17b39e5 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Permute\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
index 792eacf90dc96d6c4f85d014713a35eeb623c4ec..d3e780c8b22ed580f61ffc3d9b2bad7278391402 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
index 5b79a021caa0a52776cf87207a434e5044caa8a9..a27980a9d17397e558a4b732e3dc332a0c1e8432 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.RepeatVector\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
index 99c64505ee6e92f09d9ee25372ed173583988885..67f991276c6908ff54fd516e84533542a5f60528 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Reshape\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
index d5873ccf7659d162c02e5855df5919046c2e3554..fccea5e8af5ab81e712669ff1b2567d8bde8607e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
index 76b4c10a46161d598b2eaa9c871e665a47149a4d..d20663bdb0bc2eea323d35b1e3d4d27122f50472 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
index 40cd87de5fa9c05cfb4a2978d847e573e122bccf..889fa0a1b58bbd3babd293b7b1b45915a9ee3ca4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
index c44c0da1485821ec6b69f20efa20fc782bd9edbc..c850f3fedc814b20f0f95cc3cf4fd5c973446b5b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
@@ -6,6 +6,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
index bd70c31c38032da5ca6953a208d58a4cd5d04d3f..526d88ccba60eb25c68432e5baa03fd3a878f718 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.SimpleRNNCell\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
index de717976cf285827ccedbbcc91f88d5d95df58fa..7fddae34472411f49d42b4d65d12034d056ec818 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
index a93b7b8f6e116ce955b816a849ff68926f7adec2..5b9b62fc970238e49e6d4849285606d0a7908b23 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.Softmax\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
index 4dc24b195e3ae0fe896e9f844968b2c0f174ca77..769da30999993fad05ae0f7c04e256e6cf01a774 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
index a3bb1cc414e9fdcdda9f2b4588ab44005b0c51fd..fca2e42a1519fcf3a9f0ec996c50b148b2df05fd 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
index f9a78106fa406992d4672c6411a65d9e33f709a4..36e8de09a967c5940bf8078234f5980a78ec8009 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
index 5aa21f402228794ae9d1053e0958fd82e65c8a51..a96f16fae99af9c30959d228202055e9aebfaf58 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.StackedRNNCells\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
index 88e8a465725998997d2637cc833d87024aaf8a21..e1cbd0e150ed890ae57c1725249d1340fc2cb663 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.ThresholdedReLU\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
index f2a7673998d597370b4565c52b26399b14038f8f..f0d35728fb1c42d563ff0598dd84da51a766a764 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
index 4db82ddfa931b7dc4c0588184261f27239688a56..74efaea6ddb22ec2fe9d41558978c183b0e06671 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
index 61e65ad56df0a380417ec8ef70af3d200c90f72c..dc5bd5fd5319f9bbd601a3c4083ae566b47e1aaa 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
index 3d9402db4e2a52c8b746b256aea00392f26f17d1..e01ccfb74aead591f1018cdcbb1c888767ecdb20 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
index 0223799ed4cce948fdc7b90a3957a72f16d9eed4..7e6f90f7623677244865ac285c134dc79f7b9b69 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
index 2e4429833a91bede84e9a19b9c06c7a9146edee4..4d0d402dad442ccf52267f5ce40b05400afbfbc7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
index 26cf7b9e49b124108a428801350661f610527e99..b353a529bcf8e543d334fee57fca26ebc83036a4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
index 64d35d944795378d3f49060a6b50aac387a1c8a0..9fe1256e616dbca4f35101df160dc55bc68bfa8a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
index 18be9c97014b526118b44544db08c8c86e3dbc2c..8ccf15f9ab0fcfa59907ff05a962a84d3d86ccb4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
index b9346329225280b6f412d7cc892a13a9b8b33cff..102eb3220334516e0051f952353920f229f4ff20 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
@@ -5,6 +5,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-directory-iterator.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-directory-iterator.pbtxt
index 04174bff5f04fead68af68afeec80316867009a4..ec0f3d892d9d03a738d34a40afe701e788908a8e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-directory-iterator.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-directory-iterator.pbtxt
@@ -6,7 +6,7 @@ tf_class {
   is_instance: "<type \'object\'>"
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'directory\', \'image_data_generator\', \'target_size\', \'color_mode\', \'classes\', \'class_mode\', \'batch_size\', \'shuffle\', \'seed\', \'data_format\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'follow_links\', \'interpolation\'], varargs=None, keywords=None, defaults=[\'(256, 256)\', \'rgb\', \'None\', \'categorical\', \'32\', \'True\', \'None\', \'None\', \'None\', \'\', \'png\', \'False\', \'nearest\'], "
+    argspec: "args=[\'self\', \'directory\', \'image_data_generator\', \'target_size\', \'color_mode\', \'classes\', \'class_mode\', \'batch_size\', \'shuffle\', \'seed\', \'data_format\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'follow_links\', \'subset\', \'interpolation\'], varargs=None, keywords=None, defaults=[\'(256, 256)\', \'rgb\', \'None\', \'categorical\', \'32\', \'True\', \'None\', \'None\', \'None\', \'\', \'png\', \'False\', \'None\', \'nearest\'], "
   }
   member_method {
     name: "next"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-image-data-generator.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-image-data-generator.pbtxt
index 41f27d1f740457f4b7c4f74cb089a448a0fed845..f5bc04e44c198e5bc60f8361dd32e4ae00250468 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-image-data-generator.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-image-data-generator.pbtxt
@@ -4,7 +4,7 @@ tf_class {
   is_instance: "<type \'object\'>"
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'featurewise_center\', \'samplewise_center\', \'featurewise_std_normalization\', \'samplewise_std_normalization\', \'zca_whitening\', \'zca_epsilon\', \'rotation_range\', \'width_shift_range\', \'height_shift_range\', \'shear_range\', \'zoom_range\', \'channel_shift_range\', \'fill_mode\', \'cval\', \'horizontal_flip\', \'vertical_flip\', \'rescale\', \'preprocessing_function\', \'data_format\'], varargs=None, keywords=None, defaults=[\'False\', \'False\', \'False\', \'False\', \'False\', \'1e-06\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'nearest\', \'0.0\', \'False\', \'False\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'featurewise_center\', \'samplewise_center\', \'featurewise_std_normalization\', \'samplewise_std_normalization\', \'zca_whitening\', \'zca_epsilon\', \'rotation_range\', \'width_shift_range\', \'height_shift_range\', \'brightness_range\', \'shear_range\', \'zoom_range\', \'channel_shift_range\', \'fill_mode\', \'cval\', \'horizontal_flip\', \'vertical_flip\', \'rescale\', \'preprocessing_function\', \'data_format\', \'validation_split\'], varargs=None, keywords=None, defaults=[\'False\', \'False\', \'False\', \'False\', \'False\', \'1e-06\', \'0.0\', \'0.0\', \'0.0\', \'None\', \'0.0\', \'0.0\', \'0.0\', \'nearest\', \'0.0\', \'False\', \'False\', \'None\', \'None\', \'None\', \'0.0\'], "
   }
   member_method {
     name: "fit"
@@ -12,11 +12,11 @@ tf_class {
   }
   member_method {
     name: "flow"
-    argspec: "args=[\'self\', \'x\', \'y\', \'batch_size\', \'shuffle\', \'seed\', \'save_to_dir\', \'save_prefix\', \'save_format\'], varargs=None, keywords=None, defaults=[\'None\', \'32\', \'True\', \'None\', \'None\', \'\', \'png\'], "
+    argspec: "args=[\'self\', \'x\', \'y\', \'batch_size\', \'shuffle\', \'seed\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'subset\'], varargs=None, keywords=None, defaults=[\'None\', \'32\', \'True\', \'None\', \'None\', \'\', \'png\', \'None\'], "
   }
   member_method {
     name: "flow_from_directory"
-    argspec: "args=[\'self\', \'directory\', \'target_size\', \'color_mode\', \'classes\', \'class_mode\', \'batch_size\', \'shuffle\', \'seed\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'follow_links\', \'interpolation\'], varargs=None, keywords=None, defaults=[\'(256, 256)\', \'rgb\', \'None\', \'categorical\', \'32\', \'True\', \'None\', \'None\', \'\', \'png\', \'False\', \'nearest\'], "
+    argspec: "args=[\'self\', \'directory\', \'target_size\', \'color_mode\', \'classes\', \'class_mode\', \'batch_size\', \'shuffle\', \'seed\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'follow_links\', \'subset\', \'interpolation\'], varargs=None, keywords=None, defaults=[\'(256, 256)\', \'rgb\', \'None\', \'categorical\', \'32\', \'True\', \'None\', \'None\', \'\', \'png\', \'False\', \'None\', \'nearest\'], "
   }
   member_method {
     name: "random_transform"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-numpy-array-iterator.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-numpy-array-iterator.pbtxt
index 4ef6e6e99e3b71d4a6e497cc577ef8b42cebab79..42196ddeee7aab144537eef250c07060923fa6a9 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-numpy-array-iterator.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.-numpy-array-iterator.pbtxt
@@ -6,7 +6,7 @@ tf_class {
   is_instance: "<type \'object\'>"
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'x\', \'y\', \'image_data_generator\', \'batch_size\', \'shuffle\', \'seed\', \'data_format\', \'save_to_dir\', \'save_prefix\', \'save_format\'], varargs=None, keywords=None, defaults=[\'32\', \'False\', \'None\', \'None\', \'None\', \'\', \'png\'], "
+    argspec: "args=[\'self\', \'x\', \'y\', \'image_data_generator\', \'batch_size\', \'shuffle\', \'seed\', \'data_format\', \'save_to_dir\', \'save_prefix\', \'save_format\', \'subset\'], varargs=None, keywords=None, defaults=[\'32\', \'False\', \'None\', \'None\', \'None\', \'\', \'png\', \'None\'], "
   }
   member_method {
     name: "next"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.pbtxt
index d28fef696515e09990d63581de6127fd52c0a4ee..6b850dd6b784412d623f44200b4acc169bf25968 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.image.pbtxt
@@ -36,6 +36,10 @@ tf_module {
     name: "load_img"
     argspec: "args=[\'path\', \'grayscale\', \'target_size\', \'interpolation\'], varargs=None, keywords=None, defaults=[\'False\', \'None\', \'nearest\'], "
   }
+  member_method {
+    name: "random_brightness"
+    argspec: "args=[\'x\', \'brightness_range\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "random_channel_shift"
     argspec: "args=[\'x\', \'intensity\', \'channel_axis\'], varargs=None, keywords=None, defaults=[\'0\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.-timeseries-generator.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.-timeseries-generator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d9c3215b555c19bc5cf4b32b0d227a9e1b63ce1e
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.-timeseries-generator.pbtxt
@@ -0,0 +1,14 @@
+path: "tensorflow.keras.preprocessing.sequence.TimeseriesGenerator"
+tf_class {
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.preprocessing.sequence.TimeseriesGenerator\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.utils.data_utils.Sequence\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'data\', \'targets\', \'length\', \'sampling_rate\', \'stride\', \'start_index\', \'end_index\', \'shuffle\', \'reverse\', \'batch_size\'], varargs=None, keywords=None, defaults=[\'1\', \'1\', \'0\', \'None\', \'False\', \'False\', \'128\'], "
+  }
+  member_method {
+    name: "on_epoch_end"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.pbtxt
index 1b01935cc53b450c3e7009f945f86c8e1c10bf8e..cf59f8a27269c1161919f7ca2a44c5717a836dd7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.sequence.pbtxt
@@ -1,5 +1,9 @@
 path: "tensorflow.keras.preprocessing.sequence"
 tf_module {
+  member {
+    name: "TimeseriesGenerator"
+    mtype: "<type \'type\'>"
+  }
   member_method {
     name: "make_sampling_table"
     argspec: "args=[\'size\', \'sampling_factor\'], varargs=None, keywords=None, defaults=[\'1e-05\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.text.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.text.pbtxt
index d106429df0273929472aa58909f554bcffde9bca..50b54fc7e179bdfb8641d8de12934caa3fc44300 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.text.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.preprocessing.text.pbtxt
@@ -4,6 +4,10 @@ tf_module {
     name: "Tokenizer"
     mtype: "<type \'type\'>"
   }
+  member_method {
+    name: "hashing_trick"
+    argspec: "args=[\'text\', \'n\', \'hash_function\', \'filters\', \'lower\', \'split\'], varargs=None, keywords=None, defaults=[\'None\', \'!\"#$%&()*+,-./:;<=>?@[\\\\]^_`{|}~\\t\\n\', \'True\', \' \'], "
+  }
   member_method {
     name: "one_hot"
     argspec: "args=[\'text\', \'n\', \'filters\', \'lower\', \'split\'], varargs=None, keywords=None, defaults=[\'!\"#$%&()*+,-./:;<=>?@[\\\\]^_`{|}~\\t\\n\', \'True\', \' \'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
index de81206bc8b25046cd48c79ff8f154041c0e0cb0..1c4f550d7f05b8be33326cb39d7a5f3bf663f5e6 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
index 72d5496464210efd9e423996dfb274dd9564f761..d2db0952693f2989e6a9e8748a254eb4db483206 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
index 595e77ff9f8b64b6606fb075f3edf2281b4c3c1f..34d9a9df281c09a2e2030daf74a2ceb8066085bb 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
index 0c4aa2ff2612269727026141574726ad6df5cdbd..21ad0efecf88c42a3a679910ddfe095585a7933a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.BatchNormalization"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.normalization.BatchNormalization\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
index 5f576d0189309442dc4cea3d3617ab3144420165..ed38747c7671a267bb640ecb96a4c5fcc46c5edf 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
index 675a7c76e569d3163ecd2c547841b4c36078b21d..ff453c6059477c20528fc768d93c65d208cdfc4a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
index eaabbf6aab172aea5c51f8071076890bb6b5bcf7..5583bd22dce18b0a0593b73bde509818b63b3f29 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
index 838e070d79d2d7cfbd631f1a5e9960412cfdae5a..63f0c32a7c8f7e530c76c64fa619102bc12f9ad9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
index 4bd8cfc1a48cd839e2ffa54d0d0ca863060406d8..b77726252ccca30a7c6555fb569eb65b69e34998 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
index 57eccb03ffeb90652b019b5ce8a519797e4a3a3d..92db9f6dcd2f77c4253eb77df4a26fb632b2a766 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Dense"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dense\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
index a1ec00eeeaa98a6199e29b187b0760ddc92db09d..80fa846a24c9162d8521bdb4f098b9cd8e34aedb 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Dropout"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
index a06943d51a52f1951056136445b0d5786d801b5b..f63213b3dde40aa54b165c1c269c26fd2cd9e3b4 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Flatten"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Flatten\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
index 24fda0c87ed0aeabd0fd4a16bb2efab444f8cd8a..4e45b2d513bb72bb47433d72c310d6a34fbc0c01 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
@@ -1,6 +1,7 @@
 path: "tensorflow.layers.Layer"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
index 4c3d00e0e1ddfe95c56f9ebc7c5d609c79dd44d4..19ec33fce775caa634e71e2295ac945a6f70ade9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
index f7e2017b0c9438130f1cfb2431eb73ca4d3103c5..76180c333a21c592a3b53bb445df9b12d3596552 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
index 84780926a38ff811a5ab35fadfac690a6dbbbbe2..ded75c8ff09efc6746ddd2284f53d2c021cc473c 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
index 05799ecfc9fdb9ff44620a67dcdbdc4426fddced..3dbfa5453f8e0ebb02429df9c4cbdf98de6b8ced 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
index c2aeb35c4648bcce22ca73c838a85803a6b9cedf..ab171df1d1650e19836018f3316e6919f6d36def 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
index 44536787f09fc98bba8a4eb0bc562427cfe48b8b..9c71a24d0500e2091e0ae94cc4dd7ed6b788a54f 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
index 768565d3cacbd1313ee5a64c9b15f9ab70683772..9e19f96b7452616956fb7fd3ca62d8f4b25a2122 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
index 0d253e5dd233d6d2b6ad0070a463c283a8769dab..7540aa62861895a7c41840476d4edb79785a77a9 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.DeviceWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
index 97edf245f6fbed393a6fb8dbf1e83649e9ac4b4e..fc1ff386690f9c7acb11d4cc0770e394f78350ad 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.DropoutWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
index 6ecc134d4df866ab5d59e238a8157064421579bd..751122cfff3bf9c55dd9fa264fdf2e1960940724 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
index 4b3ca1578ba52f30e3405ff198fb716496a462c6..4b6313f395fd8fd4ec2af78365117620263e7a55 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
index 9a6c73a079884b8ab92be1c9e89b2a9f34aad851..00e8c71140596ecea237ce05a09feff1fbb49001 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.MultiRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
index 27488f8e73f20456fae911511ecd2e41a60da351..3852f90dd6c4a254e20e789bdeb7796d61cef6bc 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.nn.rnn_cell.RNNCell"
 tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
index 3310836ed26387718115c2454300b9edfe930451..8f3f0f7506ef49014b31cd4bc04f1cb1e0d696fc 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.ResidualWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.pbtxt b/tensorflow/tools/api/golden/tensorflow.pbtxt
index f8d08f1d39a8bfa7d78be106e59d88de75a57823..55b82dd76553cf9bb4ebb667595898c16ca3fe25 100644
--- a/tensorflow/tools/api/golden/tensorflow.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.pbtxt
@@ -84,6 +84,10 @@ tf_module {
     name: "GRAPH_DEF_VERSION_MIN_PRODUCER"
     mtype: "<type \'int\'>"
   }
+  member {
+    name: "GradientTape"
+    mtype: "<type \'type\'>"
+  }
   member {
     name: "Graph"
     mtype: "<type \'type\'>"
@@ -596,6 +600,10 @@ tf_module {
     name: "add_to_collection"
     argspec: "args=[\'name\', \'value\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "add_to_collections"
+    argspec: "args=[\'names\', \'value\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "all_variables"
     argspec: "args=[], varargs=None, keywords=None, defaults=None"
@@ -892,6 +900,10 @@ tf_module {
     name: "cumsum"
     argspec: "args=[\'x\', \'axis\', \'exclusive\', \'reverse\', \'name\'], varargs=None, keywords=None, defaults=[\'0\', \'False\', \'False\', \'None\'], "
   }
+  member_method {
+    name: "custom_gradient"
+    argspec: "args=[\'f\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "decode_base64"
     argspec: "args=[\'input\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -964,6 +976,10 @@ tf_module {
     name: "einsum"
     argspec: "args=[\'equation\'], varargs=inputs, keywords=kwargs, defaults=None"
   }
+  member_method {
+    name: "enable_eager_execution"
+    argspec: "args=[\'config\', \'device_policy\', \'execution_mode\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
+  }
   member_method {
     name: "encode_base64"
     argspec: "args=[\'input\', \'pad\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'None\'], "
@@ -980,6 +996,10 @@ tf_module {
     name: "erfc"
     argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "executing_eagerly"
+    argspec: "args=[], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "exp"
     argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -1094,7 +1114,7 @@ tf_module {
   }
   member_method {
     name: "get_local_variable"
-    argspec: "args=[], varargs=args, keywords=kwargs, defaults=None"
+    argspec: "args=[\'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'collections\', \'caching_device\', \'partitioner\', \'validate_shape\', \'use_resource\', \'custom_getter\', \'constraint\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'False\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "get_seed"
@@ -1600,6 +1620,10 @@ tf_module {
     name: "reduce_sum"
     argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
+  member_method {
+    name: "regex_replace"
+    argspec: "args=[\'input\', \'pattern\', \'rewrite\', \'replace_global\', \'name\'], varargs=None, keywords=None, defaults=[\'True\', \'None\'], "
+  }
   member_method {
     name: "register_tensor_conversion_function"
     argspec: "args=[\'base_type\', \'conversion_func\', \'priority\'], varargs=None, keywords=None, defaults=[\'100\'], "
@@ -1996,6 +2020,14 @@ tf_module {
     name: "to_bfloat16"
     argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'ToBFloat16\'], "
   }
+  member_method {
+    name: "to_complex128"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'ToComplex128\'], "
+  }
+  member_method {
+    name: "to_complex64"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'ToComplex64\'], "
+  }
   member_method {
     name: "to_double"
     argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'ToDouble\'], "
@@ -2060,6 +2092,10 @@ tf_module {
     name: "unsorted_segment_max"
     argspec: "args=[\'data\', \'segment_ids\', \'num_segments\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "unsorted_segment_mean"
+    argspec: "args=[\'data\', \'segment_ids\', \'num_segments\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "unsorted_segment_min"
     argspec: "args=[\'data\', \'segment_ids\', \'num_segments\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
index c02e54adfbd9f33e661453767b517a5f0de90d57..16bfbf20d5227d6308248bebcb62f32a2df8ef41 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.AdadeltaOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adadelta.AdadeltaOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
index 2b619908fc6aea3f4b8e6a57d0dcf85a9854d466..61cde9181c2367153b7b289b41bd932482bb92fd 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.AdagradDAOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adagrad_da.AdagradDAOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
index 2005cf4677c06cf1f8b4207a444690fdd0c2306e..0a998c1afe4fff6e215360bc1cf8fc135754223c 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.AdagradOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adagrad.AdagradOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
index 0a2bae1d9021b20707e03ae5786e71f388266c14..cc5954152577796ee7a5a6e1cedc873647d64f7c 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.AdamOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adam.AdamOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
index 847f9ad75998f1bdda8858650091c70fd0b4015b..1add3a902122341a706c38b19ea6ff5882c26445 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.FtrlOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.ftrl.FtrlOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
index 13a58e0608ed269415ba78d84a03f1bae128e80c..ef5bbd6ace29abb5c73516176fcc7594a58d493a 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.GradientDescentOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.gradient_descent.GradientDescentOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
index bfbc2357a346c7bfef0242a735ab14c5f4005b22..3d6e87f5eb44de9d6ce1bdd25a54b8df9020cc03 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.MomentumOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.momentum.MomentumOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
index 437efa0a2bd04c308db6186e714a5d8785541fa5..e73861ff7cb2d90d8efac72cdd7de3b27395f29e 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
@@ -1,7 +1,6 @@
 path: "tensorflow.train.Optimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
index 72f224605f67e72dd78699b5f1a703cc3edd566b..301b35b199c87890a0aef4139eb06253592ce0c4 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.ProximalAdagradOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.proximal_adagrad.ProximalAdagradOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
index 316275b1fb1abd384e193994e35115a1c463f07d..8815befa936a85522011111a4a6270d22cbc25ae 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.ProximalGradientDescentOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.proximal_gradient_descent.ProximalGradientDescentOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
index af50a1986100d830f0809a3f4a0f01faa8821b3b..e9819683ba5ec1bcacb3cdbcb2d787e866a77b6f 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.RMSPropOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.rmsprop.RMSPropOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
index 6edc516c9392fa14f23ffc2a6481ec21216f06cf..3db96aff876b88b80b647570cf68b1ebc0b2da3b 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
@@ -2,7 +2,6 @@ path: "tensorflow.train.SyncReplicasOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.sync_replicas_optimizer.SyncReplicasOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
   is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-vocab-info.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-vocab-info.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4ce7cb111163e103a1cebe30d5c6f3eeb4234693
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.train.-vocab-info.pbtxt
@@ -0,0 +1,39 @@
+path: "tensorflow.train.VocabInfo"
+tf_class {
+  is_instance: "<class \'tensorflow.python.training.warm_starting_util.VocabInfo\'>"
+  is_instance: "<class \'tensorflow.python.training.warm_starting_util.VocabInfo\'>"
+  is_instance: "<type \'tuple\'>"
+  member {
+    name: "backup_initializer"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "new_vocab"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "new_vocab_size"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "num_oov_buckets"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "old_vocab"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "old_vocab_size"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+  }
+  member_method {
+    name: "count"
+  }
+  member_method {
+    name: "index"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.train.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.pbtxt
index e49c719a334455d1f8f39fa67332be8bb81f2bc2..c75ee474aa471524e4b5c8a7e2dd4a9da4b08eae 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.pbtxt
@@ -224,6 +224,10 @@ tf_module {
     name: "SyncReplicasOptimizer"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "VocabInfo"
+    mtype: "<type \'type\'>"
+  }
   member {
     name: "WorkerSessionCreator"
     mtype: "<type \'type\'>"
@@ -402,7 +406,7 @@ tf_module {
   }
   member_method {
     name: "sdca_optimizer"
-    argspec: "args=[\'sparse_example_indices\', \'sparse_feature_indices\', \'sparse_feature_values\', \'dense_features\', \'example_weights\', \'example_labels\', \'sparse_indices\', \'sparse_weights\', \'dense_weights\', \'example_state_data\', \'loss_type\', \'l1\', \'l2\', \'num_loss_partitions\', \'num_inner_iterations\', \'adaptative\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'None\'], "
+    argspec: "args=[\'sparse_example_indices\', \'sparse_feature_indices\', \'sparse_feature_values\', \'dense_features\', \'example_weights\', \'example_labels\', \'sparse_indices\', \'sparse_weights\', \'dense_weights\', \'example_state_data\', \'loss_type\', \'l1\', \'l2\', \'num_loss_partitions\', \'num_inner_iterations\', \'adaptative\', \'name\'], varargs=None, keywords=None, defaults=[\'True\', \'None\'], "
   }
   member_method {
     name: "sdca_shrink_l1"
@@ -436,6 +440,10 @@ tf_module {
     name: "update_checkpoint_state"
     argspec: "args=[\'save_dir\', \'model_checkpoint_path\', \'all_model_checkpoint_paths\', \'latest_filename\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
+  member_method {
+    name: "warm_start"
+    argspec: "args=[\'ckpt_to_initialize_from\', \'vars_to_warm_start\', \'var_name_to_vocab_info\', \'var_name_to_prev_var_name\'], varargs=None, keywords=None, defaults=[\'.*\', \'None\', \'None\'], "
+  }
   member_method {
     name: "write_graph"
     argspec: "args=[\'graph_or_graph_def\', \'logdir\', \'name\', \'as_text\'], varargs=None, keywords=None, defaults=[\'True\'], "
diff --git a/tensorflow/tools/api/tests/BUILD b/tensorflow/tools/api/tests/BUILD
index 608a34ab7b32bdc26cebbe43b383155406fb51b2..15bf1abb5f8f541c435be77b1a3c2f13382f2438 100644
--- a/tensorflow/tools/api/tests/BUILD
+++ b/tensorflow/tools/api/tests/BUILD
@@ -23,6 +23,7 @@ py_test(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        "//tensorflow:experimental_tensorflow_py",
         "//tensorflow:tensorflow_py",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:lib",
diff --git a/tensorflow/tools/api/tests/api_compatibility_test.py b/tensorflow/tools/api/tests/api_compatibility_test.py
index c1e09cc531ed8e8995e3e73b87e96b72fba6c038..603b2a4327b94873b9908d5e0e114dcc4f7542dc 100644
--- a/tensorflow/tools/api/tests/api_compatibility_test.py
+++ b/tensorflow/tools/api/tests/api_compatibility_test.py
@@ -34,6 +34,7 @@ import sys
 import unittest
 
 import tensorflow as tf
+from tensorflow import experimental_api as api
 
 from google.protobuf import text_format
 
@@ -46,6 +47,9 @@ from tensorflow.tools.api.lib import python_object_to_proto_visitor
 from tensorflow.tools.common import public_api
 from tensorflow.tools.common import traverse
 
+if hasattr(tf, 'experimental_api'):
+  del tf.experimental_api
+
 # FLAGS defined at the bottom:
 FLAGS = None
 # DEFINE_boolean, update_goldens, default False:
@@ -54,7 +58,7 @@ _UPDATE_GOLDENS_HELP = """
      have to be authorized by TensorFlow leads.
 """
 
-# DEFINE_boolean, verbose_diffs, default False:
+# DEFINE_boolean, verbose_diffs, default True:
 _VERBOSE_DIFFS_HELP = """
      If set to true, print line by line diffs on all libraries. If set to
      false, only print which libraries have differences.
@@ -109,7 +113,8 @@ class ApiCompatibilityTest(test.TestCase):
                              expected_dict,
                              actual_dict,
                              verbose=False,
-                             update_goldens=False):
+                             update_goldens=False,
+                             additional_missing_object_message=''):
     """Diff given dicts of protobufs and report differences a readable way.
 
     Args:
@@ -120,6 +125,8 @@ class ApiCompatibilityTest(test.TestCase):
       verbose: Whether to log the full diffs, or simply report which files were
           different.
       update_goldens: Whether to update goldens when there are diffs found.
+      additional_missing_object_message: Message to print when a symbol is
+          missing.
     """
     diffs = []
     verbose_diffs = []
@@ -138,7 +145,8 @@ class ApiCompatibilityTest(test.TestCase):
       verbose_diff_message = ''
       # First check if the key is not found in one or the other.
       if key in only_in_expected:
-        diff_message = 'Object %s expected but not found (removed).' % key
+        diff_message = 'Object %s expected but not found (removed). %s' % (
+            key, additional_missing_object_message)
         verbose_diff_message = diff_message
       elif key in only_in_actual:
         diff_message = 'New object %s found (added).' % key
@@ -165,7 +173,7 @@ class ApiCompatibilityTest(test.TestCase):
       logging.error('%d differences found between API and golden.', diff_count)
       messages = verbose_diffs if verbose else diffs
       for i in range(diff_count):
-        logging.error('Issue %d\t: %s', i + 1, messages[i])
+        print('Issue %d\t: %s' % (i + 1, messages[i]), file=sys.stderr)
 
       if update_goldens:
         # Write files if requested.
@@ -229,13 +237,56 @@ class ApiCompatibilityTest(test.TestCase):
         verbose=FLAGS.verbose_diffs,
         update_goldens=FLAGS.update_goldens)
 
+  @unittest.skipUnless(
+      sys.version_info.major == 2,
+      'API compabitility test goldens are generated using python2.')
+  def testNewAPIBackwardsCompatibility(self):
+    # Extract all API stuff.
+    visitor = python_object_to_proto_visitor.PythonObjectToProtoVisitor()
+
+    public_api_visitor = public_api.PublicAPIVisitor(visitor)
+    public_api_visitor.do_not_descend_map['tf'].append('contrib')
+    public_api_visitor.do_not_descend_map['tf.GPUOptions'] = ['Experimental']
+    # TODO(annarev): Make slide_dataset available in API.
+    public_api_visitor.private_map['tf'] = ['slide_dataset']
+    traverse.traverse(api, public_api_visitor)
+
+    proto_dict = visitor.GetProtos()
+
+    # Read all golden files.
+    expression = os.path.join(
+        resource_loader.get_root_dir_with_all_resources(),
+        _KeyToFilePath('*'))
+    golden_file_list = file_io.get_matching_files(expression)
+
+    def _ReadFileToProto(filename):
+      """Read a filename, create a protobuf from its contents."""
+      ret_val = api_objects_pb2.TFAPIObject()
+      text_format.Merge(file_io.read_file_to_string(filename), ret_val)
+      return ret_val
+
+    golden_proto_dict = {
+        _FileNameToKey(filename): _ReadFileToProto(filename)
+        for filename in golden_file_list
+    }
+
+    # Diff them. Do not fail if called with update.
+    # If the test is run to update goldens, only report diffs but do not fail.
+    self._AssertProtoDictEquals(
+        golden_proto_dict,
+        proto_dict,
+        verbose=FLAGS.verbose_diffs,
+        update_goldens=False,
+        additional_missing_object_message=
+        'Check if tf_export decorator/call is missing for this symbol.')
+
 
 if __name__ == '__main__':
   parser = argparse.ArgumentParser()
   parser.add_argument(
       '--update_goldens', type=bool, default=False, help=_UPDATE_GOLDENS_HELP)
   parser.add_argument(
-      '--verbose_diffs', type=bool, default=False, help=_VERBOSE_DIFFS_HELP)
+      '--verbose_diffs', type=bool, default=True, help=_VERBOSE_DIFFS_HELP)
   FLAGS, unparsed = parser.parse_known_args()
 
   # Now update argv, so that unittest library does not get confused.
diff --git a/tensorflow/tools/benchmark/benchmark_model.cc b/tensorflow/tools/benchmark/benchmark_model.cc
index ecab6f8769ae2d0126f63580030ed6ff756015d0..15523028c726fefa13641a1369cf4274bcfb9973 100644
--- a/tensorflow/tools/benchmark/benchmark_model.cc
+++ b/tensorflow/tools/benchmark/benchmark_model.cc
@@ -48,33 +48,14 @@ limitations under the License.
 namespace tensorflow {
 namespace benchmark_model {
 
-Status InitializeSession(int num_threads, const string& graph,
-                         std::unique_ptr<Session>* session,
-                         std::unique_ptr<GraphDef>* graph_def) {
-  LOG(INFO) << "Loading TensorFlow.";
+namespace {
 
-  tensorflow::SessionOptions options;
-  tensorflow::ConfigProto& config = options.config;
-  if (num_threads > 0) {
-    config.set_intra_op_parallelism_threads(num_threads);
+Status InitializeVariables(Session* session,
+                           const std::vector<string>& init_ops) {
+  LOG(INFO) << "Initializing graph variables";
+  for (const string& init_op : init_ops) {
+    TF_RETURN_IF_ERROR(session->Run({}, {}, {init_op}, nullptr));
   }
-  LOG(INFO) << "Got config, " << config.device_count_size() << " devices";
-
-  session->reset(tensorflow::NewSession(options));
-  graph_def->reset(new GraphDef());
-  tensorflow::GraphDef tensorflow_graph;
-  Status s = ReadBinaryProto(Env::Default(), graph, graph_def->get());
-  if (!s.ok()) {
-    LOG(ERROR) << "Could not create TensorFlow Graph: " << s;
-    return s;
-  }
-
-  s = (*session)->Create(*(graph_def->get()));
-  if (!s.ok()) {
-    LOG(ERROR) << "Could not create TensorFlow Session: " << s;
-    return s;
-  }
-
   return Status::OK();
 }
 
@@ -247,8 +228,56 @@ void RecordBenchmarkEntry(const string& output_prefix,
   TF_QCHECK_OK(node_reporter.Close());
 }
 
+void SleepSeconds(double sleep_seconds) {
+  if (sleep_seconds <= 0.0) {
+    return;
+  }
+#ifdef PLATFORM_WINDOWS
+  Sleep(sleep_seconds * 1000);
+#else
+  // Convert the inference_delay string into a timespec.
+  timespec req;
+  req.tv_sec = static_cast<time_t>(sleep_seconds);
+  req.tv_nsec = (sleep_seconds - req.tv_sec) * 1000000000;
+  nanosleep(&req, nullptr);
+#endif
+}
+
+}  // namespace
+
+Status InitializeSession(int num_threads, const string& graph,
+                         std::unique_ptr<Session>* session,
+                         std::unique_ptr<GraphDef>* graph_def) {
+  LOG(INFO) << "Loading TensorFlow.";
+
+  tensorflow::SessionOptions options;
+  tensorflow::ConfigProto& config = options.config;
+  if (num_threads > 0) {
+    config.set_intra_op_parallelism_threads(num_threads);
+  }
+  LOG(INFO) << "Got config, " << config.device_count_size() << " devices";
+
+  session->reset(tensorflow::NewSession(options));
+  graph_def->reset(new GraphDef());
+  tensorflow::GraphDef tensorflow_graph;
+  Status s = ReadBinaryProto(Env::Default(), graph, graph_def->get());
+  if (!s.ok()) {
+    LOG(ERROR) << "Could not create TensorFlow Graph: " << s;
+    return s;
+  }
+
+  s = (*session)->Create(*(graph_def->get()));
+  if (!s.ok()) {
+    LOG(ERROR) << "Could not create TensorFlow Session: " << s;
+    return s;
+  }
+
+  return Status::OK();
+}
+
 Status RunBenchmark(const std::vector<InputLayerInfo>& inputs,
-                    const std::vector<string>& outputs, Session* session,
+                    const std::vector<string>& outputs,
+                    const std::vector<string>& targets, Session* session,
                     StatSummarizer* stats, int64* inference_time_us) {
   std::vector<std::pair<string, tensorflow::Tensor> > input_tensors;
   CreateTensorsFromInputInfo(inputs, &input_tensors);
@@ -264,8 +293,8 @@ Status RunBenchmark(const std::vector<InputLayerInfo>& inputs,
 
   RunMetadata run_metadata;
   const int64 start_time = Env::Default()->NowMicros();
-  s = session->Run(run_options, input_tensors, outputs, {}, &output_tensors,
-                   &run_metadata);
+  s = session->Run(run_options, input_tensors, outputs, targets,
+                   &output_tensors, &run_metadata);
   const int64 end_time = Env::Default()->NowMicros();
   *inference_time_us = end_time - start_time;
 
@@ -283,24 +312,10 @@ Status RunBenchmark(const std::vector<InputLayerInfo>& inputs,
   return s;
 }
 
-void SleepSeconds(double sleep_seconds) {
-  if (sleep_seconds <= 0.0) {
-    return;
-  }
-#ifdef PLATFORM_WINDOWS
-  Sleep(sleep_seconds * 1000);
-#else
-  // Convert the inference_delay string into a timespec.
-  timespec req;
-  req.tv_sec = static_cast<time_t>(sleep_seconds);
-  req.tv_nsec = (sleep_seconds - req.tv_sec) * 1000000000;
-  nanosleep(&req, nullptr);
-#endif
-}
-
 Status TimeMultipleRuns(double sleep_seconds, int num_runs, double max_time_s,
                         const std::vector<InputLayerInfo>& inputs,
-                        const std::vector<string>& outputs, Session* session,
+                        const std::vector<string>& outputs,
+                        const std::vector<string>& targets, Session* session,
                         StatSummarizer* stats, int64* total_time_us,
                         int64* actual_num_runs) {
   *total_time_us = 0;
@@ -315,7 +330,8 @@ Status TimeMultipleRuns(double sleep_seconds, int num_runs, double max_time_s,
   const bool until_max_time = num_runs <= 0;
   for (int i = 0; until_max_time || i < num_runs; ++i) {
     int64 time;
-    Status run_status = RunBenchmark(inputs, outputs, session, stats, &time);
+    Status run_status =
+        RunBenchmark(inputs, outputs, targets, session, stats, &time);
     stat.UpdateStat(time);
     (*total_time_us) += time;
     ++(*actual_num_runs);
@@ -345,11 +361,13 @@ Status TimeMultipleRuns(double sleep_seconds, int num_runs, double max_time_s,
 
 int Main(int argc, char** argv) {
   string graph = "/data/local/tmp/tensorflow_inception_graph.pb";
+  string init_ops_string = "";
   string input_layer_string = "input:0";
   string input_layer_shape_string = "1,224,224,3";
   string input_layer_type_string = "float";
   string input_layer_values_string = "";
   string output_layer_string = "output:0";
+  string target_layer_string = "";
   int max_num_runs = 1000;
   string max_time = "10.0";
   string inference_delay = "-1.0";
@@ -371,12 +389,14 @@ int Main(int argc, char** argv) {
 
   std::vector<Flag> flag_list = {
       Flag("graph", &graph, "graph file name"),
+      Flag("init_ops", &init_ops_string, "init ops"),
       Flag("input_layer", &input_layer_string, "input layer names"),
       Flag("input_layer_shape", &input_layer_shape_string, "input layer shape"),
       Flag("input_layer_type", &input_layer_type_string, "input layer type"),
       Flag("input_layer_values", &input_layer_values_string,
            "values to initialize the inputs with"),
       Flag("output_layer", &output_layer_string, "output layer name"),
+      Flag("target_layer", &target_layer_string, "target layer name"),
       Flag("max_num_runs", &max_num_runs, "number of runs max"),
       Flag("max_time", &max_time, "length to run max"),
       Flag("inference_delay", &inference_delay,
@@ -410,6 +430,7 @@ int Main(int argc, char** argv) {
     return -1;
   }
 
+  std::vector<string> init_ops = str_util::Split(init_ops_string, ',');
   std::vector<string> input_layers = str_util::Split(input_layer_string, ',');
   std::vector<string> input_layer_shapes =
       str_util::Split(input_layer_shape_string, ':');
@@ -418,6 +439,7 @@ int Main(int argc, char** argv) {
   std::vector<string> input_layer_values =
       str_util::Split(input_layer_values_string, ':');
   std::vector<string> output_layers = str_util::Split(output_layer_string, ',');
+  std::vector<string> target_layers = str_util::Split(target_layer_string, ',');
   if ((input_layers.size() != input_layer_shapes.size()) ||
       (input_layers.size() != input_layer_types.size())) {
     LOG(ERROR) << "There must be the same number of items in --input_layer,"
@@ -441,10 +463,12 @@ int Main(int argc, char** argv) {
   }
 
   LOG(INFO) << "Graph: [" << graph << "]";
+  LOG(INFO) << "Init ops:" << init_ops_string;
   LOG(INFO) << "Input layers: [" << input_layer_string << "]";
   LOG(INFO) << "Input shapes: [" << input_layer_shape_string << "]";
   LOG(INFO) << "Input types: [" << input_layer_type_string << "]";
   LOG(INFO) << "Output layers: [" << output_layer_string << "]";
+  LOG(INFO) << "Target layers: [" << target_layer_string << "]";
   LOG(INFO) << "Num runs: [" << max_num_runs << "]";
   LOG(INFO) << "Inter-inference delay (seconds): [" << inference_delay << "]";
   LOG(INFO) << "Inter-benchmark delay (seconds): [" << inter_benchmark_delay
@@ -470,6 +494,16 @@ int Main(int argc, char** argv) {
     return -1;
   }
 
+  if (!init_ops.empty()) {
+    Status initialize_variables_status =
+        InitializeVariables(session.get(), init_ops);
+    if (!initialize_variables_status.ok()) {
+      LOG(ERROR) << "Graph variables initialization failed with "
+                 << initialize_variables_status;
+      return -1;
+    }
+  }
+
   StatSummarizerOptions stats_options;
   stats_options.show_run_order = show_run_order;
   stats_options.run_order_limit = run_order_limit;
@@ -520,9 +554,10 @@ int Main(int argc, char** argv) {
   int64 warmup_time_us = 0;
   int64 num_warmup_runs = 0;
   if (warmup_runs > 0) {
-    Status warmup_time_status = TimeMultipleRuns(
-        inter_inference_sleep_seconds, warmup_runs, -1.0, inputs, output_layers,
-        session.get(), nullptr, &warmup_time_us, &num_warmup_runs);
+    Status warmup_time_status =
+        TimeMultipleRuns(inter_inference_sleep_seconds, warmup_runs, -1.0,
+                         inputs, output_layers, target_layers, session.get(),
+                         nullptr, &warmup_time_us, &num_warmup_runs);
     if (!warmup_time_status.ok()) {
       LOG(ERROR) << "Timing failed with " << warmup_time_status;
       return -1;
@@ -536,8 +571,8 @@ int Main(int argc, char** argv) {
   int64 no_stat_num_runs = 0;
   Status no_stat_time_status = TimeMultipleRuns(
       inter_inference_sleep_seconds, max_num_runs, max_benchmark_time_seconds,
-      inputs, output_layers, session.get(), nullptr, &no_stat_time_us,
-      &no_stat_num_runs);
+      inputs, output_layers, target_layers, session.get(), nullptr,
+      &no_stat_time_us, &no_stat_num_runs);
   const double no_stat_wall_time = no_stat_time_us / 1000000.0;
   if (!no_stat_time_status.ok()) {
     LOG(ERROR) << "Timing failed with " << no_stat_time_status;
@@ -551,8 +586,8 @@ int Main(int argc, char** argv) {
   int64 stat_num_runs = 0;
   Status stat_time_status = TimeMultipleRuns(
       inter_inference_sleep_seconds, max_num_runs, max_benchmark_time_seconds,
-      inputs, output_layers, session.get(), stats.get(), &stat_time_us,
-      &stat_num_runs);
+      inputs, output_layers, target_layers, session.get(), stats.get(),
+      &stat_time_us, &stat_num_runs);
   if (!stat_time_status.ok()) {
     LOG(ERROR) << "Timing failed with " << stat_time_status;
     return -1;
diff --git a/tensorflow/tools/benchmark/benchmark_model.h b/tensorflow/tools/benchmark/benchmark_model.h
index dff62c5b5d518da8f9034295626e46db783f343d..dc5f0080374e70edad52965cc0a95f99751baa48 100644
--- a/tensorflow/tools/benchmark/benchmark_model.h
+++ b/tensorflow/tools/benchmark/benchmark_model.h
@@ -37,13 +37,15 @@ Status InitializeSession(int num_threads, const string& graph,
 
 // Does a single run of the model that's been loaded into the given session.
 Status RunBenchmark(const std::vector<InputLayerInfo>& inputs,
-                    const std::vector<string>& outputs, Session* session,
+                    const std::vector<string>& outputs,
+                    const std::vector<string>& targets, Session* session,
                     StatSummarizer* stats, int64* inference_time_us);
 
 // Runs the model multiple time, keeping track of timing information.
 Status TimeMultipleRuns(double sleep_seconds, int num_runs, double max_time_s,
                         const std::vector<InputLayerInfo>& inputs,
-                        const std::vector<string>& outputs, Session* session,
+                        const std::vector<string>& outputs,
+                        const std::vector<string>& targets, Session* session,
                         StatSummarizer* stats, int64* total_time_us,
                         int64* actual_num_runs);
 
diff --git a/tensorflow/tools/benchmark/benchmark_model_test.cc b/tensorflow/tools/benchmark/benchmark_model_test.cc
index bb4eb5352039b01a6692621906eff005187cfa36..16ab2ff66e763a0ca5130f075f988bade9c8abd1 100644
--- a/tensorflow/tools/benchmark/benchmark_model_test.cc
+++ b/tensorflow/tools/benchmark/benchmark_model_test.cc
@@ -64,8 +64,8 @@ TEST(BenchmarkModelTest, InitializeAndRun) {
   int64 time;
   int64 num_runs = 0;
   TF_ASSERT_OK(benchmark_model::TimeMultipleRuns(
-      0.0, 10, 0.0, {input}, {output_name}, session.get(), stats.get(), &time,
-      &num_runs));
+      0.0, 10, 0.0, {input}, {output_name}, {}, session.get(), stats.get(),
+      &time, &num_runs));
   ASSERT_EQ(num_runs, 10);
 }
 
diff --git a/tensorflow/tools/ci_build/Dockerfile.cmake b/tensorflow/tools/ci_build/Dockerfile.cmake
index ec90c83aacd068e8f9c16e5be8eb6e1cef098ea6..d5dea4f3e41841aed5aeac02fcca850dbfdfaeb3 100644
--- a/tensorflow/tools/ci_build/Dockerfile.cmake
+++ b/tensorflow/tools/ci_build/Dockerfile.cmake
@@ -23,11 +23,12 @@ RUN /install/install_deb_packages.sh
 
 RUN apt-get update
 RUN apt-get install -y --no-install-recommends python-pip
+RUN pip install --upgrade wheel
 RUN pip install --upgrade astor
 RUN pip install --upgrade gast
 RUN pip install --upgrade numpy
 RUN pip install --upgrade termcolor
 
 # Install golang
-RUN add-apt-repository -y ppa:ubuntu-lxc/lxd-stable
-RUN apt-get install -y golang
+RUN apt-get install -t xenial-backports -y golang-1.9
+ENV PATH=${PATH}:/usr/lib/go-1.9/bin
diff --git a/tensorflow/tools/ci_build/Dockerfile.rbe.cpu b/tensorflow/tools/ci_build/Dockerfile.rbe.cpu
new file mode 100644
index 0000000000000000000000000000000000000000..6f0798b1afc34bc08df6f3f8f467a329fcf0fe9b
--- /dev/null
+++ b/tensorflow/tools/ci_build/Dockerfile.rbe.cpu
@@ -0,0 +1,14 @@
+FROM launcher.gcr.io/google/rbe-debian8:r322167
+LABEL maintainer="Yu Yi <yiyu@google.com>"
+
+# Copy install scripts
+COPY install/*.sh /install/
+
+# Setup envvars
+ENV CC /usr/local/bin/clang
+ENV CXX /usr/local/bin/clang++
+ENV AR /usr/bin/ar
+
+# Run pip install script for RBE Debian8 container.
+RUN /install/install_pip_packages_remote.sh
+RUN /install/install_pip_packages.sh
diff --git a/tensorflow/tools/ci_build/Dockerfile.rbe.gpu b/tensorflow/tools/ci_build/Dockerfile.rbe.gpu
new file mode 100644
index 0000000000000000000000000000000000000000..24ff4765a619701cd614414d2b06f7fa4ce7d8c0
--- /dev/null
+++ b/tensorflow/tools/ci_build/Dockerfile.rbe.gpu
@@ -0,0 +1,26 @@
+FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
+
+LABEL maintainer="Nick Lopez <ngiraldo@google.com>"
+
+# In the Ubuntu 16.04 images, cudnn is placed in system paths. Move them to
+# /usr/local/cuda
+RUN cp -P /usr/include/cudnn.h /usr/local/cuda/include
+RUN cp -P /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64
+
+# Copy and run the install scripts.
+COPY install/*.sh /install/
+ARG DEBIAN_FRONTEND=noninteractive
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa && \
+    add-apt-repository -y ppa:george-edison55/cmake-3.x
+RUN /install/install_deb_packages.sh
+RUN /install/install_pip_packages.sh
+RUN /install/install_golang.sh
+
+# Install clang from pre-built package
+RUN cd /tmp && \
+    wget https://storage.googleapis.com/clang-builds-stable/clang-ubuntu16_04/clang_r323528.tar.gz && \
+    echo "26752d9f5785df07193fac8316ba5d5ba3bec36d970c29a1577360848818ac74  clang_r323528.tar.gz" | sha256sum -c && \
+    tar -C /usr/local -xf clang_r323528.tar.gz && \
+    rm clang_r323528.tar.gz
+
diff --git a/tensorflow/tools/ci_build/builds/test_tutorials.sh b/tensorflow/tools/ci_build/builds/test_tutorials.sh
index 67e5af556405a5c659000a07a79a6bd9a1d1e542..db335f14ca4f88ade7a540ffab7ed9de67f1248e 100755
--- a/tensorflow/tools/ci_build/builds/test_tutorials.sh
+++ b/tensorflow/tools/ci_build/builds/test_tutorials.sh
@@ -277,17 +277,6 @@ test_ptb_word_lm() {
   fi
 }
 
-
-# -----------------------------------------------------------
-# translate_test
-test_translate_test() {
-  LOG_FILE=$1
-
-  run_in_directory "${TEST_DIR}" "${LOG_FILE}" \
-    "${TF_MODELS_DIR}/tutorials/rnn/translate/translate.py" --self_test=True
-}
-
-
 # Run the tutorial tests
 test_runner "tutorial test-on-install" \
     "${TUT_TESTS}" "${TF_BUILD_TUT_TEST_BLACKLIST}" "${LOGS_DIR}"
diff --git a/tensorflow/tools/ci_build/remote/remote_docker_build.sh b/tensorflow/tools/ci_build/ci_rbe_docker_build.sh
similarity index 58%
rename from tensorflow/tools/ci_build/remote/remote_docker_build.sh
rename to tensorflow/tools/ci_build/ci_rbe_docker_build.sh
index e00a66aabaf1068c772aabce2391616518be44d4..cd811de6bdf9275b799a608381c76713a6c7a65b 100755
--- a/tensorflow/tools/ci_build/remote/remote_docker_build.sh
+++ b/tensorflow/tools/ci_build/ci_rbe_docker_build.sh
@@ -16,25 +16,19 @@
 # Build TensorFlow Docker images for remote build
 #
 # Usage:
-#   remote_docker_build.sh -c # docker image for cpu build
-#   remote_docker_build.sh -g # docker image for gpu build
-
+#   ci_rbe_docker_build.sh -c # docker image for cpu build
+#   ci_rbe_docker_build.sh -g # docker image for gpu build
 
 function main {
-  publish=true
   cpu_build=false
   gpu_build=false
-  publish=true
+  publish=false
 
   script_dir=$(dirname "$(readlink -f "$0")")
   cd $script_dir
 
-  trap cleanup_on_finish EXIT
-
   set_script_flags $@
 
-  build_base_image
-
   build_tf_image
 
   if [ "$publish" = true ] ; then
@@ -50,17 +44,14 @@ function set_script_flags {
       c)
         cpu_build=true
         ;;
-      f)
-        base_image_build_script=$OPTARG
-        ;;
       g)
         gpu_build=true
         ;;
       h)
         print_usage
         ;;
-      n)
-        publish=false
+      p)
+        publish=true
         ;;
       *)
         print_usage "ERROR: unknown option"
@@ -76,7 +67,6 @@ function print_usage {
   echo "Usage: $(basename $0) -c | -g [options]"
   echo "  -c build image for CPU build (base image debian8-clang)"
   echo "  -g build image for GPU build (base image nvidia-clang)"
-  echo "  -f the script which build the {debian8,nvidia}-clang base image"
   echo "[option] is one of"
   echo "  -n not publish the locally-built image to GCR;"
   echo "     the build process will publish image to GCR by default"
@@ -87,54 +77,22 @@ function print_usage {
   exit 1
 }
 
-
-# Build nvidia-cuba-clang base image for GPU image.
-# For CPU the `clang-debian8` from Cloud Launcher will be used directly:
-# https://console.cloud.google.com/launcher/details/google/clang-debian8?filter=category:developer-tools&q=clang
-function build_base_image {
-  if [ "$gpu_build" = true ] ; then
-    base_image="nvidia-cuda"
-    # Run a 2-stage build for clang base image, see
-    # https://github.com/llvm-mirror/llvm/blob/master/docs/Docker.rst
-    $base_image_build_script \
-      --source $base_image \
-      --branch branches/google/stable \
-      --docker-repository ${base_image}-clang --docker-tag "latest" \
-      -p clang -i stage2-install-clang -i stage2-install-clang-headers \
-      -- \
-      -DLLVM_TARGETS_TO_BUILD=Native -DCMAKE_BUILD_TYPE=Release \
-      -DBOOTSTRAP_CMAKE_BUILD_TYPE=Release \
-      -DCLANG_ENABLE_BOOTSTRAP=ON \
-      -DCLANG_BOOTSTRAP_TARGETS="install-clang;install-clang-headers"
-  fi
-}
-
-
 function build_tf_image {
   if [ "$cpu_build" = true ] ; then
-    dockerfile="Dockerfile.cpu"
-    tf_image="tensorflow-remote"
+    dockerfile="Dockerfile.rbe.cpu"
+    tf_image="tensorflow-rbe-cpu"
   else
-    dockerfile="Dockerfile.gpu"
-    tf_image="tensorflow-remote-gpu"
+    dockerfile="Dockerfile.rbe.gpu"
+    tf_image="tensorflow-rbe-gpu"
   fi
 
   docker build -f $dockerfile -t $tf_image .
 }
 
-
 function publish_tf_image {
   gcr_tf_image="gcr.io/tensorflow/${tf_image}"
   docker tag $tf_image $gcr_tf_image
   gcloud docker -- push $gcr_tf_image
 }
 
-
-function cleanup_on_finish {
-  cd $script_dir
-  rm -rf $llvm_docker_src
-  docker rmi -f ${base_image}-clang ${base_image}-clang-build
-}
-
-
 main $@
diff --git a/tensorflow/tools/ci_build/copy_binary.py b/tensorflow/tools/ci_build/copy_binary.py
index 90fd6a6e71f19649406234bc93025c15e4a5063c..420d390d2b9dc1ec25461b3502c63467a7eda16b 100755
--- a/tensorflow/tools/ci_build/copy_binary.py
+++ b/tensorflow/tools/ci_build/copy_binary.py
@@ -29,13 +29,9 @@ import argparse
 import os
 import re
 import shutil
-import subprocess
+import tempfile
 import zipfile
 
-UNZIP_CMD = "/usr/bin/unzip"
-ZIP_CMD = "/usr/bin/zip"
-SED_CMD = "/bin/sed"
-
 TF_NIGHTLY_REGEX = r"(.+)tf_nightly(|_gpu)-(\d\.\d\.\d.dev[\d]{0,8})-(.+)\.whl"
 BINARY_STRING_TEMPLATE = "%s-%s-%s.whl"
 
@@ -43,7 +39,7 @@ BINARY_STRING_TEMPLATE = "%s-%s-%s.whl"
 def check_existence(filename):
   """Check the existence of file or dir."""
   if not os.path.exists(filename):
-    raise RuntimeError("%s not found.")
+    raise RuntimeError("%s not found." % filename)
 
 
 def copy_binary(directory, origin_tag, new_tag, version, gpu=False):
@@ -64,27 +60,36 @@ def copy_binary(directory, origin_tag, new_tag, version, gpu=False):
     package = "tf_nightly"
   origin_binary = BINARY_STRING_TEMPLATE % (package, version, origin_tag)
   new_binary = BINARY_STRING_TEMPLATE % (package, version, new_tag)
-  zip_ref = zipfile.ZipFile(directory + origin_binary, "r")
-  zip_ref.extractall()
-  zip_ref.close()
-  old_py_ver = re.search(r"(cp\d\d-cp\d\d)", origin_tag).group(1)
-  new_py_ver = re.search(r"(cp\d\d-cp\d\d)", new_tag).group(1)
-  subprocess.check_call(
-      "%s -i s/%s/%s/g %s-%s.dist-info/WHEEL" % (SED_CMD, old_py_ver,
-                                                 new_py_ver, package, version),
-      shell=True)
-  zout = zipfile.ZipFile(directory + new_binary, "w", zipfile.ZIP_DEFLATED)
-  zip_these_files = [
-      "%s-%s.dist-info" % (package, version),
-      "%s-%s.data" % (package, version)
-  ]
-  for dirname in zip_these_files:
-    for root, _, files in os.walk(dirname):
-      for filename in files:
-        zout.write(os.path.join(root, filename))
-  zout.close()
-  for dirname in zip_these_files:
-    shutil.rmtree(dirname)
+  zip_ref = zipfile.ZipFile(os.path.join(directory, origin_binary), "r")
+
+  try:
+    tmpdir = tempfile.mkdtemp()
+    os.chdir(tmpdir)
+
+    zip_ref.extractall()
+    zip_ref.close()
+    old_py_ver = re.search(r"(cp\d\d-cp\d\d)", origin_tag).group(1)
+    new_py_ver = re.search(r"(cp\d\d-cp\d\d)", new_tag).group(1)
+
+    wheel_file = os.path.join(
+        tmpdir, "%s-%s.dist-info" % (package, version), "WHEEL")
+    with open(wheel_file, "r") as f:
+      content = f.read()
+    with open(wheel_file, "w") as f:
+      f.write(content.replace(old_py_ver, new_py_ver))
+
+    zout = zipfile.ZipFile(directory + new_binary, "w", zipfile.ZIP_DEFLATED)
+    zip_these_files = [
+        "%s-%s.dist-info" % (package, version),
+        "%s-%s.data" % (package, version),
+    ]
+    for dirname in zip_these_files:
+      for root, _, files in os.walk(dirname):
+        for filename in files:
+          zout.write(os.path.join(root, filename))
+    zout.close()
+  finally:
+    shutil.rmtree(tmpdir)
 
 
 def main():
@@ -110,6 +115,7 @@ def main():
   args = parser.parse_args()
 
   # Argument checking
+  args.filename = os.path.abspath(args.filename)
   check_existence(args.filename)
   regex_groups = re.search(TF_NIGHTLY_REGEX, args.filename)
   directory = regex_groups.group(1)
diff --git a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
index cfeaebdbf57c01fef7cd81dae76217429336d0ff..d0816c92b7308a1079579e605ee9af491a0533fb 100755
--- a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
+++ b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
@@ -54,3 +54,6 @@ for i in `seq 0 $((TF_GPU_COUNT-1))`; do
   fi
 done
 
+echo "Cannot find a free GPU to run the test $* on, exiting with failure..."
+exit 1
+
diff --git a/tensorflow/tools/ci_build/install/install_bazel.sh b/tensorflow/tools/ci_build/install/install_bazel.sh
index 1df6a84d7c6f86abfb965063625ac43a3f1a57fb..3e27a94cf2bf3110ac181d6ef5a57366be17255f 100755
--- a/tensorflow/tools/ci_build/install/install_bazel.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel.sh
@@ -15,7 +15,7 @@
 # ==============================================================================
 
 # Select bazel version.
-BAZEL_VERSION="0.10.0"
+BAZEL_VERSION="0.11.0"
 
 set +e
 local_bazel_ver=$(bazel version 2>&1 | grep -i label | awk '{print $3}')
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh b/tensorflow/tools/ci_build/install/install_pip_packages_remote.sh
similarity index 62%
rename from tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh
rename to tensorflow/tools/ci_build/install/install_pip_packages_remote.sh
index 852486d1677ec597fe56111ffb0e470c333c1cd7..39a6d557d185d8564a79315fc738a054325aa0bc 100755
--- a/tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh
+++ b/tensorflow/tools/ci_build/install/install_pip_packages_remote.sh
@@ -1,5 +1,5 @@
-#!/bin/bash
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#!/usr/bin/env bash
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,14 +13,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-GSUTIL_BIN="/var/gcloud/google-cloud-sdk/bin/gsutil"
 
-echo "Got teardown argument $1"
+set -e
 
-if "${GSUTIL_BIN}" rm "$1"
-then
-  echo "Cleaned up new tfrecord file in GCS: '$1'"
-else
-  echo "FAIL: Unable to clean up new tfrecord file in GCS: '$1'"
-  exit 1
+if [ ! -f /usr/bin/x86_64-linux-gnu-gcc ]; then
+  ln -s /usr/local/bin/clang /usr/bin/x86_64-linux-gnu-gcc
 fi
+
+pip2 install -U pip
+pip3 install -U pip
+pip2  install -U setuptools
+pip3 install -U setuptools
+
+# The rest of the pip packages will be installed in
+# `install_pip_packages.sh`
diff --git a/tensorflow/tools/ci_build/osx/cpu/run_contrib.sh b/tensorflow/tools/ci_build/osx/cpu/run_contrib.sh
index 509ee38ec4fd584037f8e43726c01391430c1817..5c5a36139f50e85e70ce4bff5ca8054f7570b0f5 100755
--- a/tensorflow/tools/ci_build/osx/cpu/run_contrib.sh
+++ b/tensorflow/tools/ci_build/osx/cpu/run_contrib.sh
@@ -31,7 +31,7 @@ export CC_OPT_FLAGS='-mavx'
 export PYTHON_BIN_PATH=$(which python2)
 yes "" | $PYTHON_BIN_PATH configure.py
 which bazel
-bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac \
+bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac,-no_mac \
     --test_timeout 300,450,1200,3600 \
     --test_size_filters=small,medium --config=opt \
     --jobs=${N_JOBS} --build_tests_only --test_output=errors -k -- \
diff --git a/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh b/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh
index 05547136704394ed9262f566a2bfb4160b73c7fd..338066131b5d4511ae9f0646a1269b182cf8e1fa 100755
--- a/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh
+++ b/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh
@@ -31,7 +31,7 @@ export CC_OPT_FLAGS='-mavx'
 export PYTHON_BIN_PATH=$(which python2)
 yes "" | $PYTHON_BIN_PATH configure.py
 which bazel
-bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac \
+bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac,-no_mac \
     --test_timeout 300,450,1200,3600 --config=opt \
     --test_size_filters=small,medium \
     --jobs=${N_JOBS} --build_tests_only --test_output=errors -k -- \
diff --git a/tensorflow/tools/ci_build/osx/cpu/run_py3_cc_core.sh b/tensorflow/tools/ci_build/osx/cpu/run_py3_cc_core.sh
index 8f839ca110e5bbeba6fb7f0baaeab2fe6f126319..920a261ae3c8d68ec0b0d311fd361e3843eebd86 100755
--- a/tensorflow/tools/ci_build/osx/cpu/run_py3_cc_core.sh
+++ b/tensorflow/tools/ci_build/osx/cpu/run_py3_cc_core.sh
@@ -30,7 +30,7 @@ export TF_NEED_CUDA=0
 export PYTHON_BIN_PATH=$(which python3)
 yes "" | $PYTHON_BIN_PATH configure.py
 which bazel
-bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac \
+bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac,-no_mac \
     --test_timeout 300,450,1200,3600 \
     --test_size_filters=small,medium \
     --jobs=${N_JOBS} --build_tests_only --test_output=errors -k -- \
diff --git a/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh b/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
index e1b56b9a25f663737ffe0991882f6e5e753265ed..7d471b47034f04ea4c2d31d9cdd7cea48fb32745 100755
--- a/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
+++ b/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
@@ -31,5 +31,5 @@ export TF_NEED_OPENCL_SYCL=0
 export TF_NEED_MKL=0
 export COMPUTECPP_PATH="/usr/local"
 
-export PATH="/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
+export PATH="$PATH:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
 build_libtensorflow_tarball "-cpu-darwin-$(uname -m)"
diff --git a/tensorflow/tools/ci_build/remote/Dockerfile.cpu b/tensorflow/tools/ci_build/remote/Dockerfile.cpu
deleted file mode 100644
index 7b01d8320d26f38c92ad8f404da3188809a6d400..0000000000000000000000000000000000000000
--- a/tensorflow/tools/ci_build/remote/Dockerfile.cpu
+++ /dev/null
@@ -1,27 +0,0 @@
-FROM launcher.gcr.io/google/clang-debian8:latest
-
-RUN apt-get update && apt-get --no-install-recommends install -y \
-    binutils \
-    binutils-gold \
-    curl \
-    libstdc++-4.9-dev \
-    python \
-    python-dev \
-    python-numpy \
-    python-pip \
-    unzip \
-    zip && \
-    rm -rf /var/lib/apt/lists/*
-
-RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
-    python get-pip.py && \
-    rm get-pip.py
-
-# Set up grpc
-RUN pip install  --upgrade enum34 futures mock numpy six backports.weakref portpicker && \
-    pip install --pre 'protobuf>=3.0.0a3' && \
-    pip install 'grpcio>=1.1.3'
-
-# TODO: Set up golang which is compatible with clang
-
-WORKDIR /botexec
diff --git a/tensorflow/tools/ci_build/remote/Dockerfile.gpu b/tensorflow/tools/ci_build/remote/Dockerfile.gpu
deleted file mode 100644
index 47ffd44163dd3e4b99f06689e1aa6f19f84cc2ca..0000000000000000000000000000000000000000
--- a/tensorflow/tools/ci_build/remote/Dockerfile.gpu
+++ /dev/null
@@ -1,27 +0,0 @@
-FROM nvidia-cuda-clang:latest
-
-RUN apt-get update && apt-get --no-install-recommends install -y \
-    binutils \
-    binutils-gold \
-    curl \
-    libstdc++-4.9-dev \
-    python \
-    python-dev \
-    python-numpy \
-    python-pip \
-    unzip \
-    zip && \
-    rm -rf /var/lib/apt/lists/*
-
-RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
-    python get-pip.py && \
-    rm get-pip.py
-
-# Set up grpc
-RUN pip install --upgrade \
-        enum34 futures astor gast mock numpy six \
-        backports.weakref termcolor && \
-    pip install --pre 'protobuf>=3.0.0a3' && \
-    pip install 'grpcio>=1.1.3'
-
-WORKDIR /botexec
diff --git a/tensorflow/tools/ci_build/windows/bazel/common_env.sh b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
index 1c35d74af72ad0a72b0016356888c8cf77e20e56..7d4cc7ac3005f7ff9a79d18228e86d6b74e1e8b0 100644
--- a/tensorflow/tools/ci_build/windows/bazel/common_env.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
@@ -34,6 +34,9 @@ export BAZEL_SH=${BAZEL_SH:-"C:/tools/msys64/usr/bin/bash"}
 
 export PYTHON_BASE_PATH="${PYTHON_DIRECTORY:-Program Files/Anaconda3}"
 
+# Set the path to find bazel.
+export PATH="/c/tools/bazel/:$PATH"
+
 # Set Python path for ./configure
 export PYTHON_BIN_PATH="C:/${PYTHON_BASE_PATH}/python.exe"
 export PYTHON_LIB_PATH="C:/${PYTHON_BASE_PATH}/lib/site-packages"
diff --git a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
index b87e4a9bec41264827d415a11dfa6f23aeda725d..4656afe0256d03540fed6912677c8e93f9cf9eb6 100644
--- a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
+++ b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
@@ -37,7 +37,7 @@ SET CMAKE_DIR=%REPO_ROOT%\tensorflow\contrib\cmake
 SET MSBUILD_EXE="C:\Program Files (x86)\MSBuild\14.0\Bin\msbuild.exe"
 
 :: Run cmake to create Visual Studio Project files.
-%CMAKE_EXE% %CMAKE_DIR% -A x64 -DSWIG_EXECUTABLE=%SWIG_EXE% -DPYTHON_EXECUTABLE=%PY_EXE% -DCMAKE_BUILD_TYPE=Release -DPYTHON_LIBRARIES=%PY_LIB% -Dtensorflow_BUILD_PYTHON_TESTS=%BUILD_PYTHON_TESTS% -Dtensorflow_BUILD_CC_TESTS=%BUILD_CC_TESTS% -Dtensorflow_ENABLE_GPU=ON -DCUDNN_HOME=%CUDNN_HOME% -Dtensorflow_TF_NIGHTLY=%TF_NIGHTLY% -Dtensorflow_DISABLE_EIGEN_FORCEINLINE=%DISABLE_FORCEINLINE% -Dtensorflow_WIN_CPU_SIMD_OPTIONS=/arch:AVX
+%CMAKE_EXE% %CMAKE_DIR% -A x64 -DSWIG_EXECUTABLE=%SWIG_EXE% -DPYTHON_EXECUTABLE=%PY_EXE% -DCMAKE_BUILD_TYPE=Release -DPYTHON_LIBRARIES=%PY_LIB% -Dtensorflow_BUILD_PYTHON_TESTS=%BUILD_PYTHON_TESTS% -Dtensorflow_BUILD_CC_TESTS=%BUILD_CC_TESTS% -Dtensorflow_ENABLE_GPU=ON -DCUDNN_HOME=%CUDNN_HOME% -Dtensorflow_TF_NIGHTLY=%TF_NIGHTLY% -Dtensorflow_DISABLE_EIGEN_FORCEINLINE=%DISABLE_FORCEINLINE% -Dtensorflow_WIN_CPU_SIMD_OPTIONS=/arch:AVX -G"Visual Studio 14"
 
 :: Run msbuild in the resulting VS project files to build a pip package.
 %MSBUILD_EXE% /p:Configuration=Release /maxcpucount:32 tf_python_build_pip_package.vcxproj
diff --git a/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat b/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
index b537192a945b2a2d8c2df940b947c6c0f7d6fc06..97829892b10059f9d9663e103534891d1481abec 100644
--- a/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
+++ b/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
@@ -28,6 +28,9 @@ IF DEFINED TF_NIGHTLY (ECHO TF_NIGHTLY is set to %TF_NIGHTLY%) ELSE (SET TF_NIGH
 :: Set pip binary location. Do not override if it is set already.
 IF DEFINED PIP_EXE (ECHO PIP_EXE is set to %PIP_EXE%) ELSE (SET PIP_EXE="C:\Program Files\Anaconda3\Scripts\pip.exe")
 
+:: Set ctest binary location.
+IF DEFINED CTEST_EXE (ECHO CTEST_EXE is set to %CTEST_EXE%) ELSE (SET CTEST_EXE="C:\Program Files\cmake\bin\ctest.exe")
+
 :: Run the CMAKE build to build the pip package.
 CALL %REPO_ROOT%\tensorflow\tools\ci_build\windows\gpu\cmake\run_build.bat
 if %errorlevel% neq 0 exit /b %errorlevel%
@@ -47,4 +50,4 @@ if %errorlevel% neq 0 exit /b %errorlevel%
 
 :: Run all python tests if the installation succeeded.
 echo Running tests...
-ctest -C Release --output-on-failure --jobs 1
+%CTEST_EXE% -C Release --output-on-failure --jobs 1
diff --git a/tensorflow/tools/compatibility/tf_upgrade.py b/tensorflow/tools/compatibility/tf_upgrade.py
index 6e90b286c99f894ddd25268afc69043759571c36..1f8833582af4c922115e637117e775e619439786 100644
--- a/tensorflow/tools/compatibility/tf_upgrade.py
+++ b/tensorflow/tools/compatibility/tf_upgrade.py
@@ -662,9 +662,9 @@ class TFAPIChangeSpec(APIChangeSpec):
   def _reverse_handler(file_edit_recorder, node):
     # TODO(aselle): Could check for a literal list of bools and try to convert
     # them to indices.
-    comment = ("ERROR: tf.reverse has had its argument semantics changed\n"
-               "significantly the converter cannot detect this reliably, so you"
-               "need to inspect this usage manually.\n")
+    comment = ("ERROR: tf.reverse has had its argument semantics changed "
+               "significantly the converter cannot detect this reliably, so "
+               "you need to inspect this usage manually.\n")
     file_edit_recorder.add(
         comment,
         node.lineno,
diff --git a/tensorflow/tools/dist_test/README.md b/tensorflow/tools/dist_test/README.md
index c1b1f79bbd4b657768b9bbcab93efa3354774915..228d5ee35d1839c60b51a85bd606c1ba86e46886 100644
--- a/tensorflow/tools/dist_test/README.md
+++ b/tensorflow/tools/dist_test/README.md
@@ -17,6 +17,14 @@ cesnsu model:
 
     ./local_test.sh --model_name CENSUS_WIDENDEEP
 
+You can test specify version of TensorFlow:
+
+```shell
+./local_test.sh ${whl_file_url}
+```
+
+For example, you can find these TensorFlow python package URLs from [here](https://www.tensorflow.org/install/install_linux#the_url_of_the_tensorflow_python_package) for Ubuntu.
+
 **2) Launch a remote k8s cluster on Google Kubernetes Engine (GKE) and run the
 test suite on it**
 
diff --git a/tensorflow/tools/dist_test/local_test.sh b/tensorflow/tools/dist_test/local_test.sh
index 435f9d0dc9c55a3dcfc45e7e46f279b4679a9086..caae7fd5305af9846628eaf00348dd08df4e827f 100755
--- a/tensorflow/tools/dist_test/local_test.sh
+++ b/tensorflow/tools/dist_test/local_test.sh
@@ -16,12 +16,11 @@
 #
 # Tests distributed TensorFlow on a locally running TF GRPC cluster.
 #
-# This script peforms the following steps:
-# 1) Build the docker-in-docker (dind) image capable of running docker and
-#    Kubernetes (k8s) cluster inside.
+# This script performs the following steps:
+# 1) Build the docker image capable of running distributed TensorFlow in docker.
 # 2) Run a container from the aforementioned image and start docker service
 #    in it
-# 3) Call a script to launch a k8s TensorFlow GRPC cluster inside the container
+# 3) Call a script to launch a distributed TensorFlow GRPC cluster inside the container
 #    and run the distributed test suite.
 #
 # Usage: local_test.sh <whl_file_location>
@@ -64,15 +63,9 @@ die() {
 
 # Configurations
 DOCKER_IMG_NAME="tensorflow/tf-dist-test-local-cluster"
-LOCAL_K8S_CACHE=${HOME}/kubernetes
 
-# Helper function
-get_container_id_by_image_name() {
-    # Get the id of a container by image name
-    # Usage: get_docker_container_id_by_image_name <img_name>
-
-    docker ps | grep $1 | awk '{print $1}'
-}
+# Use TensorFlow v1.5.0 for Python 2.7 and CPU only as we set num_gpus to 0 in the below
+DEFAULT_WHL_FILE_LOCATION="https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0-cp27-none-linux_x86_64.whl"
 
 # Parse input arguments
 LEAVE_CONTAINER_RUNNING=0
@@ -84,7 +77,8 @@ SYNC_REPLICAS_FLAG=""
 
 WHL_FILE_LOCATION=${1}
 if [[ -z "${WHL_FILE_LOCATION}" ]]; then
-  die "whl file location is not specified"
+  WHL_FILE_LOCATION=${DEFAULT_WHL_FILE_LOCATION}
+  echo "use default whl file location"
 fi
 
 while true; do
@@ -121,7 +115,7 @@ DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # Get utility functions
 source ${DIR}/scripts/utils.sh
 
-# Build docker-in-docker image for local k8s cluster.
+# Build docker image for local distributed TensorFlow cluster.
 NO_CACHE_FLAG=""
 if [[ ! -z "${TF_DIST_DOCKER_NO_CACHE}" ]] &&
    [[ "${TF_DIST_DOCKER_NO_CACHE}" != "0" ]]; then
diff --git a/tensorflow/tools/dist_test/python/mnist_replica.py b/tensorflow/tools/dist_test/python/mnist_replica.py
index a2d12442c44553a287637029843021b7541fa3fa..d6e7f317dd0b52203e354676425dbbbcd53e1973 100644
--- a/tensorflow/tools/dist_test/python/mnist_replica.py
+++ b/tensorflow/tools/dist_test/python/mnist_replica.py
@@ -56,7 +56,7 @@ flags.DEFINE_integer("task_index", None,
 flags.DEFINE_integer("num_gpus", 1, "Total number of gpus for each machine."
                      "If you don't use GPU, please set it to '0'")
 flags.DEFINE_integer("replicas_to_aggregate", None,
-                     "Number of replicas to aggregate before parameter update"
+                     "Number of replicas to aggregate before parameter update "
                      "is applied (For sync_replicas mode only; default: "
                      "num_workers)")
 flags.DEFINE_integer("hidden_units", 100,
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index d16761c3675942838fd2be0ea6e0b7463a3bf249..11f476d12c086f70335d9a69d7f3b86b525b5623 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -57,7 +57,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.8.0
+ENV BAZEL_VERSION 0.11.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -70,7 +70,7 @@ RUN mkdir /bazel && \
 
 # Download and build TensorFlow.
 WORKDIR /tensorflow
-RUN git clone --branch=r1.6 --depth=1 https://github.com/tensorflow/tensorflow.git .
+RUN git clone --branch=r1.7 --depth=1 https://github.com/tensorflow/tensorflow.git .
 
 # TODO(craigcitro): Don't install the pip package, since it makes it
 # more difficult to experiment with local changes. Instead, just add
diff --git a/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl b/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
index 3690e7dfe57a4682276a90b10cb84c9a329b3f5e..037d13116efc5ddf76c31eb87d7f81d31c3591f5 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
+++ b/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
@@ -3,7 +3,7 @@ FROM tensorflow/tensorflow:latest-devel
 LABEL maintainer="Clayne Robison<clayne.b.robison@intel.com>"
 
 # These arguments are parameterized. Use --build-args to override.
-ARG TF_BRANCH=r1.6
+ARG TF_BRANCH=r1.7
 ARG WHL_DIR=/whl
 
 RUN apt-get update && apt-get install -y --no-install-recommends \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 4ef37881bc91aaa58bab031c69b4a96c2a9d8ec1..1fcb6428b21b4ca495bef2b3249b6463e9ef0a10 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -66,7 +66,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.8.0
+ENV BAZEL_VERSION 0.11.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -79,7 +79,7 @@ RUN mkdir /bazel && \
 
 # Download and build TensorFlow.
 WORKDIR /tensorflow
-RUN git clone --branch=r1.6 --depth=1 https://github.com/tensorflow/tensorflow.git .
+RUN git clone --branch=r1.7 --depth=1 https://github.com/tensorflow/tensorflow.git .
 
 # Configure the build for our CUDA configuration.
 ENV CI_BUILD_PYTHON python
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index b6682cd68163ec870ed815b45ac4fdd9233f88c6..625321e1235202f78a2d5e1a5b2d9d05e1e3f9ba 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,11 +1,18 @@
-FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
+FROM nvidia/cuda:9.0-base-ubuntu16.04
 
 LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
+        cuda-command-line-tools-9-0 \
+        cuda-cublas-9-0 \
+        cuda-cufft-9-0 \
+        cuda-curand-9-0 \
+        cuda-cusolver-9-0 \
+        cuda-cusparse-9-0 \
         curl \
+        libcudnn7=7.0.5.15-1+cuda9.0 \
         libfreetype6-dev \
         libpng12-dev \
         libzmq3-dev \
diff --git a/tensorflow/tools/docs/parser.py b/tensorflow/tools/docs/parser.py
index e758229535e7b10994a39cbafb37e116fd2a465c..d2a63ecc4960117eb64fcc4f94bf882d4a3f91dd 100644
--- a/tensorflow/tools/docs/parser.py
+++ b/tensorflow/tools/docs/parser.py
@@ -34,7 +34,11 @@ from tensorflow.python.util import tf_inspect
 
 
 # A regular expression capturing a python identifier.
-IDENTIFIER_RE = '[a-zA-Z_][a-zA-Z0-9_]*'
+IDENTIFIER_RE = r'[a-zA-Z_]\w*'
+
+
+class TFDocsError(Exception):
+  pass
 
 
 class _Errors(object):
@@ -118,6 +122,8 @@ SYMBOL_REFERENCE_RE = re.compile(
     """,
     flags=re.VERBOSE)
 
+AUTO_REFERENCE_RE = re.compile(r'`([a-zA-Z0-9_.]+?)`')
+
 
 class ReferenceResolver(object):
   """Class for replacing @{...} references with Markdown links.
@@ -240,10 +246,25 @@ class ReferenceResolver(object):
     Returns:
       `string`, with "@{symbol}" references replaced by Markdown links.
     """
-    def one_ref(match):
-      return self._one_ref(match, relative_path_to_root)
 
-    return re.sub(SYMBOL_REFERENCE_RE, one_ref, string)
+    def strict_one_ref(match):
+      try:
+        return self._one_ref(match, relative_path_to_root)
+      except TFDocsError as e:
+        self.add_error(e.message)
+        return 'BAD_LINK'
+
+    string = re.sub(SYMBOL_REFERENCE_RE, strict_one_ref, string)
+
+    def sloppy_one_ref(match):
+      try:
+        return self._one_ref(match, relative_path_to_root)
+      except TFDocsError:
+        return match.group(0)
+
+    string = re.sub(AUTO_REFERENCE_RE, sloppy_one_ref, string)
+
+    return string
 
   def python_link(self, link_text, ref_full_name, relative_path_to_root,
                   code_ref=True):
@@ -307,14 +328,14 @@ class ReferenceResolver(object):
 
     Raises:
       RuntimeError: If `ref_full_name` is not documented.
+      TFDocsError: If the @{} syntax cannot be decoded.
     """
     master_name = self._duplicate_of.get(ref_full_name, ref_full_name)
 
     # Check whether this link exists
     if master_name not in self._all_names:
-      message = 'Cannot make link to "%s": Not in index.' % master_name
-      self.add_error(message)
-      return 'BROKEN_LINK'
+      raise TFDocsError(
+          'Cannot make link to "%s": Not in index.' % master_name)
 
     # If this is a member of a class, link to the class page with an anchor.
     ref_path = None
@@ -369,8 +390,8 @@ class ReferenceResolver(object):
             code_ref=not manual_link_text)
 
     # Error!
-    self.add_error('Did not understand "%s"' % match.group(0))
-    return 'BROKEN_LINK'
+    raise TFDocsError('Did not understand "%s"' % match.group(0),
+                      'BROKEN_LINK')
 
   def _doc_link(self, string, link_text, manual_link_text,
                 relative_path_to_root):
@@ -395,11 +416,10 @@ class ReferenceResolver(object):
     return self._doc_missing(string, hash_tag, link_text, manual_link_text,
                              relative_path_to_root)
 
-  def _doc_missing(self, string, unused_hash_tag, link_text,
+  def _doc_missing(self, string, unused_hash_tag, unused_link_text,
                    unused_manual_link_text, unused_relative_path_to_root):
     """Generate an error for unrecognized @{$...} references."""
-    self.add_error('Unknown Document "%s"' % string)
-    return link_text
+    raise TFDocsError('Unknown Document "%s"' % string)
 
   def _cc_link(self, string, link_text, unused_manual_link_text,
                relative_path_to_root):
@@ -416,8 +436,8 @@ class ReferenceResolver(object):
     elif string == 'tensorflow::ops::Const':
       ret = 'namespace/tensorflow/ops.md#const'
     else:
-      self.add_error('C++ reference not understood: "%s"' % string)
-      return 'TODO_C++:%s' % string
+      raise TFDocsError('C++ reference not understood: "%s"' % string)
+
     # relative_path_to_root gets you to api_docs/python, we go from there
     # to api_docs/cc, and then add ret.
     cc_relative_path = os.path.normpath(os.path.join(
diff --git a/tensorflow/tools/git/gen_git_source.py b/tensorflow/tools/git/gen_git_source.py
index 3630dbd740e981971bdc9ff45b756b45095d437d..cbcdbf5b807a585865e2e3f19291e55388d55cb1 100755
--- a/tensorflow/tools/git/gen_git_source.py
+++ b/tensorflow/tools/git/gen_git_source.py
@@ -114,6 +114,13 @@ def configure(src_base_path, gen_path, debug=False):
   for target, src in link_map.items():
     if src is None:
       open(os.path.join(gen_path, target), "w").write("")
+    elif not os.path.exists(src):
+      # Git repo is configured in a way we don't support such as having
+      # packed refs. Even though in a git repo, tf.__git_version__ will not
+      # be accurate.
+      # TODO(mikecase): Support grabbing git info when using packed refs.
+      open(os.path.join(gen_path, target), "w").write("")
+      spec["git"] = False
     else:
       try:
         # In python 3.5, symlink function exists even on Windows. But requires
diff --git a/tensorflow/tools/graph_transforms/BUILD b/tensorflow/tools/graph_transforms/BUILD
index ad3668fa02e102607c9a03ac312451a147affdda..6e21aa28461819fb9f65642716536e37ada8f9bf 100644
--- a/tensorflow/tools/graph_transforms/BUILD
+++ b/tensorflow/tools/graph_transforms/BUILD
@@ -91,7 +91,6 @@ cc_library(
     srcs = [
         "add_default_attributes.cc",
         "backports.cc",
-        "fake_quantize_training.cc",
         "flatten_atrous.cc",
         "fold_batch_norms.cc",
         "fold_constants_lib.cc",
@@ -105,7 +104,6 @@ cc_library(
         "remove_attribute.cc",
         "remove_control_dependencies.cc",
         "remove_device.cc",
-        "remove_ema.cc",
         "remove_nodes.cc",
         "rename_attribute.cc",
         "rename_op.cc",
@@ -134,8 +132,8 @@ cc_library(
         "//tensorflow/core:tensorflow",
         "//tensorflow/contrib/rnn:gru_ops_op_lib",
         "//tensorflow/contrib/rnn:lstm_ops_op_lib",
+        "//tensorflow/core/kernels:quantization_utils",
     ] + if_not_windows([
-        "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/kernels:remote_fused_graph_rewriter_transform",
         "//tensorflow/core/kernels/hexagon:hexagon_rewriter_transform",
     ]),
@@ -148,7 +146,6 @@ tf_cc_test(
     srcs = [
         "add_default_attributes_test.cc",
         "backports_test.cc",
-        "fake_quantize_training_test.cc",
         "flatten_atrous_test.cc",
         "fold_batch_norms_test.cc",
         "fold_constants_test.cc",
@@ -161,7 +158,6 @@ tf_cc_test(
         "quantize_weights_test.cc",
         "remove_attribute_test.cc",
         "remove_device_test.cc",
-        "remove_ema_test.cc",
         "remove_nodes_test.cc",
         "rename_attribute_test.cc",
         "rename_op_test.cc",
@@ -182,6 +178,7 @@ tf_cc_test(
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
+        "//tensorflow/core/kernels:quantization_utils",
         "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/util/tensor_bundle",
     ],
diff --git a/tensorflow/tools/graph_transforms/fake_quantize_training.cc b/tensorflow/tools/graph_transforms/fake_quantize_training.cc
deleted file mode 100644
index 61aecc6e16d817d245421f18fa39c70aa45b2bef..0000000000000000000000000000000000000000
--- a/tensorflow/tools/graph_transforms/fake_quantize_training.cc
+++ /dev/null
@@ -1,51 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#define EIGEN_USE_THREADS
-
-#include "tensorflow/core/graph/quantize_training.h"
-#include "tensorflow/tools/graph_transforms/transform_utils.h"
-
-namespace tensorflow {
-namespace graph_transforms {
-
-// EXPERIMENTAL: This can change without warning.
-// Rewrites the GraphDef for quantized training.
-// Rewrites the forward pass to include the precision loss with quantization so
-// the model can learn to deal with such loss and achieve better accuracy when
-// it is quantized later for inference.
-// Quantization range information is collected in FakeQuantizeWithMinMaxVars
-// ops.
-//
-// TODO(suharshs): Provide instructions on converting the resulting graph for
-// inference.
-// TODO(suharshs): Implement this using the GTT rather than calling the old
-// prototype function.
-Status FakeQuantizeTraining(const GraphDef& input_graph_def,
-                            const TransformFuncContext& context,
-                            GraphDef* output_graph_def) {
-  // TODO(suharshs): Make num_bits a parameter.
-  const int32 num_bits = 8;
-  // TODO(suharshs): Make quantization op a parameter?
-  const string quant_op_type = "FakeQuantWithMinMaxVars";
-
-  return DoQuantizeTrainingOnGraphDef(input_graph_def, num_bits, quant_op_type,
-                                      output_graph_def);
-}
-
-REGISTER_GRAPH_TRANSFORM("fake_quantize_training", FakeQuantizeTraining);
-
-}  // namespace graph_transforms
-}  // namespace tensorflow
diff --git a/tensorflow/tools/graph_transforms/fake_quantize_training_test.cc b/tensorflow/tools/graph_transforms/fake_quantize_training_test.cc
deleted file mode 100644
index 5e4ab209e97808c3f42ecf73fb763ef9d7ab1cfe..0000000000000000000000000000000000000000
--- a/tensorflow/tools/graph_transforms/fake_quantize_training_test.cc
+++ /dev/null
@@ -1,63 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/cc/ops/const_op.h"
-#include "tensorflow/cc/ops/math_ops.h"
-#include "tensorflow/core/framework/tensor_testutil.h"
-#include "tensorflow/core/lib/core/status_test_util.h"
-#include "tensorflow/core/platform/test.h"
-#include "tensorflow/tools/graph_transforms/transform_utils.h"
-
-namespace tensorflow {
-namespace graph_transforms {
-
-// Declare here, so we don't need a public header.
-Status FakeQuantizeTraining(const GraphDef& input_graph_def,
-                            const TransformFuncContext& context,
-                            GraphDef* output_graph_def);
-
-class FakeQuantizeTrainingTest : public ::testing::Test {};
-
-// For now, since the fake_quantize_training transform just calls the
-// quantize_training rewrite from tensorflow/core/graph/quantize_training.h,
-// we just test that the graph has been changed by the transform.
-// TODO(suharshs): Once we implement the fake_quantize_training transform
-// using the GTT, write proper tests of the transform here.
-TEST_F(FakeQuantizeTrainingTest, TransformOccurred) {
-  auto root = tensorflow::Scope::DisabledShapeInferenceScope();
-  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
-
-  Tensor a_data(DT_FLOAT, TensorShape());
-  test::FillIota<float>(&a_data, 1.0f);
-  Output a_const = Const(root.WithOpName("a"), Input::Initializer(a_data));
-
-  Tensor b_data(DT_FLOAT, TensorShape());
-  test::FillIota<float>(&b_data, 1.0f);
-  Output b_const = Const(root.WithOpName("b"), Input::Initializer(b_data));
-
-  Output matmul = MatMul(root.WithOpName("matmul"), a_const, b_const);
-  GraphDef graph_def;
-  TF_ASSERT_OK(root.ToGraphDef(&graph_def));
-
-  GraphDef result;
-  TransformFuncContext context;
-  TF_ASSERT_OK(FakeQuantizeTraining(graph_def, context, &result));
-
-  // Test that the transformation resulted in a graph with more nodes.
-  EXPECT_GT(result.node_size(), graph_def.node_size());
-}
-
-}  // namespace graph_transforms
-}  // namespace tensorflow
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
index d89afe85c72883323cec3c14342fd60adebd024d..d86f65325be1c3f5151ab8d0a0c3c64afa3dcf0f 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
@@ -182,6 +182,36 @@ Status FuseBatchNormWithConv(const NodeMatch& match,
   return Status::OK();
 }
 
+Status FuseBatchNormWithBatchToSpace(const NodeMatch& match,
+                             std::vector<NodeDef>* new_nodes) {
+  // Calculate the scale and offset values to apply.
+  std::vector<float> scale_values;
+  std::vector<float> offset_values;
+  TF_RETURN_IF_ERROR(
+      GetScaleAndOffsetValues(match, &scale_values, &offset_values));
+
+  // Fuse conv weights, and set the final output node name as batch_norm_node.
+  const NodeDef& batch_norm_node = match.node;
+  const NodeMatch& batch_to_space_node_match = match.inputs[0];
+  const NodeMatch& conv_node_match = batch_to_space_node_match.inputs[0];
+  const NodeDef& batch_to_space_node = batch_to_space_node_match.node;
+  const NodeDef& conv_node = conv_node_match.node;
+
+  string biasadd_name = conv_node.name() + "/biasadd";
+  TF_RETURN_IF_ERROR(
+      FuseScaleOffsetToConvWeights(scale_values, offset_values, conv_node_match,
+                                   biasadd_name , new_nodes));
+
+  NodeDef new_batch_to_space_node = batch_to_space_node;
+  // reuse batch_norm node name
+  new_batch_to_space_node.set_name(batch_norm_node.name());
+  new_batch_to_space_node.set_input(0, biasadd_name);
+  new_nodes->push_back(batch_to_space_node_match.inputs[1].node);
+  new_nodes->push_back(batch_to_space_node_match.inputs[2].node);
+  new_nodes->push_back(new_batch_to_space_node);
+  return Status::OK();
+}
+
 Status FuseBatchNormWithConvConcat(const NodeMatch& match,
                                    std::vector<NodeDef>* new_nodes) {
   // Calculate the scale and offset values to apply.
@@ -284,6 +314,43 @@ Status FoldOldBatchNorms(const GraphDef& input_graph_def,
     current_graph_def = replaced_graph_def;
   } while (did_graph_change);
 
+  do {
+    did_graph_change = false;
+    GraphDef replaced_graph_def;
+    TF_RETURN_IF_ERROR(ReplaceMatchingOpTypes(
+        current_graph_def,  // clang-format off
+        {"BatchNormWithGlobalNormalization|FusedBatchNorm",    // batch_norm_node
+         {
+             {"BatchToSpaceND",                  // batch_to_space_node
+              {
+                  {"Conv2D",                     // conv_node
+                   {
+                       {"*"},                    // input_node
+                       {"Const"},                // weights_node
+                   }
+                  },
+                  {"Const"},                     // block_shape
+                  {"Const"},                     // crops
+              }
+             },
+             {"Const"},                          // mean_node
+             {"Const"},                          // variance_node
+             {"Const"},                          // beta_node
+             {"Const"},                          // gamma_node
+         }
+        },  // clang-format on
+        [&did_graph_change](const NodeMatch& match,
+                            const std::set<string>& input_nodes,
+                            const std::set<string>& output_nodes,
+                            std::vector<NodeDef>* new_nodes) {
+          TF_RETURN_IF_ERROR(FuseBatchNormWithBatchToSpace(match, new_nodes));
+          did_graph_change = true;
+          return Status::OK();
+        },
+        {}, &replaced_graph_def));
+    current_graph_def = replaced_graph_def;
+  } while (did_graph_change);
+
   do {
     did_graph_change = false;
     GraphDef replaced_graph_def;
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
index b30ba9ac8b92db68eb3374c51a7f31b69cd1e3cf..7651a03fe51012678d6d6fc495fd82e497aa512b 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/cc/ops/const_op.h"
 #include "tensorflow/cc/ops/image_ops.h"
 #include "tensorflow/cc/ops/nn_ops.h"
+#include "tensorflow/cc/ops/array_ops.h"
 #include "tensorflow/cc/ops/sendrecv_ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
@@ -298,6 +299,96 @@ class FoldOldBatchNormsTest : public ::testing::Test {
   }
 };
 
+void TestFoldFusedBatchNormsWithBatchToSpace() {
+  auto root = tensorflow::Scope::NewRootScope();
+  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
+
+  Tensor input_data(DT_FLOAT, TensorShape({2, 1, 3, 2}));
+  test::FillValues<float>(
+      &input_data, {1.0f, 4.0f, 2.0f, 5.0f, 3.0f, 6.0f, -1.0f, -4.0f, -2.0f,
+                    -5.0f, -3.0f, -6.0f});
+  Output input_op =
+      Const(root.WithOpName("input_op"), Input::Initializer(input_data));
+
+  Tensor weights_data(DT_FLOAT, TensorShape({1, 2, 2, 2}));
+  test::FillValues<float>(&weights_data,
+                          {1.0f, 2.0f, 3.0f, 4.0f, 0.1f, 0.2f, 0.3f, 0.4f});
+  Output weights_op =
+      Const(root.WithOpName("weights_op"), Input::Initializer(weights_data));
+
+  Output conv_op = Conv2D(root.WithOpName("conv_op"), input_op, weights_op,
+                          {1, 1, 1, 1}, "VALID");
+
+  Tensor block_shape_data(DT_INT32, TensorShape({2}));
+  test::FillValues<int32>(&block_shape_data, {1, 2});
+  Output block_shape_op =
+      Const(root.WithOpName("block_shape_op"), Input::Initializer(block_shape_data));
+
+  Tensor crops_data(DT_INT32, TensorShape({2, 2}));
+  test::FillValues<int32>(&crops_data, {0, 0, 0, 1});
+  Output crops_op =
+      Const(root.WithOpName("crops_op"), Input::Initializer(crops_data));
+
+  Output batch_to_space_op = BatchToSpaceND(root.WithOpName("batch_to_space_op"),
+                                            conv_op, block_shape_op, crops_data);
+
+  Tensor mean_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&mean_data, {10.0f, 20.0f});
+  Output mean_op =
+      Const(root.WithOpName("mean_op"), Input::Initializer(mean_data));
+
+  Tensor variance_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&variance_data, {0.25f, 0.5f});
+  Output variance_op = Const(root.WithOpName("variance_op"),
+                             Input::Initializer(variance_data));
+
+  Tensor beta_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&beta_data, {0.1f, 0.6f});
+  Output beta_op =
+      Const(root.WithOpName("beta_op"), Input::Initializer(beta_data));
+
+  Tensor gamma_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&gamma_data, {1.0f, 2.0f});
+  Output gamma_op =
+      Const(root.WithOpName("gamma_op"), Input::Initializer(gamma_data));
+
+  GraphDef original_graph_def;
+  TF_ASSERT_OK(root.ToGraphDef(&original_graph_def));
+
+  NodeDef batch_norm_node;
+  batch_norm_node.set_op("FusedBatchNorm");
+  batch_norm_node.set_name("output");
+  AddNodeInput("batch_to_space_op", &batch_norm_node);
+  AddNodeInput("gamma_op", &batch_norm_node);
+  AddNodeInput("beta_op", &batch_norm_node);
+  AddNodeInput("mean_op", &batch_norm_node);
+  AddNodeInput("variance_op", &batch_norm_node);
+  SetNodeAttr("T", DT_FLOAT, &batch_norm_node);
+  SetNodeAttr("epsilon", 0.00001f, &batch_norm_node);
+  SetNodeAttr("is_training", false, &batch_norm_node);
+  *(original_graph_def.mutable_node()->Add()) = batch_norm_node;
+
+  std::unique_ptr<Session> original_session(NewSession(SessionOptions()));
+  TF_ASSERT_OK(original_session->Create(original_graph_def));
+  std::vector<Tensor> original_outputs;
+  TF_ASSERT_OK(original_session->Run({}, {"output"}, {}, &original_outputs));
+
+  GraphDef fused_graph_def;
+  TF_ASSERT_OK(FoldOldBatchNorms(original_graph_def, {{}, {"output"}},
+                                 &fused_graph_def));
+
+  std::unique_ptr<Session> fused_session(NewSession(SessionOptions()));
+  TF_ASSERT_OK(fused_session->Create(fused_graph_def));
+  std::vector<Tensor> fused_outputs;
+  TF_ASSERT_OK(fused_session->Run({}, {"output"}, {}, &fused_outputs));
+
+  test::ExpectTensorNear<float>(original_outputs[0], fused_outputs[0], 1e-5);
+
+  for (const NodeDef& node : fused_graph_def.node()) {
+    EXPECT_NE("FusedBatchNormWithBatchToSpace", node.op());
+  }
+}
+
 TEST_F(FoldOldBatchNormsTest, TestFoldOldBatchNorms) {
   TestFoldOldBatchNorms();
 }
@@ -307,7 +398,7 @@ TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNorms) {
 }
 
 TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithConcat) {
-  // Test axis is not 3, so all weigths and offsets are fused to each of inputs
+  // Test axis is not 3, so all weights and offsets are fused to each of inputs
   // of conv2d.
   TestFoldFusedBatchNormsWithConcat(/*split=*/true);
   // Test axis = 3, BatchNorm weights and offsets will be split before fused
@@ -315,5 +406,9 @@ TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithConcat) {
   TestFoldFusedBatchNormsWithConcat(/*split=*/false);
 }
 
+TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithBatchToSpace) {
+  TestFoldFusedBatchNormsWithBatchToSpace();
+}
+
 }  // namespace graph_transforms
 }  // namespace tensorflow
diff --git a/tensorflow/tools/graph_transforms/remove_ema.cc b/tensorflow/tools/graph_transforms/remove_ema.cc
deleted file mode 100644
index 22e26267025c3fed4f44ffbc09d55d8d355cc448..0000000000000000000000000000000000000000
--- a/tensorflow/tools/graph_transforms/remove_ema.cc
+++ /dev/null
@@ -1,146 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#define EIGEN_USE_THREADS
-
-#include "tensorflow/tools/graph_transforms/transform_utils.h"
-
-namespace tensorflow {
-namespace graph_transforms {
-
-// EXPERIMENTAL: This can change without warning.
-// Given a graph that has gone through the FakeQuantizeTraining transform and
-// has been frozen afterwards, RemoveEMA simplifies the FakeQuantize estimated
-// moving average subgraphs to make it compatible with the QuantizeNodes
-// transform.
-Status RemoveEMA(const GraphDef& input_graph_def,
-                 const TransformFuncContext& context,
-                 GraphDef* output_graph_def) {
-  TF_RETURN_IF_ERROR(ReplaceMatchingOpTypes(
-      input_graph_def,  // clang-format off
-      {"FakeQuantWithMinMaxVars",
-       {
-         {"*"},
-         {"Assign",
-          {
-            {"Const"},
-            {"Merge",
-             {
-               {"Switch",
-                {
-                  {"Min",
-                   {
-                     {"*"},
-                     {"Range",
-                      {
-                        {"*"},
-                        {"*"},
-                        {"*"},
-                      }
-                     }
-                   }
-                  },
-                  {"IsVariableInitialized"}
-                }
-               },
-               {"Sub",
-                {
-                  {"Const"},
-                  {"Mul",
-                   {
-                     {"Sub"},
-                     {"Sub",
-                      {
-                        {"Const"},
-                        {"Const"}
-                      }
-                     }
-                   }
-                  }
-                }
-               }
-             }
-            }
-          }
-         },
-         {"Assign",
-          {
-            {"Const"},
-            {"Merge",
-             {
-               {"Switch",
-                {
-                  {"Max"},
-                  {"IsVariableInitialized"}
-                }
-               },
-               {"Sub",
-                {
-                  {"Const"},
-                  {"Mul",
-                   {
-                     {"Sub"},
-                     {"Sub",
-                      {
-                        {"Const"},
-                        {"Const"}
-                      }
-                     }
-                   }
-                  }
-                }
-               }
-             }
-            }
-          }
-         },
-       }
-      },  // clang-format on
-      [](const NodeMatch& match, const std::set<string>& input_nodes,
-         const std::set<string>& output_nodes,
-         std::vector<NodeDef>* new_nodes) {
-        const NodeDef& fake_quant_node = match.node;
-        const NodeDef& input_node = match.inputs[0].node;
-        const NodeDef& min_var_node = match.inputs[1].inputs[0].node;
-        const NodeDef& max_var_node = match.inputs[2].inputs[0].node;
-
-        // Make a new FakeQuantizeWithMinMaxVars operation that uses constants
-        // for its min/max arguments rather than an entire EMA subgraph.
-        NodeDef new_fake_quant_node;
-        new_fake_quant_node.set_op(fake_quant_node.op());
-        new_fake_quant_node.set_name(fake_quant_node.name());
-        AddNodeInput(input_node.name(), &new_fake_quant_node);
-        AddNodeInput(min_var_node.name(), &new_fake_quant_node);
-        AddNodeInput(max_var_node.name(), &new_fake_quant_node);
-        CopyNodeAttr(fake_quant_node, "narrow_range", "narrow_range",
-                     &new_fake_quant_node);
-        CopyNodeAttr(fake_quant_node, "num_bits", "num_bits",
-                     &new_fake_quant_node);
-
-        new_nodes->push_back(new_fake_quant_node);
-        new_nodes->push_back(input_node);
-        new_nodes->push_back(min_var_node);
-        new_nodes->push_back(max_var_node);
-
-        return Status::OK();
-      },
-      {}, output_graph_def));
-  return Status::OK();
-}
-
-REGISTER_GRAPH_TRANSFORM("remove_ema", RemoveEMA);
-
-}  // namespace graph_transforms
-}  // namespace tensorflow
diff --git a/tensorflow/tools/graph_transforms/remove_ema_test.cc b/tensorflow/tools/graph_transforms/remove_ema_test.cc
deleted file mode 100644
index 27db90e2729487f89324622f7a63aca1c5a58fe7..0000000000000000000000000000000000000000
--- a/tensorflow/tools/graph_transforms/remove_ema_test.cc
+++ /dev/null
@@ -1,121 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/cc/ops/const_op.h"
-#include "tensorflow/cc/ops/math_ops.h"
-#include "tensorflow/core/framework/tensor_testutil.h"
-#include "tensorflow/core/lib/core/status_test_util.h"
-#include "tensorflow/core/platform/test.h"
-#include "tensorflow/core/public/session.h"
-#include "tensorflow/tools/graph_transforms/transform_utils.h"
-
-namespace tensorflow {
-namespace graph_transforms {
-
-// Declare transformations here, so we don't need a public header.
-Status FakeQuantizeTraining(const GraphDef& input_graph_def,
-                            const TransformFuncContext& context,
-                            GraphDef* output_graph_def);
-
-Status RemoveEMA(const GraphDef& input_graph_def,
-                 const TransformFuncContext& context,
-                 GraphDef* output_graph_def);
-
-Status QuantizeNodes(const GraphDef& input_graph_def,
-                     const TransformFuncContext& context,
-                     GraphDef* output_graph_def);
-
-class RemoveEMATest : public ::testing::Test {};
-
-TEST_F(RemoveEMATest, FakeQuant_RemoveEMA_QuantizeTraining) {
-  // Build a small graph.
-  auto root = tensorflow::Scope::NewRootScope();
-  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
-
-  Tensor a_data(DT_FLOAT, TensorShape({1, 1}));
-  test::FillIota<float>(&a_data, 1.0f);
-  Output a_const = Const(root.WithOpName("a"), Input::Initializer(a_data));
-
-  Tensor b_data(DT_FLOAT, TensorShape({1, 1}));
-  test::FillIota<float>(&b_data, 1.0f);
-  Output b_const = Const(root.WithOpName("b"), Input::Initializer(b_data));
-
-  Output matmul = MatMul(root.WithOpName("matmul"), a_const, b_const);
-  GraphDef graph_def;
-  TF_ASSERT_OK(root.ToGraphDef(&graph_def));
-
-  // (1) FakeQuantize the graph.
-  GraphDef fake_quantized_graph_def;
-  TransformFuncContext context;
-  TF_ASSERT_OK(
-      FakeQuantizeTraining(graph_def, context, &fake_quantized_graph_def));
-
-  // Test that the transformation resulted in a graph with more nodes.
-  EXPECT_GT(fake_quantized_graph_def.node_size(), graph_def.node_size());
-
-  // (2) Run the graph to initialize the newly added variables.
-  std::unique_ptr<Session> session(NewSession(SessionOptions()));
-  TF_ASSERT_OK(session->Create(fake_quantized_graph_def));
-  std::vector<Tensor> outputs;
-  TF_ASSERT_OK(session->Run({}, {"matmul"}, {}, &outputs));
-
-  // (3) Freeze the graph. Create a "frozen graph" that matches what we would
-  // expect if we actually froze the above graph.
-  // TODO(suharshs): Use a c++ freeze graph alternative, when one is available.
-  GraphDef frozen_graph_def;
-  for (const NodeDef& node : fake_quantized_graph_def.node()) {
-    if (node.op() == "Variable" || node.op() == "VariableV2") {
-      NodeDef const_node;
-      const_node.set_op("Const");
-      const_node.set_name(node.name());
-      SetNodeAttr("dtype", DT_FLOAT, &const_node);
-      Tensor tensor(DT_FLOAT, {});
-      tensor.flat<float>()(0) = 1.0f;
-      SetNodeTensorAttr<float>("value", tensor, &const_node);
-      *(frozen_graph_def.mutable_node()->Add()) = const_node;
-    } else {
-      *(frozen_graph_def.mutable_node()->Add()) = node;
-    }
-  }
-
-  // Test that freezing the graph resulted in a graph with the same number of
-  // nodes.
-  EXPECT_EQ(frozen_graph_def.node_size(), fake_quantized_graph_def.node_size());
-
-  // (4) RemoveEMA on the graph to make it compatible with QuantizeNodes.
-  GraphDef removed_ema_graph_def;
-  TF_ASSERT_OK(RemoveEMA(frozen_graph_def, context, &removed_ema_graph_def));
-
-  // Test that the transformation resulted in a graph with less nodes.
-  EXPECT_LT(removed_ema_graph_def.node_size(), frozen_graph_def.node_size());
-
-  // (5) QuantizeNodes and inspect the final graph.
-  // TODO(suharshs): Add a more thorough inspection of the structure of
-  // the output graph.
-  GraphDef quantized_graph_def;
-  TF_ASSERT_OK(
-      QuantizeNodes(removed_ema_graph_def, context, &quantized_graph_def));
-
-  // Test that the transformation resulted in a graph with more nodes.
-  EXPECT_GT(quantized_graph_def.node_size(), removed_ema_graph_def.node_size());
-
-  // Make sure that the FakeQuantizeWithMinMaxVars op has been removed.
-  for (const NodeDef& node : quantized_graph_def.node()) {
-    EXPECT_NE(node.op(), "FakeQuantWithMinMaxVars");
-  }
-}
-
-}  // namespace graph_transforms
-}  // namespace tensorflow
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel b/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel
deleted file mode 100755
index 439d86c5d2c10d15f68247c0df42ce488c10d6be..0000000000000000000000000000000000000000
--- a/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel
+++ /dev/null
@@ -1,56 +0,0 @@
-package(default_visibility = ["//visibility:public"])
-
-load("@rbe_integration_test//skylark:integration_tests.bzl", "sut_component", "integration_test")
-load("@rbe_integration_test//skylark:toolchains.bzl", "toolchain_container_images")
-
-sut_component(
-    name = "gcs",
-    docker_image = toolchain_container_images()["tensorflow"],
-    setups = [{
-        "program": "setup.sh",
-        "args": [
-            "gs://tensorflow-test-bucket/tf-gcs-test",
-        ],
-        "output_properties": ["gcs_path"],
-        "timeout_seconds": 100,
-    }],
-    teardowns = [{
-        "program": "teardown.sh",
-        "args": ["{gcs_path}"],
-        "timeout_seconds": 100,
-    }],
-)
-
-py_binary(
-    name = "gcs_smoke",
-    srcs = ["gcs_smoke.py"],
-)
-
-sh_binary(
-    name = "test_wrapper",
-    srcs = ["test_wrapper.sh"],
-    data = [
-        "gcs_smoke",
-    ],
-)
-
-integration_test(
-    name = "gcs_smoke_test",
-    sut_deps = {
-        ":gcs": "gcs",
-    },
-    tags = [
-        "manual",
-        "notap",
-    ],
-    test = {
-        "program": ":test_wrapper",
-        "args": [
-            "--gcs_bucket_url={gcs#gcs_path}",
-            "--num_examples=20",
-        ],
-        "timeout_seconds": 250,
-    },
-    test_docker_image = toolchain_container_images()["tensorflow"],
-    test_type = "MultiMachine",
-)
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py b/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py
deleted file mode 100755
index 8438c2156cb09b4d8c9442d9a5f4de67e59272f2..0000000000000000000000000000000000000000
--- a/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py
+++ /dev/null
@@ -1,253 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Smoke test for reading records from GCS to TensorFlow."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import sys
-import time
-
-import numpy as np
-import tensorflow as tf
-from tensorflow.core.example import example_pb2
-from tensorflow.python.lib.io import file_io
-
-flags = tf.app.flags
-flags.DEFINE_string("gcs_bucket_url", "",
-                    "The URL to the GCS bucket in which the temporary "
-                    "tfrecord file is to be written and read, e.g., "
-                    "gs://my-gcs-bucket/test-directory")
-flags.DEFINE_integer("num_examples", 10, "Number of examples to generate")
-
-FLAGS = flags.FLAGS
-
-
-def create_examples(num_examples, input_mean):
-  """Create ExampleProto's containing data."""
-  ids = np.arange(num_examples).reshape([num_examples, 1])
-  inputs = np.random.randn(num_examples, 1) + input_mean
-  target = inputs - input_mean
-  examples = []
-  for row in range(num_examples):
-    ex = example_pb2.Example()
-    ex.features.feature["id"].bytes_list.value.append(str(ids[row, 0]))
-    ex.features.feature["target"].float_list.value.append(target[row, 0])
-    ex.features.feature["inputs"].float_list.value.append(inputs[row, 0])
-    examples.append(ex)
-  return examples
-
-
-def create_dir_test():
-  """Verifies file_io directory handling methods."""
-
-  # Test directory creation.
-  starttime_ms = int(round(time.time() * 1000))
-  dir_name = "%s/tf_gcs_test_%s" % (FLAGS.gcs_bucket_url, starttime_ms)
-  print("Creating dir %s" % dir_name)
-  file_io.create_dir(dir_name)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Created directory in: %d milliseconds" % elapsed_ms)
-
-  # Check that the directory exists.
-  dir_exists = file_io.is_directory(dir_name)
-  assert dir_exists
-  print("%s directory exists: %s" % (dir_name, dir_exists))
-
-  # Test recursive directory creation.
-  starttime_ms = int(round(time.time() * 1000))
-  recursive_dir_name = "%s/%s/%s" % (dir_name,
-                                     "nested_dir1",
-                                     "nested_dir2")
-  print("Creating recursive dir %s" % recursive_dir_name)
-  file_io.recursive_create_dir(recursive_dir_name)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Created directory recursively in: %d milliseconds" % elapsed_ms)
-
-  # Check that the directory exists.
-  recursive_dir_exists = file_io.is_directory(recursive_dir_name)
-  assert recursive_dir_exists
-  print("%s directory exists: %s" % (recursive_dir_name, recursive_dir_exists))
-
-  # Create some contents in the just created directory and list the contents.
-  num_files = 10
-  files_to_create = ["file_%d.txt" % n for n in range(num_files)]
-  for file_num in files_to_create:
-    file_name = "%s/%s" % (dir_name, file_num)
-    print("Creating file %s." % file_name)
-    file_io.write_string_to_file(file_name, "test file.")
-
-  print("Listing directory %s." % dir_name)
-  starttime_ms = int(round(time.time() * 1000))
-  directory_contents = file_io.list_directory(dir_name)
-  print(directory_contents)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Listed directory %s in %s milliseconds" % (dir_name, elapsed_ms))
-  assert set(directory_contents) == set(files_to_create + ["nested_dir1/"])
-
-  # Test directory renaming.
-  dir_to_rename = "%s/old_dir" % dir_name
-  new_dir_name = "%s/new_dir" % dir_name
-  file_io.create_dir(dir_to_rename)
-  assert file_io.is_directory(dir_to_rename)
-  assert not file_io.is_directory(new_dir_name)
-
-  starttime_ms = int(round(time.time() * 1000))
-  print("Will try renaming directory %s to %s" % (dir_to_rename, new_dir_name))
-  file_io.rename(dir_to_rename, new_dir_name)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Renamed directory %s to %s in %s milliseconds" % (
-      dir_to_rename, new_dir_name, elapsed_ms))
-  assert not file_io.is_directory(dir_to_rename)
-  assert file_io.is_directory(new_dir_name)
-
-  # Test Delete directory recursively.
-  print("Deleting directory recursively %s." % dir_name)
-  starttime_ms = int(round(time.time() * 1000))
-  file_io.delete_recursively(dir_name)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  dir_exists = file_io.is_directory(dir_name)
-  assert not dir_exists
-  print("Deleted directory recursively %s in %s milliseconds" % (
-      dir_name, elapsed_ms))
-
-
-def create_object_test():
-  """Verifies file_io's object manipulation methods ."""
-  starttime_ms = int(round(time.time() * 1000))
-  dir_name = "%s/tf_gcs_test_%s" % (FLAGS.gcs_bucket_url, starttime_ms)
-  print("Creating dir %s." % dir_name)
-  file_io.create_dir(dir_name)
-
-  num_files = 5
-  # Create files of 2 different patterns in this directory.
-  files_pattern_1 = ["%s/test_file_%d.txt" % (dir_name, n)
-                     for n in range(num_files)]
-  files_pattern_2 = ["%s/testfile%d.txt" % (dir_name, n)
-                     for n in range(num_files)]
-
-  starttime_ms = int(round(time.time() * 1000))
-  files_to_create = files_pattern_1 + files_pattern_2
-  for file_name in files_to_create:
-    print("Creating file %s." % file_name)
-    file_io.write_string_to_file(file_name, "test file creation.")
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Created %d files in %s milliseconds" %
-        (len(files_to_create), elapsed_ms))
-
-  # Listing files of pattern1.
-  list_files_pattern = "%s/test_file*.txt" % dir_name
-  print("Getting files matching pattern %s." % list_files_pattern)
-  starttime_ms = int(round(time.time() * 1000))
-  files_list = file_io.get_matching_files(list_files_pattern)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Listed files in %s milliseconds" % elapsed_ms)
-  print(files_list)
-  assert set(files_list) == set(files_pattern_1)
-
-  # Listing files of pattern2.
-  list_files_pattern = "%s/testfile*.txt" % dir_name
-  print("Getting files matching pattern %s." % list_files_pattern)
-  starttime_ms = int(round(time.time() * 1000))
-  files_list = file_io.get_matching_files(list_files_pattern)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("Listed files in %s milliseconds" % elapsed_ms)
-  print(files_list)
-  assert set(files_list) == set(files_pattern_2)
-
-  # Test renaming file.
-  file_to_rename = "%s/oldname.txt" % dir_name
-  file_new_name = "%s/newname.txt" % dir_name
-  file_io.write_string_to_file(file_to_rename, "test file.")
-  assert file_io.file_exists(file_to_rename)
-  assert not file_io.file_exists(file_new_name)
-
-  print("Will try renaming file %s to %s" % (file_to_rename, file_new_name))
-  starttime_ms = int(round(time.time() * 1000))
-  file_io.rename(file_to_rename, file_new_name)
-  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
-  print("File %s renamed to %s in %s milliseconds" % (
-      file_to_rename, file_new_name, elapsed_ms))
-  assert not file_io.file_exists(file_to_rename)
-  assert file_io.file_exists(file_new_name)
-
-  # Delete directory.
-  print("Deleting directory %s." % dir_name)
-  file_io.delete_recursively(dir_name)
-
-
-def main(argv):
-  del argv  # Unused.
-  # Sanity check on the GCS bucket URL.
-  if not FLAGS.gcs_bucket_url or not FLAGS.gcs_bucket_url.startswith("gs://"):
-    print("ERROR: Invalid GCS bucket URL: \"%s\"" % FLAGS.gcs_bucket_url)
-    sys.exit(1)
-
-  # Verify that writing to the records file in GCS works.
-  print("\n=== Testing writing and reading of GCS record file... ===")
-  example_data = create_examples(FLAGS.num_examples, 5)
-  with tf.python_io.TFRecordWriter(FLAGS.gcs_bucket_url) as hf:
-    for e in example_data:
-      hf.write(e.SerializeToString())
-
-    print("Data written to: %s" % FLAGS.gcs_bucket_url)
-
-  # Verify that reading from the tfrecord file works and that
-  # tf_record_iterator works.
-  record_iter = tf.python_io.tf_record_iterator(FLAGS.gcs_bucket_url)
-  read_count = 0
-  for _ in record_iter:
-    read_count += 1
-  print("Read %d records using tf_record_iterator" % read_count)
-
-  if read_count != FLAGS.num_examples:
-    print("FAIL: The number of records read from tf_record_iterator (%d) "
-          "differs from the expected number (%d)" % (read_count,
-                                                     FLAGS.num_examples))
-    sys.exit(1)
-
-  # Verify that running the read op in a session works.
-  print("\n=== Testing TFRecordReader.read op in a session... ===")
-  with tf.Graph().as_default() as _:
-    filename_queue = tf.train.string_input_producer([FLAGS.gcs_bucket_url],
-                                                    num_epochs=1)
-    reader = tf.TFRecordReader()
-    _, serialized_example = reader.read(filename_queue)
-
-    with tf.Session() as sess:
-      sess.run(tf.global_variables_initializer())
-      sess.run(tf.local_variables_initializer())
-      tf.train.start_queue_runners()
-      index = 0
-      for _ in range(FLAGS.num_examples):
-        print("Read record: %d" % index)
-        sess.run(serialized_example)
-        index += 1
-
-      # Reading one more record should trigger an exception.
-      try:
-        sess.run(serialized_example)
-        print("FAIL: Failed to catch the expected OutOfRangeError while "
-              "reading one more record than is available")
-        sys.exit(1)
-      except tf.errors.OutOfRangeError:
-        print("Successfully caught the expected OutOfRangeError while "
-              "reading one more record than is available")
-
-  create_dir_test()
-  create_object_test()
-
-if __name__ == "__main__":
-  tf.app.run(main)
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh b/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh
deleted file mode 100755
index ef29dee3462c21d6318a6fb7e7e658961f0d88dd..0000000000000000000000000000000000000000
--- a/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh
+++ /dev/null
@@ -1,21 +0,0 @@
-# This is a python2 only test.
-#!/bin/bash
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-# Test Tensorflow package installation.
-/usr/local/bin/pip install --user tf-nightly
-
-# Test Tensorflow interaction with GCS.
-python tensorflow/tools/integration_test/gcs_smoke_test/gcs_smoke.py "$@"
diff --git a/tensorflow/tools/lib_package/BUILD b/tensorflow/tools/lib_package/BUILD
index 3fbdb5cacd1fd0039deaae5ac330b6c2ca006a68..0ede8c63704ac4a474eb0d19e17cf5f365abca77 100644
--- a/tensorflow/tools/lib_package/BUILD
+++ b/tensorflow/tools/lib_package/BUILD
@@ -138,7 +138,6 @@ genrule(
         "@zlib_archive//:zlib.h",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
-        "@mkl//:LICENSE",
     ]),
     outs = ["include/tensorflow/c/LICENSE"],
     cmd = "$(location :concat_licenses.sh) $(SRCS) >$@",
@@ -176,7 +175,6 @@ genrule(
         "@zlib_archive//:zlib.h",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
-        "@mkl//:LICENSE",
     ]),
     outs = ["include/tensorflow/jni/LICENSE"],
     cmd = "$(location :concat_licenses.sh) $(SRCS) >$@",
diff --git a/tensorflow/tools/pip_package/BUILD b/tensorflow/tools/pip_package/BUILD
index fb6eaa4faa28b4f6b17e1774907c0c9ff58d6ada..2c06fa60dbd2bda7fce4409a1a625fd33550826a 100644
--- a/tensorflow/tools/pip_package/BUILD
+++ b/tensorflow/tools/pip_package/BUILD
@@ -108,6 +108,7 @@ filegroup(
         "@highwayhash//:LICENSE",
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
+        "@kafka//:LICENSE",
         "@libxsmm_archive//:LICENSE",
         "@lmdb//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
@@ -125,7 +126,6 @@ filegroup(
         "@org_python_pypi_backports_weakref//:LICENSE",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
-        "@mkl//:LICENSE",
     ]) + if_not_windows([
         "@nccl_archive//:LICENSE.txt",
     ]) + tf_additional_license_deps(),
@@ -156,6 +156,7 @@ sh_binary(
             "//tensorflow/contrib/graph_editor:graph_editor_pip",
             "//tensorflow/contrib/keras:keras",
             "//tensorflow/contrib/labeled_tensor:labeled_tensor_pip",
+            "//tensorflow/contrib/lite/python:interpreter_test_data",
             "//tensorflow/contrib/lite/toco:toco",
             "//tensorflow/contrib/lite/toco/python:toco_wrapper",
             "//tensorflow/contrib/lite/toco/python:toco_from_protos",
diff --git a/tensorflow/tools/pip_package/pip_smoke_test.py b/tensorflow/tools/pip_package/pip_smoke_test.py
index 73d759eb130633094b402c821cc32eb76c076a44..e2518f6cbf0beb0943e5b7289796459d14992bfc 100644
--- a/tensorflow/tools/pip_package/pip_smoke_test.py
+++ b/tensorflow/tools/pip_package/pip_smoke_test.py
@@ -58,6 +58,10 @@ BLACKLIST = [
     # contrib
     "//tensorflow/contrib/session_bundle:session_bundle_half_plus_two",
     "//tensorflow/contrib/keras:testing_utils",
+    "//tensorflow/contrib/lite/python:interpreter",
+    "//tensorflow/contrib/lite/python:interpreter_test",
+    "//tensorflow/contrib/lite/python:interpreter.py",
+    "//tensorflow/contrib/lite/python:interpreter_test.py",
     "//tensorflow/contrib/ffmpeg:test_data",
     "//tensorflow/contrib/factorization/examples:mnist",
     "//tensorflow/contrib/factorization/examples:mnist.py",
@@ -71,6 +75,7 @@ BLACKLIST = [
     "//tensorflow/contrib/timeseries/examples:data/period_trend.csv",  # pylint:disable=line-too-long
     "//tensorflow/contrib/timeseries/python/timeseries:test_utils",
     "//tensorflow/contrib/timeseries/python/timeseries/state_space_models:test_utils",  # pylint:disable=line-too-long
+    "//tensorflow/contrib/image:sparse_image_warp_test_data",
 ]
 
 
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index 4b6f123daa7b528173234a2bffd30ead2aa9fc0e..ff30016cc2f648afa7331e01b21193cb918e4209 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -29,7 +29,7 @@ from setuptools.dist import Distribution
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.6.0-rc1'
+_VERSION = '1.7.0-rc1'
 
 REQUIRED_PACKAGES = [
     'absl-py >= 0.1.6',
@@ -72,7 +72,7 @@ if sys.version_info < (3, 4):
 
 # pylint: disable=line-too-long
 CONSOLE_SCRIPTS = [
-    'freeze_graph = tensorflow.python.tools.freeze_graph:main',
+    'freeze_graph = tensorflow.python.tools.freeze_graph:run_main',
     'toco_from_protos = tensorflow.contrib.lite.toco.python.toco_from_protos:main',
     'toco = tensorflow.contrib.lite.toco.python.toco_wrapper:main',
     'saved_model_cli = tensorflow.python.tools.saved_model_cli:main',
@@ -200,8 +200,7 @@ headers = (list(find_files('*.h', 'tensorflow/core')) +
            list(find_files('*.h', 'tensorflow/stream_executor')) +
            list(find_files('*.h', 'google/protobuf_archive/src')) +
            list(find_files('*', 'third_party/eigen3')) +
-           list(find_files('*', 'external/eigen_archive')) +
-           list(find_files('*.h', 'external/nsync/public')))
+           list(find_files('*', 'external/eigen_archive')))
 
 setup(
     name=project_name,
diff --git a/tensorflow/tools/test/performance.bzl b/tensorflow/tools/test/performance.bzl
index cee53dd5b61e50126948e3652865a32f45eab092..3486871080c78dc7a1cc201ea2a4d45ebc342758 100644
--- a/tensorflow/tools/test/performance.bzl
+++ b/tensorflow/tools/test/performance.bzl
@@ -31,7 +31,7 @@ def tf_cc_logged_benchmark(
       size = "large",
       srcs = ["//tensorflow/tools/test:run_and_gather_logs"],
       args = [
-          "--name=//%s:%s" % (PACKAGE_NAME, name),
+          "--name=//%s:%s" % (native.package_name(), name),
           "--test_name=" + target,
           "--test_args=--benchmarks=%s" % benchmarks,
           "--benchmark_type=%s" % benchmark_type,
diff --git a/tensorflow/tools/test/upload_test_benchmarks.py b/tensorflow/tools/test/upload_test_benchmarks.py
index 77cc9f75f7725438918f681833d58e9ecb4a2f70..9c45359ee1b037ffb01820f874b88b6cabc6d14b 100644
--- a/tensorflow/tools/test/upload_test_benchmarks.py
+++ b/tensorflow/tools/test/upload_test_benchmarks.py
@@ -87,7 +87,9 @@ import json
 import os
 import shutil
 
+from six import text_type
 from google.cloud import datastore
+from six import text_type
 
 
 def is_real_file(dirpath, fname):
@@ -150,7 +152,7 @@ def upload_benchmark_data(client, data):
   """
   test_result = json.loads(data)
 
-  test_name = unicode(test_result["name"])
+  test_name = text_type(test_result["name"])
   start_time = datetime.datetime.utcfromtimestamp(
       float(test_result["startTime"]))
   batch = []
@@ -162,7 +164,7 @@ def upload_benchmark_data(client, data):
   t_val.update({
       "test": test_name,
       "start": start_time,
-      "info": unicode(data)
+      "info": text_type(data)
   })
   batch.append(t_val)
 
@@ -170,7 +172,7 @@ def upload_benchmark_data(client, data):
   # the attribute to be fetched and displayed.  The full entry information is
   # also stored as a non-indexed JSON blob.
   for ent in test_result["entries"].get("entry", []):
-    ent_name = unicode(ent["name"])
+    ent_name = text_type(ent["name"])
     e_key = client.key("Entry")
     e_val = datastore.Entity(e_key, exclude_from_indexes=["info"])
     e_val.update({
@@ -178,7 +180,7 @@ def upload_benchmark_data(client, data):
         "start": start_time,
         "entry": ent_name,
         "timing": ent["wallTime"],
-        "info": unicode(json.dumps(ent))
+        "info": text_type(json.dumps(ent))
     })
     batch.append(e_val)
 
diff --git a/tensorflow/version_check.bzl b/tensorflow/version_check.bzl
new file mode 100644
index 0000000000000000000000000000000000000000..79e721dab422c1449214acbe5fc1643edc3a9db0
--- /dev/null
+++ b/tensorflow/version_check.bzl
@@ -0,0 +1,48 @@
+""" Helpers to check minimum version of bazel."""
+
+def _extract_version_number(bazel_version):
+  """Extracts the semantic version number from a version string
+
+  Args:
+    bazel_version: the version string that begins with the semantic version
+      e.g. "1.2.3rc1 abc1234" where "abc1234" is a commit hash.
+
+  Returns:
+    The semantic version string, like "1.2.3".
+  """
+  for i in range(len(bazel_version)):
+    c = bazel_version[i]
+    if not (c.isdigit() or c == "."):
+      return bazel_version[:i]
+  return bazel_version
+
+# Parse the bazel version string from `native.bazel_version`.
+# e.g.
+# "0.10.0rc1 abc123d" => (0, 10, 0)
+# "0.3.0" => (0, 3, 0)
+def _parse_bazel_version(bazel_version):
+  """Parses a version string into a 3-tuple of ints
+
+  int tuples can be compared directly using binary operators (<, >).
+
+  Args:
+    bazel_version: the Bazel version string
+
+  Returns:
+    An int 3-tuple of a (major, minor, patch) version.
+  """
+
+  version = _extract_version_number(bazel_version)
+  return tuple([int(n) for n in version.split(".")])
+
+def check_bazel_version_at_least(minimum_bazel_version):
+  if "bazel_version" not in dir(native):
+    fail("\nCurrent Bazel version is lower than 0.2.1, expected at least %s\n" % minimum_bazel_version)
+  elif not native.bazel_version:
+    print("\nCurrent Bazel is not a release version, cannot check for compatibility.")
+    print("Make sure that you are running at least Bazel %s.\n" % minimum_bazel_version)
+    return
+
+  if _parse_bazel_version(native.bazel_version) < _parse_bazel_version(minimum_bazel_version):
+    fail("\nCurrent Bazel version is {}, expected at least {}\n".format(
+        native.bazel_version, minimum_bazel_version))
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 85f423f236b6c97f3d94e3d98357f5ad1ce912a2..075c2c849abc4c605e99455e0ef587a2093424f8 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -10,65 +10,23 @@ load("//third_party/sycl:sycl_configure.bzl", "sycl_configure")
 load("//third_party/toolchains/clang6:repo.bzl", "clang6_configure")
 load("//third_party/toolchains/cpus/arm:arm_compiler_configure.bzl", "arm_compiler_configure")
 load("//third_party:repo.bzl", "tf_http_archive")
+load("//third_party/clang_toolchain:cc_configure_clang.bzl", "cc_download_clang_toolchain")
 load("@io_bazel_rules_closure//closure/private:java_import_external.bzl", "java_import_external")
 load("@io_bazel_rules_closure//closure:defs.bzl", "filegroup_external")
 
-def _extract_version_number(bazel_version):
-  """Extracts the semantic version number from a version string
-
-  Args:
-    bazel_version: the version string that begins with the semantic version
-      e.g. "1.2.3rc1 abc1234" where "abc1234" is a commit hash.
-
-  Returns:
-    The semantic version string, like "1.2.3".
-  """
-  for i in range(len(bazel_version)):
-    c = bazel_version[i]
-    if not (c.isdigit() or c == "."):
-      return bazel_version[:i]
-  return bazel_version
-
-# Parse the bazel version string from `native.bazel_version`.
-# e.g.
-# "0.10.0rc1 abc123d" => (0, 10, 0)
-# "0.3.0" => (0, 3, 0)
-def _parse_bazel_version(bazel_version):
-  """Parses a version string into a 3-tuple of ints
-
-  int tuples can be compared directly using binary operators (<, >).
-
-  Args:
-    bazel_version: the Bazel version string
-
-  Returns:
-    An int 3-tuple of a (major, minor, patch) version.
-  """
-
-  version = _extract_version_number(bazel_version)
-  return tuple([int(n) for n in version.split(".")])
-
-def check_bazel_version_at_least(minimum_bazel_version):
-  if "bazel_version" not in dir(native):
-    fail("\nCurrent Bazel version is lower than 0.2.1, expected at least %s\n" % minimum_bazel_version)
-  elif not native.bazel_version:
-    print("\nCurrent Bazel is not a release version, cannot check for compatibility.")
-    print("Make sure that you are running at least Bazel %s.\n" % minimum_bazel_version)
-    return
-
-  if _parse_bazel_version(native.bazel_version) < _parse_bazel_version(minimum_bazel_version):
-    fail("\nCurrent Bazel version is {}, expected at least {}\n".format(
-        native.bazel_version, minimum_bazel_version))
+
+# Sanitize a dependency so that it works correctly from code that includes
+# TensorFlow as a submodule.
+def clean_dep(dep):
+  return str(Label(dep))
 
 # If TensorFlow is linked as a submodule.
 # path_prefix is no longer used.
 # tf_repo_name is thought to be under consideration.
 def tf_workspace(path_prefix="", tf_repo_name=""):
-  # We must check the bazel version before trying to parse any other BUILD
-  # files, in case the parsing of those build files depends on the bazel
-  # version we require here.
-  check_bazel_version_at_least("0.5.4")
+  # Note that we check the minimum bazel version in WORKSPACE.
   clang6_configure(name="local_config_clang6")
+  cc_download_clang_toolchain(name="local_config_download_clang")
   cuda_configure(name="local_config_cuda")
   tensorrt_configure(name="local_config_tensorrt")
   git_configure(name="local_config_git")
@@ -79,17 +37,37 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   arm_compiler_configure(
       name="local_config_arm_compiler",
       remote_config_repo="../arm_compiler",
-      build_file = str(Label("//third_party/toolchains/cpus/arm:BUILD")))
+      build_file = clean_dep("//third_party/toolchains/cpus/arm:BUILD"))
 
   mkl_repository(
-      name = "mkl",
+      name = "mkl_linux",
       urls = [
-          "https://mirror.bazel.build/github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz",
-          "https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz",
+          "https://mirror.bazel.build/intel/mkl-dnn/releases/download/v0.12/mklml_lnx_2018.0.1.20171227.tgz",
+          "https://github.com/intel/mkl-dnn/releases/download/v0.12/mklml_lnx_2018.0.1.20171227.tgz",
       ],
-      sha256 = "6b07cb7e5451db67c2e31e785ae458b18f7f363c60a61685488f69e9ae7199d4",
-      strip_prefix = "mklml_lnx_2018.0.1.20171007",
-      build_file = str(Label("//third_party/mkl:mkl.BUILD")),
+      sha256 = "feacc3d82565c1231470359b42c696236fae873704e0b013436afba5fd4fd30f",
+      strip_prefix = "mklml_lnx_2018.0.1.20171227",
+      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
+  )
+  mkl_repository(
+      name = "mkl_windows",
+      urls = [
+          "https://mirror.bazel.build/intel/mkl-dnn/releases/download/v0.12/mklml_win_2018.0.1.20171227.zip",
+          "https://github.com/intel/mkl-dnn/releases/download/v0.12/mklml_win_2018.0.1.20171227.zip"
+      ],
+      sha256 = "24bae8d7b22b431a654acadea43f2243c46ae6b1e5a73a4a936825f31d284ee4",
+      strip_prefix = "mklml_win_2018.0.1.20171227",
+      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
+  )
+  mkl_repository(
+      name = "mkl_darwin",
+      urls = [
+          "https://mirror.bazel.build/intel/mkl-dnn/releases/download/v0.12/mklml_mac_2018.0.1.20171227.tgz",
+          "https://github.com/intel/mkl-dnn/releases/download/v0.12/mklml_mac_2018.0.1.20171227.tgz"
+      ],
+      sha256 = "0e954ec6fd3dc5e37f64c4043f6b5613dd687558da3df1028b3b7c29ff5cf77f",
+      strip_prefix = "mklml_mac_2018.0.1.20171227",
+      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
   )
 
   if path_prefix:
@@ -99,12 +77,12 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "mkl_dnn",
       urls = [
-          "https://mirror.bazel.build/github.com/01org/mkl-dnn/archive/e0bfcaa7fcb2b1e1558f5f0676933c1db807a729.tar.gz",
-          "https://github.com/01org/mkl-dnn/archive/e0bfcaa7fcb2b1e1558f5f0676933c1db807a729.tar.gz",
+          "https://mirror.bazel.build/github.com/intel/mkl-dnn/archive/v0.12.tar.gz",
+          "https://github.com/intel/mkl-dnn/archive/v0.12.tar.gz",
       ],
-      sha256 = "02e244f63dd95402691a361392504c143eede9a89043426f174836638a9cbf09",
-      strip_prefix = "mkl-dnn-e0bfcaa7fcb2b1e1558f5f0676933c1db807a729",
-      build_file = str(Label("//third_party/mkl_dnn:mkldnn.BUILD")),
+      sha256 = "86fa2a8c12a56e3b725945acedeaa82492746be02545aba6d710f097e013e19e",
+      strip_prefix = "mkl-dnn-0.12",
+      build_file = clean_dep("//third_party/mkl_dnn:mkldnn.BUILD"),
   )
 
   tf_http_archive(
@@ -115,7 +93,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
      sha256 = "5996380e3e8b981f55d1c8d58e709c00dbb4806ba367be75d0925a68cc2f6478",
      strip_prefix = "abseil-cpp-720c017e30339fd1786ce4aac68bc8559736e53f",
-     build_file = str(Label("//third_party:com_google_absl.BUILD")),
+     build_file = clean_dep("//third_party:com_google_absl.BUILD"),
   )
 
   tf_http_archive(
@@ -126,8 +104,8 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "0cadb31a35b514bf2dfd6b5d38205da94ef326ec6908fc3fd7c269948467214f",
       strip_prefix = "eigen-eigen-2355b229ea4c",
-      build_file = str(Label("//third_party:eigen.BUILD")),
-      patch_file = str(Label("//third_party:eigen_fix_cuda_compilation.patch"))
+      build_file = clean_dep("//third_party:eigen.BUILD"),
+      patch_file = clean_dep("//third_party:eigen_fix_cuda_compilation.patch")
   )
 
   tf_http_archive(
@@ -140,7 +118,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           # remove the whitelist entry in third_party/repo.bzl.
           # "https://github.com/raspberrypi/tools/archive/0e906ebc527eab1cdbf7adabff5b474da9562e9f.tar.gz",
       ],
-      build_file = str(Label("//:arm_compiler.BUILD")),
+      build_file = clean_dep("//:arm_compiler.BUILD"),
   )
 
   tf_http_archive(
@@ -151,7 +129,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "2ade869c3f42f23b5263c7d594aa3c7e5e61ac6a3afcaf5d6e42899d2a7986ce",
       strip_prefix = "libxsmm-1.8.1",
-      build_file = str(Label("//third_party:libxsmm.BUILD")),
+      build_file = clean_dep("//third_party:libxsmm.BUILD"),
   )
 
   tf_http_archive(
@@ -164,7 +142,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "932075525642b04ac6f1b50589f1df5cd72ec2f448b721fd32234cf183f0e755",
       strip_prefix = "or-tools-253f7955c6a1fd805408fba2e42ac6d45b312d15/src",
-      build_file = str(Label("//third_party:ortools.BUILD")),
+      build_file = clean_dep("//third_party:ortools.BUILD"),
   )
 
   tf_http_archive(
@@ -196,7 +174,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "6560547c63e4af82b0f202cb710ceabb3f21347a4b996db565a411da5b17aba0",
       strip_prefix = "farmhash-816a4ae622e964763ca0862d9dbd19324a1eaf45",
-      build_file = str(Label("//third_party:farmhash.BUILD")),
+      build_file = clean_dep("//third_party:farmhash.BUILD"),
   )
 
   tf_http_archive(
@@ -207,7 +185,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "0f30a15b1566d93f146c8d149878a06e91d9bb7ec2cfd76906df62a82be4aac9",
       strip_prefix = "highwayhash-dfcb97ca4fe9277bf9dc1802dd979b071896453b",
-      build_file = str(Label("//third_party:highwayhash.BUILD")),
+      build_file = clean_dep("//third_party:highwayhash.BUILD"),
   )
 
   tf_http_archive(
@@ -218,7 +196,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "00b0891c678c065446ca59bcee64719d0096d54d6886e6e472aeee2e170ae324",
       strip_prefix = "nasm-2.12.02",
-      build_file = str(Label("//third_party:nasm.BUILD")),
+      build_file = clean_dep("//third_party:nasm.BUILD"),
   )
 
   tf_http_archive(
@@ -226,11 +204,10 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       urls = [
           "https://mirror.bazel.build/github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.1.tar.gz",
           "https://github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.1.tar.gz",
-          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
       ],
       sha256 = "c15a9607892113946379ccea3ca8b85018301b200754f209453ab21674268e77",
       strip_prefix = "libjpeg-turbo-1.5.1",
-      build_file = str(Label("//third_party/jpeg:jpeg.BUILD")),
+      build_file = clean_dep("//third_party/jpeg:jpeg.BUILD"),
   )
 
   tf_http_archive(
@@ -241,7 +218,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "716c59c7dfc808a4c368f8ada526932be72b2fcea11dd85dc9d88b1df1dfe9c2",
       strip_prefix = "libpng-1.2.53",
-      build_file = str(Label("//third_party:png.BUILD")),
+      build_file = clean_dep("//third_party:png.BUILD"),
   )
 
   tf_http_archive(
@@ -252,7 +229,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "208780b3616f9de0aeb50822b7a8f5482f6515193859e91ed61637be6ad74fd4",
       strip_prefix = "sqlite-amalgamation-3200000",
-      build_file = str(Label("//third_party:sqlite.BUILD")),
+      build_file = clean_dep("//third_party:sqlite.BUILD"),
   )
 
   tf_http_archive(
@@ -263,7 +240,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "34a7377ba834397db019e8eb122e551a49c98f49df75ec3fcc92b9a794a4f6d1",
       strip_prefix = "giflib-5.1.4",
-      build_file = str(Label("//third_party:gif.BUILD")),
+      build_file = clean_dep("//third_party:gif.BUILD"),
   )
 
   tf_http_archive(
@@ -274,7 +251,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "105f8d68616f8248e24bf0e9372ef04d3cc10104f1980f54d57b2ce73a5ad56a",
       strip_prefix = "six-1.10.0",
-      build_file = str(Label("//third_party:six.BUILD")),
+      build_file = clean_dep("//third_party:six.BUILD"),
   )
 
   tf_http_archive(
@@ -285,7 +262,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "ff6d2e2962d834acb125cc4dcc80c54a8c17c253f4cc9d9c43b5102a560bb75d",
       strip_prefix = "astor-0.6.2",
-      build_file = str(Label("//third_party:astor.BUILD")),
+      build_file = clean_dep("//third_party:astor.BUILD"),
   )
 
   tf_http_archive(
@@ -296,7 +273,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "7068908321ecd2774f145193c4b34a11305bd104b4551b09273dfd1d6a374930",
       strip_prefix = "gast-0.2.0",
-      build_file = str(Label("//third_party:gast.BUILD")),
+      build_file = clean_dep("//third_party:gast.BUILD"),
   )
 
   tf_http_archive(
@@ -307,7 +284,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b",
       strip_prefix = "termcolor-1.1.0",
-      build_file = str(Label("//third_party:termcolor.BUILD")),
+      build_file = clean_dep("//third_party:termcolor.BUILD"),
   )
 
   tf_http_archive(
@@ -328,7 +305,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "8813bf712a66b3d8b85dc289e1104ed220f1878cf981e2fe756dfaabe9a82892",
       strip_prefix = "backports.weakref-1.0rc1/src",
-      build_file = str(Label("//third_party:backports_weakref.BUILD")),
+      build_file = clean_dep("//third_party:backports_weakref.BUILD"),
   )
 
   tf_http_archive(
@@ -339,7 +316,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "2dadd04a2802de27e0fe5a19b76538f6da9d39ff244036afa00c1bba754de5ee",
       strip_prefix = "codegen-1.0",
-      build_file = str(Label("//third_party:codegen.BUILD")),
+      build_file = clean_dep("//third_party:codegen.BUILD"),
   )
 
   filegroup_external(
@@ -389,11 +366,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "nsync",
       urls = [
-          "https://mirror.bazel.build/github.com/google/nsync/archive/8502189abfa44c249c01c2cad64e6ed660a9a668.tar.gz",
-          "https://github.com/google/nsync/archive/8502189abfa44c249c01c2cad64e6ed660a9a668.tar.gz",
+          "https://mirror.bazel.build/github.com/google/nsync/archive/0559ce013feac8db639ee1bf776aca0325d28777.tar.gz",
+          "https://github.com/google/nsync/archive/0559ce013feac8db639ee1bf776aca0325d28777.tar.gz",
       ],
-      sha256 = "51f81ff4202bbb820cdbedc061bd2eb6765f2b5c06489e7a8694bedac329e8f8",
-      strip_prefix = "nsync-8502189abfa44c249c01c2cad64e6ed660a9a668",
+      sha256 = "6284454c5cd8b1dae2eeb8cf5eb63004de930b5427ed5f6b1aa793513df6b361",
+      strip_prefix = "nsync-0559ce013feac8db639ee1bf776aca0325d28777",
   )
 
   tf_http_archive(
@@ -424,7 +401,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "http://ftp.exim.org/pub/pcre/pcre-8.39.tar.gz",
       ],
       strip_prefix = "pcre-8.39",
-      build_file = str(Label("//third_party:pcre.BUILD")),
+      build_file = clean_dep("//third_party:pcre.BUILD"),
   )
 
   tf_http_archive(
@@ -436,7 +413,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "http://pilotfiber.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
       ],
       strip_prefix = "swig-3.0.8",
-      build_file = str(Label("//third_party:swig.BUILD")),
+      build_file = clean_dep("//third_party:swig.BUILD"),
   )
 
   tf_http_archive(
@@ -447,17 +424,17 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://curl.haxx.se/download/curl-7.49.1.tar.gz",
       ],
       strip_prefix = "curl-7.49.1",
-      build_file = str(Label("//third_party:curl.BUILD")),
+      build_file = clean_dep("//third_party:curl.BUILD"),
   )
 
   tf_http_archive(
       name = "grpc",
       urls = [
-          "https://mirror.bazel.build/github.com/grpc/grpc/archive/730b778632e79cc3c96ad237f282d687ee325ce7.tar.gz",
-          "https://github.com/grpc/grpc/archive/730b778632e79cc3c96ad237f282d687ee325ce7.tar.gz",
+          "https://mirror.bazel.build/github.com/grpc/grpc/archive/575bda39755b98d1f7099406bb57a6e3b2074874.tar.gz",
+          "https://github.com/grpc/grpc/archive/575bda39755b98d1f7099406bb57a6e3b2074874.tar.gz",
       ],
-      sha256 = "8c91a8d12e1e868cf51f7340b75507a8aa017a7e1b56f46ed6816aeb803dc9bd",
-      strip_prefix = "grpc-730b778632e79cc3c96ad237f282d687ee325ce7",
+      sha256 = "f08a5c8e265191b39cc74915b1bc1fd380d86cd0176c92b7cce30b6ac50514ad",
+      strip_prefix = "grpc-575bda39755b98d1f7099406bb57a6e3b2074874",
   )
 
   tf_http_archive(
@@ -468,7 +445,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://github.com/antirez/linenoise/archive/c894b9e59f02203dbe4e2be657572cf88c4230c3.tar.gz",
       ],
       strip_prefix = "linenoise-c894b9e59f02203dbe4e2be657572cf88c4230c3",
-      build_file = str(Label("//third_party:linenoise.BUILD")),
+      build_file = clean_dep("//third_party:linenoise.BUILD"),
   )
 
   # TODO(phawkins): currently, this rule uses an unofficial LLVM mirror.
@@ -476,12 +453,12 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "llvm",
       urls = [
-          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/fc8ba497cd1a1af4ecae19a5b64bdbd71e065e14.tar.gz",
-          "https://github.com/llvm-mirror/llvm/archive/fc8ba497cd1a1af4ecae19a5b64bdbd71e065e14.tar.gz",
+          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/1c3cdea2f181d8e14ee184466c5fb237f1b4cda8.tar.gz",
+          "https://github.com/llvm-mirror/llvm/archive/1c3cdea2f181d8e14ee184466c5fb237f1b4cda8.tar.gz",
       ],
-      sha256 = "f5721d9cc18a9109c9e9f847f48e69b710b961cee83e6691227e310cb3b5da58",
-      strip_prefix = "llvm-fc8ba497cd1a1af4ecae19a5b64bdbd71e065e14",
-      build_file = str(Label("//third_party/llvm:llvm.BUILD")),
+      sha256 = "1efbb9b05af88368be984d2f6526061d4a857181ef10f8841889a3a46869bb01",
+      strip_prefix = "llvm-1c3cdea2f181d8e14ee184466c5fb237f1b4cda8",
+      build_file = clean_dep("//third_party/llvm:llvm.BUILD"),
   )
 
   tf_http_archive(
@@ -492,7 +469,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "108532fb94c6f227558d45be3f3347b52539f0f58290a7bb31ec06c462d05326",
       strip_prefix = "lmdb-LMDB_0.9.19/libraries/liblmdb",
-      build_file = str(Label("//third_party:lmdb.BUILD")),
+      build_file = clean_dep("//third_party:lmdb.BUILD"),
   )
 
   tf_http_archive(
@@ -503,7 +480,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "07d34db40593d257324ec5fb9debc4dc33f29f8fb44e33a2eeb35503e61d0fe2",
       strip_prefix = "jsoncpp-11086dd6a7eba04289944367ca82cea71299ed70",
-      build_file = str(Label("//third_party:jsoncpp.BUILD")),
+      build_file = clean_dep("//third_party:jsoncpp.BUILD"),
   )
 
   tf_http_archive(
@@ -524,7 +501,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "36658cb768a54c1d4dec43c3116c27ed893e88b02ecfcb44f2166f9c0b7f2a0d",
       strip_prefix = "zlib-1.2.8",
-      build_file = str(Label("//third_party:zlib.BUILD")),
+      build_file = clean_dep("//third_party:zlib.BUILD"),
   )
 
   tf_http_archive(
@@ -534,7 +511,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "http://www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz",
       ],
       sha256 = "52bb637c70b971958ec79c9c8752b1df5ff0218a4db4510e60826e0cb79b5296",
-      build_file = str(Label("//third_party/fft2d:fft2d.BUILD")),
+      build_file = clean_dep("//third_party/fft2d:fft2d.BUILD"),
   )
 
   tf_http_archive(
@@ -545,7 +522,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "2f7504c73d85bac842e893340333be8cb8561710642fc9562fccdd9d2c3fcc94",
       strip_prefix = "snappy-1.1.4",
-      build_file = str(Label("//third_party:snappy.BUILD")),
+      build_file = clean_dep("//third_party:snappy.BUILD"),
   )
 
   tf_http_archive(
@@ -556,7 +533,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "2ca86fb6179ecbff789cc67c836139c1bbc0324ed8c04643405a30bf26325176",
       strip_prefix = "nccl-03d856977ecbaac87e598c0c4bafca96761b9ac7",
-      build_file = str(Label("//third_party:nccl.BUILD")),
+      build_file = clean_dep("//third_party:nccl.BUILD"),
   )
 
   tf_http_archive(
@@ -567,8 +544,8 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "dd035d57c8f19b0b612dd6eefe6e5eebad76f506e302cccb7c2066f25a83585e",
       strip_prefix = "librdkafka-0.11.1",
-      build_file = str(Label("//third_party:kafka/BUILD")),
-      patch_file = str(Label("//third_party/kafka:config.patch")),
+      build_file = clean_dep("//third_party:kafka/BUILD"),
+      patch_file = clean_dep("//third_party/kafka:config.patch"),
   )
 
   tf_http_archive(
@@ -579,7 +556,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "b888d8ce5fc10254c3dd6c9020c7764dd53cf39cf011249d0b4deda895de1b7c",
       strip_prefix = "aws-sdk-cpp-1.3.15",
-      build_file = str(Label("//third_party:aws.BUILD")),
+      build_file = clean_dep("//third_party:aws.BUILD"),
   )
 
   java_import_external(
@@ -615,7 +592,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "3c8f25c02e806c3ce0ab5fb7da1817f89fc9732709024e2a81b6b82f7cc792a8",
       strip_prefix = "jemalloc-4.4.0",
-      build_file = str(Label("//third_party:jemalloc.BUILD")),
+      build_file = clean_dep("//third_party:jemalloc.BUILD"),
   )
 
   java_import_external(
@@ -644,11 +621,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   )
 
   java_import_external(
-      name = "javax_validation",
-      jar_sha256 = "e459f313ebc6db2483f8ceaad39af07086361b474fa92e40f442e8de5d9895dc",
+      name = "org_checkerframework_qual",
+      jar_sha256 = "a17501717ef7c8dda4dba73ded50c0d7cde440fd721acfeacbf19786ceac1ed6",
       jar_urls = [
-          "http://mirror.bazel.build/repo1.maven.org/maven2/javax/validation/validation-api/1.0.0.GA/validation-api-1.0.0.GA.jar",
-          "http://repo1.maven.org/maven2/javax/validation/validation-api/1.0.0.GA/validation-api-1.0.0.GA.jar",
+          "http://mirror.bazel.build/repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
+          "http://repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
       ],
       licenses = ["notice"],  # Apache 2.0
   )
@@ -661,7 +638,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "e0928ca4aa10ea1e0551e2d7ce4d1d7ea2d84b2abbdef082b0da84268791d0c4",
       strip_prefix = "pprof-c0fb62ec88c411cc91194465e54db2632845b650",
-      build_file = str(Label("//third_party:pprof.BUILD")),
+      build_file = clean_dep("//third_party:pprof.BUILD"),
   )
 
   tf_http_archive(
@@ -672,7 +649,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       ],
       sha256 = "6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3",
       strip_prefix = "cub-1.8.0",
-      build_file = str(Label("//third_party:cub.BUILD")),
+      build_file = clean_dep("//third_party:cub.BUILD"),
   )
 
   tf_http_archive(
@@ -683,7 +660,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://github.com/cython/cython/archive/3732784c45cfb040a5b0936951d196f83a12ea17.tar.gz",
       ],
       strip_prefix = "cython-3732784c45cfb040a5b0936951d196f83a12ea17",
-      build_file = str(Label("//third_party:cython.BUILD")),
+      build_file = clean_dep("//third_party:cython.BUILD"),
       delete = ["BUILD.bazel"],
   )
 
@@ -697,16 +674,6 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       sha256 = "699b55a6916c687f4b7dc092dbbf5f64672cde0dc965f79717735ec4e5416556",
   )
 
-  tf_http_archive(
-      name = "rbe_integration_test",
-      urls = [
-          "http://mirror.bazel.build/github.com/google/rbe-integration-test/archive/78a6194c7dda200b9522cf07707e3bc695804d1e.tar.gz",
-          "https://github.com/google/rbe-integration-test/archive/78a6194c7dda200b9522cf07707e3bc695804d1e.tar.gz",
-      ],
-      sha256 = "66d93b3919a165d486c31f5290d312abe9fda2685242f812c110653c124e1db4",
-      strip_prefix = "rbe-integration-test-78a6194c7dda200b9522cf07707e3bc695804d1e",
-   )
-
   tf_http_archive(
       name = "arm_neon_2_x86_sse",
       sha256 = "c8d90aa4357f8079d427e87a6f4c493da1fa4140aee926c05902d7ec1533d9a5",
@@ -715,7 +682,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://mirror.bazel.build/github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
           "https://github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
       ],
-      build_file = str(Label("//third_party:arm_neon_2_x86_sse.BUILD")),
+      build_file = clean_dep("//third_party:arm_neon_2_x86_sse.BUILD"),
   )
 
   tf_http_archive(
@@ -726,7 +693,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://mirror.bazel.build/github.com/google/flatbuffers/archive/971a68110e4fc1bace10fcb6deeb189e7e1a34ce.tar.gz",
           "https://github.com/google/flatbuffers/archive/971a68110e4fc1bace10fcb6deeb189e7e1a34ce.tar.gz",
       ],
-      build_file = str(Label("//third_party/flatbuffers:flatbuffers.BUILD")),
+      build_file = clean_dep("//third_party/flatbuffers:flatbuffers.BUILD"),
   )
 
   tf_http_archive(
@@ -736,7 +703,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
           "https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
       ],
-      build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
+      build_file = clean_dep("//third_party:tflite_mobilenet.BUILD"),
   )
 
   tf_http_archive(
@@ -746,7 +713,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
           "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip",
           "https://storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip"
       ],
-      build_file = str(Label("//third_party:tflite_smartreply.BUILD")),
+      build_file = clean_dep("//third_party:tflite_smartreply.BUILD"),
   )
 
   ##############################################################################
@@ -810,7 +777,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   # Needed by Protobuf
   native.bind(
       name = "python_headers",
-      actual = str(Label("//util/python:python_headers")),
+      actual = clean_dep("//util/python:python_headers"),
   )
 
   # Needed by Protobuf
diff --git a/third_party/clang_toolchain/BUILD b/third_party/clang_toolchain/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/third_party/clang_toolchain/cc_configure_clang.bzl b/third_party/clang_toolchain/cc_configure_clang.bzl
new file mode 100644
index 0000000000000000000000000000000000000000..1181110ea9674e56264509fe5bb043a587888200
--- /dev/null
+++ b/third_party/clang_toolchain/cc_configure_clang.bzl
@@ -0,0 +1,27 @@
+""" Downloads clang and configures the crosstool using bazel's autoconf."""
+
+load("@bazel_tools//tools/cpp:cc_configure.bzl", "cc_autoconf_impl")
+load(":download_clang.bzl", "download_clang")
+
+_TF_DOWNLOAD_CLANG = "TF_DOWNLOAD_CLANG"
+_TF_NEED_CUDA = "TF_NEED_CUDA"
+
+def _cc_clang_autoconf(repo_ctx):
+  if repo_ctx.os.environ.get(_TF_DOWNLOAD_CLANG) != "1":
+    return
+  if repo_ctx.os.environ.get(_TF_NEED_CUDA) == "1":
+    # Clang is handled separately for CUDA configs.
+    # See cuda_configure.bzl for more details.
+    return
+
+  download_clang(repo_ctx, out_folder='extra_tools')
+  overriden_tools = {'gcc': 'extra_tools/bin/clang'}
+  cc_autoconf_impl(repo_ctx, overriden_tools)
+
+cc_download_clang_toolchain = repository_rule(
+    environ = [
+        _TF_DOWNLOAD_CLANG,
+        _TF_NEED_CUDA,
+    ],
+    implementation = _cc_clang_autoconf,
+)
diff --git a/third_party/gpus/download_clang.bzl b/third_party/clang_toolchain/download_clang.bzl
similarity index 100%
rename from third_party/gpus/download_clang.bzl
rename to third_party/clang_toolchain/download_clang.bzl
diff --git a/third_party/gpus/crosstool/CROSSTOOL_clang.tpl b/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
index e4363d604577de09241d635b6990c9dd6429efe0..2f09473ee2ddf9a38ca0c7aa11094690607b532f 100644
--- a/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
+++ b/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
@@ -49,6 +49,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-lstdc++"
       }
@@ -75,6 +76,7 @@ toolchain {
     name: "alwayslink"
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,-no-as-needed"
@@ -116,6 +118,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-Wl,-z,relro,-z,now"
       }
@@ -161,6 +164,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         # Stamp the binary with a unique identifier.
         flag: "-Wl,--build-id=md5"
@@ -176,6 +180,7 @@ toolchain {
       action: "c++-compile"
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag:"-no-canonical-prefixes"
       }
@@ -199,6 +204,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-B/usr/bin/"
       }
@@ -246,6 +252,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,--gc-sections"
diff --git a/third_party/gpus/cuda/remote.BUILD.tpl b/third_party/gpus/cuda/remote.BUILD.tpl
index d88d512b90c352e6a301ed6efe8266d8dd6bf744..f774def5e6cec25e4920ecce0076340a31c70386 100644
--- a/third_party/gpus/cuda/remote.BUILD.tpl
+++ b/third_party/gpus/cuda/remote.BUILD.tpl
@@ -41,65 +41,65 @@ config_setting(
 
 alias(
     name = "cuda_headers",
-    actual = "%{remote_cuda_repo}cuda:cuda_headers",
+    actual = "%{remote_cuda_repo}/cuda:cuda_headers",
 )
 
 alias(
     name = "cudart_static",
-    actual = "%{remote_cuda_repo}cuda:cudart_static",
+    actual = "%{remote_cuda_repo}/cuda:cudart_static",
 )
 
 alias(
     name = "cuda_driver",
-    actual = "%{remote_cuda_repo}cuda:cuda_driver",
+    actual = "%{remote_cuda_repo}/cuda:cuda_driver",
 )
 
 alias(
     name = "cudart",
-    actual = "%{remote_cuda_repo}cuda:cudart",
+    actual = "%{remote_cuda_repo}/cuda:cudart",
 )
 
 alias(
     name = "cublas",
-    actual = "%{remote_cuda_repo}cuda:cublas",
+    actual = "%{remote_cuda_repo}/cuda:cublas",
 )
 
 alias(
     name = "cusolver",
-    actual = "%{remote_cuda_repo}cuda:cusolver",
+    actual = "%{remote_cuda_repo}/cuda:cusolver",
 )
 
 alias(
     name = "cudnn",
-    actual = "%{remote_cuda_repo}cuda:cudnn",
+    actual = "%{remote_cuda_repo}/cuda:cudnn",
 )
 
 alias(
     name = "cufft",
-    actual = "%{remote_cuda_repo}cuda:cufft",
+    actual = "%{remote_cuda_repo}/cuda:cufft",
 )
 
 alias(
     name = "curand",
-    actual = "%{remote_cuda_repo}cuda:curand",
+    actual = "%{remote_cuda_repo}/cuda:curand",
 )
 
 alias(
     name = "cuda",
-    actual = "%{remote_cuda_repo}cuda:cuda",
+    actual = "%{remote_cuda_repo}/cuda:cuda",
 )
 
 alias(
     name = "cupti_headers",
-    actual = "%{remote_cuda_repo}cuda:cupti_headers",
+    actual = "%{remote_cuda_repo}/cuda:cupti_headers",
 )
 
 alias(
     name = "cupti_dsos",
-    actual = "%{remote_cuda_repo}cuda:cupti_dsos",
+    actual = "%{remote_cuda_repo}/cuda:cupti_dsos",
 )
 
 alias(
     name = "libdevice_root",
-    actual = "%{remote_cuda_repo}cuda:libdevice_root",
+    actual = "%{remote_cuda_repo}/cuda:libdevice_root",
 )
diff --git a/third_party/gpus/cuda_configure.bzl b/third_party/gpus/cuda_configure.bzl
index b7c47a19ddcfc69dbee54bf6ca4080489b292c01..ede7e318976527eb4fe6489083dc45896733f7bf 100644
--- a/third_party/gpus/cuda_configure.bzl
+++ b/third_party/gpus/cuda_configure.bzl
@@ -38,7 +38,65 @@ _DEFAULT_CUDA_TOOLKIT_PATH = "/usr/local/cuda"
 _DEFAULT_CUDNN_INSTALL_PATH = "/usr/local/cuda"
 _DEFAULT_CUDA_COMPUTE_CAPABILITIES = ["3.5", "5.2"]
 
-load(":download_clang.bzl", "download_clang")
+# Lookup paths for CUDA / cuDNN libraries, relative to the install directories.
+#
+# Paths will be tried out in the order listed below. The first successful path
+# will be used. For example, when looking for the cudart libraries, the first
+# attempt will be lib64/cudart inside the CUDA toolkit.
+CUDA_LIB_PATHS = [
+  "lib64/",
+  "lib64/stubs/",
+  "lib/x86_64-linux-gnu/",
+  "lib/x64/",
+  "lib/",
+  "",
+]
+
+# Lookup paths for cupti.h, relative to the CUDA toolkit directory.
+#
+# On most systems, the cupti library is not installed in the same directory as
+# the other CUDA libraries but rather in a special extras/CUPTI directory.
+CUPTI_HEADER_PATHS = [
+  "extras/CUPTI/include/",
+  "include/cuda/CUPTI/",
+]
+
+# Lookup paths for the cupti library, relative to the
+#
+# On most systems, the cupti library is not installed in the same directory as
+# the other CUDA libraries but rather in a special extras/CUPTI directory.
+CUPTI_LIB_PATHS = [
+  "extras/CUPTI/lib64/",
+  "lib/x86_64-linux-gnu",
+  "lib64/",
+  "extras/CUPTI/libx64/",
+  "extras/CUPTI/lib/",
+  "lib/",
+]
+
+# Lookup paths for CUDA headers (cuda.h) relative to the CUDA toolkit directory.
+CUDA_INCLUDE_PATHS = [
+  "include/",
+  "include/cuda/"
+]
+
+# Lookup paths for cudnn.h relative to the CUDNN install directory.
+CUDNN_INCLUDE_PATHS = [
+  "",
+  "include/",
+  "include/cuda/",
+]
+
+# Lookup paths for NVVM libdevice relative to the CUDA directory toolkit.
+#
+# libdevice implements mathematical functions for GPU kernels, and is provided
+# in NVVM bitcode (a subset of LLVM bitcode).
+NVVM_LIBDEVICE_PATHS = [
+  "nvvm/libdevice/",
+  "share/cuda/",
+]
+
+load("//third_party/clang_toolchain:download_clang.bzl", "download_clang")
 
 # TODO(dzc): Once these functions have been factored out of Bazel's
 # cc_configure.bzl, load them from @bazel_tools instead.
@@ -522,31 +580,31 @@ def _find_cuda_lib(lib, repository_ctx, cpu_value, basedir, version="",
       path: The full path to the library.
   """
   file_name = _lib_name(lib, cpu_value, version, static)
-  if cpu_value == "Linux":
-    path = repository_ctx.path("%s/lib64/%s" % (basedir, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-    path = repository_ctx.path("%s/lib64/stubs/%s" % (basedir, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-    path = repository_ctx.path(
-        "%s/lib/x86_64-linux-gnu/%s" % (basedir, file_name))
+  for relative_path in CUDA_LIB_PATHS:
+    path = repository_ctx.path("%s/%s%s" % (basedir, relative_path, file_name))
     if path.exists:
       return struct(file_name=file_name, path=str(path.realpath))
+  auto_configure_fail("Cannot find cuda library %s" % file_name)
 
-  elif cpu_value == "Windows":
-    path = repository_ctx.path("%s/lib/x64/%s" % (basedir, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
 
-  path = repository_ctx.path("%s/lib/%s" % (basedir, file_name))
-  if path.exists:
-    return struct(file_name=file_name, path=str(path.realpath))
-  path = repository_ctx.path("%s/%s" % (basedir, file_name))
-  if path.exists:
-    return struct(file_name=file_name, path=str(path.realpath))
+def _find_cupti_header_dir(repository_ctx, cuda_config):
+  """Returns the path to the directory containing cupti.h
 
-  auto_configure_fail("Cannot find cuda library %s" % file_name)
+  On most systems, the cupti library is not installed in the same directory as
+  the other CUDA libraries but rather in a special extras/CUPTI directory.
+
+  Args:
+    repository_ctx: The repository context.
+    cuda_config: The CUDA config as returned by _get_cuda_config
+
+  Returns:
+    The path of the directory containing the cupti header.
+  """
+  cuda_toolkit_path = cuda_config.cuda_toolkit_path
+  for relative_path in CUPTI_HEADER_PATHS:
+    if repository_ctx.path("%s/%scupti.h" % (cuda_toolkit_path, relative_path)).exists:
+        return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+  auto_configure_fail("Cannot find cupti.h under %s" % cuda_toolkit_path)
 
 
 def _find_cupti_lib(repository_ctx, cuda_config):
@@ -566,35 +624,13 @@ def _find_cupti_lib(repository_ctx, cuda_config):
   """
   file_name = _lib_name("cupti", cuda_config.cpu_value,
                         cuda_config.cuda_version)
-  if cuda_config.cpu_value == "Linux":
-    path = repository_ctx.path(
-        "%s/extras/CUPTI/lib64/%s" % (cuda_config.cuda_toolkit_path, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-
-    path = repository_ctx.path(
-        "%s/lib/x86_64-linux-gnu/%s" % (cuda_config.cuda_toolkit_path,
-                                        file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-
-  elif cuda_config.cpu_value == "Windows":
+  cuda_toolkit_path = cuda_config.cuda_toolkit_path
+  for relative_path in CUPTI_LIB_PATHS:
     path = repository_ctx.path(
-        "%s/extras/CUPTI/libx64/%s" %
-        (cuda_config.cuda_toolkit_path, file_name))
+        "%s/%s%s" % (cuda_toolkit_path, relative_path, file_name))
     if path.exists:
       return struct(file_name=file_name, path=str(path.realpath))
 
-  path = repository_ctx.path(
-      "%s/extras/CUPTI/lib/%s" % (cuda_config.cuda_toolkit_path, file_name))
-  if path.exists:
-    return struct(file_name=file_name, path=str(path.realpath))
-
-  path = repository_ctx.path(
-      "%s/lib/%s" % (cuda_config.cuda_toolkit_path, file_name))
-  if path.exists:
-    return struct(file_name=file_name, path=str(path.realpath))
-
   auto_configure_fail("Cannot find cupti library %s" % file_name)
 
 def _find_libs(repository_ctx, cuda_config):
@@ -635,6 +671,23 @@ def _find_libs(repository_ctx, cuda_config):
   }
 
 
+def _find_cuda_include_path(repository_ctx, cuda_config):
+  """Returns the path to the directory containing cuda.h
+
+  Args:
+    repository_ctx: The repository context.
+    cuda_config: The CUDA config as returned by _get_cuda_config
+
+  Returns:
+    The path of the directory containing the CUDA headers.
+  """
+  cuda_toolkit_path = cuda_config.cuda_toolkit_path
+  for relative_path in CUDA_INCLUDE_PATHS:
+    if repository_ctx.path("%s/%scuda.h" % (cuda_toolkit_path, relative_path)).exists:
+        return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+  auto_configure_fail("Cannot find cuda.h under %s" % cuda_toolkit_path)
+
+
 def _find_cudnn_header_dir(repository_ctx, cudnn_install_basedir):
   """Returns the path to the directory containing cudnn.h
 
@@ -646,15 +699,31 @@ def _find_cudnn_header_dir(repository_ctx, cudnn_install_basedir):
   Returns:
     The path of the directory containing the cudnn header.
   """
-  if repository_ctx.path(cudnn_install_basedir + "/cudnn.h").exists:
-    return cudnn_install_basedir
-  if repository_ctx.path(cudnn_install_basedir + "/include/cudnn.h").exists:
-    return cudnn_install_basedir + "/include"
+  for relative_path in CUDA_INCLUDE_PATHS:
+    if repository_ctx.path("%s/%scudnn.h" % (cudnn_install_basedir, relative_path)).exists:
+        return ("%s/%s" % (cudnn_install_basedir, relative_path))[:-1]
   if repository_ctx.path("/usr/include/cudnn.h").exists:
     return "/usr/include"
   auto_configure_fail("Cannot find cudnn.h under %s" % cudnn_install_basedir)
 
 
+def _find_nvvm_libdevice_dir(repository_ctx, cuda_config):
+  """Returns the path to the directory containing libdevice in bitcode format.
+
+  Args:
+    repository_ctx: The repository context.
+    cuda_config: The CUDA config as returned by _get_cuda_config
+
+  Returns:
+    The path of the directory containing the CUDA headers.
+  """
+  cuda_toolkit_path = cuda_config.cuda_toolkit_path
+  for relative_path in NVVM_LIBDEVICE_PATHS:
+    if repository_ctx.path("%s/%slibdevice.10.bc" % (cuda_toolkit_path, relative_path)).exists:
+      return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+  auto_configure_fail("Cannot find libdevice.10.bc under %s" % cuda_toolkit_path)
+
+
 def _cudart_static_linkopt(cpu_value):
   """Returns additional platform-specific linkopts for cudart."""
   return "" if cpu_value == "Darwin" else "\"-lrt\","
@@ -925,21 +994,22 @@ def _create_local_cuda_repository(repository_ctx):
   """Creates the repository containing files set up to build with CUDA."""
   cuda_config = _get_cuda_config(repository_ctx)
 
+  cuda_include_path = _find_cuda_include_path(repository_ctx, cuda_config)
   cudnn_header_dir = _find_cudnn_header_dir(repository_ctx,
                                             cuda_config.cudnn_install_basedir)
+  cupti_header_dir = _find_cupti_header_dir(repository_ctx, cuda_config)
+  nvvm_libdevice_dir = _find_nvvm_libdevice_dir(repository_ctx, cuda_config)
 
   # Set up symbolic links for the cuda toolkit by creating genrules to do
   # symlinking. We create one genrule for each directory we want to track under
   # cuda_toolkit_path
   cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  cuda_include_path = cuda_toolkit_path + "/include"
   genrules = [symlink_genrule_for_dir(repository_ctx,
       cuda_include_path, "cuda/include", "cuda-include")]
   genrules.append(symlink_genrule_for_dir(repository_ctx,
-      cuda_toolkit_path + "/nvvm", "cuda/nvvm", "cuda-nvvm"))
+      nvvm_libdevice_dir, "cuda/nvvm/libdevice", "cuda-nvvm"))
   genrules.append(symlink_genrule_for_dir(repository_ctx,
-      cuda_toolkit_path + "/extras/CUPTI/include",
-      "cuda/extras/CUPTI/include", "cuda-extras"))
+      cupti_header_dir, "cuda/extras/CUPTI/include", "cuda-extras"))
 
   cuda_libs = _find_libs(repository_ctx, cuda_config)
   cuda_lib_src = []
@@ -1086,6 +1156,7 @@ cuda_configure = repository_rule(
         _TF_CUDNN_VERSION,
         _TF_CUDA_COMPUTE_CAPABILITIES,
         _TF_CUDA_CONFIG_REPO,
+        "NVVMIR_LIBRARY_DIR",
     ],
 )
 
diff --git a/third_party/jpeg/jpeg.BUILD b/third_party/jpeg/jpeg.BUILD
index 87a23925c4316c3ee107af77272300e34b1bb257..4418ac32fc4b08713ff1d1f0d78042803153c886 100644
--- a/third_party/jpeg/jpeg.BUILD
+++ b/third_party/jpeg/jpeg.BUILD
@@ -526,12 +526,12 @@ config_setting(
 
 config_setting(
     name = "armeabi-v7a",
-    values = {"android_cpu": "armeabi-v7a"},
+    values = {"cpu": "armeabi-v7a"},
 )
 
 config_setting(
     name = "arm64-v8a",
-    values = {"android_cpu": "arm64-v8a"},
+    values = {"cpu": "arm64-v8a"},
 )
 
 config_setting(
diff --git a/third_party/kafka/BUILD b/third_party/kafka/BUILD
index a61a9e1f6c2b29ad3b992e810c0cab463dfd7feb..a839ca717e695f35fac684b510f0a022010e0710 100644
--- a/third_party/kafka/BUILD
+++ b/third_party/kafka/BUILD
@@ -130,12 +130,16 @@ cc_library(
     ],
     hdrs = [
         "config.h",
+        "src-cpp/rdkafkacpp.h",
+        "src-cpp/rdkafkacpp_int.h",
+        "src/lz4.c",
+        "src/snappy_compat.h",
     ],
-    defines = [
+    copts = [
+        "-Iexternal/kafka/src",
+        "-Iexternal/kafka/src-cpp",
     ],
-    includes = [
-        "src",
-        "src-cpp",
+    defines = [
     ],
     linkopts = [
         "-lpthread",
@@ -143,5 +147,6 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         "@boringssl//:ssl",
+        "@zlib_archive//:zlib",
     ],
 )
diff --git a/third_party/mkl/BUILD b/third_party/mkl/BUILD
index b27d341404c4ee1ca1e87ff3b9f427ec52eba739..3262562bccca4f2a8b3da860cb38928f144994a9 100644
--- a/third_party/mkl/BUILD
+++ b/third_party/mkl/BUILD
@@ -1,7 +1,5 @@
 licenses(["notice"])  # 3-Clause BSD
 
-exports_files(["LICENSE"])
-
 config_setting(
     name = "using_mkl",
     values = {
@@ -10,17 +8,52 @@ config_setting(
     visibility = ["//visibility:public"],
 )
 
+config_setting(
+    name = "using_mkl_lnx_x64",
+    values = {
+        "cpu": "k8",
+        "define": "using_mkl=true",
+    },
+    visibility = ["//visibility:public"],
+)
+
 load(
     "//third_party/mkl:build_defs.bzl",
     "if_mkl",
 )
 
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+    srcs = ["MKL_LICENSE"] + select({
+        "@org_tensorflow//tensorflow:linux_x86_64": [
+            "@mkl_linux//:LICENSE",
+        ],
+        "@org_tensorflow//tensorflow:darwin": [
+            "@mkl_darwin//:LICENSE",
+        ],
+        "@org_tensorflow//tensorflow:windows": [
+            "@mkl_windows//:LICENSE",
+        ]
+    })
+)
+
 cc_library(
     name = "intel_binary_blob",
-    srcs = if_mkl([
-        "@mkl//:libmklml_intel.so",
-        "@mkl//:libiomp5.so",
-    ]),
+
     visibility = ["//visibility:public"],
-    deps = ["@mkl//:mkl_headers"],
+    deps = select({
+        "@org_tensorflow//tensorflow:linux_x86_64": [
+            "@mkl_linux//:mkl_headers",
+            "@mkl_linux//:mkl_libs_linux",
+        ],
+        "@org_tensorflow//tensorflow:darwin": [
+            "@mkl_darwin//:mkl_headers",
+            "@mkl_darwin//:mkl_libs_darwin",
+        ],
+        "@org_tensorflow//tensorflow:windows": [
+            "@mkl_windows//:mkl_headers",
+            "@mkl_windows//:mkl_libs_windows",
+        ]
+    })
 )
diff --git a/third_party/mkl/LICENSE b/third_party/mkl/MKL_LICENSE
similarity index 100%
rename from third_party/mkl/LICENSE
rename to third_party/mkl/MKL_LICENSE
diff --git a/third_party/mkl/build_defs.bzl b/third_party/mkl/build_defs.bzl
index 8b73ddabdd7ff5de7374ffbbb76e7bf954c27765..53e02769dad5dd74348dec2dcec88010e543f01c 100644
--- a/third_party/mkl/build_defs.bzl
+++ b/third_party/mkl/build_defs.bzl
@@ -24,6 +24,18 @@ def if_mkl(if_true, if_false = []):
         "//conditions:default": if_false
     })
 
+def if_mkl_lnx_x64(if_true, if_false = []):
+    """Shorthand for select()'ing on whether we're building with MKL.
+
+    Returns a select statement which evaluates to if_true if we're building
+    with MKL enabled.  Otherwise, the select statement evaluates to if_false.
+
+    """
+    return select({
+        str(Label("//third_party/mkl:using_mkl_lnx_x64")): if_true,
+        "//conditions:default": if_false
+    })
+
 
 def _enable_local_mkl(repository_ctx):
   return _TF_MKL_ROOT in repository_ctx.os.environ
diff --git a/third_party/mkl/mkl.BUILD b/third_party/mkl/mkl.BUILD
index 8db97232e156b46091b379b0771239f55d6ea5ad..892221ec00295a694ab40868cd886e820768f78f 100644
--- a/third_party/mkl/mkl.BUILD
+++ b/third_party/mkl/mkl.BUILD
@@ -17,14 +17,29 @@ cc_library(
     visibility = ["//visibility:public"],
 )
 
-filegroup(
-    name = "libmklml_intel.so",
-    srcs = ["lib/libmklml_intel.so"],
+cc_library(
+    name = "mkl_libs_linux",
+    srcs = [
+        "lib/libiomp5.so",
+        "lib/libmklml_intel.so"
+    ],
     visibility = ["//visibility:public"],
 )
 
-filegroup(
-    name = "libiomp5.so",
-    srcs = ["lib/libiomp5.so"],
+cc_library(
+    name = "mkl_libs_darwin",
+    srcs = [
+        "lib/libiomp5.dylib",
+        "lib/libmklml.dylib"
+    ],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "mkl_libs_windows",
+    srcs = [
+        "lib/libiomp5md.lib",
+        "lib/mklml.lib"
+    ],
     visibility = ["//visibility:public"],
 )
diff --git a/third_party/mkl_dnn/mkldnn.BUILD b/third_party/mkl_dnn/mkldnn.BUILD
index 58bb7a6a5d0494301aa5b0bd29f858e7d06e69d3..68f24aabaee6ed33fe5b92a3996f7d175b924ea0 100644
--- a/third_party/mkl_dnn/mkldnn.BUILD
+++ b/third_party/mkl_dnn/mkldnn.BUILD
@@ -1,5 +1,13 @@
 exports_files(["LICENSE"])
 
+config_setting(
+    name = "clang_linux_x86_64",
+    values = {
+        "cpu": "k8",
+        "define": "using_clang=true",
+    },
+)
+
 cc_library(
     name = "mkl_dnn",
     srcs = glob([
@@ -9,8 +17,11 @@ cc_library(
     hdrs = glob(["include/*"]),
     copts = ["-fexceptions"] + select({
         "@org_tensorflow//tensorflow:linux_x86_64": [
-            "-fopenmp",
+            "-fopenmp",  # only works with gcc
         ],
+        # TODO(ibiryukov): enable openmp with clang by including libomp as a
+        # dependency.
+        ":clang_linux_x86_64": [],
         "//conditions:default": [],
     }),
     includes = [
diff --git a/third_party/py/BUILD.tpl b/third_party/py/BUILD.tpl
index de06ad5f27e7c08aade4a8f51ab60ba52d012b7b..1dd8ab433a37a127b98ae7069bffcbfd4f6d8bd1 100644
--- a/third_party/py/BUILD.tpl
+++ b/third_party/py/BUILD.tpl
@@ -2,20 +2,26 @@ licenses(["restricted"])
 
 package(default_visibility = ["//visibility:public"])
 
+# To build Python C/C++ extension on Windows, we need to link to python import library pythonXY.lib
+# See https://docs.python.org/3/extending/windows.html
+cc_import(
+    name = "python_lib",
+    interface_library = select({
+        ":windows": ":python_import_lib",
+        # A placeholder for Unix platforms which makes --no_build happy.
+        "//conditions:default": "not-existing.lib",
+    }),
+    system_provided = 1,
+)
+
 cc_library(
     name = "python_headers",
     hdrs = [":python_include"],
-    data = select({
-        ":windows": [":python_import_lib"],
+    deps = select({
+        ":windows": [":python_lib"],
         "//conditions:default": [],
     }),
     includes = ["python_include"],
-    linkopts = select({
-        # TODO(pcloudy): Ideally, this should just go into deps after resolving
-        # https://github.com/bazelbuild/bazel/issues/3237,
-        ":windows": ["$(locations :python_import_lib)"],
-        "//conditions:default": [],
-    }),
 )
 
 cc_library(
diff --git a/third_party/sycl/sycl/BUILD.tpl b/third_party/sycl/sycl/BUILD.tpl
index 21b1a2bbf7d320327d8f6e35124e6ef47019130b..b7e9aa8edb4dd1ecc36595ea0a11f442d05cefee 100755
--- a/third_party/sycl/sycl/BUILD.tpl
+++ b/third_party/sycl/sycl/BUILD.tpl
@@ -21,7 +21,7 @@ config_setting(
     name = "using_sycl_trisycl",
     define_values = {
         "using_sycl": "true",
-        "using_trisycl": "false",
+        "using_trisycl": "true",
     },
 )
 
diff --git a/third_party/tensorrt/tensorrt_configure.bzl b/third_party/tensorrt/tensorrt_configure.bzl
index 8e76e5d02aeddab66dacaa495a6c493f18a95a69..9b946505a615372aa7de317c8ee390a2cd4b60e9 100644
--- a/third_party/tensorrt/tensorrt_configure.bzl
+++ b/third_party/tensorrt/tensorrt_configure.bzl
@@ -57,6 +57,10 @@ def _find_trt_header_dir(repository_ctx, trt_install_path):
     path = "/usr/include/x86_64-linux-gnu"
     if _headers_exist(repository_ctx, path):
       return path
+  if trt_install_path == "/usr/lib/aarch64-linux-gnu":
+    path = "/usr/include/aarch64-linux-gnu"
+    if _headers_exist(repository_ctx, path):
+      return path
   path = str(repository_ctx.path("%s/../include" % trt_install_path).realpath)
   if _headers_exist(repository_ctx, path):
     return path
diff --git a/third_party/toolchains/gpus/crosstool/BUILD b/third_party/toolchains/gpus/crosstool/BUILD
index a8c6b0f0291363f3a7576a70e78b3428fb984957..1f9065007ca884a46bfa391d1ee8a8f0333da235 100644
--- a/third_party/toolchains/gpus/crosstool/BUILD
+++ b/third_party/toolchains/gpus/crosstool/BUILD
@@ -50,3 +50,8 @@ filegroup(
     name = "empty",
     srcs = [],
 )
+
+filegroup(
+    name = "crosstool_wrapper_driver_is_not_gcc",
+    srcs = ["clang/bin/crosstool_wrapper_driver_is_not_gcc"],
+)
diff --git a/third_party/toolchains/gpus/crosstool/CROSSTOOL b/third_party/toolchains/gpus/crosstool/CROSSTOOL
index a47e0c7cd74edcea777d76854c2d7e97d69897fa..d6ee7e38c414dd59b76c7b2b4c95c55831bb30a8 100644
--- a/third_party/toolchains/gpus/crosstool/CROSSTOOL
+++ b/third_party/toolchains/gpus/crosstool/CROSSTOOL
@@ -53,6 +53,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-lstdc++"
       }
@@ -79,6 +80,7 @@ toolchain {
     name: "alwayslink"
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,-no-as-needed"
@@ -120,6 +122,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-Wl,-z,relro,-z,now"
       }
@@ -141,8 +144,8 @@ toolchain {
       flag_group {
         # All warnings are enabled. Maybe enable -Werror as well?
         flag: "-Wall"
-        # TODO(ngiraldo): Some parts of the codebase set -Werror and hit this 
-        # warning, so switch it off for now.
+        # Some parts of the codebase set -Werror and hit this warning, so
+        # switch it off for now.
         flag: "-Wno-invalid-partial-specialization"
       }
     }
@@ -165,6 +168,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         # Stamp the binary with a unique identifier.
         flag: "-Wl,--build-id=md5"
@@ -180,6 +184,7 @@ toolchain {
       action: "c++-compile"
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag:"-no-canonical-prefixes"
       }
@@ -203,6 +208,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-B/usr/bin/"
       }
@@ -250,6 +256,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,--gc-sections"
@@ -296,7 +303,7 @@ toolchain {
   cxx_builtin_include_directory: "/usr/include/x86_64-linux-gnu/c++/5.4.0"
   cxx_builtin_include_directory: "/usr/include/c++/5.4.0/backward"
   cxx_builtin_include_directory: "/usr/local/include"
-  cxx_builtin_include_directory: "/usr/local/lib/clang/6.0.0/include"
+  cxx_builtin_include_directory: "/usr/local/lib/clang/7.0.0/include"
   cxx_builtin_include_directory: "/usr/include/x86_64-linux-gnu"
   cxx_builtin_include_directory: "/usr/include"
 }
diff --git a/third_party/toolchains/gpus/cuda/BUILD b/third_party/toolchains/gpus/cuda/BUILD
index 39136de99c901d6d6a9dafefe3163972511ec122..4cb83809383afa52d5a1d98777f8e5bb2d266286 100644
--- a/third_party/toolchains/gpus/cuda/BUILD
+++ b/third_party/toolchains/gpus/cuda/BUILD
@@ -51,6 +51,7 @@ cc_library(
     includes = [
         ".",
         "cuda/include",
+        "cuda/include/crt",
     ],
     visibility = ["//visibility:public"],
 )
@@ -84,8 +85,8 @@ cc_library(
 
 cc_library(
     name = "cudart",
-    srcs = ["cuda/lib/libcudart.so.8.0"],
-    data = ["cuda/lib/libcudart.so.8.0"],
+    srcs = ["cuda/lib/libcudart.so.9.0"],
+    data = ["cuda/lib/libcudart.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -96,8 +97,8 @@ cc_library(
 
 cc_library(
     name = "cublas",
-    srcs = ["cuda/lib/libcublas.so.8.0"],
-    data = ["cuda/lib/libcublas.so.8.0"],
+    srcs = ["cuda/lib/libcublas.so.9.0"],
+    data = ["cuda/lib/libcublas.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -108,8 +109,8 @@ cc_library(
 
 cc_library(
     name = "cusolver",
-    srcs = ["cuda/lib/libcusolver.so.8.0"],
-    data = ["cuda/lib/libcusolver.so.8.0"],
+    srcs = ["cuda/lib/libcusolver.so.9.0"],
+    data = ["cuda/lib/libcusolver.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -121,8 +122,8 @@ cc_library(
 
 cc_library(
     name = "cudnn",
-    srcs = ["cuda/lib/libcudnn.so.6"],
-    data = ["cuda/lib/libcudnn.so.6"],
+    srcs = ["cuda/lib/libcudnn.so.7"],
+    data = ["cuda/lib/libcudnn.so.7"],
     includes = [
         ".",
         "cuda/include",
@@ -133,8 +134,8 @@ cc_library(
 
 cc_library(
     name = "cufft",
-    srcs = ["cuda/lib/libcufft.so.8.0"],
-    data = ["cuda/lib/libcufft.so.8.0"],
+    srcs = ["cuda/lib/libcufft.so.9.0"],
+    data = ["cuda/lib/libcufft.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -145,8 +146,8 @@ cc_library(
 
 cc_library(
     name = "curand",
-    srcs = ["cuda/lib/libcurand.so.8.0"],
-    data = ["cuda/lib/libcurand.so.8.0"],
+    srcs = ["cuda/lib/libcurand.so.9.0"],
+    data = ["cuda/lib/libcurand.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -183,7 +184,7 @@ cc_library(
 
 cc_library(
     name = "cupti_dsos",
-    data = ["cuda/lib/libcupti.so.8.0"],
+    data = ["cuda/lib/libcupti.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -200,1063 +201,990 @@ cc_library(
 genrule(
     name = "cuda-include",
     outs = [
-        "cuda/include/math_functions.hpp",
-        "cuda/include/cufft.h",
-        "cuda/include/nvgraph.h",
-        "cuda/include/curand_normal.h",
-        "cuda/include/curand_uniform.h",
-        "cuda/include/nppi_data_exchange_and_initialization.h",
-        "cuda/include/cuda_gl_interop.h",
-        "cuda/include/nppi_compression_functions.h",
-        "cuda/include/npp.h",
+        "cuda/include/CL/cl.h",
+        "cuda/include/CL/cl.hpp",
+        "cuda/include/CL/cl_egl.h",
+        "cuda/include/CL/cl_ext.h",
+        "cuda/include/CL/cl_gl.h",
+        "cuda/include/CL/cl_gl_ext.h",
+        "cuda/include/CL/cl_platform.h",
+        "cuda/include/CL/opencl.h",
+        "cuda/include/builtin_types.h",
+        "cuda/include/channel_descriptor.h",
+        "cuda/include/common_functions.h",
+        "cuda/include/cooperative_groups.h",
+        "cuda/include/cooperative_groups_helpers.h",
+        "cuda/include/crt/common_functions.h",
+        "cuda/include/crt/device_double_functions.h",
+        "cuda/include/crt/device_double_functions.hpp",
+        "cuda/include/crt/device_functions.h",
+        "cuda/include/crt/device_functions.hpp",
+        "cuda/include/crt/func_macro.h",
+        "cuda/include/crt/host_config.h",
+        "cuda/include/crt/host_defines.h",
+        "cuda/include/crt/host_runtime.h",
+        "cuda/include/crt/math_functions.h",
+        "cuda/include/crt/math_functions.hpp",
+        "cuda/include/crt/mma.h",
+        "cuda/include/crt/mma.hpp",
+        "cuda/include/crt/nvfunctional",
+        "cuda/include/crt/sm_70_rt.h",
+        "cuda/include/crt/sm_70_rt.hpp",
+        "cuda/include/crt/storage_class.h",
+        "cuda/include/cuComplex.h",
+        "cuda/include/cublas.h",
+        "cuda/include/cublasXt.h",
+        "cuda/include/cublas_api.h",
+        "cuda/include/cublas_v2.h",
         "cuda/include/cuda.h",
-        "cuda/include/nppi_statistics_functions.h",
-        "cuda/include/vector_functions.hpp",
-        "cuda/include/sm_32_intrinsics.hpp",
-        "cuda/include/sm_32_intrinsics.h",
-        "cuda/include/curand_discrete.h",
+        "cuda/include/cudaEGL.h",
+        "cuda/include/cudaGL.h",
+        "cuda/include/cudaProfiler.h",
+        "cuda/include/cudaVDPAU.h",
+        "cuda/include/cuda_device_runtime_api.h",
+        "cuda/include/cuda_fp16.h",
+        "cuda/include/cuda_fp16.hpp",
+        "cuda/include/cuda_gl_interop.h",
+        "cuda/include/cuda_occupancy.h",
+        "cuda/include/cuda_profiler_api.h",
         "cuda/include/cuda_runtime.h",
+        "cuda/include/cuda_runtime_api.h",
+        "cuda/include/cuda_surface_types.h",
+        "cuda/include/cuda_texture_types.h",
+        "cuda/include/cuda_vdpau_interop.h",
+        "cuda/include/cudalibxt.h",
+        "cuda/include/cudnn.h",
+        "cuda/include/cufft.h",
         "cuda/include/cufftXt.h",
-        "cuda/include/sm_61_intrinsics.h",
-        "cuda/include/texture_fetch_functions.h",
+        "cuda/include/cufftw.h",
+        "cuda/include/curand.h",
+        "cuda/include/curand_discrete.h",
+        "cuda/include/curand_discrete2.h",
+        "cuda/include/curand_globals.h",
+        "cuda/include/curand_kernel.h",
+        "cuda/include/curand_lognormal.h",
         "cuda/include/curand_mrg32k3a.h",
-        "cuda/include/host_defines.h",
-        "cuda/include/common_functions.h",
-        "cuda/include/nppi_support_functions.h",
-        "cuda/include/nppi_linear_transforms.h",
-        "cuda/include/device_double_functions.hpp",
-        "cuda/include/math_constants.h",
-        "cuda/include/nvToolsExtSync.h",
-        "cuda/include/npps_initialization.h",
+        "cuda/include/curand_mtgp32.h",
+        "cuda/include/curand_mtgp32_host.h",
+        "cuda/include/curand_mtgp32_kernel.h",
+        "cuda/include/curand_mtgp32dc_p_11213.h",
+        "cuda/include/curand_normal.h",
+        "cuda/include/curand_normal_static.h",
+        "cuda/include/curand_philox4x32_x.h",
+        "cuda/include/curand_poisson.h",
+        "cuda/include/curand_precalc.h",
+        "cuda/include/curand_uniform.h",
+        "cuda/include/cusolverDn.h",
+        "cuda/include/cusolverRf.h",
+        "cuda/include/cusolverSp.h",
         "cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h",
-        "cuda/include/texture_indirect_functions.hpp",
-        "cuda/include/cudaProfiler.h",
-        "cuda/include/npps_filtering_functions.h",
+        "cuda/include/cusolver_common.h",
+        "cuda/include/cusparse.h",
         "cuda/include/cusparse_v2.h",
-        "cuda/include/nppi.h",
-        "cuda/include/surface_indirect_functions.h",
-        "cuda/include/sm_30_intrinsics.h",
+        "cuda/include/device_atomic_functions.h",
+        "cuda/include/device_atomic_functions.hpp",
         "cuda/include/device_double_functions.h",
-        "cuda/include/sm_35_intrinsics.h",
-        "cuda/include/cusolverSp.h",
-        "cuda/include/library_types.h",
-        "cuda/include/surface_indirect_functions.hpp",
-        "cuda/include/cudalibxt.h",
-        "cuda/include/channel_descriptor.h",
+        "cuda/include/device_double_functions.hpp",
+        "cuda/include/device_functions.h",
+        "cuda/include/device_functions.hpp",
         "cuda/include/device_functions_decls.h",
-        "cuda/include/curand_kernel.h",
-        "cuda/include/curand_mtgp32_host.h",
-        "cuda/include/nvToolsExtCuda.h",
-        "cuda/include/nvToolsExt.h",
-        "cuda/include/cuComplex.h",
-        "cuda/include/sm_32_atomic_functions.h",
-        "cuda/include/texture_indirect_functions.h",
-        "cuda/include/sm_32_atomic_functions.hpp",
-        "cuda/include/sm_20_intrinsics.hpp",
         "cuda/include/device_launch_parameters.h",
-        "cuda/include/curand_mtgp32.h",
-        "cuda/include/texture_fetch_functions.hpp",
-        "cuda/include/cuda_occupancy.h",
-        "cuda/include/CL/opencl.h",
-        "cuda/include/CL/cl_platform.h",
-        "cuda/include/CL/cl_egl.h",
-        "cuda/include/CL/cl_gl.h",
-        "cuda/include/CL/cl.h",
-        "cuda/include/CL/cl_gl_ext.h",
-        "cuda/include/CL/cl_ext.h",
-        "cuda/include/CL/cl.hpp",
+        "cuda/include/device_types.h",
+        "cuda/include/driver_functions.h",
+        "cuda/include/driver_types.h",
+        "cuda/include/dynlink_cuda.h",
+        "cuda/include/dynlink_cuda_cuda.h",
+        "cuda/include/dynlink_cuviddec.h",
+        "cuda/include/dynlink_nvcuvid.h",
+        "cuda/include/fatBinaryCtl.h",
+        "cuda/include/fatbinary.h",
         "cuda/include/host_config.h",
-        "cuda/include/cuda_surface_types.h",
+        "cuda/include/host_defines.h",
+        "cuda/include/library_types.h",
+        "cuda/include/math_constants.h",
         "cuda/include/math_functions.h",
+        "cuda/include/math_functions.hpp",
+        "cuda/include/math_functions_dbl_ptx3.h",
+        "cuda/include/math_functions_dbl_ptx3.hpp",
+        "cuda/include/mma.h",
+        "cuda/include/npp.h",
+        "cuda/include/nppcore.h",
+        "cuda/include/nppdefs.h",
+        "cuda/include/nppi.h",
+        "cuda/include/nppi_arithmetic_and_logical_operations.h",
+        "cuda/include/nppi_color_conversion.h",
+        "cuda/include/nppi_compression_functions.h",
+        "cuda/include/nppi_computer_vision.h",
+        "cuda/include/nppi_data_exchange_and_initialization.h",
+        "cuda/include/nppi_filtering_functions.h",
+        "cuda/include/nppi_geometry_transforms.h",
+        "cuda/include/nppi_linear_transforms.h",
+        "cuda/include/nppi_morphological_operations.h",
+        "cuda/include/nppi_statistics_functions.h",
+        "cuda/include/nppi_support_functions.h",
+        "cuda/include/nppi_threshold_and_compare_operations.h",
+        "cuda/include/npps.h",
+        "cuda/include/npps_arithmetic_and_logical_operations.h",
+        "cuda/include/npps_conversion_functions.h",
+        "cuda/include/npps_filtering_functions.h",
+        "cuda/include/npps_initialization.h",
+        "cuda/include/npps_statistics_functions.h",
+        "cuda/include/npps_support_functions.h",
+        "cuda/include/nppversion.h",
+        "cuda/include/nvToolsExt.h",
+        "cuda/include/nvToolsExtCuda.h",
+        "cuda/include/nvToolsExtCudaRt.h",
         "cuda/include/nvToolsExtMeta.h",
+        "cuda/include/nvToolsExtSync.h",
+        "cuda/include/nvblas.h",
+        "cuda/include/nvfunctional",
+        "cuda/include/nvgraph.h",
+        "cuda/include/nvml.h",
+        "cuda/include/nvrtc.h",
+        "cuda/include/sm_20_atomic_functions.h",
         "cuda/include/sm_20_atomic_functions.hpp",
-        "cuda/include/device_functions.h",
-        "cuda/include/device_types.h",
-        "cuda/include/npps_conversion_functions.h",
-        "cuda/include/curand_precalc.h",
-        "cuda/include/cusolverRf.h",
+        "cuda/include/sm_20_intrinsics.h",
+        "cuda/include/sm_20_intrinsics.hpp",
+        "cuda/include/sm_30_intrinsics.h",
+        "cuda/include/sm_30_intrinsics.hpp",
+        "cuda/include/sm_32_atomic_functions.h",
+        "cuda/include/sm_32_atomic_functions.hpp",
+        "cuda/include/sm_32_intrinsics.h",
+        "cuda/include/sm_32_intrinsics.hpp",
+        "cuda/include/sm_35_atomic_functions.h",
+        "cuda/include/sm_35_intrinsics.h",
+        "cuda/include/sm_60_atomic_functions.h",
         "cuda/include/sm_60_atomic_functions.hpp",
-        "cuda/include/cuviddec.h",
-        "cuda/include/curand_discrete2.h",
-        "cuda/include/device_functions.hpp",
-        "cuda/include/thrust/transform_scan.h",
-        "cuda/include/thrust/system_error.h",
-        "cuda/include/thrust/device_malloc.h",
-        "cuda/include/thrust/partition.h",
-        "cuda/include/thrust/unique.h",
-        "cuda/include/thrust/device_delete.h",
-        "cuda/include/thrust/execution_policy.h",
+        "cuda/include/sm_61_intrinsics.h",
+        "cuda/include/sm_61_intrinsics.hpp",
+        "cuda/include/sobol_direction_vectors.h",
+        "cuda/include/surface_functions.h",
+        "cuda/include/surface_functions.hpp",
+        "cuda/include/surface_indirect_functions.h",
+        "cuda/include/surface_indirect_functions.hpp",
+        "cuda/include/surface_types.h",
+        "cuda/include/texture_fetch_functions.h",
+        "cuda/include/texture_fetch_functions.hpp",
+        "cuda/include/texture_indirect_functions.h",
+        "cuda/include/texture_indirect_functions.hpp",
+        "cuda/include/texture_types.h",
         "cuda/include/thrust/adjacent_difference.h",
-        "cuda/include/thrust/sequence.h",
-        "cuda/include/thrust/merge.h",
-        "cuda/include/thrust/device_new.h",
-        "cuda/include/thrust/transform_reduce.h",
-        "cuda/include/thrust/device_vector.h",
-        "cuda/include/thrust/gather.h",
-        "cuda/include/thrust/sort.h",
-        "cuda/include/thrust/scan.h",
-        "cuda/include/thrust/detail/temporary_array.h",
-        "cuda/include/thrust/detail/util/align.h",
-        "cuda/include/thrust/detail/util/blocking.h",
-        "cuda/include/thrust/detail/transform.inl",
-        "cuda/include/thrust/detail/device_vector.inl",
+        "cuda/include/thrust/advance.h",
+        "cuda/include/thrust/binary_search.h",
+        "cuda/include/thrust/complex.h",
+        "cuda/include/thrust/copy.h",
+        "cuda/include/thrust/count.h",
+        "cuda/include/thrust/detail/adjacent_difference.inl",
+        "cuda/include/thrust/detail/advance.inl",
+        "cuda/include/thrust/detail/allocator/allocator_traits.h",
+        "cuda/include/thrust/detail/allocator/allocator_traits.inl",
+        "cuda/include/thrust/detail/allocator/copy_construct_range.h",
+        "cuda/include/thrust/detail/allocator/copy_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/default_construct_range.h",
+        "cuda/include/thrust/detail/allocator/default_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/destroy_range.h",
+        "cuda/include/thrust/detail/allocator/destroy_range.inl",
+        "cuda/include/thrust/detail/allocator/fill_construct_range.h",
+        "cuda/include/thrust/detail/allocator/fill_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/malloc_allocator.h",
+        "cuda/include/thrust/detail/allocator/malloc_allocator.inl",
+        "cuda/include/thrust/detail/allocator/no_throw_allocator.h",
+        "cuda/include/thrust/detail/allocator/tagged_allocator.h",
+        "cuda/include/thrust/detail/allocator/tagged_allocator.inl",
+        "cuda/include/thrust/detail/allocator/temporary_allocator.h",
+        "cuda/include/thrust/detail/allocator/temporary_allocator.inl",
         "cuda/include/thrust/detail/binary_search.inl",
-        "cuda/include/thrust/detail/overlapped_copy.h",
-        "cuda/include/thrust/detail/vector_base.inl",
-        "cuda/include/thrust/detail/device_reference.inl",
-        "cuda/include/thrust/detail/functional/actor.h",
-        "cuda/include/thrust/detail/functional/value.h",
-        "cuda/include/thrust/detail/functional/operators.h",
-        "cuda/include/thrust/detail/functional/operators/logical_operators.h",
-        "cuda/include/thrust/detail/functional/operators/relational_operators.h",
-        "cuda/include/thrust/detail/functional/operators/assignment_operator.h",
-        "cuda/include/thrust/detail/functional/operators/bitwise_operators.h",
-        "cuda/include/thrust/detail/functional/operators/operator_adaptors.h",
-        "cuda/include/thrust/detail/functional/operators/arithmetic_operators.h",
-        "cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h",
-        "cuda/include/thrust/detail/functional/argument.h",
-        "cuda/include/thrust/detail/functional/placeholder.h",
-        "cuda/include/thrust/detail/functional/actor.inl",
-        "cuda/include/thrust/detail/functional/composite.h",
-        "cuda/include/thrust/detail/static_map.h",
-        "cuda/include/thrust/detail/type_traits/has_nested_type.h",
-        "cuda/include/thrust/detail/type_traits/is_call_possible.h",
-        "cuda/include/thrust/detail/type_traits/function_traits.h",
-        "cuda/include/thrust/detail/type_traits/pointer_traits.h",
-        "cuda/include/thrust/detail/type_traits/has_member_function.h",
-        "cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h",
-        "cuda/include/thrust/detail/type_traits/minimum_type.h",
-        "cuda/include/thrust/detail/type_traits/has_trivial_assign.h",
-        "cuda/include/thrust/detail/type_traits/is_metafunction_defined.h",
-        "cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h",
-        "cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h",
-        "cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h",
-        "cuda/include/thrust/detail/reference.h",
-        "cuda/include/thrust/detail/inner_product.inl",
-        "cuda/include/thrust/detail/use_default.h",
-        "cuda/include/thrust/detail/sequence.inl",
-        "cuda/include/thrust/detail/sort.inl",
-        "cuda/include/thrust/detail/equal.inl",
-        "cuda/include/thrust/detail/execution_policy.h",
-        "cuda/include/thrust/detail/integer_traits.h",
-        "cuda/include/thrust/detail/type_traits.h",
-        "cuda/include/thrust/detail/reverse.inl",
-        "cuda/include/thrust/detail/tabulate.inl",
-        "cuda/include/thrust/detail/unique.inl",
-        "cuda/include/thrust/detail/scatter.inl",
-        "cuda/include/thrust/detail/set_operations.inl",
-        "cuda/include/thrust/detail/device_malloc.inl",
-        "cuda/include/thrust/detail/copy_if.inl",
-        "cuda/include/thrust/detail/fill.inl",
-        "cuda/include/thrust/detail/temporary_array.inl",
-        "cuda/include/thrust/detail/transform_scan.inl",
-        "cuda/include/thrust/detail/minmax.h",
-        "cuda/include/thrust/detail/swap.inl",
-        "cuda/include/thrust/detail/pointer.inl",
-        "cuda/include/thrust/detail/transform_reduce.inl",
-        "cuda/include/thrust/detail/config.h",
-        "cuda/include/thrust/detail/distance.inl",
-        "cuda/include/thrust/detail/pair.inl",
-        "cuda/include/thrust/detail/allocator/temporary_allocator.h",
-        "cuda/include/thrust/detail/allocator/tagged_allocator.h",
-        "cuda/include/thrust/detail/allocator/destroy_range.inl",
-        "cuda/include/thrust/detail/allocator/destroy_range.h",
-        "cuda/include/thrust/detail/allocator/no_throw_allocator.h",
-        "cuda/include/thrust/detail/allocator/default_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/fill_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/tagged_allocator.inl",
-        "cuda/include/thrust/detail/allocator/malloc_allocator.h",
-        "cuda/include/thrust/detail/allocator/allocator_traits.h",
-        "cuda/include/thrust/detail/allocator/copy_construct_range.h",
-        "cuda/include/thrust/detail/allocator/allocator_traits.inl",
-        "cuda/include/thrust/detail/allocator/default_construct_range.h",
-        "cuda/include/thrust/detail/allocator/copy_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/malloc_allocator.inl",
-        "cuda/include/thrust/detail/allocator/temporary_allocator.inl",
-        "cuda/include/thrust/detail/allocator/fill_construct_range.h",
-        "cuda/include/thrust/detail/temporary_buffer.h",
-        "cuda/include/thrust/detail/reduce.inl",
-        "cuda/include/thrust/detail/device_new.inl",
-        "cuda/include/thrust/detail/pointer.h",
-        "cuda/include/thrust/detail/for_each.inl",
-        "cuda/include/thrust/detail/generate.inl",
-        "cuda/include/thrust/detail/dispatch/is_trivial_copy.h",
-        "cuda/include/thrust/detail/adjacent_difference.inl",
-        "cuda/include/thrust/detail/tuple_meta_transform.h",
-        "cuda/include/thrust/detail/functional.inl",
-        "cuda/include/thrust/detail/remove.inl",
-        "cuda/include/thrust/detail/tuple_transform.h",
-        "cuda/include/thrust/detail/merge.inl",
-        "cuda/include/thrust/detail/extrema.inl",
-        "cuda/include/thrust/detail/trivial_sequence.h",
-        "cuda/include/thrust/detail/vector_base.h",
-        "cuda/include/thrust/detail/count.inl",
-        "cuda/include/thrust/detail/uninitialized_copy.inl",
-        "cuda/include/thrust/detail/function.h",
-        "cuda/include/thrust/detail/swap_ranges.inl",
-        "cuda/include/thrust/detail/device_delete.inl",
-        "cuda/include/thrust/detail/static_assert.h",
-        "cuda/include/thrust/detail/logical.inl",
-        "cuda/include/thrust/detail/seq.h",
-        "cuda/include/thrust/detail/mpl/math.h",
-        "cuda/include/thrust/detail/mismatch.inl",
-        "cuda/include/thrust/detail/internal_functional.h",
-        "cuda/include/thrust/detail/get_iterator_value.h",
-        "cuda/include/thrust/detail/copy.inl",
-        "cuda/include/thrust/detail/copy.h",
+        "cuda/include/thrust/detail/complex/arithmetic.h",
+        "cuda/include/thrust/detail/complex/c99math.h",
+        "cuda/include/thrust/detail/complex/catrig.h",
         "cuda/include/thrust/detail/complex/catrigf.h",
-        "cuda/include/thrust/detail/complex/cpowf.h",
-        "cuda/include/thrust/detail/complex/csqrtf.h",
+        "cuda/include/thrust/detail/complex/ccosh.h",
         "cuda/include/thrust/detail/complex/ccoshf.h",
-        "cuda/include/thrust/detail/complex/csinhf.h",
+        "cuda/include/thrust/detail/complex/cexp.h",
+        "cuda/include/thrust/detail/complex/cexpf.h",
+        "cuda/include/thrust/detail/complex/clog.h",
         "cuda/include/thrust/detail/complex/clogf.h",
-        "cuda/include/thrust/detail/complex/ccosh.h",
-        "cuda/include/thrust/detail/complex/arithmetic.h",
-        "cuda/include/thrust/detail/complex/csqrt.h",
-        "cuda/include/thrust/detail/complex/cpow.h",
         "cuda/include/thrust/detail/complex/complex.inl",
-        "cuda/include/thrust/detail/complex/math_private.h",
-        "cuda/include/thrust/detail/complex/c99math.h",
+        "cuda/include/thrust/detail/complex/cpow.h",
+        "cuda/include/thrust/detail/complex/cpowf.h",
         "cuda/include/thrust/detail/complex/cproj.h",
-        "cuda/include/thrust/detail/complex/catrig.h",
-        "cuda/include/thrust/detail/complex/ctanhf.h",
-        "cuda/include/thrust/detail/complex/cexpf.h",
         "cuda/include/thrust/detail/complex/csinh.h",
-        "cuda/include/thrust/detail/complex/stream.h",
+        "cuda/include/thrust/detail/complex/csinhf.h",
+        "cuda/include/thrust/detail/complex/csqrt.h",
+        "cuda/include/thrust/detail/complex/csqrtf.h",
         "cuda/include/thrust/detail/complex/ctanh.h",
-        "cuda/include/thrust/detail/complex/cexp.h",
-        "cuda/include/thrust/detail/complex/clog.h",
-        "cuda/include/thrust/detail/range/head_flags.h",
-        "cuda/include/thrust/detail/range/tail_flags.h",
-        "cuda/include/thrust/detail/execute_with_allocator.h",
-        "cuda/include/thrust/detail/integer_math.h",
-        "cuda/include/thrust/detail/swap.h",
-        "cuda/include/thrust/detail/uninitialized_fill.inl",
-        "cuda/include/thrust/detail/scan.inl",
-        "cuda/include/thrust/detail/gather.inl",
-        "cuda/include/thrust/detail/reference_forward_declaration.h",
-        "cuda/include/thrust/detail/numeric_traits.h",
-        "cuda/include/thrust/detail/reference.inl",
-        "cuda/include/thrust/detail/cstdint.h",
-        "cuda/include/thrust/detail/device_free.inl",
-        "cuda/include/thrust/detail/copy_if.h",
-        "cuda/include/thrust/detail/partition.inl",
-        "cuda/include/thrust/detail/find.inl",
-        "cuda/include/thrust/detail/config/forceinline.h",
-        "cuda/include/thrust/detail/config/debug.h",
-        "cuda/include/thrust/detail/config/config.h",
-        "cuda/include/thrust/detail/config/host_device.h",
-        "cuda/include/thrust/detail/config/host_system.h",
+        "cuda/include/thrust/detail/complex/ctanhf.h",
+        "cuda/include/thrust/detail/complex/math_private.h",
+        "cuda/include/thrust/detail/complex/stream.h",
+        "cuda/include/thrust/detail/config.h",
         "cuda/include/thrust/detail/config/compiler.h",
-        "cuda/include/thrust/detail/config/device_system.h",
         "cuda/include/thrust/detail/config/compiler_fence.h",
+        "cuda/include/thrust/detail/config/config.h",
+        "cuda/include/thrust/detail/config/debug.h",
+        "cuda/include/thrust/detail/config/device_system.h",
         "cuda/include/thrust/detail/config/exec_check_disable.h",
-        "cuda/include/thrust/detail/config/simple_defines.h",
+        "cuda/include/thrust/detail/config/forceinline.h",
         "cuda/include/thrust/detail/config/global_workarounds.h",
-        "cuda/include/thrust/detail/replace.inl",
+        "cuda/include/thrust/detail/config/host_device.h",
+        "cuda/include/thrust/detail/config/host_system.h",
+        "cuda/include/thrust/detail/config/simple_defines.h",
+        "cuda/include/thrust/detail/contiguous_storage.h",
+        "cuda/include/thrust/detail/contiguous_storage.inl",
+        "cuda/include/thrust/detail/copy.h",
+        "cuda/include/thrust/detail/copy.inl",
+        "cuda/include/thrust/detail/copy_if.h",
+        "cuda/include/thrust/detail/copy_if.inl",
+        "cuda/include/thrust/detail/count.inl",
+        "cuda/include/thrust/detail/cstdint.h",
+        "cuda/include/thrust/detail/device_delete.inl",
+        "cuda/include/thrust/detail/device_free.inl",
+        "cuda/include/thrust/detail/device_malloc.inl",
+        "cuda/include/thrust/detail/device_new.inl",
         "cuda/include/thrust/detail/device_ptr.inl",
-        "cuda/include/thrust/detail/tuple.inl",
-        "cuda/include/thrust/detail/malloc_and_free.h",
+        "cuda/include/thrust/detail/device_reference.inl",
+        "cuda/include/thrust/detail/device_vector.inl",
+        "cuda/include/thrust/detail/dispatch/is_trivial_copy.h",
+        "cuda/include/thrust/detail/distance.inl",
+        "cuda/include/thrust/detail/equal.inl",
+        "cuda/include/thrust/detail/execute_with_allocator.h",
+        "cuda/include/thrust/detail/execution_policy.h",
+        "cuda/include/thrust/detail/extrema.inl",
+        "cuda/include/thrust/detail/fill.inl",
+        "cuda/include/thrust/detail/find.inl",
+        "cuda/include/thrust/detail/for_each.inl",
+        "cuda/include/thrust/detail/function.h",
+        "cuda/include/thrust/detail/functional.inl",
+        "cuda/include/thrust/detail/functional/actor.h",
+        "cuda/include/thrust/detail/functional/actor.inl",
+        "cuda/include/thrust/detail/functional/argument.h",
+        "cuda/include/thrust/detail/functional/composite.h",
+        "cuda/include/thrust/detail/functional/operators.h",
+        "cuda/include/thrust/detail/functional/operators/arithmetic_operators.h",
+        "cuda/include/thrust/detail/functional/operators/assignment_operator.h",
+        "cuda/include/thrust/detail/functional/operators/bitwise_operators.h",
+        "cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h",
+        "cuda/include/thrust/detail/functional/operators/logical_operators.h",
+        "cuda/include/thrust/detail/functional/operators/operator_adaptors.h",
+        "cuda/include/thrust/detail/functional/operators/relational_operators.h",
+        "cuda/include/thrust/detail/functional/placeholder.h",
+        "cuda/include/thrust/detail/functional/value.h",
+        "cuda/include/thrust/detail/gather.inl",
+        "cuda/include/thrust/detail/generate.inl",
+        "cuda/include/thrust/detail/get_iterator_value.h",
         "cuda/include/thrust/detail/host_vector.inl",
+        "cuda/include/thrust/detail/inner_product.inl",
+        "cuda/include/thrust/detail/integer_math.h",
+        "cuda/include/thrust/detail/integer_traits.h",
+        "cuda/include/thrust/detail/internal_functional.h",
+        "cuda/include/thrust/detail/logical.inl",
+        "cuda/include/thrust/detail/malloc_and_free.h",
+        "cuda/include/thrust/detail/merge.inl",
+        "cuda/include/thrust/detail/minmax.h",
+        "cuda/include/thrust/detail/mismatch.inl",
+        "cuda/include/thrust/detail/mpl/math.h",
+        "cuda/include/thrust/detail/numeric_traits.h",
+        "cuda/include/thrust/detail/overlapped_copy.h",
+        "cuda/include/thrust/detail/pair.inl",
+        "cuda/include/thrust/detail/partition.inl",
+        "cuda/include/thrust/detail/pointer.h",
+        "cuda/include/thrust/detail/pointer.inl",
+        "cuda/include/thrust/detail/range/head_flags.h",
+        "cuda/include/thrust/detail/range/tail_flags.h",
         "cuda/include/thrust/detail/raw_pointer_cast.h",
-        "cuda/include/thrust/detail/advance.inl",
-        "cuda/include/thrust/detail/contiguous_storage.h",
         "cuda/include/thrust/detail/raw_reference_cast.h",
-        "cuda/include/thrust/detail/contiguous_storage.inl",
-        "cuda/include/thrust/reverse.h",
-        "cuda/include/thrust/device_malloc_allocator.h",
-        "cuda/include/thrust/scatter.h",
-        "cuda/include/thrust/pair.h",
-        "cuda/include/thrust/advance.h",
-        "cuda/include/thrust/find.h",
-        "cuda/include/thrust/device_ptr.h",
-        "cuda/include/thrust/generate.h",
-        "cuda/include/thrust/uninitialized_fill.h",
-        "cuda/include/thrust/system/system_error.h",
-        "cuda/include/thrust/system/detail/bad_alloc.h",
-        "cuda/include/thrust/system/detail/adl/transform_scan.h",
-        "cuda/include/thrust/system/detail/adl/unique_by_key.h",
-        "cuda/include/thrust/system/detail/adl/partition.h",
-        "cuda/include/thrust/system/detail/adl/unique.h",
-        "cuda/include/thrust/system/detail/adl/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/adl/sequence.h",
-        "cuda/include/thrust/system/detail/adl/merge.h",
-        "cuda/include/thrust/system/detail/adl/transform_reduce.h",
-        "cuda/include/thrust/system/detail/adl/gather.h",
-        "cuda/include/thrust/system/detail/adl/sort.h",
-        "cuda/include/thrust/system/detail/adl/scan.h",
-        "cuda/include/thrust/system/detail/adl/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/adl/scan_by_key.h",
-        "cuda/include/thrust/system/detail/adl/reverse.h",
-        "cuda/include/thrust/system/detail/adl/assign_value.h",
-        "cuda/include/thrust/system/detail/adl/scatter.h",
-        "cuda/include/thrust/system/detail/adl/find.h",
-        "cuda/include/thrust/system/detail/adl/generate.h",
-        "cuda/include/thrust/system/detail/adl/uninitialized_fill.h",
-        "cuda/include/thrust/system/detail/adl/remove.h",
-        "cuda/include/thrust/system/detail/adl/tabulate.h",
-        "cuda/include/thrust/system/detail/adl/for_each.h",
-        "cuda/include/thrust/system/detail/adl/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/adl/reduce.h",
-        "cuda/include/thrust/system/detail/adl/equal.h",
-        "cuda/include/thrust/system/detail/adl/copy.h",
-        "cuda/include/thrust/system/detail/adl/swap_ranges.h",
-        "cuda/include/thrust/system/detail/adl/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/adl/binary_search.h",
-        "cuda/include/thrust/system/detail/adl/set_operations.h",
-        "cuda/include/thrust/system/detail/adl/mismatch.h",
-        "cuda/include/thrust/system/detail/adl/extrema.h",
-        "cuda/include/thrust/system/detail/adl/count.h",
-        "cuda/include/thrust/system/detail/adl/replace.h",
+        "cuda/include/thrust/detail/reduce.inl",
+        "cuda/include/thrust/detail/reference.h",
+        "cuda/include/thrust/detail/reference.inl",
+        "cuda/include/thrust/detail/reference_forward_declaration.h",
+        "cuda/include/thrust/detail/remove.inl",
+        "cuda/include/thrust/detail/replace.inl",
+        "cuda/include/thrust/detail/reverse.inl",
+        "cuda/include/thrust/detail/scan.inl",
+        "cuda/include/thrust/detail/scatter.inl",
+        "cuda/include/thrust/detail/seq.h",
+        "cuda/include/thrust/detail/sequence.inl",
+        "cuda/include/thrust/detail/set_operations.inl",
+        "cuda/include/thrust/detail/sort.inl",
+        "cuda/include/thrust/detail/static_assert.h",
+        "cuda/include/thrust/detail/static_map.h",
+        "cuda/include/thrust/detail/swap.h",
+        "cuda/include/thrust/detail/swap.inl",
+        "cuda/include/thrust/detail/swap_ranges.inl",
+        "cuda/include/thrust/detail/tabulate.inl",
+        "cuda/include/thrust/detail/temporary_array.h",
+        "cuda/include/thrust/detail/temporary_array.inl",
+        "cuda/include/thrust/detail/temporary_buffer.h",
+        "cuda/include/thrust/detail/transform.inl",
+        "cuda/include/thrust/detail/transform_reduce.inl",
+        "cuda/include/thrust/detail/transform_scan.inl",
+        "cuda/include/thrust/detail/trivial_sequence.h",
+        "cuda/include/thrust/detail/tuple.inl",
+        "cuda/include/thrust/detail/tuple_meta_transform.h",
+        "cuda/include/thrust/detail/tuple_transform.h",
+        "cuda/include/thrust/detail/type_traits.h",
+        "cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h",
+        "cuda/include/thrust/detail/type_traits/function_traits.h",
+        "cuda/include/thrust/detail/type_traits/has_member_function.h",
+        "cuda/include/thrust/detail/type_traits/has_nested_type.h",
+        "cuda/include/thrust/detail/type_traits/has_trivial_assign.h",
+        "cuda/include/thrust/detail/type_traits/is_call_possible.h",
+        "cuda/include/thrust/detail/type_traits/is_metafunction_defined.h",
+        "cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h",
+        "cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h",
+        "cuda/include/thrust/detail/type_traits/minimum_type.h",
+        "cuda/include/thrust/detail/type_traits/pointer_traits.h",
+        "cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h",
+        "cuda/include/thrust/detail/uninitialized_copy.inl",
+        "cuda/include/thrust/detail/uninitialized_fill.inl",
+        "cuda/include/thrust/detail/unique.inl",
+        "cuda/include/thrust/detail/use_default.h",
+        "cuda/include/thrust/detail/util/align.h",
+        "cuda/include/thrust/detail/util/blocking.h",
+        "cuda/include/thrust/detail/vector_base.h",
+        "cuda/include/thrust/detail/vector_base.inl",
+        "cuda/include/thrust/device_allocator.h",
+        "cuda/include/thrust/device_delete.h",
+        "cuda/include/thrust/device_free.h",
+        "cuda/include/thrust/device_malloc.h",
+        "cuda/include/thrust/device_malloc_allocator.h",
+        "cuda/include/thrust/device_new.h",
+        "cuda/include/thrust/device_new_allocator.h",
+        "cuda/include/thrust/device_ptr.h",
+        "cuda/include/thrust/device_reference.h",
+        "cuda/include/thrust/device_vector.h",
+        "cuda/include/thrust/distance.h",
+        "cuda/include/thrust/equal.h",
+        "cuda/include/thrust/execution_policy.h",
+        "cuda/include/thrust/extrema.h",
+        "cuda/include/thrust/fill.h",
+        "cuda/include/thrust/find.h",
+        "cuda/include/thrust/for_each.h",
+        "cuda/include/thrust/functional.h",
+        "cuda/include/thrust/gather.h",
+        "cuda/include/thrust/generate.h",
+        "cuda/include/thrust/host_vector.h",
+        "cuda/include/thrust/inner_product.h",
+        "cuda/include/thrust/iterator/constant_iterator.h",
+        "cuda/include/thrust/iterator/counting_iterator.h",
+        "cuda/include/thrust/iterator/detail/any_assign.h",
+        "cuda/include/thrust/iterator/detail/any_system_tag.h",
+        "cuda/include/thrust/iterator/detail/constant_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/counting_iterator.inl",
+        "cuda/include/thrust/iterator/detail/device_system_tag.h",
+        "cuda/include/thrust/iterator/detail/discard_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/distance_from_result.h",
+        "cuda/include/thrust/iterator/detail/host_system_tag.h",
+        "cuda/include/thrust/iterator/detail/is_iterator_category.h",
+        "cuda/include/thrust/iterator/detail/is_trivial_iterator.h",
+        "cuda/include/thrust/iterator/detail/iterator_adaptor_base.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_to_system.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h",
+        "cuda/include/thrust/iterator/detail/iterator_facade_category.h",
+        "cuda/include/thrust/iterator/detail/iterator_traits.inl",
+        "cuda/include/thrust/iterator/detail/iterator_traversal_tags.h",
+        "cuda/include/thrust/iterator/detail/join_iterator.h",
+        "cuda/include/thrust/iterator/detail/minimum_category.h",
+        "cuda/include/thrust/iterator/detail/minimum_system.h",
+        "cuda/include/thrust/iterator/detail/normal_iterator.h",
+        "cuda/include/thrust/iterator/detail/permutation_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/retag.h",
+        "cuda/include/thrust/iterator/detail/reverse_iterator.inl",
+        "cuda/include/thrust/iterator/detail/reverse_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/tagged_iterator.h",
+        "cuda/include/thrust/iterator/detail/transform_iterator.inl",
+        "cuda/include/thrust/iterator/detail/transform_output_iterator.inl",
+        "cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h",
+        "cuda/include/thrust/iterator/detail/universal_categories.h",
+        "cuda/include/thrust/iterator/detail/zip_iterator.inl",
+        "cuda/include/thrust/iterator/detail/zip_iterator_base.h",
+        "cuda/include/thrust/iterator/discard_iterator.h",
+        "cuda/include/thrust/iterator/iterator_adaptor.h",
+        "cuda/include/thrust/iterator/iterator_categories.h",
+        "cuda/include/thrust/iterator/iterator_facade.h",
+        "cuda/include/thrust/iterator/iterator_traits.h",
+        "cuda/include/thrust/iterator/permutation_iterator.h",
+        "cuda/include/thrust/iterator/retag.h",
+        "cuda/include/thrust/iterator/reverse_iterator.h",
+        "cuda/include/thrust/iterator/transform_iterator.h",
+        "cuda/include/thrust/iterator/transform_output_iterator.h",
+        "cuda/include/thrust/iterator/zip_iterator.h",
+        "cuda/include/thrust/logical.h",
+        "cuda/include/thrust/memory.h",
+        "cuda/include/thrust/merge.h",
+        "cuda/include/thrust/mismatch.h",
+        "cuda/include/thrust/pair.h",
+        "cuda/include/thrust/partition.h",
+        "cuda/include/thrust/random.h",
+        "cuda/include/thrust/random/detail/discard_block_engine.inl",
+        "cuda/include/thrust/random/detail/linear_congruential_engine.inl",
+        "cuda/include/thrust/random/detail/linear_congruential_engine_discard.h",
+        "cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl",
+        "cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h",
+        "cuda/include/thrust/random/detail/mod.h",
+        "cuda/include/thrust/random/detail/normal_distribution.inl",
+        "cuda/include/thrust/random/detail/normal_distribution_base.h",
+        "cuda/include/thrust/random/detail/random_core_access.h",
+        "cuda/include/thrust/random/detail/subtract_with_carry_engine.inl",
+        "cuda/include/thrust/random/detail/uniform_int_distribution.inl",
+        "cuda/include/thrust/random/detail/uniform_real_distribution.inl",
+        "cuda/include/thrust/random/detail/xor_combine_engine.inl",
+        "cuda/include/thrust/random/detail/xor_combine_engine_max.h",
+        "cuda/include/thrust/random/discard_block_engine.h",
+        "cuda/include/thrust/random/linear_congruential_engine.h",
+        "cuda/include/thrust/random/linear_feedback_shift_engine.h",
+        "cuda/include/thrust/random/normal_distribution.h",
+        "cuda/include/thrust/random/subtract_with_carry_engine.h",
+        "cuda/include/thrust/random/uniform_int_distribution.h",
+        "cuda/include/thrust/random/uniform_real_distribution.h",
+        "cuda/include/thrust/random/xor_combine_engine.h",
+        "cuda/include/thrust/reduce.h",
+        "cuda/include/thrust/remove.h",
+        "cuda/include/thrust/replace.h",
+        "cuda/include/thrust/reverse.h",
+        "cuda/include/thrust/scan.h",
+        "cuda/include/thrust/scatter.h",
+        "cuda/include/thrust/sequence.h",
+        "cuda/include/thrust/set_operations.h",
+        "cuda/include/thrust/sort.h",
+        "cuda/include/thrust/swap.h",
+        "cuda/include/thrust/system/cpp/detail/adjacent_difference.h",
+        "cuda/include/thrust/system/cpp/detail/assign_value.h",
+        "cuda/include/thrust/system/cpp/detail/binary_search.h",
+        "cuda/include/thrust/system/cpp/detail/copy.h",
+        "cuda/include/thrust/system/cpp/detail/copy_if.h",
+        "cuda/include/thrust/system/cpp/detail/count.h",
+        "cuda/include/thrust/system/cpp/detail/equal.h",
+        "cuda/include/thrust/system/cpp/detail/execution_policy.h",
+        "cuda/include/thrust/system/cpp/detail/extrema.h",
+        "cuda/include/thrust/system/cpp/detail/fill.h",
+        "cuda/include/thrust/system/cpp/detail/find.h",
+        "cuda/include/thrust/system/cpp/detail/for_each.h",
+        "cuda/include/thrust/system/cpp/detail/gather.h",
+        "cuda/include/thrust/system/cpp/detail/generate.h",
+        "cuda/include/thrust/system/cpp/detail/get_value.h",
+        "cuda/include/thrust/system/cpp/detail/inner_product.h",
+        "cuda/include/thrust/system/cpp/detail/iter_swap.h",
+        "cuda/include/thrust/system/cpp/detail/logical.h",
+        "cuda/include/thrust/system/cpp/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/cpp/detail/memory.inl",
+        "cuda/include/thrust/system/cpp/detail/merge.h",
+        "cuda/include/thrust/system/cpp/detail/mismatch.h",
+        "cuda/include/thrust/system/cpp/detail/par.h",
+        "cuda/include/thrust/system/cpp/detail/partition.h",
+        "cuda/include/thrust/system/cpp/detail/reduce.h",
+        "cuda/include/thrust/system/cpp/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/remove.h",
+        "cuda/include/thrust/system/cpp/detail/replace.h",
+        "cuda/include/thrust/system/cpp/detail/reverse.h",
+        "cuda/include/thrust/system/cpp/detail/scan.h",
+        "cuda/include/thrust/system/cpp/detail/scan_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/scatter.h",
+        "cuda/include/thrust/system/cpp/detail/sequence.h",
+        "cuda/include/thrust/system/cpp/detail/set_operations.h",
+        "cuda/include/thrust/system/cpp/detail/sort.h",
+        "cuda/include/thrust/system/cpp/detail/swap_ranges.h",
+        "cuda/include/thrust/system/cpp/detail/tabulate.h",
+        "cuda/include/thrust/system/cpp/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/cpp/detail/transform.h",
+        "cuda/include/thrust/system/cpp/detail/transform_reduce.h",
+        "cuda/include/thrust/system/cpp/detail/transform_scan.h",
+        "cuda/include/thrust/system/cpp/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/cpp/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/cpp/detail/unique.h",
+        "cuda/include/thrust/system/cpp/detail/unique_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/vector.inl",
+        "cuda/include/thrust/system/cpp/execution_policy.h",
+        "cuda/include/thrust/system/cpp/memory.h",
+        "cuda/include/thrust/system/cpp/vector.h",
+        "cuda/include/thrust/system/cuda/config.h",
+        "cuda/include/thrust/system/cuda/detail/adjacent_difference.h",
+        "cuda/include/thrust/system/cuda/detail/assign_value.h",
+        "cuda/include/thrust/system/cuda/detail/binary_search.h",
+        "cuda/include/thrust/system/cuda/detail/copy.h",
+        "cuda/include/thrust/system/cuda/detail/copy_if.h",
+        "cuda/include/thrust/system/cuda/detail/core/agent_launcher.h",
+        "cuda/include/thrust/system/cuda/detail/core/alignment.h",
+        "cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h",
+        "cuda/include/thrust/system/cuda/detail/core/util.h",
+        "cuda/include/thrust/system/cuda/detail/count.h",
+        "cuda/include/thrust/system/cuda/detail/cross_system.h",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/cub.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/host/mutex.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_device.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_type.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/equal.h",
+        "cuda/include/thrust/system/cuda/detail/error.inl",
+        "cuda/include/thrust/system/cuda/detail/execution_policy.h",
+        "cuda/include/thrust/system/cuda/detail/extrema.h",
+        "cuda/include/thrust/system/cuda/detail/fill.h",
+        "cuda/include/thrust/system/cuda/detail/find.h",
+        "cuda/include/thrust/system/cuda/detail/for_each.h",
+        "cuda/include/thrust/system/cuda/detail/gather.h",
+        "cuda/include/thrust/system/cuda/detail/generate.h",
+        "cuda/include/thrust/system/cuda/detail/get_value.h",
+        "cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h",
+        "cuda/include/thrust/system/cuda/detail/guarded_driver_types.h",
+        "cuda/include/thrust/system/cuda/detail/inner_product.h",
+        "cuda/include/thrust/system/cuda/detail/internal/copy_cross_system.h",
+        "cuda/include/thrust/system/cuda/detail/internal/copy_device_to_device.h",
+        "cuda/include/thrust/system/cuda/detail/iter_swap.h",
+        "cuda/include/thrust/system/cuda/detail/logical.h",
+        "cuda/include/thrust/system/cuda/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/cuda/detail/memory.inl",
+        "cuda/include/thrust/system/cuda/detail/memory_buffer.h",
+        "cuda/include/thrust/system/cuda/detail/merge.h",
+        "cuda/include/thrust/system/cuda/detail/mismatch.h",
+        "cuda/include/thrust/system/cuda/detail/par.h",
+        "cuda/include/thrust/system/cuda/detail/par_to_seq.h",
+        "cuda/include/thrust/system/cuda/detail/parallel_for.h",
+        "cuda/include/thrust/system/cuda/detail/partition.h",
+        "cuda/include/thrust/system/cuda/detail/reduce.h",
+        "cuda/include/thrust/system/cuda/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/remove.h",
+        "cuda/include/thrust/system/cuda/detail/replace.h",
+        "cuda/include/thrust/system/cuda/detail/reverse.h",
+        "cuda/include/thrust/system/cuda/detail/scan.h",
+        "cuda/include/thrust/system/cuda/detail/scan_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/scatter.h",
+        "cuda/include/thrust/system/cuda/detail/sequence.h",
+        "cuda/include/thrust/system/cuda/detail/set_operations.h",
+        "cuda/include/thrust/system/cuda/detail/sort.h",
+        "cuda/include/thrust/system/cuda/detail/swap_ranges.h",
+        "cuda/include/thrust/system/cuda/detail/tabulate.h",
+        "cuda/include/thrust/system/cuda/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/cuda/detail/terminate.h",
+        "cuda/include/thrust/system/cuda/detail/transform.h",
+        "cuda/include/thrust/system/cuda/detail/transform_reduce.h",
+        "cuda/include/thrust/system/cuda/detail/transform_scan.h",
+        "cuda/include/thrust/system/cuda/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/cuda/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/cuda/detail/unique.h",
+        "cuda/include/thrust/system/cuda/detail/unique_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/util.h",
+        "cuda/include/thrust/system/cuda/detail/vector.inl",
+        "cuda/include/thrust/system/cuda/error.h",
+        "cuda/include/thrust/system/cuda/execution_policy.h",
+        "cuda/include/thrust/system/cuda/experimental/pinned_allocator.h",
+        "cuda/include/thrust/system/cuda/memory.h",
+        "cuda/include/thrust/system/cuda/vector.h",
+        "cuda/include/thrust/system/detail/adl/adjacent_difference.h",
+        "cuda/include/thrust/system/detail/adl/assign_value.h",
+        "cuda/include/thrust/system/detail/adl/binary_search.h",
+        "cuda/include/thrust/system/detail/adl/copy.h",
+        "cuda/include/thrust/system/detail/adl/copy_if.h",
+        "cuda/include/thrust/system/detail/adl/count.h",
+        "cuda/include/thrust/system/detail/adl/equal.h",
+        "cuda/include/thrust/system/detail/adl/extrema.h",
+        "cuda/include/thrust/system/detail/adl/fill.h",
+        "cuda/include/thrust/system/detail/adl/find.h",
+        "cuda/include/thrust/system/detail/adl/for_each.h",
+        "cuda/include/thrust/system/detail/adl/gather.h",
+        "cuda/include/thrust/system/detail/adl/generate.h",
         "cuda/include/thrust/system/detail/adl/get_value.h",
         "cuda/include/thrust/system/detail/adl/inner_product.h",
-        "cuda/include/thrust/system/detail/adl/copy_if.h",
-        "cuda/include/thrust/system/detail/adl/logical.h",
         "cuda/include/thrust/system/detail/adl/iter_swap.h",
+        "cuda/include/thrust/system/detail/adl/logical.h",
         "cuda/include/thrust/system/detail/adl/malloc_and_free.h",
-        "cuda/include/thrust/system/detail/adl/fill.h",
+        "cuda/include/thrust/system/detail/adl/merge.h",
+        "cuda/include/thrust/system/detail/adl/mismatch.h",
+        "cuda/include/thrust/system/detail/adl/partition.h",
+        "cuda/include/thrust/system/detail/adl/reduce.h",
+        "cuda/include/thrust/system/detail/adl/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/adl/remove.h",
+        "cuda/include/thrust/system/detail/adl/replace.h",
+        "cuda/include/thrust/system/detail/adl/reverse.h",
+        "cuda/include/thrust/system/detail/adl/scan.h",
+        "cuda/include/thrust/system/detail/adl/scan_by_key.h",
+        "cuda/include/thrust/system/detail/adl/scatter.h",
+        "cuda/include/thrust/system/detail/adl/sequence.h",
+        "cuda/include/thrust/system/detail/adl/set_operations.h",
+        "cuda/include/thrust/system/detail/adl/sort.h",
+        "cuda/include/thrust/system/detail/adl/swap_ranges.h",
+        "cuda/include/thrust/system/detail/adl/tabulate.h",
+        "cuda/include/thrust/system/detail/adl/temporary_buffer.h",
         "cuda/include/thrust/system/detail/adl/transform.h",
+        "cuda/include/thrust/system/detail/adl/transform_reduce.h",
+        "cuda/include/thrust/system/detail/adl/transform_scan.h",
+        "cuda/include/thrust/system/detail/adl/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/adl/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/adl/unique.h",
+        "cuda/include/thrust/system/detail/adl/unique_by_key.h",
+        "cuda/include/thrust/system/detail/bad_alloc.h",
         "cuda/include/thrust/system/detail/errno.h",
         "cuda/include/thrust/system/detail/error_category.inl",
-        "cuda/include/thrust/system/detail/sequential/transform_scan.h",
-        "cuda/include/thrust/system/detail/sequential/unique_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h",
-        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.h",
-        "cuda/include/thrust/system/detail/sequential/sort.inl",
-        "cuda/include/thrust/system/detail/sequential/partition.h",
-        "cuda/include/thrust/system/detail/sequential/unique.h",
-        "cuda/include/thrust/system/detail/sequential/execution_policy.h",
-        "cuda/include/thrust/system/detail/sequential/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/sequential/sequence.h",
-        "cuda/include/thrust/system/detail/sequential/merge.h",
-        "cuda/include/thrust/system/detail/sequential/transform_reduce.h",
-        "cuda/include/thrust/system/detail/sequential/gather.h",
-        "cuda/include/thrust/system/detail/sequential/sort.h",
-        "cuda/include/thrust/system/detail/sequential/copy_backward.h",
-        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/scan.h",
-        "cuda/include/thrust/system/detail/sequential/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/sequential/scan_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/reverse.h",
-        "cuda/include/thrust/system/detail/sequential/assign_value.h",
-        "cuda/include/thrust/system/detail/sequential/scatter.h",
-        "cuda/include/thrust/system/detail/sequential/find.h",
-        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/merge.inl",
-        "cuda/include/thrust/system/detail/sequential/generate.h",
-        "cuda/include/thrust/system/detail/sequential/uninitialized_fill.h",
-        "cuda/include/thrust/system/detail/sequential/general_copy.h",
-        "cuda/include/thrust/system/detail/sequential/insertion_sort.h",
-        "cuda/include/thrust/system/detail/sequential/remove.h",
-        "cuda/include/thrust/system/detail/sequential/tabulate.h",
-        "cuda/include/thrust/system/detail/sequential/for_each.h",
-        "cuda/include/thrust/system/detail/sequential/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/reduce.h",
-        "cuda/include/thrust/system/detail/sequential/equal.h",
-        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.h",
-        "cuda/include/thrust/system/detail/sequential/copy.inl",
-        "cuda/include/thrust/system/detail/sequential/copy.h",
-        "cuda/include/thrust/system/detail/sequential/swap_ranges.h",
-        "cuda/include/thrust/system/detail/sequential/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/sequential/binary_search.h",
-        "cuda/include/thrust/system/detail/sequential/set_operations.h",
-        "cuda/include/thrust/system/detail/sequential/mismatch.h",
-        "cuda/include/thrust/system/detail/sequential/extrema.h",
-        "cuda/include/thrust/system/detail/sequential/count.h",
-        "cuda/include/thrust/system/detail/sequential/trivial_copy.h",
-        "cuda/include/thrust/system/detail/sequential/replace.h",
-        "cuda/include/thrust/system/detail/sequential/get_value.h",
-        "cuda/include/thrust/system/detail/sequential/inner_product.h",
-        "cuda/include/thrust/system/detail/sequential/copy_if.h",
-        "cuda/include/thrust/system/detail/sequential/logical.h",
-        "cuda/include/thrust/system/detail/sequential/iter_swap.h",
-        "cuda/include/thrust/system/detail/sequential/malloc_and_free.h",
-        "cuda/include/thrust/system/detail/sequential/fill.h",
-        "cuda/include/thrust/system/detail/sequential/transform.h",
-        "cuda/include/thrust/system/detail/error_condition.inl",
-        "cuda/include/thrust/system/detail/internal/decompose.h",
         "cuda/include/thrust/system/detail/error_code.inl",
-        "cuda/include/thrust/system/detail/generic/transform_scan.h",
-        "cuda/include/thrust/system/detail/generic/memory.inl",
-        "cuda/include/thrust/system/detail/generic/transform.inl",
-        "cuda/include/thrust/system/detail/generic/binary_search.inl",
-        "cuda/include/thrust/system/detail/generic/scan_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/unique_by_key.h",
-        "cuda/include/thrust/system/detail/generic/inner_product.inl",
-        "cuda/include/thrust/system/detail/generic/select_system.h",
-        "cuda/include/thrust/system/detail/generic/sequence.inl",
-        "cuda/include/thrust/system/detail/generic/sort.inl",
-        "cuda/include/thrust/system/detail/generic/equal.inl",
-        "cuda/include/thrust/system/detail/generic/partition.h",
-        "cuda/include/thrust/system/detail/generic/unique.h",
+        "cuda/include/thrust/system/detail/error_condition.inl",
         "cuda/include/thrust/system/detail/generic/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/generic/tag.h",
-        "cuda/include/thrust/system/detail/generic/unique_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/sequence.h",
-        "cuda/include/thrust/system/detail/generic/type_traits.h",
-        "cuda/include/thrust/system/detail/generic/merge.h",
-        "cuda/include/thrust/system/detail/generic/reverse.inl",
-        "cuda/include/thrust/system/detail/generic/tabulate.inl",
-        "cuda/include/thrust/system/detail/generic/unique.inl",
-        "cuda/include/thrust/system/detail/generic/scatter.inl",
-        "cuda/include/thrust/system/detail/generic/set_operations.inl",
-        "cuda/include/thrust/system/detail/generic/copy_if.inl",
-        "cuda/include/thrust/system/detail/generic/transform_reduce.h",
-        "cuda/include/thrust/system/detail/generic/transform_scan.inl",
-        "cuda/include/thrust/system/detail/generic/gather.h",
-        "cuda/include/thrust/system/detail/generic/reduce_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/transform_reduce.inl",
-        "cuda/include/thrust/system/detail/generic/sort.h",
-        "cuda/include/thrust/system/detail/generic/distance.inl",
-        "cuda/include/thrust/system/detail/generic/scan.h",
-        "cuda/include/thrust/system/detail/generic/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/generic/reduce.inl",
-        "cuda/include/thrust/system/detail/generic/scan_by_key.h",
-        "cuda/include/thrust/system/detail/generic/reverse.h",
-        "cuda/include/thrust/system/detail/generic/temporary_buffer.inl",
-        "cuda/include/thrust/system/detail/generic/scatter.h",
-        "cuda/include/thrust/system/detail/generic/generate.inl",
         "cuda/include/thrust/system/detail/generic/adjacent_difference.inl",
-        "cuda/include/thrust/system/detail/generic/remove.inl",
         "cuda/include/thrust/system/detail/generic/advance.h",
-        "cuda/include/thrust/system/detail/generic/find.h",
-        "cuda/include/thrust/system/detail/generic/merge.inl",
-        "cuda/include/thrust/system/detail/generic/scalar/binary_search.inl",
-        "cuda/include/thrust/system/detail/generic/scalar/binary_search.h",
-        "cuda/include/thrust/system/detail/generic/extrema.inl",
-        "cuda/include/thrust/system/detail/generic/generate.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/generic/advance.inl",
+        "cuda/include/thrust/system/detail/generic/binary_search.h",
+        "cuda/include/thrust/system/detail/generic/binary_search.inl",
+        "cuda/include/thrust/system/detail/generic/copy.h",
+        "cuda/include/thrust/system/detail/generic/copy.inl",
+        "cuda/include/thrust/system/detail/generic/copy_if.h",
+        "cuda/include/thrust/system/detail/generic/copy_if.inl",
+        "cuda/include/thrust/system/detail/generic/count.h",
         "cuda/include/thrust/system/detail/generic/count.inl",
-        "cuda/include/thrust/system/detail/generic/remove.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_copy.inl",
-        "cuda/include/thrust/system/detail/generic/tabulate.h",
-        "cuda/include/thrust/system/detail/generic/for_each.h",
         "cuda/include/thrust/system/detail/generic/distance.h",
-        "cuda/include/thrust/system/detail/generic/swap_ranges.inl",
-        "cuda/include/thrust/system/detail/generic/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/generic/reduce.h",
+        "cuda/include/thrust/system/detail/generic/distance.inl",
         "cuda/include/thrust/system/detail/generic/equal.h",
-        "cuda/include/thrust/system/detail/generic/mismatch.inl",
-        "cuda/include/thrust/system/detail/generic/copy.inl",
-        "cuda/include/thrust/system/detail/generic/copy.h",
-        "cuda/include/thrust/system/detail/generic/swap_ranges.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/generic/binary_search.h",
-        "cuda/include/thrust/system/detail/generic/set_operations.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_fill.inl",
-        "cuda/include/thrust/system/detail/generic/mismatch.h",
-        "cuda/include/thrust/system/detail/generic/scan.inl",
-        "cuda/include/thrust/system/detail/generic/gather.inl",
+        "cuda/include/thrust/system/detail/generic/equal.inl",
         "cuda/include/thrust/system/detail/generic/extrema.h",
-        "cuda/include/thrust/system/detail/generic/count.h",
-        "cuda/include/thrust/system/detail/generic/replace.h",
+        "cuda/include/thrust/system/detail/generic/extrema.inl",
+        "cuda/include/thrust/system/detail/generic/fill.h",
+        "cuda/include/thrust/system/detail/generic/find.h",
+        "cuda/include/thrust/system/detail/generic/find.inl",
+        "cuda/include/thrust/system/detail/generic/for_each.h",
+        "cuda/include/thrust/system/detail/generic/gather.h",
+        "cuda/include/thrust/system/detail/generic/gather.inl",
+        "cuda/include/thrust/system/detail/generic/generate.h",
+        "cuda/include/thrust/system/detail/generic/generate.inl",
         "cuda/include/thrust/system/detail/generic/inner_product.h",
-        "cuda/include/thrust/system/detail/generic/copy_if.h",
+        "cuda/include/thrust/system/detail/generic/inner_product.inl",
         "cuda/include/thrust/system/detail/generic/logical.h",
-        "cuda/include/thrust/system/detail/generic/partition.inl",
         "cuda/include/thrust/system/detail/generic/memory.h",
-        "cuda/include/thrust/system/detail/generic/find.inl",
+        "cuda/include/thrust/system/detail/generic/memory.inl",
+        "cuda/include/thrust/system/detail/generic/merge.h",
+        "cuda/include/thrust/system/detail/generic/merge.inl",
+        "cuda/include/thrust/system/detail/generic/mismatch.h",
+        "cuda/include/thrust/system/detail/generic/mismatch.inl",
+        "cuda/include/thrust/system/detail/generic/partition.h",
+        "cuda/include/thrust/system/detail/generic/partition.inl",
+        "cuda/include/thrust/system/detail/generic/reduce.h",
+        "cuda/include/thrust/system/detail/generic/reduce.inl",
+        "cuda/include/thrust/system/detail/generic/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/generic/reduce_by_key.inl",
+        "cuda/include/thrust/system/detail/generic/remove.h",
+        "cuda/include/thrust/system/detail/generic/remove.inl",
+        "cuda/include/thrust/system/detail/generic/replace.h",
         "cuda/include/thrust/system/detail/generic/replace.inl",
-        "cuda/include/thrust/system/detail/generic/advance.inl",
-        "cuda/include/thrust/system/detail/generic/fill.h",
+        "cuda/include/thrust/system/detail/generic/reverse.h",
+        "cuda/include/thrust/system/detail/generic/reverse.inl",
+        "cuda/include/thrust/system/detail/generic/scalar/binary_search.h",
+        "cuda/include/thrust/system/detail/generic/scalar/binary_search.inl",
+        "cuda/include/thrust/system/detail/generic/scan.h",
+        "cuda/include/thrust/system/detail/generic/scan.inl",
+        "cuda/include/thrust/system/detail/generic/scan_by_key.h",
+        "cuda/include/thrust/system/detail/generic/scan_by_key.inl",
+        "cuda/include/thrust/system/detail/generic/scatter.h",
+        "cuda/include/thrust/system/detail/generic/scatter.inl",
+        "cuda/include/thrust/system/detail/generic/select_system.h",
+        "cuda/include/thrust/system/detail/generic/sequence.h",
+        "cuda/include/thrust/system/detail/generic/sequence.inl",
+        "cuda/include/thrust/system/detail/generic/set_operations.h",
+        "cuda/include/thrust/system/detail/generic/set_operations.inl",
+        "cuda/include/thrust/system/detail/generic/sort.h",
+        "cuda/include/thrust/system/detail/generic/sort.inl",
+        "cuda/include/thrust/system/detail/generic/swap_ranges.h",
+        "cuda/include/thrust/system/detail/generic/swap_ranges.inl",
+        "cuda/include/thrust/system/detail/generic/tabulate.h",
+        "cuda/include/thrust/system/detail/generic/tabulate.inl",
+        "cuda/include/thrust/system/detail/generic/tag.h",
+        "cuda/include/thrust/system/detail/generic/temporary_buffer.h",
+        "cuda/include/thrust/system/detail/generic/temporary_buffer.inl",
         "cuda/include/thrust/system/detail/generic/transform.h",
+        "cuda/include/thrust/system/detail/generic/transform.inl",
+        "cuda/include/thrust/system/detail/generic/transform_reduce.h",
+        "cuda/include/thrust/system/detail/generic/transform_reduce.inl",
+        "cuda/include/thrust/system/detail/generic/transform_scan.h",
+        "cuda/include/thrust/system/detail/generic/transform_scan.inl",
+        "cuda/include/thrust/system/detail/generic/type_traits.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_copy.inl",
+        "cuda/include/thrust/system/detail/generic/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_fill.inl",
+        "cuda/include/thrust/system/detail/generic/unique.h",
+        "cuda/include/thrust/system/detail/generic/unique.inl",
+        "cuda/include/thrust/system/detail/generic/unique_by_key.h",
+        "cuda/include/thrust/system/detail/generic/unique_by_key.inl",
+        "cuda/include/thrust/system/detail/internal/decompose.h",
+        "cuda/include/thrust/system/detail/sequential/adjacent_difference.h",
+        "cuda/include/thrust/system/detail/sequential/assign_value.h",
+        "cuda/include/thrust/system/detail/sequential/binary_search.h",
+        "cuda/include/thrust/system/detail/sequential/copy.h",
+        "cuda/include/thrust/system/detail/sequential/copy.inl",
+        "cuda/include/thrust/system/detail/sequential/copy_backward.h",
+        "cuda/include/thrust/system/detail/sequential/copy_if.h",
+        "cuda/include/thrust/system/detail/sequential/count.h",
+        "cuda/include/thrust/system/detail/sequential/equal.h",
+        "cuda/include/thrust/system/detail/sequential/execution_policy.h",
+        "cuda/include/thrust/system/detail/sequential/extrema.h",
+        "cuda/include/thrust/system/detail/sequential/fill.h",
+        "cuda/include/thrust/system/detail/sequential/find.h",
+        "cuda/include/thrust/system/detail/sequential/for_each.h",
+        "cuda/include/thrust/system/detail/sequential/gather.h",
+        "cuda/include/thrust/system/detail/sequential/general_copy.h",
+        "cuda/include/thrust/system/detail/sequential/generate.h",
+        "cuda/include/thrust/system/detail/sequential/get_value.h",
+        "cuda/include/thrust/system/detail/sequential/inner_product.h",
+        "cuda/include/thrust/system/detail/sequential/insertion_sort.h",
+        "cuda/include/thrust/system/detail/sequential/iter_swap.h",
+        "cuda/include/thrust/system/detail/sequential/logical.h",
+        "cuda/include/thrust/system/detail/sequential/malloc_and_free.h",
+        "cuda/include/thrust/system/detail/sequential/merge.h",
+        "cuda/include/thrust/system/detail/sequential/merge.inl",
+        "cuda/include/thrust/system/detail/sequential/mismatch.h",
+        "cuda/include/thrust/system/detail/sequential/partition.h",
+        "cuda/include/thrust/system/detail/sequential/reduce.h",
+        "cuda/include/thrust/system/detail/sequential/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/sequential/remove.h",
+        "cuda/include/thrust/system/detail/sequential/replace.h",
+        "cuda/include/thrust/system/detail/sequential/reverse.h",
+        "cuda/include/thrust/system/detail/sequential/scan.h",
+        "cuda/include/thrust/system/detail/sequential/scan_by_key.h",
+        "cuda/include/thrust/system/detail/sequential/scatter.h",
+        "cuda/include/thrust/system/detail/sequential/sequence.h",
+        "cuda/include/thrust/system/detail/sequential/set_operations.h",
+        "cuda/include/thrust/system/detail/sequential/sort.h",
+        "cuda/include/thrust/system/detail/sequential/sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/swap_ranges.h",
+        "cuda/include/thrust/system/detail/sequential/tabulate.h",
+        "cuda/include/thrust/system/detail/sequential/temporary_buffer.h",
+        "cuda/include/thrust/system/detail/sequential/transform.h",
+        "cuda/include/thrust/system/detail/sequential/transform_reduce.h",
+        "cuda/include/thrust/system/detail/sequential/transform_scan.h",
+        "cuda/include/thrust/system/detail/sequential/trivial_copy.h",
+        "cuda/include/thrust/system/detail/sequential/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/sequential/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/sequential/unique.h",
+        "cuda/include/thrust/system/detail/sequential/unique_by_key.h",
         "cuda/include/thrust/system/detail/system_error.inl",
-        "cuda/include/thrust/system/omp/execution_policy.h",
-        "cuda/include/thrust/system/omp/vector.h",
-        "cuda/include/thrust/system/omp/detail/transform_scan.h",
-        "cuda/include/thrust/system/omp/detail/memory.inl",
-        "cuda/include/thrust/system/omp/detail/reduce_intervals.inl",
-        "cuda/include/thrust/system/omp/detail/unique_by_key.h",
-        "cuda/include/thrust/system/omp/detail/sort.inl",
-        "cuda/include/thrust/system/omp/detail/partition.h",
-        "cuda/include/thrust/system/omp/detail/unique.h",
-        "cuda/include/thrust/system/omp/detail/execution_policy.h",
+        "cuda/include/thrust/system/error_code.h",
         "cuda/include/thrust/system/omp/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/omp/detail/unique_by_key.inl",
-        "cuda/include/thrust/system/omp/detail/sequence.h",
-        "cuda/include/thrust/system/omp/detail/merge.h",
-        "cuda/include/thrust/system/omp/detail/unique.inl",
+        "cuda/include/thrust/system/omp/detail/assign_value.h",
+        "cuda/include/thrust/system/omp/detail/binary_search.h",
+        "cuda/include/thrust/system/omp/detail/copy.h",
+        "cuda/include/thrust/system/omp/detail/copy.inl",
+        "cuda/include/thrust/system/omp/detail/copy_if.h",
         "cuda/include/thrust/system/omp/detail/copy_if.inl",
-        "cuda/include/thrust/system/omp/detail/transform_reduce.h",
-        "cuda/include/thrust/system/omp/detail/gather.h",
-        "cuda/include/thrust/system/omp/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/omp/detail/sort.h",
-        "cuda/include/thrust/system/omp/detail/scan.h",
-        "cuda/include/thrust/system/omp/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/omp/detail/count.h",
         "cuda/include/thrust/system/omp/detail/default_decomposition.h",
-        "cuda/include/thrust/system/omp/detail/reduce.inl",
-        "cuda/include/thrust/system/omp/detail/scan_by_key.h",
-        "cuda/include/thrust/system/omp/detail/reverse.h",
-        "cuda/include/thrust/system/omp/detail/assign_value.h",
-        "cuda/include/thrust/system/omp/detail/scatter.h",
-        "cuda/include/thrust/system/omp/detail/for_each.inl",
         "cuda/include/thrust/system/omp/detail/default_decomposition.inl",
-        "cuda/include/thrust/system/omp/detail/remove.inl",
-        "cuda/include/thrust/system/omp/detail/vector.inl",
-        "cuda/include/thrust/system/omp/detail/find.h",
-        "cuda/include/thrust/system/omp/detail/generate.h",
-        "cuda/include/thrust/system/omp/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/omp/detail/remove.h",
-        "cuda/include/thrust/system/omp/detail/tabulate.h",
-        "cuda/include/thrust/system/omp/detail/for_each.h",
-        "cuda/include/thrust/system/omp/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/omp/detail/reduce.h",
         "cuda/include/thrust/system/omp/detail/equal.h",
-        "cuda/include/thrust/system/omp/detail/copy.inl",
-        "cuda/include/thrust/system/omp/detail/copy.h",
-        "cuda/include/thrust/system/omp/detail/swap_ranges.h",
-        "cuda/include/thrust/system/omp/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/omp/detail/binary_search.h",
-        "cuda/include/thrust/system/omp/detail/set_operations.h",
-        "cuda/include/thrust/system/omp/detail/mismatch.h",
+        "cuda/include/thrust/system/omp/detail/execution_policy.h",
         "cuda/include/thrust/system/omp/detail/extrema.h",
-        "cuda/include/thrust/system/omp/detail/count.h",
-        "cuda/include/thrust/system/omp/detail/replace.h",
+        "cuda/include/thrust/system/omp/detail/fill.h",
+        "cuda/include/thrust/system/omp/detail/find.h",
+        "cuda/include/thrust/system/omp/detail/for_each.h",
+        "cuda/include/thrust/system/omp/detail/for_each.inl",
+        "cuda/include/thrust/system/omp/detail/gather.h",
+        "cuda/include/thrust/system/omp/detail/generate.h",
         "cuda/include/thrust/system/omp/detail/get_value.h",
         "cuda/include/thrust/system/omp/detail/inner_product.h",
-        "cuda/include/thrust/system/omp/detail/copy_if.h",
-        "cuda/include/thrust/system/omp/detail/logical.h",
-        "cuda/include/thrust/system/omp/detail/partition.inl",
         "cuda/include/thrust/system/omp/detail/iter_swap.h",
+        "cuda/include/thrust/system/omp/detail/logical.h",
+        "cuda/include/thrust/system/omp/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/omp/detail/memory.inl",
+        "cuda/include/thrust/system/omp/detail/merge.h",
+        "cuda/include/thrust/system/omp/detail/mismatch.h",
         "cuda/include/thrust/system/omp/detail/par.h",
+        "cuda/include/thrust/system/omp/detail/partition.h",
+        "cuda/include/thrust/system/omp/detail/partition.inl",
+        "cuda/include/thrust/system/omp/detail/reduce.h",
+        "cuda/include/thrust/system/omp/detail/reduce.inl",
+        "cuda/include/thrust/system/omp/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/omp/detail/reduce_by_key.inl",
         "cuda/include/thrust/system/omp/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/omp/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/omp/detail/fill.h",
+        "cuda/include/thrust/system/omp/detail/reduce_intervals.inl",
+        "cuda/include/thrust/system/omp/detail/remove.h",
+        "cuda/include/thrust/system/omp/detail/remove.inl",
+        "cuda/include/thrust/system/omp/detail/replace.h",
+        "cuda/include/thrust/system/omp/detail/reverse.h",
+        "cuda/include/thrust/system/omp/detail/scan.h",
+        "cuda/include/thrust/system/omp/detail/scan_by_key.h",
+        "cuda/include/thrust/system/omp/detail/scatter.h",
+        "cuda/include/thrust/system/omp/detail/sequence.h",
+        "cuda/include/thrust/system/omp/detail/set_operations.h",
+        "cuda/include/thrust/system/omp/detail/sort.h",
+        "cuda/include/thrust/system/omp/detail/sort.inl",
+        "cuda/include/thrust/system/omp/detail/swap_ranges.h",
+        "cuda/include/thrust/system/omp/detail/tabulate.h",
+        "cuda/include/thrust/system/omp/detail/temporary_buffer.h",
         "cuda/include/thrust/system/omp/detail/transform.h",
-        "cuda/include/thrust/system/omp/memory.h",
-        "cuda/include/thrust/system/tbb/execution_policy.h",
-        "cuda/include/thrust/system/tbb/vector.h",
-        "cuda/include/thrust/system/tbb/detail/transform_scan.h",
-        "cuda/include/thrust/system/tbb/detail/memory.inl",
-        "cuda/include/thrust/system/tbb/detail/unique_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/sort.inl",
-        "cuda/include/thrust/system/tbb/detail/partition.h",
-        "cuda/include/thrust/system/tbb/detail/unique.h",
-        "cuda/include/thrust/system/tbb/detail/execution_policy.h",
+        "cuda/include/thrust/system/omp/detail/transform_reduce.h",
+        "cuda/include/thrust/system/omp/detail/transform_scan.h",
+        "cuda/include/thrust/system/omp/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/omp/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/omp/detail/unique.h",
+        "cuda/include/thrust/system/omp/detail/unique.inl",
+        "cuda/include/thrust/system/omp/detail/unique_by_key.h",
+        "cuda/include/thrust/system/omp/detail/unique_by_key.inl",
+        "cuda/include/thrust/system/omp/detail/vector.inl",
+        "cuda/include/thrust/system/omp/execution_policy.h",
+        "cuda/include/thrust/system/omp/memory.h",
+        "cuda/include/thrust/system/omp/vector.h",
+        "cuda/include/thrust/system/system_error.h",
         "cuda/include/thrust/system/tbb/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/tbb/detail/unique_by_key.inl",
-        "cuda/include/thrust/system/tbb/detail/sequence.h",
-        "cuda/include/thrust/system/tbb/detail/merge.h",
-        "cuda/include/thrust/system/tbb/detail/unique.inl",
-        "cuda/include/thrust/system/tbb/detail/copy_if.inl",
-        "cuda/include/thrust/system/tbb/detail/transform_reduce.h",
-        "cuda/include/thrust/system/tbb/detail/gather.h",
-        "cuda/include/thrust/system/tbb/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/tbb/detail/sort.h",
-        "cuda/include/thrust/system/tbb/detail/scan.h",
-        "cuda/include/thrust/system/tbb/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/tbb/detail/reduce.inl",
-        "cuda/include/thrust/system/tbb/detail/scan_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/reverse.h",
         "cuda/include/thrust/system/tbb/detail/assign_value.h",
-        "cuda/include/thrust/system/tbb/detail/scatter.h",
-        "cuda/include/thrust/system/tbb/detail/for_each.inl",
-        "cuda/include/thrust/system/tbb/detail/remove.inl",
-        "cuda/include/thrust/system/tbb/detail/vector.inl",
-        "cuda/include/thrust/system/tbb/detail/find.h",
-        "cuda/include/thrust/system/tbb/detail/merge.inl",
-        "cuda/include/thrust/system/tbb/detail/generate.h",
-        "cuda/include/thrust/system/tbb/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/tbb/detail/remove.h",
-        "cuda/include/thrust/system/tbb/detail/tabulate.h",
-        "cuda/include/thrust/system/tbb/detail/for_each.h",
-        "cuda/include/thrust/system/tbb/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/reduce.h",
-        "cuda/include/thrust/system/tbb/detail/equal.h",
-        "cuda/include/thrust/system/tbb/detail/copy.inl",
-        "cuda/include/thrust/system/tbb/detail/copy.h",
-        "cuda/include/thrust/system/tbb/detail/swap_ranges.h",
-        "cuda/include/thrust/system/tbb/detail/uninitialized_copy.h",
         "cuda/include/thrust/system/tbb/detail/binary_search.h",
-        "cuda/include/thrust/system/tbb/detail/set_operations.h",
-        "cuda/include/thrust/system/tbb/detail/mismatch.h",
-        "cuda/include/thrust/system/tbb/detail/scan.inl",
-        "cuda/include/thrust/system/tbb/detail/extrema.h",
+        "cuda/include/thrust/system/tbb/detail/copy.h",
+        "cuda/include/thrust/system/tbb/detail/copy.inl",
+        "cuda/include/thrust/system/tbb/detail/copy_if.h",
+        "cuda/include/thrust/system/tbb/detail/copy_if.inl",
         "cuda/include/thrust/system/tbb/detail/count.h",
-        "cuda/include/thrust/system/tbb/detail/replace.h",
+        "cuda/include/thrust/system/tbb/detail/equal.h",
+        "cuda/include/thrust/system/tbb/detail/execution_policy.h",
+        "cuda/include/thrust/system/tbb/detail/extrema.h",
+        "cuda/include/thrust/system/tbb/detail/fill.h",
+        "cuda/include/thrust/system/tbb/detail/find.h",
+        "cuda/include/thrust/system/tbb/detail/for_each.h",
+        "cuda/include/thrust/system/tbb/detail/for_each.inl",
+        "cuda/include/thrust/system/tbb/detail/gather.h",
+        "cuda/include/thrust/system/tbb/detail/generate.h",
         "cuda/include/thrust/system/tbb/detail/get_value.h",
         "cuda/include/thrust/system/tbb/detail/inner_product.h",
-        "cuda/include/thrust/system/tbb/detail/copy_if.h",
-        "cuda/include/thrust/system/tbb/detail/logical.h",
-        "cuda/include/thrust/system/tbb/detail/partition.inl",
         "cuda/include/thrust/system/tbb/detail/iter_swap.h",
+        "cuda/include/thrust/system/tbb/detail/logical.h",
+        "cuda/include/thrust/system/tbb/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/tbb/detail/memory.inl",
+        "cuda/include/thrust/system/tbb/detail/merge.h",
+        "cuda/include/thrust/system/tbb/detail/merge.inl",
+        "cuda/include/thrust/system/tbb/detail/mismatch.h",
         "cuda/include/thrust/system/tbb/detail/par.h",
+        "cuda/include/thrust/system/tbb/detail/partition.h",
+        "cuda/include/thrust/system/tbb/detail/partition.inl",
+        "cuda/include/thrust/system/tbb/detail/reduce.h",
+        "cuda/include/thrust/system/tbb/detail/reduce.inl",
+        "cuda/include/thrust/system/tbb/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/reduce_by_key.inl",
         "cuda/include/thrust/system/tbb/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/tbb/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/tbb/detail/fill.h",
+        "cuda/include/thrust/system/tbb/detail/remove.h",
+        "cuda/include/thrust/system/tbb/detail/remove.inl",
+        "cuda/include/thrust/system/tbb/detail/replace.h",
+        "cuda/include/thrust/system/tbb/detail/reverse.h",
+        "cuda/include/thrust/system/tbb/detail/scan.h",
+        "cuda/include/thrust/system/tbb/detail/scan.inl",
+        "cuda/include/thrust/system/tbb/detail/scan_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/scatter.h",
+        "cuda/include/thrust/system/tbb/detail/sequence.h",
+        "cuda/include/thrust/system/tbb/detail/set_operations.h",
+        "cuda/include/thrust/system/tbb/detail/sort.h",
+        "cuda/include/thrust/system/tbb/detail/sort.inl",
+        "cuda/include/thrust/system/tbb/detail/swap_ranges.h",
+        "cuda/include/thrust/system/tbb/detail/tabulate.h",
+        "cuda/include/thrust/system/tbb/detail/temporary_buffer.h",
         "cuda/include/thrust/system/tbb/detail/transform.h",
-        "cuda/include/thrust/system/tbb/memory.h",
-        "cuda/include/thrust/system/error_code.h",
-        "cuda/include/thrust/system/cpp/execution_policy.h",
-        "cuda/include/thrust/system/cpp/vector.h",
-        "cuda/include/thrust/system/cpp/detail/transform_scan.h",
-        "cuda/include/thrust/system/cpp/detail/memory.inl",
-        "cuda/include/thrust/system/cpp/detail/unique_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/partition.h",
-        "cuda/include/thrust/system/cpp/detail/unique.h",
-        "cuda/include/thrust/system/cpp/detail/execution_policy.h",
-        "cuda/include/thrust/system/cpp/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/cpp/detail/sequence.h",
-        "cuda/include/thrust/system/cpp/detail/merge.h",
-        "cuda/include/thrust/system/cpp/detail/transform_reduce.h",
-        "cuda/include/thrust/system/cpp/detail/gather.h",
-        "cuda/include/thrust/system/cpp/detail/sort.h",
-        "cuda/include/thrust/system/cpp/detail/scan.h",
-        "cuda/include/thrust/system/cpp/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/cpp/detail/scan_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/reverse.h",
-        "cuda/include/thrust/system/cpp/detail/assign_value.h",
-        "cuda/include/thrust/system/cpp/detail/scatter.h",
-        "cuda/include/thrust/system/cpp/detail/vector.inl",
-        "cuda/include/thrust/system/cpp/detail/find.h",
-        "cuda/include/thrust/system/cpp/detail/generate.h",
-        "cuda/include/thrust/system/cpp/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/cpp/detail/remove.h",
-        "cuda/include/thrust/system/cpp/detail/tabulate.h",
-        "cuda/include/thrust/system/cpp/detail/for_each.h",
-        "cuda/include/thrust/system/cpp/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/reduce.h",
-        "cuda/include/thrust/system/cpp/detail/equal.h",
-        "cuda/include/thrust/system/cpp/detail/copy.h",
-        "cuda/include/thrust/system/cpp/detail/swap_ranges.h",
-        "cuda/include/thrust/system/cpp/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/cpp/detail/binary_search.h",
-        "cuda/include/thrust/system/cpp/detail/set_operations.h",
-        "cuda/include/thrust/system/cpp/detail/mismatch.h",
-        "cuda/include/thrust/system/cpp/detail/extrema.h",
-        "cuda/include/thrust/system/cpp/detail/count.h",
-        "cuda/include/thrust/system/cpp/detail/replace.h",
-        "cuda/include/thrust/system/cpp/detail/get_value.h",
-        "cuda/include/thrust/system/cpp/detail/inner_product.h",
-        "cuda/include/thrust/system/cpp/detail/copy_if.h",
-        "cuda/include/thrust/system/cpp/detail/logical.h",
-        "cuda/include/thrust/system/cpp/detail/iter_swap.h",
-        "cuda/include/thrust/system/cpp/detail/par.h",
-        "cuda/include/thrust/system/cpp/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/cpp/detail/fill.h",
-        "cuda/include/thrust/system/cpp/detail/transform.h",
-        "cuda/include/thrust/system/cpp/memory.h",
-        "cuda/include/thrust/system/cuda/execution_policy.h",
-        "cuda/include/thrust/system/cuda/vector.h",
-        "cuda/include/thrust/system/cuda/error.h",
-        "cuda/include/thrust/system/cuda/detail/copy_device_to_device.h",
-        "cuda/include/thrust/system/cuda/detail/transform_scan.h",
-        "cuda/include/thrust/system/cuda/detail/memory.inl",
-        "cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_device.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_type.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/host/spinlock.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/cub.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_shift.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.inl",
-        "cuda/include/thrust/system/cuda/detail/copy_cross_system.inl",
-        "cuda/include/thrust/system/cuda/detail/unique_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/bulk.h",
-        "cuda/include/thrust/system/cuda/detail/sort.inl",
-        "cuda/include/thrust/system/cuda/detail/partition.h",
-        "cuda/include/thrust/system/cuda/detail/unique.h",
-        "cuda/include/thrust/system/cuda/detail/execution_policy.h",
-        "cuda/include/thrust/system/cuda/detail/cuda_launch_config.h",
-        "cuda/include/thrust/system/cuda/detail/cub.h",
-        "cuda/include/thrust/system/cuda/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/cuda/detail/sequence.h",
-        "cuda/include/thrust/system/cuda/detail/merge.h",
-        "cuda/include/thrust/system/cuda/detail/set_symmetric_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/copy_if.inl",
-        "cuda/include/thrust/system/cuda/detail/transform_reduce.h",
-        "cuda/include/thrust/system/cuda/detail/error.inl",
-        "cuda/include/thrust/system/cuda/detail/gather.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/cuda/detail/sort.h",
-        "cuda/include/thrust/system/cuda/detail/synchronize.h",
-        "cuda/include/thrust/system/cuda/detail/scan.h",
-        "cuda/include/thrust/system/cuda/detail/temporary_indirect_permutation.h",
-        "cuda/include/thrust/system/cuda/detail/extern_shared_ptr.h",
-        "cuda/include/thrust/system/cuda/detail/detail/set_operation.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/balanced_path.h",
-        "cuda/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/set_operation.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_closure.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/merge.h",
-        "cuda/include/thrust/system/cuda/detail/detail/alignment.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_calculator.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_closure.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/uninitialized.h",
-        "cuda/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_calculator.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.inl",
-        "cuda/include/thrust/system/cuda/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/cuda/detail/default_decomposition.h",
-        "cuda/include/thrust/system/cuda/detail/reduce.inl",
-        "cuda/include/thrust/system/cuda/detail/scan_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/reverse.h",
-        "cuda/include/thrust/system/cuda/detail/assign_value.h",
-        "cuda/include/thrust/system/cuda/detail/scatter.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.hpp",
-        "cuda/include/thrust/system/cuda/detail/for_each.inl",
-        "cuda/include/thrust/system/cuda/detail/default_decomposition.inl",
-        "cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h",
-        "cuda/include/thrust/system/cuda/detail/adjacent_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/vector.inl",
-        "cuda/include/thrust/system/cuda/detail/throw_on_error.h",
-        "cuda/include/thrust/system/cuda/detail/find.h",
-        "cuda/include/thrust/system/cuda/detail/terminate.h",
-        "cuda/include/thrust/system/cuda/detail/merge.inl",
-        "cuda/include/thrust/system/cuda/detail/trivial_copy.inl",
-        "cuda/include/thrust/system/cuda/detail/generate.h",
-        "cuda/include/thrust/system/cuda/detail/execute_on_stream.h",
-        "cuda/include/thrust/system/cuda/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/cuda/detail/remove.h",
-        "cuda/include/thrust/system/cuda/detail/tabulate.h",
-        "cuda/include/thrust/system/cuda/detail/for_each.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/decomposition.h",
-        "cuda/include/thrust/system/cuda/detail/reduce.h",
-        "cuda/include/thrust/system/cuda/detail/equal.h",
-        "cuda/include/thrust/system/cuda/detail/runtime_introspection.h",
-        "cuda/include/thrust/system/cuda/detail/copy.inl",
-        "cuda/include/thrust/system/cuda/detail/copy.h",
-        "cuda/include/thrust/system/cuda/detail/swap_ranges.h",
-        "cuda/include/thrust/system/cuda/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/cuda/detail/binary_search.h",
-        "cuda/include/thrust/system/cuda/detail/runtime_introspection.inl",
-        "cuda/include/thrust/system/cuda/detail/set_operations.h",
-        "cuda/include/thrust/system/cuda/detail/mismatch.h",
-        "cuda/include/thrust/system/cuda/detail/scan.inl",
-        "cuda/include/thrust/system/cuda/detail/synchronize.inl",
-        "cuda/include/thrust/system/cuda/detail/extrema.h",
-        "cuda/include/thrust/system/cuda/detail/set_union.inl",
-        "cuda/include/thrust/system/cuda/detail/set_intersection.inl",
-        "cuda/include/thrust/system/cuda/detail/count.h",
-        "cuda/include/thrust/system/cuda/detail/trivial_copy.h",
-        "cuda/include/thrust/system/cuda/detail/copy_device_to_device.inl",
-        "cuda/include/thrust/system/cuda/detail/replace.h",
-        "cuda/include/thrust/system/cuda/detail/bulk/malloc.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/config.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/closure.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/async.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/bulk.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/execution_policy.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/uninitialized.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/async.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/future.hpp",
-        "cuda/include/thrust/system/cuda/detail/guarded_driver_types.h",
-        "cuda/include/thrust/system/cuda/detail/get_value.h",
-        "cuda/include/thrust/system/cuda/detail/inner_product.h",
-        "cuda/include/thrust/system/cuda/detail/copy_if.h",
-        "cuda/include/thrust/system/cuda/detail/logical.h",
-        "cuda/include/thrust/system/cuda/detail/iter_swap.h",
-        "cuda/include/thrust/system/cuda/detail/block/merge.h",
-        "cuda/include/thrust/system/cuda/detail/block/inclusive_scan.h",
-        "cuda/include/thrust/system/cuda/detail/block/merge.inl",
-        "cuda/include/thrust/system/cuda/detail/block/merging_sort.h",
-        "cuda/include/thrust/system/cuda/detail/block/exclusive_scan.h",
-        "cuda/include/thrust/system/cuda/detail/block/reduce.h",
-        "cuda/include/thrust/system/cuda/detail/block/copy.h",
-        "cuda/include/thrust/system/cuda/detail/block/odd_even_sort.h",
-        "cuda/include/thrust/system/cuda/detail/par.h",
-        "cuda/include/thrust/system/cuda/detail/copy_cross_system.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/cuda/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/cuda/detail/fill.h",
-        "cuda/include/thrust/system/cuda/detail/set_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/transform.h",
-        "cuda/include/thrust/system/cuda/experimental/pinned_allocator.h",
-        "cuda/include/thrust/system/cuda/memory.h",
-        "cuda/include/thrust/remove.h",
+        "cuda/include/thrust/system/tbb/detail/transform_reduce.h",
+        "cuda/include/thrust/system/tbb/detail/transform_scan.h",
+        "cuda/include/thrust/system/tbb/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/tbb/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/tbb/detail/unique.h",
+        "cuda/include/thrust/system/tbb/detail/unique.inl",
+        "cuda/include/thrust/system/tbb/detail/unique_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/unique_by_key.inl",
+        "cuda/include/thrust/system/tbb/detail/vector.inl",
+        "cuda/include/thrust/system/tbb/execution_policy.h",
+        "cuda/include/thrust/system/tbb/memory.h",
+        "cuda/include/thrust/system/tbb/vector.h",
+        "cuda/include/thrust/system_error.h",
         "cuda/include/thrust/tabulate.h",
-        "cuda/include/thrust/for_each.h",
-        "cuda/include/thrust/distance.h",
-        "cuda/include/thrust/reduce.h",
-        "cuda/include/thrust/equal.h",
-        "cuda/include/thrust/complex.h",
-        "cuda/include/thrust/device_allocator.h",
-        "cuda/include/thrust/copy.h",
+        "cuda/include/thrust/transform.h",
+        "cuda/include/thrust/transform_reduce.h",
+        "cuda/include/thrust/transform_scan.h",
+        "cuda/include/thrust/tuple.h",
         "cuda/include/thrust/uninitialized_copy.h",
-        "cuda/include/thrust/device_reference.h",
-        "cuda/include/thrust/binary_search.h",
-        "cuda/include/thrust/set_operations.h",
-        "cuda/include/thrust/swap.h",
-        "cuda/include/thrust/mismatch.h",
-        "cuda/include/thrust/extrema.h",
-        "cuda/include/thrust/count.h",
-        "cuda/include/thrust/device_free.h",
-        "cuda/include/thrust/random/discard_block_engine.h",
-        "cuda/include/thrust/random/normal_distribution.h",
-        "cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h",
-        "cuda/include/thrust/random/detail/subtract_with_carry_engine.inl",
-        "cuda/include/thrust/random/detail/xor_combine_engine_max.h",
-        "cuda/include/thrust/random/detail/linear_congruential_engine_discard.h",
-        "cuda/include/thrust/random/detail/uniform_int_distribution.inl",
-        "cuda/include/thrust/random/detail/discard_block_engine.inl",
-        "cuda/include/thrust/random/detail/uniform_real_distribution.inl",
-        "cuda/include/thrust/random/detail/random_core_access.h",
-        "cuda/include/thrust/random/detail/mod.h",
-        "cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl",
-        "cuda/include/thrust/random/detail/linear_congruential_engine.inl",
-        "cuda/include/thrust/random/detail/xor_combine_engine.inl",
-        "cuda/include/thrust/random/detail/normal_distribution.inl",
-        "cuda/include/thrust/random/detail/normal_distribution_base.h",
-        "cuda/include/thrust/random/uniform_int_distribution.h",
-        "cuda/include/thrust/random/linear_feedback_shift_engine.h",
-        "cuda/include/thrust/random/xor_combine_engine.h",
-        "cuda/include/thrust/random/subtract_with_carry_engine.h",
-        "cuda/include/thrust/random/linear_congruential_engine.h",
-        "cuda/include/thrust/random/uniform_real_distribution.h",
-        "cuda/include/thrust/functional.h",
-        "cuda/include/thrust/replace.h",
-        "cuda/include/thrust/device_new_allocator.h",
-        "cuda/include/thrust/host_vector.h",
+        "cuda/include/thrust/uninitialized_fill.h",
+        "cuda/include/thrust/unique.h",
         "cuda/include/thrust/version.h",
-        "cuda/include/thrust/inner_product.h",
-        "cuda/include/thrust/iterator/iterator_traits.h",
-        "cuda/include/thrust/iterator/discard_iterator.h",
-        "cuda/include/thrust/iterator/retag.h",
-        "cuda/include/thrust/iterator/permutation_iterator.h",
-        "cuda/include/thrust/iterator/transform_iterator.h",
-        "cuda/include/thrust/iterator/detail/reverse_iterator.inl",
-        "cuda/include/thrust/iterator/detail/zip_iterator.inl",
-        "cuda/include/thrust/iterator/detail/counting_iterator.inl",
-        "cuda/include/thrust/iterator/detail/distance_from_result.h",
-        "cuda/include/thrust/iterator/detail/host_system_tag.h",
-        "cuda/include/thrust/iterator/detail/iterator_traversal_tags.h",
-        "cuda/include/thrust/iterator/detail/retag.h",
-        "cuda/include/thrust/iterator/detail/tagged_iterator.h",
-        "cuda/include/thrust/iterator/detail/iterator_traits.inl",
-        "cuda/include/thrust/iterator/detail/minimum_category.h",
-        "cuda/include/thrust/iterator/detail/discard_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h",
-        "cuda/include/thrust/iterator/detail/zip_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/normal_iterator.h",
-        "cuda/include/thrust/iterator/detail/join_iterator.h",
-        "cuda/include/thrust/iterator/detail/device_system_tag.h",
-        "cuda/include/thrust/iterator/detail/universal_categories.h",
-        "cuda/include/thrust/iterator/detail/reverse_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/minimum_system.h",
-        "cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h",
-        "cuda/include/thrust/iterator/detail/is_iterator_category.h",
-        "cuda/include/thrust/iterator/detail/permutation_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/any_assign.h",
-        "cuda/include/thrust/iterator/detail/any_system_tag.h",
-        "cuda/include/thrust/iterator/detail/is_trivial_iterator.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_to_system.h",
-        "cuda/include/thrust/iterator/detail/iterator_adaptor_base.h",
-        "cuda/include/thrust/iterator/detail/constant_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/transform_iterator.inl",
-        "cuda/include/thrust/iterator/detail/iterator_facade_category.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h",
-        "cuda/include/thrust/iterator/constant_iterator.h",
-        "cuda/include/thrust/iterator/counting_iterator.h",
-        "cuda/include/thrust/iterator/iterator_adaptor.h",
-        "cuda/include/thrust/iterator/iterator_facade.h",
-        "cuda/include/thrust/iterator/iterator_categories.h",
-        "cuda/include/thrust/iterator/reverse_iterator.h",
-        "cuda/include/thrust/iterator/zip_iterator.h",
-        "cuda/include/thrust/logical.h",
-        "cuda/include/thrust/tuple.h",
-        "cuda/include/thrust/memory.h",
-        "cuda/include/thrust/random.h",
-        "cuda/include/thrust/fill.h",
-        "cuda/include/thrust/transform.h",
-        "cuda/include/texture_types.h",
-        "cuda/include/nppversion.h",
-        "cuda/include/cuda_texture_types.h",
-        "cuda/include/fatbinary.h",
-        "cuda/include/cublasXt.h",
-        "cuda/include/cuda_fp16.h",
         "cuda/include/vector_functions.h",
-        "cuda/include/cusparse.h",
-        "cuda/include/nppi_filtering_functions.h",
-        "cuda/include/nppi_morphological_operations.h",
-        "cuda/include/sobol_direction_vectors.h",
-        "cuda/include/nvblas.h",
-        "cuda/include/curand_mtgp32dc_p_11213.h",
-        "cuda/include/nvcuvid.h",
-        "cuda/include/cuda_runtime_api.h",
-        "cuda/include/curand_mtgp32_kernel.h",
-        "cuda/include/cublas_v2.h",
-        "cuda/include/builtin_types.h",
-        "cuda/include/nppi_geometry_transforms.h",
-        "cuda/include/npps_support_functions.h",
-        "cuda/include/cufftw.h",
-        "cuda/include/cuda_device_runtime_api.h",
-        "cuda/include/sm_30_intrinsics.hpp",
+        "cuda/include/vector_functions.hpp",
         "cuda/include/vector_types.h",
-        "cuda/include/sm_35_atomic_functions.h",
-        "cuda/include/sm_20_intrinsics.h",
-        "cuda/include/driver_types.h",
-        "cuda/include/nvToolsExtCudaRt.h",
-        "cuda/include/curand_globals.h",
-        "cuda/include/device_atomic_functions.h",
-        "cuda/include/surface_types.h",
-        "cuda/include/nvrtc.h",
-        "cuda/include/nppdefs.h",
-        "cuda/include/sm_60_atomic_functions.h",
-        "cuda/include/driver_functions.h",
-        "cuda/include/cusolver_common.h",
-        "cuda/include/cublas.h",
-        "cuda/include/curand_lognormal.h",
-        "cuda/include/device_atomic_functions.hpp",
-        "cuda/include/crt/device_runtime.h",
-        "cuda/include/crt/storage_class.h",
-        "cuda/include/crt/func_macro.h",
-        "cuda/include/crt/host_runtime.h",
-        "cuda/include/nppi_arithmetic_and_logical_operations.h",
-        "cuda/include/npps_arithmetic_and_logical_operations.h",
-        "cuda/include/nppi_computer_vision.h",
-        "cuda/include/surface_functions.hpp",
-        "cuda/include/surface_functions.h",
-        "cuda/include/curand_normal_static.h",
-        "cuda/include/curand.h",
-        "cuda/include/math_functions_dbl_ptx3.h",
-        "cuda/include/curand_philox4x32_x.h",
-        "cuda/include/nppi_threshold_and_compare_operations.h",
-        "cuda/include/nvml.h",
-        "cuda/include/npps.h",
-        "cuda/include/cuda_vdpau_interop.h",
-        "cuda/include/sm_61_intrinsics.hpp",
-        "cuda/include/cublas_api.h",
-        "cuda/include/nppi_color_conversion.h",
-        "cuda/include/math_functions_dbl_ptx3.hpp",
-        "cuda/include/nppcore.h",
-        "cuda/include/cudaGL.h",
-        "cuda/include/fatBinaryCtl.h",
-        "cuda/include/npps_statistics_functions.h",
-        "cuda/include/cudaVDPAU.h",
-        "cuda/include/curand_poisson.h",
-        "cuda/include/cusolverDn.h",
-        "cuda/include/cuda_profiler_api.h",
-        "cuda/include/sm_20_atomic_functions.h",
-        "cuda/include/nvfunctional",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/include/math_functions.hpp" "$(@D)/cuda/include/math_functions.hpp" && cp "/usr/local/cuda-8.0/include/cufft.h" "$(@D)/cuda/include/cufft.h" && cp "/usr/local/cuda-8.0/include/nvgraph.h" "$(@D)/cuda/include/nvgraph.h" && cp "/usr/local/cuda-8.0/include/curand_normal.h" "$(@D)/cuda/include/curand_normal.h" && cp "/usr/local/cuda-8.0/include/curand_uniform.h" "$(@D)/cuda/include/curand_uniform.h" && cp "/usr/local/cuda-8.0/include/nppi_data_exchange_and_initialization.h" "$(@D)/cuda/include/nppi_data_exchange_and_initialization.h" && cp "/usr/local/cuda-8.0/include/cuda_gl_interop.h" "$(@D)/cuda/include/cuda_gl_interop.h" && cp "/usr/local/cuda-8.0/include/nppi_compression_functions.h" "$(@D)/cuda/include/nppi_compression_functions.h" && cp "/usr/local/cuda-8.0/include/npp.h" "$(@D)/cuda/include/npp.h" && cp "/usr/local/cuda-8.0/include/cuda.h" "$(@D)/cuda/include/cuda.h" && cp "/usr/local/cuda-8.0/include/nppi_statistics_functions.h" "$(@D)/cuda/include/nppi_statistics_functions.h" && cp "/usr/local/cuda-8.0/include/vector_functions.hpp" "$(@D)/cuda/include/vector_functions.hpp" && cp "/usr/local/cuda-8.0/include/sm_32_intrinsics.hpp" "$(@D)/cuda/include/sm_32_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/sm_32_intrinsics.h" "$(@D)/cuda/include/sm_32_intrinsics.h" && cp "/usr/local/cuda-8.0/include/curand_discrete.h" "$(@D)/cuda/include/curand_discrete.h" && cp "/usr/local/cuda-8.0/include/cuda_runtime.h" "$(@D)/cuda/include/cuda_runtime.h" && cp "/usr/local/cuda-8.0/include/cufftXt.h" "$(@D)/cuda/include/cufftXt.h" && cp "/usr/local/cuda-8.0/include/sm_61_intrinsics.h" "$(@D)/cuda/include/sm_61_intrinsics.h" && cp "/usr/local/cuda-8.0/include/texture_fetch_functions.h" "$(@D)/cuda/include/texture_fetch_functions.h" && cp "/usr/local/cuda-8.0/include/curand_mrg32k3a.h" "$(@D)/cuda/include/curand_mrg32k3a.h" && cp "/usr/local/cuda-8.0/include/host_defines.h" "$(@D)/cuda/include/host_defines.h" && cp "/usr/local/cuda-8.0/include/common_functions.h" "$(@D)/cuda/include/common_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_support_functions.h" "$(@D)/cuda/include/nppi_support_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_linear_transforms.h" "$(@D)/cuda/include/nppi_linear_transforms.h" && cp "/usr/local/cuda-8.0/include/device_double_functions.hpp" "$(@D)/cuda/include/device_double_functions.hpp" && cp "/usr/local/cuda-8.0/include/math_constants.h" "$(@D)/cuda/include/math_constants.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtSync.h" "$(@D)/cuda/include/nvToolsExtSync.h" && cp "/usr/local/cuda-8.0/include/npps_initialization.h" "$(@D)/cuda/include/npps_initialization.h" && cp "/usr/local/cuda-8.0/include/cusolverSp_LOWLEVEL_PREVIEW.h" "$(@D)/cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h" && cp "/usr/local/cuda-8.0/include/texture_indirect_functions.hpp" "$(@D)/cuda/include/texture_indirect_functions.hpp" && cp "/usr/local/cuda-8.0/include/cudaProfiler.h" "$(@D)/cuda/include/cudaProfiler.h" && cp "/usr/local/cuda-8.0/include/npps_filtering_functions.h" "$(@D)/cuda/include/npps_filtering_functions.h" && cp "/usr/local/cuda-8.0/include/cusparse_v2.h" "$(@D)/cuda/include/cusparse_v2.h" && cp "/usr/local/cuda-8.0/include/nppi.h" "$(@D)/cuda/include/nppi.h" && cp "/usr/local/cuda-8.0/include/surface_indirect_functions.h" "$(@D)/cuda/include/surface_indirect_functions.h" && cp "/usr/local/cuda-8.0/include/sm_30_intrinsics.h" "$(@D)/cuda/include/sm_30_intrinsics.h" && cp "/usr/local/cuda-8.0/include/device_double_functions.h" "$(@D)/cuda/include/device_double_functions.h" && cp "/usr/local/cuda-8.0/include/sm_35_intrinsics.h" "$(@D)/cuda/include/sm_35_intrinsics.h" && cp "/usr/local/cuda-8.0/include/cusolverSp.h" "$(@D)/cuda/include/cusolverSp.h" && cp "/usr/local/cuda-8.0/include/library_types.h" "$(@D)/cuda/include/library_types.h" && cp "/usr/local/cuda-8.0/include/surface_indirect_functions.hpp" "$(@D)/cuda/include/surface_indirect_functions.hpp" && cp "/usr/local/cuda-8.0/include/cudalibxt.h" "$(@D)/cuda/include/cudalibxt.h" && cp "/usr/local/cuda-8.0/include/channel_descriptor.h" "$(@D)/cuda/include/channel_descriptor.h" && cp "/usr/local/cuda-8.0/include/device_functions_decls.h" "$(@D)/cuda/include/device_functions_decls.h" && cp "/usr/local/cuda-8.0/include/curand_kernel.h" "$(@D)/cuda/include/curand_kernel.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32_host.h" "$(@D)/cuda/include/curand_mtgp32_host.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtCuda.h" "$(@D)/cuda/include/nvToolsExtCuda.h" && cp "/usr/local/cuda-8.0/include/nvToolsExt.h" "$(@D)/cuda/include/nvToolsExt.h" && cp "/usr/local/cuda-8.0/include/cuComplex.h" "$(@D)/cuda/include/cuComplex.h" && cp "/usr/local/cuda-8.0/include/sm_32_atomic_functions.h" "$(@D)/cuda/include/sm_32_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/texture_indirect_functions.h" "$(@D)/cuda/include/texture_indirect_functions.h" && cp "/usr/local/cuda-8.0/include/sm_32_atomic_functions.hpp" "$(@D)/cuda/include/sm_32_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/sm_20_intrinsics.hpp" "$(@D)/cuda/include/sm_20_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/device_launch_parameters.h" "$(@D)/cuda/include/device_launch_parameters.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32.h" "$(@D)/cuda/include/curand_mtgp32.h" && cp "/usr/local/cuda-8.0/include/texture_fetch_functions.hpp" "$(@D)/cuda/include/texture_fetch_functions.hpp" && cp "/usr/local/cuda-8.0/include/cuda_occupancy.h" "$(@D)/cuda/include/cuda_occupancy.h" && cp "/usr/local/cuda-8.0/include/CL/opencl.h" "$(@D)/cuda/include/CL/opencl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_platform.h" "$(@D)/cuda/include/CL/cl_platform.h" && cp "/usr/local/cuda-8.0/include/CL/cl_egl.h" "$(@D)/cuda/include/CL/cl_egl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_gl.h" "$(@D)/cuda/include/CL/cl_gl.h" && cp "/usr/local/cuda-8.0/include/CL/cl.h" "$(@D)/cuda/include/CL/cl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_gl_ext.h" "$(@D)/cuda/include/CL/cl_gl_ext.h" && cp "/usr/local/cuda-8.0/include/CL/cl_ext.h" "$(@D)/cuda/include/CL/cl_ext.h" && cp "/usr/local/cuda-8.0/include/CL/cl.hpp" "$(@D)/cuda/include/CL/cl.hpp" && cp "/usr/local/cuda-8.0/include/host_config.h" "$(@D)/cuda/include/host_config.h" && cp "/usr/local/cuda-8.0/include/cuda_surface_types.h" "$(@D)/cuda/include/cuda_surface_types.h" && cp "/usr/local/cuda-8.0/include/math_functions.h" "$(@D)/cuda/include/math_functions.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtMeta.h" "$(@D)/cuda/include/nvToolsExtMeta.h" && cp "/usr/local/cuda-8.0/include/sm_20_atomic_functions.hpp" "$(@D)/cuda/include/sm_20_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/device_functions.h" "$(@D)/cuda/include/device_functions.h" && cp "/usr/local/cuda-8.0/include/device_types.h" "$(@D)/cuda/include/device_types.h" && cp "/usr/local/cuda-8.0/include/npps_conversion_functions.h" "$(@D)/cuda/include/npps_conversion_functions.h" && cp "/usr/local/cuda-8.0/include/curand_precalc.h" "$(@D)/cuda/include/curand_precalc.h" && cp "/usr/local/cuda-8.0/include/cusolverRf.h" "$(@D)/cuda/include/cusolverRf.h" && cp "/usr/local/cuda-8.0/include/sm_60_atomic_functions.hpp" "$(@D)/cuda/include/sm_60_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/cuviddec.h" "$(@D)/cuda/include/cuviddec.h" && cp "/usr/local/cuda-8.0/include/curand_discrete2.h" "$(@D)/cuda/include/curand_discrete2.h" && cp "/usr/local/cuda-8.0/include/device_functions.hpp" "$(@D)/cuda/include/device_functions.hpp" && cp "/usr/local/cuda-8.0/include/thrust/transform_scan.h" "$(@D)/cuda/include/thrust/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system_error.h" "$(@D)/cuda/include/thrust/system_error.h" && cp "/usr/local/cuda-8.0/include/thrust/device_malloc.h" "$(@D)/cuda/include/thrust/device_malloc.h" && cp "/usr/local/cuda-8.0/include/thrust/partition.h" "$(@D)/cuda/include/thrust/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/unique.h" "$(@D)/cuda/include/thrust/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/device_delete.h" "$(@D)/cuda/include/thrust/device_delete.h" && cp "/usr/local/cuda-8.0/include/thrust/execution_policy.h" "$(@D)/cuda/include/thrust/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/adjacent_difference.h" "$(@D)/cuda/include/thrust/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/sequence.h" "$(@D)/cuda/include/thrust/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/merge.h" "$(@D)/cuda/include/thrust/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/device_new.h" "$(@D)/cuda/include/thrust/device_new.h" && cp "/usr/local/cuda-8.0/include/thrust/transform_reduce.h" "$(@D)/cuda/include/thrust/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/device_vector.h" "$(@D)/cuda/include/thrust/device_vector.h" && cp "/usr/local/cuda-8.0/include/thrust/gather.h" "$(@D)/cuda/include/thrust/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/sort.h" "$(@D)/cuda/include/thrust/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/scan.h" "$(@D)/cuda/include/thrust/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_array.h" "$(@D)/cuda/include/thrust/detail/temporary_array.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/util/align.h" "$(@D)/cuda/include/thrust/detail/util/align.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/util/blocking.h" "$(@D)/cuda/include/thrust/detail/util/blocking.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform.inl" "$(@D)/cuda/include/thrust/detail/transform.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_vector.inl" "$(@D)/cuda/include/thrust/detail/device_vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/binary_search.inl" "$(@D)/cuda/include/thrust/detail/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/overlapped_copy.h" "$(@D)/cuda/include/thrust/detail/overlapped_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/vector_base.inl" "$(@D)/cuda/include/thrust/detail/vector_base.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_reference.inl" "$(@D)/cuda/include/thrust/detail/device_reference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/actor.h" "$(@D)/cuda/include/thrust/detail/functional/actor.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/value.h" "$(@D)/cuda/include/thrust/detail/functional/value.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/logical_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/logical_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/relational_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/relational_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/assignment_operator.h" "$(@D)/cuda/include/thrust/detail/functional/operators/assignment_operator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/bitwise_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/bitwise_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/operator_adaptors.h" "$(@D)/cuda/include/thrust/detail/functional/operators/operator_adaptors.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/arithmetic_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/arithmetic_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/compound_assignment_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/argument.h" "$(@D)/cuda/include/thrust/detail/functional/argument.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/placeholder.h" "$(@D)/cuda/include/thrust/detail/functional/placeholder.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/actor.inl" "$(@D)/cuda/include/thrust/detail/functional/actor.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/composite.h" "$(@D)/cuda/include/thrust/detail/functional/composite.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/static_map.h" "$(@D)/cuda/include/thrust/detail/static_map.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_nested_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_nested_type.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/is_call_possible.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_call_possible.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/function_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/function_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/pointer_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/pointer_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_member_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_member_function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" "$(@D)/cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/minimum_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/minimum_type.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_trivial_assign.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_trivial_assign.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/is_metafunction_defined.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_metafunction_defined.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/iterator/is_output_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/result_of_adaptable_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference.h" "$(@D)/cuda/include/thrust/detail/reference.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/inner_product.inl" "$(@D)/cuda/include/thrust/detail/inner_product.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/use_default.h" "$(@D)/cuda/include/thrust/detail/use_default.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/sequence.inl" "$(@D)/cuda/include/thrust/detail/sequence.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/sort.inl" "$(@D)/cuda/include/thrust/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/equal.inl" "$(@D)/cuda/include/thrust/detail/equal.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/execution_policy.h" "$(@D)/cuda/include/thrust/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/integer_traits.h" "$(@D)/cuda/include/thrust/detail/integer_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reverse.inl" "$(@D)/cuda/include/thrust/detail/reverse.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tabulate.inl" "$(@D)/cuda/include/thrust/detail/tabulate.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/unique.inl" "$(@D)/cuda/include/thrust/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/scatter.inl" "$(@D)/cuda/include/thrust/detail/scatter.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/set_operations.inl" "$(@D)/cuda/include/thrust/detail/set_operations.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_malloc.inl" "$(@D)/cuda/include/thrust/detail/device_malloc.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy_if.inl" "$(@D)/cuda/include/thrust/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/fill.inl" "$(@D)/cuda/include/thrust/detail/fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_array.inl" "$(@D)/cuda/include/thrust/detail/temporary_array.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform_scan.inl" "$(@D)/cuda/include/thrust/detail/transform_scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/minmax.h" "$(@D)/cuda/include/thrust/detail/minmax.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap.inl" "$(@D)/cuda/include/thrust/detail/swap.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pointer.inl" "$(@D)/cuda/include/thrust/detail/pointer.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform_reduce.inl" "$(@D)/cuda/include/thrust/detail/transform_reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/config.h" "$(@D)/cuda/include/thrust/detail/config.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/distance.inl" "$(@D)/cuda/include/thrust/detail/distance.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pair.inl" "$(@D)/cuda/include/thrust/detail/pair.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/temporary_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/tagged_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/destroy_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/destroy_range.h" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/no_throw_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/no_throw_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/default_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/fill_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/tagged_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/malloc_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/allocator_traits.h" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/copy_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/allocator_traits.inl" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/default_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/copy_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/malloc_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/temporary_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/fill_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reduce.inl" "$(@D)/cuda/include/thrust/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_new.inl" "$(@D)/cuda/include/thrust/detail/device_new.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pointer.h" "$(@D)/cuda/include/thrust/detail/pointer.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/for_each.inl" "$(@D)/cuda/include/thrust/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/generate.inl" "$(@D)/cuda/include/thrust/detail/generate.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/dispatch/is_trivial_copy.h" "$(@D)/cuda/include/thrust/detail/dispatch/is_trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/detail/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple_meta_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_meta_transform.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional.inl" "$(@D)/cuda/include/thrust/detail/functional.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/remove.inl" "$(@D)/cuda/include/thrust/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_transform.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/merge.inl" "$(@D)/cuda/include/thrust/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/extrema.inl" "$(@D)/cuda/include/thrust/detail/extrema.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/trivial_sequence.h" "$(@D)/cuda/include/thrust/detail/trivial_sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/vector_base.h" "$(@D)/cuda/include/thrust/detail/vector_base.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/count.inl" "$(@D)/cuda/include/thrust/detail/count.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/function.h" "$(@D)/cuda/include/thrust/detail/function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap_ranges.inl" "$(@D)/cuda/include/thrust/detail/swap_ranges.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_delete.inl" "$(@D)/cuda/include/thrust/detail/device_delete.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/static_assert.h" "$(@D)/cuda/include/thrust/detail/static_assert.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/logical.inl" "$(@D)/cuda/include/thrust/detail/logical.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/seq.h" "$(@D)/cuda/include/thrust/detail/seq.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/mpl/math.h" "$(@D)/cuda/include/thrust/detail/mpl/math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/mismatch.inl" "$(@D)/cuda/include/thrust/detail/mismatch.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/internal_functional.h" "$(@D)/cuda/include/thrust/detail/internal_functional.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/get_iterator_value.h" "$(@D)/cuda/include/thrust/detail/get_iterator_value.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy.inl" "$(@D)/cuda/include/thrust/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy.h" "$(@D)/cuda/include/thrust/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/catrigf.h" "$(@D)/cuda/include/thrust/detail/complex/catrigf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cpowf.h" "$(@D)/cuda/include/thrust/detail/complex/cpowf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csqrtf.h" "$(@D)/cuda/include/thrust/detail/complex/csqrtf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ccoshf.h" "$(@D)/cuda/include/thrust/detail/complex/ccoshf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csinhf.h" "$(@D)/cuda/include/thrust/detail/complex/csinhf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/clogf.h" "$(@D)/cuda/include/thrust/detail/complex/clogf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ccosh.h" "$(@D)/cuda/include/thrust/detail/complex/ccosh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/arithmetic.h" "$(@D)/cuda/include/thrust/detail/complex/arithmetic.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csqrt.h" "$(@D)/cuda/include/thrust/detail/complex/csqrt.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cpow.h" "$(@D)/cuda/include/thrust/detail/complex/cpow.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/complex.inl" "$(@D)/cuda/include/thrust/detail/complex/complex.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/math_private.h" "$(@D)/cuda/include/thrust/detail/complex/math_private.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/c99math.h" "$(@D)/cuda/include/thrust/detail/complex/c99math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cproj.h" "$(@D)/cuda/include/thrust/detail/complex/cproj.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/catrig.h" "$(@D)/cuda/include/thrust/detail/complex/catrig.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ctanhf.h" "$(@D)/cuda/include/thrust/detail/complex/ctanhf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cexpf.h" "$(@D)/cuda/include/thrust/detail/complex/cexpf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csinh.h" "$(@D)/cuda/include/thrust/detail/complex/csinh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/stream.h" "$(@D)/cuda/include/thrust/detail/complex/stream.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ctanh.h" "$(@D)/cuda/include/thrust/detail/complex/ctanh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cexp.h" "$(@D)/cuda/include/thrust/detail/complex/cexp.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/clog.h" "$(@D)/cuda/include/thrust/detail/complex/clog.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/range/head_flags.h" "$(@D)/cuda/include/thrust/detail/range/head_flags.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/range/tail_flags.h" "$(@D)/cuda/include/thrust/detail/range/tail_flags.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/execute_with_allocator.h" "$(@D)/cuda/include/thrust/detail/execute_with_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/integer_math.h" "$(@D)/cuda/include/thrust/detail/integer_math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap.h" "$(@D)/cuda/include/thrust/detail/swap.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/scan.inl" "$(@D)/cuda/include/thrust/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/gather.inl" "$(@D)/cuda/include/thrust/detail/gather.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference_forward_declaration.h" "$(@D)/cuda/include/thrust/detail/reference_forward_declaration.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/numeric_traits.h" "$(@D)/cuda/include/thrust/detail/numeric_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference.inl" "$(@D)/cuda/include/thrust/detail/reference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/cstdint.h" "$(@D)/cuda/include/thrust/detail/cstdint.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_free.inl" "$(@D)/cuda/include/thrust/detail/device_free.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy_if.h" "$(@D)/cuda/include/thrust/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/partition.inl" "$(@D)/cuda/include/thrust/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/find.inl" "$(@D)/cuda/include/thrust/detail/find.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/forceinline.h" "$(@D)/cuda/include/thrust/detail/config/forceinline.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/debug.h" "$(@D)/cuda/include/thrust/detail/config/debug.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/config.h" "$(@D)/cuda/include/thrust/detail/config/config.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/host_device.h" "$(@D)/cuda/include/thrust/detail/config/host_device.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/host_system.h" "$(@D)/cuda/include/thrust/detail/config/host_system.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/compiler.h" "$(@D)/cuda/include/thrust/detail/config/compiler.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/device_system.h" "$(@D)/cuda/include/thrust/detail/config/device_system.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/compiler_fence.h" "$(@D)/cuda/include/thrust/detail/config/compiler_fence.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/exec_check_disable.h" "$(@D)/cuda/include/thrust/detail/config/exec_check_disable.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/simple_defines.h" "$(@D)/cuda/include/thrust/detail/config/simple_defines.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/global_workarounds.h" "$(@D)/cuda/include/thrust/detail/config/global_workarounds.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/replace.inl" "$(@D)/cuda/include/thrust/detail/replace.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_ptr.inl" "$(@D)/cuda/include/thrust/detail/device_ptr.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple.inl" "$(@D)/cuda/include/thrust/detail/tuple.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/host_vector.inl" "$(@D)/cuda/include/thrust/detail/host_vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/raw_pointer_cast.h" "$(@D)/cuda/include/thrust/detail/raw_pointer_cast.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/advance.inl" "$(@D)/cuda/include/thrust/detail/advance.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/contiguous_storage.h" "$(@D)/cuda/include/thrust/detail/contiguous_storage.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/raw_reference_cast.h" "$(@D)/cuda/include/thrust/detail/raw_reference_cast.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/contiguous_storage.inl" "$(@D)/cuda/include/thrust/detail/contiguous_storage.inl" && cp "/usr/local/cuda-8.0/include/thrust/reverse.h" "$(@D)/cuda/include/thrust/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/device_malloc_allocator.h" "$(@D)/cuda/include/thrust/device_malloc_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/scatter.h" "$(@D)/cuda/include/thrust/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/pair.h" "$(@D)/cuda/include/thrust/pair.h" && cp "/usr/local/cuda-8.0/include/thrust/advance.h" "$(@D)/cuda/include/thrust/advance.h" && cp "/usr/local/cuda-8.0/include/thrust/find.h" "$(@D)/cuda/include/thrust/find.h" && cp "/usr/local/cuda-8.0/include/thrust/device_ptr.h" "$(@D)/cuda/include/thrust/device_ptr.h" && cp "/usr/local/cuda-8.0/include/thrust/generate.h" "$(@D)/cuda/include/thrust/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/uninitialized_fill.h" "$(@D)/cuda/include/thrust/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/system_error.h" "$(@D)/cuda/include/thrust/system/system_error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/bad_alloc.h" "$(@D)/cuda/include/thrust/system/detail/bad_alloc.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/partition.h" "$(@D)/cuda/include/thrust/system/detail/adl/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/unique.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/adl/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/sequence.h" "$(@D)/cuda/include/thrust/system/detail/adl/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/merge.h" "$(@D)/cuda/include/thrust/system/detail/adl/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/gather.h" "$(@D)/cuda/include/thrust/system/detail/adl/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/sort.h" "$(@D)/cuda/include/thrust/system/detail/adl/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/adl/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reverse.h" "$(@D)/cuda/include/thrust/system/detail/adl/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scatter.h" "$(@D)/cuda/include/thrust/system/detail/adl/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/find.h" "$(@D)/cuda/include/thrust/system/detail/adl/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/generate.h" "$(@D)/cuda/include/thrust/system/detail/adl/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/remove.h" "$(@D)/cuda/include/thrust/system/detail/adl/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/adl/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/for_each.h" "$(@D)/cuda/include/thrust/system/detail/adl/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/equal.h" "$(@D)/cuda/include/thrust/system/detail/adl/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/adl/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/adl/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/adl/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/adl/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/extrema.h" "$(@D)/cuda/include/thrust/system/detail/adl/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/count.h" "$(@D)/cuda/include/thrust/system/detail/adl/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/replace.h" "$(@D)/cuda/include/thrust/system/detail/adl/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/get_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/adl/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/logical.h" "$(@D)/cuda/include/thrust/system/detail/adl/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/adl/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/adl/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/errno.h" "$(@D)/cuda/include/thrust/system/detail/errno.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_category.inl" "$(@D)/cuda/include/thrust/system/detail/error_category.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/partition.h" "$(@D)/cuda/include/thrust/system/detail/sequential/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/unique.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/execution_policy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/sequential/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sequence.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/merge.h" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/gather.h" "$(@D)/cuda/include/thrust/system/detail/sequential/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy_backward.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_backward.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/sequential/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reverse.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scatter.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/find.h" "$(@D)/cuda/include/thrust/system/detail/sequential/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/merge.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/generate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/general_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/general_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/insertion_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/insertion_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/remove.h" "$(@D)/cuda/include/thrust/system/detail/sequential/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/for_each.h" "$(@D)/cuda/include/thrust/system/detail/sequential/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/equal.h" "$(@D)/cuda/include/thrust/system/detail/sequential/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/sequential/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/sequential/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/sequential/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/sequential/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/extrema.h" "$(@D)/cuda/include/thrust/system/detail/sequential/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/count.h" "$(@D)/cuda/include/thrust/system/detail/sequential/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/trivial_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/replace.h" "$(@D)/cuda/include/thrust/system/detail/sequential/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/get_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/sequential/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/logical.h" "$(@D)/cuda/include/thrust/system/detail/sequential/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/sequential/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/sequential/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_condition.inl" "$(@D)/cuda/include/thrust/system/detail/error_condition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/internal/decompose.h" "$(@D)/cuda/include/thrust/system/detail/internal/decompose.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_code.inl" "$(@D)/cuda/include/thrust/system/detail/error_code.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/memory.inl" "$(@D)/cuda/include/thrust/system/detail/generic/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/inner_product.inl" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/select_system.h" "$(@D)/cuda/include/thrust/system/detail/generic/select_system.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sequence.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sort.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/equal.inl" "$(@D)/cuda/include/thrust/system/detail/generic/equal.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/partition.h" "$(@D)/cuda/include/thrust/system/detail/generic/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tag.h" "$(@D)/cuda/include/thrust/system/detail/generic/tag.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sequence.h" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/type_traits.h" "$(@D)/cuda/include/thrust/system/detail/generic/type_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/merge.h" "$(@D)/cuda/include/thrust/system/detail/generic/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reverse.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tabulate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scatter.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/set_operations.inl" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy_if.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/gather.h" "$(@D)/cuda/include/thrust/system/detail/generic/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sort.h" "$(@D)/cuda/include/thrust/system/detail/generic/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/distance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/distance.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reverse.h" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/temporary_buffer.inl" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scatter.h" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/generate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/generate.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/remove.inl" "$(@D)/cuda/include/thrust/system/detail/generic/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/advance.h" "$(@D)/cuda/include/thrust/system/detail/generic/advance.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/find.h" "$(@D)/cuda/include/thrust/system/detail/generic/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/merge.inl" "$(@D)/cuda/include/thrust/system/detail/generic/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scalar/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scalar/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/extrema.inl" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/generate.h" "$(@D)/cuda/include/thrust/system/detail/generic/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/count.inl" "$(@D)/cuda/include/thrust/system/detail/generic/count.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/remove.h" "$(@D)/cuda/include/thrust/system/detail/generic/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/for_each.h" "$(@D)/cuda/include/thrust/system/detail/generic/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/distance.h" "$(@D)/cuda/include/thrust/system/detail/generic/distance.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/swap_ranges.inl" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/equal.h" "$(@D)/cuda/include/thrust/system/detail/generic/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/mismatch.inl" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/gather.inl" "$(@D)/cuda/include/thrust/system/detail/generic/gather.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/extrema.h" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/count.h" "$(@D)/cuda/include/thrust/system/detail/generic/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/replace.h" "$(@D)/cuda/include/thrust/system/detail/generic/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/logical.h" "$(@D)/cuda/include/thrust/system/detail/generic/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/partition.inl" "$(@D)/cuda/include/thrust/system/detail/generic/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/memory.h" "$(@D)/cuda/include/thrust/system/detail/generic/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/find.inl" "$(@D)/cuda/include/thrust/system/detail/generic/find.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/replace.inl" "$(@D)/cuda/include/thrust/system/detail/generic/replace.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/advance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/advance.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/system_error.inl" "$(@D)/cuda/include/thrust/system/detail/system_error.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/vector.h" "$(@D)/cuda/include/thrust/system/omp/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/omp/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sort.inl" "$(@D)/cuda/include/thrust/system/omp/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/partition.h" "$(@D)/cuda/include/thrust/system/omp/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/omp/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/omp/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/merge.h" "$(@D)/cuda/include/thrust/system/omp/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/gather.h" "$(@D)/cuda/include/thrust/system/omp/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sort.h" "$(@D)/cuda/include/thrust/system/omp/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/omp/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/omp/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/omp/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/remove.inl" "$(@D)/cuda/include/thrust/system/omp/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/omp/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/find.h" "$(@D)/cuda/include/thrust/system/omp/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/generate.h" "$(@D)/cuda/include/thrust/system/omp/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/remove.h" "$(@D)/cuda/include/thrust/system/omp/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/omp/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/equal.h" "$(@D)/cuda/include/thrust/system/omp/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/omp/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/omp/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/omp/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/omp/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/omp/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/count.h" "$(@D)/cuda/include/thrust/system/omp/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/replace.h" "$(@D)/cuda/include/thrust/system/omp/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/omp/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/logical.h" "$(@D)/cuda/include/thrust/system/omp/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/partition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/omp/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/par.h" "$(@D)/cuda/include/thrust/system/omp/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/omp/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/memory.h" "$(@D)/cuda/include/thrust/system/omp/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/vector.h" "$(@D)/cuda/include/thrust/system/tbb/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/memory.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sort.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/partition.h" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/tbb/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sequence.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/merge.h" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/gather.h" "$(@D)/cuda/include/thrust/system/tbb/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sort.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/tbb/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reverse.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scatter.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/remove.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/vector.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/find.h" "$(@D)/cuda/include/thrust/system/tbb/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/merge.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/generate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/remove.h" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/for_each.h" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/equal.h" "$(@D)/cuda/include/thrust/system/tbb/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/tbb/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/tbb/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/tbb/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/tbb/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/extrema.h" "$(@D)/cuda/include/thrust/system/tbb/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/count.h" "$(@D)/cuda/include/thrust/system/tbb/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/replace.h" "$(@D)/cuda/include/thrust/system/tbb/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/get_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/tbb/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/logical.h" "$(@D)/cuda/include/thrust/system/tbb/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/partition.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/tbb/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/par.h" "$(@D)/cuda/include/thrust/system/tbb/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/tbb/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/memory.h" "$(@D)/cuda/include/thrust/system/tbb/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/error_code.h" "$(@D)/cuda/include/thrust/system/error_code.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/vector.h" "$(@D)/cuda/include/thrust/system/cpp/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/partition.h" "$(@D)/cuda/include/thrust/system/cpp/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/unique.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cpp/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/merge.h" "$(@D)/cuda/include/thrust/system/cpp/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/gather.h" "$(@D)/cuda/include/thrust/system/cpp/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/sort.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cpp/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/find.h" "$(@D)/cuda/include/thrust/system/cpp/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/generate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/remove.h" "$(@D)/cuda/include/thrust/system/cpp/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cpp/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/equal.h" "$(@D)/cuda/include/thrust/system/cpp/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cpp/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cpp/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cpp/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cpp/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cpp/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/count.h" "$(@D)/cuda/include/thrust/system/cpp/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/replace.h" "$(@D)/cuda/include/thrust/system/cpp/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cpp/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/logical.h" "$(@D)/cuda/include/thrust/system/cpp/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cpp/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/par.h" "$(@D)/cuda/include/thrust/system/cpp/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cpp/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/memory.h" "$(@D)/cuda/include/thrust/system/cpp/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/vector.h" "$(@D)/cuda/include/thrust/system/cuda/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/error.h" "$(@D)/cuda/include/thrust/system/cuda/error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_device_to_device.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_device_to_device.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_allocator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_device.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_device.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_macro.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_namespace.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_type.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_type.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/host/spinlock.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/host/spinlock.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_ptx.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_debug.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/cub.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/cub.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_shift.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_shift.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_arch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_cross_system.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_cross_system.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk.h" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/partition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/unique.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cuda_launch_config.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cuda_launch_config.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cub.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_symmetric_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_symmetric_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/error.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/error.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/gather.h" "$(@D)/cuda/include/thrust/system/cuda/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/synchronize.h" "$(@D)/cuda/include/thrust/system/cuda/detail/synchronize.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/temporary_indirect_permutation.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_indirect_permutation.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/extern_shared_ptr.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extern_shared_ptr.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/set_operation.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/set_operation.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/balanced_path.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/balanced_path.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/set_operation.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/set_operation.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_closure.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_closure.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/alignment.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/alignment.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_sort_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_calculator.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_calculator.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_closure.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_closure.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/uninitialized.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/uninitialized.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_calculator.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_calculator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_sort_each.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/default_decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/default_decomposition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/throw_on_error.h" "$(@D)/cuda/include/thrust/system/cuda/detail/throw_on_error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/find.h" "$(@D)/cuda/include/thrust/system/cuda/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/terminate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/terminate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/merge.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/trivial_copy.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/trivial_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/generate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/execute_on_stream.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execute_on_stream.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/remove.h" "$(@D)/cuda/include/thrust/system/cuda/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/decomposition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/equal.h" "$(@D)/cuda/include/thrust/system/cuda/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/runtime_introspection.h" "$(@D)/cuda/include/thrust/system/cuda/detail/runtime_introspection.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cuda/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cuda/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/runtime_introspection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/runtime_introspection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cuda/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/synchronize.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/synchronize.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_union.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_union.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_intersection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_intersection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/count.h" "$(@D)/cuda/include/thrust/system/cuda/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/trivial_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_device_to_device.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_device_to_device.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/replace.h" "$(@D)/cuda/include/thrust/system/cuda/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/malloc.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/malloc.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/config.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/config.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/closure.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/closure.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/async.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/async.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/bulk.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/bulk.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/execution_policy.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/execution_policy.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/uninitialized.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/uninitialized.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/async.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/async.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/future.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/future.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/guarded_driver_types.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_driver_types.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cuda/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/logical.h" "$(@D)/cuda/include/thrust/system/cuda/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cuda/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/inclusive_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/inclusive_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merge.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merging_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merging_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/exclusive_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/exclusive_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/odd_even_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/odd_even_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/par.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_cross_system.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cuda/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/experimental/pinned_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/experimental/pinned_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/memory.h" "$(@D)/cuda/include/thrust/system/cuda/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/remove.h" "$(@D)/cuda/include/thrust/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/tabulate.h" "$(@D)/cuda/include/thrust/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/for_each.h" "$(@D)/cuda/include/thrust/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/distance.h" "$(@D)/cuda/include/thrust/distance.h" && cp "/usr/local/cuda-8.0/include/thrust/reduce.h" "$(@D)/cuda/include/thrust/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/equal.h" "$(@D)/cuda/include/thrust/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/complex.h" "$(@D)/cuda/include/thrust/complex.h" && cp "/usr/local/cuda-8.0/include/thrust/device_allocator.h" "$(@D)/cuda/include/thrust/device_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/copy.h" "$(@D)/cuda/include/thrust/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/uninitialized_copy.h" "$(@D)/cuda/include/thrust/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/device_reference.h" "$(@D)/cuda/include/thrust/device_reference.h" && cp "/usr/local/cuda-8.0/include/thrust/binary_search.h" "$(@D)/cuda/include/thrust/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/set_operations.h" "$(@D)/cuda/include/thrust/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/swap.h" "$(@D)/cuda/include/thrust/swap.h" && cp "/usr/local/cuda-8.0/include/thrust/mismatch.h" "$(@D)/cuda/include/thrust/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/extrema.h" "$(@D)/cuda/include/thrust/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/count.h" "$(@D)/cuda/include/thrust/count.h" && cp "/usr/local/cuda-8.0/include/thrust/device_free.h" "$(@D)/cuda/include/thrust/device_free.h" && cp "/usr/local/cuda-8.0/include/thrust/random/discard_block_engine.h" "$(@D)/cuda/include/thrust/random/discard_block_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/normal_distribution.h" "$(@D)/cuda/include/thrust/random/normal_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/subtract_with_carry_engine.inl" "$(@D)/cuda/include/thrust/random/detail/subtract_with_carry_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/xor_combine_engine_max.h" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine_max.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_congruential_engine_discard.h" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine_discard.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/uniform_int_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_int_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/discard_block_engine.inl" "$(@D)/cuda/include/thrust/random/detail/discard_block_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/uniform_real_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_real_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/random_core_access.h" "$(@D)/cuda/include/thrust/random/detail/random_core_access.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/mod.h" "$(@D)/cuda/include/thrust/random/detail/mod.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_feedback_shift_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_congruential_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/xor_combine_engine.inl" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/normal_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/normal_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/normal_distribution_base.h" "$(@D)/cuda/include/thrust/random/detail/normal_distribution_base.h" && cp "/usr/local/cuda-8.0/include/thrust/random/uniform_int_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_int_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/random/linear_feedback_shift_engine.h" "$(@D)/cuda/include/thrust/random/linear_feedback_shift_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/xor_combine_engine.h" "$(@D)/cuda/include/thrust/random/xor_combine_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/subtract_with_carry_engine.h" "$(@D)/cuda/include/thrust/random/subtract_with_carry_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/linear_congruential_engine.h" "$(@D)/cuda/include/thrust/random/linear_congruential_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/uniform_real_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_real_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/functional.h" "$(@D)/cuda/include/thrust/functional.h" && cp "/usr/local/cuda-8.0/include/thrust/replace.h" "$(@D)/cuda/include/thrust/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/device_new_allocator.h" "$(@D)/cuda/include/thrust/device_new_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/host_vector.h" "$(@D)/cuda/include/thrust/host_vector.h" && cp "/usr/local/cuda-8.0/include/thrust/version.h" "$(@D)/cuda/include/thrust/version.h" && cp "/usr/local/cuda-8.0/include/thrust/inner_product.h" "$(@D)/cuda/include/thrust/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_traits.h" "$(@D)/cuda/include/thrust/iterator/iterator_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/discard_iterator.h" "$(@D)/cuda/include/thrust/iterator/discard_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/retag.h" "$(@D)/cuda/include/thrust/iterator/retag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/permutation_iterator.h" "$(@D)/cuda/include/thrust/iterator/permutation_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/transform_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/reverse_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/zip_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/counting_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/counting_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/distance_from_result.h" "$(@D)/cuda/include/thrust/iterator/detail/distance_from_result.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/host_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/host_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_traversal_tags.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/retag.h" "$(@D)/cuda/include/thrust/iterator/detail/retag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/tagged_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/tagged_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_traits.inl" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traits.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/minimum_category.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/discard_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/discard_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_to_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/zip_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/normal_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/normal_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/join_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/join_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/device_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/device_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/universal_categories.h" "$(@D)/cuda/include/thrust/iterator/detail/universal_categories.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/reverse_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/minimum_system.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_system.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/tuple_of_iterator_references.h" "$(@D)/cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/is_iterator_category.h" "$(@D)/cuda/include/thrust/iterator/detail/is_iterator_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/permutation_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/permutation_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/any_assign.h" "$(@D)/cuda/include/thrust/iterator/detail/any_assign.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/any_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/any_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/is_trivial_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/is_trivial_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_to_system.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_system.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_adaptor_base.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_adaptor_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/constant_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/constant_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/transform_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_facade_category.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_facade_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/constant_iterator.h" "$(@D)/cuda/include/thrust/iterator/constant_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/counting_iterator.h" "$(@D)/cuda/include/thrust/iterator/counting_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_adaptor.h" "$(@D)/cuda/include/thrust/iterator/iterator_adaptor.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_facade.h" "$(@D)/cuda/include/thrust/iterator/iterator_facade.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_categories.h" "$(@D)/cuda/include/thrust/iterator/iterator_categories.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/reverse_iterator.h" "$(@D)/cuda/include/thrust/iterator/reverse_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/zip_iterator.h" "$(@D)/cuda/include/thrust/iterator/zip_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/logical.h" "$(@D)/cuda/include/thrust/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/tuple.h" "$(@D)/cuda/include/thrust/tuple.h" && cp "/usr/local/cuda-8.0/include/thrust/memory.h" "$(@D)/cuda/include/thrust/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/random.h" "$(@D)/cuda/include/thrust/random.h" && cp "/usr/local/cuda-8.0/include/thrust/fill.h" "$(@D)/cuda/include/thrust/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/transform.h" "$(@D)/cuda/include/thrust/transform.h" && cp "/usr/local/cuda-8.0/include/texture_types.h" "$(@D)/cuda/include/texture_types.h" && cp "/usr/local/cuda-8.0/include/nppversion.h" "$(@D)/cuda/include/nppversion.h" && cp "/usr/local/cuda-8.0/include/cuda_texture_types.h" "$(@D)/cuda/include/cuda_texture_types.h" && cp "/usr/local/cuda-8.0/include/fatbinary.h" "$(@D)/cuda/include/fatbinary.h" && cp "/usr/local/cuda-8.0/include/cublasXt.h" "$(@D)/cuda/include/cublasXt.h" && cp "/usr/local/cuda-8.0/include/cuda_fp16.h" "$(@D)/cuda/include/cuda_fp16.h" && cp "/usr/local/cuda-8.0/include/vector_functions.h" "$(@D)/cuda/include/vector_functions.h" && cp "/usr/local/cuda-8.0/include/cusparse.h" "$(@D)/cuda/include/cusparse.h" && cp "/usr/local/cuda-8.0/include/nppi_filtering_functions.h" "$(@D)/cuda/include/nppi_filtering_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_morphological_operations.h" "$(@D)/cuda/include/nppi_morphological_operations.h" && cp "/usr/local/cuda-8.0/include/sobol_direction_vectors.h" "$(@D)/cuda/include/sobol_direction_vectors.h" && cp "/usr/local/cuda-8.0/include/nvblas.h" "$(@D)/cuda/include/nvblas.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32dc_p_11213.h" "$(@D)/cuda/include/curand_mtgp32dc_p_11213.h" && cp "/usr/local/cuda-8.0/include/nvcuvid.h" "$(@D)/cuda/include/nvcuvid.h" && cp "/usr/local/cuda-8.0/include/cuda_runtime_api.h" "$(@D)/cuda/include/cuda_runtime_api.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32_kernel.h" "$(@D)/cuda/include/curand_mtgp32_kernel.h" && cp "/usr/local/cuda-8.0/include/cublas_v2.h" "$(@D)/cuda/include/cublas_v2.h" && cp "/usr/local/cuda-8.0/include/builtin_types.h" "$(@D)/cuda/include/builtin_types.h" && cp "/usr/local/cuda-8.0/include/nppi_geometry_transforms.h" "$(@D)/cuda/include/nppi_geometry_transforms.h" && cp "/usr/local/cuda-8.0/include/npps_support_functions.h" "$(@D)/cuda/include/npps_support_functions.h" && cp "/usr/local/cuda-8.0/include/cufftw.h" "$(@D)/cuda/include/cufftw.h" && cp "/usr/local/cuda-8.0/include/cuda_device_runtime_api.h" "$(@D)/cuda/include/cuda_device_runtime_api.h" && cp "/usr/local/cuda-8.0/include/sm_30_intrinsics.hpp" "$(@D)/cuda/include/sm_30_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/vector_types.h" "$(@D)/cuda/include/vector_types.h" && cp "/usr/local/cuda-8.0/include/sm_35_atomic_functions.h" "$(@D)/cuda/include/sm_35_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/sm_20_intrinsics.h" "$(@D)/cuda/include/sm_20_intrinsics.h" && cp "/usr/local/cuda-8.0/include/driver_types.h" "$(@D)/cuda/include/driver_types.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtCudaRt.h" "$(@D)/cuda/include/nvToolsExtCudaRt.h" && cp "/usr/local/cuda-8.0/include/curand_globals.h" "$(@D)/cuda/include/curand_globals.h" && cp "/usr/local/cuda-8.0/include/device_atomic_functions.h" "$(@D)/cuda/include/device_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/surface_types.h" "$(@D)/cuda/include/surface_types.h" && cp "/usr/local/cuda-8.0/include/nvrtc.h" "$(@D)/cuda/include/nvrtc.h" && cp "/usr/local/cuda-8.0/include/nppdefs.h" "$(@D)/cuda/include/nppdefs.h" && cp "/usr/local/cuda-8.0/include/sm_60_atomic_functions.h" "$(@D)/cuda/include/sm_60_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/driver_functions.h" "$(@D)/cuda/include/driver_functions.h" && cp "/usr/local/cuda-8.0/include/cusolver_common.h" "$(@D)/cuda/include/cusolver_common.h" && cp "/usr/local/cuda-8.0/include/cublas.h" "$(@D)/cuda/include/cublas.h" && cp "/usr/local/cuda-8.0/include/curand_lognormal.h" "$(@D)/cuda/include/curand_lognormal.h" && cp "/usr/local/cuda-8.0/include/device_atomic_functions.hpp" "$(@D)/cuda/include/device_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/crt/device_runtime.h" "$(@D)/cuda/include/crt/device_runtime.h" && cp "/usr/local/cuda-8.0/include/crt/storage_class.h" "$(@D)/cuda/include/crt/storage_class.h" && cp "/usr/local/cuda-8.0/include/crt/func_macro.h" "$(@D)/cuda/include/crt/func_macro.h" && cp "/usr/local/cuda-8.0/include/crt/host_runtime.h" "$(@D)/cuda/include/crt/host_runtime.h" && cp "/usr/local/cuda-8.0/include/nppi_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/nppi_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-8.0/include/npps_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/npps_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-8.0/include/nppi_computer_vision.h" "$(@D)/cuda/include/nppi_computer_vision.h" && cp "/usr/local/cuda-8.0/include/surface_functions.hpp" "$(@D)/cuda/include/surface_functions.hpp" && cp "/usr/local/cuda-8.0/include/surface_functions.h" "$(@D)/cuda/include/surface_functions.h" && cp "/usr/local/cuda-8.0/include/curand_normal_static.h" "$(@D)/cuda/include/curand_normal_static.h" && cp "/usr/local/cuda-8.0/include/curand.h" "$(@D)/cuda/include/curand.h" && cp "/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.h" "$(@D)/cuda/include/math_functions_dbl_ptx3.h" && cp "/usr/local/cuda-8.0/include/curand_philox4x32_x.h" "$(@D)/cuda/include/curand_philox4x32_x.h" && cp "/usr/local/cuda-8.0/include/nppi_threshold_and_compare_operations.h" "$(@D)/cuda/include/nppi_threshold_and_compare_operations.h" && cp "/usr/local/cuda-8.0/include/nvml.h" "$(@D)/cuda/include/nvml.h" && cp "/usr/local/cuda-8.0/include/npps.h" "$(@D)/cuda/include/npps.h" && cp "/usr/local/cuda-8.0/include/cuda_vdpau_interop.h" "$(@D)/cuda/include/cuda_vdpau_interop.h" && cp "/usr/local/cuda-8.0/include/sm_61_intrinsics.hpp" "$(@D)/cuda/include/sm_61_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/cublas_api.h" "$(@D)/cuda/include/cublas_api.h" && cp "/usr/local/cuda-8.0/include/nppi_color_conversion.h" "$(@D)/cuda/include/nppi_color_conversion.h" && cp "/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.hpp" "$(@D)/cuda/include/math_functions_dbl_ptx3.hpp" && cp "/usr/local/cuda-8.0/include/nppcore.h" "$(@D)/cuda/include/nppcore.h" && cp "/usr/local/cuda-8.0/include/cudaGL.h" "$(@D)/cuda/include/cudaGL.h" && cp "/usr/local/cuda-8.0/include/fatBinaryCtl.h" "$(@D)/cuda/include/fatBinaryCtl.h" && cp "/usr/local/cuda-8.0/include/npps_statistics_functions.h" "$(@D)/cuda/include/npps_statistics_functions.h" && cp "/usr/local/cuda-8.0/include/cudaVDPAU.h" "$(@D)/cuda/include/cudaVDPAU.h" && cp "/usr/local/cuda-8.0/include/curand_poisson.h" "$(@D)/cuda/include/curand_poisson.h" && cp "/usr/local/cuda-8.0/include/cusolverDn.h" "$(@D)/cuda/include/cusolverDn.h" && cp "/usr/local/cuda-8.0/include/cuda_profiler_api.h" "$(@D)/cuda/include/cuda_profiler_api.h" && cp "/usr/local/cuda-8.0/include/sm_20_atomic_functions.h" "$(@D)/cuda/include/sm_20_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/nvfunctional" "$(@D)/cuda/include/nvfunctional"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/include/CL/cl.h" "$(@D)/cuda/include/CL/cl.h" && cp "/usr/local/cuda-9.0/include/CL/cl.hpp" "$(@D)/cuda/include/CL/cl.hpp" && cp "/usr/local/cuda-9.0/include/CL/cl_egl.h" "$(@D)/cuda/include/CL/cl_egl.h" && cp "/usr/local/cuda-9.0/include/CL/cl_ext.h" "$(@D)/cuda/include/CL/cl_ext.h" && cp "/usr/local/cuda-9.0/include/CL/cl_gl.h" "$(@D)/cuda/include/CL/cl_gl.h" && cp "/usr/local/cuda-9.0/include/CL/cl_gl_ext.h" "$(@D)/cuda/include/CL/cl_gl_ext.h" && cp "/usr/local/cuda-9.0/include/CL/cl_platform.h" "$(@D)/cuda/include/CL/cl_platform.h" && cp "/usr/local/cuda-9.0/include/CL/opencl.h" "$(@D)/cuda/include/CL/opencl.h" && cp "/usr/local/cuda-9.0/include/builtin_types.h" "$(@D)/cuda/include/builtin_types.h" && cp "/usr/local/cuda-9.0/include/channel_descriptor.h" "$(@D)/cuda/include/channel_descriptor.h" && cp "/usr/local/cuda-9.0/include/common_functions.h" "$(@D)/cuda/include/common_functions.h" && cp "/usr/local/cuda-9.0/include/cooperative_groups.h" "$(@D)/cuda/include/cooperative_groups.h" && cp "/usr/local/cuda-9.0/include/cooperative_groups_helpers.h" "$(@D)/cuda/include/cooperative_groups_helpers.h" && cp "/usr/local/cuda-9.0/include/crt/common_functions.h" "$(@D)/cuda/include/crt/common_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_double_functions.h" "$(@D)/cuda/include/crt/device_double_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_double_functions.hpp" "$(@D)/cuda/include/crt/device_double_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/device_functions.h" "$(@D)/cuda/include/crt/device_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_functions.hpp" "$(@D)/cuda/include/crt/device_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/func_macro.h" "$(@D)/cuda/include/crt/func_macro.h" && cp "/usr/local/cuda-9.0/include/crt/host_config.h" "$(@D)/cuda/include/crt/host_config.h" && cp "/usr/local/cuda-9.0/include/crt/host_defines.h" "$(@D)/cuda/include/crt/host_defines.h" && cp "/usr/local/cuda-9.0/include/crt/host_runtime.h" "$(@D)/cuda/include/crt/host_runtime.h" && cp "/usr/local/cuda-9.0/include/crt/math_functions.h" "$(@D)/cuda/include/crt/math_functions.h" && cp "/usr/local/cuda-9.0/include/crt/math_functions.hpp" "$(@D)/cuda/include/crt/math_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/mma.h" "$(@D)/cuda/include/crt/mma.h" && cp "/usr/local/cuda-9.0/include/crt/mma.hpp" "$(@D)/cuda/include/crt/mma.hpp" && cp "/usr/local/cuda-9.0/include/crt/nvfunctional" "$(@D)/cuda/include/crt/nvfunctional" && cp "/usr/local/cuda-9.0/include/crt/sm_70_rt.h" "$(@D)/cuda/include/crt/sm_70_rt.h" && cp "/usr/local/cuda-9.0/include/crt/sm_70_rt.hpp" "$(@D)/cuda/include/crt/sm_70_rt.hpp" && cp "/usr/local/cuda-9.0/include/crt/storage_class.h" "$(@D)/cuda/include/crt/storage_class.h" && cp "/usr/local/cuda-9.0/include/cuComplex.h" "$(@D)/cuda/include/cuComplex.h" && cp "/usr/local/cuda-9.0/include/cublas.h" "$(@D)/cuda/include/cublas.h" && cp "/usr/local/cuda-9.0/include/cublasXt.h" "$(@D)/cuda/include/cublasXt.h" && cp "/usr/local/cuda-9.0/include/cublas_api.h" "$(@D)/cuda/include/cublas_api.h" && cp "/usr/local/cuda-9.0/include/cublas_v2.h" "$(@D)/cuda/include/cublas_v2.h" && cp "/usr/local/cuda-9.0/include/cuda.h" "$(@D)/cuda/include/cuda.h" && cp "/usr/local/cuda-9.0/include/cudaEGL.h" "$(@D)/cuda/include/cudaEGL.h" && cp "/usr/local/cuda-9.0/include/cudaGL.h" "$(@D)/cuda/include/cudaGL.h" && cp "/usr/local/cuda-9.0/include/cudaProfiler.h" "$(@D)/cuda/include/cudaProfiler.h" && cp "/usr/local/cuda-9.0/include/cudaVDPAU.h" "$(@D)/cuda/include/cudaVDPAU.h" && cp "/usr/local/cuda-9.0/include/cuda_device_runtime_api.h" "$(@D)/cuda/include/cuda_device_runtime_api.h" && cp "/usr/local/cuda-9.0/include/cuda_fp16.h" "$(@D)/cuda/include/cuda_fp16.h" && cp "/usr/local/cuda-9.0/include/cuda_fp16.hpp" "$(@D)/cuda/include/cuda_fp16.hpp" && cp "/usr/local/cuda-9.0/include/cuda_gl_interop.h" "$(@D)/cuda/include/cuda_gl_interop.h" && cp "/usr/local/cuda-9.0/include/cuda_occupancy.h" "$(@D)/cuda/include/cuda_occupancy.h" && cp "/usr/local/cuda-9.0/include/cuda_profiler_api.h" "$(@D)/cuda/include/cuda_profiler_api.h" && cp "/usr/local/cuda-9.0/include/cuda_runtime.h" "$(@D)/cuda/include/cuda_runtime.h" && cp "/usr/local/cuda-9.0/include/cuda_runtime_api.h" "$(@D)/cuda/include/cuda_runtime_api.h" && cp "/usr/local/cuda-9.0/include/cuda_surface_types.h" "$(@D)/cuda/include/cuda_surface_types.h" && cp "/usr/local/cuda-9.0/include/cuda_texture_types.h" "$(@D)/cuda/include/cuda_texture_types.h" && cp "/usr/local/cuda-9.0/include/cuda_vdpau_interop.h" "$(@D)/cuda/include/cuda_vdpau_interop.h" && cp "/usr/local/cuda-9.0/include/cudalibxt.h" "$(@D)/cuda/include/cudalibxt.h" && cp "/usr/local/cuda-9.0/include/cudnn.h" "$(@D)/cuda/include/cudnn.h" && cp "/usr/local/cuda-9.0/include/cufft.h" "$(@D)/cuda/include/cufft.h" && cp "/usr/local/cuda-9.0/include/cufftXt.h" "$(@D)/cuda/include/cufftXt.h" && cp "/usr/local/cuda-9.0/include/cufftw.h" "$(@D)/cuda/include/cufftw.h" && cp "/usr/local/cuda-9.0/include/curand.h" "$(@D)/cuda/include/curand.h" && cp "/usr/local/cuda-9.0/include/curand_discrete.h" "$(@D)/cuda/include/curand_discrete.h" && cp "/usr/local/cuda-9.0/include/curand_discrete2.h" "$(@D)/cuda/include/curand_discrete2.h" && cp "/usr/local/cuda-9.0/include/curand_globals.h" "$(@D)/cuda/include/curand_globals.h" && cp "/usr/local/cuda-9.0/include/curand_kernel.h" "$(@D)/cuda/include/curand_kernel.h" && cp "/usr/local/cuda-9.0/include/curand_lognormal.h" "$(@D)/cuda/include/curand_lognormal.h" && cp "/usr/local/cuda-9.0/include/curand_mrg32k3a.h" "$(@D)/cuda/include/curand_mrg32k3a.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32.h" "$(@D)/cuda/include/curand_mtgp32.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32_host.h" "$(@D)/cuda/include/curand_mtgp32_host.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32_kernel.h" "$(@D)/cuda/include/curand_mtgp32_kernel.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32dc_p_11213.h" "$(@D)/cuda/include/curand_mtgp32dc_p_11213.h" && cp "/usr/local/cuda-9.0/include/curand_normal.h" "$(@D)/cuda/include/curand_normal.h" && cp "/usr/local/cuda-9.0/include/curand_normal_static.h" "$(@D)/cuda/include/curand_normal_static.h" && cp "/usr/local/cuda-9.0/include/curand_philox4x32_x.h" "$(@D)/cuda/include/curand_philox4x32_x.h" && cp "/usr/local/cuda-9.0/include/curand_poisson.h" "$(@D)/cuda/include/curand_poisson.h" && cp "/usr/local/cuda-9.0/include/curand_precalc.h" "$(@D)/cuda/include/curand_precalc.h" && cp "/usr/local/cuda-9.0/include/curand_uniform.h" "$(@D)/cuda/include/curand_uniform.h" && cp "/usr/local/cuda-9.0/include/cusolverDn.h" "$(@D)/cuda/include/cusolverDn.h" && cp "/usr/local/cuda-9.0/include/cusolverRf.h" "$(@D)/cuda/include/cusolverRf.h" && cp "/usr/local/cuda-9.0/include/cusolverSp.h" "$(@D)/cuda/include/cusolverSp.h" && cp "/usr/local/cuda-9.0/include/cusolverSp_LOWLEVEL_PREVIEW.h" "$(@D)/cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h" && cp "/usr/local/cuda-9.0/include/cusolver_common.h" "$(@D)/cuda/include/cusolver_common.h" && cp "/usr/local/cuda-9.0/include/cusparse.h" "$(@D)/cuda/include/cusparse.h" && cp "/usr/local/cuda-9.0/include/cusparse_v2.h" "$(@D)/cuda/include/cusparse_v2.h" && cp "/usr/local/cuda-9.0/include/device_atomic_functions.h" "$(@D)/cuda/include/device_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/device_atomic_functions.hpp" "$(@D)/cuda/include/device_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_double_functions.h" "$(@D)/cuda/include/device_double_functions.h" && cp "/usr/local/cuda-9.0/include/device_double_functions.hpp" "$(@D)/cuda/include/device_double_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_functions.h" "$(@D)/cuda/include/device_functions.h" && cp "/usr/local/cuda-9.0/include/device_functions.hpp" "$(@D)/cuda/include/device_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_functions_decls.h" "$(@D)/cuda/include/device_functions_decls.h" && cp "/usr/local/cuda-9.0/include/device_launch_parameters.h" "$(@D)/cuda/include/device_launch_parameters.h" && cp "/usr/local/cuda-9.0/include/device_types.h" "$(@D)/cuda/include/device_types.h" && cp "/usr/local/cuda-9.0/include/driver_functions.h" "$(@D)/cuda/include/driver_functions.h" && cp "/usr/local/cuda-9.0/include/driver_types.h" "$(@D)/cuda/include/driver_types.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuda.h" "$(@D)/cuda/include/dynlink_cuda.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuda_cuda.h" "$(@D)/cuda/include/dynlink_cuda_cuda.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuviddec.h" "$(@D)/cuda/include/dynlink_cuviddec.h" && cp "/usr/local/cuda-9.0/include/dynlink_nvcuvid.h" "$(@D)/cuda/include/dynlink_nvcuvid.h" && cp "/usr/local/cuda-9.0/include/fatBinaryCtl.h" "$(@D)/cuda/include/fatBinaryCtl.h" && cp "/usr/local/cuda-9.0/include/fatbinary.h" "$(@D)/cuda/include/fatbinary.h" && cp "/usr/local/cuda-9.0/include/host_config.h" "$(@D)/cuda/include/host_config.h" && cp "/usr/local/cuda-9.0/include/host_defines.h" "$(@D)/cuda/include/host_defines.h" && cp "/usr/local/cuda-9.0/include/library_types.h" "$(@D)/cuda/include/library_types.h" && cp "/usr/local/cuda-9.0/include/math_constants.h" "$(@D)/cuda/include/math_constants.h" && cp "/usr/local/cuda-9.0/include/math_functions.h" "$(@D)/cuda/include/math_functions.h" && cp "/usr/local/cuda-9.0/include/math_functions.hpp" "$(@D)/cuda/include/math_functions.hpp" && cp "/usr/local/cuda-9.0/include/math_functions_dbl_ptx3.h" "$(@D)/cuda/include/math_functions_dbl_ptx3.h" && cp "/usr/local/cuda-9.0/include/math_functions_dbl_ptx3.hpp" "$(@D)/cuda/include/math_functions_dbl_ptx3.hpp" && cp "/usr/local/cuda-9.0/include/mma.h" "$(@D)/cuda/include/mma.h" && cp "/usr/local/cuda-9.0/include/npp.h" "$(@D)/cuda/include/npp.h" && cp "/usr/local/cuda-9.0/include/nppcore.h" "$(@D)/cuda/include/nppcore.h" && cp "/usr/local/cuda-9.0/include/nppdefs.h" "$(@D)/cuda/include/nppdefs.h" && cp "/usr/local/cuda-9.0/include/nppi.h" "$(@D)/cuda/include/nppi.h" && cp "/usr/local/cuda-9.0/include/nppi_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/nppi_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-9.0/include/nppi_color_conversion.h" "$(@D)/cuda/include/nppi_color_conversion.h" && cp "/usr/local/cuda-9.0/include/nppi_compression_functions.h" "$(@D)/cuda/include/nppi_compression_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_computer_vision.h" "$(@D)/cuda/include/nppi_computer_vision.h" && cp "/usr/local/cuda-9.0/include/nppi_data_exchange_and_initialization.h" "$(@D)/cuda/include/nppi_data_exchange_and_initialization.h" && cp "/usr/local/cuda-9.0/include/nppi_filtering_functions.h" "$(@D)/cuda/include/nppi_filtering_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_geometry_transforms.h" "$(@D)/cuda/include/nppi_geometry_transforms.h" && cp "/usr/local/cuda-9.0/include/nppi_linear_transforms.h" "$(@D)/cuda/include/nppi_linear_transforms.h" && cp "/usr/local/cuda-9.0/include/nppi_morphological_operations.h" "$(@D)/cuda/include/nppi_morphological_operations.h" && cp "/usr/local/cuda-9.0/include/nppi_statistics_functions.h" "$(@D)/cuda/include/nppi_statistics_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_support_functions.h" "$(@D)/cuda/include/nppi_support_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_threshold_and_compare_operations.h" "$(@D)/cuda/include/nppi_threshold_and_compare_operations.h" && cp "/usr/local/cuda-9.0/include/npps.h" "$(@D)/cuda/include/npps.h" && cp "/usr/local/cuda-9.0/include/npps_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/npps_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-9.0/include/npps_conversion_functions.h" "$(@D)/cuda/include/npps_conversion_functions.h" && cp "/usr/local/cuda-9.0/include/npps_filtering_functions.h" "$(@D)/cuda/include/npps_filtering_functions.h" && cp "/usr/local/cuda-9.0/include/npps_initialization.h" "$(@D)/cuda/include/npps_initialization.h" && cp "/usr/local/cuda-9.0/include/npps_statistics_functions.h" "$(@D)/cuda/include/npps_statistics_functions.h" && cp "/usr/local/cuda-9.0/include/npps_support_functions.h" "$(@D)/cuda/include/npps_support_functions.h" && cp "/usr/local/cuda-9.0/include/nppversion.h" "$(@D)/cuda/include/nppversion.h" && cp "/usr/local/cuda-9.0/include/nvToolsExt.h" "$(@D)/cuda/include/nvToolsExt.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtCuda.h" "$(@D)/cuda/include/nvToolsExtCuda.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtCudaRt.h" "$(@D)/cuda/include/nvToolsExtCudaRt.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtMeta.h" "$(@D)/cuda/include/nvToolsExtMeta.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtSync.h" "$(@D)/cuda/include/nvToolsExtSync.h" && cp "/usr/local/cuda-9.0/include/nvblas.h" "$(@D)/cuda/include/nvblas.h" && cp "/usr/local/cuda-9.0/include/nvfunctional" "$(@D)/cuda/include/nvfunctional" && cp "/usr/local/cuda-9.0/include/nvgraph.h" "$(@D)/cuda/include/nvgraph.h" && cp "/usr/local/cuda-9.0/include/nvml.h" "$(@D)/cuda/include/nvml.h" && cp "/usr/local/cuda-9.0/include/nvrtc.h" "$(@D)/cuda/include/nvrtc.h" && cp "/usr/local/cuda-9.0/include/sm_20_atomic_functions.h" "$(@D)/cuda/include/sm_20_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_20_atomic_functions.hpp" "$(@D)/cuda/include/sm_20_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_20_intrinsics.h" "$(@D)/cuda/include/sm_20_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_20_intrinsics.hpp" "$(@D)/cuda/include/sm_20_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_30_intrinsics.h" "$(@D)/cuda/include/sm_30_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_30_intrinsics.hpp" "$(@D)/cuda/include/sm_30_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_32_atomic_functions.h" "$(@D)/cuda/include/sm_32_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_32_atomic_functions.hpp" "$(@D)/cuda/include/sm_32_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_32_intrinsics.h" "$(@D)/cuda/include/sm_32_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_32_intrinsics.hpp" "$(@D)/cuda/include/sm_32_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_35_atomic_functions.h" "$(@D)/cuda/include/sm_35_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_35_intrinsics.h" "$(@D)/cuda/include/sm_35_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_60_atomic_functions.h" "$(@D)/cuda/include/sm_60_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_60_atomic_functions.hpp" "$(@D)/cuda/include/sm_60_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_61_intrinsics.h" "$(@D)/cuda/include/sm_61_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_61_intrinsics.hpp" "$(@D)/cuda/include/sm_61_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sobol_direction_vectors.h" "$(@D)/cuda/include/sobol_direction_vectors.h" && cp "/usr/local/cuda-9.0/include/surface_functions.h" "$(@D)/cuda/include/surface_functions.h" && cp "/usr/local/cuda-9.0/include/surface_functions.hpp" "$(@D)/cuda/include/surface_functions.hpp" && cp "/usr/local/cuda-9.0/include/surface_indirect_functions.h" "$(@D)/cuda/include/surface_indirect_functions.h" && cp "/usr/local/cuda-9.0/include/surface_indirect_functions.hpp" "$(@D)/cuda/include/surface_indirect_functions.hpp" && cp "/usr/local/cuda-9.0/include/surface_types.h" "$(@D)/cuda/include/surface_types.h" && cp "/usr/local/cuda-9.0/include/texture_fetch_functions.h" "$(@D)/cuda/include/texture_fetch_functions.h" && cp "/usr/local/cuda-9.0/include/texture_fetch_functions.hpp" "$(@D)/cuda/include/texture_fetch_functions.hpp" && cp "/usr/local/cuda-9.0/include/texture_indirect_functions.h" "$(@D)/cuda/include/texture_indirect_functions.h" && cp "/usr/local/cuda-9.0/include/texture_indirect_functions.hpp" "$(@D)/cuda/include/texture_indirect_functions.hpp" && cp "/usr/local/cuda-9.0/include/texture_types.h" "$(@D)/cuda/include/texture_types.h" && cp "/usr/local/cuda-9.0/include/thrust/adjacent_difference.h" "$(@D)/cuda/include/thrust/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/advance.h" "$(@D)/cuda/include/thrust/advance.h" && cp "/usr/local/cuda-9.0/include/thrust/binary_search.h" "$(@D)/cuda/include/thrust/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/complex.h" "$(@D)/cuda/include/thrust/complex.h" && cp "/usr/local/cuda-9.0/include/thrust/copy.h" "$(@D)/cuda/include/thrust/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/count.h" "$(@D)/cuda/include/thrust/count.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/detail/adjacent_difference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/advance.inl" "$(@D)/cuda/include/thrust/detail/advance.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/allocator_traits.h" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/allocator_traits.inl" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/copy_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/copy_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/default_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/default_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/destroy_range.h" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/destroy_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/fill_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/fill_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/malloc_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/malloc_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/no_throw_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/no_throw_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/tagged_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/tagged_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/temporary_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/temporary_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/binary_search.inl" "$(@D)/cuda/include/thrust/detail/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/arithmetic.h" "$(@D)/cuda/include/thrust/detail/complex/arithmetic.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/c99math.h" "$(@D)/cuda/include/thrust/detail/complex/c99math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/catrig.h" "$(@D)/cuda/include/thrust/detail/complex/catrig.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/catrigf.h" "$(@D)/cuda/include/thrust/detail/complex/catrigf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ccosh.h" "$(@D)/cuda/include/thrust/detail/complex/ccosh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ccoshf.h" "$(@D)/cuda/include/thrust/detail/complex/ccoshf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cexp.h" "$(@D)/cuda/include/thrust/detail/complex/cexp.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cexpf.h" "$(@D)/cuda/include/thrust/detail/complex/cexpf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/clog.h" "$(@D)/cuda/include/thrust/detail/complex/clog.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/clogf.h" "$(@D)/cuda/include/thrust/detail/complex/clogf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/complex.inl" "$(@D)/cuda/include/thrust/detail/complex/complex.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cpow.h" "$(@D)/cuda/include/thrust/detail/complex/cpow.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cpowf.h" "$(@D)/cuda/include/thrust/detail/complex/cpowf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cproj.h" "$(@D)/cuda/include/thrust/detail/complex/cproj.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csinh.h" "$(@D)/cuda/include/thrust/detail/complex/csinh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csinhf.h" "$(@D)/cuda/include/thrust/detail/complex/csinhf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csqrt.h" "$(@D)/cuda/include/thrust/detail/complex/csqrt.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csqrtf.h" "$(@D)/cuda/include/thrust/detail/complex/csqrtf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ctanh.h" "$(@D)/cuda/include/thrust/detail/complex/ctanh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ctanhf.h" "$(@D)/cuda/include/thrust/detail/complex/ctanhf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/math_private.h" "$(@D)/cuda/include/thrust/detail/complex/math_private.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/stream.h" "$(@D)/cuda/include/thrust/detail/complex/stream.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config.h" "$(@D)/cuda/include/thrust/detail/config.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/compiler.h" "$(@D)/cuda/include/thrust/detail/config/compiler.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/compiler_fence.h" "$(@D)/cuda/include/thrust/detail/config/compiler_fence.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/config.h" "$(@D)/cuda/include/thrust/detail/config/config.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/debug.h" "$(@D)/cuda/include/thrust/detail/config/debug.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/device_system.h" "$(@D)/cuda/include/thrust/detail/config/device_system.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/exec_check_disable.h" "$(@D)/cuda/include/thrust/detail/config/exec_check_disable.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/forceinline.h" "$(@D)/cuda/include/thrust/detail/config/forceinline.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/global_workarounds.h" "$(@D)/cuda/include/thrust/detail/config/global_workarounds.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/host_device.h" "$(@D)/cuda/include/thrust/detail/config/host_device.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/host_system.h" "$(@D)/cuda/include/thrust/detail/config/host_system.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/simple_defines.h" "$(@D)/cuda/include/thrust/detail/config/simple_defines.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/contiguous_storage.h" "$(@D)/cuda/include/thrust/detail/contiguous_storage.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/contiguous_storage.inl" "$(@D)/cuda/include/thrust/detail/contiguous_storage.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy.h" "$(@D)/cuda/include/thrust/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy.inl" "$(@D)/cuda/include/thrust/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy_if.h" "$(@D)/cuda/include/thrust/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy_if.inl" "$(@D)/cuda/include/thrust/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/count.inl" "$(@D)/cuda/include/thrust/detail/count.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/cstdint.h" "$(@D)/cuda/include/thrust/detail/cstdint.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_delete.inl" "$(@D)/cuda/include/thrust/detail/device_delete.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_free.inl" "$(@D)/cuda/include/thrust/detail/device_free.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_malloc.inl" "$(@D)/cuda/include/thrust/detail/device_malloc.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_new.inl" "$(@D)/cuda/include/thrust/detail/device_new.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_ptr.inl" "$(@D)/cuda/include/thrust/detail/device_ptr.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_reference.inl" "$(@D)/cuda/include/thrust/detail/device_reference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_vector.inl" "$(@D)/cuda/include/thrust/detail/device_vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/dispatch/is_trivial_copy.h" "$(@D)/cuda/include/thrust/detail/dispatch/is_trivial_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/distance.inl" "$(@D)/cuda/include/thrust/detail/distance.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/equal.inl" "$(@D)/cuda/include/thrust/detail/equal.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/execute_with_allocator.h" "$(@D)/cuda/include/thrust/detail/execute_with_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/execution_policy.h" "$(@D)/cuda/include/thrust/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/extrema.inl" "$(@D)/cuda/include/thrust/detail/extrema.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/fill.inl" "$(@D)/cuda/include/thrust/detail/fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/find.inl" "$(@D)/cuda/include/thrust/detail/find.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/for_each.inl" "$(@D)/cuda/include/thrust/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/function.h" "$(@D)/cuda/include/thrust/detail/function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional.inl" "$(@D)/cuda/include/thrust/detail/functional.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/actor.h" "$(@D)/cuda/include/thrust/detail/functional/actor.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/actor.inl" "$(@D)/cuda/include/thrust/detail/functional/actor.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/argument.h" "$(@D)/cuda/include/thrust/detail/functional/argument.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/composite.h" "$(@D)/cuda/include/thrust/detail/functional/composite.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/arithmetic_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/arithmetic_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/assignment_operator.h" "$(@D)/cuda/include/thrust/detail/functional/operators/assignment_operator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/bitwise_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/bitwise_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/compound_assignment_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/logical_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/logical_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/operator_adaptors.h" "$(@D)/cuda/include/thrust/detail/functional/operators/operator_adaptors.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/relational_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/relational_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/placeholder.h" "$(@D)/cuda/include/thrust/detail/functional/placeholder.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/value.h" "$(@D)/cuda/include/thrust/detail/functional/value.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/gather.inl" "$(@D)/cuda/include/thrust/detail/gather.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/generate.inl" "$(@D)/cuda/include/thrust/detail/generate.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/get_iterator_value.h" "$(@D)/cuda/include/thrust/detail/get_iterator_value.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/host_vector.inl" "$(@D)/cuda/include/thrust/detail/host_vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/inner_product.inl" "$(@D)/cuda/include/thrust/detail/inner_product.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/integer_math.h" "$(@D)/cuda/include/thrust/detail/integer_math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/integer_traits.h" "$(@D)/cuda/include/thrust/detail/integer_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/internal_functional.h" "$(@D)/cuda/include/thrust/detail/internal_functional.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/logical.inl" "$(@D)/cuda/include/thrust/detail/logical.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/merge.inl" "$(@D)/cuda/include/thrust/detail/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/minmax.h" "$(@D)/cuda/include/thrust/detail/minmax.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/mismatch.inl" "$(@D)/cuda/include/thrust/detail/mismatch.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/mpl/math.h" "$(@D)/cuda/include/thrust/detail/mpl/math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/numeric_traits.h" "$(@D)/cuda/include/thrust/detail/numeric_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/overlapped_copy.h" "$(@D)/cuda/include/thrust/detail/overlapped_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/pair.inl" "$(@D)/cuda/include/thrust/detail/pair.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/partition.inl" "$(@D)/cuda/include/thrust/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/pointer.h" "$(@D)/cuda/include/thrust/detail/pointer.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/pointer.inl" "$(@D)/cuda/include/thrust/detail/pointer.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/range/head_flags.h" "$(@D)/cuda/include/thrust/detail/range/head_flags.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/range/tail_flags.h" "$(@D)/cuda/include/thrust/detail/range/tail_flags.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/raw_pointer_cast.h" "$(@D)/cuda/include/thrust/detail/raw_pointer_cast.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/raw_reference_cast.h" "$(@D)/cuda/include/thrust/detail/raw_reference_cast.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/reduce.inl" "$(@D)/cuda/include/thrust/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference.h" "$(@D)/cuda/include/thrust/detail/reference.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference.inl" "$(@D)/cuda/include/thrust/detail/reference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference_forward_declaration.h" "$(@D)/cuda/include/thrust/detail/reference_forward_declaration.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/remove.inl" "$(@D)/cuda/include/thrust/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/replace.inl" "$(@D)/cuda/include/thrust/detail/replace.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reverse.inl" "$(@D)/cuda/include/thrust/detail/reverse.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/scan.inl" "$(@D)/cuda/include/thrust/detail/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/scatter.inl" "$(@D)/cuda/include/thrust/detail/scatter.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/seq.h" "$(@D)/cuda/include/thrust/detail/seq.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/sequence.inl" "$(@D)/cuda/include/thrust/detail/sequence.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/set_operations.inl" "$(@D)/cuda/include/thrust/detail/set_operations.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/sort.inl" "$(@D)/cuda/include/thrust/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/static_assert.h" "$(@D)/cuda/include/thrust/detail/static_assert.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/static_map.h" "$(@D)/cuda/include/thrust/detail/static_map.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap.h" "$(@D)/cuda/include/thrust/detail/swap.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap.inl" "$(@D)/cuda/include/thrust/detail/swap.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap_ranges.inl" "$(@D)/cuda/include/thrust/detail/swap_ranges.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/tabulate.inl" "$(@D)/cuda/include/thrust/detail/tabulate.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_array.h" "$(@D)/cuda/include/thrust/detail/temporary_array.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_array.inl" "$(@D)/cuda/include/thrust/detail/temporary_array.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform.inl" "$(@D)/cuda/include/thrust/detail/transform.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform_reduce.inl" "$(@D)/cuda/include/thrust/detail/transform_reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform_scan.inl" "$(@D)/cuda/include/thrust/detail/transform_scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/trivial_sequence.h" "$(@D)/cuda/include/thrust/detail/trivial_sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple.inl" "$(@D)/cuda/include/thrust/detail/tuple.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple_meta_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_meta_transform.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_transform.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" "$(@D)/cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/function_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/function_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_member_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_member_function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_nested_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_nested_type.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_trivial_assign.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_trivial_assign.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/is_call_possible.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_call_possible.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/is_metafunction_defined.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_metafunction_defined.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/iterator/is_output_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/minimum_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/minimum_type.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/pointer_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/pointer_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/result_of_adaptable_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/unique.inl" "$(@D)/cuda/include/thrust/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/use_default.h" "$(@D)/cuda/include/thrust/detail/use_default.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/util/align.h" "$(@D)/cuda/include/thrust/detail/util/align.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/util/blocking.h" "$(@D)/cuda/include/thrust/detail/util/blocking.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/vector_base.h" "$(@D)/cuda/include/thrust/detail/vector_base.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/vector_base.inl" "$(@D)/cuda/include/thrust/detail/vector_base.inl" && cp "/usr/local/cuda-9.0/include/thrust/device_allocator.h" "$(@D)/cuda/include/thrust/device_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_delete.h" "$(@D)/cuda/include/thrust/device_delete.h" && cp "/usr/local/cuda-9.0/include/thrust/device_free.h" "$(@D)/cuda/include/thrust/device_free.h" && cp "/usr/local/cuda-9.0/include/thrust/device_malloc.h" "$(@D)/cuda/include/thrust/device_malloc.h" && cp "/usr/local/cuda-9.0/include/thrust/device_malloc_allocator.h" "$(@D)/cuda/include/thrust/device_malloc_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_new.h" "$(@D)/cuda/include/thrust/device_new.h" && cp "/usr/local/cuda-9.0/include/thrust/device_new_allocator.h" "$(@D)/cuda/include/thrust/device_new_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_ptr.h" "$(@D)/cuda/include/thrust/device_ptr.h" && cp "/usr/local/cuda-9.0/include/thrust/device_reference.h" "$(@D)/cuda/include/thrust/device_reference.h" && cp "/usr/local/cuda-9.0/include/thrust/device_vector.h" "$(@D)/cuda/include/thrust/device_vector.h" && cp "/usr/local/cuda-9.0/include/thrust/distance.h" "$(@D)/cuda/include/thrust/distance.h" && cp "/usr/local/cuda-9.0/include/thrust/equal.h" "$(@D)/cuda/include/thrust/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/execution_policy.h" "$(@D)/cuda/include/thrust/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/extrema.h" "$(@D)/cuda/include/thrust/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/fill.h" "$(@D)/cuda/include/thrust/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/find.h" "$(@D)/cuda/include/thrust/find.h" && cp "/usr/local/cuda-9.0/include/thrust/for_each.h" "$(@D)/cuda/include/thrust/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/functional.h" "$(@D)/cuda/include/thrust/functional.h" && cp "/usr/local/cuda-9.0/include/thrust/gather.h" "$(@D)/cuda/include/thrust/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/generate.h" "$(@D)/cuda/include/thrust/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/host_vector.h" "$(@D)/cuda/include/thrust/host_vector.h" && cp "/usr/local/cuda-9.0/include/thrust/inner_product.h" "$(@D)/cuda/include/thrust/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/constant_iterator.h" "$(@D)/cuda/include/thrust/iterator/constant_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/counting_iterator.h" "$(@D)/cuda/include/thrust/iterator/counting_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/any_assign.h" "$(@D)/cuda/include/thrust/iterator/detail/any_assign.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/any_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/any_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/constant_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/constant_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/counting_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/counting_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/device_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/device_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/discard_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/discard_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/distance_from_result.h" "$(@D)/cuda/include/thrust/iterator/detail/distance_from_result.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/host_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/host_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/is_iterator_category.h" "$(@D)/cuda/include/thrust/iterator/detail/is_iterator_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/is_trivial_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/is_trivial_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_adaptor_base.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_adaptor_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_to_system.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_system.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_to_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_facade_category.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_facade_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_traits.inl" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traits.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_traversal_tags.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/join_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/join_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/minimum_category.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/minimum_system.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_system.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/normal_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/normal_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/permutation_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/permutation_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/retag.h" "$(@D)/cuda/include/thrust/iterator/detail/retag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/reverse_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/reverse_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/tagged_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/tagged_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/transform_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/transform_output_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_output_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/tuple_of_iterator_references.h" "$(@D)/cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/universal_categories.h" "$(@D)/cuda/include/thrust/iterator/detail/universal_categories.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/zip_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/zip_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/discard_iterator.h" "$(@D)/cuda/include/thrust/iterator/discard_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_adaptor.h" "$(@D)/cuda/include/thrust/iterator/iterator_adaptor.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_categories.h" "$(@D)/cuda/include/thrust/iterator/iterator_categories.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_facade.h" "$(@D)/cuda/include/thrust/iterator/iterator_facade.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_traits.h" "$(@D)/cuda/include/thrust/iterator/iterator_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/permutation_iterator.h" "$(@D)/cuda/include/thrust/iterator/permutation_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/retag.h" "$(@D)/cuda/include/thrust/iterator/retag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/reverse_iterator.h" "$(@D)/cuda/include/thrust/iterator/reverse_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/transform_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/transform_output_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_output_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/zip_iterator.h" "$(@D)/cuda/include/thrust/iterator/zip_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/logical.h" "$(@D)/cuda/include/thrust/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/memory.h" "$(@D)/cuda/include/thrust/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/merge.h" "$(@D)/cuda/include/thrust/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/mismatch.h" "$(@D)/cuda/include/thrust/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/pair.h" "$(@D)/cuda/include/thrust/pair.h" && cp "/usr/local/cuda-9.0/include/thrust/partition.h" "$(@D)/cuda/include/thrust/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/random.h" "$(@D)/cuda/include/thrust/random.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/discard_block_engine.inl" "$(@D)/cuda/include/thrust/random/detail/discard_block_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_congruential_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_congruential_engine_discard.h" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine_discard.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_feedback_shift_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/mod.h" "$(@D)/cuda/include/thrust/random/detail/mod.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/normal_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/normal_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/normal_distribution_base.h" "$(@D)/cuda/include/thrust/random/detail/normal_distribution_base.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/random_core_access.h" "$(@D)/cuda/include/thrust/random/detail/random_core_access.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/subtract_with_carry_engine.inl" "$(@D)/cuda/include/thrust/random/detail/subtract_with_carry_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/uniform_int_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_int_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/uniform_real_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_real_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/xor_combine_engine.inl" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/xor_combine_engine_max.h" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine_max.h" && cp "/usr/local/cuda-9.0/include/thrust/random/discard_block_engine.h" "$(@D)/cuda/include/thrust/random/discard_block_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/linear_congruential_engine.h" "$(@D)/cuda/include/thrust/random/linear_congruential_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/linear_feedback_shift_engine.h" "$(@D)/cuda/include/thrust/random/linear_feedback_shift_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/normal_distribution.h" "$(@D)/cuda/include/thrust/random/normal_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/subtract_with_carry_engine.h" "$(@D)/cuda/include/thrust/random/subtract_with_carry_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/uniform_int_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_int_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/uniform_real_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_real_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/xor_combine_engine.h" "$(@D)/cuda/include/thrust/random/xor_combine_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/reduce.h" "$(@D)/cuda/include/thrust/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/remove.h" "$(@D)/cuda/include/thrust/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/replace.h" "$(@D)/cuda/include/thrust/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/reverse.h" "$(@D)/cuda/include/thrust/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/scan.h" "$(@D)/cuda/include/thrust/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/scatter.h" "$(@D)/cuda/include/thrust/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/sequence.h" "$(@D)/cuda/include/thrust/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/set_operations.h" "$(@D)/cuda/include/thrust/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/sort.h" "$(@D)/cuda/include/thrust/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/swap.h" "$(@D)/cuda/include/thrust/swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cpp/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cpp/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/count.h" "$(@D)/cuda/include/thrust/system/cpp/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/equal.h" "$(@D)/cuda/include/thrust/system/cpp/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cpp/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/find.h" "$(@D)/cuda/include/thrust/system/cpp/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cpp/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/gather.h" "$(@D)/cuda/include/thrust/system/cpp/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/generate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cpp/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cpp/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/logical.h" "$(@D)/cuda/include/thrust/system/cpp/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cpp/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/merge.h" "$(@D)/cuda/include/thrust/system/cpp/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cpp/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/par.h" "$(@D)/cuda/include/thrust/system/cpp/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/partition.h" "$(@D)/cuda/include/thrust/system/cpp/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/remove.h" "$(@D)/cuda/include/thrust/system/cpp/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/replace.h" "$(@D)/cuda/include/thrust/system/cpp/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cpp/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/sort.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cpp/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cpp/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/unique.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/memory.h" "$(@D)/cuda/include/thrust/system/cpp/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/vector.h" "$(@D)/cuda/include/thrust/system/cpp/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/config.h" "$(@D)/cuda/include/thrust/system/cuda/config.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cuda/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/agent_launcher.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/agent_launcher.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/alignment.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/alignment.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/triple_chevron_launch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/util.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/util.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/count.h" "$(@D)/cuda/include/thrust/system/cuda/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cross_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/cub.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/cub.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/host/mutex.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/host/mutex.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_allocator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_arch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_debug.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_device.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_device.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_macro.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_namespace.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_ptx.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_type.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_type.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/equal.h" "$(@D)/cuda/include/thrust/system/cuda/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/error.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/error.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/find.h" "$(@D)/cuda/include/thrust/system/cuda/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/gather.h" "$(@D)/cuda/include/thrust/system/cuda/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/generate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/guarded_driver_types.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_driver_types.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cuda/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/internal/copy_cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/internal/copy_cross_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/internal/copy_device_to_device.h" "$(@D)/cuda/include/thrust/system/cuda/detail/internal/copy_device_to_device.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cuda/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/logical.h" "$(@D)/cuda/include/thrust/system/cuda/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cuda/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/memory_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/memory_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/par.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/par_to_seq.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par_to_seq.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/parallel_for.h" "$(@D)/cuda/include/thrust/system/cuda/detail/parallel_for.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/partition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/remove.h" "$(@D)/cuda/include/thrust/system/cuda/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/replace.h" "$(@D)/cuda/include/thrust/system/cuda/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cuda/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cuda/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/terminate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/terminate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/unique.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/util.h" "$(@D)/cuda/include/thrust/system/cuda/detail/util.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/error.h" "$(@D)/cuda/include/thrust/system/cuda/error.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/experimental/pinned_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/experimental/pinned_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/memory.h" "$(@D)/cuda/include/thrust/system/cuda/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/vector.h" "$(@D)/cuda/include/thrust/system/cuda/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/adl/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/adl/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/count.h" "$(@D)/cuda/include/thrust/system/detail/adl/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/equal.h" "$(@D)/cuda/include/thrust/system/detail/adl/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/extrema.h" "$(@D)/cuda/include/thrust/system/detail/adl/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/find.h" "$(@D)/cuda/include/thrust/system/detail/adl/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/for_each.h" "$(@D)/cuda/include/thrust/system/detail/adl/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/gather.h" "$(@D)/cuda/include/thrust/system/detail/adl/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/generate.h" "$(@D)/cuda/include/thrust/system/detail/adl/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/get_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/adl/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/adl/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/logical.h" "$(@D)/cuda/include/thrust/system/detail/adl/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/adl/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/merge.h" "$(@D)/cuda/include/thrust/system/detail/adl/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/adl/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/partition.h" "$(@D)/cuda/include/thrust/system/detail/adl/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/remove.h" "$(@D)/cuda/include/thrust/system/detail/adl/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/replace.h" "$(@D)/cuda/include/thrust/system/detail/adl/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reverse.h" "$(@D)/cuda/include/thrust/system/detail/adl/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scatter.h" "$(@D)/cuda/include/thrust/system/detail/adl/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/sequence.h" "$(@D)/cuda/include/thrust/system/detail/adl/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/adl/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/sort.h" "$(@D)/cuda/include/thrust/system/detail/adl/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/adl/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/adl/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/adl/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/unique.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/bad_alloc.h" "$(@D)/cuda/include/thrust/system/detail/bad_alloc.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/errno.h" "$(@D)/cuda/include/thrust/system/detail/errno.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_category.inl" "$(@D)/cuda/include/thrust/system/detail/error_category.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_code.inl" "$(@D)/cuda/include/thrust/system/detail/error_code.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_condition.inl" "$(@D)/cuda/include/thrust/system/detail/error_condition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/advance.h" "$(@D)/cuda/include/thrust/system/detail/generic/advance.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/advance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/advance.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy_if.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/count.h" "$(@D)/cuda/include/thrust/system/detail/generic/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/count.inl" "$(@D)/cuda/include/thrust/system/detail/generic/count.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/distance.h" "$(@D)/cuda/include/thrust/system/detail/generic/distance.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/distance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/distance.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/equal.h" "$(@D)/cuda/include/thrust/system/detail/generic/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/equal.inl" "$(@D)/cuda/include/thrust/system/detail/generic/equal.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/extrema.h" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/extrema.inl" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/find.h" "$(@D)/cuda/include/thrust/system/detail/generic/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/find.inl" "$(@D)/cuda/include/thrust/system/detail/generic/find.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/for_each.h" "$(@D)/cuda/include/thrust/system/detail/generic/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/gather.h" "$(@D)/cuda/include/thrust/system/detail/generic/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/gather.inl" "$(@D)/cuda/include/thrust/system/detail/generic/gather.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/generate.h" "$(@D)/cuda/include/thrust/system/detail/generic/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/generate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/generate.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/inner_product.inl" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/logical.h" "$(@D)/cuda/include/thrust/system/detail/generic/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/memory.h" "$(@D)/cuda/include/thrust/system/detail/generic/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/memory.inl" "$(@D)/cuda/include/thrust/system/detail/generic/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/merge.h" "$(@D)/cuda/include/thrust/system/detail/generic/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/merge.inl" "$(@D)/cuda/include/thrust/system/detail/generic/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/mismatch.inl" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/partition.h" "$(@D)/cuda/include/thrust/system/detail/generic/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/partition.inl" "$(@D)/cuda/include/thrust/system/detail/generic/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/remove.h" "$(@D)/cuda/include/thrust/system/detail/generic/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/remove.inl" "$(@D)/cuda/include/thrust/system/detail/generic/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/replace.h" "$(@D)/cuda/include/thrust/system/detail/generic/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/replace.inl" "$(@D)/cuda/include/thrust/system/detail/generic/replace.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reverse.h" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reverse.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scalar/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scalar/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scatter.h" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scatter.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/select_system.h" "$(@D)/cuda/include/thrust/system/detail/generic/select_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sequence.h" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sequence.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/set_operations.inl" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sort.h" "$(@D)/cuda/include/thrust/system/detail/generic/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sort.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/swap_ranges.inl" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tabulate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tag.h" "$(@D)/cuda/include/thrust/system/detail/generic/tag.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/temporary_buffer.inl" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/type_traits.h" "$(@D)/cuda/include/thrust/system/detail/generic/type_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/internal/decompose.h" "$(@D)/cuda/include/thrust/system/detail/internal/decompose.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/sequential/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/sequential/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy_backward.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_backward.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/count.h" "$(@D)/cuda/include/thrust/system/detail/sequential/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/equal.h" "$(@D)/cuda/include/thrust/system/detail/sequential/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/execution_policy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/extrema.h" "$(@D)/cuda/include/thrust/system/detail/sequential/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/find.h" "$(@D)/cuda/include/thrust/system/detail/sequential/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/for_each.h" "$(@D)/cuda/include/thrust/system/detail/sequential/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/gather.h" "$(@D)/cuda/include/thrust/system/detail/sequential/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/general_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/general_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/generate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/get_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/sequential/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/insertion_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/insertion_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/sequential/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/logical.h" "$(@D)/cuda/include/thrust/system/detail/sequential/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/sequential/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/merge.h" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/merge.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/sequential/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/partition.h" "$(@D)/cuda/include/thrust/system/detail/sequential/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/remove.h" "$(@D)/cuda/include/thrust/system/detail/sequential/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/replace.h" "$(@D)/cuda/include/thrust/system/detail/sequential/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reverse.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scatter.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sequence.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/sequential/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/sequential/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/sequential/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/trivial_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/trivial_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/unique.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/system_error.inl" "$(@D)/cuda/include/thrust/system/detail/system_error.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/error_code.h" "$(@D)/cuda/include/thrust/system/error_code.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/omp/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/omp/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/count.h" "$(@D)/cuda/include/thrust/system/omp/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/equal.h" "$(@D)/cuda/include/thrust/system/omp/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/omp/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/find.h" "$(@D)/cuda/include/thrust/system/omp/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/gather.h" "$(@D)/cuda/include/thrust/system/omp/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/generate.h" "$(@D)/cuda/include/thrust/system/omp/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/omp/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/omp/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/logical.h" "$(@D)/cuda/include/thrust/system/omp/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/omp/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/omp/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/merge.h" "$(@D)/cuda/include/thrust/system/omp/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/omp/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/par.h" "$(@D)/cuda/include/thrust/system/omp/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/partition.h" "$(@D)/cuda/include/thrust/system/omp/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/partition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/remove.h" "$(@D)/cuda/include/thrust/system/omp/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/remove.inl" "$(@D)/cuda/include/thrust/system/omp/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/replace.h" "$(@D)/cuda/include/thrust/system/omp/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/omp/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/omp/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/omp/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/omp/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sort.h" "$(@D)/cuda/include/thrust/system/omp/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sort.inl" "$(@D)/cuda/include/thrust/system/omp/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/omp/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/omp/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/omp/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/omp/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/memory.h" "$(@D)/cuda/include/thrust/system/omp/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/vector.h" "$(@D)/cuda/include/thrust/system/omp/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/system_error.h" "$(@D)/cuda/include/thrust/system/system_error.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/tbb/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/tbb/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/count.h" "$(@D)/cuda/include/thrust/system/tbb/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/equal.h" "$(@D)/cuda/include/thrust/system/tbb/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/extrema.h" "$(@D)/cuda/include/thrust/system/tbb/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/find.h" "$(@D)/cuda/include/thrust/system/tbb/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/for_each.h" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/gather.h" "$(@D)/cuda/include/thrust/system/tbb/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/generate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/get_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/tbb/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/tbb/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/logical.h" "$(@D)/cuda/include/thrust/system/tbb/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/tbb/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/memory.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/merge.h" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/merge.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/tbb/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/par.h" "$(@D)/cuda/include/thrust/system/tbb/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/partition.h" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/partition.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_intervals.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/remove.h" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/remove.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/replace.h" "$(@D)/cuda/include/thrust/system/tbb/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reverse.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scatter.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sequence.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/tbb/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sort.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sort.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/tbb/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/tbb/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/vector.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/memory.h" "$(@D)/cuda/include/thrust/system/tbb/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/vector.h" "$(@D)/cuda/include/thrust/system/tbb/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system_error.h" "$(@D)/cuda/include/thrust/system_error.h" && cp "/usr/local/cuda-9.0/include/thrust/tabulate.h" "$(@D)/cuda/include/thrust/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/transform.h" "$(@D)/cuda/include/thrust/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/transform_reduce.h" "$(@D)/cuda/include/thrust/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/transform_scan.h" "$(@D)/cuda/include/thrust/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/tuple.h" "$(@D)/cuda/include/thrust/tuple.h" && cp "/usr/local/cuda-9.0/include/thrust/uninitialized_copy.h" "$(@D)/cuda/include/thrust/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/uninitialized_fill.h" "$(@D)/cuda/include/thrust/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/unique.h" "$(@D)/cuda/include/thrust/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/version.h" "$(@D)/cuda/include/thrust/version.h" && cp "/usr/local/cuda-9.0/include/vector_functions.h" "$(@D)/cuda/include/vector_functions.h" && cp "/usr/local/cuda-9.0/include/vector_functions.hpp" "$(@D)/cuda/include/vector_functions.hpp" && cp "/usr/local/cuda-9.0/include/vector_types.h" "$(@D)/cuda/include/vector_types.h"
    """,
 )
 
@@ -1264,72 +1192,69 @@ genrule(
     name = "cuda-nvvm",
     outs = [
         "cuda/nvvm/bin/cicc",
-        "cuda/nvvm/libdevice/libdevice.compute_50.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_30.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_20.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_35.10.bc",
-        "cuda/nvvm/lib64/libnvvm.so.3",
-        "cuda/nvvm/lib64/libnvvm.so",
-        "cuda/nvvm/lib64/libnvvm.so.3.1.0",
         "cuda/nvvm/include/nvvm.h",
-        "cuda/nvvm/libnvvm-samples/ptxgen/README.txt",
-        "cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c",
-        "cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt",
+        "cuda/nvvm/lib64/libnvvm.so",
+        "cuda/nvvm/lib64/libnvvm.so.3",
+        "cuda/nvvm/lib64/libnvvm.so.3.2.0",
+        "cuda/nvvm/libdevice/libdevice.10.bc",
+        "cuda/nvvm/libnvvm-samples/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/README.txt",
         "cuda/nvvm/libnvvm-samples/build.bat",
-        "cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt",
-        "cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu",
+        "cuda/nvvm/libnvvm-samples/build.sh",
+        "cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h",
+        "cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h",
         "cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt",
         "cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp",
-        "cuda/nvvm/libnvvm-samples/README.txt",
-        "cuda/nvvm/libnvvm-samples/simple/simple.c",
-        "cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll",
+        "cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu",
+        "cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/ptxgen/README.txt",
+        "cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c",
+        "cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt",
         "cuda/nvvm/libnvvm-samples/simple/README.txt",
+        "cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll",
         "cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll",
-        "cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt",
-        "cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h",
-        "cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h",
-        "cuda/nvvm/libnvvm-samples/build.sh",
-        "cuda/nvvm/libnvvm-samples/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/simple/simple.c",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/nvvm/bin/cicc" "$(@D)/cuda/nvvm/bin/cicc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_50.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_50.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_30.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_30.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_20.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_20.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_35.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_35.10.bc" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so.3" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so" "$(@D)/cuda/nvvm/lib64/libnvvm.so" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so.3.1.0" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3.1.0" && cp "/usr/local/cuda-8.0/nvvm/include/nvvm.h" "$(@D)/cuda/nvvm/include/nvvm.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/ptxgen.c" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/build.bat" "$(@D)/cuda/nvvm/libnvvm-samples/build.bat" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple.c" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple.c" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple-gpu.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple-gpu64.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/common/include/DDSWriter.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/build.sh" "$(@D)/cuda/nvvm/libnvvm-samples/build.sh" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/CMakeLists.txt"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/nvvm/bin/cicc" "$(@D)/cuda/nvvm/bin/cicc" && cp "/usr/local/cuda-9.0/nvvm/include/nvvm.h" "$(@D)/cuda/nvvm/include/nvvm.h" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so" "$(@D)/cuda/nvvm/lib64/libnvvm.so" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so.3" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so.3.2.0" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3.2.0" && cp "/usr/local/cuda-9.0/nvvm/libdevice/libdevice.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.10.bc" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/build.bat" "$(@D)/cuda/nvvm/libnvvm-samples/build.bat" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/build.sh" "$(@D)/cuda/nvvm/libnvvm-samples/build.sh" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/common/include/DDSWriter.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/ptxgen.c" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple-gpu.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple-gpu64.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple.c" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple.c"
    """,
 )
 
 genrule(
     name = "cuda-extras",
     outs = [
-        "cuda/extras/CUPTI/include/cupti_result.h",
+        "cuda/extras/CUPTI/include/GL/gl.h",
+        "cuda/extras/CUPTI/include/GL/glew.h",
+        "cuda/extras/CUPTI/include/GL/glext.h",
+        "cuda/extras/CUPTI/include/GL/glu.h",
+        "cuda/extras/CUPTI/include/GL/glut.h",
+        "cuda/extras/CUPTI/include/GL/glx.h",
+        "cuda/extras/CUPTI/include/GL/glxext.h",
+        "cuda/extras/CUPTI/include/GL/wglew.h",
+        "cuda/extras/CUPTI/include/GL/wglext.h",
+        "cuda/extras/CUPTI/include/cuda_stdint.h",
+        "cuda/extras/CUPTI/include/cupti.h",
+        "cuda/extras/CUPTI/include/cupti_activity.h",
+        "cuda/extras/CUPTI/include/cupti_callbacks.h",
+        "cuda/extras/CUPTI/include/cupti_driver_cbid.h",
         "cuda/extras/CUPTI/include/cupti_events.h",
-        "cuda/extras/CUPTI/include/openacc/cupti_openacc.h",
+        "cuda/extras/CUPTI/include/cupti_metrics.h",
+        "cuda/extras/CUPTI/include/cupti_nvtx_cbid.h",
+        "cuda/extras/CUPTI/include/cupti_result.h",
+        "cuda/extras/CUPTI/include/cupti_runtime_cbid.h",
         "cuda/extras/CUPTI/include/cupti_version.h",
-        "cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h",
+        "cuda/extras/CUPTI/include/generated_cudaGL_meta.h",
         "cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h",
-        "cuda/extras/CUPTI/include/cupti_activity.h",
-        "cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h",
+        "cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h",
         "cuda/extras/CUPTI/include/generated_cuda_meta.h",
-        "cuda/extras/CUPTI/include/cupti_nvtx_cbid.h",
-        "cuda/extras/CUPTI/include/cuda_stdint.h",
-        "cuda/extras/CUPTI/include/generated_cudaGL_meta.h",
+        "cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h",
         "cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h",
-        "cuda/extras/CUPTI/include/cupti_metrics.h",
-        "cuda/extras/CUPTI/include/cupti_callbacks.h",
-        "cuda/extras/CUPTI/include/cupti_runtime_cbid.h",
-        "cuda/extras/CUPTI/include/cupti.h",
-        "cuda/extras/CUPTI/include/GL/glut.h",
-        "cuda/extras/CUPTI/include/GL/glu.h",
-        "cuda/extras/CUPTI/include/GL/glxext.h",
-        "cuda/extras/CUPTI/include/GL/wglext.h",
-        "cuda/extras/CUPTI/include/GL/glx.h",
-        "cuda/extras/CUPTI/include/GL/glext.h",
-        "cuda/extras/CUPTI/include/GL/wglew.h",
-        "cuda/extras/CUPTI/include/GL/gl.h",
-        "cuda/extras/CUPTI/include/GL/glew.h",
-        "cuda/extras/CUPTI/include/cupti_driver_cbid.h",
         "cuda/extras/CUPTI/include/generated_nvtx_meta.h",
+        "cuda/extras/CUPTI/include/openacc/cupti_openacc.h",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_result.h" "$(@D)/cuda/extras/CUPTI/include/cupti_result.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_events.h" "$(@D)/cuda/extras/CUPTI/include/cupti_events.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/openacc/cupti_openacc.h" "$(@D)/cuda/extras/CUPTI/include/openacc/cupti_openacc.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_version.h" "$(@D)/cuda/extras/CUPTI/include/cupti_version.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cudaVDPAU_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_activity.h" "$(@D)/cuda/extras/CUPTI/include/cupti_activity.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_nvtx_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_nvtx_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cuda_stdint.h" "$(@D)/cuda/extras/CUPTI/include/cuda_stdint.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cudaGL_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaGL_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_metrics.h" "$(@D)/cuda/extras/CUPTI/include/cupti_metrics.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_callbacks.h" "$(@D)/cuda/extras/CUPTI/include/cupti_callbacks.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_runtime_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_runtime_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti.h" "$(@D)/cuda/extras/CUPTI/include/cupti.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glut.h" "$(@D)/cuda/extras/CUPTI/include/GL/glut.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glu.h" "$(@D)/cuda/extras/CUPTI/include/GL/glu.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glxext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glxext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/wglext.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glx.h" "$(@D)/cuda/extras/CUPTI/include/GL/glx.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/wglew.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglew.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/gl.h" "$(@D)/cuda/extras/CUPTI/include/GL/gl.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glew.h" "$(@D)/cuda/extras/CUPTI/include/GL/glew.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_driver_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_driver_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_nvtx_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_nvtx_meta.h"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/gl.h" "$(@D)/cuda/extras/CUPTI/include/GL/gl.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glew.h" "$(@D)/cuda/extras/CUPTI/include/GL/glew.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glu.h" "$(@D)/cuda/extras/CUPTI/include/GL/glu.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glut.h" "$(@D)/cuda/extras/CUPTI/include/GL/glut.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glx.h" "$(@D)/cuda/extras/CUPTI/include/GL/glx.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glxext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glxext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/wglew.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglew.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/wglext.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cuda_stdint.h" "$(@D)/cuda/extras/CUPTI/include/cuda_stdint.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti.h" "$(@D)/cuda/extras/CUPTI/include/cupti.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_activity.h" "$(@D)/cuda/extras/CUPTI/include/cupti_activity.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_callbacks.h" "$(@D)/cuda/extras/CUPTI/include/cupti_callbacks.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_driver_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_driver_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_events.h" "$(@D)/cuda/extras/CUPTI/include/cupti_events.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_metrics.h" "$(@D)/cuda/extras/CUPTI/include/cupti_metrics.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_nvtx_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_nvtx_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_result.h" "$(@D)/cuda/extras/CUPTI/include/cupti_result.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_runtime_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_runtime_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_version.h" "$(@D)/cuda/extras/CUPTI/include/cupti_version.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cudaGL_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaGL_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cudaVDPAU_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_nvtx_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_nvtx_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/openacc/cupti_openacc.h" "$(@D)/cuda/extras/CUPTI/include/openacc/cupti_openacc.h"
    """,
 )
 
@@ -1337,26 +1262,21 @@ genrule(
     name = "cuda-lib",
     outs = [
         "cuda/lib/libcuda.so",
-        "cuda/lib/libcudart.so.8.0",
+        "cuda/lib/libcudart.so.9.0",
         "cuda/lib/libcudart_static.a",
-        "cuda/lib/libcublas.so.8.0",
-        "cuda/lib/libcusolver.so.8.0",
-        "cuda/lib/libcurand.so.8.0",
-        "cuda/lib/libcufft.so.8.0",
-        "cuda/lib/libcudnn.so.6",
-        "cuda/lib/libcupti.so.8.0",
+        "cuda/lib/libcublas.so.9.0",
+        "cuda/lib/libcusolver.so.9.0",
+        "cuda/lib/libcurand.so.9.0",
+        "cuda/lib/libcufft.so.9.0",
+        "cuda/lib/libcudnn.so.7",
+        "cuda/lib/libcupti.so.9.0",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/stubs/libcuda.so" "$(@D)/cuda/lib/libcuda.so" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61" "$(@D)/cuda/lib/libcudart.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a" "$(@D)/cuda/lib/libcudart_static.a" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.88" "$(@D)/cuda/lib/libcublas.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcusolver.so.8.0.61" "$(@D)/cuda/lib/libcusolver.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcurand.so.8.0.61" "$(@D)/cuda/lib/libcurand.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcufft.so.8.0.61" "$(@D)/cuda/lib/libcufft.so.8.0" && cp "/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21" "$(@D)/cuda/lib/libcudnn.so.6" && cp "/usr/local/cuda-8.0/extras/CUPTI/lib64/libcupti.so.8.0.61" "$(@D)/cuda/lib/libcupti.so.8.0"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcuda.so" "$(@D)/cuda/lib/libcuda.so" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0" "$(@D)/cuda/lib/libcudart.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart_static.a" "$(@D)/cuda/lib/libcudart_static.a" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0" "$(@D)/cuda/lib/libcublas.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcusolver.so.9.0" "$(@D)/cuda/lib/libcusolver.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0" "$(@D)/cuda/lib/libcurand.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcufft.so.9.0" "$(@D)/cuda/lib/libcufft.so.9.0" && cp "/usr/lib/x86_64-linux-gnu/libcudnn.so.7" "$(@D)/cuda/lib/libcudnn.so.7" && cp "/usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0" "$(@D)/cuda/lib/libcupti.so.9.0"
    """,
 )
 
-genrule(
+filegroup(
     name = "cudnn-include",
-    outs = [
-        "cuda/include/cudnn.h",
-    ],
-    cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/include/cudnn.h" "$(@D)/cudnn.h"
-   """,
+    srcs = [],
 )
diff --git a/third_party/toolchains/gpus/py/BUILD b/third_party/toolchains/gpus/py/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..2d5ace93ff5054927cda61b0302af4edd8fe56c1
--- /dev/null
+++ b/third_party/toolchains/gpus/py/BUILD
@@ -0,0 +1,171 @@
+# A build file to configure python remote repository used with Bazel remote
+# execution service
+# DO NOT EDIT: automatically generated BUILD file
+
+licenses(["restricted"])
+
+package(default_visibility = ["//visibility:public"])
+
+cc_library(
+    name = "python_headers",
+    hdrs = [":python_include"],
+    data = select({
+        ":windows": [":python_import_lib"],
+        "//conditions:default": [],
+    }),
+    includes = ["python_include"],
+    linkopts = select({
+        # TODO(pcloudy): Ideally, this should just go into deps after resolving
+        # https://github.com/bazelbuild/bazel/issues/3237,
+        ":windows": ["$(locations :python_import_lib)"],
+        "//conditions:default": [],
+    }),
+)
+
+cc_library(
+    name = "numpy_headers",
+    hdrs = [":numpy_include"],
+    includes = ["numpy_include"],
+)
+
+config_setting(
+    name = "windows",
+    values = {"cpu": "x64_windows"},
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "python_include",
+    outs = [
+        "python_include/Python-ast.h",
+        "python_include/Python.h",
+        "python_include/abstract.h",
+        "python_include/asdl.h",
+        "python_include/ast.h",
+        "python_include/bitset.h",
+        "python_include/boolobject.h",
+        "python_include/bufferobject.h",
+        "python_include/bytearrayobject.h",
+        "python_include/bytes_methods.h",
+        "python_include/bytesobject.h",
+        "python_include/cStringIO.h",
+        "python_include/cellobject.h",
+        "python_include/ceval.h",
+        "python_include/classobject.h",
+        "python_include/cobject.h",
+        "python_include/code.h",
+        "python_include/codecs.h",
+        "python_include/compile.h",
+        "python_include/complexobject.h",
+        "python_include/datetime.h",
+        "python_include/descrobject.h",
+        "python_include/dictobject.h",
+        "python_include/dtoa.h",
+        "python_include/enumobject.h",
+        "python_include/errcode.h",
+        "python_include/eval.h",
+        "python_include/fileobject.h",
+        "python_include/floatobject.h",
+        "python_include/frameobject.h",
+        "python_include/funcobject.h",
+        "python_include/genobject.h",
+        "python_include/graminit.h",
+        "python_include/grammar.h",
+        "python_include/import.h",
+        "python_include/intobject.h",
+        "python_include/intrcheck.h",
+        "python_include/iterobject.h",
+        "python_include/listobject.h",
+        "python_include/longintrepr.h",
+        "python_include/longobject.h",
+        "python_include/marshal.h",
+        "python_include/memoryobject.h",
+        "python_include/metagrammar.h",
+        "python_include/methodobject.h",
+        "python_include/modsupport.h",
+        "python_include/moduleobject.h",
+        "python_include/node.h",
+        "python_include/object.h",
+        "python_include/objimpl.h",
+        "python_include/opcode.h",
+        "python_include/osdefs.h",
+        "python_include/parsetok.h",
+        "python_include/patchlevel.h",
+        "python_include/pgen.h",
+        "python_include/pgenheaders.h",
+        "python_include/py_curses.h",
+        "python_include/pyarena.h",
+        "python_include/pycapsule.h",
+        "python_include/pyconfig.h",
+        "python_include/pyctype.h",
+        "python_include/pydebug.h",
+        "python_include/pyerrors.h",
+        "python_include/pyexpat.h",
+        "python_include/pyfpe.h",
+        "python_include/pygetopt.h",
+        "python_include/pymacconfig.h",
+        "python_include/pymactoolbox.h",
+        "python_include/pymath.h",
+        "python_include/pymem.h",
+        "python_include/pyport.h",
+        "python_include/pystate.h",
+        "python_include/pystrcmp.h",
+        "python_include/pystrtod.h",
+        "python_include/pythonrun.h",
+        "python_include/pythread.h",
+        "python_include/rangeobject.h",
+        "python_include/setobject.h",
+        "python_include/sliceobject.h",
+        "python_include/stringobject.h",
+        "python_include/structmember.h",
+        "python_include/structseq.h",
+        "python_include/symtable.h",
+        "python_include/sysmodule.h",
+        "python_include/timefuncs.h",
+        "python_include/token.h",
+        "python_include/traceback.h",
+        "python_include/tupleobject.h",
+        "python_include/ucnhash.h",
+        "python_include/unicodeobject.h",
+        "python_include/warnings.h",
+        "python_include/weakrefobject.h",
+    ],
+    cmd = """
+cp "/usr/include/python2.7/Python-ast.h" "$(@D)/python_include/Python-ast.h" && cp "/usr/include/python2.7/Python.h" "$(@D)/python_include/Python.h" && cp "/usr/include/python2.7/abstract.h" "$(@D)/python_include/abstract.h" && cp "/usr/include/python2.7/asdl.h" "$(@D)/python_include/asdl.h" && cp "/usr/include/python2.7/ast.h" "$(@D)/python_include/ast.h" && cp "/usr/include/python2.7/bitset.h" "$(@D)/python_include/bitset.h" && cp "/usr/include/python2.7/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/usr/include/python2.7/bufferobject.h" "$(@D)/python_include/bufferobject.h" && cp "/usr/include/python2.7/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/usr/include/python2.7/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/usr/include/python2.7/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/usr/include/python2.7/cStringIO.h" "$(@D)/python_include/cStringIO.h" && cp "/usr/include/python2.7/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/usr/include/python2.7/ceval.h" "$(@D)/python_include/ceval.h" && cp "/usr/include/python2.7/classobject.h" "$(@D)/python_include/classobject.h" && cp "/usr/include/python2.7/cobject.h" "$(@D)/python_include/cobject.h" && cp "/usr/include/python2.7/code.h" "$(@D)/python_include/code.h" && cp "/usr/include/python2.7/codecs.h" "$(@D)/python_include/codecs.h" && cp "/usr/include/python2.7/compile.h" "$(@D)/python_include/compile.h" && cp "/usr/include/python2.7/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/usr/include/python2.7/datetime.h" "$(@D)/python_include/datetime.h" && cp "/usr/include/python2.7/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/usr/include/python2.7/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/usr/include/python2.7/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/usr/include/python2.7/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/usr/include/python2.7/errcode.h" "$(@D)/python_include/errcode.h" && cp "/usr/include/python2.7/eval.h" "$(@D)/python_include/eval.h" && cp "/usr/include/python2.7/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/usr/include/python2.7/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/usr/include/python2.7/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/usr/include/python2.7/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/usr/include/python2.7/genobject.h" "$(@D)/python_include/genobject.h" && cp "/usr/include/python2.7/graminit.h" "$(@D)/python_include/graminit.h" && cp "/usr/include/python2.7/grammar.h" "$(@D)/python_include/grammar.h" && cp "/usr/include/python2.7/import.h" "$(@D)/python_include/import.h" && cp "/usr/include/python2.7/intobject.h" "$(@D)/python_include/intobject.h" && cp "/usr/include/python2.7/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/usr/include/python2.7/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/usr/include/python2.7/listobject.h" "$(@D)/python_include/listobject.h" && cp "/usr/include/python2.7/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/usr/include/python2.7/longobject.h" "$(@D)/python_include/longobject.h" && cp "/usr/include/python2.7/marshal.h" "$(@D)/python_include/marshal.h" && cp "/usr/include/python2.7/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/usr/include/python2.7/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/usr/include/python2.7/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/usr/include/python2.7/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/usr/include/python2.7/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/usr/include/python2.7/node.h" "$(@D)/python_include/node.h" && cp "/usr/include/python2.7/object.h" "$(@D)/python_include/object.h" && cp "/usr/include/python2.7/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/usr/include/python2.7/opcode.h" "$(@D)/python_include/opcode.h" && cp "/usr/include/python2.7/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/usr/include/python2.7/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/usr/include/python2.7/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/usr/include/python2.7/pgen.h" "$(@D)/python_include/pgen.h" && cp "/usr/include/python2.7/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/usr/include/python2.7/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/usr/include/python2.7/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/usr/include/python2.7/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/usr/include/python2.7/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/usr/include/python2.7/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/usr/include/python2.7/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/usr/include/python2.7/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/usr/include/python2.7/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/usr/include/python2.7/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/usr/include/python2.7/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/usr/include/python2.7/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/usr/include/python2.7/pymactoolbox.h" "$(@D)/python_include/pymactoolbox.h" && cp "/usr/include/python2.7/pymath.h" "$(@D)/python_include/pymath.h" && cp "/usr/include/python2.7/pymem.h" "$(@D)/python_include/pymem.h" && cp "/usr/include/python2.7/pyport.h" "$(@D)/python_include/pyport.h" && cp "/usr/include/python2.7/pystate.h" "$(@D)/python_include/pystate.h" && cp "/usr/include/python2.7/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/usr/include/python2.7/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/usr/include/python2.7/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/usr/include/python2.7/pythread.h" "$(@D)/python_include/pythread.h" && cp "/usr/include/python2.7/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/usr/include/python2.7/setobject.h" "$(@D)/python_include/setobject.h" && cp "/usr/include/python2.7/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/usr/include/python2.7/stringobject.h" "$(@D)/python_include/stringobject.h" && cp "/usr/include/python2.7/structmember.h" "$(@D)/python_include/structmember.h" && cp "/usr/include/python2.7/structseq.h" "$(@D)/python_include/structseq.h" && cp "/usr/include/python2.7/symtable.h" "$(@D)/python_include/symtable.h" && cp "/usr/include/python2.7/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/usr/include/python2.7/timefuncs.h" "$(@D)/python_include/timefuncs.h" && cp "/usr/include/python2.7/token.h" "$(@D)/python_include/token.h" && cp "/usr/include/python2.7/traceback.h" "$(@D)/python_include/traceback.h" && cp "/usr/include/python2.7/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/usr/include/python2.7/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/usr/include/python2.7/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/usr/include/python2.7/warnings.h" "$(@D)/python_include/warnings.h" && cp "/usr/include/python2.7/weakrefobject.h" "$(@D)/python_include/weakrefobject.h"
+   """,
+)
+
+genrule(
+    name = "numpy_include",
+    outs = [
+        "numpy_include/numpy/__multiarray_api.h",
+        "numpy_include/numpy/__ufunc_api.h",
+        "numpy_include/numpy/_neighborhood_iterator_imp.h",
+        "numpy_include/numpy/_numpyconfig.h",
+        "numpy_include/numpy/arrayobject.h",
+        "numpy_include/numpy/arrayscalars.h",
+        "numpy_include/numpy/halffloat.h",
+        "numpy_include/numpy/multiarray_api.txt",
+        "numpy_include/numpy/ndarrayobject.h",
+        "numpy_include/numpy/ndarraytypes.h",
+        "numpy_include/numpy/noprefix.h",
+        "numpy_include/numpy/npy_1_7_deprecated_api.h",
+        "numpy_include/numpy/npy_3kcompat.h",
+        "numpy_include/numpy/npy_common.h",
+        "numpy_include/numpy/npy_cpu.h",
+        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/npy_interrupt.h",
+        "numpy_include/numpy/npy_math.h",
+        "numpy_include/numpy/npy_no_deprecated_api.h",
+        "numpy_include/numpy/npy_os.h",
+        "numpy_include/numpy/numpyconfig.h",
+        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/oldnumeric.h",
+        "numpy_include/numpy/ufunc_api.txt",
+        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/utils.h",
+    ],
+    cmd = """
+cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h"
+   """,
+)
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 8b8c71756171387b7a4b834ea94015a00313492e..1c1e6afb65ab8da5b689d58ecaec6ac6c8a69bb8 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -27,11 +27,14 @@ build --define framework_shared_object=true
 build:mkl --define=using_mkl=true
 build:mkl -c opt
 
+build:download_clang --crosstool_top=@local_config_download_clang//:toolchain
+build:download_clang --define=using_clang=true
+
 build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
 build:cuda --define=using_cuda=true --define=using_cuda_nvcc=true
 
 build:cuda_clang --crosstool_top=@local_config_cuda//crosstool:toolchain
-build:cuda_clang --define=using_cuda=true --define=using_cuda_clang=true
+build:cuda_clang --define=using_cuda=true --define=using_cuda_clang=true --define=using_clang=true
 
 build:win-cuda --define=using_cuda=true --define=using_cuda_nvcc=true